From Probabilistic to Differentiable Programming: a Programming Language Approach to AI

In his popular-scientific book The Master Algorithm, Pedro Domingos identifies five main schools of thought in Machine Learning (ML, for short), which he calls tribes. Each tribe has its own definition of learning as well as a family of algorithms capturing the learning process. In this post I move from Domingos’ classification and look at ML (and its schools of thought) from a different perspective, which I think is becoming quite popular both in the ML and in the Programming Language Theory (PLT, for short) community: the programming language perspective.

Before entering the details, let me make a couple of clarifications about this post.

  1. First of all, this is a work in progress post. This means that I will publish partial versions of it (the reason for that is in point 2), with the goal of having a complete post in few weeks.  Any comment, suggestion, or correction is obviously very welcome.
  2. Second, I do not know much about ML. I am writing this post trying to systematise the things I am studying as well as to make connections between ML and PLT explicit. Some of these connections are well-known and I will try to give references for them. The others are based on personal thoughts and considerations. They could be trivial, meaningless, or well-known in some specific fields. In the latter case any pointer to references will be very much appreciated.

The Five Tribes of ML

Let us start by briefly recalling Domingos’ five tribes of ML. These are summarised in the following table (taken from Domingos’ slides).

TribeOriginsMaster Algorithm
SymbolistsLogic, PhilosophyInverse deduction
ConnectionistsNeuroscienceBack propagation
EvolutionariesEvolutionary biology Genetic programming
BayesiansStatisticsProbabilistic inference
AnalogisersPsychology Kernel machines

A nice explanation of this classification is given by Domingos himself in this video (and, of course, in his book). Intuitively, each school of thought proposes a notion of learning, and a concrete way to achieve learning according to that notion. For instance, bayesians identify learning with reducing uncertainty, and achieve such reduction through statistical/probabilistic inference. Connectionists identify learning with strengthening (and weakening) of synaptic connections between (artificial) neurons, and achieve such strengthening through back propagation. Concerning connectionists, the above table is a bit outdated, not taking into account advances in Deep Learning and the more recent trend of Differentiable Programming. The latter, together with probabilistic programming, is probably the best example to see advantages of a linguistic approach to machine learning. For this reason in this post I will focus on probabilistic and differentiable programming (which I will formally introduce later) only, leaving the analysis of other approaches to ML as future work.

Before looking at the programming language approach to ML—by the way, suggestions for a better name are appreciated—let me briefly recall some crucial points made in Colah’s blog post (an enlightening reading) about deep learning and functional programming. These observations are probably the best way to introduce some guiding intuitions behind the idea of a linguistic approach to machine learning, and later will constitute the starting point for introducing differentiable porgramming.

From Deep Learning to Functional Programming (and Back?)

Deep Learning (DL, for short) is a connectionist approach to machine learning that focuses on the use deep neural networks (although also other kinds of networks are used, and the distinction between deep and shallow neural network is not that clear). The basic component of a neural networks, either deep or shallow, is the artificial neuron. For us, the latter is just a function F_{\mathbf{w}}: \mathbb{R}^n \to \mathbb{R} of a special form (for simplicity, we restrict our attention to real numbers).

An n-ary neuron takes as input n real numbers, i.e. an element \mathbf{x} \in \mathbb{R}^n, and gives to each input x_i a weight w_i. That is, the neuron is parametric with respect to a weight vector \mathbf{w} \in \mathbb{R}^n (if you are familiar with this topic you may wonder why we are not considering the bias b of the neuron; for simplicity we encode it in \mathbf{w} e.g. by means of trivial extra input). The neuron then makes the linear combination of the input vector with the weight vector \mathbf{w}^\top \mathbf{x} = \sum_{i =1}^n w_i x_i, and passes the result to the activation function f (a better notation for the neuron would be F_{\mathbf{w},f}, so to emphasise its dependency both from weights and the activation function).

A neuron alone cannot do much. We use network layers to combine neurons together. A k-ary network layer is a finite vector of neurons \begin{bmatrix} F_{\mathbf{w}_1}^{1} \\ \vdots \\ F_{\mathbf{w}_k}^{k} \end{bmatrix}. For simplicity, we assume each neuron F_{\mathbf{w}_i}^{i} to receive as input the same vector \mathbf{x} \in \mathbb{R}^n, so that we can group the neurons in the layer together, obtaining a function F_{\mathbf{w}}: \mathbb{R}^n \to \mathbb{R}^k, defined by:

    \[F_{\mathbf{w}}(\mathbf{x}) = \begin{bmatrix} F_{\mathbf{w}_1}^{1}(\mathbf{x}) \\ \vdots \\ F_{\mathbf{w}_k}^{k}(\mathbf{x}) \end{bmatrix}.\]

Notice that F_{\mathbf{w}} is parametric in the weight vector \mathbf{w}= \begin{bmatrix} \mathbf{w}_1 \\ \vdots \\ \mathbf{w}_k \end{bmatrix}.
The basic idea behind deep neural network is that F_{\mathbf{w}}(\mathbf{x}) can be itself used as an input vector for an k-ary layer \begin{bmatrix} G_{\mathbf{v}_1}^{1} \\ \vdots \\ G_{\mathbf{v}_k}^{k} \end{bmatrix} of m-ary neurons. Again, this further layer can be formalised as a function G_{\mathbf{v}}: \mathbb{R}^k \to \mathbb{R}^m parametric in the weight vector \mathbf{v}. In particular, we have the composed function G_{\mathbf{v}} \circ F_{\mathbf{w}}: \mathbb{R}^n \to \mathbb{R}^m:

    \[(G_{\mathbf{v}} \circ F_{\mathbf{w}})(\mathbf{x}) = \begin{bmatrix} G_{\mathbf{v}_1}^{1}(F_{\mathbf{w}}(\mathbf{x})) \\ \vdots \\G_{\mathbf{v}_k}^{k}(F_{\mathbf{w}}(\mathbf{x})). \end{bmatrix}\]

The function G_{\mathbf{v}} \circ F_{\mathbf{w}} is parametric with respect to weight vector \mathbf{u} = \begin{bmatrix} \mathbf{w} \\ \mathbf{v} \end{bmatrix}, and, as before, (G_{\mathbf{v}} \circ F_{\mathbf{w}})(\mathbf{x}) can be passed as input to another k-layer. We can repeat this game as many time as we wish. A neural network is nothing but the function obtained by composing these layers (the last layer, called output layer, usually collects the results obtained and organises them in the desired output format).
In the end, a neural network is de facto a function F_{\mathbf{w}}: \mathbb{R}^n \to \mathbb{R}^m of the form:

    \[F_{\mathbf{w}_k}^k \circ \cdots \circ F_{\mathbf{w}_1}^1,\]

where

    \[\mathbb{R}^n = \xymatrix{ \mathbb{R}^{n_1} \ar[r]^{F_{\mathbf{w}_1}^1} & \mathbb{R}^{n_2} \ar[r]^{F_{\mathbf{w}_2}^2} & \cdots \ar[r]^{F_{\mathbf{w}_k}^k} & \mathbb{R}^{n_{k+1}} } = \mathbb{R}^m.\]

Summing up, neurons are functions of a specific form, and neural networks are themselves functions, the latter being essentially obtained by composing neurons. The structure of a neural network can be seen as a higher-order function: it takes neurons (as well as other neural networks) as input and composes them, passes them as input, duplicates them (in the ML community the technique of using the same neuron in different ‘places’ of a network is called weight tying). In the previously mentioned blog post, Colah shows that several neural networks patterns (such as recurrent and convolutional neural networks) can be seen as instances of well-known (functional) combinators, such as map, fold, unfold, and zip. This suggests some kind of connection between functional programming (FP, for short) and DL. We could think of DL as way of doing functional programming, the only difference with ordinary FP being the manipulation of neurons and layers (i.e. special kind of functions) rather than ‘plain’ functions.

The question we should ask is now obvious: can we rethink DL as very specific form of (functional) programming? The positive answer is given by Differentiable Programming, which, as we will see, shows that we can also forget about neurons and neural networks—in a slogan, it is enough to do functional programming with differentiable functions. I will write a lot about that, but first I want to spend some words on a methodological point. The above analysis is, in a way, completely independent from DL, neural networks etc. It is just a linguistic (actually a programming language) approach to DL. Instead of working with the objects one is interested in (neural networks in this case), a specific language is designed capable of expressing those objects and to manipulate them in such a way that some good properties are preserved (and the good properties we are interested in essentially amount to requiring functions to be learnable in a certain sense). This approach can be used to rethink also the other tribes of ML. Actually, for some them this has already been done, probabilistic programming being one of the best examples.

In the next paragraphs I will expand on this ‘programming language approach to ML’ on a very general level. I will try to trace an informal correspondence between learning and computability, arguing that as ordinary programming languages approach computability linguistically (designing languages describing all and only computable functions), programming languages for learning (in the spirit of probabilistic and differentiable programming) should also approach learning linguistically (thus allowing to express all and only learnable functions). Moreover, as we do not have a general, universal definition of notions like those of a program or of a computation—rather those definitions are given with respect to a paradigm (e.g. computation as evaluation, programs as functions)—the very same thing happens when studying learning. Each tribes as its own definition of learning model and learning engine, as each programming paradigm has its own definition of program and computation engine.

Again, the correspondence I will try to outline is not (and thus should not be taken as) formal. Rather, it is methodological. It helps in suggesting new (?) ways to approach problems, using tools from e.g. PLT. As an example, one of the main long-term goal of ML is the integration of learning techniques (e.g. symbolic and non-symbolic methods). Combining different approaches to ML is not only a hard problem, but it is also an unclearly stated problem. However, if we look at those approaches from a programming language perspective the problem can clearly be stated as the addition of specific features to a language, pretty much the same way we add imperative features to functional languages. Of course, that does not mean that we can easily do that, but at least that we formally know our goal. Once we approach schools of thought in ML linguistically, integrating them is morally close to design multiparadigms programming languages.

A Programming Language Perspective on Computation

After having read previous paragraph, a question naturally arises: how does PLT relate to ML? To answer this question, we look back at the very birth of computer science (I know, this sounds scary but trust me, it is not). Computer Science was born to answer the question of which functions (and thus problems) can be computed in a purely mechanical way, meaning that the value of the function on interest can be obtained by an effective procedure. More formally, we say that a function  f:\mathbb {N}^{k} \to \mathbb {N} is computable if and only if there exists a mechanical procedure computing f(\mathbf{x}), for any \mathbf{x} \in \mathbb{N}^k.

Of course, to make the above definition meaningful we have to say what a mechanical procedure is. Historically, such definition has been provided via the notion of Turing Machine (other definitions have been proposed, but none of them was, at first, universally accepted as the one of Turing Machine). We can then say that a function  f:\mathbb {N} ^{k}\to \mathbb {N} is computable if and only if there exists a Turing Machine M computing f.

What I would like to stress about this definition is that it is intrinsically mathematical. A (mathematical) universe of functions and numbers is assumed (the one given by Zermelo-Fraenkel set theory, for instance), and a region of this universe is identified as the one containing computable functions. Calling \mathcal{F} the universe of all functions (I do not want to be formal talking about sets, classes etc) and \mathcal{U} the universe of all computable functions, Turing proved that \mathcal{U} is strictly smaller than \mathcal{F}. Equivalently, there exist non-computable functions.

The situation is now as follows. We use the language of mathematics (set theory) to talk about functions. Among the functions expressible in such language (which are all functions, since the very definition of function is given inside the language of set theory) some are computable. Every time we talk about a function we should check whether it is computable. Every time we modify, compose, or let interact two or more computable functions we have to prove that the resulting function is still computable. The language of set theory is just to expressive to allow us to talk about entities in \mathcal{U} in a truly compositional way.

Here comes the ‘linguistic approach’. Mathematics can often be classified into synthetic and analytic. As a matter of fact, almost all of modern (mainstream) mathematics is analytic, since it is all done inside set theory. Every concept (e.g. topological spaces, geometrical figures, lines, numbers etc) is ultimately a set, and thus it is analysed as a set. A different perspective is somehow offered by some concepts and ideas in category theory (such as the notion of topos and the related field of synthetic differential geometry). Instead of finding the ‘ultimate components’ of which entities are made of, the existence of basic entities is postulated (e.g. the existence of points and lines) and a universe of objects is built from them according to given laws and constructions. In the end we do not know what those entities are (we could not state that a line is a set of points satisfying certain properties), but we know how they behave and what we can do with them. Most importantly, we are ensured that everything we do in our synthetic universe will remain in such universe.

Logic and Computer Science improve the synthetic approach by stressing its linguistic components. The focus is not directly on the synthetic universe we want to build, but on the language used to describe (and to talk about) objects in that universe. The goal is to obtain a language that fully captures the target universe: everything that can be expressed in the language corresponds to an object in the universe, and any object in the universe can be described linguistically. This is nothing but another way to phrase the usual soundness and completeness theorems we prove in logic and semantics.

Church’s \lambda-calculus can be seen as the linguistic counterpart of Turing’s analysis of computability. In fact, the \lambda-calculus provides a language that completely describes \mathcal{U}, the universe of all computable functions. Every phrase we can write in the \lambda-calculus (i.e. every \lambda-term) corresponds to a computable function, and every computable function can be written as a \lambda-term. Not only we have a synthetic universe of computable functions, but we have a complete, linguistic description of it. This is the essence of programming languages, the reduction of computation to a linguistic act.

This is a very nice philosophical—romantic if you wish—though historically inaccurate view on computability and PLT. But, again, how does that relate to ML?

A Programming Language Perspective on Learning

Shifting from computation to learning, the connection between PLT and ML can be roughly thought as a linguistic account of learning. As an ordinary programming language is a medium to write computable functions, a programming language for learning is a medium to write ‘learnable functions’. And as standard programming languages are based on different programming  paradigms (imperative, functional, logic etc) which ultimately give a notion of computation (computation as evaluation, state dynamics, or logical inference), also programming languages for learning are based on different learning paradigms which somehow encodes different points of view on learning (learning as logical induction, or uncertainty reduction). It should come with no surprise that those learning paradigms are the ones of the five tribes previously described.

Programming paradigm Program Computation model
FunctionalFunctionReduction/substitution
LogicFacts and inference rulesLogical inference
ImperativeSequence of statementsState transformation
Object-orientedObjectsMessage passing

The above table is surprisingly (?) similar to the one of the five tribes of ML. In fact, roughly speaking (we will refine this correspondence later) programming paradigms correspond to tribes, programs to (learning) models, and models of computation to master algorithms.

Let me clarify a bit the terminology (again, suggestions for improvements are very welcome), as at this point it is a bit confusing (and misleading). Different authors use different names for the same concepts, and this makes informal (i.e. non-symbolic) communication hard. Concerning PLT we follow [CTMCP]  and define a computation model as a formal system that defines how computations are done. Therefore, a programming paradigm defines a notion of program and a computational model, the former being the object to compute and the latter specifying how computation is performed (and thus ultimately what computation is). For instance, the functional paradigm defines programs to be (functional) expressions and uses reduction/substitution (e.g. \beta-reduction) as computation model (this model is called the substitution model of computation in e.g. [SICP]). For simplicity, we refer to the act of computing a program as the evaluation or execution of the program. According to this convention, a model of computation defines a notion of evaluation.

Shifting from the realm of computation to the one of learning, we find a similar classification, although with a different terminology. Each tribe proposes a learning paradigm. As each programming paradigm defines the objects to compute (programs) and a notion of computation for them (computation model), also each learning paradigm has to define the objects to learn and a notion of learning for them. Unfortunately, such objects are called learning models which I think is a bad name creating lots of confusion. For instance, the Bayesian paradigm to learning defines learning models as probabilistic models (such as Bayesian networks), whereas the connectionist paradigm defines learning models as neural networks (either deep or shallow). Note that the learning model is what should be learnt, and not the way something is learnt (i.e. the learning procedure). The latter we call the fitting model, as learning consists in fitting models to data. Again, for Bayesians fitting means performing statistical inference (according to data/observations), whereas for connectionists fitting means strengthening synaptic paths between neurons, i.e. modifying weights in a neural network according to backpropagation. Summing up, we have the following table.

ComputationLearning
Programming paradigmLearning paradigm
Program(Learning) model
Computation model (evaluation)Fitting model (fitting)

We now start focusing on the first two paradigms we are interested in: bayesians and connectionists.

  1. For Bayesians a (learning) model is a probabilistic model (e.g. a Bayesian network). The model is a representation of (the knowledge the agent has of) the world. Learning means fitting the model to data (which are observations), i.e. to reduce uncertainty according to the observations performed. This is achieved through statistical inference, which gives the laws describing how uncertainty evolves with respect to observations.
  2. For connectionists a (learning) model is a neural network. Learning means strengthening synaptic connections/paths according to data. The fitting model describing this conception of learning is given by backpropagation, which is nothing but a form of automatic differentiation.

Now we have two paradigms, each defining a learning paradigm consisting of a notion of (learning) model and fitting model (statistical inference for Bayesians, and automatic differentiation for connectionist). Pushing the analogy between programming and learning one step further what we have to do is to define languages for writing learning models and engines to ‘perform’ fitting. In fact, a programming language roughly comes with an appropriate syntax for writing programs together with an engine (which for simplicity we call interpreter) to evaluate such programs (according to the specific programming paradigm). For instance, we expect a functional programming language to come with enough syntactic primitives to write functional expressions (such as \lambda-abstractions and higher-order combinators) as well as an interpreter which evaluates such expressions according to the substitution model of computation. When we look at languages for learning, expressions should define learning models (e.g. probabilistic models or neural networks) whereas the interpreter should fit the model defined by the expression to data (e.g. performing statistical inference on the probabilistic model or differentiating the neural network—we will see in the next paragraph what that means).

Let us clarify the above point with an example. A probabilistic programming language (thus a language for Bayesian learning paradigm) should have enough primitives to write probabilistic models. Those include primitives for probabilistic choices, sampling, and forms of conditioning. Once we have written a (probabilistic) model, the interpreter should perform statistical inference on it (with respect to given data) thus fitting it (to data). What about deep learning? The answer is differentiable programming to which we dedicate the next paragraph. Before doing so, let me write what its essence is. I do not expect you to see why (that will be explained later), but if you accept that, then you can immediately see some advantages of rethinking DL from a programming language perspective. The main observation behind differentiable programming is that, after all, deep learning [BPR]

[…] essentially amounts to writing program templates of potential solutions to a problem, which are constructed as differentiable directed graphs assembled from functional blocks whose parameters are learned from examples using gradient-based optimisation.

That is, in DL (and more generally in the connectionist approach to ML) the learning model is some kind of differentiable function, whereas fitting is preformed using gradient-based optimisation (obtained via automatic differentiation). Therefore, a programming language for deep learning needs to have a syntax for writing differentiable functions (as well as primitives for differentiation) and an interpreter performing gradient-based optimisation. Moreover, in light of Colah’s observations, we expect such language to be higher-order, allowing us to write functional combinators encoding the previously solution patterns of deep neural networks.

We can now see the advantages of such linguistic approach to DL. Writing learning models in an ordinary language does not guarantee such models to be differentiable (and thus, learnable). Moreover, every time we define a new operations on learning models (such as composition) we have no guarantee that the result will be differentiable as well, and thus indeed a learning model. As the \lambda-calculus guarantees that every program/term written in such language is computable, a differentiable \lambda-calculus guarantees that every model/term written using its language is differentiable, and thus learnable (meaning that we can fitting it to data). 

Learning paradigmModelFitting model
Bayesian Probabilistic modelStatistical inference
ConnectionistDifferentiable functionGradient-based automatic differentiation

It is now time to start looking at our running examples seriously (i.e. formally), namely probabilistic and differentiable programming. The plan is to review the mathematical principles behind them, and to design (and study) core languages according to the previous considerations (notably probabilistic and differentiable \lambda-calculi). Let us start with differentiable programming (DP, for short).

Differentiable Programming

[under construction]

Probabilistic Programming

[under construction]

References

1 thought on “From Probabilistic to Differentiable Programming: a Programming Language Approach to AI”

  1. Hello, I am a math undergrad who is interested in type systems, topos theory, and synthetic differential geometry. Would you be interested in talking to me about the state of differentiable programming, and about possibly collaborating? I also program in Haskell, Agda and Idris.

Leave a Reply