The Likelihood of Programs

There is a neat way to frame unsupervised learning as likelihood maximization, but not in the usual way where you just compute the likelihood of the data using a model and ignore the likelihood of the model itself. Rather, this is the combined likelihood of model and data, where the model is treated as a sequence of code symbols. Imagine concatenating the model's code symbols with the data to form one long sequence. If every symbol in that sequence can be assigned a probability from the preceding symbols, then the total likelihood would be the product of those probabilities. The data is explained as a sequence of probabilistic 'events', the symbols in the sequence. In this context, the goal of unsupervised learning would be to find the most likely explanation of the data by finding the most likely sequence of events that include the data.

But how would you compute the probability of the model's code symbols? You can use the model to compute the probabilities of the data, but what do you use for the model? There are at least two different way to do it:

First, this is the realm of Algorithmic Information Theory and the usual approach in this field is to use the length of the model's code, like in bits. Consider if the model's binary code was randomly generated using a uniform distribution over the possible values ("0", "1", and "END"). As soon as we sample "END" the program is done. The $log_3$ likelihood of a single binary code symbol is $log_3(1/3)$ = -1 and so the $log_3$ likelihood of a model $N$ bits long is $-N$. So if you wanted the combined $log_3$ likelihood of model and data you would compute the $log_3$ likelihood of the data from the model and subtract $N$. (Technical side note: There are many ways to get a normalized distribution over models. The "END" symbol can be omitted if the program specifies its own length, giving a uniform distribution over just ("0" and "1"), but I think it's easier to think in terms of an "END" symbol.)

Before jumping into the second method for computing model probability, there are some interesting observations we can make. The fact that the log likelihood of a model is its negative length means that longer programs are always less likely. That seems like a good thing, since there are infinitely many possible programs and most of them are very long, but the probability distribution over all of them must sum to 1. It's reminiscent of Occam's razor. You can come up with a big, complicated model to explain some data, but the act of making your model bigger makes it less likely. Smaller models are inherently more likely, but when you account for the fact that larger models can explain the data better the optimal model hits a sweet spot that isn't too large (overfit) or too small (underfit). A benefit of considering the combined model+data likelihood in machine learning is that you wouldn't need to worry about any explicit regularization. There is only more or less likely as you search over different models. Unfortunately, it's an endless search, since the only way to 100% know for sure we found the most likely model is to evaluate every possible model, which is the space of all possible programs. Some programs never halt (finish their computation). We can't even know which ones will or won't halt—this is the well-known halting problem and it's undecidable. So, we'll have to be content with "the most likely model found so far".

Self-Modeling Programs

Ok, so a second method for computing the probability of a model is to use the earlier code symbols of the model to compute the probabilities of the later ones, similar to how we use the model to compute the probabilities of later data symbols from the preceding data symbols. First, to be clear on the terminology, let's define the program to be the concatenation of model code and data:

In a self-modeling program, both model 
and data are considered code symbols in a special programming language. The probability distribution for the Nth program symbol is 
computed by executing the partial program formed by the preceding symbols 0 to N-1.
In a self-modeling program, both model and data are considered code symbols in a special programming language. The probability distribution for the Nth program symbol is computed by executing the partial program formed by the preceding symbols 0 to N-1.

This sounds weird at first, but it has a kind of elegance. Most large language models work by executing the model on the past inputs to compute a probability distribution over the possible next symbols. We usually think of the model as the program and the past data symbols as the program inputs, but it's also valid to think of the past data as the program code and the model as a kind of interpreter for that code. So let's create a new programming language that accepts as valid code symbols both the regular code symbols of any Turing complete language and all of the possible symbols in the data sequence. We want to be able to always generate an output probability distribution from the past symbols of the combined model+data sequence, so let's add a restriction that the execution of any program in the language the must always return a probability distribution over symbols, even for partial or invalid programs. Let's call these programs self-modeling programs.

There's a lot to unpack here. The program can contain complex expressions defining how to compute the probabilities of future symbols from past ones, but it also includes regular data symbols. Consider this example in natural language:

I will count to ten. 1 2 3 4 5 6 7 8 9 10.

In this example, 1 2 3 4 5 6 7 8 9 10 is the data sequence and I will count to ten is the model code we are pre-pending to the data to form the entire program. To be clear, natural language isn't a well-defined self-modeling programming language, but it works well to illustrate. If this were a self-modeling program, we would compute the probability of every symbol (character) by executing the program formed by the preceding characters. The programming language would define how to do this, but it could start with a uniform distribution over all characters until we finish parsing a full expression. In this case I will count to ten might be a valid expression in our programming language, which has the effect of making the symbols in the sequence 1 2 3 4 5 6 7 8 9 10 have very high probabilities, possibly even at or near 1.

In self-modeling programs, the data sequence by itself is always a valid program, but it may not be the most likely one. It's often possible to increase the likelihood by making the program longer. This is in contrast to the method of using program length, where shorter programs are always more likely. To summarize:

  • The self-modeling programming language must be Turing complete so it can define any possible computation.
  • The language must include the data symbols as valid code symbols.
  • The execution of any expression (even an empty, partial, or invalid one) must always result in a valid probability distribution over code symbols.
  • The probability of each code symbol is computed by executing the partial program formed from all preceding symbols.
  • The program likelihood is the product of all code symbols in the program.
  • The most likely program for some data is the program with the highest likelihood among all possible self-modeling programs that ends with that data.

The upshot of all of this is that it lets us explain the data as a sequence of probabilistic events, where each event can influence the probabilities of those that follow in any arbitrary way that a program can describe. Essentially we have some observable data and are hypothesizing the events that led to its creation. We can evaluate different hypotheses by computing their likelihoods and pick the most likely one. A fun fact is that the most likely possible self-modeling program for some given data could be compressed to the shortest possible program, if perfect compression is achieved. This length is called the Kolmogorov Complexity of the data.

Doesn't it matter which programming language we use? Yes, but Kolmogorov made the observation that you can construct an emulator in any Turing complete language for any other language. The emulator will have a constant length, which means that converting from one language to another will change the program length by a constant.

So which method for measuring model likelihood is better? Measuring the length of a program is easy, but it's an indirect measure of model likelihood and you need to compress the model first as much as possible to get the best measure. If you are searching for the most likely program it is more straightforward to iterate through all short programs for the first method, but it's computationally impractical to get very far doing this. You could iterate through the most likely self-modeling programs by sampling code symbols, but this is also computationally impractical. A downside of self-modeling programs is that they require the construction of a special programming language, but they also provide explicit code symbol probabilities and provide a very intuitive way to reason about program likelihood, as shown with the observations below.

Insights From Self-Modeling Programs

Here is a fun observation: if you use some hidden program to generate observable data, then given enough of this data, the hidden program can be reverse engineered by searching for the most likely program for that data. To see why, let's use the self-modeling program paradigm described above. We have some hidden program that generates observable code one symbol at a time by computing the next symbol's probability distribution and randomly sampling a new symbol. If the hidden program generates a deterministic sequence of data, then the data symbols will have been generated with probabilities of 1. If the data sequence is long enough, then the most likely program we could find must also assign a probability of 1 to every data symbol. Otherwise, it would not be the most likely program and we could construct an even more likely program program by adding some extra code to our model that happens to match the hidden program so that the probabilities of the data symbols are 1. This extra code will decrease the likelihood of the program by some constant amount, but if the data sequence is long enough the gains from adding the extra code will outweigh its cost.

This result holds even if the data sequence isn't generated deterministically. Think, like, quantum mechanics where we may assume that the observable data like position and momentum comes from randomly sampling a probability distribution. In this case, the most likely program will still assign probabilities to the observations that match those computed by the hidden program. Otherwise, we could construct a more likely program that does. If the probability of a recurring event is $p$, then the most likely program is the one that also predicts probability $p$ for that event. Likelihood is maximized when the predicted probability matches the empirical probability of the event.

So the most likely program will be functionally equivalent to the hidden one, in that they both will compute the same probabilities for observable symbols. But what if the hidden program randomly sampled unlikely symbols? This could make the most likely program functionally different from the hidden program, but the effect diminishes the more data you have. As more data is generated, the empirical probabilities converge to the true hidden ones and the most likely program's symbol probabilities also converges to the empirical probabilities.

So, it shouldn't be surprising that we can reverse engineer the laws of nature. Nature generates a lot of observable data and the goal of physics researchers is essentially to find the most likely program for the observed data. Maybe learning itself could even be defined as the process of finding the more likely programs to explain observed data. It's plausible that in visual processing, for example, concepts like 'cat', 'dog', or 'human' emerge naturally as code symbols in the program that explains our observations. Having these code symbols allows you to assign higher probabilities to the data you are observing.

What about practical implications for machine learning? One useful idea is that programs are fundamentally more likely when fewer events occur more often as opposed to many events less often. In other words, programs with more repetition are more likely. Thus, if you construct a machine learning model where each layer is the same function, that is inherently more likely than a model of the same code size where each layer is a different function. They may have the same program lengths, but the one with more repetition is more compressible. Of course that's just the probability of the model. To get the program likelihood you would have to evaluate the model on the data, but when evaluating models is so computationally expensive it makes sense to try the models with higher probabilities first. Any kind of symmetry in general is more likely for the same reason. We may see so much symmetry in biology, for example, because genetic code containing symmetry is more likely to form than any near-equivalent code without symmetric repetitions.

Program likelihood is the best metric to use when comparing different models on the same data. Other metrics may still provide useful information, but for unsupervised/self-supervised learning or even supervised learning where the model output is a discrete probability distribution, the best model can be selected based on program likelihood.

So what about these large language models that ignore the likelihood of the model itself? Clearly, when the amount of data is small, then it's important to consider the likelihood of your model. The more data you have, the less the model likelihood matters to the total program likelihood, assuming the model stays constant. In that case, you may be able to approximately ignore the model likelihood. However, longer data sequences naturally support bigger models, so program likelihood is still useful to compare models of different sizes.

Ultimately, finding the most likely program is a search problem, which is computationally expensive. But even if you only search over a small number of models, program likelihood is the metric you want to use to select the best one. Computing program likelihood does take a little extra work, though. For a first approximation you could use the length of the model code, but this ignores the fact that some models are more compressible than others. Another option is to use a self-modeling programming language that can explicitly calculate program likelihood. In any case, I think this is a fascinating subject area and a useful tool that seems a little underappreciated.

April 11, 2024
Contact: email