Understanding Convolutions in Probability: A Mad-Science Perspective
After watching the recent 3Blue1Brown video on convolutions I realized that there is a surprising lack of articles on convolutions as they apply to probability. If you search online for "convolution" you will discover about a billion posts on "convolutional neural networks", which is a bit unfortunate since CNNs don't technically use convolutions (they use correlations, though there's a reasonable argument it doesn't matter) and convolution plays a very important and practical role in probability. Not to mention that viewing convolution from the perspective of probability might be one of the easiest ways of understanding it!
So in this post we're going to take a look at how to use convolutions, how to compute them and how they are defined mathematically... and we'll also throw in a bit of mad-science!
Callout: Before even beginning with the rest of this post I need to give special mention to Steven W. Smith's absolutely phenomenal Guide to Digital Signal Processing. This is the book that really made convolutions click for me and was a huge inspiration in writing this post. As an homage I have tried to make all my charts in a similar style to the ones he uses in his book. Now let's continue!
At a high level, in probability, a convolution is the way to determine the distribution of the sum of two random variables. That is, we can see it as a way of combine two probability distributions to create a third distribution, in much the same way we might use multiplication to combine two integers to make a third. This is very useful because, in all but a few special cases, the distribution of the sum of two random variables is not the same as the distribution that each of individual random variable.
Summing random variables
A notable and fortunate exception regarding summing two random variables is the case of normally distributed random variables. For example suppose we have these two random variables that are normally distributed:
$$X \sim \mathcal{N}(\mu_x, \sigma_x)$$
$$Y \sim \mathcal{N}(\mu_y, \sigma_y)$$
That is we have two random values sampled from two different normal distributions. It's easy to imagine a scenario when we want to know what the distribution the sum of these might be:
$$Z = X + Y$$
$$Z \sim ?$$
It turns out that there is a very pleasant analytic solution to this:
$$Z \sim \mathcal{N}(\mu_x + \mu_y, \sqrt{\sigma_x^2 + \sigma_y^2})$$
If you aren't familiar with this wonderful fact then this post has already been worth your time, as this is a very useful tool to have for many probability problems.
Notes on convolution notation.
The trouble is that not all real world probability distributions are well defined and most well defined probability distributions don't have this pleasant property. Convolutions are important because they allow us to define this operation (at least mathematically) on any probability distribution. Convolutions are particularly useful when we want to solve this problem for arbitrary discrete distributions.
We're going to adopt some notation that we'll be using throughout the post. We'll consider our two probability distributions \(f\) and \(g\), each will take argument \(t\) representing the value we want to know the probability of. Much of the work of convolutions comes out of the world of signal processing so \(t\) typically means "time", but we'll think of it more as an index.
Because we're going to make heavy use of discrete convolutions in the post we'll alternate between using \(f(t)\) when we want to think of our distribution of a function and \(f[t]\) when we want to think of it as an array. In either case this will mean "the probability of \(t\)".
The symbol used for convolutions is \(\ast\) and we typically denote the convolution of two functions as:
$$f\ast g = h$$
We'll stick to denoting the result of our convolution with \(h\). And since the result of the convolution is itself another probability distribution we can pass \(t\) into the result of convolution as well:
$$(f \ast g)(t)$$
With this we can make a generalized version of our sum of normal random variables we performed before:
$$X \sim f$$
$$P(X = t) = f(t)$$
$$Y \sim g$$
$$P(Y = t) = g(t)$$
$$Z = X + Y$$
$$Z \sim f \ast g$$
$$h = (f \ast g)$$
$$P(Z = t) = h(t) = (f \ast g)(t)$$
This latter notation will work to describe the result of any probability distribution, not just normals. Now computing those results is another issue. Let's start with our horrific example!
Building a Crab-Human Monster!
In the post we're going to explore the very common problem of constructing a horrific monstrosity which is the fusion of a crab and a human wielding a chainsaw.
If you've ever done much mad science you'll know that a major problem is selectively determining the best crab and best chainsaw wielding human to use as candidates. After all some crabs are just tasty snacks and some men wielding chainsaws are just lumber jacks... that would yield a lumber snack! Hardly the monster you're hoping for.
The common way to solve this is to have tournaments where crabs fight crabs and humans fight humans. This yields the strongest, toughest and meanest of each species as inputs into our final creation.
This leads to another big problem: missing limbs. The only downside of our tournaments is that fights between crabs and fights between chainsaw wielding dudes tend to leave the winner shy a few limbs. This is important because the total limbs of the final monster equal the combined limbs of the crab and human.
For the record: humans have up to 4 limbs and crabs have up to 10.
Now it would be awesome to have a 14 armed chainsaw crabman, but we'll settle for just 4, heck 3 works so long as it can chase and chainsaw! What we want to do is take the distribution of human limbs after the fight, the distribution of crab limbs after their fight and use convolution to determine the distribution of monster limbs!
Distribution of Human and Crab limbs
Here we can see the distribution of human limbs after many rounds of chainsaw combat:
For notation we'll consider the random variable representing human limbs remaining to be \(X\), and sampled from a distribution function \(f\).
$$X \sim f$$
We'll treat \(f\) as an array for most of this post since we have discrete values here. Note that there are 5 possible values since no limbs at all is a possibility.
Here is the distribution of crab limbs:
The number of crab limbs remaining will be denoted by \(Y\) and sampled from a distribution \(g\).
$$Y \sim g$$
What we want know is to now is the final monster distribution \(Z\). Notationally our solution is going to be:
$$h = (f \ast g)$$
$$Z \sim h$$
But what does \(h\) look like and how can we compute it?
First let's make sure we're thinking of \(f\), \(g\) and \(h\) as 0 indexed arrays where each index holds the value of the probability of that index. That is \(h[5]\) is the probability that the monster has 5 limbs and \(f[0]\) is the probability that the chainsaw wielding man has no limbs.
One important thing to realize is that, as an array (or as a signal), \(h\) is larger than \(f\) or \(g\). In fact the length of \(h\) is always going the length of \(f\) + the length of \(g\) minus 1. This makes logical sense since our monster will have between 0 and 14 legs, so there are 15 possible values.
Computing our result starting from the inputs
It's quite possible you have solved this problem before in code. I know I have without even knowing what a convolution was!
One way that might come naturally is to think about this from the perspective of the inputs. We'll start by building an empty array (well technically a List in Python) for our final distribution:
MAX_LIMBS = 14 # This is 'h' in our math notation monster_dist = [0]*(MAX_LIMBS+1)
Next we'll iterate through each human limb count. For each human limb count we'll look up each crab limb count, sum the total count for our index into the final result and multiply the probabilities, and add that to the existing value in the monster_dist.
Here is this in code:
# human_limb_dist is 'f' in our math notation # crab_limb_dist is 'g' in our math notation # iterate through all possible human limb counts for human_limb_count in range(len(human_limb_dist)): # look up the probability for each limb count p_human_count = human_limb_dist[human_limb_count] # then multiply that probability with the corresonding crab limb count for crab_limb_count in range(len(crab_limb_dist)): p_crab_count = crab_limb_dist[crab_limb_count] # this is the index in the monster_dist crab_human_sum = human_limb_count + crab_limb_count p_crab_human_sum = p_human_count * p_crab_count monster_dist[crab_human_sum] += p_crab_human_sum
That's it! That is how you compute the convolution! However it's much more useful to understand what this convolution is really doing. Here we can visualize what's happening.
What's very important to visualize here is that at each step we are computing a sliding window. That is, for each of the 5 human limb possibilities we're computing probabilities for all 11 crab possibilities. At no point do we directly compute all 15 of the final probabilities for the final monster, these are calculated as a consequence of this window of results that happens as we step through.
We're not quite ready to show the final formula for convolutions, for that we need to consider this problem thinking from the perspective of the final result.
Starting with the output
In our first example we broke down the convolution in regards to the effect of each of the inputs, that is we iterated through each value in each distribution to ultimately compute the output distribution.
But say we work backwards, starting from the final monster distribution. Let's try to compute the probability for there being 8 limbs in the monster. How can we begin?
In the previous visualization we kept all of our individual distributions centered on the final distribution but let's look at these again and only consider the indexes of the individual results. That is for each of the 5 possible human limb counts we'll consider the 11 combinations of that value with a corresponding value in the crab distribution.
Focusing on just the 8 limbs value let's consider which value from each of the 5 arrays we created feeds into the final result:
It's well worth looking at this visualization carefully. Consider the first chart below the result. The value 8 is highlighted because we have initially 0 human limbs, and 0 + 8 = 8. When we go next to 1 human limb, it is the value at index 7 we use because 1 + 7 = 8.
Let's go ahead and write out which index contributes to the value for 8 for each of these:
- 0 human limbs: index 8
- 1 human limbs: index 7
- 2 human limbs: index 6
- 3 human limbs: index 5
- 4 human limbs: index 4
It should be pretty clear we have a pattern. For any given value \(t\) in our final distribution, it is equal to, the sum of \(f[i] \cdot g[t-i]\) for each value \(i\) in the array.
We can flesh this out mathematically as:
$$(f \ast g)[t] = \Sigma_{i} f[i] \cdot g[t-i]$$
Here we have the formal definition of a discrete convolution!
We can also write this in code:
def conv_t(f, g, t): return sum([ f[i]*g[t-i] for i in range(len(f)) # we need do some bounds checking in our code if t-i < len(g) and (t-i) >= 0 ])
And now we can use a list comprehension to compute the entire monster distribution:
> [conv_t(human_limb_dist, crab_limb_dist, t) for t in range(MAX_LIMBS+1)] [0.0021141649048625794, 0.009513742071881607, 0.023255813953488372, 0.0507399577167019, 0.07399577167019028, 0.10359408033826639, 0.13530655391120505, 0.1416490486257928, 0.16279069767441862, 0.14164904862579283, 0.08879492600422832, 0.03699788583509514, 0.016913319238900635, 0.011627906976744186, 0.0010570824524312897]
As one final sanity check, let's go ahead and compare this with the scipy.signal convolve function:
> from scipy.signal import convolve > convolve(human_limb_dist, crab_limb_dist) [0.0021141649048625794, 0.009513742071881607, 0.023255813953488372, 0.0507399577167019, 0.07399577167019027, 0.10359408033826639, 0.13530655391120505, 0.14164904862579283, 0.1627906976744186, 0.1416490486257928, 0.08879492600422832, 0.03699788583509513, 0.016913319238900635, 0.011627906976744186, 0.0010570824524312897]
Looks good to me!
Continuous Convolution
Now that we've derived the discrete convolution, it's fairly straight forward to translate this into the continuous case (note: we'll use \(\tau\) instead of \(i\) here since it's no longer an index):
$$(f \ast g)(t) = \int_{-\infty}^{\infty}f(\tau)g(t-\tau)d\tau$$
I tend to spend most of my time computing things numerically, so the discrete case is more useful for most of my problems. However the really cool thing about the continuous case is this can be used to solve convolution analytically. The Wikipedia and this stack overflow comment both demonstrate the way we can prove our earlier statement about the sum of two normally distributed random variables using a convolution. In the cases where we can solve this analytically, we can potentially get much faster solutions to our convolutional problems!
Conclusion
If you either watched the 3Blue1Brown video or read through Smith's section of Digital Signal Processing on convolutions you'll notice that there is a view of convolutions that I didn't include here. That is the idea of "reversing" the filter (or in this case the crab distribution). While visually this reversal comes up and makes sense in some cases, I personally found that viewing this algorithmically and computationally from the perspective of probability theory doesn't require us to really think about reversing anything and, at least for me, makes things more clear.
That said I highly recommend reading up more on convolutions from different perspectives as it shows up in many different areas, each one taking a different view of this really amazing operation.
Bonus: If you read r y x, r's most recent post Goodbye Data Science (highly recommended), there is a hilarious comment regarding people reading advanced machine learning papers without understating the basics:
Like bro, you want to do stuff with 'diffusion models'? You don’t even know how to add two normal distributions together! You ain’t diffusing shit!
I want to point out that in this post all the images were created with stable diffusion and we covered not only adding normal distributions but any distribution! Of course that itself might be the very real distinction between mad science and data science!
Support on Patreon
Support my writing on Patreon and gain access to the source code and video commentary for this article as well as access to much more of my writing!