Dive into Bayesian statistics (2): Solve the nasty denominator!

2021-11-22 844 words 4 minutes

Contents

In the last post, we tried to use a Bayesian framework to model the number of visitors per hour. After concatenating a Poisson distribution with a Gamma prior, we get something like: $$ P(\lambda | \text{data}) = c \cdot \lambda^{19} e^{-6\lambda} $$

Since we are interested to find $\lambda_0$ that gives the maximum value of $P(\lambda | \text{data})$ (a.k.a. Maximum A Posteriori), we don’t need to worry too much about a constant $c$. But in this post, we are going to solve $c$, and consolidate our understanding of Bayesian inference.

Interpretation the denominator

Recall our Bayesian framework states: $$ P(\lambda | \text{data}) = \frac{P(\text{data} | \lambda) \cdot P(\lambda)}{P(\text{data})} $$

I wrote detailed explanations of the $P(\text{data} | \text{params})$ and $P(\text{params})$ in the last post, but the denominator $P(\text{data})$ is still unclear. How can we interpret the probability of data? What does that even mean?

Well, let’s first take look at a bivariate discrete random variables:

$$ \begin{array}{c|lcr}& X_1 & X_2 & X_3 \\ \hline Y_1 & 0.3 & 0.2 & 0.1 \\ Y_2 & 0.1 & 0.1 & 0.2 \end{array} $$

The question is: what is $P(Y_1)$?

The answer is simple, we simply have $P(Y_1) = 0.3 + 0.2 + 0.1 = 0.6$

But the point is, we can write

$$ \begin{aligned} P(Y_1) & = P(X_1 \cap Y_1) + P(X_2 \cap Y_1) + P(X_3 \cap Y_1) \\ & = P(Y_1 | X_1)\cdot P(X_1) + P(Y_1 | X_2)\cdot P(X_2) + P(Y_1 | X_3)\cdot P(X_3) \end{aligned} $$

Now, we can apply this idea of calculating the marginal probability to our Bayesian framework, where we have

$$ \begin{aligned} P( \text{data}) & = P( \text{data} \cap \lambda_1) + P( \text{data} \cap \lambda_2) + … \\ & = P( \text{data} | \lambda_1) \cdot P(\lambda_1) + P( \text{data} | \lambda_2) \cdot P(\lambda_2) … \end{aligned} $$

Since $\lambda$ is a continuous variable, it’s more appropriate to write the above formula as:

$$ P( \text{data}) = \int P( \text{data} | \lambda) P(\lambda) \cdot d \lambda $$

Now, let’s plug the denominator into the Bayes theorem:

$$ P(\lambda | \text{data}) = \frac{P(\text{data} | \lambda) P(\lambda)}{\int P( \text{data} | \lambda) P(\lambda) \cdot d \lambda} $$

Notice that the denominator is just the integral of the numerator.

Back to our problem

In the appendix of my last post, we solved the numerator, which is

$$ P(\text{data} | \lambda) \cdot P(\lambda) = \frac{4}{5! \cdot 3! \cdot 4! \cdot 6! } \cdot \lambda^{19} e^{- 6 \lambda} $$

As we discussed, the denominator is just the integral of the numerator. We have

$$ \begin{aligned} P( \text{data}) & = \int P( \text{data} | \lambda) P(\lambda) \cdot d \lambda\\. & = \int_0^{\infty} \frac{4}{5! \cdot 3! \cdot 4! \cdot 6! } \cdot \lambda^{19} e^{- 6 \lambda} d \lambda \\ & = 1.07 \times 10^{-5} \end{aligned} $$

Now we can finalize our calculation by plug in the numerator and the denominator:

$$ \begin{aligned} P( \lambda |\text{data}) & = \frac{P(\text{data} | \lambda) \cdot P(\lambda)}{P(\text{data})}\\ & = 0.03 \cdot \lambda^{19} e^{- 6 \lambda} \end{aligned} $$

We finally solved our posterior distribudtion!!!

Conjugate prior

Now, let me introduce a concept called conjugate prior. If you choose a prior distribution carefully, you may get a posterior distribution that is in the same distribution family as the prior distribution.

In our case, gamma distribution is indeed a conjugate prior of poisson distribution. From Wikipedias, The posterior distribution is $$\text{Gamma}(\alpha + \sum_i x_i , \beta + n)$$

In our case, we have $$ \begin{aligned} P(\lambda |\text{data}) & \sim \text{Gamma}(2+ 5 + 3 + 4 + 6 , 2 + 4) \\ & =\text{Gamma}(20, 6) \end{aligned} $$

Let’s plug $\alpha' = 20, \beta' = 6$ into the formula of Gamma distribution: $$ \begin{aligned} f(\lambda)_{\alpha' = 20, \beta' = 6} & = \frac{6^{20}}{19!}\lambda^{19}e^{-6\lambda} \\ & = 0.03 \cdot \lambda^{19} e^{- 6 \lambda} \end{aligned} $$

, which is exactly the same as the result achieved by integration.

Summary

In summary, we showed two ways of calculating the posterior probability (a.k.a. find the denominator) in this post.

Method 1: Integrate the numerator

This method, in principle, can solve all the problems. But as you can see in this post, solving a one dimensional problem isn’t easy. In real world application, you can easily encounter $> 5$ dimension problems, and the integration would be a nightmare for any mathematicians. Imagine you need to solve something like $$\int \int \int \int \int … dx \cdot dy \cdot dz \cdot dw \cdot dq$$ Solving this integration might be the worst thing of the universe :(

Method 2: Conjugate prior

The calculation is easy enough. In our case, you just need to follow the formula and plug in the numbers, and you will get a beautiful posterior distribution. But the drawback is that it’s not guaranteed that your prior is always conjugate with your likelihood function. Therefore, you need to carefully choose the prior distribution according to this table, if you want to utilize this nice property.

All right, thanks for reading this post. In the next few posts, I will introduce Markov chain Monte Carlo (MCMC), which solve the drawbacks we talked about. Stay tuned!