In our previous post, we have manually implemented the Markov Chain Monte Carlo algorithm (specially, Metropolis Hastings), and drew samples from the posterior distribution. The code isn’t particularly difficult to understand, but it’s not intuitive to read/write neither. Besides the difficulty of implementation, algorithm performance (a.k.a speed) is also a major consideration in a more realistic application. Luckily, well-written tools are available to resolve these obstacles, namely, Stan and PyMC3.
In the last few posts, we tried three methods (Integration, Conjugate Prior and MCMC) to infer the posterior distribution $P(\lambda | \text{data})$, which gave us
$$\lambda \sim \text{Gamma}(\alpha = 20, \beta = 6)$$
Now you may ask: Cool, we have our posterior distribution, but SO WHAT?
In this post, we are going to see why studying the posterior distribution is advantageous and interesting, and show you the internal relationship between a Poisson distribution, a Gamma distribution and a Negative binomial distribution.
In this post, I will continue to use the same example that I used before (Bayesian: MAP and Bayesian: solve denominator). Also, it will be very helpful to first understand accept-reject sampling that I discussed in this post.
Now let’s get started!
As we discussed at the end of this post, solving the denominator is a non-trivial work, especially when you have many parameters to estimate. One way to overcome this obstacle is to use a method called Markov Chain Monte Carlo (MCMC).
In the last post, we tried to use a Bayesian framework to model the number of visitors per hour. After concatenating a Poisson distribution with a Gamma prior, we get something like: $$ P(\lambda | \text{data}) = c \cdot \lambda^{19} e^{-6\lambda} $$
Since we are interested to find $\lambda_0$ that gives the maximum value of $P(\lambda | \text{data})$ (a.k.a. Maximum A Posteriori), we don’t need to worry too much about a constant $c$.
I know, this is an exciting & scary topic. It literally took me months to understand, and I hope this post will make your life easier.
Before you read this post, I assume you are already familiar with basic probability theories, maximum likelihood estimation and bayes theorem. I encourage you to read my previous post that discussed MLE, and we are going to use the same dataset in this post.
In this post, I am going to show two methods to draw samples from a generic distribution.
But before we get started, we should define what do I mean generic distribution. Here is one example:
$$ f(x)= \begin{cases} 0 & \text{if $x$ < 0} \\ c \cdot \sqrt{x} & \text{if $0 < x < 1 $} \\ 0 & \text{if $x > 1$} \end{cases} $$
First, let’s take a look at the probability density function (pdf):