5 min read

Maximum likelihood estimation

In this post, I will show you THE most important technique in inferential statistics: Maximum Likelihood Estimation (MLE).

 
 

1. Some data to work with

Before we get started, let’s see what type of problem could be solved using MLE.

For example, I recorded the number of visitors of this website each hour from 8:00 am - 12:00 am (p.s. off course this is fake data, and I am probably too optimistic), and I hope to have a model that can accurately describe my data, and well as making predictions. Here is my data:

Time Number of visitors
8:am - 9:am 5
9:am - 10:am 3
10:am - 11:am 4
11:am - 12:am 6

The first thing you need to do is to choose an appropriate probability distribution to describe your data. From your experience, you are happy with a poisson distribution, which has a probability mass function: f(x)=λxeλx!

where x is our observed data and λ the parameter we want to estimate.

(p.s. You can also use other distributions like negative binomial to model count data.)

 
   
 

2. Let’s experiment with different λ

Try λ=2.

The core question here is:
Assume λ=2, then what is the probability of observing our data?

That shouldn’t be difficult to answer. We first plug λ=2 into the probability mass function, and get: f(x)=2xe2x!

Now we can calculate the probability of having exact 5 visitors in 1-hour duration. f(x=5)=25e25!=0.036

Similarly, we have:
f(x=3)=23e23!=0.18
f(x=4)=24e24!=0.09
f(x=6)=26e26!=0.012

Since our sample is independent and identically distributed (a.k.a. i.i.d), the probability of observing our data ( [5,3,4,6] ) given λ=2, is simply the product of the probabilities we have calculated:

P(data|λ=2)=0.036×0.18×0.09×0.012=7×106

 

Try λ=4

The calculation is exactly the same, and I am not going to repeat this. But here is the result:
f(x=5)=0.156
f(x=3)=0.195
f(x=4)=0.195
f(x=6)=0.104
Again, we want to calculate the probability of observing those data, given λ=4: P(data|λ=4)=0.156×0.195×0.195×0.104=6.2×104

Notice the conditional probability in our example P(data|λ=λ0), is also called likelihood function: L(λ=λ0|data)

Also notice that when we change λ, the result of ip(xi|λ) (a.k.a. likelihood) also change. In our example, when λ change from 2 to 4, our likelihood function increased. This means λ=4 is a superior model, as it better aligns with our observed data.

The goal of MLE is to find λ0 to maximize the likelihood of the observed data given the model.

 
   
 

3. Find λ0

Now the question become clear, we have:
L(λ|data)=P(data|λ)=fλ(5)×fλ(3)×fλ(4)×fλ(6)=λ5eλ5!×λ3eλ3!×λ4eλ4!×λ6eλ6!

And we are trying to find the λ0 to make sure the L(λ0|data) is the greatest. In mathematical notation:

argmaxλ(L(λ|data))

Obviously, you can take the derivative of the likelihood function to find the maximum value, but this is beyond the scope of this blog. However, if you are interested, please check the appendix. I drew a graph to show you the relationship between λ and L(λ|data)

Description

You can visually tell, when λ=4.5, our likelihood function reaches its maximum.

In fact, λ=4.5 is the average number of our data, and this is not a coincidence~

 
 
 

Summary

In summary, when we observed a set of data [5,3,4,6] and we believed our data is from a poisson distribution, the “best guess” (under frequentist’s framework) of that poisson distribution is that it has ˆλ=4.5, which is the average of our data points.

That’s the end of this post Thanks for reading.

 
   
 


Appendix

We have the likelihood function:

L(λ|data)=λ5eλ5!×λ3eλ3!×λ4eλ4!×λ6eλ6!

Let g(λ)=log(L(λ|data))

Taking the logarithm wouldn’t change the place where the function reach its peak. Therefore, we can instead find the λ that maximizes g(λ).

After applying logarithm, we have g(λ)=log(λ5eλ5!×λ3eλ3!×λ4eλ4!×λ6eλ6!)

Since logarithm covert product to summation, we have:

g(λ)=[log(λ5)λlog(5!)]+[log(λ3)λlog(3!)] +[log(λ4)λlog(4!)]+[log(λ6)λlog(6!)]

We then take the derivative of g(λ):

g(λ)=5λ1+3λ1+4λ1+6λ1

Let g(λ)=0, we have:

5λ1+3λ1+4λ1+6λ1=0

Solve λ, we have:

λ=4.5