Model the Gene Expression (2): Likelihood Ratio Test

2021-11-23 1485 words 7 minutes

Contents

In the last post, we used a GLM framework to model the gene expression. $$ y \sim NB(\mu, r) \\ log( \mu )= b_1 x + b_0 $$

Using maximum likelihood estimation, we were able to find a set of parameters $\hat b_0, \hat b_1, \hat r$, that maximizes the likelihood function.

But if you send this model (the estimated parameters) to biologists, they wouldn’t be happy. And we all know what is lacking: the p-value!

Introduction

Obviously, it’s inappropriate to use a t-test/ANOVA to evaluate a generalized linear model, since the assumptions of linearity and normality fail. Now the question is: do we have a generic method to construct a hypothesis test, under the GLM framework?

The answers is YES, and even better, we have more than one method to get the job done, namely:

Likelihood ratio test
Wald test
Score test

Wald test and Score test are quite complicated, and myself is still trying to wrap my head around them. Hopefully I will have a chance to discuss them in future blog posts. But in this post, we will focus on the likelihood ratio test. I will offer some intuition, compare likelihood ratio test to ANOVA, and show you the calculation step by step,

ANOVA

I think it might be helpful to go over ANOVA before we dive into likelihood ratio test, since they share a lot of similarities.

Recall in ANOVA, we build two models: a simple model, and a fancy model. By adding an additional variable, we want to see if the Sum of Squared Residuals decreases significantly, comparing to the inherent noise.

Take our type 2 diabetic medication as an example. A simple model could be “the medication doesn’t affect gene1 expression”. A fancy model could be “the medication changed gene 1 expression”.

We fit the simple model using lm(data ~ 1), and fit the fancy model using lm(data ~ expgroup). After fitting those two models, we can calculate the Sum of Square Residuals (SSR) for each model: $SSR_{\text{simple}}, SSR_{\text{fancy}}$. Notice adding an additional variable could never hurt the model, so we have $SSR_{\text{simple}} > SSR_{\text{fancy}}$, for any variables.

The next step of ANOVA is to plug $SSR_{\text{simple}}, SSR_{\text{fancy}}$ into the formula

$$ F = \frac{ (SSR_{\text{simple}} - SSR_{\text{fancy}}) /(p_{\text{fancy}} - p_{\text{simple}}) }{SSR_{\text{fancy}}/(n - p_{\text{fancy}} ) } $$

, where $p_{\text{fancy}}, p_{\text{simple}}$ represents the number of parameters in that model.

We calculate the $F$, which has a f-distribution with $df_1 = p_{\text{fancy}} - p_{\text{simple}}$, $df_2 = n - p_{\text{fancy}}$. To find the p value, we calculate the probability of more extreme events, which is $f$ greater than what we have observed (our calculated F) – this is the area of the upper tail of the F distribution.

The take home message is that, we are comparing a simple model and a fancy model. We want to determine if adding another variable could significantly improve the fitness (measured in SSR) of the model.

Likelihood Ratio Test

Similarly, in likelihood ratio test, we also fit a simple model, and a fancy model. In our medication example, our simple model could be expressed as glm.nb(data ~ 1), while the fancy model could be expressed as glm.nb(data ~ expgroup).

To evaluate the fitness of the model, we use maximum likelihood estimation again. For the simple model, we find the maximum of its likelihood function $\mathcal{L}_m(\mu, r | data)$; for the fancy model, we also find its maximum of likelihood function $\mathcal{L}_m(b_0, b_1, r | data)$.

Then we take the logarithm of the likelihoods ratio:

$$ \begin{aligned} \lambda & = -2 \cdot ln \Bigg[ \frac{\mathcal{L}_{\text{m}}(\mu, r | data)}{\mathcal{L} _{\text{m}}(b_0, b_1, r | data)} \Bigg] \\ & = -2 \cdot \Bigg[ ln (\mathcal{L}_m (\mu, r | data)) - ln (\mathcal{L}_m (b_0, b_1, r | data)) \Bigg] \end{aligned} $$

It has been proved that $\lambda$ is asymptotically* Chi square distributed, with a degree of freedom of 1 (in a more general case, with a degree of freedom equal to the difference in dimension between the fancy model and simple model).

To find the p value, we simply calculate the area of the upper tail of the $\chi^2$ distribution.

I hope you have find some similarities between the Likelihood Ratio Test and ANOVA. The procedures are almost identical:

build a simple model and a fancy model
calculate the fitness for each model (RSS vs likelihood)
determine if adding another variable could significantly improve the fitness

(*Asymptotically: it means as the sample size gets larger, $\lambda$ tends to be closer to a theoretical chi square distribution. With that being said, when we have a small sample size, the p value calculated with chi square distribution will be a little bit off)

(I should mention that internally ANOVA is a likelihood ratio test, but I will skip the discussion for conciseness)

Calculation

Now let’s do the actual calculation

Patient	Treatment	Gene1 Expression
patient1	placebo (0)	1000
patient2	placebo (0)	400
patient3	placebo (0)	700
patient4	medication (1)	2100
patient5	medication (1)	3200
patient6	medication (1)	1000

Step 1: Build a simple model

This is equivalent to say that all the patients are from the same distribution. We construct the likelihood function as what we have done before

$$ \begin{aligned} \mathcal{L}(\mu, r | data) = & \prod_{i = 1}^6 P(y_i |\mu, r) \end{aligned} $$

Step 2: Build a fancy model

$$ \begin{aligned} \mathcal{L}(b_0, b_1, r | data) = & \prod_{i = 1}^6 P(y_i |b_0, b_1, r) \end{aligned} $$

Step 3: Find the log maximum likelihood for the simple model

Again, we are going to use optim() function to help us do the calculation

data = c(1000, 400, 700, 2100, 3200, 1000)
expgroup = as.factor(c("control","control","control","treatment","treatment","treatment" )) 
 

neg_loglikelihood_simple = function(params){
  mu = params[1] # intercept
  r = params[2] # dispersion
  
  # define the log likelihood of each patient
  # patient 1-3 received placebo, therefore mu = b0
  loglikelihood1 = log(dnbinom(data[1], size = r, mu = mu))
  loglikelihood2 = log(dnbinom(data[2], size = r, mu = mu))
  loglikelihood3 = log(dnbinom(data[3], size = r, mu = mu))
  
  # patient 4-6 received medication, therefore mu = b0 + b1
  loglikelihood4 = log(dnbinom(data[4], size = r, mu = mu))
  loglikelihood5 = log(dnbinom(data[5], size = r, mu = mu))
  loglikelihood6 = log(dnbinom(data[6], size = r, mu = mu))

  # summation of log, equivalent to log product.
  # optim function finds the minimum, we maximize the loglikelihood, which is equivalent to minimize neg_loglikelihood
  return(- (loglikelihood1 + loglikelihood2 + loglikelihood3 + loglikelihood4 + loglikelihood5 + loglikelihood6))
}

optim(c(20,3), neg_loglikelihood_simple)

We found $$ \hat \mu = 1400, \hat r = 2.31 \\ ln (\mathcal{L}_m (\mu, r | data)) = -48.5 $$

Step 4: Find the log maximum likelihood for the fancy model

neg_loglikelihood_fancy = function(params){
  b0 = params[1] # intercept
  b1 = params[2] # coefficient
  r = params[3] # dispersion
  
  # define the log likelihood of each patient
  # patient 1-3 received placebo, therefore mu = b0
  loglikelihood1 = log(dnbinom(data[1], size = r, mu = exp(b0)))
  loglikelihood2 = log(dnbinom(data[2], size = r, mu = exp(b0)))
  loglikelihood3 = log(dnbinom(data[3], size = r, mu = exp(b0)))
  
  # patient 4-6 received medication, therefore mu = b0 + b1
  loglikelihood4 = log(dnbinom(data[4], size = r, mu = exp(b0 + b1)))
  loglikelihood5 = log(dnbinom(data[5], size = r, mu = exp(b0 + b1)))
  loglikelihood6 = log(dnbinom(data[6], size = r, mu = exp(b0 + b1)))

  # summation of log, equivalent to log product.
  # optim function finds the minimum, we maximize the loglikelihood, which is equivalent to minimize neg_loglikelihood
  return(- (loglikelihood1 + loglikelihood2 + loglikelihood3 + loglikelihood4 + loglikelihood5 + loglikelihood6))
}


optim(c(5,2,3), neg_loglikelihood_fancy)

We found $$ \hat b_0 = 6.55, \hat b_1 = 1.09, \hat r = 5.92 \\ ln (\mathcal{L}_m (b_0, b_1, r | data)) = -45.4 $$

Step 5: Calculate $\lambda$ and p

$$ \begin{aligned} \lambda & = -2 \cdot (ln (\mathcal{L}_m (\mu, r | data)) - ln (\mathcal{L}_m (b_0, b_1, r | data))) \\ & = -2 \cdot (−48.5 - -45.4) \\ & = 6.2 \end{aligned} $$

Plug into $\chi^2$ distribution with $df = 1$, we calculate the p value:

pchisq(6.2, df = 1, lower.tail = F)

We have $$ p = 0.013 $$

You may use the lmtest::lrtest() function to confirm the calculation:

library(MASS)
data = c(1000, 400, 700, 2100, 3200, 1000)
expgroup = as.factor(c("control","control","control","treatment","treatment","treatment" ))

# build a simple model and a fancy model
model_simple = glm.nb(data ~ 1)
model_fancy = glm.nb(data ~ expgroup)

# perform test
lrtest(model_simple, model_fancy)

You should find the results to be the same!

Summary

In this post, we explored the likelihood ratio test, and found $p = 0.013$, which is sufficient to reject the null hypothesis if we set $\alpha = 0.05$.

I should reinforce the interpretation of the p value, regardless what statistical tests we are using:

CORRECT INTERPRETATION: P value is the probability of observing the events or more extreme events, given the null hypothesis is true.

WRONG INTERPRETATION: P value is the probability of making a mistake.

Reference:

Wiki likelihood ratio test: https://en.wikipedia.org/wiki/Likelihood-ratio_test#cite_note-2

Statquest: https://www.youtube.com/watch?v=NF5_btOaCig