Transcriptome-wide association studies

Instrument variable & TWAS

Transcriptome-wide association studies (TWAS) aim to identify associations between gene expression and traits of interest. In an ideal world where we have both RNA-seq and trait data for tens of thousands of individuals, performing a TWAS analysis would be straightforward: simply regress the trait on gene expression. However, GTEx, the largest collection of expression data, has only ~700 RNA-seq samples, and it does not include trait values. This limitation precludes a direct association test between expression and traits.

Despite this constraint, GTEx has collected genetic data from all participants and trained models to predict gene expression levels from genetic variants across multiple tissues. Variants that are strongly associated with gene expression are referred to as expression quantitative trait loci (eQTLs).

For a given cohort with only genetic and phenotype data, we can first imputed/predicted RNA expression from genetics data, then regress the phenotype on the predicted RNA expressions. Let’s define some additional notations:

$$ \begin{split} \hat x_k: &\text{ standardized imputed RNA expression for gene k} \\ \hat w_k: &\text{ pre-trained weights from GTEx to predict gene k} \end{split} $$

For a new dataset with genotype and phenotype information, we can first impute gene expression using $\hat x_k = G \hat w_k$. When predicting expression levels, we typically restrict the analysis to a small genomic region known as cis-eQTLs. We then regress the phenotype on the predicted expression to estimate the effect size and p-value for each gene.

TWAS employs a two-stage least squares regression and is theoretically robust to confounding between expression and trait. The model is conceptually identical to Mendelian randomization, where gene expression is treated as an exposure. With individual-level genotype and phenotype data, we can perform TWAS as follows:

$$ \begin{split} \hat \beta^{TWAS}_k &= \frac{\mathbb{C}ov[\hat x_k, y]}{\mathbb{V}ar[\hat x_k]} \\ &= \frac{\hat x_k^T y}{\hat x_k^T \hat x_k} \\ &= \frac{\hat w_k^T G^T y}{\hat w_k^T G^T G \hat w_k} \end{split} $$

Notice we have $\hat \beta = G^T y/N$, and $R = G^TG/N$, this allows us to further compress the expression to: $$ \hat \beta^{TWAS}_k = \frac{w_k^T \hat \beta}{\hat w_k^TR w_k} = \frac{w_k^T z}{\sqrt{N} \hat w_k^TR w_k} $$

As the phenotypical variance explained by a single locus is so small, the GWAS marginal effect size is distributed as $\hat \beta \sim MVN(R\lambda, \frac{1}{N}R)$. According to linear transformation of a multi-variate Gaussian random variable, we have: $$ \begin{split} \hat \beta^{TWAS}_k &\sim MVN(\frac{w_k^T R \lambda}{\hat w_k^TR w_k}, \frac{1}{N \hat w_k^TR w_k})\\ s.e &= \frac{1}{\sqrt{N \hat w_k^TR w_k}} \\ z &= \frac{ \beta^{TWAS}_k}{s.e} = \frac{\sqrt{N} \hat w^T \hat \beta}{(\hat w_k^TR w_k)^\frac{1}{2}} \end{split} $$

The above expression is convenient, as it allows for TWAS analysis with only GWAS summary statistics and a matched LD panel. The expression seems consistent with Sasha Gusev’s presentation. I also attached some simulation to demonstrate this quantity.

Simulation

path = "https://www.mv.helsinki.fi/home/mjxpirin/GWAS_course/material/APOE_1000G_FIN_74SNPS."
haps = read.table(paste0(path,"txt"))
info = read.table(paste0(path,"legend.txt"),header = T, as.is = T)


### Consider this is GTEx data
n1 = 1000
G1 = as.matrix(haps[sample(1:nrow(haps), size = n1, repl = T),1:10] + 
                haps[sample(1:nrow(haps), size = n1, repl = T),1:10])
row.names(G1) = NULL
colnames(G1) = NULL

G1_ = scale(G1)
M = ncol(G1_)
lambda = c(0, 0, 0, 0.3, 0, 0, 0, 0, 0, 0) # a very strong eQTL 

x1 = G1_ %*% lambda + rnorm(n1, mean = 0, sd = sqrt(1 - var(G1_ %*% lambda)))
x1_ = scale(x1)
# the expression model uses all the SNPs, might need to add regularization term in realistic analysis
w = solve(t(G1_) %*% G1_) %*% t(G1_) %*% x1_

### Consider this is my own data with genotype and trait
n2 = 6000
G2 = as.matrix(haps[sample(1:nrow(haps), size = n2, repl = T),1:10] + 
                haps[sample(1:nrow(haps), size = n2, repl = T),1:10])

row.names(G2) = NULL
colnames(G2) = NULL
G2_ = scale(G2)
R = cor(G2_)
beta_twas = 0.6

# x2 is not unknown, but we generate the true expression data 
x2 = G2_ %*% lambda + rnorm(n2, mean = 0, sd = sqrt(1 - var(G2_ %*% lambda)))
x2_ = scale(x2)

y2 = x2_ %*% beta_twas + rnorm(n2, mean = 0, sd = sqrt(1 - var(x2_ %*% beta_twas)))
y2_ = scale(y2)

# with individual level data
beta_twas_hat_method1 = cov(y2_, G2_ %*% w)/var(G2_ %*% w)
# a GWAS effect size is pre-computed
beta_hat = 1/n2 * (t(G2_) %*% y2_)

# with sumstats level data and LD
beta_twas_hat_method2 = t(w) %*% beta_hat / (t(w) %*% R %*% w)
# z score
z_twas_method2 = sqrt(n2) *  t(w) %*% beta_hat / sqrt(t(w) %*% R %*% w)

A few remarks about TWAS:

TWAS can be interpreted as a special usage of Mendelian randomization. One might think TWAS can identify gene expressions that causally affect the phenotype, as MR does. But due to complications such as co-expression, this is generally not true.
Standard TWAS analysis ignored the variability of the imputed expression data, which might induce inflated False positives. Xue et al. examnined this query, and concludes that it is mostly fine.
The weights for predicting gene expression are often obtained from GTEx, but would these weights as predictive across different ancestries? Chen et al. presented a method that integrate multiple GWAS results to perform TWAS. But I think there are rooms for more methodology development.

Reference:

FUSION paper, Gusev et al. 2016 Nat Gen link