The R packages limma and voom have been developed for analysis of RNA-seq read count data, whereby RNA-seq artifacts are modeled successfully, as it was demonstrated. Here I consider employing limma and voom for our study.
Background
Ifat’s initial analysis (stage 1) used Ri, the rank transformed Sig statistic averaged over 8 known imprinted genes g∈{g1,...,g8} as the response variable of a normal linear model. My subsequent analysis extended this framework in several ways, fitting 88 various multiple regression models (and 88 additional simple regression models). These extensions include
- both separate and aggregated
- a few more known imprinted genes
My analysis (referred to here as stage 2) confirmed Ifat’s result (stage 1) suggesting loss of the 8 genes’ imprinting with age when they are averaged and added some new results:
- separate analysis of 16 known imprinted genes showed for some genes significant loss of imprinting while for others no change with age or even significant gain
- statistical power was too small for 3 of 16 genes (scarce data)
- age and institution had the largest effect, yet the age effect was quite mild
- the logistic model
logi.S
appeared the best of those tested based on model fit and interpretability with regards to data sets from 16 known imprinted genes - the normal linear model
nlm.S
has simpler interpretation but was found to be inadequate for untransformed S (due to the heteroscedasticity)
Some of the remaining questions:
- how well does/would the logistic model fit data from
- the 16 genes of stage 2?
- those classified by Ifat as novel imprinted genes?
- genes with possibly small parental bias?
- are there data transformations that support the normal linear model for all genes?
- how to increase power by borrowing strength across genes but still allow gene-gene variability?
The limma
package with its recent voom
extension appears a powerful tool to answer the above questions.
limma and voom
The figure, taken from the 2015 limma paper, showcases the features of limma
:
Its voom
extension takes care of two challenging features of RNA-seq data: strong heteroscedasticity and overdispersion (relative to the Poisson/binomial distributions). voom
“unlocks” the versatile and powerful toolbox of limma
by estimating the precision – the inverse error variance – for each observation (i.e. for each pair of tissue sample i and gene g) and using that precision as weight in the normal linear model. Here is the illustration of how precision for two observations is obtained by voom
.
Adoption to the present study
To make the data accessible to limma
and voom
, allele-specific read counts should be used instead of the S statistic derived from those. This means that for each individual i and gene g we would have at least one pair of read counts: one for parental allele A and another for allele B. limma
could then be used to detect differential expression between the two alleles.
Let {s:s∈(i,g)} be all k heterozygous SNP variant sites that occur in individual i and gene g and let Is indicate the event that the reference variant is maternal. Because that event is random (unless we have prior knowledge that g is imprinted maternally or paternally), there are 2k allele configurations for (i,g). A rule – used in the definition of S – that could reduce the number of configurations to 1 would be summing the higher read counts (and the lower read counts) over all s.
This table compares limma
(potential stage 3 analysis) with two modeling approaches nlm.R
and logi.S
from stage 1 and 2, respectively.
nlm.R | logi.S | limma + voom | |
---|---|---|---|
response | R: rank transformed S | S | logcpmA,logcpmB |
precision weights / overdispersion | no / no | yes / no | yes / yes |
gene-gene variability | w/o borrowing of strength | w/o borrowing of strength | yes |
borrowing of strength (genes) | w/o variability | w/o variability | yes |
age dependence H_0 | βage=0 | βage=0 | βage,A=βage,B |
effect size / interpretation | no | yes / difficult | yes / simple |
overall parental bias H_0 | no | β0=1/2 | β0,A=1/2 and β0,B=1/2 |
genome-wise applicability | yes | ? | yes |
proportion of H_0 | no | ? | yes |
gene set testing | no | no | yes |