The R packages limma and voom have been developed for analysis of RNA-seq read count data, whereby RNA-seq artifacts are modeled successfully, as it was demonstrated. Here I consider employing limma and voom for our study.
Background
Ifat’s initial analysis (stage 1) used , the rank transformed statistic averaged over 8 known imprinted genes as the response variable of a normal linear model. My subsequent analysis extended this framework in several ways, fitting 88 various multiple regression models (and 88 additional simple regression models). These extensions include
- both separate and aggregated
- a few more known imprinted genes
My analysis (referred to here as stage 2) confirmed Ifat’s result (stage 1) suggesting loss of the 8 genes’ imprinting with age when they are averaged and added some new results:
- separate analysis of 16 known imprinted genes showed for some genes significant loss of imprinting while for others no change with age or even significant gain
- statistical power was too small for 3 of 16 genes (scarce data)
- age and institution had the largest effect, yet the age effect was quite mild
- the logistic model
logi.S
appeared the best of those tested based on model fit and interpretability with regards to data sets from 16 known imprinted genes - the normal linear model
nlm.S
has simpler interpretation but was found to be inadequate for untransformed (due to the heteroscedasticity)
Some of the remaining questions:
- how well does/would the logistic model fit data from
- the 16 genes of stage 2?
- those classified by Ifat as novel imprinted genes?
- genes with possibly small parental bias?
- are there data transformations that support the normal linear model for all genes?
- how to increase power by borrowing strength across genes but still allow gene-gene variability?
The limma
package with its recent voom
extension appears a powerful tool to answer the above questions.
limma and voom
The figure, taken from the 2015 limma paper, showcases the features of limma
:
Its voom
extension takes care of two challenging features of RNA-seq data: strong heteroscedasticity and overdispersion (relative to the Poisson/binomial distributions). voom
“unlocks” the versatile and powerful toolbox of limma
by estimating the precision – the inverse error variance – for each observation (i.e. for each pair of tissue sample and gene ) and using that precision as weight in the normal linear model. Here is the illustration of how precision for two observations is obtained by voom
.
Adoption to the present study
To make the data accessible to limma
and voom
, allele-specific read counts should be used instead of the statistic derived from those. This means that for each individual and gene we would have at least one pair of read counts: one for parental allele and another for allele . limma
could then be used to detect differential expression between the two alleles.
Let be all heterozygous SNP variant sites that occur in individual and gene and let indicate the event that the reference variant is maternal. Because that event is random (unless we have prior knowledge that is imprinted maternally or paternally), there are allele configurations for . A rule – used in the definition of – that could reduce the number of configurations to 1 would be summing the higher read counts (and the lower read counts) over all .
This table compares limma
(potential stage 3 analysis) with two modeling approaches nlm.R
and logi.S
from stage 1 and 2, respectively.
nlm.R | logi.S | limma + voom | |
---|---|---|---|
response | R: rank transformed S | S | |
precision weights / overdispersion | no / no | yes / no | yes / yes |
gene-gene variability | w/o borrowing of strength | w/o borrowing of strength | yes |
borrowing of strength (genes) | w/o variability | w/o variability | yes |
age dependence H_0 | |||
effect size / interpretation | no | yes / difficult | yes / simple |
overall parental bias H_0 | no | and | |
genome-wise applicability | yes | ? | yes |
proportion of H_0 | no | ? | yes |
gene set testing | no | no | yes |