The R packages limma and voom have been developed for analysis of RNA-seq read count data, whereby RNA-seq artifacts are modeled successfully, as it was demonstrated. Here I consider employing limma and voom for our study.

Background

Ifat’s initial analysis (stage 1) used Ri, the rank transformed Sig statistic averaged over 8 known imprinted genes g{g1,...,g8} as the response variable of a normal linear model. My subsequent analysis extended this framework in several ways, fitting 88 various multiple regression models (and 88 additional simple regression models). These extensions include

My analysis (referred to here as stage 2) confirmed Ifat’s result (stage 1) suggesting loss of the 8 genes’ imprinting with age when they are averaged and added some new results:

Some of the remaining questions:

The limma package with its recent voom extension appears a powerful tool to answer the above questions.

limma and voom

The figure, taken from the 2015 limma paper, showcases the features of limma: limma F1

Its voom extension takes care of two challenging features of RNA-seq data: strong heteroscedasticity and overdispersion (relative to the Poisson/binomial distributions). voom “unlocks” the versatile and powerful toolbox of limma by estimating the precision – the inverse error variance – for each observation (i.e. for each pair of tissue sample i and gene g) and using that precision as weight in the normal linear model. Here is the illustration of how precision for two observations is obtained by voom.

Adoption to the present study

To make the data accessible to limma and voom, allele-specific read counts should be used instead of the S statistic derived from those. This means that for each individual i and gene g we would have at least one pair of read counts: one for parental allele A and another for allele B. limma could then be used to detect differential expression between the two alleles.

Let {s:s(i,g)} be all k heterozygous SNP variant sites that occur in individual i and gene g and let Is indicate the event that the reference variant is maternal. Because that event is random (unless we have prior knowledge that g is imprinted maternally or paternally), there are 2k allele configurations for (i,g). A rule – used in the definition of S – that could reduce the number of configurations to 1 would be summing the higher read counts (and the lower read counts) over all s.

This table compares limma (potential stage 3 analysis) with two modeling approaches nlm.R and logi.S from stage 1 and 2, respectively.

  nlm.R logi.S limma + voom
response R: rank transformed S S logcpmA,logcpmB
precision weights / overdispersion no / no yes / no yes / yes
gene-gene variability w/o borrowing of strength w/o borrowing of strength yes
borrowing of strength (genes) w/o variability w/o variability yes
age dependence H_0 βage=0 βage=0 βage,A=βage,B
effect size / interpretation no yes / difficult yes / simple
overall parental bias H_0 no β0=1/2 β0,A=1/2 and β0,B=1/2
genome-wise applicability yes ? yes
proportion of H_0 no ? yes
gene set testing no no yes