Summary of our personal communication with Gabriel Hoffman.
- Extend regression analysis (both fixed and mixed effects models) to many more (perhaps all?) assessable genes.
- Consolidate and employ both fixed and mixed effects models
- from mixed models to from the corresponding fixed models in order to resolve present discrepancies.
- use
varPartConfInf
for confidence intervals
- Include HPCC data set; Gabriel didn’t think it was as crucial as the improvements in statistical methodology he suggested.
- Adapt filtering/correction procedures for cleaner read counts
- consult with papers from Lappalainen et al: Lappalainen 2013 Nature, Castel 2015 (tools and best practices), Baran 2015 (the landscape of genomic imprinting in human tissues) as well as Gabriel’s methodological description (Hoffman 2016) based on the above mentioned studies
- reference allele bias
- imputed dosage between 0.99 and 1.01
- Extend predictors (covariates) with additional ones present in CMC
- Fit simple regression models akin to Figure 5 “Fitting four families of generalized linear models…” in order to demonstrate qualitatively the results from multiple regression models. In other words, extend Fig 5 to more genes and predictors.
- Consider other tools and statistical approaches
- the GATK walker ASEReadCounter
- the Bayesian model comparison framework by Pirinen 2015 and the MAMBA software implementing it
- Knowles 2015
- Somehow use available knowledge on DNA methylation and age. A name (Steve Hoggarth?) was mentioned.
Already done
- Create table explaining the meaning of model names such as wnlm.Q
- Word carefully the causal interpretation: allelic bias -> SCZ
Possible improvements
Here I assume we want to keep our earlier two-step approach:
- obtain higher and lower read counts
- correct for various errors
- aggregate over all het-SNP sites within gene
- modeling in order to call imprinted genes and study dependence on predictors
Note that a few one-step approaches have been published, in which a single statistical model explicitly incorporates all sources of technical variation (errors), biological variation, and performs aggregation. These one-step approaches are more powerful and robust but are not readily applicable to our specific dataset and aims.
Obtaining higher and lower read counts
- reference allele bias (mapping)
- genotyping/impputation error
- previously used by us: a qualitative post-hoc test (ref/non-ref test) that is subjective therefore not implemented computationally
- the fact that such post-hoc correction was necessary suggests that genotyping/imputation error was not sufficiently taken into account
- phasing error (aggregation)
- the assignment of each SNP variant at site to the higher (lower) expressed allele based on the higher and lower read counts at : . Our current simple rule is that if then otherwise
For some of these points there exist recommended tools and best practices (see Castel 2015 and references therein). For the phasing error various approximations can be envisioned based on probabilistic sampling from the space of haplotypes given some estimate of the expected allele ratio for transcripts.
Modeling
Previously we considered the ratio where . It was challenging to find a generalized linear regression model (GLM) that fits data for all genes , where $I$ is the number of individuals (samples). No single link function (e.g. linear, sigmoidal or scaled sigmoidal) appeared good enough and no error distribution could capture the overdispersion: the binomial distribution of the logi.S and logi2.S was of course better than the homoscedastic normal model, but not good enough in general. Eventually, a quasi-log transformation was applied to and normal linear model was fitted (wnlm.Q) based on model checking diagnostics.
Two alternatives are sketched below, both of which models given instead of given . Both alternatives are inspired by published and well-tested statistical approaches and therefore
- both approaches might afford more power and robustness than our previous approach using the wnlm.Q and logi.S models
- they might be general enough to fit data for all genes (assessable) and not only the earlier selected 30 genes
- therefore, they may allow both genome-wide ranking of genes and genome-wide investigation of how allelic bias depends on biological predictors
Approach 1: negative binomial regression
The defining features are
- log link function:
- negative binomial distribution of given its mean as well as parameter , which controls dispersion.
Unfortunately this model cannot be readily employed to our data because the lower read counts also depend on (or equivalently on ). Perhaps could be normalized using or separately for each gene .
Note that parameter may be interpreted as the number of failures whereas the number of successes. But might also be given the same interpretation (number of successes). What is the relationship between and ? It’s something to unfold.
Approach 2: transformation and linear model
- linear link function:
- normal distribution of the log odds given its mean and variance , where is precision weight.
Challenge: estimation of weights ; (how) could be done using voom/limma?