Summary of our personal communication with Gabriel Hoffman.

  1. Extend regression analysis (both fixed and mixed effects models) to many more (perhaps all?) assessable genes.
  2. Consolidate and employ both fixed and mixed effects models
    • from mixed models to from the corresponding fixed models in order to resolve present discrepancies.
    • use varPartConfInf for confidence intervals
  3. Include HPCC data set; Gabriel didn’t think it was as crucial as the improvements in statistical methodology he suggested.
  4. Adapt filtering/correction procedures for cleaner read counts
    • consult with papers from Lappalainen et al: Lappalainen 2013 Nature, Castel 2015 (tools and best practices), Baran 2015 (the landscape of genomic imprinting in human tissues) as well as Gabriel’s methodological description (Hoffman 2016) based on the above mentioned studies
    • reference allele bias
    • imputed dosage between 0.99 and 1.01
  5. Extend predictors (covariates) with additional ones present in CMC
  6. Fit simple regression models akin to Figure 5 “Fitting four families of generalized linear models…” in order to demonstrate qualitatively the results from multiple regression models. In other words, extend Fig 5 to more genes and predictors.
  7. Consider other tools and statistical approaches
  8. Somehow use available knowledge on DNA methylation and age. A name (Steve Hoggarth?) was mentioned.

Already done

Possible improvements

Here I assume we want to keep our earlier two-step approach:

  1. obtain higher and lower read counts
    • correct for various errors
    • aggregate over all het-SNP sites within gene
  2. modeling in order to call imprinted genes and study dependence on predictors

Note that a few one-step approaches have been published, in which a single statistical model explicitly incorporates all sources of technical variation (errors), biological variation, and performs aggregation. These one-step approaches are more powerful and robust but are not readily applicable to our specific dataset and aims.

Obtaining higher and lower read counts

For some of these points there exist recommended tools and best practices (see Castel 2015 and references therein). For the phasing error various approximations can be envisioned based on probabilistic sampling from the space of haplotypes given some estimate of the expected allele ratio for transcripts.

Modeling

Previously we considered the ratio where . It was challenging to find a generalized linear regression model (GLM) that fits data for all genes , where $I$ is the number of individuals (samples). No single link function (e.g. linear, sigmoidal or scaled sigmoidal) appeared good enough and no error distribution could capture the overdispersion: the binomial distribution of the logi.S and logi2.S was of course better than the homoscedastic normal model, but not good enough in general. Eventually, a quasi-log transformation was applied to and normal linear model was fitted (wnlm.Q) based on model checking diagnostics.

Two alternatives are sketched below, both of which models given instead of given . Both alternatives are inspired by published and well-tested statistical approaches and therefore

Approach 1: negative binomial regression

The defining features are

  1. log link function:
  2. negative binomial distribution of given its mean as well as parameter , which controls dispersion.

Unfortunately this model cannot be readily employed to our data because the lower read counts also depend on (or equivalently on ). Perhaps could be normalized using or separately for each gene .

Note that parameter may be interpreted as the number of failures whereas the number of successes. But might also be given the same interpretation (number of successes). What is the relationship between and ? It’s something to unfold.

Approach 2: transformation and linear model

  1. linear link function:
  2. normal distribution of the log odds given its mean and variance , where is precision weight.

Challenge: estimation of weights ; (how) could be done using voom/limma?