The present article extends previous inference of regression parameters (see extending ANOVA) with a conditional analysis. Operationally, conditioning takes a subset of all observations (i.e. individuals/RNA preparations) defined by a condition. In the present, rather special, case a condition is defined by a single level of a single factor-type (i.e. categorical) predictor/explanatory variable. However, more general conditioning setups could also be explored. In that case multiple predictors might be selected, of which some might be continuous (i.e. a covariate) while others categorical, and for each predictor a possibly compound event (e.g. multiple levels of a factor) might be taken.

The choice of Institution and Gender as selected predictors for conditioning is motivated by the fact that they are categorical with 3 and 2 levels, respectively, and that they are clearly associated with Age (see “scatter plot matrices” in this article).

Preparations

Load libraries and my custom functions.

library(lattice)
library(latticeExtra)
## Loading required package: RColorBrewer
source("~/projects/monoallelic-brain/src/import-data.R")
source("~/projects/monoallelic-brain/src/fit-glms.R")

Next:

  1. import data
  2. create subsets (conditioning)
  3. fit models (logi.S, wnlm.Q)
  4. calculate confidence intervals (CI)
E <- get.predictors() # default arguments
# updated gene set
gene.ids <- unlist(read.csv("../../data/genes.regression.new", as.is = TRUE))
names(gene.ids) <- gene.ids
Y <- get.readcounts(gene.ids = gene.ids, count.thrs = 0)
# predictors for conditioning
s.p <- c("Institution", "Gender")
# exclude unweighed aggregates UA.8 and UA from fitting
Z <- Y[grep("^UA(.8)?$", names(Y), value = TRUE, invert = TRUE)]
M <- list(logi.S = l.l.do.all.fits.sset(s.p, Z, E, e.vars, "logi.S"),
          wnlm.Q = l.l.do.all.fits.sset(s.p, Z, E, e.vars, "wnlm.Q"))
# regression coefficients: estimates and CI at 99 % level
Betas <- lapply(M, function(l3m) {
    df <- do.call(make.groups,
                  lapply(l3m,
                         function(l2m)
                             do.call(make.groups,
                                     lapply(l2m,
                                            get.estimate.CI, conf.lev = 0.99))))
    names(df)[-(1:5)] <- c("Level", "Cond.Var")
    return(df)
})

Results

In the following plots the effect of 5 conditions (MSSM, …, Female) is explored on the inferred . All denotes the full set of observations (no conditioning), as done before. Note the following

plot of chunk beta-age-cond-wnlm-Q

plot of chunk beta-age-cond-logi-S

The same results plotted with different style

plot of chunk beta-age-cond-wnlm-Q-2

For presentations

plot of chunk beta-age-cond-wnlm-Q-2-present

plot of chunk beta-age-cond-logi-S-2

Exclude those genes for which the fit was deemed insufficient. Requires the file model-checking.csv, which was generated in a subsequent analysis.

plot of chunk beta-age-cond-logi-S-2-skip

Conclusion

The main results and their interpretation can be summarized as follows:

How to deal with the present limitations?

Suggested alternatives, roughly in increasing model complexity

  1. under the present models interpret the results more carefully
    • pro: no further analysis required
    • con: the conclusion that age has effect on allelic imbalance remains weakly established
  2. linear modeling with voom/limma
    • pro: (1) allows fitting normal linear models on suitably transformed with the advantage of more power than wnlm.Q and (expectedly) more robustness to nonlinearity than logi.S; also (2), voom/limma uses a more detailed form of data, at the level of SNPs instead of genes, potentially using available information more efficiently; (3) mature R packages are available
    • con: substantial additional analysis, including the recovery of the full (i.e. SNP-based) read count data
  3. extend present models with interaction terms such as that between Age and Institution
    • pro: can be performed on the readily available gene-based read count data
    • con: including too few interactions may not suffice correcting for nonlinearity whereas even a few interaction terms greatly increase the number of parameters, resulting in overfitting, therefore this approach might need to be combined with formal model selection implemented in for instance the stats package of R (see add1 and step functions) to find the best balance between model fit and parsimony
  4. use Bayesian networks (BN)
    • pro: even the most complex dependency structure can be captured by a BN
    • con: many details of statistics and software implementation must be worked out