For certain genes g allelic bias was found to depend on biological predictors (Age, Dx, Ancestry) in our previous analysis. That analysis fitted various fixed effects multiple regression models (GLM family) using as response variable the read count ratio Sig or its quasi-log transformed version Qig. After checking model fit, we primarily based our results to a weighted, normal linear model using Qig, which we called wnlm.Q (as opposed to its unweighted variation unlm.Q). Significant dependence of the allelic bias of gene g on (some level of) the predictor p was assessed in terms of the estimated regression coefficient ˆβpg.

But decomposition of variance using variancePartition showed that a predictor that significantly affected allelic bias for a certain gene did not necessarily explain substantial fraction of variance. This was the case in spite of using the same response and predictors in the mixed effects model for variancePartition as for fixed effects models like wnlm.Q. A notable difference is that the mixed effects model was based on the unweighted unlm.Q rather than the weighted wnlm.Q.

What is the reason for this discrepancy? This question can be approached based on two cases giving altogether four explanations:

  1. ˆβpg under the fixed effects model largely deviates from that under the mixed effects model because of
    • treating certain predictors’ effect random instead of fixed impacts ˆβpg for other predictors
    • weighting, which was used for the fixed effects model wnlm.Q but not for the mixed effects model
    • programming bug
  2. ˆβpg under the fixed effects agrees with that under the mixed effects model
    • in this case the discrepancy arises from the related but different quantities of the two analyses: significance of βpg0 is established using a Tpg-statistic whereas variance partitioning operates with the fractional variance explained ˆσ2pg/ˆσ2total (more details below)

The analysis below excludes programming bug and appears to suggest that the discrepancy is due to a mixture of the remaining three explanations.

Calculations

library(variancePartition)
library(doParallel)
library(lattice)
library(latticeExtra)
lattice.options(default.args = list(as.table = TRUE))
lattice.options(default.theme = "standard.theme")
opts_chunk$set(dpi = 144)
opts_chunk$set(out.width = "700px")
opts_chunk$set(dev = c("png", "pdf"))
source("../../src/import-data.R")
source("../../src/graphics.R")
source("../../src/fit-glms.R")

Import data

# selected set of genes
gene.ids <- unlist(read.csv("../../data/genes.regression.new", as.is = TRUE))
names(gene.ids) <- gene.ids
# predictors
names(e.vars) <- e.vars
E <- get.predictors()[e.vars]
# response: Q, quasi-log transformed read count ratio
Y <- get.readcounts(gene.ids = gene.ids, count.thrs = 0)
Q <- data.frame(sapply(Y[gene.ids], getElement, "Q"), check.names = FALSE)
# response: S, untransformed
S <- data.frame(sapply(Y[gene.ids], getElement, "S"), check.names = FALSE)
# total read count, used as weights
N <- data.frame(sapply(Y[gene.ids], getElement, "N"), check.names = FALSE)

Fit various fixed effect models to data for 30 selected genes

M <- do.all.fits(Z = Y[gene.ids], G = E)

Define model formula for fitting using the fitVarPartModel

(fchar <- as.character(formula(M$unlm.Q$MEST)))
## [1] "~"                                                                                                                       
## [2] "Y"                                                                                                                       
## [3] "Age + Institution + Gender + PMI + Dx + RIN + RNA_batch + Ancestry.1 + Ancestry.2 + Ancestry.3 + Ancestry.4 + Ancestry.5"
fm <- list()
fm$fixed.1 <- as.formula(paste(fchar[1], fchar[3]))
fm$mixed.1 <- ~ Age + (1|Institution) + (1|Gender) + PMI + (1|Dx) + RIN +(1|RNA_batch) + Ancestry.1 + Ancestry.2 + Ancestry.3 + Ancestry.4 + Ancestry.5
fm$mixed.2 <- ~ Age + (1|Institution) + Gender + PMI + Dx + RIN +(1|RNA_batch) + Ancestry.1 + Ancestry.2 + Ancestry.3 + Ancestry.4 + Ancestry.5

Fit fixed and mixed effects models using variancePartition

M$fixed.1 <- fitVarPartModel(t(Q), fm$fixed.1, E)
M$mixed.1 <- fitVarPartModel(t(Q), fm$mixed.1, E)
## Projected memory usage: > 3.2 Mb 
## Projected run time: ~ 0.05 min
# the next expressions give errors
M$mixed.1.w <- fitVarPartModel(t(Q), fm$mixed.1, E, weightsMatrix = t(N))
## Projected memory usage: > 3.2 Mb 
## Projected run time: ~ 0.09 min
## Error in {: task 1 failed - "Invalid grouping factor specification, Institution"
M$mixed.2 <- fitVarPartModel(t(Q), fm$mixed.2, E)
## Projected memory usage: > 3 Mb 
## Projected run time: ~ 0.02 min
## Error in checkModelStatus(fitInit, showWarnings = showWarnings, colinearityCutoff): Categorical variables modeled as fixed effect: Gender 
## The results will not behave as expected and may be very wrong!!

Weights for fitting are expected from voom; using total read counts as weights failed with an error for some reason. Also, there’s an error, when some factors are treated to convey fixed while the others as random effects.

# extract variance partitions from model objects
vp <- lapply(M[c("fixed.1", "mixed.1")], function(m) extractVarPart(m))
# reshape into long format
vpl <- data.frame(lapply(vp, function(v) stack(v[ , c(e.vars, "Residuals")])$values))
vpl$predictor <- factor(rep(c(e.vars, "Residuals"), each = length(gene.ids)), ordered = TRUE, levels = c(e.vars, "Residuals"))
vpl$gene <- factor(rep(gene.ids, length(c(e.vars, "Residuals"))), ordered = TRUE, levels = gene.ids)
vpl <- cbind(vpl[c("predictor", "gene")], vpl[- grep("predictor|gene", names(vpl))])
vpll <- reshape(vpl, varying = c("fixed.1", "mixed.1"), v.names = "fractional.variance",
                timevar = "model.type", times = c("fixed.1", " mixed.1"), direction = "long")

Extract ˆβpg for each gene g under each model:

coef.names <- names(coef(M$unlm.Q[[1]]))
cf <- data.frame(lapply(M[- length(M)],
                        function(l.m)
                            stack(lapply(l.m,
                                         function(m)
                                             coef(m)[ coef.names ]))$values))
cf$mixed.1 <-
    stack(lapply(M$mixed.1,
                 function(m)
                     unlist(coef(m)[[1]][1, , drop = TRUE])[ coef.names ]))$values
#cf$coefficient <- rep(coef.names, length(gene.ids))
cf$coefficient <- factor(rep(coef.names, length(gene.ids)), ordered = TRUE, levels = coef.names)
cf$gene <- factor(rep(gene.ids, each = length(coef.names)), ordered = TRUE, levels = gene.ids)
cf <- cbind(cf[c("coefficient", "gene")], cf[- grep("coefficient|gene", names(cf))])

Calculate 99% confidence intervals for each βpg (also extracts ˆβpg):

beta.hat.CI <- lapply(M[c("unlm.Q", "wnlm.Q")], get.estimate.CI)

Finally, load objects for plotting:

source("2017-02-14-beta-from-mixed-model.R")

Results

Recapitulation: regression coefficients βpg

The plot below shows for each coefficient p and gene g the estimated regression coefficients {ˆβpg} and 99% confidence intervals (CI) for a βpg under the wnlm.Q model. Note that these same results are presented in the current manuscript. Recall the close relationship between the CI and the p-value for rejecting the null hypothesis βpg=0.

plot of chunk wnlm-Q-CI

In the preceding calculations variance partitioning was based on not the wnlm.Q but the unlm.Q model, however. Therefore, a similar plot for the unlm.Q model comes next showing that weighting slightly changes {ˆβpg} and the CIs.

my.segplot(beta.hat.CI$unlm.Q, main = expression(paste("99 % CI for ", beta, " under unlm.Q")))

plot of chunk unlm-Q-CI

Variance partitioning

The next plot shows the fractional variance ˆσ2vg/ˆσ2total explained by each predictor v for each gene g under the fixed effects model unlm.Q.

tp <- trellis.par.get()
my.col <- c(rainbow(length(e.vars))[ c(seq(1, length(e.vars), by = 3), seq(2, length(e.vars), by = 3), seq(3, length(e.vars), by = 3)) ], "gray")
trellis.par.set(superpose.polygon = list(alpha = 1, col = my.col, border = "black", lty = 1, lwd = 0.5))
my.main <- "Variance partitioning: fixed effects model (unlm.Q)"
my.xlab <- "fractional variance"
barchart(gene ~ fractional.variance, data = vpll, groups = predictor, stack = TRUE, auto.key = list(columns = 3), xlim = c(0, 0.6), subset = model.type == "fixed.1", main = my.main)

plot of chunk fixed-varpart

Comparing coefficients to variance explained

Let v be a predictor and g a gene; the goal is to compare the set of regression the coefficient(s) {βpgp corresponds to v} to the fractional variance ˆσ2vg/ˆσ2total explained by v. Note that the set of coefficients has only one member if predictor v is a continuous variable or a factor with only one treatment level, in which case σ2vg=Var(X^βpg). In contrast, the set of coefficients has multiple members for factors with multiple treatment levels. Depending on the nature of predictor, the corresponding estimated coefficient(s) ˆβpg are on a particular scale. For this reason I use the t-statistic Tpg=ˆβpg/Var(ˆβpg), which normalizes the coefficients across all predictors not only in the sense of placing them onto the same scale but also in the sense that the if Tp1g=Tp2g then the corresponding p-values are also equal.

The plot indicates an expected positive correlation between Tpg and ˆσ2vg/ˆσ2total but the correlation is weakened by certain coefficients p. Such coefficients are the 7 treatment levels (RNA_batchB,…, RNA_batch0) of the factor RNA_batch; each level has in itself relatively weak impact (small Tpg and hence weak significance) but together they explain relatively much variation. Another example for an outlier coefficient is that for Age, which again tends to have relatively small TAgeg but large ˆσ2Ageg/ˆσ2total.

my.main <- "Quantities of effect size: fixed model (unlm.Q)"
my.xlab <- expression(paste(hat(sigma)[vg], " / ", hat(sigma)[tot], ", fractional variance explained"))
my.ylab <- expression(paste("|", t[pg], "|, normalized coefficient ", hat(beta)[pg]))
tval.vp.plot(tval.vp.data <- tval.vp(m.type = "fixed.1", llm = M), main = my.main, xlab = my.xlab, ylab = my.ylab)

plot of chunk tval-varpart-fixedplot of chunk tval-varpart-fixed

For presentation

update(tval.vp.plot(tval.vp.data <- tval.vp(m.type = "fixed.1", llm = M), xlab = my.xlab, ylab = my.ylab)[c(5:8, 21:24)], layout = c(4, 2))

plot of chunk tval-varpart-fixed-present

update(tval.vp.plot(tval.vp.data <- tval.vp(m.type = "fixed.1", llm = M), xlab = my.xlab, ylab = my.ylab)[c(3, 6, 14, 15, 24, 30)], layout = c(3, 2), key = my.key$coefficient.2)

plot of chunk tval-varpart-fixed-present-b

Introducing random effects

The above results were obtained under unlm.Q, a fixed effects model. Turning all categorical (factor-type) predictors from fixed into random effects results in the mixed.1 mixed effects model. The next plot compares t-statistics to fractional variance under this model. Similar trend emerges as for the fixed effects model above but the positive correlation between Tpg and ˆσ2vg/ˆσ2total is stronger. This is partly due to removing multilevel factors, like RNA_batch, from the comparison (since their effects are modeled here as random) and partly to Age following more closely the trend displayed by other covariates. Subsequent plots will explore this in more detail.

my.main <- "Quantities of effect size: mixed model (mixed.1)"
tval.vp.plot(tval.vp.mixed.1 <- tval.vp(m.type = "mixed.1", llm = M), main = my.main, xlab = my.xlab, ylab = my.ylab)

plot of chunk tval-varpart-mixedplot of chunk tval-varpart-mixed

Impact on variance partitioning

The next plot illustrates how introducing random effects changes variance partitioning in terms of fractional variance explained. Some predictors, like Age, PMI, Dx, or RNA_batch, explain less variation in the mixed effects model, while the residuals increase. This phenomenologically explains the stronger correlation between Tpg and ˆσ2vg/ˆσ2total seen above.

my.main <- "Variance partitioning: mixed vs fixed effects model"
my.xlab <- "fractional variance: fixed effects (unlm.Q)"
my.ylab <- "fractional variance: mixed effects (mixed.1)"
my.plot(x = "fixed.1", y = "mixed.1", group.by = "predictor", lbl.type = "gene", dt = vpl, main = my.main, xlab = my.xlab, ylab = my.ylab)

plot of chunk fixed-mixed-varpart

Another view on the same results: grouping by genes

update(my.plot(x = "fixed.1", y = "mixed.1", group.by = "gene", lbl.type = "predictor", dt = vpl, subset = vpl$predictor != "Residuals", main = my.main, xlab = my.xlab, ylab = my.ylab), scales = list(relation = "free"), layout = c(4, 4))

plot of chunk fixed-mixed-varpart-genesplot of chunk fixed-mixed-varpart-genes

Impact on regression coefficients

The impact of introducing random effects on Tpg statistics (normalized ˆβpg) is rather small:

plot of chunk mixed-1-fixed-tval

The result is similar for unnormalized ˆβpg:

plot of chunk mixed-1-fixed

plot of chunk mixed-1-fixed-coef

These differences really reflect the introduction of random effects and are not due to software bug because the two different implementations of the fitting of the same fixed effects model (unlm.Q) gives identical results:

with(cf, all.equal(unlm.Q, fixed.1))
## [1] TRUE

Introducing weighting

Because fitting weighted models using variancePartition failed (see Calculations above), the impact of weighting is only studied in terms of regression coefficients:

plot of chunk fixed-unweighted-weighted plot of chunk fixed-unweighted-weighted-coef