The present analysis investigates more thoroughly the significance of effect on read count ratio of various predictors for various genes. This is done by random permutations of the observed values of any given predictor and fitting the same models to both observed (unpermuted) and permuted data. This approach is thought to provide more accurate p-values and confidence intervals (CIs) than deriving those from only the observed data using the standard way based on normal distribution theory (see ch8.3, p371 in A.C Davison: Statistical Models), which is very sensitive to the quality of model fit. Indeed, a general finding from this analysis is that some genes show more unexpected distribution of CI than others and that variability is explained by diagnostics of model fit for the corresponding genes. See model checking for details.

Introduction

There are several limitations of the (generalized) linear models (e.g. wnlm.R, logi.S) of read count ratio regressed on predictors such as Age, Institution, etc. These limitations might lead to bias in the estimation of regression coefficients , on which some of the main conclusions from this study are based. In particular, a nonzero associated with some predictor indicates either significant effect, or bias. These are only the basic cases; typically we expect to deal with some combination of the two.

The analysis herein aims to distinguish between the two basic cases. This done by randomly re-permuting observations for each predictor (in other words individuals are randomly relabeled for each predictor). Any significant result associated with the permuted predictor must either come from bias or occur “by chance” with probability 1 less the chosen confidence level for .

Calculations

Preparations

## Loading required package: RColorBrewer
## 
## Attaching package: 'latticeExtra'
## The following object is masked from 'package:ggplot2':
## 
##     layer

Load functions

source("~/projects/monoallelic-brain/src/import-data.R")
source("~/projects/monoallelic-brain/src/fit-glms.R")
source("~/projects/monoallelic-brain/src/graphics.R")
source("2016-09-20-permuted-observations.R")
gene.ids <- unlist(read.csv("../../data/genes.regression.new", as.is = TRUE))
names(gene.ids) <- gene.ids

Get data: observations on predictors (explanatory variables) and on the higher and lower read count from selected genes (more details in a previous post)

E <- get.predictors() # default arguments
Y <- get.readcounts(gene.ids = gene.ids, count.thrs = 0)

Permutations

set.seed(1976)
perm.obs <- sample.int(nrow(E))
names(e.vars) <- e.vars
EP <- lapply(e.vars, function(v) { E1 <- E; E1[[v]] <- E[[v]][perm.obs]; return(E1) })
EP$Unpermuted <- E
set.seed(1976)
perms <- data.frame(cbind(seq_len(nrow(E)), replicate(20, sample.int(nrow(E)))))
names(perms) <- c("U", paste0("P", seq_len(length(perms) - 1)))

Model fitting and CI calculation

# exclude unweighed aggregates UA.8 and UA from fitting
to.fit.ids <- grep("^UA(.8)?$", names(Y), value = TRUE, invert = TRUE)
sel.models <- c("logi.S", "wnlm.Q", "wnlm.R", "wnlm.S"); names(sel.models) <- sel.models
M <- lapply(EP, function(e) do.all.fits(Y[to.fit.ids], G = e, preds = e.vars, sel.models = sel.models))
Betas.Unpermuted <- lapply(M$Unpermuted,
                           function(m) { x <- get.estimate.CI(m); x <- x[ ! x$Coefficient %in% "(Intercept)", ] })
Betas.Unpermuted.95 <- lapply(M$Unpermuted,
                           function(m) { x <- get.estimate.CI(m, conf.lev = 0.95); x <- x[ ! x$Coefficient %in% "(Intercept)", ] })
Betas.Permuted <- lapply(sel.models, get.estimate.CI.permut, M = M, e.v = e.vars, conf.lev = 0.99)
Betas.Permuted.95 <- lapply(sel.models, get.estimate.CI.permut, M = M, e.v = e.vars, conf.lev = 0.95)
Betas <- aggregate.CI.permut2(perms = perms, gene.ids = gene.ids, e.vars = e.vars,
                              sel.models = sel.models, E = E, Y = Y[gene.ids], conf.lev=0.99)

Results

Single permutation

Both the unpermuted and permuted data was fit by the models logi.S, wnlm.Q, wnlm.R, wnlm.S. Given the poor fit for wnlm.S and the relatively low power of wnlm.R only wnlm.Q and logi.S associated results are presented in the following four plots. These results are the estimates and CI for at confidence level of . The following observations and conclusions may be noted:

Under wnlm.Q there are 59 significant coefficients before and 37 after permutation. In contrast, under logi.S there are 185 and 191 (before and after). But taking only those coefficients that are significant under both models yields 44 and 27 (before and after, respectively).

Under logi.S the following pattern is observed. For some genes (e.g. GRB10, ZDBF2) permutation tends to abolish significance—if observed—, whereas for other genes (e.g. SNRPN, SNURF) there are many significant coefficients after permutation. This suggests systematic differences between genes: better fit of logi.S to the data for the former set of genes and poorer for the latter gene set. The poor fit explains the many significant coefficients after permutation.

my.segplot(data = Betas.Unpermuted$logi.S, xlim = my.xlim, main = "Unpermuted under logi.S")

plot of chunk unpermuted-logi-S

plot of chunk permuted-logi-S

plot of chunk unpermuted-wnlm-Q

plot of chunk permuted-wnlm-Q

Repeated permutations

The above analysis is extended now in two ways: presenting CI for several, though not all, with

  1. 20 random permutations
  2. under all selected models logi.S, wnlm.Q, wnlm.R, wnlm.S

Note that genes (panels) are skipped for which the fit did not converge. In the present analysis this applies only to logistic models, since those employ iterative least squares algorithm. Normal linear models, on the other hand, use a one-step least square algorithm and thus are guaranteed to “converge”.

The following general results emerge from the following plots

Age

# skip genes for which the fit did not converge
# applies to logistic models, since those employ iterative least squares algorithm
# normal linear models use a one-step least square algorithm and thus are guaranteed to "converge"
skip <- ! sapply(M$Unpermuted$logi.S, `[[`, "converged")[1:30]
my.segplot2("Age", "logi.S", skip = skip) # fit has not converged for gene ranked 2 (TMEM261P1)

plot of chunk permuted-age-logi-S

plot of chunk permuted-age-wnlm-S

plot of chunk permuted-age-wnlm-Q

plot of chunk permuted-age-wnlm-R

Gender

plot of chunk permuted-gender-logi-S

plot of chunk permuted-gender-wnlm-S

plot of chunk permuted-gender-wnlm-Q

plot of chunk permuted-gender-wnlm-R

Dx

plot of chunk permuted-dx-control-logi-S

plot of chunk permuted-dx-control-wnlm-S

plot of chunk permuted-dx-control-wnlm-Q

plot of chunk permuted-dx-control-wnlm-R

Ancestry

plot of chunk permuted-ancestry-1-logi-S

plot of chunk permuted-ancestry-1-wnlm-S

plot of chunk permuted-ancestry-1-wnlm-Q

plot of chunk permuted-ancestry-1-wnlm-R

Institution

plot of chunk permuted-institution-penn-logi-S

plot of chunk permuted-institution-penn-wnlm-S

plot of chunk permuted-institution-penn-wnlm-Q

plot of chunk permuted-institution-penn-wnlm-R

PMI

plot of chunk permuted-pmi-logi-S

plot of chunk permuted-pmi-wnlm-S

plot of chunk permuted-pmi-wnlm-Q

plot of chunk permuted-pmi-wnlm-R