Model Checking

Two characteristics of fit are checked for a set of models: (1) normality of residuals and (2) homogeneity of error variance. These are evaluated with diagnostic plots. Among the checked models the best fitting one is unlm.Q and wnlm.Q regardless of the terms in the linear predictor. However, the fit of wnlm.Q converges very slowly or fails to converge especially when more terms are present in the linear predictor.

## Loading required package: Matrix

## Loading required package: methods

gene.ids <- unlist(read.csv("../../data/genes.regression.new", as.is = TRUE))
names(gene.ids) <- gene.ids

Prepare data frame including all 30 selected genes and read formulas (linear predictors) for models like the relatively simple $M1$ and the more complex $M3$ :

dat <- merge.data(gene.ids = gene.ids)
fm <- read.csv(file = "../../results/M-formulas.csv", stringsAsFactors = FALSE)

Results under $M3$

Fit models; note messages warning problems with convergence (slow convergence) for wnlm.Q and logi.S

M3 <- list()
M3$unlm.Q <- lmer(reformulate(fm$M3[3], response = "Q"), data = dat)
M3$wnlm.Q <- lmer(reformulate(fm$M3[3], response = "Q"), data = dat, weights = N)

## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue
##  - Rescale variables?

M3$unlm.S <- lmer(reformulate(fm$M3[3], response = "S"), data = dat)
M3$wnlm.S <- lmer(reformulate(fm$M3[3], response = "S"), data = dat, weights = N)

## Warning in optwrap(optimizer, devfun, getStart(start, rho$lower, rho$pp), :
## convergence code 1 from bobyqa: bobyqa -- maximum number of function
## evaluations exceeded

## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control
## $checkConv, : unable to evaluate scaled gradient

## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control
## $checkConv, : Model failed to converge: degenerate Hessian with 1 negative
## eigenvalues

M3$logi.S <- glmer(reformulate(fm$M3[3], response = "H.N"), data = dat, family = binomial)

## Warning in optwrap(optimizer, devfun, start, rho$lower, control =
## control, : convergence code 1 from bobyqa: bobyqa -- maximum number of
## function evaluations exceeded

## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control
## $checkConv, : unable to evaluate scaled gradient

## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control
## $checkConv, : Model failed to converge: degenerate Hessian with 1 negative
## eigenvalues

# do fit of simpler models to cache results right here
M1 <- list()
M1$unlm.Q <- lmer(reformulate(fm$M1[3], response = "Q"), data = dat)
M1$wnlm.Q <- lmer(reformulate(fm$M1[3], response = "Q"), data = dat, weights = N)

## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue
##  - Rescale variables?

M1$unlm.S <- lmer(reformulate(fm$M1[3], response = "S"), data = dat)
M1$wnlm.S <- lmer(reformulate(fm$M1[3], response = "S"), data = dat, weights = N)

## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue
##  - Rescale variables?

M1$logi.S <- glmer(reformulate(fm$M1[3], response = "H.N"), data = dat, family = binomial)

## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control
## $checkConv, : Model failed to converge with max|grad| = 0.00437736 (tol =
## 0.001, component 1)

## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue
##  - Rescale variables?

get.diagnostic.data <- function(lM, sel.col = 2:7) {
    helper <- function(x)
        cbind(data.frame(Residual = residuals(m <- lM[[x]]), Fitted.value = predict(m), Family = x),
              model.frame(m)[sel.col])
    l <- lapply(names(lM), helper)
    long <- do.call(rbind, l)
    lv <- c("unlm.S", "unlm.Q", "wnlm.S", "wnlm.Q", "logi.S")
    #lv <- c("logi.S", "unlm.S", "unlm.Q", "wnlm.S", "wnlm.Q")
    long$Family <- factor(long$Family, levels = lv, ordered = TRUE)
    return(long)
}

Normality of residuals

The distribution of residuals under all four model families shows smaller or larger departure from normality. unlm.Q is the closest to standard normal distribution and wnlm.Q is equally normal (or perhaps slightly less so). The distribution under logi.S is strikingly far from normal. So is it under unlm.S, for which additionally the scale of distribution is diminished (compare slope of black diagonal to the overall slope of the blue curve) indicating that the variance estimation—which is based on the residuals—is strongly biased downwards.

diag.M3 <- get.diagnostic.data(M3)
arg <- list(main = "Normality of residuals.
Mixed multiple regression model, all genes", xlab = "normal quantiles", ylab = "empirical quantiles", pch = "+")
qqmath(~ Residual | Family, data = diag.M3, scales = list(y = list(relation = "free")),
       xlab = arg$xlab, ylab = arg$ylab, main = arg$main, pch = arg$pch, abline = c(0, 1))

plot of chunk qqplot-families-M3

qqmath(~ Residual | Gene, data = diag.M3, subset = Family == "unlm.Q", xlab = arg$xlab, ylab = arg$ylab, main = paste(arg$main, ": unlm.Q"), pch = arg$pch, abline = c(0, 1))

plot of chunk qqplot-genes-unlm-Q-M3

Homogeneity of error

Inspecting the homogeneity of error variance (or equivalently standard deviation) leads to the same conclusion as above regarding the relative goodness of fit of the various model families: unlm.Q and wnlm.Q are the best (the most homoscedastic) while unlm.S and logi.S show systematic relationships between error and fitted value indicating poorer fit.

arg <- list(main = "Homogeneity of error variance.
Mixed multiple regression model, all genes", xlab = "fitted value", ylab = expression(sqrt(residual)), pch = "+")
xyplot(sqrt(abs(Residual)) ~ Fitted.value | Family, data = diag.M3, scales = list(relation = "free"),
       xlab = arg$xlab, ylab = arg$ylab, main = arg$main, pch = arg$pch,
       panel = function(...) { panel.xyplot(...); panel.loess(..., col = "black") })

plot of chunk scedasticity-families-M3

xyplot(sqrt(abs(Residual)) ~ Fitted.value | Gene, data = diag.M3, subset = Family == "unlm.Q", scales = list(x = list(relation = "free", draw = FALSE)), xlab = arg$xlab, ylab = arg$ylab, main = paste(arg$main, ": unlm.Q"), pch = arg$pch)

plot of chunk scedasticity-genes-unlm-Q-M3

Results under a simpler model, $M1$

Very similar patterns are seen under $M1$ to those under $M3$ . The fitting of the respective models from the wnlm.Q and logi.S families again converges with problems (see above).

diag.M1 <- get.diagnostic.data(M1)
arg <- list(main = "mixed multiple regression model, all genes", xlab = "normal quantiles", ylab = "empirical quantiles", pch = "+")
qqmath(~ Residual | Family, data = diag.M1, scales = list(y = list(relation = "free")),
       xlab = arg$xlab, ylab = arg$ylab, main = arg$main, pch = arg$pch, abline = c(0, 1))

plot of chunk qqplot-families-M1

arg <- list(main = "mixed multiple regression model, all genes", xlab = "fitted value", ylab = expression(sqrt(residual)), pch = "+")
xyplot(sqrt(abs(Residual)) ~ Fitted.value | Family, data = diag.M1, scales = list(relation = "free"),
       xlab = arg$xlab, ylab = arg$ylab, main = arg$main, pch = arg$pch)

plot of chunk scedasticity-families-M1

Lab Notebook

Author of notebook: Attila Gulyás-Kovács, Chess Lab, Mount Sinai School of Medicine. Related publication: Nat Commun. 2018 Jul 25;9(1):2914

Results under $M3$

Normality of residuals

Homogeneity of error

Results under a simpler model, $M1$

Lab Notebook

Author of notebook: Attila Gulyás-Kovács, Chess Lab, Mount Sinai School of Medicine. Related publication: Nat Commun. 2018 Jul 25;9(1):2914

Results under M3

Normality of residuals

Homogeneity of error

Results under a simpler model, M1

Results under $M3$

Results under a simpler model, $M1$