[med-svn] [r-bioc-edger] 03/05: Imported Upstream version 3.6.2+dfsg
Andreas Tille
tille at debian.org
Fri Jun 27 14:42:39 UTC 2014
This is an automated email from the git hooks/post-receive script.
tille pushed a commit to branch master
in repository r-bioc-edger.
commit 2285a83663901e91e2fca58f9e1e76276b425610
Author: Andreas Tille <tille at debian.org>
Date: Fri Jun 27 16:38:45 2014 +0200
Imported Upstream version 3.6.2+dfsg
---
DESCRIPTION | 15 +-
NAMESPACE | 4 +-
R/DGEList.R | 3 +
R/adjustedProfileLik.R | 25 +-
R/aveLogCPM.R | 84 ++++--
R/calcNormFactors.R | 14 +-
R/camera.DGEList.R | 25 +-
R/cpm.R | 16 +-
R/dglmStdResid.R | 2 +-
R/dispBinTrend.R | 13 +-
R/dispCoxReid.R | 18 +-
R/dispCoxReidInterpolateTagwise.R | 6 +-
R/estimateCommonDisp.R | 14 +-
R/estimateDisp.R | 19 +-
R/estimateGLMCommonDisp.R | 27 +-
R/estimateGLMRobustDisp.R | 99 ++++++
R/estimateGLMTagwiseDisp.R | 28 +-
R/estimateGLMTrendedDisp.R | 30 +-
R/estimateTagwiseDisp.R | 4 +-
R/exactTest.R | 4 +-
R/exactTestByDeviance.R | 56 +---
R/exactTestBySmallP.R | 8 +-
R/exactTestDoubleTail.R | 2 +-
R/glmQLFTest.R | 11 +-
R/glmfit.R | 107 +++----
R/mglmLS.R | 190 ------------
R/mglmLevenberg.R | 43 +--
R/mglmOneGroup.R | 48 +--
R/mglmOneWay.R | 7 +-
R/mglmSimple.R | 83 -----
R/nbinomDeviance.R | 43 +++
R/plotBCV.R | 2 +-
R/predFC.R | 14 +-
R/processHairpinReads.R | 124 ++++++++
R/residDF.R | 29 ++
R/roast.DGEList.R | 56 ++--
R/rpkm.R | 45 +++
R/subsetting.R | 152 +++-------
R/sumTechReps.R | 44 +++
R/validDGEList.R | 13 +
build/vignette.rds | Bin 227 -> 228 bytes
inst/NEWS.Rd | 97 ++++--
inst/doc/edgeR.Rnw | 5 +-
inst/doc/edgeR.pdf | Bin 49354 -> 48051 bytes
inst/doc/index.html | 2 +-
man/DGELRT-class.Rd | 2 +-
man/adjustedProfileLik.Rd | 3 +-
man/asdataframe.Rd | 2 +-
man/asmatrix.Rd | 2 +-
man/aveLogCPM.Rd | 12 +-
man/calcNormFactors.Rd | 2 +-
man/camera.DGEList.Rd | 47 ++-
man/condLogLikDerSize.Rd | 4 +-
man/cpm.Rd | 18 +-
man/dglmStdResid.Rd | 2 +-
man/dispBinTrend.Rd | 3 +-
man/dispCoxReid.Rd | 14 +-
man/dispCoxReidInterpolateTagwise.Rd | 3 +-
man/estimateDisp.Rd | 4 +-
man/estimateGLMCommonDisp.Rd | 8 +-
man/estimateGLMRobustDisp.Rd | 57 ++++
man/estimateGLMTagwiseDisp.Rd | 10 +-
man/estimateGLMTrendedDisp.Rd | 5 +-
man/estimateTagwiseDisp.Rd | 2 +-
man/estimateTrendedDisp.Rd | 18 +-
man/exactTest.Rd | 4 +-
man/expandAsMatrix.Rd | 3 -
man/glmfit.Rd | 19 +-
man/gof.Rd | 2 +-
man/maPlot.Rd | 2 +-
man/meanvar.Rd | 2 +-
man/mglm.Rd | 62 ++--
man/nbinomDeviance.Rd | 58 ++++
man/normalizeChIPtoInput.Rd | 4 +-
man/plotBCV.Rd | 4 +-
man/plotExonUsage.Rd | 4 +-
man/plotMDS.DGEList.Rd | 4 +-
man/plotSmear.Rd | 2 +-
man/predFC.Rd | 16 +-
man/processHairpinReads.Rd | 55 ++++
man/readDGE.Rd | 4 +-
man/roast.DGEList.Rd | 82 +----
man/subsetting.Rd | 20 +-
man/sumTechReps.Rd | 34 +++
man/topTags.Rd | 1 -
man/validDGEList.Rd | 37 +++
src/Makevars | 12 +-
src/R_compute_nbdev.cpp | 45 +++
src/R_exact_test_by_deviance.cpp | 154 +++-------
src/R_levenberg.cpp | 58 +++-
src/R_one_group.cpp | 38 ++-
src/R_process_hairpin_reads.c | 571 +++++++++++++++++++++++++++++++++++
src/{core => }/adj_coxreid.cpp | 0
src/{core => }/fmm_spline.c | 0
src/{core => }/glm.h | 21 +-
src/{core => }/glm_levenberg.cpp | 66 ++--
src/{core => }/glm_one_group.cpp | 32 +-
src/{core => }/interpolator.cpp | 0
src/{core => }/interpolator.h | 0
src/matvec_check.cpp | 80 +++++
src/matvec_check.h | 19 ++
src/nbdev.cpp | 35 +++
src/{core => }/utils.h | 6 +-
tests/edgeR-Tests.R | 3 +-
tests/edgeR-Tests.Rout.save | 449 +++++++++------------------
105 files changed, 2385 insertions(+), 1472 deletions(-)
diff --git a/DESCRIPTION b/DESCRIPTION
index 272f2be..c9dd90c 100755
--- a/DESCRIPTION
+++ b/DESCRIPTION
@@ -1,15 +1,14 @@
Package: edgeR
-Version: 3.4.2
-Date: 2013/08/31
+Version: 3.6.2
+Date: 2014/04/26
Title: Empirical analysis of digital gene expression data in R
-Author: Mark Robinson <mark.robinson at imls.uzh.ch>, Davis McCarthy <dmccarthy at wehi.edu.au>, Yunshun Chen <yuchen at wehi.edu.au>, Aaron Lun <alun at wehi.edu.au>, Gordon Smyth <smyth at wehi.edu.au>
-Maintainer: Mark Robinson <mark.robinson at imls.uzh.ch>, Davis McCarthy
- <dmccarthy at wehi.edu.au>, Yunshun Chen <yuchen at wehi.edu.au>,
- Gordon Smyth <smyth at wehi.edu.au>
-Depends: R (>= 2.15.0), methods, limma
+Author: Yunshun Chen <yuchen at wehi.edu.au>, Davis McCarthy <dmccarthy at wehi.edu.au>, Aaron Lun <alun at wehi.edu.au>, Mark Robinson <mark.robinson at imls.uzh.ch>, Xiaobei Zhou <xiaobei.zhou at uzh.ch>, Gordon Smyth <smyth at wehi.edu.au>
+Maintainer: Yunshun Chen <yuchen at wehi.edu.au>, Mark Robinson <mark.robinson at imls.uzh.ch>, Davis McCarthy <dmccarthy at wehi.edu.au>, Gordon Smyth <smyth at wehi.edu.au>
+Depends: R (>= 2.15.0), limma
+Imports: methods
Suggests: MASS, statmod, splines, locfit, KernSmooth
biocViews: Bioinformatics, DifferentialExpression, SAGE,
HighThroughputSequencing, RNAseq, ChIPseq
Description: Differential expression analysis of RNA-seq and digital gene expression profiles with biological replication. Uses empirical Bayes estimation and exact tests based on the negative binomial distribution. Also useful for differential signal analysis with other types of genome-scale count data.
License: GPL (>=2)
-Packaged: 2013-12-06 04:48:35 UTC; biocbuild
+Packaged: 2014-05-10 05:35:45 UTC; biocbuild
diff --git a/NAMESPACE b/NAMESPACE
index c3de40b..3e830dd 100644
--- a/NAMESPACE
+++ b/NAMESPACE
@@ -7,7 +7,8 @@ exportPattern("^[^\\.]")
exportClasses("DGEList","DGEExact","DGEGLM","DGELRT","TopTags")
exportMethods("show")
-import(limma)
+import(methods)
+importFrom("limma",camera,is.fullrank,mroast,nonEstimable,plotMDS,roast,subsetListOfArrays)
S3method(as.matrix,DGEList)
S3method(dim,DGEList)
@@ -32,3 +33,4 @@ S3method(as.data.frame,TopTags)
S3method(roast,DGEList)
S3method(mroast,DGEList)
S3method(camera,DGEList)
+S3method(sumTechReps,DGEList)
diff --git a/R/DGEList.R b/R/DGEList.R
index 42702ff..25aa43e 100644
--- a/R/DGEList.R
+++ b/R/DGEList.R
@@ -40,5 +40,8 @@ DGEList <- function(counts=matrix(0,0,0), lib.size=colSums(counts), norm.factors
}
}
+# x$offset <- expandAsMatrix(getOffset(x),dim(counts))
+# x$weights <- matrix(1,ntags,nlib)
+
x
}
diff --git a/R/adjustedProfileLik.R b/R/adjustedProfileLik.R
index bf78ca6..200fcb2 100644
--- a/R/adjustedProfileLik.R
+++ b/R/adjustedProfileLik.R
@@ -1,26 +1,35 @@
-adjustedProfileLik <- function(dispersion, y, design, offset, adjust=TRUE)
+adjustedProfileLik <- function(dispersion, y, design, offset, weights=NULL, adjust=TRUE)
# tagwise Cox-Reid adjusted profile likelihoods for the dispersion
# dispersion can be scalar or tagwise vector
# y is matrix: rows are genes/tags/transcripts, columns are samples/libraries
# offset is matrix of the same dimensions as y
# Yunshun Chen, Gordon Smyth, Aaron Lun
-# Created June 2010. Last modified 21 Aug 2012.
+# Created June 2010. Last modified 17 Feb 2014.
{
if(any(dim(y)!=dim(offset))) offset <- expandAsMatrix(offset,dim(y))
ntags <- nrow(y)
nlibs <- ncol(y)
if(length(dispersion)==1) dispersion <- rep(dispersion,ntags)
-
+
# Fit tagwise linear models. This is actually the most time-consuming
# operation that I can see for this function.
- fit <- glmFit(y,design=design,dispersion=dispersion,offset=offset,prior.count=0)
+ fit <- glmFit(y,design=design,dispersion=dispersion,offset=offset,prior.count=0,weights=weights)
# Compute log-likelihood.
mu <- fit$fitted
+ if(is.null(weights)) weights <- 1
if(dispersion[1] == 0){
- loglik <- rowSums(dpois(y,lambda=mu,log = TRUE))
+# loglik <- rowSums(weights*dpois(y,lambda=mu,log = TRUE))
+ ll <- y*log(mu)-mu-lgamma(y+1)
+ ll[mu==0] <- 0
+ loglik <- rowSums(weights*ll)
+
} else {
- loglik <- rowSums(dnbinom(y,size=1/dispersion,mu=mu,log = TRUE))
+# loglik <- rowSums(weights*dnbinom(y,size=1/dispersion,mu=mu,log = TRUE))
+ r <- 1/dispersion
+ ll <- y*log(mu)-y*log(mu+r)+r*log(r)-r*log(mu+r)+lgamma(y+r)-lgamma(y+1)-lgamma(r)
+ ll[mu==0] <- 0
+ loglik <- rowSums(weights*ll)
}
if (!adjust) {
return(loglik)
@@ -28,10 +37,10 @@ adjustedProfileLik <- function(dispersion, y, design, offset, adjust=TRUE)
# Calculating the Cox-Reid adjustment.
if(ncol(design)==1) {
- D <- rowSums(mu/(1+mu*dispersion))
+ D <- rowSums(weights*mu/(1+mu*dispersion))
cr <- 0.5*log(abs(D))
} else {
- W <- mu/(1+dispersion*mu)
+ W <- weights*mu/(1+dispersion*mu)
# Checking type, otherwise the C++ code will complain.
# Note the use of a transposed matrix for easy row access in column-major format.
diff --git a/R/aveLogCPM.R b/R/aveLogCPM.R
index 68727de..99ec3a3 100644
--- a/R/aveLogCPM.R
+++ b/R/aveLogCPM.R
@@ -1,36 +1,84 @@
aveLogCPM <- function(y, ...)
UseMethod("aveLogCPM")
-aveLogCPM.DGEList <- function(y, normalized.lib.sizes=TRUE, prior.count=2, dispersion=0.05, ...)
+aveLogCPM.DGEList <- function(y, normalized.lib.sizes=TRUE, prior.count=2, dispersion=NULL, ...)
# log2(AveCPM)
-# Gordon Smyth
-# 11 March 2013.
+# Gordon Smyth
+# Created 11 March 2013. Last modified 24 November 2013.
{
+# Library sizes should be stored in y but are sometimes missing
lib.size <- y$samples$lib.size
- if(normalized.lib.sizes) lib.size <- lib.size*y$samples$norm.factors
- aveLogCPM(y$counts,lib.size=lib.size,prior.count=prior.count,dispersion=dispersion)
+ if(is.null(lib.size)) lib.size <- colSums(y$counts)
+
+# Normalization factors should be stored in y but are sometimes missing
+ if(normalized.lib.sizes) {
+ nf <- y$samples$norm.factors
+ if(!is.null(y$samples$norm.factors)) lib.size <- lib.size*nf
+ }
+
+# Dispersion supplied as argument over-rules value in object
+# Should trended.dispersion or tagwise.dispersion be used instead of common.dispersion if available?
+ if(is.null(dispersion)) dispersion <- y$common.dispersion
+
+ aveLogCPM(y$counts,lib.size=lib.size,prior.count=prior.count,dispersion=dispersion,weights=y$weights)
}
-aveLogCPM.DGEGLM <- function(y, prior.count=2, dispersion=0.05, ...)
+aveLogCPM.DGEGLM <- function(y, prior.count=2, dispersion=NULL, ...)
# log2(AveCPM)
-# Gordon Smyth
-# 11 March 2013.
+# Gordon Smyth
+# Created 11 March 2013. Last modified 24 Nov 2013.
{
- offset <- y$offset
- if(is.matrix(offset)) offset <- colMeans(offset)
- lib.size <- exp(offset)
- aveLogCPM(y$counts,lib.size=lib.size,prior.count=prior.count,dispersion=dispersion)
+# Dispersion supplied as argument over-rules value in object
+ if(is.null(dispersion)) dispersion <- y$dispersion
+
+ aveLogCPM(y$counts,offset=y$offset,prior.count=prior.count,dispersion=dispersion,weights=y$weights)
}
-aveLogCPM.default <- function(y,lib.size=NULL,prior.count=2,dispersion=0.05, ...)
+aveLogCPM.default <- function(y,lib.size=NULL,offset=NULL,prior.count=2,dispersion=NULL,weights=NULL, ...)
# log2(AveCPM)
-# Gordon Smyth
-# 25 Aug 2012. Last modified 11 March 2012.
+# Gordon Smyth
+# Created 25 Aug 2012. Last modified 4 Nov 2013.
{
+# Check y
y <- as.matrix(y)
- if(is.null(lib.size)) lib.size <- colSums(y)
- prior.count.scaled <- lib.size/mean(lib.size) * prior.count
+ if(any(y<0)) stop("y must be non-negative")
+
+# Check prior.count
+ if(prior.count<0) prior.count <- 0
+
+# Check dispersion
+ if(is.null(dispersion)) dispersion <- 0.05
+ isna <- is.na(dispersion)
+ if(all(isna)) dispersion <- 0.05
+ if(any(isna)) dispersion[isna] <- mean(dispersion,na.rm=TRUE)
+
+# Check lib.size and offset.
+# If offset is provided, it takes precedence over lib.size.
+# However it must have a similar average to log(lib.size)
+# for the results to be meaningful as logCPM values
+ if(is.null(offset)) {
+ if(is.null(lib.size)) lib.size <- colSums(y)
+ } else {
+ lib.size <- exp(offset)
+ }
+ mean.lib.size <- mean(lib.size)
+
+# Special case when all counts are zero
+ if(mean.lib.size==0) {
+ abundance <- rep(-log(nrow(y)),nrow(y))
+ return( (abundance+log(1e6)) / log(2) )
+ }
+
+# Scale prior counts to preserve fold changes
+ prior.count.scaled <- lib.size/mean.lib.size * prior.count
+
+# Add double prior counts to library sizes
offset <- log(lib.size+2*prior.count.scaled)
- abundance <- mglmOneGroup(t(t(y)+prior.count.scaled),dispersion=dispersion,offset=offset)
+
+# Add prior counts to y
+ if(is.null(dim(prior.count.scaled))) prior.count.scale <- matrix(1,nrow(y),1) %*% prior.count.scaled
+ y <- y+prior.count.scaled
+
+ abundance <- mglmOneGroup(y,dispersion=dispersion,offset=offset,weights=weights)
(abundance+log(1e6)) / log(2)
}
diff --git a/R/calcNormFactors.R b/R/calcNormFactors.R
index 2c90dd4..9890a1c 100644
--- a/R/calcNormFactors.R
+++ b/R/calcNormFactors.R
@@ -26,16 +26,12 @@ calcNormFactors <- function(object, method=c("TMM","RLE","upperquartile","none")
f <- switch(method,
TMM = {
f75 <- .calcFactorQuantile(data=x, lib.size=lib.size, p=0.75)
- if( is.null(refColumn) )
- refColumn <- which.min(abs(f75-mean(f75)))
+ if( is.null(refColumn) ) refColumn <- which.min(abs(f75-mean(f75)))
if(length(refColumn)==0 | refColumn < 1 | refColumn > ncol(x)) refColumn <- 1
- f <- rep(NA,ncol(x))
- for(i in 1:ncol(x))
- f[i] <- .calcFactorWeighted(obs=x[,i],ref=x[,refColumn], libsize.obs=lib.size[i],
- libsize.ref=lib.size[refColumn],
- logratioTrim=logratioTrim, sumTrim=sumTrim,
- doWeighting=doWeighting, Acutoff=Acutoff)
- f
+ f <- rep(NA,ncol(x))
+ for(i in 1:ncol(x))
+ f[i] <- .calcFactorWeighted(obs=x[,i],ref=x[,refColumn], libsize.obs=lib.size[i], libsize.ref=lib.size[refColumn], logratioTrim=logratioTrim, sumTrim=sumTrim, doWeighting=doWeighting, Acutoff=Acutoff)
+ f
},
RLE = .calcFactorRLE(x)/lib.size,
upperquartile = .calcFactorQuantile(x,lib.size,p=p),
diff --git a/R/camera.DGEList.R b/R/camera.DGEList.R
index 0f5ef2e..a6385b3 100644
--- a/R/camera.DGEList.R
+++ b/R/camera.DGEList.R
@@ -1,21 +1,21 @@
-camera.DGEList <- function(y, index, design=NULL, contrast=ncol(design), weights=NULL, use.ranks=FALSE, allow.neg.cor=TRUE, trend.var=FALSE, sort=TRUE)
+camera.DGEList <- function(y, index, design=NULL, contrast=ncol(design), ...)
# Rotation gene set testing for RNA-Seq data accounting for inter-gene correlation
# Yunshun Chen, Gordon Smyth
-# Created 07 Jan 2013. Last modified 4 Feb 2013.
+# Created 07 Jan 2013. Last modified 28 Feb 2014.
{
-# Check design matrix
+# Check dispersion estimates in y
+ dispersion <- getDispersion(y)
+ if(is.null(dispersion)) stop("Dispersion estimate not found. Please estimate the dispersion(s) before you proceed.")
+
+# Make default design matrix from group factor
if(is.null(design)) {
- if(nlevels(y$samples$group)<2) stop("Samples all belong to the same group")
+ if(nlevels(y$samples$group)<2) stop("design not supplied and samples all belong to the same group")
design <- model.matrix(~y$samples$group)
rownames(design) <- colnames(y)
}
nbeta <- ncol(design)
if(nbeta < 2) stop("design matrix must have at least two columns")
-# Check dispersion estimates
- dispersion <- getDispersion(y)
- if(is.null(dispersion)) stop("Dispersion estimate not found. Please estimate the dispersion(s) before you proceed.")
-
# Check contrast
if(length(contrast) == 1) {
u <- rep.int(0, nbeta)
@@ -25,14 +25,15 @@ camera.DGEList <- function(y, index, design=NULL, contrast=ncol(design), weights
if(length(contrast) != nbeta) stop("length of contrast must match column dimension of design")
if(all(contrast==0)) stop("contrast all zero")
-# Null design matrix
+# Construct null hypothesis design matrix
QR <- qr(contrast)
design0 <- t(qr.qty(QR, t(design))[-1, , drop=FALSE])
-# Null fit
+# Null hypothesis fit
fit.null <- glmFit(y, design0, prior.count=0)
- z <- zscoreNBinom(y$counts, mu=fit.null$fitted.values, size=1/dispersion)
+# Quantile residuals from null fit
+ y <- zscoreNBinom(y$counts, mu=fit.null$fitted.values, size=1/dispersion)
- camera(y=z, index=index, design=design, contrast=contrast, weights=weights, use.ranks=use.ranks, allow.neg.cor=allow.neg.cor, trend.var=trend.var, sort=sort)
+ NextMethod("camera")
}
diff --git a/R/cpm.R b/R/cpm.R
index 114e2ca..8aa28c7 100644
--- a/R/cpm.R
+++ b/R/cpm.R
@@ -8,7 +8,7 @@ cpm.DGEList <- function(x, normalized.lib.sizes=TRUE, log=FALSE, prior.count=0.2
{
lib.size <- x$samples$lib.size
if(normalized.lib.sizes) lib.size <- lib.size*x$samples$norm.factors
- cpm(x$counts,lib.size=lib.size,log=log,prior.count=prior.count)
+ cpm.default(x$counts,lib.size=lib.size,log=log,prior.count=prior.count)
}
cpm.default <- function(x, lib.size=NULL, log=FALSE, prior.count=0.25, ...)
@@ -28,17 +28,3 @@ cpm.default <- function(x, lib.size=NULL, log=FALSE, prior.count=0.25, ...)
else
t(t(x)/lib.size)
}
-
-rpkm <- function(x, gene.length, normalized.lib.sizes=TRUE, log=FALSE, prior.count=0.25)
-# Reads per kilobase of gene length per million reads of sequencing
-# Gordon Smyth
-# Created 1 November 2012. Last modified 11 March 2012.
-{
- y <- cpm(x=x,normalized.lib.sizes=normalized.lib.sizes,log=log,prior.count=prior.count)
- gene.length.kb <- gene.length/1000
- if(log)
- y-log2(gene.length.kb)
- else
- y/gene.length.kb
-}
-
diff --git a/R/dglmStdResid.R b/R/dglmStdResid.R
index bd00066..417d086 100644
--- a/R/dglmStdResid.R
+++ b/R/dglmStdResid.R
@@ -10,7 +10,7 @@ dglmStdResid <- function(y, design, dispersion=0, offset=0, nbins=100, make.plot
stop("Number of entries in argument 'offset' incompatible with 'y'. Must have length equal to 1 or to the number of entries in the matrix of counts or to the number of columns in the matrix of counts.\n")
else
offset <- matrix(offset, nrow=ngenes, ncol=nlibs, byrow=TRUE)
- fit <- mglmLS(y, design=design, dispersion=0, offset=offset)
+ fit <- mglmLevenberg(y, design=design, dispersion=0, offset=offset)
means <- as.vector(fit$fitted)
std.resid <- nlibs * ( as.vector(y) - means )^2 / ( nlibs - ncol(design) ) # Obtain an approximate value for the standardized residual: denominator is (n - p) / n instead of the usual (1 - leverage)
n <- length(means)
diff --git a/R/dispBinTrend.R b/R/dispBinTrend.R
index 7406a4d..edf461c 100644
--- a/R/dispBinTrend.R
+++ b/R/dispBinTrend.R
@@ -1,13 +1,15 @@
-dispBinTrend <- function(y, design=NULL, offset=NULL, df=5, span=0.3, min.n=400, method.bin="CoxReid", method.trend="spline", AveLogCPM=NULL, ...)
+dispBinTrend <- function(y, design=NULL, offset=NULL, df=5, span=0.3, min.n=400, method.bin="CoxReid", method.trend="spline", AveLogCPM=NULL, weights=NULL, ...)
# Estimate common dispersion in bins based on AveLogCPM,
# then fit a curve through the dispersions
# Davis McCarthy, Gordon Smyth
-# Created 10 Feb 2011. Last modified 17 April 2013.
+# Created 10 Feb 2011. Last modified 25 Nov 2013.
{
# Check y
y <- as.matrix(y)
nlibs <- ncol(y)
ntags <- nrow(y)
+
+# Check for all zero rows
pos <- rowSums(y)>0
if(!any(pos)) return(AveLogCPM=AveLogCPM, dispersion=rep(0,ntags))
npostags <- sum(pos)
@@ -20,14 +22,15 @@ dispBinTrend <- function(y, design=NULL, offset=NULL, df=5, span=0.3, min.n=400,
}
# Check offset
- if(is.null(offset)) offset <- expandAsMatrix(log(colSums(y)),dim(y))
+ if(is.null(offset)) offset <- log(colSums(y))
+ offset <- expandAsMatrix(offset,dim(y))
# Check methods
method.bin <- match.arg(method.bin, c("CoxReid", "Pearson", "deviance"))
method.trend <- match.arg(method.trend, c("spline", "loess"))
# Check AveLogCPM
- if(is.null(AveLogCPM)) AveLogCPM <- aveLogCPM(y)
+ if(is.null(AveLogCPM)) AveLogCPM <- aveLogCPM(y,weights=weights)
# Define bins of genes based on min.n in each bin
# All zero rows are marked as group==0
@@ -54,7 +57,7 @@ dispBinTrend <- function(y, design=NULL, offset=NULL, df=5, span=0.3, min.n=400,
bin.d <- bin.A <- rep(0,nbins)
for(i in 1:nbins) {
bin <- group==i
- bin.d[i] <- estimateGLMCommonDisp(y[bin,], design, method=method.bin, offset[bin,], min.row.sum=0, ...)
+ bin.d[i] <- estimateGLMCommonDisp(y[bin,], design, method=method.bin, offset[bin,], min.row.sum=0, weights=weights[bin,] ,...)
bin.A[i] <- mean(AveLogCPM[bin])
}
diff --git a/R/dispCoxReid.R b/R/dispCoxReid.R
index 87aa84a..c28e5e0 100644
--- a/R/dispCoxReid.R
+++ b/R/dispCoxReid.R
@@ -1,7 +1,7 @@
-dispCoxReid <- function(y, design=NULL, offset=NULL, interval=c(0,4), tol=1e-5, min.row.sum=5, subset=10000, AveLogCPM=NULL)
+dispCoxReid <- function(y, design=NULL, offset=NULL, weights=NULL, AveLogCPM=NULL, interval=c(0,4), tol=1e-5, min.row.sum=5, subset=10000)
# Cox-Reid APL estimator of common dispersion
# Gordon Smyth, Davis McCarthy
-# 26 Jan 2011. Last modified 4 Feb 2013.
+# 26 Jan 2011. Last modified 9 Dec 2013.
{
# Check y
y <- as.matrix(y)
@@ -15,8 +15,8 @@ dispCoxReid <- function(y, design=NULL, offset=NULL, interval=c(0,4), tol=1e-5,
design <- as.matrix(design)
}
-# Check offseet
- if(is.null(offset)) offset <- 0
+# Check offset
+ if(is.null(offset)) offset <- log(colSums(y))
offset <- expandAsMatrix(offset,dim(y))
if(min(interval)<0) stop("please give a non-negative interval for the dispersion")
@@ -25,23 +25,25 @@ dispCoxReid <- function(y, design=NULL, offset=NULL, interval=c(0,4), tol=1e-5,
if(any(small.row.sum)) {
y <- y[!small.row.sum,,drop=FALSE]
offset <- offset[!small.row.sum,,drop=FALSE]
+ weights <- weights[!small.row.sum,,drop=FALSE]
if(!is.null(AveLogCPM)) AveLogCPM <- AveLogCPM[!small.row.sum]
}
if(nrow(y)<1) stop("no data rows with required number of counts")
# Subsetting
if(!is.null(subset) && subset<=nrow(y)/2) {
- if(is.null(AveLogCPM)) AveLogCPM <- aveLogCPM(y,offset=offset)
+ if(is.null(AveLogCPM)) AveLogCPM <- aveLogCPM(y,offset=offset,weights=weights)
i <- systematicSubset(subset,AveLogCPM)
y <- y[i,,drop=FALSE]
offset <- offset[i,,drop=FALSE]
+ weights <- weights[i,,drop=FALSE]
}
# Function for optimizing
- fun <- function(par,y,design,offset) {
- sum(adjustedProfileLik(par^4,y,design,offset))
+ fun <- function(par,y,design,offset,weights) {
+ sum(adjustedProfileLik(par^4,y,design,offset,weights=weights))
}
- out <- optimize(f=fun,interval=interval^0.25,y=y,design=design,offset=offset,maximum=TRUE,tol=tol)
+ out <- optimize(f=fun,interval=interval^0.25,y=y,design=design,offset=offset,weights=weights,maximum=TRUE,tol=tol)
out$maximum^4
}
diff --git a/R/dispCoxReidInterpolateTagwise.R b/R/dispCoxReidInterpolateTagwise.R
index be35908..acd222f 100644
--- a/R/dispCoxReidInterpolateTagwise.R
+++ b/R/dispCoxReidInterpolateTagwise.R
@@ -1,4 +1,4 @@
-dispCoxReidInterpolateTagwise <- function(y, design, offset=NULL, dispersion, trend=TRUE, AveLogCPM=NULL, min.row.sum=5, prior.df=10, span=0.3, grid.npts=11, grid.range=c(-6,6))
+dispCoxReidInterpolateTagwise <- function(y, design, offset=NULL, dispersion, trend=TRUE, AveLogCPM=NULL, min.row.sum=5, prior.df=10, span=0.3, grid.npts=11, grid.range=c(-6,6),weights=NULL)
# Estimate tagwise NB dispersions
# using weighted Cox-Reid Adjusted Profile-likelihood
# and cubic spline interpolation over a tagwise grid.
@@ -38,7 +38,7 @@ dispCoxReidInterpolateTagwise <- function(y, design, offset=NULL, dispersion, tr
# Apply min.row.sum and use input dispersion for small count tags
i <- (rowSums(y) >= min.row.sum)
if(any(!i)) {
- if(any(i)) dispersion[i] <- Recall(y=y[i,],design=design,offset=offset[i,],dispersion=dispersion[i],AveLogCPM=AveLogCPM[i],grid.npts=grid.npts,min.row.sum=0,prior.df=prior.df,span=span,trend=trend)
+ if(any(i)) dispersion[i] <- Recall(y=y[i,],design=design,offset=offset[i,],dispersion=dispersion[i],AveLogCPM=AveLogCPM[i],grid.npts=grid.npts,min.row.sum=0,prior.df=prior.df,span=span,trend=trend,weights=weights[i,])
return(dispersion)
}
@@ -48,7 +48,7 @@ dispCoxReidInterpolateTagwise <- function(y, design, offset=NULL, dispersion, tr
apl <- matrix(0, nrow=ntags, ncol=grid.npts)
for(i in 1:grid.npts){
spline.disp <- dispersion * 2^spline.pts[i]
- apl[,i] <- adjustedProfileLik(spline.disp, y=y, design=design, offset=offset)
+ apl[,i] <- adjustedProfileLik(spline.disp, y=y, design=design, offset=offset, weights=weights)
}
if(trend) {
o <- order(AveLogCPM)
diff --git a/R/estimateCommonDisp.R b/R/estimateCommonDisp.R
index 367cdbd..2f44a1f 100644
--- a/R/estimateCommonDisp.R
+++ b/R/estimateCommonDisp.R
@@ -1,9 +1,10 @@
estimateCommonDisp <- function(object,tol=1e-06,rowsum.filter=5,verbose=FALSE)
-# Do two iterations of calculating pseudodata and estimating common dispersion
-# Davis McCarthy, Mark Robinson, Gordon Smyth.
-# Created 2009. Last modified 2 Aug 2012.
+# Estimate common dispersion using exact conditional likelihood
+# Davis McCarthy, Mark Robinson, Gordon Smyth.
+# Created 2009. Last modified 20 Nov 2013.
{
if(!is(object,"DGEList")) stop("Currently supports DGEList objects")
+ object <- validDGEList(object)
group <- object$samples$group <- as.factor(object$samples$group)
if( all(tabulate(group)<=1) ) {
@@ -28,10 +29,15 @@ estimateCommonDisp <- function(object,tol=1e-06,rowsum.filter=5,verbose=FALSE)
if(verbose) cat("Disp =",round(disp,5),", BCV =",round(sqrt(disp),4),"\n")
object$common.dispersion <- disp
object$pseudo.counts <- out$pseudo.counts
+ object$pseudo.lib.size <- out$common.lib.size
# Average logCPM
+# Note different behaviour to estimateGLMCommonDisp:
+# Here the estimated common.dispersion is used to compute AveLogCPM whereas
+# estimateGLMCommonDisp calculates AveLogCPM prior to estimating the common
+# dispersion using a pre-set dispersion of 0.05
object$AveLogCPM <- aveLogCPM(object)
- object$pseudo.lib.size <- out$common.lib.size
+
object
}
diff --git a/R/estimateDisp.R b/R/estimateDisp.R
index 9509e2d..92e688b 100644
--- a/R/estimateDisp.R
+++ b/R/estimateDisp.R
@@ -2,11 +2,11 @@
########### Weighted Likelihood Empirical Bayes ##############
##############################################################
-estimateDisp <- function(y, design=NULL, offset=NULL, prior.df=NULL, trend.method="locfit", span=NULL, grid.length=21, grid.range=c(-10,10), robust=FALSE, winsor.tail.p=c(0.05,0.1), tol=1e-06)
+estimateDisp <- function(y, design=NULL, prior.df=NULL, trend.method="locfit", span=NULL, grid.length=21, grid.range=c(-10,10), robust=FALSE, winsor.tail.p=c(0.05,0.1), tol=1e-06)
# Estimating dispersion using weighted conditional likelihood empirical Bayes.
# Use GLM approach if a design matrix is given, and classic approach otherwise.
# It calculates a matrix of likelihoods for each gene at a set of dispersion grid points, and then calls WLEB() to do the shrinkage.
-# Yunshun Chen, Gordon Smyth. Created July 2012. Last modified 4 Feb 2013.
+# Yunshun Chen, Gordon Smyth. Created July 2012. Last modified 17 Feb 2014.
{
if( !is(y,"DGEList") ) stop("y must be a DGEList")
group <- y$samples$group <- as.factor(y$samples$group)
@@ -21,7 +21,7 @@ estimateDisp <- function(y, design=NULL, offset=NULL, prior.df=NULL, trend.metho
grid.vals <- spline.disp/(1+spline.disp)
l0 <- matrix(0, ntags, grid.length)
- if(is.null(offset)) offset <- getOffset(y)
+ offset <- getOffset(y)
AveLogCPM <- aveLogCPM(y)
offset <- expandAsMatrix(offset, dim(y))
@@ -77,17 +77,14 @@ estimateDisp <- function(y, design=NULL, offset=NULL, prior.df=NULL, trend.metho
# Calculate prior.df
if(is.null(prior.df)){
- glmfit <- glmFit(y, design, dispersion=y$trended.dispersion, prior.count=0)
+ glmfit <- glmFit(y$counts, design, offset=offset, dispersion=y$trended.dispersion, prior.count=0)
# Residual deviances
df.residual <- glmfit$df.residual
# Adjust df.residual for fitted values at zero
- zerofit <- (glmfit$fitted.values < 1e-14)
- Q <- qr.Q(qr(glmfit$design))
- h <- rowSums(Q^2)
- dffromzeros <- zerofit %*% (1-h)
- df.residual <- drop(round(df.residual-dffromzeros))
+ zerofit <- (glmfit$fitted.values < 1e-4) & (glmfit$counts < 1e-4)
+ df.residual <- .residDF(zerofit, design)
# Empirical Bayes squeezing of the quasi-likelihood variance factors
s2 <- glmfit$deviance / df.residual
@@ -124,8 +121,8 @@ estimateDisp <- function(y, design=NULL, offset=NULL, prior.df=NULL, trend.metho
}
if(sum(!i)!=0){
# Make sure that there are still some genes with finite prior.df
- out.2 <- WLEB(theta=spline.pts, loglik=l0[!i,], prior.n=prior.n[!i], covariate=AveLogCPM[!i],
- trend.method=trend.method, span=span, overall=FALSE, trend=FALSE, m0=out.1$shared.loglik[!i,])
+ out.2 <- WLEB(theta=spline.pts, loglik=l0[!i,,drop=FALSE], prior.n=prior.n[!i], covariate=AveLogCPM[!i],
+ trend.method=trend.method, span=span, overall=FALSE, trend=FALSE, m0=out.1$shared.loglik[!i,,drop=FALSE])
y$tagwise.dispersion[!i] <- 0.1 * 2^out.2$individual
}
}
diff --git a/R/estimateGLMCommonDisp.R b/R/estimateGLMCommonDisp.R
index 33be444..7e60718 100644
--- a/R/estimateGLMCommonDisp.R
+++ b/R/estimateGLMCommonDisp.R
@@ -1,22 +1,19 @@
-# Last modified 13 March 2013
-
-estimateGLMCommonDisp <- function(y, design=NULL, offset=NULL, method="CoxReid", subset=10000, AveLogCPM=NULL, verbose=FALSE, ...)
+estimateGLMCommonDisp <- function(y, ...)
UseMethod("estimateGLMCommonDisp")
-estimateGLMCommonDisp.DGEList <- function(y, design=NULL, offset=NULL, method="CoxReid", subset=10000, AveLogCPM=NULL, verbose=FALSE, ...)
+estimateGLMCommonDisp.DGEList <- function(y, design=NULL, method="CoxReid", subset=10000, verbose=FALSE, ...)
{
-# If provided as arguments, offset and AveLogCPM over-rule the values stored in y
- if(!is.null(AveLogCPM)) y$AveLogCPM <- AveLogCPM
- if(is.null(y$AveLogCPM)) y$AveLogCPM <- aveLogCPM(y)
- if(!is.null(offset)) y$offset <- expandAsMatrix(offset,dim(y))
+# Check y
+ y <- validDGEList(y)
+ if(is.null(y$AveLogCPM)) y$AveLogCPM <- aveLogCPM(y,dispersion=NULL)
+
+ disp <- estimateGLMCommonDisp(y=y$counts, design=design, offset=getOffset(y), method=method, subset=subset, AveLogCPM=y$AveLogCPM, verbose=verbose, weights=y$weights, ...)
- disp <- estimateGLMCommonDisp(y=y$counts, design=design, offset=getOffset(y), method=method, subset=subset, AveLogCPM=y$AveLogCPM, verbose=verbose, ...)
y$common.dispersion <- disp
- y$design <- design
y
}
-estimateGLMCommonDisp.default <- function(y, design=NULL, offset=NULL, method="CoxReid", subset=10000, AveLogCPM=NULL, verbose=FALSE, ...)
+estimateGLMCommonDisp.default <- function(y, design=NULL, offset=NULL, method="CoxReid", subset=10000, AveLogCPM=NULL, verbose=FALSE, weights=NULL, ...)
{
# Check y
y <- as.matrix(y)
@@ -35,20 +32,22 @@ estimateGLMCommonDisp.default <- function(y, design=NULL, offset=NULL, method="C
}
# Check method
- method <- match.arg(method, c("CoxReid","Pearson","Pearson2","deviance"))
+ method <- match.arg(method, c("CoxReid","Pearson","deviance"))
+ if(!method == "CoxReid" && !is.null(weights)) warning("weights only supported by CoxReid method")
# Check offset
if(is.null(offset)) offset <- log(colSums(y))
# Check AveLogCPM
- if(is.null(AveLogCPM)) AveLogCPM <- aveLogCPM(y)
+ if(is.null(AveLogCPM)) AveLogCPM <- aveLogCPM(y,weights=weights)
# Call lower-level function
disp <- switch(method,
- CoxReid=dispCoxReid(y, design=design, offset=offset, subset=subset, AveLogCPM=AveLogCPM, ...),
+ CoxReid=dispCoxReid(y, design=design, offset=offset, subset=subset, AveLogCPM=AveLogCPM, weights=weights, ...),
Pearson=dispPearson(y, design=design, offset=offset, subset=subset, AveLogCPM=AveLogCPM, ...),
deviance=dispDeviance(y, design=design, offset=offset, subset=subset, AveLogCPM=AveLogCPM, ...)
)
+
if(verbose) cat("Disp =",round(disp,5),", BCV =",round(sqrt(disp),4),"\n")
disp
}
diff --git a/R/estimateGLMRobustDisp.R b/R/estimateGLMRobustDisp.R
new file mode 100644
index 0000000..7afb855
--- /dev/null
+++ b/R/estimateGLMRobustDisp.R
@@ -0,0 +1,99 @@
+estimateGLMRobustDisp <-
+function (y, design = NULL, prior.df = 10, update.trend = TRUE, trend.method = "bin.loess", maxit = 6, k = 1.345, residual.type = "pearson", verbose = FALSE, record = FALSE)
+{
+
+ if (!is(y, "DGEList"))
+ stop("Input must be a DGEList.")
+ y$weights <- array(1, dim(y))
+ if(is.null(y$trended.dispersion)) y <- estimateGLMTrendedDisp(y,design = design, method = trend.method)
+ if(is.null(y$tagwise.dispersion)) y <- estimateGLMTagwiseDisp(y,design = design, prior.df = prior.df)
+ if(record) y <- .getRecord(y, i = 0, weights = y$weights)
+ for (i in seq_len(maxit)){
+ if (verbose)
+ message(paste0("Iteration ", i, ": Re-fitting GLM. "),
+ appendLF = FALSE)
+ fit <- glmFit(y, design = design)
+ res <- .calcResid(fit, residual.type = residual.type)
+ y$weights <- .psi.huber.matrix(res, k = k)
+ y$AveLogCPM <- aveLogCPM(y,dispersion=y$trended.dispersion)
+ if(update.trend){
+ if (verbose)
+ message("Re-estimating trended dispersion.")
+ y <- estimateGLMTrendedDisp(y, design = design, method = trend.method)
+ }
+ if (verbose)
+ message("Re-estimating tagwise dispersion.")
+ y <- estimateGLMTagwiseDisp(y, design = design, prior.df=prior.df)
+ if(record) y <- .getRecord(y, i = i, res = res, weights = y$weights, fit = fit)
+
+ }
+ y
+
+}
+
+
+
+
+.calcResid <- function(f,residual.type=c("pearson", "anscombe","deviance"))
+{ residual.type <- match.arg(residual.type)
+ resAns <- function(y,mu,disper)
+ {
+ res <- function(y,mu,disper)
+ {
+ f <- function(x,disper) {(x*(1+disper*x))^(-1/3)}
+ const <- (f(mu,disper=disper))^(1/2)
+ if(mu==0)
+ out <- 0
+ else
+ out <- const*integrate(f,mu,y,disper=disper)$value
+ out
+ }
+ resV<- Vectorize(res,vectorize.args=c("y","mu","disper"))
+ out <- matrix(resV(y,mu,disper),nrow=nrow(y))
+ }
+
+ resDev <- function(y,mu,disper)
+ {
+ y <- y+1e-5
+ r <- 2*(y*log(y/mu)+(y+1/disper)*log((mu+1/disper)/(y+1/disper)))
+ #r[y==0] <- 0
+ r[mu==0] <- 0
+ sign(y-mu)*sqrt(r)
+ }
+
+ mu <- f$fitted.values
+ disp <- expandAsMatrix(f$dispersion,dim(mu))
+ yi <- f$counts
+ res <- switch(residual.type,anscombe=resAns(yi,mu,disp),
+ pearson = {(yi - mu)/sqrt((mu * (1+(disp)*mu)))},
+ deviance = resDev(yi,mu,disp))
+ res[mu==0] <- 0
+ res
+}
+
+
+
+.psi.huber.matrix <- function (u,k=1.345)
+{
+ z <- k/abs(u)
+ z[abs(u) <= k] <- 1
+ z
+}
+
+
+
+.getRecord <-
+function(y, i, res = NULL, weights = NULL, fit = NULL)
+{
+ iteration <- paste0("iteration_", i)
+ if(is.null(y$record)) record <- list()
+ else record <- y$record
+ if(!is.null(y$AveLogCPM)) record$AveLogCPM[[iteration]] <- y$AveLogCPM
+ if(!is.null(y$trended.dispersion)) record$trended.dispersion[[iteration]] <- y$trended.dispersion
+ if(!is.null(y$tagwise.dispersion)) record$tagwise.dispersion[[iteration]] <- y$tagwise.dispersion
+ if(!is.null(weights)) record$weights[[iteration]] <- weights
+ if(!is.null(res)) record$res[[iteration]] <- res
+ if(!is.null(fit)) record$mu[[iteration]] <- fit$fitted.values
+ y$record <- record
+ y
+}
\ No newline at end of file
diff --git a/R/estimateGLMTagwiseDisp.R b/R/estimateGLMTagwiseDisp.R
index ec677cd..2706305 100644
--- a/R/estimateGLMTagwiseDisp.R
+++ b/R/estimateGLMTagwiseDisp.R
@@ -1,32 +1,28 @@
-# Created March 2011. Last modified 13 March 2013.
-
estimateGLMTagwiseDisp <- function(y, ...)
UseMethod("estimateGLMTagwiseDisp")
-estimateGLMTagwiseDisp.DGEList <- function(y, design=NULL, offset=NULL, dispersion=NULL, prior.df=10, trend=!is.null(y$trended.dispersion), span=NULL, AveLogCPM=NULL, ...)
+estimateGLMTagwiseDisp.DGEList <- function(y, design=NULL, prior.df=10, trend=!is.null(y$trended.dispersion), span=NULL, ...)
{
-# If provided as arguments, offset and AveLogCPM over-rule the values stored in y
- if(!is.null(AveLogCPM)) y$AveLogCPM <- AveLogCPM
- if(is.null(y$AveLogCPM)) y$AveLogCPM <- aveLogCPM(y)
- if(!is.null(offset)) y$offset <- expandAsMatrix(offset,dim(y))
-
# Find appropriate dispersion
if(trend) {
- if(is.null(dispersion)) dispersion <- y$trended.dispersion
+ dispersion <- y$trended.dispersion
if(is.null(dispersion)) stop("No trended.dispersion found in data object. Run estimateGLMTrendedDisp first.")
} else {
- if(is.null(dispersion)) dispersion <- y$common.dispersion
+ dispersion <- y$common.dispersion
if(is.null(dispersion)) stop("No common.dispersion found in data object. Run estimateGLMCommonDisp first.")
}
- d <- estimateGLMTagwiseDisp(y=y$counts, design=design, offset=getOffset(y), dispersion=dispersion, trend=trend, prior.df=prior.df, AveLogCPM=y$AveLogCPM, ...)
+ if(is.null(y$AveLogCPM)) y$AveLogCPM <- aveLogCPM(y,dispersion=dispersion)
+
+ d <- estimateGLMTagwiseDisp(y=y$counts, design=design, offset=getOffset(y), dispersion=dispersion, trend=trend, prior.df=prior.df, AveLogCPM=y$AveLogCPM, weights=y$weights, ...)
+
y$prior.df <- prior.df
y$span <- d$span
y$tagwise.dispersion <- d$tagwise.dispersion
y
}
-estimateGLMTagwiseDisp.default <- function(y, design=NULL, offset=NULL, dispersion, prior.df=10, trend=TRUE, span=NULL, AveLogCPM=NULL, ...)
+estimateGLMTagwiseDisp.default <- function(y, design=NULL, offset=NULL, dispersion, prior.df=10, trend=TRUE, span=NULL, AveLogCPM=NULL, weights=NULL, ...)
{
# Check y
y <- as.matrix(y)
@@ -47,14 +43,18 @@ estimateGLMTagwiseDisp.default <- function(y, design=NULL, offset=NULL, dispersi
return(rep(NA,ntags))
}
+# Check offset
+ if(is.null(offset)) offset <- log(colSums(y))
+
# Check span
+# span can be chosen smaller when ntags is large
if(is.null(span)) if(ntags>10) span <- (10/ntags)^0.23 else span <- 1
# Check AveLogCPM
- if(is.null(AveLogCPM)) AveLogCPM <- aveLogCPM(y,lib.size=exp(offset))
+ if(is.null(AveLogCPM)) AveLogCPM <- aveLogCPM(y,offset=offset,dispersion=dispersion,weights=weights)
# Call Cox-Reid grid method
- tagwise.dispersion <- dispCoxReidInterpolateTagwise(y, design, offset=offset, dispersion, trend=trend, prior.df=prior.df, span=span, AveLogCPM=AveLogCPM, ...)
+ tagwise.dispersion <- dispCoxReidInterpolateTagwise(y, design, offset=offset, dispersion, trend=trend, prior.df=prior.df, span=span, AveLogCPM=AveLogCPM, weights=weights,...)
list(tagwise.dispersion=tagwise.dispersion,span=span)
}
diff --git a/R/estimateGLMTrendedDisp.R b/R/estimateGLMTrendedDisp.R
index e06a873..d0830f9 100644
--- a/R/estimateGLMTrendedDisp.R
+++ b/R/estimateGLMTrendedDisp.R
@@ -1,29 +1,26 @@
-# Last modified 11 March 2013
-
-estimateGLMTrendedDisp <- function(y, design=NULL, offset=NULL, AveLogCPM=NULL, method="auto", ...)
+estimateGLMTrendedDisp <- function(y, ...)
UseMethod("estimateGLMTrendedDisp")
-estimateGLMTrendedDisp.DGEList <- function(y, design=NULL, offset=NULL, AveLogCPM=NULL, method="auto", ...)
+estimateGLMTrendedDisp.DGEList <- function(y, design=NULL, method="auto", ...)
{
-# If provided as arguments, offset and AveLogCPM over-rule the values stored in y
- if(!is.null(AveLogCPM)) y$AveLogCPM <- AveLogCPM
if(is.null(y$AveLogCPM)) y$AveLogCPM <- aveLogCPM(y)
- if(!is.null(offset)) y$offset <- expandAsMatrix(offset,dim(y))
- d <- estimateGLMTrendedDisp(y=y$counts, design=design, offset=getOffset(y), AveLogCPM=y$AveLogCPM, method=method, ...)
+ d <- estimateGLMTrendedDisp(y=y$counts, design=design, offset=getOffset(y), AveLogCPM=y$AveLogCPM, method=method, weights=y$weights, ...)
+
y$trended.dispersion <- d$dispersion
y$trend.method <- d$trend.method
y$bin.dispersion <- d$bin.dispersion
y$bin.AveLogCPM <- d$bin.AveLogCPM
- y$design <- d$design
y
}
-estimateGLMTrendedDisp.default <- function(y, design=NULL, offset=NULL, AveLogCPM=NULL, method="auto", ...)
+estimateGLMTrendedDisp.default <- function(y, design=NULL, offset=NULL, AveLogCPM=NULL, method="auto", weights=NULL, ...)
{
# Check y
y <- as.matrix(y)
ntags <- nrow(y)
+ if(ntags==0) return(numeric(0))
+ nlibs <- ncol(y)
# Check design
if(is.null(design)) {
@@ -39,14 +36,10 @@ estimateGLMTrendedDisp.default <- function(y, design=NULL, offset=NULL, AveLogCP
}
# Check offset
- if(is.null(offset)) {
- lib.size <- colSums(y)
- offset <- log(lib.size)
- }
- offset <- expandAsMatrix(offset,dim(y))
+ if(is.null(offset)) offset <- log(colSums(y))
# Check AveLogCPM
- if(is.null(AveLogCPM)) AveLogCPM <- aveLogCPM(y,lib.size=exp(offset))
+ if(is.null(AveLogCPM)) AveLogCPM <- aveLogCPM(y,offset=offset,weights=weights)
# Check method
method <- match.arg(method,c("auto","bin.spline","bin.loess","power","spline"))
@@ -60,13 +53,12 @@ estimateGLMTrendedDisp.default <- function(y, design=NULL, offset=NULL, AveLogCP
# Call lower-level function
trend <- switch(method,
- bin.spline=dispBinTrend(y, design, offset=offset, method.trend="spline", AveLogCPM=AveLogCPM, ...),
- bin.loess=dispBinTrend(y, design, offset=offset, method.trend="loess", AveLogCPM=AveLogCPM, ...),
+ bin.spline=dispBinTrend(y, design, offset=offset, method.trend="spline", AveLogCPM=AveLogCPM, weights=weights, ...),
+ bin.loess=dispBinTrend(y, design, offset=offset, method.trend="loess", AveLogCPM=AveLogCPM, weights=weights, ...),
power=dispCoxReidPowerTrend(y, design, offset=offset, AveLogCPM=AveLogCPM, ...),
spline=dispCoxReidSplineTrend(y, design, offset=offset, AveLogCPM=AveLogCPM, ...)
)
- trend$design <- design
trend$AveLogCPM <- AveLogCPM
trend$trend.method <- method
trend
diff --git a/R/estimateTagwiseDisp.R b/R/estimateTagwiseDisp.R
index a3688ce..49d1890 100644
--- a/R/estimateTagwiseDisp.R
+++ b/R/estimateTagwiseDisp.R
@@ -31,7 +31,7 @@ estimateTagwiseDisp <- function(object, prior.df=10, trend="movingave", span=NUL
grid.vals <- spline.disp/(1+spline.disp)
l0 <- matrix(0,ntags,grid.length)
- for(j in 1:grid.length) for(i in 1:length(y)) l0[,j] <- condLogLikDerDelta(y[[i]],grid.vals[j],der=0)+l0[,j]
+ for(j in 1:grid.length) for(i in 1:length(y)) l0[,j] <- condLogLikDerDelta(y[[i]],grid.vals[j],der=0L)+l0[,j]
if(is.null(span)) if(trend=="movingave") span <- 0.3 else span <- 0.5
m0 <- switch(trend,
@@ -50,7 +50,7 @@ estimateTagwiseDisp <- function(object, prior.df=10, trend="movingave", span=NUL
if(trend != "none") stop("optimize method doesn't allow for abundance-dispersion trend")
if(verbose) message("Tagwise dispersion optimization begun, may be slow, progress reported every 100 tags")
for(tag in seq_len(ntags)) {
- delta.this <- optimize(weightedCondLogLikDerDelta, interval=c(1e-4,100/(100+1)), tol=tol, maximum=TRUE, y=y, tag=tag, ntags=ntags, prior.n=prior.n, der=0)
+ delta.this <- optimize(weightedCondLogLikDerDelta, interval=c(1e-4,100/(100+1)), tol=tol, maximum=TRUE, y=y, tag=tag, ntags=ntags, prior.n=prior.n, der=0L)
delta[tag] <- delta.this$maximum
if(verbose) if(tag%%100==0) message("tag ",tag)
}
diff --git a/R/exactTest.R b/R/exactTest.R
index 826d997..7263c47 100644
--- a/R/exactTest.R
+++ b/R/exactTest.R
@@ -78,8 +78,8 @@ exactTest <- function(object, pair=1:2, dispersion="auto", rejection.region="dou
exact.pvals <- switch(rejection.region,
doubletail=exactTestDoubleTail(y1,y2,dispersion=dispersion,big.count=big.count),
- deviance=exactTestByDeviance(y1,y2,dispersion=dispersion,big.count=big.count),
- smallp=exactTestBySmallP(y1,y2,dispersion=dispersion,big.count=big.count)
+ deviance=exactTestByDeviance(y1,y2,dispersion=dispersion),
+ smallp=exactTestBySmallP(y1,y2,dispersion=dispersion)
)
AveLogCPM <- object$AveLogCPM
diff --git a/R/exactTestByDeviance.R b/R/exactTestByDeviance.R
index 7a0cf55..688564e 100644
--- a/R/exactTestByDeviance.R
+++ b/R/exactTestByDeviance.R
@@ -1,13 +1,14 @@
-exactTestByDeviance <- function(y1,y2,dispersion=0,big.count=900)
+exactTestByDeviance <- function(y1,y2,dispersion=0)
# Test for differences in means between two groups of
# negative binomial or Poisson random variables,
# using exact enumeration conditional on total sum.
# Rejection region is defined by large deviance statistics,
-# making this a conditional likelihood ratio test.
+# so this is a conditional likelihood ratio test.
-# Davis McCarthy, Gordon Smyth.
-# Created 8 August 2011. Last modified 2 October 2011.
+# R version by Davis McCarthy, Gordon Smyth, created 8 August 2011.
+# C++ version by Aaron Lun, created 26 July 2012.
+# Last modified 9 Dec 2013.
{
y1 <- as.matrix(y1)
y2 <- as.matrix(y2)
@@ -17,54 +18,27 @@ exactTestByDeviance <- function(y1,y2,dispersion=0,big.count=900)
n1 <- ncol(y1)
n2 <- ncol(y2)
- if(n1==n2) return(exactTestDoubleTail(y1=y1,y2=y2,dispersion=dispersion,big.count=big.count))
+ if(n1==n2) return(exactTestDoubleTail(y1=y1,y2=y2,dispersion=dispersion))
- sum1 <- round(rowSums(y1))
- sum2 <- round(rowSums(y2))
- N <- sum1+sum2
- mu <- N/(n1+n2)
+ sum1 <- as.integer(round(rowSums(y1)))
+ sum2 <- as.integer(round(rowSums(y2)))
if(all(dispersion==0)) return(binomTest(sum1,sum2,p=n1/(n1+n2)))
if(any(dispersion==0)) stop("dispersion must be either all zero or all positive")
if(length(dispersion)==1) dispersion <- rep(dispersion,ntags)
- r <- 1/dispersion
- all.zeros <- N==0
pvals <- rep(1,ntags)
if(ntags==0) return(pvals)
# Eliminate all zero rows
+ all.zeros <- sum1==0 & sum2==0
if(any(all.zeros)) {
- pvals[!all.zeros] <- Recall(y1=y1[!all.zeros,,drop=FALSE],y2=y2[!all.zeros,,drop=FALSE],dispersion=dispersion[!all.zeros],big.count=big.count)
+ pvals[!all.zeros] <- Recall(y1=y1[!all.zeros,,drop=FALSE],y2=y2[!all.zeros,,drop=FALSE],dispersion=dispersion[!all.zeros])
return(pvals)
}
- for (i in 1:ntags) {
- ind <- 0:N[i]
- p.top <- dnbinom(ind,size=n1*r[i],mu=n1*mu[i])*dnbinom(N[i]-ind,size=n2*r[i],mu=n2*mu[i])
- p.obs <- dnbinom(sum1[i],size=n1*r[i],mu=n1*mu[i]) * dnbinom(sum2[i],size=n2*r[i],mu=n2*mu[i])
- ## We need a robust way to choose the appropriate prob masses to sum over
- ## We use the LR stat to choose---choose all values of support with LR stat >= obs LR stat
- ## Then sum prob masses over these chosen values of support
- ind.mat <- cbind(ind, N[i]-ind)
- rmat.i <- matrix(cbind(n1*r[i], n2*r[i]), nrow=length(ind), ncol=2, byrow=TRUE)
- mumat.i <- matrix(cbind(n1*mu[i], n2*mu[i]), nrow=length(ind), ncol=2, byrow=TRUE)
- dev.obs <- .nbdev(cbind(sum1[i],sum2[i]), rmat.i[1,,drop=FALSE], mumat.i[1,,drop=FALSE])
- dev.all <- .nbdev(ind.mat, rmat.i, mumat.i)
- keep <- dev.all>=dev.obs
- p.bot <- dnbinom(N[i],size=(n1+n2)*r[i],mu=(n1+n2)*mu[i])
- pvals[i] <- sum(p.top[keep]/p.bot)
- }
- pmin(pvals,1)
-}
-.nbdev <- function(y, r, mu)
- ## Compute NB deviances for observations, with potentially different size (r = 1/dispersion) and mean (mu)
- ## Arguments can be vectors or matrices
- ## Created by Davis McCarthy, 5 August 2011.
- ## Last modified 8 August 2011.
-{
- if(!identical(dim(y),dim(r)) | !identical(dim(y),dim(mu)))
- stop("Matrices y, r and m are not all the same size.\n")
- y[y==0] <- 1e-08
- dev <- 2*rowSums( y*log( y / mu ) - (y + r)*log( (1+y/r) / (1+mu/r) ) )
- dev
+# Checking the dispersion type.
+ dispersion<-as.double(dispersion)
+ pvals<-.Call("R_exact_test_by_deviance", sum1, sum2, n1, n2, dispersion, PACKAGE="edgeR")
+ if (is.character(pvals)) { stop(pvals) }
+ pmin(pvals, 1)
}
diff --git a/R/exactTestBySmallP.R b/R/exactTestBySmallP.R
index 374719b..b65f6a5 100644
--- a/R/exactTestBySmallP.R
+++ b/R/exactTestBySmallP.R
@@ -1,4 +1,4 @@
-exactTestBySmallP <- function(y1,y2,dispersion=0,big.count=900)
+exactTestBySmallP <- function(y1,y2,dispersion=0)
# Test for differences in means between two groups of
# negative binomial or Poisson random variables,
# using exact enumeration conditional on total sum.
@@ -7,7 +7,7 @@ exactTestBySmallP <- function(y1,y2,dispersion=0,big.count=900)
# all values with probability equal or less than that observed.
# Mark Robinson, Davis McCarthy, Gordon Smyth.
-# Created 17 June 2009. Last modified 10 Jan 2012.
+# Created 17 June 2009. Last modified 9 Dec 2013.
{
y1 <- as.matrix(y1)
y2 <- as.matrix(y2)
@@ -17,7 +17,7 @@ exactTestBySmallP <- function(y1,y2,dispersion=0,big.count=900)
n1 <- ncol(y1)
n2 <- ncol(y2)
- if(n1==n2) return(exactTestDoubleTail(y1=y1,y2=y2,dispersion=dispersion,big.count=big.count))
+ if(n1==n2) return(exactTestDoubleTail(y1=y1,y2=y2,dispersion=dispersion))
sum1 <- round(rowSums(y1))
sum2 <- round(rowSums(y2))
@@ -32,7 +32,7 @@ exactTestBySmallP <- function(y1,y2,dispersion=0,big.count=900)
pvals <- rep(1,ntags)
if(ntags==0) return(pvals)
if(any(all.zeros)) {
- pvals[!all.zeros] <- Recall(y1=y1[!all.zeros,,drop=FALSE],y2=y2[!all.zeros,,drop=FALSE],dispersion=dispersion[!all.zeros],big.count=big.count)
+ pvals[!all.zeros] <- Recall(y1=y1[!all.zeros,,drop=FALSE],y2=y2[!all.zeros,,drop=FALSE],dispersion=dispersion[!all.zeros])
return(pvals)
}
for (i in 1:ntags) {
diff --git a/R/exactTestDoubleTail.R b/R/exactTestDoubleTail.R
index 5c77d3f..c81b326 100644
--- a/R/exactTestDoubleTail.R
+++ b/R/exactTestDoubleTail.R
@@ -4,7 +4,7 @@ exactTestDoubleTail <- function(y1,y2,dispersion=0,big.count=900)
# using exact enumeration conditional on total sum.
# Smaller tail probability is doubled to get p-value.
-# QUESTION: should we use sign(logFC) to choose which tail to evaluate
+# QUESTION: should we use sign(logFC) to choose which tail to evaluate
# instead of trying to find smaller of tail probabilities?
# Gordon Smyth
diff --git a/R/glmQLFTest.R b/R/glmQLFTest.R
index 372e2f0..dc5d851 100644
--- a/R/glmQLFTest.R
+++ b/R/glmQLFTest.R
@@ -1,7 +1,7 @@
glmQLFTest <- function(y, design=NULL, dispersion=NULL, coef=ncol(glmfit$design), contrast=NULL, abundance.trend=TRUE, robust=FALSE, winsor.tail.p=c(0.05,0.1), plot=FALSE)
# Quasi-likelihood F-tests for DGE glms.
# Davis McCarthy and Gordon Smyth.
-# Created 18 Feb 2011. Last modified 21 July 2013.
+# Created 18 Feb 2011. Last modified 18 Jan 2014.
{
# Initial fit with trended dispersion
if(is(y,"DGEList")) {
@@ -25,11 +25,8 @@ glmQLFTest <- function(y, design=NULL, dispersion=NULL, coef=ncol(glmfit$design)
df.residual <- glmfit$df.residual
# Adjust df.residual for fitted values at zero
- zerofit <- (glmfit$fitted.values < 1e-14)
- Q <- qr.Q(qr(glmfit$design))
- h <- rowSums(Q^2)
- dffromzeros <- zerofit %*% (1-h)
- df.residual <- drop(round(df.residual-dffromzeros))
+ zerofit <- (glmfit$fitted.values < 1e-4) & (glmfit$counts < 1e-4)
+ df.residual <- .residDF(zerofit, design)
# Empirical Bayes squeezing of the quasi-likelihood variance factors
s2 <- glmfit$deviance / df.residual
@@ -72,7 +69,7 @@ glmQLFTest <- function(y, design=NULL, dispersion=NULL, coef=ncol(glmfit$design)
out$table$F <- F
out$table$PValue <- F.pvalue
- out$df.residual.corrected <- df.residual
+ out$df.residual <- df.residual
out$s2.fit <- s2.fit
out$df.prior <- s2.fit$df.prior
out$df.total <- df.total
diff --git a/R/glmfit.R b/R/glmfit.R
index a09492f..741254f 100644
--- a/R/glmfit.R
+++ b/R/glmfit.R
@@ -1,16 +1,17 @@
# FIT GENERALIZED LINEAR MODELS
-glmFit <- function(y, design, dispersion=NULL, offset=NULL, weights=NULL, lib.size=NULL, prior.count=0.125, start=NULL, method="auto", ...)
+glmFit <- function(y, ...)
UseMethod("glmFit")
-glmFit.DGEList <- function(y, design=NULL, dispersion=NULL, offset=NULL, weights=NULL, lib.size=NULL, prior.count=0.125, start=NULL, method="auto", ...)
+glmFit.DGEList <- function(y, design=NULL, dispersion=NULL, prior.count=0.125, start=NULL, ...)
# Created 11 May 2011. Last modified 11 March 2013.
{
if(is.null(dispersion)) dispersion <- getDispersion(y)
if(is.null(dispersion)) stop("No dispersion values found in DGEList object.")
- if(is.null(offset) && is.null(lib.size)) offset <- getOffset(y)
+
if(is.null(y$AveLogCPM)) y$AveLogCPM <- aveLogCPM(y)
- fit <- glmFit(y=y$counts,design=design,dispersion=dispersion,offset=offset,weights=weights,lib.size=lib.size,prior.count=prior.count,start=start,method=method,...)
+
+ fit <- glmFit(y=y$counts,design=design,dispersion=dispersion,offset=getOffset(y),lib.size=NULL,weights=y$weights,prior.count=prior.count,start=start,...)
fit$samples <- y$samples
fit$genes <- y$genes
fit$prior.df <- y$prior.df
@@ -18,14 +19,18 @@ glmFit.DGEList <- function(y, design=NULL, dispersion=NULL, offset=NULL, weights
new("DGEGLM",fit)
}
-glmFit.default <- function(y, design=NULL, dispersion=NULL, offset=NULL, weights=NULL, lib.size=NULL, prior.count=0.125, start=NULL, method="auto", ...)
+glmFit.default <- function(y, design=NULL, dispersion=NULL, offset=NULL, lib.size=NULL, weights=NULL, prior.count=0.125, start=NULL, ...)
# Fit negative binomial generalized linear model for each transcript
# to a series of digital expression libraries
# Davis McCarthy and Gordon Smyth
-# Created 17 August 2010. Last modified 13 Nov 2012.
+# Created 17 August 2010. Last modified 26 Nov 2013.
{
-# Check input
+# Check y
y <- as.matrix(y)
+ ntag <- nrow(y)
+ nlib <- ncol(y)
+
+# Check design
if(is.null(design)) {
design <- matrix(1,ncol(y),1)
rownames(design) <- colnames(y)
@@ -35,75 +40,43 @@ glmFit.default <- function(y, design=NULL, dispersion=NULL, offset=NULL, weights
ne <- nonEstimable(design)
if(!is.null(ne)) stop(paste("Design matrix not of full rank. The following coefficients not estimable:\n", paste(ne, collapse = " ")))
}
- if(is.null(dispersion)) {
- stop("No dispersion values provided.")
- } else {
- if(!( length(dispersion)==1 | length(dispersion)==nrow(y) ))
- stop("Length of dispersion vector incompatible with count matrix. Dispersion argument must be either of length 1 (i.e. common dispersion) or length equal to the number of rows of y (i.e. individual dispersion value for each tag/gene).")
+
+# Check dispersion
+ if(is.null(dispersion)) stop("No dispersion values provided.")
+
+# Check offset and lib.size
+ if(is.null(offset)) {
+ if(is.null(lib.size)) lib.size <- colSums(y)
+ offset <- log(lib.size)
}
- if(!is.null(offset) && !is.null(lib.size)) warning("offset and lib.size both supplied: offset takes precedence, lib.size ignored.")
- if(is.null(lib.size)) lib.size <- colSums(y)
- if(is.null(offset)) offset <- log(lib.size)
offset <- expandAsMatrix(offset,dim(y))
- iswt <- !is.null(weights)
- if(iswt) {
- weights <- expandAsMatrix(weights,dim(y))
- weights[weights <= 0] <- NA
- y[!is.finite(weights)] <- NA
- }
- method <- match.arg(method,c("auto","linesearch","levenberg","simple"))
-# End of input checking
- ngenes <- nrow(y)
- nlibs <- ncol(y)
- isna <- any(is.na(y))
+# weights are checked in lower-level functions
-# Choose fitting algorithm
- if(method=="auto") {
- if(isna || iswt) {
- method <- "simple"
- } else {
- group <- designAsFactor(design)
- if(nlevels(group)==ncol(design)) {
- method <- "oneway"
- } else {
- method <- "levenberg"
- }
- }
- }
- if(method!="simple") {
- if(iswt) stop("weights only supported by simple fitting method")
- if(isna) stop("NAs only supported by simple fitting method")
+# Fit the tagwise glms
+# If the design is equivalent to a oneway layout, use a shortcut algorithm
+ group <- designAsFactor(design)
+ if(nlevels(group)==ncol(design)) {
+ fit <- mglmOneWay(y,design=design,dispersion=dispersion,offset=offset,weights=weights)
+ fit$deviance <- nbinomDeviance(y=y,mean=fit$fitted.values,dispersion=dispersion,weights=weights)
+ fit$method <- "oneway"
+ } else {
+ fit <- mglmLevenberg(y,design=design,dispersion=dispersion,offset=offset,weights=weights,coef.start=start,maxit=250)
+ fit$method <- "levenberg"
}
-# Fit a glm to each gene
- fit <- switch(method,
- linesearch=mglmLS(y,design=design,dispersion=dispersion,coef.start=start,offset=offset,...),
- oneway=mglmOneWay(y,design=design,dispersion=dispersion,offset=offset),
- levenberg=mglmLevenberg(y,design=design,dispersion=dispersion,offset=offset,coef.start=start,maxit=250,...),
- simple=mglmSimple(y,design=design,dispersion=dispersion,offset=offset,weights=weights)
- )
-
# Prepare output
fit$counts <- y
- if(prior.count>0)
- fit$coefficients <- predFC(y,design,offset=offset,dispersion=dispersion,prior.count=prior.count)*log(2)
- else
- fit$coefficients <- as.matrix(fit$coefficients)
+ if(prior.count>0) fit$coefficients <- predFC(y,design,offset=offset,dispersion=dispersion,prior.count=prior.count,weights=weights,...)*log(2)
colnames(fit$coefficients) <- colnames(design)
rownames(fit$coefficients) <- rownames(y)
- fit$fitted.values <- as.matrix(fit$fitted.values)
dimnames(fit$fitted.values) <- dimnames(y)
- if(is.null(fit$deviance)) {
- deviances <- deviances.function(dispersion)
- fit$deviance <- deviances(y,fit$fitted.values,dispersion)
- }
- if(is.null(fit$df.residual)) fit$df.residual <- rep(nlibs-ncol(design),ngenes)
-# if(is.null(fit$abundance)) fit$abundance <- mglmOneGroup(y, offset=offset, dispersion=dispersion)
- if(is.null(fit$design)) fit$design <- design
- if(is.null(fit$offset)) fit$offset <- offset
- if(is.null(fit$dispersion)) fit$dispersion <- dispersion
- fit$method <- method
+# FIXME: we are not allowing missing values, so df.residual must be same for all tags
+ fit$df.residual <- rep(nlib-ncol(design),ntag)
+ fit$design <- design
+ fit$offset <- offset
+ fit$dispersion <- dispersion
+ fit$weights <- weights
new("DGEGLM",fit)
}
@@ -111,7 +84,7 @@ glmFit.default <- function(y, design=NULL, dispersion=NULL, offset=NULL, weights
glmLRT <- function(glmfit,coef=ncol(glmfit$design),contrast=NULL,test="chisq")
# Tagwise likelihood ratio tests for DGEGLM
# Gordon Smyth, Davis McCarthy and Yunshun Chen.
-# Created 1 July 2010. Last modified 14 Dec 2012.
+# Created 1 July 2010. Last modified 22 Nov 2013.
{
# Check glmfit
if(!is(glmfit,"DGEGLM")) {
@@ -120,6 +93,7 @@ glmLRT <- function(glmfit,coef=ncol(glmfit$design),contrast=NULL,test="chisq")
}
stop("glmfit must be an DGEGLM object (usually produced by glmFit).")
}
+ if(is.null(glmfit$AveLogCPM)) glmfit$AveLogCPM <- aveLogCPM(glmfit)
nlibs <- ncol(glmfit)
# Check test
@@ -197,7 +171,6 @@ glmLRT <- function(glmfit,coef=ncol(glmfit$design),contrast=NULL,test="chisq")
rn <- 1:nrow(glmfit)
else
rn <- make.unique(rn)
- if(is.null(glmfit$AveLogCPM)) glmfit$AveLogCPM <- aveLogCPM(glmfit)
tab <- data.frame(
logFC=logFC,
logCPM=glmfit$AveLogCPM,
diff --git a/R/mglmLS.R b/R/mglmLS.R
deleted file mode 100644
index 68ebab9..0000000
--- a/R/mglmLS.R
+++ /dev/null
@@ -1,190 +0,0 @@
-deviances.function <- function(dispersion)
-# Deviance function for multiple GLMs
-# Gordon Smyth
-# 23 November 2010. Last modified 26 Nov 2010.
-{
- i <- dispersion>0
- if(all(i)) {
-# All Negative binomial
- deviances <- function(y,mu,dispersion) {
- logymu <- log(y/mu)
- logymu[y<1e-14] <- 0
- 2*rowSums(y*logymu + (y+1/dispersion)*log((mu+1/dispersion)/(y+1/dispersion)))
- }
- } else {
- if(any(i)) {
-# Some Poisson, some negative binomial
- deviances <- function(y,mu,dispersion) {
- i <- dispersion>0
- f0 <- deviances.function(0)
- f1 <- deviances.function(1)
- dev <- dispersion
- dev[!i] <- f0(y[!i,,drop=FALSE],mu[!i,,drop=FALSE],0)
- dev[i] <- f1(y[i,,drop=FALSE],mu[i,,drop=FALSE],dispersion[i])
- dev
- }
- } else {
-# All Poisson
- deviances <- function(y,mu,dispersion) {
- logymu <- log(y/mu)
- logymu[y<1e-14] <- 0
- 2*rowSums(y*logymu-(y-mu))
- }
- }
- }
- deviances
-}
-
-######################################################
-######### Simple Line Search glm (Multiple) ##########
-######################################################
-
-mglmLS <- function(y,design,dispersion=0,offset=0,coef.start=NULL,tol=1e-5,maxit=50,trace=FALSE)
-# Fit the same negative binomial generalized linear model with log link
-# to multipe response vectors
-# by approximate Fisher scoring with simple line search
-# Yunshun Chen and Gordon Smyth
-# 12 November 2010. Revised 27 July 2012.
-{
-# Check input
- X <- as.matrix(design)
- ncoef <- ncol(X)
- if(any(y<0)) stop("y must be non-negative")
- if(is.vector(y)) y <- matrix(y, nrow=1)
- ntags <- nrow(y)
- nlibs <- ncol(y)
- phi <- dispersion
- if(any(phi<0)) stop("dispersions must be non-negative")
- if(all(phi==0)) {
- ispoisson <- TRUE
- } else {
- if(any(phi==0)) stop("Cannot mix zero and positive dispersions")
- ispoisson <- FALSE
- }
- phi <- rep(phi,length=ntags)
- offset <- expandAsMatrix(offset,dim(y))
-
-# Define deviance functions
- deviances <- deviances.function(dispersion)
-
-# Transform to orthonormal design matrix
- qrX <- qr(X)
- X <- qr.Q(qrX)
-
- beta <- matrix(0,ntags,ncoef)
- rownames(beta) <- rownames(y)
- colnames(beta) <- colnames(X)
- stepsize <- meanw <- 1/rowMeans(y)+phi
-
-# Non-iterative solution for all zero case
- nypos <- rowSums(y>0)
- if(any(nypos<1)) {
-# yi <- y[nypos<1,,drop=FALSE]
-# logyi <- log(yi)
-# logyi[yi==0] <- -30
- z <- -30-offset[nypos<1,,drop=FALSE]
- beta[nypos<1,] <- z %*% X
- }
-
-# Index tags still iterating
- i <- nypos >= 1
- ls.fail <- rep(FALSE,ntags)
-
-# Starting values
- if(any(i))
- if(is.null(coef.start)) {
- z <- log(pmax(y[i,,drop=FALSE],1/6))-offset[i,,drop=FALSE]
-# beta[i,] <- t(qr.coef(qrX,t(z)))
- beta[i,] <- z %*% X
- } else {
- beta[i,] <- coef.start[i,,drop=FALSE]
- }
- mu <- exp(beta %*% t(X) + offset)
- dimnames(mu) <- dimnames(y)
-
-# Approximate Fisher scoring iteration
- iter <- 0
- if(trace) {
- cat("Iter",iter,"\n")
- cat("Scoring for",sum(i),"tag(s)\n")
-# print(summary(beta))
- }
- while(any(i)) {
- iter <- iter + 1
-
-# Tagwise test for convergence
- yi <- y[i,,drop=FALSE]
- mui <- mu[i,,drop=FALSE]
- phii <- phi[i]
- z <- (yi-mui)/(1+phii*mui)
-# dbeta <- t(qr.coef(qrX,t(z)))
- dbeta <- z %*% X
- derivbig <- rowMeans(abs(dbeta)) > tol
-# cat("derivbig",derivbig,"\n")
- i[i] <- derivbig
- i[ls.fail] <- FALSE
-# cat("i",i,"\n")
-
- if(iter > maxit) break
-
- if(trace) {
- cat("Iter",iter,"\n")
- cat("Scoring for",sum(i),"tag(s)\n")
-# print(summary(beta))
- }
-
-# Subset to data not yet converged
- if(any(!derivbig)) {
- yi <- yi[derivbig,,drop=FALSE]
- mui <- mui[derivbig,,drop=FALSE]
- phii <- phii[derivbig]
- dbeta <- dbeta[derivbig,,drop=FALSE]
- }
-
-# Current deviance, and prepare for line search
- devi <- deviances(yi,mui,phii)
- betai <- beta[i,,drop=FALSE]
- offseti <- offset[i,,drop=FALSE]
- stepsizei <- stepsize[i]
-
-# Index tags active in line search
- j <- i[i]
-# cat("j",j,"\n")
-
-# Line search until deviance is decreased
- iter.ls <- 0
- while(any(j)) {
- iter.ls <- iter.ls + 1
- if(iter.ls > 50) {
-# cat("Line search iteration limit exceeded at iteration ",iter,"\n")
- k <- which(i)[j]
- ls.fail[k] <- TRUE
- i[k] <- FALSE
- break
- }
- betaj <- betai[j,,drop=FALSE] + stepsizei[j]*dbeta[j,,drop=FALSE]
- muj <- exp(betaj %*% t(X) + offseti[j,,drop=FALSE])
- devj <- deviances(yi[j,,drop=FALSE],muj,phii[j])
- decr <- devj < devi[j]
- if(any(decr)) {
- k <- which(i)[j][decr]
- beta[k,] <- betaj[decr,]
- mu[k,] <- exp(beta[k,,drop=FALSE]%*%t(X)+offset[k,,drop=FALSE])
- if(iter.ls==1) stepsize[k] <- 1.2*stepsize[k]
-# print(betaj[decr,])
- j[j] <- !decr
-# cat("j",j,"\n")
- }
- if(trace) {
- if(iter.ls==1 & any(j)) cat("Step halving:",sum(j),"tag(s), ")
- }
- stepsizei[j] <- stepsizei[j]/3
- }
- if(trace)
- if(iter.ls>1) cat(iter.ls-1,"iteration(s)\n")
- }
-
- R <- qr.R(qrX)
- beta <- t(solve(R,t(beta)))
- list(coefficients=beta,fitted.values=mu,fail=which(ls.fail),not.converged=which(i))
-}
diff --git a/R/mglmLevenberg.R b/R/mglmLevenberg.R
index c35d7a7..9885a72 100644
--- a/R/mglmLevenberg.R
+++ b/R/mglmLevenberg.R
@@ -1,6 +1,6 @@
-mglmLevenberg <- function(y, design, dispersion=0, offset=0, coef.start=NULL, start.method="null", tol=1e-06, maxit=200)
+mglmLevenberg <- function(y, design, dispersion=0, offset=0, weights=NULL, coef.start=NULL, start.method="null", maxit=200, tol=1e-06)
# Fit genewise negative binomial glms with log-link
-# using Levenberg damping to ensure secure convergence
+# using Levenberg damping to ensure convergence
# R version by Gordon Smyth and Yunshun Chen
# C++ version by Aaron Lun
@@ -8,33 +8,42 @@ mglmLevenberg <- function(y, design, dispersion=0, offset=0, coef.start=NULL, st
{
# Check arguments
y <- as.matrix(y)
+ if(!is.numeric(y)) stop("y is non-numeric")
if(any(y<0)) stop("y must be non-negative")
nlibs <- ncol(y)
ngenes <- nrow(y)
if(nlibs==0 || ngenes==0) stop("no data")
+
if(!( all(is.finite(y)) || all(is.finite(design)) )) stop("All values must be finite and non-missing")
design <- as.matrix(design)
if(length(dispersion)<ngenes) dispersion <- rep(dispersion,length.out=ngenes)
+
if(is.null(coef.start)) {
start.method <- match.arg(start.method, c("null","y"))
if(start.method=="null") N <- exp(offset)
} else {
coef.start <- as.matrix(coef.start)
}
+
offset <- t(expandAsMatrix(offset,dim(y)))
+# Check weights
+ if(is.null(weights)) weights <- 1
+ weights <- t(expandAsMatrix(weights,dim(y)))
+
# Initializing if desired. Note that lm.fit can fit in a vectorised manner,
# where each column of the input matrix is a separate set of observations.
if(is.null(coef.start)) {
if(start.method=="y") {
delta <- min(max(y), 1/6)
y1 <- pmax(y, delta)
- fit <- lm.fit(design, t(log(y1)) - offset)
+ fit <- lm.wfit(design, t(log(y1)) - offset, weights)
beta <- fit$coefficients
mu <- exp(beta + offset)
} else {
N <- expandAsMatrix(N,dim(y))
- beta.mean <- log(.rowMeans(y/N,ngenes,nlibs))
+ w <- t(weights) * N/(1+dispersion*N)
+ beta.mean <- log(rowSums(y*w/N)/rowSums(w))
beta <- qr.coef(qr(design), matrix(beta.mean,nrow=nlibs,ncol=ngenes,byrow=TRUE))
mu <- exp(design %*% beta + offset)
}
@@ -43,22 +52,22 @@ mglmLevenberg <- function(y, design, dispersion=0, offset=0, coef.start=NULL, st
mu <- exp(design %*% beta + offset)
}
-# Checking arguments and calling the C++ method. Warning; this WILL overwrite 'beta' and 'mu' in memory.
-# Also note that we use transposed matrices so that each row of the original can be accessed from column-major
-# storage in C++.
- if (!is.double(design)) storage.mode(design)<-"double"
- if (!is.double(y)) storage.mode(y)<-"double"
- if (!is.double(dispersion)) storage.mode(dispersion)<-"double"
- if (!is.double(offset)) storage.mode(offset)<-"double"
- if (!is.double(beta)) storage.mode(beta)<-"double"
- if (!is.double(mu)) storage.mode(mu)<-"double"
- output <- .Call("R_levenberg", nlibs, ngenes, design, t(y), dispersion, offset, beta, mu, tol, maxit, PACKAGE="edgeR")
+# Checking arguments and calling the C++ method. We use transposed matrices so that each can be accessed from column-major storage in C++.
+ if (!is.double(design)) storage.mode(design) <- "double"
+ if (!is.double(dispersion)) storage.mode(dispersion) <- "double"
+ if (!is.double(offset)) storage.mode(offset) <- "double"
+ if (!is.double(weights)) storage.mode(weights) <- "double"
+ if (!is.double(beta)) storage.mode(beta) <- "double"
+ if (!is.double(mu)) storage.mode(mu) <- "double"
+ output <- .Call("R_levenberg", nlibs, ngenes, design, t(y), dispersion, offset, weights, beta, mu, tol, maxit, PACKAGE="edgeR")
+
+# Check for error condition
if (is.character(output)) { stop(output) }
- # Naming the output and returning it.
+# Naming the output and returning it.
names(output) <- c("coefficients", "fitted.values", "deviance", "iter", "failed")
- output$coefficients<-t(output$coefficients)
- output$fitted.values<-t(output$fitted.values)
+ output$coefficients <- t(output$coefficients)
+ output$fitted.values <- t(output$fitted.values)
colnames(output$coefficients) <- colnames(design)
rownames(output$coefficients) <- rownames(y)
dimnames(output$fitted.values) <- dimnames(y)
diff --git a/R/mglmOneGroup.R b/R/mglmOneGroup.R
index fea29ee..7e18837 100644
--- a/R/mglmOneGroup.R
+++ b/R/mglmOneGroup.R
@@ -1,42 +1,52 @@
-mglmOneGroup <- function(y,dispersion=0,offset=0,maxit=50,tol=1e-10)
-# Fit null (single-group) negative-binomial glm with log-link to DGE data
-# Aaron Lun and Gordon Smyth
-# 18 Aug 2010. Last modified 19 October 2012.
+mglmOneGroup <- function(y,dispersion=0,offset=0,weights=NULL,maxit=50,tol=1e-10,verbose=FALSE)
+# Fit single-group negative-binomial glm
+# Aaron Lun and Gordon Smyth
+# 18 Aug 2010. Last modified 22 Nov 2013.
{
-# Check input values for y
+# Check y
y <- as.matrix(y)
+ if(!is.numeric(y)) stop("y is non-numeric")
if(any(y<0)) stop("y must be non-negative")
ntags <- nrow(y)
nlibs <- ncol(y)
-# Check input values for dispersion
+# Check dispersion
+ dispersion <- as.vector(dispersion)
+ if(typeof(dispersion) != "double") stop("dispersion not floating point number")
if(any(dispersion<0)) stop("dispersion must be non-negative")
+# Check offset
+ if(typeof(offset) != "double") stop("offset not floating point number")
# All-Poisson special case
- N <- exp(offset)
- if(all(dispersion==0)) {
+ if(all(dispersion==0) && is.null(weights)) {
+ N <- exp(offset)
if(is.null(dim(N)))
m <- mean(N)
else
m <- .rowMeans(N,ntags,nlibs)
- return(log(.rowMeans(y/m,ntags,nlibs)))
+ return(log(.rowMeans(y/m, ntags, nlibs)))
}
-# Expanding the offset and dispersion values.
+# Check weights
+ if(is.null(weights)) weights=1
+ if(typeof(weights) == "integer") storage.mode(weights) <- "double"
+ if(typeof(weights) != "double") stop("weights is non-numeric")
+
+# Expansions to full dimensions
dispersion <- rep(dispersion,length=ntags)
offset <- expandAsMatrix(offset,dim(y))
+ weights <- expandAsMatrix(weights,dim(y))
+
+# Fisher scoring iteration.
+# Matrices are transposed so that values for each tag are in consecutive memory locations in C
+ output <- .Call("R_one_group", ntags, nlibs, y, dispersion, offset, weights, maxit, tol, PACKAGE="edgeR")
-# Checking type for entry into C++ code.
- if (!is.double(dispersion)) storage.mode(dispersion)<-"double"
- if (!is.double(offset)) storage.mode(offset)<-"double"
- stopifnot(is.numeric(y));
+# Check error condition
+ if(is.character(output)) stop(output)
-# Fisher scoring iteration. Matrices are transposed due to column major storage - thus, each column
-# of the transposed matrix maps to a row of the original for easy access.
- output<-.Call("R_one_group", ntags, nlibs, t(y), dispersion, t(offset), maxit, tol, PACKAGE="edgeR")
- if (is.character(output) ) { stop(output) }
- if (any(!output[[2]])) warning(paste("max iteractions exceeded for", sum(!output[[2]]), "tags", sep=" "))
+# Convergence achieved for all tags?
+ if(verbose) if (any(!output[[2]])) warning(paste("max iteractions exceeded for", sum(!output[[2]]), "tags", sep=" "))
output[[1]]
}
diff --git a/R/mglmOneWay.R b/R/mglmOneWay.R
index e7ca293..c83879d 100644
--- a/R/mglmOneWay.R
+++ b/R/mglmOneWay.R
@@ -10,12 +10,12 @@ designAsFactor <- function(design)
g
}
-mglmOneWay <- function(y,design=NULL,dispersion=0,offset=0,maxit=50)
+mglmOneWay <- function(y,design=NULL,dispersion=0,offset=0,weights=NULL,maxit=50,tol=1e-10)
# Fit multiple negative binomial glms with log link
# by Fisher scoring with
# only a single explanatory factor in the model
# Gordon Smyth
-# 11 March 2011. Last modified 19 October 2012.
+# 11 March 2011. Last modified 21 Nov 2013.
{
y <- as.matrix(y)
ntags <- nrow(y)
@@ -31,11 +31,12 @@ mglmOneWay <- function(y,design=NULL,dispersion=0,offset=0,maxit=50)
stopifnot(ncol(design)==ngroups)
mu <- matrix(0,ntags,ngroups)
offset <- expandAsMatrix(offset,dim(y))
+ if(!is.null(weights)) weights <- expandAsMatrix(weights,dim(y))
firstjofgroup <- rep(0,ngroups)
for (g in 1:ngroups) {
j <- which(group==(levels(group)[g]))
firstjofgroup[g] <- j[1]
- mu[,g] <- mglmOneGroup(y[,j,drop=FALSE],dispersion=dispersion,offset=offset[,j,drop=FALSE],maxit=maxit)
+ mu[,g] <- mglmOneGroup(y[,j,drop=FALSE],dispersion=dispersion,offset=offset[,j,drop=FALSE],weights=weights[,j,drop=FALSE],maxit=maxit,tol=tol)
}
designunique <- design[firstjofgroup,,drop=FALSE]
mu1 <- pmax(mu,-1e8)
diff --git a/R/mglmSimple.R b/R/mglmSimple.R
deleted file mode 100644
index c13292b..0000000
--- a/R/mglmSimple.R
+++ /dev/null
@@ -1,83 +0,0 @@
-### NB GLM fitting genewise using glm.fit()
-
-mglmSimple <- function(y, design, dispersion=0, offset=0, weights=NULL)
-## Fit negative binomial generalized linear model for each transcript
-## to a series of digital expression libraries,
-## using genewise calls to stats:::glm.fit().
-## Requires the {MASS} package for the negative.binomial() family
-## Lower-level function. Takes a matrix of counts (y)
-
-## Davis McCarthy and Gordon Smyth
-## Created 17 August 2010. Last modified 10 Apr 2012.
-{
-# Check arguments
- require(MASS)
- y <- as.matrix(y)
- nlibs <- ncol(y)
- ngenes <- nrow(y)
- design <- as.matrix(design)
- offset <- expandAsMatrix(offset,dim(y))
- if(!is.null(weights)) {
- weights <- expandAsMatrix(weights,dim(y))
- weights[weights <= 0] <- NA
- y[!is.finite(weights)] <- NA
- } else {
- weights <- array(1,dim(y))
- }
-
-# Define objects in which to store various results from the glm fits
- coefficients <- matrix(NA,nrow=ngenes,ncol=ncol(design))
- fitted.values <- matrix(NA,nrow=ngenes,ncol=nlibs)
- colnames(coefficients) <- colnames(design)
- rownames(coefficients) <- rownames(y)
- dimnames(fitted.values) <- dimnames(y)
- df.residual <- rep(0,ngenes)
- dev <- rep(NA,ngenes)
- error <- converged <- rep(FALSE,ngenes)
-
-# If common dispersion, then set glm family here
- if(length(dispersion)>1) {
- common.family <- FALSE
- if(length(dispersion)!=ngenes) stop("length(dispersion) should agree with nrow(y)")
- } else {
- common.family <- TRUE
- if(dispersion > 1e-10)
- f <- negative.binomial(link="log",theta=1/dispersion)
- else
- f <- poisson(link="log")
- }
-
-# Fit a glm to each gene sequentially
-
- for (i in 1:ngenes) {
- if(!common.family) {
- if(dispersion[i] > 1e-10)
- f <- negative.binomial(link="log",theta=1/dispersion[i])
- else
- f <- poisson(link="log")
- f$aic <- function(y,n,mu,wt,dev) NA
- }
-
- z <- as.vector(y[i,])
- obs <- is.finite(z)
- if(sum(obs) > 0) {
- X <- design[obs,,drop=FALSE]
- z <- z[obs]
- w <- as.vector(weights[i,obs])
- out <- tryCatch(glm.fit(X,z,w,offset=offset[i,obs],family=f),error=function(e) e)
- if(class(out)[1]=="simpleError") {
- error[i] <- TRUE
- } else {
- coefficients[i,] <- out$coefficients
- fitted.values[i,] <- fitted(out)
- dev[i] <- out$deviance
- df.residual[i] <- out$df.residual
- converged[i] <- out$converged
- }
- }
- }
- list(coefficients=coefficients, df.residual=df.residual, deviance=dev, design=design,
- offset=offset, dispersion=dispersion, weights=weights, fitted.values=fitted.values,
- converged=converged, error=error)
-}
-
diff --git a/R/nbinomDeviance.R b/R/nbinomDeviance.R
new file mode 100644
index 0000000..8ea3054
--- /dev/null
+++ b/R/nbinomDeviance.R
@@ -0,0 +1,43 @@
+nbinomDeviance <- function(y,mean,dispersion=0,weights=NULL)
+# Negative binomial residual deviance
+# y is a matrix and a deviance is computed for each row
+# A vector y is taken to be a matrix with one row.
+# Original version 23 November 2010.
+# Last modified 9 Dec 2013.
+{
+# Convert vector to row matrix
+ if(!is.matrix(y)) y <- array(y, c(1L,length(y)), if(!is.null(names(y))) list(NULL,names(y)))
+
+ d <- nbinomUnitDeviance(y=y,mean=mean,dispersion=dispersion)
+ if(!is.null(weights)) d <- weights*d
+ rowSums(d)
+}
+
+
+nbinomUnitDeviance <- function(y,mean,dispersion=0)
+# Unit deviance for the nbinom distribution.
+{
+# Check y. May be matrix or vector.
+ if (!is.double(y)) storage.mode(y) <- "double"
+ ntags <- NROW(y)
+ nobs <- length(y)
+
+# Check mean
+ if (!is.double(mean)) storage.mode(mean) <- "double"
+ if(length(mean)<nobs) stop("mean should have same dimensions as y")
+
+# Check dispersion.
+# Can be tagwise (rowwise) or observation-wise.
+ if (!is.double(dispersion)) dispersion <- "double"
+ lend <- length(dispersion)
+ if(lend < ntags) dispersion <- rep_len(dispersion, length.out=ntags)
+ if(lend > ntags && lend < nobs) dispersion <- rep_len(dispersion, length.out=nobs)
+
+ out <- .Call("R_compute_nbdev", y=y, mu=mean, phi=dispersion, PACKAGE="edgeR")
+
+# Check error status
+ if (is.character(out)) stop(out)
+
+ y[] <- out
+ return(y)
+}
diff --git a/R/plotBCV.R b/R/plotBCV.R
index d64765f..7a42442 100644
--- a/R/plotBCV.R
+++ b/R/plotBCV.R
@@ -34,7 +34,7 @@ plotBCV <- function(y, xlab="Average log CPM", ylab="Biological coefficient of v
labels <- c(labels,"Trend")
cols <- c(cols,col.trend)
}
- legend("topright",legend=labels,lwd=2,col=cols)
+ legend("topright",legend=labels,lty=c(-1,1,1),pch=c(pch,-1,-1),pt.cex=cex,lwd=2,col=cols)
# Add binned dispersions if appropriate
# if(!is.null(y$trend.method)) if(y$trend.method %in% c("bin.spline","bin.loess")) if(!is.null(y$bin.dispersion)) if(!is.null(y$bin.AveLogCPM))
diff --git a/R/predFC.R b/R/predFC.R
index df06e76..4d7f2ec 100644
--- a/R/predFC.R
+++ b/R/predFC.R
@@ -1,7 +1,7 @@
-predFC <- function(y,design=NULL,prior.count=0.125,offset=NULL,dispersion=NULL)
+predFC <- function(y,design=NULL,prior.count=0.125,offset=NULL,dispersion=NULL,weights=NULL,...)
UseMethod("predFC")
-predFC.DGEList <- function(y,design=NULL,prior.count=0.125,offset=NULL,dispersion=NULL)
+predFC.DGEList <- function(y,design=NULL,prior.count=0.125,offset=NULL,dispersion=NULL,weights=NULL,...)
{
if(is.null(offset)) offset <- getOffset(y)
if(is.null(dispersion)) dispersion <- getDispersion(y)
@@ -9,10 +9,10 @@ predFC.DGEList <- function(y,design=NULL,prior.count=0.125,offset=NULL,dispersio
dispersion <- 0
message("dispersion set to zero")
}
- predFC.default(y=y$counts,design=design,prior.count=prior.count,offset=offset,dispersion=dispersion)
+ predFC.default(y=y$counts,design=design,prior.count=prior.count,offset=offset,dispersion=dispersion,weights=weights,...)
}
-predFC.default <- function(y,design=NULL,prior.count=0.125,offset=NULL,dispersion=0)
+predFC.default <- function(y,design=NULL,prior.count=0.125,offset=NULL,dispersion=0,weights=NULL,...)
# Shrink log-fold-changes towards zero by augmenting data counts
# Gordon Smyth and Belinda Phipson
# 17 Aug 2011. Last modified 4 Nov 2012.
@@ -22,6 +22,7 @@ predFC.default <- function(y,design=NULL,prior.count=0.125,offset=NULL,dispersio
ngenes <- nrow(y)
nsamples <- ncol(y)
+
# Check prior.count
if(prior.count<0) stop("prior.count should be non-negative")
@@ -31,7 +32,6 @@ predFC.default <- function(y,design=NULL,prior.count=0.125,offset=NULL,dispersio
offset <- log(lib.size)
} else
lib.size <- exp(offset)
-
# Check design
if(is.null(design)) {
warning("Behaviour of predFC with design=NULL is scheduled to be deprecated April 2014. Use cpm() instead.",call.=FALSE)
@@ -50,7 +50,7 @@ predFC.default <- function(y,design=NULL,prior.count=0.125,offset=NULL,dispersio
y <- y+prior.count
# Return matrix of coefficients on log2 scale
- g <- glmFit(y,design,offset=log(lib.size),dispersion=dispersion,prior.count=0)
- g$coefficients/log(2)
+ g <- glmFit(y,design,offset=log(lib.size),dispersion=dispersion,prior.count=0,weights=weights,...)
+ g$coefficients/log(2)
}
diff --git a/R/processHairpinReads.R b/R/processHairpinReads.R
new file mode 100644
index 0000000..aa19562
--- /dev/null
+++ b/R/processHairpinReads.R
@@ -0,0 +1,124 @@
+# Code to process hairpin reads from Illumina sequencer
+# Assume fixed structure of read:
+# Barcode + Common sequence + Hairpin sequence
+
+processHairpinReads = function(readfile, barcodefile, hairpinfile,
+ barcodeStart=1, barcodeEnd=5, hairpinStart=37, hairpinEnd=57,
+ allowShifting=FALSE, shiftingBase = 3,
+ allowMismatch=FALSE, barcodeMismatchBase = 1, hairpinMismatchBase = 2,
+ allowShiftedMismatch = FALSE,
+ verbose = FALSE) {
+
+ # Check file existence
+ if ((length(readfile) == 1) && (!file.exists(readfile)))
+ stop("Read file doesn't exist.\n")
+ if (length(readfile) > 1){
+ for(i in 1:length(readfile)){
+ if (!file.exists(readfile[i]))
+ stop(paste("Read file ", readfile[i], " doesn't exist. \n", sep=""))
+ }
+ }
+ if (!file.exists(barcodefile))
+ stop("Barcode file doesn't exist.\n")
+ if (!file.exists(hairpinfile))
+ stop("Hairpin file doesn't exist.\n")
+
+ # Validating params
+ reads <- file(readfile[1], "rt");
+ first_read <- readLines(reads, 2)
+ readlength <- nchar(first_read[2])
+
+ if (barcodeStart > barcodeEnd)
+ stop("Barcode start position is greater than barcode end position.\n")
+ if ((barcodeStart < 1) || (barcodeStart > readlength))
+ stop("Invalid barcode start position!\n")
+ if ((barcodeEnd < 1) || (barcodeEnd > readlength))
+ stop("Invalid barcode end position!\n")
+ if (barcodeEnd <= barcodeStart)
+ stop("Barcode end position should be greater than barcode start position. \n")
+ if ((hairpinStart < 1) || (hairpinStart > readlength))
+ stop("Invalid hairpin start position!")
+ if ((hairpinEnd < 1) || (hairpinEnd > readlength))
+ stop("Invalid hairpin end position!")
+ if (hairpinEnd <= hairpinStart)
+ stop("Hairpin end position should be greater than hairpin start position. \n")
+
+ # check that barcodes and hairpins provided have no duplicates, are in specified length.
+ barcodelength <- barcodeEnd - barcodeStart + 1;
+ barcodes <- read.table(barcodefile, header=TRUE, sep="\t");
+ numbc <- nrow(barcodes)
+ barcodeseqs <- as.character(barcodes$Sequences) #[,2])
+ barcodenames <- as.character(barcodes$ID) #[,1])
+ if ((min(nchar(barcodeseqs)) != barcodelength) || (max(nchar(barcodeseqs)) != barcodelength))
+ stop(paste("Barcode sequence length is set to ", barcodelength, ", there are barcode sequence not in specified length.\n", sep=""))
+ if (length(unique(barcodeseqs)) != numbc)
+ stop("There are duplicate barcode sequences.\n")
+ if (length(unique(barcodenames)) != numbc)
+ stop("There are duplicate barcode names.\n")
+ tempbarcodefile <- paste("Barcode", as.character(Sys.getpid()), "temp.txt", sep = "_")
+ # passing only barcode sequences to C function
+ write.table(barcodeseqs, file=tempbarcodefile, sep="\t", quote=FALSE, row.names=FALSE, col.names=FALSE);
+
+ hairpinlength <- hairpinEnd - hairpinStart + 1;
+ hairpins <- read.table(hairpinfile, header=TRUE, sep="\t");
+ numhp <- nrow(hairpins)
+ hairpinseqs <- as.character(hairpins$Sequences) #[,2])
+ hairpinnames <- as.character(hairpins$ID) #[,1])
+ if ((min(nchar(hairpinseqs)) != hairpinlength) || (max(nchar(hairpinseqs)) != hairpinlength))
+ stop(paste("Hairpin sequence length is set to ", hairpinlength, ", there are hairpin sequences not in specified length.\n", sep=""))
+ if (length(unique(hairpinseqs)) != numhp)
+ stop("There are duplicate hairpin sequences.\n")
+ if (length(unique(hairpinnames)) != numhp)
+ stop("There are duplicate hairpin names.\n")
+
+ # passing only hairpin sequences to C function
+ temphairpinfile <- paste("Hairpin", as.character(Sys.getpid()), "temp.txt", sep = "_")
+ write.table(hairpinseqs, file=temphairpinfile, sep="\t", quote=FALSE, row.names=FALSE, col.names=FALSE);
+
+ if (allowShifting) {
+ if ((shiftingBase <= 0) || (shiftingBase > 5))
+ stop("To allow hairpin matching at a shifted position, please input a positive shiftingBase no greater than 5. ")
+ }
+
+ if (allowMismatch) {
+ if ((barcodeMismatchBase <= 0) || (barcodeMismatchBase > 2))
+ stop("To allow mismatch in barcode sequence, please input a positive barcodeMismatchBase no greater than than 2. ")
+ if ((hairpinMismatchBase <= 0) || (hairpinMismatchBase > 4))
+ stop("To allow mismatch in hairpin sequence, please input a positive hairpinMismatchBase no greater than than 4. ")
+ }
+
+ if (allowShiftedMismatch) {
+ if ((!allowShifting) || (!allowMismatch)){
+ stop("allowShiftedMismatch option can only be turned on when allowShiting and allowMismatch are both TRUE. ")
+ }
+ }
+ tempoutfile <- paste("ReadcountSummary", as.character(Sys.getpid()), "output.txt", sep = "_")
+
+ .C("processHairpinReads", readfile, as.integer(length(readfile)), as.character(tempbarcodefile), as.character(temphairpinfile),
+ as.integer(barcodeStart), as.integer(barcodeEnd), as.integer(hairpinStart), as.integer(hairpinEnd),
+ as.integer(allowShifting), as.integer(shiftingBase),
+ as.integer(allowMismatch), as.integer(barcodeMismatchBase), as.integer(hairpinMismatchBase),
+ as.integer(allowShiftedMismatch),
+ as.character(tempoutfile), as.integer(verbose))
+
+ summary <- read.table(tempoutfile, sep="\t", header=FALSE)
+ file.remove(tempoutfile)
+ file.remove(tempbarcodefile)
+ file.remove(temphairpinfile)
+ close(reads)
+ if (nrow(summary) != length(hairpinnames))
+ stop("Number of hairpins from result count matrix doesn't agree with given hairpin list. ")
+ if (ncol(summary) != length(barcodenames))
+ stop("Number of barcodes from result count matrix doesn't agree with given barcode list. ")
+ colnames(summary) = barcodenames
+ rownames(summary) = hairpinnames
+ x <- DGEList(counts = summary, genes = hairpins)
+ if(!is.null(barcodes$group)) {
+ x$samples = cbind("ID"=barcodes$ID, "lib.size"=x$samples$lib.size,
+ "norm.factors"=x$samples$norm.factors,
+ barcodes[,-match(c("ID","Sequences"), colnames(barcodes))])
+ } else {
+ x$samples = cbind("ID"=barcodes$ID, x$samples, barcodes[,-match(c("ID","Sequences"), colnames(barcodes))])
+ }
+ x
+}
diff --git a/R/residDF.R b/R/residDF.R
new file mode 100644
index 0000000..8e7c062
--- /dev/null
+++ b/R/residDF.R
@@ -0,0 +1,29 @@
+.residDF <- function(zero, design)
+# 6 Jan 2014
+{
+ nlib <- ncol(zero)
+ ncoef <- ncol(design)
+ nzero <- rowSums(zero)
+
+# Default is no zero
+ DF <- rep(nlib-ncoef,length(nzero))
+
+# All zero case
+ DF[nzero==nlib] <- 0
+
+# Anything in between?
+ somezero <- nzero>0 & nzero<nlib
+ if(any(somezero)) {
+ zero2 <- zero[somezero,,drop=FALSE]
+ key <- rowSums( 2L^(col(zero2)-1L) * zero2 )
+ DF2 <- nlib-nzero[somezero]
+ for (u in unique(key)) {
+ i <- which(key==u)
+ zeroi <- zero2[i[1],]
+ DF2[i] <- DF2[i]-qr(design[!zeroi,,drop=FALSE])$rank
+ }
+ DF2 <- pmax(DF2,0)
+ DF[somezero] <- DF2
+ }
+ DF
+}
diff --git a/R/roast.DGEList.R b/R/roast.DGEList.R
index 3d9e08b..c7e2828 100644
--- a/R/roast.DGEList.R
+++ b/R/roast.DGEList.R
@@ -1,20 +1,20 @@
-roast.DGEList <- function(y, index=NULL, design=NULL, contrast=ncol(design), set.statistic="mean", gene.weights=NULL, array.weights=NULL, weights=NULL, block=NULL, correlation, var.prior=NULL, df.prior=NULL, trend.var=FALSE, nrot=999)
+roast.DGEList <- function(y, index=NULL, design=NULL, contrast=ncol(design), ...)
# Rotation gene set testing for RNA-Seq data
# Yunshun Chen, Gordon Smyth
-# Created 19 Dec 2012. Last revised on 4 Feb 2013
+# Created 19 Dec 2012. Last revised on 28 Feb 2014
{
-# Check design matrix
+# Check dispersion estimates in y
+ dispersion <- getDispersion(y)
+ if(is.null(dispersion)) stop("Dispersion estimate not found. Please estimate the dispersion(s) before you proceed.")
+
+# Make default design matrix from group factor
if(is.null(design)) {
- if(nlevels(y$samples$group)<2) stop("Need at least two groups, or at least two columns for design matrix")
+ if(nlevels(y$samples$group)<2) stop("design not supplied and samples all belong to the same group")
design <- model.matrix(~y$samples$group)
rownames(design) <- colnames(y)
}
nbeta <- ncol(design)
- if(nbeta < 2) stop("Need at least two columns for design")
-
-# Check dispersion estimates
- dispersion <- getDispersion(y)
- if(is.null(dispersion)) stop("Dispersion estimate not found. Please estimate the dispersion(s) before you proceed.")
+ if(nbeta < 2) stop("design matrix must have at least two columns")
# Check contrast
if(length(contrast) == 1) {
@@ -25,37 +25,38 @@ roast.DGEList <- function(y, index=NULL, design=NULL, contrast=ncol(design), set
if(length(contrast) != nbeta) stop("length of contrast must match column dimension of design")
if(all(contrast==0)) stop("contrast all zero")
-# Null design matrix
+# Construct null hypothesis design matrix
QR <- qr(contrast)
design0 <- t(qr.qty(QR, t(design))[-1, , drop=FALSE])
-# Null fit
+# Null hypothesis fit
fit.null <- glmFit(y, design0, prior.count=0)
- z <- zscoreNBinom(y$counts, mu=fit.null$fitted.values, size=1/dispersion)
+# Quantile residuals from null fit
+ y <- zscoreNBinom(y$counts, mu=fit.null$fitted.values, size=1/dispersion)
- roast(y=z, index=index, design=design, contrast=contrast, set.statistic=set.statistic, gene.weights=gene.weights, array.weights=array.weights, weights=weights, block=block, correlation=correlation, var.prior=var.prior, df.prior=df.prior, trend.var=trend.var, nrot=nrot)
+ NextMethod("roast")
}
-mroast.DGEList <- function(y, index=NULL, design=NULL, contrast=ncol(design), set.statistic="mean", gene.weights=NULL, array.weights=NULL, weights=NULL, block=NULL, correlation, var.prior=NULL, df.prior=NULL, trend.var=FALSE, nrot=999, adjust.method="BH", midp=TRUE, sort="directional")
+mroast.DGEList <- function(y, index=NULL, design=NULL, contrast=ncol(design), ...)
# Rotation gene set testing for RNA-Seq data with multiple sets
# Yunshun Chen, Gordon Smyth
-# Created 8 Jan 2013
+# Created 8 Jan 2013. Last revised 28 Feb 2014.
{
-# Check design matrix
+# Check dispersion estimates in y
+ dispersion <- getDispersion(y)
+ if(is.null(dispersion)) stop("Dispersion estimate not found. Please estimate the dispersion(s) before you proceed.")
+
+# Make default design matrix from group factor
if(is.null(design)) {
- if(nlevels(y$samples$group)<2) stop("Need at least two groups, or at least two columns for design matrix")
+ if(nlevels(y$samples$group)<2) stop("design not supplied and samples all belong to the same group")
design <- model.matrix(~y$samples$group)
rownames(design) <- colnames(y)
}
nbeta <- ncol(design)
- if(nbeta < 2) stop("Need at least two columns for design")
-
-# Check dispersion estimates
- dispersion <- getDispersion(y)
- if(is.null(dispersion)) stop("Dispersion estimate not found. Please estimate the dispersion(s) before you proceed.")
+ if(nbeta < 2) stop("design matrix must have at least two columns")
# Check contrast
if(length(contrast) == 1) {
@@ -66,15 +67,16 @@ mroast.DGEList <- function(y, index=NULL, design=NULL, contrast=ncol(design), se
if(length(contrast) != nbeta) stop("length of contrast must match column dimension of design")
if(all(contrast==0)) stop("contrast all zero")
-# Null design matrix
+# Construct null hypothesis design matrix
QR <- qr(contrast)
design0 <- t(qr.qty(QR, t(design))[-1, , drop=FALSE])
-# Null fit
+# Null hypothesis fit
fit.null <- glmFit(y, design0, prior.count=0)
- z <- zscoreNBinom(y$counts, mu=fit.null$fitted.values, size=1/dispersion)
-
- mroast(y=z, index=index, design=design, contrast=contrast, set.statistic=set.statistic, gene.weights=gene.weights, array.weights=array.weights, weights=weights, block=block, correlation=correlation, var.prior=var.prior, df.prior=df.prior, trend.var=trend.var, nrot=nrot, adjust.method=adjust.method, midp=midp, sort=sort)
+# Quantile residuals from null fit
+ y <- zscoreNBinom(y$counts, mu=fit.null$fitted.values, size=1/dispersion)
+
+ NextMethod("mroast")
}
diff --git a/R/rpkm.R b/R/rpkm.R
new file mode 100644
index 0000000..2eb3081
--- /dev/null
+++ b/R/rpkm.R
@@ -0,0 +1,45 @@
+rpkm <- function(x, ...)
+UseMethod("rpkm")
+
+rpkm.DGEList <- function(x, gene.length=NULL, normalized.lib.sizes=TRUE, log=FALSE, prior.count=0.25, ...)
+# Counts per million for a DGEList
+# Gordon Smyth.
+# Created 18 March 2013. Last modified 1 November 2012
+{
+# Try to find gene lengths
+# If column name containing gene lengths isn't specified,
+# then will try "Length" or "length" or any column name containing "length"
+ if(is.character(gene.length)) {
+ gene.length <- x$genes[[gene.length[1]]]
+ if(is.null(gene.length)) stop("gene.length column not found")
+ } else {
+ if(is.null(gene.length)) gene.length <- x$genes$Length
+ if(is.null(gene.length)) gene.length <- x$genes$length
+ if(is.null(gene.length)) {
+ j <- grep("length",tolower(names(x$genes)))
+ if(length(j)==1)
+ gene.length <- x$genes[,j]
+ else
+ stop("Gene lengths not found")
+ }
+ }
+
+ lib.size <- x$samples$lib.size
+ if(normalized.lib.sizes) lib.size <- lib.size*x$samples$norm.factors
+
+ rpkm.default(x=x$counts,gene.length=gene.length,lib.size=lib.size,log=log,prior.count=prior.count,...)
+}
+
+rpkm.default <- function(x, gene.length, lib.size=NULL, log=FALSE, prior.count=0.25, ...)
+# Reads per kilobase of gene length per million reads of sequencing
+# Gordon Smyth
+# Created 1 November 2012. Last modified 18 March 2014.
+{
+ y <- cpm.default(x=x,lib.size=lib.size,log=log,prior.count=prior.count)
+ gene.length.kb <- gene.length/1000
+ if(log)
+ y-log2(gene.length.kb)
+ else
+ y/gene.length.kb
+}
+
diff --git a/R/subsetting.R b/R/subsetting.R
index 298a020..1c7b720 100644
--- a/R/subsetting.R
+++ b/R/subsetting.R
@@ -1,95 +1,46 @@
# SUBSET DATA SETS
assign("[.DGEList",
-function(object, i, j, ...) {
+function(object, i, j, keep.lib.sizes=TRUE)
# Subsetting for DGEList objects
-# Davis McCarthy, Gordon Smyth
-# 24 September 2009. Last modified 7 May 2012.
+# 24 September 2009. Last modified 17 March 2014.
+{
+# Recognized components
+ IJ <- c("counts","pseudo.counts","offset","weights")
+ IX <- c("genes")
+ JX <- c("samples")
+ I <- c("AveLogCPM","trended.dispersion","tagwise.dispersion","prior.n","prior.df")
+# Obsolete <- c("conc","infos","all.zeros")
- if(nargs() != 3) stop("Two subscripts required",call.=FALSE)
- if(missing(i))
- if(missing(j))
- return(object)
- else {
- object$counts <- object$counts[,j,drop=FALSE]
- object$samples <- droplevels(object$samples[j,,drop=FALSE])
- object$pseudo.counts <- object$pseudo.counts[,j,drop=FALSE]
- object$offset <- object$offset[,j,drop=FALSE]
- }
- else {
- if(is.character(i)) {
- i <- match(i,rownames(object$counts))
- i <- i[!is.na(i)]
- }
- if(missing(j)) {
- object$counts <- object$counts[i,,drop=FALSE]
- object$conc$conc.common <- object$conc$conc.common[i,drop=FALSE]
- object$conc$conc.group <- object$conc$conc.group[i,,drop=FALSE]
- object$abundance <- object$abundance[i,drop=FALSE]
- object$trended.dispersion <- object$trended.dispersion[i,drop=FALSE]
- object$tagwise.dispersion <- object$tagwise.dispersion[i,drop=FALSE]
- object$infos <- object$infos[i,drop=FALSE]
- object$pseudo.counts <- object$pseudo.counts[i,,drop=FALSE]
- object$genes <- object$genes[i,,drop=FALSE]
- object$all.zeros <- object$all.zeros[i,drop=FALSE]
- object$offset <- object$offset[i,,drop=FALSE]
- !is.null(object$AveLogCPM)
- object$AveLogCPM <- object$AveLogCPM[i,drop=FALSE]
- } else {
- object$counts <- object$counts[i,j,drop=FALSE]
- object$samples <- droplevels(object$samples[j,,drop=FALSE])
- object$pseudo.counts <- object$pseudo.counts[i,j,drop=FALSE]
- object$conc$conc.common <- object$conc$conc.common[i,drop=FALSE]
- object$conc$conc.group <- object$conc$conc.group[i,,drop=FALSE]
- object$trended.dispersion <- object$trended.dispersion[i,drop=FALSE]
- object$tagwise.dispersion <- object$tagwise.dispersion[i,drop=FALSE]
- object$infos <- object$infos[i,drop=FALSE]
- object$genes <- object$genes[i,,drop=FALSE]
- object$all.zeros <- object$all.zeros[i,drop=FALSE]
- object$offset <- object$offset[i,,drop=FALSE]
- !is.null(object$AveLogCPM)
- object$AveLogCPM <- object$AveLogCPM[i,drop=FALSE]
- }
- }
- object
+ out <- subsetListOfArrays(object,i,j,IJ=IJ,IX=IX,I=I,JX=JX)
+ if(!(missing(i) || keep.lib.sizes)) out$samples$lib.size <- colSums(out$counts)
+ out
})
-
assign("[.DGEGLM",
-function(object, i, j, ...)
+function(object, i, j)
# Subsetting for DGEGLM objects
-# Davis McCarthy, Gordon Smyth
-# 11 May 2011. Last modified 8 April 2013.
+# 11 May 2011. Last modified 11 Dec 2013.
{
- if(nargs() != 3) stop("Two subscripts required",call.=FALSE)
- if(!missing(j))
- stop("Subsetting columns not allowed for DGEGLM object. Try subsetting elements of DGEGLM object instead.",call.=FALSE)
- if(!missing(i)) {
- object$coefficients <- object$coefficients[i,,drop=FALSE]
- object$df.residual <- object$df.residual[i,drop=FALSE]
- object$deviance <- object$deviance[i,drop=FALSE]
- object$offset <- object$offset[i,,drop=FALSE]
- object$genes <- object$genes[i,,drop=FALSE]
- object$trended.dispersion <- object$trended.dispersion[i,drop=FALSE]
- object$tagwise.dispersion <- object$tagwise.dispersion[i,drop=FALSE]
- if(length(object$dispersion)>1) object$dispersion <- object$dispersion[i,drop=FALSE]
- object$weights <- object$weights[i,,drop=FALSE]
- object$fitted.values <- object$fitted.values[i,,drop=FALSE]
- object$abundance <- object$abundance[i,drop=FALSE]
- }
- object
+ if(!missing(j)) stop("Subsetting columns not allowed for DGEGLM object.",call.=FALSE)
+
+# Recognized components
+ IJ <- character(0)
+ IX <- c("counts","offset","weights","genes","coefficients","fitted.values")
+ I <- c("AveLogCPM","trended.dispersion","tagwise.dispersion","prior.n","prior.df","dispersion","df.residual","deviance","iter","failed")
+ JX <- c("samples","design")
+
+ subsetListOfArrays(object,i,j,IJ=IJ,IX=IX,I=I,JX=JX)
})
assign("[.DGEExact",
-function(object, i, j, ...)
+function(object, i, j)
# Subsetting for DGEExact objects
# Davis McCarthy, Gordon Smyth
-# 6 October 2010. Last modified 8 April 2013.
+# 6 October 2010. Last modified 11 Dec 2013.
{
- if(nargs() != 3) stop("Two subscripts required",call.=FALSE)
- if(!missing(j))
- stop("Subsetting columns not allowed for DGEExact object. Try subsetting object$table instead.",call.=FALSE)
+ if(!missing(j)) stop("Subsetting columns not allowed for DGEExact objects.",call.=FALSE)
if(!missing(i)) {
object$table <- object$table[i,,drop=FALSE]
object$genes <- object$genes[i,,drop=FALSE]
@@ -99,50 +50,27 @@ function(object, i, j, ...)
assign("[.DGELRT",
-function(object, i, j, ...)
+function(object, i, j)
# Subsetting for DGELRT objects
-# Davis McCarthy, Gordon Smyth
-# 6 April 2011. Last modified 8 April 2013.
+# 6 April 2011. Last modified 11 Dec 2013.
{
- if(nargs() != 3) stop("Two subscripts required",call.=FALSE)
- if(!missing(j))
- stop("Subsetting columns not allowed for DGELRT object. Try subsetting object$table instead.",call.=FALSE)
- if(!missing(i)) {
- object$table <- object$table[i,,drop=FALSE]
- object$genes <- object$genes[i,,drop=FALSE]
- object$abundance <- object$abundance[i,drop=FALSE]
- object$trended.dispersion <- object$trended.dispersion[i,drop=FALSE]
- object$tagwise.dispersion <- object$tagwise.dispersion[i,drop=FALSE]
- object$dispersion <- object$dispersion[i,drop=FALSE]
- object$coefficients <- object$coefficients[i,,drop=FALSE]
- }
- object
+ if(!missing(j)) stop("Subsetting columns not allowed for DGELRT object.",call.=FALSE)
+
+# Recognized components
+ IJ <- character(0)
+ IX <- c("counts","offset","weights","genes","coefficients","fitted.values","table")
+ I <- c("AveLogCPM","trended.dispersion","tagwise.dispersion","prior.n","prior.df","dispersion","df.residual","deviance","iter","failed","df.test")
+ JX <- character(0)
+
+ subsetListOfArrays(object,i,j,IJ=IJ,IX=IX,I=I,JX=JX)
})
assign("[.TopTags",
-function(object, i, j, ...)
+function(object, i, j)
# Subsetting for TopTags objects
-# Gordon Smyth
-# 7 October 2009. Last modified 8 April 2013.
+# 7 October 2009. Last modified 11 Dec 2013.
{
- if(nargs() != 3) stop("Two subscripts required",call.=FALSE)
- if(missing(i))
- if(missing(j))
- return(object)
- else {
- object$table <- object$table[,j,drop=FALSE]
- }
- else {
- if(is.character(i)) {
- i <- match(i,rownames(object$counts))
- i <- i[!is.na(i)]
- }
- if(missing(j)) {
- object$table <- object$table[i,,drop=FALSE]
- } else {
- object$table <- object$table[i,j,drop=FALSE]
- }
- }
+ if(!missing(i) || !missing(j)) object$table <- object$table[i,j,drop=FALSE]
object
})
diff --git a/R/sumTechReps.R b/R/sumTechReps.R
new file mode 100644
index 0000000..1089a2f
--- /dev/null
+++ b/R/sumTechReps.R
@@ -0,0 +1,44 @@
+# sumTechReps.R
+
+sumTechReps <- function(x,ID=colnames(x),...) UseMethod("sumTechReps")
+# 24 Sept 2010
+
+sumTechReps.default <- function(x,ID=colnames(x),...)
+# Sum over replicate columns, for matrices
+# Yifang Hu and Gordon Smyth
+# Created 14 March 2014
+{
+ if(is.null(x)) return(NULL)
+ x <- as.matrix(x)
+ if(is.null(ID)) stop("No sample IDs")
+ t(rowsum(t(x),group=ID,reorder=FALSE,na.rm=FALSE))
+}
+
+sumTechReps.DGEList <- function(x,ID=colnames(x),...)
+# Sum over replicate columns, for matrices
+# Yifang Hu and Gordon Smyth
+# Created 14 March 2014. Last modified 17 March 2014.
+{
+ d <- duplicated(ID)
+ if(!any(d)) return(x)
+
+ x$common.dispersion <- x$trended.dispersion <- x$tagwise.dispersion <- NULL
+ x$weights <- NULL
+
+ y <- x[,!d]
+
+# Sum counts
+ y$counts <- sumTechReps.default(x$counts,ID=ID,...)
+
+# Sum library sizes
+ y$samples$lib.size <- drop(rowsum(x$samples$lib.size,group=ID,reorder=FALSE,na.rm=FALSE))
+
+# Average normalization factors
+ y$samples$norm.factors <- drop(rowsum(x$samples$norm.factors,group=ID,reorder=FALSE,na.rm=FALSE))
+ n <- rep(1L,nrow(x$samples))
+ n <- drop(rowsum(n,group=ID,reorder=FALSE,na.rm=FALSE))
+ y$samples$norm.factors <- y$samples$norm.factors/n
+
+ rownames(y$samples) <- colnames(y$counts)
+ y
+}
diff --git a/R/validDGEList.R b/R/validDGEList.R
new file mode 100644
index 0000000..c9d23b6
--- /dev/null
+++ b/R/validDGEList.R
@@ -0,0 +1,13 @@
+validDGEList <- function(y)
+# Check for standard components of DGEList object
+# Gordon Smyth
+# 20 Nov 2013
+{
+ if(is.null(y$counts)) stop("No count matrix")
+ y$counts <- as.matrix(y$counts)
+ nlib <- ncol(y$counts)
+ if(is.null(y$samples$group)) y$samples$group <- gl(1,nlib)
+ if(is.null(y$samples$lib.size)) y$samples$lib.size <- colSums(y$counts)
+ if(is.null(y$samples$norm.factors)) y$samples$norm.factors <- rep.int(1,nlib)
+ y
+}
diff --git a/build/vignette.rds b/build/vignette.rds
index 139a0ca..5c206c1 100644
Binary files a/build/vignette.rds and b/build/vignette.rds differ
diff --git a/inst/NEWS.Rd b/inst/NEWS.Rd
index 4228c74..d6e6913 100644
--- a/inst/NEWS.Rd
+++ b/inst/NEWS.Rd
@@ -2,52 +2,106 @@
\title{edgeR News}
\encoding{UTF-8}
-\section{Version 3.3.8}{\itemize{
+\section{Version 3.6.0}{\itemize{
+
\item
-predFC() with design=NULL now uses normalization factors correctly.
-However this use of predFC() to compute counts per million is being phased out in favour of cpm().
-}}
+Improved treatment of fractional counts.
+Previously the classic edgeR pipeline permitted fractional counts but the glm pipeline did not.
+edgeR now permits fractional counts throughout.
-\section{Version 3.3.5}{\itemize{
\item
-Refinement to cutWithMinN() to make the bin numbers more equal in the worst case.
+All glm-based functions in edgeR now accept quantitative observation-level weights.
+The glm fitting function mglmLS() and mglmSimple() are retired, and all glm fitting is now done by either mglmLevenberg() or mglmOneWay().
\item
-estimateDisp() now creates the design matrix correctly when the design matrix is not given as an argument and there is only one group. Previously this case gave an error.
+New capabilities for robust estimation allowing for observation-level outliers.
+In particular, the new function estimateGLMRobustDisp() computes a robust dispersion estimate for each gene.
+
+\item
+More careful calculation of residual df in the presence of exactly zero fitted values for glmQLFTest() and estimateDisp().
+The new code allows for deflation of residual df for more complex experimental designs.
+
+\item
+New function processHairpinReads() for analyzing data from shRNA-seq screens.
+
+\item
+New function sumTechReps() to collapse counts over technical replicate libraries.
+
+\item
+New functions nbinomDeviance() and nbinomUnitDeviance.
+Old function deviances.function() removed.
+
+\item
+New function validDGEList().
+
+\item
+rpkm() is now a generic function, and it now tries to find the gene lengths automatically if available from the annotation information in a DGEList object.
+
+\item
+Subsetting a DGEList object now has the option of resetting to the library sizes to the new column sums.
+Internally, the subsetting code for DGEList, DGEExact, DGEGLM, DGELRT and TopTags data objects has been simplified using the new utility function subsetListOfArrays in the limma package.
+
+\item
+To strengthen the interface and to strengthen the object-orientated nature of the functions, the DGEList methods for estimateDisp(), estimateGLMCommonDisp(), estimateGLMTrendedDisp() and estimateGLMTagwiseDisp no longer accept offset, weights or AveLogCPM as arguments.
+These quantities are now always taken from the DGEList object.
+
+\item
+The User's Guide has new sections on read alignment, producing a table of counts, and on how to translate scientific questions into contrasts when using a glm.
+
+\item
+camera.DGEList(), roast.DGEList() and mroast.DGEList() now include ... argument.
\item
-Minor edit to glm.h code.
+The main computation of exactTestByDeviance() now implemented in C++ code.
+
+\item
+The big.count argument has been removed from functions exactTestByDeviance() and exactTestBySmallP().
+
+\item
+New default value for offset in dispCoxReid.
+
+\item
+More tolerant error checking for dispersion value when computing aveLogCPM().
+
+\item
+aveLogCPM() now returns a value even when all the counts are zero.
+
+\item
+The functions is.fullrank and nonEstimable are now imported from limma.
}}
-\section{Version 3.3.4}{\itemize{
+
+\section{Version 3.4.0}{\itemize{
+
+\item
+estimateDisp() now creates the design matrix correctly when the design matrix is not given as an argument and there is only one group. Previously this case gave an error.
+
\item
plotMDS.DGEList now gives a friendly error message when there are fewer than 3 data columns.
-}}
-\section{Version 3.3.3}{\itemize{
\item
-DGEList() accepts NULL as a possible value again for the group, lib.size and norm.factors arguments.
-It is treated the same way as a missing argument.
-}}
+Updates to DGEList() so that arguments lib.size, group and norm.factors are now set to their defaults in the function definition rather than set to NULL.
+However NULL is still accepted as a possible value for these arguments in the function call, in which case the default value is used as if the argument was missing.
-\section{Version 3.3.2}{\itemize{
\item
-Update to cutWithMinN() so that it does not fail even when there are many repeated x values.
+Refinement to cutWithMinN() to make the bin numbers more equal in the worst case.
+Also a bug fix so that cutWithMinN() does not fail even when there are many repeated x values.
\item
Refinement to computation for nbins in dispBinTrend. Now changes more smoothly with the number of genes. trace argument is retired.
\item
+Updates to help pages for the data classes.
+
+\item
Fixes to calcNormFactors with method="TMM" so that it takes account of lib.size and refCol if these are preset.
\item
-Updates to help pages for the data classes.
-}}
+Bug fix to glmQLFTest when plot=TRUE but abundance.trend=FALSE.
-\section{Version 3.3.1}{\itemize{
\item
-Updates to DGEList() and DGEList-class documentation.
-Arguments lib.size, group and norm.factors are now set to their defaults in the function definition rather than set to NULL.
+predFC() with design=NULL now uses normalization factors correctly.
+However this use of predFC() to compute counts per million is being phased out in favour of cpm().
}}
\section{Version 3.2.0}{\itemize{
@@ -271,7 +325,6 @@ Function pooledVar() removed as no longer necessary.
\item
Minor fixes to various functions to ensure correct results in special cases.
-
}}
diff --git a/inst/doc/edgeR.Rnw b/inst/doc/edgeR.Rnw
index f3e2b75..34e205e 100755
--- a/inst/doc/edgeR.Rnw
+++ b/inst/doc/edgeR.Rnw
@@ -13,10 +13,11 @@
\begin{document}
-\title{edgeR: differential expression analysis \\ of digital gene expression data}
+\title{edgeR Package Introduction}
\author{Mark Robinson, Davis McCarthy, Yunshun Chen,\\
Aaron Lun, Gordon K.\ Smyth}
-\date{18 October 2012}
+\date{10 October 2012\\
+Revised 10 November 2013}
\maketitle
diff --git a/inst/doc/edgeR.pdf b/inst/doc/edgeR.pdf
index 93aabfb..4c96e55 100644
Binary files a/inst/doc/edgeR.pdf and b/inst/doc/edgeR.pdf differ
diff --git a/inst/doc/index.html b/inst/doc/index.html
index d3205f0..719cd3a 100644
--- a/inst/doc/index.html
+++ b/inst/doc/index.html
@@ -14,7 +14,7 @@
<ul>
<li><a href="edgeRUsersGuide.pdf">edgeR User's Guide (pdf)</a>. This is the main documentation for the package.</li>
-<li><a href="edgeR.pdf">edgeR Vignette (pdf)</a>. A brief Sweave vignette referring to the User's Guide.</li>
+<li><a href="edgeR.pdf">edgeR Introduction (pdf)</a>. One page introduction.</li>
</ul>
<hr>
diff --git a/man/DGELRT-class.Rd b/man/DGELRT-class.Rd
index 792eabe..42a6bfe 100644
--- a/man/DGELRT-class.Rd
+++ b/man/DGELRT-class.Rd
@@ -29,7 +29,7 @@ Objects of this class contain the following list components:
\code{design } \tab design matrix for the full model from the likelihood
ratio test.\cr
- \code{... } \tab if the argument \code{y} to \code{glmLRT} (which
+ \code{\dots } \tab if the argument \code{y} to \code{glmLRT} (which
produces the \code{DGELRT} object) was itself a \code{DGEList} object, then
the \code{DGELRT} will contain all of the elements of \code{y},
except for the table of counts and the table of pseudocounts.\cr
diff --git a/man/adjustedProfileLik.Rd b/man/adjustedProfileLik.Rd
index f079657..6f81865 100644
--- a/man/adjustedProfileLik.Rd
+++ b/man/adjustedProfileLik.Rd
@@ -8,7 +8,7 @@ Compute adjusted profile-likelihoods for estimating the dispersion parameters of
}
\usage{
-adjustedProfileLik(dispersion, y, design, offset, adjust=TRUE)
+adjustedProfileLik(dispersion, y, design, offset, weights=NULL, adjust=TRUE)
}
\arguments{
@@ -16,6 +16,7 @@ adjustedProfileLik(dispersion, y, design, offset, adjust=TRUE)
\item{y}{numeric matrix of counts.}
\item{design}{numeric matrix giving the design matrix.}
\item{offset}{numeric matrix of same size as \code{y} giving offsets for the log-linear models. Can be a scalor or a vector of length \code{ncol(y)}, in which case it is expanded out to a matrix.}
+\item{weights}{optional numeric matrix giving observation weights}
\item{adjust}{logical, if \code{TRUE} then Cox-Reid adjustment is made to the log-likelihood, if \code{FALSE} then the log-likelihood is returned without adjustment.}
}
diff --git a/man/asdataframe.Rd b/man/asdataframe.Rd
index 5b3bd09..19f83db 100644
--- a/man/asdataframe.Rd
+++ b/man/asdataframe.Rd
@@ -5,7 +5,7 @@
Turn a \code{TopTags} object into a \code{data.frame}.
}
\usage{
-\method{as.data.frame}{TopTags}(x, row.names = NULL, optional = FALSE, ...)
+\method{as.data.frame}{TopTags}(x, row.names = NULL, optional = FALSE, \dots)
}
\arguments{
\item{x}{an object of class \code{TopTags}}
diff --git a/man/asmatrix.Rd b/man/asmatrix.Rd
index 21b9b9f..d77baef 100644
--- a/man/asmatrix.Rd
+++ b/man/asmatrix.Rd
@@ -2,7 +2,7 @@
\alias{as.matrix.DGEList}
\title{Turn a DGEList Object into a Matrix}
\description{
-Turn a digital gene expression object into a numeric matrix by extracting the count values.
+Coerce a digital gene expression object into a numeric matrix by extracting the count values.
}
\usage{
\method{as.matrix}{DGEList}(x,\dots)
diff --git a/man/aveLogCPM.Rd b/man/aveLogCPM.Rd
index 0b07533..62371f4 100644
--- a/man/aveLogCPM.Rd
+++ b/man/aveLogCPM.Rd
@@ -11,17 +11,19 @@ Compute average log2 counts-per-million for each row of counts.
}
\usage{
-\method{aveLogCPM}{DGEList}(y, normalized.lib.sizes=TRUE, prior.count=2, dispersion=0.05, \dots)
-\method{aveLogCPM}{default}(y, lib.size=NULL, prior.count=2, dispersion=0.05, \dots)
+\method{aveLogCPM}{DGEList}(y, normalized.lib.sizes=TRUE, prior.count=2, dispersion=NULL, \dots)
+\method{aveLogCPM}{default}(y, lib.size=NULL, offset=NULL, prior.count=2, dispersion=NULL, weights=NULL, \dots)
}
\arguments{
\item{y}{numeric matrix containing counts. Rows for tags and columns for libraries.}
\item{normalized.lib.sizes}{logical, use normalized library sizes?}
-\item{lib.size}{numeric vector of library sizes. Defaults to \code{colSums(y)}.}
\item{prior.count}{average value to be added to each count, to avoid infinite values on the log-scale.}
-\item{dispersion}{numeric scalar or vector of negative-binomial dispersions.}
-\item{\dots}{other arguments are not currently used}
+\item{dispersion}{numeric scalar or vector of negative-binomial dispersions. Defaults to 0.05.}
+\item{lib.size}{numeric vector of library sizes. Defaults to \code{colSums(y)}. Ignored if \code{offset} is not \code{NULL}.}
+\item{offset}{numeric matrix of offsets for the log-linear models.}
+\item{weights}{optional numeric matrix of observation weights.}
+\item{\dots}{other arguments are not currently used.}
}
\details{
diff --git a/man/calcNormFactors.Rd b/man/calcNormFactors.Rd
index 0694cd5..b0cd255 100644
--- a/man/calcNormFactors.Rd
+++ b/man/calcNormFactors.Rd
@@ -54,9 +54,9 @@ Differential expression analysis for sequence count data
Bullard JH, Purdom E, Hansen KD, Dudoit S. (2010)
Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments.
\emph{BMC Bioinformatics} 11, 94.
-A scaling normalization method for differential expression analysis of RNA-seq data.
Robinson MD, Oshlack A (2010).
+A scaling normalization method for differential expression analysis of RNA-seq data.
\emph{Genome Biology} 11, R25.
}
diff --git a/man/camera.DGEList.Rd b/man/camera.DGEList.Rd
index 5cfc8d2..08d7d5f 100644
--- a/man/camera.DGEList.Rd
+++ b/man/camera.DGEList.Rd
@@ -5,34 +5,24 @@
Test whether a set of genes is highly ranked relative to other genes in terms of differential expression, accounting for inter-gene correlation.
}
\usage{
-\method{camera}{DGEList}(y, index, design, contrast=ncol(design), weights=NULL, use.ranks=FALSE, allow.neg.cor=TRUE, trend.var=FALSE, sort=TRUE)
+\method{camera}{DGEList}(y, index, design=NULL, contrast=ncol(design), \dots)
}
\arguments{
- \item{y}{\code{DGEList} object.}
+ \item{y}{a \code{DGEList} object containing dispersion estimates.}
\item{index}{an index vector or a list of index vectors. Can be any vector such that \code{y[indices,]} selects the rows corresponding to the test set.}
- \item{design}{design matrix.}
- \item{contrast}{contrast of the linear model coefficients for which the test is required. Can be an integer specifying a column of \code{design}, or else a numeric vector of same length as the number of columns of \code{design}.}
- \item{weights}{can be a numeric matrix of individual weights, of same size as \code{y}, or a numeric vector of array weights with length equal to \code{ncol(y)}.}
- \item{use.ranks}{do a rank-based test (\code{TRUE}) or a parametric test (\code{FALSE}?}
- \item{allow.neg.cor}{should reduced variance inflation factors be allowed for negative correlations?}
- \item{trend.var}{logical, should an empirical Bayes trend be estimated? See \code{\link{eBayes}} for details.}
- \item{sort}{logical, should the results be sorted by p-value?}
+ \item{design}{the design matrix.}
+ \item{contrast}{the contrast of the linear model coefficients for which the test is required. Can be an integer specifying a column of \code{design}, or else a numeric vector of same length as the number of columns of \code{design}.}
+ \item{\dots}{other arguments are passed to \code{\link{camera.default}}.}
}
\details{
-This function implements a method proposed by Wu and Smyth (2012) for the digital gene expression data, eg. RNA-Seq data.
-\code{camera} performs a \emph{competitive} test in the sense defined by Goeman and Buhlmann (2007).
-It tests whether the genes in the set are highly ranked in terms of differential expression relative to genes not in the set.
-It has similar aims to \code{geneSetTest} but accounts for inter-gene correlation.
-See \code{\link{roast.DGEList}} for an analogous \emph{self-contained} gene set test.
-
-The function can be used for any sequencing experiment which can be represented by a Negative Binomial generalized linear model.
-The design matrix for the experiment is specified as for the \code{\link{glmFit}} function, and the contrast of interest is specified as for the \code{\link{glmLRT}} function.
-This allows users to focus on differential expression for any coefficient or contrast in a model by giving the vector of test statistic values.
-
-\code{camera} estimates p-values after adjusting the variance of test statistics by an estimated variance inflation factor.
-The inflation factor depends on estimated genewise correlation and the number of genes in the gene set.
+The camera gene set test was proposed by Wu and Smyth (2012) for microarray data.
+This function makes the camera test available for digital gene expression data.
+The negative binomial count data is converted to approximate normal deviates by computing mid-p quantile residuals (Dunn and Smyth, 1996; Routledge, 1994) under the null hypothesis that the contrast is zero.
+See \code{\link{camera}} for more description of the test and for a complete list of possible arguments.
+
+The design matrix defaults to the \code{model.matrix(~y$samples$group)}.
}
\value{
@@ -42,13 +32,18 @@ A data.frame. See \code{\link{camera}} for details.
\author{Yunshun Chen, Gordon Smyth}
\references{
+Dunn, K. P., and Smyth, G. K. (1996).
+Randomized quantile residuals.
+\emph{J. Comput. Graph. Statist.}, 5, 236-244.
+\url{http://www.statsci.org/smyth/pubs/residual.html}
+
+Routledge, RD (1994).
+Practicing safe statistics with the mid-p.
+\emph{Canadian Journal of Statistics} 22, 103-110.
+
Wu, D, and Smyth, GK (2012). Camera: a competitive gene set test accounting for inter-gene correlation.
\emph{Nucleic Acids Research} 40, e133.
\url{http://nar.oxfordjournals.org/content/40/17/e133}
-
-Goeman, JJ, and Buhlmann, P (2007).
-Analyzing gene expression data in terms of gene sets: methodological issues.
-\emph{Bioinformatics} 23, 980-987.
}
\seealso{
@@ -80,5 +75,3 @@ camera(y, iset2, design)
camera(y, list(set1=iset1,set2=iset2), design)
}
-\keyword{htest}
-
diff --git a/man/condLogLikDerSize.Rd b/man/condLogLikDerSize.Rd
index ff462c7..0fe8a9f 100755
--- a/man/condLogLikDerSize.Rd
+++ b/man/condLogLikDerSize.Rd
@@ -18,10 +18,10 @@ condLogLikDerDelta(y, delta, der=1L)
\item{der}{integer specifying derivative required, either 0 (the function), 1 (first derivative) or 2 (second derivative)}
}
-\value{vector of function/derivative evaluations, one for each transcript,with respect to }
+\value{vector of row-wise derivatives with respect to \code{r} or \code{delta}}
\details{The library sizes must be equalized before running this function. This function carries out the actual mathematical computations for the conditional log-likelihood and its derivatives, calculating the conditional log-likelihood for each tag/transcript.
-Derivatives are with respect to either the size or the delta parametrization of the dispersion.
+Derivatives are with respect to either the size (\code{r}) or the delta parametrization (\code{delta}) of the dispersion.
}
diff --git a/man/cpm.Rd b/man/cpm.Rd
index 62c312c..80f606d 100644
--- a/man/cpm.Rd
+++ b/man/cpm.Rd
@@ -3,15 +3,18 @@
\alias{cpm.DGEList}
\alias{cpm.default}
\alias{rpkm}
+\alias{rpkm.DGEList}
+\alias{rpkm.default}
\title{Counts per Million or Reads per Kilobase per Million}
\description{Computes counts per million (CPM) or reads per kilobase per million (RPKM) values.}
\usage{
-\method{cpm}{DGEList}(x, normalized.lib.sizes=TRUE, log=FALSE, prior.count=0.25, ...)
-\method{cpm}{default}(x, lib.size=NULL, log=FALSE, prior.count=0.25, ...)
-rpkm(x, gene.length, normalized.lib.sizes=TRUE, log=FALSE, prior.count=0.25)
+\method{cpm}{DGEList}(x, normalized.lib.sizes=TRUE, log=FALSE, prior.count=0.25, \dots)
+\method{cpm}{default}(x, lib.size=NULL, log=FALSE, prior.count=0.25, \dots)
+\method{rpkm}{DGEList}(x, gene.length=NULL, normalized.lib.sizes=TRUE, log=FALSE, prior.count=0.25, \dots)
+\method{rpkm}{default}(x, gene.length, lib.size=NULL, log=FALSE, prior.count=0.25, \dots)
}
\arguments{
\item{x}{matrix of counts or a \code{DGEList} object}
@@ -19,7 +22,7 @@ rpkm(x, gene.length, normalized.lib.sizes=TRUE, log=FALSE, prior.count=0.25)
\item{lib.size}{library size, defaults to \code{colSums(x)}.}
\item{log}{logical, if \code{TRUE} then \code{log2} values are returned.}
\item{prior.count}{average count to be added to each observation to avoid taking log of zero. Used only if \code{log=TRUE}.}
-\item{gene.length}{vector of length \code{nrow(x)} giving gene length in bases.}
+\item{gene.length}{vector of length \code{nrow(x)} giving gene length in bases, or the name of the column \code{x$genes} containing the gene lengths.}
\item{\ldots}{other arguments are not currently used}
}
@@ -30,6 +33,9 @@ CPM or RPKM values are useful descriptive measures for the expression level of a
By default, the normalized library sizes are used in the computation for \code{DGEList} objects but simple column sums for matrices.
If log-values are computed, then a small count, given by \code{prior.count} but scaled to be proportional to the library size, is added to \code{x} to avoid taking the log of zero.
+
+The \code{rpkm} method for \code{DGEList} objects will try to find the gene lengths in a column of \code{x$genes} called \code{Length} or \code{length}.
+Failing that, it will look for any column name containing \code{"length"} in any capitalization.
}
\note{
@@ -41,9 +47,13 @@ If log-values are computed, then a small count, given by \code{prior.count} but
\examples{
y <- matrix(rnbinom(20,size=1,mu=10),5,4)
cpm(y)
+
d <- DGEList(counts=y, lib.size=1001:1004)
cpm(d)
cpm(d,log=TRUE)
+
+d$genes$Length <- c(1000,2000,500,1500,3000)
+rpkm(d)
}
\seealso{
diff --git a/man/dglmStdResid.Rd b/man/dglmStdResid.Rd
index b0d0523..c520f8e 100644
--- a/man/dglmStdResid.Rd
+++ b/man/dglmStdResid.Rd
@@ -9,7 +9,7 @@
\usage{
dglmStdResid(y, design, dispersion=0, offset=0, nbins=100, make.plot=TRUE,
- xlab="Mean", ylab="Ave. binned standardized residual", ...)
+ xlab="Mean", ylab="Ave. binned standardized residual", \dots)
getDispersions(binned.object)
}
diff --git a/man/dispBinTrend.Rd b/man/dispBinTrend.Rd
index 499554f..83fb990 100644
--- a/man/dispBinTrend.Rd
+++ b/man/dispBinTrend.Rd
@@ -9,7 +9,7 @@ Estimate the abundance-dispersion trend by computing the common dispersion for b
\usage{
dispBinTrend(y, design=NULL, offset=NULL, df = 5, span=0.3, min.n=400, method.bin="CoxReid",
- method.trend="spline", AveLogCPM=NULL, \dots)
+ method.trend="spline", AveLogCPM=NULL, weights=NULL, \dots)
}
\arguments{
@@ -22,6 +22,7 @@ dispBinTrend(y, design=NULL, offset=NULL, df = 5, span=0.3, min.n=400, method.bi
\item{method.bin}{method used to estimate the dispersion in each bin. Possible values are \code{"CoxReid"}, \code{"Pearson"} or \code{"deviance"}.}
\item{method.trend}{type of curve to smooth the bins. Possible values are \code{"spline"} for a natural cubic regression spline or \code{"loess"} for a linear lowess curve.}
\item{AveLogCPM}{numeric vector giving average log2 counts per million for each gene}
+\item{weights}{optional numeric matrix giving observation weights}
\item{\dots}{other arguments are passed to \code{estimateGLMCommonDisp}}
}
diff --git a/man/dispCoxReid.Rd b/man/dispCoxReid.Rd
index 374e869..289ead3 100644
--- a/man/dispCoxReid.Rd
+++ b/man/dispCoxReid.Rd
@@ -10,8 +10,8 @@ Estimate a common dispersion parameter across multiple negative binomial general
}
\usage{
-dispCoxReid(y, design=NULL, offset=NULL, interval=c(0,4), tol=1e-5, min.row.sum=5,
- subset=10000, AveLogCPM=NULL)
+dispCoxReid(y, design=NULL, offset=NULL, weights=NULL, AveLogCPM=NULL, interval=c(0,4),
+ tol=1e-5, min.row.sum=5, subset=10000)
dispDeviance(y, design=NULL, offset=NULL, interval=c(0,4), tol=1e-5, min.row.sum=5,
subset=10000, AveLogCPM=NULL, robust=FALSE, trace=FALSE)
dispPearson(y, design=NULL, offset=NULL, min.row.sum=5, subset=10000,
@@ -23,9 +23,13 @@ dispPearson(y, design=NULL, offset=NULL, min.row.sum=5, subset=10000,
\item{design}{numeric design matrix, as for \code{\link{glmFit}}.}
-\item{offset}{numeric vector or matrix of offsets for the log-linear models, as for \code{\link{glmFit}}.}
+\item{offset}{numeric vector or matrix of offsets for the log-linear models, as for \code{\link{glmFit}}. Defaults to \code{log(colSums(y))}.}
-\item{interval}{numeric vector of length 2 giving allowable values for the dispersion, passed to \code{optimize}.}
+\item{weights}{optional numeric matrix giving observation weights}
+
+\item{AveLogCPM}{numeric vector giving average log2 counts per million.}
+
+\item{interval}{numeric vector of length 2 giving minimum and maximum allowable values for the dispersion, passed to \code{optimize}.}
\item{tol}{the desired accuracy, see \code{optimize} or \code{uniroot}.}
@@ -33,8 +37,6 @@ dispPearson(y, design=NULL, offset=NULL, min.row.sum=5, subset=10000,
\item{subset}{integer, number of rows to use in the calculation. Rows used are chosen evenly spaced by AveLogCPM.}
-\item{AveLogCPM}{numeric vector giving average log2 counts per million.}
-
\item{trace}{logical, should iteration information be output?}
\item{robust}{logical, should a robust estimator be used?}
diff --git a/man/dispCoxReidInterpolateTagwise.Rd b/man/dispCoxReidInterpolateTagwise.Rd
index 8d2b435..70aa4f7 100644
--- a/man/dispCoxReidInterpolateTagwise.Rd
+++ b/man/dispCoxReidInterpolateTagwise.Rd
@@ -9,7 +9,7 @@ Estimate tagwise dispersion parameters across multiple negative binomial general
\usage{
dispCoxReidInterpolateTagwise(y, design, offset=NULL, dispersion, trend=TRUE, AveLogCPM=NULL,
- min.row.sum=5, prior.df=10, span=0.3, grid.npts=11, grid.range=c(-6,6))
+ min.row.sum=5, prior.df=10, span=0.3, grid.npts=11, grid.range=c(-6,6), weights=NULL)
}
\arguments{
@@ -34,6 +34,7 @@ dispCoxReidInterpolateTagwise(y, design, offset=NULL, dispersion, trend=TRUE, Av
\item{grid.npts}{numeric scalar, the number of points at which to place knots for the spline-based estimation of the tagwise dispersion estimates.}
\item{grid.range}{numeric vector of length 2, giving relative range, in terms of \code{log2(dispersion)}, on either side of trendline for each tag for spline grid points.}
+\item{weights}{optional numeric matrix giving observation weights}
}
\value{\code{dispCoxReidInterpolateTagwise} produces a vector of tagwise dispersions having the same length as the number of genes in the count data.
diff --git a/man/estimateDisp.Rd b/man/estimateDisp.Rd
index 8996e41..0fd4613 100644
--- a/man/estimateDisp.Rd
+++ b/man/estimateDisp.Rd
@@ -8,7 +8,7 @@ Maximizes the negative binomial likelihood to give the estimate of the common, t
}
\usage{
-estimateDisp(y, design=NULL, offset=NULL, prior.df=NULL, trend.method="locfit", span=NULL, grid.length=21, grid.range=c(-10,10), robust=FALSE, winsor.tail.p=c(0.05,0.1), tol=1e-06)
+estimateDisp(y, design=NULL, prior.df=NULL, trend.method="locfit", span=NULL, grid.length=21, grid.range=c(-10,10), robust=FALSE, winsor.tail.p=c(0.05,0.1), tol=1e-06)
}
\arguments{
@@ -16,8 +16,6 @@ estimateDisp(y, design=NULL, offset=NULL, prior.df=NULL, trend.method="locfit",
\item{design}{numeric design matrix}
-\item{offset}{numeric vector or matrix of offsets for the log-linear models}
-
\item{prior.df}{prior degrees of freedom. It is used in calculating \code{prior.n}.}
\item{trend.method}{method for estimating dispersion trend. Possible values are \code{"none"}, \code{"movingave"}, \code{"loess"} and \code{"locfit"}.}
diff --git a/man/estimateGLMCommonDisp.Rd b/man/estimateGLMCommonDisp.Rd
index 0c11916..ad186d2 100644
--- a/man/estimateGLMCommonDisp.Rd
+++ b/man/estimateGLMCommonDisp.Rd
@@ -10,8 +10,8 @@ Estimates a common negative binomial dispersion parameter for a DGE dataset with
}
\usage{
-\S3method{estimateGLMCommonDisp}{DGEList}(y, design=NULL, offset=NULL, method="CoxReid", subset=10000, AveLogCPM=NULL, verbose=FALSE, ...)
-\S3method{estimateGLMCommonDisp}{default}(y, design=NULL, offset=NULL, method="CoxReid", subset=10000, AveLogCPM=NULL, verbose=FALSE, ...)
+\S3method{estimateGLMCommonDisp}{DGEList}(y, design=NULL, method="CoxReid", subset=10000, verbose=FALSE, \dots)
+\S3method{estimateGLMCommonDisp}{default}(y, design=NULL, offset=NULL, method="CoxReid", subset=10000, AveLogCPM=NULL, verbose=FALSE, weights=NULL,\dots)
}
\arguments{
@@ -29,15 +29,17 @@ Possible values are \code{"CoxReid"}, \code{"Pearson"} or \code{"deviance"}.}
\item{AveLogCPM}{numeric vector giving average log2 counts per million for each gene}
\item{verbose}{logical, if \code{TRUE} estimated dispersion and BCV will be printed to standard output.}
+\item{weights}{optional numeric matrix giving observation weights}
\item{\ldots}{other arguments are passed to lower-level functions.
See \code{\link{dispCoxReid}}, \code{\link{dispPearson}} and \code{\link{dispDeviance}} for details.}
}
\value{
-The default method returns a numeric vector of length 1 containing the estimated dispersion.
+The default method returns a numeric vector of length 1 containing the estimated common dispersion.
The \code{DGEList} method returns the same \code{DGEList} \code{y} as input but with \code{common.dispersion} as an added component.
+The output object will also contain a component \code{AveLogCPM} if it was not already present in \code{y}.
}
\details{
diff --git a/man/estimateGLMRobustDisp.Rd b/man/estimateGLMRobustDisp.Rd
new file mode 100644
index 0000000..5e46d4b
--- /dev/null
+++ b/man/estimateGLMRobustDisp.Rd
@@ -0,0 +1,57 @@
+\name{estimateGLMRobustDisp}
+\alias{estimateGLMRobustDisp}
+
+\title{Empirical Robust Bayes Tagwise Dispersions for Negative Binomial GLMs using Observation Weights}
+
+\description{
+Compute a robust estimate of the negative binomial dispersion parameter for each tag or transcript, with expression levels specified by a log-linear model, using observation weights. These observation weights will be stored and used later for estimating regression parameters.
+}
+
+\usage{
+estimateGLMRobustDisp(y, design = NULL, prior.df = 10, update.trend = TRUE, trend.method = "bin.loess", maxit = 6, k = 1.345, residual.type = "pearson", verbose = FALSE, record = FALSE)
+}
+
+\arguments{
+\item{y}{a \code{DGEList} object.}
+\item{design}{numeric design matrix, as for \code{\link{glmFit}}.}
+\item{prior.df}{prior degrees of freedom.}
+\item{update.trend}{logical. Should the trended dispersion be re-estimated at each iteration?}
+\item{trend.method}{method (low-level function) used to estimated the trended dispersions. \code{\link{estimateGLMTrendedDisp}}}
+\item{maxit}{maximum number of iterations for weighted \code{\link{estimateGLMTagwiseDisp}}.}
+\item{k}{the tuning constant for Huber estimator. If the absolute value of residual (r) is less than k, its observation weight is 1, otherwise \code{k/abs(r)}.}
+\item{residual.type}{type of residual (r) used for estimation observation weight}
+\item{verbose}{logical. Should verbose comments be printed?}
+\item{record}{logical. Should information for each iteration be recorded (and returned as a list)?}
+}
+
+\value{
+\code{estimateGLMRobustDisp} produces a \code{DGEList} object, which contains the (robust) tagwise dispersion parameter estimate for each tag for the negative binomial model that maximizes the weighted Cox-Reid adjusted profile likelihood, as well as the observation weights. The observation weights are calculated using residuals and the Huber function.
+
+Note that when \code{record=TRUE}, a simple list of \code{DGEList} objects is returned, one for each iteration (this is for debugging or tracking purposes).
+}
+
+\details{
+At times, because of the moderation of dispersion estimates towards a trended values, features (typically, genes) can be sensitive to outliers and causing false positives. That is, since the dispersion estimates are moderated downwards toward the trend and because the regression parameter estimates may be affected by the outliers, genes are deemed significantly differential expressed. The function uses an iterative procedure where weights are calculated from residuals and estimates are [...]
+
+Note: it is not necessary to first calculate the common, trended and tagwise dispersion estimates. If these are not available, the function will first calculate this (in an unweighted) fashion.
+}
+
+\references{
+Zhou X, Lindsay H and Robinson MD (2013) Robustly detecting differential expression in RNA sequencing data using observation weights \emph{in preparation}.
+}
+
+\author{Xiaobei Zhou, Mark D. Robinson}
+\examples{
+y <- matrix(rnbinom(100*6,mu=10,size=1/0.1),ncol=6)
+d <- DGEList(counts=y,group=c(1,1,1,2,2,2),lib.size=c(1000:1005))
+design <- model.matrix(~group, data=d$samples) # Define the design matrix for the full model
+d <- estimateGLMRobustDisp(d, design)
+summary(d$tagwise.dispersion)
+}
+
+\seealso{
+This function calls
+\code{\link{estimateGLMTrendedDisp}}
+and
+\code{\link{estimateGLMTagwiseDisp}}.
+}
diff --git a/man/estimateGLMTagwiseDisp.Rd b/man/estimateGLMTagwiseDisp.Rd
index 241c204..5918245 100644
--- a/man/estimateGLMTagwiseDisp.Rd
+++ b/man/estimateGLMTagwiseDisp.Rd
@@ -3,6 +3,7 @@
\alias{estimateGLMTagwiseDisp.DGEList}
\alias{estimateGLMTagwiseDisp.default}
+
\title{Empirical Bayes Tagwise Dispersions for Negative Binomial GLMs}
\description{
@@ -10,10 +11,10 @@ Compute an empirical Bayes estimate of the negative binomial dispersion paramete
}
\usage{
-\S3method{estimateGLMTagwiseDisp}{DGEList}(y, design=NULL, offset=NULL, dispersion=NULL, prior.df=10,
- trend=!is.null(y$trended.dispersion), span=NULL, AveLogCPM=NULL, ...)
+\S3method{estimateGLMTagwiseDisp}{DGEList}(y, design=NULL, prior.df=10,
+ trend=!is.null(y$trended.dispersion), span=NULL, \dots)
\S3method{estimateGLMTagwiseDisp}{default}(y, design=NULL, offset=NULL, dispersion, prior.df=10,
- trend=TRUE, span=NULL, AveLogCPM=NULL, ...)
+ trend=TRUE, span=NULL, AveLogCPM=NULL, weights=NULL, \dots)
}
\arguments{
@@ -25,6 +26,7 @@ Compute an empirical Bayes estimate of the negative binomial dispersion paramete
\item{prior.df}{prior degrees of freedom.}
\item{span}{width of the smoothing window, in terms of proportion of the data set. Default value decreases with the number of tags.}
\item{AveLogCPM}{numeric vector giving average log2 counts per million for each gene}
+\item{weights}{optional numeric matrix giving observation weights}
\item{\ldots}{other arguments are passed to \code{\link{dispCoxReidInterpolateTagwise}}.}
}
@@ -71,5 +73,3 @@ summary(d$tagwise.dispersion)
\code{\link{estimateCommonDisp}} for common dispersion or \code{\link{estimateTagwiseDisp}} for tagwise dispersion in the context of a multiple group experiment (one-way layout).
}
-
-\keyword{algebra}
diff --git a/man/estimateGLMTrendedDisp.Rd b/man/estimateGLMTrendedDisp.Rd
index 7da6ab4..8c951a2 100644
--- a/man/estimateGLMTrendedDisp.Rd
+++ b/man/estimateGLMTrendedDisp.Rd
@@ -10,8 +10,8 @@ Estimates the abundance-dispersion trend by Cox-Reid approximate profile likelih
}
\usage{
-\S3method{estimateGLMTrendedDisp}{DGEList}(y, design=NULL, offset=NULL, AveLogCPM=NULL, method="auto", ...)
-\S3method{estimateGLMTrendedDisp}{default}(y, design=NULL, offset=NULL, AveLogCPM=NULL, method="auto", ...)
+\S3method{estimateGLMTrendedDisp}{DGEList}(y, design=NULL, method="auto", \dots)
+\S3method{estimateGLMTrendedDisp}{default}(y, design=NULL, offset=NULL, AveLogCPM=NULL, method="auto", weights=NULL, \dots)
}
\arguments{
@@ -21,6 +21,7 @@ Estimates the abundance-dispersion trend by Cox-Reid approximate profile likelih
Possible values are \code{"auto"} (default, switch to \code{"bin.spline"} method if the number of tags is great than 200 and \code{"power"} method otherwise),\code{"bin.spline"}, \code{"bin.loess"} (which both result in a call to \code{dispBinTrend}), \code{"power"} (call to \code{dispCoxReidPowerTrend}), or \code{"spline"} (call to \code{dispCoxReidSplineTrend}).}
\item{offset}{numeric scalar, vector or matrix giving the linear model offsets, as for \code{\link{glmFit}}.}
\item{AveLogCPM}{numeric vector giving average log2 counts per million for each gene.}
+\item{weights}{optional numeric matrix giving observation weights}
\item{\ldots}{other arguments are passed to lower-level functions \code{\link{dispBinTrend}}, \code{\link{dispCoxReidPowerTrend}} or \code{\link{dispCoxReidSplineTrend}}.}
}
diff --git a/man/estimateTagwiseDisp.Rd b/man/estimateTagwiseDisp.Rd
index a5ff94b..c6b7957 100644
--- a/man/estimateTagwiseDisp.Rd
+++ b/man/estimateTagwiseDisp.Rd
@@ -68,7 +68,7 @@ assessing differences in tag abundance. \emph{Bioinformatics} 23, 2881-2887.
\author{Mark Robinson, Davis McCarthy, Yunshun Chen and Gordon Smyth}
\examples{
-# See exactTest
+# See ?exactTest or ?estimateTrendedDisp for examples
}
\seealso{
diff --git a/man/estimateTrendedDisp.Rd b/man/estimateTrendedDisp.Rd
index 38889b6..d6e88ed 100644
--- a/man/estimateTrendedDisp.Rd
+++ b/man/estimateTrendedDisp.Rd
@@ -32,11 +32,19 @@ An object of class \code{DGEList} with the same components as for \code{\link{es
\author{Yunshun Chen and Gordon Smyth}
\examples{
-y <- matrix(rnbinom(6000, mu=100, size=10), 1000, 6)
-group <- c(0,0,0,1,1,1)
-d <- DGEList(y, group=group)
-d <- estimateCommonDisp(d)
-d <- estimateTrendedDisp(d)
+ngenes <- 1000
+nlib <- 4
+log2cpm <- seq(from=0,to=16,length=ngenes)
+lib.size <- 1e7
+mu <- 2^log2cpm * lib.size * 1e-6
+dispersion <- 1/sqrt(mu) + 0.1
+counts <- rnbinom(ngenes*nlib, mu=mu, size=1/dispersion)
+counts <- matrix(counts,ngenes,nlib)
+y <- DGEList(counts,lib.size=rep(lib.size,nlib))
+y <- estimateCommonDisp(y)
+y <- estimateTrendedDisp(y)
+y <- estimateTagwiseDisp(y)
+plotBCV(y)
}
\seealso{
diff --git a/man/exactTest.Rd b/man/exactTest.Rd
index 124e625..f579394 100644
--- a/man/exactTest.Rd
+++ b/man/exactTest.Rd
@@ -13,8 +13,8 @@
exactTest(object, pair=1:2, dispersion="auto", rejection.region="doubletail",
big.count=900, prior.count=0.125)
exactTestDoubleTail(y1, y2, dispersion=0, big.count=900)
-exactTestBySmallP(y1, y2, dispersion=0, big.count=900)
-exactTestByDeviance(y1, y2, dispersion=0, big.count=900)
+exactTestBySmallP(y1, y2, dispersion=0)
+exactTestByDeviance(y1, y2, dispersion=0)
exactTestBetaApprox(y1, y2, dispersion=0)
}
diff --git a/man/expandAsMatrix.Rd b/man/expandAsMatrix.Rd
index 7a0bf45..6137f82 100644
--- a/man/expandAsMatrix.Rd
+++ b/man/expandAsMatrix.Rd
@@ -25,7 +25,4 @@ Numeric matrix of dimension \code{dim}.
expandAsMatrix(1:3,c(4,3))
expandAsMatrix(1:4,c(4,3))
}
-\seealso{
-\code{\link{mglmLS}}.
-}
\keyword{hplot}
diff --git a/man/glmfit.Rd b/man/glmfit.Rd
index f283623..4483ad3 100644
--- a/man/glmfit.Rd
+++ b/man/glmfit.Rd
@@ -11,10 +11,12 @@
Conduct genewise statistical tests for a given coefficient or coefficient contrast.}
\usage{
-\method{glmFit}{DGEList}(y, design=NULL, dispersion=NULL, offset=NULL, weights=NULL, lib.size=NULL,
- prior.count=0.125, start=NULL, method="auto", ...)
+\method{glmFit}{DGEList}(y, design=NULL, dispersion=NULL, prior.count=0.125, start=NULL, \dots)
+\method{glmFit}{default}(y, design=NULL, dispersion=NULL, offset=NULL, lib.size=NULL, weights=NULL,
+ prior.count=0.125, start=NULL, \dots)
glmLRT(glmfit, coef=ncol(glmfit$design), contrast=NULL, test="chisq")
-glmQLFTest(y, design=NULL, dispersion=NULL, coef=ncol(glmfit$design), contrast=NULL, abundance.trend=TRUE, robust=FALSE, winsor.tail.p=c(0.05,0.1), plot=FALSE)
+glmQLFTest(y, design=NULL, dispersion=NULL, coef=ncol(glmfit$design), contrast=NULL,
+ abundance.trend=TRUE, robust=FALSE, winsor.tail.p=c(0.05,0.1), plot=FALSE)
}
\arguments{
@@ -36,9 +38,7 @@ Defaults to a single column of ones, equivalent to treating the columns as repli
\item{start}{optional numeric matrix of initial estimates for the linear model coefficients.}
-\item{method}{which fitting algorithm to use. Possible values are \code{"auto"}, \code{"linesearch"}, \code{"levenberg"} or \code{"simple"}.}
-
-\item{...}{other arguments are passed to lower-level functions, for example to \code{mglmLS}.}
+\item{\dots}{other arguments are passed to lower level fitting functions.}
\item{glmfit}{a \code{DGEGLM} object, usually output from \code{glmFit}.}
@@ -85,8 +85,7 @@ The data frame \code{table} contains the following columns:
\code{glmFit} fits genewise negative binomial glms, all with the same design matrix but possibly different dispersions, offsets and weights.
When the design matrix defines a one-way layout, or can be re-parametrized to a one-way layout, the glms are fitting very quickly using \code{\link{mglmOneGroup}}.
-Otherwise the default fitting method, implemented in \code{\link{mglmLS}}, is a parallelized line search algorithm described by McCarthy et al (2012).
-Other possible fitting methods are \code{\link{mglmLevenberg}} and \code{\link{mglmSimple}}.
+Otherwise the default fitting method, implemented in \code{\link{mglmLevenberg}} a Fisher scoring algorithm with Levenberg-style damping.
Positive \code{prior.count} cause the returned coefficients to be shrunk in such a way that fold-changes between the treatment conditions are decreased.
In particular, infinite fold-changes are avoided.
@@ -95,7 +94,7 @@ The returned coefficients are affected but not the likelihood ratio tests or p-v
\code{glmLRT} conducts likelihood ratio tests for one or more coefficients in the linear model.
If \code{coef} is used, the null hypothesis is that all the coefficients indicated by \code{coef} are equal to zero.
-If \code{contrast} is non-null, then the null hypothesis is that the specified contrast of the coefficients is equal to zero.
+If \code{contrast} is non-null, then the null hypothesis is that the specified contrasts of the coefficients are equal to zero.
For example, a contrast of \code{c(0,1,-1)}, assuming there are three coefficients, would test the hypothesis that the second and third coefficients are equal.
\code{glmQLFTest} implements the quasi-likelihood method of Lund et al (2012).
@@ -152,7 +151,7 @@ d <- estimateGLMCommonDisp(d, design, verbose=TRUE)
}
\seealso{
-Low-level computations are done by \code{\link{mglmOneGroup}}, \code{\link{mglmLS}}, \code{\link{mglmLevenberg}} or \code{\link{mglmSimple}}.
+Low-level computations are done by \code{\link{mglmOneGroup}} or \code{\link{mglmLevenberg}}.
\code{\link{topTags}} displays results from \code{glmLRT} or \code{glmQLFTest}.
diff --git a/man/gof.Rd b/man/gof.Rd
index 4d97278..bcd92f1 100644
--- a/man/gof.Rd
+++ b/man/gof.Rd
@@ -6,7 +6,7 @@
\description{Conducts deviance goodness of fit tests for each fit in a \code{DGEGLM} object}
\usage{
-gof(glmfit, pcutoff=0.1, adjust="holm", plot=FALSE, main="qq-plot of genewise goodness of fit", ...)
+gof(glmfit, pcutoff=0.1, adjust="holm", plot=FALSE, main="qq-plot of genewise goodness of fit", \dots)
}
\arguments{
diff --git a/man/maPlot.Rd b/man/maPlot.Rd
index 52777e1..8576edc 100644
--- a/man/maPlot.Rd
+++ b/man/maPlot.Rd
@@ -11,7 +11,7 @@ To represent counts that were low (e.g. zero in 1 library and non-zero in the ot
\usage{
maPlot(x, y, logAbundance=NULL, logFC=NULL, normalize=FALSE, plot.it=TRUE,
smearWidth=1, col=NULL, allCol="black", lowCol="orange", deCol="red",
- de.tags=NULL, smooth.scatter=FALSE, lowess=FALSE, ...)
+ de.tags=NULL, smooth.scatter=FALSE, lowess=FALSE, \dots)
}
%- maybe also 'usage' for other objects documented here.
\arguments{
diff --git a/man/meanvar.Rd b/man/meanvar.Rd
index 9b92e92..83f8459 100644
--- a/man/meanvar.Rd
+++ b/man/meanvar.Rd
@@ -10,7 +10,7 @@
\usage{
plotMeanVar(object, meanvar=NULL, show.raw.vars=FALSE, show.tagwise.vars=FALSE,
show.binned.common.disp.vars=FALSE, show.ave.raw.vars=TRUE, scalar=NULL,
- NBline=FALSE, nbins=100, log.axes="xy", xlab=NULL, ylab=NULL, ...)
+ NBline=FALSE, nbins=100, log.axes="xy", xlab=NULL, ylab=NULL, \dots)
binMeanVar(x, group, nbins=100, common.dispersion=FALSE, object=NULL)
}
diff --git a/man/mglm.Rd b/man/mglm.Rd
index 446fe26..512058e 100644
--- a/man/mglm.Rd
+++ b/man/mglm.Rd
@@ -1,27 +1,21 @@
\name{mglm}
\alias{mglm}
-\alias{mglmSimple}
-\alias{mglmLS}
\alias{mglmOneGroup}
\alias{mglmOneWay}
\alias{mglmLevenberg}
-\alias{deviances.function}
\alias{designAsFactor}
-\title{Fit Negative Binomial Generalized Linear Model to Multiple Response Vectors}
+\title{Fit Negative Binomial Generalized Linear Model to Multiple Response Vectors: Low Level Functions}
\description{
Fit the same log-link negative binomial or Poisson generalized linear model (GLM) to each row of a matrix of counts.
}
\usage{
-mglmLS(y, design, dispersion=0, offset=0, coef.start=NULL, tol=1e-5, maxit=50, trace=FALSE)
-mglmOneGroup(y, dispersion=0, offset=0, maxit=50, tol=1e-10)
-mglmOneWay(y, design=NULL, dispersion=0, offset=0, maxit=50)
-mglmSimple(y, design, dispersion=0, offset=0, weights=NULL)
-mglmLevenberg(y, design, dispersion=0, offset=0, coef.start=NULL, start.method="null",
- tol=1e-06, maxit=200)
-deviances.function(dispersion)
+mglmOneGroup(y, dispersion=0, offset=0, weights=NULL, maxit=50, tol=1e-10, verbose=FALSE)
+mglmOneWay(y, design=NULL, dispersion=0, offset=0, weights=NULL, maxit=50, tol=1e-10)
+mglmLevenberg(y, design, dispersion=0, offset=0, weights=NULL,
+ coef.start=NULL, start.method="null", maxit=200, tol=1e-06)
designAsFactor(design)
}
@@ -38,7 +32,6 @@ Can be a scalar giving one value for all tags, or a vector of length equal to th
\item{weights}{numeric vector or matrix of non-negative quantitative weights.
Can be a vector of length equal to the number of libraries, or a matrix of the same size as \code{y}.}
-
\item{coef.start}{numeric matrix of starting values for the linear model coefficients.
Number of rows should agree with \code{y} and number of columns should agree with \code{design}.}
@@ -48,12 +41,11 @@ Number of rows should agree with \code{y} and number of columns should agree wit
\item{maxit}{scalar giving the maximum number of iterations for the Fisher scoring algorithm.}
-\item{trace}{logical, whether or not to information should be output at each iteration.}
+\item{verbose}{logical. If \code{TRUE}, warnings will be issued when \code{maxit} iterations are exceeded before convergence is achieved.}
}
\details{
-The functions \code{mglmLS}, \code{mglmOneGroup} and \code{mglmSimple} all fit negative binomial generalized linear models, with the same design matrix but possibly different dispersions, offsets and weights, to a series of response vectors.
-\code{mglmLS} and \code{mglmOneGroup} are vectorized in R for fast execution, while \code{mglmSimple} simply makes tagwise calls to \code{glm.fit} in the stats package.
+The functions \code{mglmOneGroup}, \code{mglmOneWay} and \code{mglmLevenberg} all fit negative binomial generalized linear models, with the same design matrix but possibly different dispersions, offsets and weights, to a series of response vectors.
The functions are all low-level functions in that they operate on atomic objects such as matrices.
They are used as work-horses by higher-level functions in the edgeR package, especially by \code{glmFit}.
@@ -63,43 +55,25 @@ It implements Fisher scoring with a score-statistic stopping criterion for each
Excellent starting values are available for the null model, so this function seldom has any problems with convergence.
It is used by other edgeR functions to compute the overall abundance for each tag.
-\code{mglmLS} fits an arbitrary log-linear model to each response vector.
-It implements a vectorized approximate scoring algorithm with a likelihood derivative stopping criterion for each tag.
-A simple line search strategy is used to ensure that the residual deviance is reduced at each iteration.
-This function is the work-horse of other edgeR functions such as \code{glmFit} and \code{glmLRT}.
-
-\code{mglmSimple} is not vectorized, and simply makes tag-wise calls to \code{glm.fit}.
-This has the advantage that it accesses all the usual information generated by \code{glm.fit}.
-Unfortunately, \code{glm.fit} does not always converge, and the tag-wise fitting is relatively slow.
-
-\code{mglmLevenberg} implements a Levenberg-Marquardt modification of the glm scoring algorithm to prevent divergence,
-and is implemented in C++.
+\code{mglmLevenberg} fits an arbitrary log-linear model to each response vector.
+It implements a Levenberg-Marquardt modification of the glm scoring algorithm to prevent divergence.
+The main computation is implemented in C++.
All these functions treat the dispersion parameter of the negative binomial distribution as a known input.
-\code{deviances.function} simply chooses the appropriate deviance function to use given a scalar or vector of dispersion parameters. If the dispersion values are zero, then the Poisson deviance function is returned; if the dispersion values are positive, then the negative binomial deviance function is returned.
+\code{deviances.function} chooses the appropriate deviance function to use given a scalar or vector of dispersion parameters.
+If the dispersion values are zero, then the Poisson deviance function is returned; if the dispersion values are positive, then the negative binomial deviance function is returned.
}
\value{
\code{mglmOneGroup} produces a vector of length equal to the number of tags/genes (number of rows of \code{y}) providing the single coefficent from the GLM fit for each tag/gene. This can be interpreted as a measure of the 'average expression' level of the tag/gene.
-\code{mglmLS} produces a list with the following components:
- \item{coefficients}{matrix of estimated coefficients for the linear models}
- \item{fitted.values}{matrix of fitted values}
- \item{fail}{vector of indices of tags that fail the line search, in that the maximum number of step-halvings in exceeded}
- \item{not.converged}{vector of indices of tags that exceed the iteration limit before satisying the convergence criterion}
-
-\code{mglmSimple} produces a list with the following components:
+\code{mglmLevenberg} produces a list with the following components:
\item{coefficients}{matrix of estimated coefficients for the linear models}
- \item{df.residual}{vector of residual degrees of freedom for the linear models}
- \item{deviance}{vector of deviances for the linear models}
- \item{design}{matrix giving the experimental design that was used for each of the linear models}
- \item{offset}{scalar, vector or matrix of offset values used for the linear models}
- \item{dispersion}{scalar or vector of the dispersion values used for the linear model fits}
- \item{weights}{matrix of final weights for the observations from the linear model fits}
\item{fitted.values}{matrix of fitted values}
- \item{error}{logical vector, did the fit fail?}
- \item{converged}{local vector, did the fit converge?}
+ \item{deviance}{residual deviances}
+ \item{iter}{number of iterations used}
+ \item{fail}{logical vector indicating tags for which the maximum damping was exceeded before convergence was achieved}
\code{deviances.function} returns a function to calculate the deviance as appropriate for the given values of the dispersion.
@@ -112,7 +86,7 @@ McCarthy, DJ, Chen, Y, Smyth, GK (2012). Differential expression analysis of mul
\url{http://nar.oxfordjournals.org/content/40/10/4288}
}
-\author{Yunshun Chen, Davis McCarthy, Aaron Lun, Gordon Smyth. C++ code by Aaron Lun.}
+\author{Gordon Smyth, Yunshun Chen, Davis McCarthy, Aaron Lun. C++ code by Aaron Lun.}
\examples{
y <- matrix(rnbinom(1000,mu=10,size=2),ncol=4)
@@ -130,7 +104,7 @@ AveLogCPM <- aveLogCPM(y, dispersion, offset=log(lib.size))
f1 <- factor(c(1,1,2,2))
f2 <- factor(c(1,2,1,2))
x <- model.matrix(~f1+f2)
-fit <- mglmLS(y, x, dispersion=dispersion, offset=log(lib.size))
+fit <- mglmLevenberg(y, x, dispersion=dispersion, offset=log(lib.size))
head(fit$coefficients)
}
diff --git a/man/nbinomDeviance.Rd b/man/nbinomDeviance.Rd
new file mode 100644
index 0000000..58b29c0
--- /dev/null
+++ b/man/nbinomDeviance.Rd
@@ -0,0 +1,58 @@
+\name{nbinomDeviance}
+\alias{nbinomDeviance}
+\alias{nbinomUnitDeviance}
+
+\title{Negative Binomial Deviance}
+
+\description{
+Fit the same log-link negative binomial or Poisson generalized linear model (GLM) to each row of a matrix of counts.
+}
+
+\usage{
+nbinomUnitDeviance(y, mean, dispersion=0)
+nbinomDeviance(y, mean, dispersion=0, weights=NULL)
+}
+
+\arguments{
+\item{y}{numeric vector or matrix containing the negative binomial counts. If a matrix, then rows for tags and columns for libraries. \code{nbinomDeviance} treats a vector as a matrix with one row.}
+
+\item{mean}{numeric vector matrix of expected values, of same dimension as \code{y}.}
+
+\item{dispersion}{numeric vector or matrix of negative binomial dispersions.
+Can be a scalar, or a vector of length equal to the number of tags, or a matrix of same dimensions as \code{y}.}
+
+\item{weights}{numeric vector or matrix of non-negative weights, as for \code{glmFit}.}
+}
+
+\details{
+\code{nbinomUnitDeviance} computes the unit deviance for each \code{y} observation.
+\code{nbinomDeviance} computes the total residual deviance for each row of \code{y} observation, i.e., weighted row sums of the unit deviances.
+
+Care is taken to ensure accurate computation for small dispersion values.
+}
+
+\value{
+\code{nbinomUnitDeviance} returns a numeric vector or matrix of the same size as \code{y}.
+
+\code{nbinomDeviance} returns a numeric vector of length equal to the number of rows of \code{y}.
+}
+
+\references{
+Jorgensen, B. (2006).
+Generalized linear models. Encyclopedia of Environmetrics, Wiley.
+\url{http://onlinelibrary.wiley.com/doi/10.1002/9780470057339.vag010/full}.
+
+McCarthy, DJ, Chen, Y, Smyth, GK (2012).
+Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation.
+\emph{Nucleic Acids Research} 40, 4288-4297.
+\url{http://nar.oxfordjournals.org/content/40/10/4288}
+}
+
+\author{Gordon Smyth, Yunshun Chen, Aaron Lun. C++ code by Aaron Lun.}
+
+\examples{
+y <- matrix(1:6,3,2)
+mu <- matrix(3,3,2)
+nbinomUnitDeviance(y,mu,dispersion=0.2)
+nbinomDeviance(y,mu,dispersion=0.2)
+}
diff --git a/man/normalizeChIPtoInput.Rd b/man/normalizeChIPtoInput.Rd
index a0b6d7a..453e93e 100644
--- a/man/normalizeChIPtoInput.Rd
+++ b/man/normalizeChIPtoInput.Rd
@@ -9,8 +9,8 @@ Normalize ChIP-Seq read counts to input control values, then test for significan
}
\usage{
-normalizeChIPtoInput(input, response, dispersion=0.01, niter=6, loss="p", plot=FALSE, verbose=FALSE, ...)
-calcNormOffsetsforChIP(input, response, dispersion=0.01, niter=6, loss="p", plot=FALSE, verbose=FALSE, ...)
+normalizeChIPtoInput(input, response, dispersion=0.01, niter=6, loss="p", plot=FALSE, verbose=FALSE, \dots)
+calcNormOffsetsforChIP(input, response, dispersion=0.01, niter=6, loss="p", plot=FALSE, verbose=FALSE, \dots)
}
\arguments{
diff --git a/man/plotBCV.Rd b/man/plotBCV.Rd
index 56fb9a7..e4baf3f 100644
--- a/man/plotBCV.Rd
+++ b/man/plotBCV.Rd
@@ -6,7 +6,7 @@ Plot genewise biological coefficient of variation (BCV) against gene abundance (
}
\usage{
plotBCV(y, xlab="Average log CPM", ylab="Biological coefficient of variation",
- pch=16, cex=0.2, col.common="red", col.trend="blue", col.tagwise="black", ...)
+ pch=16, cex=0.2, col.common="red", col.trend="blue", col.tagwise="black", \dots)
}
\arguments{
\item{y}{a \code{DGEList} object.}
@@ -17,7 +17,7 @@ plotBCV(y, xlab="Average log CPM", ylab="Biological coefficient of variation",
\item{col.common}{color of line showing common dispersion}
\item{col.trend}{color of line showing dispersion trend}
\item{col.tagwise}{color of points showing tagwise dispersions}
- \item{...}{any other arguments are passed to \code{plot}.}
+ \item{\dots}{any other arguments are passed to \code{plot}.}
}
\details{
diff --git a/man/plotExonUsage.Rd b/man/plotExonUsage.Rd
index 0a9dfc4..944a4f5 100644
--- a/man/plotExonUsage.Rd
+++ b/man/plotExonUsage.Rd
@@ -6,7 +6,7 @@
\description{Create a plot of exon usage for a given gene by plotting the (un)transformed counts for each exon, coloured by experimental group.}
\usage{
-plotExonUsage(y, geneID, group=NULL, transform="none", counts.per.million=TRUE, legend.coords=NULL, ...)
+plotExonUsage(y, geneID, group=NULL, transform="none", counts.per.million=TRUE, legend.coords=NULL, \dots)
}
\arguments{
@@ -22,7 +22,7 @@ plotExonUsage(y, geneID, group=NULL, transform="none", counts.per.million=TRUE,
\item{legend.coords}{optional vector of length 2 giving the x- and y-coordinates of the legend on the plot. If \code{NULL} (default), the legend will be automatically placed near the top right corner of the plot.}
-\item{...}{optional further arguments to be passed on to \code{plot}.}
+\item{\dots}{optional further arguments to be passed on to \code{plot}.}
}
\value{\code{plotExonUsage} (invisibly) returns the transformed matrix of counts for the gene being plotted and produces a plot to the current device.}
diff --git a/man/plotMDS.DGEList.Rd b/man/plotMDS.DGEList.Rd
index 2604e8a..d604788 100644
--- a/man/plotMDS.DGEList.Rd
+++ b/man/plotMDS.DGEList.Rd
@@ -8,7 +8,7 @@ for the top genes that best distinguish the samples.
}
\usage{
\method{plotMDS}{DGEList}(x, top=500, labels=colnames(x), col=NULL, cex=1, dim.plot=c(1,2),
- ndim=max(dim.plot), xlab=NULL, ylab=NULL, method="logFC", prior.count=2, gene.selection="pairwise", ...)
+ ndim=max(dim.plot), xlab=NULL, ylab=NULL, method="logFC", prior.count=2, gene.selection="pairwise", \dots)
}
\arguments{
\item{x}{an \code{DGEList} object.}
@@ -23,7 +23,7 @@ for the top genes that best distinguish the samples.
\item{method}{how to compute distances. Possible values are "logFC" or \code{"bcv"}.}
\item{prior.count}{average prior count to be added to observation to shrink the estimated log-fold-changes towards zero. Only used when \code{method="logFC"}.}
\item{gene.selection}{character, \code{"pairwise"} to choose the top genes separately for each pairwise comparison between the samples or \code{"common"} to select the same genes for all comparisons. Only used when \code{method="logFC"}.}
- \item{...}{any other arguments are passed to \code{plot}.}
+ \item{\dots}{any other arguments are passed to \code{plot}.}
}
\details{
diff --git a/man/plotSmear.Rd b/man/plotSmear.Rd
index 19bb385..e806c38 100644
--- a/man/plotSmear.Rd
+++ b/man/plotSmear.Rd
@@ -8,7 +8,7 @@ Both of these functions plot the log-fold change (i.e. the log of the ratio of e
}
\usage{
plotSmear(object, pair=NULL, de.tags=NULL, xlab="Average logCPM", ylab="logFC", pch=19,
- cex=0.2, smearWidth=0.5, panel.first=grid(), smooth.scatter=FALSE, lowess=FALSE, ...)
+ cex=0.2, smearWidth=0.5, panel.first=grid(), smooth.scatter=FALSE, lowess=FALSE, \dots)
}
%- maybe also 'usage' for other objects documented here.
\arguments{
diff --git a/man/predFC.Rd b/man/predFC.Rd
index 7aed193..482ee59 100644
--- a/man/predFC.Rd
+++ b/man/predFC.Rd
@@ -8,8 +8,8 @@
\description{Computes estimated coefficients for a NB glm in such a way that the log-fold-changes are shrunk towards zero.}
\usage{
-\S3method{predFC}{DGEList}(y, design=NULL, prior.count=0.125, offset=NULL, dispersion=NULL)
-\S3method{predFC}{default}(y, design=NULL, prior.count=0.125, offset=NULL, dispersion=0)
+\S3method{predFC}{DGEList}(y, design=NULL, prior.count=0.125, offset=NULL, dispersion=NULL, weights=NULL, \dots)
+\S3method{predFC}{default}(y, design=NULL, prior.count=0.125, offset=NULL, dispersion=0, weights=NULL, \dots)
}
\arguments{
@@ -21,7 +21,11 @@
\item{offset}{numeric vector or matrix giving the offset in the log-linear model predictor, as for \code{\link{glmFit}}. Usually equal to log library sizes.}
-\item{dispersion}{the negative binomial dispersion}
+\item{dispersion}{numeric vector of negative binomial dispersions.}
+
+\item{weights}{optional numeric matrix giving observation weights}
+
+\item{\ldots}{other arguments are passed to \code{glmFit}.}
}
\details{
@@ -50,6 +54,12 @@ Numeric matrix of linear model coefficients (if \code{design} is given) or logCP
\author{Belinda Phipson and Gordon Smyth}
+\references{
+Phipson, B. (2013).
+\emph{Empirical Bayes modelling of expression profiles and their associations}.
+PhD Thesis. University of Melbourne, Australia.
+\url{http://repository.unimelb.edu.au/10187/17614}
+}
\examples{
# generate counts for a two group experiment with n=2 in each group and 100 genes
dispersion <- 0.1
diff --git a/man/processHairpinReads.Rd b/man/processHairpinReads.Rd
new file mode 100644
index 0000000..c41c27c
--- /dev/null
+++ b/man/processHairpinReads.Rd
@@ -0,0 +1,55 @@
+\name{processHairpinReads}
+\alias{processHairpinReads}
+
+\title{Process raw data from shRNA-seq screens}
+
+\description{
+Given a list of barcode sequences and hairpin sequences from a shRNA-seq screen, generate a DGEList of counts from the raw fastq file/(s) containing the sequence reads.
+}
+
+\usage{
+processHairpinReads(readfile, barcodefile, hairpinfile,
+ barcodeStart=1, barcodeEnd=5, hairpinStart=37, hairpinEnd=57,
+ allowShifting=FALSE, shiftingBase = 3,
+ allowMismatch=FALSE, barcodeMismatchBase = 1, hairpinMismatchBase = 2,
+ allowShiftedMismatch=FALSE, verbose = FALSE)
+}
+
+\arguments{
+\item{readfile}{character vector giving one or more fastq filenames}
+\item{barcodefile}{filename containing barcode ids and sequences}
+\item{hairpinfile}{filename containing hairpin ids and sequences}
+\item{barcodeStart}{numeric value, starting position (inclusive) of barcode sequence in reads}
+\item{barcodeEnd}{numeric value, ending position (inclusive) of barcode sequence in reads}
+\item{hairpinStart}{numeric value, starting position (inclusive) of hairpin sequence in reads}
+\item{hairpinEnd}{numeric value, ending position (inclusive) of hairpin sequence in reads}
+\item{allowShifting}{logical, indicates whether a given hairpin can be matched to a neighbouring position}
+\item{shiftingBase}{numeric value of maximum number of shifted bases from input \code{hairpinStart} and \code{hairpinEnd} should the program check for a hairpin match when \code{allowShifting} is \code{TRUE}}
+\item{allowMismatch}{logical, indicates whether sequence mismatch is allowed}
+\item{barcodeMismatchBase}{numeric value of maximum number of base sequence mismatch allowed in barcode when \code{allowShifting} is \code{TRUE}}
+\item{hairpinMismatchBase}{numeric value of maximum number of base sequence mismatch allowed in hairpin when \code{allowShifting} is \code{TRUE}}
+\item{allowShiftedMismatch}{logical, effective when \code{allowShifting} and \code{allowMismatch} are both \code{TRUE}. It indicates whether we check for sequence mismatch at a shifted position.}
+\item{verbose}{if \code{TRUE}, output program progess}
+}
+
+\value{Returns a \code{\link[edgeR:DGEList-class]{DGEList}} object with following components:
+ \item{counts}{read count matrix tallying up the number of reads with particular barcode and hairpin matches. Each row is a hairpin and each column is a sample}
+ \item{genes}{In this case, hairpin information (ID, sequences, corresponding target gene) may be recorded in this data.frame}
+ \item{lib.size}{auto-calculated column sum of the counts matrix}
+}
+
+\details{
+The input barcode file and hairpin files are tab-separated text files with at least two columns (named 'ID' and 'Sequences') containing the sample or hairpin ids and a second column indicating the sample index or hairpin sequences to be matched. The barcode file may also contain a 'group' column that indicates which experimental group a sample belongs to. Additional columns in each file will be included in the respective \code{$samples} or \code{$genes} data.frames of the final code{\l [...]
+
+To compute the count matrix, the matching to given barcodes and hairpins is conducted in two rounds. The first round looks for an exact sequence match. The program checks for a match from given barcode sequences and hairpin sequences at specified location. If \code{allowShifting} is set to \code{TRUE}, the program also checks if a given hairpin sequence can be found at a neighbouring position in read. For hairpins without a match, the program performs a second round of mapping which allo [...]
+
+The program outputs a \code{\link[edgeR:DGEList-class]{DGEList}} object, with a count matrix indicating the number of times each barcode and hairpin combination could be matched in reads from input fastq file/(s).
+
+For further examples and data, refer to the Case studies available from http://bioinf.wehi.edu.au/shRNAseq/.
+}
+
+\author{Zhiyin Dai, Matthew Ritchie}
+
+\references{
+Dai Z, Sheridan JM, et al. (submitted, 2014). shRNA-seq data analysis with edgeR. \emph{submitted}.
+}
diff --git a/man/readDGE.Rd b/man/readDGE.Rd
index 1fe02cb..74ce789 100755
--- a/man/readDGE.Rd
+++ b/man/readDGE.Rd
@@ -5,7 +5,7 @@
\description{Reads and merges a set of text files containing digital gene expression data.}
-\usage{readDGE(files, path=NULL, columns=c(1,2), group=NULL, labels=NULL, ...)}
+\usage{readDGE(files, path=NULL, columns=c(1,2), group=NULL, labels=NULL, \dots)}
\arguments{
\item{files}{character vector of filenames, or alternatively a data.frame with a column containing the file names of the files containing the libraries of counts and, optionally, columns containing the \code{group} to which each library belongs, descriptions of the other samples and other information.}
@@ -14,7 +14,7 @@ The default is the current working directory.}
\item{columns}{numeric vector stating which two columns contain the tag names and counts, respectively}
\item{group}{vector, or preferably a factor, indicating the experimental group to which each library belongs. If \code{group} is not \code{NULL}, then this argument overrides any group information included in the \code{files} argument.}
\item{labels}{character vector giving short names to associate with the libraries. Defaults to the file names.}
-\item{...}{other are passed to \code{read.delim}}
+\item{\dots}{other are passed to \code{read.delim}}
}
\details{
diff --git a/man/roast.DGEList.Rd b/man/roast.DGEList.Rd
index 42d7c1e..ed9e026 100644
--- a/man/roast.DGEList.Rd
+++ b/man/roast.DGEList.Rd
@@ -7,32 +7,16 @@ Rotation gene set testing for Negative Binomial generalized linear models.
}
\usage{
-\method{roast}{DGEList}(y, index=NULL, design=NULL, contrast=ncol(design), set.statistic="mean",
- gene.weights=NULL, array.weights=NULL, weights=NULL, block=NULL, correlation,
- var.prior=NULL, df.prior=NULL, trend.var=FALSE, nrot=999)
-\method{mroast}{DGEList}(y, index=NULL, design=NULL, contrast=ncol(design), set.statistic="mean",
- gene.weights=NULL, array.weights=NULL, weights=NULL, block=NULL, correlation,
- var.prior=NULL, df.prior=NULL, trend.var=FALSE, nrot=999, adjust.method="BH", midp=TRUE, sort="directional")
+\method{roast}{DGEList}(y, index=NULL, design=NULL, contrast=ncol(design), \dots)
+\method{mroast}{DGEList}(y, index=NULL, design=NULL, contrast=ncol(design), \dots)
}
\arguments{
\item{y}{\code{DGEList} object.}
- \item{index}{index vector specifying which rows (genes) of \code{y} are in the test set. This can be a vector of indices, or a logical vector of the same length as \code{statistics}, or any vector such as \code{y[iset,]} contains the values for the gene set to be tested.}
+ \item{index}{index vector specifying which rows (genes) of \code{y} are in the test set. This can be a vector of indices, or a logical vector of the same length as \code{statistics}, or any vector such as \code{y[iset,]} contains the values for the gene set to be tested. Defaults to all genes. For \code{mroast} a list of index vectors.}
\item{design}{design matrix}
\item{contrast}{contrast for which the test is required. Can be an integer specifying a column of \code{design}, or else a contrast vector of length equal to the number of columns of \code{design}.}
- \item{set.statistic}{summary set statistic. Possibilities are \code{"mean"},\code{"floormean"},\code{"mean50"} or \code{"msq"}.}
- \item{gene.weights}{optional numeric vector of weights for genes in the set. Can be positive or negative. For \code{mroast.DGEList} this vector must have length equal to \code{nrow(y)}. For \code{roast.DGEList}, can be of length \code{nrow(y)} or of length equal to the number of genes in the test set.}
- \item{array.weights}{optional numeric vector of array weights.}
- \item{weights}{optional matrix of observation weights. If supplied, should be of same dimensions as \code{y} and all values should be positive.}
- \item{block}{optional vector of blocks.}
- \item{correlation}{correlation between blocks.}
- \item{var.prior}{prior value for residual variances. If not provided, this is estimated from all the data using \code{squeezeVar}.}
- \item{df.prior}{prior degrees of freedom for residual variances. If not provided, this is estimated using \code{squeezeVar}.}
- \item{trend.var}{logical, should a trend be estimated for \code{var.prior}? See \code{eBayes} for details. Only used if \code{var.prior} or \code{df.prior} are \code{NULL}.}
- \item{nrot}{number of rotations used to estimate the p-values.}
- \item{adjust.method}{method used to adjust the p-values for multiple testing. See \code{\link{p.adjust}} for possible values.}
- \item{midp}{logical, should mid-p-values be used in instead of ordinary p-values when adjusting for multiple testing?}
- \item{sort}{character, whether to sort output table by directional p-values (\code{"directional"}), non-directional p-value (\code{"mixed"}), or not at all (\code{"none"}).}
+ \item{\dots}{other arguments are passed to \code{link{roast.default}} or \code{link{mroast.default}}.}
}
\value{
@@ -42,40 +26,14 @@ Rotation gene set testing for Negative Binomial generalized linear models.
}
\details{
-This function implements a method for the ROAST gene set test from Wu et al (2010) for the digital gene expression data, eg. RNA-Seq data.
-Basically, the Negative Binomial generalized linear models are fitted for count data. The fitted values are converted into z-scores, and then it calls the \code{roast} function in \code{limma} package to conduct the gene set test.
-It tests whether any of the genes in the set are differentially expressed.
-This allows users to focus on differential expression for any coefficient or contrast in a generalized linear model.
-If \code{contrast} is not specified, the last coefficient in the model will be tested.
-The arguments \code{array.weights}, \code{block} and \code{correlation} have the same meaning as they for for the \code{\link{lmFit}} function.
-
-The arguments \code{df.prior} and \code{var.prior} have the same meaning as in the output of the \code{\link{eBayes}} function.
-If these arguments are not supplied, they are estimated exactly as is done by \code{eBayes}.
-
-The argument \code{gene.weights} allows directions or weights to be set for individual genes in the set.
-
-The gene set statistics \code{"mean"}, \code{"floormean"}, \code{"mean50"} and \code{msq} are defined by Wu et al (2010).
-The different gene set statistics have different sensitivities to small number of genes.
-If \code{set.statistic="mean"} then the set will be statistically significantly only when the majority of the genes are differentially expressed.
-\code{"floormean"} and \code{"mean50"} will detect as few as 25\% differentially expressed.
-\code{"msq"} is sensitive to even smaller proportions of differentially expressed genes, if the effects are reasonably large.
-
-The output gives p-values three possible alternative hypotheses,
-\code{"Up"} to test whether the genes in the set tend to be up-regulated, with positive t-statistics,
-\code{"Down"} to test whether the genes in the set tend to be down-regulated, with negative t-statistics,
-and \code{"Mixed"} to test whether the genes in the set tend to be differentially expressed, without regard for direction.
-
-\code{roast} estimates p-values by simulation, specifically by random rotations of the orthogonalized residuals (Langsrud, 2005), so p-values will vary slightly from run to run.
-To get more precise p-values, increase the number of rotations \code{nrot}.
-The p-value is computed as \code{(b+1)/(nrot+1)} where \code{b} is the number of rotations giving a more extreme statistic than that observed (Phipson and Smyth, 2010).
-This means that the smallest possible p-value is \code{1/(nrot+1)}.
-
-\code{mroast} does roast tests for multiple sets, including adjustment for multiple testing.
-By default, \code{mroast} reports ordinary p-values but uses mid-p-values (Routledge, 1994) at the multiple testing stage.
-Mid-p-values are probably a good choice when using false discovery rates (\code{adjust.method="BH"}) but not when controlling the family-wise type I error rate (\code{adjust.method="holm"}).
-
-\code{roast} performs a \emph{self-contained} test in the sense defined by Goeman and Buhlmann (2007).
-For a \emph{competitive} gene set test, see \code{\link{camera.DGEList}}.
+The roast gene set test was proposed by Wu et al (2010) for microarray data.
+This function makes the roast test available for digital gene expression data.
+The negative binomial count data is converted to approximate normal deviates by computing mid-p quantile residuals (Dunn and Smyth, 1996; Routledge, 1994) under the null hypothesis that the contrast is zero.
+See \code{\link{roast}} for more description of the test and for a complete list of possible arguments.
+
+The design matrix defaults to the \code{model.matrix(~y$samples$group)}.
+
+\code{mroast} performs \code{roast} tests for a multiple of gene sets.
}
\seealso{
@@ -85,17 +43,10 @@ For a \emph{competitive} gene set test, see \code{\link{camera.DGEList}}.
\author{Yunshun Chen and Gordon Smyth}
\references{
-Goeman, JJ, and Buhlmann, P (2007).
-Analyzing gene expression data in terms of gene sets: methodological issues.
-\emph{Bioinformatics} 23, 980-987.
-
-Langsrud, O (2005).
-Rotation tests.
-\emph{Statistics and Computing} 15, 53-60.
-
-Phipson B, and Smyth GK (2010).
-Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn.
-\emph{Statistical Applications in Genetics and Molecular Biology}, Volume 9, Article 39.
+Dunn, K. P., and Smyth, G. K. (1996).
+Randomized quantile residuals.
+\emph{J. Comput. Graph. Statist.}, 5, 236-244.
+\url{http://www.statsci.org/smyth/pubs/residual.html}
Routledge, RD (1994).
Practicing safe statistics with the mid-p.
@@ -129,4 +80,3 @@ roast(y, iset1, design, contrast=2)
mroast(y, iset1, design, contrast=2)
mroast(y, list(set1=iset1, set2=iset2), design, contrast=2)
}
-\keyword{htest}
diff --git a/man/subsetting.Rd b/man/subsetting.Rd
index 945d4f5..afc2411 100644
--- a/man/subsetting.Rd
+++ b/man/subsetting.Rd
@@ -4,27 +4,30 @@
\alias{[.DGEExact}
\alias{[.DGELRT}
\alias{[.DGEGLM}
+\alias{[.TopTags}
\title{Subset DGEList, DGEGLM, DGEExact and DGELRT Objects}
\description{
Extract a subset of a \code{DGEList}, \code{DGEGLM}, \code{DGEExact} or \code{DGELRT} object.
}
\usage{
-\method{[}{DGEList}(object, i, j, \ldots)
-\method{[}{DGEGLM}(object, i, j, \ldots)
-\method{[}{DGEExact}(object, i, j, \ldots)
-\method{[}{DGELRT}(object, i, j, \ldots)
+\method{[}{DGEList}(object, i, j, keep.lib.sizes=TRUE)
+\method{[}{DGEGLM}(object, i, j)
+\method{[}{DGEExact}(object, i, j)
+\method{[}{DGELRT}(object, i, j)
+\method{[}{TopTags}(object, i, j)
}
\arguments{
- \item{object}{object of class \code{DGEList}, \code{DGEGLM}, \code{DGEExact} or \code{DGELRT}, respectively}
- \item{i,j}{elements to extract. \code{i} subsets the tags or genes while \code{j} subsets the libraries. Note, columns of \code{DGEGLM}, \code{DGEExact} and \code{DGELRT} objects cannot be subsetted.}
- \item{\ldots}{not used}
+ \item{object}{object of class \code{DGEList}, \code{DGEGLM}, \code{DGEExact} or \code{DGELRT}. For \code{subsetListOfArrays}, any list of conformal matrices and vectors.}
+ \item{i,j}{elements to extract. \code{i} subsets the tags or genes while \code{j} subsets the libraries.
+ Note that columns of \code{DGEGLM}, \code{DGEExact} and \code{DGELRT} objects cannot be subsetted.}
+ \item{keep.lib.sizes}{logical, if \code{TRUE} the lib.sizes will be kept unchanged on output, otherwise they will be recomputed as the column sums of the counts of the remaining rows.}
}
\details{
\code{i,j} may take any values acceptable for the matrix components of \code{object} of class \code{DGEList}.
See the \link{Extract} help entry for more details on subsetting matrices. For \code{DGEGLM}, \code{DGEExact} and \code{DGELRT} objects, only rows (i.e. \code{i}) may be subsetted.
}
\value{
-An object of class \code{DGEList}, \code{DGEGLM}, \code{DGEExact} or \code{DGELRT} as appropriate, holding data from the specified subset of tags/genes and libraries.
+An object of the same class as \code{object} holding data from the specified subset of rows and columns.
}
\author{Davis McCarthy, Gordon Smyth}
\seealso{
@@ -43,4 +46,3 @@ results <- exactTest(d)
results[1:2,]
# NB: cannot subset columns for DGEExact objects
}
-\keyword{manip}
diff --git a/man/sumTechReps.Rd b/man/sumTechReps.Rd
new file mode 100644
index 0000000..f4e5b21
--- /dev/null
+++ b/man/sumTechReps.Rd
@@ -0,0 +1,34 @@
+\name{sumTechReps}
+\alias{sumTechReps}
+\alias{sumTechReps.default}
+\alias{sumTechReps.DGEList}
+\title{Sum Over Replicate Samples}
+\description{
+Condense the columns of a matrix or DGEList object so that counts are summed over technical replicate samples.
+}
+\usage{
+\method{sumTechReps}{default}(x, ID=colnames(x), \dots)
+\method{sumTechReps}{DGEList}(x, ID=colnames(x), \dots)
+}
+\arguments{
+ \item{x}{a numeric matrix or \code{DGEList} object.}
+ \item{ID}{sample identifier.}
+ \item{\dots}{other arguments are not currently used.}
+}
+\details{
+A new matrix or DGEList object is computed in which the counts for technical replicate samples are replaced by their sums.
+}
+\value{
+A data object of the same class as \code{x} with a column for each unique value of \code{ID}.
+Columns are in the same order as the ID values first occur in the ID vector.
+}
+\author{Gordon Smyth and Yifang Hu}
+\seealso{
+ \code{\link{rowsum}}.
+}
+
+\examples{
+x <- matrix(rpois(8*3,lambda=5),8,3)
+colnames(x) <- c("a","a","b")
+sumTechReps(x)
+}
diff --git a/man/topTags.Rd b/man/topTags.Rd
index f9bb409..5593f59 100755
--- a/man/topTags.Rd
+++ b/man/topTags.Rd
@@ -2,7 +2,6 @@
\alias{topTags}
\alias{TopTags-class}
\alias{show,TopTags-method}
-\alias{[.TopTags}
\title{Table of the Top Differentially Expressed Tags}
diff --git a/man/validDGEList.Rd b/man/validDGEList.Rd
new file mode 100644
index 0000000..84c758e
--- /dev/null
+++ b/man/validDGEList.Rd
@@ -0,0 +1,37 @@
+\name{validDGEList}
+\alias{validDGEList}
+\alias{validDGEList}
+
+\title{Check for Valid DGEList object}
+
+\description{
+Check for existence of standard components of DGEList object.
+}
+
+\usage{
+validDGEList(y)
+}
+
+\arguments{
+\item{y}{\code{DGEList} object.}
+}
+
+\details{
+This function checks that the standard \code{counts} and \code{samples} components of a \code{DGEList} object are present.
+}
+
+\value{
+\code{DGEList} with missing components added.
+}
+
+\author{Gordon Smyth}
+
+\examples{
+counts <- matrix(rpois(4*2,lambda=5),4,2)
+dge <- new("DGEList", list(counts=counts))
+validDGEList(dge)
+}
+
+\seealso{
+\code{\link{DGEList}}
+}
diff --git a/src/Makevars b/src/Makevars
index 786849b..5569601 100644
--- a/src/Makevars
+++ b/src/Makevars
@@ -1,14 +1,4 @@
CHECK=#-Wall -pedantic
PKG_LIBS=$(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS)
-PKG_CPPFLAGS=-I./core -I. $(CHECK)
+PKG_CPPFLAGS=$(CHECK)
PKG_CFLAGS=$(CHECK)
-CPP_SOURCES=R_exact_test_by_deviance.cpp R_loess_by_col.cpp R_cr_adjust.cpp R_levenberg.cpp R_maximize_interpolant.cpp R_one_group.cpp R_simple_good_turing.cpp \
- core/adj_coxreid.cpp core/glm_levenberg.cpp core/glm_one_group.cpp core/interpolator.cpp
-C_SOURCES=core/fmm_spline.c
-OBJECTS=$(CPP_SOURCES:.cpp=.o) $(C_SOURCES:.c=.o)
-
-all: $(SHLIB)
-
-clean:
- @rm -fv $(OBJECTS)
-
diff --git a/src/R_compute_nbdev.cpp b/src/R_compute_nbdev.cpp
new file mode 100644
index 0000000..89e26ed
--- /dev/null
+++ b/src/R_compute_nbdev.cpp
@@ -0,0 +1,45 @@
+#include "utils.h"
+#include "glm.h"
+#include <iostream>
+
+extern "C" {
+
+SEXP R_compute_nbdev (SEXP y, SEXP mu, SEXP phi) try {
+ if (!IS_NUMERIC(phi)) { throw std::runtime_error("dispersion vector should be double-precision"); }
+ const int ntags=LENGTH(phi);
+ if (!IS_NUMERIC(y)) { throw std::runtime_error("count matrix should be double-precision"); }
+ if (!IS_NUMERIC(mu)) { throw std::runtime_error("matrix of means should be double-precision"); }
+ const int nlib=LENGTH(mu)/ntags;
+ if (nlib*ntags !=LENGTH(mu)) { throw std::runtime_error("mean matrix has inconsistent dimensions"); }
+ if (LENGTH(mu)!=LENGTH(y)) { throw std::runtime_error("count and mean matrices should have same dimensions"); }
+
+ const double* yptr=NUMERIC_POINTER(y);
+ const double* mptr=NUMERIC_POINTER(mu);
+ const double* dptr=NUMERIC_POINTER(phi);
+
+ // Running through each row and computing the unit deviance, and then that sum.
+ SEXP output=PROTECT(allocMatrix(REALSXP, ntags, nlib));
+ try {
+ double* optr=NUMERIC_POINTER(output);
+ int counter;
+ for (int i=0; i<ntags; ++i) {
+ counter=0;
+ for (int j=0; j<nlib; ++j, counter+=ntags) {
+ optr[counter]=compute_unit_nb_deviance(yptr[counter], mptr[counter], dptr[i]);
+ }
+ ++optr;
+ ++yptr;
+ ++mptr;
+ }
+ } catch (std::exception& e) {
+ UNPROTECT(1);
+ throw;
+ }
+
+ UNPROTECT(1);
+ return output;
+} catch(std::exception& e) {
+ return mkString(e.what());
+}
+
+}
diff --git a/src/R_exact_test_by_deviance.cpp b/src/R_exact_test_by_deviance.cpp
index d08c19b..5743e05 100644
--- a/src/R_exact_test_by_deviance.cpp
+++ b/src/R_exact_test_by_deviance.cpp
@@ -1,135 +1,71 @@
#include "utils.h"
+#include "glm.h"
extern "C" {
#include "Rmath.h"
}
-
-double nbdev (const double& sum, const double& mu, const double& size, const bool& deriv=false) {
- const double& use_sum=(sum > low_value ? sum : low_value);
- if (!deriv) {
- return use_sum*std::log(use_sum/mu) - (use_sum+size)*std::log((use_sum+size)/(mu+size));
- } else {
- return std::log(use_sum/mu) - std::log((use_sum+size)/(mu+size));
- }
-}
+#ifdef DEBUG
+#include <iostream>
+#endif
extern "C" {
-SEXP R_exact_test_by_deviance(SEXP sums_1, SEXP sums_2, SEXP n_1, SEXP n_2, SEXP disp, SEXP big, SEXP tol) try {
- // Setting up the inputs.
- if (!IS_INTEGER(sums_1) || !IS_INTEGER(sums_2)) { throw std::runtime_error("sums must be integer vectors"); }
+SEXP R_exact_test_by_deviance(SEXP sums_1, SEXP sums_2, SEXP n_1, SEXP n_2, SEXP disp) try {
+ if (!IS_INTEGER(n_1) || LENGTH(n_1)!=1 || !IS_INTEGER(n_2) || LENGTH(n_2)!=1) {
+ throw std::runtime_error("number of libraries must be integer scalars"); }
if (!IS_NUMERIC(disp)) { throw std::runtime_error("dispersion must be a double precision vector"); }
-
+ if (!IS_INTEGER(sums_1) || !IS_INTEGER(sums_2)) { throw std::runtime_error("sums must be integer vectors"); }
+
const int n1=INTEGER_VALUE(n_1), n2=INTEGER_VALUE(n_2);
+ const int nlibs = n1+n2;
const int ntags=LENGTH(sums_1);
if (ntags!=LENGTH(sums_2) || ntags!=LENGTH(disp)) {
throw std::runtime_error("lengths of input vectors do not match");
} else if (n1<=0 || n2 <=0) {
throw std::runtime_error("number of libraries must be positive for each condition");
}
- int* s1_ptr=INTEGER_POINTER(sums_1), *s2_ptr=INTEGER_POINTER(sums_2);
- double *d_ptr=NUMERIC_POINTER(disp);
- const double nr_tolerance=NUMERIC_VALUE(tol);
- const double big_count=NUMERIC_VALUE(big);
+ const int* s1_ptr=INTEGER_POINTER(sums_1), *s2_ptr=INTEGER_POINTER(sums_2);
+ const double *d_ptr=NUMERIC_POINTER(disp);
-
- // Setting up the outputs.
- SEXP output;
- PROTECT(output=NEW_NUMERIC(ntags));
- double* p_ptr=NUMERIC_POINTER(output);
+ SEXP output=PROTECT(NEW_NUMERIC(ntags));
try{
- // Iterating through the tags.
- const double prop1=n1/double(n1+n2), prop2=n2/double(n1+n2);
- for (long i=0; i<ntags; ++i) {
- const double size1=n1/d_ptr[i], size2=n2/d_ptr[i];
- const int& s1=s1_ptr[i], s2=s2_ptr[i];
- const int total=s1+s2;
- const double mu1=total*prop1, mu2=total*prop2;
-
- if (std::abs(s1-mu1)/s1 < low_value) {
- /* If our count is equal to our mean, then we can just bail and set
- * the p-value at 1. It's not going to get any smaller if the deviance is zero.
- */
- p_ptr[i]=1;
- continue;
- }
+ double* p_ptr=NUMERIC_POINTER(output);
+ for (int i=0; i<ntags; ++i) {
+ const int& s1=s1_ptr[i];
+ const int& s2=s2_ptr[i];
+ const int stotal=s1+s2;
- // Sorting out which direction we want to go in.
- const double threshold_dev=nbdev(s1, mu1, size1) + nbdev(s2, mu2, size2);
- const bool other_is_up=(s1<mu1);
- const double& right_size=(other_is_up ? size2 : size1);
- const double& left_size=(other_is_up ? size1 : size2);
- const double& right_mu=(other_is_up ? mu2 : mu1);
- const double& left_mu=(other_is_up ? mu1 : mu2);
+ // Computing current means and sizes for each library (probability is the same).
+ const double mu = stotal/nlibs;
+ const double mu1=mu*n1, mu2=mu*n2, r1=n1/d_ptr[i], r2=n2/d_ptr[i];
+ const double p = r1/(r1+mu1);
- /* The deviance is a function with one minimum which is monotonic
- * decreasing upon approach and increasing upon departure. We use
- * a Newton-Raphson search to identify the point on the ``other''
- * side which has deviance closest to our observed deviance.
- */
- double x=0;
- double step=100;
-
- /* Note the 'minus' in the gradient for the second term; this is because the
- * differentiation of 'total-x' needs to be reversed to account for the
- * negativeness of the 'x'. Also note that the NR search is safe
- * because the only solution which has a gradient of zero is when the
- * means are equal to the counts, and that is considered (above).
- */
- while (std::abs(step) > nr_tolerance) {
- step=(nbdev(x, right_mu, right_size)+nbdev(total-x, left_mu, left_size)-threshold_dev)/
- (nbdev(x, right_mu, right_size, true)-nbdev(total-x, left_mu, left_size, true));
- x-=step;
- if (x > total || x < 0) { throw std::runtime_error("failure during Newton-Raphson procedure"); }
+ /* The aim is to sum conditional probabilities for all partitions of the total sum with deviances
+ * greater than that observed for the current partition. We start computing from the extremes
+ * in both cases.
+ */
+ const double phi1=1/r1, phi2=1/r2;
+ const double obsdev=compute_unit_nb_deviance(s1, mu1, phi1)+compute_unit_nb_deviance(s2, mu2, phi2);
+ double& currentp=(p_ptr[i]=0);
+
+ // Going from the left.
+ int j=0;
+ while (j <= stotal) {
+ if (obsdev <= compute_unit_nb_deviance(j, mu1, phi1)+compute_unit_nb_deviance(stotal-j, mu2, phi2)) {
+ currentp+=dnbinom(j, r1, p, 0) * dnbinom(stotal-j, r2, p, 0);
+ } else { break; }
+ ++j;
}
-
- double& p_out=(p_ptr[i]=0);
- const int& including=(other_is_up ? s1 : s2);
- /* We check if the mu*disp product is large enough for us to use the fact that
- * the NB distribution is well approximated by the Gamma. This means that the
- * conditional NB distribution can then be approximated by the Beta distribution.
- * Note that we only have to check one of them, because mu1*disp2=mu2*disp2=total*disp.
- */
- if (mu1/size1 > big_count) {
- const double alpha1=mu1/(1+mu1/size1), alpha2=n2/n1*alpha1;
- const double& left_alpha=(other_is_up ? alpha1 : alpha2);
- const double& right_alpha=(other_is_up ? alpha2 : alpha1);
- p_out=pbeta(including/total, left_alpha, right_alpha, 1, 0)
- +pbeta((x+0.5)/total, right_alpha, left_alpha, 1, 0);
- continue;
+ // Going from the right, or what's left of it.
+ for (int k=0; k<=stotal-j; ++k) {
+ if (obsdev <= compute_unit_nb_deviance(k, mu2, phi2)+compute_unit_nb_deviance(stotal-k, mu1, phi1)) {
+ currentp+=dnbinom(k, r2, p, 0) * dnbinom(stotal-k, r1, p, 0);
+ } else { break; }
}
- /* We use lbeta to avoid over/underflow problems resulting from beta.
- * These go away with lbeta because we end up subtracting by the divisor.
- * The price is some loss of precision as the exponent is moved around.
- * However, the number of digits lost is usually small (~3 for
- * the limit of the double datatype). When it gets large, the non-logged
- * version wouldn't be able to handle it anyway, so it's an okay price to pay.
- */
- const double divisor=lbeta(size1, size2);
-
- /* If the counts are small enough, we iterate. We bascially go through and include
- * our lower partitions on the side of the observed partition and including the observed
- * partition (hence the +0.5 in the 'including' definition). We do the same for the
- * 'other' side, but we ignore the closest integer for now.
- */
- double mult=1;
- for (int j=0; j<=including; ++j) {
- p_out+=std::exp(lbeta(j+left_size, total+right_size-j)-divisor)*mult;
- mult*=(total-j)/(j+1.0);
- }
- mult=1;
- for (int j=0; j<x-0.5; ++j) {
- p_out+=std::exp(lbeta(j+right_size, total+left_size-j)-divisor)*mult;
- mult*=(total-j)/(j+1.0);
- }
- // We now examine the closest integer. We hold off until this point just to check
- // that it indeed has higher deviance (to protect against NR inaccuracy).
- const double new_x=std::floor(x+0.5);
- if (nbdev(new_x, right_mu, right_size)+nbdev(total-new_x, left_mu, left_size) > threshold_dev) {
- p_out+=std::exp(lbeta(new_x+right_size, total+left_size-new_x)-divisor)*mult;
- }
- }
+ const double totalr=r1+r2;
+ currentp /= dnbinom(stotal, totalr, totalr/(totalr+mu1+mu2), 0);
+ }
} catch (std::exception& e) {
UNPROTECT(1);
throw;
diff --git a/src/R_levenberg.cpp b/src/R_levenberg.cpp
index 927cfa3..9185a45 100644
--- a/src/R_levenberg.cpp
+++ b/src/R_levenberg.cpp
@@ -1,18 +1,30 @@
#include "glm.h"
+#include "matvec_check.h"
extern "C" {
-SEXP R_levenberg (SEXP nlib, SEXP ntag, SEXP design, SEXP counts, SEXP disp, SEXP offset, SEXP beta, SEXP fitted, SEXP tol, SEXP maxit) try {
+SEXP R_levenberg (SEXP nlib, SEXP ntag, SEXP design, SEXP counts, SEXP disp, SEXP offset, SEXP weights,
+ SEXP beta, SEXP fitted, SEXP tol, SEXP maxit) try {
if (!IS_NUMERIC(design)) { throw std::runtime_error("design matrix should be double precision"); }
- if (!IS_NUMERIC(counts)) { throw std::runtime_error("count matrix should be double precision"); }
if (!IS_NUMERIC(disp)) { throw std::runtime_error("dispersion vector should be double precision"); }
- if (!IS_NUMERIC(offset)) { throw std::runtime_error("offset matrix should be double precision"); }
if (!IS_NUMERIC(beta)) { throw std::runtime_error("matrix of start values for coefficients should be double precision"); }
if (!IS_NUMERIC(fitted)) { throw std::runtime_error("matrix of starting fitted values should be double precision"); }
-
- // Getting and checking the dimensions of the arguments.
const int num_tags=INTEGER_VALUE(ntag);
const int num_libs=INTEGER_VALUE(nlib);
+
+ // Checking the count matrix.
+ const double *cdptr=NULL;
+ const int* ciptr=NULL;
+ double* count_ptr=(double*)R_alloc(num_libs, sizeof(double));
+ bool is_integer=IS_INTEGER(counts);
+ if (is_integer) {
+ ciptr=INTEGER_POINTER(counts);
+ } else {
+ if (!IS_NUMERIC(counts)) { throw std::runtime_error("count matrix must be integer or double-precision"); }
+ cdptr=NUMERIC_POINTER(counts);
+ }
+
+ // Getting and checking the dimensions of the arguments.
const int dlen=LENGTH(design);
const int clen=LENGTH(counts);
if (dlen%num_libs!=0) { throw std::runtime_error("size of design matrix is incompatible with number of libraries"); }
@@ -25,13 +37,17 @@ SEXP R_levenberg (SEXP nlib, SEXP ntag, SEXP design, SEXP counts, SEXP disp, SEX
throw std::runtime_error("dimensions of the fitted matrix do not match those of the count matrix");
} else if (LENGTH(disp)!=num_tags) {
throw std::runtime_error("length of dispersion vector must be equal to the number of tags");
- } else if (LENGTH(offset)!=clen) {
- throw std::runtime_error("dimensions of offset matrix must match that of the count matrix");
- }
+ }
// Initializing pointers to the assorted features.
- double* beta_ptr=NUMERIC_POINTER(beta), *design_ptr=NUMERIC_POINTER(design), *count_ptr=NUMERIC_POINTER(counts),
- *fitted_ptr=NUMERIC_POINTER(fitted), *offset_ptr=NUMERIC_POINTER(offset), *disp_ptr=NUMERIC_POINTER(disp);
+ const double* beta_ptr=NUMERIC_POINTER(beta),
+ *design_ptr=NUMERIC_POINTER(design),
+ *fitted_ptr=NUMERIC_POINTER(fitted),
+ *disp_ptr=NUMERIC_POINTER(disp);
+ matvec_check allo(num_libs, num_tags, offset, true, "offset", false);
+ const double* const* optr2=allo.access();
+ matvec_check allw(num_libs, num_tags, weights, true, "weight", true);
+ const double* const* wptr2=allw.access();
// Initializing output cages.
SEXP output=PROTECT(NEW_LIST(5));
@@ -49,21 +65,37 @@ SEXP R_levenberg (SEXP nlib, SEXP ntag, SEXP design, SEXP counts, SEXP disp, SEX
// Running through each tag and fitting the NB GLM.
glm_levenberg glbg(num_libs, num_coefs, design_ptr, INTEGER_VALUE(maxit), NUMERIC_VALUE(tol));
for (int tag=0; tag<num_tags; ++tag) {
+
+ // Copying integer/double counts to a new vector.
+ if (is_integer) {
+ for (int i=0; i<num_libs; ++i) { count_ptr[i]=double(ciptr[i]); }
+ ciptr+=num_libs;
+ } else {
+ for (int i=0; i<num_libs; ++i) { count_ptr[i]=cdptr[i]; }
+ cdptr+=num_libs;
+ }
+
// Copying elements to the new_beta and new_fitted, so output is automatically stored.
for (int i=0; i<num_libs; ++i) { new_fitted_ptr[i]=fitted_ptr[i]; }
for (int i=0; i<num_coefs; ++i) { new_beta_ptr[i]=beta_ptr[i]; }
- if (glbg.fit(offset_ptr, count_ptr, *disp_ptr, new_fitted_ptr, new_beta_ptr)) {
+ if (glbg.fit(*optr2, count_ptr,
+#ifdef WEIGHTED
+ *wptr2,
+#endif
+ *disp_ptr, new_fitted_ptr, new_beta_ptr)) {
std::stringstream errout;
errout<< "solution using Cholesky decomposition failed for tag " << tag+1;
throw std::runtime_error(errout.str());
}
- offset_ptr+=num_libs;
- count_ptr+=num_libs;
+ allo.advance();
+ allw.advance();
+
++disp_ptr;
fitted_ptr+=num_libs;
new_fitted_ptr+=num_libs;
beta_ptr+=num_coefs;
new_beta_ptr+=num_coefs;
+
*(dev_ptr++)=glbg.get_deviance();
*(iter_ptr++)=glbg.get_iterations();
*(fail_ptr++)=glbg.is_failure();
diff --git a/src/R_one_group.cpp b/src/R_one_group.cpp
index 5f61ce0..a1bffd5 100644
--- a/src/R_one_group.cpp
+++ b/src/R_one_group.cpp
@@ -1,8 +1,9 @@
#include "glm.h"
+#include "matvec_check.h"
extern "C" {
-SEXP R_one_group (SEXP nt, SEXP nl, SEXP y, SEXP disp, SEXP offsets, SEXP max_iterations, SEXP tolerance) try {
+SEXP R_one_group (SEXP nt, SEXP nl, SEXP y, SEXP disp, SEXP offsets, SEXP weights, SEXP max_iterations, SEXP tolerance) try {
const int num_tags=INTEGER_VALUE(nt);
const int num_libs=INTEGER_VALUE(nl);
if (num_tags*num_libs != LENGTH(y) ) { throw std::runtime_error("dimensions of the count table are not as specified"); } // Checking that it is an exact division.
@@ -10,22 +11,23 @@ SEXP R_one_group (SEXP nt, SEXP nl, SEXP y, SEXP disp, SEXP offsets, SEXP max_it
const int maxit=INTEGER_VALUE(max_iterations);
const double tol=NUMERIC_VALUE(tolerance);
if (!IS_NUMERIC(disp)) { throw std::runtime_error("dispersion vector must be double precision"); }
- if (!IS_NUMERIC(offsets)) { throw std::runtime_error("offset matrix/vector must be double precision"); }
-
if (LENGTH(disp)!=num_tags) { throw std::runtime_error("length of dispersion vector must be 1 or equal to the number of tags"); }
- if (LENGTH(offsets)!=num_tags*num_libs) { throw std::runtime_error("dimensions of offset matrix must match that of the count matrix"); }
// Setting up some iterators. We provide some flexibility to detecting numeric-ness.
+ double *ydptr=NULL;
+ int* yiptr=NULL;
+ double* yptr=(double*)R_alloc(num_libs, sizeof(double));
bool is_integer=IS_INTEGER(y);
- double *ydptr=0;
- int* yiptr=0;
if (is_integer) {
yiptr=INTEGER_POINTER(y);
- ydptr=(double*) R_alloc(num_libs, sizeof(double));
} else {
+ if (!IS_NUMERIC(y)) { throw std::runtime_error("count matrix must be integer or double-precision"); }
ydptr=NUMERIC_POINTER(y);
}
- double* optr=NUMERIC_POINTER(offsets);
+ matvec_check allo(num_libs, num_tags, offsets, false, "offset", false);
+ const double* const* optr2=allo.access();
+ matvec_check allw(num_libs, num_tags, weights, false, "weight", true);
+ const double* const* wptr2=allw.access();
double* dptr=NUMERIC_POINTER(disp);
// Setting up beta for output.
@@ -37,19 +39,29 @@ SEXP R_one_group (SEXP nt, SEXP nl, SEXP y, SEXP disp, SEXP offsets, SEXP max_it
try {
// Iterating through tags and fitting.
+ int counter=0;
for (int tag=0; tag<num_tags; ++tag) {
+ counter=0;
if (is_integer) {
- for (int i=0; i<num_libs; ++i) { ydptr[i]=yiptr[i]; }
- yiptr+=num_libs;
+ for (int i=0; i<num_libs; ++i, counter+=num_tags) { yptr[i]=double(yiptr[counter]); }
+ ++yiptr;
+ } else {
+ for (int i=0; i<num_libs; ++i, counter+=num_tags) { yptr[i]=ydptr[counter]; }
+ ++ydptr;
}
- std::pair<double, bool> out=glm_one_group(num_libs, maxit, tol, optr, ydptr, *dptr);
+ std::pair<double, bool> out=glm_one_group(num_libs, maxit, tol, *optr2,
+#ifdef WEIGHTED
+ *wptr2,
+#endif
+ yptr, *dptr);
+
(*bptr)=out.first;
(*cptr)=out.second;
- if (!is_integer) { ydptr+=num_libs; }
- optr+=num_libs;
++bptr;
++cptr;
++dptr;
+ allo.advance();
+ allw.advance();
}
} catch (std::exception& e) {
UNPROTECT(1);
diff --git a/src/R_process_hairpin_reads.c b/src/R_process_hairpin_reads.c
new file mode 100644
index 0000000..2afb799
--- /dev/null
+++ b/src/R_process_hairpin_reads.c
@@ -0,0 +1,571 @@
+#include <R.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <ctype.h>
+#include <time.h>
+
+#define MAX_BARCODE 1000
+#define MAX_HAIRPIN 10000
+#define SEQ_LEN 100
+#define BLOCKSIZE 10000000
+
+typedef struct {
+ char *sequence;
+ int original_pos;
+} a_barcode;
+
+typedef struct {
+ char *sequence;
+ int original_pos;
+ long count;
+} a_hairpin;
+
+a_barcode *barcodes[MAX_BARCODE];
+a_hairpin *hairpins[MAX_HAIRPIN];
+
+int num_barcode;
+int num_hairpin;
+long num_read;
+long summary[MAX_HAIRPIN][MAX_BARCODE];
+int barcode_start;
+int barcode_end;
+int barcode_length;
+int hairpin_start;
+int hairpin_end;
+int hairpin_length;
+int allow_shifting;
+int shifting_n_base;
+int allow_mismatch;
+int num_mismatch_hairpin;
+int barcode_n_mismatch;
+int hairpin_n_mismatch;
+int allow_shifted_mismatch;
+int isverbose;
+
+long barcodecount;
+long hairpincount;
+long bchpcount;
+
+a_hairpin *mismatch_hairpins[MAX_HAIRPIN];
+int *barcodeindex;
+int *hairpinindex;
+
+void
+Read_In_Barcodes(char* filename){
+ FILE *fin;
+ char * line = NULL;
+ size_t len = 1000;
+ char *readline;
+
+ fin = fopen(filename,"r");
+ line = (char *)malloc(len+1);
+ a_barcode *new_barcode;
+
+ int count = 0;
+ while ((readline = fgets(line, len, fin)) != NULL){
+ count++;
+ new_barcode = (a_barcode *)malloc(sizeof(a_barcode));
+ new_barcode->sequence = (char *)malloc(SEQ_LEN * sizeof(char));
+
+ new_barcode->original_pos = count;
+ strncpy(new_barcode->sequence, line, barcode_length);
+ barcodes[count] = new_barcode;
+ }
+ fclose(fin);
+ num_barcode = count;
+ free(line);
+ Rprintf(" -- Number of Barcodes : %d\n", num_barcode);
+}
+
+int
+locate_barcode(char *a_barcode){
+ int imin, imax, imid;
+ imin = 1;
+ imax = num_barcode;
+
+ while (imax >= imin) {
+ imid = (imax + imin) / 2;
+
+ if (strncmp(barcodes[imid]->sequence, a_barcode, barcode_length) < 0)
+ imin = imid + 1;
+ else if (strncmp(barcodes[imid]->sequence, a_barcode, barcode_length) > 0)
+ imax = imid - 1;
+ else
+ return barcodes[imid]->original_pos;
+ }
+ return -1;
+}
+
+int
+locate_hairpin(char *a_hairpin){
+ int imin, imax, imid;
+ imin = 1;
+ imax = num_hairpin;
+
+ while (imax >= imin) {
+ imid = (imax + imin) / 2;
+
+ if (strncmp(hairpins[imid]->sequence, a_hairpin, hairpin_length) < 0)
+ imin = imid + 1;
+ else if (strncmp(hairpins[imid]->sequence, a_hairpin, hairpin_length) > 0)
+ imax = imid - 1;
+ else
+ return hairpins[imid]->original_pos;
+ }
+ return -1;
+}
+
+int
+Valid_Match(char *sequence1, char *sequence2, int length, int threshold){
+ int i_base;
+ int mismatchbasecount = 0;
+ for (i_base = 0; i_base < length; i_base++) {
+ if (sequence1[i_base] != sequence2[i_base])
+ mismatchbasecount++;
+ }
+ if (mismatchbasecount <= threshold)
+ return 1;
+ else
+ return -1;
+}
+
+int
+locate_mismatch_barcode(char *a_barcode){
+ int i;
+ int match_index = -1;
+ for (i = 1; i <= num_barcode; i++){
+ if (Valid_Match(a_barcode, barcodes[i]->sequence, barcode_length, barcode_n_mismatch) > 0) {
+ match_index = barcodes[i]->original_pos;
+ break;
+ }
+ }
+ return match_index;
+}
+
+int
+locate_mismatch_hairpin(char *a_hairpin){
+ int i;
+ int match_index = -1;
+ for (i = 1; i <= num_mismatch_hairpin; i++){
+ if (Valid_Match(a_hairpin, mismatch_hairpins[i]->sequence, hairpin_length, hairpin_n_mismatch) > 0) {
+ match_index = mismatch_hairpins[i]->original_pos;
+ break;
+ }
+ }
+ return match_index;
+}
+
+
+void
+Sort_Barcodes(void){
+ int i, j;
+ a_barcode *temp;
+ for(i = 1; i < num_barcode; i++){
+ for(j = i+1; j <= num_barcode; j++){
+ if (strcmp(barcodes[i]->sequence, barcodes[j]->sequence) > 0){
+ temp = barcodes[i];
+ barcodes[i] = barcodes[j];
+ barcodes[j] = temp;
+ }
+ }
+ }
+}
+
+void
+Read_In_Hairpins(char *filename){
+ FILE *fin;
+ char * line = NULL;
+ size_t len = 1000;
+ char *readline;
+
+ fin = fopen(filename,"r");
+ line = (char *)malloc(len+1);
+ a_hairpin *new_hairpin;
+
+ int count = 0;
+ while ((readline = fgets(line, len, fin)) != NULL){
+ count++;
+ new_hairpin = (a_hairpin *)malloc(sizeof(a_hairpin));
+ new_hairpin->sequence = (char *)malloc(SEQ_LEN * sizeof(char));
+ new_hairpin->original_pos = count;
+ new_hairpin->count = 0;
+ strncpy(new_hairpin->sequence, line, hairpin_length);
+ hairpins[count] = new_hairpin;
+ }
+ fclose(fin);
+ num_hairpin = count;
+ free(line);
+ Rprintf(" -- Number of Hairpins : %d\n", num_hairpin);
+}
+
+void
+Sort_Hairpins(void){
+ int i, j;
+ a_hairpin *temp;
+ for(i = 1; i < num_hairpin; i++){
+ for(j = i+1; j <= num_hairpin; j++){
+ if (strcmp(hairpins[i]->sequence, hairpins[j]->sequence) > 0){
+ temp = hairpins[i];
+ hairpins[i] = hairpins[j];
+ hairpins[j] = temp;
+ }
+ }
+ }
+}
+
+long Count_Reads(char *filename) {
+ FILE *freads = fopen(filename, "r");
+ char * line = NULL;
+ char *readline;
+ size_t len = 1000;
+ line = (char *)malloc(sizeof(char) * (len + 1));
+ if(freads == NULL){
+ fclose(freads);
+ return 0;
+ }
+ long line_count = 0;
+ while ((readline = fgets(line, len, freads)) != NULL){
+ line_count++;
+ }
+ fclose(freads);
+ free(line);
+ return line_count / 4;
+}
+
+
+void
+Process_Hairpin_Reads(char *filename){
+ FILE *fin;
+ char * line = NULL;
+ size_t len = 1000;
+ char *readline;
+ long num_read_thisfile = 0;
+
+ fin = fopen(filename,"r");
+ line = (char *)malloc(sizeof(char) * (len+1));
+
+ if (isverbose)
+ Rprintf("Processing reads in %s.\n", filename);
+
+ char * this_barcode;
+ char * this_hairpin;
+ this_barcode = (char *)malloc(SEQ_LEN * sizeof(char));
+ this_hairpin = (char *)malloc(SEQ_LEN * sizeof(char));
+
+ long line_count = 0;
+
+ int barcode_index;
+ int hairpin_index;
+
+ while ((readline = fgets(line, len, fin)) != NULL){
+ line_count++;
+ if ((line_count % 4) != 2)
+ continue;
+
+ if ((isverbose) && (num_read_thisfile % BLOCKSIZE == 0))
+ Rprintf(" -- Processing %d million reads\n", (num_read_thisfile / BLOCKSIZE + 1) * 10);
+ num_read++;
+ num_read_thisfile++;
+ strncpy(this_barcode, line + barcode_start - 1, barcode_length);
+ barcode_index = locate_barcode(this_barcode);
+
+ strncpy(this_hairpin, line + hairpin_start - 1, hairpin_length);
+ hairpin_index = locate_hairpin(this_hairpin);
+
+ if ((hairpin_index <= 0) && (allow_shifting > 0)){
+ // Check if given hairpin can be mapped to a shifted location.
+ int index;
+ // check shifting leftwards
+ for (index = 1; index <= shifting_n_base; index++){
+ strncpy(this_hairpin, line + hairpin_start - 1 - index, hairpin_length);
+ hairpin_index = locate_hairpin(this_hairpin);
+ if (hairpin_index > 0)
+ break;
+ }
+ // check shifting rightwards
+ if (hairpin_index <= 0){
+ for (index = 1; index <= shifting_n_base; index++){
+ strncpy(this_hairpin, line + hairpin_start - 1 + index, hairpin_length);
+ hairpin_index = locate_hairpin(this_hairpin);
+ if (hairpin_index > 0)
+ break;
+ }
+ }
+ }
+
+ if (barcode_index > 0)
+ barcodecount++;
+
+ if (hairpin_index > 0){
+ hairpincount++;
+ hairpins[hairpin_index]->count++;
+ }
+
+ if ((barcode_index > 0) && (hairpin_index > 0)) {
+ summary[hairpin_index][barcode_index]++;
+ bchpcount++;
+ }
+
+ barcodeindex[num_read] = barcode_index;
+ hairpinindex[num_read] = hairpin_index;
+ }
+ if (isverbose)
+ Rprintf("Number of reads in file %s : %ld\n", filename, num_read_thisfile);
+ fclose(fin);
+ free(line);
+ free(this_barcode);
+ free(this_hairpin);
+}
+
+void
+Create_Mismatch_Hairpins_List(void){
+ int i;
+ num_mismatch_hairpin = 0;
+
+ for (i = 1; i <= num_hairpin; i++){
+ if (hairpins[i]->count == 0){
+ num_mismatch_hairpin++;
+ mismatch_hairpins[num_mismatch_hairpin] = hairpins[i];
+ }
+ }
+ Rprintf("\nThere are %d hairpins without exact sequence match.\n", num_mismatch_hairpin);
+}
+
+
+void
+Process_Mismatch(char *filename){
+
+ FILE *fin;
+ char * line = NULL;
+ size_t len = 1000;
+ char *readline;
+ long num_read_thisfile = 0;
+
+ fin = fopen(filename,"r");
+ line = (char *)malloc(len+1);
+
+ if (isverbose)
+ Rprintf("Processing reads in %s, considering sequence mismatch. \n", filename);
+
+ char * this_hairpin;
+ char * this_barcode;
+ this_hairpin = (char *)malloc(SEQ_LEN * sizeof(char));
+ this_barcode = (char *)malloc(SEQ_LEN * sizeof(char));
+
+ long line_count = 0;
+
+ int new_barcode_index;
+ int new_hairpin_index;
+ while ((readline = fgets(line, len, fin)) != NULL){
+ line_count++;
+ if ((line_count % 4) != 2)
+ continue;
+
+ if ((isverbose) && (num_read_thisfile % BLOCKSIZE == 0))
+ Rprintf(" -- Processing %d million reads\n", (num_read_thisfile / BLOCKSIZE + 1) * 10);
+ num_read++;
+ num_read_thisfile++;
+
+ // only re-process reads withougt perfect hairpin match or without perfect barcode match;
+ if ((hairpinindex[num_read] > 0) && (barcodeindex[num_read] > 0))
+ continue;
+
+ // re-match barcode:
+ if (barcodeindex[num_read] <= 0){
+ strncpy(this_barcode, line + barcode_start - 1, barcode_length);
+ new_barcode_index = locate_mismatch_barcode(this_barcode);
+ if (new_barcode_index > 0)
+ barcodecount++;
+ } else {
+ new_barcode_index = barcodeindex[num_read];
+ }
+
+ // re-match hairpin:
+ if (hairpinindex[num_read] <= 0){
+ strncpy(this_hairpin, line + hairpin_start - 1, hairpin_length);
+ new_hairpin_index = locate_mismatch_hairpin(this_hairpin);
+ if ((new_hairpin_index <= 0) && (allow_shifting > 0) && (allow_shifted_mismatch > 0)){
+ // Check if given hairpin can be mapped to a shifted location.
+ int index;
+ // check shifting leftwards
+ for (index = 1; index <= shifting_n_base; index++){
+ strncpy(this_hairpin, line + hairpin_start - 1 - index, hairpin_length);
+ new_hairpin_index = locate_mismatch_hairpin(this_hairpin);
+ if (new_hairpin_index > 0)
+ break;
+ }
+ // check shifting rightwards
+ if (new_hairpin_index <= 0){
+ for (index = 1; index <= shifting_n_base; index++){
+ strncpy(this_hairpin, line + hairpin_start - 1 + index, hairpin_length);
+ new_hairpin_index = locate_mismatch_hairpin(this_hairpin);
+ if (new_hairpin_index > 0)
+ break;
+ }
+ }
+ }
+ if (new_hairpin_index > 0)
+ hairpincount++;
+ } else {
+ new_hairpin_index = hairpinindex[num_read];
+ }
+
+ if ((new_barcode_index > 0) && (new_hairpin_index > 0)) {
+ summary[new_hairpin_index][new_barcode_index]++;
+ bchpcount++;
+ }
+
+ }
+ fclose(fin);
+ free(line);
+ free(this_barcode);
+ free(this_hairpin);
+}
+
+
+void
+Initialise(int barcodestart, int barcodeend, int hairpinstart, int hairpinend,
+ int allowshifting, int shiftingnbase,
+ int allowMismatch, int barcodemismatch, int hairpinmismatch,
+ int allowShiftedMismatch, int verbose){
+ int i, j;
+ for(i = 0; i < MAX_HAIRPIN; i++) {
+ for(j = 0; j < MAX_BARCODE; j++) {
+ summary[i][j] = 0;
+ }
+ }
+ num_barcode = 0;
+ num_hairpin = 0;
+
+ barcode_start = barcodestart;
+ barcode_end = barcodeend;
+ hairpin_start = hairpinstart;
+ hairpin_end = hairpinend;
+ barcode_length = barcode_end - barcode_start + 1;
+ hairpin_length = hairpin_end - hairpin_start + 1;
+
+ allow_shifting = allowshifting;
+ shifting_n_base = shiftingnbase;
+ allow_mismatch = allowMismatch;
+ barcode_n_mismatch = barcodemismatch;
+ hairpin_n_mismatch = hairpinmismatch;
+ allow_shifted_mismatch = allowShiftedMismatch;
+ isverbose = verbose;
+
+ num_read = 0;
+ barcodecount = 0;
+ hairpincount = 0;
+ bchpcount = 0;
+}
+
+void
+Output_Summary_Table(char *output){
+ int i, j;
+ FILE *fout;
+ fout = fopen(output, "w");
+ for(i = 1; i <= num_hairpin; i++) {
+ fprintf(fout, "%ld", summary[i][1]);
+ for(j = 2; j <= num_barcode; j++) {
+ fprintf(fout, "\t%ld", summary[i][j]);
+ }
+ fprintf(fout, "\n");
+ }
+ fclose(fout);
+}
+
+void
+Check_Hairpins(void){
+ int p, q;
+ char base;
+ for(p = 1; p <= num_hairpin; p++){
+ for(q = 0; q < hairpin_length; q++){
+ base = hairpins[p]->sequence[q];
+ if ((base != 'A') && (base != 'T') && (base != 'G') && (base != 'C')){
+ Rprintf("Hairpin no.%d: %s contains invalid base %c\n", p, hairpins[p]->sequence, base);
+ }
+ }
+ }
+}
+
+void
+Clean_Up(void){
+ int index;
+ for (index = 1; index <= num_barcode; index++){
+ free(barcodes[index]->sequence);
+ free(barcodes[index]);
+ }
+ for (index = 1; index <= num_hairpin; index++){
+ free(hairpins[index]->sequence);
+ free(hairpins[index]);
+ }
+}
+
+void
+processHairpinReads(char **file, int *filecount,
+ char**barcodeseqs, char**hairpinseqs,
+ int *barcodestart, int *barcodeend, int *hairpinstart, int *hairpinend,
+ int *allowShifting, int *shiftingnbase,
+ int *allowMismatch, int *barcodemismatch, int *hairpinmismatch,
+ int *allowShiftedMismatch,
+ char **output, int *verbose)
+{
+ Initialise(*barcodestart, *barcodeend, *hairpinstart, *hairpinend,
+ *allowShifting, *shiftingnbase,
+ *allowMismatch, *barcodemismatch, *hairpinmismatch,
+ *allowShiftedMismatch, *verbose);
+
+ Read_In_Barcodes(*barcodeseqs);
+ Sort_Barcodes();
+
+ Read_In_Hairpins(*hairpinseqs);
+ Check_Hairpins();
+ Sort_Hairpins();
+
+ long totalreads;
+ int i_file;
+ totalreads = 0;
+ for (i_file = 0; i_file < *filecount; i_file++){
+ totalreads = totalreads + Count_Reads(file[i_file]);
+ }
+ barcodeindex = (int *)malloc(sizeof(int) * totalreads);
+ hairpinindex = (int *)malloc(sizeof(int) * totalreads);
+ for (i_file = 0; i_file < *filecount; i_file++){
+ Process_Hairpin_Reads(file[i_file]);
+ }
+
+ if (allow_mismatch > 0){
+ num_read = 0;
+ // reset total number of read to 0, recheck initial barcode/hairpin index
+ Create_Mismatch_Hairpins_List();
+ if (num_mismatch_hairpin > 0){
+ for (i_file = 0; i_file < *filecount; i_file++){
+ Process_Mismatch(file[i_file]);
+ }
+ }
+ }
+
+ Rprintf("\nThe input run parameters are: \n");
+ Rprintf(" -- Barcode: start position %d\t end position %d\t length %d\n", barcode_start, barcode_end, barcode_length);
+ Rprintf(" -- Hairpin: start position %d\t end position %d\t length %d\n", hairpin_start, hairpin_end, hairpin_length);
+ if (allow_shifting) {
+ Rprintf(" -- Allow hairpin sequences to be matched to a shifted position, <= %d base left or right of the specified positions. \n", shifting_n_base);
+ } else {
+ Rprintf(" -- Hairpin sequences need to match at specified positions. \n");
+ }
+ if (allow_mismatch) {
+ Rprintf(" -- Allow sequence mismatch, <= %d base in barcode sequence and <= %d base in hairpin sequence. \n", barcode_n_mismatch, hairpin_n_mismatch );
+ } else {
+ Rprintf(" -- Mismatch in barcode/hairpin sequences not allowed. \n");
+ }
+
+ Rprintf("\nTotal number of read is %ld \n", num_read);
+ Rprintf("There are %ld reads (%.4f percent) with barcode matches\n", barcodecount, 100.0*barcodecount/num_read);
+ Rprintf("There are %ld reads (%.4f percent) with hairpin matches\n", hairpincount, 100.0*hairpincount/num_read);
+ Rprintf("There are %ld reads (%.4f percent) with both barcode and hairpin matches\n", bchpcount, 100.0*bchpcount/num_read);
+
+ Output_Summary_Table(*output);
+ free(barcodeindex);
+ free(hairpinindex);
+}
+
diff --git a/src/core/adj_coxreid.cpp b/src/adj_coxreid.cpp
similarity index 100%
rename from src/core/adj_coxreid.cpp
rename to src/adj_coxreid.cpp
diff --git a/src/core/fmm_spline.c b/src/fmm_spline.c
similarity index 100%
rename from src/core/fmm_spline.c
rename to src/fmm_spline.c
diff --git a/src/core/glm.h b/src/glm.h
similarity index 69%
rename from src/core/glm.h
rename to src/glm.h
index 7f8569d..4b59a72 100644
--- a/src/core/glm.h
+++ b/src/glm.h
@@ -1,15 +1,24 @@
#ifndef GLM_H
#define GLM_H
+#define WEIGHTED 1
#include "utils.h"
-std::pair<double,bool> glm_one_group(const int&, const int&, const double&, const double*, const double*, const double&);
+std::pair<double,bool> glm_one_group(const int&, const int&, const double&, const double*,
+#ifdef WEIGHTED
+ const double*,
+#endif
+ const double*, const double&);
class glm_levenberg {
public:
glm_levenberg(const int&, const int&, const double*, const int&, const double&);
~glm_levenberg();
- int fit(const double*, const double*, const double&, double*, double*);
+ int fit(const double*, const double*,
+#ifdef WEIGHTED
+ const double*,
+#endif
+ const double&, double*, double*);
const bool& is_failure() const;
const int& get_iterations() const;
@@ -33,10 +42,16 @@ private:
int iter;
bool failed;
- double nb_deviance(const double*, const double*, const double&) const;
+ double nb_deviance(const double*, const double*,
+#ifdef WEIGHTED
+ const double*,
+#endif
+ const double&) const;
void autofill(const double*, double*, const double*);
};
+double compute_unit_nb_deviance(double, double, const double&);
+
class adj_coxreid {
public:
adj_coxreid(const int&, const int&, const double*);
diff --git a/src/core/glm_levenberg.cpp b/src/glm_levenberg.cpp
similarity index 84%
rename from src/core/glm_levenberg.cpp
rename to src/glm_levenberg.cpp
index b4c2f3a..4a9788f 100644
--- a/src/core/glm_levenberg.cpp
+++ b/src/glm_levenberg.cpp
@@ -8,26 +8,23 @@
* very big or very small.
*/
-const double one_million=std::pow(10, 6.0), one_millionth=std::pow(10, -6.0);
-const double mildly_low_value=std::pow(10, -8.0), supremely_low_value=std::pow(10, -13.0), ridiculously_low_value=std::pow(10, -100.0);
-
-double glm_levenberg::nb_deviance (const double* y, const double* mu, const double& phi) const {
- double dev=0;
+const double one_millionth=std::pow(10, -6.0);
+const double supremely_low_value=std::pow(10, -13.0), ridiculously_low_value=std::pow(10, -100.0);
+
+double glm_levenberg::nb_deviance (const double* y, const double* mu,
+#ifdef WEIGHTED
+ const double* w,
+#endif
+ const double& phi) const {
+ double tempdev=0;
for (int i=0; i<nlibs; ++i) {
- // We add a small value to protect against zero during division and logging.
- const double& cur_y=(y[i] < mildly_low_value ? mildly_low_value : y[i]);
- const double& cur_mu=(mu[i] < mildly_low_value ? mildly_low_value : mu[i]);
- const double product=cur_mu*phi;
- // Calculating the deviance using either the Poisson (small phi*mu), the Gamma (large) or NB (everything else).
- if (product < one_millionth) {
- dev+=cur_y * std::log(cur_y/cur_mu) - (cur_y - cur_mu);
- } else if (product > one_million) {
- dev+=(cur_y - cur_mu)/cur_mu - std::log(cur_y/cur_mu); // * cur_mu/(1+product);
- } else {
- dev+=cur_y * std::log( cur_y/cur_mu ) + (cur_y + 1/phi) * std::log( (cur_mu + 1/phi)/(cur_y + 1/phi) );
- }
+#ifdef WEIGHTED
+ tempdev+=w[i]*compute_unit_nb_deviance(y[i], mu[i], phi);
+#else
+ tempdev+=compute_unit_nb_deviance(y[i], mu[i], phi);
+#endif
}
- return dev*2;
+ return tempdev;
}
void glm_levenberg::autofill(const double* offset, double* mu, const double* beta) {
@@ -76,7 +73,11 @@ const char normal='n', transposed='t', uplo='U';
const double a=1, b=0;
const int nrhs=1;
-int glm_levenberg::fit(const double* offset, const double* y, const double& disp, double* mu, double* beta) {
+int glm_levenberg::fit(const double* offset, const double* y,
+#ifdef WEIGHTED
+ const double* w,
+#endif
+ const double& disp, double* mu, double* beta) {
// We expect 'mu' and 'beta' to be supplied. We then check the maximum value of the counts.
double ymax=0;
for (int lib=0; lib<nlibs; ++lib) {
@@ -98,7 +99,11 @@ int glm_levenberg::fit(const double* offset, const double* y, const double& disp
* We then proceed to iterating using reweighted least squares.
*/
autofill(offset, mu, beta);
- dev=nb_deviance(y, mu, disp);
+ dev=nb_deviance(y, mu,
+#ifdef WEIGHTED
+ w,
+#endif
+ disp);
double max_info=-1, lambda=0;
while ((++iter) <= maxit) {
@@ -106,10 +111,10 @@ int glm_levenberg::fit(const double* offset, const double* y, const double& disp
/* Here we set up the matrix XtWX i.e. the Fisher information matrix. X is the design matrix and W is a diagonal matrix
* with the working weights for each observation (i.e. library). The working weights are part of the first derivative of
- * the log-likelihood for a given coefficient. When multiplied by two covariates in the design matrix, you get the Fisher
- * information (i.e. variance of the log-likelihood) for that pair. This takes the role of the second derivative of the
- * log-likelihood. The working weights are formed by taking the reciprocal of the product of the variance (in terms of the mean)
- * and the square of the derivative of the link function.
+ * the log-likelihood for a given coefficient, multiplied by any user-specified weights. When multiplied by two covariates
+ * in the design matrix, you get the Fisher information (i.e. variance of the log-likelihood) for that pair. This takes
+ * the role of the second derivative of the log-likelihood. The working weights are formed by taking the reciprocal of the
+ * product of the variance (in terms of the mean) and the square of the derivative of the link function.
*
* We also set up the actual derivative of the log likelihoods in 'dl'. This is done by multiplying each covariate by the
* difference between the mu and observation and dividing by the variance and derivative of the link function. This is
@@ -119,9 +124,14 @@ int glm_levenberg::fit(const double* offset, const double* y, const double& disp
*/
for (int row=0; row<nlibs; ++row) {
const double& cur_mu=mu[row];
- const double denom=1+cur_mu*disp;
+ const double denom=(1+cur_mu*disp);
+#ifdef WEIGHTED
+ const double weight=cur_mu/denom*w[row];
+ const double deriv=(y[row]-cur_mu)/denom*w[row];
+#else
const double weight=cur_mu/denom;
const double deriv=(y[row]-cur_mu)/denom;
+#endif
for (int col=0; col<ncoefs; ++col){
const int index=col*nlibs+row;
wx[index]=design[index]*weight;
@@ -192,7 +202,11 @@ int glm_levenberg::fit(const double* offset, const double* y, const double& disp
* lambda up so we want to retake the step from where we were before). This is why we don't modify the values
* in-place until we're sure we want to take the step.
*/
- const double dev_new=nb_deviance(y, mu_new, disp);
+ const double dev_new=nb_deviance(y, mu_new,
+#ifdef WEIGHTED
+ w,
+#endif
+ disp);
if (dev_new/ymax < supremely_low_value) { low_dev=true; }
if (dev_new <= dev || low_dev) {
for (int i=0; i<ncoefs; ++i) { beta[i]=beta_new[i]; }
diff --git a/src/core/glm_one_group.cpp b/src/glm_one_group.cpp
similarity index 63%
rename from src/core/glm_one_group.cpp
rename to src/glm_one_group.cpp
index eb4d834..02b7e28 100644
--- a/src/core/glm_one_group.cpp
+++ b/src/glm_one_group.cpp
@@ -1,31 +1,51 @@
#include "glm.h"
-std::pair<double,bool> glm_one_group(const int& nlibs, const int& maxit, const double& tolerance, const double* offset, const double* y, const double& disp) {
+std::pair<double,bool> glm_one_group(const int& nlibs, const int& maxit, const double& tolerance, const double* offset,
+#ifdef WEIGHTED
+ const double* weights,
+#endif
+ const double* y,
+ const double& disp) {
/* Setting up initial values for beta as the log of the mean of the ratio of counts to offsets.
* This is the exact solution for the gamma distribution (which is the limit of the NB as
* the dispersion goes to infinity.
*/
bool nonzero=false;
- double cur_beta=0;
+ double cur_beta=0, totweight=0;
for (int j=0; j<nlibs; ++j) {
const double& cur_val=y[j];
if (cur_val>low_value) {
+#ifdef WEIGHTED
+ cur_beta+=cur_val/std::exp(offset[j]) * weights[j];
+#else
cur_beta+=cur_val/std::exp(offset[j]);
+#endif
nonzero=true;
}
+#ifdef WEIGHTED
+ totweight+=weights[j];
+#else
+ ++totweight;
+#endif
}
if (!nonzero) { return std::make_pair(R_NegInf, true); }
// If we can't cop out of it, we'll do Newton-Raphson iterations instead.
bool has_converged=false;
- cur_beta=std::log(cur_beta/nlibs);
+ double dl, info;
+ cur_beta=std::log(cur_beta/totweight);
for (int i=0; i<maxit; ++i) {
- double dl=0, info=0;
+ dl=0;
+ info=0;
for (int j=0; j<nlibs; ++j) {
- const double& cur_val=y[j];
const double mu=std::exp(cur_beta+offset[j]), denominator=1+mu*disp;
- dl+=(cur_val-mu)/denominator;
+#ifdef WEIGHTED
+ dl+=(y[j]-mu)/denominator * weights[j];
+ info+=mu/denominator * weights[j];
+#else
+ dl+=(y[j]-mu)/denominator;
info+=mu/denominator;
+#endif
}
const double step=dl/info;
cur_beta+=step;
diff --git a/src/core/interpolator.cpp b/src/interpolator.cpp
similarity index 100%
rename from src/core/interpolator.cpp
rename to src/interpolator.cpp
diff --git a/src/core/interpolator.h b/src/interpolator.h
similarity index 100%
rename from src/core/interpolator.h
rename to src/interpolator.h
diff --git a/src/matvec_check.cpp b/src/matvec_check.cpp
new file mode 100644
index 0000000..4c2ea17
--- /dev/null
+++ b/src/matvec_check.cpp
@@ -0,0 +1,80 @@
+#include "matvec_check.h"
+
+matvec_check::matvec_check(const int nlib, const int nlen, SEXP incoming, const bool transposed,
+ const char* err, const bool nullok) : mycheck(NULL), temp(NULL), isvec(true), istran(transposed),
+ nl(nlib), nt(nlen), tagdex(0), libdex(0) {
+ // Checking if NULL (and whether it's allowed). If it is, it becomes a vector of 1's.
+ std::stringstream failed;
+ if (incoming==R_NilValue) {
+ if (!nullok) {
+ failed << err << " vector or matrix cannot be null";
+ throw std::runtime_error(failed.str());
+ }
+ temp=new double[nl];
+ for (int i=0; i<nl; ++i) { temp[i]=1; }
+ mycheck=temp;
+ return;
+ }
+
+ if (!IS_NUMERIC(incoming)) {
+ failed << err << " vector or matrix should be double precision";
+ throw std::runtime_error(failed.str());
+ }
+
+ // Checking if it is a vector, matrix or transposed matrix.
+ mycheck=NUMERIC_POINTER(incoming);
+ const int curlen=LENGTH(incoming);
+ if (curlen==0) {
+ // If it's empty, it's treated as a null.
+ if (!nullok) {
+ failed << err << " vector or matrix cannot be empty";
+ throw std::runtime_error(failed.str());
+ }
+ temp=new double[nl];
+ for (int i=0; i<nl; ++i) { temp[i]=1; }
+ mycheck=temp;
+ } else if (curlen!=nl) {
+ isvec=false;
+ if (LENGTH(incoming)!=nl*nlen) {
+ failed << "dimensions of " << err << " matrix are not consistent with number of libraries and tags";
+ throw std::runtime_error(failed.str());
+ }
+ if (!istran) {
+ temp=new double[nl];
+ libdex=0;
+ for (int i=0; i<nl; ++i, libdex+=nt) { temp[i]=mycheck[libdex]; }
+ }
+ } else {
+ // Otherwise, it's all good; we can use the pointer directly if it's a vector.
+ ;
+ }
+ return;
+}
+
+void matvec_check::advance() {
+ if (!isvec) {
+ if (!istran) {
+ // Copying elements to an array if it is not transposed, so each row can be accessed at a pointer.
+ ++mycheck;
+ if ((++tagdex) >= nt) { return; }
+ libdex=0;
+ for (int i=0; i<nl; ++i, libdex+=nt) { temp[i]=mycheck[libdex]; }
+ } else {
+ // Each (original) row is a (transposed) column, so rows can be accessed directly in column-major format.
+ mycheck+=nl;
+ }
+ }
+ return;
+}
+
+const double* const* matvec_check::access() const {
+ if (isvec || istran) {
+ return &mycheck;
+ } else {
+ return &temp;
+ }
+}
+
+matvec_check::~matvec_check() {
+ if (temp!=NULL) { delete [] temp; }
+}
diff --git a/src/matvec_check.h b/src/matvec_check.h
new file mode 100644
index 0000000..17d1b99
--- /dev/null
+++ b/src/matvec_check.h
@@ -0,0 +1,19 @@
+#include "utils.h"
+#ifndef MATVEC_CHECK_H
+#define MATVEC_CHECK_H
+
+class matvec_check {
+public:
+ matvec_check(const int, const int, SEXP, const bool, const char*, const bool);
+ ~matvec_check();
+ void advance();
+ const double* const* access() const;
+private:
+ const double* mycheck;
+ double* temp;
+ bool isvec, istran;
+ const int nl, nt;
+ int tagdex, libdex;
+};
+
+#endif
diff --git a/src/nbdev.cpp b/src/nbdev.cpp
new file mode 100644
index 0000000..b3f7837
--- /dev/null
+++ b/src/nbdev.cpp
@@ -0,0 +1,35 @@
+#include "glm.h"
+
+/* Function to calculate the deviance. Note the protection for very large mu*phi (where we
+ * use a gamma instead) or very small mu*phi (where we use the Poisson instead). This
+ * approximation protects against numerical instability introduced by subtrackting
+ * a very large log value in (log mu) with another very large logarithm (log mu+1/phi).
+ * We need to consider the 'phi' as the approximation is only good when the product is
+ * very big or very small.
+ */
+
+const double one_million=std::pow(10, 6.0), one_tenthousandth=std::pow(10, -4.0);
+const double mildly_low_value=std::pow(10, -8.0);
+
+double compute_unit_nb_deviance (double y, double mu, const double& phi) {
+ // We add a small value to protect against zero during division and logging.
+ y+=mildly_low_value;
+ mu+=mildly_low_value;
+
+ /* Calculating the deviance using either the Poisson (small phi*mu), the Gamma (large) or NB (everything else).
+ * Some additional work is put in to make the transitions between families smooth.
+ */
+ if (phi < one_tenthousandth) {
+ const double resid = y - mu;
+ return 2 * ( y * std::log(y/mu) - resid - 0.5*resid*resid*phi*(1+phi*(2/3*resid-y)) );
+ } else {
+ const double product=mu*phi;
+ if (product > one_million) {
+ return 2 * ( (y - mu)/mu - std::log(y/mu) ) * mu/(1+product);
+ } else {
+ const double invphi=1/phi;
+ return 2 * (y * std::log( y/mu ) + (y + invphi) * std::log( (mu + invphi)/(y + invphi) ) );
+ }
+ }
+}
+
diff --git a/src/core/utils.h b/src/utils.h
similarity index 89%
rename from src/core/utils.h
rename to src/utils.h
index ef16fe0..2d15018 100644
--- a/src/core/utils.h
+++ b/src/utils.h
@@ -1,8 +1,12 @@
#ifndef UTILS_H
#define UTILS_H
+//#define DEBUG
-#include <cmath>
+#ifdef DEBUG
#include <iostream>
+#endif
+
+#include <cmath>
#include <deque>
#include <stdexcept>
#include <sstream>
diff --git a/tests/edgeR-Tests.R b/tests/edgeR-Tests.R
index 6236d85..c6dbb3a 100644
--- a/tests/edgeR-Tests.R
+++ b/tests/edgeR-Tests.R
@@ -95,8 +95,7 @@ fit <- glmFit(d, design, dispersion=dispersion.true, prior.count=0.5/3)
results <- glmLRT(fit, coef=2)
topTags(results)
d <- estimateGLMCommonDisp(d, design, verbose=TRUE)
-glmFit(d,design,dispersion=dispersion.true,method="simple", prior.count=0.5/3)
-glmFit(d,design,dispersion=dispersion.true,method="levenberg", prior.count=0.5/3)
+glmFit(d,design,dispersion=dispersion.true, prior.count=0.5/3)
# Exact tests
y <- matrix(rnbinom(20,mu=10,size=3/2),5,4)
diff --git a/tests/edgeR-Tests.Rout.save b/tests/edgeR-Tests.Rout.save
index 3c78133..bb3f88c 100644
--- a/tests/edgeR-Tests.Rout.save
+++ b/tests/edgeR-Tests.Rout.save
@@ -1,5 +1,5 @@
-R version 3.0.0 (2013-04-03) -- "Masked Marvel"
+R Under development (unstable) (2013-11-30 r64358) -- "Unsuffered Consequences"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: i386-w64-mingw32/i386 (32-bit)
@@ -42,16 +42,16 @@ Loading required package: limma
> topTags(de)
Comparison of groups: 2-1
logFC logCPM PValue FDR
-Tag.17 2.0450964 13.73750 0.01975954 0.4347099
-Tag.21 -1.7265870 13.38299 0.06131012 0.6744114
-Tag.6 -1.6329986 12.81458 0.12446044 0.8982100
-Tag.2 4.0861092 11.54134 0.16331090 0.8982100
-Tag.16 0.9324996 13.57093 0.29050785 0.9655885
-Tag.20 0.8543138 13.76371 0.31736609 0.9655885
-Tag.12 0.7081170 14.31393 0.37271028 0.9655885
-Tag.19 -0.7976602 13.31402 0.40166354 0.9655885
-Tag.3 -0.7300410 13.54148 0.42139935 0.9655885
-Tag.8 -0.7917906 12.86342 0.47117217 0.9655885
+Tag.17 2.0450964 13.73716 0.01975954 0.4347099
+Tag.21 -1.7265870 13.38314 0.06131012 0.6744114
+Tag.6 -1.6329986 12.81499 0.12446044 0.8982100
+Tag.2 4.0861092 11.54169 0.16331090 0.8982100
+Tag.16 0.9324996 13.57085 0.29050785 0.9655885
+Tag.20 0.8543138 13.76374 0.31736609 0.9655885
+Tag.12 0.7081170 14.31396 0.37271028 0.9655885
+Tag.19 -0.7976602 13.31391 0.40166354 0.9655885
+Tag.3 -0.7300410 13.54143 0.42139935 0.9655885
+Tag.8 -0.7917906 12.86373 0.47117217 0.9655885
>
> d2 <- estimateTagwiseDisp(d,trend="none",prior.df=20)
> summary(d2$tagwise.dispersion)
@@ -61,31 +61,31 @@ Tag.8 -0.7917906 12.86342 0.47117217 0.9655885
> topTags(de)
Comparison of groups: 2-1
logFC logCPM PValue FDR
-Tag.17 2.0450964 13.73750 0.01975954 0.4347099
-Tag.21 -1.7265870 13.38299 0.06131012 0.6744114
-Tag.6 -1.6329986 12.81458 0.12446044 0.8982100
-Tag.2 4.0861092 11.54134 0.16331090 0.8982100
-Tag.16 0.9324996 13.57093 0.29050785 0.9655885
-Tag.20 0.8543138 13.76371 0.31736609 0.9655885
-Tag.12 0.7081170 14.31393 0.37271028 0.9655885
-Tag.19 -0.7976602 13.31402 0.40166354 0.9655885
-Tag.3 -0.7300410 13.54148 0.42139935 0.9655885
-Tag.8 -0.7917906 12.86342 0.47117217 0.9655885
+Tag.17 2.0450964 13.73716 0.01975954 0.4347099
+Tag.21 -1.7265870 13.38314 0.06131012 0.6744114
+Tag.6 -1.6329986 12.81499 0.12446044 0.8982100
+Tag.2 4.0861092 11.54169 0.16331090 0.8982100
+Tag.16 0.9324996 13.57085 0.29050785 0.9655885
+Tag.20 0.8543138 13.76374 0.31736609 0.9655885
+Tag.12 0.7081170 14.31396 0.37271028 0.9655885
+Tag.19 -0.7976602 13.31391 0.40166354 0.9655885
+Tag.3 -0.7300410 13.54143 0.42139935 0.9655885
+Tag.8 -0.7917906 12.86373 0.47117217 0.9655885
>
> de <- exactTest(d2)
> topTags(de)
Comparison of groups: 2-1
logFC logCPM PValue FDR
-Tag.17 2.0450987 13.73750 0.01327001 0.2919403
-Tag.21 -1.7265897 13.38299 0.05683886 0.6252275
-Tag.6 -1.6329910 12.81458 0.11460208 0.8404152
-Tag.2 4.0861092 11.54134 0.16126207 0.8869414
-Tag.16 0.9324975 13.57093 0.28103256 0.9669238
-Tag.20 0.8543178 13.76371 0.30234789 0.9669238
-Tag.12 0.7081149 14.31393 0.37917895 0.9669238
-Tag.19 -0.7976633 13.31402 0.40762735 0.9669238
-Tag.3 -0.7300478 13.54148 0.40856822 0.9669238
-Tag.8 -0.7918243 12.86342 0.49005179 0.9669238
+Tag.17 2.0450987 13.73716 0.01327001 0.2919403
+Tag.21 -1.7265897 13.38314 0.05683886 0.6252275
+Tag.6 -1.6329910 12.81499 0.11460208 0.8404152
+Tag.2 4.0861092 11.54169 0.16126207 0.8869414
+Tag.16 0.9324975 13.57085 0.28103256 0.9669238
+Tag.20 0.8543178 13.76374 0.30234789 0.9669238
+Tag.12 0.7081149 14.31396 0.37917895 0.9669238
+Tag.19 -0.7976633 13.31391 0.40762735 0.9669238
+Tag.3 -0.7300478 13.54143 0.40856822 0.9669238
+Tag.8 -0.7918243 12.86373 0.49005179 0.9669238
>
> d2 <- estimateTagwiseDisp(d,trend="movingave",span=0.4,prior.df=20)
> summary(d2$tagwise.dispersion)
@@ -95,16 +95,16 @@ Tag.8 -0.7918243 12.86342 0.49005179 0.9669238
> topTags(de)
Comparison of groups: 2-1
logFC logCPM PValue FDR
-Tag.17 2.0450951 13.73750 0.02427872 0.5341319
-Tag.21 -1.7265927 13.38299 0.05234833 0.5758316
-Tag.6 -1.6330014 12.81458 0.12846308 0.8954397
-Tag.2 4.0861092 11.54134 0.16280722 0.8954397
-Tag.16 0.9324887 13.57093 0.24308201 0.9711975
-Tag.20 0.8543044 13.76371 0.35534649 0.9711975
-Tag.19 -0.7976535 13.31402 0.38873717 0.9711975
-Tag.3 -0.7300525 13.54148 0.40001438 0.9711975
-Tag.12 0.7080985 14.31393 0.43530228 0.9711975
-Tag.8 -0.7918376 12.86342 0.49782701 0.9711975
+Tag.17 2.0450951 13.73716 0.02427872 0.5341319
+Tag.21 -1.7265927 13.38314 0.05234833 0.5758316
+Tag.6 -1.6330014 12.81499 0.12846308 0.8954397
+Tag.2 4.0861092 11.54169 0.16280722 0.8954397
+Tag.16 0.9324887 13.57085 0.24308201 0.9711975
+Tag.20 0.8543044 13.76374 0.35534649 0.9711975
+Tag.19 -0.7976535 13.31391 0.38873717 0.9711975
+Tag.3 -0.7300525 13.54143 0.40001438 0.9711975
+Tag.12 0.7080985 14.31396 0.43530228 0.9711975
+Tag.8 -0.7918376 12.86373 0.49782701 0.9711975
>
> summary(exactTest(d2,rejection="smallp")$table$PValue)
Min. 1st Qu. Median Mean 3rd Qu. Max.
@@ -116,40 +116,40 @@ Tag.8 -0.7918376 12.86342 0.49782701 0.9711975
> d2 <- estimateTagwiseDisp(d,trend="loess",span=0.8,prior.df=20)
> summary(d2$tagwise.dispersion)
Min. 1st Qu. Median Mean 3rd Qu. Max.
- 0.1165 0.1449 0.1833 0.1849 0.2116 0.2826
+ 0.1165 0.1448 0.1832 0.1848 0.2116 0.2826
> de <- exactTest(d2)
> topTags(de)
Comparison of groups: 2-1
logFC logCPM PValue FDR
-Tag.17 2.0450979 13.73750 0.01547929 0.3405443
-Tag.21 -1.7266049 13.38299 0.03544057 0.3898463
-Tag.6 -1.6329841 12.81458 0.10633495 0.7797896
-Tag.2 4.0861092 11.54134 0.16057929 0.8831861
-Tag.16 0.9324935 13.57093 0.26349447 0.9658370
-Tag.20 0.8543140 13.76371 0.31673704 0.9658370
-Tag.19 -0.7976354 13.31402 0.35562850 0.9658370
-Tag.3 -0.7300593 13.54148 0.38831288 0.9658370
-Tag.12 0.7081041 14.31393 0.41512829 0.9658370
-Tag.8 -0.7918152 12.86342 0.48483728 0.9658370
+Tag.17 2.0450979 13.73716 0.01545949 0.3401087
+Tag.21 -1.7266049 13.38314 0.03542770 0.3897047
+Tag.6 -1.6329841 12.81499 0.10631799 0.7796653
+Tag.2 4.0861092 11.54169 0.16058047 0.8831926
+Tag.16 0.9324935 13.57085 0.26348077 0.9658392
+Tag.20 0.8543140 13.76374 0.31676619 0.9658392
+Tag.19 -0.7976354 13.31391 0.35563266 0.9658392
+Tag.3 -0.7300593 13.54143 0.38829455 0.9658392
+Tag.12 0.7081040 14.31396 0.41516367 0.9658392
+Tag.8 -0.7918152 12.86373 0.48481833 0.9658392
>
> d2 <- estimateTagwiseDisp(d,trend="tricube",span=0.8,prior.df=20)
> summary(d2$tagwise.dispersion)
Min. 1st Qu. Median Mean 3rd Qu. Max.
- 0.1165 0.1449 0.1833 0.1849 0.2116 0.2826
+ 0.1165 0.1448 0.1832 0.1848 0.2116 0.2826
> de <- exactTest(d2)
> topTags(de)
Comparison of groups: 2-1
logFC logCPM PValue FDR
-Tag.17 2.0450979 13.73750 0.01547929 0.3405443
-Tag.21 -1.7266049 13.38299 0.03544057 0.3898463
-Tag.6 -1.6329841 12.81458 0.10633495 0.7797896
-Tag.2 4.0861092 11.54134 0.16057929 0.8831861
-Tag.16 0.9324935 13.57093 0.26349447 0.9658370
-Tag.20 0.8543140 13.76371 0.31673704 0.9658370
-Tag.19 -0.7976354 13.31402 0.35562850 0.9658370
-Tag.3 -0.7300593 13.54148 0.38831288 0.9658370
-Tag.12 0.7081041 14.31393 0.41512829 0.9658370
-Tag.8 -0.7918152 12.86342 0.48483728 0.9658370
+Tag.17 2.0450979 13.73716 0.01545949 0.3401087
+Tag.21 -1.7266049 13.38314 0.03542770 0.3897047
+Tag.6 -1.6329841 12.81499 0.10631799 0.7796653
+Tag.2 4.0861092 11.54169 0.16058047 0.8831926
+Tag.16 0.9324935 13.57085 0.26348077 0.9658392
+Tag.20 0.8543140 13.76374 0.31676619 0.9658392
+Tag.19 -0.7976354 13.31391 0.35563266 0.9658392
+Tag.3 -0.7300593 13.54143 0.38829455 0.9658392
+Tag.12 0.7081040 14.31396 0.41516367 0.9658392
+Tag.8 -0.7918152 12.86373 0.48481833 0.9658392
>
> # mglmOneWay
> design <- model.matrix(~group,data=d$samples)
@@ -178,133 +178,7 @@ $fitted.values
[7,] 7.5 7.5 9.5 9.5
[8,] 7.0 7.0 4.0 4.0
[9,] 4.5 4.5 5.5 5.5
-[10,] 10.0 10.0 13.0 13.0
-
-> mglmOneWay(d[1:10,],design,dispersion=0)
-$coefficients
- (Intercept) group2
- [1,] -1.000000e+08 0.000000e+00
- [2,] -1.000000e+08 1.000000e+08
- [3,] 2.525729e+00 -5.108256e-01
- [4,] 2.525729e+00 1.484200e-01
- [5,] 2.140066e+00 -1.941560e-01
- [6,] 2.079442e+00 -1.163151e+00
- [7,] 2.014903e+00 2.363888e-01
- [8,] 1.945910e+00 -5.596158e-01
- [9,] 1.504077e+00 2.006707e-01
-[10,] 2.302585e+00 2.623643e-01
-
-$fitted.values
- [,1] [,2] [,3] [,4]
- [1,] 0.0 0.0 0.0 0.0
- [2,] 0.0 0.0 2.0 2.0
- [3,] 12.5 12.5 7.5 7.5
- [4,] 12.5 12.5 14.5 14.5
- [5,] 8.5 8.5 7.0 7.0
- [6,] 8.0 8.0 2.5 2.5
- [7,] 7.5 7.5 9.5 9.5
- [8,] 7.0 7.0 4.0 4.0
- [9,] 4.5 4.5 5.5 5.5
-[10,] 10.0 10.0 13.0 13.0
-
->
-> fit <- glmFit(d,design,dispersion=d$common.dispersion,prior.count=0.5/4)
-> lrt <- glmLRT(fit,coef=2)
-> topTags(lrt)
-Coefficient: group2
- logFC logCPM LR PValue FDR
-Tag.17 2.0450964 13.73750 6.0485417 0.01391779 0.3058697
-Tag.2 4.0861092 11.54134 4.8400348 0.02780633 0.3058697
-Tag.21 -1.7265870 13.38299 4.0777825 0.04345065 0.3186381
-Tag.6 -1.6329986 12.81458 3.0078205 0.08286364 0.4557500
-Tag.16 0.9324996 13.57093 1.3477682 0.24566867 0.8276702
-Tag.20 0.8543138 13.76371 1.1890032 0.27553071 0.8276702
-Tag.19 -0.7976602 13.31402 0.9279152 0.33540526 0.8276702
-Tag.12 0.7081170 14.31393 0.9095513 0.34023349 0.8276702
-Tag.3 -0.7300410 13.54148 0.8300307 0.36226364 0.8276702
-Tag.8 -0.7917906 12.86342 0.7830377 0.37621371 0.8276702
->
-> fit <- glmFit(d,design,dispersion=d$common.dispersion,prior.count=0.5)
-> summary(fit$coef)
- (Intercept) group2
- Min. :-7.604 Min. :-1.13681
- 1st Qu.:-4.895 1st Qu.:-0.32341
- Median :-4.713 Median : 0.15083
- Mean :-4.940 Mean : 0.07817
- 3rd Qu.:-4.524 3rd Qu.: 0.35163
- Max. :-4.107 Max. : 1.60864
->
-> fit <- glmFit(d,design,prior.count=0.5/4)
-> lrt <- glmLRT(fit,coef=2)
-> topTags(lrt)
-Coefficient: group2
- logFC logCPM LR PValue FDR
-Tag.17 2.0450964 13.73750 6.0485417 0.01391779 0.3058697
-Tag.2 4.0861092 11.54134 4.8400348 0.02780633 0.3058697
-Tag.21 -1.7265870 13.38299 4.0777825 0.04345065 0.3186381
-Tag.6 -1.6329986 12.81458 3.0078205 0.08286364 0.4557500
-Tag.16 0.9324996 13.57093 1.3477682 0.24566867 0.8276702
-Tag.20 0.8543138 13.76371 1.1890032 0.27553071 0.8276702
-Tag.19 -0.7976602 13.31402 0.9279152 0.33540526 0.8276702
-Tag.12 0.7081170 14.31393 0.9095513 0.34023349 0.8276702
-Tag.3 -0.7300410 13.54148 0.8300307 0.36226364 0.8276702
-Tag.8 -0.7917906 12.86342 0.7830377 0.37621371 0.8276702
->
-> dglm <- estimateGLMCommonDisp(d,design)
-> dglm$common.dispersion
-[1] 0.2033282
-> dglm <- estimateGLMTagwiseDisp(dglm,design,prior.df=20)
-> summary(dglm$tagwise.dispersion)
- Min. 1st Qu. Median Mean 3rd Qu. Max.
- 0.1756 0.1879 0.1998 0.2031 0.2135 0.2578
-> fit <- glmFit(dglm,design,prior.count=0.5/4)
-> lrt <- glmLRT(fit,coef=2)
-> topTags(lrt)
-Coefficient: group2
- logFC logCPM LR PValue FDR
-Tag.17 2.0450988 13.73750 6.8001118 0.009115216 0.2005348
-Tag.2 4.0861092 11.54134 4.8594096 0.027495744 0.2872068
-Tag.21 -1.7265904 13.38299 4.2537154 0.039164558 0.2872068
-Tag.6 -1.6329904 12.81458 3.1763761 0.074710253 0.4109064
-Tag.16 0.9324970 13.57093 1.4126709 0.234613511 0.8499599
-Tag.20 0.8543183 13.76371 1.2721097 0.259371274 0.8499599
-Tag.19 -0.7976614 13.31402 0.9190392 0.337727380 0.8499599
-Tag.12 0.7081163 14.31393 0.9014515 0.342392806 0.8499599
-Tag.3 -0.7300488 13.54148 0.8817937 0.347710872 0.8499599
-Tag.8 -0.7918166 12.86342 0.7356185 0.391068048 0.8603497
-> dglm <- estimateGLMTrendedDisp(dglm,design)
-> summary(dglm$trended.dispersion)
- Min. 1st Qu. Median Mean 3rd Qu. Max.
- 0.1522 0.1676 0.1740 0.1887 0.1999 0.3471
-> dglm <- estimateGLMTrendedDisp(dglm,design,method="power")
-> summary(dglm$trended.dispersion)
- Min. 1st Qu. Median Mean 3rd Qu. Max.
- 0.1522 0.1676 0.1740 0.1887 0.1999 0.3471
-> dglm <- estimateGLMTrendedDisp(dglm,design,method="spline")
-Loading required package: splines
-> summary(dglm$trended.dispersion)
- Min. 1st Qu. Median Mean 3rd Qu. Max.
- 0.0206 0.1010 0.1687 0.1849 0.2445 0.4910
-> dglm <- estimateGLMTrendedDisp(dglm,design,method="bin.spline")
-> summary(dglm$trended.dispersion)
- Min. 1st Qu. Median Mean 3rd Qu. Max.
- 0.1997 0.1997 0.1997 0.1997 0.1997 0.1997
-> dglm <- estimateGLMTagwiseDisp(dglm,design,prior.df=20)
-> summary(dglm$tagwise.dispersion)
- Min. 1st Qu. Median Mean 3rd Qu. Max.
- 0.1385 0.1792 0.1964 0.1935 0.2026 0.2709
->
-> # Continuous trend
-> nlibs <- 3
-> ntags <- 1000
-> dispersion.true <- 0.1
-> # Make first transcript respond to covariate x
-> x <- 0:2
-> design <- model.matrix(~x)
-> beta.true <- cbind(Beta1=2,Beta2=c(2,rep(0,ntags-1)))
-> mu.true <- 2^(beta.true %*% t(design))
-> # Generate count data
-> y <- rnbinom(ntags*nlibs,mu=mu.true,size=1/dispersion.true)
+[10,] 10.0 10/dispersion.true)
> y <- matrix(y,ntags,nlibs)
> colnames(y) <- c("x0","x1","x2")
> rownames(y) <- paste("Gene",1:ntags,sep="")
@@ -315,20 +189,19 @@ Loading required package: splines
> topTags(results)
Coefficient: x
logFC logCPM LR PValue FDR
-Gene1 2.907024 13.56183 38.738513 4.845535e-10 4.845535e-07
-Gene61 2.855317 10.27136 10.738307 1.049403e-03 5.247014e-01
-Gene62 -2.123902 10.53174 8.818704 2.981584e-03 8.334758e-01
-Gene134 -1.949073 10.53355 8.125889 4.363759e-03 8.334758e-01
-Gene740 -1.610046 10.94907 8.013408 4.643227e-03 8.334758e-01
-Gene354 2.022698 10.45066 7.826308 5.149116e-03 8.334758e-01
-Gene5 1.856816 10.45249 7.214238 7.232750e-03 8.334758e-01
-Gene746 -1.798331 10.53094 6.846262 8.882690e-03 8.334758e-01
-Gene110 1.623148 10.68607 6.737984 9.438120e-03 8.334758e-01
-Gene383 1.637140 10.75412 6.687530 9.708962e-03 8.334758e-01
+Gene1 2.907024 13.56183 38.738512 4.845536e-10 4.845536e-07
+Gene61 2.855317 10.27136 10.738307 1.049403e-03 5.247015e-01
+Gene62 -2.123902 10.53184 8.818703 2.981585e-03 8.334760e-01
+Gene134 -1.949073 10.53365 8.125889 4.363759e-03 8.334760e-01
+Gene740 -1.610046 10.94916 8.013408 4.643227e-03 8.334760e-01
+Gene354 2.022698 10.45075 7.826308 5.149118e-03 8.334760e-01
+Gene5 1.856816 10.45258 7.214238 7.232750e-03 8.334760e-01
+Gene746 -1.798331 10.53103 6.846262 8.882693e-03 8.334760e-01
+Gene110 1.623148 10.68616 6.737984 9.438120e-03 8.334760e-01
+Gene383 1.637140 10.75421 6.687530 9.708965e-03 8.334760e-01
> d <- estimateGLMCommonDisp(d, design, verbose=TRUE)
Disp = 0.10253 , BCV = 0.3202
-> glmFit(d,design,dispersion=dispersion.true,method="simple", prior.count=0.5/3)
-Loading required package: MASS
+> glmFit(d,design,dispersion=dispersion.true, prior.count=0.5/3)
An object of class "DGEGLM"
$coefficients
(Intercept) x
@@ -339,96 +212,9 @@ Gene4 -7.480255 0.5172002
Gene5 -8.747793 1.2870467
995 more rows ...
-$df.residual
-[1] 1 1 1 1 1
-995 more elements ...
-
-$deviance
-[1] 6.38037582 1.46644949 1.38532340 0.01593969 1.03894514
-995 more elements ...
-
-$design
- (Intercept) x
-1 1 0
-2 1 1
-3 1 2
-attr(,"assign")
-[1] 0 1
-
-$offset
- [,1] [,2] [,3]
-[1,] 8.295172 8.338525 8.284484
-[2,] 8.295172 8.338525 8.284484
-[3,] 8.295172 8.338525 8.284484
-[4,] 8.295172 8.338525 8.284484
-[5,] 8.295172 8.338525 8.284484
-995 more rows ...
-
-$dispersion
-[1] 0.1
-
-$weights
- [,1] [,2] [,3]
-[1,] 1 1 1
-[2,] 1 1 1
-[3,] 1 1 1
-[4,] 1 1 1
-[5,] 1 1 1
-995 more rows ...
-
$fitted.values
x0 x1 x2
-Gene1 2.3569790 18.954451 138.2830865
-Gene2 2.5138459 1.089294 0.4282075
-Gene3 4.1580678 3.750528 3.0689914
-Gene4 2.1012458 3.769592 6.1349943
-Gene5 0.5080376 2.136398 8.1502479
-995 more rows ...
-
-$converged
-[1] TRUE TRUE TRUE TRUE TRUE
-995 more elements ...
-
-$error
-[1] FALSE FALSE FALSE FALSE FALSE
-995 more elements ...
-
-$counts
- x0 x1 x2
-Gene1 0 30 110
-Gene2 2 2 0
-Gene3 3 6 2
-Gene4 2 4 6
-Gene5 1 1 9
-995 more rows ...
-
-$method
-[1] "simple"
-
-$samples
- group lib.size norm.factors
-x0 1 4001 1.0008730
-x1 1 4176 1.0014172
-x2 1 3971 0.9977138
-
-$AveLogCPM
-[1] 13.561832 9.682757 10.447014 10.532113 10.452489
-995 more elements ...
-
-> glmFit(d,design,dispersion=dispersion.true,method="levenberg", prior.count=0.5/3)
-An object of class "DGEGLM"
-$coefficients
- (Intercept) x
-Gene1 -7.391745 2.0149958
-Gene2 -7.318483 -0.7611895
-Gene3 -6.831702 -0.1399478
-Gene4 -7.480255 0.5172002
-Gene5 -8.747793 1.2870467
-995 more rows ...
-
-$fitted.values
- x0 x1 x2
-Gene1 2.3570471 18.954454 138.2791326
+Gene1 2.3570471 18.954454 138.2791328
Gene2 2.5138172 1.089292 0.4282107
Gene3 4.1580452 3.750528 3.0690081
Gene4 2.1012460 3.769592 6.1349937
@@ -436,7 +222,7 @@ Gene5 0.5080377 2.136398 8.1502486
995 more rows ...
$deviance
-[1] 6.38037543 1.46644912 1.38532340 0.01593969 1.03894514
+[1] 6.38037545 1.46644913 1.38532340 0.01593969 1.03894513
995 more elements ...
$iter
@@ -447,6 +233,9 @@ $failed
[1] FALSE FALSE FALSE FALSE FALSE
995 more elements ...
+$method
+[1] "levenberg"
+
$counts
x0 x1 x2
Gene1 0 30 110
@@ -480,9 +269,6 @@ $offset
$dispersion
[1] 0.1
-$method
-[1] "levenberg"
-
$samples
group lib.size norm.factors
x0 1 4001 1.0008730
@@ -490,7 +276,7 @@ x1 1 4176 1.0014172
x2 1 3971 0.9977138
$AveLogCPM
-[1] 13.561832 9.682757 10.447014 10.532113 10.452489
+[1] 13.561832 9.682859 10.447107 10.532113 10.452583
995 more elements ...
>
@@ -525,22 +311,22 @@ $AveLogCPM
> topTags(lrt)
Coefficient: LR test of 2 contrasts
logFC.1 logFC.2 logCPM LR PValue FDR
-1 -7.2381060 -0.0621100 17.20027 10.7712179 0.004582049 0.02291025
-5 -1.6684268 -0.9326507 17.34879 1.7309951 0.420842114 0.90967967
-2 1.2080938 1.0420198 18.24809 1.0496688 0.591653346 0.90967967
-4 0.5416704 -0.1506381 17.59977 0.3958596 0.820427427 0.90967967
-3 0.2876249 -0.2008143 18.02991 0.1893255 0.909679671 0.90967967
+1 -7.2381060 -0.0621100 17.19071 10.7712171 0.004582051 0.02291026
+5 -1.6684268 -0.9326507 17.33529 1.7309951 0.420842115 0.90967967
+2 1.2080938 1.0420198 18.24544 1.0496688 0.591653347 0.90967967
+4 0.5416704 -0.1506381 17.57744 0.3958596 0.820427427 0.90967967
+3 0.2876249 -0.2008143 18.06216 0.1893255 0.909679672 0.90967967
> design <- model.matrix(~0+group)
> fit <- glmFit(y,design,dispersion=2/3,prior.count=0.5/7)
> lrt <- glmLRT(fit,contrast=cbind(c(-1,1,0),c(0,-1,1),c(-1,0,1)))
> topTags(lrt)
Coefficient: LR test of 2 contrasts
logFC.1 logFC.2 logCPM LR PValue FDR
-1 -7.2381060 7.1759960 17.20027 10.7712179 0.004582049 0.02291025
-5 -1.6684268 0.7357761 17.34879 1.7309951 0.420842114 0.90967967
-2 1.2080938 -0.1660740 18.24809 1.0496688 0.591653346 0.90967967
-4 0.5416704 -0.6923084 17.59977 0.3958596 0.820427427 0.90967967
-3 0.2876249 -0.4884392 18.02991 0.1893255 0.909679671 0.90967967
+1 -7.2381060 7.1759960 17.19071 10.7712171 0.004582051 0.02291026
+5 -1.6684268 0.7357761 17.33529 1.7309951 0.420842115 0.90967967
+2 1.2080938 -0.1660740 18.24544 1.0496688 0.591653347 0.90967967
+4 0.5416704 -0.6923084 17.57744 0.3958596 0.820427427 0.90967967
+3 0.2876249 -0.4884392 18.06216 0.1893255 0.909679672 0.90967967
>
> # spliceVariants
> z = matrix(c(2,0,4,6,4,3,7,1,1,0,1,1,0,3,1,2,0,1,2,1,0,3,1,0), 8, 3)
@@ -550,7 +336,7 @@ An object of class "DGEExact"
$table
logFC logCPM LR PValue
1 NA 19.19460 0.00000 1.00000000
-2 NA 19.34082 11.47712 0.07470318
+2 NA 19.34082 11.47712 0.07470331
$comparison
NULL
@@ -608,4 +394,53 @@ $n0
>
> proc.time()
user system elapsed
- 4.30 0.04 4.35
+ 3.29 0.04 3.32
+ edgeR/vignettes/ 0000755 0001263 0001264 00000000000 12333335261 015174 5 ustar [...]
+%\VignetteKeyword{RNA-Seq}
+%\VignetteKeyword{differential expression}
+%\VignettePackage{edgeR}
+\documentclass[12pt]{article}
+
+\textwidth=6.2in
+\textheight=8.5in
+\oddsidemargin=0.2in
+\evensidemargin=0.2in
+\headheight=0in
+\headsep=0in
+
+\begin{document}
+
+\title{edgeR Package Introduction}
+\author{Mark Robinson, Davis McCarthy, Yunshun Chen,\\
+Aaron Lun, Gordon K.\ Smyth}
+\date{10 October 2012\\
+Revised 10 November 2013}
+\maketitle
+
+
+edgeR is a package for the differential expression analysis of digital gene expression data,
+that is, of count data arising from DNA sequencing technologies.
+It is especially designed for differential expression analyses of RNA-Seq or SAGE data,
+or differential marking analyses of ChIP-Seq data.
+
+edgeR implements novel statistical methods based on the negative binomial distribution
+as a model for count variability, including empirical Bayes methods, exact tests, and generalized linear models.
+The package is especially suitable for analysing designed experiments with multiple
+experimental factors but possibly small numbers of replicates.
+It has unique abilities to model transcript specific variation even in small samples,
+a capability essential for prioritizing genes or transcripts that have consistent effects across replicates.
+
+The full edgeR User's Guide is available as part of the online documentation.
+To reach the User's Guide, install the edgeR package and load it into an R session by \texttt{library(edgeR)}.
+In R for Windows, the User's Guide will then be available from the drop-down menu called ``Vignettes''.
+In other operating systems, type
+\begin{Schunk}
+\begin{Sinput}
+> library(edgeR)
+> edgeRUsersGuide()
+\end{Sinput}
+\end{Schunk}
+at the R prompt to open the User's Guide in a pdf viewer.
+
+\end{document}
+ [...]
\ No newline at end of file
--
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/r-bioc-edger.git
More information about the debian-med-commit
mailing list