[med-svn] r19631 - in trunk/packages/R/r-bioc-bsgenome/trunk/debian: . patches

Andreas Tille tille at moszumanska.debian.org
Fri Jul 17 21:14:55 UTC 2015


Author: tille
Date: 2015-07-17 21:14:54 +0000 (Fri, 17 Jul 2015)
New Revision: 19631

Added:
   trunk/packages/R/r-bioc-bsgenome/trunk/debian/patches/
   trunk/packages/R/r-bioc-bsgenome/trunk/debian/patches/remove_paragraphs_bound_to_fail_from_vignette.patch
   trunk/packages/R/r-bioc-bsgenome/trunk/debian/patches/series
Modified:
   trunk/packages/R/r-bioc-bsgenome/trunk/debian/changelog
Log:
New upstream version (+commit patches subdir)


Modified: trunk/packages/R/r-bioc-bsgenome/trunk/debian/changelog
===================================================================
--- trunk/packages/R/r-bioc-bsgenome/trunk/debian/changelog	2015-07-17 19:14:07 UTC (rev 19630)
+++ trunk/packages/R/r-bioc-bsgenome/trunk/debian/changelog	2015-07-17 21:14:54 UTC (rev 19631)
@@ -1,3 +1,9 @@
+r-bioc-bsgenome (1.36.2-1) unstable; urgency=medium
+
+  * New upstream version
+
+ -- Andreas Tille <tille at debian.org>  Fri, 17 Jul 2015 22:50:12 +0200
+
 r-bioc-bsgenome (1.36.1-1) unstable; urgency=medium
 
   * New upstream version

Added: trunk/packages/R/r-bioc-bsgenome/trunk/debian/patches/remove_paragraphs_bound_to_fail_from_vignette.patch
===================================================================
--- trunk/packages/R/r-bioc-bsgenome/trunk/debian/patches/remove_paragraphs_bound_to_fail_from_vignette.patch	                        (rev 0)
+++ trunk/packages/R/r-bioc-bsgenome/trunk/debian/patches/remove_paragraphs_bound_to_fail_from_vignette.patch	2015-07-17 21:14:54 UTC (rev 19631)
@@ -0,0 +1,511 @@
+Author: Andreas Tille <tille at debian.org>
+Last-Update: Tue, 21 Oct 2014 05:22:51 +0200
+Description: In the autopkgtest we are trying to reproduce the creation of
+ the documentation but some paragraphs are failing due to not yet packaged
+ data packages.  These paragraphs are deleted for the moment to enable
+ successful testing.
+ .
+ The fact that also DNAStringSet() does not work should be furtherly
+ investigated by some BioConductor experts.
+ 
+
+--- a/vignettes/GenomeSearching.Rnw
++++ b/vignettes/GenomeSearching.Rnw
+@@ -107,135 +107,7 @@ The BSgenome data package for the ce2 ge
+ available in Bioconductor but they could be added if there is demand for
+ them.
+ 
+-See \Rfunction{?available.genomes} for how to install
+-\Rpackage{BSgenome.Celegans.UCSC.ce2}.
+-Then load the package and display the single object defined in it:
+-<<b1>>=
+-library(BSgenome.Celegans.UCSC.ce2)
+-ls("package:BSgenome.Celegans.UCSC.ce2")
+-genome <- BSgenome.Celegans.UCSC.ce2
+-genome
+-@
+-
+-\Robject{genome} is a \Rclass{BSgenome} object:
+-<<b2>>=
+-class(genome)
+-@
+-
+-When displayed, some basic information about the origin of the
+-genome is shown (organism, provider, provider version, etc...)
+-followed by the index of {\it single} sequences and eventually
+-an additional index of {\it multiple} sequences.
+-Methods (adequately called {\it accessor methods}) are defined
+-for individual access to this information:
+-<<b3>>=
+-organism(genome)
+-provider(genome)
+-providerVersion(genome)
+-seqnames(genome)
+-mseqnames(genome)
+-@
+-
+-See the man page for the \Rclass{BSgenome} class (\Rfunction{?BSgenome})
+-for a complete list of accessor methods and their descriptions.
+-
+-Now we are ready to display chromosome I:
+-<<b4>>=
+-genome$chrI
+-@
+-
+-Note that this chrI sequence corresponds to the {\it forward} strand
+-(aka {\it direct} or {\it sense} or {\it positive} or {\it plus} strand)
+-of chromosome I.
+-UCSC, and genome providers in general, don't provide files containing the
+-nucleotide sequence of the {\it reverse} strand (aka {\it indirect}
+-or {\it antisense} or {\it negative} or {\it minus} or {\it opposite} strand)
+-of the chromosomes because these sequences can be deduced from the {\it forward}
+-sequences by taking their reverse complements.
+-The BSgenome data packages are no exceptions: they only
+-provide the {\it forward} strand sequence of every chromosome.
+-See \Rfunction{?reverseComplement} for more details about the reverse
+-complement of a \Rclass{DNAString} object.
+-It is important to remember that, in practice, the {\it reverse} strand
+-sequence is almost never needed.
+-The reason is that, in fact, a {\it reverse} strand analysis
+-can (and should) always be transposed into a {\it forward} strand analysis.
+-Therefore trying to compute the {\it reverse} strand sequence of an entire
+-chromosome by applying \Rfunction{reverseComplement} to its {\it forward}
+-strand sequence is almost always a bad idea.
+-See the {\it Finding an arbitrary nucleotide pattern in an entire genome} section
+-of this document for how to find arbitrary patterns in the {\it reverse} strand
+-of a chromosome.
+-
+-% It seems like this page http://www.medterms.com/script/main/art.asp?articlekey=20468
+-% is lying about the noncoding (or coding, they are in fact contradicting themselves)
+-% nature of the sense and antisense strands.
+-
+-The number of bases in this sequence can be retrieved with:
+-<<b5>>=
+-chrI <- genome$chrI
+-length(chrI)
+-@
+-
+-Some basic stats:
+-<<b6>>=
+-afI <- alphabetFrequency(chrI)
+-afI
+-sum(afI) == length(chrI)
+-@
+-
+-Count all {\it exact} matches of pattern \Robject{"ACCCAGGGC"}:
+-<<b7>>=
+-p1 <- "ACCCAGGGC"
+-countPattern(p1, chrI)
+-@
+-
+-Like most pattern matching functions in \Rpackage{Biostrings},
+-the \Rfunction{countPattern} and \Rfunction{matchPattern} functions
+-support {\it inexact} matching. One form of inexact matching is to
+-allow a few mismatching letters per match. Here we allow at most one:
+-<<b8>>=
+-countPattern(p1, chrI, max.mismatch=1)
+-@
+-
+-With the \Rfunction{matchPattern} function, the locations of the matches are
+-stored in an \Rclass{XStringViews} object:
+-<<b9>>=
+-m1 <- matchPattern(p1, chrI, max.mismatch=1)
+-m1[4:6]
+-class(m1)
+-@
+-
+-The \Rfunction{mismatch} function (new in \Rpackage{Biostrings}~2)
+-returns the positions of the mismatching letters for each match:
+-<<b10>>=
+-mismatch(p1, m1[4:6])
+-@
+-
+-Note: The \Rfunction{mismatch} method is in fact a particular case
+-of a (vectorized) {\it alignment} function where only ``replacements''
+-are allowed. Current implementation is slow but this will be addressed.
+-
+-It may happen that a match is {\it out of limits} like in this example:
+-<<b11>>=
+-p2 <- DNAString("AAGCCTAAGCCTAAGCCTAA")
+-m2 <- matchPattern(p2, chrI, max.mismatch=2)
+-m2[1:4]
+-p2 == m2[1:4]
+-mismatch(p2, m2[1:4])
+-@
+-
+-The list of exact matches and the list of inexact matches
+-can both be obtained with:
+-<<b12,results=hide>>=
+-m2[p2 == m2]
+-m2[p2 != m2]
+-@
+-
+-Note that the length of \Robject{m2[p2 == m2]} should be
+-equal to \Robject{countPattern(p2, chrI, max.mismatch=0)}.
+-
+-
++% DELETED
+ 
+ % ---------------------------------------------------------------------------
+ 
+@@ -297,158 +169,7 @@ More precisely, here is the analysis we
+ 
+ \end{itemize}
+ 
+-Let's start by loading the input dictionary with:
+-<<c1>>=
+-ce2dict0_file <- system.file("extdata", "ce2dict0.fa", package="BSgenome")
+-ce2dict0 <- readDNAStringSet(ce2dict0_file, "fasta")
+-ce2dict0
+-@
+-
+-Here is how we can write the functions that will perform our analysis:
+-<<c2>>=
+-writeHits <- function(seqname, matches, strand, file="", append=FALSE)
+-{
+-    if (file.exists(file) && !append)
+-        warning("existing file ", file, " will be overwritten with 'append=FALSE'")
+-    if (!file.exists(file) && append)
+-        warning("new file ", file, " will have no header with 'append=TRUE'")
+-    hits <- data.frame(seqname=rep.int(seqname, length(matches)),
+-                       start=start(matches),
+-                       end=end(matches),
+-                       strand=rep.int(strand, length(matches)),
+-                       patternID=names(matches),
+-                       check.names=FALSE)
+-    write.table(hits, file=file, append=append, quote=FALSE, sep="\t",
+-                row.names=FALSE, col.names=!append)
+-}
+-
+-runAnalysis1 <- function(dict0, outfile="")
+-{
+-    library(BSgenome.Celegans.UCSC.ce2)
+-    genome <- BSgenome.Celegans.UCSC.ce2
+-    seqnames <- seqnames(genome)
+-    seqnames_in1string <- paste(seqnames, collapse=", ")
+-    cat("Target:", providerVersion(genome),
+-        "chromosomes", seqnames_in1string, "\n")
+-    append <- FALSE
+-    for (seqname in seqnames) {
+-        subject <- genome[[seqname]]
+-        cat(">>> Finding all hits in chromosome", seqname, "...\n")
+-        for (i in seq_len(length(dict0))) {
+-            patternID <- names(dict0)[i]
+-            pattern <- dict0[[i]]
+-            plus_matches <- matchPattern(pattern, subject)
+-            names(plus_matches) <- rep.int(patternID, length(plus_matches))
+-            writeHits(seqname, plus_matches, "+", file=outfile, append=append)
+-            append <- TRUE
+-            rcpattern <- reverseComplement(pattern)
+-            minus_matches <- matchPattern(rcpattern, subject)
+-            names(minus_matches) <- rep.int(patternID, length(minus_matches))
+-            writeHits(seqname, minus_matches, "-", file=outfile, append=append)
+-        }
+-        cat(">>> DONE\n")
+-    }
+-}
+-@
+-
+-Some important notes about the implementation of the \Rfunction{runAnalysis1}
+-function:
+-\begin{itemize}
+-  \item{}
+-  \Robject{subject <- genome[[seqname]]} is the code that actually loads a
+-  chromosome sequence into memory.
+-  Using only one sequence at a time is a good practice to avoid memory
+-  allocation problems on a machine with a limited amount of memory.
+-  For example, loading all the human chromosome sequences in memory would
+-  require more than 3GB of memory!
+-
+-  \item{}
+-  We have 2 nested \Robject{for} loops: the outer loop walks thru the
+-  target (7 chromosomes) and the inner loop walks thru the set of
+-  patterns. Doing the other way around would be very inefficient,
+-  especially with a bigger number of patterns because this would require
+-  to load each chromosome sequence into memory as many times as the
+-  number of patterns.
+-  \Rfunction{runAnalysis1} loads each sequence only once.
+-
+-  \item{}
+-  We find the matches in the minus strand (\Robject{minus_matches}) by
+-  first taking the reverse complement of the current pattern (with
+-  \Robject{rcpattern <- reverseComplement(pattern)}) and NOT by
+-  taking the reverse complement of the current subject.
+-\end{itemize}
+-
+-Now we are ready to run the analysis and put the results in the
+-\Robject{"ce2dict0_ana1.txt"} file:
+-<<c3>>=
+-runAnalysis1(ce2dict0, outfile="ce2dict0_ana1.txt")
+-@
+-
+-Here is some very simple example of post analysis:
+-\begin{itemize}
+-  \item{}
+-Get the total number of hits:
+-<<c4>>=
+-hits1 <- read.table("ce2dict0_ana1.txt", header=TRUE)
+-nrow(hits1)
+-@
+-  \item{}
+-Get the number of hits per chromosome:
+-<<c5>>=
+-table(hits1$seqname)
+-@
+-  \item{}
+-Get the number of hits per pattern:
+-<<c6>>=
+-hits1_table <- table(hits1$patternID)
+-hits1_table
+-@
+-  \item{}
+-Get the pattern(s) with the higher number of hits:
+-<<c7>>=
+-hits1_table[hits1_table == max(hits1_table)] # pattern(s) with more hits
+-@
+-  \item{}
+-Get the pattern(s) with no hits:
+-<<c8>>=
+-setdiff(names(ce2dict0), hits1$patternID) # pattern(s) with no hits
+-@
+-  \item{}
+-And finally a function that can be used to plot the hits:
+-<<c9>>=
+-plotGenomeHits <- function(bsgenome, seqnames, hits)
+-{
+-    chrlengths <- seqlengths(bsgenome)[seqnames]
+-    XMAX <- max(chrlengths)
+-    YMAX <- length(seqnames)
+-    plot.new()
+-    plot.window(c(1, XMAX), c(0, YMAX))
+-    axis(1)
+-    axis(2, at=seq_len(length(seqnames)), labels=rev(seqnames), tick=FALSE, las=1)
+-    ## Plot the chromosomes
+-    for (i in seq_len(length(seqnames)))
+-        lines(c(1, chrlengths[i]), c(YMAX + 1 - i, YMAX + 1 - i), type="l")
+-    ## Plot the hits
+-    for (i in seq_len(nrow(hits))) {
+-        seqname <- hits$seqname[i]
+-        y0 <- YMAX + 1 - match(seqname, seqnames)
+-        if (hits$strand[i] == "+") {
+-            y <- y0 + 0.05
+-            col <- "red"
+-        } else {
+-            y <- y0 - 0.05
+-            col <- "blue"
+-        }
+-        lines(c(hits$start[i], hits$end[i]), c(y, y), type="l", col=col, lwd=3)
+-    }
+-}
+-@
+-Plot the hits found by \Rfunction{runAnalysis1} with:
+-<<c10,eval=false>>=
+-plotGenomeHits(genome, seqnames(genome), hits1)
+-@
+-\end{itemize}
+-
++% DELETED
+ 
+ 
+ % ---------------------------------------------------------------------------
+@@ -466,25 +187,8 @@ that actually implements the fast search
+ So if you need to reuse the same pattern a high number of times,
+ it's a good idea to convert it {\it before} to pass it to the
+ \Rmethod{matchPattern} or \Rmethod{countPattern} method.
+-This way the conversion is done only once:
+-<<d1>>=
+-library(hgu95av2probe)
+-tmpseq <- DNAStringSet(hgu95av2probe$sequence)
+-someStats <- function(v)
+-{
+-    GC <- DNAString("GC")
+-    CG <- DNAString("CG")
+-    sapply(seq_len(length(v)),
+-           function(i) {
+-               y <- v[[i]]
+-               c(alphabetFrequency(y)[1:4],
+-                 GC=countPattern(GC, y),
+-                 CG=countPattern(CG, y))
+-           }
+-    )
+-}
+-someStats(tmpseq[1:10])
+-@
++
++% DELETED
+ 
+ % The above example is Raphael's use case discussed on BioC on Feb 2006.
+ % In Biostrings 1, the equivalent would be:
+@@ -528,14 +232,7 @@ repeats with period less than or equal t
+ For a given package, all the sequences will always have the same number of
+ masks.
+ 
+-<<f1>>=
+-library(BSgenome.Hsapiens.UCSC.hg38.masked)
+-genome <- BSgenome.Hsapiens.UCSC.hg38.masked
+-chrY <- genome$chrY
+-chrY
+-chrM <- genome$chrM
+-chrM
+-@
++% DELETED
+ 
+ The built-in masks are named consistenly across all the BSgenome data packages
+ available in Bioconductor:
+@@ -578,128 +275,7 @@ The {\it masked width} is the total numb
+ that are masked and the {\it masked ratio} is the {\it masked width}
+ divided by the length of the sequence.
+ 
+-To activate a mask, use the \Rmethod{active} replacement method
+-in conjonction with the \Rmethod{masks} method. For example, to
+-activate the RepeatMasker mask, do:
+-<<f2>>=
+-active(masks(chrY))["RM"] <- TRUE
+-chrY
+-@
+-
+-As you can see, the {\it masked width} for all the active masks
+-together (i.e. the total number of nucleotide positions that are
+-masked by at least one active mask) is now the same as for the
+-first mask. This represents a {\it masked ratio} of about 83\%.
+-
+-Now when we use a function that is {\it mask aware}, like
+-\Rfunction{alphabetFrequency}, the masked regions of the
+-input sequence are ignored:
+-<<f3>>=
+-active(masks(chrY)) <- FALSE
+-active(masks(chrY))["AGAPS"] <- TRUE
+-alphabetFrequency(unmasked(chrY))
+-alphabetFrequency(chrY)
+-@
+-
+-This output indicates that, for this chromosome, the assembly gaps
+-correspond exactly to the regions in the sequence that were filled
+-with the letter N. Note that this is not always the case: sometimes
+-Ns, and other IUPAC ambiguity letters, can be found inside the contigs.
+-
+-When coercing a \Rclass{MaskedXString} object to an \Rclass{XStringViews}
+-object, each non-masked region in the original sequence is converted into
+-a view on the sequence:
+-<<f4>>=
+-as(chrY, "XStringViews")
+-@
+-
+-This can be used in conjonction with the \Rmethod{gaps} method to
+-see the gaps between the views i.e. the masked regions themselves:
+-<<f5,results=hide>>=
+-gaps(as(chrY, "XStringViews"))
+-@
+-
+-To extract the sizes of the assembly gaps:
+-<<f6>>=
+-width(gaps(as(chrY, "XStringViews")))
+-@
+-
+-Note that, if applied directly to \Robject{chrY}, \Rmethod{gaps}
+-returns a \Rclass{MaskedDNAString} object with a single mask masking
+-the regions that are not masked in the original object:
+-<<f7>>=
+-gaps(chrY)
+-alphabetFrequency(gaps(chrY))
+-@
+-
+-In fact, for any \Rclass{MaskedDNAString} object, the following should
+-always be \Robject{TRUE}, whatever the masks are:
+-<<f8>>=
+-af0 <- alphabetFrequency(unmasked(chrY))
+-af1 <- alphabetFrequency(chrY)
+-af2 <- alphabetFrequency(gaps(chrY))
+-all(af0 == af1 + af2)
+-@
+-
+-With all chrY masks active:
+-<<f9>>=
+-active(masks(chrY)) <- TRUE
+-af1 <- alphabetFrequency(chrY)
+-af1
+-gaps(chrY)
+-af2 <- alphabetFrequency(gaps(chrY))
+-af2
+-all(af0 == af1 + af2)
+-@
+-
+-Now let's compare three different ways of finding all the occurences of
+-the \Robject{"CANNTG"} consensus sequence in chrY. The Ns in this
+-pattern need to be treated as wildcards i.e. they must match any
+-letter in the subject.
+-
+-Without the mask feature, the first way to do it would be to use
+-the \Robject{fixed=FALSE} option in the call to \Rfunction{matchPattern}
+-(or \Rfunction{countPattern}):
+-<<f10>>=
+-Ebox <- "CANNTG"
+-active(masks(chrY)) <- FALSE
+-countPattern(Ebox, chrY, fixed=FALSE)
+-@
+-
+-The problem with this method is that the Ns in the subject
+-are also treated as wildcards hence the abnormally high number of
+-matches.
+-A better method is to specify the {\it side} of the matching problem
+-(i.e. {\it pattern} or {\it subject}) where the Ns should be treated
+-as wildcards:
+-<<f11>>=
+-countPattern(Ebox, chrY, fixed=c(pattern=FALSE,subject=TRUE))
+-@
+-
+-Finally, \Rfunction{countPattern} being {\it mask aware}, this can be
+-achieved more efficiently by just masking the assembly gaps and
+-ambiguities:
+-<<f12>>=
+-active(masks(chrY))[c("AGAPS", "AMB")] <- TRUE
+-alphabetFrequency(chrY, baseOnly=TRUE)  # no ambiguities
+-countPattern(Ebox, chrY, fixed=FALSE)
+-@
+-
+-Note that some chromosomes can have Ns outside the assembly gaps:
+-<<f13>>=
+-chr2 <- genome$chr2
+-active(masks(chr2))[-2] <- FALSE
+-alphabetFrequency(gaps(chr2))
+-@
+-so it is recommended to always keep the AMB mask active (in addition to
+-the AGAPS mask) whatever the sequence is.
+-
+-Note that not all functions that work with an \Rclass{XString}
+-input are {\it mask aware} but more will be added in the near future.
+-However, most of the times there is a alternate way to exclude some
+-arbitrary regions from an analysis without having to use {\it mask aware}
+-functions. This is described below in the {\it Hard masking} section.
+-
++% DELETED
+ 
+ 
+ % ---------------------------------------------------------------------------
+@@ -766,16 +342,13 @@ runAnalysis2 <- function(dict0, outfile=
+ Remember that \Rfunction{matchPDict} only works if all the patterns in the
+ input dictionary have the same length so for this 2nd analysis, we will truncate
+ the patterns in \Robject{ce2dict0} to 15 nucleotides:
+-<<e2>>=
+-ce2dict0cw15 <- DNAStringSet(ce2dict0, end=15)
+-@
+ 
+-Now we can run this 2nd analysis and put the results in the
+-\Robject{"ce2dict0cw15_ana2.txt"} file:
+-<<e3>>=
+-runAnalysis2(ce2dict0cw15, outfile="ce2dict0cw15_ana2.txt")
+-@
++% STRANGE THAT ALSO THIS DOES NOT WORK ... ???
++%<<e2>>=
++%ce2dict0cw15 <- DNAStringSet(ce2dict0, end=15)
++%@
+ 
++% DELETED
+ 
+ 
+ % ---------------------------------------------------------------------------
+@@ -788,3 +361,4 @@ sessionInfo()
+ 
+ \end{document}
+ 
++

Added: trunk/packages/R/r-bioc-bsgenome/trunk/debian/patches/series
===================================================================
--- trunk/packages/R/r-bioc-bsgenome/trunk/debian/patches/series	                        (rev 0)
+++ trunk/packages/R/r-bioc-bsgenome/trunk/debian/patches/series	2015-07-17 21:14:54 UTC (rev 19631)
@@ -0,0 +1 @@
+remove_paragraphs_bound_to_fail_from_vignette.patch




More information about the debian-med-commit mailing list