[med-svn] [rsem] 01/03: Imported Upstream version 1.2.22+dfsg

Michael Crusoe misterc-guest at moszumanska.debian.org
Thu Aug 27 20:41:59 UTC 2015


This is an automated email from the git hooks/post-receive script.

misterc-guest pushed a commit to branch master
in repository rsem.

commit 3a9288dbae8176742283fc83a5d058f202641ab6
Author: Michael R. Crusoe <michael.crusoe at gmail.com>
Date:   Thu Aug 27 13:37:24 2015 -0700

    Imported Upstream version 1.2.22+dfsg
---
 EBSeq/rsem-for-ebseq-find-DE |   7 +-
 PairedEndHit.h               |   4 +-
 PairedEndQModel.h            |   1 +
 README.md                    |  20 +++---
 SamParser.h                  |  23 +++++--
 WHAT_IS_NEW                  |  18 ++++++
 rsem-calculate-expression    | 150 ++++++++++++++++++++++++++++++++++++++++++-
 rsem-prepare-reference       |  72 +++++++++++++++++++--
 8 files changed, 268 insertions(+), 27 deletions(-)

diff --git a/EBSeq/rsem-for-ebseq-find-DE b/EBSeq/rsem-for-ebseq-find-DE
index f228bab..a448843 100755
--- a/EBSeq/rsem-for-ebseq-find-DE
+++ b/EBSeq/rsem-for-ebseq-find-DE
@@ -10,6 +10,7 @@ path <- argv[1]
 ngvector_file <- argv[2]
 data_matrix_file <- argv[3]
 output_file <- argv[4]
+norm_out_file <- paste0("norm_", data_matrix_file)
 
 nc <- length(argv) - 4;
 num_reps <- as.numeric(argv[5:(5+nc-1)])
@@ -23,6 +24,7 @@ if (sum(num_reps) != n) stop("Total number of replicates given does not match th
 
 conditions <- as.factor(rep(paste("C", 1:nc, sep=""), times = num_reps))
 Sizes <- MedianNorm(DataMat)
+NormMat <- GetNormalizedMat(DataMat, Sizes)
 ngvector <- NULL
 if (ngvector_file != "#") {
   ngvector <- as.vector(data.matrix(read.table(ngvector_file)))
@@ -37,8 +39,8 @@ if (nc == 2) {
   PP <- as.data.frame(GetPPMat(EBOut))
   fc_res <- PostFC(EBOut)
 
-  results <- cbind(PP, fc_res$PostFC, fc_res$RealFC)
-  colnames(results) <- c("PPEE", "PPDE", "PostFC", "RealFC")
+  results <- cbind(PP, fc_res$PostFC, fc_res$RealFC,unlist(EBOut$C1Mean)[rownames(PP)], unlist(EBOut$C2Mean)[rownames(PP)])
+  colnames(results) <- c("PPEE", "PPDE", "PostFC", "RealFC","C1Mean","C2Mean")
   results <- results[order(results[,"PPDE"], decreasing = TRUE),]
   write.table(results, file = output_file, sep = "\t")
   
@@ -68,3 +70,4 @@ if (nc == 2) {
   MultiFC <- GetMultiFC(MultiOut)
   write.table(MultiFC$CondMeans[ord,], file = paste(output_file, ".condmeans", sep = ""), sep = "\t")
 }
+write.table(NormMat, file = norm_out_file, sep = "\t")
diff --git a/PairedEndHit.h b/PairedEndHit.h
index 1242cf7..be3ea54 100644
--- a/PairedEndHit.h
+++ b/PairedEndHit.h
@@ -15,13 +15,13 @@ public:
 		this->insertL = insertL;
 	}
 
-	short getInsertL() const { return insertL; }
+	int getInsertL() const { return insertL; }
 
 	bool read(std::istream&);
 	void write(std::ostream&);
 
 private:
-	short insertL; // insert length
+	int insertL; // insert length
 };
 
 bool PairedEndHit::read(std::istream& in) {
diff --git a/PairedEndQModel.h b/PairedEndQModel.h
index 7aebb49..3864c80 100644
--- a/PairedEndQModel.h
+++ b/PairedEndQModel.h
@@ -126,6 +126,7 @@ public:
 		const SingleReadQ& mate2 = read.getMate2();
 		int m2pos = totLen - pos - insertLen;
 		int m2dir = !dir;
+
 		prob *= mld->getAdjustedProb(mate2.getReadLength(), hit.getInsertL()) *
 		        qpro->getProb(mate2.getReadSeq(), mate2.getQScore(), ref, m2pos, m2dir);
 
diff --git a/README.md b/README.md
index dedffcd..4636079 100644
--- a/README.md
+++ b/README.md
@@ -1,7 +1,7 @@
 README for RSEM
 ===============
 
-[Bo Li](http://pages.cs.wisc.edu/~bli) \(bli at cs dot wisc dot edu\)
+[Bo Li](http://bli25ucb.github.io/) \(bli at cs dot wisc dot edu\)
 
 * * *
 
@@ -89,7 +89,7 @@ To prepare the reference sequences, you should run the
     rsem-prepare-reference --help
 
 to get usage information or visit the [rsem-prepare-reference
-documentation page](http://deweylab.biostat.wisc.edu/rsem/rsem-prepare-reference.html).
+documentation page](rsem-prepare-reference.html).
 
 ### II. Calculating Expression Values
 
@@ -99,7 +99,7 @@ To calculate expression values, you should run the
     rsem-calculate-expression --help
 
 to get usage information or visit the [rsem-calculate-expression
-documentation page](http://deweylab.biostat.wisc.edu/rsem/rsem-calculate-expression.html).
+documentation page](rsem-calculate-expression.html).
 
 #### Calculating expression values from single-end data
 
@@ -130,7 +130,7 @@ using an alternative aligner, you may also want to provide the
 '--no-bowtie' option to 'rsem-prepare-reference' so that the Bowtie
 indices are not built.
 
-RSEM requires all alignments of the same read group together. For
+RSEM requires the alignments of a read to be adjacent. For
 paired-end reads, RSEM also requires the two mates of any alignment be
 adjacent. To check if your SAM/BAM file satisfy the requirements,
 please run
@@ -145,7 +145,7 @@ process. Please run
 
 to get usage information or visit the [convert-sam-for-rsem
 documentation
-page](http://deweylab.biostat.wisc.edu/rsem/convert-sam-for-rsem.html).
+page](convert-sam-for-rsem.html).
 
 However, please note that RSEM does ** not ** support gapped
 alignments. So make sure that your aligner does not produce alignments
@@ -228,7 +228,7 @@ To generate transcript wiggle plots, you should run the
     rsem-plot-transcript-wiggles --help
 
 to get usage information or visit the [rsem-plot-transcript-wiggles
-documentation page](http://deweylab.biostat.wisc.edu/rsem/rsem-plot-transcript-wiggles.html).
+documentation page](rsem-plot-transcript-wiggles.html).
 
 #### e) Visualize the model learned by RSEM
 
@@ -397,7 +397,7 @@ transcripts whose lengths are less than k are assigned to cluster
 
 to get usage information or visit the [rsem-generate-ngvector
 documentation
-page](http://deweylab.biostat.wisc.edu/rsem/rsem-generate-ngvector.html).
+page](rsem-generate-ngvector.html).
 
 If your reference is a de novo assembled transcript set, you should
 run 'rsem-generate-ngvector' first. Then load the resulting
@@ -430,7 +430,7 @@ for all genes/transcripts. Run
     rsem-run-ebseq --help
 
 to get usage information or visit the [rsem-run-ebseq documentation
-page](http://deweylab.biostat.wisc.edu/rsem/rsem-run-ebseq.html). Second,
+page](rsem-run-ebseq.html). Second,
 'rsem-control-fdr' takes 'rsem-run-ebseq' 's result and reports called
 differentially expressed genes/transcripts by controlling the false
 discovery rate. Run
@@ -438,7 +438,7 @@ discovery rate. Run
     rsem-control-fdr --help
 
 to get usage information or visit the [rsem-control-fdr documentation
-page](http://deweylab.biostat.wisc.edu/rsem/rsem-control-fdr.html). These
+page](rsem-control-fdr.html). These
 two scripts can perform DE analysis on either 2 conditions or multiple
 conditions.
 
@@ -452,7 +452,7 @@ be sent to <a href="mailto:nleng at wisc.edu">Ning Leng</a>.
 
 ## <a name="authors"></a> Authors
 
-The RSEM algorithm is developed by Bo Li and Colin Dewey. The RSEM software is mainly implemented by Bo Li.
+[Bo Li](http://bli25ucb.github.io/) and [Colin Dewey](https://www.biostat.wisc.edu/~cdewey/) designed the RSEM algorithm. [Bo Li](http://bli25ucb.github.io/) implemented the RSEM software. [Peng Liu](https://www.biostat.wisc.edu/~cdewey/group.html) contributed the STAR aligner options.
 
 ## <a name="acknowledgements"></a> Acknowledgements
 
diff --git a/SamParser.h b/SamParser.h
index 6e90765..9fa0342 100644
--- a/SamParser.h
+++ b/SamParser.h
@@ -23,6 +23,8 @@
 
 #include "Transcripts.h"
 
+const char* whitespace = " \t\n\r\f\v";
+
 class SamParser {
 public:
 	SamParser(char, const char*, const char*, Transcripts&, const char*);
@@ -62,7 +64,16 @@ private:
 	}
 
 	std::string getName(const bam1_t* b) {
-		return std::string(bam1_qname(b));
+        const char* raw_query_name = bam1_qname(b);
+        // Retain only the first whitespace-delimited word as the read name
+        // This prevents issues of mismatching names when aligners do not
+        // strip off extra words in read name strings
+        const char* whitespace_pos = std::strpbrk(raw_query_name, whitespace);
+        if (whitespace_pos == NULL) {
+            return std::string(raw_query_name);
+        } else {
+            return std::string(raw_query_name, whitespace_pos - raw_query_name);
+        }
 	}
 
 	std::string getReadSeq(const bam1_t*);
@@ -190,10 +201,11 @@ int SamParser::parseNext(PairedEndRead& read, PairedEndHit& hit) {
 		mp1 = b2; mp2 = b;
 	}
 
-	if (strcmp(bam1_qname(mp1), bam1_qname(mp2))) printf("Warning: Detected a read pair whose two mates have different names--%s and %s!\n", getName(mp1).c_str(), getName(mp2).c_str()); 
+	std::string name = getName(mp1);
+	std::string name2 = getName(mp2);
+	if (name != name2) printf("Warning: Detected a read pair whose two mates have different names--%s and %s!\n", name.c_str(), name2.c_str());
 
 	int readType = getReadType(mp1, mp2);
-	std::string name = getName(mp1);
 
 	if (readType != 1 || (readType == 1 && read.getName().compare(name) != 0)) {
 		val = readType;
@@ -244,10 +256,11 @@ int SamParser::parseNext(PairedEndReadQ& read, PairedEndHit& hit) {
 		mp1 = b2; mp2 = b;
 	}
 
-	if (strcmp(bam1_qname(mp1), bam1_qname(mp2))) printf("Warning: Detected a read pair whose two mates have different names--%s and %s!\n", getName(mp1).c_str(), getName(mp2).c_str()); 
+	std::string name = getName(mp1);
+	std::string name2 = getName(mp2);
+	if (name != name2) printf("Warning: Detected a read pair whose two mates have different names--%s and %s!\n", name.c_str(), name2.c_str());
 
 	int readType = getReadType(mp1, mp2);
-	std::string name = getName(mp1);
 
 	if (readType != 1 || (readType == 1 && read.getName().compare(name) != 0)) {
 		val = readType;
diff --git a/WHAT_IS_NEW b/WHAT_IS_NEW
index a508fac..68fdc30 100644
--- a/WHAT_IS_NEW
+++ b/WHAT_IS_NEW
@@ -1,3 +1,21 @@
+RSEM v1.2.22
+
+- Added options to run the STAR aligner
+
+--------------------------------------------------------------------------------------------
+
+RSEM v1.2.21
+
+- Strip read names of extra words to avoid mismatches of paired-end read names
+
+--------------------------------------------------------------------------------------------
+
+RSEM v1.2.20
+
+- Fixed a problem that can lead to assertion error if any paired-end read's insert size > 32767 (by changing the type of insertL in PairedEndHit.h from short to int)
+
+--------------------------------------------------------------------------------------------
+
 RSEM v1.2.19
 
 - Modified 'rsem-prepare-reference' such that by default it does not add any poly(A) tails. To add poly(A) tails, use '--polyA' option
diff --git a/rsem-calculate-expression b/rsem-calculate-expression
index 585732d..ea6279e 100755
--- a/rsem-calculate-expression
+++ b/rsem-calculate-expression
@@ -2,7 +2,7 @@
 
 use Getopt::Long;
 use Pod::Usage;
-
+use File::Basename;
 use FindBin;
 use lib $FindBin::RealBin;
 use rsem_perl_utils qw(runCommand collectResults showVersionInfo);
@@ -88,6 +88,14 @@ my $gap = 32;
 
 my $alleleS = 0;
 
+my $star = 0;
+my $star_path  = '';
+my $gzipped_read_file = 0;
+my $bzipped_read_file = 0;
+my $output_star_genome_bam = 0;
+my $sort_bam_by_read_name = 0;
+my $sort_bam_buffer_size = '60G';
+
 GetOptions("keep-intermediate-files" => \$keep_intermediate_files,
 	   "temporary-folder=s" => \$temp_dir,
 	   "no-qualities" => \$no_qual,
@@ -131,6 +139,13 @@ GetOptions("keep-intermediate-files" => \$keep_intermediate_files,
 	   "ci-memory=i" => \$NMB,
 	   "ci-number-of-samples-per-count-vector=i" => \$NSPC,
 	   "samtools-sort-mem=s" => \$SortMem,
+	   'star' => \$star,
+	   'star-path=s' => \$star_path,
+	   "gzipped-read-file" => \$gzipped_read_file,
+	   "bzipped-read-file" => \$bzipped_read_file,
+	   "output-star-genome-bam" => \$output_star_genome_bam,
+	   "sort-bam-by-read-name" => \$sort_bam_by_read_name,
+	   "sort-bam-buffer-size=s" => \$sort_bam_buffer_size,
 	   "seed=i" => \$seed,
 	   "time" => \$mTime,
 	   "version" => \$version,
@@ -168,6 +183,8 @@ pod2usage(-msg => "--sampling-for-bam cannot be specified if --no-bam-output is
 pod2usage(-msg => "--output-genome-bam cannot be specified if --no-bam-output is specified!\n", -exitval => 2, -verbose => 2) if ($genGenomeBamF && !$genBamF);
 pod2usage(-msg => "The seed for random number generator must be a non-negative 32bit integer!\n", -exitval => 2, -verbose => 2) if (($seed ne "NULL") && ($seed < 0 || $seed > 0xffffffff));
 pod2usage(-msg => "The credibility level should be within (0, 1)!\n", -exitval => 2, -verbose => 2) if ($CONFIDENCE <= 0.0 || $CONFIDENCE >= 1.0);
+pod2usage(-msg => "Gzipped read files can only be used for aligner STAR\n", -exitval=>2, -verbose =>2) if ( ( $star == 0 ) && ( $gzipped_read_file != 0));
+pod2usage(-msg => "Bzipped read files can only be used for aligner STAR\n", -exitval=>2, -verbose =>2) if ( ( $star == 0 ) && ( $bzipped_read_file != 0));
 
 if ($L < 25) { print "Warning: the seed length set is less than 25! This is only allowed if the references are not added poly(A) tails.\n"; }
 
@@ -229,11 +246,64 @@ my ($mate_minL, $mate_maxL) = (1, $maxL);
 
 if ($bowtie_path ne "") { $bowtie_path .= "/"; }
 if ($bowtie2_path ne "") { $bowtie2_path .= "/"; }
+if ($star_path ne '') { $star_path .= '/'; }
 
 my $command = "";
 
 if (!$is_sam && !$is_bam) {
-    if (!$bowtie2) {
+	if ( $star ) {
+	  ## align reads by STAR
+    my $star_genome_path = dirname($refName);
+	  $command = "$star_path/STAR " . 
+	               ## ENCODE3 pipeline parameters
+	               " --genomeDir $star_genome_path " .
+	               ' --outSAMunmapped Within ' .
+	               ' --outFilterType BySJout ' .
+	               ' --outSAMattributes NH HI AS NM MD ' .
+	               ' --outFilterMultimapNmax 20 ' .
+	               ' --outFilterMismatchNmax 999 ' .
+	               ' --outFilterMismatchNoverLmax 0.04 ' .
+	               ' --alignIntronMin 20 ' .
+	               ' --alignIntronMax 1000000 ' .
+	               ' --alignMatesGapMax 1000000 ' .
+	               ' --alignSJoverhangMin 8 ' .
+	               ' --alignSJDBoverhangMin 1 ' .
+	               ' --sjdbScore 1 ' .
+	               " --runThreadN $nThreads " .
+	               ##
+	
+	               ## different than ENCODE3 pipeline 
+	               ## do not allow using shared memory
+	               ' --genomeLoad NoSharedMemory ' .
+	               ##
+	
+	               ## different than ENCODE3 pipeline, which sorts output BAM
+	               ## no need to do it here to save time and memory 
+	               ' --outSAMtype BAM Unsorted ' .
+	               ##
+	
+	               ## unlike ENCODE3, we don't output bedGraph files
+	
+	               ' --quantMode TranscriptomeSAM '.
+	               ' --outSAMheaderHD \@HD VN:1.4 SO:unsorted '.
+	
+	               ## define output file prefix
+	               " --outFileNamePrefix $imdName ";
+	               ##
+	
+	  if ( $gzipped_read_file ) {
+	    $command .= ' --readFilesCommand zcat ';
+	  } elsif ( $bzipped_read_file ) {
+	    $command .= ' --readFilesCommand bzip2 -c ';
+	  }
+	
+	  if ( $read_type == 0 || $read_type == 1 ) {
+	    $command .= " --readFilesIn $mate1_list ";
+	  } else {
+	    $command .= " --readFilesIn $mate1_list $mate2_list";
+	  }
+	
+	} elsif (!$bowtie2) {
 	$command = $bowtie_path."bowtie";
 	if ($no_qual) { $command .= " -f"; }
 	else { $command .= " -q"; }
@@ -307,6 +377,40 @@ if (!$is_sam && !$is_bam) {
 
     $inpF = "$imdName.bam";
     $is_bam = 1; # alignments are outputed as a BAM file
+
+    if ( $star ) {
+      my $star_tr_bam = $imdName . 'Aligned.toTranscriptome.out.bam';
+      rename $star_tr_bam, $inpF
+        or die "can't rename $star_tr_bam to $inpF: $!\n";
+      rmdir $imdName . "_STARtmp/";
+      my $star_genome_bam = $imdName . "Aligned.out.bam";
+      my $rsem_star_genome_bam = $sampleName.'.STAR.genome.bam';
+      if ( $output_star_genome_bam ) {
+        rename $star_genome_bam, $rsem_star_genome_bam or die 
+          "can't move $star_genome_bam to $rsem_star_genome_bam: $!\n";
+      } else {
+        unlink $star_genome_bam or die "can't remove $star_genome_bam: $!\n";
+      }
+    }
+}
+
+if ( $sort_bam_by_read_name ) {
+  my $sorted_bam = "$imdName.mateSorted.bam";
+  my $sort_workdir = dirname($imdName);
+  $command = "cat <( samtools view -H $inpF ) " .
+                 "<( samtools view -@ $nThreads $inpF | " .
+                 q(awk '{printf "%s ", $0; getline; print}' | ) .
+                 "sort -S $sort_bam_buffer_size -T $sort_workdir | " .
+                 q(tr ' ' '\n' ) .
+                 ") | samtools view -@ $nThreads -bS - > $sorted_bam"; 
+
+  print "$command\n\n";
+
+  ## have to invoke BASH to use the <()<() substitutions
+  ## have to use LIST to invoke BASE
+  system( ('bash', '-c', $command) );
+
+  rename $sorted_bam, $inpF or die "can't rename $sorted_bam to $inpF: $!\n";
 }
 
 if ($mTime) { $time_start = time(); }
@@ -520,6 +624,14 @@ The RNA-Seq protocol used to generate the reads is strand specific, i.e., all (u
 
 Use Bowtie 2 instead of Bowtie to align reads. Since currently RSEM does not handle indel, local and discordant alignments, the Bowtie2 parameters are set in a way to avoid those alignments. In particular, we use options '--sensitive --dpad 0 --gbar 99999999 --mp 1,1 --np 1 --score-min L,0,-0.1' by default. The last parameter of '--score-min', '-0.1', is the negative of maximum mismatch rate. This rate can be set by option '--bowtie2-mismatch-rate'. If reads are paired-end, we additional [...]
 
+=item B<--star>
+
+Use STAR to align reads. Alignment parameters are from ENCODE3's STAR-RSEM pipeline. To save computational time and memory resources, STAR's Output BAM file is unsorted. It is stored in RSEM's temporary directory with name as 'sample_name.bam'. Each STAR job will have its own private copy of the genome in memory. (Default: off) 
+
+=item B<--star-path> <path>
+
+The path to STAR's executable. (Default: the path to STAR executable is assumed to be in user's PATH environment variable)
+
 =item B<--sam>
 
 Input file is in SAM format. (Default: off)
@@ -634,6 +746,26 @@ Input quality scores are solexa encoded (from GA Pipeline ver. < 1.3). (Default:
 
 (Bowtie 2 parameter) Set Bowtie 2's preset options in --end-to-end mode. This option controls how hard Bowtie 2 tries to find alignments. <string> must be one of "very_fast", "fast", "sensitive" and "very_sensitive". The four candidates correspond to Bowtie 2's "--very-fast", "--fast", "--sensitive" and "--very-sensitive" options. (Default: "sensitive" - use Bowtie 2's default)
 
+=item B<--gzipped-read-file>
+
+Input read file(s) is compressed by gzip. This option can be only used when aligning reads by STAR, i.e. --star-genome-path <path> is defined (Default: off)
+
+=item B<--bzipped-read-file>
+
+Input read file(s) is compressed by bzip2. This option can be only used when aligning reads by STAR, i.e. --star-genome-path <path> is defined (Default: off)
+
+=item B<--output-star-genome-bam>
+
+Save the BAM file from STAR alignment under genomic coordinate to 'sample_name.STAR.genome.bam'. This file is NOT sorted by genomic coordinate. In this file, according to STAR's manual, 'paired ends of an alignment are always adjacent, and multiple alignments of a read are adjacent as well'. (Default: off)
+
+=item B<--sort-bam-by-read-name>
+
+Sort BAM file aligned under transcript coordidate by read name. Setting this option on will produce determinstic maximum likelihood estimations from independet runs. Note that sorting will take long time and lots of memory. (Default: off)
+
+=item B<--sort-bam-buffer-size> <string>
+
+Size for main memeory buffer when sorting BAM file. It can be any string acceptable to GNU sort's '-S' option. See "sort --help" for details. (Default: '60G')
+
 =item B<--forward-prob> <double>
 
 Probability of generating a read from the forward strand of a transcript. Set to 1 for a strand-specific protocol where all (upstream) reads are derived from the forward strand, 0 for a strand-specific protocol where all (upstream) read are derived from the reverse strand, or 0.5 for a non-strand-specific protocol. (Default: 0.5)
@@ -706,7 +838,7 @@ Output time consumed by each step of RSEM to 'sample_name.time'. (Default: off)
 
 =head1 DESCRIPTION
 
-In its default mode, this program aligns input reads against a reference transcriptome with Bowtie and calculates expression values using the alignments.  RSEM assumes the data are single-end reads with quality scores, unless the '--paired-end' or '--no-qualities' options are specified.  Users may use an alternative aligner by specifying one of the --sam and --bam options, and providing an alignment file in the specified format. However, users should make sure that they align against the [...]
+In its default mode, this program aligns input reads against a reference transcriptome with Bowtie and calculates expression values using the alignments.  RSEM assumes the data are single-end reads with quality scores, unless the '--paired-end' or '--no-qualities' options are specified. Alternatively, users can use STAR to align reads using the '--star' option. RSEM has provided options in 'rsem-prepare-reference' to prepare STAR's genome indices. Users may use an alternative aligner by  [...]
 
 One simple way to make the alignment file satisfying RSEM's requirements (assuming the aligner used put mates in a paired-end read adjacent) is to use 'convert-sam-for-rsem' script. This script only accept SAM format files as input. If a BAM format file is obtained, please use samtools to convert it to a SAM file first. For example, if '/ref/mouse_125' is the 'reference_name' and the SAM file is named 'input.sam', you can run the following command: 
 
@@ -962,4 +1094,16 @@ Assume the path to the bowtie executables is in the user's PATH environment vari
                            /ref/mouse_125 \
                            mmliver_paired_end_quals
 
+6) '/data/mmliver_1.fq.gz' and '/data/mmliver_2.fq.gz', paired-end reads with quality scores and read files are compressed by gzip. We want to use STAR to aligned reads and assume STAR executable is '/sw/STAR'. Suppose we want to use 8 threads and do not generate a genome BAM file:
+
+ rsem-calculate-expression --star \
+                           --star-path /sw/STAR \
+                           --gzipped-read-file \
+                           -p 8 \
+                           /data/mmliver_1.fq.gz \
+                           /data/mmliver_2.fq.gz \
+                           /ref/mouse_125 \
+                           mmliver_paired_end_quals
+
+
 =cut
diff --git a/rsem-prepare-reference b/rsem-prepare-reference
index 2b33f89..8e5a3d9 100755
--- a/rsem-prepare-reference
+++ b/rsem-prepare-reference
@@ -2,6 +2,7 @@
 
 use Getopt::Long;
 use Pod::Usage;	
+use File::Basename;
 use FindBin;
 use lib $FindBin::RealBin;
 use rsem_perl_utils;
@@ -10,6 +11,7 @@ use Env qw(@PATH);
 @PATH = ($FindBin::RealBin, @PATH);
 
 use strict;
+use warnings;
 
 my $status;
 
@@ -28,6 +30,11 @@ my $help = 0;
 
 my $alleleMappingF = "";
 
+my $star = 0;
+my $star_path = '';
+my $star_nthreads = 1;
+my $star_sjdboverhang = 100;
+
 GetOptions("gtf=s" => \$gtfF,
 	   "transcript-to-gene-map=s" => \$mappingF,
 	   "allele-to-gene-map=s" => \$alleleMappingF,
@@ -38,6 +45,10 @@ GetOptions("gtf=s" => \$gtfF,
 	   "bowtie-path=s" => \$bowtie_path,
 	   "bowtie2" => \$bowtie2,
 	   "bowtie2-path=s" => \$bowtie2_path,
+	   'star' => \$star,
+	   'star-path=s' => \$star_path,
+	   'star-sjdboverhang=i' => \$star_sjdboverhang,
+	   'p|num-threads=i' => \$star_nthreads,
 	   "q|quiet" => \$quiet,
 	   "h|help" => \$help) or pod2usage(-exitval => 2, -verbose => 2);
 
@@ -71,6 +82,7 @@ if ($polyA) {
 
 if ($bowtie_path ne "") { $bowtie_path .= "/"; }
 if ($bowtie2_path ne "") { $bowtie2_path .= "/"; }
+if ($star_path ne '') { $star_path .= '/'; }
 
 my $command = "";
 
@@ -114,6 +126,19 @@ if ($bowtie2) {
     &runCommand($command);
 }
 
+if ($star) {
+    my $out_star_genome_path = dirname($ARGV[1]);
+    $command = "$star_path/STAR " .
+                        " --runThreadN $star_nthreads " .
+                        ' --runMode genomeGenerate ' .
+                        " --genomeDir $out_star_genome_path " .
+                        " --genomeFastaFiles @list " .
+                        " --sjdbGTFfile $gtfF " .
+                        " --sjdbOverhang $star_sjdboverhang " .
+                        " --outFileNamePrefix $ARGV[1]";
+    &runCommand($command);
+}
+
 __END__
 
 =head1 NAME
@@ -206,6 +231,22 @@ Build Bowtie 2 indices. (Default: off)
 
 The path to the Bowtie 2 executables. (Default: the path to Bowtie 2 executables is assumed to be in the user's PATH environment variable)
 
+=item B<--star>
+
+Build STAR indices. (Default: off)
+
+=item B<--star-path> <path>
+
+The path to STAR's executable. (Default: the path to STAR executable is assumed to be in user's PATH environment varaible)
+
+=item B<--star-sjdboverhang> <int>
+
+Length of the genomic sequence around annotated junction. It is only used for STAT to build splice junctions database and not needed for Bowtie or Bowtie2. It will be passed as the --sjdbOverhang option to STAR. According to STAR's manual, its ideal value is max(ReadLength)-1, e.g. for 2x101 paired-end reads, the ideal value is 101-1=100. In most cases, the default value of 100 will work as well as the ideal value. (Default: 100)
+
+=item B<-p/--num-threads> <int>
+
+Number of threads to use for building STAR's genome indices. (Default: 1)
+
 =item B<-q/--quiet>
 
 Suppress the output of logging information. (Default: off)
@@ -218,18 +259,18 @@ Show help information.
 
 =head1 DESCRIPTION
 
-This program extracts/preprocesses the reference sequences for RSEM. It can optionally build Bowtie indices (with '--bowtie' option) and/or Bowtie 2 indices (with '--bowtie2' option) using their default parameters. If an alternative aligner is to be used, indices for that particular aligner can be built from either 'reference_name.idx.fa' or 'reference_name.n2g.idx.fa' (see OUTPUT for details). This program is used in conjunction with the 'rsem-calculate-expression' program.
+This program extracts/preprocesses the reference sequences for RSEM. It can optionally build Bowtie indices (with '--bowtie' option) and/or Bowtie 2 indices (with '--bowtie2' option) using their default parameters. It can also optionally build STAR indices (with '--star' option) using parameters from ENCODE3's STAR-RSEM pipeline. If an alternative aligner is to be used, indices for that particular aligner can be built from either 'reference_name.idx.fa' or 'reference_name.n2g.idx.fa' (se [...]
 
 =head1 OUTPUT
 
-This program will generate 'reference_name.grp', 'reference_name.ti', 'reference_name.transcripts.fa', 'reference_name.seq', 'reference_name.chrlist' (if '--gtf' is on), 'reference_name.idx.fa', 'reference_name.n2g.idx.fa' and optional Bowtie/Bowtie 2 index files.
+This program will generate 'reference_name.grp', 'reference_name.ti', 'reference_name.transcripts.fa', 'reference_name.seq', 'reference_name.chrlist' (if '--gtf' is on), 'reference_name.idx.fa', 'reference_name.n2g.idx.fa', optional Bowtie/Bowtie 2 index files, and optional STAR index files.
 
 'reference_name.grp', 'reference_name.ti', 'reference_name.seq', and 'reference_name.chrlist' are used by RSEM internally.
 
 B<'reference_name.transcripts.fa'> contains the extracted reference transcripts in Multi-FASTA format. Poly(A) tails are not added and it may contain lower case bases in its sequences if the corresponding genomic regions are soft-masked.
 
-B<'reference_name.idx.fa' and 'reference_name.n2g.idx.fa'> are used by aligners to build their own indices. In these two files, all sequence bases are converted into upper case. In addition, poly(A) tails are added if '--polyA' option is set. The only difference between 'reference_name.idx.fa' and 'reference_name.n2g.idx.fa' is that 'reference_name.n2g.idx.fa' in addition converts all 'N' characters to 'G' characters. This conversion is in particular desired for aligners (e.g. Bowtie) th [...]
- 
+B<'reference_name.idx.fa' and 'reference_name.n2g.idx.fa'> are used by aligners to build their own indices. In these two files, all sequence bases are converted into upper case. In addition, poly(A) tails are added if '--polyA' option is set. The only difference between 'reference_name.idx.fa' and 'reference_name.n2g.idx.fa' is that 'reference_name.n2g.idx.fa' in addition converts all 'N' characters to 'G' characters. This conversion is in particular desired for aligners (e.g. Bowtie) th [...]
+
 =head1 EXAMPLES
 
 1) Suppose we have mouse RNA-Seq data and want to use the UCSC mm9 version of the mouse genome. We have downloaded the UCSC Genes transcript annotations in GTF format (as mm9.gtf) using the Table Browser and the knownIsoforms.txt file for mm9 from the UCSC Downloads. We also have all chromosome files for mm9 in the directory '/data/mm9'.  We want to put the generated reference files under '/ref' with name 'mouse_0'. We do not add any poly(A) tails. Please note that GTF files generated fr [...]
@@ -263,11 +304,32 @@ OR
                         /data/mm9 \
                         /ref/mouse_0
 
-3) Suppose we only have transcripts from EST tags stored in 'mm9.fasta' and isoform-gene information stored in 'mapping.txt'. We want to add 125bp long poly(A) tails to all transcripts. The reference_name is set as 'mouse_125'. In addition, we do not want to build Bowtie/Bowtie 2 indices, and will use an alternative aligner to align reads against either 'mouse_125.idx.fa' or 'mouse_125.idx.n2g.fa':
+3) Suppose we want to build STAR indices in the above example and save index files under '/ref' with name 'mouse_0'. Assuming STAR executable is '/sw/STAR', the command will be:
+
+ rsem-prepare-reference --gtf mm9.gtf \
+                        --star \
+                        --star-path /sw/STAR \
+                        -p 8 \
+                        /data/mm9/chr1.fa,/data/mm9/chr2.fa,...,/data/mm9/chrM.fa \
+                        /ref/mouse_0
+
+OR
+
+ rsem-prepare-reference --gtf mm9.gtf \
+                        --star \
+                        --star-path /sw/STAR \
+                        -p 8 \
+                        /data/mm9
+                        /ref/mouse_0
+
+STAR genome index files will be saved under '/ref/'. 
+
+4) Suppose we only have transcripts from EST tags stored in 'mm9.fasta' and isoform-gene information stored in 'mapping.txt'. We want to add 125bp long poly(A) tails to all transcripts. The reference_name is set as 'mouse_125'. In addition, we do not want to build Bowtie/Bowtie 2 indices, and will use an alternative aligner to align reads against either 'mouse_125.idx.fa' or 'mouse_125.idx.n2g.fa':
 
  rsem-prepare-reference --transcript-to-gene-map mapping.txt \
                         --polyA
                         mm9.fasta \
                         mouse_125
 
+
 =cut

-- 
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/rsem.git



More information about the debian-med-commit mailing list