[med-svn] [Git][med-team/rna-star][upstream] New upstream version 2.7.5c+dfsg
Steffen Möller
gitlab at salsa.debian.org
Fri Aug 21 20:52:05 BST 2020
Steffen Möller pushed to branch upstream at Debian Med / rna-star
Commits:
63ea0026 by Steffen Moeller at 2020-08-21T21:46:35+02:00
New upstream version 2.7.5c+dfsg
- - - - -
24 changed files:
- CHANGES.md
- README.md
- doc/STARmanual.pdf
- extras/doc-latex/STARmanual.tex
- extras/doc-latex/parametersDefault.tex
- extras/docker/Dockerfile
- source/ErrorWarning.cpp
- source/GTF.cpp
- source/Genome_genomeGenerate.cpp
- source/IncludeDefine.h
- source/Makefile
- source/Parameters.cpp
- source/ParametersSolo.cpp
- source/Parameters_openReadsFiles.cpp
- source/Parameters_readSAMheader.cpp
- source/ReadAlignChunk_processChunks.cpp
- source/ReadAlign_alignBAM.cpp
- source/STAR.cpp
- source/SoloFeature_collapseUMI.cpp
- source/Transcriptome_classifyAlign.cpp
- source/VERSION
- source/genomeScanFastaFiles.cpp
- source/parametersDefault
- source/readLoad.cpp
Changes:
=====================================
CHANGES.md
=====================================
@@ -1,5 +1,26 @@
+STAR 2.7.5c --- 2020/08/16
+==========================
+Bug-fix release.
+----------------
+
+* Issue #988: proceed reading from GTF after a warning that exon end is past chromosome end.
+* Issue #978: fixed corrupted transcriptInfo.tab in genome generation for cases where GTF file contains extra chromosomes not present in FASTA files.
+* Issue #945: otuput GX/GN for --soloFeatures GeneFull .
+* Implemented removal of control characters from the ends of input read lines, for compatibility with files pre-processed on Windows.
+
+STAR 2.7.5b --- 2020/08/01
+==========================
+Bug-fix release.
+----------------
+
+* Issue #558: Fixed a bug that can cause a seg-fault in STARsolo run with paired-end reads that have protruding ends.
+* Issue #952: Increased the maximum allowed length of the SAM tags in the input SAM files.
+* Issue #955: fixed seg-fault-causing bug for --soloFeatures SJ option.
+* Issue #963: When reading GTF file, skip any exons that extend past the end of the chromosome, and give a warning.
+* Issue #965: output genome sizes with and without padding into Log.out.
* Docker build: switched to debian:stable-slim in the Dockerfile.
* --soloType CB_samTagOut now allows output of (uncorrected) UMI sequences and quality scores with SAM tags UR and UY.
+* Throw an error if FIFO file cannot be created on non-Linux partitions.
STAR 2.7.5a 2020/06/16
======================
=====================================
README.md
=====================================
@@ -1,7 +1,7 @@
STAR 2.7
========
Spliced Transcripts Alignment to a Reference
-© Alexander Dobin, 2009-2019
+© Alexander Dobin, 2009-2020
https://www.ncbi.nlm.nih.gov/pubmed/23104886
AUTHOR/SUPPORT
@@ -35,9 +35,9 @@ Download the latest [release from](https://github.com/alexdobin/STAR/releases) a
```bash
# Get latest STAR source from releases
-wget https://github.com/alexdobin/STAR/archive/2.7.5a.tar.gz
-tar -xzf 2.7.5a.tar.gz
-cd STAR-2.7.5a
+wget https://github.com/alexdobin/STAR/archive/2.7.5c.tar.gz
+tar -xzf 2.7.5c.tar.gz
+cd STAR-2.7.5c
# Alternatively, get STAR source using git
git clone https://github.com/alexdobin/STAR.git
=====================================
doc/STARmanual.pdf
=====================================
Binary files a/doc/STARmanual.pdf and b/doc/STARmanual.pdf differ
=====================================
extras/doc-latex/STARmanual.tex
=====================================
@@ -34,7 +34,7 @@
\newcommand{\sechyperref}[1]{\hyperref[#1]{Section \ref{#1}. \nameref{#1}}}
-\title{STAR manual 2.7.5a}
+\title{STAR manual 2.7.5c}
\author{Alexander Dobin\\
dobin at cshl.edu}
\maketitle
@@ -292,28 +292,83 @@ The \opt{outSAMmultNmax} parameter limits the number of output alignments (SAM l
\subsubsection{SAM attributes.}
-The SAM attributes can be specified by the user using \opt{outSAMattributes} \optvr{A1 A2 A3 ...} option which accept a list of 2-character SAM attributes. The implemented attributes are: \optv{NH HI NM MD AS nM jM jI XS}. By default, STAR outputs \optv{NH HI AS nM} attributes.
-\begin{itemize}
+The SAM attributes can be specified by the user using \opt{outSAMattributes} \optvr{A1 A2 A3 ...} option which accept a list of 2-character SAM attributes. The attributes can be listed in any order, and will be recorded in that order in the SAM file. By default, STAR outputs \optv{NH HI AS nM} attributes.
+
+\begin{itemize}[itemsep=1pt]
+\item[]
+\textbf{Presets:}
+%
+\item[]
+\optv{None} : No SAM attributes
+%
\item[]
-\optv{NH HI NM MD} : have standard meaning as defined in the SAM format specifications.
+\optv{Standard} : NH HI AS nM
+%
+\item[]
+\optv{All} : NH HI AS nM NM MD jM jI MC ch
+
+\textbf{Alignment:}
+%
\item[]
-\optv{AS} : id the local alignment score (paired for paired-end reads).
+\optv{NH} : number of loci the reads maps to: $=1$ for unique mappers, $>1$ for multimappers. Standard SAM tag.
+%
\item[]
-\optv{nM} : is the number of mismatches per (paired) alignment, not to be confused with \optv{NM}, which is the number of mismatches in each mate..
+\optv{HI} : multiple alignment index, starts with --outSAMattrIHstart ($=1$ by default). Standard SAM tag.
+%
+\item[]
+\optv{AS} : local alignment score, $+1/-1$ for matches/mismateches, score* penalties for indels and gaps. For PE reads, total score for two mates. Stadnard SAM tag.
+%
+\item[]
+\optv{NM} : edit distance to the reference (number of mismatched + inserted + deleted bases) for each mate. Standard SAM tag.
+%
+\item[]
+\optv{nM} : number of mismatches per (paired) alignment, not to be confused with \optv{NM}, which is the number of mismatches+indels in each mate..
+%
\item[]
\optv{jM:B:c,M1,M2,...} : intron motifs for all junctions (i.e. N in CIGAR): 0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT. If splice junctions database is used, and a junction is annotated, 20 is added to its motif value.
+%
+\item[]
+\optv{MD} : string encoding mismatched and deleted reference bases (see standard SAM specifications). Standard SAM tag.
+%
\item[]
\optv{jI:B:I,Start1,End1,Start2,End2,...} : Start and End of introns for all junctions (1-based).
+%
\item[]
\optv{jM jI} : attributes require samtools 0.1.18 or later, and were reported to be incompatible with some downstream tools such as Cufflinks.
+
+\textbf{Variation:}
+%
+\item[]
+\optv{vA} : variant allele.
+%
+\item[]
+\optv{vG} : genomic coordinate of the variant overlapped by the read.
+%
+\item[]
+\optv{vW} : WASP filtering tag, see detailed description in Section \ref{section:WASP}. Requires \opt{waspOutputMode} \optv{SAMtag}.
+
+\textbf{STARsolo:}
+%
\item[]
-\optv{vA} : variant allele
+\optv{CR CY UR UY} : sequences and quality scores of cell barcodes and UMIs for the solo* demultiplexing, not error corrected.
+%
\item[]
-\optv{vG} : genomic coordiante of the variant overlapped by the read
+\optv{GX GN} : gene ID and name.
+%
\item[]
-\optv{vW} : WASP filtering tag, see detailed description in Section \ref{section:WASP}==. Requires --waspOutputMode SAMtag
+\optv{CB UB} : error-corrected cell barcodes and UMIs for solo* demultiplexing. Requires \opt{outSAMtype} \optv{BAM SortedByCoordinate}.
+%
\item[]
-\optv{CR CY UR UY} : STARsolo: sequences and quality scores of cell barcodes and UMIs for the solo* demultiplexing, not error corrected
+\optv{sM} : assessment of CB and UMI.
+%
+\item[]
+\optv{sS} : sequence of the entire barcode (CB,UMI,adapter...).
+%
+\item[]
+\optv{sQ} : quality of the entire barcode.
+
+
+\textbf{Unmapped reads:}
\item[]
\optv{uT} : for unmapped reads, reason for not mapping:
\begin{itemize}[noitemsep,topsep=-3pt]
=====================================
extras/doc-latex/parametersDefault.tex
=====================================
@@ -291,35 +291,47 @@
\end{optOptTable}
\optName{outSAMattributes}
\optValue{Standard}
- \optLine{string: a string of desired SAM attributes, in the order desired for the output SAM}
+ \optLine{string: a string of desired SAM attributes, in the order desired for the output SAM. Tags can be listed in any combination/order.}
+ \optLine{***Presets:}
\begin{optOptTable}
- \optOpt{NH HI AS nM NM MD jM jI XS MC ch} \optOptLine{any combination in any order}
\optOpt{None} \optOptLine{no attributes}
\optOpt{Standard} \optOptLine{NH HI AS nM}
\optOpt{All} \optOptLine{NH HI AS nM NM MD jM jI MC ch}
\end{optOptTable}
- \optLine{variation:}
+ \optLine{***Alignment:}
+\begin{optOptTable}
+ \optOpt{NH} \optOptLine{number of loci the reads maps to: =1 for unique mappers, {\textgreater}1 for multimappers. Standard SAM tag.}
+ \optOpt{HI} \optOptLine{multiple alignment index, starts with --outSAMattrIHstart (=1 by default). Standard SAM tag.}
+ \optOpt{AS} \optOptLine{local alignment score, +1/-1 for matches/mismateches, score* penalties for indels and gaps. For PE reads, total score for two mates. Stadnard SAM tag.}
+ \optOpt{nM} \optOptLine{number of mismatches. For PE reads, sum over two mates.}
+ \optOpt{NM} \optOptLine{edit distance to the reference (number of mismatched + inserted + deleted bases) for each mate. Standard SAM tag.}
+ \optOpt{MD} \optOptLine{string encoding mismatched and deleted reference bases (see standard SAM specifications). Standard SAM tag.}
+ \optOpt{jM} \optOptLine{intron motifs for all junctions (i.e. N in CIGAR): 0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT. If splice junctions database is used, and a junction is annotated, 20 is added to its motif value.}
+ \optOpt{jI} \optOptLine{start and end of introns for all junctions (1-based).}
+ \optOpt{XS} \optOptLine{alignment strand according to --outSAMstrandField.}
+ \optOpt{MC} \optOptLine{mate's CIGAR string. Standard SAM tag.}
+ \optOpt{ch} \optOptLine{marks all segment of all chimeric alingments for --chimOutType WithinBAM output.}
+\end{optOptTable}
+ \optLine{***Variation:}
\begin{optOptTable}
\optOpt{vA} \optOptLine{variant allele}
- \optOpt{ha} \optOptLine{haplotype (1/2) when mapping to the diploid genome. Requires genome generated with --genomeTransformType Diploid}
- \optOpt{vG} \optOptLine{genomic coordinate of the variant overlapped by the read}
+ \optOpt{vG} \optOptLine{genomic coordinate of the variant overlapped by the read.}
\optOpt{vW} \optOptLine{1 - alignment passes WASP filtering; 2,3,4,5,6,7 - alignment does not pass WASP filtering. Requires --waspOutputMode SAMtag.}
\end{optOptTable}
- \optLine{STARsolo:}
+ \optLine{***STARsolo:}
\begin{optOptTable}
- \optOpt{CR CY UR UY} \optOptLine{sequences and quality scores of cell barcodes and UMIs for the solo* demultiplexing}
+ \optOpt{CR CY UR UY} \optOptLine{sequences and quality scores of cell barcodes and UMIs for the solo* demultiplexing.}
+ \optOpt{GX GN} \optOptLine{gene ID and gene name.}
\optOpt{CB UB} \optOptLine{error-corrected cell barcodes and UMIs for solo* demultiplexing. Requires --outSAMtype BAM SortedByCoordinate.}
- \optOpt{sM} \optOptLine{assessment of CB and UMI}
-\end{optOptTable}
- \optLine{sS ... sequence of the entire barcode (CB,UMI,adapter...)}
-\begin{optOptTable}
- \optOpt{sQ} \optOptLine{quality of the entire barcode}
- \optOpt{GX GN} \optOptLine{gene ID and gene name}
+ \optOpt{sM} \optOptLine{assessment of CB and UMI.}
+ \optOpt{sS} \optOptLine{sequence of the entire barcode (CB,UMI,adapter).}
+ \optOpt{sQ} \optOptLine{quality of the entire barcode.}
\end{optOptTable}
- \optLine{Unsupported/undocumented:}
+ \optLine{***Unsupported/undocumented:}
\begin{optOptTable}
- \optOpt{rB} \optOptLine{alignment block read/genomic coordinates}
- \optOpt{vR} \optOptLine{read coordinate of the variant}
+ \optOpt{ha} \optOptLine{haplotype (1/2) when mapping to the diploid genome. Requires genome generated with --genomeTransformType Diploid .}
+ \optOpt{rB} \optOptLine{alignment block read/genomic coordinates.}
+ \optOpt{vR} \optOptLine{read coordinate of the variant.}
\end{optOptTable}
\optName{outSAMattrIHstart}
\optValue{1}
=====================================
extras/docker/Dockerfile
=====================================
@@ -2,7 +2,7 @@ FROM debian:stable-slim
MAINTAINER dobin at cshl.edu
-ARG STAR_VERSION=2.7.5a
+ARG STAR_VERSION=2.7.5c
ENV PACKAGES gcc g++ make wget zlib1g-dev unzip
=====================================
source/ErrorWarning.cpp
=====================================
@@ -5,10 +5,12 @@
#include "TimeFunctions.h"
#include "GlobalVariables.h"
-void exitWithError(string messageOut, ostream &streamOut1, ostream &streamOut2, int errorInt, Parameters &P) {
+void exitWithError(string messageOut, ostream &streamOut1, ostream &streamOut2, int errorInt, Parameters &P)
+{
+ if (P.runThreadN>1)
+ pthread_mutex_lock(&g_threadChunks.mutexError);
time_t timeCurrent;
time( &timeCurrent);
- if (P.runThreadN>1) pthread_mutex_lock(&g_threadChunks.mutexError);
if (streamOut1.good()) {
streamOut1 << "\n" << messageOut << endl << timeMonthDayTime(timeCurrent) <<" ...... FATAL ERROR, exiting\n" <<flush;
};
@@ -20,14 +22,16 @@ void exitWithError(string messageOut, ostream &streamOut1, ostream &streamOut2,
exit(errorInt);
};
-void warningMessage(string messageOut, ostream &streamOut1, ostream &streamOut2, Parameters &P) {
- time_t timeCurrent;
- time( &timeCurrent);
- if (P.runThreadN>1) pthread_mutex_lock(&g_threadChunks.mutexError);
+void warningMessage(string messageOut, ostream &streamOut1, ostream &streamOut2, Parameters &P)
+{
+ if (P.runThreadN>1)
+ pthread_mutex_lock(&g_threadChunks.mutexError);
if (streamOut1.good()) {
- streamOut1 << timeMonthDayTime(timeCurrent) <<"\tWARNING: " << messageOut <<endl;
+ streamOut1 << "!!!!! WARNING: " << messageOut <<endl;
};
if (streamOut2.good()) {
- streamOut2 << timeMonthDayTime(timeCurrent) << "\tWARNING: " << messageOut <<endl;
+ streamOut2 << "!!!!! WARNING: " << messageOut <<endl;
};
-};
\ No newline at end of file
+ if (P.runThreadN>1)
+ pthread_mutex_unlock(&g_threadChunks.mutexError);
+};
=====================================
source/GTF.cpp
=====================================
@@ -71,7 +71,7 @@ GTF::GTF(Genome &genome, Parameters &P, const string &dirOut, SjdbClass &sjdbLoc
if (genome.pGe.sjdbGTFchrPrefix!="-") chr1=genome.pGe.sjdbGTFchrPrefix + chr1;
if (genome.chrNameIndex.count(chr1)==0) {//chr not in Genome
- P.inOut->logMain << "WARNING: while processing pGe.sjdbGTFfile=" << genome.pGe.sjdbGTFfile <<": chromosome '"<<chr1<<"' not found in Genome fasta files for line:\n";
+ P.inOut->logMain << "WARNING: while processing sjdbGTFfile=" << genome.pGe.sjdbGTFfile <<": chromosome '"<<chr1<<"' not found in Genome fasta files for line:\n";
P.inOut->logMain << oneLine <<"\n"<<flush;
continue; //do not process exons/transcripts on missing chromosomes
};
@@ -79,6 +79,12 @@ GTF::GTF(Genome &genome, Parameters &P, const string &dirOut, SjdbClass &sjdbLoc
uint64 ex1,ex2;
char str1;
oneLineStream >> ex1 >> ex2 >> ddd2 >> str1 >> ddd2; //read all fields except the last
+ if ( ex2 > genome.chrLength[genome.chrNameIndex[chr1]] ) {
+ warningMessage("while processing sjdbGTFfile=" + genome.pGe.sjdbGTFfile + ", line:\n" + oneLine + "\n exon end = " + to_string(ex2) + \
+ " is larger than the chromosome " + chr1 + " length = " + to_string(genome.chrLength[genome.chrNameIndex[chr1]] ) + " , will skip this exon\n", \
+ std::cerr, P.inOut->logMain, P);
+ continue;
+ };
string oneLine1;
getline(oneLineStream, oneLine1);//get the last field
@@ -152,7 +158,11 @@ GTF::GTF(Genome &genome, Parameters &P, const string &dirOut, SjdbClass &sjdbLoc
if (exonN==0) {
ostringstream errOut;
errOut << "Fatal INPUT FILE error, no valid ""exon"" lines in the GTF file: " << genome.pGe.sjdbGTFfile <<"\n";
- errOut << "Solution: check the formatting of the GTF file. Most likely cause is the difference in chromosome naming between GTF and FASTA file.\n";
+ errOut << "Solution: check the formatting of the GTF file. One likely cause is the difference in chromosome naming between GTF and FASTA file.\n";
exitWithError(errOut.str(),std::cerr, P.inOut->logMain, EXIT_CODE_INPUT_FILES, P);
};
+
+ exonLoci.resize(exonN); //previous exonLoci.size() was an estimate, need to reszie to the actual value
+
+ return;
};
=====================================
source/Genome_genomeGenerate.cpp
=====================================
@@ -142,7 +142,14 @@ void Genome::genomeGenerate() {
nGenome = genomeScanFastaFiles(P,NULL,false,*this);//first scan the fasta file to find all the sizes
genomeSequenceAllocate(nGenome, nG1alloc, G, G1);
genomeScanFastaFiles(P,G,true,*this); //load the genome sequence
-
+
+ uint64 nGenomeTrue=0;
+ for (auto &cl : chrLength)
+ nGenomeTrue += cl;
+
+ P.inOut->logMain <<"Genome sequence total length = " << nGenomeTrue << "\n";
+ P.inOut->logMain <<"Genome size with padding = "<< nGenome <<"\n";
+
//consensusSequence(); //replace with consensus allele
SjdbClass sjdbLoci; //will be filled in transcriptGeneSJ below
@@ -157,10 +164,10 @@ void Genome::genomeGenerate() {
sjdbLoadFromFiles(P,sjdbLoci);//this will not be transformed. TODO prevent this parameter combination
- if (pGe.gSAindexNbases > log2(nGenome)/2-1) {
+ if (pGe.gSAindexNbases > log2(nGenomeTrue)/2-1) {
ostringstream warnOut;
- warnOut << "--genomeSAindexNbases " << pGe.gSAindexNbases << " is too large for the genome size=" << nGenome;
- warnOut << ", which may cause seg-fault at the mapping step. Re-run genome generation with recommended --genomeSAindexNbases " << int(log2(nGenome)/2-1);
+ warnOut << "--genomeSAindexNbases " << pGe.gSAindexNbases << " is too large for the genome size=" << nGenomeTrue;
+ warnOut << ", which may cause seg-fault at the mapping step. Re-run genome generation with recommended --genomeSAindexNbases " << int(log2(nGenomeTrue)/2-1);
warningMessage(warnOut.str(),P.inOut->logMain,std::cerr,P);
};
@@ -183,7 +190,7 @@ void Genome::genomeGenerate() {
// GstrandBit
GstrandBit = (char) (uint) floor(log(nGenome+P.limitSjdbInsertNsj*sjdbLength)/log(2))+1; //GstrandBit uses P.limitSjdbInsertNsj even if no insertion requested, in case it will be requested at the mapping stage
if (GstrandBit<32) GstrandBit=32; //TODO: should not this be 31? Need to test for small genomes. TODO: use simple access function for SA
- P.inOut->logMain <<"Estimated genome size="<<nGenome+P.limitSjdbInsertNsj*sjdbLength<<" = "<<nGenome<<" + "<<P.limitSjdbInsertNsj*sjdbLength<<"\n";
+ P.inOut->logMain <<"Estimated genome size with padding and SJs: total=genome+SJ="<<nGenome+P.limitSjdbInsertNsj*sjdbLength<<" = "<<nGenome<<" + "<<P.limitSjdbInsertNsj*sjdbLength<<"\n";
P.inOut->logMain << "GstrandBit=" << int(GstrandBit) <<"\n";
GstrandMask = ~(1LLU<<GstrandBit);
SA.defineBits(GstrandBit+1,nSA);
=====================================
source/IncludeDefine.h
=====================================
@@ -117,12 +117,13 @@ typedef uint8_t uint8;
#define BAM_CIGAR_X 8
+
+#define BAM_ATTR_MaxSize 10000
+
#if defined COMPILE_FOR_LONG_READS
#define MAX_N_EXONS 1000
- #define BAM_ATTR_MaxSize 10000
#else
#define MAX_N_EXONS 20
- #define BAM_ATTR_MaxSize 1000
#endif
//input reads
@@ -151,6 +152,7 @@ typedef uint8_t uint8;
#define EXIT_CODE_FILE_OPEN 109
#define EXIT_CODE_FILE_WRITE 110
#define EXIT_CODE_INCONSISTENT_DATA 111
+#define EXIT_CODE_FIFO 112
//cleaned-end
=====================================
source/Makefile
=====================================
@@ -27,7 +27,7 @@ CFLAGS ?= -pipe -Wall -Wextra -O3
# Unconditionally set essential flags and optimization options
CXXFLAGS_common := -std=c++11 -fopenmp $(COMPTIMEPLACE)
CXXFLAGS_main := -O3 $(CXXFLAGS_common)
-CXXFLAGS_gdb := -O -g3 $(CXXFLAGS_common)
+CXXFLAGS_gdb := -O0 -g3 $(CXXFLAGS_common)
##########################################################################################################
OBJECTS = soloInputFeatureUMI.o SoloFeature_countSmartSeq.o SoloFeature_redistributeReadsByCB.o \
=====================================
source/Parameters.cpp
=====================================
@@ -1339,6 +1339,7 @@ void Parameters::inputParameters (int argInN, char* argIn[]) {//input parameters
};
alignEndsProtrude.nBasesMax=stoi(alignEndsProtrude.in.at(0),nullptr);
+ alignEndsProtrude.concordantPair=false;
if (alignEndsProtrude.nBasesMax>0) {//allow ends protrusion
if (alignEndsProtrude.in.at(1)=="ConcordantPair") {
alignEndsProtrude.concordantPair=true;
=====================================
source/ParametersSolo.cpp
=====================================
@@ -137,13 +137,21 @@ void ParametersSolo::initialize(Parameters *pPin)
exitWithError(errOut.str(),std::cerr, pP->inOut->logMain, EXIT_CODE_PARAMETER, *pP);
};
- if (featureYes[SoloFeatureTypes::Gene]) {
- pP->quant.gene.yes=true;
- pP->quant.yes = true;
+ if (featureYes[SoloFeatureTypes::Velocyto] && type==SoloTypes::SmartSeq) {
+ string errOut = "EXITING because of fatal PARAMETERS error: --soloFeatures Velocyto is presently not compatible with --soloType SmartSeq .\n";
+ errOut += "SOLUTION: re-run without --soloFeatures Velocyto .";
+ exitWithError(errOut, std::cerr, pP->inOut->logMain, EXIT_CODE_PARAMETER, *pP);
};
- if (featureYes[SoloFeatureTypes::GeneFull]) {
- pP->quant.geneFull.yes=true;
- pP->quant.yes = true;
+
+ if (type != SoloTypes::CB_samTagOut) {//gene quantification is needed
+ if (featureYes[SoloFeatureTypes::Gene]) {
+ pP->quant.gene.yes=true;
+ pP->quant.yes = true;
+ };
+ if (featureYes[SoloFeatureTypes::GeneFull]) {
+ pP->quant.geneFull.yes=true;
+ pP->quant.yes = true;
+ };
};
////////////////////////////////////////////umiDedup
=====================================
source/Parameters_openReadsFiles.cpp
=====================================
@@ -27,7 +27,13 @@ void Parameters::openReadsFiles()
sysCom << outFileTmp <<"tmp.fifo.read"<<imate+1;
readFilesInTmp.push_back(sysCom.str());
remove(readFilesInTmp.at(imate).c_str());
- mkfifo(readFilesInTmp.at(imate).c_str(), S_IRUSR | S_IWUSR );
+ if (mkfifo(readFilesInTmp.at(imate).c_str(), S_IRUSR | S_IWUSR ) != 0) {
+ exitWithError("Exiting because of *FATAL ERROR*: could not create FIFO file " + readFilesInTmp.at(imate) + "\n"
+ + "SOLUTION: check the if run directory supports FIFO files.\n"
+ + "If run partition does not support FIFO (e.g. Windows partitions FAT, NTFS), "
+ + "re-run on a Linux partition, or point --outTmpDir to a Linux partition.\n"
+ , std::cerr, inOut->logMain, EXIT_CODE_FIFO, *this);
+ };
inOut->logMain << "\n Input read files for mate "<< imate+1 <<" :\n";
=====================================
source/Parameters_readSAMheader.cpp
=====================================
@@ -18,7 +18,13 @@ void Parameters::readSAMheader(const string readFilesCommandString, const vector
string tmpFifo=outFileTmp+"tmp.fifo.header";
remove(tmpFifo.c_str());
- mkfifo(tmpFifo.c_str(), S_IRUSR | S_IWUSR );
+ if (mkfifo(tmpFifo.c_str(), S_IRUSR | S_IWUSR ) != 0) {
+ exitWithError("Exiting because of *FATAL ERROR*: could not create FIFO file " + tmpFifo + "\n"
+ + "SOLUTION: check the if run directory supports FIFO files.\n"
+ + "If run partition does not support FIFO (e.g. Windows partitions FAT, NTFS), "
+ + "re-run on a Linux partition, or point --outTmpDir to a Linux partition.\n"
+ , std::cerr, inOut->logMain, EXIT_CODE_FIFO, *this);
+ };
ifstream tmpFifoIn;
for (uint32 ii=0; ii<readFilesNames.size(); ii++) {
=====================================
source/ReadAlignChunk_processChunks.cpp
=====================================
@@ -5,6 +5,12 @@
#include "SequenceFuns.h"
#include "GlobalVariables.h"
+inline void removeStringEndControl(string &str)
+{//removes control character (including space) from the end of the string
+ if (int(str.back())<33)
+ str.pop_back();
+};
+
void ReadAlignChunk::processChunks() {//read-map-write chunks
noReadsLeft=false; //true if there no more reads left in the file
bool newFile=false; //new file marker in the input stream
@@ -79,6 +85,7 @@ void ReadAlignChunk::processChunks() {//read-map-write chunks
if (P.outFilterBySJoutStage!=2) {//not the 2nd stage of the 2-stage mapping, read ID from the 1st read
string readID;
P.inOut->readIn[0] >> readID;
+ removeStringEndControl(readID);
if (P.outSAMreadIDnumber) {
readID="@"+to_string(P.iReadAll);
};
@@ -87,9 +94,11 @@ void ReadAlignChunk::processChunks() {//read-map-write chunks
if (P.inOut->readIn[0].peek()!='\n') {//2nd field exists
string field2;
P.inOut->readIn[0] >> field2;
- if (field2.length()>=3 && field2.at(1)==':' && field2.at(2)=='Y' && field2.at(3)==':' )
+ if (field2.length()>=3 && field2[1]==':' && field2[2]=='Y' && field2[3]==':' )
passFilterIllumina='Y';
};
+
+ //add extra information to readID line
readID += ' '+ to_string(P.iReadAll)+' '+passFilterIllumina+' '+to_string(P.readFilesIndex);
//ignore the rest of the read name for both mates
@@ -99,6 +108,7 @@ void ReadAlignChunk::processChunks() {//read-map-write chunks
if (P.pSolo.barcodeReadYes) {//record barcode sequence
string seq1;
getline(P.inOut->readIn[P.pSolo.barcodeRead],seq1);
+ removeStringEndControl(seq1);
if (seq1.size() != P.pSolo.bL) {
if (P.pSolo.bL > 0) {
ostringstream errOut;
@@ -114,6 +124,7 @@ void ReadAlignChunk::processChunks() {//read-map-write chunks
readID += ' ' + seq1;
P.inOut->readIn[P.pSolo.barcodeRead].ignore(DEF_readNameSeqLengthMax,'\n');//skip to the end of 3rd ("+") line
getline(P.inOut->readIn[P.pSolo.barcodeRead],seq1); //read qualities
+ removeStringEndControl(seq1);
readID += ' ' + seq1;
g_statsAll.qualHistCalc(1, seq1.c_str()+P.pSolo.barcodeStart, P.pSolo.barcodeEnd==0 ? seq1.size() : P.pSolo.barcodeEnd-P.pSolo.barcodeStart+1);
};
@@ -127,12 +138,19 @@ void ReadAlignChunk::processChunks() {//read-map-write chunks
//copy 3 (4 for stage 2) lines: sequence, dummy, quality
for (uint imate=0; imate<P.readNmates; imate++) {
for (uint iline=(P.outFilterBySJoutStage==2 ? 0:1);iline<4;iline++) {
+ uint64 origChunkStart=chunkInSizeBytesTotal[imate];
+
P.inOut->readIn[imate].getline(chunkIn[imate] + chunkInSizeBytesTotal[imate], DEF_readNameSeqLengthMax+1 );
- chunkInSizeBytesTotal[imate] += P.inOut->readIn[imate].gcount();
+ chunkInSizeBytesTotal[imate] += P.inOut->readIn[imate].gcount(); //this includes the final \n
+
+ if ( int(chunkIn[imate][chunkInSizeBytesTotal[imate]-2]) < 33 ) {//remove control char at the end if present
+ chunkInSizeBytesTotal[imate]--;
+ };
+
chunkIn[imate][chunkInSizeBytesTotal[imate]-1]='\n';
if (iline==3 && P.outFilterBySJoutStage!=2) {
- g_statsAll.qualHistCalc(imate, chunkIn[imate] + chunkInSizeBytesTotal[imate] - P.inOut->readIn[imate].gcount(), P.inOut->readIn[imate].gcount());
+ g_statsAll.qualHistCalc(imate, chunkIn[imate] + origChunkStart, chunkInSizeBytesTotal[imate] - origChunkStart);
};
};
};
@@ -151,18 +169,20 @@ void ReadAlignChunk::processChunks() {//read-map-write chunks
P.inOut->readIn[imate].ignore(DEF_readNameSeqLengthMax,'\n');
chunkInSizeBytesTotal[imate] += sprintf(chunkIn[imate] + chunkInSizeBytesTotal[imate], " %llu %c %i \n", P.iReadAll, 'N', P.readFilesIndex);
-
-
};
-// else {//2nd stage of 2-stage mapping
-// read index and file index are already recorded with the read name, simply copy it
-// P.inOut->readIn[imate].getline(chunkIn[imate] + chunkInSizeBytesTotal[imate], DEF_readNameSeqLengthMax+1 );
-// };
+
+ //read multi-line fasta
nextChar=P.inOut->readIn[imate].peek();
- while (nextChar!='@' && nextChar!='>' && nextChar!=' ' && nextChar!='\n' && P.inOut->readIn[imate].good()) {//read multi-line fasta
+ while (nextChar!='@' && nextChar!='>' && nextChar!=' ' && nextChar!='\n' && P.inOut->readIn[imate].good()) {
P.inOut->readIn[imate].getline(chunkIn[imate] + chunkInSizeBytesTotal[imate], DEF_readSeqLengthMax + 1 );
- if (P.inOut->readIn[imate].gcount()<2) break; //no more input
- chunkInSizeBytesTotal[imate] += P.inOut->readIn[imate].gcount()-1;
+ if (P.inOut->readIn[imate].gcount()<2)
+ break; //no more input
+
+ chunkInSizeBytesTotal[imate] += P.inOut->readIn[imate].gcount()-1; //-1 because \n was counted, bu wee need to remove it
+ if ( int(chunkIn[imate][chunkInSizeBytesTotal[imate]-1]) < 33 ) {//remove control char at the end if present
+ chunkInSizeBytesTotal[imate]--;
+ };
+
nextChar=P.inOut->readIn[imate].peek();
};
chunkIn[imate][chunkInSizeBytesTotal[imate]]='\n';
=====================================
source/ReadAlign_alignBAM.cpp
=====================================
@@ -374,13 +374,25 @@ int ReadAlign::alignBAM(Transcript const &trOut, uint nTrOut, uint iTrOut, uint
case ATTR_UY:
attrN+=bamAttrArrayWrite(soloRead->readBar->umiQual,"UY",attrOutArray+attrN);
break;
+
case ATTR_GX:
- if (readAnnot.geneConcordant.size()==1 && trOut.alignGenes.size()==1) //only output if read maps to a single gene, and this alignment maps to one gene
- attrN+=bamAttrArrayWrite(chunkTr->geID[*trOut.alignGenes.begin()],"GX",attrOutArray+attrN);
+ if ( P.quant.gene.yes ) {
+ if (readAnnot.geneConcordant.size()==1 && trOut.alignGenes.size()==1) //only output if read maps to a single gene, and this alignment maps to one gene
+ attrN+=bamAttrArrayWrite(chunkTr->geID[*trOut.alignGenes.begin()],"GX",attrOutArray+attrN);
+ } else if ( P.quant.geneFull.yes ) {
+ if ( readAnnot.geneFull.size()==1 ) //only output if read maps to a single gene
+ attrN+=bamAttrArrayWrite(chunkTr->geID[*readAnnot.geneFull.begin()],"GX",attrOutArray+attrN);
+ };
break;
+
case ATTR_GN:
- if (readAnnot.geneConcordant.size()==1 && trOut.alignGenes.size()==1) //only output if read maps to a single gene, and this alignment maps to one gene
- attrN+=bamAttrArrayWrite(chunkTr->geName[*trOut.alignGenes.begin()],"GN",attrOutArray+attrN);
+ if ( P.quant.gene.yes ) {
+ if (readAnnot.geneConcordant.size()==1 && trOut.alignGenes.size()==1) //only output if read maps to a single gene, and this alignment maps to one gene
+ attrN+=bamAttrArrayWrite(chunkTr->geName[*trOut.alignGenes.begin()],"GN",attrOutArray+attrN);
+ } else if ( P.quant.geneFull.yes ) {
+ if ( readAnnot.geneFull.size()==1 ) //only output if read maps to a single gene
+ attrN+=bamAttrArrayWrite(chunkTr->geName[*readAnnot.geneFull.begin()],"GN",attrOutArray+attrN);
+ };
break;
case ATTR_sM:
=====================================
source/STAR.cpp
=====================================
@@ -49,8 +49,7 @@ void usage(int usageType) {
parametersDefault_len);
};
exit(0);
-}
-
+};
int main(int argInN, char* argIn[]) {
// If no argument is given, or the first argument is either '-h' or '--help', run usage()
=====================================
source/SoloFeature_collapseUMI.cpp
=====================================
@@ -89,7 +89,7 @@ uint32 graphNumberOfConnectedComponents(uint32 N, vector<array<uint32,2>> V, vec
uint32 nConnComp=0;
for (uint32 ii=0; ii<N; ii++) {
- if (nodeEdges[ii].size()==0) {//this node is not connected, no need to check. Save time beacuse this happens often
+ if (nodeEdges[ii].size()==0) {//this node is not connected, no need to check. Save time because this happens often
++nConnComp;
} else if (nodeColor[ii]==(uint32)-1) {//node not visited
++nConnComp;
=====================================
source/Transcriptome_classifyAlign.cpp
=====================================
@@ -22,11 +22,13 @@ int alignToTranscript(Transcript &aG, uint64 trS1, uint16 exN1, uint32 *exSE1, u
//distTTS=trLen[tr1]-(gthaG.exons[iab][EX_G]-trS1-exSE1[2*ex1]+exLenCum1[ex1]+aG.exons[iab][EX_L]);
//if trStr1==2: TTS=trLen[tr1]-TTS, TSS=trLen[tr1]-TSS
- for (uint32 iab=0, ex1=0, bS=0, bE=0, eE=0, enS=0;
- iab<aG.nExons; iab++) {//scan through all blocks of the align
+ for (uint32 iab=0, ex1=0, bS=0, bE=0, eE=0, enS=0; iab<aG.nExons; iab++) {//scan through all blocks of the align
uint64 bEprev=bE;
+ if (aG.exons[iab][EX_G] < trS1) //this can happen for PE reads, when the 2nd mate protrudes to the left of the first
+ return -1;
+
bS=(uint32) (aG.exons[iab][EX_G]-trS1);//block start
bE=bS+aG.exons[iab][EX_L]-1;//block end
=====================================
source/VERSION
=====================================
@@ -1 +1 @@
-#define STAR_VERSION "2.7.5a_2020-06-29"
+#define STAR_VERSION "2.7.5c"
=====================================
source/genomeScanFastaFiles.cpp
=====================================
@@ -44,7 +44,8 @@ uint genomeScanFastaFiles (Parameters &P, char* G, bool flagRun, Genome &mapGen)
mapGen.chrName.push_back(chrName1);
};
- if (!flagRun && mapGen.chrStart.size()>0) mapGen.chrLength.push_back(N-mapGen.chrStart.at(mapGen.chrStart.size()-1)); //true length of the chr
+ if (!flagRun && mapGen.chrStart.size()>0) mapGen.chrLength.push_back(N-mapGen.chrStart.back()); //true length of the chr
+
if (N>0) {//pad the chromosomes to bins boudnaries
N = ( (N+1)/mapGen.genomeChrBinNbases+1 )*mapGen.genomeChrBinNbases;
@@ -52,7 +53,9 @@ uint genomeScanFastaFiles (Parameters &P, char* G, bool flagRun, Genome &mapGen)
if (!flagRun) {
mapGen.chrStart.push_back(N);
- P.inOut->logMain << mapGen.pGe.gFastaFiles.at(ii)<<" : chr # " << mapGen.chrStart.size()-1 << " \""<<mapGen.chrName.at(mapGen.chrStart.size()-1)<<"\" chrStart: "<<N<<"\n"<<flush;
+ P.inOut->logMain << mapGen.pGe.gFastaFiles.at(ii)<<" : chr # " << mapGen.chrStart.size()-1 \
+ << " \""<< mapGen.chrName.back() <<"\"" \
+ << " chrStart: "<<N<<"\n"<<flush;
};
} else {//char lines
if (flagRun) {
@@ -70,15 +73,17 @@ uint genomeScanFastaFiles (Parameters &P, char* G, bool flagRun, Genome &mapGen)
};
if (!flagRun)
- mapGen.chrLength.push_back(N-mapGen.chrStart.at(mapGen.chrStart.size()-1)); //true length of the last chr
+ mapGen.chrLength.push_back(N-mapGen.chrStart.back()); //true length of the last chr
N = ( (N+1)/mapGen.genomeChrBinNbases+1)*mapGen.genomeChrBinNbases;
if (!flagRun) {
mapGen.nChrReal=mapGen.chrStart.size();
mapGen.chrStart.push_back(N); //last chromosome end+1
+ P.inOut->logMain << "Chromosome sequence lengths: \n";
for (uint ii=0;ii<mapGen.nChrReal;ii++) {
mapGen.chrNameIndex[mapGen.chrName[ii]]=ii;
+ P.inOut->logMain << mapGen.chrName[ii] <<"\t"<< mapGen.chrLength[ii] << "\n";
};
};
=====================================
source/parametersDefault
=====================================
@@ -269,26 +269,38 @@ outSAMstrandField None
intronMotif ... strand derived from the intron motif. This option changes the output alignments: reads with inconsistent and/or non-canonical introns are filtered out.
outSAMattributes Standard
- string: a string of desired SAM attributes, in the order desired for the output SAM
- NH HI AS nM NM MD jM jI XS MC ch ... any combination in any order
+ string: a string of desired SAM attributes, in the order desired for the output SAM. Tags can be listed in any combination/order.
+ ***Presets:
None ... no attributes
Standard ... NH HI AS nM
- All ... NH HI AS nM NM MD jM jI MC ch
- variation:
+ All ... NH HI AS nM NM MD jM jI MC ch
+ ***Alignment:
+ NH ... number of loci the reads maps to: =1 for unique mappers, >1 for multimappers. Standard SAM tag.
+ HI ... multiple alignment index, starts with --outSAMattrIHstart (=1 by default). Standard SAM tag.
+ AS ... local alignment score, +1/-1 for matches/mismateches, score* penalties for indels and gaps. For PE reads, total score for two mates. Stadnard SAM tag.
+ nM ... number of mismatches. For PE reads, sum over two mates.
+ NM ... edit distance to the reference (number of mismatched + inserted + deleted bases) for each mate. Standard SAM tag.
+ MD ... string encoding mismatched and deleted reference bases (see standard SAM specifications). Standard SAM tag.
+ jM ... intron motifs for all junctions (i.e. N in CIGAR): 0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT. If splice junctions database is used, and a junction is annotated, 20 is added to its motif value.
+ jI ... start and end of introns for all junctions (1-based).
+ XS ... alignment strand according to --outSAMstrandField.
+ MC ... mate's CIGAR string. Standard SAM tag.
+ ch ... marks all segment of all chimeric alingments for --chimOutType WithinBAM output.
+ ***Variation:
vA ... variant allele
- ha ... haplotype (1/2) when mapping to the diploid genome. Requires genome generated with --genomeTransformType Diploid
- vG ... genomic coordinate of the variant overlapped by the read
+ vG ... genomic coordinate of the variant overlapped by the read.
vW ... 1 - alignment passes WASP filtering; 2,3,4,5,6,7 - alignment does not pass WASP filtering. Requires --waspOutputMode SAMtag.
- STARsolo:
- CR CY UR UY ... sequences and quality scores of cell barcodes and UMIs for the solo* demultiplexing
+ ***STARsolo:
+ CR CY UR UY ... sequences and quality scores of cell barcodes and UMIs for the solo* demultiplexing.
+ GX GN ... gene ID and gene name.
CB UB ... error-corrected cell barcodes and UMIs for solo* demultiplexing. Requires --outSAMtype BAM SortedByCoordinate.
- sM ... assessment of CB and UMI
- sS ... sequence of the entire barcode (CB,UMI,adapter...)
- sQ ... quality of the entire barcode
- GX GN ... gene ID and gene name
- Unsupported/undocumented:
- rB ... alignment block read/genomic coordinates
- vR ... read coordinate of the variant
+ sM ... assessment of CB and UMI..
+ sS ... sequence of the entire barcode (CB,UMI,adapter).
+ sQ ... quality of the entire barcode.
+ ***Unsupported/undocumented:
+ ha ... haplotype (1/2) when mapping to the diploid genome. Requires genome generated with --genomeTransformType Diploid .
+ rB ... alignment block read/genomic coordinates.
+ vR ... read coordinate of the variant.
outSAMattrIHstart 1
int>=0: start value for the IH attribute. 0 may be required by some downstream software, such as Cufflinks or StringTie..
=====================================
source/readLoad.cpp
=====================================
@@ -84,7 +84,7 @@ int readLoad(istream& readInStream, Parameters& P, uint iMate, uint& Lread, uint
} else {
Lread=0;
};
- convertNucleotidesToNumbersRemoveControls(Seq+P.clip5pNbases[iMate],SeqNum,Lread);
+ convertNucleotidesToNumbers(Seq+P.clip5pNbases[iMate],SeqNum,Lread);
//clip the adapter
if (P.clip3pAdapterSeq.at(iMate).length()>0) {
View it on GitLab: https://salsa.debian.org/med-team/rna-star/-/commit/63ea00262b1c169acb719ad4e01b005ef625d0e5
--
View it on GitLab: https://salsa.debian.org/med-team/rna-star/-/commit/63ea00262b1c169acb719ad4e01b005ef625d0e5
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20200821/a84cd0f9/attachment-0001.html>
More information about the debian-med-commit
mailing list