[med-svn] [subread] 01/05: New upstream version 1.5.2+dfsg
Alex Mestiashvili
malex-guest at moszumanska.debian.org
Tue Mar 21 15:28:04 UTC 2017
This is an automated email from the git hooks/post-receive script.
malex-guest pushed a commit to branch master
in repository subread.
commit 5ca905c2dc7b6335c301c4c0d59dc336fc757a4b
Author: Alexandre Mestiashvili <alex at biotec.tu-dresden.de>
Date: Tue Mar 21 09:35:10 2017 +0100
New upstream version 1.5.2+dfsg
---
README.txt | 56 +-
doc/SubreadUsersGuide.tex | 166 +++---
src/HelperFunctions.c | 169 ++++++
src/HelperFunctions.h | 4 +
src/Makefile.Linux | 5 +-
src/SNPCalling.c | 62 +-
src/core-bigtable.c | 2 +-
src/core-bigtable.h | 2 +-
src/core-indel.c | 931 +++++++++++++++++++++++-------
src/core-indel.h | 19 +-
src/core-interface-aligner.c | 177 ++++--
src/core-interface-subjunc.c | 164 ++++--
src/core-junction.c | 332 +++++------
src/core-junction.h | 2 +
src/core.c | 824 +++++++++++++++++---------
src/core.h | 20 +-
src/coverage_calc.c | 67 ++-
src/gene-algorithms.c | 17 +-
src/gene-value-index.c | 32 +-
src/gene-value-index.h | 2 +-
src/hashtable.c | 45 +-
src/hashtable.h | 19 +-
src/index-builder.c | 57 +-
src/input-files.c | 239 +++++---
src/input-files.h | 12 +-
src/makefile.version | 2 +-
src/propmapped.c | 35 +-
src/readSummary.c | 402 +++++++------
src/sambam-file.c | 47 +-
src/sambam-file.h | 6 +-
src/sorted-hashtable.c | 57 +-
src/subread.h | 10 +-
test/featureCounts/data/test-chralias.GTF | 23 +
test/featureCounts/del4.FC | 9 +
test/featureCounts/del4.FC.summary | 12 +
35 files changed, 2809 insertions(+), 1219 deletions(-)
diff --git a/README.txt b/README.txt
index 10c9779..8478e65 100644
--- a/README.txt
+++ b/README.txt
@@ -14,55 +14,56 @@ For FreeBSD OS, use command:
gmake -f Makefile.FreeBSD
-For OpenSolaris/Oracle Solaris OS, use command:
-
-gmake -f Makefile.SunOS
-
If the build is successful, a new directory called 'bin' will be created under the home directory of the package (ie. one level up from 'src' directory). The 'bin directory contains all the generated executables. To enable easy access to these executables, you may copy the executables to a system directory such as '/usr/bin' or add the path to the executables to your search path (add path to your environment variable `PATH').
Content
--------------
-annotation Directory including NCBI RefSeq gene annotations for genomes 'hg19', 'mm10' and 'mm9'.
- Each row is an exon. Entrez gene identifiers and chromosomal coordinates are provided for each exon.
-bin Directory including executables after compilation (or directly available from a binary release).
-doc Directory including the users manual.
-LICENSE The license agreement for using this package.
-README.txt This file.
-src Directory including source code (binary releases do not have this directory).
-test Directory including test data and scripts.
+annotation Directory including NCBI RefSeq gene annotations for genomes 'hg19', 'hg38', 'mm10' and 'mm9'.
+ Each row is an exon. Entrez gene identifiers and chromosomal coordinates are provided for each exon.
+bin Directory including executables after compilation (or directly available from a binary release).
+doc Directory including the users manual.
+LICENSE The license agreement for using this package.
+README.txt This file.
+src Directory including source code (binary releases do not have this directory).
+test Directory including test data and scripts.
A Quick Start
--------------
-An index should be built before carrying out read alignments:
+Build index for a reference genome:
+
+ subread-buildindex -o my_index chr1.fa chr2.fa ...
+ (You may provide a single FASTA file including all chromosomal sequences).
-subread-buildindex -o my_index chr1.fa chr2.fa ...
-(You may provide a single FASTA file including all chromosomal sequences).
+Align a single-end RNA-seq dataset to the reference genome:
-With built index, you can now align reads to the reference genome. Align single-end reads:
+ subread-align -i my_index -r reads.txt -t 0 -o subread_results.bam
-subread-align -i my_index -r reads.txt -o subread_results.sam
+Align a paired-end genomic DNA-seq dataset to the reference genome:
-Align paired-end reads:
+ subread-align -i my_index -r reads1.txt -R reads2.txt -t 1 -o subread_results_PE.bam
-subread-align -i my_index -r reads1.txt -R reads2.txt -o subread_results_PE.sam
+Detect exon-exon junctions from a paired-end RNA-seq dataset (read mapping results are also produced):
-Detect exon-exon junctions from RNA-seq data (read mapping results are also generated):
+ subjunc -i my_index -r reads1.txt -R reads2.txt -o subjunc_results.bam
-subjunc -i my_index -r reads1.txt -R reads2.txt -o subjunc_results.sam
+Assign mapped RNA-seq reads to mm10 genes using inbuilt annotation:
-Assign mapped reads to genomic features (eg. genes):
+ featureCounts -a ../annotation/mm10_RefSeq_exon.txt -F 'SAF' -o counts.txt subread_results.bam
-featureCounts -a annotation.gtf -o counts.txt subread_results.sam
+Assign mapped RNA-seq reads to hg38 genes using a public GTF annotation:
+
+ featureCounts -a hg38_annotation.gtf -o counts.txt subread_results.bam
Tutorials
-------------------
A short tutorial for Subread - http://bioinf.wehi.edu.au/subread
A short tutorial for Subjunc - http://bioinf.wehi.edu.au/subjunc
A short tutorial for featureCounts - http://bioinf.wehi.edu.au/featureCounts
+A short tutorial for exactSNP - http://bioinf.wehi.edu.au/exactSNP
Users Guide
--------------
-Users Guide can be found in the 'doc' subdirectory. It provides comprehensive descriptions to the programs included in this package.
+Users Guide can be found in the 'doc' subdirectory of this software package or via URL (http://bioinf.wehi.edu.au/subread-package/SubreadUsersGuide.pdf).
Citation
--------------
@@ -72,8 +73,9 @@ Liao Y, Smyth GK and Shi W. The Subread aligner: fast, accurate and scalable rea
If you use the featureCounts program, please cite:
-Liao Y, Smyth GK and Shi W. featureCounts: an efficient general-purpose program for assigning sequence reads to genomic features. Bioinformatics, 2013. doi: 10.1093/bioinformatics/btt656
+Liao Y, Smyth GK and Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics, 30(7):923-30, 2014
-Get help
+Mailing lists
--------------
-You may subscribe to the SeqAnswer forum (http://www.seqanswers.com) or the Bioconductor mailing list (http://bioconductor.org/) to get help. Alternatively, you may directly contact Wei Shi (shi at wehi dot edu dot au) or Yang Liao (liao at wehi dot edu dot au) for help.
+Please post your questions/suggestions to Bioconductor support site(https://support.bioconductor.org/) or Subread google group (https://groups.google.com/forum/#!forum/subread).
+
diff --git a/doc/SubreadUsersGuide.tex b/doc/SubreadUsersGuide.tex
index 01f3e3e..1bf7b73 100644
--- a/doc/SubreadUsersGuide.tex
+++ b/doc/SubreadUsersGuide.tex
@@ -35,9 +35,9 @@
\begin{center}
{\Huge\bf Subread/Rsubread Users Guide}\\
\vspace{1 cm}
-{\centering\large Subread v1.5.1/Rsubread v1.23.4\\}
+{\centering\large Subread v1.5.2/Rsubread v1.24.2\\}
\vspace{1 cm}
-\centering 25 August 2016\\
+\centering 15 March 2017\\
\vspace{5 cm}
\Large Wei Shi and Yang Liao\\
\vspace{1 cm}
@@ -47,7 +47,7 @@ The Walter and Eliza Hall Institute of Medical Research\\
The University of Melbourne\\
Melbourne, Australia\\}
\vspace{7 cm}
-\centering Copyright \small{\copyright} 2011 - 2016\\
+\centering Copyright \small{\copyright} 2011 - 2017\\
\end{center}
\end{titlepage}
@@ -94,17 +94,17 @@ These software programs support a variety of sequencing platforms including Illu
\section{Citation}
-If you use {\Subread} or {\Subjunc} aligners, please cite:\\
+If you use {\Subread} or {\Subjunc} aligners, please cite:
\begin{quote}
-Liao Y, Smyth GK and Shi W. The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Research, 41(10):e108, 2013
+Liao Y, Smyth GK and Shi W (2013). The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. \emph{Nucleic Acids Research}, 41(10):e108.
\\
{\color{blue}{\url{http://www.ncbi.nlm.nih.gov/pubmed/23558742}} }
\end{quote}
-If you use \featureCounts, please cite:\\
+{\noindent If you use \featureCounts, please cite:}
\begin{quote}
-Liao Y, Smyth GK and Shi W. featureCounts: an efficient general-purpose program for assigning sequence reads to genomic features. Bioinformatics, 2013 Nov 30. [Epub ahead of print]
+Liao Y, Smyth GK and Shi W (2014). featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. \emph{Bioinformatics}, 30(7):923-30.
\\
{\color{blue}{\url{http://www.ncbi.nlm.nih.gov/pubmed/24227677}}}
\end{quote}
@@ -166,8 +166,8 @@ Alternatively, you may download the {\Rsubread} source package directly from {\c
\section{How to get help}
-Bioconductor mailing list (\url{http://bioconductor.org/}) and SeqAnswer forum (\url{http://www.seqanswers.com}) are the best places to get help and to report bugs.
-Alternatively, you may contact Wei Shi (shi at wehi dot edu dot au) directly.
+Bioconductor support site (\url{https://support.bioconductor.org/}) or Google Subread group (\url{https://groups.google.com/forum/#!forum/subread}) are the best place to post questions or make suggestions.
+
\chapter{The seed-and-vote mapping paradigm}
@@ -215,19 +215,24 @@ Since no mismatches are allowed in the mapping of the subreads, the indels can b
\section{Detection of exon-exon junctions}
\label{sec:junction}
-The seed-and-vote paradigm is also very useful in detecting exon-exon junctions, because the short subreads extracted across the entire read can be used to detect short exons in a sensitive and accurate way.
-The figure below shows the schematic of detecting exon-exon junctions and mapping RNA-seq reads by \code{Subjunc}, which uses this paradigm.
-
+Figure below shows the schematic of exon-exon junction under seed-and-vote paradigm.
The first scan detects all possible exon-exon junctions using the mapping locations of the subreads extracted from each read.
-Matched donor (`GT') and receptor (`AG') sites are required for calling junctions.
Exons as short as 16bp can be detected in this step.
-The second scan verifies the putative exon-exon junctions discovered from the first scan by performing re-alignments for the junction reads.
-The output from \code{Subjunc} includes the list of verified junctions and also the mapping results for all the reads.
-Orientation of splicing sites is indicated by `XA' tag in section of optional fields in mapping output.
+The second scan verifies the putative exon-exon junctions discovered from the first scan by read re-alignment.
-By default, \code{Subjunc} only reports canonical exon-exon junctions it has discovered (ie. presence of donor (`GT') and receptor (`AG') sites is required).
-However, users may turn on `--allJunctions' option to instruct \code{Subjunc} to report all junctions including both canonical and non-canonical ones.
+This approach is implemented in the \code{Subjunc} program.
+The output of \code{Subjunc} includes a list of discovered junctions, in addition to the mapping results.
+By default, \code{Subjunc} only reports canonical exon-exon junctions that contain canonical donor and receptor sites (`GT' and `AG' respectively).
+It was reported that such exon-exon junctions account for $>$98\% of all junctions.
+Orientation of donor and receptor sites is indicated by `XA' tag in the SAM/BAM output.
+\code{Subjunc} will report both canonical and non-canonical junctions when `--allJunctions' option is turned on.
+Accuracy of junction detection generally improves when external gene annotation data is provided.
+The annotation data should include chromosomal coordinates of known exons of each gene.
+\code{Subjunc} infers exon-exon junctions from the provided annotation data by connecting each pair of neighboring exons from the same gene.
+This should cover majority of known exon-exon junctions and the other junctions are expected to be discovered by the program.
+Note that although \code{Subread} aligner does not report exon-exon junctions, providing this annotation is useful for it to map junction reads more accurately.
+See `-a' parameter in Table 2 for more details.
\begin{center}
@@ -362,12 +367,19 @@ output_file="rsubread.bam",minFragLength=50,maxFragLength=600)
\label{sec:index}
The \code{subread-buildindex} (\code{buildindex} function in \Rsubread) program builds an index for reference genome by creating a hash table in which keys are 16bp mers (subreads) extracted from the genome and values are their chromosomal locations.
-By default, subreads are extracted from the genome at a 2bp interval.
-The reference sequences should be in FASTA format (the header line for each chromosomal sequence starts with ``$>$'').\\
+
+By default, subreads are extracted from the reference genome at a 2bp interval and a gapped index is built.
+During mapping, three sets of subreads will be extracted from each read.
+The first set starts from the first base of the read, the second set starts from the second base and the third set starts from the third base.
+
+The reference sequences should be in FASTA format.
+The \code{subread-buildindex} function divides each reference sequence name (which can be found in the header lines) into multiple substrings by using separators including `\code{|}', ` '(space) and `\code{<tab>}', and it uses the first substring as the name for the reference sequence during its index building.
+The first substrings must be distinct for different reference sequences (otherwise the index cannot be built).
+Note that the starting `\code{>}' character in the header line is not included in the first substrings.
Table 1 describes the arguments used by the \code{subread-buildindex} program.
-\newpage
+%\newpage
\begin{table}[h]
\raggedright{Table 1: Arguments used by the \code{subread-buildindex} program (\code{buildindex} function in \Rsubread) in alphabetical order.
@@ -378,17 +390,17 @@ Arguments & Description \\
\hline
chr1.fa, chr2.fa, ... \newline (\code{reference}) & Give names of chromosome files. Note that in {\Rsubread}, only a single FASTA file including all reference sequences should be provided.\\
\hline
--B \newline (\code{indexSplit=FALSE}) & Create one block of index. The built index will not be split into multiple pieces. This makes the largest amount of memory be requested when running alignments, but it enables the maximum mapping speed to be achieved. This option overrides -M when it is provided as well.\\
+-B \newline (\code{indexSplit=FALSE}) & Create one block of index. The built index will not be split into multiple pieces. The more blocks an index has, the slower the mapping speed. This option will override `-M' option when it is also provided.\\
\hline
-c \newline (\code{colorspace}) & Build a color-space index.\\
\hline
-f $<int>$ \newline (\code{TH\_subread}) & Specify the threshold for removing uninformative subreads (highly repetitive 16bp mers). Subreads will be excluded from the index if they occur more than threshold number of times in the reference genome. Default value is 100.\\
\hline
--F \newline (\code{gappedIndex=FALSE}) & Build a full index for the reference genome. 16bp mers (subreads) will be extracted from every position of the reference genome. Under default setting (`-F' is not specified), subreads are extracted in every three bases from the genome.\\
+-F \newline (\code{gappedIndex=FALSE}) & Build a full index for the reference genome. 16bp mers (subreads) will be extracted from every position of a reference genome. Size of the full index built for mouse genome is 14GB.\\
\hline
--M $<int>$ \newline (\code{memory}) & Specify the Size of requested memory(RAM) in megabytes, 8000MB by default. With the default value, the index built for a mammalian genome (eg. human or mouse genome) will be saved into one block, enabling the fastest mapping speed to be achieved. The amount of memory used is $\sim$ 7600MB for mouse or human genome (other species have a much smaller memory footprint), when performing read mapping. Using less memory will increase read mapping time.\\
+-M $<int>$ \newline (\code{memory}) & Specify the size of computer memory(RAM) in megabytes that will be used for alignment of reads, 8000MB by default. If the size of an index built for a reference genome is greater than the `-M' value, this index will be split into multiple blocks and then saved onto the disk. These blocks will be loaded into computer memory sequentially when performing read alignment. A gapped index generated for mouse genome has a size of 5300MB. Note that when gener [...]
\hline
--o $<basename>$ \newline (\code{basename}) & Specify the base name of the index to be created.\\
+-o $<string>$ \newline (\code{basename}) & Specify the base name of the index to be created.\\
\hline
-v & Output version of the program. \\
\hline
@@ -417,6 +429,10 @@ Arguments used by \code{subjunc} only are marked with $^{**}$.
\hline
Arguments & Description \\
\hline
+-a $<string>$\newline (\code{useAnnotation, annot.inbuilt, annot.ext}) & Name of a gene annotation file that includes chromosomal coordinates of exons from each gene. GTF/GFF format by default. See -F option for supported formats. Users may use the inbuilt annotations included in this package (SAF format) for human and mouse data. Exon-exon junctions are inferred by connecting each pair of neighboring exons from the same gene.\\
+\hline
+-A $<string>$\newline (\code{chrAliases}) & Name of a comma-delimited text file that includes aliases of chromosome names. This file should contain two columns. First column contains names of chromosomes included in the SAF or GTF annotation and second column contains corresponding names of chromosomes in the reference genome. No column headers should be provided. Also note that chromosome names are case sensitive. This file can be used to match chromosome names between the annotation an [...]
+\hline
-b \newline (\code{color2base=TRUE}) & Output base-space reads instead of color-space reads in mapping output for color space data (eg. LifTech SOLiD data). Note that the mapping itself will still be performed at color-space.\\
\hline
-B $<int>$ \newline (\code{nBestLocations}) & Specify the maximal number of equally-best mapping locations allowed to be reported for each read. 1 by default. `NH' tag is used to indicate how many alignments are reported for the read and `HI' tag is used for numbering the alignments reported for the same read, in the output. Note that \code{-u} option takes precedence over \code{-B}.\\
@@ -425,7 +441,9 @@ Arguments & Description \\
\hline
-D $<int>$ \newline (\code{maxFragLength}) & Specify the maximum fragment/template length, 600 by default.\\
\hline
--i $<index> \newline (\code{index}) $ & Specify the base name of the index.\\
+-F $<string>$ \newline (\code{isGTF}) & Specify format of the provided annotation file. Acceptable formats include `GTF' (or compatible GFF format) and `SAF'. Default format in SourceForge {\Subread} is `GTF'. Default format in {\Rsubread} is `SAF'. \\
+\hline
+-i $<string> \newline (\code{index}) $ & Specify the base name of the index.\\
\hline
-I $<int>$ \newline (\code{indels}) & Specify the number of INDEL bases allowed in the mapping. 5 by default. Indels of up to 200bp long can be detected.\\
\hline
@@ -435,15 +453,15 @@ Arguments & Description \\
\hline
-n $<int>$ \newline (\code{nsubreads}) & Specify the number of subreads extracted from each read, 10 by default.\\
\hline
--o $<output>$ \newline (\code{output\_file}) & Give the name of output file. The default output format is BAM. All reads are included in mapping output, including both mapped and unmapped reads, and they are in the same order as in the input file.\\
+-o $<string>$ \newline (\code{output\_file}) & Give the name of output file. The default output format is BAM. All reads are included in mapping output, including both mapped and unmapped reads, and they are in the same order as in the input file.\\
\hline
-p $<int>$ \newline (\code{TH2}) & Specify the minimum number of consensus subreads both reads from the same pair must have. This argument is only applicable for paired-end read data. The value of this argument should not be greater than that of `-m' option, so as to rescue those read pairs in which one read has a high mapping quality but the other does not. 1 by default.\\
\hline
-P $<3:6>$ \newline (\code{phredOffset}) & Specify the format of Phred scores used in the input data, '3' for phred+33 and '6' for phred+64. '3' by default. For \code{align} function in \Rsubread, the possible values are `33' (for phred+33) and `64' (for phred+64). `33' by default.\\
\hline
--r $<input>$ \newline (\code{readfile1}) & Give the name of input file(s) (multiple files are allowed to be provided to \code{align} and \code{subjunc} functions in {\Rsubread}). For paired-end read data, this gives the first read file and the other read file should be provided via the -R option. Supported input formats include FASTQ/FASTA (uncompressed or gzip compressed)(default), SAM and BAM.\\
+-r $<string>$ \newline (\code{readfile1}) & Give the name of an input file (multiple files are allowed to be provided to \code{align} and \code{subjunc} functions in {\Rsubread}). For paired-end read data, this gives the first read file and the other read file should be provided via the -R option. Supported input formats include FASTQ/FASTA (uncompressed or gzip compressed)(default), SAM and BAM.\\
\hline
--R $<input>$ \newline (\code{readfile2}) & Provide name of the second read file from paired-end data. The program will switch to paired-end read mapping mode if this file is provided. (multiple files are allowed to be provided to \code{align} and \code{subjunc} functions in {\Rsubread}).\\
+-R $<string>$ \newline (\code{readfile2}) & Provide name of the second read file from paired-end data. The program will switch to paired-end read mapping mode if this file is provided. (multiple files are allowed to be provided to \code{align} and \code{subjunc} functions in {\Rsubread}).\\
\hline
-S $<ff:fr:rf>$ \newline (\code{PE\_orientation}) & Specify the orientation of the two reads from the same pair. It has three possible values including `fr', `ff' and `'rf. Letter `f' denotes the forward strand and letter `r' the reverse strand. `fr' by default (ie. the first read in the pair is on the forward strand and the second read on the reverse strand).\\
\hline
@@ -453,19 +471,23 @@ $^*$ -t $<int>$ \newline (\code{type}) & Specify the type of input sequencing da
\hline
-u \newline (\code{unique=TRUE}) & Output uniquely mapped reads only. Reads that were found to have more than one best mapping location will not be reported.\\
\hline
-$^{**}$$--$allJunctions \newline (\code{reportAllJunctions=TRUE}) & This option should be used with \code{subjunc} for detecting canonical exon-exon junctions (with `GT/AG' donor/receptor sites), non-canonical exon-exon junctions and structural variants (SVs) in RNA-seq data. detected junctions will be saved to a file with suffix name ``.junction.bed". Detected SV breakpoints will be saved to a file with suffix name ``.breakpoints.txt", which includes chromosomal coordinates of detected [...]
+$^{**}$$--$allJunctions \newline (\code{reportAllJunctions} \newline \code{=TRUE}) & This option should be used with \code{subjunc} for detecting canonical exon-exon junctions (with `GT/AG' donor/receptor sites), non-canonical exon-exon junctions and structural variants (SVs) in RNA-seq data. detected junctions will be saved to a file with suffix name ``.junction.bed". Detected SV breakpoints will be saved to a file with suffix name ``.breakpoints.txt", which includes chromosomal coordin [...]
\hline
$--$BAMinput \newline (\code{input\_format="BAM"}) & Specify that the input read data are in BAM format.\\
\hline
$--$complexIndels & Detect multiple short indels that occur concurrently in a small genomic region (these indels could be as close as 1bp apart).\\
\hline
-$--$DPGapExt $<int>$ \newline (\code{DP\_GapExtPenalty}) & Specify the penalty for extending the gap when performing the Smith-Waterman dynamic programming. 0 by defaut.\\
+$--$DPGapExt $<int>$ \newline (\code{DP\_GapExtPenalty}) & Specify the penalty for extending the gap when performing the Smith-Waterman dynamic programming. 0 by default.\\
\hline
-$--$DPGapOpen $<int>$ \newline (\code{DP\_GapOpenPenalty}) & Specify the penalty for opening a gap when applying the Smith-Waterman dynamic programming to detecting indels. -2 by defaut.\\
+$--$DPGapOpen $<int>$ \newline (\code{DP\_GapOpenPenalty}) & Specify the penalty for opening a gap when applying the Smith-Waterman dynamic programming to detecting indels. -2 by default.\\
\hline
-$--$DPMismatch $<int>$ \newline (\code{DP\_MismatchPenalty}) & Specify the penalty for mismatches when performing the Smith-Waterman dynamic programming. 0 by defaut.\\
+$--$DPMismatch $<int>$ \newline (\code{DP\_MismatchPenalty}) & Specify the penalty for mismatches when performing the Smith-Waterman dynamic programming. 0 by default.\\
\hline
-$--$DPMatch $<int>$ \newline (\code{DP\_MatchScore}) & Specify the score for the matched base when performing the Smith-Waterman dynamic programming. 2 by defaut.\\
+$--$DPMatch $<int>$ \newline (\code{DP\_MatchScore}) & Specify the score for the matched base when performing the Smith-Waterman dynamic programming. 2 by default.\\
+\hline
+$--$gtfFeature $<string>$ \newline (\code{GTF.featureType}) & Specify the type of features that will be extracted from a GTF annotation. `exon' by default. Feature types can be found in the 3rd column of a GTF annotation.\\
+\hline
+$--$gtfAttr $<string>$ \newline (\code{GTF.attrType}) & Specify the type of attributes in a GTF annotation that will be used to group features. `gene\_id' by default. Attributes can be found in the 9th column of a GTF annotation.\\
\hline
$--$rg $<string>$ \newline (\code{readGroup}) & Add a $<tag:value>$ to the read group (RG) header in the mapping output. \\
\hline
@@ -489,7 +511,8 @@ $--$trim3 $<int>$ \newline (\code{nTrim3}) & Trim off $<int>$ number of bases fr
\section{Mapping quality scores}
-{\Subread} and {\Subjunc} aligners assign a mapping quality score (MQS) to each mapped read to indicate the confidence of the mapping:\\
+{\Subread} and {\Subjunc} aligners determine the final mapping location of each read by taking into account vote number, number of mis-matched bases, number of matched bases and mapping distance between two reads from the same pair (for paired-end reads only) .
+They then assign a mapping quality score (MQS) to each mapped read to indicate the confidence of mapping using the following formula:\\
\[ MQS = \left\{
\begin{array}{l l}
@@ -499,10 +522,8 @@ $--$trim3 $<int>$ \newline (\code{nTrim3}) & Trim off $<int>$ number of bases fr
0 & \quad \text{if $>1$ equally best locations were found}\\
\end{array} \right.\] \\
-\noindent where $N_c$ is the number of candidate locations that are considered for final full alignment (re-alignment step).
-Such locations must have a vote number within top three vote numbers counted from all locations considered in the subread-mapping step (first scan).
-Up to three candiate locations are considered for each read in the realignment step.
-$N_{mm}$ is the number of mismatches found in the alignment at each candidate location.
+\noindent where $N_c$ is the number of candidate locations considered at the re-alignment step (note that no more than three candidate locations are considered at this step).
+$N_{mm}$ is the number of mismatches present in the final reported alignment for the read.
@@ -626,6 +647,7 @@ Reads may be re-aligned if required.
Output of {\Subjunc} aligner includes a list of discovered exon-exon junction locations and also the complete alignment results for the reads.
Table 2 describes the arguments used by the {\Subjunc} program.\\
+
\section{Mapping output}
Read mapping results for each library will be saved to a BAM/SAM file.
@@ -731,6 +753,7 @@ The genomic features can be specified in either GTF/GFF or SAF format. The SAF f
The genomic features can be specified in either GTF/GFF or SAF format.
A definition of the GTF format can be found at UCSC website (\url{http://genome.ucsc.edu/FAQ/FAQformat.html#format4}).
The SAF format includes five required columns for each feature: feature identifier, chromosome name, start position, end position and strand.
+Both start and end positions are inclusive.
These five columns provide the minimal sufficient information for read quantification purposes.
Extra annotation data are allowed to be added from the sixth column.
@@ -760,41 +783,46 @@ In this case, {\featureCounts} can be instructed to count fragments rather than
{\featureCounts} automatically sorts reads by name if paired reads are not in consecutive positions in the SAM or BAM file, with minimal cost.
Users do not need to sort their paired reads before providing them to {\featureCounts}.
-\subsection{Features and meta-features}
+\subsection{Assign reads to features or meta-features}
{\featureCounts} is a general-purpose read summarization function, which assigns mapped reads (RNA-seq reads or genomic DNA-seq reads) to genomic features or meta-features.
-Each feature is an interval (range of positions) on one of the reference sequences. We define a meta-feature to be a set of features representing a biological construct of interest. For example, features often correspond to exons and meta-features to genes. Features sharing the same feature identifier in the GTF or SAF annotation are taken to belong to the same meta-feature. {\featureCounts} can summarize reads at either the feature or meta-feature levels.
+A feature is an interval (range of positions) on one of the reference sequences.
+A meta-feature is a set of features that represents a biological construct of interest.
+For example, features often correspond to exons and meta-features to genes. Features sharing the same feature identifier in the GTF or SAF annotation are taken to belong to the same meta-feature. {\featureCounts} can summarize reads at either the feature or meta-feature levels.
We recommend to use unique gene identifiers, such as NCBI Entrez gene identifiers, to cluster features into meta-features. Gene names are not recommended to use for this purpose because different genes may have the same names. Unique gene identifiers were often included in many publicly available GTF annotations which can be readily used for summarization. The Bioconductor {\Rsubread} package also includes NCBI RefSeq annotations for human and mice. Entrez gene identifiers are used in th [...]
-\subsection{Overlap of reads with features}
-
-{\featureCounts} preforms precise read assignment by comparing mapping location of every base in the read or fragment with the genomic region spanned by each feature.
+{\featureCounts} preforms precise read assignment by comparing mapping location of every base in the read with the genomic region spanned by each feature.
It takes account of any gaps (insertions, deletions, exon-exon junctions or structural variants) that are found in the read.
It calls a hit if any overlap is found between read and feature.
Users may use `--minOverlap (\code{minOverlap} in \R)' and `--fracOverlap (\code{fracOverlap} in \R)' options to specify the minimum number of overlapping bases and minimum fraction of overlapping bases requried for assigning a read to a feature, respectively.
The `--fracOverlap' option might be particularly useful for counting reads with variable lengths.
-% A hit is called for a meta-feature if the read or fragment overlaps any component feature of the meta-feature.
-\subsection{Multi-mapping reads}
+When counting reads at meta-feature level, a hit is called for a meta-feature if the read overlaps any component feature of the meta-feature.
+Note that if a read hits a meta-feature, it is always counted once no matter how many features in the meta-feature this read overalps with.
+For instance, an exon-spanning read overlapping with more than one exon within the same gene only contributes 1 count to the gene.
-A multi-mapping read is a read that can be equally best mapped to more than one location in the reference genome.
-Due to the mapping amguity, it is recommended that multi-mapping reads should be excluded from read counting (default behavior of {\featureCounts} program) to produce as accurate counts as possible.
-However we do provide users with different options to deal with the counting of such reads.
-Users can choose to discard multi-mapping reads, or fully count every alignment reported for a multi-mapping read (ie. each alignment carries 1 count) or count each alignment fractionally (ie. each alignment carries $1/n$ count where $n$ is the total number of alignments reported for the read).
-Relevent parameters for counting multi-mapping reads include `-M' (\code{countMultiMappingReads} in \R) and `--fraction' (\code{fraction} in \R).
+\subsection{Count multi-mapping reads and multi-overlapping reads}
+
+A multi-mapping read is a read that can be equally best mapped to more than one location in the reference genome.
+Due to the mapping ambiguity, it is recommended that multi-mapping reads should be excluded from read counting (default behavior of {\featureCounts} program) to produce as accurate counts as possible.
+However we do provide users with other counting options for such reads.
+Users can specify the `-M' option (set \code{countMultiMappingReads} to \code{TRUE} in \R) to fully count every alignment reported for a multi-mapping read (each alignment carries 1 count), or specify both `-M' and `--fraction' options (set both \code{countMultiMappingReads} and \code{fraction} to \code{TRUE} in \R) to count each alignment fractionally (each alignment carries $1/x$ count where $x$ is the total number of alignments reported for the read).
+Note that for multi-mapping reads the counting is performed at the level of individual alignments (not at read level).
-\subsection{Multi-overlap reads}
+A multi-overlapping read is a read that overlaps more than one meta-feature when counting reads at meta-feature level or overlaps more than one feature when counting reads at feature level.
+The decision of whether or not to counting these reads is often determined by the experiment type. We recommend that reads or fragments overlapping more than one gene are not counted for RNA-seq experiments, because any single fragment must originate from only one of the target genes but the identity of the true target gene cannot be confidently determined.
+On the other hand, we recommend that multi-overlapping reads or fragments are counted for ChIP-seq experiments because for example epigenetic modifications inferred from these reads may regulate the biological functions of all their overlapping genes.
-A multi-overlap read or fragment is one that overlaps more than one feature, or more than one meta-feature when summarizing at the meta-feature level. {\featureCounts} provides users with the option to exclude multi-overlap reads, or fully count them for each overlapping feature (ie. each overlapping feature receives a count of 1 from the read) or assign a fractional count to each overlapping feature.
-Relevent parameters for counting multi-overlap reads include `-O' (\code{allowMultiOverlap} in \R) and `--fraction' (\code{fraction} in \R).
+By default, {\featureCounts} does not count multi-overlapping reads.
+Users can specify the `-O' option (set \code{allowMultiOverlap} to \code{TRUE} in \R) to fully count them for each overlapping meta-feature/feature (each overlapping meta-feature/feature receives a count of 1 from a read), or specify both `-O' and `--fraction' options (set both \code{allowMultiOverlap} and \code{fraction} to \code{TRUE} in \R) to assign a fractional count to each overlapping meta-feature/feature (each overlapping meta-feature/feature receives a count of $1/y$ from a read [...]
-The decision whether or not to counting these reads is often determined by the experiment type. We recommend that reads or fragments overlapping more than one gene are not counted for RNA-seq experiments, because any single fragment must originate from only one of the target genes but the identity of the true target gene cannot be confidently determined. On the other hand, we recommend that multi-overlap reads or fragments are counted for most ChIP-seq experiments because epigenetic modi [...]
+If a read is both multi-mapping and multi-overlapping, then each overlapping meta-feature/feature will receive a fractional count of $1/(x*y)$ when `-O', `-M', and `--fraction' are all specified.
+Note that each alignment reported for a multi-mapping read is assessed separately for overlapping with multiple meta-features/features.
-Note that, when counting at the meta-feature level, reads that overlap multiple features of the same meta-feature are always counted exactly once for that meta-feature, provided there is no overlap with any other meta-feature. For example, an exon-spanning read will be counted only once for the corresponding gene even if it overlaps with more than one exon.
\subsection{In-built annotations}
@@ -849,9 +877,9 @@ Arguments included in parenthesis are the equivalent parameters used by {\featur
\hline
Arguments & Description \\
\hline
-input\_files \newline (\code{files}) & Give the names of input read files that include the read mapping results. The program automatically detects the file format (SAM or BAM). Multiple files can be provided at the same time.\\
+input\_files \newline (\code{files}) & Give the names of input read files that include the read mapping results. The program automatically detects the file format (SAM or BAM). Multiple files can be provided at the same time. Files are allowed to be provided via $<stdin>$. \\
\hline
--a $<input> \newline (\code{annot.ext, annot.inbuilt}) $ & Give the name of an annotation file. \\
+-a $<string>$ \newline (\code{annot.ext, annot.inbuilt}) & Give the name of an annotation file. \\
\hline
-A \newline (\code{chrAliases}) & Provide a chromosome name alias file to match chr names in annotation with those in the reads. This should be a two-column comma-delimited text file. Its first column should include chr names in the annotation and its second column should include chr names in the reads. Chr names are case sensitive. No column header should be included in the file.\\
\hline
@@ -865,20 +893,20 @@ input\_files \newline (\code{files}) & Give the names of input read files that i
\hline
-f \newline (\code{useMetaFeatures}) & If specified, read summarization will be performed at feature level (eg. exon level). Otherwise, it is performed at meta-feature level (eg. gene level).\\
\hline
--F \newline (\code{isGTFAnnotationFile}) & Specify the format of the annotation file. Acceptable formats include `GTF' and `SAF' (see Section~\ref{sec:annotation} for details). The {\C} version of {\featureCounts} program uses a GTF format annotation by default, but the R version uses a SAF format annotation by default. The R version also includes in-built annotations.\\
+-F \newline (\code{isGTFAnnotationFile}) & Specify the format of the annotation file. Acceptable formats include `GTF' and `SAF' (see Section~\ref{sec:annotation} for details). By default, {\C} version of {\featureCounts} program accepts a GTF format annotation and R version accepts a SAF format annotation. In-built annotations in SAF format are provided.\\
\hline
--g $<input>$ \newline (\code{GTF.attrType}) & Specify the attribute type used to group features (eg. exons) into meta-features (eg. genes) when GTF annotation is provided. `gene\_id' by default. This attribute type is usually the gene identifier. This argument is useful for the meta-feature level summarization.\\
+-g $<string>$ \newline (\code{GTF.attrType}) & Specify the attribute type used to group features (eg. exons) into meta-features (eg. genes) when GTF annotation is provided. `gene\_id' by default. This attribute type is usually the gene identifier. This argument is useful for the meta-feature level summarization.\\
\hline
--G $<input>$ \newline (\code{genome}) & Provide the name of a FASTA-format file that contains the reference sequences used in
+-G $<string>$ \newline (\code{genome}) & Provide the name of a FASTA-format file that contains the reference sequences used in
read mapping that produced the provided SAM/BAM files. This optional argument can be used with '-J' option to improve read counting for junctions.\\
\hline
--J \newline (\code{juncCounts}) & Count the number of reads supporting each exon-exon junction. Junctions are identified from those exon-spanning reads (containing `N' in CIGAR string) in input data. For each junction, the reported data include number of supporting reads, genes that the junction belongs to, chromosomal coordinates of splice sites etc.\\
+-J \newline (\code{juncCounts}) & Count the number of reads supporting each exon-exon junction. Junctions are identified from those exon-spanning reads (containing `N' in CIGAR string) in input data. The output result includes names of primary and secondary genes that overlap at least one of the two splice sites of a junction. Only one primary gene is reported, but there might be more than one secondary gene reported. Secondary genes do not overlap more splice sites than the primary gene [...]
\hline
--M \newline (\code{countMultiMappingReads}) & If specified, multi-mapping reads/fragments will be counted. A multi-mapping read will be counted up to N times if it has N reported mapping locations. The program uses the `NH' tag to find multi-mapping reads.\\
+-M \newline (\code{countMultiMappingReads}) & If specified, multi-mapping reads/fragments will be counted. The program uses the `NH' tag to find multi-mapping reads. Alignments reported for a multi-mapping read will be counted separately. Each alignment will have \code{1} count or a fractional count if \code{--fraction} is specified. See section ``Count multi-mapping reads and multi-overlapping reads'' for more details.\\
\hline
--o $<output>$ & Give the name of the output file. The output file contains the number of reads assigned to each meta-feature (or each feature if \code{-f} is specified). Note that the {\featureCounts} function in {\Rsubread} does not use this parameter. It returns a \code{list} object including read summarization results and other data. \\
+-o $<string>$ & Give the name of the output file. The output file contains the number of reads assigned to each meta-feature (or each feature if \code{-f} is specified). Note that the {\featureCounts} function in {\Rsubread} does not use this parameter. It returns a \code{list} object including read summarization results and other data. \\
\hline
--O \newline (\code{allowMultiOverlap}) & If specified, reads (or fragments if \code{-p} is specified) will be allowed to be assigned to more than one matched meta-feature (or feature if \code{-f} is specified). Reads/fragments overlapping with more than one meta-feature/feature will be counted more than once. Note that when performing meta-feature level summarization, a read (or fragment) will still be counted once if it overlaps with multiple features belonging to the same meta-feature [...]
+-O \newline (\code{allowMultiOverlap}) & If specified, reads (or fragments if \code{-p} is specified) will be allowed to be assigned to more than one matched meta-feature (or feature if \code{-f} is specified). Reads/fragments overlapping with more than one meta-feature/feature will be counted more than once. Note that when performing meta-feature level summarization, a read (or fragment) will still be counted once if it overlaps with multiple features within the same meta-feature (as lo [...]
\hline
-p \newline (\code{isPairedEnd}) & If specified, fragments (or templates) will be counted instead of reads. This option is only applicable for paired-end reads.\\
\hline
@@ -888,9 +916,9 @@ read mapping that produced the provided SAM/BAM files. This optional argument ca
\hline
-R \newline (\code{reportReads}) & Output detailed read assignment results for each read (or fragment if paired end). They are saved to a tab-delimited file that contains four columns including read name, status(assigned or the reason if not assigned), name of target feature/meta-feature and total number of hits if the read/fragment is counted multiple times. Names of output files are the same as input file names except a suffix string `.featureCounts' is added.\\
\hline
--s $<int>$ \newline (\code{isStrandSpecific}) & Indicate if strand-specific read counting should be performed. Acceptable values: 0 (unstranded), 1 (stranded) and 2 (reversely stranded). 0 by default. For paired-end reads, strand of the first read is taken as the strand of the whole fragment and FLAG field of the current read is used to tell if it is the first read in the fragment.\\
+-s $<int>$ \newline (\code{isStrandSpecific}) & Indicate if strand-specific read counting should be performed. Acceptable values: 0 (unstranded), 1 (stranded) and 2 (reversely stranded). 0 by default. For paired-end reads, strand of the first read is taken as the strand of the whole fragment. FLAG field is used to tell if a read is first or second read in a pair.\\
\hline
--t $<input>$ \newline (\code{GTF.featureType}) & Specify the feature type. Only rows which have the matched feature type in the provided GTF annotation file will be included for read counting. `exon' by default.\\
+-t $<string>$ \newline (\code{GTF.featureType}) & Specify the feature type. Only rows which have the matched feature type in the provided GTF annotation file will be included for read counting. `exon' by default.\\
\hline
-T $<int>$ \newline (\code{nthreads}) & Number of the threads. The value should be between 1 and 32. 1 by default.\\
\hline
@@ -902,9 +930,9 @@ $--$countNonSplit \newline AlignmentsOnly \newline (\code{nonSplitOnly}) & If s
\hline
$--$donotsort \newline (\code{autosort}) & If specified, paired end reads will not be re-ordered even if reads from the same pair were found not to be next to each other in the input.\\
\hline
-$--$fraction \newline (\code{fraction}) & Assign fractional counts to features. This option must be used together with '-M' or '-O' or both. When '-M' is specified, each reported alignment from a multi-mapping read (identified via `NH' tag) will carry a fractional count of 1/x, instead of 1 (one), where x is the total number of alignments reported for the same read. When '-O' is specified, each overlapping feature will receive a fractional count of 1/y, where y is the total number of fea [...]
+$--$fraction \newline (\code{fraction}) & Assign fractional counts to features. This option must be used together with `-M' or `-O' or both. When `-M' is specified, each reported alignment from a multi-mapping read (identified via `NH' tag) will carry a count of 1/x, instead of 1 (one), where x is the total number of alignments reported for the same read. When `-O' is specified, each overlapping feature will receive a count of 1/y, where y is the total number of features overlapping with [...]
\hline
-$--$fracOverlap $<value>$ \newline (\code{fracOverlap}) & Minimum fraction of overlapping bases in a read that is required for read assignment. Value should be within range [0,1]. 0 by default. If paired end, number of overlapping bases is counted from both reads. Soft-clipped bases are counted when calculating total read length (but ignored when counting overlapping bases). Both this option and `--minOverlap' option need to be satisfied for read assignment. \\
+$--$fracOverlap $<float>$ \newline (\code{fracOverlap}) & Minimum fraction of overlapping bases in a read that is required for read assignment. Value should be a float number in the range [0,1]. 0 by default. If paired end, number of overlapping bases is counted from both reads. Soft-clipped bases are counted when calculating total read length (but ignored when counting overlapping bases). Both this option and `--minOverlap' option need to be satisfied for read assignment. \\
\hline
$--$ignoreDup \newline (\code{ignoreDup}) & If specified, reads that were marked as duplicates will be ignored. Bit Ox400 in FLAG field of SAM/BAM file is used for identifying duplicate reads. In paired end data, the entire read pair will be ignored if at least one end is found to be a duplicate read.\\
\hline
diff --git a/src/HelperFunctions.c b/src/HelperFunctions.c
index c976dc0..c08351a 100644
--- a/src/HelperFunctions.c
+++ b/src/HelperFunctions.c
@@ -44,6 +44,7 @@
#include "subread.h"
+#include "input-files.h"
#include "gene-algorithms.h"
#include "HelperFunctions.h"
@@ -979,3 +980,171 @@ double fast_fisher_test_one_side(unsigned int a, unsigned int b, unsigned int c,
}
+int load_features_annotation(char * file_name, int file_type, char * gene_id_column, char * feature_name_column,
+ void * context, int do_add_feature(char * gene_name, char * chro_name, unsigned int start, unsigned int end, int is_negative_strand, void * context) ){
+ char * file_line = malloc(MAX_LINE_LENGTH+1);
+ int lineno = 0, is_GFF_warned = 0, loaded_features = 0;
+ FILE * fp = fopen(file_name, "r");
+
+ if(NULL == fp){
+ SUBREADprintf("Error: unable to open the annotation file : %s\n", file_name);
+ return -1;
+ }
+
+ while(1){
+ int is_gene_id_found = 0, is_negative_strand = -1;
+ char * token_temp = NULL, * feature_name, * chro_name = NULL;
+ char feature_name_tmp[FEATURE_NAME_LENGTH];
+ feature_name = feature_name_tmp;
+
+ unsigned int start = 0, end = 0;
+ char * getres = fgets(file_line, MAX_LINE_LENGTH, fp);
+ if(getres == NULL) break;
+
+ lineno++;
+ if(is_comment_line(file_line, file_type, lineno-1))continue;
+
+ if(file_type == FILE_TYPE_RSUBREAD)
+ {
+ feature_name = strtok_r(file_line,"\t",&token_temp);
+ int feature_name_len = strlen(feature_name);
+ if(feature_name_len > FEATURE_NAME_LENGTH) feature_name[FEATURE_NAME_LENGTH -1 ] = 0;
+
+ chro_name = strtok_r(NULL,"\t", &token_temp);
+ int chro_name_len = strlen(chro_name);
+ if(chro_name_len > MAX_CHROMOSOME_NAME_LEN) chro_name[MAX_CHROMOSOME_NAME_LEN -1 ] = 0;
+
+ char * start_ptr = strtok_r(NULL,"\t", &token_temp);
+ char * end_ptr = strtok_r(NULL,"\t", &token_temp);
+
+ if(start_ptr == NULL || end_ptr == NULL){
+ SUBREADprintf("\nWarning: the format on the %d-th line is wrong.\n", lineno);
+ }
+ long long int tv1 = atoll(start_ptr);
+ long long int tv2 = atoll(end_ptr);
+
+ if( isdigit(start_ptr[0]) && isdigit(end_ptr[0]) ){
+ if(strlen(start_ptr) > 10 || strlen(end_ptr) > 10 || tv1 > 0x7fffffff || tv2> 0x7fffffff){
+ SUBREADprintf("\nError: Line %d contains a coordinate greater than 2^31!\n", lineno);
+ return -2;
+ }
+ }else{
+ SUBREADprintf("\nError: Line %d contains a format error. The expected annotation format is SAF.\n", lineno);
+ return -2;
+ }
+
+ start = atoi(start_ptr);// start
+ end = atoi(end_ptr);//end
+
+ char * strand_str = strtok_r(NULL,"\t", &token_temp);
+ if(strand_str == NULL)
+ is_negative_strand = 0;
+ else
+ is_negative_strand = ('-' ==strand_str[0]);
+
+ is_gene_id_found = 1;
+
+ } else if(file_type == FILE_TYPE_GTF) {
+ chro_name = strtok_r(file_line,"\t",&token_temp);
+ strtok_r(NULL,"\t", &token_temp);// source
+ char * feature_type = strtok_r(NULL,"\t", &token_temp);// feature_type
+
+ if(strcmp(feature_type, feature_name_column)==0){
+ char * start_ptr = strtok_r(NULL,"\t", &token_temp);
+ char * end_ptr = strtok_r(NULL,"\t", &token_temp);
+
+
+ if(start_ptr == NULL || end_ptr == NULL){
+ SUBREADprintf("\nWarning: the format on the %d-th line is wrong.\n", lineno);
+ }
+ long long int tv1 = atoll(start_ptr);
+ long long int tv2 = atoll(end_ptr);
+
+
+ if( isdigit(start_ptr[0]) && isdigit(end_ptr[0]) ){
+ if(strlen(start_ptr) > 10 || strlen(end_ptr) > 10 || tv1 > 0x7fffffff || tv2> 0x7fffffff){
+ SUBREADprintf("\nError: Line %d contains a coordinate greater than 2^31!\n", lineno);
+ return -2;
+ }
+ }else{
+ SUBREADprintf("\nError: Line %d contains a format error. The expected annotation format is GTF/GFF.\n", lineno);
+ return -2;
+ }
+ start = atoi(start_ptr);// start
+ end = atoi(end_ptr);//end
+
+ if(start < 1 || end<1 || start > 0x7fffffff || end > 0x7fffffff || start > end)
+ SUBREADprintf("\nWarning: the feature on the %d-th line has zero coordinate or zero lengths\n\n", lineno);
+
+
+ strtok_r(NULL,"\t", &token_temp);// score
+ is_negative_strand = ('-' == (strtok_r(NULL,"\t", &token_temp)[0]));//strand
+ strtok_r(NULL,"\t",&token_temp); // "frame"
+ char * extra_attrs = strtok_r(NULL,"\t",&token_temp); // name_1 "val1"; name_2 "val2"; ...
+ if(extra_attrs && (strlen(extra_attrs)>2)){
+ int attr_val_len = GTF_extra_column_value(extra_attrs , gene_id_column , feature_name_tmp, FEATURE_NAME_LENGTH);
+ if(attr_val_len>0) is_gene_id_found=1;
+ }
+
+ if(!is_gene_id_found){
+ if(!is_GFF_warned)
+ {
+ int ext_att_len = strlen(extra_attrs);
+ if(extra_attrs[ext_att_len-1] == '\n') extra_attrs[ext_att_len-1] =0;
+ SUBREADprintf("\nWarning: failed to find the gene identifier attribute in the 9th column of the provided GTF file.\nThe specified gene identifier attribute is '%s' \nThe attributes included in your GTF annotation are '%s' \n\n", gene_id_column, extra_attrs);
+ }
+ is_GFF_warned++;
+ }
+
+ }
+ }
+
+ if(is_gene_id_found){
+ do_add_feature(feature_name, chro_name, start, end, is_negative_strand, context);
+ loaded_features++;
+ }
+
+ }
+ fclose(fp);
+ free(file_line);
+ return loaded_features;
+}
+
+HashTable * load_alias_table(char * fname) {
+ FILE * fp = f_subr_open(fname, "r");
+ if(!fp)
+ {
+ print_in_box(80,0,0,"WARNING unable to open alias file '%s'", fname);
+ return NULL;
+ }
+
+ char * fl = malloc(2000);
+
+ HashTable * ret = HashTableCreate(1013);
+ HashTableSetDeallocationFunctions(ret, free, free);
+ HashTableSetKeyComparisonFunction(ret, fc_strcmp_chro);
+ HashTableSetHashFunction(ret, fc_chro_hash);
+
+ while (1)
+ {
+ char *ret_fl = fgets(fl, 1999, fp);
+ if(!ret_fl) break;
+ if(fl[0]=='#') continue;
+ char * sam_chr = NULL;
+ char * anno_chr = strtok_r(fl, ",", &sam_chr);
+ if((!sam_chr)||(!anno_chr)) continue;
+
+ sam_chr[strlen(sam_chr)-1]=0;
+ char * anno_chr_buf = malloc(strlen(anno_chr)+1);
+ strcpy(anno_chr_buf, anno_chr);
+ char * sam_chr_buf = malloc(strlen(sam_chr)+1);
+ strcpy(sam_chr_buf, sam_chr);
+ HashTablePut(ret, sam_chr_buf, anno_chr_buf);
+ }
+
+ fclose(fp);
+
+ free(fl);
+ return ret;
+}
+
diff --git a/src/HelperFunctions.h b/src/HelperFunctions.h
index 65beb1e..0325657 100644
--- a/src/HelperFunctions.h
+++ b/src/HelperFunctions.h
@@ -71,4 +71,8 @@ unsigned int find_left_end_cigar(unsigned int right_pos, char * cigar);
int mac_or_rand_str(char * char_14);
double fast_fisher_test_one_side(unsigned int a, unsigned int b, unsigned int c, unsigned int d, long double * frac_buffer, int buffer_size);
+int load_features_annotation(char * file_name, int file_type, char * gene_id_column, char * feature_name_column,
+ void * context, int do_add_feature(char * gene_name, char * chro_name, unsigned int start, unsigned int end, int is_negative_strand, void * context) );
+
+HashTable * load_alias_table(char * fname) ;
#endif
diff --git a/src/Makefile.Linux b/src/Makefile.Linux
index fced4fc..fed83b5 100644
--- a/src/Makefile.Linux
+++ b/src/Makefile.Linux
@@ -2,10 +2,11 @@
include makefile.version
-OPT_LEVEL = 9
+OPT_LEVEL = 3
CCFLAGS = -mtune=core2 ${MACOS} -O${OPT_LEVEL} -Wall -DMAKE_FOR_EXON -D MAKE_STANDALONE -D SUBREAD_VERSION=\"${SUBREAD_VERSION}\" -D_FILE_OFFSET_BITS=64
LDFLAGS = ${STATIC_MAKE} -lpthread -lz -lm ${MACOS} -O${OPT_LEVEL} -DMAKE_FOR_EXON -D MAKE_STANDALONE # -DREPORT_ALL_THE_BEST
-CC = gcc ${CCFLAGS} -ggdb -fomit-frame-pointer -ffast-math -funroll-loops -mmmx -msse -msse2 -msse3 -fmessage-length=0
+CC_EXEC = gcc
+CC = ${CC_EXEC} ${CCFLAGS} -fmessage-length=0 -ggdb # -fomit-frame-pointer -ffast-math -funroll-loops -mmmx -msse -msse2 -msse3 -fmessage-length=0
ALL_LIBS= core core-junction core-indel sambam-file sublog gene-algorithms hashtable input-files sorted-hashtable gene-value-index exon-algorithms HelperFunctions interval_merge long-hashtable core-bigtable seek-zlib
diff --git a/src/SNPCalling.c b/src/SNPCalling.c
index b1feeb9..c2712a6 100644
--- a/src/SNPCalling.c
+++ b/src/SNPCalling.c
@@ -79,6 +79,7 @@ struct SNP_Calling_Parameters{
char pile_file_name[300];
int delete_piles;
+ int disk_is_full;
char background_input_file[300];
char subread_index[300];
@@ -294,7 +295,11 @@ int read_tmp_block(struct SNP_Calling_Parameters * parameters, FILE * tmp_fp, ch
fread(&read_rec, sizeof(read_rec), 1, tmp_fp);
fread(&read_len, sizeof(short), 1, tmp_fp);
fread(read, sizeof(char), read_len, tmp_fp);
- fread(qual, sizeof(char), read_len, tmp_fp);
+ int rlen = fread(qual, sizeof(char), read_len, tmp_fp);
+ if(rlen < read_len){
+ SUBREADputs("ERROR: the temporary file is broken.");
+ return -1;
+ }
first_base_pos = read_rec.pos - block_no * BASE_BLOCK_LENGTH;
parameters->is_paired_end_data = read_rec.flags & 1;
@@ -581,7 +586,7 @@ void fishers_test_on_block(struct SNP_Calling_Parameters * parameters, float * s
int process_snp_votes(FILE *out_fp, unsigned int offset , unsigned int reference_len, char * referenced_genome, char * chro_name , char * temp_prefix, struct SNP_Calling_Parameters * parameters)
{
- int block_no = (offset -1) / BASE_BLOCK_LENGTH, i;
+ int block_no = (offset -1) / BASE_BLOCK_LENGTH, i, disk_is_full = 0;
char temp_file_name[300];
FILE *tmp_fp;
unsigned int * snp_voting_piles, *snp_BGC_piles = NULL; // offset * 4 + "A/C/G/T"[0,1,2,3]
@@ -632,7 +637,7 @@ int process_snp_votes(FILE *out_fp, unsigned int offset , unsigned int reference
pcutoff_list[i]=-1.;
}
- read_tmp_block(parameters, tmp_fp,&SNP_bitmap_recorder,snp_voting_piles,block_no, reference_len, referenced_genome);
+ int read_is_error = read_tmp_block(parameters, tmp_fp,&SNP_bitmap_recorder,snp_voting_piles,block_no, reference_len, referenced_genome);
fclose(tmp_fp);
if (parameters -> delete_piles)
@@ -891,7 +896,12 @@ int process_snp_votes(FILE *out_fp, unsigned int offset , unsigned int reference
snprintf(sprint_line,999, "%s\t%u\t.\t%c\t%s\t%.4f\t.\tDP=%d;MM=%s;BGTOTAL=%d;BGMM=%d%s\n", chro_name, BASE_BLOCK_LENGTH*block_no +1 + i, true_value,base_list, Qvalue, all_reads, supporting_list , snp_filter_background_matched[i]+snp_filter_background_unmatched[i], snp_filter_background_unmatched[i], BGC_Qvalue_str);
if(parameters->output_fp_lock)
subread_lock_occupy(parameters->output_fp_lock);
- fwrite(sprint_line, 1, strlen(sprint_line),out_fp);
+ int sprint_line_len = strlen(sprint_line);
+ int wlen = fwrite(sprint_line, 1, sprint_line_len,out_fp);
+ if(wlen < sprint_line_len){
+ disk_is_full=1;
+ break;
+ }
parameters->reported_SNPs++;
if(parameters->output_fp_lock)
subread_lock_release(parameters->output_fp_lock);
@@ -932,8 +942,12 @@ int process_snp_votes(FILE *out_fp, unsigned int offset , unsigned int reference
fwrite(referenced_genome + i, 1, 1, out_fp);
fwrite(referenced_genome + 1 + i + max(0,indels), 1, 1, out_fp);
unsigned short * indel_sups = parameters -> cigar_event_table-> appendix2;
- fprintf(out_fp, "\t1.0\t.\tINDEL;DP=%d;SR=%d\n",all_reads,indel_sups[event_id]);
+ int wlen = fprintf(out_fp, "\t1.0\t.\tINDEL;DP=%d;SR=%d\n",all_reads,indel_sups[event_id]);
+ if(wlen < 10){
+ disk_is_full=1;
+ break;
+ }
parameters->reported_indels++;
if(parameters->output_fp_lock)
subread_lock_release(parameters->output_fp_lock);
@@ -956,7 +970,7 @@ int process_snp_votes(FILE *out_fp, unsigned int offset , unsigned int reference
free(pcutoff_list);
free(sprint_line);
//SUBREADprintf("OVERLAPPED=%llu; MISMA=%llu; ALL_BASES=%llu\n",OVERLAPPED_BASES, OVER_MISMA_BASES, ALL_BASES);
- return 0;
+ return read_is_error || disk_is_full;
}
@@ -1033,9 +1047,10 @@ int run_chromosome_search(FILE *in_fp, FILE * out_fp, char * chro_name , char *
//#warning "=== ONLY TEST ONE BLOCK , USE 'if(1)' IN RELEASE ==="
//if(strcmp(chro_name,"chr7")==0 && all_offset == 60000000){
if(1){
- process_snp_votes(out_fp, all_offset, offset, referenced_base, chro_name , temp_prefix, parameters);
+ parameters -> disk_is_full |= process_snp_votes(out_fp, all_offset, offset, referenced_base, chro_name , temp_prefix, parameters);
print_in_box(89,0,0,"processed block %c[36m%s@%d%c[0m by thread %d/%d [block number=%d/%d]", CHAR_ESC, chro_name, all_offset, CHAR_ESC , thread_no+1, all_threads, 1+(*task_no)-parameters->empty_blocks, parameters->all_blocks);
}
+ if(parameters -> disk_is_full)break;
}
else if((*task_no) % all_threads == thread_no)
{
@@ -1204,6 +1219,11 @@ int parse_read_lists_maybe_threads(char * in_FASTA_file, char * out_BED_file, ch
}
//fprintf(out_fp, "## Fisher_Test_Size=%u\n",fisher_test_size);
fclose(out_fp);
+ if(parameters -> disk_is_full){
+ unlink(out_BED_file);
+ SUBREADputs("ERROR: cannot write into the output VCF file. Please check the disk space in the output directory.");
+ ret = 1;
+ }
return ret;
}
@@ -1405,14 +1425,11 @@ int SNP_calling(char * in_SAM_file, char * out_BED_file, char * in_FASTA_file, c
HashTableSetKeyComparisonFunction(parameters-> cigar_event_table, my_strcmp);
memcpy(rand48_seed, &start_time, 6);
- if(temp_location)
- strcpy(temp_file_prefix, temp_location);
- else{
- char mac_rand[13];
- mac_or_rand_str(mac_rand);
+ char mac_rand[13];
+ mac_or_rand_str(mac_rand);
+
+ sprintf(temp_file_prefix, "%s/temp-snps-%06u-%s-", temp_location, getpid(), mac_rand);
- sprintf(temp_file_prefix, "./temp-snps-%06u-%s-", getpid(), mac_rand);
- }
_EXSNP_SNP_delete_temp_prefix = temp_file_prefix;
print_in_box(89,0,0,"Split %s file into %c[36m%s*%c[0m ..." , parameters -> is_BAM_file_input?"BAM":"SAM" , CHAR_ESC, temp_file_prefix, CHAR_ESC);
@@ -1578,7 +1595,7 @@ int main_snp_calling_test(int argc,char ** argv)
optopt = 63;
-
+ memset(¶meters, 0, sizeof(struct SNP_Calling_Parameters));
parameters.start_time = miltime();
parameters.empty_blocks = 0;
parameters.reported_SNPs = 0;
@@ -1688,10 +1705,6 @@ int main_snp_calling_test(int argc,char ** argv)
strncpy(out_BED_file, optarg,299);
break;
- case '9': // UNUSED
- strncpy(temp_path, optarg,299);
- break;
-
case 'T':
threads = atoi(optarg);
if(!threads)threads=1;
@@ -1814,7 +1827,16 @@ int main_snp_calling_test(int argc,char ** argv)
warning_file_type(in_SAM_file, parameters.is_BAM_file_input?FILE_TYPE_BAM:FILE_TYPE_SAM);
warning_file_type(in_FASTA_file, FILE_TYPE_FASTA);
warning_file_limit();
- ret = SNP_calling(in_SAM_file, out_BED_file, in_FASTA_file, temp_path[0]?temp_path:NULL, read_count, threads, ¶meters);
+ int x1;
+ for(x1 = strlen(out_BED_file); x1 >= 0; x1--){
+ if(out_BED_file[x1]=='/'){
+ memcpy(temp_path, out_BED_file, x1);
+ temp_path[x1]=0;
+ break;
+ }
+ }
+ if(temp_path[0]==0)strcpy(temp_path, "./");
+ ret = SNP_calling(in_SAM_file, out_BED_file, in_FASTA_file, temp_path, read_count, threads, ¶meters);
if(ret != -1)
{
print_in_box(80,0,1,"");
diff --git a/src/core-bigtable.c b/src/core-bigtable.c
index 52cd5a8..4054c9e 100644
--- a/src/core-bigtable.c
+++ b/src/core-bigtable.c
@@ -430,7 +430,7 @@ void bktable_append(bucketed_table_t * tab, char * chro, unsigned int pos, void
}
-void bktable_free_ptrs(void * buckv, HashTable * tab){
+void bktable_free_ptrs(void * bukey, void * buckv, HashTable * tab){
int x1;
bucketed_table_bucket_t * buck = buckv;
for(x1 = 0; x1 < buck -> items; x1++)
diff --git a/src/core-bigtable.h b/src/core-bigtable.h
index 518a75d..b7e27f4 100644
--- a/src/core-bigtable.h
+++ b/src/core-bigtable.h
@@ -36,7 +36,7 @@ void bktable_init(bucketed_table_t * tab, unsigned int maximum_interval_length,
void bktable_destroy(bucketed_table_t * tab);
-void bktable_free_ptrs(void * buckv, HashTable * tab);
+void bktable_free_ptrs(void * bkey, void * buckv, HashTable * tab);
void fraglist_init(fragment_list_t * list);
diff --git a/src/core-indel.c b/src/core-indel.c
index 09c2979..79174db 100644
--- a/src/core-indel.c
+++ b/src/core-indel.c
@@ -229,6 +229,8 @@ int anti_supporting_read_scan(global_context_t * global_context)
if(event_body -> event_small_side >= coverage_end - 5) continue;
event_body -> anti_supporting_reads ++;
+ //printf("OCT27-ANTISUP-READ @ %u has SMALL @ %u~%u , INDELS = %d, ASUP = %d\n", coverage_start - 1, event_body -> event_small_side, event_body -> event_large_side, event_body -> indel_length, event_body -> anti_supporting_reads);
+
cancelled_event_list[cancelled_events++] = small_side_ordered_event_ids[x1];
}
@@ -248,6 +250,7 @@ int anti_supporting_read_scan(global_context_t * global_context)
}
if(to_be_add){
+ // printf("OCT27-ANTISUP-READ @ %u has LARGE @ %u~%u, INDELS = %d, ASUP = %d\n", coverage_end, event_body -> event_small_side, event_body -> event_large_side, event_body -> indel_length, event_body -> anti_supporting_reads);
event_body -> anti_supporting_reads ++;
}
}
@@ -266,45 +269,24 @@ chromosome_event_t * reallocate_event_space( global_context_t* global_context,th
{
int max_event_no;
- if(thread_context)
- {
+ if(thread_context) {
max_event_no = ((indel_thread_context_t *)thread_context -> module_thread_contexts[MODULE_INDEL_ID]) -> current_max_event_number;
if(max_event_no<=event_no)
{
- //printf("T REALLOCATD: %d\n", (int)(((indel_context_t *)global_context -> module_contexts[MODULE_INDEL_ID]) -> current_max_event_number * 1.5));
((indel_thread_context_t *)thread_context -> module_thread_contexts[MODULE_INDEL_ID]) -> current_max_event_number *= 1.6;
- // chromosome_event_t * new_space =
((indel_thread_context_t *)thread_context -> module_thread_contexts[MODULE_INDEL_ID]) -> event_space_dynamic =
realloc(((indel_thread_context_t *)thread_context -> module_thread_contexts[MODULE_INDEL_ID]) -> event_space_dynamic, sizeof(chromosome_event_t) *
((indel_thread_context_t *)thread_context -> module_thread_contexts[MODULE_INDEL_ID]) -> current_max_event_number);
-
- // if(!new_space){
- // SUBREADprintf("Unable to reallocate local event space!\n");
- // return NULL;
- // }
-
- // ((indel_thread_context_t *)thread_context -> module_thread_contexts[MODULE_INDEL_ID]) -> event_space_dynamic = new_space;
}
return ((indel_thread_context_t *)thread_context -> module_thread_contexts[MODULE_INDEL_ID]) -> event_space_dynamic;
- }
- else
- {
+ } else {
max_event_no = ((indel_context_t *)global_context -> module_contexts[MODULE_INDEL_ID]) -> current_max_event_number;
if(max_event_no<=event_no)
{
- //printf("G REALLOCATD: %d\n", (int)(((indel_context_t *)global_context -> module_contexts[MODULE_INDEL_ID]) -> current_max_event_number * 1.5));
((indel_context_t *)global_context -> module_contexts[MODULE_INDEL_ID]) -> current_max_event_number *= 1.6;
- //chromosome_event_t * new_space =
((indel_context_t *)global_context -> module_contexts[MODULE_INDEL_ID]) -> event_space_dynamic =
realloc(((indel_context_t *)global_context -> module_contexts[MODULE_INDEL_ID]) -> event_space_dynamic, sizeof(chromosome_event_t) *
((indel_context_t *)global_context -> module_contexts[MODULE_INDEL_ID]) -> current_max_event_number);
-
- //if(!new_space){
- // SUBREADprintf("Unable to reallocate global event space!\n");
- // return NULL;
- //}
-
- //((indel_context_t *)global_context -> module_contexts[MODULE_INDEL_ID]) -> event_space_dynamic = new_space;
}
return ((indel_context_t *)global_context -> module_contexts[MODULE_INDEL_ID]) -> event_space_dynamic;
}
@@ -488,9 +470,6 @@ void remove_sorted_neighbours(global_context_t * global_context)
void remove_neighbour(global_context_t * global_context)
{
- //#warning "====================== MUST COMMENT THIS LINE!! NOT REMOVING NEIGHBOURS FOR DETECTING PAIRED INVERSION EVENTS ====================="
- //return;
-
indel_context_t * indel_context = (indel_context_t *)global_context -> module_contexts[MODULE_INDEL_ID];
HashTable * event_table = indel_context -> event_entry_table;
@@ -560,7 +539,8 @@ void remove_neighbour(global_context_t * global_context)
{
int neighbour_range = 11;
int indel_range = 4 , delta_small = 0;// max_range = 0;
- if(event_body->event_type == CHRO_EVENT_TYPE_INDEL) continue;
+ if(event_body -> event_type == CHRO_EVENT_TYPE_INDEL) continue;
+ if(event_body -> is_donor_found_or_annotation & 64) continue; // do not remove known events
//if(! global_context->config.check_donor_at_junctions )max_range = 10;
//for(delta_small = - max_range; delta_small <= max_range ; delta_small ++)
@@ -584,10 +564,6 @@ void remove_neighbour(global_context_t * global_context)
}
}
- if(0&&to_be_removed_number<maxinum_removed_events)
- if(event_body -> supporting_reads<2 && event_body -> anti_supporting_reads > 5 * event_body -> supporting_reads)
- to_be_removed_ids[to_be_removed_number++] = event_body -> global_event_id;
-
if(1 && global_context->config.do_fusion_detection && event_body -> event_type == CHRO_EVENT_TYPE_FUSION)
{
for(xk2=-10 ; xk2 < 10 ; xk2++)
@@ -618,27 +594,7 @@ void remove_neighbour(global_context_t * global_context)
{
chromosome_event_t * deleted_event = &event_space[to_be_removed_ids[xk1]];
- int * id_list = HashTableGet(event_table, NULL+deleted_event-> event_small_side);
- if(NULL == id_list){
- SUBREADprintf("Missing entry : %u for %d\n", deleted_event-> event_small_side, to_be_removed_ids[xk1]);
- }else{
- int xk2, current_items = id_list[0]&0x0fffffff;
- for(xk2=1; xk2< current_items && id_list[xk2] >0 ; xk2++)
- if(to_be_removed_ids[xk1] == id_list[xk2] - 1)break;
- if(xk2< current_items && id_list[xk2] > 0)
- {
- int xk3;
- for(xk3 = xk2; xk3<current_items -1; xk3++)
- {
- if(0==id_list[xk3+1]) break;
-
- id_list[xk3] = id_list[xk3+1];
- }
- id_list[xk3] = 0;
- }
-
- }
- //printf("NBR_REMOVED=%u - %u\n", deleted_event -> event_small_side , deleted_event -> event_large_side);
+ //printf("NBR_REMOVED=%u - %u\n", deleted_event -> event_small_side , deleted_event -> event_large_side);
if(deleted_event -> event_type == CHRO_EVENT_TYPE_INDEL && deleted_event -> inserted_bases)
free(deleted_event -> inserted_bases);
deleted_event -> event_type = CHRO_EVENT_TYPE_REMOVED;
@@ -673,34 +629,38 @@ int init_indel_tables(global_context_t * context)
// each integer is the index of event in the event array + 1.
// unused values are set to 0.
- indel_context -> event_entry_table = HashTableCreate(399997);
+ indel_context -> event_entry_table = NULL;
+
+ context -> module_contexts[MODULE_INDEL_ID] = indel_context;
+
+ indel_context -> total_events = 0;
+ indel_context -> current_max_event_number = 0;
+ //indel_context -> current_max_event_number = context->config.init_max_event_number;
+ //indel_context -> event_space_dynamic = (chromosome_event_t *)malloc(sizeof(chromosome_event_t)*indel_context -> current_max_event_number);
+
+ indel_context -> event_space_dynamic = NULL;
+
+ if(context -> config.all_threads < 2) {
+ indel_context -> event_entry_table = HashTableCreate(399997);
- if(context -> config.use_bitmap_event_table)
- {
indel_context -> event_entry_table -> appendix1=malloc(1024 * 1024 * 64);
indel_context -> event_entry_table -> appendix2=malloc(1024 * 1024 * 64);
memset(indel_context -> event_entry_table -> appendix1, 0, 1024 * 1024 * 64);
memset(indel_context -> event_entry_table -> appendix2, 0, 1024 * 1024 * 64);
- }
- else
- {
- indel_context -> event_entry_table -> appendix1=NULL;
- indel_context -> event_entry_table -> appendix2=NULL;
- }
- // appendix1 is for smaller side and appendix2 is for larger side.
-
- HashTableSetKeyComparisonFunction(indel_context->event_entry_table, localPointerCmp_forEventEntry);
- HashTableSetHashFunction(indel_context->event_entry_table, localPointerHashFunction_forEventEntry);
+ HashTableSetKeyComparisonFunction(indel_context->event_entry_table, localPointerCmp_forEventEntry);
+ HashTableSetHashFunction(indel_context->event_entry_table, localPointerHashFunction_forEventEntry);
- context -> module_contexts[MODULE_INDEL_ID] = indel_context;
-
- indel_context -> total_events = 0;
- indel_context -> current_max_event_number = context->config.init_max_event_number;
- indel_context -> event_space_dynamic = (chromosome_event_t *)malloc(sizeof(chromosome_event_t)*indel_context -> current_max_event_number);
-
- if(context->config.is_third_iteration_running)
- {
+ indel_context -> total_events = 0;
+ indel_context -> current_max_event_number = context->config.init_max_event_number;
+ indel_context -> event_space_dynamic = malloc(sizeof(chromosome_event_t)*indel_context -> current_max_event_number);
+ if(!indel_context -> event_space_dynamic)
+ {
+ sublog_printf(SUBLOG_STAGE_RELEASED, SUBLOG_LEVEL_FATAL, "Cannot allocate memory for threads. Please try to reduce the thread number.");
+ return 1;
+ }
+ }
+ if(context->config.is_third_iteration_running) {
char * fns = malloc(200);
fns[0]=0;
exec_cmd("ulimit -n", fns, 200);
@@ -722,12 +682,14 @@ int init_indel_tables(global_context_t * context)
indel_context -> dynamic_align_table_mask = malloc(sizeof(char *)*MAX_READ_LENGTH);
int xk1;
- for(xk1=0;xk1<MAX_READ_LENGTH; xk1++)
- {
+ for(xk1=0;xk1<MAX_READ_LENGTH; xk1++) {
indel_context -> dynamic_align_table[xk1] = malloc(sizeof(short)*MAX_READ_LENGTH);
indel_context -> dynamic_align_table_mask[xk1] = malloc(sizeof(char)*MAX_READ_LENGTH);
}
+ for(xk1=0; xk1<EVENT_BODY_LOCK_BUCKETS; xk1++)
+ subread_init_lock(indel_context -> event_body_locks+xk1);
+
return 0;
}
@@ -736,11 +698,10 @@ int init_indel_thread_contexts(global_context_t * global_context, thread_context
indel_thread_context_t * indel_thread_context = (indel_thread_context_t*)malloc(sizeof(indel_thread_context_t));
indel_context_t * indel_context = (indel_context_t *) global_context -> module_contexts[MODULE_INDEL_ID];
- if(task == STEP_VOTING || task == STEP_ITERATION_ONE)
- {
+ if(task == STEP_VOTING) {
indel_thread_context -> event_entry_table = HashTableCreate(399997);
- indel_thread_context -> event_entry_table -> appendix1=indel_context -> event_entry_table-> appendix1;
- indel_thread_context -> event_entry_table -> appendix2=indel_context -> event_entry_table-> appendix2;
+ indel_thread_context -> event_entry_table -> appendix1=NULL;//indel_context -> event_entry_table-> appendix1;
+ indel_thread_context -> event_entry_table -> appendix2=NULL;//indel_context -> event_entry_table-> appendix2;
HashTableSetKeyComparisonFunction(indel_thread_context->event_entry_table, localPointerCmp_forEventEntry);
HashTableSetHashFunction(indel_thread_context->event_entry_table, localPointerHashFunction_forEventEntry);
@@ -807,79 +768,587 @@ void destory_event_entry_table(HashTable * old)
}
}
+typedef struct {
+ unsigned int scanning_positons;
+ unsigned int thread_bodytable_number;
+} scanning_events_record_t;
-int finalise_indel_thread(global_context_t * global_context, thread_context_t * thread_context, int task)
-{
+
+
+#define _get_global_body(iii) ( indel_context -> event_space_dynamic + records[iii].thread_bodytable_number)
+
+int scanning_events_compare(void * arr, int l, int r){
+ void ** arrr = (void **) arr;
+ indel_context_t * indel_context = arrr[0];
+ scanning_events_record_t * records = arrr[1];
+ chromosome_event_t * body_l = _get_global_body(l);
+ chromosome_event_t * body_r = _get_global_body(r);
+
+ if(records[l].scanning_positons > records[r].scanning_positons)return 1;
+ if(records[l].scanning_positons < records[r].scanning_positons)return -1;
+
+ if((body_l -> is_donor_found_or_annotation & 64)!=0 && (body_r -> is_donor_found_or_annotation & 64)==0) return 1; // prefer known truth.
+ if((body_l -> is_donor_found_or_annotation & 64)==0 && (body_r -> is_donor_found_or_annotation & 64)!=0) return -1;
+ if(body_l -> supporting_reads > body_r -> supporting_reads) return -1;
+ if(body_l -> supporting_reads < body_r -> supporting_reads) return 1;
+ if(abs(body_l -> indel_length) < abs(body_r -> indel_length)) return 1;
+ if(abs(body_l -> indel_length) > abs(body_r -> indel_length)) return -1;
+ if(body_l -> indel_length > body_r -> indel_length) return -1; // same length, but L is del and R is ins -- prefer del than ins
+ if(body_l -> indel_length < body_r -> indel_length) return 1;
+
+ if(body_l -> event_small_side > body_r -> event_small_side)return 1;
+ if(body_l -> event_small_side < body_r -> event_small_side)return -1;
+ if(body_l -> event_large_side > body_r -> event_large_side)return 1;
+ return -1;
+}
+
+void null_scanning_events_merge(void * arr, int start, int items, int items2){}
+void scanning_events_merge(void * arr, int start, int items, int items2){
+ void ** arrr = (void **) arr;
+ scanning_events_record_t * records = arrr[1];
+
+ int read_1_ptr = start, read_2_ptr = start+items, write_ptr;
+ scanning_events_record_t * merged_records = malloc(sizeof(scanning_events_record_t) * (items+items2));
+
+ for(write_ptr=0; write_ptr<items+items2; write_ptr++){
+ if((read_1_ptr >= start+items)||(read_2_ptr < start+items+items2 && scanning_events_compare(arr, read_1_ptr, read_2_ptr) > 0))
+ memcpy(merged_records+write_ptr, records+(read_2_ptr++), sizeof(scanning_events_record_t));
+ else
+ memcpy(merged_records+write_ptr, records+(read_1_ptr++), sizeof(scanning_events_record_t));
+ }
+ memcpy(records + start, merged_records, sizeof(scanning_events_record_t) * (items+items2));
+ free(merged_records);
+}
+
+void scanning_events_exchange(void * arr, int l, int r){
+ void ** arrr = (void **) arr;
+ scanning_events_record_t * records = arrr[1];
+
+ unsigned long tmpi;
+
+ tmpi = records[l].scanning_positons;
+ records[l].scanning_positons = records[r].scanning_positons;
+ records[r].scanning_positons = tmpi;
+
+ tmpi = records[l].thread_bodytable_number;
+ records[l].thread_bodytable_number = records[r].thread_bodytable_number;
+ records[r].thread_bodytable_number = tmpi;
+}
+
+
+
+#define _test_record_size if(current_record_number >= current_record_size - 2){\
+ current_record_size *= 1.5;\
+ records=realloc(records, sizeof(scanning_events_record_t)*current_record_size);\
+ if(NULL == records) return -1;\
+ }\
+
+#define _add_record records[current_record_number].scanning_positons = body -> event_small_side;\
+ records[current_record_number].thread_bodytable_number = xx1;\
+ current_record_number++;\
+ records[current_record_number].scanning_positons = body -> event_large_side;\
+ records[current_record_number].thread_bodytable_number = xx1;\
+ current_record_number++;\
+
+
+int sort_junction_entry_table(global_context_t * global_context){
indel_context_t * indel_context = (indel_context_t*)global_context -> module_contexts[MODULE_INDEL_ID];
- indel_thread_context_t * indel_thread_context = (indel_thread_context_t*)thread_context -> module_thread_contexts[MODULE_INDEL_ID];
- if(task == STEP_VOTING || task == STEP_ITERATION_ONE)
- {
- int xk1;
- for(xk1 = 0; xk1 < indel_thread_context -> total_events; xk1++)
+ chromosome_event_t * event_space = indel_context -> event_space_dynamic;
+
+ if(indel_context -> event_entry_table){
+ if(indel_context -> event_entry_table->appendix1)
{
- chromosome_event_t * event_body = &indel_thread_context -> event_space_dynamic[xk1];
- chromosome_event_t * matched_old_event = NULL;
- chromosome_event_t * search_result [MAX_EVENT_ENTRIES_PER_SITE];
- int found_events = search_event(global_context, indel_context -> event_entry_table, indel_context -> event_space_dynamic, event_body -> event_small_side, EVENT_SEARCH_BY_SMALL_SIDE, -1, search_result);
- int xk2;
- for(xk2 = 0; xk2 < found_events; xk2++)
- {
- chromosome_event_t * test_found_event = search_result [xk2];
- if(test_found_event->event_large_side == event_body -> event_large_side && test_found_event->event_type == event_body -> event_type && (event_body -> event_type!=CHRO_EVENT_TYPE_INDEL || event_body -> indel_length == test_found_event->indel_length))
- {
- matched_old_event = test_found_event;
- break;
+ free(indel_context -> event_entry_table -> appendix1);
+ free(indel_context -> event_entry_table -> appendix2);
+ }
+
+ // free the entry table: the pointers.
+ destory_event_entry_table(indel_context -> event_entry_table);
+ HashTableDestroy(indel_context -> event_entry_table);
+ }
+
+ indel_context -> event_entry_table = HashTableCreate(399997);
+ HashTableSetKeyComparisonFunction(indel_context->event_entry_table, localPointerCmp_forEventEntry);
+ HashTableSetHashFunction(indel_context->event_entry_table, localPointerHashFunction_forEventEntry);
+
+ if(global_context -> config.use_bitmap_event_table) {
+ indel_context -> event_entry_table -> appendix1=malloc(1024 * 1024 * 64);
+ indel_context -> event_entry_table -> appendix2=malloc(1024 * 1024 * 64);
+ memset(indel_context -> event_entry_table -> appendix1, 0, 1024 * 1024 * 64);
+ memset(indel_context -> event_entry_table -> appendix2, 0, 1024 * 1024 * 64);
+ } else {
+ indel_context -> event_entry_table -> appendix1=NULL;
+ indel_context -> event_entry_table -> appendix2=NULL;
+ }
+
+ int xx1, current_record_number=0, current_record_size=10000;
+
+ scanning_events_record_t * records = malloc(sizeof(scanning_events_record_t)*current_record_size);
+
+ for(xx1 = 0; xx1 < indel_context -> total_events; xx1++){
+ chromosome_event_t * body = event_space + xx1;
+ _test_record_size;
+ _add_record;
+ }
+
+ void * sort_arr[2];
+ sort_arr[0] = indel_context;
+ sort_arr[1] = records;
+
+ // many repeated elements
+ // do not use quick sort.
+ merge_sort(sort_arr, current_record_number, scanning_events_compare, scanning_events_exchange, scanning_events_merge);
+ unsigned int last_scannping_pos = records[0].scanning_positons;
+ int merge_start = 0;
+ HashTable * event_table = indel_context -> event_entry_table;
+
+ for(xx1 =0; xx1 <= current_record_number; xx1++){
+ scanning_events_record_t * this_record = NULL;
+ if(xx1<current_record_number) this_record = records+xx1;
+
+ if(xx1>0){
+ if(xx1 == current_record_number || last_scannping_pos!= this_record -> scanning_positons){
+ int merge_end = xx1, merge_i;
+ if( merge_end-merge_start > MAX_EVENT_ENTRIES_PER_SITE )
+ merge_end = merge_start + MAX_EVENT_ENTRIES_PER_SITE;
+ unsigned int * id_list = malloc(sizeof(int) * (1+merge_end-merge_start));
+ assert(id_list);
+ id_list[0] = merge_end-merge_start;
+ for(merge_i = merge_start; merge_i < merge_end; merge_i++){
+ chromosome_event_t * body = _get_global_body(merge_i);
+ id_list[merge_i - merge_start + 1] = records[merge_i].thread_bodytable_number + 1;
+
+ mark_event_bitmap(event_table->appendix1, body -> event_small_side);
+ mark_event_bitmap(event_table->appendix2, body -> event_large_side);
}
+ merge_start = xx1;
+
+ //#warning "======= UNCOMMENT NEXT ======="
+ HashTablePut(indel_context -> event_entry_table, NULL + last_scannping_pos, id_list);
}
+ }
+ if(xx1 == current_record_number) break;
+ last_scannping_pos = this_record -> scanning_positons;
+ }
- if(matched_old_event)
- {
- matched_old_event -> supporting_reads += event_body -> supporting_reads;
- if(event_body->inserted_bases && event_body -> event_type == CHRO_EVENT_TYPE_INDEL && event_body -> indel_length<0){
- //printf("FREED INDEL\n");
- free(event_body->inserted_bases);
+ free(records);
+ return 0;
+}
+
+#define _get_event_body(iii) (records[iii].thread_no<0?( indel_context -> event_space_dynamic + records[iii].thread_bodytable_number ):(((indel_thread_context_t*)(thread_contexts[records[iii].thread_no].module_thread_contexts[MODULE_INDEL_ID]))-> event_space_dynamic + records[iii].thread_bodytable_number ))
+
+typedef struct {
+ unsigned int thread_bodytable_number;
+ short thread_no;
+} concatinating_events_record_t;
+
+#define _test_conc_size if(conc_rec_items >= conc_rec_size - 1){\
+ conc_rec_size *= 1.5;\
+ records = realloc(records, conc_rec_size * sizeof(concatinating_events_record_t));\
+ if(NULL == records) return -1;\
+ }
+
+int conc_sort_compare(void * arr, int l, int r){
+ void ** arrr = (void **)arr;
+ concatinating_events_record_t * records = arrr[0];
+ indel_context_t * indel_context = arrr[1];
+ thread_context_t * thread_contexts = arrr[2];
+
+ chromosome_event_t * body_l = _get_event_body(l);
+ chromosome_event_t * body_r = _get_event_body(r);
+
+
+ if(body_l -> event_small_side > body_r -> event_small_side)return 3;
+ if(body_l -> event_small_side < body_r -> event_small_side)return -3;
+ if(body_l -> event_large_side > body_r -> event_large_side)return 3;
+ if(body_l -> event_large_side < body_r -> event_large_side)return -3;
+
+ if(abs(body_l -> indel_length) < abs(body_r -> indel_length)) return 2;
+ if(abs(body_l -> indel_length) > abs(body_r -> indel_length)) return -2;
+ if(body_l -> indel_length > body_r -> indel_length) return -2; // same length, but L is del and R is ins -- prefer del than ins
+ if(body_l -> indel_length < body_r -> indel_length) return 2;
+
+ if((body_l -> is_donor_found_or_annotation & 64)!=0 && (body_r -> is_donor_found_or_annotation & 64)==0) return 1; // prefer known truth.
+ if((body_l -> is_donor_found_or_annotation & 64)==0 && (body_r -> is_donor_found_or_annotation & 64)!=0) return -1;
+ if(body_l -> supporting_reads > body_r -> supporting_reads) return -1;
+ if(body_l -> supporting_reads < body_r -> supporting_reads) return 1;
+ return 0;
+
+}
+
+
+void conc_sort_merge(void * arr, int start, int items, int items2){
+ void ** arrr = (void **) arr;
+ concatinating_events_record_t * records = arrr[0];
+
+ int read_1_ptr = start, read_2_ptr = start+items, write_ptr;
+ concatinating_events_record_t * merged_records = malloc(sizeof(concatinating_events_record_t) * (items+items2));
+
+ for(write_ptr=0; write_ptr<items+items2; write_ptr++){
+ if((read_1_ptr >= start+items)||(read_2_ptr < start+items+items2 && conc_sort_compare(arr, read_1_ptr, read_2_ptr) > 0))
+ memcpy(merged_records+write_ptr, records+(read_2_ptr++), sizeof(concatinating_events_record_t));
+ else
+ memcpy(merged_records+write_ptr, records+(read_1_ptr++), sizeof(concatinating_events_record_t));
+ }
+ memcpy(records + start, merged_records, sizeof(concatinating_events_record_t) * (items+items2));
+ free(merged_records);
+}
+
+
+
+void conc_sort_exchange(void * arr, int l, int r){
+
+ void ** arrr = (void **)arr;
+ concatinating_events_record_t * records = arrr[0];
+
+ unsigned int tmpi;
+ tmpi = records[l].thread_bodytable_number;
+ records[l].thread_bodytable_number = records[r].thread_bodytable_number;
+ records[r].thread_bodytable_number = tmpi;
+
+ tmpi = records[l].thread_no;
+ records[l].thread_no = records[r].thread_no;
+ records[r].thread_no = tmpi;
+}
+
+// sort and merge events from all threads and the global event space.
+int sort_global_event_table(global_context_t * global_context){
+ return finalise_indel_and_junction_thread(global_context, NULL, STEP_VOTING);
+}
+
+int finalise_indel_and_junction_thread(global_context_t * global_context, thread_context_t * thread_contexts, int task)
+{
+ indel_context_t * indel_context = (indel_context_t*)global_context -> module_contexts[MODULE_INDEL_ID];
+ if(task == STEP_VOTING) {
+ concatinating_events_record_t * records;
+ int conc_rec_size = 10000, conc_rec_items = 0;
+ records = malloc(sizeof(concatinating_events_record_t) * conc_rec_size);
+
+ int xk1, thn;
+ if(thread_contexts)
+ for(thn =0; thn < global_context->config.all_threads; thn++){
+ thread_context_t * thread_context = thread_contexts+thn;
+ indel_thread_context_t * indel_thread_context = (indel_thread_context_t*)thread_context -> module_thread_contexts[MODULE_INDEL_ID];
+
+ for(xk1 = 0; xk1 < indel_thread_context -> total_events; xk1++){
+ chromosome_event_t * old_body = indel_thread_context -> event_space_dynamic + xk1;
+ if(old_body -> event_type == CHRO_EVENT_TYPE_REMOVED) continue;
+
+ _test_conc_size;
+ records[conc_rec_items].thread_no=thn;
+ records[conc_rec_items++].thread_bodytable_number=xk1;
}
}
- else
- {
- int new_event_no = indel_context -> total_events++;
- reallocate_event_space(global_context, NULL, new_event_no);
- event_body -> global_event_id = new_event_no;
- memcpy(indel_context -> event_space_dynamic + new_event_no , event_body , sizeof(chromosome_event_t));
- put_new_event(indel_context -> event_entry_table , event_body, new_event_no);
+
+ for(xk1 = 0; xk1 < indel_context -> total_events; xk1++){
+ chromosome_event_t * old_body = indel_context -> event_space_dynamic + xk1;
+ if(old_body -> event_type == CHRO_EVENT_TYPE_REMOVED) continue;
+ _test_conc_size;
+ records[conc_rec_items].thread_no=-1;
+ records[conc_rec_items++].thread_bodytable_number=xk1;
+ }
+
+ void * sort_arr[3];
+ sort_arr[0] = records;
+ sort_arr[1] = indel_context;
+ sort_arr[2] = thread_contexts;
+
+ // many repeated elements -- do not use quick sort.
+ merge_sort(sort_arr, conc_rec_items, conc_sort_compare, conc_sort_exchange, conc_sort_merge);
+
+ chromosome_event_t * prev_env = NULL;
+ int merge_target_size = 10000;
+ int merge_target_items = 0;
+ chromosome_event_t * merge_target = malloc(sizeof(chromosome_event_t) * merge_target_size);
+
+ int merge_start = 0;
+ for(xk1 = 0; xk1 <= conc_rec_items; xk1++){
+ chromosome_event_t * this_event = (xk1 == conc_rec_items)?NULL:_get_event_body(xk1);
+ if(xk1 > 0){
+ int compret = 0;
+ if(xk1 < conc_rec_items) compret = conc_sort_compare(sort_arr, xk1-1, xk1);
+ if(abs(compret)>1 || xk1 == conc_rec_items){// different events or last one -- merge [ prev_event_record_no , xk1 - 1 ]
+ int xk_merge;
+ // find a new slot in the target space
+ if(merge_target_items >= merge_target_size - 1){
+ merge_target_size *= 1.5;
+ merge_target = realloc(merge_target, sizeof(chromosome_event_t) * merge_target_size);
+ }
+
+ chromosome_event_t * merged_body = merge_target + merge_target_items;
+ memcpy( merged_body, prev_env, sizeof(chromosome_event_t) );
+ merged_body -> global_event_id = merge_target_items++;
+
+ for(xk_merge = merge_start; xk_merge < xk1 - 1; xk_merge++){
+ chromosome_event_t * old_body = _get_event_body(xk_merge);
+
+ assert(merged_body -> event_type == old_body -> event_type);
+ merged_body -> supporting_reads += old_body -> supporting_reads;
+ merged_body -> anti_supporting_reads += old_body -> anti_supporting_reads;
+ merged_body -> final_counted_reads += old_body -> final_counted_reads;
+ merged_body -> final_reads_mismatches += old_body -> final_reads_mismatches;
+ merged_body -> critical_supporting_reads += old_body -> critical_supporting_reads;
+ merged_body -> junction_flanking_left = max(merged_body -> junction_flanking_left, old_body -> junction_flanking_left);
+ merged_body -> junction_flanking_right = max(merged_body -> junction_flanking_right, old_body -> junction_flanking_right);
+ merged_body -> is_donor_found_or_annotation |= old_body -> is_donor_found_or_annotation;
+
+ if(merged_body -> connected_next_event_distance > 0 && old_body -> connected_next_event_distance > 0)
+ merged_body -> connected_next_event_distance = min(merged_body -> connected_next_event_distance, old_body -> connected_next_event_distance);
+ else merged_body -> connected_next_event_distance = max(merged_body -> connected_next_event_distance, old_body -> connected_next_event_distance);
+
+ if(merged_body -> connected_previous_event_distance > 0 && old_body -> connected_previous_event_distance > 0)
+ merged_body -> connected_previous_event_distance = min(merged_body -> connected_previous_event_distance, old_body -> connected_previous_event_distance);
+ else merged_body -> connected_previous_event_distance = max(merged_body -> connected_previous_event_distance, old_body -> connected_previous_event_distance);
+
+ if(old_body->inserted_bases && old_body -> event_type == CHRO_EVENT_TYPE_INDEL && old_body -> indel_length<0){
+ //printf("OCT27-FREEMEM-INS [%d] : thread-%d + %d %u-%p\n", xk_merge, records[xk_merge].thread_no, records[xk_merge].thread_bodytable_number, old_body->event_small_side, old_body->inserted_bases );
+ if(merged_body -> inserted_bases && merged_body -> inserted_bases != old_body -> inserted_bases){
+ // SUBREADprintf("FREE PTR=%p\n", old_body -> inserted_bases);
+ free(old_body -> inserted_bases);
+ }
+ else merged_body -> inserted_bases = old_body -> inserted_bases;
+ old_body -> inserted_bases = NULL;
+ }
+ }
+ merge_start = xk1;
+ }
}
- }
- destory_event_entry_table(indel_thread_context -> event_entry_table);
- HashTableDestroy(indel_thread_context -> event_entry_table);
- free(indel_thread_context -> event_space_dynamic);
+ if(xk1 == conc_rec_items) break;
+ prev_env = this_event;
+ }
- for(xk1=0;xk1<MAX_READ_LENGTH; xk1++)
- {
- free(indel_thread_context -> dynamic_align_table[xk1]);
- free(indel_thread_context -> dynamic_align_table_mask[xk1]);
+ if(0){
+ for(xk1 = 0; xk1 < merge_target_items; xk1++){
+ chromosome_event_t * pev = merge_target + xk1;
+ printf("OCT27-MERGERES: %u~%u, indel=%d, nsup=%d, TYPE=%d\n",pev->event_small_side, pev->event_large_side, pev->indel_length, pev->supporting_reads, pev->event_type);
+ }
}
+ free(records);
+
+ if(thread_contexts)
+ for(thn =0; thn < global_context->config.all_threads; thn++){
+ thread_context_t * thread_context = thread_contexts+thn;
+ indel_thread_context_t * indel_thread_context = (indel_thread_context_t*)thread_context -> module_thread_contexts[MODULE_INDEL_ID];
- free(indel_thread_context->dynamic_align_table);
- free(indel_thread_context->dynamic_align_table_mask);
+ destory_event_entry_table(indel_thread_context -> event_entry_table);
+ HashTableDestroy(indel_thread_context -> event_entry_table);
+
+ free(indel_thread_context -> event_space_dynamic);
+
+ for(xk1=0;xk1<MAX_READ_LENGTH; xk1++)
+ {
+ free(indel_thread_context -> dynamic_align_table[xk1]);
+ free(indel_thread_context -> dynamic_align_table_mask[xk1]);
+ }
+ free(indel_thread_context->dynamic_align_table);
+ free(indel_thread_context->dynamic_align_table_mask);
+ free(indel_thread_context);
+ }
+
+ if(indel_context -> event_space_dynamic ) free(indel_context -> event_space_dynamic);
+ indel_context -> event_space_dynamic = NULL;
+
+ indel_context -> event_space_dynamic = merge_target;
+ indel_context -> current_max_event_number = merge_target_size;
+ indel_context -> total_events = merge_target_items;
+ } else if(task == STEP_ITERATION_TWO) {
+ int xk1, thn;
+ if(thread_contexts)
+ for(thn =0; thn < global_context->config.all_threads; thn++){
+ thread_context_t * thread_context = thread_contexts+thn;
+ indel_thread_context_t * indel_thread_context = (indel_thread_context_t*)thread_context -> module_thread_contexts[MODULE_INDEL_ID];
+
+ for(xk1 = 0; xk1 < indel_thread_context -> total_events; xk1++)
+ {
+ indel_context -> event_space_dynamic [xk1] . final_counted_reads +=indel_thread_context -> final_counted_reads_array[xk1];
+ indel_context -> event_space_dynamic [xk1] . final_reads_mismatches +=indel_thread_context -> final_reads_mismatches_array[xk1];
+ }
+ free(indel_thread_context -> final_counted_reads_array);
+ free(indel_thread_context -> final_reads_mismatches_array);
+ free(thread_context -> output_buffer);
+ free(indel_thread_context);
+ }
}
- else if(task == STEP_ITERATION_TWO)
- {
- int xk1;
- for(xk1 = 0; xk1 < indel_thread_context -> total_events; xk1++)
- {
- indel_context -> event_space_dynamic [xk1] . final_counted_reads +=indel_thread_context -> final_counted_reads_array[xk1];
- indel_context -> event_space_dynamic [xk1] . final_reads_mismatches +=indel_thread_context -> final_reads_mismatches_array[xk1];
+ return 0;
+}
+
+
+void add_annotation_to_junctions(void * key, void * val, HashTable * tab){
+ global_context_t * global_context = tab->appendix1;
+ indel_context_t * indel_context = (indel_context_t *)global_context -> module_contexts[MODULE_INDEL_ID];
+ char * gene_chr = key;
+ int * pos_strands = val;
+ int current_lagre_end = -1;
+ char chroname[MAX_CHROMOSOME_NAME_LEN];
+
+ int x1, ww=0, wi=0;
+ for(x1 =0 ; ; x1++){
+ if(ww){
+ chroname[wi++]=gene_chr[x1];
+ chroname[wi]=0;
+ }
+ if(0==gene_chr[x1]) break;
+ if(gene_chr[x1]==':')
+ ww=1;
+ }
+
+ //#warning ">>>>>>> COMMENt NEXT <<<<<<<<<<<<<<"
+ //if(1) SUBREADprintf("SEE LOCATION: %s has %d\n", gene_chr , pos_strands[0]+1 );
+
+ for(x1 = 1; x1 < pos_strands[0]+1; x1+=3){
+ if(pos_strands[x1]==0)break;
+ //#warning ">>>>>>> COMMENt NEXT <<<<<<<<<<<<<<"
+ //SUBREADprintf("SEE LOCATION ___ : %d >0 && %d < %d %d %d\n", current_lagre_end , current_lagre_end, pos_strands[x1], pos_strands[x1+1], pos_strands[x1+2]);
+ if(current_lagre_end > 0 && current_lagre_end < pos_strands[x1]){
+ // add a junction between current_lagre_end and pos_strands[x1]
+ if(indel_context -> total_events >= indel_context -> current_max_event_number -1){
+ indel_context -> current_max_event_number*=1.2;
+ indel_context -> event_space_dynamic = realloc(indel_context -> event_space_dynamic , sizeof( chromosome_event_t )*indel_context -> current_max_event_number);
+ }
+ chromosome_event_t * newbody = indel_context -> event_space_dynamic + indel_context -> total_events ;
+ memset(newbody, 0, sizeof(chromosome_event_t));
+ newbody -> global_event_id = indel_context -> total_events++;
+ newbody -> event_type = CHRO_EVENT_TYPE_JUNCTION;
+
+ newbody -> event_small_side = linear_gene_position(&global_context->chromosome_table , chroname, current_lagre_end);
+ newbody -> event_large_side = linear_gene_position(&global_context->chromosome_table , chroname, pos_strands[x1]);
+
+ //#warning ">>>>>>> COMMENt NEXT <<<<<<<<<<<<<<"
+ if(0){
+ if ( 1 || newbody -> event_small_side == 625683723){
+ SUBREADprintf("Loaded event :");
+ debug_show_event(global_context, newbody);
+ }
+ }
+ newbody -> is_negative_strand = pos_strands[x1+2];
+ newbody -> is_donor_found_or_annotation = 127;
+ // printf("09NOV: add_annotation_to_junctions: %u -> %u '%s', %d,%d\n", newbody -> event_small_side , newbody -> event_large_side, chroname, current_lagre_end, pos_strands[x1]);
}
- free(indel_thread_context -> final_counted_reads_array);
- free(indel_thread_context -> final_reads_mismatches_array);
- free(thread_context -> output_buffer);
+
+ if(current_lagre_end < pos_strands[x1+1])
+ current_lagre_end = pos_strands[x1+1];
}
- free(indel_thread_context);
+}
+
+
+typedef struct {
+ global_context_t * global_context;
+ HashTable * feature_sorting_table;
+} do_load_juncs_context_t;
+
+int do_juncs_add_feature(char * gene_name, char * chro_name, unsigned int feature_start, unsigned int feature_end, int is_negative_strand, void * context){
+ //#warning ">>>>>>> COMMENt NEXT <<<<<<<<<<<<<<"
+ //SUBREADprintf("INJ LOCS: %s : %u, %u\n", chro_name, feature_start, feature_end);
+ do_load_juncs_context_t * do_load_juncs_context = context;
+ HashTable * feature_sorting_table = do_load_juncs_context -> feature_sorting_table;
+
+ global_context_t * global_context = do_load_juncs_context -> global_context;
+ char tmp_chro_name[MAX_CHROMOSOME_NAME_LEN];
+ if(global_context -> sam_chro_to_anno_chr_alias){
+ char * sam_chro = get_sam_chro_name_from_alias(global_context -> sam_chro_to_anno_chr_alias, chro_name);
+ if(sam_chro!=NULL) chro_name = sam_chro;
+ }
+ int access_n = HashTableGet( global_context -> chromosome_table.read_name_to_index, chro_name ) - NULL;
+ if(access_n < 1){
+ if(chro_name[0]=='c' && chro_name[1]=='h' && chro_name[2]=='r'){
+ chro_name += 3;
+ }else{
+ strcpy(tmp_chro_name, "chr");
+ strcat(tmp_chro_name, chro_name);
+ chro_name = tmp_chro_name;
+ }
+ }
+
+ char sort_key[ MAX_CHROMOSOME_NAME_LEN * 3 + 2 ];
+ sprintf(sort_key, "%s:%s", gene_name, chro_name);
+ int * old_features = HashTableGet(feature_sorting_table, sort_key), x1;
+ int written_space = -1, written_last_item = 0;
+ if(old_features){
+ int has_space = 0;
+ for(x1= 1; x1 < old_features[0] + 1 ; x1+=3){
+ if(old_features[x1]==0){
+ has_space = 1;
+ break;
+ }
+ }
+ if(!has_space){
+ int old_size = old_features[0];
+ old_features[0] *= 1.5;
+ old_features[0] -= old_features[0]%3;
+
+ // I used malloc but not realloc here, because the old memory block will be freed by HashTablePutReplace.
+ int * old_features2 = malloc( sizeof(int) * (old_features[0] +1));
+ memcpy(old_features2, old_features, (old_size + 1) * sizeof(int));
+ old_features2[old_size + 1] = 0;
+
+ HashTablePutReplace(feature_sorting_table, sort_key, old_features2, 0);
+ old_features = old_features2;
+ }
+ for(x1= 1; x1 < old_features[0] + 1 ; x1+=3){
+ if(old_features[x1]> feature_start || old_features[x1]==0){
+ written_space = x1;
+ if(old_features[x1]> feature_start){
+ int x2;
+ for(x2 = x1+3; x2 < old_features[0] + 1;x2+=3){
+ if(old_features[x2]==0)break;
+ }
+ x2 += 2;
+ assert(x2 < old_features[0] + 1);
+ if(x2 + 1 < old_features[0] + 1) old_features[x2+1]=0;
+
+ for(; x2 >= x1 + 3; x2 --)
+ old_features[x2] = old_features[x2-3];
+ } else written_last_item = 1;
+ break;
+ }
+ }
+ }else{
+ int init_space_sort = 15;
+ old_features = malloc((init_space_sort + 1) * sizeof(int));
+ old_features[0]=init_space_sort;
+ char * mm_sort_key = malloc(strlen(sort_key)+1);
+ strcpy(mm_sort_key, sort_key);
+ HashTablePut(feature_sorting_table, mm_sort_key, old_features);
+ written_space = 1;
+ written_last_item = 1;
+ }
+ assert(written_space >0);
+ old_features[written_space]=feature_start - 1;
+ old_features[written_space+1]=feature_end - 1;
+ old_features[written_space+2]=is_negative_strand;
+ if(written_last_item && written_space + 3 < old_features[0]+1) old_features[written_space+3]=0;
+
return 0;
}
+int load_known_junctions(global_context_t * global_context){
+
+ HashTable * feature_sorting_table = HashTableCreate(90239);
+ HashTableSetKeyComparisonFunction(feature_sorting_table, my_strcmp);
+ HashTableSetHashFunction(feature_sorting_table, HashTableStringHashFunction);
+ HashTableSetDeallocationFunctions(feature_sorting_table, free, free);
+
+ do_load_juncs_context_t do_load_juncs_context;
+ memset(&do_load_juncs_context, 0, sizeof(do_load_juncs_context));
+
+ do_load_juncs_context.global_context = global_context;
+ do_load_juncs_context.feature_sorting_table = feature_sorting_table;
+
+ int features = load_features_annotation(global_context->config.exon_annotation_file , global_context->config.exon_annotation_file_type, global_context->config.exon_annotation_gene_id_column, global_context->config.exon_annotation_feature_name_column,
+ &do_load_juncs_context, do_juncs_add_feature);
+
+ feature_sorting_table -> appendix1 = global_context;
+ HashTableIteration(feature_sorting_table, add_annotation_to_junctions);
+ HashTableDestroy(feature_sorting_table);
+
+ if(features <0)return -1;
+ else{
+ return 0;
+ }
+}
+
int there_are_events_in_range(char * bitmap, unsigned int pos, int sec_len)
{
@@ -944,35 +1413,18 @@ void put_new_event(HashTable * event_table, chromosome_event_t * new_event , int
{
//#warning "====== DO NOT NEED TO CLEAR THE MEMORY BUFFER! MALLOC IS GOOD ======"
//id_list = calloc(sizeof(unsigned int),EVENT_ENTRIES_INIT_SIZE);
- id_list = malloc(sizeof(unsigned int)*EVENT_ENTRIES_INIT_SIZE);
+ id_list = malloc(sizeof(unsigned int)*(1+EVENT_ENTRIES_INIT_SIZE));
id_list[0]=EVENT_ENTRIES_INIT_SIZE;
id_list[1]=0;
HashTablePut(event_table , NULL+sides[xk1] , id_list);
}
- unsigned int current_capacity = id_list[0] & 0x0fffffff;
- assert(current_capacity >= EVENT_ENTRIES_INIT_SIZE && current_capacity <= MAX_EVENT_ENTRIES_PER_SITE);
-
- for(xk2=1;xk2<MAX_EVENT_ENTRIES_PER_SITE; xk2++)
+ for(xk2=1;xk2< id_list[0] ; xk2++)
if(id_list[xk2]==0) break;
- if(xk2 >= current_capacity - 1 && current_capacity < MAX_EVENT_ENTRIES_PER_SITE)
- {
- while(1){
- id_list = HashTableGet(event_table, NULL+sides[xk1]);
- if((id_list[0] & 0xf0000000)==0)break;
- }
-
- id_list[0] |= 0x80000000;
- current_capacity = min(MAX_EVENT_ENTRIES_PER_SITE, current_capacity*2);
- id_list = realloc(id_list, sizeof(unsigned int) * current_capacity);
- id_list[0] = current_capacity;
- HashTablePut(event_table, NULL+sides[xk1] , id_list);
- }
-
- if(xk2 < MAX_EVENT_ENTRIES_PER_SITE)
+ if(xk2 < id_list[0])
id_list[xk2] = event_no+1;
- if(xk2 < MAX_EVENT_ENTRIES_PER_SITE -1) id_list[xk2+1] = 0;
+ if(xk2 < id_list[0]) id_list[xk2+1] = 0;
}
if(event_table->appendix1)
@@ -1001,14 +1453,11 @@ int search_event(global_context_t * global_context, HashTable * event_table, chr
}
unsigned int * res = HashTableGet(event_table, NULL+pos);
- if(0 && 42399326 == pos){
- SUBREADprintf("EVENT_HIT=%p for %u\n", res, pos);
- }
if(res)
{
int xk2;
int current_size = res[0]&0x0fffffff;
- for(xk2=1; xk2< current_size ; xk2++)
+ for(xk2=1; xk2< current_size+1 ; xk2++)
{
if(0 && res[xk2] > 520000){
SUBREADprintf("TOO LARGE EVENT : %u ; POS=%d/%u\n", res[xk2] , xk2, res[0]);
@@ -1025,6 +1474,24 @@ int search_event(global_context_t * global_context, HashTable * event_table, chr
return_buffer[ret++] = event_body;
}
+
+ //#warning ">>>>>>>>>>>>>> COMMENT THIS <<<<<<<<<<<<<<<<<<<<<<<<<<<<"
+ if(0){
+ indel_context_t * indel_context = (indel_context_t *) global_context -> module_contexts[MODULE_INDEL_ID];
+ chromosome_event_t * est = indel_context -> event_space_dynamic;
+ if(est == event_space)
+ printf("OCT27-STEPRS-EVENT_HIT= %u ; HIT=%d\n", pos, xk2);
+
+ }
+ }else{
+ //#warning ">>>>>>>>>>>>>> COMMENT THIS <<<<<<<<<<<<<<<<<<<<<<<<<<<<"
+ if(0){
+ indel_context_t * indel_context = (indel_context_t *) global_context -> module_contexts[MODULE_INDEL_ID];
+ chromosome_event_t * est = indel_context -> event_space_dynamic;
+ if(est == event_space)
+ printf("OCT27-STEPRS-EVENT_HIT= %u ; HIT=0000\n", pos);
+
+ }
}
return ret;
@@ -1051,6 +1518,7 @@ void set_insertion_sequence(global_context_t * global_context, thread_context_t
int xk1;
(*binary_bases) = malloc((1+insertions)/4+2);
+ //SUBREADprintf("ALLOC PTR=%p\n", (*binary_bases) );
assert(insertions <= MAX_INSERTION_LENGTH);
memset((*binary_bases),0, (1+insertions)/4+2);
@@ -1064,7 +1532,7 @@ void set_insertion_sequence(global_context_t * global_context, thread_context_t
}
}
-chromosome_event_t * local_add_indel_event(global_context_t * global_context, thread_context_t * thread_context, HashTable * event_table, char * read_text, unsigned int left_edge, int indels, int score_supporting_read_added, int is_ambiguous, int mismatched_bases)
+chromosome_event_t * local_add_indel_event(global_context_t * global_context, thread_context_t * thread_context, HashTable * event_table, char * read_text, unsigned int left_edge, int indels, int score_supporting_read_added, int is_ambiguous, int mismatched_bases,int * old_event_id)
{
chromosome_event_t * found = NULL;
chromosome_event_t * search_return [MAX_EVENT_ENTRIES_PER_SITE];
@@ -1076,9 +1544,12 @@ chromosome_event_t * local_add_indel_event(global_context_t * global_context, th
event_space = ((indel_context_t *)global_context -> module_contexts[MODULE_INDEL_ID]) -> event_space_dynamic;
-
int found_events = search_event(global_context, event_table, event_space, left_edge, EVENT_SEARCH_BY_SMALL_SIDE, CHRO_EVENT_TYPE_INDEL|CHRO_EVENT_TYPE_LONG_INDEL, search_return);
+ //#warning ">>>>>>>>>>>>>>>>>> COMMENt THIS <<<<<<<<<<<<<<<<<<<<<<<"
+ //printf("OCT27-STEPRR-TST LEFT=%u, INDEL=%d; FOUND=%d\n", left_edge , indels , found_events);
+
+
if(found_events)
{
int kx1;
@@ -1093,6 +1564,7 @@ chromosome_event_t * local_add_indel_event(global_context_t * global_context, th
}
if(found){
+ if(old_event_id) (*old_event_id)=found->global_event_id;
found -> supporting_reads += score_supporting_read_added;
//found -> is_ambiguous = max(is_ambiguous , found -> is_ambiguous );
return NULL;
@@ -1124,7 +1596,8 @@ chromosome_event_t * local_add_indel_event(global_context_t * global_context, th
new_event -> event_quality = 1;//pow(0.5 , 3*mismatched_bases);
//new_event -> is_ambiguous = is_ambiguous;
- //SUBREADprintf("NEW INDEL:%d LEFT=%u RIGHT=%u\n", new_event -> indel_length, new_event -> event_small_side , new_event -> event_large_side);
+ //#warning ">>>>>>>>>>>>>>>>>> COMMENt THIS <<<<<<<<<<<<<<<<<<<<<<<"
+ //printf("OCT27-STEPRR-NEW LEFT=%u RIGHT=%u, INDEL=%d\n", new_event -> event_small_side , new_event -> event_large_side, new_event -> indel_length );
put_new_event(event_table, new_event , event_no);
return new_event;
@@ -1448,10 +1921,7 @@ int find_new_indels(global_context_t * global_context, thread_context_t * thread
if(first_correct_base < last_correct_base || first_correct_base > last_correct_base + 3000)
SUBREADprintf("WRONG ORDER: F=%u, L=%d\n", first_correct_base , last_correct_base);
- //#warning ">>>>>>> COMMENT NEXT BLOCK IN RELEASE <<<<<<<<"
- if(0 && FIXLENstrcmp("D00491:277:C89FUANXX:7:1110:20418:31541", read_name) == 0)
- //if(current_result->selected_position > 433897 - 100 && current_result->selected_position < 433897)
- SUBREADprintf("INDEL_P03: I=%d; INDELS=%d; POS=%u; COVER=%d -- %d (vote_no : %d - %d)\n", i, indels, current_result->selected_position, last_correct_base, first_correct_base, last_correct_subread, next_correct_subread);
+ //printf("OCT27-STEPRR %s POS=%u, I=%d; INDELS=%d; COVER=%d -- %d (vote_no : %d - %d)\n", read_name, current_result->selected_position, i, indels, last_correct_base, first_correct_base, last_correct_subread, next_correct_subread);
if(global_context -> config.use_dynamic_programming_indel || read_len > EXON_LONG_READ_LENGTH)
{
@@ -1469,9 +1939,9 @@ int find_new_indels(global_context_t * global_context, thread_context_t * thread
//#warning ">>>>>>> COMMENT NEXT BLOCK IN RELEASE <<<<<<<<"
- if(0 && FIXLENstrcmp("D00491:277:C89FUANXX:7:1110:20418:31541", read_name) == 0)
- {
- SUBREADprintf("IR= %d %d~%d\n", dyna_steps, last_correct_base, first_correct_base);
+ if(0) {
+ char outstr[1000];
+ sprintf(outstr, "OCT27-STEPDD-IR %s %d %d~%d ", read_name, dyna_steps, last_correct_base, first_correct_base);
for(x1=0; x1<dyna_steps;x1++)
{
@@ -1480,9 +1950,9 @@ int find_new_indels(global_context_t * global_context, thread_context_t * thread
else if(mv==1)mc='D';
else if(mv==2)mc='I';
else mc='X';
- SUBREADprintf("%c",mc);
+ sprintf(outstr+strlen(outstr),"%c",mc);
}
- SUBREADputs("");
+ puts(outstr);
}
unsigned int cursor_on_chromosome = voting_position + last_correct_base + last_indel, cursor_on_read = last_correct_base;
int last_mv = 0;
@@ -1542,26 +2012,39 @@ int find_new_indels(global_context_t * global_context, thread_context_t * thread
if(abs(current_indel_len)<=global_context -> config.max_indel_length)
{
- chromosome_event_t * new_event = local_add_indel_event(global_context, thread_context, event_table, read_text + cursor_on_read + min(0,current_indel_len), indel_left_boundary - 1, current_indel_len, 1, ambiguous_count, 0);
+ int old_event_id = -1;
+
+ chromosome_event_t * new_event = local_add_indel_event(global_context, thread_context, event_table, read_text + cursor_on_read + min(0,current_indel_len), indel_left_boundary - 1, current_indel_len, 1, ambiguous_count, 0, &old_event_id);
mark_gapped_read(current_result);
- if(last_event_id >=0 && new_event){
- // the event space can be changed when the new event is added. the location is updated everytime.
- chromosome_event_t * event_space = NULL;
- if(thread_context)
- event_space = ((indel_thread_context_t *)thread_context -> module_thread_contexts[MODULE_INDEL_ID]) -> event_space_dynamic;
- else
- event_space = ((indel_context_t *)global_context -> module_contexts[MODULE_INDEL_ID]) -> event_space_dynamic;
+ chromosome_event_t * event_space = NULL;
+ if(thread_context)
+ event_space = ((indel_thread_context_t *)thread_context -> module_thread_contexts[MODULE_INDEL_ID]) -> event_space_dynamic;
+ else
+ event_space = ((indel_context_t *)global_context -> module_contexts[MODULE_INDEL_ID]) -> event_space_dynamic;
+
+ if(last_event_id>=0){
chromosome_event_t * last_event = event_space + last_event_id;
+ int dist = indel_left_boundary - last_event -> event_large_side ;
+
+ if(last_event -> connected_next_event_distance<1)
+ last_event -> connected_next_event_distance = dist;
+ else last_event -> connected_next_event_distance = min(dist, last_event -> connected_next_event_distance);
- int dist = new_event -> event_small_side - last_event -> event_large_side +1;
- new_event -> connected_previous_event_distance = dist;
- last_event -> connected_next_event_distance = dist;
+ chromosome_event_t * old_new_event = new_event;
+ if(old_event_id>=0){
+ assert(!new_event);
+ old_new_event = event_space + old_event_id;
+ }
+
+ if(old_new_event -> connected_previous_event_distance < 1)
+ old_new_event -> connected_previous_event_distance = dist;
+ else old_new_event -> connected_previous_event_distance = min(dist, old_new_event -> connected_previous_event_distance);
}
if (new_event)
last_event_id = new_event -> global_event_id;
- else last_event_id = -1;
+ else last_event_id = old_event_id;
}
}
@@ -1639,9 +2122,10 @@ int find_new_indels(global_context_t * global_context, thread_context_t * thread
*/
- if(best_score >0)
- {
- local_add_indel_event(global_context, thread_context, event_table, read_text + last_correct_base + last_indel, best_pos - 1, indels, 1, is_ambiguous_indel, mismatched_bases);
+ if(best_score >0) {
+ //#warning ">>>>>>>>>>>>>>>>>> COMMEN THIS <<<<<<<<<<<<<<<<<<<<<<"
+ //printf("OCT27-STEPAD-%s POS=%u INDEL=%d\n", read_name, best_pos , indels);
+ local_add_indel_event(global_context, thread_context, event_table, read_text + last_correct_base + last_indel, best_pos - 1, indels, 1, is_ambiguous_indel, mismatched_bases, NULL);
mark_gapped_read(current_result);
}
}
@@ -1762,7 +2246,7 @@ int find_new_indels(global_context_t * global_context, thread_context_t * thread
//head_indel_left_edge -= max(0, head_indel_movement);
if(abs(head_indel_movement)<=global_context -> config.max_indel_length)
{
- local_add_indel_event(global_context, thread_context, event_table, read_text + head_indel_pos, head_indel_left_edge, head_indel_movement, 1, 1, 0);
+ local_add_indel_event(global_context, thread_context, event_table, read_text + head_indel_pos, head_indel_left_edge, head_indel_movement, 1, 1, 0,NULL);
mark_gapped_read(current_result);
}
}
@@ -1772,7 +2256,7 @@ int find_new_indels(global_context_t * global_context, thread_context_t * thread
if(abs(tail_indel_movement)<=global_context -> config.max_indel_length)
{
- local_add_indel_event(global_context, thread_context, event_table, read_text + tail_indel_pos, tail_indel_left_edge, tail_indel_movement, 1, 1, 0);
+ local_add_indel_event(global_context, thread_context, event_table, read_text + tail_indel_pos, tail_indel_left_edge, tail_indel_movement, 1, 1, 0,NULL);
mark_gapped_read(current_result);
}
@@ -1783,10 +2267,39 @@ int find_new_indels(global_context_t * global_context, thread_context_t * thread
return 0;
}
+void print_indel_table(global_context_t * global_context){
+
+ int xk1;
+ indel_context_t * indel_context = (indel_context_t *)global_context -> module_contexts[MODULE_INDEL_ID];
+ HashTable * entry_table = indel_context -> event_entry_table;
+ for(xk1 = 0; xk1 < indel_context -> total_events ; xk1++){
+ chromosome_event_t * event_body = indel_context -> event_space_dynamic +xk1;
+
+ printf("OCT27-STEP-INTAB-TYPE-%d POS %u~%u GID=%u PV %d %d SUP %d / %d\n", event_body -> event_type, event_body -> event_small_side, event_body -> event_large_side, event_body -> global_event_id, event_body -> connected_next_event_distance, event_body -> connected_previous_event_distance , event_body -> supporting_reads , event_body -> anti_supporting_reads);
+ }
+
+ int bucket;
+ KeyValuePair * cursor;
+ for(bucket=0; bucket< entry_table -> numOfBuckets; bucket++){
+ cursor = entry_table -> bucketArray[bucket];
+ while(cursor){
+ unsigned int entry_pos = cursor->key - NULL;
+ int * env_array = cursor->value;
+ int env_i;
+ for(env_i = 1; env_array[env_i]; env_i++){
+ chromosome_event_t * event_body = indel_context -> event_space_dynamic + (env_array[env_i]-1);
+ printf("OCT27-STEPQ-ENTAB-%u [%d] to %u ~ %u len=%d VAL=%d PTR=%p\n",entry_pos, env_i, event_body -> event_small_side, event_body -> event_large_side, event_body -> indel_length, env_array[env_i], env_array);
+ }
+
+ cursor = cursor->next;
+ }
+ }
+}
+
int write_indel_final_results(global_context_t * global_context)
{
- int xk1;
+ int xk1, disk_is_full = 0;
char * inserted_bases = NULL;
indel_context_t * indel_context = (indel_context_t *)global_context -> module_contexts[MODULE_INDEL_ID];
char * fn2, *ref_bases, *alt_bases;
@@ -1799,7 +2312,6 @@ int write_indel_final_results(global_context_t * global_context)
//if(!ofp)
// printf("HOW??? %s\n", fn2);
- free(fn2);
inserted_bases = malloc(MAX_INSERTION_LENGTH + 2);
ref_bases = malloc(1000);
alt_bases = malloc(1000);
@@ -1865,7 +2377,8 @@ int write_indel_final_results(global_context_t * global_context)
else event_body -> event_quality = 1;
}
- fprintf(ofp, "%s\t%u\t.\t%s\t%s\t%d\t.\tINDEL;DP=%d;SR=%d\n", chro_name, chro_pos, ref_bases, alt_bases, (int)(max(1, 250 + 10*log(event_body -> event_quality)/log(10))), event_body -> final_counted_reads + event_body -> anti_supporting_reads, event_body -> final_counted_reads);
+ int write_len = fprintf(ofp, "%s\t%u\t.\t%s\t%s\t%d\t.\tINDEL;DP=%d;SR=%d\n", chro_name, chro_pos, ref_bases, alt_bases, (int)(max(1, 250 + 10*log(event_body -> event_quality)/log(10))), event_body -> final_counted_reads + event_body -> anti_supporting_reads, event_body -> final_counted_reads);
+ if(write_len < 10) disk_is_full = 1;
}
global_context->all_indels++;
@@ -1875,6 +2388,11 @@ int write_indel_final_results(global_context_t * global_context)
free(ref_bases);
free(alt_bases);
free(inserted_bases);
+ if(disk_is_full){
+ unlink(fn2);
+ SUBREADprintf("ERROR: disk is full. Unable to write into the indel list.\n");
+ }
+ free(fn2);
return 0;
}
@@ -1901,6 +2419,9 @@ int destroy_indel_module(global_context_t * global_context)
free(indel_context -> dynamic_align_table_mask[xk1]);
}
+ for(xk1=0; xk1<EVENT_BODY_LOCK_BUCKETS; xk1++)
+ subread_destroy_lock(indel_context -> event_body_locks+xk1);
+
free(indel_context -> dynamic_align_table);
free(indel_context -> dynamic_align_table_mask);
return 0;
@@ -4000,6 +4521,7 @@ void init_global_context(global_context_t * context)
context->config.is_first_iteration_running = 1;
context->config.is_second_iteration_running = 1;
context->config.reads_per_chunk = 1024*1024*1024;
+
context->config.reads_per_chunk = 20*1024*1024;
context->config.use_memory_buffer = 1;
context->config.is_methylation_reads = 0;
@@ -4019,8 +4541,11 @@ void init_global_context(global_context_t * context)
context->config.second_read_file[0] = 0;
context->config.index_prefix[0] = 0;
context->config.output_prefix[0] = 0;
- context->config.medium_result_prefix[0] = 0;
+ context->config.exon_annotation_file[0] = 0;
context->config.SAM_extra_columns = 0;
+ context->config.exon_annotation_file_type = FILE_TYPE_GTF;
+ strcpy(context->config.exon_annotation_gene_id_column, "gene_id");
+ strcpy(context->config.exon_annotation_feature_name_column, "exon");
context->config.DP_penalty_create_gap = -1;
context->config.DP_penalty_extend_gap = 0;
@@ -4040,13 +4565,6 @@ void init_global_context(global_context_t * context)
memcpy(seed_rand, &double_time, 2*sizeof(int));
srand(seed_rand[0]^seed_rand[1]);
- char mac_rand[13];
- mac_or_rand_str(mac_rand);
-
- sprintf(context->config.temp_file_prefix, "./core-temp-sum-%06u-%s", getpid(), mac_rand );
- _COREMAIN_delete_temp_prefix = context->config.temp_file_prefix;
-
-
context->config.max_indel_length = 5;
context->config.phred_score_format = FASTQ_PHRED33;
context->start_time = miltime();
@@ -4057,7 +4575,26 @@ void init_global_context(global_context_t * context)
context->timecost_for_realign = 0;
}
+void init_core_temp_path(global_context_t * context){
+ int x1;
+ char mac_rand[13];
+ mac_or_rand_str(mac_rand);
+
+ context->config.temp_file_prefix[0] = 0;
+ if(context->config.output_prefix[0]){
+ for(x1 = strlen(context->config.output_prefix); x1 >=0; x1--){
+ if(context->config.output_prefix[x1]=='/'){
+ memcpy(context->config.temp_file_prefix, context->config.output_prefix, x1);
+ context->config.temp_file_prefix[x1]=0;
+ break;
+ }
+ }
+ }
+ if(context->config.temp_file_prefix[0] == 0)strcpy(context->config.temp_file_prefix, "./");
+ sprintf(context->config.temp_file_prefix+strlen(context->config.temp_file_prefix), "/core-temp-sum-%06u-%s", getpid(), mac_rand );
+ _COREMAIN_delete_temp_prefix = context->config.temp_file_prefix;
+}
#define INDEL_MASK_BY_INSERTION 1
#define INDEL_MASK_BY_DELETION 2
diff --git a/src/core-indel.h b/src/core-indel.h
index 6eea940..75d210a 100644
--- a/src/core-indel.h
+++ b/src/core-indel.h
@@ -30,8 +30,8 @@
//#define MAX_EVENT_ENTRIES_PER_SITE 5
//#define MAX_EVENT_ENTRIES_PER_SITE 12
//
-#define EVENT_ENTRIES_INIT_SIZE 9
-#define MAX_EVENT_ENTRIES_PER_SITE 9
+#define EVENT_ENTRIES_INIT_SIZE (9)
+#define MAX_EVENT_ENTRIES_PER_SITE (9)
#define CHRO_EVENT_TYPE_REMOVED 0
#define CHRO_EVENT_TYPE_INDEL 8
#define CHRO_EVENT_TYPE_LONG_INDEL 16
@@ -132,6 +132,8 @@ typedef struct{
unsigned int block_start_linear_pos;
} reassembly_block_context_t;
+#define EVENT_BODY_LOCK_BUCKETS 14929
+
typedef struct{
HashTable * event_entry_table;
@@ -139,6 +141,7 @@ typedef struct{
unsigned int current_max_event_number;
chromosome_event_t * event_space_dynamic;
HashTable * local_reassembly_pileup_files;
+ subread_lock_t event_body_locks[EVENT_BODY_LOCK_BUCKETS];
short ** dynamic_align_table;
char ** dynamic_align_table_mask;
@@ -159,7 +162,9 @@ typedef struct{
int init_indel_tables(global_context_t * context);
int destroy_indel_module(global_context_t * context);
int init_indel_thread_contexts(global_context_t * global_context, thread_context_t * thread_context, int task);
-int finalise_indel_thread(global_context_t * global_context, thread_context_t * thread_context, int task);
+int sort_global_event_table(global_context_t * global_context);
+int load_known_junctions(global_context_t * global_context);
+int finalise_indel_and_junction_thread(global_context_t * global_context, thread_context_t * thread_contexts, int task);
int find_new_indels(global_context_t * global_context, thread_context_t * thread_context, int pair_number, char * read_name, char * read_text, char * qual_text, int read_len, int is_second_read, int best_read_id);
int write_indel_final_results(global_context_t * context);
int search_event(global_context_t * global_context,HashTable * event_table, chromosome_event_t * event_space, unsigned int pos, int search_type, char event_type, chromosome_event_t ** return_buffer);
@@ -188,5 +193,11 @@ int anti_supporting_read_scan(global_context_t * global_context);
int core_dynamic_align(global_context_t * global_context, thread_context_t * thread_context, char * read, int read_len, unsigned int begin_position, char * movement_buffer, int expected_offset, char * read_name);
-chromosome_event_t * local_add_indel_event(global_context_t * global_context, thread_context_t * thread_context, HashTable * event_table, char * read_text, unsigned int left_edge, int indels, int score_supporting_read_added, int is_ambiguous, int mismatched_bases);
+void init_core_temp_path(global_context_t * context);
+
+chromosome_event_t * local_add_indel_event(global_context_t * global_context, thread_context_t * thread_context, HashTable * event_table, char * read_text, unsigned int left_edge, int indels, int score_supporting_read_added, int is_ambiguous, int mismatched_bases,int * old_event_id);
+
+void print_indel_table(global_context_t * global_context);
+int sort_junction_entry_table(global_context_t * global_context);
+void mark_event_bitmap(unsigned char * bitmap, unsigned int pos);
#endif
diff --git a/src/core-interface-aligner.c b/src/core-interface-aligner.c
index 5145414..43dd5aa 100644
--- a/src/core-interface-aligner.c
+++ b/src/core-interface-aligner.c
@@ -4,7 +4,6 @@
#include <getopt.h>
#include <unistd.h>
-
#include "subread.h"
#include "input-files.h"
#include "core.h"
@@ -36,6 +35,9 @@ static struct option long_options[] =
{"quality", no_argument, 0, 'Q'},
{"trim5", required_argument, 0, '5'},
{"trim3", required_argument, 0, '3'},
+ {"exonAnnotation", required_argument, 0, 'a'},
+ {"exonAlias", required_argument, 0, 'A'},
+ {"exonFormat", required_argument, 0, 'F'},
{"memoryMultiplex", required_argument, 0, 0},
{"rg", required_argument, 0, 0},
{"gzFASTQinput", no_argument, 0, 0},
@@ -66,65 +68,80 @@ void print_usage_core_aligner()
SUBREADprintf("\nVersion %s\n\n", SUBREAD_VERSION);
SUBREADputs("Usage:");
SUBREADputs("");
- SUBREADputs(" ./subread-align [options] -i <index_name> -r <input> -o <output> -t <type>");
+ SUBREADputs("./subread-align [options] -i <index_name> -r <input> -t <type> -o <output>");
SUBREADputs("");
- SUBREADputs("Required arguments:");
+ SUBREADputs("## Mandatory arguments:");
SUBREADputs(" ");
SUBREADputs(" -i <string> Base name of the index.");
SUBREADputs("");
- SUBREADputs(" -r <string> Name of the input file. Input formats including gzipped");
- SUBREADputs(" fastq, fastq, and fasta can be automatically detected. If");
- SUBREADputs(" paired-end, this should give the name of file including");
- SUBREADputs(" first reads.");
+ SUBREADputs(" -r <string> Name of an input read file. If paired-end, this should be");
+ SUBREADputs(" the first read file (typically containing ‘R1’ in the file");
+ SUBREADputs(" name) and the second should be provided via ‘-R’.");
+ SUBREADputs(" Acceptable formats include gzipped FASTQ, FASTQ and FASTA.");
+ SUBREADputs(" These formats are identified automatically.");
SUBREADputs(" ");
SUBREADputs(" -t <int> Type of input sequencing data. Its values include");
SUBREADputs(" 0: RNA-seq data");
SUBREADputs(" 1: genomic DNA-seq data.");
SUBREADputs(" ");
- SUBREADputs("Optional arguments:");
+ SUBREADputs("## Optional arguments:");
+ SUBREADputs("# input reads and output");
SUBREADputs(" ");
- SUBREADputs(" -o <string> Name of the output file. By default, the output is in BAM");
- SUBREADputs(" format.");
+ SUBREADputs(" -o <string> Name of an output file. By default, the output is in BAM");
+ SUBREADputs(" format. Omitting this option makes the output be written to");
+ SUBREADputs(" STDOUT.");
+ SUBREADputs("");
+ SUBREADputs(" -R <string> Name of the second read file in paired-end data (typically");
+ SUBREADputs(" containing ‘R2’ in the file name).");
+ SUBREADputs("");
+ SUBREADputs(" --SAMinput Input reads are in SAM format.");
+ SUBREADputs("");
+ SUBREADputs(" --BAMinput Input reads are in BAM format.");
+ SUBREADputs("");
+ SUBREADputs(" --SAMoutput Save mapping results in SAM format.");
+ SUBREADputs("");
+ SUBREADputs("# offset value added to Phred quality scores of read bases");
+ SUBREADputs("");
+ SUBREADputs(" -P <3:6> Offset value added to the Phred quality score of each read");
+ SUBREADputs(" base. '3' for phred+33 and '6' for phred+64. '3' by default.");
+ SUBREADputs("");
+ SUBREADputs("# thresholds for mapping");
SUBREADputs("");
SUBREADputs(" -n <int> Number of selected subreads, 10 by default.");
SUBREADputs("");
SUBREADputs(" -m <int> Consensus threshold for reporting a hit (minimal number of");
SUBREADputs(" subreads that map in consensus) . If paired-end, this gives");
- SUBREADputs(" the consensus threshold for the anchor read. 3 by default");
+ SUBREADputs(" the consensus threshold for the anchor read (anchor read");
+ SUBREADputs(" receives more votes than the other read in the same pair).");
+ SUBREADputs(" 3 by default");
+ SUBREADputs("");
+ SUBREADputs(" -p <int> Consensus threshold for the non- anchor read in a pair. 1 by");
+ SUBREADputs(" default.");
SUBREADputs("");
- SUBREADputs(" -M <int> Specify the maximum number of mis-matched bases allowed in");
- SUBREADputs(" the alignment. 3 by default. Mis-matches found in soft-");
+ SUBREADputs(" -M <int> Maximum number of mis-matched bases allowed in each reported");
+ SUBREADputs(" alignment. 3 by default. Mis-matched bases found in soft-");
SUBREADputs(" clipped bases are not counted.");
SUBREADputs("");
- SUBREADputs(" -T <int> Number of CPU threads used, 1 by default.");
+ SUBREADputs("# unique mapping and multi-mapping");
SUBREADputs("");
- SUBREADputs(" -I <int> Maximum length (in bp) of indels that can be detected. 5 by");
- SUBREADputs(" default. The program can detect indels of up to 200bp long.");
+ SUBREADputs(" -u Report uniquely mapped reads only. Number of matched bases (");
+ SUBREADputs(" for RNA-seq) or mis-matched bases(for genomic DNA-seq) is");
+ SUBREADputs(" used to break the tie when multiple mapping locations are");
+ SUBREADputs(" found.");
SUBREADputs("");
SUBREADputs(" -B <int> Maximal number of equally-best mapping locations to be");
SUBREADputs(" reported. 1 by default. Note that -u option takes precedence");
SUBREADputs(" over -B.");
SUBREADputs("");
- SUBREADputs(" -P <3:6> Format of Phred scores in input files, '3' for phred+33 and");
- SUBREADputs(" '6' for phred+64. '3' by default.");
- SUBREADputs("");
- SUBREADputs(" -u Report uniquely mapped reads only. Number of matched bases (");
- SUBREADputs(" for RNA-seq) or mis-matched bases(for genomic DNA-seq) is");
- SUBREADputs(" used to break the tie.");
- SUBREADputs("");
- SUBREADputs(" -b Convert color-space read bases to base-space read bases in");
- SUBREADputs(" the mapping output. Note that read mapping is performed at");
- SUBREADputs(" color-space.");
- SUBREADputs("");
- SUBREADputs(" --sv Detect structural variants (eg. long indel, inversion,");
- SUBREADputs(" duplication and translocation) and report breakpoints. Refer");
- SUBREADputs(" to Users Guide for breakpoint reporting.");
+ SUBREADputs("# indel detection");
SUBREADputs("");
- SUBREADputs(" --SAMinput Input reads are in SAM format.");
+ SUBREADputs(" -I <int> Maximum length (in bp) of indels that can be detected. 5 by");
+ SUBREADputs(" default. Indels of up to 200bp long can be detected.");
SUBREADputs("");
- SUBREADputs(" --BAMinput Input reads are in BAM format.");
+ SUBREADputs(" --complexIndels Detect multiple short indels that are in close proximity");
+ SUBREADputs(" (they can be as close as 1bp apart from each other).");
SUBREADputs("");
- SUBREADputs(" --SAMoutput Save mapping result in SAM format.");
+ SUBREADputs("# read trimming");
SUBREADputs("");
SUBREADputs(" --trim5 <int> Trim off <int> number of bases from 5' end of each read. 0");
SUBREADputs(" by default.");
@@ -132,10 +149,33 @@ void print_usage_core_aligner()
SUBREADputs(" --trim3 <int> Trim off <int> number of bases from 3' end of each read. 0");
SUBREADputs(" by default.");
SUBREADputs("");
+ SUBREADputs("# distance and orientation of paired end reads");
+ SUBREADputs("");
+ SUBREADputs(" -d <int> Minimum fragment/insert length, 50bp by default.");
+ SUBREADputs("");
+ SUBREADputs(" -D <int> Maximum fragment/insert length, 600bp by default.");
+ SUBREADputs("");
+ SUBREADputs(" -S <ff:fr:rf> Orientation of first and second reads, 'fr' by default (");
+ SUBREADputs(" forward/reverse).");
+ SUBREADputs("");
+ SUBREADputs("# number of CPU threads");
+ SUBREADputs("");
+ SUBREADputs(" -T <int> Number of CPU threads used, 1 by default.");
+ SUBREADputs("");
+ SUBREADputs("# read group");
+ SUBREADputs("");
SUBREADputs(" --rg-id <string> Add read group ID to the output.");
SUBREADputs("");
SUBREADputs(" --rg <string> Add <tag:value> to the read group (RG) header in the output.");
SUBREADputs("");
+ SUBREADputs("# color space reads");
+ SUBREADputs("");
+ SUBREADputs(" -b Convert color-space read bases to base-space read bases in");
+ SUBREADputs(" the mapping output. Note that read mapping is performed at");
+ SUBREADputs(" color-space.");
+ SUBREADputs("");
+ SUBREADputs("# dynamic programming");
+ SUBREADputs("");
SUBREADputs(" --DPGapOpen <int> Penalty for gap opening in short indel detection. -1 by");
SUBREADputs(" default.");
SUBREADputs("");
@@ -148,30 +188,44 @@ void print_usage_core_aligner()
SUBREADputs(" --DPMatch <int> Score for matched bases in short indel detection. 2 by");
SUBREADputs(" default.");
SUBREADputs("");
- SUBREADputs(" --complexIndels Detect multiple short indels that occur concurrently in a");
- SUBREADputs(" small genomic region (these indels could be as close as 1bp");
- SUBREADputs(" apart).");
+ SUBREADputs("# detect structural variants");
SUBREADputs("");
- SUBREADputs(" -v Output version of the program.");
+ SUBREADputs(" --sv Detect structural variants (eg. long indel, inversion,");
+ SUBREADputs(" duplication and translocation) and report breakpoints. Refer");
+ SUBREADputs(" to Users Guide for breakpoint reporting.");
SUBREADputs("");
- SUBREADputs("Optional arguments for paired-end reads:");
+ SUBREADputs("# gene annotation");
SUBREADputs("");
- SUBREADputs(" -R <string> Name of the file including second reads.");
+ SUBREADputs(" -a Name of an annotation file. GTF/GFF format by default. See");
+ SUBREADputs(" -F option for more format information.");
SUBREADputs("");
- SUBREADputs(" -p <int> Consensus threshold for the non-anchor read (receiving less");
- SUBREADputs(" votes than the anchor read from the same pair). 1 by");
- SUBREADputs(" default.");
+ SUBREADputs(" -F Specify format of the provided annotation file. Acceptable");
+ SUBREADputs(" formats include 'GTF' (or compatible GFF format) and");
+ SUBREADputs(" 'SAF'. 'GTF' by default. For SAF format, please refer to");
+ SUBREADputs(" Users Guide.");
SUBREADputs("");
- SUBREADputs(" -d <int> Minimum fragment/insert length, 50bp by default.");
+ SUBREADputs(" -A Provide a chromosome name alias file to match chr names in");
+ SUBREADputs(" annotation with those in the reads. This should be a two-");
+ SUBREADputs(" column comma-delimited text file. Its first column should");
+ SUBREADputs(" include chr names in the annotation and its second column");
+ SUBREADputs(" should include chr names in the index. Chr names are case");
+ SUBREADputs(" sensitive. No column header should be included in the");
+ SUBREADputs(" file.");
SUBREADputs("");
- SUBREADputs(" -D <int> Maximum fragment/insert length, 600bp by default.");
+ SUBREADputs(" --gtfFeature <string> Specify feature type in GTF annotation. 'exon'");
+ SUBREADputs(" by default. Features used for read counting will be ");
+ SUBREADputs(" extracted from annotation using the provided value.");
SUBREADputs("");
- SUBREADputs(" -S <ff:fr:rf> Orientation of first and second reads, 'fr' by default (");
- SUBREADputs(" forward/reverse).");
+ SUBREADputs(" --gtfAttr <string> Specify attribute type in GTF annotation. 'gene_id'");
+ SUBREADputs(" by default. Meta-features used for read counting will be ");
+ SUBREADputs(" extracted from annotation using the provided value.");
+ SUBREADputs("");
+ SUBREADputs("# others");
+ SUBREADputs("");
+ SUBREADputs(" -v Output version of the program.");
SUBREADputs("");
SUBREADputs("Refer to Users Manual for detailed description to the arguments. ");
SUBREADputs("");
-
}
@@ -211,7 +265,7 @@ int parse_opts_aligner(int argc , char ** argv, global_context_t * global_contex
*/
- while ((c = getopt_long (argc, argv, "xsvJS:L:AHd:D:n:m:p:G:E:X:Y:P:R:r:i:l:o:T:I:t:B:bFcuUfM:Q1:2:3:5:?", long_options, &option_index)) != -1)
+ while ((c = getopt_long (argc, argv, "xsvJS:L:A:a:Hd:D:n:m:p:G:E:X:Y:P:R:r:i:l:o:T:I:t:B:bF:cuUfM:Q1:2:3:5:?", long_options, &option_index)) != -1)
{
switch(c)
{
@@ -270,8 +324,21 @@ int parse_opts_aligner(int argc , char ** argv, global_context_t * global_contex
case 's':
global_context->config.downscale_mapping_quality = 1;
break;
+ case 'a':
+ strncpy(global_context->config.exon_annotation_file, optarg, MAX_FILE_NAME_LENGTH-1);
+ break;
case 'A':
- global_context->config.report_sam_file = 0;
+ strncpy(global_context->config.exon_annotation_alias_file, optarg, MAX_FILE_NAME_LENGTH-1);
+ break;
+ case 'F':
+ if(strcmp(optarg,"GTF")==0){
+ global_context->config.exon_annotation_file_type = FILE_TYPE_GTF;
+ } else if(strcmp(optarg,"SAF")==0){
+ global_context->config.exon_annotation_file_type = FILE_TYPE_RSUBREAD;
+ } else {
+ SUBREADprintf("Unknown annotation file format: %s.\nThe accepted formats are GTF and SAF only.\n", optarg);
+ STANDALONE_exit(-1);
+ }
break;
case 'S':
global_context->config.is_first_read_reversed = optarg[0]=='r'?1:0;
@@ -391,10 +458,6 @@ int parse_opts_aligner(int argc , char ** argv, global_context_t * global_contex
}
break;
- case 'F':
- global_context->config.is_second_iteration_running = 0;
- global_context->config.report_sam_file = 0;
- break;
case 'c':
global_context->config.space_type = GENE_SPACE_COLOR;
break;
@@ -470,6 +533,14 @@ int parse_opts_aligner(int argc , char ** argv, global_context_t * global_contex
global_context -> config.use_memory_buffer = 1;
global_context -> config.reads_per_chunk = 600llu*1024*1024;
}
+ else if(strcmp("gtfFeature", long_options[option_index].name) == 0)
+ {
+ strncpy(global_context->config.exon_annotation_feature_name_column, optarg, MAX_READ_NAME_LEN - 1);
+ }
+ else if(strcmp("gtfAttr", long_options[option_index].name) == 0)
+ {
+ strncpy(global_context->config.exon_annotation_gene_id_column, optarg, MAX_READ_NAME_LEN - 1);
+ }
else if(strcmp("maxRealignLocations", long_options[option_index].name)==0)
{
global_context->config.max_vote_combinations = atoi(optarg);
diff --git a/src/core-interface-subjunc.c b/src/core-interface-subjunc.c
index f89330c..d64a104 100644
--- a/src/core-interface-subjunc.c
+++ b/src/core-interface-subjunc.c
@@ -31,6 +31,11 @@ static struct option long_options[] =
{"color-convert", no_argument, 0, 'b'},
{"junctionIns", required_argument, 0, 0},
{"multi", required_argument, 0, 'B'},
+ {"exonAnnotation", required_argument, 0, 'a'},
+ {"exonAlias", required_argument, 0, 'A'},
+ {"exonFormat", required_argument, 0, 'F'},
+ {"gtfAttr", required_argument, 0, 0},
+ {"gtfFeature", required_argument, 0, 0},
{"rg", required_argument, 0, 0},
{"rg-id", required_argument, 0, 0},
{"gzFASTQinput", no_argument, 0, 0},
@@ -72,7 +77,7 @@ void print_usage_core_subjunc()
SUBREADputs("");
SUBREADputs(" ./subjunc [options] -i <index_name> -r <input> -o <output>");
SUBREADputs("");
- SUBREADputs("Required arguments:");
+ SUBREADputs("## Mandatory arguments:");
SUBREADputs("");
SUBREADputs(" -i <index> Base name of the index.");
SUBREADputs("");
@@ -81,45 +86,64 @@ void print_usage_core_subjunc()
SUBREADputs(" paired-end, this should give the name of file including");
SUBREADputs(" first reads.");
SUBREADputs("");
- SUBREADputs("Optional arguments:");
+ SUBREADputs("## Optional arguments:");
+ SUBREADputs("# input reads and output");
+ SUBREADputs(" ");
+ SUBREADputs(" -o <string> Name of an output file. By default, the output is in BAM");
+ SUBREADputs(" format. Omitting this option makes the output be written to");
+ SUBREADputs(" STDOUT.");
SUBREADputs("");
- SUBREADputs(" -o <string> Name of the output file. By default, the output is in BAM");
- SUBREADputs(" format.");
+ SUBREADputs(" -R <string> Name of the second read file in paired-end data (typically");
+ SUBREADputs(" containing ‘R2’ in the file name).");
SUBREADputs("");
- SUBREADputs(" -n <int> Number of selected subreads, 14 by default.");
+ SUBREADputs(" --SAMinput Input reads are in SAM format.");
+ SUBREADputs("");
+ SUBREADputs(" --BAMinput Input reads are in BAM format.");
+ SUBREADputs("");
+ SUBREADputs(" --SAMoutput Save mapping results in SAM format.");
+ SUBREADputs("");
+ SUBREADputs("# offset value added to Phred quality scores of read bases");
+ SUBREADputs("");
+ SUBREADputs(" -P <3:6> Offset value added to the Phred quality score of each read");
+ SUBREADputs(" base. '3' for phred+33 and '6' for phred+64. '3' by default.");
+ SUBREADputs("");
+ SUBREADputs("# thresholds for mapping");
+ SUBREADputs("");
+ SUBREADputs(" -n <int> Number of selected subreads, 10 by default.");
SUBREADputs("");
SUBREADputs(" -m <int> Consensus threshold for reporting a hit (minimal number of");
SUBREADputs(" subreads that map in consensus) . If paired-end, this gives");
- SUBREADputs(" the consensus threshold for the anchor read. 1 by default");
+ SUBREADputs(" the consensus threshold for the anchor read (anchor read");
+ SUBREADputs(" receives more votes than the other read in the same pair).");
+ SUBREADputs(" 3 by default");
SUBREADputs("");
- SUBREADputs(" -M <int> Specify the maximum number of mis-matched bases allowed in");
- SUBREADputs(" the alignment. 3 by default. Mis-matches found in soft-");
+ SUBREADputs(" -p <int> Consensus threshold for the non- anchor read in a pair. 1 by");
+ SUBREADputs(" default.");
+ SUBREADputs("");
+ SUBREADputs(" -M <int> Maximum number of mis-matched bases allowed in each reported");
+ SUBREADputs(" alignment. 3 by default. Mis-matched bases found in soft-");
SUBREADputs(" clipped bases are not counted.");
SUBREADputs("");
- SUBREADputs(" -T <int> Number of CPU threads used, 1 by default.");
+ SUBREADputs("# unique mapping and multi-mapping");
SUBREADputs("");
- SUBREADputs(" -I <int> Maximum length (in bp) of indels that can be detected. 5 by");
- SUBREADputs(" default. The program can detect indels of up to 200bp long.");
+ SUBREADputs(" -u Report uniquely mapped reads only. Number of matched bases (");
+ SUBREADputs(" for RNA-seq) or mis-matched bases(for genomic DNA-seq) is");
+ SUBREADputs(" used to break the tie when multiple mapping locations are");
+ SUBREADputs(" found.");
SUBREADputs("");
SUBREADputs(" -B <int> Maximal number of equally-best mapping locations to be");
SUBREADputs(" reported. 1 by default. Note that -u option takes precedence");
SUBREADputs(" over -B.");
SUBREADputs("");
- SUBREADputs(" -P <3:6> Format of Phred scores used in input files, '3' for phred+33");
- SUBREADputs(" and '6' for phred+64. '3' by default.");
- SUBREADputs("");
- SUBREADputs(" -u Report uniquely mapped reads only. Number of mis-matched");
- SUBREADputs(" bases is used to break the tie.");
- SUBREADputs("");
- SUBREADputs(" -b Convert color-space read bases to base-space read bases in");
- SUBREADputs(" the mapping output. Note that read mapping is performed at");
- SUBREADputs(" color-space.");
+ SUBREADputs("# indel detection");
SUBREADputs("");
- SUBREADputs(" --SAMinput Input reads are in SAM format.");
+ SUBREADputs(" -I <int> Maximum length (in bp) of indels that can be detected. 5 by");
+ SUBREADputs(" default. Indels of up to 200bp long can be detected.");
SUBREADputs("");
- SUBREADputs(" --BAMinput Input reads are in BAM format.");
+ SUBREADputs(" --complexIndels Detect multiple short indels that are in close proximity");
+ SUBREADputs(" (they can be as close as 1bp apart from each other).");
SUBREADputs("");
- SUBREADputs(" --SAMoutput Save mapping result in SAM format.");
+ SUBREADputs("# read trimming");
SUBREADputs("");
SUBREADputs(" --trim5 <int> Trim off <int> number of bases from 5' end of each read. 0");
SUBREADputs(" by default.");
@@ -127,10 +151,33 @@ void print_usage_core_subjunc()
SUBREADputs(" --trim3 <int> Trim off <int> number of bases from 3' end of each read. 0");
SUBREADputs(" by default.");
SUBREADputs("");
+ SUBREADputs("# distance and orientation of paired end reads");
+ SUBREADputs("");
+ SUBREADputs(" -d <int> Minimum fragment/insert length, 50bp by default.");
+ SUBREADputs("");
+ SUBREADputs(" -D <int> Maximum fragment/insert length, 600bp by default.");
+ SUBREADputs("");
+ SUBREADputs(" -S <ff:fr:rf> Orientation of first and second reads, 'fr' by default (");
+ SUBREADputs(" forward/reverse).");
+ SUBREADputs("");
+ SUBREADputs("# number of CPU threads");
+ SUBREADputs("");
+ SUBREADputs(" -T <int> Number of CPU threads used, 1 by default.");
+ SUBREADputs("");
+ SUBREADputs("# read group");
+ SUBREADputs("");
SUBREADputs(" --rg-id <string> Add read group ID to the output.");
SUBREADputs("");
SUBREADputs(" --rg <string> Add <tag:value> to the read group (RG) header in the output.");
SUBREADputs("");
+ SUBREADputs("# color space reads");
+ SUBREADputs("");
+ SUBREADputs(" -b Convert color-space read bases to base-space read bases in");
+ SUBREADputs(" the mapping output. Note that read mapping is performed at");
+ SUBREADputs(" color-space.");
+ SUBREADputs("");
+ SUBREADputs("# dynamic programming");
+ SUBREADputs("");
SUBREADputs(" --DPGapOpen <int> Penalty for gap opening in short indel detection. -1 by");
SUBREADputs(" default.");
SUBREADputs("");
@@ -143,35 +190,44 @@ void print_usage_core_subjunc()
SUBREADputs(" --DPMatch <int> Score for matched bases in short indel detection. 2 by");
SUBREADputs(" default.");
SUBREADputs("");
+ SUBREADputs("# detect all junctions including gene fusions");
+ SUBREADputs("");
SUBREADputs(" --allJunctions Detect exon-exon junctions (both canonical and non-canonical");
SUBREADputs(" junctions) and structural variants in RNA-seq data. Refer to");
SUBREADputs(" Users Guide for reporting of junctions and fusions.");
SUBREADputs("");
- SUBREADputs(" --complexIndels Detect multiple short indels that occur concurrently in a");
- SUBREADputs(" small genomic region (these indels could be as close as 1bp");
- SUBREADputs(" apart).");
+ SUBREADputs("# gene annotation");
SUBREADputs("");
- SUBREADputs(" -v Output version of the program.");
+ SUBREADputs(" -a Name of an annotation file. GTF/GFF format by default. See");
+ SUBREADputs(" -F option for more format information.");
SUBREADputs("");
- SUBREADputs("Optional arguments for paired-end reads:");
+ SUBREADputs(" -F Specify format of the provided annotation file. Acceptable");
+ SUBREADputs(" formats include 'GTF' (or compatible GFF format) and");
+ SUBREADputs(" 'SAF'. 'GTF' by default. For SAF format, please refer to");
+ SUBREADputs(" Users Guide.");
SUBREADputs("");
- SUBREADputs(" -R <string> Name of the file including second reads.");
+ SUBREADputs(" -A Provide a chromosome name alias file to match chr names in");
+ SUBREADputs(" annotation with those in the reads. This should be a two-");
+ SUBREADputs(" column comma-delimited text file. Its first column should");
+ SUBREADputs(" include chr names in the annotation and its second column");
+ SUBREADputs(" should include chr names in the index. Chr names are case");
+ SUBREADputs(" sensitive. No column header should be included in the");
+ SUBREADputs(" file.");
SUBREADputs("");
- SUBREADputs(" -p <int> Consensus threshold for the non-anchor read (receiving less");
- SUBREADputs(" votes than the anchor read from the same pair). 1 by");
- SUBREADputs(" default.");
+ SUBREADputs(" --gtfFeature <string> Specify feature type in GTF annotation. 'exon'");
+ SUBREADputs(" by default. Features used for read counting will be ");
+ SUBREADputs(" extracted from annotation using the provided value.");
SUBREADputs("");
- SUBREADputs(" -d <int> Minimum fragment/insert length, 50bp by default.");
+ SUBREADputs(" --gtfAttr <string> Specify attribute type in GTF annotation. 'gene_id'");
+ SUBREADputs(" by default. Meta-features used for read counting will be ");
+ SUBREADputs(" extracted from annotation using the provided value.");
SUBREADputs("");
- SUBREADputs(" -D <int> Maximum fragment/insert length, 600bp by default.");
+ SUBREADputs("# others");
SUBREADputs("");
- SUBREADputs(" -S <ff:fr:rf> Orientation of first and second reads, 'fr' by default (");
- SUBREADputs(" forward/reverse).");
+ SUBREADputs(" -v Output version of the program.");
SUBREADputs("");
- SUBREADputs("Refer to Users Manual for detailed description to the arguments.");
+ SUBREADputs("Refer to Users Manual for detailed description to the arguments. ");
SUBREADputs("");
-
-
}
int parse_opts_subjunc(int argc , char ** argv, global_context_t * global_context)
@@ -212,7 +268,7 @@ int parse_opts_subjunc(int argc , char ** argv, global_context_t * global_contex
print_usage_core_subjunc();
return -1;
}
- while ((c = getopt_long (argc, argv, "vxsJ1:2:S:L:AHd:D:n:m:p:P:R:r:i:l:o:G:Y:E:X:T:I:B:bQFcuUfM:3:5:9:?", long_options, &option_index)) != -1)
+ while ((c = getopt_long (argc, argv, "vxsJ1:2:S:L:A:a:Hd:D:n:m:p:P:R:r:i:l:o:G:Y:E:X:T:I:B:bQF:cuUfM:3:5:9:?", long_options, &option_index)) != -1)
{
switch(c)
{
@@ -273,8 +329,22 @@ int parse_opts_subjunc(int argc , char ** argv, global_context_t * global_contex
case 's':
global_context->config.downscale_mapping_quality = 1;
break;
+ case 'a':
+ strncpy(global_context->config.exon_annotation_file, optarg, MAX_FILE_NAME_LENGTH-1);
+ break;
case 'A':
- global_context->config.report_sam_file = 0;
+ strncpy(global_context->config.exon_annotation_alias_file, optarg, MAX_FILE_NAME_LENGTH-1);
+ break;
+ case 'F':
+ if(strcmp(optarg,"GTF")==0){
+ global_context->config.exon_annotation_file_type = FILE_TYPE_GTF;
+ } else if(strcmp(optarg,"SAF")==0){
+ global_context->config.exon_annotation_file_type = FILE_TYPE_RSUBREAD;
+ } else {
+ SUBREADprintf("Unknown annotation file format: %s.\nThe accepted formats are GTF and SAF only.\n", optarg);
+ STANDALONE_exit(-1);
+ }
+
break;
case 'S':
global_context->config.is_first_read_reversed = optarg[0]=='r'?1:0;
@@ -385,10 +455,6 @@ int parse_opts_subjunc(int argc , char ** argv, global_context_t * global_contex
global_context->config.minimum_subread_for_second_read = atoi(optarg);
break;
- case 'F':
- global_context->config.is_second_iteration_running = 0;
- global_context->config.report_sam_file = 0;
- break;
case 'B':
if(!is_valid_digit(optarg, "B"))
STANDALONE_exit(-1);
@@ -496,6 +562,14 @@ int parse_opts_subjunc(int argc , char ** argv, global_context_t * global_contex
global_context->config.do_big_margin_filtering_for_junctions = 0;
global_context->config.limited_tree_scan = 0;
}
+ else if(strcmp("gtfFeature", long_options[option_index].name) == 0)
+ {
+ strncpy(global_context->config.exon_annotation_feature_name_column, optarg, MAX_READ_NAME_LEN - 1);
+ }
+ else if(strcmp("gtfAttr", long_options[option_index].name) == 0)
+ {
+ strncpy(global_context->config.exon_annotation_gene_id_column, optarg, MAX_READ_NAME_LEN - 1);
+ }
else if(strcmp("maxVoteSimples", long_options[option_index].name)==0)
{
global_context->config.max_vote_simples = atoi(optarg);
diff --git a/src/core-junction.c b/src/core-junction.c
index d2fb9a4..7825dbf 100644
--- a/src/core-junction.c
+++ b/src/core-junction.c
@@ -94,7 +94,7 @@ typedef struct{
short split_point;
char inserted_bases;
char is_GT_AG_donors;
- char is_donor_found;
+ char is_donor_found_or_annotation;
char is_strand_jumped;
char is_break_even;
@@ -103,6 +103,13 @@ typedef struct{
} select_junction_record_t;
+void debug_show_event(global_context_t* global_context, chromosome_event_t * event){
+ char outpos1[100], outpos2[100];
+ absoffset_to_posstr(global_context, event -> event_small_side, outpos1);
+ absoffset_to_posstr(global_context, event -> event_large_side, outpos2);
+ SUBREADprintf("Event between %s and %s\n", outpos1, outpos2);
+}
+
// read_head_abs_pos is the offset of the FIRST WANTED base.
void search_events_to_front(global_context_t * global_context, thread_context_t * thread_context, explain_context_t * explain_context, char * read_text , char * qual_text, unsigned int read_head_abs_offset, short remainder_len, short sofar_matched, int suggested_movement, int do_not_jump)
@@ -142,9 +149,9 @@ void search_events_to_front(global_context_t * global_context, thread_context_t
if(suggested_movement) move_start = suggested_movement-1;
int is_junction_scanned = 0;
- if(0 && FIXLENstrcmp("DB7DT8Q1:236:C2NGTACXX:2:1213:17842:64278", explain_context -> read_name) == 0)
+ if(0 && FIXLENstrcmp("D00491:294:C8E5GANXX:2:2302:8191:22433", explain_context -> read_name) == 0)
{
- SUBREADprintf("EVENT MAY HAVE FRONT=%d\t%d > %d\tPAIR_NO=%llu\n\nSCAN_START=%d\n", there_are_events_in_range(event_table->appendix1, read_head_abs_offset , remainder_len ), MAX_EVENTS_IN_READ-1, explain_context -> tmp_search_sections, explain_context -> pair_number, move_start);
+ SUBREADprintf("FF REM LEN = %d , EVENT MAY HAVE FRONT=%d\t%d > %d\tPAIR_NO=%llu\n\nSCAN_START=%d\n", remainder_len, there_are_events_in_range(event_table->appendix1, read_head_abs_offset , remainder_len ), MAX_EVENTS_IN_READ-1, explain_context -> tmp_search_sections, explain_context -> pair_number, move_start);
}
if((global_context -> config.do_fusion_detection|| there_are_events_in_range(event_table->appendix1, read_head_abs_offset, remainder_len)) &&
@@ -166,7 +173,7 @@ void search_events_to_front(global_context_t * global_context, thread_context_t
int site_events_no = search_event(global_context, event_table , event_space , potential_event_pos, event_search_method , search_types , site_events);
- if(0 && FIXLENstrcmp("R000002444", explain_context -> read_name) == 0)
+ if(0 && FIXLENstrcmp("R010442852", explain_context -> read_name) == 0)
{
SUBREADprintf("FOUND THE EVENT FRONT:%d at %u\n", site_events_no, potential_event_pos);
if(site_events_no)
@@ -191,7 +198,7 @@ void search_events_to_front(global_context_t * global_context, thread_context_t
int this_round_junction_scanned = 0;
- if(0 && FIXLENstrcmp("R000002444", explain_context -> read_name) == 0)
+ if(0 && FIXLENstrcmp("R010442852", explain_context -> read_name) == 0)
SUBREADprintf("F_JUMP? match=%d / tested=%d\n", matched_bases_to_site , tested_read_pos);
//#warning "========= remove - 2000 from next line ============="
@@ -204,8 +211,11 @@ void search_events_to_front(global_context_t * global_context, thread_context_t
continue;
}
//if(explain_context -> pair_number == 23)
- if(0 && FIXLENstrcmp("R000002444", explain_context -> read_name) == 0)
- SUBREADprintf("F_JUMP?%d > %d %s (%u) ; SEARCH_TAG=%u , EVENT=%u,%u\n", (1+matched_bases_to_site)*10000 / tested_read_pos , 9000, read_text, tested_chro_begin, potential_event_pos , tested_event -> event_small_side, tested_event -> event_large_side);
+ if(0 && FIXLENstrcmp("R010442852", explain_context -> read_name) == 0){
+ SUBREADprintf("F_JUMP?%d > %d %s (%u) ; SEARCH_TAG=%u\n", (1+matched_bases_to_site)*10000 / tested_read_pos , 9000, read_text, tested_chro_begin, potential_event_pos);
+ debug_show_event(global_context, tested_event);
+
+ }
// note that these two values are the index of the first wanted base.
unsigned int new_read_head_abs_offset;
@@ -219,7 +229,6 @@ void search_events_to_front(global_context_t * global_context, thread_context_t
}
}
if( tested_event -> event_type != CHRO_EVENT_TYPE_INDEL){
- if(is_junction_scanned) continue;
this_round_junction_scanned = 1;
}
@@ -254,9 +263,9 @@ void search_events_to_front(global_context_t * global_context, thread_context_t
int current_pure_donor_found = explain_context -> tmp_is_pure_donor_found_explain;
explain_context -> tmp_support_as_simple = tested_event -> supporting_reads;
- explain_context -> tmp_min_support_as_complex = min(tested_event -> supporting_reads,explain_context -> tmp_min_support_as_complex);
+ explain_context -> tmp_min_support_as_complex = min((tested_event -> is_donor_found_or_annotation & 64)?0x7fffffff:tested_event -> supporting_reads,explain_context -> tmp_min_support_as_complex);
explain_context -> tmp_min_unsupport = min(tested_event -> anti_supporting_reads,explain_context -> tmp_min_unsupport);
- explain_context -> tmp_is_pure_donor_found_explain = explain_context -> tmp_is_pure_donor_found_explain && tested_event -> is_donor_found;
+ explain_context -> tmp_is_pure_donor_found_explain = explain_context -> tmp_is_pure_donor_found_explain && tested_event -> is_donor_found_or_annotation;
if(tested_event -> event_type == CHRO_EVENT_TYPE_FUSION && tested_event -> is_strand_jumped)
explain_context -> current_is_strand_jumped = !explain_context -> current_is_strand_jumped;
@@ -266,7 +275,7 @@ void search_events_to_front(global_context_t * global_context, thread_context_t
explain_context -> tmp_search_sections ++;
- if(0 && FIXLENstrcmp("R000002444", explain_context -> read_name) == 0)
+ if(0 && FIXLENstrcmp("R000404427", explain_context -> read_name) == 0)
SUBREADprintf("FRONT_ADD_EVENT : %s , %u ~ %u , INDELLEN=%d, TEST_READ_POS=%u, RPED=%u, ABSSTART=%u\n", explain_context -> read_name, tested_event -> event_small_side, tested_event -> event_large_side, tested_event -> indel_length, tested_read_pos, explain_context -> tmp_search_junctions[explain_context -> tmp_search_sections + 1].read_pos_end, new_read_head_abs_offset);
//if(explain_context -> pair_number == 23){
@@ -305,9 +314,6 @@ void new_explain_try_replace(global_context_t* global_context, thread_context_t
int is_better_result = 0, is_same_best = 0;
- if(0 && FIXLENstrcmp("R_chr901_166222_12M1D88M", explain_context -> read_name) == 0)
- SUBREADprintf("TRY_REPLACE : MATCHED: BEST=%d, THIS=%d, IS_TO_BACK=%d, SECTIONS=%d, NEXT_EVENT[0]=%p, READ_LEN[0]=%d ~ %d\n", explain_context -> best_matching_bases , explain_context-> tmp_total_matched_bases, search_to_back, explain_context -> tmp_search_sections, explain_context -> tmp_search_junctions[0].event_after_section, explain_context -> tmp_search_junctions[0].read_pos_start, explain_context -> tmp_search_junctions[0].read_pos_end);
-
if(explain_context -> best_matching_bases < explain_context-> tmp_total_matched_bases)
{
is_better_result = 1;
@@ -327,6 +333,9 @@ void new_explain_try_replace(global_context_t* global_context, thread_context_t
explain_context -> best_is_complex += explain_context -> tmp_search_sections;
explain_context -> second_best_matching_bases = explain_context -> best_matching_bases;
+ if(0 && FIXLENstrcmp("R010442852", explain_context -> read_name) == 0){
+ SUBREADprintf("complexity: curr=%d, new=%d ; sections=%d\n", explain_context->best_min_support_as_complex, explain_context -> tmp_min_support_as_complex, explain_context -> tmp_search_sections );
+ }
if(explain_context -> best_is_complex > 1)
{
// is complex now!
@@ -347,7 +356,7 @@ void new_explain_try_replace(global_context_t* global_context, thread_context_t
else{
if(explain_context -> tmp_min_support_as_complex >explain_context->best_min_support_as_complex){
is_better_result = 1;
- explain_context->best_min_support_as_complex =explain_context -> tmp_min_support_as_complex;
+ explain_context -> best_min_support_as_complex =explain_context -> tmp_min_support_as_complex;
explain_context -> best_is_pure_donor_found_explain = explain_context -> tmp_is_pure_donor_found_explain;
explain_context -> is_currently_tie = 0;
}
@@ -393,8 +402,17 @@ void new_explain_try_replace(global_context_t* global_context, thread_context_t
}
}
- if(0 && FIXLENstrcmp("R000002444", explain_context -> read_name) == 0)
- SUBREADprintf("TRY_REPLACE_DESICION: BETTER=%d, SAME=%d\n", is_better_result, is_same_best);
+ if(0 && FIXLENstrcmp("R010442852", explain_context -> read_name) == 0){
+ SUBREADprintf("TRY_REPLACE_DESICION TO %s: BETTER=%d, SAME=%d ; CURRENT : %d secs ; NEWBEST : %d secs\n", search_to_back?"BACK":"FRONT", is_better_result, is_same_best, search_to_back? explain_context -> result_back_junction_numbers[0]:explain_context -> result_front_junction_numbers[0] ,explain_context -> tmp_search_sections);
+ int xx1;
+ for(xx1 = 0; xx1 < explain_context -> tmp_search_sections;xx1++){
+ SUBREADprintf(" Event : %d ~ %d in read\n", explain_context -> tmp_search_junctions[xx1].read_pos_start, explain_context -> tmp_search_junctions[xx1].read_pos_end);
+ if(explain_context -> tmp_search_junctions[xx1].event_after_section){
+ SUBREADprintf(" ");
+ debug_show_event(global_context, explain_context -> tmp_search_junctions[xx1].event_after_section);
+ }
+ }
+ }
if(is_better_result)
{
@@ -465,11 +483,14 @@ void search_events_to_back(global_context_t * global_context, thread_context_t *
if(suggested_movement) move_start = read_tail_pos - suggested_movement + 1;
- if(0 && FIXLENstrcmp("R000002444", explain_context -> read_name) == 0)
+ if(0 && FIXLENstrcmp("D00491:294:C8E5GANXX:2:2302:8191:22433", explain_context -> read_name) == 0)
{
- SUBREADprintf("EVENT MAY HAVE BETWEEN (%u, %u) BACK=%d\t%d > %d\tPAIR_NO=%llu\nMOVE_START=%d\n", read_tail_abs_offset - read_tail_pos, read_tail_pos , there_are_events_in_range(event_table -> appendix2, read_tail_abs_offset - read_tail_pos, read_tail_pos), MAX_EVENTS_IN_READ-1, explain_context -> tmp_search_sections, explain_context -> pair_number, move_start);
+ SUBREADprintf("BF EVENT MAY HAVE BETWEEN (%u, %u) BACK=%d\t%d > %d\tPAIR_NO=%llu\nMOVE_START=%d\n", read_tail_abs_offset - read_tail_pos, read_tail_pos , there_are_events_in_range(event_table -> appendix2, read_tail_abs_offset - read_tail_pos, read_tail_pos), MAX_EVENTS_IN_READ-1, explain_context -> tmp_search_sections, explain_context -> pair_number, move_start);
}
+ //#warning ">>>>>>>>>>>>>> COMMENT THIS <<<<<<<<<<<<<<<<<<<<<"
+ //printf("OCT27-STEP-BKIN : %s , STT=%d, %u, %d\n", explain_context -> read_name, move_start, read_tail_abs_offset, read_tail_pos);
+
if(MAX_EVENTS_IN_READ - 1> explain_context -> tmp_search_sections && ( there_are_events_in_range(event_table -> appendix2, read_tail_abs_offset - read_tail_pos, read_tail_pos)||global_context -> config.do_fusion_detection))
for(tested_read_pos = move_start; tested_read_pos >=0;tested_read_pos --)
{
@@ -487,11 +508,10 @@ void search_events_to_back(global_context_t * global_context, thread_context_t *
int search_types = CHRO_EVENT_TYPE_INDEL | CHRO_EVENT_TYPE_JUNCTION | CHRO_EVENT_TYPE_FUSION;
int site_events_no = search_event(global_context, event_table , event_space , potential_event_pos, event_search_method , search_types, site_events);
+ //#warning ">>>>>>>>>>>>>> COMMENT THIS <<<<<<<<<<<<<<<<<<<<<"
+ //printf("OCT27-STEP-BKIN-SR: %s at %u, FOUND=%d\n" , explain_context -> read_name,potential_event_pos,site_events_no);
- //if(explain_context -> pair_number==999999)
- //printf("BF OFFSET=%d; READ_TAIL=%d; REDGE=%u; FOUND=%d\n", tested_read_pos, read_tail_pos, potential_event_pos, site_events_no);
-
- if(0 && FIXLENstrcmp("R000002444", explain_context -> read_name) == 0)
+ if(0 && FIXLENstrcmp("R000404427", explain_context -> read_name) == 0)
{
if(site_events_no) {
SUBREADprintf("FOUND THE EVENT BACK:%d at %u\t", site_events_no, potential_event_pos);
@@ -513,6 +533,8 @@ void search_events_to_back(global_context_t * global_context, thread_context_t *
int this_round_junction_scanned = 0;
+ //#warning ">>>>>>>>>>>>>>>> REMOVE IT <<<<<<<<<<<<<<<<<<<<<<"
+ //printf("OCT27-STEPSB-JB-%s: test %u = %d events; TEST=%d > 7000 : MA=%d; %s ; %u = %u - (%d - %d) ; LEV=%d\n", explain_context -> read_name, potential_event_pos, site_events_no, (read_tail_pos<=tested_read_pos)?(-1234):( matched_bases_to_site*10000/(read_tail_pos - tested_read_pos)) , matched_bases_to_site, read_text + tested_read_pos, potential_event_pos, read_tail_abs_offset, read_tail_pos, tested_read_pos, explain_context -> tmp_search_sections);
//#warning "========= remove - 2000 from next line ============="
if((read_tail_pos>tested_read_pos) && ( matched_bases_to_site*10000/(read_tail_pos - tested_read_pos) > 9000 - 2000 || global_context->config.maximise_sensitivity_indel) )
for(xk1 = 0; xk1 < site_events_no ; xk1++)
@@ -532,14 +554,12 @@ void search_events_to_back(global_context_t * global_context, thread_context_t *
}
}
if( tested_event -> event_type != CHRO_EVENT_TYPE_INDEL){
- if(is_junction_scanned) continue;
this_round_junction_scanned = 1;
}
if(0 && strcmp("S_chr901_565784_72M8D28M", explain_context -> read_name) == 0)
SUBREADprintf("B_JUMP?%d > %d TLEN=%d \n", (1+matched_bases_to_site)*10000 / (read_tail_pos - tested_read_pos) , 9000, read_tail_pos - tested_read_pos);
-
// note that read_tail_pos is the first unwanted base.
int new_read_tail_pos = tested_read_pos;
if(tested_event->event_type == CHRO_EVENT_TYPE_INDEL) new_read_tail_pos += min(0, tested_event -> indel_length);
@@ -567,7 +587,7 @@ void search_events_to_back(global_context_t * global_context, thread_context_t *
explain_context -> tmp_search_junctions[explain_context -> tmp_search_sections + 1].read_pos_end = tested_read_pos + min(0, tested_event->indel_length) - tested_event -> indel_at_junction;
explain_context -> tmp_search_junctions[explain_context -> tmp_search_sections + 1].abs_offset_for_start = new_read_tail_abs_offset;
- if(0 && FIXLENstrcmp("R000002444", explain_context -> read_name) == 0)
+ if(0 && FIXLENstrcmp("R000404427", explain_context -> read_name) == 0)
SUBREADprintf("BACK_ADD_EVENT : %s , %u ~ %u , INDELLEN=%d, TEST_READ_POS=%u, RPED=%u, ABSSTART=%u\n", explain_context -> read_name, tested_event -> event_small_side, tested_event -> event_large_side, tested_event -> indel_length, tested_read_pos, explain_context -> tmp_search_junctions[explain_context -> tmp_search_sections + 1].read_pos_end, new_read_tail_abs_offset);
if(tested_event->event_type == CHRO_EVENT_TYPE_FUSION) jump_penalty = 2;
@@ -580,9 +600,10 @@ void search_events_to_back(global_context_t * global_context, thread_context_t *
int current_pure_donor_found = explain_context -> tmp_is_pure_donor_found_explain;
explain_context -> tmp_support_as_simple = tested_event -> supporting_reads;
- explain_context -> tmp_min_support_as_complex = min(tested_event -> supporting_reads,explain_context -> tmp_min_support_as_complex);
+ //explain_context -> tmp_min_support_as_complex = min(tested_event -> supporting_reads,explain_context -> tmp_min_support_as_complex);
+ explain_context -> tmp_min_support_as_complex = min((tested_event -> is_donor_found_or_annotation & 64)?0x7fffffff:tested_event -> supporting_reads,explain_context -> tmp_min_support_as_complex);
explain_context -> tmp_min_unsupport = min(tested_event -> anti_supporting_reads,explain_context -> tmp_min_unsupport);
- explain_context -> tmp_is_pure_donor_found_explain = explain_context -> tmp_is_pure_donor_found_explain && tested_event -> is_donor_found;
+ explain_context -> tmp_is_pure_donor_found_explain = explain_context -> tmp_is_pure_donor_found_explain && tested_event -> is_donor_found_or_annotation;
if(tested_event -> event_type == CHRO_EVENT_TYPE_FUSION && tested_event -> is_strand_jumped)
explain_context -> current_is_strand_jumped = !explain_context -> current_is_strand_jumped;
@@ -593,7 +614,12 @@ void search_events_to_back(global_context_t * global_context, thread_context_t *
explain_context -> tmp_search_sections ++;
//printf("SUGGEST_PREV at %u = %d (! %d)\n", tested_event -> event_small_side, tested_event -> connected_previous_event_distance, tested_event -> connected_next_event_distance);
+ //#warning ">>>>>>>>>>>>>>>> REMOVE IT <<<<<<<<<<<<<<<<<<<<<<"
+ //printf("OCT27-STEPSB-JB-%s: %u IN -> %u; NEW_TAIL=%d; ENV_CONN=%d; LEV=%d\n", explain_context -> read_name, potential_event_pos, new_read_tail_abs_offset, new_read_tail_pos, tested_event -> connected_previous_event_distance, explain_context -> tmp_search_sections);
+
search_events_to_back(global_context, thread_context, explain_context, read_text , qual_text, new_read_tail_abs_offset , new_read_tail_pos, sofar_matched + matched_bases_to_site - jump_penalty, tested_event -> connected_previous_event_distance, 0);
+ //#warning ">>>>>>>>>>>>>>>> REMOVE IT <<<<<<<<<<<<<<<<<<<<<<"
+ //printf("OCT27-STEPSB-JB-%s: %u OUT <- %u; LEN=%d\n", explain_context -> read_name, potential_event_pos, new_read_tail_abs_offset, explain_context -> tmp_search_sections);
explain_context -> tmp_search_sections --;
//if(explain_context->pair_number == 999999)
@@ -658,12 +684,6 @@ int init_junction_thread_contexts(global_context_t * global_context, thread_cont
{
return 0;
}
-int finalise_junction_thread(global_context_t * global_context, thread_context_t * thread_context, int task)
-{
-
- return 0;
-}
-
void insert_big_margin_record(global_context_t * global_context , unsigned short * big_margin_record, unsigned char votes, short read_pos_start, short read_pos_end, int read_len, int is_negative)
{
@@ -988,7 +1008,7 @@ void copy_vote_to_alignment_res(global_context_t * global_context, thread_contex
junc_res, abs32uint(current_vote -> pos[vote_i][vote_j] - junc_res -> minor_position), current_vote -> votes[i][j], current_vote -> coverage_end[i][j] - current_vote -> coverage_start[i][j],
abs32uint(current_vote -> pos[vote_i][vote_j] - current_vote -> pos[i][j]));
- int replace_minor = 0, minor_indel_offset = 0, inserted_bases = 0, is_GT_AG_donors = 0, is_donor_found = 0, final_split_point = 0, major_indels = 0, small_side_increasing_coordinate = 0, large_side_increasing_coordinate = 0;
+ int replace_minor = 0, minor_indel_offset = 0, inserted_bases = 0, is_GT_AG_donors = 0, is_donor_found_or_annotation = 0, final_split_point = 0, major_indels = 0, small_side_increasing_coordinate = 0, large_side_increasing_coordinate = 0;
if(0 && FIXLENstrcmp("R002403247", read_name) == 0)
{
@@ -1038,7 +1058,7 @@ void copy_vote_to_alignment_res(global_context_t * global_context, thread_contex
replace_minor = donor_jumped_score(global_context, thread_context, small_half_abs_offset, large_half_abs_offset,
max(0, guess_start_as_reversed) , min( guess_end_as_reversed, curr_read_len), curr_read_text,
curr_read_len, is_small_half_negative, is_large_half_negative, is_small_half_on_left_as_reversed,
- & final_split_point, & is_GT_AG_donors, & is_donor_found, &small_side_increasing_coordinate, &large_side_increasing_coordinate);
+ & final_split_point, & is_GT_AG_donors, & is_donor_found_or_annotation, &small_side_increasing_coordinate, &large_side_increasing_coordinate);
if( 0 && 1018082 == pair_number)
{
@@ -1061,7 +1081,7 @@ void copy_vote_to_alignment_res(global_context_t * global_context, thread_contex
else
overlapped = current_vote -> coverage_end[vote_i][vote_j] - current_vote -> coverage_start[i][j];
- if(0 && FIXLENstrcmp("R000002444", read_name) == 0)
+ if(0 && FIXLENstrcmp("R000404427", read_name) == 0)
{
SUBREADprintf("OVL=%d, DIST=%llu\n", overlapped, abs(dist));
}
@@ -1122,7 +1142,7 @@ void copy_vote_to_alignment_res(global_context_t * global_context, thread_contex
current_vote -> pos[i][j]),max(current_vote -> pos[vote_i][vote_j] ,
current_vote -> pos[i][j]), left_indel_offset, right_indel_offset, normally_arranged,
max(0, guess_start), min( guess_end, curr_read_len), curr_read_text, curr_read_len,
- & final_split_point, & is_GT_AG_donors, & is_donor_found, & inserted_bases, &small_side_increasing_coordinate, &large_side_increasing_coordinate, read_name);
+ & final_split_point, & is_GT_AG_donors, & is_donor_found_or_annotation, & inserted_bases, &small_side_increasing_coordinate, &large_side_increasing_coordinate, read_name);
// Now "final_split_point" is the read offset on the 'reversed' form of the read (I.e., the reversed FASTQ form for read_A and the raw FASTQ form for read_B.) if do_fusion_detection AND if the main half is on negative strand.
// However, because the final_split_point is ALWAYS on the form where the major half can be mapped, final_split_point will never be changed.
@@ -1167,7 +1187,7 @@ void copy_vote_to_alignment_res(global_context_t * global_context, thread_contex
junc_res -> indel_at_junction = inserted_bases;
align_res -> result_flags &=~0x3;
- if( (!is_donor_found) || is_GT_AG_donors > 2) align_res -> result_flags |= 3;
+ if( (!is_donor_found_or_annotation) || is_GT_AG_donors > 2) align_res -> result_flags |= 3;
else align_res -> result_flags = is_GT_AG_donors? (align_res -> result_flags|CORE_IS_GT_AG_DONORS):(align_res -> result_flags &~CORE_IS_GT_AG_DONORS);
align_res -> result_flags = is_strand_jumpped? (align_res -> result_flags|CORE_IS_STRAND_JUMPED):(align_res -> result_flags &~CORE_IS_STRAND_JUMPED);
@@ -1644,7 +1664,7 @@ void simple_add_junction( global_context_t * global_context, thread_context_t *
new_event -> supporting_reads = 1;
new_event -> indel_length = 0;
new_event -> indel_at_junction = indel_at_junction;
- new_event -> is_donor_found = 1;
+ new_event -> is_donor_found_or_annotation = 1;
new_event -> small_side_increasing_coordinate = 0;
new_event -> large_side_increasing_coordinate = 1;
put_new_event(event_table, new_event , event_no);
@@ -2143,13 +2163,8 @@ int process_voting_junction_PE_topK(global_context_t * global_context, thread_co
if(third_k == 0 && current_vote->votes[i][j] >= third_highest_votes [is_second_read][global_context -> config.top_scores - 1])
{
- if(0 && memcmp("V0112_0155:7:1101:2293:2015", read_name_1, 26) == 0)
- {
- char posout[100];
- absoffset_to_posstr(global_context, current_vote -> pos[i][j], posout);
-
- SUBREADprintf("[%s] INSERT BIG_MARGIN AT %s: COV=%d ~ %d ; V = %d\n", read_name_1, posout, current_vote -> coverage_start[i][j], current_vote -> coverage_end[i][j] , current_vote -> votes[i][j]);
- }
+ //#warning ">>>>>> COMMENT THIS <<<<<<<<"
+ //SUBREADprintf("OCT27-STEP0-%s-POS%u-COV=%d.%d-V%d\n", read_name_1, current_vote -> pos[i][j], current_vote -> coverage_start[i][j], current_vote -> coverage_end[i][j] , current_vote -> votes[i][j]);
insert_big_margin_record(global_context , _global_retrieve_big_margin_ptr(global_context,pair_number, is_second_read), current_vote -> votes[i][j], current_vote -> coverage_start[i][j], current_vote -> coverage_end[i][j] , current_read_len, (current_vote -> masks[i][j] & IS_NEGATIVE_STRAND)?1:0);
@@ -2159,6 +2174,7 @@ int process_voting_junction_PE_topK(global_context_t * global_context, thread_co
current_simple[current_simple_number].is_vote_t_item = 1;
current_simple[current_simple_number].item_index_i = i;
current_simple[current_simple_number].item_index_j = j;
+ current_simple[current_simple_number].read_start_base = current_vote -> coverage_start[i][j];
current_simple[current_simple_number].mapping_position = current_vote -> pos[i][j];
current_simple[current_simple_number].major_half_votes = current_vote -> votes[i][j];
@@ -2180,6 +2196,7 @@ int process_voting_junction_PE_topK(global_context_t * global_context, thread_co
current_simple[current_simple_number].item_index_i = i;
current_simple[current_simple_number].mapping_position = old_result -> selected_position;
current_simple[current_simple_number].major_half_votes = old_result -> selected_votes;
+ current_simple[current_simple_number].read_start_base = old_result -> confident_coverage_start;
current_simple_number ++;
}
@@ -2198,21 +2215,33 @@ int process_voting_junction_PE_topK(global_context_t * global_context, thread_co
for(i = 0; i < simple_record_numbers[0]; i++){
for(j = 0; j < simple_record_numbers[1]; j++){
int target_index;
- int is_PE_distance = 0, is_same_chromosome = 0;
-
- if(0 && FIXLENstrcmp("R006633992", read_name_1)==0)
- SUBREADprintf("TOPK #%d-%d : %d, %d < %d, PE=%d %u ~ %u\n", i,j, vote_simple_1_buffer[i].major_half_votes, vote_simple_2_buffer[j].major_half_votes, global_context->config.minimum_subread_for_first_read, is_PE_distance, ( vote_simple_1_buffer+i )->mapping_position , (vote_simple_2_buffer+j) ->mapping_position);
+ int is_PE_distance = 0, is_same_chromosome = 0, is_both_exonic_regions = 0;
if(max(vote_simple_1_buffer[i].major_half_votes, vote_simple_2_buffer[j].major_half_votes) < global_context->config.minimum_subread_for_first_read)continue;
simple_PE_and_same_chro(global_context , vote_simple_1_buffer+i, vote_simple_2_buffer+j , &is_PE_distance, &is_same_chromosome , read_len_1, read_len_2);
if((!is_PE_distance) && min(vote_simple_1_buffer[i].major_half_votes, vote_simple_2_buffer[j].major_half_votes) < global_context->config.minimum_subread_for_first_read)continue;
+ if( global_context -> exonic_region_bitmap && is_same_chromosome)is_both_exonic_regions = is_pos_in_annotated_exon_regions(global_context, vote_simple_1_buffer[i].mapping_position + vote_simple_1_buffer[i].read_start_base ) && is_pos_in_annotated_exon_regions(global_context, vote_simple_2_buffer[j].mapping_position + vote_simple_2_buffer[j].read_start_base ) ;
+ if(0 && strcmp(read_name_1, "R010442852")==0)
+ SUBREADprintf("KNOWN_EXONS: %s: %u + %u ~ %u + %u : %s\n", read_name_1, vote_simple_1_buffer[i].mapping_position , vote_simple_1_buffer[i].read_start_base , vote_simple_2_buffer[j].mapping_position , vote_simple_2_buffer[j].read_start_base, is_both_exonic_regions?"YES":"NO");
//#warning " ============== USE THE FIRST WEIGHT FORMULA IN RELEASE ================ "
//#warning " ============== USE THE SECOND WEIGHT FORMULA FOR SVs GRANT APP ======== "
- int adjusted_weight = is_PE_distance?1300:(is_same_chromosome?1000:800);
- if(global_context -> config.PE_predominant_weight) adjusted_weight = is_PE_distance?13000:(is_same_chromosome?100:80);
- //int adjusted_weight = is_PE_distance?1600:(is_same_chromosome?1000:500);
+ int adjusted_weight;
+
+ if(1){
+ if (is_both_exonic_regions && is_PE_distance) adjusted_weight = 1800;
+ else if(is_both_exonic_regions) adjusted_weight = 1300;
+ else if(is_PE_distance) adjusted_weight = 1300;
+ else if(is_same_chromosome) adjusted_weight = 1000;
+ else adjusted_weight = 800;
+ }else{
+ if (is_both_exonic_regions) adjusted_weight = 1300;
+ else if(is_PE_distance) adjusted_weight = 1300;
+ else if(is_same_chromosome) adjusted_weight = 1000;
+ else adjusted_weight = 800;
+ }
+ //int adjusted_weight = is_PE_distance?1600:(is_same_chromosome?1000:500);
int adjusted_votes = (vote_simple_1_buffer[i].major_half_votes + vote_simple_2_buffer[j].major_half_votes) * adjusted_weight;
for(target_index=0; target_index<used_comb_buffer; target_index++){
@@ -2233,10 +2262,6 @@ int process_voting_junction_PE_topK(global_context_t * global_context, thread_co
if(used_comb_buffer < global_context -> config.max_vote_combinations)
used_comb_buffer ++;
-
- if(0 && FIXLENstrcmp("R005717815", read_name_1)==0)
- SUBREADprintf("Vadj [%d][%d] = %d (raw = %d + %d), PE=%d, Target=%d/%d\n", i,j , adjusted_votes, vote_simple_1_buffer[i].major_half_votes, vote_simple_2_buffer[j].major_half_votes, is_PE_distance, target_index, used_comb_buffer);
-
}
}
@@ -2260,17 +2285,7 @@ int process_voting_junction_PE_topK(global_context_t * global_context, thread_co
int alignment_res_r1_cursor = 0, alignment_res_r2_cursor = 0;
if(used_comb_buffer > 0){
- //sort the comb buffers.
-
- //quick_sort(comb_buffer, used_comb_buffer, comb_sort_compare, comb_sort_exchange);
merge_sort(comb_buffer, used_comb_buffer, comb_sort_compare, comb_sort_exchange, comb_sort_merge);
-
- if(0 && FIXLENstrcmp("V0112_0155:7:1101:19612:13380", read_name_1)==0)
- for(i = 0; i < used_comb_buffer; i++)
- {
- SUBREADprintf("C[%d], SCORE = %llu ; VOTES = %d + %d\n", i, comb_buffer[i].score_adj, comb_buffer[i].r1_loc -> major_half_votes, comb_buffer[i].r2_loc -> major_half_votes);
- }
-
for(is_second_read = 0; is_second_read < 1 + global_context -> input_reads.is_paired_end_reads; is_second_read++){
int current_read_len = is_second_read ? read_len_2:read_len_1;
char * current_read_text = is_second_read ? read_text_2:read_text_1;
@@ -2300,12 +2315,7 @@ int process_voting_junction_PE_topK(global_context_t * global_context, thread_co
}
}
- if(0 && memcmp("HWI-ST212:219:C0C1TACXX:1:1107:20025:113054", read_name_1, 41)==0){
- SUBREADprintf("%s %s : Read_%d ; BEST=%d / %d, %u\n", is_exist?" ":"NEW", read_name_1 , is_second_read + 1 , *current_r_cursor , global_context->config.multi_best_reads, current_loc->mapping_position);
- }
-
if(!is_exist){
- //SUBREADprintf("%u\tC_i=%d, C_j=%d, IS_VOTE=%d, Vadj=%llu\n", pair_number, current_loc -> item_index_i, current_loc -> item_index_j, current_loc -> is_vote_t_item, comb_buffer[i].score_adj);
if(current_loc -> is_vote_t_item)
copy_vote_to_alignment_res(global_context, thread_context, current_alignment_tmp + (*current_r_cursor), current_junction_tmp ? current_junction_tmp + (*current_r_cursor) : NULL, current_vote, current_loc -> item_index_i, current_loc -> item_index_j, current_read_len, read_name_1, current_read_text, current_all_subreads , current_vote -> noninformative_subreads, pair_number, is_second_read, is_fully_covered);
else{
@@ -2372,12 +2382,6 @@ int process_voting_junction_PE_topK(global_context_t * global_context, thread_co
memcpy(current_junction_tmp + (*current_r_cursor), _global_retrieve_subjunc_ptr(global_context, pair_number, is_second_read, current_loc -> item_index_i), sizeof(subjunc_result_t));
}
- if(0)
- {
- char posout[100];
- absoffset_to_posstr(global_context, current_alignment_tmp[*current_r_cursor] . selected_position, posout);
- SUBREADprintf("The %d-th %s is at %s; vote=%d, minor=%d\n", *current_r_cursor, read_name_1, posout, current_alignment_tmp[*current_r_cursor].selected_votes, current_junction_tmp[*current_r_cursor].minor_votes);
- }
(*current_r_cursor)++;
}
}
@@ -2385,8 +2389,6 @@ int process_voting_junction_PE_topK(global_context_t * global_context, thread_co
}
}
- //SUBREADprintf("TOPK : CANDIDATES = %d , %d\n", alignment_res_r1_cursor, alignment_res_r2_cursor);
-
for(is_second_read = 0; is_second_read < 1 + global_context -> input_reads.is_paired_end_reads; is_second_read++){
int * current_r_cursor = is_second_read ? &alignment_res_r2_cursor:&alignment_res_r1_cursor;
if((*current_r_cursor) > global_context->config.multi_best_reads){
@@ -2406,23 +2408,13 @@ int process_voting_junction_PE_topK(global_context_t * global_context, thread_co
for(i = 0; i < global_context->config.multi_best_reads ; i++){
mapping_result_t * cur_res = _global_retrieve_alignment_ptr(global_context, pair_number, is_second_read, i);
if( i < (*current_r_cursor))
- {
- // checked:boundary
memcpy(cur_res, current_alignment_tmp + i, sizeof(mapping_result_t));
- if(0 && FIXLENstrcmp("V0112_0155:7:1101:19612:13380", read_name_1)==0)
- SUBREADprintf("COPIED READ_%d\t\t%llu [%d] , V=%d, MASK=%d, POS=%u, PTR=%p\n", is_second_read + 1, pair_number, *current_r_cursor, cur_res -> selected_votes, cur_res -> result_flags, current_alignment_tmp[i].selected_position, cur_res);
- }
else cur_res -> selected_votes = 0;
if(global_context -> config.do_breakpoint_detection) {
subjunc_result_t * cur_junc = _global_retrieve_subjunc_ptr(global_context, pair_number, is_second_read, i);
if(i < (*current_r_cursor))
- {
- // checked:boundary
memcpy(cur_junc, current_junction_tmp + i , sizeof(subjunc_result_t));
- if(0 && FIXLENstrcmp("V0112_0155:7:1101:19612:13380", read_name_1)==0)
- SUBREADprintf("COPIED SUBJUNC: MINOR=%u, MINORVOTES=%d\n", (current_junction_tmp + i) -> minor_position, (current_junction_tmp + i) -> minor_votes);
- }
else cur_junc -> minor_votes = 0;
}
@@ -2811,7 +2803,7 @@ int find_soft_clipping(global_context_t * global_context, thread_context_t * th
// read_head_abs_offset is the first WANTED base in read.
// If the first section in read is reversed, read_head_abs_offset is the LAST WANTED bases in this section. (the abs offset of the first base in the section is actually larger than read_head_abs_offset)
-int final_CIGAR_quality(global_context_t * global_context, thread_context_t * thread_context, char * read_text, char * qual_text, int read_len, char * cigar_string, unsigned long read_head_abs_offset, int is_read_head_reversed, int * mismatched_bases, int covered_start, int covered_end, char * read_name, int * non_clipped_length, int *total_indel_length, int * matched_bases, int * chromosomal_length)
+int final_CIGAR_quality(global_context_t * global_context, thread_context_t * thread_context, char * read_text, char * qual_text, int read_len, char * cigar_string, unsigned long read_head_abs_offset, int is_read_head_reversed, int * mismatched_bases, int covered_start, int covered_end, char * read_name, int * non_clipped_length, int *total_indel_length, int * matched_bases, int * chromosomal_length, int * full_section_clipped)
{
int cigar_cursor = 0;
int read_cursor = 0;
@@ -2866,13 +2858,17 @@ int final_CIGAR_quality(global_context_t * global_context, thread_context_t * th
}
else
head_soft_clipped = find_soft_clipping(global_context, thread_context, current_value_index, read_text, current_perfect_section_abs, tmp_int, 0, adj_coverage_start);
- if(0&& memcmp(read_name, TTTSNAME, 26)==0)
- debug_clipping(global_context, thread_context, current_value_index, debug_ptr, current_perfect_section_abs, tmp_int, 0, adj_coverage_start, head_soft_clipped, read_name);
-
- if(head_soft_clipped == tmp_int) head_soft_clipped = 0;
+ if(head_soft_clipped == tmp_int){
+ (*full_section_clipped) = 1;
+ head_soft_clipped = 0;
+ }
else has_clipping_this_section_head = 1;
+ if(has_clipping_this_section_head){
+ if( tmp_int - head_soft_clipped < 3 && head_soft_clipped > 1 ) (*full_section_clipped) = 1;
+ }
+
if(reversed_first_section_text)
free(reversed_first_section_text);
reversed_first_section_text = NULL;
@@ -2894,11 +2890,15 @@ int final_CIGAR_quality(global_context_t * global_context, thread_context_t * th
else
tail_soft_clipped = find_soft_clipping(global_context, thread_context, current_value_index, read_text + read_cursor, current_perfect_section_abs, tmp_int, 1, adj_coverage_end);
- if(0 && memcmp(read_name, TTTSNAME, 26)==0)
- debug_clipping(global_context, thread_context, current_value_index, debug_ptr, current_perfect_section_abs, tmp_int, !current_reversed, adj_coverage_end , tail_soft_clipped, read_name);
+ if(tail_soft_clipped == tmp_int){
+ tail_soft_clipped = 0;
+ if(full_section_clipped)(*full_section_clipped) = 1;
+ } else has_clipping_this_section_tail = 1;
+
+ if( has_clipping_this_section_tail ){
+ if(tmp_int - tail_soft_clipped < 3 && tail_soft_clipped > 1) (*full_section_clipped) = 1;
+ }
- if(tail_soft_clipped == tmp_int) tail_soft_clipped = 0;
- else has_clipping_this_section_tail = 1;
if(reversed_first_section_text)
free(reversed_first_section_text);
}
@@ -3058,10 +3058,9 @@ unsigned int finalise_explain_CIGAR(global_context_t * global_context, thread_co
result -> result_flags &= ~CORE_IS_PAIRED_END;
//SUBREADprintf("FINAL_CIGAR R1 %d[%d] = %p, FLAGS=%d\n", explain_context -> pair_number , explain_context-> best_read_id , result , result -> result_flags);
- tmp_cigar[0]=0;
- tmp_cigar_exonic[0]=0;
// reverse the back_search result for every equally best alignment
//
+
for(back_i = 0; back_i < explain_context -> all_back_alignments; back_i++){
if( explain_context -> result_back_junction_numbers[back_i] > MAX_EVENTS_IN_READ ){
SUBREADprintf("ERROR: Too many cigar sections: %d > %d\n", explain_context -> result_back_junction_numbers[back_i] , MAX_EVENTS_IN_READ);
@@ -3109,23 +3108,10 @@ unsigned int finalise_explain_CIGAR(global_context_t * global_context, thread_co
for(front_i = 0; front_i < explain_context -> all_front_alignments; front_i++){
if(final_alignment_number >= MAX_ALIGNMENT_PER_ANCHOR)break;
-
- if(0 && FIXLENstrcmp("DB7DT8Q1:236:C2NGTACXX:2:1213:17842:64278",explain_context->read_name ) == 0){
- SUBREADprintf("For the %d-th front search result set and the %d-th back search result set, there are %d + %d - 1 = %d sections in the read\nmapped location = %u\n", front_i, back_i, explain_context -> result_back_junction_numbers[back_i] , explain_context -> result_front_junction_numbers[front_i] , explain_context -> result_back_junction_numbers[back_i] + explain_context -> result_front_junction_numbers[front_i] -1, result -> selected_position);
-
- for(xk1 = 0; xk1 < explain_context -> result_back_junction_numbers[back_i] + explain_context -> result_front_junction_numbers[front_i]; xk1++)
- {
- perfect_section_in_read_t * current_section;
- int is_front_search = 0;
- if(xk1 >= explain_context -> result_back_junction_numbers[back_i]) {
- current_section = &explain_context -> result_front_junctions[front_i][xk1 - explain_context -> result_back_junction_numbers[back_i]];
- is_front_search = 1;
- } else {
- current_section = &explain_context -> result_back_junctions[back_i][xk1];
- }
- SUBREADprintf(" The %d-th section ( %d long ) has next event being %p\n", xk1, current_section -> read_pos_end - current_section -> read_pos_start , current_section -> event_after_section);
- }
- }
+ to_be_supported_count = 0;
+ tmp_cigar[0]=0;
+ tmp_cigar_exonic[0]=0;
+ int known_junction_supp = 0;
for(xk1 = 0; xk1 < explain_context -> result_back_junction_numbers[back_i] + explain_context -> result_front_junction_numbers[front_i] -1; xk1++)
{
@@ -3166,12 +3152,8 @@ unsigned int finalise_explain_CIGAR(global_context_t * global_context, thread_co
if(event_after)
{
if(event_after -> event_type == CHRO_EVENT_TYPE_INDEL)
- {
- if(0 && FIXLENstrcmp("R000002444", explain_context -> read_name) ==0){
- SUBREADprintf("Get INDEL from the %d-th mapped section (back=%d, front=%d) ; event_pntr=%p, section_mapped_len=%d (start=%d, end=%d)\n", xk1, explain_context -> result_back_junction_numbers[back_i] , explain_context -> result_front_junction_numbers[front_i] , event_after, read_pos_end - read_pos_start, read_pos_start, read_pos_end);
- }
sprintf(piece_cigar+strlen(piece_cigar), "%d%c", abs(event_after->indel_length), event_after->indel_length>0?'D':'I');
- } else if(event_after -> event_type == CHRO_EVENT_TYPE_JUNCTION||event_after -> event_type == CHRO_EVENT_TYPE_FUSION) {
+ else if(event_after -> event_type == CHRO_EVENT_TYPE_JUNCTION||event_after -> event_type == CHRO_EVENT_TYPE_FUSION) {
// the distance in CIGAR is the NEXT UNWANTED BASE of piece#1 to the FIRST WANTED BASE in piece#2
int delta_one ;
if(current_section -> is_strand_jumped + current_section -> is_connected_to_large_side == 1) delta_one = 1;
@@ -3211,17 +3193,11 @@ unsigned int finalise_explain_CIGAR(global_context_t * global_context, thread_co
if(event_after -> is_strand_jumped) jump_mode = tolower(jump_mode);
fusions_in_read += (event_after -> event_type == CHRO_EVENT_TYPE_FUSION);
-
- //if(event_after -> event_large_side + delta_one < event_after -> event_small_side)
- // SUBREADprintf("%s CONNECT_TO_LARGE : %d REV ENV: %u ~ %u: %s, DELTA=%d, MOVE_LEN=%d, READ=%s JUMP: CUR=%d, AFT=%d\n", is_front_search?"FRONT_SEARCH":"BACK_SEARCH", current_section -> is_connected_to_large_side, event_after -> event_small_side , event_after -> event_large_side, explain_context -> read_name, delta_one, event_after -> event_large_side - event_after -> event_small_side + delta_one, explain_context -> read_name, current_section -> is_strand_jumped, event_after -> [...]
-
sprintf(piece_cigar+strlen(piece_cigar), "%u%c", (int)movement, jump_mode);
-
- //if(event_after -> event_large_side + delta_one < event_after -> event_small_side)
- // SUBREADprintf("PART CIGAR=%s\n" , piece_cigar);
if(event_after -> indel_at_junction) sprintf(piece_cigar+strlen(piece_cigar), "%dI", event_after -> indel_at_junction);
is_junction_read ++;
+ if(event_after -> is_donor_found_or_annotation & 64 ) known_junction_supp ++;
}
to_be_supported[to_be_supported_count++] = event_after;
}
@@ -3234,6 +3210,8 @@ unsigned int finalise_explain_CIGAR(global_context_t * global_context, thread_co
int mismatch_bases = 0, isCigarOK = 0;
+ //#warning ">>>>>>>>>>>>>>>> COMMENT NEXT LINE <<<<<<<<<<<<<<<<<<<<<<<"
+ //SUBREADprintf("ReadDebug:%s\t%s\n", explain_context -> read_name , tmp_cigar);
if(is_cigar_overflow) sprintf(tmp_cigar, "%dM", explain_context -> full_read_len);
unsigned int final_position;
@@ -3252,13 +3230,14 @@ unsigned int finalise_explain_CIGAR(global_context_t * global_context, thread_co
- int final_qual = 0, applied_mismatch = 0, non_clipped_length = 0, total_indel_length = 0, total_coverage_length = 0, final_MATCH = 0, chromosomal_length = 0;
+ int final_qual = 0, applied_mismatch = 0, non_clipped_length = 0, total_indel_length = 0, total_coverage_length = 0, final_MATCH = 0, chromosomal_length = 0, full_section_clipped = 0;
if(is_exonic_read_fraction_OK)
{
total_coverage_length = result -> confident_coverage_end - result -> confident_coverage_start;
- final_qual = final_CIGAR_quality(global_context, thread_context, explain_context -> full_read_text, explain_context -> full_qual_text, explain_context -> full_read_len , tmp_cigar, final_position, is_first_section_negative != ((result->result_flags & CORE_IS_NEGATIVE_STRAND)?1:0), &mismatch_bases, result -> confident_coverage_start, result -> confident_coverage_end, explain_context -> read_name, &non_clipped_length, &total_indel_length, & final_MATCH, & chromosomal_length);
- //if(mismatch_bases<99) SUBREADprintf("CIGAR=%s, CHRLEN=%d\n", tmp_cigar, chromosomal_length);
+ final_qual = final_CIGAR_quality(global_context, thread_context, explain_context -> full_read_text, explain_context -> full_qual_text, explain_context -> full_read_len , tmp_cigar, final_position, is_first_section_negative != ((result->result_flags & CORE_IS_NEGATIVE_STRAND)?1:0), &mismatch_bases, result -> confident_coverage_start, result -> confident_coverage_end, explain_context -> read_name, &non_clipped_length, &total_indel_length, & final_MATCH, & chromosomal_length, & full_s [...]
+ //#warning ">>>>>>> COMMENT THIS <<<<<<<"
+ //printf("OCT27-STEP2-%s:%d-POS%u-VOT%d-CIG-%s [ %d ]-INDELs=%llu; M/MM=%d,%d\n", explain_context -> read_name, explain_context -> is_second_read + 1, result -> selected_position, result -> selected_votes, tmp_cigar, is_cigar_overflow, ((indel_thread_context_t *)thread_context -> module_thread_contexts[MODULE_INDEL_ID]) -> event_entry_table -> numOfElements, final_MATCH, mismatch_bases);
applied_mismatch = is_junction_read? global_context->config.max_mismatch_junction_reads:global_context->config.max_mismatch_exonic_reads ;
@@ -3269,16 +3248,18 @@ unsigned int finalise_explain_CIGAR(global_context_t * global_context, thread_co
}
- //#warning " ========== COMMENT THIS LINE !! ========="
//if(explain_context -> pair_number == 999999)
// ACDB PVDB TTTS
- if(0 && FIXLENstrcmp("V0112_0155:7:1101:19274:15465", explain_context -> read_name) ==0)
- SUBREADprintf("FINALQUAL %s : FINAL_POS=%u\tCIGAR=%s\tMM=%d > %d?\tVOTE=%d > %0.2f x %d ? MASK=%d\tQUAL=%d\tBRNO=%d\n\n", explain_context -> read_name, final_position , tmp_cigar, mismatch_bases, applied_mismatch, result -> selected_votes, global_context -> config.minimum_exonic_subread_fraction,result-> used_subreads_in_vote, result->result_flags, final_qual, explain_context -> best_read_id);
+ //#warning " ========== COMMENT THIS LINE !! ========="
+ if(0 && FIXLENstrcmp("R000404427", explain_context -> read_name) ==0){
+ char outpos1[100];
+ absoffset_to_posstr(global_context, final_position, outpos1);
+ SUBREADprintf("FINALQUAL %s : FINAL_POS=%s ( %u )\tCIGAR=%s\tMM=%d / MAPLEN=%d > %d?\tVOTE=%d > %0.2f x %d ? MASK=%d\tQUAL=%d\tBRNO=%d\nKNOWN_JUNCS=%d\n\n", explain_context -> read_name, outpos1 , final_position , tmp_cigar, mismatch_bases, non_clipped_length, applied_mismatch, result -> selected_votes, global_context -> config.minimum_exonic_subread_fraction,result-> used_subreads_in_vote, result->result_flags, final_qual, explain_context -> best_read_id, known_junction_supp);
+ }
- if( mismatch_bases <= applied_mismatch && is_exonic_read_fraction_OK && fusions_in_read < 2)
- {
+ if(mismatch_bases <= applied_mismatch && is_exonic_read_fraction_OK && fusions_in_read < 2 ){// && (0 == full_section_clipped || 0 == global_context -> config.do_breakpoint_detection)) {
realignment_result_t * realign_res = final_realignments+final_alignment_number;
final_alignment_number ++;
@@ -3286,6 +3267,7 @@ unsigned int finalise_explain_CIGAR(global_context_t * global_context, thread_co
realign_res -> first_base_is_jumpped = 0;
realign_res -> mapping_result = result;
realign_res -> chromosomal_length = chromosomal_length;
+ realign_res -> known_junction_supp = known_junction_supp;
if(mismatch_bases > applied_mismatch ) realign_res -> realign_flags |= CORE_TOO_MANY_MISMATCHES;
else realign_res -> realign_flags &= ~CORE_TOO_MANY_MISMATCHES;
@@ -3299,28 +3281,24 @@ unsigned int finalise_explain_CIGAR(global_context_t * global_context, thread_co
if(1)
{
- // commit the change to the chromosome_events
-
int is_RNA_from_positive = -1;
-
unsigned long long read_id = 2llu * explain_context -> pair_number + explain_context->is_second_read;
for(xk1= 0; xk1 < to_be_supported_count; xk1++)
{
if(xk1 >= MAX_EVENTS_IN_READ) break;
- if(0 && strcmp( explain_context -> read_name, "ERR161544.68584")==0)
- SUBREADprintf("%s RELATED_EVENT= EVENT_NO_%d\n", explain_context -> read_name , to_be_supported[xk1] -> global_event_id);
+
if(to_be_supported [xk1] -> event_type !=CHRO_EVENT_TYPE_INDEL && is_junction_read){
- if(to_be_supported [xk1] -> event_type == CHRO_EVENT_TYPE_JUNCTION && to_be_supported [xk1] -> is_donor_found && is_RNA_from_positive == -1)
+ if(to_be_supported [xk1] -> event_type == CHRO_EVENT_TYPE_JUNCTION && to_be_supported [xk1] -> is_donor_found_or_annotation && is_RNA_from_positive == -1)
is_RNA_from_positive = !(to_be_supported [xk1] -> is_negative_strand);
}
+
+ //final counts are added in function "add_realignment_event_support" in core.c
+
realign_res -> supporting_chromosome_events[xk1] = to_be_supported[xk1];
realign_res -> flanking_size_left[xk1] = flanking_size_left[xk1];
realign_res -> flanking_size_right[xk1] = flanking_size_right[xk1];
realign_res -> crirical_support[xk1] += (read_id == to_be_supported [xk1] -> critical_read_id);
- //if(flanking_size_left[xk1]>=16 && flanking_size_right[xk1]>=16) realign_res -> crirical_support[xk1]++;
- //SUBREADprintf("CRITICAL=%llu, THIS=%llu\n", read_id, to_be_supported [xk1] -> critical_read_id);
- //if(read_id == to_be_supported [xk1] -> critical_read_id) realign_res -> crirical_support[] = // to_be_supported [xk1] -> critical_supporting_reads ++;
}
if(to_be_supported_count < MAX_EVENTS_IN_READ )
realign_res -> supporting_chromosome_events[to_be_supported_count] = NULL;
@@ -3328,10 +3306,6 @@ unsigned int finalise_explain_CIGAR(global_context_t * global_context, thread_co
result -> result_flags |= CORE_IS_FULLY_EXPLAINED;
result -> read_length = explain_context->full_read_len;
- //if(explain_context -> pair_number < 20)
- // SUBREADprintf("RESULT %d at %p : FLAGS=%d\n", explain_context -> pair_number, result, result -> result_flags);
-
-
if(is_RNA_from_positive == -1)
{
realign_res -> realign_flags |= CORE_NOTFOUND_DONORS ;
@@ -3348,10 +3322,6 @@ unsigned int finalise_explain_CIGAR(global_context_t * global_context, thread_co
isCigarOK=1;
}
- //final_MATCH = non_clipped_length - mismatch_bases;
- //if(final_MATCH > 0);
- //else printf("CIGAR COMPRESSION ERROR : %s by %s\n", tmp_cigar, explain_context -> read_name);
-
realign_res -> first_base_position = final_position;
realign_res -> final_quality = final_qual;
realign_res -> final_mismatched_bases = mismatch_bases;
@@ -3362,8 +3332,6 @@ unsigned int finalise_explain_CIGAR(global_context_t * global_context, thread_co
}
}
- //SUBREADprintf("L2MM = %d\n", final_MATCH);
- //return final_MATCH * 10000 - total_indel_length;
return final_alignment_number;
}
@@ -3492,7 +3460,7 @@ int is_ambiguous_voting(global_context_t * global_context, subread_read_number_t
// Note that the read_text is on reversed mode. The guess points are on reversed mode too.
// "Left" and "Right" means the left/right half in the "reversed" read.
-int donor_jumped_score(global_context_t * global_context, thread_context_t * thread_context, unsigned int small_virtualHead_abs_offset, unsigned int large_virtualHead_abs_offset, int guess_start, int guess_end, char * read_text, int read_len, int is_small_half_negative, int is_large_half_negative, int small_half_on_left_reversed, int * final_split_point, int * is_GT_AG_strand, int * is_donor_found, int * small_side_increasing_coordinate, int * large_side_increasing_coordinate)
+int donor_jumped_score(global_context_t * global_context, thread_context_t * thread_context, unsigned int small_virtualHead_abs_offset, unsigned int large_virtualHead_abs_offset, int guess_start, int guess_end, char * read_text, int read_len, int is_small_half_negative, int is_large_half_negative, int small_half_on_left_reversed, int * final_split_point, int * is_GT_AG_strand, int * is_donor_found_or_annotation, int * small_side_increasing_coordinate, int * large_side_increasing_coordinate)
{
gene_value_index_t * value_index = thread_context?thread_context->current_value_index:global_context->current_value_index ;
// guess_end is the index of the first UNWANTED BASE.
@@ -3583,7 +3551,7 @@ int donor_jumped_score(global_context_t * global_context, thread_context_t * thr
{
//printf("TEST_JUMPED: BSCORE=%d SPLT=%d\n", best_score , selected_real_split_point);
*final_split_point = selected_real_split_point;
- *is_donor_found = best_score>=500;
+ *is_donor_found_or_annotation = best_score>=500;
*is_GT_AG_strand = selected_junction_strand;
return best_score;
}
@@ -3591,7 +3559,7 @@ int donor_jumped_score(global_context_t * global_context, thread_context_t * thr
}
-int donor_score(global_context_t * global_context, thread_context_t * thread_context, unsigned int left_virtualHead_abs_offset, unsigned int right_virtualHead_abs_offset, int left_indel_offset, int right_indel_offset, int normally_arranged, int guess_start, int guess_end, char * read_text, int read_len, int * final_split_point, int * is_GT_AG_strand, int * is_donor_found, int * final_inserted_bases, int * small_side_increasing_coordinate, int * large_side_increasing_coordinate, char * r [...]
+int donor_score(global_context_t * global_context, thread_context_t * thread_context, unsigned int left_virtualHead_abs_offset, unsigned int right_virtualHead_abs_offset, int left_indel_offset, int right_indel_offset, int normally_arranged, int guess_start, int guess_end, char * read_text, int read_len, int * final_split_point, int * is_GT_AG_strand, int * is_donor_found_or_annotation, int * final_inserted_bases, int * small_side_increasing_coordinate, int * large_side_increasing_coordi [...]
{
@@ -3740,7 +3708,7 @@ int donor_score(global_context_t * global_context, thread_context_t * thread_con
if(best_score>0 && (0==non_insertion_preferred || 0==selected_inserted_bases))
{
*final_split_point = selected_real_split_point;
- *is_donor_found = best_score>=290000;
+ *is_donor_found_or_annotation = best_score>=290000;
*is_GT_AG_strand = selected_junction_strand;
*final_inserted_bases = selected_inserted_bases;
@@ -3818,7 +3786,7 @@ void find_new_junctions(global_context_t * global_context, thread_context_t * th
unsigned int right_virtualHead_abs_offset = max(result -> selected_position, subjunc_result -> minor_position);
int is_GT_AG_donors = result->result_flags & 0x3;
- int is_donor_found = is_GT_AG_donors<3;
+ int is_donor_found_or_annotation = is_GT_AG_donors<3;
int is_strand_jumped = (result->result_flags & CORE_IS_STRAND_JUMPED)?1:0;
if(selected_real_split_point>0)
@@ -4041,7 +4009,7 @@ void find_new_junctions(global_context_t * global_context, thread_context_t * th
//printf("MMMMX %d %u -- %u : TYPE %d\n" , event_no, left_edge_wanted, right_edge_wanted, new_event_type);
-// if((is_donor_found || !global_context -> config.check_donor_at_junctions) &&(!is_strand_jumped) && right_edge_wanted - left_edge_wanted <= global_context -> config.maximum_intron_length
+// if((is_donor_found_or_annotation || !global_context -> config.check_donor_at_junctions) &&(!is_strand_jumped) && right_edge_wanted - left_edge_wanted <= global_context -> config.maximum_intron_length
// && (subjunc_result->minor_coverage_start > result->confident_coverage_start) + (subjunc_result -> minor_position > result -> selected_position) !=1)
if(new_event_type == CHRO_EVENT_TYPE_JUNCTION)
@@ -4052,7 +4020,7 @@ void find_new_junctions(global_context_t * global_context, thread_context_t * th
new_event -> supporting_reads = 1;
new_event -> indel_length = 0;
new_event -> indel_at_junction = subjunc_result->indel_at_junction;
- new_event -> is_donor_found = is_donor_found;
+ new_event -> is_donor_found_or_annotation = is_donor_found_or_annotation;
new_event -> small_side_increasing_coordinate = subjunc_result -> small_side_increasing_coordinate;
new_event -> large_side_increasing_coordinate = subjunc_result -> large_side_increasing_coordinate;
@@ -4085,8 +4053,8 @@ void find_new_junctions(global_context_t * global_context, thread_context_t * th
}
}
-void write_translocation_results_final(void * buckv, HashTable * tab);
-void write_inversion_results_final(void * buckv, HashTable * tab);
+void write_translocation_results_final(void * key, void * buckv, HashTable * tab);
+void write_inversion_results_final(void * key, void * buckv, HashTable * tab);
int write_fusion_final_results(global_context_t * global_context)
{
@@ -4098,7 +4066,7 @@ int write_fusion_final_results(global_context_t * global_context)
fprintf(ofp,"#Chr Location Chr Location SameStrand nSupport\n");
//fprintf(ofp,"#Chr Location Chr Location SameStrand nSupport BreakPoint1_GoUp BreakPoint2_GoUp\n");
- int xk1;
+ int xk1, disk_is_full = 0;
unsigned int all_junctions = 0;
int no_sup_juncs = 0;
int all_juncs = 0;
@@ -4127,7 +4095,8 @@ int write_fusion_final_results(global_context_t * global_context)
//#warning "SUBREAD_151 ================ COMMENT the 'UNPAIRED' line and UNCOMMENT the next line ======================"
//fprintf(ofp, "UNPAIRED\t%s\t%u\t%s\t%u\t%s\t%d\n", chro_name_left, chro_pos_left, chro_name_right, chro_pos_right, event_body -> is_strand_jumped?"No":"Yes", event_body -> final_counted_reads);
- fprintf(ofp, "%s\t%u\t%s\t%u\t%s\t%d\t%s\t%s\n", chro_name_left, chro_pos_left, chro_name_right, chro_pos_right+1, event_body -> is_strand_jumped?"No":"Yes", event_body -> final_counted_reads, event_body -> small_side_increasing_coordinate?"Yes":"No", event_body -> large_side_increasing_coordinate?"Yes":"No");
+ int wlen = fprintf(ofp, "%s\t%u\t%s\t%u\t%s\t%d\t%s\t%s\n", chro_name_left, chro_pos_left, chro_name_right, chro_pos_right+1, event_body -> is_strand_jumped?"No":"Yes", event_body -> final_counted_reads, event_body -> small_side_increasing_coordinate?"Yes":"No", event_body -> large_side_increasing_coordinate?"Yes":"No");
+ if(wlen < 8) disk_is_full = 1;
}
global_context -> all_fusions = all_junctions;
@@ -4142,10 +4111,15 @@ int write_fusion_final_results(global_context_t * global_context)
}
fclose(ofp);
+
+ if(disk_is_full){
+ unlink(fn2);
+ SUBREADprintf("ERROR: disk is full. No fusion table is generated.\n");
+ }
return 0;
}
-void write_inversion_results_final(void * buckv, HashTable * tab){
+void write_inversion_results_final(void * bukey, void * buckv, HashTable * tab){
int x1;
bucketed_table_bucket_t * buck = buckv;
@@ -4170,7 +4144,7 @@ void write_inversion_results_final(void * buckv, HashTable * tab){
}
-void write_translocation_results_final(void * buckv, HashTable * tab){
+void write_translocation_results_final(void * bukey, void * buckv, HashTable * tab){
int x1;
bucketed_table_bucket_t * buck = buckv;
@@ -4205,7 +4179,7 @@ void write_translocation_results_final(void * buckv, HashTable * tab){
int write_junction_final_results(global_context_t * global_context)
{
- int no_sup_juncs = 0;
+ int no_sup_juncs = 0, disk_is_full = 0;
indel_context_t * indel_context = (indel_context_t *)global_context -> module_contexts[MODULE_INDEL_ID];
char fn2 [MAX_FILE_NAME_LENGTH];
@@ -4251,18 +4225,23 @@ int write_junction_final_results(global_context_t * global_context)
if(event_body->indel_at_junction)
sprintf(indel_sect,"INS%d", event_body->indel_at_junction);
+ if(event_body-> is_donor_found_or_annotation &64)strcpy(indel_sect,"ANNO");
//else if(event_body->critical_supporting_reads < 1)
// strcpy(indel_sect, "NOCRT");
else indel_sect[0]=0;
- fprintf(ofp,"%s\t%u\t%u\tJUNC%08u%s\t%d\t%c\t%u\t%u\t%d,%d,%d\t2\t%d,%d\t0,%u\n", chro_name_left, feature_start, feature_end,
+ int wlen = fprintf(ofp,"%s\t%u\t%u\tJUNC%08u%s\t%d\t%c\t%u\t%u\t%d,%d,%d\t2\t%d,%d\t0,%u\n", chro_name_left, feature_start, feature_end,
all_junctions, indel_sect, event_body -> final_counted_reads, event_body->is_negative_strand?'-':'+',
feature_start, feature_end, event_body->is_negative_strand?0:255, /*event_body -> anti_supporting_reads*/ event_body->is_negative_strand?255:0, event_body->is_negative_strand?255:0,
event_body -> junction_flanking_left, event_body -> junction_flanking_right, feature_end-feature_start-event_body -> junction_flanking_right);
-
+ if(wlen < 10) disk_is_full = 1;
}
fclose(ofp);
+ if(disk_is_full){
+ unlink(fn2);
+ SUBREADprintf("ERROR: disk is full; no junction table is created.\n");
+ }
global_context -> all_junctions = all_junctions;
//printf("Non-support juncs=%d; Final juncs = %d\n", no_sup_juncs, all_junctions);
return 0;
@@ -5058,7 +5037,7 @@ int core13_test_donor(char *read, int read_len, unsigned int pos1, unsigned int
-#define EXON_LARGE_WINDOW 60
+#define EXON_LARGE_WINDOW 60
#define ACCEPTED_SUPPORT_RATE 0.3
void core_fragile_junction_voting(global_context_t * global_context, thread_context_t * thread_context, char * rname, char * read, char * qual, unsigned int full_rl, int negative_strand, int color_space, unsigned int low_border, unsigned int high_border, gene_vote_t *vote_p1)
@@ -5208,7 +5187,7 @@ void core_fragile_junction_voting(global_context_t * global_context, thread_cont
SUBREADprintf("INDEL_DDADD: abs(I=%d); INDELS=%d; LOC=%u\n",i, current_indel_len, indel_left_boundary-1);
if(abs(current_indel_len)<=global_context -> config.max_indel_length)
{
- chromosome_event_t * new_event = local_add_indel_event(global_context, thread_context, event_table, InBuff + cursor_on_read + min(0,current_indel_len), indel_left_boundary - 1, current_indel_len, 1, ambiguous_count, 0);
+ chromosome_event_t * new_event = local_add_indel_event(global_context, thread_context, event_table, InBuff + cursor_on_read + min(0,current_indel_len), indel_left_boundary - 1, current_indel_len, 1, ambiguous_count, 0, NULL);
if(last_event_id >=0 && new_event){
// the event space can be changed when the new event is added. the location is updated everytime.
chromosome_event_t * event_space = NULL;
@@ -5219,6 +5198,7 @@ void core_fragile_junction_voting(global_context_t * global_context, thread_cont
chromosome_event_t * last_event = event_space + last_event_id;
int dist = new_event -> event_small_side - last_event -> event_large_side +1;
+
new_event -> connected_previous_event_distance = dist;
last_event -> connected_next_event_distance = dist;
}
diff --git a/src/core-junction.h b/src/core-junction.h
index 49ea697..9bae3df 100644
--- a/src/core-junction.h
+++ b/src/core-junction.h
@@ -66,6 +66,7 @@ typedef struct{
int result_front_junction_numbers[MAX_ALIGNMENT_PER_ANCHOR];
int all_back_alignments;
int all_front_alignments;
+ int known_junctions;
// unsigned int tmp_jump_length;
// unsigned int best_jump_length;
@@ -153,5 +154,6 @@ int is_funky_fragment(global_context_t * global_context, char * rname1, char * c
void finalise_structural_variances(global_context_t * global_context);
+void debug_show_event(global_context_t* global_context, chromosome_event_t * event);
void get_event_two_coordinates(global_context_t * global_context, unsigned int event_no, char ** small_chro, int * small_pos, unsigned int * small_abs, char ** large_chro, int * large_pos, unsigned int * large_abs);
#endif
diff --git a/src/core.c b/src/core.c
index 591802a..7d272fe 100644
--- a/src/core.c
+++ b/src/core.c
@@ -45,6 +45,7 @@
#include "core.h"
#include "input-files.h"
#include "sorted-hashtable.h"
+#include "HelperFunctions.h"
#include "core-bigtable.h"
#include "core-indel.h"
@@ -152,173 +153,186 @@ int exec_cmd(char * cmd, char * outstr, int out_limit){
void print_in_box(int line_width, int is_boundary, int options, char * pattern,...)
{
- int put_color_for_colon, is_center;
+ int put_color_for_colon, is_center, is_wrapped;
va_list args;
va_start(args , pattern);
char is_R_linebreak=0, * content, *out_line_buff;
put_color_for_colon = (options & PRINT_BOX_NOCOLOR_FOR_COLON)?0:1;
is_center = (options & PRINT_BOX_CENTER)?1:0;
- content= malloc(1000);
- out_line_buff= malloc(1000);
- out_line_buff[0]=0;
- vsprintf(content, pattern, args);
- int is_R_code,x1,content_len = strlen(content), state, txt_len, is_cut = 0, real_lenwidth;
-
- is_R_code = 1;
- #ifdef MAKE_STANDALONE
- is_R_code = 0;
- #endif
-
- if(content_len>0&&content[content_len-1]=='\r'){
- content_len--;
- content[content_len] = 0;
- is_R_linebreak = 1;
- }
+ is_wrapped = (options & PRINT_BOX_WRAPPED)?1:0;
+
+ content= malloc(1200);
+ int content_len = vsprintf(content, pattern, args);
+ out_line_buff= malloc(1200);
+ out_line_buff[0]=0;;
+
+ if(is_wrapped){
+ int seg_i = 0;
+ for(seg_i=0; seg_i < content_len; seg_i += line_width-7){
+ strcpy(out_line_buff, content + seg_i);
+ out_line_buff[line_width-7] = 0;
+
+ print_in_box(line_width, is_boundary, options & (~PRINT_BOX_WRAPPED), out_line_buff);
+ }
+ }else{
+ int is_R_code,x1,content_len = strlen(content), state, txt_len, is_cut = 0, real_lenwidth;
- if(content_len>0&&content[content_len-1]=='\n'){
- content_len--;
- content[content_len] = 0;
- }
+ is_R_code = 1;
+ #ifdef MAKE_STANDALONE
+ is_R_code = 0;
+ #endif
- state = 0;
- txt_len = 0;
- real_lenwidth = line_width;
- for(x1 = 0; content [x1]; x1++)
- {
- char nch = content [x1];
- if(nch == CHAR_ESC)
- state = 1;
- if(state){
- real_lenwidth --;
- }else{
- txt_len++;
-
- if(txt_len == 80 - 6)
- {
- is_cut = 1;
- }
+ if(content_len>0&&content[content_len-1]=='\r'){
+ content_len--;
+ content[content_len] = 0;
+ is_R_linebreak = 1;
}
- if(nch == 'm' && state)
- state = 0;
- }
+ if(content_len>0&&content[content_len-1]=='\n'){
+ content_len--;
+ content[content_len] = 0;
+ }
- if(is_cut)
- {
state = 0;
txt_len = 0;
+ real_lenwidth = line_width;
for(x1 = 0; content [x1]; x1++)
{
char nch = content [x1];
if(nch == CHAR_ESC)
state = 1;
- if(!state){
+ if(state){
+ real_lenwidth --;
+ }else{
txt_len++;
- if(txt_len == 80 - 9)
+
+ if(txt_len == 80 - 6)
{
- strcpy(content+x1, "\x1b[0m ...");
- content_len = line_width - 4;
- content_len = 80 - 4;
- line_width = 80;
- break;
+ is_cut = 1;
}
}
+
if(nch == 'm' && state)
state = 0;
}
- }
- if(content_len==0 && is_boundary)
- {
- strcat(out_line_buff,is_boundary==1?"//":"\\\\");
- for(x1=0;x1<line_width-4;x1++)
- strcat(out_line_buff,"=");
- strcat(out_line_buff,is_boundary==1?"\\\\":"//");
- sublog_printf(SUBLOG_STAGE_RELEASED, SUBLOG_LEVEL_INFO, "%s", out_line_buff);
-
- free(content);
- free(out_line_buff);
- return;
- }
- else if(is_boundary)
- {
- int left_stars = (line_width - content_len)/2 - 1;
- int right_stars = line_width - content_len - 2 - left_stars;
- strcat(out_line_buff,is_boundary==1?"//":"\\\\");
- for(x1=0;x1<left_stars-2;x1++) strcat(out_line_buff,"=");
- sprintf(out_line_buff+strlen(out_line_buff),"%c[36m", CHAR_ESC);
- sprintf(out_line_buff+strlen(out_line_buff)," %s ", content);
- sprintf(out_line_buff+strlen(out_line_buff),"%c[0m", CHAR_ESC);
- for(x1=0;x1<right_stars-2;x1++) strcat(out_line_buff,"=");
- strcat(out_line_buff,is_boundary==1?"\\\\":"//");
- sublog_printf(SUBLOG_STAGE_RELEASED, SUBLOG_LEVEL_INFO, "%s", out_line_buff);
-
- free(content);
- free(out_line_buff);
- return;
- }
-
- int right_spaces, left_spaces;
- if(is_center)
- left_spaces = (line_width - content_len)/2-2;
- else
- left_spaces = 1;
-
- right_spaces = line_width - 4 - content_len- left_spaces;
-
- char spaces[81];
- memset(spaces , ' ', 80);
- spaces[0]='|';
- spaces[1]='|';
- spaces[80]=0;
-
- //sublog_fwrite(SUBLOG_STAGE_RELEASED, SUBLOG_LEVEL_INFO,"||");
-
- //for(x1=0;x1<left_spaces;x1++) sublog_fwrite(SUBLOG_STAGE_RELEASED, SUBLOG_LEVEL_INFO," ");
-
- spaces[left_spaces+2] = 0;
- strcat(out_line_buff,spaces);
-
- if(is_R_code)
- {
- strcat(out_line_buff,content);
- }
- else
- {
- int col1w=-1;
- for(x1=0; content[x1]; x1++)
+ if(is_cut)
{
- if(content[x1]==':')
+ state = 0;
+ txt_len = 0;
+ for(x1 = 0; content [x1]; x1++)
{
- col1w=x1;
- break;
+ char nch = content [x1];
+ if(nch == CHAR_ESC)
+ state = 1;
+ if(!state){
+ txt_len++;
+ if(txt_len == 80 - 9)
+ {
+ strcpy(content+x1, "\x1b[0m ...");
+ content_len = line_width - 4;
+ content_len = 80 - 4;
+ line_width = 80;
+ break;
+ }
+ }
+ if(nch == 'm' && state)
+ state = 0;
}
}
- if(col1w>0 && col1w < content_len-1 && put_color_for_colon)
+
+ if(content_len==0 && is_boundary)
+ {
+ strcat(out_line_buff,is_boundary==1?"//":"\\\\");
+ for(x1=0;x1<line_width-4;x1++)
+ strcat(out_line_buff,"=");
+ strcat(out_line_buff,is_boundary==1?"\\\\":"//");
+ sublog_printf(SUBLOG_STAGE_RELEASED, SUBLOG_LEVEL_INFO, "%s", out_line_buff);
+
+ free(content);
+ free(out_line_buff);
+ return;
+ }
+ else if(is_boundary)
{
- content[col1w+1]=0;
- strcat(out_line_buff,content);
- strcat(out_line_buff," ");
+ int left_stars = (line_width - content_len)/2 - 1;
+ int right_stars = line_width - content_len - 2 - left_stars;
+ strcat(out_line_buff,is_boundary==1?"//":"\\\\");
+ for(x1=0;x1<left_stars-2;x1++) strcat(out_line_buff,"=");
sprintf(out_line_buff+strlen(out_line_buff),"%c[36m", CHAR_ESC);
- strcat(out_line_buff,content+col1w+2);
+ sprintf(out_line_buff+strlen(out_line_buff)," %s ", content);
sprintf(out_line_buff+strlen(out_line_buff),"%c[0m", CHAR_ESC);
+ for(x1=0;x1<right_stars-2;x1++) strcat(out_line_buff,"=");
+ strcat(out_line_buff,is_boundary==1?"\\\\":"//");
+ sublog_printf(SUBLOG_STAGE_RELEASED, SUBLOG_LEVEL_INFO, "%s", out_line_buff);
+
+ free(content);
+ free(out_line_buff);
+ return;
}
+
+ int right_spaces, left_spaces;
+ if(is_center)
+ left_spaces = (line_width - content_len)/2-2;
else
+ left_spaces = 1;
+
+ right_spaces = line_width - 4 - content_len- left_spaces;
+
+ char spaces[81];
+ memset(spaces , ' ', 80);
+ spaces[0]='|';
+ spaces[1]='|';
+ spaces[80]=0;
+
+ //sublog_fwrite(SUBLOG_STAGE_RELEASED, SUBLOG_LEVEL_INFO,"||");
+
+ //for(x1=0;x1<left_spaces;x1++) sublog_fwrite(SUBLOG_STAGE_RELEASED, SUBLOG_LEVEL_INFO," ");
+
+ spaces[left_spaces+2] = 0;
+ strcat(out_line_buff,spaces);
+
+ if(is_R_code)
+ {
strcat(out_line_buff,content);
+ }
+ else
+ {
+ int col1w=-1;
+ for(x1=0; content[x1]; x1++)
+ {
+ if(content[x1]==':')
+ {
+ col1w=x1;
+ break;
+ }
+ }
+ if(col1w>0 && col1w < content_len-1 && put_color_for_colon)
+ {
+ content[col1w+1]=0;
+ strcat(out_line_buff,content);
+ strcat(out_line_buff," ");
+ sprintf(out_line_buff+strlen(out_line_buff),"%c[36m", CHAR_ESC);
+ strcat(out_line_buff,content+col1w+2);
+ sprintf(out_line_buff+strlen(out_line_buff),"%c[0m", CHAR_ESC);
+ }
+ else
+ strcat(out_line_buff,content);
+ }
+ // for(x1=0;x1<right_spaces - 1;x1++) sublog_fwrite(SUBLOG_STAGE_RELEASED, SUBLOG_LEVEL_INFO," ");
+
+ memset(spaces , ' ', 80);
+ spaces[79]='|';
+ spaces[78]='|';
+
+ right_spaces = max(1,right_spaces);
+ //if(is_R_linebreak)
+ // sprintf(out_line_buff+strlen(out_line_buff)," %c[0m%s%c", CHAR_ESC, spaces + (78 - right_spaces + 1) ,CORE_SOFT_BR_CHAR);
+ //else
+ sprintf(out_line_buff+strlen(out_line_buff)," %c[0m%s", CHAR_ESC , spaces + (78 - right_spaces + 1));
+ sublog_printf(SUBLOG_STAGE_RELEASED, SUBLOG_LEVEL_INFO, out_line_buff);
}
-// for(x1=0;x1<right_spaces - 1;x1++) sublog_fwrite(SUBLOG_STAGE_RELEASED, SUBLOG_LEVEL_INFO," ");
-
- memset(spaces , ' ', 80);
- spaces[79]='|';
- spaces[78]='|';
-
- right_spaces = max(1,right_spaces);
- //if(is_R_linebreak)
- // sprintf(out_line_buff+strlen(out_line_buff)," %c[0m%s%c", CHAR_ESC, spaces + (78 - right_spaces + 1) ,CORE_SOFT_BR_CHAR);
- //else
- sprintf(out_line_buff+strlen(out_line_buff)," %c[0m%s", CHAR_ESC , spaces + (78 - right_spaces + 1));
- sublog_printf(SUBLOG_STAGE_RELEASED, SUBLOG_LEVEL_INFO, out_line_buff);
free(out_line_buff);
free(content);
}
@@ -627,8 +641,7 @@ int parse_opts_core(int argc , char ** argv, global_context_t * global_context)
global_context->config.report_sam_file = 0;
break;
case 'B':
- global_context->config.is_first_iteration_running = 0;
- strcpy(global_context->config.medium_result_prefix, optarg);
+ strcpy(global_context->config.exon_annotation_file, optarg);
break;
case 'c':
global_context->config.space_type = GENE_SPACE_COLOR;
@@ -705,6 +718,7 @@ int core_main(int argc , char ** argv, int (parse_opts (int , char **, global_co
int ret = parse_opts(argc , argv, global_context);
+ init_core_temp_path(global_context);
//global_context->config.reads_per_chunk = 200*1024;
if(global_context->config.max_indel_length > 20 && !global_context->input_reads.is_paired_end_reads)
@@ -764,6 +778,10 @@ int convert_BAM_to_SAM(global_context_t * global_context, char * fname, int is_b
{
char * is_ret = SamBam_fgets(sambam_reader, fline, 2999, 1);
if(!is_ret) break;
+ if(sambam_reader ->is_bam_broken ) {
+ SUBREADputs("ERROR: the BAM format is broken.");
+ return -1;
+ }
if(fline[0]=='@')continue;
if(is_SAM_unsorted(fline, tmp_readname, &tmp_flags , read_no)){
if(tmp_flags & 1) global_context->input_reads.is_paired_end_reads = 1;
@@ -791,6 +809,7 @@ int convert_BAM_to_SAM(global_context_t * global_context, char * fname, int is_b
else if(is_bam)
print_in_box(80,0,0,"Convert the input BAM file...");
+ int disk_is_full = 0;
if(is_bam || (global_context->input_reads.is_paired_end_reads && !is_file_sorted))
{
sprintf(temp_file_name, "%s.sam", global_context->config.temp_file_prefix);
@@ -804,7 +823,7 @@ int convert_BAM_to_SAM(global_context_t * global_context, char * fname, int is_b
int writer_opened = 0;
if(is_file_sorted) sam_fp = f_subr_open(temp_file_name,"w");
- else writer_opened = sort_SAM_create(&writer, temp_file_name, ".");
+ else writer_opened = sort_SAM_create(&writer, temp_file_name, NULL);
if((is_file_sorted && !sam_fp) || (writer_opened && !is_file_sorted)){
SUBREADprintf("Failed to write to the directory. You may not have permission to write to this directory or the disk is full.\n");
@@ -815,23 +834,37 @@ int convert_BAM_to_SAM(global_context_t * global_context, char * fname, int is_b
{
char * is_ret = SamBam_fgets(sambam_reader, fline, 2999, 1);
if(!is_ret) break;
- if(is_file_sorted)
- fputs(fline, sam_fp);
- else{
+ if(is_file_sorted){
+ int wlen = fputs(fline, sam_fp);
+ if(wlen <0){
+ SUBREADprintf("ERROR: unable to write into the temporary SAM file. Please check the disk space in the output directory.\n");
+ disk_is_full = 1;
+ break;
+ }
+ }else{
int ret = sort_SAM_add_line(&writer, fline, strlen(fline));
- if(ret<0) {
+ if(ret== -1) {
print_in_box(80,0,0,"ERROR: read name is too long; check input format.");
break;
}
+ if(ret== -2) {
+ SUBREADprintf("ERROR: unable to write into the temporary SAM file. Please check the disk space in the output directory.\n");
+ disk_is_full = 1;
+ break;
+ }
}
}
if(is_file_sorted)
fclose(sam_fp);
else{
- sort_SAM_finalise(&writer);
+ int ret = sort_SAM_finalise(&writer);
if(writer.unpaired_reads)
print_in_box(80,0,0,"%llu single-end mapped reads in reordering.", writer.unpaired_reads);
+ if(ret) {
+ disk_is_full = 1;
+ SUBREADprintf("ERROR: unable to create the temporary file. Please check the disk space in the output directory.\n");
+ }
}
SamBam_fclose(sambam_reader);
@@ -839,10 +872,8 @@ int convert_BAM_to_SAM(global_context_t * global_context, char * fname, int is_b
global_context -> will_remove_input_file = 1;
}
-
-
free(fline);
- return 0;
+ return disk_is_full;
}
int convert_GZ_to_FQ(global_context_t * global_context, char * fname, int half_n)
@@ -1650,27 +1681,40 @@ void write_buffered_output_file(global_context_t *global_context, output_fragme
}
else
{
- sambamout_fprintf(global_context -> output_sam_fp , "%s\t%d\t%s\t%u\t%d\t%s\t%s\t%u\t%lld\t%s\t%s%s%s\n", rec->r1.read_name, rec->r1.flags, rec->r1.chro_name, rec->r1.location, rec->r1.map_quality, rec->r1.cigar, rec->r1.other_chro_name, rec->r1.other_location, rec->r1.tlen, rec->r1.read_text, rec->r1.qual_text, rec->r1.additional_columns[0]?"\t":"", rec->r1.additional_columns);
+ int write_len_2 = 100, write_len = sambamout_fprintf(global_context -> output_sam_fp , "%s\t%d\t%s\t%u\t%d\t%s\t%s\t%u\t%lld\t%s\t%s%s%s\n", rec->r1.read_name, rec->r1.flags, rec->r1.chro_name, rec->r1.location, rec->r1.map_quality, rec->r1.cigar, rec->r1.other_chro_name, rec->r1.other_location, rec->r1.tlen, rec->r1.read_text, rec->r1.qual_text, rec->r1.additional_columns[0]?"\t":"", rec->r1.additional_columns);
if(global_context->input_reads.is_paired_end_reads)
- sambamout_fprintf(global_context -> output_sam_fp , "%s\t%d\t%s\t%u\t%d\t%s\t%s\t%u\t%lld\t%s\t%s%s%s\n", rec->r2.read_name, rec->r2.flags, rec->r2.chro_name, rec->r2.location, rec->r2.map_quality, rec->r2.cigar, rec->r2.other_chro_name, rec->r2.other_location, rec->r2.tlen, rec->r2.read_text, rec->r2.qual_text, rec->r2.additional_columns[0]?"\t":"", rec->r2.additional_columns);
+ write_len_2 = sambamout_fprintf(global_context -> output_sam_fp , "%s\t%d\t%s\t%u\t%d\t%s\t%s\t%u\t%lld\t%s\t%s%s%s\n", rec->r2.read_name, rec->r2.flags, rec->r2.chro_name, rec->r2.location, rec->r2.map_quality, rec->r2.cigar, rec->r2.other_chro_name, rec->r2.other_location, rec->r2.tlen, rec->r2.read_text, rec->r2.qual_text, rec->r2.additional_columns[0]?"\t":"", rec->r2.additional_columns);
+
+ if( write_len < 10 || write_len_2 < 10 ){
+ global_context -> output_sam_is_full = 1;
+ }
}
}
+double last_t1 = 0;
+
void merge_buffered_output_file(global_context_t *global_context, int need_lock, int my_thread_no, int * all_threads_finished){
thread_context_t * thread_contexts = global_context -> all_thread_contexts;
//SUBREADprintf("merge_start: lock=%d, thread=%d, remain_item=%d\n", need_lock, my_thread_no, thread_contexts[my_thread_no].output_buffer_item);
int current_thread_no;
+ unsigned int all_reads=0;
+
+ //double tb = miltime();
if(need_lock){
for(current_thread_no = 0 ; current_thread_no < global_context->config.all_threads ; current_thread_no ++){
- if(my_thread_no != current_thread_no) {
- thread_context_t * current_thread = thread_contexts + current_thread_no;
+ thread_context_t * current_thread = thread_contexts + current_thread_no;
+ //SUBREADprintf(" THREAD %d : %d / %d used\n", current_thread_no, current_thread -> output_buffer_item, MULTI_THREAD_OUTPUT_ITEMS* global_context -> config.reported_multi_best_reads);
+ all_reads += current_thread -> output_buffer_item;
+
+ if(my_thread_no != current_thread_no)
subread_lock_occupy(¤t_thread -> output_lock);
- }
}
}
+ //double t0 = miltime();
+
while(1){
int has_found = 0;
@@ -1689,10 +1733,12 @@ void merge_buffered_output_file(global_context_t *global_context, int need_lock,
//if(161430 <= earliest_frag_number) SUBREADprintf("The %d-th thread has earlist = %u ; want %u\n", current_thread_no, earliest_frag_number , global_context -> last_written_fragment_number);
+ //SUBREADprintf("30NOV2016-EARLIST %d <= %d\n", earliest_frag_number , global_context -> last_written_fragment_number);
if(earliest_frag_number <= global_context -> last_written_fragment_number){
int need_more = max(1, src -> multi_mapping_locations);
need_more -= src -> this_mapping_location;
+ //#warning ">>>>> THE NEXT BLOCK SHOULD BE COMMENTED OUT <<<<<<"
if(0 && src -> this_mapping_location < src -> multi_mapping_locations)
{
int next_frag_index = (earliest_frag_index == MULTI_THREAD_OUTPUT_ITEMS * global_context -> config.reported_multi_best_reads - 1)?0:earliest_frag_index + 1;
@@ -1704,6 +1750,7 @@ void merge_buffered_output_file(global_context_t *global_context, int need_lock,
assert(nextsrc -> fragment_number_in_chunk == src -> fragment_number_in_chunk);
}
//if(161430 <= earliest_frag_number) SUBREADprintf("WRITTEN [%u]? %d/%d MORE=%d\n", earliest_frag_number, src -> this_mapping_location , src -> multi_mapping_locations,need_more);
+ //SUBREADprintf("30NOV2016-NEED-MORE %d >= %d BY THRE %d\n", current_thread -> output_buffer_item , need_more, current_thread_no);
if(current_thread -> output_buffer_item >= need_more){
//SUBREADprintf("merge: %d >= 1 ? this=%d, all=%d \n", src -> this_mapping_location, need_more, src -> this_mapping_location , src -> multi_mapping_locations);
if(need_more <= 1)
@@ -1720,6 +1767,11 @@ void merge_buffered_output_file(global_context_t *global_context, int need_lock,
break;
}
+ //double t1 = miltime();
+
+ //if(last_t1 > 100)
+ // SUBREADprintf("WRITE RATE: RUNNING=%.6f, TB->T0=%.6f, T0->T1=%.6f %.12f sec / read\n", tb - last_t1 , t0-tb, t1-t0 , (t1 - t0) / all_reads );
+
if(need_lock){
for(current_thread_no = 0 ; current_thread_no < global_context->config.all_threads ; current_thread_no ++){
if(my_thread_no != current_thread_no) {
@@ -1730,9 +1782,13 @@ void merge_buffered_output_file(global_context_t *global_context, int need_lock,
}
global_context -> need_merge_buffer_now = 0;
+ //last_t1 = t1;
+
//SUBREADprintf("merge_finished: lock=%d, thread=%d, remain_item=%d, last_id=%u\n", need_lock, my_thread_no, thread_contexts[my_thread_no].output_buffer_item, global_context -> last_written_fragment_number);
}
+int for_other_thread = 0;
+int for_one_threads = 0;
#define BUFFER_TICK_SLEEP_TIME 1000
@@ -1748,12 +1804,19 @@ void add_buffered_fragment(global_context_t * global_context, thread_context_t *
while(1){
int done = 0, all_finished=0;
+ //SUBREADprintf("30NOV2016-ADDBUFF-LOCK BY THRE %d\n", thread_context -> thread_id);
subread_lock_occupy(&thread_context -> output_lock);
- if(thread_context -> thread_id == 0 && ( thread_context -> output_buffer_item > MULTI_THREAD_OUTPUT_ITEMS* global_context -> config.reported_multi_best_reads /4 || global_context -> need_merge_buffer_now)){
+ if(thread_context -> thread_id == 0 && ( thread_context -> output_buffer_item > MULTI_THREAD_OUTPUT_ITEMS* global_context -> config.reported_multi_best_reads *1/4 || global_context -> need_merge_buffer_now)){
+ if(global_context -> need_merge_buffer_now)
+ for_other_thread ++;
+ else
+ for_one_threads ++;
+ //SUBREADprintf("ADD BLOCK: FOR ONE:%d , FOR OTHER:%d\n", for_one_threads, for_other_thread);
merge_buffered_output_file(global_context, 1, thread_context -> thread_id, &all_finished);
}
+ //SUBREADprintf("30NOV2016-TRY-WRITE MY ITEMS=%d < %d BY THRE %d\n", thread_context -> output_buffer_item , MULTI_THREAD_OUTPUT_ITEMS * global_context -> config.reported_multi_best_reads , thread_context -> thread_id );
+
if(thread_context -> output_buffer_item < MULTI_THREAD_OUTPUT_ITEMS * global_context -> config.reported_multi_best_reads) {
- //SUBREADprintf("BUFFER PNTR=%d\n", thread_context -> output_buffer_pointer);
output_fragment_buffer_t * target = thread_context -> output_buffer+thread_context -> output_buffer_pointer;
target -> multi_mapping_locations = all_locations;
target -> this_mapping_location = this_location;
@@ -1797,6 +1860,7 @@ void add_buffered_fragment(global_context_t * global_context, thread_context_t *
}
subread_lock_release(&thread_context -> output_lock);
+ //SUBREADprintf("30NOV2016-ADDBUFF-FREE BY THRE %d\n", thread_context -> thread_id);
if(done)break;
usleep(BUFFER_TICK_SLEEP_TIME);
}
@@ -1845,7 +1909,10 @@ void write_single_fragment(global_context_t * global_context, thread_context_t *
if(global_context->input_reads.is_paired_end_reads)
{
flag2 = calc_flags( global_context , rec1, rec2, 1, read_len_1, read_len_2, current_location, tlen, is_R2_OK, is_R1_OK);
- if((0 == current_location) && (flag2 & SAM_FLAG_MATCHED_IN_PAIR)) global_context->all_correct_PE_reads ++;
+ if((0 == current_location) && (flag2 & SAM_FLAG_MATCHED_IN_PAIR)){
+ if(thread_context)thread_context->all_correct_PE_reads ++;
+ else global_context->all_correct_PE_reads ++;
+ }
}
@@ -2082,9 +2149,13 @@ void write_single_fragment(global_context_t * global_context, thread_context_t *
}
else
{
- sambamout_fprintf(global_context -> output_sam_fp , "%s\t%d\t%s\t%u\t%d\t%s\t%s\t%u\t%lld\t%s\t%s%s%s\n", read_name_1, flag1, out_chro1, out_offset1, out_mapping_quality1, out_cigar1, mate_chro_for_1, out_offset2, out_tlen1, read_text_1 + display_offset1, qual_text_1, extra_additional_1[0]?"\t":"", extra_additional_1);
+ int write_len_2 = 100, write_len = sambamout_fprintf(global_context -> output_sam_fp , "%s\t%d\t%s\t%u\t%d\t%s\t%s\t%u\t%lld\t%s\t%s%s%s\n", read_name_1, flag1, out_chro1, out_offset1, out_mapping_quality1, out_cigar1, mate_chro_for_1, out_offset2, out_tlen1, read_text_1 + display_offset1, qual_text_1, extra_additional_1[0]?"\t":"", extra_additional_1);
if(global_context->input_reads.is_paired_end_reads)
- sambamout_fprintf(global_context -> output_sam_fp , "%s\t%d\t%s\t%u\t%d\t%s\t%s\t%u\t%lld\t%s\t%s%s%s\n", read_name_2, flag2, out_chro2, out_offset2, out_mapping_quality2, out_cigar2, mate_chro_for_2, out_offset1, out_tlen2, read_text_2 + display_offset2, qual_text_2, extra_additional_2[0]?"\t":"", extra_additional_2);
+ write_len_2 = sambamout_fprintf(global_context -> output_sam_fp , "%s\t%d\t%s\t%u\t%d\t%s\t%s\t%u\t%lld\t%s\t%s%s%s\n", read_name_2, flag2, out_chro2, out_offset2, out_mapping_quality2, out_cigar2, mate_chro_for_2, out_offset1, out_tlen2, read_text_2 + display_offset2, qual_text_2, extra_additional_2[0]?"\t":"", extra_additional_2);
+
+ if( write_len < 10 || write_len_2 < 10 ){
+ global_context -> output_sam_is_full = 1;
+ }
}
//subread_lock_release(&global_context -> output_lock);
}
@@ -2190,8 +2261,6 @@ int do_iteration_one(global_context_t * global_context, thread_context_t * threa
sqr_read_number++;
ret = fetch_next_read_pair(global_context, thread_context, ginp1, ginp2, &read_len_1, &read_len_2, read_name_1, read_name_2, read_text_1, read_text_2, qual_text_1, qual_text_2, 1, ¤t_read_number);
- if(current_read_number < 0) break;
- // if no more reads
for (is_second_read = 0; is_second_read < 1 + global_context -> input_reads.is_paired_end_reads; is_second_read ++)
{
@@ -2355,22 +2424,27 @@ int do_iteration_three(global_context_t * global_context, thread_context_t * thr
void add_realignment_event_support(global_context_t * global_context , realignment_result_t * res){
int xk1;
+ indel_context_t * indel_context = (indel_context_t *)global_context -> module_contexts[MODULE_INDEL_ID];
for(xk1 = 0; xk1 < MAX_EVENTS_IN_READ ; xk1++){
chromosome_event_t *sup = res -> supporting_chromosome_events[xk1];
if(!sup)break;
+
+ int lock_hash = sup -> global_event_id % EVENT_BODY_LOCK_BUCKETS;
+ subread_lock_occupy(indel_context -> event_body_locks+lock_hash);
sup -> final_counted_reads ++;
sup -> junction_flanking_left = max(sup -> junction_flanking_left, res -> flanking_size_left[xk1]);
sup -> junction_flanking_right = max(sup -> junction_flanking_right, res -> flanking_size_right[xk1]);
+ subread_lock_release(indel_context -> event_body_locks+lock_hash);
}
}
-void test_PE_and_same_chro_align(global_context_t * global_context , realignment_result_t * res1, realignment_result_t * res2, int * is_PE_distance, int * is_same_chromosome, int read_len_1, int read_len_2, char * rname);
+unsigned int calc_end_pos(unsigned int p, char * cigar, unsigned int * all_skipped_len, int * is_exonic_regions, global_context_t * global_context);
+void test_PE_and_same_chro_align(global_context_t * global_context , realignment_result_t * res1, realignment_result_t * res2, int * is_exonic_regions, int * is_PE_distance, int * is_same_chromosome, int read_len_1, int read_len_2, char * rname);
void write_realignments_for_fragment(global_context_t * global_context, thread_context_t * thread_context, subread_output_context_t * out_context, unsigned int read_number, realignment_result_t * res1, realignment_result_t * res2, char * read_name_1, char * read_name_2, char * read_text_1, char * read_text_2, char * qual_text_1, char * qual_text_2 , int rlen1 , int rlen2, int multi_mapping_number, int this_multi_mapping_i, int non_informative_subreads_r1, int non_informative_subreads_r2){
int is_2_OK = 0, is_1_OK = 0;
-
if(res1){
is_1_OK = convert_read_to_tmp(global_context , out_context, read_number, 0, rlen1, read_text_1, qual_text_1, res1, out_context -> r1, read_name_1);
if(is_1_OK) add_realignment_event_support(global_context, res1);
@@ -2549,7 +2623,10 @@ int do_iteration_two(global_context_t * global_context, thread_context_t * threa
fastq_64_to_33(raw_qual_text_2);
}
+ if((global_context -> config.is_BAM_output && global_context -> output_bam_writer -> is_internal_error) ||
+ (global_context -> output_sam_is_full))break;
if(current_read_number < 0) break;
+
// if no more reads
if( global_context -> input_reads.is_paired_end_reads)
max_votes = max(_global_retrieve_alignment_ptr(global_context, current_read_number, 0, 0)->selected_votes, _global_retrieve_alignment_ptr(global_context, current_read_number, 1, 0)->selected_votes);
@@ -2575,9 +2652,11 @@ int do_iteration_two(global_context_t * global_context, thread_context_t * threa
int * current_MISMATCH_buffer = is_second_read?final_MISMATCH_buffer2:final_MISMATCH_buffer1;
int * current_realignment_index = is_second_read?final_realignment_index2:final_realignment_index1;
+
for(best_read_id = 0; best_read_id < global_context -> config.multi_best_reads; best_read_id++)
{
mapping_result_t *current_result = _global_retrieve_alignment_ptr(global_context, current_read_number, is_second_read, best_read_id);
+
if(best_read_id == 0){
if(is_second_read)
non_informative_subreads_r2 = current_result -> noninformative_subreads_in_vote;
@@ -2603,6 +2682,9 @@ int do_iteration_two(global_context_t * global_context, thread_context_t * threa
continue;
}
+ //#warning ">>>>>>> COMMENT THIS <<<<<<<"
+ //printf("OCT27-STEP21-%s:%d-ALN%02d-THRE %d\n", current_read_name, is_second_read + 1, best_read_id, thread_context -> thread_id);
+
int is_negative_strand = (current_result -> result_flags & CORE_IS_NEGATIVE_STRAND)?1:0;
if(is_negative_strand + is_reversed_already == 1)
@@ -2618,15 +2700,11 @@ int do_iteration_two(global_context_t * global_context, thread_context_t * threa
if(is_second_read) r2_step2_locations = best_read_id + 1;
else r1_step2_locations = best_read_id + 1;
- if(0 && FIXLENstrcmp("V0112_0155:7:1101:7921:2517#ACTTGA", read_name_1)==0){
- SUBREADprintf("R1 N2=%d, R2 N2=%d, R%d : pos=%u\n", r1_step2_locations, r2_step2_locations, is_second_read+1, current_result->selected_position);
- }
-
unsigned int final_alignments = explain_read(global_context, thread_context , final_realignments + (is_second_read + 2 * best_read_id) * MAX_ALIGNMENT_PER_ANCHOR,
current_read_number, current_rlen, current_read_name, current_read, current_qual, is_second_read, best_read_id, is_negative_strand);
- if(0 && FIXLENstrcmp("R000000359", read_name_1)==0)
- SUBREADprintf("Final alignments=%d, cand = %d , FLAG=%d, final_MATCH=%d\n", final_alignments , (* current_candidate_locations), current_result -> result_flags & CORE_IS_FULLY_EXPLAINED , final_realignments[0].final_matched_bases);
+ if(0 && FIXLENstrcmp("R010442852", read_name_1)==0)
+ SUBREADprintf("Final alignments of %s [R%d] =%d, cand = %d , FLAG=%d, final_MATCH=%d\n", read_name_1, is_second_read +1, final_alignments , (* current_candidate_locations), current_result -> result_flags & CORE_IS_FULLY_EXPLAINED , final_realignments[0].final_matched_bases);
final_realignment_number[ best_read_id * 2 + is_second_read ] = final_alignments;
@@ -2638,6 +2716,8 @@ int do_iteration_two(global_context_t * global_context, thread_context_t * threa
if((current_result -> result_flags & CORE_IS_FULLY_EXPLAINED) && final_MATCH>0) {
int final_MISMATCH = final_realignments[realign_index].final_mismatched_bases;
+ //#warning ">>>>>>>>>>>>>>>>>>>>>>>>>>>>> REMOVE THIS <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<"
+ //printf("OCT27-STEPMSM-MMM %s M=%d MM=%d\n", read_name_1, final_MATCH ,final_MISMATCH);
current_MATCH_buffer[*current_candidate_locations] = final_MATCH;
current_MISMATCH_buffer[*current_candidate_locations] = final_MISMATCH;
@@ -2645,11 +2725,23 @@ int do_iteration_two(global_context_t * global_context, thread_context_t * threa
(*current_candidate_locations) ++;
}
}
+ //#warning ">>>>>>> COMMENT THIS <<<<<<<"
+ //printf("OCT27-FI-%s:%d-ALN%02d-THRE %d\n", current_read_name, is_second_read + 1, best_read_id, thread_context -> thread_id);
}
}
- if(0 && FIXLENstrcmp("R000000359", read_name_1)==0)
- SUBREADprintf("Candidate locations = %d (%d), %d (%d)\n", r1_candidate_locations, final_MATCH_buffer1[0] , r2_candidate_locations, final_MATCH_buffer2[0]);
+ //#warning ">>>>>>> COMMENT THIS <<<<<<<"
+ //printf("OCT27-FIND-%s-THRE %d ; cand locs = %d , %d\n", read_name_1, thread_context -> thread_id, r1_candidate_locations, r2_candidate_locations );
+ //if(0 && FIXLENstrcmp("R000000359", read_name_1)==0)
+ //#warning ">>>>>>>>>>>> COMMENT THIS <<<<<<<<<<<<<<<<<<<"
+ if(0){
+ printf("OCT27-STEPMSM %s %d (%d), %d (%d)\n", read_name_1, r1_candidate_locations, r1_candidate_locations? final_MATCH_buffer1[0]:-9999 , r2_candidate_locations, r2_candidate_locations?final_MATCH_buffer2[0]:-9999);
+ int xxx1;
+ for(xxx1=0; xxx1<r1_candidate_locations; xxx1++)
+ printf("OCT27-STEPMSM-R1 %s [%02d] %d\n", read_name_1, xxx1, final_MATCH_buffer1[xxx1]);
+ for(xxx1=0; xxx1<r2_candidate_locations; xxx1++)
+ printf("OCT27-STEPMSM-R2 %s [%02d] %d\n", read_name_1, xxx1, final_MATCH_buffer2[xxx1]);
+ }
//if(161430 <= current_read_number) SUBREADprintf("LOC1=%d, LOC2=%d\n", r1_candidate_locations, r2_candidate_locations);
@@ -2665,27 +2757,36 @@ int do_iteration_two(global_context_t * global_context, thread_context_t * threa
int * current_MISMATCH_buffer = is_second_read?final_MISMATCH_buffer2:final_MISMATCH_buffer1;
int * current_realignment_index = is_second_read?final_realignment_index2:final_realignment_index1;
- unsigned int best_score_highest = 0, read_record_i;
- unsigned int scores_array [global_context -> config.multi_best_reads * MAX_ALIGNMENT_PER_ANCHOR];
+ int read_record_i;
+ unsigned long long int best_score_highest = 0;
+ unsigned long long int scores_array [global_context -> config.multi_best_reads * MAX_ALIGNMENT_PER_ANCHOR];
for(read_record_i = 0; read_record_i < current_candidate_locations; read_record_i++){
- //realignment_result_t * current_realignment_result = final_realignments + current_realignment_index[read_record_i];
+ realignment_result_t * current_realignment_result = final_realignments + current_realignment_index[read_record_i];
//mapping_result_t *current_result = current_realignment_result -> mapping_result;
//assert(current_result -> result_flags & CORE_IS_FULLY_EXPLAINED);
unsigned int this_MATCH = current_MATCH_buffer[read_record_i];
unsigned int this_MISMATCH = current_MISMATCH_buffer[read_record_i];
- unsigned int this_SCORE;
+ unsigned long long this_SCORE;
if(global_context -> config.experiment_type == CORE_EXPERIMENT_DNASEQ){
- this_SCORE = this_MATCH * 100000 + (10000 - this_MISMATCH);
+ this_SCORE = this_MATCH * 100000llu + (10000 - this_MISMATCH);
}else{
- this_SCORE = 100000 * (10000 - this_MISMATCH) + this_MATCH;
+ unsigned int skip = 0; int is_exonic_regions = 1;
+ if(global_context -> exonic_region_bitmap)calc_end_pos(current_realignment_result -> first_base_position, current_realignment_result -> cigar_string, &skip, &is_exonic_regions, global_context );
+
+ int weight = is_exonic_regions?3000:1000;
+
+ if(1) weight = 1;
+
+ this_SCORE = (100000llu * (10000 - this_MISMATCH) + this_MATCH)*50llu + current_realignment_result -> known_junction_supp;
+ this_SCORE *= weight;
}
- if(0 && FIXLENstrcmp("R:chrX:52790377:100M:J0", read_name_1)==0)
- SUBREADprintf("%s, %d-th read [%d] : MAT=%d, MISMA=%d, SCORE=%u, BEST_SCORE=%u\n", read_name_1, is_second_read+1, read_record_i , this_MATCH, this_MISMATCH, this_SCORE, best_score_highest );
+ // if(0 && FIXLENstrcmp("R:chrX:52790377:100M:J0", read_name_1)==0)
+ // SUBREADprintf("%s, %d-th read [%d] : MAT=%d, MISMA=%d, SCORE=%u, BEST_SCORE=%u\n", read_name_1, is_second_read+1, read_record_i , this_MATCH, this_MISMATCH, this_SCORE, best_score_highest );
best_score_highest = max(best_score_highest, this_SCORE);
@@ -2694,12 +2795,10 @@ int do_iteration_two(global_context_t * global_context, thread_context_t * threa
for(read_record_i = 0; read_record_i < current_candidate_locations ; read_record_i++){
realignment_result_t * current_realignment_result = final_realignments + current_realignment_index[read_record_i];
- if( scores_array[read_record_i] >= best_score_highest && (current_realignment_result -> realign_flags & CORE_TOO_MANY_MISMATCHES)==0)
- {
+ if( scores_array[read_record_i] >= best_score_highest && (current_realignment_result -> realign_flags & CORE_TOO_MANY_MISMATCHES)==0) {
int is_repeated = 0;
is_repeated = add_repeated_buffer(global_context, repeated_buffer_pos, repeated_buffer_cigars, &repeated_count, is_second_read?NULL:current_realignment_result , is_second_read?current_realignment_result:NULL);
-
if(is_repeated)
scores_array[read_record_i] = 0;
else{
@@ -2709,6 +2808,9 @@ int do_iteration_two(global_context_t * global_context, thread_context_t * threa
}
}
+ //#warning ">>>>>>> COMMENT THIS <<<<<<<"
+ //printf("OCT27-WRITESINGLEMAP-%s-THRE %d ; occu=%d ; end=%d\n", read_name_1, thread_context -> thread_id, highest_score_occurence , is_second_read+1);
+
if(highest_score_occurence<2 || global_context -> config.report_multi_mapping_reads){
int is_break_even = 0;
if(highest_score_occurence>1) is_break_even = 1;
@@ -2735,18 +2837,20 @@ int do_iteration_two(global_context_t * global_context, thread_context_t * threa
output_cursor ++;
}
}
- assert(output_cursor >= highest_score_occurence - 1);
+
+ assert(output_cursor == highest_score_occurence);
}
}
}
} else {
int r1_best_id, r2_best_id, highest_score_occurence = 0;
unsigned long long highest_score = 0;
+ memset(final_SCORE_buffer, 0 , sizeof(long long) * global_context -> config.multi_best_reads * global_context -> config.multi_best_reads * MAX_ALIGNMENT_PER_ANCHOR*MAX_ALIGNMENT_PER_ANCHOR );
for(r1_best_id = 0; r1_best_id < r1_candidate_locations; r1_best_id ++) {
int r1_matched = final_MATCH_buffer1[r1_best_id];
if(r1_matched < 1) continue;
realignment_result_t * realignment_result_R1 = final_realignments + final_realignment_index1[r1_best_id];
- if(0 && FIXLENstrcmp("FINALQUAL R:chrX:52790377:100M:J0", read_name_1)==0)
+ if(0 && FIXLENstrcmp("R000404427", read_name_1) ==0)
SUBREADprintf("R1 MA=%d, MISMA=%d %u %s \n", realignment_result_R1->final_matched_bases, realignment_result_R1 -> final_mismatched_bases, realignment_result_R1 -> first_base_position, realignment_result_R1 -> cigar_string);
for(r2_best_id = 0; r2_best_id < r2_candidate_locations; r2_best_id ++) {
@@ -2754,13 +2858,23 @@ int do_iteration_two(global_context_t * global_context, thread_context_t * threa
if(r2_matched < 1) continue;
realignment_result_t * realignment_result_R2 = final_realignments + final_realignment_index2[r2_best_id];
- if(0 && FIXLENstrcmp("FINALQUAL R:chrX:52790377:100M:J0", read_name_1)==0)
+ if(0 && FIXLENstrcmp("R000404427", read_name_1) ==0)
SUBREADprintf("R2 MA=%d, MISMA=%d %u %s\n", realignment_result_R2->final_matched_bases, realignment_result_R2 -> final_mismatched_bases, realignment_result_R2 -> first_base_position, realignment_result_R2 -> cigar_string);
int is_PE = 0;
- int is_same_chro = 0;
+ int is_same_chro = 0, is_exonic_regions = 0;
unsigned long long final_SCORE = 0;
- test_PE_and_same_chro_align(global_context , realignment_result_R1 , realignment_result_R2, &is_PE, &is_same_chro , read_len_1, read_len_2, read_name_1);
+ test_PE_and_same_chro_align(global_context , realignment_result_R1 , realignment_result_R2, &is_exonic_regions, &is_PE, &is_same_chro , read_len_1, read_len_2, read_name_1);
+ if(0 && FIXLENstrcmp("R000404427", read_name_1) ==0) {
+ char outpos1[100];
+ char outpos2[100];
+
+ absoffset_to_posstr(global_context, realignment_result_R1 -> first_base_position, outpos1);
+ absoffset_to_posstr(global_context, realignment_result_R2 -> first_base_position, outpos2);
+
+ SUBREADprintf("READ %s : %s %s %s %s : %s\n", read_name_1, outpos1, realignment_result_R1 -> cigar_string, outpos2, realignment_result_R2 -> cigar_string, is_exonic_regions? "YES":"NO");
+ }
+
if(global_context -> config.experiment_type == CORE_EXPERIMENT_DNASEQ){
int weight;
@@ -2768,11 +2882,11 @@ int do_iteration_two(global_context_t * global_context, thread_context_t * threa
//#warning " ============ USE THE FIRST THREE WEIGHTS! ======== "
if(is_PE)
- weight = global_context -> config.PE_predominant_weight?300:120;
+ weight = 120;
//weight = 300;
else if(is_same_chro)
- weight = global_context -> config.PE_predominant_weight?5:100;
- else weight = global_context -> config.PE_predominant_weight?3:80;
+ weight = 100;
+ else weight = 80;
//weight = 30;
final_SCORE = weight * (final_MATCH_buffer1[r1_best_id] + final_MATCH_buffer2[r2_best_id]);
//#warning "=========== ADD BY YANG LIAO FOR MORE MAPPED READS WITH '-u' OPTION ================"
@@ -2781,29 +2895,51 @@ int do_iteration_two(global_context_t * global_context, thread_context_t * threa
} else if (global_context -> config.experiment_type == CORE_EXPERIMENT_RNASEQ) {
int weight;
- if(is_PE)
- weight = 3000;
- else if(is_same_chro)
- weight = global_context -> config.PE_predominant_weight?10:1000;
- else weight = global_context -> config.PE_predominant_weight?3:300;
+
+
+ if(1){
+
+ if(is_exonic_regions && is_PE)
+ weight = 5000;
+ else if(is_PE || is_exonic_regions)
+ weight = 3000;
+ else if(is_same_chro)
+ weight = 1000;
+ else weight = 300;
+ }else{
+ if(is_PE)
+ weight = 3000;
+ else if(is_same_chro)
+ weight = 1000;
+ else weight = 300;
+
+ }
//#warning "=========== ADD BY YANG LIAO ' + 2' ===================="
final_SCORE = 100000llu * weight / (final_MISMATCH_buffer1[r1_best_id] + final_MISMATCH_buffer2[r2_best_id] + 1 + 2);
//#warning "=========== ADD BY YANG LIAO FOR MORE MAPPED READS WITH '-u' OPTION ================"
- final_SCORE = final_SCORE * 1000llu + (final_MATCH_buffer1[r1_best_id] + final_MATCH_buffer2[r2_best_id]);
+ final_SCORE = final_SCORE * 100llu + (final_MATCH_buffer1[r1_best_id] + final_MATCH_buffer2[r2_best_id]);
+ final_SCORE = final_SCORE * 20 + realignment_result_R2 -> known_junction_supp + realignment_result_R1 -> known_junction_supp;
+
+ //#warning ">>>>>>>>>>>>>>>>>>>>>>>>>>>>> REMOVE THIS <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<"
+ //printf("OCT27-STEPMSM-FNSC %s M=%d,%d MM=%d,%d CHROSAME=%d FSCR=%llu\n", read_name_1, final_MATCH_buffer1[r1_best_id], final_MATCH_buffer2[r2_best_id], final_MISMATCH_buffer1[r1_best_id] , final_MISMATCH_buffer2[r2_best_id], is_same_chro, final_SCORE );
+
} else assert(0);
assert(final_SCORE > 0);
- final_SCORE_buffer[r1_best_id * global_context -> config.multi_best_reads + r2_best_id] = final_SCORE;
+ final_SCORE_buffer[r1_best_id * global_context -> config.multi_best_reads * MAX_ALIGNMENT_PER_ANCHOR + r2_best_id] = final_SCORE;
- if(0 && FIXLENstrcmp("FINALQUAL R:chrX:52790377:100M:J0", read_name_1)==0){
+ if(0 && FIXLENstrcmp("R000404427", read_name_1) ==0){
SUBREADprintf("Highest=%llu, This=%llu, Occurance=%d\n", highest_score , final_SCORE , highest_score_occurence);
}
if(final_SCORE > highest_score) {
+ //#warning ">>>>>>>>>>>> COMMENT THIS <<<<<<<<<<<<<<<<<<<"
+ //printf("OCT27-STEPMSM-REPL %s : SCORE %llu -> %llu, pos=%u,%u\n", read_name_1, final_SCORE, highest_score , realignment_result_R1->mapping_result->selected_position, realignment_result_R2->mapping_result->selected_position );
highest_score_occurence = 1;
highest_score = final_SCORE;
+ //SUBREADprintf("29NOV2016-FIRSTBEST %s [%d : score %llu] LOC = %d,%d\n", read_name_1, highest_score_occurence, final_SCORE, r1_best_id, r2_best_id);
clear_repeated_buffer(global_context, repeated_buffer_pos, repeated_buffer_cigars, &repeated_count);
add_repeated_buffer(global_context, repeated_buffer_pos, repeated_buffer_cigars, &repeated_count, realignment_result_R1, realignment_result_R2);
} else if(final_SCORE == highest_score) {
@@ -2813,17 +2949,21 @@ int do_iteration_two(global_context_t * global_context, thread_context_t * threa
is_repeat = add_repeated_buffer(global_context, repeated_buffer_pos, repeated_buffer_cigars, &repeated_count, realignment_result_R1, realignment_result_R2);
if(is_repeat)
- final_SCORE_buffer[r1_best_id * global_context -> config.multi_best_reads + r2_best_id] = 0;
- else
+ final_SCORE_buffer[r1_best_id * global_context -> config.multi_best_reads * MAX_ALIGNMENT_PER_ANCHOR + r2_best_id] = 0;
+ else{
highest_score_occurence ++;
- if(0 && FIXLENstrcmp("R000001161", read_name_1)==0)
- SUBREADprintf("REPEAT OF %s: %d OCCURANCE AFT=%d\n", read_name_1, is_repeat, highest_score_occurence);
+ // SUBREADprintf("29NOV2016-BEST %s [%d : score %llu] LOC = %d,%d\n", read_name_1, highest_score_occurence, final_SCORE, r1_best_id, r2_best_id);
+ }
}
}
}
//SUBREADprintf("Highest score = %llu, Occurance = %d\n", highest_score , highest_score_occurence);
// Then, copy the (R1, R2) that have the highest score into the align_res buffer.
+
+ //#warning ">>>>>>> COMMENT THIS <<<<<<<"
+ //printf("OCT27-WRITE-BOTHMAOOED-%s-THRE %d ; occu=%d\n", read_name_1, thread_context -> thread_id, highest_score_occurence);
+
if(highest_score_occurence <= 1 || global_context -> config.report_multi_mapping_reads){
int is_break_even = 0;
@@ -2840,7 +2980,7 @@ int do_iteration_two(global_context_t * global_context, thread_context_t * threa
int r2_matched = final_MATCH_buffer2[r2_best_id];
if(r2_matched < 1) continue;
- if(final_SCORE_buffer[r1_best_id * global_context -> config.multi_best_reads + r2_best_id] == highest_score &&
+ if(final_SCORE_buffer[r1_best_id * global_context -> config.multi_best_reads * MAX_ALIGNMENT_PER_ANCHOR + r2_best_id] == highest_score &&
output_cursor < global_context -> config.reported_multi_best_reads){
realignment_result_t * r1_realign = final_realignments + final_realignment_index1[r1_best_id];
realignment_result_t * r2_realign = final_realignments + final_realignment_index2[r2_best_id];
@@ -2858,16 +2998,18 @@ int do_iteration_two(global_context_t * global_context, thread_context_t * threa
r1_realign -> MAPQ_adjustment = r1_step2_locations + final_MISMATCH_buffer1[r1_best_id];
r2_realign -> MAPQ_adjustment = r2_step2_locations + final_MISMATCH_buffer2[r2_best_id];
- //SUBREADprintf("R1R2_Rep = %d,%d\n", r1_step2_locations,r2_step2_locations);
+ //SUBREADprintf("29NOV2016-WOUT %s [%d / %d : score %llu] LOC = %d,%d\n", read_name_1, output_cursor , highest_score_occurence, highest_score, r1_best_id, r2_best_id);
write_realignments_for_fragment(global_context, thread_context, &out_context, current_read_number, r1_realign, r2_realign, read_name_1, read_name_2, read_text_1, read_text_2, qual_text_1, qual_text_2, read_len_1, read_len_2, highest_score_occurence, output_cursor, non_informative_subreads_r1, non_informative_subreads_r2);
output_cursor ++;
}
}
}
- assert(output_cursor >= highest_score_occurence - 1);
+ assert(output_cursor == highest_score_occurence );
}
}
+ //#warning ">>>>>>> COMMENT THIS <<<<<<<"
+ //printf("OCT27-WRITE-UNMAP?-%s-THRE %d\n", read_name_1, thread_context -> thread_id);
if(output_cursor<1) {
strcpy(read_text_1, raw_read_text_1);
strcpy(read_text_2, raw_read_text_2);
@@ -2884,6 +3026,8 @@ int do_iteration_two(global_context_t * global_context, thread_context_t * threa
sqr_read_number=0;
}
}
+ //#warning ">>>>>>> COMMENT THIS <<<<<<<"
+ //printf("OCT27-FIN-%s-THRE %d\n", read_name_1, thread_context -> thread_id);
//bigtable_release_result(global_context, thread_context, current_read_number, 1);
}
@@ -2917,6 +3061,8 @@ int do_iteration_two(global_context_t * global_context, thread_context_t * threa
usleep(100);
}
}
+
+ //SUBREADprintf("29NOV2016-QUIT TH-%d\n", thread_context->thread_id );
return 0;
}
@@ -2985,8 +3131,17 @@ int do_voting(global_context_t * global_context, thread_context_t * thread_conte
int is_reversed, applied_subreads = 0, v1_all_subreads=0, v2_all_subreads=0;
ret = fetch_next_read_pair(global_context, thread_context, ginp1, ginp2, &read_len_1, &read_len_2, read_name_1, read_name_2, read_text_1, read_text_2, qual_text_1, qual_text_2,1, ¤t_read_number);
- //SUBREADprintf("DO_VOTE:%llu BY THREAD %d\n", current_read_number, thread_context -> thread_id);
- if(current_read_number < 0) break;
+
+ if(current_read_number < 0){
+ // #warning ">>>>>>>>>>>>> COMMENT THIS <<<<<<<<<<<<<<<<<<<"
+ // printf("OCT27-STEPB-QUIT :T%d\n", thread_context -> thread_id);
+ break;
+ }
+
+ //#warning ">>>>>>>>>>>>> COMMENT THIS <<<<<<<<<<<<<<<<<<<"
+ //printf("OCT27-STEPB - %s :T%d\n", read_name_1, thread_context -> thread_id);
+
+
//SUBREADprintf("RL=%d,%d\n", read_len_1, read_len_2);
@@ -3113,7 +3268,13 @@ int do_voting(global_context_t * global_context, thread_context_t * thread_conte
//SUBREADprintf("P%d %llu %s\n", is_reversed, current_read_number, read_name_1);
//#warning ">>>>>>> DISABLE THE FOLLOING BLOCK <<<<<<"
- if(0 && FIXLENstrcmp("V0112_0155:7:1101:7921:2517#ACTTGA", read_name_1) ==0 ) {
+ if(0) { int dx1, vtab_items1=0, vtab_items2=0;
+ for(dx1 = 0; dx1 < GENE_VOTE_TABLE_SIZE; dx1++) vtab_items1+=vote_1->items[dx1];
+ for(dx1 = 0; dx1 < GENE_VOTE_TABLE_SIZE; dx1++) vtab_items2+=vote_2->items[dx1];
+ printf("OCT27-STEPA-%s-P%d, V1 %d, V2 %d\n", read_name_1, is_reversed, vtab_items1, vtab_items2);
+ }
+ //#warning ">>>>>>> DISABLE THE FOLLOING BLOCK <<<<<<"
+ if(0 && FIXLENstrcmp("R010442852", read_name_1) ==0 ) {
SUBREADprintf(">>>%llu<<<\n%s [%d] %s\n%s [%d] %s\n", current_read_number, read_name_1, read_len_1, read_text_1, read_name_2, read_len_2, read_text_2);
SUBREADprintf(" ======= PAIR %s = %llu ; NON_INFORMATIVE = %d, %d =======\n", read_name_1, current_read_number, vote_1 -> noninformative_subreads, vote_2 -> noninformative_subreads);
print_votes(vote_1, global_context -> config.index_prefix);
@@ -3290,17 +3451,7 @@ int run_maybe_threads(global_context_t *global_context, int task)
void * thr_parameters [5];
int ret_value =0;
- /*
- if(task==STEP_VOTING)
- print_in_box(80,0,0, "Map %s...", global_context->input_reads.is_paired_end_reads?"fragments":"reads");
- else if(task == STEP_ITERATION_ONE)
- print_in_box(80,0,0, "Detect indels%s...", global_context->config.do_breakpoint_detection?" and junctions":"");
- else if(task == STEP_ITERATION_TWO)
- print_in_box(80,0,0, "Finish the %'llu %s...", global_context -> processed_reads_in_chunk, global_context->input_reads.is_paired_end_reads?"fragments":"reads");
- */
-
- if(global_context->config.all_threads<2)
- {
+ if(global_context->config.all_threads<2) {
thr_parameters[0] = global_context;
thr_parameters[1] = NULL;
thr_parameters[2] = &task;
@@ -3308,9 +3459,14 @@ int run_maybe_threads(global_context_t *global_context, int task)
thr_parameters[4] = &ret_value;
run_in_thread(thr_parameters);
- }
- else
- {
+ if(STEP_VOTING == task){
+ // sort and merge events from all threads and the global event space.
+ sort_global_event_table(global_context);
+ // sort the event entry table at each location.
+ //#warning "==== UNCOMMENT NEXT LINE ======"
+ sort_junction_entry_table(global_context);
+ }
+ } else {
int current_thread_no ;
thread_context_t thread_contexts[64];
int ret_values[64];
@@ -3321,6 +3477,7 @@ int run_maybe_threads(global_context_t *global_context, int task)
global_context -> last_written_fragment_number = 0;
for(current_thread_no = 0 ; current_thread_no < global_context->config.all_threads ; current_thread_no ++)
thread_contexts[current_thread_no].all_mapped_reads = 0;
+ thread_contexts[current_thread_no].all_correct_PE_reads = 0;
}
for(current_thread_no = 0 ; current_thread_no < global_context->config.all_threads ; current_thread_no ++)
@@ -3344,22 +3501,28 @@ int run_maybe_threads(global_context_t *global_context, int task)
{
pthread_join(thread_contexts[current_thread_no].thread, NULL);
+ if(STEP_ITERATION_TWO == task) global_context -> all_correct_PE_reads += thread_contexts[current_thread_no].all_correct_PE_reads;
ret_value += *(ret_values + current_thread_no);
if(ret_value)break;
}
- if(STEP_ITERATION_TWO == task)
+ if(STEP_ITERATION_TWO == task){
finalise_buffered_output_file(global_context);
+ }
for(current_thread_no = 0 ; current_thread_no < global_context->config.all_threads ; current_thread_no ++)
{
if(thread_contexts[current_thread_no].output_buffer_item > 0)
SUBREADprintf("ERROR: UNFINISHED OUTPUT!\n");
thread_context_t * thread_context = thread_contexts+current_thread_no;
-
- finalise_indel_thread(global_context, thread_context, task);
- finalise_junction_thread(global_context, thread_context, task);
global_context -> all_mapped_reads += thread_context -> all_mapped_reads;
}
+
+ // sort and merge events from all threads and the global event space.
+ finalise_indel_and_junction_thread(global_context, thread_contexts, task);
+ if(STEP_VOTING == task){
+ // sort the event entry table at each location.
+ sort_junction_entry_table(global_context);
+ }
}
if(CORE_SOFT_BR_CHAR == '\r')
@@ -3488,7 +3651,6 @@ int read_chunk_circles(global_context_t *global_context)
reward_read_files(global_context, SEEK_SET);
int is_last_chunk = global_context -> processed_reads_in_chunk < global_context->config.reads_per_chunk;
- //SUBREADprintf("LAST_CHUNK=%d, INDEX_BLOCKS=%d\n", is_last_chunk, global_context->index_block_number );
if(global_context->index_block_number > 1 || is_last_chunk)
gehash_destory_fast(global_context -> current_index);
@@ -3498,17 +3660,22 @@ int read_chunk_circles(global_context_t *global_context)
}
- //sublog_printf(SUBLOG_STAGE_DEV1, SUBLOG_LEVEL_DEBUG, "%d reads have been processed in this chunk.", global_context -> processed_reads_in_chunk);
-
-
// after the voting step, all subread index blocks are released and all base index blocks are loaded at once.
+ double time_before_realign = miltime();
+ if(0 == chunk_no && global_context->config.exon_annotation_file[0] && global_context->config.do_breakpoint_detection){
+ ret = ret || load_known_junctions(global_context);
+ if(!ret){
+ // sort and merge events from all threads and the global event space.
+ sort_global_event_table(global_context);
+ // sort the event entry table at each location.
+ sort_junction_entry_table(global_context);
+ }
+ }
- //reward_read_files(global_context, SEEK_SET);
- //ret = run_maybe_threads(global_context, STEP_ITERATION_ONE);
- double time_before_realign = miltime();
- ret = anti_supporting_read_scan(global_context);
+ ret = ret || anti_supporting_read_scan(global_context);
remove_neighbour(global_context);
+
reward_read_files(global_context, SEEK_SET);
double period_before_realign = miltime() - time_before_realign;
global_context -> timecost_before_realign += period_before_realign;
@@ -3531,10 +3698,12 @@ int read_chunk_circles(global_context_t *global_context)
if(ret) return ret;
- if(global_context -> processed_reads_in_chunk < global_context->config.reads_per_chunk)
+ if(global_context -> processed_reads_in_chunk < global_context->config.reads_per_chunk ||
+ (global_context -> config.is_BAM_output && global_context -> output_bam_writer -> is_internal_error) ||
+ (global_context -> output_sam_is_full))
// base value indexes loaded in the last circle are not destroyed and are used in writting the indel VCF.
// the indexes will be destroyed in destroy_global_context
- break;
+ break;
clean_context_after_chunk(global_context);
chunk_no++;
@@ -3542,25 +3711,9 @@ int read_chunk_circles(global_context_t *global_context)
free(global_context -> current_index);
- // load all array index blocks at once.
if(global_context -> config.is_third_iteration_running)
- {
- /*
- for(block_no = 0; block_no< global_context->index_block_number; block_no++)
- {
- char tmp_fname[MAX_FILE_NAME_LENGTH];
- sprintf(tmp_fname, "%s.%02d.%c.array", global_context->config.index_prefix, block_no, global_context->config.space_type == GENE_SPACE_COLOR?'c':'b');
- if(gvindex_load(&global_context -> all_value_indexes[block_no], tmp_fname)) return -1;
- }
- */
-
finalise_long_insertions(global_context);
- /*
- for(block_no = 0; block_no< global_context->index_block_number; block_no++)
- gvindex_destory(&global_context -> all_value_indexes[block_no]);
- */
- }
return 0;
}
@@ -3628,6 +3781,10 @@ int print_configuration(global_context_t * context)
print_in_box(80, 0, 0, "Output method : STDOUT (%s)" , context->config.is_BAM_output?"BAM":"SAM");
print_in_box(80, 0, 0, "Index name : %s", context->config.index_prefix);
+ if(context->config.exon_annotation_file[0])
+ print_in_box(80, 0, 0, "Annotations : %s (%s)", context->config.exon_annotation_file, context->config.exon_annotation_file_type==FILE_TYPE_GTF?"GTF":"SAF");
+ print_in_box(80, 0, 0, "");
+ print_in_box(80, 0, 1, "------------------------------------");
print_in_box(80, 0, 0, "");
print_in_box(80, 0, 0, " Threads : %d", context->config.all_threads);
print_in_box(80, 0, 0, " Phred offset : %d", (context->config.phred_score_format == FASTQ_PHRED33)?33:64);
@@ -3683,7 +3840,7 @@ int print_configuration(global_context_t * context)
char tbuf[90];
char_strftime(tbuf);
- print_in_box(80,1,1,"Running (%s)", tbuf);
+ print_in_box(80,1,1,"Running (%s, pid=%d)", tbuf, getpid());
print_in_box(80,0,1,"");
@@ -3753,6 +3910,82 @@ void write_sam_headers(global_context_t * context)
}
}
+#define EXONIC_REGION_RESOLUTION 16
+
+int is_pos_in_annotated_exon_regions(global_context_t * global_context, unsigned int pos){
+ int exonic_map_byte = pos / EXONIC_REGION_RESOLUTION / 8;
+ int exonic_map_bit = (pos / EXONIC_REGION_RESOLUTION) % 8;
+
+ return (global_context ->exonic_region_bitmap [exonic_map_byte] & (1<<exonic_map_bit))?1:0;
+
+}
+
+char * get_sam_chro_name_from_alias(HashTable * tab, char * anno_chro){
+ KeyValuePair * cursor = NULL;
+ long x1;
+ for(x1 = 0; x1 < tab -> numOfBuckets; x1 ++){
+ cursor = tab -> bucketArray[x1];
+ while(1){
+ if(NULL == cursor)break;
+ char * tab_anno_chro = cursor -> value;
+ if(strcmp(tab_anno_chro, anno_chro) == 0) return (char *)cursor -> key;
+ cursor = cursor -> next;
+ }
+ }
+ return NULL;
+}
+
+int do_anno_bitmap_add_feature(char * gene_name, char * chro_name, unsigned int feature_start, unsigned int feature_end, int is_negative_strand, void * context){
+ global_context_t * global_context = context;
+
+ char tmp_chro_name[MAX_CHROMOSOME_NAME_LEN];
+ if(global_context -> sam_chro_to_anno_chr_alias){
+ char * sam_chro = get_sam_chro_name_from_alias(global_context -> sam_chro_to_anno_chr_alias, chro_name);
+ if(sam_chro!=NULL) chro_name = sam_chro;
+ }
+
+ int access_n = HashTableGet( global_context -> chromosome_table.read_name_to_index, chro_name ) - NULL;
+
+ if(access_n < 1){
+ if(chro_name[0]=='c' && chro_name[1]=='h' && chro_name[2]=='r'){
+ chro_name += 3;
+ }else{
+ strcpy(tmp_chro_name, "chr");
+ strcat(tmp_chro_name, chro_name);
+ chro_name = tmp_chro_name;
+ }
+ }
+
+ unsigned int exonic_map_start = linear_gene_position(&global_context->chromosome_table , chro_name, feature_start);
+ unsigned int exonic_map_stop = linear_gene_position(&global_context->chromosome_table , chro_name, feature_end);
+
+ if(exonic_map_start > 0xffffff00 || exonic_map_stop > 0xffffff00) return -1;
+
+ exonic_map_start -= exonic_map_start%EXONIC_REGION_RESOLUTION;
+ exonic_map_stop -= exonic_map_stop%EXONIC_REGION_RESOLUTION;
+
+ for(; exonic_map_start <= exonic_map_stop; exonic_map_start+=EXONIC_REGION_RESOLUTION){
+ int exonic_map_byte = exonic_map_start / EXONIC_REGION_RESOLUTION / 8;
+ int exonic_map_bit = (exonic_map_start / EXONIC_REGION_RESOLUTION) % 8;
+
+ global_context ->exonic_region_bitmap[exonic_map_byte] |= (1<<exonic_map_bit);
+ }
+ return 0;
+}
+
+int load_annotated_exon_regions(global_context_t * global_context){
+ int bitmap_size = (4096 / EXONIC_REGION_RESOLUTION / 8)*1024*1024;
+ global_context ->exonic_region_bitmap = malloc(bitmap_size);
+ memset( global_context ->exonic_region_bitmap , 0, bitmap_size );
+
+ int loaded_features = load_features_annotation(global_context->config.exon_annotation_file, global_context->config.exon_annotation_file_type, global_context->config.exon_annotation_gene_id_column, global_context->config.exon_annotation_feature_name_column, global_context, do_anno_bitmap_add_feature);
+ if(loaded_features < 0)return -1;
+ else{
+ print_in_box(80,0,0,"%d annotation records were loaded.\n", loaded_features);
+ return 0;
+ }
+}
+
int load_global_context(global_context_t * context)
{
char tmp_fname [MAX_FILE_NAME_LENGTH];
@@ -3789,6 +4022,7 @@ int load_global_context(global_context_t * context)
//sublog_printf(SUBLOG_STAGE_RELEASED, SUBLOG_LEVEL_ERROR,"Unable to open '%s' as input. Please check if it exists, you have the permission to read it, and it is in the correct format.\n", context->config.second_read_file);
return -1;
}
+
context -> config.max_vote_combinations = 3;
context -> config.multi_best_reads = 3;
context -> config.max_vote_simples = 64;
@@ -3881,6 +4115,12 @@ int load_global_context(global_context_t * context)
sublog_printf(SUBLOG_STAGE_RELEASED, SUBLOG_LEVEL_ERROR,"Unable top open index '%s'. Please make sure that the correct prefix is specified and you have the permission to read these files. For example, if there are files '/opt/my_index.reads', '/opt/my_index.files' and etc, the index prefix should be specified as '/opt/my_index' without any suffix. \n", context->config.index_prefix);
return -1;
}
+
+ context->current_index_block_number = 0;
+ if(load_offsets(&context->chromosome_table, context->config.index_prefix)){
+ sublog_printf(SUBLOG_STAGE_RELEASED, SUBLOG_LEVEL_ERROR,"\nThe index was built by using an old version of Subread; its format is no longer supported. Please use the current version of the index builder to rebuild it.\n");
+ return 1;
+ }
if(context->config.space_type == GENE_SPACE_COLOR)
sprintf(tmp_fname, "%s.00.c.tab", context->config.index_prefix);
@@ -3907,10 +4147,6 @@ int load_global_context(global_context_t * context)
}
}
- context->current_index_block_number = 0;
- load_offsets(&context->chromosome_table, context->config.index_prefix);
-
-
if(context->config.report_sam_file)
write_sam_headers(context);
@@ -3933,9 +4169,19 @@ int load_global_context(global_context_t * context)
memset( context->all_value_indexes , 0 , 100 * sizeof(gene_value_index_t));
+ context -> sam_chro_to_anno_chr_alias = NULL;
+ if(context->config.exon_annotation_file[0]){
+ if( load_annotated_exon_regions( context ) ) return -1;
+ if(context->config.exon_annotation_alias_file[0])
+ context -> sam_chro_to_anno_chr_alias = load_alias_table(context->config.exon_annotation_alias_file);
+ } else context -> exonic_region_bitmap = NULL;
return 0;
}
+
+
+
+
int init_modules(global_context_t * context)
{
sublog_printf(SUBLOG_STAGE_DEV1, SUBLOG_LEVEL_DEBUG, "init_modules: begin");
@@ -3957,19 +4203,30 @@ int destroy_modules(global_context_t * context)
int destroy_global_context(global_context_t * context)
{
- int xk1, block_no;
+ int xk1, block_no, ret = 0;
+
+ if(context -> exonic_region_bitmap) free(context -> exonic_region_bitmap);
for(block_no = 0; block_no< context->index_block_number; block_no++)
gvindex_destory(&context -> all_value_indexes[block_no]);
if(context->output_sam_fp)
{
+ if(context -> output_sam_is_full){
+ unlink(context->config.output_prefix);
+ SUBREADprintf("\nERROR: cannot finish the SAM file! Please check the disk space in the output directory.\nNo output file was generated.\n");
+ ret = 1;
+ }
fclose(context -> output_sam_fp);
- // free(context -> output_sam_inner_buffer);
}
if(context->output_bam_writer)
{
SamBam_writer_close(context->output_bam_writer);
+ if(context->output_bam_writer -> is_internal_error){
+ unlink(context->config.output_prefix);
+ SUBREADprintf("\nERROR: cannot finish the BAM file! Please check the disk space in the output directory.\nNo output file was generated.\n");
+ ret = 1;
+ }
free(context->output_bam_writer);
context->output_bam_writer=NULL;
}
@@ -3982,10 +4239,10 @@ int destroy_global_context(global_context_t * context)
destroy_offsets(&context->chromosome_table);
finalise_bigtable_results(context);
- if((context -> will_remove_input_file & 1) && (memcmp(context ->config.first_read_file, "./core-temp", 11) == 0)) unlink(context ->config.first_read_file);
- if((context -> will_remove_input_file & 2) && (memcmp(context ->config.second_read_file, "./core-temp", 11) == 0)) unlink(context ->config.second_read_file);
+ if((context -> will_remove_input_file & 1) && strstr(context ->config.first_read_file, "/core-temp")) unlink(context ->config.first_read_file);
+ if((context -> will_remove_input_file & 2) && strstr(context ->config.second_read_file, "/core-temp")) unlink(context ->config.second_read_file);
- return 0;
+ return ret;
}
@@ -4372,35 +4629,26 @@ void quick_sort(void * arr, int arr_size, int compare (void * arr, int l, int r)
void quick_sort_run(void * arr, int spot_low,int spot_high, int compare (void * arr, int l, int r), void exchange(void * arr, int l, int r))
{
+ // https://en.wikipedia.org/wiki/Quicksort
+ // Lomuto partition scheme
+
int pivot,j,i;
- if(spot_high-spot_low<1) return;
- pivot = (spot_low + spot_high)/2;
+ if(spot_high <= spot_low) return;
+ pivot = spot_high;
i = spot_low;
- j = spot_high;
- while(i<=j)
- {
- if(compare(arr, i, pivot) <0)
+ for(j = spot_low; j < spot_high; j++)
+ if(compare(arr, j, pivot)<=0)
{
+ exchange(arr,i,j);
i++;
- continue;
}
- if(compare(arr, j, pivot)>0)
- {
- j--;
- continue;
- }
+ exchange(arr, i, spot_high);
- if(i!=j)
- exchange(arr,i,j);
- i++;
- j--;
- }
-
- quick_sort_run(arr, spot_low, j, compare, exchange);
- quick_sort_run(arr, i, spot_high, compare, exchange);
+ quick_sort_run(arr, spot_low, i-1, compare, exchange);
+ quick_sort_run(arr, i+1, spot_high, compare, exchange);
}
void basic_sort_run(void * arr, int start, int items, int compare (void * arr, int l, int r), void exchange(void * arr, int l, int r)){
@@ -4442,16 +4690,24 @@ void merge_sort(void * arr, int arr_size, int compare (void * arr, int l, int r)
merge_sort_run(arr, 0, arr_size, compare, exchange, merge);
}
-unsigned int calc_end_pos(unsigned int p, char * cigar, unsigned int * all_skipped_len){
+unsigned int calc_end_pos(unsigned int p, char * cigar, unsigned int * all_skipped_len, int * is_exonic_regions, global_context_t * global_context){
unsigned int cursor = p, tmpi=0;
int nch, cigar_cursor;
for(cigar_cursor = 0; 0!=(nch = cigar[cigar_cursor]); cigar_cursor++){
if(isdigit(nch)){
tmpi = tmpi * 10 + (nch - '0');
}else{
- if(nch == 'M' || nch == 'N' || nch == 'D'){
+ if((nch == 'S' && cursor == p) || nch == 'M' || nch == 'N' || nch == 'D'){
+ if(nch == 'M' && global_context -> exonic_region_bitmap){
+ if(global_context -> config.do_breakpoint_detection){
+ if(is_pos_in_annotated_exon_regions(global_context, cursor) == 0 || is_pos_in_annotated_exon_regions(global_context, cursor + tmpi - 1) == 0) ( * is_exonic_regions) = 0;
+ } else {
+ if(is_pos_in_annotated_exon_regions(global_context, cursor + tmpi/2) == 0) ( * is_exonic_regions) = 0;
+ }
+ }
+
cursor += tmpi;
- if(nch == 'N' || nch == 'D') *all_skipped_len += tmpi;
+ if(nch == 'N' || nch == 'D')(*all_skipped_len) += tmpi;
}
tmpi = 0;
}
@@ -4460,22 +4716,22 @@ unsigned int calc_end_pos(unsigned int p, char * cigar, unsigned int * all_skipp
}
-void test_PE_and_same_chro_cigars(global_context_t * global_context , unsigned int pos1, unsigned int pos2, int * is_PE_distance, int * is_same_chromosome, int read_len_1, int read_len_2, char * cigar1, char * cigar2, char *read_name){
- char * r1_chr, * r2_chr;
+void test_PE_and_same_chro_cigars(global_context_t * global_context , unsigned int pos1, unsigned int pos2, int * is_exonic_regions, int * is_PE_distance, int * is_same_chromosome, int read_len_1, int read_len_2, char * cigar1, char * cigar2, char *read_name){
+ char * r1_chr = NULL, * r2_chr = NULL;
int r1_pos, r2_pos;
(*is_same_chromosome) = 0;
(*is_PE_distance) = 0;
+ (*is_exonic_regions) = 1;
locate_gene_position(pos1, &global_context -> chromosome_table, & r1_chr, & r1_pos);
locate_gene_position(pos2, &global_context -> chromosome_table, & r2_chr, & r2_pos);
-
if(r1_chr == r2_chr){
unsigned int skip_1 = 0;
unsigned int skip_2 = 0;
- unsigned int r1_end_pos = calc_end_pos(pos1, cigar1, &skip_1);
- unsigned int r2_end_pos = calc_end_pos(pos2, cigar2, &skip_2);
+ unsigned int r1_end_pos = calc_end_pos(pos1, cigar1, &skip_1, is_exonic_regions, global_context );
+ unsigned int r2_end_pos = calc_end_pos(pos2, cigar2, &skip_2, is_exonic_regions, global_context );
unsigned int tlen = max(r1_end_pos, r2_end_pos) - min(pos1, pos2);
if(tlen > skip_1) tlen -= skip_1;
@@ -4488,8 +4744,8 @@ void test_PE_and_same_chro_cigars(global_context_t * global_context , unsigned i
}
}
-void test_PE_and_same_chro_align(global_context_t * global_context , realignment_result_t * res1, realignment_result_t * res2, int * is_PE_distance, int * is_same_chromosome, int read_len_1, int read_len_2, char * read_name){
- return test_PE_and_same_chro_cigars(global_context, res1 -> first_base_position, res2 -> first_base_position, is_PE_distance, is_same_chromosome , read_len_1 , read_len_2, res1 -> cigar_string, res2 -> cigar_string, read_name);
+void test_PE_and_same_chro_align(global_context_t * global_context , realignment_result_t * res1, realignment_result_t * res2, int * is_exonic_regions, int * is_PE_distance, int * is_same_chromosome, int read_len_1, int read_len_2, char * read_name){
+ return test_PE_and_same_chro_cigars(global_context, res1 -> first_base_position, res2 -> first_base_position, is_exonic_regions, is_PE_distance, is_same_chromosome , read_len_1 , read_len_2, res1 -> cigar_string, res2 -> cigar_string, read_name);
}
diff --git a/src/core.h b/src/core.h
index 90cfa2b..acd0f3a 100644
--- a/src/core.h
+++ b/src/core.h
@@ -139,7 +139,11 @@ typedef struct{
// input_scheme
char first_read_file[MAX_FILE_NAME_LENGTH];
char second_read_file[MAX_FILE_NAME_LENGTH];
- char medium_result_prefix[MAX_FILE_NAME_LENGTH];
+ char exon_annotation_file[MAX_FILE_NAME_LENGTH];
+ char exon_annotation_alias_file[MAX_FILE_NAME_LENGTH];
+ int exon_annotation_file_type;
+ char exon_annotation_gene_id_column[MAX_READ_NAME_LEN];
+ char exon_annotation_feature_name_column[MAX_READ_NAME_LEN];
short read_trim_5;
@@ -266,6 +270,7 @@ typedef struct{
#define CORE_EXPERIMENT_DNASEQ 1000
#define CORE_EXPERIMENT_RNASEQ 2000
+#define PRINT_BOX_WRAPPED 4
#define PRINT_BOX_NOCOLOR_FOR_COLON 2
#define PRINT_BOX_CENTER 1
@@ -305,11 +310,10 @@ typedef struct{
short junction_flanking_left;
short junction_flanking_right;
- unsigned char event_type;
char indel_at_junction;
char is_negative_strand; // this only works to junction detection, according to 'GT/AG' or 'CT/AC' donors. This only applys to junctions.
char is_strand_jumped; // "strand jumped" means that the left and right sides are on different strands. This only applys to fusions.
- char is_donor_found; // only for junctions: GT/AG is found at the location.
+ char is_donor_found_or_annotation; // only for junctions: GT/AG is found at the location. 1: found, 0:not found: 64: from annotation (thus unknown)
// Also, if "is_strand_jumped" is true, all coordinates (e.g., splicing points, cover_start, cover_end, etc) are on "reversed read" view.
char small_side_increasing_coordinate;
@@ -325,6 +329,7 @@ typedef struct{
unsigned short anti_supporting_reads;
unsigned short final_counted_reads;
unsigned short final_reads_mismatches;
+ unsigned char event_type;
unsigned int global_event_id;
float event_quality;
@@ -373,6 +378,7 @@ typedef struct{
short final_quality;
short chromosomal_length;
short MAPQ_adjustment;
+ int known_junction_supp;
} realignment_result_t;
#define BUCKETED_TABLE_INIT_ITEMS 3
@@ -402,6 +408,7 @@ typedef struct {
int item_index_j;
unsigned int mapping_position;
int major_half_votes;
+ unsigned short read_start_base;
}simple_mapping_t;
typedef struct{
@@ -439,6 +446,7 @@ typedef struct {
typedef struct{
+ unsigned long long all_correct_PE_reads;
int thread_id;
pthread_t thread;
@@ -478,6 +486,7 @@ typedef struct{
FILE * output_sam_fp;
FILE * long_insertion_FASTA_fp;
char * output_sam_inner_buffer;
+ int output_sam_is_full;
// running contexts
void * module_contexts[5];
@@ -536,7 +545,8 @@ typedef struct{
// per chunk parameters
subread_read_number_t read_block_start;
-
+ char * exonic_region_bitmap;
+ HashTable * sam_chro_to_anno_chr_alias;
} global_context_t;
@@ -639,4 +649,6 @@ int is_valid_digit(char * optarg, char * optname);
int is_valid_digit_range(char * optarg, char * optname, int min, int max_inc);
int is_valid_float(char * optarg, char * optname);
int exec_cmd(char * cmd, char * outstr, int out_limit);
+int is_pos_in_annotated_exon_regions(global_context_t * global_context, unsigned int pos);
+char * get_sam_chro_name_from_alias(HashTable * tab, char * anno_chro);
#endif
diff --git a/src/coverage_calc.c b/src/coverage_calc.c
index eb84d2e..79eb304 100644
--- a/src/coverage_calc.c
+++ b/src/coverage_calc.c
@@ -2,6 +2,7 @@
#include <stdlib.h>
#include <string.h>
#include <getopt.h>
+#include <assert.h>
#include "subread.h"
#include "core.h"
#include "HelperFunctions.h"
@@ -11,10 +12,12 @@
#include "hashtable.h"
#define COVERAGE_MAX_INT 0x7ffffff0
+#define MAX_FRAGMENT_LENGTH 3000
unsigned long long all_counted;
typedef unsigned int coverage_bin_entry_t;
int is_BAM_input = 0;
int max_M = 10;
+int paired_end = 0;
char input_file_name[300];
char output_file_name[300];
HashTable * cov_bin_table;
@@ -48,6 +51,8 @@ void calcCount_usage()
SUBREADputs("");
SUBREADputs("Optional arguments:");
SUBREADputs("");
+ SUBREADputs(" -p The input file contains paired-end reads.");
+ SUBREADputs("");
SUBREADputs(" --maxMOp <int> Maximum number of 'M' operations allowed in a CIGAR string.");
SUBREADputs(" 10 by default. Both 'X' and '=' are treated as 'M' and adjacent");
SUBREADputs(" 'M' operations are merged in the CIGAR string.");
@@ -133,7 +138,9 @@ int covCalc()
HashTableSetKeyComparisonFunction(cov_bin_table , fc_strcmp_chro);
HashTableSetDeallocationFunctions(cov_bin_table , free, free);
-
+ unsigned int hit_locs[2][MAX_FRAGMENT_LENGTH], reads = 0;
+ coverage_bin_entry_t * chrbin12[2];
+ int hits1,hits2;
SamBam_FILE * in_fp = SamBam_fopen(input_file_name, is_BAM_input?SAMBAM_FILE_BAM:SAMBAM_FILE_SAM);
char * line_buffer = malloc(3000);
@@ -147,10 +154,10 @@ int covCalc()
}
else
{
- char * Chros[FC_CIGAR_PARSER_ITEMS];
- unsigned int Staring_Points[FC_CIGAR_PARSER_ITEMS];
- unsigned short Staring_Read_Points[FC_CIGAR_PARSER_ITEMS];
- unsigned short Section_Lengths[FC_CIGAR_PARSER_ITEMS];
+ char * Chros[ max_M ];
+ unsigned int Staring_Points[max_M];
+ unsigned short Staring_Read_Points[max_M];
+ unsigned short Section_Lengths[max_M];
int flags=0, x1, is_junc = 0;
char cigar_str[200];
@@ -159,6 +166,7 @@ int covCalc()
cigar_str[0]=0;
chro[0]=0;
+ if(reads % 2 == 0) hits1= hits2 = 0;
get_read_info(line_buffer, chro, &pos, cigar_str, &flags);
if(flags & 4) continue;
@@ -172,8 +180,35 @@ int covCalc()
coverage_bin_entry_t * chrbin = (coverage_bin_entry_t*) bin_entry[0];
unsigned int chrlen = (void *)( bin_entry[1]) - NULL;
int cigar_sections = RSubread_parse_CIGAR_string(chro, pos, cigar_str, max_M, Chros, Staring_Points, Staring_Read_Points, Section_Lengths, &is_junc);
- for(x1 = 0; x1 < cigar_sections; x1++)
- {
+ if(paired_end) {
+ int * this_hits = (reads%2)?&hits2:&hits1;
+ unsigned int * this_hit_locs = &(hit_locs[reads%2][0]);
+ for(x1 = 0; x1 < cigar_sections; x1++){
+ unsigned int x2,x3;
+
+ //if(strcmp( cigar_str, "8S2M1D14M4D5M1D9M1I14M1D9M4D10M1D23M1I4M1D9M1I9M4D5M4D2M" )==0){
+ // SUBREADprintf("Cigar [%d] = %u ~ + %u ; this_hits=%d\n", x1, Staring_Points[x1], Section_Lengths[x1], *this_hits);
+ //}
+
+ for(x2 = Staring_Points[x1]; x2<Staring_Points[x1]+Section_Lengths[x1]; x2++){
+ int found = 0;
+ for(x3=0; x3<*this_hits; x3++){
+ if(this_hit_locs[x3] == x2){
+ found =1;
+ break;
+ }
+ }
+ if(!found) this_hit_locs[(*this_hits)++] = x2;
+ if( *this_hits >= MAX_FRAGMENT_LENGTH){
+ SUBREADprintf("ERROR: read is too long : %s!\n", cigar_str);
+ return -1;
+ }
+ }
+ }
+
+ if(* this_hits > 0) assert(chrbin);
+ chrbin12[reads%2] = chrbin;
+ } else for(x1 = 0; x1 < cigar_sections; x1++) {
unsigned int x2;
for(x2 = Staring_Points[x1]; x2<Staring_Points[x1]+Section_Lengths[x1]; x2++)
{
@@ -189,6 +224,19 @@ int covCalc()
}
}
}
+ if(reads % 2 == 1){
+ int r,x1;
+ for(r = 0; r < 2; r++){
+ int hits = r?hits2:hits1;
+ unsigned int * this_hit_locs = &hit_locs[r][0];
+ if(hits>0)assert(chrbin12[r]);
+ for(x1 = 0; x1<hits;x1++){
+ // SUBREADprintf("pos[%d : %d,%d]=%u %p\n",x1,hits1,hits2, this_hit_locs[x1], this_hit_locs);
+ chrbin12[r][this_hit_locs[x1]]++;
+ }
+ }
+ }
+ reads ++;
}
}
@@ -250,7 +298,7 @@ int cov_calc_main(int argc, char ** argv)
opterr=1;
optopt=63;
- while ((c = getopt_long (argc, argv, "BM:i:o:?", cov_calc_long_options, &option_index)) != -1)
+ while ((c = getopt_long (argc, argv, "BpM:i:o:?", cov_calc_long_options, &option_index)) != -1)
switch(c)
{
case 'M':
@@ -258,6 +306,9 @@ int cov_calc_main(int argc, char ** argv)
exit(-1);
max_M = atoi(optarg);
break;
+ case 'p':
+ paired_end = 1;
+ break;
case 'i':
strcpy(input_file_name, optarg);
break;
diff --git a/src/gene-algorithms.c b/src/gene-algorithms.c
index 2e0ca36..cabadf9 100644
--- a/src/gene-algorithms.c
+++ b/src/gene-algorithms.c
@@ -422,11 +422,11 @@ unsigned int linear_gene_position(const gene_offset_t* offsets , char *chro_name
n = HashTableGet(offsets -> read_name_to_index, chro_name)-NULL;
- //printf("NAME INDEL : %ld ; n=%d\n", offsets -> read_name_to_index -> numOfElements, n );
+ //printf("09NOV: GET '%s' = %d\n", chro_name, n);
if(n<1) return 0xffffffff;
-
if(n>1)
ret = offsets->read_offsets[n-2];
+ ret += offsets -> padding ;
return ret + chro_pos;
}
@@ -487,10 +487,13 @@ int locate_gene_position_max(unsigned int linear, const gene_offset_t* offsets ,
*chro_name = (char *)offsets->read_names+n*MAX_CHROMOSOME_NAME_LEN;
+ //#warning ">>>>>>>>>>>>>>NOT <<<<<<<<<<<<<<<<<"
+ //printf("OCT27-NULL_LOCATION: %u > %u; N=%d\n", linear, offsets->read_offsets[n-1], n);
+
return 0;
}
}
- return 1;
+ return -1;
}
@@ -1477,7 +1480,9 @@ int load_offsets(gene_offset_t* offsets , const char index_prefix [])
int n=0;
int padding = 0;
- gehash_load_option(index_prefix, SUBREAD_INDEX_OPTION_INDEX_PADDING , &padding);
+ int is_V3_index = gehash_load_option(index_prefix, SUBREAD_INDEX_OPTION_INDEX_PADDING , &padding);
+ if(!is_V3_index) return 1;
+
memset(offsets, 0, sizeof( gene_offset_t));
sprintf(fn, "%s.reads", index_prefix);
@@ -1529,10 +1534,8 @@ int load_offsets(gene_offset_t* offsets , const char index_prefix [])
char * read_name_mem = malloc(MAX_CHROMOSOME_NAME_LEN);
strcpy(read_name_mem, offsets->read_names + n*MAX_CHROMOSOME_NAME_LEN);
-
- //printf("PUT OFFSET: %s,%u\n", read_name_mem , n);
+ //printf("09NOV: add_annotation_to_junctions: '%s',%u\n", read_name_mem , n);
HashTablePut(offsets->read_name_to_index, read_name_mem , NULL + 1 + n);
-
n++;
if(n >= current_max_n)
diff --git a/src/gene-value-index.c b/src/gene-value-index.c
index d14a64c..7c51bf7 100644
--- a/src/gene-value-index.c
+++ b/src/gene-value-index.c
@@ -127,19 +127,25 @@ void gvindex_set(gene_value_index_t * index, gehash_data_t offset, gehash_key_t
index -> length = offset + 16 - index -> start_point + padding;
}
-void gvindex_dump(gene_value_index_t * index, const char filename [])
+int gvindex_dump(gene_value_index_t * index, const char filename [])
{
FILE * fp = f_subr_open(filename, "wb");
+ int is_full = 0;
+ int write_len = fwrite(&index->start_point,4,1, fp);
- fwrite(&index->start_point,4,1, fp);
- fwrite(&index->length, 4, 1, fp);
+ write_len += fwrite(&index->length, 4, 1, fp);
+ if(write_len < 2){
+ is_full = 1;
+ }
unsigned int useful_bytes, useful_bits;
gvindex_baseno2offset (index -> length+ index -> start_point, index,&useful_bytes,&useful_bits);
- fwrite(index->values, 1, useful_bytes+1, fp);
-
+ write_len = fwrite(index->values, 1, useful_bytes+1, fp);
+ if(write_len <= useful_bytes) is_full = 1;
fclose(fp);
+ if(is_full) SUBREADprintf("ERROR: unable to writeinto the output file. Please check the disk space in the output directory.\n");
+ return is_full;
}
@@ -148,10 +154,15 @@ int gvindex_load(gene_value_index_t * index, const char filename [])
FILE * fp = f_subr_open(filename, "rb");
int read_length;
read_length = fread(&index->start_point,4,1, fp);
- assert(read_length>0);
+ if(read_length<1){
+ SUBREADprintf("ERROR: the array index is incomplete : %d", read_length );
+ return 1;
+ }
read_length = fread(&index->length,4,1, fp);
- assert(read_length>0);
-
+ if(read_length<1){
+ SUBREADputs("ERROR: the index is incomplete.");
+ return 1;
+ }
//SUBREADprintf ("\nBINDEX %s : %u ~ +%u\n",filename, index->start_point, index->length );
unsigned int useful_bytes, useful_bits;
@@ -167,7 +178,10 @@ int gvindex_load(gene_value_index_t * index, const char filename [])
read_length =fread(index->values, 1, useful_bytes+1, fp);
- assert(read_length>0);
+ if(read_length < useful_bytes){
+ SUBREADprintf("ERROR: the array index is incomplete : %d < %d.", read_length, useful_bytes+1 );
+ return 1;
+ }
fclose(fp);
return 0;
diff --git a/src/gene-value-index.h b/src/gene-value-index.h
index 59eb16f..fa86f5e 100644
--- a/src/gene-value-index.h
+++ b/src/gene-value-index.h
@@ -28,7 +28,7 @@ int gvindex_init(gene_value_index_t * index, unsigned int start_point);
void gvindex_set (gene_value_index_t * index, gehash_data_t offset, gehash_key_t base_value, int padding);
-void gvindex_dump(gene_value_index_t * index, const char filename []);
+int gvindex_dump(gene_value_index_t * index, const char filename []);
int gvindex_load(gene_value_index_t * index, const char filename []);
diff --git a/src/hashtable.c b/src/hashtable.c
index d1573d6..8f7ab59 100644
--- a/src/hashtable.c
+++ b/src/hashtable.c
@@ -6,15 +6,54 @@
* Released to the public domain.
*
*--------------------------------------------------------------------------
- * $Id: hashtable.c,v 9999.11 2015/01/25 21:32:56 cvs Exp $
+ * $Id: hashtable.c,v 9999.14 2017/03/10 00:01:40 cvs Exp $
\*--------------------------------------------------------------------------*/
#include <stdio.h>
#include <stdlib.h>
+#include <string.h>
#include <assert.h>
#include <pthread.h>
#include "hashtable.h"
+
+
+ArrayList * ArrayListCreate(int init_capacity){
+ ArrayList * ret = malloc(sizeof(ArrayList));
+ memset(ret,0,sizeof(ArrayList));
+ ret -> capacityOfElements = init_capacity;
+ ret -> elementList = malloc(sizeof(void *)*init_capacity);
+ return ret;
+}
+
+void ArrayListDestroy(ArrayList * list){
+ long x1;
+ if(list -> elemDeallocator)
+ for(x1 = 0;x1 < list->numOfElements; x1++)
+ list -> elemDeallocator(list -> elementList[x1]);
+
+ free(list -> elementList);
+ free(list);
+}
+
+void * ArrayListGet(ArrayList * list, long n){
+ if(n<0 || n >= list->numOfElements)return NULL;
+ return list -> elementList[n];
+}
+
+int ArrayListPush(ArrayList * list, void * new_elem){
+ if(list -> capacityOfElements <= list->numOfElements){
+ list -> capacityOfElements *=1.3;
+ list -> elementList=realloc(list -> elementList, list -> capacityOfElements);
+ }
+ list->elementList[list->numOfElements++] = new_elem;
+ return list->numOfElements;
+}
+void ArrayListSetDeallocationFunction(ArrayList * list, void (*elem_deallocator)(void *elem)){
+ list -> elemDeallocator = elem_deallocator;
+}
+
+
static int pointercmp(const void *pointer1, const void *pointer2);
static unsigned long pointerHashFunction(const void *pointer);
static int isProbablePrime(long number);
@@ -743,13 +782,13 @@ void free_values_destroy(HashTable * tab)
HashTableDestroy(tab);
}
-void HashTableIteration(HashTable * tab, void process_item(void * hashed_obj, HashTable * tab) )
+void HashTableIteration(HashTable * tab, void process_item(void * key, void * hashed_obj, HashTable * tab) )
{
int i;
for (i=0; i< tab ->numOfBuckets; i++) {
KeyValuePair *pair = tab ->bucketArray[i];
while (pair != NULL) {
- process_item(pair -> value, tab);
+ process_item(( void * )pair -> key, pair -> value, tab);
KeyValuePair *nextPair = pair->next;
pair = nextPair;
}
diff --git a/src/hashtable.h b/src/hashtable.h
index 7bb475d..9ab1608 100644
--- a/src/hashtable.h
+++ b/src/hashtable.h
@@ -6,7 +6,7 @@
* Released to the public domain.
*
*--------------------------------------------------------------------------
- * $Id: hashtable.h,v 9999.7 2015/01/25 21:32:56 cvs Exp $
+ * $Id: hashtable.h,v 9999.9 2017/03/10 00:01:40 cvs Exp $
\*--------------------------------------------------------------------------*/
#ifndef _HASHTABLE_H
@@ -40,7 +40,22 @@ typedef struct {
long long int counter3;
} HashTable;
-void HashTableIteration(HashTable * tab, void process_item(void * hashed_obj, HashTable * tab) );
+
+
+typedef struct {
+ void ** elementList;
+ long numOfElements;
+ long capacityOfElements;
+ void (*elemDeallocator)(void *elem);
+} ArrayList;
+
+ArrayList * ArrayListCreate(int init_capacity);
+void ArrayListDestroy(ArrayList * list);
+void * ArrayListGet(ArrayList * list, long n);
+int ArrayListPush(ArrayList * list, void * new_elem);
+void ArrayListSetDeallocationFunction(ArrayList * list, void (*elem_deallocator)(void *elem));
+
+void HashTableIteration(HashTable * tab, void process_item(void * key, void * hashed_obj, HashTable * tab) );
/*--------------------------------------------------------------------------*\
* NAME:
diff --git a/src/index-builder.c b/src/index-builder.c
index a932bbe..64e89ba 100644
--- a/src/index-builder.c
+++ b/src/index-builder.c
@@ -86,7 +86,7 @@ int build_gene_index(const char index_prefix [], char ** chro_files, int chro_fi
long long int all_bases = guess_gene_bases(chro_files,chro_file_number);
double local_begin_ftime = 0.;
- int chro_table_maxsize=100;
+ int chro_table_maxsize=100, dump_res = 0;
unsigned int * read_offsets = malloc(sizeof(unsigned int) * chro_table_maxsize);
char * read_names = malloc(MAX_READ_NAME_LEN * chro_table_maxsize);
gehash_t table;
@@ -163,17 +163,15 @@ int build_gene_index(const char index_prefix [], char ** chro_files, int chro_fi
sprintf (fn, "%s.%02d.%c.tab", index_prefix, table_no, IS_COLOR_SPACE?'c':'b');
SUBREADfflush(stdout);
- gehash_dump(&table, fn);
+ if(!dump_res)dump_res |= gehash_dump(&table, fn);
if(VALUE_ARRAY_INDEX)
{
sprintf (fn, "%s.%02d.%c.array", index_prefix, table_no, IS_COLOR_SPACE?'c':'b');
- gvindex_dump(&value_array_index, fn);
+ if(!dump_res)dump_res |= gvindex_dump(&value_array_index, fn);
gvindex_destory(&value_array_index) ;
}
-
-
gehash_destory(&table);
read_offsets[read_no-1] = offset + table.padding;
@@ -280,11 +278,11 @@ int build_gene_index(const char index_prefix [], char ** chro_files, int chro_fi
sprintf(fn, "%s.%02d.%c.tab", index_prefix, table_no, IS_COLOR_SPACE?'c':'b');
SUBREADfflush(stdout);
- gehash_dump(&table, fn);
+ if(!dump_res)dump_res |= gehash_dump(&table, fn);
if(VALUE_ARRAY_INDEX)
{
sprintf(fn, "%s.%02d.%c.array", index_prefix, table_no, IS_COLOR_SPACE?'c':'b');
- gvindex_dump(&value_array_index, fn);
+ if(!dump_res)dump_res |= gvindex_dump(&value_array_index, fn);
}
table_no ++;
@@ -420,9 +418,27 @@ int build_gene_index(const char index_prefix [], char ** chro_files, int chro_fi
}
}
- free(fn);
free(read_names);
free(read_offsets);
+ if(dump_res){
+ SUBREADprintf("No index was built.\n");
+ sprintf(fn, "%s.files", index_prefix);
+ unlink(fn);
+ sprintf(fn, "%s.reads", index_prefix);
+ unlink(fn);
+ int index_i;
+ for(index_i = 0; index_i <= 99; index_i++){
+ sprintf(fn, "%s.%02d.b.tab", index_prefix, index_i);
+ unlink(fn);
+ sprintf(fn, "%s.%02d.c.tab", index_prefix, index_i);
+ unlink(fn);
+ sprintf(fn, "%s.%02d.b.array", index_prefix, index_i);
+ unlink(fn);
+ sprintf(fn, "%s.%02d.c.array", index_prefix, index_i);
+ unlink(fn);
+ }
+ }
+ free(fn);
return 0;
}
@@ -778,7 +794,7 @@ int check_and_convert_FastA(char ** input_fas, int fa_number, char * out_fa, uns
char * line_buf = malloc(MAX_READ_LENGTH);
char * read_head_buf = malloc(MAX_READ_LENGTH * 3);
unsigned int inp_file_no, line_no;
- int written_chrs = 0;
+ int written_chrs = 0, is_disk_full = 0;
int chrom_lens_max_len = 100;
int chrom_lens_len = 0;
ERROR_FOUND_IN_FASTA = 0;
@@ -786,7 +802,7 @@ int check_and_convert_FastA(char ** input_fas, int fa_number, char * out_fa, uns
if(!out_fp)
{
- SUBREADprintf("ERROR: The current directory is not writable, but the index builder needs to create temporary files in the current directory. Please change the working directory and rerun the index builder.\n");
+ SUBREADprintf("ERROR: the output directory is not writable, but the index builder needs to create temporary files in the current directory. Please change the working directory and rerun the index builder.\n");
return -1;
}
@@ -921,7 +937,13 @@ int check_and_convert_FastA(char ** input_fas, int fa_number, char * out_fa, uns
if(is_head_written)
{
- fprintf(out_fp,"%s\n", line_buf);
+ int line_buf_len = strlen(line_buf);
+ int writen_len = fprintf(out_fp,"%s\n", line_buf);
+ if(writen_len < line_buf_len){
+ SUBREADprintf("ERROR: unable to write into the temporary file. Please check the free space of the output directory.\n");
+ is_disk_full = 1;
+ break;
+ }
(*chrom_lens)[chrom_lens_len-1] = read_len;
(*chrom_lens)[chrom_lens_len] = 0;
}
@@ -935,6 +957,7 @@ int check_and_convert_FastA(char ** input_fas, int fa_number, char * out_fa, uns
}
fclose(in_fp);
+ if(is_disk_full) break;
}
@@ -948,7 +971,7 @@ int check_and_convert_FastA(char ** input_fas, int fa_number, char * out_fa, uns
return 1;
}
- if(is_repeated_chro)
+ if(is_repeated_chro|| is_disk_full)
return 1;
if(ERROR_FOUND_IN_FASTA)
@@ -1174,8 +1197,16 @@ int main_buildindex(int argc,char ** argv)
begin_ftime = miltime();
+ for(x1 = strlen(output_file); x1 >=0; x1--){
+ if(output_file[x1]=='/'){
+ memcpy(tmp_fa_file, output_file, x1);
+ tmp_fa_file[x1]=0;
+ break;
+ }
+ }
+ if(tmp_fa_file[0]==0)strcpy(tmp_fa_file, "./");
- sprintf(tmp_fa_file, "./subread-index-sam-%06u-XXXXXX", getpid());
+ sprintf(tmp_fa_file+strlen(tmp_fa_file), "/subread-index-sam-%06u-XXXXXX", getpid());
mkstemp(tmp_fa_file);
sprintf(log_file_name, "%s.log", output_file);
diff --git a/src/input-files.c b/src/input-files.c
index 2839eac..b820fa2 100644
--- a/src/input-files.c
+++ b/src/input-files.c
@@ -1453,7 +1453,7 @@ int my_strcmp(const void * s1, const void * s2)
return strcmp((char*)s1, (char*)s2);
}
-void write_read_block_file(FILE *temp_fp , unsigned int read_number, char *read_name, int flags, char * chro, unsigned int pos, char *cigar, int mapping_quality, char *sequence , char *quality_string, int rl , int is_sequence_needed, char strand, unsigned short read_pos, unsigned short read_len)
+int write_read_block_file(FILE *temp_fp , unsigned int read_number, char *read_name, int flags, char * chro, unsigned int pos, char *cigar, int mapping_quality, char *sequence , char *quality_string, int rl , int is_sequence_needed, char strand, unsigned short read_pos, unsigned short read_len)
{
base_block_temp_read_t datum;
datum.record_type = 100;
@@ -1469,17 +1469,21 @@ void write_read_block_file(FILE *temp_fp , unsigned int read_number, char *read_
{
SUBREADprintf("READ IS TOO LONG:%d\n", rl);
- return;
+ return -1;
}
fwrite(&datum, sizeof(datum), 1, temp_fp);
if(is_sequence_needed)
{
unsigned short srl = rl&0xffff;
- fwrite(&srl, sizeof(short),1, temp_fp);
- fwrite(sequence , 1, rl,temp_fp );
- fwrite(quality_string , 1, rl,temp_fp );
+ int wlen = fwrite(&srl, sizeof(short),1, temp_fp);
+ if(wlen != 1) return -1;
+ wlen = fwrite(sequence , 1, rl,temp_fp );
+ if(wlen != rl) return -1;
+ wlen = fwrite(quality_string , 1, rl,temp_fp );
+ if(wlen != rl) return -1;
}
+ return 0;
}
@@ -1724,7 +1728,7 @@ void break_VCF_file(char * vcf_file, HashTable * fp_table, char * temp_file_pref
int break_SAM_file(char * in_SAM_file, int is_BAM_file, char * temp_file_prefix, unsigned int * real_read_count, int * block_count, chromosome_t * known_chromosomes, int is_sequence_needed, int base_ignored_head_tail, gene_value_index_t *array_index, gene_offset_t * offsets, unsigned long long int * all_mapped_bases, HashTable * event_table, char * VCF_file)
{
- int i, is_first_read=1;
+ int i, is_first_read=1, is_error = 0;
HashTable * fp_table;
unsigned int read_number = 0;
char line_buffer [3000];
@@ -1931,7 +1935,7 @@ int break_SAM_file(char * in_SAM_file, int is_BAM_file, char * temp_file_prefix,
if(!temp_fp) return -1;
if(all_mapped_bases)
(*all_mapped_bases) += insert_length;
- write_read_block_file(temp_fp , read_number, read_name, flags, chro, insertion_cursor, cigar, mapping_quality, sequence + read_cursor , quality_string + read_cursor, insert_length , 1, is_negative_strand, read_cursor, rl);
+ is_error |= write_read_block_file(temp_fp , read_number, read_name, flags, chro, insertion_cursor, cigar, mapping_quality, sequence + read_cursor , quality_string + read_cursor, insert_length , 1, is_negative_strand, read_cursor, rl);
if(close_now) fclose(temp_fp);
}
insertion_cursor += insert_length;
@@ -1980,7 +1984,7 @@ int break_SAM_file(char * in_SAM_file, int is_BAM_file, char * temp_file_prefix,
sprintf(temp_file_name, "%s%s", temp_file_prefix , temp_file_suffix);
temp_fp = get_temp_file_pointer(temp_file_name, fp_table, &close_now);
- write_read_block_file(temp_fp , read_number, read_name, flags, chro, pos, cigar, mapping_quality, sequence , quality_string, rl , is_sequence_needed, is_negative_strand, 0,rl);
+ is_error |= write_read_block_file(temp_fp , read_number, read_name, flags, chro, pos, cigar, mapping_quality, sequence , quality_string, rl , is_sequence_needed, is_negative_strand, 0,rl);
if(close_now)fclose(temp_fp);
}
read_number ++;
@@ -1993,7 +1997,10 @@ int break_SAM_file(char * in_SAM_file, int is_BAM_file, char * temp_file_prefix,
SamBam_fclose(sambam_reader);
if(real_read_count)
(*real_read_count) = read_number;
- return 0;
+ if(is_error){
+ SUBREADprintf("ERROR: cannot write into the temporary files. Please check the disk space in the output directory.\n");
+ }
+ return is_error;
}
int is_in_exon_annotations(gene_t *output_genes, unsigned int offset, int is_start)
@@ -2386,6 +2393,13 @@ int SAM_pairer_warning_file_open_limit(){
int SAM_pairer_create(SAM_pairer_context_t * pairer, int all_threads, int bin_buff_size_per_thread, int BAM_input, int is_Tiny_Mode, int is_single_end_mode, int force_do_not_sort, int display_progress, char * in_file, void (* reset_output_function) (void * pairer), int (* output_header_function) (void * pairer, int thread_no, int is_text, unsigned int items, char * bin, unsigned int bin_len), int (* output_function) (void * pairer, int thread_no, char * readname, char * bin1, char * bin2 [...]
memset(pairer, 0, sizeof(SAM_pairer_context_t));
+
+ if(in_file[0]=='<'){
+ in_file++;
+ strncpy(pairer -> in_file_name, "<STDIN>", MAX_FILE_NAME_LENGTH);
+ }else
+ strncpy(pairer -> in_file_name, in_file, MAX_FILE_NAME_LENGTH);
+
pairer -> input_fp = f_subr_open(in_file, "rb");
if(NULL == pairer -> input_fp) return 1;
@@ -2654,11 +2668,12 @@ int SAM_pairer_fetch_BAM_block(SAM_pairer_context_t * pairer , SAM_pairer_thread
thread_context -> input_buff_BIN_ptr = 0;
- thread_context -> strm.zalloc = Z_NULL;
+ /*thread_context -> strm.zalloc = Z_NULL;
thread_context -> strm.zfree = Z_NULL;
thread_context -> strm.opaque = Z_NULL;
thread_context -> strm.avail_in = 0;
thread_context -> strm.next_in = Z_NULL;
+ */
inflateReset(&thread_context -> strm);
@@ -2667,6 +2682,9 @@ int SAM_pairer_fetch_BAM_block(SAM_pairer_context_t * pairer , SAM_pairer_thread
thread_context -> strm.avail_out = pairer -> input_buff_BIN_size - thread_context -> input_buff_BIN_used;
thread_context -> strm.next_out = (unsigned char *)thread_context -> input_buff_BIN + thread_context -> input_buff_BIN_used;
+ //#warning "=========== COMMENT NEXT LINE IN RELEASE ================"
+ //SUBREADprintf("GZIP INFLATE INPUT : %u chars\n", thread_context -> strm.avail_in);
+
int ret = inflate(&thread_context ->strm, Z_FINISH);
if(ret == Z_OK || ret == Z_STREAM_END)
{
@@ -3660,9 +3678,9 @@ int SAM_pairer_is_matched_chunks(char * c1, char * c2){
-void merge_level_fps(SAM_pairer_context_t * pairer, char * fname, FILE ** fps, int fps_no){
+int merge_level_fps(SAM_pairer_context_t * pairer, char * fname, FILE ** fps, int fps_no){
char * bin_tmp1 , * bin_tmp2;
- int max_name_len = MAX_READ_NAME_LEN*2 +80, x1;
+ int max_name_len = MAX_READ_NAME_LEN*2 +80, x1, is_disk_full = 0;
char tmp_fname[MAX_FILE_NAME_LENGTH];
sprintf(tmp_fname, "%s-MERGE-TMP.tmp", pairer->tmp_file_prefix);
@@ -3706,7 +3724,7 @@ void merge_level_fps(SAM_pairer_context_t * pairer, char * fname, FILE ** fps, i
}
- if(min_name_fileno >= 0){
+ if(min_name_fileno >= 0 && !is_disk_full){
SAM_pairer_osr_next_bin( fps[ min_name_fileno ] , bin_tmp1);
if(min2_name_fileno>=0){
@@ -3741,7 +3759,8 @@ void merge_level_fps(SAM_pairer_context_t * pairer, char * fname, FILE ** fps, i
memcpy( &rbinlen, bin_tmp1 , 2);
rbinlen += 4;
fwrite( bin_tmp1, 2, 1, out_fp );
- fwrite( bin_tmp1, 1, rbinlen, out_fp );
+ int write_len = fwrite( bin_tmp1, 1, rbinlen, out_fp );
+ if(write_len < rbinlen)is_disk_full = 1;
}
int read_has = SAM_pairer_osr_next_name( fps[min_name_fileno], names + max_name_len*min_name_fileno, -1, -1);
if(!read_has) *(names + max_name_len*min_name_fileno)=0;
@@ -3752,7 +3771,7 @@ void merge_level_fps(SAM_pairer_context_t * pairer, char * fname, FILE ** fps, i
unlink(fname);
rename(tmp_fname, fname);
free(names);
-
+ return is_disk_full;
}
#define PAIRER_WAIT_TICK_TIME 10000
@@ -3766,8 +3785,8 @@ void SAM_pairer_set_merge_max_fp(SAM_pairer_context_t * pairer, int fon){
}
-void SAM_pairer_probe_maxfp( SAM_pairer_context_t * pairer){
- int orphant_fp_no=0;
+int SAM_pairer_probe_maxfp( SAM_pairer_context_t * pairer){
+ int orphant_fp_no=0, is_disk_full = 0;
int thno, bkno, x1;
int thread_fps [ pairer -> total_threads ];
char tmp_fname[MAX_FILE_NAME_LENGTH];
@@ -3843,20 +3862,21 @@ void SAM_pairer_probe_maxfp( SAM_pairer_context_t * pairer){
// #warning ">>>> COMMENT DEBUG OUTPUT <<<<"
// SUBREADprintf("Merging temp files\n");
- merge_level_fps(pairer , tmp_fname, level_merge_fps, current_opened_fp_no);
+ is_disk_full |= merge_level_fps(pairer , tmp_fname, level_merge_fps, current_opened_fp_no);
for(x1 = 0; x1 < current_opened_fp_no; x1++) fclose(level_merge_fps[x1]);
if(processed_orphant < orphant_fp_no){
level_merge_fps[0] = fopen(tmp_fname, "rb");
current_opened_fp_no = 1;
}
+ if(is_disk_full) break;
}
}
}
pairer -> merge_level_finished = 1;
}
free(orphant_fps);
-
+ return is_disk_full;
}
void * SAM_pairer_rescure_orphants_max_FP(void * params){
@@ -3993,14 +4013,14 @@ void * SAM_pairer_rescure_orphants_max_FP(void * params){
}
-void SAM_pairer_update_orphant_table(SAM_pairer_context_t * pairer , SAM_pairer_thread_t * thread_context){
+int SAM_pairer_update_orphant_table(SAM_pairer_context_t * pairer , SAM_pairer_thread_t * thread_context){
unsigned int x2 = 0;
unsigned char ** name_list, ** bin_list;
//SUBREADprintf("ELES=%lu\n", thread_context->orphant_table->numOfElements);
name_list = malloc(sizeof(char*) * thread_context->orphant_table->numOfElements);
bin_list = malloc(sizeof(char*) * thread_context->orphant_table->numOfElements);
- int x1;
+ int x1, is_error = 0;
for(x1 = 0; x1 < thread_context->orphant_table->numOfBuckets; x1 ++){
KeyValuePair *pair = thread_context->orphant_table->bucketArray[x1];
while (pair != NULL) {
@@ -4028,10 +4048,14 @@ void SAM_pairer_update_orphant_table(SAM_pairer_context_t * pairer , SAM_pairer_
memcpy(&bin_len, bin_list[x1] , 4);
int namelen = strlen((char *)name_list[x1]);
- fwrite(&namelen,1,2,tmp_fp);
- fwrite(name_list[x1], 1, namelen, tmp_fp);
- fwrite(&bin_len,1,2,tmp_fp);
- fwrite(bin_list[x1], 1, bin_len + 4, tmp_fp);
+ int write_len = fwrite(&namelen,2,1,tmp_fp);
+ is_error = (write_len <1);
+ write_len = fwrite(name_list[x1], 1, namelen, tmp_fp);
+ is_error |= (write_len <namelen);
+ write_len = fwrite(&bin_len,2,1,tmp_fp);
+ is_error |= (write_len <1);
+ write_len = fwrite(bin_list[x1], 1, bin_len + 4, tmp_fp);
+ is_error |= (write_len < bin_len + 4);
HashTableRemove(thread_context->orphant_table , name_list[x1]);
}
@@ -4040,6 +4064,8 @@ void SAM_pairer_update_orphant_table(SAM_pairer_context_t * pairer , SAM_pairer_
free(name_list);
free(bin_list);
thread_context -> orphant_space = 0;
+ if(is_error) SUBREADprintf("ERROR: unable to write into the temporary file. Please check the disk space in the output directory.\n");
+ return is_error;
}
@@ -4141,7 +4167,7 @@ int SAM_pairer_find_start(SAM_pairer_context_t * pairer , SAM_pairer_thread_t *
void * SAM_pairer_thread_run( void * params ){
void ** param_ptr = (void **) params;
SAM_pairer_context_t * pairer = param_ptr[0];
- int thread_no = (int)(param_ptr[1]-NULL);
+ int thread_no = (int)(param_ptr[1]-NULL), is_disk_full = 0;
free(params);
SAM_pairer_thread_t * thread_context = pairer -> threads + thread_no;
@@ -4177,7 +4203,7 @@ void * SAM_pairer_thread_run( void * params ){
}
if(thread_context -> orphant_space > pairer -> input_buff_SBAM_size)
- SAM_pairer_update_orphant_table(pairer, thread_context);
+ if(!is_disk_full)is_disk_full |= SAM_pairer_update_orphant_table(pairer, thread_context);
if(is_finished){
pairer -> BAM_header_parsed = 1;
@@ -4186,7 +4212,9 @@ void * SAM_pairer_thread_run( void * params ){
}
if(thread_context -> orphant_table -> numOfElements > 0)
- SAM_pairer_update_orphant_table(pairer, thread_context);
+ if(!is_disk_full)is_disk_full |= SAM_pairer_update_orphant_table(pairer, thread_context);
+
+ pairer -> is_internal_error |= is_disk_full;
return NULL;
}
@@ -4211,20 +4239,25 @@ int SAM_pairer_run_once( SAM_pairer_context_t * pairer){
if(0 == pairer -> is_bad_format){
- SAM_pairer_probe_maxfp( pairer );
+ int is_disk_full = SAM_pairer_probe_maxfp( pairer );
- for(x1 = 0; x1 < pairer -> total_threads ; x1++){
- // this 16-byte memory block is freed in the thread worker.
+ if(is_disk_full){
+ SUBREADprintf("ERROR: cannot write into the temporary file. Please check the disk space in the output directory.\n");
+ pairer -> is_internal_error = 1;
+ }else{
+ for(x1 = 0; x1 < pairer -> total_threads ; x1++){
+ // this 16-byte memory block is freed in the thread worker.
- void ** init_params = malloc(sizeof(void *) * 2);
+ void ** init_params = malloc(sizeof(void *) * 2);
- init_params[0] = pairer;
- init_params[1] = (void *)(NULL+x1);
- pthread_create(&(pairer -> threads[x1].thread_stab), NULL, SAM_pairer_rescure_orphants_max_FP, init_params);
- }
+ init_params[0] = pairer;
+ init_params[1] = (void *)(NULL+x1);
+ pthread_create(&(pairer -> threads[x1].thread_stab), NULL, SAM_pairer_rescure_orphants_max_FP, init_params);
+ }
- for(x1 = 0; x1 < pairer -> total_threads ; x1++){
- pthread_join(pairer -> threads[x1].thread_stab, NULL);
+ for(x1 = 0; x1 < pairer -> total_threads ; x1++){
+ pthread_join(pairer -> threads[x1].thread_stab, NULL);
+ }
}
}
@@ -4279,20 +4312,17 @@ int fix_load_next_block(FILE * in, char * binbuf, z_stream * strm){
strm -> avail_out = 70000;
strm -> next_out = (unsigned char*)binbuf;
int ret_inf = inflate(strm, Z_FINISH);
- if(ret_inf == Z_STREAM_END){
+ if(ret_inf == Z_STREAM_END)
ret = 70000 - strm -> avail_out;
- // SUBREADprintf("FIX_DECOM: %d -> %d\n", bsize - xlen - 19, ret);
- }else{
- SUBREADprintf("FIX_DECOM_ERR:%d\n" , ret_inf);
- ret = -1;
- }
+ else
+ ret = -2;
inflateReset(strm);
}
free(bam_buf);
return ret;
}
-void fix_write_block(FILE * out, char * bin, int binlen, z_stream * strm){
+int fix_write_block(FILE * out, char * bin, int binlen, z_stream * strm){
char * bam_buf = malloc(70000);
int x1, bam_len = 0, retbam;
@@ -4349,26 +4379,29 @@ void fix_write_block(FILE * out, char * bin, int binlen, z_stream * strm){
fwrite( &x1, 2, 1 , out );
x1 = bam_len + 19 + 6;
fwrite( &x1, 2, 1 , out );
- fwrite( bam_buf , 1,bam_len, out );
+ int write_len = fwrite( bam_buf , 1,bam_len, out );
fwrite( &crc, 4, 1, out );
fwrite( &binlen, 4, 1, out );
free(bam_buf);
+
+ if(write_len < bam_len)return 1;
+ return 0;
}
#define FIX_GET_NEXT_NCH { while(in_bin_ptr == in_bin_size){ \
in_bin_ptr = 0; in_bin_size = 0;\
int newsize = fix_load_next_block(old_fp, in_bin, &in_strm);\
- if(newsize < 0){ break;}else{in_bin_size = newsize;}\
+ if(newsize < 0){ in_bin_size = -1; if(newsize<-1)SUBREADprintf("ERROR: failed to decompress the BAM file %s\n", pairer -> in_file_name) ;break;}else{in_bin_size = newsize;}\
} if(in_bin_size>0){nch = in_bin[in_bin_ptr++]; if(nch < 0)nch += 256; } else nch = -1; }
-#define FIX_FLASH_OUT { if(out_bin_ptr > 0) fix_write_block(new_fp, out_bin, out_bin_ptr, &out_strm); out_bin_ptr = 0; }
+#define FIX_FLASH_OUT { if(out_bin_ptr > 0)disk_is_full |= fix_write_block(new_fp, out_bin, out_bin_ptr, &out_strm); out_bin_ptr = 0; }
#define FIX_APPEND_OUT(p, c) { if(out_bin_ptr > 60000){FIX_FLASH_OUT} ; memcpy(out_bin + out_bin_ptr, p, c); out_bin_ptr +=c ; }
#define FIX_APPEND_READ(p, c){ memcpy(out_bin + out_bin_ptr, p, c); out_bin_ptr +=c ; }
-void SAM_pairer_fix_format(SAM_pairer_context_t * pairer){
+int SAM_pairer_fix_format(SAM_pairer_context_t * pairer){
FILE * old_fp = pairer -> input_fp;
fseek(old_fp, 0, SEEK_SET);
char tmpfname [300];
@@ -4398,6 +4431,7 @@ void SAM_pairer_fix_format(SAM_pairer_context_t * pairer){
deflateInit2(&out_strm, Z_NO_COMPRESSION, Z_DEFLATED,
PAIRER_GZIP_WINDOW_BITS, PAIRER_DEFAULT_MEM_LEVEL, Z_DEFAULT_STRATEGY);
+ int disk_is_full = 0;
int in_bin_ptr = 0;
int out_bin_ptr = 0;
int in_bin_size = 0;
@@ -4407,6 +4441,7 @@ void SAM_pairer_fix_format(SAM_pairer_context_t * pairer){
for(x1 = 0; x1 < 4; x1++){
FIX_GET_NEXT_NCH; // BAM1
+ if(nch < 0) return -1;
FIX_APPEND_OUT(&nch, 1);
}
@@ -4415,6 +4450,7 @@ void SAM_pairer_fix_format(SAM_pairer_context_t * pairer){
content_size = 0;
for(x1 = 0; x1 < 4; x1++){
FIX_GET_NEXT_NCH;
+ if(nch < 0) return -1;
// SUBREADprintf("FIX: TLEN: %d\n", nch);
content_size += (nch << (8 * x1));
}
@@ -4422,6 +4458,7 @@ void SAM_pairer_fix_format(SAM_pairer_context_t * pairer){
//SUBREADprintf("FIX: TXTLEN=%d\n", content_size);
for(content_count = 0; content_count < content_size; content_count++){
FIX_GET_NEXT_NCH;
+ if(nch < 0) return -1;
FIX_APPEND_OUT(&nch, 1);
// fputc(nch, stderr);
}
@@ -4431,6 +4468,7 @@ void SAM_pairer_fix_format(SAM_pairer_context_t * pairer){
content_size = 0;
for(x1 = 0; x1 < 4; x1++){
FIX_GET_NEXT_NCH;
+ if(nch < 0) return -1;
content_size += (nch << (8 * x1));
}
FIX_APPEND_OUT(&content_size, 4);
@@ -4439,11 +4477,13 @@ void SAM_pairer_fix_format(SAM_pairer_context_t * pairer){
int namelen = 0;
for(x1 = 0; x1 < 4; x1++){
FIX_GET_NEXT_NCH;
+ if(nch < 0) return -1;
namelen+= (nch << (8 * x1));
}
FIX_APPEND_READ(&namelen, 4);
for(x1 = 0; x1 < namelen + 4; x1++){ // inc. length
FIX_GET_NEXT_NCH;
+ if(nch < 0) return -1;
FIX_APPEND_READ(&nch, 1);
}
@@ -4468,6 +4508,7 @@ void SAM_pairer_fix_format(SAM_pairer_context_t * pairer){
block_size = nch;
for(x1 = 1; x1 < 4; x1++){
FIX_GET_NEXT_NCH;
+ if(nch < 0) return -1;
block_size += (nch << (8 * x1));
}
@@ -4490,6 +4531,7 @@ void SAM_pairer_fix_format(SAM_pairer_context_t * pairer){
int extag_new_len = 0;
for(x1 = 0; x1 < block_size; x1++){
FIX_GET_NEXT_NCH;
+ if(nch < 0) return -1;
if(x1 == 8) name_len = nch;
else if(x1 >= 16 && x1 < 20){
seq_len += ( nch << (8 * (x1 - 16)));
@@ -4509,11 +4551,14 @@ void SAM_pairer_fix_format(SAM_pairer_context_t * pairer){
int this_tag_output = 0;
if(etag_name0 > 0){
FIX_GET_NEXT_NCH;
+ if(nch < 0) return -1;
}
etag_name0 = nch;
FIX_GET_NEXT_NCH;
+ if(nch < 0) return -1;
etag_name1 = nch;
FIX_GET_NEXT_NCH;
+ if(nch < 0) return -1;
etag_type = nch;
x1 += 3;
@@ -4533,24 +4578,31 @@ void SAM_pairer_fix_format(SAM_pairer_context_t * pairer){
if(etag_type == 'Z'||etag_type =='H'){
while(1){
FIX_GET_NEXT_NCH;
+ if(nch < 0) return -1;
x1++;
if(nch == 0)break;
}
}else if(etag_type == 'A'){
FIX_GET_NEXT_NCH;
+ if(nch < 0) return -1;
x1++;
}else if(etag_type =='B'){
FIX_GET_NEXT_NCH;
+ if(nch < 0) return -1;
char array_type = nch;
int x2, adlen = 1, aditems = 0;
if(array_type == 's'||array_type == 'S')adlen = 2;
if(array_type == 'i'||array_type == 'I'||array_type == 'f')adlen = 4;
for(x2=0;x2<4; x2++) {
FIX_GET_NEXT_NCH;
+ if(nch < 0) return -1;
aditems += nch << (8*x2);
}
x1 += 5 + aditems * adlen;
- for(x2 = 0; x2 < aditems * adlen; x2++) FIX_GET_NEXT_NCH;
+ for(x2 = 0; x2 < aditems * adlen; x2++){
+ FIX_GET_NEXT_NCH;
+ if(nch < 0) return -1;
+ }
}else{
int dlen = 1;
if(etag_type == 's'||etag_type == 'S') dlen = 2;
@@ -4559,6 +4611,7 @@ void SAM_pairer_fix_format(SAM_pairer_context_t * pairer){
x1 += dlen;
while(dlen > 0){
FIX_GET_NEXT_NCH;
+ if(nch < 0) return -1;
if(this_tag_output)
FIX_APPEND_READ(&nch, 1);
dlen--;
@@ -4579,6 +4632,7 @@ void SAM_pairer_fix_format(SAM_pairer_context_t * pairer){
}else{
for(x1 = 0; x1 < block_size; x1++){
FIX_GET_NEXT_NCH;
+ if(nch < 0) return -1;
FIX_APPEND_READ(&nch, 1);
}
}
@@ -4590,7 +4644,7 @@ void SAM_pairer_fix_format(SAM_pairer_context_t * pairer){
}
FIX_FLASH_OUT;
//SUBREADprintf("FIX READS=%llu\n", reads);
- fix_write_block(new_fp, out_bin, 0, &out_strm);
+ disk_is_full |= fix_write_block(new_fp, out_bin, 0, &out_strm);
deflateEnd(&out_strm);
inflateEnd(&in_strm);
@@ -4600,6 +4654,10 @@ void SAM_pairer_fix_format(SAM_pairer_context_t * pairer){
pairer -> input_fp = f_subr_open(tmpfname, "rb");
free(in_bin);
free(out_bin);
+
+ if(disk_is_full)SUBREADprintf("ERROR: cannot write into the temporary file. Please check the empty space in the output directory.\n");
+
+ return disk_is_full;
}
@@ -4680,7 +4738,13 @@ int SAM_nosort_decompress_next_block(SAM_pairer_context_t * pairer){
int * BIN_buff_ptr = pairer -> appendix5;
SBAM_used = PBam_get_next_zchunk(pairer -> input_fp, SBAM_buff, NOSORT_SBAM_BUFF_SIZE, &decompressed_len);
- if(SBAM_used<0) return -1;
+ if(SBAM_used<0){
+ if(SBAM_used == -2){
+ SUBREADputs("ERROR: the BAM format is broken.");
+ pairer->is_internal_error = 1;
+ }
+ return -1;
+ }
//SUBREADprintf("PRE-LOAD BAM: USED %d, PTR %d\n", * BIN_buff_used , * BIN_buff_ptr);
if((* BIN_buff_ptr) < (* BIN_buff_used)){
@@ -4958,21 +5022,21 @@ int SAM_pairer_run( SAM_pairer_context_t * pairer){
}else for(corrected_run = 0; corrected_run < 2 ; corrected_run ++){
SAM_pairer_run_once(pairer);
- if(pairer -> is_bad_format && pairer->input_is_BAM && ! pairer -> is_incomplete_BAM){
+ if(pairer -> is_bad_format && pairer->input_is_BAM && ( ! pairer -> is_internal_error ) && ( ! pairer -> is_incomplete_BAM )){
//#warning ">>>>>> REMOVE '+ 1' FROM NEXT LINE IN RELEASE <<<<<<"
assert(1 != corrected_run);
//#warning ">>>>>> COMMENT NEXT LINE IN RELEASE <<<<<<"
//SUBREADprintf("Retrying with the corrected format...\n");
delete_with_prefix(pairer -> tmp_file_prefix);
- SAM_pairer_fix_format(pairer);
- if(pairer -> is_bad_format)
+ pairer -> is_internal_error |= SAM_pairer_fix_format(pairer);
+ if(pairer -> is_bad_format || pairer -> is_internal_error)
return -1;
SAM_pairer_reset(pairer);
pairer -> reset_output_function(pairer);
}else break;
}
- return pairer -> is_bad_format;
+ return pairer -> is_bad_format || pairer -> is_internal_error;
}
int sort_SAM_create(SAM_sort_writer * writer, char * output_file, char * tmp_path)
@@ -4984,7 +5048,21 @@ int sort_SAM_create(SAM_sort_writer * writer, char * output_file, char * tmp_pat
old_sig_INT = signal (SIGINT, SAM_SORT_SIGINT_hook);
mac_or_rand_str(mac_rand);
- sprintf(writer -> tmp_path, "%s/temp-sort-%06u-%s-", tmp_path, getpid(), mac_rand);
+ if(tmp_path == NULL){
+ int slash_pos = 0;
+ for(slash_pos = strlen(output_file); slash_pos >=0; slash_pos--){
+ if(output_file[slash_pos]=='/')break;
+ }
+ if(slash_pos >= 0){
+ memcpy(writer -> tmp_path, output_file, slash_pos+1);
+ sprintf(writer -> tmp_path + slash_pos+1, "temp-sort-%06u-%s-", getpid(), mac_rand);
+ }else sprintf(writer -> tmp_path, "./temp-sort-%06u-%s-", getpid(), mac_rand);
+
+ }else sprintf(writer -> tmp_path, "%s/temp-sort-%06u-%s-", tmp_path, getpid(), mac_rand);
+
+ //#warning " >>>>>>>>>>>>>>>> REMOVE THE NEXT LINE <<<<<<<<<<<<<<<<<<<< "
+ //SUBREADprintf("TMP_SORT=%s FROM %s\n", writer -> tmp_path, output_file);
+
_SAMSORT_SNP_delete_temp_prefix = writer -> tmp_path;
sprintf(tmp_fname, "%s%s", writer -> tmp_path, "headers.txt");
@@ -5028,9 +5106,9 @@ void find_tag_out(char * read_line_buf, char * tag, char * hi_tag_out)
}
-void sort_SAM_finalise(SAM_sort_writer * writer)
+int sort_SAM_finalise(SAM_sort_writer * writer)
{
- int x1_chunk, x1_block;
+ int x1_chunk, x1_block, is_disk_full = 0;
int xk1;
for(xk1=0;xk1<SAM_SORT_BLOCKS;xk1++)
{
@@ -5165,7 +5243,8 @@ void sort_SAM_finalise(SAM_sort_writer * writer)
fputs(read_name_buf, writer->out_fp);
putc('\t', writer->out_fp);
- fputs(read_line_buf, writer->out_fp);
+ int write_len = fputs(read_line_buf, writer->out_fp);
+ if(write_len < 0) is_disk_full = 1;
read_name_buf[strlen(read_name_buf)]='\t';
HashTableRemove(first_read_name_table, read_name_buf);
@@ -5224,7 +5303,8 @@ void sort_SAM_finalise(SAM_sort_writer * writer)
fprintf(writer->out_fp, "%s\t%d\t*\t0\t0\t*\t%s\t%d\t0\tN\tI%s%s\n", read_name_buf, dummy_flags, dummy_mate_chr_buf, dummy_mate_pos, nh_tag_out, hi_tag_out);
fputs(read_name_buf, writer->out_fp);
putc('\t', writer->out_fp);
- fputs(read_line_buf, writer->out_fp);
+ int write_len = fputs(read_line_buf, writer->out_fp);
+ if(write_len < 0) is_disk_full = 1;
writer -> unpaired_reads +=1;
}
@@ -5324,6 +5404,7 @@ void sort_SAM_finalise(SAM_sort_writer * writer)
fclose(writer -> out_fp);
signal (SIGTERM, old_sig_TERM);
signal (SIGINT, old_sig_INT);
+ return is_disk_full;
}
void sort_SAM_check_chunk(SAM_sort_writer * writer)
@@ -5346,10 +5427,15 @@ void sort_SAM_check_chunk(SAM_sort_writer * writer)
// line_len = strlen(SAM_line)
int sort_SAM_add_line(SAM_sort_writer * writer, char * SAM_line, int line_len)
{
+ int is_disk_full = 0;
assert(writer -> all_chunks_header_fp);
if(line_len<3) return 0;
- if(SAM_line[0]=='@')
- fputs(SAM_line, writer -> out_fp);
+ if(SAM_line[0]=='@'){
+ int wlen = fputs(SAM_line, writer -> out_fp);
+ if(wlen < 0){
+ return -2;
+ }
+ }
else
{
char read_name[MAX_READ_NAME_LEN + MAX_CHROMOSOME_NAME_LEN * 2 + 26];
@@ -5493,14 +5579,15 @@ int sort_SAM_add_line(SAM_sort_writer * writer, char * SAM_line, int line_len)
fwrite(&read_name_len, 2, 1, writer -> current_block_fp_array[block_id]);
fwrite(read_name, 1, read_name_len, writer -> current_block_fp_array[block_id]);
fwrite(&line_len, 2, 1, writer -> current_block_fp_array[block_id]);
- fwrite(second_col_pos, 1, line_len, writer -> current_block_fp_array[block_id]);
+ int write_len = fwrite(second_col_pos, 1, line_len, writer -> current_block_fp_array[block_id]);
+ if(write_len < line_len)is_disk_full = -2;
writer -> output_file_size += line_len;
writer -> current_chunk_size += line_len;
writer -> written_reads ++;
}
- return 0;
+ return is_disk_full;
}
int is_SAM_unsorted(char * SAM_line, char * tmp_read_name, short * tmp_flag, unsigned long long int read_no)
@@ -5652,6 +5739,28 @@ char * fgets_noempty(char * buf, int maxlen, FILE * fp)
}
}
+int is_comment_line(const char * l, int file_type, unsigned int lineno)
+{
+ int tabs = 0, xk1 = 0;
+ if(l[0]=='#') return 1;
+
+ if(isalpha(l[0]) && file_type == FILE_TYPE_RSUBREAD)
+ {
+ char target_chr[16];
+ memcpy(target_chr, l, 16);
+ for(xk1=0; xk1<16; xk1++)
+ target_chr[xk1] = tolower(target_chr[xk1]);
+
+ if(memcmp(target_chr, "geneid\tchr\tstart",16)==0) return 1;
+ }
+
+ xk1=0;
+ while(l[xk1]) tabs += (l[xk1++] == '\t');
+
+ return tabs < ((file_type == FILE_TYPE_GTF)?8:4);
+}
+
+
int probe_file_type_fast(char * fname){
FILE * fp = f_subr_open(fname, "rb");
diff --git a/src/input-files.h b/src/input-files.h
index 8e8de89..46cedaa 100644
--- a/src/input-files.h
+++ b/src/input-files.h
@@ -36,6 +36,7 @@
#define GENE_INPUT_SAM_PAIR_2 95
+#define MAX_LINE_LENGTH 3000
#define MIN_FILE_POINTERS_ALLOWED 50
#define FILE_TYPE_SAM 50
@@ -49,6 +50,8 @@
#define FILE_TYPE_UNKNOWN 999
#define FILE_TYPE_EMPTY 999990
#define FILE_TYPE_NONEXIST 999999
+#define FILE_TYPE_RSUBREAD 10
+#define FILE_TYPE_GTF 100
#define FEATURECOUNTS_BUFFER_SIZE (1024*1024*12)
@@ -138,7 +141,8 @@ typedef struct {
int input_chunk_no;
int input_buff_SBAM_size;
int input_buff_BIN_size;
- char tmp_file_prefix[MAX_FILE_NAME_LENGTH];
+ char tmp_file_prefix[MAX_FILE_NAME_LENGTH+1];
+ char in_file_name[MAX_FILE_NAME_LENGTH+1];
SAM_pairer_thread_t * threads;
int BAM_header_parsed;
@@ -146,6 +150,7 @@ typedef struct {
unsigned int BAM_n_ref;
int is_unsorted_notified;
int is_incomplete_BAM;
+ int is_internal_error;
void (* reset_output_function) (void * pairer);
int (* output_function) (void * pairer, int thread_no, char * rname, char * bin1, char * bin2);
@@ -263,7 +268,7 @@ double guess_reads_density_format(char * fname, int is_sam, int * min_phred, int
FILE * get_temp_file_pointer(char *temp_file_name, HashTable* fp_table, int * close_immediately);
-void write_read_block_file(FILE *temp_fp , unsigned int read_number, char *read_name, int flags, char * chro, unsigned int pos, char *cigar, int mapping_quality, char *sequence , char *quality_string, int rl , int is_sequence_needed, char strand, unsigned short read_pos, unsigned short read_len);
+int write_read_block_file(FILE *temp_fp , unsigned int read_number, char *read_name, int flags, char * chro, unsigned int pos, char *cigar, int mapping_quality, char *sequence , char *quality_string, int rl , int is_sequence_needed, char strand, unsigned short read_pos, unsigned short read_len);
int get_read_block(char *chro, unsigned int pos, char *temp_file_suffix, chromosome_t *known_chromosomes, unsigned int * max_base_position);
int my_strcmp(const void * s1, const void * s2);
@@ -273,7 +278,7 @@ void destroy_cigar_event_table(HashTable * event_table);
int is_SAM_unsorted(char * SAM_line, char * tmp_read_name, short * tmp_flag, unsigned long long int read_no);
int sort_SAM_add_line(SAM_sort_writer * writer, char * SAM_line, int line_len);
-void sort_SAM_finalise(SAM_sort_writer * writer);
+int sort_SAM_finalise(SAM_sort_writer * writer);
int sort_SAM_create(SAM_sort_writer * writer, char * output_file, char * tmp_path);
void colorread2base(char * read_buffer, int read_len);
@@ -309,4 +314,5 @@ void SAM_pairer_writer_destroy( SAM_pairer_writer_main_t * bam_main ) ;
int SAM_pairer_iterate_int_tags(unsigned char * bin, int bin_len, char * tag_name, int * saved_value);
int SAM_pairer_warning_file_open_limit();
void *delay_realloc(void * old_pntr, size_t old_size, size_t new_size);
+int is_comment_line(const char * l, int file_type, unsigned int lineno);
#endif
diff --git a/src/makefile.version b/src/makefile.version
index f0e22ca..8b17cdd 100644
--- a/src/makefile.version
+++ b/src/makefile.version
@@ -1,4 +1,4 @@
-SUBREAD_VERSION_BASE=1.5.1
+SUBREAD_VERSION_BASE=1.5.2
SUBREAD_VERSION_DATE=$(SUBREAD_VERSION_BASE)-$(shell date +"%d%b%Y")
SUBREAD_VERSION="$(SUBREAD_VERSION_DATE)"
SUBREAD_VERSION="$(SUBREAD_VERSION_BASE)"
diff --git a/src/propmapped.c b/src/propmapped.c
index 064dae7..6d87cf8 100644
--- a/src/propmapped.c
+++ b/src/propmapped.c
@@ -248,7 +248,7 @@ FILE * get_FP_by_read_name(propMapped_context * context, char * read_name)
}
-void add_read_flags(propMapped_context * context, FILE * fp, char * read_name, unsigned short flags)
+int add_read_flags(propMapped_context * context, FILE * fp, char * read_name, unsigned short flags)
{
int x1;
int rname_len = strlen(read_name);
@@ -269,13 +269,17 @@ void add_read_flags(propMapped_context * context, FILE * fp, char * read_name, u
rname_len = strlen(read_name);
if(rname_len>250)
- return;
+ return -1;
unsigned char rname_len_char = (unsigned char)rname_len;
- fwrite(&rname_len_char,1,1,fp);
- fwrite(read_name,rname_len,1,fp);
- fwrite(&flags, 1,sizeof(short), fp);
+ int wlen = fwrite(&rname_len_char,1,1,fp);
+ if(wlen < 1) return -1;
+ wlen = fwrite(read_name,rname_len,1,fp);
+ if(wlen < 1) return -1;
+ wlen = fwrite(&flags, sizeof(short), 1, fp);
+ if(wlen < 1) return -1;
+ return 0;
}
int init_PE_sambam(propMapped_context * context)
@@ -283,7 +287,19 @@ int init_PE_sambam(propMapped_context * context)
char mac_rand[13];
mac_or_rand_str(mac_rand);
srand(time(NULL));
- sprintf(context->temp_file_prefix, "prpm-temp-sum-%06u-%s", getpid(), mac_rand);
+
+ int x1;
+ context->temp_file_prefix[0]=0;
+ for(x1 = strlen(context->output_file_name) ; x1 >=0; x1--){
+ if(context->output_file_name[x1]=='/'){
+ memcpy(context->temp_file_prefix, context->output_file_name, x1);
+ context->temp_file_prefix[x1] = 0;
+ }
+ }
+
+ if(context->temp_file_prefix[0]==0) strcpy(context->temp_file_prefix, "./");
+
+ sprintf(context->temp_file_prefix+strlen(context->temp_file_prefix), "/prpm-temp-sum-%06u-%s", getpid(), mac_rand);
_PROPMAPPED_delete_tmp_prefix = context -> temp_file_prefix;
signal (SIGTERM, PROPMAPPED_SIGINT_hook);
@@ -302,6 +318,7 @@ int split_PE_sambam(propMapped_context * context)
}
char * line_buffer = malloc(3000);
+ int is_error = 0;
while(1)
{
@@ -315,7 +332,8 @@ int split_PE_sambam(propMapped_context * context)
unsigned flags = atoi(flags_str);
FILE * fp = get_FP_by_read_name(context , read_name);
- add_read_flags(context, fp, read_name, flags);
+ is_error = add_read_flags(context, fp, read_name, flags);
+ if(is_error) break;
context -> all_records++;
}
@@ -323,7 +341,8 @@ int split_PE_sambam(propMapped_context * context)
free(line_buffer);
SamBam_fclose(in_fp);
- return 0;
+ if(is_error) SUBREADprintf("ERROR: Unable to write into the temporary file. Please check the disk space in the output directory.");
+ return is_error;
}
int prop_PE(propMapped_context * context)
diff --git a/src/readSummary.c b/src/readSummary.c
index c759674..b5a0acc 100644
--- a/src/readSummary.c
+++ b/src/readSummary.c
@@ -22,12 +22,14 @@
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
+#include <stdarg.h>
#include <assert.h>
#include <string.h>
#include <unistd.h>
#include <ctype.h>
+
#ifndef MAKE_STANDALONE
#include <R.h>
#endif
@@ -52,11 +54,7 @@
/********************************************************************/
/********************************************************************/
/********************************************************************/
-#define FEATURE_NAME_LENGTH 256
#define CHROMOSOME_NAME_LENGTH 256
-#define MAX_LINE_LENGTH 3000
-#define FILE_TYPE_RSUBREAD 10
-#define FILE_TYPE_GTF 100
#define ALLOW_ALL_MULTI_MAPPING 1
#define ALLOW_PRIMARY_MAPPING 2
@@ -191,6 +189,8 @@ typedef struct {
int is_duplicate_ignored;
int is_first_read_reversed;
int is_second_read_straight;
+ int use_stdin_file;
+ int disk_is_full;
int do_not_sort;
int reduce_5_3_ends_to_one;
int isCVersion;
@@ -545,8 +545,10 @@ int print_FC_configuration(fc_thread_global_context_t * global_context, char * a
{
char * tmp_ptr1 = NULL , * next_fn, *sam_used = malloc(strlen(sam)+300), sam_ntxt[30],bam_ntxt[30], next_ntxt[50];
int nfiles=1, nBAMfiles = 0, nNonExistFiles = 0;
+ char MAC_or_random[13];
+ mac_or_rand_str(MAC_or_random);
- sprintf(sam_used, "%s/featureCounts_test_file_writable.tmp", global_context -> temp_file_dir);
+ sprintf(sam_used, "%s/featureCounts_test_file_writable-%06d-%s.tmp", global_context -> temp_file_dir, getpid(), MAC_or_random);
FILE * fp = fopen(sam_used,"w");
if(fp){
fclose(fp);
@@ -570,7 +572,11 @@ int print_FC_configuration(fc_thread_global_context_t * global_context, char * a
if(BAM_header_size>0) global_context -> max_BAM_header_size = max( global_context -> max_BAM_header_size , BAM_header_size + 180000);
if(file_probe==-1){
nNonExistFiles++;
- SUBREADprintf("\nERROR: invalid parameter: '%s'\n\n", next_fn);
+ if(global_context -> use_stdin_file){
+ SUBREADprintf("\nERROR: no valid SAM or BAM file is received from <STDIN>\n\n");
+ }else{
+ SUBREADprintf("\nERROR: invalid parameter: '%s'\n\n", next_fn);
+ }
return 1;
}
if(file_probe == 1) nBAMfiles++;
@@ -610,12 +616,14 @@ int print_FC_configuration(fc_thread_global_context_t * global_context, char * a
else if(is_first_read_PE == 1) file_chr = 'P';
//file_chr = 'o';
- print_in_box(94,0,0," %c[32m%c%c[36m %s%c[0m",CHAR_ESC, file_chr,CHAR_ESC, next_fn,CHAR_ESC);
+ print_in_box(94,0,0," %c[32m%c%c[36m %s%c[0m",CHAR_ESC, file_chr,CHAR_ESC, global_context -> use_stdin_file?"<STDIN>":next_fn,CHAR_ESC);
nfiles++;
}
(*n_input_files) = nfiles;
print_in_box(80,0,0,"");
+
+ #ifdef MAKE_STANDALONE
print_in_box(80,0,0," Output file : %s", out);
print_in_box(80,0,0," Summary : %s.summary", out);
print_in_box(80,0,0," Annotation : %s (%s)", annot, is_GTF?"GTF":"SAF");
@@ -624,14 +632,19 @@ int print_FC_configuration(fc_thread_global_context_t * global_context, char * a
print_in_box(80,0,0," (Note that files are saved to the output directory)");
print_in_box(80,0,0,"");
}
+
if(global_context -> do_junction_counting)
print_in_box(80,0,0," Junction Counting : <output_file>.jcounts");
+ #endif
if(global_context -> alias_file_name[0])
print_in_box(80,0,0," Chromosome alias file : %s", global_context -> alias_file_name);
+
print_in_box(80,0,0," Dir for temp files : %s", global_context->temp_file_dir);
+ #ifdef MAKE_STANDALONE
print_in_box(80,0,0,"");
+ #endif
print_in_box(80,0,0," Threads : %d", global_context->thread_number);
print_in_box(80,0,0," Level : %s level", global_context->is_gene_level?"meta-feature":"feature");
print_in_box(80,0,0," Paired-end : %s", global_context->is_paired_end_mode_assign?"yes":"no");
@@ -646,16 +659,15 @@ int print_FC_configuration(fc_thread_global_context_t * global_context, char * a
if(global_context->is_multi_mapping_allowed == ALLOW_PRIMARY_MAPPING)
multi_mapping_allow_mode = "primary only";
else if(global_context->is_multi_mapping_allowed == ALLOW_ALL_MULTI_MAPPING)
- multi_mapping_allow_mode = global_context -> use_fraction_multi_mapping?"counted (as fractions)": "counted (as integer)";
+ multi_mapping_allow_mode = global_context -> use_fraction_multi_mapping?"counted": "counted";
print_in_box(80,0,0," Multimapping reads : %s", multi_mapping_allow_mode);
print_in_box(80,0,0,"Multi-overlapping reads : %s", global_context->is_multi_overlap_allowed?"counted":"not counted");
if(global_context -> is_split_or_exonic_only)
print_in_box(80,0,0," Split alignments : %s", (1 == global_context -> is_split_or_exonic_only)?"only split alignments":"only exonic alignments");
- if(global_context -> fragment_minimum_overlapping !=1)
- print_in_box(80,0,0," Overlapping bases : %d", global_context -> fragment_minimum_overlapping);
- if(global_context -> fractional_minimum_overlapping !=1)
- print_in_box(81,0,0," Overlapping bases : %0.1f%%%%", global_context -> fractional_minimum_overlapping*100);
+ print_in_box(80,0,0," Min overlapping bases : %d", global_context -> fragment_minimum_overlapping);
+ if(global_context -> fractional_minimum_overlapping > 0.000001)
+ print_in_box(81,0,0," Min overlapping frac. : %0.1f%%%%", global_context -> fractional_minimum_overlapping*100);
if(global_context -> five_end_extension || global_context -> three_end_extension)
print_in_box(80,0,0," Read extensions : %d on 5' and %d on 3' ends", global_context -> five_end_extension , global_context -> three_end_extension);
if(global_context -> reduce_5_3_ends_to_one)
@@ -688,10 +700,14 @@ int print_FC_configuration(fc_thread_global_context_t * global_context, char * a
return 0;
}
-void print_FC_results(fc_thread_global_context_t * global_context)
+void print_FC_results(fc_thread_global_context_t * global_context, char * out)
{
print_in_box(89,0,1,"%c[36mRead assignment finished.%c[0m", CHAR_ESC, CHAR_ESC);
print_in_box(80,0,0,"");
+ #ifdef MAKE_STANDALONE
+ print_in_box(80,0,PRINT_BOX_WRAPPED,"Summary of counting results can be found in file \"%s\"", out);
+ print_in_box(80,0,0,"");
+ #endif
print_in_box(80,2,1,"http://subread.sourceforge.net/");
SUBREADputs("");
return;
@@ -727,28 +743,6 @@ int fc_strcmp(const void * s1, const void * s2)
return strcmp((char*)s1, (char*)s2);
}
-
-int is_comment_line(const char * l, int file_type, unsigned int lineno)
-{
- int tabs = 0, xk1 = 0;
- if(l[0]=='#') return 1;
-
- if(isalpha(l[0]) && file_type == FILE_TYPE_RSUBREAD)
- {
- char target_chr[16];
- memcpy(target_chr, l, 16);
- for(xk1=0; xk1<16; xk1++)
- target_chr[xk1] = tolower(target_chr[xk1]);
-
- if(memcmp(target_chr, "geneid\tchr\tstart",16)==0) return 1;
- }
-
- xk1=0;
- while(l[xk1]) tabs += (l[xk1++] == '\t');
-
- return tabs < ((file_type == FILE_TYPE_GTF)?8:4);
-}
-
void register_junc_feature(fc_thread_global_context_t *global_context, char * feature_name, char * chro, unsigned int start, unsigned int stop){
HashTable * gene_table = HashTableGet(global_context -> junction_features_table, chro);
//SUBREADprintf("REG %s : %p\n", chro, gene_table);
@@ -862,7 +856,7 @@ int load_feature_info(fc_thread_global_context_t *global_context, const char * a
while(1)
{
char * fgets_ret = fgets(file_line, MAX_LINE_LENGTH, fp);
- char * token_temp, *chro_name;
+ char * token_temp = NULL, *chro_name;
fc_chromosome_index_info * chro_stab;
unsigned int feature_pos = 0;
if(!fgets_ret) break;
@@ -926,7 +920,7 @@ int load_feature_info(fc_thread_global_context_t *global_context, const char * a
int is_gene_id_found = 0;
fgets(file_line, MAX_LINE_LENGTH, fp);
lineno++;
- char * token_temp;
+ char * token_temp = NULL;
if(is_comment_line(file_line, file_type, lineno-1))continue;
if(file_type == FILE_TYPE_RSUBREAD)
@@ -2199,10 +2193,10 @@ void parse_bin(SamBam_Reference_Info * sambam_chro_table, char * bin, char * bin
/*
typedef struct {
- char chromosome_name_left[CHROMOSOME_NAME_LENGTH + 1];
- char chromosome_name_right[CHROMOSOME_NAME_LENGTH + 1];
- unsigned int last_exon_base_left;
- unsigned int first_exon_base_right;
+ char chromosome_name_left[CHROMOSOME_NAME_LENGTH + 1];
+ char chromosome_name_right[CHROMOSOME_NAME_LENGTH + 1];
+ unsigned int last_exon_base_left;
+ unsigned int first_exon_base_right;
} fc_junction_info_t;
*/
@@ -2283,6 +2277,19 @@ void vote_and_add_count(fc_thread_global_context_t * global_context, fc_thread_t
char ** hits_chro1, char ** hits_chro2, unsigned int * hits_start_pos1, unsigned int * hits_start_pos2, unsigned short * hits_length1, unsigned short * hits_length2,
int fixed_fractional_count, char * read_name);
+int writesize_fprint(fc_thread_global_context_t * global_context, FILE * fp, const char * pattern, ...){
+ int ret;
+ va_list args;
+ va_start(args , pattern);
+ assert(fp);
+
+ ret = vfprintf(fp, pattern , args);
+ va_end(args);
+ if(ret < 1) global_context -> disk_is_full = 1;
+ return ret;
+
+}
+
void process_line_buffer(fc_thread_global_context_t * global_context, fc_thread_thread_context_t * thread_context, char * bin1, char * bin2)
{
@@ -2340,7 +2347,7 @@ void process_line_buffer(fc_thread_global_context_t * global_context, fc_thread_
thread_context->read_counters.unassigned_unmapped ++;
if(global_context -> SAM_output_fp)
- fprintf(global_context -> SAM_output_fp,"%s\tUnassigned_Unmapped\t*\t*\n", read_name);
+ writesize_fprint(global_context,global_context -> SAM_output_fp,"%s\tUnassigned_Unmapped\t*\t*\n", read_name);
return; // do nothing if a read is unmapped, or the first read in a pair of reads is unmapped.
}
@@ -2363,7 +2370,7 @@ void process_line_buffer(fc_thread_global_context_t * global_context, fc_thread_
if(global_context -> SAM_output_fp)
{
- fprintf(global_context -> SAM_output_fp,"%s\tUnassigned_MappingQuality\t*\tMapping_Quality=%d,%d\n", read_name, first_read_quality_score, mapping_qual);
+ writesize_fprint(global_context,global_context -> SAM_output_fp,"%s\tUnassigned_MappingQuality\t*\tMapping_Quality=%d,%d\n", read_name, first_read_quality_score, mapping_qual);
}
return;
}
@@ -2393,7 +2400,7 @@ void process_line_buffer(fc_thread_global_context_t * global_context, fc_thread_
thread_context->read_counters.unassigned_fragmentlength ++;
if(global_context -> SAM_output_fp)
- fprintf(global_context -> SAM_output_fp,"%s\tUnassigned_FragmentLength\t*\tLength=%ld\n", read_name, fragment_length);
+ writesize_fprint(global_context,global_context -> SAM_output_fp,"%s\tUnassigned_FragmentLength\t*\tLength=%ld\n", read_name, fragment_length);
return;
}
} else {
@@ -2401,7 +2408,7 @@ void process_line_buffer(fc_thread_global_context_t * global_context, fc_thread_
thread_context->read_counters.unassigned_chimericreads ++;
if(global_context -> SAM_output_fp)
- fprintf(global_context -> SAM_output_fp,"%s\tUnassigned_Chimera\t*\t*\n", read_name);
+ writesize_fprint(global_context,global_context -> SAM_output_fp,"%s\tUnassigned_Chimera\t*\t*\n", read_name);
return;
}
}
@@ -2416,7 +2423,7 @@ void process_line_buffer(fc_thread_global_context_t * global_context, fc_thread_
{
thread_context->read_counters.unassigned_duplicate ++;
if(global_context -> SAM_output_fp)
- fprintf(global_context -> SAM_output_fp,"%s\tUnassigned_Duplicate\t*\t*\n", read_name);
+ writesize_fprint(global_context,global_context -> SAM_output_fp,"%s\tUnassigned_Duplicate\t*\t*\n", read_name);
return;
}
@@ -2433,7 +2440,7 @@ void process_line_buffer(fc_thread_global_context_t * global_context, fc_thread_
thread_context->read_counters.unassigned_multimapping ++;
if(global_context -> SAM_output_fp)
- fprintf(global_context -> SAM_output_fp,"%s\tUnassigned_MultiMapping\t*\t*\n", read_name);
+ writesize_fprint(global_context,global_context -> SAM_output_fp,"%s\tUnassigned_MultiMapping\t*\t*\n", read_name);
return;
}
@@ -2447,7 +2454,7 @@ void process_line_buffer(fc_thread_global_context_t * global_context, fc_thread_
thread_context->read_counters.unassigned_secondary ++;
if(global_context -> SAM_output_fp)
- fprintf(global_context -> SAM_output_fp,"%s\tUnassigned_Secondary\t*\t*\n", read_name);
+ writesize_fprint(global_context,global_context -> SAM_output_fp,"%s\tUnassigned_Secondary\t*\t*\n", read_name);
return;
}
@@ -2476,7 +2483,7 @@ void process_line_buffer(fc_thread_global_context_t * global_context, fc_thread_
if(skipped_for_exonic == 1 + global_context -> is_paired_end_mode_assign){
if(global_context -> SAM_output_fp)
- fprintf(global_context -> SAM_output_fp,"%s\tUnassigned_%s\t*\t*\n", read_name, (global_context->is_split_or_exonic_only == 2)?"Hasjunction":"Nonjunction");
+ writesize_fprint(global_context,global_context -> SAM_output_fp,"%s\tUnassigned_%s\t*\t*\n", read_name, (global_context->is_split_or_exonic_only == 2)?"Hasjunction":"Nonjunction");
thread_context->read_counters.unassigned_junction_condition ++;
return;
@@ -2486,7 +2493,7 @@ void process_line_buffer(fc_thread_global_context_t * global_context, fc_thread_
if(global_context->is_split_or_exonic_only == 2 && is_junction_read) {
if(global_context -> SAM_output_fp)
- fprintf(global_context -> SAM_output_fp,"%s\tUnassigned_%s\t*\t*\n", read_name, (global_context->is_split_or_exonic_only == 2)?"Hasjunction":"Nonjunction");
+ writesize_fprint(global_context,global_context -> SAM_output_fp,"%s\tUnassigned_%s\t*\t*\n", read_name, (global_context->is_split_or_exonic_only == 2)?"Hasjunction":"Nonjunction");
thread_context->read_counters.unassigned_junction_condition ++;
return;
}
@@ -2817,7 +2824,7 @@ void vote_and_add_count(fc_thread_global_context_t * global_context, fc_thread_t
{
int final_gene_number = global_context -> exontable_geneid[hit_exon_id];
unsigned char * final_feture_name = global_context -> gene_name_array[final_gene_number];
- fprintf(global_context -> SAM_output_fp,"%s\tAssigned\t%s\tTotal=1\n", read_name, final_feture_name);
+ writesize_fprint(global_context,global_context -> SAM_output_fp,"%s\tAssigned\t%s\tTotal=1\n", read_name, final_feture_name);
}
thread_context->read_counters.assigned_reads ++;
} else if(global_context -> need_calculate_overlap_len == 0 && nhits2 == 1 && nhits1 == 1 && hits_indices2[0]==hits_indices1[0]) {
@@ -2828,7 +2835,7 @@ void vote_and_add_count(fc_thread_global_context_t * global_context, fc_thread_t
{
int final_gene_number = global_context -> exontable_geneid[hit_exon_id];
unsigned char * final_feture_name = global_context -> gene_name_array[final_gene_number];
- fprintf(global_context -> SAM_output_fp,"%s\tAssigned\t%s\tTotal=1\n", read_name, final_feture_name);
+ writesize_fprint(global_context,global_context -> SAM_output_fp,"%s\tAssigned\t%s\tTotal=1\n", read_name, final_feture_name);
}
thread_context->read_counters.assigned_reads ++;
} else {
@@ -2951,7 +2958,7 @@ void vote_and_add_count(fc_thread_global_context_t * global_context, fc_thread_t
score_x1_key = global_context -> exontable_geneid[ scoring_exon_ids[score_x1] ];
else score_x1_key = scoring_exon_ids[score_x1] ;
- //fprintf(stderr, "Q222KEY: exon=%ld, gene=%ld\n", scoring_exon_ids[score_x1] , score_x1_key );
+ //writesize_fprint(global_context,stderr, "Q222KEY: exon=%ld, gene=%ld\n", scoring_exon_ids[score_x1] , score_x1_key );
if( score_x1_key == score_merge_key ){
if((scoring_flags[score_x1] & ( ends?2:1 )) == 0) {
scoring_flags[score_x1] |= (ends?2:1);
@@ -2979,6 +2986,7 @@ void vote_and_add_count(fc_thread_global_context_t * global_context, fc_thread_t
int maximum_total_count = 0;
int maximum_score_x1 = 0;
int applied_fragment_minimum_overlapping = 1;
+ int overlapping_total_count = 0;
if( global_context -> fragment_minimum_overlapping > 1 || global_context -> need_calculate_fragment_len){
applied_fragment_minimum_overlapping = max( global_context -> fragment_minimum_overlapping, global_context -> fractional_minimum_overlapping * ( total_frag_len) );
@@ -3001,11 +3009,12 @@ void vote_and_add_count(fc_thread_global_context_t * global_context, fc_thread_t
maximum_score_x1 = score_x1;
}else if( maximum_score == scoring_numbers[score_x1] )
maximum_total_count++;
+ overlapping_total_count ++;
}
if(maximum_total_count == 0){
if(global_context -> SAM_output_fp)
- fprintf(global_context -> SAM_output_fp,"%s\tUnassigned_NoFeatures\t*\t*\n", read_name);
+ writesize_fprint(global_context,global_context -> SAM_output_fp,"%s\tUnassigned_NoFeatures\t*\t*\n", read_name);
thread_context->read_counters.unassigned_nofeatures ++;
}else{
@@ -3021,9 +3030,9 @@ void vote_and_add_count(fc_thread_global_context_t * global_context, fc_thread_t
int final_gene_number = global_context -> exontable_geneid[max_exon_id];
unsigned char * final_feture_name = global_context -> gene_name_array[final_gene_number];
if(scoring_count>1)
- fprintf(global_context -> SAM_output_fp,"%s\tAssigned\t%s\tTotal=1;%s/Targets=%d/%d\n", read_name, final_feture_name, global_context -> use_overlapping_break_tie? "MaximumOverlapping":"Votes", maximum_score, scoring_count);
+ writesize_fprint(global_context,global_context -> SAM_output_fp,"%s\tAssigned\t%s\tTotal=1;%s/Targets=%d/%d\n", read_name, final_feture_name, global_context -> use_overlapping_break_tie? "MaximumOverlapping":"Votes", maximum_score, scoring_count);
else
- fprintf(global_context -> SAM_output_fp,"%s\tAssigned\t%s\tTotal=1\n", read_name, final_feture_name);
+ writesize_fprint(global_context,global_context -> SAM_output_fp,"%s\tAssigned\t%s\tTotal=1\n", read_name, final_feture_name);
}
thread_context->read_counters.assigned_reads ++;
}else if(global_context -> is_multi_overlap_allowed) {
@@ -3034,10 +3043,14 @@ void vote_and_add_count(fc_thread_global_context_t * global_context, fc_thread_t
{
// This change was made on 31/MAR/2016
- if( scoring_numbers[xk1] < maximum_score ) continue ;
+ if( scoring_numbers[xk1] < 1 ) continue ;
+ if( scoring_numbers[xk1] < maximum_score && global_context -> use_overlapping_break_tie ) continue ;
long tmp_voter_id = scoring_exon_ids[xk1];
- thread_context->count_table[tmp_voter_id] += calculate_multi_overlap_fraction(global_context, fixed_fractional_count, maximum_total_count);
+ //if(1 && FIXLENstrcmp( read_name , "V0112_0155:7:1101:5467:23779#ATCACG" )==0)
+ // SUBREADprintf("CountsFrac = %d ; add=%d\n", overlapping_total_count, calculate_multi_overlap_fraction(global_context, fixed_fractional_count, overlapping_total_count) );
+
+ thread_context->count_table[tmp_voter_id] += calculate_multi_overlap_fraction(global_context, fixed_fractional_count, overlapping_total_count);
if(global_context -> SAM_output_fp)
{
@@ -3058,12 +3071,12 @@ void vote_and_add_count(fc_thread_global_context_t * global_context, fc_thread_t
int ffnn = strlen(final_feture_names);
if(ffnn>0) final_feture_names[ffnn-1]=0;
// overlapped but still assigned
- fprintf(global_context -> SAM_output_fp,"%s\tAssigned\t%s\tTotal=%d\n", read_name, final_feture_names, assigned_no);
+ writesize_fprint(global_context,global_context -> SAM_output_fp,"%s\tAssigned\t%s\tTotal=%d\n", read_name, final_feture_names, assigned_no);
}
thread_context->read_counters.assigned_reads ++;
} else {
if(global_context -> SAM_output_fp)
- fprintf(global_context -> SAM_output_fp,"%s\tUnassigned_Ambiguity\t*\tNumber_Of_Overlapped_Genes=%d\n", read_name, maximum_total_count);
+ writesize_fprint(global_context,global_context -> SAM_output_fp,"%s\tUnassigned_Ambiguity\t*\tNumber_Of_Overlapped_Genes=%d\n", read_name, maximum_total_count);
thread_context->read_counters.unassigned_ambiguous ++;
}
@@ -3169,47 +3182,6 @@ void fc_thread_merge_results(fc_thread_global_context_t * global_context, read_c
print_in_box(80,0,0,"");
}
-HashTable * load_alias_table(char * fname)
-{
- FILE * fp = f_subr_open(fname, "r");
- if(!fp)
- {
- print_in_box(80,0,0,"WARNING unable to open alias file '%s'", fname);
- return NULL;
- }
-
- char * fl = malloc(2000);
-
- HashTable * ret = HashTableCreate(1013);
- HashTableSetDeallocationFunctions(ret, free, free);
- HashTableSetKeyComparisonFunction(ret, fc_strcmp);
- HashTableSetHashFunction(ret, fc_chro_hash);
-
- while (1)
- {
- char *ret_fl = fgets(fl, 1999, fp);
- if(!ret_fl) break;
- if(fl[0]=='#') continue;
- char * sam_chr = NULL;
- char * anno_chr = strtok_r(fl, ",", &sam_chr);
- if((!sam_chr)||(!anno_chr)) continue;
-
- sam_chr[strlen(sam_chr)-1]=0;
- char * anno_chr_buf = malloc(strlen(anno_chr)+1);
- strcpy(anno_chr_buf, anno_chr);
- char * sam_chr_buf = malloc(strlen(sam_chr)+1);
- strcpy(sam_chr_buf, sam_chr);
-
- //printf("ALIAS: %s -> %s\n", sam_chr, anno_chr);
- HashTablePut(ret, sam_chr_buf, anno_chr_buf);
- }
-
- fclose(fp);
-
- free(fl);
- return ret;
-}
-
void get_temp_dir_from_out(char * tmp, char * out){
char * slash = strrchr(out,'/');
if(NULL == slash){
@@ -3220,7 +3192,37 @@ void get_temp_dir_from_out(char * tmp, char * out){
}
}
-void fc_thread_init_global_context(fc_thread_global_context_t * global_context, unsigned int buffer_size, unsigned short threads, int line_length , int is_PE_data, int min_pe_dist, int max_pe_dist, int is_gene_level, int is_overlap_allowed, int is_strand_checked, char * output_fname, int is_sam_out, int is_both_end_required, int is_chimertc_disallowed, int is_PE_distance_checked, char *feature_name_column, char * gene_id_column, int min_map_qual_score, int is_multi_mapping_allowed, int i [...]
+void fc_thread_init_input_files(fc_thread_global_context_t * global_context, char * in_fnames, char ** out_ptr ){
+ if(global_context -> use_stdin_file){
+ #ifdef MAKE_STANDALONE
+
+ char MAC_or_random[13];
+
+ (*out_ptr) = malloc(300);
+ mac_or_rand_str(MAC_or_random);
+ sprintf(*out_ptr, "%s/temp-core-%06u-%s.sam", global_context -> temp_file_dir, getpid(), MAC_or_random);
+
+ SUBREADprintf("\nReading data from <STDIN> for featureCounts ...\n\n");
+
+ FILE * ifp = fopen(*out_ptr,"w");
+ while(1){
+ char nchar[100];
+ int rlen = fread(nchar, 1, 100, stdin);
+ if(rlen > 0) fwrite(nchar, 1, rlen, ifp);
+ else break;
+ //if(rlen < 100)break;
+ }
+ fclose(ifp);
+
+ #endif
+ }else{
+ (*out_ptr) = malloc(strlen(in_fnames)+1);
+ strcpy((*out_ptr), in_fnames);
+ }
+
+}
+
+void fc_thread_init_global_context(fc_thread_global_context_t * global_context, unsigned int buffer_size, unsigned short threads, int line_length , int is_PE_data, int min_pe_dist, int max_pe_dist, int is_gene_level, int is_overlap_allowed, int is_strand_checked, char * output_fname, int is_sam_out, int is_both_end_required, int is_chimertc_disallowed, int is_PE_distance_checked, char *feature_name_column, char * gene_id_column, int min_map_qual_score, int is_multi_mapping_allowed, int i [...]
{
int x1;
@@ -3243,6 +3245,7 @@ void fc_thread_init_global_context(fc_thread_global_context_t * global_context,
global_context -> is_multi_mapping_allowed = is_multi_mapping_allowed;
global_context -> is_split_or_exonic_only = is_split_or_exonic_only;
global_context -> is_duplicate_ignored = is_duplicate_ignored;
+ global_context -> use_stdin_file = use_stdin_file;
//global_context -> is_first_read_reversed = (pair_orientations[0]=='r');
//global_context -> is_second_read_straight = (pair_orientations[1]=='f');
@@ -3434,13 +3437,16 @@ int fc_thread_start_threads(fc_thread_global_context_t * global_context, int et_
thread_args[1] = & global_context -> thread_contexts[xk1];
}
- char rand_prefix[500];
+ char rand_prefix[300];
+ char new_fn[300];
char MAC_or_random[13];
mac_or_rand_str(MAC_or_random);
sprintf(rand_prefix, "%s/temp-core-%06u-%s.sam", global_context -> temp_file_dir, getpid(), MAC_or_random);
+ if(global_context -> use_stdin_file) sprintf(new_fn, "<%s", global_context -> input_file_name );
+ else sprintf(new_fn, "%s", global_context -> input_file_name );
//#warning "REMOVE ' * 2 ' FROM NEXT LINE !!!!!!"
- SAM_pairer_create(&global_context -> read_pairer, global_context -> thread_number , global_context -> max_BAM_header_size/1024/1024+2, !global_context-> is_SAM_file, 1, !global_context -> is_paired_end_mode_assign, global_context ->is_paired_end_mode_assign && global_context -> do_not_sort ,0, global_context -> input_file_name, process_pairer_reset, process_pairer_header, process_pairer_output, rand_prefix, global_context);
+ SAM_pairer_create(&global_context -> read_pairer, global_context -> thread_number , global_context -> max_BAM_header_size/1024/1024+2, !global_context-> is_SAM_file, 1, !global_context -> is_paired_end_mode_assign, global_context ->is_paired_end_mode_assign && global_context -> do_not_sort ,0, new_fn, process_pairer_reset, process_pairer_header, process_pairer_output, rand_prefix, global_context);
SAM_pairer_set_unsorted_notification(&global_context -> read_pairer, pairer_unsorted_notification);
return 0;
@@ -3518,7 +3524,7 @@ void fc_write_final_gene_results(fc_thread_global_context_t * global_context, in
if(!next_fn||strlen(next_fn)<1) break;
if(column_numbers[i_files])
{
- fprintf(fp_out,"\t%s", next_fn);
+ fprintf(fp_out,"\t%s", global_context -> use_stdin_file?"STDIN":next_fn);
non_empty_files ++;
}
next_fn = strtok_r(NULL, ";", &tmp_ptr);
@@ -3578,6 +3584,7 @@ void fc_write_final_gene_results(fc_thread_global_context_t * global_context, in
char *is_occupied = malloc(longest_gene_exons);
unsigned int * input_start_stop_list = malloc(longest_gene_exons * sizeof(int) * 2);
unsigned int * output_start_stop_list = malloc(longest_gene_exons * sizeof(int) * 2);
+ int disk_is_full = 0;
char * out_chr_list = malloc(longest_gene_exons * (1+global_context -> longest_chro_name) + 1), * tmp_chr_list = NULL;
char * out_start_list = malloc(11 * longest_gene_exons + 1), * tmp_start_list = NULL;
@@ -3655,7 +3662,7 @@ void fc_write_final_gene_results(fc_thread_global_context_t * global_context, in
_cut_tail(out_end_list);
_cut_tail(out_strand_list);
- fprintf(fp_out, "%s\t%s\t%s\t%s\t%s\t%d" , gene_symbol, out_chr_list, out_start_list, out_end_list, out_strand_list, gene_nonoverlap_len);
+ int wlen = fprintf(fp_out, "%s\t%s\t%s\t%s\t%s\t%d" , gene_symbol, out_chr_list, out_start_list, out_end_list, out_strand_list, gene_nonoverlap_len);
// all exons: gene_exons_number[xk1] : gene_exons_pointer[xk1]
int non_empty_file_index = 0;
@@ -3675,7 +3682,7 @@ void fc_write_final_gene_results(fc_thread_global_context_t * global_context, in
}
}
fprintf(fp_out,"\n");
-
+ if(wlen < 6)disk_is_full = 1;
}
free(is_occupied);
free(input_start_stop_list);
@@ -3693,12 +3700,17 @@ void fc_write_final_gene_results(fc_thread_global_context_t * global_context, in
free(gene_exons_end);
free(gene_exons_strand);
fclose(fp_out);
+
+ if(disk_is_full){
+ SUBREADprintf("ERROR: disk is full; the count file cannot be generated.\n");
+ unlink(out_file);
+ }
}
void fc_write_final_counts(fc_thread_global_context_t * global_context, const char * out_file, int nfiles, char * file_list, read_count_type_t ** column_numbers, fc_read_counters *read_counters, int isCVersion)
{
char fname[300];
- int i_files, xk1;
+ int i_files, xk1, disk_is_full = 0;
sprintf(fname, "%s.summary", out_file);
FILE * fp_out = f_subr_open(fname,"w");
@@ -3715,7 +3727,7 @@ void fc_write_final_counts(fc_thread_global_context_t * global_context, const ch
{
if(!next_fn||strlen(next_fn)<1) break;
if(column_numbers[i_files])
- fprintf(fp_out,"\t%s", next_fn);
+ fprintf(fp_out,"\t%s", global_context -> use_stdin_file?"STDIN":next_fn);
next_fn += strlen(next_fn)+1;
}
@@ -3733,17 +3745,24 @@ void fc_write_final_counts(fc_thread_global_context_t * global_context, const ch
if(column_numbers[i_files])
fprintf(fp_out,"\t%llu", *cntr);
}
- fprintf(fp_out,"\n");
+ int wlen = fprintf(fp_out,"\n");
+ if(wlen < 1)disk_is_full = 1;
}
fclose(fp_out);
+
+ if(disk_is_full){
+ SUBREADprintf("ERROR: disk is full; the count file cannot be generated.\n");
+ unlink(out_file);
+ }
+
}
void fc_write_final_results(fc_thread_global_context_t * global_context, const char * out_file, int features, read_count_type_t ** column_numbers, char * file_list, int n_input_files, fc_feature_info_t * loaded_features, int header_out)
{
/* save the results */
FILE * fp_out;
- int i, i_files = 0;
+ int i, i_files = 0, disk_is_full =0;
fp_out = f_subr_open(out_file,"w");
if(!fp_out){
SUBREADprintf("Failed to create file %s\n", out_file);
@@ -3766,7 +3785,7 @@ void fc_write_final_results(fc_thread_global_context_t * global_context, const c
while(1){
if(!next_fn||strlen(next_fn)<1) break;
if(column_numbers[i_files])
- fprintf(fp_out,"\t%s", next_fn);
+ fprintf(fp_out,"\t%s", global_context -> use_stdin_file?"STDIN":next_fn);
next_fn = strtok_r(NULL, ";", &tmp_ptr);
i_files++;
}
@@ -3794,10 +3813,15 @@ void fc_write_final_results(fc_thread_global_context_t * global_context, const c
}
}
- fprintf(fp_out,"\n");
+ int wlen = fprintf(fp_out,"\n");
+ if(wlen < 1)disk_is_full = 1;
}
fclose(fp_out);
+ if(disk_is_full){
+ SUBREADprintf("ERROR: disk is full; unable to write into the output file.\n");
+ unlink(out_file);
+ }
}
static struct option long_options[] =
@@ -3828,29 +3852,35 @@ void print_usage()
SUBREADprintf("\nVersion %s\n\n", SUBREAD_VERSION);
SUBREADputs("Usage: featureCounts [options] -a <annotation_file> -o <output_file> input_file1 [input_file2] ... \n");
- SUBREADputs("## Required arguments:");
+ SUBREADputs("## Mandatory arguments:");
SUBREADputs("");
SUBREADputs(" -a <string> Name of an annotation file. GTF/GFF format by default.");
- SUBREADputs(" See -F option for more formats.");
+ SUBREADputs(" See -F option for more format information. Inbuilt");
+ SUBREADputs(" annotations (SAF format) is available in 'annotation'");
+ SUBREADputs(" directory of the package.");
SUBREADputs("");
SUBREADputs(" -o <string> Name of the output file including read counts. A separate");
SUBREADputs(" file including summary statistics of counting results is");
- SUBREADputs(" also included in the output (`<string>.summary')");
+ SUBREADputs(" also included in the output ('<string>.summary')");
SUBREADputs("");
- SUBREADputs(" input_file1 [input_file2] ... A list of SAM or BAM format files.");
+ SUBREADputs(" input_file1 [input_file2] ... A list of SAM or BAM format files. They can be");
+ SUBREADputs(" either name or location sorted. If not files provided,");
+ SUBREADputs(" <stdin> input is expected.");
SUBREADputs("");
- SUBREADputs("## Options:");
+
+ SUBREADputs("## Optional arguments:");
SUBREADputs("# Annotation");
SUBREADputs("");
- SUBREADputs(" -F <string> Specify format of provided annotation file. Acceptable");
- SUBREADputs(" formats include `GTF/GFF' and `SAF'. `GTF/GFF' by default.");
- SUBREADputs(" See Users Guide for description of SAF format.");
+ SUBREADputs(" -F <string> Specify format of the provided annotation file. Acceptable");
+ SUBREADputs(" formats include 'GTF' (or compatible GFF format) and");
+ SUBREADputs(" 'SAF'. 'GTF' by default. For SAF format, please refer to");
+ SUBREADputs(" Users Guide.");
SUBREADputs("");
- SUBREADputs(" -t <string> Specify feature type in GTF annotation. `exon' by ");
+ SUBREADputs(" -t <string> Specify feature type in GTF annotation. 'exon' by ");
SUBREADputs(" default. Features used for read counting will be ");
SUBREADputs(" extracted from annotation using the provided value.");
SUBREADputs("");
- SUBREADputs(" -g <string> Specify attribute type in GTF annotation. `gene_id' by ");
+ SUBREADputs(" -g <string> Specify attribute type in GTF annotation. 'gene_id' by ");
SUBREADputs(" default. Meta-features used for read counting will be ");
SUBREADputs(" extracted from annotation using the provided value.");
SUBREADputs("");
@@ -3880,8 +3910,8 @@ void print_usage()
SUBREADputs(" end. If a negative value is provided, then a gap of up");
SUBREADputs(" to specified size will be allowed between read and the");
SUBREADputs(" feature that the read is assigned to.");
- SUBREADputs("");
- SUBREADputs(" --fracOverlap <value> Minimum fraction of overlapping bases in a read that is");
+ SUBREADputs("");
+ SUBREADputs(" --fracOverlap <float> Minimum fraction of overlapping bases in a read that is");
SUBREADputs(" required for read assignment. Value should be within range");
SUBREADputs(" [0,1]. 0 by default. Number of overlapping bases is");
SUBREADputs(" counted from both reads if paired end. Both this option");
@@ -3906,7 +3936,7 @@ void print_usage()
SUBREADputs("");
SUBREADputs(" -M Multi-mapping reads will also be counted. For a multi-");
SUBREADputs(" mapping read, all its reported alignments will be ");
- SUBREADputs(" counted. The `NH' tag in BAM/SAM input is used to detect ");
+ SUBREADputs(" counted. The 'NH' tag in BAM/SAM input is used to detect ");
SUBREADputs(" multi-mapping reads.");
SUBREADputs("");
SUBREADputs("# Fractional counting");
@@ -3920,7 +3950,7 @@ void print_usage()
SUBREADputs(" is specified, each overlapping feature will receive a");
SUBREADputs(" fractional count of 1/y, where y is the total number of");
SUBREADputs(" features overlapping with the read. When both '-M' and");
- SUBREADputs(" '-O' are specified, each alignment will carry a fraction");
+ SUBREADputs(" '-O' are specified, each alignment will carry a fractional");
SUBREADputs(" count of 1/(x*y).");
SUBREADputs("");
@@ -3974,8 +4004,7 @@ void print_usage()
SUBREADputs(" instead of reads. This option is only applicable for");
SUBREADputs(" paired-end reads.");
SUBREADputs("");
- SUBREADputs(" -B Count read pairs that have both ends successfully aligned ");
- SUBREADputs(" only.");
+ SUBREADputs(" -B Only count read pairs that have both ends aligned.");
SUBREADputs("");
SUBREADputs(" -P Check validity of paired-end distance when counting read ");
SUBREADputs(" pairs. Use -d and -D to set thresholds.");
@@ -4120,7 +4149,7 @@ int junccmp(fc_junction_gene_t * j1, fc_junction_gene_t * j2){
void fc_write_final_junctions(fc_thread_global_context_t * global_context, char * output_file_name, read_count_type_t ** table_columns, char * input_file_names, int n_input_files, HashTable ** junction_global_table_list, HashTable ** splicing_global_table_list){
- int infile_i;
+ int infile_i, disk_is_full = 0;
HashTable * merged_junction_table = HashTableCreate(156679);
@@ -4210,7 +4239,7 @@ void fc_write_final_junctions(fc_thread_global_context_t * global_context, char
{
if(!next_fn||strlen(next_fn)<1) break;
if(table_columns[infile_i])
- fprintf(ofp,"\t%s", next_fn);
+ fprintf(ofp,"\t%s", global_context -> use_stdin_file?"STDIN":next_fn);
next_fn += strlen(next_fn)+1;
}
@@ -4243,7 +4272,7 @@ void fc_write_final_junctions(fc_thread_global_context_t * global_context, char
else if(donor[0]=='C' && donor[1]=='T' && receptor[0]=='A' && receptor[1]=='C') strand = "-";
}else if(!global_context ->is_junction_no_chro_shown){
global_context ->is_junction_no_chro_shown = 1;
- print_in_box(80,0,0, " WARNING contig `%s' is not found in the", chro_small);
+ print_in_box(80,0,0, " WARNING contig '%s' is not found in the", chro_small);
print_in_box(80,0,0, " provided genome file!");
print_in_box(80,0,0,"");
@@ -4269,6 +4298,8 @@ void fc_write_final_junctions(fc_thread_global_context_t * global_context, char
junction_support_list[ ky_i2 ] ++;
junction_source_list[ky_i2] |= ( (ky_i1 < found_features_small)? 1 : 2 );
is_duplicate = 1;
+
+ max_supp = max(junction_support_list[ky_i2], max_supp);
break;
}
}
@@ -4304,7 +4335,7 @@ void fc_write_final_junctions(fc_thread_global_context_t * global_context, char
for(ky_i2 = 0; ky_i2 < unique_junctions; ky_i2 ++){
fc_junction_gene_t * tested_key = junction_key_list[ky_i2];
- if(tested_key != NULL && tested_key -> pos_first_base < smallest_coordinate_gene){
+ if(tested_key != NULL && junction_support_list[ky_i2] == max_supp && tested_key -> pos_first_base < smallest_coordinate_gene){
primary_gene = tested_key;
smallest_coordinate_gene = tested_key -> pos_first_base;
}
@@ -4341,7 +4372,8 @@ void fc_write_final_junctions(fc_thread_global_context_t * global_context, char
unsigned long count = HashTableGet(junction_global_table_list[infile_i] , key_list[ky_i]) - NULL;
fprintf(ofp,"\t%lu", count);
}
- fprintf(ofp, "\n");
+ int wlen = fprintf(ofp, "\n");
+ if(wlen < 1) disk_is_full = 1;
}
fclose(ofp);
free(junction_key_list);
@@ -4357,6 +4389,10 @@ void fc_write_final_junctions(fc_thread_global_context_t * global_context, char
HashTableDestroy(merged_junction_table);
HashTableDestroy(merged_splicing_table);
+ if(disk_is_full){
+ unlink(outfname);
+ SUBREADprintf("ERROR: disk is full; no junction counting table is generated.\n");
+ }
}
char * get_short_fname(char * lname){
@@ -4424,15 +4460,16 @@ int readSummary(int argc,char *argv[]){
39: as.numeric(is_Restrictly_No_Overlapping) # when "1", disable the voting-based tie breaking (e.g., when the reads are paired-end and one gene receives two votes but the other gene only has one.). "0" by default.
40: as.numeric(min_Fractional_Overlap) # A fractioal number. 0.00 : at least 1 bp overlapping
41: temp_directory # the directory to put temp files. "<use output directory>" by default, namely find it from the output file dir.
+ 42: as.numeric(use_stdin_stdout) # only for CfeatureCounts. When use_stdin_stdout & 0x01 > 0, the input file is from stdin (stored in a temporary file); when use_stdin_stdout & 0x02 > 0, the output should be written to STDOUT instead of a file.
*/
- int isStrandChecked, isCVersion, isChimericDisallowed, isPEDistChecked, minMappingQualityScore=0, isInputFileResortNeeded, feature_block_size = 20, reduce_5_3_ends_to_one;
+ int isStrandChecked, isCVersion, isChimericDisallowed, isPEDistChecked, minMappingQualityScore=0, isInputFileResortNeeded, feature_block_size = 20, reduce_5_3_ends_to_one, useStdinFile;
float fracOverlap;
char **chr;
long *start, *stop;
int *geneid;
- char *nameFeatureTypeColumn, *nameGeneIDColumn,*debug_command, *pair_orientations="fr", *temp_dir;
+ char *nameFeatureTypeColumn, *nameGeneIDColumn,*debug_command, *pair_orientations="fr", *temp_dir, *file_name_ptr ;
long nexons;
@@ -4603,20 +4640,25 @@ int readSummary(int argc,char *argv[]){
}
else temp_dir = NULL;// get_temp_dir_from_out(temp_dir, (char *)argv[3]);
+ if(argc>42){
+ useStdinFile = (atoi(argv[42]) & 1)!=0;
+ }else useStdinFile = 0;
+
if(SAM_pairer_warning_file_open_limit()) return -1;
fc_thread_global_context_t global_context;
- fc_thread_init_global_context(& global_context, FEATURECOUNTS_BUFFER_SIZE, thread_number, MAX_LINE_LENGTH, isPE, minPEDistance, maxPEDistance,isGeneLevel, isMultiOverlapAllowed, isStrandChecked, (char *)argv[3] , isReadSummaryReport, isBothEndRequired, isChimericDisallowed, isPEDistChecked, nameFeatureTypeColumn, nameGeneIDColumn, minMappingQualityScore,isMultiMappingAllowed, 0, alias_file_name, cmd_rebuilt, isInputFileResortNeeded, feature_block_size, isCVersion, fiveEndExtension, thre [...]
+ fc_thread_init_global_context(& global_context, FEATURECOUNTS_BUFFER_SIZE, thread_number, MAX_LINE_LENGTH, isPE, minPEDistance, maxPEDistance,isGeneLevel, isMultiOverlapAllowed, isStrandChecked, (char *)argv[3] , isReadSummaryReport, isBothEndRequired, isChimericDisallowed, isPEDistChecked, nameFeatureTypeColumn, nameGeneIDColumn, minMappingQualityScore,isMultiMappingAllowed, 0, alias_file_name, cmd_rebuilt, isInputFileResortNeeded, feature_block_size, isCVersion, fiveEndExtension, thre [...]
+ fc_thread_init_input_files( & global_context, argv[2], &file_name_ptr );
if( global_context.is_multi_mapping_allowed != ALLOW_ALL_MULTI_MAPPING && (!isMultiOverlapAllowed) && global_context.use_fraction_multi_mapping)
{
SUBREADprintf("ERROR: '--fraction' option should be used together with '-M' or '-O'. Please change the parameters to allow multi-mapping reads and/or multi-overlapping features.\n");
return -1;
}
- if( print_FC_configuration(&global_context, argv[1], argv[2], argv[3], global_context.is_SAM_file, isGTF, & n_input_files, isReadSummaryReport) )
+ if( print_FC_configuration(&global_context, argv[1], file_name_ptr, argv[3], global_context.is_SAM_file, isGTF, & n_input_files, isReadSummaryReport) )
return -1;
@@ -4671,15 +4713,15 @@ int readSummary(int argc,char *argv[]){
char * tmp_pntr = NULL;
- char * file_list_used = malloc(strlen(argv[2])+1);
- char * file_list_used2 = malloc(strlen(argv[2])+1);
- char * is_unique = malloc(strlen(argv[2])+1);
- strcpy(file_list_used, argv[2]);
+ char * file_list_used = malloc(strlen(file_name_ptr)+1);
+ char * file_list_used2 = malloc(strlen(file_name_ptr)+1);
+ char * is_unique = malloc(strlen(file_name_ptr)+1);
+ strcpy(file_list_used, file_name_ptr);
for(x1 = 0;;x1++){
char * test_fn = strtok_r(x1?NULL:file_list_used,";", &tmp_pntr);
if(NULL == test_fn) break;
char * short_fname = get_short_fname(test_fn);
- strcpy(file_list_used2, argv[2]);
+ strcpy(file_list_used2, file_name_ptr);
is_unique[x1]=1;
char * loop_ptr = NULL;
@@ -4700,7 +4742,7 @@ int readSummary(int argc,char *argv[]){
free(file_list_used2);
tmp_pntr = NULL;
- strcpy(file_list_used, argv[2]);
+ strcpy(file_list_used, file_name_ptr);
char * next_fn = strtok_r(file_list_used,";", &tmp_pntr);
read_count_type_t ** table_columns = calloc( n_input_files , sizeof(read_count_type_t *)), i_files=0;
fc_read_counters * read_counters = calloc(n_input_files , sizeof(fc_read_counters));
@@ -4714,7 +4756,7 @@ int readSummary(int argc,char *argv[]){
for(x1 = 0;;x1++){
int orininal_isPE = global_context.is_paired_end_mode_assign;
- if(next_fn==NULL || strlen(next_fn)<1) break;
+ if(next_fn==NULL || strlen(next_fn)<1 || global_context.disk_is_full) break;
read_count_type_t * column_numbers = calloc(nexons, sizeof(read_count_type_t));
HashTable * junction_global_table = NULL;
@@ -4745,6 +4787,9 @@ int readSummary(int argc,char *argv[]){
memset(my_read_counter, 0, sizeof(fc_read_counters));
int ret_int = readSummary_single_file(& global_context, column_numbers, nexons, geneid, chr, start, stop, sorted_strand, anno_chr_2ch, anno_chrs, anno_chr_head, block_end_index, block_min_start, block_max_end, my_read_counter, junction_global_table, splicing_global_table);
+ if(global_context.disk_is_full){
+ SUBREADprintf("ERROR: disk is full. Please check the free space in the output directory.\n");
+ }
if(ret_int!=0){
// give up this file.
@@ -4771,17 +4816,18 @@ int readSummary(int argc,char *argv[]){
free(file_list_used);
if(global_context.is_input_bad_format){
- SUBREADprintf("\nFATAL Error: an input file has wrong format! The program has to terminate and no counting file is generated.\n\n");
- }else{
+ SUBREADprintf("\nFATAL Error: The program has to terminate and no counting file is generated.\n\n");
+ }else if(!global_context.disk_is_full){
if(isGeneLevel)
- fc_write_final_gene_results(&global_context, geneid, chr, start, stop, sorted_strand, argv[3], nexons, table_columns, argv[2], n_input_files , loaded_features, isCVersion);
+ fc_write_final_gene_results(&global_context, geneid, chr, start, stop, sorted_strand, argv[3], nexons, table_columns, file_name_ptr, n_input_files , loaded_features, isCVersion);
else
- fc_write_final_results(&global_context, argv[3], nexons, table_columns, argv[2], n_input_files ,loaded_features, isCVersion);
+ fc_write_final_results(&global_context, argv[3], nexons, table_columns, file_name_ptr, n_input_files ,loaded_features, isCVersion);
}
- if(global_context.do_junction_counting)
- fc_write_final_junctions(&global_context, argv[3], table_columns, argv[2], n_input_files , junction_global_table_list, splicing_global_table_list);
+ if(global_context.do_junction_counting && !global_context.disk_is_full)
+ fc_write_final_junctions(&global_context, argv[3], table_columns, file_name_ptr, n_input_files , junction_global_table_list, splicing_global_table_list);
- fc_write_final_counts(&global_context, argv[3], n_input_files, argv[2], table_columns, read_counters, isCVersion);
+ if(!global_context.disk_is_full)
+ fc_write_final_counts(&global_context, argv[3], n_input_files, file_name_ptr, table_columns, read_counters, isCVersion);
int total_written_coulmns = 0;
for(i_files=0; i_files<n_input_files; i_files++)
@@ -4796,9 +4842,10 @@ int readSummary(int argc,char *argv[]){
}
free(table_columns);
+ free(file_name_ptr);
- if(global_context.is_input_bad_format == 0) print_FC_results(&global_context);
+ if(global_context.is_input_bad_format == 0) print_FC_results(&global_context, (char *)argv[3]/*out file name*/);
KeyValuePair * cursor;
int bucket;
for(bucket=0; bucket < global_context.exontable_chro_table -> numOfBuckets; bucket++)
@@ -4908,14 +4955,11 @@ void sort_bucket_table(fc_thread_global_context_t * global_context){
int readSummary_single_file(fc_thread_global_context_t * global_context, read_count_type_t * column_numbers, int nexons, int * geneid, char ** chr, long * start, long * stop, unsigned char * sorted_strand, char * anno_chr_2ch, char ** anno_chrs, long * anno_chr_head, long * block_end_index, long * block_min_start , long * block_max_end, fc_read_counters * my_read_counter, HashTable * junction_global_table, HashTable * splicing_global_table)
{
- FILE *fp_in = NULL;
int read_length = 0;
int is_first_read_PE=0;
char * line = (char*)calloc(MAX_LINE_LENGTH, 1);
char * file_str = "";
- if(strcmp( global_context->input_file_name,"STDIN")!=0)
- {
int file_probe = is_certainly_bam_file(global_context->input_file_name, &is_first_read_PE, NULL);
global_context -> is_paired_end_input_file = is_first_read_PE;
@@ -4935,17 +4979,13 @@ int readSummary_single_file(fc_thread_global_context_t * global_context, read_co
if(!global_context->redo)
{
- print_in_box(80,0,0,"Process %s file %s...", file_str, global_context->input_file_name);
+ print_in_box(80,0,0,"Process %s file %s...", file_str, global_context -> use_stdin_file? "<STDIN>":global_context->input_file_name);
if(is_first_read_PE)
print_in_box(80,0,0," Paired-end reads are included.");
else
print_in_box(80,0,0," Single-end reads are included.");
}
- }
-
- if(strcmp( global_context->input_file_name,"STDIN")!=0)
- {
FILE * exist_fp = f_subr_open( global_context->input_file_name,"r");
if(!exist_fp)
{
@@ -4955,26 +4995,9 @@ int readSummary_single_file(fc_thread_global_context_t * global_context, read_co
return -1;
}
fclose(exist_fp);
- }
-
- /*
- if(strcmp(global_context->input_file_name,"STDIN")!=0)
- if(warning_file_type(global_context->input_file_name, global_context->is_SAM_file?FILE_TYPE_SAM:FILE_TYPE_BAM))
- global_context->is_unpaired_warning_shown=1;
- */
// Open the SAM/BAM file
// Nothing is done if the file does not exist.
- #ifdef MAKE_STANDALONE
- if(strcmp("STDIN",global_context->input_file_name)==0)
- fp_in = stdin;
- else
- fp_in = f_subr_open(global_context->input_file_name,"r");
- #else
- fp_in = f_subr_open(global_context->input_file_name,"r");
- #endif
-
-
// begin to load-in the data.
if(!global_context->redo)
{
@@ -4996,13 +5019,6 @@ int readSummary_single_file(fc_thread_global_context_t * global_context, read_co
fc_thread_merge_results(global_context, column_numbers , &nreads_mapped_to_exon, my_read_counter, junction_global_table, splicing_global_table);
fc_thread_destroy_thread_context(global_context);
- //global_context .read_counters.assigned_reads = nreads_mapped_to_exon;
-
- #ifdef MAKE_STANDALONE
- if(strcmp("STDIN",global_context->input_file_name)!=0)
- #endif
- fclose(fp_in);
-
if(global_context -> sambam_chro_table) free(global_context -> sambam_chro_table);
global_context -> sambam_chro_table = NULL;
@@ -5018,7 +5034,7 @@ int main(int argc, char ** argv)
int feature_count_main(int argc, char ** argv)
#endif
{
- char * Rargv[42];
+ char * Rargv[43];
char annot_name[300];
char temp_dir[300];
char * out_name = malloc(300);
@@ -5068,7 +5084,8 @@ int feature_count_main(int argc, char ** argv)
int very_long_file_names_size = 200;
int fiveEndExtension = 0, threeEndExtension = 0, minFragmentOverlap = 1;
float fracOverlap = 0.0;
- char strFiveEndExtension[11], strThreeEndExtension[11], strMinFragmentOverlap[11], fracOverlapStr[20];
+ int std_input_output_mode = 0;
+ char strFiveEndExtension[11], strThreeEndExtension[11], strMinFragmentOverlap[11], fracOverlapStr[20], std_input_output_mode_str[11];
very_long_file_names = malloc(very_long_file_names_size);
very_long_file_names [0] = 0;
fasta_contigs_name[0]=0;
@@ -5321,7 +5338,7 @@ int feature_count_main(int argc, char ** argv)
minFragmentOverlap = 1;
}
- if(out_name[0]==0 || annot_name[0]==0||argc == optind)
+ if(out_name[0]==0 || annot_name[0]==0)
{
print_usage();
return -1;
@@ -5342,6 +5359,7 @@ int feature_count_main(int argc, char ** argv)
}
very_long_file_names[strlen(very_long_file_names)-1]=0;
+ std_input_output_mode = (strcmp(very_long_file_names, "") == 0?1:0);
sprintf(strFiveEndExtension, "%d", fiveEndExtension);
sprintf(strThreeEndExtension, "%d", threeEndExtension);
@@ -5353,6 +5371,8 @@ int feature_count_main(int argc, char ** argv)
sprintf(feature_block_size_str,"%d", feature_block_size);
sprintf(Strand_Sensitive_Str,"%d", Strand_Sensitive_Mode);
sprintf(fracOverlapStr, "%g", fracOverlap);
+ sprintf(std_input_output_mode_str,"%d",std_input_output_mode);
+
Rargv[0] = "CreadSummary";
Rargv[1] = annot_name;
Rargv[2] = very_long_file_names;
@@ -5395,7 +5415,11 @@ int feature_count_main(int argc, char ** argv)
Rargv[39] = is_Restrictedly_No_Overlap?"1":"0";
Rargv[40] = fracOverlapStr;
Rargv[41] = temp_dir;
- int retvalue = readSummary(42, Rargv);
+ Rargv[42] = std_input_output_mode_str;
+
+ int retvalue = -1;
+ if(is_ReadSummary_Report && (std_input_output_mode & 1)==1) SUBREADprintf("ERROR: no detailed assignment results can be written when the input is from STDIN. Please remove the '-R' option.\n");
+ else retvalue = readSummary(43, Rargv);
free(very_long_file_names);
free(out_name);
diff --git a/src/sambam-file.c b/src/sambam-file.c
index c38250c..da988a6 100644
--- a/src/sambam-file.c
+++ b/src/sambam-file.c
@@ -58,6 +58,10 @@ int SamBam_fetch_next_chunk(SamBam_FILE *fp)
ret = PBam_get_next_zchunk(fp -> os_file, in_buff, 65536, & real_len);
if(ret > 0)
nchunk = SamBam_unzip(fp -> input_binary_stream_buffer + fp->input_binary_stream_write_ptr - fp -> input_binary_stream_read_ptr + have , in_buff , ret);
+ else if(ret == -2){
+ SUBREADputs("ERROR: BAM format is broken!");
+ return -2;
+ }
//printf("RET=%d; CHK=%d\n", ret, nchunk);
@@ -345,7 +349,7 @@ int PBam_get_next_zchunk(FILE * bam_fp, char * buffer, int buffer_length, unsign
{
unsigned char ID1, ID2, CM, FLG;
unsigned short XLEN;
- int BSIZE=-1;
+ int BSIZE=-1, rlen, is_file_broken = 0;
if(feof(bam_fp)) return -1;
@@ -371,7 +375,8 @@ int PBam_get_next_zchunk(FILE * bam_fp, char * buffer, int buffer_length, unsign
fread(&SI1, 1, 1, bam_fp);
fread(&SI2, 1, 1, bam_fp);
- fread(&SLEN, 1, 2, bam_fp);
+ rlen = fread(&SLEN, 2, 1, bam_fp);
+ if(rlen < 1) is_file_broken = 1;
if(SI1==66 && SI2== 67 && SLEN == 2)
{
@@ -391,10 +396,14 @@ int PBam_get_next_zchunk(FILE * bam_fp, char * buffer, int buffer_length, unsign
if(CDATA_READING<CDATA_LEN)
fseeko(bam_fp, CDATA_LEN-CDATA_READING, SEEK_CUR);
fseeko(bam_fp, 4, SEEK_CUR);
- fread(&real_len, 4, 1, bam_fp);
+ rlen = fread(&real_len, 4, 1, bam_fp);
+ if(rlen < 1) is_file_broken = 1;
// SUBREADprintf("read_data=%u\n", CDATA_LEN);
- return CDATA_READING;
+ if(is_file_broken){
+ SUBREADputs("ERROR: the input BAM file is broken.");
+ }
+ return is_file_broken?-2:CDATA_READING;
}
else
return -1;
@@ -748,13 +757,22 @@ int PBum_load_header(FILE * bam_fp, SamBam_Reference_Info** chro_tab, char * rem
char * CDATA = malloc(80010);
char * PDATA = malloc(1000000);
- int chro_tab_size = 0, chro_tab_items = 0, chro_tab_state = 0, header_remainder = 0, remainder_byte_len = 0;
+ int chro_tab_size = 0, chro_tab_items = 0, chro_tab_state = 0, header_remainder = 0, remainder_byte_len = 0, bam_is_broken = 0;
z_stream strm;
while(1)
{
unsigned int real_len = 0;
int rlen = PBam_get_next_zchunk(bam_fp,CDATA,80000, & real_len);
- if(rlen<0) break;
+ if(rlen<0){
+ bam_is_broken = (rlen == -2);
+ if(bam_is_broken){
+ SUBREADprintf("BAM file format error!\n");
+ free(CDATA);
+ free(PDATA);
+ return -1;
+ }
+ break;
+ }
strm.zalloc = Z_NULL;
strm.zfree = Z_NULL;
@@ -980,16 +998,22 @@ void SamBam_writer_add_chunk(SamBam_Writer * writer)
compressed_size = 70000 - writer -> output_stream.avail_out;
//printf("ADDED BLOCK=%d; LEN=%d; S=%s\n", compressed_size, writer ->chunk_buffer_used, writer ->chunk_buffer);
SamBam_writer_chunk_header(writer, compressed_size);
- fwrite(writer -> compressed_chunk_buffer, 1, compressed_size, writer -> bam_fp);
+ int chunk_write_size = fwrite(writer -> compressed_chunk_buffer, 1, compressed_size, writer -> bam_fp);
fwrite(&CRC32 , 4, 1, writer -> bam_fp);
fwrite(&writer ->chunk_buffer_used , 4, 1, writer -> bam_fp);
+ if(chunk_write_size < compressed_size){
+ if(!writer -> is_internal_error)SUBREADputs("ERROR: no space left in the output directory.");
+ writer -> is_internal_error = 1;
+ }
writer ->chunk_buffer_used = 0;
}
+double sambam_t1 = 0;
+
void SamBam_writer_write_header(SamBam_Writer * writer)
{
int header_ptr=0, header_block_start = 0;
@@ -1391,7 +1415,16 @@ int SamBam_writer_add_read(SamBam_Writer * writer, char * read_name, unsigned in
if(writer -> chunk_buffer_used>55000)
{
+ // double t0 = miltime();
SamBam_writer_add_chunk(writer);
+ // double t1 = miltime();
+ // if(sambam_t1 > 100)
+ // SUBREADprintf("Running = %.6f , Compress Time = %.6f\n", t0 - sambam_t1, t1 - t0);
+ // sambam_t1 = t1;
+
+
+
+
writer -> chunk_buffer_used = 0;
}
return 0;
diff --git a/src/sambam-file.h b/src/sambam-file.h
index d8193cb..a23f0f6 100644
--- a/src/sambam-file.h
+++ b/src/sambam-file.h
@@ -38,7 +38,7 @@ typedef unsigned int BS_uint_32;
#define BAM_FILE_STAGE_ALIGNMENT 20
-#define SAMBAM_COMPRESS_LEVEL 5
+#define SAMBAM_COMPRESS_LEVEL Z_BEST_SPEED
#define SAMBAM_GZIP_WINDOW_BITS -15
#define SAMBAM_INPUT_STREAM_SIZE 140000
@@ -72,7 +72,7 @@ typedef struct
} SamBam_Alignment;
-#define SB_FETCH(a) if((a) -> input_binary_stream_write_ptr - (a) -> input_binary_stream_read_ptr < 3000){SamBam_fetch_next_chunk(a);}
+#define SB_FETCH(a) if((a) -> input_binary_stream_write_ptr - (a) -> input_binary_stream_read_ptr < 3000){int test_rlen_2 = SamBam_fetch_next_chunk(a); if(test_rlen_2 == -2){(a)->is_bam_broken = 1;}}
#define SB_EOF(a) ((a)-> is_eof && ( (a) -> input_binary_stream_write_ptr <= (a) -> input_binary_stream_read_ptr ))
#define SB_READ(a) ((a) -> input_binary_stream_buffer + (a) -> input_binary_stream_read_ptr - (a) -> input_binary_stream_buffer_start_ptr)
#define SB_RINC(a, len) ((a) -> input_binary_stream_read_ptr) += len
@@ -97,6 +97,7 @@ typedef struct
char * input_binary_stream_buffer;
int is_eof;
int is_paired_end;
+ int is_bam_broken;
} SamBam_FILE;
@@ -111,6 +112,7 @@ typedef struct
int header_plain_text_buffer_max;
int chunk_buffer_used;
int writer_state;
+ int is_internal_error;
unsigned int crc0;
HashTable * chromosome_name_table;
diff --git a/src/sorted-hashtable.c b/src/sorted-hashtable.c
index c8bc072..5e1b3ce 100644
--- a/src/sorted-hashtable.c
+++ b/src/sorted-hashtable.c
@@ -1296,7 +1296,7 @@ int gehash_load(gehash_t * the_table, const char fname [])
FILE * fp = f_subr_open(fname, "rb");
if (!fp)
{
- SUBREADprintf ("Table file `%s' is not found.\n", fname);
+ SUBREADprintf ("Table file '%s' is not found.\n", fname);
return 1;
}
@@ -1341,6 +1341,10 @@ int gehash_load(gehash_t * the_table, const char fname [])
the_table -> index_gap = 3;
the_table -> current_items = load_int64(fp);
+ if(the_table -> current_items < 1 || the_table -> current_items > 0xffffffffllu){
+ SUBREADputs("ERROR: the index format is unrecognizable.");
+ return 1;
+ }
the_table -> buckets_number = load_int32(fp);
the_table -> buckets = (struct gehash_bucket * )malloc(sizeof(struct gehash_bucket) * the_table -> buckets_number);
if(!the_table -> buckets)
@@ -1368,9 +1372,15 @@ int gehash_load(gehash_t * the_table, const char fname [])
if(current_bucket -> current_items > 0)
{
read_length = fread(current_bucket -> new_item_keys, sizeof(short), current_bucket -> current_items, fp);
- assert(read_length>0);
+ if(read_length < current_bucket -> current_items){
+ SUBREADprintf("ERROR: the index is incomplete : %d < %u.\n",read_length, current_bucket -> current_items);
+ return 1;
+ }
read_length = fread(current_bucket -> item_values, sizeof(gehash_data_t), current_bucket -> current_items, fp);
- assert(read_length>0);
+ if(read_length < current_bucket -> current_items){
+ SUBREADprintf("ERROR: the index value is incomplete : %d < %u.\n",read_length, current_bucket -> current_items);
+ return 1;
+ }
}
}
@@ -1415,9 +1425,15 @@ int gehash_load(gehash_t * the_table, const char fname [])
if(current_bucket -> current_items > 0)
{
read_length = fread(current_bucket -> item_keys, sizeof(gehash_key_t), current_bucket -> current_items, fp);
- assert(read_length>0);
+ if(read_length < current_bucket -> current_items){
+ SUBREADprintf("ERROR: the index is incomplete.\n");
+ return 1;
+ }
read_length = fread(current_bucket -> item_values, sizeof(gehash_data_t), current_bucket -> current_items, fp);
- assert(read_length>0);
+ if(read_length < current_bucket -> current_items){
+ SUBREADprintf("ERROR: the index is incomplete.\n");
+ return 1;
+ }
}
}
@@ -1499,7 +1515,7 @@ int gehash_dump(gehash_t * the_table, const char fname [])
int maximum_bucket_size = 0;
if (!fp)
{
- SUBREADprintf ("Table file `%s' is not able to open.\n", fname);
+ SUBREADprintf ("Table file '%s' is not able to open.\n", fname);
return -1;
}
@@ -1648,13 +1664,25 @@ int gehash_dump(gehash_t * the_table, const char fname [])
}
}
- fwrite(& (current_bucket -> current_items), sizeof(int), 1, fp);
- fwrite(& (current_bucket -> space_size), sizeof(int), 1, fp);
+ int is_full = 0;
+ int write_len = fwrite(& (current_bucket -> current_items), sizeof(int), 1, fp);
+ if(write_len<1) is_full = 1;
+ write_len = fwrite(& (current_bucket -> space_size), sizeof(int), 1, fp);
+ if(write_len<1) is_full = 1;
+
if(the_table->version_number == SUBINDEX_VER0)
fwrite(current_bucket -> item_keys, sizeof(gehash_key_t), current_bucket -> current_items, fp);
- else
- fwrite(current_bucket -> new_item_keys, sizeof(short), current_bucket -> current_items, fp);
- fwrite(current_bucket -> item_values, sizeof(gehash_data_t), current_bucket -> current_items, fp);
+ else{
+ write_len = fwrite(current_bucket -> new_item_keys, sizeof(short), current_bucket -> current_items, fp);
+ if(write_len < current_bucket -> current_items) is_full = 1;
+ }
+ write_len = fwrite(current_bucket -> item_values, sizeof(gehash_data_t), current_bucket -> current_items, fp);
+ if(write_len < current_bucket -> current_items) is_full = 1;
+ if(is_full){
+ fclose(fp);
+ SUBREADprintf("ERROR: Unable to write into the output file. Please check the disk space in the output directory.\n");
+ return 1;
+ }
}
if(the_table->version_number > SUBINDEX_VER0)
@@ -1667,8 +1695,13 @@ int gehash_dump(gehash_t * the_table, const char fname [])
}
- fwrite(&(the_table -> is_small_table), sizeof(char), 1, fp);
+ int write_len = fwrite(&(the_table -> is_small_table), sizeof(char), 1, fp);
fclose(fp);
+
+ if(write_len < 1){
+ SUBREADprintf("ERROR: Unable to write into the output file. Please check the disk space in the output directory.\n");
+ return 1;
+ }
print_in_box(80,0,0,"");
return 0;
}
diff --git a/src/subread.h b/src/subread.h
index 8b3d959..b68c7e7 100644
--- a/src/subread.h
+++ b/src/subread.h
@@ -33,9 +33,6 @@
#include "hashtable.h"
-#define INPUT_BUFFER_SIZE (8*1024*1024)
-#define OUTPUT_BUFFER_SIZE (32*1024*1024)
-
#define SAM_FLAG_PAIRED_TASK 0x01
#define SAM_FLAG_FIRST_READ_IN_PAIR 0x40
#define SAM_FLAG_SECOND_READ_IN_PAIR 0x80
@@ -64,11 +61,12 @@
#define MAX_READ_NAME_LEN 100
#define MAX_CHROMOSOME_NAME_LEN 100
#define MAX_FILE_NAME_LENGTH 300
+#define FEATURE_NAME_LENGTH 256
-#define MULTI_THREAD_OUTPUT_ITEMS 4096
+//#warning "============== REMOVE '*1.2' FROM THE NEXT LINE ================"
+#define MULTI_THREAD_OUTPUT_ITEMS (4096 * 3/5 *3)
-//#warning "============ CHANGE THE NEXT LINE TO 120 ========"
-#define EXON_LONG_READ_LENGTH 120
+#define EXON_LONG_READ_LENGTH 160
#define EXON_MAX_CIGAR_LEN 256
#define FC_CIGAR_PARSER_ITEMS 11
diff --git a/test/featureCounts/data/test-chralias.GTF b/test/featureCounts/data/test-chralias.GTF
new file mode 100644
index 0000000..d482f1a
--- /dev/null
+++ b/test/featureCounts/data/test-chralias.GTF
@@ -0,0 +1,23 @@
+chr3 SAF exon 100 10000 . + 0 gene_id "simu_gene1"; transcript_id "TRN_simu_gene1"; exon_id "EXON_simu_gene1.1";
+chr3 SAF exon 20000 30000 . + 0 gene_id "simu_gene1"; transcript_id "TRN_simu_gene1"; exon_id "EXON_simu_gene1.2";
+chr3 SAF exon 40000 89000 . + 0 gene_id "simu_gene1"; transcript_id "TRN_simu_gene1"; exon_id "EXON_simu_gene1.3";
+chr3 SAF exon 100010 101000 . + 0 gene_id "simu_gene2"; transcript_id "TRN_simu_gene2"; exon_id "EXON_simu_gene2.1";
+chr3 SAF exon 102000 103000 . + 0 gene_id "simu_gene2"; transcript_id "TRN_simu_gene2"; exon_id "EXON_simu_gene2.2";
+chr3 SAF exon 104000 129000 . + 0 gene_id "simu_gene2"; transcript_id "TRN_simu_gene2"; exon_id "EXON_simu_gene2.3";
+chr3 SAF exon 102000 131000 . + 0 gene_id "simu_gene2"; transcript_id "TRN_simu_gene2"; exon_id "EXON_simu_gene2.4";
+chr3 SAF exon 500010 501000 . - 0 gene_id "simu_gene3"; transcript_id "TRN_simu_gene3"; exon_id "EXON_simu_gene3.1";
+chr3 SAF exon 502000 503000 . - 0 gene_id "simu_gene3"; transcript_id "TRN_simu_gene3"; exon_id "EXON_simu_gene3.2";
+chr3 SAF exon 504000 529000 . - 0 gene_id "simu_gene3"; transcript_id "TRN_simu_gene3"; exon_id "EXON_simu_gene3.3";
+chr3 SAF exon 600000 669000 . - 0 gene_id "simu_gene3"; transcript_id "TRN_simu_gene3"; exon_id "EXON_simu_gene3.4";
+chr3 SAF exon 602000 631000 . + 0 gene_id "simu_gene4"; transcript_id "TRN_simu_gene4"; exon_id "EXON_simu_gene4.1";
+chr3 SAF exon 672000 699000 . + 0 gene_id "simu_gene4"; transcript_id "TRN_simu_gene4"; exon_id "EXON_simu_gene4.2";
+chr3 SAF exon 702000 719000 . + 0 gene_id "simu_gene4"; transcript_id "TRN_simu_gene4"; exon_id "EXON_simu_gene4.3";
+chr4 SAF exon 20000 100000 . - 0 gene_id "simu_gene5"; transcript_id "TRN_simu_gene5"; exon_id "EXON_simu_gene5.1";
+chr4 SAF exon 120000 190000 . - 0 gene_id "simu_gene5"; transcript_id "TRN_simu_gene5"; exon_id "EXON_simu_gene5.2";
+chr4 SAF exon 200000 210000 . - 0 gene_id "simu_gene5"; transcript_id "TRN_simu_gene5"; exon_id "EXON_simu_gene5.3";
+chr4 SAF exon 220000 300000 . - 0 gene_id "simu_gene5"; transcript_id "TRN_simu_gene5"; exon_id "EXON_simu_gene5.4";
+chr4 SAF exon 420000 490000 . - 0 gene_id "simu_gene6"; transcript_id "TRN_simu_gene6"; exon_id "EXON_simu_gene6.1";
+chr4 SAF exon 500000 560000 . - 0 gene_id "simu_gene6"; transcript_id "TRN_simu_gene6"; exon_id "EXON_simu_gene6.2";
+chr5 SAF exon 120000 490000 . - 0 gene_id "simu_gene7"; transcript_id "TRN_simu_gene7"; exon_id "EXON_simu_gene7.1";
+chr5 SAF exon 500000 960000 . - 0 gene_id "simu_gene7"; transcript_id "TRN_simu_gene7"; exon_id "EXON_simu_gene7.2";
+chr5 SAF exon 970000 1000000 . - 0 gene_id "simu_gene7"; transcript_id "TRN_simu_gene7"; exon_id "EXON_simu_gene7.3";
diff --git a/test/featureCounts/del4.FC b/test/featureCounts/del4.FC
new file mode 100644
index 0000000..c1b0eac
--- /dev/null
+++ b/test/featureCounts/del4.FC
@@ -0,0 +1,9 @@
+# Program:featureCounts v1.5.2-alpha1; Command:"../../bin/featureCounts" "-a" "data/test-chralias.GTF" "-o" "del4.FC" "-A" "data/test-chralias.txt" "data/test-chralias.sam"
+Geneid Chr Start End Strand Length data/test-chralias.sam
+simu_gene1 chr3;chr3;chr3 100;20000;40000 10000;30000;89000 +;+;+ 68903 31
+simu_gene2 chr3;chr3;chr3;chr3 100010;102000;102000;104000 101000;103000;131000;129000 +;+;+;+ 29992 10
+simu_gene3 chr3;chr3;chr3;chr3 500010;502000;504000;600000 501000;503000;529000;669000 -;-;-;- 95994 16
+simu_gene4 chr3;chr3;chr3 602000;672000;702000 631000;699000;719000 +;+;+ 73003 12
+simu_gene5 chr4;chr4;chr4;chr4 20000;120000;200000;220000 100000;190000;210000;300000 -;-;-;- 240004 95
+simu_gene6 chr4;chr4 420000;500000 490000;560000 -;- 130002 44
+simu_gene7 chr5;chr5;chr5 120000;500000;970000 490000;960000;1000000 -;-;- 860003 338
diff --git a/test/featureCounts/del4.FC.summary b/test/featureCounts/del4.FC.summary
new file mode 100644
index 0000000..40b1f9f
--- /dev/null
+++ b/test/featureCounts/del4.FC.summary
@@ -0,0 +1,12 @@
+Status data/test-chralias.sam
+Assigned 546
+Unassigned_Ambiguity 8
+Unassigned_MultiMapping 0
+Unassigned_NoFeatures 646
+Unassigned_Unmapped 0
+Unassigned_MappingQuality 0
+Unassigned_FragmentLength 0
+Unassigned_Chimera 0
+Unassigned_Secondary 0
+Unassigned_Nonjunction 0
+Unassigned_Duplicate 0
--
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/subread.git
More information about the debian-med-commit
mailing list