[med-svn] [subread] 02/06: New upstream version 1.5.3+dfsg
Alex Mestiashvili
malex-guest at moszumanska.debian.org
Fri Jul 28 14:38:11 UTC 2017
This is an automated email from the git hooks/post-receive script.
malex-guest pushed a commit to branch master
in repository subread.
commit 9d897ea2471b594ef66a1ac4ba861fdc2a4fc50d
Author: Alexandre Mestiashvili <alex at biotec.tu-dresden.de>
Date: Fri Jul 28 11:03:52 2017 +0200
New upstream version 1.5.3+dfsg
---
doc/SubreadUsersGuide.tex | 159 +--
src/HelperFunctions.c | 42 +-
src/HelperFunctions.h | 4 +-
src/Makefile.FreeBSD | 11 +-
src/Makefile.Linux | 29 +-
src/Makefile.MacOS | 6 +-
src/core-indel.c | 37 +-
src/core-interface-aligner.c | 37 +-
src/core-interface-subjunc.c | 39 +-
src/core-junction.c | 29 +-
src/core-junction.h | 3 +
src/core.c | 373 ++++---
src/core.h | 23 +-
src/hashtable.c | 70 +-
src/hashtable.h | 5 +-
src/input-files.c | 574 ++++++++---
src/input-files.h | 15 +-
src/makefile.version | 2 +-
src/read-repair.c | 2 +-
src/readSummary.c | 1620 +++++++++++++++++++++---------
src/sambam-file.c | 184 +++-
src/sambam-file.h | 4 +-
src/subread.h | 6 +-
src/tx-unique.c | 443 ++++++++
src/tx-unique.h | 37 +
test/subread-align/subread-align-test.sh | 4 +-
26 files changed, 2757 insertions(+), 1001 deletions(-)
diff --git a/doc/SubreadUsersGuide.tex b/doc/SubreadUsersGuide.tex
index 1bf7b73..f4698e3 100644
--- a/doc/SubreadUsersGuide.tex
+++ b/doc/SubreadUsersGuide.tex
@@ -35,9 +35,9 @@
\begin{center}
{\Huge\bf Subread/Rsubread Users Guide}\\
\vspace{1 cm}
-{\centering\large Subread v1.5.2/Rsubread v1.24.2\\}
+{\centering\large Subread v1.5.3/Rsubread v1.26.1\\}
\vspace{1 cm}
-\centering 15 March 2017\\
+\centering 11 July 2017\\
\vspace{5 cm}
\Large Wei Shi and Yang Liao\\
\vspace{1 cm}
@@ -260,7 +260,7 @@ In the first scan, the aligners use seed-and-vote method to identify candidate m
In the second scan, they carry out final alignment for each read using the variant and junction information.
Variant and junction data (including chromosomal coordinates and number of supporting reads) will be output along with the read mapping results.
To the best of our knowledge, \code{Subread} and \code{Subjunc} are the first to employ a two-scan mapping strategy to achieve a superior mapping accuracy.
-This strategy was later adopted by other aligners as well (called `two-pass').
+This strategy was later seen in other aligners as well (called `two-pass').
\section{Multi-mapping reads}
@@ -291,9 +291,7 @@ Total number of matched bases (for genomic DNA-seq data) or mis-matched bases (f
\section{Recommended aligner setting}
It is recommended to report uniquely mapped reads only when running \code{Subread} and \code{Subjunc} aligners since this will give the most accurate mapping result.
-By default, only uniquely mapped reads are reported when running aligners in Bioconductor {\Rsubread} package.
-This however needs to be explicitly specified when running aligners in SourceForge {\Subread} package (\code{-u}).
-
+This is the default setting for the two aligners in both SourceForge {\Subread} and {\Rsubread} packages.
\chapter{Mapping reads generated by genomic DNA sequencing technologies}
\label{chapter:subread-dnaseq}
@@ -310,20 +308,17 @@ An index must be built for the reference first and then the read mapping can be
{\noindent\bf Step 2: Align reads}\\
-\noindent Map single-end reads from a gzipped file using 5 threads and save mapping results to a BAM file:\\
+\noindent Map single-end genomic DNA sequencing reads using 5 threads (only uniquely mapped reads are reported):\\
\code{subread-align -t 1 -T 5 -i my\_index -r reads.txt.gz -o subread\_results.bam}\\
+\noindent Map paired-end reads:\\
+\code{subread-align -t 1 -d 50 -D 600 -i my\_index -r reads1.txt -R reads2.txt \newline -o subread\_results.sam}\\
+
\noindent Detect indels of up to 16bp:\\
\code{subread-align -t 1 -I 16 -i my\_index -r reads.txt -o subread\_results.sam}\\
\noindent Report up to three best mapping locations:\\
-\code{subread-align -t 1 -B 3 -i my\_index -r reads.txt -o subread\_results.sam}\\
-
-\noindent Report uniquely mapped reads only:\\
-\code{subread-align -t 1 -u -i my\_index -r reads.txt -o subread\_results.sam}\\
-
-\noindent Map paired-end reads:\\
-\code{subread-align -t 1 -d 50 -D 600 -i my\_index -r reads1.txt -R reads2.txt \newline -o subread\_results.sam}\\
+\code{subread-align -t 1 --multiMapping -B 3 -i my\_index -r reads.txt -o subread\_results.sam}\\
\section{A quick start for using Bioconductor {\Rsubread} package}
@@ -353,7 +348,8 @@ align(index="my_index",readfile1="reads.txt.gz",type="dna",output_file="rsubread
\noindent Report up to three best mapping locations:
\begin{Rcode}
-align(index="my_index",readfile1="reads.txt.gz",type="dna",output_file="rsubread.bam",nBestLocations=3)
+align(index="my_index",readfile1="reads.txt.gz",type="dna",output_file="rsubread.bam",
+unique=FALSE,nBestLocations=3)
\end{Rcode}
\noindent Map paired-end reads:
@@ -379,9 +375,9 @@ Note that the starting `\code{>}' character in the header line is not included i
Table 1 describes the arguments used by the \code{subread-buildindex} program.
-%\newpage
+\newpage
-\begin{table}[h]
+\begin{table}[!tpb]
\raggedright{Table 1: Arguments used by the \code{subread-buildindex} program (\code{buildindex} function in \Rsubread) in alphabetical order.
Arguments in parenthesis in the first column are used by \code{buildindex}.\newline\\}
\begin{tabular}{|p{4cm}|p{12cm}|}
@@ -435,7 +431,7 @@ Arguments & Description \\
\hline
-b \newline (\code{color2base=TRUE}) & Output base-space reads instead of color-space reads in mapping output for color space data (eg. LifTech SOLiD data). Note that the mapping itself will still be performed at color-space.\\
\hline
--B $<int>$ \newline (\code{nBestLocations}) & Specify the maximal number of equally-best mapping locations allowed to be reported for each read. 1 by default. `NH' tag is used to indicate how many alignments are reported for the read and `HI' tag is used for numbering the alignments reported for the same read, in the output. Note that \code{-u} option takes precedence over \code{-B}.\\
+-B $<int>$ \newline (\code{nBestLocations}) & Specify the maximal number of equally-best mapping locations to be reported for a read. 1 by default. In the mapping output, the `NH' tag is used to indicate how many alignments are reported for the read and the `HI' tag is used for numbering the alignments reported for the same read. This option should be used together with the `$--$multiMapping' option. \\
\hline
-d $<int>$ \newline (\code{minFragLength}) & Specify the minimum fragment/template length, 50 by default. Note that if the two reads from the same pair do not satisfy the fragment length criteria, they will be mapped individually as if they were single-end reads.\\
\hline
@@ -468,8 +464,8 @@ Arguments & Description \\
$^*$ -t $<int>$ \newline (\code{type}) & Specify the type of input sequencing data. Possible values include \code{0}, denoting RNA-seq data, or \code{1}, denoting genomic DNA-seq data. User must specify the value. Character values including `rna' and `dna' can also be used in the {\R} function. For genomic DNA-seq data, the aligner takes into account both the number of matched bases and the number of mis-matched bases to determine the the best mapping location after applying the `seed-a [...]
\hline
-T $<int>$ \newline (\code{nthreads}) & Specify the number of threads/CPUs used for mapping. The value should be between 1 and 32. 1 by default.\\
-\hline
--u \newline (\code{unique=TRUE}) & Output uniquely mapped reads only. Reads that were found to have more than one best mapping location will not be reported.\\
+%\hline
+%-u \newline (\code{unique=TRUE}) & Output uniquely mapped reads only. Reads that were found to have more than one best mapping location will not be reported.\\
\hline
$^{**}$$--$allJunctions \newline (\code{reportAllJunctions} \newline \code{=TRUE}) & This option should be used with \code{subjunc} for detecting canonical exon-exon junctions (with `GT/AG' donor/receptor sites), non-canonical exon-exon junctions and structural variants (SVs) in RNA-seq data. detected junctions will be saved to a file with suffix name ``.junction.bed". Detected SV breakpoints will be saved to a file with suffix name ``.breakpoints.txt", which includes chromosomal coordin [...]
\hline
@@ -489,6 +485,8 @@ $--$gtfFeature $<string>$ \newline (\code{GTF.featureType}) & Specify the type o
\hline
$--$gtfAttr $<string>$ \newline (\code{GTF.attrType}) & Specify the type of attributes in a GTF annotation that will be used to group features. `gene\_id' by default. Attributes can be found in the 9th column of a GTF annotation.\\
\hline
+$--$multiMapping \newline (\code{unique=FALSE}) & Multi-mapping reads will also be reported in the mapping output. Number of alignments reported for each multi-mapping read is determined by the `-B' option. If the total number of equally best mapping locations found for a read is greater than the number specified by `-B', then random mapping locations (total number of these locations is the same as `-B' value) will be selected. For example, if value of `-B' is 1, then one random locatio [...]
+\hline
$--$rg $<string>$ \newline (\code{readGroup}) & Add a $<tag:value>$ to the read group (RG) header in the mapping output. \\
\hline
$--$rg-id $<string>$ \newline (\code{readGroupID}) & Specify the read group ID. If specified, the read group ID will be added to the read group header field and also to each read in the mapping output. \\
@@ -545,7 +543,7 @@ $N_{mm}$ is the number of mismatches present in the final reported alignment for
Read mapping results for each library will be saved to a BAM or SAM format file.
Short indels detected from the read data will be saved to a text file (`.indel').
-If `--sv' is specified when running \code{subread-align}, breakpoints detected from structural variant events will be output to a text file for each library as well (`.breakpoints.txt').
+If `$--$sv' is specified when running \code{subread-align}, breakpoints detected from structural variant events will be output to a text file for each library as well (`.breakpoints.txt').
\newpage
@@ -691,8 +689,8 @@ Below is an example of mapping 50bp long reads (adaptor sequences were included
Note that `-t' option should have a value of 1 since miRNA-seq reads are more similar to gDNA-seq reads than mRNA-seq reads from the read mapping point of vew.
\code{\\
-subread-align -t 1 -i mm10\_full\_index -n 35 -m 4 -M 3 -T 10 -I 0 -P 3 -B 10 \\
--r miRNA\_reads.fastq -o result.sam\\
+subread-align -t 1 -i mm10\_full\_index -n 35 -m 4 -M 3 -T 10 -I 0
+--multiMapping -B 10 -r miRNA\_reads.fastq -o result.sam\\
}
The `-B 10' parameter instructs {\Subread} aligner to report up to 10 best mapping locations (equally best) in the mapping results.
@@ -775,6 +773,18 @@ GeneID Chr Start End Strand\\
\code{GeneID} column includes gene identifiers that can be numbers or character strings.
Chromosomal names included in the \code{Chr} column must match the chromosomal names of reference sequences to which the reads were aligned.
+\subsection{In-built annotations}
+
+In-built gene annotations for genomes \emph{hg38}, \emph{hg19}, \emph{mm10} and \emph{mm9} are included in both Bioconductor {\Rsubread} package and SourceForge {\Subread} package.
+These annotations were downloaded from NCBI RefSeq database and then adapted by merging overlapping exons from the same gene to form a set of disjoint exons for each gene.
+Genes with the same Entrez gene identifiers were also merged into one gene.
+
+Each row in the annotation represents an exon of a gene. There are five columns in the annotation data including Entrez gene identifier (\emph{GeneID}), chromosomal name (\emph{Chr}), chromosomal start position(\emph{Start}), chromosomal end position (\emph{End}) and strand (\emph{Strand}).
+
+In {\Rsubread}, users can access these annotations via the {\textsf{getInBuiltAnnotation}} function.
+In {\Subread}, these annotations are stored in directory `annotation' under home directory of the package.
+
+
\subsection{Single and paired-end reads}
Reads may be paired or unpaired.
@@ -824,44 +834,59 @@ If a read is both multi-mapping and multi-overlapping, then each overlapping met
Note that each alignment reported for a multi-mapping read is assessed separately for overlapping with multiple meta-features/features.
-\subsection{In-built annotations}
+\subsection{Read filtering}
+\label{sec:read_filtering}
-In-built gene annotations for genomes \emph{hg38}, \emph{hg19}, \emph{mm10} and \emph{mm9} are included in both Bioconductor {\Rsubread} package and SourceForge {\Subread} package.
-These annotations were downloaded from NCBI RefSeq database and then adapted by merging overlapping exons from the same gene to form a set of disjoint exons for each gene.
-Genes with the same Entrez gene identifiers were also merged into one gene.
+{\featureCounts} implements a variety of read filters to facilitate flexible read counting, which should satisfy the requirement of most downstream analyses.
+The order of these filters being applied is as follows (from highest to lowest):
+unmapped
+$>$ mapping quality
+$>$ chimeric fragment
+$>$ fragment length
+$>$ duplication
+$>$ multi-mapping
+$>$ secondary alignment
+$>$ split reads
+$>$ no overlapping features
+$>$ overlapping length
+$>$ assignment ambiguity.
-Each row in the annotation represents an exon of a gene. There are five columns in the annotation data including Entrez gene identifier (\emph{GeneID}), chromosomal name (\emph{Chr}), chromosomal start position(\emph{Start}), chromosomal end position (\emph{End}) and strand (\emph{Strand}).
+Number of reads that were excluded from counting by each filter is reported in the program output, in addition to the reported read counts (see Section~\ref{sec:program_output}).
-In {\Rsubread}, users can access these annotations via the {\textsf{getInBuiltAnnotation}} function.
-In {\Subread}, these annotations are stored in directory `annotation' under home directory of the package.
\subsection{Program output}
+\label{sec:program_output}
-Output of {\featureCounts} program in SourceForge {\Subread} package is saved into a tab-delimited file, which includes annotation columns (`Geneid', `Chr', `Start', `End', `Strand' and `Length') and data columns (read counts for each gene in each library).
-Annotation column `Length' contains total number of non-overlapping bases of each feature or meta-feature.
-When for example summarizing RNA-seq reads to genes, this column will give total number of non-overlapping bases included in all exons belonging to the same gene, for each gene.
+The output of {\featureCounts} program includes a count table and a summary of counting results.
+For SourceForge {\Subread}, the output data are saved to two tab-delimited files: one file contains read counts (file name is specified by the user) and the other file includes summary of counting results (file name is the name of read count file added with `.summary').
+For {\Rsubread}, all the output data are saved to an {\R} `List' object (for more details see the help page for {\featureCounts} function in {\Rsubread} package).
-When performing summarization at meta-feature level, annotation columns including `Chr', `Start', `End', `Strand' and `Length' give the annotation information for every feature included each meta-features.
-Therefore, each of these columns may include more than one value (semi-colon separated).
+The read count table includes annotation columns (`Geneid', `Chr', `Start', `End', `Strand' and `Length') and data columns (eg. read counts for genes for each library).
+When counting reads to meta-features (eg. genes) columns `Chr', `Start', `End' and `Strand' may each contain multiple values (separated by semi-colons), which correspond to individual features included in the same meta-feature.
+Column `Length' always contains one single value which is the total number of non-overlapping bases included in a meta-feature (or a feature), regardless of counting at meta-feature level or feature level.
+When counting RNA-seq reads to genes, the `Length' column typically contains the total number of non-overlapping bases in exons belonging to the same gene for each gene.
+
+The counting summary includes the total number of reads that are assigned and also the number of reads that are not assigned due to filtering.
+Below lists all the filters supported by {\featureCounts}:
-Output of {\featureCounts} program in SourceForge {\Subread} package also includes stat info of summarization results, which is saved to a tab-delimited file as well (a separate file).
-This file gives the total number of reads that are successfully assigned and also numbers of reads that are not assigned due to various reasons.
-Below lists the reasons why reads may not be assigned:
\begin{itemize}
-\item Unassigned\_Ambiguity: overlapping with two or more features (feature-level summarization) or meta-features (meta-feature-level) summarization.
-\item Unassigned\_MultiMapping: reads marked as multi-mapping in SAM/BAM input (the `NH' tag is checked by the program).
-\item Unassigned\_NoFeatures: not overlapping with any features included in the annotation.
\item Unassigned\_Unmapped: reads are reported as unmapped in SAM/BAM input. Note that if the `--primary' option of featureCounts program is specified, the read marked as a primary alignment will be considered for assigning to features.
\item Unassigned\_MappingQuality: mapping quality scores lower than the specified threshold.
-\item Unassigned\_FragementLength: length of fragment does not satisfy the criteria.
\item Unassigned\_Chimera: two reads from the same pair are mapped to different chromosomes or have incorrect orientation.
-\item Unassigned\_Secondary: reads marked as second alignment in the FLAG field in SAM/BAM input.
-\item Unassigned\_Nonjunction: reads do not span two or more exons. Such reads will not be assigned if the `--countSplitAlignmentsOnly' option is specified.
+\item Unassigned\_FragementLength: length of fragment does not satisfy the criteria.
\item Unassigned\_Duplicate: reads marked as duplicate in the FLAG field in SAM/BAM input.
+\item Unassigned\_MultiMapping: reads marked as multi-mapping in SAM/BAM input (the `NH' tag is checked by the program).
+\item Unassigned\_Secondary: reads marked as second alignment in the FLAG field in SAM/BAM input.
+\item Unassigned\_Nonjunction: reads that do not span exons will not be assigned if the `--countSplitAlignmentsOnly' option is specified.
+\item Unassigned\_NoFeatures: not overlapping with any features included in the annotation.
+\item Unassigned\_Overlapping\_Length: no features/meta-features were found to have the minimum required overlap length.
+\item Unassigned\_Ambiguity: overlapping with two or more features (feature-level summarization) or meta-features (meta-feature-level) summarization.
\end{itemize}
-All these output were also provided by the {\featureCounts} function included in Bioconductor {\Rsubread} package, except that read summarization results are saved into an {\R} `List' object.
-For more details, see the help page for {\featureCounts} function in {\Rsubread}.
+These filters are listed in the order that they are applied (same order with that shown in Section~\ref{sec:read_filtering}).
+Which of the filters are applied during read counting depend on the parameter setting.
+Usually only a subset of filters are applied in the counting.
+Unassigned reads are counted for each filter and each unassigned read will only be counted for one filter (the first filter that filters the read out).
\subsection{Program usage}
@@ -902,6 +927,8 @@ read mapping that produced the provided SAM/BAM files. This optional argument ca
\hline
-J \newline (\code{juncCounts}) & Count the number of reads supporting each exon-exon junction. Junctions are identified from those exon-spanning reads (containing `N' in CIGAR string) in input data. The output result includes names of primary and secondary genes that overlap at least one of the two splice sites of a junction. Only one primary gene is reported, but there might be more than one secondary gene reported. Secondary genes do not overlap more splice sites than the primary gene [...]
\hline
+-L \newline (\code{isLongRead}) & Turn on long-read counting mode. This option should be used when counting long reads such as Nanopore or PacBio reads.\\
+\hline
-M \newline (\code{countMultiMappingReads}) & If specified, multi-mapping reads/fragments will be counted. The program uses the `NH' tag to find multi-mapping reads. Alignments reported for a multi-mapping read will be counted separately. Each alignment will have \code{1} count or a fractional count if \code{--fraction} is specified. See section ``Count multi-mapping reads and multi-overlapping reads'' for more details.\\
\hline
-o $<string>$ & Give the name of the output file. The output file contains the number of reads assigned to each meta-feature (or each feature if \code{-f} is specified). Note that the {\featureCounts} function in {\Rsubread} does not use this parameter. It returns a \code{list} object including read summarization results and other data. \\
@@ -914,7 +941,7 @@ read mapping that produced the provided SAM/BAM files. This optional argument ca
\hline
-Q $<int>$ \newline (\code{minMQS}) & The minimum mapping quality score a read must satisfy in order to be counted. For paired-end reads, at least one end should satisfy this criteria. 0 by default.\\
\hline
--R \newline (\code{reportReads}) & Output detailed read assignment results for each read (or fragment if paired end). They are saved to a tab-delimited file that contains four columns including read name, status(assigned or the reason if not assigned), name of target feature/meta-feature and total number of hits if the read/fragment is counted multiple times. Names of output files are the same as input file names except a suffix string `.featureCounts' is added.\\
+-R $<string>$ \newline (\code{reportReads}) & Output detailed read assignment results for each read (or fragment if paired end). The detailed assignment results can be saved in three different formats including \code{CORE}, \code{SAM} and \code{BAM} (note that these values are case sensitive). \newline When \code{CORE} format is specified, a tab-delimited file will be generated for each input file. Name of each generated file is the input file name added with `.featureCounts'. Each gene [...]
\hline
-s $<int>$ \newline (\code{isStrandSpecific}) & Indicate if strand-specific read counting should be performed. Acceptable values: 0 (unstranded), 1 (stranded) and 2 (reversely stranded). 0 by default. For paired-end reads, strand of the first read is taken as the strand of the whole fragment. FLAG field is used to tell if a read is first or second read in a pair.\\
\hline
@@ -924,6 +951,8 @@ read mapping that produced the provided SAM/BAM files. This optional argument ca
\hline
-v & Output version of the program. \\
\hline
+$--$byReadGroup \newline (\code{byReadGroup}) & Count reads by read group. Read group information is identified from the header of BAM/SAM input files and the generated count table will include counts for each group in each library.\\
+\hline
$--$countSplit \newline AlignmentsOnly \newline (\code{splitOnly}) & If specified, only split alignments (CIGAR strings contain letter `N') will be counted. All the other alignments will be ignored. An example of split alignments is the exon-spanning reads in RNA-seq data. If exon-spanning reads need to be assigned to all their overlapping exons, `-f' and `-O' options should be provided as well.\\
\hline
$--$countNonSplit \newline AlignmentsOnly \newline (\code{nonSplitOnly}) & If specified, only non-split alignments (CIGAR strings do not contain letter `N') will be counted. All the other alignments will be ignored.\\
@@ -952,6 +981,8 @@ $--$readExtension3 $<int>$ \newline (\code{readExtension3}) & Reads are extended
\hline
$--$tmpDir $<string>$ \newline (\code{tmpDir}) & Directory under which intermediate files are saved (later removed). By default, intermediate files will be saved to the directory specified in `-o' argument (In \R, intermediate files are saved to the current working directory by default).\\
\hline
+$--$verbose \newline (\code{verbose}) & Output verbose information for debugging such as unmatched chromosomes/contigs between reads and annotation.\\
+\hline
\end{longtable}
\pagebreak
@@ -1008,9 +1039,9 @@ The example commands below assume your annotation file is in GTF format.\\
library(Rsubread)
\end{Rcode}
-\noindent Summarize single-end reads using built-in RefSeq annotation for mouse genome mm9:
+\noindent Summarize single-end reads using built-in RefSeq annotation for mouse genome `mm10' (`mm10' is the default inbuilt genome annotation):
\begin{Rcode}
-featureCounts(files="mapping_results_SE.sam",annot.inbuilt="mm9")
+featureCounts(files="mapping_results_SE.sam")
\end{Rcode}
\noindent Summarize single-end reads using a user-provided GTF annotation file:
@@ -1161,12 +1192,13 @@ Retrieve Phred scores for read bases from a Fastq/BAM/SAM file.
Remove duplicated reads from a SAM file.
-
\section{subread-fullscan}
Get all chromosomal locations that contain a genomic sequence sharing high homology with a given input sequence.
+\section{txUnique}
+This function is only implemented in {\Rsubread} and it counts the number of bases unique to each transcript.
\chapter{Case studies}
@@ -1214,8 +1246,7 @@ buildindex(basename="chr1",reference="hg19_chr1.fa")
\begin{Rcode}
targets <- readTargets()
-align(index="chr1",readfile1=targets$InputFile,input_format="gzFASTQ",output_format="BAM",
-output_file=targets$OutputFile,unique=TRUE,indels=5)
+align(index="chr1",readfile1=targets$InputFile,output_file=targets$OutputFile)
\end{Rcode}
{\noindent\bf Read summarization.} Summarize mapped reads to NCBI RefSeq genes.
@@ -1248,18 +1279,18 @@ Create a {\DGEList} object.
x <- DGEList(counts=fc$counts, genes=fc$annotation[,c("GeneID","Length")])
\end{Rcode}
-Calculate RPKM (reads per kilobases of exon per million reads mapped) values for genes:
-\begin{Rcode}
-x_rpkm <- rpkm(x,x$genes$Length,prior.count=0)
-
-x_rpkm[1:5,]
- A_1.bam A_2.bam B_1.bam B_2.bam
-653635 939 905.0 709 736
-100422834 19 0.0 0 0
-645520 11 8.1 0 0
-79501 0 0.0 0 0
-729737 62 64.9 19 16
-\end{Rcode}
+%Calculate RPKM (reads per kilobases of exon per million reads mapped) values for genes:
+%\begin{Rcode}
+%x_rpkm <- rpkm(x,x$genes$Length,prior.count=0)
+%
+%x_rpkm[1:5,]
+% A_1.bam A_2.bam B_1.bam B_2.bam
+%653635 939 905.0 709 736
+%100422834 19 0.0 0 0
+%645520 11 8.1 0 0
+%79501 0 0.0 0 0
+%729737 62 64.9 19 16
+%\end{Rcode}
{\noindent\bf Filtering.} Only keep in the analysis those genes which had $>$10 reads per million mapped reads in at least two libraries.
diff --git a/src/HelperFunctions.c b/src/HelperFunctions.c
index c08351a..ce372c1 100644
--- a/src/HelperFunctions.c
+++ b/src/HelperFunctions.c
@@ -36,10 +36,12 @@
#else
+#ifndef __MINGW32__
#include <sys/ioctl.h>
+#include <netinet/in.h>
#include <net/if.h>
+#endif
#include <unistd.h>
-#include <netinet/in.h>
#endif
@@ -768,7 +770,7 @@ int strcmp_number(char * s1, char * s2)
int mac_str(char * str_buff)
{
-#ifdef FREEBSD
+#if defined(FREEBSD) || defined(__MINGW32__)
return 1;
#else
#ifdef MACOS
@@ -980,10 +982,10 @@ double fast_fisher_test_one_side(unsigned int a, unsigned int b, unsigned int c,
}
-int load_features_annotation(char * file_name, int file_type, char * gene_id_column, char * feature_name_column,
- void * context, int do_add_feature(char * gene_name, char * chro_name, unsigned int start, unsigned int end, int is_negative_strand, void * context) ){
+int load_features_annotation(char * file_name, int file_type, char * gene_id_column, char * transcript_id_column, char * used_feature_type,
+ void * context, int do_add_feature(char * gene_name, char * transcript_name, char * chro_name, unsigned int start, unsigned int end, int is_negative_strand, void * context) ){
char * file_line = malloc(MAX_LINE_LENGTH+1);
- int lineno = 0, is_GFF_warned = 0, loaded_features = 0;
+ int lineno = 0, is_GFF_txid_warned = 0, is_GFF_geneid_warned = 0, loaded_features = 0;
FILE * fp = fopen(file_name, "r");
if(NULL == fp){
@@ -992,9 +994,9 @@ int load_features_annotation(char * file_name, int file_type, char * gene_id_col
}
while(1){
- int is_gene_id_found = 0, is_negative_strand = -1;
- char * token_temp = NULL, * feature_name, * chro_name = NULL;
- char feature_name_tmp[FEATURE_NAME_LENGTH];
+ int is_tx_id_found = 0, is_gene_id_found = 0, is_negative_strand = -1;
+ char * token_temp = NULL, * feature_name, *transcript_id = NULL, * chro_name = NULL;
+ char feature_name_tmp[FEATURE_NAME_LENGTH], txid_tmp[FEATURE_NAME_LENGTH];
feature_name = feature_name_tmp;
unsigned int start = 0, end = 0;
@@ -1049,7 +1051,7 @@ int load_features_annotation(char * file_name, int file_type, char * gene_id_col
strtok_r(NULL,"\t", &token_temp);// source
char * feature_type = strtok_r(NULL,"\t", &token_temp);// feature_type
- if(strcmp(feature_type, feature_name_column)==0){
+ if(strcmp(feature_type, used_feature_type)==0){
char * start_ptr = strtok_r(NULL,"\t", &token_temp);
char * end_ptr = strtok_r(NULL,"\t", &token_temp);
@@ -1084,23 +1086,37 @@ int load_features_annotation(char * file_name, int file_type, char * gene_id_col
if(extra_attrs && (strlen(extra_attrs)>2)){
int attr_val_len = GTF_extra_column_value(extra_attrs , gene_id_column , feature_name_tmp, FEATURE_NAME_LENGTH);
if(attr_val_len>0) is_gene_id_found=1;
+
+ if(transcript_id_column){
+ transcript_id = txid_tmp;
+ attr_val_len = GTF_extra_column_value(extra_attrs , transcript_id_column , txid_tmp, FEATURE_NAME_LENGTH);
+ if(attr_val_len>0) is_tx_id_found=1;
+ else transcript_id = NULL;
+ }
}
if(!is_gene_id_found){
- if(!is_GFF_warned)
- {
+ if(!is_GFF_geneid_warned){
int ext_att_len = strlen(extra_attrs);
if(extra_attrs[ext_att_len-1] == '\n') extra_attrs[ext_att_len-1] =0;
SUBREADprintf("\nWarning: failed to find the gene identifier attribute in the 9th column of the provided GTF file.\nThe specified gene identifier attribute is '%s' \nThe attributes included in your GTF annotation are '%s' \n\n", gene_id_column, extra_attrs);
}
- is_GFF_warned++;
+ is_GFF_geneid_warned++;
}
+ if(transcript_id_column && !is_tx_id_found){
+ if(!is_GFF_txid_warned){
+ int ext_att_len = strlen(extra_attrs);
+ if(extra_attrs[ext_att_len-1] == '\n') extra_attrs[ext_att_len-1] =0;
+ SUBREADprintf("\nWarning: failed to find the transcript identifier attribute in the 9th column of the provided GTF file.\nThe specified gene identifier attribute is '%s' \nThe attributes included in your GTF annotation are '%s' \n\n", transcript_id_column, extra_attrs);
+ }
+ is_GFF_txid_warned++;
+ }
}
}
if(is_gene_id_found){
- do_add_feature(feature_name, chro_name, start, end, is_negative_strand, context);
+ do_add_feature(feature_name, transcript_id, chro_name, start, end, is_negative_strand, context);
loaded_features++;
}
diff --git a/src/HelperFunctions.h b/src/HelperFunctions.h
index 0325657..ae15a55 100644
--- a/src/HelperFunctions.h
+++ b/src/HelperFunctions.h
@@ -71,8 +71,8 @@ unsigned int find_left_end_cigar(unsigned int right_pos, char * cigar);
int mac_or_rand_str(char * char_14);
double fast_fisher_test_one_side(unsigned int a, unsigned int b, unsigned int c, unsigned int d, long double * frac_buffer, int buffer_size);
-int load_features_annotation(char * file_name, int file_type, char * gene_id_column, char * feature_name_column,
- void * context, int do_add_feature(char * gene_name, char * chro_name, unsigned int start, unsigned int end, int is_negative_strand, void * context) );
+int load_features_annotation(char * file_name, int file_type, char * gene_id_column, char * transcript_id_column, char * used_feature_type,
+ void * context, int do_add_feature(char * gene_name, char * transcript_id, char * chrome_name, unsigned int start, unsigned int end, int is_negative_strand, void * context) );
HashTable * load_alias_table(char * fname) ;
#endif
diff --git a/src/Makefile.FreeBSD b/src/Makefile.FreeBSD
index f588afc..b901a3f 100644
--- a/src/Makefile.FreeBSD
+++ b/src/Makefile.FreeBSD
@@ -3,7 +3,7 @@ include makefile.version
MACOS = -D FREEBSD
-CCFLAGS = -march=native -mtune=core2 ${MACOS} -O9 -Wall -DMAKE_FOR_EXON -D MAKE_STANDALONE -D SUBREAD_VERSION=\"${SUBREAD_VERSION}\"
+CCFLAGS = -march=native -mtune=core2 ${MACOS} -O9 -Wall -Wno-maybe-uninitialized -Wno-incompatible-pointer-types -Wno-array-bounds -Wno-unused-but-set-variable -Wno-unused-variable -Wno-unused-result -DMAKE_FOR_EXON -D MAKE_STANDALONE -D SUBREAD_VERSION=\"${SUBREAD_VERSION}\"
LDFLAGS = -pthread -lz -lm ${MACOS} -DMAKE_FOR_EXON -D MAKE_STANDALONE -l compat # -DREPORT_ALL_THE_BEST
CC = gcc ${CCFLAGS} -ggdb -fomit-frame-pointer -ffast-math -funroll-loops -mmmx -msse -msse2 -msse3 -fmessage-length=0
@@ -14,20 +14,23 @@ ALL_OBJECTS=$(addsuffix .o, ${ALL_LIBS})
ALL_H=$(addsuffix .h, ${ALL_LIBS})
ALL_C=$(addsuffix .c, ${ALL_LIBS})
-all: featureCounts removeDup exactSNP subread-buildindex subindel subread-align subjunc subtools qualityScores subread-fullscan propmapped coverageCount
+all: repair featureCounts removeDup exactSNP subread-buildindex subindel subread-align subjunc subtools qualityScores subread-fullscan propmapped coverageCount
mkdir -p ../bin/utilities
mv subread-align subjunc featureCounts subindel exactSNP subread-buildindex ../bin/
- mv coverageCount subtools qualityScores propmapped subread-fullscan removeDup ../bin/utilities
+ mv repair coverageCount subtools qualityScores propmapped subread-fullscan removeDup ../bin/utilities
@echo
@echo "###########################################################"
@echo "# #"
- @echo "# Installation complete. #"
+ @echo "# Installation successfully complete. #"
@echo "# #"
@echo "# Generated executables were copied to directory ../bin/ #"
@echo "# #"
@echo "###########################################################"
@echo
+repair: read-repair.c ${ALL_OBJECTS}
+ ${CC} -o repair read-repair.c ${ALL_OBJECTS} ${LDFLAGS}
+
propmapped: propmapped.c ${ALL_OBJECTS}
${CC} -o propmapped propmapped.c ${ALL_OBJECTS} ${LDFLAGS}
diff --git a/src/Makefile.Linux b/src/Makefile.Linux
index fed83b5..4a251f7 100644
--- a/src/Makefile.Linux
+++ b/src/Makefile.Linux
@@ -3,10 +3,10 @@
include makefile.version
OPT_LEVEL = 3
-CCFLAGS = -mtune=core2 ${MACOS} -O${OPT_LEVEL} -Wall -DMAKE_FOR_EXON -D MAKE_STANDALONE -D SUBREAD_VERSION=\"${SUBREAD_VERSION}\" -D_FILE_OFFSET_BITS=64
-LDFLAGS = ${STATIC_MAKE} -lpthread -lz -lm ${MACOS} -O${OPT_LEVEL} -DMAKE_FOR_EXON -D MAKE_STANDALONE # -DREPORT_ALL_THE_BEST
+CCFLAGS = -mtune=core2 ${MACOS} -O${OPT_LEVEL} -DMAKE_FOR_EXON -D MAKE_STANDALONE -D SUBREAD_VERSION=\"${SUBREAD_VERSION}\" -D_FILE_OFFSET_BITS=64 # -w
+LDFLAGS = ${STATIC_MAKE} -pthread -lz -lm ${MACOS} -O${OPT_LEVEL} -DMAKE_FOR_EXON -D MAKE_STANDALONE
CC_EXEC = gcc
-CC = ${CC_EXEC} ${CCFLAGS} -fmessage-length=0 -ggdb # -fomit-frame-pointer -ffast-math -funroll-loops -mmmx -msse -msse2 -msse3 -fmessage-length=0
+CC = ${CC_EXEC} ${CCFLAGS} -fmessage-length=0 -ggdb
ALL_LIBS= core core-junction core-indel sambam-file sublog gene-algorithms hashtable input-files sorted-hashtable gene-value-index exon-algorithms HelperFunctions interval_merge long-hashtable core-bigtable seek-zlib
@@ -14,20 +14,26 @@ ALL_OBJECTS=$(addsuffix .o, ${ALL_LIBS})
ALL_H=$(addsuffix .h, ${ALL_LIBS})
ALL_C=$(addsuffix .c, ${ALL_LIBS})
-all: repair featureCounts removeDup exactSNP subread-buildindex subindel subread-align subjunc qualityScores subread-fullscan propmapped coverageCount # samMappedBases mergeVCF testZlib
+all: repair txUnique featureCounts removeDup exactSNP subread-buildindex subindel subread-align subjunc qualityScores subread-fullscan propmapped coverageCount # samMappedBases mergeVCF testZlib
mkdir -p ../bin/utilities
mv subread-align subjunc featureCounts subindel exactSNP subread-buildindex ../bin/
- mv repair coverageCount propmapped qualityScores removeDup subread-fullscan ../bin/utilities
+ mv repair coverageCount propmapped qualityScores removeDup subread-fullscan txUnique ../bin/utilities
@echo
@echo "###########################################################"
@echo "# #"
- @echo "# Installation complete. #"
+ @echo "# Installation successfully complete. #"
@echo "# #"
@echo "# Generated executables were copied to directory ../bin/ #"
@echo "# #"
@echo "###########################################################"
@echo
+repair: read-repair.c ${ALL_OBJECTS}
+ ${CC} -o repair read-repair.c ${ALL_OBJECTS} ${LDFLAGS}
+
+txUnique: tx-unique.c tx-unique.h ${ALL_OBJECTS}
+ ${CC} -o txUnique tx-unique.c ${ALL_OBJECTS} ${LDFLAGS}
+
globalReassembly: global-reassembly.c ${ALL_OBJECTS}
${CC} -o globalReassembly global-reassembly.c ${ALL_OBJECTS} ${LDFLAGS}
@@ -67,16 +73,5 @@ subread-fullscan: fullscan.c ${ALL_OBJECTS}
coverageCount: coverage_calc.c ${ALL_OBJECTS}
${CC} -o coverageCount coverage_calc.c ${ALL_OBJECTS} ${LDFLAGS}
-#testZlib: test-seek-zlib.c ${ALL_OBJECTS}
-# ${CC} -o testZlib test-seek-zlib.c ${ALL_OBJECTS} ${LDFLAGS}
-
-repair: read-repair.c ${ALL_OBJECTS}
- ${CC} -o repair read-repair.c ${ALL_OBJECTS} ${LDFLAGS}
-
-#samMappedBases: samMappedBases.c ${ALL_OBJECTS}
-# ${CC} -o samMappedBases samMappedBases.c ${ALL_OBJECTS} ${LDFLAGS}
-#mergeVCF: mergeVCF.c ${ALL_OBJECTS}
-# ${CC} -o mergeVCF mergeVCF.c ${ALL_OBJECTS} ${LDFLAGS}
-
clean:
rm -f core featureCounts exactSNP removeDup subread-buildindex ${ALL_OBJECTS}
diff --git a/src/Makefile.MacOS b/src/Makefile.MacOS
index 7c8e2bf..122b4d8 100644
--- a/src/Makefile.MacOS
+++ b/src/Makefile.MacOS
@@ -1,7 +1,7 @@
MACOS = -D MACOS
include makefile.version
-CCFLAGS = -mtune=core2 ${MACOS} -O9 -Wall -DMAKE_FOR_EXON -D MAKE_STANDALONE -D SUBREAD_VERSION=\"${SUBREAD_VERSION}\" -D_FILE_OFFSET_BITS=64
+CCFLAGS = -mtune=core2 ${MACOS} -O9 -w -DMAKE_FOR_EXON -D MAKE_STANDALONE -D SUBREAD_VERSION=\"${SUBREAD_VERSION}\" -D_FILE_OFFSET_BITS=64
LDFLAGS = -pthread -lz -lm ${MACOS} -DMAKE_FOR_EXON -D MAKE_STANDALONE # -DREPORT_ALL_THE_BEST
CC = gcc ${CCFLAGS} ${STATIC_MAKE} -ggdb -fomit-frame-pointer -O3 -ffast-math -funroll-loops -mmmx -msse -msse2 -msse3 -fmessage-length=0
@@ -18,7 +18,7 @@ all: repair featureCounts removeDup exactSNP subread-buildindex subindel subrea
@echo
@echo "###########################################################"
@echo "# #"
- @echo "# Installation complete. #"
+ @echo "# Installation successfully complete. #"
@echo "# #"
@echo "# Generated executables were copied to directory ../bin/ #"
@echo "# #"
@@ -27,7 +27,7 @@ all: repair featureCounts removeDup exactSNP subread-buildindex subindel subrea
repair: read-repair.c ${ALL_OBJECTS}
- ${CC} -o repair read-repair.c ${ALL_OBJECTS} ${LDFLAGS}
+ ${CC} -o repair read-repair.c ${ALL_OBJECTS} ${LDFLAGS}
propmapped: propmapped.c ${ALL_OBJECTS}
${CC} -o propmapped propmapped.c ${ALL_OBJECTS} ${LDFLAGS}
diff --git a/src/core-indel.c b/src/core-indel.c
index 79174db..437767d 100644
--- a/src/core-indel.c
+++ b/src/core-indel.c
@@ -1117,13 +1117,6 @@ int finalise_indel_and_junction_thread(global_context_t * global_context, thread
prev_env = this_event;
}
- if(0){
- for(xk1 = 0; xk1 < merge_target_items; xk1++){
- chromosome_event_t * pev = merge_target + xk1;
- printf("OCT27-MERGERES: %u~%u, indel=%d, nsup=%d, TYPE=%d\n",pev->event_small_side, pev->event_large_side, pev->indel_length, pev->supporting_reads, pev->event_type);
- }
- }
-
free(records);
if(thread_contexts)
@@ -1238,7 +1231,7 @@ typedef struct {
HashTable * feature_sorting_table;
} do_load_juncs_context_t;
-int do_juncs_add_feature(char * gene_name, char * chro_name, unsigned int feature_start, unsigned int feature_end, int is_negative_strand, void * context){
+int do_juncs_add_feature(char * gene_name, char * transcript_id, char * chro_name, unsigned int feature_start, unsigned int feature_end, int is_negative_strand, void * context){
//#warning ">>>>>>> COMMENt NEXT <<<<<<<<<<<<<<"
//SUBREADprintf("INJ LOCS: %s : %u, %u\n", chro_name, feature_start, feature_end);
do_load_juncs_context_t * do_load_juncs_context = context;
@@ -1336,8 +1329,7 @@ int load_known_junctions(global_context_t * global_context){
do_load_juncs_context.global_context = global_context;
do_load_juncs_context.feature_sorting_table = feature_sorting_table;
- int features = load_features_annotation(global_context->config.exon_annotation_file , global_context->config.exon_annotation_file_type, global_context->config.exon_annotation_gene_id_column, global_context->config.exon_annotation_feature_name_column,
- &do_load_juncs_context, do_juncs_add_feature);
+ int features = load_features_annotation(global_context->config.exon_annotation_file , global_context->config.exon_annotation_file_type, global_context->config.exon_annotation_gene_id_column, NULL, global_context->config.exon_annotation_feature_name_column, &do_load_juncs_context, do_juncs_add_feature);
feature_sorting_table -> appendix1 = global_context;
HashTableIteration(feature_sorting_table, add_annotation_to_junctions);
@@ -1476,22 +1468,9 @@ int search_event(global_context_t * global_context, HashTable * event_table, chr
}
//#warning ">>>>>>>>>>>>>> COMMENT THIS <<<<<<<<<<<<<<<<<<<<<<<<<<<<"
- if(0){
- indel_context_t * indel_context = (indel_context_t *) global_context -> module_contexts[MODULE_INDEL_ID];
- chromosome_event_t * est = indel_context -> event_space_dynamic;
- if(est == event_space)
- printf("OCT27-STEPRS-EVENT_HIT= %u ; HIT=%d\n", pos, xk2);
-
- }
}else{
//#warning ">>>>>>>>>>>>>> COMMENT THIS <<<<<<<<<<<<<<<<<<<<<<<<<<<<"
- if(0){
- indel_context_t * indel_context = (indel_context_t *) global_context -> module_contexts[MODULE_INDEL_ID];
- chromosome_event_t * est = indel_context -> event_space_dynamic;
- if(est == event_space)
- printf("OCT27-STEPRS-EVENT_HIT= %u ; HIT=0000\n", pos);
-
- }
+
}
return ret;
@@ -2275,7 +2254,7 @@ void print_indel_table(global_context_t * global_context){
for(xk1 = 0; xk1 < indel_context -> total_events ; xk1++){
chromosome_event_t * event_body = indel_context -> event_space_dynamic +xk1;
- printf("OCT27-STEP-INTAB-TYPE-%d POS %u~%u GID=%u PV %d %d SUP %d / %d\n", event_body -> event_type, event_body -> event_small_side, event_body -> event_large_side, event_body -> global_event_id, event_body -> connected_next_event_distance, event_body -> connected_previous_event_distance , event_body -> supporting_reads , event_body -> anti_supporting_reads);
+ //printf("OCT27-STEP-INTAB-TYPE-%d POS %u~%u GID=%u PV %d %d SUP %d / %d\n", event_body -> event_type, event_body -> event_small_side, event_body -> event_large_side, event_body -> global_event_id, event_body -> connected_next_event_distance, event_body -> connected_previous_event_distance , event_body -> supporting_reads , event_body -> anti_supporting_reads);
}
int bucket;
@@ -2288,7 +2267,7 @@ void print_indel_table(global_context_t * global_context){
int env_i;
for(env_i = 1; env_array[env_i]; env_i++){
chromosome_event_t * event_body = indel_context -> event_space_dynamic + (env_array[env_i]-1);
- printf("OCT27-STEPQ-ENTAB-%u [%d] to %u ~ %u len=%d VAL=%d PTR=%p\n",entry_pos, env_i, event_body -> event_small_side, event_body -> event_large_side, event_body -> indel_length, env_array[env_i], env_array);
+ // printf("OCT27-STEPQ-ENTAB-%u [%d] to %u ~ %u len=%d VAL=%d PTR=%p\n",entry_pos, env_i, event_body -> event_small_side, event_body -> event_large_side, event_body -> indel_length, env_array[env_i], env_array);
}
cursor = cursor->next;
@@ -2327,8 +2306,8 @@ int write_indel_final_results(global_context_t * global_context)
chromosome_event_t * event_body = indel_context -> event_space_dynamic +xk1;
- //#warning " ================= REMOVE '- 1' from the next LINE!!! ========================="
- if((event_body -> event_type != CHRO_EVENT_TYPE_INDEL && event_body->event_type != CHRO_EVENT_TYPE_LONG_INDEL && event_body -> event_type != CHRO_EVENT_TYPE_POTENTIAL_INDEL)|| (event_body -> final_counted_reads < 1 && event_body -> event_type == CHRO_EVENT_TYPE_INDEL) )
+ //#warning " ================= REMOVE '- 1' from the next LINE!!! ========================="
+ if((event_body -> event_type != CHRO_EVENT_TYPE_INDEL && event_body->event_type != CHRO_EVENT_TYPE_LONG_INDEL && event_body -> event_type != CHRO_EVENT_TYPE_POTENTIAL_INDEL)|| (event_body -> final_counted_reads < 1 && event_body -> event_type == CHRO_EVENT_TYPE_INDEL) )
continue;
//assert((-event_body -> indel_length) < MAX_INSERTION_LENGTH);
@@ -4451,6 +4430,7 @@ void init_global_context(global_context_t * context)
context->config.do_fusion_detection = 0;
context->config.do_structural_variance_detection = 0;
context->config.more_accurate_fusions = 1;
+ context->config.report_multi_mapping_reads = 0;
//#warning "============= best values for the SVs application: 8; 5; 32 ==============="
context->config.top_scores = 8 - 5;
@@ -4471,7 +4451,6 @@ void init_global_context(global_context_t * context)
context->will_remove_input_file = 0;
context->config.ignore_unmapped_reads = 0;
context->config.report_unmapped_using_mate_pos = 1;
- context->config.report_multi_mapping_reads = 1;
context->config.downscale_mapping_quality=0;
context->config.ambiguous_mapping_tolerance = 39;
context->config.use_hamming_distance_break_ties = 0;
diff --git a/src/core-interface-aligner.c b/src/core-interface-aligner.c
index 43dd5aa..8488535 100644
--- a/src/core-interface-aligner.c
+++ b/src/core-interface-aligner.c
@@ -59,6 +59,7 @@ static struct option long_options[] =
{"complexIndels", no_argument, 0, 0},
{"minVoteCutoff", required_argument, 0, 0},
{"maxRealignLocations", required_argument, 0, 0},
+ {"multiMapping", no_argument, 0, 0},
{0, 0, 0, 0}
};
@@ -68,15 +69,15 @@ void print_usage_core_aligner()
SUBREADprintf("\nVersion %s\n\n", SUBREAD_VERSION);
SUBREADputs("Usage:");
SUBREADputs("");
- SUBREADputs("./subread-align [options] -i <index_name> -r <input> -t <type> -o <output>");
+ SUBREADputs("./subread-align [options] -i <index_name> -r <input> -t <type> -o <output>");
SUBREADputs("");
SUBREADputs("## Mandatory arguments:");
SUBREADputs(" ");
SUBREADputs(" -i <string> Base name of the index.");
SUBREADputs("");
SUBREADputs(" -r <string> Name of an input read file. If paired-end, this should be");
- SUBREADputs(" the first read file (typically containing ‘R1’ in the file");
- SUBREADputs(" name) and the second should be provided via ‘-R’.");
+ SUBREADputs(" the first read file (typically containing \"R1\"in the file");
+ SUBREADputs(" name) and the second should be provided via \"-R\".");
SUBREADputs(" Acceptable formats include gzipped FASTQ, FASTQ and FASTA.");
SUBREADputs(" These formats are identified automatically.");
SUBREADputs(" ");
@@ -92,15 +93,15 @@ void print_usage_core_aligner()
SUBREADputs(" STDOUT.");
SUBREADputs("");
SUBREADputs(" -R <string> Name of the second read file in paired-end data (typically");
- SUBREADputs(" containing ‘R2’ in the file name).");
+ SUBREADputs(" containing \"R2\" the file name).");
SUBREADputs("");
SUBREADputs(" --SAMinput Input reads are in SAM format.");
SUBREADputs("");
SUBREADputs(" --BAMinput Input reads are in BAM format.");
SUBREADputs("");
- SUBREADputs(" --SAMoutput Save mapping results in SAM format.");
+ SUBREADputs(" --SAMoutput Save mapping results in SAM format.");
SUBREADputs("");
- SUBREADputs("# offset value added to Phred quality scores of read bases");
+ SUBREADputs("# Phred offset");
SUBREADputs("");
SUBREADputs(" -P <3:6> Offset value added to the Phred quality score of each read");
SUBREADputs(" base. '3' for phred+33 and '6' for phred+64. '3' by default.");
@@ -124,14 +125,13 @@ void print_usage_core_aligner()
SUBREADputs("");
SUBREADputs("# unique mapping and multi-mapping");
SUBREADputs("");
- SUBREADputs(" -u Report uniquely mapped reads only. Number of matched bases (");
- SUBREADputs(" for RNA-seq) or mis-matched bases(for genomic DNA-seq) is");
- SUBREADputs(" used to break the tie when multiple mapping locations are");
- SUBREADputs(" found.");
+ SUBREADputs(" --multiMapping Report multi-mapping reads in addition to uniquely mapped");
+ SUBREADputs(" reads. Use \"-B\" to set the maximum number of equally-best");
+ SUBREADputs(" alignments to be reported.");
SUBREADputs("");
- SUBREADputs(" -B <int> Maximal number of equally-best mapping locations to be");
- SUBREADputs(" reported. 1 by default. Note that -u option takes precedence");
- SUBREADputs(" over -B.");
+ SUBREADputs(" -B <int> Maximum number of equally-best alignments to be reported for");
+ SUBREADputs(" a multi-mapping read. Equally-best alignments have the same");
+ SUBREADputs(" number of mis-matched bases. 1 by default.");
SUBREADputs("");
SUBREADputs("# indel detection");
SUBREADputs("");
@@ -317,6 +317,7 @@ int parse_opts_aligner(int argc , char ** argv, global_context_t * global_contex
global_context->config.max_vote_combinations = max(global_context->config.max_vote_combinations, global_context->config.reported_multi_best_reads + 1);
global_context->config.max_vote_simples = max(global_context->config.max_vote_simples, global_context->config.reported_multi_best_reads + 1);
+ global_context->config.report_multi_mapping_reads = 1;
break;
case 'H':
global_context->config.use_hamming_distance_break_ties = 1;
@@ -347,10 +348,6 @@ int parse_opts_aligner(int argc , char ** argv, global_context_t * global_contex
case 'U':
global_context->config.report_no_unpaired_reads = 1;
break;
- case 'u':
- global_context->config.report_multi_mapping_reads = 0;
- global_context->config.use_hamming_distance_break_ties = 1;
- break;
case 'b':
global_context->config.convert_color_to_base = 1;
break;
@@ -466,7 +463,11 @@ int parse_opts_aligner(int argc , char ** argv, global_context_t * global_contex
break;
case 0:
- if(strcmp("memoryMultiplex", long_options[option_index].name)==0)
+ if(strcmp("multiMapping", long_options[option_index].name)==0)
+ {
+ global_context->config.report_multi_mapping_reads = 1;
+ }
+ else if(strcmp("memoryMultiplex", long_options[option_index].name)==0)
{
global_context->config.memory_use_multiplex = atof(optarg);
}
diff --git a/src/core-interface-subjunc.c b/src/core-interface-subjunc.c
index d64a104..c4b66e1 100644
--- a/src/core-interface-subjunc.c
+++ b/src/core-interface-subjunc.c
@@ -66,6 +66,7 @@ static struct option long_options[] =
{"minVoteCutoff", required_argument, 0, 0},
{"minMappedFraction", required_argument, 0, 0},
{"complexIndels", no_argument, 0, 0},
+ {"multiMapping", no_argument, 0, 0},
{0, 0, 0, 0}
};
@@ -81,10 +82,11 @@ void print_usage_core_subjunc()
SUBREADputs("");
SUBREADputs(" -i <index> Base name of the index.");
SUBREADputs("");
- SUBREADputs(" -r <string> Name of the input file. Input formats including gzipped");
- SUBREADputs(" fastq, fastq, and fasta can be automatically detected. If");
- SUBREADputs(" paired-end, this should give the name of file including");
- SUBREADputs(" first reads.");
+ SUBREADputs(" -r <string> Name of an input read file. If paired-end, this should be");
+ SUBREADputs(" the first read file (typically containing \"R1\"in the file");
+ SUBREADputs(" name) and the second should be provided via \"-R\".");
+ SUBREADputs(" Acceptable formats include gzipped FASTQ, FASTQ and FASTA.");
+ SUBREADputs(" These formats are identified automatically.");
SUBREADputs("");
SUBREADputs("## Optional arguments:");
SUBREADputs("# input reads and output");
@@ -94,15 +96,15 @@ void print_usage_core_subjunc()
SUBREADputs(" STDOUT.");
SUBREADputs("");
SUBREADputs(" -R <string> Name of the second read file in paired-end data (typically");
- SUBREADputs(" containing ‘R2’ in the file name).");
+ SUBREADputs(" containing \"R2\" the file name).");
SUBREADputs("");
SUBREADputs(" --SAMinput Input reads are in SAM format.");
SUBREADputs("");
SUBREADputs(" --BAMinput Input reads are in BAM format.");
SUBREADputs("");
- SUBREADputs(" --SAMoutput Save mapping results in SAM format.");
+ SUBREADputs(" --SAMoutput Save mapping results in SAM format.");
SUBREADputs("");
- SUBREADputs("# offset value added to Phred quality scores of read bases");
+ SUBREADputs("# Phred offset");
SUBREADputs("");
SUBREADputs(" -P <3:6> Offset value added to the Phred quality score of each read");
SUBREADputs(" base. '3' for phred+33 and '6' for phred+64. '3' by default.");
@@ -126,14 +128,13 @@ void print_usage_core_subjunc()
SUBREADputs("");
SUBREADputs("# unique mapping and multi-mapping");
SUBREADputs("");
- SUBREADputs(" -u Report uniquely mapped reads only. Number of matched bases (");
- SUBREADputs(" for RNA-seq) or mis-matched bases(for genomic DNA-seq) is");
- SUBREADputs(" used to break the tie when multiple mapping locations are");
- SUBREADputs(" found.");
+ SUBREADputs(" --multiMapping Report multi-mapping reads in addition to uniquely mapped");
+ SUBREADputs(" reads. Use \"-B\" to set the maximum number of equally-best");
+ SUBREADputs(" alignments to be reported.");
SUBREADputs("");
- SUBREADputs(" -B <int> Maximal number of equally-best mapping locations to be");
- SUBREADputs(" reported. 1 by default. Note that -u option takes precedence");
- SUBREADputs(" over -B.");
+ SUBREADputs(" -B <int> Maximum number of equally-best alignments to be reported for");
+ SUBREADputs(" a multi-mapping read. Equally-best alignments have the same");
+ SUBREADputs(" number of mis-matched bases. 1 by default.");
SUBREADputs("");
SUBREADputs("# indel detection");
SUBREADputs("");
@@ -353,10 +354,6 @@ int parse_opts_subjunc(int argc , char ** argv, global_context_t * global_contex
case 'U':
global_context->config.report_no_unpaired_reads = 1;
break;
- case 'u':
- global_context->config.report_multi_mapping_reads = 0;
- global_context->config.use_hamming_distance_break_ties = 1;
- break;
case 'b':
global_context->config.convert_color_to_base = 1;
break;
@@ -475,7 +472,11 @@ int parse_opts_subjunc(int argc , char ** argv, global_context_t * global_contex
break;
case 0:
- if(strcmp("memoryMultiplex", long_options[option_index].name)==0)
+ if(strcmp("multiMapping", long_options[option_index].name)==0)
+ {
+ global_context->config.report_multi_mapping_reads = 1;
+ }
+ else if(strcmp("memoryMultiplex", long_options[option_index].name)==0)
{
global_context->config.memory_use_multiplex = atof(optarg);
}
diff --git a/src/core-junction.c b/src/core-junction.c
index 7825dbf..ff6a3d0 100644
--- a/src/core-junction.c
+++ b/src/core-junction.c
@@ -202,7 +202,7 @@ void search_events_to_front(global_context_t * global_context, thread_context_t
SUBREADprintf("F_JUMP? match=%d / tested=%d\n", matched_bases_to_site , tested_read_pos);
//#warning "========= remove - 2000 from next line ============="
- if(tested_read_pos >0 && ( matched_bases_to_site*10000/tested_read_pos > 9000 - 2000 || global_context->config.maximise_sensitivity_indel) )
+ if(explain_context -> total_tries < REALIGN_TOTAL_TRIES && tested_read_pos >0 && ( matched_bases_to_site*10000/tested_read_pos > 9000 - 2000 || global_context->config.maximise_sensitivity_indel) )
for(xk1 = 0; xk1 < site_events_no ; xk1++)
{
chromosome_event_t * tested_event = site_events[xk1];
@@ -275,16 +275,10 @@ void search_events_to_front(global_context_t * global_context, thread_context_t
explain_context -> tmp_search_sections ++;
- if(0 && FIXLENstrcmp("R000404427", explain_context -> read_name) == 0)
- SUBREADprintf("FRONT_ADD_EVENT : %s , %u ~ %u , INDELLEN=%d, TEST_READ_POS=%u, RPED=%u, ABSSTART=%u\n", explain_context -> read_name, tested_event -> event_small_side, tested_event -> event_large_side, tested_event -> indel_length, tested_read_pos, explain_context -> tmp_search_junctions[explain_context -> tmp_search_sections + 1].read_pos_end, new_read_head_abs_offset);
-
- //if(explain_context -> pair_number == 23){
- //printf("JUMP_IN: %u ; STRAND=%c ; REMENDER=%d ; 0=%d 0=%d\n", new_read_head_abs_offset, tested_event -> is_strand_jumped?'X':'=', new_remainder_len, tested_event -> indel_length, tested_event -> indel_at_junction);
- //}
+ if(0 && FIXLENstrcmp("R000404427", explain_context -> read_name) == 0)
+ SUBREADprintf("FRONT_ADD_EVENT : %s , %u ~ %u , INDELLEN=%d, TEST_READ_POS=%u, RPED=%u, ABSSTART=%u\n", explain_context -> read_name, tested_event -> event_small_side, tested_event -> event_large_side, tested_event -> indel_length, tested_read_pos, explain_context -> tmp_search_junctions[explain_context -> tmp_search_sections + 1].read_pos_end, new_read_head_abs_offset);
- // #warning "SUBREAD_151 REMOVE THIS ASSERTION! "
- // assert(new_remainder_len < 102);
- //printf("SUGGEST_NEXT = %d (! %d)\n", tested_event -> connected_next_event_distance, tested_event -> connected_previous_event_distance);
+ explain_context -> total_tries ++;
search_events_to_front(global_context, thread_context, explain_context, read_text + tested_event -> indel_at_junction + tested_read_pos - min(0, tested_event->indel_length), qual_text + tested_read_pos - min(0, tested_event->indel_length), new_read_head_abs_offset, new_remainder_len, sofar_matched + matched_bases_to_site - jump_penalty, tested_event -> connected_next_event_distance, 0);
explain_context -> tmp_search_sections --;
@@ -536,7 +530,7 @@ void search_events_to_back(global_context_t * global_context, thread_context_t *
//#warning ">>>>>>>>>>>>>>>> REMOVE IT <<<<<<<<<<<<<<<<<<<<<<"
//printf("OCT27-STEPSB-JB-%s: test %u = %d events; TEST=%d > 7000 : MA=%d; %s ; %u = %u - (%d - %d) ; LEV=%d\n", explain_context -> read_name, potential_event_pos, site_events_no, (read_tail_pos<=tested_read_pos)?(-1234):( matched_bases_to_site*10000/(read_tail_pos - tested_read_pos)) , matched_bases_to_site, read_text + tested_read_pos, potential_event_pos, read_tail_abs_offset, read_tail_pos, tested_read_pos, explain_context -> tmp_search_sections);
//#warning "========= remove - 2000 from next line ============="
- if((read_tail_pos>tested_read_pos) && ( matched_bases_to_site*10000/(read_tail_pos - tested_read_pos) > 9000 - 2000 || global_context->config.maximise_sensitivity_indel) )
+ if(explain_context -> total_tries < REALIGN_TOTAL_TRIES && (read_tail_pos>tested_read_pos) && ( matched_bases_to_site*10000/(read_tail_pos - tested_read_pos) > 9000 - 2000 || global_context->config.maximise_sensitivity_indel) )
for(xk1 = 0; xk1 < site_events_no ; xk1++)
{
chromosome_event_t * tested_event = site_events[xk1];
@@ -617,6 +611,7 @@ void search_events_to_back(global_context_t * global_context, thread_context_t *
//#warning ">>>>>>>>>>>>>>>> REMOVE IT <<<<<<<<<<<<<<<<<<<<<<"
//printf("OCT27-STEPSB-JB-%s: %u IN -> %u; NEW_TAIL=%d; ENV_CONN=%d; LEV=%d\n", explain_context -> read_name, potential_event_pos, new_read_tail_abs_offset, new_read_tail_pos, tested_event -> connected_previous_event_distance, explain_context -> tmp_search_sections);
+ explain_context -> total_tries ++;
search_events_to_back(global_context, thread_context, explain_context, read_text , qual_text, new_read_tail_abs_offset , new_read_tail_pos, sofar_matched + matched_bases_to_site - jump_penalty, tested_event -> connected_previous_event_distance, 0);
//#warning ">>>>>>>>>>>>>>>> REMOVE IT <<<<<<<<<<<<<<<<<<<<<<"
//printf("OCT27-STEPSB-JB-%s: %u OUT <- %u; LEN=%d\n", explain_context -> read_name, potential_event_pos, new_read_tail_abs_offset, explain_context -> tmp_search_sections);
@@ -2542,6 +2537,7 @@ unsigned int explain_read(global_context_t * global_context, thread_context_t *
explain_context.pair_number = pair_number;
explain_context.is_second_read = is_second_read ;
explain_context.best_read_id = best_read_id;
+ explain_context.total_tries = 0;
unsigned int back_search_tail_position,front_search_start_position;
unsigned short back_search_read_tail, front_search_read_start;
@@ -2582,9 +2578,11 @@ unsigned int explain_read(global_context_t * global_context, thread_context_t *
front_search_start_position = current_result -> selected_position + front_search_read_start;
}
- if(0 && FIXLENstrcmp( explain_context.read_name, "R000002689")==0)
- {
- SUBREADprintf("EXPLAIN_READ_%d %s [%d]: POS=%u ;; BACK SEARCH TAILPOS=%u, READTAIL=%d ; INDEL_IN_CONF=%d ; READ_COV=%d~%d\n", 1+is_second_read, explain_context.read_name, best_read_id, current_result -> selected_position, back_search_tail_position, back_search_read_tail, current_result -> indels_in_confident_coverage, front_search_read_start, back_search_read_tail);
+ if(0 && FIXLENstrcmp("SRR3439488.572382", explain_context.read_name)==0) {
+ char * q_res_chro=NULL;
+ int q_res_offset=0;
+ locate_gene_position(current_result -> selected_position,&global_context -> chromosome_table, &q_res_chro, &q_res_offset);
+ SUBREADprintf("EXPLAIN_READ_%d %s [%d]: POS=%u (%s:%d);; BACK SEARCH TAILPOS=%u, READTAIL=%d ; INDEL_IN_CONF=%d ; READ_COV=%d~%d\n", 1+is_second_read, explain_context.read_name, best_read_id, current_result -> selected_position,q_res_chro,q_res_offset, back_search_tail_position, back_search_read_tail, current_result -> indels_in_confident_coverage, front_search_read_start, back_search_read_tail);
}
search_events_to_back(global_context, thread_context, &explain_context, read_text , qual_text, back_search_tail_position , back_search_read_tail, 0, 0, 1);
@@ -2683,6 +2681,9 @@ unsigned int explain_read(global_context_t * global_context, thread_context_t *
else*/
int realignment_number = finalise_explain_CIGAR(global_context, thread_context, &explain_context, final_realignments);
+ if(0 && FIXLENstrcmp("SRR3439488.572382", explain_context.read_name)==0)
+ SUBREADprintf("TRYING_REALIGN:%s:%u\n", explain_context.read_name, explain_context.total_tries);
+
return realignment_number;
}
diff --git a/src/core-junction.h b/src/core-junction.h
index 9bae3df..7a73937 100644
--- a/src/core-junction.h
+++ b/src/core-junction.h
@@ -23,6 +23,8 @@
#include "hashtable.h"
#include "core.h"
+#define REALIGN_TOTAL_TRIES 50
+
#define FUNKY_FRAGMENT_A 1 // same strand and gapped (0<gap<tra_len)
#define FUNKY_FRAGMENT_BC 2 // very far far away (>=tra_len) or chimeric.
#define FUNKY_FRAGMENT_DE 4 // tlen < tra_len and strand jumpped
@@ -67,6 +69,7 @@ typedef struct{
int all_back_alignments;
int all_front_alignments;
int known_junctions;
+ unsigned int total_tries;
// unsigned int tmp_jump_length;
// unsigned int best_jump_length;
diff --git a/src/core.c b/src/core.c
index 7d272fe..eeff7f9 100644
--- a/src/core.c
+++ b/src/core.c
@@ -343,56 +343,70 @@ void print_in_box(int line_width, int is_boundary, int options, char * pattern,.
int show_summary(global_context_t * global_context)
{
- if(progress_report_callback)
- {
- long long int all_reads_K = global_context -> all_processed_reads / 1000;
- float mapped_reads_percentage = global_context -> all_mapped_reads * 1./global_context -> all_processed_reads;
- if(global_context->input_reads.is_paired_end_reads) mapped_reads_percentage/=2;
- progress_report_callback(10, 900000, (int) (miltime()-global_context->start_time));
- progress_report_callback(10, 900010, (int) all_reads_K);
- progress_report_callback(10, 900011, (int) (10000.*mapped_reads_percentage));
- }
-
- print_in_box(80,0,1,"");
- print_in_box(89,0,1,"%c[36mCompleted successfully.%c[0m", CHAR_ESC, CHAR_ESC);
- print_in_box(80,0,1,"");
- print_in_box(80,2,1,"");
- sublog_printf(SUBLOG_STAGE_RELEASED, SUBLOG_LEVEL_INFO, "");
- print_in_box(80, 1,1,"Summary");
- print_in_box(80, 0,1,"");
- print_in_box(80, 0,0," Processed : %'llu %s" , global_context -> all_processed_reads, global_context->input_reads.is_paired_end_reads?"fragments":"reads");
- print_in_box(81, 0,0," Mapped : %'llu %s (%.1f%%%%)", global_context -> all_mapped_reads, global_context->input_reads.is_paired_end_reads?"fragments":"reads" , global_context -> all_mapped_reads*100.0 / global_context -> all_processed_reads);
- if(global_context->input_reads.is_paired_end_reads)
- print_in_box(80, 0,0," Correctly paired : %'llu fragments", global_context -> all_correct_PE_reads);
-
- if(global_context->config.output_prefix[0])
- {
- if(global_context->config.entry_program_name == CORE_PROGRAM_SUBJUNC && ( global_context -> config.prefer_donor_receptor_junctions || !global_context -> config.do_fusion_detection))
- print_in_box(80, 0,0," Junctions : %'u", global_context -> all_junctions);
- if(global_context->config.do_fusion_detection)
- print_in_box(80, 0,0," Fusions : %'u", global_context -> all_fusions);
- print_in_box(80, 0,0," Indels : %'u", global_context -> all_indels);
- }
-
+ if(progress_report_callback)
+ {
+ long long int all_reads_K = global_context -> all_processed_reads / 1000;
+ float mapped_reads_percentage = global_context -> all_mapped_reads * 1./global_context -> all_processed_reads;
+ if(global_context->input_reads.is_paired_end_reads) mapped_reads_percentage/=2;
+ progress_report_callback(10, 900000, (int) (miltime()-global_context->start_time));
+ progress_report_callback(10, 900010, (int) all_reads_K);
+ progress_report_callback(10, 900011, (int) (10000.*mapped_reads_percentage));
+ }
+
+ print_in_box(80,0,1,"");
+ print_in_box(89,0,1,"%c[36mCompleted successfully.%c[0m", CHAR_ESC, CHAR_ESC);
+ print_in_box(80,0,1,"");
+ print_in_box(80,2,1,"");
+ sublog_printf(SUBLOG_STAGE_RELEASED, SUBLOG_LEVEL_INFO, "");
+ print_in_box(80, 1,1,"Summary");
+ print_in_box(80, 0,1,"");
+ print_in_box(80, 0,0," Processed : %'llu %s" , global_context -> all_processed_reads, global_context->input_reads.is_paired_end_reads?"fragments":"reads");
+ print_in_box(81, 0,0," Mapped : %'u %s (%.1f%%%%), wherein", global_context -> all_mapped_reads, global_context->input_reads.is_paired_end_reads?"fragments":"reads" , global_context -> all_mapped_reads*100.0 / global_context -> all_processed_reads);
+ print_in_box(80, 0,0," Uniquely mapped : %'u", global_context -> all_uniquely_mapped_reads);
+ print_in_box(80, 0,0," Multi-mapping : %'u", global_context -> all_multimapping_reads);
+ print_in_box(80, 0,1,"");
+ print_in_box(80, 0,0," Not mapped : %'u", global_context -> all_unmapped_reads);
+ if(global_context->input_reads.is_paired_end_reads){
+ print_in_box(80, 0,1,"");
+ print_in_box(80, 0,0," Correctly paired : %'llu fragments", global_context -> all_correct_PE_reads);
+ print_in_box(80, 0,0,"Not mapped in pairs : %'llu fragments, wherein", global_context -> all_mapped_reads - global_context -> all_correct_PE_reads);
+ print_in_box(80, 0,0,"Only one end mapped : %'u fragments", global_context -> not_properly_pairs_only_one_end_mapped);
+ print_in_box(80, 0,0," Multi-chromosomes : %'u fragments", global_context -> not_properly_pairs_different_chro);
+ print_in_box(80, 0,0," Different strands : %'u fragments", global_context -> not_properly_different_strands);
+ print_in_box(80, 0,0," Not in PE distance : %'u fragments", global_context -> not_properly_pairs_TLEN_wrong);
+ print_in_box(80, 0,0," Abnormal order : %'u fragments", global_context -> not_properly_pairs_wrong_arrangement);
+ }
+
+ print_in_box(80, 0,1,"");
+
+ if(global_context->config.output_prefix[0])
+ {
+ if(global_context->config.entry_program_name == CORE_PROGRAM_SUBJUNC && ( global_context -> config.prefer_donor_receptor_junctions || !global_context -> config.do_fusion_detection))
+ print_in_box(80, 0,0," Junctions : %'u", global_context -> all_junctions);
+ if(global_context->config.do_fusion_detection)
+ print_in_box(80, 0,0," Fusions : %'u", global_context -> all_fusions);
+ print_in_box(80, 0,0," Indels : %'u", global_context -> all_indels);
+ }
+
if(global_context -> is_phred_warning)
{
print_in_box(80, 0,1,"");
- print_in_box(80,0,0, " WARNING : Phred offset (%d) incorrect?", global_context->config.phred_score_format == FASTQ_PHRED33?33:64);
+ print_in_box(80,0,0, " WARNING : Phred offset (%d) incorrect?", global_context->config.phred_score_format == FASTQ_PHRED33?33:64);
}
- print_in_box(80, 0,1,"");
- print_in_box(80, 0,0," Running time : %.1f minutes", (miltime()-global_context->start_time)*1./60);
+ print_in_box(80, 0,1,"");
+ print_in_box(80, 0,0," Running time : %.1f minutes", (miltime()-global_context->start_time)*1./60);
/*
- print_in_box(80, 0,0," Running time 0 : %.2f minutes", global_context->timecost_load_index/60);
- print_in_box(80, 0,0," Running time 1 : %.2f minutes", global_context->timecost_voting/60);
- print_in_box(80, 0,0," Running time 2 : %.2f minutes", global_context->timecost_before_realign/60);
- print_in_box(80, 0,0," Running time 3 : %.2f minutes", global_context->timecost_for_realign/60);
+ print_in_box(80, 0,0," Running time 0 : %.2f minutes", global_context->timecost_load_index/60);
+ print_in_box(80, 0,0," Running time 1 : %.2f minutes", global_context->timecost_voting/60);
+ print_in_box(80, 0,0," Running time 2 : %.2f minutes", global_context->timecost_before_realign/60);
+ print_in_box(80, 0,0," Running time 3 : %.2f minutes", global_context->timecost_for_realign/60);
*/
- print_in_box(80, 0,1,"");
- print_in_box(80, 2,1,"http://subread.sourceforge.net/");
- sublog_printf(SUBLOG_STAGE_RELEASED, SUBLOG_LEVEL_INFO, "");
+ print_in_box(80, 0,1,"");
+ print_in_box(80, 2,1,"http://subread.sourceforge.net/");
+ sublog_printf(SUBLOG_STAGE_RELEASED, SUBLOG_LEVEL_INFO, "");
- return 0;
+ return 0;
}
@@ -1489,9 +1503,9 @@ int getFirstM(char * cig){
int calc_tlen(global_context_t * global_context, subread_output_tmp_t * rec1 , subread_output_tmp_t * rec2, int read_len_1, int read_len_2);
-int calc_flags(global_context_t * global_context, subread_output_tmp_t * rec1 , subread_output_tmp_t * rec2, int is_second_read, int read_len_1, int read_len_2, int current_location_no , int tlen, int this_OK, int mate_OK)
+int calc_flags(global_context_t * global_context, thread_context_t * thread_context, subread_output_tmp_t * rec1 , subread_output_tmp_t * rec2, int is_second_read, int read_len_1, int read_len_2, int current_location_no , int tlen, int this_OK, int mate_OK)
{
- int ret;
+ int ret, is_TLEN_wrong = 0;
if(global_context->input_reads.is_paired_end_reads)
{
@@ -1513,29 +1527,45 @@ int calc_flags(global_context_t * global_context, subread_output_tmp_t * rec1 ,
{
int TLEN = tlen;//calc_tlen(global_context , rec1, rec2, read_len_1, read_len_2);
int is_PEM = 0;
- if(TLEN >= global_context->config. minimum_pair_distance && TLEN <= global_context-> config.maximum_pair_distance && this_rec->strand == mate_rec->strand)
+ if( rec1 -> chro == rec2 -> chro /* two pointers can be directly compared. */ && TLEN >= global_context->config.minimum_pair_distance &&
+ TLEN <= global_context-> config.maximum_pair_distance && this_rec->strand == mate_rec->strand )
{
if(global_context -> config.is_first_read_reversed && !(global_context -> config.is_second_read_reversed))
{
- if(this_rec -> strand == 0)
- {
+ if(this_rec -> strand == 0) {
if((is_second_read + (mate_rec-> offset > this_rec -> offset) == 1) || mate_rec-> offset == this_rec -> offset)
is_PEM = 1;
+ else is_TLEN_wrong = 1;
}
}
else
{
- if(this_rec -> strand)
- {
+ if(this_rec -> strand) {
if((is_second_read + (mate_rec-> offset < this_rec -> offset) == 1) || mate_rec-> offset == this_rec -> offset) is_PEM = 1;
- }else
- {
+ else is_TLEN_wrong = 1;
+ }else {
if((is_second_read + (mate_rec-> offset > this_rec -> offset) == 1) || mate_rec-> offset == this_rec -> offset) is_PEM = 1;
-
+ else is_TLEN_wrong = 1;
}
}
}
if(is_PEM) ret |= SAM_FLAG_MATCHED_IN_PAIR;
+ else if(is_second_read){
+ if(rec1 -> chro != rec2 -> chro){
+ if(thread_context) thread_context -> not_properly_pairs_different_chro ++;
+ else global_context -> not_properly_pairs_different_chro++;
+ }else if(this_rec->strand != mate_rec->strand){
+ if(thread_context) thread_context -> not_properly_different_strands ++;
+ else global_context -> not_properly_different_strands ++;
+ }else if(TLEN < global_context->config.minimum_pair_distance || TLEN > global_context-> config.maximum_pair_distance){
+ if(thread_context) thread_context -> not_properly_pairs_TLEN_wrong ++;
+ else global_context -> not_properly_pairs_TLEN_wrong ++;
+ }else if(is_TLEN_wrong){
+ if(thread_context) thread_context -> not_properly_pairs_wrong_arrangement ++;
+ else global_context -> not_properly_pairs_wrong_arrangement ++;
+ }
+
+ }
}
}
else
@@ -1902,13 +1932,13 @@ void write_single_fragment(global_context_t * global_context, thread_context_t *
}
}
- int flag1 = calc_flags( global_context , rec1, rec2, 0, read_len_1, read_len_2, current_location, tlen, is_R1_OK, is_R2_OK);
+ int flag1 = calc_flags( global_context, thread_context , rec1, rec2, 0, read_len_1, read_len_2, current_location, tlen, is_R1_OK, is_R2_OK);
int flag2 = -1;
if(global_context->input_reads.is_paired_end_reads)
{
- flag2 = calc_flags( global_context , rec1, rec2, 1, read_len_1, read_len_2, current_location, tlen, is_R2_OK, is_R1_OK);
+ flag2 = calc_flags( global_context , thread_context, rec1, rec2, 1, read_len_1, read_len_2, current_location, tlen, is_R2_OK, is_R1_OK);
if((0 == current_location) && (flag2 & SAM_FLAG_MATCHED_IN_PAIR)){
if(thread_context)thread_context->all_correct_PE_reads ++;
else global_context->all_correct_PE_reads ++;
@@ -2443,6 +2473,7 @@ unsigned int calc_end_pos(unsigned int p, char * cigar, unsigned int * all_skipp
void test_PE_and_same_chro_align(global_context_t * global_context , realignment_result_t * res1, realignment_result_t * res2, int * is_exonic_regions, int * is_PE_distance, int * is_same_chromosome, int read_len_1, int read_len_2, char * rname);
void write_realignments_for_fragment(global_context_t * global_context, thread_context_t * thread_context, subread_output_context_t * out_context, unsigned int read_number, realignment_result_t * res1, realignment_result_t * res2, char * read_name_1, char * read_name_2, char * read_text_1, char * read_text_2, char * qual_text_1, char * qual_text_2 , int rlen1 , int rlen2, int multi_mapping_number, int this_multi_mapping_i, int non_informative_subreads_r1, int non_informative_subreads_r2){
+
int is_2_OK = 0, is_1_OK = 0;
if(res1){
@@ -2470,6 +2501,34 @@ void write_realignments_for_fragment(global_context_t * global_context, thread_c
r2_output = out_context -> r2;
}
+ if(this_multi_mapping_i < 1){
+ if( is_1_OK == 0 && is_2_OK == 0)
+ if(thread_context) thread_context -> all_unmapped_reads ++;
+ else global_context -> all_unmapped_reads ++;
+ else if( is_1_OK == 0 || is_2_OK == 0){
+ if(thread_context) thread_context -> not_properly_pairs_only_one_end_mapped ++;
+ else global_context -> not_properly_pairs_only_one_end_mapped ++;
+
+ if((is_1_OK && (res1 -> realign_flags & CORE_IS_BREAKEVEN )) ||
+ (is_2_OK && (res2 -> realign_flags & CORE_IS_BREAKEVEN))) {
+ if(thread_context) thread_context -> all_multimapping_reads ++;
+ else global_context -> all_multimapping_reads ++;
+ }else{
+ if(thread_context) thread_context -> all_uniquely_mapped_reads ++;
+ else global_context -> all_uniquely_mapped_reads ++;
+ }
+ }else{
+ if(res1 -> realign_flags & CORE_IS_BREAKEVEN){
+ if(thread_context) thread_context -> all_multimapping_reads ++;
+ else global_context -> all_multimapping_reads ++;
+ }else{
+ if(thread_context) thread_context -> all_uniquely_mapped_reads ++;
+ else global_context -> all_uniquely_mapped_reads ++;
+ }
+ }
+ }
+
+
if((!global_context->config.ignore_unmapped_reads) || (is_2_OK || is_1_OK))
write_single_fragment(global_context, thread_context, r1_output, res1, r2_output, res2, multi_mapping_number , this_multi_mapping_i , read_name_1, read_name_2, rlen1, rlen2, read_text_1, read_text_2, qual_text_1, qual_text_2, read_number, non_informative_subreads_r1, non_informative_subreads_r2, is_1_OK, is_2_OK);
}
@@ -3442,10 +3501,6 @@ void * run_in_thread(void * pthread_param)
return NULL;
}
-void finalise_buffered_output_file(global_context_t *global_context){
-// merge_buffered_output_file(global_context, 0 , 0);
-}
-
int run_maybe_threads(global_context_t *global_context, int task)
{
void * thr_parameters [5];
@@ -3473,12 +3528,8 @@ int run_maybe_threads(global_context_t *global_context, int task)
memset(thread_contexts, 0, sizeof(thread_context_t)*64);
global_context -> all_thread_contexts = thread_contexts;
- if(task == STEP_ITERATION_TWO){
+ if(task == STEP_ITERATION_TWO)
global_context -> last_written_fragment_number = 0;
- for(current_thread_no = 0 ; current_thread_no < global_context->config.all_threads ; current_thread_no ++)
- thread_contexts[current_thread_no].all_mapped_reads = 0;
- thread_contexts[current_thread_no].all_correct_PE_reads = 0;
- }
for(current_thread_no = 0 ; current_thread_no < global_context->config.all_threads ; current_thread_no ++)
{
@@ -3501,20 +3552,25 @@ int run_maybe_threads(global_context_t *global_context, int task)
{
pthread_join(thread_contexts[current_thread_no].thread, NULL);
- if(STEP_ITERATION_TWO == task) global_context -> all_correct_PE_reads += thread_contexts[current_thread_no].all_correct_PE_reads;
+ if(STEP_ITERATION_TWO == task){
+ global_context -> all_mapped_reads += thread_contexts[current_thread_no].all_mapped_reads;
+ global_context -> all_correct_PE_reads += thread_contexts[current_thread_no].all_correct_PE_reads;
+ global_context -> not_properly_pairs_wrong_arrangement += thread_contexts[current_thread_no].not_properly_pairs_wrong_arrangement;
+ global_context -> not_properly_pairs_different_chro += thread_contexts[current_thread_no].not_properly_pairs_different_chro;
+ global_context -> not_properly_different_strands += thread_contexts[current_thread_no].not_properly_different_strands;
+ global_context -> not_properly_pairs_TLEN_wrong += thread_contexts[current_thread_no].not_properly_pairs_TLEN_wrong;
+ global_context -> all_unmapped_reads += thread_contexts[current_thread_no].all_unmapped_reads;
+ global_context -> not_properly_pairs_only_one_end_mapped += thread_contexts[current_thread_no].not_properly_pairs_only_one_end_mapped;
+ global_context -> all_multimapping_reads += thread_contexts[current_thread_no].all_multimapping_reads;
+ global_context -> all_uniquely_mapped_reads += thread_contexts[current_thread_no].all_uniquely_mapped_reads;
+ }
ret_value += *(ret_values + current_thread_no);
if(ret_value)break;
}
- if(STEP_ITERATION_TWO == task){
- finalise_buffered_output_file(global_context);
- }
- for(current_thread_no = 0 ; current_thread_no < global_context->config.all_threads ; current_thread_no ++)
- {
+ for(current_thread_no = 0 ; current_thread_no < global_context->config.all_threads ; current_thread_no ++){
if(thread_contexts[current_thread_no].output_buffer_item > 0)
SUBREADprintf("ERROR: UNFINISHED OUTPUT!\n");
- thread_context_t * thread_context = thread_contexts+current_thread_no;
- global_context -> all_mapped_reads += thread_context -> all_mapped_reads;
}
// sort and merge events from all threads and the global event space.
@@ -3750,69 +3806,69 @@ void print_subread_logo()
int print_configuration(global_context_t * context)
{
setlocale(LC_NUMERIC, "");
- sublog_printf(SUBLOG_STAGE_RELEASED, SUBLOG_LEVEL_ERROR,"");
- print_subread_logo();
- sublog_printf(SUBLOG_STAGE_RELEASED, SUBLOG_LEVEL_ERROR,"");
- print_in_box(80, 1, 1, context->config.entry_program_name == CORE_PROGRAM_SUBJUNC?"subjunc setting":"subread-align setting");
- print_in_box(80, 0, 1, "");
-
- if(context->config.do_breakpoint_detection)
- {
- if(context->config.do_fusion_detection)
- {
+ sublog_printf(SUBLOG_STAGE_RELEASED, SUBLOG_LEVEL_ERROR,"");
+ print_subread_logo();
+ sublog_printf(SUBLOG_STAGE_RELEASED, SUBLOG_LEVEL_ERROR,"");
+ print_in_box(80, 1, 1, context->config.entry_program_name == CORE_PROGRAM_SUBJUNC?"subjunc setting":"subread-align setting");
+ print_in_box(80, 0, 1, "");
+
+ if(context->config.do_breakpoint_detection)
+ {
+ if(context->config.do_fusion_detection)
+ {
print_in_box(80, 0, 0, "Function : Read alignment + Junction/Fusion detection%s", context->config.experiment_type == CORE_EXPERIMENT_DNASEQ?" (DNA-Seq)":" (RNA-Seq)");
- }
- else
+ }
+ else
print_in_box(80, 0, 0, "Function : Read alignment + Junction detection (%s)", context->config.experiment_type == CORE_EXPERIMENT_DNASEQ?"DNA-Seq":"RNA-Seq");
- }
- else
- print_in_box(80, 0, 0, "Function : Read alignment%s", context->config.experiment_type == CORE_EXPERIMENT_DNASEQ?" (DNA-Seq)":" (RNA-Seq)");
- if( context->config.second_read_file[0])
- {
- print_in_box(80, 0, 0, "Input file 1 : %s", context->config.first_read_file);
- print_in_box(80, 0, 0, "Input file 2 : %s", context->config.second_read_file);
- }
- else
- print_in_box(80, 0, 0, "Input file : %s%s", context->config.first_read_file, context->config.is_SAM_file_input?(context->config.is_BAM_input?" (BAM)":" (SAM)"):"");
-
- if(context->config.output_prefix [0])
- print_in_box(80, 0, 0, "Output file : %s (%s)", context->config.output_prefix, context->config.is_BAM_output?"BAM":"SAM");
- else
- print_in_box(80, 0, 0, "Output method : STDOUT (%s)" , context->config.is_BAM_output?"BAM":"SAM");
-
- print_in_box(80, 0, 0, "Index name : %s", context->config.index_prefix);
+ }
+ else
+ print_in_box(80, 0, 0, "Function : Read alignment%s", context->config.experiment_type == CORE_EXPERIMENT_DNASEQ?" (DNA-Seq)":" (RNA-Seq)");
+ if( context->config.second_read_file[0])
+ {
+ print_in_box(80, 0, 0, "Input file 1 : %s", context->config.first_read_file);
+ print_in_box(80, 0, 0, "Input file 2 : %s", context->config.second_read_file);
+ }
+ else
+ print_in_box(80, 0, 0, "Input file : %s%s", context->config.first_read_file, context->config.is_SAM_file_input?(context->config.is_BAM_input?" (BAM)":" (SAM)"):"");
+
+ if(context->config.output_prefix [0])
+ print_in_box(80, 0, 0, "Output file : %s (%s)", context->config.output_prefix, context->config.is_BAM_output?"BAM":"SAM");
+ else
+ print_in_box(80, 0, 0, "Output method : STDOUT (%s)" , context->config.is_BAM_output?"BAM":"SAM");
+
+ print_in_box(80, 0, 0, "Index name : %s", context->config.index_prefix);
if(context->config.exon_annotation_file[0])
print_in_box(80, 0, 0, "Annotations : %s (%s)", context->config.exon_annotation_file, context->config.exon_annotation_file_type==FILE_TYPE_GTF?"GTF":"SAF");
print_in_box(80, 0, 0, "");
print_in_box(80, 0, 1, "------------------------------------");
print_in_box(80, 0, 0, "");
- print_in_box(80, 0, 0, " Threads : %d", context->config.all_threads);
- print_in_box(80, 0, 0, " Phred offset : %d", (context->config.phred_score_format == FASTQ_PHRED33)?33:64);
- if( context->config.second_read_file[0])
- {
+ print_in_box(80, 0, 0, " Threads : %d", context->config.all_threads);
+ print_in_box(80, 0, 0, " Phred offset : %d", (context->config.phred_score_format == FASTQ_PHRED33)?33:64);
+ if( context->config.second_read_file[0])
+ {
print_in_box(80, 0, 0, " # of extracted subreads : %d", context->config.total_subreads);
print_in_box(80, 0, 0, " Min read1 vote : %d", context->config.minimum_subread_for_first_read);
print_in_box(80, 0, 0, " Min read2 vote : %d", context->config.minimum_subread_for_second_read);
print_in_box(80, 0, 0, " Max fragment size : %d", context->config.maximum_pair_distance);
print_in_box(80, 0, 0, " Min fragment size : %d", context->config.minimum_pair_distance);
- }
- else
- print_in_box(80, 0, 0, " Min votes : %d / %d", context->config.minimum_subread_for_first_read, context->config.total_subreads);
+ }
+ else
+ print_in_box(80, 0, 0, " Min votes : %d / %d", context->config.minimum_subread_for_first_read, context->config.total_subreads);
print_in_box(80, 0, 0, " Maximum allowed mismatches : %d", context->config.max_mismatch_exonic_reads);
- print_in_box(80, 0, 0, " Maximum allowed indel bases : %d", context->config.max_indel_length);
- print_in_box(80, 0, 0, "# of best alignments reported : %d", context->config.multi_best_reads);
- print_in_box(80, 0, 0, " Unique mapping : %s", context->config.report_multi_mapping_reads?"no":"yes");
+ print_in_box(80, 0, 0, " Maximum allowed indel bases : %d", context->config.max_indel_length);
+ print_in_box(80, 0, 0, "# of best alignments reported : %d", context->config.multi_best_reads);
+ print_in_box(80, 0, 0, " Unique mapping : %s", context->config.report_multi_mapping_reads?"no":"yes");
- if(context->config.max_insertion_at_junctions)
- print_in_box(80, 0, 0, " Insertions at junc : %d", context->config.max_insertion_at_junctions);
+ if(context->config.max_insertion_at_junctions)
+ print_in_box(80, 0, 0, " Insertions at junc : %d", context->config.max_insertion_at_junctions);
- if(context->config.read_group_id[0])
+ if(context->config.read_group_id[0])
print_in_box(80, 0, 0, " Read group name : %s", context->config.read_group_id);
- print_in_box(80, 0, 1, "");
- print_in_box(80, 2, 1, "http://subread.sourceforge.net/");
- sublog_printf(SUBLOG_STAGE_RELEASED, SUBLOG_LEVEL_ERROR,"");
+ print_in_box(80, 0, 1, "");
+ print_in_box(80, 2, 1, "http://subread.sourceforge.net/");
+ sublog_printf(SUBLOG_STAGE_RELEASED, SUBLOG_LEVEL_ERROR,"");
if(!context->config.experiment_type){
@@ -3820,31 +3876,31 @@ int print_configuration(global_context_t * context)
return -1;
}
- if(!context->config.first_read_file[0])
- {
- sublog_printf(SUBLOG_STAGE_RELEASED, SUBLOG_LEVEL_ERROR,"You have to specify at least one input file in the FASTQ/FASTA/PLAIN format using the '-r' option.\n");
- return -1;
- }
+ if(!context->config.first_read_file[0])
+ {
+ sublog_printf(SUBLOG_STAGE_RELEASED, SUBLOG_LEVEL_ERROR,"You have to specify at least one input file in the FASTQ/FASTA/PLAIN format using the '-r' option.\n");
+ return -1;
+ }
- if(0 && !context->config.output_prefix[0])
- {
- sublog_printf(SUBLOG_STAGE_RELEASED, SUBLOG_LEVEL_ERROR,"You have to specify the path of output using the '-o' option.\n");
- return -1;
- }
+ if(0 && !context->config.output_prefix[0])
+ {
+ sublog_printf(SUBLOG_STAGE_RELEASED, SUBLOG_LEVEL_ERROR,"You have to specify the path of output using the '-o' option.\n");
+ return -1;
+ }
- if(!context->config.index_prefix[0])
- {
- sublog_printf(SUBLOG_STAGE_RELEASED, SUBLOG_LEVEL_ERROR,"You have to specify the prefix of the index files using the '-i' option.\n");
- return -1;
- }
- char tbuf[90];
- char_strftime(tbuf);
+ if(!context->config.index_prefix[0])
+ {
+ sublog_printf(SUBLOG_STAGE_RELEASED, SUBLOG_LEVEL_ERROR,"You have to specify the prefix of the index files using the '-i' option.\n");
+ return -1;
+ }
+ char tbuf[90];
+ char_strftime(tbuf);
- print_in_box(80,1,1,"Running (%s, pid=%d)", tbuf, getpid());
- print_in_box(80,0,1,"");
+ print_in_box(80,1,1,"Running (%s, pid=%d)", tbuf, getpid());
+ print_in_box(80,0,1,"");
- return 0;
+ return 0;
}
@@ -3935,17 +3991,18 @@ char * get_sam_chro_name_from_alias(HashTable * tab, char * anno_chro){
return NULL;
}
-int do_anno_bitmap_add_feature(char * gene_name, char * chro_name, unsigned int feature_start, unsigned int feature_end, int is_negative_strand, void * context){
+int do_anno_bitmap_add_feature(char * gene_name, char * tracnscript_id, char * chro_name, unsigned int feature_start, unsigned int feature_end, int is_negative_strand, void * context){
global_context_t * global_context = context;
-
- char tmp_chro_name[MAX_CHROMOSOME_NAME_LEN];
if(global_context -> sam_chro_to_anno_chr_alias){
char * sam_chro = get_sam_chro_name_from_alias(global_context -> sam_chro_to_anno_chr_alias, chro_name);
if(sam_chro!=NULL) chro_name = sam_chro;
}
- int access_n = HashTableGet( global_context -> chromosome_table.read_name_to_index, chro_name ) - NULL;
+ if(!HashTableGet(global_context -> annotation_chro_table,chro_name))
+ HashTablePut(global_context -> annotation_chro_table, memstrcpy( chro_name ), NULL+1);
+ char tmp_chro_name[MAX_CHROMOSOME_NAME_LEN];
+ int access_n = HashTableGet( global_context -> chromosome_table.read_name_to_index, chro_name ) - NULL;
if(access_n < 1){
if(chro_name[0]=='c' && chro_name[1]=='h' && chro_name[2]=='r'){
chro_name += 3;
@@ -3958,8 +4015,9 @@ int do_anno_bitmap_add_feature(char * gene_name, char * chro_name, unsigned int
unsigned int exonic_map_start = linear_gene_position(&global_context->chromosome_table , chro_name, feature_start);
unsigned int exonic_map_stop = linear_gene_position(&global_context->chromosome_table , chro_name, feature_end);
-
- if(exonic_map_start > 0xffffff00 || exonic_map_stop > 0xffffff00) return -1;
+ if(exonic_map_start > 0xffffff00 || exonic_map_stop > 0xffffff00){
+ return -1;
+ }
exonic_map_start -= exonic_map_start%EXONIC_REGION_RESOLUTION;
exonic_map_stop -= exonic_map_stop%EXONIC_REGION_RESOLUTION;
@@ -3973,17 +4031,28 @@ int do_anno_bitmap_add_feature(char * gene_name, char * chro_name, unsigned int
return 0;
}
+void warning_anno_vs_index(HashTable * anno_chros_tab, gene_offset_t * index_chros_offset){
+ HashTable * index_chros_tab = index_chros_offset -> read_name_to_index;
+ warning_hash_hash(anno_chros_tab, index_chros_tab, "Chromosomes/contigs in annotation but not in index :");
+ warning_hash_hash(index_chros_tab, anno_chros_tab, "Chromosomes/contigs in index but not in annotation :");
+}
+
+
int load_annotated_exon_regions(global_context_t * global_context){
int bitmap_size = (4096 / EXONIC_REGION_RESOLUTION / 8)*1024*1024;
global_context ->exonic_region_bitmap = malloc(bitmap_size);
memset( global_context ->exonic_region_bitmap , 0, bitmap_size );
+ global_context -> annotation_chro_table = HashTableCreate(1003);
+ HashTableSetDeallocationFunctions( global_context -> annotation_chro_table, free, NULL);
+ HashTableSetKeyComparisonFunction( global_context -> annotation_chro_table, my_strcmp);
+ HashTableSetHashFunction( global_context -> annotation_chro_table, fc_chro_hash);
- int loaded_features = load_features_annotation(global_context->config.exon_annotation_file, global_context->config.exon_annotation_file_type, global_context->config.exon_annotation_gene_id_column, global_context->config.exon_annotation_feature_name_column, global_context, do_anno_bitmap_add_feature);
+ int loaded_features = load_features_annotation(global_context->config.exon_annotation_file, global_context->config.exon_annotation_file_type, global_context->config.exon_annotation_gene_id_column, NULL, global_context->config.exon_annotation_feature_name_column, global_context, do_anno_bitmap_add_feature);
if(loaded_features < 0)return -1;
- else{
- print_in_box(80,0,0,"%d annotation records were loaded.\n", loaded_features);
- return 0;
- }
+ else print_in_box(80,0,0,"%d annotation records were loaded.\n", loaded_features);
+ warning_anno_vs_index(global_context -> annotation_chro_table, &global_context -> chromosome_table);
+ HashTableDestroy(global_context -> annotation_chro_table);
+ return 0;
}
int load_global_context(global_context_t * context)
@@ -3992,6 +4061,10 @@ int load_global_context(global_context_t * context)
int min_phred_score = -1 , max_phred_score = -1;
context -> is_phred_warning = 0;
+ if(context->config.multi_best_reads>1 && ! context->config.report_multi_mapping_reads){
+ print_in_box(80,0,0,"WARNING: Multi-mapping reads are reported.");
+ context->config.report_multi_mapping_reads = 1;
+ }
subread_init_lock(&context->input_reads.input_lock);
if(core_geinput_open(context, &context->input_reads.first_read_file, 1,1))
@@ -4159,12 +4232,6 @@ int load_global_context(global_context_t * context)
sublog_printf(SUBLOG_STAGE_RELEASED, SUBLOG_LEVEL_ERROR,"Cannot initialise the voting space. You need at least 2GB of empty physical memory to run this program.\n");
return 1;
}
- context->all_processed_reads = 0;
- context->all_mapped_reads = 0;
- context->all_correct_PE_reads = 0;
- context->all_junctions = 0;
- context->all_fusions = 0;
- context->all_indels = 0;
sublog_printf(SUBLOG_STAGE_DEV1, SUBLOG_LEVEL_DEBUG, "load_global_context: finished");
memset( context->all_value_indexes , 0 , 100 * sizeof(gene_value_index_t));
diff --git a/src/core.h b/src/core.h
index acd0f3a..ed13fe1 100644
--- a/src/core.h
+++ b/src/core.h
@@ -459,8 +459,17 @@ typedef struct{
int output_buffer_item;
int output_buffer_pointer;
int is_finished;
- unsigned int all_mapped_reads;
subread_lock_t output_lock;
+
+ unsigned int all_mapped_reads;
+ unsigned int not_properly_pairs_wrong_arrangement;
+ unsigned int not_properly_pairs_different_chro;
+ unsigned int not_properly_different_strands;
+ unsigned int not_properly_pairs_TLEN_wrong;
+ unsigned int all_unmapped_reads;
+ unsigned int not_properly_pairs_only_one_end_mapped;
+ unsigned int all_multimapping_reads;
+ unsigned int all_uniquely_mapped_reads;
} thread_context_t;
@@ -516,12 +525,21 @@ typedef struct{
double timecost_for_realign;
unsigned long long all_processed_reads;
- unsigned long long all_mapped_reads;
unsigned long long all_correct_PE_reads;
unsigned int all_junctions;
unsigned int all_fusions;
unsigned int all_indels;
+ unsigned int all_mapped_reads;
+ unsigned int not_properly_pairs_wrong_arrangement;
+ unsigned int not_properly_pairs_different_chro;
+ unsigned int not_properly_different_strands;
+ unsigned int not_properly_pairs_TLEN_wrong;
+ unsigned int all_unmapped_reads;
+ unsigned int not_properly_pairs_only_one_end_mapped;
+ unsigned int all_multimapping_reads;
+ unsigned int all_uniquely_mapped_reads;
+
unsigned long long current_circle_start_abs_offset_file1;
gene_inputfile_position_t current_circle_start_position_file1;
gene_inputfile_position_t current_circle_start_position_file2;
@@ -547,6 +565,7 @@ typedef struct{
subread_read_number_t read_block_start;
char * exonic_region_bitmap;
HashTable * sam_chro_to_anno_chr_alias;
+ HashTable * annotation_chro_table;
} global_context_t;
diff --git a/src/hashtable.c b/src/hashtable.c
index 8f7ab59..10478ee 100644
--- a/src/hashtable.c
+++ b/src/hashtable.c
@@ -6,7 +6,7 @@
* Released to the public domain.
*
*--------------------------------------------------------------------------
- * $Id: hashtable.c,v 9999.14 2017/03/10 00:01:40 cvs Exp $
+ * $Id: hashtable.c,v 9999.17 2017/04/28 06:30:27 cvs Exp $
\*--------------------------------------------------------------------------*/
#include <stdio.h>
@@ -15,7 +15,12 @@
#include <assert.h>
#include <pthread.h>
#include "hashtable.h"
+#include "core.h"
+static int pointercmp(const void *pointer1, const void *pointer2);
+static unsigned long pointerHashFunction(const void *pointer);
+static int isProbablePrime(long number);
+static long calculateIdealNumOfBuckets(HashTable *hashTable);
ArrayList * ArrayListCreate(int init_capacity){
@@ -43,8 +48,12 @@ void * ArrayListGet(ArrayList * list, long n){
int ArrayListPush(ArrayList * list, void * new_elem){
if(list -> capacityOfElements <= list->numOfElements){
- list -> capacityOfElements *=1.3;
- list -> elementList=realloc(list -> elementList, list -> capacityOfElements);
+ if(list -> capacityOfElements *1.3 > list -> capacityOfElements + 10)
+ list -> capacityOfElements = list -> capacityOfElements *1.3;
+ else list -> capacityOfElements = list -> capacityOfElements + 10;
+
+ list -> elementList=realloc(list -> elementList, sizeof(void *) * list -> capacityOfElements);
+ assert(list -> elementList);
}
list->elementList[list->numOfElements++] = new_elem;
return list->numOfElements;
@@ -53,12 +62,53 @@ void ArrayListSetDeallocationFunction(ArrayList * list, void (*elem_deallocator
list -> elemDeallocator = elem_deallocator;
}
+int ArrayListSort_compare(void * sortdata0, int L, int R){
+ void ** sortdata = sortdata0;
+ ArrayList * list = sortdata[0];
+ int (*comp_elems)(void * L_elem, void * R_elem) = sortdata[1];
-static int pointercmp(const void *pointer1, const void *pointer2);
-static unsigned long pointerHashFunction(const void *pointer);
-static int isProbablePrime(long number);
-static long calculateIdealNumOfBuckets(HashTable *hashTable);
+ void * L_elem = list -> elementList[L];
+ void * R_elem = list -> elementList[R];
+ return comp_elems(L_elem, R_elem);
+}
+
+void ArrayListSort_exchange(void * sortdata0, int L, int R){
+ void ** sortdata = sortdata0;
+ ArrayList * list = sortdata[0];
+
+ void * tmpp = list -> elementList[L];
+ list -> elementList[L] = list -> elementList[R];
+ list -> elementList[R] = tmpp;
+}
+void ArrayListSort_merge(void * sortdata0, int start, int items, int items2){
+ void ** sortdata = sortdata0;
+ ArrayList * list = sortdata[0];
+ int (*comp_elems)(void * L_elem, void * R_elem) = sortdata[1];
+
+ void ** merged = malloc(sizeof(void *)*(items + items2));
+ int write_cursor, read1=start, read2=start+items;
+
+ for(write_cursor = 0; write_cursor < items + items2; write_cursor++){
+ void * Elm1 = list -> elementList[read1];
+ void * Elm2 = list -> elementList[read2];
+
+ int select_1 = (read1 == start + items)?0:( read2 == start + items + items2 || comp_elems(Elm1, Elm2) < 0 );
+ if(select_1) merged[write_cursor] = list -> elementList[read1++];
+ else merged[write_cursor] = list -> elementList[read2++];
+ }
+
+ memcpy(list -> elementList + start, merged, sizeof(void *) * (items + items2));
+ free(merged);
+}
+
+void ArrayListSort(ArrayList * list, int compare_L_minus_R(void * L_elem, void * R_elem)){
+ void * sortdata[2];
+ sortdata[0] = list;
+ sortdata[1] = compare_L_minus_R;
+
+ merge_sort(sortdata, list -> numOfElements, ArrayListSort_compare, ArrayListSort_exchange, ArrayListSort_merge);
+}
/*--------------------------------------------------------------------------*\
* NAME:
@@ -157,8 +207,10 @@ void HashTableDestroy(HashTable *hashTable) {
if (hashTable->keyDeallocator != NULL)
hashTable->keyDeallocator((void *) pair->key);
- if (hashTable->valueDeallocator != NULL)
- hashTable->valueDeallocator(pair->value);
+ if (hashTable->valueDeallocator != NULL){
+// fprintf(stderr,"FREE %p\n", pair->value);
+ hashTable->valueDeallocator(pair->value);
+ }
free(pair);
pair = nextPair;
}
diff --git a/src/hashtable.h b/src/hashtable.h
index 9ab1608..9ac7cc6 100644
--- a/src/hashtable.h
+++ b/src/hashtable.h
@@ -6,7 +6,7 @@
* Released to the public domain.
*
*--------------------------------------------------------------------------
- * $Id: hashtable.h,v 9999.9 2017/03/10 00:01:40 cvs Exp $
+ * $Id: hashtable.h,v 9999.11 2017/04/28 01:40:20 cvs Exp $
\*--------------------------------------------------------------------------*/
#ifndef _HASHTABLE_H
@@ -41,7 +41,6 @@ typedef struct {
} HashTable;
-
typedef struct {
void ** elementList;
long numOfElements;
@@ -54,6 +53,8 @@ void ArrayListDestroy(ArrayList * list);
void * ArrayListGet(ArrayList * list, long n);
int ArrayListPush(ArrayList * list, void * new_elem);
void ArrayListSetDeallocationFunction(ArrayList * list, void (*elem_deallocator)(void *elem));
+void ArrayListSort(ArrayList * list, int compare_L_minus_R(void * L_elem, void * R_elem));
+
void HashTableIteration(HashTable * tab, void process_item(void * key, void * hashed_obj, HashTable * tab) );
diff --git a/src/input-files.c b/src/input-files.c
index b820fa2..0a35d1b 100644
--- a/src/input-files.c
+++ b/src/input-files.c
@@ -75,6 +75,15 @@ void * delay_realloc(void * old_pntr, size_t old_size, size_t new_size){
return new_ret;
}
+// the caller is in charge of deallocation
+char * memstrcpy(char * in){
+ int ilen = strlen(in);
+ char * ret = malloc(ilen+1);
+ memcpy(ret, in, ilen);
+ ret[ilen]=0;
+ return ret;
+}
+
double guess_reads_density(char * fname, int is_sam)
{
return guess_reads_density_format(fname, is_sam, NULL, NULL, NULL);
@@ -2250,7 +2259,7 @@ int SAM_pairer_writer_create( SAM_pairer_writer_main_t * bam_main , int all_thre
bam_main -> threads[x1].strm.next_in = Z_NULL;
deflateInit2(&bam_main -> threads[x1].strm, bam_main -> compression_level, Z_DEFLATED,
- PAIRER_GZIP_WINDOW_BITS, PAIRER_DEFAULT_MEM_LEVEL, Z_DEFAULT_STRATEGY);
+ PAIRER_GZIP_WINDOW_BITS, PAIRER_DEFAULT_MEM_LEVEL, Z_DEFAULT_STRATEGY);
}
return 0;
}
@@ -2390,7 +2399,7 @@ int SAM_pairer_warning_file_open_limit(){
// in_format can be either
// bin_buff_size_per_thread is in Mega-Bytes.
// It returns 0 if no error
-int SAM_pairer_create(SAM_pairer_context_t * pairer, int all_threads, int bin_buff_size_per_thread, int BAM_input, int is_Tiny_Mode, int is_single_end_mode, int force_do_not_sort, int display_progress, char * in_file, void (* reset_output_function) (void * pairer), int (* output_header_function) (void * pairer, int thread_no, int is_text, unsigned int items, char * bin, unsigned int bin_len), int (* output_function) (void * pairer, int thread_no, char * readname, char * bin1, char * bin2 [...]
+int SAM_pairer_create(SAM_pairer_context_t * pairer, int all_threads, int bin_buff_size_per_thread, int BAM_input, int is_Tiny_Mode, int is_single_end_mode, int force_do_not_sort, int display_progress, char * in_file, void (* reset_output_function) (void * pairer), int (* output_header_function) (void * pairer, int thread_no, int is_text, unsigned int items, char * bin, unsigned int bin_len), int (* output_function) (void * pairer, int thread_no, char * bin1, char * bin2), char * tmp_pat [...]
memset(pairer, 0, sizeof(SAM_pairer_context_t));
@@ -2413,14 +2422,15 @@ int SAM_pairer_create(SAM_pairer_context_t * pairer, int all_threads, int bin_bu
pairer -> display_progress = display_progress;
pairer -> is_single_end_mode = is_single_end_mode;
pairer -> force_do_not_sort = force_do_not_sort;
+ pairer -> long_read_minimum_length = long_read_minimum_length;
subread_init_lock(&pairer -> unsorted_notification_lock);
subread_init_lock(&pairer -> input_fp_lock);
subread_init_lock(&pairer -> output_header_lock);
pairer -> total_threads = all_threads;
- pairer -> input_buff_SBAM_size = bin_buff_size_per_thread * 1024 * 1024;
- pairer -> input_buff_BIN_size = 1024*1024;
+ pairer -> input_buff_SBAM_size = max(bin_buff_size_per_thread * 1024 * 1024, 100+FC_LONG_READ_RECORD_HARDLIMIT);
+ pairer -> input_buff_BIN_size = max(1024*1024, pairer -> input_buff_SBAM_size );
pairer -> appendix1 = appendix1;
@@ -2559,7 +2569,7 @@ int SAM_pairer_read_BAM_block(FILE * fp, int max_read_len, char * inbuff) {
}
//#define MIN_BAM_BLOCK_SIZE 66000
-#define MIN_BAM_BLOCK_SIZE (1024*1024)
+#define MIN_BAM_BLOCK_SIZE FC_LONG_READ_RECORD_HARDLIMIT
int SAM_pairer_read_SAM_MB( FILE * fp, int max_read_len, char * inbuff ){
int ret = 0;
@@ -2698,7 +2708,7 @@ int SAM_pairer_fetch_BAM_block(SAM_pairer_context_t * pairer , SAM_pairer_thread
int test_read_bin = SAM_pairer_find_start(pairer, thread_context);
if(test_read_bin<1 && thread_context -> input_buff_BIN_used >= 32 ){
pairer -> is_bad_format = 1;
- SUBREADprintf("BIN REMAIN=%d, BAM USED=%d, BIN GENERATED=%d, BAM REMAIN=%d, TEST_READ_BIN=%d\n", remained_BIN, used_BAM, have, thread_context -> input_buff_SBAM_used - thread_context -> input_buff_SBAM_ptr, test_read_bin);
+ //SUBREADprintf("BIN REMAIN=%d, BAM USED=%d, BIN GENERATED=%d, BAM REMAIN=%d, TEST_READ_BIN=%d\n", remained_BIN, used_BAM, have, thread_context -> input_buff_SBAM_used - thread_context -> input_buff_SBAM_ptr, test_read_bin);
}
}
} else {
@@ -2763,7 +2773,7 @@ int SAM_pairer_get_next_read_BIN( SAM_pairer_context_t * pairer , SAM_pairer_thr
BAM_next_nch;
header_txt [x1] = nch;
}
- pairer -> output_header(pairer, thread_context -> thread_id, 1, pairer -> BAM_l_text , header_txt , pairer -> BAM_l_text );
+ int is_OK = pairer -> output_header(pairer, thread_context -> thread_id, 1, pairer -> BAM_l_text , header_txt , pairer -> BAM_l_text );
BAM_next_u32(pairer -> BAM_n_ref);
unsigned int ref_bin_len = 0;
@@ -2797,10 +2807,13 @@ int SAM_pairer_get_next_read_BIN( SAM_pairer_context_t * pairer , SAM_pairer_thr
//SUBREADprintf("%d-th ref : %s [len=%u], bin_len=%d < %d\n", x1, ref_name, l_ref, ref_bin_len, pairer -> BAM_l_text);
}
- //exit(0);
- pairer -> output_header(pairer, thread_context -> thread_id, 0, pairer -> BAM_n_ref , header_txt , ref_bin_len );
+ is_OK = is_OK || pairer -> output_header(pairer, thread_context -> thread_id, 0, pairer -> BAM_n_ref , header_txt , ref_bin_len );
if(header_txt) free(header_txt);
+ if(is_OK){
+ pairer -> is_incomplete_BAM = 1;
+ return 0;
+ }
pairer -> BAM_header_parsed = 1;
//if(pairer -> display_progress)
@@ -2818,25 +2831,17 @@ int SAM_pairer_get_next_read_BIN( SAM_pairer_context_t * pairer , SAM_pairer_thr
return 0;
}
- unsigned int record_len=0;
+ unsigned int record_len = 0, seq_len = 0;
memcpy(&record_len, thread_context -> input_buff_BIN + thread_context -> input_buff_BIN_ptr, 4);
+ memcpy(&seq_len, thread_context -> input_buff_BIN + thread_context -> input_buff_BIN_ptr + 20, 4);
thread_context -> input_buff_BIN_ptr += 4;
- //SUBREADprintf("RECLEN=%d, MAX=%d\n", record_len, MAX_BIN_RECORD_LENGTH);
-
- if(record_len < 32 || record_len > MAX_BIN_RECORD_LENGTH || thread_context -> input_buff_BIN_used < thread_context -> input_buff_BIN_ptr + record_len ){
- //SUBREADprintf("BAD FORMAT:%u\n", record_len);
+ if(record_len < 32 || record_len > min(MAX_BIN_RECORD_LENGTH,60000) || seq_len >= pairer -> long_read_minimum_length || thread_context -> input_buff_BIN_used < thread_context -> input_buff_BIN_ptr + record_len ){
+ if(seq_len >= pairer -> long_read_minimum_length) pairer -> is_single_end_mode = 1;
pairer -> is_bad_format = 1;
return 0;
}
- /*
- while(thread_context -> input_buff_BIN_used <= thread_context -> input_buff_BIN_ptr + record_len){
- int ret_fetch = SAM_pairer_fetch_BAM_block(pairer, thread_context);
- if(ret_fetch)
- return 0;
- }*/
-
(* bin_where) = thread_context -> input_buff_BIN + thread_context -> input_buff_BIN_ptr - 4;
(* bin_len) = record_len + 4;
thread_context -> input_buff_BIN_ptr += record_len;
@@ -2870,7 +2875,7 @@ int SAM_pairer_get_next_read_BIN( SAM_pairer_context_t * pairer , SAM_pairer_thr
}
}
- pairer -> output_header(pairer, thread_context -> thread_id, 1, header_len , header_start , header_len);
+ int is_OK = pairer -> output_header(pairer, thread_context -> thread_id, 1, header_len , header_start , header_len);
thread_context -> input_buff_SBAM_ptr = 0;
int header_bin_ptr = 0, header_contigs = 0;
while(1){
@@ -2919,8 +2924,12 @@ int SAM_pairer_get_next_read_BIN( SAM_pairer_context_t * pairer , SAM_pairer_thr
}
}
- pairer -> output_header(pairer, thread_context -> thread_id, 0, header_contigs , header_start , header_bin_ptr);
+ is_OK = is_OK || pairer -> output_header(pairer, thread_context -> thread_id, 0, header_contigs , header_start , header_bin_ptr);
pairer -> BAM_header_parsed = 1;
+ if(is_OK){
+ pairer -> is_incomplete_BAM = 1;
+ return 0;
+ }
}
if(passed_read_SBAM_ptr >=0)
@@ -2928,6 +2937,7 @@ int SAM_pairer_get_next_read_BIN( SAM_pairer_context_t * pairer , SAM_pairer_thr
if( thread_context -> input_buff_SBAM_ptr < thread_context -> input_buff_SBAM_used ){
thread_context -> input_buff_BIN_ptr = 0;
+ //SUBREADprintf("reduce_SAM_to_BAM_0 \n");
*bin_len = reduce_SAM_to_BAM(pairer, thread_context,!pairer -> tiny_mode);
*bin_where = (unsigned char *)thread_context -> input_buff_BIN;
@@ -2964,8 +2974,6 @@ int online_register_contig(SAM_pairer_context_t * pairer , SAM_pairer_thread_t *
#define set_memory_int(ptr, iii) { *(ptr) = (iii)&0xff; *(ptr+1) = (iii>>8)&0xff; *(ptr+2) = (iii>>16)&0xff;*(ptr+3) = (iii>>24); }
int reduce_SAM_to_BAM(SAM_pairer_context_t * pairer , SAM_pairer_thread_t * thread_context, int include_sequence){
-
-
int column_no = 0, in_ptr = 0;
char * in_str = thread_context -> input_buff_SBAM + thread_context -> input_buff_SBAM_ptr;
char * read_name = NULL, * ref = NULL, * mate_ref = NULL, * cigar = NULL, * seq = NULL, * qual = NULL;
@@ -3021,10 +3029,21 @@ int reduce_SAM_to_BAM(SAM_pairer_context_t * pairer , SAM_pairer_thread_t * thre
set_memory_int(bin_tmp + 12, mapq_nl);
int coverage;
- int cigar_ops = SamBam_compress_cigar(cigar, (int *)(bin_tmp + 36 + l_read_name), &coverage, 10000);
+ int cigar_ops = SamBam_compress_cigar(cigar, (int *)(bin_tmp + 36 + l_read_name), &coverage, 65535);
int flag_nc = flag << 16 | cigar_ops;
set_memory_int(bin_tmp + 16, flag_nc);
+
+
+ int seq_len = qual - seq - 1;
+
+ if(seq_len >=pairer -> long_read_minimum_length ){
+ pairer -> is_single_end_mode = 1;
+ include_sequence = 0;
+ pairer -> tiny_mode = 1;
+ pairer -> long_cigar_mode = 1;
+ }
+
if(include_sequence){
set_memory_int(bin_tmp + 20, l_seq); // SEQ_LEN
}else set_memory_int(bin_tmp + 20, 1);
@@ -3045,7 +3064,6 @@ int reduce_SAM_to_BAM(SAM_pairer_context_t * pairer , SAM_pairer_thread_t * thre
set_memory_int(bin_tmp + 32, tlen);
memcpy(bin_tmp + 36, read_name, l_read_name);
-
int bin_ptr = 36 + l_read_name + 4 * cigar_ops;
if(include_sequence){
@@ -3074,25 +3092,27 @@ int reduce_SAM_to_BAM(SAM_pairer_context_t * pairer , SAM_pairer_thread_t * thre
int is_important_tag = (in_str[in_ptr+0] == 'N' && in_str[in_ptr+1] == 'H') ||
(in_str[in_ptr+0] == 'H' && in_str[in_ptr+1] == 'I') ||
+ (in_str[in_ptr+0] == 'R' && in_str[in_ptr+1] == 'G') ||
(in_str[in_ptr+0] == 'N' && in_str[in_ptr+1] == 'M') ;
int xxnch;
- if(in_str[in_ptr + 3] == 'Z'){
- if(!pairer -> tiny_mode){
+ if(in_str[in_ptr + 3] == 'Z' || in_str[in_ptr + 3] == 'H'){
+ if(is_important_tag||!pairer -> tiny_mode){
bin_tmp[bin_ptr+0] = in_str[in_ptr+0];
bin_tmp[bin_ptr+1] = in_str[in_ptr+1];
- bin_tmp[bin_ptr+2] = 'Z';
+ bin_tmp[bin_ptr+2] = in_str[in_ptr + 3];
bin_ptr += 3;
}
in_ptr += 5;
while(1){
xxnch = *(in_str + in_ptr);
- if(xxnch == '\n' || xxnch == '\t') break;
- if(!pairer -> tiny_mode)
+ if(xxnch == '\n' || xxnch == '\t' || xxnch == 0) break;
+ if(is_important_tag||!pairer -> tiny_mode)
*(bin_tmp + (bin_ptr++)) = xxnch;
in_ptr ++;
}
- if(!pairer -> tiny_mode)
+ if(is_important_tag||!pairer -> tiny_mode){
*(bin_tmp + (bin_ptr++)) = 0;
+ }
}else if(in_str[in_ptr + 3] == 'i'){
int tmpi = 0, tmpi_sign = 1;
if(is_important_tag || !pairer -> tiny_mode){
@@ -3106,7 +3126,7 @@ int reduce_SAM_to_BAM(SAM_pairer_context_t * pairer , SAM_pairer_thread_t * thre
while(1){
xxnch = *(in_str + in_ptr);
- if(xxnch == '\n' || xxnch == '\t') break;
+ if(xxnch == '\n' || xxnch == '\t' || xxnch == 0) break;
else if(xxnch == '-') tmpi_sign = -1;
else tmpi = tmpi * 10 + xxnch - '0';
in_ptr ++;
@@ -3116,6 +3136,64 @@ int reduce_SAM_to_BAM(SAM_pairer_context_t * pairer , SAM_pairer_thread_t * thre
set_memory_int(bin_tmp+bin_ptr, tmpi);
bin_ptr += 4;
}
+ }else if(in_str[in_ptr + 3] == 'f'){
+ char ftxt[30];
+ int fi=0;
+ while(1){
+ xxnch = *(in_str + in_ptr + 5 + fi);
+ if(xxnch== '\n' || xxnch == '\t'|| xxnch == 0) break;
+ ftxt[fi++]=xxnch;
+ ftxt[fi]=0;
+ }
+ if(!pairer -> tiny_mode){
+ float fv = atof(ftxt);
+ bin_tmp[bin_ptr+0] = in_str[in_ptr+0];
+ bin_tmp[bin_ptr+1] = in_str[in_ptr+1];
+ bin_tmp[bin_ptr+2] = 'f';
+ memcpy( bin_tmp + bin_ptr + 3, &fv, 4);
+ bin_ptr += 7;
+ }
+ in_ptr += 5 + fi;
+ }else if(in_str[in_ptr + 3] == 'B'){
+ char elemtype = in_str[in_ptr + 5];
+ int txi=0, eles=0;
+ char ttxt[30], *elen_ptr = NULL;;
+ if(!pairer -> tiny_mode){
+ bin_tmp[bin_ptr+0] = in_str[in_ptr+0];
+ bin_tmp[bin_ptr+1] = in_str[in_ptr+1];
+ bin_tmp[bin_ptr+2] = 'B';
+ bin_tmp[bin_ptr+3] = elemtype;
+ elen_ptr = bin_tmp+4 + bin_ptr;
+ bin_ptr += 8;
+ }
+ in_ptr += 6;
+ while(1){
+ xxnch = *(in_str + in_ptr);
+ if((!pairer -> tiny_mode)){
+ if((xxnch ==',' || xxnch =='\n' || xxnch == '\t' || xxnch == 0) && txi > 0){
+ //SUBREADprintf("ADD VAL : `%s`\n", ttxt);
+ if(elemtype == 'f'){
+ float fv = atof(ttxt);
+ memcpy( bin_tmp + bin_ptr, &fv, 4);
+ }else{
+ int iv = atoi(ttxt);
+ memcpy( bin_tmp + bin_ptr, &iv, 4);
+ }
+ bin_ptr+=4;
+ txi=0;
+ eles++;
+ }else{
+ if(xxnch!=','){
+ ttxt[txi++] = xxnch;
+ ttxt[txi] = 0;
+ }
+ }
+ }
+ if(xxnch =='\n' || xxnch == '\t' || xxnch == 0)break;
+ in_ptr ++;
+ }
+ if((!pairer -> tiny_mode)) memcpy(elen_ptr, & eles, 4);
+
}else if(in_str[in_ptr + 3] == 'A'){
if(!pairer -> tiny_mode){
bin_tmp[bin_ptr+0] = in_str[in_ptr+0];
@@ -3129,19 +3207,17 @@ int reduce_SAM_to_BAM(SAM_pairer_context_t * pairer , SAM_pairer_thread_t * thre
in_ptr += 5;
while(1){
xxnch = *(in_str + in_ptr);
- if(xxnch == '\n' || xxnch == '\t') break;
+ if(xxnch == '\n' || xxnch == '\t' || xxnch == 0) break;
in_ptr++;
}
}
+ // #warning "=============== COMMENT NEXT ====================="
+ // SUBREADprintf("Z_len PTR = %d + %d\n", bin_ptr, thread_context -> input_buff_BIN_ptr);
}
}
thread_context -> input_buff_SBAM_ptr += in_ptr + 1;
- if(bin_ptr > 60000){
- SUBREADprintf("ERROR: the read record length (%d) is longer than the limit. The program has to terminate. \n", bin_ptr);
- pairer -> is_bad_format = 1;
- }
bin_ptr -= 4;
set_memory_int(bin_tmp, bin_ptr);
@@ -3151,7 +3227,7 @@ int reduce_SAM_to_BAM(SAM_pairer_context_t * pairer , SAM_pairer_thread_t * thre
return bin_ptr;
}
-int SAM_pairer_iterate_int_tags(unsigned char * bin, int bin_len, char * tag_name, int * saved_value){
+int SAM_pairer_iterate_tags(unsigned char * bin, int bin_len, char * tag_name, char * data_type, char ** saved_value){
int found = 0;
int bin_cursor = 0;
while(bin_cursor < bin_len){
@@ -3164,58 +3240,69 @@ int SAM_pairer_iterate_int_tags(unsigned char * bin, int bin_len, char * tag_nam
SUBREADprintf("TAG=%s, TYP=%c %d %c\n", outc, bin[bin_cursor+2], bin[bin_cursor+3], bin[bin_cursor+4]);
}
- if(bin[bin_cursor] == tag_name[0] && bin[bin_cursor+1] == tag_name[1]){
- int tag_int_val = 0;
- if(bin[bin_cursor+2]=='i' || bin[bin_cursor+2]=='I'){
- memcpy(&tag_int_val, bin+bin_cursor+3, 4);
- found = 1;
- } else if(bin[bin_cursor+2]=='s' || bin[bin_cursor+2]=='S'){
- memcpy(&tag_int_val, bin+bin_cursor+3, 2);
- found = 1;
- } else if(bin[bin_cursor+2]=='c' || bin[bin_cursor+2]=='C'){
- memcpy(&tag_int_val, bin+bin_cursor+3, 1);
- found = 1;
- }
- if(found){
- (* saved_value) = tag_int_val;
- break;
- }
- }
- int skip_content = 0;
+ if(bin[bin_cursor] == tag_name[0] && bin[bin_cursor+1] == tag_name[1]){
+ (* data_type) = bin[bin_cursor+2];
+ (* saved_value) = (char *)bin+bin_cursor+3;
+ found = 1;
+ break;
+ }
+ int skip_content = 0;
//SUBREADprintf("NextTag=%c; ", bin[bin_cursor+2]);
- if(bin[bin_cursor+2]=='i' || bin[bin_cursor+2]=='I' || bin[bin_cursor+2]=='f')
- skip_content = 4;
- else if(bin[bin_cursor+2]=='s' || bin[bin_cursor+2]=='S')
- skip_content = 2;
- else if(bin[bin_cursor+2]=='c' || bin[bin_cursor+2]=='C' || bin[bin_cursor+2]=='A')
- skip_content = 1;
+ if(bin[bin_cursor+2]=='i' || bin[bin_cursor+2]=='I' || bin[bin_cursor+2]=='f')
+ skip_content = 4;
+ else if(bin[bin_cursor+2]=='s' || bin[bin_cursor+2]=='S')
+ skip_content = 2;
+ else if(bin[bin_cursor+2]=='c' || bin[bin_cursor+2]=='C' || bin[bin_cursor+2]=='A')
+ skip_content = 1;
else if(bin[bin_cursor+2]=='Z' || bin[bin_cursor+2]=='H'){
- while(bin[bin_cursor+skip_content + 3]){
+ while(bin[bin_cursor+skip_content + 3]){
//SUBREADprintf("ACHAR=%c\n", (bin[skip_content + 3]));
skip_content++;
}
skip_content ++;
- } else if(bin[bin_cursor+2]=='B'){
- char cell_type = tolower(bin[bin_cursor+3]);
+ } else if(bin[bin_cursor+2]=='B'){
+ char cell_type = tolower(bin[bin_cursor+3]);
- memcpy(&skip_content, bin + bin_cursor + 4, 4);
+ memcpy(&skip_content, bin + bin_cursor + 4, 4);
// SUBREADprintf("Array Type=%c, cells=%d\n", cell_type, skip_content);
- if(cell_type == 's')skip_content *=2;
- else if(cell_type == 'i' || cell_type == 'f')skip_content *= 4;
+ if(cell_type == 's')skip_content *=2;
+ else if(cell_type == 'i' || cell_type == 'f')skip_content *= 4;
skip_content += 4 + 1;
- }else{
+ }else{
SUBREADprintf("UnknownTag=%c\n", bin[bin_cursor+2]);
assert(0);
}
//SUBREADprintf("SKIP=%d\n", skip_content);
- bin_cursor += skip_content + 3;
- }
- return found;
+ bin_cursor += skip_content + 3;
+ }
+ return found;
+}
+
+int SAM_pairer_iterate_int_tags(unsigned char * bin, int bin_len, char * tag_name, int * saved_value){
+ char * data_ptr = NULL;
+ char data_type = 0;
+
+ (*saved_value) = 0;
+ int ret = SAM_pairer_iterate_tags(bin, bin_len, tag_name, &data_type, &data_ptr);
+ //SUBREADprintf(" NEED %s , FOUND %d, TYPE %c\n", tag_name, ret, data_type);
+ if(ret){
+ if(data_type == 'i' || data_type == 'I')
+ memcpy(saved_value, data_ptr, 4);
+ else if(data_type == 's' || data_type == 'S')
+ memcpy(saved_value, data_ptr, 2);
+ else if(data_type == 'c' || data_type == 'C')
+ memcpy(saved_value, data_ptr, 1);
+ else return 0;
+ }
+
+ return ret;
}
+
+
int SAM_pairer_get_read_full_name( SAM_pairer_context_t * pairer , SAM_pairer_thread_t * thread_context , unsigned char * bin, int bin_len , char * full_name, int * this_flag){
- full_name[0]=0;
- int rlen = 0;
+ full_name[0]=0;
+ int rlen = 0;
unsigned int l_read_name = 0;
unsigned int refID = 0;
unsigned int next_refID = 0;
@@ -3269,7 +3356,8 @@ int SAM_pairer_get_read_full_name( SAM_pairer_context_t * pairer , SAM_pairer_th
unsigned int tags_len = bin_len - tags_start;
if(tags_len > 2){
- SAM_pairer_iterate_int_tags(bin + tags_start, tags_len, "HI", &HItag);
+ int found = SAM_pairer_iterate_int_tags(bin + tags_start, tags_len, "HI", &HItag);
+ if(!found) HItag = -1;
}
int slash_pos = 0;
@@ -3433,14 +3521,14 @@ void SAM_pairer_writer_reset( void * pairer_vp ) {
}
-int SAM_pairer_multi_thread_output(void * pairer_vp, int thread_no, char * rname, char * bin1, char * bin2 ){
+int SAM_pairer_multi_thread_output(void * pairer_vp, int thread_no, char * bin1, char * bin2 ){
SAM_pairer_context_t * pairer = (SAM_pairer_context_t *) pairer_vp;
SAM_pairer_writer_main_t * bam_main = (SAM_pairer_writer_main_t * )pairer -> appendix1;
SAM_pairer_writer_thread_t * bam_thread = bam_main -> threads + thread_no;
char dummy_bin2 [MAX_READ_NAME_LEN*2 + 180 ];
- if(bin2==NULL && rname != NULL && bam_main -> has_dummy){
- SAM_pairer_make_dummy( rname, bin1, dummy_bin2 );
+ if(bin2==NULL && bam_main -> has_dummy){
+ SAM_pairer_make_dummy( "DUMMY", bin1, dummy_bin2 );
bin2 = dummy_bin2;
}
@@ -3470,10 +3558,11 @@ int SAM_pairer_multi_thread_output(void * pairer_vp, int thread_no, char * rname
}
void SAM_pairer_do_read_test( SAM_pairer_context_t * pairer , SAM_pairer_thread_t * thread_context , int read_name_len, char * read_full_name, int bin_len, char * bin , int flags){
+
unsigned char * mate_bin = HashTableGet(thread_context -> orphant_table, read_full_name);
if(mate_bin){
if(pairer -> output_function)
- pairer -> output_function(pairer, thread_context -> thread_id, read_full_name, bin, (char*)mate_bin);
+ pairer -> output_function(pairer, thread_context -> thread_id, bin, (char*)mate_bin);
HashTableRemove(thread_context -> orphant_table, read_full_name);
if(thread_context -> orphant_space > bin_len)
thread_context -> orphant_space -= bin_len;
@@ -3489,7 +3578,8 @@ void SAM_pairer_do_read_test( SAM_pairer_context_t * pairer , SAM_pairer_thread_
HashTablePut(thread_context -> orphant_table, mem_name, mem_bin);
thread_context -> orphant_space += bin_len;
- //SUBREADprintf("Orphant_created [%d]: %s\n", thread_context -> thread_id, read_full_name);
+ //#warning "============= COMMENT NEXT =================="
+ //SUBREADprintf("Orphant_created [%d]: %s ; BINLEN=%d, OPSIZE=%d\n", thread_context -> thread_id, read_full_name, bin_len, thread_context -> orphant_space);
}
}
@@ -3518,14 +3608,15 @@ int SAM_pairer_do_next_read( SAM_pairer_context_t * pairer , SAM_pairer_thread_t
int bin_len = 0, this_flags = 0;
int has_next_read = SAM_pairer_get_next_read_BIN(pairer, thread_context, &bin, &bin_len);
+ //#warning "============COMMENT NEXT =================="
+ //SUBREADprintf("GOT READ: BINLEN=%d\n", bin_len);
if(has_next_read){
int name_len = SAM_pairer_get_read_full_name(pairer, thread_context, bin, bin_len, read_full_name, & this_flags);
if(pairer -> is_single_end_mode == 0 && ( this_flags & 1 ) == 1){ // if the reads are PE
-
if(strcmp(read_full_name , thread_context -> immediate_last_read_full_name) == 0){
if(pairer -> output_function)
- pairer -> output_function(pairer, thread_context -> thread_id, read_full_name, (char*) bin, (char*)thread_context -> immediate_last_read_bin);
+ pairer -> output_function(pairer, thread_context -> thread_id, (char*) bin, (char*)thread_context -> immediate_last_read_bin);
thread_context -> immediate_last_read_full_name[0] = 0;
}else{
@@ -3554,7 +3645,7 @@ int SAM_pairer_do_next_read( SAM_pairer_context_t * pairer , SAM_pairer_thread_t
}
}else{ // else just write.
if(pairer -> output_function)
- pairer -> output_function(pairer, thread_context -> thread_id, NULL, (char*) bin, NULL);
+ pairer -> output_function(pairer, thread_context -> thread_id, (char*) bin, NULL);
}
thread_context -> readno_in_chunk ++;
return 0;
@@ -3640,8 +3731,7 @@ int SAM_pairer_osr_next_name(FILE * fp , char * name, int thread_no, int all_thr
fseek(fp, -2-rlen, SEEK_CUR);
return 1;
}
- fread(&rlen, 1, 2, fp);
- assert(rlen < 65535);
+ fread(&rlen, 1, 4, fp);
rlen +=4;
fseek(fp, rlen, SEEK_CUR);
}
@@ -3654,8 +3744,7 @@ void SAM_pairer_osr_next_bin(FILE * fp, char * bin){
assert(rlen < 1024);
fseek(fp, rlen, SEEK_CUR);
rlen =0;
- fread(&rlen, 1, 2, fp);
- assert(rlen < 65535);
+ fread(&rlen, 1, 4, fp);
rlen +=4;
fread(bin, 1, rlen, fp);
}
@@ -3687,8 +3776,8 @@ int merge_level_fps(SAM_pairer_context_t * pairer, char * fname, FILE ** fps, in
char * names = malloc( fps_no * max_name_len );
- bin_tmp1 = malloc(66000);
- bin_tmp2 = malloc(66000);
+ bin_tmp1 = malloc(FC_LONG_READ_RECORD_HARDLIMIT);
+ bin_tmp2 = malloc(FC_LONG_READ_RECORD_HARDLIMIT);
FILE * out_fp = fopen(tmp_fname, "wb");
@@ -3729,7 +3818,7 @@ int merge_level_fps(SAM_pairer_context_t * pairer, char * fname, FILE ** fps, in
if(min2_name_fileno>=0){
SAM_pairer_osr_next_bin( fps[ min2_name_fileno ] , bin_tmp2);
- pairer -> output_function(pairer, 0, names + max_name_len*min_name_fileno , (char*) bin_tmp1, (char*)bin_tmp2);
+ pairer -> output_function(pairer, 0, (char*) bin_tmp1, (char*)bin_tmp2);
if(0 == pairer -> is_unsorted_notified){
char * name_tmp_1 = malloc(strlen(names+(min_name_fileno * max_name_len))+5), *name_tmp_2 = malloc(strlen(names+(min_name_fileno * max_name_len))+5);
@@ -3756,9 +3845,9 @@ int merge_level_fps(SAM_pairer_context_t * pairer, char * fname, FILE ** fps, in
wlen = strlen( names+(min_name_fileno * max_name_len) );
fwrite( &wlen, 2, 1,out_fp );
fwrite( names+(min_name_fileno * max_name_len), 1, wlen, out_fp );
- memcpy( &rbinlen, bin_tmp1 , 2);
+ memcpy( &rbinlen, bin_tmp1 , 4);
rbinlen += 4;
- fwrite( bin_tmp1, 2, 1, out_fp );
+ fwrite( bin_tmp1, 4, 1, out_fp );
int write_len = fwrite( bin_tmp1, 1, rbinlen, out_fp );
if(write_len < rbinlen)is_disk_full = 1;
}
@@ -3771,6 +3860,8 @@ int merge_level_fps(SAM_pairer_context_t * pairer, char * fname, FILE ** fps, in
unlink(fname);
rename(tmp_fname, fname);
free(names);
+ free(bin_tmp1);
+ free(bin_tmp2);
return is_disk_full;
}
#define PAIRER_WAIT_TICK_TIME 10000
@@ -3957,7 +4048,7 @@ void * SAM_pairer_rescure_orphants_max_FP(void * params){
if( min2_name_fileno >=0){
SAM_pairer_osr_next_bin( orphant_fps[ min2_name_fileno ] , bin_tmp2);
- pairer -> output_function(pairer, thread_no, names + max_name_len*min_name_fileno , (char*) bin_tmp1, (char*)bin_tmp2);
+ pairer -> output_function(pairer, thread_no, (char*) bin_tmp1, (char*)bin_tmp2);
if(0 == pairer -> is_unsorted_notified){
char *name_tmp_1 = malloc(strlen(names+(min_name_fileno * max_name_len))+5), *name_tmp_2 = malloc(strlen(names+(min_name_fileno * max_name_len))+5);
@@ -3985,7 +4076,7 @@ void * SAM_pairer_rescure_orphants_max_FP(void * params){
}else{
//#warning ">>>>>>> COMMENT NEXT LINE <<<<<<<<"
//SUBREADprintf("FINAL_ORPHAN:%s\n" , names + max_name_len*min_name_fileno);
- pairer -> output_function(pairer, thread_no, names + max_name_len*min_name_fileno, (char*) bin_tmp1, NULL);
+ pairer -> output_function(pairer, thread_no, (char*) bin_tmp1, NULL);
died++;
}
@@ -4008,6 +4099,7 @@ void * SAM_pairer_rescure_orphants_max_FP(void * params){
}
free( bin_tmp1 );
free( bin_tmp2 );
+ free(orphant_fps);
pairer -> total_orphan_reads += died;
return NULL;
}
@@ -4052,7 +4144,7 @@ int SAM_pairer_update_orphant_table(SAM_pairer_context_t * pairer , SAM_pairer_t
is_error = (write_len <1);
write_len = fwrite(name_list[x1], 1, namelen, tmp_fp);
is_error |= (write_len <namelen);
- write_len = fwrite(&bin_len,2,1,tmp_fp);
+ write_len = fwrite(&bin_len,4, 1,tmp_fp);
is_error |= (write_len <1);
write_len = fwrite(bin_list[x1], 1, bin_len + 4, tmp_fp);
is_error |= (write_len < bin_len + 4);
@@ -4197,6 +4289,7 @@ void * SAM_pairer_thread_run( void * params ){
if(pairer -> is_bad_format) break;
if(thread_context -> immediate_last_read_full_name[0]){
+
SAM_pairer_register_matcher(pairer, thread_context -> chunk_number, thread_context -> readno_in_chunk - 1, thread_context -> immediate_last_read_full_name, thread_context -> immediate_last_read_bin, thread_context -> immediate_last_read_bin_len , thread_context -> immediate_last_read_flags);
SAM_pairer_do_read_test(pairer , thread_context , thread_context -> immediate_last_read_name_len , thread_context -> immediate_last_read_full_name , thread_context -> immediate_last_read_bin_len , thread_context -> immediate_last_read_bin, thread_context -> immediate_last_read_flags);
thread_context -> immediate_last_read_full_name[0] = 0;
@@ -4238,9 +4331,7 @@ int SAM_pairer_run_once( SAM_pairer_context_t * pairer){
}
if(0 == pairer -> is_bad_format){
-
int is_disk_full = SAM_pairer_probe_maxfp( pairer );
-
if(is_disk_full){
SUBREADprintf("ERROR: cannot write into the temporary file. Please check the disk space in the output directory.\n");
pairer -> is_internal_error = 1;
@@ -4404,7 +4495,7 @@ int fix_write_block(FILE * out, char * bin, int binlen, z_stream * strm){
int SAM_pairer_fix_format(SAM_pairer_context_t * pairer){
FILE * old_fp = pairer -> input_fp;
fseek(old_fp, 0, SEEK_SET);
- char tmpfname [300];
+ char tmpfname [300], readname[256];
sprintf(tmpfname, "%s.fixbam", pairer -> tmp_file_prefix);
@@ -4429,7 +4520,7 @@ int SAM_pairer_fix_format(SAM_pairer_context_t * pairer){
out_strm.next_in = Z_NULL;
deflateInit2(&out_strm, Z_NO_COMPRESSION, Z_DEFLATED,
- PAIRER_GZIP_WINDOW_BITS, PAIRER_DEFAULT_MEM_LEVEL, Z_DEFAULT_STRATEGY);
+ PAIRER_GZIP_WINDOW_BITS, PAIRER_DEFAULT_MEM_LEVEL, Z_DEFAULT_STRATEGY);
int disk_is_full = 0;
int in_bin_ptr = 0;
@@ -4437,7 +4528,7 @@ int SAM_pairer_fix_format(SAM_pairer_context_t * pairer){
int in_bin_size = 0;
int content_count = 0;
int content_size = 0;
- int x1, nch = 0;
+ int x1, nch = 0, is_longcigar = 0;
for(x1 = 0; x1 < 4; x1++){
FIX_GET_NEXT_NCH; // BAM1
@@ -4494,13 +4585,14 @@ int SAM_pairer_fix_format(SAM_pairer_context_t * pairer){
FIX_FLASH_OUT;
// ===== The reads
+ int seq_len = 0, name_len = 0, cigar_opts = 0;
unsigned long long reads =0;
pairer -> is_bad_format = 0;
- while(1){
+ while(! is_longcigar){
int block_size = 0, new_block_size;
- int seq_len = 0, name_len = 0, cigar_opts = 0;
char * block_size_ptr = out_bin + out_bin_ptr;
char * sqlen_ptr = NULL;
+ seq_len = 0, name_len = 0, cigar_opts = 0;
// block_length
FIX_GET_NEXT_NCH;
@@ -4512,16 +4604,7 @@ int SAM_pairer_fix_format(SAM_pairer_context_t * pairer){
block_size += (nch << (8 * x1));
}
- //#warning ">>>>>> COMMENT NEXT BLOCK <<<<<<"
- if(0){
- if(block_size > 65000)
- SUBREADprintf("Bsize=%d\n", block_size);
- }
- if(block_size > 60000 && !pairer -> tiny_mode){
- pairer -> is_bad_format = 1;
- SUBREADprintf("ERROR: the read record length (%d) is longer than the limit. The program has to terminate. \n", block_size);
- break;
- }else if(block_size + out_bin_ptr > 60000 && !pairer -> tiny_mode)
+ if(block_size + out_bin_ptr > 60000 && !pairer -> tiny_mode)
FIX_FLASH_OUT;
FIX_APPEND_READ(&block_size, 4);
@@ -4545,6 +4628,25 @@ int SAM_pairer_fix_format(SAM_pairer_context_t * pairer){
continue;
}
}
+
+ //#warning "+===================== REMOVE -59999 IN NEXT LINE ================"
+ //if(x1==32)SUBREADprintf("SEQ_LEN=%d, REC_LEN=%d\n", seq_len, block_size);
+ if( x1 == 32 && seq_len >= pairer -> long_read_minimum_length){
+ is_longcigar = 1;
+ int x2;
+ for(x2 = 0; x2 < name_len; x2++){
+ FIX_GET_NEXT_NCH;
+ readname[x2] = nch;
+ }
+ break;
+ }
+ if( x1 == 32 && block_size > 60000 ){
+ print_in_box(80,0,0,"");
+ print_in_box(80,0,0," ERROR: Alignment record is too long.");
+ print_in_box(80,0,0," Please use the long read mode.");
+ return -1;
+ }
+
char etag_name0 = -1, etag_name1, etag_type;
if(x1 == 32 + name_len + 4 * cigar_opts + seq_len + (seq_len+1)/2){
while(x1 < block_size){
@@ -4624,19 +4726,42 @@ int SAM_pairer_fix_format(SAM_pairer_context_t * pairer){
//SUBREADprintf("WR[%d]: %d = %c, SL=%d, RNL=%d, COP=%d\n", out_bin_ptr, nch, nch, seq_len, name_len, cigar_opts);
}
- seq_len = min(1, seq_len);
- sqlen_ptr[0]=seq_len; sqlen_ptr[1]=0, sqlen_ptr[2]=0; sqlen_ptr[3]=0;
- new_block_size = 32 + name_len + 4 * cigar_opts + seq_len + (seq_len+1)/2 + extag_new_len;
- //SUBREADprintf("ETAG_NLEN=%d, ETAGS=%d\n", new_block_size, extag_new_len);
- memcpy(block_size_ptr, &new_block_size, 4);
+ if(!is_longcigar){
+ seq_len = min(1, seq_len);
+ sqlen_ptr[0]=seq_len; sqlen_ptr[1]=0, sqlen_ptr[2]=0; sqlen_ptr[3]=0;
+ new_block_size = 32 + name_len + 4 * cigar_opts + seq_len + (seq_len+1)/2 + extag_new_len;
+ //SUBREADprintf("ETAG_NLEN=%d, ETAGS=%d\n", new_block_size, extag_new_len);
+ memcpy(block_size_ptr, &new_block_size, 4);
+ }
}else{
for(x1 = 0; x1 < block_size; x1++){
FIX_GET_NEXT_NCH;
if(nch < 0) return -1;
+
+ if(x1 == 8) name_len = nch;
+ else if(x1 >= 16 && x1 < 20){
+ seq_len += ( nch << (8 * (x1 - 16)));
+ if(x1 == 16) sqlen_ptr = out_bin + out_bin_ptr;
+ }else if(x1 == 12 || x1 == 13){
+ cigar_opts += ( nch << (8 * (x1 - 12)));
+ }
+
+ if(x1 == 32 && seq_len >= pairer -> long_read_minimum_length){
+ is_longcigar = 1;
+ int x2;
+ for(x2 = 0; x2 < name_len; x2++){
+ FIX_GET_NEXT_NCH;
+ readname[x2] = nch;
+ }
+ break;
+ }
+
FIX_APPEND_READ(&nch, 1);
}
}
+ //#warning "========= COMMENT NEXT ============="
+ //SUBREADprintf("OUTBIN_PTR=%d\n", out_bin_ptr);
reads ++;
if(out_bin_ptr > 60000){
FIX_FLASH_OUT;
@@ -4648,23 +4773,34 @@ int SAM_pairer_fix_format(SAM_pairer_context_t * pairer){
deflateEnd(&out_strm);
inflateEnd(&in_strm);
- fclose(old_fp);
fclose(new_fp);
- pairer -> input_fp = f_subr_open(tmpfname, "rb");
free(in_bin);
free(out_bin);
- if(disk_is_full)SUBREADprintf("ERROR: cannot write into the temporary file. Please check the empty space in the output directory.\n");
+ if(is_longcigar){
+ unlink(tmpfname);
+ pairer -> long_cigar_mode = 1;
+ pairer -> tiny_mode = 1;
+ if(0 && ! pairer -> is_single_end_mode){
+ print_in_box(80,0,0," Switch to long-read mode; reads, not read-pairs, will be counted.");
+ print_in_box(80,0,0," Read name: %s", readname);
+ print_in_box(80,0,0," It had %d cigar opts and %d bases, more than %d.", cigar_opts, seq_len, pairer -> long_read_minimum_length);
+ }
+ }else{
+ fclose(old_fp);
+ pairer -> input_fp = f_subr_open(tmpfname, "rb");
+ }
+ if(disk_is_full)SUBREADprintf("ERROR: cannot write into the temporary file. Please check the empty space in the output directory.\n");
return disk_is_full;
}
unsigned int nosort_tick_time = 100;
-#define NOSORT_SBAM_BUFF_SIZE 500000
-#define NOSORT_BIN_BUFF_SIZE (2*500100)
+#define NOSORT_SBAM_BUFF_SIZE 5000000
+#define NOSORT_BIN_BUFF_SIZE (2*5010000)
void * SAM_nosort_thread_run( void * params ){
@@ -4686,20 +4822,26 @@ void * SAM_nosort_thread_run( void * params ){
if(thread_context -> reads_in_SBAM > 1){
if(pairer -> input_is_BAM){
- int record_len;
+ int record_len, seq_len1 = 0, seq_len2 = 0;
// SUBREADprintf("LOAD BY THREAD %d:", thread_no);
memcpy(&record_len, thread_context -> input_buff_SBAM + thread_context -> input_buff_SBAM_ptr, 4);
// SUBREADprintf("RLEN=%d\n", record_len);
- assert(record_len > 32 &&record_len < 500000);
+ assert(record_len > 32 &&record_len < NOSORT_SBAM_BUFF_SIZE);
memcpy(read_ptr_1 , thread_context -> input_buff_SBAM + thread_context -> input_buff_SBAM_ptr, 4 + record_len);
+ memcpy(&seq_len1, thread_context -> input_buff_SBAM + thread_context -> input_buff_SBAM_ptr + 20, 4);
thread_context -> input_buff_SBAM_ptr += record_len + 4;
memcpy(&record_len, thread_context -> input_buff_SBAM + thread_context -> input_buff_SBAM_ptr, 4);
- assert(record_len > 32 &&record_len < 500000);
+ assert(record_len > 32 &&record_len < NOSORT_SBAM_BUFF_SIZE);
memcpy(read_ptr_2 , thread_context -> input_buff_SBAM + thread_context -> input_buff_SBAM_ptr, 4 + record_len);
+ memcpy(&seq_len2, thread_context -> input_buff_SBAM + thread_context -> input_buff_SBAM_ptr + 20, 4);
thread_context -> input_buff_SBAM_ptr += record_len + 4;
has_found = 1;
thread_context -> reads_in_SBAM -= 2;
+
+ if(seq_len1 >= pairer -> long_read_minimum_length || seq_len2 >= pairer -> long_read_minimum_length)
+ pairer -> long_cigar_mode = 1;
+
}else{
thread_context -> input_buff_BIN_ptr = 0;
int rret = reduce_SAM_to_BAM(pairer, thread_context , 0);
@@ -4718,7 +4860,7 @@ void * SAM_nosort_thread_run( void * params ){
subread_lock_release(&thread_context -> SBAM_lock);
if(has_found)
- pairer -> output_function(pairer, thread_no, NULL, (char*) read_ptr_1,(char*) read_ptr_2);
+ pairer -> output_function(pairer, thread_no, (char*) read_ptr_1,(char*) read_ptr_2);
else{
if(to_quit) break;
usleep(nosort_tick_time);
@@ -4811,7 +4953,7 @@ void SAM_nosort_run_once(SAM_pairer_context_t * pairer){
header_txt [x1] = nch;
}
- pairer -> output_header(pairer, 0, 1, pairer -> BAM_l_text , header_txt , pairer -> BAM_l_text );
+ int is_OK = pairer -> output_header(pairer, 0, 1, pairer -> BAM_l_text , header_txt , pairer -> BAM_l_text );
NOSORT_BAM_next_u32(pairer -> BAM_n_ref);
unsigned int ref_bin_len = 0;
for(x1 = 0; x1 < pairer -> BAM_n_ref; x1++) {
@@ -4833,9 +4975,14 @@ void SAM_nosort_run_once(SAM_pairer_context_t * pairer){
assert(ref_bin_len < pairer -> BAM_l_text);
}
- pairer -> output_header(pairer, 0, 0, pairer -> BAM_n_ref , header_txt , ref_bin_len );
+ is_OK = is_OK || pairer -> output_header(pairer, 0, 0, pairer -> BAM_n_ref , header_txt , ref_bin_len );
free(header_txt);
+ if(is_OK){
+ pairer -> is_incomplete_BAM = 1;
+ return;
+ }
+
while(1){
if(pairer -> is_finished) break;
int need_sleep = 1;
@@ -4956,9 +5103,16 @@ void SAM_nosort_run_once(SAM_pairer_context_t * pairer){
}
pairer -> BAM_header_parsed = 1;
- pairer -> output_header(pairer, 0, 0, header_contigs , header_bin , header_bin_ptr);
+ int is_OK = pairer -> output_header(pairer, 0, 0, header_contigs , header_bin , header_bin_ptr);
free(header_bin);
+ if(is_OK){
+ pairer -> is_incomplete_BAM = 1;
+ return;
+ }
+
+
+
fseek(pairer -> input_fp, passed_read_SBAM_ptr, SEEK_SET);
line_ptr = SBAM_buff;
@@ -5014,6 +5168,126 @@ void SAM_nosort_run_once(SAM_pairer_context_t * pairer){
}
}
+#define BINADD_NCHAR { if(binptr >= bin_buff_capacity - 10){\
+ bin_buff_capacity = bin_buff_capacity * 14 / 10;\
+ bin_buffer = realloc(bin_buffer, bin_buff_capacity);\
+ } bin_buffer[binptr++] = nch;}
+
+
+
+// only one thread; very large buffer size.
+int SAM_pairer_long_cigar_run(SAM_pairer_context_t * pairer){
+ char *bin_buffer, *bam_buffer;
+ FILE * old_fp = pairer -> input_fp;
+ int bin_buff_capacity = 1000000, block_size = 0;
+ char * in_bin = malloc(140000);
+ bin_buffer = malloc(bin_buff_capacity);
+ bam_buffer = malloc(70000);
+
+ z_stream in_strm;
+ in_strm.zalloc = Z_NULL;
+ in_strm.zfree = Z_NULL;
+ in_strm.opaque = Z_NULL;
+ in_strm.avail_in = 0;
+ in_strm.next_in = Z_NULL;
+
+ inflateInit2(&in_strm, PAIRER_GZIP_WINDOW_BITS);
+
+ fseek(old_fp, 0, SEEK_SET);
+
+ if(1){
+ int disk_is_full = 0;
+ int in_bin_ptr = 0;
+ int out_bin_ptr = 0;
+ int in_bin_size = 0;
+ int content_count = 0;
+ int content_size = 0;
+ int is_finished = 0;
+ int x1, nch = 0, is_longcigar = 0, binptr = 0;
+
+ for(x1 = 0; x1 < 4; x1++){
+ FIX_GET_NEXT_NCH; // BAM1
+ if(nch < 0) return -1;
+ }
+
+ // ====== The header texts
+ content_size = 0;
+ binptr = 0;
+ for(x1 = 0; x1 < 4; x1++){
+ FIX_GET_NEXT_NCH;
+ if(nch < 0) return -1;
+ content_size += (nch << (8 * x1));
+ }
+ for(content_count = 0; content_count < content_size; content_count++){
+ FIX_GET_NEXT_NCH;
+ BINADD_NCHAR;
+ if(nch < 0) return -1;
+ }
+
+ pairer -> output_header (pairer , 0, 1, binptr, bin_buffer, binptr);
+
+ // ====== The chromosome table
+ binptr = 0;
+ content_size = 0;
+ for(x1 = 0; x1 < 4; x1++){
+ FIX_GET_NEXT_NCH;
+ if(nch < 0) return -1;
+ content_size += (nch << (8 * x1));
+ }
+
+ for(content_count = 0; content_count < content_size; content_count++){
+ block_size = 0;
+ for(x1 = 0; x1 < 4; x1++){
+ FIX_GET_NEXT_NCH;
+ if(nch < 0) return -1;
+ BINADD_NCHAR;
+ block_size += (nch << (8 * x1));
+ }
+
+ for(x1 = 0; x1 < block_size + 4; x1++){
+ FIX_GET_NEXT_NCH;
+ if(nch < 0) return -1;
+ BINADD_NCHAR;
+ }
+ }
+ pairer -> output_header (pairer , 0, 0, content_size, bin_buffer, binptr);
+
+ // go through the reads
+ int reads = 0;
+ while(1){
+ binptr = 0;
+ block_size = 0;
+ for(x1 = 0; x1 < 4; x1++){
+ FIX_GET_NEXT_NCH;
+ if(x1 == 0 && nch < 0){
+ is_finished=1;
+ break;
+ }
+ if(nch < 0) return -1;
+
+ BINADD_NCHAR;
+ block_size += (nch << (8 * x1));
+ }
+ if(is_finished)break;
+
+ for(x1 = 0; x1 < block_size; x1 ++){
+ FIX_GET_NEXT_NCH;
+ if(nch < 0) return -1;
+ BINADD_NCHAR;
+ }
+
+ pairer -> output_function(pairer, 0, bin_buffer, NULL);
+ reads++;
+ }
+ }
+
+ free(bam_buffer);
+ free(bin_buffer);
+ free(in_bin);
+
+ return 0;
+}
+
int SAM_pairer_run( SAM_pairer_context_t * pairer){
int corrected_run;
@@ -5025,18 +5299,21 @@ int SAM_pairer_run( SAM_pairer_context_t * pairer){
if(pairer -> is_bad_format && pairer->input_is_BAM && ( ! pairer -> is_internal_error ) && ( ! pairer -> is_incomplete_BAM )){
//#warning ">>>>>> REMOVE '+ 1' FROM NEXT LINE IN RELEASE <<<<<<"
assert(1 != corrected_run);
- //#warning ">>>>>> COMMENT NEXT LINE IN RELEASE <<<<<<"
- //SUBREADprintf("Retrying with the corrected format...\n");
delete_with_prefix(pairer -> tmp_file_prefix);
pairer -> is_internal_error |= SAM_pairer_fix_format(pairer);
+ //#warning ">>>>>> COMMENT NEXT LINE IN RELEASE <<<<<<"
+ //SUBREADprintf("Retrying with the corrected format... (%d)\n", pairer -> is_bad_format);
+
if(pairer -> is_bad_format || pairer -> is_internal_error)
return -1;
SAM_pairer_reset(pairer);
pairer -> reset_output_function(pairer);
+
+ if(pairer -> long_cigar_mode) return SAM_pairer_long_cigar_run(pairer);
}else break;
}
- return pairer -> is_bad_format || pairer -> is_internal_error;
+ return pairer -> is_bad_format || pairer -> is_internal_error || pairer -> is_incomplete_BAM;
}
int sort_SAM_create(SAM_sort_writer * writer, char * output_file, char * tmp_path)
@@ -6027,6 +6304,35 @@ int probe_file_type_EX(char * fname, int * is_first_read_PE, long long * SAMBAM_
return ret;
}
+void warning_hash_hash(HashTable * t1, HashTable * t2, char * msg){
+ int buck_i, shown = 0;
+ for(buck_i = 0; buck_i < t1 -> numOfBuckets; buck_i++){
+ KeyValuePair * cursor = t1 -> bucketArray[buck_i];
+ while(cursor){
+ char * t1chro = (char *) cursor -> key;
+ int found = HashTableGet(t2, t1chro) != NULL;
+ if(!found) if(strlen(t1chro)>3 && t1chro[0]=='c'&&t1chro[1]=='h'&&t1chro[2]=='r' ) found = HashTableGet(t2, t1chro+3) != NULL;
+ if(!found) {
+ char tmp_t1chro [MAX_CHROMOSOME_NAME_LEN+1];
+ sprintf(tmp_t1chro, "chr%s", t1chro);
+ found = HashTableGet(t2, tmp_t1chro) != NULL;
+ }
+
+ if(!found){
+ if(!shown){
+ print_in_box(80,0,0,"");
+ print_in_box(80,0,0,msg);
+ shown = 1;
+ }
+ print_in_box(80,0,0," %s", t1chro);
+ }
+ cursor = cursor -> next;
+ }
+ }
+ if(shown) print_in_box(80,0,0,"");
+}
+
+
#ifdef MAKE_INPUTTEST
int main(int argc, char ** argv)
{
diff --git a/src/input-files.h b/src/input-files.h
index 46cedaa..fbbad50 100644
--- a/src/input-files.h
+++ b/src/input-files.h
@@ -105,7 +105,7 @@ typedef struct {
unsigned long long orphant_space;
z_stream strm;
- char immediate_last_read_bin[66000];
+ char immediate_last_read_bin[FC_LONG_READ_RECORD_HARDLIMIT];
char immediate_last_read_full_name[MAX_READ_NAME_LEN*2 +80 ];
int immediate_last_read_flags;
int immediate_last_read_bin_len;
@@ -123,6 +123,8 @@ typedef struct {
int is_bad_format;
int is_single_end_mode;
int force_do_not_sort;
+ int long_cigar_mode;
+ int long_read_minimum_length;
int is_finished;
int merge_level_finished;
int max_file_open_number;
@@ -153,7 +155,7 @@ typedef struct {
int is_internal_error;
void (* reset_output_function) (void * pairer);
- int (* output_function) (void * pairer, int thread_no, char * rname, char * bin1, char * bin2);
+ int (* output_function) (void * pairer, int thread_no, char * bin1, char * bin2);
int (* output_header) (void * pairer, int thread_no, int is_text, unsigned int items, char * bin, unsigned int bin_len);
void (* unsorted_notification) (void * pairer, char * bin1, char * bin2); // it is called only once
// reserved for the application passing its own data to the output function.
@@ -187,6 +189,9 @@ typedef struct {
void fastq_64_to_33(char * qs);
+// the caller is in charge of deallocation
+char * memstrcpy(char * in);
+
int chars2color(char c1, char c2);
int genekey2color(char last_base,char * key);
@@ -300,19 +305,21 @@ unsigned long long geinput_file_offset( gene_input_t * input);
int probe_file_type_EX(char * fname, int * is_first_read_PE, long long * SAMBAM_header_length);
-int SAM_pairer_create(SAM_pairer_context_t * pairer, int all_threads, int bin_buff_size_per_thread, int BAM_input, int is_Tiny_Mode, int is_single_end_mode, int force_do_not_sort, int display_progress, char * in_file, void (* reset_output_function) (void * pairer), int (* output_header_function) (void * pairer, int thread_no, int is_text, unsigned int items, char * bin, unsigned int bin_len), int (* output_function) (void * pairer, int thread_no, char * rname, char * bin1, char * bin2), [...]
+int SAM_pairer_create(SAM_pairer_context_t * pairer, int all_threads, int bin_buff_size_per_thread, int BAM_input, int is_Tiny_Mode, int is_single_end_mode, int force_do_not_sort, int display_progress, char * in_file, void (* reset_output_function) (void * pairer), int (* output_header_function) (void * pairer, int thread_no, int is_text, unsigned int items, char * bin, unsigned int bin_len), int (* output_function) (void * pairer, int thread_no, char * bin1, char * bin2), char * tmp_pat [...]
int SAM_pairer_run( SAM_pairer_context_t * pairer);
void SAM_pairer_destroy(SAM_pairer_context_t * pairer);
void SAM_pairer_writer_reset(void * pairer);
void SAM_pairer_set_unsorted_notification(SAM_pairer_context_t * pairer, void (* unsorted_notification) (void * pairer, char * bin1, char * bin2));
-int SAM_pairer_multi_thread_output( void * pairer, int thread_no, char * rname, char * bin1, char * bin2 );
+int SAM_pairer_multi_thread_output( void * pairer, int thread_no, char * bin1, char * bin2 );
int SAM_pairer_multi_thread_header (void * pairer_vp, int thread_no, int is_text, unsigned int items, char * bin, unsigned int bin_len);
int SAM_pairer_writer_create( SAM_pairer_writer_main_t * bam_main , int all_threads, int has_dummy , int BAM_output, int BAM_compression_level, char * out_file);
void SAM_pairer_writer_destroy( SAM_pairer_writer_main_t * bam_main ) ;
int SAM_pairer_iterate_int_tags(unsigned char * bin, int bin_len, char * tag_name, int * saved_value);
+int SAM_pairer_iterate_tags(unsigned char * bin, int bin_len, char * tag_name, char * data_type, char ** saved_value);
int SAM_pairer_warning_file_open_limit();
void *delay_realloc(void * old_pntr, size_t old_size, size_t new_size);
int is_comment_line(const char * l, int file_type, unsigned int lineno);
+void warning_hash_hash(HashTable * t1, HashTable * t2, char * msg);
#endif
diff --git a/src/makefile.version b/src/makefile.version
index 8b17cdd..dd7ddb4 100644
--- a/src/makefile.version
+++ b/src/makefile.version
@@ -1,4 +1,4 @@
-SUBREAD_VERSION_BASE=1.5.2
+SUBREAD_VERSION_BASE=1.5.3
SUBREAD_VERSION_DATE=$(SUBREAD_VERSION_BASE)-$(shell date +"%d%b%Y")
SUBREAD_VERSION="$(SUBREAD_VERSION_DATE)"
SUBREAD_VERSION="$(SUBREAD_VERSION_BASE)"
diff --git a/src/read-repair.c b/src/read-repair.c
index 8b70d82..ed877b0 100644
--- a/src/read-repair.c
+++ b/src/read-repair.c
@@ -120,7 +120,7 @@ int main_read_repair(int argc, char ** argv)
SUBREADprintf("Unable to open the output file. Program terminated.\n");
return -1;
}else{
- ret = SAM_pairer_create(&pairer, threads, memory, is_BAM, tiny_mode,0,0 , 1, in_BAM_file, SAM_pairer_writer_reset, SAM_pairer_multi_thread_header, SAM_pairer_multi_thread_output, rand_prefix, &writer_main);
+ ret = SAM_pairer_create(&pairer, threads, memory, is_BAM, tiny_mode,0,0 , 1, in_BAM_file, SAM_pairer_writer_reset, SAM_pairer_multi_thread_header, SAM_pairer_multi_thread_output, rand_prefix, &writer_main, 99999999);
if(ret){
SUBREADprintf("Unable to open the input file. Program terminated.\n");
return -1;
diff --git a/src/readSummary.c b/src/readSummary.c
index b5a0acc..13432d5 100644
--- a/src/readSummary.c
+++ b/src/readSummary.c
@@ -69,6 +69,18 @@ typedef struct{
} fc_junction_gene_t;
+#define MAXIMUM_INSERTION_IN_SECTION 8
+
+typedef struct {
+ char * chro;
+ unsigned int start_pos;
+ unsigned int chromosomal_length;
+ short insertions;
+ unsigned int insertion_start_pos[ MAXIMUM_INSERTION_IN_SECTION ];
+ unsigned short insertion_lengths[ MAXIMUM_INSERTION_IN_SECTION ];
+} CIGAR_interval_t;
+
+
typedef struct {
int space;
@@ -95,16 +107,18 @@ typedef struct {
typedef struct {
unsigned long long assigned_reads;
- unsigned long long unassigned_ambiguous;
- unsigned long long unassigned_multimapping;
- unsigned long long unassigned_nofeatures;
+
unsigned long long unassigned_unmapped;
unsigned long long unassigned_mappingquality;
- unsigned long long unassigned_fragmentlength;
unsigned long long unassigned_chimericreads;
+ unsigned long long unassigned_fragmentlength;
+ unsigned long long unassigned_duplicate;
+ unsigned long long unassigned_multimapping;
unsigned long long unassigned_secondary;
unsigned long long unassigned_junction_condition;
- unsigned long long unassigned_duplicate;
+ unsigned long long unassigned_nofeatures;
+ unsigned long long unassigned_overlapping_length;
+ unsigned long long unassigned_ambiguous;
} fc_read_counters;
typedef unsigned long long read_count_type_t;
@@ -115,6 +129,7 @@ typedef struct {
unsigned long long int all_reads;
//unsigned short current_read_length1;
//unsigned short current_read_length2;
+ unsigned int count_table_size;
read_count_type_t * count_table;
read_count_type_t unpaired_fragment_no;
unsigned int chunk_read_ptr;
@@ -132,9 +147,19 @@ typedef struct {
long hits_indices1 [MAX_HIT_NUMBER];
long hits_indices2 [MAX_HIT_NUMBER];
+ unsigned int proc_Starting_Chro_Points[65536];
+ unsigned short proc_Starting_Read_Points[65536];
+ unsigned short proc_Section_Read_Lengths[65536];
+ char * proc_ChroNames[65536];
+ char proc_Event_After_Section[65536];
+ CIGAR_interval_t proc_CIGAR_intervals_R1[65536], proc_CIGAR_intervals_R2[65536];
+
char ** scoring_buff_gap_chros;
unsigned int * scoring_buff_gap_starts;
unsigned short * scoring_buff_gap_lengths;
+ char * read_details_buff;
+ char * bam_compressed_buff;
+ int read_details_buff_used;
unsigned int scoring_buff_numbers[MAX_HIT_NUMBER * 2];
unsigned int scoring_buff_flags[MAX_HIT_NUMBER * 2];
@@ -142,13 +167,13 @@ typedef struct {
long scoring_buff_exon_ids[MAX_HIT_NUMBER * 2];
char * chro_name_buff;
- z_stream * strm_buffer;
+ z_stream bam_file_output_stream;
HashTable * junction_counting_table; // key: string chro_name \t last_base_previous_exont \t first_base_next_exon
HashTable * splicing_point_table;
+ HashTable * RG_table; // rg_name -> [ count_table, sum_fc_read_counters, junction_counting_table, splicing_point_table]
+ // NOTE: some reads have no RG tag. These reads are put into the tables in this object but not in the RG_table -> tables.
fc_read_counters read_counters;
-
- SamBam_Alignment aln_buffer;
} fc_thread_thread_context_t;
#define REVERSE_TABLE_BUCKET_LENGTH 131072
@@ -185,10 +210,14 @@ typedef struct {
int is_SEPEmix_warning_shown;
int is_unpaired_warning_shown;
int is_stake_warning_shown;
+ int is_read_too_long_to_SAM_BAM_shown;
int is_split_or_exonic_only;
int is_duplicate_ignored;
int is_first_read_reversed;
int is_second_read_straight;
+ int is_verbose;
+ int long_read_minimum_length;
+ int assign_reads_to_RG;
int use_stdin_file;
int disk_is_full;
int do_not_sort;
@@ -223,6 +252,7 @@ typedef struct {
int is_input_bad_format;
SamBam_Reference_Info * sambam_chro_table;
pthread_spinlock_t sambam_chro_table_lock;
+ pthread_spinlock_t read_details_out_lock;
SAM_pairer_context_t read_pairer;
@@ -235,7 +265,7 @@ typedef struct {
HashTable * junction_bucket_table;
fasta_contigs_t * fasta_contigs;
HashTable * gene_name_table; // gene_name -> gene_number
- HashTable * annot_chro_name_alias_table; // name in annotation file -> alias name
+ HashTable * BAM_chros_to_anno_table; // name in annotation file -> alias name
char alias_file_name[300];
char input_file_name[300];
char * input_file_short_name;
@@ -265,7 +295,7 @@ typedef struct {
char * exontable_anno_chr_2ch;
long * exontable_anno_chr_heads;
- FILE * SAM_output_fp;
+ FILE * read_details_out_FP;
double start_time;
char * cmd_rebuilt;
@@ -628,7 +658,7 @@ int print_FC_configuration(fc_thread_global_context_t * global_context, char * a
print_in_box(80,0,0," Summary : %s.summary", out);
print_in_box(80,0,0," Annotation : %s (%s)", annot, is_GTF?"GTF":"SAF");
if(isReadSummaryReport){
- print_in_box(80,0,0," Assignment details : <input_file>.featureCounts");
+ print_in_box(80,0,0," Assignment details : <input_file>.featureCounts%s", isReadSummaryReport == FILE_TYPE_BAM?".bam":(isReadSummaryReport == FILE_TYPE_SAM?".sam":""));
print_in_box(80,0,0," (Note that files are saved to the output directory)");
print_in_box(80,0,0,"");
}
@@ -674,6 +704,8 @@ int print_FC_configuration(fc_thread_global_context_t * global_context, char * a
print_in_box(80,0,0," Read reduction to : %d' end" , global_context -> reduce_5_3_ends_to_one == REDUCE_TO_5_PRIME_END ?5:3);
if(global_context -> is_duplicate_ignored)
print_in_box(80,0,0," Duplicated Reads : ignored");
+ if(global_context -> long_read_minimum_length < 5000)
+ print_in_box(80,0,0," Long read mode : yes");
//print_in_box(80,0,0," Read orientations : %c%c", global_context->is_first_read_reversed?'r':'f', global_context->is_second_read_straight?'f':'r' );
if(global_context->is_paired_end_mode_assign)
@@ -693,8 +725,8 @@ int print_FC_configuration(fc_thread_global_context_t * global_context, char * a
print_in_box(80,0,0,"");
if( global_context -> max_BAM_header_size > 32 * 1024 * 1024 ){
}
- if(global_context->annot_chro_name_alias_table)
- print_in_box(80,0,0,"%ld chromosome name aliases are loaded.", global_context -> annot_chro_name_alias_table ->numOfElements);
+ if(global_context->BAM_chros_to_anno_table)
+ print_in_box(80,0,0,"%ld chromosome name aliases are loaded.", global_context -> BAM_chros_to_anno_table ->numOfElements);
free(sam_used);
return 0;
@@ -782,8 +814,8 @@ int locate_junc_features(fc_thread_global_context_t *global_context, char * chro
gene_info_list_t * list = NULL;
char bucket_key[CHROMOSOME_NAME_LENGTH + 20];
- if(global_context -> annot_chro_name_alias_table) {
- char * anno_chro_name = HashTableGet( global_context -> annot_chro_name_alias_table , chro);
+ if(global_context -> BAM_chros_to_anno_table) {
+ char * anno_chro_name = HashTableGet( global_context -> BAM_chros_to_anno_table , chro);
if(anno_chro_name){
sprintf(bucket_key, "%s:%u", anno_chro_name, pos - pos % JUNCTION_BUCKET_STEP);
list = HashTableGet(global_context -> junction_bucket_table, bucket_key);
@@ -978,7 +1010,7 @@ int load_feature_info(fc_thread_global_context_t *global_context, const char * a
if(strand_str == NULL)
ret_features[xk1].is_negative_strand = 0;
else
- ret_features[xk1].is_negative_strand = ('-' ==strand_str[0]);
+ ret_features[xk1].is_negative_strand = ('+' ==strand_str[0])?0:(('-' ==strand_str[0])?1:-1);
ret_features[xk1].sorted_order = xk1;
int bin_location = ret_features[xk1].start / REVERSE_TABLE_BUCKET_LENGTH;
@@ -1034,7 +1066,8 @@ int load_feature_info(fc_thread_global_context_t *global_context, const char * a
strtok_r(NULL,"\t", &token_temp);// score
- ret_features[xk1].is_negative_strand = ('-' == (strtok_r(NULL,"\t", &token_temp)[0]));//strand
+ char * strand_str = strtok_r(NULL,"\t", &token_temp);
+ ret_features[xk1].is_negative_strand = ('-' == strand_str[0])?1:( ('+' == strand_str[0])?0:-1 );//strand
ret_features[xk1].sorted_order = xk1;
strtok_r(NULL,"\t",&token_temp); // "frame"
char * extra_attrs = strtok_r(NULL,"\t",&token_temp); // name_1 "val1"; name_2 "val2"; ...
@@ -1511,6 +1544,18 @@ void print_read_wrapping(char * rl, int is_second){
}
+void disallocate_RG_tables(void * pt){
+ void ** t4 = pt;
+ free(t4[0]);
+ free(t4[1]);
+ if(t4[2]){
+ HashTableDestroy(t4[2]);
+ HashTableDestroy(t4[3]);
+ }
+ free(pt);
+}
+
+
void process_pairer_reset(void * pairer_vp){
SAM_pairer_context_t * pairer = (SAM_pairer_context_t *) pairer_vp;
fc_thread_global_context_t * global_context = (fc_thread_global_context_t * )pairer -> appendix1;
@@ -1530,8 +1575,6 @@ void process_pairer_reset(void * pairer_vp){
global_context -> thread_contexts[xk1].nreads_mapped_to_exon = 0;
global_context -> thread_contexts[xk1].unpaired_fragment_no = 0;
-
-
global_context -> thread_contexts[xk1].read_counters.unassigned_ambiguous = 0;
global_context -> thread_contexts[xk1].read_counters.unassigned_nofeatures = 0;
global_context -> thread_contexts[xk1].read_counters.unassigned_unmapped = 0;
@@ -1542,25 +1585,82 @@ void process_pairer_reset(void * pairer_vp){
global_context -> thread_contexts[xk1].read_counters.unassigned_secondary = 0;
global_context -> thread_contexts[xk1].read_counters.unassigned_junction_condition = 0;
global_context -> thread_contexts[xk1].read_counters.unassigned_duplicate = 0;
+ global_context -> thread_contexts[xk1].read_counters.unassigned_overlapping_length = 0;
global_context -> thread_contexts[xk1].read_counters.assigned_reads = 0;
+ global_context -> thread_contexts[xk1].read_details_buff_used = 0;
+
+ if(global_context -> do_junction_counting)
+ {
+ HashTableDestroy(global_context -> thread_contexts[xk1].junction_counting_table);
+ global_context -> thread_contexts[xk1].junction_counting_table = HashTableCreate(131317);
+ HashTableSetHashFunction(global_context -> thread_contexts[xk1].junction_counting_table,HashTableStringHashFunction);
+ HashTableSetDeallocationFunctions(global_context -> thread_contexts[xk1].junction_counting_table, free, NULL);
+ HashTableSetKeyComparisonFunction(global_context -> thread_contexts[xk1].junction_counting_table, fc_strcmp_chro);
+
+ HashTableDestroy(global_context -> thread_contexts[xk1].splicing_point_table);
+ global_context -> thread_contexts[xk1].splicing_point_table = HashTableCreate(131317);
+ HashTableSetHashFunction(global_context -> thread_contexts[xk1].splicing_point_table,HashTableStringHashFunction);
+ HashTableSetDeallocationFunctions(global_context -> thread_contexts[xk1].splicing_point_table, free, NULL);
+ HashTableSetKeyComparisonFunction(global_context -> thread_contexts[xk1].splicing_point_table, fc_strcmp_chro);
+ }
+
+ if(global_context -> assign_reads_to_RG){
+ HashTableDestroy(global_context -> thread_contexts[xk1].RG_table);
+ global_context -> thread_contexts[xk1].RG_table = HashTableCreate(97);
+ HashTableSetHashFunction(global_context -> thread_contexts[xk1].RG_table,HashTableStringHashFunction);
+ HashTableSetDeallocationFunctions(global_context -> thread_contexts[xk1].RG_table, free, disallocate_RG_tables);
+ HashTableSetKeyComparisonFunction(global_context -> thread_contexts[xk1].RG_table, fc_strcmp_chro);
+ }
+
+
}
- if(global_context -> SAM_output_fp){
- ftruncate(fileno(global_context -> SAM_output_fp), 0);
- fseek(global_context -> SAM_output_fp, 0 , SEEK_SET);
+ if(global_context -> read_details_out_FP){
+ ftruncate(fileno(global_context -> read_details_out_FP), 0);
+ fseek(global_context -> read_details_out_FP, 0 , SEEK_SET);
}
}
-int process_pairer_header (void * pairer_vp, int thread_no, int is_text, unsigned int items, char * bin, unsigned int bin_len){
+int is_value_contig_name(char * n, int l){
+ int x;
+ for(x=0; x<l; x++){
+ if(n[x]==0)continue;
+ if(n[x]>'~' || n[x]<'!') return 0;
+ }
+ return 1;
+
+}
+int compress_read_detail_BAM(fc_thread_global_context_t * global_context, fc_thread_thread_context_t * thread_context, int write_start, int write_end, char * bam_buf);
+int process_pairer_header (void * pairer_vp, int thread_no, int is_text, unsigned int items, char * bin, unsigned int bin_len){
SAM_pairer_context_t * pairer = (SAM_pairer_context_t *) pairer_vp;
fc_thread_global_context_t * global_context = (fc_thread_global_context_t * )pairer -> appendix1;
+ fc_thread_thread_context_t * thread_context = global_context -> thread_contexts;
//SUBREADprintf("ENTER PROCESS (THRD %d): IS_TXT=%d, ITEMS = %d, CURRENT_ITEMS=%d\n", thread_no, is_text, items, global_context -> sambam_chro_table_items);
pthread_spin_lock(&global_context -> sambam_chro_table_lock);
- if( !is_text ){
+ if(global_context -> is_read_details_out == FILE_TYPE_BAM){
+ int write_cursor;
+ int first_block = 1;
+ for(write_cursor = 0; write_cursor < bin_len; write_cursor += 55000){
+ int wlen = min(55000, bin_len - write_cursor);
+
+ if( first_block ){
+ if(is_text)memcpy(thread_context -> read_details_buff, "BAM\1", 4);
+ memcpy(thread_context -> read_details_buff + (is_text?4:0), is_text?(&bin_len):(&items), 4);
+ }
+
+ memcpy(thread_context -> read_details_buff + (first_block?4*(1+is_text):0), bin + write_cursor, wlen);
+ int blen = compress_read_detail_BAM(global_context, thread_context, 0, wlen + (first_block?4*(1+is_text):0), thread_context -> bam_compressed_buff);
+ fwrite( thread_context -> bam_compressed_buff, 1, blen, global_context -> read_details_out_FP);
+ first_block = 0;
+ }
+ }else if( global_context -> is_read_details_out == FILE_TYPE_SAM && is_text ){
+ fwrite( bin, 1, bin_len, global_context -> read_details_out_FP);
+ }
+ if(!is_text ){
if(global_context -> sambam_chro_table)
global_context -> sambam_chro_table = delay_realloc(global_context -> sambam_chro_table, global_context -> sambam_chro_table_items * sizeof(SamBam_Reference_Info), (items + global_context -> sambam_chro_table_items) * sizeof(SamBam_Reference_Info));
else global_context -> sambam_chro_table = malloc(items * sizeof(SamBam_Reference_Info));
@@ -1569,8 +1669,16 @@ int process_pairer_header (void * pairer_vp, int thread_no, int is_text, unsigne
for(x1 = global_context -> sambam_chro_table_items; x1 < global_context -> sambam_chro_table_items+items; x1++){
int l_name;
memcpy(&l_name, bin + bin_ptr, 4);
- assert(l_name < MAX_CHROMOSOME_NAME_LEN);
bin_ptr += 4;
+
+ if( !is_value_contig_name(bin + bin_ptr, l_name)){
+ SUBREADprintf("The chromosome name contains unexpected characters: \"%s\" (%d chars)\nfeatureCounts has to stop running\n", bin + bin_ptr, l_name);
+ return -1;
+ }
+ if(l_name >= MAX_CHROMOSOME_NAME_LEN){
+ SUBREADprintf("The chromosome name of \"%s\" contains %d characters, longer than the upper limit of %d\nfeatureCounts has to stop running\n", bin + bin_ptr , l_name, MAX_CHROMOSOME_NAME_LEN - 1);
+ return -1;
+ }
memcpy(global_context -> sambam_chro_table[x1].chro_name , bin + bin_ptr, l_name);
//SUBREADprintf("The %d-th is '%s'\n", x1, global_context -> sambam_chro_table[x1].chro_name);
bin_ptr += l_name;
@@ -1635,106 +1743,6 @@ void make_dummy(char * rname, char * bin1, char * out_txt2, SamBam_Reference_In
mate_chro_str, max(0,mate_pos), HItagStr);
}
-
-void convert_bin_to_read(char * bin, char * txt, SamBam_Reference_Info * sambam_chro_table){
- unsigned int block_len;
- memcpy(&block_len, bin, 4);
- int ref_id;
- memcpy(&ref_id, bin + 4, 4);
- int pos;
- memcpy(&pos, bin + 8, 4);
- unsigned int bin_mq_nl;
- memcpy(&bin_mq_nl, bin + 12, 4);
- unsigned int flag_nc;
- memcpy(&flag_nc, bin + 16, 4);
- int l_seq;
- memcpy(&l_seq, bin + 20, 4);
- int next_refID;
- memcpy(&next_refID, bin + 24, 4);
- int next_pos;
- memcpy(&next_pos, bin + 28, 4);
- int tlen;
- memcpy(&tlen, bin + 32, 4);
-
- int txt_ptr = 0;
- int l_read_name = bin_mq_nl & 0xff;
- memcpy(txt , bin + 36, l_read_name);
- txt_ptr += l_read_name - 1;
- txt_ptr += sprintf(txt+txt_ptr, "\t%d", flag_nc >> 16);
- if(ref_id < 0){
- strcpy(txt+txt_ptr, "\t*\t0\t0");
- txt_ptr += 6;
- }else txt_ptr += sprintf(txt+txt_ptr, "\t%s\t%d\t%d", sambam_chro_table[ref_id].chro_name, pos + 1, (bin_mq_nl >> 8 & 0xff));
-
- int cigar_ops = flag_nc & 0xffff;
- if(cigar_ops < 1){
- strcpy(txt+txt_ptr, "\t*");
- txt_ptr += 2;
- }else{
- int x1;
- strcpy(txt+txt_ptr, "\t");
- txt_ptr++;
- for(x1=0; x1 < cigar_ops; x1++){
- unsigned int cigar_sec;
- memcpy(&cigar_sec, bin + 36 + l_read_name + 4 * x1 , 4);
- txt_ptr += sprintf(txt+txt_ptr, "%u%c", cigar_sec >> 4 , cigar_op_char( cigar_sec & 15 ));
- }
- }
-
- if(next_refID < 0)
- txt_ptr += sprintf(txt+txt_ptr, "\t*\t0\t%d", tlen);
- else txt_ptr += sprintf(txt+txt_ptr, "\t%s\t%d\t%d", sambam_chro_table[next_refID].chro_name, next_pos + 1, tlen);
- strcpy(txt+txt_ptr, "\tN\tI");
- txt_ptr += 4;
-
- int bin_ptr = 36 + l_read_name + 4 * cigar_ops + l_seq + (l_seq+1)/2;
-
- while(bin_ptr < block_len + 4){
- char tag_name[3];
- tag_name[0]=bin[bin_ptr];
- tag_name[1]=bin[bin_ptr+1];
- tag_name[2]=0;
-
- char tagtype = bin[bin_ptr+2];
- int delta = 0;
- int tmpi = 0;
- if(tagtype == 'i' || tagtype == 'I'){
- delta = 4;
- memcpy(&tmpi, bin + bin_ptr + 3, 4);
- txt_ptr += sprintf(txt+txt_ptr, "\t%s:i:%d", tag_name,tmpi);
- }else if(tagtype == 's' || tagtype == 'S'){
- delta = 2;
- memcpy(&tmpi, bin + bin_ptr + 3, 2);
- txt_ptr += sprintf(txt+txt_ptr, "\t%s:i:%d", tag_name,tmpi);
- }else if(tagtype == 'c' || tagtype == 'C'){
- delta = 1;
- memcpy(&tmpi, bin + bin_ptr + 3, 1);
- txt_ptr += sprintf(txt+txt_ptr, "\t%s:i:%d", tag_name,tmpi);
- }else if(tagtype == 'A'){
- delta = 1;
- txt_ptr += sprintf(txt+txt_ptr, "\t%s:%c:%c", tag_name, tagtype, *(bin + bin_ptr + 3));
- }else if(tagtype == 'f')
- delta = 4;
- else if(tagtype == 'Z' ||tagtype == 'H'){
- txt_ptr += sprintf(txt+txt_ptr, "\t%s:%c", tag_name, tagtype);
- while(bin[bin_ptr + 3+delta]){
- *(txt+txt_ptr) = bin[bin_ptr + 3+delta];
- txt_ptr ++;
- delta ++;
- }
- *(txt+txt_ptr) = 0;
- }else if(tagtype == 'B'){
- char celltype = bin[bin_ptr + 4];
- int cellitems ;
- memcpy(&cellitems, bin + bin_ptr + 5, 4);
- int celldelta = 1;
- if(celltype == 's' || celltype == 'S') celldelta = 2;
- else if(celltype == 'i' || celltype == 'I' || celltype == 'f') celldelta = 4;
- delta = cellitems * celldelta;
- }
- bin_ptr += 3 + delta;
- }
-}
int reverse_flag(int mf){
int ret = mf & 3;
if(mf & 4) ret |= 8;
@@ -1748,18 +1756,6 @@ int reverse_flag(int mf){
return ret;
}
-#define MAXIMUM_INSERTION_IN_SECTION 8
-
-typedef struct {
- char * chro;
- unsigned int start_pos;
- unsigned int chromosomal_length;
- short insertions;
- unsigned int insertion_start_pos[ MAXIMUM_INSERTION_IN_SECTION ];
- unsigned short insertion_lengths[ MAXIMUM_INSERTION_IN_SECTION ];
-} CIGAR_interval_t;
-
-
int calc_total_frag_one_len(CIGAR_interval_t * intvs, int intvn){
int ret = 0, x1;
for(x1 = 0; x1 < intvn; x1++){
@@ -1788,6 +1784,13 @@ int calc_total_frag_len( fc_thread_global_context_t * global_context, fc_thread_
// two reads are from different chromosomes
return calc_total_frag_one_len( CIGAR_intervals_R2,CIGAR_intervals_R2_sections ) + calc_total_frag_one_len( CIGAR_intervals_R1,CIGAR_intervals_R1_sections );
+ if(0&& FIXLENstrcmp("V0112_0155:7:1101:11874:24723", read_name)==0){
+ int xx;
+ for(xx = 0; xx < CIGAR_intervals_R1_sections; xx++)
+ SUBREADprintf("R1 SEC %d: %u + %d\n", xx, CIGAR_intervals_R1[xx].start_pos, CIGAR_intervals_R1[xx].chromosomal_length );
+ for(xx = 0; xx < CIGAR_intervals_R2_sections; xx++)
+ SUBREADprintf("R2 SEC %d: %u + %d\n", xx, CIGAR_intervals_R2[xx].start_pos, CIGAR_intervals_R2[xx].chromosomal_length );
+ }
unsigned int merged_section_count = 0;
unsigned int merged_section_start_positions[ MAXIMUM_INSERTION_IN_SECTION * 3 ];
@@ -2016,7 +2019,11 @@ int calc_total_frag_len( fc_thread_global_context_t * global_context, fc_thread_
return ret;
}
-void parse_bin(SamBam_Reference_Info * sambam_chro_table, char * bin, char * bin2, char ** read_name, int * flag, char ** chro, long * pos, int * mapq, char ** mate_chro, long * mate_pos, long * tlen, int * is_junction_read, int * cigar_sect, unsigned int * Starting_Chro_Points, unsigned short * Starting_Read_Points, unsigned short * Section_Read_Lengths, char ** ChroNames, char * Event_After_Section, int * NH_value, int max_M, CIGAR_interval_t * intervals_buffer, int * intervals_i){
+void get_readname_from_bin(char * bin, char ** read_name){
+ (*read_name) = bin + 36;
+}
+
+void parse_bin(SamBam_Reference_Info * sambam_chro_table, char * bin, char * bin2, char ** read_name, int * flag, char ** chro, long * pos, int * mapq, char ** mate_chro, long * mate_pos, long * tlen, int * is_junction_read, int * cigar_sect, unsigned int * Starting_Chro_Points, unsigned short * Starting_Read_Points, unsigned short * Section_Read_Lengths, char ** ChroNames, char * Event_After_Section, int * NH_value, int max_M, CIGAR_interval_t * intervals_buffer, int * intervals_i, int [...]
int x1, len_of_S1 = 0;
*cigar_sect = 0;
*NH_value = 1;
@@ -2160,6 +2167,12 @@ void parse_bin(SamBam_Reference_Info * sambam_chro_table, char * bin, char * bin
memcpy(&block_len, bin, 4);
int found_NH = SAM_pairer_iterate_int_tags((unsigned char *)bin+bin_ptr, block_len + 4 - bin_ptr, "NH", NH_value);
if(!found_NH) *(NH_value) = 1;
+
+ if(assign_reads_to_RG){
+ char RG_type = 0;
+ SAM_pairer_iterate_tags((unsigned char *)bin+bin_ptr, block_len + 4 - bin_ptr, "RG", &RG_type, RG_ptr);
+ if(RG_type != 'Z') (*RG_ptr) = NULL;
+ }
//SUBREADprintf("FOUND=%d, NH=%d, TAG=%.*s\n", found_NH, *(NH_value), 3 , bin+bin_ptr);
}else{
(*read_name) = bin2 + 36;
@@ -2218,13 +2231,13 @@ int calc_junctions_from_cigarInts(fc_thread_global_context_t * global_context, i
return ret;
}
-void add_fragment_supported_junction( fc_thread_global_context_t * global_context, fc_thread_thread_context_t * thread_context, fc_junction_info_t * supported_junctions1,
- int njunc1, fc_junction_info_t * supported_junctions2, int njunc2);
+void add_fragment_supported_junction( fc_thread_global_context_t * global_context, fc_thread_thread_context_t * thread_context, fc_junction_info_t * supported_junctions1, int njunc1, fc_junction_info_t * supported_junctions2, int njunc2, char * RG_name);
void process_line_junctions(fc_thread_global_context_t * global_context, fc_thread_thread_context_t * thread_context, char * bin1, char * bin2) {
fc_junction_info_t supported_junctions1[global_context -> max_M], supported_junctions2[global_context -> max_M];
int is_second_read, njunc1=0, njunc2=0, is_junction_read, cigar_sections;
int alignment_masks, mapping_qual, NH_value;
+ char *RG_ptr;
for(is_second_read = 0 ; is_second_read < 2; is_second_read++){
char * read_chr, *read_name, *mate_chr;
@@ -2235,8 +2248,9 @@ void process_line_junctions(fc_thread_global_context_t * global_context, fc_thre
char * ChroNames[global_context -> max_M];
char Event_After_Section[global_context -> max_M];
if(is_second_read && !global_context -> is_paired_end_mode_assign) break;
+ RG_ptr = NULL;
- parse_bin(global_context -> sambam_chro_table, is_second_read?bin2:bin1, is_second_read?bin1:bin2 , &read_name, &alignment_masks , &read_chr, &read_pos, &mapping_qual, &mate_chr, &mate_pos, &fragment_length, &is_junction_read, &cigar_sections, Starting_Chro_Points, Starting_Read_Points, Section_Read_Lengths, ChroNames, Event_After_Section, &NH_value, global_context -> max_M, NULL, NULL);
+ parse_bin(global_context -> sambam_chro_table, is_second_read?bin2:bin1, is_second_read?bin1:bin2 , &read_name, &alignment_masks , &read_chr, &read_pos, &mapping_qual, &mate_chr, &mate_pos, &fragment_length, &is_junction_read, &cigar_sections, Starting_Chro_Points, Starting_Read_Points, Section_Read_Lengths, ChroNames, Event_After_Section, &NH_value, global_context -> max_M, NULL, NULL, global_context -> assign_reads_to_RG, &RG_ptr);
assert(cigar_sections <= global_context -> max_M);
int * njunc_current = is_second_read?&njunc2:&njunc1;
@@ -2248,15 +2262,60 @@ void process_line_junctions(fc_thread_global_context_t * global_context, fc_thre
//}
}
if(njunc1 >0 || njunc2>0)
- add_fragment_supported_junction(global_context, thread_context, supported_junctions1, njunc1, supported_junctions2, njunc2);
+ add_fragment_supported_junction(global_context, thread_context, supported_junctions1, njunc1, supported_junctions2, njunc2, RG_ptr);
+
+}
+
+void ** get_RG_tables(fc_thread_global_context_t * global_context, fc_thread_thread_context_t * thread_context, char * rg_name){
+ void ** ret = HashTableGet(thread_context->RG_table, rg_name);
+ if(ret) return ret;
+
+ ret = malloc(sizeof(void *)*4);
+
+ ret[0] = malloc(thread_context -> count_table_size * sizeof(read_count_type_t));
+ ret[1] = malloc(sizeof(fc_read_counters));
+ memset(ret[0], 0, thread_context -> count_table_size * sizeof(read_count_type_t));
+ memset(ret[1], 0, sizeof(fc_read_counters));
+
+ if(global_context -> do_junction_counting){
+ HashTable * junction_counting_table = HashTableCreate(131317);
+ HashTableSetHashFunction(junction_counting_table,HashTableStringHashFunction);
+ HashTableSetDeallocationFunctions(junction_counting_table, free, NULL);
+ HashTableSetKeyComparisonFunction(junction_counting_table, fc_strcmp_chro);
+
+ HashTable * splicing_point_table = HashTableCreate(131317);
+ HashTableSetHashFunction(splicing_point_table,HashTableStringHashFunction);
+ HashTableSetDeallocationFunctions(splicing_point_table, free, NULL);
+ HashTableSetKeyComparisonFunction(splicing_point_table, fc_strcmp_chro);
+
+ ret [2] = junction_counting_table;
+ ret [3] = splicing_point_table;
+ }else ret[2] = NULL;
+
+ char * rg_name_mem = malloc(strlen(rg_name)+1);
+ strcpy(rg_name_mem, rg_name);
+ HashTablePut(thread_context->RG_table, rg_name_mem, ret);
+ return ret;
}
-int process_pairer_output(void * pairer_vp, int thread_no, char * rname, char * bin1, char * bin2){
+int process_pairer_output(void * pairer_vp, int thread_no, char * bin1, char * bin2){
SAM_pairer_context_t * pairer = (SAM_pairer_context_t *) pairer_vp;
fc_thread_global_context_t * global_context = (fc_thread_global_context_t * )pairer -> appendix1;
fc_thread_thread_context_t * thread_context = global_context -> thread_contexts + thread_no;
+
+ if(pairer -> long_cigar_mode){
+ if(global_context -> max_M < 65536){
+ //SUBREADprintf("SWITCHED INTO LONG-READ MODE\n");
+ global_context -> max_M = 65536;
+ }
+ if(!global_context->is_read_too_long_to_SAM_BAM_shown &&(global_context -> is_read_details_out == FILE_TYPE_SAM || global_context -> is_read_details_out == FILE_TYPE_BAM)){
+ global_context -> is_read_details_out = 0;
+ SUBREADprintf("ERROR: The read is too long to the SAM or BAM output.\nPlease use the 'CORE' mode for the assignment detail output.\n");
+ global_context->is_read_too_long_to_SAM_BAM_shown = 1;
+ }
+ }
//#warning "++++++ REMOVE THIS RETURN ++++++"
//return 0;
@@ -2274,20 +2333,241 @@ int process_pairer_output(void * pairer_vp, int thread_no, char * rname, char *
void sort_bucket_table(fc_thread_global_context_t * global_context);
void vote_and_add_count(fc_thread_global_context_t * global_context, fc_thread_thread_context_t * thread_context,
long * hits_indices1, int nhits1, long * hits_indices2, int nhits2, unsigned int total_frag_len,
- char ** hits_chro1, char ** hits_chro2, unsigned int * hits_start_pos1, unsigned int * hits_start_pos2, unsigned short * hits_length1, unsigned short * hits_length2,
- int fixed_fractional_count, char * read_name);
+ char ** hits_chro1, char ** hits_chro2, unsigned int * hits_start_pos1, unsigned int * hits_start_pos2, unsigned short * hits_length1, unsigned short * hits_length2, int fixed_fractional_count, char * read_name, char * RG_name, char * bin1, char * bin2);
+
+void add_bin_new_tags(char * oldbin, char **newbin, char ** tags, char * types, void ** vals){
+ int new_tags_length = 0;
+ int tagi;
+ for(tagi = 0; tags[tagi]; tagi++){
+ char type = types[tagi];
+ if(type == 'i') new_tags_length += 7;
+ else new_tags_length += 4 + strlen((char *)vals[tagi]);
+ }
-int writesize_fprint(fc_thread_global_context_t * global_context, FILE * fp, const char * pattern, ...){
- int ret;
- va_list args;
- va_start(args , pattern);
- assert(fp);
+ int oldbin_len;
+ memcpy(&oldbin_len, oldbin, 4);
+ oldbin_len += 4;
- ret = vfprintf(fp, pattern , args);
- va_end(args);
+ int newbin_len = oldbin_len + new_tags_length;
+ (*newbin) = malloc(newbin_len);
+ memcpy(*newbin, oldbin, oldbin_len);
+ newbin_len -= 4;
+ memcpy(*newbin, &newbin_len, 4);
+ newbin_len += 4;
+
+ for(tagi = 0; tags[tagi]; tagi++){
+ memcpy( (*newbin) + oldbin_len, tags[tagi] ,2);
+ (*newbin)[oldbin_len+2] = types[tagi];
+ if(types[tagi] == 'i'){
+ int intv = vals[tagi] - NULL;
+ memcpy((*newbin) + oldbin_len + 3, &intv, 4);
+ oldbin_len += 7;
+ }else{
+ int vlen = strlen((char *)(vals[tagi]))+1;
+ memcpy((*newbin) + oldbin_len + 3, vals[tagi], vlen);
+ oldbin_len += 3 + vlen;
+ }
+ }
+}
+
+unsigned int FC_CRC32(char * dat, int len){
+ unsigned int crc0 = crc32(0, NULL, 0);
+ unsigned int ret = crc32(crc0, (unsigned char *)dat, len);
+ return ret;
+}
+
+
+
+int compress_read_detail_BAM(fc_thread_global_context_t * global_context, fc_thread_thread_context_t * thread_context, int write_start, int write_end, char * bam_buf){
+ if(global_context -> is_read_details_out == FILE_TYPE_SAM){
+ // there MUST be only one read in the buffer.
+ int write_ptr = write_start;
+ int tmplen = 0 ;
+ int sam_ptr = 0;
+ while(1){
+ if(write_ptr >= write_end) break;
+ memcpy(&tmplen, thread_context -> read_details_buff + write_ptr, 4);
+ tmplen +=4;
+ int txtlen = convert_BAM_binary_to_SAM(global_context -> sambam_chro_table, thread_context -> read_details_buff + write_ptr, bam_buf + sam_ptr);
+ bam_buf[sam_ptr + txtlen] = '\n';
+ bam_buf[sam_ptr + txtlen + 1] = 0;
+ sam_ptr += txtlen + 1;
+ write_ptr += tmplen;
+ }
+ return sam_ptr;
+
+ }else{
+ // there may be multiple reads in the buffer.
+ int write_ptr, bin_len = write_end - write_start;
+ char * compressed_buff = bam_buf + 18;
+
+ int compressed_size ;
+ unsigned int CRC32;
+ thread_context -> bam_file_output_stream.avail_out = 66600;
+ thread_context -> bam_file_output_stream.avail_in = bin_len;
+ //SUBREADprintf("COMPRESS PTR=%p , LEN=%d\n", thread_context -> read_details_buff + write_start , bin_len);
+ CRC32 = FC_CRC32(thread_context -> read_details_buff + write_start , bin_len);
+
+ int Z_DEFAULT_MEM_LEVEL = 8;
+ thread_context -> bam_file_output_stream.zalloc = Z_NULL;
+ thread_context -> bam_file_output_stream.zfree = Z_NULL;
+ thread_context -> bam_file_output_stream.opaque = Z_NULL;
+
+ deflateInit2(&thread_context -> bam_file_output_stream, bin_len?Z_BEST_SPEED:Z_DEFAULT_COMPRESSION, Z_DEFLATED, -15, Z_DEFAULT_MEM_LEVEL, Z_DEFAULT_STRATEGY);
+
+ thread_context -> bam_file_output_stream.next_in = (unsigned char*) thread_context -> read_details_buff + write_start;
+ thread_context -> bam_file_output_stream.next_out = (unsigned char*) compressed_buff;
+
+ deflate(&thread_context -> bam_file_output_stream, Z_FINISH);
+ deflateEnd(&thread_context -> bam_file_output_stream);
+
+ compressed_size = 66600 -thread_context -> bam_file_output_stream.avail_out;
+
+ bam_buf[0]=31;
+ bam_buf[1]=139;
+ bam_buf[2]=8;
+ bam_buf[3]=4;
+ memset(bam_buf+4, 0, 5);
+ bam_buf[9] = 0xff; // OS
+
+ int tmpi = 6;
+ memcpy(bam_buf+10, &tmpi, 2); //XLSN
+ bam_buf[12]=66; // SI1
+ bam_buf[13]=67; // SI2
+ tmpi = 2;
+ memcpy(bam_buf+14, &tmpi, 2); //BSIZE
+ tmpi = compressed_size + 19 + 6;
+ memcpy(bam_buf+16, &tmpi, 2); //BSIZE
+
+ memcpy(bam_buf+18+compressed_size, &CRC32, 4);
+ memcpy(bam_buf+18+compressed_size+4, &bin_len, 4);
+ return 18+compressed_size+8;
+ }
+}
+
+void write_read_detailed_remainder(fc_thread_global_context_t * global_context, fc_thread_thread_context_t * thread_context){
+ int write_bin_ptr = 0;
+ int last_written_ptr = 0;
+ int bam_compressed_buff_ptr = 0;
+
+ if(thread_context -> read_details_buff_used <1)return;
+
+ if(global_context -> is_read_details_out == FILE_TYPE_BAM && thread_context -> read_details_buff_used < 64000){
+ bam_compressed_buff_ptr = compress_read_detail_BAM(global_context, thread_context, 0, thread_context -> read_details_buff_used, thread_context -> bam_compressed_buff);
+ }else while(1){
+ if(write_bin_ptr >= thread_context -> read_details_buff_used ) break;
+ int tmplen = 0;
+ memcpy(&tmplen, thread_context -> read_details_buff + write_bin_ptr, 4);
+ if(tmplen < 9 || tmplen > 3*MAX_FC_READ_LENGTH){
+ SUBREADprintf("ERROR: Format error : len = %d\n", tmplen);
+ //oexit(-1);
+ return ;
+ }
+ tmplen +=4;
+ write_bin_ptr += tmplen;
+ if(write_bin_ptr - last_written_ptr > 64000 || write_bin_ptr >= thread_context -> read_details_buff_used || global_context -> is_read_details_out == FILE_TYPE_SAM){
+ bam_compressed_buff_ptr += compress_read_detail_BAM(global_context, thread_context, last_written_ptr, write_bin_ptr, thread_context -> bam_compressed_buff + bam_compressed_buff_ptr);
+ last_written_ptr = write_bin_ptr;
+ }
+ }
+ pthread_spin_lock(&global_context -> read_details_out_lock);
+ fwrite(thread_context -> bam_compressed_buff, 1, bam_compressed_buff_ptr , global_context -> read_details_out_FP);
+ pthread_spin_unlock(&global_context -> read_details_out_lock);
+ thread_context -> read_details_buff_used =0;
+}
+
+
+int add_read_detail_bin_buff(fc_thread_global_context_t * global_context, fc_thread_thread_context_t * thread_context, char * bin, int do_write){
+ int binlen=0;
+
+ memcpy(&binlen, bin, 4);
+ binlen += 4;
+ if(binlen > MAX_FC_READ_LENGTH * 3){
+ if(!global_context->is_read_too_long_to_SAM_BAM_shown){
+ SUBREADprintf("ERROR: The read is too long to the SAM or BAM output.\nPlease use the 'CORE' mode for the assignment detail output.\n");
+ global_context->is_read_too_long_to_SAM_BAM_shown = 1;
+ }
+ return -1;
+ }
+
+ memcpy(thread_context -> read_details_buff + thread_context -> read_details_buff_used, bin, binlen);
+ thread_context -> read_details_buff_used += binlen;
+
+ if(do_write){
+ if(global_context -> is_read_details_out == FILE_TYPE_SAM || thread_context -> read_details_buff_used >= 55000) write_read_detailed_remainder(global_context, thread_context);
+ }
+ return 0;
+}
+
+int write_read_details_FP(fc_thread_global_context_t * global_context, fc_thread_thread_context_t * thread_context, char * status, int feature_count, char * features, char * bin1, char * bin2){
+ int ret = 1;
+
+ char * read_name;
+
+ if(global_context -> is_read_details_out == FILE_TYPE_RSUBREAD){
+ get_readname_from_bin(bin1?bin1:bin2, &read_name);
+ fprintf(global_context -> read_details_out_FP, "%s\t%s\t%d\t%s\n", read_name, status, feature_count, features?features:"NA");
+ }else{
+ char * out_bin1 = NULL, *out_bin2 = NULL;
+ char * tags[4];
+ char types[4];
+ void * vals[4];
+
+ tags[0]="XS";
+ tags[1]=feature_count >0?"XN":NULL;
+ tags[2]=feature_count >0?"XT":NULL;
+ tags[3]=NULL;
+ types[0]='Z';
+ types[1]='i';
+ types[2]='Z';
+ vals[0]=status;
+ vals[1]=NULL+feature_count;
+ vals[2]=features;
+
+ if(bin1){
+ add_bin_new_tags(bin1, &out_bin1, tags, types, vals);
+ add_read_detail_bin_buff(global_context, thread_context, out_bin1, bin2 == NULL);
+ free(out_bin1);
+ }
+
+ if(bin2){
+ add_bin_new_tags(bin2, &out_bin2, tags, types, vals);
+ add_read_detail_bin_buff(global_context, thread_context, out_bin2, 1);
+ free(out_bin2);
+ }
+ }
if(ret < 1) global_context -> disk_is_full = 1;
return ret;
+}
+
+void warning_anno_BAM_chromosomes(fc_thread_global_context_t * global_context){
+ int x1;
+ HashTable * BAM_chro_tab = HashTableCreate(1117);
+ HashTableSetHashFunction(BAM_chro_tab,HashTableStringHashFunction);
+ HashTableSetKeyComparisonFunction(BAM_chro_tab,fc_strcmp_chro );
+
+ for(x1 = 0; x1 < global_context -> sambam_chro_table_items; x1++){
+ char * BAM_chro = global_context -> sambam_chro_table[x1].chro_name;
+ if( global_context -> BAM_chros_to_anno_table){
+ char * tmp_chro = HashTableGet(global_context -> BAM_chros_to_anno_table, global_context -> sambam_chro_table[x1].chro_name);
+ if(tmp_chro) BAM_chro = tmp_chro;
+ }
+ HashTablePut(BAM_chro_tab, BAM_chro, NULL+1);
+ }
+
+ HashTable * ANNO_chro_tab = HashTableCreate(1117);
+ HashTableSetHashFunction(ANNO_chro_tab,HashTableStringHashFunction);
+ HashTableSetKeyComparisonFunction(ANNO_chro_tab,fc_strcmp_chro );
+ for(x1 = 0 ; x1 < global_context -> exontable_exons ; x1++)
+ HashTablePut(ANNO_chro_tab, global_context -> exontable_chr[x1], NULL+1);
+
+ if(global_context -> is_verbose){
+ warning_hash_hash(ANNO_chro_tab, BAM_chro_tab, "Chromosomes/contigs in annotation but not in input file");
+ warning_hash_hash(BAM_chro_tab, ANNO_chro_tab, "Chromosomes/contigs in input file but not in annotation");
+ }
+ HashTableDestroy(BAM_chro_tab);
+ HashTableDestroy(ANNO_chro_tab);
}
void process_line_buffer(fc_thread_global_context_t * global_context, fc_thread_thread_context_t * thread_context, char * bin1, char * bin2)
@@ -2304,19 +2584,23 @@ void process_line_buffer(fc_thread_global_context_t * global_context, fc_thread_
unsigned int total_frag_len =0;
int cigar_sections, is_junction_read;
- unsigned int Starting_Chro_Points[global_context -> max_M];
- unsigned short Starting_Read_Points[global_context -> max_M];
- unsigned short Section_Read_Lengths[global_context -> max_M];
- char * ChroNames[global_context -> max_M];
- char Event_After_Section[global_context -> max_M];
+ unsigned int * Starting_Chro_Points = thread_context -> proc_Starting_Chro_Points;
+ unsigned short * Starting_Read_Points = thread_context -> proc_Starting_Read_Points;
+ unsigned short * Section_Read_Lengths = thread_context -> proc_Section_Read_Lengths;
+ char ** ChroNames = thread_context -> proc_ChroNames;
+ char * Event_After_Section = thread_context -> proc_Event_After_Section;
+
+ CIGAR_interval_t * CIGAR_intervals_R1 = thread_context -> proc_CIGAR_intervals_R1;
+ CIGAR_interval_t * CIGAR_intervals_R2 = thread_context -> proc_CIGAR_intervals_R2;
int is_second_read;
int maximum_NH_value = 1, NH_value;
int skipped_for_exonic = 0;
int first_read_quality_score = 0, CIGAR_intervals_R1_sections = 0, CIGAR_intervals_R2_sections = 0;
- CIGAR_interval_t CIGAR_intervals_R1 [ global_context -> max_M ];
- CIGAR_interval_t CIGAR_intervals_R2 [ global_context -> max_M ];
+ if(thread_context -> thread_id == 0 && thread_context -> all_reads < 1){
+ warning_anno_BAM_chromosomes(global_context);
+ }
if(global_context -> need_calculate_overlap_len ){
memset( CIGAR_intervals_R1, 0, sizeof(CIGAR_interval_t) * global_context -> max_M );
@@ -2327,16 +2611,20 @@ void process_line_buffer(fc_thread_global_context_t * global_context, fc_thread_
//if(thread_context->all_reads>1000000) printf("TA=%llu\n%s\n",thread_context->all_reads, thread_context -> line_buffer1);
+ char * RG_ptr;
for(is_second_read = 0 ; is_second_read < 2; is_second_read++)
{
if(is_second_read && !global_context -> is_paired_end_mode_assign) break;
- parse_bin(global_context -> sambam_chro_table, is_second_read?bin2:bin1, is_second_read?bin1:bin2 , &read_name, &alignment_masks , &read_chr, &read_pos, &mapping_qual, &mate_chr, &mate_pos, &fragment_length, &is_junction_read, &cigar_sections, Starting_Chro_Points, Starting_Read_Points, Section_Read_Lengths, ChroNames, Event_After_Section, &NH_value, global_context -> max_M , global_context -> need_calculate_overlap_len?(is_second_read?CIGAR_intervals_R2:CIGAR_intervals_R1):NULL, is_s [...]
+ RG_ptr = NULL;
+ parse_bin(global_context -> sambam_chro_table, is_second_read?bin2:bin1, is_second_read?bin1:bin2 , &read_name, &alignment_masks , &read_chr, &read_pos, &mapping_qual, &mate_chr, &mate_pos, &fragment_length, &is_junction_read, &cigar_sections, Starting_Chro_Points, Starting_Read_Points, Section_Read_Lengths, ChroNames, Event_After_Section, &NH_value, global_context -> max_M , global_context -> need_calculate_overlap_len?(is_second_read?CIGAR_intervals_R2:CIGAR_intervals_R1):NULL, is_s [...]
+
+ if(global_context -> assign_reads_to_RG && NULL == RG_ptr)return;
// SUBREADprintf(" RNAME=%s\n", read_name);
//#warning "==================== REMOVE WHEN RELEASE ========================"
- //if(global_context -> SAM_output_fp)
- // fprintf(global_context -> SAM_output_fp, "SAMDEBUG: %s\t\t%s, %ld\n", read_name, read_chr, read_pos);
+ //if(global_context -> read_details_out_FP)
+ // fprintf(global_context -> read_details_out_FP, "SAMDEBUG: %s\t\t%s, %ld\n", read_name, read_chr, read_pos);
if(is_second_read == 0)
{
//skip the read if unmapped (its mate will be skipped as well if paired-end)
@@ -2344,10 +2632,16 @@ void process_line_buffer(fc_thread_global_context_t * global_context, fc_thread_
((alignment_masks & SAM_FLAG_UNMAPPED) && (alignment_masks & SAM_FLAG_MATE_UNMATCHED) && global_context -> is_paired_end_mode_assign) ||
(((alignment_masks & SAM_FLAG_UNMAPPED) || (alignment_masks & SAM_FLAG_MATE_UNMATCHED)) && global_context -> is_paired_end_mode_assign && global_context -> is_both_end_required)
){
- thread_context->read_counters.unassigned_unmapped ++;
+
+ if(RG_ptr){
+ void ** tab4s = get_RG_tables(global_context, thread_context, RG_ptr);
+ fc_read_counters * sumtab = tab4s[1];
+ sumtab -> unassigned_unmapped++;
+ }else
+ thread_context->read_counters.unassigned_unmapped ++;
- if(global_context -> SAM_output_fp)
- writesize_fprint(global_context,global_context -> SAM_output_fp,"%s\tUnassigned_Unmapped\t*\t*\n", read_name);
+ if(global_context -> read_details_out_FP)
+ write_read_details_FP(global_context , thread_context ,"Unassigned_Unmapped",0, NULL, bin1, bin2);
return; // do nothing if a read is unmapped, or the first read in a pair of reads is unmapped.
}
@@ -2368,9 +2662,9 @@ void process_line_buffer(fc_thread_global_context_t * global_context, fc_thread_
{
thread_context->read_counters.unassigned_mappingquality ++;
- if(global_context -> SAM_output_fp)
+ if(global_context -> read_details_out_FP)
{
- writesize_fprint(global_context,global_context -> SAM_output_fp,"%s\tUnassigned_MappingQuality\t*\tMapping_Quality=%d,%d\n", read_name, first_read_quality_score, mapping_qual);
+ write_read_details_FP(global_context, thread_context, "Unassigned_MappingQuality", 0, NULL, bin1, bin2);
}
return;
}
@@ -2397,18 +2691,28 @@ void process_line_buffer(fc_thread_global_context_t * global_context, fc_thread_
//^^^^^^^^^^^^^^^^^^^^ They are directly compared because they are both pointers in the same contig name table.
//
if(global_context -> is_PE_distance_checked && ((fragment_length > global_context -> max_paired_end_distance) || (fragment_length < global_context -> min_paired_end_distance))) {
- thread_context->read_counters.unassigned_fragmentlength ++;
-
- if(global_context -> SAM_output_fp)
- writesize_fprint(global_context,global_context -> SAM_output_fp,"%s\tUnassigned_FragmentLength\t*\tLength=%ld\n", read_name, fragment_length);
+ if(RG_ptr){
+ void ** tab4s = get_RG_tables(global_context, thread_context, RG_ptr);
+ fc_read_counters * sumtab = tab4s[1];
+ sumtab -> unassigned_fragmentlength++;
+ }else
+ thread_context->read_counters.unassigned_fragmentlength ++;
+
+ if(global_context -> read_details_out_FP)
+ write_read_details_FP(global_context, thread_context, "Unassigned_FragmentLength", -1, NULL, bin1, bin2);
return;
}
} else {
if(global_context -> is_chimertc_disallowed) {
- thread_context->read_counters.unassigned_chimericreads ++;
-
- if(global_context -> SAM_output_fp)
- writesize_fprint(global_context,global_context -> SAM_output_fp,"%s\tUnassigned_Chimera\t*\t*\n", read_name);
+ if(RG_ptr){
+ void ** tab4s = get_RG_tables(global_context, thread_context, RG_ptr);
+ fc_read_counters * sumtab = tab4s[1];
+ sumtab -> unassigned_chimericreads++;
+ }else
+ thread_context->read_counters.unassigned_chimericreads ++;
+
+ if(global_context -> read_details_out_FP)
+ write_read_details_FP(global_context, thread_context, "Unassigned_Chimera", -1, NULL, bin1, bin2);
return;
}
}
@@ -2421,9 +2725,13 @@ void process_line_buffer(fc_thread_global_context_t * global_context, fc_thread_
{
if(alignment_masks & SAM_FLAG_DUPLICATE)
{
- thread_context->read_counters.unassigned_duplicate ++;
- if(global_context -> SAM_output_fp)
- writesize_fprint(global_context,global_context -> SAM_output_fp,"%s\tUnassigned_Duplicate\t*\t*\n", read_name);
+ if(RG_ptr){
+ void ** tab4s = get_RG_tables(global_context, thread_context, RG_ptr);
+ fc_read_counters * sumtab = tab4s[1];
+ sumtab -> unassigned_duplicate++;
+ }else thread_context->read_counters.unassigned_duplicate ++;
+ if(global_context -> read_details_out_FP)
+ write_read_details_FP(global_context, thread_context, "Unassigned_Duplicate", -1, NULL, bin1, bin2);
return;
}
@@ -2437,10 +2745,14 @@ void process_line_buffer(fc_thread_global_context_t * global_context, fc_thread_
{
// now it is a NH>1 read!
// not allow multimapping -> discard!
- thread_context->read_counters.unassigned_multimapping ++;
+ if(RG_ptr){
+ void ** tab4s = get_RG_tables(global_context, thread_context, RG_ptr);
+ fc_read_counters * sumtab = tab4s[1];
+ sumtab -> unassigned_multimapping++;
+ }else thread_context->read_counters.unassigned_multimapping ++;
- if(global_context -> SAM_output_fp)
- writesize_fprint(global_context,global_context -> SAM_output_fp,"%s\tUnassigned_MultiMapping\t*\t*\n", read_name);
+ if(global_context -> read_details_out_FP)
+ write_read_details_FP(global_context, thread_context, "Unassigned_MultiMapping", -1, NULL, bin1, bin2);
return;
}
@@ -2451,10 +2763,14 @@ void process_line_buffer(fc_thread_global_context_t * global_context, fc_thread_
// if a pair of reads have one secondary, the entire fragment is seen as secondary.
if((alignment_masks & SAM_FLAG_SECONDARY_MAPPING) && (global_context -> is_multi_mapping_allowed == ALLOW_PRIMARY_MAPPING))
{
- thread_context->read_counters.unassigned_secondary ++;
-
- if(global_context -> SAM_output_fp)
- writesize_fprint(global_context,global_context -> SAM_output_fp,"%s\tUnassigned_Secondary\t*\t*\n", read_name);
+ if(RG_ptr){
+ void ** tab4s = get_RG_tables(global_context, thread_context, RG_ptr);
+ fc_read_counters * sumtab = tab4s[1];
+ sumtab -> unassigned_secondary++;
+ }else thread_context->read_counters.unassigned_secondary ++;
+
+ if(global_context -> read_details_out_FP)
+ write_read_details_FP(global_context, thread_context, "Unassigned_Secondary", -1, NULL, bin1, bin2);
return;
}
@@ -2482,19 +2798,27 @@ void process_line_buffer(fc_thread_global_context_t * global_context, fc_thread_
skipped_for_exonic ++;
if(skipped_for_exonic == 1 + global_context -> is_paired_end_mode_assign){
- if(global_context -> SAM_output_fp)
- writesize_fprint(global_context,global_context -> SAM_output_fp,"%s\tUnassigned_%s\t*\t*\n", read_name, (global_context->is_split_or_exonic_only == 2)?"Hasjunction":"Nonjunction");
-
- thread_context->read_counters.unassigned_junction_condition ++;
+ if(global_context -> read_details_out_FP)
+ write_read_details_FP(global_context, thread_context, (global_context->is_split_or_exonic_only == 2)?"Unassigned_Hasjunction":"Unassigned_Nonjunction", -1, NULL, bin1, bin2);
+
+ if(RG_ptr){
+ void ** tab4s = get_RG_tables(global_context, thread_context, RG_ptr);
+ fc_read_counters * sumtab = tab4s[1];
+ sumtab -> unassigned_junction_condition++;
+ }else thread_context->read_counters.unassigned_junction_condition ++;
return;
}
}
if(global_context->is_split_or_exonic_only == 2 && is_junction_read) {
- if(global_context -> SAM_output_fp)
- writesize_fprint(global_context,global_context -> SAM_output_fp,"%s\tUnassigned_%s\t*\t*\n", read_name, (global_context->is_split_or_exonic_only == 2)?"Hasjunction":"Nonjunction");
- thread_context->read_counters.unassigned_junction_condition ++;
+ if(global_context -> read_details_out_FP)
+ write_read_details_FP(global_context, thread_context,(global_context->is_split_or_exonic_only == 2)?"Unassigned_Hasjunction":"Unassigned_Nonjunction", -1, NULL, bin1, bin2);
+ if(RG_ptr){
+ void ** tab4s = get_RG_tables(global_context, thread_context, RG_ptr);
+ fc_read_counters * sumtab = tab4s[1];
+ sumtab -> unassigned_junction_condition++;
+ }else thread_context->read_counters.unassigned_junction_condition ++;
return;
}
@@ -2503,21 +2827,6 @@ void process_line_buffer(fc_thread_global_context_t * global_context, fc_thread_
//for(cigar_section_id = 0; cigar_section_id<cigar_sections; cigar_section_id++)
// SUBREADprintf("BCCC: %llu , sec[%d] %s: %u ~ %u ; secs=%d ; flags=%d ; second=%d\n", read_pos, cigar_section_id , ChroNames[cigar_section_id] , Starting_Chro_Points[cigar_section_id], Section_Lengths[cigar_section_id], cigar_sections, alignment_masks, is_second_read);
- if(global_context -> reduce_5_3_ends_to_one)
- {
- if((REDUCE_TO_5_PRIME_END == global_context -> reduce_5_3_ends_to_one) + is_this_negative_strand == 1) // reduce to 5' end (small coordinate if positive strand / large coordinate if negative strand)
- {
- Section_Read_Lengths[0]=1;
- }
- else
- {
- Starting_Chro_Points[0] = Starting_Chro_Points[cigar_sections-1] + Section_Read_Lengths[cigar_sections-1] - 1;
- Section_Read_Lengths[0]=1;
- }
-
- cigar_sections = 1;
- }
-
// Extending the reads to the 3' and 5' ends. (from the read point of view)
if(global_context -> five_end_extension)
{
@@ -2557,6 +2866,24 @@ void process_line_buffer(fc_thread_global_context_t * global_context, fc_thread_
}
+ if(global_context -> reduce_5_3_ends_to_one)
+ {
+ if((REDUCE_TO_5_PRIME_END == global_context -> reduce_5_3_ends_to_one) + is_this_negative_strand == 1) // reduce to 5' end (small coordinate if positive strand / large coordinate if negative strand)
+ {
+ Section_Read_Lengths[0]=1;
+ }
+ else
+ {
+ Starting_Chro_Points[0] = Starting_Chro_Points[cigar_sections-1] + Section_Read_Lengths[cigar_sections-1] - 1;
+ Section_Read_Lengths[0]=1;
+ }
+
+ cigar_sections = 1;
+ }
+
+
+
+
for(cigar_section_id = 0; cigar_section_id<cigar_sections; cigar_section_id++)
{
long section_begin_pos = Starting_Chro_Points[cigar_section_id];
@@ -2572,9 +2899,9 @@ void process_line_buffer(fc_thread_global_context_t * global_context, fc_thread_
fc_chromosome_index_info * this_chro_info = HashTableGet(global_context -> exontable_chro_table, ChroNames[cigar_section_id]);
if(this_chro_info == NULL)
{
- if(global_context -> annot_chro_name_alias_table)
+ if(global_context -> BAM_chros_to_anno_table)
{
- char * anno_chro_name = HashTableGet( global_context -> annot_chro_name_alias_table , ChroNames[cigar_section_id]);
+ char * anno_chro_name = HashTableGet( global_context -> BAM_chros_to_anno_table , ChroNames[cigar_section_id]);
if(anno_chro_name)
this_chro_info = HashTableGet(global_context -> exontable_chro_table, anno_chro_name);
}
@@ -2671,7 +2998,7 @@ void process_line_buffer(fc_thread_global_context_t * global_context, fc_thread_
if(global_context -> need_calculate_fragment_len )
total_frag_len = calc_total_frag_len( global_context, thread_context, CIGAR_intervals_R1, CIGAR_intervals_R1_sections, CIGAR_intervals_R2, CIGAR_intervals_R2_sections , read_name);
- //SUBREADprintf("FRAGLEN: %s %d\n", read_name, total_frag_len);
+ //SUBREADprintf("FRAGLEN: %s %d; CIGARS=%d,%d\n", read_name, total_frag_len, CIGAR_intervals_R1_sections,CIGAR_intervals_R2_sections);
int fixed_fractional_count = global_context -> use_fraction_multi_mapping ?calc_fixed_fraction(maximum_NH_value): NH_FRACTION_INT;
@@ -2681,7 +3008,7 @@ void process_line_buffer(fc_thread_global_context_t * global_context, fc_thread_
vote_and_add_count(global_context, thread_context,
hits_indices1, nhits1, hits_indices2, nhits2, total_frag_len,
hits_chro1, hits_chro2, hits_start_pos1, hits_start_pos2, hits_length1, hits_length2,
- fixed_fractional_count, read_name);
+ fixed_fractional_count, read_name, RG_ptr, bin1, bin2);
return;
}
@@ -2717,11 +3044,22 @@ int count_bitmap_overlapping(char * x1_bitmap, unsigned short rl){
return ret;
}
-void add_fragment_supported_junction( fc_thread_global_context_t * global_context, fc_thread_thread_context_t * thread_context, fc_junction_info_t * supported_junctions1,
- int njunc1, fc_junction_info_t * supported_junctions2, int njunc2){
+void add_fragment_supported_junction( fc_thread_global_context_t * global_context, fc_thread_thread_context_t * thread_context, fc_junction_info_t * supported_junctions1, int njunc1, fc_junction_info_t * supported_junctions2, int njunc2, char * RG_name){
assert(njunc1 >= 0 && njunc1 <= global_context -> max_M -1 );
assert(njunc2 >= 0 && njunc2 <= global_context -> max_M -1 );
int x1,x2, in_total_junctions = njunc2 + njunc1;
+
+ HashTable * junction_counting_table, *splicing_point_table;
+
+ if(RG_name){
+ void ** tab4s = get_RG_tables(global_context, thread_context, RG_name);
+ junction_counting_table = tab4s[2];
+ splicing_point_table = tab4s[3];
+ }else{
+ junction_counting_table = thread_context -> junction_counting_table;
+ splicing_point_table = thread_context -> splicing_point_table;
+ }
+
for(x1 = 0; x1 < in_total_junctions; x1 ++){
fc_junction_info_t * j_one = (x1 >= njunc1)?supported_junctions2+(x1-njunc1):(supported_junctions1+x1);
if(j_one->chromosome_name_left[0]==0) continue;
@@ -2739,9 +3077,9 @@ void add_fragment_supported_junction( fc_thread_global_context_t * global_contex
char * this_key = malloc(strlen(j_one->chromosome_name_left) + strlen(j_one->chromosome_name_right) + 36);
sprintf(this_key, "%s\t%u\t%s\t%u", j_one->chromosome_name_left, j_one -> last_exon_base_left, j_one->chromosome_name_right, j_one -> first_exon_base_right);
- void * count_ptr = HashTableGet(thread_context -> junction_counting_table, this_key);
+ void * count_ptr = HashTableGet(junction_counting_table, this_key);
unsigned long long count_junc = count_ptr - NULL;
- HashTablePut(thread_context -> junction_counting_table, this_key, NULL+count_junc + 1);
+ HashTablePut(junction_counting_table, this_key, NULL+count_junc + 1);
// #warning "CONTINUE SHOULD BE REMOVED!!!!"
// continue;
@@ -2753,9 +3091,9 @@ void add_fragment_supported_junction( fc_thread_global_context_t * global_contex
for( x2 = 0 ; x2 < 2 ; x2++ ){
char * lr_key = x2?right_key:left_key;
- count_ptr = HashTableGet(thread_context -> splicing_point_table, lr_key);
+ count_ptr = HashTableGet(splicing_point_table, lr_key);
count_junc = count_ptr - NULL;
- HashTablePut(thread_context -> splicing_point_table, lr_key, NULL + count_junc + 1);
+ HashTablePut(splicing_point_table, lr_key, NULL + count_junc + 1);
}
}
}
@@ -2814,30 +3152,49 @@ unsigned short calc_score_overlaps(fc_thread_global_context_t * global_context,
void vote_and_add_count(fc_thread_global_context_t * global_context, fc_thread_thread_context_t * thread_context,
long * hits_indices1, int nhits1, long * hits_indices2, int nhits2, unsigned int total_frag_len,
- char ** hits_chro1, char ** hits_chro2, unsigned int * hits_start_pos1, unsigned int * hits_start_pos2, unsigned short * hits_length1, unsigned short * hits_length2,
- int fixed_fractional_count, char * read_name){
+ char ** hits_chro1, char ** hits_chro2, unsigned int * hits_start_pos1, unsigned int * hits_start_pos2, unsigned short * hits_length1, unsigned short * hits_length2, int fixed_fractional_count, char * read_name, char * RG_name, char * bin1, char * bin2){
if(global_context -> need_calculate_overlap_len == 0 && nhits2+nhits1==1) {
long hit_exon_id = nhits2?hits_indices2[0]:hits_indices1[0];
- thread_context->count_table[hit_exon_id] += fixed_fractional_count;
+
+ if(RG_name){
+ void ** tab4s = get_RG_tables(global_context, thread_context, RG_name);
+ fc_read_counters * sumtab = tab4s[1];
+ sumtab -> assigned_reads++;
+
+ read_count_type_t * count_table = tab4s[0];
+ count_table[hit_exon_id] += fixed_fractional_count;
+ }else{
+ thread_context->count_table[hit_exon_id] += fixed_fractional_count;
+ thread_context->read_counters.assigned_reads ++;
+ }
thread_context->nreads_mapped_to_exon++;
- if(global_context -> SAM_output_fp)
+ if(global_context -> read_details_out_FP)
{
int final_gene_number = global_context -> exontable_geneid[hit_exon_id];
- unsigned char * final_feture_name = global_context -> gene_name_array[final_gene_number];
- writesize_fprint(global_context,global_context -> SAM_output_fp,"%s\tAssigned\t%s\tTotal=1\n", read_name, final_feture_name);
+ char * final_feture_name = (char *)global_context -> gene_name_array[final_gene_number];
+ write_read_details_FP(global_context, thread_context, "Assigned", 1, final_feture_name, bin1, bin2);
}
- thread_context->read_counters.assigned_reads ++;
} else if(global_context -> need_calculate_overlap_len == 0 && nhits2 == 1 && nhits1 == 1 && hits_indices2[0]==hits_indices1[0]) {
long hit_exon_id = hits_indices1[0];
- thread_context->count_table[hit_exon_id] += fixed_fractional_count;
+
+ if(RG_name){
+ void ** tab4s = get_RG_tables(global_context, thread_context, RG_name);
+ fc_read_counters * sumtab = tab4s[1];
+ sumtab -> assigned_reads++;
+
+ read_count_type_t * count_table = tab4s[0];
+ count_table[hit_exon_id] += fixed_fractional_count;
+ }else{
+ thread_context->count_table[hit_exon_id] += fixed_fractional_count;
+ thread_context->read_counters.assigned_reads ++;
+ }
thread_context->nreads_mapped_to_exon++;
- if(global_context -> SAM_output_fp)
+ if(global_context -> read_details_out_FP)
{
int final_gene_number = global_context -> exontable_geneid[hit_exon_id];
- unsigned char * final_feture_name = global_context -> gene_name_array[final_gene_number];
- writesize_fprint(global_context,global_context -> SAM_output_fp,"%s\tAssigned\t%s\tTotal=1\n", read_name, final_feture_name);
+ char * final_feture_name = (char *)global_context -> gene_name_array[final_gene_number];
+ write_read_details_FP(global_context, thread_context, "Assigned", 1, final_feture_name, bin1, bin2);
}
- thread_context->read_counters.assigned_reads ++;
} else {
// Build a voting table.
// The voting table should be:
@@ -2958,7 +3315,7 @@ void vote_and_add_count(fc_thread_global_context_t * global_context, fc_thread_t
score_x1_key = global_context -> exontable_geneid[ scoring_exon_ids[score_x1] ];
else score_x1_key = scoring_exon_ids[score_x1] ;
- //writesize_fprint(global_context,stderr, "Q222KEY: exon=%ld, gene=%ld\n", scoring_exon_ids[score_x1] , score_x1_key );
+ //write_read_details_FP(global_context,stderr, "Q222KEY: exon=%ld, gene=%ld\n", scoring_exon_ids[score_x1] , score_x1_key );
if( score_x1_key == score_merge_key ){
if((scoring_flags[score_x1] & ( ends?2:1 )) == 0) {
scoring_flags[score_x1] |= (ends?2:1);
@@ -2992,101 +3349,138 @@ void vote_and_add_count(fc_thread_global_context_t * global_context, fc_thread_t
applied_fragment_minimum_overlapping = max( global_context -> fragment_minimum_overlapping, global_context -> fractional_minimum_overlapping * ( total_frag_len) );
}
-
- for(score_x1 = 0; score_x1 < scoring_count ; score_x1++){
-
- if(0 && FIXLENstrcmp("V0112_0155:7:1101:5387:6362", read_name)==0) SUBREADprintf("Scoring Overlap %s = %d >=%d, score=%d, exonid=%ld\n", read_name, scoring_overlappings[score_x1], applied_fragment_minimum_overlapping, scoring_numbers[score_x1], scoring_exon_ids[score_x1]);
- //SUBREADprintf("RLTEST: %s %d\n", read_name, scoring_overlappings[score_x1]);
- if( applied_fragment_minimum_overlapping > 1 )
- if( applied_fragment_minimum_overlapping > scoring_overlappings[score_x1] ){
- scoring_numbers[score_x1] = 0;
- continue;
- }
-
- if( maximum_score < scoring_numbers[score_x1] ){
- maximum_total_count = 1;
- maximum_score = scoring_numbers[score_x1];
- maximum_score_x1 = score_x1;
- }else if( maximum_score == scoring_numbers[score_x1] )
- maximum_total_count++;
- overlapping_total_count ++;
- }
-
- if(maximum_total_count == 0){
- if(global_context -> SAM_output_fp)
- writesize_fprint(global_context,global_context -> SAM_output_fp,"%s\tUnassigned_NoFeatures\t*\t*\n", read_name);
-
- thread_context->read_counters.unassigned_nofeatures ++;
+ if(scoring_count == 0){
+ if(global_context -> read_details_out_FP)
+ write_read_details_FP(global_context, thread_context,"Unassigned_NoFeatures",-1, NULL, bin1, bin2);
+ if(RG_name){
+ void ** tab4s = get_RG_tables(global_context, thread_context, RG_name);
+ fc_read_counters * sumtab = tab4s[1];
+ sumtab -> unassigned_nofeatures++;
+ }else thread_context->read_counters.unassigned_nofeatures ++;
}else{
-
- // final adding votes.
- if(1 == maximum_total_count && !global_context -> is_multi_overlap_allowed) {
- // simple add to the exon ( EXON_ID = decision_table_exon_ids[maximum_decision_no])
- long max_exon_id = scoring_exon_ids[maximum_score_x1];
- thread_context->count_table[max_exon_id] += fixed_fractional_count;
- thread_context->nreads_mapped_to_exon++;
- if(global_context -> SAM_output_fp)
- {
- int final_gene_number = global_context -> exontable_geneid[max_exon_id];
- unsigned char * final_feture_name = global_context -> gene_name_array[final_gene_number];
- if(scoring_count>1)
- writesize_fprint(global_context,global_context -> SAM_output_fp,"%s\tAssigned\t%s\tTotal=1;%s/Targets=%d/%d\n", read_name, final_feture_name, global_context -> use_overlapping_break_tie? "MaximumOverlapping":"Votes", maximum_score, scoring_count);
- else
- writesize_fprint(global_context,global_context -> SAM_output_fp,"%s\tAssigned\t%s\tTotal=1\n", read_name, final_feture_name);
+ for(score_x1 = 0; score_x1 < scoring_count ; score_x1++){
+ if(0 && FIXLENstrcmp("V0112_0155:7:1101:5387:6362", read_name)==0) SUBREADprintf("Scoring Overlap %s = %d >=%d, score=%d, exonid=%ld\n", read_name, scoring_overlappings[score_x1], applied_fragment_minimum_overlapping, scoring_numbers[score_x1], scoring_exon_ids[score_x1]);
+ //SUBREADprintf("RLTEST: %s %d\n", read_name, scoring_overlappings[score_x1]);
+ if( applied_fragment_minimum_overlapping > 1 )
+ if( applied_fragment_minimum_overlapping > scoring_overlappings[score_x1] ){
+ scoring_numbers[score_x1] = 0;
+ continue;
+ }
+
+ if( maximum_score < scoring_numbers[score_x1] ){
+ maximum_total_count = 1;
+ maximum_score = scoring_numbers[score_x1];
+ maximum_score_x1 = score_x1;
+ }else if( maximum_score == scoring_numbers[score_x1] )
+ maximum_total_count++;
+ overlapping_total_count ++;
}
- thread_context->read_counters.assigned_reads ++;
- }else if(global_context -> is_multi_overlap_allowed) {
- char final_feture_names[1000];
- int assigned_no = 0, xk1;
- final_feture_names[0]=0;
- for(xk1 = 0; xk1 < scoring_count; xk1++)
- {
- // This change was made on 31/MAR/2016
- if( scoring_numbers[xk1] < 1 ) continue ;
- if( scoring_numbers[xk1] < maximum_score && global_context -> use_overlapping_break_tie ) continue ;
+ if(maximum_total_count == 0){
+ if(global_context -> read_details_out_FP)
+ write_read_details_FP(global_context, thread_context,"Unassigned_Overlapping_Length", -1, NULL, bin1, bin2);
+
+ if(RG_name){
+ void ** tab4s = get_RG_tables(global_context, thread_context, RG_name);
+ fc_read_counters * sumtab = tab4s[1];
+ sumtab -> unassigned_overlapping_length++;
+ }else thread_context->read_counters.unassigned_overlapping_length ++;
+ }else{
- long tmp_voter_id = scoring_exon_ids[xk1];
- //if(1 && FIXLENstrcmp( read_name , "V0112_0155:7:1101:5467:23779#ATCACG" )==0)
- // SUBREADprintf("CountsFrac = %d ; add=%d\n", overlapping_total_count, calculate_multi_overlap_fraction(global_context, fixed_fractional_count, overlapping_total_count) );
+ // final adding votes.
+ if(1 == maximum_total_count && !global_context -> is_multi_overlap_allowed) {
+ // simple add to the exon ( EXON_ID = decision_table_exon_ids[maximum_decision_no])
+ long max_exon_id = scoring_exon_ids[maximum_score_x1];
+
+ if(RG_name){
+ void ** tab4s = get_RG_tables(global_context, thread_context, RG_name);
+ fc_read_counters * sumtab = tab4s[1];
+ sumtab -> assigned_reads++;
+
+ read_count_type_t * count_table = tab4s[0];
+ count_table[max_exon_id] += fixed_fractional_count;
+ }else{
+ thread_context->count_table[max_exon_id] += fixed_fractional_count;
+ thread_context->read_counters.assigned_reads ++;
+ }
+ thread_context->nreads_mapped_to_exon++;
+ if(global_context -> read_details_out_FP)
+ {
+ int final_gene_number = global_context -> exontable_geneid[max_exon_id];
+ char * final_feture_name = (char *)global_context -> gene_name_array[final_gene_number];
+ write_read_details_FP(global_context, thread_context,"Assigned", 1, final_feture_name, bin1, bin2);
+ }
+ }else if(global_context -> is_multi_overlap_allowed) {
+ char final_feture_names[1000];
+ int assigned_no = 0, xk1;
+ final_feture_names[0]=0;
+ int is_etc = 0;
- thread_context->count_table[tmp_voter_id] += calculate_multi_overlap_fraction(global_context, fixed_fractional_count, overlapping_total_count);
+ for(xk1 = 0; xk1 < scoring_count; xk1++)
+ {
- if(global_context -> SAM_output_fp)
- {
- if(strlen(final_feture_names)<700)
+ // This change was made on 31/MAR/2016
+ if( scoring_numbers[xk1] < 1 ) continue ;
+ if( scoring_numbers[xk1] < maximum_score && global_context -> use_overlapping_break_tie ) continue ;
+
+ long tmp_voter_id = scoring_exon_ids[xk1];
+ //if(1 && FIXLENstrcmp( read_name , "V0112_0155:7:1101:5467:23779#ATCACG" )==0)
+ // SUBREADprintf("CountsFrac = %d ; add=%d\n", overlapping_total_count, calculate_multi_overlap_fraction(global_context, fixed_fractional_count, overlapping_total_count) );
+ if(RG_name){
+ void ** tab4s = get_RG_tables(global_context, thread_context, RG_name);
+ read_count_type_t * count_table = tab4s[0];
+ count_table[tmp_voter_id] += calculate_multi_overlap_fraction(global_context, fixed_fractional_count, overlapping_total_count);
+ }else thread_context->count_table[tmp_voter_id] += calculate_multi_overlap_fraction(global_context, fixed_fractional_count, overlapping_total_count);
+
+ if(global_context -> read_details_out_FP) {
+ if(strlen(final_feture_names)<700) {
+ int final_gene_number = global_context -> exontable_geneid[tmp_voter_id];
+ unsigned char * final_feture_name = global_context -> gene_name_array[final_gene_number];
+ strncat(final_feture_names, (char *)final_feture_name, 999);
+ strncat(final_feture_names, ",", 999);
+ }else if(!is_etc){
+ is_etc = 1;
+ strncat(final_feture_names, ",...", 999);
+ }
+ assigned_no++;
+ }
+ }
+ final_feture_names[999]=0;
+ if(RG_name){
+ void ** tab4s = get_RG_tables(global_context, thread_context, RG_name);
+ fc_read_counters * sumtab = tab4s[1];
+ sumtab -> assigned_reads++;
+ }else{
+ thread_context->read_counters.assigned_reads ++;
+ }
+ thread_context->nreads_mapped_to_exon++;
+
+ if(global_context -> read_details_out_FP)
{
- int final_gene_number = global_context -> exontable_geneid[tmp_voter_id];
- unsigned char * final_feture_name = global_context -> gene_name_array[final_gene_number];
- strncat(final_feture_names, (char *)final_feture_name, 999);
- strncat(final_feture_names, ",", 999);
- assigned_no++;
+ int ffnn = strlen(final_feture_names);
+ if(ffnn>0) final_feture_names[ffnn-1]=0;
+ // overlapped but still assigned
+ write_read_details_FP(global_context, thread_context, "Assigned", assigned_no, final_feture_names, bin1, bin2);
+ }
+ } else {
+ if(global_context -> read_details_out_FP)
+ write_read_details_FP(global_context, thread_context,"Unassigned_Ambiguity", -1, NULL, bin1, bin2);
+ if(RG_name){
+ fc_read_counters * sumtab = get_RG_tables(global_context, thread_context, RG_name)[1];
+ sumtab -> unassigned_ambiguous++;
+ }else{
+ thread_context->read_counters.unassigned_ambiguous ++;
}
}
}
- final_feture_names[999]=0;
- thread_context->nreads_mapped_to_exon++;
- if(global_context -> SAM_output_fp)
- {
- int ffnn = strlen(final_feture_names);
- if(ffnn>0) final_feture_names[ffnn-1]=0;
- // overlapped but still assigned
- writesize_fprint(global_context,global_context -> SAM_output_fp,"%s\tAssigned\t%s\tTotal=%d\n", read_name, final_feture_names, assigned_no);
- }
- thread_context->read_counters.assigned_reads ++;
- } else {
- if(global_context -> SAM_output_fp)
- writesize_fprint(global_context,global_context -> SAM_output_fp,"%s\tUnassigned_Ambiguity\t*\tNumber_Of_Overlapped_Genes=%d\n", read_name, maximum_total_count);
-
- thread_context->read_counters.unassigned_ambiguous ++;
- }
}
}
}
-void fc_thread_merge_results(fc_thread_global_context_t * global_context, read_count_type_t * nreads , unsigned long long int *nreads_mapped_to_exon, fc_read_counters * my_read_counter, HashTable * junction_global_table, HashTable * splicing_global_table)
+// return the number of RG result sets
+int fc_thread_merge_results(fc_thread_global_context_t * global_context, read_count_type_t * nreads , unsigned long long int *nreads_mapped_to_exon, fc_read_counters * my_read_counter, HashTable * junction_global_table, HashTable * splicing_global_table, HashTable * RGmerged_table)
{
- int xk1, xk2;
+ int xk1, xk2, ret = 0;
long long int total_input_reads = 0 ;
read_count_type_t unpaired_fragment_no = 0;
@@ -3097,16 +3491,112 @@ void fc_thread_merge_results(fc_thread_global_context_t * global_context, read_c
for(xk1=0; xk1<global_context-> thread_number; xk1++)
{
+ if(global_context -> assign_reads_to_RG){
+ HashTable * thread_rg_tab = global_context -> thread_contexts[xk1].RG_table;
+ int buck_i;
+ for(buck_i = 0; buck_i < thread_rg_tab -> numOfBuckets; buck_i++){
+ KeyValuePair *cursor = thread_rg_tab -> bucketArray[buck_i];
+ while(cursor){
+ char * rg_name = (char *)cursor -> key;
+ void ** rg_thread_tabs = cursor -> value;
+ void ** rg_old_tabs = HashTableGet(RGmerged_table, rg_name);
+ if(!rg_old_tabs){
+ rg_old_tabs = malloc(sizeof(char *)*4); // all_counts, sum_counts , junc_table, split_table
+ rg_old_tabs[0] = calloc(global_context -> thread_contexts[xk1].count_table_size, sizeof(long long));
+ rg_old_tabs[1] = calloc(1, sizeof(fc_read_counters));
+ if(global_context -> do_junction_counting){
+ HashTable * junction_counting_table = HashTableCreate(131317);
+ HashTableSetHashFunction(junction_counting_table,HashTableStringHashFunction);
+ HashTableSetDeallocationFunctions(junction_counting_table, free, NULL);
+ HashTableSetKeyComparisonFunction(junction_counting_table, fc_strcmp_chro);
+
+ HashTable * splicing_point_table = HashTableCreate(131317);
+ HashTableSetHashFunction(splicing_point_table,HashTableStringHashFunction);
+ HashTableSetDeallocationFunctions(splicing_point_table, free, NULL);
+ HashTableSetKeyComparisonFunction(splicing_point_table, fc_strcmp_chro);
+
+ rg_old_tabs[2] = junction_counting_table;
+ rg_old_tabs[3] = splicing_point_table;
+ }else rg_old_tabs[2] = NULL;
+
+ HashTablePut(RGmerged_table, memstrcpy(rg_name), rg_old_tabs);
+ }
+ long long * rg_counts = rg_old_tabs[0];
+ fc_read_counters * rg_sum_reads = rg_old_tabs[1];
+ HashTable * rg_junc_tab = rg_old_tabs[2];
+ HashTable * rg_split_tab = rg_old_tabs[3];
+
+ long long * rg_thread_counts = rg_thread_tabs[0];
+ fc_read_counters * rg_thread_sum_reads = rg_thread_tabs[1];
+ HashTable * rg_thread_junc_table = rg_thread_tabs[2];
+ HashTable * rg_thread_split_table = rg_thread_tabs[3];
+
+ for(xk2=0; xk2<global_context -> exontable_exons; xk2++)
+ rg_counts[xk2] += rg_thread_counts[xk2];
+
+ rg_sum_reads->unassigned_ambiguous += rg_thread_sum_reads->unassigned_ambiguous;
+ rg_sum_reads->unassigned_nofeatures += rg_thread_sum_reads->unassigned_nofeatures;
+ rg_sum_reads->unassigned_overlapping_length += rg_thread_sum_reads->unassigned_overlapping_length;
+ rg_sum_reads->unassigned_unmapped += rg_thread_sum_reads->unassigned_unmapped;
+ rg_sum_reads->unassigned_mappingquality += rg_thread_sum_reads->unassigned_mappingquality;
+ rg_sum_reads->unassigned_fragmentlength += rg_thread_sum_reads->unassigned_fragmentlength;
+ rg_sum_reads->unassigned_chimericreads += rg_thread_sum_reads->unassigned_chimericreads;
+ rg_sum_reads->unassigned_multimapping += rg_thread_sum_reads->unassigned_multimapping;
+ rg_sum_reads->unassigned_secondary += rg_thread_sum_reads->unassigned_secondary;
+ rg_sum_reads->unassigned_junction_condition += rg_thread_sum_reads->unassigned_junction_condition;
+ rg_sum_reads->unassigned_duplicate += rg_thread_sum_reads->unassigned_duplicate;
+ rg_sum_reads->assigned_reads += rg_thread_sum_reads->assigned_reads;
+
+ if(global_context -> do_junction_counting){
+ int bucket_i;
+ for(bucket_i = 0 ; bucket_i < rg_thread_junc_table -> numOfBuckets; bucket_i++){
+ KeyValuePair * cursor;
+ cursor = rg_thread_junc_table -> bucketArray[bucket_i];
+ while(cursor){
+ char * junckey = (char *) cursor -> key;
+ void * globval = HashTableGet(rg_junc_tab, junckey);
+ char * new_key = memstrcpy(junckey);
+
+ globval += (cursor -> value - NULL);
+ HashTablePut(rg_junc_tab, new_key, globval);
+ // new_key will be freed when it is replaced next time or when the global table is destroyed.
+
+ cursor = cursor->next;
+ }
+ }
+
+ for(bucket_i = 0 ; bucket_i < rg_thread_split_table -> numOfBuckets; bucket_i++){
+ KeyValuePair * cursor;
+ cursor = rg_thread_split_table -> bucketArray[bucket_i];
+ while(cursor){
+ char * junckey = (char *) cursor -> key;
+ void * globval = HashTableGet(rg_split_tab, junckey);
+ char * new_key = memstrcpy(junckey);
+
+ //if(xk1>0)
+ //SUBREADprintf("MERGE THREAD-%d : %s VAL=%u, ADD=%u\n", xk1, junckey, globval - NULL, cursor -> value - NULL);
+ globval += (cursor -> value - NULL);
+ HashTablePut(rg_split_tab, new_key, globval);
+ cursor = cursor->next;
+ }
+ }
+ } // end : merge junc tables
+ ret++;
+ cursor = cursor -> next;
+ }
+ }
+ }
+
for(xk2=0; xk2<global_context -> exontable_exons; xk2++)
- {
nreads[xk2]+=global_context -> thread_contexts[xk1].count_table[xk2];
- }
+
total_input_reads += global_context -> thread_contexts[xk1].all_reads;
(*nreads_mapped_to_exon) += global_context -> thread_contexts[xk1].nreads_mapped_to_exon;
unpaired_fragment_no += global_context -> thread_contexts[xk1].unpaired_fragment_no;
global_context -> read_counters.unassigned_ambiguous += global_context -> thread_contexts[xk1].read_counters.unassigned_ambiguous;
global_context -> read_counters.unassigned_nofeatures += global_context -> thread_contexts[xk1].read_counters.unassigned_nofeatures;
+ global_context -> read_counters.unassigned_overlapping_length += global_context -> thread_contexts[xk1].read_counters.unassigned_overlapping_length;
global_context -> read_counters.unassigned_unmapped += global_context -> thread_contexts[xk1].read_counters.unassigned_unmapped;
global_context -> read_counters.unassigned_mappingquality += global_context -> thread_contexts[xk1].read_counters.unassigned_mappingquality;
global_context -> read_counters.unassigned_fragmentlength += global_context -> thread_contexts[xk1].read_counters.unassigned_fragmentlength;
@@ -3119,6 +3609,7 @@ void fc_thread_merge_results(fc_thread_global_context_t * global_context, read_c
my_read_counter->unassigned_ambiguous += global_context -> thread_contexts[xk1].read_counters.unassigned_ambiguous;
my_read_counter->unassigned_nofeatures += global_context -> thread_contexts[xk1].read_counters.unassigned_nofeatures;
+ my_read_counter->unassigned_overlapping_length += global_context -> thread_contexts[xk1].read_counters.unassigned_overlapping_length;
my_read_counter->unassigned_unmapped += global_context -> thread_contexts[xk1].read_counters.unassigned_unmapped;
my_read_counter->unassigned_mappingquality += global_context -> thread_contexts[xk1].read_counters.unassigned_mappingquality;
my_read_counter->unassigned_fragmentlength += global_context -> thread_contexts[xk1].read_counters.unassigned_fragmentlength;
@@ -3168,18 +3659,36 @@ void fc_thread_merge_results(fc_thread_global_context_t * global_context, read_c
}
}
- char pct_str[10];
- if(total_input_reads>0)
- sprintf(pct_str,"(%.1f%%%%)", (*nreads_mapped_to_exon)*100./total_input_reads);
- else pct_str[0]=0;
- if(unpaired_fragment_no){
- print_in_box(80,0,0," Not properly paired fragments : %llu", unpaired_fragment_no);
+
+ if(0 == global_context -> is_input_bad_format){
+ char pct_str[10];
+ if(total_input_reads>0)
+ sprintf(pct_str,"(%.1f%%%%)", (*nreads_mapped_to_exon)*100./total_input_reads);
+ else pct_str[0]=0;
+
+
+ if(unpaired_fragment_no){
+ print_in_box(80,0,0," Not properly paired fragments : %llu", unpaired_fragment_no);
+ }
+
+ int show_summary = 1;
+ if(global_context -> assign_reads_to_RG){
+ if(RGmerged_table -> numOfElements)
+ print_in_box(80,0,0," Total read groups : %ld", RGmerged_table -> numOfElements);
+ else{
+ print_in_box(80,0,0," No read groups are found; no output is generated.");
+ show_summary = 0;
+ }
+ }
+ if(show_summary){
+ print_in_box(80,0,0," Total %s : %llu", global_context -> is_paired_end_mode_assign?"fragments":"reads", total_input_reads);
+ print_in_box(pct_str[0]?81:80,0,0," Successfully assigned %s : %llu %s", global_context -> is_paired_end_mode_assign?"fragments":"reads", *nreads_mapped_to_exon,pct_str);
+ }
+ print_in_box(80,0,0," Running time : %.2f minutes", (miltime() - global_context -> start_time)/60);
+ print_in_box(80,0,0,"");
}
- print_in_box(80,0,0," Total %s : %llu", global_context -> is_paired_end_mode_assign?"fragments":"reads", total_input_reads);
- print_in_box(pct_str[0]?81:80,0,0," Successfully assigned %s : %llu %s", global_context -> is_paired_end_mode_assign?"fragments":"reads", *nreads_mapped_to_exon,pct_str);
- print_in_box(80,0,0," Running time : %.2f minutes", (miltime() - global_context -> start_time)/60);
- print_in_box(80,0,0,"");
+ return ret;
}
void get_temp_dir_from_out(char * tmp, char * out){
@@ -3222,7 +3731,7 @@ void fc_thread_init_input_files(fc_thread_global_context_t * global_context, cha
}
-void fc_thread_init_global_context(fc_thread_global_context_t * global_context, unsigned int buffer_size, unsigned short threads, int line_length , int is_PE_data, int min_pe_dist, int max_pe_dist, int is_gene_level, int is_overlap_allowed, int is_strand_checked, char * output_fname, int is_sam_out, int is_both_end_required, int is_chimertc_disallowed, int is_PE_distance_checked, char *feature_name_column, char * gene_id_column, int min_map_qual_score, int is_multi_mapping_allowed, int i [...]
+void fc_thread_init_global_context(fc_thread_global_context_t * global_context, unsigned int buffer_size, unsigned short threads, int line_length , int is_PE_data, int min_pe_dist, int max_pe_dist, int is_gene_level, int is_overlap_allowed, int is_strand_checked, char * output_fname, int is_sam_out, int is_both_end_required, int is_chimertc_disallowed, int is_PE_distance_checked, char *feature_name_column, char * gene_id_column, int min_map_qual_score, int is_multi_mapping_allowed, int i [...]
{
int x1;
@@ -3230,7 +3739,7 @@ void fc_thread_init_global_context(fc_thread_global_context_t * global_context,
global_context -> max_BAM_header_size = buffer_size;
global_context -> all_reads = 0;
global_context -> redo = 0;
- global_context -> SAM_output_fp = NULL;
+ global_context -> read_details_out_FP = NULL;
global_context -> isCVersion = isCVersion;
global_context -> is_read_details_out = is_sam_out;
@@ -3246,6 +3755,9 @@ void fc_thread_init_global_context(fc_thread_global_context_t * global_context,
global_context -> is_split_or_exonic_only = is_split_or_exonic_only;
global_context -> is_duplicate_ignored = is_duplicate_ignored;
global_context -> use_stdin_file = use_stdin_file;
+ global_context -> assign_reads_to_RG = assign_reads_to_RG;
+ global_context -> long_read_minimum_length = long_read_minimum_length;
+ global_context -> is_verbose = is_verbose;
//global_context -> is_first_read_reversed = (pair_orientations[0]=='r');
//global_context -> is_second_read_straight = (pair_orientations[1]=='f');
@@ -3260,7 +3772,7 @@ void fc_thread_init_global_context(fc_thread_global_context_t * global_context,
global_context -> unistr_buffer_size = 1024*1024*2;
global_context -> unistr_buffer_used = 0;
global_context -> unistr_buffer_space = malloc(global_context -> unistr_buffer_size);
- global_context -> annot_chro_name_alias_table = NULL;
+ global_context -> BAM_chros_to_anno_table = NULL;
global_context -> cmd_rebuilt = cmd_rebuilt;
global_context -> feature_block_size = feature_block_size;
global_context -> five_end_extension = fiveEndExtension;
@@ -3276,6 +3788,7 @@ void fc_thread_init_global_context(fc_thread_global_context_t * global_context,
global_context -> read_counters.unassigned_ambiguous=0;
global_context -> read_counters.unassigned_nofeatures=0;
+ global_context -> read_counters.unassigned_overlapping_length=0;
global_context -> read_counters.unassigned_unmapped=0;
global_context -> read_counters.unassigned_mappingquality=0;
global_context -> read_counters.unassigned_fragmentlength=0;
@@ -3289,7 +3802,7 @@ void fc_thread_init_global_context(fc_thread_global_context_t * global_context,
if(alias_file_name && alias_file_name[0])
{
strcpy(global_context -> alias_file_name,alias_file_name);
- global_context -> annot_chro_name_alias_table = load_alias_table(alias_file_name);
+ global_context -> BAM_chros_to_anno_table = load_alias_table(alias_file_name);
}
else global_context -> alias_file_name[0]=0;
@@ -3338,17 +3851,18 @@ int fc_thread_start_threads(fc_thread_global_context_t * global_context, int et_
global_context -> read_length = read_length;
global_context -> is_unpaired_warning_shown = 0;
global_context -> is_stake_warning_shown = 0;
+ global_context -> is_read_too_long_to_SAM_BAM_shown = 0;
if(global_context -> is_read_details_out)
{
char tmp_fname[350], *modified_fname;
int i=0;
if( global_context -> input_file_unique ){
- sprintf(tmp_fname, "%s/%s.featureCounts", global_context -> output_file_path, global_context -> input_file_short_name);
- global_context -> SAM_output_fp = f_subr_open(tmp_fname, "w");
+ sprintf(tmp_fname, "%s/%s.featureCounts%s", global_context -> output_file_path, global_context -> input_file_short_name, global_context -> is_read_details_out == FILE_TYPE_BAM?".bam":(global_context -> is_read_details_out == FILE_TYPE_SAM?".sam":""));
+ global_context -> read_details_out_FP = f_subr_open(tmp_fname, "w");
//SUBREADprintf("FCSSF=%s\n", tmp_fname);
} else {
- sprintf(tmp_fname, "%s.featureCounts", global_context -> raw_input_file_name);
+ sprintf(tmp_fname, "%s.featureCounts%s", global_context -> raw_input_file_name, global_context -> input_file_short_name, global_context -> is_read_details_out == FILE_TYPE_BAM?".bam":(global_context -> is_read_details_out == FILE_TYPE_SAM?".sam":""));
modified_fname = tmp_fname;
while(modified_fname[0]=='/' || modified_fname[0]=='.' || modified_fname[0]=='\\'){
modified_fname ++;
@@ -3359,16 +3873,17 @@ int fc_thread_start_threads(fc_thread_global_context_t * global_context, int et_
}
char tmp_fname2[350];
sprintf(tmp_fname2, "%s/%s", global_context -> output_file_path, modified_fname);
- global_context -> SAM_output_fp = f_subr_open(tmp_fname2, "w");
+ global_context -> read_details_out_FP = f_subr_open(tmp_fname2, "w");
//SUBREADprintf("FCSSF=%s\n", tmp_fname2);
}
- if(!global_context -> SAM_output_fp)
- {
+ if(global_context -> read_details_out_FP){
+ pthread_spin_init(&global_context -> read_details_out_lock, 1);
+ }else{
SUBREADprintf("Unable to create file '%s'; the read assignment details are not written.\n", tmp_fname);
}
}
else
- global_context -> SAM_output_fp = NULL;
+ global_context -> read_details_out_FP = NULL;
global_context -> redo = 0;
global_context -> exontable_geneid = et_geneid;
@@ -3394,10 +3909,10 @@ int fc_thread_start_threads(fc_thread_global_context_t * global_context, int et_
global_context -> thread_contexts[xk1].thread_id = xk1;
global_context -> thread_contexts[xk1].chunk_read_ptr = 0;
global_context -> thread_contexts[xk1].count_table = calloc(sizeof(read_count_type_t), et_exons);
+ global_context -> thread_contexts[xk1].count_table_size = et_exons;
global_context -> thread_contexts[xk1].nreads_mapped_to_exon = 0;
global_context -> thread_contexts[xk1].all_reads = 0;
global_context -> thread_contexts[xk1].chro_name_buff = malloc(CHROMOSOME_NAME_LENGTH);
- global_context -> thread_contexts[xk1].strm_buffer = malloc(sizeof(z_stream));
global_context -> thread_contexts[xk1].unpaired_fragment_no = 0;
global_context -> thread_contexts[xk1].read_counters.assigned_reads = 0;
@@ -3410,7 +3925,14 @@ int fc_thread_start_threads(fc_thread_global_context_t * global_context, int et_
global_context -> thread_contexts[xk1].read_counters.unassigned_multimapping = 0;
global_context -> thread_contexts[xk1].read_counters.unassigned_secondary = 0;
global_context -> thread_contexts[xk1].read_counters.unassigned_junction_condition = 0;
+ global_context -> thread_contexts[xk1].read_counters.unassigned_overlapping_length = 0;
global_context -> thread_contexts[xk1].read_counters.unassigned_duplicate = 0;
+ global_context -> thread_contexts[xk1].read_details_buff_used = 0;
+
+ if(global_context -> read_details_out_FP){
+ global_context -> thread_contexts[xk1].read_details_buff = malloc(70000 + 2 * MAX_FC_READ_LENGTH * 3);
+ global_context -> thread_contexts[xk1].bam_compressed_buff = malloc(70000 + 2 * MAX_FC_READ_LENGTH * 3);
+ }
if(global_context -> need_calculate_overlap_len){
global_context -> thread_contexts[xk1].scoring_buff_gap_chros = malloc( sizeof(char *) * MAX_HIT_NUMBER * 2 * global_context -> max_M *2);
@@ -3430,11 +3952,15 @@ int fc_thread_start_threads(fc_thread_global_context_t * global_context, int et_
HashTableSetDeallocationFunctions(global_context -> thread_contexts[xk1].splicing_point_table, free, NULL);
HashTableSetKeyComparisonFunction(global_context -> thread_contexts[xk1].splicing_point_table, fc_strcmp_chro);
}
+
+ if(global_context -> assign_reads_to_RG){
+ global_context -> thread_contexts[xk1].RG_table = HashTableCreate(97);
+ HashTableSetHashFunction(global_context -> thread_contexts[xk1].RG_table,HashTableStringHashFunction);
+ HashTableSetDeallocationFunctions(global_context -> thread_contexts[xk1].RG_table, free, disallocate_RG_tables);
+ HashTableSetKeyComparisonFunction(global_context -> thread_contexts[xk1].RG_table, fc_strcmp_chro);
+ }
if(!global_context -> thread_contexts[xk1].count_table) return 1;
- void ** thread_args = malloc(sizeof(void *)*2);
- thread_args[0] = global_context;
- thread_args[1] = & global_context -> thread_contexts[xk1];
}
char rand_prefix[300];
@@ -3445,8 +3971,8 @@ int fc_thread_start_threads(fc_thread_global_context_t * global_context, int et_
if(global_context -> use_stdin_file) sprintf(new_fn, "<%s", global_context -> input_file_name );
else sprintf(new_fn, "%s", global_context -> input_file_name );
- //#warning "REMOVE ' * 2 ' FROM NEXT LINE !!!!!!"
- SAM_pairer_create(&global_context -> read_pairer, global_context -> thread_number , global_context -> max_BAM_header_size/1024/1024+2, !global_context-> is_SAM_file, 1, !global_context -> is_paired_end_mode_assign, global_context ->is_paired_end_mode_assign && global_context -> do_not_sort ,0, new_fn, process_pairer_reset, process_pairer_header, process_pairer_output, rand_prefix, global_context);
+ //#warning " ===================== REMOVE ' 0 && ' FROM NEXT LINE !!!!!! =================="
+ SAM_pairer_create(&global_context -> read_pairer, global_context -> thread_number , global_context -> max_BAM_header_size/1024/1024+2, !global_context-> is_SAM_file, !( global_context -> is_read_details_out == FILE_TYPE_BAM ||global_context -> is_read_details_out == FILE_TYPE_SAM ) , !global_context -> is_paired_end_mode_assign, global_context ->is_paired_end_mode_assign && global_context -> do_not_sort ,0, new_fn, process_pairer_reset, process_pairer_header, process_pairer_output, rand [...]
SAM_pairer_set_unsorted_notification(&global_context -> read_pairer, pairer_unsorted_notification);
return 0;
@@ -3455,18 +3981,27 @@ int fc_thread_start_threads(fc_thread_global_context_t * global_context, int et_
void fc_thread_destroy_thread_context(fc_thread_global_context_t * global_context)
{
int xk1;
- if(global_context -> is_read_details_out)
- {
- fclose(global_context -> SAM_output_fp);
- global_context -> SAM_output_fp = NULL;
+
+ if(global_context -> is_read_details_out)for(xk1=0; xk1<global_context-> thread_number; xk1++)
+ write_read_detailed_remainder(global_context, global_context -> thread_contexts+xk1);
+
+ if(global_context -> is_read_details_out) {
+ if( global_context -> is_read_details_out == FILE_TYPE_BAM ){
+ char bam_tail_block[1000];
+ int tail_size = compress_read_detail_BAM( global_context, global_context -> thread_contexts, 0,0,bam_tail_block);
+ assert(tail_size > 0);
+ //SUBREADprintf("TAIL SIZE=%d\n", tail_size);
+ fwrite(bam_tail_block, 1, tail_size, global_context -> read_details_out_FP);
+ }
+ fclose(global_context -> read_details_out_FP);
+ global_context -> read_details_out_FP = NULL;
+ pthread_spin_destroy(&global_context -> read_details_out_lock);
}
- for(xk1=0; xk1<global_context-> thread_number; xk1++)
- {
+ for(xk1=0; xk1<global_context-> thread_number; xk1++) {
//printf("CHRR_FREE\n");
free(global_context -> thread_contexts[xk1].count_table);
free(global_context -> thread_contexts[xk1].chro_name_buff);
- free(global_context -> thread_contexts[xk1].strm_buffer);
if(global_context -> thread_contexts[xk1].scoring_buff_gap_chros){
free(global_context -> thread_contexts[xk1].scoring_buff_gap_chros);
free(global_context -> thread_contexts[xk1].scoring_buff_gap_starts);
@@ -3476,6 +4011,12 @@ void fc_thread_destroy_thread_context(fc_thread_global_context_t * global_contex
HashTableDestroy(global_context -> thread_contexts[xk1].junction_counting_table);
HashTableDestroy(global_context -> thread_contexts[xk1].splicing_point_table);
}
+ if(global_context -> assign_reads_to_RG)
+ HashTableDestroy(global_context -> thread_contexts[xk1].RG_table);
+ if(global_context -> is_read_details_out ){
+ free(global_context -> thread_contexts[xk1].read_details_buff);
+ free(global_context -> thread_contexts[xk1].bam_compressed_buff);
+ }
}
pthread_spin_destroy(&global_context->sambam_chro_table_lock);
@@ -3483,7 +4024,12 @@ void fc_thread_destroy_thread_context(fc_thread_global_context_t * global_contex
}
void fc_thread_wait_threads(fc_thread_global_context_t * global_context)
{
- global_context -> is_input_bad_format |= SAM_pairer_run(&global_context -> read_pairer);
+ int assign_ret = SAM_pairer_run(&global_context -> read_pairer);
+ if(assign_ret){
+ print_in_box(80,0,0,"");
+ print_in_box(80,0,0," format error found in this file!");
+ }
+ global_context -> is_input_bad_format |= assign_ret;
}
void BUFstrcat(char * targ, char * src, char ** buf){
@@ -3496,7 +4042,7 @@ void BUFstrcat(char * targ, char * src, char ** buf){
(**buf) = 0;
}
-void fc_write_final_gene_results(fc_thread_global_context_t * global_context, int * et_geneid, char ** et_chr, long * et_start, long * et_stop, unsigned char * et_strand, const char * out_file, int features, read_count_type_t ** column_numbers, char * file_list, int n_input_files, fc_feature_info_t * loaded_features, int header_out)
+void fc_write_final_gene_results(fc_thread_global_context_t * global_context, int * et_geneid, char ** et_chr, long * et_start, long * et_stop, unsigned char * et_strand, const char * out_file, int features, ArrayList * column_numbers, ArrayList * column_names, fc_feature_info_t * loaded_features, int header_out)
{
int xk1;
int genes = global_context -> gene_name_table -> numOfElements;
@@ -3516,23 +4062,17 @@ void fc_write_final_gene_results(fc_thread_global_context_t * global_context, in
fprintf(fp_out, "\n");
}
- char * tmp_ptr = NULL, * next_fn;
- int non_empty_files = 0, i_files=0;
+ int i_files;
fprintf(fp_out,"Geneid\tChr\tStart\tEnd\tStrand\tLength");
- next_fn = strtok_r(file_list, ";", &tmp_ptr);
- while(1){
- if(!next_fn||strlen(next_fn)<1) break;
- if(column_numbers[i_files])
- {
- fprintf(fp_out,"\t%s", global_context -> use_stdin_file?"STDIN":next_fn);
- non_empty_files ++;
- }
- next_fn = strtok_r(NULL, ";", &tmp_ptr);
- i_files++;
+ for(i_files=0; i_files<column_names->numOfElements; i_files++)
+ {
+ char * next_fn = ArrayListGet(column_names, i_files);
+ fprintf(fp_out,"\t%s", global_context -> use_stdin_file?"STDIN":next_fn);
}
+
fprintf(fp_out,"\n");
- gene_columns = calloc(sizeof(read_count_type_t) , genes * non_empty_files);
+ gene_columns = calloc(sizeof(read_count_type_t) , genes * column_names->numOfElements);
unsigned int * gene_exons_number = calloc(sizeof(unsigned int) , genes);
unsigned int * gene_exons_pointer = calloc(sizeof(unsigned int) , genes);
unsigned int * gene_exons_start = malloc(sizeof(unsigned int) * features);
@@ -3572,10 +4112,10 @@ void fc_write_final_gene_results(fc_thread_global_context_t * global_context, in
for(xk1 = 0; xk1 < features; xk1++)
{
int gene_id = et_geneid[xk1], k_noempty = 0;
- for(i_files=0;i_files < n_input_files; i_files++)
+ for(i_files=0;i_files < column_names->numOfElements; i_files++)
{
- if(column_numbers[i_files]==NULL) continue;
- gene_columns[gene_id * non_empty_files + k_noempty ] += column_numbers[i_files][xk1];
+ unsigned long long * this_col = ArrayListGet(column_numbers, i_files);
+ gene_columns[gene_id * column_names->numOfElements + k_noempty ] += this_col[xk1];
k_noempty++;
}
}
@@ -3643,7 +4183,7 @@ void fc_write_final_gene_results(fc_thread_global_context_t * global_context, in
BUFstrcat(out_start_list, numbbuf, &tmp_start_list);
sprintf(numbbuf,"%u;", input_start_stop_list[xk3 * 2 + 1] - 1);
BUFstrcat(out_end_list, numbbuf, &tmp_end_list);
- sprintf(numbbuf,"%c;", matched_strand?'-':'+');
+ sprintf(numbbuf,"%c;", (matched_strand==1)?'-':( ( matched_strand==0 )? '+':'.'));
BUFstrcat(out_strand_list, numbbuf, &tmp_strand_list);
}
@@ -3664,21 +4204,15 @@ void fc_write_final_gene_results(fc_thread_global_context_t * global_context, in
int wlen = fprintf(fp_out, "%s\t%s\t%s\t%s\t%s\t%d" , gene_symbol, out_chr_list, out_start_list, out_end_list, out_strand_list, gene_nonoverlap_len);
- // all exons: gene_exons_number[xk1] : gene_exons_pointer[xk1]
- int non_empty_file_index = 0;
- for(i_files=0; i_files< n_input_files; i_files++)
+ for(i_files=0; i_files< column_names->numOfElements; i_files++)
{
- if(column_numbers[i_files])
- {
- read_count_type_t longlong_res = 0;
- double double_res = 0;
- int is_double_number = calc_float_fraction(gene_columns[non_empty_file_index+non_empty_files*xk1], &longlong_res, &double_res);
- if(is_double_number){
- fprintf(fp_out,"\t%.2f", double_res);
- }else{
- fprintf(fp_out,"\t%llu", longlong_res);
- }
- non_empty_file_index ++;
+ read_count_type_t longlong_res = 0;
+ double double_res = 0;
+ int is_double_number = calc_float_fraction(gene_columns[i_files + column_names->numOfElements*xk1], &longlong_res, &double_res);
+ if(is_double_number){
+ fprintf(fp_out,"\t%.2f", double_res);
+ }else{
+ fprintf(fp_out,"\t%llu", longlong_res);
}
}
fprintf(fp_out,"\n");
@@ -3707,7 +4241,7 @@ void fc_write_final_gene_results(fc_thread_global_context_t * global_context, in
}
}
-void fc_write_final_counts(fc_thread_global_context_t * global_context, const char * out_file, int nfiles, char * file_list, read_count_type_t ** column_numbers, fc_read_counters *read_counters, int isCVersion)
+void fc_write_final_counts(fc_thread_global_context_t * global_context, const char * out_file, ArrayList * column_names, ArrayList * read_counters, int isCVersion)
{
char fname[300];
int i_files, xk1, disk_is_full = 0;
@@ -3721,29 +4255,24 @@ void fc_write_final_counts(fc_thread_global_context_t * global_context, const ch
}
fprintf(fp_out,"Status");
- char * next_fn = file_list;
- for(i_files=0; i_files<nfiles; i_files++)
+ for(i_files=0; i_files<column_names->numOfElements; i_files++)
{
- if(!next_fn||strlen(next_fn)<1) break;
- if(column_numbers[i_files])
- fprintf(fp_out,"\t%s", global_context -> use_stdin_file?"STDIN":next_fn);
-
- next_fn += strlen(next_fn)+1;
+ char * next_fn = ArrayListGet(column_names, i_files);
+ fprintf(fp_out,"\t%s", global_context -> use_stdin_file?"STDIN":next_fn);
}
fprintf(fp_out,"\n");
- char * keys [] ={ "Assigned" , "Unassigned_Ambiguity", "Unassigned_MultiMapping" ,"Unassigned_NoFeatures", "Unassigned_Unmapped", "Unassigned_MappingQuality", "Unassigned_FragmentLength", "Unassigned_Chimera", "Unassigned_Secondary", (global_context->is_split_or_exonic_only == 2)?"Unassigned_Hasjunction":"Unassigned_Nonjunction", "Unassigned_Duplicate"};
+ char * keys [] ={ "Assigned" , "Unassigned_Unmapped", "Unassigned_MappingQuality", "Unassigned_Chimera", "Unassigned_FragmentLength", "Unassigned_Duplicate", "Unassigned_MultiMapping" , "Unassigned_Secondary", (global_context->is_split_or_exonic_only == 2)?"Unassigned_Hasjunction":"Unassigned_Nonjunction", "Unassigned_NoFeatures", "Unassigned_Overlapping_Length", "Unassigned_Ambiguity"};
- for(xk1=0; xk1<11; xk1++)
+ for(xk1=0; xk1<12; xk1++)
{
fprintf(fp_out,"%s", keys[xk1]);
- for(i_files = 0; i_files < nfiles; i_files ++)
+ for(i_files = 0; i_files < column_names->numOfElements; i_files ++)
{
- unsigned long long * array_0 = (unsigned long long *)&(read_counters[i_files]);
+ unsigned long long * array_0 = ArrayListGet(read_counters,i_files);
unsigned long long * cntr = array_0 + xk1;
- if(column_numbers[i_files])
- fprintf(fp_out,"\t%llu", *cntr);
+ fprintf(fp_out,"\t%llu", *cntr);
}
int wlen = fprintf(fp_out,"\n");
if(wlen < 1)disk_is_full = 1;
@@ -3758,7 +4287,7 @@ void fc_write_final_counts(fc_thread_global_context_t * global_context, const ch
}
}
-void fc_write_final_results(fc_thread_global_context_t * global_context, const char * out_file, int features, read_count_type_t ** column_numbers, char * file_list, int n_input_files, fc_feature_info_t * loaded_features, int header_out)
+void fc_write_final_results(fc_thread_global_context_t * global_context, const char * out_file, int features, ArrayList* column_numbers, ArrayList * column_names,fc_feature_info_t * loaded_features, int header_out)
{
/* save the results */
FILE * fp_out;
@@ -3779,38 +4308,31 @@ void fc_write_final_results(fc_thread_global_context_t * global_context, const c
- char * tmp_ptr = NULL, * next_fn;
+ char * next_fn;
fprintf(fp_out,"Geneid\tChr\tStart\tEnd\tStrand\tLength");
- next_fn = strtok_r(file_list, ";", &tmp_ptr);
- while(1){
- if(!next_fn||strlen(next_fn)<1) break;
- if(column_numbers[i_files])
- fprintf(fp_out,"\t%s", global_context -> use_stdin_file?"STDIN":next_fn);
- next_fn = strtok_r(NULL, ";", &tmp_ptr);
- i_files++;
+
+ for(i_files = 0; i_files < column_names -> numOfElements; i_files++){
+ next_fn = ArrayListGet(column_names, i_files);
+ fprintf(fp_out,"\t%s", global_context -> use_stdin_file?"STDIN":next_fn);
}
fprintf(fp_out,"\n");
for(i=0;i<features;i++)
{
fprintf(fp_out,"%s\t%s\t%u\t%u\t%c\t%d", global_context -> unistr_buffer_space + loaded_features[i].feature_name_pos,
global_context -> unistr_buffer_space + loaded_features[i].feature_name_pos + loaded_features[i].chro_name_pos_delta,
- loaded_features[i].start, loaded_features[i].end, loaded_features[i].is_negative_strand?'-':'+',loaded_features[i].end-loaded_features[i].start+1);
- for(i_files=0; i_files<n_input_files; i_files++)
+ loaded_features[i].start, loaded_features[i].end, loaded_features[i].is_negative_strand == 1?'-':( loaded_features[i].is_negative_strand == 0? '+':'.'),loaded_features[i].end-loaded_features[i].start+1);
+ for(i_files=0; i_files < column_names -> numOfElements; i_files++)
{
- if(column_numbers[i_files])
- {
- int sorted_exon_no = loaded_features[i].sorted_order;
- unsigned long long count_frac_raw = column_numbers[i_files][sorted_exon_no], longlong_res = 0;
-
- double double_res = 0;
- int is_double_number = calc_float_fraction(count_frac_raw, &longlong_res, &double_res);
- if(is_double_number){
- fprintf(fp_out,"\t%.2f", double_res);
- }else{
- fprintf(fp_out,"\t%llu", longlong_res);
- }
-
-
+ unsigned long long * this_list = ArrayListGet(column_numbers, i_files);
+ int sorted_exon_no = loaded_features[i].sorted_order;
+ unsigned long long count_frac_raw = this_list[sorted_exon_no], longlong_res = 0;
+
+ double double_res = 0;
+ int is_double_number = calc_float_fraction(count_frac_raw, &longlong_res, &double_res);
+ if(is_double_number){
+ fprintf(fp_out,"\t%.2f", double_res);
+ }else{
+ fprintf(fp_out,"\t%llu", longlong_res);
}
}
int wlen = fprintf(fp_out,"\n");
@@ -3844,6 +4366,8 @@ static struct option long_options[] =
{"maxMOp", required_argument, 0, 0},
{"tmpDir", required_argument, 0, 0},
{"largestOverlap", no_argument, 0,0},
+ {"byReadGroup", no_argument, 0,0},
+ {"verbose", no_argument, 0,0},
{0, 0, 0, 0}
};
@@ -3864,8 +4388,9 @@ void print_usage()
SUBREADputs(" also included in the output ('<string>.summary')");
SUBREADputs("");
SUBREADputs(" input_file1 [input_file2] ... A list of SAM or BAM format files. They can be");
- SUBREADputs(" either name or location sorted. If not files provided,");
- SUBREADputs(" <stdin> input is expected.");
+ SUBREADputs(" either name or location sorted. If no files provided,");
+ SUBREADputs(" <stdin> input is expected. Location-sorted paired-end reads");
+ SUBREADputs(" are automatically sorted by read names.");
SUBREADputs("");
SUBREADputs("## Optional arguments:");
@@ -4026,12 +4551,29 @@ void print_usage()
SUBREADputs("");
SUBREADputs(" -T <int> Number of the threads. 1 by default.");
SUBREADputs("");
+
+ SUBREADputs("# Read groups");
+ SUBREADputs("");
+ SUBREADputs(" --byReadGroup Assign reads by read group. \"RG\" tag is required to be");
+ SUBREADputs(" present in the input BAM/SAM files.");
+ SUBREADputs(" ");
+ SUBREADputs("");
+
+ SUBREADputs("# Long reads");
+ SUBREADputs("");
+ SUBREADputs(" -L Count long reads such as Nanopore and PacBio reads. Long");
+ SUBREADputs(" read counting can only run in one thread and only reads");
+ SUBREADputs(" (not read-pairs) can be counted. There is no limitation on");
+ SUBREADputs(" the number of 'M' operations allowed in a CIGAR string in");
+ SUBREADputs(" long read counting.");
+ SUBREADputs("");
+
SUBREADputs("# Miscellaneous");
SUBREADputs("");
- SUBREADputs(" -R Output detailed assignment result for each read. A text ");
- SUBREADputs(" file will be generated for each input file, including ");
- SUBREADputs(" names of reads and meta-features/features reads were ");
- SUBREADputs(" assigned to. See Users Guide for more details.");
+ SUBREADputs(" -R <format> Output detailed assignment results for each read or read-");
+ SUBREADputs(" pair. Results are saved to a file that is in one of the");
+ SUBREADputs(" following formats: CORE, SAM and BAM. See Users Guide for");
+ SUBREADputs(" more info about these formats.");
SUBREADputs("");
SUBREADputs(" --tmpDir <string> Directory under which intermediate files are saved (later");
SUBREADputs(" removed). By default, intermediate files will be saved to");
@@ -4042,6 +4584,9 @@ void print_usage()
SUBREADputs(" and adjacent 'M' operations are merged in the CIGAR");
SUBREADputs(" string.");
SUBREADputs("");
+ SUBREADputs(" --verbose Output verbose information for debugging, such as un-");
+ SUBREADputs(" matched chromosome/contig names.");
+ SUBREADputs("");
SUBREADputs(" -v Output version of the program.");
SUBREADputs("");
@@ -4148,7 +4693,7 @@ int junccmp(fc_junction_gene_t * j1, fc_junction_gene_t * j2){
}
-void fc_write_final_junctions(fc_thread_global_context_t * global_context, char * output_file_name, read_count_type_t ** table_columns, char * input_file_names, int n_input_files, HashTable ** junction_global_table_list, HashTable ** splicing_global_table_list){
+void fc_write_final_junctions(fc_thread_global_context_t * global_context, char * output_file_name, ArrayList * column_names, ArrayList * junction_global_table_list, ArrayList * splicing_global_table_list){
int infile_i, disk_is_full = 0;
HashTable * merged_junction_table = HashTableCreate(156679);
@@ -4164,13 +4709,13 @@ void fc_write_final_junctions(fc_thread_global_context_t * global_context, char
HashTableSetKeyComparisonFunction(merged_splicing_table, fc_strcmp_chro);
- for(infile_i = 0 ; infile_i < n_input_files ; infile_i ++){
- if(!table_columns[infile_i]) continue; // bad input file
+ for(infile_i = 0 ; infile_i < column_names -> numOfElements ; infile_i ++){
KeyValuePair * cursor;
int bucket;
- for(bucket=0; bucket < splicing_global_table_list[infile_i] -> numOfBuckets; bucket++)
+ HashTable * spl_table = ArrayListGet(splicing_global_table_list, infile_i);
+ for(bucket=0; bucket < spl_table -> numOfBuckets; bucket++)
{
- cursor = splicing_global_table_list[infile_i] -> bucketArray[bucket];
+ cursor = spl_table -> bucketArray[bucket];
while (cursor)
{
char * ky = (char *)cursor -> key;
@@ -4182,13 +4727,13 @@ void fc_write_final_junctions(fc_thread_global_context_t * global_context, char
}
}
- for(infile_i = 0 ; infile_i < n_input_files ; infile_i ++){
- if(!table_columns[infile_i]) continue; // bad input file
+ for(infile_i = 0 ; infile_i < column_names -> numOfElements ; infile_i ++){
KeyValuePair * cursor;
int bucket;
- for(bucket=0; bucket < junction_global_table_list[infile_i] -> numOfBuckets; bucket++)
+ HashTable * junc_table = ArrayListGet(junction_global_table_list, infile_i);
+ for(bucket=0; bucket < junc_table -> numOfBuckets; bucket++)
{
- cursor = junction_global_table_list[infile_i] -> bucketArray[bucket];
+ cursor = junc_table -> bucketArray[bucket];
while (cursor)
{
char * ky = (char *)cursor -> key;
@@ -4231,17 +4776,13 @@ void fc_write_final_junctions(fc_thread_global_context_t * global_context, char
int ky_i1, ky_i2;
FILE * ofp = fopen(outfname, "w");
char * tmpp = NULL;
- char * next_fn = input_file_names;
fprintf(ofp, "PrimaryGene\tSecondaryGenes\tSite1_chr\tSite1_location\tSite1_strand\tSite2_chr\tSite2_location\tSite2_strand");
- for(infile_i=0; infile_i < n_input_files; infile_i++)
+ for(infile_i=0; infile_i < column_names -> numOfElements; infile_i++)
{
- if(!next_fn||strlen(next_fn)<1) break;
- if(table_columns[infile_i])
- fprintf(ofp,"\t%s", global_context -> use_stdin_file?"STDIN":next_fn);
-
- next_fn += strlen(next_fn)+1;
+ char * next_fn = ArrayListGet(column_names, infile_i);
+ fprintf(ofp,"\t%s", global_context -> use_stdin_file?"STDIN":next_fn);
}
fprintf(ofp, "\n");
@@ -4367,9 +4908,9 @@ void fc_write_final_junctions(fc_thread_global_context_t * global_context, char
chro_large[-1]='\t';
- for(infile_i = 0 ; infile_i < n_input_files ; infile_i ++){
- if(!table_columns[infile_i]) continue;
- unsigned long count = HashTableGet(junction_global_table_list[infile_i] , key_list[ky_i]) - NULL;
+ for(infile_i = 0 ; infile_i < column_names -> numOfElements ; infile_i ++){
+ HashTable * junc_table = ArrayListGet(junction_global_table_list, infile_i);
+ unsigned long count = HashTableGet(junc_table, key_list[ky_i]) - NULL;
fprintf(ofp,"\t%lu", count);
}
int wlen = fprintf(ofp, "\n");
@@ -4408,7 +4949,7 @@ char * get_short_fname(char * lname){
return ret;
}
-int readSummary_single_file(fc_thread_global_context_t * global_context, read_count_type_t * column_numbers, int nexons, int * geneid, char ** chr, long * start, long * stop, unsigned char * sorted_strand, char * anno_chr_2ch, char ** anno_chrs, long * anno_chr_head, long * block_end_index, long * block_min_start , long * block_max_end, fc_read_counters * my_read_counter, HashTable * junc_glob_tab, HashTable * splicing_glob_tab);
+int readSummary_single_file(fc_thread_global_context_t * global_context, read_count_type_t * column_numbers, int nexons, int * geneid, char ** chr, long * start, long * stop, unsigned char * sorted_strand, char * anno_chr_2ch, char ** anno_chrs, long * anno_chr_head, long * block_end_index, long * block_min_start , long * block_max_end, fc_read_counters * my_read_counter, HashTable * junc_glob_tab, HashTable * splicing_glob_tab, HashTable * merged_RG_table);
int readSummary(int argc,char *argv[]){
@@ -4461,9 +5002,12 @@ int readSummary(int argc,char *argv[]){
40: as.numeric(min_Fractional_Overlap) # A fractioal number. 0.00 : at least 1 bp overlapping
41: temp_directory # the directory to put temp files. "<use output directory>" by default, namely find it from the output file dir.
42: as.numeric(use_stdin_stdout) # only for CfeatureCounts. When use_stdin_stdout & 0x01 > 0, the input file is from stdin (stored in a temporary file); when use_stdin_stdout & 0x02 > 0, the output should be written to STDOUT instead of a file.
+ 43: as.numeric(assign_reads_to_RG) # 1: reads with "RG" tags will be assigned to read groups' 0: default setting
+ 44: as.numeric(long_read_minimum_length) # Reads longer than this will be assigned as long reads (no multi-threading)
+ 45: as.numeric(is_verbose) # 1: show the mismatched chromosome names on screet; 0: don't do so
*/
- int isStrandChecked, isCVersion, isChimericDisallowed, isPEDistChecked, minMappingQualityScore=0, isInputFileResortNeeded, feature_block_size = 20, reduce_5_3_ends_to_one, useStdinFile;
+ int isStrandChecked, isCVersion, isChimericDisallowed, isPEDistChecked, minMappingQualityScore=0, isInputFileResortNeeded, feature_block_size = 20, reduce_5_3_ends_to_one, useStdinFile, assignReadsToRG, long_read_minimum_length, is_verbose;
float fracOverlap;
char **chr;
long *start, *stop;
@@ -4643,13 +5187,24 @@ int readSummary(int argc,char *argv[]){
if(argc>42){
useStdinFile = (atoi(argv[42]) & 1)!=0;
}else useStdinFile = 0;
+
+ if(argc>43)
+ assignReadsToRG = (argv[43][0]=='1');
+ else assignReadsToRG = 0;
+ if(argc>44)
+ long_read_minimum_length = atoi(argv[44])?1:1999999999;
+ else long_read_minimum_length = 1999999999;
+
+ if(argc>45)
+ is_verbose = (argv[45][0]=='1');
+ else is_verbose = 0;
if(SAM_pairer_warning_file_open_limit()) return -1;
fc_thread_global_context_t global_context;
- fc_thread_init_global_context(& global_context, FEATURECOUNTS_BUFFER_SIZE, thread_number, MAX_LINE_LENGTH, isPE, minPEDistance, maxPEDistance,isGeneLevel, isMultiOverlapAllowed, isStrandChecked, (char *)argv[3] , isReadSummaryReport, isBothEndRequired, isChimericDisallowed, isPEDistChecked, nameFeatureTypeColumn, nameGeneIDColumn, minMappingQualityScore,isMultiMappingAllowed, 0, alias_file_name, cmd_rebuilt, isInputFileResortNeeded, feature_block_size, isCVersion, fiveEndExtension, thre [...]
+ fc_thread_init_global_context(& global_context, FEATURECOUNTS_BUFFER_SIZE, thread_number, MAX_LINE_LENGTH, isPE, minPEDistance, maxPEDistance,isGeneLevel, isMultiOverlapAllowed, isStrandChecked, (char *)argv[3] , isReadSummaryReport, isBothEndRequired, isChimericDisallowed, isPEDistChecked, nameFeatureTypeColumn, nameGeneIDColumn, minMappingQualityScore,isMultiMappingAllowed, 0, alias_file_name, cmd_rebuilt, isInputFileResortNeeded, feature_block_size, isCVersion, fiveEndExtension, thre [...]
fc_thread_init_input_files( & global_context, argv[2], &file_name_ptr );
@@ -4707,7 +5262,7 @@ int readSummary(int argc,char *argv[]){
global_context.exontable_exons = nexons;
- unsigned int x1, * nreads = (unsigned int *) calloc(nexons,sizeof(int));
+ unsigned int x1, * nreads = (unsigned int *) calloc(nexons,sizeof(int)), total_written_coulmns=0;
@@ -4744,16 +5299,25 @@ int readSummary(int argc,char *argv[]){
tmp_pntr = NULL;
strcpy(file_list_used, file_name_ptr);
char * next_fn = strtok_r(file_list_used,";", &tmp_pntr);
- read_count_type_t ** table_columns = calloc( n_input_files , sizeof(read_count_type_t *)), i_files=0;
- fc_read_counters * read_counters = calloc(n_input_files , sizeof(fc_read_counters));
- HashTable ** junction_global_table_list = NULL;
- HashTable ** splicing_global_table_list = NULL;
+ ArrayList * table_columns = ArrayListCreate(n_input_files+1);
+ ArrayList * table_column_names = ArrayListCreate(n_input_files+1);
+ ArrayList * read_counters = ArrayListCreate(n_input_files+1);
+ ArrayListSetDeallocationFunction(table_columns, free);
+ ArrayListSetDeallocationFunction(table_column_names, free);
+ ArrayListSetDeallocationFunction(read_counters, free);
+
+ ArrayList * junction_global_table_list = NULL;
+ ArrayList * splicing_global_table_list = NULL;
if(global_context.do_junction_counting){
- junction_global_table_list = calloc(n_input_files, sizeof(HashTable *));
- splicing_global_table_list = calloc(n_input_files, sizeof(HashTable *));
+ junction_global_table_list = ArrayListCreate(n_input_files+1);
+ splicing_global_table_list = ArrayListCreate(n_input_files+1);
+ ArrayListSetDeallocationFunction(junction_global_table_list, (void (*)(void *))HashTableDestroy);
+ ArrayListSetDeallocationFunction(splicing_global_table_list, (void (*)(void *))HashTableDestroy);
}
+ int ret_int = 0;
+
for(x1 = 0;;x1++){
int orininal_isPE = global_context.is_paired_end_mode_assign;
if(next_fn==NULL || strlen(next_fn)<1 || global_context.disk_is_full) break;
@@ -4783,17 +5347,24 @@ int readSummary(int argc,char *argv[]){
HashTableSetKeyComparisonFunction(splicing_global_table, fc_strcmp_chro);
}
- fc_read_counters * my_read_counter = &(read_counters[i_files]);
- memset(my_read_counter, 0, sizeof(fc_read_counters));
+ HashTable * merged_RG_table = NULL;
+ if(global_context.assign_reads_to_RG){
+ merged_RG_table = HashTableCreate(97);
+ HashTableSetHashFunction(merged_RG_table,HashTableStringHashFunction);
+ HashTableSetDeallocationFunctions(merged_RG_table, NULL, free); // the names are put into the column_names table, but the 4-pointer arrays are not used anymore.
+ HashTableSetKeyComparisonFunction(merged_RG_table, fc_strcmp_chro);
+ }
+
+ fc_read_counters * my_read_counter = calloc(1, sizeof(fc_read_counters));
+ global_context.is_read_details_out = isReadSummaryReport;
+ global_context.max_M = max_M;
- int ret_int = readSummary_single_file(& global_context, column_numbers, nexons, geneid, chr, start, stop, sorted_strand, anno_chr_2ch, anno_chrs, anno_chr_head, block_end_index, block_min_start, block_max_end, my_read_counter, junction_global_table, splicing_global_table);
+ ret_int = ret_int || readSummary_single_file(& global_context, column_numbers, nexons, geneid, chr, start, stop, sorted_strand, anno_chr_2ch, anno_chrs, anno_chr_head, block_end_index, block_min_start, block_max_end, my_read_counter, junction_global_table, splicing_global_table, merged_RG_table);
if(global_context.disk_is_full){
SUBREADprintf("ERROR: disk is full. Please check the free space in the output directory.\n");
}
if(ret_int!=0){
// give up this file.
-
- table_columns[i_files] = NULL;
if(global_context.do_junction_counting){
HashTableDestroy(junction_global_table);
HashTableDestroy(splicing_global_table);
@@ -4801,47 +5372,76 @@ int readSummary(int argc,char *argv[]){
free(column_numbers);
} else {
// finished
- table_columns[i_files] = column_numbers;
- if(global_context.do_junction_counting){
- junction_global_table_list[ i_files ] = junction_global_table;
- splicing_global_table_list[ i_files ] = splicing_global_table;
+
+ char * mem_file_name = memstrcpy(next_fn);
+ if(!global_context.assign_reads_to_RG){
+ ArrayListPush(table_columns, column_numbers);
+ ArrayListPush(table_column_names, mem_file_name);
+ ArrayListPush(read_counters, my_read_counter);
+ if(global_context.do_junction_counting){
+ ArrayListPush(junction_global_table_list,junction_global_table);
+ ArrayListPush(splicing_global_table_list,splicing_global_table);
+ }
+ }
+
+ if(global_context.assign_reads_to_RG){
+ int buck_i;
+ for(buck_i = 0; buck_i < merged_RG_table -> numOfBuckets; buck_i++){
+ KeyValuePair * cursor = merged_RG_table -> bucketArray[buck_i];
+ while(cursor){
+ char * rg_name = (char*) cursor -> key;
+ void ** tab4 = cursor -> value;
+ int rg_name_len = strlen(rg_name);
+ int file_len = strlen(mem_file_name);
+
+ char * rg_file_name = malloc(rg_name_len + 3 + file_len);
+ sprintf(rg_file_name, "%s:%s", mem_file_name, rg_name);
+ free(rg_name);
+
+ ArrayListPush(table_column_names, rg_file_name);
+ ArrayListPush(table_columns, tab4[0]);
+ ArrayListPush(read_counters, tab4[1]);
+ if(global_context.do_junction_counting){
+ ArrayListPush(junction_global_table_list,tab4[2]);
+ ArrayListPush(splicing_global_table_list,tab4[3]);
+ }
+ cursor = cursor->next;
+ }
+ }
+
+ free(mem_file_name);
}
+ total_written_coulmns ++;
}
global_context.is_paired_end_mode_assign = orininal_isPE;
-
- i_files++;
next_fn = strtok_r(NULL, ";", &tmp_pntr);
+ if(merged_RG_table) HashTableDestroy(merged_RG_table);
}
free(file_list_used);
+ free(is_unique);
if(global_context.is_input_bad_format){
SUBREADprintf("\nFATAL Error: The program has to terminate and no counting file is generated.\n\n");
}else if(!global_context.disk_is_full){
if(isGeneLevel)
- fc_write_final_gene_results(&global_context, geneid, chr, start, stop, sorted_strand, argv[3], nexons, table_columns, file_name_ptr, n_input_files , loaded_features, isCVersion);
+ fc_write_final_gene_results(&global_context, geneid, chr, start, stop, sorted_strand, argv[3], nexons, table_columns, table_column_names, loaded_features, isCVersion);
else
- fc_write_final_results(&global_context, argv[3], nexons, table_columns, file_name_ptr, n_input_files ,loaded_features, isCVersion);
+ fc_write_final_results(&global_context, argv[3], nexons, table_columns, table_column_names, loaded_features, isCVersion);
}
if(global_context.do_junction_counting && !global_context.disk_is_full)
- fc_write_final_junctions(&global_context, argv[3], table_columns, file_name_ptr, n_input_files , junction_global_table_list, splicing_global_table_list);
+ fc_write_final_junctions(&global_context, argv[3], table_column_names, junction_global_table_list, splicing_global_table_list);
if(!global_context.disk_is_full)
- fc_write_final_counts(&global_context, argv[3], n_input_files, file_name_ptr, table_columns, read_counters, isCVersion);
+ fc_write_final_counts(&global_context, argv[3], table_column_names, read_counters, isCVersion);
- int total_written_coulmns = 0;
- for(i_files=0; i_files<n_input_files; i_files++)
- if(table_columns[i_files]){
- free(table_columns[i_files]);
- if(global_context.do_junction_counting){
- HashTableDestroy(junction_global_table_list[i_files]);
- HashTableDestroy(splicing_global_table_list[i_files]);
- }
-
- total_written_coulmns++;
-
- }
- free(table_columns);
+ ArrayListDestroy(table_columns);
+ ArrayListDestroy(table_column_names);
+ ArrayListDestroy(read_counters);
+ if(global_context.do_junction_counting){
+ ArrayListDestroy(junction_global_table_list);
+ ArrayListDestroy(splicing_global_table_list);
+ }
free(file_name_ptr);
@@ -4863,7 +5463,7 @@ int readSummary(int argc,char *argv[]){
}
}
- if(global_context.SAM_output_fp) fclose(global_context. SAM_output_fp);
+ if(global_context.read_details_out_FP) fclose(global_context. read_details_out_FP);
HashTableDestroy(global_context.gene_name_table);
free(global_context.gene_name_array);
@@ -4872,13 +5472,11 @@ int readSummary(int argc,char *argv[]){
destroy_contig_fasta(global_context.fasta_contigs);
free(global_context.fasta_contigs);
}
- if(global_context.annot_chro_name_alias_table)
- HashTableDestroy(global_context.annot_chro_name_alias_table);
+ if(global_context.BAM_chros_to_anno_table)
+ HashTableDestroy(global_context.BAM_chros_to_anno_table);
if(global_context.do_junction_counting){
HashTableDestroy(global_context.junction_bucket_table);
HashTableDestroy(global_context.junction_features_table);
- free(junction_global_table_list);
- free(splicing_global_table_list);
}
free(global_context.unistr_buffer_space);
@@ -4953,7 +5551,7 @@ void sort_bucket_table(fc_thread_global_context_t * global_context){
-int readSummary_single_file(fc_thread_global_context_t * global_context, read_count_type_t * column_numbers, int nexons, int * geneid, char ** chr, long * start, long * stop, unsigned char * sorted_strand, char * anno_chr_2ch, char ** anno_chrs, long * anno_chr_head, long * block_end_index, long * block_min_start , long * block_max_end, fc_read_counters * my_read_counter, HashTable * junction_global_table, HashTable * splicing_global_table)
+int readSummary_single_file(fc_thread_global_context_t * global_context, read_count_type_t * column_numbers, int nexons, int * geneid, char ** chr, long * start, long * stop, unsigned char * sorted_strand, char * anno_chr_2ch, char ** anno_chrs, long * anno_chr_head, long * block_end_index, long * block_min_start , long * block_max_end, fc_read_counters * my_read_counter, HashTable * junction_global_table, HashTable * splicing_global_table, HashTable * merged_RG_table)
{
int read_length = 0;
int is_first_read_PE=0;
@@ -5016,7 +5614,7 @@ int readSummary_single_file(fc_thread_global_context_t * global_context, read_co
fc_thread_wait_threads(global_context);
unsigned long long int nreads_mapped_to_exon = 0;
- fc_thread_merge_results(global_context, column_numbers , &nreads_mapped_to_exon, my_read_counter, junction_global_table, splicing_global_table);
+ fc_thread_merge_results(global_context, column_numbers , &nreads_mapped_to_exon, my_read_counter, junction_global_table, splicing_global_table, merged_RG_table);
fc_thread_destroy_thread_context(global_context);
if(global_context -> sambam_chro_table) free(global_context -> sambam_chro_table);
@@ -5034,7 +5632,7 @@ int main(int argc, char ** argv)
int feature_count_main(int argc, char ** argv)
#endif
{
- char * Rargv[43];
+ char * Rargv[46];
char annot_name[300];
char temp_dir[300];
char * out_name = malloc(300);
@@ -5071,6 +5669,7 @@ int feature_count_main(int argc, char ** argv)
int is_Multi_Mapping_Allowed = 0;
int is_Split_or_Exonic_Only = 0;
int is_duplicate_ignored = 0;
+ int assign_reads_to_RG = 0;
int do_not_sort = 0;
int do_junction_cnt = 0;
int reduce_5_3_ends_to_one = 0;
@@ -5084,8 +5683,8 @@ int feature_count_main(int argc, char ** argv)
int very_long_file_names_size = 200;
int fiveEndExtension = 0, threeEndExtension = 0, minFragmentOverlap = 1;
float fracOverlap = 0.0;
- int std_input_output_mode = 0;
- char strFiveEndExtension[11], strThreeEndExtension[11], strMinFragmentOverlap[11], fracOverlapStr[20], std_input_output_mode_str[11];
+ int std_input_output_mode = 0, long_read_mode = 0, is_verbose = 0;
+ char strFiveEndExtension[11], strThreeEndExtension[11], strMinFragmentOverlap[11], fracOverlapStr[20], std_input_output_mode_str[11], long_read_mode_str[11];
very_long_file_names = malloc(very_long_file_names_size);
very_long_file_names [0] = 0;
fasta_contigs_name[0]=0;
@@ -5116,7 +5715,7 @@ int feature_count_main(int argc, char ** argv)
strcpy(max_M_str, "10");
strcpy(Pair_Orientations,"fr");
- while ((c = getopt_long (argc, argv, "G:A:g:t:T:o:a:d:D:L:Q:pbF:fs:S:CBJPMORv?", long_options, &option_index)) != -1)
+ while ((c = getopt_long (argc, argv, "G:A:g:t:T:o:a:d:D:LQ:pbF:fs:S:CBJPMOR:v?", long_options, &option_index)) != -1)
switch(c)
{
case 'S':
@@ -5209,7 +5808,13 @@ int feature_count_main(int argc, char ** argv)
is_Overlap = 1;
break;
case 'R':
- is_ReadSummary_Report = 1;
+ if(strcmp(optarg, "SAM")==0) is_ReadSummary_Report = FILE_TYPE_SAM;
+ else if(strcmp(optarg, "BAM")==0) is_ReadSummary_Report = FILE_TYPE_BAM;
+ else if(strcmp(optarg, "CORE")==0) is_ReadSummary_Report = FILE_TYPE_RSUBREAD;
+ else{
+ SUBREADprintf("\nERROR: unknown output format: '%s'\n\n", optarg);
+ STANDALONE_exit(-1);
+ }
break;
case 's':
if(!is_valid_digit_range(optarg, "s", 0 , 2))
@@ -5228,7 +5833,7 @@ int feature_count_main(int argc, char ** argv)
term_strncpy(annot_name, optarg,299);
break;
case 'L':
- feature_block_size = atoi(optarg);
+ long_read_mode = 1;
break;
case 0 : // long options
@@ -5239,7 +5844,7 @@ int feature_count_main(int argc, char ** argv)
if(strcmp("readExtension5", long_options[option_index].name)==0)
{
- if(!is_valid_digit(optarg, "readExtension5"))
+ if(!is_valid_digit_range(optarg, "readExtension5", 0, 0x7fffffff))
STANDALONE_exit(-1);
fiveEndExtension = atoi(optarg);
fiveEndExtension = max(0, fiveEndExtension);
@@ -5247,7 +5852,7 @@ int feature_count_main(int argc, char ** argv)
if(strcmp("readExtension3", long_options[option_index].name)==0)
{
- if(!is_valid_digit(optarg, "readExtension3"))
+ if(!is_valid_digit_range(optarg, "readExtension3", 0, 0x7fffffff))
STANDALONE_exit(-1);
threeEndExtension = atoi(optarg);
threeEndExtension = max(0, threeEndExtension);
@@ -5286,7 +5891,7 @@ int feature_count_main(int argc, char ** argv)
strcpy(temp_dir, optarg);
}
if(strcmp("maxMOp", long_options[option_index].name)==0){
- if(!is_valid_digit_range(optarg, "maxMOp", 1 , 64))
+ if(!is_valid_digit_range(optarg, "maxMOp", 1 , 65555))
STANDALONE_exit(-1);
strcpy(max_M_str, optarg);
}
@@ -5296,7 +5901,10 @@ int feature_count_main(int argc, char ** argv)
reduce_5_3_ends_to_one = REDUCE_TO_3_PRIME_END;
else if(optarg[0]=='5')
reduce_5_3_ends_to_one = REDUCE_TO_5_PRIME_END;
-
+ else{
+ SUBREADprintf("Invalide parameter to the --read2pos option: %s\n", optarg);
+ STANDALONE_exit(-1);
+ }
}
if(strcmp("largestOverlap", long_options[option_index].name)==0)
@@ -5322,6 +5930,14 @@ int feature_count_main(int argc, char ** argv)
{
is_Split_or_Exonic_Only = 2;
}
+
+ if(strcmp("verbose", long_options[option_index].name)==0){
+ is_verbose = 1;
+ }
+
+ if(strcmp("byReadGroup", long_options[option_index].name)==0){
+ assign_reads_to_RG = 1;
+ }
break;
case '?':
default :
@@ -5372,6 +5988,7 @@ int feature_count_main(int argc, char ** argv)
sprintf(Strand_Sensitive_Str,"%d", Strand_Sensitive_Mode);
sprintf(fracOverlapStr, "%g", fracOverlap);
sprintf(std_input_output_mode_str,"%d",std_input_output_mode);
+ sprintf(long_read_mode_str, "%d", long_read_mode);
Rargv[0] = "CreadSummary";
Rargv[1] = annot_name;
@@ -5386,7 +6003,7 @@ int feature_count_main(int argc, char ** argv)
Rargv[10] = nthread_str;
Rargv[11] = isGTF?"1":"0";
Rargv[12] = Strand_Sensitive_Str;
- Rargv[13] = is_ReadSummary_Report?"1":"0";
+ Rargv[13] = is_ReadSummary_Report == 0 ? "0":(is_ReadSummary_Report == FILE_TYPE_RSUBREAD?"10":(is_ReadSummary_Report == FILE_TYPE_BAM?"500":"50"));
Rargv[14] = is_Both_End_Mapped?"1":"0";
Rargv[15] = is_Chimeric_Disallowed?"1":"0";
Rargv[16] = is_PE_Dist_Checked?"1":"0";
@@ -5416,10 +6033,13 @@ int feature_count_main(int argc, char ** argv)
Rargv[40] = fracOverlapStr;
Rargv[41] = temp_dir;
Rargv[42] = std_input_output_mode_str;
+ Rargv[43] = assign_reads_to_RG?"1":"0";
+ Rargv[44] = long_read_mode_str;
+ Rargv[45] = is_verbose?"1":"0";
int retvalue = -1;
if(is_ReadSummary_Report && (std_input_output_mode & 1)==1) SUBREADprintf("ERROR: no detailed assignment results can be written when the input is from STDIN. Please remove the '-R' option.\n");
- else retvalue = readSummary(43, Rargv);
+ else retvalue = readSummary(46, Rargv);
free(very_long_file_names);
free(out_name);
diff --git a/src/sambam-file.c b/src/sambam-file.c
index da988a6..2acc7cf 100644
--- a/src/sambam-file.c
+++ b/src/sambam-file.c
@@ -529,6 +529,167 @@ int PBam_chunk_headers(char * chunk, int *chunk_ptr, int chunk_len, SamBam_Refer
return -1;
}
+int convert_BAM_binary_to_SAM( SamBam_Reference_Info * chro_table, char * bam_bin, char * sam_txt ){
+ int bin_len = 0;
+ memcpy(&bin_len, bam_bin, 4);
+ bin_len += 4;
+
+ int sam_ptr = 0, tmpint = 0;
+ sam_ptr += sprintf(sam_txt + sam_ptr, "%s\t", bam_bin+36);
+
+ memcpy(&tmpint, bam_bin + 16 ,4);
+ sam_ptr += sprintf(sam_txt + sam_ptr, "%d\t", (tmpint >> 16) & 0xffff);
+ int cigar_opts = tmpint & 0xffff;
+
+ memcpy(&tmpint, bam_bin + 4 ,4);
+ int r1chro = tmpint;
+ sam_ptr += sprintf(sam_txt + sam_ptr, "%s\t", tmpint<0?"*":chro_table[tmpint].chro_name);
+ memcpy(&tmpint, bam_bin + 8 ,4);
+ sam_ptr += sprintf(sam_txt + sam_ptr, "%d\t", tmpint+1);
+ memcpy(&tmpint, bam_bin + 12 ,4);
+ sam_ptr += sprintf(sam_txt + sam_ptr, "%d\t", (tmpint >> 8) & 0xff);
+ int name_len = tmpint & 0xff;
+ int cigar_i;
+ for(cigar_i = 0; cigar_i < cigar_opts; cigar_i ++){
+ unsigned int cigarint=0;
+ memcpy(&cigarint, bam_bin + name_len + 36 + cigar_i * 4,4);
+ sam_ptr += sprintf(sam_txt + sam_ptr, "%u%c", cigarint >> 4, "MIDNSHP=X"[cigarint&0xf]);
+ }
+ sam_ptr += sprintf(sam_txt + sam_ptr, "%s\t", cigar_i<1?"*":"");
+
+ memcpy(&tmpint, bam_bin + 24, 4);
+ //SUBREADprintf("CHRO_IDX=%d\n", tmpint);
+ sam_ptr += sprintf(sam_txt + sam_ptr, "%s\t", tmpint<0?"*":((tmpint == r1chro)?"=":chro_table[tmpint].chro_name));
+
+ memcpy(&tmpint, bam_bin + 28, 4);
+ sam_ptr += sprintf(sam_txt + sam_ptr, "%d\t", tmpint+1);
+
+ memcpy(&tmpint, bam_bin + 32, 4);
+ sam_ptr += sprintf(sam_txt + sam_ptr, "%d\t", tmpint);
+
+ int seq_len;
+ memcpy(&seq_len, bam_bin + 20,4);
+ int seqi, flex_ptr=name_len + 36 + cigar_opts * 4;
+ for(seqi=0; seqi<seq_len; seqi++){
+ sam_txt[sam_ptr++]="=ACMGRSVTWYHKDBN"[ (bam_bin[flex_ptr] >> ( 4*!(seqi %2) )) & 15 ];
+ if(seqi %2) flex_ptr++;
+ }
+ sam_txt[sam_ptr++]='\t';
+ if(seqi %2) flex_ptr++;
+ for(seqi=0; seqi<seq_len; seqi++){
+ unsigned char nch = (unsigned char) bam_bin[flex_ptr++];
+ if(nch!=0xff||seqi == 0)
+ sam_txt[sam_ptr++]=nch==0xff?'*':(nch+33);
+ }
+ sam_txt[sam_ptr++]='\t';
+
+ while(flex_ptr < bin_len){
+ sam_txt[sam_ptr++]=bam_bin[flex_ptr++];
+ sam_txt[sam_ptr++]=bam_bin[flex_ptr++];
+ sam_txt[sam_ptr++]=':';
+ char tagtype = bam_bin[flex_ptr++];
+
+ if(tagtype == 'B'){
+ char elemtype = bam_bin[flex_ptr++];
+ int elem_no=0, is_int_type = 0, type_bytes = 0, is_signed = 0;
+ memcpy(&elem_no, bam_bin + flex_ptr, 4);
+ flex_ptr += 4;
+ sam_txt[sam_ptr++]='B';
+ sam_txt[sam_ptr++]=':';
+ sam_txt[sam_ptr++]=elemtype;
+ sam_txt[sam_ptr++]=',';
+
+ if(elemtype == 'i' || elemtype == 'I'){
+ is_int_type = 1;
+ type_bytes = 4;
+ is_signed = elemtype == 'i' ;
+ }else if(elemtype == 's' || elemtype == 'S'){
+ is_int_type = 1;
+ type_bytes = 2;
+ is_signed = elemtype == 's' ;
+ }else if(elemtype == 'c' || elemtype == 'C'){
+ is_int_type = 1;
+ type_bytes = 1;
+ is_signed = elemtype == 's' ;
+ }else if(elemtype == 'f'){
+ type_bytes = 4;
+ }
+
+ int elemi;
+ for(elemi =0; elemi < elem_no; elemi++){
+ if(is_int_type){
+ int tagval = 0;
+ memcpy(&tagval, bam_bin + flex_ptr, type_bytes);
+ long long printv = is_signed?tagval:( (unsigned int) tagval );
+ sam_ptr += sprintf(sam_txt + sam_ptr, "%lld,", printv);
+ }else{
+ float tagval = 0;
+ memcpy(&tagval, bam_bin + flex_ptr, type_bytes);
+ sam_ptr += sprintf(sam_txt + sam_ptr, "%f,", tagval);
+ }
+ flex_ptr += type_bytes;
+ }
+
+ sam_txt[sam_ptr-1] = '\t';
+ sam_txt[sam_ptr] = 0;
+ }else{
+ int is_int_type = 0, is_float_type = 0, type_bytes = 0, is_string_type = 0, is_char_type = 0, is_signed = 0;
+ if(tagtype == 'i' || tagtype == 'I'){
+ is_int_type = 1;
+ type_bytes = 4;
+ is_signed = tagtype == 'i' ;
+ }else if(tagtype == 's' || tagtype == 'S'){
+ is_int_type = 1;
+ type_bytes = 2;
+ is_signed = tagtype == 's' ;
+ }else if(tagtype == 'c' || tagtype == 'C'){
+ is_int_type = 1;
+ type_bytes = 1;
+ is_signed = tagtype == 's' ;
+ }else if(tagtype == 'f'){
+ is_float_type = 1;
+ type_bytes = 4;
+ }else if(tagtype == 'Z' || tagtype == 'H'){
+ is_string_type = 1;
+ while(bam_bin[flex_ptr+(type_bytes ++)]);
+ }else if(tagtype == 'A'){
+ is_char_type = 1;
+ type_bytes = 1;
+ }
+
+
+ sam_txt[sam_ptr++]=is_int_type?'i':tagtype;
+ sam_txt[sam_ptr++]=':';
+
+ if(is_int_type){
+ int tagval = 0;
+ memcpy(&tagval, bam_bin + flex_ptr, type_bytes);
+ long long printv = is_signed?tagval:( (unsigned int) tagval );
+ sam_ptr += sprintf(sam_txt + sam_ptr, "%lld\t", printv);
+ }else if(is_string_type){
+ // type_bytes includes \0
+ memcpy(sam_txt + sam_ptr, bam_bin + flex_ptr, type_bytes -1);
+ sam_txt[ sam_ptr + type_bytes -1 ] = '\t';
+
+ //sam_txt[ sam_ptr + type_bytes +1]=0;
+ //SUBREADprintf("STR_LEN=%d\tSTR=%s\n", type_bytes-1, sam_txt + sam_ptr);
+ sam_ptr += type_bytes;
+ }else if(is_float_type){
+ float tagval = 0;
+ memcpy(&tagval, bam_bin + flex_ptr, type_bytes);
+ sam_ptr += sprintf(sam_txt + sam_ptr, "%f\t", tagval);
+ }else if(is_char_type){
+ sam_txt[ sam_ptr++ ] = bam_bin[flex_ptr];
+ sam_txt[ sam_ptr++ ] = '\t';
+ }
+ flex_ptr += type_bytes;
+ }
+ }
+
+ sam_txt[sam_ptr-1]=0; //last '\t'
+ return sam_ptr-1;
+}
+
int PBam_chunk_gets(char * chunk, int *chunk_ptr, int chunk_limit, SamBam_Reference_Info * bam_chro_table, char * buff , int buff_len, SamBam_Alignment*aln, int seq_needed)
{
int xk1;
@@ -620,6 +781,7 @@ int PBam_chunk_gets(char * chunk, int *chunk_ptr, int chunk_limit, SamBam_Refere
char extra_tags [CORE_ADDITIONAL_INFO_LENGTH];
extra_tags[0]=0;
+ int extra_len = 0;
while( (*chunk_ptr) < next_start)
{
char extag[2];
@@ -666,14 +828,23 @@ int PBam_chunk_gets(char * chunk, int *chunk_ptr, int chunk_limit, SamBam_Refere
if(extype == 'c' || extype=='C' || extype == 'i' || extype=='I' || extype == 's' || extype=='S'){
int tmpi = 0;
memcpy(&tmpi, chunk+(*chunk_ptr),delta);
- if(tmpi >= 0)
- sprintf(extra_tags + strlen(extra_tags), "\t%c%c:i:%d", extag[0], extag[1], tmpi);
+ if(tmpi >= 0 && extra_len < CORE_ADDITIONAL_INFO_LENGTH - 18){
+ int sret = sprintf(extra_tags + strlen(extra_tags), "\t%c%c:i:%d", extag[0], extag[1], tmpi);
+ extra_len += sret;
+ }
}else if(extype == 'Z'){
- sprintf(extra_tags + strlen(extra_tags), "\t%c%c:Z:", extag[0], extag[1]);
- *(extra_tags + strlen(extra_tags)+delta-1) = 0;
- memcpy(extra_tags + strlen(extra_tags), chunk + (*chunk_ptr), delta - 1);
+ if(extra_len < CORE_ADDITIONAL_INFO_LENGTH - 7 - delta){
+ sprintf(extra_tags + strlen(extra_tags), "\t%c%c:Z:", extag[0], extag[1]);
+ extra_len += 6;
+ *(extra_tags + strlen(extra_tags)+delta-1) = 0;
+ memcpy(extra_tags + strlen(extra_tags), chunk + (*chunk_ptr), delta - 1);
+ extra_len += delta - 1;
+ }
}else if(extype == 'A'){
- sprintf(extra_tags + strlen(extra_tags), "\t%c%c:A:%c", extag[0], extag[1], *(chunk + *chunk_ptr) );
+ if(extra_len < CORE_ADDITIONAL_INFO_LENGTH - 8){
+ int sret = sprintf(extra_tags + strlen(extra_tags), "\t%c%c:A:%c", extag[0], extag[1], *(chunk + *chunk_ptr) );
+ extra_len += sret;
+ }
}
}
@@ -1196,6 +1367,7 @@ int SamBam_compress_cigar(char * cigar, int * cigar_int, int * ret_coverage, int
for(; int_opt<8; int_opt++) if("MIDNSHP=X"[int_opt] == nch)break;
cigar_int[num_opt ++] = (tmp_int << 4) | int_opt;
tmp_int = 0;
+ //SUBREADprintf("CIGARCOM: %d-th is %c\n", num_opt, nch);
if(num_opt>=max_secs)break;
}
}
diff --git a/src/sambam-file.h b/src/sambam-file.h
index a23f0f6..c6b0bfe 100644
--- a/src/sambam-file.h
+++ b/src/sambam-file.h
@@ -27,7 +27,7 @@ typedef unsigned short BS_uint_16;
typedef unsigned int BS_uint_32;
#define BAM_MAX_CHROMOSOME_NAME_LEN 100
-#define BAM_MAX_CIGAR_LEN 256
+#define BAM_MAX_CIGAR_LEN (30000)
#define BAM_MAX_READ_NAME_LEN 256
#define BAM_MAX_READ_LEN 3000
@@ -185,4 +185,6 @@ int SamBam_fetch_next_chunk(SamBam_FILE *fp);
int SamBam_compress_cigar(char * cigar, int * cigar_int, int * ret_coverage, int max_secs);
char cigar_op_char(int ch);
void SamBam_read2bin(char * read_txt, char * read_bin);
+
+int convert_BAM_binary_to_SAM(SamBam_Reference_Info * chro_table, char * bam_bin, char * sam_txt);
#endif
diff --git a/src/subread.h b/src/subread.h
index b68c7e7..e024432 100644
--- a/src/subread.h
+++ b/src/subread.h
@@ -61,14 +61,14 @@
#define MAX_READ_NAME_LEN 100
#define MAX_CHROMOSOME_NAME_LEN 100
#define MAX_FILE_NAME_LENGTH 300
-#define FEATURE_NAME_LENGTH 256
+#define FEATURE_NAME_LENGTH 256
//#warning "============== REMOVE '*1.2' FROM THE NEXT LINE ================"
#define MULTI_THREAD_OUTPUT_ITEMS (4096 * 3/5 *3)
-
#define EXON_LONG_READ_LENGTH 160
#define EXON_MAX_CIGAR_LEN 256
#define FC_CIGAR_PARSER_ITEMS 11
+#define FC_LONG_READ_RECORD_HARDLIMIT (8*1024*1024)
#define MAX_INDEL_SECTIONS 7
//#define XBIG_MARGIN_RECORD_SIZE 24
@@ -188,7 +188,7 @@ typedef short gene_vote_number_t;
#define MAX_EXONS_PER_GENE 400
#define MAX_EXON_CONNECTIONS 10
-#define MAX_GENE_NAME_LEN 12
+#define MAX_GENE_NAME_LEN 128
#define MAX_INDEL_TOLERANCE 7
#define SUBINDEX_VER0 100
diff --git a/src/tx-unique.c b/src/tx-unique.c
new file mode 100644
index 0000000..fd5d26c
--- /dev/null
+++ b/src/tx-unique.c
@@ -0,0 +1,443 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <getopt.h>
+#include <assert.h>
+#include "tx-unique.h"
+
+void txunique_gene_free(void * v_gene){
+ txunique_gene_t * gene = v_gene;
+
+ ArrayListDestroy(gene -> transcript_list);
+ free(gene);
+}
+
+int txunique_init_context(txunique_context_t * context){
+ memset(context, 0 , sizeof(txunique_context_t));
+ strcpy(context -> gene_name_column_name, "gene_id");
+ strcpy(context -> transcript_id_column_name, "transcript_id");
+ strcpy(context -> used_feature_type, "exon");
+
+ context -> gene_table = HashTableCreate(62333);
+ HashTableSetKeyComparisonFunction(context -> gene_table, my_strcmp);
+ HashTableSetHashFunction(context -> gene_table, HashTableStringHashFunction);
+ HashTableSetDeallocationFunctions(context -> gene_table, NULL, txunique_gene_free);
+
+ context -> result_table = HashTableCreate(62333);
+ HashTableSetKeyComparisonFunction(context -> result_table, my_strcmp);
+ HashTableSetHashFunction(context -> result_table, HashTableStringHashFunction);
+ HashTableSetDeallocationFunctions(context -> result_table, free, NULL);
+ return 0;
+}
+
+void txunique_free_trans(void * v_trans){
+ txunique_transcript_t * trans = v_trans;
+
+ ArrayListDestroy(trans -> exon_list);
+ free(trans);
+}
+
+int txunique_do_add_exon(char * gene_name, char * transcript_id, char * chrome_name, unsigned int start, unsigned int end, int is_negative_strand, void * v_context){
+ int tx_i;
+
+ txunique_context_t * context = v_context;
+
+ txunique_gene_t * tag_gene = HashTableGet(context -> gene_table, gene_name);
+ if(!tag_gene){
+ tag_gene = malloc(sizeof(txunique_gene_t));
+ strncpy(tag_gene -> gene_name, gene_name, FEATURE_NAME_LENGTH-1);
+ tag_gene -> transcript_list = ArrayListCreate(6);
+ ArrayListSetDeallocationFunction(tag_gene -> transcript_list, txunique_free_trans);
+ HashTablePut(context -> gene_table, tag_gene -> gene_name, tag_gene);
+ }
+
+ txunique_transcript_t * tag_tx = NULL;
+ //SUBREADprintf("NEW EXON: %s of %s : %s %u ~ %u ; txs = %lu in %lu\n", gene_name, transcript_id, chrome_name, start, end, tag_gene -> transcript_list -> numOfElements, tag_gene -> transcript_list -> capacityOfElements);
+ for(tx_i = 0; tx_i < tag_gene -> transcript_list -> numOfElements; tx_i++){
+ txunique_transcript_t * try_tx = ArrayListGet(tag_gene -> transcript_list , tx_i);
+ if(strcmp(try_tx -> transcript_id, transcript_id) == 0){
+ tag_tx = try_tx;
+ break;
+ }
+ }
+
+ if(!tag_tx){
+ tag_tx = malloc(sizeof(txunique_transcript_t));
+ strncpy(tag_tx -> transcript_id, transcript_id, FEATURE_NAME_LENGTH-1);
+ tag_tx -> exon_list = ArrayListCreate(6);
+ ArrayListSetDeallocationFunction(tag_tx -> exon_list, free);
+ ArrayListPush(tag_gene -> transcript_list, tag_tx);
+ }
+
+ txunique_exon_t * tag_exon = malloc(sizeof(txunique_exon_t));
+ strncpy(tag_exon -> chro_name, chrome_name, MAX_CHROMOSOME_NAME_LEN -1);
+ tag_exon -> exon_start = start;
+ tag_exon -> exon_stop = end;
+ tag_exon -> is_negative_strand = is_negative_strand;
+ ArrayListPush(tag_tx -> exon_list, tag_exon);
+
+ return 0;
+}
+
+int txunique_load_annotation(txunique_context_t * context){
+ int loaded_features = load_features_annotation(context -> input_GTF_file_name, FILE_TYPE_GTF, context -> gene_name_column_name, context -> transcript_id_column_name, context -> used_feature_type, context, txunique_do_add_exon);
+ if(loaded_features<1) return -1;
+ return 0;
+}
+
+int txunique_process_flat_comp( void * ex1p, void * ex2p ){
+ txunique_exon_t * ex1 = ex1p;
+ txunique_exon_t * ex2 = ex2p;
+
+ if(ex1 -> exon_start < ex2 -> exon_start)return -1;
+ if(ex1 -> exon_start > ex2 -> exon_start)return 1;
+ return 0;
+}
+
+void debug_print_exs(ArrayList * exs){
+ int ex_i;
+ for(ex_i=0; ex_i<exs -> numOfElements; ex_i++){
+ txunique_exon_t *ex = ArrayListGet(exs,ex_i);
+ SUBREADprintf(" %s (%s) : %u ~ %u\n", ex->chro_name, ex->is_negative_strand?"NEG":"POS", ex->exon_start, ex->exon_stop);
+ }
+}
+
+ArrayList * txunique_process_flat_exons(ArrayList * exs){
+ ArrayList * ret = ArrayListCreate(5);
+ ArrayListSetDeallocationFunction(ret, free);
+ if(exs->numOfElements < 1) return ret;
+
+ //SUBREADputs("Before Sorting");
+ //debug_print_exs(exs);
+ int ex_i;
+ ArrayListSort(exs , txunique_process_flat_comp);
+
+ //SUBREADputs("Before Flatten");
+ //debug_print_exs(exs);
+
+ txunique_exon_t * memex = malloc(sizeof(txunique_exon_t));
+ memcpy(memex, ArrayListGet(exs,0), sizeof(txunique_exon_t));
+ ArrayListPush(ret, memex);
+
+ for(ex_i=1; ex_i<exs -> numOfElements; ex_i++){
+ txunique_exon_t * lastex = ArrayListGet(ret, ret -> numOfElements - 1);
+ txunique_exon_t * tryex = ArrayListGet(exs, ex_i);
+ if(tryex -> exon_start > lastex -> exon_stop + 1){
+ txunique_exon_t * memex = malloc(sizeof(txunique_exon_t));
+ memcpy(memex, tryex, sizeof(txunique_exon_t));
+ ArrayListPush(ret, memex);
+ }
+ else
+ lastex -> exon_stop = max(tryex -> exon_stop, lastex -> exon_stop);
+ }
+
+ //SUBREADputs("After Flatten");
+ //debug_print_exs(ret);
+ return ret;
+}
+
+struct _txunique_tmp_edges{
+ int is_exon_start;
+ int nsupp;
+ unsigned int base_open_end;
+};
+
+void debug_print_edges(ArrayList * exs){
+ int ex_i;
+ for(ex_i=0; ex_i<exs -> numOfElements; ex_i++){
+ struct _txunique_tmp_edges *ex = ArrayListGet(exs,ex_i);
+ SUBREADprintf(" %u : %s - nsup=%d\n", ex->base_open_end, ex->is_exon_start?"START":"END ", ex->nsupp);
+ }
+}
+
+int txunique_process_gene_edge_comp(void * e1p, void * e2p){
+ struct _txunique_tmp_edges * e1 = e1p;
+ struct _txunique_tmp_edges * e2 = e2p;
+ if(e1 -> base_open_end < e2 -> base_open_end) return -1;
+ if(e1 -> base_open_end > e2 -> base_open_end) return 1;
+
+ if(e1 -> is_exon_start && !e2 -> is_exon_start) return -1;
+ if(e2 -> is_exon_start && !e1 -> is_exon_start) return 1;
+ return 0;
+}
+
+void txunique_process_gene_chro(txunique_context_t * context, char * chro, int strand_mode, txunique_gene_t * gene){
+ int tx_i, ex_i;
+
+ ArrayList **flatten_trans = malloc(gene -> transcript_list -> numOfElements*sizeof(void *));
+ assert(flatten_trans);
+ ArrayList * edge_list = ArrayListCreate(6);
+ ArrayListSetDeallocationFunction(edge_list, free);
+
+ for(tx_i = 0; tx_i < gene -> transcript_list -> numOfElements; tx_i ++){
+ txunique_transcript_t * try_tx = ArrayListGet(gene -> transcript_list , tx_i);
+ ArrayList * used_exons = ArrayListCreate(6);
+ ArrayListSetDeallocationFunction(used_exons, free);
+
+ for(ex_i = 0; ex_i < try_tx -> exon_list -> numOfElements; ex_i ++){
+ txunique_exon_t * try_ex = ArrayListGet(try_tx -> exon_list, ex_i);
+ if(strand_mode != try_ex -> is_negative_strand|| strcmp(try_ex -> chro_name, chro)!=0)continue;
+ txunique_exon_t * memex = malloc(sizeof(txunique_exon_t));
+ memcpy(memex, try_ex, sizeof(txunique_exon_t));
+ ArrayListPush(used_exons, memex);
+ }
+
+ //SUBREADprintf("===== Process : %s : %s : %s : %s\n", gene -> gene_name, try_tx -> transcript_id, chro, strand_mode?"NEG":"POS");
+ flatten_trans[tx_i] = txunique_process_flat_exons(used_exons);
+ ArrayListDestroy(used_exons);
+
+ for(ex_i = 0; ex_i < flatten_trans[tx_i] -> numOfElements; ex_i ++){
+ txunique_exon_t * try_ex = ArrayListGet(flatten_trans[tx_i], ex_i);
+ struct _txunique_tmp_edges * edge_start = malloc(sizeof(struct _txunique_tmp_edges)), *edge_end = malloc(sizeof(struct _txunique_tmp_edges));
+ edge_start -> is_exon_start = 1;
+ edge_start -> base_open_end = try_ex -> exon_start;
+ edge_start -> nsupp = 0;
+ edge_end -> is_exon_start = 0;
+ edge_end -> base_open_end = try_ex -> exon_stop + 1;
+ edge_end -> nsupp = 0;
+
+ ArrayListPush(edge_list, edge_start);
+ ArrayListPush(edge_list, edge_end);
+ }
+ }
+
+ if(edge_list -> numOfElements >0){
+ ArrayListSort(edge_list, txunique_process_gene_edge_comp);
+ ArrayList * combined_edge_list = ArrayListCreate(6);
+ ArrayListSetDeallocationFunction(combined_edge_list, free);
+
+ struct _txunique_tmp_edges * merged_edge = ArrayListGet(edge_list, 0);
+ merged_edge -> nsupp = 1;
+ for(ex_i = 1; ex_i <= edge_list -> numOfElements; ex_i ++){
+ struct _txunique_tmp_edges * tmpedge = NULL;
+ if(ex_i < edge_list -> numOfElements) tmpedge = ArrayListGet(edge_list, ex_i);
+
+ if(NULL == tmpedge || merged_edge -> is_exon_start != tmpedge -> is_exon_start || merged_edge -> base_open_end != tmpedge -> base_open_end){
+ struct _txunique_tmp_edges * memedge = malloc(sizeof(struct _txunique_tmp_edges));
+ memcpy(memedge, merged_edge, sizeof(struct _txunique_tmp_edges));
+ ArrayListPush(combined_edge_list, memedge);
+ if(tmpedge){
+ merged_edge = tmpedge;
+ merged_edge -> nsupp = 1;
+ }
+ }else merged_edge -> nsupp++;
+ }
+
+ //debug_print_edges(combined_edge_list);
+ for(tx_i = 0; tx_i < gene -> transcript_list -> numOfElements; tx_i ++){
+ txunique_transcript_t * try_tx = ArrayListGet(gene -> transcript_list , tx_i);
+ ArrayList * flatten_exons = flatten_trans[tx_i];
+
+ unsigned int total_bases = 0, unique_bases = 0, unique_start = 0, total_start = 0;
+ int overlapping_count = 0, txex_i = 0, is_on = 0;
+ for(ex_i = 0; ex_i < combined_edge_list -> numOfElements; ex_i ++){
+ txunique_exon_t * txex = NULL;
+ if(txex_i < flatten_exons -> numOfElements) txex = ArrayListGet(flatten_exons, txex_i);
+ struct _txunique_tmp_edges *edge = ArrayListGet(combined_edge_list, ex_i);
+
+ //SUBREADprintf(" TXN %s, Edge: %u (%s), Exon [%d]: %u ~ %u, %s, depth=%d, uniq_base=%u, all_base=%u\n", try_tx->transcript_id, edge -> base_open_end, edge -> is_exon_start?"START":"END ", txex_i, txex?txex->exon_start:0, txex?txex->exon_stop:0 , is_on?"ON":"OFF", overlapping_count, unique_bases, total_bases);
+ if(total_start<1) assert(!is_on);
+ if(!is_on) assert(total_start < 1);
+ if(edge -> is_exon_start){
+ overlapping_count += edge -> nsupp;
+
+ if(txex && edge -> base_open_end == txex -> exon_start)
+ is_on = 1;
+
+ if(unique_start >0){
+ unique_bases += edge -> base_open_end - unique_start;
+ unique_start = 0;
+ assert(overlapping_count > 1);
+ }else if(overlapping_count == 1&& is_on) unique_start = edge -> base_open_end;
+
+ if(total_start<1 && is_on)total_start = edge -> base_open_end;
+ }
+
+ if(is_on) assert(overlapping_count>0);
+
+ if(!edge -> is_exon_start){
+ if(txex && edge -> base_open_end == txex -> exon_stop+1) is_on = 0;
+ overlapping_count -= edge -> nsupp;
+
+ if(unique_start >0){
+ unique_bases += edge -> base_open_end - unique_start;
+ unique_start =0;
+ assert(overlapping_count ==0);
+ } else if(overlapping_count == 1 && is_on) unique_start = edge -> base_open_end;
+
+ if(total_start && !is_on){
+ total_bases += edge -> base_open_end - total_start;
+ total_start = 0;
+ }
+ }
+
+ if(overlapping_count<1) assert(!is_on);
+ if(!is_on) assert(total_start < 1);
+ if(total_start<1) assert(!is_on);
+
+ if(txex && edge-> is_exon_start==0){
+ if(edge-> base_open_end > txex -> exon_stop) txex_i++;
+ }
+
+ //SUBREADprintf(" %s, Edge: %u (%s), Exon [%d]: %u ~ %u, %s, depth=%d, uniq_base=%u, all_base=%u\n\n", try_tx->transcript_id, edge -> base_open_end, edge -> is_exon_start?"START":"END ", txex_i, txex?txex->exon_start:0, txex?txex->exon_stop:0 , is_on?"ON":"OFF", overlapping_count, unique_bases, total_bases);
+ }
+ assert(overlapping_count == 0);
+ assert(total_start == 0);
+ assert(unique_start == 0);
+ assert(is_on == 0);
+
+ char * hash_key = malloc(strlen(try_tx->transcript_id) + strlen(gene -> gene_name)+20);
+ sprintf(hash_key, "%s\t%s\nALL", gene -> gene_name, try_tx->transcript_id);
+ int old_all_bases = HashTableGet(context -> result_table, hash_key) - NULL;
+ if(old_all_bases < 1)old_all_bases = 1;
+ HashTablePut(context -> result_table, hash_key, NULL + old_all_bases + total_bases);
+
+ hash_key = malloc(strlen(try_tx->transcript_id) + strlen(gene -> gene_name)+20);
+ sprintf(hash_key, "%s\t%s\nUNIQUE", gene -> gene_name, try_tx->transcript_id);
+ old_all_bases = HashTableGet(context -> result_table, hash_key) - NULL;
+ if(old_all_bases < 1)old_all_bases = 1;
+ HashTablePut(context -> result_table, hash_key, NULL + old_all_bases + unique_bases);
+ }
+
+ ArrayListDestroy(combined_edge_list);
+
+ }
+ ArrayListDestroy(edge_list);
+ free(flatten_trans);
+}
+
+void txunique_process_write_gene(void * key, void * hashed_obj, HashTable * tab){
+ txunique_context_t * context = tab -> appendix1;
+ FILE * out_fp = tab -> appendix2;
+ int tx_i;
+
+ txunique_gene_t * gene = hashed_obj;
+ for(tx_i = 0; tx_i < gene -> transcript_list -> numOfElements; tx_i ++){
+ txunique_transcript_t * try_tx = ArrayListGet(gene -> transcript_list , tx_i);
+ char hash_key [ FEATURE_NAME_LENGTH * 2 + 20];
+ sprintf(hash_key, "%s\t%s\nALL", gene->gene_name, try_tx -> transcript_id);
+ int all_bases = HashTableGet(context -> result_table, hash_key)-NULL-1;
+ sprintf(hash_key, "%s\t%s\nUNIQUE", gene->gene_name, try_tx -> transcript_id);
+ int unique_bases = HashTableGet(context -> result_table, hash_key)-NULL-1;
+ fprintf(out_fp, "%s\t%s\t%d\t%d\n", gene->gene_name, try_tx -> transcript_id, unique_bases, all_bases);
+ }
+
+}
+
+void txunique_process_gene(void * key, void * hashed_obj, HashTable * tab){
+ txunique_context_t * context = tab -> appendix1;
+ txunique_gene_t * gene = hashed_obj;
+ int tx_i, ex_i, ch_i;
+
+ ArrayList * chro_list = ArrayListCreate(5);
+ for(tx_i = 0; tx_i < gene -> transcript_list -> numOfElements; tx_i ++){
+ txunique_transcript_t * try_tx = ArrayListGet(gene -> transcript_list , tx_i);
+ for(ex_i = 0; ex_i < try_tx -> exon_list -> numOfElements; ex_i ++){
+ txunique_exon_t * try_ex = ArrayListGet(try_tx -> exon_list, ex_i);
+ int found_chro = 0;
+ for(ch_i = 0; ch_i < chro_list->numOfElements; ch_i++){
+ char * t_chro = ArrayListGet(chro_list, ch_i);
+ if(strcmp(t_chro, try_ex -> chro_name)==0){
+ found_chro = 1;
+ break;
+ }
+ }
+ if(!found_chro) ArrayListPush(chro_list, try_ex -> chro_name);
+ }
+ }
+
+ for(ch_i = 0; ch_i < chro_list -> numOfElements; ch_i ++){
+ int is_negative_strand;
+ char * chro = ArrayListGet(chro_list, ch_i);
+ for(is_negative_strand = 0; is_negative_strand<2; is_negative_strand++)
+ txunique_process_gene_chro(context, chro, is_negative_strand, gene);
+ }
+}
+
+int txunique_find_unique_bases(txunique_context_t * context){
+ context -> gene_table -> appendix1 = context;
+ HashTableIteration(context -> gene_table, txunique_process_gene);
+ return 0;
+}
+
+int txunique_write_output_file(txunique_context_t * context){
+ FILE * out_fp = fopen(context -> output_file_name, "w" );
+ fprintf(out_fp, "Gene_ID\tTranscript_ID\tUnique_Bases\tAll_Bases\n");
+ if(!out_fp){
+ SUBREADprintf("ERROR: unable to write output file : '%s'\n", context -> output_file_name);
+ return 1;
+ }
+
+ context -> gene_table -> appendix1 = context;
+ context -> gene_table -> appendix2 = out_fp;
+ HashTableIteration(context -> gene_table, txunique_process_write_gene);
+ fclose(out_fp);
+ return 0;
+}
+
+int txunique_destroy_context(txunique_context_t * context){
+ HashTableDestroy(context -> gene_table);
+ HashTableDestroy(context -> result_table);
+ return 0;
+}
+
+int txunique_parse_options(txunique_context_t * context, int argc, char ** argv){
+ int c;
+
+ optind = 1;
+ opterr = 1;
+ optopt = 63;
+
+ int sort_needed = 0;
+ while ((c = getopt (argc, argv, "a:o:g:t:f:h"))!=-1){
+ switch(c){
+ case 'a':
+ strcpy(context -> input_GTF_file_name, optarg);
+ break;
+
+ case 'o':
+ strcpy(context -> output_file_name, optarg);
+ break;
+
+ case 'g':
+ strcpy(context -> gene_name_column_name, optarg);
+ break;
+
+ case 't':
+ strcpy(context -> transcript_id_column_name, optarg);
+ break;
+
+ case 'f':
+ strcpy(context -> used_feature_type, optarg);
+ break;
+
+ default:
+ SUBREADputs("./txUnique -a <GTF_file> -o <output_text> { -g <gene_id_column> -t <transcript_id_column> -f <feature_type> }");
+ break;
+ }
+ }
+
+ if( context -> input_GTF_file_name[0]==0 || context -> output_file_name[0] == 0 ){
+ SUBREADputs("The GTF file name and the output file name must be specified.");
+ return 1;
+ }
+ return 0;
+}
+
+#ifdef MAKE_STANDALONE
+int main(int argc, char ** argv){
+#else
+int TxUniqueMain(int argc, char ** argv){
+#endif
+ txunique_context_t context;
+ int ret = 0;
+
+ ret = ret || txunique_init_context(&context);
+ ret = ret || txunique_parse_options(&context, argc, argv);
+ ret = ret || txunique_load_annotation(&context);
+ ret = ret || txunique_find_unique_bases(&context);
+ ret = ret || txunique_write_output_file(&context);
+ ret = ret || txunique_destroy_context(&context);
+ if(!ret) SUBREADputs("All finished.");
+ return ret;
+}
diff --git a/src/tx-unique.h b/src/tx-unique.h
new file mode 100644
index 0000000..123e29d
--- /dev/null
+++ b/src/tx-unique.h
@@ -0,0 +1,37 @@
+#ifndef __TX_UNIQUE_H_
+#define __TX_UNIQUE_H_
+
+#include "subread.h"
+#include "hashtable.h"
+#include "input-files.h"
+#include "HelperFunctions.h"
+
+typedef struct{
+ char chro_name[MAX_CHROMOSOME_NAME_LEN];
+ unsigned int exon_start;
+ unsigned int exon_stop;
+ int is_negative_strand;
+} txunique_exon_t;
+
+typedef struct{
+ char transcript_id[FEATURE_NAME_LENGTH];
+ ArrayList * exon_list;
+} txunique_transcript_t;
+
+typedef struct{
+ char gene_name[FEATURE_NAME_LENGTH];
+ ArrayList * transcript_list;
+} txunique_gene_t;
+
+typedef struct{
+ char input_GTF_file_name[MAX_FILE_NAME_LENGTH];
+ char output_file_name[MAX_FILE_NAME_LENGTH];
+ char gene_name_column_name[FEATURE_NAME_LENGTH];
+ char transcript_id_column_name[FEATURE_NAME_LENGTH];
+ char used_feature_type[FEATURE_NAME_LENGTH];
+
+ HashTable * gene_table; // gene_id => array of transcripts [ (transcript_id, list_of_exons) ]
+ HashTable * result_table; // "$gene_id\t$transcript_id\nALL|UNIQUE" => NULL + number + 1
+} txunique_context_t;
+
+#endif
diff --git a/test/subread-align/subread-align-test.sh b/test/subread-align/subread-align-test.sh
index f611a85..f57f8c4 100644
--- a/test/subread-align/subread-align-test.sh
+++ b/test/subread-align/subread-align-test.sh
@@ -22,7 +22,7 @@ echo "*** SINGLE-END READS NO ERROR NO DUP ******" >> test-tmp.log
echo "*************************************************" >> test-tmp.log
echo >>test-tmp.log
-$SUBREAD_HOME/subread-align --SAMoutput -t0 -u -i ../small1 -r data/test-noerror-r1.fq -o result/test-tmp.sam -H -J
+$SUBREAD_HOME/subread-align --SAMoutput -t0 -i ../small1 -r data/test-noerror-r1.fq -o result/test-tmp.sam -H -J
cat result/test-tmp.sam | $PYTHON_EXEC readname_ora_match.py >>test-tmp.log
@@ -42,7 +42,7 @@ echo "*** READS NO ERROR, NO DUPLICATED REPORT ******" >> test-tmp.log
echo "*************************************************" >> test-tmp.log
echo >>test-tmp.log
-$SUBREAD_HOME/subread-align --SAMoutput -t0 -u -i ../small1 -r data/test-noerror-r1.fq -R data/test-noerror-r2.fq -o result/test-tmp.sam -Q -J
+$SUBREAD_HOME/subread-align --SAMoutput -t0 -i ../small1 -r data/test-noerror-r1.fq -R data/test-noerror-r2.fq -o result/test-tmp.sam -Q -J
cat result/test-tmp.sam | $PYTHON_EXEC readname_ora_match.py >>test-tmp.log
--
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/subread.git
More information about the debian-med-commit
mailing list