[med-svn] [Git][med-team/flexbar][upstream] New upstream version 3.4.0
Andreas Tille
gitlab at salsa.debian.org
Sun Jul 29 08:52:21 BST 2018
Andreas Tille pushed to branch upstream at Debian Med / flexbar
Commits:
5d9caef2 by Andreas Tille at 2018-07-29T07:40:49Z
New upstream version 3.4.0
- - - - -
19 changed files:
- .gitignore
- CMakeLists.txt
- README.md
- src/Flexbar.cpp
- src/Flexbar.h
- src/FlexbarIO.h
- src/FlexbarTypes.h
- + src/LoadAdapters.h
- src/LoadFasta.h
- src/Options.h
- src/PairedAlign.h
- src/PairedInput.h
- src/PairedOutput.h
- src/QualTrimming.h
- src/SeqAlign.h
- src/SeqAlignAlgo.h
- + src/SeqAlignPair.h
- src/SeqInput.h
- src/SeqOutput.h
Changes:
=====================================
.gitignore
=====================================
--- a/.gitignore
+++ b/.gitignore
@@ -7,5 +7,6 @@ cmake_install.cmake
# misc
flexbar
.DS_Store
+wget-log
include
local
=====================================
CMakeLists.txt
=====================================
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -2,7 +2,7 @@ cmake_minimum_required( VERSION 2.8.2 )
project( FLEXBAR )
-set( SEQAN_APP_VERSION "3.2.0" )
+set( SEQAN_APP_VERSION "3.4.0" )
include_directories( ${FLEXBAR_SOURCE_DIR}/include )
# link_directories( ${FLEXBAR_SOURCE_DIR}/lib )
=====================================
README.md
=====================================
--- a/README.md
+++ b/README.md
@@ -4,48 +4,50 @@ The program Flexbar preprocesses high-throughput sequencing data efficiently. It
Refer to the [manual](https://github.com/seqan/flexbar/wiki) or contact [Johannes Roehr](https://github.com/jtroehr) for support with this application.
-![Flexbar logo](https://github.com/seqan/flexbar/wiki/images/flexbar-logo.png)
-
### References
-Johannes T. Roehr, Christoph Dieterich, Knut Reinert: Flexbar 3.0 – SIMD and multicore parallelization. Bioinformatics 2017.
+Johannes T. Roehr, Christoph Dieterich, Knut Reinert:
+Flexbar 3.0 – SIMD and multicore parallelization. Bioinformatics 2017.
See article on [PubMed](https://www.ncbi.nlm.nih.gov/pubmed/28541403)
-Matthias Dodt, Johannes T. Roehr, Rina Ahmed, Christoph Dieterich: Flexbar – flexible barcode and adapter processing for next-generation sequencing platforms. Biology 2012.
+Matthias Dodt, Johannes T. Roehr, Rina Ahmed, Christoph Dieterich:
+Flexbar – flexible barcode and adapter processing for next-generation sequencing platforms. Biology 2012.
See article on [PubMed](https://www.ncbi.nlm.nih.gov/pubmed/24832523)
+![Flexbar logo](https://github.com/seqan/flexbar/wiki/images/flexbar-logo.png)
+
### Download
-Flexbar source code as well as binaries for Linux and Mac OS can be downloaded on the [release](https://github.com/seqan/flexbar/releases) page. Please follow instructions for building or setup of binaries below. Additionally, Flexbar is available via package manager on Debian systems. Versions before 2.4 can be found on the [old](https://sourceforge.net/projects/flexbar) page.
+Flexbar source code as well as binaries for Linux and Mac OS can be downloaded on the [release](https://github.com/seqan/flexbar/releases) page. Please follow instructions for building or setup of binaries below. Additionally, Flexbar is available via package manager on Debian systems and in Conda. Versions before 2.4 can be found on the [old](https://sourceforge.net/projects/flexbar) page.
### Building from source
Make sure that `cmake` is available, as well as development and runtime files of the TBB library 4.0 or later (Intel Threading Building Blocks). For example on Debian systems, install the packages `libtbb-dev` and `libtbb2`. Furthermore, the SeqAn library and a compiler that supports C++14 is required:
-* Get SeqAn library version 2.2.0 [here](https://github.com/seqan/seqan/releases/download/seqan-v2.2.0/seqan-library-2.2.0.tar.xz)
-* Download Flexbar 3.2 source code [release](https://github.com/seqan/flexbar/releases)
+* Get SeqAn library version 2.4.0 [here](https://github.com/seqan/seqan/releases/download/seqan-v2.4.0/seqan-library-2.4.0.tar.xz)
+* Download Flexbar 3.4 source code [release](https://github.com/seqan/flexbar/releases)
Decompress both files:
- tar xzf flexbar-3.2.0.tar.gz
- tar xJf seqan-library-2.2.0.tar.xz
+ tar xzf flexbar-3.4.0.tar.gz
+ tar xJf seqan-library-2.4.0.tar.xz
Move SeqAn include folder to Flexbar:
- mv seqan-library-2.2.0/include flexbar-3.2.0
+ mv seqan-library-2.4.0/include flexbar-3.4.0
Use these commands for building:
- cd flexbar-3.2.0
+ cd flexbar-3.4.0
cmake .
make
-Flexbar version 2.7 requires SeqAn 2.1.1 instead. Releases prior to 2.7 use the SeqAn 1.4.2 library.
+Flexbar versions from 3.0 up to 3.2 require SeqAn 2.2.0 instead. Flexbar version 2.7 uses SeqAn 2.1.1 and releases prior to 2.7 use the SeqAn 1.4.2 library.
### Binaries
@@ -69,17 +71,37 @@ Flexbar needs at least one file with sequencing reads in fasta or fastq format a
flexbar -r reads [-b barcodes] [-a adapters] [options]
-Refer to the help screen `flexbar -h` or [manual](https://github.com/seqan/flexbar/wiki) for more information. Although default parameters of Flexbar are optimized to deliver good results in many scenarios, the adjustment of parameters might improve results, e.g. `--adapter-min-overlap`. To run tests, make sure `flexbar` is reachable via the path variable and run `flexbar_test.sh` within the test folder.
+Refer to the help screen `flexbar -h` or [manual](https://github.com/seqan/flexbar/wiki) for more information. Although default parameters of Flexbar are optimized to deliver good results in many scenarios, the adjustment of parameters like `--adapter-min-overlap` might improve results. For tests, run `flexbar_test.sh` within the test folder if `flexbar` is reachable via the path variable.
+
+#### Quality-based trimming
+
+In this example, reads in fastq format are trimmed based on their quality scores in Illumina version 1.8 format. The TAIL method trims the right end of reads until a quality score equal or higher than the threshold is reached, default 20. Trimmed reads are written to `target.fastq` in same format as the input.
+
+ flexbar -r reads.fq -t target -q TAIL -qf i1.8
+
+#### Demultiplexing with barcodes
+
+Reads that are barcoded on the left end are demultiplexed by specifying a file with barcodes in fasta format. Reads that can be assigned are written to separate files using file names that are based on the names of barcodes in the fasta file.
+
+ flexbar -r reads.fq -b barcodes.fa -bt LTAIL
+
+#### Adapter removal single-end
+
+To remove adapter sequences from single-end reads, specify a file with adapters in fasta format. These are removed from the right side of reads per default, if they do not align before the read start. The left side of reads is kept if long enough. The overlap of an adapter and read must have at least length 3 with at most 10% errors in default settings.
+
+ flexbar -r reads.fq -a adapters.fa -ao 3 -ae 0.1
+
+#### Adapter removal paired-end
-#### Examples
+For paired-end libraries, specify both files with paired reads and a fasta file with adapters for removal. Given adapters are trimmed in right mode per default. It is recommended to activate the pair overlap detection in case of standard paired reads. This increases the sensitivity by removing very short parts of adapters if an overlap is detected for a pair.
-In this example, reads that are barcoded on left side are demultiplexed by specifying a file with barcodes in fasta format. After separation of reads, given adapters are removed from the right side if they do not align before read start. The left side of reads is kept if long enough. Remaining reads are written to the file `target.fastq` in same format as the input.
+ flexbar -r r1.fq -p r2.fq -a adapters.fa -ap ON
- flexbar -r reads.fq -t target -b brc.fa -be LTAIL -a adp.fa
+#### Adapter removal presets
-The second example shows how to trim compressed reads based on their quality scores in illumina version 1.8 format. Afterwards, provided adapters are removed in right trim-end mode, only if the overlap of adapter and read has at least length 5 with at most 20% errors.
+Several adapter presets for Illumina libraries are included in Flexbar. For example, select the `TruSeq` preset for standard TruSeq adapters and specify two read files for paired reads. If a preset is chosen, a separate file with adapters is not needed for removal. It is recommended to turn on the pair overlap detection for standard paired-end libraries.
- flexbar -r reads.fq.gz -q TAIL -qf i1.8 -a adp.fa -ao 5 -at 0.2
+ flexbar -r r1.fq -p r2.fq -aa TruSeq -ap ON
For further examples visit the [manual](https://github.com/seqan/flexbar/wiki) page.
=====================================
src/Flexbar.cpp
=====================================
--- a/src/Flexbar.cpp
+++ b/src/Flexbar.cpp
@@ -2,11 +2,11 @@
Flexbar - flexible barcode and adapter removal
- Version 3.2.0
+ Version 3.4.0
BSD 3-Clause License
- uses SeqAn library release 2.2.0
+ uses SeqAn library release 2.4.0
and TBB library 4.0 or later
@@ -29,8 +29,8 @@ int main(int argc, const char* argv[]){
using namespace std;
using namespace seqan;
- const string version = "3.2";
- const string date = "May 2018";
+ const string version = "3.4.0";
+ const string date = "June 2018";
ArgumentParser parser("flexbar");
=====================================
src/Flexbar.h
=====================================
--- a/src/Flexbar.h
+++ b/src/Flexbar.h
@@ -27,6 +27,7 @@
#include "Options.h"
#include "FlexbarIO.h"
#include "LoadFasta.h"
+#include "LoadAdapters.h"
#include "SeqInput.h"
#include "PairedInput.h"
#include "PairedOutput.h"
@@ -68,56 +69,69 @@ void loadBarcodes(Options &o, const bool secondSet){
template <typename TSeqStr, typename TString>
void loadAdapters(Options &o, const bool secondSet, const bool useAdapterFile){
-
+
using namespace std;
using namespace flexbar;
- LoadFasta<TSeqStr, TString> lf(o, true);
-
- if(useAdapterFile){
+ if(o.aPreset != APOFF){
+ LoadAdapters<TSeqStr, TString> la(o);
- string adapFile = secondSet ? o.adapter2File : o.adapterFile;
+ la.loadSequences(secondSet);
- lf.loadSequences(adapFile);
+ if(secondSet) o.adapters2 = la.getAdapters();
+ else o.adapters = la.getAdapters();
+
+ if(secondSet) la.printAdapters("Adapter2");
+ else la.printAdapters("Adapter");
+ }
+ else{
+ LoadFasta<TSeqStr, TString> lf(o, true);
- if(secondSet){
- o.adapters2 = lf.getBars();
+ if(useAdapterFile){
+
+ string adapFile = secondSet ? o.adapter2File : o.adapterFile;
+
+ lf.loadSequences(adapFile);
- if(o.adapters2.size() == 0){
- cerr << "\nERROR: No adapters found in file.\n" << endl;
- exit(1);
+ if(secondSet){
+ o.adapters2 = lf.getBars();
+
+ if(o.adapters2.size() == 0){
+ cerr << "\nERROR: No adapters found in file.\n" << endl;
+ exit(1);
+ }
+ }
+ else{
+ o.adapters = lf.getBars();
+
+ if(o.adapters.size() == 0){
+ cerr << "\nERROR: No adapters found in file.\n" << endl;
+ exit(1);
+ }
}
}
else{
- o.adapters = lf.getBars();
-
- if(o.adapters.size() == 0){
- cerr << "\nERROR: No adapters found in file.\n" << endl;
- exit(1);
+ if(o.rcMode == RCOFF || o.rcMode == RCON){
+ TBar bar;
+ bar.id = "cmdline";
+ bar.seq = o.adapterSeq;
+ o.adapters.push_back(bar);
}
+ if(o.rcMode == RCON || o.rcMode == RCONLY){
+ TSeqStr adapterSeqRC = o.adapterSeq;
+ seqan::reverseComplement(adapterSeqRC);
+
+ TBar barRC;
+ barRC.id = "cmdline_rc";
+ barRC.seq = adapterSeqRC;
+ o.adapters.push_back(barRC);
+ }
+ lf.setBars(o.adapters);
}
+
+ if(secondSet) lf.printBars("Adapter2");
+ else lf.printBars("Adapter");
}
- else{
- if(o.rcMode == RCOFF || o.rcMode == RCON){
- TBar bar;
- bar.id = "cmdline";
- bar.seq = o.adapterSeq;
- o.adapters.push_back(bar);
- }
- if(o.rcMode == RCON || o.rcMode == RCONLY){
- TSeqStr adapterSeqRC = o.adapterSeq;
- seqan::reverseComplement(adapterSeqRC);
-
- TBar barRC;
- barRC.id = "cmdline_rc";
- barRC.seq = adapterSeqRC;
- o.adapters.push_back(barRC);
- }
- lf.setBars(o.adapters);
- }
-
- if(secondSet) lf.printBars("Adapter2");
- else lf.printBars("Adapter");
}
@@ -192,13 +206,13 @@ void printMessage(Options &o){
string s = "Flexbar completed ";
- if(o.barDetect != BOFF) s += "barcode";
- if(o.barDetect == WITHIN_READ_REMOVAL) s += " removal within reads";
- if(o.barDetect == WITHIN_READ) s += " detection within reads";
- if(o.barDetect == BARCODE_READ) s += " detection with separate reads";
- if(o.barDetect != BOFF && o.adapRm != AOFF) s += " and ";
- if(o.barDetect == BOFF && o.adapRm == AOFF) s += "basic processing";
- if(o.adapRm != AOFF) s += "adapter removal";
+ if(o.barDetect != BOFF) s += "barcode";
+ if(o.barDetect == WITHIN_READ_REMOVAL) s += " removal within reads";
+ if(o.barDetect == WITHIN_READ) s += " detection within reads";
+ if(o.barDetect == BARCODE_READ) s += " detection with separate reads";
+ if(o.barDetect != BOFF && (o.adapRm != AOFF || o.poMode != POFF)) s += " and ";
+ if(o.barDetect == BOFF && o.adapRm == AOFF && o.poMode == POFF) s += "basic processing";
+ if(o.adapRm != AOFF || o.poMode != POFF) s += "adapter removal";
*o.out << s << ".\n" << endl;
@@ -246,6 +260,8 @@ void startProcessing(Options &o){
if(o.writeLengthDist) outputFilter.writeLengthDist();
+ if(o.poMode != POFF) alignFilter.printPairOverlapStats();
+
if(o.adapRm != AOFF){
outputFilter.printAdapterRemovalStats();
alignFilter.printAdapterOverlapStats();
@@ -296,7 +312,7 @@ void startProcessing(Options &o){
if(o.barDetect != BOFF && ! o.writeUnassigned)
*out << " skipped unassigned reads " << alignValue(len, alignFilter.getNrUnassignedReads()) << endl;
- if(o.adapRm != AOFF)
+ if(o.adapRm != AOFF || o.poMode != POFF)
*out << " short prior to adapter removal " << alignValue(len, alignFilter.getNrPreShortReads()) << endl;
if(o.qTrim != QOFF && o.qtrimPostRm)
=====================================
src/FlexbarIO.h
=====================================
--- a/src/FlexbarIO.h
+++ b/src/FlexbarIO.h
@@ -9,6 +9,7 @@
#include <fstream>
#include <seqan/seq_io.h>
+#include <seqan/stream.h>
#if SEQAN_HAS_ZLIB
#include <zlib.h>
@@ -46,6 +47,176 @@ void closeFile(std::fstream &strm){
}
+namespace seqan{
+
+ // Extension for input fasta file with dat ending
+
+ struct DatFastaAdaptor_;
+ using DatFastaAdaptor = Tag<DatFastaAdaptor_>;
+
+ // Specilaize sequence input file with custom tag
+ using DatFastaSeqFileIn = FormattedFile<Fastq, Input, DatFastaAdaptor>;
+
+ // Your custom format tag
+ struct DatFastaSeqFormat_;
+ using DatFastaSeqFormat = Tag<DatFastaSeqFormat_>;
+
+ // The extended TagList containing our custom format
+ using DatFastaSeqInFormats = TagList<DatFastaSeqFormat, SeqInFormats>;
+
+ // Overloaded file format metafunction
+ template <>
+ struct FileFormat<FormattedFile<Fastq, Input, DatFastaAdaptor> >{
+ using Type = TagSelector<DatFastaSeqInFormats>;
+ };
+
+ // Set magic header
+ template <typename T>
+ struct MagicHeader<DatFastaSeqFormat, T> : public MagicHeader<Fasta, T>{};
+
+ // Specify the valid ending for your fasta adaptor
+ template <typename T>
+ struct FileExtensions<DatFastaSeqFormat, T>{
+ static char const * VALUE[1];
+ };
+
+ template <typename T>
+ char const * FileExtensions<DatFastaSeqFormat, T>::VALUE[1] = { ".dat" };
+
+ // Overload inner readRecord function
+
+ template <typename TIdString, typename TSeqString, typename TSpec>
+ inline void
+ readRecord(TIdString & id, TSeqString & seq, FormattedFile<Fastq, Input, TSpec> & file, DatFastaSeqFormat){
+ readRecord(id, seq, file.iter, Fasta()); // Delegate to Fasta parser
+ }
+
+
+ // Extension for input fastq file with dat ending
+
+ struct DatFastqAdaptor_;
+ using DatFastqAdaptor = Tag<DatFastqAdaptor_>;
+
+ // Specilaize sequence input file with custom tag
+ using DatFastqSeqFileIn = FormattedFile<Fastq, Input, DatFastqAdaptor>;
+
+ // Your custom format tag
+ struct DatFastqSeqFormat_;
+ using DatFastqSeqFormat = Tag<DatFastqSeqFormat_>;
+
+ // The extended TagList containing our custom format
+ using DatFastqSeqInFormats = TagList<DatFastqSeqFormat, SeqInFormats>;
+
+ // Overloaded file format metafunction
+ template <>
+ struct FileFormat<FormattedFile<Fastq, Input, DatFastqAdaptor> >{
+ using Type = TagSelector<DatFastqSeqInFormats>;
+ };
+
+ // Set magic header
+ template <typename T>
+ struct MagicHeader<DatFastqSeqFormat, T> : public MagicHeader<Fastq, T>{};
+
+ // Specify the valid ending for your fastq adaptor
+ template <typename T>
+ struct FileExtensions<DatFastqSeqFormat, T>{
+ static char const * VALUE[1];
+ };
+
+ template <typename T>
+ char const * FileExtensions<DatFastqSeqFormat, T>::VALUE[1] = { ".dat" };
+
+ // Overload inner readRecord function
+
+ template <typename TIdString, typename TSeqString, typename TSpec>
+ inline void
+ readRecord(TIdString & id, TSeqString & seq, TIdString & qual, FormattedFile<Fastq, Input, TSpec> & file, DatFastqSeqFormat){
+ readRecord(id, seq, qual, file.iter, Fastq()); // Delegate to Fastq parser
+ }
+
+ template <typename TIdString, typename TSeqString, typename TSpec>
+ inline void
+ readRecord(TIdString & id, TSeqString & seq, FormattedFile<Fastq, Input, TSpec> & file, DatFastqSeqFormat){
+ readRecord(id, seq, file.iter, Fasta()); // Delegate to Fasta parser
+ }
+
+
+ // Extension for input fastq file with txt ending
+
+ struct FlexbarReadsAdaptor_;
+ using FlexbarReadsAdaptor = Tag<FlexbarReadsAdaptor_>;
+
+ // Specilaize sequence input file with custom tag
+ using FlexbarReadsSeqFileIn = FormattedFile<Fastq, Input, FlexbarReadsAdaptor>;
+
+ // Your custom format tag
+ struct FlexbarReadsSeqFormat_;
+ using FlexbarReadsSeqFormat = Tag<FlexbarReadsSeqFormat_>;
+
+ // The extended TagList containing our custom format
+ using FlexbarReadsSeqInFormats = TagList<FlexbarReadsSeqFormat, DatFastqSeqInFormats>;
+
+ // Overloaded file format metafunction
+ template <>
+ struct FileFormat<FormattedFile<Fastq, Input, FlexbarReadsAdaptor> >{
+ using Type = TagSelector<FlexbarReadsSeqInFormats>;
+ };
+
+ // Set magic header
+ template <typename T>
+ struct MagicHeader<FlexbarReadsSeqFormat, T> : public MagicHeader<Fastq, T>{};
+
+ // Specify the valid ending for your fastq adaptor
+ template <typename T>
+ struct FileExtensions<FlexbarReadsSeqFormat, T>{
+ static char const * VALUE[1];
+ };
+
+ template <typename T>
+ char const * FileExtensions<FlexbarReadsSeqFormat, T>::VALUE[1] = { ".txt" };
+
+ // Overload inner readRecord function
+
+ template <typename TIdString, typename TSeqString, typename TSpec>
+ inline void
+ readRecord(TIdString & id, TSeqString & seq, TIdString & qual, FormattedFile<Fastq, Input, TSpec> & file, FlexbarReadsSeqFormat){
+ readRecord(id, seq, qual, file.iter, Fastq()); // Delegate to Fastq parser
+ }
+
+ template <typename TIdString, typename TSeqString, typename TSpec>
+ inline void
+ readRecord(TIdString & id, TSeqString & seq, FormattedFile<Fastq, Input, TSpec> & file, FlexbarReadsSeqFormat){
+ readRecord(id, seq, file.iter, Fasta()); // Delegate to Fasta parser
+ }
+
+
+ // Extension for output reads file with dat ending
+
+ using FlexbarReadsSeqFileOut = FormattedFile<Fastq, Output, DatFastqAdaptor>;
+
+ using FlexbarReadsSeqOutFormats = TagList<DatFastqSeqFormat, SeqOutFormats>;
+
+ template <>
+ struct FileFormat<FormattedFile<Fastq, Output, DatFastqAdaptor> >{
+ using Type = TagSelector<FlexbarReadsSeqOutFormats>;
+ };
+
+ // Inner writeRecord function
+
+ template <typename TSpec, typename TIdString, typename TSeqString>
+ inline void
+ writeRecord(FormattedFile<Fastq, Output, TSpec> & file, TIdString & id, TSeqString & seq, TIdString & qual){
+ writeRecord(file.iter, id, seq, qual, Fastq()); // Delegate to Fastq parser
+ }
+
+ template <typename TSpec, typename TIdString, typename TSeqString>
+ inline void
+ writeRecord(FormattedFile<Fastq, Output, TSpec> & file, TIdString & id, TSeqString & seq){
+ writeRecord(file.iter, id, seq, Fasta()); // Delegate to Fasta parser
+ }
+}
+
+
void checkFileCompression(const std::string path){
using namespace std;
@@ -108,39 +279,38 @@ void checkInputType(const std::string path, flexbar::FileFormat &format, const b
if(c == '>') format = FASTA;
else if(c == '@') format = FASTQ;
else{
- cerr << "\nERROR: Reads file type not conform.\n";
- cerr << "Uncompressed fasta or fastq for stdin.\n" << endl;
+ cerr << "\nERROR: Format of reads from standard input not conform.\n";
+ cerr << "Use uncompressed fasta or fastq for stdin.\n" << endl;
exit(1);
}
}
else{
+ seqan::FlexbarReadsSeqFileIn seqFileIn;
- seqan::SeqFileIn seqFileIn;
-
- if(!open(seqFileIn, path.c_str())){
+ if(! open(seqFileIn, path.c_str())){
cerr << "\nERROR: Could not open file " << path << "\n" << endl;
exit(1);
}
- if(! atEnd(seqFileIn)){
-
- try{
- FSeqStr rseq;
- FString tag, qual;
+ try{
+ if(! atEnd(seqFileIn)){
+
+ FSeqStr seq;
+ FString id, qual;
- readRecord(tag, rseq, qual, seqFileIn);
+ readRecord(id, seq, qual, seqFileIn);
if(qual == "") format = FASTA;
else format = FASTQ;
}
- catch(seqan::Exception const &e){
- cerr << "\nERROR: " << e.what() << "\nProgram execution aborted.\n" << endl;
+ else{
+ cerr << "\nReads file seems to be empty.\n\n" << endl;
close(seqFileIn);
exit(1);
}
}
- else{
- cerr << "\nReads file seems to be empty.\n\n" << endl;
+ catch(seqan::Exception const &e){
+ cerr << "\nERROR: " << e.what() << "\nProgram execution aborted.\n" << endl;
close(seqFileIn);
exit(1);
}
=====================================
src/FlexbarTypes.h
=====================================
--- a/src/FlexbarTypes.h
+++ b/src/FlexbarTypes.h
@@ -11,13 +11,14 @@ class SeqRead {
TSeqStr seq;
TString id, qual, umi;
- bool rmAdapter, rmAdapterRC;
+ bool rmAdapter, rmAdapterRC, pairOverlap;
SeqRead(TSeqStr& sequence, TString& seqID) :
seq(sequence),
id(seqID),
rmAdapter(false),
- rmAdapterRC(false){
+ rmAdapterRC(false),
+ pairOverlap(false){
}
SeqRead(TSeqStr& sequence, TString& seqID, TString& quality) :
@@ -25,7 +26,8 @@ class SeqRead {
id(seqID),
qual(quality),
rmAdapter(false),
- rmAdapterRC(false){
+ rmAdapterRC(false),
+ pairOverlap(false){
}
};
@@ -132,6 +134,30 @@ namespace flexbar{
};
+ struct Adapters {
+ FString id, info;
+ FSeqStr seq1, seq2, seqc;
+ };
+
+
+ enum AdapterPreset {
+ APOFF,
+ TRUSEQ,
+ SMALLRNA,
+ METHYL,
+ RIBO,
+ NEXTERA,
+ NEXTERAMP
+ };
+
+
+ enum PairOverlap {
+ POFF,
+ PON,
+ PSHORT,
+ PONLY
+ };
+
enum RevCompMode {
RCOFF,
RCON,
=====================================
src/LoadAdapters.h
=====================================
--- /dev/null
+++ b/src/LoadAdapters.h
@@ -0,0 +1,170 @@
+// LoadAdapters.h
+
+#ifndef FLEXBAR_LOADADAPTERS_H
+#define FLEXBAR_LOADADAPTERS_H
+
+
+template <typename TSeqStr, typename TString>
+class LoadAdapters {
+
+private:
+
+ std::ostream *out;
+ tbb::concurrent_vector<flexbar::TBar> adapters;
+
+ flexbar::Adapters a;
+
+ const flexbar::AdapterPreset m_aPreset;
+ const flexbar::RevCompMode m_rcMode;
+
+public:
+
+ LoadAdapters(const Options &o) :
+
+ out(o.out),
+ m_aPreset(o.aPreset),
+ m_rcMode(o.rcMode){
+
+ using namespace flexbar;
+
+ // Illumina sequencing adapters
+ // Oligonucleotide sequences © 2018 Illumina, Inc. All rights reserved.
+ // Obtained from https://support.illumina.com/bulletins/2016/12/what-sequences-do-i-use-for-adapter-trimming.html
+
+ if(m_aPreset == TRUSEQ){
+ a.id = "TruSeq";
+ a.seq1 = "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA";
+ a.seq2 = "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT";
+ a.info = "TruSeq LT and TruSeq HT-based kits";
+ }
+ else if(m_aPreset == METHYL){
+ a.id = "TrueSeq-Methyl";
+ a.seq1 = "AGATCGGAAGAGCACACGTCTGAAC";
+ a.seq2 = "AGATCGGAAGAGCGTCGTGTAGGGA";
+ a.info = "ScriptSeq and TruSeq DNA Methylation";
+ }
+ else if(m_aPreset == SMALLRNA){
+ a.id = "TrueSeq-smallRNA";
+ a.seq1 = "TGGAATTCTCGGGTGCCAAGG";
+ a.info = "TruSeq Small RNA";
+ }
+ else if(m_aPreset == RIBO){
+ a.id = "TrueSeq-Ribo";
+ a.seq1 = "AGATCGGAAGAGCACACGTCT";
+ a.info = "TruSeq Ribo Profile";
+ }
+ else if(m_aPreset == NEXTERA){
+ a.id = "Nextera-TruSight";
+ a.seq1 = "CTGTCTCTTATACACATCT";
+ a.info = "AmpliSeq, Nextera, Nextera DNA Flex, Nextera DNA, Nextera XT, Nextera Enrichment, Nextera Rapid Capture Enrichment, TruSight Enrichment, TruSight Rapid Capture Enrichment, TruSight HLA";
+ }
+ else if(m_aPreset == NEXTERAMP){
+ a.id = "Nextera-Matepair";
+ a.seq1 = "GATCGGAAGAGCACACGTCTGAACTCCAGTCAC";
+ a.seq2 = "GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT";
+ a.seqc = "CTGTCTCTTATACACATCT";
+ a.info = "Nextera Mate Pair";
+ }
+
+ // IonTorrent sequencing adapters
+
+ Adapters IonTorrent;
+ IonTorrent.id = "IonTorrent";
+ IonTorrent.seq1 = "ATCACCGACTGCCCATAGAGAGGCTGAGAC";
+ IonTorrent.seq1 = "CCATCTCATCCCTGCGTGTCTCCGACTCAG";
+ IonTorrent.seq2 = "CCTCTCTATGGGCAGTCGGTGAT";
+ IonTorrent.info = "IonTorrent";
+ };
+
+
+ virtual ~LoadAdapters(){};
+
+
+ void loadSequences(const bool secondSet){
+
+ using namespace std;
+ using namespace flexbar;
+
+ TString id = a.id;
+ TSeqStr seq;
+
+ if(! secondSet) seq = a.seq1;
+ else seq = a.seq2;
+
+ if(m_rcMode == RCOFF || m_rcMode == RCON){
+ TBar adapter;
+ adapter.id = id;
+ adapter.seq = seq;
+ adapters.push_back(adapter);
+ }
+
+ if(m_rcMode == RCON || m_rcMode == RCONLY){
+ TString idRC = id;
+ TSeqStr seqRC = seq;
+
+ append(idRC, "_rc");
+ seqan::reverseComplement(seqRC);
+
+ TBar adapterRC;
+ adapterRC.id = idRC;
+ adapterRC.seq = seqRC;
+ adapterRC.rcAdapter = true;
+ adapters.push_back(adapterRC);
+ }
+
+ if(m_aPreset == NEXTERAMP){
+ TString idc = id;
+ TSeqStr seqc = a.seqc;
+
+ append(idc, "_circ");
+
+ TBar adapter;
+ adapter.id = idc;
+ adapter.seq = seqc;
+ adapters.push_back(adapter);
+
+ append(idc, "_rc");
+ seqan::reverseComplement(seqc);
+
+ TBar adapterRC;
+ adapterRC.id = idc;
+ adapterRC.seq = seqc;
+ adapters.push_back(adapterRC);
+ }
+ };
+
+
+ tbb::concurrent_vector<flexbar::TBar> getAdapters(){
+ return adapters;
+ }
+
+
+ void printAdapters(std::string adapterName) const {
+
+ using namespace std;
+
+ const unsigned int maxSpaceLen = 23;
+
+ stringstream s; s << adapterName;
+ int len = s.str().length() + 1;
+
+ if(len + 2 > maxSpaceLen) len = maxSpaceLen - 2;
+
+ *out << adapterName << ":" << string(maxSpaceLen - len, ' ') << "Sequence:" << "\n";
+
+ for(unsigned int i=0; i < adapters.size(); ++i){
+ TString seqTag = adapters.at(i).id;
+
+ int whiteSpaceLen = maxSpaceLen - length(seqTag);
+ if(whiteSpaceLen < 2) whiteSpaceLen = 2;
+
+ string whiteSpace = string(whiteSpaceLen, ' ');
+
+ *out << seqTag << whiteSpace << adapters.at(i).seq << "\n";
+ }
+ *out << endl;
+ }
+
+};
+
+#endif
=====================================
src/LoadFasta.h
=====================================
--- a/src/LoadFasta.h
+++ b/src/LoadFasta.h
@@ -33,19 +33,17 @@ public:
using namespace std;
using namespace flexbar;
- seqan::SeqFileIn seqFileIn;
-
- setFormat(seqFileIn, seqan::Fasta());
+ seqan::DatFastaSeqFileIn seqFileIn;
if(! open(seqFileIn, filePath.c_str())){
cerr << "\nERROR: Could not open file " << filePath << "\n" << endl;
exit(1);
}
- TSeqStrs seqs;
- TStrings ids;
-
try{
+ TSeqStrs seqs;
+ TStrings ids;
+
readRecords(ids, seqs, seqFileIn);
map<TString, short> idMap;
=====================================
src/Options.h
=====================================
--- a/src/Options.h
+++ b/src/Options.h
@@ -15,6 +15,7 @@
struct Options{
std::string readsFile, readsFile2, barReadsFile;
+ std::string outReadsFile, outReadsFile2, outLogFile;
std::string barcodeFile, adapterFile, barcode2File, adapter2File;
std::string adapterSeq, targetName, logAlignStr, outCompression;
std::string htrimLeft, htrimRight;
@@ -24,10 +25,10 @@ struct Options{
bool useStdin, useStdout, relaxRegion, useRcTrimEnd, qtrimPostRm;
bool interleavedInput, htrimAdapterRm, htrimMaxFirstOnly;
- int cutLen_begin, cutLen_end, cutLen_read, a_tail_len, b_tail_len;
- int qtrimThresh, qtrimWinSize, a_overhang, htrimMinLength, htrimMaxLength, a_cycles;
- int maxUncalled, min_readLen, a_min_overlap, b_min_overlap, nThreads, bundleSize;
- int match, mismatch, gapCost, b_match, b_mismatch, b_gapCost;
+ int cutLen_begin, cutLen_end, cutLen_read, a_tail_len, b_tail_len, p_min_overlap;
+ int qtrimThresh, qtrimWinSize, a_overhang, htrimMinLength, htrimMinLength2, htrimMaxLength;
+ int maxUncalled, min_readLen, a_min_overlap, b_min_overlap, nThreads, bundleSize, nBundles;
+ int a_match, a_mismatch, a_gapCost, b_match, b_mismatch, b_gapCost, a_cycles;
float a_errorRate, b_errorRate, h_errorRate;
@@ -41,6 +42,8 @@ struct Options{
flexbar::BarcodeDetect barDetect;
flexbar::AdapterRemoval adapRm;
flexbar::RevCompMode rcMode;
+ flexbar::PairOverlap poMode;
+ flexbar::AdapterPreset aPreset;
tbb::concurrent_vector<flexbar::TBar> barcodes, adapters, barcodes2, adapters2;
@@ -48,6 +51,9 @@ struct Options{
std::fstream fstrmOut;
Options(){
+
+ using namespace flexbar;
+
readsFile = "";
readsFile2 = "";
barReadsFile = "";
@@ -55,6 +61,9 @@ struct Options{
adapterFile = "";
barcode2File = "";
adapter2File = "";
+ outReadsFile = "";
+ outReadsFile2 = "";
+ outLogFile = "";
outCompression = "";
htrimLeft = "";
htrimRight = "";
@@ -79,24 +88,32 @@ struct Options{
htrimAdapterRm = false;
htrimMaxFirstOnly = false;
- cutLen_begin = 0;
- cutLen_end = 0;
- cutLen_read = 0;
- qtrimThresh = 0;
- qtrimWinSize = 0;
- a_tail_len = 0;
- b_tail_len = 0;
- b_min_overlap = 0;
- htrimMaxLength = 0;
-
- format = flexbar::FASTA;
- qual = flexbar::SANGER;
- qTrim = flexbar::QOFF;
- logAlign = flexbar::NONE;
- cmprsType = flexbar::UNCOMPRESSED;
- barDetect = flexbar::BOFF;
- adapRm = flexbar::AOFF;
- rcMode = flexbar::RCOFF;
+ cutLen_begin = 0;
+ cutLen_end = 0;
+ cutLen_read = 0;
+ qtrimThresh = 0;
+ qtrimWinSize = 0;
+ a_tail_len = 0;
+ b_tail_len = 0;
+ a_min_overlap = 3;
+ b_min_overlap = 0;
+ htrimMinLength2 = 0;
+ htrimMaxLength = 0;
+ nBundles = 0;
+
+ format = FASTA;
+ qual = SANGER;
+ qTrim = QOFF;
+ logAlign = NONE;
+ cmprsType = UNCOMPRESSED;
+ barDetect = BOFF;
+ adapRm = AOFF;
+ rcMode = RCOFF;
+ poMode = POFF;
+ a_end = RIGHT;
+ arc_end = RIGHT;
+ b_end = LTAIL;
+ aPreset = APOFF;
}
};
@@ -131,6 +148,11 @@ const std::string getFlexbarURL(){
}
+const std::string getFlexbarDescription(){
+ return "The program Flexbar preprocesses high-throughput sequencing data efficiently. It demultiplexes barcoded runs and removes adapter sequences. Moreover, trimming and filtering features are provided. Flexbar increases read mapping rates and improves genome as well as transcriptome assemblies. Unique molecular identifiers can be extracted in a flexible way. The program supports sequencing data in fasta and fastq format, e.g. from the Illumina platform. Refer to the manual on github.com/seqan/flexbar/wiki or contact Johannes Roehr on github.com/jtroehr for support with this application.";
+}
+
+
void defineOptions(seqan::ArgumentParser &parser, const std::string version, const std::string date){
using namespace seqan;
@@ -140,12 +162,12 @@ void defineOptions(seqan::ArgumentParser &parser, const std::string version, con
setVersion(parser, version);
setDate(parser, date);
- // setCitation(parser, "\n\n" + getFlexbarCitation());
-
// setAppName(parser, "Flexbar");
+ // setCitation(parser, "\n\n" + getFlexbarCitation());
// setShortCopyright(parser, "BSD 3-Clause License");
// setLongCopyright(parser, "");
+ addDescription(parser, getFlexbarDescription());
setShortDescription(parser, "flexible barcode and adapter removal");
addUsageLine(parser, "\\fB-r\\fP reads [\\fB-b\\fP barcodes] [\\fB-a\\fP adapters] [options]");
@@ -156,7 +178,8 @@ void defineOptions(seqan::ArgumentParser &parser, const std::string version, con
addSection(parser, "Basic options");
addOption(parser, ArgParseOption("n", "threads", "Number of threads to employ.", ARG::INTEGER));
- addOption(parser, ArgParseOption("N", "bundle", "Number of read pairs per thread.", ARG::INTEGER));
+ addOption(parser, ArgParseOption("N", "bundle", "Number of (paired) reads per thread.", ARG::INTEGER));
+ addOption(parser, ArgParseOption("M", "bundles", "Process only certain number of bundles for testing.", ARG::INTEGER));
addOption(parser, ArgParseOption("t", "target", "Prefix for output file names or paths.", ARG::OUTPUT_PREFIX));
addOption(parser, ArgParseOption("r", "reads", "Fasta/q file or stdin (-) with reads that may contain barcodes.", ARG::INPUT_FILE));
addOption(parser, ArgParseOption("p", "reads2", "Second input file of paired reads, gz and bz2 files supported.", ARG::INPUT_FILE));
@@ -167,8 +190,8 @@ void defineOptions(seqan::ArgumentParser &parser, const std::string version, con
addOption(parser, ArgParseOption("b2", "barcodes2", "Additional barcodes file for second read set in paired mode.", ARG::INPUT_FILE));
addOption(parser, ArgParseOption("br", "barcode-reads", "Fasta/q file containing separate barcode reads for detection.", ARG::INPUT_FILE));
addOption(parser, ArgParseOption("bo", "barcode-min-overlap", "Minimum overlap of barcode and read. Default: barcode length.", ARG::INTEGER));
- addOption(parser, ArgParseOption("bt", "barcode-error-rate", "Error rate threshold for mismatches and gaps.", ARG::DOUBLE));
- addOption(parser, ArgParseOption("be", "barcode-trim-end", "Type of detection, see section trim-end modes.", ARG::STRING));
+ addOption(parser, ArgParseOption("be", "barcode-error-rate", "Error rate threshold for mismatches and gaps.", ARG::DOUBLE));
+ addOption(parser, ArgParseOption("bt", "barcode-trim-end", "Type of detection, see section trim-end modes.", ARG::STRING));
addOption(parser, ArgParseOption("bn", "barcode-tail-length", "Region size in tail trim-end modes. Default: barcode length.", ARG::INTEGER));
addOption(parser, ArgParseOption("bk", "barcode-keep", "Keep barcodes within reads instead of removal."));
addOption(parser, ArgParseOption("bu", "barcode-unassigned", "Include unassigned reads in output generation."));
@@ -180,21 +203,31 @@ void defineOptions(seqan::ArgumentParser &parser, const std::string version, con
addOption(parser, ArgParseOption("a", "adapters", "Fasta file with adapters for removal that may contain N.", ARG::INPUT_FILE));
addOption(parser, ArgParseOption("a2", "adapters2", "File with extra adapters for second read set in paired mode.", ARG::INPUT_FILE));
addOption(parser, ArgParseOption("as", "adapter-seq", "Single adapter sequence as alternative to adapters option.", ARG::STRING));
- // addOption(parser, ArgParseOption("ap", "adapter-pair-overlap", "Overlap detection for paired reads.", ARG::STRING));
- addOption(parser, ArgParseOption("ao", "adapter-min-overlap", "Minimum overlap of adapter and read for removal.", ARG::INTEGER));
- addOption(parser, ArgParseOption("at", "adapter-error-rate", "Error rate threshold for mismatches and gaps.", ARG::DOUBLE));
- addOption(parser, ArgParseOption("ae", "adapter-trim-end", "Type of removal, see section trim-end modes.", ARG::STRING));
+ addOption(parser, ArgParseOption("aa", "adapter-preset", "", ARG::STRING));
+ addOption(parser, ArgParseOption("ao", "adapter-min-overlap", "Minimum overlap for removal without pair overlap.", ARG::INTEGER));
+ addOption(parser, ArgParseOption("ae", "adapter-error-rate", "Error rate threshold for mismatches and gaps.", ARG::DOUBLE));
+ addOption(parser, ArgParseOption("at", "adapter-trim-end", "Type of removal, see section trim-end modes.", ARG::STRING));
addOption(parser, ArgParseOption("an", "adapter-tail-length", "Region size for tail trim-end modes. Default: adapter length.", ARG::INTEGER));
// addOption(parser, ArgParseOption("ah", "adapter-overhang", "Overhang at read ends in right and left modes.", ARG::INTEGER));
addOption(parser, ArgParseOption("ax", "adapter-relaxed", "Skip restriction to pass read ends in right and left modes."));
+ addOption(parser, ArgParseOption("ap", "adapter-pair-overlap", "Overlap detection of paired reads.", ARG::STRING));
+ addOption(parser, ArgParseOption("av", "adapter-min-poverlap", "Minimum overlap of paired reads for detection.", ARG::INTEGER));
addOption(parser, ArgParseOption("ac", "adapter-revcomp", "Include reverse complements of adapters.", ARG::STRING));
- addOption(parser, ArgParseOption("ad", "adapter-revcomp-end", "Use different trim-end for reverse complement of adapters.", ARG::STRING));
+ addOption(parser, ArgParseOption("ad", "adapter-revcomp-end", "Use different trim-end for reverse complements of adapters.", ARG::STRING));
addOption(parser, ArgParseOption("ar", "adapter-read-set", "Consider only single read set for adapters.", ARG::STRING));
addOption(parser, ArgParseOption("ay", "adapter-cycles", "Number of adapter removal cycles.", ARG::INTEGER));
addOption(parser, ArgParseOption("am", "adapter-match", "Alignment match score.", ARG::INTEGER));
addOption(parser, ArgParseOption("ai", "adapter-mismatch", "Alignment mismatch score.", ARG::INTEGER));
addOption(parser, ArgParseOption("ag", "adapter-gap", "Alignment gap score.", ARG::INTEGER));
+ // addSection(parser, "Joining paired reads");
+ // addOption(parser, ArgParseOption("j", "join", "Align paired reads and join them in case of sufficient overlap."));
+ // addOption(parser, ArgParseOption("jo", "join-min-overlap", "Minimum overlap of adapter and read sequence.", ARG::INTEGER));
+ // addOption(parser, ArgParseOption("jt", "join-error-rate", "Error rate threshold for mismatches and gaps.", ARG::DOUBLE));
+ // addOption(parser, ArgParseOption("jm", "join-match", "Alignment match score.", ARG::INTEGER));
+ // addOption(parser, ArgParseOption("ji", "join-mismatch", "Alignment mismatch score.", ARG::INTEGER));
+ // addOption(parser, ArgParseOption("jg", "join-gap", "Alignment gap score.", ARG::INTEGER));
+
addSection(parser, "Filtering and trimming");
addOption(parser, ArgParseOption("u", "max-uncalled", "Allowed uncalled bases N for each read.", ARG::INTEGER));
addOption(parser, ArgParseOption("x", "pre-trim-left", "Trim given number of bases on 5' read end before detection.", ARG::INTEGER));
@@ -210,25 +243,29 @@ void defineOptions(seqan::ArgumentParser &parser, const std::string version, con
addOption(parser, ArgParseOption("qa", "qtrim-post-removal", "Perform quality-based trimming after removal steps."));
addSection(parser, "Trimming of homopolymers");
- addOption(parser, ArgParseOption("X", "htrim-left", "Trim certain homopolymers on left read end after removal.", ARG::STRING));
- addOption(parser, ArgParseOption("Y", "htrim-right", "Trim certain homopolymers on right read end after removal.", ARG::STRING));
- addOption(parser, ArgParseOption("M", "htrim-min-length", "Minimum length of homopolymers at read ends.", ARG::INTEGER));
- addOption(parser, ArgParseOption("L", "htrim-max-length", "Maximum length of homopolymers at read ends.", ARG::INTEGER));
- addOption(parser, ArgParseOption("F", "htrim-max-first", "Max length of homopolymers only for first type."));
- addOption(parser, ArgParseOption("T", "htrim-error-rate", "Error rate threshold for mismatches.", ARG::DOUBLE));
- addOption(parser, ArgParseOption("A", "htrim-adapter", "Trim only in case of adapter removal on same side."));
+ addOption(parser, ArgParseOption("hl", "htrim-left", "Trim specific homopolymers on left read end after removal.", ARG::STRING));
+ addOption(parser, ArgParseOption("hr", "htrim-right", "Trim certain homopolymers on right read end after removal.", ARG::STRING));
+ addOption(parser, ArgParseOption("hi", "htrim-min-length", "Minimum length of homopolymers at read ends.", ARG::INTEGER));
+ addOption(parser, ArgParseOption("h2", "htrim-min-length2", "Minimum length for homopolymers specified after first one.", ARG::INTEGER));
+ addOption(parser, ArgParseOption("hx", "htrim-max-length", "Maximum length of homopolymers on left and right read end.", ARG::INTEGER));
+ addOption(parser, ArgParseOption("hf", "htrim-max-first", "Apply maximum length of homopolymers only for first one."));
+ addOption(parser, ArgParseOption("he", "htrim-error-rate", "Error rate threshold for mismatches.", ARG::DOUBLE));
+ addOption(parser, ArgParseOption("ha", "htrim-adapter", "Trim only in case of adapter removal on same side."));
addSection(parser, "Output selection");
addOption(parser, ArgParseOption("f", "fasta-output", "Prefer non-quality format fasta for output."));
addOption(parser, ArgParseOption("z", "zip-output", "Direct compression of output files.", ARG::STRING));
addOption(parser, ArgParseOption("1", "stdout-reads", "Write reads to stdout, tagged and interleaved if needed."));
+ addOption(parser, ArgParseOption("R", "output-reads", "Output file for reads instead of target prefix usage.", ARG::OUTPUT_FILE));
+ addOption(parser, ArgParseOption("D", "output-reads2", "Output file for reads2 instead of target prefix usage.", ARG::OUTPUT_FILE));
addOption(parser, ArgParseOption("j", "length-dist", "Generate length distribution for read output files."));
addOption(parser, ArgParseOption("s", "single-reads", "Write single reads for too short counterparts in pairs."));
addOption(parser, ArgParseOption("S", "single-reads-paired", "Write paired single reads with N for short counterparts."));
addSection(parser, "Logging and tagging");
addOption(parser, ArgParseOption("l", "align-log", "Print chosen read alignments.", ARG::STRING));
- addOption(parser, ArgParseOption("o", "stdout-log", "Write statistics to console instead of target log file."));
+ addOption(parser, ArgParseOption("o", "stdout-log", "Write statistics to stdout instead of target log file."));
+ addOption(parser, ArgParseOption("O", "output-log", "Output file for logging instead of target prefix usage.", ARG::OUTPUT_FILE));
addOption(parser, ArgParseOption("g", "removal-tags", "Tag reads that are subject to adapter or barcode removal."));
addOption(parser, ArgParseOption("e", "number-tags", "Replace read tags by ascending number to save space."));
addOption(parser, ArgParseOption("d", "umi-tags", "Capture UMIs in reads at barcode or adapter N positions."));
@@ -244,36 +281,44 @@ void defineOptions(seqan::ArgumentParser &parser, const std::string version, con
setAdvanced(parser, "barcode-mismatch");
setAdvanced(parser, "barcode-gap");
- setAdvanced(parser, "adapter-revcomp");
- setAdvanced(parser, "adapter-revcomp-end");
+ setAdvanced(parser, "adapter-seq");
setAdvanced(parser, "adapter-tail-length");
- // setAdvanced(parser, "adapter-overhang");
setAdvanced(parser, "adapter-relaxed");
+ setAdvanced(parser, "adapter-min-poverlap");
+ setAdvanced(parser, "adapter-revcomp");
+ setAdvanced(parser, "adapter-revcomp-end");
setAdvanced(parser, "adapter-read-set");
setAdvanced(parser, "adapter-cycles");
setAdvanced(parser, "adapter-match");
setAdvanced(parser, "adapter-mismatch");
setAdvanced(parser, "adapter-gap");
+ // setAdvanced(parser, "adapter-overhang");
setAdvanced(parser, "post-trim-length");
setAdvanced(parser, "qtrim-win-size");
setAdvanced(parser, "qtrim-post-removal");
setAdvanced(parser, "htrim-left");
+ setAdvanced(parser, "htrim-min-length2");
setAdvanced(parser, "htrim-max-length");
setAdvanced(parser, "htrim-max-first");
setAdvanced(parser, "htrim-adapter");
+ setAdvanced(parser, "version-check");
setAdvanced(parser, "man-help");
setAdvanced(parser, "bundle");
+ setAdvanced(parser, "bundles");
setAdvanced(parser, "interleaved");
- setAdvanced(parser, "stdout-reads");
setAdvanced(parser, "length-dist");
+ setAdvanced(parser, "single-reads");
setAdvanced(parser, "single-reads-paired");
+ setAdvanced(parser, "output-reads");
+ setAdvanced(parser, "output-reads2");
+ setAdvanced(parser, "output-log");
setAdvanced(parser, "number-tags");
setAdvanced(parser, "umi-tags");
- setCategory(parser, "Trimming");
+ setCategory(parser, "Barcode and adapter removal");
// setRequired(parser, "reads");
// setMinValue(parser, "threads", "1");
@@ -311,10 +356,13 @@ void defineOptions(seqan::ArgumentParser &parser, const std::string version, con
setValidValues(parser, "qtrim-format", "sanger solexa i1.3 i1.5 i1.8");
setValidValues(parser, "align-log", "ALL MOD TAB");
setValidValues(parser, "zip-output", "GZ BZ2");
- // setValidValues(parser, "adapter-pair-overlap", "ON ONLY");
+
setValidValues(parser, "adapter-read-set", "1 2");
setValidValues(parser, "adapter-revcomp", "ON ONLY");
+ setValidValues(parser, "adapter-pair-overlap", "ON SHORT ONLY");
+ setValidValues(parser, "adapter-preset", "TruSeq SmallRNA Methyl Ribo Nextera NexteraMP");
+ // setDefaultValue(parser, "version-check", "OFF");
setDefaultValue(parser, "target", "flexbarOut");
setDefaultValue(parser, "threads", "1");
setDefaultValue(parser, "bundle", "256");
@@ -328,14 +376,15 @@ void defineOptions(seqan::ArgumentParser &parser, const std::string version, con
setDefaultValue(parser, "barcode-mismatch", "-1");
setDefaultValue(parser, "barcode-gap", "-9");
- setDefaultValue(parser, "adapter-trim-end", "RIGHT");
- setDefaultValue(parser, "adapter-min-overlap", "3");
- setDefaultValue(parser, "adapter-error-rate", "0.1");
- setDefaultValue(parser, "adapter-cycles", "1");
- // setDefaultValue(parser, "adapter-overhang", "0");
- setDefaultValue(parser, "adapter-match", "1");
- setDefaultValue(parser, "adapter-mismatch", "-1");
- setDefaultValue(parser, "adapter-gap", "-6");
+ setDefaultValue(parser, "adapter-trim-end", "RIGHT");
+ setDefaultValue(parser, "adapter-min-overlap", "3");
+ setDefaultValue(parser, "adapter-error-rate", "0.1");
+ setDefaultValue(parser, "adapter-min-poverlap", "40");
+ setDefaultValue(parser, "adapter-cycles", "1");
+ setDefaultValue(parser, "adapter-match", "1");
+ setDefaultValue(parser, "adapter-mismatch", "-1");
+ setDefaultValue(parser, "adapter-gap", "-6");
+ // setDefaultValue(parser, "adapter-overhang", "0");
setDefaultValue(parser, "qtrim-threshold", "20");
setDefaultValue(parser, "qtrim-win-size", "5");
@@ -351,8 +400,11 @@ void defineOptions(seqan::ArgumentParser &parser, const std::string version, con
addText(parser._toolDoc, "\\fBRTAIL:\\fP use only last n bases, see tail-length options", false);
addTextSection(parser, "EXAMPLES");
- addText(parser._toolDoc, "\\fBflexbar\\fP \\fB-r\\fP reads.fq \\fB-t\\fP target \\fB-b\\fP brc.fa \\fB-be\\fP LTAIL \\fB-a\\fP adp.fa", false);
- addText(parser._toolDoc, "\\fBflexbar\\fP \\fB-r\\fP reads.fq.gz \\fB-q\\fP TAIL \\fB-qf\\fP i1.8 \\fB-a\\fP adp.fa \\fB-ao\\fP 5 \\fB-at\\fP 0.2");
+ addText(parser._toolDoc, "\\fBflexbar\\fP \\fB-r\\fP reads.fq \\fB-t\\fP target \\fB-q\\fP TAIL \\fB-qf\\fP i1.8", false);
+ addText(parser._toolDoc, "\\fBflexbar\\fP \\fB-r\\fP reads.fq \\fB-b\\fP barcodes.fa \\fB-bt\\fP LTAIL", false);
+ addText(parser._toolDoc, "\\fBflexbar\\fP \\fB-r\\fP reads.fq \\fB-a\\fP adapters.fa \\fB-ao\\fP 3 \\fB-ae\\fP 0.1", false);
+ addText(parser._toolDoc, "\\fBflexbar\\fP \\fB-r\\fP r1.fq \\fB-p\\fP r2.fq \\fB-a\\fP adapters.fa \\fB-ap\\fP ON", false);
+ addText(parser._toolDoc, "\\fBflexbar\\fP \\fB-r\\fP r1.fq \\fB-p\\fP r2.fq \\fB-aa\\fP TruSeq \\fB-ap\\fP ON");
}
@@ -437,6 +489,7 @@ void parseCmdLine(seqan::ArgumentParser &parser, std::string version, int argc,
void initOptions(Options &o, seqan::ArgumentParser &parser){
using namespace std;
+ using namespace flexbar;
bool stdOutReads = isSet(parser, "stdout-reads");
bool stdOutLog = isSet(parser, "stdout-log");
@@ -450,7 +503,15 @@ void initOptions(Options &o, seqan::ArgumentParser &parser){
else{
string s;
getOptionValue(s, parser, "target");
- openOutputFile(o.fstrmOut, s + ".log");
+
+ s = s + ".log";
+
+ if(isSet(parser, "output-log") && ! o.logStdout){
+ getOptionValue(o.outLogFile, parser, "output-log");
+ s = o.outLogFile;
+ }
+
+ openOutputFile(o.fstrmOut, s);
o.out = &o.fstrmOut;
*o.out << endl;
@@ -485,13 +546,24 @@ void loadOptions(Options &o, seqan::ArgumentParser &parser){
}
getOptionValue(o.bundleSize, parser, "bundle");
- *out << "Bundled fragments: " << o.bundleSize << endl << endl;
+ *out << "Bundled fragments: " << o.bundleSize << endl;
if(o.bundleSize < 1){
cerr << "\n" << "Bundle size should be 1 at least.\n" << endl;
exit(1);
}
+ if(isSet(parser, "bundles")){
+ getOptionValue(o.nBundles, parser, "bundles");
+ *out << "Number of bundles: " << o.nBundles << endl << endl;
+
+ if(o.nBundles < 1){
+ cerr << "\n" << "Number of bundles should be 1 at least.\n" << endl;
+ exit(1);
+ }
+ }
+ else *out << endl;
+
getOptionValue(o.targetName, parser, "target");
*out << "Target name: " << o.targetName << endl;
@@ -528,10 +600,11 @@ void loadOptions(Options &o, seqan::ArgumentParser &parser){
}
getOptionValue(o.readsFile2, parser, "reads2");
*out << "Reads file 2: " << o.readsFile2 << " (paired run)" << endl;
+
o.runType = PAIRED;
o.isPaired = true;
- flexbar::FileFormat fformat;
+ FileFormat fformat;
checkInputType(o.readsFile2, fformat, false);
if(o.format != fformat){
@@ -549,7 +622,7 @@ void loadOptions(Options &o, seqan::ArgumentParser &parser){
getOptionValue(o.barReadsFile, parser, "barcode-reads");
*out << "Barcode reads file: " << o.barReadsFile << endl;
- flexbar::FileFormat fformat;
+ FileFormat fformat;
checkInputType(o.barReadsFile, fformat, false);
if(o.format != fformat){
@@ -572,6 +645,7 @@ void loadOptions(Options &o, seqan::ArgumentParser &parser){
}
if(isSet(parser, "barcodes2") && o.barDetect != BARCODE_READ && o.isPaired){
+
getOptionValue(o.barcode2File, parser, "barcodes2");
*out << "Barcode file 2: " << o.barcode2File << endl;
@@ -596,6 +670,34 @@ void loadOptions(Options &o, seqan::ArgumentParser &parser){
*out << "Adapter file 2: " << o.adapter2File << endl;
o.adapRm = NORMAL2;
}
+
+ if(isSet(parser, "adapter-preset")){
+
+ if(o.adapRm == NORMAL || o.adapRm == NORMAL2){
+ cerr << "\n" << "Please specify either adapter preset or custom adapters.\n" << endl;
+ exit(1);
+ }
+ string aa;
+ getOptionValue(aa, parser, "adapter-preset");
+
+ if(aa == "TruSeq") o.aPreset = TRUSEQ;
+ else if(aa == "SmallRNA") o.aPreset = SMALLRNA;
+ else if(aa == "Methyl") o.aPreset = METHYL;
+ else if(aa == "Ribo") o.aPreset = RIBO;
+ else if(aa == "Nextera") o.aPreset = NEXTERA;
+ else if(aa == "NexteraMP") o.aPreset = NEXTERAMP;
+
+ *out << "Adapter preset: " << aa << endl;
+
+ o.adapRm = NORMAL;
+
+ if(o.isPaired && (o.aPreset == TRUSEQ || o.aPreset == METHYL || o.aPreset == NEXTERAMP)) o.adapRm = NORMAL2;
+
+ if(! o.isPaired && o.aPreset == NEXTERAMP){
+ cerr << "\n" << "Please provide paired reads for preset NexteraMP.\n" << endl;
+ exit(1);
+ }
+ }
*out << endl;
@@ -708,6 +810,14 @@ void loadOptions(Options &o, seqan::ArgumentParser &parser){
*out << "htrim-right: " << o.htrimRight << endl;
}
+ getOptionValue(o.htrimMinLength, parser, "htrim-min-length");
+ *out << "htrim-min-length: " << o.htrimMinLength << endl;
+
+ if(isSet(parser, "htrim-min-length2")){
+ getOptionValue(o.htrimMinLength2, parser, "htrim-min-length2");
+ *out << "htrim-min-length2: " << o.htrimMinLength2 << endl;
+ }
+
if(isSet(parser, "htrim-max-length")){
getOptionValue(o.htrimMaxLength, parser, "htrim-max-length");
*out << "htrim-max-length: " << o.htrimMaxLength << endl;
@@ -718,9 +828,6 @@ void loadOptions(Options &o, seqan::ArgumentParser &parser){
}
}
- getOptionValue(o.htrimMinLength, parser, "htrim-min-length");
- *out << "htrim-min-length: " << o.htrimMinLength << endl;
-
getOptionValue(o.h_errorRate, parser, "htrim-error-rate");
*out << "htrim-error-rate: " << o.h_errorRate << endl;
@@ -760,6 +867,29 @@ void loadOptions(Options &o, seqan::ArgumentParser &parser){
o.writeSingleReads = false;
}
+ if(! o.useStdout && ! o.writeSingleReads && (o.runType == SINGLE || o.runType == PAIRED)){
+
+ if(isSet(parser, "output-reads") && (isSet(parser, "output-reads2") || o.runType == SINGLE)){
+ getOptionValue(o.outReadsFile, parser, "output-reads");
+ *out << "output-reads: " << o.outReadsFile << endl;
+ }
+
+ if(isSet(parser, "output-reads2") && isSet(parser, "output-reads") && o.runType == PAIRED){
+ getOptionValue(o.outReadsFile2, parser, "output-reads2");
+ *out << "output-reads2: " << o.outReadsFile2 << endl;
+
+ if(o.outReadsFile == o.outReadsFile2){
+ cerr << "\n" << "Output reads and reads2 file should not be the same.\n" << endl;
+ exit(1);
+ }
+ }
+
+ if(o.outLogFile != "" && (o.outLogFile == o.outReadsFile || o.outLogFile == o.outReadsFile2)){
+ cerr << "\n" << "Output log file should not be the same as output reads or reads2 file.\n" << endl;
+ exit(1);
+ }
+ }
+
if(isSet(parser, "fasta-output")) o.switch2Fasta = true;
if(isSet(parser, "length-dist")) o.writeLengthDist = true;
if(isSet(parser, "number-tags")) o.useNumberTag = true;
@@ -827,87 +957,136 @@ void loadOptions(Options &o, seqan::ArgumentParser &parser){
*out << "barcode-gap: ";
if(o.b_gapCost >= 0) *out << " ";
- *out << o.b_gapCost << endl;
-
- *out << endl;
+ *out << o.b_gapCost << "\n" << endl;
}
// adapter options
- if(o.adapRm != AOFF){
+ if(o.isPaired && isSet(parser, "adapter-pair-overlap")){
- string a_trim_end;
- getOptionValue(a_trim_end, parser, "adapter-trim-end");
+ string pOverlap;
+ getOptionValue(pOverlap, parser, "adapter-pair-overlap");
- if (a_trim_end == "LEFT") o.a_end = LEFT;
- else if(a_trim_end == "RIGHT") o.a_end = RIGHT;
- else if(a_trim_end == "ANY") o.a_end = ANY;
- else if(a_trim_end == "LTAIL") o.a_end = LTAIL;
- else if(a_trim_end == "RTAIL") o.a_end = RTAIL;
+ if (pOverlap == "ON") o.poMode = PON;
+ else if(pOverlap == "SHORT") o.poMode = PSHORT;
+ else if(pOverlap == "ONLY") o.poMode = PONLY;
else {
- cerr << "\nSpecified adapter trim-end is unknown.\n" << endl;
+ cerr << "\nSpecified pair overlap mode is unknown.\n" << endl;
exit(1);
}
- *out << "adapter-trim-end: " << a_trim_end << endl;
-
-
- if(isSet(parser, "adapter-tail-length")){
- getOptionValue(o.a_tail_len, parser, "adapter-tail-length");
- *out << "adapter-tail-length: " << o.a_tail_len << endl;
- }
+ if(o.adapRm == AOFF && (o.poMode == PON || o.poMode == PSHORT)) o.poMode = POFF;
+ else *out << "adapter-pair-overlap: " << pOverlap << endl;
+ }
+
+ if(o.adapRm != AOFF || o.poMode == PONLY){
- if(isSet(parser, "adapter-revcomp")){
+ if(o.adapRm != AOFF){
- string rcModeStr;
- getOptionValue(rcModeStr, parser, "adapter-revcomp");
- *out << "adapter-revcomp: " << rcModeStr << endl;
+ string a_trim_end;
+ getOptionValue(a_trim_end, parser, "adapter-trim-end");
- if (rcModeStr == "ON") o.rcMode = RCON;
- else if(rcModeStr == "ONLY") o.rcMode = RCONLY;
+ if (a_trim_end == "LEFT") o.a_end = LEFT;
+ else if(a_trim_end == "RIGHT") o.a_end = RIGHT;
+ else if(a_trim_end == "ANY") o.a_end = ANY;
+ else if(a_trim_end == "LTAIL") o.a_end = LTAIL;
+ else if(a_trim_end == "RTAIL") o.a_end = RTAIL;
+ else {
+ cerr << "\nSpecified adapter trim-end is unknown.\n" << endl;
+ exit(1);
+ }
+ *out << "adapter-trim-end: " << a_trim_end << endl;
- if(isSet(parser, "adapter-revcomp-end") && o.rcMode == RCON){
+ if(o.aPreset != APOFF && o.a_end != RIGHT){
+ cerr << "\nAdapter trim-end should be RIGHT for adapter presets.\n" << endl;
+ exit(1);
+ }
+
+ if(isSet(parser, "adapter-tail-length")){
+ getOptionValue(o.a_tail_len, parser, "adapter-tail-length");
+ *out << "adapter-tail-length: " << o.a_tail_len << endl;
+ }
+
+ if(isSet(parser, "adapter-revcomp")){
- string arc_trim_end;
- getOptionValue(arc_trim_end, parser, "adapter-revcomp-end");
+ string rcModeStr;
+ getOptionValue(rcModeStr, parser, "adapter-revcomp");
+ *out << "adapter-revcomp: " << rcModeStr << endl;
- if (arc_trim_end == "LEFT") o.arc_end = LEFT;
- else if(arc_trim_end == "RIGHT") o.arc_end = RIGHT;
- else if(arc_trim_end == "ANY") o.arc_end = ANY;
- else if(arc_trim_end == "LTAIL") o.arc_end = LTAIL;
- else if(arc_trim_end == "RTAIL") o.arc_end = RTAIL;
- else {
- cerr << "\nSpecified reverse complement adapter trim-end is unknown.\n" << endl;
- exit(1);
- }
+ if (rcModeStr == "ON") o.rcMode = RCON;
+ else if(rcModeStr == "ONLY") o.rcMode = RCONLY;
- if(o.arc_end != o.a_end){
- *out << "adapter-revcomp-end: " << arc_trim_end << endl;
- o.useRcTrimEnd = true;
+ if(isSet(parser, "adapter-revcomp-end") && o.rcMode == RCON){
+
+ string arc_trim_end;
+ getOptionValue(arc_trim_end, parser, "adapter-revcomp-end");
+
+ if (arc_trim_end == "LEFT") o.arc_end = LEFT;
+ else if(arc_trim_end == "RIGHT") o.arc_end = RIGHT;
+ else if(arc_trim_end == "ANY") o.arc_end = ANY;
+ else if(arc_trim_end == "LTAIL") o.arc_end = LTAIL;
+ else if(arc_trim_end == "RTAIL") o.arc_end = RTAIL;
+ else {
+ cerr << "\nSpecified reverse complement adapter trim-end is unknown.\n" << endl;
+ exit(1);
+ }
+
+ if(o.arc_end != o.a_end){
+ *out << "adapter-revcomp-end: " << arc_trim_end << endl;
+ o.useRcTrimEnd = true;
+ }
}
}
- }
-
- if(isSet(parser, "adapter-relaxed")){
- *out << "adapter-relaxed: on" << endl;
- o.relaxRegion = true;
- }
-
- if(isSet(parser, "adapter-read-set") && o.isPaired && o.adapRm != NORMAL2){
- string a_read_set;
- getOptionValue(a_read_set, parser, "adapter-read-set");
- *out << "adapter-read-set: " << a_read_set << endl;
- if(a_read_set == "1") o.adapRm = AONE;
- else if(a_read_set == "2") o.adapRm = ATWO;
+ if(isSet(parser, "adapter-relaxed")){
+ *out << "adapter-relaxed: on" << endl;
+ o.relaxRegion = true;
+ }
+
+ if(isSet(parser, "adapter-read-set") && o.isPaired && o.adapRm != NORMAL2){
+ string a_read_set;
+ getOptionValue(a_read_set, parser, "adapter-read-set");
+ *out << "adapter-read-set: " << a_read_set << endl;
+
+ if(a_read_set == "1") o.adapRm = AONE;
+ else if(a_read_set == "2") o.adapRm = ATWO;
+ }
+
+ getOptionValue(o.a_cycles, parser, "adapter-cycles");
+
+ if(o.a_cycles < 1){
+ cerr << "\nNumber of adapter removal cycles should be 1 at least.\n" << endl;
+ exit(1);
+ }
+ if(o.aPreset == NEXTERAMP && o.a_cycles < 3) o.a_cycles = 3;
+ if(o.a_cycles > 1) *out << "adapter-cycles: " << o.a_cycles << endl;
+
+ getOptionValue(o.a_min_overlap, parser, "adapter-min-overlap");
+ *out << "adapter-min-overlap: " << o.a_min_overlap << endl;
+
+ if(o.a_min_overlap < 1){
+ cerr << "\nAdapter min-overlap should be 1 at least.\n" << endl;
+ exit(1);
+ }
+
+ if((o.poMode == PON || o.poMode == PSHORT) && o.a_end != RIGHT && o.a_end != RTAIL &&
+ (! o.useRcTrimEnd || (o.arc_end != RIGHT && o.arc_end != RTAIL))){
+ cerr << "\nOne adapter trim-end should be RIGHT or RTAIL if pair overlap is ON or SHORT.\n" << endl;
+ exit(1);
+ }
+
+ // getOptionValue(o.a_overhang, parser, "adapter-overhang");
+ // *out << "adapter-overhang: " << o.a_overhang << endl;
}
- getOptionValue(o.a_min_overlap, parser, "adapter-min-overlap");
- *out << "adapter-min-overlap: " << o.a_min_overlap << endl;
-
- if(o.a_min_overlap < 1){
- cerr << "\nAdapter min-overlap should be 1 at least.\n" << endl;
- exit(1);
+ if(o.poMode != POFF){
+ getOptionValue(o.p_min_overlap, parser, "adapter-min-poverlap");
+ *out << "adapter-min-poverlap: " << o.p_min_overlap << endl;
+
+ if(o.p_min_overlap < 20){
+ cerr << "\nMinimum overlap of paired reads should be 20 at least.\n" << endl;
+ exit(1);
+ }
}
getOptionValue(o.a_errorRate, parser, "adapter-error-rate");
@@ -918,32 +1097,21 @@ void loadOptions(Options &o, seqan::ArgumentParser &parser){
exit(1);
}
- getOptionValue(o.a_cycles, parser, "adapter-cycles");
- if(o.a_cycles > 1) *out << "adapter-cycles: " << o.a_cycles << endl;
-
- if(o.a_cycles < 1){
- cerr << "\nNumber of adapter removal cycles should be 1 at least.\n" << endl;
- exit(1);
- }
-
- // getOptionValue(o.a_overhang, parser, "adapter-overhang");
- // *out << "adapter-overhang: " << o.a_overhang << endl;
-
- getOptionValue(o.match, parser, "adapter-match");
- getOptionValue(o.mismatch, parser, "adapter-mismatch");
- getOptionValue(o.gapCost, parser, "adapter-gap");
+ getOptionValue(o.a_match, parser, "adapter-match");
+ getOptionValue(o.a_mismatch, parser, "adapter-mismatch");
+ getOptionValue(o.a_gapCost, parser, "adapter-gap");
*out << "adapter-match: ";
- if(o.match >= 0) *out << " ";
- *out << o.match << endl;
+ if(o.a_match >= 0) *out << " ";
+ *out << o.a_match << endl;
*out << "adapter-mismatch: ";
- if(o.mismatch >= 0) *out << " ";
- *out << o.mismatch << endl;
+ if(o.a_mismatch >= 0) *out << " ";
+ *out << o.a_mismatch << endl;
*out << "adapter-gap: ";
- if(o.gapCost >= 0) *out << " ";
- *out << o.gapCost << "\n" << endl;
+ if(o.a_gapCost >= 0) *out << " ";
+ *out << o.a_gapCost << "\n" << endl;
}
}
=====================================
src/PairedAlign.h
=====================================
--- a/src/PairedAlign.h
+++ b/src/PairedAlign.h
@@ -4,6 +4,7 @@
#define FLEXBAR_PAIREDALIGN_H
#include "SeqAlign.h"
+#include "SeqAlignPair.h"
#include "SeqAlignAlgo.h"
@@ -17,17 +18,18 @@ private:
const std::string m_htrimLeft, m_htrimRight;
- const int m_htrimMinLength, m_htrimMaxLength;
- const float m_htrimErrorRate;
-
+ const unsigned int m_htrimMinLength, m_htrimMinLength2, m_htrimMaxLength;
const unsigned int m_arTimes;
+ const float m_htrimErrorRate;
+
const flexbar::FileFormat m_format;
const flexbar::LogAlign m_log;
const flexbar::RunType m_runType;
const flexbar::BarcodeDetect m_barType;
const flexbar::AdapterRemoval m_adapRem;
const flexbar::TrimEnd m_aTrimEnd, m_arcTrimEnd, m_bTrimEnd;
+ const flexbar::PairOverlap m_poMode;
tbb::atomic<unsigned long> m_unassigned;
tbb::concurrent_vector<flexbar::TBar> *m_adapters, *m_adapters2;
@@ -36,6 +38,9 @@ private:
typedef SeqAlign<TSeqStr, TString, SeqAlignAlgo<TSeqStr> > TSeqAlign;
TSeqAlign *m_a1, *m_b1, *m_a2, *m_b2;
+ typedef SeqAlignPair<TSeqStr, TString, SeqAlignAlgo<TSeqStr> > TSeqAlignPair;
+ TSeqAlignPair *m_p;
+
std::ostream *out;
public:
@@ -48,6 +53,7 @@ public:
m_runType(o.runType),
m_barType(o.barDetect),
m_adapRem(o.adapRm),
+ m_poMode(o.poMode),
m_aTrimEnd(o.a_end),
m_arcTrimEnd(o.arc_end),
m_bTrimEnd(o.b_end),
@@ -58,6 +64,7 @@ public:
m_htrimLeft(o.htrimLeft),
m_htrimRight(o.htrimRight),
m_htrimMinLength(o.htrimMinLength),
+ m_htrimMinLength2(o.htrimMinLength2),
m_htrimMaxLength(o.htrimMaxLength),
m_htrimMaxFirstOnly(o.htrimMaxFirstOnly),
m_htrimErrorRate(o.h_errorRate),
@@ -72,11 +79,13 @@ public:
m_barcodes2 = &o.barcodes2;
m_adapters2 = &o.adapters2;
- m_b1 = new TSeqAlign(m_barcodes, o, o.b_min_overlap, o.b_errorRate, o.b_tail_len, o.b_match, o.b_mismatch, o.b_gapCost, true);
- m_a1 = new TSeqAlign(m_adapters, o, o.a_min_overlap, o.a_errorRate, o.a_tail_len, o.match, o.mismatch, o.gapCost, false);
-
+ m_b1 = new TSeqAlign(m_barcodes, o, o.b_min_overlap, o.b_errorRate, o.b_tail_len, o.b_match, o.b_mismatch, o.b_gapCost, true);
m_b2 = new TSeqAlign(m_barcodes2, o, o.b_min_overlap, o.b_errorRate, o.b_tail_len, o.b_match, o.b_mismatch, o.b_gapCost, true);
- m_a2 = new TSeqAlign(m_adapters2, o, o.a_min_overlap, o.a_errorRate, o.a_tail_len, o.match, o.mismatch, o.gapCost, false);
+
+ m_a1 = new TSeqAlign(m_adapters, o, o.a_min_overlap, o.a_errorRate, o.a_tail_len, o.a_match, o.a_mismatch, o.a_gapCost, false);
+ m_a2 = new TSeqAlign(m_adapters2, o, o.a_min_overlap, o.a_errorRate, o.a_tail_len, o.a_match, o.a_mismatch, o.a_gapCost, false);
+
+ m_p = new TSeqAlignPair(o, o.p_min_overlap, o.a_errorRate, o.a_match, o.a_mismatch, o.a_gapCost);
if(m_log == flexbar::TAB)
*out << "ReadTag\tQueryTag\tQueryStart\tQueryEnd\tOverlapLength\tMismatches\tIndels\tAllowedErrors" << std::endl;
@@ -85,9 +94,10 @@ public:
virtual ~PairedAlign(){
delete m_b1;
- delete m_a1;
delete m_b2;
+ delete m_a1;
delete m_a2;
+ delete m_p;
};
@@ -95,21 +105,18 @@ public:
using namespace flexbar;
- if(m_barType != BOFF){
-
- switch(m_barType){
- case BARCODE_READ: pRead->barID = m_b1->alignSeqRead(pRead->b, false, alBundle[0], cycle[0], idxAl[0], alMode, m_bTrimEnd); break;
- case WITHIN_READ_REMOVAL2: pRead->barID2 = m_b2->alignSeqRead(pRead->r2, true, alBundle[2], cycle[2], idxAl[2], alMode, m_bTrimEnd);
- case WITHIN_READ_REMOVAL: pRead->barID = m_b1->alignSeqRead(pRead->r1, true, alBundle[1], cycle[1], idxAl[1], alMode, m_bTrimEnd); break;
- case WITHIN_READ2: pRead->barID2 = m_b2->alignSeqRead(pRead->r2, false, alBundle[2], cycle[2], idxAl[2], alMode, m_bTrimEnd);
- case WITHIN_READ: pRead->barID = m_b1->alignSeqRead(pRead->r1, false, alBundle[1], cycle[1], idxAl[1], alMode, m_bTrimEnd); break;
- case BOFF: break;
- }
+ switch(m_barType){
+ case BARCODE_READ: pRead->barID = m_b1->alignSeqRead(pRead->b, false, alBundle[0], cycle[0], idxAl[0], alMode, m_bTrimEnd); break;
+ case WITHIN_READ_REMOVAL2: pRead->barID2 = m_b2->alignSeqRead(pRead->r2, true, alBundle[2], cycle[2], idxAl[2], alMode, m_bTrimEnd);
+ case WITHIN_READ_REMOVAL: pRead->barID = m_b1->alignSeqRead(pRead->r1, true, alBundle[1], cycle[1], idxAl[1], alMode, m_bTrimEnd); break;
+ case WITHIN_READ2: pRead->barID2 = m_b2->alignSeqRead(pRead->r2, false, alBundle[2], cycle[2], idxAl[2], alMode, m_bTrimEnd);
+ case WITHIN_READ: pRead->barID = m_b1->alignSeqRead(pRead->r1, false, alBundle[1], cycle[1], idxAl[1], alMode, m_bTrimEnd); break;
+ case BOFF: break;
+ }
+
+ if(pRead->barID == 0 || (m_twoBarcodes && pRead->barID2 == 0)){
- if(pRead->barID == 0 || (m_twoBarcodes && pRead->barID2 == 0)){
-
- if(cycle[0] != PRELOAD) m_unassigned++;
- }
+ if(cycle[0] != PRELOAD) m_unassigned++;
}
}
@@ -118,15 +125,12 @@ public:
using namespace flexbar;
- if(m_adapRem != AOFF){
-
- if(m_adapRem != ATWO)
- m_a1->alignSeqRead(pRead->r1, true, alBundle[0], cycle[0], idxAl[0], alMode, trimEnd);
-
- if(pRead->r2 != NULL && m_adapRem != AONE){
- if(m_adapRem != NORMAL2) m_a1->alignSeqRead(pRead->r2, true, alBundle[1], cycle[1], idxAl[1], alMode, trimEnd);
- else m_a2->alignSeqRead(pRead->r2, true, alBundle[1], cycle[1], idxAl[1], alMode, trimEnd);
- }
+ if(m_adapRem != ATWO)
+ m_a1->alignSeqRead(pRead->r1, true, alBundle[0], cycle[0], idxAl[0], alMode, trimEnd);
+
+ if(pRead->r2 != NULL && m_adapRem != AONE){
+ if(m_adapRem != NORMAL2) m_a1->alignSeqRead(pRead->r2, true, alBundle[1], cycle[1], idxAl[1], alMode, trimEnd);
+ else m_a2->alignSeqRead(pRead->r2, true, alBundle[1], cycle[1], idxAl[1], alMode, trimEnd);
}
}
@@ -166,7 +170,10 @@ public:
}
}
- if(cutPos > 0 && cutPos >= m_htrimMinLength){
+ unsigned int htrimMinLength = m_htrimMinLength;
+ if(m_htrimMinLength2 > 0 && s > 0) htrimMinLength = m_htrimMinLength2;
+
+ if(cutPos > 0 && cutPos >= htrimMinLength){
erase(seqRead->seq, 0, cutPos);
if(m_format == FASTQ){
@@ -214,7 +221,10 @@ public:
}
}
- if(cutPos < seqLen && cutPos <= seqLen - m_htrimMinLength){
+ unsigned int htrimMinLength = m_htrimMinLength;
+ if(m_htrimMinLength2 > 0 && s > 0) htrimMinLength = m_htrimMinLength2;
+
+ if(cutPos < seqLen && cutPos <= seqLen - htrimMinLength){
erase(seqRead->seq, cutPos, length(seqRead->seq));
if(m_format == FASTQ){
@@ -264,7 +274,6 @@ public:
idxAl.push_back(0);
cycle.push_back(PRELOAD);
}
-
for(unsigned int i = 0; i < prBundle->size(); ++i){
alignPairedReadToBarcodes(prBundle->at(i), alBundle, cycle, idxAl, alMode);
}
@@ -273,7 +282,6 @@ public:
idxAl[i] = 0;
cycle[i] = COMPUTE;
}
-
for(unsigned int i = 0; i < prBundle->size(); ++i){
alignPairedReadToBarcodes(prBundle->at(i), alBundle, cycle, idxAl, alMode);
}
@@ -281,12 +289,29 @@ public:
// adapter removal
+ if(m_poMode != POFF){
+
+ Alignments alignments;
+ unsigned int idxAl = 0;
+ ComputeCycle cycle = PRELOAD;
+
+ for(unsigned int i = 0; i < prBundle->size(); ++i){
+ m_p->alignSeqReadPair(prBundle->at(i)->r1, prBundle->at(i)->r2, alignments, cycle, idxAl);
+ }
+
+ idxAl = 0;
+ cycle = COMPUTE;
+
+ for(unsigned int i = 0; i < prBundle->size(); ++i){
+ m_p->alignSeqReadPair(prBundle->at(i)->r1, prBundle->at(i)->r2, alignments, cycle, idxAl);
+ }
+ }
+
if(m_adapRem != AOFF){
for(unsigned int c = 0; c < m_arTimes; ++c){
flexbar::TrimEnd trimEnd = m_aTrimEnd;
-
unsigned int rc = 1;
if(m_useRcTrimEnd){
@@ -314,7 +339,6 @@ public:
idxAl.push_back(0);
cycle.push_back(PRELOAD);
}
-
for(unsigned int i = 0; i < prBundle->size(); ++i){
alignPairedReadToAdapters(prBundle->at(i), alBundle, cycle, idxAl, alMode, trimEnd);
}
@@ -323,7 +347,6 @@ public:
idxAl[i] = 0;
cycle[i] = COMPUTE;
}
-
for(unsigned int i = 0; i < prBundle->size(); ++i){
alignPairedReadToAdapters(prBundle->at(i), alBundle, cycle, idxAl, alMode, trimEnd);
}
@@ -333,12 +356,10 @@ public:
if(m_umiTags){
for(unsigned int i = 0; i < prBundle->size(); ++i){
-
append(prBundle->at(i)->r1->id, prBundle->at(i)->r1->umi);
if(prBundle->at(i)->r2 != NULL){
append(prBundle->at(i)->r1->id, prBundle->at(i)->r2->umi);
-
append(prBundle->at(i)->r2->id, prBundle->at(i)->r1->umi);
append(prBundle->at(i)->r2->id, prBundle->at(i)->r2->umi);
}
@@ -347,7 +368,6 @@ public:
if(m_htrim){
if(m_htrimLeft != ""){
-
for(unsigned int i = 0; i < prBundle->size(); ++i){
trimLeftHPS(prBundle->at(i)->r1);
@@ -356,7 +376,6 @@ public:
}
}
if(m_htrimRight != ""){
-
for(unsigned int i = 0; i < prBundle->size(); ++i){
trimRightHPS(prBundle->at(i)->r1);
@@ -385,8 +404,20 @@ public:
using namespace flexbar;
- if(m_adapRem != NORMAL2) return m_a1->getNrPreShortReads();
- else return m_a1->getNrPreShortReads() + m_a2->getNrPreShortReads();
+ if (m_poMode != POFF) return m_p->getNrPreShortReads();
+ else if(m_adapRem != NORMAL2) return m_a1->getNrPreShortReads();
+ else return m_a1->getNrPreShortReads() + m_a2->getNrPreShortReads();
+ }
+
+
+ void printPairOverlapStats(){
+
+ using namespace flexbar;
+
+ if(m_p->getNrOverlappingReads() > 0)
+ *out << m_p->getOverlapStatsString() << "\n\n";
+
+ if(m_adapRem == AOFF) *out << std::endl;
}
=====================================
src/PairedInput.h
=====================================
--- a/src/PairedInput.h
+++ b/src/PairedInput.h
@@ -15,7 +15,7 @@ private:
const bool m_isPaired, m_useBarRead, m_useNumberTag, m_interleaved;
const unsigned int m_bundleSize;
- tbb::atomic<unsigned long> m_uncalled, m_uncalledPairs, m_tagCounter;
+ tbb::atomic<unsigned long> m_uncalled, m_uncalledPairs, m_tagCounter, m_nBundles;
SeqInput<TSeqStr, TString> *m_f1, *m_f2, *m_b;
public:
@@ -29,6 +29,7 @@ public:
m_isPaired(o.isPaired),
m_useBarRead(o.barDetect == flexbar::BARCODE_READ),
m_bundleSize(o.bundleSize),
+ m_nBundles(o.nBundles),
m_tagCounter(0),
m_uncalled(0),
m_uncalledPairs(0){
@@ -43,6 +44,8 @@ public:
if(m_useBarRead)
m_b = new SeqInput<TSeqStr, TString>(o, o.barReadsFile, false, false);
+
+ if(m_nBundles > 0) ++m_nBundles;
}
virtual ~PairedInput(){
@@ -62,6 +65,10 @@ public:
TStrings quals, quals2, qualsBR;
TBools uncalled, uncalled2, uncalledBR;
+ if(m_nBundles > 0){
+ if(m_nBundles-- == 1) return NULL;
+ }
+
unsigned int bundleSize = m_bundleSize;
if(m_interleaved) bundleSize = m_bundleSize * 2;
=====================================
src/PairedOutput.h
=====================================
--- a/src/PairedOutput.h
+++ b/src/PairedOutput.h
@@ -102,10 +102,10 @@ public:
b2 << barcode2;
string s = m_target + "_barcode_" + b1.str();
- TSeqOutput *of1 = new TSeqOutput(s, barcode1, false, o);
+ TSeqOutput *of1 = new TSeqOutput(s, barcode, false, o);
s = m_target + "_barcode_" + b2.str();
- TSeqOutput *of2 = new TSeqOutput(s, barcode2, false, o);
+ TSeqOutput *of2 = new TSeqOutput(s, barcode, false, o);
TOutFiles& f = m_outMap[i + 1];
f.f1 = of1;
@@ -125,10 +125,10 @@ public:
if(m_writeUnassigned){
string s = m_target + "_barcode_unassigned_1";
- TSeqOutput *of1 = new TSeqOutput(s, "unassigned_1", false, o);
+ TSeqOutput *of1 = new TSeqOutput(s, "unassigned", false, o);
s = m_target + "_barcode_unassigned_2";
- TSeqOutput *of2 = new TSeqOutput(s, "unassigned_2", false, o);
+ TSeqOutput *of2 = new TSeqOutput(s, "unassigned", false, o);
TOutFiles& f = m_outMap[0];
f.f1 = of1;
@@ -154,10 +154,16 @@ public:
m_outMap = new TOutFiles[m_mapsize];
string s = m_target + "_1";
- TSeqOutput *of1 = new TSeqOutput(s, "1", false, o);
+
+ if(o.outReadsFile != "") s = o.outReadsFile;
+
+ TSeqOutput *of1 = new TSeqOutput(s, "", false, o);
s = m_target + "_2";
- TSeqOutput *of2 = new TSeqOutput(s, "2", false, o);
+
+ if(o.outReadsFile2 != "") s = o.outReadsFile2;
+
+ TSeqOutput *of2 = new TSeqOutput(s, "", false, o);
TOutFiles& f = m_outMap[0];
f.f1 = of1;
@@ -182,6 +188,9 @@ public:
m_outMap = new TOutFiles[m_mapsize];
string s = m_target;
+
+ if(o.outReadsFile != "") s = o.outReadsFile;
+
TSeqOutput *of1 = new TSeqOutput(s, "", false, o);
TOutFiles& f = m_outMap[0];
=====================================
src/QualTrimming.h
=====================================
--- a/src/QualTrimming.h
+++ b/src/QualTrimming.h
@@ -1,45 +1,13 @@
-// ==========================================================================
-// QualTrimming.h
-// ==========================================================================
-// Copyright (c) 2006-2015, Knut Reinert, FU Berlin
-// All rights reserved.
-//
-// Redistribution and use in source and binary forms, with or without
-// modification, are permitted provided that the following conditions are met:
-//
-// * Redistributions of source code must retain the above copyright
-// notice, this list of conditions and the following disclaimer.
-// * Redistributions in binary form must reproduce the above copyright
-// notice, this list of conditions and the following disclaimer in the
-// documentation and/or other materials provided with the distribution.
-// * Neither the name of Knut Reinert or the FU Berlin nor the names of
-// its contributors may be used to endorse or promote products derived
-// from this software without specific prior written permission.
-//
-// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
-// AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
-// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
-// ARE DISCLAIMED. IN NO EVENT SHALL KNUT REINERT OR THE FU BERLIN BE LIABLE
-// FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
-// DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
-// SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
-// CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
-// LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
-// OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
-// DAMAGE.
-//
-// ==========================================================================
+// QualTrimming.h
+
// Authors: Sebastian Roskosch
// Benjamin Menkuec
// Johannes Roehr
-// ==========================================================================
#ifndef FLEXBAR_QUALTRIMMING_H
#define FLEXBAR_QUALTRIMMING_H
-// Tags for choosing quality-based trimming method
-
struct Tail {};
struct BWA {};
@@ -50,13 +18,9 @@ struct Window {
};
-// ============================================================================
-// Functions
-// ============================================================================
-
-
template <typename TString>
inline unsigned getQuality(const TString& qual, unsigned i){
+
return static_cast<int>(qual[i]);
}
@@ -67,37 +31,35 @@ unsigned qualTrimming(const TString& qual, unsigned const cutoff, Tail const &){
for (int i = length(qual) - 1; i >= 0; --i){
- if(getQuality(qual, i) >= cutoff){
- return i + 1;
- }
+ if(getQuality(qual, i) >= cutoff) return i + 1;
}
return 0;
}
-// Trim by shifting a window over the seq and cut where avg qual in window turns bad first.
+// Trim by shifting a window over the seq and cut where avg qual in window turns bad
template <typename TString>
unsigned qualTrimming(const TString& qual, unsigned const _cutoff, Window const & spec){
unsigned window = spec.size;
unsigned avg = 0, i = 0;
- // Work with absolute cutoff in window to avoid divisions.
+ // Work with absolute cutoff in window to avoid divisions
unsigned cutoff = _cutoff * window;
- // Calculate average quality of initial window.
+ // Calculate average quality of initial window
for (i = 0; i < window; ++i){
avg += getQuality(qual, i);
}
- // Shift window over read and keep mean quality, update in constant time.
+ // Shift window over read and keep mean quality, update in constant time
for (i = 0; i < length(qual) && avg >= cutoff; ++i){
- // Take care only not to go over the end of the sequence. Shorten window near the end.
+ // Take care only not to go over the end of the sequence. Shorten window near the end
avg -= getQuality(qual, i);
avg += i + window < length(qual) ? getQuality(qual, i + window) : 0;
}
- return i; // i holds start of first window that turned bad.
+ return i; // i holds start of first window that turned bad
}
@@ -114,7 +76,6 @@ unsigned qualTrimming(const TString& qual, unsigned const cutoff, BWA const &){
if(sum < 0){
break;
}
-
if(sum > max){
max = sum;
max_arg = i;
@@ -127,6 +88,8 @@ unsigned qualTrimming(const TString& qual, unsigned const cutoff, BWA const &){
template <typename TSeqStr, typename TString>
bool qualTrim(TSeqStr &seq, TString &qual, const flexbar::QualTrimType qtrim, const int cutoff, const int wSize){
+ using namespace seqan;
+
unsigned cutPos;
if(qtrim == flexbar::TAIL){
@@ -139,8 +102,6 @@ bool qualTrim(TSeqStr &seq, TString &qual, const flexbar::QualTrimType qtrim, co
cutPos = qualTrimming(qual, cutoff, BWA());
}
- using namespace seqan;
-
if(cutPos < length(qual)){
seq = prefix(seq, cutPos);
@@ -148,9 +109,7 @@ bool qualTrim(TSeqStr &seq, TString &qual, const flexbar::QualTrimType qtrim, co
return true;
}
- else{
- return false;
- }
+ else return false;
}
@@ -171,25 +130,4 @@ bool qualTrim(SeqRead<TSeqStr, TString> *seqRead, const flexbar::QualTrimType qt
}
-// inline unsigned getQuality(const seqan::String<seqan::Dna5Q>& seq, unsigned i)
-// {
-// return seqan::getQualityValue(seq[i]);
-// }
-
-// template<bool tag>
-// struct TagTrimming
-// {
-// static const bool value = tag;
-// };
-
-// template <typename TSeq, typename TSpec>
-// unsigned trimRead(TSeq& seq, unsigned const cutoff, TSpec const & spec) noexcept
-// {
-// unsigned ret, cut_pos;
-// cut_pos = _trimRead(seqan::Dna5QString(seq), cutoff, spec);
-// ret = length(seq) - cut_pos;
-// erase(seq, cut_pos, length(seq));
-// return ret;
-// }
-
#endif
=====================================
src/SeqAlign.h
=====================================
--- a/src/SeqAlign.h
+++ b/src/SeqAlign.h
@@ -11,8 +11,9 @@ private:
typedef AlignResults<TSeqStr> TAlignResults;
- const flexbar::LogAlign m_log;
- const flexbar::FileFormat m_format;
+ const flexbar::LogAlign m_log;
+ const flexbar::FileFormat m_format;
+ const flexbar::PairOverlap m_poMode;
const bool m_isBarcoding, m_writeTag, m_umiTags, m_strictRegion;
const int m_minLength, m_minOverlap, m_tailLength;
@@ -24,7 +25,7 @@ private:
tbb::concurrent_vector<unsigned long> m_rmOverlaps;
std::ostream *m_out;
- TAlgorithm algo;
+ TAlgorithm m_algo;
public:
@@ -36,6 +37,7 @@ public:
m_isBarcoding(isBarcoding),
m_umiTags(o.umiTags),
m_minLength(o.min_readLen),
+ m_poMode(o.poMode),
m_log(o.logAlign),
m_format(o.format),
m_writeTag(o.useRemovalTag),
@@ -44,7 +46,7 @@ public:
m_out(o.out),
m_nPreShortReads(0),
m_modified(0),
- algo(TAlgorithm(o, match, mismatch, gapCost)){
+ m_algo(TAlgorithm(o, match, mismatch, gapCost, ! isBarcoding)){
m_queries = queries;
m_rmOverlaps = tbb::concurrent_vector<unsigned long>(flexbar::MAX_READLENGTH + 1, 0);
@@ -119,7 +121,7 @@ public:
TAlignResults a;
// global sequence alignment
- algo.alignGlobal(a, alignments, cycle, idxAl++, trimEnd);
+ m_algo.alignGlobal(a, alignments, cycle, idxAl++, trimEnd);
a.queryLength = length(m_queries->at(i).seq);
a.tailLength = (m_tailLength > 0) ? m_tailLength : a.queryLength;
@@ -130,6 +132,9 @@ public:
float madeErrors = static_cast<float>(a.mismatches + a.gapsR + a.gapsA);
int minOverlap = (m_isBarcoding && m_minOverlap == 0) ? a.queryLength : m_minOverlap;
+ if(! m_isBarcoding && m_poMode == PON && seqRead.pairOverlap &&
+ (trimEnd == RIGHT || trimEnd == RTAIL)) minOverlap = 1;
+
bool validAl = true;
if(((trimEnd == RTAIL || trimEnd == RIGHT) && a.startPosA < a.startPosS && m_strictRegion) ||
=====================================
src/SeqAlignAlgo.h
=====================================
--- a/src/SeqAlignAlgo.h
+++ b/src/SeqAlignAlgo.h
@@ -21,13 +21,14 @@ private:
// TScoreSimple m_score;
TScoreMatrix m_scoreMatrix;
- const bool m_umiTags;
+ const bool m_umiTags, m_isAdapterRm;
const flexbar::LogAlign m_log;
public:
- SeqAlignAlgo(const Options &o, const int match, const int mismatch, const int gapCost):
+ SeqAlignAlgo(const Options &o, const int match, const int mismatch, const int gapCost, const bool isAdapterRm):
m_umiTags(o.umiTags),
+ m_isAdapterRm(isAdapterRm),
m_log(o.logAlign){
using namespace seqan;
@@ -38,7 +39,7 @@ public:
for(unsigned i = 0; i < ValueSize<TChar>::VALUE; ++i){
for(unsigned j = 0; j < ValueSize<TChar>::VALUE; ++j){
- if(i == j || TChar(j) == 'N')
+ if(i == j || TChar(j) == 'N' || (TChar(i) == 'N' && isAdapterRm))
setScore(m_scoreMatrix, TChar(i), TChar(j), match);
else setScore(m_scoreMatrix, TChar(i), TChar(j), mismatch);
}
@@ -125,10 +126,10 @@ public:
for(; it1 != end(row1); ++it1){
if(a.startPos <= alPos && alPos < a.endPos){
- if(isGap(it1)) ++a.gapsR;
- else if(isGap(it2)) ++a.gapsA;
- else if(*it1 != *it2 && *it2 != 'N') ++a.mismatches;
- else if(m_umiTags && *it2 == 'N') append(a.umiTag, (TChar) *it1);
+ if(isGap(it1)) ++a.gapsR;
+ else if(isGap(it2)) ++a.gapsA;
+ else if(*it1 != *it2 && *it2 != 'N' && (*it1 != 'N' || ! m_isAdapterRm)) ++a.mismatches;
+ else if(m_umiTags && *it2 == 'N') append(a.umiTag, (TChar) *it1);
}
++alPos;
++it2;
=====================================
src/SeqAlignPair.h
=====================================
--- /dev/null
+++ b/src/SeqAlignPair.h
@@ -0,0 +1,246 @@
+// SeqAlignPair.h
+
+#ifndef FLEXBAR_SEQALIGNPAIR_H
+#define FLEXBAR_SEQALIGNPAIR_H
+
+
+template <typename TSeqStr, typename TString, class TAlgorithm>
+class SeqAlignPair {
+
+private:
+
+ typedef AlignResults<TSeqStr> TAlignResults;
+
+ const flexbar::LogAlign m_log;
+ const flexbar::FileFormat m_format;
+ const flexbar::PairOverlap m_poMode;
+
+ const bool m_writeTag;
+ const int m_minLength, m_minOverlap, m_aMinOverlap;
+ const float m_errorRate;
+ const unsigned int m_bundleSize;
+
+ tbb::atomic<unsigned long> m_nPreShortReads, m_overlaps, m_modified;
+ tbb::concurrent_vector<unsigned long> m_overlapLengths;
+
+ std::ostream *m_out;
+ TAlgorithm m_algo;
+
+public:
+
+ SeqAlignPair(const Options &o, const int minOverlap, const float errorRate, const int match, const int mismatch, const int gapCost):
+
+ m_minOverlap(minOverlap),
+ m_aMinOverlap(o.a_min_overlap),
+ m_errorRate(errorRate),
+ m_minLength(o.min_readLen),
+ m_poMode(o.poMode),
+ m_log(o.logAlign),
+ m_format(o.format),
+ m_writeTag(o.useRemovalTag),
+ m_bundleSize(o.bundleSize),
+ m_out(o.out),
+ m_nPreShortReads(0),
+ m_overlaps(0),
+ m_modified(0),
+ m_algo(TAlgorithm(o, match, mismatch, gapCost, true)){
+
+ m_overlapLengths = tbb::concurrent_vector<unsigned long>(flexbar::MAX_READLENGTH + 1, 0);
+ };
+
+
+ void alignSeqReadPair(flexbar::TSeqRead* sr, flexbar::TSeqRead* sr2, flexbar::Alignments &alignments, flexbar::ComputeCycle &cycle, unsigned int &idxAl){
+
+ using namespace std;
+ using namespace flexbar;
+
+ TSeqRead &seqRead = *sr;
+ TSeqRead &seqRead2 = *sr2;
+
+ int readLength = length(seqRead.seq);
+ int readLength2 = length(seqRead2.seq);
+
+ if(cycle != PRELOAD){
+ if(readLength < m_minLength) ++m_nPreShortReads;
+ if(readLength2 < m_minLength) ++m_nPreShortReads;
+ }
+
+ if(readLength < 1 || readLength2 < 1) return;
+
+
+ if(cycle == PRELOAD){
+
+ if(idxAl == 0) reserve(alignments.aset, m_bundleSize);
+
+ TSeqStr rcSeq2 = seqRead2.seq;
+ seqan::reverseComplement(rcSeq2);
+
+ TAlign align;
+ appendValue(alignments.aset, align);
+ resize(rows(alignments.aset[idxAl]), 2);
+
+ assignSource(row(alignments.aset[idxAl], 0), seqRead.seq);
+ assignSource(row(alignments.aset[idxAl], 1), rcSeq2);
+
+ ++idxAl;
+ return;
+ }
+
+ TAlignResults a;
+
+ m_algo.alignGlobal(a, alignments, cycle, idxAl++, ANY);
+
+ a.overlapLength = a.endPos - a.startPos;
+ a.allowedErrors = m_errorRate * a.overlapLength;
+
+ float madeErrors = static_cast<float>(a.mismatches + a.gapsR + a.gapsA);
+
+ stringstream s;
+
+ // check if alignment is valid, number of errors and overlap length
+ if((a.startPosA < a.startPosS || a.endPosA < a.endPosS) && madeErrors <= a.allowedErrors && a.overlapLength >= m_minOverlap){
+
+ if(a.startPosA < a.startPosS){
+
+ seqRead2.pairOverlap = true;
+
+ if(m_poMode == PONLY || (m_poMode == PSHORT && a.startPosS < m_aMinOverlap)){
+
+ unsigned int rCutPos = readLength2 - a.startPosS;
+ erase(seqRead2.seq, rCutPos, readLength2);
+
+ if(m_format == FASTQ)
+ erase(seqRead2.qual, rCutPos, readLength2);
+
+ ++m_modified;
+
+ if(m_writeTag) append(seqRead2.id, "_Flexbar_removal_PO");
+ }
+ }
+
+ if(a.endPosA < a.endPosS){
+
+ seqRead.pairOverlap = true;
+
+ if(m_poMode == PONLY || (m_poMode == PSHORT && (a.endPosS - a.endPosA) < m_aMinOverlap)){
+
+ unsigned int rCutPos = readLength - (a.endPosS - a.endPosA);
+ erase(seqRead.seq, rCutPos, readLength);
+
+ if(m_format == FASTQ)
+ erase(seqRead.qual, rCutPos, readLength);
+
+ ++m_modified;
+
+ if(m_writeTag) append(seqRead.id, "_Flexbar_removal_PO");
+ }
+ }
+
+ ++m_overlaps;
+
+ // store overlap occurrences
+ if(a.overlapLength <= MAX_READLENGTH) m_overlapLengths.at(a.overlapLength)++;
+ else cerr << "\nCompile Flexbar with larger max read length for correct overlap stats.\n" << endl;
+
+ // alignment stats
+ if(m_log == ALL || m_log == MOD){
+
+ s << "Sequence removal:\n";
+
+ s << " read id " << seqRead.id << "\n"
+ << " read pos " << a.startPosS << "-" << a.endPosS << "\n"
+ << " read2 id " << seqRead2.id << "\n"
+ << " read2 pos " << a.startPosA << "-" << a.endPosA << "\n"
+ << " score " << a.score << "\n"
+ << " overlap " << a.overlapLength << "\n"
+ << " errors " << a.gapsR + a.gapsA + a.mismatches << "\n"
+ << " error threshold " << a.allowedErrors << "\n"
+ << " remaining read " << seqRead.seq << "\n";
+
+ if(m_format == FASTQ)
+ s << " remaining qual " << seqRead.qual << "\n";
+
+ s << " remaining read2 " << seqRead2.seq << "\n";
+
+ if(m_format == FASTQ)
+ s << " remaining qual2 " << seqRead2.qual << "\n";
+
+ s << "\n Alignment:\n" << endl << a.alString;
+ }
+ else if(m_log == TAB){
+ s << seqRead.id << "\t" << seqRead2.id << "\t"
+ << a.startPosA << "\t" << a.endPosA << "\t" << a.overlapLength << "\t"
+ << a.mismatches << "\t" << a.gapsR + a.gapsA << "\t" << a.allowedErrors << endl;
+ }
+ }
+ else if(m_log == ALL){
+ s << "Unvalid alignment:" << "\n"
+ << "read id " << seqRead.id << "\n"
+ << "read2 id " << seqRead2.id << "\n\n" << endl;
+ }
+
+ *m_out << s.str();
+
+ return;
+ }
+
+
+ std::string getOverlapStatsString(){
+
+ using namespace std;
+ using namespace flexbar;
+
+ unsigned long nValues = 0, halfValues = 0, cumValues = 0, lenSum = 0;
+ unsigned int max = 0, median = 0, mean = 0;
+
+ unsigned int min = numeric_limits<unsigned int>::max();
+
+ for(unsigned int i = 0; i <= MAX_READLENGTH; ++i){
+ unsigned long lenCount = m_overlapLengths.at(i);
+
+ if(lenCount > 0 && i < min) min = i;
+ if(lenCount > 0 && i > max) max = i;
+
+ nValues += lenCount;
+ lenSum += lenCount * i;
+ }
+
+ halfValues = nValues / 2;
+
+ for(unsigned int i = 0; i <= MAX_READLENGTH; ++i){
+ cumValues += m_overlapLengths.at(i);
+
+ if(cumValues >= halfValues){
+ median = i;
+ break;
+ }
+ }
+
+ if(m_overlaps > 0) mean = lenSum / m_overlaps;
+
+ stringstream s;
+
+ if(m_modified > 0){
+ s << "Number of trimmed reads based on pair overlap: ";
+ s << m_modified << "\n";
+ }
+
+ s << "Min, max, mean and median overlap of paired reads: ";
+ s << min << " / " << max << " / " << mean << " / " << median;
+
+ return s.str();
+ }
+
+
+ unsigned long getNrPreShortReads() const {
+ return m_nPreShortReads;
+ }
+
+
+ unsigned long getNrOverlappingReads() const {
+ return m_overlaps;
+ }
+
+};
+
+#endif
=====================================
src/SeqInput.h
=====================================
--- a/src/SeqInput.h
+++ b/src/SeqInput.h
@@ -12,10 +12,9 @@ class SeqInput {
private:
- seqan::SeqFileIn seqFileIn;
+ seqan::FlexbarReadsSeqFileIn seqFileIn;
const flexbar::QualTrimType m_qtrim;
const flexbar::FileFormat m_format;
- // typedef seqan::String<char, seqan::MMap<> > TMMapString;
const bool m_preProcess, m_useStdin, m_qtrimPostRm;
const int m_maxUncalled, m_preTrimBegin, m_preTrimEnd, m_qtrimThresh, m_qtrimWinSize;
=====================================
src/SeqOutput.h
=====================================
--- a/src/SeqOutput.h
+++ b/src/SeqOutput.h
@@ -9,7 +9,7 @@ class SeqOutput {
private:
- seqan::SeqFileOut seqFileOut;
+ seqan::FlexbarReadsSeqFileOut seqFileOut;
std::string m_filePath;
const TString m_tagStr;
@@ -40,9 +40,13 @@ public:
m_filePath = filePath;
- if(m_format == FASTA || m_switch2Fasta)
- m_filePath += getExtension(FASTA) + o.outCompression;
- else m_filePath += getExtension(FASTQ) + o.outCompression;
+ if(filePath != o.outReadsFile && filePath != o.outReadsFile2){
+
+ if(m_format == FASTA || m_switch2Fasta)
+ m_filePath += getExtension(FASTA);
+ else m_filePath += getExtension(FASTQ);
+ }
+ m_filePath += o.outCompression;
m_lengthDist = tbb::concurrent_vector<unsigned long>(MAX_READLENGTH + 1, 0);
@@ -121,7 +125,6 @@ public:
}
catch(seqan::Exception const &e){
cerr << "\nERROR: " << e.what() << "\nProgram execution aborted.\n" << endl;
-
close(seqFileOut);
exit(1);
}
View it on GitLab: https://salsa.debian.org/med-team/flexbar/commit/5d9caef2649a7d87dfc7b79dbc3a863c20b659a0
--
View it on GitLab: https://salsa.debian.org/med-team/flexbar/commit/5d9caef2649a7d87dfc7b79dbc3a863c20b659a0
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20180729/7d115988/attachment-0001.html>
More information about the debian-med-commit
mailing list