[med-svn] [Git][med-team/fastp][master] 8 commits: New upstream version 0.22.0+dfsg

Nilesh Patra (@nilesh) gitlab at salsa.debian.org
Mon Aug 30 19:33:29 BST 2021



Nilesh Patra pushed to branch master at Debian Med / fastp


Commits:
3dd4452f by Nilesh Patra at 2021-08-30T23:56:21+05:30
New upstream version 0.22.0+dfsg
- - - - -
f5223423 by Nilesh Patra at 2021-08-30T23:56:21+05:30
routine-update: New upstream version

- - - - -
b8cfe6be by Nilesh Patra at 2021-08-30T23:56:22+05:30
Update upstream source from tag 'upstream/0.22.0+dfsg'

Update to upstream version '0.22.0+dfsg'
with Debian dir 37a9cc455e73cc5cbb5b2fbe43ef416c58e2ccdd
- - - - -
f43b4f2b by Nilesh Patra at 2021-08-30T23:56:22+05:30
routine-update: Standards-Version: 4.6.0

- - - - -
55edcf36 by Nilesh Patra at 2021-08-30T23:56:23+05:30
routine-update: debhelper-compat 13

- - - - -
6b1ccb88 by Nilesh Patra at 2021-08-30T23:56:49+05:30
Set upstream metadata fields: Bug-Database, Bug-Submit.

Changes-By: lintian-brush
Fixes: lintian: upstream-metadata-missing-bug-tracking
See-also: https://lintian.debian.org/tags/upstream-metadata-missing-bug-tracking.html

- - - - -
c9e6e7d5 by Nilesh Patra at 2021-08-30T23:56:49+05:30
Remove obsolete fields Contact, Name from debian/upstream/metadata (already present in machine-readable debian/copyright).

Changes-By: lintian-brush

- - - - -
424e5ca9 by Nilesh Patra at 2021-08-30T23:56:50+05:30
routine-update: Ready to upload to unstable

- - - - -


24 changed files:

- .gitignore
- .travis.yml
- LICENSE
- Makefile
- README.md
- debian/changelog
- debian/control
- debian/upstream/metadata
- src/common.h
- src/duplicate.cpp
- src/duplicate.h
- src/evaluator.cpp
- src/htmlreporter.cpp
- src/htmlreporter.h
- src/jsonreporter.cpp
- src/jsonreporter.h
- src/knownadapters.h
- src/main.cpp
- src/options.cpp
- src/options.h
- src/peprocessor.cpp
- src/peprocessor.h
- src/seprocessor.cpp
- src/util.h


Changes:

=====================================
.gitignore
=====================================
@@ -33,3 +33,10 @@ fastp
 *.exe
 *.out
 *.app
+
+# Editor Config
+.vscode
+
+# Test Output
+*.json
+*.html
\ No newline at end of file


=====================================
.travis.yml
=====================================
@@ -1,3 +1,6 @@
+arch:
+  - amd64
+  - ppc64le
 language: cpp
 compiler:
     - gcc


=====================================
LICENSE
=====================================
@@ -1,6 +1,6 @@
 MIT License
 
-Copyright (c) 2017 OpenGene - Open Source Genetics Toolbox
+Copyright (c) 2016 OpenGene - Open Source Genetics Toolbox
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal


=====================================
Makefile
=====================================
@@ -28,8 +28,14 @@ ${DIR_OBJ}/%.o:${DIR_SRC}/%.cpp make_obj_dir
 
 .PHONY:clean
 clean:
-	rm obj/*.o
-	rm $(TARGET)
+	@if test -d $(DIR_OBJ) ; \
+	then \
+		find $(DIR_OBJ) -name *.o -delete; \
+	fi
+	@if test -e $(TARGET) ; \
+	then \
+		rm $(TARGET) ; \
+	fi
 
 make_obj_dir:
 	@if test ! -d $(DIR_OBJ) ; \


=====================================
README.md
=====================================
@@ -7,23 +7,47 @@ https://badges.debian.net/badges/debian/unstable/fastp/version.svg)](https://pac
 [![Build Status](https://travis-ci.org/OpenGene/fastp.svg?branch=master)](https://travis-ci.org/OpenGene/fastp)
 # fastp
 A tool designed to provide fast all-in-one preprocessing for FastQ files. This tool is developed in C++ with multithreading supported to afford high performance. 
-* [features](#features)
-* [simple usage](#simple-usage)
-* [examples of report](#examples-of-report)
-* [download, compile and install](#get-fastp)
-* [input and output](#input-and-output)
-* [filtering by quality, length, complexity, etc.](#filtering)
-* [adapter trimming](#adapters)
-* [per read cutting by quality score](#per-read-cutting-by-quality-score)
-* [base correction for paired end (PE) data](#base-correction-for-pe-data)
-* [globa trimming](#global-trimming)
-* [polyG tail trimming](#polyg-tail-trimming) and [polyX tail trimming](#polyx-tail-trimming)
-* [unique molecular identifier (UMI) processing](#unique-molecular-identifier-umi-processing)
-* [output splitting](#output-splitting)
-* [overrepresented sequence analysis](#overrepresented-sequence-analysis)
-* [merge paired-end reads](#merge-paired-end-reads)
-* [all options](#all-options)
-* [citation](#citation)
+- [fastp](#fastp)
+- [features](#features)
+- [simple usage](#simple-usage)
+- [examples of report](#examples-of-report)
+- [get fastp](#get-fastp)
+  - [install with Bioconda](#install-with-bioconda)
+  - [or download binary (only for Linux systems, http://opengene.org/fastp/fastp)](#or-download-binary-only-for-linux-systems-httpopengeneorgfastpfastp)
+  - [or compile from source](#or-compile-from-source)
+  - [compile from source for windows user with MinGW64-distro](#compile-from-source-for-windows-user-with-mingw64-distro)
+- [input and output](#input-and-output)
+  - [output to STDOUT](#output-to-stdout)
+  - [input from STDIN](#input-from-stdin)
+  - [store the unpaired reads for PE data](#store-the-unpaired-reads-for-pe-data)
+  - [store the reads that fail the filters](#store-the-reads-that-fail-the-filters)
+  - [process only part of the data](#process-only-part-of-the-data)
+  - [do not overwrite exiting files](#do-not-overwrite-exiting-files)
+  - [split the output to multiple files for parallel processing](#split-the-output-to-multiple-files-for-parallel-processing)
+  - [merge PE reads](#merge-pe-reads)
+- [duplication rate and deduplication](#duplication-rate-and-deduplication)
+  - [duplication rate evaluation](#duplication-rate-evaluation)
+  - [deduplication](#deduplication)
+- [filtering](#filtering)
+  - [quality filter](#quality-filter)
+  - [length filter](#length-filter)
+  - [low complexity filter](#low-complexity-filter)
+  - [Other filter](#other-filter)
+- [adapters](#adapters)
+- [per read cutting by quality score](#per-read-cutting-by-quality-score)
+- [base correction for PE data](#base-correction-for-pe-data)
+- [global trimming](#global-trimming)
+- [polyG tail trimming](#polyg-tail-trimming)
+- [polyX tail trimming](#polyx-tail-trimming)
+- [unique molecular identifier (UMI) processing](#unique-molecular-identifier-umi-processing)
+  - [UMI example](#umi-example)
+- [output splitting](#output-splitting)
+  - [splitting by limiting file number](#splitting-by-limiting-file-number)
+  - [splitting by limiting the lines of each file](#splitting-by-limiting-the-lines-of-each-file)
+- [overrepresented sequence analysis](#overrepresented-sequence-analysis)
+- [merge paired-end reads](#merge-paired-end-reads)
+- [all options](#all-options)
+- [citation](#citation)
 
 # features
 0. comprehensive quality profiling for both before and after filtering data (quality curves, base contents, KMER, Q20/Q30, GC Ratio, duplication, adapter contents...)
@@ -86,6 +110,18 @@ make
 # Install
 sudo make install
 ```
+
+## compile from source for windows user with MinGW64-distro
+
+Get compiler from https://nuwen.net/mingw.html and install as `How to install` section.
+
+```shell
+git clone -b master --depth=1 https://github.com/OpenGene/fastp.git
+cd fastp
+make
+## add fastp to your PATH
+```
+
 `fastp` only relies on `zlib`, which is already available on most Linux-like systems. If you get an error like `undefined reference to gzbuffer` when compiling `fastp`, you may have to update the `zlib` (http://zlib.net)
 
 # input and output
@@ -221,7 +257,7 @@ For Illumina NextSeq/NovaSeq data, `polyG` can happen in read tails since `G` me
 A minimum length can be set with `<poly_g_min_len>` for `fastp` to detect polyG. This value is 10 by default.
 
 # polyX tail trimming
-This feature is similar as polyG tail trimming, but is disabled by default. Use `-x` or `--polyX` to enable it. A minimum length can be set with `<poly_x_min_len>` for `fastp` to detect polyX. This value is 10 by default.   
+This feature is similar as polyG tail trimming, but is disabled by default. Use `-x` or `--trim_poly_x` to enable it. A minimum length can be set with `<poly_x_min_len>` for `fastp` to detect polyX. This value is 10 by default.   
 
 When `polyG tail trimming` and `polyX tail trimming` are both enabled, fastp will perform `polyG trimming` first, then perform `polyX trimming`. This setting is useful for trimming the tails having `polyX (i.e. polyA) ` before `polyG`. `polyG` is usually caused by sequencing artifacts, while `polyA` can be commonly found from the tails of mRNA-Seq reads.
 
@@ -294,6 +330,18 @@ means that 150bp are from read1, and 15bp are from read2. `fastp` prefers the ba
 
 Same as the [base correction feature](#base-correction-for-pe-data), this function is also based on overlapping detection, which has adjustable parameters `overlap_len_require (default 30)`, `overlap_diff_limit (default 5)` and `overlap_diff_limit_percent (default 20%)`. Please note that the reads should meet these three conditions simultaneously.
 
+# duplication rate and deduplication
+For both SE and PE data, fastp supports evaluating its duplication rate and removing duplicated reads/pairs. fastp considers one read as duplicated only if its all base pairs are identical as another one. This meas if there is a sequencing error or an N base, the read will not be treated as duplicated.
+
+## duplication rate evaluation
+By default, fastp evaluates duplication rate, and this module may use 1G memory and take 10% ~ 20% more running time. If you dont need the duplication rate information, you can set `--dont_eval_duplication` to disable the duplication evaluation. But please be noted that, if deduplication (`--dedup`) option is enabled, then `--dont_eval_duplication` option is ignored.
+
+For the reason of performance, fastp doesn't apply thread synchronization for duplication calculation. So the evaluated duplication rate can vary slightly (~0.1%) when you run fastp with the data multiple times. If you want to get stable reproduced result, please use only worker thread (`-w 1`).
+
+## deduplication
+Since `v0.22.0`, fastp supports deduplication for FASTQ data. Specify `-D` or `--dedup` to enable this option. The duplication evaluation module has been improved to get more accurate result. Please also note that when deduplication is enabled, the evaluated duplication rate might be slightly different compared to the case that deduplication is not enabled. This is due to duplication evaluation module applies more calculation (and uses more memory) when deduplication is enabled.
+
+
 # all options
 ```shell
 usage: fastp -i <in1> -o <out1> [-I <in1> -O <out2>] [options...]
@@ -303,9 +351,11 @@ options:
   -o, --out1                         read1 output file name (string [=])
   -I, --in2                          read2 input file name (string [=])
   -O, --out2                           read2 output file name (string [=])
+  -D, --dedup                          enable deduplication to drop the duplicated reads/pairs
       --unpaired1                      for PE input, if read1 passed QC but read2 not, it will be written to unpaired1. Default is to discard it. (string [=])
       --unpaired2                      for PE input, if read2 passed QC but read1 not, it will be written to unpaired2. If --unpaired2 is same as --unpaired1 (default mode), both unpaired reads will be written to this same file. (string [=])
       --failed_out                     specify the file to store reads that cannot pass the filters. (string [=])
+      --overlapped_out                 for each read pair, output the overlapped region if it has no any mismatched base. (string [=])
   -m, --merge                          for paired-end input, merge each pair of reads into a single read if they are overlapped. The merged reads will be written to the file given by --merged_out, the unmerged reads will be written to the files specified by --out1 and --out2. The merging mode is disabled by default.
       --merged_out                     in the merging mode, specify the file name to store merged output, or specify --stdout to stream the merged output (string [=])
       --include_unmerged               in the merging mode, write the unmerged or unpaired reads to the file specified by --merge. Disabled by default.
@@ -316,6 +366,7 @@ options:
       --interleaved_in                 indicate that <in1> is an interleaved FASTQ which contains both read1 and read2. Disabled by default.
       --reads_to_process             specify how many reads/pairs to be processed. Default 0 means process all reads. (int [=0])
       --dont_overwrite               don't overwrite existing files. Overwritting is allowed by default.
+      --dont_eval_duplication          don't evaluate duplication rate to save time and use less memory.
       --fix_mgi_id                     the MGI FASTQ ID format is not compatible with many BAM operation tools, enable this option to fix it.
   
   # adapter trimming options


=====================================
debian/changelog
=====================================
@@ -1,3 +1,15 @@
+fastp (0.22.0+dfsg-1) unstable; urgency=medium
+
+  * Team upload.
+  * New upstream version
+  * Standards-Version: 4.6.0 (routine-update)
+  * debhelper-compat 13 (routine-update)
+  * Set upstream metadata fields: Bug-Database, Bug-Submit.
+  * Remove obsolete fields Contact, Name from debian/upstream/metadata (already
+    present in machine-readable debian/copyright).
+
+ -- Nilesh Patra <nilesh at debian.org>  Mon, 30 Aug 2021 23:56:50 +0530
+
 fastp (0.20.1+dfsg-1) unstable; urgency=medium
 
   * New upstream version


=====================================
debian/control
=====================================
@@ -3,10 +3,10 @@ Maintainer: Debian Med Packaging Team <debian-med-packaging at lists.alioth.debian.
 Uploaders: Dylan Aïssi <daissi at debian.org>
 Section: science
 Priority: optional
-Build-Depends: debhelper-compat (= 12),
+Build-Depends: debhelper-compat (= 13),
                help2man,
                zlib1g-dev
-Standards-Version: 4.5.0
+Standards-Version: 4.6.0
 Vcs-Browser: https://salsa.debian.org/med-team/fastp
 Vcs-Git: https://salsa.debian.org/med-team/fastp.git
 Homepage: https://github.com/OpenGene/fastp


=====================================
debian/upstream/metadata
=====================================
@@ -1,5 +1,5 @@
-Contact: Shifu Chen <chen at haplox.com>
-Name: fastp
+Bug-Database: https://github.com/OpenGene/fastp/issues
+Bug-Submit: https://github.com/OpenGene/fastp/issues/new
 Reference:
   - Author: Shifu Chen and Yanqing Zhou and Yaru Chen and Jia Gu
     Title: "fastp: an ultra-fast all-in-one FASTQ preprocessor"


=====================================
src/common.h
=====================================
@@ -1,12 +1,17 @@
 #ifndef COMMON_H
 #define COMMON_H
 
-#define FASTP_VER "0.20.1"
+#define FASTP_VER "0.22.0"
 
 #define _DEBUG false
 
-typedef long int64;
-typedef unsigned long uint64;
+#ifndef _WIN32
+	typedef long int64;
+	typedef unsigned long uint64;
+#else
+	typedef long long int64;
+	typedef unsigned long long uint64;
+#endif
 
 typedef int int32;
 typedef unsigned int uint32;


=====================================
src/duplicate.cpp
=====================================
@@ -3,172 +3,140 @@
 #include <memory.h>
 #include <math.h>
 
+const int PRIME_ARRAY_LEN = 1<<9;
+
 Duplicate::Duplicate(Options* opt) {
     mOptions = opt;
-    mKeyLenInBase = mOptions->duplicate.keylen;
-    mKeyLenInBit = 1<<(2*mKeyLenInBase);
-    mDups = new uint64[mKeyLenInBit];
-    memset(mDups, 0, sizeof(uint64)*mKeyLenInBit);
-    mCounts = new uint16[mKeyLenInBit];
-    memset(mCounts, 0, sizeof(uint16)*mKeyLenInBit);
-    mGC = new uint8[mKeyLenInBit];
-    memset(mGC, 0, sizeof(uint8)*mKeyLenInBit);
+
+    // 1G mem required
+    mBufLenInBytes = 1L <<29;
+    mBufNum = 2;
+
+    // if deduplication is enabled, we double the buffer size and buffer number to make more accurate deduplication
+    // this will take 4x memory (4G)
+    if(mOptions->duplicate.dedup) {
+        mBufLenInBytes *= 2;
+        mBufNum *= 2;
+    }
+
+    mOffsetMask = PRIME_ARRAY_LEN * mBufNum - 1;
+
+    mBufLenInBits = mBufLenInBytes << 3;
+    mDupBuf = new uint8[mBufLenInBytes * mBufNum];
+    memset(mDupBuf, 0, sizeof(uint8) * mBufLenInBytes * mBufNum);
+
+    mPrimeArrays = new uint64[mBufNum * PRIME_ARRAY_LEN];
+    memset(mPrimeArrays, 0, sizeof(uint64) * mBufNum * PRIME_ARRAY_LEN);
+    initPrimeArrays();
+
+    mTotalReads = 0;
+    mDupReads = 0;
+}
+
+void Duplicate::initPrimeArrays() {
+    uint64 number = 10000;
+    uint64 count = 0;
+    while(count < mBufNum * PRIME_ARRAY_LEN) {
+        number++;
+        bool isPrime = true;
+        for(uint64 i=2; i<=sqrt(number); i++) {
+            if(number%i == 0) {
+                isPrime = false;
+                break;
+            }
+        }
+        if(isPrime) {
+            mPrimeArrays[count] = number;
+            count++;
+            number += 10000;
+        }
+    }
 }
 
 Duplicate::~Duplicate(){
-    delete[] mDups;
-    delete[] mCounts;
+    delete[] mDupBuf;
+    delete[] mPrimeArrays;
 }
 
-uint64 Duplicate::seq2int(const char* data, int start, int keylen, bool& valid) {
-    uint64 ret = 0;
-    for(int i=0; i<keylen; i++) {
-        switch(data[start + i]) {
+void Duplicate::seq2intvector(const char* data, int len, uint64* output, int posOffset) {
+    for(int p=0; p<len; p++) {
+        uint64 base = 0;
+        switch(data[p]) {
             case 'A':
-                ret += 0;
+                base = 7;
                 break;
             case 'T':
-                ret += 1;
+                base = 222;
                 break;
             case 'C':
-                ret += 2;
+                base = 74;
                 break;
             case 'G':
-                ret += 3;
+                base = 31;
                 break;
             default:
-                valid = false;
-                return 0;
+                base = 13;
         }
-        // if it's not the last one, shift it by 2 bits
-        if(i != keylen-1)
-            ret <<= 2;
-    }
-    return ret;
-}
-
-void Duplicate::addRecord(uint32 key, uint64 kmer32, uint8 gc) {
-    if(mCounts[key] == 0) {
-        mCounts[key] = 1;
-        mDups[key] = kmer32;
-        mGC[key] = gc;
-    } else {
-        if(mDups[key] == kmer32)
-            mCounts[key]++;
-        else if(mDups[key] > kmer32) {
-            mDups[key] = kmer32;
-            mCounts[key] = 1;
-            mGC[key] = gc;
+        for(int i=0; i<mBufNum; i++) {
+            int offset = (p+posOffset)*mBufNum + i;
+            offset &= mOffsetMask;
+            output[i] += mPrimeArrays[offset] * (base + (p+posOffset));
         }
     }
 }
 
-void Duplicate::statRead(Read* r) {
-    if(r->length() < 32)
-        return;
-
-    int start1 = 0;
-    int start2 = max(0, r->length() - 32 - 5);
-
-    const char* data = r->mSeq.mStr.c_str();
-    bool valid = true;
-
-    uint64 ret = seq2int(data, start1, mKeyLenInBase, valid);
-    uint32 key = (uint32)ret;
-    if(!valid)
-        return;
+bool Duplicate::checkRead(Read* r) {
+    uint64* positions = new uint64[mBufNum];
 
-    uint64 kmer32 = seq2int(data, start2, 32, valid);
-    if(!valid)
-        return;
+    // init
+    for(int i=0; i<mBufNum; i++)
+        positions[i] = 0;
+    int len = r->length();
+    seq2intvector(r->mSeq.mStr.c_str(), len, positions);
+    bool isDup = applyBloomFilter(positions);
+    delete[] positions;
 
-    int gc = 0;
+    mTotalReads++;
+    if(isDup)
+        mDupReads++;
 
-    // not calculated
-    if(mCounts[key] == 0) {
-        for(int i=0; i<r->length(); i++) {
-            if(data[i] == 'C' || data[i] == 'T')
-                gc++;
-        }
-    }
-
-    gc = round(255.0 * (double) gc / (double) r->length());
-
-    addRecord(key, kmer32, (uint8)gc);
+    return isDup;
 }
 
-void Duplicate::statPair(Read* r1, Read* r2) {
-    if(r1->length() < 32 || r2->length() < 32)
-        return;
-
-    const char* data1 = r1->mSeq.mStr.c_str();
-    const char* data2 = r2->mSeq.mStr.c_str();
-    bool valid = true;
-
-    uint64 ret = seq2int(data1, 0, mKeyLenInBase, valid);
-    uint32 key = (uint32)ret;
-    if(!valid)
-        return;
-
-    uint64 kmer32 = seq2int(data2, 0, 32, valid);
-    if(!valid)
-        return;
-
-    int gc = 0;
-
-    // not calculated
-    if(mCounts[key] == 0) {
-        for(int i=0; i<r1->length(); i++) {
-            if(data1[i] == 'G' || data1[i] == 'C')
-                gc++;
-        }
-        for(int i=0; i<r2->length(); i++) {
-            if(data2[i] == 'G' || data2[i] == 'C')
-                gc++;
-        }
-    }
-
-    gc = round(255.0 * (double) gc / (double)( r1->length() + r2->length()));
-
-    addRecord(key, kmer32, gc);
+bool Duplicate::checkPair(Read* r1, Read* r2) {
+    uint64* positions = new uint64[mBufNum];
+    
+    // init
+    for(int i=0; i<mBufNum; i++)
+        positions[i] = 0;
+    seq2intvector(r1->mSeq.mStr.c_str(), r1->length(), positions);
+    seq2intvector(r2->mSeq.mStr.c_str(), r2->length(), positions, r1->length());
+    bool isDup = applyBloomFilter(positions);
+    delete[] positions;
+
+    mTotalReads++;
+    if(isDup)
+        mDupReads++;
+
+    return isDup;
 }
 
-double Duplicate::statAll(int* hist, double* meanGC, int histSize) {
-    long totalNum = 0;
-    long dupNum = 0;
-    int* gcStatNum = new int[histSize];
-    memset(gcStatNum, 0, sizeof(int)*histSize);
-    for(int key=0; key<mKeyLenInBit; key++) {
-        int count = mCounts[key];
-        double gc = mGC[key];
-
-        if(count > 0) {
-            totalNum += count;
-            dupNum += count - 1;
-
-            if(count >= histSize){
-                hist[histSize-1]++;
-                meanGC[histSize-1] += gc;
-                gcStatNum[histSize-1]++;
-            }
-            else{
-                hist[count]++;
-                meanGC[count] += gc;
-                gcStatNum[count]++;
-            }
-        }
-    }
+bool Duplicate::applyBloomFilter(uint64* positions) {
+    bool isDup = true;
+    for(int i=0; i<mBufNum; i++) {
+        uint64 pos = positions[i] % mBufLenInBits;
+        uint64 bytePos = pos >> 3;
+        uint32 bitOffset = pos & 0x07;
+        uint8 byte = (0x01) << bitOffset;
 
-    for(int i=0; i<histSize; i++) {
-        if(gcStatNum[i] > 0) {
-            meanGC[i] = meanGC[i] / 255.0 / gcStatNum[i];
-        }
+        isDup = isDup && (mDupBuf[i * mBufLenInBytes + bytePos] & byte);
+        mDupBuf[i * mBufLenInBytes + bytePos] |= byte;
     }
+    return isDup;
+}
 
-    delete[] gcStatNum;
-
-    if(totalNum == 0)
+double Duplicate::getDupRate() {
+    if(mTotalReads == 0)
         return 0.0;
-    else
-        return (double)dupNum / (double)totalNum;
+    return (double)mDupReads/(double)mTotalReads;
 }
\ No newline at end of file


=====================================
src/duplicate.h
=====================================
@@ -15,21 +15,26 @@ public:
     Duplicate(Options* opt);
     ~Duplicate();
 
-    void statRead(Read* r1);
-    void statPair(Read* r1, Read* r2);
-    uint64 seq2int(const char* data, int start, int keylen, bool& valid);
-    void addRecord(uint32 key, uint64 kmer32, uint8 gc);
+    bool checkRead(Read* r1);
+    bool checkPair(Read* r1, Read* r2);
+    void seq2intvector(const char* data, int len, uint64* output, int posOffset = 0);
 
-    // make histogram and get duplication rate
-    double statAll(int* hist, double* meanGC, int histSize);
+    double getDupRate();
+
+private:
+    void initPrimeArrays();
+    bool applyBloomFilter(uint64* positions);
 
 private:
     Options* mOptions;
-    int mKeyLenInBase;
-    int mKeyLenInBit;
-    uint64* mDups;
-    uint16* mCounts;
-    uint8* mGC;
+    uint64 mBufLenInBits;
+    uint64 mBufLenInBytes;
+    uint32 mBufNum;
+    uint8* mDupBuf;
+    uint64* mPrimeArrays;
+    uint64 mTotalReads;
+    uint64 mDupReads;
+    uint64 mOffsetMask;
     
 };
 


=====================================
src/evaluator.cpp
=====================================
@@ -21,8 +21,8 @@ bool Evaluator::isTwoColorSystem() {
     if(!r)
         return false;
 
-    // NEXTSEQ500, NEXTSEQ 550, NOVASEQ
-    if(starts_with(r->mName, "@NS") || starts_with(r->mName, "@NB") || starts_with(r->mName, "@A0")) {
+    // NEXTSEQ500, NEXTSEQ 550/550DX, NOVASEQ
+    if(starts_with(r->mName, "@NS") || starts_with(r->mName, "@NB") || starts_with(r->mName, "@NDX") || starts_with(r->mName, "@A0")) {
         delete r;
         return true;
     }
@@ -717,4 +717,4 @@ bool Evaluator::test() {
     string s = "ATCGATCGAT";
     cerr << eval.int2seq(eval.seq2int(s, 0, 10, -1), 10) << endl;
     return eval.int2seq(eval.seq2int(s, 0, 10, -1), 10) == s;
-}
\ No newline at end of file
+}


=====================================
src/htmlreporter.cpp
=====================================
@@ -13,13 +13,11 @@ HtmlReporter::HtmlReporter(Options* opt){
 HtmlReporter::~HtmlReporter(){
 }
 
-void HtmlReporter::setDupHist(int* dupHist, double* dupMeanGC, double dupRate) {
-    mDupHist = dupHist;
-    mDupMeanGC = dupMeanGC;
+void HtmlReporter::setDup(double dupRate) {
     mDupRate = dupRate;
 }
 
-void HtmlReporter::setInsertHist(long* insertHist, int insertSizePeak) {
+void HtmlReporter::setInsertHist(atomic_long* insertHist, int insertSizePeak) {
     mInsertHist = insertHist;
     mInsertSizePeak = insertSizePeak;
 }
@@ -185,7 +183,7 @@ void HtmlReporter::printSummary(ofstream& ofs, FilterResult* result, Stats* preS
         ofs << "</div>\n";
     }
 
-    if(mOptions->duplicate.enabled) {
+    /*if(mOptions->duplicate.enabled) {
         ofs << "<div class='section_div'>\n";
         ofs << "<div class='section_title' onclick=showOrHide('duplication')><a name='summary'>Duplication</a></div>\n";
         ofs << "<div id='duplication'>\n";
@@ -194,7 +192,7 @@ void HtmlReporter::printSummary(ofstream& ofs, FilterResult* result, Stats* preS
 
         ofs << "</div>\n";
         ofs << "</div>\n";
-    }
+    }*/
 
     if(mOptions->isPaired()) {
         ofs << "<div class='section_div'>\n";
@@ -239,7 +237,7 @@ void HtmlReporter::reportInsertSize(ofstream& ofs, int isizeLimit) {
     ofs << " or >" << isizeLimit;
     ofs << ", or contain too much sequencing errors to be detected as overlapped.";
     ofs <<"</div>\n";
-    
+
     ofs << "\n<script type=\"text/javascript\">" << endl;
     string json_str = "var data=[";
 
@@ -268,7 +266,7 @@ void HtmlReporter::reportDuplication(ofstream& ofs) {
     ofs << "<div id='duplication_figure'>\n";
     ofs << "<div class='figure' id='plot_duplication' style='height:400px;'></div>\n";
     ofs << "</div>\n";
-    
+
     ofs << "\n<script type=\"text/javascript\">" << endl;
     string json_str = "var data=[";
 
@@ -351,7 +349,7 @@ void HtmlReporter::report(FilterResult* result, Stats* preStats1, Stats* postSta
     ofs << "<div class='section_title' onclick=showOrHide('after_filtering')><a name='summary'>After filtering</a></div>\n";
     ofs << "<div id='after_filtering'>\n";
 
-    if(postStats1) {  
+    if(postStats1) {
         string name = "read1";
         if(mOptions->merge.enabled)
             name = "merged";
@@ -407,6 +405,9 @@ void HtmlReporter::printCSS(ofstream& ofs){
 
 void HtmlReporter::printJS(ofstream& ofs){
     ofs << "<script src='http://opengene.org/plotly-1.2.0.min.js'></script>" << endl;
+    ofs << "\n<script type='text/javascript'>" << endl;
+    ofs << "    window.Plotly || document.write('<script src=\"https://cdn.plot.ly/plotly-1.2.0.min.js\"><\\/script>')" << endl;
+    ofs << "</script>" << endl;
     ofs << "\n<script type=\"text/javascript\">" << endl;
     ofs << "    function showOrHide(divname) {" << endl;
     ofs << "        div = document.getElementById(divname);" << endl;


=====================================
src/htmlreporter.h
=====================================
@@ -9,6 +9,9 @@
 #include "stats.h"
 #include "filterresult.h"
 #include <fstream>
+#include <atomic>
+#include "common.h"
+#include "util.h"
 
 using namespace std;
 
@@ -16,8 +19,8 @@ class HtmlReporter{
 public:
     HtmlReporter(Options* opt);
     ~HtmlReporter();
-    void setDupHist(int* dupHist, double* dupMeanGC, double dupRate);
-    void setInsertHist(long* insertHist, int insertSizePeak);
+    void setDup(double dupRate);
+    void setInsertHist(atomic_long* insertHist, int insertSizePeak);
     void report(FilterResult* result, Stats* preStats1, Stats* postStats1, Stats* preStats2 = NULL, Stats* postStats2 = NULL);
 
     static void outputRow(ofstream& ofs, string key, long value);
@@ -39,9 +42,9 @@ private:
     int* mDupHist;
     double* mDupMeanGC;
     double mDupRate;
-    long* mInsertHist;
+    atomic_long* mInsertHist;
     int mInsertSizePeak;
 };
 
 
-#endif
\ No newline at end of file
+#endif


=====================================
src/jsonreporter.cpp
=====================================
@@ -9,13 +9,11 @@ JsonReporter::JsonReporter(Options* opt){
 JsonReporter::~JsonReporter(){
 }
 
-void JsonReporter::setDupHist(int* dupHist, double* dupMeanGC, double dupRate) {
-    mDupHist = dupHist;
-    mDupMeanGC = dupMeanGC;
+void JsonReporter::setDup(double dupRate) {
     mDupRate = dupRate;
 }
 
-void JsonReporter::setInsertHist(long* insertHist, int insertSizePeak) {
+void JsonReporter::setInsertHist(atomic_long* insertHist, int insertSizePeak) {
     mInsertHist = insertHist;
     mInsertSizePeak = insertSizePeak;
 }
@@ -26,6 +24,14 @@ void JsonReporter::report(FilterResult* result, Stats* preStats1, Stats* postSta
     ofs.open(mOptions->jsonFile, ifstream::out);
     ofs << "{" << endl;
 
+    // sequencing info
+    string sequencingInfo  = mOptions->isPaired()?"paired end":"single end";
+    if(mOptions->isPaired()) {
+        sequencingInfo += " (" + to_string(preStats1->getCycles()) + " cycles + " + to_string(preStats2->getCycles()) + " cycles)";
+    } else {
+        sequencingInfo += " (" + to_string(preStats1->getCycles()) + " cycles)";
+    }
+
     long pre_total_reads = preStats1->getReads();
     if(preStats2)
         pre_total_reads += preStats2->getReads();
@@ -68,7 +74,8 @@ void JsonReporter::report(FilterResult* result, Stats* preStats1, Stats* postSta
 
     // summary
     ofs << "\t" << "\"summary\": {" << endl;
-
+    ofs << "\t\t" << "\"fastp_version\": \""<< FASTP_VER << "\"," << endl;
+    ofs << "\t\t" << "\"sequencing\": \""<< sequencingInfo << "\"," << endl;
     ofs << "\t\t" << "\"before_filtering\": {" << endl;
     ofs << "\t\t\t" << "\"total_reads\":" << pre_total_reads << "," << endl; 
     ofs << "\t\t\t" << "\"total_bases\":" << pre_total_bases << "," << endl; 
@@ -106,21 +113,7 @@ void JsonReporter::report(FilterResult* result, Stats* preStats1, Stats* postSta
 
     if(mOptions->duplicate.enabled) {
         ofs << "\t" << "\"duplication\": {" << endl;
-        ofs << "\t\t\"rate\": " << mDupRate << "," << endl;
-        ofs << "\t\t\"histogram\": [";
-        for(int d=1; d<mOptions->duplicate.histSize; d++) {
-            ofs << mDupHist[d];
-            if(d!=mOptions->duplicate.histSize-1)
-                ofs << ",";
-        }
-        ofs << "]," << endl;
-        ofs << "\t\t\"mean_gc\": [";
-        for(int d=1; d<mOptions->duplicate.histSize; d++) {
-            ofs << mDupMeanGC[d];
-            if(d!=mOptions->duplicate.histSize-1)
-                ofs << ",";
-        }
-        ofs << "]" << endl;
+        ofs << "\t\t\"rate\": " << mDupRate << endl;
         ofs << "\t" << "}";
         ofs << "," << endl;
     }


=====================================
src/jsonreporter.h
=====================================
@@ -9,6 +9,9 @@
 #include "stats.h"
 #include "filterresult.h"
 #include <fstream>
+#include <atomic>
+#include "common.h"
+#include "util.h"
 
 using namespace std;
 
@@ -17,8 +20,8 @@ public:
     JsonReporter(Options* opt);
     ~JsonReporter();
 
-    void setDupHist(int* dupHist, double* dupMeanGC, double dupRate);
-    void setInsertHist(long* insertHist, int insertSizePeak);
+    void setDup(double dupRate);
+    void setInsertHist(atomic_long* insertHist, int insertSizePeak);
     void report(FilterResult* result, Stats* preStats1, Stats* postStats1, Stats* preStats2 = NULL, Stats* postStats2 = NULL);
 
 private:
@@ -26,9 +29,9 @@ private:
     int* mDupHist;
     double* mDupMeanGC;
     double mDupRate;
-    long* mInsertHist;
+    atomic_long* mInsertHist;
     int mInsertSizePeak;
 };
 
 
-#endif
\ No newline at end of file
+#endif


=====================================
src/knownadapters.h
=====================================
@@ -242,6 +242,8 @@ inline map<string, string> getKnownAdapter() {
     knownAdapters["TGGAATTCTCGGGTGCCAAGGAACTCCAGTCACTTAGGCATCTCGTATGCCGTCTTCTGCTTG"] = ">RNA_PCR_Primer_Index_3_(RPI3)";
     knownAdapters["TTTTTTTTTTAATGATACGGCGACCACCGAGATCTACAC"] = ">FlowCell1";
     knownAdapters["TTTTTTTTTTCAAGCAGAAGACGGCATACGA"] = ">FlowCell2";
+    knownAdapters["AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA"] = ">MGI/BGI adapter (forward)";
+    knownAdapters["AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG"] = ">MGI/BGI adapter (reverse)";
 	
     return knownAdapters;
 }


=====================================
src/main.cpp
=====================================
@@ -35,8 +35,10 @@ int main(int argc, char* argv[]){
     cmd.add<string>("out1", 'o', "read1 output file name", false, "");
     cmd.add<string>("in2", 'I', "read2 input file name", false, "");
     cmd.add<string>("out2", 'O', "read2 output file name", false, "");
+    cmd.add("dedup", 'D', "enable deduplication to drop the duplicated reads/pairs");
     cmd.add<string>("unpaired1", 0, "for PE input, if read1 passed QC but read2 not, it will be written to unpaired1. Default is to discard it.", false, "");
     cmd.add<string>("unpaired2", 0, "for PE input, if read2 passed QC but read1 not, it will be written to unpaired2. If --unpaired2 is same as --unpaired1 (default mode), both unpaired reads will be written to this same file.", false, "");
+    cmd.add<string>("overlapped_out", 0, "for each read pair, output the overlapped region if it has no any mismatched base.", false, "");
     cmd.add<string>("failed_out", 0, "specify the file to store reads that cannot pass the filters.", false, "");
     cmd.add("merge", 'm', "for paired-end input, merge each pair of reads into a single read if they are overlapped. The merged reads will be written to the file given by --merged_out, the unmerged reads will be written to the files specified by --out1 and --out2. The merging mode is disabled by default.");
     cmd.add<string>("merged_out", 0, "in the merging mode, specify the file name to store merged output, or specify --stdout to stream the merged output", false, "");
@@ -48,6 +50,7 @@ int main(int argc, char* argv[]){
     cmd.add("interleaved_in", 0, "indicate that <in1> is an interleaved FASTQ which contains both read1 and read2. Disabled by default.");
     cmd.add<int>("reads_to_process", 0, "specify how many reads/pairs to be processed. Default 0 means process all reads.", false, 0);
     cmd.add("dont_overwrite", 0, "don't overwrite existing files. Overwritting is allowed by default.");
+    cmd.add("dont_eval_duplication", 0, "don't evaluate duplication rate to save time and use less memory.");
     cmd.add("fix_mgi_id", 0, "the MGI FASTQ ID format is not compatible with many BAM operation tools, enable this option to fix it.");
     cmd.add("verbose", 'V', "output verbose log information (i.e. when every 1M reads are processed).");
 
@@ -164,9 +167,11 @@ int main(int argc, char* argv[]){
     opt.in2 = cmd.get<string>("in2");
     opt.out1 = cmd.get<string>("out1");
     opt.out2 = cmd.get<string>("out2");
+    opt.duplicate.dedup = cmd.exist("dedup");
     opt.unpaired1 = cmd.get<string>("unpaired1");
     opt.unpaired2 = cmd.get<string>("unpaired2");
     opt.failedOut = cmd.get<string>("failed_out");
+    opt.overlappedOut = cmd.get<string>("overlapped_out");
     // write to the same file
     if(opt.unpaired2.empty())
         opt.unpaired2 = opt.unpaired1;
@@ -179,6 +184,7 @@ int main(int argc, char* argv[]){
     opt.interleavedInput = cmd.exist("interleaved_in");
     opt.verbose = cmd.exist("verbose");
     opt.fixMGI = cmd.exist("fix_mgi_id");
+    opt.duplicate.enabled = !cmd.exist("dont_eval_duplication") || cmd.exist("dedup") ;
 
     // merge PE
     opt.merge.enabled = cmd.exist("merge");


=====================================
src/options.cpp
=====================================
@@ -196,6 +196,12 @@ bool Options::validate() {
             error_exit(out2 + " already exists and you have set to not rewrite output files by --dont_overwrite");
         }
     }
+    if(!overlappedOut.empty()) {
+        //check_file_writable(out2);
+        if(dontOverwrite && file_exists(overlappedOut)) {
+            error_exit(overlappedOut + " already exists and you have set to not rewrite output files by --dont_overwrite");
+        }
+    }
     if(!isPaired()) {
         if(!unpaired1.empty()) {
             cerr << "Not paired-end mode. Ignoring argument --unpaired1 = " << unpaired1 << endl;
@@ -205,6 +211,10 @@ bool Options::validate() {
             cerr << "Not paired-end mode. Ignoring argument --unpaired2 = " << unpaired2 << endl;
             unpaired2 = "";
         }
+        if(!overlappedOut.empty()) {
+            cerr << "Not paired-end mode. Ignoring argument --overlapped_out = " << overlappedOut << endl;
+            overlappedOut = "";
+        }
     }
     if(split.enabled) {
         if(!unpaired1.empty()) {


=====================================
src/options.h
=====================================
@@ -35,11 +35,13 @@ public:
         enabled = true;
         keylen = 12;
         histSize = 32;
+        dedup = false;
     }
 public:
     bool enabled;
     int keylen;
     int histSize;
+    bool dedup;
 };
 
 class IndexFilterOptions {
@@ -306,6 +308,8 @@ public:
     // file name of failed reads output
     string failedOut;
     // json file
+    string overlappedOut;
+    // json file
     string jsonFile;
     // html file
     string htmlFile;


=====================================
src/peprocessor.cpp
=====================================
@@ -24,14 +24,15 @@ PairEndProcessor::PairEndProcessor(Options* opt){
     mUmiProcessor = new UmiProcessor(opt);
 
     int isizeBufLen = mOptions->insertSizeMax + 1;
-    mInsertSizeHist = new long[isizeBufLen];
-    memset(mInsertSizeHist, 0, sizeof(long)*isizeBufLen);
+    mInsertSizeHist = new atomic_long[isizeBufLen];
+    memset(mInsertSizeHist, 0, sizeof(atomic_long)*isizeBufLen);
     mLeftWriter =  NULL;
     mRightWriter = NULL;
     mUnpairedLeftWriter =  NULL;
     mUnpairedRightWriter = NULL;
     mMergedWriter = NULL;
     mFailedWriter = NULL;
+    mOverlappedWriter = NULL;
 
     mDuplicate = NULL;
     if(mOptions->duplicate.enabled) {
@@ -62,6 +63,9 @@ void PairEndProcessor::initOutput() {
     if(!mOptions->failedOut.empty())
         mFailedWriter = new WriterThread(mOptions, mOptions->failedOut);
 
+    if(!mOptions->overlappedOut.empty())
+        mOverlappedWriter = new WriterThread(mOptions, mOptions->overlappedOut);
+
     if(mOptions->out1.empty())
         return;
     
@@ -87,6 +91,10 @@ void PairEndProcessor::closeOutput() {
         delete mFailedWriter;
         mFailedWriter = NULL;
     }
+    if(mOverlappedWriter) {
+        delete mOverlappedWriter;
+        mOverlappedWriter = NULL;
+    }
     if(mUnpairedLeftWriter) {
         delete mUnpairedLeftWriter;
         mLeftWriter = NULL;
@@ -132,6 +140,7 @@ bool PairEndProcessor::process(){
     std::thread* unpairedRightWriterThread = NULL;
     std::thread* mergedWriterThread = NULL;
     std::thread* failedWriterThread = NULL;
+    std::thread* overlappedWriterThread = NULL;
     if(mLeftWriter)
         leftWriterThread = new std::thread(std::bind(&PairEndProcessor::writeTask, this, mLeftWriter));
     if(mRightWriter)
@@ -144,6 +153,8 @@ bool PairEndProcessor::process(){
         mergedWriterThread = new std::thread(std::bind(&PairEndProcessor::writeTask, this, mMergedWriter));
     if(mFailedWriter)
         failedWriterThread = new std::thread(std::bind(&PairEndProcessor::writeTask, this, mFailedWriter));
+    if(mOverlappedWriter)
+        overlappedWriterThread = new std::thread(std::bind(&PairEndProcessor::writeTask, this, mOverlappedWriter));
 
     producer.join();
     for(int t=0; t<mOptions->thread; t++){
@@ -163,6 +174,8 @@ bool PairEndProcessor::process(){
             mergedWriterThread->join();
         if(failedWriterThread)
             failedWriterThread->join();
+        if(overlappedWriterThread)
+            overlappedWriterThread->join();
     }
 
     if(mOptions->verbose)
@@ -197,7 +210,7 @@ bool PairEndProcessor::process(){
         cerr << "Read1 after filtering:"<<endl;
         finalPostStats1->print();
         cerr << endl;
-        cerr << "Read2 aftering filtering:"<<endl;
+        cerr << "Read2 after filtering:"<<endl;
         finalPostStats2->print();
     } else {
         cerr << "Merged and filtered:"<<endl;
@@ -208,16 +221,9 @@ bool PairEndProcessor::process(){
     cerr << "Filtering result:"<<endl;
     finalFilterResult->print();
 
-    int* dupHist = NULL;
-    double* dupMeanTlen = NULL;
-    double* dupMeanGC = NULL;
     double dupRate = 0.0;
     if(mOptions->duplicate.enabled) {
-        dupHist = new int[mOptions->duplicate.histSize];
-        memset(dupHist, 0, sizeof(int) * mOptions->duplicate.histSize);
-        dupMeanGC = new double[mOptions->duplicate.histSize];
-        memset(dupMeanGC, 0, sizeof(double) * mOptions->duplicate.histSize);
-        dupRate = mDuplicate->statAll(dupHist, dupMeanGC, mOptions->duplicate.histSize);
+        dupRate = mDuplicate->getDupRate();
         cerr << endl;
         cerr << "Duplication rate: " << dupRate * 100.0 << "%" << endl;
     }
@@ -241,13 +247,13 @@ bool PairEndProcessor::process(){
 
     // make JSON report
     JsonReporter jr(mOptions);
-    jr.setDupHist(dupHist, dupMeanGC, dupRate);
+    jr.setDup(dupRate);
     jr.setInsertHist(mInsertSizeHist, peakInsertSize);
     jr.report(finalFilterResult, finalPreStats1, finalPostStats1, finalPreStats2, finalPostStats2);
 
     // make HTML report
     HtmlReporter hr(mOptions);
-    hr.setDupHist(dupHist, dupMeanGC, dupRate);
+    hr.setDup(dupRate);
     hr.setInsertHist(mInsertSizeHist, peakInsertSize);
     hr.report(finalFilterResult, finalPreStats1, finalPostStats1, finalPreStats2, finalPostStats2);
 
@@ -265,11 +271,6 @@ bool PairEndProcessor::process(){
     delete finalPostStats2;
     delete finalFilterResult;
 
-    if(mOptions->duplicate.enabled) {
-        delete[] dupHist;
-        delete[] dupMeanGC;
-    }
-
     delete[] threads;
     delete[] configs;
 
@@ -285,6 +286,8 @@ bool PairEndProcessor::process(){
         delete mergedWriterThread;
     if(failedWriterThread)
         delete failedWriterThread;
+    if(overlappedWriterThread)
+        delete overlappedWriterThread;
 
     if(!mOptions->split.enabled)
         closeOutput();
@@ -312,6 +315,7 @@ bool PairEndProcessor::processPairEnd(ReadPairPack* pack, ThreadConfig* config){
     string singleOutput;
     string mergedOutput;
     string failedOut;
+    string overlappedOut;
     int readPassed = 0;
     int mergedCount = 0;
     for(int p=0;p<pack->count;p++){
@@ -329,8 +333,12 @@ bool PairEndProcessor::processPairEnd(ReadPairPack* pack, ThreadConfig* config){
         config->getPreStats2()->statRead(or2);
 
         // handling the duplication profiling
-        if(mDuplicate)
-            mDuplicate->statPair(or1, or2);
+        bool dedupOut = false;
+        if(mDuplicate) {
+            bool isDup = mDuplicate->checkPair(or1, or2);
+            if(mOptions->duplicate.dedup && isDup)
+                dedupOut = true;
+        }
 
         // filter by index
         if(mOptions->indexFilter.enabled && mFilter->filterByIndex(or1, or2)) {
@@ -385,6 +393,15 @@ bool PairEndProcessor::processPairEnd(ReadPairPack* pack, ThreadConfig* config){
             }
         }
 
+        if(r1 != NULL && r2!=NULL && mOverlappedWriter) {
+            OverlapResult ov = OverlapAnalysis::analyze(r1, r2, mOptions->overlapDiffLimit, mOptions->overlapRequire, 0);
+            if(ov.overlapped) {
+                Read* overlappedRead = new Read(r1->mName, r1->mSeq.mStr.substr(max(0,ov.offset), ov.overlap_len), r1->mStrand, r1->mQuality.substr(max(0,ov.offset), ov.overlap_len));
+                overlappedOut += overlappedRead->toString();
+                delete overlappedRead;
+            }
+        }
+
         if(config->getThreadId() == 0 && !isizeEvaluated && r1 != NULL && r2!=NULL) {
             OverlapResult ov = OverlapAnalysis::analyze(r1, r2, mOptions->overlapDiffLimit, mOptions->overlapRequire, mOptions->overlapDiffPercentLimit/100.0);
             statInsertSize(r1, r2, ov, frontTrimmed1, frontTrimmed2);
@@ -423,14 +440,14 @@ bool PairEndProcessor::processPairEnd(ReadPairPack* pack, ThreadConfig* config){
             } else if(mOptions->merge.includeUnmerged){
                 int result1 = mFilter->passFilter(r1);
                 config->addFilterResult(result1, 1);
-                if(result1 == PASS_FILTER) {
+                if(result1 == PASS_FILTER && !dedupOut) {
                     mergedOutput += r1->toString();
                     config->getPostStats1()->statRead(r1);
                 }
 
                 int result2 = mFilter->passFilter(r2);
                 config->addFilterResult(result2, 1);
-                if(result2 == PASS_FILTER) {
+                if(result2 == PASS_FILTER && !dedupOut) {
                     mergedOutput += r2->toString();
                     config->getPostStats1()->statRead(r2);
                 }
@@ -447,42 +464,45 @@ bool PairEndProcessor::processPairEnd(ReadPairPack* pack, ThreadConfig* config){
 
             config->addFilterResult(max(result1, result2), 2);
 
-            if( r1 != NULL &&  result1 == PASS_FILTER && r2 != NULL && result2 == PASS_FILTER ) {
-                
-                if(mOptions->outputToSTDOUT && !mOptions->merge.enabled) {
-                    singleOutput += r1->toString() + r2->toString();
-                } else {
-                    outstr1 += r1->toString();
-                    outstr2 += r2->toString();
-                }
+            if(!dedupOut) {
 
-                // stats the read after filtering
-                if(!mOptions->merge.enabled) {
-                    config->getPostStats1()->statRead(r1);
-                    config->getPostStats2()->statRead(r2);
-                }
+                if( r1 != NULL &&  result1 == PASS_FILTER && r2 != NULL && result2 == PASS_FILTER ) {
+                    
+                    if(mOptions->outputToSTDOUT && !mOptions->merge.enabled) {
+                        singleOutput += r1->toString() + r2->toString();
+                    } else {
+                        outstr1 += r1->toString();
+                        outstr2 += r2->toString();
+                    }
 
-                readPassed++;
-            } else if( r1 != NULL &&  result1 == PASS_FILTER) {
-                if(mUnpairedLeftWriter) {
-                    unpairedOut1 += r1->toString();
-                    if(mFailedWriter)
-                        failedOut += or2->toStringWithTag(FAILED_TYPES[result2]);
-                } else {
-                    if(mFailedWriter) {
-                        failedOut += or1->toStringWithTag("paired_read_is_failing");
-                        failedOut += or2->toStringWithTag(FAILED_TYPES[result2]);
+                    // stats the read after filtering
+                    if(!mOptions->merge.enabled) {
+                        config->getPostStats1()->statRead(r1);
+                        config->getPostStats2()->statRead(r2);
                     }
-                }
-            } else if( r2 != NULL && result2 == PASS_FILTER) {
-                if(mUnpairedLeftWriter || mUnpairedRightWriter) {
-                    unpairedOut2 += r2->toString();
-                    if(mFailedWriter)
-                        failedOut += or1->toStringWithTag(FAILED_TYPES[result1]);
-                } else {
-                    if(mFailedWriter) {
-                        failedOut += or1->toStringWithTag(FAILED_TYPES[result1]);
-                        failedOut += or2->toStringWithTag("paired_read_is_failing");
+
+                    readPassed++;
+                } else if( r1 != NULL &&  result1 == PASS_FILTER) {
+                    if(mUnpairedLeftWriter) {
+                        unpairedOut1 += r1->toString();
+                        if(mFailedWriter)
+                            failedOut += or2->toStringWithTag(FAILED_TYPES[result2]);
+                    } else {
+                        if(mFailedWriter) {
+                            failedOut += or1->toStringWithTag("paired_read_is_failing");
+                            failedOut += or2->toStringWithTag(FAILED_TYPES[result2]);
+                        }
+                    }
+                } else if( r2 != NULL && result2 == PASS_FILTER) {
+                    if(mUnpairedLeftWriter || mUnpairedRightWriter) {
+                        unpairedOut2 += r2->toString();
+                        if(mFailedWriter)
+                            failedOut += or1->toStringWithTag(FAILED_TYPES[result1]);
+                    } else {
+                        if(mFailedWriter) {
+                            failedOut += or1->toStringWithTag(FAILED_TYPES[result1]);
+                            failedOut += or2->toStringWithTag("paired_read_is_failing");
+                        }
                     }
                 }
             }
@@ -529,6 +549,13 @@ bool PairEndProcessor::processPairEnd(ReadPairPack* pack, ThreadConfig* config){
         mFailedWriter->input(fdata, failedOut.size());
     }
 
+    if(mOverlappedWriter && !overlappedOut.empty()) {
+        // write failed data
+        char* odata = new char[overlappedOut.size()];
+        memcpy(odata, overlappedOut.c_str(), overlappedOut.size());
+        mOverlappedWriter->input(odata, overlappedOut.size());
+    }
+
     // normal output by left/right writer thread
     if(mRightWriter && mLeftWriter && (!outstr1.empty() || !outstr2.empty())) {
         // write PE
@@ -808,6 +835,8 @@ void PairEndProcessor::consumerTask(ThreadConfig* config)
             mMergedWriter->setInputCompleted();
         if(mFailedWriter)
             mFailedWriter->setInputCompleted();
+        if(mOverlappedWriter)
+            mOverlappedWriter->setInputCompleted();
     }
     
     if(mOptions->verbose) {


=====================================
src/peprocessor.h
=====================================
@@ -74,13 +74,14 @@ private:
     ofstream* mOutStream1;
     ofstream* mOutStream2;
     UmiProcessor* mUmiProcessor;
-    long* mInsertSizeHist;
+    atomic_long* mInsertSizeHist;
     WriterThread* mLeftWriter;
     WriterThread* mRightWriter;
     WriterThread* mUnpairedLeftWriter;
     WriterThread* mUnpairedRightWriter;
     WriterThread* mMergedWriter;
     WriterThread* mFailedWriter;
+    WriterThread* mOverlappedWriter;
     Duplicate* mDuplicate;
 };
 


=====================================
src/seprocessor.cpp
=====================================
@@ -135,28 +135,21 @@ bool SingleEndProcessor::process(){
     cerr << "Filtering result:"<<endl;
     finalFilterResult->print();
 
-    int* dupHist = NULL;
-    double* dupMeanTlen = NULL;
-    double* dupMeanGC = NULL;
     double dupRate = 0.0;
     if(mOptions->duplicate.enabled) {
-        dupHist = new int[mOptions->duplicate.histSize];
-        memset(dupHist, 0, sizeof(int) * mOptions->duplicate.histSize);
-        dupMeanGC = new double[mOptions->duplicate.histSize];
-        memset(dupMeanGC, 0, sizeof(double) * mOptions->duplicate.histSize);
-        dupRate = mDuplicate->statAll(dupHist, dupMeanGC, mOptions->duplicate.histSize);
+        dupRate = mDuplicate->getDupRate();
         cerr << endl;
         cerr << "Duplication rate (may be overestimated since this is SE data): " << dupRate * 100.0 << "%" << endl;
     }
 
     // make JSON report
     JsonReporter jr(mOptions);
-    jr.setDupHist(dupHist, dupMeanGC, dupRate);
+    jr.setDup(dupRate);
     jr.report(finalFilterResult, finalPreStats, finalPostStats);
 
     // make HTML report
     HtmlReporter hr(mOptions);
-    hr.setDupHist(dupHist, dupMeanGC, dupRate);
+    hr.setDup(dupRate);
     hr.report(finalFilterResult, finalPreStats, finalPostStats);
 
     // clean up
@@ -171,11 +164,6 @@ bool SingleEndProcessor::process(){
     delete finalPostStats;
     delete finalFilterResult;
 
-    if(mOptions->duplicate.enabled) {
-        delete[] dupHist;
-        delete[] dupMeanGC;
-    }
-
     delete[] threads;
     delete[] configs;
 
@@ -203,8 +191,12 @@ bool SingleEndProcessor::processSingleEnd(ReadPack* pack, ThreadConfig* config){
         config->getPreStats1()->statRead(or1);
 
         // handling the duplication profiling
-        if(mDuplicate)
-            mDuplicate->statRead(or1);
+        bool dedupOut = false;
+        if(mDuplicate) {
+            bool isDup = mDuplicate->checkRead(or1);
+            if(mOptions->duplicate.dedup && isDup)
+                dedupOut = true;
+        }
 
         // filter by index
         if(mOptions->indexFilter.enabled && mFilter->filterByIndex(or1)) {
@@ -254,14 +246,16 @@ bool SingleEndProcessor::processSingleEnd(ReadPack* pack, ThreadConfig* config){
 
         config->addFilterResult(result, 1);
 
-        if( r1 != NULL &&  result == PASS_FILTER) {
-            outstr += r1->toString();
+        if(!dedupOut) {
+            if( r1 != NULL &&  result == PASS_FILTER) {
+                outstr += r1->toString();
 
-            // stats the read after filtering
-            config->getPostStats1()->statRead(r1);
-            readPassed++;
-        } else if(mFailedWriter) {
-            failedOut += or1->toStringWithTag(FAILED_TYPES[result]);
+                // stats the read after filtering
+                config->getPostStats1()->statRead(r1);
+                readPassed++;
+            } else if(mFailedWriter) {
+                failedOut += or1->toStringWithTag(FAILED_TYPES[result]);
+            }
         }
 
         delete or1;


=====================================
src/util.h
=====================================
@@ -9,6 +9,7 @@
 #include <algorithm>
 #include <time.h>
 #include <mutex>
+#include <regex>
 
 using namespace std;
 
@@ -187,6 +188,10 @@ inline void check_file_valid(const  string& s) {
     }
 }
 
+inline bool check_filename_valid(const string& s){
+    return 0 < trim(s).length() && trim(s).length() <= 255 && regex_match(s, regex("^[A-Za-z0-9_\\.\\-]+$"));
+}
+
 inline void check_file_writable(const  string& s) {
     string dir = dirname(s);
     if(!file_exists(dir)) {
@@ -267,7 +272,7 @@ inline void loginfo(const string s){
     logmtx.lock();
     time_t tt = time(NULL);
     tm* t= localtime(&tt);
-    cerr<<"["<<t->tm_hour<<":"<<t->tm_min<<":"<<t->tm_sec<<"] "<<s<<endl;
+    fprintf(stderr, "[%02d:%02d:%02d]\n", t->tm_hour, t->tm_min, t->tm_sec);
     logmtx.unlock();
 }
 



View it on GitLab: https://salsa.debian.org/med-team/fastp/-/compare/07e4806299fc4a25e2e3e0ea5d95f7f436510efb...424e5ca91edb3ec999722793de1ad6b5baeec727

-- 
View it on GitLab: https://salsa.debian.org/med-team/fastp/-/compare/07e4806299fc4a25e2e3e0ea5d95f7f436510efb...424e5ca91edb3ec999722793de1ad6b5baeec727
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20210830/67ab97c1/attachment-0001.htm>


More information about the debian-med-commit mailing list