[med-svn] [Git][med-team/porechop][upstream] New upstream version 0.2.4+dfsg

Steffen Möller gitlab at salsa.debian.org
Tue Oct 23 12:58:45 BST 2018


Steffen Möller pushed to branch upstream at Debian Med / porechop


Commits:
9904cd8b by Steffen Moeller at 2018-10-23T11:53:12Z
New upstream version 0.2.4+dfsg
- - - - -


5 changed files:

- README.md
- porechop/adapters.py
- porechop/nanopore_read.py
- porechop/porechop.py
- porechop/version.py


Changes:

=====================================
README.md
=====================================
@@ -5,6 +5,13 @@ Porechop is a tool for finding and removing adapters from [Oxford Nanopore](http
 Porechop also supports demultiplexing of Nanopore reads that were barcoded with the [Native Barcoding Kit](https://store.nanoporetech.com/native-barcoding-kit-1d.html), [PCR Barcoding Kit](https://store.nanoporetech.com/pcr-barcoding-kit-96.html) or [Rapid Barcoding Kit](https://store.nanoporetech.com/rapid-barcoding-sequencing-kit.html).
 
 
+### Oct 2018 update: Porechop is officially unsupported
+
+While I'm happy Porechop has so many users, it has always been a bit klugey and a pain to maintain. I don't have the time to give it the attention it deserves, so I'm going to now officially declare Porechop as abandonware (though the unanswered [issues](https://github.com/rrwick/Porechop/issues) and [pull requests](https://github.com/rrwick/Porechop/pulls) reveal that it already has been for some time). I've added a [known issues](#known-issues) section to the README to outline what I think is wrong with Porechop and how a reimplementation should look. I may someday (no promises though :stuck_out_tongue:) try to rewrite it from a blank canvas to address its faults.
+
+
+
+
 
 # Table of contents
 
@@ -24,6 +31,7 @@ Porechop also supports demultiplexing of Nanopore reads that were barcoded with
     * [Verbose output](#verbose-output)
 * [Known adapters](#known-adapters)
 * [Full usage](#full-usage)
+* [Known issues](#known-issues)
 * [Acknowledgements](#acknowledgements)
 * [License](#license)
 
@@ -91,9 +99,6 @@ __Demultiplex barcoded reads:__<br>
 __Demultiplex barcoded reads, straight from Albacore output directory:__<br>
 `porechop -i albacore_dir -b output_dir`
 
-__Demultiplex barcoded reads without trimming (appropriate if Nanopolish will be used):__<br>
-`porechop -i input_reads.fastq.gz -b output_dir --untrimmed`
-
 __Also works with FASTA:__<br>
 `porechop -i input_reads.fasta -o output_reads.fasta`
 
@@ -143,11 +148,9 @@ TGTTGTTGTTGTTATTGTTGTTATTGTTGTTGTATTGTTGTTATTGTTGTTGTTGTACATTGTTATTGTTGTATTGTTGT
 
 If you run Porechop with `--discard_middle`, the reads with internal adapters will be thrown out instead of split.
 
-This approach might make sense if you are trimming reads from a barcoded run, as chimeric reads may combine sequences from different bins. For example, consider this read:
-```
-BC01_rev - SEQUENCE_1 - SQK-NSK007_Y_Top - BC02_rev -  SEQUENCE_2
-```
-SEQUENCE_1 belongs in the barcode 1 bin and SEQUENCE_2 belongs in the barcode 2 bin, so while we could split the read, we would end up with contamination from another bin. Throwing the read out with `--discard_middle` might be the better option.
+If you plan on using your reads with [Nanopolish](https://github.com/jts/nanopolish), then the `--discard_middle` option is required. This is because Nanopolish first runs [`nanopolish index`](https://github.com/jts/nanopolish#data-preprocessing) to find a one-to-one association between FASTQ reads and fast5 files. If you ran Porechop _without_ `--discard_middle`, then you could end up with multiple separate FASTQ reads which are from a single fast5, and this is incompatible with Nanopolish.
+
+This option is also recommended if you are trimming reads from a demultiplexed barcoded sequencing run. This is because chimeric reads may contain two sequences from two separate barcodes, so throwing them out is the safer option to reduce cross-barcode contamination.
 
 
 ### Barcode demultiplexing
@@ -189,10 +192,10 @@ Alternately, you can run Porechop with `-b` which specifies a directory for barc
 
 If Porechop is run without `-o` or `-b`, then it will output the trimmed reads to stdout and print its progress info to stderr. The output format of the reads will be FASTA/FASTQ based on the input reads, or else can be specified using `--format`.
 
-The `--verbosity` option will change the output of progress info:
+The `--verbosity` option will change the amount of progress info:
 * `--verbosity 0` gives no progress output.
-* `--verbosity 1` (the default) gives summary info about end adapter trimming and shows all instances of middle adapter splitting.
-* `--verbosity 2` shows sequences and is described below.
+* `--verbosity 1` (default) gives summary info about end adapter trimming and middle adapter splitting.
+* `--verbosity 2` shows the actual trimmed/split sequences for each read (described more below).
 * `--verbosity 3` shows tons of data (mainly for debugging).
 
 
@@ -233,103 +236,143 @@ If you want to add your own adapter sequences to Porechop, you can do so by edit
 # Full usage
 
 ```
-usage: porechop [-h] -i INPUT [-o OUTPUT] [--format {auto,fasta,fastq,fasta.gz,fastq.gz}] [-v VERBOSITY]
-                [-t THREADS] [--version] [-b BARCODE_DIR] [--barcode_threshold BARCODE_THRESHOLD]
+usage: porechop -i INPUT [-o OUTPUT] [--format {auto,fasta,fastq,fasta.gz,fastq.gz}] [-v VERBOSITY]
+                [-t THREADS] [-b BARCODE_DIR] [--barcode_threshold BARCODE_THRESHOLD]
                 [--barcode_diff BARCODE_DIFF] [--require_two_barcodes] [--untrimmed]
-                [--adapter_threshold ADAPTER_THRESHOLD] [--check_reads CHECK_READS]
-                [--scoring_scheme SCORING_SCHEME] [--end_size END_SIZE] [--min_trim_size MIN_TRIM_SIZE]
-                [--extra_end_trim EXTRA_END_TRIM] [--end_threshold END_THRESHOLD] [--discard_middle]
+                [--discard_unassigned] [--adapter_threshold ADAPTER_THRESHOLD]
+                [--check_reads CHECK_READS] [--scoring_scheme SCORING_SCHEME] [--end_size END_SIZE]
+                [--min_trim_size MIN_TRIM_SIZE] [--extra_end_trim EXTRA_END_TRIM]
+                [--end_threshold END_THRESHOLD] [--no_split] [--discard_middle]
                 [--middle_threshold MIDDLE_THRESHOLD]
                 [--extra_middle_trim_good_side EXTRA_MIDDLE_TRIM_GOOD_SIDE]
                 [--extra_middle_trim_bad_side EXTRA_MIDDLE_TRIM_BAD_SIDE]
-                [--min_split_read_size MIN_SPLIT_READ_SIZE]
-
-Porechop: a tool for finding adapters in Oxford Nanopore reads, trimming them from the ends and splitting
-reads with internal adapters
+                [--min_split_read_size MIN_SPLIT_READ_SIZE] [-h] [--version]
 
-optional arguments:
-  -h, --help                        show this help message and exit
+Porechop: a tool for finding adapters in Oxford Nanopore reads, trimming them from the ends and
+splitting reads with internal adapters
 
 Main options:
-  -i INPUT, --input INPUT           FASTA/FASTQ of input reads or a directory which will be recursively
-                                    searched for FASTQ files (required)
-  -o OUTPUT, --output OUTPUT        Filename for FASTA or FASTQ of trimmed reads (if not set, trimmed
-                                    reads will be printed to stdout)
+  -i INPUT, --input INPUT        FASTA/FASTQ of input reads or a directory which will be
+                                 recursively searched for FASTQ files (required)
+  -o OUTPUT, --output OUTPUT     Filename for FASTA or FASTQ of trimmed reads (if not set, trimmed
+                                 reads will be printed to stdout)
   --format {auto,fasta,fastq,fasta.gz,fastq.gz}
-                                    Output format for the reads - if auto, the format will be chosen based
-                                    on the output filename or the input read format (default: auto)
+                                 Output format for the reads - if auto, the format will be chosen
+                                 based on the output filename or the input read format (default:
+                                 auto)
   -v VERBOSITY, --verbosity VERBOSITY
-                                    Level of progress information: 0 = none, 1 = some, 2 = lots, 3 = full
-                                    - output will go to stdout if reads are saved to a file and stderr if
-                                    reads are printed to stdout (default: 1)
-  -t THREADS, --threads THREADS     Number of threads to use for adapter alignment (default: 8)
-  --version                         show program's version number and exit
+                                 Level of progress information: 0 = none, 1 = some, 2 = lots, 3 =
+                                 full - output will go to stdout if reads are saved to a file and
+                                 stderr if reads are printed to stdout (default: 1)
+  -t THREADS, --threads THREADS  Number of threads to use for adapter alignment (default: 8)
 
 Barcode binning settings:
   Control the binning of reads based on barcodes (i.e. barcode demultiplexing)
 
   -b BARCODE_DIR, --barcode_dir BARCODE_DIR
-                                    Reads will be binned based on their barcode and saved to separate
-                                    files in this directory (incompatible with --output)
+                                 Reads will be binned based on their barcode and saved to separate
+                                 files in this directory (incompatible with --output)
   --barcode_threshold BARCODE_THRESHOLD
-                                    A read must have at least this percent identity to a barcode to be
-                                    binned (default: 75.0)
-  --barcode_diff BARCODE_DIFF       If the difference between a read's best barcode identity and its
-                                    second-best barcode identity is less than this value, it will not be
-                                    put in a barcode bin (to exclude cases which are too close to call)
-                                    (default: 5.0)
-  --require_two_barcodes            Reads will only be put in barcode bins if they have a strong match for
-                                    the barcode on both their start and end (default: a read can be binned
-                                    with a match at its start or end)
-  --untrimmed                       Bin reads but do not trim the ends (appropriate if reads are to be
-                                    used with Nanopolish) (default: False)
+                                 A read must have at least this percent identity to a barcode to be
+                                 binned (default: 75.0)
+  --barcode_diff BARCODE_DIFF    If the difference between a read's best barcode identity and its
+                                 second-best barcode identity is less than this value, it will not
+                                 be put in a barcode bin (to exclude cases which are too close to
+                                 call) (default: 5.0)
+  --require_two_barcodes         Reads will only be put in barcode bins if they have a strong match
+                                 for the barcode on both their start and end (default: a read can
+                                 be binned with a match at its start or end)
+  --untrimmed                    Bin reads but do not trim them (default: trim the reads)
+  --discard_unassigned           Discard unassigned reads (instead of creating a "none" bin)
+                                 (default: False)
 
 Adapter search settings:
   Control how the program determines which adapter sets are present
 
   --adapter_threshold ADAPTER_THRESHOLD
-                                    An adapter set has to have at least this percent identity to be
-                                    labelled as present and trimmed off (0 to 100) (default: 90.0)
-  --check_reads CHECK_READS         This many reads will be aligned to all possible adapters to determine
-                                    which adapter sets are present (default: 10000)
-  --scoring_scheme SCORING_SCHEME   Comma-delimited string of alignment scores: match, mismatch, gap open,
-                                    gap extend (default: 3,-6,-5,-2)
+                                 An adapter set has to have at least this percent identity to be
+                                 labelled as present and trimmed off (0 to 100) (default: 90.0)
+  --check_reads CHECK_READS      This many reads will be aligned to all possible adapters to
+                                 determine which adapter sets are present (default: 10000)
+  --scoring_scheme SCORING_SCHEME
+                                 Comma-delimited string of alignment scores: match, mismatch, gap
+                                 open, gap extend (default: 3,-6,-5,-2)
 
 End adapter settings:
   Control the trimming of adapters from read ends
 
-  --end_size END_SIZE               The number of base pairs at each end of the read which will be
-                                    searched for adapter sequences (default: 150)
-  --min_trim_size MIN_TRIM_SIZE     Adapter alignments smaller than this will be ignored (default: 4)
-  --extra_end_trim EXTRA_END_TRIM   This many additional bases will be removed next to adapters found at
-                                    the ends of reads (default: 2)
-  --end_threshold END_THRESHOLD     Adapters at the ends of reads must have at least this percent identity
-                                    to be removed (0 to 100) (default: 75.0)
+  --end_size END_SIZE            The number of base pairs at each end of the read which will be
+                                 searched for adapter sequences (default: 150)
+  --min_trim_size MIN_TRIM_SIZE  Adapter alignments smaller than this will be ignored (default: 4)
+  --extra_end_trim EXTRA_END_TRIM
+                                 This many additional bases will be removed next to adapters found
+                                 at the ends of reads (default: 2)
+  --end_threshold END_THRESHOLD  Adapters at the ends of reads must have at least this percent
+                                 identity to be removed (0 to 100) (default: 75.0)
 
 Middle adapter settings:
   Control the splitting of read from middle adapters
 
-  --no_split                        Skip splitting reads based on middle adapters (default: split reads
-                                    when an adapter is found in the middle)
-  --discard_middle                  Reads with middle adapters will be discarded (default: reads with
-                                    middle adapters are split) (this option is on by default when
-                                    outputting reads into barcode bins)
+  --no_split                     Skip splitting reads based on middle adapters (default: split
+                                 reads when an adapter is found in the middle)
+  --discard_middle               Reads with middle adapters will be discarded (default: reads with
+                                 middle adapters are split) (required for reads to be used with
+                                 Nanopolish, this option is on by default when outputting reads
+                                 into barcode bins)
   --middle_threshold MIDDLE_THRESHOLD
-                                    Adapters in the middle of reads must have at least this percent
-                                    identity to be found (0 to 100) (default: 85.0)
+                                 Adapters in the middle of reads must have at least this percent
+                                 identity to be found (0 to 100) (default: 85.0)
   --extra_middle_trim_good_side EXTRA_MIDDLE_TRIM_GOOD_SIDE
-                                    This many additional bases will be removed next to middle adapters on
-                                    their "good" side (default: 10)
+                                 This many additional bases will be removed next to middle adapters
+                                 on their "good" side (default: 10)
   --extra_middle_trim_bad_side EXTRA_MIDDLE_TRIM_BAD_SIDE
-                                    This many additional bases will be removed next to middle adapters on
-                                    their "bad" side (default: 100)
+                                 This many additional bases will be removed next to middle adapters
+                                 on their "bad" side (default: 100)
   --min_split_read_size MIN_SPLIT_READ_SIZE
-                                    Post-split read pieces smaller than this many base pairs will not be
-                                    outputted (default: 1000)
+                                 Post-split read pieces smaller than this many base pairs will not
+                                 be outputted (default: 1000)
+
+Help:
+  -h, --help                     Show this help message and exit
+  --version                      Show program's version number and exit
 ```
 
 
 
+# Known issues
+
+### Adapter search
+
+Porechop tries to automatically determine which adapters are present by looking at the reads, but this approach has a few issues:
+
+* As the number of kits/barcodes has grown, adapter-search part of the Porechop's pipeline has become increasingly slow.
+* Porechop only does the adapter search on a subset of reads, which means there can be problems with non-randomly ordered read sets (e.g. all barcode 1 reads at the start of a file, followed by barcode 2 reads, etc).
+* Many ONT adapters share common sequence with each other, making false positive adapter finds possible.
+
+A simpler solution (and in hindsight what I should have done) would be to require the kit and/or adapters from the user.  E.g. `porechop --sqk-lsk109` or `porechop --start_adapt ACGCTAGCATACGT`.
+
+
+### Performance
+
+Porechop uses [SeqAn](https://github.com/seqan/seqan) to perform its alignments in C++. This library is very flexible, but not as fast as some alternatives, such as [Edlib](https://github.com/Martinsos/edlib).
+
+Another performance issue is that Porechop uses [ctypes](https://docs.python.org/3/library/ctypes.html) to interface with its C++ code. Function calls with ctypes can have a bit of overhead, which means that Porechop cannot use threads very efficiently (it spends too much of its time in the Python code, which is intrinsically non-parallel).
+
+
+### Barcode demultiplexing
+
+I added demultiplexing to Porechop as an afterthought – it was already looking for barcodes to trim, so why not sort reads by barcodes too? This turned out to be a very useful feature, but in hindsight I think it might be simpler (and easier to maintain) if trimming and demultiplexing functionality were in separate tools.
+
+
+### Base-space problems
+
+I've encountered a couple of issues where adapter sequences are not properly basecalled, resulting in inconsistent sequence. Porechop trims in base-space, so this is a somewhat intractable problem. See [issue #40](https://github.com/rrwick/Porechop/issues/40) for an example.
+
+These have made me wonder if the adapter trimming should be done in signal-space instead, though that would be a more complex problem to solve. I hope that in the future ONT can integrate this kind of functionality directly into their basecallers.
+
+
+
+
 # Acknowledgements
 
 Porechop was inspired by (and largely coded during) [Porecamp Australia 2017](https://porecamp-au.github.io/). Thanks to the organisers and attendees who helped me realise that a Nanopore adapter trimmer might be a useful tool! I later met David Stoddart from Oxford Nanopore at London Calling 2017, and he helped me get many of the adapter sequences right.


=====================================
porechop/adapters.py
=====================================
@@ -43,7 +43,9 @@ class Adapter(object):
         Gets the barcode name for the output files. We want a concise name, so it looks at all
         options and chooses the shortest.
         """
-        possible_names = [self.name, self.start_sequence[0]]
+        possible_names = [self.name]
+        if self.start_sequence:
+            possible_names.append(self.start_sequence[0])
         if self.end_sequence:
             possible_names.append(self.end_sequence[0])
         barcode_name = sorted(possible_names, key=lambda x: len(x))[0]
@@ -76,28 +78,36 @@ ADAPTERS = [Adapter('SQK-NSK007',
                     start_sequence=('SQK-NSK007_Y_Top', 'AATGTACTTCGTTCAGTTACGTATTGCT'),
                     end_sequence=('SQK-NSK007_Y_Bottom', 'GCAATACGTAACTGAACGAAGT')),
 
+
             Adapter('Rapid',
-                    start_sequence=('Rapid_adapter', 'GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA')),
+                    start_sequence=('Rapid_adapter',
+                                    'GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA')),
+
+            Adapter('RBK004_upstream',
+                    start_sequence=('RBK004_upstream', 'AATGTACTTCGTTCAGTTACGGCTTGGGTGTTTAACC')),
+
 
             Adapter('SQK-MAP006',
-                            start_sequence=('SQK-MAP006_Y_Top_SK63',    'GGTTGTTTCTGTTGGTGCTGATATTGCT'),
-                            end_sequence=  ('SQK-MAP006_Y_Bottom_SK64', 'GCAATATCAGCACCAACAGAAA')),
+                    start_sequence=('SQK-MAP006_Y_Top_SK63',    'GGTTGTTTCTGTTGGTGCTGATATTGCT'),
+                    end_sequence=  ('SQK-MAP006_Y_Bottom_SK64', 'GCAATATCAGCACCAACAGAAA')),
 
-            Adapter('SQK-MAP006 Short',
+            Adapter('SQK-MAP006 short',
                     start_sequence=('SQK-MAP006_Short_Y_Top_LI32',    'CGGCGTCTGCTTGGGTGTTTAACCT'),
                     end_sequence=  ('SQK-MAP006_Short_Y_Bottom_LI33', 'GGTTAAACACCCAAGCAGACGCCG')),
 
+
+            # The PCR adapters are used both in PCR DNA kits and some cDNA kits.
             Adapter('PCR adapters 1',
                     start_sequence=('PCR_1_start', 'ACTTGCCTGTCGCTCTATCTTC'),
-                    end_sequence=  ('PCR_1_end', 'GAAGATAGAGCGACAGGCAAGT')),
+                    end_sequence=  ('PCR_1_end',   'GAAGATAGAGCGACAGGCAAGT')),
 
-            Adapter('PCR tail 1',
-                    start_sequence=('PCR_tail_1_start', 'TTAACCTTTCTGTTGGTGCTGATATTGC'),
-                    end_sequence=  ('PCR_tail_1_end',   'GCAATATCAGCACCAACAGAAAGGTTAA')),
+            Adapter('PCR adapters 2',
+                    start_sequence=('PCR_2_start', 'TTTCTGTTGGTGCTGATATTGC'),
+                    end_sequence=  ('PCR_2_end',   'GCAATATCAGCACCAACAGAAA')),
 
-            Adapter('PCR tail 2',
-                    start_sequence=('PCR_tail_2_start', 'TTAACCTACTTGCCTGTCGCTCTATCTTC'),
-                    end_sequence=  ('PCR_tail_2_end',   'GAAGATAGAGCGACAGGCAAGTAGGTTAA')),
+            Adapter('PCR adapters 3',
+                    start_sequence=('PCR_3_start', 'TACTTGCCTGTCGCTCTATCTTC'),
+                    end_sequence=  ('PCR_3_end',   'GAAGATAGAGCGACAGGCAAGTA')),
 
 
             # 1D^2 kit adapters are interesting. ONT provided the following sequences on their site:
@@ -117,6 +127,11 @@ ADAPTERS = [Adapter('SQK-NSK007',
             # adapter sequences here.
 
 
+            Adapter('cDNA SSP',
+                    start_sequence=('cDNA_SSP',     'TTTCTGTTGGTGCTGATATTGCTGCCATTACGGCCGGG'),
+                    end_sequence=  ('cDNA_SSP_rev', 'CCCGGCCGTAATGGCAGCAATATCAGCACCAACAGAAA')),
+
+
             # Some barcoding kits (like the native barcodes) use the rev comp barcode at the start
             # of the read and the forward barcode at the end of the read.
             Adapter('Barcode 1 (reverse)',
@@ -461,11 +476,23 @@ def make_full_native_barcode_adapter(barcode_num):
                    end_sequence=('NB' + '%02d' % barcode_num + '_end', end_full_seq))
 
 
-def make_full_rapid_barcode_adapter(barcode_num):
+def make_old_full_rapid_barcode_adapter(barcode_num):  # applies to SQK-RBK001
+    barcode = [x for x in ADAPTERS if x.name == 'Barcode ' + str(barcode_num) + ' (forward)'][0]
+    start_barcode_seq = barcode.start_sequence[1]
+
+    start_full_seq = 'AATGTACTTCGTTCAGTTACG' + 'TATTGCT' + start_barcode_seq + \
+                     'GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA'
+
+    return Adapter('Rapid barcoding ' + str(barcode_num) + ' (full sequence, old)',
+                   start_sequence=('RB' + '%02d' % barcode_num + '_full', start_full_seq))
+
+
+def make_new_full_rapid_barcode_adapter(barcode_num):  # applies to SQK-RBK004
     barcode = [x for x in ADAPTERS if x.name == 'Barcode ' + str(barcode_num) + ' (forward)'][0]
     start_barcode_seq = barcode.start_sequence[1]
 
-    start_full_seq = 'AATGTACTTCGTTCAGTTACGTATTGCT' + start_barcode_seq + 'GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA'
+    start_full_seq = 'AATGTACTTCGTTCAGTTACG' + 'GCTTGGGTGTTTAACC' + start_barcode_seq + \
+                     'GTTTTCGCATTTATCGTGAAACGCTTTCGCGTTTTTCGTGCGCCGCTTCA'
 
-    return Adapter('Rapid barcoding ' + str(barcode_num) + ' (full sequence)',
+    return Adapter('Rapid barcoding ' + str(barcode_num) + ' (full sequence, new)',
                    start_sequence=('RB' + '%02d' % barcode_num + '_full', start_full_seq))


=====================================
porechop/nanopore_read.py
=====================================
@@ -23,7 +23,13 @@ class NanoporeRead(object):
     def __init__(self, name, seq, quals):
         self.name = name
 
-        self.seq = seq
+        self.seq = seq.upper()
+        if self.seq.count('U') > self.seq.count('T'):
+            self.rna = True
+            self.seq = self.seq.replace('U', 'T')
+        else:
+            self.rna = False
+
         self.quals = quals
         if len(quals) < len(seq):
             self.quals += '+' * (len(seq) - len(quals))
@@ -96,6 +102,8 @@ class NanoporeRead(object):
                 seq = self.get_seq_with_start_end_adapters_trimmed()
             if not seq:  # Don't return empty sequences
                 return ''
+            if self.rna:
+                seq = seq.replace('T', 'U')
             return ''.join(['>', self.name, '\n', add_line_breaks_to_sequence(seq, 70)])
         elif discard_middle:
             return ''
@@ -106,6 +114,8 @@ class NanoporeRead(object):
                 if not split_read_part[0]:  # Don't return empty sequences
                     return ''
                 seq = add_line_breaks_to_sequence(split_read_part[0], 70)
+                if self.rna:
+                    seq = seq.replace('T', 'U')
                 fasta_str += ''.join(['>', read_name, '\n', seq])
             return fasta_str
 
@@ -119,6 +129,8 @@ class NanoporeRead(object):
                 quals = self.get_quals_with_start_end_adapters_trimmed()
             if not seq:  # Don't return empty sequences
                 return ''
+            if self.rna:
+                seq = seq.replace('T', 'U')
             return ''.join(['@', self.name, '\n', seq, '\n+\n', quals, '\n'])
         elif discard_middle:
             return ''
@@ -126,10 +138,12 @@ class NanoporeRead(object):
             fastq_str = ''
             for i, split_read_part in enumerate(self.get_split_read_parts(min_split_read_size)):
                 read_name = add_number_to_read_name(self.name, i + 1)
-                if not split_read_part[0]:  # Don't return empty sequences
+                seq, qual = split_read_part[0], split_read_part[1]
+                if not seq:  # Don't return empty sequences
                     return ''
-                fastq_str += ''.join(['@', read_name, '\n', split_read_part[0], '\n+\n',
-                                      split_read_part[1], '\n'])
+                if self.rna:
+                    seq = seq.replace('T', 'U')
+                fastq_str += ''.join(['@', read_name, '\n', seq, '\n+\n', qual, '\n'])
             return fastq_str
 
     def align_adapter_set(self, adapter_set, end_size, scoring_scheme_vals):
@@ -138,10 +152,11 @@ class NanoporeRead(object):
         This is not to determine where to trim the reads, but rather to figure out which adapter
         sets are present in the data.
         """
-        read_seq_start = self.seq[:end_size]
-        score, _, _, _ = align_adapter(read_seq_start, adapter_set.start_sequence[1],
-                                       scoring_scheme_vals)
-        adapter_set.best_start_score = max(adapter_set.best_start_score, score)
+        if adapter_set.start_sequence:
+            read_seq_start = self.seq[:end_size]
+            score, _, _, _ = align_adapter(read_seq_start, adapter_set.start_sequence[1],
+                                           scoring_scheme_vals)
+            adapter_set.best_start_score = max(adapter_set.best_start_score, score)
         if adapter_set.end_sequence:
             read_seq_end = self.seq[-end_size:]
             score, _, _, _ = align_adapter(read_seq_end, adapter_set.end_sequence[1],
@@ -156,6 +171,8 @@ class NanoporeRead(object):
         """
         read_seq_start = self.seq[:end_size]
         for adapter in adapters:
+            if not adapter.start_sequence:
+                continue
             full_score, partial_score, read_start, read_end = \
                 align_adapter(read_seq_start, adapter.start_sequence[1], scoring_scheme_vals)
             if partial_score > end_threshold and read_end != end_size and \


=====================================
porechop/porechop.py
=====================================
@@ -24,7 +24,8 @@ import re
 from multiprocessing.dummy import Pool as ThreadPool
 from collections import defaultdict
 from .misc import load_fasta_or_fastq, print_table, red, bold_underline, MyHelpFormatter, int_to_str
-from .adapters import ADAPTERS, make_full_native_barcode_adapter, make_full_rapid_barcode_adapter
+from .adapters import ADAPTERS, make_full_native_barcode_adapter,\
+    make_old_full_rapid_barcode_adapter, make_new_full_rapid_barcode_adapter
 from .nanopore_read import NanoporeRead
 from .version import __version__
 
@@ -37,16 +38,17 @@ def main():
     matching_sets = find_matching_adapter_sets(check_reads, args.verbosity, args.end_size,
                                                args.scoring_scheme_vals, args.print_dest,
                                                args.adapter_threshold, args.threads)
-    matching_sets = exclude_end_adapters_for_rapid(matching_sets)
     matching_sets = fix_up_1d2_sets(matching_sets)
-    display_adapter_set_results(matching_sets, args.verbosity, args.print_dest)
-    matching_sets = add_full_barcode_adapter_sets(matching_sets)
 
     if args.barcode_dir:
         forward_or_reverse_barcodes = choose_barcoding_kit(matching_sets, args.verbosity,
                                                            args.print_dest)
     else:
         forward_or_reverse_barcodes = None
+
+    display_adapter_set_results(matching_sets, args.verbosity, args.print_dest)
+    matching_sets = add_full_barcode_adapter_sets(matching_sets)
+
     if args.verbosity > 0:
         print('\n', file=args.print_dest)
 
@@ -86,7 +88,7 @@ def get_arguments():
     parser = argparse.ArgumentParser(description='Porechop: a tool for finding adapters in Oxford '
                                                  'Nanopore reads, trimming them from the ends and '
                                                  'splitting reads with internal adapters',
-                                     formatter_class=MyHelpFormatter)
+                                     formatter_class=MyHelpFormatter, add_help=False)
     main_group = parser.add_argument_group('Main options')
     main_group.add_argument('-i', '--input', required=True,
                             help='FASTA/FASTQ of input reads or a directory which will be '
@@ -105,7 +107,6 @@ def get_arguments():
                                  'a file and stderr if reads are printed to stdout')
     main_group.add_argument('-t', '--threads', type=int, default=default_threads,
                             help='Number of threads to use for adapter alignment')
-    main_group.add_argument('--version', action='version', version=__version__)
 
     barcode_group = parser.add_argument_group('Barcode binning settings',
                                               'Control the binning of reads based on barcodes '
@@ -127,8 +128,7 @@ def get_arguments():
                                     'match for the barcode on both their start and end (default: '
                                     'a read can be binned with a match at its start or end)')
     barcode_group.add_argument('--untrimmed', action='store_true',
-                               help='Bin reads but do not trim them (appropriate if reads are to '
-                                    'be used with Nanopolish) (default: trim the reads)')
+                               help='Bin reads but do not trim them (default: trim the reads)')
     barcode_group.add_argument('--discard_unassigned', action='store_true',
                                help='Discard unassigned reads (instead of creating a "none" bin)')
 
@@ -169,9 +169,10 @@ def get_arguments():
                                         'middle)')
     middle_trim_group.add_argument('--discard_middle', action='store_true',
                                    help='Reads with middle adapters will be discarded (default: '
-                                        'reads with middle adapters are split) (this option is '
-                                        'on by default when outputting reads into barcode bins)')
-    middle_trim_group.add_argument('--middle_threshold', type=float, default=85.0,
+                                        'reads with middle adapters are split) (required for '
+                                        'reads to be used with Nanopolish, this option is on by '
+                                        'default when outputting reads into barcode bins)')
+    middle_trim_group.add_argument('--middle_threshold', type=float, default=90.0,
                                    help='Adapters in the middle of reads must have at least this '
                                         'percent identity to be found (0 to 100)')
     middle_trim_group.add_argument('--extra_middle_trim_good_side', type=int, default=10,
@@ -184,6 +185,12 @@ def get_arguments():
                                    help='Post-split read pieces smaller than this many base pairs '
                                         'will not be outputted')
 
+    help_args = parser.add_argument_group('Help')
+    help_args.add_argument('-h', '--help', action='help', default=argparse.SUPPRESS,
+                           help='Show this help message and exit')
+    help_args.add_argument('--version', action='version', version=__version__,
+                           help="Show program's version number and exit")
+
     args = parser.parse_args()
 
     try:
@@ -325,35 +332,43 @@ def choose_barcoding_kit(adapter_sets, verbosity, print_dest):
     If the user is sorting reads by barcode bin, choose one barcode configuration (rev comp
     barcodes at the start of the read or at the end of the read) and ignore the other.
     """
-    forward_barcodes = 0
-    reverse_barcodes = 0
+    # Tally up scores for forward and reverse barcodes.
+    forward_start_or_end, reverse_start_or_end = 0, 0
+    forward_start_and_end, reverse_start_and_end = 0, 0
     for adapter_set in adapter_sets:
-        score = adapter_set.best_start_or_end_score()
-        if 'Barcode' in adapter_set.name and '(forward)' in adapter_set.name:
-            forward_barcodes += score
-        elif 'Barcode' in adapter_set.name and '(reverse)' in adapter_set.name:
-            reverse_barcodes += score
-    if forward_barcodes > reverse_barcodes:
-        if verbosity > 0:
-            print('\nBarcodes determined to be in forward orientation', file=print_dest)
-        return 'forward'
-    elif reverse_barcodes > forward_barcodes:
-        if verbosity > 0:
-            print('\nBarcodes determined to be in reverse orientation', file=print_dest)
-        return 'reverse'
-    else:
-        return None
+        if 'barcode' in adapter_set.name.lower():
+            if '(forward)' in adapter_set.name.lower():
+                forward_start_or_end += adapter_set.best_start_or_end_score()
+                forward_start_and_end += adapter_set.best_start_score
+                forward_start_and_end += adapter_set.best_end_score
+            elif '(reverse)' in adapter_set.name.lower():
+                reverse_start_or_end += adapter_set.best_start_or_end_score()
+                reverse_start_and_end += adapter_set.best_start_score
+                reverse_start_and_end += adapter_set.best_end_score
+
+    if forward_start_or_end == 0 and reverse_start_or_end == 0:
+        sys.exit('Error: no barcodes were found, so Porechop cannot perform barcode demultiplexing')
+
+    # If possible, make a decision using each barcode's best start OR end score.
+    orientation = None
+    if forward_start_or_end > reverse_start_or_end:
+        orientation = 'forward'
+    elif reverse_start_or_end > forward_start_or_end:
+        orientation = 'reverse'
+
+    # If that didn't work (i.e. it's a tie between forward and reverse), then choose based on the
+    # sum of both start AND end scores.
+    elif forward_start_and_end > reverse_start_and_end:
+        orientation = 'forward'
+    elif reverse_start_and_end > forward_start_and_end:
+        orientation = 'reverse'
+
+    if orientation is None:
+        sys.exit('Error: Porechop could not determine barcode orientation')
 
-
-def exclude_end_adapters_for_rapid(matching_sets):
-    """
-    Rapid reads shouldn't have end adapters, so we don't want to look for them if this seems to be
-    a rapid read set.
-    """
-    if 'Rapid adapter' in [x.name for x in matching_sets]:
-        for s in matching_sets:
-            s.end_sequence = None
-    return matching_sets
+    if verbosity > 0:
+        print('\nBarcodes determined to be in ' + orientation + ' orientation', file=print_dest)
+    return orientation
 
 
 def fix_up_1d2_sets(matching_sets):
@@ -411,8 +426,11 @@ def add_full_barcode_adapter_sets(matching_sets):
 
         # Rapid barcode full sequences
         if all(x in matching_set_names
-               for x in ['SQK-NSK007', 'Rapid', 'Barcode ' + str(i) + ' (forward)']):
-            matching_sets.append(make_full_rapid_barcode_adapter(i))
+               for x in ['Rapid', 'Barcode ' + str(i) + ' (forward)']):
+            if 'RBK004_upstream' in matching_set_names:
+                matching_sets.append(make_new_full_rapid_barcode_adapter(i))
+            elif 'SQK-NSK007' in matching_set_names:
+                matching_sets.append(make_old_full_rapid_barcode_adapter(i))
 
     return matching_sets
 
@@ -424,11 +442,14 @@ def find_adapters_at_read_ends(reads, matching_sets, verbosity, end_size, extra_
     if verbosity > 0:
         print(bold_underline('Trimming adapters from read ends'),
               file=print_dest)
-        name_len = max(max(len(x.start_sequence[0]) for x in matching_sets),
-                       max(len(x.end_sequence[0]) if x.end_sequence else 0 for x in matching_sets))
+        name_len = max(max(len(x.start_sequence[0])
+                           if x.start_sequence else 0 for x in matching_sets),
+                       max(len(x.end_sequence[0])
+                           if x.end_sequence else 0 for x in matching_sets))
         for matching_set in matching_sets:
-            print('  ' + matching_set.start_sequence[0].rjust(name_len) + ': ' +
-                  red(matching_set.start_sequence[1]), file=print_dest)
+            if matching_set.start_sequence:
+                print('  ' + matching_set.start_sequence[0].rjust(name_len) + ': ' +
+                      red(matching_set.start_sequence[1]), file=print_dest)
             if matching_set.end_sequence:
                 print('  ' + matching_set.end_sequence[0].rjust(name_len) + ': ' +
                       red(matching_set.end_sequence[1]), file=print_dest)
@@ -519,15 +540,18 @@ def find_adapters_in_read_middles(reads, matching_sets, verbosity, middle_thresh
 
     adapters = []
     for matching_set in matching_sets:
-        adapters.append(matching_set.start_sequence)
-        if matching_set.end_sequence and \
-                matching_set.end_sequence[1] != matching_set.start_sequence[1]:
-            adapters.append(matching_set.end_sequence)
+        if matching_set.start_sequence:
+            adapters.append(matching_set.start_sequence)
+        if matching_set.end_sequence:
+            if (not matching_set.start_sequence) or \
+                    matching_set.end_sequence[1] != matching_set.start_sequence[1]:
+                adapters.append(matching_set.end_sequence)
 
     start_sequence_names = set()
     end_sequence_names = set()
     for matching_set in matching_sets:
-        start_sequence_names.add(matching_set.start_sequence[0])
+        if matching_set.start_sequence:
+            start_sequence_names.add(matching_set.start_sequence[0])
         if matching_set.end_sequence:
             end_sequence_names.add(matching_set.end_sequence[0])
 


=====================================
porechop/version.py
=====================================
@@ -2,4 +2,4 @@
 The version is stored here in a separate file so it can exist in only one place.
 http://stackoverflow.com/questions/458550
 """
-__version__ = '0.2.3'
+__version__ = '0.2.4'



View it on GitLab: https://salsa.debian.org/med-team/porechop/commit/9904cd8bf255135b069354758622ae612466aa0b

-- 
View it on GitLab: https://salsa.debian.org/med-team/porechop/commit/9904cd8bf255135b069354758622ae612466aa0b
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20181023/01b32d03/attachment-0001.html>


More information about the debian-med-commit mailing list