[med-svn] [Git][med-team/seqkit][upstream] New upstream version 0.14.0+ds

Nilesh Patra gitlab at salsa.debian.org
Fri Oct 30 20:42:29 GMT 2020



Nilesh Patra pushed to branch upstream at Debian Med / seqkit


Commits:
47b180fb by Nilesh Patra at 2020-10-31T00:56:06+05:30
New upstream version 0.14.0+ds
- - - - -


18 changed files:

- CHANGES.md
- README.md
- doc/docs/download.md
- doc/docs/faq.md
- doc/docs/usage.md
- seqkit/cmd/grep.go
- seqkit/cmd/locate.go
- seqkit/cmd/mutate.go
- + seqkit/cmd/pair.go
- seqkit/cmd/rename.go
- seqkit/cmd/replace.go
- seqkit/cmd/seq.go
- seqkit/cmd/sliding.go
- seqkit/cmd/split.go
- seqkit/cmd/split2.go
- seqkit/cmd/stat.go
- seqkit/cmd/translate.go
- seqkit/cmd/version.go


Changes:

=====================================
CHANGES.md
=====================================
@@ -1,3 +1,14 @@
+- [SeqKit v0.14.0](https://github.com/shenwei356/seqkit/releases/tag/v0.14.0)
+[![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/seqkit/v0.14.0/total.svg)](https://github.com/shenwei356/seqkit/releases/tag/v0.14.0)
+    - new command `seqkit pair`: match up paired-end reads from two fastq files, faster than fastq-pair.
+    - `seqkit translate`: new flag `-F/--append-fram` for optional adding frame info to ID. [#159](https://github.com/shenwei356/seqkit/issues/159)
+    - `seqkit stats`: reduce memory usage when using `-a` for calculating N50. [#153](https://github.com/shenwei356/seqkit/issues/153)
+    - `seqkit mutate`: fix inserting sequence `-i/--insertion`, 
+       this bug occurs when `insert site` is big in some cases, don't worry if no error reported.
+    - `seqkit replace`:
+        - new flag `-U/--keep-untouched`: do not change anything when no value found for the key (only for sequence name).
+        - do no support editing FASTQ sequence.
+    - `seqkit grep/locate`: new flag `--circular` for supporting circular genome. [#158](https://github.com/shenwei356/seqkit/issues/158)
 - [SeqKit v0.13.2](https://github.com/shenwei356/seqkit/releases/tag/v0.13.2)
 [![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/seqkit/v0.13.2/total.svg)](https://github.com/shenwei356/seqkit/releases/tag/v0.13.2)
     - `seqkit sana`: fix bug causing hanging on empty files. [#149](https://github.com/shenwei356/seqkit/pull/149)


=====================================
README.md
=====================================
@@ -56,6 +56,7 @@ enable researchers to rapidly accomplish common FASTA/Q file manipulations.
 - [Acknowledgements](#acknowledgements)
 - [Contact](#contact)
 - [License](#license)
+- [Starchart](#startchart)
 
 <!-- END doctoc generated TOC please keep comment here to allow auto update -->
 
@@ -68,7 +69,7 @@ enable researchers to rapidly accomplish common FASTA/Q file manipulations.
   (see [download](http://bioinf.shenwei.me/seqkit/download/))
 - **UltraFast** (see [benchmark](#benchmark)),
   **multiple-CPUs supported**
-- **Practical functions supported by 28 subcommands** (see subcommands and
+- **Practical functions supported by 34 subcommands** (see subcommands and
   [usage](http://bioinf.shenwei.me/seqkit/usage/) )
 - **Supporting [Bash-completion](#bash-completion)**
 - **Well documented** (detailed [usage](http://bioinf.shenwei.me/seqkit/usage/)
@@ -117,7 +118,7 @@ enable researchers to rapidly accomplish common FASTA/Q file manipulations.
 
 ## Subcommands
 
-33 functional subcommands in total.
+34 functional subcommands in total.
 
 **Sequence and subsequence**
 
@@ -159,6 +160,7 @@ enable researchers to rapidly accomplish common FASTA/Q file manipulations.
 - [`common`](https://bioinf.shenwei.me/seqkit/usage/#common)        find common sequences of multiple files by id/name/sequence
 - [`split`](https://bioinf.shenwei.me/seqkit/usage/#split)          split sequences into files by id/seq region/size/parts (mainly for FASTA)
 - [`split2`](https://bioinf.shenwei.me/seqkit/usage/#split2)        split sequences into files by size/parts (FASTA, PE/SE FASTQ)
+- [`pair`](https://bioinf.shenwei.me/seqkit/usage/#pair)            match up paired-end reads from two fastq files
 
 **Edit**
 
@@ -418,3 +420,7 @@ propose new functions or ask for help.
 ## License
 
 [MIT License](https://github.com/shenwei356/seqkit/blob/master/LICENSE)
+
+## Starchart
+
+<img src="https://starchart.cc/shenwei356/seqkit.svg" alt="Stargazers over time" style="max-width: 100%">


=====================================
doc/docs/download.md
=====================================
@@ -6,9 +6,17 @@ SeqKit is implemented in [Go](https://golang.org/) programming language,
 
 ## Latest Version
 
-- [SeqKit v0.13.2](https://github.com/shenwei356/seqkit/releases/tag/v0.13.2)
-[![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/seqkit/v0.13.2/total.svg)](https://github.com/shenwei356/seqkit/releases/tag/v0.13.2)
-    - `seqkit sana`: fix bug causing hanging on empty files. [#149](https://github.com/shenwei356/seqkit/pull/149)
+- [SeqKit v0.14.0](https://github.com/shenwei356/seqkit/releases/tag/v0.14.0)
+[![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/seqkit/v0.14.0/total.svg)](https://github.com/shenwei356/seqkit/releases/tag/v0.14.0)
+    - new command `seqkit pair`: match up paired-end reads from two fastq files, faster than fastq-pair.
+    - `seqkit translate`: new flag `-F/--append-fram` for optional adding frame info to ID. [#159](https://github.com/shenwei356/seqkit/issues/159)
+    - `seqkit stats`: reduce memory usage when using `-a` for calculating N50. [#153](https://github.com/shenwei356/seqkit/issues/153)
+    - `seqkit mutate`: fix inserting sequence `-i/--insertion`, 
+       this bug occurs when `insert site` is big in some cases, don't worry if no error reported.
+    - `seqkit replace`:
+        - new flag `-U/--keep-untouched`: do not change anything when no value found for the key (only for sequence name).
+        - do no support editing FASTQ sequence.
+    - `seqkit grep/locate`: new flag `--circular` for supporting circular genome. [#158](https://github.com/shenwei356/seqkit/issues/158)
 
 ### Please cite
 
@@ -25,12 +33,12 @@ SeqKit is implemented in [Go](https://golang.org/) programming language,
 
 OS     |Arch      |File, 中国镜像                                                                                                                                                                                  |Download Count
 :------|:---------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-Linux  |32-bit    |[seqkit_linux_386.tar.gz](https://github.com/shenwei356/seqkit/releases/download/v0.13.2/seqkit_linux_386.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_linux_386.tar.gz)                            |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_linux_386.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.13.2/seqkit_linux_386.tar.gz)
-Linux  |**64-bit**|[**seqkit_linux_amd64.tar.gz**](https://github.com/shenwei356/seqkit/releases/download/v0.13.2/seqkit_linux_amd64.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_linux_amd64.tar.gz)                  |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_linux_amd64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.13.2/seqkit_linux_amd64.tar.gz)
-OS X   |32-bit    |[seqkit_darwin_386.tar.gz](https://github.com/shenwei356/seqkit/releases/download/v0.13.2/seqkit_darwin_386.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_darwin_386.tar.gz)                         |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_darwin_386.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.13.2/seqkit_darwin_386.tar.gz)
-OS X   |**64-bit**|[**seqkit_darwin_amd64.tar.gz**](https://github.com/shenwei356/seqkit/releases/download/v0.13.2/seqkit_darwin_amd64.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_darwin_amd64.tar.gz)               |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_darwin_amd64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.13.2/seqkit_darwin_amd64.tar.gz)
-Windows|32-bit    |[seqkit_windows_386.exe.tar.gz](https://github.com/shenwei356/seqkit/releases/download/v0.13.2/seqkit_windows_386.exe.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_windows_386.exe.tar.gz)          |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_windows_386.exe.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.13.2/seqkit_windows_386.exe.tar.gz)
-Windows|**64-bit**|[**seqkit_windows_amd64.exe.tar.gz**](https://github.com/shenwei356/seqkit/releases/download/v0.13.2/seqkit_windows_amd64.exe.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_windows_amd64.exe.tar.gz)|[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_windows_amd64.exe.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.13.2/seqkit_windows_amd64.exe.tar.gz)
+Linux  |32-bit    |[seqkit_linux_386.tar.gz](https://github.com/shenwei356/seqkit/releases/download/v0.14.0/seqkit_linux_386.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_linux_386.tar.gz)                            |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_linux_386.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.14.0/seqkit_linux_386.tar.gz)
+Linux  |**64-bit**|[**seqkit_linux_amd64.tar.gz**](https://github.com/shenwei356/seqkit/releases/download/v0.14.0/seqkit_linux_amd64.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_linux_amd64.tar.gz)                  |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_linux_amd64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.14.0/seqkit_linux_amd64.tar.gz)
+OS X   |32-bit    |[seqkit_darwin_386.tar.gz](https://github.com/shenwei356/seqkit/releases/download/v0.14.0/seqkit_darwin_386.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_darwin_386.tar.gz)                         |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_darwin_386.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.14.0/seqkit_darwin_386.tar.gz)
+OS X   |**64-bit**|[**seqkit_darwin_amd64.tar.gz**](https://github.com/shenwei356/seqkit/releases/download/v0.14.0/seqkit_darwin_amd64.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_darwin_amd64.tar.gz)               |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_darwin_amd64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.14.0/seqkit_darwin_amd64.tar.gz)
+Windows|32-bit    |[seqkit_windows_386.exe.tar.gz](https://github.com/shenwei356/seqkit/releases/download/v0.14.0/seqkit_windows_386.exe.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_windows_386.exe.tar.gz)          |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_windows_386.exe.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.14.0/seqkit_windows_386.exe.tar.gz)
+Windows|**64-bit**|[**seqkit_windows_amd64.exe.tar.gz**](https://github.com/shenwei356/seqkit/releases/download/v0.14.0/seqkit_windows_amd64.exe.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_windows_amd64.exe.tar.gz)|[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_windows_amd64.exe.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.14.0/seqkit_windows_amd64.exe.tar.gz)
 
 
 ## Installation
@@ -100,6 +108,9 @@ Howto:
 
 ## Release History
 
+- [SeqKit v0.13.2](https://github.com/shenwei356/seqkit/releases/tag/v0.13.2)
+[![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/seqkit/v0.13.2/total.svg)](https://github.com/shenwei356/seqkit/releases/tag/v0.13.2)
+    - `seqkit sana`: fix bug causing hanging on empty files. [#149](https://github.com/shenwei356/seqkit/pull/149)
 - [SeqKit v0.13.1](https://github.com/shenwei356/seqkit/releases/tag/v0.13.1)
 [![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/seqkit/v0.13.1/total.svg)](https://github.com/shenwei356/seqkit/releases/tag/v0.13.1)
     - `seqkit sana`: fix bug causing hanging on empty files. [#148](https://github.com/shenwei356/seqkit/pull/148)


=====================================
doc/docs/faq.md
=====================================
@@ -264,72 +264,7 @@ is the second captured match. Command:
 
 ## How to extract paired reads from two paired-end reads file?
 
-Let's create two unbanlanced PE reads file:
-
-    $ seqkit rmdup duplicated-reads.fq.gz \
-        | seqkit replace --pattern " .+" --replacement " 1" \
-        | seqkit sample --proportion 0.9 --rand-seed 1 --out-file read_1.fq.gz
-    $ seqkit rmdup duplicated-reads.fq.gz \
-        | seqkit replace --pattern " .+" --replacement " 2" \
-        | seqkit sample --proportion 0.9 --rand-seed 2 --out-file read_2.fq.gz
-
-Overview:
-
-    # number of records
-    $ seqkit stat read_1.fq.gz read_2.fq.gz
-    file          format  type  num_seqs  sum_len  min_len  avg_len  max_len
-    read_1.fq.gz  FASTQ   DNA      9,033  912,333      101      101      101
-    read_2.fq.gz  FASTQ   DNA      8,965  905,465      101      101      101
-
-    # sequence headers
-    $ seqkit head -n 3 read_1.fq.gz | seqkit seq --name
-    SRR1972739.1 1
-    SRR1972739.3 1
-    SRR1972739.4 1
-
-    $ seqkit head -n 3 read_2.fq.gz | seqkit seq --name
-    SRR1972739.1 2
-    SRR1972739.2 2
-    SRR1972739.3 2
-
-Firstly, extract sequence IDs of two file and compute the intersection:
-
-    $ seqkit seq --name --only-id read_1.fq.gz read_2.fq.gz \
-        | sort | uniq -d > id.txt
-
-    $ # number of IDs
-    wc -l id.txt
-    8090 id.txt
-
-Then extract reads using `id.txt`:
-
-    $ seqkit grep --pattern-file id.txt read_1.fq.gz -o read_1.f.fq.gz
-    $ seqkit grep --pattern-file id.txt read_2.fq.gz -o read_2.f.fq.gz
-
-Check if the IDs in two files are the same by `md5sum`:
-
-    $ seqkit seq --name --only-id read_1.f.fq.gz > read_1.f.fq.gz.id.txt
-    $ seqkit seq --name --only-id read_2.f.fq.gz > read_2.f.fq.gz.id.txt
-
-    $ md5sum read_*.f.fq.gz.id.txt
-    537c57cfdc3923bb94a3dc31a0c3b02a  read_1.f.fq.gz.id.txt
-    537c57cfdc3923bb94a3dc31a0c3b02a  read_2.f.fq.gz.id.txt
-
-Note that this example assumes that the IDs in the two reads file have
-same order. If not you can sort them after previous steps. Shell `sort`
-can sort large file using disk, so temporary directory is set as
-current directory by option `-T .`.
-
-    $ gzip -d -c read_1.f.fq.gz \
-        | seqkit fx2tab \
-        | sort -k1,1 -T . \
-        | seqkit tab2fx \
-        | gzip -c > read_1.f.sorted.fq.gz
-    $ gzip -d -c read_2.f.fq.gz \
-        | seqkit fx2tab \
-        | sort -k1,1 -T . \
-        | seqkit tab2fx \
-        | gzip -c > read_2.f.sorted.fq.gz
+Use [seqkit pair](https://bioinf.shenwei.me/seqkit/usage/#pair).
 
 ## How to concatenate two FASTA sequences in to one?
 


=====================================
doc/docs/usage.md
=====================================
@@ -46,6 +46,7 @@
 - [common](#common)
 - [split](#split)
 - [split2](#split2)
+- [pair](#pair)
 
 **Edit**
 
@@ -173,7 +174,7 @@ reproduced in different environments with same random seed.
 ``` text
 SeqKit -- a cross-platform and ultrafast toolkit for FASTA/Q file manipulation
 
-Version: 0.13.2
+Version: 0.14.0
 
 Author: Wei Shen <shenwei356 at gmail.com>
 
@@ -201,6 +202,7 @@ Available Commands:
   help            Help about any command
   locate          locate subsequences/motifs, mismatch allowed
   mutate          edit sequence (point mutation, insertion, deletion)
+  pair            match up paired-end reads from two fastq files
   range           print FASTA/Q records in a range (start:end)
   rename          rename duplicated IDs
   replace         replace name/sequence by regular expression
@@ -561,7 +563,8 @@ Usage:
   seqkit sliding [flags]
 
 Flags:
-  -C, --circular-genome   circular genome.
+  -c, --circular          circular genome (same to -C/--circular-genome)
+  -C, --circular-genome   circular genome (same to -c/--circular)
   -g, --greedy            greedy mode, i.e., exporting last subsequences even shorter than windows size
   -s, --step int          step size
   -W, --window int        window size
@@ -1224,6 +1227,7 @@ Usage:
 
 Flags:
   -x, --allow-unknown-codon                     translate unknown code to 'X'. And you may not use flag --trim which removes 'X'
+  -F, --append-frame                            append frame infomation to sequence ID
       --clean                                   change all STOP codon positions from the '*' character to 'X' (an unknown residue)
   -f, --frame strings                           frame(s) to translate, available value: 1, 2, 3, -1, -2, -3, and 6 for all six frames (default [1])
   -h, --help                                    help for translate
@@ -1363,6 +1367,8 @@ Usage
 search sequences by ID/name/sequence/sequence motifs, mismatch allowed
 
 Attentions:
+  0. By default, we match sequence ID with patterns, use "-n/--by-name"
+     for matching full name instead of just ID.
   1. Unlike POSIX/GNU grep, we compare the pattern to the whole target
      (ID/full header) by default. Please switch "-r/--use-regexp" on
      for partly matching.
@@ -1398,8 +1404,9 @@ Usage:
   seqkit grep [flags]
 
 Flags:
-  -n, --by-name               match by full name instead of just id
+  -n, --by-name               match by full name instead of just ID
   -s, --by-seq                search subseq on seq, only positive strand is searched, and mismatch allowed using flag -m/--max-mismatch
+  -c  --circular              circular genome
   -d, --degenerate            pattern/motif contains degenerate base
       --delete-matched        delete a pattern right after being matched, this keeps the firstly matched data and speedups when using regular expressions
   -h, --help                  help for grep
@@ -1409,12 +1416,30 @@ Flags:
   -p, --pattern strings       search pattern (multiple values supported. Attention: use double quotation marks for patterns containing comma, e.g., -p '"A{2,}"'))
   -f, --pattern-file string   pattern file (one record per line)
   -R, --region string         specify sequence region for searching. e.g 1:12 for first 12 bases, -12:-1 for last 12 bases
-  -r, --use-regexp            patterns are regular expression
+  -r, --use-regexp            patterns are regular expressio
 
 ```
 
 Examples
 
+
+1. Searching with list of sequence IDs (do not containing whitespace)
+
+        $ seqkit grep -f id.txt seqs.fq.gz -o result.fq.gz
+        
+        # ignore case
+        $ seqkit grep -i -f id.txt seqs.fq.gz -o result.fq.gz
+
+1. Serching non-canonical sequence IDs, Using `--id-regexp` to capture IDs. 
+   Refer to [section Sequence ID](#sequence-id) and [seqkit seq](#seq) for examples.
+
+1. Searching with list of sequence names (they may contain whitespace).
+
+        $ seqkit grep -n -f name.txt seqs.fa.gz -o result.fa.gz
+        
+1. Useq `-r/--use-regexp` for partly matching, but **this may produce "false positive" matches**.
+   For example, `seq_1` matches `seq_10` with `-nri`.
+
 1. Extract human hairpins (i.e. sequences with name starting with `hsa`)
 
         $ zcat hairpin.fa.gz | seqkit grep -r -p ^hsa
@@ -1425,7 +1450,7 @@ Examples
         AGGUUGAGGUAGUAGGUUGUAUAGUUUAGAAUUACAUCAAGGGAGAUAACUGUACAGCCU
         CCUAGCUUUCCU
 
-1. Remove human and mice hairpins.
+1. Remove human and mice hairpins (invert match with `-v`)
 
         $ zcat hairpin.fa.gz | seqkit grep -r -p ^hsa -p ^mmu -v
 
@@ -1448,7 +1473,18 @@ Examples
 
         $ cat hairpin.fa.gz | seqkit grep -s -i -p aggcg
 
+1. Circular genome
 
+        $ echo -e ">seq\nACGTTGCA" 
+        >seq
+        ACGTTGCA
+        
+        $ echo -e ">seq\nACGTTGCA"  | seqkit grep -s -i -p AA
+        
+        $ echo -e ">seq\nACGTTGCA"  | seqkit grep -s -i -p AA -c
+        >seq
+        ACGTTGCA
+        
 1. Extract sequences containing AGGCG (allow mismatch)
 
         $ time cat hairpin.fa.gz | seqkit grep -s -i -p aggcg | seqkit stats
@@ -1502,11 +1538,15 @@ Mismatch is allowed using flag "-m/--max-mismatch",
 but it's not fast enough for large genome like human genome.
 Though, it's fast enough for microbial genomes.
 
+When using flag --circular, end position of matched subsequence that 
+crossing genome sequence end would be greater than sequence length.
+
 Usage:
   seqkit locate [flags]
 
 Flags:
       --bed                       output in BED6 format
+  -c  --circular                  circular genome
   -d, --degenerate                pattern/motif contains degenerate base
       --gtf                       output in GTF format
   -h, --help                      help for locate
@@ -1613,7 +1653,7 @@ Examples
         cel-mir-58a     SeqKit  location        81      88      0       +       .       gene_id "AUGGACUN";
         ath-MIR163      SeqKit  location        122     129     0       -       .       gene_id "AUGGACUN";
 
-1. greedy mode (default)
+1. Greedy mode (default)
 
          $ echo -e '>seq\nACGACGACGA' | seqkit locate -p ACGA | csvtk -t pretty
          seqID   patternName   pattern   strand   start   end   matched
@@ -1621,13 +1661,33 @@ Examples
          seq     ACGA          ACGA      +        4       7     ACGA
          seq     ACGA          ACGA      +        7       10    ACGA
 
-1. non-greedy mode (`-G`)
+1. Non-greedy mode (`-G`)
 
         $ echo -e '>seq\nACGACGACGA' | seqkit locate -p ACGA -G | csvtk -t pretty
         seqID   patternName   pattern   strand   start   end   matched
         seq     ACGA          ACGA      +        1       4     ACGA
         seq     ACGA          ACGA      +        7       10    ACGA
 
+
+1. Circular genome. Note that end position of matched subsequence that 
+crossing genome sequence end would be greater than sequence length.
+
+        $ echo -e ">seq\nACGTTGCA"
+        >seq
+        ACGTTGCA
+    
+        $ echo -e ">seq\nACGTTGCA" \
+            | seqkit locate -i -p aa
+        seqID   patternName     pattern strand  start   end     matched
+        seq     aa      aa      -       4       5       aa
+        
+        $ echo -e ">seq\nACGTTGCA" \
+            | seqkit locate -i -p aa -c \
+            | csvtk pretty -t
+        seqID   patternName   pattern   strand   start   end   matched
+        seq     aa            aa        +        8       9     aa
+        seq     aa            aa        -        4       5     aa
+
         
 ## fish
 
@@ -2127,8 +2187,8 @@ Flags:
   -f, --force              overwrite output directory
   -h, --help               help for split2
   -O, --out-dir string     output directory (default value is $infile.split)
-  -1, --read1 string       read1 file
-  -2, --read2 string       read2 file
+  -1, --read1 string       (gzipped) read1 file
+  -2, --read2 string       (gzipped) read2 file
 ```
 
 Examples
@@ -2177,6 +2237,65 @@ Examples
         [INFO] write 1250 sequences to file: out/reads_1.part_001.fq.gz
         [INFO] write 1250 sequences to file: out/reads_1.part_002.fq.gz
 
+## pair
+
+Usage
+
+```text
+match up paired-end reads from two fastq files
+
+Attensions:
+1. Orders of headers in the two files better be the same (not shuffled),
+   otherwise it consumes huge number of memory for buffering reads in memory.
+2. Unpaired reads are optional outputted with flag -u/--save-unpaired.
+3. If flag -O/--out-dir not given, output will be saved in the same directory
+   of input, with suffix "paired", e.g., read_1.paired.fq.gz.
+   Otherwise names are kept untouched in the given output directory.
+4. Paired gzipped files may be slightly larger than original files, because
+   of using different gzip package/library, don't worry.
+
+Usage:
+  seqkit pair [flags]
+
+Flags:
+  -f, --force            overwrite output directory
+  -h, --help             help for pair
+  -O, --out-dir string   output directory
+  -1, --read1 string     (gzipped) read1 file
+  -2, --read2 string     (gzipped) read2 file
+  -u, --save-unpaired    save unpaired reads if there are
+```
+
+Examples
+
+1. Simple one
+
+        $ seqkit pair -1 reads_1.fq.gz -2 reads_2.fq.gz
+        
+        # output
+        reads_1.paired.fq.gz
+        reads_2.paired.fq.gz
+        
+2. Set output directory, file names are kept untouched.
+
+        $ seqkit pair -1 reads_1.fq.gz -2 reads_2.fq.gz -O result
+        
+        $ tree result        
+        result/
+        ├── reads_1.fq.gz
+        └── reads_2.fq.gz
+        
+3. Save unpaired reads if there are.
+
+        $ seqkit pair -1 reads_1.fq.gz -2 reads_2.fq.gz -O result -u
+        
+        $ tree result
+        result
+        ├── reads_1.fq.gz
+        ├── reads_1.unpaired.fq.gz
+        ├── reads_2.fq.gz
+        └── reads_2.unpaired.fq.gz
+
 
 ## sample
 


=====================================
seqkit/cmd/grep.go
=====================================
@@ -46,6 +46,8 @@ var grepCmd = &cobra.Command{
 	Long: fmt.Sprintf(`search sequences by ID/name/sequence/sequence motifs, mismatch allowed
 
 Attentions:
+  0. By default, we match sequence ID with patterns, use "-n/--by-name"
+     for matching full name instead of just ID.
   1. Unlike POSIX/GNU grep, we compare the pattern to the whole target
      (ID/full header) by default. Please switch "-r/--use-regexp" on
      for partly matching.
@@ -91,6 +93,7 @@ Examples:
 		ignoreCase := getFlagBool(cmd, "ignore-case")
 		degenerate := getFlagBool(cmd, "degenerate")
 		region := getFlagString(cmd, "region")
+		circular := getFlagBool(cmd, "circular")
 
 		if len(pattern) == 0 && patternFile == "" {
 			checkError(fmt.Errorf("one of flags -p (--pattern) and -f (--pattern-file) needed"))
@@ -284,6 +287,11 @@ Examples:
 				} else if bySeq {
 					if limitRegion {
 						target = record.Seq.SubSeq(start, end).Seq
+					} else if circular {
+						// concat two copies of sequence, and do not change orginal sequence
+						target = make([]byte, len(record.Seq.Seq)*2)
+						copy(target[0:len(record.Seq.Seq)], record.Seq.Seq)
+						copy(target[len(record.Seq.Seq):], record.Seq.Seq)
 					} else {
 						target = record.Seq.Seq
 					}
@@ -375,12 +383,12 @@ func init() {
 	grepCmd.Flags().BoolP("use-regexp", "r", false, "patterns are regular expression")
 	grepCmd.Flags().BoolP("delete-matched", "", false, "delete a pattern right after being matched, this keeps the firstly matched data and speedups when using regular expressions")
 	grepCmd.Flags().BoolP("invert-match", "v", false, "invert the sense of matching, to select non-matching records")
-	grepCmd.Flags().BoolP("by-name", "n", false, "match by full name instead of just id")
+	grepCmd.Flags().BoolP("by-name", "n", false, "match by full name instead of just ID")
 	grepCmd.Flags().BoolP("by-seq", "s", false, "search subseq on seq, only positive strand is searched, and mismatch allowed using flag -m/--max-mismatch")
 	grepCmd.Flags().IntP("max-mismatch", "m", 0, "max mismatch when matching by seq. For large genomes like human genome, using mapping/alignment tools would be faster")
 	grepCmd.Flags().BoolP("ignore-case", "i", false, "ignore case")
 	grepCmd.Flags().BoolP("degenerate", "d", false, "pattern/motif contains degenerate base")
 	grepCmd.Flags().StringP("region", "R", "", "specify sequence region for searching. "+
 		"e.g 1:12 for first 12 bases, -12:-1 for last 12 bases")
-
+	grepCmd.Flags().BoolP("circular", "c", false, "circular genome")
 }


=====================================
seqkit/cmd/locate.go
=====================================
@@ -53,6 +53,9 @@ Mismatch is allowed using flag "-m/--max-mismatch",
 but it's not fast enough for large genome like human genome.
 Though, it's fast enough for microbial genomes.
 
+When using flag --circular, end position of matched subsequence that 
+crossing genome sequence end would be greater than sequence length.
+
 `,
 	Run: func(cmd *cobra.Command, args []string) {
 		config := getConfigs(cmd)
@@ -84,6 +87,7 @@ Though, it's fast enough for microbial genomes.
 		outFmtBED := getFlagBool(cmd, "bed")
 		mismatches := getFlagNonNegativeInt(cmd, "max-mismatch")
 		hideMatched := getFlagBool(cmd, "hide-matched")
+		circular := getFlagBool(cmd, "circular")
 
 		if len(pattern) == 0 && patternFile == "" {
 			checkError(fmt.Errorf("one of flags -p (--pattern) and -f (--pattern-file) needed"))
@@ -256,6 +260,11 @@ Though, it's fast enough for microbial genomes.
 				}
 
 				l = len(record.Seq.Seq)
+
+				if circular { // concat two copies of sequence
+					record.Seq.Seq = append(record.Seq.Seq, record.Seq.Seq...)
+				}
+
 				if !onlyPositiveStrand {
 					seqRP = record.Seq.RevCom()
 				}
@@ -272,7 +281,12 @@ Though, it's fast enough for microbial genomes.
 							checkError(fmt.Errorf("fail to search pattern '%s' on seq '%s': %s", pName, record.Name, err))
 						}
 						for _, i = range loc {
+							if circular && i+1 > l { // 2nd clone of original part
+								continue
+							}
+
 							begin = i + 1
+
 							end = i + len(pSeq)
 							if i+len(pSeq) > len(record.Seq.Seq) {
 								continue
@@ -333,6 +347,10 @@ Though, it's fast enough for microbial genomes.
 							checkError(fmt.Errorf("fail to search pattern '%s' on seq '%s': %s", pName, record.Name, err))
 						}
 						for _, i = range loc {
+							if circular && i+1 > l { // 2nd clone of original part
+								continue
+							}
+
 							begin = l - i - len(pSeq) + 1
 							end = l - i
 							if i+len(pSeq) > len(record.Seq.Seq) {
@@ -407,6 +425,11 @@ Though, it's fast enough for microbial genomes.
 							loc = []int{i, i + lpatten}
 						}
 						begin = offset + loc[0] + 1
+
+						if circular && begin > l { // 2nd clone of original part
+							break
+						}
+
 						end = offset + loc[1]
 
 						flag = true // check "duplicated" region
@@ -495,8 +518,16 @@ Though, it's fast enough for microbial genomes.
 							loc = []int{i, i + lpatten}
 						}
 
+						if circular && offset+loc[0]+1 > l { // 2nd clone of original part
+							break
+						}
+
 						begin = l - offset - loc[1] + 1
 						end = l - offset - loc[0]
+						if offset+loc[1] > l {
+							begin += l
+							end += l
+						}
 
 						flag = true
 						if useRegexp || degenerate {
@@ -582,4 +613,5 @@ func init() {
 	locateCmd.Flags().BoolP("bed", "", false, "output in BED6 format")
 	locateCmd.Flags().IntP("max-mismatch", "m", 0, "max mismatch when matching by seq. For large genomes like human genome, using mapping/alignment tools would be faster")
 	locateCmd.Flags().BoolP("hide-matched", "M", false, "do not show matched sequences")
+	locateCmd.Flags().BoolP("circular", "c", false, `circular genome. type "seqkit locate -h" for details`)
 }


=====================================
seqkit/cmd/mutate.go
=====================================
@@ -294,16 +294,16 @@ Examples:
 					if mIns.pos == 0 {
 						s = 0
 						record.Seq.Seq = append(record.Seq.Seq, mIns.seq...)
-						copy(record.Seq.Seq[s+len(mIns.seq):], record.Seq.Seq[s:s+seqLen])
+						copy(record.Seq.Seq[s+len(mIns.seq):], record.Seq.Seq[s:seqLen])
 						copy(record.Seq.Seq[s:s+len(mIns.seq)], mIns.seq)
 					} else {
 						s, _, ok = seq.SubLocation(seqLen, mIns.pos, mIns.pos)
 						if !ok {
 							log.Warningf("[%s]: insertion mutation: position (%d) out of sequence length (%d)", record.ID, mIns.pos, seqLen)
 						} else {
-							record.Seq.Seq = append(record.Seq.Seq, mIns.seq...)
-							copy(record.Seq.Seq[s+len(mIns.seq):], record.Seq.Seq[s:s+seqLen])
-							copy(record.Seq.Seq[s:s+len(mIns.seq)], mIns.seq)
+							record.Seq.Seq = append(record.Seq.Seq, mIns.seq...)             // append insert sequence to the end
+							copy(record.Seq.Seq[s+len(mIns.seq):], record.Seq.Seq[s:seqLen]) // move sequence behind the insert location at the end, so leave space for IS.
+							copy(record.Seq.Seq[s:s+len(mIns.seq)], mIns.seq)                // copy IS into desired region.
 						}
 					}
 


=====================================
seqkit/cmd/pair.go
=====================================
@@ -0,0 +1,372 @@
+// Copyright © 2016-2019 Wei Shen <shenwei356 at gmail.com>
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in
+// all copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+// THE SOFTWARE.
+
+package cmd
+
+import (
+	"bytes"
+	"fmt"
+	"io"
+	"os"
+	"path/filepath"
+	"runtime"
+
+	"github.com/cespare/xxhash"
+	"github.com/pkg/errors"
+
+	"github.com/shenwei356/bio/seq"
+	"github.com/shenwei356/bio/seqio/fastx"
+	"github.com/shenwei356/util/pathutil"
+	"github.com/shenwei356/xopen"
+	"github.com/spf13/cobra"
+)
+
+// pairCmd represents the pair command
+var pairCmd = &cobra.Command{
+	Use:   "pair",
+	Short: "match up paired-end reads from two fastq files",
+	Long: `match up paired-end reads from two fastq files
+
+Attensions:
+1. Orders of headers in the two files better be the same (not shuffled),
+   otherwise it consumes huge number of memory for buffering reads in memory.
+2. Unpaired reads are optional outputted with flag -u/--save-unpaired.
+3. If flag -O/--out-dir not given, output will be saved in the same directory
+   of input, with suffix "paired", e.g., read_1.paired.fq.gz.
+   Otherwise names are kept untouched in the given output directory.
+4. Paired gzipped files may be slightly larger than original files, because
+   of using different gzip package/library, don't worry.
+
+`,
+	Run: func(cmd *cobra.Command, args []string) {
+		config := getConfigs(cmd)
+		alphabet := config.Alphabet
+		idRegexp := config.IDRegexp
+		lineWidth := 0
+		seq.AlphabetGuessSeqLengthThreshold = config.AlphabetGuessSeqLength
+		seq.ValidateSeq = false
+		runtime.GOMAXPROCS(config.Threads)
+
+		if len(args) > 0 {
+			checkError(errors.New("no positional arugments are allowed"))
+		}
+
+		read1 := getFlagString(cmd, "read1")
+		read2 := getFlagString(cmd, "read2")
+		if read1 == "" || read2 == "" {
+			checkError(fmt.Errorf("flag -1/--read1 and -2/--read2 needed"))
+		}
+		if read1 == read2 {
+			checkError(fmt.Errorf("values of flag -1/--read1 and -2/--read2 can not be the same"))
+		}
+
+		outdir := getFlagString(cmd, "out-dir")
+		force := getFlagBool(cmd, "force")
+		saveUnpaired := getFlagBool(cmd, "save-unpaired")
+
+		if outdir == "" {
+			outdir = filepath.Dir(read1)
+		}
+
+		var addSuffix bool
+		if outdir != "./" && outdir != "." {
+			existed, err := pathutil.DirExists(outdir)
+			checkError(err)
+			if existed {
+				empty, err := pathutil.IsEmpty(outdir)
+				checkError(err)
+				if !empty {
+					if force {
+						checkError(os.RemoveAll(outdir))
+						checkError(os.MkdirAll(outdir, 0755))
+					} else {
+						log.Warningf("outdir not empty: %s, you can use --force to overwrite", outdir)
+					}
+				}
+			} else {
+				checkError(os.MkdirAll(outdir, 0755))
+			}
+		} else if filepath.Clean(filepath.Dir(read1)) == filepath.Clean(outdir) {
+			addSuffix = true
+		}
+
+		var reader1, reader2 *fastx.Reader
+		var record1, record2 *fastx.Record
+
+		// readers
+		var err error
+		reader1, err = fastx.NewReader(alphabet, read1, idRegexp)
+		checkError(errors.Wrap(err, read1))
+		reader2, err = fastx.NewReader(alphabet, read2, idRegexp)
+		checkError(errors.Wrap(err, read2))
+
+		// out file 1
+		var outFile1, base1, suffix1 string
+		if addSuffix {
+			base1, suffix1 = filepathTrimExtension(filepath.Base(read1))
+			outFile1 = filepath.Join(outdir, base1+".paired"+suffix1)
+		} else {
+			outFile1 = filepath.Join(outdir, filepath.Base(read1))
+		}
+		outfh1, err := xopen.Wopen(outFile1)
+		checkError(errors.Wrap(err, outFile1))
+		defer outfh1.Close()
+
+		// out file 2
+		var outFile2, base2, suffix2 string
+		if addSuffix {
+			base2, suffix2 = filepathTrimExtension(filepath.Base(read2))
+			outFile2 = filepath.Join(outdir, base2+".paired"+suffix2)
+		} else {
+			outFile2 = filepath.Join(outdir, filepath.Base(read2))
+		}
+		outfh2, err := xopen.Wopen(outFile2)
+		checkError(errors.Wrap(err, outFile2))
+		defer outfh2.Close()
+
+		// load first records
+		record1, err = reader1.Read()
+		checkError(errors.Wrap(err, read1))
+		record2, err = reader2.Read()
+		checkError(errors.Wrap(err, read2))
+
+		// require fastq
+		if !reader1.IsFastq || !reader2.IsFastq {
+			checkError(fmt.Errorf("fastq files needed"))
+		}
+
+		// buffer for saving unpaired reads
+		m1 := make(map[uint64]*fastx.Record, 1024)
+		m2 := make(map[uint64]*fastx.Record, 1024)
+
+		var h1, h2 uint64
+		var ok1, ok2 bool
+		var r1, r2 *fastx.Record
+		var eof1, eof2 bool
+		var n uint64
+
+		for {
+			// break when finishing reading both files.
+			if eof1 && eof2 {
+				break
+			}
+
+			// paired
+			if !eof1 && !eof2 && bytes.Compare(record1.ID, record2.ID) == 0 { // same ID
+				// output paired reads
+				record1.FormatToWriter(outfh1, lineWidth)
+				record2.FormatToWriter(outfh2, lineWidth)
+				n++
+
+				// new read1
+				if !eof1 {
+					record1, err = reader1.Read()
+					if err != nil {
+						if err == io.EOF {
+							eof1 = true
+							// break
+						} else {
+							checkError(errors.Wrap(err, read1))
+							break
+						}
+					}
+				}
+
+				// new read2
+				if !eof2 {
+					record2, err = reader2.Read()
+					if err != nil {
+						if err == io.EOF {
+							eof2 = true
+							// break
+						} else {
+							checkError(errors.Wrap(err, read2))
+							break
+						}
+					}
+				}
+
+				continue
+			}
+
+			if !eof1 {
+				h1 = xxhash.Sum64(record1.ID)
+				if r2, ok2 = m2[h1]; ok2 { // found pair of record1 in m2
+					// output paired reads
+					record1.FormatToWriter(outfh1, lineWidth)
+					r2.FormatToWriter(outfh2, lineWidth)
+					n++
+
+					delete(m2, h1)
+				} else {
+					m1[h1] = record1.Clone()
+				}
+
+				// new read1
+				record1, err = reader1.Read()
+				if err != nil {
+					if err == io.EOF {
+						eof1 = true
+						// break
+					} else {
+						checkError(errors.Wrap(err, read1))
+						break
+					}
+				}
+			}
+
+			// ---
+
+			if !eof2 {
+				h2 = xxhash.Sum64(record2.ID)
+				if r1, ok1 = m1[h2]; ok1 { // found pair of record2 in m1
+					// output paired reads
+					r1.FormatToWriter(outfh1, lineWidth)
+					record2.FormatToWriter(outfh2, lineWidth)
+					n++
+
+					delete(m1, h2)
+				} else {
+					m2[h2] = record2.Clone()
+				}
+
+				// new read2
+				record2, err = reader2.Read()
+				if err != nil {
+					if err == io.EOF {
+						eof2 = true
+						// break
+					} else {
+						checkError(errors.Wrap(err, read2))
+						break
+					}
+				}
+			}
+		}
+
+		var outFile1U, outFile2U string
+		var outfh1U, outfh2U *xopen.Writer
+		var n1U, n2U uint64
+		if saveUnpaired {
+			if !addSuffix {
+				base1, suffix1 = filepathTrimExtension(filepath.Base(read1))
+				base2, suffix2 = filepathTrimExtension(filepath.Base(read2))
+			}
+		}
+
+		// left reads
+		if len(m1) > 0 && len(m2) > 0 {
+			for h1, r1 = range m1 {
+
+				if string(r1.ID) == "A00582:209:HWWJCDSXX:1:1101:8314:2581" {
+					fmt.Println("shit----")
+				}
+
+				if r2, ok2 = m2[h1]; ok2 {
+					// output paired reads
+					r1.FormatToWriter(outfh1, lineWidth)
+					r2.FormatToWriter(outfh2, lineWidth)
+					n++
+
+					if saveUnpaired { // delete paired reads in m2
+						// delete(m1, h1) // no need
+						delete(m2, h2)
+					}
+				} else if saveUnpaired { // unpaired reads in m1
+					if outfh1U == nil {
+						outFile1U = filepath.Join(outdir, base1+".unpaired"+suffix1)
+						outfh1U, err = xopen.Wopen(outFile1U)
+						checkError(errors.Wrap(err, outFile1U))
+						defer outfh1U.Close()
+					}
+					r1.FormatToWriter(outfh1U, lineWidth)
+					n1U++
+				}
+			}
+			if saveUnpaired {
+				for _, r2 = range m2 { // left unpaired reads in m2
+					if outfh2U == nil {
+						outFile2U = filepath.Join(outdir, base2+".unpaired"+suffix2)
+						outfh2U, err = xopen.Wopen(outFile2U)
+						checkError(errors.Wrap(err, outFile2U))
+						defer outfh2U.Close()
+					}
+
+					r2.FormatToWriter(outfh2U, lineWidth)
+					n2U++
+				}
+			}
+		} else if len(m1) > 0 {
+			if saveUnpaired {
+				for _, r1 = range m1 { // all reads in m1 are unpaired
+					if outfh1U == nil {
+						outFile1U = filepath.Join(outdir, base1+".unpaired"+suffix1)
+						outfh1U, err = xopen.Wopen(outFile1U)
+						checkError(errors.Wrap(err, outFile1U))
+						defer outfh1U.Close()
+					}
+
+					r1.FormatToWriter(outfh1U, lineWidth)
+					n1U++
+				}
+			}
+		} else if saveUnpaired { // len(m2) > 0
+			for _, r2 = range m2 { // all reads in m2 are unpaired
+				if outfh2U == nil {
+					outFile2U = filepath.Join(outdir, base2+".unpaired"+suffix2)
+					outfh2U, err = xopen.Wopen(outFile2U)
+					checkError(errors.Wrap(err, outFile2U))
+					defer outfh2U.Close()
+				}
+
+				r2.FormatToWriter(outfh2U, lineWidth)
+				n2U++
+			}
+		}
+
+		if !config.Quiet {
+			log.Infof("%d paired-end reads saved to %s and %s", n, outFile1, outFile2)
+			if saveUnpaired {
+				if n1U > 0 {
+					log.Infof("%d unpaired reads saved to %s", n1U, outFile1U)
+				} else {
+					log.Infof("no unpaired reads in %s", read1)
+				}
+
+				if n2U > 0 {
+					log.Infof("%d unpaired reads saved to %s", n2U, outFile2U)
+				} else {
+					log.Infof("no unpaired reads in %s", read2)
+				}
+			}
+		}
+
+	},
+}
+
+func init() {
+	RootCmd.AddCommand(pairCmd)
+
+	pairCmd.Flags().StringP("read1", "1", "", "(gzipped) read1 file")
+	pairCmd.Flags().StringP("read2", "2", "", "(gzipped) read2 file")
+	pairCmd.Flags().StringP("out-dir", "O", "", "output directory")
+	pairCmd.Flags().BoolP("force", "f", false, "overwrite output directory")
+	pairCmd.Flags().BoolP("save-unpaired", "u", false, "save unpaired reads if there are")
+}


=====================================
seqkit/cmd/rename.go
=====================================
@@ -88,12 +88,13 @@ Attention:
 					if !empty {
 						if force {
 							checkError(os.RemoveAll(outdir))
+							checkError(os.MkdirAll(outdir, 0755))
 						} else {
 							log.Warningf("outdir not empty: %s, you can use --force to overwrite", outdir)
 						}
 					}
 				} else {
-					checkError(os.MkdirAll(outdir, 0777))
+					checkError(os.MkdirAll(outdir, 0755))
 				}
 			}
 		}


=====================================
seqkit/cmd/replace.go
=====================================
@@ -76,6 +76,7 @@ Special replacement symbols (only for replacing name not sequence):
 		nrWidth := getFlagPositiveInt(cmd, "nr-width")
 		kvFile := getFlagString(cmd, "kv-file")
 		keepKey := getFlagBool(cmd, "keep-key")
+		keepUntouch := getFlagBool(cmd, "keep-untouch")
 		keyCaptIdx := getFlagPositiveInt(cmd, "key-capt-idx")
 		keyMissRepl := getFlagString(cmd, "key-miss-repl")
 
@@ -145,6 +146,7 @@ Special replacement symbols (only for replacing name not sequence):
 		var found [][]byte
 		var k string
 		var ok bool
+		var doNotChange bool
 		var record *fastx.Record
 		var fastxReader *fastx.Reader
 		nrFormat := fmt.Sprintf("%%0%dd", nrWidth)
@@ -168,8 +170,13 @@ Special replacement symbols (only for replacing name not sequence):
 
 				nr++
 				if bySeq {
+					if fastxReader.IsFastq {
+						checkError(fmt.Errorf("editing FASTQ is not supported"))
+					}
 					record.Seq.Seq = patternRegexp.ReplaceAll(record.Seq.Seq, replacement)
 				} else {
+					doNotChange = false
+
 					r = replacement
 
 					if replaceWithNR {
@@ -181,6 +188,7 @@ Special replacement symbols (only for replacing name not sequence):
 						if len(founds) > 1 {
 							checkError(fmt.Errorf(`pattern "%s" matches multiple targets in "%s", this will cause chaos`, p, record.Name))
 						}
+
 						if len(founds) > 0 {
 							found = founds[0]
 							if keyCaptIdx > len(found)-1 {
@@ -192,15 +200,21 @@ Special replacement symbols (only for replacing name not sequence):
 							}
 							if _, ok = kvs[k]; ok {
 								r = reKV.ReplaceAll(r, []byte(kvs[k]))
+							} else if keepUntouch {
+								doNotChange = true
 							} else if keepKey {
 								r = reKV.ReplaceAll(r, found[keyCaptIdx])
 							} else {
 								r = reKV.ReplaceAll(r, []byte(keyMissRepl))
 							}
+						} else {
+							doNotChange = true
 						}
 					}
 
-					record.Name = patternRegexp.ReplaceAll(record.Name, r)
+					if !doNotChange {
+						record.Name = patternRegexp.ReplaceAll(record.Name, r)
+					}
 				}
 
 				record.FormatToWriter(outfh, config.LineWidth)
@@ -222,10 +236,11 @@ func init() {
 			`use ${1} instead of $1 when {kv} given!`)
 	replaceCmd.Flags().IntP("nr-width", "", 1, `minimum width for {nr} in flag -r/--replacement. e.g., formating "1" to "001" by --nr-width 3`)
 	// replaceCmd.Flags().BoolP("by-name", "n", false, "replace full name instead of just id")
-	replaceCmd.Flags().BoolP("by-seq", "s", false, "replace seq")
+	replaceCmd.Flags().BoolP("by-seq", "s", false, "replace seq (only FASTA)")
 	replaceCmd.Flags().BoolP("ignore-case", "i", false, "ignore case")
 	replaceCmd.Flags().StringP("kv-file", "k", "",
 		`tab-delimited key-value file for replacing key with value when using "{kv}" in -r (--replacement) (only for sequence name)`)
+	replaceCmd.Flags().BoolP("keep-untouch", "U", false, "do not change anything when no value found for the key (only for sequence name)")
 	replaceCmd.Flags().BoolP("keep-key", "K", false, "keep the key as value when no value found for the key (only for sequence name)")
 	replaceCmd.Flags().IntP("key-capt-idx", "I", 1, "capture variable index of key (1-based)")
 	replaceCmd.Flags().StringP("key-miss-repl", "m", "", "replacement for key with no corresponding value")


=====================================
seqkit/cmd/seq.go
=====================================
@@ -139,7 +139,7 @@ var seqCmd = &cobra.Command{
 		if outFile == "-" {
 			outfh = os.Stdout
 		} else {
-			outfh, err = os.Open(outFile)
+			outfh, err = os.Create(outFile)
 			checkError(err)
 		}
 		defer outfh.Close()


=====================================
seqkit/cmd/sliding.go
=====================================
@@ -51,7 +51,7 @@ var slidingCmd = &cobra.Command{
 		files := getFileListFromArgsAndFile(cmd, args, true, "infile-list", true)
 
 		greedy := getFlagBool(cmd, "greedy")
-		circular := getFlagBool(cmd, "circular-genome")
+		circular := getFlagBool(cmd, "circular-genome") || getFlagBool(cmd, "circular")
 		step := getFlagInt(cmd, "step")
 		window := getFlagInt(cmd, "window")
 		if step == 0 || window == 0 {
@@ -147,5 +147,6 @@ func init() {
 	slidingCmd.Flags().IntP("step", "s", 0, "step size")
 	slidingCmd.Flags().IntP("window", "W", 0, "window size")
 	slidingCmd.Flags().BoolP("greedy", "g", false, "greedy mode, i.e., exporting last subsequences even shorter than windows size")
-	slidingCmd.Flags().BoolP("circular-genome", "C", false, "circular genome.")
+	slidingCmd.Flags().BoolP("circular-genome", "C", false, "circular genome (same to -c/--circular)")
+	slidingCmd.Flags().BoolP("circular", "c", false, "circular genome (same to -C/--circular-genome)")
 }


=====================================
seqkit/cmd/split.go
=====================================
@@ -122,12 +122,13 @@ Examples:
 					if !empty {
 						if force {
 							checkError(os.RemoveAll(outdir))
+							checkError(os.MkdirAll(outdir, 0755))
 						} else {
 							log.Warningf("outdir not empty: %s, you can use --force to overwrite", outdir)
 						}
 					}
 				} else {
-					checkError(os.MkdirAll(outdir, 0777))
+					checkError(os.MkdirAll(outdir, 0755))
 				}
 			}
 		}


=====================================
seqkit/cmd/split2.go
=====================================
@@ -169,12 +169,13 @@ according to the input files.
 					if !empty {
 						if force {
 							checkError(os.RemoveAll(outdir))
+							checkError(os.MkdirAll(outdir, 0755))
 						} else {
 							log.Warningf("outdir not empty: %s, you can use --force to overwrite", outdir)
 						}
 					}
 				} else {
-					checkError(os.MkdirAll(outdir, 0777))
+					checkError(os.MkdirAll(outdir, 0755))
 				}
 			}
 
@@ -325,8 +326,8 @@ according to the input files.
 func init() {
 	RootCmd.AddCommand(split2Cmd)
 
-	split2Cmd.Flags().StringP("read1", "1", "", "read1 file")
-	split2Cmd.Flags().StringP("read2", "2", "", "read2 file")
+	split2Cmd.Flags().StringP("read1", "1", "", "(gzipped) read1 file")
+	split2Cmd.Flags().StringP("read2", "2", "", "(gzipped) read2 file")
 	split2Cmd.Flags().IntP("by-size", "s", 0, "split sequences into multi parts with N sequences")
 	split2Cmd.Flags().IntP("by-part", "p", 0, "split sequences into N parts")
 	split2Cmd.Flags().StringP("by-length", "l", "", "split sequences into chunks of N bases, supports K/M/G suffix")


=====================================
seqkit/cmd/stat.go
=====================================
@@ -33,6 +33,7 @@ import (
 	"github.com/dustin/go-humanize"
 	"github.com/shenwei356/bio/seq"
 	"github.com/shenwei356/bio/seqio/fastx"
+	"github.com/shenwei356/bio/util"
 	"github.com/shenwei356/util/byteutil"
 	"github.com/shenwei356/util/math"
 	"github.com/shenwei356/xopen"
@@ -141,7 +142,7 @@ Tips:
 								info.lenAvg,
 								info.lenMax))
 						} else {
-							outfh.WriteString(fmt.Sprintf("%s\t%s\t%s\t%d\t%d\t%d\t%.1f\t%d\t%d\t%d\t%d\t%d\t%d\t%.2f\t%.2f\n",
+							outfh.WriteString(fmt.Sprintf("%s\t%s\t%s\t%d\t%d\t%d\t%.1f\t%d\t%.1f\t%.1f\t%.1f\t%d\t%d\t%.2f\t%.2f\n",
 								info.file,
 								info.format,
 								info.t,
@@ -177,7 +178,7 @@ Tips:
 										info1.lenAvg,
 										info1.lenMax))
 								} else {
-									outfh.WriteString(fmt.Sprintf("%s\t%s\t%s\t%d\t%d\t%d\t%.1f\t%d\t%d\t%d\t%d\t%d\t%d\t%.2f\t%.2f\n",
+									outfh.WriteString(fmt.Sprintf("%s\t%s\t%s\t%d\t%d\t%d\t%.1f\t%d\t%.1f\t%.1f\t%.1f\t%d\t%d\t%.2f\t%.2f\n",
 										info1.file,
 										info1.format,
 										info1.t,
@@ -230,7 +231,7 @@ Tips:
 								info.lenAvg,
 								info.lenMax))
 						} else {
-							outfh.WriteString(fmt.Sprintf("%s\t%s\t%s\t%d\t%d\t%d\t%.1f\t%d\t%d\t%d\t%d\t%d\t%d\t%.2f\t%.2f\n",
+							outfh.WriteString(fmt.Sprintf("%s\t%s\t%s\t%d\t%d\t%d\t%.1f\t%d\t%.1f\t%.1f\t%.1f\t%d\t%d\t%.2f\t%.2f\n",
 								info.file,
 								info.format,
 								info.t,
@@ -288,13 +289,13 @@ Tips:
 					<-token
 				}()
 
-				var num, l, lenMin, lenMax, lenSum, gapSum int64
-				var n50, sum, l50 int64
-				var q1, q2, q3 int64
+				var gapSum uint64
+
+				lensStats := util.NewLengthStats()
+
 				var q20, q30 int64
 				var q byte
 				var encodeOffset int = fqEncoding.Offset()
-				var lens sortutil.Int64Slice
 				var seqFormat, t string
 				var record *fastx.Record
 				var fastxReader *fastx.Reader
@@ -315,12 +316,7 @@ Tips:
 				}
 
 				seqFormat = ""
-				num, lenMin, lenMax, lenSum, gapSum = 0, int64(^uint64(0)>>1), 0, 0, 0
-				n50, sum, l50 = 0, 0, 0
-				q20, q30 = 0, 0
-				if all {
-					lens = make(sortutil.Int64Slice, 0, 1000)
-				}
+
 				for {
 					record, err = fastxReader.Read()
 					if err != nil {
@@ -342,7 +338,6 @@ Tips:
 						break
 					}
 
-					num++
 					if seqFormat == "" {
 						if len(record.Seq.Qual) > 0 {
 							seqFormat = "FASTQ"
@@ -350,17 +345,9 @@ Tips:
 							seqFormat = "FASTA"
 						}
 					}
-					l = int64(len(record.Seq.Seq))
-					if all {
-						lens = append(lens, l)
-					}
-					lenSum += l
-					if l < lenMin {
-						lenMin = l
-					}
-					if l > lenMax {
-						lenMax = l
-					}
+
+					lensStats.Add(uint64(len(record.Seq.Seq)))
+
 					if all {
 						if fastxReader.IsFastq {
 							for _, q = range record.Seq.Qual {
@@ -373,7 +360,7 @@ Tips:
 							}
 						}
 
-						gapSum += int64(byteutil.CountBytes(record.Seq.Seq, gapLettersBytes))
+						gapSum += uint64(byteutil.CountBytes(record.Seq.Seq, gapLettersBytes))
 					}
 				}
 
@@ -387,17 +374,13 @@ Tips:
 					t = fastxReader.Alphabet().String()
 				}
 
+				var n50 uint64
+				var l50 int
+				var q1, q2, q3 float64
 				if all {
-					sort.Sort(lens)
-					for i := num - 1; i >= 0; i-- {
-						sum += lens[i]
-						if (sum << 1) >= lenSum {
-							n50 = lens[i]
-							l50 = num - i
-							break
-						}
-					}
-					q1, q2, q3 = quartile(lens)
+					n50 = lensStats.N50()
+					l50 = lensStats.L50()
+					q1, q2, q3 = lensStats.Q1(), lensStats.Q2(), lensStats.Q3()
 				}
 
 				select {
@@ -405,7 +388,7 @@ Tips:
 					return
 				default:
 				}
-				if num == 0 {
+				if lensStats.Count() == 0 {
 					if basename {
 						file = filepath.Base(file)
 					}
@@ -414,8 +397,8 @@ Tips:
 					}
 					ch <- statInfo{file, seqFormat, t,
 						0, 0, 0, 0,
-						0, lenMax, 0, 0,
-						q1, q2, q3,
+						0, 0, 0, 0,
+						0, 0, 0,
 						0, 0,
 						nil, id}
 				} else {
@@ -426,10 +409,10 @@ Tips:
 						file = stdinLabel
 					}
 					ch <- statInfo{file, seqFormat, t,
-						num, lenSum, gapSum, lenMin,
-						math.Round(float64(lenSum)/float64(num), 1), lenMax, n50, l50,
+						lensStats.Count(), lensStats.Sum(), gapSum, lensStats.Min(),
+						math.Round(lensStats.Mean(), 1), lensStats.Max(), n50, l50,
 						q1, q2, q3,
-						math.Round(float64(q20)/float64(lenSum)*100, 2), math.Round(float64(q30)/float64(lenSum)*100, 2),
+						math.Round(float64(q20)/float64(lensStats.Sum())*100, 2), math.Round(float64(q30)/float64(lensStats.Sum())*100, 2),
 						nil, id}
 				}
 			}(file, id)
@@ -485,26 +468,26 @@ Tips:
 					info.file,
 					info.format,
 					info.t,
-					humanize.Comma(info.num),
-					humanize.Comma(info.lenSum),
-					humanize.Comma(info.lenMin),
+					humanize.Comma(int64(info.num)),
+					humanize.Comma(int64(info.lenSum)),
+					humanize.Comma(int64(info.lenMin)),
 					humanize.Commaf(info.lenAvg),
-					humanize.Comma(info.lenMax))
+					humanize.Comma(int64(info.lenMax)))
 			} else {
 				tbl.AddRow(
 					info.file,
 					info.format,
 					info.t,
-					humanize.Comma(info.num),
-					humanize.Comma(info.lenSum),
-					humanize.Comma(info.lenMin),
+					humanize.Comma(int64(info.num)),
+					humanize.Comma(int64(info.lenSum)),
+					humanize.Comma(int64(info.lenMin)),
 					humanize.Commaf(info.lenAvg),
-					humanize.Comma(info.lenMax),
-					humanize.Comma(info.Q1),
-					humanize.Comma(info.Q2),
-					humanize.Comma(info.Q3),
-					humanize.Comma(info.gapSum),
-					humanize.Comma(info.N50),
+					humanize.Comma(int64(info.lenMax)),
+					humanize.Commaf(info.Q1),
+					humanize.Commaf(info.Q2),
+					humanize.Commaf(info.Q3),
+					humanize.Comma(int64(info.gapSum)),
+					humanize.Comma(int64(info.N50)),
 					humanize.Commaf(info.q20),
 					humanize.Commaf(info.q30),
 					// humanize.Comma(info.L50),
@@ -519,19 +502,23 @@ type statInfo struct {
 	file   string
 	format string
 	t      string
-	num    int64
-	lenSum int64
-	gapSum int64
-	lenMin int64
+
+	num    uint64
+	lenSum uint64
+	gapSum uint64
+	lenMin uint64
+
 	lenAvg float64
-	lenMax int64
-	N50    int64
-	L50    int64
-	Q1     int64
-	Q2     int64
-	Q3     int64
-	q20    float64
-	q30    float64
+	lenMax uint64
+	N50    uint64
+	L50    int
+
+	Q1 float64
+	Q2 float64
+	Q3 float64
+
+	q20 float64
+	q30 float64
 
 	err error
 	id  uint64


=====================================
seqkit/cmd/translate.go
=====================================
@@ -123,6 +123,7 @@ Translate Tables/Genetic Codes:
 		markInitCodonAsM := getFlagBool(cmd, "init-codon-as-M")
 		listTable := getFlagInt(cmd, "list-transl-table")
 		listTableAmb := getFlagInt(cmd, "list-transl-table-with-amb-codons")
+		appendFrame := getFlagBool(cmd, "append-frame")
 
 		outfh, err := xopen.Wopen(outFile)
 		checkError(err)
@@ -192,7 +193,11 @@ Translate Tables/Genetic Codes:
 					}
 					checkError(err)
 
-					outfh.WriteString(fmt.Sprintf(">%s_frame=%d %s\n", record.ID, frame, record.Desc))
+					if appendFrame {
+						outfh.WriteString(fmt.Sprintf(">%s_frame=%d %s\n", record.ID, frame, record.Desc))
+					} else {
+						outfh.WriteString(">" + string(record.Name) + "\n")
+					}
 					outfh.Write(byteutil.WrapByteSlice(_seq.Seq, config.LineWidth))
 					outfh.WriteString("\n")
 				}
@@ -213,4 +218,5 @@ func init() {
 	translateCmd.Flags().BoolP("init-codon-as-M", "M", false, "translate initial codon at beginning to 'M'")
 	translateCmd.Flags().IntP("list-transl-table", "l", -1, "show details of translate table N, 0 for all")
 	translateCmd.Flags().IntP("list-transl-table-with-amb-codons", "L", -1, "show details of translate table N (including ambigugous codons), 0 for all. ")
+	translateCmd.Flags().BoolP("append-frame", "F", false, "append frame information to sequence ID")
 }


=====================================
seqkit/cmd/version.go
=====================================
@@ -29,7 +29,7 @@ import (
 )
 
 // VERSION of seqkit
-const VERSION = "0.13.2"
+const VERSION = "0.14.0"
 
 // versionCmd represents the version command
 var versionCmd = &cobra.Command{



View it on GitLab: https://salsa.debian.org/med-team/seqkit/-/commit/47b180fb012f6d67509daf94dec938c2866be13a

-- 
View it on GitLab: https://salsa.debian.org/med-team/seqkit/-/commit/47b180fb012f6d67509daf94dec938c2866be13a
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20201030/cf04a71f/attachment-0001.html>


More information about the debian-med-commit mailing list