[med-svn] [Git][med-team/seqkit][upstream] New upstream version 0.16.0+ds

Nilesh Patra gitlab at salsa.debian.org
Sat May 1 19:36:24 BST 2021



Nilesh Patra pushed to branch upstream at Debian Med / seqkit


Commits:
e77f29a7 by Nilesh Patra at 2021-05-01T22:40:17+05:30
New upstream version 0.16.0+ds
- - - - -


17 changed files:

- CHANGES.md → CHANGELOG.md
- README.md
- doc/docs/download.md
- doc/docs/usage.md
- doc/mkdocs.yml
- go.mod
- go.sum
- seqkit/cmd/amplicon.go
- seqkit/cmd/bam_toolbox.go
- seqkit/cmd/bed.go
- seqkit/cmd/fx2tab.go
- seqkit/cmd/genautocomplete.go
- seqkit/cmd/grep.go
- + seqkit/cmd/head-genome.go
- seqkit/cmd/locate.go
- seqkit/cmd/tab2fx.go
- seqkit/cmd/version.go


Changes:

=====================================
CHANGES.md → CHANGELOG.md
=====================================
@@ -1,3 +1,27 @@
+
+
+- [SeqKit v0.16.0](https://github.com/shenwei356/seqkit/releases/tag/v0.16.0)
+[![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/seqkit/v0.16.0/total.svg)](https://github.com/shenwei356/seqkit/releases/tag/v0.16.0)
+    - new command `seqkit head-genome`:
+        - print sequences of the first genome with common prefixes in name
+    - `seqkit grep/locate/amplicon -m`
+        - *much faster (300-400x) searching with mismatch allowed* by optimizing FM-indexing and parallelization.
+        - new flag `-I/--immediate-output`.
+    - `seqkit grep/locate`:
+        - fix bug of `-m` when querying contains letters not in alphabet, usually for protein sequences. [#178](https://github.com/shenwei356/seqkit/issues/178), [#179](https://github.com/shenwei356/seqkit/issues/179)
+        - onply search on positive strand when searching unlimited or protein sequences.
+    - `seqkit locate`:
+        - removing debug info for `-r` introduced in a0f6b6e. [#180](https://github.com/shenwei356/seqkit/issues/180)
+    - `seqkit amplicon`:
+        - fix bug of `-m`, when mismatch is allowed.
+    - `seqkit fx2tab`:
+        - new flag `-C/--base-count` for counting bases. [#183](https://github.com/shenwei356/seqkit/issues/183)
+    -  `seqkit tab2fx`:
+        -  fix a rare bug. [#197](https://github.com/shenwei356/seqkit/issues/197)
+    - `seqkit subseq`:
+        - fix bug for BED with empty columns. [#195](https://github.com/shenwei356/seqkit/issues/195)
+    - `seqkit genautocomplete`: 
+        - **support bash|zsh|fish|powershell**.
 - [SeqKit v0.15.0](https://github.com/shenwei356/seqkit/releases/tag/v0.15.0)
 [![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/seqkit/v0.15.0/total.svg)](https://github.com/shenwei356/seqkit/releases/tag/v0.15.0)
     - `seqkit grep/locate`: update help message.


=====================================
README.md
=====================================
@@ -19,6 +19,7 @@ and
 - [![Gitter](https://badges.gitter.im/seqkit/community.svg)](https://gitter.im/seqkit/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge)
 
 
+
 ## Introduction
 
 FASTA and FASTQ are basic and ubiquitous formats for storing nucleotide and
@@ -47,7 +48,7 @@ enable researchers to rapidly accomplish common FASTA/Q file manipulations.
 - [Features](#features)
 - [Subcommands](#subcommands)
 - [Installation](#installation)
-- [Bash-completion](#bash-completion)
+- [Command-line completion](#command-line-completion)
 - [Technical details and guides for use](#technical-details-and-guides-for-use)
 - [Usage && Examples](#usage--examples)
 - [Benchmark](#benchmark)
@@ -118,7 +119,7 @@ enable researchers to rapidly accomplish common FASTA/Q file manipulations.
 
 ## Subcommands
 
-34 functional subcommands in total.
+35 functional subcommands in total.
 
 **Sequence and subsequence**
 
@@ -153,6 +154,7 @@ enable researchers to rapidly accomplish common FASTA/Q file manipulations.
 **Set operations**
 
 - [`head`](https://bioinf.shenwei.me/seqkit/usage/#head)            print first N FASTA/Q records
+- [`head-genome`](https://bioinf.shenwei.me/seqkit/usage/#head-genome) print sequences of the first genome with common prefixes in name
 - [`range`](https://bioinf.shenwei.me/seqkit/usage/#range)          print FASTA/Q records in a range (start:end)
 - [`sample`](https://bioinf.shenwei.me/seqkit/usage/#sample)        sample sequences by number or proportion
 - [`rmdup`](https://bioinf.shenwei.me/seqkit/usage/#rmdup)          remove duplicated sequences by id/name/sequence
@@ -178,7 +180,7 @@ enable researchers to rapidly accomplish common FASTA/Q file manipulations.
 **Misc**
 
 - `version`   print version information and check for update
-- `genautocomplete` generate shell autocompletion script
+- `genautocomplete` generate shell autocompletion script (bash|zsh|fish|powershell)
 
 ## Installation
 
@@ -234,24 +236,33 @@ Run the following commands:
     docker run -it shenwei356/seqkit:latest
 
 
-## Bash-completion
+## Command-line completion
+
+Supported shell: bash|zsh|fish|powershell
+
+Bash:
+
+    # generate completion shell
+    seqkit genautocomplete --shell bash
 
-Note: The current version supports Bash only.
-This should work for *nix systems with Bash installed.
+    # configure if never did.
+    # install bash-completion if the "complete" command is not found.
+    echo "for bcfile in ~/.bash_completion.d/* ; do source \$bcfile; done" >> ~/.bash_completion
+    echo "source ~/.bash_completion" >> ~/.bashrc
 
-Howto:
+Zsh:
 
-1. run: `seqkit genautocomplete`
+    # generate completion shell
+    seqkit genautocomplete --shell zsh --file ~/.zfunc/_seqkit
 
-2. create and edit `~/.bash_completion` file if you don't have it.
+    # configure if never did
+    echo 'fpath=( ~/.zfunc "${fpath[@]}" )' >> ~/.zshrc
+    echo "autoload -U compinit; compinit" >> ~/.zshrc
 
-        nano ~/.bash_completion
+fish:
 
-    add the following:
+    seqkit genautocomplete --shell fish --file ~/.config/fish/completions/seqkit.fish
 
-        for bcfile in ~/.bash_completion.d/* ; do
-          . $bcfile
-        done
 
 ## Technical details and guides for use
 


=====================================
doc/docs/download.md
=====================================
@@ -6,14 +6,28 @@ SeqKit is implemented in [Go](https://golang.org/) programming language,
 
 ## Latest Version
 
-- [SeqKit v0.15.0](https://github.com/shenwei356/seqkit/releases/tag/v0.15.0)
-[![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/seqkit/v0.15.0/total.svg)](https://github.com/shenwei356/seqkit/releases/tag/v0.15.0)
-    - `seqkit grep/locate`: update help message.
-    - `seqkit grep`: **search on both strand when searching by sequence**.
-    - `seqkit split2`: fix redundant log when using `-s`.
-    - `seqkit bam`: new field `RightSoftClipSeq`. [#172](https://github.com/shenwei356/seqkit/pull/172)
-    - `seqkit sample -2`: remove extra `\n`. [#173](https://github.com/shenwei356/seqkit/issues/173)
-    - `seqkit split2 -l`: fix bug for splitting by accumulative length, this bug occurs when the first record is longer than `-l`, no sequences are lost.
+- [SeqKit v0.16.0](https://github.com/shenwei356/seqkit/releases/tag/v0.16.0)
+[![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/seqkit/v0.16.0/total.svg)](https://github.com/shenwei356/seqkit/releases/tag/v0.16.0)
+    - new command `seqkit head-genome`:
+        - print sequences of the first genome with common prefixes in name
+    - `seqkit grep/locate/amplicon -m`
+        - *much faster (300-400x) searching with mismatch allowed* by optimizing FM-indexing and parallelization.
+        - new flag `-I/--immediate-output`.
+    - `seqkit grep/locate`:
+        - fix bug of `-m` when querying contains letters not in alphabet, usually for protein sequences. [#178](https://github.com/shenwei356/seqkit/issues/178), [#179](https://github.com/shenwei356/seqkit/issues/179)
+        - onply search on positive strand when searching unlimited or protein sequences.
+    - `seqkit locate`:
+        - removing debug info for `-r` introduced in a0f6b6e. [#180](https://github.com/shenwei356/seqkit/issues/180)
+    - `seqkit amplicon`:
+        - fix bug of `-m`, when mismatch is allowed.
+    - `seqkit fx2tab`:
+        - new flag `-C/--base-count` for counting bases. [#183](https://github.com/shenwei356/seqkit/issues/183)
+    -  `seqkit tab2fx`:
+        -  fix a rare bug. [#197](https://github.com/shenwei356/seqkit/issues/197)
+    - `seqkit subseq`:
+        - fix bug for BED with empty columns. [#195](https://github.com/shenwei356/seqkit/issues/195)
+    - `seqkit genautocomplete`: 
+        - **support bash|zsh|fish|powershell**.
 
 ### Please cite
 
@@ -30,13 +44,13 @@ SeqKit is implemented in [Go](https://golang.org/) programming language,
 
 OS     |Arch      |File, 中国镜像                                                                                                                                                                                  |Download Count
 :------|:---------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-Linux  |32-bit    |[seqkit_linux_386.tar.gz](https://github.com/shenwei356/seqkit/releases/download/v0.15.0/seqkit_linux_386.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_linux_386.tar.gz)                            |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_linux_386.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.15.0/seqkit_linux_386.tar.gz)
-Linux  |**64-bit**|[**seqkit_linux_amd64.tar.gz**](https://github.com/shenwei356/seqkit/releases/download/v0.15.0/seqkit_linux_amd64.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_linux_amd64.tar.gz)                  |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_linux_amd64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.15.0/seqkit_linux_amd64.tar.gz)
-Linux  |**arm64** |[**seqkit_linux_arm64.tar.gz**](https://github.com/shenwei356/seqkit/releases/download/v0.15.0/seqkit_linux_arm64.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_linux_arm64.tar.gz)                  |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_linux_arm64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.15.0/seqkit_linux_arm64.tar.gz)
-macOS  |**64-bit**|[**seqkit_darwin_amd64.tar.gz**](https://github.com/shenwei356/seqkit/releases/download/v0.15.0/seqkit_darwin_amd64.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_darwin_amd64.tar.gz)               |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_darwin_amd64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.15.0/seqkit_darwin_amd64.tar.gz)
-macOS  |**arm64** |[**seqkit_darwin_arm64.tar.gz**](https://github.com/shenwei356/seqkit/releases/download/v0.15.0/seqkit_darwin_arm64.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_darwin_arm64.tar.gz)               |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_darwin_arm64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.15.0/seqkit_darwin_arm64.tar.gz)
-Windows|32-bit    |[seqkit_windows_386.exe.tar.gz](https://github.com/shenwei356/seqkit/releases/download/v0.15.0/seqkit_windows_386.exe.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_windows_386.exe.tar.gz)          |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_windows_386.exe.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.15.0/seqkit_windows_386.exe.tar.gz)
-Windows|**64-bit**|[**seqkit_windows_amd64.exe.tar.gz**](https://github.com/shenwei356/seqkit/releases/download/v0.15.0/seqkit_windows_amd64.exe.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_windows_amd64.exe.tar.gz)|[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_windows_amd64.exe.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.15.0/seqkit_windows_amd64.exe.tar.gz)
+Linux  |32-bit    |[seqkit_linux_386.tar.gz](https://github.com/shenwei356/seqkit/releases/download/v0.16.0/seqkit_linux_386.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_linux_386.tar.gz)                            |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_linux_386.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.16.0/seqkit_linux_386.tar.gz)
+Linux  |**64-bit**|[**seqkit_linux_amd64.tar.gz**](https://github.com/shenwei356/seqkit/releases/download/v0.16.0/seqkit_linux_amd64.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_linux_amd64.tar.gz)                  |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_linux_amd64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.16.0/seqkit_linux_amd64.tar.gz)
+Linux  |**arm64** |[**seqkit_linux_arm64.tar.gz**](https://github.com/shenwei356/seqkit/releases/download/v0.16.0/seqkit_linux_arm64.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_linux_arm64.tar.gz)                  |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_linux_arm64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.16.0/seqkit_linux_arm64.tar.gz)
+macOS  |**64-bit**|[**seqkit_darwin_amd64.tar.gz**](https://github.com/shenwei356/seqkit/releases/download/v0.16.0/seqkit_darwin_amd64.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_darwin_amd64.tar.gz)               |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_darwin_amd64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.16.0/seqkit_darwin_amd64.tar.gz)
+macOS  |**arm64** |[**seqkit_darwin_arm64.tar.gz**](https://github.com/shenwei356/seqkit/releases/download/v0.16.0/seqkit_darwin_arm64.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_darwin_arm64.tar.gz)               |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_darwin_arm64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.16.0/seqkit_darwin_arm64.tar.gz)
+Windows|32-bit    |[seqkit_windows_386.exe.tar.gz](https://github.com/shenwei356/seqkit/releases/download/v0.16.0/seqkit_windows_386.exe.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_windows_386.exe.tar.gz)          |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_windows_386.exe.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.16.0/seqkit_windows_386.exe.tar.gz)
+Windows|**64-bit**|[**seqkit_windows_amd64.exe.tar.gz**](https://github.com/shenwei356/seqkit/releases/download/v0.16.0/seqkit_windows_amd64.exe.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_windows_amd64.exe.tar.gz)|[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_windows_amd64.exe.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.16.0/seqkit_windows_amd64.exe.tar.gz)
 
 
 ## Installation
@@ -87,25 +101,41 @@ Run the following commands:
 
 ## Bash-completion
 
-Note: The current version supports Bash only.
-This should work for *nix systems with Bash installed.
+Supported shell: bash|zsh|fish|powershell
+
+Bash:
+
+    # generate completion shell
+    seqkit genautocomplete --shell bash
 
-Howto:
+    # configure if never did.
+    # install bash-completion if the "complete" command is not found.
+    echo "for bcfile in ~/.bash_completion.d/* ; do source \$bcfile; done" >> ~/.bash_completion
+    echo "source ~/.bash_completion" >> ~/.bashrc
 
-1. run: `seqkit genautocomplete`
+Zsh:
 
-2. create and edit `~/.bash_completion` file if you don't have it.
+    # generate completion shell
+    seqkit genautocomplete --shell zsh --file ~/.zfunc/_seqkit
 
-        nano ~/.bash_completion
+    # configure if never did
+    echo 'fpath=( ~/.zfunc "${fpath[@]}" )' >> ~/.zshrc
+    echo "autoload -U compinit; compinit" >> ~/.zshrc
 
-    add the following:
+fish:
 
-        for bcfile in ~/.bash_completion.d/* ; do
-          . $bcfile
-        done
+    seqkit genautocomplete --shell fish --file ~/.config/fish/completions/seqkit.fish
 
 ## Release History
 
+- [SeqKit v0.15.0](https://github.com/shenwei356/seqkit/releases/tag/v0.15.0)
+[![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/seqkit/v0.15.0/total.svg)](https://github.com/shenwei356/seqkit/releases/tag/v0.15.0)
+    - `seqkit grep/locate`: update help message.
+    - `seqkit grep`: **search on both strand when searching by sequence**.
+    - `seqkit split2`: fix redundant log when using `-s`.
+    - `seqkit bam`: new field `RightSoftClipSeq`. [#172](https://github.com/shenwei356/seqkit/pull/172)
+    - `seqkit sample -2`: remove extra `\n`. [#173](https://github.com/shenwei356/seqkit/issues/173)
+    - `seqkit split2 -l`: fix bug for splitting by accumulative length, this bug occurs when the first record is longer than `-l`, no sequences are lost.
 - [SeqKit v0.14.0](https://github.com/shenwei356/seqkit/releases/tag/v0.14.0)
 [![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/seqkit/v0.14.0/total.svg)](https://github.com/shenwei356/seqkit/releases/tag/v0.14.0)
     - new command `seqkit pair`: match up paired-end reads from two fastq files, faster than fastq-pair.


=====================================
doc/docs/usage.md
=====================================
@@ -39,6 +39,7 @@
 **Set operations**
 
 - [head](#head)
+- [head-genome](#head-genome)
 - [range](#range)
 - [sample](#sample)
 - [rmdup](#rmdup)
@@ -968,6 +969,7 @@ Flags:
   -a, --alphabet               print alphabet letters
   -q, --avg-qual               print average quality of a read
   -B, --base-content strings   print base content. (case ignored, multiple values supported) e.g. -B AT -B N
+  -C, --base-count strings     print base count. (case ignored, multiple values supported) e.g. -C AT -C N
   -I, --case-sensitive         calculate case sensitive base content
   -g, --gc                     print GC content
   -G, --gc-skew                print GC-Skew
@@ -1379,9 +1381,8 @@ Attentions:
      for partly matching.
   2. When searching by sequences, it's partly matching, and both positive
      and negative strands are searched.
-     Mismatch is allowed using flag "-m/--max-mismatch",
-     but it's not fast enough for large genome like human genome.
-     Though, it's fast enough for microbial genomes.
+     Mismatch is allowed using flag "-m/--max-mismatch", you can increase
+     the value of "-j/--threads" to accelerate processing.
   3. Degenerate bases/residues like "RYMM.." are also supported by flag -d.
      But do not use degenerate bases/residues in regular expression, you need
      convert them to regular expression, e.g., change "N" or "X"  to "."..
@@ -1424,6 +1425,7 @@ Flags:
       --delete-matched         delete a pattern right after being matched, this keeps the firstly matched data and speedups when using regular expressions
   -h, --help                   help for grep
   -i, --ignore-case            ignore case
+  -I, --immediate-output       print output immediately, do not use write buffer
   -v, --invert-match           invert the sense of matching, to select non-matching records
   -m, --max-mismatch int       max mismatch when matching by seq. For large genomes like human genome, using mapping/alignment tools would be faster
   -P, --only-positive-strand   only search on positive strand
@@ -1509,13 +1511,14 @@ Examples
         user    0m0.100s
         sys     0m0.017s
 
-        $ time cat hairpin.fa.gz | seqkit grep -s -i -p aggcg -m 1 | seqkit stats
+        $ time zcat hairpin.fa.gz | seqkit grep -s -i -p aggcg -m 1 | seqkit stats
         file  format  type  num_seqs    sum_len  min_len  avg_len  max_len
-        -     FASTA   RNA     17,168  1,881,005       39    109.6    2,354
+        -     FASTA   RNA     22,290  2,375,819       39    106.6    2,354
+
+        real    0m1.081s
+        user    0m1.305s
+        sys     0m0.158s
 
-        real    0m0.864s
-        user    0m0.941s
-        sys     0m0.014s
 
 1. Extract sequences starting with AGGCG
 
@@ -1553,8 +1556,7 @@ Attentions:
      parser accepts comma-separated-values (CSV) for multiple values (motifs).
      Patterns in file do not follow this rule.     
   4. Mismatch is allowed using flag "-m/--max-mismatch",
-     but it's not fast enough for large genome like human genome.
-     Though, it's fast enough for microbial genomes.
+     you can increase the value of "-j/--threads" to accelerate processing.
   5. When using flag --circular, end position of matched subsequence that 
      crossing genome sequence end would be greater than sequence length.
 
@@ -1569,6 +1571,7 @@ Flags:
   -h, --help                      help for locate
   -M, --hide-matched              do not show matched sequences
   -i, --ignore-case               ignore case
+  -I, --immediate-output          print output immediately, do not use write buffer
   -m, --max-mismatch int          max mismatch when matching by seq. For large genomes like human genome, using mapping/alignment tools would be faster
   -G, --non-greedy                non-greedy mode, faster but may miss motifs overlapping with others
   -P, --only-positive-strand      only search on positive strand
@@ -1777,6 +1780,7 @@ retrieve amplicon (or specific region around it) via primer(s).
 Attentions:
   1. Only one (the longest) matching location is returned for every primer pair.
   2. Mismatch is allowed, but the mismatch location (5' or 3') is not controled. 
+     You can increase the value of "-j/--threads" to accelerate processing.
   3. Degenerate bases/residues like "RYMM.." are also supported.
      But do not use degenerate bases/residues in regular expression, you need
      convert them to regular expression, e.g., change "N" or "X"  to "."..
@@ -1833,6 +1837,7 @@ Flags:
   -f, --flanking-region        region is flanking region
   -F, --forward string         forward primer (5'-primer-3'), degenerate bases allowed
   -h, --help                   help for amplicon
+  -I, --immediate-output       print output immediately, do not use write buffer
   -m, --max-mismatch int       max mismatch when matching primers, no degenerate bases allowed
   -P, --only-positive-strand   only search on positive strand
   -p, --primer-file string     3- or 2-column tabular primer file, with first column as primer name
@@ -2406,12 +2411,45 @@ Examples
         +
         HIHIIIIIHIIHGHHIHHIIIIIIIIIIIIIIIHHIIIIIHHIHIIIIIGIHIIIIHHHHHHGHIHIIIIIIIII
 
+## head-genome
+
+Usage
+
+```text
+
+print sequences of the first genome with common prefixes in name
+
+For a FASTA file containing multiple contigs of strains (see example below),
+these's no list of IDs available for retrieving sequences of a certain strain,
+while descriptions of each strain share the same prefix.
+
+This command is used to restrieve sequences of the first strain,
+i.e., "Vibrio cholerae strain M29".
+
+>NZ_JFGR01000001.1 Vibrio cholerae strain M29 Contig_1, whole genome shotgun sequence
+>NZ_JFGR01000002.1 Vibrio cholerae strain M29 Contig_2, whole genome shotgun sequence
+>NZ_JFGR01000003.1 Vibrio cholerae strain M29 Contig_3, whole genome shotgun sequence
+>NZ_JSTP01000001.1 Vibrio cholerae strain 2012HC-12 NODE_79, whole genome shotgun sequence
+>NZ_JSTP01000002.1 Vibrio cholerae strain 2012HC-12 NODE_78, whole genome shotgun sequence
+
+Attention:
+
+  1. Sequences in file should be well organized.
+
+Usage:
+  seqkit head-genome [flags]
+
+Flags:
+  -h, --help                    help for head-genome
+  -m, --mini-common-words int   minimal shared prefix words (default 4)
 
+```
+        
 ## range
 
 Usage
 
-``` text
+```text
 print FASTA/Q records in a range (start:end)
 
 Usage:
@@ -3338,22 +3376,30 @@ Usage
 ``` text
 generate shell autocompletion script
 
-Note: The current version supports Bash only.
-This should work for *nix systems with Bash installed.
+Supported shell: bash|zsh|fish|powershell
+
+Bash:
+
+    # generate completion shell
+    seqkit genautocomplete --shell bash
 
-Howto:
+    # configure if never did.
+    # install bash-completion if the "complete" command is not found.
+    echo "for bcfile in ~/.bash_completion.d/* ; do source \$bcfile; done" >> ~/.bash_completion
+    echo "source ~/.bash_completion" >> ~/.bashrc
 
-1. run: seqkit genautocomplete
+Zsh:
 
-2. create and edit ~/.bash_completion file if you don't have it.
+    # generate completion shell
+    seqkit genautocomplete --shell zsh --file ~/.zfunc/_seqkit
 
-        nano ~/.bash_completion
+    # configure if never did
+    echo 'fpath=( ~/.zfunc "${fpath[@]}" )' >> ~/.zshrc
+    echo "autoload -U compinit; compinit" >> ~/.zshrc
 
-   add the following:
+fish:
 
-        for bcfile in ~/.bash_completion.d/* ; do
-          . $bcfile
-        done
+    seqkit genautocomplete --shell fish --file ~/.config/fish/completions/seqkit.fish
 
 Usage:
   seqkit genautocomplete [flags]


=====================================
doc/mkdocs.yml
=====================================
@@ -1,5 +1,5 @@
 site_name: SeqKit - Ultrafast FASTA/Q kit
-pages:
+nav:
 - Home: index.md
 - Download: download.md
 - Documents:
@@ -12,17 +12,17 @@ pages:
 # theme: cinder
 
 theme:
-    name: 'material'
+    name: material
     palette:
-        primary: 'teal'
-        accent: 'blue grey'
-    logo:
-        icon: 'school'
-    favicon: 'files/favicon.ico'
+        primary: teal
+        accent: blue grey
+    icon:
+        logo: material/school
+    favicon: files/favicon.ico
     feature:
-        tabs: false
+        navigation.tabs: false
 extra:
-  manifest: 'manifest.webmanifest'
+  manifest: manifest.webmanifest
 
 repo_url: https://github.com/shenwei356/seqkit
 site_description: SeqKit - a cross-platform and ultrafast toolkit for FASTA/Q file manipulation


=====================================
go.mod
=====================================
@@ -7,30 +7,25 @@ require (
 	github.com/biogo/hts v1.2.2
 	github.com/bsipos/thist v1.0.0
 	github.com/cespare/xxhash v1.1.0
-	github.com/cznic/mathutil v0.0.0-20181122101859-297441e03548 // indirect
 	github.com/cznic/sortutil v0.0.0-20181122101858-f5f958428db8
 	github.com/dustin/go-humanize v1.0.0
-	github.com/edsrzf/mmap-go v1.0.0 // indirect
 	github.com/fsnotify/fsnotify v1.4.9
 	github.com/iafan/cwalk v0.0.0-20191125092548-dd7f505d2f66
-	github.com/klauspost/compress v1.11.4 // indirect
-	github.com/klauspost/pgzip v1.2.5 // indirect
 	github.com/logrusorgru/aurora v2.0.3+incompatible
 	github.com/mattn/go-colorable v0.1.8
 	github.com/mattn/go-isatty v0.0.12
 	github.com/mattn/go-runewidth v0.0.9 // indirect
 	github.com/mitchellh/go-homedir v1.1.0
 	github.com/pkg/errors v0.9.1
-	github.com/remyoudompheng/bigfft v0.0.0-20200410134404-eec4a21b6bb0 // indirect
-	github.com/shenwei356/bio v0.0.0-20201213090627-18e3e643a476
-	github.com/shenwei356/bpool v0.0.0-20160710042833-f9e0ee4d0403 // indirect
-	github.com/shenwei356/breader v0.0.0-20170924140440-21f0a70fe179
-	github.com/shenwei356/bwt v0.0.0-20200418151221-ae79c9858c90
+	github.com/shenwei356/bio v0.1.0
+	github.com/shenwei356/breader v0.1.0
+	github.com/shenwei356/bwt v0.5.1
 	github.com/shenwei356/go-logging v0.0.0-20171012171522-c6b9702d88ba
 	github.com/shenwei356/natsort v0.0.0-20190418160752-600d539c017d // indirect
-	github.com/shenwei356/util v0.0.0-20201214054755-2942125340cd
+	github.com/shenwei356/util v0.3.0
 	github.com/shenwei356/xopen v0.0.0-20181203091311-f4f16ddd3992
 	github.com/smallfish/simpleyaml v0.0.0-20170911015856-a32031077861
-	github.com/spf13/cobra v1.1.1
+	github.com/spf13/cobra v1.1.3
 	github.com/tatsushid/go-prettytable v0.0.0-20141013043238-ed2d14c29939
+	github.com/twotwotwo/sorts v0.0.0-20160814051341-bf5c1f2b8553
 )


=====================================
go.sum
=====================================
@@ -139,8 +139,9 @@ github.com/konsorten/go-windows-terminal-sequences v1.0.1/go.mod h1:T0+1ngSBFLxv
 github.com/kortschak/utter v0.0.0-20190412033250-50fe362e6560/go.mod h1:oDr41C7kH9wvAikWyFhr6UFr8R7nelpmCF5XR5rL7I8=
 github.com/kr/logfmt v0.0.0-20140226030751-b84e30acd515/go.mod h1:+0opPa2QZZtGFBFZlji/RkVcI2GknAs/DXo4wKdlNEc=
 github.com/kr/pretty v0.1.0/go.mod h1:dAy3ld7l9f0ibDNOQOHHMYYIIbhfbHSm3C4ZsoJORNo=
-github.com/kr/pretty v0.2.0 h1:s5hAObm+yFO5uHYt5dYjxi2rXrsnmRpJx4OYvIWUaQs=
 github.com/kr/pretty v0.2.0/go.mod h1:ipq/a2n7PKx3OHsz4KJII5eveXtPO4qwEXGdVfWzfnI=
+github.com/kr/pretty v0.2.1 h1:Fmg33tUaq4/8ym9TJN1x7sLJnHVwhP33CNkpYV/7rwI=
+github.com/kr/pretty v0.2.1/go.mod h1:ipq/a2n7PKx3OHsz4KJII5eveXtPO4qwEXGdVfWzfnI=
 github.com/kr/pty v1.1.1/go.mod h1:pFQYn66WHrOpPYNljwOMqo10TkYh1fy3cYio2l3bCsQ=
 github.com/kr/text v0.1.0 h1:45sCR5RtlFHMR4UwH9sdQ5TC8v0qDQCHnXt+kaKSTVE=
 github.com/kr/text v0.1.0/go.mod h1:4Jbv+DJW3UT/LiOwJeYQe1efqtUx/iVham/4vfdArNI=
@@ -194,20 +195,20 @@ github.com/rogpeppe/go-internal v1.3.0/go.mod h1:M8bDsm7K2OlrFYOpmOWEs/qY81heoFR
 github.com/russross/blackfriday/v2 v2.0.1/go.mod h1:+Rmxgy9KzJVeS9/2gXHxylqXiyQDYRxCVz55jmeOWTM=
 github.com/ryanuber/columnize v0.0.0-20160712163229-9b3edd62028f/go.mod h1:sm1tb6uqfes/u+d4ooFouqFdy9/2g9QGwK3SQygK0Ts=
 github.com/sean-/seed v0.0.0-20170313163322-e2103e2c3529/go.mod h1:DxrIzT+xaE7yg65j358z/aeFdxmN0P9QXhEzd20vsDc=
-github.com/shenwei356/bio v0.0.0-20201213090627-18e3e643a476 h1:PE3sUBFeHWFNLsrQ7Blyk4duI+emVGKQ1+iYo4ZalKk=
-github.com/shenwei356/bio v0.0.0-20201213090627-18e3e643a476/go.mod h1:tKIebGVfqYwgLIoXohkLiQO685B2d3l9vTf1cjRZaVk=
+github.com/shenwei356/bio v0.1.0 h1:VDnI28zcdybywdn6/tcZvplAJ1IxOAAYrTJhhTB1SLQ=
+github.com/shenwei356/bio v0.1.0/go.mod h1:NgFauYHlpmjCYEf2XP8foITht6ej6poggQkILpjraN4=
 github.com/shenwei356/bpool v0.0.0-20160710042833-f9e0ee4d0403 h1:/3JklLnHXiWUBxWc3joQYavDQJpncRhRA909cUb7eOw=
 github.com/shenwei356/bpool v0.0.0-20160710042833-f9e0ee4d0403/go.mod h1:YkgdTWfNnJgv5HVJbVSDmxQtkK3/jZWDoqcG26BVU8k=
-github.com/shenwei356/breader v0.0.0-20170924140440-21f0a70fe179 h1:MZiEG6aN8j1A0jXzCzyZM2BM6ZHbePYiwR4FXgUTCCk=
-github.com/shenwei356/breader v0.0.0-20170924140440-21f0a70fe179/go.mod h1:Yb9qYJ2qJolWeqIVS1LG5ylHTLI5ZfbVFEJJouF2DsY=
-github.com/shenwei356/bwt v0.0.0-20200418151221-ae79c9858c90 h1:c8Er9GmVclwt002/ZxVx7WOThpmPKNR4rehAXxICTII=
-github.com/shenwei356/bwt v0.0.0-20200418151221-ae79c9858c90/go.mod h1:q4esDgocpgU8CcnXuPNun+hGv+VDXGX+NnXHQaHGRCU=
+github.com/shenwei356/breader v0.1.0 h1:g7q33VI15Mroj2TTab06CF6prbQAez+6p8fAoK4qkjo=
+github.com/shenwei356/breader v0.1.0/go.mod h1:YXIrHIPtbJCP6Kv27qGp+cXQl7hyzD0iQrEVYCy/gqw=
+github.com/shenwei356/bwt v0.5.1 h1:pq9KxkJPwzUwPjwWqfTOkUmzwHm8JtMkZxBkqkzuIVo=
+github.com/shenwei356/bwt v0.5.1/go.mod h1:V2hX4adhr4WfFpy2ogKiV8A2WO9FeK0wkq1o3/R91mE=
 github.com/shenwei356/go-logging v0.0.0-20171012171522-c6b9702d88ba h1:UvnrxFDPmz7agYX0eQ2JEorTKn1ORnZ9dT5OzbjPvK8=
 github.com/shenwei356/go-logging v0.0.0-20171012171522-c6b9702d88ba/go.mod h1:LiqYp/K5yCEWOi7Ux/iOF/kjDxtsdYjOGcKHDbEOXFU=
 github.com/shenwei356/natsort v0.0.0-20190418160752-600d539c017d h1:eeXLHcXyGEr72V1SOSEI7vSzUOTJvHutwF7Ykm+hscQ=
 github.com/shenwei356/natsort v0.0.0-20190418160752-600d539c017d/go.mod h1:SiiGiRFyRtV7S9RamOrmQR5gpGIRhWJM1w0EtmuQ1io=
-github.com/shenwei356/util v0.0.0-20201214054755-2942125340cd h1:11sa/TyR/91v63OA9VnyjRontqtsDFZN1wCLNaaAa38=
-github.com/shenwei356/util v0.0.0-20201214054755-2942125340cd/go.mod h1:n3qhc3bQzlqJ2/5v79hgl0Gd3WzJOkI8XcUix25Brdg=
+github.com/shenwei356/util v0.3.0 h1:gTVa3sGwcyGEHgNpXTzdL3MaaJN/bGAypVKSCnT4QfU=
+github.com/shenwei356/util v0.3.0/go.mod h1:n3qhc3bQzlqJ2/5v79hgl0Gd3WzJOkI8XcUix25Brdg=
 github.com/shenwei356/xopen v0.0.0-20181203091311-f4f16ddd3992 h1:RXEEyKj0JL3SrRIYsWIEyy4AwjHbI3I8aDGK6CA4+YI=
 github.com/shenwei356/xopen v0.0.0-20181203091311-f4f16ddd3992/go.mod h1:6EQUa6I7Zsl2GQKqcL9qGLrTzVE+oZyly+uhzovQYSk=
 github.com/shurcooL/sanitized_anchor_name v1.0.0/go.mod h1:1NzhyTcUVG4SuEtjjoZeVRXNmyL/1OwPU0+IJeTBvfc=
@@ -221,8 +222,8 @@ github.com/spaolacci/murmur3 v0.0.0-20180118202830-f09979ecbc72 h1:qLC7fQah7D6K1
 github.com/spaolacci/murmur3 v0.0.0-20180118202830-f09979ecbc72/go.mod h1:JwIasOWyU6f++ZhiEuf87xNszmSA2myDM2Kzu9HwQUA=
 github.com/spf13/afero v1.1.2/go.mod h1:j4pytiNVoe2o6bmDsKpLACNPDBIoEAkihy7loJ1B0CQ=
 github.com/spf13/cast v1.3.0/go.mod h1:Qx5cxh0v+4UWYiBimWS+eyWzqEqokIECu5etghLkUJE=
-github.com/spf13/cobra v1.1.1 h1:KfztREH0tPxJJ+geloSLaAkaPkr4ki2Er5quFV1TDo4=
-github.com/spf13/cobra v1.1.1/go.mod h1:WnodtKOvamDL/PwE2M4iKs8aMDBZ5Q5klgD3qfVJQMI=
+github.com/spf13/cobra v1.1.3 h1:xghbfqPkxzxP3C/f3n5DdpAbdKLj4ZE4BWQI362l53M=
+github.com/spf13/cobra v1.1.3/go.mod h1:pGADOWyqRD/YMrPZigI/zbliZ2wVD/23d+is3pSWzOo=
 github.com/spf13/jwalterweatherman v1.0.0/go.mod h1:cQK4TGJAtQXfYWX+Ddv3mKDzgVb68N+wFjFa4jdeBTo=
 github.com/spf13/pflag v1.0.3/go.mod h1:DYY7MBk1bdzusC3SYhjObp+wFpr4gzcvqqNjLnInEg4=
 github.com/spf13/pflag v1.0.5 h1:iy+VFUOCP1a+8yFto/drg2CJ5u0yRoB7fZw3DKv/JXA=
@@ -236,6 +237,8 @@ github.com/subosito/gotenv v1.2.0/go.mod h1:N0PQaV/YGNqwC0u51sEeR/aUtSLEXKX9iv69
 github.com/tatsushid/go-prettytable v0.0.0-20141013043238-ed2d14c29939 h1:BhIUXV2ySTLrKgh/Hnts+QTQlIbWtomXt3LMdzME0A0=
 github.com/tatsushid/go-prettytable v0.0.0-20141013043238-ed2d14c29939/go.mod h1:omGxs4/6hNjxPKUTjmaNkPzehSnNJOJN6pMEbrlYIT4=
 github.com/tmc/grpc-websocket-proxy v0.0.0-20190109142713-0ad062ec5ee5/go.mod h1:ncp9v5uamzpCO7NfCPTXjqaC+bZgJeR0sMTm6dMHP7U=
+github.com/twotwotwo/sorts v0.0.0-20160814051341-bf5c1f2b8553 h1:DRC1ubdb3ZmyyIeCSTxjZIQAnpLPfKVgYrLETQuOPjo=
+github.com/twotwotwo/sorts v0.0.0-20160814051341-bf5c1f2b8553/go.mod h1:Rj7Csq/tZ/egz+Ltc2IVpsA5309AmSMEswjkTZmq2Xc=
 github.com/ulikunitz/xz v0.5.6/go.mod h1:2bypXElzHzzJZwzH67Y6wb67pO62Rzfn7BSiF4ABRW8=
 github.com/xiang90/probing v0.0.0-20190116061207-43a291ad63a2/go.mod h1:UETIi67q53MR2AWcXfiuqkDkRtnGDLqkBTpCHuJHxtU=
 go.etcd.io/bbolt v1.3.2/go.mod h1:IbVyRI1SCnLcuJnV2u8VeU0CEYM7e686BmAb1XKL+uU=
@@ -312,8 +315,9 @@ golang.org/x/sys v0.0.0-20190606165138-5da285871e9c/go.mod h1:h1NjWce9XRLGQEsW7w
 golang.org/x/sys v0.0.0-20190624142023-c5567b49c5d0/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
 golang.org/x/sys v0.0.0-20191005200804-aed5e4c7ecf9/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
 golang.org/x/sys v0.0.0-20200116001909-b77594299b42/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
-golang.org/x/sys v0.0.0-20200223170610-d5e6a3e2c0ae h1:/WDfKMnPU+m5M4xB+6x4kaepxRw6jWvR5iDRdvjHgy8=
 golang.org/x/sys v0.0.0-20200223170610-d5e6a3e2c0ae/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
+golang.org/x/sys v0.0.0-20210315160823-c6e025ad8005 h1:pDMpM2zh2MT0kHy037cKlSby2nEhD50SYqwQk76Nm40=
+golang.org/x/sys v0.0.0-20210315160823-c6e025ad8005/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
 golang.org/x/text v0.3.0/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=
 golang.org/x/text v0.3.1-0.20180807135948-17ff2d5776d2/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=
 golang.org/x/text v0.3.2/go.mod h1:bEr9sfX3Q8Zfm5fL9x+3itogRgK3+ptLWKqgva+5dAk=
@@ -367,16 +371,17 @@ google.golang.org/grpc v1.21.1/go.mod h1:oYelfM1adQP15Ek0mdvEgi9Df8B9CZIaU1084ij
 gopkg.in/alecthomas/kingpin.v2 v2.2.6/go.mod h1:FMv+mEhP44yOT+4EoQTLFTRgOQ1FBLkstjWtayDeSgw=
 gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
 gopkg.in/check.v1 v1.0.0-20180628173108-788fd7840127/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
-gopkg.in/check.v1 v1.0.0-20190902080502-41f04d3bba15 h1:YR8cESwS4TdDjEe65xsg0ogRM/Nc3DYOhEAlW+xobZo=
 gopkg.in/check.v1 v1.0.0-20190902080502-41f04d3bba15/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
+gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c h1:Hei/4ADfdWqJk1ZMxUNpqntNwaWcugrBjAiHlqqRiVk=
+gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c/go.mod h1:JHkPIbrfpd72SG/EVd6muEfDQjcINNoR0C8j2r3qZ4Q=
 gopkg.in/errgo.v2 v2.1.0/go.mod h1:hNsd1EY+bozCKY1Ytp96fpM3vjJbqLJn88ws8XvfDNI=
 gopkg.in/ini.v1 v1.51.0/go.mod h1:pNLf8WUiyNEtQjuu5G5vTm06TEv9tsIgeAvK8hOrP4k=
 gopkg.in/resty.v1 v1.12.0/go.mod h1:mDo4pnntr5jdWRML875a/NmxYqAlA73dVijT2AXvQQo=
 gopkg.in/yaml.v2 v2.0.0-20170812160011-eb3733d160e7/go.mod h1:JAlM8MvJe8wmxCU4Bli9HhUf9+ttbYbLASfIpnQbh74=
 gopkg.in/yaml.v2 v2.2.1/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI=
 gopkg.in/yaml.v2 v2.2.4/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI=
-gopkg.in/yaml.v2 v2.2.8 h1:obN1ZagJSUGI0Ek/LBmuj4SNLPfIny3KsKFopxRdj10=
-gopkg.in/yaml.v2 v2.2.8/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI=
+gopkg.in/yaml.v2 v2.4.0 h1:D8xgwECY7CYvx+Y2n4sBz93Jn9JRvxdiyyo8CTfuKaY=
+gopkg.in/yaml.v2 v2.4.0/go.mod h1:RDklbk79AGWmwhnvt/jBztapEOGDOx6ZbXqjP6csGnQ=
 honnef.co/go/tools v0.0.0-20190102054323-c2f93a96b099/go.mod h1:rf3lG4BRIbNafJWhAfAdb/ePZxsR/4RtNHQocxwk9r4=
 honnef.co/go/tools v0.0.0-20190106161140-3f1c8253044a/go.mod h1:rf3lG4BRIbNafJWhAfAdb/ePZxsR/4RtNHQocxwk9r4=
 honnef.co/go/tools v0.0.0-20190418001031-e561f6794a2a/go.mod h1:rf3lG4BRIbNafJWhAfAdb/ePZxsR/4RtNHQocxwk9r4=


=====================================
seqkit/cmd/amplicon.go
=====================================
@@ -31,6 +31,7 @@ import (
 	"sort"
 	"strconv"
 	"strings"
+	"sync"
 
 	"github.com/shenwei356/bio/featio/gtf"
 	"github.com/shenwei356/bio/seq"
@@ -40,6 +41,7 @@ import (
 	"github.com/shenwei356/bwt/fmi"
 	"github.com/shenwei356/xopen"
 	"github.com/spf13/cobra"
+	"github.com/twotwotwo/sorts/sortutil"
 )
 
 // ampliconCmd represents the amplicon command
@@ -51,6 +53,7 @@ var ampliconCmd = &cobra.Command{
 Attentions:
   1. Only one (the longest) matching location is returned for every primer pair.
   2. Mismatch is allowed, but the mismatch location (5' or 3') is not controled. 
+     You can increase the value of "-j/--threads" to accelerate processing.
   3. Degenerate bases/residues like "RYMM.." are also supported.
      But do not use degenerate bases/residues in regular expression, you need
      convert them to regular expression, e.g., change "N" or "X"  to "."..
@@ -131,6 +134,8 @@ Examples:
 		onlyPositiveStrand := getFlagBool(cmd, "only-positive-strand")
 		outFmtBED := getFlagBool(cmd, "bed")
 
+		immediateOutput := getFlagBool(cmd, "immediate-output")
+
 		var list [][3]string
 		var primers [][3][]byte
 
@@ -180,13 +185,194 @@ Examples:
 			usingRegion = true
 		}
 
+		strands := []string{"+", "-"}
+
+		// -------------------------------------------------------------------
+		// only for m > 0, where FMI is slow
+
+		if maxMismatch > 0 {
+			type Arecord struct {
+				id     uint64
+				ok     bool
+				record []string
+			}
+
+			var wg sync.WaitGroup
+			ch := make(chan *Arecord, config.Threads)
+			tokens := make(chan int, config.Threads)
+
+			done := make(chan int)
+			go func() {
+				m := make(map[uint64]*Arecord, config.Threads)
+				var id, _id uint64
+				var ok bool
+				var _r *Arecord
+				var row string
+
+				id = 1
+				for r := range ch {
+					_id = r.id
+
+					if _id == id { // right there
+						if r.ok {
+							for _, row = range r.record {
+								outfh.WriteString(row)
+							}
+
+							if immediateOutput {
+								outfh.Flush()
+							}
+						}
+						id++
+						continue
+					}
+
+					m[_id] = r // save for later check
+
+					if _r, ok = m[id]; ok { // check buffered
+						if _r.ok {
+							for _, row = range _r.record {
+								outfh.WriteString(row)
+							}
+
+							if immediateOutput {
+								outfh.Flush()
+							}
+						}
+						delete(m, id)
+						id++
+					}
+				}
+
+				if len(m) > 0 {
+					ids := make([]uint64, len(m))
+					i := 0
+					for _id = range m {
+						ids[i] = _id
+						i++
+					}
+					sortutil.Uint64s(ids)
+					for _, _id = range ids {
+						_r = m[_id]
+
+						if _r.ok {
+							for _, row = range _r.record {
+								outfh.WriteString(row)
+							}
+
+							if immediateOutput {
+								outfh.Flush()
+							}
+						}
+					}
+				}
+				done <- 1
+			}()
+
+			var id uint64
+			for _, file := range files {
+				var record *fastx.Record
+				var fastxReader *fastx.Reader
+
+				fastxReader, err = fastx.NewReader(alphabet, file, idRegexp)
+				checkError(err)
+
+				for {
+					record, err = fastxReader.Read()
+					if err != nil {
+						if err == io.EOF {
+							break
+						}
+						checkError(err)
+						break
+					}
+					if fastxReader.IsFastq {
+						config.LineWidth = 0
+						fastx.ForcelyOutputFastq = true
+					}
+
+					tokens <- 1
+					wg.Add(1)
+					id++
+					go func(record *fastx.Record, id uint64) {
+						defer func() {
+							wg.Done()
+							<-tokens
+						}()
+
+						var finder *AmpliconFinder
+						var loc []int
+						var strand string
+						var tmpSeq *seq.Seq
+						var primer [3][]byte
+
+						results := make([]string, 0, 2)
+
+						for _, strand = range strands {
+							if strand == "-" {
+								if onlyPositiveStrand {
+									continue
+								}
+								record.Seq.RevComInplace()
+							}
+
+							for _, primer = range primers {
+								finder, err = NewAmpliconFinder(record.Seq.Seq, primer[1], primer[2], maxMismatch)
+								checkError(err)
+
+								if usingRegion {
+									loc, err = finder.LocateRange(begin, end, fregion, strict)
+								} else {
+									loc, err = finder.Locate()
+								}
+								checkError(err)
+
+								if loc == nil {
+									continue
+								}
+
+								if outFmtBED {
+									results = append(results, fmt.Sprintf("%s\t%d\t%d\t%s\t%d\t%s\t%s\n",
+										record.ID,
+										loc[0]-1,
+										loc[1],
+										primer[0],
+										0,
+										strand,
+										record.Seq.SubSeq(loc[0], loc[1]).Seq))
+
+									continue
+								}
+								tmpSeq = record.Seq
+
+								record.Seq = record.Seq.SubSeq(loc[0], loc[1])
+
+								results = append(results, string(record.Format(config.LineWidth)))
+
+								record.Seq = tmpSeq
+							}
+						}
+
+						ch <- &Arecord{record: results, id: id, ok: len(results) > 0}
+					}(record.Clone(), id)
+				}
+
+				config.LineWidth = lineWidth
+			}
+
+			wg.Wait()
+			close(ch)
+			<-done
+
+			return
+		}
+
 		var record *fastx.Record
 		var fastxReader *fastx.Reader
 
 		var finder *AmpliconFinder
 		var loc []int
 
-		strands := []string{"+", "-"}
 		var strand string
 		var tmpSeq *seq.Seq
 		var primer [3][]byte
@@ -271,7 +457,8 @@ func init() {
 	ampliconCmd.Flags().BoolP("flanking-region", "f", false, "region is flanking region")
 	ampliconCmd.Flags().BoolP("strict-mode", "s", false, "strict mode, i.e., discarding seqs not fully matching (shorter) given region range")
 	ampliconCmd.Flags().BoolP("only-positive-strand", "P", false, "only search on positive strand")
-	ampliconCmd.Flags().BoolP("bed", "", false, "output in BED6+1 format with amplicon as 7th columns")
+	ampliconCmd.Flags().BoolP("bed", "", false, "output in BED6+1 format with amplicon as the 7th column")
+	ampliconCmd.Flags().BoolP("immediate-output", "I", false, "print output immediately, do not use write buffer")
 }
 
 func loadPrimers(file string) ([][3]string, error) {
@@ -691,7 +878,7 @@ func (finder *AmpliconFinder) Locate() ([]int, error) {
 	sort.Ints(locsI) // to remain the FIRST location
 	sort.Ints(locsJ) // to remain the LAST location
 	finder.searched, finder.found = true, true
-	finder.iBegin, finder.iEnd = locsI[0], locsI[0]+len(finder.R)-1
+	finder.iBegin, finder.iEnd = locsI[0], locsJ[len(locsJ)-1]+len(finder.R)-1
 	return []int{locsI[0] + 1, locsJ[len(locsJ)-1] + len(finder.R)}, nil
 }
 


=====================================
seqkit/cmd/bam_toolbox.go
=====================================
@@ -58,7 +58,7 @@ type Toolshed map[string]BamTool
 
 func (s Toolshed) String() string {
 	tools := make([]string, 0, len(s))
-	for t, _ := range s {
+	for t := range s {
 		tools = append(tools, t)
 	}
 	sort.Strings(tools)


=====================================
seqkit/cmd/bed.go
=====================================
@@ -28,7 +28,6 @@ import (
 	"strings"
 
 	"github.com/shenwei356/breader"
-	"github.com/shenwei356/util/stringutil"
 )
 
 // BedFeature is the gff BedFeature struct
@@ -64,9 +63,8 @@ func ReadBedFilteredFeatures(file string, chrs []string) ([]BedFeature, error) {
 		if line == "" || line[0] == '#' || (len(line) > 7 && string(line[0:7]) == "browser") || (len(line) > 5 && string(line[0:5]) == "track") {
 			return nil, false, nil
 		}
-		//
-		// do not use regexp.Split(), it's very slow
-		items := stringutil.Split(line, "\t")
+
+		items := strings.Split(line, "\t")
 		n := len(items)
 		if n < 3 {
 			return nil, false, nil


=====================================
seqkit/cmd/fx2tab.go
=====================================
@@ -63,6 +63,7 @@ Attention:
 		printGC := getFlagBool(cmd, "gc")
 		printGCSkew := getFlagBool(cmd, "gc-skew")
 		baseContents := getFlagStringSlice(cmd, "base-content")
+		baseCounts := getFlagStringSlice(cmd, "base-count")
 		caseSensitive := getFlagBool(cmd, "case-sensitive")
 		onlyName := getFlagBool(cmd, "name")
 		printTitle := getFlagBool(cmd, "header-line")
@@ -98,6 +99,11 @@ Attention:
 			if printGCSkew {
 				outfh.WriteString("\tGC-Skew")
 			}
+			if len(baseCounts) > 0 {
+				for _, bc := range baseCounts {
+					outfh.WriteString(fmt.Sprintf("\t%s", bc))
+				}
+			}
 			if len(baseContents) > 0 {
 				for _, bc := range baseContents {
 					outfh.WriteString(fmt.Sprintf("\t%s", bc))
@@ -164,6 +170,16 @@ Attention:
 					outfh.WriteString(fmt.Sprintf("\t%.2f", (g-c)/(g+c)*100))
 				}
 
+				if len(baseCounts) > 0 {
+					for _, bc := range baseCounts {
+						if caseSensitive {
+							outfh.WriteString(fmt.Sprintf("\t%d", record.Seq.BaseCountCaseSensitive(bc)))
+						} else {
+							outfh.WriteString(fmt.Sprintf("\t%d", record.Seq.BaseCount(bc)))
+						}
+					}
+				}
+
 				if len(baseContents) > 0 {
 					for _, bc := range baseContents {
 						if caseSensitive {
@@ -199,6 +215,7 @@ func init() {
 	fx2tabCmd.Flags().BoolP("gc", "g", false, "print GC content")
 	fx2tabCmd.Flags().BoolP("gc-skew", "G", false, "print GC-Skew")
 	fx2tabCmd.Flags().StringSliceP("base-content", "B", []string{}, "print base content. (case ignored, multiple values supported) e.g. -B AT -B N")
+	fx2tabCmd.Flags().StringSliceP("base-count", "C", []string{}, "print base count. (case ignored, multiple values supported) e.g. -C AT -C N")
 	fx2tabCmd.Flags().BoolP("case-sensitive", "I", false, "calculate case sensitive base content")
 	fx2tabCmd.Flags().BoolP("only-id", "i", false, "print ID instead of full head")
 	fx2tabCmd.Flags().BoolP("name", "n", false, "only print names (no sequences and qualities)")


=====================================
seqkit/cmd/genautocomplete.go
=====================================
@@ -1,4 +1,4 @@
-// Copyright © 2016-2019 Wei Shen <shenwei356 at gmail.com>
+// Copyright © 2016-2021 Wei Shen <shenwei356 at gmail.com>
 //
 // Permission is hereby granted, free of charge, to any person obtaining a copy
 // of this software and associated documentation files (the "Software"), to deal
@@ -33,44 +33,60 @@ import (
 // genautocompleteCmd represents the fq2fa command
 var genautocompleteCmd = &cobra.Command{
 	Use:   "genautocomplete",
-	Short: "generate shell autocompletion script",
+	Short: "generate shell autocompletion script (bash|zsh|fish|powershell)",
 	Long: `generate shell autocompletion script
 
-Note: The current version supports Bash only.
-This should work for *nix systems with Bash installed.
+Supported shell: bash|zsh|fish|powershell
 
-Howto:
+Bash:
 
-1. run: seqkit genautocomplete
+    # generate completion shell
+    seqkit genautocomplete --shell bash
 
-2. create and edit ~/.bash_completion file if you don't have it.
+    # configure if never did.
+    # install bash-completion if the "complete" command is not found.
+    echo "for bcfile in ~/.bash_completion.d/* ; do source \$bcfile; done" >> ~/.bash_completion
+    echo "source ~/.bash_completion" >> ~/.bashrc
 
-        nano ~/.bash_completion
+Zsh:
 
-    add the following:
+    # generate completion shell
+    seqkit genautocomplete --shell zsh --file ~/.zfunc/_seqkit
 
-        for bcfile in ~/.bash_completion.d/* ; do
-          . $bcfile
-        done
+    # configure if never did
+    echo 'fpath=( ~/.zfunc "${fpath[@]}" )' >> ~/.zshrc
+    echo "autoload -U compinit; compinit" >> ~/.zshrc
+
+fish:
+
+    seqkit genautocomplete --shell fish --file ~/.config/fish/completions/seqkit.fish
 
 `,
 	Run: func(cmd *cobra.Command, args []string) {
-		autocompleteTarget := getFlagString(cmd, "file")
-		autocompleteType := getFlagString(cmd, "type")
-
-		if autocompleteType != "bash" {
-			checkError(fmt.Errorf("only Bash is supported for now"))
-		}
+		outfile := getFlagString(cmd, "file")
+		shell := getFlagString(cmd, "shell")
 
-		dir := filepath.Dir(autocompleteTarget)
+		dir := filepath.Dir(outfile)
 		ok, err := pathutil.DirExists(dir)
 		checkError(err)
 		if !ok {
 			os.MkdirAll(dir, 0744)
 		}
-		checkError(cmd.Root().GenBashCompletionFile(autocompleteTarget))
 
-		log.Infof("bash completion file for SeqKit saved to %s", autocompleteTarget)
+		switch shell {
+		case "bash":
+			checkError(cmd.Root().GenBashCompletionFile(outfile))
+		case "zsh":
+			checkError(cmd.Root().GenZshCompletionFile(outfile))
+		case "fish":
+			checkError(cmd.Root().GenFishCompletionFile(outfile, true))
+		case "powershell":
+			checkError(cmd.Root().GenPowerShellCompletionFile(outfile))
+		default:
+			checkError(fmt.Errorf("unsupported shell: %s", shell))
+		}
+
+		log.Infof("%s completion file for seqkit saved to %s", shell, outfile)
 	},
 }
 
@@ -79,5 +95,5 @@ func init() {
 	defaultCompletionFile, err := homedir.Expand("~/.bash_completion.d/seqkit.sh")
 	checkError(err)
 	genautocompleteCmd.Flags().StringP("file", "", defaultCompletionFile, "autocompletion file")
-	genautocompleteCmd.Flags().StringP("type", "", "bash", "autocompletion type (currently only bash supported)")
+	genautocompleteCmd.Flags().StringP("shell", "", "bash", "autocompletion type (bash|zsh|fish|powershell)")
 }


=====================================
seqkit/cmd/grep.go
=====================================
@@ -28,8 +28,10 @@ import (
 	"runtime"
 	"strconv"
 	"strings"
+	"sync"
 
 	"github.com/shenwei356/bwt"
+	"github.com/twotwotwo/sorts/sortutil"
 
 	"github.com/shenwei356/bio/seq"
 	"github.com/shenwei356/bio/seqio/fastx"
@@ -54,9 +56,8 @@ Attentions:
      for partly matching.
   2. When searching by sequences, it's partly matching, and both positive
      and negative strands are searched.
-     Mismatch is allowed using flag "-m/--max-mismatch",
-     but it's not fast enough for large genome like human genome.
-     Though, it's fast enough for microbial genomes.
+     Mismatch is allowed using flag "-m/--max-mismatch", you can increase
+     the value of "-j/--threads" to accelerate processing.
   3. Degenerate bases/residues like "RYMM.." are also supported by flag -d.
      But do not use degenerate bases/residues in regular expression, you need
      convert them to regular expression, e.g., change "N" or "X"  to "."..
@@ -105,6 +106,8 @@ Examples:
 		region := getFlagString(cmd, "region")
 		circular := getFlagBool(cmd, "circular")
 
+		immediateOutput := getFlagBool(cmd, "immediate-output")
+
 		if len(pattern) == 0 && patternFile == "" {
 			checkError(fmt.Errorf("one of flags -p (--pattern) and -f (--pattern-file) needed"))
 		}
@@ -266,21 +269,209 @@ Examples:
 		checkError(err)
 		defer outfh.Close()
 
+		var fastxReader *fastx.Reader
+		var record *fastx.Record
+		strands := []byte{'+', '-'}
+
+		// -------------------------------------------------------------------
+		// only for searching with sequences and mismatch > 0, were FMI is very slow
+
+		if bySeq && mismatches > 1 {
+			type Arecord struct {
+				id     uint64
+				ok     bool
+				record *fastx.Record
+			}
+
+			var wg sync.WaitGroup
+			ch := make(chan *Arecord, config.Threads)
+			tokens := make(chan int, config.Threads)
+
+			done := make(chan int)
+			go func() {
+				m := make(map[uint64]*Arecord, config.Threads)
+				var id, _id uint64
+				var ok bool
+				var _r *Arecord
+
+				id = 1
+				for r := range ch {
+					_id = r.id
+
+					if _id == id { // right there
+						if r.ok {
+							r.record.FormatToWriter(outfh, config.LineWidth)
+							if immediateOutput {
+								outfh.Flush()
+							}
+						}
+						id++
+						continue
+					}
+
+					m[_id] = r // save for later check
+
+					if _r, ok = m[id]; ok { // check buffered
+						if _r.ok {
+							_r.record.FormatToWriter(outfh, config.LineWidth)
+							if immediateOutput {
+								outfh.Flush()
+							}
+						}
+						delete(m, id)
+						id++
+					}
+				}
+
+				if len(m) > 0 {
+					ids := make([]uint64, len(m))
+					i := 0
+					for _id = range m {
+						ids[i] = _id
+						i++
+					}
+					sortutil.Uint64s(ids)
+					for _, _id = range ids {
+						_r = m[_id]
+
+						if _r.ok {
+							_r.record.FormatToWriter(outfh, config.LineWidth)
+							if immediateOutput {
+								outfh.Flush()
+							}
+						}
+					}
+				}
+				done <- 1
+			}()
+
+			var id uint64
+			for _, file := range files {
+				fastxReader, err = fastx.NewReader(alphabet, file, idRegexp)
+				checkError(err)
+
+				checkAlphabet := true
+				for {
+					record, err = fastxReader.Read()
+					if err != nil {
+						if err == io.EOF {
+							break
+						}
+						checkError(err)
+						break
+					}
+
+					if checkAlphabet {
+						if fastxReader.Alphabet() == seq.Unlimit || fastxReader.Alphabet() == seq.Protein {
+							onlyPositiveStrand = true
+						}
+						checkAlphabet = false
+					}
+
+					if fastxReader.IsFastq {
+						config.LineWidth = 0
+						fastx.ForcelyOutputFastq = true
+					}
+
+					tokens <- 1
+					wg.Add(1)
+					id++
+					go func(record *fastx.Record, id uint64) {
+						defer func() {
+							wg.Done()
+							<-tokens
+						}()
+
+						var sequence *seq.Seq
+						var target []byte
+						var hit bool
+						var k string
+
+						sfmi := fmi.NewFMIndex()
+
+						for _, strand := range strands {
+							if hit {
+								break
+							}
+
+							if strand == '-' && onlyPositiveStrand {
+								break
+							}
+
+							sequence = record.Seq
+							if strand == '-' {
+								sequence = record.Seq.RevCom()
+							}
+							if limitRegion {
+								target = sequence.SubSeq(start, end).Seq
+							} else if circular {
+								// concat two copies of sequence, and do not change orginal sequence
+								target = make([]byte, len(sequence.Seq)*2)
+								copy(target[0:len(sequence.Seq)], sequence.Seq)
+								copy(target[len(sequence.Seq):], sequence.Seq)
+							} else {
+								target = sequence.Seq
+							}
+
+							if ignoreCase {
+								target = bytes.ToLower(target)
+							}
+
+							_, err = sfmi.Transform(target)
+							if err != nil {
+								checkError(fmt.Errorf("fail to build FMIndex for sequence: %s", record.Name))
+							}
+							for k = range patterns {
+								hit, err = sfmi.Match([]byte(k), mismatches)
+								if err != nil {
+									checkError(fmt.Errorf("fail to search pattern '%s' on seq '%s': %s", k, record.Name, err))
+								}
+								if hit {
+									break
+								}
+							}
+
+						}
+
+						if invertMatch {
+							if hit {
+								ch <- &Arecord{record: nil, ok: false, id: id}
+								return
+							}
+						} else {
+							if !hit {
+								ch <- &Arecord{record: nil, ok: false, id: id}
+								return
+							}
+						}
+
+						ch <- &Arecord{record: record, ok: true, id: id}
+
+					}(record.Clone(), id)
+				}
+			}
+
+			wg.Wait()
+			close(ch)
+			<-done
+
+			return
+		}
+
+		// -------------------------------------------------------------------
+
 		var sequence *seq.Seq
 		var target []byte
 		var ok, hit bool
-		var record *fastx.Record
-		var fastxReader *fastx.Reader
 		var k string
-		var locs []int
 		var re *regexp.Regexp
 		var p string
-		strands := []byte{'+', '-'}
 		var strand byte
 		for _, file := range files {
 			fastxReader, err = fastx.NewReader(alphabet, file, idRegexp)
 			checkError(err)
 
+			checkAlphabet := true
 			for {
 				record, err = fastxReader.Read()
 				if err != nil {
@@ -290,6 +481,14 @@ Examples:
 					checkError(err)
 					break
 				}
+
+				if checkAlphabet {
+					if fastxReader.Alphabet() == seq.Unlimit || fastxReader.Alphabet() == seq.Protein {
+						onlyPositiveStrand = true
+					}
+					checkAlphabet = false
+				}
+
 				if fastxReader.IsFastq {
 					config.LineWidth = 0
 					fastx.ForcelyOutputFastq = true
@@ -367,12 +566,11 @@ Examples:
 								checkError(fmt.Errorf("fail to build FMIndex for sequence: %s", record.Name))
 							}
 							for k = range patterns {
-								locs, err = sfmi.Locate([]byte(k), mismatches)
+								hit, err = sfmi.Match([]byte(k), mismatches)
 								if err != nil {
 									checkError(fmt.Errorf("fail to search pattern '%s' on seq '%s': %s", k, record.Name, err))
 								}
-								if len(locs) > 0 {
-									hit = true
+								if hit {
 									if deleteMatched && !invertMatch {
 										delete(patterns, k)
 									}
@@ -406,6 +604,10 @@ Examples:
 				}
 
 				record.FormatToWriter(outfh, config.LineWidth)
+
+				if immediateOutput {
+					outfh.Flush()
+				}
 			}
 
 			config.LineWidth = lineWidth
@@ -430,4 +632,5 @@ func init() {
 	grepCmd.Flags().StringP("region", "R", "", "specify sequence region for searching. "+
 		"e.g 1:12 for first 12 bases, -12:-1 for last 12 bases")
 	grepCmd.Flags().BoolP("circular", "c", false, "circular genome")
+	grepCmd.Flags().BoolP("immediate-output", "I", false, "print output immediately, do not use write buffer")
 }


=====================================
seqkit/cmd/head-genome.go
=====================================
@@ -0,0 +1,151 @@
+// Copyright © 2016-2019 Wei Shen <shenwei356 at gmail.com>
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in
+// all copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+// THE SOFTWARE.
+
+package cmd
+
+import (
+	"fmt"
+	"io"
+	"runtime"
+
+	"github.com/shenwei356/bio/seq"
+	"github.com/shenwei356/bio/seqio/fastx"
+	"github.com/shenwei356/util/stringutil"
+	"github.com/shenwei356/xopen"
+	"github.com/spf13/cobra"
+)
+
+// headGenomeCmd represents the head command
+var headGenomeCmd = &cobra.Command{
+	Use:   "head-genome",
+	Short: "print sequences of the first genome with common prefixes in name",
+	Long: `print sequences of the first genome with common prefixes in name
+
+For a FASTA file containing multiple contigs of strains (see example below),
+these's no list of IDs available for retrieving sequences of a certain strain,
+while descriptions of each strain share the same prefix.
+
+This command is used to restrieve sequences of the first strain,
+i.e., "Vibrio cholerae strain M29".
+
+>NZ_JFGR01000001.1 Vibrio cholerae strain M29 Contig_1, whole genome shotgun sequence
+>NZ_JFGR01000002.1 Vibrio cholerae strain M29 Contig_2, whole genome shotgun sequence
+>NZ_JFGR01000003.1 Vibrio cholerae strain M29 Contig_3, whole genome shotgun sequence
+>NZ_JSTP01000001.1 Vibrio cholerae strain 2012HC-12 NODE_79, whole genome shotgun sequence
+>NZ_JSTP01000002.1 Vibrio cholerae strain 2012HC-12 NODE_78, whole genome shotgun sequence
+
+Attention:
+
+  1. Sequences in file should be well organized.
+
+`,
+	Run: func(cmd *cobra.Command, args []string) {
+		config := getConfigs(cmd)
+		alphabet := config.Alphabet
+		idRegexp := config.IDRegexp
+		outFile := config.OutFile
+		lineWidth := config.LineWidth
+		seq.AlphabetGuessSeqLengthThreshold = config.AlphabetGuessSeqLength
+		seq.ValidateSeq = false
+		runtime.GOMAXPROCS(config.Threads)
+
+		minWords := getFlagPositiveInt(cmd, "mini-common-words")
+
+		files := getFileListFromArgsAndFile(cmd, args, true, "infile-list", true)
+
+		outfh, err := xopen.Wopen(outFile)
+		checkError(err)
+		defer outfh.Close()
+
+		var record *fastx.Record
+		var fastxReader *fastx.Reader
+
+		var prefixes, words []string
+		var i, N int
+		var nSharedWords, pNSharedWords int
+
+		for _, file := range files {
+			fastxReader, err = fastx.NewReader(alphabet, file, idRegexp)
+			checkError(err)
+
+			for {
+				record, err = fastxReader.Read()
+				if err != nil {
+					if err == io.EOF {
+						break
+					}
+					checkError(err)
+					break
+				}
+
+				if fastxReader.IsFastq {
+					config.LineWidth = 0
+					fastx.ForcelyOutputFastq = true
+				}
+
+				if len(record.Desc) == 0 {
+					checkError(fmt.Errorf("no description: %s", record.ID))
+				}
+
+				if prefixes == nil { // first record
+					record.FormatToWriter(outfh, config.LineWidth)
+
+					prefixes = stringutil.Split(string(record.Desc), "\t ")
+					continue
+				}
+
+				words = stringutil.Split(string(record.Desc), "\t ")
+				if len(words) < len(prefixes) {
+					N = len(words)
+				} else {
+					N = len(prefixes)
+				}
+
+				nSharedWords = 0
+				for i = 0; i < N; i++ {
+					if words[i] != prefixes[i] {
+						break
+					}
+					nSharedWords++
+				}
+
+				if nSharedWords < minWords {
+					return
+				}
+
+				if pNSharedWords == 0 { // 2nd sequence
+					pNSharedWords = nSharedWords
+				} else if nSharedWords != pNSharedWords { // number of shared words changed
+					return
+				}
+
+				record.FormatToWriter(outfh, config.LineWidth)
+			}
+
+			config.LineWidth = lineWidth
+		}
+	},
+}
+
+func init() {
+	RootCmd.AddCommand(headGenomeCmd)
+
+	headGenomeCmd.Flags().IntP("mini-common-words", "m", 1, "minimal shared prefix words")
+}


=====================================
seqkit/cmd/locate.go
=====================================
@@ -26,6 +26,7 @@ import (
 	"io"
 	"regexp"
 	"runtime"
+	"sync"
 
 	"github.com/shenwei356/bio/seq"
 	"github.com/shenwei356/bio/seqio/fastx"
@@ -33,6 +34,7 @@ import (
 	"github.com/shenwei356/bwt/fmi"
 	"github.com/shenwei356/xopen"
 	"github.com/spf13/cobra"
+	"github.com/twotwotwo/sorts/sortutil"
 )
 
 // locateCmd represents the locate command
@@ -54,8 +56,7 @@ Attentions:
      parser accepts comma-separated-values (CSV) for multiple values (motifs).
      Patterns in file do not follow this rule.     
   4. Mismatch is allowed using flag "-m/--max-mismatch",
-     but it's not fast enough for large genome like human genome.
-     Though, it's fast enough for microbial genomes.
+     you can increase the value of "-j/--threads" to accelerate processing.
   5. When using flag --circular, end position of matched subsequence that 
      crossing genome sequence end would be greater than sequence length.
 
@@ -92,6 +93,8 @@ Attentions:
 		hideMatched := getFlagBool(cmd, "hide-matched")
 		circular := getFlagBool(cmd, "circular")
 
+		immediateOutput := getFlagBool(cmd, "immediate-output")
+
 		if config.Alphabet == seq.Protein {
 			onlyPositiveStrand = true
 		}
@@ -100,7 +103,6 @@ Attentions:
 			checkError(fmt.Errorf("one of flags -p (--pattern) and -f (--pattern-file) needed"))
 		}
 
-		var sfmi *fmi.FMIndex
 		if mismatches > 0 {
 			if degenerate {
 				checkError(fmt.Errorf("flag -d (--degenerate) not allowed when giving flag -m (--max-mismatch)"))
@@ -111,7 +113,7 @@ Attentions:
 			if nonGreedy && !quiet {
 				log.Infof("flag -G (--non-greedy) ignored when giving flag -m (--max-mismatch)")
 			}
-			sfmi = fmi.NewFMIndex()
+
 		}
 		if useFMI {
 			if degenerate {
@@ -120,7 +122,6 @@ Attentions:
 			if useRegexp {
 				checkError(fmt.Errorf("flag -r (--use-regexp) ignored when giving flag -F (--use-fmi)"))
 			}
-			sfmi = fmi.NewFMIndex()
 		}
 
 		// prepare pattern
@@ -216,7 +217,6 @@ Attentions:
 						}
 						re, err := regexp.Compile(s)
 						checkError(err)
-						fmt.Println(s, re)
 						regexps[p] = re
 					} else if bytes.Index(patterns[p], []byte(".")) >= 0 ||
 						!(seq.DNAredundant.IsValid(patterns[p]) == nil ||
@@ -239,20 +239,309 @@ Attentions:
 				outfh.WriteString("seqID\tpatternName\tpattern\tstrand\tstart\tend\tmatched\n")
 			}
 		}
+
+		// -------------------------------------------------------------------
+		// only for m > 0, where FMI is slow
+
+		var record *fastx.Record
+		var fastxReader *fastx.Reader
+		_onlyPositiveStrand := onlyPositiveStrand
+
+		if mismatches > 0 || useFMI {
+			type Arecord struct {
+				id     uint64
+				ok     bool
+				record []string
+			}
+
+			var wg sync.WaitGroup
+			ch := make(chan *Arecord, config.Threads)
+			tokens := make(chan int, config.Threads)
+
+			done := make(chan int)
+			go func() {
+				m := make(map[uint64]*Arecord, config.Threads)
+				var id, _id uint64
+				var ok bool
+				var _r *Arecord
+				var row string
+
+				id = 1
+				for r := range ch {
+					_id = r.id
+
+					if _id == id { // right there
+						if r.ok {
+							for _, row = range r.record {
+								outfh.WriteString(row)
+							}
+
+							if immediateOutput {
+								outfh.Flush()
+							}
+						}
+						id++
+						continue
+					}
+
+					m[_id] = r // save for later check
+
+					if _r, ok = m[id]; ok { // check buffered
+						if _r.ok {
+							for _, row = range _r.record {
+								outfh.WriteString(row)
+							}
+
+							if immediateOutput {
+								outfh.Flush()
+							}
+						}
+						delete(m, id)
+						id++
+					}
+				}
+
+				if len(m) > 0 {
+					ids := make([]uint64, len(m))
+					i := 0
+					for _id = range m {
+						ids[i] = _id
+						i++
+					}
+					sortutil.Uint64s(ids)
+					for _, _id = range ids {
+						_r = m[_id]
+
+						if _r.ok {
+							for _, row = range _r.record {
+								outfh.WriteString(row)
+							}
+
+							if immediateOutput {
+								outfh.Flush()
+							}
+						}
+					}
+				}
+				done <- 1
+			}()
+
+			var id uint64
+			for _, file := range files {
+				fastxReader, err = fastx.NewReader(alphabet, file, idRegexp)
+				checkError(err)
+
+				checkAlphabet := true
+				for {
+					record, err = fastxReader.Read()
+					if err != nil {
+						if err == io.EOF {
+							break
+						}
+						checkError(err)
+						break
+					}
+
+					if checkAlphabet {
+						if fastxReader.Alphabet() == seq.Unlimit || fastxReader.Alphabet() == seq.Protein {
+							_onlyPositiveStrand = true
+						}
+						checkAlphabet = false
+					}
+
+					tokens <- 1
+					wg.Add(1)
+					id++
+					go func(record *fastx.Record, id uint64) {
+						defer func() {
+							wg.Done()
+							<-tokens
+						}()
+
+						var seqRP *seq.Seq
+						var l int
+						var loc []int
+						var i, begin, end int
+						var pSeq []byte
+						var pName string
+						var sfmi *fmi.FMIndex
+						sfmi = fmi.NewFMIndex()
+						results := make([]string, 0, 2)
+
+						if !(degenerate || useRegexp) && ignoreCase {
+							record.Seq.Seq = bytes.ToLower(record.Seq.Seq)
+						}
+
+						l = len(record.Seq.Seq)
+
+						if circular { // concat two copies of sequence
+							record.Seq.Seq = append(record.Seq.Seq, record.Seq.Seq...)
+						}
+
+						_, err = sfmi.Transform(record.Seq.Seq)
+						if err != nil {
+							checkError(fmt.Errorf("fail to build FMIndex for sequence: %s", record.Name))
+						}
+
+						for pName, pSeq = range patterns {
+							loc, err = sfmi.Locate(pSeq, mismatches)
+							if err != nil {
+								checkError(fmt.Errorf("fail to search pattern '%s' on seq '%s': %s", pName, record.Name, err))
+							}
+							for _, i = range loc {
+								if circular && i+1 > l { // 2nd clone of original part
+									continue
+								}
+
+								begin = i + 1
+
+								end = i + len(pSeq)
+								if i+len(pSeq) > len(record.Seq.Seq) {
+									continue
+								}
+								if outFmtGTF {
+									results = append(results, fmt.Sprintf("%s\t%s\t%s\t%d\t%d\t%d\t%s\t%s\tgene_id \"%s\"; \n",
+										record.ID,
+										"SeqKit",
+										"location",
+										begin,
+										end,
+										0,
+										"+",
+										".",
+										pName))
+								} else if outFmtBED {
+									results = append(results, fmt.Sprintf("%s\t%d\t%d\t%s\t%d\t%s\n",
+										record.ID,
+										begin-1,
+										end,
+										pName,
+										0,
+										"+"))
+								} else {
+									if hideMatched {
+										results = append(results, fmt.Sprintf("%s\t%s\t%s\t%s\t%d\t%d\n",
+											record.ID,
+											pName,
+											patterns[pName],
+											"+",
+											begin,
+											end))
+									} else {
+										results = append(results, fmt.Sprintf("%s\t%s\t%s\t%s\t%d\t%d\t%s\n",
+											record.ID,
+											pName,
+											patterns[pName],
+											"+",
+											begin,
+											end,
+											record.Seq.Seq[i:i+len(pSeq)]))
+									}
+								}
+							}
+						}
+
+						if _onlyPositiveStrand {
+							ch <- &Arecord{record: results, id: id, ok: len(results) > 0}
+							return
+						}
+
+						seqRP = record.Seq.RevCom()
+
+						_, err = sfmi.Transform(seqRP.Seq)
+						if err != nil {
+							checkError(fmt.Errorf("fail to build FMIndex for reverse complement sequence: %s", record.Name))
+						}
+						for pName, pSeq = range patterns {
+							loc, err = sfmi.Locate(pSeq, mismatches)
+							if err != nil {
+								checkError(fmt.Errorf("fail to search pattern '%s' on seq '%s': %s", pName, record.Name, err))
+							}
+							for _, i = range loc {
+								if circular && i+1 > l { // 2nd clone of original part
+									continue
+								}
+
+								begin = l - i - len(pSeq) + 1
+								end = l - i
+								if i+len(pSeq) > len(record.Seq.Seq) {
+									continue
+								}
+								if outFmtGTF {
+									results = append(results, fmt.Sprintf("%s\t%s\t%s\t%d\t%d\t%d\t%s\t%s\tgene_id \"%s\"; \n",
+										record.ID,
+										"SeqKit",
+										"location",
+										begin,
+										end,
+										0,
+										"-",
+										".",
+										pName))
+								} else if outFmtBED {
+									results = append(results, fmt.Sprintf("%s\t%d\t%d\t%s\t%d\t%s\n",
+										record.ID,
+										begin-1,
+										end,
+										pName,
+										0,
+										"-"))
+								} else {
+									if hideMatched {
+										results = append(results, fmt.Sprintf("%s\t%s\t%s\t%s\t%d\t%d\n",
+											record.ID,
+											pName,
+											patterns[pName],
+											"-",
+											begin,
+											end))
+									} else {
+										results = append(results, fmt.Sprintf("%s\t%s\t%s\t%s\t%d\t%d\t%s\n",
+											record.ID,
+											pName,
+											patterns[pName],
+											"-",
+											begin,
+											end,
+											seqRP.Seq[i:i+len(pSeq)]))
+									}
+								}
+							}
+						}
+
+						ch <- &Arecord{record: results, id: id, ok: len(results) > 0}
+					}(record.Clone(), id)
+				}
+			}
+
+			wg.Wait()
+			close(ch)
+			<-done
+
+			return
+		}
+
+		// -------------------------------------------------------------------
+
 		var seqRP *seq.Seq
 		var offset, l, lpatten int
 		var loc []int
 		var locs, locsNeg [][2]int
 		var i, begin, end int
 		var flag bool
-		var record *fastx.Record
-		var fastxReader *fastx.Reader
 		var pSeq, p []byte
 		var pName string
 		var re *regexp.Regexp
+		var sfmi *fmi.FMIndex
+		if mismatches > 0 || useFMI {
+			sfmi = fmi.NewFMIndex()
+		}
+
 		for _, file := range files {
 			fastxReader, err = fastx.NewReader(alphabet, file, idRegexp)
 			checkError(err)
+
+			checkAlphabet := true
 			for {
 				record, err = fastxReader.Read()
 				if err != nil {
@@ -263,6 +552,13 @@ Attentions:
 					break
 				}
 
+				if checkAlphabet {
+					if fastxReader.Alphabet() == seq.Unlimit || fastxReader.Alphabet() == seq.Protein {
+						_onlyPositiveStrand = true
+					}
+					checkAlphabet = false
+				}
+
 				if !(degenerate || useRegexp) && ignoreCase {
 					record.Seq.Seq = bytes.ToLower(record.Seq.Seq)
 				}
@@ -273,10 +569,6 @@ Attentions:
 					record.Seq.Seq = append(record.Seq.Seq, record.Seq.Seq...)
 				}
 
-				if !onlyPositiveStrand {
-					seqRP = record.Seq.RevCom()
-				}
-
 				if mismatches > 0 || useFMI {
 					_, err = sfmi.Transform(record.Seq.Seq)
 					if err != nil {
@@ -341,10 +633,12 @@ Attentions:
 						}
 					}
 
-					if onlyPositiveStrand {
+					if _onlyPositiveStrand {
 						continue
 					}
 
+					seqRP = record.Seq.RevCom()
+
 					_, err = sfmi.Transform(seqRP.Seq)
 					if err != nil {
 						checkError(fmt.Errorf("fail to build FMIndex for reverse complement sequence: %s", record.Name))
@@ -406,6 +700,10 @@ Attentions:
 						}
 					}
 
+					if immediateOutput {
+						outfh.Flush()
+					}
+
 					continue
 				}
 
@@ -507,6 +805,8 @@ Attentions:
 						continue
 					}
 
+					seqRP = record.Seq.RevCom()
+
 					locsNeg = make([][2]int, 0, 1000)
 
 					offset = 0
@@ -600,6 +900,10 @@ Attentions:
 						}
 					}
 				}
+
+				if immediateOutput {
+					outfh.Flush()
+				}
 			}
 		}
 	},
@@ -622,4 +926,5 @@ func init() {
 	locateCmd.Flags().IntP("max-mismatch", "m", 0, "max mismatch when matching by seq. For large genomes like human genome, using mapping/alignment tools would be faster")
 	locateCmd.Flags().BoolP("hide-matched", "M", false, "do not show matched sequences")
 	locateCmd.Flags().BoolP("circular", "c", false, `circular genome. type "seqkit locate -h" for details`)
+	locateCmd.Flags().BoolP("immediate-output", "I", false, "print output immediately, do not use write buffer")
 }


=====================================
seqkit/cmd/tab2fx.go
=====================================
@@ -21,12 +21,11 @@
 package cmd
 
 import (
-	"bytes"
+	"bufio"
 	"fmt"
 	"runtime"
 	"strings"
 
-	"github.com/shenwei356/breader"
 	"github.com/shenwei356/util/byteutil"
 	"github.com/shenwei356/xopen"
 	"github.com/spf13/cobra"
@@ -53,92 +52,57 @@ var tab2faCmd = &cobra.Command{
 		checkError(err)
 		defer outfh.Close()
 
-		type Slice []string
-		fn := func(line string) (interface{}, bool, error) {
-			line = strings.TrimRight(line, "\r\n")
-
-			if line == "" {
-				return "", false, nil
-			}
-			// check comment line
-			isCommentLine := false
-			for _, p := range commentPrefixes {
-				if strings.HasPrefix(line, p) {
-					isCommentLine = true
-					break
-				}
-			}
-			if isCommentLine {
-				return "", false, nil
-			}
-
-			items := strings.Split(line, "\t")
-			if len(items) < 2 {
-				return Slice(items), false, fmt.Errorf("at least two columns needed: %s", line)
-			}
-			if len(items) > 2 {
-				return Slice(items[0:3]), true, nil
-			}
-			return Slice(items[0:2]), true, nil
-		}
-
+		var line, p string
+		var items []string
+		var isCommentLine, isFastq bool
+		var scanner *bufio.Scanner
+		var fh *xopen.Reader
 		for _, file := range files {
-			reader, err := breader.NewBufferedReader(file, config.Threads, 10, fn)
+			fh, err = xopen.Ropen(file)
 			checkError(err)
-			var text []byte
-			var b *bytes.Buffer
-			isFastq := false
-			for chunk := range reader.Ch {
-				if chunk.Err != nil {
-					checkError(chunk.Err)
-					break
-				}
-				for _, data := range chunk.Data {
-					items := data.(Slice)
-					if len(items) == 3 && (len(items[2]) > 0 || isFastq) { // fastq
-						isFastq = true
-						outfh.WriteString(fmt.Sprintf("@%s\n", items[0]))
-
-						// 	outfh.Write(byteutil.WrapByteSlice([]byte(items[1]), lineWidth))
 
-						// if bufferedByteSliceWrapper == nil {
-						// 	bufferedByteSliceWrapper = byteutil.NewBufferedByteSliceWrapper2(1, len(items[1]), lineWidth)
-						// }
-						// text, b = bufferedByteSliceWrapper.Wrap([]byte(items[1]), lineWidth)
-						// outfh.Write(text)
-						// outfh.Flush()
-						// bufferedByteSliceWrapper.Recycle(b)
+			scanner = bufio.NewScanner(fh)
 
-						outfh.WriteString(items[1]) // seq
+			isFastq = false
+			for scanner.Scan() {
+				line = strings.TrimRight(scanner.Text(), "\r\n")
 
-						outfh.WriteString("\n+\n")
-
-						// outfh.Write(byteutil.WrapByteSlice([]byte(items[2]), lineWidth))
-
-						// text, b = bufferedByteSliceWrapper.Wrap([]byte(items[2]), lineWidth)
-						// outfh.Write(text)
-						// outfh.Flush()
-						// bufferedByteSliceWrapper.Recycle(b)
-
-						outfh.WriteString(items[2]) // qual
-
-						outfh.WriteString("\n")
-					} else {
-						outfh.WriteString(fmt.Sprintf(">%s\n", items[0]))
+				if line == "" {
+					continue
+				}
+				// check comment line
+				isCommentLine = false
+				for _, p = range commentPrefixes {
+					if strings.HasPrefix(line, p) {
+						isCommentLine = true
+						break
+					}
+				}
+				if isCommentLine {
+					continue
+				}
 
-						// outfh.Write(byteutil.WrapByteSlice([]byte(items[1]), lineWidth))
-						if bufferedByteSliceWrapper == nil {
-							bufferedByteSliceWrapper = byteutil.NewBufferedByteSliceWrapper2(1, len(items[1]), lineWidth)
-						}
-						text, b = bufferedByteSliceWrapper.Wrap([]byte(items[1]), lineWidth)
-						outfh.Write(text)
-						outfh.Flush()
-						bufferedByteSliceWrapper.Recycle(b)
+				items = strings.Split(line, "\t")
+				if len(items) < 2 {
+					checkError(fmt.Errorf("at least two columns needed: %s", line))
+				}
 
-						outfh.WriteString("\n")
-					}
+				if len(items) == 3 && (len(items[2]) > 0 || isFastq) { // fastq
+					isFastq = true
+					outfh.WriteString(fmt.Sprintf("@%s\n", items[0]))
+					outfh.WriteString(items[1]) // seq
+					outfh.WriteString("\n+\n")
+					outfh.WriteString(items[2]) // qual
+
+					outfh.WriteString("\n")
+				} else {
+					outfh.WriteString(fmt.Sprintf(">%s\n", items[0]))
+					outfh.Write(byteutil.WrapByteSlice([]byte(items[1]), lineWidth))
+					outfh.WriteString("\n")
 				}
+
 			}
+			checkError(scanner.Err())
 		}
 	},
 }


=====================================
seqkit/cmd/version.go
=====================================
@@ -29,7 +29,7 @@ import (
 )
 
 // VERSION of seqkit
-const VERSION = "0.15.0"
+const VERSION = "0.16.0"
 
 // versionCmd represents the version command
 var versionCmd = &cobra.Command{



View it on GitLab: https://salsa.debian.org/med-team/seqkit/-/commit/e77f29a7b82e1daaca7ff0a0a0edc08ec577d880

-- 
View it on GitLab: https://salsa.debian.org/med-team/seqkit/-/commit/e77f29a7b82e1daaca7ff0a0a0edc08ec577d880
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20210501/e2c9da35/attachment-0001.htm>


More information about the debian-med-commit mailing list