[med-svn] [Git][med-team/seqkit][upstream] New upstream version 0.16.0+ds
Nilesh Patra
gitlab at salsa.debian.org
Sat May 1 19:36:24 BST 2021
Nilesh Patra pushed to branch upstream at Debian Med / seqkit
Commits:
e77f29a7 by Nilesh Patra at 2021-05-01T22:40:17+05:30
New upstream version 0.16.0+ds
- - - - -
17 changed files:
- CHANGES.md → CHANGELOG.md
- README.md
- doc/docs/download.md
- doc/docs/usage.md
- doc/mkdocs.yml
- go.mod
- go.sum
- seqkit/cmd/amplicon.go
- seqkit/cmd/bam_toolbox.go
- seqkit/cmd/bed.go
- seqkit/cmd/fx2tab.go
- seqkit/cmd/genautocomplete.go
- seqkit/cmd/grep.go
- + seqkit/cmd/head-genome.go
- seqkit/cmd/locate.go
- seqkit/cmd/tab2fx.go
- seqkit/cmd/version.go
Changes:
=====================================
CHANGES.md → CHANGELOG.md
=====================================
@@ -1,3 +1,27 @@
+
+
+- [SeqKit v0.16.0](https://github.com/shenwei356/seqkit/releases/tag/v0.16.0)
+[![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/seqkit/v0.16.0/total.svg)](https://github.com/shenwei356/seqkit/releases/tag/v0.16.0)
+ - new command `seqkit head-genome`:
+ - print sequences of the first genome with common prefixes in name
+ - `seqkit grep/locate/amplicon -m`
+ - *much faster (300-400x) searching with mismatch allowed* by optimizing FM-indexing and parallelization.
+ - new flag `-I/--immediate-output`.
+ - `seqkit grep/locate`:
+ - fix bug of `-m` when querying contains letters not in alphabet, usually for protein sequences. [#178](https://github.com/shenwei356/seqkit/issues/178), [#179](https://github.com/shenwei356/seqkit/issues/179)
+ - onply search on positive strand when searching unlimited or protein sequences.
+ - `seqkit locate`:
+ - removing debug info for `-r` introduced in a0f6b6e. [#180](https://github.com/shenwei356/seqkit/issues/180)
+ - `seqkit amplicon`:
+ - fix bug of `-m`, when mismatch is allowed.
+ - `seqkit fx2tab`:
+ - new flag `-C/--base-count` for counting bases. [#183](https://github.com/shenwei356/seqkit/issues/183)
+ - `seqkit tab2fx`:
+ - fix a rare bug. [#197](https://github.com/shenwei356/seqkit/issues/197)
+ - `seqkit subseq`:
+ - fix bug for BED with empty columns. [#195](https://github.com/shenwei356/seqkit/issues/195)
+ - `seqkit genautocomplete`:
+ - **support bash|zsh|fish|powershell**.
- [SeqKit v0.15.0](https://github.com/shenwei356/seqkit/releases/tag/v0.15.0)
[![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/seqkit/v0.15.0/total.svg)](https://github.com/shenwei356/seqkit/releases/tag/v0.15.0)
- `seqkit grep/locate`: update help message.
=====================================
README.md
=====================================
@@ -19,6 +19,7 @@ and
- [![Gitter](https://badges.gitter.im/seqkit/community.svg)](https://gitter.im/seqkit/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge)
+
## Introduction
FASTA and FASTQ are basic and ubiquitous formats for storing nucleotide and
@@ -47,7 +48,7 @@ enable researchers to rapidly accomplish common FASTA/Q file manipulations.
- [Features](#features)
- [Subcommands](#subcommands)
- [Installation](#installation)
-- [Bash-completion](#bash-completion)
+- [Command-line completion](#command-line-completion)
- [Technical details and guides for use](#technical-details-and-guides-for-use)
- [Usage && Examples](#usage--examples)
- [Benchmark](#benchmark)
@@ -118,7 +119,7 @@ enable researchers to rapidly accomplish common FASTA/Q file manipulations.
## Subcommands
-34 functional subcommands in total.
+35 functional subcommands in total.
**Sequence and subsequence**
@@ -153,6 +154,7 @@ enable researchers to rapidly accomplish common FASTA/Q file manipulations.
**Set operations**
- [`head`](https://bioinf.shenwei.me/seqkit/usage/#head) print first N FASTA/Q records
+- [`head-genome`](https://bioinf.shenwei.me/seqkit/usage/#head-genome) print sequences of the first genome with common prefixes in name
- [`range`](https://bioinf.shenwei.me/seqkit/usage/#range) print FASTA/Q records in a range (start:end)
- [`sample`](https://bioinf.shenwei.me/seqkit/usage/#sample) sample sequences by number or proportion
- [`rmdup`](https://bioinf.shenwei.me/seqkit/usage/#rmdup) remove duplicated sequences by id/name/sequence
@@ -178,7 +180,7 @@ enable researchers to rapidly accomplish common FASTA/Q file manipulations.
**Misc**
- `version` print version information and check for update
-- `genautocomplete` generate shell autocompletion script
+- `genautocomplete` generate shell autocompletion script (bash|zsh|fish|powershell)
## Installation
@@ -234,24 +236,33 @@ Run the following commands:
docker run -it shenwei356/seqkit:latest
-## Bash-completion
+## Command-line completion
+
+Supported shell: bash|zsh|fish|powershell
+
+Bash:
+
+ # generate completion shell
+ seqkit genautocomplete --shell bash
-Note: The current version supports Bash only.
-This should work for *nix systems with Bash installed.
+ # configure if never did.
+ # install bash-completion if the "complete" command is not found.
+ echo "for bcfile in ~/.bash_completion.d/* ; do source \$bcfile; done" >> ~/.bash_completion
+ echo "source ~/.bash_completion" >> ~/.bashrc
-Howto:
+Zsh:
-1. run: `seqkit genautocomplete`
+ # generate completion shell
+ seqkit genautocomplete --shell zsh --file ~/.zfunc/_seqkit
-2. create and edit `~/.bash_completion` file if you don't have it.
+ # configure if never did
+ echo 'fpath=( ~/.zfunc "${fpath[@]}" )' >> ~/.zshrc
+ echo "autoload -U compinit; compinit" >> ~/.zshrc
- nano ~/.bash_completion
+fish:
- add the following:
+ seqkit genautocomplete --shell fish --file ~/.config/fish/completions/seqkit.fish
- for bcfile in ~/.bash_completion.d/* ; do
- . $bcfile
- done
## Technical details and guides for use
=====================================
doc/docs/download.md
=====================================
@@ -6,14 +6,28 @@ SeqKit is implemented in [Go](https://golang.org/) programming language,
## Latest Version
-- [SeqKit v0.15.0](https://github.com/shenwei356/seqkit/releases/tag/v0.15.0)
-[![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/seqkit/v0.15.0/total.svg)](https://github.com/shenwei356/seqkit/releases/tag/v0.15.0)
- - `seqkit grep/locate`: update help message.
- - `seqkit grep`: **search on both strand when searching by sequence**.
- - `seqkit split2`: fix redundant log when using `-s`.
- - `seqkit bam`: new field `RightSoftClipSeq`. [#172](https://github.com/shenwei356/seqkit/pull/172)
- - `seqkit sample -2`: remove extra `\n`. [#173](https://github.com/shenwei356/seqkit/issues/173)
- - `seqkit split2 -l`: fix bug for splitting by accumulative length, this bug occurs when the first record is longer than `-l`, no sequences are lost.
+- [SeqKit v0.16.0](https://github.com/shenwei356/seqkit/releases/tag/v0.16.0)
+[![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/seqkit/v0.16.0/total.svg)](https://github.com/shenwei356/seqkit/releases/tag/v0.16.0)
+ - new command `seqkit head-genome`:
+ - print sequences of the first genome with common prefixes in name
+ - `seqkit grep/locate/amplicon -m`
+ - *much faster (300-400x) searching with mismatch allowed* by optimizing FM-indexing and parallelization.
+ - new flag `-I/--immediate-output`.
+ - `seqkit grep/locate`:
+ - fix bug of `-m` when querying contains letters not in alphabet, usually for protein sequences. [#178](https://github.com/shenwei356/seqkit/issues/178), [#179](https://github.com/shenwei356/seqkit/issues/179)
+ - onply search on positive strand when searching unlimited or protein sequences.
+ - `seqkit locate`:
+ - removing debug info for `-r` introduced in a0f6b6e. [#180](https://github.com/shenwei356/seqkit/issues/180)
+ - `seqkit amplicon`:
+ - fix bug of `-m`, when mismatch is allowed.
+ - `seqkit fx2tab`:
+ - new flag `-C/--base-count` for counting bases. [#183](https://github.com/shenwei356/seqkit/issues/183)
+ - `seqkit tab2fx`:
+ - fix a rare bug. [#197](https://github.com/shenwei356/seqkit/issues/197)
+ - `seqkit subseq`:
+ - fix bug for BED with empty columns. [#195](https://github.com/shenwei356/seqkit/issues/195)
+ - `seqkit genautocomplete`:
+ - **support bash|zsh|fish|powershell**.
### Please cite
@@ -30,13 +44,13 @@ SeqKit is implemented in [Go](https://golang.org/) programming language,
OS |Arch |File, 中国镜像 |Download Count
:------|:---------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-Linux |32-bit |[seqkit_linux_386.tar.gz](https://github.com/shenwei356/seqkit/releases/download/v0.15.0/seqkit_linux_386.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_linux_386.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_linux_386.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.15.0/seqkit_linux_386.tar.gz)
-Linux |**64-bit**|[**seqkit_linux_amd64.tar.gz**](https://github.com/shenwei356/seqkit/releases/download/v0.15.0/seqkit_linux_amd64.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_linux_amd64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_linux_amd64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.15.0/seqkit_linux_amd64.tar.gz)
-Linux |**arm64** |[**seqkit_linux_arm64.tar.gz**](https://github.com/shenwei356/seqkit/releases/download/v0.15.0/seqkit_linux_arm64.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_linux_arm64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_linux_arm64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.15.0/seqkit_linux_arm64.tar.gz)
-macOS |**64-bit**|[**seqkit_darwin_amd64.tar.gz**](https://github.com/shenwei356/seqkit/releases/download/v0.15.0/seqkit_darwin_amd64.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_darwin_amd64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_darwin_amd64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.15.0/seqkit_darwin_amd64.tar.gz)
-macOS |**arm64** |[**seqkit_darwin_arm64.tar.gz**](https://github.com/shenwei356/seqkit/releases/download/v0.15.0/seqkit_darwin_arm64.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_darwin_arm64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_darwin_arm64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.15.0/seqkit_darwin_arm64.tar.gz)
-Windows|32-bit |[seqkit_windows_386.exe.tar.gz](https://github.com/shenwei356/seqkit/releases/download/v0.15.0/seqkit_windows_386.exe.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_windows_386.exe.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_windows_386.exe.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.15.0/seqkit_windows_386.exe.tar.gz)
-Windows|**64-bit**|[**seqkit_windows_amd64.exe.tar.gz**](https://github.com/shenwei356/seqkit/releases/download/v0.15.0/seqkit_windows_amd64.exe.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_windows_amd64.exe.tar.gz)|[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_windows_amd64.exe.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.15.0/seqkit_windows_amd64.exe.tar.gz)
+Linux |32-bit |[seqkit_linux_386.tar.gz](https://github.com/shenwei356/seqkit/releases/download/v0.16.0/seqkit_linux_386.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_linux_386.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_linux_386.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.16.0/seqkit_linux_386.tar.gz)
+Linux |**64-bit**|[**seqkit_linux_amd64.tar.gz**](https://github.com/shenwei356/seqkit/releases/download/v0.16.0/seqkit_linux_amd64.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_linux_amd64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_linux_amd64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.16.0/seqkit_linux_amd64.tar.gz)
+Linux |**arm64** |[**seqkit_linux_arm64.tar.gz**](https://github.com/shenwei356/seqkit/releases/download/v0.16.0/seqkit_linux_arm64.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_linux_arm64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_linux_arm64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.16.0/seqkit_linux_arm64.tar.gz)
+macOS |**64-bit**|[**seqkit_darwin_amd64.tar.gz**](https://github.com/shenwei356/seqkit/releases/download/v0.16.0/seqkit_darwin_amd64.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_darwin_amd64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_darwin_amd64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.16.0/seqkit_darwin_amd64.tar.gz)
+macOS |**arm64** |[**seqkit_darwin_arm64.tar.gz**](https://github.com/shenwei356/seqkit/releases/download/v0.16.0/seqkit_darwin_arm64.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_darwin_arm64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_darwin_arm64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.16.0/seqkit_darwin_arm64.tar.gz)
+Windows|32-bit |[seqkit_windows_386.exe.tar.gz](https://github.com/shenwei356/seqkit/releases/download/v0.16.0/seqkit_windows_386.exe.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_windows_386.exe.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_windows_386.exe.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.16.0/seqkit_windows_386.exe.tar.gz)
+Windows|**64-bit**|[**seqkit_windows_amd64.exe.tar.gz**](https://github.com/shenwei356/seqkit/releases/download/v0.16.0/seqkit_windows_amd64.exe.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_windows_amd64.exe.tar.gz)|[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_windows_amd64.exe.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.16.0/seqkit_windows_amd64.exe.tar.gz)
## Installation
@@ -87,25 +101,41 @@ Run the following commands:
## Bash-completion
-Note: The current version supports Bash only.
-This should work for *nix systems with Bash installed.
+Supported shell: bash|zsh|fish|powershell
+
+Bash:
+
+ # generate completion shell
+ seqkit genautocomplete --shell bash
-Howto:
+ # configure if never did.
+ # install bash-completion if the "complete" command is not found.
+ echo "for bcfile in ~/.bash_completion.d/* ; do source \$bcfile; done" >> ~/.bash_completion
+ echo "source ~/.bash_completion" >> ~/.bashrc
-1. run: `seqkit genautocomplete`
+Zsh:
-2. create and edit `~/.bash_completion` file if you don't have it.
+ # generate completion shell
+ seqkit genautocomplete --shell zsh --file ~/.zfunc/_seqkit
- nano ~/.bash_completion
+ # configure if never did
+ echo 'fpath=( ~/.zfunc "${fpath[@]}" )' >> ~/.zshrc
+ echo "autoload -U compinit; compinit" >> ~/.zshrc
- add the following:
+fish:
- for bcfile in ~/.bash_completion.d/* ; do
- . $bcfile
- done
+ seqkit genautocomplete --shell fish --file ~/.config/fish/completions/seqkit.fish
## Release History
+- [SeqKit v0.15.0](https://github.com/shenwei356/seqkit/releases/tag/v0.15.0)
+[![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/seqkit/v0.15.0/total.svg)](https://github.com/shenwei356/seqkit/releases/tag/v0.15.0)
+ - `seqkit grep/locate`: update help message.
+ - `seqkit grep`: **search on both strand when searching by sequence**.
+ - `seqkit split2`: fix redundant log when using `-s`.
+ - `seqkit bam`: new field `RightSoftClipSeq`. [#172](https://github.com/shenwei356/seqkit/pull/172)
+ - `seqkit sample -2`: remove extra `\n`. [#173](https://github.com/shenwei356/seqkit/issues/173)
+ - `seqkit split2 -l`: fix bug for splitting by accumulative length, this bug occurs when the first record is longer than `-l`, no sequences are lost.
- [SeqKit v0.14.0](https://github.com/shenwei356/seqkit/releases/tag/v0.14.0)
[![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/seqkit/v0.14.0/total.svg)](https://github.com/shenwei356/seqkit/releases/tag/v0.14.0)
- new command `seqkit pair`: match up paired-end reads from two fastq files, faster than fastq-pair.
=====================================
doc/docs/usage.md
=====================================
@@ -39,6 +39,7 @@
**Set operations**
- [head](#head)
+- [head-genome](#head-genome)
- [range](#range)
- [sample](#sample)
- [rmdup](#rmdup)
@@ -968,6 +969,7 @@ Flags:
-a, --alphabet print alphabet letters
-q, --avg-qual print average quality of a read
-B, --base-content strings print base content. (case ignored, multiple values supported) e.g. -B AT -B N
+ -C, --base-count strings print base count. (case ignored, multiple values supported) e.g. -C AT -C N
-I, --case-sensitive calculate case sensitive base content
-g, --gc print GC content
-G, --gc-skew print GC-Skew
@@ -1379,9 +1381,8 @@ Attentions:
for partly matching.
2. When searching by sequences, it's partly matching, and both positive
and negative strands are searched.
- Mismatch is allowed using flag "-m/--max-mismatch",
- but it's not fast enough for large genome like human genome.
- Though, it's fast enough for microbial genomes.
+ Mismatch is allowed using flag "-m/--max-mismatch", you can increase
+ the value of "-j/--threads" to accelerate processing.
3. Degenerate bases/residues like "RYMM.." are also supported by flag -d.
But do not use degenerate bases/residues in regular expression, you need
convert them to regular expression, e.g., change "N" or "X" to "."..
@@ -1424,6 +1425,7 @@ Flags:
--delete-matched delete a pattern right after being matched, this keeps the firstly matched data and speedups when using regular expressions
-h, --help help for grep
-i, --ignore-case ignore case
+ -I, --immediate-output print output immediately, do not use write buffer
-v, --invert-match invert the sense of matching, to select non-matching records
-m, --max-mismatch int max mismatch when matching by seq. For large genomes like human genome, using mapping/alignment tools would be faster
-P, --only-positive-strand only search on positive strand
@@ -1509,13 +1511,14 @@ Examples
user 0m0.100s
sys 0m0.017s
- $ time cat hairpin.fa.gz | seqkit grep -s -i -p aggcg -m 1 | seqkit stats
+ $ time zcat hairpin.fa.gz | seqkit grep -s -i -p aggcg -m 1 | seqkit stats
file format type num_seqs sum_len min_len avg_len max_len
- - FASTA RNA 17,168 1,881,005 39 109.6 2,354
+ - FASTA RNA 22,290 2,375,819 39 106.6 2,354
+
+ real 0m1.081s
+ user 0m1.305s
+ sys 0m0.158s
- real 0m0.864s
- user 0m0.941s
- sys 0m0.014s
1. Extract sequences starting with AGGCG
@@ -1553,8 +1556,7 @@ Attentions:
parser accepts comma-separated-values (CSV) for multiple values (motifs).
Patterns in file do not follow this rule.
4. Mismatch is allowed using flag "-m/--max-mismatch",
- but it's not fast enough for large genome like human genome.
- Though, it's fast enough for microbial genomes.
+ you can increase the value of "-j/--threads" to accelerate processing.
5. When using flag --circular, end position of matched subsequence that
crossing genome sequence end would be greater than sequence length.
@@ -1569,6 +1571,7 @@ Flags:
-h, --help help for locate
-M, --hide-matched do not show matched sequences
-i, --ignore-case ignore case
+ -I, --immediate-output print output immediately, do not use write buffer
-m, --max-mismatch int max mismatch when matching by seq. For large genomes like human genome, using mapping/alignment tools would be faster
-G, --non-greedy non-greedy mode, faster but may miss motifs overlapping with others
-P, --only-positive-strand only search on positive strand
@@ -1777,6 +1780,7 @@ retrieve amplicon (or specific region around it) via primer(s).
Attentions:
1. Only one (the longest) matching location is returned for every primer pair.
2. Mismatch is allowed, but the mismatch location (5' or 3') is not controled.
+ You can increase the value of "-j/--threads" to accelerate processing.
3. Degenerate bases/residues like "RYMM.." are also supported.
But do not use degenerate bases/residues in regular expression, you need
convert them to regular expression, e.g., change "N" or "X" to "."..
@@ -1833,6 +1837,7 @@ Flags:
-f, --flanking-region region is flanking region
-F, --forward string forward primer (5'-primer-3'), degenerate bases allowed
-h, --help help for amplicon
+ -I, --immediate-output print output immediately, do not use write buffer
-m, --max-mismatch int max mismatch when matching primers, no degenerate bases allowed
-P, --only-positive-strand only search on positive strand
-p, --primer-file string 3- or 2-column tabular primer file, with first column as primer name
@@ -2406,12 +2411,45 @@ Examples
+
HIHIIIIIHIIHGHHIHHIIIIIIIIIIIIIIIHHIIIIIHHIHIIIIIGIHIIIIHHHHHHGHIHIIIIIIIII
+## head-genome
+
+Usage
+
+```text
+
+print sequences of the first genome with common prefixes in name
+
+For a FASTA file containing multiple contigs of strains (see example below),
+these's no list of IDs available for retrieving sequences of a certain strain,
+while descriptions of each strain share the same prefix.
+
+This command is used to restrieve sequences of the first strain,
+i.e., "Vibrio cholerae strain M29".
+
+>NZ_JFGR01000001.1 Vibrio cholerae strain M29 Contig_1, whole genome shotgun sequence
+>NZ_JFGR01000002.1 Vibrio cholerae strain M29 Contig_2, whole genome shotgun sequence
+>NZ_JFGR01000003.1 Vibrio cholerae strain M29 Contig_3, whole genome shotgun sequence
+>NZ_JSTP01000001.1 Vibrio cholerae strain 2012HC-12 NODE_79, whole genome shotgun sequence
+>NZ_JSTP01000002.1 Vibrio cholerae strain 2012HC-12 NODE_78, whole genome shotgun sequence
+
+Attention:
+
+ 1. Sequences in file should be well organized.
+
+Usage:
+ seqkit head-genome [flags]
+
+Flags:
+ -h, --help help for head-genome
+ -m, --mini-common-words int minimal shared prefix words (default 4)
+```
+
## range
Usage
-``` text
+```text
print FASTA/Q records in a range (start:end)
Usage:
@@ -3338,22 +3376,30 @@ Usage
``` text
generate shell autocompletion script
-Note: The current version supports Bash only.
-This should work for *nix systems with Bash installed.
+Supported shell: bash|zsh|fish|powershell
+
+Bash:
+
+ # generate completion shell
+ seqkit genautocomplete --shell bash
-Howto:
+ # configure if never did.
+ # install bash-completion if the "complete" command is not found.
+ echo "for bcfile in ~/.bash_completion.d/* ; do source \$bcfile; done" >> ~/.bash_completion
+ echo "source ~/.bash_completion" >> ~/.bashrc
-1. run: seqkit genautocomplete
+Zsh:
-2. create and edit ~/.bash_completion file if you don't have it.
+ # generate completion shell
+ seqkit genautocomplete --shell zsh --file ~/.zfunc/_seqkit
- nano ~/.bash_completion
+ # configure if never did
+ echo 'fpath=( ~/.zfunc "${fpath[@]}" )' >> ~/.zshrc
+ echo "autoload -U compinit; compinit" >> ~/.zshrc
- add the following:
+fish:
- for bcfile in ~/.bash_completion.d/* ; do
- . $bcfile
- done
+ seqkit genautocomplete --shell fish --file ~/.config/fish/completions/seqkit.fish
Usage:
seqkit genautocomplete [flags]
=====================================
doc/mkdocs.yml
=====================================
@@ -1,5 +1,5 @@
site_name: SeqKit - Ultrafast FASTA/Q kit
-pages:
+nav:
- Home: index.md
- Download: download.md
- Documents:
@@ -12,17 +12,17 @@ pages:
# theme: cinder
theme:
- name: 'material'
+ name: material
palette:
- primary: 'teal'
- accent: 'blue grey'
- logo:
- icon: 'school'
- favicon: 'files/favicon.ico'
+ primary: teal
+ accent: blue grey
+ icon:
+ logo: material/school
+ favicon: files/favicon.ico
feature:
- tabs: false
+ navigation.tabs: false
extra:
- manifest: 'manifest.webmanifest'
+ manifest: manifest.webmanifest
repo_url: https://github.com/shenwei356/seqkit
site_description: SeqKit - a cross-platform and ultrafast toolkit for FASTA/Q file manipulation
=====================================
go.mod
=====================================
@@ -7,30 +7,25 @@ require (
github.com/biogo/hts v1.2.2
github.com/bsipos/thist v1.0.0
github.com/cespare/xxhash v1.1.0
- github.com/cznic/mathutil v0.0.0-20181122101859-297441e03548 // indirect
github.com/cznic/sortutil v0.0.0-20181122101858-f5f958428db8
github.com/dustin/go-humanize v1.0.0
- github.com/edsrzf/mmap-go v1.0.0 // indirect
github.com/fsnotify/fsnotify v1.4.9
github.com/iafan/cwalk v0.0.0-20191125092548-dd7f505d2f66
- github.com/klauspost/compress v1.11.4 // indirect
- github.com/klauspost/pgzip v1.2.5 // indirect
github.com/logrusorgru/aurora v2.0.3+incompatible
github.com/mattn/go-colorable v0.1.8
github.com/mattn/go-isatty v0.0.12
github.com/mattn/go-runewidth v0.0.9 // indirect
github.com/mitchellh/go-homedir v1.1.0
github.com/pkg/errors v0.9.1
- github.com/remyoudompheng/bigfft v0.0.0-20200410134404-eec4a21b6bb0 // indirect
- github.com/shenwei356/bio v0.0.0-20201213090627-18e3e643a476
- github.com/shenwei356/bpool v0.0.0-20160710042833-f9e0ee4d0403 // indirect
- github.com/shenwei356/breader v0.0.0-20170924140440-21f0a70fe179
- github.com/shenwei356/bwt v0.0.0-20200418151221-ae79c9858c90
+ github.com/shenwei356/bio v0.1.0
+ github.com/shenwei356/breader v0.1.0
+ github.com/shenwei356/bwt v0.5.1
github.com/shenwei356/go-logging v0.0.0-20171012171522-c6b9702d88ba
github.com/shenwei356/natsort v0.0.0-20190418160752-600d539c017d // indirect
- github.com/shenwei356/util v0.0.0-20201214054755-2942125340cd
+ github.com/shenwei356/util v0.3.0
github.com/shenwei356/xopen v0.0.0-20181203091311-f4f16ddd3992
github.com/smallfish/simpleyaml v0.0.0-20170911015856-a32031077861
- github.com/spf13/cobra v1.1.1
+ github.com/spf13/cobra v1.1.3
github.com/tatsushid/go-prettytable v0.0.0-20141013043238-ed2d14c29939
+ github.com/twotwotwo/sorts v0.0.0-20160814051341-bf5c1f2b8553
)
=====================================
go.sum
=====================================
@@ -139,8 +139,9 @@ github.com/konsorten/go-windows-terminal-sequences v1.0.1/go.mod h1:T0+1ngSBFLxv
github.com/kortschak/utter v0.0.0-20190412033250-50fe362e6560/go.mod h1:oDr41C7kH9wvAikWyFhr6UFr8R7nelpmCF5XR5rL7I8=
github.com/kr/logfmt v0.0.0-20140226030751-b84e30acd515/go.mod h1:+0opPa2QZZtGFBFZlji/RkVcI2GknAs/DXo4wKdlNEc=
github.com/kr/pretty v0.1.0/go.mod h1:dAy3ld7l9f0ibDNOQOHHMYYIIbhfbHSm3C4ZsoJORNo=
-github.com/kr/pretty v0.2.0 h1:s5hAObm+yFO5uHYt5dYjxi2rXrsnmRpJx4OYvIWUaQs=
github.com/kr/pretty v0.2.0/go.mod h1:ipq/a2n7PKx3OHsz4KJII5eveXtPO4qwEXGdVfWzfnI=
+github.com/kr/pretty v0.2.1 h1:Fmg33tUaq4/8ym9TJN1x7sLJnHVwhP33CNkpYV/7rwI=
+github.com/kr/pretty v0.2.1/go.mod h1:ipq/a2n7PKx3OHsz4KJII5eveXtPO4qwEXGdVfWzfnI=
github.com/kr/pty v1.1.1/go.mod h1:pFQYn66WHrOpPYNljwOMqo10TkYh1fy3cYio2l3bCsQ=
github.com/kr/text v0.1.0 h1:45sCR5RtlFHMR4UwH9sdQ5TC8v0qDQCHnXt+kaKSTVE=
github.com/kr/text v0.1.0/go.mod h1:4Jbv+DJW3UT/LiOwJeYQe1efqtUx/iVham/4vfdArNI=
@@ -194,20 +195,20 @@ github.com/rogpeppe/go-internal v1.3.0/go.mod h1:M8bDsm7K2OlrFYOpmOWEs/qY81heoFR
github.com/russross/blackfriday/v2 v2.0.1/go.mod h1:+Rmxgy9KzJVeS9/2gXHxylqXiyQDYRxCVz55jmeOWTM=
github.com/ryanuber/columnize v0.0.0-20160712163229-9b3edd62028f/go.mod h1:sm1tb6uqfes/u+d4ooFouqFdy9/2g9QGwK3SQygK0Ts=
github.com/sean-/seed v0.0.0-20170313163322-e2103e2c3529/go.mod h1:DxrIzT+xaE7yg65j358z/aeFdxmN0P9QXhEzd20vsDc=
-github.com/shenwei356/bio v0.0.0-20201213090627-18e3e643a476 h1:PE3sUBFeHWFNLsrQ7Blyk4duI+emVGKQ1+iYo4ZalKk=
-github.com/shenwei356/bio v0.0.0-20201213090627-18e3e643a476/go.mod h1:tKIebGVfqYwgLIoXohkLiQO685B2d3l9vTf1cjRZaVk=
+github.com/shenwei356/bio v0.1.0 h1:VDnI28zcdybywdn6/tcZvplAJ1IxOAAYrTJhhTB1SLQ=
+github.com/shenwei356/bio v0.1.0/go.mod h1:NgFauYHlpmjCYEf2XP8foITht6ej6poggQkILpjraN4=
github.com/shenwei356/bpool v0.0.0-20160710042833-f9e0ee4d0403 h1:/3JklLnHXiWUBxWc3joQYavDQJpncRhRA909cUb7eOw=
github.com/shenwei356/bpool v0.0.0-20160710042833-f9e0ee4d0403/go.mod h1:YkgdTWfNnJgv5HVJbVSDmxQtkK3/jZWDoqcG26BVU8k=
-github.com/shenwei356/breader v0.0.0-20170924140440-21f0a70fe179 h1:MZiEG6aN8j1A0jXzCzyZM2BM6ZHbePYiwR4FXgUTCCk=
-github.com/shenwei356/breader v0.0.0-20170924140440-21f0a70fe179/go.mod h1:Yb9qYJ2qJolWeqIVS1LG5ylHTLI5ZfbVFEJJouF2DsY=
-github.com/shenwei356/bwt v0.0.0-20200418151221-ae79c9858c90 h1:c8Er9GmVclwt002/ZxVx7WOThpmPKNR4rehAXxICTII=
-github.com/shenwei356/bwt v0.0.0-20200418151221-ae79c9858c90/go.mod h1:q4esDgocpgU8CcnXuPNun+hGv+VDXGX+NnXHQaHGRCU=
+github.com/shenwei356/breader v0.1.0 h1:g7q33VI15Mroj2TTab06CF6prbQAez+6p8fAoK4qkjo=
+github.com/shenwei356/breader v0.1.0/go.mod h1:YXIrHIPtbJCP6Kv27qGp+cXQl7hyzD0iQrEVYCy/gqw=
+github.com/shenwei356/bwt v0.5.1 h1:pq9KxkJPwzUwPjwWqfTOkUmzwHm8JtMkZxBkqkzuIVo=
+github.com/shenwei356/bwt v0.5.1/go.mod h1:V2hX4adhr4WfFpy2ogKiV8A2WO9FeK0wkq1o3/R91mE=
github.com/shenwei356/go-logging v0.0.0-20171012171522-c6b9702d88ba h1:UvnrxFDPmz7agYX0eQ2JEorTKn1ORnZ9dT5OzbjPvK8=
github.com/shenwei356/go-logging v0.0.0-20171012171522-c6b9702d88ba/go.mod h1:LiqYp/K5yCEWOi7Ux/iOF/kjDxtsdYjOGcKHDbEOXFU=
github.com/shenwei356/natsort v0.0.0-20190418160752-600d539c017d h1:eeXLHcXyGEr72V1SOSEI7vSzUOTJvHutwF7Ykm+hscQ=
github.com/shenwei356/natsort v0.0.0-20190418160752-600d539c017d/go.mod h1:SiiGiRFyRtV7S9RamOrmQR5gpGIRhWJM1w0EtmuQ1io=
-github.com/shenwei356/util v0.0.0-20201214054755-2942125340cd h1:11sa/TyR/91v63OA9VnyjRontqtsDFZN1wCLNaaAa38=
-github.com/shenwei356/util v0.0.0-20201214054755-2942125340cd/go.mod h1:n3qhc3bQzlqJ2/5v79hgl0Gd3WzJOkI8XcUix25Brdg=
+github.com/shenwei356/util v0.3.0 h1:gTVa3sGwcyGEHgNpXTzdL3MaaJN/bGAypVKSCnT4QfU=
+github.com/shenwei356/util v0.3.0/go.mod h1:n3qhc3bQzlqJ2/5v79hgl0Gd3WzJOkI8XcUix25Brdg=
github.com/shenwei356/xopen v0.0.0-20181203091311-f4f16ddd3992 h1:RXEEyKj0JL3SrRIYsWIEyy4AwjHbI3I8aDGK6CA4+YI=
github.com/shenwei356/xopen v0.0.0-20181203091311-f4f16ddd3992/go.mod h1:6EQUa6I7Zsl2GQKqcL9qGLrTzVE+oZyly+uhzovQYSk=
github.com/shurcooL/sanitized_anchor_name v1.0.0/go.mod h1:1NzhyTcUVG4SuEtjjoZeVRXNmyL/1OwPU0+IJeTBvfc=
@@ -221,8 +222,8 @@ github.com/spaolacci/murmur3 v0.0.0-20180118202830-f09979ecbc72 h1:qLC7fQah7D6K1
github.com/spaolacci/murmur3 v0.0.0-20180118202830-f09979ecbc72/go.mod h1:JwIasOWyU6f++ZhiEuf87xNszmSA2myDM2Kzu9HwQUA=
github.com/spf13/afero v1.1.2/go.mod h1:j4pytiNVoe2o6bmDsKpLACNPDBIoEAkihy7loJ1B0CQ=
github.com/spf13/cast v1.3.0/go.mod h1:Qx5cxh0v+4UWYiBimWS+eyWzqEqokIECu5etghLkUJE=
-github.com/spf13/cobra v1.1.1 h1:KfztREH0tPxJJ+geloSLaAkaPkr4ki2Er5quFV1TDo4=
-github.com/spf13/cobra v1.1.1/go.mod h1:WnodtKOvamDL/PwE2M4iKs8aMDBZ5Q5klgD3qfVJQMI=
+github.com/spf13/cobra v1.1.3 h1:xghbfqPkxzxP3C/f3n5DdpAbdKLj4ZE4BWQI362l53M=
+github.com/spf13/cobra v1.1.3/go.mod h1:pGADOWyqRD/YMrPZigI/zbliZ2wVD/23d+is3pSWzOo=
github.com/spf13/jwalterweatherman v1.0.0/go.mod h1:cQK4TGJAtQXfYWX+Ddv3mKDzgVb68N+wFjFa4jdeBTo=
github.com/spf13/pflag v1.0.3/go.mod h1:DYY7MBk1bdzusC3SYhjObp+wFpr4gzcvqqNjLnInEg4=
github.com/spf13/pflag v1.0.5 h1:iy+VFUOCP1a+8yFto/drg2CJ5u0yRoB7fZw3DKv/JXA=
@@ -236,6 +237,8 @@ github.com/subosito/gotenv v1.2.0/go.mod h1:N0PQaV/YGNqwC0u51sEeR/aUtSLEXKX9iv69
github.com/tatsushid/go-prettytable v0.0.0-20141013043238-ed2d14c29939 h1:BhIUXV2ySTLrKgh/Hnts+QTQlIbWtomXt3LMdzME0A0=
github.com/tatsushid/go-prettytable v0.0.0-20141013043238-ed2d14c29939/go.mod h1:omGxs4/6hNjxPKUTjmaNkPzehSnNJOJN6pMEbrlYIT4=
github.com/tmc/grpc-websocket-proxy v0.0.0-20190109142713-0ad062ec5ee5/go.mod h1:ncp9v5uamzpCO7NfCPTXjqaC+bZgJeR0sMTm6dMHP7U=
+github.com/twotwotwo/sorts v0.0.0-20160814051341-bf5c1f2b8553 h1:DRC1ubdb3ZmyyIeCSTxjZIQAnpLPfKVgYrLETQuOPjo=
+github.com/twotwotwo/sorts v0.0.0-20160814051341-bf5c1f2b8553/go.mod h1:Rj7Csq/tZ/egz+Ltc2IVpsA5309AmSMEswjkTZmq2Xc=
github.com/ulikunitz/xz v0.5.6/go.mod h1:2bypXElzHzzJZwzH67Y6wb67pO62Rzfn7BSiF4ABRW8=
github.com/xiang90/probing v0.0.0-20190116061207-43a291ad63a2/go.mod h1:UETIi67q53MR2AWcXfiuqkDkRtnGDLqkBTpCHuJHxtU=
go.etcd.io/bbolt v1.3.2/go.mod h1:IbVyRI1SCnLcuJnV2u8VeU0CEYM7e686BmAb1XKL+uU=
@@ -312,8 +315,9 @@ golang.org/x/sys v0.0.0-20190606165138-5da285871e9c/go.mod h1:h1NjWce9XRLGQEsW7w
golang.org/x/sys v0.0.0-20190624142023-c5567b49c5d0/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20191005200804-aed5e4c7ecf9/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20200116001909-b77594299b42/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
-golang.org/x/sys v0.0.0-20200223170610-d5e6a3e2c0ae h1:/WDfKMnPU+m5M4xB+6x4kaepxRw6jWvR5iDRdvjHgy8=
golang.org/x/sys v0.0.0-20200223170610-d5e6a3e2c0ae/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
+golang.org/x/sys v0.0.0-20210315160823-c6e025ad8005 h1:pDMpM2zh2MT0kHy037cKlSby2nEhD50SYqwQk76Nm40=
+golang.org/x/sys v0.0.0-20210315160823-c6e025ad8005/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/text v0.3.0/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=
golang.org/x/text v0.3.1-0.20180807135948-17ff2d5776d2/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=
golang.org/x/text v0.3.2/go.mod h1:bEr9sfX3Q8Zfm5fL9x+3itogRgK3+ptLWKqgva+5dAk=
@@ -367,16 +371,17 @@ google.golang.org/grpc v1.21.1/go.mod h1:oYelfM1adQP15Ek0mdvEgi9Df8B9CZIaU1084ij
gopkg.in/alecthomas/kingpin.v2 v2.2.6/go.mod h1:FMv+mEhP44yOT+4EoQTLFTRgOQ1FBLkstjWtayDeSgw=
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
gopkg.in/check.v1 v1.0.0-20180628173108-788fd7840127/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
-gopkg.in/check.v1 v1.0.0-20190902080502-41f04d3bba15 h1:YR8cESwS4TdDjEe65xsg0ogRM/Nc3DYOhEAlW+xobZo=
gopkg.in/check.v1 v1.0.0-20190902080502-41f04d3bba15/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
+gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c h1:Hei/4ADfdWqJk1ZMxUNpqntNwaWcugrBjAiHlqqRiVk=
+gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c/go.mod h1:JHkPIbrfpd72SG/EVd6muEfDQjcINNoR0C8j2r3qZ4Q=
gopkg.in/errgo.v2 v2.1.0/go.mod h1:hNsd1EY+bozCKY1Ytp96fpM3vjJbqLJn88ws8XvfDNI=
gopkg.in/ini.v1 v1.51.0/go.mod h1:pNLf8WUiyNEtQjuu5G5vTm06TEv9tsIgeAvK8hOrP4k=
gopkg.in/resty.v1 v1.12.0/go.mod h1:mDo4pnntr5jdWRML875a/NmxYqAlA73dVijT2AXvQQo=
gopkg.in/yaml.v2 v2.0.0-20170812160011-eb3733d160e7/go.mod h1:JAlM8MvJe8wmxCU4Bli9HhUf9+ttbYbLASfIpnQbh74=
gopkg.in/yaml.v2 v2.2.1/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI=
gopkg.in/yaml.v2 v2.2.4/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI=
-gopkg.in/yaml.v2 v2.2.8 h1:obN1ZagJSUGI0Ek/LBmuj4SNLPfIny3KsKFopxRdj10=
-gopkg.in/yaml.v2 v2.2.8/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI=
+gopkg.in/yaml.v2 v2.4.0 h1:D8xgwECY7CYvx+Y2n4sBz93Jn9JRvxdiyyo8CTfuKaY=
+gopkg.in/yaml.v2 v2.4.0/go.mod h1:RDklbk79AGWmwhnvt/jBztapEOGDOx6ZbXqjP6csGnQ=
honnef.co/go/tools v0.0.0-20190102054323-c2f93a96b099/go.mod h1:rf3lG4BRIbNafJWhAfAdb/ePZxsR/4RtNHQocxwk9r4=
honnef.co/go/tools v0.0.0-20190106161140-3f1c8253044a/go.mod h1:rf3lG4BRIbNafJWhAfAdb/ePZxsR/4RtNHQocxwk9r4=
honnef.co/go/tools v0.0.0-20190418001031-e561f6794a2a/go.mod h1:rf3lG4BRIbNafJWhAfAdb/ePZxsR/4RtNHQocxwk9r4=
=====================================
seqkit/cmd/amplicon.go
=====================================
@@ -31,6 +31,7 @@ import (
"sort"
"strconv"
"strings"
+ "sync"
"github.com/shenwei356/bio/featio/gtf"
"github.com/shenwei356/bio/seq"
@@ -40,6 +41,7 @@ import (
"github.com/shenwei356/bwt/fmi"
"github.com/shenwei356/xopen"
"github.com/spf13/cobra"
+ "github.com/twotwotwo/sorts/sortutil"
)
// ampliconCmd represents the amplicon command
@@ -51,6 +53,7 @@ var ampliconCmd = &cobra.Command{
Attentions:
1. Only one (the longest) matching location is returned for every primer pair.
2. Mismatch is allowed, but the mismatch location (5' or 3') is not controled.
+ You can increase the value of "-j/--threads" to accelerate processing.
3. Degenerate bases/residues like "RYMM.." are also supported.
But do not use degenerate bases/residues in regular expression, you need
convert them to regular expression, e.g., change "N" or "X" to "."..
@@ -131,6 +134,8 @@ Examples:
onlyPositiveStrand := getFlagBool(cmd, "only-positive-strand")
outFmtBED := getFlagBool(cmd, "bed")
+ immediateOutput := getFlagBool(cmd, "immediate-output")
+
var list [][3]string
var primers [][3][]byte
@@ -180,13 +185,194 @@ Examples:
usingRegion = true
}
+ strands := []string{"+", "-"}
+
+ // -------------------------------------------------------------------
+ // only for m > 0, where FMI is slow
+
+ if maxMismatch > 0 {
+ type Arecord struct {
+ id uint64
+ ok bool
+ record []string
+ }
+
+ var wg sync.WaitGroup
+ ch := make(chan *Arecord, config.Threads)
+ tokens := make(chan int, config.Threads)
+
+ done := make(chan int)
+ go func() {
+ m := make(map[uint64]*Arecord, config.Threads)
+ var id, _id uint64
+ var ok bool
+ var _r *Arecord
+ var row string
+
+ id = 1
+ for r := range ch {
+ _id = r.id
+
+ if _id == id { // right there
+ if r.ok {
+ for _, row = range r.record {
+ outfh.WriteString(row)
+ }
+
+ if immediateOutput {
+ outfh.Flush()
+ }
+ }
+ id++
+ continue
+ }
+
+ m[_id] = r // save for later check
+
+ if _r, ok = m[id]; ok { // check buffered
+ if _r.ok {
+ for _, row = range _r.record {
+ outfh.WriteString(row)
+ }
+
+ if immediateOutput {
+ outfh.Flush()
+ }
+ }
+ delete(m, id)
+ id++
+ }
+ }
+
+ if len(m) > 0 {
+ ids := make([]uint64, len(m))
+ i := 0
+ for _id = range m {
+ ids[i] = _id
+ i++
+ }
+ sortutil.Uint64s(ids)
+ for _, _id = range ids {
+ _r = m[_id]
+
+ if _r.ok {
+ for _, row = range _r.record {
+ outfh.WriteString(row)
+ }
+
+ if immediateOutput {
+ outfh.Flush()
+ }
+ }
+ }
+ }
+ done <- 1
+ }()
+
+ var id uint64
+ for _, file := range files {
+ var record *fastx.Record
+ var fastxReader *fastx.Reader
+
+ fastxReader, err = fastx.NewReader(alphabet, file, idRegexp)
+ checkError(err)
+
+ for {
+ record, err = fastxReader.Read()
+ if err != nil {
+ if err == io.EOF {
+ break
+ }
+ checkError(err)
+ break
+ }
+ if fastxReader.IsFastq {
+ config.LineWidth = 0
+ fastx.ForcelyOutputFastq = true
+ }
+
+ tokens <- 1
+ wg.Add(1)
+ id++
+ go func(record *fastx.Record, id uint64) {
+ defer func() {
+ wg.Done()
+ <-tokens
+ }()
+
+ var finder *AmpliconFinder
+ var loc []int
+ var strand string
+ var tmpSeq *seq.Seq
+ var primer [3][]byte
+
+ results := make([]string, 0, 2)
+
+ for _, strand = range strands {
+ if strand == "-" {
+ if onlyPositiveStrand {
+ continue
+ }
+ record.Seq.RevComInplace()
+ }
+
+ for _, primer = range primers {
+ finder, err = NewAmpliconFinder(record.Seq.Seq, primer[1], primer[2], maxMismatch)
+ checkError(err)
+
+ if usingRegion {
+ loc, err = finder.LocateRange(begin, end, fregion, strict)
+ } else {
+ loc, err = finder.Locate()
+ }
+ checkError(err)
+
+ if loc == nil {
+ continue
+ }
+
+ if outFmtBED {
+ results = append(results, fmt.Sprintf("%s\t%d\t%d\t%s\t%d\t%s\t%s\n",
+ record.ID,
+ loc[0]-1,
+ loc[1],
+ primer[0],
+ 0,
+ strand,
+ record.Seq.SubSeq(loc[0], loc[1]).Seq))
+
+ continue
+ }
+ tmpSeq = record.Seq
+
+ record.Seq = record.Seq.SubSeq(loc[0], loc[1])
+
+ results = append(results, string(record.Format(config.LineWidth)))
+
+ record.Seq = tmpSeq
+ }
+ }
+
+ ch <- &Arecord{record: results, id: id, ok: len(results) > 0}
+ }(record.Clone(), id)
+ }
+
+ config.LineWidth = lineWidth
+ }
+
+ wg.Wait()
+ close(ch)
+ <-done
+
+ return
+ }
+
var record *fastx.Record
var fastxReader *fastx.Reader
var finder *AmpliconFinder
var loc []int
- strands := []string{"+", "-"}
var strand string
var tmpSeq *seq.Seq
var primer [3][]byte
@@ -271,7 +457,8 @@ func init() {
ampliconCmd.Flags().BoolP("flanking-region", "f", false, "region is flanking region")
ampliconCmd.Flags().BoolP("strict-mode", "s", false, "strict mode, i.e., discarding seqs not fully matching (shorter) given region range")
ampliconCmd.Flags().BoolP("only-positive-strand", "P", false, "only search on positive strand")
- ampliconCmd.Flags().BoolP("bed", "", false, "output in BED6+1 format with amplicon as 7th columns")
+ ampliconCmd.Flags().BoolP("bed", "", false, "output in BED6+1 format with amplicon as the 7th column")
+ ampliconCmd.Flags().BoolP("immediate-output", "I", false, "print output immediately, do not use write buffer")
}
func loadPrimers(file string) ([][3]string, error) {
@@ -691,7 +878,7 @@ func (finder *AmpliconFinder) Locate() ([]int, error) {
sort.Ints(locsI) // to remain the FIRST location
sort.Ints(locsJ) // to remain the LAST location
finder.searched, finder.found = true, true
- finder.iBegin, finder.iEnd = locsI[0], locsI[0]+len(finder.R)-1
+ finder.iBegin, finder.iEnd = locsI[0], locsJ[len(locsJ)-1]+len(finder.R)-1
return []int{locsI[0] + 1, locsJ[len(locsJ)-1] + len(finder.R)}, nil
}
=====================================
seqkit/cmd/bam_toolbox.go
=====================================
@@ -58,7 +58,7 @@ type Toolshed map[string]BamTool
func (s Toolshed) String() string {
tools := make([]string, 0, len(s))
- for t, _ := range s {
+ for t := range s {
tools = append(tools, t)
}
sort.Strings(tools)
=====================================
seqkit/cmd/bed.go
=====================================
@@ -28,7 +28,6 @@ import (
"strings"
"github.com/shenwei356/breader"
- "github.com/shenwei356/util/stringutil"
)
// BedFeature is the gff BedFeature struct
@@ -64,9 +63,8 @@ func ReadBedFilteredFeatures(file string, chrs []string) ([]BedFeature, error) {
if line == "" || line[0] == '#' || (len(line) > 7 && string(line[0:7]) == "browser") || (len(line) > 5 && string(line[0:5]) == "track") {
return nil, false, nil
}
- //
- // do not use regexp.Split(), it's very slow
- items := stringutil.Split(line, "\t")
+
+ items := strings.Split(line, "\t")
n := len(items)
if n < 3 {
return nil, false, nil
=====================================
seqkit/cmd/fx2tab.go
=====================================
@@ -63,6 +63,7 @@ Attention:
printGC := getFlagBool(cmd, "gc")
printGCSkew := getFlagBool(cmd, "gc-skew")
baseContents := getFlagStringSlice(cmd, "base-content")
+ baseCounts := getFlagStringSlice(cmd, "base-count")
caseSensitive := getFlagBool(cmd, "case-sensitive")
onlyName := getFlagBool(cmd, "name")
printTitle := getFlagBool(cmd, "header-line")
@@ -98,6 +99,11 @@ Attention:
if printGCSkew {
outfh.WriteString("\tGC-Skew")
}
+ if len(baseCounts) > 0 {
+ for _, bc := range baseCounts {
+ outfh.WriteString(fmt.Sprintf("\t%s", bc))
+ }
+ }
if len(baseContents) > 0 {
for _, bc := range baseContents {
outfh.WriteString(fmt.Sprintf("\t%s", bc))
@@ -164,6 +170,16 @@ Attention:
outfh.WriteString(fmt.Sprintf("\t%.2f", (g-c)/(g+c)*100))
}
+ if len(baseCounts) > 0 {
+ for _, bc := range baseCounts {
+ if caseSensitive {
+ outfh.WriteString(fmt.Sprintf("\t%d", record.Seq.BaseCountCaseSensitive(bc)))
+ } else {
+ outfh.WriteString(fmt.Sprintf("\t%d", record.Seq.BaseCount(bc)))
+ }
+ }
+ }
+
if len(baseContents) > 0 {
for _, bc := range baseContents {
if caseSensitive {
@@ -199,6 +215,7 @@ func init() {
fx2tabCmd.Flags().BoolP("gc", "g", false, "print GC content")
fx2tabCmd.Flags().BoolP("gc-skew", "G", false, "print GC-Skew")
fx2tabCmd.Flags().StringSliceP("base-content", "B", []string{}, "print base content. (case ignored, multiple values supported) e.g. -B AT -B N")
+ fx2tabCmd.Flags().StringSliceP("base-count", "C", []string{}, "print base count. (case ignored, multiple values supported) e.g. -C AT -C N")
fx2tabCmd.Flags().BoolP("case-sensitive", "I", false, "calculate case sensitive base content")
fx2tabCmd.Flags().BoolP("only-id", "i", false, "print ID instead of full head")
fx2tabCmd.Flags().BoolP("name", "n", false, "only print names (no sequences and qualities)")
=====================================
seqkit/cmd/genautocomplete.go
=====================================
@@ -1,4 +1,4 @@
-// Copyright © 2016-2019 Wei Shen <shenwei356 at gmail.com>
+// Copyright © 2016-2021 Wei Shen <shenwei356 at gmail.com>
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
@@ -33,44 +33,60 @@ import (
// genautocompleteCmd represents the fq2fa command
var genautocompleteCmd = &cobra.Command{
Use: "genautocomplete",
- Short: "generate shell autocompletion script",
+ Short: "generate shell autocompletion script (bash|zsh|fish|powershell)",
Long: `generate shell autocompletion script
-Note: The current version supports Bash only.
-This should work for *nix systems with Bash installed.
+Supported shell: bash|zsh|fish|powershell
-Howto:
+Bash:
-1. run: seqkit genautocomplete
+ # generate completion shell
+ seqkit genautocomplete --shell bash
-2. create and edit ~/.bash_completion file if you don't have it.
+ # configure if never did.
+ # install bash-completion if the "complete" command is not found.
+ echo "for bcfile in ~/.bash_completion.d/* ; do source \$bcfile; done" >> ~/.bash_completion
+ echo "source ~/.bash_completion" >> ~/.bashrc
- nano ~/.bash_completion
+Zsh:
- add the following:
+ # generate completion shell
+ seqkit genautocomplete --shell zsh --file ~/.zfunc/_seqkit
- for bcfile in ~/.bash_completion.d/* ; do
- . $bcfile
- done
+ # configure if never did
+ echo 'fpath=( ~/.zfunc "${fpath[@]}" )' >> ~/.zshrc
+ echo "autoload -U compinit; compinit" >> ~/.zshrc
+
+fish:
+
+ seqkit genautocomplete --shell fish --file ~/.config/fish/completions/seqkit.fish
`,
Run: func(cmd *cobra.Command, args []string) {
- autocompleteTarget := getFlagString(cmd, "file")
- autocompleteType := getFlagString(cmd, "type")
-
- if autocompleteType != "bash" {
- checkError(fmt.Errorf("only Bash is supported for now"))
- }
+ outfile := getFlagString(cmd, "file")
+ shell := getFlagString(cmd, "shell")
- dir := filepath.Dir(autocompleteTarget)
+ dir := filepath.Dir(outfile)
ok, err := pathutil.DirExists(dir)
checkError(err)
if !ok {
os.MkdirAll(dir, 0744)
}
- checkError(cmd.Root().GenBashCompletionFile(autocompleteTarget))
- log.Infof("bash completion file for SeqKit saved to %s", autocompleteTarget)
+ switch shell {
+ case "bash":
+ checkError(cmd.Root().GenBashCompletionFile(outfile))
+ case "zsh":
+ checkError(cmd.Root().GenZshCompletionFile(outfile))
+ case "fish":
+ checkError(cmd.Root().GenFishCompletionFile(outfile, true))
+ case "powershell":
+ checkError(cmd.Root().GenPowerShellCompletionFile(outfile))
+ default:
+ checkError(fmt.Errorf("unsupported shell: %s", shell))
+ }
+
+ log.Infof("%s completion file for seqkit saved to %s", shell, outfile)
},
}
@@ -79,5 +95,5 @@ func init() {
defaultCompletionFile, err := homedir.Expand("~/.bash_completion.d/seqkit.sh")
checkError(err)
genautocompleteCmd.Flags().StringP("file", "", defaultCompletionFile, "autocompletion file")
- genautocompleteCmd.Flags().StringP("type", "", "bash", "autocompletion type (currently only bash supported)")
+ genautocompleteCmd.Flags().StringP("shell", "", "bash", "autocompletion type (bash|zsh|fish|powershell)")
}
=====================================
seqkit/cmd/grep.go
=====================================
@@ -28,8 +28,10 @@ import (
"runtime"
"strconv"
"strings"
+ "sync"
"github.com/shenwei356/bwt"
+ "github.com/twotwotwo/sorts/sortutil"
"github.com/shenwei356/bio/seq"
"github.com/shenwei356/bio/seqio/fastx"
@@ -54,9 +56,8 @@ Attentions:
for partly matching.
2. When searching by sequences, it's partly matching, and both positive
and negative strands are searched.
- Mismatch is allowed using flag "-m/--max-mismatch",
- but it's not fast enough for large genome like human genome.
- Though, it's fast enough for microbial genomes.
+ Mismatch is allowed using flag "-m/--max-mismatch", you can increase
+ the value of "-j/--threads" to accelerate processing.
3. Degenerate bases/residues like "RYMM.." are also supported by flag -d.
But do not use degenerate bases/residues in regular expression, you need
convert them to regular expression, e.g., change "N" or "X" to "."..
@@ -105,6 +106,8 @@ Examples:
region := getFlagString(cmd, "region")
circular := getFlagBool(cmd, "circular")
+ immediateOutput := getFlagBool(cmd, "immediate-output")
+
if len(pattern) == 0 && patternFile == "" {
checkError(fmt.Errorf("one of flags -p (--pattern) and -f (--pattern-file) needed"))
}
@@ -266,21 +269,209 @@ Examples:
checkError(err)
defer outfh.Close()
+ var fastxReader *fastx.Reader
+ var record *fastx.Record
+ strands := []byte{'+', '-'}
+
+ // -------------------------------------------------------------------
+ // only for searching with sequences and mismatch > 0, were FMI is very slow
+
+ if bySeq && mismatches > 1 {
+ type Arecord struct {
+ id uint64
+ ok bool
+ record *fastx.Record
+ }
+
+ var wg sync.WaitGroup
+ ch := make(chan *Arecord, config.Threads)
+ tokens := make(chan int, config.Threads)
+
+ done := make(chan int)
+ go func() {
+ m := make(map[uint64]*Arecord, config.Threads)
+ var id, _id uint64
+ var ok bool
+ var _r *Arecord
+
+ id = 1
+ for r := range ch {
+ _id = r.id
+
+ if _id == id { // right there
+ if r.ok {
+ r.record.FormatToWriter(outfh, config.LineWidth)
+ if immediateOutput {
+ outfh.Flush()
+ }
+ }
+ id++
+ continue
+ }
+
+ m[_id] = r // save for later check
+
+ if _r, ok = m[id]; ok { // check buffered
+ if _r.ok {
+ _r.record.FormatToWriter(outfh, config.LineWidth)
+ if immediateOutput {
+ outfh.Flush()
+ }
+ }
+ delete(m, id)
+ id++
+ }
+ }
+
+ if len(m) > 0 {
+ ids := make([]uint64, len(m))
+ i := 0
+ for _id = range m {
+ ids[i] = _id
+ i++
+ }
+ sortutil.Uint64s(ids)
+ for _, _id = range ids {
+ _r = m[_id]
+
+ if _r.ok {
+ _r.record.FormatToWriter(outfh, config.LineWidth)
+ if immediateOutput {
+ outfh.Flush()
+ }
+ }
+ }
+ }
+ done <- 1
+ }()
+
+ var id uint64
+ for _, file := range files {
+ fastxReader, err = fastx.NewReader(alphabet, file, idRegexp)
+ checkError(err)
+
+ checkAlphabet := true
+ for {
+ record, err = fastxReader.Read()
+ if err != nil {
+ if err == io.EOF {
+ break
+ }
+ checkError(err)
+ break
+ }
+
+ if checkAlphabet {
+ if fastxReader.Alphabet() == seq.Unlimit || fastxReader.Alphabet() == seq.Protein {
+ onlyPositiveStrand = true
+ }
+ checkAlphabet = false
+ }
+
+ if fastxReader.IsFastq {
+ config.LineWidth = 0
+ fastx.ForcelyOutputFastq = true
+ }
+
+ tokens <- 1
+ wg.Add(1)
+ id++
+ go func(record *fastx.Record, id uint64) {
+ defer func() {
+ wg.Done()
+ <-tokens
+ }()
+
+ var sequence *seq.Seq
+ var target []byte
+ var hit bool
+ var k string
+
+ sfmi := fmi.NewFMIndex()
+
+ for _, strand := range strands {
+ if hit {
+ break
+ }
+
+ if strand == '-' && onlyPositiveStrand {
+ break
+ }
+
+ sequence = record.Seq
+ if strand == '-' {
+ sequence = record.Seq.RevCom()
+ }
+ if limitRegion {
+ target = sequence.SubSeq(start, end).Seq
+ } else if circular {
+ // concat two copies of sequence, and do not change orginal sequence
+ target = make([]byte, len(sequence.Seq)*2)
+ copy(target[0:len(sequence.Seq)], sequence.Seq)
+ copy(target[len(sequence.Seq):], sequence.Seq)
+ } else {
+ target = sequence.Seq
+ }
+
+ if ignoreCase {
+ target = bytes.ToLower(target)
+ }
+
+ _, err = sfmi.Transform(target)
+ if err != nil {
+ checkError(fmt.Errorf("fail to build FMIndex for sequence: %s", record.Name))
+ }
+ for k = range patterns {
+ hit, err = sfmi.Match([]byte(k), mismatches)
+ if err != nil {
+ checkError(fmt.Errorf("fail to search pattern '%s' on seq '%s': %s", k, record.Name, err))
+ }
+ if hit {
+ break
+ }
+ }
+
+ }
+
+ if invertMatch {
+ if hit {
+ ch <- &Arecord{record: nil, ok: false, id: id}
+ return
+ }
+ } else {
+ if !hit {
+ ch <- &Arecord{record: nil, ok: false, id: id}
+ return
+ }
+ }
+
+ ch <- &Arecord{record: record, ok: true, id: id}
+
+ }(record.Clone(), id)
+ }
+ }
+
+ wg.Wait()
+ close(ch)
+ <-done
+
+ return
+ }
+
+ // -------------------------------------------------------------------
+
var sequence *seq.Seq
var target []byte
var ok, hit bool
- var record *fastx.Record
- var fastxReader *fastx.Reader
var k string
- var locs []int
var re *regexp.Regexp
var p string
- strands := []byte{'+', '-'}
var strand byte
for _, file := range files {
fastxReader, err = fastx.NewReader(alphabet, file, idRegexp)
checkError(err)
+ checkAlphabet := true
for {
record, err = fastxReader.Read()
if err != nil {
@@ -290,6 +481,14 @@ Examples:
checkError(err)
break
}
+
+ if checkAlphabet {
+ if fastxReader.Alphabet() == seq.Unlimit || fastxReader.Alphabet() == seq.Protein {
+ onlyPositiveStrand = true
+ }
+ checkAlphabet = false
+ }
+
if fastxReader.IsFastq {
config.LineWidth = 0
fastx.ForcelyOutputFastq = true
@@ -367,12 +566,11 @@ Examples:
checkError(fmt.Errorf("fail to build FMIndex for sequence: %s", record.Name))
}
for k = range patterns {
- locs, err = sfmi.Locate([]byte(k), mismatches)
+ hit, err = sfmi.Match([]byte(k), mismatches)
if err != nil {
checkError(fmt.Errorf("fail to search pattern '%s' on seq '%s': %s", k, record.Name, err))
}
- if len(locs) > 0 {
- hit = true
+ if hit {
if deleteMatched && !invertMatch {
delete(patterns, k)
}
@@ -406,6 +604,10 @@ Examples:
}
record.FormatToWriter(outfh, config.LineWidth)
+
+ if immediateOutput {
+ outfh.Flush()
+ }
}
config.LineWidth = lineWidth
@@ -430,4 +632,5 @@ func init() {
grepCmd.Flags().StringP("region", "R", "", "specify sequence region for searching. "+
"e.g 1:12 for first 12 bases, -12:-1 for last 12 bases")
grepCmd.Flags().BoolP("circular", "c", false, "circular genome")
+ grepCmd.Flags().BoolP("immediate-output", "I", false, "print output immediately, do not use write buffer")
}
=====================================
seqkit/cmd/head-genome.go
=====================================
@@ -0,0 +1,151 @@
+// Copyright © 2016-2019 Wei Shen <shenwei356 at gmail.com>
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in
+// all copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+// THE SOFTWARE.
+
+package cmd
+
+import (
+ "fmt"
+ "io"
+ "runtime"
+
+ "github.com/shenwei356/bio/seq"
+ "github.com/shenwei356/bio/seqio/fastx"
+ "github.com/shenwei356/util/stringutil"
+ "github.com/shenwei356/xopen"
+ "github.com/spf13/cobra"
+)
+
+// headGenomeCmd represents the head command
+var headGenomeCmd = &cobra.Command{
+ Use: "head-genome",
+ Short: "print sequences of the first genome with common prefixes in name",
+ Long: `print sequences of the first genome with common prefixes in name
+
+For a FASTA file containing multiple contigs of strains (see example below),
+these's no list of IDs available for retrieving sequences of a certain strain,
+while descriptions of each strain share the same prefix.
+
+This command is used to restrieve sequences of the first strain,
+i.e., "Vibrio cholerae strain M29".
+
+>NZ_JFGR01000001.1 Vibrio cholerae strain M29 Contig_1, whole genome shotgun sequence
+>NZ_JFGR01000002.1 Vibrio cholerae strain M29 Contig_2, whole genome shotgun sequence
+>NZ_JFGR01000003.1 Vibrio cholerae strain M29 Contig_3, whole genome shotgun sequence
+>NZ_JSTP01000001.1 Vibrio cholerae strain 2012HC-12 NODE_79, whole genome shotgun sequence
+>NZ_JSTP01000002.1 Vibrio cholerae strain 2012HC-12 NODE_78, whole genome shotgun sequence
+
+Attention:
+
+ 1. Sequences in file should be well organized.
+
+`,
+ Run: func(cmd *cobra.Command, args []string) {
+ config := getConfigs(cmd)
+ alphabet := config.Alphabet
+ idRegexp := config.IDRegexp
+ outFile := config.OutFile
+ lineWidth := config.LineWidth
+ seq.AlphabetGuessSeqLengthThreshold = config.AlphabetGuessSeqLength
+ seq.ValidateSeq = false
+ runtime.GOMAXPROCS(config.Threads)
+
+ minWords := getFlagPositiveInt(cmd, "mini-common-words")
+
+ files := getFileListFromArgsAndFile(cmd, args, true, "infile-list", true)
+
+ outfh, err := xopen.Wopen(outFile)
+ checkError(err)
+ defer outfh.Close()
+
+ var record *fastx.Record
+ var fastxReader *fastx.Reader
+
+ var prefixes, words []string
+ var i, N int
+ var nSharedWords, pNSharedWords int
+
+ for _, file := range files {
+ fastxReader, err = fastx.NewReader(alphabet, file, idRegexp)
+ checkError(err)
+
+ for {
+ record, err = fastxReader.Read()
+ if err != nil {
+ if err == io.EOF {
+ break
+ }
+ checkError(err)
+ break
+ }
+
+ if fastxReader.IsFastq {
+ config.LineWidth = 0
+ fastx.ForcelyOutputFastq = true
+ }
+
+ if len(record.Desc) == 0 {
+ checkError(fmt.Errorf("no description: %s", record.ID))
+ }
+
+ if prefixes == nil { // first record
+ record.FormatToWriter(outfh, config.LineWidth)
+
+ prefixes = stringutil.Split(string(record.Desc), "\t ")
+ continue
+ }
+
+ words = stringutil.Split(string(record.Desc), "\t ")
+ if len(words) < len(prefixes) {
+ N = len(words)
+ } else {
+ N = len(prefixes)
+ }
+
+ nSharedWords = 0
+ for i = 0; i < N; i++ {
+ if words[i] != prefixes[i] {
+ break
+ }
+ nSharedWords++
+ }
+
+ if nSharedWords < minWords {
+ return
+ }
+
+ if pNSharedWords == 0 { // 2nd sequence
+ pNSharedWords = nSharedWords
+ } else if nSharedWords != pNSharedWords { // number of shared words changed
+ return
+ }
+
+ record.FormatToWriter(outfh, config.LineWidth)
+ }
+
+ config.LineWidth = lineWidth
+ }
+ },
+}
+
+func init() {
+ RootCmd.AddCommand(headGenomeCmd)
+
+ headGenomeCmd.Flags().IntP("mini-common-words", "m", 1, "minimal shared prefix words")
+}
=====================================
seqkit/cmd/locate.go
=====================================
@@ -26,6 +26,7 @@ import (
"io"
"regexp"
"runtime"
+ "sync"
"github.com/shenwei356/bio/seq"
"github.com/shenwei356/bio/seqio/fastx"
@@ -33,6 +34,7 @@ import (
"github.com/shenwei356/bwt/fmi"
"github.com/shenwei356/xopen"
"github.com/spf13/cobra"
+ "github.com/twotwotwo/sorts/sortutil"
)
// locateCmd represents the locate command
@@ -54,8 +56,7 @@ Attentions:
parser accepts comma-separated-values (CSV) for multiple values (motifs).
Patterns in file do not follow this rule.
4. Mismatch is allowed using flag "-m/--max-mismatch",
- but it's not fast enough for large genome like human genome.
- Though, it's fast enough for microbial genomes.
+ you can increase the value of "-j/--threads" to accelerate processing.
5. When using flag --circular, end position of matched subsequence that
crossing genome sequence end would be greater than sequence length.
@@ -92,6 +93,8 @@ Attentions:
hideMatched := getFlagBool(cmd, "hide-matched")
circular := getFlagBool(cmd, "circular")
+ immediateOutput := getFlagBool(cmd, "immediate-output")
+
if config.Alphabet == seq.Protein {
onlyPositiveStrand = true
}
@@ -100,7 +103,6 @@ Attentions:
checkError(fmt.Errorf("one of flags -p (--pattern) and -f (--pattern-file) needed"))
}
- var sfmi *fmi.FMIndex
if mismatches > 0 {
if degenerate {
checkError(fmt.Errorf("flag -d (--degenerate) not allowed when giving flag -m (--max-mismatch)"))
@@ -111,7 +113,7 @@ Attentions:
if nonGreedy && !quiet {
log.Infof("flag -G (--non-greedy) ignored when giving flag -m (--max-mismatch)")
}
- sfmi = fmi.NewFMIndex()
+
}
if useFMI {
if degenerate {
@@ -120,7 +122,6 @@ Attentions:
if useRegexp {
checkError(fmt.Errorf("flag -r (--use-regexp) ignored when giving flag -F (--use-fmi)"))
}
- sfmi = fmi.NewFMIndex()
}
// prepare pattern
@@ -216,7 +217,6 @@ Attentions:
}
re, err := regexp.Compile(s)
checkError(err)
- fmt.Println(s, re)
regexps[p] = re
} else if bytes.Index(patterns[p], []byte(".")) >= 0 ||
!(seq.DNAredundant.IsValid(patterns[p]) == nil ||
@@ -239,20 +239,309 @@ Attentions:
outfh.WriteString("seqID\tpatternName\tpattern\tstrand\tstart\tend\tmatched\n")
}
}
+
+ // -------------------------------------------------------------------
+ // only for m > 0, where FMI is slow
+
+ var record *fastx.Record
+ var fastxReader *fastx.Reader
+ _onlyPositiveStrand := onlyPositiveStrand
+
+ if mismatches > 0 || useFMI {
+ type Arecord struct {
+ id uint64
+ ok bool
+ record []string
+ }
+
+ var wg sync.WaitGroup
+ ch := make(chan *Arecord, config.Threads)
+ tokens := make(chan int, config.Threads)
+
+ done := make(chan int)
+ go func() {
+ m := make(map[uint64]*Arecord, config.Threads)
+ var id, _id uint64
+ var ok bool
+ var _r *Arecord
+ var row string
+
+ id = 1
+ for r := range ch {
+ _id = r.id
+
+ if _id == id { // right there
+ if r.ok {
+ for _, row = range r.record {
+ outfh.WriteString(row)
+ }
+
+ if immediateOutput {
+ outfh.Flush()
+ }
+ }
+ id++
+ continue
+ }
+
+ m[_id] = r // save for later check
+
+ if _r, ok = m[id]; ok { // check buffered
+ if _r.ok {
+ for _, row = range _r.record {
+ outfh.WriteString(row)
+ }
+
+ if immediateOutput {
+ outfh.Flush()
+ }
+ }
+ delete(m, id)
+ id++
+ }
+ }
+
+ if len(m) > 0 {
+ ids := make([]uint64, len(m))
+ i := 0
+ for _id = range m {
+ ids[i] = _id
+ i++
+ }
+ sortutil.Uint64s(ids)
+ for _, _id = range ids {
+ _r = m[_id]
+
+ if _r.ok {
+ for _, row = range _r.record {
+ outfh.WriteString(row)
+ }
+
+ if immediateOutput {
+ outfh.Flush()
+ }
+ }
+ }
+ }
+ done <- 1
+ }()
+
+ var id uint64
+ for _, file := range files {
+ fastxReader, err = fastx.NewReader(alphabet, file, idRegexp)
+ checkError(err)
+
+ checkAlphabet := true
+ for {
+ record, err = fastxReader.Read()
+ if err != nil {
+ if err == io.EOF {
+ break
+ }
+ checkError(err)
+ break
+ }
+
+ if checkAlphabet {
+ if fastxReader.Alphabet() == seq.Unlimit || fastxReader.Alphabet() == seq.Protein {
+ _onlyPositiveStrand = true
+ }
+ checkAlphabet = false
+ }
+
+ tokens <- 1
+ wg.Add(1)
+ id++
+ go func(record *fastx.Record, id uint64) {
+ defer func() {
+ wg.Done()
+ <-tokens
+ }()
+
+ var seqRP *seq.Seq
+ var l int
+ var loc []int
+ var i, begin, end int
+ var pSeq []byte
+ var pName string
+ var sfmi *fmi.FMIndex
+ sfmi = fmi.NewFMIndex()
+ results := make([]string, 0, 2)
+
+ if !(degenerate || useRegexp) && ignoreCase {
+ record.Seq.Seq = bytes.ToLower(record.Seq.Seq)
+ }
+
+ l = len(record.Seq.Seq)
+
+ if circular { // concat two copies of sequence
+ record.Seq.Seq = append(record.Seq.Seq, record.Seq.Seq...)
+ }
+
+ _, err = sfmi.Transform(record.Seq.Seq)
+ if err != nil {
+ checkError(fmt.Errorf("fail to build FMIndex for sequence: %s", record.Name))
+ }
+
+ for pName, pSeq = range patterns {
+ loc, err = sfmi.Locate(pSeq, mismatches)
+ if err != nil {
+ checkError(fmt.Errorf("fail to search pattern '%s' on seq '%s': %s", pName, record.Name, err))
+ }
+ for _, i = range loc {
+ if circular && i+1 > l { // 2nd clone of original part
+ continue
+ }
+
+ begin = i + 1
+
+ end = i + len(pSeq)
+ if i+len(pSeq) > len(record.Seq.Seq) {
+ continue
+ }
+ if outFmtGTF {
+ results = append(results, fmt.Sprintf("%s\t%s\t%s\t%d\t%d\t%d\t%s\t%s\tgene_id \"%s\"; \n",
+ record.ID,
+ "SeqKit",
+ "location",
+ begin,
+ end,
+ 0,
+ "+",
+ ".",
+ pName))
+ } else if outFmtBED {
+ results = append(results, fmt.Sprintf("%s\t%d\t%d\t%s\t%d\t%s\n",
+ record.ID,
+ begin-1,
+ end,
+ pName,
+ 0,
+ "+"))
+ } else {
+ if hideMatched {
+ results = append(results, fmt.Sprintf("%s\t%s\t%s\t%s\t%d\t%d\n",
+ record.ID,
+ pName,
+ patterns[pName],
+ "+",
+ begin,
+ end))
+ } else {
+ results = append(results, fmt.Sprintf("%s\t%s\t%s\t%s\t%d\t%d\t%s\n",
+ record.ID,
+ pName,
+ patterns[pName],
+ "+",
+ begin,
+ end,
+ record.Seq.Seq[i:i+len(pSeq)]))
+ }
+ }
+ }
+ }
+
+ if _onlyPositiveStrand {
+ ch <- &Arecord{record: results, id: id, ok: len(results) > 0}
+ return
+ }
+
+ seqRP = record.Seq.RevCom()
+
+ _, err = sfmi.Transform(seqRP.Seq)
+ if err != nil {
+ checkError(fmt.Errorf("fail to build FMIndex for reverse complement sequence: %s", record.Name))
+ }
+ for pName, pSeq = range patterns {
+ loc, err = sfmi.Locate(pSeq, mismatches)
+ if err != nil {
+ checkError(fmt.Errorf("fail to search pattern '%s' on seq '%s': %s", pName, record.Name, err))
+ }
+ for _, i = range loc {
+ if circular && i+1 > l { // 2nd clone of original part
+ continue
+ }
+
+ begin = l - i - len(pSeq) + 1
+ end = l - i
+ if i+len(pSeq) > len(record.Seq.Seq) {
+ continue
+ }
+ if outFmtGTF {
+ results = append(results, fmt.Sprintf("%s\t%s\t%s\t%d\t%d\t%d\t%s\t%s\tgene_id \"%s\"; \n",
+ record.ID,
+ "SeqKit",
+ "location",
+ begin,
+ end,
+ 0,
+ "-",
+ ".",
+ pName))
+ } else if outFmtBED {
+ results = append(results, fmt.Sprintf("%s\t%d\t%d\t%s\t%d\t%s\n",
+ record.ID,
+ begin-1,
+ end,
+ pName,
+ 0,
+ "-"))
+ } else {
+ if hideMatched {
+ results = append(results, fmt.Sprintf("%s\t%s\t%s\t%s\t%d\t%d\n",
+ record.ID,
+ pName,
+ patterns[pName],
+ "-",
+ begin,
+ end))
+ } else {
+ results = append(results, fmt.Sprintf("%s\t%s\t%s\t%s\t%d\t%d\t%s\n",
+ record.ID,
+ pName,
+ patterns[pName],
+ "-",
+ begin,
+ end,
+ seqRP.Seq[i:i+len(pSeq)]))
+ }
+ }
+ }
+ }
+
+ ch <- &Arecord{record: results, id: id, ok: len(results) > 0}
+ }(record.Clone(), id)
+ }
+ }
+
+ wg.Wait()
+ close(ch)
+ <-done
+
+ return
+ }
+
+ // -------------------------------------------------------------------
+
var seqRP *seq.Seq
var offset, l, lpatten int
var loc []int
var locs, locsNeg [][2]int
var i, begin, end int
var flag bool
- var record *fastx.Record
- var fastxReader *fastx.Reader
var pSeq, p []byte
var pName string
var re *regexp.Regexp
+ var sfmi *fmi.FMIndex
+ if mismatches > 0 || useFMI {
+ sfmi = fmi.NewFMIndex()
+ }
+
for _, file := range files {
fastxReader, err = fastx.NewReader(alphabet, file, idRegexp)
checkError(err)
+
+ checkAlphabet := true
for {
record, err = fastxReader.Read()
if err != nil {
@@ -263,6 +552,13 @@ Attentions:
break
}
+ if checkAlphabet {
+ if fastxReader.Alphabet() == seq.Unlimit || fastxReader.Alphabet() == seq.Protein {
+ _onlyPositiveStrand = true
+ }
+ checkAlphabet = false
+ }
+
if !(degenerate || useRegexp) && ignoreCase {
record.Seq.Seq = bytes.ToLower(record.Seq.Seq)
}
@@ -273,10 +569,6 @@ Attentions:
record.Seq.Seq = append(record.Seq.Seq, record.Seq.Seq...)
}
- if !onlyPositiveStrand {
- seqRP = record.Seq.RevCom()
- }
-
if mismatches > 0 || useFMI {
_, err = sfmi.Transform(record.Seq.Seq)
if err != nil {
@@ -341,10 +633,12 @@ Attentions:
}
}
- if onlyPositiveStrand {
+ if _onlyPositiveStrand {
continue
}
+ seqRP = record.Seq.RevCom()
+
_, err = sfmi.Transform(seqRP.Seq)
if err != nil {
checkError(fmt.Errorf("fail to build FMIndex for reverse complement sequence: %s", record.Name))
@@ -406,6 +700,10 @@ Attentions:
}
}
+ if immediateOutput {
+ outfh.Flush()
+ }
+
continue
}
@@ -507,6 +805,8 @@ Attentions:
continue
}
+ seqRP = record.Seq.RevCom()
+
locsNeg = make([][2]int, 0, 1000)
offset = 0
@@ -600,6 +900,10 @@ Attentions:
}
}
}
+
+ if immediateOutput {
+ outfh.Flush()
+ }
}
}
},
@@ -622,4 +926,5 @@ func init() {
locateCmd.Flags().IntP("max-mismatch", "m", 0, "max mismatch when matching by seq. For large genomes like human genome, using mapping/alignment tools would be faster")
locateCmd.Flags().BoolP("hide-matched", "M", false, "do not show matched sequences")
locateCmd.Flags().BoolP("circular", "c", false, `circular genome. type "seqkit locate -h" for details`)
+ locateCmd.Flags().BoolP("immediate-output", "I", false, "print output immediately, do not use write buffer")
}
=====================================
seqkit/cmd/tab2fx.go
=====================================
@@ -21,12 +21,11 @@
package cmd
import (
- "bytes"
+ "bufio"
"fmt"
"runtime"
"strings"
- "github.com/shenwei356/breader"
"github.com/shenwei356/util/byteutil"
"github.com/shenwei356/xopen"
"github.com/spf13/cobra"
@@ -53,92 +52,57 @@ var tab2faCmd = &cobra.Command{
checkError(err)
defer outfh.Close()
- type Slice []string
- fn := func(line string) (interface{}, bool, error) {
- line = strings.TrimRight(line, "\r\n")
-
- if line == "" {
- return "", false, nil
- }
- // check comment line
- isCommentLine := false
- for _, p := range commentPrefixes {
- if strings.HasPrefix(line, p) {
- isCommentLine = true
- break
- }
- }
- if isCommentLine {
- return "", false, nil
- }
-
- items := strings.Split(line, "\t")
- if len(items) < 2 {
- return Slice(items), false, fmt.Errorf("at least two columns needed: %s", line)
- }
- if len(items) > 2 {
- return Slice(items[0:3]), true, nil
- }
- return Slice(items[0:2]), true, nil
- }
-
+ var line, p string
+ var items []string
+ var isCommentLine, isFastq bool
+ var scanner *bufio.Scanner
+ var fh *xopen.Reader
for _, file := range files {
- reader, err := breader.NewBufferedReader(file, config.Threads, 10, fn)
+ fh, err = xopen.Ropen(file)
checkError(err)
- var text []byte
- var b *bytes.Buffer
- isFastq := false
- for chunk := range reader.Ch {
- if chunk.Err != nil {
- checkError(chunk.Err)
- break
- }
- for _, data := range chunk.Data {
- items := data.(Slice)
- if len(items) == 3 && (len(items[2]) > 0 || isFastq) { // fastq
- isFastq = true
- outfh.WriteString(fmt.Sprintf("@%s\n", items[0]))
-
- // outfh.Write(byteutil.WrapByteSlice([]byte(items[1]), lineWidth))
- // if bufferedByteSliceWrapper == nil {
- // bufferedByteSliceWrapper = byteutil.NewBufferedByteSliceWrapper2(1, len(items[1]), lineWidth)
- // }
- // text, b = bufferedByteSliceWrapper.Wrap([]byte(items[1]), lineWidth)
- // outfh.Write(text)
- // outfh.Flush()
- // bufferedByteSliceWrapper.Recycle(b)
+ scanner = bufio.NewScanner(fh)
- outfh.WriteString(items[1]) // seq
+ isFastq = false
+ for scanner.Scan() {
+ line = strings.TrimRight(scanner.Text(), "\r\n")
- outfh.WriteString("\n+\n")
-
- // outfh.Write(byteutil.WrapByteSlice([]byte(items[2]), lineWidth))
-
- // text, b = bufferedByteSliceWrapper.Wrap([]byte(items[2]), lineWidth)
- // outfh.Write(text)
- // outfh.Flush()
- // bufferedByteSliceWrapper.Recycle(b)
-
- outfh.WriteString(items[2]) // qual
-
- outfh.WriteString("\n")
- } else {
- outfh.WriteString(fmt.Sprintf(">%s\n", items[0]))
+ if line == "" {
+ continue
+ }
+ // check comment line
+ isCommentLine = false
+ for _, p = range commentPrefixes {
+ if strings.HasPrefix(line, p) {
+ isCommentLine = true
+ break
+ }
+ }
+ if isCommentLine {
+ continue
+ }
- // outfh.Write(byteutil.WrapByteSlice([]byte(items[1]), lineWidth))
- if bufferedByteSliceWrapper == nil {
- bufferedByteSliceWrapper = byteutil.NewBufferedByteSliceWrapper2(1, len(items[1]), lineWidth)
- }
- text, b = bufferedByteSliceWrapper.Wrap([]byte(items[1]), lineWidth)
- outfh.Write(text)
- outfh.Flush()
- bufferedByteSliceWrapper.Recycle(b)
+ items = strings.Split(line, "\t")
+ if len(items) < 2 {
+ checkError(fmt.Errorf("at least two columns needed: %s", line))
+ }
- outfh.WriteString("\n")
- }
+ if len(items) == 3 && (len(items[2]) > 0 || isFastq) { // fastq
+ isFastq = true
+ outfh.WriteString(fmt.Sprintf("@%s\n", items[0]))
+ outfh.WriteString(items[1]) // seq
+ outfh.WriteString("\n+\n")
+ outfh.WriteString(items[2]) // qual
+
+ outfh.WriteString("\n")
+ } else {
+ outfh.WriteString(fmt.Sprintf(">%s\n", items[0]))
+ outfh.Write(byteutil.WrapByteSlice([]byte(items[1]), lineWidth))
+ outfh.WriteString("\n")
}
+
}
+ checkError(scanner.Err())
}
},
}
=====================================
seqkit/cmd/version.go
=====================================
@@ -29,7 +29,7 @@ import (
)
// VERSION of seqkit
-const VERSION = "0.15.0"
+const VERSION = "0.16.0"
// versionCmd represents the version command
var versionCmd = &cobra.Command{
View it on GitLab: https://salsa.debian.org/med-team/seqkit/-/commit/e77f29a7b82e1daaca7ff0a0a0edc08ec577d880
--
View it on GitLab: https://salsa.debian.org/med-team/seqkit/-/commit/e77f29a7b82e1daaca7ff0a0a0edc08ec577d880
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20210501/e2c9da35/attachment-0001.htm>
More information about the debian-med-commit
mailing list