[med-svn] [Git][med-team/seqkit][upstream] New upstream version 0.14.0+ds
Nilesh Patra
gitlab at salsa.debian.org
Fri Oct 30 20:42:29 GMT 2020
Nilesh Patra pushed to branch upstream at Debian Med / seqkit
Commits:
47b180fb by Nilesh Patra at 2020-10-31T00:56:06+05:30
New upstream version 0.14.0+ds
- - - - -
18 changed files:
- CHANGES.md
- README.md
- doc/docs/download.md
- doc/docs/faq.md
- doc/docs/usage.md
- seqkit/cmd/grep.go
- seqkit/cmd/locate.go
- seqkit/cmd/mutate.go
- + seqkit/cmd/pair.go
- seqkit/cmd/rename.go
- seqkit/cmd/replace.go
- seqkit/cmd/seq.go
- seqkit/cmd/sliding.go
- seqkit/cmd/split.go
- seqkit/cmd/split2.go
- seqkit/cmd/stat.go
- seqkit/cmd/translate.go
- seqkit/cmd/version.go
Changes:
=====================================
CHANGES.md
=====================================
@@ -1,3 +1,14 @@
+- [SeqKit v0.14.0](https://github.com/shenwei356/seqkit/releases/tag/v0.14.0)
+[![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/seqkit/v0.14.0/total.svg)](https://github.com/shenwei356/seqkit/releases/tag/v0.14.0)
+ - new command `seqkit pair`: match up paired-end reads from two fastq files, faster than fastq-pair.
+ - `seqkit translate`: new flag `-F/--append-fram` for optional adding frame info to ID. [#159](https://github.com/shenwei356/seqkit/issues/159)
+ - `seqkit stats`: reduce memory usage when using `-a` for calculating N50. [#153](https://github.com/shenwei356/seqkit/issues/153)
+ - `seqkit mutate`: fix inserting sequence `-i/--insertion`,
+ this bug occurs when `insert site` is big in some cases, don't worry if no error reported.
+ - `seqkit replace`:
+ - new flag `-U/--keep-untouched`: do not change anything when no value found for the key (only for sequence name).
+ - do no support editing FASTQ sequence.
+ - `seqkit grep/locate`: new flag `--circular` for supporting circular genome. [#158](https://github.com/shenwei356/seqkit/issues/158)
- [SeqKit v0.13.2](https://github.com/shenwei356/seqkit/releases/tag/v0.13.2)
[![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/seqkit/v0.13.2/total.svg)](https://github.com/shenwei356/seqkit/releases/tag/v0.13.2)
- `seqkit sana`: fix bug causing hanging on empty files. [#149](https://github.com/shenwei356/seqkit/pull/149)
=====================================
README.md
=====================================
@@ -56,6 +56,7 @@ enable researchers to rapidly accomplish common FASTA/Q file manipulations.
- [Acknowledgements](#acknowledgements)
- [Contact](#contact)
- [License](#license)
+- [Starchart](#startchart)
<!-- END doctoc generated TOC please keep comment here to allow auto update -->
@@ -68,7 +69,7 @@ enable researchers to rapidly accomplish common FASTA/Q file manipulations.
(see [download](http://bioinf.shenwei.me/seqkit/download/))
- **UltraFast** (see [benchmark](#benchmark)),
**multiple-CPUs supported**
-- **Practical functions supported by 28 subcommands** (see subcommands and
+- **Practical functions supported by 34 subcommands** (see subcommands and
[usage](http://bioinf.shenwei.me/seqkit/usage/) )
- **Supporting [Bash-completion](#bash-completion)**
- **Well documented** (detailed [usage](http://bioinf.shenwei.me/seqkit/usage/)
@@ -117,7 +118,7 @@ enable researchers to rapidly accomplish common FASTA/Q file manipulations.
## Subcommands
-33 functional subcommands in total.
+34 functional subcommands in total.
**Sequence and subsequence**
@@ -159,6 +160,7 @@ enable researchers to rapidly accomplish common FASTA/Q file manipulations.
- [`common`](https://bioinf.shenwei.me/seqkit/usage/#common) find common sequences of multiple files by id/name/sequence
- [`split`](https://bioinf.shenwei.me/seqkit/usage/#split) split sequences into files by id/seq region/size/parts (mainly for FASTA)
- [`split2`](https://bioinf.shenwei.me/seqkit/usage/#split2) split sequences into files by size/parts (FASTA, PE/SE FASTQ)
+- [`pair`](https://bioinf.shenwei.me/seqkit/usage/#pair) match up paired-end reads from two fastq files
**Edit**
@@ -418,3 +420,7 @@ propose new functions or ask for help.
## License
[MIT License](https://github.com/shenwei356/seqkit/blob/master/LICENSE)
+
+## Starchart
+
+<img src="https://starchart.cc/shenwei356/seqkit.svg" alt="Stargazers over time" style="max-width: 100%">
=====================================
doc/docs/download.md
=====================================
@@ -6,9 +6,17 @@ SeqKit is implemented in [Go](https://golang.org/) programming language,
## Latest Version
-- [SeqKit v0.13.2](https://github.com/shenwei356/seqkit/releases/tag/v0.13.2)
-[![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/seqkit/v0.13.2/total.svg)](https://github.com/shenwei356/seqkit/releases/tag/v0.13.2)
- - `seqkit sana`: fix bug causing hanging on empty files. [#149](https://github.com/shenwei356/seqkit/pull/149)
+- [SeqKit v0.14.0](https://github.com/shenwei356/seqkit/releases/tag/v0.14.0)
+[![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/seqkit/v0.14.0/total.svg)](https://github.com/shenwei356/seqkit/releases/tag/v0.14.0)
+ - new command `seqkit pair`: match up paired-end reads from two fastq files, faster than fastq-pair.
+ - `seqkit translate`: new flag `-F/--append-fram` for optional adding frame info to ID. [#159](https://github.com/shenwei356/seqkit/issues/159)
+ - `seqkit stats`: reduce memory usage when using `-a` for calculating N50. [#153](https://github.com/shenwei356/seqkit/issues/153)
+ - `seqkit mutate`: fix inserting sequence `-i/--insertion`,
+ this bug occurs when `insert site` is big in some cases, don't worry if no error reported.
+ - `seqkit replace`:
+ - new flag `-U/--keep-untouched`: do not change anything when no value found for the key (only for sequence name).
+ - do no support editing FASTQ sequence.
+ - `seqkit grep/locate`: new flag `--circular` for supporting circular genome. [#158](https://github.com/shenwei356/seqkit/issues/158)
### Please cite
@@ -25,12 +33,12 @@ SeqKit is implemented in [Go](https://golang.org/) programming language,
OS |Arch |File, 中国镜像 |Download Count
:------|:---------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-Linux |32-bit |[seqkit_linux_386.tar.gz](https://github.com/shenwei356/seqkit/releases/download/v0.13.2/seqkit_linux_386.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_linux_386.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_linux_386.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.13.2/seqkit_linux_386.tar.gz)
-Linux |**64-bit**|[**seqkit_linux_amd64.tar.gz**](https://github.com/shenwei356/seqkit/releases/download/v0.13.2/seqkit_linux_amd64.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_linux_amd64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_linux_amd64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.13.2/seqkit_linux_amd64.tar.gz)
-OS X |32-bit |[seqkit_darwin_386.tar.gz](https://github.com/shenwei356/seqkit/releases/download/v0.13.2/seqkit_darwin_386.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_darwin_386.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_darwin_386.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.13.2/seqkit_darwin_386.tar.gz)
-OS X |**64-bit**|[**seqkit_darwin_amd64.tar.gz**](https://github.com/shenwei356/seqkit/releases/download/v0.13.2/seqkit_darwin_amd64.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_darwin_amd64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_darwin_amd64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.13.2/seqkit_darwin_amd64.tar.gz)
-Windows|32-bit |[seqkit_windows_386.exe.tar.gz](https://github.com/shenwei356/seqkit/releases/download/v0.13.2/seqkit_windows_386.exe.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_windows_386.exe.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_windows_386.exe.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.13.2/seqkit_windows_386.exe.tar.gz)
-Windows|**64-bit**|[**seqkit_windows_amd64.exe.tar.gz**](https://github.com/shenwei356/seqkit/releases/download/v0.13.2/seqkit_windows_amd64.exe.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_windows_amd64.exe.tar.gz)|[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_windows_amd64.exe.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.13.2/seqkit_windows_amd64.exe.tar.gz)
+Linux |32-bit |[seqkit_linux_386.tar.gz](https://github.com/shenwei356/seqkit/releases/download/v0.14.0/seqkit_linux_386.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_linux_386.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_linux_386.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.14.0/seqkit_linux_386.tar.gz)
+Linux |**64-bit**|[**seqkit_linux_amd64.tar.gz**](https://github.com/shenwei356/seqkit/releases/download/v0.14.0/seqkit_linux_amd64.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_linux_amd64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_linux_amd64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.14.0/seqkit_linux_amd64.tar.gz)
+OS X |32-bit |[seqkit_darwin_386.tar.gz](https://github.com/shenwei356/seqkit/releases/download/v0.14.0/seqkit_darwin_386.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_darwin_386.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_darwin_386.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.14.0/seqkit_darwin_386.tar.gz)
+OS X |**64-bit**|[**seqkit_darwin_amd64.tar.gz**](https://github.com/shenwei356/seqkit/releases/download/v0.14.0/seqkit_darwin_amd64.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_darwin_amd64.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_darwin_amd64.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.14.0/seqkit_darwin_amd64.tar.gz)
+Windows|32-bit |[seqkit_windows_386.exe.tar.gz](https://github.com/shenwei356/seqkit/releases/download/v0.14.0/seqkit_windows_386.exe.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_windows_386.exe.tar.gz) |[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_windows_386.exe.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.14.0/seqkit_windows_386.exe.tar.gz)
+Windows|**64-bit**|[**seqkit_windows_amd64.exe.tar.gz**](https://github.com/shenwei356/seqkit/releases/download/v0.14.0/seqkit_windows_amd64.exe.tar.gz), <br/> [中国镜像](http://app.shenwei.me/data/seqkit/seqkit_windows_amd64.exe.tar.gz)|[![Github Releases (by Asset)](https://img.shields.io/github/downloads/shenwei356/seqkit/latest/seqkit_windows_amd64.exe.tar.gz.svg?maxAge=3600)](https://github.com/shenwei356/seqkit/releases/download/v0.14.0/seqkit_windows_amd64.exe.tar.gz)
## Installation
@@ -100,6 +108,9 @@ Howto:
## Release History
+- [SeqKit v0.13.2](https://github.com/shenwei356/seqkit/releases/tag/v0.13.2)
+[![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/seqkit/v0.13.2/total.svg)](https://github.com/shenwei356/seqkit/releases/tag/v0.13.2)
+ - `seqkit sana`: fix bug causing hanging on empty files. [#149](https://github.com/shenwei356/seqkit/pull/149)
- [SeqKit v0.13.1](https://github.com/shenwei356/seqkit/releases/tag/v0.13.1)
[![Github Releases (by Release)](https://img.shields.io/github/downloads/shenwei356/seqkit/v0.13.1/total.svg)](https://github.com/shenwei356/seqkit/releases/tag/v0.13.1)
- `seqkit sana`: fix bug causing hanging on empty files. [#148](https://github.com/shenwei356/seqkit/pull/148)
=====================================
doc/docs/faq.md
=====================================
@@ -264,72 +264,7 @@ is the second captured match. Command:
## How to extract paired reads from two paired-end reads file?
-Let's create two unbanlanced PE reads file:
-
- $ seqkit rmdup duplicated-reads.fq.gz \
- | seqkit replace --pattern " .+" --replacement " 1" \
- | seqkit sample --proportion 0.9 --rand-seed 1 --out-file read_1.fq.gz
- $ seqkit rmdup duplicated-reads.fq.gz \
- | seqkit replace --pattern " .+" --replacement " 2" \
- | seqkit sample --proportion 0.9 --rand-seed 2 --out-file read_2.fq.gz
-
-Overview:
-
- # number of records
- $ seqkit stat read_1.fq.gz read_2.fq.gz
- file format type num_seqs sum_len min_len avg_len max_len
- read_1.fq.gz FASTQ DNA 9,033 912,333 101 101 101
- read_2.fq.gz FASTQ DNA 8,965 905,465 101 101 101
-
- # sequence headers
- $ seqkit head -n 3 read_1.fq.gz | seqkit seq --name
- SRR1972739.1 1
- SRR1972739.3 1
- SRR1972739.4 1
-
- $ seqkit head -n 3 read_2.fq.gz | seqkit seq --name
- SRR1972739.1 2
- SRR1972739.2 2
- SRR1972739.3 2
-
-Firstly, extract sequence IDs of two file and compute the intersection:
-
- $ seqkit seq --name --only-id read_1.fq.gz read_2.fq.gz \
- | sort | uniq -d > id.txt
-
- $ # number of IDs
- wc -l id.txt
- 8090 id.txt
-
-Then extract reads using `id.txt`:
-
- $ seqkit grep --pattern-file id.txt read_1.fq.gz -o read_1.f.fq.gz
- $ seqkit grep --pattern-file id.txt read_2.fq.gz -o read_2.f.fq.gz
-
-Check if the IDs in two files are the same by `md5sum`:
-
- $ seqkit seq --name --only-id read_1.f.fq.gz > read_1.f.fq.gz.id.txt
- $ seqkit seq --name --only-id read_2.f.fq.gz > read_2.f.fq.gz.id.txt
-
- $ md5sum read_*.f.fq.gz.id.txt
- 537c57cfdc3923bb94a3dc31a0c3b02a read_1.f.fq.gz.id.txt
- 537c57cfdc3923bb94a3dc31a0c3b02a read_2.f.fq.gz.id.txt
-
-Note that this example assumes that the IDs in the two reads file have
-same order. If not you can sort them after previous steps. Shell `sort`
-can sort large file using disk, so temporary directory is set as
-current directory by option `-T .`.
-
- $ gzip -d -c read_1.f.fq.gz \
- | seqkit fx2tab \
- | sort -k1,1 -T . \
- | seqkit tab2fx \
- | gzip -c > read_1.f.sorted.fq.gz
- $ gzip -d -c read_2.f.fq.gz \
- | seqkit fx2tab \
- | sort -k1,1 -T . \
- | seqkit tab2fx \
- | gzip -c > read_2.f.sorted.fq.gz
+Use [seqkit pair](https://bioinf.shenwei.me/seqkit/usage/#pair).
## How to concatenate two FASTA sequences in to one?
=====================================
doc/docs/usage.md
=====================================
@@ -46,6 +46,7 @@
- [common](#common)
- [split](#split)
- [split2](#split2)
+- [pair](#pair)
**Edit**
@@ -173,7 +174,7 @@ reproduced in different environments with same random seed.
``` text
SeqKit -- a cross-platform and ultrafast toolkit for FASTA/Q file manipulation
-Version: 0.13.2
+Version: 0.14.0
Author: Wei Shen <shenwei356 at gmail.com>
@@ -201,6 +202,7 @@ Available Commands:
help Help about any command
locate locate subsequences/motifs, mismatch allowed
mutate edit sequence (point mutation, insertion, deletion)
+ pair match up paired-end reads from two fastq files
range print FASTA/Q records in a range (start:end)
rename rename duplicated IDs
replace replace name/sequence by regular expression
@@ -561,7 +563,8 @@ Usage:
seqkit sliding [flags]
Flags:
- -C, --circular-genome circular genome.
+ -c, --circular circular genome (same to -C/--circular-genome)
+ -C, --circular-genome circular genome (same to -c/--circular)
-g, --greedy greedy mode, i.e., exporting last subsequences even shorter than windows size
-s, --step int step size
-W, --window int window size
@@ -1224,6 +1227,7 @@ Usage:
Flags:
-x, --allow-unknown-codon translate unknown code to 'X'. And you may not use flag --trim which removes 'X'
+ -F, --append-frame append frame infomation to sequence ID
--clean change all STOP codon positions from the '*' character to 'X' (an unknown residue)
-f, --frame strings frame(s) to translate, available value: 1, 2, 3, -1, -2, -3, and 6 for all six frames (default [1])
-h, --help help for translate
@@ -1363,6 +1367,8 @@ Usage
search sequences by ID/name/sequence/sequence motifs, mismatch allowed
Attentions:
+ 0. By default, we match sequence ID with patterns, use "-n/--by-name"
+ for matching full name instead of just ID.
1. Unlike POSIX/GNU grep, we compare the pattern to the whole target
(ID/full header) by default. Please switch "-r/--use-regexp" on
for partly matching.
@@ -1398,8 +1404,9 @@ Usage:
seqkit grep [flags]
Flags:
- -n, --by-name match by full name instead of just id
+ -n, --by-name match by full name instead of just ID
-s, --by-seq search subseq on seq, only positive strand is searched, and mismatch allowed using flag -m/--max-mismatch
+ -c --circular circular genome
-d, --degenerate pattern/motif contains degenerate base
--delete-matched delete a pattern right after being matched, this keeps the firstly matched data and speedups when using regular expressions
-h, --help help for grep
@@ -1409,12 +1416,30 @@ Flags:
-p, --pattern strings search pattern (multiple values supported. Attention: use double quotation marks for patterns containing comma, e.g., -p '"A{2,}"'))
-f, --pattern-file string pattern file (one record per line)
-R, --region string specify sequence region for searching. e.g 1:12 for first 12 bases, -12:-1 for last 12 bases
- -r, --use-regexp patterns are regular expression
+ -r, --use-regexp patterns are regular expressio
```
Examples
+
+1. Searching with list of sequence IDs (do not containing whitespace)
+
+ $ seqkit grep -f id.txt seqs.fq.gz -o result.fq.gz
+
+ # ignore case
+ $ seqkit grep -i -f id.txt seqs.fq.gz -o result.fq.gz
+
+1. Serching non-canonical sequence IDs, Using `--id-regexp` to capture IDs.
+ Refer to [section Sequence ID](#sequence-id) and [seqkit seq](#seq) for examples.
+
+1. Searching with list of sequence names (they may contain whitespace).
+
+ $ seqkit grep -n -f name.txt seqs.fa.gz -o result.fa.gz
+
+1. Useq `-r/--use-regexp` for partly matching, but **this may produce "false positive" matches**.
+ For example, `seq_1` matches `seq_10` with `-nri`.
+
1. Extract human hairpins (i.e. sequences with name starting with `hsa`)
$ zcat hairpin.fa.gz | seqkit grep -r -p ^hsa
@@ -1425,7 +1450,7 @@ Examples
AGGUUGAGGUAGUAGGUUGUAUAGUUUAGAAUUACAUCAAGGGAGAUAACUGUACAGCCU
CCUAGCUUUCCU
-1. Remove human and mice hairpins.
+1. Remove human and mice hairpins (invert match with `-v`)
$ zcat hairpin.fa.gz | seqkit grep -r -p ^hsa -p ^mmu -v
@@ -1448,7 +1473,18 @@ Examples
$ cat hairpin.fa.gz | seqkit grep -s -i -p aggcg
+1. Circular genome
+ $ echo -e ">seq\nACGTTGCA"
+ >seq
+ ACGTTGCA
+
+ $ echo -e ">seq\nACGTTGCA" | seqkit grep -s -i -p AA
+
+ $ echo -e ">seq\nACGTTGCA" | seqkit grep -s -i -p AA -c
+ >seq
+ ACGTTGCA
+
1. Extract sequences containing AGGCG (allow mismatch)
$ time cat hairpin.fa.gz | seqkit grep -s -i -p aggcg | seqkit stats
@@ -1502,11 +1538,15 @@ Mismatch is allowed using flag "-m/--max-mismatch",
but it's not fast enough for large genome like human genome.
Though, it's fast enough for microbial genomes.
+When using flag --circular, end position of matched subsequence that
+crossing genome sequence end would be greater than sequence length.
+
Usage:
seqkit locate [flags]
Flags:
--bed output in BED6 format
+ -c --circular circular genome
-d, --degenerate pattern/motif contains degenerate base
--gtf output in GTF format
-h, --help help for locate
@@ -1613,7 +1653,7 @@ Examples
cel-mir-58a SeqKit location 81 88 0 + . gene_id "AUGGACUN";
ath-MIR163 SeqKit location 122 129 0 - . gene_id "AUGGACUN";
-1. greedy mode (default)
+1. Greedy mode (default)
$ echo -e '>seq\nACGACGACGA' | seqkit locate -p ACGA | csvtk -t pretty
seqID patternName pattern strand start end matched
@@ -1621,13 +1661,33 @@ Examples
seq ACGA ACGA + 4 7 ACGA
seq ACGA ACGA + 7 10 ACGA
-1. non-greedy mode (`-G`)
+1. Non-greedy mode (`-G`)
$ echo -e '>seq\nACGACGACGA' | seqkit locate -p ACGA -G | csvtk -t pretty
seqID patternName pattern strand start end matched
seq ACGA ACGA + 1 4 ACGA
seq ACGA ACGA + 7 10 ACGA
+
+1. Circular genome. Note that end position of matched subsequence that
+crossing genome sequence end would be greater than sequence length.
+
+ $ echo -e ">seq\nACGTTGCA"
+ >seq
+ ACGTTGCA
+
+ $ echo -e ">seq\nACGTTGCA" \
+ | seqkit locate -i -p aa
+ seqID patternName pattern strand start end matched
+ seq aa aa - 4 5 aa
+
+ $ echo -e ">seq\nACGTTGCA" \
+ | seqkit locate -i -p aa -c \
+ | csvtk pretty -t
+ seqID patternName pattern strand start end matched
+ seq aa aa + 8 9 aa
+ seq aa aa - 4 5 aa
+
## fish
@@ -2127,8 +2187,8 @@ Flags:
-f, --force overwrite output directory
-h, --help help for split2
-O, --out-dir string output directory (default value is $infile.split)
- -1, --read1 string read1 file
- -2, --read2 string read2 file
+ -1, --read1 string (gzipped) read1 file
+ -2, --read2 string (gzipped) read2 file
```
Examples
@@ -2177,6 +2237,65 @@ Examples
[INFO] write 1250 sequences to file: out/reads_1.part_001.fq.gz
[INFO] write 1250 sequences to file: out/reads_1.part_002.fq.gz
+## pair
+
+Usage
+
+```text
+match up paired-end reads from two fastq files
+
+Attensions:
+1. Orders of headers in the two files better be the same (not shuffled),
+ otherwise it consumes huge number of memory for buffering reads in memory.
+2. Unpaired reads are optional outputted with flag -u/--save-unpaired.
+3. If flag -O/--out-dir not given, output will be saved in the same directory
+ of input, with suffix "paired", e.g., read_1.paired.fq.gz.
+ Otherwise names are kept untouched in the given output directory.
+4. Paired gzipped files may be slightly larger than original files, because
+ of using different gzip package/library, don't worry.
+
+Usage:
+ seqkit pair [flags]
+
+Flags:
+ -f, --force overwrite output directory
+ -h, --help help for pair
+ -O, --out-dir string output directory
+ -1, --read1 string (gzipped) read1 file
+ -2, --read2 string (gzipped) read2 file
+ -u, --save-unpaired save unpaired reads if there are
+```
+
+Examples
+
+1. Simple one
+
+ $ seqkit pair -1 reads_1.fq.gz -2 reads_2.fq.gz
+
+ # output
+ reads_1.paired.fq.gz
+ reads_2.paired.fq.gz
+
+2. Set output directory, file names are kept untouched.
+
+ $ seqkit pair -1 reads_1.fq.gz -2 reads_2.fq.gz -O result
+
+ $ tree result
+ result/
+ ├── reads_1.fq.gz
+ └── reads_2.fq.gz
+
+3. Save unpaired reads if there are.
+
+ $ seqkit pair -1 reads_1.fq.gz -2 reads_2.fq.gz -O result -u
+
+ $ tree result
+ result
+ ├── reads_1.fq.gz
+ ├── reads_1.unpaired.fq.gz
+ ├── reads_2.fq.gz
+ └── reads_2.unpaired.fq.gz
+
## sample
=====================================
seqkit/cmd/grep.go
=====================================
@@ -46,6 +46,8 @@ var grepCmd = &cobra.Command{
Long: fmt.Sprintf(`search sequences by ID/name/sequence/sequence motifs, mismatch allowed
Attentions:
+ 0. By default, we match sequence ID with patterns, use "-n/--by-name"
+ for matching full name instead of just ID.
1. Unlike POSIX/GNU grep, we compare the pattern to the whole target
(ID/full header) by default. Please switch "-r/--use-regexp" on
for partly matching.
@@ -91,6 +93,7 @@ Examples:
ignoreCase := getFlagBool(cmd, "ignore-case")
degenerate := getFlagBool(cmd, "degenerate")
region := getFlagString(cmd, "region")
+ circular := getFlagBool(cmd, "circular")
if len(pattern) == 0 && patternFile == "" {
checkError(fmt.Errorf("one of flags -p (--pattern) and -f (--pattern-file) needed"))
@@ -284,6 +287,11 @@ Examples:
} else if bySeq {
if limitRegion {
target = record.Seq.SubSeq(start, end).Seq
+ } else if circular {
+ // concat two copies of sequence, and do not change orginal sequence
+ target = make([]byte, len(record.Seq.Seq)*2)
+ copy(target[0:len(record.Seq.Seq)], record.Seq.Seq)
+ copy(target[len(record.Seq.Seq):], record.Seq.Seq)
} else {
target = record.Seq.Seq
}
@@ -375,12 +383,12 @@ func init() {
grepCmd.Flags().BoolP("use-regexp", "r", false, "patterns are regular expression")
grepCmd.Flags().BoolP("delete-matched", "", false, "delete a pattern right after being matched, this keeps the firstly matched data and speedups when using regular expressions")
grepCmd.Flags().BoolP("invert-match", "v", false, "invert the sense of matching, to select non-matching records")
- grepCmd.Flags().BoolP("by-name", "n", false, "match by full name instead of just id")
+ grepCmd.Flags().BoolP("by-name", "n", false, "match by full name instead of just ID")
grepCmd.Flags().BoolP("by-seq", "s", false, "search subseq on seq, only positive strand is searched, and mismatch allowed using flag -m/--max-mismatch")
grepCmd.Flags().IntP("max-mismatch", "m", 0, "max mismatch when matching by seq. For large genomes like human genome, using mapping/alignment tools would be faster")
grepCmd.Flags().BoolP("ignore-case", "i", false, "ignore case")
grepCmd.Flags().BoolP("degenerate", "d", false, "pattern/motif contains degenerate base")
grepCmd.Flags().StringP("region", "R", "", "specify sequence region for searching. "+
"e.g 1:12 for first 12 bases, -12:-1 for last 12 bases")
-
+ grepCmd.Flags().BoolP("circular", "c", false, "circular genome")
}
=====================================
seqkit/cmd/locate.go
=====================================
@@ -53,6 +53,9 @@ Mismatch is allowed using flag "-m/--max-mismatch",
but it's not fast enough for large genome like human genome.
Though, it's fast enough for microbial genomes.
+When using flag --circular, end position of matched subsequence that
+crossing genome sequence end would be greater than sequence length.
+
`,
Run: func(cmd *cobra.Command, args []string) {
config := getConfigs(cmd)
@@ -84,6 +87,7 @@ Though, it's fast enough for microbial genomes.
outFmtBED := getFlagBool(cmd, "bed")
mismatches := getFlagNonNegativeInt(cmd, "max-mismatch")
hideMatched := getFlagBool(cmd, "hide-matched")
+ circular := getFlagBool(cmd, "circular")
if len(pattern) == 0 && patternFile == "" {
checkError(fmt.Errorf("one of flags -p (--pattern) and -f (--pattern-file) needed"))
@@ -256,6 +260,11 @@ Though, it's fast enough for microbial genomes.
}
l = len(record.Seq.Seq)
+
+ if circular { // concat two copies of sequence
+ record.Seq.Seq = append(record.Seq.Seq, record.Seq.Seq...)
+ }
+
if !onlyPositiveStrand {
seqRP = record.Seq.RevCom()
}
@@ -272,7 +281,12 @@ Though, it's fast enough for microbial genomes.
checkError(fmt.Errorf("fail to search pattern '%s' on seq '%s': %s", pName, record.Name, err))
}
for _, i = range loc {
+ if circular && i+1 > l { // 2nd clone of original part
+ continue
+ }
+
begin = i + 1
+
end = i + len(pSeq)
if i+len(pSeq) > len(record.Seq.Seq) {
continue
@@ -333,6 +347,10 @@ Though, it's fast enough for microbial genomes.
checkError(fmt.Errorf("fail to search pattern '%s' on seq '%s': %s", pName, record.Name, err))
}
for _, i = range loc {
+ if circular && i+1 > l { // 2nd clone of original part
+ continue
+ }
+
begin = l - i - len(pSeq) + 1
end = l - i
if i+len(pSeq) > len(record.Seq.Seq) {
@@ -407,6 +425,11 @@ Though, it's fast enough for microbial genomes.
loc = []int{i, i + lpatten}
}
begin = offset + loc[0] + 1
+
+ if circular && begin > l { // 2nd clone of original part
+ break
+ }
+
end = offset + loc[1]
flag = true // check "duplicated" region
@@ -495,8 +518,16 @@ Though, it's fast enough for microbial genomes.
loc = []int{i, i + lpatten}
}
+ if circular && offset+loc[0]+1 > l { // 2nd clone of original part
+ break
+ }
+
begin = l - offset - loc[1] + 1
end = l - offset - loc[0]
+ if offset+loc[1] > l {
+ begin += l
+ end += l
+ }
flag = true
if useRegexp || degenerate {
@@ -582,4 +613,5 @@ func init() {
locateCmd.Flags().BoolP("bed", "", false, "output in BED6 format")
locateCmd.Flags().IntP("max-mismatch", "m", 0, "max mismatch when matching by seq. For large genomes like human genome, using mapping/alignment tools would be faster")
locateCmd.Flags().BoolP("hide-matched", "M", false, "do not show matched sequences")
+ locateCmd.Flags().BoolP("circular", "c", false, `circular genome. type "seqkit locate -h" for details`)
}
=====================================
seqkit/cmd/mutate.go
=====================================
@@ -294,16 +294,16 @@ Examples:
if mIns.pos == 0 {
s = 0
record.Seq.Seq = append(record.Seq.Seq, mIns.seq...)
- copy(record.Seq.Seq[s+len(mIns.seq):], record.Seq.Seq[s:s+seqLen])
+ copy(record.Seq.Seq[s+len(mIns.seq):], record.Seq.Seq[s:seqLen])
copy(record.Seq.Seq[s:s+len(mIns.seq)], mIns.seq)
} else {
s, _, ok = seq.SubLocation(seqLen, mIns.pos, mIns.pos)
if !ok {
log.Warningf("[%s]: insertion mutation: position (%d) out of sequence length (%d)", record.ID, mIns.pos, seqLen)
} else {
- record.Seq.Seq = append(record.Seq.Seq, mIns.seq...)
- copy(record.Seq.Seq[s+len(mIns.seq):], record.Seq.Seq[s:s+seqLen])
- copy(record.Seq.Seq[s:s+len(mIns.seq)], mIns.seq)
+ record.Seq.Seq = append(record.Seq.Seq, mIns.seq...) // append insert sequence to the end
+ copy(record.Seq.Seq[s+len(mIns.seq):], record.Seq.Seq[s:seqLen]) // move sequence behind the insert location at the end, so leave space for IS.
+ copy(record.Seq.Seq[s:s+len(mIns.seq)], mIns.seq) // copy IS into desired region.
}
}
=====================================
seqkit/cmd/pair.go
=====================================
@@ -0,0 +1,372 @@
+// Copyright © 2016-2019 Wei Shen <shenwei356 at gmail.com>
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy
+// of this software and associated documentation files (the "Software"), to deal
+// in the Software without restriction, including without limitation the rights
+// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+// copies of the Software, and to permit persons to whom the Software is
+// furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in
+// all copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+// THE SOFTWARE.
+
+package cmd
+
+import (
+ "bytes"
+ "fmt"
+ "io"
+ "os"
+ "path/filepath"
+ "runtime"
+
+ "github.com/cespare/xxhash"
+ "github.com/pkg/errors"
+
+ "github.com/shenwei356/bio/seq"
+ "github.com/shenwei356/bio/seqio/fastx"
+ "github.com/shenwei356/util/pathutil"
+ "github.com/shenwei356/xopen"
+ "github.com/spf13/cobra"
+)
+
+// pairCmd represents the pair command
+var pairCmd = &cobra.Command{
+ Use: "pair",
+ Short: "match up paired-end reads from two fastq files",
+ Long: `match up paired-end reads from two fastq files
+
+Attensions:
+1. Orders of headers in the two files better be the same (not shuffled),
+ otherwise it consumes huge number of memory for buffering reads in memory.
+2. Unpaired reads are optional outputted with flag -u/--save-unpaired.
+3. If flag -O/--out-dir not given, output will be saved in the same directory
+ of input, with suffix "paired", e.g., read_1.paired.fq.gz.
+ Otherwise names are kept untouched in the given output directory.
+4. Paired gzipped files may be slightly larger than original files, because
+ of using different gzip package/library, don't worry.
+
+`,
+ Run: func(cmd *cobra.Command, args []string) {
+ config := getConfigs(cmd)
+ alphabet := config.Alphabet
+ idRegexp := config.IDRegexp
+ lineWidth := 0
+ seq.AlphabetGuessSeqLengthThreshold = config.AlphabetGuessSeqLength
+ seq.ValidateSeq = false
+ runtime.GOMAXPROCS(config.Threads)
+
+ if len(args) > 0 {
+ checkError(errors.New("no positional arugments are allowed"))
+ }
+
+ read1 := getFlagString(cmd, "read1")
+ read2 := getFlagString(cmd, "read2")
+ if read1 == "" || read2 == "" {
+ checkError(fmt.Errorf("flag -1/--read1 and -2/--read2 needed"))
+ }
+ if read1 == read2 {
+ checkError(fmt.Errorf("values of flag -1/--read1 and -2/--read2 can not be the same"))
+ }
+
+ outdir := getFlagString(cmd, "out-dir")
+ force := getFlagBool(cmd, "force")
+ saveUnpaired := getFlagBool(cmd, "save-unpaired")
+
+ if outdir == "" {
+ outdir = filepath.Dir(read1)
+ }
+
+ var addSuffix bool
+ if outdir != "./" && outdir != "." {
+ existed, err := pathutil.DirExists(outdir)
+ checkError(err)
+ if existed {
+ empty, err := pathutil.IsEmpty(outdir)
+ checkError(err)
+ if !empty {
+ if force {
+ checkError(os.RemoveAll(outdir))
+ checkError(os.MkdirAll(outdir, 0755))
+ } else {
+ log.Warningf("outdir not empty: %s, you can use --force to overwrite", outdir)
+ }
+ }
+ } else {
+ checkError(os.MkdirAll(outdir, 0755))
+ }
+ } else if filepath.Clean(filepath.Dir(read1)) == filepath.Clean(outdir) {
+ addSuffix = true
+ }
+
+ var reader1, reader2 *fastx.Reader
+ var record1, record2 *fastx.Record
+
+ // readers
+ var err error
+ reader1, err = fastx.NewReader(alphabet, read1, idRegexp)
+ checkError(errors.Wrap(err, read1))
+ reader2, err = fastx.NewReader(alphabet, read2, idRegexp)
+ checkError(errors.Wrap(err, read2))
+
+ // out file 1
+ var outFile1, base1, suffix1 string
+ if addSuffix {
+ base1, suffix1 = filepathTrimExtension(filepath.Base(read1))
+ outFile1 = filepath.Join(outdir, base1+".paired"+suffix1)
+ } else {
+ outFile1 = filepath.Join(outdir, filepath.Base(read1))
+ }
+ outfh1, err := xopen.Wopen(outFile1)
+ checkError(errors.Wrap(err, outFile1))
+ defer outfh1.Close()
+
+ // out file 2
+ var outFile2, base2, suffix2 string
+ if addSuffix {
+ base2, suffix2 = filepathTrimExtension(filepath.Base(read2))
+ outFile2 = filepath.Join(outdir, base2+".paired"+suffix2)
+ } else {
+ outFile2 = filepath.Join(outdir, filepath.Base(read2))
+ }
+ outfh2, err := xopen.Wopen(outFile2)
+ checkError(errors.Wrap(err, outFile2))
+ defer outfh2.Close()
+
+ // load first records
+ record1, err = reader1.Read()
+ checkError(errors.Wrap(err, read1))
+ record2, err = reader2.Read()
+ checkError(errors.Wrap(err, read2))
+
+ // require fastq
+ if !reader1.IsFastq || !reader2.IsFastq {
+ checkError(fmt.Errorf("fastq files needed"))
+ }
+
+ // buffer for saving unpaired reads
+ m1 := make(map[uint64]*fastx.Record, 1024)
+ m2 := make(map[uint64]*fastx.Record, 1024)
+
+ var h1, h2 uint64
+ var ok1, ok2 bool
+ var r1, r2 *fastx.Record
+ var eof1, eof2 bool
+ var n uint64
+
+ for {
+ // break when finishing reading both files.
+ if eof1 && eof2 {
+ break
+ }
+
+ // paired
+ if !eof1 && !eof2 && bytes.Compare(record1.ID, record2.ID) == 0 { // same ID
+ // output paired reads
+ record1.FormatToWriter(outfh1, lineWidth)
+ record2.FormatToWriter(outfh2, lineWidth)
+ n++
+
+ // new read1
+ if !eof1 {
+ record1, err = reader1.Read()
+ if err != nil {
+ if err == io.EOF {
+ eof1 = true
+ // break
+ } else {
+ checkError(errors.Wrap(err, read1))
+ break
+ }
+ }
+ }
+
+ // new read2
+ if !eof2 {
+ record2, err = reader2.Read()
+ if err != nil {
+ if err == io.EOF {
+ eof2 = true
+ // break
+ } else {
+ checkError(errors.Wrap(err, read2))
+ break
+ }
+ }
+ }
+
+ continue
+ }
+
+ if !eof1 {
+ h1 = xxhash.Sum64(record1.ID)
+ if r2, ok2 = m2[h1]; ok2 { // found pair of record1 in m2
+ // output paired reads
+ record1.FormatToWriter(outfh1, lineWidth)
+ r2.FormatToWriter(outfh2, lineWidth)
+ n++
+
+ delete(m2, h1)
+ } else {
+ m1[h1] = record1.Clone()
+ }
+
+ // new read1
+ record1, err = reader1.Read()
+ if err != nil {
+ if err == io.EOF {
+ eof1 = true
+ // break
+ } else {
+ checkError(errors.Wrap(err, read1))
+ break
+ }
+ }
+ }
+
+ // ---
+
+ if !eof2 {
+ h2 = xxhash.Sum64(record2.ID)
+ if r1, ok1 = m1[h2]; ok1 { // found pair of record2 in m1
+ // output paired reads
+ r1.FormatToWriter(outfh1, lineWidth)
+ record2.FormatToWriter(outfh2, lineWidth)
+ n++
+
+ delete(m1, h2)
+ } else {
+ m2[h2] = record2.Clone()
+ }
+
+ // new read2
+ record2, err = reader2.Read()
+ if err != nil {
+ if err == io.EOF {
+ eof2 = true
+ // break
+ } else {
+ checkError(errors.Wrap(err, read2))
+ break
+ }
+ }
+ }
+ }
+
+ var outFile1U, outFile2U string
+ var outfh1U, outfh2U *xopen.Writer
+ var n1U, n2U uint64
+ if saveUnpaired {
+ if !addSuffix {
+ base1, suffix1 = filepathTrimExtension(filepath.Base(read1))
+ base2, suffix2 = filepathTrimExtension(filepath.Base(read2))
+ }
+ }
+
+ // left reads
+ if len(m1) > 0 && len(m2) > 0 {
+ for h1, r1 = range m1 {
+
+ if string(r1.ID) == "A00582:209:HWWJCDSXX:1:1101:8314:2581" {
+ fmt.Println("shit----")
+ }
+
+ if r2, ok2 = m2[h1]; ok2 {
+ // output paired reads
+ r1.FormatToWriter(outfh1, lineWidth)
+ r2.FormatToWriter(outfh2, lineWidth)
+ n++
+
+ if saveUnpaired { // delete paired reads in m2
+ // delete(m1, h1) // no need
+ delete(m2, h2)
+ }
+ } else if saveUnpaired { // unpaired reads in m1
+ if outfh1U == nil {
+ outFile1U = filepath.Join(outdir, base1+".unpaired"+suffix1)
+ outfh1U, err = xopen.Wopen(outFile1U)
+ checkError(errors.Wrap(err, outFile1U))
+ defer outfh1U.Close()
+ }
+ r1.FormatToWriter(outfh1U, lineWidth)
+ n1U++
+ }
+ }
+ if saveUnpaired {
+ for _, r2 = range m2 { // left unpaired reads in m2
+ if outfh2U == nil {
+ outFile2U = filepath.Join(outdir, base2+".unpaired"+suffix2)
+ outfh2U, err = xopen.Wopen(outFile2U)
+ checkError(errors.Wrap(err, outFile2U))
+ defer outfh2U.Close()
+ }
+
+ r2.FormatToWriter(outfh2U, lineWidth)
+ n2U++
+ }
+ }
+ } else if len(m1) > 0 {
+ if saveUnpaired {
+ for _, r1 = range m1 { // all reads in m1 are unpaired
+ if outfh1U == nil {
+ outFile1U = filepath.Join(outdir, base1+".unpaired"+suffix1)
+ outfh1U, err = xopen.Wopen(outFile1U)
+ checkError(errors.Wrap(err, outFile1U))
+ defer outfh1U.Close()
+ }
+
+ r1.FormatToWriter(outfh1U, lineWidth)
+ n1U++
+ }
+ }
+ } else if saveUnpaired { // len(m2) > 0
+ for _, r2 = range m2 { // all reads in m2 are unpaired
+ if outfh2U == nil {
+ outFile2U = filepath.Join(outdir, base2+".unpaired"+suffix2)
+ outfh2U, err = xopen.Wopen(outFile2U)
+ checkError(errors.Wrap(err, outFile2U))
+ defer outfh2U.Close()
+ }
+
+ r2.FormatToWriter(outfh2U, lineWidth)
+ n2U++
+ }
+ }
+
+ if !config.Quiet {
+ log.Infof("%d paired-end reads saved to %s and %s", n, outFile1, outFile2)
+ if saveUnpaired {
+ if n1U > 0 {
+ log.Infof("%d unpaired reads saved to %s", n1U, outFile1U)
+ } else {
+ log.Infof("no unpaired reads in %s", read1)
+ }
+
+ if n2U > 0 {
+ log.Infof("%d unpaired reads saved to %s", n2U, outFile2U)
+ } else {
+ log.Infof("no unpaired reads in %s", read2)
+ }
+ }
+ }
+
+ },
+}
+
+func init() {
+ RootCmd.AddCommand(pairCmd)
+
+ pairCmd.Flags().StringP("read1", "1", "", "(gzipped) read1 file")
+ pairCmd.Flags().StringP("read2", "2", "", "(gzipped) read2 file")
+ pairCmd.Flags().StringP("out-dir", "O", "", "output directory")
+ pairCmd.Flags().BoolP("force", "f", false, "overwrite output directory")
+ pairCmd.Flags().BoolP("save-unpaired", "u", false, "save unpaired reads if there are")
+}
=====================================
seqkit/cmd/rename.go
=====================================
@@ -88,12 +88,13 @@ Attention:
if !empty {
if force {
checkError(os.RemoveAll(outdir))
+ checkError(os.MkdirAll(outdir, 0755))
} else {
log.Warningf("outdir not empty: %s, you can use --force to overwrite", outdir)
}
}
} else {
- checkError(os.MkdirAll(outdir, 0777))
+ checkError(os.MkdirAll(outdir, 0755))
}
}
}
=====================================
seqkit/cmd/replace.go
=====================================
@@ -76,6 +76,7 @@ Special replacement symbols (only for replacing name not sequence):
nrWidth := getFlagPositiveInt(cmd, "nr-width")
kvFile := getFlagString(cmd, "kv-file")
keepKey := getFlagBool(cmd, "keep-key")
+ keepUntouch := getFlagBool(cmd, "keep-untouch")
keyCaptIdx := getFlagPositiveInt(cmd, "key-capt-idx")
keyMissRepl := getFlagString(cmd, "key-miss-repl")
@@ -145,6 +146,7 @@ Special replacement symbols (only for replacing name not sequence):
var found [][]byte
var k string
var ok bool
+ var doNotChange bool
var record *fastx.Record
var fastxReader *fastx.Reader
nrFormat := fmt.Sprintf("%%0%dd", nrWidth)
@@ -168,8 +170,13 @@ Special replacement symbols (only for replacing name not sequence):
nr++
if bySeq {
+ if fastxReader.IsFastq {
+ checkError(fmt.Errorf("editing FASTQ is not supported"))
+ }
record.Seq.Seq = patternRegexp.ReplaceAll(record.Seq.Seq, replacement)
} else {
+ doNotChange = false
+
r = replacement
if replaceWithNR {
@@ -181,6 +188,7 @@ Special replacement symbols (only for replacing name not sequence):
if len(founds) > 1 {
checkError(fmt.Errorf(`pattern "%s" matches multiple targets in "%s", this will cause chaos`, p, record.Name))
}
+
if len(founds) > 0 {
found = founds[0]
if keyCaptIdx > len(found)-1 {
@@ -192,15 +200,21 @@ Special replacement symbols (only for replacing name not sequence):
}
if _, ok = kvs[k]; ok {
r = reKV.ReplaceAll(r, []byte(kvs[k]))
+ } else if keepUntouch {
+ doNotChange = true
} else if keepKey {
r = reKV.ReplaceAll(r, found[keyCaptIdx])
} else {
r = reKV.ReplaceAll(r, []byte(keyMissRepl))
}
+ } else {
+ doNotChange = true
}
}
- record.Name = patternRegexp.ReplaceAll(record.Name, r)
+ if !doNotChange {
+ record.Name = patternRegexp.ReplaceAll(record.Name, r)
+ }
}
record.FormatToWriter(outfh, config.LineWidth)
@@ -222,10 +236,11 @@ func init() {
`use ${1} instead of $1 when {kv} given!`)
replaceCmd.Flags().IntP("nr-width", "", 1, `minimum width for {nr} in flag -r/--replacement. e.g., formating "1" to "001" by --nr-width 3`)
// replaceCmd.Flags().BoolP("by-name", "n", false, "replace full name instead of just id")
- replaceCmd.Flags().BoolP("by-seq", "s", false, "replace seq")
+ replaceCmd.Flags().BoolP("by-seq", "s", false, "replace seq (only FASTA)")
replaceCmd.Flags().BoolP("ignore-case", "i", false, "ignore case")
replaceCmd.Flags().StringP("kv-file", "k", "",
`tab-delimited key-value file for replacing key with value when using "{kv}" in -r (--replacement) (only for sequence name)`)
+ replaceCmd.Flags().BoolP("keep-untouch", "U", false, "do not change anything when no value found for the key (only for sequence name)")
replaceCmd.Flags().BoolP("keep-key", "K", false, "keep the key as value when no value found for the key (only for sequence name)")
replaceCmd.Flags().IntP("key-capt-idx", "I", 1, "capture variable index of key (1-based)")
replaceCmd.Flags().StringP("key-miss-repl", "m", "", "replacement for key with no corresponding value")
=====================================
seqkit/cmd/seq.go
=====================================
@@ -139,7 +139,7 @@ var seqCmd = &cobra.Command{
if outFile == "-" {
outfh = os.Stdout
} else {
- outfh, err = os.Open(outFile)
+ outfh, err = os.Create(outFile)
checkError(err)
}
defer outfh.Close()
=====================================
seqkit/cmd/sliding.go
=====================================
@@ -51,7 +51,7 @@ var slidingCmd = &cobra.Command{
files := getFileListFromArgsAndFile(cmd, args, true, "infile-list", true)
greedy := getFlagBool(cmd, "greedy")
- circular := getFlagBool(cmd, "circular-genome")
+ circular := getFlagBool(cmd, "circular-genome") || getFlagBool(cmd, "circular")
step := getFlagInt(cmd, "step")
window := getFlagInt(cmd, "window")
if step == 0 || window == 0 {
@@ -147,5 +147,6 @@ func init() {
slidingCmd.Flags().IntP("step", "s", 0, "step size")
slidingCmd.Flags().IntP("window", "W", 0, "window size")
slidingCmd.Flags().BoolP("greedy", "g", false, "greedy mode, i.e., exporting last subsequences even shorter than windows size")
- slidingCmd.Flags().BoolP("circular-genome", "C", false, "circular genome.")
+ slidingCmd.Flags().BoolP("circular-genome", "C", false, "circular genome (same to -c/--circular)")
+ slidingCmd.Flags().BoolP("circular", "c", false, "circular genome (same to -C/--circular-genome)")
}
=====================================
seqkit/cmd/split.go
=====================================
@@ -122,12 +122,13 @@ Examples:
if !empty {
if force {
checkError(os.RemoveAll(outdir))
+ checkError(os.MkdirAll(outdir, 0755))
} else {
log.Warningf("outdir not empty: %s, you can use --force to overwrite", outdir)
}
}
} else {
- checkError(os.MkdirAll(outdir, 0777))
+ checkError(os.MkdirAll(outdir, 0755))
}
}
}
=====================================
seqkit/cmd/split2.go
=====================================
@@ -169,12 +169,13 @@ according to the input files.
if !empty {
if force {
checkError(os.RemoveAll(outdir))
+ checkError(os.MkdirAll(outdir, 0755))
} else {
log.Warningf("outdir not empty: %s, you can use --force to overwrite", outdir)
}
}
} else {
- checkError(os.MkdirAll(outdir, 0777))
+ checkError(os.MkdirAll(outdir, 0755))
}
}
@@ -325,8 +326,8 @@ according to the input files.
func init() {
RootCmd.AddCommand(split2Cmd)
- split2Cmd.Flags().StringP("read1", "1", "", "read1 file")
- split2Cmd.Flags().StringP("read2", "2", "", "read2 file")
+ split2Cmd.Flags().StringP("read1", "1", "", "(gzipped) read1 file")
+ split2Cmd.Flags().StringP("read2", "2", "", "(gzipped) read2 file")
split2Cmd.Flags().IntP("by-size", "s", 0, "split sequences into multi parts with N sequences")
split2Cmd.Flags().IntP("by-part", "p", 0, "split sequences into N parts")
split2Cmd.Flags().StringP("by-length", "l", "", "split sequences into chunks of N bases, supports K/M/G suffix")
=====================================
seqkit/cmd/stat.go
=====================================
@@ -33,6 +33,7 @@ import (
"github.com/dustin/go-humanize"
"github.com/shenwei356/bio/seq"
"github.com/shenwei356/bio/seqio/fastx"
+ "github.com/shenwei356/bio/util"
"github.com/shenwei356/util/byteutil"
"github.com/shenwei356/util/math"
"github.com/shenwei356/xopen"
@@ -141,7 +142,7 @@ Tips:
info.lenAvg,
info.lenMax))
} else {
- outfh.WriteString(fmt.Sprintf("%s\t%s\t%s\t%d\t%d\t%d\t%.1f\t%d\t%d\t%d\t%d\t%d\t%d\t%.2f\t%.2f\n",
+ outfh.WriteString(fmt.Sprintf("%s\t%s\t%s\t%d\t%d\t%d\t%.1f\t%d\t%.1f\t%.1f\t%.1f\t%d\t%d\t%.2f\t%.2f\n",
info.file,
info.format,
info.t,
@@ -177,7 +178,7 @@ Tips:
info1.lenAvg,
info1.lenMax))
} else {
- outfh.WriteString(fmt.Sprintf("%s\t%s\t%s\t%d\t%d\t%d\t%.1f\t%d\t%d\t%d\t%d\t%d\t%d\t%.2f\t%.2f\n",
+ outfh.WriteString(fmt.Sprintf("%s\t%s\t%s\t%d\t%d\t%d\t%.1f\t%d\t%.1f\t%.1f\t%.1f\t%d\t%d\t%.2f\t%.2f\n",
info1.file,
info1.format,
info1.t,
@@ -230,7 +231,7 @@ Tips:
info.lenAvg,
info.lenMax))
} else {
- outfh.WriteString(fmt.Sprintf("%s\t%s\t%s\t%d\t%d\t%d\t%.1f\t%d\t%d\t%d\t%d\t%d\t%d\t%.2f\t%.2f\n",
+ outfh.WriteString(fmt.Sprintf("%s\t%s\t%s\t%d\t%d\t%d\t%.1f\t%d\t%.1f\t%.1f\t%.1f\t%d\t%d\t%.2f\t%.2f\n",
info.file,
info.format,
info.t,
@@ -288,13 +289,13 @@ Tips:
<-token
}()
- var num, l, lenMin, lenMax, lenSum, gapSum int64
- var n50, sum, l50 int64
- var q1, q2, q3 int64
+ var gapSum uint64
+
+ lensStats := util.NewLengthStats()
+
var q20, q30 int64
var q byte
var encodeOffset int = fqEncoding.Offset()
- var lens sortutil.Int64Slice
var seqFormat, t string
var record *fastx.Record
var fastxReader *fastx.Reader
@@ -315,12 +316,7 @@ Tips:
}
seqFormat = ""
- num, lenMin, lenMax, lenSum, gapSum = 0, int64(^uint64(0)>>1), 0, 0, 0
- n50, sum, l50 = 0, 0, 0
- q20, q30 = 0, 0
- if all {
- lens = make(sortutil.Int64Slice, 0, 1000)
- }
+
for {
record, err = fastxReader.Read()
if err != nil {
@@ -342,7 +338,6 @@ Tips:
break
}
- num++
if seqFormat == "" {
if len(record.Seq.Qual) > 0 {
seqFormat = "FASTQ"
@@ -350,17 +345,9 @@ Tips:
seqFormat = "FASTA"
}
}
- l = int64(len(record.Seq.Seq))
- if all {
- lens = append(lens, l)
- }
- lenSum += l
- if l < lenMin {
- lenMin = l
- }
- if l > lenMax {
- lenMax = l
- }
+
+ lensStats.Add(uint64(len(record.Seq.Seq)))
+
if all {
if fastxReader.IsFastq {
for _, q = range record.Seq.Qual {
@@ -373,7 +360,7 @@ Tips:
}
}
- gapSum += int64(byteutil.CountBytes(record.Seq.Seq, gapLettersBytes))
+ gapSum += uint64(byteutil.CountBytes(record.Seq.Seq, gapLettersBytes))
}
}
@@ -387,17 +374,13 @@ Tips:
t = fastxReader.Alphabet().String()
}
+ var n50 uint64
+ var l50 int
+ var q1, q2, q3 float64
if all {
- sort.Sort(lens)
- for i := num - 1; i >= 0; i-- {
- sum += lens[i]
- if (sum << 1) >= lenSum {
- n50 = lens[i]
- l50 = num - i
- break
- }
- }
- q1, q2, q3 = quartile(lens)
+ n50 = lensStats.N50()
+ l50 = lensStats.L50()
+ q1, q2, q3 = lensStats.Q1(), lensStats.Q2(), lensStats.Q3()
}
select {
@@ -405,7 +388,7 @@ Tips:
return
default:
}
- if num == 0 {
+ if lensStats.Count() == 0 {
if basename {
file = filepath.Base(file)
}
@@ -414,8 +397,8 @@ Tips:
}
ch <- statInfo{file, seqFormat, t,
0, 0, 0, 0,
- 0, lenMax, 0, 0,
- q1, q2, q3,
+ 0, 0, 0, 0,
+ 0, 0, 0,
0, 0,
nil, id}
} else {
@@ -426,10 +409,10 @@ Tips:
file = stdinLabel
}
ch <- statInfo{file, seqFormat, t,
- num, lenSum, gapSum, lenMin,
- math.Round(float64(lenSum)/float64(num), 1), lenMax, n50, l50,
+ lensStats.Count(), lensStats.Sum(), gapSum, lensStats.Min(),
+ math.Round(lensStats.Mean(), 1), lensStats.Max(), n50, l50,
q1, q2, q3,
- math.Round(float64(q20)/float64(lenSum)*100, 2), math.Round(float64(q30)/float64(lenSum)*100, 2),
+ math.Round(float64(q20)/float64(lensStats.Sum())*100, 2), math.Round(float64(q30)/float64(lensStats.Sum())*100, 2),
nil, id}
}
}(file, id)
@@ -485,26 +468,26 @@ Tips:
info.file,
info.format,
info.t,
- humanize.Comma(info.num),
- humanize.Comma(info.lenSum),
- humanize.Comma(info.lenMin),
+ humanize.Comma(int64(info.num)),
+ humanize.Comma(int64(info.lenSum)),
+ humanize.Comma(int64(info.lenMin)),
humanize.Commaf(info.lenAvg),
- humanize.Comma(info.lenMax))
+ humanize.Comma(int64(info.lenMax)))
} else {
tbl.AddRow(
info.file,
info.format,
info.t,
- humanize.Comma(info.num),
- humanize.Comma(info.lenSum),
- humanize.Comma(info.lenMin),
+ humanize.Comma(int64(info.num)),
+ humanize.Comma(int64(info.lenSum)),
+ humanize.Comma(int64(info.lenMin)),
humanize.Commaf(info.lenAvg),
- humanize.Comma(info.lenMax),
- humanize.Comma(info.Q1),
- humanize.Comma(info.Q2),
- humanize.Comma(info.Q3),
- humanize.Comma(info.gapSum),
- humanize.Comma(info.N50),
+ humanize.Comma(int64(info.lenMax)),
+ humanize.Commaf(info.Q1),
+ humanize.Commaf(info.Q2),
+ humanize.Commaf(info.Q3),
+ humanize.Comma(int64(info.gapSum)),
+ humanize.Comma(int64(info.N50)),
humanize.Commaf(info.q20),
humanize.Commaf(info.q30),
// humanize.Comma(info.L50),
@@ -519,19 +502,23 @@ type statInfo struct {
file string
format string
t string
- num int64
- lenSum int64
- gapSum int64
- lenMin int64
+
+ num uint64
+ lenSum uint64
+ gapSum uint64
+ lenMin uint64
+
lenAvg float64
- lenMax int64
- N50 int64
- L50 int64
- Q1 int64
- Q2 int64
- Q3 int64
- q20 float64
- q30 float64
+ lenMax uint64
+ N50 uint64
+ L50 int
+
+ Q1 float64
+ Q2 float64
+ Q3 float64
+
+ q20 float64
+ q30 float64
err error
id uint64
=====================================
seqkit/cmd/translate.go
=====================================
@@ -123,6 +123,7 @@ Translate Tables/Genetic Codes:
markInitCodonAsM := getFlagBool(cmd, "init-codon-as-M")
listTable := getFlagInt(cmd, "list-transl-table")
listTableAmb := getFlagInt(cmd, "list-transl-table-with-amb-codons")
+ appendFrame := getFlagBool(cmd, "append-frame")
outfh, err := xopen.Wopen(outFile)
checkError(err)
@@ -192,7 +193,11 @@ Translate Tables/Genetic Codes:
}
checkError(err)
- outfh.WriteString(fmt.Sprintf(">%s_frame=%d %s\n", record.ID, frame, record.Desc))
+ if appendFrame {
+ outfh.WriteString(fmt.Sprintf(">%s_frame=%d %s\n", record.ID, frame, record.Desc))
+ } else {
+ outfh.WriteString(">" + string(record.Name) + "\n")
+ }
outfh.Write(byteutil.WrapByteSlice(_seq.Seq, config.LineWidth))
outfh.WriteString("\n")
}
@@ -213,4 +218,5 @@ func init() {
translateCmd.Flags().BoolP("init-codon-as-M", "M", false, "translate initial codon at beginning to 'M'")
translateCmd.Flags().IntP("list-transl-table", "l", -1, "show details of translate table N, 0 for all")
translateCmd.Flags().IntP("list-transl-table-with-amb-codons", "L", -1, "show details of translate table N (including ambigugous codons), 0 for all. ")
+ translateCmd.Flags().BoolP("append-frame", "F", false, "append frame information to sequence ID")
}
=====================================
seqkit/cmd/version.go
=====================================
@@ -29,7 +29,7 @@ import (
)
// VERSION of seqkit
-const VERSION = "0.13.2"
+const VERSION = "0.14.0"
// versionCmd represents the version command
var versionCmd = &cobra.Command{
View it on GitLab: https://salsa.debian.org/med-team/seqkit/-/commit/47b180fb012f6d67509daf94dec938c2866be13a
--
View it on GitLab: https://salsa.debian.org/med-team/seqkit/-/commit/47b180fb012f6d67509daf94dec938c2866be13a
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20201030/cf04a71f/attachment-0001.html>
More information about the debian-med-commit
mailing list