[med-svn] [Git][med-team/any2fasta][upstream] New upstream version 0.8.1
Andreas Tille (@tille)
gitlab at salsa.debian.org
Fri Apr 10 21:25:07 BST 2026
Andreas Tille pushed to branch upstream at Debian Med / any2fasta
Commits:
d2181828 by Andreas Tille at 2026-04-10T21:06:50+02:00
New upstream version 0.8.1
- - - - -
25 changed files:
- + .github/workflows/CI.yml
- .gitignore
- − .travis.yml
- + CODE_OF_CONDUCT.md
- README.md
- any2fasta
- + environment.yml
- + paper/paper.bib
- + paper/paper.md
- test.clw → test/test.clw
- test.embl → test/test.embl
- test.fna → test/test.fna
- + test/test.fna.bz2
- + test/test.fna.gz
- + test/test.fna.xz
- + test/test.fna.zip
- + test/test.fna.zst
- test.fq → test/test.fq
- test.gbk → test/test.gbk
- test.gfa → test/test.gfa
- test.gff → test/test.gff
- test.noseq.gff → test/test.noseq.gff
- + test/test.pdb
- + test/test.sh
- test.sth → test/test.sth
Changes:
=====================================
.github/workflows/CI.yml
=====================================
@@ -0,0 +1,30 @@
+name: CI
+
+on:
+ push:
+ branches: [ master ]
+ pull_request:
+ branches: [ master ]
+
+jobs:
+ tests:
+ runs-on: ubuntu-latest
+ defaults:
+ run:
+ shell: bash -el {0}
+ steps:
+ - name: Pull code
+ uses: actions/checkout at v5
+
+ - name: Setup conda
+ uses: conda-incubator/setup-miniconda at v3
+ with:
+ activate-environment: any2fasta
+ environment-file: environment.yml
+ miniforge-version: latest
+ channels: conda-forge,bioconda
+ channel-priority: strict
+ auto-update-conda: true
+
+ - name: Run BATS test suite
+ run: bats test/test.sh
=====================================
.gitignore
=====================================
@@ -36,3 +36,5 @@ inc/
# backups
*~
+bug/
+dev/
=====================================
.travis.yml deleted
=====================================
@@ -1,50 +0,0 @@
-language: perl
-
-sudo: false
-
-perl:
- - "5.26"
-
-addons:
- apt:
- packages:
- - gzip
- - bzip2
- - zip
- - unzip
-
-install:
- - "export PATH=$PWD:$PATH"
-
-script:
- - "! any2fasta"
- - "any2fasta -v"
- - "any2fasta -h"
- - "any2fasta -h | grep github"
- - "! any2fasta -x"
- - "any2fasta /dev/null 2>&1 | grep 'ERROR'"
- - "any2fasta test.noseq.gff 2>&1 | grep 'ERROR'"
- - "any2fasta test.gbk | grep -m 3 '^>'"
- - "any2fasta test.gff | grep -m 3 '^>'"
- - "any2fasta test.fna | grep -m 3 '^>'"
- - "any2fasta test.gfa | grep -m 3 '^>'"
- - "any2fasta test.fq | grep -m 3 '^>'"
- - "any2fasta test.embl | grep -m 3 '^>'"
- - "any2fasta test.clw | grep -m 3 '^>'"
- - "any2fasta test.sth | grep -m 3 '^>'"
- - "any2fasta test.fna | grep 'CRYANT'"
- - "any2fasta -n test.fna | grep 'CNNANT'"
- - "any2fasta test.gfa | grep '^>24292$'"
- - "any2fasta test.fq | grep '^>ERR1163317.999'"
- - "any2fasta test.sth | grep '^>O83071'"
- - "any2fasta test.clw | grep '^>gene03'"
- - "any2fasta -l test.gbk | grep 'taagaatgagtagaaggttttga'"
- - "any2fasta -u test.gbk | grep 'TAAGAATGAGTAGAAGGTTTTGA'"
- - "any2fasta -u test.embl | grep 'K02675'"
- - "any2fasta -q -l -n test.fq | wc -l | grep '^2000$'"
- - "any2fasta - < test.gbk | grep -m 1 -F 'NZ_AHMY02000074'"
- - "gzip -c test.gbk | any2fasta - | grep -m 1 -F 'NZ_AHMY02000074'"
- - "bzip2 -c test.gbk | any2fasta - | grep -m 1 -F 'NZ_AHMY02000074'"
- - "zip test.gbk.zip test.gbk"
- - "any2fasta test.gbk.zip | grep -m 1 -F 'NZ_AHMY02000074'"
- - echo -e ">1\nA\n\n>2\nT\n" | ./any2fasta -q - | wc -l | grep '^4$'
=====================================
CODE_OF_CONDUCT.md
=====================================
@@ -0,0 +1,76 @@
+# Contributor Covenant Code of Conduct
+
+## Our Pledge
+
+In the interest of fostering an open and welcoming environment, we as
+contributors and maintainers pledge to making participation in our project and
+our community a harassment-free experience for everyone, regardless of age, body
+size, disability, ethnicity, sex characteristics, gender identity and expression,
+level of experience, education, socio-economic status, nationality, personal
+appearance, race, religion, or sexual identity and orientation.
+
+## Our Standards
+
+Examples of behavior that contributes to creating a positive environment
+include:
+
+* Using welcoming and inclusive language
+* Being respectful of differing viewpoints and experiences
+* Gracefully accepting constructive criticism
+* Focusing on what is best for the community
+* Showing empathy towards other community members
+
+Examples of unacceptable behavior by participants include:
+
+* The use of sexualized language or imagery and unwelcome sexual attention or
+ advances
+* Trolling, insulting/derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or electronic
+ address, without explicit permission
+* Other conduct which could reasonably be considered inappropriate in a
+ professional setting
+
+## Our Responsibilities
+
+Project maintainers are responsible for clarifying the standards of acceptable
+behavior and are expected to take appropriate and fair corrective action in
+response to any instances of unacceptable behavior.
+
+Project maintainers have the right and responsibility to remove, edit, or
+reject comments, commits, code, wiki edits, issues, and other contributions
+that are not aligned to this Code of Conduct, or to ban temporarily or
+permanently any contributor for other behaviors that they deem inappropriate,
+threatening, offensive, or harmful.
+
+## Scope
+
+This Code of Conduct applies both within project spaces and in public spaces
+when an individual is representing the project or its community. Examples of
+representing a project or community include using an official project e-mail
+address, posting via an official social media account, or acting as an appointed
+representative at an online or offline event. Representation of a project may be
+further defined and clarified by project maintainers.
+
+## Enforcement
+
+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported by contacting the project team at torsten.seemann+coc at gmail.com. All
+complaints will be reviewed and investigated and will result in a response that
+is deemed necessary and appropriate to the circumstances. The project team is
+obligated to maintain confidentiality with regard to the reporter of an incident.
+Further details of specific enforcement policies may be posted separately.
+
+Project maintainers who do not follow or enforce the Code of Conduct in good
+faith may face temporary or permanent repercussions as determined by other
+members of the project's leadership.
+
+## Attribution
+
+This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
+available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
+
+[homepage]: https://www.contributor-covenant.org
+
+For answers to common questions about this code of conduct, see
+https://www.contributor-covenant.org/faq
=====================================
README.md
=====================================
@@ -1,11 +1,21 @@
-[](https://travis-ci.org/tseemann/any2fasta)
+[](https://github.com/tseemann/any2fasta/actions/workflows/CI.yml)
+[](https://github.com/tseemann/any2fasta/releases)
[](https://www.gnu.org/licenses/gpl-3.0)

+[](https://anaconda.org/bioconda/any2fasta)
# any2fasta
Convert various sequence formats to FASTA
+## Quick start
+```
+% conda install -c bioconda any2fasta
+% any2fasta genome.gbk > genome.fasta
+% any2fasta seq.gbk.gz > seq.fasta
+% any2fasta protein.pdb.bz2 > protein.fasta
+```
+
## Motivation
You may wonder why this tool even exists. Well, I tried to do the right
@@ -29,6 +39,7 @@ It supports the following input formats:
6. CLUSTAL alignments, typically `.clw`, `.clu` (starts with `CLUSTAL` or `MUSCLE`)
7. STOCKHOLM alignments, typically `.sth` (starts with `# STOCKHOLM`)
8. GFA assembly graph, typically `.gfa` (starts with `^[A-Z]\t`)
+9. PDB protein data bank structure, typicall `.pdb` (starts with `^HEADER`)
Files may be compressed with:
@@ -36,40 +47,42 @@ Files may be compressed with:
2. bzip2, typically `.bz2`
3. zip, typically `.zip`
+plus any other formats supported by
+your installed version of Perl's
+[`IO::Uncompress::AnyUncompress`](https://perldoc.perl.org/IO::Uncompress::AnyUncompress#DESCRIPTION)
+module.
+
## Installation
`any2fasta` has no dependencies except [Perl 5.10](https://www.perl.org/)
or higher. It only uses core modules, so no CPAN needed.
+### Conda
+```
+% conda install -c bioconda any2fasta
+```
### Direct script download
```
% cd /usr/local/bin # choose a folder in your $PATH
% wget https://raw.githubusercontent.com/tseemann/any2fasta/master/any2fasta
% chmod +x any2fasta
```
-### Homebrew
-```
-% brew install brewsci/bio/any2fasta # COMING SOON
-```
-### Conda
-```
-% conda install -c bioconda any2fasta # COMING SOON
-```
### Github
```
% git clone https://github.com/tseemann/any2fasta.git
-% cp any2fasta/any2fasta /usr/local/bin # choose a folder in your $PATH
+% cp any2fasta/any2fasta $HOME/.local/bin # choose a folder in your $PATH
```
## Test Installation
+### Sinple check
```
% ./any2fasta -v
-any2fasta 0.2.2
+any2fasta 1.0.2
% ./any2fasta -h
NAME
- any2fasta 0.4.2
+ any2fasta 1.0.2
SYNOPSIS
Convert various sequence formats into FASTA
USAGE
@@ -78,11 +91,26 @@ OPTIONS
-h Print this help
-v Print version and exit
-q No output while running, only errors
- -n Replace ambiguous IUPAC letters with 'N'
+ -k Skip, don't die, on bad input files
+ -n Replace non-[AGTC] with 'N'
-l Lowercase the sequence
-u Uppercase the sequence
+ -g Include VERSION from GBK/EMBL files
+ -s Strip sequence descriptions (FASTA,FASTQ)
END
```
+### Extensive test
+```
+% bats $(dirname $(which any2fasta))/test/test.sh
+
+ ✓ Script syntax check
+ ✓ Version
+ ...
+ ✓ Multiple sequence with one bad one
+ ✓ Allow skipping over bad files
+
+29 tests, 0 failures, 2 skipped
+```
## Examples
```
@@ -111,6 +139,9 @@ END
* `-l` will lowercase all the letters
* `-u` will uppercase all the letters
* `-q` will prevent logging messages being printed
+* `-k` will warn of bad inputs and continue on. not stop and error
+* `-g` will appened the version to the sequence ID
+* `-s` removes `desc` from `>id desc` in FASTA,FASTQ,GFF
## Issues
@@ -123,4 +154,3 @@ Submit feedback to the [Issue Tracker](https://github.com/tseemann/any2fasta/iss
## Author
[Torsten Seemann](http://tseemann.github.io/)
-
=====================================
any2fasta
=====================================
@@ -1,19 +1,34 @@
#!/usr/bin/env perl
+use 5.18.0;
use strict;
use warnings;
use Getopt::Std;
use File::Basename;
-use IO::Uncompress::AnyUncompress;
+use IO::Uncompress::AnyUncompress qw(anyuncompress $AnyUncompressError);
+use IO::Handle;
#......................................................................................
-our $VERSION = "0.4.2";
+our $VERSION = "0.8.1";
our $EXE = basename($0);
our $STDIN = '/dev/stdin'; # '-' will be replaced with this
our $URL = 'https://github.com/tseemann/any2fasta';
#......................................................................................
+my %aa = (
+ 'ALA' => 'A', 'ARG' => 'R', 'ASN' => 'N', 'ASP' => 'D',
+ 'CYS' => 'C', 'GLU' => 'E', 'GLN' => 'Q', 'GLY' => 'G',
+ 'HIS' => 'H', 'ILE' => 'I', 'LEU' => 'L', 'LYS' => 'K',
+ 'MET' => 'M', 'PHE' => 'F', 'PRO' => 'P', 'SER' => 'S',
+ 'THR' => 'T', 'TRP' => 'W', 'TYR' => 'Y', 'VAL' => 'V',
+ # Optional: Inclusion of non-standard or ambiguous codes
+ 'SEC' => 'U', 'PYL' => 'O', 'ASX' => 'B', 'GLX' => 'Z',
+ 'XAA' => 'X', 'TER' => '*'
+);
+
+#......................................................................................
+
sub usage {
my($errcode) = @_;
$errcode ||= 0;
@@ -21,14 +36,17 @@ sub usage {
print $ofh
"NAME\n $EXE $VERSION\n",
"SYNOPSIS\n Convert various sequence formats into FASTA\n",
- "USAGE\n $EXE [options] file.{gb,fa,fq,gff,gfa,clw,sth}[.gz,bz2,zip] > output.fasta\n",
+ "USAGE\n $EXE [opts] file.{gb,fa,fq,gff,gfa,clw,sth}[.gz,bz2,zip,zstd] > output.fasta\n",
"OPTIONS\n",
" -h Print this help\n",
" -v Print version and exit\n",
" -q No output while running, only errors\n",
- " -n Replace ambiguous IUPAC letters with 'N'\n",
+ " -k Skip, don't die, on bad input files\n",
+ " -n Replace non-[AGTC] with 'N'\n",
" -l Lowercase the sequence\n",
" -u Uppercase the sequence\n",
+ " -g Include VERSION (GENBANK,EMBL)\n",
+ " -s Strip sequence descriptions (FASTA,FASTQ)\n",
"HOMEPAGE\n $URL\n",
"END\n";
exit($errcode);
@@ -42,51 +60,78 @@ sub version {
#......................................................................................
my %opt;
-getopts('vhqnlu', \%opt) or exit(-1);
+getopts('vhqnlukgs', \%opt) or exit(-1);
$opt{v} and version();
$opt{h} and usage(0);
@ARGV or usage(1);
+my $skip = $opt{'k'};
-#......................................................................................
+# to debug broken pipe errors in CI
+#STODUT->autoflush(1);
+#......................................................................................
sub msg {
print STDERR "@_\n" unless $opt{q};
}
-
+sub wrn {
+ msg("WARNING:", @_);
+}
sub err {
print STDERR "ERROR: @_\n";
exit(-1);
}
-
#......................................................................................
msg("This is $EXE $VERSION");
# regexp to function mapping
+# put in order of most-likely to be encountered
my @formats = (
- [ 'GENBANK', qr/^LOCUS\s/, \&parse_genbank ],
- [ 'EMBL', qr/^ID\s/, \&parse_embl ],
+ [ 'GENBANK', qr/^LOCUS\h/, \&parse_genbank ],
+ [ 'EMBL', qr/^ID\h/, \&parse_embl ],
+ [ 'FASTA', qr/^>\S/, \&parse_fasta ],
+ [ 'FASTQ', qr/^@\S/, \&parse_fastq ],
[ 'GFF', qr/^##gff/, \&parse_gff ],
[ 'CLUSTAL', qr/^(CLUST|MUSCL)/, \&parse_clustal ],
[ 'STOCKHOLM', qr/^# STOCKHOLM\s/, \&parse_stockholm ],
- [ 'FASTA', qr/^>/, \&parse_fasta ],
- [ 'FASTQ', qr/^@/, \&parse_fastq ],
[ 'GFA', qr/^[A-Z]\t/, \&parse_gfa ],
+ [ 'PDB', qr/^HEADER\h/, \&parse_pdb ],
);
# loop over all positional command line arguments
my $processed=0;
-for my $fname (@ARGV) {
- msg("Opening '$fname'");
- $fname = $STDIN if $fname eq '-';
- -d $fname and err("'$fname' is a directory not a file");
- -r $fname or err("'$fname' is not readable");
+# are we in '-k' skip mode?
+my $notify = $skip ? \&wrn :\&err;
+FILE:
+for my $infile (@ARGV) {
+ msg("Opening '$infile'");
+ $infile = $STDIN if $infile eq '-';
+ if (-d $infile) {
+ $notify->("'$infile' is a directory not a file");
+ next FILE;
+ }
+ if (! -r $infile) {
+ $notify->("'$infile' is not readable");
+ next FILE;
+ }
# read first line to see if we have any data
- my $unzip = IO::Uncompress::AnyUncompress->new($fname);
+ my $unzip = IO::Uncompress::AnyUncompress->new(
+ $infile,
+ 'MultiStream' => 1, # fix Issue #34
+ 'Transparent' => 1, # allow non-compressed data
+ 'BlockSize'=> (1 << 16), # in bytes
+ );
+ if (! $unzip) {
+ $notify->($AnyUncompressError);
+ next FILE;
+ }
my $header = scalar(<$unzip>); # read first line
- $header or err("The input appears to be empty");
+ if (not $header) {
+ $notify->("The input appears to be empty");
+ next FILE;
+ }
# detect format from first line
my $ok=0;
@@ -96,16 +141,16 @@ for my $fname (@ARGV) {
# read in the rest of the file now
my @line = ($header, <$unzip>);
my $lines = scalar(@line);
- msg("Read $lines lines from '$fname'");
+ msg("Read $lines lines from '$infile'");
# run the parser
my $nseq = $fmt->[2]->( \@line );
msg("Wrote $nseq sequences from", $fmt->[0], "file.");
- $nseq or err("No sequences found in '$fname'");
+ $nseq or $notify->("No sequences found in '$infile'");
$ok++;
last;
}
}
- $ok or err("Unfamilar format with first line: $header");
+ $ok or $notify->("Unfamilar format with first line: $header");
$processed++;
}
@@ -116,7 +161,7 @@ exit(0);
#......................................................................................
-sub purify {
+sub purify_dna {
my($dna) = @_;
$dna =~ s/[^ATGCN\n\r-]/N/gi if $opt{n};
$dna = lc($dna) if $opt{l};
@@ -126,16 +171,28 @@ sub purify {
#......................................................................................
+sub purify_id {
+ my($id) = @_;
+ # \V is non-(vertical-space) eg. cr nl
+ # \h is horizonatel-spce eg. spc tan
+ # we don't use $ as that strips the newline
+ $id =~ s/\h\V*// if $opt{'s'};
+ return $id;
+}
+
+#......................................................................................
+
sub parse_fasta {
my($lines) = @_;
my $count=0;
for my $line (@$lines) {
next if $line =~ m/^\s*$/;
if ($line =~ m/^>/) {
+ $line = purify_id($line);
$count++;
}
else {
- $line = purify($line);
+ $line = purify_dna($line);
}
print $line;
}
@@ -149,8 +206,9 @@ sub parse_fastq {
my $count=0;
# jump 4 lines at a time
for ( my $i=0 ; $i < $#{$lines} ; $i+=4 ) {
+ $lines->[$i] = purify_id($lines->[$i]);
print ">", substr($lines->[$i], 1);
- print purify( $lines->[$i+1] );
+ print purify_dna( $lines->[$i+1] );
$count++;
}
return $count;
@@ -160,12 +218,15 @@ sub parse_fastq {
sub parse_gff {
my($lines) = @_;
- my $at_seq = 0;
- for my $line (@$lines) {
- $at_seq++ if $line =~ m/^>/;
- print purify($line) if $at_seq;;
+ # Skip past until we get to ##FASTA section
+ while (my $line = shift @$lines) {
+ if ($line =~ m/^>/) {
+ unshift @$lines, $line;
+ # let the FASTA parser do the work now
+ return parse_fasta(\@$lines);
+ }
}
- return $at_seq;
+ return 0;
}
#......................................................................................
@@ -182,7 +243,7 @@ sub parse_gfa {
# this is NOT the original contigs, rather the unitigs
# need to parse L (link) and P (path) to reconstruct the contigs
if ($x[0] eq 'S') {
- print ">", $x[1], "\n", purify($x[2]), "\n";
+ print ">", $x[1], "\n", purify_dna($x[2]), "\n";
$count++;
}
}
@@ -201,8 +262,7 @@ sub parse_genbank {
foreach (@$lines) {
chomp;
if (m{^//}) {
- $dna = purify($dna);
- print ">", $acc, "\n", purify($dna);
+ print ">", $acc, "\n", purify_dna($dna);
$count++;
$in_seq = 0;
$dna = $acc = '';
@@ -225,6 +285,11 @@ sub parse_genbank {
if (m/^LOCUS\s+(\S+)/) {
$acc = $1;
}
+ # VERSION NZ_AHMY02000075.1
+ # this is not elsif in case ther eis no VERSION
+ if ($opt{g} and m/^VERSION\s+(\S+)/) {
+ $acc = $1;
+ }
}
}
return $count;
@@ -243,7 +308,7 @@ sub parse_embl {
chomp;
if (m{^//}) {
# end of record
- print ">", $acc, "\n", purify($dna);
+ print ">", $acc, "\n", purify_dna($dna);
$count++;
$in_seq = 0;
$dna = $acc = '';
@@ -264,6 +329,9 @@ sub parse_embl {
# ID K02675; SV 1; linear; genomic DNA; STD; UNC; 569 BP.
if (m/^ID\s+([^;]+)/) {
$acc = $1;
+ if ($opt{g} and m/\bSV\s+(\d+)/) {
+ $acc .= ".$1";
+ }
}
}
}
@@ -275,12 +343,15 @@ sub parse_embl {
sub parse_clustal {
my($lines) = @_;
my %seq;
+ my @order; # keep track of sequence identifiers order
foreach (@$lines) {
- next unless m/^(\S+)\s+([A-Z-]+)$/i; # uses '-' for gap
+ # uses '-' for gap, ignore optional position number
+ next unless m'^(\S+)\s+([A-Z-]+)(?:\s+\d+)?$'i;
+ push @order, $1 unless exists $seq{$1};
$seq{$1} .= $2."\n";
}
- for my $id (sort keys %seq) {
- print ">", $id, "\n", purify($seq{$id});
+ for my $id (@order) {
+ print ">", $id, "\n", purify_dna($seq{$id});
}
return scalar(keys %seq);
}
@@ -290,16 +361,42 @@ sub parse_clustal {
sub parse_stockholm {
my($lines) = @_;
my %seq;
+ my @order; # keep track of sequence identifiers order
foreach (@$lines) {
next if m/^#/;
last if m{^//};
- next unless m/^(\S+)\s+([A-Z.]+)$/i; # uses '.' for gap
+ # uses '.' and also '-' for gap, optional position number
+ next unless m'^(\S+)\s+([A-Z.-]+)(?:\s+\d+)?$'i;
my($id,$sq) = ($1,$2);
$sq =~ s/\./-/g; # switch to standard FASTA '-' gap char
+ push @order, $id unless exists $seq{$id};
$seq{$id} .= $sq . "\n";
}
+ for my $id (@order) {
+ print ">", $id, "\n", purify_dna($seq{$id});
+ }
+ return scalar(keys %seq);
+}
+
+#......................................................................................
+
+sub parse_pdb {
+ my($lines) = @_;
+ my %seq;
+ # HEADER IMMUNE SYSTEM 06-MAR-00 1EK3
+ my @hdr = split ' ', $lines->[0];
+ my $prefix = $hdr[-1];
+ foreach (@$lines) {
+ #SEQRES 9 A 114 GLY GLY GLY THR LYS VAL GLU IL E LYS ARG
+ #SEQRES 1 B 114 ASP ILE VAL MET THR GLN SER PR O ASP SER LEU ALA VAL
+ next unless m/^SEQRES/;
+ my @x = split ' ', $_;
+ $seq{"$prefix-$x[2]"} .=
+ join( '', map { $aa{uc($_)} || 'X' }
+ splice(@x,4) );
+ }
for my $id (sort keys %seq) {
- print ">", $id, "\n", purify($seq{$id});
+ print ">", $id, "\n", purify_dna($seq{$id}), "\n";
}
return scalar(keys %seq);
}
=====================================
environment.yml
=====================================
@@ -0,0 +1,11 @@
+channels:
+ - conda-forge
+ - bioconda
+dependencies:
+ - perl >=5.18.0
+ - gzip
+ - bzip2
+ - unzip
+ - xz
+ - zstd
+ - bats-core
=====================================
paper/paper.bib
=====================================
@@ -0,0 +1,53 @@
+ at article{rice2000emboss,
+ title={EMBOSS: the European molecular biology open software suite},
+ author={Rice, Peter and Longden, Ian and Bleasby, Alan},
+ journal={Trends in genetics},
+ volume={16},
+ number={6},
+ pages={276--277},
+ year={2000},
+ publisher={Elsevier}
+}
+
+ at article{gilbert2003readseq,
+ title={Sequence File Format Conversion with Command-Line Readseq},
+ author={Gilbert, Don},
+ journal={Current protocols in bioinformatics},
+ number={1},
+ pages={A--1E},
+ year={2003},
+ publisher={Wiley Online Library}
+}
+
+ at article{stajich2002bioperl,
+ title={The Bioperl toolkit: Perl modules for the life sciences},
+ author={Stajich, Jason E and Block, David and Boulez, Kris and Brenner, Steven E and Chervitz, Stephen A and Dagdigian, Chris and Fuellen, Georg and Gilbert, James GR and Korf, Ian and Lapp, Hilmar and others},
+ journal={Genome research},
+ volume={12},
+ number={10},
+ pages={1611--1618},
+ year={2002},
+ publisher={Cold Spring Harbor Lab}
+}
+
+ at article{cock2009biopython,
+ title={Biopython: freely available Python tools for computational molecular biology and bioinformatics},
+ author={Cock, Peter JA and Antao, Tiago and Chang, Jeffrey T and Chapman, Brad A and Cox, Cymon J and Dalke, Andrew and Friedberg, Iddo and Hamelryck, Thomas and Kauff, Frank and Wilczynski, Bartek and others},
+ journal={Bioinformatics},
+ volume={25},
+ number={11},
+ pages={1422--1423},
+ year={2009},
+ publisher={Oxford University Press}
+}
+
+ at article{pearson1988fasta,
+ title={Improved tools for biological sequence comparison},
+ author={Pearson, William R and Lipman, David J},
+ journal={Proceedings of the National Academy of Sciences},
+ volume={85},
+ number={8},
+ pages={2444--2448},
+ year={1988},
+ publisher={National Acad Sciences}
+}
=====================================
paper/paper.md
=====================================
@@ -0,0 +1,53 @@
+---
+title: 'any2fasta: convert various sequence and alignment formats to FASTA'
+tags:
+ - bioinformatics
+ - genomics
+ - file format conversion
+authors:
+ - name: Torsten Seemann
+ orcid: 0000-0001-6046-610X
+ affiliation: "1, 2"
+affiliations:
+ - name: Melbourne Bioinformatics, The University of Melbourne, Parkville, Australia.
+ index: 1
+ - name: Doherty Applied Microbial Genomics, Department of Microbiology and Immunology, The University of Melbourne, Parkville, Australia.
+ index: 2
+date: 18 October 2018
+bibliography: paper.bib
+---
+
+# Summary
+
+FASTA is a simple and pervasive plain text file format for storing
+genetic sequence data [@pearson1988fasta]. There exist many other
+richer formats for storing sequences and associated annotations
+and meta-data, such as the
+Genbank and EMBL flat files (http://www.insdc.org/documents/feature-table).
+These formats often need to be converted to FASTA for use in
+downstream software that only handles the FASTA format.
+Common tools for converting for format conversion are
+EMBOSS `seqret` [@rice2000emboss] and `readseq` [@gilbert2003readseq].
+Unfortunately, these tools mangle sequence identifiers containing
+characters such as `|` and `.`. Furthermore, they offer no way to fix the behaviour
+and have not seen any development activity in years.
+Custom scripts using the Bioperl [@stajich2002bioperl]
+or Biopython [@cock2009biopython] libraries are available,
+but these are heavyweight solutions for a relatively simple problem.
+
+Here, I present a new software tool called `any2fasta` written
+as a single Perl script with no dependencies. It can read the
+Genbank, EMBL, GFF, FASTA, FASTQ and GFA sequence formats,
+as well as the CLUSTAL and STOCKHOLM sequence alignment formats.
+The input files can be of mixed type, and may be compressed with
+`gzip`, `bzip2` or `zip`. `any2fasta` is fast because it only
+parses those parts of the input files needed to extract the
+sequence and its identifier.
+
+# Acknowledgements
+
+This work was supported by a National Health and Medical
+Research Council of Australia Project Grant (ID 1149991).
+
+# References
+
=====================================
test.clw → test/test.clw
=====================================
=====================================
test.embl → test/test.embl
=====================================
=====================================
test.fna → test/test.fna
=====================================
=====================================
test/test.fna.bz2
=====================================
Binary files /dev/null and b/test/test.fna.bz2 differ
=====================================
test/test.fna.gz
=====================================
Binary files /dev/null and b/test/test.fna.gz differ
=====================================
test/test.fna.xz
=====================================
Binary files /dev/null and b/test/test.fna.xz differ
=====================================
test/test.fna.zip
=====================================
Binary files /dev/null and b/test/test.fna.zip differ
=====================================
test/test.fna.zst
=====================================
Binary files /dev/null and b/test/test.fna.zst differ
=====================================
test.fq → test/test.fq
=====================================
=====================================
test.gbk → test/test.gbk
=====================================
=====================================
test.gfa → test/test.gfa
=====================================
=====================================
test.gff → test/test.gff
=====================================
@@ -12770,7 +12770,7 @@ ATTATTCAGGACAGCACTATTAATATGCTGCGCCTCAGCGATCCCGACGCTGCACTCATT
ATCAGCCGGGGCCAGATGCAGGAAGGGGACGAATTAGCCTCCCAGATTGAACAGCAAATG
AAAAAACTGGAGAAACAGGTAAAGGATCTTCACTACACGCCAGTTCAGGTAACACGGGTA
GGGATTAATGACGGTGAA
->BAC_00002
+>BAC_00002 Contig number 2
AGATGCCAGGTATGTGGATATACAGCGAACGCCGATGTAAATGGCGCTCGTAACATTTTA
GCGGCGGGGCACGCCGTTCTTGCCTGTGGAGAGATGGTGCAGTCAGGCCGCCCGTTGAAG
CAGGAACCCACCGAAATGATTCAGGCGACAGCCTGAACGTAGCAGGGATCCTCGTCCTTC
@@ -41569,7 +41569,7 @@ GGCGATACGGTTATCCGGCCACATGCTGAGGGTGCTGTCCGGGTGCAGCTCCGGGTCGGG
CAGGCGGTTACCTGCCAGGGCAAGATTACGAAAGCCCGCTCCCCGCAAGGACTGACGCCA
GATAGTTTCTGTCCATGGCTGCTTTTCGCATCTTACGTCTTAACCCTGCCTTGAATACCT
TATCAT
->BAC_00007
+>BAC_00007 Contig Seven
TAATCCGGTAGCGGCGTAAAAATCGCGGAAGGGATGAAAAAAACAGCGCCTGACGGCGCT
GTGTCTGGCATGCCTGCAATCCGGGAAACCGGACCAGGAAAAAACTTGCAGCCCATAACA
GTATCTACGCAGTACCTGTAATATATTGAATCTGCAGGACTTTGTAGGCCAGATAAGCGT
@@ -87438,6 +87438,6 @@ ATTTTGCGTTCTGCCCAGGACAGGTGCGTCAGGCCGTGGCAGTGATGCCCCTTGCG
>BAC_00225
CACTACATCCGATACCTGTATGACGATGGAGCAGGCCAATGAGAAGGCCAAAAAACTGGA
GCAGTCCTCAGAAGCAAAGCCGGTTGCGGCATCACTGCCGCGCCTGGCTGAAGGG
->BAC_00226
+>BAC_00226 this is the last contig
TTTGATGCCTGGCAGTTCCCTACTCTCGCATGGGGAGACCCCACACTACCATCGGCGCTA
CGGCGTTTCACTTCTGAGTTCGGCATGGGGTCAGGTGGGACCACCGCGCTAAG
=====================================
test.noseq.gff → test/test.noseq.gff
=====================================
=====================================
test/test.pdb
=====================================
The diff for this file was not included because it is too large.
=====================================
test/test.sh
=====================================
@@ -0,0 +1,163 @@
+#!/usr/bin/env bats
+
+setup() {
+ name="any2fasta"
+ bats_require_minimum_version 1.5.0
+ dir=$(dirname "$BATS_TEST_FILENAME")
+ cd "$dir"
+ exe="$dir/../$name -q"
+ tab=$'\t'
+ FASTA_ID=">NZ_CHER02000075"
+}
+
+ at test "Script syntax check" {
+ run -0 perl -c "$dir/../$name"
+}
+ at test "Version" {
+ run -0 $exe -v
+ [[ "${lines[0]}" =~ "$name" ]]
+}
+ at test "Help" {
+ run -0 $exe -h
+ [[ "$output" =~ "USAGE" ]]
+}
+ at test "No parameters" {
+ run ! $exe
+}
+ at test "Bad option" {
+ run ! $exe -Y
+ [[ "$output" =~ "Unknown option" ]]
+ [[ ! "$output" =~ "USAGE" ]]
+}
+ at test "Passing a folder" {
+ run ! $exe $dir
+ [[ "$output" =~ "directory" ]]
+}
+ at test "Empty input" {
+ run ! $exe /dev/null
+ [[ "$output" =~ "ERROR" ]]
+}
+
+ at test "Handle FASTA" {
+ run -0 $exe test.fna
+ [[ "${lines[0]}" =~ "$FASTA_ID" ]]
+ [[ "${lines[0]}" =~ "Leptospira" ]]
+}
+ at test "Handle EMBL" {
+ run -0 $exe test.embl
+ [[ "${lines[0]}" == ">K02675" ]]
+}
+ at test "Handle FASTQ" {
+ run -0 $exe test.fq
+ [[ "${lines[0]}" =~ ">ERR1163317.1" ]]
+ [[ "${lines[0]}" =~ "length=" ]]
+}
+ at test "Handle GENBANK" {
+ run -0 $exe test.gbk
+ [[ "${lines[0]}" =~ ">NZ_AHMY02000075" ]]
+}
+ at test "Handle GFF" {
+ run -0 $exe test.gff
+ [[ "$output" =~ ">BAC_00002" ]]
+}
+ at test "Handle STOCKHOLM" {
+ run -0 $exe test.sth
+ [[ "$output" =~ ">O83071/259-312" ]]
+}
+ at test "Handle CLUSTAL" {
+ run -0 $exe test.clw
+ [[ "$output" =~ ">gene03" ]]
+}
+ at test "Handle GFA" {
+ run -0 $exe test.gfa
+ [[ "${lines[0]}" =~ ">225289" ]]
+}
+ at test "Handle PDB" {
+ run -0 $exe test.pdb
+ [[ "$output" =~ ">1EK3-B" ]]
+}
+
+ at test "GZIP compression" {
+ run -0 $exe test.fna.gz
+ [[ "${lines[0]}" =~ "$FASTA_ID" ]]
+}
+ at test "BZIP2 compression" {
+ run -0 $exe test.fna.bz2
+ [[ "${lines[0]}" =~ "$FASTA_ID" ]]
+}
+ at test "XZ compression" {
+ skip
+ run -0 $exe test.fna.xz
+ [[ "${lines[0]}" =~ "$FASTA_ID" ]]
+}
+ at test "ZSTD compression" {
+ skip
+ run -0 $exe test.fna.zst
+ [[ "${lines[0]}" =~ "$FASTA_ID" ]]
+}
+ at test "ZIP compression" {
+ run -0 $exe test.fna.zip
+ [[ "${lines[0]}" =~ "$FASTA_ID" ]]
+}
+
+ at test "STDIN input" {
+ run -0 $exe - < test.fna
+ [[ "${lines[0]}" =~ "$FASTA_ID" ]]
+}
+ at test "Compressed STDIN input" {
+ run -0 $exe - < test.fna.gz
+ [[ "${lines[0]}" =~ "$FASTA_ID" ]]
+}
+
+ at test "Option -l lowercase" {
+ run -0 $exe -l test.fna
+ [[ "${lines[1]}" =~ "aacryantctc" ]]
+}
+ at test "Option -u uppercase" {
+ run -0 $exe -u test.embl
+ [[ "${lines[1]}" =~ "AGTCGCTTTTAA" ]]
+}
+ at test "Option -n deambiguate" {
+ run -0 $exe -n test.fna
+ [[ "${lines[1]}" =~ "AACNNANTCTC" ]]
+}
+
+ at test "GFF with no sequence" {
+ run ! $exe -n test.noseq.gff
+ [[ "$output" =~ "No sequences found" ]]
+}
+ at test "Multiple sequence inputs" {
+ run $exe test.fna test.embl test.pdb test.sth
+ [[ "$output" =~ ">1EK3-B" ]]
+}
+ at test "Multiple sequence with one bad one" {
+ run ! $exe test.fna test.jpg test.pdb test.sth
+ [[ "$output" =~ "ERROR" ]]
+}
+ at test "Allow skipping over bad files" {
+ run $exe -k test.fna test.jpg test.embl
+ [[ "$output" =~ ">K02675" ]]
+}
+
+ at test "Handle EMBL -g" {
+ run -0 $exe -g test.embl
+ [[ "${lines[0]}" =~ ">K02675.1" ]]
+}
+ at test "Handle GENBANK -g" {
+ run -0 $exe -g test.gbk
+ [[ "${lines[0]}" =~ ">NZ_AHMY02000075.1" ]]
+}
+
+ at test "Handle FASTA -s" {
+ run -0 $exe -s test.fna
+ [[ "${lines[0]}" =~ "$FASTA_ID" ]]
+ [[ ! "${lines[0]}" =~ "Leptospira" ]]
+}
+ at test "Handle FASTQ -s" {
+ run -0 $exe -s test.fq
+ [[ ! "${lines[0]}" =~ "length=" ]]
+}
+ at test "Handle GFF -s" {
+ run -0 $exe -s test.gff
+ [[ ! "$output" =~ "contig" ]]
+}
=====================================
test.sth → test/test.sth
=====================================
View it on GitLab: https://salsa.debian.org/med-team/any2fasta/-/commit/d21818287db414628a433651a702ccea172e96f2
--
View it on GitLab: https://salsa.debian.org/med-team/any2fasta/-/commit/d21818287db414628a433651a702ccea172e96f2
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20260410/98786ec4/attachment-0001.htm>
More information about the debian-med-commit
mailing list