[med-svn] [Git][med-team/any2fasta][upstream] New upstream version 0.8.1

Andreas Tille (@tille) gitlab at salsa.debian.org
Fri Apr 10 21:25:07 BST 2026



Andreas Tille pushed to branch upstream at Debian Med / any2fasta


Commits:
d2181828 by Andreas Tille at 2026-04-10T21:06:50+02:00
New upstream version 0.8.1
- - - - -


25 changed files:

- + .github/workflows/CI.yml
- .gitignore
- − .travis.yml
- + CODE_OF_CONDUCT.md
- README.md
- any2fasta
- + environment.yml
- + paper/paper.bib
- + paper/paper.md
- test.clw → test/test.clw
- test.embl → test/test.embl
- test.fna → test/test.fna
- + test/test.fna.bz2
- + test/test.fna.gz
- + test/test.fna.xz
- + test/test.fna.zip
- + test/test.fna.zst
- test.fq → test/test.fq
- test.gbk → test/test.gbk
- test.gfa → test/test.gfa
- test.gff → test/test.gff
- test.noseq.gff → test/test.noseq.gff
- + test/test.pdb
- + test/test.sh
- test.sth → test/test.sth


Changes:

=====================================
.github/workflows/CI.yml
=====================================
@@ -0,0 +1,30 @@
+name: CI
+
+on:
+  push:
+    branches: [ master ]
+  pull_request:
+    branches: [ master ]
+
+jobs:
+  tests:
+    runs-on: ubuntu-latest
+    defaults:
+      run:
+        shell: bash -el {0}
+    steps:
+      - name: Pull code
+        uses: actions/checkout at v5
+
+      - name: Setup conda
+        uses: conda-incubator/setup-miniconda at v3
+        with:
+          activate-environment: any2fasta
+          environment-file: environment.yml
+          miniforge-version: latest
+          channels: conda-forge,bioconda
+          channel-priority: strict
+          auto-update-conda: true
+
+      - name: Run BATS test suite
+        run: bats test/test.sh


=====================================
.gitignore
=====================================
@@ -36,3 +36,5 @@ inc/
 
 # backups
 *~
+bug/
+dev/


=====================================
.travis.yml deleted
=====================================
@@ -1,50 +0,0 @@
-language: perl
-
-sudo: false
-
-perl:
-    - "5.26"
-
-addons:
-    apt:
-        packages:
-            - gzip
-            - bzip2
-            - zip
-            - unzip
-            
-install:
-    - "export PATH=$PWD:$PATH"
-
-script:
-    - "! any2fasta"
-    - "any2fasta -v"
-    - "any2fasta -h"
-    - "any2fasta -h | grep github"
-    - "! any2fasta -x"
-    - "any2fasta /dev/null 2>&1 | grep 'ERROR'"
-    - "any2fasta test.noseq.gff 2>&1 | grep 'ERROR'"
-    - "any2fasta test.gbk | grep -m 3 '^>'"
-    - "any2fasta test.gff | grep -m 3 '^>'"
-    - "any2fasta test.fna | grep -m 3 '^>'"
-    - "any2fasta test.gfa | grep -m 3 '^>'"
-    - "any2fasta test.fq  | grep -m 3 '^>'"
-    - "any2fasta test.embl | grep -m 3 '^>'"
-    - "any2fasta test.clw  | grep -m 3 '^>'"
-    - "any2fasta test.sth  | grep -m 3 '^>'"
-    - "any2fasta    test.fna | grep 'CRYANT'"
-    - "any2fasta -n test.fna | grep 'CNNANT'"
-    - "any2fasta test.gfa | grep '^>24292$'"
-    - "any2fasta test.fq | grep '^>ERR1163317.999'"
-    - "any2fasta test.sth | grep '^>O83071'"
-    - "any2fasta test.clw | grep '^>gene03'"
-    - "any2fasta -l test.gbk | grep 'taagaatgagtagaaggttttga'"
-    - "any2fasta -u test.gbk | grep 'TAAGAATGAGTAGAAGGTTTTGA'"
-    - "any2fasta -u test.embl | grep 'K02675'"
-    - "any2fasta -q -l -n test.fq | wc -l | grep '^2000$'"
-    - "any2fasta - < test.gbk  | grep -m 1 -F 'NZ_AHMY02000074'"
-    - "gzip -c test.gbk | any2fasta - | grep -m 1 -F 'NZ_AHMY02000074'"
-    - "bzip2 -c test.gbk | any2fasta - | grep -m 1 -F 'NZ_AHMY02000074'"
-    - "zip test.gbk.zip test.gbk"
-    - "any2fasta test.gbk.zip | grep -m 1 -F 'NZ_AHMY02000074'"
-    - echo -e ">1\nA\n\n>2\nT\n" | ./any2fasta -q - | wc -l | grep '^4$'


=====================================
CODE_OF_CONDUCT.md
=====================================
@@ -0,0 +1,76 @@
+# Contributor Covenant Code of Conduct
+
+## Our Pledge
+
+In the interest of fostering an open and welcoming environment, we as
+contributors and maintainers pledge to making participation in our project and
+our community a harassment-free experience for everyone, regardless of age, body
+size, disability, ethnicity, sex characteristics, gender identity and expression,
+level of experience, education, socio-economic status, nationality, personal
+appearance, race, religion, or sexual identity and orientation.
+
+## Our Standards
+
+Examples of behavior that contributes to creating a positive environment
+include:
+
+* Using welcoming and inclusive language
+* Being respectful of differing viewpoints and experiences
+* Gracefully accepting constructive criticism
+* Focusing on what is best for the community
+* Showing empathy towards other community members
+
+Examples of unacceptable behavior by participants include:
+
+* The use of sexualized language or imagery and unwelcome sexual attention or
+ advances
+* Trolling, insulting/derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or electronic
+ address, without explicit permission
+* Other conduct which could reasonably be considered inappropriate in a
+ professional setting
+
+## Our Responsibilities
+
+Project maintainers are responsible for clarifying the standards of acceptable
+behavior and are expected to take appropriate and fair corrective action in
+response to any instances of unacceptable behavior.
+
+Project maintainers have the right and responsibility to remove, edit, or
+reject comments, commits, code, wiki edits, issues, and other contributions
+that are not aligned to this Code of Conduct, or to ban temporarily or
+permanently any contributor for other behaviors that they deem inappropriate,
+threatening, offensive, or harmful.
+
+## Scope
+
+This Code of Conduct applies both within project spaces and in public spaces
+when an individual is representing the project or its community. Examples of
+representing a project or community include using an official project e-mail
+address, posting via an official social media account, or acting as an appointed
+representative at an online or offline event. Representation of a project may be
+further defined and clarified by project maintainers.
+
+## Enforcement
+
+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported by contacting the project team at torsten.seemann+coc at gmail.com. All
+complaints will be reviewed and investigated and will result in a response that
+is deemed necessary and appropriate to the circumstances. The project team is
+obligated to maintain confidentiality with regard to the reporter of an incident.
+Further details of specific enforcement policies may be posted separately.
+
+Project maintainers who do not follow or enforce the Code of Conduct in good
+faith may face temporary or permanent repercussions as determined by other
+members of the project's leadership.
+
+## Attribution
+
+This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
+available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
+
+[homepage]: https://www.contributor-covenant.org
+
+For answers to common questions about this code of conduct, see
+https://www.contributor-covenant.org/faq


=====================================
README.md
=====================================
@@ -1,11 +1,21 @@
-[![Build Status](https://travis-ci.org/tseemann/any2fasta.svg?branch=master)](https://travis-ci.org/tseemann/any2fasta) 
+[![CI](https://github.com/tseemann/any2fasta/actions/workflows/CI.yml/badge.svg)](https://github.com/tseemann/any2fasta/actions/workflows/CI.yml)
+[![GitHub release (latest by date)](https://img.shields.io/github/v/release/tseemann/any2fasta)](https://github.com/tseemann/any2fasta/releases)
 [![License: GPL v3](https://img.shields.io/badge/License-GPL%20v3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
 ![Don't judge me](https://img.shields.io/badge/Language-Perl_5-steelblue.svg)
+[![Bioconda Downloads](https://img.shields.io/conda/dn/bioconda/any2fasta.svg)](https://anaconda.org/bioconda/any2fasta)
 
 # any2fasta
 
 Convert various sequence formats to FASTA
 
+## Quick start
+```
+% conda install -c bioconda any2fasta
+% any2fasta genome.gbk > genome.fasta
+% any2fasta seq.gbk.gz > seq.fasta
+% any2fasta protein.pdb.bz2 > protein.fasta
+```
+
 ## Motivation
 
 You may wonder why this tool even exists.  Well, I tried to do the right
@@ -29,6 +39,7 @@ It supports the following input formats:
 6. CLUSTAL alignments, typically `.clw`, `.clu` (starts with `CLUSTAL` or `MUSCLE`)
 7. STOCKHOLM alignments, typically `.sth` (starts with `# STOCKHOLM`)
 8. GFA assembly graph, typically `.gfa` (starts with `^[A-Z]\t`)
+9. PDB protein data bank structure, typicall `.pdb` (starts with `^HEADER`)
 
 Files may be compressed with:
 
@@ -36,40 +47,42 @@ Files may be compressed with:
 2. bzip2, typically `.bz2`
 3. zip, typically `.zip`
 
+plus any other formats supported by
+your installed version of Perl's
+[`IO::Uncompress::AnyUncompress`](https://perldoc.perl.org/IO::Uncompress::AnyUncompress#DESCRIPTION)
+module.
+
 ## Installation
 
 `any2fasta` has no dependencies except [Perl 5.10](https://www.perl.org/)
 or higher. It only uses core modules, so no CPAN needed.
 
+### Conda
+```
+% conda install -c bioconda any2fasta
+```
 ### Direct script download
 ```
 % cd /usr/local/bin  # choose a folder in your $PATH
 % wget https://raw.githubusercontent.com/tseemann/any2fasta/master/any2fasta
 % chmod +x any2fasta
 ```
-### Homebrew
-```
-% brew install brewsci/bio/any2fasta # COMING SOON
-```
-### Conda
-```
-% conda install -c bioconda any2fasta # COMING SOON
-```
 ### Github
 ```
 % git clone https://github.com/tseemann/any2fasta.git
-% cp any2fasta/any2fasta /usr/local/bin # choose a folder in your $PATH
+% cp any2fasta/any2fasta $HOME/.local/bin # choose a folder in your $PATH
 ```
 
 ## Test Installation
 
+### Sinple check
 ```
 % ./any2fasta -v
-any2fasta 0.2.2
+any2fasta 1.0.2
 
 % ./any2fasta -h
 NAME
-  any2fasta 0.4.2
+  any2fasta 1.0.2
 SYNOPSIS
   Convert various sequence formats into FASTA
 USAGE
@@ -78,11 +91,26 @@ OPTIONS
   -h       Print this help
   -v       Print version and exit
   -q       No output while running, only errors
-  -n       Replace ambiguous IUPAC letters with 'N'
+  -k       Skip, don't die, on bad input files
+  -n       Replace non-[AGTC] with 'N'
   -l       Lowercase the sequence
   -u       Uppercase the sequence
+  -g       Include VERSION from GBK/EMBL files
+  -s       Strip sequence descriptions (FASTA,FASTQ)
 END
 ```
+### Extensive test
+```
+% bats $(dirname $(which any2fasta))/test/test.sh
+
+ ✓ Script syntax check
+ ✓ Version
+ ...
+ ✓ Multiple sequence with one bad one
+ ✓ Allow skipping over bad files
+
+29 tests, 0 failures, 2 skipped
+```
 
 ## Examples
 ```
@@ -111,6 +139,9 @@ END
 * `-l` will lowercase all the letters
 * `-u` will uppercase all the letters
 * `-q` will prevent logging messages being printed
+* `-k` will warn of bad inputs and continue on. not stop and error
+* `-g` will appened the version to the sequence ID 
+* `-s` removes `desc` from `>id desc` in FASTA,FASTQ,GFF
 
 ## Issues
 
@@ -123,4 +154,3 @@ Submit feedback to the [Issue Tracker](https://github.com/tseemann/any2fasta/iss
 ## Author
 
 [Torsten Seemann](http://tseemann.github.io/)
-


=====================================
any2fasta
=====================================
@@ -1,19 +1,34 @@
 #!/usr/bin/env perl
+use 5.18.0;
 use strict;
 use warnings;
 use Getopt::Std;
 use File::Basename;
-use IO::Uncompress::AnyUncompress;
+use IO::Uncompress::AnyUncompress qw(anyuncompress $AnyUncompressError);
+use IO::Handle;
 
 #......................................................................................
 
-our $VERSION = "0.4.2";
+our $VERSION = "0.8.1";
 our $EXE = basename($0);
 our $STDIN = '/dev/stdin'; # '-' will be replaced with this
 our $URL = 'https://github.com/tseemann/any2fasta';
 
 #......................................................................................
 
+my %aa = (
+  'ALA' => 'A', 'ARG' => 'R', 'ASN' => 'N', 'ASP' => 'D',
+  'CYS' => 'C', 'GLU' => 'E', 'GLN' => 'Q', 'GLY' => 'G',
+  'HIS' => 'H', 'ILE' => 'I', 'LEU' => 'L', 'LYS' => 'K',
+  'MET' => 'M', 'PHE' => 'F', 'PRO' => 'P', 'SER' => 'S',
+  'THR' => 'T', 'TRP' => 'W', 'TYR' => 'Y', 'VAL' => 'V',
+  # Optional: Inclusion of non-standard or ambiguous codes
+  'SEC' => 'U', 'PYL' => 'O', 'ASX' => 'B', 'GLX' => 'Z', 
+  'XAA' => 'X', 'TER' => '*'
+);
+
+#......................................................................................
+
 sub usage {
   my($errcode) = @_;
   $errcode ||= 0;
@@ -21,14 +36,17 @@ sub usage {
   print $ofh 
     "NAME\n  $EXE $VERSION\n",
     "SYNOPSIS\n  Convert various sequence formats into FASTA\n",
-    "USAGE\n  $EXE [options] file.{gb,fa,fq,gff,gfa,clw,sth}[.gz,bz2,zip] > output.fasta\n",
+    "USAGE\n  $EXE [opts] file.{gb,fa,fq,gff,gfa,clw,sth}[.gz,bz2,zip,zstd] > output.fasta\n",
     "OPTIONS\n",
     "  -h       Print this help\n",
     "  -v       Print version and exit\n",
     "  -q       No output while running, only errors\n",
-    "  -n       Replace ambiguous IUPAC letters with 'N'\n",
+    "  -k       Skip, don't die, on bad input files\n",
+    "  -n       Replace non-[AGTC] with 'N'\n",
     "  -l       Lowercase the sequence\n",
     "  -u       Uppercase the sequence\n",
+    "  -g       Include VERSION (GENBANK,EMBL)\n",
+    "  -s       Strip sequence descriptions (FASTA,FASTQ)\n",
     "HOMEPAGE\n  $URL\n",
     "END\n";
   exit($errcode);
@@ -42,51 +60,78 @@ sub version {
 #......................................................................................
 
 my %opt;
-getopts('vhqnlu', \%opt) or exit(-1);
+getopts('vhqnlukgs', \%opt) or exit(-1);
 $opt{v} and version();
 $opt{h} and usage(0);
 @ARGV or usage(1);
+my $skip = $opt{'k'};
 
-#......................................................................................
+# to debug broken pipe errors in CI
+#STODUT->autoflush(1);
 
+#......................................................................................
 sub msg {
   print STDERR "@_\n" unless $opt{q};
 }
-
+sub wrn {
+  msg("WARNING:", @_);
+}
 sub err {
   print STDERR "ERROR: @_\n";
   exit(-1);
 }
-
 #......................................................................................
 
 msg("This is $EXE $VERSION");
 
 # regexp to function mapping
+# put in order of most-likely to be encountered
 my @formats = (
-  [ 'GENBANK',   qr/^LOCUS\s/,       \&parse_genbank   ],
-  [ 'EMBL',      qr/^ID\s/,          \&parse_embl      ],
+  [ 'GENBANK',   qr/^LOCUS\h/,       \&parse_genbank   ],
+  [ 'EMBL',      qr/^ID\h/,          \&parse_embl      ],
+  [ 'FASTA',     qr/^>\S/,           \&parse_fasta     ],
+  [ 'FASTQ',     qr/^@\S/,           \&parse_fastq     ],
   [ 'GFF',       qr/^##gff/,         \&parse_gff       ],
   [ 'CLUSTAL',   qr/^(CLUST|MUSCL)/, \&parse_clustal   ],
   [ 'STOCKHOLM', qr/^# STOCKHOLM\s/, \&parse_stockholm ],
-  [ 'FASTA',     qr/^>/,             \&parse_fasta     ],
-  [ 'FASTQ',     qr/^@/,             \&parse_fastq     ],
   [ 'GFA',       qr/^[A-Z]\t/,       \&parse_gfa       ],
+  [ 'PDB',       qr/^HEADER\h/,      \&parse_pdb       ],
 );
 
 # loop over all positional command line arguments
 my $processed=0;
 
-for my $fname (@ARGV) {
-  msg("Opening '$fname'");
-  $fname = $STDIN if $fname eq '-';
-  -d $fname and err("'$fname' is a directory not a file");
-  -r $fname or err("'$fname' is not readable");
+# are we in '-k' skip mode?
+my $notify = $skip ? \&wrn :\&err;
 
+FILE:
+for my $infile (@ARGV) {
+  msg("Opening '$infile'");
+  $infile = $STDIN if $infile eq '-';
+  if (-d $infile) {
+    $notify->("'$infile' is a directory not a file");
+    next FILE;
+  }
+  if (! -r $infile) {
+    $notify->("'$infile' is not readable");
+    next FILE;
+  }
   # read first line to see if we have any data
-  my $unzip = IO::Uncompress::AnyUncompress->new($fname);
+  my $unzip = IO::Uncompress::AnyUncompress->new(
+    $infile,
+    'MultiStream' => 1,  # fix Issue #34
+    'Transparent'  => 1, # allow non-compressed data 
+    'BlockSize'=> (1 << 16), # in bytes
+  );
+  if (! $unzip) {
+    $notify->($AnyUncompressError);
+    next FILE;
+  }
   my $header = scalar(<$unzip>); # read first line
-  $header or err("The input appears to be empty");
+  if (not $header) {
+    $notify->("The input appears to be empty");
+    next FILE;
+  }
 
   # detect format from first line
   my $ok=0;
@@ -96,16 +141,16 @@ for my $fname (@ARGV) {
       # read in the rest of the file now
       my @line = ($header, <$unzip>);
       my $lines = scalar(@line);
-      msg("Read $lines lines from '$fname'");
+      msg("Read $lines lines from '$infile'");
       # run the parser
       my $nseq = $fmt->[2]->( \@line );
       msg("Wrote $nseq sequences from", $fmt->[0], "file.");
-      $nseq or err("No sequences found in '$fname'");
+      $nseq or $notify->("No sequences found in '$infile'");
       $ok++;
       last;
     }
   }
-  $ok or err("Unfamilar format with first line: $header");
+  $ok or $notify->("Unfamilar format with first line: $header");
   $processed++;
 }
 
@@ -116,7 +161,7 @@ exit(0);
 
 #......................................................................................
 
-sub purify {
+sub purify_dna {
   my($dna) = @_;
   $dna =~ s/[^ATGCN\n\r-]/N/gi if $opt{n};
   $dna = lc($dna) if $opt{l};
@@ -126,16 +171,28 @@ sub purify {
 
 #......................................................................................
 
+sub purify_id {
+  my($id) = @_;
+  # \V is non-(vertical-space) eg. cr nl
+  # \h is     horizonatel-spce eg. spc tan
+  # we don't use $ as that strips the newline
+  $id =~ s/\h\V*// if $opt{'s'};
+  return $id;
+}
+
+#......................................................................................
+
 sub parse_fasta {
   my($lines) = @_;
   my $count=0;
   for my $line (@$lines) {
     next if $line =~ m/^\s*$/;
     if ($line =~ m/^>/) {
+      $line = purify_id($line);
       $count++;
     }
     else {
-      $line = purify($line);
+      $line = purify_dna($line);
     }
     print $line;
   }
@@ -149,8 +206,9 @@ sub parse_fastq {
   my $count=0;
   # jump 4 lines at a time
   for ( my $i=0 ; $i < $#{$lines} ; $i+=4 ) {
+    $lines->[$i] = purify_id($lines->[$i]);
     print ">", substr($lines->[$i], 1);
-    print purify( $lines->[$i+1] );
+    print purify_dna( $lines->[$i+1] );
     $count++;
   }
   return $count;
@@ -160,12 +218,15 @@ sub parse_fastq {
 
 sub parse_gff {
   my($lines) = @_;
-  my $at_seq = 0;
-  for my $line (@$lines) {
-    $at_seq++ if $line =~ m/^>/;
-    print purify($line) if $at_seq;;
+  # Skip past until we get to ##FASTA section
+  while (my $line = shift @$lines) {
+    if ($line =~ m/^>/) {
+      unshift @$lines, $line;
+      # let the FASTA parser do the work now
+      return parse_fasta(\@$lines);
+    }
   }
-  return $at_seq;
+  return 0;
 }
 
 #......................................................................................
@@ -182,7 +243,7 @@ sub parse_gfa {
     # this is NOT the original contigs, rather the unitigs
     # need to parse L (link) and P (path) to reconstruct the contigs
     if ($x[0] eq 'S') {
-      print ">", $x[1], "\n", purify($x[2]), "\n";
+      print ">", $x[1], "\n", purify_dna($x[2]), "\n";
       $count++;
     }
   }
@@ -201,8 +262,7 @@ sub parse_genbank {
   foreach (@$lines) {
     chomp;
     if (m{^//}) {
-      $dna = purify($dna);
-      print ">", $acc, "\n", purify($dna);
+      print ">", $acc, "\n", purify_dna($dna);
       $count++;
       $in_seq = 0;
       $dna = $acc = '';
@@ -225,6 +285,11 @@ sub parse_genbank {
       if (m/^LOCUS\s+(\S+)/) {
         $acc = $1;
       }
+      # VERSION     NZ_AHMY02000075.1
+      # this is not elsif in case ther eis no VERSION
+      if ($opt{g} and m/^VERSION\s+(\S+)/) {
+        $acc = $1;
+      }
     }
   }
   return $count;
@@ -243,7 +308,7 @@ sub parse_embl {
     chomp;
     if (m{^//}) {
       # end of record
-      print ">", $acc, "\n", purify($dna);
+      print ">", $acc, "\n", purify_dna($dna);
       $count++;
       $in_seq = 0;
       $dna = $acc = '';
@@ -264,6 +329,9 @@ sub parse_embl {
       # ID   K02675; SV 1; linear; genomic DNA; STD; UNC; 569 BP.
       if (m/^ID\s+([^;]+)/) {
         $acc = $1;
+        if ($opt{g} and m/\bSV\s+(\d+)/) {
+          $acc .= ".$1";
+        }
       }
     }
   }
@@ -275,12 +343,15 @@ sub parse_embl {
 sub parse_clustal {
   my($lines) = @_;
   my %seq;
+  my @order;  # keep track of sequence identifiers order
   foreach (@$lines) {
-    next unless m/^(\S+)\s+([A-Z-]+)$/i; # uses '-' for gap
+    # uses '-' for gap, ignore optional position number
+    next unless m'^(\S+)\s+([A-Z-]+)(?:\s+\d+)?$'i;
+    push @order, $1 unless exists $seq{$1};
     $seq{$1} .= $2."\n";
   }
-  for my $id (sort keys %seq) {
-    print ">", $id, "\n", purify($seq{$id});
+  for my $id (@order) {
+    print ">", $id, "\n", purify_dna($seq{$id});
   }
   return scalar(keys %seq);
 }
@@ -290,16 +361,42 @@ sub parse_clustal {
 sub parse_stockholm {
   my($lines) = @_;
   my %seq;
+  my @order;  # keep track of sequence identifiers order
   foreach (@$lines) {
     next if m/^#/;
     last if m{^//};
-    next unless m/^(\S+)\s+([A-Z.]+)$/i;  # uses '.' for gap
+    # uses '.' and also '-' for gap, optional position number
+    next unless m'^(\S+)\s+([A-Z.-]+)(?:\s+\d+)?$'i; 
     my($id,$sq) = ($1,$2);
     $sq =~ s/\./-/g;  # switch to standard FASTA '-' gap char
+    push @order, $id unless exists $seq{$id};
     $seq{$id} .= $sq . "\n";
   }
+  for my $id (@order) {
+    print ">", $id, "\n", purify_dna($seq{$id});
+  }
+  return scalar(keys %seq);
+}
+
+#......................................................................................
+
+sub parse_pdb {
+  my($lines) = @_;
+  my %seq;
+  # HEADER IMMUNE SYSTEM  06-MAR-00   1EK3
+  my @hdr = split ' ', $lines->[0];
+  my $prefix = $hdr[-1];
+  foreach (@$lines) {
+    #SEQRES   9 A  114  GLY GLY GLY THR LYS VAL GLU IL E LYS ARG
+    #SEQRES   1 B  114  ASP ILE VAL MET THR GLN SER PR O ASP SER LEU ALA VAL
+    next unless m/^SEQRES/;
+    my @x = split ' ', $_;
+    $seq{"$prefix-$x[2]"} .= 
+      join( '', map { $aa{uc($_)} || 'X'  } 
+        splice(@x,4) );
+  }
   for my $id (sort keys %seq) {
-    print ">", $id, "\n", purify($seq{$id});
+    print ">", $id, "\n", purify_dna($seq{$id}), "\n";
   }
   return scalar(keys %seq);
 }


=====================================
environment.yml
=====================================
@@ -0,0 +1,11 @@
+channels:
+  - conda-forge
+  - bioconda
+dependencies:
+  - perl >=5.18.0
+  - gzip
+  - bzip2
+  - unzip
+  - xz
+  - zstd
+  - bats-core


=====================================
paper/paper.bib
=====================================
@@ -0,0 +1,53 @@
+ at article{rice2000emboss,
+  title={EMBOSS: the European molecular biology open software suite},
+  author={Rice, Peter and Longden, Ian and Bleasby, Alan},
+  journal={Trends in genetics},
+  volume={16},
+  number={6},
+  pages={276--277},
+  year={2000},
+  publisher={Elsevier}
+}
+
+ at article{gilbert2003readseq,
+  title={Sequence File Format Conversion with Command-Line Readseq},
+  author={Gilbert, Don},
+  journal={Current protocols in bioinformatics},
+  number={1},
+  pages={A--1E},
+  year={2003},
+  publisher={Wiley Online Library}
+}
+
+ at article{stajich2002bioperl,
+  title={The Bioperl toolkit: Perl modules for the life sciences},
+  author={Stajich, Jason E and Block, David and Boulez, Kris and Brenner, Steven E and Chervitz, Stephen A and Dagdigian, Chris and Fuellen, Georg and Gilbert, James GR and Korf, Ian and Lapp, Hilmar and others},
+  journal={Genome research},
+  volume={12},
+  number={10},
+  pages={1611--1618},
+  year={2002},
+  publisher={Cold Spring Harbor Lab}
+}
+
+ at article{cock2009biopython,
+  title={Biopython: freely available Python tools for computational molecular biology and bioinformatics},
+  author={Cock, Peter JA and Antao, Tiago and Chang, Jeffrey T and Chapman, Brad A and Cox, Cymon J and Dalke, Andrew and Friedberg, Iddo and Hamelryck, Thomas and Kauff, Frank and Wilczynski, Bartek and others},
+  journal={Bioinformatics},
+  volume={25},
+  number={11},
+  pages={1422--1423},
+  year={2009},
+  publisher={Oxford University Press}
+}
+
+ at article{pearson1988fasta,
+  title={Improved tools for biological sequence comparison},
+  author={Pearson, William R and Lipman, David J},
+  journal={Proceedings of the National Academy of Sciences},
+  volume={85},
+  number={8},
+  pages={2444--2448},
+  year={1988},
+  publisher={National Acad Sciences}
+}


=====================================
paper/paper.md
=====================================
@@ -0,0 +1,53 @@
+---
+title: 'any2fasta: convert various sequence and alignment formats to FASTA'
+tags:
+  - bioinformatics
+  - genomics
+  - file format conversion
+authors:
+ - name: Torsten Seemann
+   orcid: 0000-0001-6046-610X
+   affiliation: "1, 2"
+affiliations:
+ - name: Melbourne Bioinformatics, The University of Melbourne, Parkville, Australia.
+   index: 1
+ - name: Doherty Applied Microbial Genomics, Department of Microbiology and Immunology, The University of Melbourne, Parkville, Australia.
+   index: 2
+date: 18 October 2018
+bibliography: paper.bib
+---
+
+# Summary
+
+FASTA is a simple and pervasive plain text file format for storing 
+genetic sequence data [@pearson1988fasta]. There exist many other
+richer formats for storing sequences and associated annotations 
+and meta-data, such as the 
+Genbank and EMBL flat files (http://www.insdc.org/documents/feature-table).
+These formats often need to be converted to FASTA for use in 
+downstream software that only handles the FASTA format.
+Common tools for converting for format conversion are
+EMBOSS `seqret` [@rice2000emboss] and `readseq` [@gilbert2003readseq].
+Unfortunately, these tools mangle sequence identifiers containing
+characters such as `|` and `.`. Furthermore, they offer no way to fix the behaviour 
+and have not seen any development activity in years.
+Custom scripts using the Bioperl [@stajich2002bioperl] 
+or Biopython [@cock2009biopython] libraries are available,
+but these are heavyweight solutions for a relatively simple problem.
+
+Here, I present a new software tool called `any2fasta` written
+as a single Perl script with no dependencies. It can read the
+Genbank, EMBL, GFF, FASTA, FASTQ and GFA sequence formats, 
+as well as the CLUSTAL and STOCKHOLM sequence alignment formats. 
+The input files can be of mixed type, and may be compressed with
+`gzip`, `bzip2` or `zip`. `any2fasta` is fast because it only
+parses those parts of the input files needed to extract the 
+sequence and its identifier.
+
+# Acknowledgements
+
+This work was supported by a National Health and Medical
+Research Council of Australia Project Grant (ID 1149991).
+
+# References
+


=====================================
test.clw → test/test.clw
=====================================


=====================================
test.embl → test/test.embl
=====================================


=====================================
test.fna → test/test.fna
=====================================


=====================================
test/test.fna.bz2
=====================================
Binary files /dev/null and b/test/test.fna.bz2 differ


=====================================
test/test.fna.gz
=====================================
Binary files /dev/null and b/test/test.fna.gz differ


=====================================
test/test.fna.xz
=====================================
Binary files /dev/null and b/test/test.fna.xz differ


=====================================
test/test.fna.zip
=====================================
Binary files /dev/null and b/test/test.fna.zip differ


=====================================
test/test.fna.zst
=====================================
Binary files /dev/null and b/test/test.fna.zst differ


=====================================
test.fq → test/test.fq
=====================================


=====================================
test.gbk → test/test.gbk
=====================================


=====================================
test.gfa → test/test.gfa
=====================================


=====================================
test.gff → test/test.gff
=====================================
@@ -12770,7 +12770,7 @@ ATTATTCAGGACAGCACTATTAATATGCTGCGCCTCAGCGATCCCGACGCTGCACTCATT
 ATCAGCCGGGGCCAGATGCAGGAAGGGGACGAATTAGCCTCCCAGATTGAACAGCAAATG
 AAAAAACTGGAGAAACAGGTAAAGGATCTTCACTACACGCCAGTTCAGGTAACACGGGTA
 GGGATTAATGACGGTGAA
->BAC_00002
+>BAC_00002 Contig number 2
 AGATGCCAGGTATGTGGATATACAGCGAACGCCGATGTAAATGGCGCTCGTAACATTTTA
 GCGGCGGGGCACGCCGTTCTTGCCTGTGGAGAGATGGTGCAGTCAGGCCGCCCGTTGAAG
 CAGGAACCCACCGAAATGATTCAGGCGACAGCCTGAACGTAGCAGGGATCCTCGTCCTTC
@@ -41569,7 +41569,7 @@ GGCGATACGGTTATCCGGCCACATGCTGAGGGTGCTGTCCGGGTGCAGCTCCGGGTCGGG
 CAGGCGGTTACCTGCCAGGGCAAGATTACGAAAGCCCGCTCCCCGCAAGGACTGACGCCA
 GATAGTTTCTGTCCATGGCTGCTTTTCGCATCTTACGTCTTAACCCTGCCTTGAATACCT
 TATCAT
->BAC_00007
+>BAC_00007 Contig Seven
 TAATCCGGTAGCGGCGTAAAAATCGCGGAAGGGATGAAAAAAACAGCGCCTGACGGCGCT
 GTGTCTGGCATGCCTGCAATCCGGGAAACCGGACCAGGAAAAAACTTGCAGCCCATAACA
 GTATCTACGCAGTACCTGTAATATATTGAATCTGCAGGACTTTGTAGGCCAGATAAGCGT
@@ -87438,6 +87438,6 @@ ATTTTGCGTTCTGCCCAGGACAGGTGCGTCAGGCCGTGGCAGTGATGCCCCTTGCG
 >BAC_00225
 CACTACATCCGATACCTGTATGACGATGGAGCAGGCCAATGAGAAGGCCAAAAAACTGGA
 GCAGTCCTCAGAAGCAAAGCCGGTTGCGGCATCACTGCCGCGCCTGGCTGAAGGG
->BAC_00226
+>BAC_00226 this is the last contig
 TTTGATGCCTGGCAGTTCCCTACTCTCGCATGGGGAGACCCCACACTACCATCGGCGCTA
 CGGCGTTTCACTTCTGAGTTCGGCATGGGGTCAGGTGGGACCACCGCGCTAAG


=====================================
test.noseq.gff → test/test.noseq.gff
=====================================


=====================================
test/test.pdb
=====================================
The diff for this file was not included because it is too large.

=====================================
test/test.sh
=====================================
@@ -0,0 +1,163 @@
+#!/usr/bin/env bats
+
+setup() {
+  name="any2fasta"
+  bats_require_minimum_version 1.5.0
+  dir=$(dirname "$BATS_TEST_FILENAME")
+  cd "$dir"
+  exe="$dir/../$name -q"
+  tab=$'\t'
+  FASTA_ID=">NZ_CHER02000075"
+}
+
+ at test "Script syntax check" {
+  run -0 perl -c "$dir/../$name"
+}
+ at test "Version" {
+  run -0 $exe -v
+  [[ "${lines[0]}" =~ "$name" ]]
+}
+ at test "Help" {
+  run -0 $exe -h
+  [[ "$output" =~ "USAGE" ]]
+}
+ at test "No parameters" {
+  run ! $exe
+}
+ at test "Bad option" {
+  run ! $exe -Y
+  [[ "$output" =~ "Unknown option" ]]
+  [[ ! "$output" =~ "USAGE" ]]
+}
+ at test "Passing a folder" {
+  run ! $exe $dir
+  [[ "$output" =~ "directory" ]]
+}
+ at test "Empty input" {
+  run ! $exe /dev/null
+  [[ "$output" =~ "ERROR" ]]
+}
+
+ at test "Handle FASTA" {
+  run -0 $exe test.fna
+  [[ "${lines[0]}" =~ "$FASTA_ID" ]]  
+  [[ "${lines[0]}" =~ "Leptospira" ]]  
+}
+ at test "Handle EMBL" {
+  run -0 $exe test.embl
+  [[ "${lines[0]}" == ">K02675" ]]  
+}
+ at test "Handle FASTQ" {
+  run -0 $exe test.fq
+  [[ "${lines[0]}" =~ ">ERR1163317.1" ]]  
+  [[ "${lines[0]}" =~ "length=" ]]  
+}
+ at test "Handle GENBANK" {
+  run -0 $exe test.gbk
+  [[ "${lines[0]}" =~ ">NZ_AHMY02000075" ]]  
+}
+ at test "Handle GFF" {
+  run -0 $exe test.gff
+  [[ "$output" =~ ">BAC_00002" ]]  
+}
+ at test "Handle STOCKHOLM" {
+  run -0 $exe test.sth
+  [[ "$output" =~ ">O83071/259-312" ]]  
+}
+ at test "Handle CLUSTAL" {
+  run -0 $exe test.clw
+  [[ "$output" =~ ">gene03" ]]  
+}
+ at test "Handle GFA" {
+  run -0 $exe test.gfa
+  [[ "${lines[0]}" =~ ">225289" ]]
+}
+ at test "Handle PDB" {
+  run -0 $exe test.pdb
+  [[ "$output" =~ ">1EK3-B" ]]  
+}
+
+ at test "GZIP compression" {
+  run -0 $exe test.fna.gz
+  [[ "${lines[0]}" =~ "$FASTA_ID" ]]  
+}
+ at test "BZIP2 compression" {
+  run -0 $exe test.fna.bz2
+  [[ "${lines[0]}" =~ "$FASTA_ID" ]]  
+}
+ at test "XZ compression" {
+  skip
+  run -0 $exe test.fna.xz
+  [[ "${lines[0]}" =~ "$FASTA_ID" ]]  
+}
+ at test "ZSTD compression" {
+  skip
+  run -0 $exe test.fna.zst
+  [[ "${lines[0]}" =~ "$FASTA_ID" ]]  
+}
+ at test "ZIP compression" {
+  run -0 $exe test.fna.zip
+  [[ "${lines[0]}" =~ "$FASTA_ID" ]]  
+}
+
+ at test "STDIN input" {
+  run -0 $exe - < test.fna
+  [[ "${lines[0]}" =~ "$FASTA_ID" ]]  
+}
+ at test "Compressed STDIN input" {
+  run -0 $exe - < test.fna.gz
+  [[ "${lines[0]}" =~ "$FASTA_ID" ]]  
+}
+
+ at test "Option -l lowercase" {
+  run -0 $exe -l test.fna
+  [[ "${lines[1]}" =~ "aacryantctc" ]]  
+}
+ at test "Option -u uppercase" {
+  run -0 $exe -u test.embl
+  [[ "${lines[1]}" =~ "AGTCGCTTTTAA" ]]  
+}
+ at test "Option -n deambiguate" {
+  run -0 $exe -n test.fna
+  [[ "${lines[1]}" =~ "AACNNANTCTC" ]]  
+}
+
+ at test "GFF with no sequence" {
+  run ! $exe -n test.noseq.gff
+  [[ "$output" =~ "No sequences found" ]]  
+}
+ at test "Multiple sequence inputs" {
+  run $exe test.fna test.embl test.pdb test.sth
+  [[ "$output" =~ ">1EK3-B" ]]  
+}
+ at test "Multiple sequence with one bad one" {
+  run ! $exe test.fna test.jpg test.pdb test.sth
+  [[ "$output" =~ "ERROR" ]]  
+}
+ at test "Allow skipping over bad files" {
+  run $exe -k test.fna test.jpg test.embl
+  [[ "$output" =~ ">K02675" ]]  
+}
+
+ at test "Handle EMBL -g" {
+  run -0 $exe -g test.embl
+  [[ "${lines[0]}" =~ ">K02675.1" ]]  
+}
+ at test "Handle GENBANK -g" {
+  run -0 $exe -g test.gbk
+  [[ "${lines[0]}" =~ ">NZ_AHMY02000075.1" ]]  
+}
+
+ at test "Handle FASTA -s" {
+  run -0 $exe -s test.fna
+  [[ "${lines[0]}" =~ "$FASTA_ID" ]]  
+  [[ ! "${lines[0]}" =~ "Leptospira" ]]  
+}
+ at test "Handle FASTQ -s" {
+  run -0 $exe -s test.fq
+  [[ ! "${lines[0]}" =~ "length=" ]]  
+}
+ at test "Handle GFF -s" {
+  run -0 $exe -s test.gff
+  [[ ! "$output" =~ "contig" ]]  
+}


=====================================
test.sth → test/test.sth
=====================================



View it on GitLab: https://salsa.debian.org/med-team/any2fasta/-/commit/d21818287db414628a433651a702ccea172e96f2

-- 
View it on GitLab: https://salsa.debian.org/med-team/any2fasta/-/commit/d21818287db414628a433651a702ccea172e96f2
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20260410/98786ec4/attachment-0001.htm>


More information about the debian-med-commit mailing list