[med-svn] [Git][med-team/ariba][upstream] New upstream version 2.14.2+ds

Sascha Steinbiss gitlab at salsa.debian.org
Mon Jul 8 18:06:16 BST 2019



Sascha Steinbiss pushed to branch upstream at Debian Med / ariba


Commits:
d8ad6129 by Sascha Steinbiss at 2019-07-08T16:07:58Z
New upstream version 2.14.2+ds
- - - - -


20 changed files:

- .gitignore
- .travis.yml
- CHANGELOG.md
- Dockerfile
- MANIFEST.in
- README.md
- ariba/assembly.py
- ariba/cdhit.py
- ariba/clusters.py
- ariba/external_progs.py
- ariba/mapping.py
- ariba/ref_genes_getter.py
- ariba/ref_preparer.py
- ariba/reference_data.py
- ariba/tasks/prepareref.py
- + ariba/tests/__init__.py
- ariba/tests/cdhit_test.py
- + ariba/tests/ncbi_getter_test.py
- scripts/ariba
- setup.py


Changes:

=====================================
.gitignore
=====================================
@@ -14,6 +14,7 @@ __pycache__/
 # Distribution / packaging
 .Python
 env/
+venv/
 build/
 develop-eggs/
 dist/
@@ -26,7 +27,7 @@ sdist/
 var/
 *.egg-info/
 .installed.cfg
-*.egg
+*.egg*
 
 # PyInstaller
 #  Usually these files are written by a python script from a template
@@ -43,6 +44,7 @@ htmlcov/
 .tox/
 .coverage
 .cache
+out.card*
 nosetests.xml
 coverage.xml
 
@@ -55,3 +57,10 @@ docs/_build/
 
 # PyBuilder
 target/
+
+# PyCharm
+.idea
+
+# Mac files
+.DS_Store
+


=====================================
.travis.yml
=====================================
@@ -8,7 +8,7 @@ addons:
     - libgfortran3
     - libncurses5-dev
 python:
-- '3.5'
+- '3.6'
 sudo: false
 install:
 - source ./install_dependencies.sh


=====================================
CHANGELOG.md
=====================================
@@ -1,11 +1,86 @@
 # Change Log
 
-## [Unreleased](https://github.com/sanger-pathogens/ariba/tree/HEAD)
+[v2.14.2](https://github.com/sanger-pathogens/ariba/tree/v2.14.2) (2019-06-18)
+[Full Changelog](https://github.com/sanger-pathogens/ariba/compare/v2.14.1...v2.14.2)
 
-[Full Changelog](https://github.com/sanger-pathogens/ariba/compare/v2.13.1...HEAD)
+**Fixed bugs:**
+
+- Added Spades assembler into Docker file - RT ticket 660940
+- Incremented release number
+- Added LICENSE file into release distribution - RT ticket 660890
+
+[v2.14.1](https://github.com/sanger-pathogens/ariba/tree/v2.14.1) (2019-06-13)
+[Full Changelog](https://github.com/sanger-pathogens/ariba/compare/v2.14.0...v2.14.1)
+
+**Fixed bugs:**
+
+- Ariba fails to install from PyPI due to missing .h files in distribution. Related to MANIFEST.in change in [\#269](https://github.com/sanger-pathogens/ariba/pull/269)
+- Fix for Issue [\#263](https://github.com/sanger-pathogens/ariba/issues/263)
+
+[v2.14.0](https://github.com/sanger-pathogens/ariba/tree/v2.14.0) (2019-06-06)
+[Full Changelog](https://github.com/sanger-pathogens/ariba/compare/v2.13.5...v2.14.0)
+
+**Closed issues:**
+
+- Reference dataset of ARG-ANNOT cannot be downloaded [\#265](https://github.com/sanger-pathogens/ariba/issues/265)
+- unable to download mlst schemes [\#264](https://github.com/sanger-pathogens/ariba/issues/264)
+- Several "At least one cluster failed!Stopping..." errors [\#261](https://github.com/sanger-pathogens/ariba/issues/261)
+- Allow increasing cd-hit-est memory allocation [\#255](https://github.com/sanger-pathogens/ariba/issues/255)
+- Ariba pubmlstget Error [\#240](https://github.com/sanger-pathogens/ariba/issues/240)
+- segmentation fault when I import ariba [\#230](https://github.com/sanger-pathogens/ariba/issues/230)
+- How to use spades and set minimum coverage [\#215](https://github.com/sanger-pathogens/ariba/issues/215)
+
+**Merged pull requests:**
+
+- Additional v2.14.0 updates [\#270](https://github.com/sanger-pathogens/ariba/pull/270) ([kpepper](https://github.com/kpepper))
+- Added getref feature for NCBI's Bacterial Antimicrobial Resistance Reference Gene Database [\#269](https://github.com/sanger-pathogens/ariba/pull/269) ([schultzm](https://github.com/schultzm))
+
+## [v2.13.5](https://github.com/sanger-pathogens/ariba/tree/v2.13.5) (2019-03-26)
+[Full Changelog](https://github.com/sanger-pathogens/ariba/compare/v2.13.4...v2.13.5)
+
+**Fixed bugs:**
+
+- Ariba fails without --noclean depending on database [\#205](https://github.com/sanger-pathogens/ariba/issues/205)
+
+**Closed issues:**
+
+- Installation failed: clang: error: linker command failed with exit code 1 \(use -v to see invocation\) [\#245](https://github.com/sanger-pathogens/ariba/issues/245)
+-  virfinddb db not downloading properly [\#229](https://github.com/sanger-pathogens/ariba/issues/229)
+- getref error with resfinder db [\#225](https://github.com/sanger-pathogens/ariba/issues/225)
 
 **Merged pull requests:**
 
+- Updated CHANGELOG.md [\#260](https://github.com/sanger-pathogens/ariba/pull/260) ([kpepper](https://github.com/kpepper))
+- Bump version to 2.13.5 and fix Spades invocation issue [\#259](https://github.com/sanger-pathogens/ariba/pull/259) ([kpepper](https://github.com/kpepper))
+- Minor code change to mitigate issue \#245 \(installation failure\) [\#258](https://github.com/sanger-pathogens/ariba/pull/258) ([kpepper](https://github.com/kpepper))
+
+## [v2.13.4](https://github.com/sanger-pathogens/ariba/tree/v2.13.4) (2019-03-15)
+[Full Changelog](https://github.com/sanger-pathogens/ariba/compare/v2.13.3...v2.13.4)
+
+**Closed issues:**
+
+- Ariba run signal 28 [\#252](https://github.com/sanger-pathogens/ariba/issues/252)
+- NCBI database. [\#241](https://github.com/sanger-pathogens/ariba/issues/241)
+
+**Merged pull requests:**
+
+- Rebuilt CHANGELOG for v2.13.4 [\#257](https://github.com/sanger-pathogens/ariba/pull/257) ([kpepper](https://github.com/kpepper))
+- Allow increasing cd-hit-est memory allocation \#255 [\#256](https://github.com/sanger-pathogens/ariba/pull/256) ([kpepper](https://github.com/kpepper))
+
+## [v2.13.3](https://github.com/sanger-pathogens/ariba/tree/v2.13.3) (2019-01-02)
+[Full Changelog](https://github.com/sanger-pathogens/ariba/compare/v2.13.2...v2.13.3)
+
+**Merged pull requests:**
+
+- TB D94A fix [\#251](https://github.com/sanger-pathogens/ariba/pull/251) ([martinghunt](https://github.com/martinghunt))
+
+## [v2.13.2](https://github.com/sanger-pathogens/ariba/tree/v2.13.2) (2018-12-21)
+[Full Changelog](https://github.com/sanger-pathogens/ariba/compare/v2.13.1...v2.13.2)
+
+**Merged pull requests:**
+
+- Update tb panel [\#250](https://github.com/sanger-pathogens/ariba/pull/250) ([martinghunt](https://github.com/martinghunt))
+- Added changelog [\#248](https://github.com/sanger-pathogens/ariba/pull/248) ([ssjunnebo](https://github.com/ssjunnebo))
 - Update python min version [\#247](https://github.com/sanger-pathogens/ariba/pull/247) ([ssjunnebo](https://github.com/ssjunnebo))
 
 ## [v2.13.1](https://github.com/sanger-pathogens/ariba/tree/v2.13.1) (2018-11-16)
@@ -596,3 +671,6 @@
 
 - Initial working version [\#1](https://github.com/sanger-pathogens/ariba/pull/1) ([martinghunt](https://github.com/martinghunt))
 
+
+
+\* *This Change Log was automatically generated by [github_changelog_generator](https://github.com/skywinder/Github-Changelog-Generator)*
\ No newline at end of file


=====================================
Dockerfile
=====================================
@@ -1,7 +1,16 @@
-FROM ubuntu:17.10
+FROM ubuntu:18.04
 
-RUN apt-get update
-RUN apt-get install --no-install-recommends -y \
+ENV DEBIAN_FRONTEND=noninteractive
+
+MAINTAINER ariba-help at sanger.ac.uk
+
+# Software version numbers
+ARG BOWTIE2_VERSION=2.2.9
+ARG SPADES_VERSION=3.13.1
+ARG ARIBA_VERSION=2.14.2
+
+RUN apt-get -qq update && \
+    apt-get install --no-install-recommends -y \
   build-essential \
   cd-hit \
   curl \
@@ -9,7 +18,6 @@ RUN apt-get install --no-install-recommends -y \
   libbz2-dev \
   liblzma-dev \
   mummer \
-  python \
   python3-dev \
   python3-setuptools \
   python3-pip \
@@ -19,17 +27,25 @@ RUN apt-get install --no-install-recommends -y \
   wget \
   zlib1g-dev
 
-RUN wget -q http://downloads.sourceforge.net/project/bowtie-bio/bowtie2/2.2.9/bowtie2-2.2.9-linux-x86_64.zip \
-  && unzip bowtie2-2.2.9-linux-x86_64.zip \
-  && rm bowtie2-2.2.9-linux-x86_64.zip
+RUN wget -q http://downloads.sourceforge.net/project/bowtie-bio/bowtie2/${BOWTIE2_VERSION}/bowtie2-${BOWTIE2_VERSION}-linux-x86_64.zip \
+  && unzip bowtie2-${BOWTIE2_VERSION}-linux-x86_64.zip \
+  && rm -f bowtie2-${BOWTIE2_VERSION}-linux-x86_64.zip
+
+RUN wget -q https://github.com/ablab/spades/releases/download/v${SPADES_VERSION}/SPAdes-${SPADES_VERSION}-Linux.tar.gz \
+  && tar -zxf SPAdes-${SPADES_VERSION}-Linux.tar.gz \
+  && rm -f SPAdes-${SPADES_VERSION}-Linux.tar.gz
 
 # Need MPLBACKEND="agg" to make matplotlib work without X11, otherwise get the error
 # _tkinter.TclError: no display name and no $DISPLAY environment variable
-ENV ARIBA_BOWTIE2=$PWD/bowtie2-2.2.9/bowtie2 ARIBA_CDHIT=cdhit-est MPLBACKEND="agg"
+ENV ARIBA_BOWTIE2=$PWD/bowtie2-${BOWTIE2_VERSION}/bowtie2 ARIBA_CDHIT=cdhit-est MPLBACKEND="agg"
+ENV PATH=$PATH:$PWD/SPAdes-${SPADES_VERSION}-Linux/bin
+
+RUN cd /usr/local/bin && ln -s /usr/bin/python3 python && cd
 
 RUN git clone https://github.com/sanger-pathogens/ariba.git \
   && cd ariba \
-  && git checkout v2.12.0 \
+  && git checkout v${ARIBA_VERSION} \
+  && rm -rf .git \
   && python3 setup.py test \
   && python3 setup.py install
 


=====================================
MANIFEST.in
=====================================
@@ -1 +1,3 @@
 recursive-include third_party *.h
+include LICENSE
+include AUTHORS
\ No newline at end of file


=====================================
README.md
=====================================
@@ -39,15 +39,15 @@ The input is a FASTA file of reference sequences (can be a mix of genes and nonc
 ## Quick Start
 Get reference data, for instance from [CARD](https://card.mcmaster.ca/). See [getref](https://github.com/sanger-pathogens/ariba/wiki/Task%3A-getref) for a full list.
 
-    ariba getref card out.card
+    ariba getref ncbi out.ncbi
 
 Prepare reference data for ARIBA:
 
-    ariba prepareref -f out.card.fa -m out.card.tsv out.card.prepareref
+    ariba prepareref -f out.ncbi.fa -m out.ncbi.tsv out.ncbi.prepareref
 
 Run local assemblies and call variants:
 
-    ariba run out.card.prepareref reads1.fastq reads2.fastq out.run
+    ariba run out.ncbi.prepareref reads1.fastq reads2.fastq out.run
 
 Summarise data from several runs:
 
@@ -60,7 +60,7 @@ Please read the [ARIBA wiki page][ARIBA wiki] for full usage instructions.
 If you encounter an issue when installing ARIBA please contact your local system administrator. If you encounter a bug please log it [here](https://github.com/sanger-pathogens/ariba/issues) or email us at ariba-help at sanger.ac.uk.
 
 ### Required dependencies
-  * [Python3][python] version >= 3.4.0
+  * [Python3][python] version >= 3.6.0
   * [Bowtie2][bowtie2] version >= 2.1.0
   * [CD-HIT][cdhit] version >= 4.6
   * [MUMmer][mummer] version >= 3.23
@@ -69,7 +69,7 @@ ARIBA also depends on several Python packages, all of which are available
 via pip. Installing ARIBA with pip3 will get these automatically if they
 are not already installed:
   * dendropy >= 4.2.0
-  * matplotlib (no minimum version required, but only tested on 2.0.0)
+  * matplotlib>=3.1.0
   * pyfastaq >= 3.12.0
   * pysam >= 0.9.1
   * pymummer >= 0.10.1
@@ -85,10 +85,16 @@ Download the latest release from this github repository or clone it. Run the tes
 
     python3 setup.py test
 
+**Note for OS X:** The tests require gawk which will need to be installed separately, e.g. via Homebrew.
+
 If the tests all pass, install:
 
     python3 setup.py install
 
+Alternatively, install directly from github using:
+
+    pip3 install git+https://github.com/sanger-pathogens/ariba.git #--user
+
 ### Docker
 ARIBA can be run in a Docker container. First install Docker, then install ARIBA:
 
@@ -98,6 +104,8 @@ To use ARIBA use a command like this (substituting in your directories), where y
 
     docker run --rm -it -v /home/ubuntu/data:/data sangerpathogens/ariba ariba -h
 
+When calling Ariba via Docker (as above) you'll also need to add **/data/** in front of all the passed in file or directory names (e.g. /data/my_output_folder).
+
 
 ### Debian (testing)
 ARIBA is available in the latest version of Debian, and over time will progressively filter through to Ubuntu and other distributions which use Debian. To install it as root:
@@ -138,9 +146,9 @@ Note that ARIBA also runs `bowtie2-build`, for which it uses the
 it would try to use
 
     $HOME/bowtie2-2.1.0/bowtie2-build
- 
+
 ## Temporary files
- 
+
 ARIBA can temporarily make a large number of files whilst running, which
 are put in a temporary directory made by ARIBA.  The total size of these
 files is small, but there can be a many of them. This can be a
@@ -219,5 +227,3 @@ Microbial Genomics 2017. doi: [110.1099/mgen.0.000131](http://mgen.microbiologyr
   [ARIBA wiki]: https://github.com/sanger-pathogens/ariba/wiki
   [mummer]: http://mummer.sourceforge.net/
   [python]: https://www.python.org/
-
-


=====================================
ariba/assembly.py
=====================================
@@ -140,7 +140,7 @@ class Assembly:
                 spades_out_seq_base = "contigs.fasta"
             else:
                 raise ValueError("Unknown spades_mode value: {}".format(self.spades_mode))
-            asm_cmd = [spades_exe, "-t", str(self.threads), "--pe1-1", self.reads1, "--pe1-2", self.reads2, "-o", self.assembler_dir] + \
+            asm_cmd = ['python3', spades_exe, "-t", str(self.threads), "--pe1-1", self.reads1, "--pe1-2", self.reads2, "-o", self.assembler_dir] + \
                 spades_options
             asm_ok,err = common.syscall(asm_cmd, verbose=True, verbose_filehandle=self.log_fh, shell=False, allow_fail=True)
             if not asm_ok:


=====================================
ariba/cdhit.py
=====================================
@@ -13,6 +13,7 @@ class Runner:
       seq_identity_threshold=0.9,
       threads=1,
       length_diff_cutoff=0.0,
+      memory_limit=None,
       verbose=False,
       min_cluster_number=0
     ):
@@ -20,10 +21,14 @@ class Runner:
         if not os.path.exists(infile):
             raise Error('File not found: "' + infile + '". Cannot continue')
 
+        if (memory_limit is not None) and (memory_limit < 0):
+            raise Error('Input parameter cdhit_max_memory is set to an invalid value. Cannot continue')
+
         self.infile = os.path.abspath(infile)
         self.seq_identity_threshold = seq_identity_threshold
         self.threads = threads
         self.length_diff_cutoff = length_diff_cutoff
+        self.memory_limit = memory_limit
         self.verbose = verbose
         self.min_cluster_number = min_cluster_number
         extern_progs = external_progs.ExternalProgs(fail_on_error=True, using_spades=False)
@@ -133,15 +138,11 @@ class Runner:
         return clusters
 
 
-    def run(self):
-        tmpdir = tempfile.mkdtemp(prefix='tmp.run_cd-hit.', dir=os.getcwd())
-        cdhit_fasta = os.path.join(tmpdir, 'cdhit')
-        cluster_info_outfile = cdhit_fasta + '.bak.clstr'
-
+    def get_run_cmd(self, output_file):
         cmd = ' '.join([
             self.cd_hit_est,
             '-i', self.infile,
-            '-o', cdhit_fasta,
+            '-o', output_file,
             '-c', str(self.seq_identity_threshold),
             '-T', str(self.threads),
             '-s', str(self.length_diff_cutoff),
@@ -149,8 +150,21 @@ class Runner:
             '-bak 1',
         ])
 
+        # Add in cdhit memory allocation if one has been specified
+        if self.memory_limit is not None:
+            cmd = ' '.join([cmd, '-M', str(self.memory_limit)])
+
+        return cmd
+
+
+    def run(self):
+        tmpdir = tempfile.mkdtemp(prefix='tmp.run_cd-hit.', dir=os.getcwd())
+        cdhit_fasta = os.path.join(tmpdir, 'cdhit')
+        cluster_info_outfile = cdhit_fasta + '.bak.clstr'
+        cmd = self.get_run_cmd(cdhit_fasta)
         common.syscall(cmd, verbose=self.verbose)
         clusters = self._get_clusters_from_bak_file(cluster_info_outfile, self.min_cluster_number)
         common.rmtree(tmpdir)
         return clusters
 
+


=====================================
ariba/clusters.py
=====================================
@@ -168,7 +168,7 @@ class Clusters:
         if self.verbose:
             print('Temporary directory:', self.tmp_dir)
 
-        for i in [x for x in dir(signal) if x.startswith("SIG") and x not in {'SIGCHLD', 'SIGCLD'}]:
+        for i in [x for x in dir(signal) if x.startswith("SIG") and x not in {'SIGCHLD', 'SIGCLD', 'SIGPIPE', 'SIGTSTP', 'SIGCONT'}]:
             try:
                 signum = getattr(signal, i)
                 signal.signal(signum, self._receive_signal)


=====================================
ariba/external_progs.py
=====================================
@@ -144,9 +144,14 @@ class ExternalProgs:
            Returns tuple (bool, version). First element True iff found version ok.
            Second element is version string (if found), otherwise an error message'''
         assert prog in prog_to_version_cmd
-        cmd, regex = prog_to_version_cmd[prog]
-        cmd = path + ' ' + cmd
-        cmd_output = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE).communicate()
+        args, regex = prog_to_version_cmd[prog]
+        cmd = path + ' ' + args
+        if prog == 'spades':
+            cmd_output = subprocess.Popen(['python3', path, args], shell=False, stdout=subprocess.PIPE,
+                                          stderr=subprocess.PIPE).communicate()
+        else:
+            cmd_output = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE).communicate()
+
         cmd_output = common.decode(cmd_output[0]).split('\n')[:-1] + common.decode(cmd_output[1]).split('\n')[:-1]
 
         for line in cmd_output:


=====================================
ariba/mapping.py
=====================================
@@ -86,8 +86,10 @@ def run_bowtie2(
     if LooseVersion(bowtie2_version) >= LooseVersion('2.3.1'):
         map_cmd.append('--score-min G,1,10')
 
+    # We use gawk instead of awk here as we need bitwise comparisons
+    # and these are not available via awk on Mac OSX.
     if remove_both_unmapped:
-        map_cmd.append(r''' | awk ' !(and($2,4)) || !(and($2,8)) ' ''')
+        map_cmd.append(r''' | gawk ' !(and($2,4)) || !(and($2,8)) ' ''')
 
     tmp_sam_file = out_prefix + '.unsorted.sam'
     map_cmd.append(' > ' + tmp_sam_file)


=====================================
ariba/ref_genes_getter.py
=====================================
@@ -20,6 +20,7 @@ allowed_ref_dbs = {
     'vfdb_core',
     'vfdb_full',
     'virulencefinder',
+    'ncbi',#added by schultzm
 }
 
 argannot_ref = '"ARG-ANNOT, a new bioinformatic tool to discover antibiotic resistance genes in bacterial genomes",\nGupta et al 2014, PMID: 24145532\n'
@@ -461,7 +462,7 @@ class RefGenesGetter:
     @classmethod
     def _fix_virulencefinder_fasta_file(cls, infile, outfile):
         '''Some line breaks are missing in the FASTA files from
-        viruslence finder. Which means there are lines like this:
+        virulence finder. Which means there are lines like this:
         AAGATCCAATAACTGAAGATGTTGAACAAACAATTCATAATATTTATGGTCAATATGCTATTTTCGTTGA
         AGGTGTTGCGCATTTACCTGGACATCTCTCTCCATTATTAAAAAAATTACTACTTAAATCTTTATAA>coa:1:BA000018.3
         ATGAAAAAGCAAATAATTTCGCTAGGCGCATTAGCAGTTGCATCTAGCTTATTTACATGGGATAACAAAG
@@ -541,5 +542,124 @@ class RefGenesGetter:
         print('If you use this downloaded data, please cite:')
         print('"Real-time whole-genome sequencing for routine typing, surveillance, and outbreak detection of verotoxigenic Escherichia coli", Joensen al 2014, PMID: 24574290\n')
 
+
+
+    def _get_from_ncbi(self, outprefix, test=None): ## author github:schultzm
+        """
+        Download the NCBI-curated Bacterial Antimicrobial Resistance Reference Gene Database.
+        Uses BioPython to do the data collection and extraction.
+        Author: schultzm (github) May 31, 2019.
+
+        >>> from Bio import Entrez
+        >>> import getpass
+        >>> import socket
+        >>> BIOPROJECT = "PRJNA313047"
+        >>> RETMAX = 100
+        >>> import getpass
+        >>> import socket
+        >>> Entrez.email = getpass.getuser()+'@'+socket.getfqdn()
+        >>> search_results = Entrez.read(Entrez.esearch(db="nucleotide",
+        ...                                             term=BIOPROJECT,
+        ...                                             retmax=RETMAX,
+        ...                                             usehistory="y",
+        ...                                             idtype="acc"))
+        >>> webenv = search_results["WebEnv"]
+        >>> query_key = search_results["QueryKey"]
+        >>> target_accn = "    NG_061627.1"
+        >>> records = Entrez.efetch(db="nucleotide",
+        ...                         rettype="gbwithparts",
+        ...                         retmode="text",
+        ...                         retstart=0,
+        ...                         retmax=RETMAX,
+        ...                         webenv=webenv,
+        ...                         query_key=query_key,
+        ...                         idtype="acc")
+        >>> from Bio.Alphabet import generic_dna
+        >>> from Bio import SeqIO
+        >>> from Bio.Seq import Seq
+        >>> from Bio.SeqRecord import SeqRecord
+        >>> for gb_record in SeqIO.parse(records, "genbank"):
+        ...     if gb_record.id == 'NG_061627.1':
+        ...         gb_record
+        SeqRecord(seq=Seq('TAATCCTTGGAAACCTTAGAAATTGATGGAGGATCTTAACAAGATCCTGACATA...GGC', IUPACAmbiguousDNA()), id='NG_061627.1', name='NG_061627', description='Klebsiella pneumoniae SCKLB88 mcr-8 gene for phosphoethanolamine--lipid A transferase MCR-8.2, complete CDS', dbxrefs=['BioProject:PRJNA313047'])
+
+        """
+
+        outprefix = os.path.abspath(outprefix)
+        final_fasta = outprefix + '.fa'
+        final_tsv = outprefix + '.tsv'
+
+        # Download the database as genbank using Bio.Entrez
+        from Bio import Entrez
+        import getpass
+        import socket
+        import sys
+        BIOPROJECT = "PRJNA313047" ## https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA313047
+        RETMAX=100000000
+        Entrez.email = getpass.getuser()+'@'+socket.getfqdn()
+        # See section 9.15  Using the history and WebEnv in
+        # http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:entrez-webenv
+        search_results = Entrez.read(Entrez.esearch(db="nucleotide",
+                                                    term=BIOPROJECT,
+                                                    retmax=RETMAX,
+                                                    usehistory="y",
+                                                    idtype="acc"))
+        acc_list = search_results["IdList"]
+        webenv = search_results["WebEnv"]
+        query_key = search_results["QueryKey"]
+        if test:
+            return acc_list
+            #up to here
+        if len(acc_list) > 0:
+            print(f"E-fetching {len(acc_list)} genbank records from BioProject {BIOPROJECT} and writing to.  This may take a while.", file=sys.stderr)
+            records = Entrez.efetch(db="nucleotide",
+                                               rettype="gbwithparts", retmode="text",
+                                               retstart=0, retmax=RETMAX,
+                                               webenv=webenv, query_key=query_key,
+                                               idtype="acc")
+            #pull out the records as fasta from the genbank
+            from Bio.Alphabet import generic_dna
+            from Bio import SeqIO
+            from Bio.Seq import Seq
+            from Bio.SeqRecord import SeqRecord
+
+            print(f"Parsing genbank records.")
+            with open(final_fasta, "w") as f_out_fa, \
+                 open(final_tsv, "w") as f_out_tsv:
+                for idx, gb_record in enumerate(SeqIO.parse(records, "genbank")):
+                    print(f"'{gb_record.id}'")
+                    n=0
+                    record_new=[]
+                    for index, feature in enumerate(gb_record.features):
+                        if feature.type == 'CDS':
+                            n+=1
+                            gb_feature = gb_record.features[index]
+                            id = None
+                            try:
+                                id = gb_feature.qualifiers["allele"]
+                            except:
+                                try:
+                                    try:
+                                        id = gb_feature.qualifiers["gene"]
+                                    except:
+                                        id = gb_feature.qualifiers["locus_tag"]
+                                except KeyError:
+                                    print(f"gb_feature.qualifer not found", file=sys.stderr)
+                            accession = gb_record.id
+                            seq_out = Seq(str(gb_feature.extract(gb_record.seq)), generic_dna)
+                            record_new.append(SeqRecord(seq_out,
+                                         id=f"{id[0]}.{accession}",
+                                         description=""))
+                    if len(record_new) == 1:
+                        print(f"Processing record {idx+1} of {len(acc_list)} (accession {accession})", file=sys.stderr)
+                        f_out_fa.write(f"{record_new[0].format('fasta').rstrip()}\n")
+                        f_out_tsv.write(f"{id[0]}.{accession}\t1\t0\t.\t.\t{gb_feature.qualifiers['product'][0]}\n")
+                    if idx == len(acc_list)-1:
+                        print('Finished. Final files are:', final_fasta, final_tsv, sep='\n\t', end='\n\n')
+                        print('You can use them with ARIBA like this:')
+                        print('ariba prepareref -f', final_fasta, '-m', final_tsv, 'output_directory\n')
+
+        else:
+            print(f"Nothing to do. Exiting.")    
     def run(self, outprefix):
         exec('self._get_from_' + self.ref_db + '(outprefix)')


=====================================
ariba/ref_preparer.py
=====================================
@@ -19,6 +19,7 @@ class RefPreparer:
         genetic_code=11,
         cdhit_min_id=0.9,
         cdhit_min_length=0.0,
+        cdhit_max_memory=None,
         run_cdhit=True,
         clusters_file=None,
         threads=1,
@@ -40,6 +41,7 @@ class RefPreparer:
         self.genetic_code = genetic_code
         self.cdhit_min_id = cdhit_min_id
         self.cdhit_min_length = cdhit_min_length
+        self.cdhit_max_memory = cdhit_max_memory
         self.run_cdhit = run_cdhit
         self.clusters_file = clusters_file
         self.threads = threads
@@ -193,6 +195,7 @@ class RefPreparer:
             seq_identity_threshold=self.cdhit_min_id,
             threads=self.threads,
             length_diff_cutoff=self.cdhit_min_length,
+            memory_limit=self.cdhit_max_memory,
             nocluster=not self.run_cdhit,
             verbose=self.verbose,
             clusters_file=self.clusters_file,
@@ -214,4 +217,4 @@ class RefPreparer:
             print('    grep REMOVE', os.path.join(outdir, '01.filter.check_genes.log'), file=sys.stderr)
 
         if number_of_bad_variants_logged > 0:
-            print('WARNING. Problem with at least one variant. Problem variants are rmoved. Please see the file', os.path.join(outdir, '01.filter.check_metadata.log'), 'for details.', file=sys.stderr)
+            print('WARNING. Problem with at least one variant. Problem variants are removed. Please see the file', os.path.join(outdir, '01.filter.check_metadata.log'), 'for details.', file=sys.stderr)


=====================================
ariba/reference_data.py
=====================================
@@ -434,7 +434,7 @@ class ReferenceData:
         pyfastaq.utils.close(f_out)
 
 
-    def cluster_with_cdhit(self, outprefix, seq_identity_threshold=0.9, threads=1, length_diff_cutoff=0.0, nocluster=False, verbose=False, clusters_file=None):
+    def cluster_with_cdhit(self, outprefix, seq_identity_threshold=0.9, threads=1, length_diff_cutoff=0.0, memory_limit=None, nocluster=False, verbose=False, clusters_file=None):
         clusters = {}
         ReferenceData._write_sequences_to_files(self.sequences, self.metadata, outprefix)
         ref_types = ('noncoding', 'noncoding.varonly', 'gene', 'gene.varonly')
@@ -454,6 +454,7 @@ class ReferenceData:
               seq_identity_threshold=seq_identity_threshold,
               threads=threads,
               length_diff_cutoff=length_diff_cutoff,
+              memory_limit=memory_limit,
               verbose=verbose,
               min_cluster_number = min_cluster_number,
             )


=====================================
ariba/tasks/prepareref.py
=====================================
@@ -21,6 +21,7 @@ def run(options):
         genetic_code=options.genetic_code,
         cdhit_min_id=options.cdhit_min_id,
         cdhit_min_length=options.cdhit_min_length,
+        cdhit_max_memory=options.cdhit_max_memory,
         run_cdhit=not options.no_cdhit,
         clusters_file=options.cdhit_clusters,
         threads=options.threads,


=====================================
ariba/tests/__init__.py
=====================================


=====================================
ariba/tests/cdhit_test.py
=====================================
@@ -1,7 +1,9 @@
 import unittest
 import os
+import re
 from ariba import cdhit, external_progs
 
+
 modules_dir = os.path.dirname(os.path.abspath(cdhit.__file__))
 data_dir = os.path.join(modules_dir, 'tests', 'data')
 extern_progs = external_progs.ExternalProgs()
@@ -13,6 +15,13 @@ class TestCdhit(unittest.TestCase):
             cdhit.Runner('oopsnotafile', 'out')
 
 
+    def test_init_fail_invalid_memory(self):
+        '''test_init_fail_invalid_memory'''
+        infile = os.path.join(data_dir, 'cdhit_test_run.in.fa')
+        with self.assertRaises(cdhit.Error):
+            cdhit.Runner(infile, memory_limit=-10)
+
+
     def test_get_clusters_from_bak_file(self):
         '''test _get_clusters_from_bak_file'''
         infile = os.path.join(data_dir, 'cdhit_test_get_clusters_from_bak_file.in')
@@ -162,3 +171,30 @@ class TestCdhit(unittest.TestCase):
             '1': {'seq3'},
         }
         self.assertEqual(clusters, expected_clusters)
+
+
+    def test_get_run_cmd_with_default_memory(self):
+        '''test_get_run_cmd_with_default_memory'''
+        fa_infile = os.path.join(data_dir, 'cdhit_test_run_get_clusters_from_dict_rename.in.fa')
+        r = cdhit.Runner(fa_infile)
+        run_cmd = r.get_run_cmd('foo/bar/file.out')
+        match = re.search('^.+ -o foo/bar/file.out -c 0.9 -T 1 -s 0.0 -d 0 -bak 1$', run_cmd)
+        self.assertIsNotNone(match, msg="Command output was " + run_cmd)
+
+
+    def test_get_run_cmd_with_non_default_memory(self):
+        '''test_get_run_cmd_with_non_default_memory'''
+        fa_infile = os.path.join(data_dir, 'cdhit_test_run_get_clusters_from_dict_rename.in.fa')
+        r = cdhit.Runner(fa_infile, memory_limit=900)
+        run_cmd = r.get_run_cmd('foo/bar/file.out')
+        match = re.search('^.+ -o foo/bar/file.out -c 0.9 -T 1 -s 0.0 -d 0 -bak 1 -M 900$', run_cmd)
+        self.assertIsNotNone(match, msg="Command output was " + run_cmd)
+
+
+    def test_get_run_cmd_with_unlimited_memory(self):
+        '''test_get_run_cmd_with_unlimited_memory'''
+        fa_infile = os.path.join(data_dir, 'cdhit_test_run_get_clusters_from_dict_rename.in.fa')
+        r = cdhit.Runner(fa_infile, memory_limit=0)
+        run_cmd = r.get_run_cmd('foo/bar/file.out')
+        match = re.search('^.+ -o foo/bar/file.out -c 0.9 -T 1 -s 0.0 -d 0 -bak 1 -M 0$', run_cmd)
+        self.assertIsNotNone(match, msg="Command output was " + run_cmd)


=====================================
ariba/tests/ncbi_getter_test.py
=====================================
@@ -0,0 +1,18 @@
+#!/usr/bin/env python3
+import unittest
+import os
+from ariba.ref_genes_getter import RefGenesGetter
+
+class TestNcbiGetter(unittest.TestCase):
+    def setUp(self):
+        self.ncbi_db = RefGenesGetter('ncbi')._get_from_ncbi('ncbi.test', 'test')
+        # self.ncbi_db = RefGenesGetter.run('ncbi')
+        
+    def test_ncbi(self):
+        '''
+        Test that more than 4000 records have been found on NCBI AMR DB.
+        '''
+        self.assertTrue(len(self.ncbi_db) > 4000)
+
+if __name__ == '__main__':
+    unittest.main()
\ No newline at end of file


=====================================
scripts/ariba
=====================================
@@ -62,7 +62,7 @@ subparser_getref = subparsers.add_parser(
     description='Download reference data from one of a few supported public resources',
 )
 subparser_getref.add_argument('--debug', action='store_true', help='Do not delete temporary downloaded files')
-subparser_getref.add_argument('--version', help='Version of reference data to download. If not used, gets the latest version. Applies to: card, megares, plasmidfinder, resfinder, srst2_argannot, virulencefinder. For plasmid/res/virulencefinder: default is to get latest from bitbucket - supply git commit hash to get a specific version from bitbucket, or use "old " to get from old website. For srst2_argannot: default is latest version r2, use r1 to get the older version')
+subparser_getref.add_argument('--version', help='Version of reference data to download. If not used, gets the latest version. Applies to: card, megares, ncbi, plasmidfinder, resfinder, srst2_argannot, virulencefinder. For plasmid/res/virulencefinder: default is to get latest from bitbucket - supply git commit hash to get a specific version from bitbucket, or use "old " to get from old website. For srst2_argannot: default is latest version r2, use r1 to get the older version')
 subparser_getref.add_argument('db', help='Database to download. Must be one of: ' + ' '.join(allowed_dbs), choices=allowed_dbs, metavar="DB name")
 subparser_getref.add_argument('outprefix', help='Prefix of output filenames')
 subparser_getref.set_defaults(func=ariba.tasks.getref.run)
@@ -135,7 +135,8 @@ cdhit_group = subparser_prepareref.add_argument_group('cd-hit options')
 cdhit_group.add_argument('--no_cdhit', action='store_true', help='Do not run cd-hit. Each input sequence is put into its own "cluster". Incompatible with --cdhit_clusters.')
 cdhit_group.add_argument('--cdhit_clusters', help='File specifying how the sequences should be clustered. Will be used instead of running cdhit. Format is one cluster per line. Sequence names separated by whitespace. Incompatible with --no_cdhit', metavar='FILENAME')
 cdhit_group.add_argument('--cdhit_min_id', type=float, help='Sequence identity threshold (cd-hit option -c) [%(default)s]', default=0.9, metavar='FLOAT')
-cdhit_group.add_argument('--cdhit_min_length', type=float, help='length difference cutoff (cd-hit option -s) [%(default)s]', default=0.0, metavar='FLOAT')
+cdhit_group.add_argument('--cdhit_min_length', type=float, help='Length difference cutoff (cd-hit option -s) [%(default)s]', default=0.0, metavar='FLOAT')
+cdhit_group.add_argument('--cdhit_max_memory', type=int, help='Memory limit in MB (cd-hit option -M) [%(default)s]. Use 0 for unlimited.', metavar='INT')
 
 other_prep_group = subparser_prepareref.add_argument_group('other options')
 other_prep_group.add_argument('--min_gene_length', type=int, help='Minimum allowed length in nucleotides of reference genes [%(default)s]', metavar='INT', default=6)


=====================================
setup.py
=====================================
@@ -55,7 +55,7 @@ vcfcall_mod = Extension(
 setup(
     ext_modules=[minimap_mod, fermilite_mod, vcfcall_mod],
     name='ariba',
-    version='2.13.3',
+    version='2.14.2',
     description='ARIBA: Antibiotic Resistance Identification By Assembly',
     packages = find_packages(),
     package_data={'ariba': ['test_run_data/*', 'tb_data/*']},
@@ -72,7 +72,8 @@ setup(
         'matplotlib',
         'pyfastaq >= 3.12.0',
         'pysam >= 0.9.1',
-        'pymummer>=0.10.2',
+        'pymummer<=0.10.3',
+        'matplotlib>=3.1.0',
     ],
     license='GPLv3',
     classifiers=[



View it on GitLab: https://salsa.debian.org/med-team/ariba/commit/d8ad6129e682314047f3ef0fb1b73d5abe10a74b

-- 
View it on GitLab: https://salsa.debian.org/med-team/ariba/commit/d8ad6129e682314047f3ef0fb1b73d5abe10a74b
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20190708/9892ac68/attachment-0001.html>


More information about the debian-med-commit mailing list