[med-svn] [iva] 01/04: Imported Upstream version 1.0.0
Sascha Steinbiss
sascha-guest at moszumanska.debian.org
Mon Aug 24 22:24:03 UTC 2015
This is an automated email from the git hooks/post-receive script.
sascha-guest pushed a commit to branch master
in repository iva.
commit 39b4dbf253ae8ce3554bec06bc38137d95c8a445
Author: Sascha Steinbiss <sascha at steinbiss.name>
Date: Mon Aug 24 21:54:33 2015 +0000
Imported Upstream version 1.0.0
---
README.md | 138 +----
iva/__init__.py | 1 +
iva/assembly.py | 53 +-
iva/common.py | 2 +-
iva/contig_trim.py | 10 +-
iva/edge.py | 2 +-
iva/egg_extract.py | 77 +++
iva/external_progs.py | 9 +-
iva/gage/getCorrectnessStats.sh | 2 +-
iva/gage/getMummerStats.sh | 30 +-
iva/graph.py | 2 +-
iva/kcount.py | 52 +-
iva/kraken.py | 40 +-
iva/mapping.py | 50 +-
iva/mummer.py | 14 +-
iva/qc.py | 188 +++---
iva/qc_external.py | 78 +--
iva/seed.py | 8 +-
iva/seed_processor.py | 16 +-
iva/tests/assembly_test.py | 2 +-
iva/tests/contig_test.py | 2 +-
iva/tests/data/contig_trim_test_contigs.fa | 2 +
iva/tests/data/egg_extract_test_dir.zip | Bin 0 -> 1230 bytes
iva/tests/data/egg_extract_test_dir/.zip | Bin 0 -> 1322 bytes
iva/tests/data/egg_extract_test_dir/file1 | 0
iva/tests/data/egg_extract_test_dir/iva/gage/gage1 | 2 +
iva/tests/data/egg_extract_test_dir/iva/gage/gage2 | 0
iva/tests/data/egg_extract_test_dir/iva/ratt/ratt1 | 0
iva/tests/data/egg_extract_test_dir/iva/ratt/ratt2 | 0
iva/tests/data/qc_external_test_run_gage.out | 55 ++
iva/tests/data/qc_external_test_run_gage.ref.fa | 51 ++
iva/tests/data/qc_external_test_run_gage.scaffs.fa | 53 ++
.../data/qc_external_test_run_ratt.assembly.fa | 155 +++++
.../data/qc_external_test_run_ratt.embl/hiv-1.embl | 645 +++++++++++++++++++++
iva/tests/edge_test.py | 2 +-
iva/tests/egg_extract_test.py | 105 ++++
iva/tests/graph_test.py | 2 +-
iva/tests/mapping_test.py | 6 +-
iva/tests/mummer_test.py | 6 +-
iva/tests/qc_external_test.py | 33 ++
iva/tests/qc_test.py | 148 ++---
iva/tests/seed_processor_test.py | 4 +-
iva/tests/seed_test.py | 26 +-
scripts/iva | 54 +-
scripts/iva_qc | 2 +-
setup.py | 48 +-
46 files changed, 1659 insertions(+), 516 deletions(-)
diff --git a/README.md b/README.md
index 9c45133..93f8bcc 100644
--- a/README.md
+++ b/README.md
@@ -5,143 +5,13 @@ IVA is a _de novo_ assembler designed to assemble virus genomes that
have no repeat sequences, using Illumina read pairs sequenced from
mixed populations at extremely high and variable depth.
-Installation instructions are below. For usage help and examples, see
-the [IVA wiki page] [IVA wiki page].
+Please see the [IVA website] [IVA website] for more information,
+including installation instructions.
-------------------------------------------------------------------------------
-Dependencies
-------------
+For usage help and examples, see the [IVA wiki page] [IVA wiki page].
-IVA has been developed for and tested on Linux and relies on
-some third-party tools that need to be installed first.
-For citations, see the References section at the bottom of this readme.
-You may also find it useful to read some
-[detailed installation instructions for Ubuntu] [ubuntu install].
-
-#### Assembly dependencies
-
-The following are required to run an assembly with IVA.
-
- * [Python 3] [python] (IVA is written in Python 3) and the following packages:
- * [Fastaq] [fastaq]
- * [networkx] [networkx]
- * [Pysam] [pysam]
- * [KMC] [kmc code] installed, so that `kmc` and `kmc_dump` are in your path.
- * [MUMmer] [mummer code] installed with its executables (ie `nucmer` etc)
- in your path.
- * [Samtools] [samtools code] installed, so that `samtools` is in your path.
- * [SMALT] [smalt] installed, so that `smalt` is in your path.
- * Optional: [Trimmomatic] [trimmo code] - although this is optional, it is
- highly recommended.
- It is used to trim adapter sequences from reads before assembling and
- significantly improves the results.
- You don't need to add anything to your path, but will
- need to tell IVA where the Java jar file is to use Trimmomatic (see
- [examples] [IVA wiki examples]).
-
-IVA is known to work with kmc version 2.0, MUMmer version 3.23, samtools
-version 0.1.19 and SMALT version 0.7.5.
-
-#### QC dependencies
-
-The QC scripts have the following dependencies, in addition to MUMmer,
-smalt and samtools:
-
- * [R] [r code] installed and in your path.
- * Optional: [kraken] [kraken code] installed, so that `kraken` and
- `kraken-build` are in your path. These are needed if you want to
- make your own reference database, or if you use a database to
- automatically choose the reference genome.
-
-The QC code is also bundled with the following (they do not need to be installed).
-
- * Analysis code from the [GAGE] [gage code] assembly evaluation
- project. We are grateful to the GAGE authors for permission to modify and
- redistribute this
- code.
- * [RATT] [ratt code] is used to transfer annotation from a reference
- onto the assembly.
-
-------------------------------------------------------------------------------
-
-Installation
-------------
-
-Before installing the dependencies, you may also find it useful to read some
-[detailed installation instructions for Ubuntu] [ubuntu install].
-Once the dependencies are installed, take a copy of the [latest IVA release] [IVA latest release].
-
-Then run the tests:
-
- python3 setup.py test
-
-If all the tests pass, then install with:
-
- python3 setup.py install
-
-Or if you don't have root access, then run:
-
- python3 setup.py install --prefix /install/in/here
-
-------------------------------------------------------------------------------
-
-References
-----------
-
-[**Adapter sequences:**] [adapters paper] Quail, M. a et al. _Optimal enzymes for amplifying sequencing libraries_. Nat. Methods 9, 10-1 (2012).
-
-[**GAGE:**] [gage paper] Salzberg, S. L. et al. _GAGE: A critical evaluation of genome
-assemblies and assembly algorithms_. Genome Res. 22, 557-67 (2012).
-
-[**KMC:**] [kmc paper] Deorowicz, S., Debudaj-Grabysz, A. & Grabowski, S. _Disk-based k-mer
-counting on a PC_. BMC Bioinformatics 14, 160 (2013).
-
-[**Kraken:**] [kraken paper] Wood, D. E. & Salzberg, S. L. _Kraken: ultrafast metagenomic
-sequence classification using exact alignments_.
-Genome Biol. 15, R46 (2014).
-
-[**MUMmer:**] [mummer paper] Kurtz, S. et al. _Versatile and open software for comparing large
-genomes_. Genome Biol. 5, R12 (2004).
-
-**R:** R Core Team (2013). _R: A language and environment for statistical
-computing_. R Foundation for Statistical Computing, Vienna, Austria.
-URL http://www.R-project.org/.
-
-[**RATT:**] [ratt paper] Otto, T. D., Dillon, G. P., Degrave, W. S. & Berriman, M.
-_RATT: Rapid Annotation Transfer Tool_. Nucleic Acids Res. 39, e57 (2011).
-
-[**SAMtools:**] [samtools paper] Li, H. et al. _The Sequence Alignment/Map format and SAMtools_.
-Bioinformatics 25, 2078-9 (2009).
-
-[**Trimmomatic:**] [trimmo paper] Bolger, A. M., Lohse, M. & Usadel, B. _Trimmomatic: A
-flexible trimmer for Illumina Sequence Data_. Bioinformatics 1-7 (2014).
-
-
- [adapters paper]: http://www.nature.com/nmeth/journal/v9/n1/full/nmeth.1814.html
[IVA wiki page]: https://github.com/sanger-pathogens/iva/wiki
- [IVA wiki examples]: https://github.com/sanger-pathogens/iva/wiki/iva-examples
- [IVA latest release]: https://github.com/sanger-pathogens/iva/releases/latest
- [fastaq]: https://github.com/sanger-pathogens/Fastaq
- [networkx]: https://pypi.python.org/pypi/networkx/
- [pysam]: https://code.google.com/p/pysam/
- [python]: http://www.python.org/
- [gage code]: http://gage.cbcb.umd.edu/index.html
- [gage paper]: http://genome.cshlp.org/content/early/2012/01/12/gr.131383.111
- [kmc paper]: http://www.biomedcentral.com/1471-2105/14/160
- [kmc code]: http://sun.aei.polsl.pl/kmc/download.html
- [kraken code]: http://ccb.jhu.edu/software/kraken/
- [kraken paper]: http://genomebiology.com/2014/15/3/R46
- [mummer code]: http://mummer.sourceforge.net/
- [mummer paper]: http://genomebiology.com/2004/5/2/r12
- [r code]: http://www.r-project.org/
- [ratt code]: http://ratt.sourceforge.net/
- [ratt paper]: http://nar.oxfordjournals.org/content/39/9/e57
- [samtools code]: http://samtools.sourceforge.net/
- [samtools paper]: http://bioinformatics.oxfordjournals.org/content/25/16/2078.abstract
- [smalt]: http://www.sanger.ac.uk/resources/software/smalt/
- [trimmo code]: http://www.usadellab.org/cms/?page=trimmomatic
- [trimmo paper]: http://bioinformatics.oxfordjournals.org/content/early/2014/04/12/bioinformatics.btu170
- [ubuntu install]: https://github.com/sanger-pathogens/iva/wiki/Installation-notes-for-Ubuntu
+ [IVA website]: http://sanger-pathogens.github.io/iva/
diff --git a/iva/__init__.py b/iva/__init__.py
index 2e0fe7c..825a828 100644
--- a/iva/__init__.py
+++ b/iva/__init__.py
@@ -3,6 +3,7 @@ __all__ = [
'common',
'contig',
'contig_trim',
+ 'egg_extract',
'edge',
'external_progs',
'graph',
diff --git a/iva/assembly.py b/iva/assembly.py
index 5c4d4ed..3a3a9ca 100644
--- a/iva/assembly.py
+++ b/iva/assembly.py
@@ -3,10 +3,10 @@ import pysam
import tempfile
import shutil
from iva import contig, mapping, seed, mummer, graph, edge, common
-import fastaq
+import pyfastaq
class Assembly:
- def __init__(self, contigs_file=None, map_index_k=15, map_index_s=3, threads=1, max_insert=1000, map_minid=0.5, min_clip=3, ext_min_cov=5, ext_min_ratio=2, ext_bases=100, verbose=0, seed_min_cov=5, seed_min_ratio=10, seed_min_kmer_count=200, seed_max_kmer_count=1000000000, seed_start_length=None, seed_stop_length=500, seed_overlap_length=None, make_new_seeds=False, contig_iter_trim=10, seed_ext_max_bases=50, max_contigs=50, clean=True, strand_bias=0.2):
+ def __init__(self, contigs_file=None, map_index_k=15, map_index_s=3, threads=1, max_insert=800, map_minid=0.5, min_clip=3, ext_min_cov=5, ext_min_ratio=2, ext_bases=100, verbose=0, seed_min_cov=5, seed_min_ratio=10, seed_min_kmer_count=200, seed_max_kmer_count=1000000000, seed_start_length=None, seed_stop_length=500, seed_overlap_length=None, make_new_seeds=False, contig_iter_trim=10, seed_ext_max_bases=50, max_contigs=50, clean=True, strand_bias=0):
self.contigs = {}
self.contig_lengths = {}
self.map_index_k = map_index_k
@@ -39,7 +39,7 @@ class Assembly:
self.make_new_seeds = True
else:
contigs = {}
- fastaq.tasks.file_to_dict(contigs_file, contigs)
+ pyfastaq.tasks.file_to_dict(contigs_file, contigs)
for ctg in contigs:
self._add_contig(contigs[ctg])
@@ -62,7 +62,7 @@ class Assembly:
printed = 0
if min_length is None:
min_length = self.map_index_k + 1
- f = fastaq.utils.open_file_write(filename)
+ f = pyfastaq.utils.open_file_write(filename)
if biggest_first:
contig_names = self._contig_names_size_order(biggest_first=True)
elif order_by_orfs:
@@ -90,7 +90,7 @@ class Assembly:
if order_by_orfs and contig_revcomp[i]:
self.contigs[name].fa.revcomp()
- fastaq.utils.close(f)
+ pyfastaq.utils.close(f)
def _get_contig_order_by_orfs(self, min_length=300):
@@ -137,8 +137,8 @@ class Assembly:
def _extend_contigs_with_bam(self, bam_in, out_prefix=None, output_all_useful_reads=False):
if out_prefix is not None:
- fa_out1 = fastaq.utils.open_file_write(out_prefix + '_1.fa')
- fa_out2 = fastaq.utils.open_file_write(out_prefix + '_2.fa')
+ fa_out1 = pyfastaq.utils.open_file_write(out_prefix + '_1.fa')
+ fa_out2 = pyfastaq.utils.open_file_write(out_prefix + '_2.fa')
keep_read_types = set([mapping.CAN_EXTEND_LEFT, mapping.CAN_EXTEND_RIGHT, mapping.KEEP])
if output_all_useful_reads:
keep_read_types.add(mapping.BOTH_UNMAPPED)
@@ -172,8 +172,8 @@ class Assembly:
previous_sam = None
if out_prefix is not None:
- fastaq.utils.close(fa_out1)
- fastaq.utils.close(fa_out2)
+ pyfastaq.utils.close(fa_out1)
+ pyfastaq.utils.close(fa_out2)
total_bases_added = 0
for ctg in self.contigs:
@@ -187,6 +187,7 @@ class Assembly:
def _trim_contig_for_strand_bias(self, bam, ctg_name):
+ assert os.path.exists(bam)
if ctg_name in self.contigs_trimmed_for_strand_bias:
return
ctg_length = len(self.contigs[ctg_name])
@@ -246,7 +247,7 @@ class Assembly:
start = good_intervals[i][0]
end = good_intervals[i][1]
if end - start + 1 >= 100:
- new_contigs.append(fastaq.sequences.Fasta(ctg_name + '.' + str(i+1), self.contigs[ctg_name].fa[start:end+1]))
+ new_contigs.append(pyfastaq.sequences.Fasta(ctg_name + '.' + str(i+1), self.contigs[ctg_name].fa[start:end+1]))
return new_contigs
@@ -259,6 +260,7 @@ class Assembly:
original_map_minid = self.map_minid
self.map_minid = 0.9
self._map_reads(reads_prefix + '_1.fa', reads_prefix + '_2.fa', tmp_prefix, sort_reads=True)
+ assert os.path.exists(sorted_bam)
self.map_minid = original_map_minid
new_contigs = []
contigs_to_remove = set()
@@ -270,7 +272,11 @@ class Assembly:
contigs_to_remove.add(ctg)
elif ctg not in self.contigs_trimmed_for_strand_bias:
self._trim_contig_for_strand_bias(sorted_bam, ctg)
- if tag_as_trimmed:
+ # contig could get completely trimmed so nothing left, in which
+ # case, we need to remove it
+ if len(self.contigs[ctg]) == 0:
+ contigs_to_remove.add(ctg)
+ elif tag_as_trimmed:
self.contigs_trimmed_for_strand_bias.add(ctg)
for ctg in contigs_to_remove:
@@ -421,6 +427,19 @@ class Assembly:
print('{:_^79}'.format(' Try making new seed '), flush=True)
new_seed_name = self.add_new_seed_contig(current_reads_prefix + '_1.fa', current_reads_prefix + '_2.fa')
+ if new_seed_name is None:
+ if self.verbose:
+ print('Couldn\'t make new seed and extend it. Stopping assembly.')
+ if len(self.contigs) == 0:
+ print('No contigs made.')
+ print('Read coverage may be too low, in which case try reducing --seed_min_kmer_cov, --ext_min_cov and --seed_ext_min_cov.')
+ print('Alternatively --max_insert could be incorrect, which is currently set to:', self.max_insert)
+ if reads_prefix != current_reads_prefix:
+ os.unlink(current_reads_prefix + '_1.fa')
+ os.unlink(current_reads_prefix + '_2.fa')
+ break
+
+
def _run_nucmer(self, contigs_to_use=None):
if contigs_to_use is None:
@@ -452,11 +471,11 @@ class Assembly:
def _coords_to_new_contig(self, coords_list):
- new_contig = fastaq.sequences.Fasta(coords_list[0][0], '')
+ new_contig = pyfastaq.sequences.Fasta(coords_list[0][0], '')
for name, coords, reverse in coords_list:
assert name in self.contigs
if reverse:
- seq = fastaq.sequences.Fasta('ni', self.contigs[name].fa.seq[coords.start:coords.end+1])
+ seq = pyfastaq.sequences.Fasta('ni', self.contigs[name].fa.seq[coords.start:coords.end+1])
seq.revcomp()
new_contig.seq += seq.seq
else:
@@ -500,13 +519,13 @@ class Assembly:
for hit in [hit for hit in hits if contig in [hit.qry_name, hit.ref_name] and hit.qry_name != hit.ref_name]:
start = min(hit.qry_start, hit.qry_end)
end = max(hit.qry_start, hit.qry_end)
- coords.append(fastaq.intervals.Interval(start, end))
+ coords.append(pyfastaq.intervals.Interval(start, end))
if len(coords) == 0:
return False
- fastaq.intervals.merge_overlapping_in_list(coords)
- total_bases_matched = fastaq.intervals.length_sum_from_list(coords)
+ pyfastaq.intervals.merge_overlapping_in_list(coords)
+ total_bases_matched = pyfastaq.intervals.length_sum_from_list(coords)
return min_percent <= 100.0 * total_bases_matched / len(self.contigs[contig])
@@ -596,6 +615,6 @@ class Assembly:
i += 1
new_name = 'seeded.' + str(i).zfill(5)
- self._add_contig(fastaq.sequences.Fasta(new_name, s.seq))
+ self._add_contig(pyfastaq.sequences.Fasta(new_name, s.seq))
return new_name
diff --git a/iva/common.py b/iva/common.py
index e8deeed..5a8c804 100644
--- a/iva/common.py
+++ b/iva/common.py
@@ -2,7 +2,7 @@ import argparse
import os
import sys
import subprocess
-version = '0.10.1'
+version = '1.0.0'
class abspathAction(argparse.Action):
def __call__(self, parser, namespace, value, option_string):
diff --git a/iva/contig_trim.py b/iva/contig_trim.py
index b2014bd..0ed33c9 100644
--- a/iva/contig_trim.py
+++ b/iva/contig_trim.py
@@ -2,7 +2,7 @@ import os
import sys
import shutil
import tempfile
-import fastaq
+import pyfastaq
import pysam
from iva import mapping, common
@@ -64,8 +64,8 @@ def _trim_ends(fasta_in, fasta_out, to_trim, min_length=100, min_dist_to_end=25,
sorted_bam = tmp_prefix + '.bam'
mapping.map_reads(to_trim, None, fasta_in, tmp_prefix, index_k=9, index_s=1, threads=1, minid=0.75, sort=True, extra_smalt_map_ops='-d -1 -m 10')
- f_out = fastaq.utils.open_file_write(fasta_out)
- seq_reader = fastaq.sequences.file_reader(fasta_in)
+ f_out = pyfastaq.utils.open_file_write(fasta_out)
+ seq_reader = pyfastaq.sequences.file_reader(fasta_in)
for seq in seq_reader:
coverage = mapping.get_bam_region_coverage(sorted_bam, seq.id, len(seq), both_strands=True)
good_coords = _coverage_to_trimmed_coords(coverage, min_dist_to_end=min_dist_to_end, window_length=window_length, min_pc=min_pc)
@@ -73,10 +73,10 @@ def _trim_ends(fasta_in, fasta_out, to_trim, min_length=100, min_dist_to_end=25,
continue
seq.seq = seq.seq[good_coords[0]:good_coords[1]+1]
- if len(seq) > 0:
+ if len(seq) >= min_length:
print(seq, file=f_out)
- fastaq.utils.close(f_out)
+ pyfastaq.utils.close(f_out)
shutil.rmtree(tmpdir)
diff --git a/iva/edge.py b/iva/edge.py
index 2ced725..270e42d 100644
--- a/iva/edge.py
+++ b/iva/edge.py
@@ -1,5 +1,5 @@
import copy
-from fastaq import intervals
+from pyfastaq import intervals
class Error (Exception): pass
diff --git a/iva/egg_extract.py b/iva/egg_extract.py
new file mode 100644
index 0000000..b0dc057
--- /dev/null
+++ b/iva/egg_extract.py
@@ -0,0 +1,77 @@
+import zipfile
+import tempfile
+import shutil
+import os
+
+class Error (Exception): pass
+
+
+class Extractor:
+ def __init__(self, egg):
+ self.egg = os.path.abspath(egg)
+ if not os.path.exists(self.egg):
+ raise Error('Egg not found: ' + self.egg)
+
+ if os.path.isdir(self.egg):
+ self.zip_file = None
+ elif os.path.isfile(self.egg):
+ try:
+ self.zip_file = zipfile.ZipFile(self.egg)
+ except:
+ raise Error('Error opening zip egg file: ' + self.egg)
+ self.zip_filenames = set(self.zip_file.namelist())
+ else:
+ raise Error('Egg not recognised as dir or file: ' + self.egg)
+
+
+ def _copy_file_unzipped(self, infile, outfile):
+ to_copy = os.path.join(self.egg, infile)
+ try:
+ shutil.copyfile(to_copy, outfile)
+ except:
+ raise Error('Error copying file:' + to_copy + ' -> ' + outfile)
+
+
+ def _copy_file_zipped(self, infile, outfile):
+ if infile not in self.zip_filenames:
+ raise Error('Error copying file. Not found. ' + infile)
+ tmpdir = tempfile.mkdtemp(prefix='tmp.copy_file_zipped.', dir=os.getcwd())
+ self.zip_file.extract(infile, path=tmpdir)
+ os.rename(os.path.join(tmpdir, infile), outfile)
+ shutil.rmtree(tmpdir)
+
+
+ def copy_file(self, infile, outfile):
+ if self.zip_file is None:
+ self._copy_file_unzipped(infile, outfile)
+ else:
+ self._copy_file_zipped(infile, outfile)
+
+
+ def _copy_dir_unzipped(self, indir, outdir):
+ to_copy = os.path.join(self.egg, indir)
+ try:
+ shutil.copytree(to_copy, outdir)
+ except:
+ raise Error('Error copying directory ', indir)
+
+
+ def _copy_dir_zipped(self, indir, outdir):
+ tmpdir = tempfile.mkdtemp(prefix='tmp.copy_dir_zipped', dir=os.getcwd())
+ if not indir.endswith('/'):
+ indir += '/'
+
+ files = [x for x in self.zip_filenames if x.startswith(indir)]
+
+ for f in files:
+ self.zip_file.extract(f, path=tmpdir)
+
+ os.rename(os.path.join(tmpdir, indir), outdir)
+ shutil.rmtree(tmpdir)
+
+
+ def copy_dir(self, indir, outdir):
+ if self.zip_file is None:
+ self._copy_dir_unzipped(indir, outdir)
+ else:
+ self._copy_dir_zipped(indir, outdir)
diff --git a/iva/external_progs.py b/iva/external_progs.py
index 475b784..a727111 100644
--- a/iva/external_progs.py
+++ b/iva/external_progs.py
@@ -2,7 +2,7 @@ import shutil
import subprocess
import re
import sys
-import fastaq
+import pyfastaq
from iva import common
class Error (Exception): pass
@@ -36,7 +36,6 @@ assembly_progs = [
qc_progs = [
- 'kraken',
'nucmer',
'R',
'smalt',
@@ -62,7 +61,7 @@ def get_version(prog, must_be_in_path=True):
assert prog in prog_to_version_cmd
if not is_in_path(prog):
if must_be_in_path:
- raise Error('Error getting version of', prog, '- not found in path.')
+ raise Error('Error getting version of ' + prog + ' - not found in path.')
else:
return 'UNKNOWN - not in path'
@@ -97,11 +96,11 @@ def write_prog_info(script, filename):
else:
raise Error('Script ' + script + ' not recognised')
- f = fastaq.utils.open_file_write(filename)
+ f = pyfastaq.utils.open_file_write(filename)
print(' '.join(sys.argv), file=f)
print('IVA version', common.version, file=f)
print('\n'.join(get_all_versions(required)), file=f)
if optional is not None:
print('\n'.join(get_all_versions(optional, must_be_in_path=False)), file=f)
- fastaq.utils.close(f)
+ pyfastaq.utils.close(f)
diff --git a/iva/gage/getCorrectnessStats.sh b/iva/gage/getCorrectnessStats.sh
old mode 100644
new mode 100755
index c0ef3ac..f6365f6
--- a/iva/gage/getCorrectnessStats.sh
+++ b/iva/gage/getCorrectnessStats.sh
@@ -30,7 +30,7 @@ $MUMMER/nucmer --maxmatch -p $CONTIG_FILE -l 30 -banded -D 5 $REF $CONTIGS
$MUMMER/delta-filter -o 95 -i $NUCMER_MINID $CONTIG_FILE.delta > $CONTIG_FILE.fdelta
$MUMMER/dnadiff -d $CONTIG_FILE.fdelta
-$SCRIPT_PATH/getMummerStats.sh $CONTIGS $SCRIPT_PATH
+bash $SCRIPT_PATH/getMummerStats.sh $CONTIGS $SCRIPT_PATH
cat out.1coords |awk '{print NR" "$5}' > $CONTIG_FILE.matches.lens
echo ""
diff --git a/iva/gage/getMummerStats.sh b/iva/gage/getMummerStats.sh
index b0b0c2b..de73d67 100755
--- a/iva/gage/getMummerStats.sh
+++ b/iva/gage/getMummerStats.sh
@@ -2,19 +2,19 @@ FILENAME=$1
SCRIPT_PATH=$2
JAVA_PATH=$2:.
-grep TotalBases `/bin/ls out.report` | sed 's|.*/||' | awk '{print "Genome Size: "$2}'
-grep TotalBases `/bin/ls out.report` | sed 's|.*/||' | awk '{print "Assembly Size: "$3}'
-echo -n "Chaff bases: "; for i in `/bin/ls $FILENAME`; do java -cp $JAVA_PATH SizeFasta $i 2>/dev/null | awk 'BEGIN{s = 0;}{if($NF<200)s+=$NF}END{print s}'; done
-grep UnalignedBases `/bin/ls out.report` | sed 's|.*/||' | awk '{print "Missing Reference Bases: "$2}'
-grep UnalignedBases `/bin/ls out.report` | sed 's|.*/||' | awk '{print "Missing Assembly Bases: "$3}'
-grep UnalignedSeqs `/bin/ls out.report` | sed 's|.*/||' | awk '{print "Missing Assembly Contigs: "$3}'
-echo -n "Duplicated Reference Bases: "; for i in `/bin/ls out.qdiff`; do grep DUP $i | cut -f5 | $SCRIPT_PATH/colsum.pl; done
-echo -n "Compressed Reference Bases: "; for i in `/bin/ls out.rdiff`; do grep DUP $i | cut -f5 | $SCRIPT_PATH/colsum.pl; done
-echo -n "Bad Trim: "; for i in `/bin/ls out.qdiff`; do grep BRK $i | cut -f5 | $SCRIPT_PATH/colsum.pl; done
-grep '1-to-1' -A3 `/bin/ls out.report` | grep AvgIdentity | sed 's|.*/||' | awk '{print "Avg Idy: "$2}'
-grep TotalSNPs `/bin/ls out.report` | sed 's|.*/||' | awk '{print "SNPs: "$2}'
-grep TotalIndels `/bin/ls out.report` | sed 's|.*/||' | awk '{print "Indels < 5bp: "$2}'
+grep TotalBases out.report | sed 's|.*/||' | awk '{print "Genome Size: "$2}'
+grep TotalBases out.report | sed 's|.*/||' | awk '{print "Assembly Size: "$3}'
+echo Chaff bases: $(java -cp $JAVA_PATH SizeFasta $FILENAME 2>/dev/null | awk 'BEGIN{s = 0;}{if($NF<200)s+=$NF}END{print s}')
+grep UnalignedBases out.report | sed 's|.*/||' | awk '{print "Missing Reference Bases: "$2}'
+grep UnalignedBases out.report | sed 's|.*/||' | awk '{print "Missing Assembly Bases: "$3}'
+grep UnalignedSeqs out.report | sed 's|.*/||' | awk '{print "Missing Assembly Contigs: "$3}'
+echo Duplicated Reference Bases: $(grep DUP out.qdiff | cut -f5 | $SCRIPT_PATH/colsum.pl)
+echo Compressed Reference Bases: $(grep DUP out.rdiff | cut -f5 | $SCRIPT_PATH/colsum.pl)
+echo Bad Trim: $(grep BRK out.qdiff | cut -f5 | $SCRIPT_PATH/colsum.pl)
+grep '1-to-1' -A3 out.report | grep AvgIdentity | sed 's|.*/||' | awk '{print "Avg Idy: "$2}'
+grep TotalSNPs out.report | sed 's|.*/||' | awk '{print "SNPs: "$2}'
+grep TotalIndels out.report | sed 's|.*/||' | awk '{print "Indels < 5bp: "$2}'
cat out.rdiff | grep -c GAP | sed 's|.*/||;s|:| |;' | awk '{print "Indels >= 5: "$1}'
-grep Inversions `/bin/ls out.report` | sed 's|.*/||' | awk '{print "Inversions: "$3}'
-grep Relocations `/bin/ls out.report` | sed 's|.*/||' | awk '{print "Relocation: "$3}'
-grep Translocations `/bin/ls out.report` | sed 's|.*/||' | awk '{print "Translocation: "$3}'
+grep Inversions out.report | sed 's|.*/||' | awk '{print "Inversions: "$3}'
+grep Relocations out.report | sed 's|.*/||' | awk '{print "Relocation: "$3}'
+grep Translocations out.report | sed 's|.*/||' | awk '{print "Translocation: "$3}'
diff --git a/iva/graph.py b/iva/graph.py
index e687d53..a741c0c 100644
--- a/iva/graph.py
+++ b/iva/graph.py
@@ -1,5 +1,5 @@
import networkx
-from fastaq import intervals
+from pyfastaq import intervals
from iva import edge
class Error (Exception): pass
diff --git a/iva/kcount.py b/iva/kcount.py
index d79c540..1b8a957 100644
--- a/iva/kcount.py
+++ b/iva/kcount.py
@@ -2,7 +2,7 @@ import os
import sys
import tempfile
import shutil
-import fastaq
+import pyfastaq
import pysam
from iva import common, mapping
@@ -10,13 +10,13 @@ class Error (Exception): pass
def _head_fastaq(reads1, reads2, outfile, count):
'''Takes first N sequences from a pair of interleaved fasta/q files. Output is in FASTA format. Returns hash of read length distribution (key=read length, value=count)'''
- seq_reader1 = fastaq.sequences.file_reader(reads1)
+ seq_reader1 = pyfastaq.sequences.file_reader(reads1)
if reads2 is not None:
- seq_reader2 = fastaq.sequences.file_reader(reads2)
- f = fastaq.utils.open_file_write(outfile)
+ seq_reader2 = pyfastaq.sequences.file_reader(reads2)
+ f = pyfastaq.utils.open_file_write(outfile)
lengths = {}
- original_line_length = fastaq.sequences.Fasta.line_length
- fastaq.sequences.Fasta.line_length = 0
+ original_line_length = pyfastaq.sequences.Fasta.line_length
+ pyfastaq.sequences.Fasta.line_length = 0
i = 0
for seq1 in seq_reader1:
@@ -28,16 +28,16 @@ def _head_fastaq(reads1, reads2, outfile, count):
if seq is None:
continue
lengths[len(seq)] = lengths.get(len(seq), 0) + 1
- if type(seq) == fastaq.sequences.Fastq:
- print(fastaq.sequences.Fasta(seq.id, seq.seq), file=f)
+ if type(seq) == pyfastaq.sequences.Fastq:
+ print(pyfastaq.sequences.Fasta(seq.id, seq.seq), file=f)
else:
print(seq, file=f)
i += 1
if i >= count:
break
- fastaq.utils.close(f)
- fastaq.sequences.Fasta.line_length = original_line_length
+ pyfastaq.utils.close(f)
+ pyfastaq.sequences.Fasta.line_length = original_line_length
return lengths
@@ -53,7 +53,7 @@ def _median(d):
def _run_kmc_with_script(script, reads, outfile, kmer, min_count, max_count, m_option, verbose, allow_fail):
- f = fastaq.utils.open_file_write(script)
+ f = pyfastaq.utils.open_file_write(script)
print('set -e', file=f)
kmc_command = ''.join([
'kmc -fa',
@@ -76,7 +76,7 @@ def _run_kmc_with_script(script, reads, outfile, kmer, min_count, max_count, m_o
print('kmc_dump', 'kmc_out', 'kmc_out.dump', file=f)
print('sort -k2nr', 'kmc_out.dump >', outfile, file=f)
- fastaq.utils.close(f)
+ pyfastaq.utils.close(f)
return common.syscall('bash ' + script, allow_fail=allow_fail)
@@ -120,7 +120,7 @@ def _kmc_to_kmer_counts(infile, number, kmers_to_ignore=None, contigs_to_check=N
if not using_refs:
if verbose > 2:
print('No existing kmers or contigs to check against. Using most common kmer for seed', flush=True)
- f = fastaq.utils.open_file_read(infile)
+ f = pyfastaq.utils.open_file_read(infile)
for line in f:
if len(counts) >= number:
break
@@ -131,7 +131,7 @@ def _kmc_to_kmer_counts(infile, number, kmers_to_ignore=None, contigs_to_check=N
raise Error('Error getting kmer info from this line:\n' + line)
counts[kmer] = count
- fastaq.utils.close(f)
+ pyfastaq.utils.close(f)
else:
if verbose > 2:
print('Existing kmers or contigs to check against. Running mapping', flush=True)
@@ -149,7 +149,11 @@ def _kmc_to_kmer_counts(infile, number, kmers_to_ignore=None, contigs_to_check=N
except:
raise Error('Error getting count from sequence name in bam:\n' + sam.qname)
- counts[common.decode(sam.seq)] = count
+ nucleotides = common.decode(sam.seq)
+ if nucleotides not in kmers_to_ignore:
+ counts[nucleotides] = count
+ elif verbose >= 4:
+ print('Skipping seed already found:', nucleotides)
sam_reader.close()
shutil.rmtree(tmpdir)
@@ -160,7 +164,7 @@ def _write_ref_seqs_to_be_checked(outfile, kmers_to_ignore=None, contigs_to_chec
if (kmers_to_ignore is None or len(kmers_to_ignore) == 0) and (contigs_to_check is None or len(contigs_to_check) == 0):
return False
- f = fastaq.utils.open_file_write(outfile)
+ f = pyfastaq.utils.open_file_write(outfile)
i = 1
if kmers_to_ignore is not None:
@@ -170,20 +174,20 @@ def _write_ref_seqs_to_be_checked(outfile, kmers_to_ignore=None, contigs_to_chec
i += 1
if contigs_to_check is not None and len(contigs_to_check) > 0:
- original_line_length = fastaq.sequences.Fasta.line_length
- fastaq.sequences.Fasta.line_length = 0
+ original_line_length = pyfastaq.sequences.Fasta.line_length
+ pyfastaq.sequences.Fasta.line_length = 0
for name in contigs_to_check:
if len(contigs_to_check[name].fa) > 20:
print(contigs_to_check[name].fa, file=f)
- fastaq.sequences.Fasta.line_length = original_line_length
+ pyfastaq.sequences.Fasta.line_length = original_line_length
- fastaq.utils.close(f)
+ pyfastaq.utils.close(f)
return True
def _counts_file_to_fasta(infile, outfile):
- fin = fastaq.utils.open_file_read(infile)
- fout = fastaq.utils.open_file_write(outfile)
+ fin = pyfastaq.utils.open_file_read(infile)
+ fout = pyfastaq.utils.open_file_write(outfile)
i = 1
for line in fin:
try:
@@ -196,8 +200,8 @@ def _counts_file_to_fasta(infile, outfile):
print(kmer, file=fout)
i += 1
- fastaq.utils.close(fin)
- fastaq.utils.close(fout)
+ pyfastaq.utils.close(fin)
+ pyfastaq.utils.close(fout)
def get_most_common_kmers(reads1, reads2, kmer_length=None, head=100000, min_count=10, max_count=100000000, most_common=100, method='kmc', verbose=0, ignore_seqs=None, contigs_to_check=None, threads=1):
diff --git a/iva/kraken.py b/iva/kraken.py
index 63f69f1..5b95b8b 100644
--- a/iva/kraken.py
+++ b/iva/kraken.py
@@ -7,7 +7,7 @@ import time
import re
import urllib.request
import iva
-import fastaq
+import pyfastaq
class Error (Exception): pass
@@ -66,19 +66,19 @@ class Database:
def _get_parent_taxons(self, taxons):
- f = fastaq.utils.open_file_read(self.kraken_nodes_dmp)
+ f = pyfastaq.utils.open_file_read(self.kraken_nodes_dmp)
for line in f:
a = line.split()
if a[0] in taxons:
self.taxon_to_parent[a[0]] = a[2]
- fastaq.utils.close(f)
+ pyfastaq.utils.close(f)
def _load_extra_ref_info(self):
if self.extra_refs_file is None:
return
- f = fastaq.utils.open_file_read(self.extra_refs_file)
+ f = pyfastaq.utils.open_file_read(self.extra_refs_file)
for line in f:
genbank_ids = line.rstrip().split()
new_gis = list(range(self.current_gi, self.current_gi + len(genbank_ids)))
@@ -90,7 +90,7 @@ class Database:
}
self.current_taxon_id += 1
- fastaq.utils.close(f)
+ pyfastaq.utils.close(f)
def _download_from_genbank(self, outfile, filetype, gi, max_tries=5, delay=3):
@@ -115,9 +115,9 @@ class Database:
if not file_ok:
raise Error('Error downloading gi ' + gi + ' using:\n' + url + '\nI got this:\n' + file_contents)
- f = fastaq.utils.open_file_write(outfile)
+ f = pyfastaq.utils.open_file_write(outfile)
print(file_contents, file=f)
- fastaq.utils.close(f)
+ pyfastaq.utils.close(f)
def _download_extra_refs(self):
@@ -132,7 +132,7 @@ class Database:
def _genbank_to_taxon_and_gi(self, filename):
- f = fastaq.utils.open_file_read(filename)
+ f = pyfastaq.utils.open_file_read(filename)
taxon_id = None
gi = None
for line in f:
@@ -142,19 +142,25 @@ class Database:
gi = line.rstrip().split()[-1].split(':')[-1]
if None not in [taxon_id, gi]:
break
- fastaq.utils.close(f)
+ pyfastaq.utils.close(f)
if None in [taxon_id, gi]:
raise Error('Error getting taxon or GI from ' + filename)
return taxon_id, gi
def _genbank2embl(self, infile, outfile):
- this_module_dir =os.path.dirname(inspect.getfile(inspect.currentframe()))
- genbank2embl = os.path.abspath(os.path.join(this_module_dir, 'ratt', 'genbank2embl.pl'))
+ tmpdir = tempfile.mkdtemp(prefix='tmp.genbank2embl.', dir=os.getcwd())
+ extractor = iva.egg_extract.Extractor(os.path.abspath(os.path.join(os.path.dirname(iva.__file__), os.pardir)))
+ genbank2embl_egg = os.path.join('iva', 'ratt', 'genbank2embl.pl')
+ genbank2embl = os.path.join(tmpdir, 'genbank2embl.pl')
+ extractor.copy_file(genbank2embl_egg, genbank2embl)
+ os.chmod(genbank2embl, stat.S_IRWXU)
+
# some genbank files have a 'CONTIG' line, which breaks bioperl's
# conversion genbank --> embl and makes genbank2embl.pl hang
iva.common.syscall('grep -v CONTIG ' + infile + ' > tmp.gbk; mv tmp.gbk ' + infile)
iva.common.syscall(genbank2embl + ' ' + infile + ' ' + outfile, verbose=self.verbose)
+ shutil.rmtree(tmpdir)
def _append_to_file(self, filename, line):
@@ -198,9 +204,9 @@ class Database:
def _replace_fasta_header(self, filename, header):
- fin = fastaq.utils.open_file_read(filename)
+ fin = pyfastaq.utils.open_file_read(filename)
tmp_out = filename + '.tmp.fa'
- fout = fastaq.utils.open_file_write(tmp_out)
+ fout = pyfastaq.utils.open_file_write(tmp_out)
for line in fin:
if line.startswith('>'):
@@ -208,8 +214,8 @@ class Database:
else:
print(line.rstrip(), file=fout)
- fastaq.utils.close(fin)
- fastaq.utils.close(fout)
+ pyfastaq.utils.close(fin)
+ pyfastaq.utils.close(fout)
os.rename(tmp_out, filename)
@@ -380,7 +386,7 @@ class Database:
def _get_most_common_species_dir(self, kraken_report):
embl_dirs = set([os.path.basename(x[0]) for x in os.walk(self.embl_root) if x[0] != self.embl_root])
- f = fastaq.utils.open_file_read(kraken_report)
+ f = pyfastaq.utils.open_file_read(kraken_report)
max_count = -1
most_common_dir = None
for line in f:
@@ -391,7 +397,7 @@ class Database:
most_common_dir = this_dir
max_count = int(a[2])
- fastaq.utils.close(f)
+ pyfastaq.utils.close(f)
if most_common_dir is None:
return None
else:
diff --git a/iva/mapping.py b/iva/mapping.py
index f291531..681b5be 100644
--- a/iva/mapping.py
+++ b/iva/mapping.py
@@ -2,7 +2,7 @@ import os
import re
import subprocess
import collections
-import fastaq
+import pyfastaq
import pysam
from iva import common
@@ -172,7 +172,7 @@ def find_incorrect_ref_bases(bam, ref_fasta):
reverse_keys = set(['a', 'c', 'g', 't', 'n'])
ref_seqs = {}
bad_bases = {}
- fastaq.tasks.file_to_dict(ref_fasta, ref_seqs)
+ pyfastaq.tasks.file_to_dict(ref_fasta, ref_seqs)
mpileup_cmd = 'samtools mpileup ' + bam + ' | cut -f 1,2,5'
mpileup_out = common.decode(subprocess.Popen(mpileup_cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.DEVNULL).communicate()[0]).split('\n')[:-1]
@@ -214,7 +214,7 @@ def sam_to_fasta(s):
else:
raise Error('Read', name, 'must be first of second of pair according to flag. Cannot continue')
- seq = fastaq.sequences.Fasta(name, common.decode(s.seq))
+ seq = pyfastaq.sequences.Fasta(name, common.decode(s.seq))
if s.is_reverse:
seq.revcomp()
@@ -281,10 +281,10 @@ def get_ref_name(sam, samfile):
def bam_file_to_fasta_pair_files(bam, out1, out2, remove_proper_pairs=False, chromosome=None, start=None, end=None):
'''Makes fwd and rev FASTA files from a BAM file. Order same in input and output'''
sam_reader = pysam.Samfile(bam, "rb")
- f1 = fastaq.utils.open_file_write(out1)
- f2 = fastaq.utils.open_file_write(out2)
- original_line_length = fastaq.sequences.Fasta.line_length
- fastaq.sequences.Fasta.line_length = 0
+ f1 = pyfastaq.utils.open_file_write(out1)
+ f2 = pyfastaq.utils.open_file_write(out2)
+ original_line_length = pyfastaq.sequences.Fasta.line_length
+ pyfastaq.sequences.Fasta.line_length = 0
found_reads = {}
assert (chromosome is None) or (None not in [chromosome, start, end])
@@ -309,35 +309,35 @@ def bam_file_to_fasta_pair_files(bam, out1, out2, remove_proper_pairs=False, chr
else:
print(sam_to_fasta(s), file=f2)
sam_reader.close()
- fastaq.utils.close(f1)
- fastaq.utils.close(f2)
- fastaq.sequences.Fasta.line_length = original_line_length
+ pyfastaq.utils.close(f1)
+ pyfastaq.utils.close(f2)
+ pyfastaq.sequences.Fasta.line_length = original_line_length
def bam_to_fasta(bam, out):
sam_reader = pysam.Samfile(bam, "rb")
- original_line_length = fastaq.sequences.Fasta.line_length
- fastaq.sequences.Fasta.line_length = 0
- f = fastaq.utils.open_file_write(out)
+ original_line_length = pyfastaq.sequences.Fasta.line_length
+ pyfastaq.sequences.Fasta.line_length = 0
+ f = pyfastaq.utils.open_file_write(out)
for s in sam_reader.fetch(until_eof=True):
print(sam_to_fasta(s), file=f)
sam_reader.close()
- fastaq.utils.close(f)
- fastaq.sequences.Fasta.line_length = original_line_length
+ pyfastaq.utils.close(f)
+ pyfastaq.sequences.Fasta.line_length = original_line_length
def bam_file_to_region_fasta(bam, out, chromosome, start=None, end=None):
'''Extracts all reads from a region of a BAM file. Writes a fasta file. Treats reads as unpaired'''
sam_reader = pysam.Samfile(bam, "rb")
- original_line_length = fastaq.sequences.Fasta.line_length
- fastaq.sequences.Fasta.line_length = 0
- f = fastaq.utils.open_file_write(out)
+ original_line_length = pyfastaq.sequences.Fasta.line_length
+ pyfastaq.sequences.Fasta.line_length = 0
+ f = pyfastaq.utils.open_file_write(out)
for s in sam_reader.fetch(reference=chromosome, start=start, end=end):
print(sam_to_fasta(s), file=f)
sam_reader.close()
- fastaq.utils.close(f)
- fastaq.sequences.Fasta.line_length = original_line_length
+ pyfastaq.utils.close(f)
+ pyfastaq.sequences.Fasta.line_length = original_line_length
def _total_ref_length_from_bam(bam):
@@ -365,9 +365,9 @@ def subsample_bam(bam, outfile, coverage=10):
step = max(1, ref_length / reads_wanted)
position = step
ref_id = None
- f = fastaq.utils.open_file_write(outfile)
- original_line_length = fastaq.sequences.Fasta.line_length
- fastaq.sequences.Fasta.line_length = 0
+ f = pyfastaq.utils.open_file_write(outfile)
+ original_line_length = pyfastaq.sequences.Fasta.line_length
+ pyfastaq.sequences.Fasta.line_length = 0
sam_reader = pysam.Samfile(bam, "rb")
for s in sam_reader.fetch():
if s.tid != ref_id:
@@ -376,5 +376,5 @@ def subsample_bam(bam, outfile, coverage=10):
if s.pos >= position:
print(sam_to_fasta(s), file=f)
position += step
- fastaq.utils.close(f)
- fastaq.sequences.Fasta.line_length = original_line_length
+ pyfastaq.utils.close(f)
+ pyfastaq.sequences.Fasta.line_length = original_line_length
diff --git a/iva/mummer.py b/iva/mummer.py
index 7e63e38..9f29fb8 100644
--- a/iva/mummer.py
+++ b/iva/mummer.py
@@ -1,7 +1,7 @@
import os
import tempfile
import shutil
-import fastaq
+import pyfastaq
from iva import edge, common
class Error (Exception): pass
@@ -19,18 +19,18 @@ def run_nucmer(query, ref, outfile, min_id=95, min_length=100, breaklen=200):
original_dir = os.getcwd()
os.chdir(tmpdir)
script = 'run_nucmer.sh'
- f = fastaq.utils.open_file_write(script)
+ f = pyfastaq.utils.open_file_write(script)
print('nucmer --maxmatch -p p -b', breaklen, ref, query, file=f)
print('delta-filter -i', min_id, '-l', min_length, 'p.delta > p.delta.filter', file=f)
print('show-coords -dTlro p.delta.filter >', outfile, file=f)
- fastaq.utils.close(f)
+ pyfastaq.utils.close(f)
common.syscall('bash ' + script)
os.chdir(original_dir)
shutil.rmtree(tmpdir)
def file_reader(fname):
- f = fastaq.utils.open_file_read(fname)
+ f = pyfastaq.utils.open_file_read(fname)
in_header = True
for line in f:
@@ -40,7 +40,7 @@ def file_reader(fname):
continue
yield NucmerHit(line)
- fastaq.utils.close(f)
+ pyfastaq.utils.close(f)
class NucmerHit:
@@ -84,11 +84,11 @@ class NucmerHit:
def qry_coords(self):
- return fastaq.intervals.Interval(min(self.qry_start, self.qry_end), max(self.qry_start, self.qry_end))
+ return pyfastaq.intervals.Interval(min(self.qry_start, self.qry_end), max(self.qry_start, self.qry_end))
def ref_coords(self):
- return fastaq.intervals.Interval(min(self.ref_start, self.ref_end), max(self.ref_start, self.ref_end))
+ return pyfastaq.intervals.Interval(min(self.ref_start, self.ref_end), max(self.ref_start, self.ref_end))
def on_same_strand(self):
diff --git a/iva/qc.py b/iva/qc.py
index 301b5e9..b97eb1b 100644
--- a/iva/qc.py
+++ b/iva/qc.py
@@ -1,12 +1,13 @@
import os
+import stat
import inspect
import tempfile
import copy
-import fastaq
+import pyfastaq
import shutil
import multiprocessing
import collections
-from iva import mapping, mummer, qc_external, kraken, common
+import iva
class Error (Exception): pass
@@ -53,6 +54,7 @@ class Qc:
raise Error('Error in IVA QC. File not found: "' + filename + '"')
self.outprefix = output_prefix
+ self.assembly_fasta = output_prefix + '.assembly.fasta'
self.assembly_bam = output_prefix + '.reads_mapped_to_assembly.bam'
self.ref_bam = output_prefix + '.reads_mapped_to_ref.bam'
self.reads_fwd = reads_fwd
@@ -77,13 +79,15 @@ class Qc:
self.reads_rev = self.outprefix + '.reads_2'
self.files_to_clean.append(self.reads_fwd)
self.files_to_clean.append(self.reads_rev)
- fastaq.tasks.deinterleave(reads_fr, self.reads_fwd, self.reads_rev)
+ pyfastaq.tasks.deinterleave(reads_fr, self.reads_fwd, self.reads_rev)
if not (None not in [self.reads_fwd, self.reads_rev] or reads_fr is not None):
raise Error('IVA QC needs reads_fr or both reads_fwd and reads_rev')
+ self._set_assembly_fasta_data(assembly_fasta)
+
def unzip_file(infile, outfile):
- common.syscall('gunzip -c ' + infile + ' > ' + outfile)
+ iva.common.syscall('gunzip -c ' + infile + ' > ' + outfile)
processes = []
@@ -112,7 +116,6 @@ class Qc:
processes[0].join()
self.min_ref_cov = min_ref_cov
- self._set_assembly_fasta_data(assembly_fasta)
self.threads = threads
self.contig_layout_plot_title = contig_layout_plot_title
self.nucmer_min_cds_hit_length = nucmer_min_cds_hit_length
@@ -164,15 +167,18 @@ class Qc:
def _set_assembly_fasta_data(self, fasta_filename):
- self.assembly_fasta = fasta_filename
- self.assembly_lengths = {}
- self.assembly_is_empty = os.path.getsize(self.assembly_fasta) == 0
+ self.assembly_is_empty = os.path.getsize(fasta_filename) == 0
if self.assembly_is_empty:
return
+
+ pyfastaq.tasks.to_fasta(fasta_filename, self.assembly_fasta, strip_after_first_whitespace=True)
+
self.assembly_fasta_fai = self.assembly_fasta + '.fai'
if not os.path.exists(self.assembly_fasta_fai):
- common.syscall('samtools faidx ' + self.assembly_fasta)
- fastaq.tasks.lengths_from_fai(self.assembly_fasta_fai, self.assembly_lengths)
+ iva.common.syscall('samtools faidx ' + self.assembly_fasta)
+
+ self.assembly_lengths = {}
+ pyfastaq.tasks.lengths_from_fai(self.assembly_fasta_fai, self.assembly_lengths)
def _set_ref_seq_data(self):
@@ -181,28 +187,34 @@ class Qc:
self.ref_fasta_fai = self.ref_fasta + '.fai'
self.ref_gff = self.outprefix + '.reference.gff'
tmpdir = tempfile.mkdtemp(prefix='tmp.set_ref_seq_data.', dir=os.getcwd())
- this_module_dir =os.path.dirname(inspect.getfile(inspect.currentframe()))
- embl2gff = os.path.abspath(os.path.join(this_module_dir, 'ratt', 'embl2gff.pl'))
+ extractor = iva.egg_extract.Extractor(os.path.abspath(os.path.join(os.path.dirname(iva.__file__), os.pardir)))
+ embl2gff_egg = os.path.join('iva', 'ratt', 'embl2gff.pl')
+ embl2gff = os.path.join(tmpdir, 'embl2gff.pl')
+ extractor.copy_file(embl2gff_egg, embl2gff)
+ os.chmod(embl2gff, stat.S_IRWXU)
for embl_file in os.listdir(self.embl_dir):
fa = os.path.join(tmpdir, embl_file + '.fa')
gff = os.path.join(tmpdir, embl_file + '.gff')
embl_full = os.path.join(self.embl_dir, embl_file)
- fastaq.tasks.to_fasta(embl_full, fa)
- common.syscall(' '.join([embl2gff, embl_full, '>', gff]))
+ pyfastaq.tasks.to_fasta(embl_full, fa)
+ iva.common.syscall(' '.join([embl2gff, embl_full, '>', gff]))
- common.syscall('cat ' + tmpdir + '/*.gff > ' + self.ref_gff)
- common.syscall('cat ' + tmpdir + '/*.fa > ' + self.ref_fasta)
+ iva.common.syscall('cat ' + tmpdir + '/*.gff > ' + self.ref_gff)
+ iva.common.syscall('cat ' + tmpdir + '/*.fa > ' + self.ref_fasta)
shutil.rmtree(tmpdir)
self._set_ref_fa_data()
def _set_ref_fa_data(self):
self.ref_fasta_fai = self.ref_fasta + '.fai'
- common.syscall('samtools faidx ' + self.ref_fasta)
+ iva.common.syscall('samtools faidx ' + self.ref_fasta)
self.ref_ids = self._ids_in_order_from_fai(self.ref_fasta_fai)
self.ref_lengths = {}
- fastaq.tasks.lengths_from_fai(self.ref_fasta_fai, self.ref_lengths)
+ pyfastaq.tasks.lengths_from_fai(self.ref_fasta_fai, self.ref_lengths)
+ number_of_sequences = pyfastaq.tasks.count_sequences(self.ref_fasta)
+ if number_of_sequences != len(self.ref_lengths):
+ raise Error('At least one reference name has been used more than once.\nNames must be unique. Cannot continue')
self.ref_length_offsets = {}
offset = 0
@@ -213,15 +225,15 @@ class Qc:
def _ids_in_order_from_fai(self, filename):
ids = []
- f = fastaq.utils.open_file_read(filename)
+ f = pyfastaq.utils.open_file_read(filename)
for line in f:
ids.append(line.rstrip().split('\t')[0])
- fastaq.utils.close(f)
+ pyfastaq.utils.close(f)
return ids
def _get_ref_cds_from_gff(self):
- f = fastaq.utils.open_file_read(self.ref_gff)
+ f = pyfastaq.utils.open_file_read(self.ref_gff)
coords = {}
for line in f:
# no annotation allowed after any fasta sequence. See
@@ -239,9 +251,9 @@ class Qc:
strand = data[6]
if seqname not in coords:
coords[seqname] = []
- coords[seqname].append((fastaq.intervals.Interval(start, end), strand))
+ coords[seqname].append((pyfastaq.intervals.Interval(start, end), strand))
- fastaq.utils.close(f)
+ pyfastaq.utils.close(f)
for seqname in coords:
coords[seqname].sort()
@@ -251,7 +263,7 @@ class Qc:
def _write_cds_seqs(self, cds_list, fa, f_out):
for coords, strand in cds_list:
seqname = fa.id + ':' + str(coords.start + 1) + '-' + str(coords.end + 1) + ':' + strand
- seq = fastaq.sequences.Fasta(seqname, fa.seq[coords.start:coords.end+1])
+ seq = pyfastaq.sequences.Fasta(seqname, fa.seq[coords.start:coords.end+1])
if strand == '-':
seq.revcomp()
print(seq, file=f_out)
@@ -268,24 +280,24 @@ class Qc:
def _gff_and_fasta_to_cds(self):
cds_coords = self._get_ref_cds_from_gff()
- f = fastaq.utils.open_file_write(self.ref_cds_fasta)
- seq_reader = fastaq.sequences.file_reader(self.ref_fasta)
+ f = pyfastaq.utils.open_file_write(self.ref_cds_fasta)
+ seq_reader = pyfastaq.sequences.file_reader(self.ref_fasta)
for seq in seq_reader:
if seq.id in cds_coords:
self._write_cds_seqs(cds_coords[seq.id], seq, f)
- fastaq.utils.close(f)
+ pyfastaq.utils.close(f)
def _map_cds_to_assembly(self):
if not os.path.exists(self.ref_cds_fasta):
self._gff_and_fasta_to_cds()
if not self.assembly_is_empty:
- mummer.run_nucmer(self.ref_cds_fasta, self.assembly_fasta, self.cds_nucmer_coords_in_assembly, min_length=self.nucmer_min_cds_hit_length, min_id=self.nucmer_min_cds_hit_id)
+ iva.mummer.run_nucmer(self.ref_cds_fasta, self.assembly_fasta, self.cds_nucmer_coords_in_assembly, min_length=self.nucmer_min_cds_hit_length, min_id=self.nucmer_min_cds_hit_id)
def _mummer_coords_file_to_dict(self, filename):
hits = {}
- for hit in mummer.file_reader(filename):
+ for hit in iva.mummer.file_reader(filename):
if hit.qry_name not in hits:
hits[hit.qry_name] = []
hits[hit.qry_name].append(copy.copy(hit))
@@ -293,7 +305,7 @@ class Qc:
def _has_orf(self, fa, start, end, min_length):
- subseq = fastaq.sequences.Fasta('seq', fa[start:end+1])
+ subseq = pyfastaq.sequences.Fasta('seq', fa[start:end+1])
orfs = subseq.all_orfs(min_length=min_length)
return len(orfs) > 0
@@ -304,12 +316,12 @@ class Qc:
self._map_cds_to_assembly()
hits = self._mummer_coords_file_to_dict(self.cds_nucmer_coords_in_assembly)
contigs = {}
- fastaq.tasks.file_to_dict(self.assembly_fasta, contigs)
+ pyfastaq.tasks.file_to_dict(self.assembly_fasta, contigs)
for cds_name, hit_list in hits.items():
self.cds_assembly_stats[cds_name]['number_of_contig_hits'] = len(hit_list)
hit_coords = [x.qry_coords() for x in hit_list]
- fastaq.intervals.merge_overlapping_in_list(hit_coords)
- bases_assembled = fastaq.intervals.length_sum_from_list(hit_coords)
+ pyfastaq.intervals.merge_overlapping_in_list(hit_coords)
+ bases_assembled = pyfastaq.intervals.length_sum_from_list(hit_coords)
self.cds_assembly_stats[cds_name]['bases_assembled'] = bases_assembled
self.cds_assembly_stats[cds_name]['assembled'] = 0.9 <= bases_assembled / self.cds_assembly_stats[cds_name]['length_in_ref'] <= 1.1
@@ -323,26 +335,26 @@ class Qc:
def _get_contig_hits_to_reference(self):
- mummer.run_nucmer(self.assembly_fasta, self.ref_fasta, self.assembly_vs_ref_coords, min_id=self.nucmer_min_ctg_hit_id, min_length=self.nucmer_min_ctg_hit_length, breaklen=500)
+ iva.mummer.run_nucmer(self.assembly_fasta, self.ref_fasta, self.assembly_vs_ref_coords, min_id=self.nucmer_min_ctg_hit_id, min_length=self.nucmer_min_ctg_hit_length, breaklen=500)
self.assembly_vs_ref_mummer_hits = self._mummer_coords_file_to_dict(self.assembly_vs_ref_coords)
def _write_fasta_contigs_hit_ref(self):
contigs = {}
- fastaq.tasks.file_to_dict(self.assembly_fasta, contigs)
- f = fastaq.utils.open_file_write(self.fasta_assembly_contigs_hit_ref)
+ pyfastaq.tasks.file_to_dict(self.assembly_fasta, contigs)
+ f = pyfastaq.utils.open_file_write(self.fasta_assembly_contigs_hit_ref)
for qry_name in sorted(self.assembly_vs_ref_mummer_hits):
print(contigs[qry_name], file=f)
- fastaq.utils.close(f)
+ pyfastaq.utils.close(f)
def _write_fasta_contigs_not_hit_ref(self):
- seq_reader = fastaq.sequences.file_reader(self.assembly_fasta)
- f = fastaq.utils.open_file_write(self.fasta_assembly_contigs_not_hit_ref)
+ seq_reader = pyfastaq.sequences.file_reader(self.assembly_fasta)
+ f = pyfastaq.utils.open_file_write(self.fasta_assembly_contigs_not_hit_ref)
for seq in seq_reader:
if seq.id not in self.assembly_vs_ref_mummer_hits:
print(seq, file=f)
- fastaq.utils.close(f)
+ pyfastaq.utils.close(f)
def _hash_nucmer_hits_by_ref(self, hits):
@@ -366,11 +378,11 @@ class Qc:
if name in refhits:
hits = refhits[name]
coords = [hit.ref_coords() for hit in hits]
- fastaq.intervals.merge_overlapping_in_list(coords)
+ pyfastaq.intervals.merge_overlapping_in_list(coords)
self.refseq_assembly_stats[name] = {
'hits': len(hits),
- 'bases_assembled': fastaq.intervals.length_sum_from_list(coords),
- 'assembled': 0.9 <= fastaq.intervals.length_sum_from_list(coords) / self.ref_lengths[name],
+ 'bases_assembled': pyfastaq.intervals.length_sum_from_list(coords),
+ 'assembled': 0.9 <= pyfastaq.intervals.length_sum_from_list(coords) / self.ref_lengths[name],
'assembled_ok': len(hits) == 1 and 0.9 <= hits[0].hit_length_ref / self.ref_lengths[name] <= 1.1 and 0.9 <= hits[0].qry_length / hits[0].hit_length_qry,
'longest_matching_contig': self._longest_matching_contig(refhits, name),
}
@@ -387,18 +399,18 @@ class Qc:
def _invert_list(self, coords, seq_length):
if len(coords) == 0:
- return[fastaq.intervals.Interval(0, seq_length - 1)]
+ return[pyfastaq.intervals.Interval(0, seq_length - 1)]
not_covered = []
if coords[0].start != 0:
- not_covered.append(fastaq.intervals.Interval(0, coords[0].start - 1))
+ not_covered.append(pyfastaq.intervals.Interval(0, coords[0].start - 1))
for i in range(len(coords) - 1):
- not_covered.append(fastaq.intervals.Interval(coords[i].end + 1, coords[i+1].start - 1))
+ not_covered.append(pyfastaq.intervals.Interval(coords[i].end + 1, coords[i+1].start - 1))
if coords[-1].end < seq_length - 1:
- not_covered.append(fastaq.intervals.Interval(coords[-1].end + 1, seq_length - 1))
+ not_covered.append(pyfastaq.intervals.Interval(coords[-1].end + 1, seq_length - 1))
return not_covered
@@ -407,7 +419,7 @@ class Qc:
if self.assembly_is_empty:
self.ref_pos_not_covered_by_contigs = {}
for seq, lngth in self.ref_lengths.items():
- self.ref_pos_not_covered_by_contigs[seq] = [fastaq.intervals.Interval(0, lngth - 1)]
+ self.ref_pos_not_covered_by_contigs[seq] = [pyfastaq.intervals.Interval(0, lngth - 1)]
return
for seq in self.assembly_vs_ref_mummer_hits:
@@ -417,7 +429,7 @@ class Qc:
self.ref_pos_covered_by_contigs[hit.ref_name].append(hit.ref_coords())
for coords_list in self.ref_pos_covered_by_contigs.values():
- fastaq.intervals.merge_overlapping_in_list(coords_list)
+ pyfastaq.intervals.merge_overlapping_in_list(coords_list)
for seq in self.ref_ids:
if seq in self.ref_pos_covered_by_contigs:
@@ -469,7 +481,7 @@ class Qc:
if self.assembly_is_empty:
self.incorrect_assembly_bases = {}
else:
- self.incorrect_assembly_bases = mapping.find_incorrect_ref_bases(self.assembly_bam, self.assembly_fasta)
+ self.incorrect_assembly_bases = iva.mapping.find_incorrect_ref_bases(self.assembly_bam, self.assembly_fasta)
def _contig_placement_in_reference(self, hits):
@@ -492,7 +504,7 @@ class Qc:
for qryname, coords_list in self.contig_placement.items():
for qry_coords, refname, ref_coords, same_strand, repetitive in coords_list:
offset = self.ref_length_offsets[refname]
- ref_coords = fastaq.intervals.Interval(ref_coords.start + offset, ref_coords.end + offset)
+ ref_coords = pyfastaq.intervals.Interval(ref_coords.start + offset, ref_coords.end + offset)
contig_positions.append((ref_coords, qry_coords, same_strand, repetitive, qryname))
contig_positions.sort()
@@ -507,17 +519,17 @@ class Qc:
def _map_reads_to_assembly(self):
if not self.assembly_is_empty:
- mapping.map_reads(self.reads_fwd, self.reads_rev, self.assembly_fasta, self.assembly_bam[:-4], sort=True, threads=self.threads, index_k=self.smalt_k, index_s=self.smalt_s, minid=self.smalt_id, extra_smalt_map_ops='-x')
+ iva.mapping.map_reads(self.reads_fwd, self.reads_rev, self.assembly_fasta, self.assembly_bam[:-4], sort=True, threads=self.threads, index_k=self.smalt_k, index_s=self.smalt_s, minid=self.smalt_id, extra_smalt_map_ops='-x')
os.unlink(self.assembly_bam[:-4] + '.unsorted.bam')
def _write_ref_info(self, filename):
assert self.embl_dir is not None
files = sorted(os.listdir(self.embl_dir))
- f = fastaq.utils.open_file_write(filename)
+ f = pyfastaq.utils.open_file_write(filename)
print('EMBL_directory', self.embl_dir, sep='\t', file=f)
print('Files', '\t'.join(files), sep='\t', file=f)
- fastaq.utils.close(f)
+ pyfastaq.utils.close(f)
def _choose_reference_genome(self):
@@ -525,8 +537,8 @@ class Qc:
assert self.ref_db is not None
assert os.path.exists(self.assembly_bam)
tmp_reads = self.outprefix + '.tmp.subsample.reads.fastq'
- mapping.subsample_bam(self.assembly_bam, tmp_reads, coverage=40)
- db = kraken.Database(self.ref_db, threads=self.threads, preload=self.kraken_preload)
+ iva.mapping.subsample_bam(self.assembly_bam, tmp_reads, coverage=40)
+ db = iva.kraken.Database(self.ref_db, threads=self.threads, preload=self.kraken_preload)
self.embl_dir = db.choose_reference(tmp_reads, self.kraken_prefix)
os.unlink(tmp_reads)
if self.embl_dir is None:
@@ -541,12 +553,12 @@ class Qc:
if self.assembly_is_empty or not self.blast_for_act:
return
- qc_external.run_blastn_and_write_act_script(self.assembly_fasta, self.ref_fasta, self.blast_out, self.act_script)
+ iva.qc_external.run_blastn_and_write_act_script(self.assembly_fasta, self.ref_fasta, self.blast_out, self.act_script)
def _map_reads_to_reference(self):
assert os.path.exists(self.ref_fasta)
- mapping.map_reads(self.reads_fwd, self.reads_rev, self.ref_fasta, self.ref_bam[:-4], sort=True, threads=self.threads, index_k=self.smalt_k, index_s=self.smalt_s, minid=self.smalt_id, extra_smalt_map_ops='-x')
+ iva.mapping.map_reads(self.reads_fwd, self.reads_rev, self.ref_fasta, self.ref_bam[:-4], sort=True, threads=self.threads, index_k=self.smalt_k, index_s=self.smalt_s, minid=self.smalt_id, extra_smalt_map_ops='-x')
os.unlink(self.ref_bam[:-4] + '.unsorted.bam')
@@ -555,9 +567,9 @@ class Qc:
self._map_reads_to_reference()
for seq in self.ref_ids:
assert seq not in self.ref_coverage_fwd
- self.ref_coverage_fwd[seq] = mapping.get_bam_region_coverage(self.ref_bam, seq, self.ref_lengths[seq])
+ self.ref_coverage_fwd[seq] = iva.mapping.get_bam_region_coverage(self.ref_bam, seq, self.ref_lengths[seq])
assert seq not in self.ref_coverage_rev
- self.ref_coverage_rev[seq] = mapping.get_bam_region_coverage(self.ref_bam, seq, self.ref_lengths[seq], rev=True)
+ self.ref_coverage_rev[seq] = iva.mapping.get_bam_region_coverage(self.ref_bam, seq, self.ref_lengths[seq], rev=True)
def _coverage_list_to_low_cov_intervals(self, l):
@@ -572,11 +584,11 @@ class Qc:
start = i
else:
if start is not None:
- bad_intervals.append(fastaq.intervals.Interval(start, i-1))
+ bad_intervals.append(pyfastaq.intervals.Interval(start, i-1))
start = None
if cov_bad and start is not None:
- bad_intervals.append(fastaq.intervals.Interval(start, i))
+ bad_intervals.append(pyfastaq.intervals.Interval(start, i))
return bad_intervals
@@ -588,19 +600,19 @@ class Qc:
self.low_cov_ref_regions_rev[seq] = self._coverage_list_to_low_cov_intervals(self.ref_coverage_rev[seq])
fwd_ok = self._invert_list(self.low_cov_ref_regions_fwd[seq], self.ref_lengths[seq])
rev_ok = self._invert_list(self.low_cov_ref_regions_rev[seq], self.ref_lengths[seq])
- self.ok_cov_ref_regions[seq] = fastaq.intervals.intersection(fwd_ok, rev_ok)
- self.low_cov_ref_regions[seq] = fastaq.intervals.intersection(self.low_cov_ref_regions_fwd[seq], self.low_cov_ref_regions_rev[seq])
+ self.ok_cov_ref_regions[seq] = pyfastaq.intervals.intersection(fwd_ok, rev_ok)
+ self.low_cov_ref_regions[seq] = pyfastaq.intervals.intersection(self.low_cov_ref_regions_fwd[seq], self.low_cov_ref_regions_rev[seq])
def _write_ref_coverage_to_files_for_R(self, outprefix):
assert len(self.ref_coverage_fwd)
assert len(self.ref_coverage_rev)
def list_to_file(d, fname):
- f = fastaq.utils.open_file_write(fname)
+ f = pyfastaq.utils.open_file_write(fname)
for refname in self.ref_ids:
for x in d[refname]:
print(x, file=f)
- fastaq.utils.close(f)
+ pyfastaq.utils.close(f)
list_to_file(self.ref_coverage_fwd, outprefix + '.fwd')
list_to_file(self.ref_coverage_rev, outprefix + '.rev')
@@ -624,22 +636,22 @@ class Qc:
l = self.ref_pos_covered_by_contigs[name]
else:
l = []
- self.should_have_assembled[name] = fastaq.intervals.intersection(self._invert_list(l, self.ref_lengths[name]), self.ok_cov_ref_regions[name])
+ self.should_have_assembled[name] = pyfastaq.intervals.intersection(self._invert_list(l, self.ref_lengths[name]), self.ok_cov_ref_regions[name])
def _calculate_gage_stats(self):
if self.assembly_is_empty:
- self.gage_stats = qc_external.dummy_gage_stats()
+ self.gage_stats = iva.qc_external.dummy_gage_stats()
self.gage_stats['Missing Reference Bases'] = sum(self.ref_lengths.values())
else:
- self.gage_stats = qc_external.run_gage(self.ref_fasta, self.assembly_fasta, self.gage_outdir, nucmer_minid=self.gage_nucmer_minid, clean=self.clean)
+ self.gage_stats = iva.qc_external.run_gage(self.ref_fasta, self.assembly_fasta, self.gage_outdir, nucmer_minid=self.gage_nucmer_minid, clean=self.clean)
def _calculate_ratt_stats(self):
if self.assembly_is_empty:
- self.ratt_stats = qc_external.dummy_ratt_stats()
+ self.ratt_stats = iva.qc_external.dummy_ratt_stats()
else:
- self.ratt_stats = qc_external.run_ratt(self.embl_dir, self.assembly_fasta, self.ratt_outdir, config_file=self.ratt_config, clean=self.clean)
+ self.ratt_stats = iva.qc_external.run_ratt(self.embl_dir, self.assembly_fasta, self.ratt_outdir, config_file=self.ratt_config, clean=self.clean)
def _do_calculations(self):
@@ -667,8 +679,8 @@ class Qc:
total_bases = 0
for name in self.assembly_vs_ref_mummer_hits:
coords = [x.qry_coords() for x in self.assembly_vs_ref_mummer_hits[name]]
- fastaq.intervals.merge_overlapping_in_list(coords)
- total_bases += fastaq.intervals.length_sum_from_list(coords)
+ pyfastaq.intervals.merge_overlapping_in_list(coords)
+ total_bases += pyfastaq.intervals.length_sum_from_list(coords)
return total_bases, len(self.assembly_vs_ref_mummer_hits)
@@ -677,10 +689,10 @@ class Qc:
self.stats['ref_EMBL_files'] = ' '.join(self.embl_files)
self.stats['ref_bases'] = sum(self.ref_lengths.values())
self.stats['ref_sequences'] = len(self.ref_lengths)
- self.stats['ref_bases_assembled'] = sum([fastaq.intervals.length_sum_from_list(l) for l in list(self.ref_pos_covered_by_contigs.values())])
+ self.stats['ref_bases_assembled'] = sum([pyfastaq.intervals.length_sum_from_list(l) for l in list(self.ref_pos_covered_by_contigs.values())])
self.stats['ref_sequences_assembled'] = len([1 for x in self.refseq_assembly_stats.values() if x['assembled']])
self.stats['ref_sequences_assembled_ok'] = len([1 for x in self.refseq_assembly_stats.values() if x['assembled_ok']])
- self.stats['ref_bases_assembler_missed'] = sum([fastaq.intervals.length_sum_from_list(l) for l in list(self.should_have_assembled.values())])
+ self.stats['ref_bases_assembler_missed'] = sum([pyfastaq.intervals.length_sum_from_list(l) for l in list(self.should_have_assembled.values())])
self.stats['assembly_bases'] = sum(self.assembly_lengths.values())
self.stats['assembly_contigs'] = len(self.assembly_lengths)
self.stats['assembly_bases_in_ref'], self.stats['assembly_contigs_hit_ref'] = self._contigs_and_bases_that_hit_ref()
@@ -692,27 +704,27 @@ class Qc:
def _write_stats_txt(self):
- f = fastaq.utils.open_file_write(self.stats_file_txt)
+ f = pyfastaq.utils.open_file_write(self.stats_file_txt)
for stat in self.stats_keys:
print(stat, self.stats[stat], sep='\t', file=f)
- for stat in qc_external.gage_stats:
+ for stat in iva.qc_external.gage_stats:
print('gage_' + stat.replace(' ', '_'), self.gage_stats[stat], sep='\t', file=f)
- for stat in qc_external.ratt_stats:
+ for stat in iva.qc_external.ratt_stats:
print('ratt_' + stat.replace(' ', '_'), self.ratt_stats[stat], sep='\t', file=f)
- fastaq.utils.close(f)
+ pyfastaq.utils.close(f)
def _write_stats_tsv(self):
- f = fastaq.utils.open_file_write(self.stats_file_tsv)
+ f = pyfastaq.utils.open_file_write(self.stats_file_tsv)
print('\t'.join([x.replace(' ', '_') for x in self.stats_keys]), end='\t', file=f)
- print('\t'.join(['gage_' + x.replace(' ', '_') for x in qc_external.gage_stats]), end='\t', file=f)
- print('\t'.join(['ratt_' + x.replace(' ', '_') for x in qc_external.ratt_stats]), file=f)
+ print('\t'.join(['gage_' + x.replace(' ', '_') for x in iva.qc_external.gage_stats]), end='\t', file=f)
+ print('\t'.join(['ratt_' + x.replace(' ', '_') for x in iva.qc_external.ratt_stats]), file=f)
print('\t'.join([str(self.stats[x]) for x in self.stats_keys]),
- '\t'.join([str(self.gage_stats[x]) for x in qc_external.gage_stats]),
- '\t'.join([str(self.ratt_stats[x]) for x in qc_external.ratt_stats]),
+ '\t'.join([str(self.gage_stats[x]) for x in iva.qc_external.gage_stats]),
+ '\t'.join([str(self.ratt_stats[x]) for x in iva.qc_external.ratt_stats]),
sep='\t', file=f)
- fastaq.utils.close(f)
+ pyfastaq.utils.close(f)
def _write_stats_files(self):
@@ -726,7 +738,7 @@ class Qc:
number_of_contigs = len(contig_names)
ref_length = sum(self.ref_lengths.values())
r_script = outprefix + '.R'
- f = fastaq.utils.open_file_write(r_script)
+ f = pyfastaq.utils.open_file_write(r_script)
contig_height = 0.8
vertical_lines = ''
if len(self.ref_ids) > 0:
@@ -800,8 +812,8 @@ class Qc:
print(vertical_lines, file=f)
print('dev.off()', file=f)
- fastaq.utils.close(f)
- common.syscall('R CMD BATCH ' + r_script)
+ pyfastaq.utils.close(f)
+ iva.common.syscall('R CMD BATCH ' + r_script)
def _clean(self):
diff --git a/iva/qc_external.py b/iva/qc_external.py
index 57d7e87..b04bbb1 100644
--- a/iva/qc_external.py
+++ b/iva/qc_external.py
@@ -4,8 +4,8 @@ import shutil
import os
import sys
import inspect
-import fastaq
-from iva import common
+import pyfastaq
+import iva
class Error (Exception): pass
@@ -40,8 +40,6 @@ ratt_stats = [
'gene_models_not_transferred',
]
-default_ratt_config = os.path.join(os.path.dirname(inspect.getfile(inspect.currentframe())), 'ratt', 'ratt.config')
-
def dummy_gage_stats():
return {x:'NA' for x in gage_stats}
@@ -52,8 +50,6 @@ def dummy_ratt_stats():
def run_gage(reference, scaffolds, outdir, nucmer_minid=80, clean=True):
- this_module_dir = os.path.dirname(inspect.getfile(inspect.currentframe()))
- gage_dir = os.path.join(this_module_dir, 'gage')
reference = os.path.abspath(reference)
scaffolds = os.path.abspath(scaffolds)
ref = 'ref.fa'
@@ -64,27 +60,35 @@ def run_gage(reference, scaffolds, outdir, nucmer_minid=80, clean=True):
cwd = os.getcwd()
os.mkdir(outdir)
os.chdir(outdir)
+ extractor = iva.egg_extract.Extractor(os.path.abspath(os.path.join(os.path.dirname(iva.__file__), os.pardir)))
+ gage_code_indir = os.path.join('iva', 'gage')
+ gage_code_outdir = 'gage_code'
+ extractor.copy_dir(gage_code_indir, gage_code_outdir)
os.symlink(reference, ref)
os.symlink(scaffolds, scaffs)
- fastaq.tasks.scaffolds_to_contigs(scaffs, contigs, number_contigs=True)
- f = fastaq.utils.open_file_write(gage_script)
+ pyfastaq.tasks.scaffolds_to_contigs(scaffs, contigs, number_contigs=True)
+ f = pyfastaq.utils.open_file_write(gage_script)
print(' '.join([
'sh',
- os.path.join(gage_dir, 'getCorrectnessStats.sh'),
+ os.path.join(gage_code_outdir, 'getCorrectnessStats.sh'),
ref,
contigs,
scaffs,
str(nucmer_minid),
'>', gage_out
]), file=f)
- fastaq.utils.close(f)
- common.syscall('bash ' + gage_script)
+ pyfastaq.utils.close(f)
+ iva.common.syscall('bash ' + gage_script, allow_fail=True)
+ if not os.path.exists(gage_out):
+ raise Error('Error running GAGE\nbash ' + gage_script)
stats = dummy_gage_stats()
wanted_stats = set(gage_stats)
- f = fastaq.utils.open_file_read(gage_out)
+ f = pyfastaq.utils.open_file_read(gage_out)
+ got_all_stats = False
for line in f:
if line.startswith('Corrected Contig Stats'):
+ got_all_stats = True
break
elif ':' in line:
a = line.rstrip().split(': ')
@@ -96,7 +100,10 @@ def run_gage(reference, scaffolds, outdir, nucmer_minid=80, clean=True):
stats[a[0]] = int(stat)
else:
stats[a[0]] = float(stat)
- fastaq.utils.close(f)
+ pyfastaq.utils.close(f)
+
+ if not got_all_stats:
+ raise Error('Error running GAGE\nbash ' + gage_script)
if clean:
to_clean = [
@@ -131,12 +138,6 @@ def run_gage(reference, scaffolds, outdir, nucmer_minid=80, clean=True):
def run_ratt(embl_dir, assembly, outdir, config_file=None, transfer='Species', clean=True):
embl_dir = os.path.abspath(embl_dir)
assembly = os.path.abspath(assembly)
- this_module_dir =os.path.dirname(inspect.getfile(inspect.currentframe()))
- ratt_dir = os.path.join(this_module_dir, 'ratt')
- if config_file is None:
- ratt_config = default_ratt_config
- else:
- ratt_config = os.path.abspath(config_file)
cwd = os.getcwd()
try:
@@ -145,17 +146,28 @@ def run_ratt(embl_dir, assembly, outdir, config_file=None, transfer='Species', c
except:
raise Error('Error mkdir ' + outdir)
+ extractor = iva.egg_extract.Extractor(os.path.abspath(os.path.join(os.path.dirname(iva.__file__), os.pardir)))
+ ratt_code_indir = os.path.join('iva', 'ratt')
+ ratt_code_outdir = 'ratt_code'
+ extractor.copy_dir(ratt_code_indir, ratt_code_outdir)
+
+ if config_file is None:
+ ratt_config = os.path.join(ratt_code_outdir, 'ratt.config')
+ else:
+ ratt_config = os.path.abspath(config_file)
+
script = 'run.sh'
script_out = 'run.sh.out'
ratt_outprefix = 'out'
- f = fastaq.utils.open_file_write(script)
- print('export RATT_HOME=', ratt_dir, sep='', file=f)
+ f = pyfastaq.utils.open_file_write(script)
+ print('export RATT_HOME=', ratt_code_outdir, sep='', file=f)
print('export RATT_CONFIG=', ratt_config, sep='', file=f)
+ print('for x in $RATT_HOME/*.{sh,pl}; do chmod 755 $x; done', file=f)
print('$RATT_HOME/start.ratt.sh', embl_dir, assembly, ratt_outprefix, transfer, file=f)
- fastaq.utils.close(f)
+ pyfastaq.utils.close(f)
cmd = 'bash ' + script + ' > ' + script_out
# sometimes ratt returns nonzero code, but is OK, so ignore it
- common.syscall(cmd, allow_fail=True)
+ iva.common.syscall(cmd, allow_fail=True)
stats = {}
@@ -173,13 +185,13 @@ def run_ratt(embl_dir, assembly, outdir, config_file=None, transfer='Species', c
'Gene models not transferred.': 'gene_models_not_transferred',
}
- f = fastaq.utils.open_file_read(script_out)
+ f = pyfastaq.utils.open_file_read(script_out)
for line in f:
if '\t' in line:
a = line.rstrip().split('\t')
if len(a) == 2 and a[0].isdigit() and a[1] in matches:
stats[matches[a[1]]] = int(a[0])
- fastaq.utils.close(f)
+ pyfastaq.utils.close(f)
if clean:
for d in ['Query', 'Reference', 'Sequences']:
@@ -188,7 +200,7 @@ def run_ratt(embl_dir, assembly, outdir, config_file=None, transfer='Species', c
except:
pass
- common.syscall('rm query.* Reference.* nucmer.* out.*')
+ iva.common.syscall('rm query.* Reference.* nucmer.* out.*')
os.chdir(cwd)
return stats
@@ -198,9 +210,9 @@ def run_blastn_and_write_act_script(assembly, reference, blast_out, script_out):
tmpdir = tempfile.mkdtemp(prefix='tmp.blastn.', dir=os.getcwd())
assembly_union = os.path.join(tmpdir, 'assembly.union.fa')
reference_union = os.path.join(tmpdir, 'reference.union.fa')
- fastaq.tasks.to_fasta_union(assembly, assembly_union, seqname='assembly_union')
- fastaq.tasks.to_fasta_union(reference, reference_union, seqname='reference_union')
- common.syscall('makeblastdb -dbtype nucl -in ' + reference_union)
+ pyfastaq.tasks.to_fasta_union(assembly, assembly_union, seqname='assembly_union')
+ pyfastaq.tasks.to_fasta_union(reference, reference_union, seqname='reference_union')
+ iva.common.syscall('makeblastdb -dbtype nucl -in ' + reference_union)
cmd = ' '.join([
'blastn',
'-task blastn',
@@ -211,11 +223,11 @@ def run_blastn_and_write_act_script(assembly, reference, blast_out, script_out):
'-evalue 0.01',
'-dust no',
])
- common.syscall(cmd)
+ iva.common.syscall(cmd)
- f = fastaq.utils.open_file_write(script_out)
+ f = pyfastaq.utils.open_file_write(script_out)
print('#!/usr/bin/env bash', file=f)
print('act', reference, blast_out, assembly, file=f)
- fastaq.utils.close(f)
- common.syscall('chmod 755 ' + script_out)
+ pyfastaq.utils.close(f)
+ iva.common.syscall('chmod 755 ' + script_out)
shutil.rmtree(tmpdir)
diff --git a/iva/seed.py b/iva/seed.py
index 1ae37e9..883b689 100644
--- a/iva/seed.py
+++ b/iva/seed.py
@@ -2,7 +2,7 @@ import copy
import os
import shutil
import tempfile
-import fastaq
+import pyfastaq
from iva import kcount, kmers, mapping
class Error (Exception): pass
@@ -77,7 +77,7 @@ class Seed:
def _extensions_from_reads_file(self, reads_file):
- seq_reader = fastaq.sequences.file_reader(reads_file)
+ seq_reader = pyfastaq.sequences.file_reader(reads_file)
left_seqs = []
right_seqs = []
for seq in seq_reader:
@@ -124,9 +124,9 @@ class Seed:
def write_fasta(self, filename, name):
- f = fastaq.utils.open_file_write(filename)
+ f = pyfastaq.utils.open_file_write(filename)
print('>' + name, file=f)
print(self.seq, file=f)
- fastaq.utils.close(f)
+ pyfastaq.utils.close(f)
diff --git a/iva/seed_processor.py b/iva/seed_processor.py
index e938831..316a785 100644
--- a/iva/seed_processor.py
+++ b/iva/seed_processor.py
@@ -4,7 +4,7 @@ import sys
import shutil
import multiprocessing
from iva import mapping, seed
-import fastaq
+import pyfastaq
class Error (Exception): pass
@@ -30,7 +30,7 @@ class SeedProcessor:
self.seed_max_count = seed_max_count
self.original_seeds = {}
- fastaq.tasks.file_to_dict(seeds_fasta, self.original_seeds)
+ pyfastaq.tasks.file_to_dict(seeds_fasta, self.original_seeds)
self.processed_seeds = {}
self.tmpdir = None
self.bam_file = None
@@ -71,9 +71,9 @@ class SeedProcessor:
print('Making new seed for', seed_name, ' ... extending most common kmer')
new_seed.extend(self.reads1, self.reads2, self.seed_stop_length)
- f = fastaq.utils.open_file_write(tmp_prefix + '.' + seed_name + '.fa')
- print(fastaq.sequences.Fasta('seed.' + seed_name, new_seed.seq[10:-10]), file=f)
- fastaq.utils.close(f)
+ f = pyfastaq.utils.open_file_write(tmp_prefix + '.' + seed_name + '.fa')
+ print(pyfastaq.sequences.Fasta('seed.' + seed_name, new_seed.seq[10:-10]), file=f)
+ pyfastaq.utils.close(f)
if self.verbose:
print('Making new seed for', seed_name, ' ... finished')
@@ -106,13 +106,13 @@ class SeedProcessor:
for seed_name in self.original_seeds:
fname = tmp_prefix + '.' + seed_name + '.fa'
if os.path.exists(fname):
- fastaq.tasks.file_to_dict(fname, new_seeds)
+ pyfastaq.tasks.file_to_dict(fname, new_seeds)
if len(new_seeds) == 0:
raise Error('Error! did not make any new seeds. Cannot continue')
- f = fastaq.utils.open_file_write(self.outfile)
+ f = pyfastaq.utils.open_file_write(self.outfile)
for seq in new_seeds.values():
print(seq, file=f)
- fastaq.utils.close(f)
+ pyfastaq.utils.close(f)
shutil.rmtree(self.tmpdir)
diff --git a/iva/tests/assembly_test.py b/iva/tests/assembly_test.py
index 2c528f9..21d5b33 100644
--- a/iva/tests/assembly_test.py
+++ b/iva/tests/assembly_test.py
@@ -159,7 +159,7 @@ class TestAssembly(unittest.TestCase):
fwd_cov = [0, 1, 1, 2, 5, 10, 100, 10, 10, 6, 0, 10, 10, 10, 5, 10]
rev_cov = [0, 5, 5, 5, 5, 20, 10, 10, 10, 100, 9, 10, 10, 10, 5, 0]
expected = [(3,5), (7,8), (11,14)]
- a = assembly.Assembly()
+ a = assembly.Assembly(strand_bias=0.2)
got = a._good_intervals_from_strand_coverage(fwd_cov, rev_cov)
self.assertListEqual(expected, got)
diff --git a/iva/tests/contig_test.py b/iva/tests/contig_test.py
index d40bf2d..c3a136a 100644
--- a/iva/tests/contig_test.py
+++ b/iva/tests/contig_test.py
@@ -3,7 +3,7 @@
import unittest
import os
from iva import contig
-from fastaq import sequences
+from pyfastaq import sequences
class TestContig(unittest.TestCase):
def test_init(self):
diff --git a/iva/tests/data/contig_trim_test_contigs.fa b/iva/tests/data/contig_trim_test_contigs.fa
index 3de2f28..af1e8e0 100644
--- a/iva/tests/data/contig_trim_test_contigs.fa
+++ b/iva/tests/data/contig_trim_test_contigs.fa
@@ -13,3 +13,5 @@ GCCCAGTTCCGCATTGAATAAGTACTGTTCTAGTACCATG
>5
CAAAAACTATAGAAGGGGCCTCCGCCGATCTCCTGGAGTGATAGTTCTCGATCACCGCTA
GAAAGTGGAACAGAGAAATGGAACCCTTATATGTATGACC
+>6
+ACGCTAGCGAGCGAGCTAA
diff --git a/iva/tests/data/egg_extract_test_dir.zip b/iva/tests/data/egg_extract_test_dir.zip
new file mode 100644
index 0000000..f5f03b0
Binary files /dev/null and b/iva/tests/data/egg_extract_test_dir.zip differ
diff --git a/iva/tests/data/egg_extract_test_dir/.zip b/iva/tests/data/egg_extract_test_dir/.zip
new file mode 100644
index 0000000..25d0cb0
Binary files /dev/null and b/iva/tests/data/egg_extract_test_dir/.zip differ
diff --git a/iva/tests/data/egg_extract_test_dir/file1 b/iva/tests/data/egg_extract_test_dir/file1
new file mode 100644
index 0000000..e69de29
diff --git a/iva/tests/data/egg_extract_test_dir/iva/gage/gage1 b/iva/tests/data/egg_extract_test_dir/iva/gage/gage1
new file mode 100644
index 0000000..658afdc
--- /dev/null
+++ b/iva/tests/data/egg_extract_test_dir/iva/gage/gage1
@@ -0,0 +1,2 @@
+gage1 line1
+gage1 line2
diff --git a/iva/tests/data/egg_extract_test_dir/iva/gage/gage2 b/iva/tests/data/egg_extract_test_dir/iva/gage/gage2
new file mode 100644
index 0000000..e69de29
diff --git a/iva/tests/data/egg_extract_test_dir/iva/ratt/ratt1 b/iva/tests/data/egg_extract_test_dir/iva/ratt/ratt1
new file mode 100644
index 0000000..e69de29
diff --git a/iva/tests/data/egg_extract_test_dir/iva/ratt/ratt2 b/iva/tests/data/egg_extract_test_dir/iva/ratt/ratt2
new file mode 100644
index 0000000..e69de29
diff --git a/iva/tests/data/qc_external_test_run_gage.out b/iva/tests/data/qc_external_test_run_gage.out
new file mode 100644
index 0000000..98d05f8
--- /dev/null
+++ b/iva/tests/data/qc_external_test_run_gage.out
@@ -0,0 +1,55 @@
+MUMMER: /usr/local/bin
+Contig Stats
+Total units: 3
+Reference: 3000
+BasesInFasta: 2998
+Min: 840
+Max: 1260
+N10: 1260 COUNT: 1
+N25: 1260 COUNT: 1
+N50: 898 COUNT: 2
+N75: 840 COUNT: 3
+E-size:1033.2
+
+Genome Size: 3000
+Assembly Size: 2998
+Chaff bases: 0
+Missing Reference Bases: 60(2.00%)
+Missing Assembly Bases: 54(1.80%)
+Missing Assembly Contigs: 0(0.00%)
+Duplicated Reference Bases: 0
+Compressed Reference Bases: 0
+Bad Trim: 0
+Avg Idy: 99.86
+SNPs: 2
+Indels < 5bp: 2
+Indels >= 5: 1
+Inversions: 0
+Relocation: 0
+Translocation: 0
+
+Corrected Contig Stats
+Total units: 4
+Reference: 3000
+BasesInFasta: 2946
+Min: 601
+Max: 900
+N10: 900 COUNT: 1
+N25: 900 COUNT: 1
+N50: 840 COUNT: 2
+N75: 605 COUNT: 3
+E-size:747.61
+
+Scaffold Stats
+Total units: 2
+Reference: 3000
+BasesInFasta: 3058
+Min: 1260
+Max: 1798
+N10: 1798 COUNT: 1
+N25: 1798 COUNT: 1
+N50: 1798 COUNT: 1
+N75: 1260 COUNT: 2
+E-size:1606.8
+
+Corrected Scaffold Stats
diff --git a/iva/tests/data/qc_external_test_run_gage.ref.fa b/iva/tests/data/qc_external_test_run_gage.ref.fa
new file mode 100644
index 0000000..223ea2e
--- /dev/null
+++ b/iva/tests/data/qc_external_test_run_gage.ref.fa
@@ -0,0 +1,51 @@
+>1
+ATCAGCTTGTGGAAAAGGGTATAGGCACCCATTCTTGGAGACGGTTCGGCTGGAGTTTCG
+CCGGTCGAAGAAACGGAAAGGAATTCCTCGTACAACCTCTGTCACATGCGGCTTGTTATC
+CACAACCTCTGTACGAACGTGACCCGCGGATTGGTTGGCACAGTAGCATGGGTCAAAGGT
+CTGAAACACTCCCTATGTCCAGAGGCCTTCTCTTTGACCCGCAGGTAGGAGAGGCAAGCA
+CGAAAATCGCTGACTTTTCGCGCTGAGCGAACACCAGCCTGCTCCGCCGAACTGGGTATA
+CCAGAAACATTGGGTTGCTTCTTGTCATGGGAAACTAGAGGGAACGGAAGGACGGGATAC
+GGCTTTCCGATCCTCGGAATGTGTTTGCCTGGTGTAGTATAACACTAAGCCAAATTACCG
+CTCCTTTTCGTCCACCAACCTATACGTTGGGTCTAGCAGACGTTCAGCCTTCAAACCATT
+CGGTAAGGATCGGGGTAAACTCTTTCGCAAGGAACGAGCGGTCGTAGCAGTGGTGAAATT
+GAGGCCATCAGTGTGCGCCGAATAGCTCGTGAAGCAATGTAGATTTCGCAGTACAAGGTG
+CACCGGTGACTCCAAAACGACTAGGTAAAGGCAAATCGACTTACATGAACACGCGGCTTC
+CAACAGTGTGGTAGGATGAAAAAACGTCATATTGACCTACGTATATTAGTAGTGAGCGAG
+GGACGCAGTATAGTTCCGGCCTGCCCTATAAATAAATAGTTAGCACCAAACGAAGAAGGA
+TCGATATTTTCTTCATAGGAAGATGTCTCTTGCGTTCTTCCGGTTCCCTGCGCAAGTACT
+CCCGGTATGAGTCTCACCTGCCTGGACGCGGTTTTGTAGTGTAAGAACGTTGTGGGCGGT
+GTGAATCTACGAGGTCTTTTCTGTCGCACTAAATGGCGGAGTCAGTTCACGATAAACGTC
+ACTGTCCTTCCTAAAGGCTTATACATGACGAAGTCACCAAACTCTGGACGATGTTTCAAG
+TAATTACCTTGACTGACCTTGGGATGCCGCCTGGCTAAGTCCTTCTTAAACACCAGTAAA
+TGCGCTGAACCCCACGAACCTGGTTCTCTCTTAAAACACACTGTACCACGGTGAGAATAT
+AGACATAAATTTATGCTACGGCCTCAGCGTGGGCCTATCTAGGCCGACATATTCTCATTG
+ATTAGTATAGCTTCATACAAGAAAACTGTACATGGGCCGGGCGAAGGGATTTACAGAGGA
+TTATAGTTGGTGAGGGGGACCGAAGAGTCATGCAGGCGATGACCCGGCGCTAGGCACATC
+CTTCTTTCGTGTCACTACGTAGGGCTTGGACGGGCTAGTTCGACGTTCTTGTCGAGATAT
+GCCAGGCTCAATCTGGAATTCGATCGCTTCTATGAACTTTAGCAAACGGACGCACAAGAC
+CCCCCACTATCAGGCGTCTTAGCATGAATCTTATAGATGAAAGACGCCATCGCCTTTCAA
+CACAGTACGGAGCCCCCCTAGGGTACGACGGCTAGGAAACTTCGATTCCCGATGGTGCAC
+CCAGCAAAAGCGAGGAGATATCTGACGACAAAGTCTTATCACATTCCAGCCATGTATCTG
+CTGCCTAGATTAGGATGGAACAGATGTAATATCGAAATAGTGTCGAAAATAAACACTACC
+AACGTTGCAGCGGGCCTCTGCGGATTAGTATCTTGAACGTCTGCATTGTAGATCTAGATC
+GCTTCCATGTTCTTGGTCGCTGGTATGGGACAGAAGCCATGGATATGAACTCTTTTAGGA
+GTCCAATGGGTGGTAAGAATATGAGGTTAGGTAAAACACTTCAAAACTAAGGACGAGCGA
+GTGTTCTTTTCTTAGGGGGTTTGAAGACTGGCTCCAGACCCCTCGGCTTCGCGAGAAACT
+CAAGCAGCGTATGCACGTGCGCCGCTCAGCCACGAGGCACGCTCCGTGAATCGGACTCGT
+GGCTTAAGCTTCCTACCCGTTCAGTTGGCAGTTCAAGCTATAGTGCGGCACGCACCGTGA
+CATCTTCGAGTCGACCACTTGCCCACGGAATCGGCAGACACTGGGCTTTGGCTCACAAAT
+CGTTCCGAATTGATACTTGAGGTCAACTCTGCTCGGATGCCCAGCCCGAACGCCTTGGCA
+AGAACACTTCAAGAAACTGACGGTATTCCCATCTTCTAACTGATGCGTAACCCACTGAGA
+ACGGATAGGACTCGCACGGCCGAGGGGGACTATATCAGAACGATTGGTCTGTACATGGGC
+TTGCCCGTTGTCCATCGCTTTCTGTGGGCCACACGGATGCAACCTAAGGGACAGTCCGAG
+GCTTTAAAATCAGAAGGGACGCGGAGATCCCCACTTGTTTTCGATCGCGACTCGGTATGA
+CGTGCTCTCGCGGTAACTTTATCATTGCGGTTACTTTCCTCACCGGTGACGCCTAGCTGG
+GAACCCTAGGGTGAGGCGACAGACGACGCTGCTGATTGCCGTCTTTAATATAGAATTCAT
+TGCGCTGGGATCGTCCAATGTTCCAGTCTTCACTTCTGTATCGGGCGAGAACTATTTGCG
+GAACTGAATTCGCTACCACAGACATCGCGAGCGTCACGGTTAATGCTCGGATGTCAGGAT
+GATAGCTCCCACTGTGAGACTATACGAGGCCCTGTAGGTGTCACTGTAGCTGTCGTAGAT
+GTCCAACGACGCCAATAGGGTTCAGTCTTATAAATCAAACCCTATCTAGTCCCGTACAAT
+AAAAATGGGCGCTATTGCCTATTGACATGCATATGGATTCTACACCCATGAACGTCACAC
+CCTTGGGTCACCGGAGAGCTTCGATTTCAGGTTTAGATATGCGGGAATCAAAACCTGGAC
+TGTCAGCATAATAGATTAACAACGCCACGATTGCCCGTCTGTAAATGATCGTGCTTGTTA
+ACGCTTGCTTGAGCGTATTATGGCACCGCTCGTTAGGTGAGATTGGGGCACCTAGAGCCA
diff --git a/iva/tests/data/qc_external_test_run_gage.scaffs.fa b/iva/tests/data/qc_external_test_run_gage.scaffs.fa
new file mode 100644
index 0000000..404da95
--- /dev/null
+++ b/iva/tests/data/qc_external_test_run_gage.scaffs.fa
@@ -0,0 +1,53 @@
+>1
+ATCAGCTTGTGGAAAAGGGTATAGGCACCCATTCTTGGAGACGGTTCGGCTGGAGTTTCG
+CCGGTCGAAGAAACGGAAAGGAATTCCTCGTACAACCTCTGTCACATGCGGCTTGTTATC
+CACAACCTCTGTACGAACGTGACCCGCGGATTGGTTGGCACAGTAGCATGGGTCAAAGGT
+CTGAAACACTCCCTATGTCCAGAGGCCTTCTCTTTGACCCGCAGGTAGGAGAGGCAAGCA
+CGAAAATCGCTGACTTTTCGCGCTGAGCGAACACCAGCCTGCTCCGCCGAACTGGGTATA
+CCAGAAACATTGGGTTGCTTCTTGTCATGGGAAACTAGAGGGAACGGAAGGACGGGATAC
+GGCTTTCCGATCCTCGGAATGTGTTTGCCTGGTGTAGTATAACACTAAGCCAAATTACCG
+CTCCTTTTCGTCCACCAACCTATACGTTGGGTCTAGCAGACGTTCAGCCTTCAAACCATT
+CGGTAAGGATCGGGGTAAACTCTTTCGCAAGGAACGAGCGGTCGTAGCAGTGGTGAAATT
+GAGGCCATCAGTGTGCGCCGAATAGCTCGTGAAGCAATGTAGATTTCGCAGTACAAGGTG
+CACCGGTGACTCCAAAACGACTAGGTAAAGGCAAATCGACTTACATGAACACGCGGCTTC
+CAACAGTGTGGTAGGATGAAAAAACGTCATATTGACCTACGTATATTAGTAGTGAGCGAG
+GGACGCAGTATAGTTCCGGCCTGCCCTATAAATAAATAGTTAGCACCAAACGAAGAAGGA
+TCGATATTTTCTTCATAGGAAGATGTCTCTTGCGTTCTTCCGGTTCCCTGCGCAAGTACT
+NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+GTGAATCTACGAGGTCTTTTCTGTCGCACTAAATGGCGGAGTCAGTTCACGATAAACGTC
+ACTGTCCTTCCTAAAGGCTTATACATGACGAAGTCACCAAACTCTGGACGATGTTTCAAG
+TAATTACCTTGACTGACCTTGGGATGCCGCCTGGCTAAGTCCTTCTTAAACACCAGTAAA
+TGCGCTGAACCCCACGAACCTGGTTCTCTCTTAAAACACACTGTACCACGGTGAGAATAT
+AGACATAAATTTATGCTACGGCCTCAGCGTGGGCCTATCTAGGCCGACATATTCTCATTG
+ATTAGTATAGCTTCATACAAGAAAACTGTACATGGGCCGGGCGAAGGGATTTACAGAGGA
+TTATAGTTGGTGAGGGGGACCGAAGAGTCATGCAGGCGATGACCCGGCGCTAGGCACATC
+CTTCTTTCGTGTCACTACGTAGGGCTTGGACGGGCTAGTTCGACGTTCTTGTCGAGATAT
+GCCAGGCTCAATCTGGAATTCGATCGCTTCTATGAACTTTAGCAAACGGACGCACAAGAC
+CCCCCACTATCAGGCGTCTTAGCATGAATCTTATAGATGAAAGACGCCATCGCCTTTCAA
+CGCAGTACGGAGCCCCCCTAGGGTACGACGGCTAGGAAACTTCGATTCCCGATGGTGCAC
+CCAGCAAAAGCGAGGAGATATCTGACGACAAAGTCTTATCACATTCCAGCCATGTATCTG
+CTGCCTAGATTAGGATGGAACAGATGTAATATCGAAATAGTGTCGAAAATAAACACTACC
+AACGTTGCAGCGGGCCTCTGCGGATTAGTATCTTGAACGTCTGCATTGTAGATCTAGATC
+TTCCATGTTCTTGGTCGCTGGTATGGGACAGAAGCCATGGATATGAACTCTTTTAGGA
+>2
+GTCCAATGGGTGGTAAGAATATGAGGTTAGGTAAAACACTTCAAAACTAAGGACGAGCGA
+GTGTTCTTTTCTTAGGGGGTTTGAAGACTGGCTCCAGACCCCTCGGCTTCGCGAGAAACT
+CAAGCAGCGTATGCACGTGCGCCGCTCAGCCACGAGGCACGCTCCGTGAATCGGACTCGT
+GGCTTAAGCTTCCTACCCGTTCAGTTGGCAGTTCAAGCTATAGTGCGGCACGCACCGTGA
+CATCTTCGAGTCGACCACTTGCCCACGGAATCGGCAGACACTGGGCTTTGGCTCACAAAT
+CGTTCCGAATTGATACTTGAGGTCAACTCTGCTCGGATGCCCAGCCCGAACGCCTTGGCA
+AGAACACTTCAAGAAACTGACGGTATTCCCATCTTCTAACTGATGCGTAACCCACTGAGA
+ACGGATAGGACTCGCACGGCCGAGGGGGACTATATCAGAACGATTGGTCTGTACATGGGC
+TTGCCCGTTGTCCATCGCTTTCTGTGGGCCACACGGATGCAACCTAAGGGACAGTCCGAG
+GGTTTAAAATCAGAAGGGACGCGGAGATCCCCACTTGTTTTCGATCGCGACTCGGTATGA
+CTATCGTCGCTATCGTCGATCGTCCATCATTCATACTTCAATTAGCAATTGCTGCTATGA
+CGTGCTCTCGCGGTAACTTTATCATTGCGGTTACTTTCCTCACCGGTGACGCCTAGCTGG
+GAACCCTAGGGTGAGGCGACAGACGACGCTGCTGATTGCCGTCTTTAATATAGAATTCAT
+TGCGCTGGGATCGTCCAATGTTCCAGTCTTCACTTCTGTATCGGGCGAGAACTATTTGCG
+GAACTGAATTCGCTACCACAGACATCGCGAGCGTCACGGTTAATGCTCGGATGTCAGGAT
+GATAGCTCCCACTGTGAGACTATACGAGGCCCTGTAGGTGTCACTGTAGCTGTCGTAGAT
+GTCCAACGACGCCAATAGGGTTCAGTCTTATAAATCAAACCCTATCTAGTCCCGTACAAT
+AAAAATGGGCGCTATTGCCTATTGACATGCATATGGATTCTACACCCATGAACGTCACAC
+CCTTGGGTCACCGGAGAGCTTCGATTTCAGGTTTAGATATGCGGGAATCAAAACCTGGAC
+TGTCAGCATAATAGATTAACAACGCCACGATTGCCCGTCTGTAAATGATCGTGCTTGTTA
+ACGCTTGCTTGAGCGTATTATGGCACCGCTCGTTAGGTGAGATTGGGGCACCTAGAGCCA
diff --git a/iva/tests/data/qc_external_test_run_ratt.assembly.fa b/iva/tests/data/qc_external_test_run_ratt.assembly.fa
new file mode 100644
index 0000000..dc4045a
--- /dev/null
+++ b/iva/tests/data/qc_external_test_run_ratt.assembly.fa
@@ -0,0 +1,155 @@
+>NC_001802
+ggtctctctggttagaccagatctgagcctgggagctctctggctaactagggaacccac
+tgcttaagcctcaataaagcttgccttgagtgcttcaagtagtgtgtgcccgtctgttgt
+gtgactctggtaactagagatccctcagacccttttagtcagtgtggaaaatctctagca
+gtggcgcccgaacagggacctgaaagcgaaagggaaaccagaggagctctctcgacgcag
+gactcggcttgctgaagcgcgcacggcaagaggcgaggggcggcgactggtgagtacgcc
+aaaaattttgactagcggaggctagaaggagagagatgggtgcgagagcgtcagtattaa
+gcgggggagaattagatcgatgggaaaaaattcggttaaggccagggggaaagaaaaaat
+ataaattaaaacatatagtatgggcaagcagggagctagaacgattcgcagttaatcctg
+gcctgttagaaacatcagaaggctgtagacaaatactgggacagctacaaccatcccttc
+agacaggatcagaagaacttagatcattatataatacagtagcaaccctctattgtgtgc
+atcaaaggatagagataaaagacaccaaggaagctttagacaagatagaggaagagcaaa
+acaaaagtaagaaaaaagcacagcaagcagcagctgacacaggacacagcaatcaggtca
+gccaaaattaccctatagtgcagaacatccaggggcaaatggtacatcaggccatatcac
+ctagaactttaaatgcatgggtaaaagtagtagaagagaaggctttcagcccagaagtga
+tacccatgttttcagcattatcagaaggagccaccccacaagatttaaacaccatgctaa
+acacagtggggggacatcaagcagccatgcaaatgttaaaagagaccatcaatgaggaag
+ctgcagaatgggatagagtgcatccagtgcatgcagggcctattgcaccaggccagatga
+gagaaccaaggggaagtgacatagcaggaactactagtacccttcaggaacaaataggat
+ggatgacaaataatccacctatcccagtaggagaaatttataaaagatggataatcctgg
+gattaaataaaatagtaagaatgtatagccctaccagcattctggacataagacaaggac
+caaaggaaccctttagagactatgtagaccggttctataaaactctaagagccgagcaag
+cttcacaggaggtaaaaaattggatgacagaaaccttgttggtccaaaatgcgaacccag
+attgtaagactattttaaaagcattgggaccagcggctacactagaagaaatgatgacag
+catgtcagggagtaggaggacccggccataaggcaagagttttggctgaagcaatgagcc
+aagtaacaaattcagctaccataatgatgcagagaggcaattttaggaaccaaagaaaga
+ttgttaagtgtttcaattgtggcaaagaagggcacacagccagaaattgcagggccccta
+ggaaaaagggctgttggaaatgtggaaaggaaggacaccaaatgaaagattgtactgaga
+gacaggctaattttttagggaagatctggccttcctacaagggaaggccagggaattttc
+ttcagagcagaccagagccaacagccccaccagaagagagcttcaggtctggggtagaga
+caacaactccccctcagaagcaggagccgatagacaaggaactgtatcctttaacttccc
+tcaggtcactctttggcaacgacccctcgtcacaataaagataggggggcaactaaagga
+agctctattagatacaggagcagatgatacagtattagaagaaatgagtttgccaggaag
+atggaaaccaaaaatgatagggggaattggaggttttatcaaagtaagacagtatgatca
+gatactcatagaaatctgtggacataaagctataggtacagtattagtaggacctacacc
+tgtcaacataattggaagaaatctgttgactcagattggttgcactttaaattttcccat
+tagccctattgagactgtaccagtaaaattaaagccaggaatggatggcccaaaagttaa
+acaatggccattgacagaagaaaaaataaaagcattagtagaaatttgtacagagatgga
+aaaggaagggaaaatttcaaaaattgggcctgaaaatccatacaatactccagtatttgc
+cataaagaaaaaagacagtactaaatggagaaaattagtagatttcagagaacttaataa
+gagaactcaagacttctgggaagttcaattaggaataccacatcccgcagggttaaaaaa
+gaaaaaatcagtaacagtactggatgtgggtgatgcatatttttcagttcccttagatga
+agacttcaggaagtatactgcatttaccatacctagtataaacaatgagacaccagggat
+tagatatcagtacaatgtgcttccacagggatggaaaggatcaccagcaatattccaaag
+tagcatgacaaaaatcttagagccttttagaaaacaaaatccagacatagttatctatca
+atacatggatgatttgtatgtaggatctgacttagaaatagggcagcatagaacaaaaat
+agaggagctgagacaacatctgttgaggtggggacttaccacaccagacaaaaaacatca
+gaaagaacctccattcctttggatgggttatgaactccatcctgataaatggacagtaca
+gcctatagtgctgccagaaaaagacagctggactgtcaatgacatacagaagttagtggg
+gaaattgaattgggcaagtcagatttacccagggattaaagtaaggcaattatgtaaact
+ccttagaggaaccaaagcactaacagaagtaataccactaacagaagaagcagagctaga
+actggcagaaaacagagagattctaaaagaaccagtacatggagtgtattatgacccatc
+aaaagacttaatagcagaaatacagaagcaggggcaaggccaatggacatatcaaattta
+tcaagagccatttaaaaatctgaaaacaggaaaatatgcaagaatgaggggtgcccacac
+taatgatgtaaaacaattaacagaggcagtgcaaaaaataaccacagaaagcatagtaat
+atggggaaagactcctaaatttaaactgcccatacaaaaggaaacatgggaaacatggtg
+gacagagtattggcaagccacctggattcctgagtgggagtttgttaatacccctccctt
+agtgaaattatggtaccagttagagaaagaacccatagtaggagcagaaaccttctatgt
+agatggggcagctaacagggagactaaattaggaaaagcaggatatgttactaatagagg
+aagacaaaaagttgtcaccctaactgacacaacaaatcagaagactgagttacaagcaat
+ttatctagctttgcaggattcgggattagaagtaaacatagtaacagactcacaatatgc
+attaggaatcattcaagcacaaccagatcaaagtgaatcagagttagtcaatcaaataat
+agagcagttaataaaaaaggaaaaggtctatctggcatgggtaccagcacacaaaggaat
+tggaggaaatgaacaagtagataaattagtcagtgctggaatcaggaaagtactattttt
+agatggaatagataaggcccaagatgaacatgagaaatatcacagtaattggagagcaat
+ggctagtgattttaacctgccacctgtagtagcaaaagaaatagtagccagctgtgataa
+atgtcagctaaaaggagaagccatgcatggacaagtagactgtagtccaggaatatggca
+actagattgtacacatttagaaggaaaagttatcctggtagcagttcatgtagccagtgg
+atatatagaagcagaagttattccagcagaaacagggcaggaaacagcatattttctttt
+aaaattagcaggaagatggccagtaaaaacaatacatactgacaatggcagcaatttcac
+cggtgctacggttagggccgcctgttggtgggcgggaatcaagcaggaatttggaattcc
+ctacaatccccaaagtcaaggagtagtagaatctatgaataaagaattaaagaaaattat
+aggacaggtaagagatcaggctgaacatcttaagacagcagtacaaatggcagtattcat
+ccacaattttaaaagaaaaggggggattggggggtacagtgcaggggaaagaatagtaga
+cataatagcaacagacatacaaactaaagaattacaaaaacaaattacaaaaattcaaaa
+ttttcgggtttattacagggacagcagaaatccactttggaaaggaccagcaaagctcct
+ctggaaaggtgaaggggcagtagtaatacaagataatagtgacataaaagtagtgccaag
+aagaaaagcaaagatcattagggattatggaaaacagatggcaggtgatgattgtgtggc
+aagtagacaggatgaggattagaacatggaaaagtttagtaaaacaccatatgtatgttt
+cagggaaagctaggggatggttttatagacatcactatgaaagccctcatccaagaataa
+gttcagaagtacacatcccactaggggatgctagattggtaataacaacatattggggtc
+tgcatacaggagaaagagactggcatttgggtcagggagtctccatagaatggaggaaaa
+agagatatagcacacaagtagaccctgaactagcagaccaactaattcatctgtattact
+ttgactgtttttcagactctgctataagaaaggccttattaggacacatagttagcccta
+ggtgtgaatatcaagcaggacataacaaggtaggatctctacaatacttggcactagcag
+cattaataacaccaaaaaagataaagccacctttgcctagtgttacgaaactgacagagg
+atagatggaacaagccccagaagaccaagggccacagagggagccacacaatgaatggac
+actagagcttttagaggagcttaagaatgaagctgttagacattttcctaggatttggct
+ccatggcttagggcaacatatctatgaaacttatggggatacttgggcaggagtggaagc
+cataataagaattctgcaacaactgctgtttatccattttcagaattgggtgtcgacata
+gcagaataggcgttactcgacagaggagagcaagaaatggagccagtagatcctagacta
+gagccctggaagcatccaggaagtcagcctaaaactgcttgtaccaattgctattgtaaa
+aagtgttgctttcattgccaagtttgtttcataacaaaagccttaggcatctcctatggc
+aggaagaagcggagacagcgacgaagagctcatcagaacagtcagactcatcaagcttct
+ctatcaaagcagtaagtagtacatgtaatgcaacctataccaatagtagcaatagtagca
+ttagtagtagcaataataatagcaatagttgtgtggtccatagtaatcatagaatatagg
+aaaatattaagacaaagaaaaatagacaggttaattgatagactaatagaaagagcagaa
+gacagtggcaatgagagtgaaggagaaatatcagcacttgtggagatgggggtggagatg
+gggcaccatgctccttgggatgttgatgatctgtagtgctacagaaaaattgtgggtcac
+agtctattatggggtacctgtgtggaaggaagcaaccaccactctattttgtgcatcaga
+tgctaaagcatatgatacagaggtacataatgtttgggccacacatgcctgtgtacccac
+agaccccaacccacaagaagtagtattggtaaatgtgacagaaaattttaacatgtggaa
+aaatgacatggtagaacagatgcatgaggatataatcagtttatgggatcaaagcctaaa
+gccatgtgtaaaattaaccccactctgtgttagtttaaagtgcactgatttgaagaatga
+tactaataccaatagtagtagcgggagaatgataatggagaaaggagagataaaaaactg
+ctctttcaatatcagcacaagcataagaggtaaggtgcagaaagaatatgcattttttta
+taaacttgatataataccaatagataatgatactaccagctataagttgacaagttgtaa
+cacctcagtcattacacaggcctgtccaaaggtatcctttgagccaattcccatacatta
+ttgtgccccggctggttttgcgattctaaaatgtaataataagacgttcaatggaacagg
+accatgtacaaatgtcagcacagtacaatgtacacatggaattaggccagtagtatcaac
+tcaactgctgttaaatggcagtctagcagaagaagaggtagtaattagatctgtcaattt
+cacggacaatgctaaaaccataatagtacagctgaacacatctgtagaaattaattgtac
+aagacccaacaacaatacaagaaaaagaatccgtatccagagaggaccagggagagcatt
+tgttacaataggaaaaataggaaatatgagacaagcacattgtaacattagtagagcaaa
+atggaataacactttaaaacagatagctagcaaattaagagaacaatttggaaataataa
+aacaataatctttaagcaatcctcaggaggggacccagaaattgtaacgcacagttttaa
+ttgtggaggggaatttttctactgtaattcaacacaactgtttaatagtacttggtttaa
+tagtacttggagtactgaagggtcaaataacactgaaggaagtgacacaatcaccctccc
+atgcagaataaaacaaattataaacatgtggcagaaagtaggaaaagcaatgtatgcccc
+tcccatcagtggacaaattagatgttcatcaaatattacagggctgctattaacaagaga
+tggtggtaatagcaacaatgagtccgagatcttcagacctggaggaggagatatgaggga
+caattggagaagtgaattatataaatataaagtagtaaaaattgaaccattaggagtagc
+acccaccaaggcaaagagaagagtggtgcagagagaaaaaagagcagtgggaataggagc
+tttgttccttgggttcttgggagcagcaggaagcactatgggcgcagcctcaatgacgct
+gacggtacaggccagacaattattgtctggtatagtgcagcagcagaacaatttgctgag
+ggctattgaggcgcaacagcatctgttgcaactcacagtctggggcatcaagcagctcca
+ggcaagaatcctggctgtggaaagatacctaaaggatcaacagctcctggggatttgggg
+ttgctctggaaaactcatttgcaccactgctgtgccttggaatgctagttggagtaataa
+atctctggaacagatttggaatcacacgacctggatggagtgggacagagaaattaacaa
+ttacacaagcttaatacactccttaattgaagaatcgcaaaaccagcaagaaaagaatga
+acaagaattattggaattagataaatgggcaagtttgtggaattggtttaacataacaaa
+ttggctgtggtatataaaattattcataatgatagtaggaggcttggtaggtttaagaat
+agtttttgctgtactttctatagtgaatagagttaggcagggatattcaccattatcgtt
+tcagacccacctcccaaccccgaggggacccgacaggcccgaaggaatagaagaagaagg
+tggagagagagacagagacagatccattcgattagtgaacggatccttggcacttatctg
+ggacgatctgcggagcctgtgcctcttcagctaccaccgcttgagagacttactcttgat
+tgtaacgaggattgtggaacttctgggacgcagggggtgggaagccctcaaatattggtg
+gaatctcctacagtattggagtcaggaactaaagaatagtgctgttagcttgctcaatgc
+cacagccatagcagtagctgaggggacagatagggttatagaagtagtacaaggagcttg
+tagagctattcgccacatacctagaagaataagacagggcttggaaaggattttgctata
+agatgggtggcaagtggtcaaaaagtagtgtgattggatggcctactgtaagggaaagaa
+tgagacgagctgagccagcagcagatagggtgggagcagcatctcgagacctggaaaaac
+atggagcaatcacaagtagcaatacagcagctaccaatgctgcttgtgcctggctagaag
+cacaagaggaggaggaggtgggttttccagtcacacctcaggtacctttaagaccaatga
+cttacaaggcagctgtagatcttagccactttttaaaagaaaaggggggactggaagggc
+taattcactcccaaagaagacaagatatccttgatctgtggatctaccacacacaaggct
+acttccctgattagcagaactacacaccagggccaggggtcagatatccactgacctttg
+gatggtgctacaagctagtaccagttgagccagataagatagaagaggccaataaaggag
+agaacaccagcttgttacaccctgtgagcctgcatgggatggatgacccggagagagaag
+tgttagagtggaggtttgacagccgcctagcatttcatcacgtggcccgagagctgcatc
+cggagtacttcaagaactgctgacatcgagcttgctacaagggactttccgctggggact
+ttccagggaggcgtggcctgggcgggactggggagtggcgagccctcagatcctgcatat
+aagcagctgctttttgcctgtactgggtctctctggttagaccagatctgagcctgggag
+ctctctggctaactagggaacccactgcttaagcctcaataaagcttgccttgagtgctt
+c
diff --git a/iva/tests/data/qc_external_test_run_ratt.embl/hiv-1.embl b/iva/tests/data/qc_external_test_run_ratt.embl/hiv-1.embl
new file mode 100644
index 0000000..2272e1a
--- /dev/null
+++ b/iva/tests/data/qc_external_test_run_ratt.embl/hiv-1.embl
@@ -0,0 +1,645 @@
+ID NC_001802; SV 1; linear; unassigned DNA; STD; VRL; 9181 BP.
+XX
+AC NC_001802;
+XX
+DT 16-JUN-2014
+XX
+DE Human immunodeficiency virus 1, complete genome.
+XX
+KW RefSeq
+XX
+OS Human immunodeficiency virus 1 (HIV-1)
+OC Viruses; Retro-transcribing viruses; Retroviridae; Orthoretrovirinae;
+OC Lentivirus; Primate lentivirus group.
+OX NCBI_TaxID=11676
+XX
+RN [1]
+RP 1-9181
+RX PUBMED; 9362478.
+RA Martoglio,B., Graf,R. and Dobberstein,B.;
+RT Signal peptide fragments of preprolactin and HIV-1 p-gp160 interact with
+RT calmodulin;
+RL EMBO J. 16 (22), 6636-6645 (1997)
+XX
+RN [2]
+RP 1-9181
+RA Petropoulos,C.J.;
+RT Appendix 2: Retroviral taxonomy, protein structure, sequences, and genetic
+RT maps;
+RL (in) Coffin,J.M. (Ed.); RETROVIRUSES: 757; Cold Spring Harbor Laboratory
+RL Press, Cold Spring Harbor, New York, NY, USA (1997)
+XX
+RN [3]
+RP 1-9181
+RA ;
+RT Direct Submission;
+RL Submitted (07-AUG-2002) National Center for Biotechnology Information,
+RL NIH, Bethesda, MD 20894, USA
+XX
+RN [4]
+RC Sequence update by submitter
+RP 1-9181
+RA Chappey,C.;
+RT Direct Submission;
+RL Submitted (15-MAR-1999) NIH, NLM, Rockville Pike, Bethesda, MD 20894, USA
+XX
+RN [5]
+RP 1-9181
+RA Chappey,C.;
+RT Direct Submission;
+RL Submitted (12-NOV-1997) NIH, NLM, Rockville Pike, Bethesda, MD 20894, USA
+XX
+DR BioProject; PRJNA15476.
+XX
+CC REVIEWED REFSEQ: This record has been curated by NCBI staff. The reference
+CC sequence was derived from AF033819. The annotation of this sequence was
+CC corrected and updated with the kind help of Dr. Colombe Chappey (ViroLogic
+CC Inc., South San Francisco, CA USA) and Roger Ptak (Southern Research
+CC Institute, Frederick, MD USA). COMPLETENESS: full length.
+XX
+FH Key Location/Qualifiers
+FH
+FT source 1..9181
+FT /mol_type="genomic RNA"
+FT /db_xref="taxon:11676"
+FT /organism="Human immunodeficiency virus 1"
+FT /note="strain for reference annotation"
+FT misc_feature 1..96
+FT /note="repeat; positions of RNA transcription
+FT initialization and polyadenylation; Region: R"
+FT regulatory 73..78
+FT /regulatory_class="polyA_signal_sequence"
+FT /note="both 5' and 3' poly A signals are transcribed into
+FT RNA, but the 5' one is suppressed"
+FT 5'UTR 97..181
+FT primer_bind 182..199
+FT gene 336..4642
+FT /locus_tag="HIV1gp1"
+FT /db_xref="GeneID:155348"
+FT /gene="gag-pol"
+FT CDS join(336..1637,1637..4642)
+FT /locus_tag="HIV1gp1"
+FT /protein_id="NP_057849.4"
+FT /gene="gag-pol"
+FT /note="fusion protein consisting of the viral structural
+FT proteins and enzymes; cleaved by the viral protease into
+FT individual mature proteins; The processing products of the
+FT Gag and Gag-Pol polyproteins were annotated with the help
+FT of Pettit et al., 2003 and references therein; Pr160"
+FT /db_xref="GI:28872819"
+FT /db_xref="GeneID:155348"
+FT /ribosomal_slippage
+FT /codon_start=1
+FT /translation="MGARASVLSGGELDRWEKIRLRPGGKKKYKLKHIVWASRELERFA
+FT VNPGLLETSEGCRQILGQLQPSLQTGSEELRSLYNTVATLYCVHQRIEIKDTKEALDKI
+FT EEEQNKSKKKAQQAAADTGHSNQVSQNYPIVQNIQGQMVHQAISPRTLNAWVKVVEEKA
+FT FSPEVIPMFSALSEGATPQDLNTMLNTVGGHQAAMQMLKETINEEAAEWDRVHPVHAGP
+FT IAPGQMREPRGSDIAGTTSTLQEQIGWMTNNPPIPVGEIYKRWIILGLNKIVRMYSPTS
+FT ILDIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNWMTETLLVQNANPDCKTILKALGPA
+FT ATLEEMMTACQGVGGPGHKARVLAEAMSQVTNSATIMMQRGNFRNQRKIVKCFNCGKEG
+FT HTARNCRAPRKKGCWKCGKEGHQMKDCTERQANFLREDLAFLQGKAREFSSEQTRANSP
+FT TRRELQVWGRDNNSPSEAGADRQGTVSFNFPQVTLWQRPLVTIKIGGQLKEALLDTGAD
+FT DTVLEEMSLPGRWKPKMIGGIGGFIKVRQYDQILIEICGHKAIGTVLVGPTPVNIIGRN
+FT LLTQIGCTLNFPISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKIS
+FT KIGPENPYNTPVFAIKKKDSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKKKKSVT
+FT VLDVGDAYFSVPLDEDFRKYTAFTIPSINNETPGIRYQYNVLPQGWKGSPAIFQSSMTK
+FT ILEPFRKQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEELRQHLLRWGLTTPDKKHQKEP
+FT PFLWMGYELHPDKWTVQPIVLPEKDSWTVNDIQKLVGKLNWASQIYPGIKVRQLCKLLR
+FT GTKALTEVIPLTEEAELELAENREILKEPVHGVYYDPSKDLIAEIQKQGQGQWTYQIYQ
+FT EPFKNLKTGKYARMRGAHTNDVKQLTEAVQKITTESIVIWGKTPKFKLPIQKETWETWW
+FT TEYWQATWIPEWEFVNTPPLVKLWYQLEKEPIVGAETFYVDGAANRETKLGKAGYVTNR
+FT GRQKVVTLTDTTNQKTELQAIYLALQDSGLEVNIVTDSQYALGIIQAQPDQSESELVNQ
+FT IIEQLIKKEKVYLAWVPAHKGIGGNEQVDKLVSAGIRKVLFLDGIDKAQDEHEKYHSNW
+FT RAMASDFNLPPVVAKEIVASCDKCQLKGEAMHGQVDCSPGIWQLDCTHLEGKVILVAVH
+FT VASGYIEAEVIPAETGQETAYFLLKLAGRWPVKTIHTDNGSNFTGATVRAACWWAGIKQ
+FT EFGIPYNPQSQGVVESMNKELKKIIGQVRDQAEHLKTAVQMAVFIHNFKRKGGIGGYSA
+FT GERIVDIIATDIQTKELQKQITKIQNFRVYYRDSRNPLWKGPAKLLWKGEGAVVIQDNS
+FT DIKVVPRRKAKIIRDYGKQMAGDDCVASRQDED"
+FT /product="Gag-Pol"
+FT mat_peptide join(1632..1637,1637..1798)
+FT /locus_tag="HIV1gp1"
+FT /db_xref="GI:28872821"
+FT /experiment="DESCRIPTION:[PMID:15527852]"
+FT /protein_id="NP_787043.1"
+FT /gene="gag-pol"
+FT /product="Gag-Pol Transframe peptide"
+FT /note="the Glu-Asp-Leu tripeptide (positions 4-6) is a
+FT specific inhibitor of the HIV-1 protease. Involved in
+FT regulation of the protease-mediated polyprotein
+FT processing; alternative p6 protein; p6*"
+FT mat_peptide 1655..4639
+FT /locus_tag="HIV1gp1"
+FT /db_xref="GI:28872823"
+FT /protein_id="NP_789740.1"
+FT /gene="gag-pol"
+FT /product="Pol"
+FT /note="unprocessed Pol polyprotein; includes part of the
+FT transframe peptide, protease, reverse transcriptase and
+FT integrase domains."
+FT mat_peptide 1799..2095
+FT /locus_tag="HIV1gp1"
+FT /db_xref="GI:25121906"
+FT /experiment="DESCRIPTION:[PMID:2537531]"
+FT /experiment="DESCRIPTION:[PMID:2548279]"
+FT /experiment="DESCRIPTION:[PMID:3290901]"
+FT /protein_id="NP_705926.1"
+FT /gene="gag-pol"
+FT /product="aspartic peptidase"
+FT /note="The proteinase domain of Gag-Pol (in the form of
+FT homodimer) mediates all the cleavages in the polyprotein.
+FT Cleaves itself from the polyprotein late in particle
+FT assembly; protease"
+FT mat_peptide 2096..3775
+FT /locus_tag="HIV1gp1"
+FT /db_xref="GI:25121907"
+FT /experiment="DESCRIPTION:[PMID:1374166]"
+FT /experiment="EXISTENCE:[PMID:4316300]"
+FT /protein_id="NP_705927.1"
+FT /gene="gag-pol"
+FT /product="p66 subunit"
+FT /note="transcribes single stranded viral RNA genome into
+FT double stranded proviral DNA; HIV-1 reverse transcriptase
+FT is composed of the p66 subunit (this protein) and the p51
+FT subunit that lacks the RNAse H domain of the larger
+FT subunit"
+FT mat_peptide 2096..3415
+FT /locus_tag="HIV1gp1"
+FT /db_xref="GI:28872822"
+FT /protein_id="NP_789739.1"
+FT /gene="gag-pol"
+FT /product="reverse transcriptase p51 subunit"
+FT /note="HIV-1 reverse transcriptase is composed of the p66
+FT subunit and the p51 subunit (this protein) that lacks the
+FT RNAse H domain of the larger subunit"
+FT mat_peptide 3776..4639
+FT /locus_tag="HIV1gp1"
+FT /db_xref="GI:25121908"
+FT /experiment="DESCRIPTION:[PMID:7983732]"
+FT /experiment="DESCRIPTION:[PMID:8035478]"
+FT /protein_id="NP_705928.1"
+FT /gene="gag-pol"
+FT /product="integrase"
+FT /note="mediates integration of the viral DNA into the
+FT infected cell chromosome"
+FT gene 336..1838
+FT /locus_tag="HIV1gp2"
+FT /db_xref="GeneID:155030"
+FT /gene="gag"
+FT CDS 336..1838
+FT /locus_tag="HIV1gp2"
+FT /protein_id="NP_057850.1"
+FT /gene="gag"
+FT /note="The processing products of the Gag and Gag-Pol
+FT polyproteins were annotated with the help of Pettit et
+FT al., 2003 and references therein"
+FT /db_xref="GI:9629360"
+FT /db_xref="GeneID:155030"
+FT /codon_start=1
+FT /product="Pr55(Gag)"
+FT /translation="MGARASVLSGGELDRWEKIRLRPGGKKKYKLKHIVWASRELERFA
+FT VNPGLLETSEGCRQILGQLQPSLQTGSEELRSLYNTVATLYCVHQRIEIKDTKEALDKI
+FT EEEQNKSKKKAQQAAADTGHSNQVSQNYPIVQNIQGQMVHQAISPRTLNAWVKVVEEKA
+FT FSPEVIPMFSALSEGATPQDLNTMLNTVGGHQAAMQMLKETINEEAAEWDRVHPVHAGP
+FT IAPGQMREPRGSDIAGTTSTLQEQIGWMTNNPPIPVGEIYKRWIILGLNKIVRMYSPTS
+FT ILDIRQGPKEPFRDYVDRFYKTLRAEQASQEVKNWMTETLLVQNANPDCKTILKALGPA
+FT ATLEEMMTACQGVGGPGHKARVLAEAMSQVTNSATIMMQRGNFRNQRKIVKCFNCGKEG
+FT HTARNCRAPRKKGCWKCGKEGHQMKDCTERQANFLGKIWPSYKGRPGNFLQSRPEPTAP
+FT PEESFRSGVETTTPPQKQEPIDKELYPLTSLRSLFGNDPSSQ"
+FT mat_peptide 336..731
+FT /locus_tag="HIV1gp2"
+FT /db_xref="GI:28876542"
+FT /experiment="DESCRIPTION:[PMID:12032547]"
+FT /experiment="DESCRIPTION:[PMID:1710290]"
+FT /experiment="DESCRIPTION:[PMID:8610175]"
+FT /protein_id="NP_579876.2"
+FT /gene="gag"
+FT /product="matrix"
+FT /note="viral structural protein; forms the outer
+FT structural shell of HIV-1 virions; involved in the nuclear
+FT import of the HIV-1 preintegration complex; p17"
+FT mat_peptide 732..1424
+FT /locus_tag="HIV1gp2"
+FT /db_xref="GI:19172948"
+FT /experiment="DESCRIPTION:[PMID:15208690]"
+FT /experiment="DESCRIPTION:[PMID:16041386]"
+FT /experiment="DESCRIPTION:[PMID:21248851]"
+FT /protein_id="NP_579880.1"
+FT /gene="gag"
+FT /product="capsid"
+FT /note="viral structural protein; forms the core of HIV-1
+FT virions; p24"
+FT mat_peptide 1425..1466
+FT /locus_tag="HIV1gp2"
+FT /db_xref="GI:19172950"
+FT /protein_id="NP_579882.1"
+FT /gene="gag"
+FT /product="p2"
+FT /note="Processing of Gag-Pol by the protease domain dimer
+FT starts with cleavage between the p2 and nucleocapsid
+FT proteins."
+FT mat_peptide 1467..1631
+FT /locus_tag="HIV1gp2"
+FT /db_xref="GI:19172949"
+FT /experiment="DESCRIPTION:[PMID:1639074]"
+FT /experiment="DESCRIPTION:[PMID:7666546]"
+FT /protein_id="NP_579881.1"
+FT /gene="gag"
+FT /product="nucleocapsid"
+FT /note="viral structural protein; coats the genomic RNA
+FT inside the virion core; binds and delivers full-length
+FT viral RNAs into assembling HIV-1 virions; p7"
+FT mat_peptide 1632..1679
+FT /locus_tag="HIV1gp2"
+FT /db_xref="GI:28872820"
+FT /protein_id="NP_787042.1"
+FT /gene="gag"
+FT /product="p1"
+FT /note="important for virus infectivity, protein
+FT processing, and genomic RNA dimer stability"
+FT mat_peptide 1680..1835
+FT /locus_tag="HIV1gp2"
+FT /db_xref="GI:19172951"
+FT /experiment="DESCRIPTION:[PMID:10085158]"
+FT /experiment="DESCRIPTION:[PMID:15527852]"
+FT /protein_id="NP_579883.1"
+FT /gene="gag"
+FT /product="p6"
+FT /note="important for incorporation of Vpr into assembling
+FT HIV-1 virions; helps mediate efficient virus particle
+FT release from infected cells"
+FT gene 4587..5165
+FT /locus_tag="HIV1gp3"
+FT /db_xref="GeneID:155459"
+FT /gene="vif"
+FT CDS 4587..5165
+FT /locus_tag="HIV1gp3"
+FT /protein_id="NP_057851.1"
+FT /gene="vif"
+FT /note="p23; viral infectivity factor; viral accessory
+FT protein important for virus replication in vivo"
+FT /db_xref="GI:9629361"
+FT /db_xref="GeneID:155459"
+FT /codon_start=1
+FT /product="Vif"
+FT /translation="MENRWQVMIVWQVDRMRIRTWKSLVKHHMYVSGKARGWFYRHHYE
+FT SPHPRISSEVHIPLGDARLVITTYWGLHTGERDWHLGQGVSIEWRKKRYSTQVDPELAD
+FT QLIHLYYFDCFSDSAIRKALLGHIVSPRCEYQAGHNKVGSLQYLALAALITPKKIKPPL
+FT PSVTKLTEDRWNKPQKTKGHRGSHTMNGH"
+FT gene 5105..5396
+FT /locus_tag="HIV1gp4"
+FT /db_xref="GeneID:155807"
+FT /gene="vpr"
+FT CDS join(5105..5319,5321..5396)
+FT /locus_tag="HIV1gp4"
+FT /protein_id="NP_057852.2"
+FT /gene="vpr"
+FT /note="p15; viral protein R; viral accessory protein
+FT important for virus replication in vivo; involved in the
+FT nuclear import of the HIV-1 preintegration complex;
+FT induces G2 cell cycle arrest; influences mutation rates
+FT during viral DNA synthesis; An artificial frameshift
+FT eliminating the orf-disrupting nucleotide at position 5320
+FT is introduced to obtain the typical HIV-1 Vpr protein
+FT sequence. For this particular HIV-1 strain, HXB2, only a
+FT short (78 amino acid long) variant of the Vpr sequence can
+FT be obtained by translation of nucleotides 5105 through
+FT 5341 without the frameshift"
+FT /db_xref="GI:28872817"
+FT /db_xref="GeneID:155807"
+FT /codon_start=1
+FT /exception="artificial frameshift"
+FT /translation="MEQAPEDQGPQREPHNEWTLELLEELKNEAVRHFPRIWLHGLGQH
+FT IYETYGDTWAGVEAIIRILQQLLFIHFRIGCRHSRIGVTRQRRARNGASRS"
+FT /product="Vpr"
+FT gene 5377..7970
+FT /locus_tag="HIV1gp5"
+FT /db_xref="GeneID:155871"
+FT /gene="tat"
+FT CDS join(5377..5591,7925..7970)
+FT /locus_tag="HIV1gp5"
+FT /protein_id="NP_057853.1"
+FT /gene="tat"
+FT /note="p14; transcriptional activator; viral regulatory
+FT protein required for virus replication; transactivates the
+FT viral LTR promoter through interactions with cellular
+FT transcription factors; associated with pathogenic effects
+FT of the virus; the length of Tat varies depending on virus
+FT strain or clade"
+FT /db_xref="GI:9629358"
+FT /db_xref="GeneID:155871"
+FT /codon_start=1
+FT /product="Tat"
+FT /translation="MEPVDPRLEPWKHPGSQPKTACTNCYCKKCCFHCQVCFITKALGI
+FT SYGRKKRRQRRRAHQNSQTHQASLSKQPTSQPRGDPTGPKE"
+FT gene 5516..8199
+FT /locus_tag="HIV1gp6"
+FT /db_xref="GeneID:155908"
+FT /gene="rev"
+FT CDS join(5516..5591,7925..8199)
+FT /locus_tag="HIV1gp6"
+FT /protein_id="NP_057854.1"
+FT /gene="rev"
+FT /note="p19; regulator of expression of virion proteins;
+FT prevents splicing of viral RNA; shuttles unspliced viral
+FT RNA to the cytoplasm for expression of viral proteins and
+FT incorporation of full length viral genomic RNA into
+FT virions"
+FT /db_xref="GI:9629359"
+FT /db_xref="GeneID:155908"
+FT /codon_start=1
+FT /product="Rev"
+FT /translation="MAGRSGDSDEELIRTVRLIKLLYQSNPPPNPEGTRQARRNRRRRW
+FT RERQRQIHSISERILGTYLGRSAEPVPLQLPPLERLTLDCNEDCGTSGTQGVGSPQILV
+FT ESPTVLESGTKE"
+FT gene 5608..5856
+FT /locus_tag="HIV1gp7"
+FT /db_xref="GeneID:155945"
+FT /gene="vpu"
+FT CDS 5608..5856
+FT /locus_tag="HIV1gp7"
+FT /protein_id="NP_057855.1"
+FT /gene="vpu"
+FT /note="p16; viral protein U; viral accessory protein
+FT important for virus replication in vivo; promotes
+FT degradation of CD4 and down-regulates cell surface
+FT expression of MHC class I proteins; helps mediate
+FT efficient virus particle release from infected cells;
+FT reported to induce apoptosis by suppressing the nuclear
+FT factor kappaB-dependent expression of antiapoptotic
+FT factors; may attenuate the level of Env precursor(gp160)
+FT biosynthesis; Vpu and gp160 are translated from different
+FT reading frames of the same bicistronic mRNA"
+FT /db_xref="GI:9629366"
+FT /db_xref="GeneID:155945"
+FT /codon_start=1
+FT /product="Vpu"
+FT /translation="MQPIPIVAIVALVVAIIIAIVVWSIVIIEYRKILRQRKIDRLIDR
+FT LIERAEDSGNESEGEISALVEMGVEMGHHAPWDVDDL"
+FT gene 5771..8341
+FT /locus_tag="HIV1gp8"
+FT /db_xref="GeneID:155971"
+FT /gene="env"
+FT CDS 5771..8341
+FT /locus_tag="HIV1gp8"
+FT /protein_id="NP_057856.1"
+FT /gene="env"
+FT /note="gp160; envelope glycoprotein; envelope polyprotein;
+FT cleaved by cellular proteases into mature proteins gp120
+FT and gp41"
+FT /db_xref="GI:9629363"
+FT /db_xref="GeneID:155971"
+FT /codon_start=1
+FT /product="Envelope surface glycoprotein gp160, precursor"
+FT /translation="MRVKEKYQHLWRWGWRWGTMLLGMLMICSATEKLWVTVYYGVPVW
+FT KEATTTLFCASDAKAYDTEVHNVWATHACVPTDPNPQEVVLVNVTENFNMWKNDMVEQM
+FT HEDIISLWDQSLKPCVKLTPLCVSLKCTDLKNDTNTNSSSGRMIMEKGEIKNCSFNIST
+FT SIRGKVQKEYAFFYKLDIIPIDNDTTSYKLTSCNTSVITQACPKVSFEPIPIHYCAPAG
+FT FAILKCNNKTFNGTGPCTNVSTVQCTHGIRPVVSTQLLLNGSLAEEEVVIRSVNFTDNA
+FT KTIIVQLNTSVEINCTRPNNNTRKRIRIQRGPGRAFVTIGKIGNMRQAHCNISRAKWNN
+FT TLKQIASKLREQFGNNKTIIFKQSSGGDPEIVTHSFNCGGEFFYCNSTQLFNSTWFNST
+FT WSTEGSNNTEGSDTITLPCRIKQIINMWQKVGKAMYAPPISGQIRCSSNITGLLLTRDG
+FT GNSNNESEIFRPGGGDMRDNWRSELYKYKVVKIEPLGVAPTKAKRRVVQREKRAVGIGA
+FT LFLGFLGAAGSTMGAASMTLTVQARQLLSGIVQQQNNLLRAIEAQQHLLQLTVWGIKQL
+FT QARILAVERYLKDQQLLGIWGCSGKLICTTAVPWNASWSNKSLEQIWNHTTWMEWDREI
+FT NNYTSLIHSLIEESQNQQEKNEQELLELDKWASLWNWFNITNWLWYIKLFIMIVGGLVG
+FT LRIVFAVLSIVNRVRQGYSPLSFQTHLPTPRGPDRPEGIEEEGGERDRDRSIRLVNGSL
+FT ALIWDDLRSLCLFSYHRLRDLLLIVTRIVELLGRRGWEALKYWWNLLQYWSQELKNSAV
+FT SLLNATAIAVAEGTDRVIEVVQGACRAIRHIPRRIRQGLERILL"
+FT sig_peptide 5771..5854
+FT /locus_tag="HIV1gp8"
+FT /db_xref="GI:28876543"
+FT /protein_id="NP_579893.2"
+FT /gene="env"
+FT /product="hypothetical protein"
+FT mat_peptide 5855..7303
+FT /locus_tag="HIV1gp8"
+FT /db_xref="GI:28876544"
+FT /experiment="DESCRIPTION:[PMID:24179160]"
+FT /protein_id="NP_579894.2"
+FT /gene="env"
+FT /product="Envelope surface glycoprotein gp120"
+FT /note="mediates binding of HIV-1 to CD4 and cellular co-
+FT receptors; cooperates with gp41 to mediate fusion of viral
+FT membrane with cellular membrane during virus entry into
+FT cells; Envelope surface unit; SU"
+FT mat_peptide 7304..8338
+FT /locus_tag="HIV1gp8"
+FT /db_xref="GI:19172954"
+FT /protein_id="NP_579895.1"
+FT /gene="env"
+FT /product="Envelope transmembrane domain"
+FT /note="cooperates with gp120 to mediate fusion of viral
+FT membrane with cellular membrane during virus entry into
+FT cells; Envelope transmembrane glycoprotein gp41; TM"
+FT gene complement(6919..7488)
+FT /locus_tag="HIV1gp10"
+FT /db_xref="GeneID:19424028"
+FT /gene="asp"
+FT CDS complement(6919..7488)
+FT /locus_tag="HIV1gp10"
+FT /protein_id="YP_009028572.1"
+FT /gene="asp"
+FT /note="minus-strand-encoded antisense protein of unknown
+FT function; region is highly conserved and likely analogous
+FT to the HTLV-1 encoded, HBZ, a nuclear basic region leucine
+FT zipper (b-ZIP) protein"
+FT /db_xref="GI:636665534"
+FT /db_xref="GeneID:19424028"
+FT /codon_start=1
+FT /product="Asp"
+FT /translation="MPQTVSCNRCCCASIALSKLFCCCTIPDNNCLACTVSVIEAAPIV
+FT LPAAPKNPRNKAPIPTALFSLCTTLLFALVGATPNGSIFTTLYLYNSLLQLSLISPPPG
+FT LKISDSLLLLPPSLVNSSPVIFDEHLICPLMGGAYIAFPTFCHMFIICFILHGRVIVSL
+FT PSVLFDPSVLQVLLNQVLLNSCVELQ"
+FT gene 8343..8963
+FT /locus_tag="HIV1gp9"
+FT /db_xref="GeneID:156110"
+FT /gene="nef"
+FT CDS 8343..8963
+FT /locus_tag="HIV1gp9"
+FT /protein_id="NP_057857.2"
+FT /gene="nef"
+FT /note="p27; negative factor; viral accessory protein;
+FT important for virus replication in vivo; determinant of
+FT HIV-1 pathogenesis; down-regulates cell surface CD4 and
+FT MHC class I molecules; enhances virus infectivity through;
+FT interactions with multiple cellular signaling proteins;
+FT This particular nucleotide sequence has a premature stop
+FT codon in place of a well-conserved tryptophan codon at
+FT position 8712-8714 that truncates the HIV1 Nef protein
+FT sequence to a 123 amino acids-long N-terminal portion (not
+FT shown)"
+FT /db_xref="GI:28872818"
+FT /db_xref="GeneID:156110"
+FT /codon_start=1
+FT /translation="MGGKWSKSSVIGWPTVRERMRRAEPAADRVGAASRDLEKHGAITS
+FT SNTAATNAACAWLEAQEEEEVGFPVTPQVPLRPMTYKAAVDLSHFLKEKGGLEGLIHSQ
+FT RRQDILDLWIYHTQGYFPDWQNYTPGPGVRYPLTFGWCYKLVPVEPDKIEEANKGENTS
+FT LLHPVSLHGMDDPEREVLEWRFDSRLAFHHVARELHPEYFKNC"
+FT /product="Nef"
+FT /transl_except=(pos:8712..8714,aa:Trp)
+FT 3'UTR 8631..9085
+FT misc_feature 9086..9181
+FT /note="repeat; positions of RNA transcription
+FT initialization and polyadenylation; Region: R"
+FT regulatory 9158..9163
+FT /regulatory_class="polyA_signal_sequence"
+FT /note="both 5' and 3' poly A signals are transcribed into
+FT RNA, but the 5' one is suppressed"
+XX
+SQ Sequence 9181 BP; 3272 A; 1642 C; 2225 G; 2042 T; 0 other;
+ ggtctctctg gttagaccag atctgagcct gggagctctc tggctaacta gggaacccac 60
+ tgcttaagcc tcaataaagc ttgccttgag tgcttcaagt agtgtgtgcc cgtctgttgt 120
+ gtgactctgg taactagaga tccctcagac ccttttagtc agtgtggaaa atctctagca 180
+ gtggcgcccg aacagggacc tgaaagcgaa agggaaacca gaggagctct ctcgacgcag 240
+ gactcggctt gctgaagcgc gcacggcaag aggcgagggg cggcgactgg tgagtacgcc 300
+ aaaaattttg actagcggag gctagaagga gagagatggg tgcgagagcg tcagtattaa 360
+ gcgggggaga attagatcga tgggaaaaaa ttcggttaag gccaggggga aagaaaaaat 420
+ ataaattaaa acatatagta tgggcaagca gggagctaga acgattcgca gttaatcctg 480
+ gcctgttaga aacatcagaa ggctgtagac aaatactggg acagctacaa ccatcccttc 540
+ agacaggatc agaagaactt agatcattat ataatacagt agcaaccctc tattgtgtgc 600
+ atcaaaggat agagataaaa gacaccaagg aagctttaga caagatagag gaagagcaaa 660
+ acaaaagtaa gaaaaaagca cagcaagcag cagctgacac aggacacagc aatcaggtca 720
+ gccaaaatta ccctatagtg cagaacatcc aggggcaaat ggtacatcag gccatatcac 780
+ ctagaacttt aaatgcatgg gtaaaagtag tagaagagaa ggctttcagc ccagaagtga 840
+ tacccatgtt ttcagcatta tcagaaggag ccaccccaca agatttaaac accatgctaa 900
+ acacagtggg gggacatcaa gcagccatgc aaatgttaaa agagaccatc aatgaggaag 960
+ ctgcagaatg ggatagagtg catccagtgc atgcagggcc tattgcacca ggccagatga 1020
+ gagaaccaag gggaagtgac atagcaggaa ctactagtac ccttcaggaa caaataggat 1080
+ ggatgacaaa taatccacct atcccagtag gagaaattta taaaagatgg ataatcctgg 1140
+ gattaaataa aatagtaaga atgtatagcc ctaccagcat tctggacata agacaaggac 1200
+ caaaggaacc ctttagagac tatgtagacc ggttctataa aactctaaga gccgagcaag 1260
+ cttcacagga ggtaaaaaat tggatgacag aaaccttgtt ggtccaaaat gcgaacccag 1320
+ attgtaagac tattttaaaa gcattgggac cagcggctac actagaagaa atgatgacag 1380
+ catgtcaggg agtaggagga cccggccata aggcaagagt tttggctgaa gcaatgagcc 1440
+ aagtaacaaa ttcagctacc ataatgatgc agagaggcaa ttttaggaac caaagaaaga 1500
+ ttgttaagtg tttcaattgt ggcaaagaag ggcacacagc cagaaattgc agggccccta 1560
+ ggaaaaaggg ctgttggaaa tgtggaaagg aaggacacca aatgaaagat tgtactgaga 1620
+ gacaggctaa ttttttaggg aagatctggc cttcctacaa gggaaggcca gggaattttc 1680
+ ttcagagcag accagagcca acagccccac cagaagagag cttcaggtct ggggtagaga 1740
+ caacaactcc ccctcagaag caggagccga tagacaagga actgtatcct ttaacttccc 1800
+ tcaggtcact ctttggcaac gacccctcgt cacaataaag ataggggggc aactaaagga 1860
+ agctctatta gatacaggag cagatgatac agtattagaa gaaatgagtt tgccaggaag 1920
+ atggaaacca aaaatgatag ggggaattgg aggttttatc aaagtaagac agtatgatca 1980
+ gatactcata gaaatctgtg gacataaagc tataggtaca gtattagtag gacctacacc 2040
+ tgtcaacata attggaagaa atctgttgac tcagattggt tgcactttaa attttcccat 2100
+ tagccctatt gagactgtac cagtaaaatt aaagccagga atggatggcc caaaagttaa 2160
+ acaatggcca ttgacagaag aaaaaataaa agcattagta gaaatttgta cagagatgga 2220
+ aaaggaaggg aaaatttcaa aaattgggcc tgaaaatcca tacaatactc cagtatttgc 2280
+ cataaagaaa aaagacagta ctaaatggag aaaattagta gatttcagag aacttaataa 2340
+ gagaactcaa gacttctggg aagttcaatt aggaatacca catcccgcag ggttaaaaaa 2400
+ gaaaaaatca gtaacagtac tggatgtggg tgatgcatat ttttcagttc ccttagatga 2460
+ agacttcagg aagtatactg catttaccat acctagtata aacaatgaga caccagggat 2520
+ tagatatcag tacaatgtgc ttccacaggg atggaaagga tcaccagcaa tattccaaag 2580
+ tagcatgaca aaaatcttag agccttttag aaaacaaaat ccagacatag ttatctatca 2640
+ atacatggat gatttgtatg taggatctga cttagaaata gggcagcata gaacaaaaat 2700
+ agaggagctg agacaacatc tgttgaggtg gggacttacc acaccagaca aaaaacatca 2760
+ gaaagaacct ccattccttt ggatgggtta tgaactccat cctgataaat ggacagtaca 2820
+ gcctatagtg ctgccagaaa aagacagctg gactgtcaat gacatacaga agttagtggg 2880
+ gaaattgaat tgggcaagtc agatttaccc agggattaaa gtaaggcaat tatgtaaact 2940
+ ccttagagga accaaagcac taacagaagt aataccacta acagaagaag cagagctaga 3000
+ actggcagaa aacagagaga ttctaaaaga accagtacat ggagtgtatt atgacccatc 3060
+ aaaagactta atagcagaaa tacagaagca ggggcaaggc caatggacat atcaaattta 3120
+ tcaagagcca tttaaaaatc tgaaaacagg aaaatatgca agaatgaggg gtgcccacac 3180
+ taatgatgta aaacaattaa cagaggcagt gcaaaaaata accacagaaa gcatagtaat 3240
+ atggggaaag actcctaaat ttaaactgcc catacaaaag gaaacatggg aaacatggtg 3300
+ gacagagtat tggcaagcca cctggattcc tgagtgggag tttgttaata cccctccctt 3360
+ agtgaaatta tggtaccagt tagagaaaga acccatagta ggagcagaaa ccttctatgt 3420
+ agatggggca gctaacaggg agactaaatt aggaaaagca ggatatgtta ctaatagagg 3480
+ aagacaaaaa gttgtcaccc taactgacac aacaaatcag aagactgagt tacaagcaat 3540
+ ttatctagct ttgcaggatt cgggattaga agtaaacata gtaacagact cacaatatgc 3600
+ attaggaatc attcaagcac aaccagatca aagtgaatca gagttagtca atcaaataat 3660
+ agagcagtta ataaaaaagg aaaaggtcta tctggcatgg gtaccagcac acaaaggaat 3720
+ tggaggaaat gaacaagtag ataaattagt cagtgctgga atcaggaaag tactattttt 3780
+ agatggaata gataaggccc aagatgaaca tgagaaatat cacagtaatt ggagagcaat 3840
+ ggctagtgat tttaacctgc cacctgtagt agcaaaagaa atagtagcca gctgtgataa 3900
+ atgtcagcta aaaggagaag ccatgcatgg acaagtagac tgtagtccag gaatatggca 3960
+ actagattgt acacatttag aaggaaaagt tatcctggta gcagttcatg tagccagtgg 4020
+ atatatagaa gcagaagtta ttccagcaga aacagggcag gaaacagcat attttctttt 4080
+ aaaattagca ggaagatggc cagtaaaaac aatacatact gacaatggca gcaatttcac 4140
+ cggtgctacg gttagggccg cctgttggtg ggcgggaatc aagcaggaat ttggaattcc 4200
+ ctacaatccc caaagtcaag gagtagtaga atctatgaat aaagaattaa agaaaattat 4260
+ aggacaggta agagatcagg ctgaacatct taagacagca gtacaaatgg cagtattcat 4320
+ ccacaatttt aaaagaaaag gggggattgg ggggtacagt gcaggggaaa gaatagtaga 4380
+ cataatagca acagacatac aaactaaaga attacaaaaa caaattacaa aaattcaaaa 4440
+ ttttcgggtt tattacaggg acagcagaaa tccactttgg aaaggaccag caaagctcct 4500
+ ctggaaaggt gaaggggcag tagtaataca agataatagt gacataaaag tagtgccaag 4560
+ aagaaaagca aagatcatta gggattatgg aaaacagatg gcaggtgatg attgtgtggc 4620
+ aagtagacag gatgaggatt agaacatgga aaagtttagt aaaacaccat atgtatgttt 4680
+ cagggaaagc taggggatgg ttttatagac atcactatga aagccctcat ccaagaataa 4740
+ gttcagaagt acacatccca ctaggggatg ctagattggt aataacaaca tattggggtc 4800
+ tgcatacagg agaaagagac tggcatttgg gtcagggagt ctccatagaa tggaggaaaa 4860
+ agagatatag cacacaagta gaccctgaac tagcagacca actaattcat ctgtattact 4920
+ ttgactgttt ttcagactct gctataagaa aggccttatt aggacacata gttagcccta 4980
+ ggtgtgaata tcaagcagga cataacaagg taggatctct acaatacttg gcactagcag 5040
+ cattaataac accaaaaaag ataaagccac ctttgcctag tgttacgaaa ctgacagagg 5100
+ atagatggaa caagccccag aagaccaagg gccacagagg gagccacaca atgaatggac 5160
+ actagagctt ttagaggagc ttaagaatga agctgttaga cattttccta ggatttggct 5220
+ ccatggctta gggcaacata tctatgaaac ttatggggat acttgggcag gagtggaagc 5280
+ cataataaga attctgcaac aactgctgtt tatccatttt cagaattggg tgtcgacata 5340
+ gcagaatagg cgttactcga cagaggagag caagaaatgg agccagtaga tcctagacta 5400
+ gagccctgga agcatccagg aagtcagcct aaaactgctt gtaccaattg ctattgtaaa 5460
+ aagtgttgct ttcattgcca agtttgtttc ataacaaaag ccttaggcat ctcctatggc 5520
+ aggaagaagc ggagacagcg acgaagagct catcagaaca gtcagactca tcaagcttct 5580
+ ctatcaaagc agtaagtagt acatgtaatg caacctatac caatagtagc aatagtagca 5640
+ ttagtagtag caataataat agcaatagtt gtgtggtcca tagtaatcat agaatatagg 5700
+ aaaatattaa gacaaagaaa aatagacagg ttaattgata gactaataga aagagcagaa 5760
+ gacagtggca atgagagtga aggagaaata tcagcacttg tggagatggg ggtggagatg 5820
+ gggcaccatg ctccttggga tgttgatgat ctgtagtgct acagaaaaat tgtgggtcac 5880
+ agtctattat ggggtacctg tgtggaagga agcaaccacc actctatttt gtgcatcaga 5940
+ tgctaaagca tatgatacag aggtacataa tgtttgggcc acacatgcct gtgtacccac 6000
+ agaccccaac ccacaagaag tagtattggt aaatgtgaca gaaaatttta acatgtggaa 6060
+ aaatgacatg gtagaacaga tgcatgagga tataatcagt ttatgggatc aaagcctaaa 6120
+ gccatgtgta aaattaaccc cactctgtgt tagtttaaag tgcactgatt tgaagaatga 6180
+ tactaatacc aatagtagta gcgggagaat gataatggag aaaggagaga taaaaaactg 6240
+ ctctttcaat atcagcacaa gcataagagg taaggtgcag aaagaatatg cattttttta 6300
+ taaacttgat ataataccaa tagataatga tactaccagc tataagttga caagttgtaa 6360
+ cacctcagtc attacacagg cctgtccaaa ggtatccttt gagccaattc ccatacatta 6420
+ ttgtgccccg gctggttttg cgattctaaa atgtaataat aagacgttca atggaacagg 6480
+ accatgtaca aatgtcagca cagtacaatg tacacatgga attaggccag tagtatcaac 6540
+ tcaactgctg ttaaatggca gtctagcaga agaagaggta gtaattagat ctgtcaattt 6600
+ cacggacaat gctaaaacca taatagtaca gctgaacaca tctgtagaaa ttaattgtac 6660
+ aagacccaac aacaatacaa gaaaaagaat ccgtatccag agaggaccag ggagagcatt 6720
+ tgttacaata ggaaaaatag gaaatatgag acaagcacat tgtaacatta gtagagcaaa 6780
+ atggaataac actttaaaac agatagctag caaattaaga gaacaatttg gaaataataa 6840
+ aacaataatc tttaagcaat cctcaggagg ggacccagaa attgtaacgc acagttttaa 6900
+ ttgtggaggg gaatttttct actgtaattc aacacaactg tttaatagta cttggtttaa 6960
+ tagtacttgg agtactgaag ggtcaaataa cactgaagga agtgacacaa tcaccctccc 7020
+ atgcagaata aaacaaatta taaacatgtg gcagaaagta ggaaaagcaa tgtatgcccc 7080
+ tcccatcagt ggacaaatta gatgttcatc aaatattaca gggctgctat taacaagaga 7140
+ tggtggtaat agcaacaatg agtccgagat cttcagacct ggaggaggag atatgaggga 7200
+ caattggaga agtgaattat ataaatataa agtagtaaaa attgaaccat taggagtagc 7260
+ acccaccaag gcaaagagaa gagtggtgca gagagaaaaa agagcagtgg gaataggagc 7320
+ tttgttcctt gggttcttgg gagcagcagg aagcactatg ggcgcagcct caatgacgct 7380
+ gacggtacag gccagacaat tattgtctgg tatagtgcag cagcagaaca atttgctgag 7440
+ ggctattgag gcgcaacagc atctgttgca actcacagtc tggggcatca agcagctcca 7500
+ ggcaagaatc ctggctgtgg aaagatacct aaaggatcaa cagctcctgg ggatttgggg 7560
+ ttgctctgga aaactcattt gcaccactgc tgtgccttgg aatgctagtt ggagtaataa 7620
+ atctctggaa cagatttgga atcacacgac ctggatggag tgggacagag aaattaacaa 7680
+ ttacacaagc ttaatacact ccttaattga agaatcgcaa aaccagcaag aaaagaatga 7740
+ acaagaatta ttggaattag ataaatgggc aagtttgtgg aattggttta acataacaaa 7800
+ ttggctgtgg tatataaaat tattcataat gatagtagga ggcttggtag gtttaagaat 7860
+ agtttttgct gtactttcta tagtgaatag agttaggcag ggatattcac cattatcgtt 7920
+ tcagacccac ctcccaaccc cgaggggacc cgacaggccc gaaggaatag aagaagaagg 7980
+ tggagagaga gacagagaca gatccattcg attagtgaac ggatccttgg cacttatctg 8040
+ ggacgatctg cggagcctgt gcctcttcag ctaccaccgc ttgagagact tactcttgat 8100
+ tgtaacgagg attgtggaac ttctgggacg cagggggtgg gaagccctca aatattggtg 8160
+ gaatctccta cagtattgga gtcaggaact aaagaatagt gctgttagct tgctcaatgc 8220
+ cacagccata gcagtagctg aggggacaga tagggttata gaagtagtac aaggagcttg 8280
+ tagagctatt cgccacatac ctagaagaat aagacagggc ttggaaagga ttttgctata 8340
+ agatgggtgg caagtggtca aaaagtagtg tgattggatg gcctactgta agggaaagaa 8400
+ tgagacgagc tgagccagca gcagataggg tgggagcagc atctcgagac ctggaaaaac 8460
+ atggagcaat cacaagtagc aatacagcag ctaccaatgc tgcttgtgcc tggctagaag 8520
+ cacaagagga ggaggaggtg ggttttccag tcacacctca ggtaccttta agaccaatga 8580
+ cttacaaggc agctgtagat cttagccact ttttaaaaga aaagggggga ctggaagggc 8640
+ taattcactc ccaaagaaga caagatatcc ttgatctgtg gatctaccac acacaaggct 8700
+ acttccctga ttagcagaac tacacaccag ggccaggggt cagatatcca ctgacctttg 8760
+ gatggtgcta caagctagta ccagttgagc cagataagat agaagaggcc aataaaggag 8820
+ agaacaccag cttgttacac cctgtgagcc tgcatgggat ggatgacccg gagagagaag 8880
+ tgttagagtg gaggtttgac agccgcctag catttcatca cgtggcccga gagctgcatc 8940
+ cggagtactt caagaactgc tgacatcgag cttgctacaa gggactttcc gctggggact 9000
+ ttccagggag gcgtggcctg ggcgggactg gggagtggcg agccctcaga tcctgcatat 9060
+ aagcagctgc tttttgcctg tactgggtct ctctggttag accagatctg agcctgggag 9120
+ ctctctggct aactagggaa cccactgctt aagcctcaat aaagcttgcc ttgagtgctt 9180
+ c 9181
+//
diff --git a/iva/tests/edge_test.py b/iva/tests/edge_test.py
index 03a705a..d037a88 100644
--- a/iva/tests/edge_test.py
+++ b/iva/tests/edge_test.py
@@ -1,7 +1,7 @@
import unittest
import copy
from iva import edge
-from fastaq import intervals
+from pyfastaq import intervals
class TestEdge(unittest.TestCase):
def test_open_end(self):
diff --git a/iva/tests/egg_extract_test.py b/iva/tests/egg_extract_test.py
new file mode 100644
index 0000000..65b7c57
--- /dev/null
+++ b/iva/tests/egg_extract_test.py
@@ -0,0 +1,105 @@
+import unittest
+import sys
+import shutil
+import os
+import filecmp
+from iva import egg_extract
+
+modules_dir = os.path.dirname(os.path.abspath(egg_extract.__file__))
+data_dir = os.path.join(modules_dir, 'tests', 'data')
+
+class TestEggExtract(unittest.TestCase):
+ def setUp(self):
+ self.zipped = os.path.join(data_dir, 'egg_extract_test_dir.zip')
+ self.not_zipped = os.path.join(data_dir, 'egg_extract_test_dir')
+ self.egg_zipped = egg_extract.Extractor(self.zipped)
+ self.egg_not_zipped = egg_extract.Extractor(self.not_zipped)
+
+
+ def test_init_and_write(self):
+ '''test __init__'''
+ with self.assertRaises(egg_extract.Error):
+ egg_extract.Extractor('notafilethrowerror')
+
+ self.assertTrue(self.egg_zipped.zip_file is not None)
+ self.assertTrue(self.egg_not_zipped.zip_file is None)
+
+
+ def test_copy_file_unzipped(self):
+ '''test _copy_file_unzipped'''
+ to_copy = os.path.join('iva', 'gage', 'gage1')
+ outfile = 'tmp.copy_file_unzipped'
+ if os.path.exists(outfile):
+ os.unlink(outfile)
+ self.egg_not_zipped._copy_file_unzipped(to_copy, outfile)
+ self.assertTrue(os.path.exists(outfile))
+ os.unlink(outfile)
+
+
+ def test_copy_file_zipped(self):
+ '''test _copy_file_zipped'''
+ outfile = 'tmp.copy_file_zipped'
+ if os.path.exists(outfile):
+ os.unlink(outfile)
+ to_copy = os.path.join('iva', 'gage', 'gage1')
+ self.egg_zipped._copy_file_zipped(to_copy, outfile)
+ self.assertTrue(os.path.exists(outfile))
+ os.unlink(outfile)
+
+
+ def test_copy_file(self):
+ '''test copy_file'''
+ outfile = 'tmp.copy_file'
+ for egg in [self.egg_zipped, self.egg_not_zipped]:
+ if os.path.exists(outfile):
+ os.unlink(outfile)
+ to_copy = os.path.join('iva', 'gage', 'gage1')
+ egg.copy_file(to_copy, outfile)
+ self.assertTrue(os.path.exists(outfile))
+ os.unlink(outfile)
+
+
+ def test_copy_dir_unzipped(self):
+ '''test copy_dir_unzipped'''
+ outdir = 'tmp.copy_dir_unzipped'
+ to_copy = os.path.join('iva', 'gage')
+ if os.path.exists(outdir):
+ shutil.rmtree(outdir)
+
+ self.egg_not_zipped._copy_dir_unzipped(to_copy, outdir)
+ self.assertTrue(os.path.exists(outdir))
+ self.assertTrue(os.path.isdir(outdir))
+ self.assertTrue(os.path.exists(os.path.join(outdir, 'gage1')))
+ self.assertTrue(os.path.exists(os.path.join(outdir, 'gage2')))
+ shutil.rmtree(outdir)
+
+
+ def test_copy_dir_zipped(self):
+ '''test copy_dir_zipped'''
+ outdir = 'tmp.copy_dir_zipped'
+ to_copy = os.path.join('iva', 'gage')
+ if os.path.exists(outdir):
+ shutil.rmtree(outdir)
+
+ self.egg_zipped._copy_dir_zipped(to_copy, outdir)
+ self.assertTrue(os.path.exists(outdir))
+ self.assertTrue(os.path.isdir(outdir))
+ self.assertTrue(os.path.exists(os.path.join(outdir, 'gage1')))
+ self.assertTrue(os.path.exists(os.path.join(outdir, 'gage2')))
+ shutil.rmtree(outdir)
+
+
+ def test_copy_dir(self):
+ '''test copy_dir'''
+ outdir = 'tmp.copy_dir'
+ for egg in [self.egg_zipped, self.egg_not_zipped]:
+ if os.path.exists(outdir):
+ shutil.rmtree(outdir)
+ to_copy = os.path.join('iva', 'gage')
+ egg.copy_dir(to_copy, outdir)
+ self.assertTrue(os.path.exists(outdir))
+ self.assertTrue(os.path.isdir(outdir))
+ self.assertTrue(os.path.exists(os.path.join(outdir, 'gage1')))
+ self.assertTrue(os.path.exists(os.path.join(outdir, 'gage2')))
+ shutil.rmtree(outdir)
+
diff --git a/iva/tests/graph_test.py b/iva/tests/graph_test.py
index 928ad29..7ba446e 100644
--- a/iva/tests/graph_test.py
+++ b/iva/tests/graph_test.py
@@ -3,7 +3,7 @@ import os
import filecmp
import pysam
from iva import graph, assembly, edge
-from fastaq import intervals
+from pyfastaq import intervals
modules_dir = os.path.dirname(os.path.abspath(graph.__file__))
data_dir = os.path.join(modules_dir, 'tests', 'data')
diff --git a/iva/tests/mapping_test.py b/iva/tests/mapping_test.py
index 4e11d3a..3b20e62 100644
--- a/iva/tests/mapping_test.py
+++ b/iva/tests/mapping_test.py
@@ -4,7 +4,7 @@ import shutil
import os
import filecmp
import pysam
-import fastaq
+import pyfastaq
from iva import mapping
modules_dir = os.path.dirname(os.path.abspath(mapping.__file__))
@@ -193,8 +193,8 @@ class TestMapping(unittest.TestCase):
def test_sam_to_fasta(self):
'''Test sam_to_fasta'''
expected_seqs = {}
- fastaq.tasks.file_to_dict(os.path.join(data_dir, 'mapping_test.reads_1.fasta'), expected_seqs)
- fastaq.tasks.file_to_dict(os.path.join(data_dir, 'mapping_test.reads_2.fasta'), expected_seqs)
+ pyfastaq.tasks.file_to_dict(os.path.join(data_dir, 'mapping_test.reads_1.fasta'), expected_seqs)
+ pyfastaq.tasks.file_to_dict(os.path.join(data_dir, 'mapping_test.reads_2.fasta'), expected_seqs)
sam_reader = pysam.Samfile(os.path.join(data_dir, 'mapping_test.smalt.out.bam'), "rb")
for sam in sam_reader.fetch(until_eof=True):
fa = mapping.sam_to_fasta(sam)
diff --git a/iva/tests/mummer_test.py b/iva/tests/mummer_test.py
index 809a17b..2ed0a35 100644
--- a/iva/tests/mummer_test.py
+++ b/iva/tests/mummer_test.py
@@ -2,7 +2,7 @@ import unittest
import os
import filecmp
import pysam
-import fastaq
+import pyfastaq
from iva import mummer, edge
modules_dir = os.path.dirname(os.path.abspath(mummer.__file__))
@@ -40,7 +40,7 @@ class TestMummer(unittest.TestCase):
]
for h in hits:
m = mummer.NucmerHit(h)
- self.assertEqual(fastaq.intervals.Interval(0,99), m.qry_coords())
+ self.assertEqual(pyfastaq.intervals.Interval(0,99), m.qry_coords())
def test_ref_coords(self):
@@ -50,7 +50,7 @@ class TestMummer(unittest.TestCase):
]
for h in hits:
m = mummer.NucmerHit(h)
- self.assertEqual(fastaq.intervals.Interval(0,99), m.ref_coords())
+ self.assertEqual(pyfastaq.intervals.Interval(0,99), m.ref_coords())
def test_on_same_strand(self):
diff --git a/iva/tests/qc_external_test.py b/iva/tests/qc_external_test.py
new file mode 100644
index 0000000..bf620df
--- /dev/null
+++ b/iva/tests/qc_external_test.py
@@ -0,0 +1,33 @@
+import unittest
+import os
+import shutil
+from iva import qc_external
+
+modules_dir = os.path.dirname(os.path.abspath(qc_external.__file__))
+data_dir = os.path.join(modules_dir, 'tests', 'data')
+
+
+class TestQcExternal(unittest.TestCase):
+ def test_run_gage(self):
+ '''test run_gage'''
+ ref = os.path.join(data_dir, 'qc_external_test_run_gage.ref.fa')
+ scaffs = os.path.join(data_dir, 'qc_external_test_run_gage.scaffs.fa')
+ outdir = 'tmp.qc_external_test_run_gage'
+ qc_external.run_gage(ref, scaffs, outdir)
+ self.assertTrue(os.path.exists(os.path.join(outdir, 'gage.out')))
+ with open(os.path.join(outdir, 'gage.out')) as f:
+ got_lines = f.readlines()
+ with open(os.path.join(data_dir, 'qc_external_test_run_gage.out')) as f:
+ expected_lines = f.readlines()
+ self.assertEqual(got_lines[1:], expected_lines[1:])
+ shutil.rmtree(outdir)
+
+
+ def test_run_ratt(self):
+ '''test run_ratt'''
+ embl_dir = os.path.join(data_dir, 'qc_external_test_run_ratt.embl')
+ assembly = os.path.join(data_dir, 'qc_external_test_run_ratt.assembly.fa')
+ outdir = 'tmp.qc_external_test_run_ratt'
+ qc_external.run_ratt(embl_dir, assembly, outdir)
+ self.assertTrue(os.path.exists(os.path.join(outdir, 'run.sh.out')))
+ shutil.rmtree(outdir)
diff --git a/iva/tests/qc_test.py b/iva/tests/qc_test.py
index f228446..48c47cc 100644
--- a/iva/tests/qc_test.py
+++ b/iva/tests/qc_test.py
@@ -3,7 +3,7 @@ import os
import filecmp
import pysam
from iva import qc, mummer
-import fastaq
+import pyfastaq
modules_dir = os.path.dirname(os.path.abspath(qc.__file__))
data_dir = os.path.join(modules_dir, 'tests', 'data')
@@ -48,14 +48,14 @@ class TestQc(unittest.TestCase):
self.qc.ref_gff = os.path.join(data_dir, 'qc_test.get_ref_cds_from_gff.gff')
coords = self.qc._get_ref_cds_from_gff()
expected = {
- 'contig1': [(fastaq.intervals.Interval(1200, 1499), '+'),
- (fastaq.intervals.Interval(2999, 3901), '+'),
- (fastaq.intervals.Interval(4999, 5499), '+'),
- (fastaq.intervals.Interval(6999, 7599), '+'),
- (fastaq.intervals.Interval(10010, 10041), '-'),
- (fastaq.intervals.Interval(10089, 10999), '-')],
- 'contig2': [(fastaq.intervals.Interval(120, 149), '+'),
- (fastaq.intervals.Interval(299, 391), '+')]
+ 'contig1': [(pyfastaq.intervals.Interval(1200, 1499), '+'),
+ (pyfastaq.intervals.Interval(2999, 3901), '+'),
+ (pyfastaq.intervals.Interval(4999, 5499), '+'),
+ (pyfastaq.intervals.Interval(6999, 7599), '+'),
+ (pyfastaq.intervals.Interval(10010, 10041), '-'),
+ (pyfastaq.intervals.Interval(10089, 10999), '-')],
+ 'contig2': [(pyfastaq.intervals.Interval(120, 149), '+'),
+ (pyfastaq.intervals.Interval(299, 391), '+')]
}
self.assertEqual(coords, expected)
@@ -63,14 +63,14 @@ class TestQc(unittest.TestCase):
def test_write_cds_seqs(self):
'''test _write_cds_seqs'''
- seq = fastaq.sequences.Fasta('seq', 'AGGTGTCACGTGTGTGTCATTCAGGGCA')
- cds_list = [(fastaq.intervals.Interval(1,3), '+'),
- (fastaq.intervals.Interval(7, 9), '-'),
+ seq = pyfastaq.sequences.Fasta('seq', 'AGGTGTCACGTGTGTGTCATTCAGGGCA')
+ cds_list = [(pyfastaq.intervals.Interval(1,3), '+'),
+ (pyfastaq.intervals.Interval(7, 9), '-'),
]
outfile = 'tmp.write_cds_seqs.fa'
- f = fastaq.utils.open_file_write(outfile)
+ f = pyfastaq.utils.open_file_write(outfile)
self.qc._write_cds_seqs(cds_list, seq, f)
- fastaq.utils.close(f)
+ pyfastaq.utils.close(f)
self.assertTrue(filecmp.cmp(outfile, os.path.join(data_dir, 'qc_test.write_cds_seqs.fa'), shallow=False))
os.unlink(outfile)
@@ -121,7 +121,7 @@ class TestQc(unittest.TestCase):
def test_has_orf(self):
'''test _has_orf'''
- fa = fastaq.sequences.Fasta('a', 'ggggTAAxTAAxTAATTAxTTAxTTAgg')
+ fa = pyfastaq.sequences.Fasta('a', 'ggggTAAxTAAxTAATTAxTTAxTTAgg')
self.assertFalse(self.qc._has_orf(fa, 0, 30, 30))
self.assertTrue(self.qc._has_orf(fa, 0, 30, 20))
@@ -137,7 +137,7 @@ class TestQc(unittest.TestCase):
'A:23-784:+': {
'strand': '+',
'length_in_ref': 762,
- 'ref_coords': fastaq.intervals.Interval(22, 783),
+ 'ref_coords': pyfastaq.intervals.Interval(22, 783),
'ref_name': 'A',
'bases_assembled': 762,
'number_of_contig_hits': 1,
@@ -147,7 +147,7 @@ class TestQc(unittest.TestCase):
'B:3-1733:+': {
'strand': '+',
'length_in_ref': 1731,
- 'ref_coords': fastaq.intervals.Interval(2, 1732),
+ 'ref_coords': pyfastaq.intervals.Interval(2, 1732),
'ref_name': 'B',
'bases_assembled': 1731,
'number_of_contig_hits': 1,
@@ -157,7 +157,7 @@ class TestQc(unittest.TestCase):
'C:3-1385:+': {
'strand': '+',
'length_in_ref': 1383,
- 'ref_coords': fastaq.intervals.Interval(2, 1384),
+ 'ref_coords': pyfastaq.intervals.Interval(2, 1384),
'ref_name': 'C',
'bases_assembled': 1383,
'number_of_contig_hits': 1,
@@ -167,7 +167,7 @@ class TestQc(unittest.TestCase):
'D:1-1542:+': {
'strand': '+',
'length_in_ref': 1542,
- 'ref_coords': fastaq.intervals.Interval(0, 1541),
+ 'ref_coords': pyfastaq.intervals.Interval(0, 1541),
'ref_name': 'D',
'bases_assembled': 1000,
'number_of_contig_hits': 1,
@@ -177,7 +177,7 @@ class TestQc(unittest.TestCase):
'E:3-719:+': {
'strand': '+',
'length_in_ref': 717,
- 'ref_coords': fastaq.intervals.Interval(2, 718),
+ 'ref_coords': pyfastaq.intervals.Interval(2, 718),
'ref_name': 'E',
'bases_assembled': 499,
'number_of_contig_hits': 2,
@@ -187,7 +187,7 @@ class TestQc(unittest.TestCase):
'E:290-793:-': {
'strand': '-',
'length_in_ref': 504,
- 'ref_coords': fastaq.intervals.Interval(289, 792),
+ 'ref_coords': pyfastaq.intervals.Interval(289, 792),
'ref_name': 'E',
'bases_assembled': 301,
'number_of_contig_hits': 1,
@@ -197,7 +197,7 @@ class TestQc(unittest.TestCase):
'F:25-2298:+': {
'strand': '+',
'length_in_ref': 2274,
- 'ref_coords': fastaq.intervals.Interval(24, 2297),
+ 'ref_coords': pyfastaq.intervals.Interval(24, 2297),
'ref_name': 'F',
'bases_assembled': 2274,
'number_of_contig_hits': 2,
@@ -207,7 +207,7 @@ class TestQc(unittest.TestCase):
'G:19-2175:+': {
'strand': '+',
'length_in_ref': 2157,
- 'ref_coords': fastaq.intervals.Interval(18, 2174),
+ 'ref_coords': pyfastaq.intervals.Interval(18, 2174),
'ref_name': 'G',
'assembled': False,
'assembled_ok': False,
@@ -215,7 +215,7 @@ class TestQc(unittest.TestCase):
'H:1-2307:+': {
'strand': '+',
'length_in_ref': 2307,
- 'ref_coords': fastaq.intervals.Interval(0, 2306),
+ 'ref_coords': pyfastaq.intervals.Interval(0, 2306),
'ref_name': 'H',
'assembled': False,
'assembled_ok': False,
@@ -337,22 +337,22 @@ class TestQc(unittest.TestCase):
def test_invert_list(self):
'''test _invert_list'''
- self.assertEqual(self.qc._invert_list([], 42), [fastaq.intervals.Interval(0,41)])
+ self.assertEqual(self.qc._invert_list([], 42), [pyfastaq.intervals.Interval(0,41)])
coords = [
[],
- [fastaq.intervals.Interval(0, 41)],
- [fastaq.intervals.Interval(1, 41)],
- [fastaq.intervals.Interval(0, 40)],
- [fastaq.intervals.Interval(10, 30)],
- [fastaq.intervals.Interval(5, 10), fastaq.intervals.Interval(20, 30)],
+ [pyfastaq.intervals.Interval(0, 41)],
+ [pyfastaq.intervals.Interval(1, 41)],
+ [pyfastaq.intervals.Interval(0, 40)],
+ [pyfastaq.intervals.Interval(10, 30)],
+ [pyfastaq.intervals.Interval(5, 10), pyfastaq.intervals.Interval(20, 30)],
]
expected = [
- [fastaq.intervals.Interval(0, 41)],
+ [pyfastaq.intervals.Interval(0, 41)],
[],
- [fastaq.intervals.Interval(0, 0)],
- [fastaq.intervals.Interval(41, 41)],
- [fastaq.intervals.Interval(0, 9), fastaq.intervals.Interval(31, 41)],
- [fastaq.intervals.Interval(0, 4), fastaq.intervals.Interval(11, 19), fastaq.intervals.Interval(31, 41)],
+ [pyfastaq.intervals.Interval(0, 0)],
+ [pyfastaq.intervals.Interval(41, 41)],
+ [pyfastaq.intervals.Interval(0, 9), pyfastaq.intervals.Interval(31, 41)],
+ [pyfastaq.intervals.Interval(0, 4), pyfastaq.intervals.Interval(11, 19), pyfastaq.intervals.Interval(31, 41)],
]
assert len(coords) == len(expected)
for i in range(len(coords)):
@@ -368,13 +368,13 @@ class TestQc(unittest.TestCase):
self.qc._get_contig_hits_to_reference()
self.qc._calculate_ref_positions_covered_by_contigs()
expected = {
- 'A0': [fastaq.intervals.Interval(9, 239)],
- 'A': [fastaq.intervals.Interval(9, 1016)],
- 'B': [fastaq.intervals.Interval(0, 1777)],
- 'C': [fastaq.intervals.Interval(0, 1412)],
- 'D': [fastaq.intervals.Interval(0, 999)],
- 'E': [fastaq.intervals.Interval(0, 199), fastaq.intervals.Interval(399, 699)],
- 'F': [fastaq.intervals.Interval(0, 2340)],
+ 'A0': [pyfastaq.intervals.Interval(9, 239)],
+ 'A': [pyfastaq.intervals.Interval(9, 1016)],
+ 'B': [pyfastaq.intervals.Interval(0, 1777)],
+ 'C': [pyfastaq.intervals.Interval(0, 1412)],
+ 'D': [pyfastaq.intervals.Interval(0, 999)],
+ 'E': [pyfastaq.intervals.Interval(0, 199), pyfastaq.intervals.Interval(399, 699)],
+ 'F': [pyfastaq.intervals.Interval(0, 2340)],
}
self.assertEqual(expected['F'], self.qc.ref_pos_covered_by_contigs['F'])
@@ -421,12 +421,12 @@ class TestQc(unittest.TestCase):
'''test _contig_placement_in_reference'''
h1 = mummer.NucmerHit('\t'.join(['1', '90', '100', '10', '100', '100', '100.00', '100', '100', '1', '+', 'ref1', 'qry1']))
h2 = mummer.NucmerHit('\t'.join(['17', '36', '21', '40', '100', '100', '100.00', '100', '100', '1', '+', 'ref2', 'qry1']))
- expected = [(fastaq.intervals.Interval(9, 99), 'ref1', fastaq.intervals.Interval(0, 89), False, False)]
+ expected = [(pyfastaq.intervals.Interval(9, 99), 'ref1', pyfastaq.intervals.Interval(0, 89), False, False)]
self.assertEqual(self.qc._contig_placement_in_reference([h1]), expected)
expected = [
- (fastaq.intervals.Interval(9, 99), 'ref1', fastaq.intervals.Interval(0, 89), False, True),
- (fastaq.intervals.Interval(20, 39), 'ref2', fastaq.intervals.Interval(16, 35), True, True)
+ (pyfastaq.intervals.Interval(9, 99), 'ref1', pyfastaq.intervals.Interval(0, 89), False, True),
+ (pyfastaq.intervals.Interval(20, 39), 'ref2', pyfastaq.intervals.Interval(16, 35), True, True)
]
self.assertEqual(self.qc._contig_placement_in_reference([h1, h2]), expected)
@@ -440,15 +440,15 @@ class TestQc(unittest.TestCase):
self.qc.assembly_fasta = os.path.join(data_dir, 'qc_test.assembly.fasta')
self.qc._calculate_contig_placement()
expected_placement = {
- 'A:10-1017:+': [(fastaq.intervals.Interval(0, 230), 'A0', fastaq.intervals.Interval(9, 239), True, True),
- (fastaq.intervals.Interval(0, 1007), 'A', fastaq.intervals.Interval(9, 1016), True, True)],
- 'B:1-1778:-': [(fastaq.intervals.Interval(0, 1777), 'B', fastaq.intervals.Interval(0, 1777), False, False)],
- 'C:1-1413:+,E:1-200:-': [(fastaq.intervals.Interval(0, 1412), 'C', fastaq.intervals.Interval(0, 1412), True, False),
- (fastaq.intervals.Interval(1413, 1612), 'E', fastaq.intervals.Interval(0, 199), True, False)],
- 'D:1-1000:+': [(fastaq.intervals.Interval(0, 999), 'D', fastaq.intervals.Interval(0, 999), True, False)],
- 'E:400-700:-': [(fastaq.intervals.Interval(0, 300), 'E', fastaq.intervals.Interval(399, 699), False, False)],
- 'F:1-2341:+': [(fastaq.intervals.Interval(0, 2340), 'F', fastaq.intervals.Interval(0, 2340), True, False)],
- 'F:1-2341:-': [(fastaq.intervals.Interval(0, 2340), 'F', fastaq.intervals.Interval(0, 2340), False, False)],
+ 'A:10-1017:+': [(pyfastaq.intervals.Interval(0, 230), 'A0', pyfastaq.intervals.Interval(9, 239), True, True),
+ (pyfastaq.intervals.Interval(0, 1007), 'A', pyfastaq.intervals.Interval(9, 1016), True, True)],
+ 'B:1-1778:-': [(pyfastaq.intervals.Interval(0, 1777), 'B', pyfastaq.intervals.Interval(0, 1777), False, False)],
+ 'C:1-1413:+,E:1-200:-': [(pyfastaq.intervals.Interval(0, 1412), 'C', pyfastaq.intervals.Interval(0, 1412), True, False),
+ (pyfastaq.intervals.Interval(1413, 1612), 'E', pyfastaq.intervals.Interval(0, 199), True, False)],
+ 'D:1-1000:+': [(pyfastaq.intervals.Interval(0, 999), 'D', pyfastaq.intervals.Interval(0, 999), True, False)],
+ 'E:400-700:-': [(pyfastaq.intervals.Interval(0, 300), 'E', pyfastaq.intervals.Interval(399, 699), False, False)],
+ 'F:1-2341:+': [(pyfastaq.intervals.Interval(0, 2340), 'F', pyfastaq.intervals.Interval(0, 2340), True, False)],
+ 'F:1-2341:-': [(pyfastaq.intervals.Interval(0, 2340), 'F', pyfastaq.intervals.Interval(0, 2340), False, False)],
}
self.assertEqual(expected_placement, self.qc.contig_placement)
@@ -594,9 +594,9 @@ class TestQc(unittest.TestCase):
'''test _coverage_list_to_low_cov_intervals'''
l = [0, 1, 4, 5, 6, 6, 5, 0, 5, 6, 2, 1]
expected = [
- fastaq.intervals.Interval(0,2),
- fastaq.intervals.Interval(7,7),
- fastaq.intervals.Interval(10,11),
+ pyfastaq.intervals.Interval(0,2),
+ pyfastaq.intervals.Interval(7,7),
+ pyfastaq.intervals.Interval(10,11),
]
got = self.qc._coverage_list_to_low_cov_intervals(l)
self.assertEqual(expected, got)
@@ -621,30 +621,30 @@ class TestQc(unittest.TestCase):
self.qc._calculate_ref_read_region_coverage()
expected_low_cov_ref_regions_fwd = {
- 'ref1': [fastaq.intervals.Interval(0, 0), fastaq.intervals.Interval(6, 6)],
- 'ref2': [fastaq.intervals.Interval(0, 9)],
- 'ref3': [fastaq.intervals.Interval(0, 9)],
+ 'ref1': [pyfastaq.intervals.Interval(0, 0), pyfastaq.intervals.Interval(6, 6)],
+ 'ref2': [pyfastaq.intervals.Interval(0, 9)],
+ 'ref3': [pyfastaq.intervals.Interval(0, 9)],
'ref4': [],
}
self.assertEqual(expected_low_cov_ref_regions_fwd, self.qc.low_cov_ref_regions_fwd)
expected_low_cov_ref_regions_rev = {
- 'ref1': [fastaq.intervals.Interval(0, 0), fastaq.intervals.Interval(3, 3)],
- 'ref2': [fastaq.intervals.Interval(0, 9)],
+ 'ref1': [pyfastaq.intervals.Interval(0, 0), pyfastaq.intervals.Interval(3, 3)],
+ 'ref2': [pyfastaq.intervals.Interval(0, 9)],
'ref3': [],
- 'ref4': [fastaq.intervals.Interval(0, 9)],
+ 'ref4': [pyfastaq.intervals.Interval(0, 9)],
}
self.assertEqual(expected_low_cov_ref_regions_rev, self.qc.low_cov_ref_regions_rev)
expected_low_cov_ref_regions = {
- 'ref1': [fastaq.intervals.Interval(0, 0), fastaq.intervals.Interval(3, 3), fastaq.intervals.Interval(6, 6)],
- 'ref2': [fastaq.intervals.Interval(0, 9)],
- 'ref3': [fastaq.intervals.Interval(0, 9)],
- 'ref4': [fastaq.intervals.Interval(0, 9)],
+ 'ref1': [pyfastaq.intervals.Interval(0, 0), pyfastaq.intervals.Interval(3, 3), pyfastaq.intervals.Interval(6, 6)],
+ 'ref2': [pyfastaq.intervals.Interval(0, 9)],
+ 'ref3': [pyfastaq.intervals.Interval(0, 9)],
+ 'ref4': [pyfastaq.intervals.Interval(0, 9)],
}
expected_ok_cov_ref_regions = {
- 'ref1': [fastaq.intervals.Interval(1, 2), fastaq.intervals.Interval(4, 5), fastaq.intervals.Interval(7, 9)],
+ 'ref1': [pyfastaq.intervals.Interval(1, 2), pyfastaq.intervals.Interval(4, 5), pyfastaq.intervals.Interval(7, 9)],
'ref2': [],
'ref3': [],
'ref4': [],
@@ -672,7 +672,7 @@ class TestQc(unittest.TestCase):
def test_cov_to_R_string(self):
'''test _cov_to_R_string'''
- intervals = [fastaq.intervals.Interval(0, 42), fastaq.intervals.Interval(50, 51)]
+ intervals = [pyfastaq.intervals.Interval(0, 42), pyfastaq.intervals.Interval(50, 51)]
expected = 'rect(2, 0.5, 44, 1.5, col="blue", border=NA)\n' + \
'rect(52, 0.5, 53, 1.5, col="blue", border=NA)\n'
self.assertEqual(expected, self.qc._cov_to_R_string(intervals, 'blue', 2, 1, 1))
@@ -682,16 +682,16 @@ class TestQc(unittest.TestCase):
'''test _calculate_should_have_assembled'''
self.qc.ref_ids = ['ref1', 'ref2', 'ref3']
self.qc.ref_lengths = {x:10 for x in self.qc.ref_ids}
- self.qc.ref_pos_covered_by_contigs = {'ref1': [fastaq.intervals.Interval(3, 7)]}
+ self.qc.ref_pos_covered_by_contigs = {'ref1': [pyfastaq.intervals.Interval(3, 7)]}
self.qc.ok_cov_ref_regions = {
- 'ref1': [fastaq.intervals.Interval(0, 7)],
- 'ref2': [fastaq.intervals.Interval(0, 9)],
+ 'ref1': [pyfastaq.intervals.Interval(0, 7)],
+ 'ref2': [pyfastaq.intervals.Interval(0, 9)],
'ref3': []
}
self.qc._calculate_should_have_assembled()
expected = {
- 'ref1': [fastaq.intervals.Interval(0, 2)],
- 'ref2': [fastaq.intervals.Interval(0, 9)],
+ 'ref1': [pyfastaq.intervals.Interval(0, 2)],
+ 'ref2': [pyfastaq.intervals.Interval(0, 9)],
'ref3': []
}
self.assertEqual(expected, self.qc.should_have_assembled)
diff --git a/iva/tests/seed_processor_test.py b/iva/tests/seed_processor_test.py
index e87dd0c..4f14d3e 100644
--- a/iva/tests/seed_processor_test.py
+++ b/iva/tests/seed_processor_test.py
@@ -1,7 +1,7 @@
import unittest
import os
import filecmp
-import fastaq
+import pyfastaq
from iva import seed_processor
modules_dir = os.path.dirname(os.path.abspath(seed_processor.__file__))
@@ -16,7 +16,7 @@ class TestSeedProcessor(unittest.TestCase):
tmp_out = 'tmp.process_seeds.out'
s = seed_processor.SeedProcessor(in_fasta, reads1, reads2, tmp_out, verbose=4, threads=1)
s.process()
- reader = fastaq.sequences.file_reader(tmp_out)
+ reader = pyfastaq.sequences.file_reader(tmp_out)
counter = 0
for seq in reader:
self.assertTrue(len(seq) > 470)
diff --git a/iva/tests/seed_test.py b/iva/tests/seed_test.py
index 38256c4..c099725 100644
--- a/iva/tests/seed_test.py
+++ b/iva/tests/seed_test.py
@@ -2,7 +2,7 @@ import unittest
import os
import filecmp
import shutil
-import fastaq
+import pyfastaq
from iva import seed
modules_dir = os.path.dirname(os.path.abspath(seed.__file__))
@@ -23,18 +23,18 @@ class TestSeed(unittest.TestCase):
def test_extension_from_read(self):
'''Test _test_extension_from_read'''
s = seed.Seed(seq='AGGCT')
- self.assertEqual(None, s._extension_from_read(fastaq.sequences.Fasta('x', 'AAAAA')))
- self.assertEqual(None, s._extension_from_read(fastaq.sequences.Fasta('x', 'AGGC')))
- self.assertEqual('A', s._extension_from_read(fastaq.sequences.Fasta('x', 'AGGCTA')))
- self.assertEqual('AT', s._extension_from_read(fastaq.sequences.Fasta('x', 'AGGCTAT')))
- self.assertEqual('AT', s._extension_from_read(fastaq.sequences.Fasta('x', 'GGGAGGCTAT')))
- self.assertEqual('AA', s._extension_from_read(fastaq.sequences.Fasta('x', 'TTAGCCT')))
-
- self.assertEqual(None, s._extension_from_read(fastaq.sequences.Fasta('x', 'AAAAA'), left=True))
- self.assertEqual(None, s._extension_from_read(fastaq.sequences.Fasta('x', 'AGGCTA'), left=True))
- self.assertEqual('GT', s._extension_from_read(fastaq.sequences.Fasta('x', 'GTAGGCTA'), left=True))
- self.assertEqual('GT', s._extension_from_read(fastaq.sequences.Fasta('x', 'GTAGGCTATTC'), left=True))
- self.assertEqual('GT', s._extension_from_read(fastaq.sequences.Fasta('x', 'AGCCTAC'), left=True))
+ self.assertEqual(None, s._extension_from_read(pyfastaq.sequences.Fasta('x', 'AAAAA')))
+ self.assertEqual(None, s._extension_from_read(pyfastaq.sequences.Fasta('x', 'AGGC')))
+ self.assertEqual('A', s._extension_from_read(pyfastaq.sequences.Fasta('x', 'AGGCTA')))
+ self.assertEqual('AT', s._extension_from_read(pyfastaq.sequences.Fasta('x', 'AGGCTAT')))
+ self.assertEqual('AT', s._extension_from_read(pyfastaq.sequences.Fasta('x', 'GGGAGGCTAT')))
+ self.assertEqual('AA', s._extension_from_read(pyfastaq.sequences.Fasta('x', 'TTAGCCT')))
+
+ self.assertEqual(None, s._extension_from_read(pyfastaq.sequences.Fasta('x', 'AAAAA'), left=True))
+ self.assertEqual(None, s._extension_from_read(pyfastaq.sequences.Fasta('x', 'AGGCTA'), left=True))
+ self.assertEqual('GT', s._extension_from_read(pyfastaq.sequences.Fasta('x', 'GTAGGCTA'), left=True))
+ self.assertEqual('GT', s._extension_from_read(pyfastaq.sequences.Fasta('x', 'GTAGGCTATTC'), left=True))
+ self.assertEqual('GT', s._extension_from_read(pyfastaq.sequences.Fasta('x', 'AGCCTAC'), left=True))
def test_len(self):
diff --git a/scripts/iva b/scripts/iva
index cdb2cc1..af7eea2 100755
--- a/scripts/iva
+++ b/scripts/iva
@@ -4,7 +4,7 @@ import argparse
import os
import sys
import multiprocessing
-import fastaq
+import pyfastaq
import iva
@@ -33,8 +33,8 @@ mapping_group.add_argument('-y', '--smalt_id', type=float, help='Minimum identit
contig_group = parser.add_argument_group('Contig options')
contig_group.add_argument('--ctg_first_trim', type=int, help='Number of bases to trim off the end of every contig before extending for the first time [%(default)s]', default=25, metavar='INT')
contig_group.add_argument('--ctg_iter_trim', type=int, help='During iterative extension, number of bases to trim off the end of a contig when extension fails (then try extending again) [%(default)s]', default=10, metavar='INT')
-contig_group.add_argument('--ext_min_cov', type=int, help='Minimum kmer depth needed to use that kmer to extend a contig [%(default)s]', default=5, metavar='INT')
-contig_group.add_argument('--ext_min_ratio', type=float, help='Sets N, where kmer for extension must be at least N times more abundant than next most common kmer [%(default)s]', default=2, metavar='FLOAT')
+contig_group.add_argument('--ext_min_cov', type=int, help='Minimum kmer depth needed to use that kmer to extend a contig [%(default)s]', default=10, metavar='INT')
+contig_group.add_argument('--ext_min_ratio', type=float, help='Sets N, where kmer for extension must be at least N times more abundant than next most common kmer [%(default)s]', default=4, metavar='FLOAT')
contig_group.add_argument('--ext_max_bases', type=int, help='Maximum number of bases to try to extend on each iteration [%(default)s]', default=100, metavar='INT')
contig_group.add_argument('--ext_min_clip', type=int, help='Set minimum number of bases soft clipped off a read for those bases to be used for extension [%(default)s]', default=3, metavar='INT')
contig_group.add_argument('--max_contigs', type=int, help='Maximum number of contigs allowed in the assembly. No more seeds generated if the cutoff is reached [%(default)s]', metavar='INT', default=50)
@@ -48,24 +48,22 @@ seed_group.add_argument('--seed_min_kmer_cov', type=int, help='Minimum kmer cove
seed_group.add_argument('--seed_max_kmer_cov', type=int, help='Maximum kmer coverage of initial seed [%(default)s]', default=1000000, metavar='INT')
seed_group.add_argument('--seed_ext_max_bases', type=int, help='Maximum number of bases to try to extend on each iteration [%(default)s]', default=50, metavar='INT')
seed_group.add_argument('--seed_overlap_length', type=int, help='Number of overlapping bases needed between read and seed to use that read to extend [seed_start_length]', metavar='INT')
-seed_group.add_argument('--seed_ext_min_cov', type=int, help='Minimum kmer depth needed to use that kmer to extend a contig [%(default)s]', default=5, metavar='INT')
-seed_group.add_argument('--seed_ext_min_ratio', type=float, help='Sets N, where kmer for extension must be at least N times more abundant than next most common kmer [%(default)s]', default=2, metavar='FLOAT')
+seed_group.add_argument('--seed_ext_min_cov', type=int, help='Minimum kmer depth needed to use that kmer to extend a contig [%(default)s]', default=10, metavar='INT')
+seed_group.add_argument('--seed_ext_min_ratio', type=float, help='Sets N, where kmer for extension must be at least N times more abundant than next most common kmer [%(default)s]', default=4, metavar='FLOAT')
trimming_group = parser.add_argument_group('Read trimming options')
trimming_group.add_argument('--trimmomatic', action=iva.common.abspathAction, help='Provide location of trimmomatic.jar file to enable read trimming. Required if --adapters used', metavar='FILENAME')
trimming_group.add_argument('--trimmo_qual', help='Trimmomatic options used to quality trim reads [%(default)s]', default='LEADING:10 TRAILING:10 SLIDINGWINDOW:4:20', metavar='STRING')
-iva_module_dir = os.path.dirname(iva.read_trim.__file__)
-adapters = os.path.abspath(os.path.join(iva_module_dir, 'read_trim', 'adapters.fasta'))
-trimming_group.add_argument('--adapters', action=iva.common.abspathAction, help='Fasta file of adapter sequences to be trimmed off reads. If used, must also use --trimmomatic. [' + adapters + ']', metavar='FILENAME')
+trimming_group.add_argument('--adapters', action=iva.common.abspathAction, help='Fasta file of adapter sequences to be trimmed off reads. If used, must also use --trimmomatic. Default is file of adapters supplied with IVA', metavar='FILENAME')
trimming_group.add_argument('--min_trimmed_length', type=int, help='Minimum length of read after trimming [%(default)s]', default=50, metavar='INT')
-trimming_group.add_argument('--pcr_primers', action=iva.common.abspathAction, help='FASTA file of primers. The first perfect match found to a sequence in the primers file will be trimmed off the start of each read. This is run after trimmomatic (if --trimmomatic used)', metavar='FILENAME')
+trimming_group.add_argument('--pcr_primers', action=iva.common.abspathAction, help='FASTA file of primers. The first perfect match found to a sequence in the primers file will be trimmed off the start of each read. This is run after trimmomatic (if --trimmomatic used)', metavar='FILENAME')
other_group = parser.add_argument_group('Other options')
-other_group.add_argument('-i', '--max_insert', type=int, help='Maximum insert size (includes read length). Reads with inferred insert size more than the maximum will not be used to extend contigs [%(default)s]', default=500, metavar='INT')
+other_group.add_argument('-i', '--max_insert', type=int, help='Maximum insert size (includes read length). Reads with inferred insert size more than the maximum will not be used to extend contigs [%(default)s]', default=800, metavar='INT')
other_group.add_argument('-t', '--threads', type=int, help='Number of threads to use [%(default)s]', default=1, metavar='INT')
-other_group.add_argument('--strand_bias', type=float, help='Set strand bias cutoff of mapped reads when trimming contig ends, in the interval [0,0.5]. A value of x means that a base needs min(fwd_depth, rev_depth) / total_depth <= x. [%(default)s]', default=0.1, metavar='FLOAT in [0,0.5]')
+other_group.add_argument('--strand_bias', type=float, help='Set strand bias cutoff of mapped reads when trimming contig ends, in the interval [0,0.5]. A value of x means that a base needs min(fwd_depth, rev_depth) / total_depth <= x. The only time this should be used is with libraries with overlapping reads (ie fragment length < 2*read length), and even then, it can make results worse. If used, try a low value like 0.1 first [%(default)s]', default=0, metavar='FLOAT in [0,0.5]')
other_group.add_argument('--version', action='version', version=iva.common.version)
options = parser.parse_args()
@@ -110,18 +108,18 @@ iva.external_progs.write_prog_info('iva', log_file)
reads_prefix = 'reads'
reads_1 = reads_prefix + '_1.fa'
reads_2 = reads_prefix + '_2.fa'
-original_line_length = fastaq.sequences.Fasta.line_length
-fastaq.sequences.Fasta.line_length = 0
+original_line_length = pyfastaq.sequences.Fasta.line_length
+pyfastaq.sequences.Fasta.line_length = 0
if options.reads and not options.trimmomatic:
- fastaq.tasks.deinterleave(options.reads, reads_1, reads_2, fasta_out=True)
+ pyfastaq.tasks.deinterleave(options.reads, reads_1, reads_2, fasta_out=True)
else:
to_delete = []
if options.reads:
reads_for_trimming_1 = 'reads.untrimmed_1.fq'
reads_for_trimming_2 = 'reads.untrimmed_2.fq'
- fastaq.tasks.deinterleave(options.reads, reads_for_trimming_1, reads_for_trimming_2)
+ pyfastaq.tasks.deinterleave(options.reads, reads_for_trimming_1, reads_for_trimming_2)
to_delete.append(reads_for_trimming_1)
to_delete.append(reads_for_trimming_2)
else:
@@ -131,7 +129,13 @@ else:
if options.trimmomatic:
trimmed_reads_prefix = 'reads.trimmed'
if options.adapters is None:
- options.adapters = adapters
+ extractor = iva.egg_extract.Extractor(os.path.abspath(os.path.join(os.path.dirname(iva.__file__), os.pardir)))
+ egg_adapters = os.path.join('iva', 'read_trim', 'adapters.fasta')
+ options.adapters = 'adapters.fasta'
+ extractor.copy_file(egg_adapters, options.adapters)
+
+ assert os.path.exists(options.adapters)
+
iva.read_trim.run_trimmomatic(reads_for_trimming_1, reads_for_trimming_2, trimmed_reads_prefix, options.trimmomatic, options.adapters, minlen=options.min_trimmed_length, verbose=options.verbose, threads=options.threads, qual_trim=options.trimmo_qual)
fq_to_convert_to_fa_1 = trimmed_reads_prefix + '_1.fq'
fq_to_convert_to_fa_2 = trimmed_reads_prefix + '_2.fq'
@@ -141,8 +145,8 @@ else:
fq_to_convert_to_fa_1 = reads_for_trimming_1
fq_to_convert_to_fa_2 = reads_for_trimming_2
- p1 = multiprocessing.Process(target=fastaq.tasks.to_fasta, args=(fq_to_convert_to_fa_1, reads_1), kwargs={'line_length':0})
- p2 = multiprocessing.Process(target=fastaq.tasks.to_fasta, args=(fq_to_convert_to_fa_2, reads_2), kwargs={'line_length':0})
+ p1 = multiprocessing.Process(target=pyfastaq.tasks.to_fasta, args=(fq_to_convert_to_fa_1, reads_1), kwargs={'line_length':0})
+ p2 = multiprocessing.Process(target=pyfastaq.tasks.to_fasta, args=(fq_to_convert_to_fa_2, reads_2), kwargs={'line_length':0})
p1.start()
if options.threads == 1:
@@ -154,24 +158,28 @@ else:
if options.threads > 1:
p1.join()
+ if p1.exitcode != 0 or p2.exitcode != 0:
+ print('Error in input reads files. Cannot continue', file=sys.stderr)
+ sys.exit(1)
+
for fname in to_delete:
os.unlink(fname)
if options.pcr_primers:
tmp_reads_1 = 'reads_1.pcr_trim.fa'
tmp_reads_2 = 'reads_2.pcr_trim.fa'
- fastaq.tasks.sequence_trim(reads_1, reads_2, tmp_reads_1, tmp_reads_2, options.pcr_primers, min_length=options.min_trimmed_length, check_revcomp=True)
+ pyfastaq.tasks.sequence_trim(reads_1, reads_2, tmp_reads_1, tmp_reads_2, options.pcr_primers, min_length=options.min_trimmed_length, check_revcomp=True)
os.rename(tmp_reads_1, reads_1)
os.rename(tmp_reads_2, reads_2)
-fastaq.sequences.Fasta.line_length = original_line_length
+pyfastaq.sequences.Fasta.line_length = original_line_length
if options.contigs:
- contigs = 'contigs.fasta'
- fastaq.tasks.to_fasta(options.contigs, contigs, line_length=60, strip_after_first_whitespace=True)
+ contigs = 'contigs_to_extend.fasta'
+ pyfastaq.tasks.to_fasta(options.contigs, contigs, line_length=60, strip_after_first_whitespace=True)
elif options.reference:
reference = 'reference_in.fasta'
- fastaq.tasks.to_fasta(options.reference, reference, line_length=60, strip_after_first_whitespace=True)
+ pyfastaq.tasks.to_fasta(options.reference, reference, line_length=60, strip_after_first_whitespace=True)
p = iva.seed_processor.SeedProcessor(
reference,
reads_1,
diff --git a/scripts/iva_qc b/scripts/iva_qc
index 23ce297..4849a02 100755
--- a/scripts/iva_qc
+++ b/scripts/iva_qc
@@ -40,7 +40,7 @@ mapping_group.add_argument('-y', '--smalt_id', type=float, help='Minimum identit
external_group = parser.add_argument_group('External tools')
external_group.add_argument('--gage_minid', help='Minimum percent identity used when GAGE runs nucmer [%(default)s]', metavar='INT in [0,100]', default=80)
external_group.add_argument('--kraken_preload', action='store_true', help='Use the --preload option when running kraken')
-external_group.add_argument('--ratt_config', help='Specify your own RATT config file [%(default)s]', metavar='filename', default=iva.qc_external.default_ratt_config)
+external_group.add_argument('--ratt_config', help='Specify your own RATT config file [%(default)s]', metavar='filename')
other_group = parser.add_argument_group('Other options')
diff --git a/setup.py b/setup.py
index 0a53ca1..828fb03 100644
--- a/setup.py
+++ b/setup.py
@@ -1,23 +1,57 @@
import os
import glob
+import sys
+import shutil
from setuptools import setup, find_packages
-def read(fname):
- return open(os.path.join(os.path.dirname(__file__), fname)).read()
+
+required_progs = [
+ 'kmc',
+ 'kmc_dump',
+ 'nucmer',
+ 'delta-filter',
+ 'show-coords',
+ 'samtools',
+ 'smalt',
+]
+
+found_all_progs = True
+print('Checking dependencies found in path:')
+for program in required_progs:
+ if shutil.which(program) is None:
+ found_all_progs = False
+ found = ' NOT FOUND'
+ else:
+ found = ' OK'
+ print(found, program, sep='\t')
+
+if not found_all_progs:
+ print('Cannot install because some dependencies not found.', file=sys.stderr)
+ sys.exit(1)
+
setup(
name='iva',
- version='0.10.1',
+ version='1.0.0',
description='Iterative Virus Assembler',
- long_description=read('README.md'),
packages = find_packages(),
package_data={'iva': ['gage/*', 'ratt/*', 'read_trim/*']},
author='Martin Hunt',
- author_email='mh12 at sanger.ac.uk',
+ author_email='path-help at sanger.ac.uk',
url='https://github.com/sanger-pathogens/iva',
scripts=glob.glob('scripts/*'),
test_suite='nose.collector',
- install_requires=['nose >= 1.3', 'fastaq >= 1.6.0', 'networkx'],
- dependency_links=['http://github.com/sanger-pathogens/fastaq/tarball/master#egg=fastaq-1.7.0'],
+ tests_require=['nose >= 1.3'],
+ install_requires=[
+ 'pyfastaq >= 3.0.1',
+ 'networkx >= 1.7',
+ 'pysam >= 0.8.1'
+ ],
license='GPLv3',
+ classifiers=[
+ 'Development Status :: 4 - Beta',
+ 'Topic :: Scientific/Engineering :: Bio-Informatics',
+ 'Programming Language :: Python :: 3 :: Only',
+ 'License :: OSI Approved :: GNU General Public License v3 (GPLv3)',
+ ],
)
--
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/iva.git
More information about the debian-med-commit
mailing list