[med-svn] [Git][med-team/python-pyfaidx][master] 8 commits: New upstream version 0.5.3.1
Andreas Tille
gitlab at salsa.debian.org
Mon Apr 16 12:21:04 BST 2018
Andreas Tille pushed to branch master at Debian Med / python-pyfaidx
Commits:
e3d8d1b3 by Andreas Tille at 2018-04-16T11:50:28+02:00
New upstream version 0.5.3.1
- - - - -
5fc8481b by Andreas Tille at 2018-04-16T11:50:28+02:00
Update upstream source from tag 'upstream/0.5.3.1'
Update to upstream version '0.5.3.1'
with Debian dir 00eb618e67ca3115f09e4153b232f1ce0e751606
- - - - -
65c5e1b2 by Andreas Tille at 2018-04-16T11:50:29+02:00
New upstream version
- - - - -
8547bc0a by Andreas Tille at 2018-04-16T11:51:25+02:00
Point Vcs fields to salsa.debian.org
- - - - -
f4e25e3e by Andreas Tille at 2018-04-16T11:51:25+02:00
Standards-Version: 4.1.4
- - - - -
3cc39157 by Andreas Tille at 2018-04-16T11:51:25+02:00
debhelper 11
- - - - -
aa5007d6 by Andreas Tille at 2018-04-16T11:54:50+02:00
Testsuite: autopkgtest-pkg-python
- - - - -
6ab6bd1f by Andreas Tille at 2018-04-16T13:15:06+02:00
Upload to unstable
- - - - -
25 changed files:
- + .codecov.yml
- .travis.yml
- + CODE_OF_CONDUCT.md
- README.rst
- + debian/TODO
- debian/changelog
- debian/compat
- debian/control
- pyfaidx/__init__.py
- pyfaidx/cli.py
- scripts/benchmark.py
- + tests/data/chr17.hg19.part.fa
- tests/data/download_gene_fasta.py
- + tests/data/gene.bed12
- + tests/data/gene.bed12.fasta
- tests/test_FastaRecord.py
- tests/test_FastaRecord_iter.py
- tests/test_FastaVariant.py
- + tests/test_Fasta_bgzip.py
- tests/test_faidx.py
- tests/test_feature_bounds_check.py
- tests/test_feature_indexing.py
- tests/test_feature_key_function.py
- + tests/test_feature_spliced_seq.py
- tests/test_feature_split_char.py
Changes:
=====================================
.codecov.yml
=====================================
--- /dev/null
+++ b/.codecov.yml
@@ -0,0 +1,4 @@
+comment: false
+coverage:
+ status:
+ patch: false
=====================================
.travis.yml
=====================================
--- a/.travis.yml
+++ b/.travis.yml
@@ -2,6 +2,7 @@ language: python
sudo: false
python:
- 'nightly'
+ - '3.6'
- '3.5'
- '3.4'
- '3.3'
@@ -13,14 +14,19 @@ install:
- pip wheel -f wheelhouse coverage biopython cython pysam pyvcf || true
- pip install -f wheelhouse biopython cython pysam pyfasta coverage pyvcf || true
- python setup.py install
- - if [ ! -f samtools-1.2 ]; then wget -q -O - https://github.com/samtools/samtools/releases/download/1.2/samtools-1.2.tar.bz2 | tar -xjv; fi
+ - if [ ! -f samtools-1.2 ]; then curl -sL https://github.com/samtools/samtools/releases/download/1.2/samtools-1.2.tar.bz2 | tar -xjv; fi
- cd samtools-1.2
- make
- export PATH=$PATH:$PWD
- cd ..
+ - if [ ! -f htslib-1.4 ]; then curl -sL https://github.com/samtools/htslib/releases/download/1.4/htslib-1.4.tar.bz2 | tar -xjv; fi
+ - cd htslib-1.4
+ - make
+ - export PATH=$PATH:$PWD
+ - cd ..
before_script:
- - /usr/local/bin/pip install --user biopython
- - /usr/bin/python tests/data/download_gene_fasta.py
+ - if [ $(python --version) -gt 2.6 ]; then env pip install biopython; fi
+ - env python tests/data/download_gene_fasta.py
script: nosetests --with-coverage --cover-package=pyfaidx
deploy:
provider: pypi
@@ -29,18 +35,20 @@ deploy:
secure: MbSaeuitkVTZqxa0PJ3RcR1aMf+B/sMbcx2sWOo9xfLlRFDFpYWJZ0EfXWEhrVu2YWXpBsasgunTDWSi0jNcZMH92MzOC+UTVYr45LO5sy6hm4iSiAgm/DPgYWdjP0SFKr7eL/HWPS+gHvgkXL1upleX21O358bxaezoasuKFvs=
on:
all_branches: true
- python: 2.6
+ python: 2.7
tags: true
repo: mdshw5/pyfaidx
matrix:
allow_failures:
- python: 'nightly'
- python: 'pypy3'
+ - python: 'pypy'
fast_finish: true
cache:
directories:
- tests/data
- samtools-1.2
+ - htslib-1.4
- wheelhouse
after_success:
- bash <(curl -s https://codecov.io/bash)
=====================================
CODE_OF_CONDUCT.md
=====================================
--- /dev/null
+++ b/CODE_OF_CONDUCT.md
@@ -0,0 +1,46 @@
+# Contributor Covenant Code of Conduct
+
+## Our Pledge
+
+In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, gender identity and expression, level of experience, nationality, personal appearance, race, religion, or sexual identity and orientation.
+
+## Our Standards
+
+Examples of behavior that contributes to creating a positive environment include:
+
+* Using welcoming and inclusive language
+* Being respectful of differing viewpoints and experiences
+* Gracefully accepting constructive criticism
+* Focusing on what is best for the community
+* Showing empathy towards other community members
+
+Examples of unacceptable behavior by participants include:
+
+* The use of sexualized language or imagery and unwelcome sexual attention or advances
+* Trolling, insulting/derogatory comments, and personal or political attacks
+* Public or private harassment
+* Publishing others' private information, such as a physical or electronic address, without explicit permission
+* Other conduct which could reasonably be considered inappropriate in a professional setting
+
+## Our Responsibilities
+
+Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior.
+
+Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful.
+
+## Scope
+
+This Code of Conduct applies both within project spaces and in public spaces when an individual is representing the project or its community. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers.
+
+## Enforcement
+
+Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team at mdshw5 at gmail.com. The project team will review and investigate all complaints, and will respond in a way that it deems appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. Further details of specific enforcement policies may be posted separately.
+
+Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project's leadership.
+
+## Attribution
+
+This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, available at [http://contributor-covenant.org/version/1/4][version]
+
+[homepage]: http://contributor-covenant.org
+[version]: http://contributor-covenant.org/version/1/4/
=====================================
README.rst
=====================================
--- a/README.rst
+++ b/README.rst
@@ -1,4 +1,4 @@
-|Travis| |PyPI| |Landscape| |Coverage| |Depsy| |Appveyor| |Flattr|
+|Travis| |PyPI| |Landscape| |Coverage| |Depsy| |Appveyor|
Description
-----------
@@ -38,6 +38,8 @@ or download a `release <https://github.com/mdshw5/pyfaidx/releases>`_ and:
python setup.py install
+If using ``pip install --user`` make sure to add ``/home/$(whoami)/.local/bin`` to your ``$PATH`` if you want to run the ``faidx`` script.
+
Usage
-----
@@ -72,7 +74,7 @@ Acts like a dictionary.
>>> genes['NM_001282543.1'][200:230].end
230
- >>> genes['NM_001282543.1'][200:230].longname
+ >>> genes['NM_001282543.1'][200:230].fancy_name
'NM_001282543.1:201-230'
>>> len(genes['NM_001282543.1'])
@@ -108,27 +110,7 @@ Slices just like a string:
>NM_001282543.1:1-5466
CCCCGCCCCT........
-- Slicing start and end coordinates are 0-based, just like Python sequences.
-
-Sequence can be buffered in memory using a read-ahead buffer
-for fast sequential access:
-
-.. code:: python
-
- >>> from timeit import timeit
- >>> fetch = "genes['NM_001282543.1'][200:230]"
- >>> read_ahead = "import pyfaidx; genes = pyfaidx.Fasta('tests/data/genes.fasta', read_ahead=10000)"
- >>> no_read_ahead = "import pyfaidx; genes = pyfaidx.Fasta('tests/data/genes.fasta')"
- >>> string_slicing = "genes = {}; genes['NM_001282543.1'] = 'N'*10000"
-
- >>> timeit(fetch, no_read_ahead, number=10000)
- 0.2204863309962093
- >>> timeit(fetch, read_ahead, number=10000)
- 0.1121859749982832
- >>> timeit(fetch, string_slicing, number=10000)
- 0.0033553699977346696
-
-Read-ahead buffering can reduce runtime by 1/2 for sequential accesses to buffered regions.
+- Slicing start and end coordinates are 0-based, just like Python sequences.
Complements and reverse complements just like DNA
@@ -145,6 +127,31 @@ Complements and reverse complements just like DNA
>>> -genes['NM_001282543.1'][200:230]
>NM_001282543.1 (complement):230-201
CATCCGGTTCCATGGCGGGCGCGGAACGAG
+
+``Fasta`` objects can also be accessed using method calls:
+
+.. code:: python
+
+ >>> genes.get_seq('NM_001282543.1', 201, 210)
+ >NM_001282543.1:201-210
+ CTCGTTCCGC
+
+ >>> genes.get_seq('NM_001282543.1', 201, 210, rc=True)
+ >NM_001282543.1 (complement):210-201
+ GCGGAACGAG
+
+Spliced sequences can be retrieved from a list of [start, end] coordinates:
+**TODO** update this section
+
+.. code:: python
+
+ # new in v0.5.1
+ segments = [[1, 10], [50, 70]]
+ >>> genes.get_spliced_seq('NM_001282543.1', segments)
+ >gi|543583786|ref|NM_001282543.1|:1-70
+ CCCCGCCCCTGGTTTCGAGTCGCTGGCCTGC
+
+.. _keyfn:
Custom key functions provide cleaner access:
@@ -158,6 +165,24 @@ Custom key functions provide cleaner access:
>NR_104212:1-10
CCCCGCCCCT
+You can specify a character to split names on, which will generate additional entries:
+
+.. code:: python
+
+ >>> from pyfaidx import Fasta
+ >>> genes = Fasta('tests/data/genes.fasta', split_char='.', duplicate_action="first") # default duplicate_action="stop"
+ >>> genes.keys()
+ dict_keys(['.1', 'NR_104212', 'NM_001282543', 'XM_005249644', 'XM_005249645', 'NR_104216', 'XM_005249643', 'NR_104215', 'KF435150', 'AB821309', 'NM_001282549', 'XR_241081', 'KF435149', 'XR_241079', 'NM_000465', 'XM_005265508', 'XR_241080', 'XM_005249642', 'NM_001282545', 'XM_005265507', 'NM_001282548'])
+
+If your `key_function` or `split_char` generates duplicate entries, you can choose what action to take:
+
+.. code:: python
+
+ # new in v0.4.9
+ >>> genes = Fasta('tests/data/genes.fasta', split_char="|", duplicate_action="longest")
+ >>> genes.keys()
+ dict_keys(['gi', '563317589', 'dbj', 'AB821309.1', '', '557361099', 'gb', 'KF435150.1', '557361097', 'KF435149.1', '543583796', 'ref', 'NR_104216.1', '543583795', 'NR_104215.1', '543583794', 'NR_104212.1', '543583788', 'NM_001282545.1', '543583786', 'NM_001282543.1', '543583785', 'NM_000465.3', '543583740', 'NM_001282549.1', '543583738', 'NM_001282548.1', '530384540', 'XM_005249645.1', '530384538', 'XM_005249644.1', '530384536', 'XM_005249643.1', '530384534', 'XM_005249642.1', '530373237','XM_005265508.1', '530373235', 'XM_005265507.1', '530364726', 'XR_241081.1', '530364725', 'XR_241080.1', '530364724', 'XR_241079.1'])
+
Filter functions (returning True) limit the index:
.. code:: python
@@ -226,6 +251,36 @@ Sequence names are truncated on any whitespace. This is a limitation of the inde
gi|557361097|gb|KF435149.1| Homo sapiens MDM4 protein variant G (MDM4) mRNA, complete cds
...
+ # new in v0.4.9
+ >>> from pyfaidx import Fasta
+ >>> genes = Fasta('tests/data/genes.fasta', read_long_names=True)
+ >>> for record in genes:
+ ... print(record.name)
+ ...
+ gi|563317589|dbj|AB821309.1| Homo sapiens FGFR2-AHCYL1 mRNA for FGFR2-AHCYL1 fusion kinase protein, complete cds
+ gi|557361099|gb|KF435150.1| Homo sapiens MDM4 protein variant Y (MDM4) mRNA, complete cds, alternatively spliced
+ gi|557361097|gb|KF435149.1| Homo sapiens MDM4 protein variant G (MDM4) mRNA, complete cds
+
+Sequence can be buffered in memory using a read-ahead buffer
+for fast sequential access:
+
+.. code:: python
+
+ >>> from timeit import timeit
+ >>> fetch = "genes['NM_001282543.1'][200:230]"
+ >>> read_ahead = "import pyfaidx; genes = pyfaidx.Fasta('tests/data/genes.fasta', read_ahead=10000)"
+ >>> no_read_ahead = "import pyfaidx; genes = pyfaidx.Fasta('tests/data/genes.fasta')"
+ >>> string_slicing = "genes = {}; genes['NM_001282543.1'] = 'N'*10000"
+
+ >>> timeit(fetch, no_read_ahead, number=10000)
+ 0.2204863309962093
+ >>> timeit(fetch, read_ahead, number=10000)
+ 0.1121859749982832
+ >>> timeit(fetch, string_slicing, number=10000)
+ 0.0033553699977346696
+
+Read-ahead buffering can reduce runtime by 1/2 for sequential accesses to buffered regions.
+
.. role:: red
If you want to modify the contents of your FASTA file in-place, you can use the `mutable` argument.
@@ -304,12 +359,18 @@ cli script: faidx
default base for missing positions and masking. default: N
-d DELIMITER, --delimiter DELIMITER
delimiter for splitting names to multiple values (duplicate names will be discarded). default: None
+ -e HEADER_FUNCTION, --header-function HEADER_FUNCTION
+ python function to modify header lines e.g: "lambda x: x.split("|")[0]". default: lambda x: x.split()[0]
+ -u {stop,first,last,longest,shortest}, --duplicates-action {stop,first,last,longest,shortest}
+ entry to take when duplicate sequence names are encountered. default: stop
-g REGEX, --regex REGEX
selected sequences are those matching regular expression. default: .*
-v, --invert-match selected sequences are those not matching 'regions' argument. default: False
-m, --mask-with-default-seq
mask the FASTA file using --default-seq default: False
-M, --mask-by-case mask the FASTA file by changing to lowercase. default: False
+ -e HEADER_FUNCTION, --header-function HEADER_FUNCTION
+ python function to modify header lines e.g: "lambda x: x.split("|")[0]". default: None
--no-rebuild do not rebuild the .fai index even if it is out of date. default: False
--version print pyfaidx version number
@@ -424,8 +485,6 @@ Examples:
AGCTTCCCTGTGGTTTCCCGAGGCTTCCTTGCTTCCCGCTCTGCGAGGAGCCTTTCATCCGAAGGCGGGA
.......
-
-
$ faidx --size-range 5500,6000 -i chromsizes tests/data/genes.fasta
NM_000465.3 5523
@@ -435,6 +494,14 @@ Examples:
$ faidx -M --bed regions.bed tests/data/genes.fasta
### Modifies tests/data/genes.fasta by masking regions using lowercase characters ###
+ $ faidx -e "lambda x: x.split('.')[0]" tests/data/genes.fasta -i bed
+ AB821309 1 3510
+ KF435150 1 481
+ KF435149 1 642
+ NR_104216 1 4573
+ NR_104215 1 5317
+ .......
+
Similar syntax as ``samtools faidx``
@@ -462,6 +529,19 @@ A lower-level Faidx class is also available:
where "filename.fa" is the original FASTA file.
- Start and end coordinates are 1-based.
+Support for compressed FASTA
+----------------------------
+
+``pyfaidx`` can create and read ``.fai`` indices for FASTA files that have
+been compressed using the `bgzip <http://www.htslib.org/doc/tabix.html>`_
+tool from `samtools <http://www.htslib.org/>`_. ``bgzip`` writes compressed
+data in a ``BGZF`` format. ``BGZF`` is ``gzip`` compatible, consisting of
+multiple concatenated ``gzip`` blocks, each with an additional ``gzip``
+header making it possible to build an index for rapid random access. I.e.,
+files compressed with ``bgzip`` are valid ``gzip`` and so can be read by
+``gunzip``. See `this description
+<http://pydoc.net/Python/biopython/1.66/Bio.bgzf/>`_ for more details on
+``bgzip``.
Changelog
---------
@@ -478,7 +558,7 @@ I try to fix as many bugs as possible, but most of this work is supported by a s
Contributing
------------
-Create a new Pull Request with one feauture. If you add a new feature, please
+Create a new Pull Request with one feature. If you add a new feature, please
create also the relevant test.
To get test running on your machine:
@@ -519,8 +599,3 @@ Comprehensive Cancer Center in the Department of Oncology.
.. |Appveyor| image:: https://ci.appveyor.com/api/projects/status/80ihlw30a003596w?svg=true
:target: https://ci.appveyor.com/project/mdshw5/pyfaidx
-
-.. |Flattr| image:: http://button.flattr.com/flattr-badge-large.png
- :target: https://flattr.com/submit/auto?fid=po00kq&url=https%3A%2F%2Fgithub.com%2Fmdshw5%2Fpyfaidx
-
-
=====================================
debian/TODO
=====================================
--- /dev/null
+++ b/debian/TODO
@@ -0,0 +1,5 @@
+Replace
+
+ Testsuite: autopkgtest-pkg-python
+
+by a real test suite running build time tests
=====================================
debian/changelog
=====================================
--- a/debian/changelog
+++ b/debian/changelog
@@ -1,3 +1,13 @@
+python-pyfaidx (0.5.3.1-1) unstable; urgency=medium
+
+ * New upstream version
+ * Point Vcs fields to salsa.debian.org
+ * Standards-Version: 4.1.4
+ * debhelper 11
+ * Testsuite: autopkgtest-pkg-python
+
+ -- Andreas Tille <tille at debian.org> Mon, 16 Apr 2018 13:12:29 +0200
+
python-pyfaidx (0.4.8.1-1) unstable; urgency=medium
* New upstream version
=====================================
debian/compat
=====================================
--- a/debian/compat
+++ b/debian/compat
@@ -1 +1 @@
-10
+11
=====================================
debian/control
=====================================
--- a/debian/control
+++ b/debian/control
@@ -2,8 +2,9 @@ Source: python-pyfaidx
Maintainer: Debian Med Packaging Team <debian-med-packaging at lists.alioth.debian.org>
Uploaders: Andreas Tille <tille at debian.org>
Section: python
+Testsuite: autopkgtest-pkg-python
Priority: optional
-Build-Depends: debhelper (>= 10),
+Build-Depends: debhelper (>= 11~),
dh-python,
python-all,
python-coverage,
@@ -19,9 +20,9 @@ Build-Depends: debhelper (>= 10),
python3-numpy,
python3-six,
python3-mock
-Standards-Version: 3.9.8
-Vcs-Browser: https://anonscm.debian.org/cgit/debian-med/python-pyfaidx.git
-Vcs-Git: https://anonscm.debian.org/git/debian-med/python-pyfaidx.git
+Standards-Version: 4.1.4
+Vcs-Browser: https://salsa.debian.org/med-team/python-pyfaidx
+Vcs-Git: https://salsa.debian.org/med-team/python-pyfaidx.git
Homepage: https://github.com/mdshw5/pyfaidx
Package: python-pyfaidx
=====================================
pyfaidx/__init__.py
=====================================
--- a/pyfaidx/__init__.py
+++ b/pyfaidx/__init__.py
@@ -1,5 +1,4 @@
# pylint: disable=R0913, R0914, C0301
-
"""
Fasta file -> Faidx -> Fasta -> FastaRecord -> Sequence
"""
@@ -11,7 +10,7 @@ from six import PY2, PY3, string_types, integer_types
from six.moves import zip_longest
try:
from collections import OrderedDict
-except ImportError: #python 2.6
+except ImportError: #python 2.6
from ordereddict import OrderedDict
from collections import namedtuple
import re
@@ -22,31 +21,44 @@ from threading import Lock
dna_bases = re.compile(r'([ACTGNactgnYRWSKMDVHBXyrwskmdvhbx]+)')
-__version__ = '0.4.8.1'
+__version__ = '0.5.3.1'
+
+
+class KeyFunctionError(ValueError):
+ """Raised if the key_function argument is invalid."""
class FastaIndexingError(Exception):
- def __init__(self, msg=None):
- self.msg = msg
- super(FastaIndexingError, self).__init__(self.msg)
+ """Raised if we encounter malformed FASTA that prevents indexing."""
+
+
+class IndexNotFoundError(IOError):
+ """Raised if read_fai cannot open the index file."""
+
+class FastaNotFoundError(IOError):
+ """Raised if the fasta file cannot be opened."""
-class FetchError(Exception):
- def __init__(self, msg=None):
- self.msg = msg
- super(FetchError, self).__init__(self.msg)
+class FetchError(IndexError):
+ """Raised if a request to fetch a FASTA sequence cannot be fulfilled."""
-class BedError(Exception):
- def __init__(self, msg=None):
- self.msg = 'Malformed BED entry!\n' if not msg else msg
- super(BedError, self).__init__(self.msg)
+
+class BedError(ValueError):
+ """Indicates a malformed BED entry."""
class RegionError(Exception):
- def __init__(self, msg=None):
- self.msg = 'Malformed region! Format = rname:start-end.\n' if not msg else msg
- super(RegionError, self).__init__(self.msg)
+ # This exception class is currently unused, but has been retained for
+ # backwards compatibility.
+ """A region error occurred."""
+
+
+class UnsupportedCompressionFormat(IOError):
+ """
+ Raised when a FASTA file is given with a recognized but unsupported
+ compression extension.
+ """
class Sequence(object):
@@ -56,6 +68,7 @@ class Sequence(object):
start, end = coordinates of subsequence (optional)
comp = boolean switch for complement property
"""
+
def __init__(self, name='', seq='', start=None, end=None, comp=False):
self.name = name
self.seq = seq
@@ -108,22 +121,26 @@ class Sequence(object):
"""
if self.start is None or self.end is None:
correction_factor = 0
- elif len(self.seq) == abs(self.end - self.start) + 1: # determine coordinate system
+ elif len(
+ self.seq
+ ) == abs(self.end - self.start) + 1: # determine coordinate system
one_based = True
correction_factor = -1
elif len(self.seq) == abs(self.end - self.start):
one_based = False
correction_factor = 0
elif len(self.seq) != abs(self.end - self.start):
- raise ValueError("Coordinates start=%s and end=%s imply a diffent length than sequence (length %s)." % (self.start, self.end, len(self.seq)))
+ raise ValueError(
+ "Coordinates (Sequence.start=%s and Sequence.end=%s) imply a different length than Sequence.seq (len=%s). Did you modify Sequence.seq?"
+ % (self.start, self.end, len(self.seq)))
if isinstance(n, slice):
slice_start, slice_stop, slice_step = n.indices(len(self))
if self.start is None or self.end is None: # there should never be self.start != self.end == None
start = None
end = None
- return self.__class__(self.name, self.seq[n],
- start, end, self.comp)
+ return self.__class__(self.name, self.seq[n], start, end,
+ self.comp)
self_end, self_start = (self.end, self.start)
if abs(slice_step) > 1:
start = None
@@ -137,7 +154,8 @@ class Sequence(object):
else:
start = self_start + slice_start
end = self_start + slice_stop + correction_factor
- return self.__class__(self.name, self.seq[n], start, end, self.comp)
+ return self.__class__(self.name, self.seq[n], start, end,
+ self.comp)
elif isinstance(n, integer_types):
if n < 0:
n = len(self) + n
@@ -167,7 +185,7 @@ class Sequence(object):
return self[::-1].complement
def __repr__(self):
- return '\n'.join([''.join(['>', self.longname]), self.seq])
+ return '\n'.join([''.join(['>', self.fancy_name]), self.seq])
def __len__(self):
"""
@@ -177,10 +195,10 @@ class Sequence(object):
return len(self.seq)
@property
- def longname(self):
+ def fancy_name(self):
""" Return the fancy name for the sequence, including start, end, and complementation.
>>> x = Sequence(name='chr1', seq='ATCGTA', start=1, end=6, comp=True)
- >>> x.longname
+ >>> x.fancy_name
'chr1:1-6 (complement)'
"""
name = self.name
@@ -190,6 +208,17 @@ class Sequence(object):
name += ' (complement)'
return name
+ @property
+ def long_name(self):
+ """ DEPRECATED: Use fancy_name instead.
+ Return the fancy name for the sequence, including start, end, and complementation.
+ >>> x = Sequence(name='chr1', seq='ATCGTA', start=1, end=6, comp=True)
+ >>> x.long_name
+ 'chr1:1-6 (complement)'
+ """
+ msg = "The `Sequence.long_name` property is deprecated, and will be removed in future versions. Please use `Sequence.fancy_name` instead."
+ warnings.warn(msg, DeprecationWarning, stacklevel=2)
+ return self.fancy_name
@property
def complement(self):
@@ -199,8 +228,8 @@ class Sequence(object):
>chr1 (complement)
TAGCAT
"""
- comp = self.__class__(self.name, complement(self.seq),
- start=self.start, end=self.end)
+ comp = self.__class__(
+ self.name, complement(self.seq), start=self.start, end=self.end)
comp.comp = False if self.comp else True
return comp
@@ -235,7 +264,6 @@ class Sequence(object):
else:
return None
-
@property
def gc(self):
""" Return the GC content of seq as a float
@@ -251,7 +279,9 @@ class Sequence(object):
return (g + c) / len(self.seq)
-class IndexRecord(namedtuple('IndexRecord', ['rlen', 'offset', 'lenc', 'lenb', 'bend', 'prev_bend'])):
+class IndexRecord(
+ namedtuple('IndexRecord',
+ ['rlen', 'offset', 'lenc', 'lenb', 'bend', 'prev_bend'])):
__slots__ = ()
def __getitem__(self, key):
@@ -260,16 +290,32 @@ class IndexRecord(namedtuple('IndexRecord', ['rlen', 'offset', 'lenc', 'lenb', '
return tuple.__getitem__(self, key)
def __str__(self):
- return "{rlen:n}\t{offset:n}\t{lenc:n}\t{lenb:n}\n".format(**self._asdict())
+ return "{rlen:n}\t{offset:n}\t{lenc:n}\t{lenb:n}\n".format(
+ **self._asdict())
+
+ def __len__(self):
+ return self.rlen
class Faidx(object):
""" A python implementation of samtools faidx FASTA indexing """
- def __init__(self, filename, default_seq=None, key_function=None,
- as_raw=False, strict_bounds=False, read_ahead=None,
- mutable=False, split_char=None, filt_function=None,
+
+ def __init__(self,
+ filename,
+ default_seq=None,
+ key_function=lambda x: x,
+ as_raw=False,
+ strict_bounds=False,
+ read_ahead=None,
+ mutable=False,
+ split_char=None,
+ duplicate_action="stop",
+ filt_function=lambda x: True,
one_based_attributes=True,
- sequence_always_upper=False, rebuild=True):
+ read_long_names=False,
+ sequence_always_upper=False,
+ rebuild=True,
+ build_index=True):
"""
filename: name of fasta file
key_function: optional callback function which should return a unique
@@ -279,41 +325,102 @@ class Faidx(object):
Default: False (i.e. return a Sequence() object).
"""
self.filename = filename
- if mutable:
- self.file = open(filename, 'r+b')
+
+ if filename.lower().endswith('.bgz') or filename.lower().endswith(
+ '.gz'):
+ # Only try to import Bio if we actually need the bgzf reader.
+ try:
+ from Bio import bgzf
+ except ImportError:
+ raise ImportError(
+ "BioPython must be installed to read gzipped files.")
+ else:
+ self._fasta_opener = bgzf.open
+ self._bgzf = True
+ elif filename.lower().endswith('.bz2') or filename.lower().endswith(
+ '.zip'):
+ raise UnsupportedCompressionFormat(
+ "Compressed FASTA is only supported in BGZF format. Use "
+ "bgzip to compresss your FASTA.")
else:
- self.file = open(filename, 'rb')
+ self._fasta_opener = open
+ self._bgzf = False
+
+ try:
+ self.file = self._fasta_opener(filename, 'r+b'
+ if mutable else 'rb')
+ except (ValueError, IOError) as e:
+ if str(e).find('BGZF') > -1:
+ raise UnsupportedCompressionFormat(
+ "Compressed FASTA is only supported in BGZF format. Use "
+ "the samtools bgzip utility (instead of gzip) to "
+ "compress your FASTA.")
+ else:
+ raise FastaNotFoundError(
+ "Cannot read FASTA file %s" % filename)
+
self.indexname = filename + '.fai'
- self.key_function = key_function if key_function else lambda rname: rname
- self.filt_function = filt_function if filt_function else lambda x: True
+ self.read_long_names = read_long_names
+ self.key_function = key_function
+ try:
+ key_fn_test = self.key_function(
+ "TestingReturnType of_key_function")
+ if not isinstance(key_fn_test, string_types):
+ raise KeyFunctionError(
+ "key_function argument should return a string, not {0}".
+ format(type(key_fn_test)))
+ except Exception as e:
+ pass
+ self.filt_function = filt_function
+ assert duplicate_action in ("stop", "first", "last", "longest",
+ "shortest", "drop")
+ self.duplicate_action = duplicate_action
self.as_raw = as_raw
self.default_seq = default_seq
- self.strict_bounds = strict_bounds
+ if self._bgzf and self.default_seq is not None:
+ raise FetchError(
+ "The default_seq argument is not supported with using BGZF compression. Please decompress your FASTA file and try again."
+ )
+ if self._bgzf:
+ self.strict_bounds = True
+ else:
+ self.strict_bounds = strict_bounds
+ self.split_char = split_char
self.one_based_attributes = one_based_attributes
self.sequence_always_upper = sequence_always_upper
self.index = OrderedDict()
self.lock = Lock()
- self.buffer = dict((('seq', None), ('name', None), ('start', None), ('end', None)))
+ self.buffer = dict((('seq', None), ('name', None), ('start', None),
+ ('end', None)))
if not read_ahead or isinstance(read_ahead, integer_types):
self.read_ahead = read_ahead
elif not isinstance(read_ahead, integer_types):
- raise ValueError("read_ahead value must be int, not {0}".format(type(read_ahead)))
+ raise ValueError("read_ahead value must be int, not {0}".format(
+ type(read_ahead)))
self.mutable = mutable
with self.lock: # lock around index generation so only one thread calls method
try:
- if os.path.exists(self.indexname) and getmtime(self.indexname) >= getmtime(self.filename):
- self.read_fai(split_char)
- elif os.path.exists(self.indexname) and getmtime(self.indexname) < getmtime(self.filename) and not rebuild:
- self.read_fai(split_char)
- warnings.warn("Index file {0} is older than FASTA file {1}.".format(self.indexname, self.filename), RuntimeWarning)
- else:
+ if os.path.exists(self.indexname) and getmtime(
+ self.indexname) >= getmtime(self.filename):
+ self.read_fai()
+ elif os.path.exists(self.indexname) and getmtime(
+ self.indexname) < getmtime(
+ self.filename) and not rebuild:
+ self.read_fai()
+ warnings.warn(
+ "Index file {0} is older than FASTA file {1}.".format(
+ self.indexname, self.filename), RuntimeWarning)
+ elif build_index:
self.build_index()
- self.read_fai(split_char)
- except FastaIndexingError as e:
+ self.read_fai()
+ else:
+ self.read_fai()
+
+ except FastaIndexingError:
os.remove(self.indexname)
self.file.close()
- raise FastaIndexingError(e)
+ raise
except Exception:
# Handle potential exceptions other than 'FastaIndexingError'
self.file.close()
@@ -331,36 +438,64 @@ class Faidx(object):
def __repr__(self):
return 'Faidx("%s")' % (self.filename)
- def read_fai(self, split_char):
- duplicate_ids = []
- with open(self.indexname) as index:
- prev_bend = 0
- for line in index:
- line = line.rstrip()
- rname, rlen, offset, lenc, lenb = line.split('\t')
- rlen, offset, lenc, lenb = map(int, (rlen, offset, lenc, lenb))
- newlines = int(ceil(rlen / lenc) * (lenb - lenc))
- bend = offset + newlines + rlen
- rname = filter(self.filt_function, self.key_function(rname).split(split_char))
- for i, key in enumerate(rname): # mdshw5/pyfaidx/issues/64
- if key in self.index and split_char is None:
- if i == 0:
- raise ValueError('Duplicate key "%s"' % key)
- else:
- continue
- # eliminate duplicate keys if they result from split_char
- elif key in self.index and split_char:
- duplicate_ids.append(key)
- continue
+ def _index_as_string(self):
+ """ Returns the string representation of the index as iterable """
+ for k, v in self.index.items():
+ yield '\t'.join([k, str(v)])
+
+ def read_fai(self):
+ try:
+ with open(self.indexname) as index:
+ prev_bend = 0
+ drop_keys = []
+ for line in index:
+ line = line.rstrip()
+ rname, rlen, offset, lenc, lenb = line.split('\t')
+ rlen, offset, lenc, lenb = map(int,
+ (rlen, offset, lenc, lenb))
+ newlines = int(ceil(rlen / lenc) * (lenb - lenc))
+ bend = offset + newlines + rlen
+ rec = IndexRecord(rlen, offset, lenc, lenb, bend,
+ prev_bend)
+ if self.read_long_names:
+ rname = self._long_name_from_index_record(rec)
+ if self.split_char:
+ rname = filter(self.filt_function,
+ self.key_function(rname).split(
+ self.split_char))
else:
- self.index[key] = IndexRecord(rlen, offset, lenc, lenb, bend, prev_bend)
- prev_bend = bend
- for dup in duplicate_ids:
+ # filter must act on an iterable
+ rname = filter(self.filt_function,
+ [self.key_function(rname)])
+ for key in rname: # mdshw5/pyfaidx/issues/64
+ if key in self.index:
+ if self.duplicate_action == "stop":
+ raise ValueError('Duplicate key "%s"' % key)
+ elif self.duplicate_action == "first":
+ continue
+ elif self.duplicate_action == "last":
+ self.index[key] = rec
+ elif self.duplicate_action == "longest":
+ if len(rec) > len(self.index[key]):
+ self.index[key] = rec
+ elif self.duplicate_action == "shortest":
+ if len(rec) < len(self.index[key]):
+ self.index[key] = rec
+ elif self.duplicate_action == "drop":
+ if key not in drop_keys:
+ drop_keys.append(key)
+ else:
+ self.index[key] = rec
+ prev_bend = bend
+ for dup in drop_keys:
self.index.pop(dup, None)
+ except IOError:
+ raise IndexNotFoundError(
+ "Could not read index file %s" % self.indexname)
def build_index(self):
try:
- with open(self.filename, 'r') as fastafile:
+ with self._fasta_opener(self.filename, 'r') as fastafile:
with open(self.indexname, 'w') as indexfile:
rname = None # reference sequence name
offset = 0 # binary offset of end of current line
@@ -369,30 +504,42 @@ class Faidx(object):
clen = None # character line length
bad_lines = [] # lines > || < than blen
thisoffset = offset
+
+ lastline = None
for i, line in enumerate(fastafile):
line_blen = len(line)
line_clen = len(line.rstrip('\n\r'))
+ lastline = i
# write an index line
if line[0] == '>':
- valid_entry = check_bad_lines(rname, bad_lines, i - 1)
+ valid_entry = check_bad_lines(
+ rname, bad_lines, i - 1)
if valid_entry and i > 0:
- indexfile.write("{0}\t{1:d}\t{2:d}\t{3:d}\t{4:d}\n".format(rname, rlen, thisoffset, clen, blen))
+ indexfile.write(
+ "{0}\t{1:d}\t{2:d}\t{3:d}\t{4:d}\n".format(
+ rname, rlen, thisoffset, clen, blen))
elif not valid_entry:
- raise FastaIndexingError("Line length of fasta"
- " file is not "
- "consistent! "
- "Inconsistent line found in >{0} at "
- "line {1:n}.".format(rname, bad_lines[0][0] + 1))
+ raise FastaIndexingError(
+ "Line length of fasta"
+ " file is not "
+ "consistent! "
+ "Inconsistent line found in >{0} at "
+ "line {1:n}.".format(
+ rname, bad_lines[0][0] + 1))
blen = None
rlen = 0
clen = None
bad_lines = []
- try: # must catch empty deflines
- rname = line.rstrip('\n\r')[1:].split()[0] # remove comments
+ try: # must catch empty deflines (actually these might be okay: https://github.com/samtools/htslib/pull/258)
+ rname = line.rstrip('\n\r')[1:].split()[
+ 0] # duplicates are detected with read_fai
except IndexError:
- raise FastaIndexingError("Bad sequence name %s at line %s." % (line.rstrip('\n\r'), str(i)))
+ raise FastaIndexingError(
+ "Bad sequence name %s at line %s." %
+ (line.rstrip('\n\r'), str(i)))
offset += line_blen
- thisoffset = offset
+ thisoffset = fastafile.tell(
+ ) if self._bgzf else offset
else: # check line and advance offset
if not blen:
blen = line_blen
@@ -406,23 +553,36 @@ class Faidx(object):
offset += line_blen
rlen += line_clen
- # write the final index line
- valid_entry = check_bad_lines(rname, bad_lines, i) # advance index since we're at the end of the file
- if not valid_entry:
- raise FastaIndexingError("Line length of fasta"
- " file is not "
- "consistent! "
- "Inconsistent line found in >{0} at "
- "line {1:n}.".format(rname, bad_lines[0][0] + 1))
- indexfile.write("{0:s}\t{1:d}\t{2:d}\t{3:d}\t{4:d}\n".format(rname, rlen, thisoffset, clen, blen))
- except IOError:
- raise IOError("%s may not be writable. Please use Fasta(rebuild=False), Faidx(rebuild=False) or faidx --no-rebuild." % self.indexname)
+ # write the final index line, if there is one.
+ if lastline is not None:
+ valid_entry = check_bad_lines(
+ rname, bad_lines, lastline
+ ) # advance index since we're at the end of the file
+ if valid_entry:
+ indexfile.write(
+ "{0:s}\t{1:d}\t{2:d}\t{3:d}\t{4:d}\n".format(
+ rname, rlen, thisoffset, clen, blen))
+ else:
+ raise FastaIndexingError(
+ "Line length of fasta"
+ " file is not "
+ "consistent! "
+ "Inconsistent line found in >{0} at "
+ "line {1:n}.".format(rname,
+ bad_lines[0][0] + 1))
+ except (IOError, FastaIndexingError) as e:
+ if isinstance(e, IOError):
+ raise IOError(
+ "%s may not be writable. Please use Fasta(rebuild=False), Faidx(rebuild=False) or faidx --no-rebuild."
+ % self.indexname)
+ elif isinstance(e, FastaIndexingError):
+ raise e
def write_fai(self):
with self.lock:
with open(self.indexname, 'w') as outfile:
- for k, v in self.index.items():
- outfile.write('\t'.join([k, str(v)]))
+ for line in self._index_as_string:
+ outfile.write(line)
def from_buffer(self, start, end):
i_start = start - self.buffer['start'] # want [0, 1) coordinates from [1, 1] coordinates
@@ -458,8 +618,8 @@ class Faidx(object):
4. Seek to start position, taking newlines into account
5. Read to end position, return sequence
"""
- assert isinstance(start, integer_types)
- assert isinstance(end, integer_types)
+ assert start == int(start)
+ assert end == int(end)
try:
i = self.index[rname]
except KeyError:
@@ -467,32 +627,40 @@ class Faidx(object):
"Please check your FASTA file.".format(rname))
start0 = start - 1 # make coordinates [0,1)
if start0 < 0:
- raise FetchError("Requested start coordinate must be greater than 1.")
+ raise FetchError(
+ "Requested start coordinate must be greater than 1.")
seq_len = end - start0
-
# Calculate offset (https://github.com/samtools/htslib/blob/20238f354894775ed22156cdd077bc0d544fa933/faidx.c#L398)
- newlines_before = int((start0 - 1) / i.lenc * (i.lenb - i.lenc)) if start0 > 0 else 0
+ newlines_before = int(
+ (start0 - 1) / i.lenc * (i.lenb - i.lenc)) if start0 > 0 else 0
newlines_to_end = int(end / i.lenc * (i.lenb - i.lenc))
newlines_inside = newlines_to_end - newlines_before
seq_blen = newlines_inside + seq_len
bstart = i.offset + newlines_before + start0
+ if seq_blen < 0 and self.strict_bounds:
+ raise FetchError("Requested coordinates start={0:n} end={1:n} are "
+ "invalid.\n".format(start, end))
+ elif end > i.rlen and self.strict_bounds:
+ raise FetchError("Requested end coordinate {0:n} outside of {1}. "
+ "\n".format(end, rname))
with self.lock:
- self.file.seek(bstart)
-
- if bstart + seq_blen > i.bend and not self.strict_bounds:
- seq_blen = i.bend - bstart
- elif bstart + seq_blen > i.bend and self.strict_bounds:
- raise FetchError("Requested end coordinate {0:n} outside of {1}. "
- "\n".format(end, rname))
- if seq_blen > 0:
- seq = self.file.read(seq_blen).decode()
- elif seq_blen <= 0 and not self.strict_bounds:
- seq = ''
- elif seq_blen <= 0 and self.strict_bounds:
- raise FetchError("Requested coordinates start={0:n} end={1:n} are "
- "invalid.\n".format(start, end))
+ if self._bgzf: # We can't add to virtual offsets, so we need to read from the beginning of the record and trim the beginning if needed
+ self.file.seek(i.offset)
+ chunk = start0 + newlines_before + newlines_inside + seq_len
+ chunk_seq = self.file.read(chunk).decode()
+ seq = chunk_seq[start0 + newlines_before:]
+ else:
+ self.file.seek(bstart)
+
+ if bstart + seq_blen > i.bend and not self.strict_bounds:
+ seq_blen = i.bend - bstart
+
+ if seq_blen > 0:
+ seq = self.file.read(seq_blen).decode()
+ elif seq_blen <= 0 and not self.strict_bounds:
+ seq = ''
if not internals:
return seq.replace('\n', '')
@@ -501,7 +669,9 @@ class Faidx(object):
def format_seq(self, seq, rname, start, end):
start0 = start - 1
- if len(seq) < end - start0 and self.default_seq: # Pad missing positions with default_seq
+ if len(
+ seq
+ ) < end - start0 and self.default_seq: # Pad missing positions with default_seq
pad_len = end - start0 - len(seq)
seq = ''.join([seq, pad_len * self.default_seq])
else: # Return less than requested range
@@ -516,19 +686,23 @@ class Faidx(object):
if self.as_raw:
return seq
else:
- return Sequence(name=rname, start=int(start),
- end=int(end), seq=seq)
+ return Sequence(
+ name=rname, start=int(start), end=int(end), seq=seq)
def to_file(self, rname, start, end, seq):
""" Write sequence in region from start-end, overwriting current
contents of the FASTA file. """
if not self.mutable:
- raise IOError("Write attempted for immutable Faidx instance. Set mutable=True to modify original FASTA.")
+ raise IOError(
+ "Write attempted for immutable Faidx instance. Set mutable=True to modify original FASTA."
+ )
file_seq, internals = self.from_file(rname, start, end, internals=True)
with self.lock:
if len(seq) != len(file_seq) - internals['newlines_inside']:
- raise IOError("Specified replacement sequence needs to have the same length as original.")
+ raise IOError(
+ "Specified replacement sequence needs to have the same length as original."
+ )
elif len(seq) == len(file_seq) - internals['newlines_inside']:
line_len = internals['i'].lenc
self.file.seek(internals['bstart'])
@@ -542,6 +716,39 @@ class Faidx(object):
n = m
m += line_len
self.file.write(seq[n:].encode())
+ self.file.flush()
+
+ def get_long_name(self, rname):
+ """ Return the full sequence defline and description. External method using the self.index """
+ index_record = self.index[rname]
+ if self._bgzf:
+ return self._long_name_from_bgzf(index_record)
+ else:
+ return self._long_name_from_index_record(index_record)
+
+ def _long_name_from_index_record(self, index_record):
+ """ Return the full sequence defline and description. Internal method passing IndexRecord """
+ prev_bend = index_record.prev_bend
+ defline_end = index_record.offset
+ self.file.seek(prev_bend)
+ return self.file.read(defline_end - prev_bend).decode()[1:-1]
+
+ def _long_name_from_bgzf(self, index_record):
+ """ Return the full sequence defline and description. Internal method passing IndexRecord
+ This method is present for compatibility with BGZF files, since we cannot subtract their offsets.
+ It may be possible to implement a more efficient method. """
+ raise NotImplementedError(
+ "FastaRecord.long_name and Fasta(read_long_names=True) "
+ "are not supported currently for BGZF compressed files.")
+ prev_bend = index_record.prev_bend
+ self.file.seek(prev_bend)
+ defline = []
+ while True:
+ chunk = self.file.read(4096).decode()
+ defline.append(chunk)
+ if '\n' in chunk or '\r' in chunk:
+ break
+ return ''.join(defline)[1:].split('\n\r')[0]
def close(self):
self.__exit__()
@@ -554,6 +761,7 @@ class Faidx(object):
class FastaRecord(object):
+ __slots__ = ['name', '_fa']
def __init__(self, name, fa):
self.name = name
self._fa = fa
@@ -595,12 +803,66 @@ class FastaRecord(object):
raise StopIteration
start += line_len
+ def __reversed__(self):
+ """ Reverse line-based iterator """
+ line_len = self._fa.faidx.index[self.name].lenc
+ # have to determine last line length
+ last_line = len(self) % line_len
+ if last_line == 0:
+ last_line = line_len
+ end = len(self)
+ start = end - last_line
+ while True:
+ if start > 0:
+ yield self[start:end][::-1]
+ else:
+ yield self[:end][::-1]
+ raise StopIteration
+ if end == len(self): # first iteration
+ end -= last_line
+ else:
+ end -= line_len
+ start = end - line_len
+
def __repr__(self):
return 'FastaRecord("%s")' % (self.name)
def __len__(self):
return self._fa.faidx.index[self.name].rlen
+ @property
+ def unpadded_len(self):
+ """ Returns the length of the contig without 5' and 3' N padding.
+ Functions the same as contigNonNSize in Fasta.cpp at
+ https://github.com/Illumina/hap.py/blob/master/src/c%2B%2B/lib/tools/Fasta.cpp#L284
+ """
+ length = len(self)
+ stop = False
+ for line in iter(self):
+ if stop:
+ break
+ if isinstance(line, Sequence):
+ line = line.seq
+ for base in line.upper():
+ if base == 'N':
+ length -= 1
+ else:
+ stop = True
+ break
+ stop = False
+ for line in reversed(self):
+ if stop:
+ break
+ if isinstance(line, Sequence):
+ line = line.seq
+ for base in line.upper():
+ if base == 'N':
+ length -= 1
+ else:
+ stop = True
+ break
+ return length
+
def __str__(self):
return str(self[:])
@@ -612,25 +874,27 @@ class FastaRecord(object):
for site in var:
if site.is_snp:
sample = site.genotype(self._fa.sample)
- if sample.gt_type in self._fa.gt_type and eval(self._fa.filter):
+ if sample.gt_type in self._fa.gt_type and eval(
+ self._fa.filter):
pos.append(site.POS)
return tuple(pos)
else:
- raise NotImplementedError("variant_sites() only valid for FastaVariant.")
+ raise NotImplementedError(
+ "variant_sites() only valid for FastaVariant.")
@property
def long_name(self):
""" Read the actual defline from self._fa.faidx mdshw5/pyfaidx#54 """
- index_record = self._fa.faidx.index[self.name]
- prev_bend = index_record.prev_bend
- newline_len = index_record.lenb - index_record.lenc
- defline_end = index_record.offset
- self._fa.faidx.file.seek(prev_bend)
- return self._fa.faidx.file.read(defline_end - prev_bend).decode()[1:-1]
+ return self._fa.faidx.get_long_name(self.name)
+
class MutableFastaRecord(FastaRecord):
def __init__(self, name, fa):
super(MutableFastaRecord, self).__init__(name, fa)
+ if self._fa.faidx._fasta_opener != open:
+ raise UnsupportedCompressionFormat(
+ "BGZF compressed FASTA is not supported for MutableFastaRecord. "
+ "Please decompress your FASTA file.")
def __setitem__(self, n, value):
"""Mutate sequence in region [start, end)
@@ -660,26 +924,51 @@ class MutableFastaRecord(FastaRecord):
class Fasta(object):
- def __init__(self, filename, default_seq=None, key_function=None, as_raw=False,
- strict_bounds=False, read_ahead=None, mutable=False, split_char=None,
- filt_function=None, one_based_attributes=True,
- sequence_always_upper=False, rebuild=True):
+ def __init__(self,
+ filename,
+ default_seq=None,
+ key_function=lambda x: x,
+ as_raw=False,
+ strict_bounds=False,
+ read_ahead=None,
+ mutable=False,
+ split_char=None,
+ filt_function=lambda x: True,
+ one_based_attributes=True,
+ read_long_names=False,
+ duplicate_action="stop",
+ sequence_always_upper=False,
+ rebuild=True,
+ build_index=True):
"""
An object that provides a pygr compatible interface.
filename: name of fasta file
"""
self.filename = filename
self.mutable = mutable
- self.faidx = Faidx(filename, key_function=key_function, as_raw=as_raw,
- default_seq=default_seq, strict_bounds=strict_bounds,
- read_ahead=read_ahead, mutable=mutable, split_char=split_char,
- filt_function=filt_function, one_based_attributes=one_based_attributes,
- sequence_always_upper=sequence_always_upper, rebuild=rebuild)
+ self.faidx = Faidx(
+ filename,
+ key_function=key_function,
+ as_raw=as_raw,
+ default_seq=default_seq,
+ strict_bounds=strict_bounds,
+ read_ahead=read_ahead,
+ mutable=mutable,
+ split_char=split_char,
+ filt_function=filt_function,
+ one_based_attributes=one_based_attributes,
+ read_long_names=read_long_names,
+ duplicate_action=duplicate_action,
+ sequence_always_upper=sequence_always_upper,
+ rebuild=rebuild,
+ build_index=build_index)
self.keys = self.faidx.index.keys
if not self.mutable:
- self.records = dict([(rname, FastaRecord(rname, self)) for rname in self.keys()])
+ self.records = dict(
+ [(rname, FastaRecord(rname, self)) for rname in self.keys()])
elif self.mutable:
- self.records = dict([(rname, MutableFastaRecord(rname, self)) for rname in self.keys()])
+ self.records = dict([(rname, MutableFastaRecord(rname, self))
+ for rname in self.keys()])
def __contains__(self, rname):
"""Return True if genome contains record."""
@@ -701,13 +990,40 @@ class Fasta(object):
for rname in self.keys():
yield self[rname]
- def get_seq(self, name, start, end):
+ def get_seq(self, name, start, end, rc=False):
"""Return a sequence by record name and interval [start, end).
- Coordinates are 0-based, end-exclusive.
+ Coordinates are 1-based, end-exclusive.
+ If rc is set, reverse complement will be returned.
"""
# Get sequence from real genome object and save result.
- return self.faidx.fetch(name, start, end)
+ seq = self.faidx.fetch(name, start, end)
+ if rc:
+ return -seq
+ else:
+ return seq
+
+ def get_spliced_seq(self, name, intervals, rc=False):
+ """Return a sequence by record name and list of intervals
+
+ Interval list is an iterable of [start, end].
+ Coordinates are 1-based, end-exclusive.
+ If rc is set, reverse complement will be returned.
+ """
+ # Get sequence for all intervals
+ chunks = [self.faidx.fetch(name, s, e) for s, e in intervals]
+ start = chunks[0].start
+ end = chunks[-1].end
+
+ # reverce complement
+ if rc:
+ seq = "".join([(-chunk).seq for chunk in chunks[::-1]])
+ else:
+ seq = "".join([chunk.seq for chunk in chunks])
+
+ # Sequence coordinate validation wont work since
+ # len(Sequence.seq) != end - start
+ return Sequence(name=name, seq=seq, start=None, end=None)
def close(self):
self.__exit__()
@@ -723,7 +1039,16 @@ class FastaVariant(Fasta):
""" Return consensus sequence from FASTA and VCF inputs
"""
expr = set(('>', '<', '=', '!'))
- def __init__(self, filename, vcf_file, sample=None, het=True, hom=True, call_filter=None, **kwargs):
+
+ def __init__(self,
+ filename,
+ vcf_file,
+ sample=None,
+ het=True,
+ hom=True,
+ call_filter=None,
+ **kwargs):
+ super(FastaVariant, self).__init__(filename, **kwargs)
try:
import pysam
except ImportError:
@@ -734,9 +1059,10 @@ class FastaVariant(Fasta):
raise ImportError("PyVCF must be installed for FastaVariant.")
if call_filter is not None:
try:
- key, expr, value = call_filter.split() # 'GQ > 30'
+ key, expr, value = call_filter.split() # 'GQ > 30'
except IndexError:
- raise ValueError("call_filter must be a string in the format 'XX <>!= NN'")
+ raise ValueError(
+ "call_filter must be a string in the format 'XX <>!= NN'")
assert all([x in self.expr for x in list(expr)])
assert all([x in string.ascii_uppercase for x in list(key)])
assert all([x in string.printable for x in list(value)])
@@ -752,19 +1078,21 @@ class FastaVariant(Fasta):
else:
self.sample = self.vcf.samples[0]
if len(self.vcf.samples) > 1 and sample is None:
- warnings.warn("Using sample {0} genotypes.".format(self.sample), RuntimeWarning)
+ warnings.warn("Using sample {0} genotypes.".format(
+ self.sample), RuntimeWarning)
if het and hom:
self.gt_type = set((1, 2))
elif het:
- self.gt_type = set((1,))
+ self.gt_type = set((1, ))
elif hom:
- self.gt_type = set((2,))
+ self.gt_type = set((2, ))
else:
self.gt_type = set()
- super(FastaVariant, self).__init__(filename, **kwargs)
def __repr__(self):
- return 'FastaVariant("%s", "%s", gt="%s")' % (self.filename, self.vcf.filename, str(self.gt_type))
+ return 'FastaVariant("%s", "%s", gt="%s")' % (self.filename,
+ self.vcf.filename,
+ str(self.gt_type))
def get_seq(self, name, start, end):
"""Return a sequence by record name and interval [start, end).
@@ -797,23 +1125,26 @@ class FastaVariant(Fasta):
def wrap_sequence(n, sequence, fillvalue=''):
args = [iter(sequence)] * n
for line in zip_longest(fillvalue=fillvalue, *args):
- yield ''.join(line + ("\n",))
+ yield ''.join(line + ("\n", ))
+
# To take a complement, we map each character in the first string in this pair
# to the corresponding character in the second string.
-complement_map = ('ACTGNactgnYRWSKMDVHBXyrwskmdvhbx', 'TGACNtgacnRYWSMKHBDVXrywsmkhbdvx')
+complement_map = ('ACTGNactgnYRWSKMDVHBXyrwskmdvhbx',
+ 'TGACNtgacnRYWSMKHBDVXrywsmkhbdvx')
invalid_characters_set = set(
chr(x) for x in range(256) if chr(x) not in complement_map[0])
invalid_characters_string = ''.join(invalid_characters_set)
if PY3:
- complement_table = str.maketrans(
- complement_map[0], complement_map[1], invalid_characters_string)
- translate_arguments = (complement_table,)
+ complement_table = str.maketrans(complement_map[0], complement_map[1],
+ invalid_characters_string)
+ translate_arguments = (complement_table, )
elif PY2:
complement_table = string.maketrans(complement_map[0], complement_map[1])
translate_arguments = (complement_table, invalid_characters_string)
+
def complement(seq):
""" Returns the complement of seq.
>>> seq = 'ATCGTA'
@@ -825,10 +1156,12 @@ def complement(seq):
if len(result) != len(seq):
first_invalid_position = next(
i for i in range(len(seq)) if seq[i] in invalid_characters_set)
- raise ValueError("Sequence contains non-DNA character '{0}' at position {1:n}\n".format(
- seq[first_invalid_position], first_invalid_position + 1))
+ raise ValueError(
+ "Sequence contains non-DNA character '{0}' at position {1:n}\n".
+ format(seq[first_invalid_position], first_invalid_position + 1))
return result
+
def translate_chr_name(from_name, to_name):
chr_name_map = dict(zip(from_name, to_name))
@@ -860,6 +1193,7 @@ def ucsc_split(region):
start, end = (None, None)
return (rname, start, end)
+
def check_bad_lines(rname, bad_lines, i):
""" Find inconsistent line lengths in the middle of an
entry. Allow blank lines between entries, and short lines
=====================================
pyfaidx/cli.py
=====================================
--- a/pyfaidx/cli.py
+++ b/pyfaidx/cli.py
@@ -4,23 +4,23 @@ import sys
import os.path
import re
from pyfaidx import Fasta, wrap_sequence, FetchError, ucsc_split, bed_split
+from collections import defaultdict
keepcharacters = (' ', '.', '_')
-
def write_sequence(args):
_, ext = os.path.splitext(args.fasta)
if ext:
ext = ext[1:] # remove the dot from extension
filt_function = re.compile(args.regex).search
- fasta = Fasta(args.fasta, default_seq=args.default_seq, strict_bounds=not args.lazy, split_char=args.delimiter, filt_function=filt_function, rebuild=not args.no_rebuild)
+ fasta = Fasta(args.fasta, default_seq=args.default_seq, key_function=eval(args.header_function), strict_bounds=not args.lazy, split_char=args.delimiter, filt_function=filt_function, read_long_names=args.long_names, rebuild=not args.no_rebuild)
regions_to_fetch, split_function = split_regions(args)
if not regions_to_fetch:
regions_to_fetch = fasta.keys()
if args.invert_match:
sequences_to_exclude = set([split_function(region)[0] for region in regions_to_fetch])
- fasta = Fasta(args.fasta, default_seq=args.default_seq, strict_bounds=not args.lazy, split_char=args.delimiter, rebuild=not args.no_rebuild)
+ fasta = Fasta(args.fasta, default_seq=args.default_seq, key_function=eval(args.header_function), strict_bounds=not args.lazy, split_char=args.delimiter, rebuild=not args.no_rebuild)
regions_to_fetch = (key for key in fasta.keys() if key not in sequences_to_exclude)
split_function = ucsc_split
@@ -45,14 +45,14 @@ def write_sequence(args):
try:
if args.transform:
if not header and args.transform == 'nucleotide':
- outfile.write("name\tstart\tend\tA\tT\tC\tG\tN\n")
+ outfile.write("name\tstart\tend\tA\tT\tC\tG\tN\tothers\n")
header = True
outfile.write(transform_sequence(args, fasta, name, start, end))
else:
for line in fetch_sequence(args, fasta, name, start, end):
outfile.write(line)
except FetchError as e:
- raise FetchError(e.msg.rstrip() + "Try setting --lazy.\n")
+ raise FetchError(str(e) + " Try setting --lazy.\n")
if args.split_files:
outfile.close()
fasta.__exit__()
@@ -61,7 +61,12 @@ def write_sequence(args):
def fetch_sequence(args, fasta, name, start=None, end=None):
try:
line_len = fasta.faidx.index[name].lenc
- sequence = fasta[name][start:end]
+ if args.auto_strand and start > end and start is not None and end is not None:
+ # flip (0, 1] coordinates
+ sequence = fasta[name][end - 1:start + 1]
+ sequence = sequence.reverse.complement
+ else:
+ sequence = fasta[name][start:end]
except KeyError:
sys.stderr.write("warning: {name} not found in file\n".format(**locals()))
return
@@ -69,13 +74,13 @@ def fetch_sequence(args, fasta, name, start=None, end=None):
sequence = sequence.complement
if args.reverse:
sequence = sequence.reverse
+ if args.no_output:
+ return
if args.no_names:
pass
- elif args.full_names:
- yield ''.join(['>', fasta[name].long_name, '\n'])
else:
- if start or end:
- yield ''.join(['>', sequence.longname, '\n'])
+ if (start or end) and not args.no_coords:
+ yield ''.join(['>', sequence.fancy_name, '\n'])
else:
yield ''.join(['>', sequence.name, '\n'])
for line in wrap_sequence(line_len, sequence.seq):
@@ -114,14 +119,23 @@ def split_regions(args):
def transform_sequence(args, fasta, name, start=None, end=None):
line_len = fasta.faidx.index[name].lenc
s = fasta[name][start:end]
+ if args.no_output:
+ return
if args.transform == 'bed':
- return '{name}\t{start}\t{end}\n'.format(name=s.name, start=s.start, end=s.end)
+ return '{name}\t{start}\t{end}\n'.format(name=s.name, start=s.start - 1 , end=s.end)
elif args.transform == 'chromsizes':
return '{name}\t{length}\n'.format(name=s.name, length=len(s))
elif args.transform == 'nucleotide':
- nucs = Counter(dict([('A', 0), ('T', 0), ('C', 0), ('G', 0), ('N', 0)]))
- nucs.update(str(s).upper())
- return '{name}\t{start}\t{end}\t{A}\t{T}\t{C}\t{G}\t{N}\n'.format(name=s.name, start=s.start, end=s.end, **nucs)
+ ss = str(s).upper()
+ nucs = defaultdict(int)
+ nucs.update([(c, str(ss).count(c)) for c in set(str(ss))])
+ A = nucs.pop('A', 0)
+ T = nucs.pop('T', 0)
+ C = nucs.pop('C', 0)
+ G = nucs.pop('G', 0)
+ N = nucs.pop('N', 0)
+ others = '|'.join([':'.join((k, v)) for k, v in nucs.items()])
+ return '{sname}\t{sstart}\t{send}\t{A}\t{T}\t{C}\t{G}\t{N}\t{others}\n'.format(sname=s.name, sstart=s.start, send=s.end, **locals())
elif args.transform == 'transposed':
return '{name}\t{start}\t{end}\t{seq}\n'.format(name=s.name, start=s.start, end=s.end, seq=str(s))
@@ -133,25 +147,33 @@ def main(ext_args=None):
epilog="Please cite: Shirley MD, Ma Z, Pedersen BS, Wheelan SJ. (2015) Efficient \"pythonic\" access to FASTA files using pyfaidx. PeerJ PrePrints 3:e1196 https://dx.doi.org/10.7287/peerj.preprints.970v1")
parser.add_argument('fasta', type=str, help='FASTA file')
parser.add_argument('regions', type=str, nargs='*', help="space separated regions of sequence to fetch e.g. chr1:1-1000")
- parser.add_argument('-b', '--bed', type=argparse.FileType('r'), help="bed file of regions")
- parser.add_argument('-o', '--out', type=argparse.FileType('w'), help="output file name (default: stdout)")
- parser.add_argument('-i', '--transform', type=str, choices=('bed', 'chromsizes', 'nucleotide', 'transposed'), help="transform the requested regions into another format. default: %(default)s")
- parser.add_argument('-c', '--complement', action="store_true", default=False, help="complement the sequence. default: %(default)s")
- parser.add_argument('-r', '--reverse', action="store_true", default=False, help="reverse the sequence. default: %(default)s")
- parser.add_argument('-a', '--size-range', type=parse_size_range, default=None, help='selected sequences are in the size range [low, high]. example: 1,1000 default: %(default)s')
- names = parser.add_mutually_exclusive_group()
+ _input = parser.add_argument_group('input options')
+ output = parser.add_argument_group('output options')
+ header = parser.add_argument_group('header options')
+ _input.add_argument('-b', '--bed', type=argparse.FileType('r'), help="bed file of regions")
+ output.add_argument('-o', '--out', type=argparse.FileType('w'), help="output file name (default: stdout)")
+ output.add_argument('-i', '--transform', type=str, choices=('bed', 'chromsizes', 'nucleotide', 'transposed'), help="transform the requested regions into another format. default: %(default)s")
+ output.add_argument('-c', '--complement', action="store_true", default=False, help="complement the sequence. default: %(default)s")
+ output.add_argument('-r', '--reverse', action="store_true", default=False, help="reverse the sequence. default: %(default)s")
+ output.add_argument('-y', '--auto-strand', action="store_true", default=False, help="reverse complement the sequence when start > end coordinate. default: %(default)s")
+ output.add_argument('-a', '--size-range', type=parse_size_range, default=None, help='selected sequences are in the size range [low, high]. example: 1,1000 default: %(default)s')
+ names = header.add_mutually_exclusive_group()
names.add_argument('-n', '--no-names', action="store_true", default=False, help="omit sequence names from output. default: %(default)s")
- names.add_argument('-f', '--full-names', action="store_true", default=False, help="output full names including description. default: %(default)s")
- parser.add_argument('-x', '--split-files', action="store_true", default=False, help="write each region to a separate file (names are derived from regions)")
- parser.add_argument('-l', '--lazy', action="store_true", default=False, help="fill in --default-seq for missing ranges. default: %(default)s")
- parser.add_argument('-s', '--default-seq', type=check_seq_length, default='N', help='default base for missing positions and masking. default: %(default)s')
- parser.add_argument('-d', '--delimiter', type=str, default=None, help='delimiter for splitting names to multiple values (duplicate names will be discarded). default: %(default)s')
- matcher = parser.add_mutually_exclusive_group()
+ names.add_argument('-f', '--long-names', action="store_true", default=False, help="output full (long) names from the input fasta headers. default: headers are truncated after the first whitespace")
+ header.add_argument('-t', '--no-coords', action="store_true", default=False, help="omit coordinates (e.g. chr:start-end) from output headers. default: %(default)s")
+ output.add_argument('-x', '--split-files', action="store_true", default=False, help="write each region to a separate file (names are derived from regions)")
+ output.add_argument('-l', '--lazy', action="store_true", default=False, help="fill in --default-seq for missing ranges. default: %(default)s")
+ output.add_argument('-s', '--default-seq', type=check_seq_length, default='N', help='default base for missing positions and masking. default: %(default)s')
+ header.add_argument('-d', '--delimiter', type=str, default=None, help='delimiter for splitting names to multiple values (duplicate names will be discarded). default: %(default)s')
+ header.add_argument('-e', '--header-function', type=str, default='lambda x: x.split()[0]', help='python function to modify header lines e.g: "lambda x: x.split("|")[0]". default: %(default)s')
+ header.add_argument('-u', '--duplicates-action', type=str, default="stop", choices=("stop", "first", "last", "longest", "shortest"), help='entry to take when duplicate sequence names are encountered. default: %(default)s')
+ matcher = header.add_mutually_exclusive_group()
matcher.add_argument('-g', '--regex', type=str, default='.*', help='selected sequences are those matching regular expression. default: %(default)s')
matcher.add_argument('-v', '--invert-match', action="store_true", default=False, help="selected sequences are those not matching 'regions' argument. default: %(default)s")
- masking = parser.add_mutually_exclusive_group()
+ masking = output.add_mutually_exclusive_group()
masking.add_argument('-m', '--mask-with-default-seq', action="store_true", default=False, help="mask the FASTA file using --default-seq default: %(default)s")
masking.add_argument('-M', '--mask-by-case', action="store_true", default=False, help="mask the FASTA file by changing to lowercase. default: %(default)s")
+ output.add_argument('--no-output', action="store_true", default=False, help="do not output any sequence. default: %(default)s")
parser.add_argument('--no-rebuild', action="store_true", default=False, help="do not rebuild the .fai index even if it is out of date. default: %(default)s")
parser.add_argument('--version', action="version", version=__version__, help="print pyfaidx version number")
# print help usage if no arguments are supplied
@@ -162,12 +184,19 @@ def main(ext_args=None):
args = parser.parse_args(ext_args)
else:
args = parser.parse_args()
+
+ if args.auto_strand:
+ if args.complement:
+ sys.stderr.write("--auto-strand and --complement are both set. Are you sure this is what you want?\n")
+ if args.reverse:
+ sys.stderr.write("--auto-strand and --reverse are both set. Are you sure this is what you want?\n")
if args.mask_with_default_seq or args.mask_by_case:
mask_sequence(args)
else:
write_sequence(args)
+
def check_seq_length(value):
if len(value) != 1:
=====================================
scripts/benchmark.py
=====================================
--- a/scripts/benchmark.py
+++ b/scripts/benchmark.py
@@ -48,6 +48,10 @@ def make_long_fasta(filename, nrecs=250, seqlen=SEQLEN):
f.write(line)
return headers
+def bgzip_compress_fasta(filename):
+ from subprocess import call
+ call(' '.join(['bgzip', '-c', filename, '>', filename + '.gz']), shell=True)
+
def read_dict(f, headers):
for k in islice(headers, 0, None, 10):
for start, end in intervals:
@@ -89,7 +93,10 @@ def read_samtools(f, headers):
def main():
fa_file = NamedTemporaryFile()
index = fa_file.name + '.fai'
+ bgzf_index = fa_file.name + '.gz.fai'
headers = make_long_fasta(fa_file.name)
+ bgzip_compress_fasta(fa_file.name)
+
def pyfaidx_fasta(n):
print('timings for pyfaidx.Fasta')
ti = []
@@ -113,6 +120,29 @@ def main():
print(mean(tf)/nreads/10*1000*1000)
tracemalloc.stop()
+ def pyfaidx_bgzf_faidx(n):
+ print('timings for pyfaidx.Faidx with bgzf compression')
+ ti = []
+ tf = []
+ for _ in range(n):
+ t = time.time()
+ f = pyfaidx.Faidx(fa_file.name + '.gz')
+ ti.append(time.time() - t)
+
+ t = time.time()
+ read_faidx(f, headers)
+ tf.append(time.time() - t)
+ os.remove(index)
+ # profile memory usage and report timings
+ tracemalloc.start()
+ f = pyfaidx.Faidx(fa_file.name + '.gz')
+ read_faidx(f, headers)
+ os.remove(index)
+ print(tracemalloc.get_traced_memory())
+ print(mean(ti))
+ print(mean(tf)/nreads/10*1000*1000)
+ tracemalloc.stop()
+
def pyfaidx_faidx(n):
print('timings for pyfaidx.Faidx')
ti = []
@@ -276,6 +306,7 @@ def main():
n = 3
pyfaidx_fasta(n)
pyfaidx_faidx(n)
+ pyfaidx_bgzf_faidx(n)
pyfasta_fasta(n)
pyfasta_fseek(n)
seqio_read(n)
=====================================
tests/data/chr17.hg19.part.fa
=====================================
--- /dev/null
+++ b/tests/data/chr17.hg19.part.fa
@@ -0,0 +1,2 @@
+>chr17
+AAGCTTCTCACCCTGTTCCTGCATAGATAATTGCATGACAATTGCCTTGTCCCTGCTGAATGTGCTCTGGGGTCTCTGGGGTCTCACCCACGACCAACTCCCTGGGCCTGGCACCAGGGAGCTTAACAAACATCTGTCCAGCGAATACCTGCATCCCTAGAAGTGAAGCCACCGCCCAAAGACACGCCCATGTCCAGCTTAACCTGCATCCCTAGAAGTGAAGGCACCGCCCAAAGACACGCCCATGTCCAGCTTATTCTGCCCAGTTCCTCTCCAGAAAGGCTGCATGGTTGACACACAGTGcctgcgacaaagctgaatgctatcatttaaaaactccttgctggtttgagaggcagaaaatgatatctcatagttgctttactttgcatattttAAAATTGTGACTTTCATGGCATAAATAATACTGGTTTATTACAGAAGCACTAGAAAATGCATGTGGACAAAAGTTGGGATTAGGAGagagaaatgaagacatatgtccacacaaaaacctgttcattgcagctttctaccatcaccaaaaattgcaaacaaccacacgcccttcaactggggaactcatcaacaacaaacttgtggtttacccacacaatggaagaccacttagcaacaaaaaggaccaaactcctggtacatgcaactgacagatgaatctcaaacgcattcctccgtgtgaaagaagccggactcacagggcaacacactatctgactgtttcatgggaaagtctggaaacggcaacaccattgagacagaaaacaggtgagtggttgcctggggccagggaactttctggggtcatattctctgtgttgattctggtggtggaaacaagactgtCccagcctgggtgatacagcgagaccccatctctaccaaaaaattaaaaattagctgggcatggtggtgcatgcctgtagtcccagctattcacagtgctgaggtgggaagatgcttgagcccaggagttcaaggctgcaatgagctatgattgcgccactgcactttggcctggacaacagagcaaaaccctgtctctaaaaaaagaaaagaaaagaaaaaCTCACTGGATATGAATGATAcaggttgaggatccattatctgaaatgcttggaccagatgttttgaattttggattttttcatattttgtaatctttgcagtatatttaccagttcagcatccctaactcaaaaattcaaaaatctgaaatcccaaacgcgccaataagcattccctttgagcgtcatgtcggtgcttggaatgtttggggttttggatttacagctttgggacgctcaacctgTACCTCAATAAACCTGATTTTAAAAAAGTTTGGGGGGATTCCCCTAAGCCCGCCACCCGGAGACAGCGGATTTCCTTAGTTACTTACTATGCTCCTTGGCCATTTCTCTAGGTATTGGTATATTGTGTCTGCTGTGAACTGTCCTTGGCCTGTTTGGTGACGGGTGAGGAGCAGGGACAGAAGGGTCCTGCGTGCCCTGCCTTCACAAGCCCCTGGAAGGAAAGTTGTTTTGGGATCTCTGCACCCTCAGCCTGGACAACTTGTGCCCATCTGGTGACCCCTCACTCAGCCACCAGACTTCCACGACAGGCTCCAGCCTCGGCACCTTCAGCCATGGACAGTTCCGCCAGCGTTGCCCTCTGTTCTGCTGTTTTCTCTACCAGAAGTGCCCTTCCCTCCTCACCTGACCACTCTGGGGAAATCCCTCAGCACCCTCCCTGAGCATACCCTACTCTGGCACAAGCCCACCCTGCAAAGCCCCTGAGGCCCGCCCTGTGGCGTCTCTCCCTCCCTTGCTGTCAGGACAGTGGTCCTGGCCACCGGGGCTCACGGAGCCGCCCTGTGCCGTGTACCTCTGAGCCCTCTGCACAGTGCCTTCTGCTTGCCTGTGGCTTTGAGAAGAaaccccttctggttatacataagacagccagagaagggagttgcccagggtggcacagcacgttgctgccagTTACTGCCATTTTCACGGGCATGAAATGGAGATAACAACAGGAGCGACCGCACAGGCTGCTGAGCGCGTCACACGCAGCCATCGCGCAGCTCAGGGATATTACGTGTAACTCGACATGTCAGCGATTGTCACAGGCACTGCTACTCCTGGGGTTTTCCATCAAACCCTCAAGAGCTGGGCCTGGGGTCAACTTCCGGCCTGGGGAAACTGGGGCAAGTATCACCAGAGATGAGCTTTATAAAAATAATGGTGCTAgctgggcatggtggcttgcacctgtaatcccagcactttgggaggccgagctaggaggatcgtttgagtccagcagtttgagaccagcctggccaatacggcaaaacccagtctctacaaaaaatacaaaaaacaactagccaggcgtggtggtgcacacctgtagtcccagctactcaggaggctgagggggaaggactgcttgagcccaggagtttgaggctgctgtgagctgtgatcgcatcactgcattccagcccggtgacagagtgagtcactgtctcaaaaaagaaaggaagaaataaagaaaacaaATAAAAATAATAGTGCAGACAAAAGGCCTTGACCCATCTAGCTTTGGCCCTCAGCATCAACCGCTAGATACGTCCCTCCCTTTCTTCTGGGGCACAGGTCACACTCTCTTCCAGGTCTAGGATGCAGCTGAGGGGTGCCCCTCTTACCATCTAATCTGTGCCCTTATTTCCTCTGCTTTAGTGAGGAAGAGGCCCCTGGTCCATGAAGGGGCCTTTCAGAGACGGGGACCCCTGAGGAGCCCCGAGCAGCAGCCGTCGTGTCTCACCCAGGGTGTCTGAAACAGATGTGGAGGTCTCGGGTGAGGCGTGGCTCAGATACAGGGAGTGGCCCACAGCTCGGCCTGTCTTTGAAAGGCCACGTGACCTGGCCCACGGCTGGCAGGTGGGACCCAGCTGCAGGGGTCCAGCAGCACCCACAGCAGCCACCTGTGGCAGGGAGGAGCTTGTGGTACAGTGGACAGGCCCTGCCCAGATGGCCCCCCGCCTGCCTGTGGAAGTTGACCAGACCATCTGTCACAGCAGGTAAGACTCTGCTTTCTGGGCAACCCAGCAGGTGACCCTGGAATTCCTGTCCATCTGGCAGGTGGGCATTGAAACTGGTTTAAAAATGTCACACCATAggccgggcacagtggctcacgcctgtaatcccagccctttgggaggccagggtgggtggatcacttgaggtcaggagttcaagaccagcctggccaacatggtgaaaccccgtctactaaaaatacaaaaattagcctggcgtggtggcgcatgcctgtaatcccagctacttgggaagctgagggatgagaactgcttgaacctgggaggcagacgttgcagtgagctgagatcacgccactgcactccagcctgggcaacagagtaagactctgtctcaaaaaaaaaaaaaTCACACCATTTTGGCTTCAGATTGCATATCCTCCTGCAAGGATATATACGCGTGAAATTCAAGTCAATGACAAATCAGAAGAAAAAACATATATATACGCAAACCAGTATCCTACTGTGTGTGTCGTTTGTTGTGTTTTCGACAGCTGTCCGTGTTATAATAATTCctctagttcaaatttattcatttttaacttcatagtaccacattctacacactgcccatgtcccctcaagcttcccctggctcctgcaaccacaaatctactctctgcctctgtgggttgacctattctggacacgtcatagaaatagagtcctgcaacacgtggccgtctgtgtctggcttctctcgcttagcatcttgtttccaaggtcctcccacagtgtagcatgcacctgctacactccttcttagggctgatattccaCGCACCTGCTACACTCCTTCTTATGGCTGATATTCCACGCACCTGCTACACTCCTTCTTAGGGCTGATATTCCACACACCCGCTACACTCCTTCTTAGGGCTGATATTCCACGCACCCGCTACACTCCTTCTTAGGGCTGATATTCCACGCACCTGCTACACTCCTTCTTAGGGCTGATATTCCACGCACCTGCTACACTCCTTCTTAGGGCTGATATTCCACGCACCTGCTACACTCCTTCTTAGGGCTGATATTCCACGCACCTGCTACACTCCTTCTTATGACTGATATTCCACGCACCtgctacactccttcttagggctgatattccactgcagggacagacttcatttgtgtatccattcatcagtggatggacacttggggtgtttccacttttggctgttgtggatagtgctgctatgaacattcctgcacaagttttaggatggacatgtttttcttatctcttgggtatataacaaggagtggaattgccagatcaaatggtgattctgtgtttaactttctgaggaactatcagctgcttcccaaagtggccatcccattattctcaTAttatttttatttgtttattatattttgagagtgtctcgctctgtcaccctggctggagtgcagtggtgtgatctcggctcactgcaatctccacctcccaggttgaagtgattttcctgcctcagcctcccaagtagctgcgattacaggcgcccgccaccacacccagctaatttttttatttttagtagagacgggtttccccatattggccaagctggtctcaaactcctgacctcaggtgatccgtgcgcctcggcctcccaaagtactgggattacaagcacgagccactgcactcagccTAATTTATTACTATTTTTAACTGTAGAGACAAGGTCTCAGTATGTTGCCAAGGCTGGTctcaaattcctgggttcaagcaatcctctcaagtagttaggactacagggacatgccactacaccaggctaatttttaatttttttatagagatggaggtctcactatgttgcccaggatggtctcaaactcctagcctcaagcaatcctcctgccttggcctcccaaagtgttcagaatgtaagtgtaactcactgcacctgCCACATACAATTTTTAAAGTGACAGAAATATGATCATGGCCATGGGAATGGTGCCCCTAGAGCTGTGCCACGAGGAGTACTGGCCTCTTCATGGTGCCAAGATGTCCCTGAGGCCTTAGTCACCTGGGTCCTTGGTGTCCCCTAGGTCAGGGCCATCCCTCTGTCATTCCCCCTCCCTGAAGCACCTGCCCCTCCTTTCTGCTGAACTAAATTCCTCCCCAAGTCTCAGTTTTCCAGAGTCTCCCTGTGAGCTCACACTCATACCTACCTAGTTTCTGAAGAGCCCCGAGCACTGACGGTAGTCACTGTGGCACCTGTAGCACCCTCTCCAAAGGGTCGCCAGCTCCTGCCTGGCTCTCAGAGCTACACAGCCTCTCCTGACCAGGCTCACAGCTCCACAGCTCTGCGGCCCAGGCCACCTGGCATGGCCCCCTGAGTTAGTCTCTCCCCAGGCCCACCATCAGCCCTGTTGGTAGAGCTGGGTGGACTCTTATCCGCATCTGTAGCACCATTATAGGGCTGGGCAAATGTGGGCAGTTGCAAAGGCCTGGCAGAGTTCCTCTGCATCCTCCCCCAGCCTCCTGGCTACCCCCGGCACTGGCCCCGCCTTCTGTCCCCTCCTGCAACTCATGGCCCCTCCTGGGCCCCTCAGTCACAGAGAGGCCTGGACATAGCATCAGGTGATAACtaatacctgcatgaccctgggaagccactcagcttctctgtccctcagtttccccatctgtgaaatgggctggccatgcttaacccctggagttgCCAAGGTAGCCCATCAGGGAACACAGCGCCCCTGTACCTCAGGCACTCCCTGGGTGCCTGCCACCTGGGACCACGGAGCCGGCACACGGACCCCCGTCCTTGGAGGTGAAGACGTGGCAGGTGGTCACGCGCACGGCACACTCACGTTTTCACGTAGGGGTCCGAGTAGCCGTTGGCGTCCATGGCGGCCAGGTGGGCGCACCGCACGATGCCTACCAGCAGGCCTTGCTTCTGTGAGCTGTACTTGAGGGAGATGAGGATGCGGCCCCGCTCCTCCAGGGACTTGTCTTCAGTCTTGTCCACCTGTTGGACGGGACGGTCACTCAGTCCTCACCTGCTCCCACCCCTCTCTTGGCCGTCCTTGGCCTCCTCTTCCTGAGCCCGCCCATCCGGCTGCTGCAGCCGGGCCTGGTCACGGTCCCTGGTGAGTGGCCTCACTGTGAACTCACAGCCCTTCTGTCATCTCTTCCCTCCCAAGGGCTCTTCCAGGGCTGGATCTGACCCACACCCTCCCTGTTCTTGGCTGCATGTGGCCTACGGTGGCCTTCACAGCCCAGCAAGGGCCAGCCcaggggttggcaactgcagcccgggggccagatcaggcccgacgcctggctctgtgtgggatgtgagctaagcatggctcttacccttatcttaaacaatttctttttcaaaaaaatagagacaggggtctcactattttgcccaggctgatctcaaactcctgggctcaagggatcttcccaccttggcctcccaaagtgctgggattacagacatgagccaccgtgcccagctggttcttactttttggaatggcagaaagaaaaatgaaatgaaaaatattttgtgacacatgaaaatgacatgaaattcacacttcagcttccattaacaccgtgttgttggaacgcagccttgccagctccgtgatgcttctctatggctgcttttgccctaaggacagagctgcatggtggccacagattgcgaggcacacagagcctaacattagcgctaagcggcccttcacggGTCTGCAGCCCCCGGCCTGGCGCGCTCGGCTTCCGCGGACGGTCTTCCACCAGGTCCTCCTCACAGCCCGCCCGGACTTTCCCTTGGCTTTGAAGCCGGGTCCCACCTCCACACCTTTGCGCACCCTTAATCCACAGCTCCAGCACCCTCCTCTGACCTCCTCAGTGTTCACCGTTGATAAATTCCTGGTCCTTCTGGGCCTGGGGGGCTTTTCTGACCCTCAGGGTGGGTTGGCCGTCCCCTCTGCGCGCCCACTGCCTCGGGTTTATCCTGTCATGGCCTGGTGTGGCCCCCTGCTTCCTCAGGGCTCTGCCCCAGCCTCTAAGTTCCTTGAAGTTGGAACTCTACCTGTAAAttttgattttgacagagtctgactcttgtcacccaggctggagtgcagtggcatgatcttggctcactacaacctctgcctcccgggtttaagcgattctcctgcctcagccttctgagtagctgggactacagtctaatttttctattttttggtagagacagggttttgccatgtctgccaggctggtctcgaattcctgacctcaagtgatccacccacctcagcctcctgaagtgctgggattacaggtatgagccaccgcatccagATAATACCTGCGTCATCTTTAATACCTCAGGGCTGGGCAAGGCCCTGGCACTAAGGGCCCACTGGAGGCTTGTACTAGTGAAGGCAGGAATGGAGGGACAGATGGTCCCCTGTGCGTTCAACCTCATACAAGCAACTCCAGCCTGGAGGTTCAAAGAAAGGCAGAAAGGTCTGCCCATGACAATGGAGCCTGGTGGATGGAAAAGGGTCACCGTGGCCCCAGTTCCCTGAAGGTGCTACCTGGAGCCTCTCAGGGTGCTGGATGTGGCTCCCTCAGGAGAACCCCGAAGACAGAGTTCTGGTACATTCCTCACCCTGGAGAAGCTGGGAGCCAAGTAGGGAGCCCATCCAGTGCCTTCCCTGCCTCAGCAGCTGCCAGTGCCTGCCTCTCAAGGCAGATCTCAGCTCCAGGCCTCCCCATCCCCAGCCAGCCTGATGCTTCTCCATTCCTTGCCCTCCCGAGACTACCCGGCTTCTTCCTTCTGGCTGCTGCACACCCCAAACCCTCTCTCCTAGGCCCTCAGCCCCCCCAGACAAACCCAGCTTGAGCCCACCCACACACGCCCAGCTTGAGCCCTGTGTCCCCCTGGATCCCCTTAGGACTCTACCAGGCTCCTCTCCCAAGCAGCCCCAGCCACCTCCTTTTGCCCACAACCGTGCCCTGCCTAACACCAAAGTGAGTCCCTCTGGCTACATCTAACCCTTTCTGGAGATGAGACTCCTGTGAGGTGAGCAGGGATGTTGAGcagggaagacatacatgccccaccagccaccttcgccacctcccacgcccagagcagacatggctaatcgatcctagcatttgccctaggctccaatgccacctcagaatccttttcaacacagtgctcaggcagccattcctggtggatcaggccaggcctgcgagatCTCGCTATCTGCCGCTCAGGCAGAGCTTTGTGTGTGAGGGCCTTGATGCCCTGTGCTGCTTGTGTCAGTGTGTGTGGTGTGAAAATTAAACAATATAAACTAGCAGGggccaggcacagtgactcatgcctgtaaccccagcactttgggaggccaaggcgggcagatcacctgaggtcagcagctcaggaccagcctggccaacatggtgaaaccccatctctactaaaaatacaaaaattagccgggtgtggtggcgcatgcctgtaatcccagctattgggaggctgaggcagaagaatcacttgaacccgggaggcggaggttgaagtaagccgacattgcgccactgcactccagcctgggcgacatagactccatctcataaaaaaaaaaaaaaaacaaCCAACCAGCAGGCATATTTTTAGCTCTTTTTTCAGGGGTGGGACATTGTACTTCGGGTTGTTTCATATGAAGACCACTGGGTCTTGCTCAGTATTGACTTAAAACAGATAATGTTCGCAGACTGTAAATTCTAAAACCTACCGCCAGAggcctggcacaggggctcatgcctgtaatcccagcactttgggaggccgaggcaggaggatcactggaggtcaggagtttgagacttgcctggccaacatggcgaacccccgtctctactaaaaataccaaaattagccaggcgtggtggcgcacacctgtaatcccagcactttgggaggctgaggcaggtggatcattcaaggtcaggagtttgagacacctggacaatatggtgaaaccccatctctaccaaatatacaaaaattaaccaggcgtggcggcacacgcctgtagtcccagctactcgggaggctgaggcatgagaattgtttgaactcaggaggtagaggttgcagtgaacagagattttgccactgaactccagcctggatgacagagcaagactcagtctcaaaaaataataataataaaaGTACCACCAGAATGTGGCTGTACTGTCAGGGGTGCATCCCCAGCTGCACTCCTGCGGTCACTGTGAGTCCCTGAACGGCACCAATGGGCCGGTAGCGCATCCAGCAACGCCCTGATCATGGCcacgcacagggacgcacatgctttcacgaacgcacaccacacatgtggacacacacactgtcgcacacagacacgtactgacatatgctcttacacacaattcacacacgagcacacacacacacacgctgacaccccacgtacatacccacGTGGTTGTTTGTTTATGCCAGTGATGAAAACTCAGGAACACTAAGGCAGGGCTGGTGTTGCtttttttttttttttttgagacagagtcttgctcttgctcttgtcacccaggctggagtgcaacggtgcaatctcggctcactgcaacctccgcctctcaggttcaagcgattctcctgcctcagcctcctgagtagctgggattacaggcatgcacccccacacccggctaatttttttatttttagcagagacggggtttcgccatgttggccaggctctctcgaactcttgacctcatgacccacctgccttggcctcctaagatgttgCCTTTCTTAAGTGACATAGACCATGTGGAAAAACCGGGTTACCTGTGGTTAGTGACTAACAATAAAACAGGAAAGGTTATATCCATCACACAAATGTCTGAGGGGGAGAGAATGTGACAAGGAATAAAATTGGATCAAATTCTGCAAAAGTAACTGGGATTCTGGGAAGAAGCCGTGGCCTCAGGCTGACTCGCCCCCGGGGCTTGACTTGGGCTAAGCTCGAGGTGAGTCCACGTCCCCGGGCCCCACTGGGGCTGGGTACACTGGGGACAGCCGCCGGGCTCTGTCCTCCCAAACTTGCCCCTTGCCCAGTCCTCTTAGGGGGACAACGTGCCATCGAGGGGACCATGCCTCCGCCTGTGTCCTGAACGCTGGGAGGCTGAGGCCCCAGGATTTCTCTTGACCCCAGTGGCACGGGGGACTCCTGGCTTCACCAGCCCTATGAACCAGGTGAAGGTGAGGCCATAGACAAGGGAGGATGGGGGAGGGAAGAGGGACATAGAGACCAAGACTCAGAGGGCGTAGCTGCTGGAGCAGGCCGAGGGCAAATCTGTTCTGACATAACGTTGAGACAAATGCCATTTCTAGGAAAGGAtactctgctgtctcctctgcgtatctcacaggcactcaggtctaacatgttccaagcgtgctccttgcgcgtcctgtccacccgtggtccctgctgagtcctctgagtgcagcaaacagcccctcaggcttcagtggctcaggccccaaacctcggatctgcccttccctcacccaaggacgtcctatctgctgcctgcacatctggttcagaatcaagccctcctaccgctgccaaggtcaggccagggttgtgcccatctctccccatctccccaggcctcctgccctctcctccctcttcttcaagccatcctgagcccaggccagcgagcctggtaaaatgtcatcccccatgatccctcagcttagcaccctcccgtggccactcagagtgaaagccagggtccttcctcacctccacccccttgactctccatgctcacctccccggtctcccctcccctctcactctgcccctcATGAGTCCCATCACAGGCAGGAAGTTctgccttcccagcacctgccaccgagccaggtacacagcaggtgctcaatcaatCCTCTCACCGGCAGCTGCTTCTCCAGGCAGATGCTGAAGGTCTTGGTGTGGTTGGGTTTCAGCTTCTTCAGGGGCACACGTGTCTCCCCGATGAACTCATTGTGCCGGAATTTGTCCTCGTCACACACAGAGATCCTAGAGGGGGCGGTGGTGAGGGGCACAGCCAGTGCCTCAGACGCACTGGGCATGGTGGAGGTGTGCGCAGGTAGGGCCAGCCCTGGCTTCTCCTGCCCCAAGCCCTGCCCTGGTCTGGGGTGGGAGACGCACAAGATGCCTGGGCCCTGACAGGGGCAGAGTGTGGCACGATATCAGGCACTGTCCTCATGGACAAGTGTCCTCAGGTTGGAAGAGGGGACAGGAGAAGGCAGAACCAGTGCCAGGAGTAGCCAGGAGGCTGGGAGAGCCGGTTCTCTGGAGGGAACCACCTCCAGCACCCTGAGAGGCCCCAGGAAGCACCTTCAGGGGACTGGGGCCAGGGTGACCCTTTTCCACCCACCAGGCTCCATTGGCGTGGGCAGACCTTTCTGCCTTTCTTCAAGCCCCCAAGCTGGAGTTGCTGGCATGAGGTTGAACCCACAGGGGACCATCTGACCCTCCAGTCCTGCCTTCATTAGGAAAAGCCGGGTGGGAGTAGGGGTTGGGGAGGGAGCAGGCGGCCTGGGACCCTCACCCACCGCAGGGTCTTGCGGATCATGTCTTCATCTGTGATCCCGTAGTAAGTGAGGGTCTCGTTCCATGTGGGGTTCAGAGTGTTACGGAGAGTTTTTGTTCTGAGCTTATTTGCCTGGAGAAGAGAAAAATGATCTTATTAGCATCAAAGTGtgtatcaaacagaacaatggcccccagagatggccacgtgctcatcttggagcctgtgaatgtgttatcaacatggccaagtggactgaggctgcaggtggacttagggttggtaatgagctgacattagaatagggagattatcctagattgttgggtggcccaatgtggtcacagggttcttaaaagcagaagaatggacagagaagacagtcagggacgtcaccaaggaagggggccagagagatgcaatggggcccgctgtgaaggtggaggaaggggccacaagcccaggagggccgatggcctctagaagctggaaggagctaggaaactgtgggctccccgggctccagaaggaatgcagccttgccgacaccttgattttagcccagagagagaccactgctggacttctaacctgcagagcagtccgagagtaaacgcgctgctttaagccacgaagtttgtggtcatttgttgcagcagccgtaggaggctcatccaAGGAGGACCCCACCTCCAGCCCGATGCCACGGGGTAGGTTCTGCAGGAGGCCTGTGTGGAACTGGAGTCTCCTTCCCAGGGCATTCCCTGCCTTTGTGGACCGTCCCCTCTGCCTGGAACCCATCCCGTCCTTGGGTCTGCCTCGGGGGTGGCCTCTCCGAGCTGGAATATTGCTGCCACTCCCTCCTCCTGTGCCAGCGGCTCTGAGCTCCTTGTCTGAATCGGTCTGATGCCTGCCACACCCGgcccaggctgggaggtcagacagcctgggcttccaattccagctctgtggcagcatcaggttccccactgtggaaagggcacaggaatccctctctcccattgctttcaggagctgtttgtaaggagactgctctttataaaacactagggaaagtcctggGGACTTCCTACAACTCGGCAGCCATGCACTGCCGGCTCCAGTCCCACAAATGAAGGGTCACTGAGCACACTTCCCTGGTCATACTCGCCCTCGGCCCTCATATCCCTGAGCCCTTCTTGCAGCATAAGGGCATCAAGACCCTGTGTGGGGAGCCCATTTCTTCCCCAGGAGTAGGTGGTGGGGCACTGCCATTTCTCCTACAGTCCTGCCTTCCCTAGAGAAGGGGGAAGGGCCCCTTCTGGGGTGGTCCTCCTGGGCTGCTGTGGGCCCCAGGCCTACCCATTTAGGCCTCCACGTAAACTCAGACTCTGCCCCATGAAATTAAAAATAAAACAGCACTGAAGTGTAGGATGCACAGAAGGAAGTCTAATTAAGTTCTGACTTTCCATGGCTCAAGGTCACCTTGATGCTTTTATTTAACAGCAATTTTTCTCTCTCCTGTCCGAGATGCACCTGTTGTGCTGGTGCCCTGAGCACATGTGTGGTCGGACGCTTGGGTCATCCGGCGTAGGAATGAGGGGCAGATTctgcccacttcatcccccttgcggagagtcagggaacagcccatcaacacgttagatgctctgagaagtcctgtggtacagaaacctgtttaacaatgtttccccatgttatttaaccacagagtcctttgtttGCACCTAATAGTAACTTCCAGCTGAGAATGCTTCTTATAAAATGCTGGCCTGGAAATTCTCCAGGGGTTTCCCCAGATCAGCAGGTAGGTAACGCTGACTCGTGAGGTCCCCAGGAGGCAACAGGGTGTGGGGCAGACTGGGTGCTCTGTGTGCAGTTGAATGGCTGAGGTCCCATCAGCTTGCTTGGACTCTGAGGGGAGTGGCAGCTGTGGGCCTCCCTGGGTGCCCGCAGCCCAGTCCAGGCCCCAAAGCAAAAGGACCAGAGTGCATTGCCTGAGGTGGTGGGCGGGTGCAGCTGGTGGGCAGGCCTGGGTGGAACCCCTCACCTTACTGGCTCCTGGCAGCAGGTGCAGCTTGACGTAGGGGTCTGCCAGCCCATTGTGGTCCATTGGCTTCAGGCCCTGGGCAGAGAAGAGCAAACGGTGTGAACTGGAAATCGGGGACATGGAACTGACAGGGCCTGCAGATGCCCTTGTCCCCTGGGTCTCCTGGCTGGGCCCATTCATGGCCCCTTTCACAGAAGCCCCCGGGCCCCACACCTCCAGATGGAGGGAGTGAGGATGGCTCAACGCTGATGCAATCTCCCCTCCCCAAAACATCCTGGCCCAGGGCAGCTGTCATGGGGCCTCTGTCAGGACAGACTTTCTCTTGTCCTCATCCTGGAAGCCAGGTGGCTCCCCCTTCTCTGAGCCTCCCTCTCCACCCTGCTTTGCTCAGGCCCTGCCATCATCCTGACTCCCGTCTCCCATCTACCAAGCGTGGCCTCCTGAGTGGCCATGCCCCTCACTGACCCTAAATCCACATAACCCCGGGACACAGTCTGAGACTAGGCCCCATTTCTGGCCATATTTGCCACAACTCCCAGTCAGTCCACAGAGATCACACAAGCAGACTGGCCACAGAGGGTCCCTCCACAAGATGGCACCCCTCTCTCACTCTGGTTCTTTCAGAAAACCCAGCCACGCGTGCACAGCCAGAGGCACACAGAAGCATGGACAGAGAGGGGCTTTGCTGACCTCAGCAGGCTCTTGTCTCTGAAAATTGCATTTGTTTCCTTCTGGGATGGTGCACAGGGATGCAGGGAGGCAGCTGTGGCATCTGTGGGGGTGCAGCCGGCAACCCCAGGATAGTATTCAAAGGGCTTCTACCCAGCCGGGTCAGGGAGGCAGCTCGGCGGCTCACACCAGGCCCATCTCCAAGGGGAGGCTGGGGCTCCCTCCCAGGCCCACCCCACCTCGGTTTGACGGAGCTCTGGGGGATGCGGCAGGGGCTTCCAACCACCTACTTCCCTGCcactggcaaagggcagaggtttgggggcctggctggcttgggctccaggctcagctcctgttcaccagctgggaagtggatgagttactttgcacgctaagcctcagctttatcatctgttaaaggggttggccgggcacggtggctcacgcctgtaatcccagcgcttgagcccaggagttcaagaccagcctgggcaacacagtaagaccctgtctcttcaaaaaactaaaaaattagctgggtggggtggcatgtgcctgtggttccagcaactagagaagctgaggtgggaggatcgcttgaacctgggacacggaggctgcagtgaaccatgatcgcaccactgcactccagcctgggtgacagagtgagaccctgtctcaagtttaaaaaaaaaaaaaaaaaaagggcgggggtgtaatactcccaccttcctagggctggagtgagagggagggagactgtgtggacacagggcctggcaggcagGGTTTGCTCCCTTCCCTGCACGCCAGGCCCCCGTGAGACGCCGGAGAAGCGGTTGAGGTCTGCTGGGTGCATCTGAGCTGTGCCCTGCACTGTGGGGCTGCCTCTGGGACAGGGCCTGGCTCAGTGGCCCAACACGTGCTCAAGGTGACAAGGCTGCTCTTCCAAACCCAACTCCCTCACCCTGCCTTGCACCTTTCCCACCAAGTGCCTGAAGTGCAGAGTTCTGGGAATCTGGTGGCCTGGCCTGACCAGGCGGTGCCTGACTGCTGTGTCTTATTCTTACAGTCCTTATACATTTGTTTGTTGTAGCTTAAAATGTTTCCATTTGGGCCACGTTTCCTGACACGTTTCACTTACAAACCTTTGGGCCACCACATGTTCTGGCGTATGTCTTTCCAGTGGCTCTGGTGGTCCCACCTCCGTACCCCGAGACACCCGTGGGAGCTAGGCTGGGTCGAGGGGCGCTGTCATGCTGGGACGGGAGTGATGCCTGCTCTGTCCTTCAGACCCCGGGACTGTGACGAGGCCCCTGCTGCTGCTCTCTACTCCTCCACCCGGCAGTTCCCTGAGCACCCTGGGCAGCCACACATTTCGCCAAAAGCAAAAGGACCAGCAGGCACTGCCTGAAGCCCTGTCACTGTGCTGCCACCAACTTCTGTGCCCAAACAGGCAGCTTCCCTGGTCCTGACCCGGGGTTCCGCCAGTGCCTCCACCTTCTGTGGGGCTGGCCTCCCACGGAGCCTGACCTCTGTCCACGGAAGGGGAAGGTCGGGAGGCTGTTTCCAGGGCAGGGAGCTGCTAGTGGGGCCCTTGGGCACATGCTCCCCAGCTTGGGAGTTGGCAAGAGACATAAACTGATCCCATGGTTTAGGGGTGGGCCTCACGGGAGGtgacgtggtttggcggtgtccccacccaagtctcatctggagttgtactcccataattcctacaggttgtgggaaggacccagcgggagataactgaatcacaaggggaggtttctcccagactgttcttgacgtagtgaatacgtctcacgagatctgatgatctGACActgttcttgaggtagtgaatacgtctcacgagatctcatggtctGACACTCTTCTTGACGTAGTGAATACGTCTCACGAGATCTCATGGTCTGACActgttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActgttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActgttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActgttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActcttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActcttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActcttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActcttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActgttcttgaggtagtgaatacgtctcacgagatctgatggtctGACActgttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActgttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActgttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActgttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActgttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActgttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActcttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActcttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActcttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActgttcttgaggtagtgaatacgtctcacgagatctgatggtctGACActgttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActgttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActgttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActgttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActgttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActgttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActgttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActgttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActgttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActcttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActcttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActgttcttgaggtagtgaatacgtctcacgagatctgatggtctGACActgttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActgttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActgttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActgttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActgttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActgttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActgttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActgttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActgttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActgttcttgaggtagtgaatacgtctcacgagatctcatggtctGACActcttcttgaggtagtgaatacatctcacgagatctgatggcctgataaggggaaacccgttttccttggctctcattctctctcttgccaccaaacatgtgagaagtgcctttcactttcagccataactgtgaggtcttctcagccacgtgtaacggcaagtccaataaacctctttcctttataaattacacagtctcaggtatgtctttatcggcagcatgaaaatggactaatacaGGGAGGATGGGTGGAAGCGGCCCCGGGGGAGGCCCTGGCTGGTACTGGCACTGAGGGAAGAGATGGGGGGTCTGGCTTTGAGAGGAGAGCTTCTCCCCAAAAACCTAGCCCTGCCCCGCCCTGGGCCTCTCAGAGGCTGTTGCTGGTGAAGTGTTCGGAAGAGGAGCTTTCTAGTCtgaagtatcattcagcctgaaaaggaagttttgacacgtgctgcaatggggatgaagcctgaagacattctgcggagtgaaagaaggcagactcaaaaggacagatcccggggactgcagactcaaaaggacagataccgggagactgcacttacatgaggtccctagaatagtcaaatccatagagaaggaagccaaatggcagcctccccaggggccaggggaggaggaagggggagctgttgtttaatgggtccagtttagttttgcaagctgaagagggctctggagatgggtttcacagcagtgtgaaggtagaaggtacttagcacaactgaactgtacgctaaaaatggttgagCGCTGAGGGAAGAAGTAAAAAAAAGTGGTTGAGGTGGGAAATGTATATCTGTGTATTTTACCACAATAAAAATAAAAAGTCTCCCAGAACTGGTAgtgccaggggccacgtgttaactcatttaatgctcacaacaggcatgtagggcagggacaaccaaccctatttacagatgggcaaactgagactGACCCTTATAAGGGGGGACAAGCAAGGGTGCACCCCAGGGTGTCCAGCCCCCACCCTGGCCCTCCAGAGGCCAGCCCTCCTTCAGCTCACCCACCCTGGGCCCCTCCCCACACCCCAGCCCAGAGCCCCAGCTCTTCCCCAGCCTGCACCACCCCTTCCCTACAGAACTGGATTTACACAGAGAAGGAACTGGGCCTCCCACCCCCACTTCTGATACCTGAGGGATACAGCCCAAAGTGGacacacacttacatgtgtgcacgcacggtaccacacatgtacacacagagacacacatacagccatgtatgtgcatacacacaaacgcacCTGGAGCTGGGAAGGGAAGGCCCTGGTGTCTGGCATGGAGAGAGGAAGGGGTGGGCTTTGGCCAGAGTGGCCTGGCAGCCGGCACCTCTCCAGTCCCCAGGCCTGGACCACCTCTACAAAGTTGGACAGAGGGAAAGGAGGAAGGGTCTAGCTTGGTCTCTACCTTGGCACAGCTGGGATTTGACAAATGCTCAGTTCTGCTCCTAGGGGTGGGCTGGAGCCCCCGCCAGGCAGGGCTGGACATGCCCTGAGTCATAGCATGGGTGGTTCTAGAGAGGGCAGGGGTGGGATGGAGCGTGCAGGCCTCTCAGTGCCCTACCAGGGCCCTGAGGCTTGCGTGGATGGCACTCACACCTACCCATGGCAGTCCACATGTGGCCCAGGCTGGGCTGGGGGACAGCCTGGGGTGGCACGCAGTAGCCTGTCCTGCTGGGTGAGCATGCTGCCAAGGGCAGCCCTTGCTGCCAGGCTGGGAGGAGGGGCAGGGGGCCTGCAGGTTGGGAGGCTGGGTGGGGCCTGGGCCCAGGCAGCTCTGTGGGAAGCCGCTGGATCTGAGCTGGGCTGGCTCAGGCCCTTACATGGCACTACTAGGGAGACTCTACTGGCCATGCAGGCCCTTACCTTGCGAGAGAACATCAATTTTGGCACCTTCCTCCCACAGGGAGCAATGGGGTGAGGGGAAGGGAACAGGACAGTTGAGAACATGGAGCTGACACATGCTTGAGTGGCAGAGCCAGAGGGCAGCACCAGGGACCAGGCCAGGCTGCAGAGGGGAGCACCAGgggccgggccaggctgcagagggcagcaccaggggccgggccaggctgcagagggcagcaccaggggccgggccaggctgcagaggggagcaccaggggccgggccaggctgcagagggcagcaccaggggccgggccaggctgcagagggcagcaccaggggccgggccaggctgCAGAGGGGAGAGCACCAGGGGCCGGGCCAGGCTGCAGAGGGGAGAGCACCAGgggccgggccaggctgcagaggggagcaccaggggccgggccaggctgcagagggcagcaccaggggccgggccaggctgcagaggggagcaccaggggccgggccaggctgcagaggggagcaccaggggccgggccaggctgCAGAGGGGAGAGCACCAGGGGCCGGGCCAGGCTGCAGAGGGGAGAGCACCAGGGGCCGGGCCAGGCTGCAGAGGGGAGAGCACCAGGGGCCGGGCCAGGCTGCAGAGGGCGGCACCAGGGGCCGGGCCAGGTTGCAGAGGGGAGCACCAGgggccgggccaggctgcagaggggagcaccaggggccgggccaggctgcagagggcagcaccaggggccgggccaggctgcagaggggagagcaccaggggccgggccaggctgcagaggggagcaccaggggccgggccaggctgcagagggcagcaccaggggccgggccaggctgCAGAGGGGAGAGCACCAGGGgccgggccaggctgcagaggggagcaccaggggccgggccaggctgcagaggggagcaccaggggccgggccaggctgcagaggggagagcaccaggggccgggccaggctgcagaggggagcaccaggggccgggccaggctgcagagggcagcaccaggggccgggccaggctgcagaggggagcaccaggggccgggccaggctgcagagggcagcaccaggggccgggccaggctgCAGAGGGGAGAGCACCAGGGgccgggccaggctgcagaggggagcaccaggggccgggccaggctgcagagggcagcaccaggggccgggccaggctgcagaggggagagcaccaggggccgggccaggctgcagaggggagcaccaggggccgggccaggctgcagaggggagcaccaggggccgggccaggctgcagaggggagcaccaggggccgggccaggctgcagaggggagcaccagcggccgggccaggctgcagaggggTCCACAGGCACCCACAACCCCAGCCCACGTAGTGAGGCTCAGAGGGCCTCTGGGCTCAGGCCGTGGACACCCTGCCTGGAGTGGCATCGGCCTCCTACAGTGGCTCGGCTTCCAGGGTGCAAAGTGGCGTCCCCACTCCTCAGGGCGTCTGGGAGGCCTGGAGGCACCAGTACCCAACCCGCCCTCCCTTGGCCTGCAGGACAGACATCACCCTGCCCCTCTCTTTCCCTCTCAGCAGCCCCTCCCCAGGCTCGAGGGTCCTCGGTCCAGGCCTTCATCTTCCCATTCTCATCTGTTTCTTTGCTCCCCGCAATGCCTGACTGTCCAAGGCATTTCTTGGGGTTGGGTATTCAAGAAGGTTTAAAGAAGAATTCCTTCCTGGCCCCGCACCCCACACAGCGCAGACATCCAAAAGCCTGGACAGGAACCTGGGGAGTGGTGTGGTCTGGCCTCCCTGACCTAGGCCCTCTTGAGGACCCCGGGGCAGGGAATTTGGGGGCAGGCTGGCGGGGCCTACCTTGGCCTTGGTGATGGTGCAGTGGAGGGCGTTGTTCTCCTGGTCATACAGCAGGCTGAAGTCCAGCGTGCCCAGGGCAGCTGCGGACAGAGGAGGGCACAGGTCCCACCCTGGCCGCATCTTGGAGAGGCTTCGCCTGCCCCTGTGAGACCAGATGAGCCTGGCCTGGGCAGGTGCCACCGTTCCATGGCGgctgccaccaaccaagcacctgctctatgccagcccctcacccatcctcccagtcccacccaaccctgggaagtcacaataatctccccactttccagaggaggagctgagacccagagaggtcaggtgggtcactccagttcCCTGTCTGGCCCAGTGTGTGGTCCACCTGAGCTGGGCACCCAGCCAAAGGAGTTCCTGCCCTCCTGGTGCAACTGCCAGTCTGGGccccgtgcctcagtttccctcacctgtgaaatgTCACAAGGATCACACAAGGCGGAGGAGACGAGGCTTTGAGAGGAACAGGTCCTGGCCAGGAAGATCAGCTGATTTGCTCAACAGTCCCCCAGCCAACACACAGGACTGCAGCTCCTCTGTCTCTCTGGCCCGTTGGGAACCCCGAGCAGGCCGTGAGGAGCCAGCTGGGTCCTCATACTGCTGCCCCCAAGTCTCTCAATGGCAGTGGTAACTCCCAAAAGCCGGGGGGAGGGTGCAGCCATAATTGGGGGAGGTTGCAGCCATAATTGGGTGCAGCTGCCTCCCTCCCCGGGGGCAATCACTCATAGCAGCTGTGGCTTTCCATGCAGAAGCGGCTCCCAGCATGGAGGCCAAGGTGATGGTTGGGGCAGAGCTTGGAGATGATGGTGGGGGGCAGAGCTTAGTGGCAGTCCGCAGACAGCAAGATGCACATTCACAGATGGCTTCAGAAGCCCAGAGCCTGCTCCCAGGCTGCAGGGCTGGTCAAATGGTGTCACattccttcattatttaccaagtgtttacaacatgccagactctaggggatggacatctgtgaggcagagtccctatcccaaggaaagcacagcttaaggggaggaacaaatggaaataatttgccgcagtagagtctggtgagccgggagccatggggagcccatgtgaattgggaccaccagggaaggcttcctggaggaggtgatgcttcacctgagccatgagggatgagtaggagttggccaatggcaaagaggggctgggggaggtgacgattccaaccacaggccaaccagcaggtacgggcttagtggctggtgtgatcccggGATAAGGGAAGGCCAAACTTGAAAGACTGTGTTCTTCAGGCTCTGAGCTAGGCTCCCAACCTGGATCTAGATCCAATTACGGCCACACCTGAccaggcccatcatttacatctcttctgcggctgctttcctacttcagaagcagaccctacggcctgcacagctgaaaatatttactatccagcccttctttatagaacaagtttgcagatctctgGGCGAGTGCTGGGCCCAGGCATAACAAAGTTTCTTGGGTCTGCAGGGAAATTGGCTAATGAAAGGCCACGAGTGTGGAGTGTCCTTTATTCTGAGCTGGTGATGGGAGTGGCATTTGTCTAGATCATCCCCTTCTAACACGGGTCAGGCACATGTCTCTGTCTTCGCAgcagtgtggggtgcaagacctgccctctgacatcagtctcctcggttcaaatcccagcgtgcacttccctgctgtgtggccagcttatctaacgtctttatgcctcagtttactcatctgtgaaatggagacaacgatagtatccacctcacagcacagtgtgcctgcggtgtaggaggtgctcagtagatcattatagaaggagcagatCCCTATGGTGGGTTTTCTAGCTAACAGTACTGGCTTCTCTAAGACCCATTCTGGAATGAAACTGTCCTCAATTGGCTCATTTCCCTCCCTGCCTTTGGGAACAGAACATACAAATAGTCACTTAAATTTTTTGAATAACCTGAGTCATTGCTATGGCCACGATGACTCTTCCGTATGTCACGGTCATGTTCCAGAGGCAAAAGGGGACACGAAGCAACCCCTGCACCAGAGGGTCCCGGGTGGCCCTCTTTTCGGTTTTCTTCATTTTTCTGCCTCTTCCTTTGACATCAGCCTGAAACTCCTATTAGATTACTTGTCTGTGTTGGCGCAGACAGCTCTCACCTATTCAATAATTGTTTCTCCCATGAGACCAGGTGATGATACTTGTTGGAAGTGTGTGCAAACTAAGAGCCAACAGGCCTTGAGTGGCTggtgtggggctcatgcctgtcatcccagcagtttgggaggccgaagcagatggatcacttgagctcaggagttcgagaccagcctggccaacatggcaacaaaaactacaaaaatggccgggtgtggtggctcacgcctataatcccagcactttgggaggctgagacgggtggatcacttgagcccaggatttcaagaccggcctggccaacatggaaacacccatttctacaaaaaatacaaaaatagccgggcatggtggcgcatgcctgtggtcccagctactcgggaggctgaggtgggaggaccgcttgagcccaggaagttgaggctacagtgagtgtgattgtgccactgcatccaacctaggcgacagagcgagatcctgtctcaaaaaaataaataaaTACTAGGTCTTGAGTGAAGCCTGAGGCACGTTCTCCTGAGCAGGGTATGTGTGCAGCAGCAATAGCTGTAGCTGGCCTTCAGCTGCCTGCATGCTAGCTGCCATCTCCACACCAGGCAGGTGAGGCTGCTGGAAGGGAGGTGTATCCCACTGCTCCTGCAGCAATGAACACAGCCACAGCCCGAGATGCACCAATCTGTACTTTTGTAACAATATTCTCTGAAGCTGTTGGCTTTACATGATAGTAAATGGATCTGCTGTTCAGGTCAGGCCTACCTGGGgtttgttcccccaacgaattagccccctgagcacagggaccatgccttggccacacctgtgagaccaacaaacagcagatgcagacacacaTGGCTGGTCCATTCAAACCAACTTGCCCTCCAGGTGCTCCCAGGAGCTGGGATCTGGTGTGACACCCAAGTGTAAAAATGCACATTCTGATTTCTGCCTGTTTCCCAACCCCAGAGAAGGACAGCCAGAAGACAGAGCGGCTGCCTGCTCCCCCTGGGACACCCAGCTCCTGGAGGGGAGAAGCCCCTGCACTGGCTCTACAGAAACCCCTGTCCAAGGAGGGCAGCGATCTTCTCCAACTGCCTGGGGGGAAAACTGACACCCGTCACCACCTCCTGGGAAGGAACAGCACACCCAGTGTGTCACCAGGAAGGGACTGATGAATTTCTTTGAACTCCACTGGAAGGTCTGTAAGAAATGATAGTTTATAAATAGAAGCTCGGTCACAGTTTTGAGAGGCTGGTGAAGCTCTcttacggagcacatggcccaaactcttcatttccctttagagaaatgaggctcagagaggggaggggagctgctcagcgtggcccagcagacccaggcacagaatctaggcctcttcacttccatcctgatagtttctcccttaaagcctactggctcCCAAGAAGttttttttttttgagacagggtcttgttctgttacccaggctggagtacagtggtgcgatcacagttcagcctcccaggctcaaacgatcctctcacctcagcctcccaagtagctgggactacaggcgcccaccaccatgcccggctaacttaaaaaatatatttacaaaaaaagagacagtggtgttggtgcggggggtgcgggggggtaggtctcactacgttgctcaggctggtcttgaactcccgggctcaagcaatcctcccgccttcacctcccaaagtgctgggattacaggcacacacgggccactctgcccTGGTCCAAAaagcttttttttggccactttgtttagctgaaatcttaggaggatgcaactaagaaacaaaaagagagaaaagcagagttgctctaggggaaggagagtgggggctcagCCGCTGCCCTGGAGAGCCCTCCGAGTTGGAAAAGCACCTTGGGCTTGTGCTGGGCGTGAGGAGGGCCCTAAGGGCAGGAGAAGGTGCTTCTCTGTGACCCACAGTGCCCTGCCCTTGGAGCGCTCCAATCACATGACCCCACTGAAGTTTACATCTTCGGAAAAATAATGAGGGTTGGTCACTAATGGGCCTTTGAGAGTTGTCACCCGCTTGTCGATAATTACAGAGCTTTAATGGGGCTAATTCAGAGAAAATCAAGCTCAAGCCCCTGTTGGCTCCTCTAAGGCCGAGCACCCCCTGCTTGGCTGATGTAGACCTGACAGCATCATCATAACCTTGGTTTGCACCTGGACTTTCCTGACACCCTTGGGAATAGGAAATCTGCCCTGAttttttttttttttttttttAAGAAAAATGTCAGGCTCTCAAATACTTTAAAAAAAACAAAACAATTGTGGCAAGGACCATAGAAAATCATCAAGCTTGTGTTGAAAGCATTTTATAGTTGGGGAAACAGCCATTCAGCCAGGCCGGCACtgcaggggggacagagagaaatgagccatcatccctgcttcccaggagcttagagacgagtgTTAAGATTTTTATAAACATCTtcattcatctgctttaaattgttgaaaggttctccatcaactcctggactcagttcaacctccttggtgtggcattcaagaccttgtgtgatatgaccactttggcctcctcataggtctcttccccttctttgagccacttggctctactgttttgcctacactgggctgagtttctgacacttcaggacctgtgtacacagctgtccctctgcgaagagcacctccctttcctcttttccctctgtcagcccacctcctccctcctcccagtcttagggctgagccctctctgtgagtgacctccgcctgccagagctttccatctgggttcttgcctccctgtccctcgctgtggggtcacctgtttgtcttggttttgatgaacaccctcaggtcagaggttgcctctcagcattcttgtatctctgtgcctaacacagagcctgccacatagttggtgcCGTAACATCGAGTGCATTACAAATATCACAGGTCCAGGTAGAACATGACTGGGGCCAAAGAGGTTCTGGGCTCAAAGGAAGGGGCACCACAGCCCATGGGGGCTGGAATATGCCCCCTAACACGGAGAACCTTTGTGGTGGACTCAACCCCTTGTCCTTGCTCCAGACCTGCACAGGCATCGGTCCTGACCCGGTCCTGGGTGAACCACACAGGGCAggtttaacacgggcatgtgatccaattctggccaatgagacaaaaggacaggtctcctggggacttctgggagaggtttcctcgctcttaaactgagacatgagaaaaggaatcggtccttaccaattccttacccagtcaaggaccttattgttacatgtgatgcctggacctgcagcagccaccttggacctgagggTCCTGCTTGGCTGATGGAGACTGACAGCGTCATTCTGACTTTGGTCTGCACCTGAACCTTCCTCACCCAGACACCCTTAGGAATAGGAAACGGGCCTTGATTTTGCTCTACAAGGAAAAACGCACAGCCACGAGCAAAGATGGACAGAACCATCACCCCAACAGTGCTGAGGGGCTGCTGCATTGGTGCACCCTGGTGCAGCAAGACCTCAGCTGtcctcattgtttaagccactctgagttggttatctgtcacttgcaacccatggcactctaGTGATTACagctttgaagaatggactgggcttccacaggtgggtgccgggcaggagagcctctcaggtaggggcagcgtgtgGGGTGTGTGGCCCACGGCCCACCTGGAGGGCAGCACGCCACAGAGGCCACCCTAGGAGAACTGGCAGGGCCGACACTGGGTCCTGCTCCTCATTGGGACATTCACTGCCTTTCCTCCCTGTCATCGCCCCTCTTCGAGATCTGCTGGGAACATGCGAGTTATGTGCTTGTGCTTTGACATGACTTAATAAAGAAAGGGCTTTAAAAAGCCAGGAGAAAAGGCTGAGTAAATAAACAGCACTTCACTCTCTGCCTTGGACAACAGAGCAACCGCCCTCATTTCTTTTTGTCAGAACAGAGATAATCCAATAATTAGGGCAGCAAATAGCATAACCCTGGCTAAAGCCGTAATGAACCATCTGGCTTGAATCATTGAGGCTTATTCTAAGGATCTGGTTGTCACAGCCCAGAGGGGGGCACCGCCTGGGGTACTGGGTGGAAGGCAGTCCCACCCCAGAACCTGTTCCCCCACCATGGACAAGAGCCAAGATTTCTGGGCTTCTGGGCCTAGGATGTTTGTCAGCCTCCCATAGGATCCAAGCCTAGGGTGGGGCCCCCCTCGAAGGGGGCGACAGGGGCCCAAGCTGGCCCTAGCTCAGTGGCACACCGTCCTCCTTAGCTCCTCTGAGCCCGTCTTCAGCATTTCTTTGGCATGTCCTGATAATACGGGTGTGGCCAGCCCACAGTTCAAGCTGGCAAAGCTGCATCCAGCTCTGCCCTGCACTGCTGACCTCATGTGAGTCTGGCCCCTGCACCGCATGTATCTCAGGCCAGGGGCCCAGCCCTGCTGCCAGGACATGCTCCCTGAAGGGCTGCTGAGCTGATGGATTGGAGATGGGGCTGGCTGGCCTGAGTTTCTAGCTTCCTGGTCCCCTCCTTGCTTTGTGGGTTCAGAGACAGCAAAAAAAAGAAAAAGAAAAAGCCTCTTGGCCATTCGCCATGCAGAGGGCCCTCTGTGGCACCAGGAGGGTGGCCTGAAGCTCTGCCCTCCCCTCTTACTTCCTTGGTGCAGCAGCAGAGATGCCAGAAATGGGAACAACTTTCCCATCCGTTCTTCTGGGGGAGGCTTTGGATTCAGGGCAGCCAAAGCAGTTACTTGGGTCCCACTCAGGGCCCACCCAGGGGATCACCAAGTCCAGGGTGAGCTTCGCACAGGGTGCCAGGGGCAGAGGGGAGGGGGGGTGGATCTAGGGCACAGCCCTGAGGGCAAAAAACTCTTCCGACCCAATCTCCCCAAGCTGGCAGGAAAGTGGAGGGACAAGACTGCTCCCCAGCCCCCACGCCCCAGGGCAGGGCCTTCATGTGCCAGGCGCTGGCCCGAGGGCCGTGGTCCAGGCCTCTAGAAAGTGCAAAGCAGGCAGTTTCCCCTACAGGGGCCCTGCTCTAACCGGCCACTGCTGATGGGCCTCCCCAGGTGGGGCGATGGGGGGTCTGTGCCCCGGGGGCACTGGTAATCCCTACCTTCAGCTTCTGGTGGCACATTTGATGCTTGGGAAACTCCAGGCCCGCAGCCCACAGGCCCTGGTGAGTGCCCAGGCCAGGCGCAGACATCCCTGCTGCGCAGGGGAGGGGCAGCACCAGCCCTGGAGAAGGGCCACATCCCGGGAAGGGCTGGGGTTTGACGAGACGCTGGTTTTCCAGGTCTCAGTGACAAGTCTGGAGCCACAGCTGAGCTAGGAGGGGGTTCTCACATGCCATCCCCACCCCGCGCAAACCGACTCCTCACTGGACTGCGACCTCTTCCGGCCTCGGTTTCCCAGCCAGTCCCGGCTCGGGCCGGACAGGCACCCTCGGGGACGGGAAAAGGCGCCAGGAGCGCCCACCGGCCGGGCCTCGGTCCCGGGACTCCGGCGCTTGCCTGCTCCCGGGGGCTCAGGGCTCAGTCCGGGAGGAGGGGGAGCGGGCTGGGGGGCCCTCTCTGCCCGGGGGCCGCGGGCGTAACAGGTGGGCGAAGGTGCGCGGCCCTGGCGAGGGCTGCGGCGGGGTCCATGGACACCGGAGGAGGAAACGCCAAGGTTTTTCCAAAGGACAAGCGGCCCCGCGGTCCTCCTGGTCCTCTGCTCGCGCGCCAGCAAAGCAGCTGCGCTCTGCGGGCCGCCGGGACCACGCGGGAGGCCGGGCCGCTCCCAGCCTCGGGCCCCTCCCCAGCTCGCCCCAGCCCCGACCCTCGGCCGCGAGGCCCTCCCGGAGCGGCTGGCGAGCGGGGAGCGACCGCGCGGCCGGCAGCAACTGGTGTCTCCCCGGGACGCAGCTCCGCCCTTCCCGGGAACAAAAGCAGCCGCCCGCGCCGGAGCTCCGGGAGGGCGGGCTGGCAGGGAGGGGGCGCGGCGCCGGCTCCGAGGAACCCGGCCCCGGAAATGGGACACCCCCAGGGGGTGCCCCCGAACTTCCCTCCTGTCCGGCCTGGGTCGGGGGAGGGCTCGAGGCCGGTGGGCAGGGCGCGGAGAGCGCACCGAGTGCGCCAGGGGCCCGCAAGCCCGCGGCGGGGTTGTGAACCGAGGCAGAGCGCGAGCGCGCGAGGGGGACCGGCGGAGGGAAGCCGCGAGGCCGTGGGGGGGCCGAGCCCGAGCCAGGGGAGGGGGCGCGAAGTCGGCGCGTGGGAAACTTACTGCAGTCGTCCGACTCGTAGCCGTCGGCGTCCGGCTCGTCCTCCGGCGGCTTGGCTGGCGGCCGCGCggggctgggacccgggctggggcccgggctggagccgTAGGCTCCGAAGAGCTGGTCCACATCCTCGTCGTCCTcgcgggcgccgtcggaggggctgcggcggccggcaccggccacagccgggcgcgcgggggcgtccgggggtgcagcggctcggggcccggcgtccgggggcaggccccgcgggAAGCGGGGGAAGTAGTCGGAGATCTGCTTGATGGGACGGATGGGGCCGGGGCACACGTCGATGGCCATATGCTCCTGGATGCTGATGGTCGCCTTCTCCCCGCGCCGCCGGAGGGTCATGCAGGCAGCGccgccccgccccgggcgcggcccggcccggcgcgaccccggcccgggggcggctcagcaggcccggcggggcgcggcGGGGGCTGCGGGCATCGCCGGCCGCGCCCCCGGACGGCCCTGACTTGGCCGCTGCCCGCTCCGCTGCGGACGGCGCGAGCGAGTGCCAGAGGCCGGCAGGCAGGGGGCGGGCCCAGCCCGCGTCACCCGGCAGCAACCAAGCAGGGTGAGTGTGCGGGCTGCGCGGGCGGCGCGGAGCGGAGGGAGCCGCGCGGCGCCACACTCACTCGCACTCGCACTCACACCGGCGTGCACGCCGGCCCGGGACCCCGCGCGCGCACACTCGCGGCCAGGCAGGGCCGCCGGGCGCCTTCCGCTCATGCACGCCCGCGGCACAAGCTGGGAGTCAACCCGGCAGGGACCCGCAGGACGCGCACCCACACGTTCCCGCGCGTGCCAGCCCACGCTGGGCTCCGCTCGCTCTCTCGGGACACACGCAGGTGTGCAAGTGCACACATGCGTGTGCAGAGACACGTGGTGGAAGCATCCGCTCGCTCACACCGTAGGTCACACACGCAAACGTCTGCATGCGCACATGGTTGTTTTAGGAAGCTGTGACACAGTACACCCCCAATGCACAGGCGCGCACACCTGATGGAGCACACACACAGGTGATCAAGGGCACCCAGGGCACAGGCTTCCCTACCCCCAAGCACCCCTAACAAGATGCACAAACATGAGCCCATAGAAGTGATCAGGGGACCCTGGGCACAACTCCTTCTCTCCTCCCCCTCTAGCAGGACGTGGAGTCACACTCCTAGCATGTAGGGACAGGTTGCCTACACACGGGCAGGACATGTACACGCTCAACTGCACAAGGACACCGGGGCTCAGTCTGTGCACACATATTCTCAGGTCACATACAACACAGCACTCTAACCTCGGGTCAGTCCCTCACTTGGTCAGTGAGTACAGTCTGAAGGACCCCCCGGGGCAGGGTGGCTGAGAACCTGCACAGGGCCTGGTCAGGAGGGAGTGAGGGCAGGGCCTGGGCGGTGAGTGATGCAATAAGGCTGGGCGGCACGCTCCCCCCACCCCCACTCCTACTCTAGGCCTCCCACACGGTCAGATCACTAAACAAATCCCAGAGGGCCCAGCCCTGGCTGTCCGGCTTTCCGGGACCAGAGCTCTGTTGGGAACTGCTGCTGCTTGGACAGGTGTGTTCCCGGAAAGCCCTGGGCATGGATGGAATCCTGTTTACCCTCTGGTTTCCACTGATGTGTAAGACACTTAGCTTCTTAGTGGGTGCCTTTGGTGCATCTGAATGAGGGGCTCCAAGCCTCTCTTGTTCCCCCACCATAACCCCTGCAGAGTGATGGGGAGCAGAGGAAAGAGAGGcaaagccttggcctgtggcttccagctgcacagttctggcaggctatttgacctctttgagcctcggtttcctcatctatgaagtgaggctatttccaactgcacagccttgtggcaaggccccatccagcacagacaggtaagaggtgctcagCTGGCTTTCCCTTCTCCCTTCCTTTGGAAAGACAAGACTCATGGTGAGAAGTGATGAGAAATTCATGTTTTGTGGTGAGTTCTGAACTGGGTGGTGGCGGGCACTACCGGCCTTTGAAAACACTGGAGAAACACATGGCATATGTTgtactgcgccaagaacattcacagctgtcatctcattgattccttaaaccaccccacgatgtaggcagggcctgctgttcccatttcacaggggaggaaattagcgctcaggcacaaggatgtgtccagggtgactgctggctggcggcagagctgcgatgagagcccagtgtcctgactgttcgatgcttccactttccctcccttctcccttTCCTCCCCTCCACTACAGAGCTCAGGGGCTCAGAGCAGAGTTGGAAACACAGGTAAAACCTCGTTCCCAAAGCTCATCCTGAGGCTTCTTGGACAGGGGAAGCCCAAACTgaggaggaggggaaggagggaaaaaaaggaggaggaagaggaggacgggaggaggCCAAGAGCCTCAGGGGTTACAGTGGGAATGAACCAGCCGGGGTTCCCTAAAGATAGTCTGAGGTCCTGGTGGGAGAAATATTCAGCCTTCCAAGAGCCAAAGGCCAAGGAAAGGACAGAAGAGCCTGGAAGGGCAGCCTGGCACAGAGGGGTTCTCATTTCCAGGAATGCTTAAGGGGATTAATCCTTAAATAGGCACAGGGTGACTTTTTCTGTGCCAGAGGTGGCTGCATCACAAAATGATTACATAATGACGCAGGTATTAAAATACAGACGCTGAACTGTCATTTTATCATTCCTTTACTAAGGAGCTGGAGGGtgtcgtcatcatcttacagatggggaaactgaggctgcgggaggtcaagtgactagcaagaggcagactggagatgagacctggacgtcctgactGCTGAGCCTGCCAAGGCCCTGTACCTGTGTTCACCGAGAGCTGGCCGCTCTCTCCTGGCCTCTTGTCCTATGGGTCTGGTTTTTGGAGGATCGGCTCATGGCTCTTGGCTCTGCAGGAGCTGGCTCTTGGGGAGGGCTTTGAAGAAGTCAGGTGGAGGGCCCAGCCTCCTAAGCATGGAGCCAGGGAACCCAAGGATGCCCACTGGAGAAACACATGGCATGTTGGGGTCTACCCTTTGGCAGGTGGCAGAGTTGAAATCCACTCTAGGTCTGAATTTGCCCAGCTCTGAGCCCAGCTGAAATGGGGTAGGGCCTCCCCGAGGGATAGAAGTGGATAGAAGTGGTGTTATCTGGGCAAGTGTCCACTTTCtagaaagagcataggtcagggagtaggcagacgtagagtcaaaagtcaggtgtgcctcctcctaagcgtaccggactttgggcaagttgtttaacttaggccctgggtctcatctgcaatgggggagaataagaacaacgttgtgtgagttaaaggtgaggatgggtatacagtgcttagcacagtgcccagtgtgcaactggcactcagggaattgtgattctgttGCCCCTGCCTTCCTGGTGCAAACCGTCCCATTGCAAATCTTCCTGGTGCAAACCGTCGTGGTGCAAACCTTCCTGGTGCAAATCGTCGTGGTGCAAACCTTCCTGGTGCAAACCGTCCCAGTGCAAAACTTCCTGGTGCAAATCGTCCTGGTGCAAACCTTCCTGATGCAAACTGTCATTGCGCTCCCAGTGCCTGCCTTGATTTCTCTACCAGTGACATGTGGTTGGCTGCTCCCTGTCTGCTGCCAGGACCAGGTGAGAAATGGATGCACTTGCCAAGGCTGGGCGCTGGCTGGCATGTGTGGGCATCTCTAAGCAGTTGGATATGTCCAAAGGCTCATCAATCATGTTGCCTTCCATTCCCCATGCTGAGGTGGGCAGCTGGGCAGCTGGACCAGCCTCTGGCAAAGTTAAGTGGATGGAGTCTGCCCTTGGTCACAGTCCTCAGAAGCCCTGTGCCCTTGCCTCTCTGCCTTCTGCTTTTCCCAACAAGCTTCTGAGCTTTCTCCCAACTCCCCACAGACCCTCTCAAAGGCTCCTCCTGTCAAGTGAGATAcagtgggttctttaaaaaatggtccaggtatctgtagcactagcatgtctatttactttgtcactgaggctggagtgcagtggtgtgatcatggctcactgcaacctcgacctcccgaggctcaagtgatcttcccaactcagcctctcaagtatctgggaccaaatgcatgcaccaccatccccagctaatttttaaaatttttgtagagacggggtcttgccatgttaccctggctagtctcaaactcctggctttaaatgattttcctgccctggcctctcaaaatgctgagattacaagcgtgagttgccacacccagcccaggatgtctatcaataatgcagattccaggcccccatctcacacccactgactcagaatatgtgtgtgcacactcaggatgcatcttaacaagcgcccctgatgattctggtgcacagtgaaggttgagactcgctgGGTTAGAGGGTGCTAATGGTTTAACGGTGATTTTCAACCTGCTGGGCTGCCTCTATACTGTTCACATGTGTAACATATGCCCATCAGTGACATCTCTCAATATTTAATGATTCTTCACTGGGGAAGTGAGTGACTAACCCAAGGCAGGTGTTGGAATTTCCAGAGAGGTTTGGCAAAGCCACAGTGGGGAGTCTGATCTCCTCCTGTTTGGGTATTCTGACCTCTTTCCCGGTGGAGAAGtgttgggagcaggccccccaaaatctagccataaactggccccaaaactggccataaacaaaacctctgcagcactgtgacatgttcataatggccctaacgcctccgctggaaggttgtgggtttaccggaatgagggcaaggaacacccggcccgcccaggacggaaaaccccttaaaggcgttcttaagccacaaacaataccgtgagtgatctgtgccttaagaacatgctcctgctgcagttaaccagcccaacctattcctttaattcagcccgtcccttcgtttcccataagggatacttttagttgatttaacatctatagaaacaatgccaatgactggcttgctgttagtaaatacgtgggtaaatctctgttccgggctctcagctctgaaggctgtgagacccctgatttcccactccacacctctatatttctgtgtgtgtgtctttaattcctctagcgccgctgggttagggtctccccgaccaagctggtctcggcaGAGAAGGACCAGTTAATGGCTCTTCTTAGCGTAGGTAACGTGTGTTGTTGAGGAAATGCCCTCTGTGCAGCACTTGGCTGGATGTTCCTTGGTTGGCATCTTGGCTGGTGTCCATGTGCCCGGAGAAGAAGGGCCCTCTCTGACCCAGGCCTAATATGTGTCCCCCGTTTCCACCCTTCCCTGATAACCGGAGGAGATCTTGTCCTGCTGTGCTAATAGAACGTTCTCTGTATTAACAGAAAATTTTAACCCAACAGAGGTGCTTGGATGGAGGAAATCCAAACAATGTTGCTGGTTGATGGAGAGGGGCATTCCAAGGTCTCTTGTCCATTCATTTATTCATGAAGCCACATttcaacaaatatttattgtgcacctgccctgtaacaggccccgagctgtgatctgtagggggctcagtgataaatgaagcCCATTTTTCTTTAGATTCCAGATGATGGAGGGGAAGTCAGGAAGGGAGGTAGGAACCACCACTCTAGGGGTCCCTTGCCTCCTTCTTGGCATACGCTATGGCCTCGGCATGGAGAGCATGATGGAGCACTGATGCCAGCAAGTCAGCTCAACATTTGCAATATAGCGTGTGGCTGCTTAGAGGCCAACTTGATATATTAAGTGATATTAAAAATGCAGCCTATTTCTGATAATTTATGCGCATTCTCATAAACACATTACACTCGTCACAGTACATTTACGAGCACCAGGCCTACCAGTGGGGAAAAGTTAACACCTACCTAATGATGTTGTTTTGCAGATGTCAGATGCTCTGGTGCTGCAGTGATGGGCTCAATGGTGCAAAGAAATGTGAGTCTTTCTTCTATCCTGGGACACCAGAAGGCTGCCCAGGGCTCGGCAGAAGAACCAGGCTCTCTCCCTCCACACTGCCAAGGTCTGTGCTGTGGGATTGGAAAGAGGCCGGGTGAGAGGCTCCCCTGCCAGGAGAGACTTAGCCCAGGATCAGGTTGCCTAAGGCAAGAGATTAGATGTCACAGAGGGATAATGCTGAATCCTCTGGGGAAGCTCTAGAAATGACCTCTGGGAGAAAGAGGGGCAGAGCCTGAGGCTTACATCACAGCACAGATCAGCCTCCCTCCCACAGGCCTGGGGACCACTAGAGCCCAGGGTCTTCATGACTGAAGTGAACATCTCAGCATTATACATGAATTAGAATGAGTAGAAACTCTAACTAGCCCAGGGCTGGTGGAAACCCAGGGCCAACAGCCACAAATGTTCCCTCCAATCGCTTTGAATTCATGATGCAATTCAAAGGAATTAACAATGCAAGTGGTAGCTTTCAATGATAACGTGTTCCACtttttgtgtttgagagagtctcactctgttgcccaggctgaagtgcaatggcacgatctcggttcactgccacctccacctcccgggttcaagcaattttcctgcctcagcctcctgagtagctgggactacaggtgactgccaccatgtttggcgggcgtagttctttgtatttttagtagagatggggtttcaccatgttggccacgctggtcttgaattcctgacctcaggtgatctgcccgcctcagtctcccaaagtgctgggattacaggtgtgaaccaccacacctggcACCCCCCGCCttttttttttttttttttttttttttagagactggaccttgctttgtcacccaggctggaatgccatggagtgatcatggctcactgcagtcttgaactcctgggctcacgtgatcctcccatctcagcctcctcaatagccagaaccacagatatgcaccaccacgtccagttaatctttttatgttttattttttaaagatgggtcttgctatgttgcccaggctTACATGAACTGAGttttacatttactacaaatgtgtgtacctataaacaatatatggtattgttttgcaagctttaaaactttgtaaaaatggcatcatactatacataacctcattaaactcatttttggcttattattatttgtgaagttttttcctgtagatgcatctagctatagtttgtttttattgctgtctactgttccactatgtgaatacgccataatttatttctcttttctcctattgttgatcatttaagctggttcctatgcgtggctatcatcaacatgctgcaatggctattcctgtttcctggtctgtatatgggatttctttttttgagacagggtctcactctattgcccaggctaaagttcagtcgtaccatctcggctcaccacagcctcaacctccctgggctcaggcgatcctcccacctcagcctcccgagtagctgggactgcaggcacataccaccacatctggctaatttttataatttttgtagagaaagggtttgccaggttgcccaggctagtctgaaactcctgagctcaagtgatcctcccacctcagcctcgcaaagtgctgggattacaggcgtgtgccaccacacctggccagggagagtttctttaagagagtttcttggataggaaatttgctgggcctagggtctgtatatctttaccttgactagagagtgcctagctgctccccaaactggttagaaccacggccattctcaccaaccatctctaacagtgccagttcctccttgtccttgtcaacacttggtattgccagacttgaatttttgctagtctggtaggttttttttatgtttaattagttctccttgctaactagtgCTACTTCATTTCTGCTGCTAAGGGTGGGCATGTGCTGTCAATAGATAAATGCAACAGATTAAAAATTGAAGAGCTtccatcaataagggattggctaaatacagtatgcctcacctgtacaatagaatactgcacaatcattaacaaagatgAGTGTGCTGATATGGAAGAGATATTGATATTCTGATGTACTAAATATCTTTTCATCTCCCAGATTTATTGTTACAAAGCAAGAGGCATAAAAAGCATATTCCCTTTGTAAATAAATGAAAAGATATGTATACACATGCATATTTGTATGTATATGCGCAGAATACCTCTGAAAGAATGAACAGGAAACTGGTAACCACAGTTCATCTGGGAAGAGCACTAGAGGACAGGGAAACTTTTTTGCTCTGTGAATTCTTACCACGCATGTGTATTAGCCTGTTGGAAAAAATTAGCcctagaataggcaaattcgtagagactgaaagtagaatagaggttgccagaggttttggggtagagaatagggggtttttatttgatagatgcattttctgtttgagatgatgagagagttctgaaatggatagtggtgatggttgtacaacattgtgattgtacttaatgccactcaactgtacacttaaaagcggttgaaatgggctgggcacggtggctcacacctggaatcccagcgcttcgggaagccaaggtgggcagatcacctgaggtcaggagttcacgaccagcctgaccaacatggtgaaaccccgtctctactaaaaatacaaaaattagctgggcgtggtggtggtcgcctataatcccagctactcaggaggctgaggcaggagaattgcttgaacctgggaggtggaggttgcagtgagccaagatcacgccactgtactccagcctgggcaacagaagtgagacctcatctcaaaaaaaaaaaaTGTTGAAATggcctggcacaatggttcacacctgtaatcccagccctcagggatgccaaggcaagaggatcacttgagcccaggagtttgagaccagcctgggaaagatggtgagactctgtctctacaaaatgttttttaaaaattagctgggtgcagtggtgcacaccctgtggtcccag
=====================================
tests/data/download_gene_fasta.py
=====================================
--- a/tests/data/download_gene_fasta.py
+++ b/tests/data/download_gene_fasta.py
@@ -65,6 +65,11 @@ def fake_chr22(filename):
chr22_len -= 70
fake_file.write('N' * mod_70 + '\n')
+def bgzip_compress_fasta(filename):
+ from subprocess import call
+ call(' '.join(['bgzip', '-c', filename, '>', filename + '.gz']), shell=True)
+
+
def fetch_chr22_vcf(filename):
from subprocess import call
call(['curl', '-s', 'ftp://ftp-trace.ncbi.nih.gov//1000genomes/ftp/pilot_data/release/2010_07/exon/snps/CEU.exon.2010_03.genotypes.vcf.gz',
@@ -83,3 +88,5 @@ if __name__ == "__main__":
fetch_chr22_vcf("chr22.vcf.gz")
if not os.path.isfile("chr22.fasta"):
fetch_chr22("chr22.fasta")
+ bgzip_compress_fasta("genes.fasta")
+ bgzip_compress_fasta("chr22.fasta")
=====================================
tests/data/gene.bed12
=====================================
--- /dev/null
+++ b/tests/data/gene.bed12
@@ -0,0 +1 @@
+chr17 6010 31420 uc010vpx.1 0 - 6012 31270 0 6 158,127,110,75,80,523, 0,5195,5861,7910,16317,24887,
=====================================
tests/data/gene.bed12.fasta
=====================================
--- /dev/null
+++ b/tests/data/gene.bed12.fasta
@@ -0,0 +1,2 @@
+>chr17:6010-31420(-)
+CGGAGCGGGCAGCGGCCAAGTCAGGGCCGTCCGGGGGCGCGGCCGGCGATGCCCGCAGCCCCCgccgcgccccgccgggcctgctgagccgcccccgggccggggtcgcgccgggccgggccgcgcccggggcggggcggCGCTGCCTGCATGACCCTCCGGCGGCGCGGGGAGAAGGCGACCATCAGCATCCAGGAGCATATGGCCATCGACGTGTGCCCCGGCCCCATCCGTCCCATCAAGCAGATCTCCGACTACTTCCCCCGCTTcccgcggggcctgcccccggacgccgggccccgagccgctgcacccccggacgcccccgcgcgcccggctgtggccggtgccggccgccgcagcccctccgacggcgcccgcgAGGACGACGAGGATGTGGACCAGCTCTTCGGAGCCTAcggctccagcccgggccccagcccgggtcccagccccGCGCGGCCGCCAGCCAAGCCGCCGGAGGACGAGCCGGACGCCGACGGCTACGAGTCGGACGACTGCACTGCCCTGGGCACGCTGGACTTCAGCCTGCTGTATGACCAGGAGAACAACGCCCTCCACTGCACCATCACCAAGGCCAAGGGCCTGAAGCCAATGGACCACAATGGGCTGGCAGACCCCTACGTCAAGCTGCACCTGCTGCCAGGAGCCAGTAAGGCAAATAAGCTCAGAACAAAAACTCTCCGTAACACTCTGAACCCCACATGGAACGAGACCCTCACTTACTACGGGATCACAGATGAAGACATGATCCGCAAGACCCTGCGGATCTCTGTGTGTGACGAGGACAAATTCCGGCACAATGAGTTCATCGGGGAGACACGTGTGCCCCTGAAGAAGCTGAAACCCAACCACACCAAGACCTTCAGCATCTGCCTGGAGAAGCAGCTGCCGGTGGACAAGACTGAAGACAAGTCCCTGGAGGAGCGGGGCCGCATCCTCATCTCCCTCAAGTACAGCTCACAGAAGCAAGGCCTGCTGGTAGGCATCGTGCGGTGCGCCCACCTGGCCGCCATGGACGCCAACGGCTACTCGGACCCCTACGTGAAAAC
=====================================
tests/test_FastaRecord.py
=====================================
--- a/tests/test_FastaRecord.py
+++ b/tests/test_FastaRecord.py
@@ -75,6 +75,30 @@ class TestFastaRecord(TestCase):
sys.stdout.writelines(tuple(Differ().compare(deflines, long_names)))
assert deflines == long_names
+ def test_unpadded_length(self):
+ filename = "data/padded.fasta"
+ with open(filename, 'w') as padded:
+ padded.write(">test_padded\n")
+ for n in range(10):
+ padded.write("N" * 80)
+ padded.write("\n")
+ padded.write("N" * 30)
+ padded.write("A" * 20)
+ padded.write("N" * 30)
+ padded.write("\n")
+ for n in range(10):
+ padded.write("N" * 80)
+ padded.write("\n")
+
+ fasta = Fasta(filename)
+ expect = 20
+ result = fasta["test_padded"].unpadded_len
+ print (expect, result)
+ assert expect == result
+ os.remove('data/padded.fasta')
+ os.remove('data/padded.fasta.fai')
+
+
class TestMutableFastaRecord(TestCase):
def setUp(self):
with open('data/genes_mutable.fasta', 'wb') as mutable:
=====================================
tests/test_FastaRecord_iter.py
=====================================
--- a/tests/test_FastaRecord_iter.py
+++ b/tests/test_FastaRecord_iter.py
@@ -25,3 +25,10 @@ class TestFastaRecordIter(TestCase):
fasta = Fasta('data/genes.fasta')
for record in fasta:
assert len(next(iter(record))) == fasta.faidx.index[record.name].lenc
+
+ def test_reverse_iter(self):
+ expect = list(chain(*([line[::-1] for line in record][::-1] for record in Fasta('data/genes.fasta', as_raw=True))))
+ result = list(chain(*([line for line in reversed(record)] for record in Fasta('data/genes.fasta', as_raw=True))))
+ for a, b in zip(expect, result):
+ print(a, b)
+ assert expect == result
=====================================
tests/test_FastaVariant.py
=====================================
--- a/tests/test_FastaVariant.py
+++ b/tests/test_FastaVariant.py
@@ -8,6 +8,9 @@ path = os.path.dirname(__file__)
os.chdir(path)
class TestFastaVariant(TestCase):
+
+ def setUp(self):
+ raise SkipTest
def tearDown(self):
try:
=====================================
tests/test_Fasta_bgzip.py
=====================================
--- /dev/null
+++ b/tests/test_Fasta_bgzip.py
@@ -0,0 +1,258 @@
+import os
+from pyfaidx import Fasta, Faidx, UnsupportedCompressionFormat, FetchError
+from itertools import chain
+try:
+ from unittest import TestCase, expectedFailure
+except ImportError:
+ from unittest import TestCase
+ from nose.plugins.skip import SkipTest as expectedFailure # python2.6
+from nose.tools import raises
+from nose.plugins.skip import SkipTest
+
+path = os.path.dirname(__file__)
+os.chdir(path)
+
+class TestIndexing(TestCase):
+ def setUp(self):
+ try:
+ from Bio import SeqIO
+ except ImportError:
+ raise SkipTest
+
+ def tearDown(self):
+ try:
+ os.remove('data/genes.fasta.gz.fai')
+ except EnvironmentError:
+ pass # some tests may delete this file
+
+ @expectedFailure
+ def test_build_issue_126(self):
+ """ Samtools BGZF index should be identical to pyfaidx BGZF index """
+ expect_index = ("gi|563317589|dbj|AB821309.1| 3510 114 70 71\n"
+ "gi|557361099|gb|KF435150.1| 481 3789 70 71\n"
+ "gi|557361097|gb|KF435149.1| 642 4368 70 71\n"
+ "gi|543583796|ref|NR_104216.1| 4573 5141 70 71\n"
+ "gi|543583795|ref|NR_104215.1| 5317 9901 70 71\n"
+ "gi|543583794|ref|NR_104212.1| 5374 15415 70 71\n"
+ "gi|543583788|ref|NM_001282545.1| 4170 20980 70 71\n"
+ "gi|543583786|ref|NM_001282543.1| 5466 25324 70 71\n"
+ "gi|543583785|ref|NM_000465.3| 5523 30980 70 71\n"
+ "gi|543583740|ref|NM_001282549.1| 3984 36696 70 71\n"
+ "gi|543583738|ref|NM_001282548.1| 4113 40851 70 71\n"
+ "gi|530384540|ref|XM_005249645.1| 2752 45151 70 71\n"
+ "gi|530384538|ref|XM_005249644.1| 3004 48071 70 71\n"
+ "gi|530384536|ref|XM_005249643.1| 3109 51246 70 71\n"
+ "gi|530384534|ref|XM_005249642.1| 3097 54528 70 71\n"
+ "gi|530373237|ref|XM_005265508.1| 2794 57830 70 71\n"
+ "gi|530373235|ref|XM_005265507.1| 2848 60824 70 71\n"
+ "gi|530364726|ref|XR_241081.1| 1009 63849 70 71\n"
+ "gi|530364725|ref|XR_241080.1| 4884 65009 70 71\n"
+ "gi|530364724|ref|XR_241079.1| 2819 70099 70 71\n")
+ index_file = Faidx('data/genes.fasta.gz').indexname
+ result_index = open(index_file).read()
+ assert result_index == expect_index
+
+class TestFastaBGZF(TestCase):
+ def setUp(self):
+ try:
+ from Bio import SeqIO
+ except ImportError:
+ raise SkipTest
+
+ def tearDown(self):
+ try:
+ os.remove('data/genes.fasta.gz.fai')
+ except EnvironmentError:
+ pass # some tests may delete this file
+
+ def test_integer_slice(self):
+ fasta = Fasta('data/genes.fasta.gz')
+ expect = fasta['gi|563317589|dbj|AB821309.1|'][:100].seq
+ result = fasta[0][:100].seq
+ assert expect == result
+
+ def test_integer_index(self):
+ fasta = Fasta('data/genes.fasta.gz')
+ expect = fasta['gi|563317589|dbj|AB821309.1|'][100].seq
+ result = fasta[0][100].seq
+ assert expect == result
+
+ def test_fetch_whole_fasta(self):
+ expect = [line.rstrip('\n') for line in open('data/genes.fasta') if line[0] != '>']
+ result = list(chain(*([line for line in record] for record in Fasta('data/genes.fasta.gz', as_raw=True))))
+ assert expect == result
+
+ def test_line_len(self):
+ fasta = Fasta('data/genes.fasta.gz')
+ for record in fasta:
+ assert len(next(iter(record))) == fasta.faidx.index[record.name].lenc
+
+ @raises(UnsupportedCompressionFormat)
+ def test_mutable_bgzf(self):
+ fasta = Fasta('data/genes.fasta.gz', mutable=True)
+
+ @raises(NotImplementedError)
+ def test_long_names(self):
+ """ Test that deflines extracted using FastaRecord.long_name are
+ identical to deflines in the actual file.
+ """
+ deflines = []
+ with open('data/genes.fasta') as fasta_file:
+ for line in fasta_file:
+ if line[0] == '>':
+ deflines.append(line[1:-1])
+ fasta = Fasta('data/genes.fasta.gz')
+ long_names = []
+ for record in fasta:
+ long_names.append(record.long_name)
+ assert deflines == long_names
+
+ def test_fetch_whole_entry(self):
+ faidx = Faidx('data/genes.fasta.gz')
+ expect = ('ATGACATCATTTTCCACCTCTGCTCAGTGTTCAACATCTGA'
+ 'CAGTGCTTGCAGGATCTCTCCTGGACAAATCAATCAGGTACGACCA'
+ 'AAACTGCCGCTTTTGAAGATTTTGCATGCAGCAGGTGCGCAAGG'
+ 'TGAAATGTTCACTGTTAAAGAGGTCATGCACTATTTAGGTCAGTACAT'
+ 'AATGGTGAAGCAACTTTATGATCAGCAGGAGCAGCATATGGTATATTG'
+ 'TGGTGGAGATCTTTTGGGAGAACTACTGGGACGTCAGAGCTTCTCCGTG'
+ 'AAAGACCCAAGCCCTCTCTATGATATGCTAAGAAAGAATCTTGTCACTTT'
+ 'AGCCACTGCTACTACAGCAAAGTGCAGAGGAAAGTTCCACTTCCAGAAAAA'
+ 'GAACTACAGAAGACGATATCCCCACACTGCCTACCTCAGAGCATAAATGCA'
+ 'TACATTCTAGAGAAGGTGATTGAAGTGGGAAAAAATGATGACCTGGAGGACTC')
+ result = faidx.fetch('gi|557361099|gb|KF435150.1|',
+ 1, 481)
+ assert str(result) == expect
+
+ def test_fetch_middle(self):
+ faidx = Faidx('data/genes.fasta.gz')
+ expect = 'TTGAAGATTTTGCATGCAGCAGGTGCGCAAGGTGAAATGTTCACTGTTAAA'
+ result = faidx.fetch('gi|557361099|gb|KF435150.1|',
+ 100, 150)
+ assert str(result) == expect
+
+ def test_fetch_end(self):
+ faidx = Faidx('data/genes.fasta.gz')
+ expect = 'TC'
+ result = faidx.fetch('gi|557361099|gb|KF435150.1|',
+ 480, 481)
+ assert str(result) == expect
+
+ @raises(FetchError)
+ def test_fetch_border(self):
+ """ Fetch past the end of a gene entry """
+ faidx = Faidx('data/genes.fasta.gz')
+ expect = 'TC'
+ result = faidx.fetch('gi|557361099|gb|KF435150.1|',
+ 480, 500)
+ print(result)
+ assert str(result) == expect
+
+ def test_rev(self):
+ faidx = Faidx('data/genes.fasta.gz')
+ expect = 'GA'
+ result = faidx.fetch('gi|557361099|gb|KF435150.1|',
+ 480, 481)
+ assert str(-result) == expect, result
+
+ @raises(FetchError)
+ def test_fetch_past_bounds(self):
+ """ Fetch past the end of a gene entry """
+ faidx = Faidx('data/genes.fasta.gz', strict_bounds=True)
+ result = faidx.fetch('gi|557361099|gb|KF435150.1|',
+ 480, 5000)
+
+ @raises(FetchError)
+ def test_fetch_negative(self):
+ """ Fetch starting with a negative coordinate """
+ faidx = Faidx('data/genes.fasta.gz', strict_bounds=True)
+ result = faidx.fetch('gi|557361099|gb|KF435150.1|',
+ -10, 10)
+
+ @raises(FetchError)
+ def test_fetch_reversed_coordinates(self):
+ """ Fetch starting with a negative coordinate """
+ faidx = Faidx('data/genes.fasta.gz', strict_bounds=True)
+ result = faidx.fetch('gi|557361099|gb|KF435150.1|',
+ 50, 10)
+
+ @raises(FetchError)
+ def test_fetch_keyerror(self):
+ """ Fetch a key that does not exist """
+ faidx = Faidx('data/genes.fasta.gz', strict_bounds=True)
+ result = faidx.fetch('gi|joe|gb|KF435150.1|',
+ 1, 10)
+
+ def test_blank_string(self):
+ """ seq[0:0] should return a blank string mdshw5/pyfaidx#53 """
+ fasta = Fasta('data/genes.fasta.gz', as_raw=True)
+ assert fasta['gi|557361099|gb|KF435150.1|'][0:0] == ''
+
+ def test_slice_from_beginning(self):
+ fasta = Fasta('data/genes.fasta.gz', as_raw=True)
+ assert fasta['gi|557361099|gb|KF435150.1|'][:4] == 'ATGA'
+
+ def test_slice_from_end(self):
+ fasta = Fasta('data/genes.fasta.gz', as_raw=True)
+ assert fasta['gi|557361099|gb|KF435150.1|'][-4:] == 'ACTC'
+
+ def test_issue_74_start(self):
+ f0 = Fasta('data/genes.fasta.gz', one_based_attributes=False)
+ f1 = Fasta('data/genes.fasta.gz', one_based_attributes=True)
+ assert f0['gi|557361099|gb|KF435150.1|'][0:90].start == f1['gi|557361099|gb|KF435150.1|'][0:90].start - 1
+
+ def test_issue_74_consistency(self):
+ f0 = Fasta('data/genes.fasta.gz', one_based_attributes=False)
+ f1 = Fasta('data/genes.fasta.gz', one_based_attributes=True)
+ assert str(f0['gi|557361099|gb|KF435150.1|'][0:90]) == str(f1['gi|557361099|gb|KF435150.1|'][0:90])
+
+ def test_issue_74_end_faidx(self):
+ f0 = Faidx('data/genes.fasta.gz', one_based_attributes=False)
+ f1 = Faidx('data/genes.fasta.gz', one_based_attributes=True)
+ end0 = f0.fetch('gi|557361099|gb|KF435150.1|', 1, 90).end
+ end1 = f1.fetch('gi|557361099|gb|KF435150.1|', 1, 90).end
+ assert end0 == end1
+
+ def test_issue_74_end_fasta(self):
+ f0 = Fasta('data/genes.fasta.gz', one_based_attributes=False)
+ f1 = Fasta('data/genes.fasta.gz', one_based_attributes=True)
+ end0 = f0['gi|557361099|gb|KF435150.1|'][1:90].end
+ end1 = f1['gi|557361099|gb|KF435150.1|'][1:90].end
+ print((end0, end1))
+ assert end0 == end1
+
+ def test_issue_79_fix(self):
+ f = Fasta('data/genes.fasta.gz')
+ s = f['gi|557361099|gb|KF435150.1|'][100:105]
+ print((s.start, s.end))
+ assert (101, 105) == (s.start, s.end)
+
+ def test_issue_79_fix_negate(self):
+ f = Fasta('data/genes.fasta.gz')
+ s = f['gi|557361099|gb|KF435150.1|'][100:105]
+ s = -s
+ print((s.start, s.end))
+ assert (105, 101) == (s.start, s.end)
+
+ def test_issue_79_fix_one_based_false(self):
+ f = Fasta('data/genes.fasta.gz', one_based_attributes=False)
+ s = f['gi|557361099|gb|KF435150.1|'][100:105]
+ print((s.start, s.end))
+ assert (100, 105) == (s.start, s.end)
+
+ def test_issue_79_fix_one_based_false_negate(self):
+ f = Fasta('data/genes.fasta.gz', one_based_attributes=False)
+ s = f['gi|557361099|gb|KF435150.1|'][100:105]
+ print(s.__dict__)
+ s = -s
+ print(s.__dict__)
+ assert (105, 100) == (s.start, s.end)
+
+ @raises(FetchError)
+ def test_fetch_border_padded(self):
+ """ Fetch past the end of a gene entry """
+ faidx = Faidx('data/genes.fasta.gz', default_seq='N')
+ expect = 'TCNNNNNNNNNNNNNNNNNNN'
+ result = faidx.fetch('gi|557361099|gb|KF435150.1|',
+ 480, 500)
+ print(result)
+ assert str(result) == expect
=====================================
tests/test_faidx.py
=====================================
--- a/tests/test_faidx.py
+++ b/tests/test_faidx.py
@@ -1,8 +1,10 @@
import os
+import filecmp
from pyfaidx import FastaIndexingError, BedError, FetchError
from pyfaidx.cli import main
from nose.tools import raises
from unittest import TestCase
+from tempfile import NamedTemporaryFile
path = os.path.dirname(__file__)
os.chdir(path)
@@ -36,3 +38,15 @@ class TestCLI(TestCase):
def test_key_warning(self):
main(['data/genes.fasta', 'foo'])
+
+ def test_auto_strand(self):
+ """ Test that --auto-strand produces the same output as --reverse --complement"""
+ with NamedTemporaryFile() as auto_strand:
+ with NamedTemporaryFile() as noto_strand:
+ main(['--auto-strand', '-o', auto_strand.name, 'data/genes.fasta', 'gi|557361099|gb|KF435150.1|:100-1'])
+ main(['--reverse', '--complement', '-o', noto_strand.name, 'data/genes.fasta', 'gi|557361099|gb|KF435150.1|:1-100'])
+ print(auto_strand.read())
+ print()
+ print(noto_strand.read())
+ self.assertTrue(filecmp.cmp(auto_strand.name, noto_strand.name))
+
=====================================
tests/test_feature_bounds_check.py
=====================================
--- a/tests/test_feature_bounds_check.py
+++ b/tests/test_feature_bounds_check.py
@@ -29,7 +29,7 @@ class TestFeatureBoundsCheck:
'GAACTACAGAAGACGATATCCCCACACTGCCTACCTCAGAGCATAAATGCA'
'TACATTCTAGAGAAGGTGATTGAAGTGGGAAAAAATGATGACCTGGAGGACTC')
result = faidx.fetch('gi|557361099|gb|KF435150.1|',
- 1, 482)
+ 1, 481)
assert str(result) == expect
def test_fetch_middle(self):
@@ -43,7 +43,7 @@ class TestFeatureBoundsCheck:
faidx = Faidx('data/genes.fasta')
expect = 'TC'
result = faidx.fetch('gi|557361099|gb|KF435150.1|',
- 480, 482)
+ 480, 481)
assert str(result) == expect
def test_fetch_border(self):
@@ -58,7 +58,7 @@ class TestFeatureBoundsCheck:
faidx = Faidx('data/genes.fasta')
expect = 'GA'
result = faidx.fetch('gi|557361099|gb|KF435150.1|',
- 480, 482)
+ 480, 481)
assert str(-result) == expect, result
@raises(FetchError)
=====================================
tests/test_feature_indexing.py
=====================================
--- a/tests/test_feature_indexing.py
+++ b/tests/test_feature_indexing.py
@@ -1,6 +1,6 @@
import os
from os.path import getmtime
-from pyfaidx import Faidx, FastaIndexingError
+from pyfaidx import Faidx, FastaIndexingError, IndexNotFoundError, FastaNotFoundError
from nose.tools import raises
from nose.plugins.skip import Skip, SkipTest
from unittest import TestCase
@@ -32,32 +32,59 @@ class TestIndexing(TestCase):
def test_build(self):
expect_index = ("gi|563317589|dbj|AB821309.1| 3510 114 70 71\n"
- "gi|557361099|gb|KF435150.1| 481 3789 70 71\n"
- "gi|557361097|gb|KF435149.1| 642 4368 70 71\n"
- "gi|543583796|ref|NR_104216.1| 4573 5141 70 71\n"
- "gi|543583795|ref|NR_104215.1| 5317 9901 70 71\n"
- "gi|543583794|ref|NR_104212.1| 5374 15415 70 71\n"
- "gi|543583788|ref|NM_001282545.1| 4170 20980 70 71\n"
- "gi|543583786|ref|NM_001282543.1| 5466 25324 70 71\n"
- "gi|543583785|ref|NM_000465.3| 5523 30980 70 71\n"
- "gi|543583740|ref|NM_001282549.1| 3984 36696 70 71\n"
- "gi|543583738|ref|NM_001282548.1| 4113 40851 70 71\n"
- "gi|530384540|ref|XM_005249645.1| 2752 45151 70 71\n"
- "gi|530384538|ref|XM_005249644.1| 3004 48071 70 71\n"
- "gi|530384536|ref|XM_005249643.1| 3109 51246 70 71\n"
- "gi|530384534|ref|XM_005249642.1| 3097 54528 70 71\n"
- "gi|530373237|ref|XM_005265508.1| 2794 57830 70 71\n"
- "gi|530373235|ref|XM_005265507.1| 2848 60824 70 71\n"
- "gi|530364726|ref|XR_241081.1| 1009 63849 70 71\n"
- "gi|530364725|ref|XR_241080.1| 4884 65009 70 71\n"
- "gi|530364724|ref|XR_241079.1| 2819 70099 70 71\n")
+ "gi|557361099|gb|KF435150.1| 481 3789 70 71\n"
+ "gi|557361097|gb|KF435149.1| 642 4368 70 71\n"
+ "gi|543583796|ref|NR_104216.1| 4573 5141 70 71\n"
+ "gi|543583795|ref|NR_104215.1| 5317 9901 70 71\n"
+ "gi|543583794|ref|NR_104212.1| 5374 15415 70 71\n"
+ "gi|543583788|ref|NM_001282545.1| 4170 20980 70 71\n"
+ "gi|543583786|ref|NM_001282543.1| 5466 25324 70 71\n"
+ "gi|543583785|ref|NM_000465.3| 5523 30980 70 71\n"
+ "gi|543583740|ref|NM_001282549.1| 3984 36696 70 71\n"
+ "gi|543583738|ref|NM_001282548.1| 4113 40851 70 71\n"
+ "gi|530384540|ref|XM_005249645.1| 2752 45151 70 71\n"
+ "gi|530384538|ref|XM_005249644.1| 3004 48071 70 71\n"
+ "gi|530384536|ref|XM_005249643.1| 3109 51246 70 71\n"
+ "gi|530384534|ref|XM_005249642.1| 3097 54528 70 71\n"
+ "gi|530373237|ref|XM_005265508.1| 2794 57830 70 71\n"
+ "gi|530373235|ref|XM_005265507.1| 2848 60824 70 71\n"
+ "gi|530364726|ref|XR_241081.1| 1009 63849 70 71\n"
+ "gi|530364725|ref|XR_241080.1| 4884 65009 70 71\n"
+ "gi|530364724|ref|XR_241079.1| 2819 70099 70 71\n")
index_file = Faidx('data/genes.fasta').indexname
result_index = open(index_file).read()
assert result_index == expect_index
+ def test_build_issue_111(self):
+ expect_index = ("gi|563317589|dbj|AB821309 3510 114 70 71\n"
+ "gi|557361099|gb|KF435150 481 3789 70 71\n"
+ "gi|557361097|gb|KF435149 642 4368 70 71\n"
+ "gi|543583796|ref|NR_104216 4573 5141 70 71\n"
+ "gi|543583795|ref|NR_104215 5317 9901 70 71\n"
+ "gi|543583794|ref|NR_104212 5374 15415 70 71\n"
+ "gi|543583788|ref|NM_001282545 4170 20980 70 71\n"
+ "gi|543583786|ref|NM_001282543 5466 25324 70 71\n"
+ "gi|543583785|ref|NM_000465 5523 30980 70 71\n"
+ "gi|543583740|ref|NM_001282549 3984 36696 70 71\n"
+ "gi|543583738|ref|NM_001282548 4113 40851 70 71\n"
+ "gi|530384540|ref|XM_005249645 2752 45151 70 71\n"
+ "gi|530384538|ref|XM_005249644 3004 48071 70 71\n"
+ "gi|530384536|ref|XM_005249643 3109 51246 70 71\n"
+ "gi|530384534|ref|XM_005249642 3097 54528 70 71\n"
+ "gi|530373237|ref|XM_005265508 2794 57830 70 71\n"
+ "gi|530373235|ref|XM_005265507 2848 60824 70 71\n"
+ "gi|530364726|ref|XR_241081 1009 63849 70 71\n"
+ "gi|530364725|ref|XR_241080 4884 65009 70 71\n"
+ "gi|530364724|ref|XR_241079 2819 70099 70 71\n")
+ index = Faidx(
+ 'data/genes.fasta',
+ read_long_names=True,
+ key_function=lambda x: x.split('.')[0])
+ result_index = ''.join(index._index_as_string())
+ assert result_index == expect_index
+
def test_order(self):
- order = ("gi|563317589|dbj|AB821309.1|",
- "gi|557361099|gb|KF435150.1|",
+ order = ("gi|563317589|dbj|AB821309.1|", "gi|557361099|gb|KF435150.1|",
"gi|557361097|gb|KF435149.1|",
"gi|543583796|ref|NR_104216.1|",
"gi|543583795|ref|NR_104215.1|",
@@ -202,11 +229,14 @@ class TestIndexing(TestCase):
# Write simple fasta file with inconsistent sequence line lengths,
# so building an index raises a 'FastaIndexingError'
with open(fasta_path, 'w') as fasta_out:
- fasta_out.write(">seq1\nCTCCGGGCCCAT\nAACACTTGGGGGTAGCTAAAGTGAA\nATAAAGCCTAAA\n")
+ fasta_out.write(
+ ">seq1\nCTCCGGGCCCAT\nAACACTTGGGGGTAGCTAAAGTGAA\nATAAAGCCTAAA\n"
+ )
builtins_open = builtins.open
- opened_files=[]
+ opened_files = []
+
def test_open(*args, **kwargs):
f = builtins_open(*args, **kwargs)
opened_files.append(f)
@@ -215,7 +245,9 @@ class TestIndexing(TestCase):
with mock.patch('six.moves.builtins.open', side_effect=test_open):
try:
Faidx(fasta_path)
- self.assertFail("Faidx construction should fail with 'FastaIndexingError'.")
+ self.assertFail(
+ "Faidx construction should fail with 'FastaIndexingError'."
+ )
except FastaIndexingError:
pass
self.assertTrue(all(f.closed for f in opened_files))
@@ -240,7 +272,8 @@ class TestIndexing(TestCase):
builtins_open = builtins.open
- opened_files=[]
+ opened_files = []
+
def test_open(*args, **kwargs):
f = builtins_open(*args, **kwargs)
opened_files.append(f)
@@ -249,10 +282,16 @@ class TestIndexing(TestCase):
with mock.patch('six.moves.builtins.open', side_effect=test_open):
try:
Faidx(fasta_path)
- self.assertFail("Faidx construction should fail with 'ValueError'.")
+ self.assertFail(
+ "Faidx construction should fail with 'ValueError'.")
except ValueError:
pass
self.assertTrue(all(f.closed for f in opened_files))
finally:
shutil.rmtree(tmp_dir)
+ @raises(IndexNotFoundError)
+ def test_issue_134_no_build_index(self):
+ """ Ensure that index file is not built when build_index=False. See mdshw5/pyfaidx#134.
+ """
+ faidx = Faidx('data/genes.fasta', build_index=False)
=====================================
tests/test_feature_key_function.py
=====================================
--- a/tests/test_feature_key_function.py
+++ b/tests/test_feature_key_function.py
@@ -34,6 +34,8 @@ def get_duplicated_gene_name(accession):
class TestFeatureKeyFunction(TestCase):
def setUp(self):
+ genes = Fasta('data/genes.fasta')
+ del genes # Support feature introduced in #111
pass
def tearDown(self):
@@ -64,3 +66,15 @@ class TestFeatureKeyFunction(TestCase):
@raises(ValueError)
def test_duplicated_keys(self):
genes = Fasta('data/genes.fasta', key_function=get_duplicated_gene_name)
+
+ def test_duplicated_keys_shortest(self):
+ genes = Fasta('data/genes.fasta', key_function=get_duplicated_gene_name, duplicate_action="shortest")
+ expect = 4573
+ result = len(genes["BARD1"])
+ assert expect == result
+
+ def test_duplicated_keys_longest(self):
+ genes = Fasta('data/genes.fasta', key_function=get_duplicated_gene_name, duplicate_action="longest")
+ expect = 5317
+ result = len(genes["BARD1"])
+ assert expect == result
=====================================
tests/test_feature_spliced_seq.py
=====================================
--- /dev/null
+++ b/tests/test_feature_spliced_seq.py
@@ -0,0 +1,74 @@
+import os
+from pyfaidx import Fasta
+from unittest import TestCase
+
+path = os.path.dirname(__file__)
+os.chdir(path)
+
+class TestFeatureSplicedSeq(TestCase):
+ def setUp(self):
+ pass
+
+ def tearDown(self):
+ fais = [
+ "data/gene.bed12.fasta.fai",
+ "data/chr17.hg19.part.fa.fai"
+ ]
+ for fai in fais:
+ try:
+ os.remove(fai)
+ except EnvironmentError:
+ pass # some tests may delete this file
+
+ def test_split_seq(self):
+ """ Fetch sequence by blocks """
+ fa = Fasta('data/chr17.hg19.part.fa')
+
+ gene = Fasta("data/gene.bed12.fasta")
+ expect = gene[list(gene.keys())[0]][:].seq
+
+ bed = "data/gene.bed12"
+ with open(bed) as fi:
+ record = fi.readline().strip().split("\t")
+
+ chrom = record[0]
+ start = int(record[1])
+ strand = record[5]
+
+ # parse bed12 format
+ starts = [int(x) for x in record[11].split(",")[:-1]]
+ sizes = [int(x) for x in record[10].split(",")[:-1]]
+ starts = [start + x for x in starts]
+ ends = [start + size for start,size in zip(starts, sizes)]
+
+ # bed half-open
+ if strand == "-":
+ starts = [start + 1 for start in starts]
+ else:
+ ends = [end - 1 for end in ends]
+
+ intervals = zip(starts, ends)
+ result = fa.get_spliced_seq(chrom, intervals, rc=True)
+ print(result.seq)
+ print("====")
+ print(expect)
+
+ assert result.seq == expect
+
+ def test_get_seq_rc(self):
+ """ Check get_seq with rc argument """
+ fa = Fasta('data/chr17.hg19.part.fa')
+
+ result = fa.get_seq("chr17", 11, 20, rc=False)
+ expect = "CCCTGTTCCT"
+ print("normal")
+ print(result.seq)
+ print(expect)
+ assert result.seq == expect
+
+ result = fa.get_seq("chr17", 11, 20, rc=True)
+ expect = "AGGAACAGGG"
+ assert result.seq == expect
+ print("rc")
+ print(result.seq)
+ print(expect)
=====================================
tests/test_feature_split_char.py
=====================================
--- a/tests/test_feature_split_char.py
+++ b/tests/test_feature_split_char.py
@@ -18,20 +18,24 @@ class TestFeatureSplitChar(TestCase):
pass # some tests may delete this file
def test_keys(self):
- fasta = Fasta('data/genes.fasta', split_char='|')
+ fasta = Fasta('data/genes.fasta', split_char='|', duplicate_action="drop")
expect = ['530364724', '530364725', '530364726', '530373235', '530373237', '530384534', '530384536', '530384538', '530384540', '543583738', '543583740', '543583785', '543583786', '543583788', '543583794', '543583795', '543583796', '557361097', '557361099', '563317589', 'AB821309.1', 'KF435149.1', 'KF435150.1', 'NM_000465.3', 'NM_001282543.1', 'NM_001282545.1', 'NM_001282548.1', 'NM_001282549.1', 'NR_104212.1', 'NR_104215.1', 'NR_104216.1', 'XM_005249642.1', 'XM_005249643.1', 'XM_005249644.1', 'XM_005249645.1', 'XM_005265507.1', 'XM_005265508.1', 'XR_241079.1', 'XR_241080.1', 'XR_241081.1', 'dbj']
result = sorted(fasta.keys())
assert result == expect
def test_key_function_by_dictionary_get_key(self):
- fasta = Fasta('data/genes.fasta', split_char='|')
+ fasta = Fasta('data/genes.fasta', split_char='|', duplicate_action="drop")
expect = 'TTGAAGATTTTGCATGCAGCAGGTGCGCAAGGTGAAATGTTCACTGTTAAA'
result = fasta['KF435150.1'][100-1:150]
assert str(result) == expect
def test_key_function_by_fetch(self):
- faidx = Faidx('data/genes.fasta', split_char='|')
+ faidx = Faidx('data/genes.fasta', split_char='|', duplicate_action="drop")
expect = 'TTGAAGATTTTGCATGCAGCAGGTGCGCAAGGTGAAATGTTCACTGTTAAA'
result = faidx.fetch('KF435150.1',
100, 150)
assert str(result) == expect
+
+ @raises(ValueError)
+ def test_stop(self):
+ fasta = Fasta('data/genes.fasta', split_char='|')
View it on GitLab: https://salsa.debian.org/med-team/python-pyfaidx/compare/a09e0d29890acdcc1cbfaaf16b562d5d7227faaa...6ab6bd1ff8665cb44fcbfcb531915cf17349bdda
---
View it on GitLab: https://salsa.debian.org/med-team/python-pyfaidx/compare/a09e0d29890acdcc1cbfaaf16b562d5d7227faaa...6ab6bd1ff8665cb44fcbfcb531915cf17349bdda
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20180416/83a3d558/attachment-0001.html>
More information about the debian-med-commit
mailing list