[med-svn] [Git][med-team/python-pyvcf][debian/jessie-backports] 16 commits: Apply patch by Scott Kitterman
Andreas Tille
gitlab at salsa.debian.org
Wed Jul 18 10:47:19 BST 2018
Andreas Tille pushed to branch debian/jessie-backports at Debian Med / python-pyvcf
Commits:
3a0d92bb by Andreas Tille at 2015-12-31T12:21:47+01:00
Apply patch by Scott Kitterman
- - - - -
482dccb9 by Andreas Tille at 2015-12-31T12:22:44+01:00
Upload to unstable
- - - - -
d95234c9 by Andreas Tille at 2016-12-02T19:42:18+01:00
Fix watch file
- - - - -
9c5c2758 by Andreas Tille at 2016-12-02T20:18:10+01:00
Fix obtaining version in get-orig-source
- - - - -
de030e81 by Andreas Tille at 2016-12-02T20:19:03+01:00
New upstream version 0.6.8
- - - - -
e5b6d3ab by Andreas Tille at 2016-12-02T20:19:04+01:00
Merge tag 'upstream/0.6.8'
Upstream version 0.6.8
- - - - -
8bcbef7c by Andreas Tille at 2016-12-02T20:19:30+01:00
New upstream version
- - - - -
a33cd399 by Andreas Tille at 2016-12-02T20:20:09+01:00
cme fix dpkg-control
- - - - -
31ef761b by Andreas Tille at 2016-12-02T20:20:44+01:00
debhelper 10
- - - - -
4e981625 by Andreas Tille at 2016-12-02T20:23:38+01:00
Patch applied upstream
- - - - -
3d2ee246 by Andreas Tille at 2016-12-02T20:30:29+01:00
hardening=+all
- - - - -
1cacc9de by Andreas Tille at 2016-12-02T20:31:07+01:00
Remove unused lintian override
- - - - -
5a22064b by Andreas Tille at 2016-12-02T21:15:30+01:00
Rename all *.py files and upload
- - - - -
03196a79 by Andreas Tille at 2017-06-09T23:05:50+02:00
Merge branch 'master' into debian/jessie-backports
- - - - -
b64aa101 by Andreas Tille at 2017-06-09T23:06:10+02:00
Rebuild for jessie-backports
- - - - -
3f9805e0 by Andreas Tille at 2017-06-16T23:59:18+02:00
export DEB_BUILD_MAINT_OPTIONS = hardening=+all,-pie
- - - - -
24 changed files:
- PKG-INFO
- README.rst
- debian/changelog
- debian/compat
- debian/control
- debian/get-orig-source
- − debian/patches/series
- − debian/patches/use_setuptools.patch
- − debian/python-pyvcf-examples.lintian-overrides
- debian/rules
- debian/watch
- + scripts/vcf_sample_filter.py
- setup.cfg
- setup.py
- vcf/__init__.py
- vcf/cparse.pyx
- vcf/model.py
- vcf/parser.py
- + vcf/sample_filter.py
- + vcf/test/FT.vcf
- + vcf/test/issue-214.vcf
- + vcf/test/strelka.vcf
- vcf/test/test_vcf.py
- vcf/utils.py
Changes:
=====================================
PKG-INFO
=====================================
--- a/PKG-INFO
+++ b/PKG-INFO
@@ -1,190 +1,27 @@
Metadata-Version: 1.1
Name: PyVCF
-Version: 0.6.7
+Version: 0.6.8
Summary: Variant Call Format (VCF) parser for Python
Home-page: https://github.com/jamescasbon/PyVCF
Author: James Casbon and @jdoughertyii
Author-email: casbon at gmail.com
License: UNKNOWN
-Description: A VCFv4.0 and 4.1 parser for Python.
-
- Online version of PyVCF documentation is available at http://pyvcf.rtfd.org/
-
- The intent of this module is to mimic the ``csv`` module in the Python stdlib,
- as opposed to more flexible serialization formats like JSON or YAML. ``vcf``
- will attempt to parse the content of each record based on the data types
- specified in the meta-information lines -- specifically the ##INFO and
- ##FORMAT lines. If these lines are missing or incomplete, it will check
- against the reserved types mentioned in the spec. Failing that, it will just
- return strings.
-
- There main interface is the class: ``Reader``. It takes a file-like
- object and acts as a reader::
-
- >>> import vcf
- >>> vcf_reader = vcf.Reader(open('vcf/test/example-4.0.vcf', 'r'))
- >>> for record in vcf_reader:
- ... print record
- Record(CHROM=20, POS=14370, REF=G, ALT=[A])
- Record(CHROM=20, POS=17330, REF=T, ALT=[A])
- Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
- Record(CHROM=20, POS=1230237, REF=T, ALT=[None])
- Record(CHROM=20, POS=1234567, REF=GTCT, ALT=[G, GTACT])
-
-
- This produces a great deal of information, but it is conveniently accessed.
- The attributes of a Record are the 8 fixed fields from the VCF spec::
-
- * ``Record.CHROM``
- * ``Record.POS``
- * ``Record.ID``
- * ``Record.REF``
- * ``Record.ALT``
- * ``Record.QUAL``
- * ``Record.FILTER``
- * ``Record.INFO``
-
- plus attributes to handle genotype information:
-
- * ``Record.FORMAT``
- * ``Record.samples``
- * ``Record.genotype``
-
- ``samples`` and ``genotype``, not being the title of any column, are left lowercase. The format
- of the fixed fields is from the spec. Comma-separated lists in the VCF are
- converted to lists. In particular, one-entry VCF lists are converted to
- one-entry Python lists (see, e.g., ``Record.ALT``). Semicolon-delimited lists
- of key=value pairs are converted to Python dictionaries, with flags being given
- a ``True`` value. Integers and floats are handled exactly as you'd expect::
-
- >>> vcf_reader = vcf.Reader(open('vcf/test/example-4.0.vcf', 'r'))
- >>> record = vcf_reader.next()
- >>> print record.POS
- 14370
- >>> print record.ALT
- [A]
- >>> print record.INFO['AF']
- [0.5]
-
- There are a number of convienience methods and properties for each ``Record`` allowing you to
- examine properties of interest::
-
- >>> print record.num_called, record.call_rate, record.num_unknown
- 3 1.0 0
- >>> print record.num_hom_ref, record.num_het, record.num_hom_alt
- 1 1 1
- >>> print record.nucl_diversity, record.aaf, record.heterozygosity
- 0.6 [0.5] 0.5
- >>> print record.get_hets()
- [Call(sample=NA00002, CallData(GT=1|0, GQ=48, DP=8, HQ=[51, 51]))]
- >>> print record.is_snp, record.is_indel, record.is_transition, record.is_deletion
- True False True False
- >>> print record.var_type, record.var_subtype
- snp ts
- >>> print record.is_monomorphic
- False
-
- ``record.FORMAT`` will be a string specifying the format of the genotype
- fields. In case the FORMAT column does not exist, ``record.FORMAT`` is
- ``None``. Finally, ``record.samples`` is a list of dictionaries containing the
- parsed sample column and ``record.genotype`` is a way of looking up genotypes
- by sample name::
-
- >>> record = vcf_reader.next()
- >>> for sample in record.samples:
- ... print sample['GT']
- 0|0
- 0|1
- 0/0
- >>> print record.genotype('NA00001')['GT']
- 0|0
-
- The genotypes are represented by ``Call`` objects, which have three attributes: the
- corresponding Record ``site``, the sample name in ``sample`` and a dictionary of
- call data in ``data``::
-
- >>> call = record.genotype('NA00001')
- >>> print call.site
- Record(CHROM=20, POS=17330, REF=T, ALT=[A])
- >>> print call.sample
- NA00001
- >>> print call.data
- CallData(GT=0|0, GQ=49, DP=3, HQ=[58, 50])
-
- Please note that as of release 0.4.0, attributes known to have single values (such as
- ``DP`` and ``GQ`` above) are returned as values. Other attributes are returned
- as lists (such as ``HQ`` above).
-
- There are also a number of methods::
-
- >>> print call.called, call.gt_type, call.gt_bases, call.phased
- True 0 T|T True
-
- Metadata regarding the VCF file itself can be investigated through the
- following attributes:
-
- * ``Reader.metadata``
- * ``Reader.infos``
- * ``Reader.filters``
- * ``Reader.formats``
- * ``Reader.samples``
-
- For example::
-
- >>> vcf_reader.metadata['fileDate']
- '20090805'
- >>> vcf_reader.samples
- ['NA00001', 'NA00002', 'NA00003']
- >>> vcf_reader.filters
- OrderedDict([('q10', Filter(id='q10', desc='Quality below 10')), ('s50', Filter(id='s50', desc='Less than 50% of samples have data'))])
- >>> vcf_reader.infos['AA'].desc
- 'Ancestral Allele'
-
- ALT records are actually classes, so that you can interrogate them::
-
- >>> reader = vcf.Reader(open('vcf/test/example-4.1-bnd.vcf'))
- >>> _ = reader.next(); row = reader.next()
- >>> print row
- Record(CHROM=1, POS=2, REF=T, ALT=[T[2:3[])
- >>> bnd = row.ALT[0]
- >>> print bnd.withinMainAssembly, bnd.orientation, bnd.remoteOrientation, bnd.connectingSequence
- True False True T
-
- Random access is supported for files with tabix indexes. Simply call fetch for the
- region you are interested in::
-
- >>> vcf_reader = vcf.Reader(filename='vcf/test/tb.vcf.gz')
- >>> for record in vcf_reader.fetch('20', 1110696, 1230237): # doctest: +SKIP
- ... print record
- Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
- Record(CHROM=20, POS=1230237, REF=T, ALT=[None])
-
- Or extract a single row::
-
- >>> print vcf_reader.fetch('20', 1110696) # doctest: +SKIP
- Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
-
-
- The ``Writer`` class provides a way of writing a VCF file. Currently, you must specify a
- template ``Reader`` which provides the metadata::
-
- >>> vcf_reader = vcf.Reader(filename='vcf/test/tb.vcf.gz')
- >>> vcf_writer = vcf.Writer(open('/dev/null', 'w'), vcf_reader)
- >>> for record in vcf_reader:
- ... vcf_writer.write_record(record)
-
-
- An extensible script is available to filter vcf files in vcf_filter.py. VCF filters
- declared by other packages will be available for use in this script. Please
- see :doc:`FILTERS` for full description.
-
-
+Description: UNKNOWN
Keywords: bioinformatics
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
+Classifier: License :: OSI Approved :: BSD License
+Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Cython
Classifier: Programming Language :: Python
+Classifier: Programming Language :: Python :: 2
+Classifier: Programming Language :: Python :: 2.6
+Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
-Classifier: Topic :: Scientific/Engineering
+Classifier: Programming Language :: Python :: 3.2
+Classifier: Programming Language :: Python :: 3.3
+Classifier: Programming Language :: Python :: 3.4
+Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
=====================================
README.rst
=====================================
--- a/README.rst
+++ b/README.rst
@@ -50,7 +50,7 @@ of key=value pairs are converted to Python dictionaries, with flags being given
a ``True`` value. Integers and floats are handled exactly as you'd expect::
>>> vcf_reader = vcf.Reader(open('vcf/test/example-4.0.vcf', 'r'))
- >>> record = vcf_reader.next()
+ >>> record = next(vcf_reader)
>>> print record.POS
14370
>>> print record.ALT
@@ -58,7 +58,7 @@ a ``True`` value. Integers and floats are handled exactly as you'd expect::
>>> print record.INFO['AF']
[0.5]
-There are a number of convienience methods and properties for each ``Record`` allowing you to
+There are a number of convenience methods and properties for each ``Record`` allowing you to
examine properties of interest::
>>> print record.num_called, record.call_rate, record.num_unknown
@@ -82,7 +82,7 @@ fields. In case the FORMAT column does not exist, ``record.FORMAT`` is
parsed sample column and ``record.genotype`` is a way of looking up genotypes
by sample name::
- >>> record = vcf_reader.next()
+ >>> record = next(vcf_reader)
>>> for sample in record.samples:
... print sample['GT']
0|0
@@ -135,27 +135,38 @@ For example::
ALT records are actually classes, so that you can interrogate them::
>>> reader = vcf.Reader(open('vcf/test/example-4.1-bnd.vcf'))
- >>> _ = reader.next(); row = reader.next()
+ >>> _ = next(reader); row = next(reader)
>>> print row
Record(CHROM=1, POS=2, REF=T, ALT=[T[2:3[])
>>> bnd = row.ALT[0]
>>> print bnd.withinMainAssembly, bnd.orientation, bnd.remoteOrientation, bnd.connectingSequence
True False True T
-Random access is supported for files with tabix indexes. Simply call fetch for the
-region you are interested in::
+The Reader supports retrieval of records within designated regions for files
+with tabix indexes via the fetch method. This requires the pysam module as a
+dependency. Pass in a chromosome, and, optionally, start and end coordinates,
+for the regions of interest::
>>> vcf_reader = vcf.Reader(filename='vcf/test/tb.vcf.gz')
- >>> for record in vcf_reader.fetch('20', 1110696, 1230237): # doctest: +SKIP
+ >>> # fetch all records on chromosome 20 from base 1110696 through 1230237
+ >>> for record in vcf_reader.fetch('20', 1110695, 1230237): # doctest: +SKIP
... print record
Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
Record(CHROM=20, POS=1230237, REF=T, ALT=[None])
-Or extract a single row::
+Note that the start and end coordinates are in the zero-based, half-open
+coordinate system, similar to ``_Record.start`` and ``_Record.end``. The very
+first base of a chromosome is index 0, and the the region includes bases up
+to, but not including the base at the end coordinate. For example::
- >>> print vcf_reader.fetch('20', 1110696) # doctest: +SKIP
- Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
+ >>> # fetch all records on chromosome 4 from base 11 through 20
+ >>> vcf_reader.fetch('4', 10, 20) # doctest: +SKIP
+would include all records overlapping a 10 base pair region from the 11th base
+of through the 20th base (which is at index 19) of chromosome 4. It would not
+include the 21st base (at index 20). (See
+http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms for more
+information on the zero-based, half-open coordinate system.)
The ``Writer`` class provides a way of writing a VCF file. Currently, you must specify a
template ``Reader`` which provides the metadata::
@@ -165,8 +176,6 @@ template ``Reader`` which provides the metadata::
>>> for record in vcf_reader:
... vcf_writer.write_record(record)
-
An extensible script is available to filter vcf files in vcf_filter.py. VCF filters
declared by other packages will be available for use in this script. Please
see :doc:`FILTERS` for full description.
-
=====================================
debian/changelog
=====================================
--- a/debian/changelog
+++ b/debian/changelog
@@ -1,3 +1,30 @@
+python-pyvcf (0.6.8-1~bpo8+1) jessie-backports; urgency=medium
+
+ * Rebuild for jessie-backports.
+ * export DEB_BUILD_MAINT_OPTIONS = hardening=+all,-pie
+
+ -- Andreas Tille <tille at debian.org> Fri, 09 Jun 2017 23:06:01 +0200
+
+python-pyvcf (0.6.8-1) unstable; urgency=medium
+
+ * New upstream version
+ * Fix watch file
+ * cme fix dpkg-control
+ * debhelper 10
+ * hardening=+all
+ * Remove unused lintian override
+
+ -- Andreas Tille <tille at debian.org> Fri, 02 Dec 2016 20:33:48 +0100
+
+python-pyvcf (0.6.7-2) unstable; urgency=medium
+
+ [ Scott Kitterman ]
+ * Change python3-dev build-dep to python3-all-dev so package is built for
+ all supported python3 versions
+ Closes: #809487
+
+ -- Andreas Tille <tille at debian.org> Thu, 31 Dec 2015 12:22:30 +0100
+
python-pyvcf (0.6.7-1~bpo8+1) jessie-backports; urgency=medium
* Rebuild for jessie-backports.
=====================================
debian/compat
=====================================
--- a/debian/compat
+++ b/debian/compat
@@ -1 +1 @@
-9
+10
=====================================
debian/control
=====================================
--- a/debian/control
+++ b/debian/control
@@ -3,17 +3,17 @@ Maintainer: Debian Med Packaging Team <debian-med-packaging at lists.alioth.debian.
Uploaders: Andreas Tille <tille at debian.org>
Section: python
Priority: optional
-Build-Depends: debhelper (>= 9),
+Build-Depends: debhelper (>= 10),
dh-python,
cython,
- python-dev (>= 2.7),
+ python-dev,
python-setuptools,
cython3,
- python3-dev,
+ python3-all-dev,
python3-setuptools
-Standards-Version: 3.9.6
+Standards-Version: 3.9.8
Vcs-Browser: https://anonscm.debian.org/cgit/debian-med/python-pyvcf.git
-Vcs-Git: git://anonscm.debian.org/debian-med/python-pyvcf.git
+Vcs-Git: https://anonscm.debian.org/git/debian-med/python-pyvcf.git
Homepage: https://pypi.python.org/pypi/PyVCF
Package: python-pyvcf
=====================================
debian/get-orig-source
=====================================
--- a/debian/get-orig-source
+++ b/debian/get-orig-source
@@ -2,7 +2,6 @@
# Test-Data needs to be fetched separately from Git
set -e
-set -x
PKG=`dpkg-parsechangelog | awk '/^Source/ { print $2 }'`
@@ -10,7 +9,7 @@ if ! echo $@ | grep -q upstream-version ; then
VERSION=`dpkg-parsechangelog | awk '/^Version:/ { print $2 }' | sed 's/\([0-9\.]\+\)-[0-9]\+$/\1/'`
uscan --verbose --force-download
else
- VERSION=`echo $@ | sed 's?^.*--upstream-version \([0-9.]\+\) .*?\1?'`
+ VERSION=`echo $@ | sed 's?^.*--upstream-version \([0-9.]\+\).*?\1?'`
if echo "$VERSION" | grep -q "upstream-version" ; then
echo "Unable to parse version number"
exit
=====================================
debian/patches/series deleted
=====================================
--- a/debian/patches/series
+++ /dev/null
@@ -1 +0,0 @@
-use_setuptools.patch
=====================================
debian/patches/use_setuptools.patch deleted
=====================================
--- a/debian/patches/use_setuptools.patch
+++ /dev/null
@@ -1,16 +0,0 @@
-Author: Andreas Tille <tille at debian.org>
-Last-Updated: Tue, 13 Oct 2015 17:02:08 +0200
-Description: Use setuptools instead of distribute
- see https://lists.alioth.debian.org/pipermail/debian-med-packaging/2015-October/035410.html
-
---- a/setup.py
-+++ b/setup.py
-@@ -53,7 +53,7 @@ setup(
- description='Variant Call Format (VCF) parser for Python',
- long_description=DOC,
- test_suite='vcf.test.test_vcf.suite',
-- install_requires=['distribute'],
-+ install_requires=['setuptools'],
- requires=requires,
- entry_points = {
- 'vcf.filters': [
=====================================
debian/python-pyvcf-examples.lintian-overrides deleted
=====================================
--- a/debian/python-pyvcf-examples.lintian-overrides
+++ /dev/null
@@ -1,3 +0,0 @@
-# This is an unchanged file from upstream tarball
-python-pyvcf-examples: package-contains-timestamped-gzip usr/share/doc/python-pyvcf-examples/test/1kg.vcf.gz
-
=====================================
debian/rules
=====================================
--- a/debian/rules
+++ b/debian/rules
@@ -4,6 +4,8 @@
# Uncomment this to turn on verbose mode.
#export DH_VERBOSE=1
+export DEB_BUILD_MAINT_OPTIONS = hardening=+all,-pie
+
export PYBUILD_NAME=pyvcf
examplesdir=debian/python-$(PYBUILD_NAME)-examples/usr/share/doc/python-$(PYBUILD_NAME)-examples
helperdir=debian/$(PYBUILD_NAME)/usr
@@ -21,7 +23,9 @@ override_dh_install:
mkdir -p $(helperdir)
mv debian/python3-$(PYBUILD_NAME)/usr/bin $(helperdir)
# strip .py extension
- mv $(helperdir)/bin/vcf_filter.py $(helperdir)/bin/vcf_filter
+ for py in $(helperdir)/bin/*.py ; do \
+ mv $${py} $(helperdir)/bin/`basename $${py} .py` ; \
+ done
rm -rf debian/python-$(PYBUILD_NAME)/usr/bin
get-orig-source:
=====================================
debian/watch
=====================================
--- a/debian/watch
+++ b/debian/watch
@@ -1,5 +1,4 @@
# Test-Data needs to be fetched separately from Git so wee need to call get-orig-source
-
-version=3
-https://pypi.python.org/pypi/PyVCF .*/PyVCF-([0-9.]+).tar.[xgb]z2? \
+version=4
+http://pypi.debian.net/PyVCF/PyVCF-(.+)\.(?:tar(?:\.gz|\.bz2)?|tgz) \
debian debian/get-orig-source
=====================================
scripts/vcf_sample_filter.py
=====================================
--- /dev/null
+++ b/scripts/vcf_sample_filter.py
@@ -0,0 +1,39 @@
+#!/usr/bin/env python
+
+# Author: Lenna X. Peterson
+# github.com/lennax
+# arklenna at gmail dot com
+
+import argparse
+import logging
+
+from vcf import SampleFilter
+
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser()
+ parser.add_argument("file", help="VCF file to filter")
+ parser.add_argument("-o", metavar="outfile",
+ help="File to write out filtered samples")
+ parser.add_argument("-f", metavar="filters",
+ help="Comma-separated list of sample indices or names \
+ to filter")
+ parser.add_argument("-i", "--invert", action="store_true",
+ help="Keep rather than discard the filtered samples")
+ parser.add_argument("-q", "--quiet", action="store_true",
+ help="Less output")
+
+ args = parser.parse_args()
+
+ if args.quiet:
+ log_level = logging.WARNING
+ else:
+ log_level = logging.INFO
+ logging.basicConfig(format='%(message)s', level=log_level)
+
+ sf = SampleFilter(infile=args.file, outfile=args.o,
+ filters=args.f, invert=args.invert)
+ if args.f is None:
+ print "Samples:"
+ for idx, val in enumerate(sf.samples):
+ print "{0}: {1}".format(idx, val)
=====================================
setup.cfg
=====================================
--- a/setup.cfg
+++ b/setup.cfg
@@ -1,5 +1,5 @@
[egg_info]
-tag_build =
tag_date = 0
+tag_build =
tag_svn_revision = 0
=====================================
setup.py
=====================================
--- a/setup.py
+++ b/setup.py
@@ -1,6 +1,6 @@
from setuptools import setup
-from distutils.core import setup
from distutils.extension import Extension
+import sys
try:
from Cython.Distutils import build_ext
@@ -8,23 +8,14 @@ try:
except:
CYTHON = False
-requires = []
+IS_PYTHON26 = sys.version_info[:2] == (2, 6)
-# python 2.6 does not have argparse
-try:
- import argparse
-except ImportError:
- requires.append('argparse')
+DEPENDENCIES = ['setuptools']
+
+if IS_PYTHON26:
+ DEPENDENCIES.extend(['argparse', 'counter', 'ordereddict',
+ 'unittest2'])
-import collections
-try:
- collections.Counter
-except AttributeError:
- requires.append('counter')
-try:
- collections.OrderedDict
-except AttributeError:
- requires.append('ordereddict')
# get the version without an import
VERSION = "Undefined"
@@ -47,14 +38,14 @@ if CYTHON:
setup(
name='PyVCF',
packages=['vcf', 'vcf.test'],
- scripts=['scripts/vcf_melt', 'scripts/vcf_filter.py'],
+ scripts=['scripts/vcf_melt', 'scripts/vcf_filter.py',
+ 'scripts/vcf_sample_filter.py'],
author='James Casbon and @jdoughertyii',
author_email='casbon at gmail.com',
description='Variant Call Format (VCF) parser for Python',
long_description=DOC,
test_suite='vcf.test.test_vcf.suite',
- install_requires=['distribute'],
- requires=requires,
+ install_requires=DEPENDENCIES,
entry_points = {
'vcf.filters': [
'site_quality = vcf.filters:SiteQuality',
@@ -71,10 +62,19 @@ setup(
'Development Status :: 4 - Beta',
'Intended Audience :: Developers',
'Intended Audience :: Science/Research',
+ 'License :: OSI Approved :: BSD License',
+ 'License :: OSI Approved :: MIT License',
'Operating System :: OS Independent',
+ 'Programming Language :: Cython',
'Programming Language :: Python',
+ 'Programming Language :: Python :: 2',
+ 'Programming Language :: Python :: 2.6',
+ 'Programming Language :: Python :: 2.7',
'Programming Language :: Python :: 3',
- 'Topic :: Scientific/Engineering',
+ 'Programming Language :: Python :: 3.2',
+ 'Programming Language :: Python :: 3.3',
+ 'Programming Language :: Python :: 3.4',
+ 'Topic :: Scientific/Engineering :: Bio-Informatics',
],
keywords='bioinformatics',
use_2to3=True,
=====================================
vcf/__init__.py
=====================================
--- a/vcf/__init__.py
+++ b/vcf/__init__.py
@@ -1,180 +1,15 @@
#!/usr/bin/env python
-'''A VCFv4.0 and 4.1 parser for Python.
+"""
+A VCFv4.0 and 4.1 parser for Python.
Online version of PyVCF documentation is available at http://pyvcf.rtfd.org/
+"""
-The intent of this module is to mimic the ``csv`` module in the Python stdlib,
-as opposed to more flexible serialization formats like JSON or YAML. ``vcf``
-will attempt to parse the content of each record based on the data types
-specified in the meta-information lines -- specifically the ##INFO and
-##FORMAT lines. If these lines are missing or incomplete, it will check
-against the reserved types mentioned in the spec. Failing that, it will just
-return strings.
-There main interface is the class: ``Reader``. It takes a file-like
-object and acts as a reader::
-
- >>> import vcf
- >>> vcf_reader = vcf.Reader(open('vcf/test/example-4.0.vcf', 'r'))
- >>> for record in vcf_reader:
- ... print record
- Record(CHROM=20, POS=14370, REF=G, ALT=[A])
- Record(CHROM=20, POS=17330, REF=T, ALT=[A])
- Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
- Record(CHROM=20, POS=1230237, REF=T, ALT=[None])
- Record(CHROM=20, POS=1234567, REF=GTCT, ALT=[G, GTACT])
-
-
-This produces a great deal of information, but it is conveniently accessed.
-The attributes of a Record are the 8 fixed fields from the VCF spec::
-
- * ``Record.CHROM``
- * ``Record.POS``
- * ``Record.ID``
- * ``Record.REF``
- * ``Record.ALT``
- * ``Record.QUAL``
- * ``Record.FILTER``
- * ``Record.INFO``
-
-plus attributes to handle genotype information:
-
- * ``Record.FORMAT``
- * ``Record.samples``
- * ``Record.genotype``
-
-``samples`` and ``genotype``, not being the title of any column, are left lowercase. The format
-of the fixed fields is from the spec. Comma-separated lists in the VCF are
-converted to lists. In particular, one-entry VCF lists are converted to
-one-entry Python lists (see, e.g., ``Record.ALT``). Semicolon-delimited lists
-of key=value pairs are converted to Python dictionaries, with flags being given
-a ``True`` value. Integers and floats are handled exactly as you'd expect::
-
- >>> vcf_reader = vcf.Reader(open('vcf/test/example-4.0.vcf', 'r'))
- >>> record = vcf_reader.next()
- >>> print record.POS
- 14370
- >>> print record.ALT
- [A]
- >>> print record.INFO['AF']
- [0.5]
-
-There are a number of convienience methods and properties for each ``Record`` allowing you to
-examine properties of interest::
-
- >>> print record.num_called, record.call_rate, record.num_unknown
- 3 1.0 0
- >>> print record.num_hom_ref, record.num_het, record.num_hom_alt
- 1 1 1
- >>> print record.nucl_diversity, record.aaf, record.heterozygosity
- 0.6 [0.5] 0.5
- >>> print record.get_hets()
- [Call(sample=NA00002, CallData(GT=1|0, GQ=48, DP=8, HQ=[51, 51]))]
- >>> print record.is_snp, record.is_indel, record.is_transition, record.is_deletion
- True False True False
- >>> print record.var_type, record.var_subtype
- snp ts
- >>> print record.is_monomorphic
- False
-
-``record.FORMAT`` will be a string specifying the format of the genotype
-fields. In case the FORMAT column does not exist, ``record.FORMAT`` is
-``None``. Finally, ``record.samples`` is a list of dictionaries containing the
-parsed sample column and ``record.genotype`` is a way of looking up genotypes
-by sample name::
-
- >>> record = vcf_reader.next()
- >>> for sample in record.samples:
- ... print sample['GT']
- 0|0
- 0|1
- 0/0
- >>> print record.genotype('NA00001')['GT']
- 0|0
-
-The genotypes are represented by ``Call`` objects, which have three attributes: the
-corresponding Record ``site``, the sample name in ``sample`` and a dictionary of
-call data in ``data``::
-
- >>> call = record.genotype('NA00001')
- >>> print call.site
- Record(CHROM=20, POS=17330, REF=T, ALT=[A])
- >>> print call.sample
- NA00001
- >>> print call.data
- CallData(GT=0|0, GQ=49, DP=3, HQ=[58, 50])
-
-Please note that as of release 0.4.0, attributes known to have single values (such as
-``DP`` and ``GQ`` above) are returned as values. Other attributes are returned
-as lists (such as ``HQ`` above).
-
-There are also a number of methods::
-
- >>> print call.called, call.gt_type, call.gt_bases, call.phased
- True 0 T|T True
-
-Metadata regarding the VCF file itself can be investigated through the
-following attributes:
-
- * ``Reader.metadata``
- * ``Reader.infos``
- * ``Reader.filters``
- * ``Reader.formats``
- * ``Reader.samples``
-
-For example::
-
- >>> vcf_reader.metadata['fileDate']
- '20090805'
- >>> vcf_reader.samples
- ['NA00001', 'NA00002', 'NA00003']
- >>> vcf_reader.filters
- OrderedDict([('q10', Filter(id='q10', desc='Quality below 10')), ('s50', Filter(id='s50', desc='Less than 50% of samples have data'))])
- >>> vcf_reader.infos['AA'].desc
- 'Ancestral Allele'
-
-ALT records are actually classes, so that you can interrogate them::
-
- >>> reader = vcf.Reader(open('vcf/test/example-4.1-bnd.vcf'))
- >>> _ = reader.next(); row = reader.next()
- >>> print row
- Record(CHROM=1, POS=2, REF=T, ALT=[T[2:3[])
- >>> bnd = row.ALT[0]
- >>> print bnd.withinMainAssembly, bnd.orientation, bnd.remoteOrientation, bnd.connectingSequence
- True False True T
-
-Random access is supported for files with tabix indexes. Simply call fetch for the
-region you are interested in::
-
- >>> vcf_reader = vcf.Reader(filename='vcf/test/tb.vcf.gz')
- >>> for record in vcf_reader.fetch('20', 1110696, 1230237): # doctest: +SKIP
- ... print record
- Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
- Record(CHROM=20, POS=1230237, REF=T, ALT=[None])
-
-Or extract a single row::
-
- >>> print vcf_reader.fetch('20', 1110696) # doctest: +SKIP
- Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
-
-
-The ``Writer`` class provides a way of writing a VCF file. Currently, you must specify a
-template ``Reader`` which provides the metadata::
-
- >>> vcf_reader = vcf.Reader(filename='vcf/test/tb.vcf.gz')
- >>> vcf_writer = vcf.Writer(open('/dev/null', 'w'), vcf_reader)
- >>> for record in vcf_reader:
- ... vcf_writer.write_record(record)
-
-
-An extensible script is available to filter vcf files in vcf_filter.py. VCF filters
-declared by other packages will be available for use in this script. Please
-see :doc:`FILTERS` for full description.
-
-'''
from vcf.parser import Reader, Writer
from vcf.parser import VCFReader, VCFWriter
from vcf.filters import Base as Filter
from vcf.parser import RESERVED_INFO, RESERVED_FORMAT
+from vcf.sample_filter import SampleFilter
-VERSION = '0.6.7'
+VERSION = '0.6.8'
=====================================
vcf/cparse.pyx
=====================================
--- a/vcf/cparse.pyx
+++ b/vcf/cparse.pyx
@@ -36,7 +36,10 @@ def parse_samples(
vals = sampvals[j]
# short circuit the most common
- if vals == '.' or vals == './.':
+ if samp_fmt._fields[j] == 'GT':
+ sampdat[j] = vals
+ continue
+ elif not vals or vals == '.':
sampdat[j] = None
continue
=====================================
vcf/model.py
=====================================
--- a/vcf/model.py
+++ b/vcf/model.py
@@ -1,33 +1,39 @@
from abc import ABCMeta, abstractmethod
import collections
import sys
+import re
try:
from collections import Counter
except ImportError:
from counter import Counter
+allele_delimiter = re.compile(r'''[|/]''') # to split a genotype into alleles
class _Call(object):
""" A genotype call, a cell entry in a VCF file"""
- __slots__ = ['site', 'sample', 'data', 'gt_nums', 'called']
+ __slots__ = ['site', 'sample', 'data', 'gt_nums', 'gt_alleles', 'called', 'ploidity']
def __init__(self, site, sample, data):
#: The ``_Record`` for this ``_Call``
self.site = site
#: The sample name
self.sample = sample
- #: Dictionary of data from the VCF file
+ #: Namedtuple of data from the VCF file
self.data = data
- try:
- self.gt_nums = self.data.GT
- #: True if the GT is not ./.
- self.called = self.gt_nums is not None
- except AttributeError:
- self.gt_nums = None
+
+ if hasattr(self.data, 'GT'):
+ self.gt_alleles = [(al if al != '.' else None) for al in allele_delimiter.split(self.data.GT)]
+ self.ploidity = len(self.gt_alleles)
+ self.called = all([al != None for al in self.gt_alleles])
+ self.gt_nums = self.data.GT if self.called else None
+ else:
#62 a call without a genotype is not defined as called or not
+ self.gt_alleles = None
+ self.ploidity = None
self.called = None
+ self.gt_nums = None
def __repr__(self):
return "Call(sample=%s, %s)" % (self.sample, str(self.data))
@@ -51,12 +57,6 @@ class _Call(object):
return "/" if not self.phased else "|"
@property
- def gt_alleles(self):
- '''The numbers of the alleles called at a given sample'''
- # grab the numeric alleles of the gt string; tokenize by phasing
- return self.gt_nums.split(self.gt_phase_char())
-
- @property
def gt_bases(self):
'''The actual genotype alleles.
E.g. if VCF genotype is 0/1, return A/G
@@ -125,10 +125,45 @@ class _Record(object):
INFO and FORMAT are available as properties.
The list of genotype calls is in the ``samples`` property.
+
+ Regarding the coordinates associated with each instance:
+
+ - ``POS``, per VCF specification, is the one-based index
+ (the first base of the contig has an index of 1) of the first
+ base of the ``REF`` sequence.
+ - The ``start`` and ``end`` denote the coordinates of the entire
+ ``REF`` sequence in the zero-based, half-open coordinate
+ system (see
+ http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms),
+ where the first base of the contig has an index of 0, and the
+ interval runs up to, but does not include, the base at the
+ ``end`` index. This indexing scheme is analagous to Python
+ slice notation.
+ - The ``affected_start`` and ``affected_end`` coordinates are
+ also in the zero-based, half-open coordinate system. These
+ coordinates indicate the precise region of the reference
+ genome actually affected by the events denoted in ``ALT``
+ (i.e., the minimum ``affected_start`` and maximum
+ ``affected_end``).
+
+ - For SNPs and structural variants, the affected region
+ includes all bases of ``REF``, including the first base
+ (i.e., ``affected_start = start = POS - 1``).
+ - For deletions, the region includes all bases of ``REF``
+ except the first base, which flanks upstream the actual
+ deletion event, per VCF specification.
+ - For insertions, the ``affected_start`` and ``affected_end``
+ coordinates represent a 0 bp-length region between the two
+ flanking bases (i.e., ``affected_start`` =
+ ``affected_end``). This is analagous to Python slice
+ notation (see http://stackoverflow.com/a/2947881/38140).
+ Neither the upstream nor downstream flanking bases are
+ included in the region.
"""
def __init__(self, CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT,
sample_indexes, samples=None):
self.CHROM = CHROM
+ #: the one-based coordinate of the first nucleotide in ``REF``
self.POS = POS
self.ID = ID
self.REF = REF
@@ -137,9 +172,9 @@ class _Record(object):
self.FILTER = FILTER
self.INFO = INFO
self.FORMAT = FORMAT
- #: 0-based start coordinate
+ #: zero-based, half-open start coordinate of ``REF``
self.start = self.POS - 1
- #: 1-based end coordinate
+ #: zero-based, half-open end coordinate of ``REF``
self.end = self.start + len(self.REF)
#: list of alleles. [0] = REF, [1:] = ALTS
self.alleles = [self.REF]
@@ -148,6 +183,61 @@ class _Record(object):
self.samples = samples or []
self._sample_indexes = sample_indexes
+ # Setting affected_start and affected_end here for Sphinx
+ # autodoc purposes...
+ #: zero-based, half-open start coordinate of affected region of reference genome
+ self.affected_start = None
+ #: zero-based, half-open end coordinate of affected region of reference genome (not included in the region)
+ self.affected_end = None
+ self._set_start_and_end()
+
+
+ def _set_start_and_end(self):
+ self.affected_start = self.affected_end = self.POS
+ for alt in self.ALT:
+ if alt is None:
+ start, end = self._compute_coordinates_for_none_alt()
+ elif alt.type == 'SNV':
+ start, end = self._compute_coordinates_for_snp()
+ elif alt.type == 'MNV':
+ start, end = self._compute_coordinates_for_indel()
+ else:
+ start, end = self._compute_coordinates_for_sv()
+ self.affected_start = min(self.affected_start, start)
+ self.affected_end = max(self.affected_end, end)
+
+
+ def _compute_coordinates_for_none_alt(self):
+ start = self.POS - 1
+ end = start + len(self.REF)
+ return (start, end)
+
+
+ def _compute_coordinates_for_snp(self):
+ if len(self.REF) > 1:
+ start = self.POS
+ end = start + (len(self.REF) - 1)
+ else:
+ start = self.POS - 1
+ end = self.POS
+ return (start, end)
+
+
+ def _compute_coordinates_for_indel(self):
+ if len(self.REF) > 1:
+ start = self.POS
+ end = start + (len(self.REF) - 1)
+ else:
+ start = end = self.POS
+ return (start, end)
+
+
+ def _compute_coordinates_for_sv(self):
+ start = self.POS - 1
+ end = start + len(self.REF)
+ return (start, end)
+
+
# For Python 2
def __cmp__(self, other):
return cmp((self.CHROM, self.POS), (getattr(other, "CHROM", None), getattr(other, "POS", None)))
@@ -221,12 +311,13 @@ class _Record(object):
""" A list of allele frequencies of alternate alleles.
NOTE: Denominator calc'ed from _called_ genotypes.
"""
- num_chroms = 2.0 * self.num_called
+ num_chroms = 0.0
allele_counts = Counter()
for s in self.samples:
if s.gt_type is not None:
- allele_counts.update([s.gt_alleles[0]])
- allele_counts.update([s.gt_alleles[1]])
+ for a in s.gt_alleles:
+ allele_counts.update([a])
+ num_chroms += 1
return [allele_counts[str(i)]/num_chroms for i in range(1, len(self.ALT)+1)]
@property
@@ -239,7 +330,7 @@ class _Record(object):
Derived from:
\"Population Genetics: A Concise Guide, 2nd ed., p.45\"
- John Gillespie.
+ John Gillespie.
"""
# skip if more than one alternate allele. assumes bi-allelic
if len(self.ALT) > 1:
@@ -285,7 +376,7 @@ class _Record(object):
for alt in self.ALT:
if alt is None or alt.type != "SNV":
return False
- if alt not in ['A', 'C', 'G', 'T']:
+ if alt not in ['A', 'C', 'G', 'T', 'N', '*']:
return False
return True
@@ -376,13 +467,14 @@ class _Record(object):
def var_subtype(self):
"""
Return the subtype of variant.
+
- For SNPs and INDELs, yeild one of: [ts, tv, ins, del]
- - For SVs yield either "complex" or the SV type defined
- in the ALT fields (removing the brackets).
- E.g.:
- <DEL> -> DEL
- <INS:ME:L1> -> INS:ME:L1
- <DUP> -> DUP
+ - For SVs yield either "complex" or the SV type defined in the ALT
+ fields (removing the brackets). E.g.::
+
+ <DEL> -> DEL
+ <INS:ME:L1> -> INS:ME:L1
+ <DUP> -> DUP
The logic is meant to follow the rules outlined in the following
paragraph at:
=====================================
vcf/parser.py
=====================================
--- a/vcf/parser.py
+++ b/vcf/parser.py
@@ -1,10 +1,11 @@
+import codecs
import collections
-import re
import csv
import gzip
-import sys
import itertools
-import codecs
+import os
+import re
+import sys
try:
from collections import OrderedDict
@@ -62,12 +63,13 @@ SINGULAR_METADATA = ['fileformat', 'fileDate', 'reference']
# Conversion between value in file and Python value
field_counts = {
'.': None, # Unknown number of values
- 'A': -1, # Equal to the number of alleles in a given record
+ 'A': -1, # Equal to the number of alternate alleles in a given record
'G': -2, # Equal to the number of genotypes in a given record
+ 'R': -3, # Equal to the number of alleles including reference in a given record
}
-_Info = collections.namedtuple('Info', ['id', 'num', 'type', 'desc'])
+_Info = collections.namedtuple('Info', ['id', 'num', 'type', 'desc', 'source', 'version'])
_Filter = collections.namedtuple('Filter', ['id', 'desc'])
_Alt = collections.namedtuple('Alt', ['id', 'desc'])
_Format = collections.namedtuple('Format', ['id', 'num', 'type', 'desc'])
@@ -80,36 +82,39 @@ class _vcf_metadata_parser(object):
def __init__(self):
super(_vcf_metadata_parser, self).__init__()
self.info_pattern = re.compile(r'''\#\#INFO=<
- ID=(?P<id>[^,]+),
- Number=(?P<number>-?\d+|\.|[AG]),
- Type=(?P<type>Integer|Float|Flag|Character|String),
+ ID=(?P<id>[^,]+),\s*
+ Number=(?P<number>-?\d+|\.|[AGR]),\s*
+ Type=(?P<type>Integer|Float|Flag|Character|String),\s*
Description="(?P<desc>[^"]*)"
+ (?:,\s*Source="(?P<source>[^"]*)")?
+ (?:,\s*Version="?(?P<version>[^"]*)"?)?
>''', re.VERBOSE)
self.filter_pattern = re.compile(r'''\#\#FILTER=<
- ID=(?P<id>[^,]+),
+ ID=(?P<id>[^,]+),\s*
Description="(?P<desc>[^"]*)"
>''', re.VERBOSE)
self.alt_pattern = re.compile(r'''\#\#ALT=<
- ID=(?P<id>[^,]+),
+ ID=(?P<id>[^,]+),\s*
Description="(?P<desc>[^"]*)"
>''', re.VERBOSE)
self.format_pattern = re.compile(r'''\#\#FORMAT=<
- ID=(?P<id>.+),
- Number=(?P<number>-?\d+|\.|[AG]),
- Type=(?P<type>.+),
+ ID=(?P<id>.+),\s*
+ Number=(?P<number>-?\d+|\.|[AGR]),\s*
+ Type=(?P<type>.+),\s*
Description="(?P<desc>.*)"
>''', re.VERBOSE)
self.contig_pattern = re.compile(r'''\#\#contig=<
- ID=(?P<id>[^,]+),
- .*
- length=(?P<length>-?\d+)
+ ID=(?P<id>[^>,]+)
+ (,.*length=(?P<length>-?\d+))?
.*
>''', re.VERBOSE)
self.meta_pattern = re.compile(r'''##(?P<key>.+?)=(?P<val>.+)''')
def vcf_field_count(self, num_str):
"""Cast vcf header numbers to integer or None"""
- if num_str not in field_counts:
+ if num_str is None:
+ return None
+ elif num_str not in field_counts:
# Fixed, specified number
return int(num_str)
else:
@@ -125,7 +130,8 @@ class _vcf_metadata_parser(object):
num = self.vcf_field_count(match.group('number'))
info = _Info(match.group('id'), num,
- match.group('type'), match.group('desc'))
+ match.group('type'), match.group('desc'),
+ match.group('source'), match.group('version'))
return (match.group('id'), info)
@@ -164,31 +170,28 @@ class _vcf_metadata_parser(object):
match.group('type'), match.group('desc'))
return (match.group('id'), form)
-
+
def read_contig(self, contig_string):
'''Read a meta-contigrmation INFO line.'''
match = self.contig_pattern.match(contig_string)
if not match:
raise SyntaxError(
"One of the contig lines is malformed: %s" % contig_string)
-
length = self.vcf_field_count(match.group('length'))
-
contig = _Contig(match.group('id'), length)
-
return (match.group('id'), contig)
-
def read_meta_hash(self, meta_string):
- items = re.split("[<>]", meta_string)
- # Removing initial hash marks and final equal sign
- key = items[0][2:-1]
+ # assert re.match("##.+=<", meta_string)
+ items = meta_string.split('=', 1)
+ # Removing initial hash marks
+ key = items[0].lstrip('#')
# N.B., items can have quoted values, so cannot just split on comma
val = OrderedDict()
state = 0
k = ''
v = ''
- for c in items[1]:
+ for c in items[1].strip('[<>]'):
if state == 0: # reading item key
if c == '=':
@@ -219,16 +222,20 @@ class _vcf_metadata_parser(object):
def read_meta(self, meta_string):
if re.match("##.+=<", meta_string):
return self.read_meta_hash(meta_string)
- else:
- match = self.meta_pattern.match(meta_string)
- return match.group('key'), match.group('val')
+ match = self.meta_pattern.match(meta_string)
+ if not match:
+ # Spec only allows key=value, but we try to be liberal and
+ # interpret anything else as key=none (and all values are parsed
+ # as strings).
+ return meta_string.lstrip('#'), 'none'
+ return match.group('key'), match.group('val')
class Reader(object):
""" Reader for a VCF v 4.0 file, an iterator returning ``_Record objects`` """
- def __init__(self, fsock=None, filename=None, compressed=False, prepend_chr=False,
- strict_whitespace=False):
+ def __init__(self, fsock=None, filename=None, compressed=None, prepend_chr=False,
+ strict_whitespace=False, encoding='ascii'):
""" Create a new Reader for a VCF file.
You must specify either fsock (stream) or filename. Gzipped streams
@@ -250,21 +257,26 @@ class Reader(object):
self._reader = fsock
if filename is None and hasattr(fsock, 'name'):
filename = fsock.name
- compressed = compressed or filename.endswith('.gz')
+ if compressed is None:
+ compressed = filename.endswith('.gz')
elif filename:
- compressed = compressed or filename.endswith('.gz')
+ if compressed is None:
+ compressed = filename.endswith('.gz')
self._reader = open(filename, 'rb' if compressed else 'rt')
self.filename = filename
if compressed:
self._reader = gzip.GzipFile(fileobj=self._reader)
if sys.version > '3':
- self._reader = codecs.getreader('ascii')(self._reader)
+ self._reader = codecs.getreader(encoding)(self._reader)
if strict_whitespace:
self._separator = '\t'
else:
self._separator = '\t| +'
+ self._row_pattern = re.compile(self._separator)
+ self._alt_pattern = re.compile('[\[\]]')
+
self.reader = (line.strip() for line in self._reader if line.strip())
#: metadata fields from header (string or hash, depending)
@@ -287,6 +299,7 @@ class Reader(object):
self._prepend_chr = prepend_chr
self._parse_metainfo()
self._format_cache = {}
+ self.encoding = encoding
def __iter__(self):
return self
@@ -301,7 +314,7 @@ class Reader(object):
parser = _vcf_metadata_parser()
- line = self.reader.next()
+ line = next(self.reader)
while line.startswith('##'):
self._header_lines.append(line)
@@ -320,7 +333,7 @@ class Reader(object):
elif line.startswith('##FORMAT'):
key, val = parser.read_format(line)
self.formats[key] = val
-
+
elif line.startswith('##contig'):
key, val = parser.read_contig(line)
self.contigs[key] = val
@@ -334,9 +347,9 @@ class Reader(object):
self.metadata[key] = []
self.metadata[key].append(val)
- line = self.reader.next()
+ line = next(self.reader)
- fields = re.split(self._separator, line[1:])
+ fields = self._row_pattern.split(line[1:])
self._column_headers = fields[:9]
self.samples = fields[9:]
self._sample_indexes = dict([(x,i) for (i,x) in enumerate(self.samples)])
@@ -358,7 +371,7 @@ class Reader(object):
retdict = {}
for entry in entries:
- entry = entry.split('=')
+ entry = entry.split('=', 1)
ID = entry[0]
try:
entry_type = self.infos[ID].type
@@ -389,6 +402,7 @@ class Reader(object):
vals = entry[1].split(',') # commas are reserved characters indicating multiple values
val = self._map(str, vals)
except IndexError:
+ entry_type = 'Flag'
val = True
try:
@@ -430,7 +444,6 @@ class Reader(object):
# check whether we already know how to parse this format
if samp_fmt not in self._format_cache:
self._format_cache[samp_fmt] = self._parse_sample_format(samp_fmt)
-
samp_fmt = self._format_cache[samp_fmt]
if cparse:
@@ -450,7 +463,10 @@ class Reader(object):
for i, vals in enumerate(sample.split(':')):
# short circuit the most common
- if vals == '.' or vals == './.':
+ if samp_fmt._fields[i] == 'GT':
+ sampdat[i] = vals
+ continue
+ elif not vals or vals == ".":
sampdat[i] = None
continue
@@ -494,9 +510,9 @@ class Reader(object):
return samp_data
def _parse_alt(self, str):
- if re.search('[\[\]]', str) is not None:
+ if self._alt_pattern.search(str) is not None:
# Paired breakend
- items = re.split('[\[\]]', str)
+ items = self._alt_pattern.split(str)
remoteCoords = items[1].split(':')
chr = remoteCoords[0]
if chr[0] == '<':
@@ -523,8 +539,8 @@ class Reader(object):
def next(self):
'''Return the next record in the file.'''
- line = self.reader.next()
- row = re.split(self._separator, line.rstrip())
+ line = next(self.reader)
+ row = self._row_pattern.split(line.rstrip())
chrom = row[0]
if self._prepend_chr:
chrom = 'chr' + chrom
@@ -559,6 +575,9 @@ class Reader(object):
fmt = row[8]
except IndexError:
fmt = None
+ else:
+ if fmt == '.':
+ fmt = None
record = _Record(chrom, pos, ID, ref, alt, qual, filt,
info, fmt, self._sample_indexes)
@@ -569,45 +588,60 @@ class Reader(object):
return record
- def fetch(self, chrom, start, end=None):
- """ fetch records from a Tabix indexed VCF, requires pysam
- if start and end are specified, return iterator over positions
- if end not specified, return individual ``_Call`` at start or None
+ def fetch(self, chrom, start=None, end=None):
+ """ Fetches records from a tabix-indexed VCF file and returns an
+ iterable of ``_Record`` instances
+
+ chrom must be specified.
+
+ The start and end coordinates are in the zero-based,
+ half-open coordinate system, similar to ``_Record.start`` and
+ ``_Record.end``. The very first base of a chromosome is
+ index 0, and the the region includes bases up to, but not
+ including the base at the end coordinate. For example
+ ``fetch('4', 10, 20)`` would include all variants
+ overlapping a 10 base pair region from the 11th base of
+ through the 20th base (which is at index 19) of chromosome
+ 4. It would not include the 21st base (at index 20). See
+ http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms
+ for more information on the zero-based, half-open coordinate
+ system.
+
+ If end is omitted, all variants from start until the end of
+ the chromosome chrom will be included.
+
+ If start and end are omitted, all variants on chrom will be
+ returned.
+
+ requires pysam
+
"""
if not pysam:
raise Exception('pysam not available, try "pip install pysam"?')
-
if not self.filename:
raise Exception('Please provide a filename (or a "normal" fsock)')
if not self._tabix:
- self._tabix = pysam.Tabixfile(self.filename)
+ self._tabix = pysam.Tabixfile(self.filename,
+ encoding=self.encoding)
if self._prepend_chr and chrom[:3] == 'chr':
chrom = chrom[3:]
- # not sure why tabix needs position -1
- start = start - 1
-
- if end is None:
- self.reader = self._tabix.fetch(chrom, start, start + 1)
- try:
- return self.next()
- except StopIteration:
- return None
-
self.reader = self._tabix.fetch(chrom, start, end)
return self
class Writer(object):
- """ VCF Writer """
+ """VCF Writer. On Windows Python 2, open stream with 'wb'."""
# Reverse keys and values in header field count dictionary
counts = dict((v,k) for k,v in field_counts.iteritems())
def __init__(self, stream, template, lineterminator="\n"):
- self.writer = csv.writer(stream, delimiter="\t", lineterminator=lineterminator)
+ self.writer = csv.writer(stream, delimiter="\t",
+ lineterminator=lineterminator,
+ quotechar='', quoting=csv.QUOTE_NONE)
self.template = template
self.stream = stream
@@ -639,7 +673,10 @@ class Writer(object):
for line in template.alts.itervalues():
stream.write(two.format(key="ALT", *line))
for line in template.contigs.itervalues():
- stream.write('##contig=<ID={0},length={1}>\n'.format(*line))
+ if line.length:
+ stream.write('##contig=<ID={0},length={1}>\n'.format(*line))
+ else:
+ stream.write('##contig=<ID={0}>\n'.format(*line))
self._write_header()
@@ -699,28 +736,14 @@ class Writer(object):
sorted(info, key=order_key))
def _format_sample(self, fmt, sample):
- try:
- # Try to get the GT value first.
- gt = getattr(sample.data, 'GT')
- # PyVCF stores './.' GT values as None, so we need to revert it back
- # to './.' when writing.
- if gt is None:
- gt = './.'
- except AttributeError:
- # Failing that, try to check whether 'GT' is specified in the FORMAT
- # field. If yes, use the recommended empty value ('./.')
- if 'GT' in fmt:
- gt = './.'
- # Otherwise use an empty string as the value
- else:
- gt = ''
- # If gt is an empty string (i.e. not stored), write all other data
+ if hasattr(sample.data, 'GT'):
+ gt = sample.data.GT
+ else:
+ gt = './.' if 'GT' in fmt else ''
+
if not gt:
return ':'.join([self._stringify(x) for x in sample.data])
- # Otherwise use the GT values from above and combine it with the rest of
- # the data.
- # Note that this follows the VCF spec, where GT is always the first
- # item whenever it is present.
+ # Following the VCF spec, GT is always the first item whenever it is present.
else:
return ':'.join([gt] + [self._stringify(x) for x in sample.data[1:]])
=====================================
vcf/sample_filter.py
=====================================
--- /dev/null
+++ b/vcf/sample_filter.py
@@ -0,0 +1,115 @@
+# Author: Lenna X. Peterson
+# github.com/lennax
+# arklenna at gmail dot com
+
+import logging
+import sys
+import warnings
+
+
+from parser import Reader, Writer
+
+
+class SampleFilter(object):
+ """
+ Modifies the vcf Reader to filter each row by sample as it is parsed.
+
+ """
+
+ def __init__(self, infile, outfile=None, filters=None, invert=False):
+ # Methods to add to Reader
+ def get_filter(self):
+ return self._samp_filter
+
+ def set_filter(self, filt):
+ self._samp_filter = filt
+ if filt:
+ self.samples = [val for idx, val in enumerate(self.samples)
+ if idx not in set(filt)]
+
+ def filter_samples(fn):
+ """Decorator function to filter sample parameter"""
+ def filt(self, samples, *args):
+ samples = [val for idx, val in enumerate(samples)
+ if idx not in set(self.sample_filter)]
+ return fn(self, samples, *args)
+ return filt
+
+ # Add property to Reader for filter list
+ Reader.sample_filter = property(get_filter, set_filter)
+ Reader._samp_filter = []
+ # Modify Reader._parse_samples to filter samples
+ self._orig_parse_samples = Reader._parse_samples
+ Reader._parse_samples = filter_samples(Reader._parse_samples)
+ self.parser = Reader(filename=infile)
+ # Store initial samples and indices
+ self.samples = self.parser.samples
+ self.smp_idx = dict([(v, k) for k, v in enumerate(self.samples)])
+ # Properties for filter/writer
+ self.outfile = outfile
+ self.invert = invert
+ self.filters = filters
+ if filters is not None:
+ self.set_filters()
+ self.write()
+
+ def __del__(self):
+ try:
+ self._undo_monkey_patch()
+ except AttributeError:
+ pass
+
+ def set_filters(self, filters=None, invert=False):
+ """Convert filters from string to list of indices, set on Reader"""
+ if filters is not None:
+ self.filters = filters
+ if invert:
+ self.invert = invert
+ filt_l = self.filters.split(",")
+ filt_s = set(filt_l)
+ if len(filt_s) < len(filt_l):
+ warnings.warn("Non-unique filters, ignoring", RuntimeWarning)
+
+ def filt2idx(item):
+ """Convert filter to valid sample index"""
+ try:
+ item = int(item)
+ except ValueError:
+ # not an idx, check if it's a value
+ return self.smp_idx.get(item)
+ else:
+ # is int, check if it's an idx
+ if item < len(self.samples):
+ return item
+ filters = set(filter(lambda x: x is not None, map(filt2idx, filt_s)))
+ if len(filters) < len(filt_s):
+ # TODO print the filters that were ignored
+ warnings.warn("Invalid filters, ignoring", RuntimeWarning)
+
+ if self.invert:
+ filters = set(xrange(len(self.samples))).difference(filters)
+
+ # `sample_filter` setter updates `samples`
+ self.parser.sample_filter = filters
+ if len(self.parser.samples) == 0:
+ warnings.warn("Number of samples to keep is zero", RuntimeWarning)
+ logging.info("Keeping these samples: {0}\n".format(self.parser.samples))
+ return self.parser.samples
+
+ def write(self, outfile=None):
+ if outfile is not None:
+ self.outfile = outfile
+ if self.outfile is None:
+ _out = sys.stdout
+ elif hasattr(self.outfile, 'write'):
+ _out = self.outfile
+ else:
+ _out = open(self.outfile, "wb")
+ logging.info("Writing to '{0}'\n".format(self.outfile))
+ writer = Writer(_out, self.parser)
+ for row in self.parser:
+ writer.write_record(row)
+
+ def _undo_monkey_patch(self):
+ Reader._parse_samples = self._orig_parse_samples
+ delattr(Reader, 'sample_filter')
=====================================
vcf/test/FT.vcf
=====================================
--- /dev/null
+++ b/vcf/test/FT.vcf
@@ -0,0 +1,50 @@
+##fileformat=VCFv4.2
+##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
+##FILTER=<ID=DP125,Description="DP<125">
+##FILTER=<ID=DP130,Description="DP<130">
+##FILTER=<ID=LowQual,Description="Low quality">
+##FILTER=<ID=Q4800,Description="QUAL<4800.0">
+##FILTER=<ID=Q5000,Description="QUAL<5000.0">
+##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
+##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
+##FORMAT=<ID=FT,Number=.,Type=String,Description="Genotype-level filter">
+##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
+##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
+##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
+##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
+##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
+##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
+##GATKCommandLine.VariantFiltration=<ID=VariantFiltration,Version=3.6-0-g89b7209,Date="Wed Jul 27 09:50:44 CEST 2016",Epoch=1469605844963,CommandLineOptions="analysis_type=VariantFiltration input_file=[] showFullBamList=false read_buffer_size=null read_filter=[] disable_read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=../ref.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 static_quantized_quals=null round_down_quantized=false disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=false bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 reference_window_stop=0 phone_home= gatk_key=null tag=NA logging_level=INFO log_to_file=null help=false version=false variant=(RodBinding name=variant source=10.vcf) mask=(RodBinding name= source=UNBOUND) out=/home/redmar/tmp/example/snps/10_filt.vcf filterExpression=[QUAL<5000.0, QUAL<4800.0] filterName=[Q5000, Q4800] genotypeFilterExpression=[DP<130, DP<125] genotypeFilterName=[DP130, DP125] clusterSize=3 clusterWindowSize=0 maskExtension=0 maskName=Mask filterNotInMask=false missingValuesInExpressionsShouldEvaluateAsFailing=false invalidatePreviousFilters=false invertFilterExpression=false invertGenotypeFilterExpression=false setFilteredGtToNocall=false filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">
+##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
+##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
+##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
+##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
+##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
+##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
+##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
+##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
+##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
+##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
+##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
+##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
+##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
+##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
+##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
+##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score >From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
+##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
+##INFO=<ID=RAW_MQ,Number=1,Type=Float,Description="Raw data for RMS Mapping Quality">
+##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
+##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
+##contig=<ID=ref,length=4888768>
+##reference=file://../ref.fasta
+#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1 2 3 4 5
+ref 63393 . C A 29454.60 . AC=5;AF=1.00;AN=5;DP=719;FS=0.000;MLEAC=5;MLEAF=1.00;MQ=60.00;QD=29.67;SOR=0.965 GT:AD:DP:GQ:PL 1:0,166:166:99:6740,0 1:0,142:142:99:5824,0 1:0,134:134:99:5616,0 1:0,122:122:99:4930,0 1:0,155:155:99:6371,0
+ref 65903 . AATTGCGCTG A 7340.57 PASS AC=1;AF=0.200;AN=5;DP=524;FS=0.000;MLEAC=1;MLEAF=0.200;MQ=60.00;QD=34.04;SOR=1.091 GT:AD:DP:FT:GQ:PL 1:0,164:164:PASS:99:7383,0 0:95,0:95:DP125;DP130:99:0,1800 0:88,0:88:DP125;DP130:99:0,1800 0:87,0:87:DP125;DP130:99:0,1800 0:89,0:89:DP125;DP130:99:0,1800
+ref 70837 . C A 4711.61 Q4800;Q5000 AC=1;AF=0.200;AN=5;DP=512;FS=0.000;MLEAC=1;MLEAF=0.200;MQ=60.00;QD=27.64;SOR=0.726 GT:AD:DP:FT:GQ:PL 0:121,0:121:DP125;DP130:99:0,1800 0:95,0:95:DP125;DP130:99:0,1800 1:0,120:120:DP125;DP130:99:4745,0 0:87,0:87:DP125;DP130:99:0,1800 0:89,0:89:DP125;DP130:99:0,1800
+ref 71448 . C T 31134.60 PASS AC=5;AF=1.00;AN=5;BaseQRankSum=2.22;ClippingRankSum=0.00;DP=768;FS=0.000;MLEAC=5;MLEAF=1.00;MQ=60.00;MQRankSum=0.00;QD=29.43;ReadPosRankSum=2.03;SOR=0.295 GT:AD:DP:FT:GQ:PL 1:0,147:147:PASS:99:5996,0 1:1,183:184:PASS:99:7501,0 1:0,113:113:DP125;DP130:99:4559,0 1:0,161:161:PASS:99:6436,0 1:0,163:163:PASS:99:6669,0
+ref 104257 . C T 5521.61 PASS AC=1;AF=0.200;AN=5;DP=506;FS=0.000;MLEAC=1;MLEAF=0.200;MQ=60.00;QD=29.45;SOR=0.854 GT:AD:DP:FT:GQ:PL 0:101,0:101:DP125;DP130:99:0,1800 0:109,0:109:DP125;DP130:99:0,1800 1:0,132:132:PASS:99:5555,0 0:67,0:67:DP125;DP130:99:0,1800 0:97,0:97:DP125;DP130:99:0,1800
+ref 140658 . C A 32467.60 PASS AC=5;AF=1.00;AN=5;BaseQRankSum=2.24;ClippingRankSum=0.00;DP=801;FS=0.000;MLEAC=5;MLEAF=1.00;MQ=60.00;MQRankSum=0.00;QD=29.65;ReadPosRankSum=1.27;SOR=0.346 GT:AD:DP:GQ:PL 1:0,170:170:99:6854,0 1:0,198:198:99:8098,0 1:0,136:136:99:5554,0 1:0,141:141:99:5661,0 1:1,155:156:99:6327,0
+ref 147463 . C A 4885.61 Q5000 AC=1;AF=0.200;AN=5;BaseQRankSum=-7.720e-01;ClippingRankSum=0.00;DP=503;FS=0.000;MLEAC=1;MLEAF=0.200;MQ=60.00;MQRankSum=0.00;QD=35.03;ReadPosRankSum=-6.950e-01;SOR=0.278 GT:AD:DP:FT:GQ:PL 0:97,0:97:DP125;DP130:99:0,1800 0:104,0:104:DP125;DP130:99:0,1800 0:84,0:84:DP125;DP130:99:0,1800 1:1,128:129:DP130:99:4919,0 0:89,0:89:DP125;DP130:99:0,1800
+ref 154578 . A G 32015.60 PASS AC=5;AF=1.00;AN=5;DP=776;FS=0.000;MLEAC=5;MLEAF=1.00;MQ=60.00;QD=25.82;SOR=0.902 GT:AD:DP:GQ:PL 1:0,152:152:99:6300,0 1:0,183:183:99:7608,0 1:0,137:137:99:5713,0 1:0,148:148:99:6040,0 1:0,156:156:99:6381,0
+ref 203200 . C T 30880.60 PASS AC=5;AF=1.00;AN=5;DP=752;FS=0.000;MLEAC=5;MLEAF=1.00;MQ=60.00;QD=29.65;SOR=0.878 GT:AD:DP:FT:GQ:PL 1:0,161:161:PASS:99:6708,0 1:0,185:185:PASS:99:7602,0 1:0,136:136:PASS:99:5602,0 1:0,126:126:DP130:99:5080,0 1:0,144:144:PASS:99:5915,0
+ref 231665 . C T 30074.60 PASS AC=5;AF=1.00;AN=5;DP=735;FS=0.000;MLEAC=5;MLEAF=1.00;MQ=60.00;QD=33.23;SOR=0.938 GT:AD:DP:FT:GQ:PL 1:0,191:191:PASS:99:7867,0 1:0,159:159:PASS:99:6431,0 1:0,130:130:PASS:99:5299,0 1:0,129:129:DP130:99:5290,0 1:0,126:126:DP130:99:5214,0
=====================================
vcf/test/issue-214.vcf
=====================================
--- /dev/null
+++ b/vcf/test/issue-214.vcf
@@ -0,0 +1,32 @@
+##fileformat=VCFv4.1
+##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
+##FILTER=<ID=LowQual,Description="Low quality">
+##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
+##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
+##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
+##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
+##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
+##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
+##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
+##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
+##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
+##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
+##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
+##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
+##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
+##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
+##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
+##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
+##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
+##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
+##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
+##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
+##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
+##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
+##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score >From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
+##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
+##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
+##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
+#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2
+1 456904 . T C,* 6162.77 . AC=1,1;AF=8.333e-03,8.333e-03;AN=120;DP=7693;FS=0.000;MLEAC=1,1;MLEAF=8.333e-03,8.333e-03;MQ=60.00;QD=31.36;SOR=0.976 GT:AD:DP:GQ:PL 0:106,0,0:106:99:0,1800,1800 0:110,0,0:110:99:0,1800,1800
+1 456940 . * C,T 6162.77 . AC=1,1;AF=8.333e-03,8.333e-03;AN=120;DP=7693;FS=0.000;MLEAC=1,1;MLEAF=8.333e-03,8.333e-03;MQ=60.00;QD=31.36;SOR=0.976 GT:AD:DP:GQ:PL 0:106,0,0:106:99:0,1800,1800 0:110,0,0:110:99:0,1800,1800
=====================================
vcf/test/strelka.vcf
=====================================
--- /dev/null
+++ b/vcf/test/strelka.vcf
@@ -0,0 +1,57 @@
+##fileformat=VCFv4.1
+##FILTER=<ID=BCNoise,Description="Average fraction of filtered basecalls within 50 bases of the indel exceeds 0.3">
+##FILTER=<ID=QSI_ref,Description="Normal sample is not homozygous ref or sindel Q-score < 30, ie calls with NT!=ref or QSI_NT < 30">
+##FILTER=<ID=QSS_ref,Description="Normal sample is not homozygous ref or ssnv Q-score < 15, ie calls with NT!=ref or QSS_NT < 15">
+##FILTER=<ID=Repeat,Description="Sequence repeat of more than 8x in the reference sequence">
+##FILTER=<ID=SpanDel,Description="Fraction of reads crossing site with spanning deletions in either sample exceeeds 0.75">
+##FILTER=<ID=iHpol,Description="Indel overlaps an interrupted homopolymer longer than 14x in the reference sequence">
+##FORMAT=<ID=AU,Number=2,Type=Integer,Description="Number of 'A' alleles used in tiers 1,2">
+##FORMAT=<ID=CU,Number=2,Type=Integer,Description="Number of 'C' alleles used in tiers 1,2">
+##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth for tier1">
+##FORMAT=<ID=DP2,Number=1,Type=Integer,Description="Read depth for tier2">
+##FORMAT=<ID=DP50,Number=1,Type=Float,Description="Average tier1 read depth within 50 bases">
+##FORMAT=<ID=FDP,Number=1,Type=Integer,Description="Number of basecalls filtered from original read depth for tier1">
+##FORMAT=<ID=FDP50,Number=1,Type=Float,Description="Average tier1 number of basecalls filtered from original read depth within 50 bases">
+##FORMAT=<ID=GU,Number=2,Type=Integer,Description="Number of 'G' alleles used in tiers 1,2">
+##FORMAT=<ID=SDP,Number=1,Type=Integer,Description="Number of reads with deletions spanning this site at tier1">
+##FORMAT=<ID=SUBDP,Number=1,Type=Integer,Description="Number of reads below tier1 mapping quality threshold aligned across this site">
+##FORMAT=<ID=SUBDP50,Number=1,Type=Float,Description="Average number of reads below tier1 mapping quality threshold aligned across sites within 50 bases">
+##FORMAT=<ID=TAR,Number=2,Type=Integer,Description="Reads strongly supporting alternate allele for tiers 1,2">
+##FORMAT=<ID=TIR,Number=2,Type=Integer,Description="Reads strongly supporting indel allele for tiers 1,2">
+##FORMAT=<ID=TOR,Number=2,Type=Integer,Description="Other reads (weak support or insufficient indel breakpoint overlap) for tiers 1,2">
+##FORMAT=<ID=TU,Number=2,Type=Integer,Description="Number of 'T' alleles used in tiers 1,2">
+##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
+##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
+##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
+##INFO=<ID=IC,Number=1,Type=Integer,Description="Number of times RU repeats in the indel allele">
+##INFO=<ID=IHP,Number=1,Type=Integer,Description="Largest reference interrupted homopolymer length intersecting with the indel">
+##INFO=<ID=NT,Number=1,Type=String,Description="Genotype of the normal in all data tiers, as used to classify somatic variants. One of {ref,het,hom,conflict}.">
+##INFO=<ID=OVERLAP,Number=0,Type=Flag,Description="Somatic indel possibly overlaps a second indel.">
+##INFO=<ID=QSI,Number=1,Type=Integer,Description="Quality score for any somatic variant, ie. for the ALT haplotype to be present at a significantly different frequency in the tumor and normal">
+##INFO=<ID=QSI_NT,Number=1,Type=Integer,Description="Quality score reflecting the joint probability of a somatic variant and NT">
+##INFO=<ID=QSS,Number=1,Type=Integer,Description="Quality score for any somatic snv, ie. for the ALT allele to be present at a significantly different frequency in the tumor and normal">
+##INFO=<ID=QSS_NT,Number=1,Type=Integer,Description="Quality score reflecting the joint probability of a somatic variant and NT">
+##INFO=<ID=RC,Number=1,Type=Integer,Description="Number of times RU repeats in the reference allele">
+##INFO=<ID=RU,Number=1,Type=String,Description="Smallest repeating sequence unit in inserted or deleted sequence">
+##INFO=<ID=SGT,Number=1,Type=String,Description="Most likely somatic genotype excluding normal noise states">
+##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="Somatic mutation">
+##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
+##INFO=<ID=TQSI,Number=1,Type=Integer,Description="Data tier used to compute QSI">
+##INFO=<ID=TQSI_NT,Number=1,Type=Integer,Description="Data tier used to compute QSI_NT">
+##INFO=<ID=TQSS,Number=1,Type=Integer,Description="Data tier used to compute QSS">
+##INFO=<ID=TQSS_NT,Number=1,Type=Integer,Description="Data tier used to compute QSS_NT">
+##INFO=<ID=set,Number=1,Type=String,Description="Source VCF for the merged record in CombineVariants">
+##content=strelka somatic indel calls
+##fileDate=20151113
+##germlineIndelTheta=0.0001
+##germlineSnvTheta=0.001
+##priorSomaticIndelRate=1e-06
+##priorSomaticSnvRate=1e-06
+##reference=file:///b37.fasta
+##source=strelka
+##source_version=2.0.17.strelka1
+##startTime=Fri Nov 13 19:38:43 2015
+#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL.variant NORMAL.variant2 TUMOR.variant TUMOR.variant2
+1 1666175 . C T . PASS AC=0;AF=0.00;AN=0;NT=ref;QSS=28;QSS_NT=28;SGT=CC->CT;SOMATIC;TQSS=1;TQSS_NT=1;set=variant AU:CU:DP:FDP:GU:SDP:SUBDP:TU 0,0:42,42:43:0:0,0:0:0:1,1 0,0:45,45:59:0:0,0:0:0:14,14
+1 3750492 . G A . PASS AC=0;AF=0.00;AN=0;NT=ref;QSS=38;QSS_NT=38;SGT=GG->AG;SOMATIC;TQSS=2;TQSS_NT=2;set=variant AU:CU:DP:FDP:GU:SDP:SUBDP:TU 0,0:0,0:116:0:116,116:0:0:0,0 6,6:0,0:96:0:90,91:0:0:0,0
+1 9117626 . G A . PASS AC=0;AF=0.00;AN=0;NT=ref;QSS=32;QSS_NT=32;SGT=GG->AG;SOMATIC;TQSS=1;TQSS_NT=1;set=variant AU:CU:DP:FDP:GU:SDP:SUBDP:TU 0,0:0,0:165:0:165,166:0:0:0,0 6,6:0,0:132:0:126,127:0:0:0,0
\ No newline at end of file
=====================================
vcf/test/test_vcf.py
=====================================
--- a/vcf/test/test_vcf.py
+++ b/vcf/test/test_vcf.py
@@ -1,13 +1,27 @@
from __future__ import print_function
import unittest
+try:
+ unittest.skip
+except AttributeError:
+ import unittest2 as unittest
import doctest
import os
import commands
import cPickle
from StringIO import StringIO
+import subprocess
+import sys
+
+try:
+ import pysam
+except ImportError:
+ pysam = None
import vcf
-from vcf import utils
+from vcf import model, utils
+
+IS_PYTHON2 = sys.version_info[0] == 2
+IS_NOT_PYPY = 'PyPy' not in sys.version
suite = doctest.DocTestSuite(vcf)
@@ -100,6 +114,37 @@ class TestVcfSpecs(unittest.TestCase):
print(c)
assert c
+ def test_vcf_4_2(self):
+ reader = vcf.Reader(fh('example-4.2.vcf'))
+ self.assertEqual(reader.metadata['fileformat'], 'VCFv4.2')
+
+ # If INFO contains no Source and Version keys, they should be None.
+ self.assertEqual(reader.infos['DP'].source, None)
+ self.assertEqual(reader.infos['DP'].version, None)
+
+ # According to spec, INFO Version key is required to be double quoted,
+ # but at least SAMtools 1.0 does not quote it. So we want to be
+ # forgiving here.
+ self.assertEqual(reader.infos['VDB'].source, None)
+ self.assertEqual(reader.infos['VDB'].version, '3')
+
+ # test we can walk the file at least
+ for r in reader:
+ for c in r:
+ assert c
+
+ def test_contig_idonly(self):
+ """Test VCF inputs with ##contig inputs containing only IDs. produced by bcftools 1.2+
+ """
+ reader = vcf.Reader(fh("contig_idonly.vcf"))
+ for cid, contig in reader.contigs.items():
+ if cid == "1":
+ assert contig.length is None
+ elif cid == "2":
+ assert contig.length == 2000
+ elif cid == "3":
+ assert contig.length == 3000
+
class TestGatkOutput(unittest.TestCase):
filename = 'gatk.vcf'
@@ -184,6 +229,34 @@ class TestBcfToolsOutput(unittest.TestCase):
for s in r.samples:
s.phased
+class TestIssue214(unittest.TestCase):
+ """ See https://github.com/jamescasbon/PyVCF/issues/214 """
+
+ def test_issue_214_is_snp(self):
+ reader=vcf.Reader(fh('issue-214.vcf'))
+ r=next(reader)
+ self.assertTrue(r.is_snp)
+
+ def test_issue_214_var_type(self):
+ reader=vcf.Reader(fh('issue-214.vcf'))
+ r=next(reader)
+ self.assertEqual(r.var_type,'snp')
+
+ # Can the ref even be a spanning deletion?
+ # Note, this does not trigger issue 214, but I've added it here for completeness
+ def test_issue_214_ref_is_del_is_snp(self):
+ reader=vcf.Reader(fh('issue-214.vcf'))
+ next(reader)
+ r=next(reader)
+ self.assertTrue(r.is_snp)
+
+ # Can the ref even be a spanning deletion?
+ # Note, this does not trigger issue 214, but I've added it here for completeness
+ def test_issue_214_ref_is_del_var_type(self):
+ reader=vcf.Reader(fh('issue-214.vcf'))
+ next(reader)
+ r=next(reader)
+ self.assertEqual(r.var_type,'snp')
class Test1kg(unittest.TestCase):
@@ -248,6 +321,16 @@ class TestGoNL(unittest.TestCase):
self.assertEqual(reader.contigs['1'].length, 249250621)
+class TestStringAsFlag(unittest.TestCase):
+
+ def test_string_as_flag(self):
+ """A flag INFO field is declared as string (not allowed by the spec,
+ but seen in practice)."""
+ reader = vcf.Reader(fh('string_as_flag.vcf', 'r'))
+ for _ in reader:
+ pass
+
+
class TestInfoOrder(unittest.TestCase):
def _assert_order(self, definitions, fields):
@@ -310,6 +393,38 @@ class TestInfoTypeCharacter(unittest.TestCase):
self.assertEquals(l.INFO, r.INFO)
+class TestParseMetaLine(unittest.TestCase):
+ def test_parse(self):
+ reader = vcf.Reader(fh('parse-meta-line.vcf'))
+ f = reader.metadata['MYFIELD'][0]
+ self.assertEqual(f['ID'], 'SomeField')
+ self.assertEqual(f['Version'], '3.4-0-g7e26428')
+ self.assertEqual(f['Date'], '"Wed Oct 07 09:11:47 CEST 2015"')
+ self.assertEqual(f['Options'], '"< 4 and > 3"')
+ next(reader)
+
+ def test_write(self):
+ reader = vcf.Reader(fh('parse-meta-line.vcf'))
+ out = StringIO()
+ writer = vcf.Writer(out, reader)
+
+ records = list(reader)
+
+ for record in records:
+ writer.write_record(record)
+ out.seek(0)
+ reader2 = vcf.Reader(out)
+
+ f = reader2.metadata['MYFIELD'][0]
+ self.assertEqual(f['ID'], 'SomeField')
+ self.assertEqual(f['Version'], '3.4-0-g7e26428')
+ self.assertEqual(f['Date'], '"Wed Oct 07 09:11:47 CEST 2015"')
+ self.assertEqual(f['Options'], '"< 4 and > 3"')
+
+ for l, r in zip(records, reader2):
+ self.assertEquals(l.INFO, r.INFO)
+
+
class TestGatkOutputWriter(unittest.TestCase):
def testWrite(self):
@@ -402,6 +517,27 @@ class TestSamplesSpace(unittest.TestCase):
self.assertEqual(self.reader.samples, self.samples)
+class TestMetadataWhitespace(unittest.TestCase):
+ filename = 'metadata-whitespace.vcf'
+ def test_metadata_whitespace(self):
+ """
+ Test parsing metadata header lines with whitespace.
+ """
+ self.reader = vcf.Reader(fh(self.filename))
+
+ # Pick one INFO line and assert that we parsed it correctly.
+ info_indel = self.reader.infos['INDEL']
+ assert info_indel.id == 'INDEL'
+ assert info_indel.num == 0
+ assert info_indel.type == 'Flag'
+ assert info_indel.desc == 'Indicates that the variant is an INDEL.'
+
+ # Test we can walk the file at least.
+ for r in self.reader:
+ for c in r:
+ pass
+
+
class TestMixedFiltering(unittest.TestCase):
filename = 'mixed-filtering.vcf'
def test_mixed_filtering(self):
@@ -426,7 +562,7 @@ class TestRecord(unittest.TestCase):
self.assertEqual(len(var.samples), num_calls)
def test_dunder_eq(self):
- rec = vcf.Reader(fh('example-4.0.vcf')).next()
+ rec = next(vcf.Reader(fh('example-4.0.vcf')))
self.assertFalse(rec == None)
self.assertFalse(None == rec)
@@ -459,6 +595,13 @@ class TestRecord(unittest.TestCase):
self.assertEqual([0.0/6.0], aaf)
elif var.POS == 1234567:
self.assertEqual([2.0/4.0, 1.0/4.0], aaf)
+ reader = vcf.Reader(fh('example-4.1-ploidy.vcf'))
+ for var in reader:
+ aaf = var.aaf
+ if var.POS == 60034:
+ self.assertEqual([4.0/6.0], aaf)
+ elif var.POS == 60387:
+ self.assertEqual([1.0/3.0], aaf)
def test_pi(self):
reader = vcf.Reader(fh('example-4.0.vcf'))
@@ -510,6 +653,24 @@ class TestRecord(unittest.TestCase):
elif var.POS == 1234567:
self.assertEqual(False, is_snp)
+
+ def test_is_snp_for_n_alt(self):
+ record = model._Record(
+ '1',
+ 10,
+ 'id1',
+ 'C',
+ [model._Substitution('N')],
+ None,
+ None,
+ {},
+ None,
+ {},
+ None
+ )
+ self.assertTrue(record.is_snp)
+
+
def test_is_indel(self):
reader = vcf.Reader(fh('example-4.0.vcf'))
for var in reader:
@@ -731,7 +892,7 @@ class TestRecord(unittest.TestCase):
def test_info_multiple_values(self):
reader = vcf.Reader(fh('example-4.1-info-multiple-values.vcf'))
- var = reader.next()
+ var = next(reader)
# check Float type INFO field with multiple values
expected = [19.3, 47.4, 14.0]
actual = var.INFO['RepeatCopies']
@@ -751,11 +912,244 @@ class TestRecord(unittest.TestCase):
self.assertEqual(cPickle.loads(cPickle.dumps(var)), var)
+ def assert_has_expected_coordinates(
+ self,
+ record,
+ expected_coordinates,
+ expected_affected_coordinates
+ ):
+ self.assertEqual(
+ (record.start, record.end),
+ expected_coordinates
+ )
+ self.assertEqual(
+ (record.affected_start, record.affected_end),
+ expected_affected_coordinates
+ )
+
+
+ def test_coordinates_for_snp(self):
+ record = model._Record(
+ '1',
+ 10,
+ 'id1',
+ 'C',
+ [model._Substitution('A')],
+ None,
+ None,
+ {},
+ None,
+ {},
+ None
+ )
+ self.assert_has_expected_coordinates(record, (9, 10), (9, 10))
+
+
+ def test_coordinates_for_insertion(self):
+ record = model._Record(
+ '1',
+ 10,
+ 'id2',
+ 'C',
+ [model._Substitution('CTA')],
+ None,
+ None,
+ {},
+ None,
+ {},
+ None
+ )
+ self.assert_has_expected_coordinates(record, (9, 10), (10, 10))
+
+
+ def test_coordinates_for_deletion(self):
+ record = model._Record(
+ '1',
+ 10,
+ 'id3',
+ 'CTA',
+ [model._Substitution('C')],
+ None,
+ None,
+ {},
+ None,
+ {},
+ None
+ )
+ self.assert_has_expected_coordinates(record, (9, 12), (10, 12))
+
+
+ def test_coordinates_for_None_alt(self):
+ record = model._Record(
+ '1',
+ 10,
+ 'id4',
+ 'C',
+ [None],
+ None,
+ None,
+ {},
+ None,
+ {},
+ None
+ )
+ self.assert_has_expected_coordinates(record, (9, 10), (9, 10))
+
+
+ def test_coordinates_for_multiple_snps(self):
+ record = model._Record(
+ '1',
+ 10,
+ 'id5',
+ 'C',
+ [
+ model._Substitution('A'),
+ model._Substitution('G'),
+ model._Substitution('T')
+ ],
+ None,
+ None,
+ {},
+ None,
+ {},
+ None
+ )
+ self.assert_has_expected_coordinates(record, (9, 10), (9, 10))
+
+
+ def test_coordinates_for_insert_and_snp(self):
+ record = model._Record(
+ '1',
+ 10,
+ 'id6',
+ 'C',
+ [
+ model._Substitution('GTA'),
+ model._Substitution('G'),
+ ],
+ None,
+ None,
+ {},
+ None,
+ {},
+ None
+ )
+ self.assert_has_expected_coordinates(record, (9, 10), (9, 10))
+ record = model._Record(
+ '1',
+ 10,
+ 'id7',
+ 'C',
+ [
+ model._Substitution('G'),
+ model._Substitution('GTA'),
+ ],
+ None,
+ None,
+ {},
+ None,
+ {},
+ None
+ )
+ self.assert_has_expected_coordinates(record, (9, 10), (9, 10))
+
+
+ def test_coordinates_for_snp_and_deletion(self):
+ record = model._Record(
+ '1',
+ 10,
+ 'id8',
+ 'CTA',
+ [
+ model._Substitution('C'),
+ model._Substitution('CTG'),
+ ],
+ None,
+ None,
+ {},
+ None,
+ {},
+ None
+ )
+ self.assert_has_expected_coordinates(record, (9, 12), (10, 12))
+ record = model._Record(
+ '1',
+ 10,
+ 'id9',
+ 'CTA',
+ [
+ model._Substitution('CTG'),
+ model._Substitution('C'),
+ ],
+ None,
+ None,
+ {},
+ None,
+ {},
+ None
+ )
+ self.assert_has_expected_coordinates(record, (9, 12), (10, 12))
+
+
+ def test_coordinates_for_insertion_and_deletion(self):
+ record = model._Record(
+ '1',
+ 10,
+ 'id10',
+ 'CT',
+ [
+ model._Substitution('CA'),
+ model._Substitution('CTT'),
+ ],
+ None,
+ None,
+ {},
+ None,
+ {},
+ None
+ )
+ self.assert_has_expected_coordinates(record, (9, 11), (10, 11))
+ record = model._Record(
+ '1',
+ 10,
+ 'id11',
+ 'CT',
+ [
+ model._Substitution('CTT'),
+ model._Substitution('CA'),
+ ],
+ None,
+ None,
+ {},
+ None,
+ {},
+ None
+ )
+ self.assert_has_expected_coordinates(record, (9, 11), (10, 11))
+
+
+ def test_coordinates_for_breakend(self):
+ record = model._Record(
+ '1',
+ 10,
+ 'id12',
+ 'CTA',
+ [model._Breakend('1', 500, False, True, 'GGTC', True)],
+ None,
+ None,
+ {},
+ None,
+ {},
+ None
+ )
+ self.assert_has_expected_coordinates(record, (9, 12), (9, 12))
+
+
class TestCall(unittest.TestCase):
def test_dunder_eq(self):
reader = vcf.Reader(fh('example-4.0.vcf'))
- var = reader.next()
+ var = next(reader)
example_call = var.samples[0]
self.assertFalse(example_call == None)
self.assertFalse(None == example_call)
@@ -808,40 +1202,73 @@ class TestCall(unittest.TestCase):
self.assertEqual([None,1,2], gt_types)
-class TestTabix(unittest.TestCase):
+ at unittest.skipUnless(pysam, "test requires installation of PySAM.")
+class TestFetch(unittest.TestCase):
def setUp(self):
self.reader = vcf.Reader(fh('tb.vcf.gz', 'rb'))
- self.run = vcf.parser.pysam is not None
+
+ def assertFetchedExpectedPositions(
+ self, fetched_variants, expected_positions):
+ fetched_positions = [var.POS for var in fetched_variants]
+ self.assertEqual(fetched_positions, expected_positions)
+
+
+ def testNoVariantsInRange(self):
+ fetched_variants = self.reader.fetch('20', 14370, 17329)
+ self.assertFetchedExpectedPositions(fetched_variants, [])
+
+
+ def testNoVariantsForZeroLengthInterval(self):
+ fetched_variants = self.reader.fetch('20', 14369, 14369)
+ self.assertFetchedExpectedPositions(fetched_variants, [])
def testFetchRange(self):
- if not self.run:
- return
- lines = list(self.reader.fetch('20', 14370, 14370))
- self.assertEquals(len(lines), 1)
- self.assertEqual(lines[0].POS, 14370)
+ fetched_variants = self.reader.fetch('20', 14369, 14370)
+ self.assertFetchedExpectedPositions(fetched_variants, [14370])
+
+ fetched_variants = self.reader.fetch('20', 14369, 17330)
+ self.assertFetchedExpectedPositions(
+ fetched_variants, [14370, 17330])
- lines = list(self.reader.fetch('20', 14370, 17330))
- self.assertEquals(len(lines), 2)
- self.assertEqual(lines[0].POS, 14370)
- self.assertEqual(lines[1].POS, 17330)
+ fetched_variants = self.reader.fetch('20', 1110695, 1234567)
+ self.assertFetchedExpectedPositions(
+ fetched_variants, [1110696, 1230237, 1234567])
- lines = list(self.reader.fetch('20', 1110695, 1234567))
- self.assertEquals(len(lines), 3)
+ def testFetchesFromStartIfStartOnlySpecified(self):
+ fetched_variants = self.reader.fetch('20', 1110695)
+ self.assertFetchedExpectedPositions(
+ fetched_variants, [1110696, 1230237, 1234567])
- def testFetchSite(self):
- if not self.run:
- return
- site = self.reader.fetch('20', 14370)
- self.assertEqual(site.POS, 14370)
- site = self.reader.fetch('20', 14369)
- assert site is None
+ def testFetchesAllFromChromIfOnlyChromSpecified(self):
+ fetched_variants = self.reader.fetch('20')
+ self.assertFetchedExpectedPositions(
+ fetched_variants,
+ [14370, 17330, 1110696, 1230237, 1234567]
+ )
+ at unittest.skipUnless(pysam, "test requires installation of PySAM.")
+class TestIssue201(unittest.TestCase):
+ def setUp(self):
+ # This file contains some non-ASCII characters in a UTF-8 encoding.
+ # https://github.com/jamescasbon/PyVCF/issues/201
+ self.reader = vcf.Reader(fh('issue-201.vcf.gz', 'rb'),
+ encoding='utf-8')
+
+ def testIterate(self):
+ for record in self.reader:
+ # Should not raise decoding errors.
+ pass
+
+ def testFetch(self):
+ for record in self.reader.fetch(chrom='17'):
+ # Should not raise decoding errors.
+ pass
class TestOpenMethods(unittest.TestCase):
@@ -870,12 +1297,61 @@ class TestOpenMethods(unittest.TestCase):
self.assertEqual(self.samples, r.samples)
+class TestSampleFilter(unittest.TestCase):
+ @unittest.skipUnless(IS_PYTHON2, "test broken for Python 3")
+ def testCLIListSamples(self):
+ proc = subprocess.Popen('python scripts/vcf_sample_filter.py vcf/test/example-4.1.vcf', shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+ out, err = proc.communicate()
+ self.assertEqual(proc.returncode, 0)
+ self.assertFalse(err)
+ expected_out = ['Samples:', '0: NA00001', '1: NA00002', '2: NA00003']
+ self.assertEqual(out.splitlines(), expected_out)
+
+ @unittest.skipUnless(IS_PYTHON2, "test broken for Python 3")
+ def testCLIWithFilter(self):
+ proc = subprocess.Popen('python scripts/vcf_sample_filter.py vcf/test/example-4.1.vcf -f 1,2 --quiet', shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+ out, err = proc.communicate()
+ self.assertEqual(proc.returncode, 0)
+ self.assertTrue(out)
+ self.assertFalse(err)
+ buf = StringIO()
+ buf.write(out)
+ buf.seek(0)
+ #print(buf.getvalue())
+ reader = vcf.Reader(buf)
+ self.assertEqual(reader.samples, ['NA00001'])
+ rec = next(reader)
+ self.assertEqual(len(rec.samples), 1)
+
+ @unittest.skipUnless(IS_NOT_PYPY, "test broken for PyPy")
+ def testSampleFilterModule(self):
+ # init filter with filename, get list of samples
+ filt = vcf.SampleFilter('vcf/test/example-4.1.vcf')
+ self.assertEqual(filt.samples, ['NA00001', 'NA00002', 'NA00003'])
+ # set filter, check which samples will be kept
+ filtered = filt.set_filters(filters="0", invert=True)
+ self.assertEqual(filtered, ['NA00001'])
+ # write filtered file to StringIO
+ buf = StringIO()
+ filt.write(buf)
+ buf.seek(0)
+ #print(buf.getvalue())
+ # undo monkey patch by destroying instance
+ del filt
+ self.assertTrue('sample_filter' not in dir(vcf.Reader))
+ # read output
+ reader = vcf.Reader(buf)
+ self.assertEqual(reader.samples, ['NA00001'])
+ rec = next(reader)
+ self.assertEqual(len(rec.samples), 1)
+
+
class TestFilter(unittest.TestCase):
+ @unittest.skip("test currently broken")
def testApplyFilter(self):
# FIXME: broken with distribute
- return
s, out = commands.getstatusoutput('python scripts/vcf_filter.py --site-quality 30 test/example-4.0.vcf sq')
#print(out)
self.assertEqual(s, 0)
@@ -903,9 +1379,9 @@ class TestFilter(unittest.TestCase):
self.assertEqual(n, 2)
+ @unittest.skip("test currently broken")
def testApplyMultipleFilters(self):
# FIXME: broken with distribute
- return
s, out = commands.getstatusoutput('python scripts/vcf_filter.py --site-quality 30 '
'--genotype-quality 50 test/example-4.0.vcf sq mgq')
self.assertEqual(s, 0)
@@ -925,7 +1401,7 @@ class TestRegression(unittest.TestCase):
def test_issue_16(self):
reader = vcf.Reader(fh('issue-16.vcf'))
- n = reader.next()
+ n = next(reader)
assert n.QUAL == None
def test_null_mono(self):
@@ -940,7 +1416,7 @@ class TestRegression(unittest.TestCase):
out.seek(0)
print(out.getvalue())
p2 = vcf.Reader(out)
- rec = p2.next()
+ rec = next(p2)
assert rec.samples
@@ -1014,26 +1490,105 @@ class TestGATKMeta(unittest.TestCase):
self.assertEqual(reader.metadata['GATKCommandLine'][1]['CommandLineOptions'], '"analysis_type=VariantAnnotator annotation=[HomopolymerRun, VariantType, TandemRepeatAnnotator]"')
+
+class TestUncalledGenotypes(unittest.TestCase):
+ """Test the handling of uncalled (., ./.) sample genotypes."""
+
+ def test_read_uncalled(self):
+ """Test that uncalled genotypes are properly read into
+ gt_nums, gt_bases, ploidity, and gt_alleles properties
+ of _Call objects. For uncalled _Call objects:
+
+ - gt_nums should be None
+ - gt_bases should be None
+ - ploidity should match the input ploidity
+ - gt_alleles should be a list of None's with length
+ matching the ploidity"""
+
+ reader = vcf.Reader(fh('uncalled_genotypes.vcf'))
+ for var in reader:
+ gt_bases = [s.gt_bases for s in var.samples]
+ gt_nums = [s.gt_nums for s in var.samples]
+ ploidity = [s.ploidity for s in var.samples]
+ gt_alleles = [s.gt_alleles for s in var.samples]
+
+ if var.POS == 14370:
+ self.assertEqual(['0|0', None, '1/1'], gt_nums)
+ self.assertEqual(['G|G', None, 'A/A'], gt_bases)
+ self.assertEqual([2,2,2], ploidity)
+ self.assertEqual([['0','0'], [None,None], ['1','1']], gt_alleles)
+ elif var.POS == 17330:
+ self.assertEqual([None, '0|1', '0/0'], gt_nums)
+ self.assertEqual([None, 'T|A', 'T/T'], gt_bases)
+ self.assertEqual([3,2,2], ploidity)
+ self.assertEqual([[None,None,None], ['0','1'], ['0','0']], gt_alleles)
+ elif var.POS == 1234567:
+ self.assertEqual(['0/1', '0/2', None], gt_nums)
+ self.assertEqual(['GTC/G', 'GTC/GTCT', None], gt_bases)
+ self.assertEqual([2,2,1], ploidity)
+ self.assertEqual([['0','1'], ['0','2'], [None]], gt_alleles)
+ reader._reader.close()
+
+
+ def test_write_uncalled(self):
+ """Test that uncalled genotypes are written just as
+ they were read in the input file."""
+
+ reader = vcf.Reader(fh('uncalled_genotypes.vcf'))
+
+ # Write all reader records to a stream.
+ out = StringIO()
+ writer = vcf.Writer(out, reader, lineterminator='\n')
+ for record in reader:
+ writer.write_record(record)
+ reader._reader.close()
+
+
+ # Compare the written stream to the input reader line-by-line.
+ out.seek(0)
+ out_lines = out.getvalue().split('\n')
+ in_file = fh('uncalled_genotypes.vcf')
+ in_lines = [l.rstrip('\n') for l in in_file]
+ in_file.close()
+ for (in_line, out_line) in zip(in_lines, out_lines):
+ self.assertEqual(in_line,out_line)
+
+class TestStrelka(unittest.TestCase):
+
+ def test_strelka(self):
+ reader = vcf.Reader(fh('strelka.vcf'))
+ n = next(reader)
+ assert n is not None
+
+
suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestVcfSpecs))
suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestGatkOutput))
suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestFreebayesOutput))
suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestSamtoolsOutput))
suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestBcfToolsOutput))
+suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestIssue214))
suite.addTests(unittest.TestLoader().loadTestsFromTestCase(Test1kg))
suite.addTests(unittest.TestLoader().loadTestsFromTestCase(Test1kgSites))
suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestGoNL))
+suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestStringAsFlag))
suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestInfoOrder))
suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestInfoTypeCharacter))
+suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestParseMetaLine))
suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestGatkOutputWriter))
suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestBcfToolsOutputWriter))
suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestWriterDictionaryMeta))
suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestSamplesSpace))
+suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestMetadataWhitespace))
suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestMixedFiltering))
suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestRecord))
suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestCall))
-suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestTabix))
+suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestFetch))
+suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestIssue201))
suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestOpenMethods))
+suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestSampleFilter))
suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestFilter))
suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestRegression))
suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestUtils))
suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestGATKMeta))
+suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestUncalledGenotypes))
+suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestStrelka))
=====================================
vcf/utils.py
=====================================
--- a/vcf/utils.py
+++ b/vcf/utils.py
@@ -28,7 +28,7 @@ def walk_together(*readers, **kwargs):
nexts = []
for reader in readers:
try:
- nexts.append(reader.next())
+ nexts.append(next(reader))
except StopIteration:
nexts.append(None)
View it on GitLab: https://salsa.debian.org/med-team/python-pyvcf/compare/fa802f797e730e317890874d3d35cd1895a22c8d...3f9805e00a59a7676a17df7f7ec6f52edf8081bc
--
View it on GitLab: https://salsa.debian.org/med-team/python-pyvcf/compare/fa802f797e730e317890874d3d35cd1895a22c8d...3f9805e00a59a7676a17df7f7ec6f52edf8081bc
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20180718/9f592cd3/attachment-0001.html>
More information about the debian-med-commit
mailing list