[med-svn] [Git][med-team/python-pyvcf][debian/jessie-backports] 16 commits: Apply patch by Scott Kitterman

Wed Jul 18 10:47:19 BST 2018

Andreas Tille pushed to branch debian/jessie-backports at Debian Med / python-pyvcf


Commits:
3a0d92bb by Andreas Tille at 2015-12-31T12:21:47+01:00
Apply patch by Scott Kitterman

- - - - -
482dccb9 by Andreas Tille at 2015-12-31T12:22:44+01:00
Upload to unstable

- - - - -
d95234c9 by Andreas Tille at 2016-12-02T19:42:18+01:00
Fix watch file

- - - - -
9c5c2758 by Andreas Tille at 2016-12-02T20:18:10+01:00
Fix obtaining version in get-orig-source

- - - - -
de030e81 by Andreas Tille at 2016-12-02T20:19:03+01:00
New upstream version 0.6.8
- - - - -
e5b6d3ab by Andreas Tille at 2016-12-02T20:19:04+01:00
Merge tag 'upstream/0.6.8'

Upstream version 0.6.8

- - - - -
8bcbef7c by Andreas Tille at 2016-12-02T20:19:30+01:00
New upstream version

- - - - -
a33cd399 by Andreas Tille at 2016-12-02T20:20:09+01:00
cme fix dpkg-control

- - - - -
31ef761b by Andreas Tille at 2016-12-02T20:20:44+01:00
debhelper 10

- - - - -
4e981625 by Andreas Tille at 2016-12-02T20:23:38+01:00
Patch applied upstream

- - - - -
3d2ee246 by Andreas Tille at 2016-12-02T20:30:29+01:00
hardening=+all

- - - - -
1cacc9de by Andreas Tille at 2016-12-02T20:31:07+01:00
Remove unused lintian override

- - - - -
5a22064b by Andreas Tille at 2016-12-02T21:15:30+01:00
Rename all *.py files and upload

- - - - -
03196a79 by Andreas Tille at 2017-06-09T23:05:50+02:00
Merge branch 'master' into debian/jessie-backports

- - - - -
b64aa101 by Andreas Tille at 2017-06-09T23:06:10+02:00
Rebuild for jessie-backports

- - - - -
3f9805e0 by Andreas Tille at 2017-06-16T23:59:18+02:00
export DEB_BUILD_MAINT_OPTIONS = hardening=+all,-pie

- - - - -


24 changed files:

- PKG-INFO
- README.rst
- debian/changelog
- debian/compat
- debian/control
- debian/get-orig-source
- − debian/patches/series
- − debian/patches/use_setuptools.patch
- − debian/python-pyvcf-examples.lintian-overrides
- debian/rules
- debian/watch
- + scripts/vcf_sample_filter.py
- setup.cfg
- setup.py
- vcf/__init__.py
- vcf/cparse.pyx
- vcf/model.py
- vcf/parser.py
- + vcf/sample_filter.py
- + vcf/test/FT.vcf
- + vcf/test/issue-214.vcf
- + vcf/test/strelka.vcf
- vcf/test/test_vcf.py
- vcf/utils.py


Changes:

=====================================
PKG-INFO
=====================================

--- a/PKG-INFO
+++ b/PKG-INFO
@@ -1,190 +1,27 @@
 Metadata-Version: 1.1
 Name: PyVCF
-Version: 0.6.7
+Version: 0.6.8
 Summary: Variant Call Format (VCF) parser for Python
 Home-page: https://github.com/jamescasbon/PyVCF
 Author: James Casbon and @jdoughertyii
 Author-email: casbon at gmail.com
 License: UNKNOWN
-Description: A VCFv4.0 and 4.1 parser for Python.
-        
-        Online version of PyVCF documentation is available at http://pyvcf.rtfd.org/
-        
-        The intent of this module is to mimic the ``csv`` module in the Python stdlib,
-        as opposed to more flexible serialization formats like JSON or YAML.  ``vcf``
-        will attempt to parse the content of each record based on the data types
-        specified in the meta-information lines --  specifically the ##INFO and
-        ##FORMAT lines.  If these lines are missing or incomplete, it will check
-        against the reserved types mentioned in the spec.  Failing that, it will just
-        return strings.
-        
-        There main interface is the class: ``Reader``.  It takes a file-like
-        object and acts as a reader::
-        
-            >>> import vcf
-            >>> vcf_reader = vcf.Reader(open('vcf/test/example-4.0.vcf', 'r'))
-            >>> for record in vcf_reader:
-            ...     print record
-            Record(CHROM=20, POS=14370, REF=G, ALT=[A])
-            Record(CHROM=20, POS=17330, REF=T, ALT=[A])
-            Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
-            Record(CHROM=20, POS=1230237, REF=T, ALT=[None])
-            Record(CHROM=20, POS=1234567, REF=GTCT, ALT=[G, GTACT])
-        
-        
-        This produces a great deal of information, but it is conveniently accessed.
-        The attributes of a Record are the 8 fixed fields from the VCF spec::
-        
-            * ``Record.CHROM``
-            * ``Record.POS``
-            * ``Record.ID``
-            * ``Record.REF``
-            * ``Record.ALT``
-            * ``Record.QUAL``
-            * ``Record.FILTER``
-            * ``Record.INFO``
-        
-        plus attributes to handle genotype information:
-        
-            * ``Record.FORMAT``
-            * ``Record.samples``
-            * ``Record.genotype``
-        
-        ``samples`` and ``genotype``, not being the title of any column, are left lowercase.  The format
-        of the fixed fields is from the spec.  Comma-separated lists in the VCF are
-        converted to lists.  In particular, one-entry VCF lists are converted to
-        one-entry Python lists (see, e.g., ``Record.ALT``).  Semicolon-delimited lists
-        of key=value pairs are converted to Python dictionaries, with flags being given
-        a ``True`` value. Integers and floats are handled exactly as you'd expect::
-        
-            >>> vcf_reader = vcf.Reader(open('vcf/test/example-4.0.vcf', 'r'))
-            >>> record = vcf_reader.next()
-            >>> print record.POS
-            14370
-            >>> print record.ALT
-            [A]
-            >>> print record.INFO['AF']
-            [0.5]
-        
-        There are a number of convienience methods and properties for each ``Record`` allowing you to
-        examine properties of interest::
-        
-            >>> print record.num_called, record.call_rate, record.num_unknown
-            3 1.0 0
-            >>> print record.num_hom_ref, record.num_het, record.num_hom_alt
-            1 1 1
-            >>> print record.nucl_diversity, record.aaf, record.heterozygosity
-            0.6 [0.5] 0.5
-            >>> print record.get_hets()
-            [Call(sample=NA00002, CallData(GT=1|0, GQ=48, DP=8, HQ=[51, 51]))]
-            >>> print record.is_snp, record.is_indel, record.is_transition, record.is_deletion
-            True False True False
-            >>> print record.var_type, record.var_subtype
-            snp ts
-            >>> print record.is_monomorphic
-            False
-        
-        ``record.FORMAT`` will be a string specifying the format of the genotype
-        fields.  In case the FORMAT column does not exist, ``record.FORMAT`` is
-        ``None``.  Finally, ``record.samples`` is a list of dictionaries containing the
-        parsed sample column and ``record.genotype`` is a way of looking up genotypes
-        by sample name::
-        
-            >>> record = vcf_reader.next()
-            >>> for sample in record.samples:
-            ...     print sample['GT']
-            0|0
-            0|1
-            0/0
-            >>> print record.genotype('NA00001')['GT']
-            0|0
-        
-        The genotypes are represented by ``Call`` objects, which have three attributes: the
-        corresponding Record ``site``, the sample name in ``sample`` and a dictionary of
-        call data in ``data``::
-        
-             >>> call = record.genotype('NA00001')
-             >>> print call.site
-             Record(CHROM=20, POS=17330, REF=T, ALT=[A])
-             >>> print call.sample
-             NA00001
-             >>> print call.data
-             CallData(GT=0|0, GQ=49, DP=3, HQ=[58, 50])
-        
-        Please note that as of release 0.4.0, attributes known to have single values (such as
-        ``DP`` and ``GQ`` above) are returned as values.  Other attributes are returned
-        as lists (such as ``HQ`` above).
-        
-        There are also a number of methods::
-        
-            >>> print call.called, call.gt_type, call.gt_bases, call.phased
-            True 0 T|T True
-        
-        Metadata regarding the VCF file itself can be investigated through the
-        following attributes:
-        
-            * ``Reader.metadata``
-            * ``Reader.infos``
-            * ``Reader.filters``
-            * ``Reader.formats``
-            * ``Reader.samples``
-        
-        For example::
-        
-            >>> vcf_reader.metadata['fileDate']
-            '20090805'
-            >>> vcf_reader.samples
-            ['NA00001', 'NA00002', 'NA00003']
-            >>> vcf_reader.filters
-            OrderedDict([('q10', Filter(id='q10', desc='Quality below 10')), ('s50', Filter(id='s50', desc='Less than 50% of samples have data'))])
-            >>> vcf_reader.infos['AA'].desc
-            'Ancestral Allele'
-        
-        ALT records are actually classes, so that you can interrogate them::
-        
-            >>> reader = vcf.Reader(open('vcf/test/example-4.1-bnd.vcf'))
-            >>> _ = reader.next(); row = reader.next()
-            >>> print row
-            Record(CHROM=1, POS=2, REF=T, ALT=[T[2:3[])
-            >>> bnd = row.ALT[0]
-            >>> print bnd.withinMainAssembly, bnd.orientation, bnd.remoteOrientation, bnd.connectingSequence
-            True False True T
-        
-        Random access is supported for files with tabix indexes.  Simply call fetch for the
-        region you are interested in::
-        
-            >>> vcf_reader = vcf.Reader(filename='vcf/test/tb.vcf.gz')
-            >>> for record in vcf_reader.fetch('20', 1110696, 1230237):  # doctest: +SKIP
-            ...     print record
-            Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
-            Record(CHROM=20, POS=1230237, REF=T, ALT=[None])
-        
-        Or extract a single row::
-        
-            >>> print vcf_reader.fetch('20', 1110696)  # doctest: +SKIP
-            Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
-        
-        
-        The ``Writer`` class provides a way of writing a VCF file.  Currently, you must specify a
-        template ``Reader`` which provides the metadata::
-        
-            >>> vcf_reader = vcf.Reader(filename='vcf/test/tb.vcf.gz')
-            >>> vcf_writer = vcf.Writer(open('/dev/null', 'w'), vcf_reader)
-            >>> for record in vcf_reader:
-            ...     vcf_writer.write_record(record)
-        
-        
-        An extensible script is available to filter vcf files in vcf_filter.py.  VCF filters
-        declared by other packages will be available for use in this script.  Please
-        see :doc:`FILTERS` for full description.
-        
-        
+Description: UNKNOWN
 Keywords: bioinformatics
 Platform: UNKNOWN
 Classifier: Development Status :: 4 - Beta
 Classifier: Intended Audience :: Developers
 Classifier: Intended Audience :: Science/Research
+Classifier: License :: OSI Approved :: BSD License
+Classifier: License :: OSI Approved :: MIT License
 Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Cython
 Classifier: Programming Language :: Python
+Classifier: Programming Language :: Python :: 2
+Classifier: Programming Language :: Python :: 2.6
+Classifier: Programming Language :: Python :: 2.7
 Classifier: Programming Language :: Python :: 3
-Classifier: Topic :: Scientific/Engineering
+Classifier: Programming Language :: Python :: 3.2
+Classifier: Programming Language :: Python :: 3.3
+Classifier: Programming Language :: Python :: 3.4
+Classifier: Topic :: Scientific/Engineering :: Bio-Informatics


=====================================
README.rst
=====================================
--- a/README.rst
+++ b/README.rst
@@ -50,7 +50,7 @@ of key=value pairs are converted to Python dictionaries, with flags being given
 a ``True`` value. Integers and floats are handled exactly as you'd expect::
 
     >>> vcf_reader = vcf.Reader(open('vcf/test/example-4.0.vcf', 'r'))
-    >>> record = vcf_reader.next()
+    >>> record = next(vcf_reader)
     >>> print record.POS
     14370
     >>> print record.ALT
@@ -58,7 +58,7 @@ a ``True`` value. Integers and floats are handled exactly as you'd expect::
     >>> print record.INFO['AF']
     [0.5]
 
-There are a number of convienience methods and properties for each ``Record`` allowing you to
+There are a number of convenience methods and properties for each ``Record`` allowing you to
 examine properties of interest::
 
     >>> print record.num_called, record.call_rate, record.num_unknown
@@ -82,7 +82,7 @@ fields.  In case the FORMAT column does not exist, ``record.FORMAT`` is
 parsed sample column and ``record.genotype`` is a way of looking up genotypes
 by sample name::
 
-    >>> record = vcf_reader.next()
+    >>> record = next(vcf_reader)
     >>> for sample in record.samples:
     ...     print sample['GT']
     0|0
@@ -135,27 +135,38 @@ For example::
 ALT records are actually classes, so that you can interrogate them::
 
     >>> reader = vcf.Reader(open('vcf/test/example-4.1-bnd.vcf'))
-    >>> _ = reader.next(); row = reader.next()
+    >>> _ = next(reader); row = next(reader)
     >>> print row
     Record(CHROM=1, POS=2, REF=T, ALT=[T[2:3[])
     >>> bnd = row.ALT[0]
     >>> print bnd.withinMainAssembly, bnd.orientation, bnd.remoteOrientation, bnd.connectingSequence
     True False True T
 
-Random access is supported for files with tabix indexes.  Simply call fetch for the
-region you are interested in::
+The Reader supports retrieval of records within designated regions for files
+with tabix indexes via the fetch method. This requires the pysam module as a
+dependency. Pass in a chromosome, and, optionally, start and end coordinates,
+for the regions of interest::
 
     >>> vcf_reader = vcf.Reader(filename='vcf/test/tb.vcf.gz')
-    >>> for record in vcf_reader.fetch('20', 1110696, 1230237):  # doctest: +SKIP
+    >>> # fetch all records on chromosome 20 from base 1110696 through 1230237
+    >>> for record in vcf_reader.fetch('20', 1110695, 1230237):  # doctest: +SKIP
     ...     print record
     Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
     Record(CHROM=20, POS=1230237, REF=T, ALT=[None])
 
-Or extract a single row::
+Note that the start and end coordinates are in the zero-based, half-open
+coordinate system, similar to ``_Record.start`` and ``_Record.end``. The very
+first base of a chromosome is index 0, and the the region includes bases up
+to, but not including the base at the end coordinate. For example::
 
-    >>> print vcf_reader.fetch('20', 1110696)  # doctest: +SKIP
-    Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
+    >>> # fetch all records on chromosome 4 from base 11 through 20
+    >>> vcf_reader.fetch('4', 10, 20)   # doctest: +SKIP
 
+would include all records overlapping a 10 base pair region from the 11th base
+of through the 20th base (which is at index 19) of chromosome 4. It would not
+include the 21st base (at index 20). (See
+http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms for more
+information on the zero-based, half-open coordinate system.)
 
 The ``Writer`` class provides a way of writing a VCF file.  Currently, you must specify a
 template ``Reader`` which provides the metadata::
@@ -165,8 +176,6 @@ template ``Reader`` which provides the metadata::
     >>> for record in vcf_reader:
     ...     vcf_writer.write_record(record)
 
-
 An extensible script is available to filter vcf files in vcf_filter.py.  VCF filters
 declared by other packages will be available for use in this script.  Please
 see :doc:`FILTERS` for full description.
-


=====================================
debian/changelog
=====================================
--- a/debian/changelog
+++ b/debian/changelog
@@ -1,3 +1,30 @@
+python-pyvcf (0.6.8-1~bpo8+1) jessie-backports; urgency=medium
+
+  * Rebuild for jessie-backports.
+  * export DEB_BUILD_MAINT_OPTIONS = hardening=+all,-pie
+
+ -- Andreas Tille <tille at debian.org>  Fri, 09 Jun 2017 23:06:01 +0200
+
+python-pyvcf (0.6.8-1) unstable; urgency=medium
+
+  * New upstream version
+  * Fix watch file
+  * cme fix dpkg-control
+  * debhelper 10
+  * hardening=+all
+  * Remove unused lintian override
+
+ -- Andreas Tille <tille at debian.org>  Fri, 02 Dec 2016 20:33:48 +0100
+
+python-pyvcf (0.6.7-2) unstable; urgency=medium
+
+  [ Scott Kitterman ]
+  * Change python3-dev build-dep to python3-all-dev so package is built for
+    all supported python3 versions
+    Closes: #809487
+
+ -- Andreas Tille <tille at debian.org>  Thu, 31 Dec 2015 12:22:30 +0100
+
 python-pyvcf (0.6.7-1~bpo8+1) jessie-backports; urgency=medium
 
   * Rebuild for jessie-backports.


=====================================
debian/compat
=====================================
--- a/debian/compat
+++ b/debian/compat
@@ -1 +1 @@
-9
+10


=====================================
debian/control
=====================================
--- a/debian/control
+++ b/debian/control
@@ -3,17 +3,17 @@ Maintainer: Debian Med Packaging Team <debian-med-packaging at lists.alioth.debian.
 Uploaders: Andreas Tille <tille at debian.org>
 Section: python
 Priority: optional
-Build-Depends: debhelper (>= 9),
+Build-Depends: debhelper (>= 10),
                dh-python,
                cython,
-               python-dev (>= 2.7),
+               python-dev,
                python-setuptools,
                cython3,
-               python3-dev,
+               python3-all-dev,
                python3-setuptools
-Standards-Version: 3.9.6
+Standards-Version: 3.9.8
 Vcs-Browser: https://anonscm.debian.org/cgit/debian-med/python-pyvcf.git
-Vcs-Git: git://anonscm.debian.org/debian-med/python-pyvcf.git
+Vcs-Git: https://anonscm.debian.org/git/debian-med/python-pyvcf.git
 Homepage: https://pypi.python.org/pypi/PyVCF
 
 Package: python-pyvcf


=====================================
debian/get-orig-source
=====================================
--- a/debian/get-orig-source
+++ b/debian/get-orig-source
@@ -2,7 +2,6 @@
 # Test-Data needs to be fetched separately from Git
 
 set -e
-set -x
 
 PKG=`dpkg-parsechangelog | awk '/^Source/ { print $2 }'`
 
@@ -10,7 +9,7 @@ if ! echo $@ | grep -q upstream-version ; then
     VERSION=`dpkg-parsechangelog | awk '/^Version:/ { print $2 }' | sed 's/\([0-9\.]\+\)-[0-9]\+$/\1/'`
     uscan --verbose --force-download
 else
-    VERSION=`echo $@ | sed 's?^.*--upstream-version \([0-9.]\+\) .*?\1?'`
+    VERSION=`echo $@ | sed 's?^.*--upstream-version \([0-9.]\+\).*?\1?'`
     if echo "$VERSION" | grep -q "upstream-version" ; then
 	echo "Unable to parse version number"
         exit


=====================================
debian/patches/series deleted
=====================================
--- a/debian/patches/series
+++ /dev/null
@@ -1 +0,0 @@
-use_setuptools.patch


=====================================
debian/patches/use_setuptools.patch deleted
=====================================
--- a/debian/patches/use_setuptools.patch
+++ /dev/null
@@ -1,16 +0,0 @@
-Author: Andreas Tille <tille at debian.org>
-Last-Updated: Tue, 13 Oct 2015 17:02:08 +0200
-Description: Use setuptools instead of distribute
- see https://lists.alioth.debian.org/pipermail/debian-med-packaging/2015-October/035410.html
-
---- a/setup.py
-+++ b/setup.py
-@@ -53,7 +53,7 @@ setup(
-     description='Variant Call Format (VCF) parser for Python',
-     long_description=DOC,
-     test_suite='vcf.test.test_vcf.suite',
--    install_requires=['distribute'],
-+    install_requires=['setuptools'],
-     requires=requires,
-     entry_points = {
-         'vcf.filters': [


=====================================
debian/python-pyvcf-examples.lintian-overrides deleted
=====================================
--- a/debian/python-pyvcf-examples.lintian-overrides
+++ /dev/null
@@ -1,3 +0,0 @@
-# This is an unchanged file from upstream tarball
-python-pyvcf-examples: package-contains-timestamped-gzip usr/share/doc/python-pyvcf-examples/test/1kg.vcf.gz
-


=====================================
debian/rules
=====================================
--- a/debian/rules
+++ b/debian/rules
@@ -4,6 +4,8 @@
 # Uncomment this to turn on verbose mode.
 #export DH_VERBOSE=1
 
+export DEB_BUILD_MAINT_OPTIONS = hardening=+all,-pie
+
 export PYBUILD_NAME=pyvcf
 examplesdir=debian/python-$(PYBUILD_NAME)-examples/usr/share/doc/python-$(PYBUILD_NAME)-examples
 helperdir=debian/$(PYBUILD_NAME)/usr
@@ -21,7 +23,9 @@ override_dh_install:
 	mkdir -p $(helperdir)
 	mv debian/python3-$(PYBUILD_NAME)/usr/bin $(helperdir)
 	# strip .py extension
-	mv $(helperdir)/bin/vcf_filter.py $(helperdir)/bin/vcf_filter
+	for py in $(helperdir)/bin/*.py ; do \
+	    mv $${py} $(helperdir)/bin/`basename $${py} .py` ; \
+	done
 	rm -rf debian/python-$(PYBUILD_NAME)/usr/bin
 
 get-orig-source:


=====================================
debian/watch
=====================================
--- a/debian/watch
+++ b/debian/watch
@@ -1,5 +1,4 @@
 # Test-Data needs to be fetched separately from Git so wee need to call get-orig-source
-
-version=3
-https://pypi.python.org/pypi/PyVCF .*/PyVCF-([0-9.]+).tar.[xgb]z2? \
+version=4
+http://pypi.debian.net/PyVCF/PyVCF-(.+)\.(?:tar(?:\.gz|\.bz2)?|tgz) \
   debian debian/get-orig-source


=====================================
scripts/vcf_sample_filter.py
=====================================
--- /dev/null
+++ b/scripts/vcf_sample_filter.py
@@ -0,0 +1,39 @@
+#!/usr/bin/env python
+
+# Author: Lenna X. Peterson
+# github.com/lennax
+# arklenna at gmail dot com
+
+import argparse
+import logging
+
+from vcf import SampleFilter
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("file", help="VCF file to filter")
+    parser.add_argument("-o", metavar="outfile",
+                       help="File to write out filtered samples")
+    parser.add_argument("-f", metavar="filters",
+                       help="Comma-separated list of sample indices or names \
+                        to filter")
+    parser.add_argument("-i", "--invert", action="store_true",
+                       help="Keep rather than discard the filtered samples")
+    parser.add_argument("-q", "--quiet", action="store_true",
+                       help="Less output")
+
+    args = parser.parse_args()
+
+    if args.quiet:
+        log_level = logging.WARNING
+    else:
+        log_level = logging.INFO
+    logging.basicConfig(format='%(message)s', level=log_level)
+
+    sf = SampleFilter(infile=args.file, outfile=args.o,
+                      filters=args.f, invert=args.invert)
+    if args.f is None:
+        print "Samples:"
+        for idx, val in enumerate(sf.samples):
+            print "{0}: {1}".format(idx, val)


=====================================
setup.cfg
=====================================
--- a/setup.cfg
+++ b/setup.cfg
@@ -1,5 +1,5 @@
 [egg_info]
-tag_build = 
 tag_date = 0
+tag_build = 
 tag_svn_revision = 0
 


=====================================
setup.py
=====================================
--- a/setup.py
+++ b/setup.py
@@ -1,6 +1,6 @@
 from setuptools import setup
-from distutils.core import setup
 from distutils.extension import Extension
+import sys
 
 try:
     from Cython.Distutils import build_ext
@@ -8,23 +8,14 @@ try:
 except:
     CYTHON = False
 
-requires = []
+IS_PYTHON26 = sys.version_info[:2] == (2, 6)
 
-# python 2.6 does not have argparse
-try:
-    import argparse
-except ImportError:
-    requires.append('argparse')
+DEPENDENCIES = ['setuptools']
+
+if IS_PYTHON26:
+    DEPENDENCIES.extend(['argparse', 'counter', 'ordereddict',
+                         'unittest2'])
 
-import collections
-try:
-    collections.Counter
-except AttributeError:
-    requires.append('counter')
-try:
-    collections.OrderedDict
-except AttributeError:
-    requires.append('ordereddict')
 
 # get the version without an import
 VERSION = "Undefined"
@@ -47,14 +38,14 @@ if CYTHON:
 setup(
     name='PyVCF',
     packages=['vcf', 'vcf.test'],
-    scripts=['scripts/vcf_melt', 'scripts/vcf_filter.py'],
+    scripts=['scripts/vcf_melt', 'scripts/vcf_filter.py',
+             'scripts/vcf_sample_filter.py'],
     author='James Casbon and @jdoughertyii',
     author_email='casbon at gmail.com',
     description='Variant Call Format (VCF) parser for Python',
     long_description=DOC,
     test_suite='vcf.test.test_vcf.suite',
-    install_requires=['distribute'],
-    requires=requires,
+    install_requires=DEPENDENCIES,
     entry_points = {
         'vcf.filters': [
             'site_quality = vcf.filters:SiteQuality',
@@ -71,10 +62,19 @@ setup(
         'Development Status :: 4 - Beta',
         'Intended Audience :: Developers',
         'Intended Audience :: Science/Research',
+        'License :: OSI Approved :: BSD License',
+        'License :: OSI Approved :: MIT License',
         'Operating System :: OS Independent',
+        'Programming Language :: Cython',
         'Programming Language :: Python',
+        'Programming Language :: Python :: 2',
+        'Programming Language :: Python :: 2.6',
+        'Programming Language :: Python :: 2.7',
         'Programming Language :: Python :: 3',
-        'Topic :: Scientific/Engineering',
+        'Programming Language :: Python :: 3.2',
+        'Programming Language :: Python :: 3.3',
+        'Programming Language :: Python :: 3.4',
+        'Topic :: Scientific/Engineering :: Bio-Informatics',
       ],
     keywords='bioinformatics',
     use_2to3=True,


=====================================
vcf/__init__.py
=====================================
--- a/vcf/__init__.py
+++ b/vcf/__init__.py
@@ -1,180 +1,15 @@
 #!/usr/bin/env python
-'''A VCFv4.0 and 4.1 parser for Python.
+"""
+A VCFv4.0 and 4.1 parser for Python.
 
 Online version of PyVCF documentation is available at http://pyvcf.rtfd.org/
+"""
 
-The intent of this module is to mimic the ``csv`` module in the Python stdlib,
-as opposed to more flexible serialization formats like JSON or YAML.  ``vcf``
-will attempt to parse the content of each record based on the data types
-specified in the meta-information lines --  specifically the ##INFO and
-##FORMAT lines.  If these lines are missing or incomplete, it will check
-against the reserved types mentioned in the spec.  Failing that, it will just
-return strings.
 
-There main interface is the class: ``Reader``.  It takes a file-like
-object and acts as a reader::
-
-    >>> import vcf
-    >>> vcf_reader = vcf.Reader(open('vcf/test/example-4.0.vcf', 'r'))
-    >>> for record in vcf_reader:
-    ...     print record
-    Record(CHROM=20, POS=14370, REF=G, ALT=[A])
-    Record(CHROM=20, POS=17330, REF=T, ALT=[A])
-    Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
-    Record(CHROM=20, POS=1230237, REF=T, ALT=[None])
-    Record(CHROM=20, POS=1234567, REF=GTCT, ALT=[G, GTACT])
-
-
-This produces a great deal of information, but it is conveniently accessed.
-The attributes of a Record are the 8 fixed fields from the VCF spec::
-
-    * ``Record.CHROM``
-    * ``Record.POS``
-    * ``Record.ID``
-    * ``Record.REF``
-    * ``Record.ALT``
-    * ``Record.QUAL``
-    * ``Record.FILTER``
-    * ``Record.INFO``
-
-plus attributes to handle genotype information:
-
-    * ``Record.FORMAT``
-    * ``Record.samples``
-    * ``Record.genotype``
-
-``samples`` and ``genotype``, not being the title of any column, are left lowercase.  The format
-of the fixed fields is from the spec.  Comma-separated lists in the VCF are
-converted to lists.  In particular, one-entry VCF lists are converted to
-one-entry Python lists (see, e.g., ``Record.ALT``).  Semicolon-delimited lists
-of key=value pairs are converted to Python dictionaries, with flags being given
-a ``True`` value. Integers and floats are handled exactly as you'd expect::
-
-    >>> vcf_reader = vcf.Reader(open('vcf/test/example-4.0.vcf', 'r'))
-    >>> record = vcf_reader.next()
-    >>> print record.POS
-    14370
-    >>> print record.ALT
-    [A]
-    >>> print record.INFO['AF']
-    [0.5]
-
-There are a number of convienience methods and properties for each ``Record`` allowing you to
-examine properties of interest::
-
-    >>> print record.num_called, record.call_rate, record.num_unknown
-    3 1.0 0
-    >>> print record.num_hom_ref, record.num_het, record.num_hom_alt
-    1 1 1
-    >>> print record.nucl_diversity, record.aaf, record.heterozygosity
-    0.6 [0.5] 0.5
-    >>> print record.get_hets()
-    [Call(sample=NA00002, CallData(GT=1|0, GQ=48, DP=8, HQ=[51, 51]))]
-    >>> print record.is_snp, record.is_indel, record.is_transition, record.is_deletion
-    True False True False
-    >>> print record.var_type, record.var_subtype
-    snp ts
-    >>> print record.is_monomorphic
-    False
-
-``record.FORMAT`` will be a string specifying the format of the genotype
-fields.  In case the FORMAT column does not exist, ``record.FORMAT`` is
-``None``.  Finally, ``record.samples`` is a list of dictionaries containing the
-parsed sample column and ``record.genotype`` is a way of looking up genotypes
-by sample name::
-
-    >>> record = vcf_reader.next()
-    >>> for sample in record.samples:
-    ...     print sample['GT']
-    0|0
-    0|1
-    0/0
-    >>> print record.genotype('NA00001')['GT']
-    0|0
-
-The genotypes are represented by ``Call`` objects, which have three attributes: the
-corresponding Record ``site``, the sample name in ``sample`` and a dictionary of
-call data in ``data``::
-
-     >>> call = record.genotype('NA00001')
-     >>> print call.site
-     Record(CHROM=20, POS=17330, REF=T, ALT=[A])
-     >>> print call.sample
-     NA00001
-     >>> print call.data
-     CallData(GT=0|0, GQ=49, DP=3, HQ=[58, 50])
-
-Please note that as of release 0.4.0, attributes known to have single values (such as
-``DP`` and ``GQ`` above) are returned as values.  Other attributes are returned
-as lists (such as ``HQ`` above).
-
-There are also a number of methods::
-
-    >>> print call.called, call.gt_type, call.gt_bases, call.phased
-    True 0 T|T True
-
-Metadata regarding the VCF file itself can be investigated through the
-following attributes:
-
-    * ``Reader.metadata``
-    * ``Reader.infos``
-    * ``Reader.filters``
-    * ``Reader.formats``
-    * ``Reader.samples``
-
-For example::
-
-    >>> vcf_reader.metadata['fileDate']
-    '20090805'
-    >>> vcf_reader.samples
-    ['NA00001', 'NA00002', 'NA00003']
-    >>> vcf_reader.filters
-    OrderedDict([('q10', Filter(id='q10', desc='Quality below 10')), ('s50', Filter(id='s50', desc='Less than 50% of samples have data'))])
-    >>> vcf_reader.infos['AA'].desc
-    'Ancestral Allele'
-
-ALT records are actually classes, so that you can interrogate them::
-
-    >>> reader = vcf.Reader(open('vcf/test/example-4.1-bnd.vcf'))
-    >>> _ = reader.next(); row = reader.next()
-    >>> print row
-    Record(CHROM=1, POS=2, REF=T, ALT=[T[2:3[])
-    >>> bnd = row.ALT[0]
-    >>> print bnd.withinMainAssembly, bnd.orientation, bnd.remoteOrientation, bnd.connectingSequence
-    True False True T
-
-Random access is supported for files with tabix indexes.  Simply call fetch for the
-region you are interested in::
-
-    >>> vcf_reader = vcf.Reader(filename='vcf/test/tb.vcf.gz')
-    >>> for record in vcf_reader.fetch('20', 1110696, 1230237):  # doctest: +SKIP
-    ...     print record
-    Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
-    Record(CHROM=20, POS=1230237, REF=T, ALT=[None])
-
-Or extract a single row::
-
-    >>> print vcf_reader.fetch('20', 1110696)  # doctest: +SKIP
-    Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
-
-
-The ``Writer`` class provides a way of writing a VCF file.  Currently, you must specify a
-template ``Reader`` which provides the metadata::
-
-    >>> vcf_reader = vcf.Reader(filename='vcf/test/tb.vcf.gz')
-    >>> vcf_writer = vcf.Writer(open('/dev/null', 'w'), vcf_reader)
-    >>> for record in vcf_reader:
-    ...     vcf_writer.write_record(record)
-
-
-An extensible script is available to filter vcf files in vcf_filter.py.  VCF filters
-declared by other packages will be available for use in this script.  Please
-see :doc:`FILTERS` for full description.
-
-'''
 from vcf.parser import Reader, Writer
 from vcf.parser import VCFReader, VCFWriter
 from vcf.filters import Base as Filter
 from vcf.parser import RESERVED_INFO, RESERVED_FORMAT
+from vcf.sample_filter import SampleFilter
 
-VERSION = '0.6.7'
+VERSION = '0.6.8'


=====================================
vcf/cparse.pyx
=====================================
--- a/vcf/cparse.pyx
+++ b/vcf/cparse.pyx
@@ -36,7 +36,10 @@ def parse_samples(
             vals = sampvals[j]
 
             # short circuit the most common
-            if vals == '.' or vals == './.':
+            if samp_fmt._fields[j] == 'GT':
+                sampdat[j] = vals
+                continue
+            elif not vals or vals == '.':
                 sampdat[j] = None
                 continue
 


=====================================
vcf/model.py
=====================================
--- a/vcf/model.py
+++ b/vcf/model.py
@@ -1,33 +1,39 @@
 from abc import ABCMeta, abstractmethod
 import collections
 import sys
+import re
 
 try:
     from collections import Counter
 except ImportError:
     from counter import Counter
 
+allele_delimiter = re.compile(r'''[|/]''') # to split a genotype into alleles
 
 class _Call(object):
     """ A genotype call, a cell entry in a VCF file"""
 
-    __slots__ = ['site', 'sample', 'data', 'gt_nums', 'called']
+    __slots__ = ['site', 'sample', 'data', 'gt_nums', 'gt_alleles', 'called', 'ploidity']
 
     def __init__(self, site, sample, data):
         #: The ``_Record`` for this ``_Call``
         self.site = site
         #: The sample name
         self.sample = sample
-        #: Dictionary of data from the VCF file
+        #: Namedtuple of data from the VCF file
         self.data = data
-        try:
-            self.gt_nums = self.data.GT
-            #: True if the GT is not ./.
-            self.called = self.gt_nums is not None
-        except AttributeError:
-            self.gt_nums = None
+
+        if hasattr(self.data, 'GT'):
+            self.gt_alleles = [(al if al != '.' else None) for al in allele_delimiter.split(self.data.GT)]
+            self.ploidity = len(self.gt_alleles)
+            self.called = all([al != None for al in self.gt_alleles])
+            self.gt_nums = self.data.GT if self.called else None
+        else:
             #62 a call without a genotype is not defined as called or not
+            self.gt_alleles = None
+            self.ploidity = None
             self.called = None
+            self.gt_nums = None
 
     def __repr__(self):
         return "Call(sample=%s, %s)" % (self.sample, str(self.data))
@@ -51,12 +57,6 @@ class _Call(object):
         return "/" if not self.phased else "|"
 
     @property
-    def gt_alleles(self):
-        '''The numbers of the alleles called at a given sample'''
-        # grab the numeric alleles of the gt string; tokenize by phasing
-        return self.gt_nums.split(self.gt_phase_char())
-
-    @property
     def gt_bases(self):
         '''The actual genotype alleles.
            E.g. if VCF genotype is 0/1, return A/G
@@ -125,10 +125,45 @@ class _Record(object):
         INFO and FORMAT are available as properties.
 
         The list of genotype calls is in the ``samples`` property.
+
+        Regarding the coordinates associated with each instance:
+
+        - ``POS``, per VCF specification, is the one-based index
+          (the first base of the contig has an index of 1) of the first
+          base of the ``REF`` sequence.
+        - The ``start`` and ``end`` denote the coordinates of the entire
+          ``REF`` sequence in the zero-based, half-open coordinate
+          system (see
+          http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms),
+          where the first base of the contig has an index of 0, and the
+          interval runs up to, but does not include, the base at the
+          ``end`` index. This indexing scheme is analagous to Python
+          slice notation.
+        - The ``affected_start`` and ``affected_end`` coordinates are
+          also in the zero-based, half-open coordinate system. These
+          coordinates indicate the precise region of the reference
+          genome actually affected by the events denoted in ``ALT``
+          (i.e., the minimum ``affected_start`` and maximum
+          ``affected_end``).
+
+          - For SNPs and structural variants, the affected region
+            includes all bases of ``REF``, including the first base
+            (i.e., ``affected_start = start = POS - 1``).
+          - For deletions, the region includes all bases of ``REF``
+            except the first base, which flanks upstream the actual
+            deletion event, per VCF specification.
+          - For insertions, the ``affected_start`` and ``affected_end``
+            coordinates represent a 0 bp-length region between the two
+            flanking bases (i.e., ``affected_start`` =
+            ``affected_end``). This is analagous to Python slice
+            notation (see http://stackoverflow.com/a/2947881/38140).
+            Neither the upstream nor downstream flanking bases are
+            included in the region.
     """
     def __init__(self, CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT,
             sample_indexes, samples=None):
         self.CHROM = CHROM
+        #: the one-based coordinate of the first nucleotide in ``REF``
         self.POS = POS
         self.ID = ID
         self.REF = REF
@@ -137,9 +172,9 @@ class _Record(object):
         self.FILTER = FILTER
         self.INFO = INFO
         self.FORMAT = FORMAT
-        #: 0-based start coordinate
+        #: zero-based, half-open start coordinate of ``REF``
         self.start = self.POS - 1
-        #: 1-based end coordinate
+        #: zero-based, half-open end coordinate of ``REF``
         self.end = self.start + len(self.REF)
         #: list of alleles. [0] = REF, [1:] = ALTS
         self.alleles = [self.REF]
@@ -148,6 +183,61 @@ class _Record(object):
         self.samples = samples or []
         self._sample_indexes = sample_indexes
 
+        # Setting affected_start and affected_end here for Sphinx
+        # autodoc purposes...
+        #: zero-based, half-open start coordinate of affected region of reference genome
+        self.affected_start = None
+        #: zero-based, half-open end coordinate of affected region of reference genome (not included in the region)
+        self.affected_end = None
+        self._set_start_and_end()
+
+
+    def _set_start_and_end(self):
+        self.affected_start = self.affected_end = self.POS
+        for alt in self.ALT:
+            if alt is None:
+                start, end = self._compute_coordinates_for_none_alt()
+            elif alt.type == 'SNV':
+                start, end = self._compute_coordinates_for_snp()
+            elif alt.type == 'MNV':
+                start, end = self._compute_coordinates_for_indel()
+            else:
+                start, end = self._compute_coordinates_for_sv()
+            self.affected_start = min(self.affected_start, start)
+            self.affected_end = max(self.affected_end, end)
+
+
+    def _compute_coordinates_for_none_alt(self):
+        start = self.POS - 1
+        end = start + len(self.REF)
+        return (start, end)
+
+
+    def _compute_coordinates_for_snp(self):
+        if len(self.REF) > 1:
+            start = self.POS
+            end = start + (len(self.REF) - 1)
+        else:
+            start = self.POS - 1
+            end = self.POS
+        return (start, end)
+
+
+    def _compute_coordinates_for_indel(self):
+        if len(self.REF) > 1:
+            start = self.POS
+            end = start + (len(self.REF) - 1)
+        else:
+            start = end = self.POS
+        return (start, end)
+
+
+    def _compute_coordinates_for_sv(self):
+        start = self.POS - 1
+        end = start + len(self.REF)
+        return (start, end)
+
+
     # For Python 2
     def __cmp__(self, other):
         return cmp((self.CHROM, self.POS), (getattr(other, "CHROM", None), getattr(other, "POS", None)))
@@ -221,12 +311,13 @@ class _Record(object):
         """ A list of allele frequencies of alternate alleles.
            NOTE: Denominator calc'ed from _called_ genotypes.
         """
-        num_chroms = 2.0 * self.num_called
+        num_chroms = 0.0
         allele_counts = Counter()
         for s in self.samples:
             if s.gt_type is not None:
-                allele_counts.update([s.gt_alleles[0]])
-                allele_counts.update([s.gt_alleles[1]])
+                for a in s.gt_alleles:
+                    allele_counts.update([a])
+                    num_chroms += 1
         return [allele_counts[str(i)]/num_chroms for i in range(1, len(self.ALT)+1)]
 
     @property
@@ -239,7 +330,7 @@ class _Record(object):
 
         Derived from:
         \"Population Genetics: A Concise Guide, 2nd ed., p.45\"
-          John Gillespie.
+        John Gillespie.
         """
         # skip if more than one alternate allele. assumes bi-allelic
         if len(self.ALT) > 1:
@@ -285,7 +376,7 @@ class _Record(object):
         for alt in self.ALT:
             if alt is None or alt.type != "SNV":
                 return False
-            if alt not in ['A', 'C', 'G', 'T']:
+            if alt not in ['A', 'C', 'G', 'T', 'N', '*']:
                 return False
         return True
 
@@ -376,13 +467,14 @@ class _Record(object):
     def var_subtype(self):
         """
         Return the subtype of variant.
+
         - For SNPs and INDELs, yeild one of: [ts, tv, ins, del]
-        - For SVs yield either "complex" or the SV type defined
-          in the ALT fields (removing the brackets).
-          E.g.:
-               <DEL>       -> DEL
-               <INS:ME:L1> -> INS:ME:L1
-               <DUP>       -> DUP
+        - For SVs yield either "complex" or the SV type defined in the ALT
+          fields (removing the brackets). E.g.::
+
+              <DEL>       -> DEL
+              <INS:ME:L1> -> INS:ME:L1
+              <DUP>       -> DUP
 
         The logic is meant to follow the rules outlined in the following
         paragraph at:


=====================================
vcf/parser.py
=====================================
--- a/vcf/parser.py
+++ b/vcf/parser.py
@@ -1,10 +1,11 @@
+import codecs
 import collections
-import re
 import csv
 import gzip
-import sys
 import itertools
-import codecs
+import os
+import re
+import sys
 
 try:
     from collections import OrderedDict
@@ -62,12 +63,13 @@ SINGULAR_METADATA = ['fileformat', 'fileDate', 'reference']
 # Conversion between value in file and Python value
 field_counts = {
     '.': None,  # Unknown number of values
-    'A': -1,  # Equal to the number of alleles in a given record
+    'A': -1,  # Equal to the number of alternate alleles in a given record
     'G': -2,  # Equal to the number of genotypes in a given record
+    'R': -3,  # Equal to the number of alleles including reference in a given record
 }
 
 
-_Info = collections.namedtuple('Info', ['id', 'num', 'type', 'desc'])
+_Info = collections.namedtuple('Info', ['id', 'num', 'type', 'desc', 'source', 'version'])
 _Filter = collections.namedtuple('Filter', ['id', 'desc'])
 _Alt = collections.namedtuple('Alt', ['id', 'desc'])
 _Format = collections.namedtuple('Format', ['id', 'num', 'type', 'desc'])
@@ -80,36 +82,39 @@ class _vcf_metadata_parser(object):
     def __init__(self):
         super(_vcf_metadata_parser, self).__init__()
         self.info_pattern = re.compile(r'''\#\#INFO=<
-            ID=(?P<id>[^,]+),
-            Number=(?P<number>-?\d+|\.|[AG]),
-            Type=(?P<type>Integer|Float|Flag|Character|String),
+            ID=(?P<id>[^,]+),\s*
+            Number=(?P<number>-?\d+|\.|[AGR]),\s*
+            Type=(?P<type>Integer|Float|Flag|Character|String),\s*
             Description="(?P<desc>[^"]*)"
+            (?:,\s*Source="(?P<source>[^"]*)")?
+            (?:,\s*Version="?(?P<version>[^"]*)"?)?
             >''', re.VERBOSE)
         self.filter_pattern = re.compile(r'''\#\#FILTER=<
-            ID=(?P<id>[^,]+),
+            ID=(?P<id>[^,]+),\s*
             Description="(?P<desc>[^"]*)"
             >''', re.VERBOSE)
         self.alt_pattern = re.compile(r'''\#\#ALT=<
-            ID=(?P<id>[^,]+),
+            ID=(?P<id>[^,]+),\s*
             Description="(?P<desc>[^"]*)"
             >''', re.VERBOSE)
         self.format_pattern = re.compile(r'''\#\#FORMAT=<
-            ID=(?P<id>.+),
-            Number=(?P<number>-?\d+|\.|[AG]),
-            Type=(?P<type>.+),
+            ID=(?P<id>.+),\s*
+            Number=(?P<number>-?\d+|\.|[AGR]),\s*
+            Type=(?P<type>.+),\s*
             Description="(?P<desc>.*)"
             >''', re.VERBOSE)
         self.contig_pattern = re.compile(r'''\#\#contig=<
-            ID=(?P<id>[^,]+),
-            .*
-            length=(?P<length>-?\d+)
+            ID=(?P<id>[^>,]+)
+            (,.*length=(?P<length>-?\d+))?
             .*
             >''', re.VERBOSE)
         self.meta_pattern = re.compile(r'''##(?P<key>.+?)=(?P<val>.+)''')
 
     def vcf_field_count(self, num_str):
         """Cast vcf header numbers to integer or None"""
-        if num_str not in field_counts:
+        if num_str is None:
+            return None
+        elif num_str not in field_counts:
             # Fixed, specified number
             return int(num_str)
         else:
@@ -125,7 +130,8 @@ class _vcf_metadata_parser(object):
         num = self.vcf_field_count(match.group('number'))
 
         info = _Info(match.group('id'), num,
-                     match.group('type'), match.group('desc'))
+                     match.group('type'), match.group('desc'),
+                     match.group('source'), match.group('version'))
 
         return (match.group('id'), info)
 
@@ -164,31 +170,28 @@ class _vcf_metadata_parser(object):
                        match.group('type'), match.group('desc'))
 
         return (match.group('id'), form)
-    
+
     def read_contig(self, contig_string):
         '''Read a meta-contigrmation INFO line.'''
         match = self.contig_pattern.match(contig_string)
         if not match:
             raise SyntaxError(
                 "One of the contig lines is malformed: %s" % contig_string)
-
         length = self.vcf_field_count(match.group('length'))
-
         contig = _Contig(match.group('id'), length)
-
         return (match.group('id'), contig)
 
-
     def read_meta_hash(self, meta_string):
-        items = re.split("[<>]", meta_string)
-        # Removing initial hash marks and final equal sign
-        key = items[0][2:-1]
+        # assert re.match("##.+=<", meta_string)
+        items = meta_string.split('=', 1)
+        # Removing initial hash marks
+        key = items[0].lstrip('#')
         # N.B., items can have quoted values, so cannot just split on comma
         val = OrderedDict()
         state = 0
         k = ''
         v = ''
-        for c in items[1]:
+        for c in items[1].strip('[<>]'):
 
             if state == 0:  # reading item key
                 if c == '=':
@@ -219,16 +222,20 @@ class _vcf_metadata_parser(object):
     def read_meta(self, meta_string):
         if re.match("##.+=<", meta_string):
             return self.read_meta_hash(meta_string)
-        else:
-            match = self.meta_pattern.match(meta_string)
-            return match.group('key'), match.group('val')
+        match = self.meta_pattern.match(meta_string)
+        if not match:
+            # Spec only allows key=value, but we try to be liberal and
+            # interpret anything else as key=none (and all values are parsed
+            # as strings).
+            return meta_string.lstrip('#'), 'none'
+        return match.group('key'), match.group('val')
 
 
 class Reader(object):
     """ Reader for a VCF v 4.0 file, an iterator returning ``_Record objects`` """
 
-    def __init__(self, fsock=None, filename=None, compressed=False, prepend_chr=False,
-                 strict_whitespace=False):
+    def __init__(self, fsock=None, filename=None, compressed=None, prepend_chr=False,
+                 strict_whitespace=False, encoding='ascii'):
         """ Create a new Reader for a VCF file.
 
             You must specify either fsock (stream) or filename.  Gzipped streams
@@ -250,21 +257,26 @@ class Reader(object):
             self._reader = fsock
             if filename is None and hasattr(fsock, 'name'):
                 filename = fsock.name
-                compressed = compressed or filename.endswith('.gz')
+                if compressed is None:
+                    compressed = filename.endswith('.gz')
         elif filename:
-            compressed = compressed or filename.endswith('.gz')
+            if compressed is None:
+                compressed = filename.endswith('.gz')
             self._reader = open(filename, 'rb' if compressed else 'rt')
         self.filename = filename
         if compressed:
             self._reader = gzip.GzipFile(fileobj=self._reader)
             if sys.version > '3':
-                self._reader = codecs.getreader('ascii')(self._reader)
+                self._reader = codecs.getreader(encoding)(self._reader)
 
         if strict_whitespace:
             self._separator = '\t'
         else:
             self._separator = '\t| +'
 
+        self._row_pattern = re.compile(self._separator)
+        self._alt_pattern = re.compile('[\[\]]')
+
         self.reader = (line.strip() for line in self._reader if line.strip())
 
         #: metadata fields from header (string or hash, depending)
@@ -287,6 +299,7 @@ class Reader(object):
         self._prepend_chr = prepend_chr
         self._parse_metainfo()
         self._format_cache = {}
+        self.encoding = encoding
 
     def __iter__(self):
         return self
@@ -301,7 +314,7 @@ class Reader(object):
 
         parser = _vcf_metadata_parser()
 
-        line = self.reader.next()
+        line = next(self.reader)
         while line.startswith('##'):
             self._header_lines.append(line)
 
@@ -320,7 +333,7 @@ class Reader(object):
             elif line.startswith('##FORMAT'):
                 key, val = parser.read_format(line)
                 self.formats[key] = val
-            
+
             elif line.startswith('##contig'):
                 key, val = parser.read_contig(line)
                 self.contigs[key] = val
@@ -334,9 +347,9 @@ class Reader(object):
                         self.metadata[key] = []
                     self.metadata[key].append(val)
 
-            line = self.reader.next()
+            line = next(self.reader)
 
-        fields = re.split(self._separator, line[1:])
+        fields = self._row_pattern.split(line[1:])
         self._column_headers = fields[:9]
         self.samples = fields[9:]
         self._sample_indexes = dict([(x,i) for (i,x) in enumerate(self.samples)])
@@ -358,7 +371,7 @@ class Reader(object):
         retdict = {}
 
         for entry in entries:
-            entry = entry.split('=')
+            entry = entry.split('=', 1)
             ID = entry[0]
             try:
                 entry_type = self.infos[ID].type
@@ -389,6 +402,7 @@ class Reader(object):
                     vals = entry[1].split(',') # commas are reserved characters indicating multiple values
                     val = self._map(str, vals)
                 except IndexError:
+                    entry_type = 'Flag'
                     val = True
 
             try:
@@ -430,7 +444,6 @@ class Reader(object):
         # check whether we already know how to parse this format
         if samp_fmt not in self._format_cache:
             self._format_cache[samp_fmt] = self._parse_sample_format(samp_fmt)
-
         samp_fmt = self._format_cache[samp_fmt]
 
         if cparse:
@@ -450,7 +463,10 @@ class Reader(object):
             for i, vals in enumerate(sample.split(':')):
 
                 # short circuit the most common
-                if vals == '.' or vals == './.':
+                if samp_fmt._fields[i] == 'GT':
+                    sampdat[i] = vals
+                    continue
+                elif not vals or vals == ".":
                     sampdat[i] = None
                     continue
 
@@ -494,9 +510,9 @@ class Reader(object):
         return samp_data
 
     def _parse_alt(self, str):
-        if re.search('[\[\]]', str) is not None:
+        if self._alt_pattern.search(str) is not None:
             # Paired breakend
-            items = re.split('[\[\]]', str)
+            items = self._alt_pattern.split(str)
             remoteCoords = items[1].split(':')
             chr = remoteCoords[0]
             if chr[0] == '<':
@@ -523,8 +539,8 @@ class Reader(object):
 
     def next(self):
         '''Return the next record in the file.'''
-        line = self.reader.next()
-        row = re.split(self._separator, line.rstrip())
+        line = next(self.reader)
+        row = self._row_pattern.split(line.rstrip())
         chrom = row[0]
         if self._prepend_chr:
             chrom = 'chr' + chrom
@@ -559,6 +575,9 @@ class Reader(object):
             fmt = row[8]
         except IndexError:
             fmt = None
+        else:
+            if fmt == '.':
+                fmt = None
 
         record = _Record(chrom, pos, ID, ref, alt, qual, filt,
                 info, fmt, self._sample_indexes)
@@ -569,45 +588,60 @@ class Reader(object):
 
         return record
 
-    def fetch(self, chrom, start, end=None):
-        """ fetch records from a Tabix indexed VCF, requires pysam
-            if start and end are specified, return iterator over positions
-            if end not specified, return individual ``_Call`` at start or None
+    def fetch(self, chrom, start=None, end=None):
+        """ Fetches records from a tabix-indexed VCF file and returns an
+            iterable of ``_Record`` instances
+
+            chrom must be specified.
+
+            The start and end coordinates are in the zero-based,
+            half-open coordinate system, similar to ``_Record.start`` and
+            ``_Record.end``. The very first base of a chromosome is
+            index 0, and the the region includes bases up to, but not
+            including the base at the end coordinate. For example
+            ``fetch('4', 10, 20)`` would include all variants
+            overlapping a 10 base pair region from the 11th base of
+            through the 20th base (which is at index 19) of chromosome
+            4. It would not include the 21st base (at index 20). See
+            http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms
+            for more information on the zero-based, half-open coordinate
+            system.
+
+            If end is omitted, all variants from start until the end of
+            the chromosome chrom will be included.
+
+            If start and end are omitted, all variants on chrom will be
+            returned.
+
+            requires pysam
+
         """
         if not pysam:
             raise Exception('pysam not available, try "pip install pysam"?')
-
         if not self.filename:
             raise Exception('Please provide a filename (or a "normal" fsock)')
 
         if not self._tabix:
-            self._tabix = pysam.Tabixfile(self.filename)
+            self._tabix = pysam.Tabixfile(self.filename,
+                                          encoding=self.encoding)
 
         if self._prepend_chr and chrom[:3] == 'chr':
             chrom = chrom[3:]
 
-        # not sure why tabix needs position -1
-        start = start - 1
-
-        if end is None:
-            self.reader = self._tabix.fetch(chrom, start, start + 1)
-            try:
-                return self.next()
-            except StopIteration:
-                return None
-
         self.reader = self._tabix.fetch(chrom, start, end)
         return self
 
 
 class Writer(object):
-    """ VCF Writer """
+    """VCF Writer. On Windows Python 2, open stream with 'wb'."""
 
     # Reverse keys and values in header field count dictionary
     counts = dict((v,k) for k,v in field_counts.iteritems())
 
     def __init__(self, stream, template, lineterminator="\n"):
-        self.writer = csv.writer(stream, delimiter="\t", lineterminator=lineterminator)
+        self.writer = csv.writer(stream, delimiter="\t",
+                                 lineterminator=lineterminator,
+                                 quotechar='', quoting=csv.QUOTE_NONE)
         self.template = template
         self.stream = stream
 
@@ -639,7 +673,10 @@ class Writer(object):
         for line in template.alts.itervalues():
             stream.write(two.format(key="ALT", *line))
         for line in template.contigs.itervalues():
-            stream.write('##contig=<ID={0},length={1}>\n'.format(*line))
+            if line.length:
+                stream.write('##contig=<ID={0},length={1}>\n'.format(*line))
+            else:
+                stream.write('##contig=<ID={0}>\n'.format(*line))
 
         self._write_header()
 
@@ -699,28 +736,14 @@ class Writer(object):
                         sorted(info, key=order_key))
 
     def _format_sample(self, fmt, sample):
-        try:
-            # Try to get the GT value first.
-            gt = getattr(sample.data, 'GT')
-            # PyVCF stores './.' GT values as None, so we need to revert it back
-            # to './.' when writing.
-            if gt is None:
-                gt = './.'
-        except AttributeError:
-            # Failing that, try to check whether 'GT' is specified in the FORMAT
-            # field. If yes, use the recommended empty value ('./.')
-            if 'GT' in fmt:
-                gt = './.'
-            # Otherwise use an empty string as the value
-            else:
-                gt = ''
-        # If gt is an empty string (i.e. not stored), write all other data
+        if hasattr(sample.data, 'GT'):
+            gt = sample.data.GT
+        else:
+            gt = './.' if 'GT' in fmt else ''
+
         if not gt:
             return ':'.join([self._stringify(x) for x in sample.data])
-        # Otherwise use the GT values from above and combine it with the rest of
-        # the data.
-        # Note that this follows the VCF spec, where GT is always the first
-        # item whenever it is present.
+        # Following the VCF spec, GT is always the first item whenever it is present.
         else:
             return ':'.join([gt] + [self._stringify(x) for x in sample.data[1:]])
 


=====================================
vcf/sample_filter.py
=====================================
--- /dev/null
+++ b/vcf/sample_filter.py
@@ -0,0 +1,115 @@
+# Author: Lenna X. Peterson
+# github.com/lennax
+# arklenna at gmail dot com
+
+import logging
+import sys
+import warnings
+
+
+from parser import Reader, Writer
+
+
+class SampleFilter(object):
+    """
+    Modifies the vcf Reader to filter each row by sample as it is parsed.
+
+    """
+
+    def __init__(self, infile, outfile=None, filters=None, invert=False):
+        # Methods to add to Reader
+        def get_filter(self):
+            return self._samp_filter
+
+        def set_filter(self, filt):
+            self._samp_filter = filt
+            if filt:
+                self.samples = [val for idx, val in enumerate(self.samples)
+                               if idx not in set(filt)]
+
+        def filter_samples(fn):
+            """Decorator function to filter sample parameter"""
+            def filt(self, samples, *args):
+                samples = [val for idx, val in enumerate(samples)
+                           if idx not in set(self.sample_filter)]
+                return fn(self, samples, *args)
+            return filt
+
+        # Add property to Reader for filter list
+        Reader.sample_filter = property(get_filter, set_filter)
+        Reader._samp_filter = []
+        # Modify Reader._parse_samples to filter samples
+        self._orig_parse_samples = Reader._parse_samples
+        Reader._parse_samples = filter_samples(Reader._parse_samples)
+        self.parser = Reader(filename=infile)
+        # Store initial samples and indices
+        self.samples = self.parser.samples
+        self.smp_idx = dict([(v, k) for k, v in enumerate(self.samples)])
+        # Properties for filter/writer
+        self.outfile = outfile
+        self.invert = invert
+        self.filters = filters
+        if filters is not None:
+            self.set_filters()
+            self.write()
+
+    def __del__(self):
+        try:
+            self._undo_monkey_patch()
+        except AttributeError:
+            pass
+
+    def set_filters(self, filters=None, invert=False):
+        """Convert filters from string to list of indices, set on Reader"""
+        if filters is not None:
+            self.filters = filters
+        if invert:
+            self.invert = invert
+        filt_l = self.filters.split(",")
+        filt_s = set(filt_l)
+        if len(filt_s) < len(filt_l):
+            warnings.warn("Non-unique filters, ignoring", RuntimeWarning)
+
+        def filt2idx(item):
+            """Convert filter to valid sample index"""
+            try:
+                item = int(item)
+            except ValueError:
+                # not an idx, check if it's a value
+                return self.smp_idx.get(item)
+            else:
+                # is int, check if it's an idx
+                if item < len(self.samples):
+                    return item
+        filters = set(filter(lambda x: x is not None, map(filt2idx, filt_s)))
+        if len(filters) < len(filt_s):
+            # TODO print the filters that were ignored
+            warnings.warn("Invalid filters, ignoring", RuntimeWarning)
+
+        if self.invert:
+            filters = set(xrange(len(self.samples))).difference(filters)
+
+        # `sample_filter` setter updates `samples`
+        self.parser.sample_filter = filters
+        if len(self.parser.samples) == 0:
+            warnings.warn("Number of samples to keep is zero", RuntimeWarning)
+        logging.info("Keeping these samples: {0}\n".format(self.parser.samples))
+        return self.parser.samples
+
+    def write(self, outfile=None):
+        if outfile is not None:
+            self.outfile = outfile
+        if self.outfile is None:
+            _out = sys.stdout
+        elif hasattr(self.outfile, 'write'):
+            _out = self.outfile
+        else:
+            _out = open(self.outfile, "wb")
+        logging.info("Writing to '{0}'\n".format(self.outfile))
+        writer = Writer(_out, self.parser)
+        for row in self.parser:
+            writer.write_record(row)
+
+    def _undo_monkey_patch(self):
+        Reader._parse_samples = self._orig_parse_samples
+        delattr(Reader, 'sample_filter')


=====================================
vcf/test/FT.vcf
=====================================
--- /dev/null
+++ b/vcf/test/FT.vcf
@@ -0,0 +1,50 @@
+##fileformat=VCFv4.2
+##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
+##FILTER=<ID=DP125,Description="DP<125">
+##FILTER=<ID=DP130,Description="DP<130">
+##FILTER=<ID=LowQual,Description="Low quality">
+##FILTER=<ID=Q4800,Description="QUAL<4800.0">
+##FILTER=<ID=Q5000,Description="QUAL<5000.0">
+##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
+##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
+##FORMAT=<ID=FT,Number=.,Type=String,Description="Genotype-level filter">
+##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
+##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
+##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
+##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
+##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
+##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
+##GATKCommandLine.VariantFiltration=<ID=VariantFiltration,Version=3.6-0-g89b7209,Date="Wed Jul 27 09:50:44 CEST 2016",Epoch=1469605844963,CommandLineOptions="analysis_type=VariantFiltration input_file=[] showFullBamList=false read_buffer_size=null read_filter=[] disable_read_filter=[] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=../ref.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 static_quantized_quals=null round_down_quantized=false disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false no_cmdline_in_header=false sites_only=false never_trim_vcf_format_field=false bcf=false bam_compression=null simplifyBAM=false disable_bam_indexing=false generate_md5=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 reference_window_stop=0 phone_home= gatk_key=null tag=NA logging_level=INFO log_to_file=null help=false version=false variant=(RodBinding name=variant source=10.vcf) mask=(RodBinding name= source=UNBOUND) out=/home/redmar/tmp/example/snps/10_filt.vcf filterExpression=[QUAL<5000.0, QUAL<4800.0] filterName=[Q5000, Q4800] genotypeFilterExpression=[DP<130, DP<125] genotypeFilterName=[DP130, DP125] clusterSize=3 clusterWindowSize=0 maskExtension=0 maskName=Mask filterNotInMask=false missingValuesInExpressionsShouldEvaluateAsFailing=false invalidatePreviousFilters=false invertFilterExpression=false invertGenotypeFilterExpression=false setFilteredGtToNocall=false filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">
+##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
+##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
+##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
+##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
+##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
+##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
+##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
+##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
+##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
+##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
+##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
+##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
+##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
+##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
+##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
+##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score >From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
+##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
+##INFO=<ID=RAW_MQ,Number=1,Type=Float,Description="Raw data for RMS Mapping Quality">
+##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
+##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
+##contig=<ID=ref,length=4888768>
+##reference=file://../ref.fasta
+#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	1	2	3	4	5
+ref	63393	.	C	A	29454.60	.	AC=5;AF=1.00;AN=5;DP=719;FS=0.000;MLEAC=5;MLEAF=1.00;MQ=60.00;QD=29.67;SOR=0.965	GT:AD:DP:GQ:PL	1:0,166:166:99:6740,0	1:0,142:142:99:5824,0	1:0,134:134:99:5616,0	1:0,122:122:99:4930,0	1:0,155:155:99:6371,0
+ref	65903	.	AATTGCGCTG	A	7340.57	PASS	AC=1;AF=0.200;AN=5;DP=524;FS=0.000;MLEAC=1;MLEAF=0.200;MQ=60.00;QD=34.04;SOR=1.091	GT:AD:DP:FT:GQ:PL	1:0,164:164:PASS:99:7383,0	0:95,0:95:DP125;DP130:99:0,1800	0:88,0:88:DP125;DP130:99:0,1800	0:87,0:87:DP125;DP130:99:0,1800	0:89,0:89:DP125;DP130:99:0,1800
+ref	70837	.	C	A	4711.61	Q4800;Q5000	AC=1;AF=0.200;AN=5;DP=512;FS=0.000;MLEAC=1;MLEAF=0.200;MQ=60.00;QD=27.64;SOR=0.726	GT:AD:DP:FT:GQ:PL	0:121,0:121:DP125;DP130:99:0,1800	0:95,0:95:DP125;DP130:99:0,1800	1:0,120:120:DP125;DP130:99:4745,0	0:87,0:87:DP125;DP130:99:0,1800	0:89,0:89:DP125;DP130:99:0,1800
+ref	71448	.	C	T	31134.60	PASS	AC=5;AF=1.00;AN=5;BaseQRankSum=2.22;ClippingRankSum=0.00;DP=768;FS=0.000;MLEAC=5;MLEAF=1.00;MQ=60.00;MQRankSum=0.00;QD=29.43;ReadPosRankSum=2.03;SOR=0.295	GT:AD:DP:FT:GQ:PL	1:0,147:147:PASS:99:5996,0	1:1,183:184:PASS:99:7501,0	1:0,113:113:DP125;DP130:99:4559,0	1:0,161:161:PASS:99:6436,0	1:0,163:163:PASS:99:6669,0
+ref	104257	.	C	T	5521.61	PASS	AC=1;AF=0.200;AN=5;DP=506;FS=0.000;MLEAC=1;MLEAF=0.200;MQ=60.00;QD=29.45;SOR=0.854	GT:AD:DP:FT:GQ:PL	0:101,0:101:DP125;DP130:99:0,1800	0:109,0:109:DP125;DP130:99:0,1800	1:0,132:132:PASS:99:5555,0	0:67,0:67:DP125;DP130:99:0,1800	0:97,0:97:DP125;DP130:99:0,1800
+ref	140658	.	C	A	32467.60	PASS	AC=5;AF=1.00;AN=5;BaseQRankSum=2.24;ClippingRankSum=0.00;DP=801;FS=0.000;MLEAC=5;MLEAF=1.00;MQ=60.00;MQRankSum=0.00;QD=29.65;ReadPosRankSum=1.27;SOR=0.346	GT:AD:DP:GQ:PL	1:0,170:170:99:6854,0	1:0,198:198:99:8098,0	1:0,136:136:99:5554,0	1:0,141:141:99:5661,0	1:1,155:156:99:6327,0
+ref	147463	.	C	A	4885.61	Q5000	AC=1;AF=0.200;AN=5;BaseQRankSum=-7.720e-01;ClippingRankSum=0.00;DP=503;FS=0.000;MLEAC=1;MLEAF=0.200;MQ=60.00;MQRankSum=0.00;QD=35.03;ReadPosRankSum=-6.950e-01;SOR=0.278	GT:AD:DP:FT:GQ:PL	0:97,0:97:DP125;DP130:99:0,1800	0:104,0:104:DP125;DP130:99:0,1800	0:84,0:84:DP125;DP130:99:0,1800	1:1,128:129:DP130:99:4919,0	0:89,0:89:DP125;DP130:99:0,1800
+ref	154578	.	A	G	32015.60	PASS	AC=5;AF=1.00;AN=5;DP=776;FS=0.000;MLEAC=5;MLEAF=1.00;MQ=60.00;QD=25.82;SOR=0.902	GT:AD:DP:GQ:PL	1:0,152:152:99:6300,0	1:0,183:183:99:7608,0	1:0,137:137:99:5713,0	1:0,148:148:99:6040,0	1:0,156:156:99:6381,0
+ref	203200	.	C	T	30880.60	PASS	AC=5;AF=1.00;AN=5;DP=752;FS=0.000;MLEAC=5;MLEAF=1.00;MQ=60.00;QD=29.65;SOR=0.878	GT:AD:DP:FT:GQ:PL	1:0,161:161:PASS:99:6708,0	1:0,185:185:PASS:99:7602,0	1:0,136:136:PASS:99:5602,0	1:0,126:126:DP130:99:5080,0	1:0,144:144:PASS:99:5915,0
+ref	231665	.	C	T	30074.60	PASS	AC=5;AF=1.00;AN=5;DP=735;FS=0.000;MLEAC=5;MLEAF=1.00;MQ=60.00;QD=33.23;SOR=0.938	GT:AD:DP:FT:GQ:PL	1:0,191:191:PASS:99:7867,0	1:0,159:159:PASS:99:6431,0	1:0,130:130:PASS:99:5299,0	1:0,129:129:DP130:99:5290,0	1:0,126:126:DP130:99:5214,0


=====================================
vcf/test/issue-214.vcf
=====================================
--- /dev/null
+++ b/vcf/test/issue-214.vcf
@@ -0,0 +1,32 @@
+##fileformat=VCFv4.1
+##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
+##FILTER=<ID=LowQual,Description="Low quality">
+##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
+##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
+##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
+##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
+##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
+##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
+##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
+##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
+##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
+##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
+##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
+##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
+##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
+##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
+##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
+##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
+##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
+##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
+##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
+##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
+##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
+##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
+##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score >From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
+##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
+##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
+##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
+#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	Sample1	Sample2
+1	456904	.	T	C,*	6162.77	.	AC=1,1;AF=8.333e-03,8.333e-03;AN=120;DP=7693;FS=0.000;MLEAC=1,1;MLEAF=8.333e-03,8.333e-03;MQ=60.00;QD=31.36;SOR=0.976	GT:AD:DP:GQ:PL	0:106,0,0:106:99:0,1800,1800	0:110,0,0:110:99:0,1800,1800
+1	456940	.	*	C,T	6162.77	.	AC=1,1;AF=8.333e-03,8.333e-03;AN=120;DP=7693;FS=0.000;MLEAC=1,1;MLEAF=8.333e-03,8.333e-03;MQ=60.00;QD=31.36;SOR=0.976	GT:AD:DP:GQ:PL	0:106,0,0:106:99:0,1800,1800	0:110,0,0:110:99:0,1800,1800


=====================================
vcf/test/strelka.vcf
=====================================
--- /dev/null
+++ b/vcf/test/strelka.vcf
@@ -0,0 +1,57 @@
+##fileformat=VCFv4.1
+##FILTER=<ID=BCNoise,Description="Average fraction of filtered basecalls within 50 bases of the indel exceeds 0.3">
+##FILTER=<ID=QSI_ref,Description="Normal sample is not homozygous ref or sindel Q-score < 30, ie calls with NT!=ref or QSI_NT < 30">
+##FILTER=<ID=QSS_ref,Description="Normal sample is not homozygous ref or ssnv Q-score < 15, ie calls with NT!=ref or QSS_NT < 15">
+##FILTER=<ID=Repeat,Description="Sequence repeat of more than 8x in the reference sequence">
+##FILTER=<ID=SpanDel,Description="Fraction of reads crossing site with spanning deletions in either sample exceeeds 0.75">
+##FILTER=<ID=iHpol,Description="Indel overlaps an interrupted homopolymer longer than 14x in the reference sequence">
+##FORMAT=<ID=AU,Number=2,Type=Integer,Description="Number of 'A' alleles used in tiers 1,2">
+##FORMAT=<ID=CU,Number=2,Type=Integer,Description="Number of 'C' alleles used in tiers 1,2">
+##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth for tier1">
+##FORMAT=<ID=DP2,Number=1,Type=Integer,Description="Read depth for tier2">
+##FORMAT=<ID=DP50,Number=1,Type=Float,Description="Average tier1 read depth within 50 bases">
+##FORMAT=<ID=FDP,Number=1,Type=Integer,Description="Number of basecalls filtered from original read depth for tier1">
+##FORMAT=<ID=FDP50,Number=1,Type=Float,Description="Average tier1 number of basecalls filtered from original read depth within 50 bases">
+##FORMAT=<ID=GU,Number=2,Type=Integer,Description="Number of 'G' alleles used in tiers 1,2">
+##FORMAT=<ID=SDP,Number=1,Type=Integer,Description="Number of reads with deletions spanning this site at tier1">
+##FORMAT=<ID=SUBDP,Number=1,Type=Integer,Description="Number of reads below tier1 mapping quality threshold aligned across this site">
+##FORMAT=<ID=SUBDP50,Number=1,Type=Float,Description="Average number of reads below tier1 mapping quality threshold aligned across sites within 50 bases">
+##FORMAT=<ID=TAR,Number=2,Type=Integer,Description="Reads strongly supporting alternate allele for tiers 1,2">
+##FORMAT=<ID=TIR,Number=2,Type=Integer,Description="Reads strongly supporting indel allele for tiers 1,2">
+##FORMAT=<ID=TOR,Number=2,Type=Integer,Description="Other reads (weak support or insufficient indel breakpoint overlap) for tiers 1,2">
+##FORMAT=<ID=TU,Number=2,Type=Integer,Description="Number of 'T' alleles used in tiers 1,2">
+##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
+##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
+##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
+##INFO=<ID=IC,Number=1,Type=Integer,Description="Number of times RU repeats in the indel allele">
+##INFO=<ID=IHP,Number=1,Type=Integer,Description="Largest reference interrupted homopolymer length intersecting with the indel">
+##INFO=<ID=NT,Number=1,Type=String,Description="Genotype of the normal in all data tiers, as used to classify somatic variants. One of {ref,het,hom,conflict}.">
+##INFO=<ID=OVERLAP,Number=0,Type=Flag,Description="Somatic indel possibly overlaps a second indel.">
+##INFO=<ID=QSI,Number=1,Type=Integer,Description="Quality score for any somatic variant, ie. for the ALT haplotype to be present at a significantly different frequency in the tumor and normal">
+##INFO=<ID=QSI_NT,Number=1,Type=Integer,Description="Quality score reflecting the joint probability of a somatic variant and NT">
+##INFO=<ID=QSS,Number=1,Type=Integer,Description="Quality score for any somatic snv, ie. for the ALT allele to be present at a significantly different frequency in the tumor and normal">
+##INFO=<ID=QSS_NT,Number=1,Type=Integer,Description="Quality score reflecting the joint probability of a somatic variant and NT">
+##INFO=<ID=RC,Number=1,Type=Integer,Description="Number of times RU repeats in the reference allele">
+##INFO=<ID=RU,Number=1,Type=String,Description="Smallest repeating sequence unit in inserted or deleted sequence">
+##INFO=<ID=SGT,Number=1,Type=String,Description="Most likely somatic genotype excluding normal noise states">
+##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="Somatic mutation">
+##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
+##INFO=<ID=TQSI,Number=1,Type=Integer,Description="Data tier used to compute QSI">
+##INFO=<ID=TQSI_NT,Number=1,Type=Integer,Description="Data tier used to compute QSI_NT">
+##INFO=<ID=TQSS,Number=1,Type=Integer,Description="Data tier used to compute QSS">
+##INFO=<ID=TQSS_NT,Number=1,Type=Integer,Description="Data tier used to compute QSS_NT">
+##INFO=<ID=set,Number=1,Type=String,Description="Source VCF for the merged record in CombineVariants">
+##content=strelka somatic indel calls
+##fileDate=20151113
+##germlineIndelTheta=0.0001
+##germlineSnvTheta=0.001
+##priorSomaticIndelRate=1e-06
+##priorSomaticSnvRate=1e-06
+##reference=file:///b37.fasta
+##source=strelka
+##source_version=2.0.17.strelka1
+##startTime=Fri Nov 13 19:38:43 2015
+#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NORMAL.variant	NORMAL.variant2	TUMOR.variant	TUMOR.variant2
+1	1666175	.	C	T	.	PASS	AC=0;AF=0.00;AN=0;NT=ref;QSS=28;QSS_NT=28;SGT=CC->CT;SOMATIC;TQSS=1;TQSS_NT=1;set=variant	AU:CU:DP:FDP:GU:SDP:SUBDP:TU	0,0:42,42:43:0:0,0:0:0:1,1		0,0:45,45:59:0:0,0:0:0:14,14	
+1	3750492	.	G	A	.	PASS	AC=0;AF=0.00;AN=0;NT=ref;QSS=38;QSS_NT=38;SGT=GG->AG;SOMATIC;TQSS=2;TQSS_NT=2;set=variant	AU:CU:DP:FDP:GU:SDP:SUBDP:TU	0,0:0,0:116:0:116,116:0:0:0,0		6,6:0,0:96:0:90,91:0:0:0,0	
+1	9117626	.	G	A	.	PASS	AC=0;AF=0.00;AN=0;NT=ref;QSS=32;QSS_NT=32;SGT=GG->AG;SOMATIC;TQSS=1;TQSS_NT=1;set=variant	AU:CU:DP:FDP:GU:SDP:SUBDP:TU	0,0:0,0:165:0:165,166:0:0:0,0		6,6:0,0:132:0:126,127:0:0:0,0	
\ No newline at end of file


=====================================
vcf/test/test_vcf.py
=====================================
--- a/vcf/test/test_vcf.py
+++ b/vcf/test/test_vcf.py
@@ -1,13 +1,27 @@
 from __future__ import print_function
 import unittest
+try:
+    unittest.skip
+except AttributeError:
+    import unittest2 as unittest
 import doctest
 import os
 import commands
 import cPickle
 from StringIO import StringIO
+import subprocess
+import sys
+
+try:
+    import pysam
+except ImportError:
+    pysam = None
 
 import vcf
-from vcf import utils
+from vcf import model, utils
+
+IS_PYTHON2 = sys.version_info[0] == 2
+IS_NOT_PYPY = 'PyPy' not in sys.version
 
 suite = doctest.DocTestSuite(vcf)
 
@@ -100,6 +114,37 @@ class TestVcfSpecs(unittest.TestCase):
                 print(c)
                 assert c
 
+    def test_vcf_4_2(self):
+        reader = vcf.Reader(fh('example-4.2.vcf'))
+        self.assertEqual(reader.metadata['fileformat'],  'VCFv4.2')
+
+        # If INFO contains no Source and Version keys, they should be None.
+        self.assertEqual(reader.infos['DP'].source, None)
+        self.assertEqual(reader.infos['DP'].version, None)
+
+        # According to spec, INFO Version key is required to be double quoted,
+        # but at least SAMtools 1.0 does not quote it. So we want to be
+        # forgiving here.
+        self.assertEqual(reader.infos['VDB'].source, None)
+        self.assertEqual(reader.infos['VDB'].version, '3')
+
+        # test we can walk the file at least
+        for r in reader:
+            for c in r:
+                assert c
+
+    def test_contig_idonly(self):
+        """Test VCF inputs with ##contig inputs containing only IDs. produced by bcftools 1.2+
+        """
+        reader = vcf.Reader(fh("contig_idonly.vcf"))
+        for cid, contig in reader.contigs.items():
+            if cid == "1":
+                assert contig.length is None
+            elif cid == "2":
+                assert contig.length == 2000
+            elif cid == "3":
+                assert contig.length == 3000
+
 class TestGatkOutput(unittest.TestCase):
 
     filename = 'gatk.vcf'
@@ -184,6 +229,34 @@ class TestBcfToolsOutput(unittest.TestCase):
             for s in r.samples:
                 s.phased
 
+class TestIssue214(unittest.TestCase):
+    """ See https://github.com/jamescasbon/PyVCF/issues/214 """
+
+    def test_issue_214_is_snp(self):
+        reader=vcf.Reader(fh('issue-214.vcf'))
+        r=next(reader)
+        self.assertTrue(r.is_snp)
+
+    def test_issue_214_var_type(self):
+        reader=vcf.Reader(fh('issue-214.vcf'))
+        r=next(reader)
+        self.assertEqual(r.var_type,'snp')
+
+    # Can the ref even be a spanning deletion?
+    # Note, this does not trigger issue 214, but I've added it here for completeness
+    def test_issue_214_ref_is_del_is_snp(self):
+        reader=vcf.Reader(fh('issue-214.vcf'))
+        next(reader)
+        r=next(reader)
+        self.assertTrue(r.is_snp)
+
+    # Can the ref even be a spanning deletion?
+    # Note, this does not trigger issue 214, but I've added it here for completeness
+    def test_issue_214_ref_is_del_var_type(self):
+        reader=vcf.Reader(fh('issue-214.vcf'))
+        next(reader)
+        r=next(reader)
+        self.assertEqual(r.var_type,'snp')
 
 class Test1kg(unittest.TestCase):
 
@@ -248,6 +321,16 @@ class TestGoNL(unittest.TestCase):
         self.assertEqual(reader.contigs['1'].length, 249250621)
 
 
+class TestStringAsFlag(unittest.TestCase):
+
+    def test_string_as_flag(self):
+        """A flag INFO field is declared as string (not allowed by the spec,
+        but seen in practice)."""
+        reader = vcf.Reader(fh('string_as_flag.vcf', 'r'))
+        for _ in reader:
+            pass
+
+
 class TestInfoOrder(unittest.TestCase):
 
     def _assert_order(self, definitions, fields):
@@ -310,6 +393,38 @@ class TestInfoTypeCharacter(unittest.TestCase):
             self.assertEquals(l.INFO, r.INFO)
 
 
+class TestParseMetaLine(unittest.TestCase):
+    def test_parse(self):
+        reader = vcf.Reader(fh('parse-meta-line.vcf'))
+        f = reader.metadata['MYFIELD'][0]
+        self.assertEqual(f['ID'], 'SomeField')
+        self.assertEqual(f['Version'], '3.4-0-g7e26428')
+        self.assertEqual(f['Date'], '"Wed Oct 07 09:11:47 CEST 2015"')
+        self.assertEqual(f['Options'], '"< 4 and > 3"')
+        next(reader)
+
+    def test_write(self):
+        reader = vcf.Reader(fh('parse-meta-line.vcf'))
+        out = StringIO()
+        writer = vcf.Writer(out, reader)
+
+        records = list(reader)
+
+        for record in records:
+            writer.write_record(record)
+        out.seek(0)
+        reader2 = vcf.Reader(out)
+
+        f = reader2.metadata['MYFIELD'][0]
+        self.assertEqual(f['ID'], 'SomeField')
+        self.assertEqual(f['Version'], '3.4-0-g7e26428')
+        self.assertEqual(f['Date'], '"Wed Oct 07 09:11:47 CEST 2015"')
+        self.assertEqual(f['Options'], '"< 4 and > 3"')
+
+        for l, r in zip(records, reader2):
+            self.assertEquals(l.INFO, r.INFO)
+
+
 class TestGatkOutputWriter(unittest.TestCase):
 
     def testWrite(self):
@@ -402,6 +517,27 @@ class TestSamplesSpace(unittest.TestCase):
         self.assertEqual(self.reader.samples, self.samples)
 
 
+class TestMetadataWhitespace(unittest.TestCase):
+    filename = 'metadata-whitespace.vcf'
+    def test_metadata_whitespace(self):
+        """
+        Test parsing metadata header lines with whitespace.
+        """
+        self.reader = vcf.Reader(fh(self.filename))
+
+        # Pick one INFO line and assert that we parsed it correctly.
+        info_indel = self.reader.infos['INDEL']
+        assert info_indel.id == 'INDEL'
+        assert info_indel.num == 0
+        assert info_indel.type == 'Flag'
+        assert info_indel.desc == 'Indicates that the variant is an INDEL.'
+
+        # Test we can walk the file at least.
+        for r in self.reader:
+            for c in r:
+                pass
+
+
 class TestMixedFiltering(unittest.TestCase):
     filename = 'mixed-filtering.vcf'
     def test_mixed_filtering(self):
@@ -426,7 +562,7 @@ class TestRecord(unittest.TestCase):
             self.assertEqual(len(var.samples), num_calls)
 
     def test_dunder_eq(self):
-        rec = vcf.Reader(fh('example-4.0.vcf')).next()
+        rec = next(vcf.Reader(fh('example-4.0.vcf')))
         self.assertFalse(rec == None)
         self.assertFalse(None == rec)
 
@@ -459,6 +595,13 @@ class TestRecord(unittest.TestCase):
                 self.assertEqual([0.0/6.0], aaf)
             elif var.POS == 1234567:
                 self.assertEqual([2.0/4.0, 1.0/4.0], aaf)
+        reader = vcf.Reader(fh('example-4.1-ploidy.vcf'))
+        for var in reader:
+            aaf = var.aaf
+            if var.POS == 60034:
+                self.assertEqual([4.0/6.0], aaf)
+            elif var.POS == 60387:
+                self.assertEqual([1.0/3.0], aaf)
 
     def test_pi(self):
         reader = vcf.Reader(fh('example-4.0.vcf'))
@@ -510,6 +653,24 @@ class TestRecord(unittest.TestCase):
             elif var.POS == 1234567:
                 self.assertEqual(False, is_snp)
 
+
+    def test_is_snp_for_n_alt(self):
+        record = model._Record(
+                '1',
+                10,
+                'id1',
+                'C',
+                [model._Substitution('N')],
+                None,
+                None,
+                {},
+                None,
+                {},
+                None
+        )
+        self.assertTrue(record.is_snp)
+
+
     def test_is_indel(self):
         reader = vcf.Reader(fh('example-4.0.vcf'))
         for var in reader:
@@ -731,7 +892,7 @@ class TestRecord(unittest.TestCase):
 
     def test_info_multiple_values(self):
         reader = vcf.Reader(fh('example-4.1-info-multiple-values.vcf'))
-        var = reader.next()
+        var = next(reader)
         # check Float type INFO field with multiple values
         expected = [19.3, 47.4, 14.0]
         actual = var.INFO['RepeatCopies']
@@ -751,11 +912,244 @@ class TestRecord(unittest.TestCase):
             self.assertEqual(cPickle.loads(cPickle.dumps(var)), var)
 
 
+    def assert_has_expected_coordinates(
+            self,
+            record,
+            expected_coordinates,
+            expected_affected_coordinates
+        ):
+        self.assertEqual(
+                (record.start, record.end),
+                expected_coordinates
+        )
+        self.assertEqual(
+                (record.affected_start, record.affected_end),
+                expected_affected_coordinates
+        )
+
+
+    def test_coordinates_for_snp(self):
+        record = model._Record(
+                '1',
+                10,
+                'id1',
+                'C',
+                [model._Substitution('A')],
+                None,
+                None,
+                {},
+                None,
+                {},
+                None
+        )
+        self.assert_has_expected_coordinates(record, (9, 10), (9, 10))
+
+
+    def test_coordinates_for_insertion(self):
+        record = model._Record(
+                '1',
+                10,
+                'id2',
+                'C',
+                [model._Substitution('CTA')],
+                None,
+                None,
+                {},
+                None,
+                {},
+                None
+        )
+        self.assert_has_expected_coordinates(record, (9, 10), (10, 10))
+
+
+    def test_coordinates_for_deletion(self):
+        record = model._Record(
+                '1',
+                10,
+                'id3',
+                'CTA',
+                [model._Substitution('C')],
+                None,
+                None,
+                {},
+                None,
+                {},
+                None
+        )
+        self.assert_has_expected_coordinates(record, (9, 12), (10, 12))
+
+
+    def test_coordinates_for_None_alt(self):
+        record = model._Record(
+                '1',
+                10,
+                'id4',
+                'C',
+                [None],
+                None,
+                None,
+                {},
+                None,
+                {},
+                None
+        )
+        self.assert_has_expected_coordinates(record, (9, 10), (9, 10))
+
+
+    def test_coordinates_for_multiple_snps(self):
+        record = model._Record(
+                '1',
+                10,
+                'id5',
+                'C',
+                [
+                    model._Substitution('A'),
+                    model._Substitution('G'),
+                    model._Substitution('T')
+                ],
+                None,
+                None,
+                {},
+                None,
+                {},
+                None
+        )
+        self.assert_has_expected_coordinates(record, (9, 10), (9, 10))
+
+
+    def test_coordinates_for_insert_and_snp(self):
+        record = model._Record(
+                '1',
+                10,
+                'id6',
+                'C',
+                [
+                    model._Substitution('GTA'),
+                    model._Substitution('G'),
+                ],
+                None,
+                None,
+                {},
+                None,
+                {},
+                None
+        )
+        self.assert_has_expected_coordinates(record, (9, 10), (9, 10))
+        record = model._Record(
+                '1',
+                10,
+                'id7',
+                'C',
+                [
+                    model._Substitution('G'),
+                    model._Substitution('GTA'),
+                ],
+                None,
+                None,
+                {},
+                None,
+                {},
+                None
+        )
+        self.assert_has_expected_coordinates(record, (9, 10), (9, 10))
+
+
+    def test_coordinates_for_snp_and_deletion(self):
+        record = model._Record(
+                '1',
+                10,
+                'id8',
+                'CTA',
+                [
+                    model._Substitution('C'),
+                    model._Substitution('CTG'),
+                ],
+                None,
+                None,
+                {},
+                None,
+                {},
+                None
+        )
+        self.assert_has_expected_coordinates(record, (9, 12), (10, 12))
+        record = model._Record(
+                '1',
+                10,
+                'id9',
+                'CTA',
+                [
+                    model._Substitution('CTG'),
+                    model._Substitution('C'),
+                ],
+                None,
+                None,
+                {},
+                None,
+                {},
+                None
+        )
+        self.assert_has_expected_coordinates(record, (9, 12), (10, 12))
+
+
+    def test_coordinates_for_insertion_and_deletion(self):
+        record = model._Record(
+                '1',
+                10,
+                'id10',
+                'CT',
+                [
+                    model._Substitution('CA'),
+                    model._Substitution('CTT'),
+                ],
+                None,
+                None,
+                {},
+                None,
+                {},
+                None
+        )
+        self.assert_has_expected_coordinates(record, (9, 11), (10, 11))
+        record = model._Record(
+                '1',
+                10,
+                'id11',
+                'CT',
+                [
+                    model._Substitution('CTT'),
+                    model._Substitution('CA'),
+                ],
+                None,
+                None,
+                {},
+                None,
+                {},
+                None
+        )
+        self.assert_has_expected_coordinates(record, (9, 11), (10, 11))
+
+
+    def test_coordinates_for_breakend(self):
+        record = model._Record(
+                '1',
+                10,
+                'id12',
+                'CTA',
+                [model._Breakend('1', 500, False, True, 'GGTC', True)],
+                None,
+                None,
+                {},
+                None,
+                {},
+                None
+        )
+        self.assert_has_expected_coordinates(record, (9, 12), (9, 12))
+
+
 class TestCall(unittest.TestCase):
 
     def test_dunder_eq(self):
         reader = vcf.Reader(fh('example-4.0.vcf'))
-        var = reader.next()
+        var = next(reader)
         example_call = var.samples[0]
         self.assertFalse(example_call == None)
         self.assertFalse(None == example_call)
@@ -808,40 +1202,73 @@ class TestCall(unittest.TestCase):
                 self.assertEqual([None,1,2], gt_types)
 
 
-class TestTabix(unittest.TestCase):
+ at unittest.skipUnless(pysam, "test requires installation of PySAM.")
+class TestFetch(unittest.TestCase):
 
     def setUp(self):
         self.reader = vcf.Reader(fh('tb.vcf.gz', 'rb'))
 
-        self.run = vcf.parser.pysam is not None
+
+    def assertFetchedExpectedPositions(
+            self, fetched_variants, expected_positions):
+        fetched_positions = [var.POS for var in fetched_variants]
+        self.assertEqual(fetched_positions, expected_positions)
+
+
+    def testNoVariantsInRange(self):
+        fetched_variants = self.reader.fetch('20', 14370, 17329)
+        self.assertFetchedExpectedPositions(fetched_variants, [])
+
+
+    def testNoVariantsForZeroLengthInterval(self):
+        fetched_variants = self.reader.fetch('20', 14369, 14369)
+        self.assertFetchedExpectedPositions(fetched_variants, [])
 
 
     def testFetchRange(self):
-        if not self.run:
-            return
-        lines = list(self.reader.fetch('20', 14370, 14370))
-        self.assertEquals(len(lines), 1)
-        self.assertEqual(lines[0].POS, 14370)
+        fetched_variants = self.reader.fetch('20', 14369, 14370)
+        self.assertFetchedExpectedPositions(fetched_variants, [14370])
+
+        fetched_variants = self.reader.fetch('20', 14369, 17330)
+        self.assertFetchedExpectedPositions(
+                fetched_variants, [14370, 17330])
 
-        lines = list(self.reader.fetch('20', 14370, 17330))
-        self.assertEquals(len(lines), 2)
-        self.assertEqual(lines[0].POS, 14370)
-        self.assertEqual(lines[1].POS, 17330)
+        fetched_variants = self.reader.fetch('20', 1110695, 1234567)
+        self.assertFetchedExpectedPositions(
+                fetched_variants, [1110696, 1230237, 1234567])
 
 
-        lines = list(self.reader.fetch('20', 1110695, 1234567))
-        self.assertEquals(len(lines), 3)
+    def testFetchesFromStartIfStartOnlySpecified(self):
+        fetched_variants = self.reader.fetch('20', 1110695)
+        self.assertFetchedExpectedPositions(
+                fetched_variants, [1110696, 1230237, 1234567])
 
-    def testFetchSite(self):
-        if not self.run:
-            return
-        site = self.reader.fetch('20', 14370)
-        self.assertEqual(site.POS, 14370)
 
-        site = self.reader.fetch('20', 14369)
-        assert site is None
+    def testFetchesAllFromChromIfOnlyChromSpecified(self):
+        fetched_variants = self.reader.fetch('20')
+        self.assertFetchedExpectedPositions(
+                fetched_variants,
+                [14370, 17330, 1110696, 1230237, 1234567]
+        )
 
 
+ at unittest.skipUnless(pysam, "test requires installation of PySAM.")
+class TestIssue201(unittest.TestCase):
+    def setUp(self):
+        # This file contains some non-ASCII characters in a UTF-8 encoding.
+        # https://github.com/jamescasbon/PyVCF/issues/201
+        self.reader = vcf.Reader(fh('issue-201.vcf.gz', 'rb'),
+                                 encoding='utf-8')
+
+    def testIterate(self):
+        for record in self.reader:
+            # Should not raise decoding errors.
+            pass
+
+    def testFetch(self):
+        for record in self.reader.fetch(chrom='17'):
+            # Should not raise decoding errors.
+            pass
 
 
 class TestOpenMethods(unittest.TestCase):
@@ -870,12 +1297,61 @@ class TestOpenMethods(unittest.TestCase):
         self.assertEqual(self.samples, r.samples)
 
 
+class TestSampleFilter(unittest.TestCase):
+    @unittest.skipUnless(IS_PYTHON2, "test broken for Python 3")
+    def testCLIListSamples(self):
+        proc = subprocess.Popen('python scripts/vcf_sample_filter.py vcf/test/example-4.1.vcf', shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+        out, err = proc.communicate()
+        self.assertEqual(proc.returncode, 0)
+        self.assertFalse(err)
+        expected_out = ['Samples:', '0: NA00001', '1: NA00002', '2: NA00003']
+        self.assertEqual(out.splitlines(), expected_out)
+
+    @unittest.skipUnless(IS_PYTHON2, "test broken for Python 3")
+    def testCLIWithFilter(self):
+        proc = subprocess.Popen('python scripts/vcf_sample_filter.py vcf/test/example-4.1.vcf -f 1,2 --quiet', shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+        out, err = proc.communicate()
+        self.assertEqual(proc.returncode, 0)
+        self.assertTrue(out)
+        self.assertFalse(err)
+        buf = StringIO()
+        buf.write(out)
+        buf.seek(0)
+        #print(buf.getvalue())
+        reader = vcf.Reader(buf)
+        self.assertEqual(reader.samples, ['NA00001'])
+        rec = next(reader)
+        self.assertEqual(len(rec.samples), 1)
+
+    @unittest.skipUnless(IS_NOT_PYPY, "test broken for PyPy")
+    def testSampleFilterModule(self):
+        # init filter with filename, get list of samples
+        filt = vcf.SampleFilter('vcf/test/example-4.1.vcf')
+        self.assertEqual(filt.samples, ['NA00001', 'NA00002', 'NA00003'])
+        # set filter, check which samples will be kept
+        filtered = filt.set_filters(filters="0", invert=True)
+        self.assertEqual(filtered, ['NA00001'])
+        # write filtered file to StringIO
+        buf = StringIO()
+        filt.write(buf)
+        buf.seek(0)
+        #print(buf.getvalue())
+        # undo monkey patch by destroying instance
+        del filt
+        self.assertTrue('sample_filter' not in dir(vcf.Reader))
+        # read output
+        reader = vcf.Reader(buf)
+        self.assertEqual(reader.samples, ['NA00001'])
+        rec = next(reader)
+        self.assertEqual(len(rec.samples), 1)
+
+
 class TestFilter(unittest.TestCase):
 
 
+    @unittest.skip("test currently broken")
     def testApplyFilter(self):
         # FIXME: broken with distribute
-        return
         s, out = commands.getstatusoutput('python scripts/vcf_filter.py --site-quality 30 test/example-4.0.vcf sq')
         #print(out)
         self.assertEqual(s, 0)
@@ -903,9 +1379,9 @@ class TestFilter(unittest.TestCase):
         self.assertEqual(n, 2)
 
 
+    @unittest.skip("test currently broken")
     def testApplyMultipleFilters(self):
         # FIXME: broken with distribute
-        return
         s, out = commands.getstatusoutput('python scripts/vcf_filter.py --site-quality 30 '
         '--genotype-quality 50 test/example-4.0.vcf sq mgq')
         self.assertEqual(s, 0)
@@ -925,7 +1401,7 @@ class TestRegression(unittest.TestCase):
 
     def test_issue_16(self):
         reader = vcf.Reader(fh('issue-16.vcf'))
-        n = reader.next()
+        n = next(reader)
         assert n.QUAL == None
 
     def test_null_mono(self):
@@ -940,7 +1416,7 @@ class TestRegression(unittest.TestCase):
         out.seek(0)
         print(out.getvalue())
         p2 = vcf.Reader(out)
-        rec = p2.next()
+        rec = next(p2)
         assert rec.samples
 
 
@@ -1014,26 +1490,105 @@ class TestGATKMeta(unittest.TestCase):
         self.assertEqual(reader.metadata['GATKCommandLine'][1]['CommandLineOptions'], '"analysis_type=VariantAnnotator annotation=[HomopolymerRun, VariantType, TandemRepeatAnnotator]"')
 
 
+
+class TestUncalledGenotypes(unittest.TestCase):
+    """Test the handling of uncalled (., ./.) sample genotypes."""
+
+    def test_read_uncalled(self):
+        """Test that uncalled genotypes are properly read into
+        gt_nums, gt_bases, ploidity, and gt_alleles properties
+        of _Call objects.  For uncalled _Call objects:
+
+        - gt_nums should be None
+        - gt_bases should be None
+        - ploidity should match the input ploidity
+        - gt_alleles should be a list of None's with length
+          matching the ploidity"""
+
+        reader = vcf.Reader(fh('uncalled_genotypes.vcf'))
+        for var in reader:
+            gt_bases = [s.gt_bases for s in var.samples]
+            gt_nums = [s.gt_nums for s in var.samples]
+            ploidity = [s.ploidity for s in var.samples]
+            gt_alleles = [s.gt_alleles for s in var.samples]
+
+            if var.POS == 14370:
+                self.assertEqual(['0|0', None, '1/1'], gt_nums)
+                self.assertEqual(['G|G', None, 'A/A'], gt_bases)
+                self.assertEqual([2,2,2], ploidity)
+                self.assertEqual([['0','0'], [None,None], ['1','1']], gt_alleles)
+            elif var.POS == 17330:
+                self.assertEqual([None, '0|1', '0/0'], gt_nums)
+                self.assertEqual([None, 'T|A', 'T/T'], gt_bases)
+                self.assertEqual([3,2,2], ploidity)
+                self.assertEqual([[None,None,None], ['0','1'], ['0','0']], gt_alleles)
+            elif var.POS == 1234567:
+                self.assertEqual(['0/1', '0/2', None], gt_nums)
+                self.assertEqual(['GTC/G', 'GTC/GTCT', None], gt_bases)
+                self.assertEqual([2,2,1], ploidity)
+                self.assertEqual([['0','1'], ['0','2'], [None]], gt_alleles)
+        reader._reader.close()
+
+
+    def test_write_uncalled(self):
+        """Test that uncalled genotypes are written just as
+        they were read in the input file."""
+
+        reader = vcf.Reader(fh('uncalled_genotypes.vcf'))
+
+        # Write all reader records to a stream.
+        out = StringIO()
+        writer = vcf.Writer(out, reader, lineterminator='\n')
+        for record in reader:
+            writer.write_record(record)
+        reader._reader.close()
+
+
+        # Compare the written stream to the input reader line-by-line.
+        out.seek(0)
+        out_lines = out.getvalue().split('\n')
+        in_file = fh('uncalled_genotypes.vcf')
+        in_lines = [l.rstrip('\n') for l in in_file]
+        in_file.close()
+        for (in_line, out_line) in zip(in_lines, out_lines):
+            self.assertEqual(in_line,out_line)
+
+class TestStrelka(unittest.TestCase):
+
+    def test_strelka(self):
+        reader = vcf.Reader(fh('strelka.vcf'))
+        n = next(reader)
+        assert n is not None
+
+
 suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestVcfSpecs))
 suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestGatkOutput))
 suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestFreebayesOutput))
 suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestSamtoolsOutput))
 suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestBcfToolsOutput))
+suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestIssue214))
 suite.addTests(unittest.TestLoader().loadTestsFromTestCase(Test1kg))
 suite.addTests(unittest.TestLoader().loadTestsFromTestCase(Test1kgSites))
 suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestGoNL))
+suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestStringAsFlag))
 suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestInfoOrder))
 suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestInfoTypeCharacter))
+suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestParseMetaLine))
 suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestGatkOutputWriter))
 suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestBcfToolsOutputWriter))
 suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestWriterDictionaryMeta))
 suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestSamplesSpace))
+suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestMetadataWhitespace))
 suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestMixedFiltering))
 suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestRecord))
 suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestCall))
-suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestTabix))
+suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestFetch))
+suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestIssue201))
 suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestOpenMethods))
+suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestSampleFilter))
 suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestFilter))
 suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestRegression))
 suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestUtils))
 suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestGATKMeta))
+suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestUncalledGenotypes))
+suite.addTests(unittest.TestLoader().loadTestsFromTestCase(TestStrelka))


=====================================
vcf/utils.py
=====================================
--- a/vcf/utils.py
+++ b/vcf/utils.py
@@ -28,7 +28,7 @@ def walk_together(*readers, **kwargs):
     nexts = []
     for reader in readers:
         try:
-            nexts.append(reader.next())
+            nexts.append(next(reader))
         except StopIteration:
             nexts.append(None)
 



View it on GitLab: https://salsa.debian.org/med-team/python-pyvcf/compare/fa802f797e730e317890874d3d35cd1895a22c8d...3f9805e00a59a7676a17df7f7ec6f52edf8081bc

-- 
View it on GitLab: https://salsa.debian.org/med-team/python-pyvcf/compare/fa802f797e730e317890874d3d35cd1895a22c8d...3f9805e00a59a7676a17df7f7ec6f52edf8081bc
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20180718/9f592cd3/attachment-0001.html>