[med-svn] [Git][med-team/paleomix][master] 8 commits: New upstream version 1.2.14

Tue Dec 17 08:09:17 GMT 2019


Andreas Tille pushed to branch master at Debian Med / paleomix


Commits:
72ba2085 by Andreas Tille at 2019-12-17T07:45:11Z
New upstream version 1.2.14
- - - - -
eec75592 by Andreas Tille at 2019-12-17T07:45:11Z
routine-update: New upstream version

- - - - -
36e02f07 by Andreas Tille at 2019-12-17T07:45:26Z
Update upstream source from tag 'upstream/1.2.14'

Update to upstream version '1.2.14'
with Debian dir 85aa9709385c29641b2799a674ae563d0360c88c
- - - - -
8e4c0e55 by Andreas Tille at 2019-12-17T07:45:36Z
routine-update: Standards-Version: 4.4.1

- - - - -
c9edf295 by Andreas Tille at 2019-12-17T07:45:36Z
R-U: autopkgtest: s/ADTTMP/AUTOPKGTEST_TMP/g

- - - - -
a55a44c4 by Andreas Tille at 2019-12-17T07:46:05Z
Set upstream metadata fields: Repository, Repository-Browse.
- - - - -
a41e449a by Andreas Tille at 2019-12-17T07:57:29Z
Cleanup changelog

- - - - -
d494bff3 by Andreas Tille at 2019-12-17T08:08:39Z
Add TODO

- - - - -


20 changed files:

- CHANGES.md
- debian/changelog
- debian/control
- debian/tests/run-unit-test
- debian/upstream/metadata
- docs/conf.py
- paleomix/__init__.py
- paleomix/nodes/gatk.py
- paleomix/nodes/picard.py
- paleomix/tools/bam_pipeline/nodes.py
- paleomix/tools/zonkey/build_tped.py
- paleomix/tools/zonkey/common.py
- paleomix/tools/zonkey/config.py
- paleomix/tools/zonkey/database.py
- paleomix/tools/zonkey/parts/admixture.py
- paleomix/tools/zonkey/parts/common.py
- paleomix/tools/zonkey/parts/nuclear.py
- paleomix/tools/zonkey/parts/report.py
- paleomix/tools/zonkey/parts/summary.py
- paleomix/tools/zonkey/pipeline.py


Changes:

=====================================
CHANGES.md
=====================================
@@ -1,5 +1,38 @@
 # Changelog
 
+## [1.2.14] - 2019-12-01
+### Changed
+  - Improved handling of K-groups in zonkey database files
+  - Change BAM pipeline version requirement for GATK to < v4.0, as the
+    the Indel Realigner has been removed in GATK v4.0
+
+### Fixed
+  - Fixed version detection of GATK for v4.0 (issue #23)
+
+
+## [1.2.13.8] - 2019-10-27
+### Changed
+  - Zonkey now identifies nuclear chromosomes by size instead of name; this
+    is done to better handle FASTAs downloaded from different sources
+
+
+## [1.2.13.7] - 2019-10-15
+### Fixed
+  - Fixed handling of digit only chromosome names in Zonkey
+  - Remove dashes from Zonkey MT genomes when running 'mito' command
+
+
+## [1.2.13.6] - 2019-10-13
+### Fixed
+  - Handle .*miss files created by some versions of plink in Zonkey
+
+
+## [1.2.13.5] - 2019-09-29
+### Fixed
+  - Ignore ValidateSamFile warning REF_SEQ_TOO_LONG_FOR_BAI warning when
+    processing genomes with contigs too large for BAI index files.
+
+
 ## [1.2.13.4] - 2019-03-25
 ### Fixed
   - Improved detection of Picard versions in cases where 'java'
@@ -587,7 +620,12 @@ the (partially) updated documentation now hosted on ReadTheDocs.
   - Switching to more traditional version-number tracking.
 
 
-[Unreleased]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.13.4...HEAD
+[Unreleased]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.14...HEAD
+[1.2.14]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.13.8...v1.2.14
+[1.2.13.8]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.13.7...v1.2.13.8
+[1.2.13.7]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.13.6...v1.2.13.7
+[1.2.13.6]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.13.5...v1.2.13.6
+[1.2.13.5]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.13.4...v1.2.13.5
 [1.2.13.4]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.13.3...v1.2.13.4
 [1.2.13.3]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.13.2...v1.2.13.3
 [1.2.13.2]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.13.1...v1.2.13.2


=====================================
debian/changelog
=====================================
@@ -1,12 +1,15 @@
-paleomix (1.2.13.4-1) UNRELEASED; urgency=medium
+paleomix (1.2.14-1) UNRELEASED; urgency=medium
 
   * New upstream version
   * debhelper-compat 12
-  * Standards-Version: 4.4.0
+  * Standards-Version: 4.4.1
   * Use secure URI in Homepage field.
-  TODO: see https://github.com/MikkelSchubert/paleomix/issues/15
+  * autopkgtest: s/ADTTMP/AUTOPKGTEST_TMP/g
+  * Set upstream metadata fields: Repository, Repository-Browse.
+  TODO: Python3 port see
+       https://github.com/MikkelSchubert/paleomix/issues/15
 
- -- Andreas Tille <tille at debian.org>  Wed, 11 Sep 2019 11:21:12 +0200
+ -- Andreas Tille <tille at debian.org>  Tue, 17 Dec 2019 08:45:11 +0100
 
 paleomix (1.2.13.3-1) unstable; urgency=medium
 


=====================================
debian/control
=====================================
@@ -15,7 +15,7 @@ Build-Depends: debhelper-compat (= 12),
                default-jre-headless,
                bowtie2,
                rsync
-Standards-Version: 4.4.0
+Standards-Version: 4.4.1
 Vcs-Browser: https://salsa.debian.org/med-team/paleomix
 Vcs-Git: https://salsa.debian.org/med-team/paleomix.git
 Homepage: https://geogenetics.ku.dk/publications/paleomix


=====================================
debian/tests/run-unit-test
=====================================
@@ -3,14 +3,14 @@ set -e
 
 pkg="paleomix"
 
-if [ "$ADTTMP" = "" ] ; then
-  ADTTMP=`mktemp -d /tmp/${pkg}-test.XXXXXX`
-#  trap "rm -rf $ADTTMP" 0 INT QUIT ABRT PIPE TERM
+if [ "$AUTOPKGTEST_TMP" = "" ] ; then
+  AUTOPKGTEST_TMP=`mktemp -d /tmp/${pkg}-test.XXXXXX`
+#  trap "rm -rf $AUTOPKGTEST_TMP" 0 INT QUIT ABRT PIPE TERM
 fi
 
-cp -a /usr/share/doc/${pkg}/tests $ADTTMP
+cp -a /usr/share/doc/${pkg}/tests $AUTOPKGTEST_TMP
 
-cd $ADTTMP
+cd $AUTOPKGTEST_TMP
 
 for gz in `find . -name "*.gz"` ; do
     # fasta_file.fasta.gz needs to stay compressed for testing


=====================================
debian/upstream/metadata
=====================================
@@ -1,25 +1,27 @@
 Reference:
- Author: >
-  Mikkel Schubert and Luca Ermini and Clio Der Sarkissian and Hákon Jónsson and
-  Aurélien Ginolhac and Robert Schaefer and Michael D Martin and Ruth Fernández
-  and Martin Kircher and Molly McCue and Eske Willerslev and Ludovic Orlando
- Title: >
-  Characterization of ancient and modern genomes by SNP detection and
-  phylogenomic and metagenomic analysis using PALEOMIX
- Journal: Nature Protocols
- Year: 2014
- Volume: 9
- Number: 5
- Pages: 1056-82
- DOI: 10.1038/nprot.2014.063
- PMID: 24722405
- URL: http://www.nature.com/nprot/journal/v9/n5/full/nprot.2014.063.html
+  Author: >
+    Mikkel Schubert and Luca Ermini and Clio Der Sarkissian and Hákon Jónsson and
+    Aurélien Ginolhac and Robert Schaefer and Michael D Martin and Ruth Fernández
+    and Martin Kircher and Molly McCue and Eske Willerslev and Ludovic Orlando
+  Title: >
+    Characterization of ancient and modern genomes by SNP detection and
+    phylogenomic and metagenomic analysis using PALEOMIX
+  Journal: Nature Protocols
+  Year: 2014
+  Volume: 9
+  Number: 5
+  Pages: 1056-82
+  DOI: 10.1038/nprot.2014.063
+  PMID: 24722405
+  URL: http://www.nature.com/nprot/journal/v9/n5/full/nprot.2014.063.html
 Registry:
- - Name: OMICtools
-   Entry: OMICS_03749
- - Name: SciCrunch
-   Entry: SCR_015057
- - Name: bio.tools
-   Entry: PALEOMIX
- - Name: conda:bioconda
-   Entry: NA
+- Name: OMICtools
+  Entry: OMICS_03749
+- Name: SciCrunch
+  Entry: SCR_015057
+- Name: bio.tools
+  Entry: PALEOMIX
+- Name: conda:bioconda
+  Entry: NA
+Repository: https://github.com/MikkelSchubert/paleomix.git
+Repository-Browse: https://github.com/MikkelSchubert/paleomix


=====================================
docs/conf.py
=====================================
@@ -57,7 +57,7 @@ author = u'Mikkel Schubert'
 # The short X.Y version.
 version = u'1.2'
 # The full version, including alpha/beta/rc tags.
-release = u'1.2.13'
+release = u'1.2.14'
 
 # The language for content autogenerated by Sphinx. Refer to documentation
 # for a list of supported languages.


=====================================
paleomix/__init__.py
=====================================
@@ -21,8 +21,8 @@
 # SOFTWARE.
 #
 
-__version_info__ = (1, 2, 13, 3)
-__version__ = '%i.%i.%i.%i' % __version_info__
+__version_info__ = (1, 2, 14)
+__version__ = "%i.%i.%i" % __version_info__
 
 
 def run(command=None):


=====================================
paleomix/nodes/gatk.py
=====================================
@@ -45,6 +45,9 @@ from paleomix.nodes.bwa import \
 import paleomix.common.versions as versions
 
 
+_RE_GATK_VERSION = r"^(?:The Genome Analysis Toolkit \(GATK\) v)?(\d+)\.(\d+)"
+
+
 def _get_gatk_version_check(config):
     """Returns a version-check object for the "GenomeAnalysisTK.jar" located at
     config.jar_root; for now, this check only serves to verify that the JAR can
@@ -60,8 +63,8 @@ def _get_gatk_version_check(config):
         # Any version is fine; for now just catch old JREs
         requirement = versions.Requirement(call=params.finalized_call,
                                            name="GenomeAnalysisTK",
-                                           search=r"^(\d+)\.(\d+)",
-                                           checks=versions.Any())
+                                           search=_RE_GATK_VERSION,
+                                           checks=versions.LT(4, 0))
         _GATK_VERSION[jar_file] = requirement
     return _GATK_VERSION[jar_file]
 _GATK_VERSION = {}


=====================================
paleomix/nodes/picard.py
=====================================
@@ -60,10 +60,13 @@ class PicardNode(CommandNode):
 
 class ValidateBAMNode(PicardNode):
     def __init__(self, config, input_bam, input_index=None, output_log=None,
-                 ignored_checks=(), dependencies=()):
+                 ignored_checks=(), big_genome_mode=False, dependencies=()):
         builder = picard_command(config, "ValidateSamFile")
         _set_max_open_files(builder, "MAX_OPEN_TEMP_FILES")
 
+        if True or big_genome_mode:
+            self._configure_for_big_genome(config, builder)
+
         builder.set_option("I", "%(IN_BAM)s", sep="=")
         for check in ignored_checks:
             builder.add_option("IGNORE", check, sep="=")
@@ -79,6 +82,22 @@ class ValidateBAMNode(PicardNode):
                             description=description,
                             dependencies=dependencies)
 
+    @staticmethod
+    def _configure_for_big_genome(config, builder):
+        # CSI uses a different method for assigning BINs to records, which
+        # Picard currently does not support.
+        builder.add_option("IGNORE", "INVALID_INDEXING_BIN", sep="=")
+
+        jar_path = os.path.join(config.jar_root, _PICARD_JAR)
+        version_check = _PICARD_VERSION_CACHE[jar_path]
+
+        try:
+            if version_check.version >= (2, 19, 0):
+                # Useless warning, as we do not build BAI indexes for large genomes
+                builder.add_option("IGNORE", "REF_SEQ_TOO_LONG_FOR_BAI", sep="=")
+        except versions.VersionRequirementError:
+            pass  # Ignored here, handled elsewhere
+
 
 class BuildSequenceDictNode(PicardNode):
     def __init__(self, config, reference, dependencies=()):
@@ -247,7 +266,8 @@ def picard_command(config, command):
     params = AtomicJavaCmdBuilder(jar_path,
                                   temp_root=config.temp_root,
                                   jre_options=config.jre_options,
-                                  CHECK_JAR=version)
+                                  CHECK_JAR=version,
+                                  set_cwd=True)
     params.set_option(command)
 
     return params


=====================================
paleomix/tools/bam_pipeline/nodes.py
=====================================
@@ -43,15 +43,11 @@ def index_and_validate_bam(config, prefix, node, log_file=None,
         # where high-quality reads may cause mis-identification of qualities
         "INVALID_QUALITY_FORMAT"]
 
-    if prefix['IndexFormat'] == '.csi':
-        # CSI uses a different method for assigning BINs to records, which
-        # Picard currently does not support.
-        ignored_checks.append("INVALID_INDEXING_BIN")
-
     return ValidateBAMNode(config=config,
                            input_bam=input_file,
                            input_index=index_file,
                            ignored_checks=ignored_checks,
+                           big_genome_mode=prefix["IndexFormat"] == ".csi",
                            output_log=log_file,
                            dependencies=node)
 


=====================================
paleomix/tools/zonkey/build_tped.py
=====================================
@@ -258,9 +258,10 @@ def write_summary(args, filename, statistics):
             handle.write("%s: %s\n" % (key, statistics.get(key, "MISSING")))
 
 
-def process_bam(args, data, bam_handle):
+def process_bam(args, data, bam_handle, mapping):
+    reverse_mapping = dict(zip(mapping.values(), mapping))
     raw_references = bam_handle.references
-    references = map(common.contig_name_to_plink_name, raw_references)
+    references = [reverse_mapping.get(name, name) for name in raw_references]
 
     if args.downsample:
         sys.stderr.write("Downsampling to at most %i BAM records ...\n"
@@ -343,7 +344,7 @@ def main(argv):
                              "identifiable nuclear alignments.\n")
             return 1
 
-        process_bam(args, data, bam_handle)
+        process_bam(args, data, bam_handle, bam_info.nuclear_contigs)
 
     return 0
 


=====================================
paleomix/tools/zonkey/common.py
=====================================
@@ -56,7 +56,7 @@ def contig_name_to_plink_name(chrom):
     identified.
     """
     if chrom.isdigit():
-        return chrom.upper
+        return chrom
     elif chrom.upper() in "XY":
         return chrom.upper()
     elif chrom.lower().startswith("chr") and chrom[3:].isdigit():


=====================================
paleomix/tools/zonkey/config.py
=====================================
@@ -218,14 +218,14 @@ def _process_samples(config):
                               "mitochondrial BAM specified; the mitochondrial "
                               "genome in the first BAM will not be used!")
 
-                files["Nuc"] = filename
+                files["Nuc"] = {"Path": filename, "Info": filetype}
                 files.setdefault("Mito", filename)
             elif filetype.is_nuclear:
                 if "Nuc" in files:
                     print_err("ERROR: Two nuclear BAMs specified!")
                     return False
 
-                files["Nuc"] = filename
+                files["Nuc"] = {"Path": filename, "Info": filetype}
             elif filetype.is_mitochondrial:
                 if "Mito" in files:
                     print_err("ERROR: Two nuclear BAMs specified!")


=====================================
paleomix/tools/zonkey/database.py
=====================================
@@ -19,7 +19,9 @@
 # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 # SOFTWARE.
+import collections
 import os
+import re
 import tarfile
 
 import pysam
@@ -46,14 +48,14 @@ _SETTINGS_KEYS = ('Format', 'Revision', 'Plink', 'NChroms', 'MitoPadding',
 
 class BAMInfo(object):
     def __init__(self):
-        self.nuclear = False
+        self.nuclear_contigs = {}
         self.mt_contig = None
         self.mt_length = None
         self.mt_padding = None
 
     @property
     def is_nuclear(self):
-        return self.nuclear
+        return bool(self.nuclear_contigs)
 
     @property
     def is_mitochondrial(self):
@@ -62,7 +64,7 @@ class BAMInfo(object):
     def __repr__(self):
         tmpl = "BAMInfo(nuclear=%r, mt_contig=%r, mt_length=%r, mt_padding=%r)"
 
-        return tmpl % (self.nuclear, self.mt_contig,
+        return tmpl % (self.nuclear_contigs, self.mt_contig,
                        self.mt_length, self.mt_padding)
 
 
@@ -74,9 +76,10 @@ _SUPPORTED_DB_FORMAT_MINOR = 20160112
 
 # Required columns in the 'contigs.txt' table; additional columns are ignored
 _CONTIGS_TABLE_COLUMNS = frozenset(('ID', 'Size', 'Checksum'))
-# Required columns in the 'samples.txt' table; additional columns are ignored
-_SAMPELS_TABLE_COLUMNS = frozenset(('ID', 'Group(2)', 'Group(3)', 'Species',
-                                    'Sex', 'SampleID', 'Publication'))
+# Required columns in the 'samples.txt' table; additional non-group columns are ignored
+_SAMPLES_TABLE_COLUMNS = frozenset(('ID', 'Species', 'Sex', 'SampleID', 'Publication'))
+# Regular expression for parsing Group(K) columns in samples.txt
+_SAMPLES_TABLE_GROUP = re.compile(r'^Group\((?P<K>.+)\)$')
 
 
 class ZonkeyDBError(RuntimeError):
@@ -103,7 +106,7 @@ class ZonkeyDB(object):
             print_info('  - Reading list of contigs ...')
             self.contigs = self._read_contigs_table(tar_handle, "contigs.txt")
             print_info('  - Reading list of samples ...')
-            self.samples = self._read_samples_table(tar_handle, "samples.txt")
+            self.samples, self.groups = self._read_samples_table(tar_handle, "samples.txt")
             print_info('  - Reading mitochondrial sequences ...')
             self.mitochondria = self._read_mitochondria(tar_handle,
                                                         "mitochondria.fasta")
@@ -196,7 +199,7 @@ class ZonkeyDB(object):
     def _read_samples_table(cls, tar_handle, filename):
         cls._check_required_file(tar_handle, filename)
 
-        samples = cls._read_table(tar_handle, "samples.txt")
+        samples = cls._read_table(tar_handle, "samples.txt", _SAMPLES_TABLE_COLUMNS)
         if not samples:
             raise ZonkeyDBError("ERROR: No samples found in genotypes table!")
 
@@ -206,16 +209,42 @@ class ZonkeyDB(object):
                                     "expected 'MALE', 'FEMALE', or 'NA'"
                                     % (row["Sex"],))
 
-        for k_groups in (2, 3):
-            key = "Group(%i)" % (k_groups,)
-            groups = frozenset(row[key] for row in samples.itervalues())
-
-            if len(groups - set('-')) not in (0, k_groups):
-                raise ZonkeyDBError("The %r column in the samples table must "
-                                    "either contain %i ancestral groups, or "
-                                    "none" % (key, k_groups))
-
-        return samples
+        group_keys = []
+        for key in samples.values()[0]:
+            match = _SAMPLES_TABLE_GROUP.match(key)
+            if match is not None:
+                k_value = match.groupdict()["K"]
+                if not k_value.isdigit():
+                    raise ZonkeyDBError("Malformed Group column name; K is "
+                                        "not a number: %r" % (key,))
+                elif not (2 <= int(k_value) <= 7):
+                    raise ZonkeyDBError("K must be between 2 and 7, but found %r"
+                                        % (key,))
+
+                group_keys.append((key, int(k_value)))
+
+        groups = {}
+        for key, k_value in group_keys:
+            group = {}
+            for sample_key, sample in samples.items():
+                group[sample_key] = sample.pop(key)
+
+            group_labels = frozenset(group.values())
+            if group_labels == frozenset('-'):
+                continue  # Allowed for backwards compatibility
+            elif '-' in group_labels:
+                raise ZonkeyDBError("Not all samples column %r assignd a group"
+                                    % (key,))
+            elif len(group_labels) != k_value:
+                raise ZonkeyDBError("Expected %i groups in column %r, found %i"
+                                    % (k_value, key, len(group_labels)))
+
+            groups[k_value] = group
+
+        if not groups:
+            raise ZonkeyDBError("No valid groups in samples.txt")
+
+        return samples, groups
 
     @classmethod
     def _read_sample_order(cls, tar_handle, filename):
@@ -390,9 +419,7 @@ class ZonkeyDB(object):
                                         % (key, linenum, filename, row[key]))
 
             for key in ('Sample1', 'Sample2'):
-                group_key = 'Group(%i)' % (row['K'],)
-                groups = frozenset(row[group_key]
-                                   for row in self.samples.itervalues())
+                groups = frozenset(self.groups[int(row["K"])].values())
 
                 if row[key] not in groups and row[key] != '-':
                     raise ZonkeyDBError('Invalid group in column %r in '
@@ -508,38 +535,44 @@ def _validate_mito_bam(data, handle, info):
 
 
 def _validate_nuclear_bam(data, handle, info):
-    # Check that chromosomes are of expected size; unused chroms are ignored.
-    bam_contigs = dict(zip(map(contig_name_to_plink_name, handle.references),
-                           handle.lengths))
-    ref_contigs = data.contigs
-
-    contigs_found = {}
-    for name, stats in sorted(ref_contigs.iteritems()):
-        if name not in bam_contigs:
-            contigs_found[name] = False
-        elif bam_contigs[name] != stats["Size"]:
-            print_err("\nERROR: Chrom %r in the BAM does not match the "
-                      "length specified in data file:\n"
-                      "    - Expected: %i\n"
-                      "    - Found: %i"
-                      % (name, bam_contigs[name], stats["Size"]))
-
-            return False
-        else:
-            contigs_found[name] = True
-
-    if any(contigs_found.itervalues()):
-        if not all(contigs_found.itervalues()):
-            print_err("\nERROR: Not all nuclear chromosomes found in BAM:")
-            for (name, stats) in sorted(ref_contigs.iteritems()):
-                is_found = "Found" if contigs_found[name] else "Not found!"
-                print_err("  - %s: %s" % (name, is_found))
-
-            return False
-        else:
-            info.nuclear = True
-
-    return True
+    # Match reference panel contigs with BAM contigs; identification is done
+    # by size since different repositories use different naming schemes.
+    bam_contigs = collections.defaultdict(list)
+    for name, length in zip(handle.references, handle.lengths):
+        bam_contigs[length].append(name)
+
+    panel_names_to_bam = {}
+    for name, stats in sorted(data.contigs.iteritems()):
+        bam_contig_names = bam_contigs.get(stats["Size"], ())
+        if len(bam_contig_names) == 1:
+            panel_names_to_bam[name] = bam_contig_names[0]
+        elif len(bam_contig_names) > 1:
+            candidates = []
+            for bam_name in bam_contig_names:
+                if contig_name_to_plink_name(bam_name) == name:
+                    candidates.append(bam_name)
+
+            if len(candidates) == 1:
+                panel_names_to_bam[name] = candidates[0]
+            else:
+                print_warn("\WARNING: Multiple candidates for chr%s with size %i:"
+                           % (name, stats["Size"]))
+
+                for name in bam_contig_names:
+                    print_warn("  - %r" % (name,))
+
+    if len(panel_names_to_bam) == len(data.contigs):
+        info.nuclear_contigs = panel_names_to_bam
+        return True
+    elif panel_names_to_bam:
+        print_err("\nERROR: Not all nuclear chromosomes found in BAM:")
+        for (name, stats) in sorted(data.contigs.iteritems()):
+            is_found = "OK" if name in panel_names_to_bam else "Not found!"
+            print_err("  - %s: %s" % (name, is_found))
+
+        return False
+    else:
+        return True
 
 
 def _check_file_compression(filename):


=====================================
paleomix/tools/zonkey/parts/admixture.py
=====================================
@@ -36,7 +36,6 @@ class AdmixtureError(RuntimeError):
 
 
 def read_admixture_results(filename, data, k_groups, cutoff=CUTOFF):
-    key = "Group(%i)" % (k_groups,)
     names = tuple(data.sample_order) + ("-",)
     table = _admixture_read_results(filename, names)
     _admixture_validate_ancestral_groups(data, table, k_groups, cutoff)
@@ -46,7 +45,7 @@ def read_admixture_results(filename, data, k_groups, cutoff=CUTOFF):
         if sample == '-':
             continue
 
-        group = data.samples[sample][key]
+        group = data.groups[k_groups][sample]
         for index, value in enumerate(row):
             if value >= cutoff:
                 ancestral_groups[index][0].add(group)
@@ -146,16 +145,13 @@ def _admixture_read_results(filename, samples):
 
 
 def _admixture_validate_ancestral_groups(data, table, k_groups, cutoff):
-    key = "Group(%i)" % (k_groups,)
     groups = collections.defaultdict(dict)
     for sample, row in table.iteritems():
-        if sample not in data.samples:
-            continue
-
-        group = data.samples[sample][key]
-        for index, value in enumerate(row):
-            if value >= cutoff:
-                groups[group][index] = True
+        group = data.groups[k_groups].get(sample)
+        if group is not None:
+            for index, value in enumerate(row):
+                if value >= cutoff:
+                    groups[group][index] = True
 
     mixed_groups = []
     for group, memberships in sorted(groups.iteritems()):


=====================================
paleomix/tools/zonkey/parts/common.py
=====================================
@@ -34,6 +34,7 @@ _DEFAULT_COLORS = ("#E69F00", "#56B4E9",
 class WriteSampleList(Node):
     def __init__(self, config, output_file, dependencies=()):
         self._samples = config.database.samples
+        self._groups = config.database.groups
 
         Node.__init__(self,
                       description="<WriteSampleList -> %r>" % (output_file,),
@@ -44,17 +45,18 @@ class WriteSampleList(Node):
     def _run(self, config, temp):
         output_file, = self.output_files
         samples = self._samples
-        groups = set(sample["Group(3)"] for sample in samples.itervalues())
-        colors = dict(zip(groups, _DEFAULT_COLORS))
+
+        group = self._groups[max(self._groups)]
+        group_colors = dict(zip(sorted(set(group.values())), _DEFAULT_COLORS))
 
         with open(fileutils.reroot_path(temp, output_file), "w") as handle:
             handle.write("Name\tGroup\tColor\n")
 
-            for name, sample in sorted(samples.iteritems()):
-                group = sample["Group(3)"]
-                color = colors[group]
+            for sample_name in sorted(samples):
+                group_name = group[sample_name]
+                group_color = group_colors[group_name]
 
-                handle.write("%s\t%s\t%s\n" % (name, group, color))
+                handle.write("%s\t%s\t%s\n" % (sample_name, group_name, group_color))
 
             handle.write("Sample\t-\t#000000\n")
 


=====================================
paleomix/tools/zonkey/parts/nuclear.py
=====================================
@@ -192,17 +192,10 @@ class BuildFilteredBEDFilesNode(CommandNode):
 
 
 class AdmixtureNode(CommandNode):
-    def __init__(self, input_file, k_groups, output_root,
-                 samples=None, dependencies=()):
-        self._samples = samples
+    def __init__(self, input_file, k_groups, output_root, groups, dependencies=()):
+        self._groups = groups
         self._input_file = input_file
-        self._k_groups = k_groups
-
-        group_key = "Group(%i)" % (self._k_groups,)
-        self._supervised = samples and any((row[group_key] != '-')
-                                           for row in samples.itervalues())
 
-        assert k_groups in (2, 3), k_groups
         prefix = os.path.splitext(os.path.basename(input_file))[0]
         output_prefix = os.path.join(output_root,
                                      "%s.%i" % (prefix, k_groups))
@@ -227,9 +220,7 @@ class AdmixtureNode(CommandNode):
                                set_cwd=True)
 
         cmd.set_option("-s", random.randint(0, 2 ** 16 - 1))
-
-        if self._supervised:
-            cmd.set_option("--supervised")
+        cmd.set_option("--supervised")
 
         cmd.add_value("%(TEMP_OUT_FILE_BED)s")
         cmd.add_value(int(k_groups))
@@ -253,20 +244,16 @@ class AdmixtureNode(CommandNode):
             basename = os.path.basename(filename)
             os.symlink(os.path.abspath(filename), os.path.join(temp, basename))
 
-        if self._supervised:
-            fam_filename = fileutils.swap_ext(self._input_file, ".fam")
+        fam_filename = fileutils.swap_ext(self._input_file, ".fam")
+        pop_filename = fileutils.swap_ext(fam_filename, ".pop")
+        pop_filename = fileutils.reroot_path(temp, pop_filename)
 
-            pop_filename = fileutils.swap_ext(fam_filename, ".pop")
-            pop_filename = fileutils.reroot_path(temp, pop_filename)
+        with open(fam_filename) as fam_handle:
+            with open(pop_filename, "w") as pop_handle:
+                for line in fam_handle:
+                    sample, _ = line.split(None, 1)
 
-            key = "Group(%i)" % (self._k_groups,)
-            with open(fam_filename) as fam_handle:
-                with open(pop_filename, "w") as pop_handle:
-                    for line in fam_handle:
-                        sample, _ = line.split(None, 1)
-                        group = self._samples.get(sample, {}).get(key, "-")
-
-                        pop_handle.write("%s\n" % (group,))
+                    pop_handle.write("%s\n" % (self._groups.get(sample, "-"),))
 
 
 class SelectBestAdmixtureNode(Node):
@@ -399,6 +386,8 @@ class BuildFreqFilesNode(CommandNode):
                           IN_BIM=input_prefix + ".bim",
                           IN_FAM=input_prefix + ".fam",
                           TEMP_OUT_CLUST="samples.clust",
+                          TEMP_OUT_IMISS=basename + ".imiss",
+                          TEMP_OUT_LMISS=basename + ".lmiss",
                           OUT_NOSEX=output_prefix + ".frq.strat.nosex",
                           OUT_LOG=output_prefix + ".frq.strat.log",
                           TEMP_OUT_PREFIX=basename,
@@ -692,8 +681,9 @@ class PlotPCANode(CommandNode):
 
 
 class PlotCoverageNode(CommandNode):
-    def __init__(self, contigs, input_file, output_prefix, dependencies=()):
+    def __init__(self, contigs, mapping, input_file, output_prefix, dependencies=()):
         self._contigs = contigs
+        self._mapping = dict(zip(mapping.values(), mapping))
         self._input_file = input_file
 
         script = rtools.rscript("zonkey", "coverage.r")
@@ -726,19 +716,11 @@ class PlotCoverageNode(CommandNode):
                     continue
 
                 name, size, hits, _ = line.split('\t')
-                name = contig_name_to_plink_name(name)
-                if name is None or not (name.isdigit() or name == 'X'):
-                    continue
-                elif name not in self._contigs:
+                name = self._mapping.get(name, name)
+                if name not in self._contigs:
                     # Excluding contigs is allowed
                     continue
 
-                if int(size) != self._contigs[name]['Size']:
-                    raise NodeError("Size mismatch between database and BAM; "
-                                    "expected size %i, found %i for contig %r"
-                                    % (int(size), self._contigs[name]['Size'],
-                                       name))
-
                 row = {
                     'ID': name,
                     'Size': self._contigs[name]['Size'],


=====================================
paleomix/tools/zonkey/parts/report.py
=====================================
@@ -139,26 +139,19 @@ class ReportNode(Node):
         output_handle.write(_SECTION_HEADER.format(name="samples",
                                                    title="Reference Panel"))
 
-        output_handle.write(_SAMPLE_LIST_HEADER)
+        group_headers = []
+        for k_value in sorted(self._data.groups):
+            group_headers.append("          <th>Group(%i)</th>" % k_value)
+        output_handle.write(_SAMPLE_LIST_HEADER % ('\n'.join(group_headers),))
 
         last_group_2 = None
         last_group_3 = None
-        for row in sorted(self._data.samples.itervalues(),
-                          key=lambda row: (row["Group(2)"],
-                                           row["Group(3)"],
-                                           row["ID"])):
-
+        for name, row in sorted(self._data.samples.iteritems()):
             row = dict(row)
-            if last_group_2 != row["Group(2)"]:
-                last_group_2 = row["Group(2)"]
-                last_group_3 = row["Group(3)"]
-            else:
-                row["Group(2)"] = ""
-
-                if last_group_3 != row["Group(3)"]:
-                    last_group_3 = row["Group(3)"]
-                else:
-                    row["Group(3)"] = ""
+            groups = []
+            for key in self._data.groups:
+                groups.append("          <td>%s</td>" % self._data.groups[key][name])
+            row["Groups"] = "\n".join(groups)
 
             pub = row["Publication"]
             if pub.startswith("http"):
@@ -183,7 +176,7 @@ class ReportNode(Node):
         overview = _ADMIXTURE_OVERVIEW.format(ADMIXTURE=admixture_v)
         output_handle.write(overview)
 
-        for k_groups in (2, 3):
+        for k_groups in sorted(self._data.groups):
             summary_incl = self._build_admixture_cell(k_groups, True)
             summary_excl = self._build_admixture_cell(k_groups, False)
 
@@ -372,7 +365,7 @@ class AnalysisReport(object):
 
                 # Include files showing proproation of ancestral populations,
                 # which are required to build admixture figures in the reports.
-                for k_groups in (2, 3):
+                for k_groups in sorted(self._data.groups):
                     input_files.append(os.path.join(admix_root,
                                                     "%s.%i.Q" % (postfix,
                                                                  k_groups)))
@@ -625,9 +618,8 @@ _OVERVIEW_FOOTER = """
 _SAMPLE_LIST_HEADER = """
       <table summary="List of samples in the reference panel.">
         <tr>
-          <th>Group(2)</th>
-          <th>Group(3)</th>
           <th>ID</th>
+%s
           <th>Species</th>
           <th>Sex</th>
           <th>Sample Name</th>
@@ -637,9 +629,8 @@ _SAMPLE_LIST_HEADER = """
 
 _SAMPLE_LIST_ROW = """
         <tr>
-          <td>{Group(2)}
-          <td>{Group(3)}
           <td>{ID}</td>
+{Groups}
           <td><em>{Species}</em></td>
           <td>{Sex}</th>
           <td>{SampleID}</td>


=====================================
paleomix/tools/zonkey/parts/summary.py
=====================================
@@ -146,7 +146,7 @@ class SummaryNode(Node):
                 admix_root = os.path.join(self._root, sample,
                                           "results", "admixture")
 
-                for k_groups in (2, 3):
+                for k_groups in sorted(self._data.groups):
                     filename = os.path.join(admix_root, "%s.%i.Q" % (postfix,
                                                                      k_groups))
 
@@ -189,7 +189,7 @@ class SummaryNode(Node):
             admix_root = os.path.join(self._root, sample,
                                       "results", "admixture")
 
-            for k_groups in (2, 3):
+            for k_groups in sorted(self._data.groups):
                 filename = os.path.join(admix_root, "%s.%i.Q" % (postfix, k_groups))
 
                 lines.append("                <td>")


=====================================
paleomix/tools/zonkey/pipeline.py
=====================================
@@ -37,6 +37,8 @@ from paleomix.common.console import \
     print_info, \
     print_warn
 
+from paleomix.common.formats.fasta import FASTA
+
 from paleomix.pipeline import \
     Pypeline
 
@@ -122,7 +124,7 @@ def build_admixture_nodes(config, data, root, plink):
 
         admix_root = os.path.join(root, "results", "admixture")
         report_root = os.path.join(root, "figures", "admixture")
-        for k_groups in (2, 3):
+        for k_groups in sorted(data.groups):
             replicates = []
 
             input_file = os.path.join(plink["root"], postfix + ".bed")
@@ -132,7 +134,7 @@ def build_admixture_nodes(config, data, root, plink):
                 node = nuclear.AdmixtureNode(input_file=input_file,
                                              output_root=output_root,
                                              k_groups=k_groups,
-                                             samples=data.samples,
+                                             groups=data.groups[k_groups],
                                              dependencies=(bed_node,))
 
                 replicates.append(node)
@@ -228,10 +230,11 @@ def build_pca_nodes(config, data, root, plink):
     return nodes
 
 
-def build_coverage_nodes(data, root, nuc_bam, dependencies=()):
+def build_coverage_nodes(contigs, mapping, root, nuc_bam, dependencies=()):
     output_prefix = os.path.join(root, 'figures', 'coverage', 'coverage')
 
-    return nuclear.PlotCoverageNode(contigs=data.contigs,
+    return nuclear.PlotCoverageNode(contigs=contigs,
+                                    mapping=mapping,
                                     input_file=nuc_bam,
                                     output_prefix=output_prefix,
                                     dependencies=dependencies),
@@ -278,6 +281,8 @@ def build_pipeline(config, root, nuc_bam, mito_bam, cache):
                                            output_file=sample_tbl)
 
     if nuc_bam is not None:
+        nuc_bam, nuc_bam_info = nuc_bam["Path"], nuc_bam["Info"]
+
         # When not sampling, BuildTPED relies on indexed access to ease
         # processing of one chromosome at a time. The index is further required
         # for idxstats used by the PlotCoverageNode.
@@ -292,8 +297,11 @@ def build_pipeline(config, root, nuc_bam, mito_bam, cache):
                                            plink))
 
         if not config.admixture_only:
-            nodes.extend(build_coverage_nodes(config.database,
-                                              root, nuc_bam, (index,)))
+            nodes.extend(build_coverage_nodes(contigs=config.database.contigs,
+                                              mapping=nuc_bam_info.nuclear_contigs,
+                                              root=root,
+                                              nuc_bam=nuc_bam,
+                                              dependencies=(index,)))
             nodes.extend(build_pca_nodes(config, config.database,
                                          root, plink))
             nodes.extend(build_treemix_nodes(config, config.database,
@@ -384,8 +392,6 @@ def setup_mito_mapping(config):
 
             info = config.database.samples.get(record.name)
             if info is not None:
-                mkfile.write("    # Group: %s\n"
-                             % (info.get('Group(3)', 'NA'),))
                 mkfile.write("    # Species: %s\n"
                              % (info.get('Species', 'NA'),))
                 mkfile.write("    # Sex: %s\n"
@@ -401,6 +407,11 @@ def setup_mito_mapping(config):
                                        "%s.fasta" % (record.name,))
 
             with open(fasta_fpath, "w") as fasta_handle:
+                record = FASTA(
+                    name=record.name,
+                    meta=None,
+                    sequence=record.sequence.replace('-', ''))
+
                 fasta_handle.write(str(record))
                 fasta_handle.write("\n")
 



View it on GitLab: https://salsa.debian.org/med-team/paleomix/compare/0313756ec7fe5613be8c89f9e6cf25f87cde0459...d494bff3f643877763c516b132b22a2331cc0dc3

-- 
View it on GitLab: https://salsa.debian.org/med-team/paleomix/compare/0313756ec7fe5613be8c89f9e6cf25f87cde0459...d494bff3f643877763c516b132b22a2331cc0dc3
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20191217/faacd4b3/attachment-0001.html>