[med-svn] [Git][med-team/changeo][upstream] New upstream version 1.2.0

Nilesh Patra (@nilesh) gitlab at salsa.debian.org
Mon Nov 1 18:28:41 GMT 2021



Nilesh Patra pushed to branch upstream at Debian Med / changeo


Commits:
da695a88 by Nilesh Patra at 2021-11-01T23:47:16+05:30
New upstream version 1.2.0
- - - - -


12 changed files:

- INSTALL.rst
- NEWS.rst
- PKG-INFO
- bin/AssignGenes.py
- bin/BuildTrees.py
- bin/MakeDb.py
- bin/ParseDb.py
- changeo.egg-info/PKG-INFO
- changeo.egg-info/requires.txt
- changeo/IO.py
- changeo/Version.py
- requirements.txt


Changes:

=====================================
INSTALL.rst
=====================================
@@ -24,7 +24,7 @@ The minimum dependencies for installation are:
 + `SciPy 0.14 <http://scipy.org>`__
 + `pandas 0.24 <http://pandas.pydata.org>`__
 + `Biopython 1.77 <http://biopython.org>`__
-+ `presto 0.6.2 <http://presto.readthedocs.io>`__
++ `presto 0.7.0 <http://presto.readthedocs.io>`__
 + `airr 1.3.1 <https://docs.airr-community.org>`__
 
 Some tools wrap external applications that are not required for installation.


=====================================
NEWS.rst
=====================================
@@ -1,6 +1,40 @@
 Release Notes
 ===============================================================================
 
+Version 1.2.0:  October 29, 2021
+-------------------------------------------------------------------------------
+
++ Updated dependencies to presto >= v0.7.0.
+
+AssignGenes:
+
++ Fixed reporting of IgBLAST output counts when specifying ``--format airr``.
+
+BuildTrees:
+
++ Added support for specifying fixed omega and hotness parameters at the
+  commandline.
+
+CreateGermlines:
+
++ Will now use the first allele in the reference database when duplicate
+  allele names are provided. Only appears to affect mouse BCR light chains
+  and TCR alleles in the IMGT database when the same allele name differs by
+  strain.
+
+MakeDb:
+
++ Added support for changes in how IMGT/HighV-QUEST v1.8.4 handles special
+  characters in sequence identifiers.
++ Fixed the ``imgt`` subcommand incorrectly allowing execution without
+  specifying the IMGT/HighV-QUEST output file at the commandline.
+
+ParseDb:
+
++ Added reporting of output file sizes to the console log of the ``split``
+  subcommand.
+
+
 Version 1.1.0:  June 21, 2021
 -------------------------------------------------------------------------------
 
@@ -8,13 +42,12 @@ Version 1.1.0:  June 21, 2021
 + Updated dependencies to biopython >= v1.77, airr >= v1.3.1, PyYAML>=5.1.
 
 MakeDb:
-
 + Added the ``--imgt-id-len`` argument to accommodate changes introduced in how
-  IMGT/HighV-QUEST truncates sequence identifiers as of version 1.8.3 (May 7, 2021).
+  IMGT/HighV-QUEST truncates sequence identifiers as of v1.8.3 (May 7, 2021).
   The header lines in the fasta files are now truncated to 49 characters. In
-  IMGT/HighV-QUEST versions older that 1.8.3, they were truncated to 50 characters.
+  IMGT/HighV-QUEST versions older than v1.8.3, they were truncated to 50 characters.
   ``--imgt-id-len`` default value is 49. Users should specify ``--imgt-id-len 50``
-  to analyze IMGT results generated with IMGT/HighV-QUEST versions older that 1.8.3.
+  to analyze IMGT results generated with IMGT/HighV-QUEST versions older than v1.8.3.
 + Added the ``--infer-junction`` argument to ``MakeDb igblast``, to enable the inference
   of the junction sequence when not reported by IgBLAST. Should be used with data from
   IgBLAST v1.6.0 or older; before igblast added the IMGT-CDR3 inference.


=====================================
PKG-INFO
=====================================
@@ -1,6 +1,6 @@
 Metadata-Version: 1.1
 Name: changeo
-Version: 1.1.0
+Version: 1.2.0
 Summary: A bioinformatics toolkit for processing high-throughput lymphocyte receptor sequencing data.
 Home-page: http://changeo.readthedocs.io
 Author: Namita Gupta, Jason Anthony Vander Heiden


=====================================
bin/AssignGenes.py
=====================================
@@ -14,6 +14,7 @@ from collections import OrderedDict
 from pkg_resources import parse_version
 from textwrap import dedent
 from time import time
+import re
 
 # Presto imports
 from presto.IO import printLog, printMessage, printError, printWarning
@@ -99,9 +100,30 @@ def assignIgBLAST(seq_file, amino_acid=False, igdata=default_igdata, loci='ig',
                                   vdb=vdb, output=out_file,
                                   threads=nproc, exec=igblast_exec)
     printMessage('Done', start_time=start_time, end=True, width=25)
-
+    
+    # Get number of processed sequences
+    if (format == 'blast'):
+        with open(out_file, 'rb') as f: 
+            f.seek(-2, os.SEEK_END) 
+            while f.read(1) != b'\n': 
+                f.seek(-2, os.SEEK_CUR)  
+            pass_info = f.readline().decode()
+        num_seqs_match = re.search('(# BLAST processed )(\d+)( .*)', pass_info)
+        num_sequences = num_seqs_match.group(2)
+    else:
+        f = open(out_file, 'rb')
+        lines = 0
+        buf_size = 1024 * 1024
+        read_f = f.raw.read
+        buf = read_f(buf_size)
+        while buf:
+            lines += buf.count(b'\n')
+            buf = read_f(buf_size)
+        num_sequences = lines - 1
+        
     # Print log
     log = OrderedDict()
+    log['PASS'] = num_sequences
     log['OUTPUT'] = os.path.basename(out_file)
     log['END'] = 'AssignGenes'
     printLog(log)


=====================================
bin/BuildTrees.py
=====================================
@@ -999,7 +999,7 @@ def runIgPhyML(outfile, igphyml_out, clone_dir, nproc=1, optimization="lr", omeg
         if oformat == "tab":
             os.rmdir(clone_dir)
         else:
-            printWarning("Using --clean all with --oformat txt will delete all tree file results.\n"
+            printWarning("Using --clean all with --oformat txt will not delete all tree file results.\n"
                          "You'll have to do that yourself.")
         log = OrderedDict()
         log["END"] = "IgPhyML analysis"
@@ -1323,19 +1323,17 @@ def getArgParser():
                                help="""Optimize combination of topology (t) branch lengths (l) and parameters (r), or 
                                nothing (n), for IgPhyML.""")
     igphyml_group.add_argument("--omega", action="store", dest="omega", type=str, default="e,e",
-                               choices = ("e", "ce", "e,e", "ce,e", "e,ce", "ce,ce"),
                                help="""Omega parameters to estimate for FWR,CDR respectively: 
-                               e = estimate, ce = estimate + confidence interval""")
+                               e = estimate, ce = estimate + confidence interval, or numeric value""")
     igphyml_group.add_argument("-t", action="store", dest="kappa", type=str, default="e",
-                               choices=("e", "ce"),
                                help="""Kappa parameters to estimate: 
-                               e = estimate, ce = estimate + confidence interval""")
+                               e = estimate, ce = estimate + confidence interval, or numeric value""")
     igphyml_group.add_argument("--motifs", action="store", dest="motifs", type=str,
                                default="WRC_2:0,GYW_0:1,WA_1:2,TW_0:3,SYC_2:4,GRS_0:5",
                                help="""Which motifs to estimate mutability.""")
     igphyml_group.add_argument("--hotness", action="store", dest="hotness", type=str, default="e,e,e,e,e,e",
                                help="""Mutability parameters to estimate: 
-                               e = estimate, ce = estimate + confidence interval""")
+                               e = estimate, ce = estimate + confidence interval, or numeric value""")
     igphyml_group.add_argument("--oformat", action="store", dest="oformat", type=str, default="tab",
                                choices=("tab", "txt"),
                                help="""IgPhyML output format.""")


=====================================
bin/MakeDb.py
=====================================
@@ -118,8 +118,11 @@ def getIDforIMGT(seq_file, imgt_id_len=default_imgt_id_len):
     for rec in readSeqFile(seq_file):
         if len(rec.description) <= imgt_id_len:
             id_key = rec.description
-        else:
-            id_key = re.sub('\||\s|!|&|\*|<|>|\?', '_', rec.description[:imgt_id_len])
+        else: # truncate and replace characters
+            if imgt_id_len == 49: # 28 September 2021 (version 1.8.4)
+                id_key = re.sub('\s|\t', '_', rec.description[:imgt_id_len])
+            else: # older versions
+                id_key = re.sub('\||\s|!|&|\*|<|>|\?', '_', rec.description[:imgt_id_len])
         ids.update({id_key: rec.description})
 
     return ids
@@ -145,8 +148,8 @@ def writeDb(records, fields, aligner_file, total_count, id_dict=None, annotation
             writer=AIRRWriter, out_file=None, out_args=default_out_args):
     """
     Writes parsed records to an output file
-    
-    Arguments: 
+
+    Arguments:
       records : a iterator of Receptor objects containing alignment data.
       fields : a list of ordered field names to write.
       aligner_file : input file name.
@@ -355,7 +358,8 @@ def writeDb(records, fields, aligner_file, total_count, id_dict=None, annotation
 
 
 def parseIMGT(aligner_file, seq_file=None, repo=None, cellranger_file=None, partial=False, asis_id=True,
-              extended=False, format=default_format, out_file=None, out_args=default_out_args, imgt_id_len=default_imgt_id_len):
+              extended=False, format=default_format, out_file=None, out_args=default_out_args,
+              imgt_id_len=default_imgt_id_len):
     """
     Main for IMGT aligned sample sequences.
 
@@ -396,7 +400,7 @@ def parseIMGT(aligner_file, seq_file=None, repo=None, cellranger_file=None, part
 
     # Get (parsed) IDs from fasta file submitted to IMGT
     id_dict = getIDforIMGT(seq_file, imgt_id_len) if seq_file else {}
-    
+
     # Load supplementary annotation table
     if cellranger_file is not None:
         f = cellranger_extended if extended else cellranger_base
@@ -438,7 +442,7 @@ def parseIMGT(aligner_file, seq_file=None, repo=None, cellranger_file=None, part
                 printWarning('Germline reference sequences do not appear to contain IMGT-numbering spacers. Results may be incorrect.')
             germ_iter = (addGermline(x, references) for x in parse_iter)
         # Write db
-        output = writeDb(germ_iter, fields=fields, aligner_file=aligner_file, total_count=total_count, 
+        output = writeDb(germ_iter, fields=fields, aligner_file=aligner_file, total_count=total_count,
                          annotations=annotations, id_dict=id_dict, asis_id=asis_id, partial=partial,
                          writer=writer, out_file=out_file, out_args=out_args)
 
@@ -535,7 +539,7 @@ def parseIgBLAST(aligner_file, seq_file, repo, amino_acid=False, cellranger_file
     with open(aligner_file, 'r') as f:
         parse_iter = parser(f, seq_dict, references, regions=regions, asis_calls=asis_calls, infer_junction=infer_junction)
         germ_iter = (addGermline(x, references, amino_acid=amino_acid) for x in parse_iter)
-        output = writeDb(germ_iter, fields=fields, aligner_file=aligner_file, total_count=total_count, 
+        output = writeDb(germ_iter, fields=fields, aligner_file=aligner_file, total_count=total_count,
                          annotations=annotations, amino_acid=amino_acid, partial=partial, asis_id=asis_id,
                          regions=regions, writer=writer, out_file=out_file, out_args=out_args)
 
@@ -614,7 +618,7 @@ def parseIHMM(aligner_file, seq_file, repo, cellranger_file=None, partial=False,
     with open(aligner_file, 'r') as f:
         parse_iter = IHMMuneReader(f, seq_dict, references)
         germ_iter = (addGermline(x, references) for x in parse_iter)
-        output = writeDb(germ_iter, fields=fields, aligner_file=aligner_file, total_count=total_count, 
+        output = writeDb(germ_iter, fields=fields, aligner_file=aligner_file, total_count=total_count,
                         annotations=annotations, asis_id=asis_id, partial=partial,
                         writer=writer, out_file=out_file, out_args=out_args)
 
@@ -625,7 +629,7 @@ def getArgParser():
     """
     Defines the ArgumentParser.
 
-    Returns: 
+    Returns:
       argparse.ArgumentParser
     """
     fields = dedent(
@@ -637,34 +641,34 @@ def getArgParser():
                   db-fail
                       database with records that fail due to no productivity information,
                       no gene V assignment, no J assignment, or no junction region.
-                 
+
               universal output fields:
-                 sequence_id, sequence, sequence_alignment, germline_alignment, 
-                 rev_comp, productive, stop_codon, vj_in_frame, locus, 
-                 v_call, d_call, j_call, junction, junction_length, junction_aa, 
+                 sequence_id, sequence, sequence_alignment, germline_alignment,
+                 rev_comp, productive, stop_codon, vj_in_frame, locus,
+                 v_call, d_call, j_call, junction, junction_length, junction_aa,
                  v_sequence_start, v_sequence_end, v_germline_start, v_germline_end,
                  d_sequence_start, d_sequence_end, d_germline_start, d_germline_end,
                  j_sequence_start, j_sequence_end, j_germline_start, j_germline_end,
                  np1_length, np2_length, fwr1, fwr2, fwr3, fwr4, cdr1, cdr2, cdr3
 
               imgt specific output fields:
-                  n1_length, n2_length, p3v_length, p5d_length, p3d_length, p5j_length, 
-                  d_frame, v_score, v_identity, d_score, d_identity, j_score, j_identity 
-                               
+                  n1_length, n2_length, p3v_length, p5d_length, p3d_length, p5j_length,
+                  d_frame, v_score, v_identity, d_score, d_identity, j_score, j_identity
+
               igblast specific output fields:
-                  v_score, v_identity, v_support, v_cigar, 
-                  d_score, d_identity, d_support, d_cigar, 
+                  v_score, v_identity, v_support, v_cigar,
+                  d_score, d_identity, d_support, d_cigar,
                   j_score, j_identity, j_support, j_cigar
 
               ihmm specific output fields:
                   vdj_score
-                  
+
               10X specific output fields:
-                  cell_id, c_call, consensus_count, umi_count, 
+                  cell_id, c_call, consensus_count, umi_count,
                   v_call_10x, d_call_10x, j_call_10x,
                   junction_10x, junction_10x_aa
               ''')
-                
+
     # Define ArgumentParser
     parser = ArgumentParser(description=__doc__, epilog=fields,
                             formatter_class=CommonHelpFormatter, add_help=False)
@@ -686,8 +690,7 @@ def getArgParser():
                                            help='Process igblastn output.',
                                            description='Process igblastn output.')
     group_igblast = parser_igblast.add_argument_group('aligner parsing arguments')
-    group_igblast.add_argument('-i', nargs='+', action='store', dest='aligner_files',
-                                required=True,
+    group_igblast.add_argument('-i', nargs='+', action='store', dest='aligner_files', required=True,
                                 help='''IgBLAST output files in format 7 with query sequence
                                      (igblastn argument \'-outfmt "7 std qseq sseq btop"\').''')
     group_igblast.add_argument('-r', nargs='+', action='store', dest='repo', required=True,
@@ -716,18 +719,18 @@ def getArgParser():
     group_igblast.add_argument('--partial', action='store_true', dest='partial',
                                 help='''If specified, include incomplete V(D)J alignments in
                                      the pass file instead of the fail file. An incomplete alignment
-                                     is defined as a record for which a valid IMGT-gapped sequence 
-                                     cannot be built or that is missing a V gene assignment, 
+                                     is defined as a record for which a valid IMGT-gapped sequence
+                                     cannot be built or that is missing a V gene assignment,
                                      J gene assignment, junction region, or productivity call.''')
     group_igblast.add_argument('--extended', action='store_true', dest='extended',
-                               help='''Specify to include additional aligner specific fields in the output. 
+                               help='''Specify to include additional aligner specific fields in the output.
                                     Adds <vdj>_score, <vdj>_identity, <vdj>_support, <vdj>_cigar,
                                     fwr1, fwr2, fwr3, fwr4, cdr1, cdr2 and cdr3.''')
     group_igblast.add_argument('--regions', action='store', dest='regions',
                                choices=('default', 'rhesus-igl'), default='default',
                                help='''IMGT CDR and FWR boundary definition to use.''')
     group_igblast.add_argument('--infer-junction', action='store_true', dest='infer_junction',
-                                 help='''Infer the junction sequence. For use with IgBLAST v1.6.0 or older, 
+                                 help='''Infer the junction sequence. For use with IgBLAST v1.6.0 or older,
                                  prior to the addition of IMGT-CDR3 inference.''')
     parser_igblast.set_defaults(func=parseIgBLAST, amino_acid=False)
 
@@ -737,8 +740,7 @@ def getArgParser():
                                            help='Process igblastp output.',
                                            description='Process igblastp output.')
     group_igblast_aa = parser_igblast_aa.add_argument_group('aligner parsing arguments')
-    group_igblast_aa.add_argument('-i', nargs='+', action='store', dest='aligner_files',
-                                  required=True,
+    group_igblast_aa.add_argument('-i', nargs='+', action='store', dest='aligner_files', required=True,
                                   help='''IgBLAST output files in format 7 with query sequence
                                        (igblastp argument \'-outfmt "7 std qseq sseq btop"\').''')
     group_igblast_aa.add_argument('-r', nargs='+', action='store', dest='repo', required=True,
@@ -763,7 +765,7 @@ def getArgParser():
                                        the sequence identifiers in the reference sequence set and the IgBLAST
                                        database to be exact string matches.''')
     group_igblast_aa.add_argument('--extended', action='store_true', dest='extended',
-                                  help='''Specify to include additional aligner specific fields in the output. 
+                                  help='''Specify to include additional aligner specific fields in the output.
                                        Adds v_score, v_identity, v_support, v_cigar, fwr1, fwr2, fwr3, cdr1 and cdr2.''')
     group_igblast_aa.add_argument('--regions', action='store', dest='regions',
                                   choices=('default', 'rhesus-igl'), default='default',
@@ -779,21 +781,21 @@ def getArgParser():
                                         description='''Process IMGT/HighV-Quest output
                                              (does not work with V-QUEST).''')
     group_imgt = parser_imgt.add_argument_group('aligner parsing arguments')
-    group_imgt.add_argument('-i', nargs='+', action='store', dest='aligner_files',
+    group_imgt.add_argument('-i', nargs='+', action='store', dest='aligner_files', required=True,
                             help='''Either zipped IMGT output files (.zip or .txz) or a
                                  folder containing unzipped IMGT output files (which must
                                  include 1_Summary, 2_IMGT-gapped, 3_Nt-sequences,
                                  and 6_Junction).''')
     group_imgt.add_argument('-s', nargs='*', action='store', dest='seq_files', required=False,
                             help='''List of FASTA files (with .fasta, .fna or .fa
-                                  extension) that were submitted to IMGT/HighV-QUEST. 
+                                  extension) that were submitted to IMGT/HighV-QUEST.
                                   If unspecified, sequence identifiers truncated by IMGT/HighV-QUEST
                                   will not be corrected.''')
     group_imgt.add_argument('-r', nargs='+', action='store', dest='repo', required=False,
                             help='''List of folders and/or fasta files containing
-                                 the germline sequence set used by IMGT/HighV-QUEST. 
+                                 the germline sequence set used by IMGT/HighV-QUEST.
                                  These reference sequences must contain IMGT-numbering spacers (gaps)
-                                 in the V segment. If unspecified, the germline sequence reconstruction 
+                                 in the V segment. If unspecified, the germline sequence reconstruction
                                  will not be included in the output.''')
     group_imgt.add_argument('--10x', action='store', nargs='+', dest='cellranger_file',
                             help='''Table file containing 10X annotations (with .csv or .tsv
@@ -807,17 +809,17 @@ def getArgParser():
     group_imgt.add_argument('--partial', action='store_true', dest='partial',
                             help='''If specified, include incomplete V(D)J alignments in
                                  the pass file instead of the fail file. An incomplete alignment
-                                 is defined as a record that is missing a V gene assignment, 
+                                 is defined as a record that is missing a V gene assignment,
                                  J gene assignment, junction region, or productivity call.''')
     group_imgt.add_argument('--extended', action='store_true', dest='extended',
-                            help='''Specify to include additional aligner specific fields in the output. 
+                            help='''Specify to include additional aligner specific fields in the output.
                                  Adds <vdj>_score, <vdj>_identity>, fwr1, fwr2, fwr3, fwr4,
-                                 cdr1, cdr2, cdr3, n1_length, n2_length, p3v_length, p5d_length, 
+                                 cdr1, cdr2, cdr3, n1_length, n2_length, p3v_length, p5d_length,
                                  p3d_length, p5j_length and d_frame.''')
     group_imgt.add_argument('--imgt-id-len', action='store', dest='imgt_id_len', type=int,
                             default=default_imgt_id_len,
-                            help='''The maximum character length of sequence identifiers reported by IMGT/HighV-QUEST. 
-                            Specify 50 if the IMGT files (-i) were generated with an IMGT/HighV-QUEST version older 
+                            help='''The maximum character length of sequence identifiers reported by IMGT/HighV-QUEST.
+                            Specify 50 if the IMGT files (-i) were generated with an IMGT/HighV-QUEST version older
                             than 1.8.3 (May 7, 2021).''')
     parser_imgt.set_defaults(func=parseIMGT)
 
@@ -851,18 +853,18 @@ def getArgParser():
     group_ihmm.add_argument('--partial', action='store_true', dest='partial',
                              help='''If specified, include incomplete V(D)J alignments in
                                   the pass file instead of the fail file. An incomplete alignment
-                                     is defined as a record for which a valid IMGT-gapped sequence 
-                                     cannot be built or that is missing a V gene assignment, 
+                                     is defined as a record for which a valid IMGT-gapped sequence
+                                     cannot be built or that is missing a V gene assignment,
                                      J gene assignment, junction region, or productivity call.''')
     group_ihmm.add_argument('--extended', action='store_true', dest='extended',
-                             help='''Specify to include additional aligner specific fields in the output. 
+                             help='''Specify to include additional aligner specific fields in the output.
                                   Adds the path score of the iHMMune-Align hidden Markov model as vdj_score;
                                   adds fwr1, fwr2, fwr3, fwr4, cdr1, cdr2 and cdr3.''')
     parser_ihmm.set_defaults(func=parseIHMM)
 
     return parser
-    
-    
+
+
 if __name__ == "__main__":
     """
     Parses command line arguments and calls main
@@ -881,7 +883,7 @@ if __name__ == "__main__":
     if 'seq_files' in args_dict: del args_dict['seq_files']
     if 'out_files' in args_dict: del args_dict['out_files']
     if 'command' in args_dict: del args_dict['command']
-    if 'func' in args_dict: del args_dict['func']           
+    if 'func' in args_dict: del args_dict['func']
 
     # Call main
     for i, f in enumerate(args.__dict__['aligner_files']):


=====================================
bin/ParseDb.py
=====================================
@@ -142,12 +142,15 @@ def splitDbFile(db_file, field, num_split=None, out_args=default_out_args):
         log['OUTPUT%i' % (i + 1)] = os.path.basename(handles_dict[k].name)
     log['RECORDS'] = rec_count
     log['PARTS'] = len(handles_dict)
-    log['END'] = 'ParseDb'
-    printLog(log)
 
-    # Close output file handles
+    # Close output file handles and log file size
     db_handle.close()
-    for t in handles_dict: handles_dict[t].close()
+    for i, t in enumerate(handles_dict):
+        handles_dict[t].close()
+        log['SIZE%i' % (i + 1)] = countDbFile(handles_dict[t].name)
+
+    log['END'] = 'ParseDb'
+    printLog(log)
 
     return [handles_dict[t].name for t in handles_dict]
 
@@ -364,7 +367,7 @@ def deleteDbFile(db_file, fields, values, logic='any', regex=False,
     """
     Deletes records from a database file
 
-    Arguments: 
+    Arguments:
       db_file : the database file name.
       fields : a list of fields to check for deletion criteria.
       values : a list of values defining deletion targets.
@@ -372,8 +375,8 @@ def deleteDbFile(db_file, fields, values, logic='any', regex=False,
       regex : if False do exact full string matches; if True allow partial regex matches.
       out_file : output file name. Automatically generated from the input file if None.
       out_args : common output argument dictionary from parseCommonArgs.
-                    
-    Returns: 
+
+    Returns:
       str : output file name.
     """
     # Define string match function
@@ -428,14 +431,14 @@ def deleteDbFile(db_file, fields, values, logic='any', regex=False,
         rec_count += 1
         # Check for deletion values in all fields
         delete = _logic_func([_match_func(rec.get(f, False), values) for f in fields])
-        
+
         # Write sequences
         if not delete:
             pass_count += 1
             pass_writer.writeDict(rec)
         else:
             fail_count += 1
-        
+
     # Print counts
     printProgress(rec_count, result_count, 0.05, start_time=start_time)
     log = OrderedDict()
@@ -449,7 +452,7 @@ def deleteDbFile(db_file, fields, values, logic='any', regex=False,
     # Close file handles
     pass_handle.close()
     db_handle.close()
- 
+
     return pass_handle.name
 
 
@@ -867,10 +870,10 @@ def getArgParser():
     """
     Defines the ArgumentParser
 
-    Arguments: 
+    Arguments:
     None
-                      
-    Returns: 
+
+    Returns:
     an ArgumentParser object
     """
     # Define input and output field help message
@@ -888,7 +891,7 @@ def getArgParser():
              required fields:
                  sequence_id
              ''')
-    
+
     # Define ArgumentParser
     parser = ArgumentParser(description=__doc__, epilog=fields,
                             formatter_class=CommonHelpFormatter, add_help=False)
@@ -1027,11 +1030,11 @@ def getArgParser():
                                          description='Merges files.')
     group_merge = parser_merge.add_argument_group('parsing arguments')
     group_merge.add_argument('-o', action='store', dest='out_file', default=None,
-                              help='''Explicit output file name. Note, this argument cannot be used with 
+                              help='''Explicit output file name. Note, this argument cannot be used with
                                    the --failed, --outdir or --outname arguments.''')
     group_merge.add_argument('--drop', action='store_true', dest='drop',
                               help='''If specified, drop fields that do not exist in all input files.
-                                   Otherwise, include all columns in all files and fill missing data 
+                                   Otherwise, include all columns in all files and fill missing data
                                    with empty strings.''')
     parser_merge.set_defaults(func=mergeDbFiles)
 
@@ -1092,4 +1095,4 @@ if __name__ == '__main__':
             args_dict['out_file'] = args.__dict__['out_files'][i] \
                 if args.__dict__['out_files'] else None
             args.func(**args_dict)
- 
+


=====================================
changeo.egg-info/PKG-INFO
=====================================
@@ -1,6 +1,6 @@
 Metadata-Version: 1.1
 Name: changeo
-Version: 1.1.0
+Version: 1.2.0
 Summary: A bioinformatics toolkit for processing high-throughput lymphocyte receptor sequencing data.
 Home-page: http://changeo.readthedocs.io
 Author: Namita Gupta, Jason Anthony Vander Heiden


=====================================
changeo.egg-info/requires.txt
=====================================
@@ -4,5 +4,5 @@ pandas>=0.24
 biopython>=1.77
 PyYAML>=5.1
 setuptools>=2.0
-presto>=0.6.2
+presto>=0.7.0
 airr>=1.3.1


=====================================
changeo/IO.py
=====================================
@@ -13,11 +13,12 @@ import yaml
 import zipfile
 from itertools import chain, groupby, zip_longest
 from tempfile import TemporaryDirectory
+from textwrap import indent
 from Bio import SeqIO
 from Bio.Seq import Seq
 
 # Presto and changeo imports
-from presto.IO import getFileType, printError, printWarning
+from presto.IO import getFileType, printError, printWarning, printDebug
 from changeo.Defaults import default_csv_size
 from changeo.Gene import getAllele, getLocus, getVAllele, getDAllele, getJAllele
 from changeo.Receptor import AIRRSchema, AIRRSchemaAA, ChangeoSchema, ChangeoSchemaAA, Receptor, ReceptorData
@@ -2167,13 +2168,14 @@ class IHMMuneReader:
             return db
 
 
-def readGermlines(references, asis=False):
+def readGermlines(references, asis=False, warn=False):
     """
     Parses germline repositories
 
     Arguments:
       references (list): list of strings specifying directories and/or files from which to read germline records.
       asis (bool): if True use sequence ID as record name and do not parse headers for allele names.
+      warn (bool): print warning messages to standard error if True.
 
     Returns:
       dict: Dictionary of germlines in the form {allele: sequence}.
@@ -2194,12 +2196,20 @@ def readGermlines(references, asis=False):
         printError('No valid germline fasta files (.fasta, .fna, .fa) were found at %s.' % ','.join(references))
 
     repo_dict = {}
+    duplicates = []
     for file_name in repo_files:
         with open(file_name, 'rU') as file_handle:
             germlines = SeqIO.parse(file_handle, 'fasta')
             for g in germlines:
                 germ_key = getAllele(g.description, 'first') if not asis else g.id
-                repo_dict[germ_key] = str(g.seq).upper()
+                if germ_key not in repo_dict:
+                    repo_dict[germ_key] = str(g.seq).upper()
+                else:
+                    duplicates.append(g.description)
+
+    if warn and len(duplicates) > 0:
+        w = indent('\n'.join(duplicates), ' '*9)
+        printWarning('Duplicated germline allele names excluded from references:\n%s' % w)
 
     return repo_dict
 


=====================================
changeo/Version.py
=====================================
@@ -5,5 +5,5 @@ Version and authorship information
 __author__    = 'Namita Gupta, Jason Anthony Vander Heiden'
 __copyright__ = 'Copyright 2021 Kleinstein Lab, Yale University. All rights reserved.'
 __license__   = 'GNU Affero General Public License 3 (AGPL-3)'
-__version__   = '1.1.0'
-__date__      = '2021.06.21'
+__version__   = '1.2.0'
+__date__      = '2021.10.29'


=====================================
requirements.txt
=====================================
@@ -4,5 +4,5 @@ pandas>=0.24
 biopython>=1.77
 PyYAML>=5.1
 setuptools>=2.0
-presto>=0.6.2
+presto>=0.7.0
 airr>=1.3.1



View it on GitLab: https://salsa.debian.org/med-team/changeo/-/commit/da695a88ac5d3597e58301e55cda02ee5771f446

-- 
View it on GitLab: https://salsa.debian.org/med-team/changeo/-/commit/da695a88ac5d3597e58301e55cda02ee5771f446
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20211101/dbce4141/attachment-0001.htm>


More information about the debian-med-commit mailing list