[med-svn] [Git][med-team/gubbins][master] 4 commits: routine-update: New upstream version

Andreas Tille (@tille) gitlab at salsa.debian.org
Wed Jul 12 20:36:47 BST 2023



Andreas Tille pushed to branch master at Debian Med / gubbins


Commits:
a780ff6c by Andreas Tille at 2023-07-12T21:30:41+02:00
routine-update: New upstream version

- - - - -
6930509c by Andreas Tille at 2023-07-12T21:30:42+02:00
New upstream version 3.3.0
- - - - -
f117de3e by Andreas Tille at 2023-07-12T21:30:50+02:00
Update upstream source from tag 'upstream/3.3.0'

Update to upstream version '3.3.0'
with Debian dir b9cc26b0d8c0608a7e76714d879f7b0839326bc2
- - - - -
af53dc3f by Andreas Tille at 2023-07-12T21:34:37+02:00
routine-update: Ready to upload to unstable

- - - - -


18 changed files:

- CHANGELOG.md
- VERSION
- debian/changelog
- debian/control
- docs/gubbins_manual.md
- docs/gubbins_tutorial.md
- environment.yml
- python/gubbins/common.py
- + python/gubbins/tests/data/corrected_alignment.aln
- + python/gubbins/tests/data/corrected_alignment.csv
- + python/gubbins/tests/data/invalid_alignment.aln
- + python/gubbins/tests/data/multiple_recombinations_annotation.gff
- + python/gubbins/tests/data/multiple_recombinations_rec_per_gene.tab
- python/gubbins/tests/test_python_scripts.py
- + python/scripts/count_recombinations_per_gene.py
- python/scripts/generate_ska_alignment.py
- python/scripts/gubbins_alignment_checker.py
- python/setup.py


Changes:

=====================================
CHANGELOG.md
=====================================
@@ -1,6 +1,7 @@
 # Change Log
 
-## [v3.3]
+## [v3.3](https://github.com/nickjcroucher/gubbins/releases/tag/v3.3) (2022-10-6)
+[Full Changelog](https://github.com/sanger-pathogens/gubbins/compare/v3.2.1...v3.3)
 
 - Enable time calibration of final tree
 - Use separate models for tree construction and SNP reconstruction


=====================================
VERSION
=====================================
@@ -1 +1 @@
-3.2.1
+3.3.0


=====================================
debian/changelog
=====================================
@@ -1,4 +1,4 @@
-gubbins (3.3-1) UNRELEASED; urgency=medium
+gubbins (3.3.0-1) unstable; urgency=medium
 
   * New upstream version
   * Standards-Version: 4.6.2 (routine-update)
@@ -8,7 +8,7 @@ gubbins (3.3-1) UNRELEASED; urgency=medium
   * Build-Depends: tox and do not override dh_auto_test
   * Test-Depends / Recommends: python3-numba
 
- -- Andreas Tille <tille at debian.org>  Tue, 07 Feb 2023 10:04:38 +0100
+ -- Andreas Tille <tille at debian.org>  Wed, 12 Jul 2023 21:30:57 +0200
 
 gubbins (2.4.1-5) unstable; urgency=medium
 


=====================================
debian/control
=====================================
@@ -10,7 +10,7 @@ Build-Depends: debhelper-compat (= 13),
                pkg-config,
                fasttree,
                iqtree,
-               raxml (>= 8.2.12+dfsg-5),
+               raxml,
                python3-setuptools,
                zlib1g-dev,
                check,


=====================================
docs/gubbins_manual.md
=====================================
@@ -52,13 +52,13 @@ These will automatically be installed within the conda environment. Please cite
 
 The required input file for Gubbins is a whole genome FASTA alignment. Each sequence should have a unique identifier, and special characters should be avoided. The sequences should only use the characters `ACGT` (DNA bases), `N` (unknown base) or `-` (alignment gap). If a starting tree is to be included, then this should be a Newick format.
 
-The alignment is most easily generated through mapping sequences against a reference sequence. This can be achieved with the popular mapping software Snippy, following the instructions on the relevant [Github repository](https://github.com/tseemann/snippy). Alternatively, the alignment can be generated using the Gubbins script `generate_ska_alignment.py`, which creates an alignment using [SKA](https://github.com/simonrharris/SKA), which can be installed through `conda install -c bioconda ska` (SKA is included when installing Gubbins through conda). For instance,
+The alignment is most easily generated through mapping sequences against a reference sequence. This can be achieved with the popular mapping software Snippy, following the instructions on the relevant [Github repository](https://github.com/tseemann/snippy). Alternatively, the alignment can be generated using the Gubbins script `generate_ska_alignment.py`, which creates an alignment using [SKA2](https://github.com/bacpop/ska.rust), which can be installed through `conda install -c bioconda ska2` (SKA2 is included when installing Gubbins through conda). For instance,
 
 ```
-generate_ska_alignment.py --reference seq_X.fa --fasta fasta_files.list --fastq fastq_files.list --out out.aln
+generate_ska_alignment.py --reference seq_X.fa --input input.list --out out.aln
 ```
 
-Where `fasta_files.list` is a two column tab-delimited file containing sequence names (in the first column) and FASTA sequence assembly file paths (in the second column); `fastq_files.list` contains the same for unassembled FASTQ-format read data.
+Where `input.list` is a tab-delimited file with one row per isolate. The first column should be the isolate name, and the subsequent entries on the same row should contain the corresponding sequence data - this may be a single FASTA assembly, or multiple FASTQ raw read files. The alignment will be reformatted for Gubbins (e.g. modifying isolate names and removing non-N IUPAC ambiguity codes) by this script.
 
 The alignment can then be analysed with Gubbins:
 


=====================================
docs/gubbins_tutorial.md
=====================================
@@ -10,10 +10,10 @@ tar xfz PMEN3_assemblies.tar.gz
 
 ## Generating the alignment
 
-The draft genomes can be aligned to the reference using [SKA](https://github.com/simonrharris/SKA). This is first installed through conda:
+The draft genomes can be aligned to the reference using [SKA2](https://github.com/bacpop/ska.rust). This is first installed through conda:
 
 ```
-conda install -c bioconda ska
+conda install -c bioconda ska2
 ```
 
 An index file is then generated to name the isolates to be aligned:
@@ -38,7 +38,7 @@ The `PMEN3_isolates.list` should contain this text:
 The alignment is then constructed using the Gubbins script `generate_ska_alignment.py`:
 
 ```
-generate_ska_alignment.py --reference RMV4.fa --fasta PMEN3_isolates.list --out PMEN3.aln
+generate_ska_alignment.py --reference RMV4.fa --input PMEN3_isolates.list --out PMEN3.aln
 ```
 
 ## Analysis with Gubbins


=====================================
environment.yml
=====================================
@@ -25,7 +25,7 @@ dependencies:
   - dendropy
   - biopython
   - multiprocess
-  - numpy
+  - numpy<=1.23.0
   - numba
 # phylogenetics
   - raxml=8.2.12
@@ -34,4 +34,4 @@ dependencies:
   - raxml-ng=1.0.1
   - fasttree=2.1.10
 # Scripts
-  - ska
+  - ska2


=====================================
python/gubbins/common.py
=====================================
@@ -557,7 +557,7 @@ def parse_and_run(input_args, program_description=""):
             subprocess.check_call(dating_command, shell=True)
         except subprocess.SubprocessError:
             # If this fails, continue to generate rest of output
-            sys.write("Failed running tree time calibration with LSD.")
+            sys.stderr.write("Failed running tree time calibration with LSD.")
             input_args.date = None
 
     # Create the final output


=====================================
python/gubbins/tests/data/corrected_alignment.aln
=====================================
@@ -0,0 +1,8 @@
+>sequence1
+AAANNNNAAA
+>sequence2
+CCCCCCCCCC
+>sequence3
+GGGGGGGGGG
+>sequence4
+TTTTTTTTTT


=====================================
python/gubbins/tests/data/corrected_alignment.csv
=====================================
@@ -0,0 +1,5 @@
+isolate,A,C,G,N,T
+sequence1,6,0,0,4,0
+sequence2,0,10,0,0,0
+sequence3,0,0,10,0,0
+sequence4,0,0,0,0,10


=====================================
python/gubbins/tests/data/invalid_alignment.aln
=====================================
@@ -0,0 +1,8 @@
+>sequence1
+AAARRRRAAA
+>sequence2
+CCCCCCCCCC
+>sequence3
+GGGGGGGGGG
+>sequence4
+TTTTTTTTTT


=====================================
python/gubbins/tests/data/multiple_recombinations_annotation.gff
=====================================
@@ -0,0 +1,5 @@
+##gff-version 3
+##sequence-region CONTIG1 1 20
+##sequence-region CONTIG2 1 200
+CONTIG2	EMBL	CDS	50	80	0.000	+	0	ID="CDS1";transl_table=11;gene="CDS1";locus_tag="CDS1";product="protein"
+CONTIG2	EMBL	CDS	80	120	0.000	+	0	ID="CDS2";transl_table=11;gene="CDS2";locus_tag="CDS2";product="protein"


=====================================
python/gubbins/tests/data/multiple_recombinations_rec_per_gene.tab
=====================================
@@ -0,0 +1,3 @@
+CDS	GeneName	Start	End	NumRec	NumAffectedTaxa	AffectedTaxa
+CDS1	CDS1	70	100	3	9	sequence_1;sequence_2;sequence_3;sequence_4;sequence_5;sequence_6;sequence_7;sequence_8;sequence_9
+CDS2	CDS2	100	140	1	1	sequence_10


=====================================
python/gubbins/tests/test_python_scripts.py
=====================================
@@ -19,17 +19,28 @@ working_dir = os.path.join(modules_dir, 'tests')
 
 class TestPythonScripts(unittest.TestCase):
 
-    ## Test the alignment_checker script 
-
+    ## Test the alignment_checker script
     def test_alignment_checker(self):
         small_aln = os.path.join(data_dir, "valid_alignment.aln")
-        output_file = os.path.join(working_dir, "valid_alignment_test")
         output_csv = os.path.join(working_dir, "valid_alignment_test.csv")
         test_csv = os.path.join(data_dir, "test_valid_output.csv")
-        aln_cmd = "gubbins_alignment_checker.py --aln " + small_aln + " --out " + output_file
+        aln_cmd = "gubbins_alignment_checker.py --aln " + small_aln + " --out " + output_csv
+        subprocess.check_call(aln_cmd, shell=True)
+        assert self.md5_check(output_csv, test_csv)
+        os.remove(output_csv)
+
+    def test_alignment_reformatting(self):
+        invalid_aln = os.path.join(data_dir, "invalid_alignment.aln")
+        output_file = os.path.join(working_dir, "invalid_alignment_test.aln")
+        output_csv = os.path.join(working_dir, "valid_alignment_test.csv")
+        test_aln = os.path.join(data_dir, "corrected_alignment.aln")
+        test_csv = os.path.join(data_dir, "corrected_alignment.csv")
+        aln_cmd = "gubbins_alignment_checker.py --aln " + invalid_aln + " --out-aln " + output_file + " --out " + output_csv
         subprocess.check_call(aln_cmd, shell=True)
         assert self.md5_check(output_csv, test_csv)
+        assert self.md5_check(output_file, test_aln)
         os.remove(output_csv)
+        os.remove(output_file)
 
     ## Test the clade extraction script
     def test_clade_extraction(self):
@@ -82,9 +93,9 @@ class TestPythonScripts(unittest.TestCase):
         ref_seq = os.path.join(preprocess_dir, 'sequence_t1.fasta')
         aln_out = os.path.join(preprocess_dir, 'ska_test_aln.aln')
         # Script name
-        ska_cmd = "generate_ska_alignment.py --fasta " + fasta_loc +\
+        ska_cmd = "generate_ska_alignment.py --input " + fasta_loc +\
             " --reference " + ref_seq + " --out " + aln_out +\
-                " --k 6"
+                " --k 7"
         subprocess.check_call(ska_cmd, shell=True)
         ## Now run gubbins on the aln and check all the output is produced 
         parser = run_gubbins.parse_input_args()
@@ -122,6 +133,17 @@ class TestPythonScripts(unittest.TestCase):
         os.remove(out_gff)
         os.remove(out_tree)
 
+    # Test clade file extraction script
+    def test_recombination_counting_per_gene(self):
+        multiple_gff = os.path.join(data_dir, "multiple_recombinations_gubbins.recombination_predictions.gff")
+        annnotation_gff = os.path.join(data_dir, "multiple_recombinations_annotation.gff")
+        out_tab = os.path.join(data_dir, "test_multiple_recombinations_annotation.gff")
+        check_tab = os.path.join(data_dir, "multiple_recombinations_rec_per_gene.tab")
+        rec_count_cmd = "count_recombinations_per_gene.py --rec-gff " + multiple_gff + " --anno-gff " + annnotation_gff + " --out " + out_tab
+        subprocess.check_call(rec_count_cmd, shell=True)
+        assert self.md5_check(out_tab, check_tab)
+        os.remove(out_tab)
+                
     @staticmethod
     def check_for_output_files(prefix):
         assert os.path.exists(prefix + '.summary_of_snp_distribution.vcf')


=====================================
python/scripts/count_recombinations_per_gene.py
=====================================
@@ -0,0 +1,131 @@
+#! python
+
+# encoding: utf-8
+# Wellcome Trust Sanger Institute and Imperial College London
+# Copyright (C) 2023  Wellcome Trust Sanger Institute and Imperial College London
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License
+# as published by the Free Software Foundation; either version 2
+# of the License, or (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
+#
+
+# Generic imports
+import sys
+import argparse
+import re
+
+# command line parsing
+def get_options():
+
+    parser = argparse.ArgumentParser(description='Mask recombinant regions detected by'
+                                                    ' Gubbins from the input alignment',
+                                     prog='mask_gubbins_aln')
+
+    # input options
+    parser.add_argument('--rec-gff',
+                        help = 'GFF of recombinant regions detected by Gubbins',
+                        required = True)
+    parser.add_argument('--anno-gff',
+                        help = 'GFF of annotation corresponding to the input alignment',
+                        required = True)
+    parser.add_argument('--out',
+                        help = 'Output file name',
+                        required = True)
+    return parser.parse_args()
+
+# main code
+if __name__ == "__main__":
+
+    # Get command line options
+    args = get_options()
+    
+    # Read recombinant regions from GFF
+    rec_start = []
+    rec_end = []
+    rec_affected = []
+    taxon_pattern = re.compile('taxa="([^"]*)"')
+    with open(args.rec_gff,'r') as gff_file:
+        for line in gff_file.readlines():
+            if not line.startswith('##'):
+                # Calculate stats
+                info = line.rstrip().split('\t')
+                start = int(info[3])
+                end = int(info[4])
+                taxon_set = set(taxon_pattern.search(info[8]).group(1).split())
+                # Record stats
+                rec_start.append(start)
+                rec_end.append(end)
+                rec_affected.append(taxon_set)
+
+    # Read annotation from GFF
+    cds_start = []
+    cds_end = []
+    cds_name = []
+    cds_index = []
+    contig_starts = {}
+    cumulative_length = 0
+    with open(args.anno_gff,'r') as gff_file:
+        for line in gff_file.readlines():
+            if line.startswith('##sequence-region'):
+                info = line.rstrip().split(' ')
+                contig_starts[info[1]] = int(cumulative_length)
+                cumulative_length = contig_starts[info[1]] + int(info[3])
+            elif not line.startswith('##'):
+                info = line.rstrip().split('\t')
+                if len(info) >= 7:
+                  if info[2] == 'CDS':
+                      name = '-'
+                      index = None
+                      cds_data = info[8].replace('"','').split(';')
+                      for datum in cds_data:
+                          qualifier, value = datum.split('=')
+                          if qualifier == 'locus_tag':
+                              index = value
+                          elif qualifier == 'gene':
+                              name = value
+                      if index is not None:
+                          contig_start = contig_starts[info[0]]
+                          cds_start.append(int(info[3]) + contig_start)
+                          cds_end.append(int(info[4]) + contig_start)
+                          cds_index.append(index)
+                          cds_name.append(name)
+
+    # Run checks on whether genes have been detected
+    if len(cds_index) == 0:
+        sys.stderr.write('No genes detected in annotation\n')
+        sys.exit()
+    elif not (len(cds_start) == len(cds_end) and len(cds_start) == len(cds_index) and len(cds_start) == len(cds_name)):
+        sys.stderr.write('Error with extraction of information on annotation\n')
+        sys.exit()
+    elif not (len(rec_start) == len(rec_end) and len(rec_start) == len(rec_affected)):
+        sys.stderr.write('Error with extraction of information on recombination\n')
+        sys.exit()
+
+    # Write out summary statistics
+    with open(args.out,'w') as out_file:
+        out_file.write('CDS\tGeneName\tStart\tEnd\tNumRec\tNumAffectedTaxa\tAffectedTaxa\n')
+        for cnum,cindex in enumerate(cds_index):
+            num_rec = 0
+            affected_taxa = set()
+            cstart = cds_start[cnum]
+            cend = cds_end[cnum]
+            for rnum,rstart in enumerate(rec_start):
+                rend = rec_end[rnum]
+                if (rstart < cend and rend > cend) or \
+                   (rstart < cstart and rend > cstart) or \
+                   (rstart > cstart and rend < cend):
+                    num_rec = num_rec + 1
+                    affected_taxa = affected_taxa.union(rec_affected[rnum])
+            sorted_affected_taxa = list(affected_taxa)
+            sorted_affected_taxa.sort()
+            out_file.write(cds_index[cnum] + '\t' + cds_name[cnum] + '\t' + str(cds_start[cnum]) + '\t' + str(cds_end[cnum]) + '\t' + str(num_rec) + '\t' + str(len(affected_taxa)) + '\t' + ';'.join(sorted_affected_taxa) + '\n')


=====================================
python/scripts/generate_ska_alignment.py
=====================================
@@ -31,7 +31,7 @@ from shutil import which
 # command line parsing
 def get_options():
 
-    parser = argparse.ArgumentParser(description='Generate a ska alignment from a list '
+    parser = argparse.ArgumentParser(description='Generate a ska2 alignment from a list '
                                                     'of assemblies',
                                      prog='generate_ska_alignment')
 
@@ -39,12 +39,8 @@ def get_options():
     parser.add_argument('--reference',
                         help = 'Name of reference sequence to use for alignment',
                         required = True)
-    parser.add_argument('--fasta',
-                        help = 'Two column list of names and FASTA files to include in alignment',
-                        default = None,
-                        required = False)
-    parser.add_argument('--fastq',
-                        help = 'Two/three column list of names and of FASTQ files to include in alignment',
+    parser.add_argument('--input',
+                        help = 'List of sequence data; one row per isolate, with first column being the isolate name',
                         default = None,
                         required = False)
     parser.add_argument('--out',
@@ -53,7 +49,7 @@ def get_options():
     parser.add_argument('--k',
                         help = 'Split kmer size',
                         type = int,
-                        default = 15)
+                        default = 17)
     parser.add_argument('--threads',
                         help = 'Number of threads to use',
                         type = int,
@@ -89,67 +85,37 @@ if __name__ == "__main__":
 
     # Check if ska is installed
     if which('ska') is None:
-        sys.stderr.write('SKA cannot be found on PATH; install with "conda install ska"')
+        sys.stderr.write('ska2 cannot be found on PATH; install with "conda install ska2"')
         sys.exit(1)
 
-    # Dictionary for sequence names
-    seq_names = {}
-    all_names = []
-
-    # Make split kmers from assemblies
-    fasta_names = []
-    if args.fasta is not None:
-        # Read in FASTA assemblies
-        with open(args.fasta,'r') as fasta_list:
-            for line in fasta_list.readlines():
-                info = line.strip().split()
-                if os.path.isfile(info[1]):
-                    fasta_names.append(info[1])
-                    seq_names[info[1]] = info[0]
-                    all_names.append(info[0])
-                else:
-                    sys.stderr.write('Unable to find file ' + info[1] + '\n')
-        # Sketch into split kmers
-        with Pool(processes = args.threads) as pool:
-            pool.map(partial(map_fasta_sequence,
-                                k = args.k,
-                                names = seq_names),
-                                fasta_names)
-    
-    # Make split kmers from FASTQs
-    fastq_names = []
-    if args.fastq is not None:
-        # Read in FASTQ reads
-        with open(args.fastq,'r') as fastq_list:
-            for line in fastq_list.readlines():
-                info = line.strip().split()
-                if os.path.isfile(info[1]) and os.path.isfile(info[2]):
-                    fastq_names.append((info[1],info[2]))
-                    seq_names[info[1]] = info[0]
-                    all_names.append(info[0])
-                else:
-                    sys.stderr.write('Unable to find files ' + info[1] + ' and ' + info[2] + '\n')
-        # Sketch into split kmers
-        with Pool(processes = args.threads) as pool:
-            return_codes = pool.map(partial(map_fastq_sequence,
-                                            k = args.k,
-                                            names = seq_names),
-                                            fastq_names)
-
-    # Map sequences
-    with Pool(processes = args.threads) as pool:
-        return_codes = pool.map(partial(ska_map_sequences,
-                                        k = args.k,
-                                        ref = args.reference),
-                                        all_names)
-
-    # Generate alignment
-    subprocess.check_output('cat ' + ' '.join([seq + '.map.aln' for seq in all_names]) + ' > ' + args.out,
+    # Check if k value is acceptable:
+    if (args.k % 2) == 0 or args.k < 5 or args.k > 63:
+        sys.stderr.write('k must be odd and between 5 and 63\n')
+        sys.exit(1)
+
+    # Build ska sketch
+    subprocess.check_output('ska build -o ' + args.out + ' -k ' + str(args.k) + \
+                            ' -f ' + args.input + ' --threads ' + str(args.threads),
                             shell = True)
 
+    # Run ska mapping
+    if os.path.exists(args.out + '.skf'):
+        tmp_aln = os.path.join(os.path.dirname(args.out), 'tmp.' + os.path.basename(args.out))
+        subprocess.check_output('ska map -o ' + tmp_aln + ' --threads ' + str(args.threads) + ' ' +  \
+                                    args.reference + ' ' + args.out + '.skf',
+                                shell = True)
+    else:
+        sys.stderr.write('ska building failed\n')
+        sys.exit(1)
+                            
+    # Clean alignment to prep for Gubbins
+    subprocess.check_output('gubbins_alignment_checker.py --aln ' + tmp_aln + ' --out-aln '  + args.out + \
+                             ' --out ' + args.out + '.csv',
+                            shell = True)
+    
+    sys.stderr.write("Completed generating alignment with ska2 (https://github.com/bacpop/ska.rust)\n")
+
     # Clean up
     if not args.no_cleanup:
-        subprocess.check_output('rm ' + ' '.join([seq + '.map.aln' for seq in all_names]),
-                                shell = True)
-        subprocess.check_output('rm ' + ' '.join([seq + '.skf' for seq in all_names]),
+        subprocess.check_output('rm ' + args.out + '.skf ' + tmp_aln,
                                 shell = True)


=====================================
python/scripts/gubbins_alignment_checker.py
=====================================
@@ -20,26 +20,35 @@
 #
 
 import argparse
-from asyncore import write
 import re
 from collections import Counter
 
 def parse_input_args():
 
     parser = argparse.ArgumentParser(
-        description='Script to output bases in an alignment by isolate',
+        description='Script to evaluate and reformat an alignment prior to Gubbins analysis',
         formatter_class=argparse.ArgumentDefaultsHelpFormatter)
-    parser.add_argument('--aln', '-a', dest="aln",
-                        help='Multifasta alignment file', required=True)
-    parser.add_argument('--out', '-o',
+    parser.add_argument('--aln',
+                        '-a',
+                        dest="aln",
+                        help='Multifasta alignment filename',
+                        required=True)
+    parser.add_argument('--out-aln',
+                        dest="out_aln",
+                        help='Reformatted alignment filename',
+                        default = None,
+                        required=False)
+    parser.add_argument('--out',
+                        '-o',
                         dest="out",
-                        help="Out csv name for writing results too", required=True)
+                        help='Output CSV filename',
+                        required=True)
 
     return parser.parse_args()
 
 def main(input_args):
 
-    ## Lets set up the csv
+    # Read the number of rows in the alignment
     row_num = 0
     tot_lines = 0
     with open(input_args.aln, "r") as aln_file:
@@ -48,16 +57,29 @@ def main(input_args):
                 row_num += 1
             tot_lines += 1
 
-    # Going to use counter in a first pass to store sequence counts and the range of sequences
-    
+    # Filter alignment if requested
+    aln_filename = input_args.aln
+    if input_args.out_aln:
+        with open(input_args.aln, "r") as aln_file, \
+                open(input_args.out_aln, "w") as out_aln_file:
+            for line in aln_file:
+                if re.search("^>", line):
+                    name = line.replace("#","_").replace(":","_").replace(">","").rstrip().split()
+                    out_aln_file.write('>' + name[0] + '\n')
+                else:
+                    sequence = line.upper().rstrip()
+                    sequence = re.sub('[^ACGTN-]','N',sequence)
+                    out_aln_file.write(sequence + '\n')
+        aln_filename = input_args.out_aln
 
+    # Going to use counter in a first pass to store sequence counts and the range of sequences
     total_base_counts = []
     iso_data = []
     total_headers = []
     tot_length_str = str(tot_lines)
     print("Running through alignment file: %s" % input_args.aln)
     print()
-    with open(input_args.aln, "r") as aln_file:
+    with open(aln_filename, "r") as aln_file:
         for index,line in enumerate(aln_file):
             num_zeros = len(tot_length_str) - len(str(index + 1))
             fmt_index = (("0" * num_zeros) + str(index + 1))
@@ -78,6 +100,7 @@ def main(input_args):
     ## Second pass to line up all the counts across the isolates 
     print("")
     print("Assessing counts...")
+    total_headers.sort()
     isolate_bases = []
     for count in total_base_counts:
         iso_row = []
@@ -88,7 +111,7 @@ def main(input_args):
     total_headers.insert(0, "isolate")
     write_out_headers = [re.sub("-","gap",i) for i in total_headers]
     print("Writing out results...")
-    with open((input_args.out + ".csv"), "w") as output:
+    with open((input_args.out), "w") as output:
         output.write(",".join(write_out_headers) + "\n")
         for i, aln_row in enumerate(isolate_bases):
             aln_row_str = list(map(str, aln_row))


=====================================
python/setup.py
=====================================
@@ -20,7 +20,8 @@ setuptools.setup(
         'scripts/extract_gubbins_clade.py',
         'scripts/mask_gubbins_aln.py',
         'scripts/gubbins_alignment_checker.py',
-        'scripts/generate_files_for_clade_analysis.py'
+        'scripts/generate_files_for_clade_analysis.py',
+        'scripts/count_recombinations_per_gene.py'
     ],
     tests_require=[
         "pytest >= 4.6",



View it on GitLab: https://salsa.debian.org/med-team/gubbins/-/compare/c6b8c30ca01f72a149a8f694d4fb6fcca486318a...af53dc3f5396f3e377f9f5e5bdb8450b039c39c2

-- 
View it on GitLab: https://salsa.debian.org/med-team/gubbins/-/compare/c6b8c30ca01f72a149a8f694d4fb6fcca486318a...af53dc3f5396f3e377f9f5e5bdb8450b039c39c2
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20230712/b531fe40/attachment-0001.htm>


More information about the debian-med-commit mailing list