[med-svn] [Git][med-team/python-pangolearn][master] 6 commits: Exclude binary files in new upstream version

Andreas Tille (@tille) gitlab at salsa.debian.org
Mon Nov 8 08:43:01 GMT 2021



Andreas Tille pushed to branch master at Debian Med / python-pangolearn


Commits:
182078fd by Andreas Tille at 2021-11-08T08:20:59+01:00
Exclude binary files in new upstream version

- - - - -
b4683525 by Andreas Tille at 2021-11-08T08:21:15+01:00
New upstream version 2021-10-18
- - - - -
0d0cdb1d by Andreas Tille at 2021-11-08T08:21:44+01:00
Update upstream source from tag 'upstream/2021-10-18'

Update to upstream version '2021-10-18'
with Debian dir 66ab22410fdeb9b382e47ac3f0b0f9ba8ee65135
- - - - -
5a6d21da by Andreas Tille at 2021-11-08T08:55:41+01:00
Advertise repackaging

- - - - -
fca780e7 by Andreas Tille at 2021-11-08T09:31:02+01:00
New upstream version 2021-10-18+dfsg
- - - - -
c131967e by Andreas Tille at 2021-11-08T09:31:31+01:00
Update upstream source from tag 'upstream/2021-10-18+dfsg'

Update to upstream version '2021-10-18+dfsg'
with Debian dir 5b96839ffcf730bc3fde84327e017a111765950b
- - - - -


23 changed files:

- debian/changelog
- debian/copyright
- debian/watch
- pangoLEARN/__init__.py
- − pangoLEARN/data/decisionTreeHeaders_v1.joblib
- − pangoLEARN/data/decisionTree_v1.joblib
- − pangoLEARN/data/decision_tree_rules.txt
- + pangoLEARN/data/decision_tree_rules.zip
- + pangoLEARN/data/lineageTree.pb
- pangoLEARN/data/lineages.downsample.csv
- pangoLEARN/data/lineages.metadata.csv → pangoLEARN/data/lineages.hash.csv
- + pangoLEARN/scripts/__init__.py
- + pangoLEARN/scripts/curate_alignment.smk
- + pangoLEARN/scripts/training_runner.sh
- − pangoLEARN/supporting_information/.DS_Store
- − pangoLEARN/supporting_information/data_prep_description.md
- + pangoLEARN/training/WH04.gb
- + pangoLEARN/training/__init__.py
- + pangoLEARN/training/downsample.py
- + pangoLEARN/training/getDecisionTreeRules.py
- + pangoLEARN/training/outgroups.csv
- + pangoLEARN/training/pangoLEARNDecisionTree_v1.py
- + pangoLEARN/training/processOutputFile.py


Changes:

=====================================
debian/changelog
=====================================
@@ -1,4 +1,4 @@
-python-pangolearn (2021-04-01-1) unstable; urgency=medium
+python-pangolearn (2021-10-18+dfsg-1) unstable; urgency=medium
 
   * Initial release (Closes: #986458)
 


=====================================
debian/copyright
=====================================
@@ -1,6 +1,7 @@
 Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
 Upstream-Name: pangoLEARN
 Source: https://github.com/cov-lineages/pangoLEARN
+Files-Excluded: */*.joblib
 
 Files: *
 Copyright: 2018-2021 cov-lineages group


=====================================
debian/watch
=====================================
@@ -1,4 +1,4 @@
 version=4
 
-opts="filenamemangle=s%(?:.*?)?v?(\d[\d.]*)\.tar\.gz%@PACKAGE at -$1.tar.gz%" \
+opts="filenamemangle=s%(?:.*?)?v?(\d[\d.]*)\.tar\.gz%@PACKAGE at -$1.tar.gz%,repacksuffix=+dfsg,dversionmangle=auto,repack,compression=xz" \
 https://github.com/cov-lineages/pangoLEARN/releases .*/archive.*/v?@ANY_VERSION@\.tar\.gz


=====================================
pangoLEARN/__init__.py
=====================================
@@ -1,2 +1,3 @@
 _program = "pangoLEARN"
-__version__ = "2021-04-01"
+__version__ = "2021-10-18"
+PANGO_VERSION = "v1.2.88"


=====================================
pangoLEARN/data/decisionTreeHeaders_v1.joblib deleted
=====================================
Binary files a/pangoLEARN/data/decisionTreeHeaders_v1.joblib and /dev/null differ


=====================================
pangoLEARN/data/decisionTree_v1.joblib deleted
=====================================
Binary files a/pangoLEARN/data/decisionTree_v1.joblib and /dev/null differ


=====================================
pangoLEARN/data/decision_tree_rules.txt deleted
=====================================
The diff for this file was not included because it is too large.

=====================================
pangoLEARN/data/decision_tree_rules.zip
=====================================
Binary files /dev/null and b/pangoLEARN/data/decision_tree_rules.zip differ


=====================================
pangoLEARN/data/lineageTree.pb
=====================================
The diff for this file was not included because it is too large.

=====================================
pangoLEARN/data/lineages.downsample.csv
=====================================
The diff for this file was not included because it is too large.

=====================================
pangoLEARN/data/lineages.metadata.csv → pangoLEARN/data/lineages.hash.csv
=====================================
The diff for this file was not included because it is too large.

=====================================
pangoLEARN/scripts/__init__.py
=====================================


=====================================
pangoLEARN/scripts/curate_alignment.smk
=====================================
@@ -0,0 +1,322 @@
+import csv
+from Bio import SeqIO
+import os
+import collections
+import hashlib
+import collections
+import csv
+from Bio import SeqIO
+from pangoLEARN.training import downsample
+from datetime import date
+today = date.today()
+
+def get_hash_string(record):
+    seq = str(record.seq).upper().encode()
+    hash_object = hashlib.md5(seq)
+    hash_string = hash_object.hexdigest()
+    return hash_string
+
+def get_dict(in_csv,name_column,data_column):
+    this_dict = {}
+    with open(in_csv,"r") as f:
+        reader = csv.DictReader(f)
+        for row in reader:
+            this_dict[row[name_column]] = row[data_column]
+    return this_dict
+
+def add_to_hash(seq_file):
+    hash_map = {}
+    seq_hash = {}
+    for record in SeqIO.parse(seq_file, "fasta"):
+        seq = str(record.seq).upper().encode()
+        hash_object = hashlib.md5(seq)
+        hash_map[hash_object.hexdigest()] = record.id
+        seq_hash[str(record.seq)] = record.id
+    return hash_map,seq_hash
+
+pangoLEARN_path = config["pangoLEARN_path"].rstrip("/")
+pangolin_path = config["pangolin_path"].rstrip("/")
+pango_designation_path = config["pango_designation_path"].rstrip("/")
+quokka_path = config["quokka_path"].rstrip("/")
+
+data_date = config["data_date"]
+config["trim_start"] = 265
+config["trim_end"] = 29674
+config["lineages_csv"]=f"{pango_designation_path}/lineages.csv"
+config["reference"] = f"{pangolin_path}/pangolin/data/reference.fasta"
+config["outgroups"] = f"{pangoLEARN_path}/pangoLEARN/training/outgroups.csv"
+config["genbank_ref"] = f"{pangoLEARN_path}/pangoLEARN/training/WH04.gb"
+config["datadir"]= f"/localdisk/home/shared/raccoon-dog/{data_date}_gisaid/publish/gisaid"
+
+rule all:
+    input:
+        os.path.join(config["outdir"],"alignment.filtered.fasta"),
+        os.path.join(config["outdir"],"decision_tree_rules.zip"),
+        os.path.join(config["outdir"],"pangolearn.init.py"),
+        os.path.join(config["outdir"],"lineage.hash.csv")
+
+rule make_init:
+    output:
+        init = os.path.join(config["outdir"],"pangolearn.init.py")
+    run:
+        pangolearn_new_v = config["pangolearn_version"]
+        pango_version = config["pango_version"]
+        with open(output.init,"w") as fw:
+            fw.write(f'''_program = "pangoLEARN"
+__version__ = "{pangolearn_new_v}"
+PANGO_VERSION = "{pango_version}"
+''')
+
+rule filter_alignment:
+    input:
+        csv = config["lineages_csv"],
+        fasta = os.path.join(config["datadir"],f"gisaid_{data_date}_all_alignment.fa"),
+        full_csv = os.path.join(config["datadir"],f"gisaid_{data_date}_all_metadata.csv")
+    output:
+        fasta = os.path.join(config["outdir"],"alignment.filtered.fasta"),
+        csv = os.path.join(config["outdir"],"lineages.metadata.filtered.csv"),
+        csv_all = os.path.join(config["outdir"],"lineages.designated.csv")
+    run:
+        csv_len = 0
+        seqs_len = 0
+        lineages = {}
+        all_lineages = {}
+        all_len = 0
+        with open(input.csv,"r") as f:
+            for l in f:
+                l = l.rstrip("\n")
+                name,lineage = l.split(",")
+                
+                lineages[name]=lineage
+                csv_len +=1
+                all_lineages[name]=lineage
+                all_len +=1
+        with open(output.csv,"w") as fw:
+            with open(output.csv_all,"w") as fw2:
+                with open(input.full_csv,"r") as f:
+                
+                    reader = csv.DictReader(f)
+                    header = reader.fieldnames
+
+                    writer = csv.DictWriter(fw, fieldnames=header, lineterminator="\n")
+                    writer.writeheader()
+
+                    writer2 = csv.DictWriter(fw2, fieldnames=header, lineterminator="\n")
+                    writer2.writeheader()
+
+                    for row in reader:
+                        name = row["sequence_name"].replace("SouthAfrica","South_Africa")
+                        if name in lineages:
+                            new_row = row
+                            new_row["lineage"] = lineages[name]
+                            writer.writerow(new_row)
+                        if name in all_lineages:
+                            all_row = row
+                            all_row["lineage"] = all_lineages[name]
+                            writer2.writerow(all_row)
+        written = {}
+        with open(output.fasta,"w") as fw:
+            for record in SeqIO.parse(input.fasta, "fasta"):
+                record.id = record.id.replace("SouthAfrica","South_Africa")
+                if record.id in lineages and not record.id in written:
+                    fw.write(f">{record.id}\n{record.seq}\n")
+
+                    written[record.id]=1
+                    seqs_len +=1
+        
+        print("Number of sequences in gisaid designated", all_len)
+        print("Number of sequences going into pangolearn training",csv_len)
+        print("Number of sequences found on gisaid", seqs_len)
+        
+
+rule align_with_minimap2:
+    input:
+        fasta = os.path.join(config["outdir"],"alignment.filtered.fasta"),
+        reference = config["reference"]
+    output:
+        sam = os.path.join(config["outdir"],"alignment.sam")
+    shell:
+        """
+        minimap2 -a -x asm5 -t {workflow.cores} \
+        {input.reference:q} \
+        {input.fasta:q} > {output.sam:q}
+        """
+
+rule get_variants:
+    input:
+        sam = os.path.join(config["outdir"],"alignment.sam")
+    output:
+        csv = os.path.join(config["outdir"],"variants.csv")
+    shell:
+        """
+        gofasta sam variants -t {workflow.cores} \
+        --samfile {input.sam:q} \
+        --reference {config[reference]} \
+        --genbank {config[genbank_ref]} \
+        --outfile {output.csv}
+        """
+
+rule add_lineage:
+    input:
+        csv = os.path.join(config["outdir"],"variants.csv"),
+        lineages = os.path.join(config["outdir"],"lineages.designated.csv")
+    output:
+        csv = os.path.join(config["outdir"],"variants.lineages.csv")
+    run:
+        lineages_dict = {}
+        with open(input.lineages,"r") as f:
+            reader= csv.DictReader(f)
+            for row in reader:
+                lineages_dict[row["sequence_name"]] = row["lineage"]
+        with open(output.csv, "w") as fw:
+            fw.write("sequence_name,nucleotide_variants,lineage,why_excluded\n")
+            with open(input.csv,"r") as f:
+                for l in f:
+                    l = l.strip("\n")
+                    name,variants = l.split(",")
+                    if name =="query":
+                        pass
+                    elif name in lineages_dict:
+                        fw.write(f"{name},{variants},{lineages_dict[name]},\n")
+                        
+
+rule downsample:
+    input:
+        csv = os.path.join(config["outdir"],"variants.lineages.csv"),
+        fasta = os.path.join(config["outdir"],"alignment.filtered.fasta")
+    output:
+        csv = os.path.join(config["outdir"],"metadata.copy.csv"),
+        fasta = os.path.join(config["outdir"],"alignment.downsample.fasta")
+    run:
+        downsample.downsample(
+            input.csv, 
+            output.csv, 
+            input.fasta, 
+            output.fasta, 
+            1, config["outgroups"], 
+            False, 
+            False, 
+            10)
+
+rule filter_metadata:
+    input:
+        csv = os.path.join(config["outdir"],"metadata.copy.csv"),
+        fasta = os.path.join(config["outdir"],"alignment.downsample.fasta")
+    output:
+        csv = os.path.join(config["outdir"],"metadata.downsample.csv")
+    run:
+        in_downsample = {}
+        for record in SeqIO.parse(input.fasta,"fasta"):
+            in_downsample[record.id] = 1
+
+        with open(output.csv, "w") as fw:
+            fw.write("sequence_name,lineage\n")
+            with open(input.csv, "r") as f:
+                reader = csv.DictReader(f)
+                for row in reader:
+                    if row["sequence_name"] in in_downsample:
+                        name = row["sequence_name"]
+                        lineage = row["lineage"]
+                        fw.write(f"{name},{lineage}\n")
+
+
+rule get_relevant_postions:
+    input:
+        fasta = os.path.join(config["outdir"],"alignment.downsample.fasta"),
+        csv = os.path.join(config["outdir"],"metadata.downsample.csv"),
+        reference = config["reference"]
+    params:
+        path_to_script = quokka_path
+    output:
+        relevant_pos_obj = os.path.join(config["outdir"],"relevantPositions.pickle"),
+    shell:
+        """
+        python {params.path_to_script}/quokka/getRelevantLocationsObject.py \
+        {input.reference:q} \
+        {input.fasta} \
+        {input.csv:q} \
+        {config[outdir]}
+        """
+
+rule run_training:
+    input:
+        fasta = os.path.join(config["outdir"],"alignment.downsample.fasta"),
+        csv = os.path.join(config["outdir"],"metadata.downsample.csv"),
+        reference = config["reference"],
+        relevant_pos_obj = rules.get_relevant_postions.output.relevant_pos_obj
+    params:
+        path_to_script = pangoLEARN_path
+    output:
+        headers = os.path.join(config["outdir"],"decisionTreeHeaders_v1.joblib"),
+        model = os.path.join(config["outdir"],"decisionTree_v1.joblib"),
+        txt = os.path.join(config["outdir"],"training_summary.txt")
+    shell:
+        """
+        python {params.path_to_script}/pangoLEARN/training/pangoLEARNDecisionTree_v1.py \
+        {input.csv:q} \
+        {input.fasta} \
+        {input.reference:q} \
+        {config[outdir]} \
+        {input.relevant_pos_obj} \
+        > {output.txt:q}
+        """
+
+rule get_recall:
+    input:
+        txt = rules.run_training.output.txt
+    params:
+        path_to_script = pangoLEARN_path
+    output:
+        txt = os.path.join(config["outdir"],"lineage_recall_report.txt")
+    shell:
+        """
+        python {params.path_to_script}/pangoLEARN/training/processOutputFile.py {input.txt} > {output.txt}
+        """
+
+rule get_decisions:
+    input:
+        headers = os.path.join(config["outdir"],"decisionTreeHeaders_v1.joblib"),
+        model = os.path.join(config["outdir"],"decisionTree_v1.joblib"),
+        txt = rules.run_training.output.txt
+    params:
+        path_to_script = pangoLEARN_path
+    output:
+        txt = os.path.join(config["outdir"],"tree_rules.txt"),
+        zipped = os.path.join(config["outdir"],"decision_tree_rules.zip")
+    shell:
+        """
+        python {params.path_to_script}/pangoLEARN/training/getDecisionTreeRules.py \
+        {input.model:q} {input.headers:q} {input.txt:q} \
+        > {output.txt:q} && zip {output.zipped:q} {output.txt:q}
+        """
+
+rule create_hash:
+    input:
+        fasta = os.path.join(config["outdir"],"alignment.filtered.fasta"),
+        lin_designation = os.path.join(config["outdir"],"lineages.designated.csv")
+    output:
+        csv = os.path.join(config["outdir"],"lineage.hash.csv"),
+        fasta = os.path.join(config["outdir"],"lineage.hash.fasta"),
+        hashed_designations = os.path.join(config["outdir"],"designations.hash.csv")
+    run:
+        designated = get_dict(input.lin_designation,"sequence_name","lineage")
+
+        hash_map,seq_hash_dict = add_to_hash(input.fasta)
+
+        with open(output.csv,"w") as fw:
+            fw.write("seq_hash,lineage\n")
+            for seq_hash in hash_map:
+                seq_name = hash_map[seq_hash]
+                set_name = designated[seq_name]
+                fw.write(f"{seq_hash},{set_name}\n")
+        
+        num_seqs = 0
+        with open(output.hashed_designations, "w") as fw2:
+            fw2.write("taxon,lineage\n")
+            with open(output.fasta, "w") as fw:
+                for seq in seq_hash_dict:
+                    num_seqs +=1
+                    fw.write(f">{seq_hash_dict[seq]}\n{seq}\n")
+                    fw2.write(f"{seq_hash_dict[seq]},{designated[seq_hash_dict[seq]]}\n")
+
+        print("Number of seqs going into training: ",f"{num_seqs}")


=====================================
pangoLEARN/scripts/training_runner.sh
=====================================
@@ -0,0 +1,55 @@
+#!/bin/bash
+
+#source /localdisk/home/s1680070/.bashrc
+#conda activate pangolin
+
+TODAY=$(date +%F)
+OUT=${TODAY}_pangoLEARN
+OUTDIR=/localdisk/home/shared/raccoon-dog/$OUT
+
+echo $OUTDIR
+
+if [ -d $OUTDIR ] 
+then
+    echo "Directory $OUTDIR exists." 
+else
+    mkdir $OUTDIR 
+    echo "Directory $OUTDIR does not exist, making it."
+fi
+
+echo "Training model version: $TODAY"
+
+if [ -z "$1" ]
+then
+    DATA_DATE=$TODAY
+else
+    DATA_DATE=$1
+fi
+
+# LATEST_DATA=$(ls -td /localdisk/home/shared/raccoon-dog/2021*_gisaid/publish/gisaid | head -n 1)
+
+REPO_PATH=/localdisk/home/s1362711/repositories
+
+PANGO_PATH=$REPO_PATH/pango-designation
+PLEARN_PATH=$REPO_PATH/pangoLEARN
+PANGOLIN_PATH=$REPO_PATH/pangolin
+QUOKKA_PATH=$REPO_PATH/quokka
+
+echo "pango designation path $PANGO_PATH"
+echo "pangoLEARN path $PLEARN_PATH"
+echo "pangolin path $PANGOLIN_PATH"
+
+
+cd $PANGO_PATH && git pull #gets any updates to the reports in the data directory
+PANGO_V=$(git tag --points-at HEAD)
+echo "pango version $PANGO_V"
+
+cd /localdisk/home/shared/raccoon-dog/ #gets any updates to the reports in the data directory
+echo "--config outdir=$OUTDIR data_date=$DATA_DATE pangolearn_version=$TODAY pango_version=$PANGO_V"
+echo "pangoLEARN training starting" | mail -s "update lineageTree.pb with pango designation version $PANGO_V" angie at soe.ucsc.edu
+snakemake --snakefile $PLEARN_PATH/pangoLEARN/scripts/curate_alignment.smk --rerun-incomplete --nolock --cores 1 --config pango_designation_path=$PANGO_PATH pangolin_path=$PANGOLIN_PATH pangoLEARN_path=$PLEARN_PATH quokka_path=$QUOKKA_PATH outdir=$OUTDIR data_date=$DATA_DATE pangolearn_version=$TODAY pango_version=$PANGO_V
+
+# cp $OUTDIR/pangolearn.init.py   /localdisk/home/s1680070/repositories/pangoLEARN/pangoLEARN/__init__.py
+# cp $OUTDIR/decision*   /localdisk/home/s1680070/repositories/pangoLEARN/pangoLEARN/data/
+# cp $OUTDIR/metadata.downsample.csv   /localdisk/home/s1680070/repositories/pangoLEARN/pangoLEARN/data/lineages.downsample.csv
+# cp $OUTDIR/lineage.hash.csv   /localdisk/home/s1680070/repositories/pangoLEARN/pangoLEARN/data/lineages.hash.csv


=====================================
pangoLEARN/supporting_information/.DS_Store deleted
=====================================
Binary files a/pangoLEARN/supporting_information/.DS_Store and /dev/null differ


=====================================
pangoLEARN/supporting_information/data_prep_description.md deleted
=====================================
@@ -1,16 +0,0 @@
-# Data preparation
-
-### Source
-
-All GISAID data is downloaded and run through [`grapevine`](https://github.com/cov-ert/grapevine) which excludes records without proper dates, removes duplicate sequences (taking the earliest sample of the duplicates), omits some sequences with known issues, filters by length and coverage, and trims the sequences to CDS.
-
-It also aligns the sequences using `mafft` and builds an ML tree using `iqtree`. A lineages is assigned to each sequence using `pangolin` with the previous data release.
-
-### Lineage Curation
-
-The phylogeny is annotated with lineage and then in `FigTree` the lineages are manually curated, drawing together a number of pieces of information including monophyly in the ML phylogeny (generally a bootstrap > 70 is required) and epidemiological data such as country and travel history. Any changes to lineage definitions and new lineages are documented during this process.
-
-- The lineage may have been defined earlier in the outbreak and with added sequence data, there is less support for that lineage. In these cases the associated epidemiological metadata is examined and the lineage may be refined or even dropped entirely. The lineage number will not be 'recycled', but the members will get reassigned the parent lineage designation.
-- The lineage may have very clear epidemiological support and ambiguities or homoplasies in the sequences/ tree could contribute to low bootstrap values. In these cases, if the support is strong, the lineages are called. Recall rates for these lingeages within `pangolin` may be lower however.
-
-


=====================================
pangoLEARN/training/WH04.gb
=====================================
@@ -0,0 +1,972 @@
+LOCUS       MT291829               29774 bp    RNA     linear   VRL 14-MAY-2020
+DEFINITION  Severe acute respiratory syndrome coronavirus 2 isolate
+            SARS-CoV-2/human/CHN/Wuhan_IME-WH04/2019, complete genome.
+ACCESSION   MT291829 GWHACAV01000001
+VERSION     MT291829.1
+KEYWORDS    .
+SOURCE      Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)
+  ORGANISM  Severe acute respiratory syndrome coronavirus 2
+            Viruses; Riboviria; Orthornavirae; Pisuviricota; Pisoniviricetes;
+            Nidovirales; Cornidovirineae; Coronaviridae; Orthocoronavirinae;
+            Betacoronavirus; Sarbecovirus.
+REFERENCE   1  (bases 1 to 29774)
+  AUTHORS   Fan,H., Qin,E., Wu,Y., Guo,Y., Zhang,X., Yong,Y., Hou,J., Xu,Z.,
+            Mu,J., Teng,Y., Mi,Z., Yang,R., Song,Y., Li,B. and Cui,Y.
+  TITLE     Direct Submission
+  JOURNAL   Submitted (06-APR-2020) Beijing Institute of Microbiology and
+            Epidemiology, Fengtai District, Beijing 100071, China
+COMMENT     This record was submitted to GenBank on behalf of the original
+            #submitter through Genome Warehouse (GWH,
+            https://bigd.big.ac.cn/gwh/) of the China National Center for
+            Bioinformation (CNCB)/National Genomics Data Center (NGDC,
+            https://bigd.big.ac.cn).
+            
+            ##Assembly-Data-START##
+            Assembly Method       :: CLC Genomic Workbench v. V9.0
+            Assembly Name         :: IME-WH04
+            Coverage              :: 46
+            Sequencing Technology :: Ion Torrent X5Plus
+            ##Assembly-Data-END##
+FEATURES             Location/Qualifiers
+     source          1..29774
+                     /organism="Severe acute respiratory syndrome coronavirus
+                     2"
+                     /mol_type="genomic RNA"
+                     /isolate="SARS-CoV-2/human/CHN/Wuhan_IME-WH04/2019"
+                     /isolation_source="bronchoalveolar lavage fluid"
+                     /host="Homo sapiens"
+                     /db_xref="taxon:2697049"
+                     /country="China: Wuhan"
+                     /collection_date="2019-12-30"
+                     /collected_by="Beijing Institute of Micribiology and
+                     Epidemiology"
+     gene            178..21467
+                     /gene="ORF1ab"
+     CDS             join(178..13380,13380..21467)
+                     /gene="ORF1ab"
+                     /ribosomal_slippage
+                     /codon_start=1
+                     /product="ORF1ab polyprotein"
+                     /protein_id="QIU81799.1"
+                     /translation="MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQ
+                     HLKDGTCGLVEVEKGVLPQLEQPYVFIKRSDARTAPHGHVMVELVAELEGIQYGRSGE
+                     TLGVLVPHVGEIPVAYRKVLLRKNGNKGAGGHSYGADLKSFDLGDELGTDPYEDFQEN
+                     WNTKHSSGVTRELMRELNGGAYTRYVDNNFCGPDGYPLECIKDLLARAGKASCTLSEQ
+                     LDFIDTKRGVYCCREHEHEIAWYTERSEKSYELQTPFEIKLAKKFDTFNGECPNFVFP
+                     LNSIIKTIQPRVEKKKLDGFMGRIRSVYPVASPNECNQMCLSTLMKCDHCGETSWQTG
+                     DFVKATCEFCGTENLTKEGATTCGYLPQNAVVKIYCPACHNSEVGPEHSLAEYHNESG
+                     LKTILRKGGRTIAFGGCVFSYVGCHNKCAYWVPRASANIGCNHTGVVGEGSEGLNDNL
+                     LEILQKEKVNINIVGDFKLNEEIAIILASFSASTSAFVETVKGLDYKAFKQIVESCGN
+                     FKVTKGKAKKGAWNIGEQKSILSPLYAFASEAARVVRSIFSRTLETAQNSVRVLQKAA
+                     ITILDGISQYSLRLIDAMMFTSDLATNNLVVMAYITGGVVQLTSQWLTNIFGTVYEKL
+                     KPVLDWLEEKFKEGVEFLRDGWEIVKFISTCACEIVGGQIVTCAKEIKESVQTFFKLV
+                     NKFLALCADSIIIGGAKLKALNLGETFVTHSKGLYRKCVKSREETGLLMPLKAPKEII
+                     FLEGETLPTEVLTEEVVLKTGDLQPLEQPTSEAVEAPLVGTPVCINGLMLLEIKDTEK
+                     YCALAPNMMVTNNTFTLKGGAPTKVTFGDDTVIEVQGYKSVNITFELDERIDKVLNEK
+                     CSAYTVELGTEVNEFACVVADAVIKTLQPVSELLTPLGIDLDEWSMATYYLFDESGEF
+                     KLASHMYCSFYPPDEDEEEGDCEEEEFEPSTQYEYGTEDDYQGKPLEFGATSAALQPE
+                     EEQEEDWLDDDSQQTVGQQDGSEDNQTTTIQTIVEVQPQLEMELTPVVQTIEVNSFSG
+                     YLKLTDNVYIKNADIVEEAKKVKPTVVVNAANVYLKHGGGVAGALNKATNNAMQVESD
+                     DYIATNGPLKVGGSCVLSGHNLAKHCLHVVGPNVNKGEDIQLLKSAYENFNQHEVLLA
+                     PLLSAGIFGADPIHSLRVCVDTVRTNVYLAVFDKNLYDKLVSSFLEMKSEKQVEQKIA
+                     EIPKEEVKPFITESKPSVEQRKQDDKKIKACVEEVTTTLEETKFLTENLLLYIDINGN
+                     LHPDSATLVSDIDITFLKKDAPYIVGDVVQEGVLTAVVIPTKKAGGTTEMLAKALRKV
+                     PTDNYITTYPGQGLNGYTVEEAKTVLKKCKSAFYILPSIISNEKQEILGTVSWNLREM
+                     LAHAEETRKLMPVCVETKAIVSTIQRKYKGIKIQEGVVDYGARFYFYTSKTTVASLIN
+                     TLNDLNETLVTMPLGYVTHGLNLEEAARYMRSLKVPATVSVSSPDAVTAYNGYLTSSS
+                     KTPEEHFIETISLAGSYKDWSYSGQSTQLGIEFLKRGDKSVYYTSNPTTFHLDGEVIT
+                     FDNLKTLLSLREVRTIKVFTTVDNINLHTQVVDMSMTYGQQFGPTYLDGADVTKIKPH
+                     NSHEGKTFYVLPNDDTLRVEAFEYYHTTDPSFLGRYMSALNHTKKWKYPQVNGLTSIK
+                     WADNNCYLATALLTLQQIELKFNPPALQDAYYRARAGEAANFCALILAYCNKTVGELG
+                     DVRETMSYLFQHANLDSCKRVLNVVCKTCGQQQTTLKGVEAVMYMGTLSYEQFKKGVQ
+                     IPCTCGKQATKYLVQQESPFVMMSAPPAQYELKHGTFTCASEYTGNYQCGHYKHITSK
+                     ETLYCIDGALLTKSSEYKGPITDVFYKENSYTTTIKPVTYKLDGVVCTEIDPKLDNYY
+                     KKDNSYFTEQPIDLVPNQPYPNASFDNFKFVCDNIKFADDLNQLTGYKKPASRELKVT
+                     FFPDLNGDVVAIDYKHYTPSFKKGAKLLHKPIVWHVNNATNKATYKPNTWCIRCLWST
+                     KPVETSNSFDVLKSEDAQGMDNLACEDLKPVSEEVVENPTIQKDVLECNVKTTEVVGD
+                     IILKPANNSLKITEEVGHTDLMAAYVDNSSLTIKKPNELSRVLGLKTLATHGLAAVNS
+                     VPWDTIANYAKPFLNKVVSTTTNIVTRCLNRVCTNYMPYFFTLLLQLCTFTRSTNSRI
+                     KASMPTTIAKNTVKSVGKFCLEASFNYLKSPNFSKLINIIIWFLLLSVCLGSLIYSTA
+                     ALGVLMSNLGMPSYCTGYREGYLNSTNVTIATYCTGSIPCSVCLSGLDSLDTYPSLET
+                     IQITISSFKWDLTAFGLVAEWFLAYILFTRFFYVLGLAAIMQLFFSYFAVHFISNSWL
+                     MWLIINLVQMAPISAMVRMYIFFASFYYVWKSYVHVVDGCNSSTCMMCYKRNRATRVE
+                     CTTIVNGVRRSFYVYANGGKGFCKLHNWNCVNCDTFCAGSTFISDEVARDLSLQFKRP
+                     INPTDQSSYIVDSVTVKNGSIHLYFDKAGQKTYERHSLSHFVNLDNLRANNTKGSLPI
+                     NVIVFDGKSKCEESSAKSASVYYSQLMCQPILLLDQALVSDVGDSAEVAVKMFDAYVN
+                     TFSSTFNVPMEKLKTLVATAEAELAKNVSLDNVLSTFISAARQGFVDSDVETKDVVEC
+                     LKLSHQSDIEVTGDSCNNYMLTYNKVENMTPRDLGACIDCSARHINAQVAKSHNIALI
+                     WNVKDFMSLSEQLRKQIRSAAKKNNLPFKLTCATTRQVVNVVTTKIALKGGKIVNNWL
+                     KQLIKVTLVFLFVAAIFYLITPVHVMSKHTDFSSEIIGYKAIDGGVTRDIASTDTCFA
+                     NKHADFDTWFSQRGGSYTNDKACPLIAAVITREVGFVVPGLPGTILRTTNGDFLHFLP
+                     RVFSAVGNICYTPSKLIEYTDFATSACVLAAECTIFKDASGKPVPYCYDTNVLEGSVA
+                     YESLRPDTRYVLMDGSIIQFPNTYLEGSVRVVTTFDSEYCRHGTCERSEAGVCVSTSG
+                     RWVLNNDYYRSLPGVFCGVDAVNLLTNMFTPLIQPIGALDISASIVAGGIVAIVVTCL
+                     AYYFMRFRRAFGEYSHVVAFNTLLFLMSFTVLCLTPVYSFLPGVYSVIYLYLTFYLTN
+                     DVSFLAHIQWMVMFTPLVPFWITIAYIICISTKHFYWFFSNYLKRRVVFNGVSFSTFE
+                     EAALCTFLLNKEMYLKLRSDVLLPLTQYNRYLALYNKYKYFSGAMDTTSYREAACCHL
+                     AKALNDFSNSGSDVLYQPPQTSITSAVLQSGFRKMAFPSGKVEGCMVQVTCGTTTLNG
+                     LWLDDVVYCPRHVICTSEDMLNPNYEDLLIRKSNHNFLVQAGNVQLRVIGHSMQNCVL
+                     KLKVDTANPKTPKYKFVRIQPGQTFSVLACYNGSPSGVYQCAMRPNFTIKGSFLNGSC
+                     GSVGFNIDYDCVSFCYMHHMELPTGVHAGTDLEGNFYGPFVDRQTAQAAGTDTTITVN
+                     VLAWLYAAVINGDRWFLNRFTTTLNDFNLVAMKYNYEPLTQDHVDILGPLSAQTGIAV
+                     LDMCASLKELLQNGMNGRTILGSALLEDEFTPFDVVRQCSGVTFQSAVKRTIKGTHHW
+                     LLLTILTSLLVLVQSTQWSLFFFLYENAFLPFAMGIIAMSAFAMMFVKHKHAFLCLFL
+                     LPSLATVAYFNMVYMPASWVMRIMTWLDMVDTSLSGFKLKDCVMYASAVVLLILMTAR
+                     TVYDDGARRVWTLMNVLTLVYKVYYGNALDQAISMWALIISVTSNYSGVVTTVMFLAR
+                     GIVFMCVEYCPIFFITGNTLQCIMLVYCFLGYFCTCYFGLFCLLNRYFRLTLGVYDYL
+                     VSTQEFRYMNSQGLLPPKNSIDAFKLNIKLLGVGGKPCIKVATVQSKMSDVKCTSVVL
+                     LSVLQQLRVESSSKLWAQCVQLHNDILLAKDTTEAFEKMVSLLSVLLSMQGAVDINKL
+                     CEEMLDNRATLQAIASEFSSLPSYAAFATAQEAYEQAVANGDSEVVLKKLKKSLNVAK
+                     SEFDRDAAMQRKLEKMADQAMTQMYKQARSEDKRAKVTSAMQTMLFTMLRKLDNDALN
+                     NIINNARDGCVPLNIIPLTTAAKLMVVIPDYNTYKNTCDGTTFTYASALWEIQQVVDA
+                     DSKIVQLSEISMDNSPNLAWPLIVTALRANSAVKLQNNELSPVALRQMSCAAGTTQTA
+                     CTDDNALAYYNTTKGGRFVLALLSDLQDLKWARFPKSDGTGTIYTELEPPCRFVTDTP
+                     KGPKVKYLYFIKGLNNLNRGMVLGSLAATVRLQAGNATEVPANSTVLSFCAFAVDAAK
+                     AYKDYLASGGQPITNCVKMLCTHTGTGQAITVTPEANMDQESFGGASCCLYCRCHIDH
+                     PNPKGFCDLKGKYVQIPTTCANDPVGFTLKNTVCTVCGMWKGYGCSCDQLREPMLQSA
+                     DAQSFLNRVCGVSAARLTPCGTGTSTDVVYRAFDIYNDKVAGFAKFLKTNCCRFQEKD
+                     EDDNLIDSYFVVKRHTFSNYQHEETIYNLLKDCPAVAKHDFFKFRIDGDMVPHISRQR
+                     LTKYTMADLVYALRHFDEGNCDTLKEILVTYNCCDDDYFNKKDWYDFVENPDILRVYA
+                     NLGERVRQALLKTVQFCDAMRNAGIVGVLTLDNQDLNGNWYDFGDFIQTTPGSGVPVV
+                     DSYYSLLMPILTLTRALTAESHVDTDLTKPYIKWDLLKYDFTEERLKLFDRYFKYWDQ
+                     TYHPNCVNCLDDRCILHCANFNVLFSTVFPPTSFGPLVRKIFVDGVPFVVSTGYHFRE
+                     LGVVHNQDVNLHSSRLSFKELLVYAADPAMHAASGNLLLDKRTTCFSVAALTNNVAFQ
+                     TVKPGNFNKDFYDFAVSKGFFKEGSSVELKHFFFAQDGNAAISDYDYYRYNLPTMCDI
+                     RQLLFVVEVVDKYFDCYDGGCINANQVIVNNLDKSAGFPFNKWGKARLYYDSMSYEDQ
+                     DALFAYTKRNVIPTITQMNLKYAISAKNRARTVAGVSICSTMTNRQFHQKLLKSIAAT
+                     RGATVVIGTSKFYGGWHNMLKTVYSDVENPHLMGWDYPKCDRAMPNMLRIMASLVLAR
+                     KHTTCCSLSHRFYRLANECAQVLSEMVMCGGSLYVKPGGTSSGDATTAYANSVFNICQ
+                     AVTANVNALLSTDGNKIADKYVRNLQHRLYECLYRNRDVDTDFVNEFYAYLRKHFSMM
+                     ILSDDAVVCFNSTYASQGLVASIKNFKSVLYYQNNVFMSEAKCWTETDLTKGPHEFCS
+                     QHTMLVKQGDDYVYLPYPDPSRILGAGCFVDDIVKTDGTLMIERFVSLAIDAYPLTKH
+                     PNQEYADVFHLYLQYIRKLHDELTGHMLDMYSVMLTNDNTSRYWEPEFYEAMYTPHTV
+                     LQAVGACVLCNSQTSLRCGACIRRPFLCCKCCYDHVISTSHKLVLSVNPYVCNAPGCD
+                     VTDVTQLYLGGMSYYCKSHKPPISFPLCANGQVFGLYKNTCVGSDNVTDFNAIATCDW
+                     TNAGDYILANTCTERLKLFAAETLKATEETFKLSYGIATVREVLSDRELHLSWEVGKP
+                     RPPLNRNYVFTGYRVTKNSKVQIGEYTFEKGDYGDAVVYRGTTTYKLNVGDYFVLTSH
+                     TVMPLSAPTLVPQEHYVRITGLYPTLNISDEFSSNVANYQKVGMQKYSTLQGPPGTGK
+                     SHFAIGLALYYPSARIVYTACSHAAVDALCEKALKYLPIDKCSRIIPARARVECFDKF
+                     KVNSTLEQYVFCTVNALPETTADIVVFDEISMATNYDLSVVNARLRAKHYVYIGDPAQ
+                     LPAPRTLLTKGTLEPEYFNSVCRLMKTIGPDMFLGTCRRCPAEIVDTVSALVYDNKLK
+                     AHKDKSAQCFKMFYKGVITHDVSSAINRPQIGVVREFLTRNPAWRKAVFISPYNSQNA
+                     VASKILGLPTQTVDSSQGSEYDYVIFTQTTETAHSCNVNRFNVAITRAKVGILCIMSD
+                     RDLYDKLQFTSLEIPRRNVATLQAENVTGLFKDCSKVITGLHPTQAPTHLSVDTKFKT
+                     EGLCVDIPGIPKDMTYRRLISMMGFKMNYQVNGYPNMFITREEAIRHVRAWIGFDVEG
+                     CHATREAVGTNLPLQLGFSTGVNLVAVPTGYVDTPNNTDFSRVSAKPPPGDQFKHLIP
+                     LMYKGLPWNVVRIKIVQMLSDTLKNLSDRVVFVLWAHGFELTSMKYFVKIGPERTCCL
+                     CDRRATCFSTASDTYACWHHSIGFDYVYNPFMIDVQQWGFTGNLQSNHDLYCQVHGNA
+                     HVASCDAIMTRCLAVHECFVKRVDWTIEYPIIGDELKINAACRKVQHMVVKAALLADK
+                     FPVLHDIGNPKAIKCVPQADVEWKFYDAQPCSDKAYKIEELFYSYATHSDKFTDGVCL
+                     FWNCNVDRYPANSIVCRFDTRVLSNLNLPGCDGGSLYVNKHAFHTPAFDKSAFVNLKQ
+                     LPFFYYSDSPCESHGKQVVSDIDYVPLKSATCITRCNLGGAVCRHHANEYRLYLDAYN
+                     MMISAGFSLWVYKQFDTYNLWNTFTRLQSLENVAFNVVNKGHFDGQQGEVPVSIINNT
+                     VYTKVDGVDVELFENKTTLPVNVAFELWAKRNIKPVPEVKILNNLGVDIAANTVIWDY
+                     KRDAPAHISTIGVCSMTDIAKKPTETICAPLTVFFDGRVDGQVDLFRNARNGVLITEG
+                     SVKGLQPSVGPKQASLNGVTLIGEAVKTQFNYYKKVDGVVQQLPETYFTQSRNLQEFK
+                     PRSQMEIDFLELAMDEFIERYKLEGYAFEHIVYGDFSHSQLGGLHLLIGLAKRFKESP
+                     FELEDFIPMDSTVKNYFITDAQTGSSKCVCSVIDLLLDDFVEIIKSQDLSVVSKVVKV
+                     TIDYTEISFMLWCKDGHVETFYPKLQSSQAWQPGVAMPNLYKMQRMLLEKCDLQNYGD
+                     SATLPKGIMMNVAKYTQLCQYLNTLTLAVPYNMRVIHFGAGSDKGVAPGTAVLRQWLP
+                     TGTLLVDSDLNDFVSDADSTLIGDCATVHTANKWDLIISDMYDPKTKNVTKENDSKEG
+                     FFTYICGFIQQKLALGGSVAIKITEHSWNADLYKLMGHFAWWTAFVTNVNASSSEAFL
+                     IGCNYLGKPREQIDGYVMHANYIFWRNTNPIQLSSYSLFDMSKFPLKLRGTAVMSLKE
+                     GQINDMILSLLSKGRLIIRENNRVVISSDVLVNN"
+     mat_peptide     178..717
+                     /gene="ORF1ab"
+                     /product="leader protein"
+     mat_peptide     718..2631
+                     /gene="ORF1ab"
+                     /product="nsp2"
+     mat_peptide     2632..8466
+                     /gene="ORF1ab"
+                     /product="nsp3"
+     mat_peptide     8467..9966
+                     /gene="ORF1ab"
+                     /product="nsp4"
+     mat_peptide     9967..10884
+                     /gene="ORF1ab"
+                     /product="3C-like proteinase"
+     mat_peptide     10885..11754
+                     /gene="ORF1ab"
+                     /product="nsp6"
+     mat_peptide     11755..12003
+                     /gene="ORF1ab"
+                     /product="nsp7"
+     mat_peptide     12004..12597
+                     /gene="ORF1ab"
+                     /product="nsp8"
+     mat_peptide     12598..12936
+                     /gene="ORF1ab"
+                     /product="nsp9"
+     mat_peptide     12937..13353
+                     /gene="ORF1ab"
+                     /product="nsp10"
+     mat_peptide     join(13354..13380,13380..13382,13378..13380,13380..16148)
+                     /gene="ORF1ab"
+                     /product="RNA-dependent RNA polymerase"
+     mat_peptide     16149..17951
+                     /gene="ORF1ab"
+                     /product="helicase"
+     mat_peptide     17952..19532
+                     /gene="ORF1ab"
+                     /product="3'-to-5' exonuclease"
+     mat_peptide     19533..20570
+                     /gene="ORF1ab"
+                     /product="endoRNAse"
+     mat_peptide     20571..21464
+                     /gene="ORF1ab"
+                     /product="2'-O-ribose methyltransferase"
+     CDS             178..13395
+                     /gene="ORF1ab"
+                     /codon_start=1
+                     /product="ORF1a polyprotein"
+                     /protein_id="QIU81800.1"
+                     /translation="MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQ
+                     HLKDGTCGLVEVEKGVLPQLEQPYVFIKRSDARTAPHGHVMVELVAELEGIQYGRSGE
+                     TLGVLVPHVGEIPVAYRKVLLRKNGNKGAGGHSYGADLKSFDLGDELGTDPYEDFQEN
+                     WNTKHSSGVTRELMRELNGGAYTRYVDNNFCGPDGYPLECIKDLLARAGKASCTLSEQ
+                     LDFIDTKRGVYCCREHEHEIAWYTERSEKSYELQTPFEIKLAKKFDTFNGECPNFVFP
+                     LNSIIKTIQPRVEKKKLDGFMGRIRSVYPVASPNECNQMCLSTLMKCDHCGETSWQTG
+                     DFVKATCEFCGTENLTKEGATTCGYLPQNAVVKIYCPACHNSEVGPEHSLAEYHNESG
+                     LKTILRKGGRTIAFGGCVFSYVGCHNKCAYWVPRASANIGCNHTGVVGEGSEGLNDNL
+                     LEILQKEKVNINIVGDFKLNEEIAIILASFSASTSAFVETVKGLDYKAFKQIVESCGN
+                     FKVTKGKAKKGAWNIGEQKSILSPLYAFASEAARVVRSIFSRTLETAQNSVRVLQKAA
+                     ITILDGISQYSLRLIDAMMFTSDLATNNLVVMAYITGGVVQLTSQWLTNIFGTVYEKL
+                     KPVLDWLEEKFKEGVEFLRDGWEIVKFISTCACEIVGGQIVTCAKEIKESVQTFFKLV
+                     NKFLALCADSIIIGGAKLKALNLGETFVTHSKGLYRKCVKSREETGLLMPLKAPKEII
+                     FLEGETLPTEVLTEEVVLKTGDLQPLEQPTSEAVEAPLVGTPVCINGLMLLEIKDTEK
+                     YCALAPNMMVTNNTFTLKGGAPTKVTFGDDTVIEVQGYKSVNITFELDERIDKVLNEK
+                     CSAYTVELGTEVNEFACVVADAVIKTLQPVSELLTPLGIDLDEWSMATYYLFDESGEF
+                     KLASHMYCSFYPPDEDEEEGDCEEEEFEPSTQYEYGTEDDYQGKPLEFGATSAALQPE
+                     EEQEEDWLDDDSQQTVGQQDGSEDNQTTTIQTIVEVQPQLEMELTPVVQTIEVNSFSG
+                     YLKLTDNVYIKNADIVEEAKKVKPTVVVNAANVYLKHGGGVAGALNKATNNAMQVESD
+                     DYIATNGPLKVGGSCVLSGHNLAKHCLHVVGPNVNKGEDIQLLKSAYENFNQHEVLLA
+                     PLLSAGIFGADPIHSLRVCVDTVRTNVYLAVFDKNLYDKLVSSFLEMKSEKQVEQKIA
+                     EIPKEEVKPFITESKPSVEQRKQDDKKIKACVEEVTTTLEETKFLTENLLLYIDINGN
+                     LHPDSATLVSDIDITFLKKDAPYIVGDVVQEGVLTAVVIPTKKAGGTTEMLAKALRKV
+                     PTDNYITTYPGQGLNGYTVEEAKTVLKKCKSAFYILPSIISNEKQEILGTVSWNLREM
+                     LAHAEETRKLMPVCVETKAIVSTIQRKYKGIKIQEGVVDYGARFYFYTSKTTVASLIN
+                     TLNDLNETLVTMPLGYVTHGLNLEEAARYMRSLKVPATVSVSSPDAVTAYNGYLTSSS
+                     KTPEEHFIETISLAGSYKDWSYSGQSTQLGIEFLKRGDKSVYYTSNPTTFHLDGEVIT
+                     FDNLKTLLSLREVRTIKVFTTVDNINLHTQVVDMSMTYGQQFGPTYLDGADVTKIKPH
+                     NSHEGKTFYVLPNDDTLRVEAFEYYHTTDPSFLGRYMSALNHTKKWKYPQVNGLTSIK
+                     WADNNCYLATALLTLQQIELKFNPPALQDAYYRARAGEAANFCALILAYCNKTVGELG
+                     DVRETMSYLFQHANLDSCKRVLNVVCKTCGQQQTTLKGVEAVMYMGTLSYEQFKKGVQ
+                     IPCTCGKQATKYLVQQESPFVMMSAPPAQYELKHGTFTCASEYTGNYQCGHYKHITSK
+                     ETLYCIDGALLTKSSEYKGPITDVFYKENSYTTTIKPVTYKLDGVVCTEIDPKLDNYY
+                     KKDNSYFTEQPIDLVPNQPYPNASFDNFKFVCDNIKFADDLNQLTGYKKPASRELKVT
+                     FFPDLNGDVVAIDYKHYTPSFKKGAKLLHKPIVWHVNNATNKATYKPNTWCIRCLWST
+                     KPVETSNSFDVLKSEDAQGMDNLACEDLKPVSEEVVENPTIQKDVLECNVKTTEVVGD
+                     IILKPANNSLKITEEVGHTDLMAAYVDNSSLTIKKPNELSRVLGLKTLATHGLAAVNS
+                     VPWDTIANYAKPFLNKVVSTTTNIVTRCLNRVCTNYMPYFFTLLLQLCTFTRSTNSRI
+                     KASMPTTIAKNTVKSVGKFCLEASFNYLKSPNFSKLINIIIWFLLLSVCLGSLIYSTA
+                     ALGVLMSNLGMPSYCTGYREGYLNSTNVTIATYCTGSIPCSVCLSGLDSLDTYPSLET
+                     IQITISSFKWDLTAFGLVAEWFLAYILFTRFFYVLGLAAIMQLFFSYFAVHFISNSWL
+                     MWLIINLVQMAPISAMVRMYIFFASFYYVWKSYVHVVDGCNSSTCMMCYKRNRATRVE
+                     CTTIVNGVRRSFYVYANGGKGFCKLHNWNCVNCDTFCAGSTFISDEVARDLSLQFKRP
+                     INPTDQSSYIVDSVTVKNGSIHLYFDKAGQKTYERHSLSHFVNLDNLRANNTKGSLPI
+                     NVIVFDGKSKCEESSAKSASVYYSQLMCQPILLLDQALVSDVGDSAEVAVKMFDAYVN
+                     TFSSTFNVPMEKLKTLVATAEAELAKNVSLDNVLSTFISAARQGFVDSDVETKDVVEC
+                     LKLSHQSDIEVTGDSCNNYMLTYNKVENMTPRDLGACIDCSARHINAQVAKSHNIALI
+                     WNVKDFMSLSEQLRKQIRSAAKKNNLPFKLTCATTRQVVNVVTTKIALKGGKIVNNWL
+                     KQLIKVTLVFLFVAAIFYLITPVHVMSKHTDFSSEIIGYKAIDGGVTRDIASTDTCFA
+                     NKHADFDTWFSQRGGSYTNDKACPLIAAVITREVGFVVPGLPGTILRTTNGDFLHFLP
+                     RVFSAVGNICYTPSKLIEYTDFATSACVLAAECTIFKDASGKPVPYCYDTNVLEGSVA
+                     YESLRPDTRYVLMDGSIIQFPNTYLEGSVRVVTTFDSEYCRHGTCERSEAGVCVSTSG
+                     RWVLNNDYYRSLPGVFCGVDAVNLLTNMFTPLIQPIGALDISASIVAGGIVAIVVTCL
+                     AYYFMRFRRAFGEYSHVVAFNTLLFLMSFTVLCLTPVYSFLPGVYSVIYLYLTFYLTN
+                     DVSFLAHIQWMVMFTPLVPFWITIAYIICISTKHFYWFFSNYLKRRVVFNGVSFSTFE
+                     EAALCTFLLNKEMYLKLRSDVLLPLTQYNRYLALYNKYKYFSGAMDTTSYREAACCHL
+                     AKALNDFSNSGSDVLYQPPQTSITSAVLQSGFRKMAFPSGKVEGCMVQVTCGTTTLNG
+                     LWLDDVVYCPRHVICTSEDMLNPNYEDLLIRKSNHNFLVQAGNVQLRVIGHSMQNCVL
+                     KLKVDTANPKTPKYKFVRIQPGQTFSVLACYNGSPSGVYQCAMRPNFTIKGSFLNGSC
+                     GSVGFNIDYDCVSFCYMHHMELPTGVHAGTDLEGNFYGPFVDRQTAQAAGTDTTITVN
+                     VLAWLYAAVINGDRWFLNRFTTTLNDFNLVAMKYNYEPLTQDHVDILGPLSAQTGIAV
+                     LDMCASLKELLQNGMNGRTILGSALLEDEFTPFDVVRQCSGVTFQSAVKRTIKGTHHW
+                     LLLTILTSLLVLVQSTQWSLFFFLYENAFLPFAMGIIAMSAFAMMFVKHKHAFLCLFL
+                     LPSLATVAYFNMVYMPASWVMRIMTWLDMVDTSLSGFKLKDCVMYASAVVLLILMTAR
+                     TVYDDGARRVWTLMNVLTLVYKVYYGNALDQAISMWALIISVTSNYSGVVTTVMFLAR
+                     GIVFMCVEYCPIFFITGNTLQCIMLVYCFLGYFCTCYFGLFCLLNRYFRLTLGVYDYL
+                     VSTQEFRYMNSQGLLPPKNSIDAFKLNIKLLGVGGKPCIKVATVQSKMSDVKCTSVVL
+                     LSVLQQLRVESSSKLWAQCVQLHNDILLAKDTTEAFEKMVSLLSVLLSMQGAVDINKL
+                     CEEMLDNRATLQAIASEFSSLPSYAAFATAQEAYEQAVANGDSEVVLKKLKKSLNVAK
+                     SEFDRDAAMQRKLEKMADQAMTQMYKQARSEDKRAKVTSAMQTMLFTMLRKLDNDALN
+                     NIINNARDGCVPLNIIPLTTAAKLMVVIPDYNTYKNTCDGTTFTYASALWEIQQVVDA
+                     DSKIVQLSEISMDNSPNLAWPLIVTALRANSAVKLQNNELSPVALRQMSCAAGTTQTA
+                     CTDDNALAYYNTTKGGRFVLALLSDLQDLKWARFPKSDGTGTIYTELEPPCRFVTDTP
+                     KGPKVKYLYFIKGLNNLNRGMVLGSLAATVRLQAGNATEVPANSTVLSFCAFAVDAAK
+                     AYKDYLASGGQPITNCVKMLCTHTGTGQAITVTPEANMDQESFGGASCCLYCRCHIDH
+                     PNPKGFCDLKGKYVQIPTTCANDPVGFTLKNTVCTVCGMWKGYGCSCDQLREPMLQSA
+                     DAQSFLNGFAV"
+     mat_peptide     178..717
+                     /gene="ORF1ab"
+                     /product="leader protein"
+     mat_peptide     718..2631
+                     /gene="ORF1ab"
+                     /product="nsp2"
+     mat_peptide     2632..8466
+                     /gene="ORF1ab"
+                     /product="nsp3"
+     mat_peptide     8467..9966
+                     /gene="ORF1ab"
+                     /product="nsp4"
+     mat_peptide     9967..10884
+                     /gene="ORF1ab"
+                     /product="3C-like proteinase"
+     mat_peptide     10885..11754
+                     /gene="ORF1ab"
+                     /product="nsp6"
+     mat_peptide     11755..12003
+                     /gene="ORF1ab"
+                     /product="nsp7"
+     mat_peptide     12004..12597
+                     /gene="ORF1ab"
+                     /product="nsp8"
+     mat_peptide     12598..12936
+                     /gene="ORF1ab"
+                     /product="nsp9"
+     mat_peptide     12937..13353
+                     /gene="ORF1ab"
+                     /product="nsp10"
+     mat_peptide     13354..13392
+                     /gene="ORF1ab"
+                     /product="nsp11"
+     stem_loop       13388..13415
+                     /gene="ORF1ab"
+                     /note="Coronavirus frameshifting stimulation element
+                     stem-loop 1"
+     stem_loop       13400..13454
+                     /gene="ORF1ab"
+                     /note="Coronavirus frameshifting stimulation element
+                     stem-loop 2"
+     gene            21475..25296
+                     /gene="S"
+     CDS             21475..25296
+                     /gene="S"
+                     /codon_start=1
+                     /product="surface glycoprotein"
+                     /protein_id="QIU81801.1"
+                     /translation="MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFR
+                     SSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIR
+                     GWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVY
+                     SSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQ
+                     GFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFL
+                     LKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITN
+                     LCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCF
+                     TNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYN
+                     YLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPY
+                     RVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFG
+                     RDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAI
+                     HADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPR
+                     RARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTM
+                     YICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFG
+                     GFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFN
+                     GLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQN
+                     VLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGA
+                     ISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMS
+                     ECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAH
+                     FPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELD
+                     SFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELG
+                     KYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSE
+                     PVLKGVKLHYT"
+     gene            25305..26132
+                     /gene="ORF3a"
+     CDS             25305..26132
+                     /gene="ORF3a"
+                     /codon_start=1
+                     /product="ORF3a protein"
+                     /protein_id="QIU81802.1"
+                     /translation="MDLFMRIFTIGTVTLKQGEIKDATPSDFVRATATIPIQASLPFG
+                     WLIVGVALLAVFQSASKIITLKKRWQLALSKGVHFVCNLLLLFVTVYSHLLLVAAGLE
+                     APFLYLYALVYFLQSINFVRIIMRLWLCWKCRSKNPLLYDANYFLCWHTNCYDYCIPY
+                     NSVTSSIVITSGDGTTSPISEHDYQIGGYTEKWESGVKDCVVLHSYFTSDYYQLYSTQ
+                     LSTDTGVEHVTFFIYNKIVDEPEEHVQIHTIDGSSGVVNPVMEPIYDEPTTTTSVPL"
+     gene            26157..26384
+                     /gene="E"
+     CDS             26157..26384
+                     /gene="E"
+                     /codon_start=1
+                     /product="envelope protein"
+                     /protein_id="QIU81803.1"
+                     /translation="MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCC
+                     NIVNVSLVKPSFYVYSRVKNLNSSRVPDLLV"
+     gene            26435..27103
+                     /gene="M"
+     CDS             26435..27103
+                     /gene="M"
+                     /codon_start=1
+                     /product="membrane glycoprotein"
+                     /protein_id="QIU81804.1"
+                     /translation="MADSNGTITVEELKKLLEQWNLVIGFLFLTWICLLQFAYANRNR
+                     FLYIIKLIFLWLLWPVTLACFVLAAVYRINWITGGIAIAMACLVGLMWLSYFIASFRL
+                     FARTRSMWSFNPETNILLNVPLHGTILTRPLLESELVIGAVILRGHLRIAGHHLGRCD
+                     IKDLPKEITVATSRTLSYYKLGASQRVAGDSGFAAYSRYRIGNYKLNTDHSSSSDNIA
+                     LLVQ"
+     gene            27114..27299
+                     /gene="ORF6"
+     CDS             27114..27299
+                     /gene="ORF6"
+                     /codon_start=1
+                     /product="ORF6 protein"
+                     /protein_id="QIU81805.1"
+                     /translation="MFHLVDFQVTIAEILLIIMRTFKVSIWNLDYIINLIIKNLSKSL
+                     TENKYSQLDEEQPMEID"
+     gene            27306..27671
+                     /gene="ORF7a"
+     CDS             27306..27671
+                     /gene="ORF7a"
+                     /codon_start=1
+                     /product="ORF7a protein"
+                     /protein_id="QIU81806.1"
+                     /translation="MKIILFLALITLATCELYHYQECVRGTTVLLKEPCSSGTYEGNS
+                     PFHPLADNKFALTCFSTQFAFACPDGVKHVYQLRARSVSPKLFIRQEEVQELYSPIFL
+                     IVAAIVFITLCFTLKRKTE"
+     gene            27668..27799
+                     /gene="ORF7b"
+     CDS             27668..27799
+                     /gene="ORF7b"
+                     /codon_start=1
+                     /product="ORF7b"
+                     /protein_id="QIU81807.1"
+                     /translation="MIELSLIDFYLCFLAFLLFLVLIMLIIFWFSLELQDHNETCHA"
+     gene            27806..28171
+                     /gene="ORF8"
+     CDS             27806..28171
+                     /gene="ORF8"
+                     /codon_start=1
+                     /product="ORF8 protein"
+                     /protein_id="QIU81808.1"
+                     /translation="MKFLVFLGIITTVAAFHQECSLQSCTQHQPYVVDDPCPIHFYSK
+                     WYIRVGARKSAPLIELCVDEAGSKSPIQYIDIGNYTVSCLPFTINCQEPKLGSLVVRC
+                     SFYEDFLEYHDVRVVLDFI"
+     gene            28186..29445
+                     /gene="N"
+     CDS             28186..29445
+                     /gene="N"
+                     /codon_start=1
+                     /product="nucleocapsid phosphoprotein"
+                     /protein_id="QIU81809.1"
+                     /translation="MSDNGPQNQRNAPRITFGGPSDSTGSNQNGERSGARSKQRRPQG
+                     LPNNTASWFTALTQHGKEDLKFPRGQGVPINTNSSPDDQIGYYRRATRRIRGGDGKMK
+                     DLSPRWYFYYLGTGPEAGLPYGANKDGIIWVATEGALNTPKDHIGTRNPANNAAIVLQ
+                     LPQGTTLPKGFYAEGSRGGSQASSRSSSRSRNSSRNSTPGSSRGTSPARMAGNGGDAA
+                     LALLLLDRLNQLESKMSGKGQQQQGQTVTKKSAAEASKKPRQKRTATKAYNVTQAFGR
+                     RGPEQTQGNFGDQELIRQGTDYKHWPQIAQFAPSASAFFGMSRIGMEVTPSGTWLTYT
+                     GAIKLDDKDPNFKDQVILLNKHIDAYKTFPPTEPKKDKKKKADETQALPQRQKKQQTV
+                     TLLPAADLDDFSKQLQQSMSSADSTQA"
+     gene            29470..29586
+                     /gene="ORF10"
+     CDS             29470..29586
+                     /gene="ORF10"
+                     /codon_start=1
+                     /product="ORF10 protein"
+                     /protein_id="QIU81810.1"
+                     /translation="MGYINVFAFPFTIYSLLLCRMNSRNYIAQVDVVNFNLT"
+     stem_loop       29521..29556
+                     /gene="ORF10"
+                     /note="Coronavirus 3' UTR pseudoknot stem-loop 1"
+     stem_loop       29541..29569
+                     /gene="ORF10"
+                     /note="Coronavirus 3' UTR pseudoknot stem-loop 2"
+     stem_loop       29640..29680
+                     /note="Coronavirus 3' stem-loop II-like motif (s2m)"
+ORIGIN      
+        1 tggctgtcac tcggctgcat gcttagtgca ctcacgcagt ataattaata actaattact
+       61 gtcgttgaca ggacacgagt aactcgtcta tcttctgcag gctgcttacg gtttcgtccg
+      121 tgttgcagcc gatcatcagc acatctaggt ttcgtccggg tgtgaccgaa aggtaagatg
+      181 gagagccttg tccctggttt caacgagaaa acacacgtcc aactcagttt gcctgtttta
+      241 caggttcgcg acgtgctcgt acgtggcttt ggagactccg tggaggaggt cttatcagag
+      301 gcacgtcaac atcttaaaga tggcacttgt ggcttagtag aagttgaaaa aggcgttttg
+      361 cctcaacttg aacagcccta tgtgttcatc aaacgttcgg atgctcgaac tgcacctcat
+      421 ggtcatgtta tggttgagct ggtagcagaa ctcgaaggca ttcagtacgg tcgtagtggt
+      481 gagacacttg gtgtccttgt ccctcatgtg ggcgaaatac cagtggctta ccgcaaggtt
+      541 cttcttcgta agaacggtaa taaaggagct ggtggccata gttacggcgc cgatctaaag
+      601 tcatttgact taggcgacga gcttggcact gatccttatg aagattttca agaaaactgg
+      661 aacactaaac atagcagtgg tgttacccgt gaactcatgc gtgagcttaa cggaggggca
+      721 tacactcgct atgtcgataa caacttctgt ggccctgatg gctaccctct tgagtgcatt
+      781 aaagaccttc tagcacgtgc tggtaaagct tcatgcactt tgtccgaaca actggacttt
+      841 attgacacta agaggggtgt atactgctgc cgtgaacatg agcatgaaat tgcttggtac
+      901 acggaacgtt ctgaaaagag ctatgaattg cagacacctt ttgaaattaa attggcaaag
+      961 aaatttgaca ccttcaatgg ggaatgtcca aattttgtat ttcccttaaa ttccataatc
+     1021 aagactattc aaccaagggt tgaaaagaaa aagcttgatg gctttatggg tagaattcga
+     1081 tctgtctatc cagttgcgtc accaaatgaa tgcaaccaaa tgtgcctttc aactctcatg
+     1141 aagtgtgatc attgtggtga aacttcatgg cagacgggcg attttgttaa agccacttgc
+     1201 gaattttgtg gcactgagaa tttgactaaa gaaggtgcca ctacttgtgg ttacttaccc
+     1261 caaaatgctg ttgttaaaat ttattgtcca gcatgtcaca attcagaagt aggacctgag
+     1321 catagtcttg ccgaatacca taatgaatct ggcttgaaaa ccattcttcg taagggtggt
+     1381 cgcactattg cctttggagg ctgtgtgttc tcttatgttg gttgccataa caagtgtgcc
+     1441 tattgggttc cacgtgctag cgctaacata ggttgtaacc atacaggtgt tgttggagaa
+     1501 ggttccgaag gtcttaatga caaccttctt gaaatactcc aaaaagagaa agtcaacatc
+     1561 aatattgttg gtgactttaa acttaatgaa gagatcgcca ttattttggc atctttttct
+     1621 gcttccacaa gtgcttttgt ggaaactgtg aaaggtttgg attataaagc attcaaacaa
+     1681 attgttgaat cctgtggtaa ttttaaagtt acaaaaggaa aagctaaaaa aggtgcctgg
+     1741 aatattggtg aacagaaatc aatactgagt cctctttatg catttgcatc agaggctgct
+     1801 cgtgttgtac gatcaatttt ctcccgcact cttgaaactg ctcaaaattc tgtgcgtgtt
+     1861 ttacagaagg ccgctataac aatactagat ggaatttcac agtattcact gagactcatt
+     1921 gatgctatga tgttcacatc tgatttggct actaacaatc tagttgtaat ggcctacatt
+     1981 acaggtggtg ttgttcagtt gacttcgcag tggctaacta acatctttgg cactgtttat
+     2041 gaaaaactca aacccgtcct tgattggctt gaagagaagt ttaaggaagg tgtagagttt
+     2101 cttagagacg gttgggaaat tgttaaattt atctcaacct gtgcttgtga aattgtcggt
+     2161 ggacaaattg tcacctgtgc aaaggaaatt aaggagagtg ttcagacatt ctttaagctt
+     2221 gtaaataaat ttttggcttt gtgtgctgac tctatcatta ttggtggagc taaacttaaa
+     2281 gccttgaatt taggtgaaac atttgtcacg cactcaaagg gattgtacag aaagtgtgtt
+     2341 aaatccagag aagaaactgg cctactcatg cctctaaaag ccccaaaaga aattatcttc
+     2401 ttagagggag aaacacttcc cacagaagtg ttaacagagg aagttgtctt gaaaactggt
+     2461 gatttacaac cattagaaca acctactagt gaagctgttg aagctccatt ggttggtaca
+     2521 ccagtttgta ttaacgggct tatgttgctc gaaatcaaag acacagaaaa gtactgtgcc
+     2581 cttgcaccta atatgatggt aacaaacaat accttcacac tcaaaggcgg tgcaccaaca
+     2641 aaggttactt ttggtgatga cactgtgata gaagtgcaag gttacaagag tgtgaatatc
+     2701 acttttgaac ttgatgaaag gattgataaa gtacttaatg agaagtgctc tgcctataca
+     2761 gttgaactcg gtacagaagt aaatgagttc gcctgtgttg tggcagatgc tgtcataaaa
+     2821 actttgcaac cagtatctga attacttaca ccactgggca ttgatttaga tgagtggagt
+     2881 atggctacat actacttatt tgatgagtct ggtgagttta aattggcttc acatatgtat
+     2941 tgttctttct accctccaga tgaggatgaa gaagaaggtg attgtgaaga agaagagttt
+     3001 gagccatcaa ctcaatatga gtatggtact gaagatgatt accaaggtaa acctttggaa
+     3061 tttggtgcca cttctgctgc tcttcaacct gaagaagagc aagaagaaga ttggttagat
+     3121 gatgatagtc aacaaactgt tggtcaacaa gacggcagtg aggacaatca gacaactact
+     3181 attcaaacaa ttgttgaggt tcaacctcaa ttagagatgg aacttacacc agttgttcag
+     3241 actattgaag tgaatagttt tagtggttat ttaaaactta ctgacaatgt atacattaaa
+     3301 aatgcagaca ttgtggaaga agctaaaaag gtaaaaccaa cagtggttgt taatgcagcc
+     3361 aatgtttacc ttaaacatgg aggaggtgtt gcaggagcct taaataaggc tactaacaat
+     3421 gccatgcaag ttgaatctga tgattacata gctactaatg gaccacttaa agtgggtggt
+     3481 agttgtgttt taagcggaca caatcttgct aaacactgtc ttcatgttgt cggcccaaat
+     3541 gttaacaaag gtgaagacat tcaacttctt aagagtgctt atgaaaattt taatcagcac
+     3601 gaagttctac ttgcaccatt attatcagct ggtatttttg gtgctgaccc tatacattct
+     3661 ttaagagttt gtgtagatac tgttcgcaca aatgtctact tagctgtctt tgataaaaat
+     3721 ctctatgaca aacttgtttc aagctttttg gaaatgaaga gtgaaaagca agttgaacaa
+     3781 aagatcgctg agattcctaa agaggaagtt aagccattta taactgaaag taaaccttca
+     3841 gttgaacaga gaaaacaaga tgataagaaa atcaaagctt gtgttgaaga agttacaaca
+     3901 actctggaag aaactaagtt cctcacagaa aacttgttac tttatattga cattaatggc
+     3961 aatcttcatc cagattctgc cactcttgtt agtgacattg acatcacttt cttaaagaaa
+     4021 gatgctccat atatagtggg tgatgttgtt caagagggtg ttttaactgc tgtggttata
+     4081 cctactaaaa aggctggtgg cactactgaa atgctagcga aagctttgag aaaagtgcca
+     4141 acagacaatt atataaccac ttacccgggt cagggtttaa atggttacac tgtagaggag
+     4201 gcaaagacag tgcttaaaaa gtgtaaaagt gccttttaca ttctaccatc tattatctct
+     4261 aatgagaagc aagaaattct tggaactgtt tcttggaatt tgcgagaaat gcttgcacat
+     4321 gcagaagaaa cacgcaaatt aatgcctgtc tgtgtggaaa ctaaagccat agtttcaact
+     4381 atacagcgta aatataaggg tattaaaata caagagggtg tggttgatta tggtgctaga
+     4441 ttttactttt acaccagtaa aacaactgta gcgtcactta tcaacacact taacgatcta
+     4501 aatgaaactc ttgttacaat gccacttggc tatgtaacac atggcttaaa tttggaagaa
+     4561 gctgctcggt atatgagatc tctcaaagtg ccagctacag tttctgtttc ttcacctgat
+     4621 gctgttacag cgtataatgg ttatcttact tcttcttcta aaacacctga agaacatttt
+     4681 attgaaacca tctcacttgc tggttcctat aaagattggt cctattctgg acaatctaca
+     4741 caactaggta tagaatttct taagagaggt gataaaagtg tatattacac tagtaatcct
+     4801 accacattcc acctagatgg tgaagttatc acctttgaca atcttaagac acttctttct
+     4861 ttgagagaag tgaggactat taaggtgttt acaacagtag acaacattaa cctccacacg
+     4921 caagttgtgg acatgtcaat gacatatgga caacagtttg gtccaactta tttggatgga
+     4981 gctgatgtta ctaaaataaa acctcataat tcacatgaag gtaaaacatt ttatgtttta
+     5041 cctaatgatg acactctacg tgttgaggct tttgagtact accacacaac tgatcctagt
+     5101 tttctgggta ggtacatgtc agcattaaat cacactaaaa agtggaaata cccacaagtt
+     5161 aatggtttaa cttctattaa atgggcagat aacaactgtt atcttgccac tgcattgtta
+     5221 acactccaac aaatagagtt gaagtttaat ccacctgctc tacaagatgc ttattacaga
+     5281 gcaagggctg gtgaagctgc taacttttgt gcacttatct tagcctactg taataagaca
+     5341 gtaggtgagt taggtgatgt tagagaaaca atgagttact tgtttcaaca tgccaattta
+     5401 gattcttgca aaagagtctt gaacgtggtg tgtaaaactt gtggacaaca gcagacaacc
+     5461 cttaagggtg tagaagctgt tatgtacatg ggcacacttt cttatgaaca atttaagaaa
+     5521 ggtgttcaga taccttgtac gtgtggtaaa caagctacaa aatatctagt acaacaggag
+     5581 tcaccttttg ttatgatgtc agcaccacct gctcagtatg aacttaagca tggtacattt
+     5641 acttgtgcta gtgagtacac tggtaattac cagtgtggtc actataaaca tataacttct
+     5701 aaagaaactt tgtattgcat agacggtgct ttacttacaa agtcctcaga atacaaaggt
+     5761 cctattacgg atgttttcta caaagaaaac agttacacaa caaccataaa accagttact
+     5821 tataaattgg atggtgttgt ttgtacagaa attgacccta agttggacaa ttattataag
+     5881 aaagacaatt cttatttcac agagcaacca attgatcttg taccaaacca accatatcca
+     5941 aacgcaagct tcgataattt taagtttgta tgtgataata tcaaatttgc tgatgattta
+     6001 aaccagttaa ctggttataa gaaacctgct tcaagagagc ttaaagttac atttttccct
+     6061 gacttaaatg gtgatgtggt ggctattgat tataaacact acacaccctc ttttaagaaa
+     6121 ggagctaaat tgttacataa acctattgtt tggcatgtta acaatgcaac taataaagcc
+     6181 acgtataaac caaatacctg gtgtatacgt tgtctttgga gcacaaaacc agttgaaaca
+     6241 tcaaattcgt ttgatgtact gaagtcagag gacgcgcagg gaatggataa tcttgcctgc
+     6301 gaagatctaa aaccagtctc tgaagaagta gtggaaaatc ctaccataca gaaagacgtt
+     6361 cttgagtgta atgtgaaaac taccgaagtt gtaggagaca ttatacttaa accagcaaat
+     6421 aatagtttaa aaattacaga agaggttggc cacacagatc taatggctgc ttatgtagac
+     6481 aattctagtc ttactattaa gaaacctaat gaattatcta gagtattagg tttgaaaacc
+     6541 cttgctactc atggtttagc tgctgttaat agtgtccctt gggatactat agctaattat
+     6601 gctaagcctt ttcttaacaa agttgttagt acaactacta acatagttac acggtgttta
+     6661 aaccgtgttt gtactaatta tatgccttat ttctttactt tattgctaca attgtgtact
+     6721 tttactagaa gtacaaattc tagaattaaa gcatctatgc cgactactat agcaaagaat
+     6781 actgttaaga gtgtcggtaa attttgtcta gaggcttcat ttaattattt gaagtcacct
+     6841 aatttttcta aactgataaa tattataatt tggtttttac tattaagtgt ttgcctaggt
+     6901 tctttaatct actcaaccgc tgctttaggt gttttaatgt ctaatttagg catgccttct
+     6961 tactgtactg gttacagaga aggctatttg aactctacta atgtcactat tgcaacctac
+     7021 tgtactggtt ctataccttg tagtgtttgt cttagtggtt tagattcttt agacacctat
+     7081 ccttctttag aaactataca aattaccatt tcatctttta aatgggattt aactgctttt
+     7141 ggcttagttg cagagtggtt tttggcatat attcttttca ctaggttttt ctatgtactt
+     7201 ggattggctg caatcatgca attgtttttc agctattttg cagtacattt tattagtaat
+     7261 tcttggctta tgtggttaat aattaatctt gtacaaatgg ccccgatttc agctatggtt
+     7321 agaatgtaca tcttctttgc atcattttat tatgtatgga aaagttatgt gcatgttgta
+     7381 gacggttgta attcatcaac ttgtatgatg tgttacaaac gtaatagagc aacaagagtc
+     7441 gaatgtacaa ctattgttaa tggtgttaga aggtcctttt atgtctatgc taatggaggt
+     7501 aaaggctttt gcaaactaca caattggaat tgtgttaatt gtgatacatt ctgtgctggt
+     7561 agtacattta ttagtgatga agttgcgaga gacttgtcac tacagtttaa aagaccaata
+     7621 aatcctactg accagtcttc ttacatcgtt gatagtgtta cagtgaagaa tggttccatc
+     7681 catctttact ttgataaagc tggtcaaaag acttatgaaa gacattctct ctctcatttt
+     7741 gttaacttag acaacctgag agctaataac actaaaggtt cattgcctat taatgttata
+     7801 gtttttgatg gtaaatcaaa atgtgaagaa tcatctgcaa aatcagcgtc tgtttactac
+     7861 agtcagctta tgtgtcaacc tatactgtta ctagatcagg cattagtgtc tgatgttggt
+     7921 gatagtgcgg aagttgcagt taaaatgttt gatgcttacg ttaatacgtt ttcatcaact
+     7981 tttaacgtac caatggaaaa actcaaaaca ctagttgcaa ctgcagaagc tgaacttgca
+     8041 aagaatgtgt ccttagacaa tgtcttatct acttttattt cagcagctcg gcaagggttt
+     8101 gttgattcag atgtagaaac taaagatgtt gttgaatgtc ttaaattgtc acatcaatct
+     8161 gacatagaag ttactggcga tagttgtaat aactatatgc tcacctataa caaagttgaa
+     8221 aacatgacac cccgtgacct tggtgcttgt attgactgta gtgcgcgtca tattaatgcg
+     8281 caggtagcaa aaagtcacaa cattgctttg atatggaacg ttaaagattt catgtcattg
+     8341 tctgaacaac tacgaaaaca aatacgtagt gctgctaaaa agaataactt accttttaag
+     8401 ttgacatgtg caactactag acaagttgtt aatgttgtaa caacaaagat agcacttaag
+     8461 ggtggtaaaa ttgttaataa ttggttgaag cagttaatta aagttacact tgtgttcctt
+     8521 tttgttgctg ctattttcta tttaataaca cctgttcatg tcatgtctaa acatactgac
+     8581 ttttcaagtg aaatcatagg atacaaggct attgatggtg gtgtcactcg tgacatagca
+     8641 tctacagata cttgttttgc taacaaacat gctgattttg acacatggtt tagccagcgt
+     8701 ggtggtagtt atactaatga caaagcttgc ccattgattg ctgcagtcat aacaagagaa
+     8761 gtgggttttg tcgtgcctgg tttgcctggc acgatattac gcacaactaa tggtgacttt
+     8821 ttgcatttct tacctagagt ttttagtgca gttggtaaca tctgttacac accatcaaaa
+     8881 cttatagagt acactgactt tgcaacatca gcttgtgttt tggctgctga atgtacaatt
+     8941 tttaaagatg cttctggtaa gccagtacca tattgttatg ataccaatgt actagaaggt
+     9001 tctgttgctt atgaaagttt acgccctgac acacgttatg tgctcatgga tggctctatt
+     9061 attcaatttc ctaacaccta ccttgaaggt tctgttagag tggtaacaac ttttgattct
+     9121 gagtactgta ggcacggcac ttgtgaaaga tcagaagctg gtgtttgtgt atctactagt
+     9181 ggtagatggg tacttaacaa tgattattac agatctttac caggagtttt ctgtggtgta
+     9241 gatgctgtaa atttacttac taatatgttt acaccactaa ttcaacctat tggtgctttg
+     9301 gacatatcag catctatagt agctggtggt attgtagcta tcgtagtaac atgccttgcc
+     9361 tactatttta tgaggtttag aagagctttt ggtgaataca gtcatgtagt tgcctttaat
+     9421 actttactat tccttatgtc attcactgta ctctgtttaa caccagttta ctcattctta
+     9481 cctggtgttt attctgttat ttacttgtac ttgacatttt atcttactaa tgatgtttct
+     9541 tttttagcac atattcagtg gatggttatg ttcacacctt tagtaccttt ctggataaca
+     9601 attgcttata tcatttgtat ttccacaaag catttctatt ggttctttag taattaccta
+     9661 aagagacgtg tagtctttaa tggtgtttcc tttagtactt ttgaagaagc tgcgctgtgc
+     9721 acctttttgt taaataaaga aatgtatcta aagttgcgta gtgatgtgct attacctctt
+     9781 acgcaatata atagatactt agctctttat aataagtaca agtattttag tggagcaatg
+     9841 gatacaacta gctacagaga agctgcttgt tgtcatctcg caaaggctct caatgacttc
+     9901 agtaactcag gttctgatgt tctttaccaa ccaccacaaa cctctatcac ctcagctgtt
+     9961 ttgcagagtg gttttagaaa aatggcattc ccatctggta aagttgaggg ttgtatggta
+    10021 caagtaactt gtggtacaac tacacttaac ggtctttggc ttgatgacgt agtttactgt
+    10081 ccaagacatg tgatctgcac ctctgaagac atgcttaacc ctaattatga agatttactc
+    10141 attcgtaagt ctaatcataa tttcttggta caggctggta atgttcaact cagggttatt
+    10201 ggacattcta tgcaaaattg tgtacttaag cttaaggttg atacagccaa tcctaagaca
+    10261 cctaagtata agtttgttcg cattcaacca ggacagactt tttcagtgtt agcttgttac
+    10321 aatggttcac catctggtgt ttaccaatgt gctatgaggc ccaatttcac tattaagggt
+    10381 tcattcctta atggttcatg tggtagtgtt ggttttaaca tagattatga ctgtgtctct
+    10441 ttttgttaca tgcaccatat ggaattacca actggagttc atgctggcac agacttagaa
+    10501 ggtaactttt atggaccttt tgttgacagg caaacagcac aagcagctgg tacggacaca
+    10561 actattacag ttaatgtttt agcttggttg tacgctgctg ttataaatgg agacaggtgg
+    10621 tttctcaatc gatttaccac aactcttaat gactttaacc ttgtggctat gaagtacaat
+    10681 tatgaacctc taacacaaga ccatgttgac atactaggac ctctttctgc tcaaactgga
+    10741 attgccgttt tagatatgtg tgcttcatta aaagaattac tgcaaaatgg tatgaatgga
+    10801 cgtaccatat tgggtagtgc tttattagaa gatgaattta caccttttga tgttgttaga
+    10861 caatgctcag gtgttacttt ccaaagtgca gtgaaaagaa caatcaaggg tacacaccac
+    10921 tggttgttac tcacaatttt gacttcactt ttagttttag tccagagtac tcaatggtct
+    10981 ttgttctttt ttttgtatga aaatgccttt ttaccttttg ctatgggtat tattgctatg
+    11041 tctgcttttg caatgatgtt tgtcaaacat aagcatgcat ttctctgttt gtttttgtta
+    11101 ccttctcttg ccactgtagc ttattttaat atggtctata tgcctgctag ttgggtgatg
+    11161 cgtattatga catggttgga tatggttgat actagtttgt ctggttttaa gctaaaagac
+    11221 tgtgttatgt atgcatcagc tgtagtgtta ctaatcctta tgacagcaag aactgtgtat
+    11281 gatgatggtg ctaggagagt gtggacactt atgaatgtct tgacactcgt ttataaagtt
+    11341 tattatggta atgctttaga tcaagccatt tccatgtggg ctcttataat ctctgttact
+    11401 tctaactact caggtgtagt tacaactgtc atgtttttgg ccagaggtat tgtttttatg
+    11461 tgtgttgagt attgccctat tttcttcata actggtaata cacttcagtg tataatgcta
+    11521 gtttattgtt tcttaggcta tttttgtact tgttactttg gcctcttttg tttactcaac
+    11581 cgctacttta gactgactct tggtgtttat gattacttag tttctacaca ggagtttaga
+    11641 tatatgaatt cacagggact actcccaccc aagaatagca tagatgcctt caaactcaac
+    11701 attaaattgt tgggtgttgg tggcaaacct tgtatcaaag tagccactgt acagtctaaa
+    11761 atgtcagatg taaagtgcac atcagtagtc ttactctcag ttttgcaaca actcagagta
+    11821 gaatcatcat ctaaattgtg ggctcaatgt gtccagttac acaatgacat tctcttagct
+    11881 aaagatacta ctgaagcctt tgaaaaaatg gtttcactac tttctgtttt gctttccatg
+    11941 cagggtgctg tagacataaa caagctttgt gaagaaatgc tggacaacag ggcaacctta
+    12001 caagctatag cctcagagtt tagttccctt ccatcatatg cagcttttgc tactgctcaa
+    12061 gaagcttatg agcaggctgt tgctaatggt gattctgaag ttgttcttaa aaagttgaag
+    12121 aagtctttga atgtggctaa atctgaattt gaccgtgatg cagccatgca acgtaagttg
+    12181 gaaaagatgg ctgatcaagc tatgacccaa atgtataaac aggctagatc tgaggacaag
+    12241 agggcaaaag ttactagtgc tatgcagaca atgcttttca ctatgcttag aaagttggat
+    12301 aatgatgcac tcaacaacat tatcaacaat gcaagagatg gttgtgttcc cttgaacata
+    12361 atacctctta caacagcagc caaactaatg gttgtcatac cagactataa cacatataaa
+    12421 aatacgtgtg atggtacaac atttacttat gcatcagcat tgtgggaaat ccaacaggtt
+    12481 gtagatgcag atagtaaaat tgttcaactt agtgaaatta gtatggacaa ttcacctaat
+    12541 ttagcatggc ctcttattgt aacagcttta agggccaatt ctgctgtcaa attacagaat
+    12601 aatgagctta gtcctgttgc actacgacag atgtcttgtg ctgccggtac tacacaaact
+    12661 gcttgcactg atgacaatgc gttagcttac tacaacacaa caaagggagg taggtttgta
+    12721 cttgcactgt tatccgattt acaggatttg aaatgggcta gattccctaa gagtgatgga
+    12781 actggtacta tctatacaga actggaacca ccttgtaggt ttgttacaga cacacctaaa
+    12841 ggtcctaaag tgaagtattt atactttatt aaaggattaa acaacctaaa tagaggtatg
+    12901 gtacttggta gtttagctgc cacagtacgt ctacaagctg gtaatgcaac agaagtgcct
+    12961 gccaattcaa ctgtattatc tttctgtgct tttgctgtag atgctgctaa agcttacaaa
+    13021 gattatctag ctagtggggg acaaccaatc actaattgtg ttaagatgtt gtgtacacac
+    13081 actggtactg gtcaggcaat aacagttaca ccggaagcca atatggatca agaatccttt
+    13141 ggtggtgcat cgtgttgtct gtactgccgt tgccacatag atcatccaaa tcctaaagga
+    13201 ttttgtgact taaaaggtaa gtatgtacaa atacctacaa cttgtgctaa tgaccctgtg
+    13261 ggttttacac ttaaaaacac agtctgtacc gtctgcggta tgtggaaagg ttatggctgt
+    13321 agttgtgatc aactccgcga acccatgctt cagtcagctg atgcacaatc gtttttaaac
+    13381 gggtttgcgg tgtaagtgca gcccgtctta caccgtgcgg cacaggcact agtactgatg
+    13441 tcgtatacag ggcttttgac atctacaatg ataaagtagc tggttttgct aaattcctaa
+    13501 aaactaattg ttgtcgcttc caagaaaagg acgaagatga caatttaatt gattcttact
+    13561 ttgtagttaa gagacacact ttctctaact accaacatga agaaacaatt tataatttac
+    13621 ttaaggattg tccagctgtt gctaaacatg acttctttaa gtttagaata gacggtgaca
+    13681 tggtaccaca tatatcacgt caacgtctta ctaaatacac aatggcagac ctcgtctatg
+    13741 ctttaaggca ttttgatgaa ggtaattgtg acacattaaa agaaatactt gtcacataca
+    13801 attgttgtga tgatgattat ttcaataaaa aggactggta tgattttgta gaaaacccag
+    13861 atatattacg cgtatacgcc aacttaggtg aacgtgtacg ccaagctttg ttaaaaacag
+    13921 tacaattctg tgatgccatg cgaaatgctg gtattgttgg tgtactgaca ttagataatc
+    13981 aagatctcaa tggtaactgg tatgatttcg gtgatttcat acaaaccacg ccaggtagtg
+    14041 gagttcctgt tgtagattct tattattcat tgttaatgcc tatattaacc ttgaccaggg
+    14101 ctttaactgc agagtcacat gttgacactg acttaacaaa gccttacatt aagtgggatt
+    14161 tgttaaaata tgacttcacg gaagagaggt taaaactctt tgaccgttat tttaaatatt
+    14221 gggatcagac ataccaccca aattgtgtta actgtttgga tgacagatgc attctgcatt
+    14281 gtgcaaactt taatgtttta ttctctacag tgttcccacc tacaagtttt ggaccactag
+    14341 tgagaaaaat atttgttgat ggtgttccat ttgtagtttc aactggatac cacttcagag
+    14401 agctaggtgt tgtacataat caggatgtaa acttacatag ctctagactt agttttaagg
+    14461 aattacttgt gtatgctgct gaccctgcta tgcacgctgc ttctggtaat ctattactag
+    14521 ataaacgcac tacgtgcttt tcagtagctg cacttactaa caatgttgct tttcaaactg
+    14581 tcaaacccgg taattttaac aaagacttct atgactttgc tgtgtctaag ggtttcttta
+    14641 aggaaggaag ttctgttgaa ttaaaacact tcttctttgc tcaggatggt aatgctgcta
+    14701 tcagcgatta tgactactat cgttataatc taccaacaat gtgtgatatc agacaactac
+    14761 tatttgtagt tgaagttgtt gataagtact ttgattgtta cgatggtggc tgtattaatg
+    14821 ctaaccaagt catcgtcaac aacctagaca aatcagctgg ttttccattt aataaatggg
+    14881 gtaaggctag actttattat gattcaatga gttatgagga tcaagatgca cttttcgcat
+    14941 atacaaaacg taatgtcatc cctactataa ctcaaatgaa tcttaagtat gccattagtg
+    15001 caaagaatag agctcgcacc gtagctggtg tctctatctg tagtactatg accaatagac
+    15061 agtttcatca aaaattattg aaatcaatag ccgccactag aggagctact gtagtaattg
+    15121 gaacaagcaa attctatggt ggttggcaca acatgttaaa aactgtttat agtgatgtag
+    15181 aaaaccctca ccttatgggt tgggattatc ctaaatgtga tagagccatg cctaacatgc
+    15241 ttagaattat ggcctcactt gttcttgctc gcaaacatac aacgtgttgt agcttgtcac
+    15301 accgtttcta tagattagct aatgagtgtg ctcaagtatt gagtgaaatg gtcatgtgtg
+    15361 gcggttcact atatgttaaa ccaggtggaa cctcatcagg agatgccaca actgcttatg
+    15421 ctaatagtgt ttttaacatt tgtcaagctg tcacggccaa tgttaatgca cttttatcta
+    15481 ctgatggtaa caaaattgcc gataagtatg tccgcaattt acaacacaga ctttatgagt
+    15541 gtctctatag aaatagagat gttgacacag actttgtgaa tgagttttac gcatatttgc
+    15601 gtaaacattt ctcaatgatg atactctctg acgatgctgt tgtgtgtttc aatagcactt
+    15661 atgcatctca aggtctagtg gctagcataa agaactttaa gtcagttctt tattatcaaa
+    15721 acaatgtttt tatgtctgaa gcaaaatgtt ggactgagac tgaccttact aaaggacctc
+    15781 atgaattttg ctctcaacat acaatgctag ttaaacaggg tgatgattat gtgtaccttc
+    15841 cttacccaga tccatcaaga atcctagggg ccggctgttt tgtagatgat atcgtaaaaa
+    15901 cagatggtac acttatgatt gaacggttcg tgtctttagc tatagatgct tacccactta
+    15961 ctaaacatcc taatcaggag tatgctgatg tctttcattt gtacttacaa tacataagaa
+    16021 agctacatga tgagttaaca ggacacatgt tagacatgta ttctgttatg cttactaatg
+    16081 ataacacttc aaggtattgg gaacctgagt tttatgaggc tatgtacaca ccgcatacag
+    16141 tcttacaggc tgttggggct tgtgttcttt gcaattcaca gacttcatta agatgtggtg
+    16201 cttgcatacg tagaccattc ttatgttgta aatgctgtta cgaccatgtc atatcaacat
+    16261 cacataaatt agtcttgtct gttaatccgt atgtttgcaa tgctccaggt tgtgatgtca
+    16321 cagatgtgac tcaactttac ttaggaggta tgagctatta ttgtaaatca cataaaccac
+    16381 ccattagttt tccattgtgt gctaatggac aagtttttgg tttatataaa aatacatgtg
+    16441 ttggtagcga taatgttact gactttaatg caattgcaac atgtgactgg acaaatgctg
+    16501 gtgattacat tttagctaac acctgtactg aaagactcaa gctttttgca gcagaaacgc
+    16561 tcaaagctac tgaggagaca tttaaactgt cttatggtat tgctactgta cgtgaagtgc
+    16621 tgtctgacag agaattacat ctttcatggg aagttggtaa acctagacca ccacttaacc
+    16681 gaaattatgt ctttactggt tatcgtgtaa ctaaaaacag taaagtacaa ataggagagt
+    16741 acacctttga aaaaggtgac tatggtgatg ctgttgttta ccgaggtaca acaacttaca
+    16801 aattaaatgt tggtgattat tttgtgctga catcacatac agtaatgcca ttaagtgcac
+    16861 ctacactagt gccacaagag cactatgtta gaattactgg cttataccca acactcaata
+    16921 tctcagatga gttttctagc aatgttgcaa attatcaaaa ggttggtatg caaaagtatt
+    16981 ctacactcca gggaccacct ggtactggta agagtcattt tgctattggc ctagctctct
+    17041 actacccttc tgctcgcata gtgtatacag cttgctctca tgccgctgtt gatgcactat
+    17101 gtgagaaggc attaaaatat ttgcctatag ataaatgtag tagaattata cctgcacgtg
+    17161 ctcgtgtaga gtgttttgat aaattcaaag tgaattcaac attagaacag tatgtctttt
+    17221 gtactgtaaa tgcattgcct gagacgacag cagatatagt tgtctttgat gaaatttcaa
+    17281 tggccacaaa ttatgatttg agtgttgtca atgccagatt acgtgctaag cactatgtgt
+    17341 acattggcga ccctgctcaa ttacctgcac cacgcacatt gctaactaag ggcacactag
+    17401 aaccagaata tttcaattca gtgtgtagac ttatgaaaac tataggtcca gacatgttcc
+    17461 tcggaacttg tcggcgttgt cctgctgaaa ttgttgacac tgtgagtgct ttggtttatg
+    17521 ataataagct taaagcacat aaagacaaat cagctcaatg ctttaaaatg ttttataagg
+    17581 gtgttatcac gcatgatgtt tcatctgcaa ttaacaggcc acaaataggc gtggtaagag
+    17641 aattccttac acgtaaccct gcttggagaa aagctgtctt tatttcacct tataattcac
+    17701 agaatgctgt agcctcaaag attttgggac taccaactca aactgttgat tcatcacagg
+    17761 gctcagaata tgactatgtc atattcactc aaaccactga aacagctcac tcttgtaatg
+    17821 taaacagatt taatgttgct attaccagag caaaagtagg catactttgc ataatgtctg
+    17881 atagagacct ttatgacaag ttgcaattta caagtcttga aattccacgt aggaatgtgg
+    17941 caactttaca agctgaaaat gtaacaggac tctttaaaga ttgtagtaag gtaatcactg
+    18001 ggttacatcc tacacaggca cctacacacc tcagtgttga cactaaattc aaaactgaag
+    18061 gtttatgtgt tgacatacct ggcataccta aggacatgac ctatagaaga ctcatctcta
+    18121 tgatgggttt taaaatgaat tatcaagtta atggttaccc taacatgttt atcacccgcg
+    18181 aagaagctat aagacatgta cgtgcatgga ttggcttcga tgtcgagggg tgtcatgcta
+    18241 ctagagaagc tgttggtacc aatttacctt tacagctagg tttttctaca ggtgttaacc
+    18301 tagttgctgt acctacaggt tatgttgata cacctaataa tacagatttt tccagagtta
+    18361 gtgctaaacc accgcctgga gatcaattta aacacctcat accacttatg tacaaaggac
+    18421 ttccttggaa tgtagtgcgt ataaagattg tacaaatgtt aagtgacaca cttaaaaatc
+    18481 tctctgacag agtcgtattt gtcttatggg cacatggctt tgagttgaca tctatgaagt
+    18541 attttgtgaa aataggacct gagcgcacct gttgtctatg tgatagacgt gccacatgct
+    18601 tttccactgc ttcagacact tatgcctgtt ggcatcattc tattggattt gattacgtct
+    18661 ataatccgtt tatgattgat gttcaacaat ggggttttac aggtaaccta caaagcaacc
+    18721 atgatctgta ttgtcaagtc catggtaatg cacatgtagc tagttgtgat gcaatcatga
+    18781 ctaggtgtct agctgtccac gagtgctttg ttaagcgtgt tgactggact attgaatatc
+    18841 ctataattgg tgatgaactg aagattaatg cggcttgtag aaaggttcaa cacatggttg
+    18901 ttaaagctgc attattagca gacaaattcc cagttcttca cgacattggt aaccctaaag
+    18961 ctattaagtg tgtacctcaa gctgatgtag aatggaagtt ctatgatgca cagccttgta
+    19021 gtgacaaagc ttataaaata gaagaattat tctattctta tgccacacat tctgacaaat
+    19081 tcacagatgg tgtatgccta ttttggaatt gcaatgtcga tagatatcct gctaattcca
+    19141 ttgtttgtag atttgacact agagtgctat ctaaccttaa cttgcctggt tgtgatggtg
+    19201 gcagtttgta tgtaaataaa catgcattcc acacaccagc ttttgataaa agtgcttttg
+    19261 ttaatttaaa acaattacca tttttctatt actctgacag tccatgtgag tctcatggaa
+    19321 aacaagtagt gtcagatata gattatgtac cactaaagtc tgctacgtgt ataacacgtt
+    19381 gcaatttagg tggtgctgtc tgtagacatc atgctaatga gtacagattg tatctcgatg
+    19441 cttataacat gatgatctca gctggcttta gcttgtgggt ttacaaacaa tttgatactt
+    19501 ataacctctg gaacactttt acaagacttc agagtttaga aaatgtggct tttaatgttg
+    19561 taaataaggg acactttgat ggacaacagg gtgaagtacc agtttctatc attaataaca
+    19621 ctgtttacac aaaagttgat ggtgttgatg tagaattgtt tgaaaataaa acaacattac
+    19681 ctgttaatgt agcatttgag ctttgggcta agcgcaacat taaaccagta ccagaggtga
+    19741 aaatactcaa taatttgggt gtggacattg ctgctaatac tgtgatctgg gactacaaaa
+    19801 gagatgctcc agcacatata tctactattg gtgtttgttc tatgactgac atagccaaga
+    19861 aaccaactga aacgatttgt gcaccactca ctgtcttttt tgatggtaga gttgatggtc
+    19921 aagtagactt atttagaaat gcccgtaatg gtgttcttat tacagaaggt agtgttaaag
+    19981 gtttacaacc atctgtaggt cccaaacaag ctagtcttaa tggagtcaca ttaattggag
+    20041 aagccgtaaa aacacagttc aattattata agaaagttga tggtgttgtc caacaattac
+    20101 ctgaaactta ctttactcag agtagaaatt tacaagaatt taaacccagg agtcaaatgg
+    20161 aaattgattt cttagaatta gctatggatg aattcattga acggtataaa ttagaaggct
+    20221 atgccttcga acatatcgtt tatggagatt ttagtcatag tcagttaggt ggtttacatc
+    20281 tactgattgg actagctaaa cgttttaagg aatcaccttt tgaattagaa gattttattc
+    20341 ctatggacag tacagttaaa aactatttca taacagatgc gcaaacaggt tcatctaagt
+    20401 gtgtgtgttc tgttattgat ttattacttg atgattttgt tgaaataata aaatcccaag
+    20461 atttatctgt agtttctaag gttgtcaaag tgactattga ctatacagaa atttcattta
+    20521 tgctttggtg taaagatggc catgtagaaa cattttaccc aaaattacaa tctagtcaag
+    20581 cgtggcaacc gggtgttgct atgcctaatc tttacaaaat gcaaagaatg ctattagaaa
+    20641 agtgtgacct tcaaaattat ggtgatagtg caacattacc taaaggcata atgatgaatg
+    20701 tcgcaaaata tactcaactg tgtcaatatt taaacacatt aacattagct gtaccctata
+    20761 atatgagagt tatacatttt ggtgctggtt ctgataaagg agttgcacca ggtacagctg
+    20821 ttttaagaca gtggttgcct acgggtacgc tgcttgtcga ttcagatctt aatgactttg
+    20881 tctctgatgc agattcaact ttgattggtg attgtgcaac tgtacataca gctaataaat
+    20941 gggatctcat tattagtgat atgtacgacc ctaagactaa aaatgttaca aaagaaaatg
+    21001 actctaaaga gggttttttc acttacattt gtgggtttat acaacaaaag ctagctcttg
+    21061 gaggttccgt ggctataaag ataacagaac attcttggaa tgctgatctt tataagctca
+    21121 tgggacactt cgcatggtgg acagcctttg ttactaatgt gaatgcgtca tcatctgaag
+    21181 catttttaat tggatgtaat tatcttggca aaccacgcga acaaatagat ggttatgtca
+    21241 tgcatgcaaa ttacatattt tggaggaata caaatccaat tcagttgtct tcctattctt
+    21301 tatttgacat gagtaaattt ccccttaaat taaggggtac tgctgttatg tctttaaaag
+    21361 aaggtcaaat caatgatatg attttatctc ttcttagtaa aggtagactt ataattagag
+    21421 aaaacaacag agttgttatt tctagtgatg ttcttgttaa caactaaacg aacaatgttt
+    21481 gtttttcttg ttttattgcc actagtctct agtcagtgtg ttaatcttac aaccagaact
+    21541 caattacccc ctgcatacac taattctttc acacgtggtg tttattaccc tgacaaagtt
+    21601 ttcagatcct cagttttaca ttcaactcag gacttgttct tacctttctt ttccaatgtt
+    21661 acttggttcc atgctataca tgtctctggg accaatggta ctaagaggtt tgataaccct
+    21721 gtcctaccat ttaatgatgg tgtttatttt gcttccactg agaagtctaa cataataaga
+    21781 ggctggattt ttggtactac tttagattcg aagacccagt ccctacttat tgttaataac
+    21841 gctactaatg ttgttattaa agtctgtgaa tttcaatttt gtaatgatcc atttttgggt
+    21901 gtttattacc acaaaaacaa caaaagttgg atggaaagtg agttcagagt ttattctagt
+    21961 gcgaataatt gcacttttga atatgtctct cagccttttc ttatggacct tgaaggaaaa
+    22021 cagggtaatt tcaaaaatct tagggaattt gtgtttaaga atattgatgg ttattttaaa
+    22081 atatattcta agcacacgcc tattaattta gtgcgtgatc tccctcaggg tttttcggct
+    22141 ttagaaccat tggtagattt gccaataggt attaacatca ctaggtttca aactttactt
+    22201 gctttacata gaagttattt gactcctggt gattcttctt caggttggac agctggtgct
+    22261 gcagcttatt atgtgggtta tcttcaacct aggacttttc tattaaaata taatgaaaat
+    22321 ggaaccatta cagatgctgt agactgtgca cttgaccctc tctcagaaac aaagtgtacg
+    22381 ttgaaatcct tcactgtaga aaaaggaatc tatcaaactt ctaactttag agtccaacca
+    22441 acagaatcta ttgttagatt tcctaatatt acaaacttgt gcccttttgg tgaagttttt
+    22501 aacgccacca gatttgcatc tgtttatgct tggaacagga agagaatcag caactgtgtt
+    22561 gctgattatt ctgtcctata taattccgca tcattttcca cttttaagtg ttatggagtg
+    22621 tctcctacta aattaaatga tctctgcttt actaatgtct atgcagattc atttgtaatt
+    22681 agaggtgatg aagtcagaca aatcgctcca gggcaaactg gaaagattgc tgattataat
+    22741 tataaattac cagatgattt tacaggctgc gttatagctt ggaattctaa caatcttgat
+    22801 tctaaggttg gtggtaatta taattacctg tatagattgt ttaggaagtc taatctcaaa
+    22861 ccttttgaga gagatatttc aactgaaatc tatcaggccg gtagcacacc ttgtaatggt
+    22921 gttgaaggtt ttaattgtta ctttccttta caatcatatg gtttccaacc cactaatggt
+    22981 gttggttacc aaccatacag agtagtagta ctttcttttg aacttctaca tgcaccagca
+    23041 actgtttgtg gacctaaaaa gtctactaat ttggttaaaa acaaatgtgt caatttcaac
+    23101 ttcaatggtt taacaggcac aggtgttctt actgagtcta acaaaaagtt tctgcctttc
+    23161 caacaatttg gcagagacat tgctgacact actgatgctg tccgtgatcc acagacactt
+    23221 gagattcttg acattacacc atgttctttt ggtggtgtca gtgttataac accaggaaca
+    23281 aatacttcta accaggttgc tgttctttat caggatgtta actgcacaga agtccctgtt
+    23341 gctattcatg cagatcaact tactcctact tggcgtgttt attctacagg ttctaatgtt
+    23401 tttcaaacac gtgcaggctg tttaataggg gctgaacatg tcaacaactc atatgagtgt
+    23461 gacataccca ttggtgcagg tatatgcgct agttatcaga ctcagactaa ttctcctcgg
+    23521 cgggcacgta gtgtagctag tcaatccatc attgcctaca ctatgtcact tggtgcagaa
+    23581 aattcagttg cttactctaa taactctatt gccataccca caaattttac tattagtgtt
+    23641 accacagaaa ttctaccagt gtctatgacc aagacatcag tagattgtac aatgtacatt
+    23701 tgtggtgatt caactgaatg cagcaatctt ttgttgcaat atggcagttt ttgtacacaa
+    23761 ttaaaccgtg ctttaactgg aatagctgtt gaacaagaca aaaacaccca agaagttttt
+    23821 gcacaagtca aacaaattta caaaacacca ccaattaaag attttggtgg ttttaatttt
+    23881 tcacaaatat taccagatcc atcaaaacca agcaagaggt catttattga agatctactt
+    23941 ttcaacaaag tgacacttgc agatgctggc ttcatcaaac aatatggtga ttgccttggt
+    24001 gatattgctg ctagagacct catttgtgca caaaagttta acggccttac tgttttgcca
+    24061 cctttgctca cagatgaaat gattgctcaa tacacttctg cactgttagc gggtacaatc
+    24121 acttctggtt ggacctttgg tgcaggtgct gcattacaaa taccatttgc tatgcaaatg
+    24181 gcttataggt ttaatggtat tggagttaca cagaatgttc tctatgagaa ccaaaaattg
+    24241 attgccaacc aatttaatag tgctattggc aaaattcaag actcactttc ttccacagca
+    24301 agtgcacttg gaaaacttca agatgtggtc aaccaaaatg cacaagcttt aaacacgctt
+    24361 gttaaacaac ttagctccaa ttttggtgca atttcaagtg ttttaaatga tatcctttca
+    24421 cgtcttgaca aagttgaggc tgaagtgcaa attgataggt tgatcacagg cagacttcaa
+    24481 agtttgcaga catatgtgac tcaacaatta attagagctg cagaaatcag agcttctgct
+    24541 aatcttgctg ctactaaaat gtcagagtgt gtacttggac aatcaaaaag agttgatttt
+    24601 tgtggaaagg gctatcatct tatgtccttc cctcagtcag cacctcatgg tgtagtcttc
+    24661 ttgcatgtga cttatgtccc tgcacaagaa aagaacttca caactgctcc tgccatttgt
+    24721 catgatggaa aagcacactt tcctcgtgaa ggtgtctttg tttcaaatgg cacacactgg
+    24781 tttgtaacac aaaggaattt ttatgaacca caaatcatta ctacagacaa cacatttgtg
+    24841 tctggtaact gtgatgttgt aataggaatt gtcaacaaca cagtttatga tcctttgcaa
+    24901 cctgaattag actcattcaa ggaggagtta gataaatatt ttaagaatca tacatcacca
+    24961 gatgttgatt taggtgacat ctctggcatt aatgcttcag ttgtaaacat tcaaaaagaa
+    25021 attgaccgcc tcaatgaggt tgccaagaat ttaaatgaat ctctcatcga tctccaagaa
+    25081 cttggaaagt atgagcagta tataaaatgg ccatggtaca tttggctagg ttttatagct
+    25141 ggcttgattg ccatagtaat ggtgacaatt atgctttgct gtatgaccag ttgctgtagt
+    25201 tgtctcaagg gctgttgttc ttgtggatcc tgctgcaaat ttgatgaaga cgactctgag
+    25261 ccagtgctca aaggagtcaa attacattac acataaacga acttatggat ttgtttatga
+    25321 gaatcttcac aattggaact gtaactttga agcaaggtga aatcaaggat gctactcctt
+    25381 cagattttgt tcgcgctact gcaacgatac cgatacaagc ctcactccct ttcggatggc
+    25441 ttattgttgg cgttgcactt cttgctgttt ttcagagcgc ttccaaaatc ataaccctca
+    25501 aaaagagatg gcaactagca ctctccaagg gtgttcactt tgtttgcaac ttgctgttgt
+    25561 tgtttgtaac agtttactca caccttttgc tcgttgctgc tggccttgaa gccccttttc
+    25621 tctatcttta tgctttagtc tacttcttgc agagtataaa ctttgtaaga ataataatga
+    25681 ggctttggct ttgctggaaa tgccgttcca aaaacccatt actttatgat gccaactatt
+    25741 ttctttgctg gcatactaat tgttacgact attgtatacc ttacaatagt gtaacttctt
+    25801 caattgtcat tacttcaggt gatggcacaa caagtcctat ttctgaacat gactaccaga
+    25861 ttggtggtta tactgaaaaa tgggaatctg gagtaaaaga ctgtgttgta ttacacagtt
+    25921 acttcacttc agactattac cagctgtact caactcaatt gagtacagac actggtgttg
+    25981 aacatgttac cttcttcatc tacaataaaa ttgttgatga gcctgaagaa catgtccaaa
+    26041 ttcacacaat cgacggttca tccggagttg ttaatccagt aatggaacca atttatgatg
+    26101 aaccgacgac gactactagc gtgcctttgt aagcacaagc tgatgagtac gaacttatgt
+    26161 actcattcgt ttcggaagag acaggtacgt taatagttaa tagcgtactt ctttttcttg
+    26221 ctttcgtggt attcttgcta gttacactag ccatccttac tgcgcttcga ttgtgtgcgt
+    26281 actgctgcaa tattgttaac gtgagtcttg taaaaccttc tttttacgtt tactctcgtg
+    26341 ttaaaaatct gaattcttct agagttcctg atcttctggt ctaaacgaac taaatattat
+    26401 attagttttt ctgtttggaa ctttaatttt agccatggca gattccaacg gtactattac
+    26461 cgttgaagag cttaaaaagc tccttgaaca atggaaccta gtaataggtt tcctattcct
+    26521 tacatggatt tgtcttctac aatttgccta tgccaacagg aataggtttt tgtatataat
+    26581 taagttaatt ttcctctggc tgttatggcc agtaacttta gcttgttttg tgcttgctgc
+    26641 tgtttacaga ataaattgga tcaccggtgg aattgctatc gcaatggctt gtcttgtagg
+    26701 cttgatgtgg ctcagctact tcattgcttc tttcagactg tttgcgcgta cgcgttccat
+    26761 gtggtcattc aatccagaaa ctaacattct tctcaacgtg ccactccatg gcactattct
+    26821 gaccagaccg cttctagaaa gtgaactcgt aatcggagct gtgatccttc gtggacatct
+    26881 tcgtattgct ggacaccatc taggacgctg tgacatcaag gacctgccta aagaaatcac
+    26941 tgttgctaca tcacgaacgc tttcttatta caaattggga gcttcgcagc gtgtagcagg
+    27001 tgactcaggt tttgctgcat acagtcgcta caggattggc aactataaat taaacacaga
+    27061 ccattccagt agcagtgaca atattgcttt gcttgtacag taagtgacaa cagatgtttc
+    27121 atctcgttga ctttcaggtt actatagcag agatattact aattattatg aggactttta
+    27181 aagtttccat ttggaatctt gattacatca taaacctcat aattaaaaat ttatctaagt
+    27241 cactaactga gaataaatat tctcaattag atgaagagca accaatggag attgattaaa
+    27301 cgaacatgaa aattattctt ttcttggcac tgataacact cgctacttgt gagctttatc
+    27361 actaccaaga gtgtgttaga ggtacaacag tacttttaaa agaaccttgc tcttctggaa
+    27421 catacgaggg caattcacca tttcatcctc tagctgataa caaatttgca ctgacttgct
+    27481 ttagcactca atttgctttt gcttgtcctg acggcgtaaa acacgtctat cagttacgtg
+    27541 ccagatcagt ttcacctaaa ctgttcatca gacaagagga agttcaagaa ctttactctc
+    27601 caatttttct tattgttgcg gcaatagtgt ttataacact ttgcttcaca ctcaaaagaa
+    27661 agacagaatg attgaacttt cattaattga cttctatttg tgctttttag cctttctgct
+    27721 attccttgtt ttaattatgc ttattatctt ttggttctca cttgaactgc aagatcataa
+    27781 tgaaacttgt cacgcctaaa cgaacatgaa atttcttgtt ttcttaggaa tcatcacaac
+    27841 tgtagctgca tttcaccaag aatgtagttt acagtcatgt actcaacatc aaccatatgt
+    27901 agttgatgac ccgtgtccta ttcacttcta ttctaaatgg tatattagag taggagctag
+    27961 aaaatcagca cctttaattg aattgtgcgt ggatgaggct ggttctaaat cacccattca
+    28021 gtacatcgat atcggtaatt atacagtttc ctgtttacct tttacaatta attgccagga
+    28081 acctaaattg ggtagtcttg tagtgcgttg ttcgttctat gaagactttt tagagtatca
+    28141 tgacgttcgt gttgttttag atttcatcta aacgaacaaa ctaaaatgtc tgataatgga
+    28201 ccccaaaatc agcgaaatgc accccgcatt acgtttggtg gaccctcaga ttcaactggc
+    28261 agtaaccaga atggagaacg cagtggggcg cgatcaaaac aacgtcggcc ccaaggttta
+    28321 cccaataata ctgcgtcttg gttcaccgct ctcactcaac atggcaagga agaccttaaa
+    28381 ttccctcgag gacaaggcgt tccaattaac accaatagca gtccagatga ccaaattggc
+    28441 tactaccgaa gagctaccag acgaattcgt ggtggtgacg gtaaaatgaa agatctcagt
+    28501 ccaagatggt atttctacta cctaggaact gggccagaag ctggacttcc ctatggtgct
+    28561 aacaaagacg gcatcatatg ggttgcaact gagggagcct tgaatacacc aaaagatcac
+    28621 attggcaccc gcaatcctgc taacaatgct gcaatcgtgc tacaacttcc tcaaggaaca
+    28681 acattgccaa aaggcttcta cgcagaaggg agcagaggcg gcagtcaagc ctcttctcgt
+    28741 tcctcatcac gtagtcgcaa cagttcaaga aattcaactc caggcagcag taggggaact
+    28801 tctcctgcta gaatggctgg caatggcggt gatgctgctc ttgctttgct gctgcttgac
+    28861 agattgaacc agcttgagag caaaatgtct ggtaaaggcc aacaacaaca aggccaaact
+    28921 gtcactaaga aatctgctgc tgaggcttct aagaagcctc ggcaaaaacg tactgccact
+    28981 aaagcataca atgtaacaca agctttcggc agacgtggtc cagaacaaac ccaaggaaat
+    29041 tttggggacc aggaactaat cagacaagga actgattaca aacattggcc gcaaattgca
+    29101 caatttgccc ccagcgcttc agcgttcttc ggaatgtcgc gcattggcat ggaagtcaca
+    29161 ccttcgggaa cgtggttgac ctacacaggt gccatcaaat tggatgacaa agatccaaat
+    29221 ttcaaagatc aagtcatttt gctgaataag catattgacg catacaaaac attcccacca
+    29281 acagagccta aaaaggacaa aaagaagaag gctgatgaaa ctcaagcctt accgcagaga
+    29341 cagaagaaac agcaaactgt gactcttctt cctgctgcag atttggatga tttctccaaa
+    29401 caattgcaac aatccatgag cagtgctgac tcaactcagg cctaaactca tgcagaccac
+    29461 acaaggcaga tgggctatat aaacgttttc gcttttccgt ttacgatata tagtctactc
+    29521 ttgtgcagaa tgaattctcg taactacata gcacaagtag atgtagttaa ctttaatctc
+    29581 acatagcaat ctttaatcag tgtgtaacat tagggaggac ttgaaagagc caccacattt
+    29641 tcaccgaggc cacgcggagt acgatcgagt gtacagtgaa caatgctagg gagagctgcc
+    29701 tatatggaag agccctaatg tgtaaaatta attttagtag tgctatcccc atgtgatttt
+    29761 aatagcttct tagg
+//
+


=====================================
pangoLEARN/training/__init__.py
=====================================


=====================================
pangoLEARN/training/downsample.py
=====================================
@@ -0,0 +1,203 @@
+#!/usr/bin/env python3
+
+import csv
+import datetime
+import operator
+# import argparse
+from Bio import SeqIO
+
+
+# def parse_args():
+#     parser = argparse.ArgumentParser(description="""Pick a representative sample for each unique sequence""",
+#                                     formatter_class=argparse.RawTextHelpFormatter)
+#     parser.add_argument('--in-metadata', dest = 'in_metadata', required=True, help='CSV of containing sequence_name and nucleotide_variants columns, the latter being | separated list of variants')
+#     parser.add_argument('--in-fasta', dest = 'in_fasta', required=True, help='FASTA of all input seqs')
+#     parser.add_argument('--diff', dest = 'diff', required=True, type=int, help='Samples within distance DIFF of included others may be excluded by the downsampler')
+#     parser.add_argument('--out-metadata', dest = 'out_metadata', required=True, help='CSV to write out')
+#     parser.add_argument('--out-fasta', dest = 'out_fasta', required=True, help='FASTA to write downsampled seqs')
+#     parser.add_argument('--outgroups', dest = 'outgroups', required=False, help='Lineage splits file containing representative outgroups to protect')
+#     parser.add_argument('--downsample_date_excluded', action='store_true', help='Downsample from those excluded as outside date window')
+#     parser.add_argument('--downsample_included', action='store_true', help='Downsample from all included sequences')
+#     parser.add_argument('--downsample_lineage_size', type=int, default=None, help='Min size of lineages to downsample, if unspecified no lineage-aware downsampling')
+
+#     args = parser.parse_args()
+#     return args
+
+def parse_outgroups(outgroup_file):
+    """
+    input is CSV, last column being the representative outgroups:
+    """
+    outgroups = []
+    if not outgroup_file:
+        return outgroups
+    with open(outgroup_file, "r") as outgroup_handle:
+        line = outgroup_handle.readline()
+        while line:
+            try:
+                outgroup = line.strip().split(",")[-1]
+                outgroups.append(outgroup)
+            except:
+                continue
+            line = outgroup_handle.readline()
+    return(outgroups)
+
+def get_count_dict(in_metadata):
+    count_dict = {}
+    num_samples = 0
+    with open(in_metadata,"r") as f:
+        reader = csv.DictReader(f)
+        for row in reader:
+            num_samples += 1
+            for var in row["nucleotide_variants"].split("|"):
+                if var in count_dict:
+                    count_dict[var] += 1
+                else:
+                    count_dict[var] = 1
+    print("Found", len(count_dict), "variants")
+
+    sorted_tuples = sorted(count_dict.items(), key=operator.itemgetter(1))
+    count_dict = {k: v for k, v in sorted_tuples}
+    return count_dict, num_samples
+
+def get_lineage_dict(in_metadata, min_size):
+    lineage_dict = {}
+    if min_size is None:
+        return lineage_dict
+
+    with open(in_metadata,"r") as f:
+        reader = csv.DictReader(f)
+        for row in reader:
+            if "lineage" in row:
+                lin = row["lineage"]
+                if lin in lineage_dict:
+                    lineage_dict[lin].append(row["sequence_name"])
+                else:
+                    lineage_dict[lin] = [row["sequence_name"]]
+    print("Found", len(lineage_dict), "lineages")
+
+    small_lineages = [lin for lin in lineage_dict if len(lineage_dict[lin]) < min_size]
+    for lin in small_lineages:
+        del lineage_dict[lin]
+    print("Found", len(lineage_dict), "lineages with at least", min_size, "representative sequences")
+
+    return lineage_dict
+
+def get_by_frequency(count_dict, num_samples, band=[0.1,1.0]):
+    lower_bound = num_samples*band[0]
+    upper_bound = num_samples*band[1]
+    most_frequent = [k for k in count_dict if lower_bound < count_dict[k] <= upper_bound]
+    print(len(most_frequent), "lie in frequency band", band)
+    return most_frequent
+
+def num_unique(muts1, muts2):
+    u1 = [m for m in muts1 if m not in muts2]
+    u2 = [m for m in muts2 if m not in muts1]
+    return len(u1+u2)
+
+def should_downsample_row(row, downsample_date_excluded=True, downsample_included=False, downsample_lineage_size=None, lineage_dict={}):
+    if downsample_included and row["why_excluded"] in [None, "None", ""]:
+        return True
+    if downsample_date_excluded and row["why_excluded"] in [None, "None", ""] and "date_filter" in row and row["date_filter"].startswith("sample_date older than"):
+        return True
+    if downsample_lineage_size and row["lineage"] in lineage_dict:
+        return True
+    return False
+
+def downsample(in_metadata, out_metadata, in_fasta, out_fasta, max_diff, outgroup_file, downsample_date_excluded, downsample_included, downsample_lineage_size):
+    original_num_seqs = 0
+    sample_dict = {}
+    var_dict = {}
+
+    count_dict, num_samples = get_count_dict(in_metadata)
+    most_frequent = get_by_frequency(count_dict, num_samples, band=[0.05,1.0])
+    very_most_frequent = get_by_frequency(count_dict, num_samples, band=[0.5,1.0])
+
+    lineage_dict = get_lineage_dict(in_metadata,downsample_lineage_size)
+
+    outgroups = parse_outgroups(outgroup_file)
+    indexed_fasta = SeqIO.index(in_fasta, "fasta")
+
+    with open(in_metadata, 'r', newline = '') as csv_in, \
+        open(out_fasta, 'w', newline = '') as fa_out, \
+        open(out_metadata, 'w', newline = '') as csv_out:
+
+        reader = csv.DictReader(csv_in, delimiter=",", quotechar='\"', dialect = "unix")
+        writer = csv.DictWriter(csv_out, fieldnames = reader.fieldnames, delimiter=",", quotechar='\"', quoting=csv.QUOTE_MINIMAL, dialect = "unix")
+        writer.writeheader()
+
+        for row in reader:
+            fasta_header = row["sequence_name"]
+            if fasta_header not in indexed_fasta:
+                continue
+            if original_num_seqs % 1000 == 0:
+                now = datetime.datetime.now()
+                print("%s Downsampled from %i seqs to %i seqs" %(str(now), original_num_seqs, len(sample_dict)))
+            original_num_seqs += 1
+
+            if fasta_header in outgroups or not should_downsample_row(row,downsample_date_excluded, downsample_included,
+                                                                      downsample_lineage_size,lineage_dict):
+                if fasta_header in outgroups:
+                    row["why_excluded"]=""
+                writer.writerow(row)
+                if row["why_excluded"] in [None, "None", ""] and fasta_header in indexed_fasta:
+                    seq_rec = indexed_fasta[fasta_header]
+                    fa_out.write(">" + seq_rec.id + "\n")
+                    fa_out.write(str(seq_rec.seq) + "\n")
+                else:
+                    print(row["why_excluded"], fasta_header, (fasta_header in indexed_fasta))
+                continue
+
+            muts = row["nucleotide_variants"].split("|")
+            if len(muts) < max_diff:
+                #if not row["why_excluded"]:
+                #    row["why_excluded"] = "downsampled with diff threshold %i" %max_diff
+                writer.writerow(row)
+                continue
+
+            found_close_seq = False
+
+            samples = set()
+            low_frequency_muts = [mut for mut in muts if mut not in most_frequent]
+            if len(low_frequency_muts) == 0:
+                low_frequency_muts = [mut for mut in muts if mut not in very_most_frequent]
+                if len(low_frequency_muts) == 0:
+                    low_frequency_muts = muts
+            if len(low_frequency_muts) > max_diff + 1:
+                low_frequency_muts = low_frequency_muts[:max_diff+1]
+            for mut in low_frequency_muts:
+                if mut in var_dict:
+                    samples.update(var_dict[mut])
+            if downsample_lineage_size:
+                samples = list( samples & set(lineage_dict[row["lineage"]]) )
+
+            for sample in samples:
+                if num_unique(muts, sample_dict[sample]) <= max_diff:
+                    found_close_seq = True
+                    #if not row["why_excluded"]:
+                    #    row["why_excluded"] = "downsampled with diff threshold %i" %max_diff
+                    writer.writerow(row)
+                    break
+            if not found_close_seq:
+                sample_dict[fasta_header] = muts
+                for mut in muts:
+                    if mut not in var_dict:
+                        var_dict[mut] = [fasta_header]
+                    else:
+                        var_dict[mut].append(fasta_header)
+                row["why_excluded"] = ""
+                writer.writerow(row)
+                if fasta_header in indexed_fasta:
+                    seq_rec = indexed_fasta[fasta_header]
+                    fa_out.write(">" + seq_rec.id + "\n")
+                    fa_out.write(str(seq_rec.seq) + "\n")
+
+    now = datetime.datetime.now()
+    print("%s Downsampled from %i seqs to %i seqs" %(str(now), original_num_seqs, len(sample_dict)))
+    # return sample_dict.keys()
+
+# def main():
+#     args = parse_args()
+#     subsample = downsample(args.in_metadata, args.out_metadata, args.in_fasta, args.out_fasta, args.diff, args.outgroups, args.downsample_date_excluded, args.downsample_included, args.downsample_lineage_size)
+
+# if __name__ == '__main__':
+#     main()
\ No newline at end of file


=====================================
pangoLEARN/training/getDecisionTreeRules.py
=====================================
@@ -0,0 +1,145 @@
+from sklearn.tree import DecisionTreeClassifier
+from sklearn.tree import export_text
+import joblib
+import sys
+from sklearn.tree import _tree
+
+sys.setrecursionlimit(4000)
+
+loaded_model = joblib.load(sys.argv[1])
+headers = joblib.load(sys.argv[2])
+output_file = sys.argv[3]
+
+finalHeaders = []
+
+with open(output_file, 'r') as f:
+	for line in f:
+		if line[0] == "[" and len(line) > 1000:
+			print(line)
+			finalHeaders = line[1:-1].split(",")
+
+			for i in range(len(finalHeaders)):
+				finalHeaders[i] = finalHeaders[i].strip()
+				finalHeaders[i] = finalHeaders[i].replace("'", "")
+
+print(finalHeaders)
+
+categories = ['A', 'C', 'G', 'T', 'N', '-']
+
+classes = loaded_model.classes_
+
+
+def getClass(weightArray):
+	weightArray = weightArray[0]
+
+	maxValue = 0
+	maxIndex = 0
+
+	for i in range(len(weightArray)):
+		if weightArray[i] > maxValue:
+			maxValue = weightArray[i]
+			maxIndex = i
+
+	return classes[maxIndex]
+
+
+def reformatRule(sign, name, threshold):
+	position = name.split("_")[0]
+	letter = name.split("_")[1]
+
+	symbol = "=="
+	if sign == "<=":
+		symbol = "!="
+
+	return position + symbol + "'" + letter + "'"
+
+
+outcomesToRules = dict()
+
+
+def tree_to_code(tree, feature_names):
+    tree_ = tree.tree_
+    feature_name = [
+        feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!"
+        for i in tree_.feature
+    ]
+
+    prevRules = []
+
+    def recurse(node, depth, prevRules):
+        indent = "  " * depth
+        if tree_.feature[node] != _tree.TREE_UNDEFINED:
+            name = feature_name[node]
+            threshold = tree_.threshold[node]
+
+            prevRulesRight = prevRules.copy()
+            prevRulesLeft = prevRules.copy()
+
+            prevRulesLeft.append(reformatRule("<=", name, threshold))
+            prevRulesRight.append(reformatRule(">", name, threshold))
+
+            recurse(tree_.children_left[node], depth + 1, prevRulesLeft)
+            recurse(tree_.children_right[node], depth + 1, prevRulesRight)
+        else:
+            rulesString = prevRules[0]
+            for rule in prevRules[1:]:
+            	rule = rule.strip()
+
+            	rulesString = rulesString + "," + rule
+
+            outcome = getClass(tree_.value[node])
+
+            if outcome not in outcomesToRules:
+            	outcomesToRules[outcome] = []
+
+            print(outcome + "\t" + rulesString)
+
+    recurse(0, 1, prevRules)
+
+tree_to_code(loaded_model, finalHeaders)
+
+
+def formatCommonRules(ruleSets):
+	splitRuleSets = []
+	totalRules = []
+	commonRules = []
+
+	for r in ruleSets:
+		splitRuleSets.append(r.split(","))
+
+		for i in r.split(","):
+			totalRules.append(i)
+
+	for i in totalRules:
+
+		addRule = True
+		for r in splitRuleSets:
+			if i not in r:
+				addRule = False
+
+		if addRule:
+			commonRules.append(i)
+
+			for r in splitRuleSets:
+				r.remove(i)
+
+	retString = ""
+
+	if len(commonRules) > 0:
+		retString = commonRules[0]
+
+		for c in commonRules[1:]:
+			retString = retString + "," + c
+
+		retString = retString + " AND "
+
+	for s in splitRuleSets:
+		if len(s) > 0:
+			retString = retString + "(" + s[0]
+			for r in s[1:]:
+				retString = retString + "," + r
+
+			retString = retString + ") OR "
+
+
+	return retString[:-3]


=====================================
pangoLEARN/training/outgroups.csv
=====================================
@@ -0,0 +1,11 @@
+lineage,outgroup
+A,Wuhan/WH04/2020
+B,Wuhan/WHU01/2020
+B.1,Italy/ABR-IZSGC-TE5166/2020
+B.1.1,Germany/BY-MVP-V2010837/2020
+B.1.177,Spain/VC-IBV-98006461/2020
+B.1.1.7,England/MILK-9E05B3/2020
+B.1.160,Belgium/rega-10021225/2020
+B.1.2,USA/TX-HMH-MCoV-16306/2020
+B.1.258,England/LEED-2A8A64/2020
+B.1.243,USA/TX-HMH-MCoV-18678/2020


=====================================
pangoLEARN/training/pangoLEARNDecisionTree_v1.py
=====================================
@@ -0,0 +1,356 @@
+import pandas as pd
+from sklearn.model_selection import train_test_split
+from sklearn import metrics
+from sklearn.datasets import make_classification
+from sklearn.model_selection import StratifiedShuffleSplit
+from sklearn.tree import DecisionTreeClassifier
+from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix, make_scorer
+from datetime import datetime
+import joblib
+import sys
+import os
+from sklearn.model_selection import cross_val_score
+from Bio import SeqIO
+import pickle
+
+# file with lineage assignments
+lineage_file = sys.argv[1]
+# file with sequences
+sequence_file = sys.argv[2]
+# how much of the data will be used for testing, instead of training
+testing_percentage = 0.0000000001
+
+relevant_positions = pickle.load(open(sys.argv[5], 'rb'))
+relevant_positions.add(0)
+
+# the path to the reference file. 
+# This reference sequence must be the same as is used in the pangolearn script!!
+referenceFile = sys.argv[3]
+
+# data storage
+dataList = []
+# dict for lookup efficiency
+indiciesToKeep = dict()
+
+referenceId = "Wuhan/WH04/2020"
+referenceSeq = ""
+
+idToLineage = dict()
+idToSeq = dict()
+
+mustKeepIds = []
+mustKeepLineages = []
+
+
+# function for handling weird sequence characters
+def clean(x, loc):
+	x = x.upper()
+	
+	if x == 'T' or x == 'A' or x == 'G' or x == 'C' or x == '-':
+		return x
+
+	if x == 'U':
+		return 'T'
+
+	# otherwise return value from reference
+	return referenceSeq[loc]
+
+def findReferenceSeq():
+	with open(referenceFile) as f:
+		currentSeq = ""
+
+		for line in f:
+			if ">" not in line:
+				currentSeq = currentSeq + line.strip()
+
+	f.close()
+	return currentSeq
+
+
+def getDataLine(seqId, seq):
+	dataLine = []
+	dataLine.append(seqId)
+
+	newSeq = ""
+
+	# for each character in the sequence
+	for index in range(len(seq)):
+		newSeq = newSeq + clean(seq[index], index)
+
+	dataLine.append(newSeq)
+	
+	return dataLine
+
+
+def readInAndFormatData():
+
+	# add the data line for the reference seq
+	idToLineage[referenceId] = "A"
+	dataList.append(getDataLine(referenceId, referenceSeq))
+
+	# create a dictionary of sequence ids to their assigned lineages
+	with open(lineage_file, 'r') as f:
+		for line in f:
+			line = line.strip()
+
+			split = line.split(",")
+
+			idToLineage[split[0]] = split[1]
+
+	# close the file
+	f.close()
+
+	seq_dict = {rec.id : rec.seq for rec in SeqIO.parse(sequence_file, "fasta")}
+
+	print("files read in, now processing")
+
+	for key in seq_dict.keys():
+		if key in idToLineage:
+			dataList.append(getDataLine(key, seq_dict[key]))
+		else:
+			print("unable to find the lineage classification for: " + key)
+
+
+# find columns in the data list which always have the same value
+def findColumnsWithoutSNPs():
+
+	# for each index in the length of each sequence
+	for index in range(len(dataList[0][1])):
+		keep = False
+
+		# loop through all lines
+		for line in dataList:
+
+			# if there is a difference somewhere, then we want to keep it
+			if dataList[0][1][index] != line[1][index] or index == 0:
+				keep = True
+				break
+
+		# otherwise, save it
+		if keep and index in relevant_positions:
+			indiciesToKeep[index] = True
+
+
+# remove columns from the data list which don't have any SNPs. We do this because
+# these columns won't be relevant for a logistic regression which is trying to use
+# differences between sequences to assign lineages
+def removeOtherIndices(indiciesToKeep):
+
+	# instantiate the final list
+	finalList = []
+
+	indicies = list(indiciesToKeep.keys())
+	indicies.sort()
+
+	# while the dataList isn't empty
+	while len(dataList) > 0:
+
+		# pop the first line
+		line = dataList.pop(0)
+		seqId = line.pop(0)
+
+		line = line[0]
+		# initialize the finalLine
+		finalLine = []
+
+		for index in indicies:
+			if index == 0:
+				# if its the first index, then that's the lineage assignment, so keep it
+				finalLine.append(seqId)
+			else:
+				# otherwise keep everything at the indices in indiciesToKeep
+				finalLine.append(line[index])
+
+		# save the finalLine to the finalList
+		finalList.append(finalLine)
+
+	# return
+	return finalList
+
+def allEqual(list):
+		entries = dict()
+
+		for i in list:
+			if i not in entries:
+				entries[i] = True
+
+		return len(entries) == 1
+
+def removeAmbiguous():
+	idsToRemove = set()
+	lineMap = dict()
+	idMap = dict()
+
+	for line in dataList:
+		keyString = ",".join(line[1:])
+
+		if keyString not in lineMap:
+			lineMap[keyString] = []
+			idMap[keyString] = []
+ 
+		if line[0] in idToLineage:
+			lineMap[keyString].append(idToLineage[line[0]])
+			idMap[keyString].append(line[0])
+		else:
+			print("diagnostics")
+			print(line[0])
+			print(keyString)
+			print(line)
+	for key in lineMap:
+		if not allEqual(lineMap[key]):
+
+			skipRest = False
+
+			# see if any protected lineages are contained in the set, if so keep those ids
+			for lineage in lineMap[key]:
+				if lineage in mustKeepLineages:
+					skipRest = True
+
+					for i in idMap[key]:
+						if lineage != idToLineage[i] and i not in mustKeepIds:
+							idsToRemove.add(i)
+
+			# none of the lineages are protected, fire at will
+			if not skipRest:
+
+				lineageToCounts = dict()
+
+				aLineage = False
+				# find most common lineage
+				for lineage in lineMap[key]:
+					if lineage not in lineageToCounts:
+						lineageToCounts[lineage] = 0
+
+					lineageToCounts[lineage] = lineageToCounts[lineage] + 1
+					aLineage = lineage
+
+				m = aLineage
+				for lineage in lineageToCounts:
+					if lineageToCounts[lineage] > lineageToCounts[m]:
+						m = lineage
+
+
+				for i in idMap[key]:
+					if m != idToLineage[i]:
+						idsToRemove.add(i)
+
+	newList = []
+
+	print("keeping indicies:")
+
+	for line in dataList:
+		if line[0] not in idsToRemove:
+			print(line[0])
+			line[0] = idToLineage[line[0]]
+			newList.append(line)
+
+	return newList
+
+
+print("reading in data " + datetime.now().strftime("%m/%d/%Y, %H:%M:%S"), flush=True)
+
+referenceSeq = findReferenceSeq()
+
+readInAndFormatData()
+
+print("processing snps, formatting data " + datetime.now().strftime("%m/%d/%Y, %H:%M:%S"), flush=True)
+
+findColumnsWithoutSNPs()
+
+dataList = removeOtherIndices(indiciesToKeep)
+
+print("# sequences before blacklisting")
+print(len(dataList))
+
+dataList = removeAmbiguous()
+
+print("# sequences after blacklisting")
+print(len(dataList))
+
+# headers are the original genome locations
+headers = list(indiciesToKeep.keys())
+headers[0] = "lineage"
+
+print("setting up training " + datetime.now().strftime("%m/%d/%Y, %H:%M:%S"), flush=True)
+
+pima = pd.DataFrame(dataList, columns=headers)
+
+# nucleotide symbols which can appear
+categories = ['A', 'C', 'G', 'T', '-']
+
+# one hot encoding of all headers other than the first which is the lineage
+dummyHeaders = headers[1:]
+
+# add extra rows to ensure all of the categories are represented, as otherwise 
+# not enough columns will be created when we call get_dummies
+for i in categories:
+	line = [i] * len(dataList[0])
+	pima.loc[len(pima)] = line
+
+# get one-hot encoding
+pima = pd.get_dummies(pima, columns=dummyHeaders)
+
+# get rid of the fake data we just added
+pima.drop(pima.tail(len(categories)).index, inplace=True)
+
+feature_cols = list(pima)
+print(feature_cols)
+
+# remove the last column from the data frame. This is because we are trying to predict these values.
+h = feature_cols.pop(0)
+X = pima[feature_cols]
+y = pima[h]
+
+# separate the data frame into testing/training data sets. 25% of the data will be used for training, 75% for test.
+X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=testing_percentage,random_state=0)
+
+print("training " + datetime.now().strftime("%m/%d/%Y, %H:%M:%S"), flush=True)
+
+header_out = os.path.join(sys.argv[4],"decisionTreeHeaders_v1.joblib")
+joblib.dump(headers, header_out, compress=9)
+
+# instantiate the random forest with 1000 trees
+dt = DecisionTreeClassifier()
+
+# fit the model
+dt.fit(X,y)
+
+print("testing " + datetime.now().strftime("%m/%d/%Y, %H:%M:%S"), flush=True)
+
+# classify the test data
+y_pred=dt.predict(X_test)
+
+print(y_pred)
+
+# get the scores from these predictions
+y_scores = dt.predict_proba(X_test)
+
+print("generating statistics " + datetime.now().strftime("%m/%d/%Y, %H:%M:%S"), flush=True)
+
+#print the confusion matrix
+print("--------------------------------------------")
+print("Confusion Matrix")
+cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
+print(cnf_matrix)
+
+print("--------------------------------------------")
+print("Classification report")
+print(metrics.classification_report(y_test, y_pred, digits=3))
+
+# save the model files to compressed joblib files
+# using joblib instead of pickle because these large files need to be compressed
+model_out = os.path.join(sys.argv[4],"decisionTree_v1.joblib")
+joblib.dump(dt,  model_out, compress=9)
+
+print("model files created", flush=True)
+
+# this method is used below when running 10-fold cross validation. It ensures
+# that the per-lineage statistics are generated for each cross-fold
+def classification_report_with_accuracy_score(y_true, y_pred):
+	print("--------------------------------------------")
+	print("Crossfold Classification Report")
+	print(metrics.classification_report(y_true, y_pred, digits=3))
+	return accuracy_score(y_true, y_pred)
+
+# optionally, run 10-fold cross validation (comment this out if not needed as it takes a while to run)
+# cross_validation_scores = cross_val_score(dt, X=X, y=y, cv=10, scoring=make_scorer(classification_report_with_accuracy_score))


=====================================
pangoLEARN/training/processOutputFile.py
=====================================
@@ -0,0 +1,57 @@
+import sys
+
+outputfile = sys.argv[1]
+
+nameToLineage = dict()
+
+class Lineage:
+	def __init__(self, name):
+		self.name = name
+		self.precisions = []
+		self.recalls = []
+		self.f1s = []
+		self.supports = []
+
+	def printStats(self):
+		avgPrec = sum(self.precisions) / len(self.precisions)
+		avgRec = sum(self.recalls) / len(self.recalls)
+		avgF1 = sum(self.f1s) / len(self.f1s) 
+		totalSupport = sum(self.supports)
+
+		print(self.name + "," + str(avgPrec) + "," + str(avgRec) + "," + str(avgF1) + "," + str(totalSupport))
+
+with open(outputfile, 'r') as f:
+	for line in f:
+		line = line.strip()
+		split = line.split()
+
+		if "macro avg" in line or "weighted avg" in line:
+			name = split[0] + " " + split[1]
+
+			if name not in nameToLineage:
+				nameToLineage[name] = Lineage(name)
+
+			nameToLineage[name].precisions.append(float(split[2]))
+			nameToLineage[name].recalls.append(float(split[3]))
+			nameToLineage[name].f1s.append(float(split[4]))
+			nameToLineage[name].supports.append(int(split[5]))
+
+		if len(split) == 5 and ":" not in line:
+			name = split[0]
+
+			if name not in nameToLineage:
+				nameToLineage[name] = Lineage(name)
+
+			nameToLineage[name].precisions.append(float(split[1]))
+			nameToLineage[name].recalls.append(float(split[2]))
+			nameToLineage[name].f1s.append(float(split[3]))
+			nameToLineage[name].supports.append(int(split[4]))
+
+print("lineage,precision,recall,f1_score,support")
+
+for key in nameToLineage:
+	if "macro" not in key and "weighted" not in key:
+		nameToLineage[key].printStats()
+
+nameToLineage["macro avg"].printStats()
+nameToLineage["weighted avg"].printStats()
\ No newline at end of file



View it on GitLab: https://salsa.debian.org/med-team/python-pangolearn/-/compare/3e8139c7d2e214dbe3c4d409c746f561d544be3b...c131967ec973a5a3c1ce5b67b3c2821e2b8f70fc

-- 
View it on GitLab: https://salsa.debian.org/med-team/python-pangolearn/-/compare/3e8139c7d2e214dbe3c4d409c746f561d544be3b...c131967ec973a5a3c1ce5b67b3c2821e2b8f70fc
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20211108/8ae15693/attachment-0001.htm>


More information about the debian-med-commit mailing list