[med-svn] [Git][med-team/kaptive][upstream] New upstream version 2.0.0

Andreas Tille (@tille) gitlab at salsa.debian.org
Fri Jan 21 07:08:51 GMT 2022



Andreas Tille pushed to branch upstream at Debian Med / kaptive


Commits:
0d2120ce by Andreas Tille at 2022-01-20T15:42:57+01:00
New upstream version 2.0.0
- - - - -


5 changed files:

- README.md
- kaptive.py
- reference_database/Klebsiella_k_locus_primary_reference.gbk
- reference_database/Klebsiella_o_locus_primary_reference.gbk
- + reference_database/Klebsiella_o_locus_primary_reference.logic


Changes:

=====================================
README.md
=====================================
@@ -1,396 +1,6 @@
-<p align="center"><img src="extras/kaptive_logo.png" alt="Kaptive" width="400"></p>
+<p align="center"><img src="https://github.com/katholt/Kaptive/blob/master/extras/kaptive_logo.png" alt="Kaptive" width="400"></p>
 
 
-Kaptive reports information about surface polysaccharide loci for _Klebsiella_ and _Acinetobacter baumannii_ genome assemblies. You can also run a graphical version of Kaptive via [this web interface](http://kaptive.holtlab.net/) ([source code](https://github.com/kelwyres/Kaptive-Web)).
+Kaptive reports information about surface polysaccharide loci for _Klebsiella pneumoniae_ species complex and _Acinetobacter baumannii_ genome assemblies. You can also run a graphical version of Kaptive via [this web interface](http://kaptive.holtlab.net/) ([source code](https://github.com/kelwyres/Kaptive-Web)).
 
-Given a novel genome and a database of known loci (K, O or OC), Kaptive will help a user to decide whether their sample has a known or novel locus. It carries out the following for each input assembly:
-* BLAST for all known locus nucleotide sequences (using `blastn`) to identify the best match ('best' defined as having the highest coverage).
-* Extract the region(s) of the assembly which correspond to the BLAST hits (i.e. the locus sequence in the assembly) and save it to a FASTA file.
-* BLAST for all known locus genes (using `tblastn`) to identify which expected genes (genes in the best matching locus) are present/missing and whether any unexpected genes (genes from other loci) are present.
-* Output a summary to a table file.
-
-In cases where your input assembly closely matches a known locus, Kaptive should make that obvious. When your assembly has a novel type, that too should be clear. However, Kaptive cannot reliably extract or annotate locus sequences for totally novel types – if it indicates a novel locus is present then extracting and annotating the sequence is up to you! Very poor assemblies can confound the results, so be sure to closely examine any case where the locus sequence in your assembly is broken into multiple pieces.
-If you think you have found a novel locus that should be added to one of the databases distributed with Kaptive please [contact us](mailto:kaptive.typing at gmail.com).
-
-For citation info and details about Kaptive, Kaptive Web and the locus databases, see [our papers](#citation) below.
-
-
-## Table of Contents
-
-* [Quick version (for the impatient)](#quick-version-for-the-impatient)
-* [Installation](#installation)
-* [Input files](#input-files)
-* [Standard output](#standard-output)
-    * [Basic](#basic)
-    * [Verbose](#verbose)
-* [Output files](#output-files)
-    * [Summary table](#summary-table)
-    * [JSON](#json)
-    * [Locus matching sequences](#locus-matching-sequences)
-* [Example results and interpretation](#example-results-and-interpretation)
-    * [Very close match](#very-close-match)
-    * [More distant match](#more-distant-match)
-    * [Broken assembly](#broken-assembly)
-    * [Poor match](#poor-match)
-* [Advanced options](#advanced-options)
-* [SLURM jobs](#slurm-jobs)
-* [Databases distributed with Kaptive](#databases-distributed-with-kaptive)
-    * [_Klebsiella_ K locus databases](#klebsiella-k-locus-databases)
-    * [_Klebsiella_ O locus database](#klebsiella-o-locus-database)
-    * [_Acinetobacter baumannii_ K and OC locus databases](#acinetobacter-baumanii-k-and-oc-locus-databases)
-* [FAQs](#faqs)
-* [Citation](#citation)
-* [License](#license)
-
-
-## Quick version (for the impatient)
-
-Kaptive needs the following input files to run (included in this repository):
-* A multi-record Genbank file with your known loci (nucleotide sequences for each whole locus and protein sequences for their genes)
-* One or more assemblies in FASTA format
-
-Example command:
-
-`kaptive.py -a path/to/assemblies/*.fasta -k database.gbk -o output_directory/prefix`
-
-For each input assembly file, Kaptive will identify the closest known locus type and report information about the corresponding locus genes.
-
-It generates the following output files:
-* A FASTA file for each input assembly with the nucleotide sequences matching the closest locus
-* A table summarising the results for all input assemblies
-
-Character codes in the output indicate problems with the locus match:
-* `?` = the match was not in a single piece, possible due to a poor match or discontiguous assembly.
-* `-` = genes expected in the locus were not found.
-* `+` = extra genes were found in the locus.
-* `*` = one or more expected genes was found but with low identity.
-
-
-## Installation
-
-Kaptive should work on both Python 2 and 3, but we run/test it on Python 3 and recommend you do the same.
-
-
-#### Clone and run
-
-Kaptive is a single Python script, so you can simply clone (or download) from GitHub and run it. Kaptive depends on [Biopython](http://biopython.org/wiki/Main_Page), so make sure it's installed (either `pip3 install biopython` or read detailed instructions [here](http://biopython.org/DIST/docs/install/Installation.html)).
-
-```
-git clone https://github.com/katholt/Kaptive
-Kaptive/kaptive.py -h
-```
-
-#### Install with pip
-
-Alternatively, you can install Kaptive using [pip](https://pip.pypa.io/en/stable/). This will take care of the Biopython requirement (if necessary) and put the `kaptive.py` script in your PATH for easy access. Pip installing will _not_ provide the reference databases, so you'll need to download them separately from [here](https://github.com/katholt/Kaptive/tree/master/reference_database).
-
-```
-pip3 install --user git+https://github.com/katholt/Kaptive
-kaptive.py -h
-```
-
-#### Other dependencies
-
-Regardless of how you download/install Kaptive, it requires that [BLAST+](http://www.ncbi.nlm.nih.gov/books/NBK279690/) is available on the command line (specifically the commands `makeblastdb`, `blastn` and `tblastn`). BLAST+ can usually be easily installed using a package manager such as [Homebrew](http://brew.sh/) (on Mac) or [apt-get](https://help.ubuntu.com/community/AptGet/Howto) (on Ubuntu and related Linux distributions). Some later versions of BLAST+ have been associated with sporadic crashes when running tblastn with multiple threads; to avoid this problem we recommend running Kaptive with BLAST+ v 2.3.0 or using the "--threads 1" option (see below for full command argument details).
-
-
-## Input files
-
-#### Assemblies
-
-Using the `-a` (or `--assembly`) argument, you must provide one or more FASTA files to analyse. There are no particular requirements about the header formats in these inputs.
-
-#### Locus references
-
-Using the `-k` (or `--k_refs`) argument, you must provide a Genbank file containing one record for each known locus.
-
-This input Genbank has the following requirements:
-* The `source` feature must contain a `note` qualifier which begins with a label such as 'K locus:'. Whatever follows is used as the locus name. The label is automatically determined, and any consistent label ending in a colon will work. However, the user can specify exactly which label to use with `--locus_label`, if desired.
-* Any locus gene should be annotated as `CDS` features. All `CDS` features will be used and any other type of feature will be ignored.
-* If the gene has a name, it should be specified in a `gene` qualifier. This is not required, but if absent the gene will only be named using its numbered position in the locus.
-
-Example piece of input Genbank file:
-```
-source          1..23877
-                /organism="Klebsiella pneumoniae"
-                /mol_type="genomic DNA"
-                /note="K locus: K1"
-CDS             1..897
-                /gene="galF"
-```
-
-#### Allelic typing
-
-You can also supply Kaptive with a FASTA file of gene alleles using the `-g` (or `--allelic_typing`) 
-argument. For example, `wzi_wzc_db.fasta` (included with Kaptive) contains wzi and wzc alleles. This file must be formatted as an [SRST2](https://github.com/katholt/srst2) database with integers for allele names.
-
-If used, Kaptive will report the number of the best allele for each type gene. If there is no perfect match, Kaptive reports the best match and adds a `*` to the allele number.
-
-
-## Standard output
-
-#### Basic
-
-Kaptive will write a simple line to stdout for each assembly:
-* the assembly name
-* the best locus match
-* character codes for any match problems
-
-Example (no type genes supplied):
-```
-assembly_1: K2*
-assembly_2: K4
-assembly_3: KL17?-*
-```
-
-Example (with type genes):
-```
-assembly_1: K2*, wzc=2, wzi=2
-assembly_2: K4, wzc=1, wzi=127
-assembly_3: KL17?-*, wzc=18*, wzi=137*
-```
-
-#### Verbose
-
-If run without the `-v` or `--verbose` option, Kaptive will give detailed information about each assembly including:
-* Which locus reference best matched the assembly
-* Information about the nucleotide sequence match between the assembly and the best locus reference:
-  * % Coverage and % identity
-  * Length discrepancy (only available if assembled locus match is in one piece)
-  * Contig names and coordinates for matching sequences
-* Details about found genes:
-  * Whether they were expected or unexpected
-  * Whether they were found inside or outside the locus matching sequence
-  * % Coverage and % identity
-  * Contig names and coordinates for matching sequences
-* Best alleles for each type gene (if the user supplied a type gene database)
-
-
-## Output files
-
-#### Summary table
-
-Kaptive produces a single tab-delimited table summarising the results of all input assemblies. It has the following columns:
-* **Assembly**: the name of the input assembly, taken from the assembly filename.
-* **Best match locus**: the locus type which most closely matches the assembly, based on BLAST coverage.
-* **Match confidence**: a categorical measure of match quality:
-  * `Perfect` = the locus was found in a single piece with 100% coverage and 100% identity.
-  * `Very high` = the locus was found in a single piece with ≥99% coverage and ≥95% identity, with no missing genes and no extra genes.
-  * `High` = the locus was found in a single piece with ≥99% coverage, with ≤ 3 missing genes and no extra genes.
-  * `Good` = the locus was found in a single piece or with ≥95% coverage, with ≤ 3 missing genes and ≤ 1 extra genes.
-  * `Low` = the locus was found in a single piece or with ≥90% coverage, with ≤ 3 missing genes and ≤ 2 extra genes.
-  * `None` = did not qualify for any of the above.
-* **Problems**: characters indicating issues with the locus match. An absence of any such characters indicates a very good match.
-  * `?` = the match was not in a single piece, possible due to a poor match or discontiguous assembly.
-  * `-` = genes expected in the locus were not found.
-  * `+` = extra genes were found in the locus.
-  * `*` = one or more expected genes was found but with low identity.
-* **Coverage**: the percent of the locus reference which BLAST found in the assembly.
-* **Identity**: the nucleotide identity of the BLAST hits between locus reference and assembly.
-* **Length discrepancy**: the difference in length between the locus match and the corresponding part of the assembly. Only available if the locus was found in a single piece (i.e. the `?` problem character is not used).
-* **Expected genes in locus**: a fraction indicating how many of the genes in the best matching locus were found in the locus part of the assembly.
-* **Expected genes in locus, details**: gene names and percent identity (from the BLAST hits) for the expected genes found in the locus part of the assembly.
-* **Missing expected genes**: a string listing the gene names of expected genes that were not found.
-* **Other genes in locus**: the number of unexpected genes (genes from loci other than the best match) which were found in the locus part of the assembly.
-* **Other genes in locus, details**: gene names and percent identity (from the BLAST hits) for the other genes found in the locus part of the assembly.
-* **Expected genes outside locus**: the number of expected genes which were found in the assembly but not in the locus part of the assembly (usually zero)
-* **Expected genes outside locus, details**: gene names and percent identity (from the BLAST hits) for the expected genes found outside the locus part of the assembly.
-* **Other genes outside locus**: the number of unexpected genes (genes from loci other than the best match) which were found outside the locus part of the assembly.
-* **Other genes outside locus, details**: gene names and percent identity (from the BLAST hits) for the other genes found outside the locus part of the assembly.
-* One column for each type gene (if the user supplied a type gene database)
-
-If the summary table already exists, Kaptive will append to it (not overwrite it). This allows you to run Kaptive in parallel on many assemblies, all outputting to the same table file.
-
-To disable the table file output, run Kaptive with `--no_table`.
-
-
-#### JSON
-
-Kaptive also outputs its results in a JSON file which contains all information from the above table, as well as more detail about BLAST results and reference sequences.
-
-To disable JSON output, run Kaptive with `--no_json`.
-
-
-#### Locus matching sequences
-
-For each input assembly, Kaptive produces a Genbank file of the region(s) of the assembly which correspond to the best locus match. This may be a single piece (in cases of a good assembly and a strong match) or it may be in multiple pieces (in cases of poor assembly and/or a novel locus). The file is named using the output prefix and the assembly name.
-
-To these output files, run Kaptive with `--no_seq_out`.
-
-
-## Example results and interpretation
-
-These examples show what Kaptive's results might look like in the output table. The gene details columns of the table have been excluded for brevity, as they can be quite long.
-
-#### Very close match
-
-Assembly | Best match locus | Problems | Coverage | Identity | Length discrepancy | Expected genes in locus | Expected genes in locus, details | Missing expected genes | Other genes in locus | Other genes in locus, details | Expected genes outside locus | Expected genes outside locus, details | Other genes outside locus | Other genes outside locus, details
-:---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---:
-assembly_1 | K1 |  | 99.94% | 99.81% | -22 bp | 20 / 20 (100%) | ... |  | 0 |  | 0 |  | 2 | ...
-
-This is a case where our assembly very closely matches a known K locus type. There are no characters in the 'Problems' column, the coverage and identity are both high, the length discrepency is low, and all expected genes were found with high identity. A couple of other low-identity K locus genes hits were elsewhere in the assembly, but that's not abnormal and no cause for concern.
-
-Overall, this is a nice, solid match for K1.
-
-#### More distant match
-
-Assembly | Best match locus | Problems | Coverage | Identity | Length discrepancy | Expected genes in locus | Expected genes in locus, details | Missing expected genes | Other genes in locus | Other genes in locus, details | Expected genes outside locus | Expected genes outside locus, details | Other genes outside locus | Other genes outside locus, details
-:---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---:
-assembly_2 | K1 | * | 99.84% | 95.32% | +97 bp | 20 / 20 (100%) | ... |  | 0 |  | 0 |  | 2 | ...
-
-This case shows an assembly that also matches the K1 locus sequence, but not as closely as our previous case. The `*` character indicates that one or more of the expected genes falls below the identity threshold (default 95%). The 'Expected genes in K locus, details' columns, excluded here for brevity, would show the identity for each gene.
-
-Our sample still almost certainly has a K locus type of K1, but it has diverged a bit more from our K1 reference, possibly due to mutation and/or recombination.
-
-#### Broken assembly
-
-Assembly | Best match locus | Problems | Coverage | Identity | Length discrepancy | Expected genes in locus | Expected genes in locus, details | Missing expected genes | Other genes in locus | Other genes in locus, details | Expected genes outside locus | Expected genes outside locus, details | Other genes outside locus | Other genes outside locus, details
-:---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---:
-assembly_3 | K2 | ?- | 99.95% | 98.38% | n/a | 17 / 18 (94.4%) | ... | K2-CDS17-manB | 0 |  | 0 |  | 1 | ...
-
-Here is a case where our assembly matched a known K locus type well (high coverage and identity) but with a couple of problems. First, the `?` character indicates that the K locus sequence was not found in one piece in the assembly. Second, one of the expected genes (K2-CDS17-manB) was not found in the gene BLAST search.
-
-In cases like this, it is worth examining the case in more detail outside of Kaptive. For this example, such an examination revealed that the assembly was poor (broken into many small pieces) and the _manB_ gene happened to be split between two contigs. So the _manB_ gene isn't really missing, it's just broken in two. Our sample most likely is a very good match for K2, but the poor assembly quality made it difficult for Kaptive to determine that automatically.
-
-#### Poor match
-
-Assembly | Best match locus | Problems | Coverage | Identity | Length discrepancy | Expected genes in locus | Expected genes in locus, details | Missing expected genes | Other genes in locus | Other genes in locus, details | Expected genes outside locus | Expected genes outside locus, details | Other genes outside locus | Other genes outside locus, details
-:---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---:
-assembly_4 | K3 | ?-* | 77.94% | 83.60% | n/a | 15 / 20 (75%) | ... | ... | 0 |  | 0 |  | 5 | ...
-
-In this case, Kaptive did not find a close match to any known K locus sequence. The best match was to K3, but BLAST only found alignments for 78% of the K3 sequence, and only at 84% nucleotide identity. Five of the twenty K3 genes were not found, and the 15 which were found had low identity. The assembly sequences matching K3 did not come in one piece (indicated by `?`), possibly due to assembly problems, but more likely due to the fact that our sample is not in fact K3 but rather has some novel K locus that was not in our reference inputs.
-
-A case such as this demands a closer examination outside of Kaptive. It is likely a novel K locus type, and you may wish to extract and annotate the K locus sequence from the assembly.
-
-
-## Advanced options
-
-Each of these options has a default and is not required on the command line, but can be adjusted if desired:
-
-* `--start_end_margin`: Kaptive tries to identify whether the start and end of a locus are present in an assembly and in the same contig. This option allows for a bit of wiggle room in this determination. For example, if this value is 10 (the default), a locus match that is missing the first 8 base pairs will still count as capturing the start of the locus. If set to zero, then the BLAST hit(s) must extend to the very start and end of the locus for Kaptive to consider the match complete.
-* `--min_gene_cov`: the minimum required percent coverage for the gene BLAST search. For example if this value is 90 (the default), then a gene BLAST hit which only covers 85% of the gene will be ignored. Using a lower value will allow smaller pieces of genes to be included in the results.
-* `--min_gene_id`: the mimimum required percent identity for the gene BLAST search. For example if this value is 80 (the default), then a gene BLAST hit which has only 65% amino acid identity will be ignored. A lower value will allow for more distant gene hits to be included in the results (possibly resulting in more genes in the 'Other genes outside locus' category). A higher value will make Kaptive only accept very close gene hits (possibly resulting in low-identity locus genes not being found and included in 'Missing expected genes').
-* `--low_gene_id`: the percent identity threshold for what counts as a low identity match in the gene BLAST search. This only affects whether or not the `*` character is included in the 'Problems'. Default is 95.
-* `--min_assembly_piece`: the smallest piece of the assembly (measured in bases) that will be included in the output FASTA files. For example, if this value is 100 (the default), then a 50 bp match between the assembly and the best matching locus reference will be ignored.
-* `--gap_fill_size`: the size of assembly gaps to be filled in when producing the output FASTA files. For example, if this value is 100 (the default) and an assembly has two separate locus BLAST hits which are only 50 bp apart in a contig, they will be merged together into one sequence for the output FASTA. But if the two BLAST hits were 150 bp apart, they will be included in the output FASTA as two separate sequences. A lower value will possibly result in more fragmented output FASTA sequences. A higher value will possibly result in more sequences being included in the locus output.
-
-
-## SLURM jobs
-
-If you are running this script on a cluster using [SLURM](http://slurm.schedmd.com/), then you can make use of the extra script: `kaptive_slurm.py`. This will create one SLURM job for each assembly so the jobs can run in parallel. All simultaneous jobs can write to the same output table. It may be necessary to modify this script to suit the details of your cluster.
-
-
-
-## Databases distributed with Kaptive
-
-The Kaptive repository contains _Klebsiella_ K-locus and O-locus databases plus _A. baumannii_ K-locus and OC-locus databases in the [reference_database](https://github.com/katholt/Kaptive/tree/master/reference_database) directory, but you can run Kaptive with any appropriately formatted database of your own.
-
-The databases were developed and curated by [Kelly Wyres](https://holtlab.net/kelly-wyres/) (_Klebsiella_) and [Johanna Kenyon](https://research.qut.edu.au/infectionandimmunity/projects/bacterial-polysaccharide-research/) (_A. baumannii_).
-
-If you have a locus database that you would like to be added to Kaptive for use by yourself and others in the community, please get in touch via the [issues page](https://github.com/katholt/Kaptive/issues) or [email](mailto:kaptive.typing at gmail.com) . Similarly, if you have identified new locus variants not currently in the existing databases, let us know!
-
-#### _Klebsiella_ K locus databases
-
-The _Klebsiella_ K locus primary reference database (`Klebsiella_k_locus_primary_reference.gbk`) comprises full-length (_galF_ to _ugd_) annotated sequences for each distinct _Klebsiella_ K locus, where available:
-* KL1 - KL77 correspond to the loci associated with each of the 77 serologically defined K-type references.
-* KL101 and above are defined from DNA sequence data on the basis of gene content.
-
-Note that insertion sequences (IS) are excluded from this database since we assume that the ancestral sequence was likely IS-free and IS transposase genes are not specific to the K locus.
-Synthetic IS-free K locus sequences were generated for K loci for which no naturally occurring IS-free variants have been identified to date.
-
-The variants database (`Klebsiella_k_locus_variant_reference.gbk`) comprises full-length annotated sequences for variants of the distinct loci:
-* IS variants are named as KLN -1, -2 etc e.g. KL15-1 is an IS variant of KL15.
-* Deletion variants are named KLN-D1, -D2 etc e.g. KL15-D1 is a deletion variant of KL15.
-Note that KL156-D1 is included in the primary reference database since no full-length version of this locus has been identified to date. 
-
-We recommend screening your data with the primary reference database first to find the best-matching K locus type. If you have poor matches or are particularly interested in detecting variant loci you should try the variant database.  
-WARNING: If you use the variant database please inspect your results carefully and decide for yourself what constitutes a confident match! Kaptive is not optimised for accurate variant detection. 
-
-Database versions:
-* Kaptive releases v0.5.1 and below include the original _Klebsiella_ K locus databases, as described in [Wyres, K. et al. Microbial Genomics (2016).](http://mgen.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000102)
-* Kaptive v0.6.0 and above include four novel primary _Klebsiella_ K locus references defined on the basis of gene content (KL162-KL165) in this [paper.](https://www.biorxiv.org/content/10.1101/557785v1)
-* Kaptive v0.7.1 and above contain updated versions of the KL53 and KL126 loci (see table below for details). The updated KL126 locus sequence will be described in McDougall, F. et al. 2020. _Klebsiella pneumoniae_ diversity and detection of _Klebsiella africana_ in Australian Fruit Bats (_Pteropus policephalus_). _In prep._
-* Kaptive v0.7.2 and above include a novel primary _Klebsiella_ K locus reference defined on the basis of gene content (KL166), which will be described in Li, M. et al. 2020. Characterization of clinically isolated hypermucoviscous _Klebsiella pneumoniae_ in Japan. _In prep._
-* Kaptive v0.7.3 and above include four novel primary _Klebsiella_ K locus references defined on the basis of gene content (KL167-KL170), which will be described in Gorrie, C. et al. 2020. Opportunity and diversity: A year of _Klebsiella pneumoniae_ infections in hospital. _In prep._
-
-
-Changes to the _Klebsiella_ K locus primary reference database:
-
-| Locus  | Change | Reason | Date of change | Kaptive version no. |
-| ------------- | ------------- | ------------- | ------------- | ------------- |
-| KL53  | Annotation update: _wcaJ_ changed to _wbaP_ | Error in original annotation | 21 July 2020 | v 0.7.1 | 
-| KL126  | Sequence update: new sequence from isolate FF923 includes _rmlBADC_ genes between _gnd_ and _ugd_ | Assembly scaffolding error in original sequence from isolate A-003-I-a-1 | 21 July 2020 | v 0.7.1 |
-
-#### _Klebsiella_ O locus database
-
-The _Klebsiella_ O locus database (`Klebsiella_o_locus_primary_reference.gbk`) contains annotated sequences for 12 distinct _Klebsiella_ O loci.
-
-O locus classification requires some special logic, as the O1 and O2 serotypes contain the same locus genes. It is two additional genes elsewhere in the chromosome (_wbbY_ and _wbbZ_) which results in the O1 antigen. Kaptive therefore looks for these genes to properly call an assembly as either O1 or O2. When only one of the two additional genes can be found, the result is ambiguous and Kaptive will report a locus type of O1/O2.
-
-Read more about the O locus and its classification here: [The diversity of _Klebsiella_ pneumoniae surface polysaccharides](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5320592/).
-
-Database versions:
-* Kaptive v0.4.0 and above include the original version of the _Klebsiella_ O locus database, as described in [Wick, R. et al. J Clin Microbiol (2019).](http://jcm.asm.org/content/56/6/e00197-18) 
-
-
-#### _Acinetobacter baunannii_ K and OC locus databases
-
-The _A. baumannii_ K (capsule) locus reference database (`Acinetobacter_baumannii_k_locus_primary_reference.gbk`) contains annotated sequences for 92 distinct K loci.
-The _A. baumannii_ OC (lipooligosaccharide outer core) locus reference database (`Acinetobacter_baumannii_OC_locus_primary_reference.gbk`) contains annotated sequences for 12 distinct OC loci.
-
-WARNING: These databases have been developed and tested specifically for _A. baumannii_ and may not be suitable for screening other _Acinetobacter_ species. You can check that your assembly is a true _A. baumannii_ by screening for the _oxaAB_ gene e.g. using blastn.
-
- Database versions:
-* Kaptive v0.7.0 and above include the original _A. baumannii_ K and OC locus databases, as described in [Wyres, KL. et al. Microbial Genomics, 2020.](https://doi.org/10.1099/mgen.0.000339)
-
-
-
-## FAQs
-
-#### Why are there locus genes found outside the locus?
-
-For _Klebsiella_ K loci in particular, a number of the K-locus genes are orthologous to genes outside of the K-locus region of the genome. E.g the _Klebsiella_ K-locus <i>man</i> and <i>rml</i> genes have orthologues in the LPS (lipopolysacharide) locus; so it is not unusual to find a small number of genes "outside" the locus.
-However, if you have a large number of genes (>5) outside the locus it may mean that there is a problem with the locus match, or that your assembly is very fragmented or contaminated (contains more than one sample).
-
-#### How can my sample be missing locus genes when it has a full-length, high identity locus match?
-
-Kaptive uses 'tblastn' to screen for the presence of each locus gene with a coverage threshold of 90%. A single non-sense mutation or small indel in the centre of a gene will interrupt the 'tblastn' match and cause it to fall below the 90% threshold. However, such a small change has only a minor effect on the nucleotide 'blast' match across the full locus.
-
-#### Why does the _Klebsiella_ K-locus region of my sample contain a <i>ugd</i> gene matching another locus?
-
-A small number of the original _Klebsiella_ K locus references are truncated, containing only a partial <i>ugd</i> sequence. The reference annotations for these loci do not include <i>ugd</i>, so are not identified by the 'tblastn' search. Instead <b>Kaptive</b> reports the closest match to the partial sequence (if it exceeds the 90% coverage threshold). 
-
-#### Why has the best matching locus changed after I reran my analysis with an updated version of the database? ####
-
-The databases are updated as novel loci are discovered and curated. If your previous match had a confidence call of 'Low' or 'None' but your new match has higher confidence, this indicates that your genome contains a locus that was absent in the older version of the database! So nothing to worry about here.
-
-But what if your old match and your new match have 'Good' or better confidence levels?
-
-If your old match had 'Perfect' or 'Very High' confidence, please post an issue to the issues page, as this may indicate a problem with the new database!
-
-If your old match had 'Good' or 'High' confidence please read on...
-
-Polysaccharide loci are subject to frequent recombinations and rearrangements, which generates new variants. As a result, a small number of pairs of loci share large regions of homology e.g. the _Klebsiella_ K-locus KL170 is very similar to KL101, and in fact seems to be a hybrid of KL101 plus a small region from KL106. 
-Kaptive can accurately distinguish the KL101 and KL170 loci when it is working with high quality genome assemblies, but this task is much trickier if the assembly is fragmented. This means that matches to KL101 that were reported using an early version of the K-locus database might be reported as KL170 when using a later version of the database.
-However, this should only occur in instances where the K-locus is fragmented in the genome assembly and in that case Kaptive will have indicated 'problems' with the matches (e.g. '?' indicating fragmented assembly or '-' indicating that an expected gene is missing), and the corresponding confidence level will be at the lower end of the scale (i.e. 'Good' or 'High', but not 'Very High' or 'Perfect').
-You may want to try to figure out the correct locus manually, e.g. using [Bandage](https://rrwick.github.io/Bandage/) to BLAST the corresponding loci in your genome assembly graph. 
-
-
-## Citation
-
-If you use Kaptive and/or the _Klebsiella_ K locus database in your research, please cite this paper:
-[Identification of _Klebsiella_ capsule synthesis loci from whole genome data. Microbial Genomics (2016).](http://mgen.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000102)
-
-If you use [Kaptive Web](http://kaptive.holtlab.net/) and/or the _Klebsiella_ O locus database in your research, please cite this paper:
-[Kaptive Web: user-friendly capsule and lipopolysaccharide serotype prediction for _Klebsiella_ genomes. Journal of Clinical Microbiology (2018).](http://jcm.asm.org/content/56/6/e00197-18)
-
-If you use the _A. baumannii_ K or OC locus database(s) in your research please cite this paper:
-[Identification of _Acinetobacter baumannii_ loci for capsular polysaccharide (KL) and lipooligosaccharide outer core (OCL) synthesis in genome assemblies using curated reference databases compatible with Kaptive. Microbial Genomics (2020).](https://doi.org/10.1099/mgen.0.000339)  
-Lists of papers describing each of the individual _A. baumannii_ reference loci can be found [here](https://github.com/katholt/Kaptive/tree/master/extras).
-
-
-## License
-
-GNU General Public License, version 3
-
-[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.495263.svg)](https://doi.org/10.5281/zenodo.495263)
+**For information on how to install, run, interpret and cite Kaptive please visit the [Wiki](https://github.com/katholt/Kaptive/wiki).**


=====================================
kaptive.py
=====================================
@@ -1,6 +1,6 @@
 #!/usr/bin/env python3
 """
-Copyright 2018 Ryan Wick (rrwick at gmail.com)
+Copyright 2021 Ryan Wick (rrwick at gmail.com)
 https://github.com/katholt/Kaptive
 
 Kaptive is a tool which reports information about the K and O types for Klebsiella genome
@@ -37,8 +37,6 @@ details. You should have received a copy of the GNU General Public License along
 not, see <http://www.gnu.org/licenses/>.
 """
 
-from __future__ import print_function
-from __future__ import division
 import argparse
 import sys
 import os
@@ -52,7 +50,7 @@ import random
 from collections import OrderedDict
 from Bio import SeqIO
 
-__version__ = '0.7.3'
+__version__ = '2.0.0'
 
 
 def main():
@@ -68,13 +66,16 @@ def main():
     output_json = not args.no_json
 
     temp_dir = make_temp_dir(args)
-    k_ref_seqs, gene_seqs, k_ref_genes = parse_genbank(args.k_refs, temp_dir, args.locus_label)
+    ref_seqs, gene_seqs, ref_genes, ref_types = \
+        parse_genbank(args.k_refs, temp_dir, args.locus_label, args.type_label)
+    special_logic = load_special_logic(args.k_refs, ref_types)
+
     all_gene_dict = {}
-    for gene_list in k_ref_genes.values():
+    for gene_list in ref_genes.values():
         for gene in gene_list:
             all_gene_dict[gene.full_name] = gene
 
-    k_refs = load_k_locus_references(k_ref_seqs, k_ref_genes)
+    refs = load_locus_references(ref_seqs, ref_genes, ref_types)
     type_gene_names = get_type_gene_names(args.allelic_typing)
 
     if output_table:
@@ -83,26 +84,28 @@ def main():
 
     for fasta_file in args.assembly:
         assembly = Assembly(fasta_file)
-        best_k = get_best_k_type_match(assembly, k_ref_seqs, k_refs, args.threads)
-        if best_k is None:
+        best = get_best_locus_match(assembly, ref_seqs, refs, args.threads)
+        if best is None:
             type_gene_results = {}
-            best_k = KLocus('None', '', [])
+            best = Locus('None', '', '', [])
         else:
-            find_assembly_pieces(assembly, best_k, args)
-            assembly_pieces_fasta = save_assembly_pieces_to_file(best_k, assembly, args.out)
+            find_assembly_pieces(assembly, best, args)
+            assembly_pieces_fasta = save_assembly_pieces_to_file(best, assembly, args.out)
             type_gene_results = type_gene_search(assembly_pieces_fasta, type_gene_names, args)
             if args.no_seq_out and assembly_pieces_fasta is not None:
                 os.remove(assembly_pieces_fasta)
-            protein_blast(assembly, best_k, gene_seqs, args)
-            check_name_for_o1_o2(best_k)
+            protein_blast(assembly, best, gene_seqs, args)
+            apply_special_logic(best, special_logic, ref_genes)
+            if best.type == 'unknown':
+                best.type = 'unknown (' + best.name + ')'
 
-        output(args.out, assembly, best_k, args, type_gene_names, type_gene_results,
+        output(args.out, assembly, best, args, type_gene_names, type_gene_results,
                json_list, output_table, output_json, all_gene_dict)
 
     if output_json:
         write_json_file(args.out, json_list)
 
-    clean_up(k_ref_seqs, gene_seqs, temp_dir)
+    clean_up(ref_seqs, gene_seqs, temp_dir)
     sys.exit(0)
 
 
@@ -155,7 +158,12 @@ def add_arguments_to_parser(parser):
                         default='automatically determined',
                         help='In the Genbank file, the source feature must have a note '
                              'identifying the locus name, starting with this label followed by '
-                             'a colon (e.g. /note="K locus: K1")')
+                             'a colon (e.g. /note="K locus: KL1")')
+    parser.add_argument('--type_label', type=str, required=False,
+                        default='automatically determined',
+                        help='In the Genbank file, the source feature must have a note '
+                             'identifying the type name, starting with this label followed by '
+                             'a colon (e.g. /note="K type: K1")')
 
 
 def check_for_blast():
@@ -199,12 +207,12 @@ def make_temp_dir(args):
     return temp_dir
 
 
-def clean_up(k_ref_seqs, gene_seqs, temp_dir):
+def clean_up(ref_seqs, gene_seqs, temp_dir):
     """
     Deletes the temporary FASTA files. If the temp directory is then empty, it is deleted too.
     """
     try:
-        os.remove(k_ref_seqs)
+        os.remove(ref_seqs)
     except OSError:
         pass
     try:
@@ -218,57 +226,75 @@ def clean_up(k_ref_seqs, gene_seqs, temp_dir):
         pass
 
 
-def parse_genbank(genbank, temp_dir, locus_label):
+def parse_genbank(genbank, temp_dir, locus_label, type_label):
     """
     This function reads the input Genbank file and produces two temporary FASTA files: one with the
     loci nucleotide sequences and one with the gene sequences.
     It returns the file paths for these two FASTA files along with a dictionary that links genes to
     loci.
     """
-    k_ref_genes = {}
-    k_ref_seqs_filename = os.path.join(temp_dir, 'temp_k_ref_seqs.fasta')
+    ref_genes, ref_types = {}, {}
+    ref_seqs_filename = os.path.join(temp_dir, 'temp_ref_seqs.fasta')
     gene_seqs_filename = os.path.join(temp_dir, 'temp_gene_seqs.fasta')
-    k_ref_seqs = open(k_ref_seqs_filename, 'wt')
+    ref_seqs = open(ref_seqs_filename, 'wt')
     gene_seqs = open(gene_seqs_filename, 'wt')
+
     if locus_label == 'automatically determined':
-        locus_label = find_locus_label(genbank)
+        locus_label = find_label(genbank, 'locus')
+    else:
+        check_label(genbank, locus_label)
+    if type_label == 'automatically determined':
+        type_label = find_label(genbank, 'type', required=False)
     else:
-        check_locus_label(genbank, locus_label)
+        check_label(genbank, type_label)
+
     for record in SeqIO.parse(genbank, 'genbank'):
-        k_locus_name = ''
+        locus_name, type_name = '', ''
         for feature in record.features:
             if feature.type == 'source' and 'note' in feature.qualifiers:
                 for note in feature.qualifiers['note']:
                     if note.startswith(locus_label):
-                        k_locus_name = get_locus_name_from_note(note, locus_label)
+                        locus_name = get_name_from_note(note, locus_label)
                     elif note.startswith('Extra genes'):
-                        k_locus_name = note.replace(':', '').replace(' ', '_')
-        if k_locus_name in k_ref_genes:
-            quit_with_error('Duplicate reference locus name: ' + k_locus_name)
-        k_ref_genes[k_locus_name] = []
+                        locus_name = note.replace(':', '').replace(' ', '_')
+                    elif type_label is not None and note.startswith(type_label):
+                        type_name = get_name_from_note(note, type_label)
+        if locus_name in ref_genes:
+            quit_with_error('Duplicate reference locus name: ' + locus_name)
+        ref_genes[locus_name] = []
 
         # Extra genes are only used for the gene search, not the nucleotide search.
-        if not k_locus_name.startswith('Extra_genes'):
-            k_ref_seqs.write('>' + k_locus_name + '\n')
-            k_ref_seqs.write(add_line_breaks_to_sequence(str(record.seq), 60))
+        if not locus_name.startswith('Extra_genes'):
+            ref_seqs.write('>' + locus_name + '\n')
+            ref_seqs.write(add_line_breaks_to_sequence(str(record.seq), 60))
+            ref_types[locus_name] = type_name
 
         gene_num = 1
         for feature in record.features:
             if feature.type == 'CDS':
-                gene = Gene(k_locus_name, gene_num, feature, record.seq)
-                k_ref_genes[k_locus_name].append(gene)
+                gene = Gene(locus_name, gene_num, feature, record.seq)
+                ref_genes[locus_name].append(gene)
                 gene_num += 1
                 gene_seqs.write(gene.get_fasta())
-    k_ref_seqs.close()
+    ref_seqs.close()
     gene_seqs.close()
-    return k_ref_seqs_filename, gene_seqs_filename, k_ref_genes
+    return ref_seqs_filename, gene_seqs_filename, ref_genes, ref_types
+
+
+def rreplace(s, old, new):
+    """
+    https://stackoverflow.com/questions/2556108
+    """
+    li = s.rsplit(old, 1)
+    return new.join(li)
 
 
-def find_locus_label(genbank):
+def find_label(genbank, text, required=True):
     """
-    Automatically finds the label for the locus sequences. The Genbank file must have exactly one
-    possible label that is present in a note qualifier in the source feature for every record. If
-    not, Kaptive will quit with an error.
+    Automatically finds the label in the Genbank file which contains the specified text. For
+    example, if the text is 'locus', then the Genbank file must have exactly one possible label
+    containing 'locus' that is present in a note qualifier in the source feature for every record.
+    If not, Kaptive will quit with an error.
     """
     possible_locus_labels = set()
     for record in SeqIO.parse(genbank, 'genbank'):
@@ -276,11 +302,14 @@ def find_locus_label(genbank):
             if feature.type == 'source' and 'note' in feature.qualifiers:
                 for note in feature.qualifiers['note']:
                     if ':' in note:
-                        possible_locus_labels.add(note.split(':')[0].strip())
-                    if '=' in note:
-                        possible_locus_labels.add(note.split('=')[0].strip())
+                        note = note.split(':')[0].strip()
+                        if text in note:
+                            possible_locus_labels.add(note)
     if not possible_locus_labels:
-        quit_with_error('None of the records contain a valid locus label')
+        if required:
+            quit_with_error('None of the records contain a valid ' + text + ' label')
+        else:
+            return None
     available_locus_labels = possible_locus_labels.copy()
     for record in SeqIO.parse(genbank, 'genbank'):
         locus_labels = set()
@@ -289,29 +318,27 @@ def find_locus_label(genbank):
                 for note in feature.qualifiers['note']:
                     if ':' in note:
                         locus_labels.add(note.split(':')[0].strip())
-                    if '=' in note:
-                        locus_labels.add(note.split('=')[0].strip())
         if any(x == 'Extra genes' for x in locus_labels):
             continue
         if not locus_labels:
-            quit_with_error('no possible locus labels were found for ' + record.name)
+            quit_with_error('no possible ' + text + ' labels were found for ' + record.name)
         previous_labels = available_locus_labels.copy()
         available_locus_labels = available_locus_labels.intersection(locus_labels)
         if not available_locus_labels:
-            error_message = record.name + ' does not have a locus label matching the previous ' \
-                            'records\n'
+            error_message = record.name + ' does not have a ' + text + ' label matching the ' \
+                            'previous records\n'
             error_message += 'Previous record labels: ' + ', '.join(list(previous_labels)) + '\n'
             error_message += 'Labels in ' + record.name + ': ' + ', '.join(list(locus_labels))
             quit_with_error(error_message)
     if len(available_locus_labels) > 1:
-        error_message = 'multiple possible locus labels were found: ' + \
+        error_message = 'multiple possible ' + text + ' labels were found: ' + \
                         ', '.join(list(available_locus_labels)) + '\n'
-        error_message += 'Please use the --locus_label option to specify which to use'
+        error_message += 'Please use the --' + text + '_label option to specify which to use'
         quit_with_error(error_message)
     return list(available_locus_labels)[0]
 
 
-def check_locus_label(genbank, locus_label):
+def check_label(genbank, label):
     """
     Makes sure that every record in the Genbank file contains a note in the source feature
     beginning with the given label.
@@ -321,18 +348,18 @@ def check_locus_label(genbank, locus_label):
         for feature in record.features:
             if feature.type == 'source' and 'note' in feature.qualifiers:
                 for note in feature.qualifiers['note']:
-                    if note.startswith(locus_label):
-                        k_locus_name = get_locus_name_from_note(note, locus_label)
-                        if k_locus_name:
+                    if note.startswith(label):
+                        locus_name = get_name_from_note(note, label)
+                        if locus_name:
                             found_label = True
         if not found_label:
-            error_message = record.name + ' is missing a locus label\n'
+            error_message = record.name + ' is missing a label\n'
             error_message += 'The source feature must have a note qualifier beginning with "' + \
-                             locus_label + ':" followed by the locus name'
+                             label + ':" followed by the relevant info'
             quit_with_error(error_message)
 
 
-def get_locus_name_from_note(full_note, locus_label):
+def get_name_from_note(full_note, locus_label):
     """
     Extracts the part of the note following the label (and any colons, spaces or equals signs).
     """
@@ -375,34 +402,34 @@ def quit_with_error(message):
     sys.exit(1)
 
 
-def get_best_k_type_match(assembly, k_refs_fasta, k_refs, threads):
+def get_best_locus_match(assembly, refs_fasta, refs, threads):
     """
     Searches for all known locus types in the given assembly and returns the best match.
     Best match is defined as the locus type for which the largest fraction of the locus has a BLAST
     hit to the assembly. In cases of a tie, the mean identity of the locus type BLAST hits are used
     to determine the best.
     """
-    for k_ref in k_refs.values():
-        k_ref.clear()
-    blast_hits = get_blast_hits(assembly.fasta, k_refs_fasta, threads)
+    for ref in refs.values():
+        ref.clear()
+    blast_hits = get_blast_hits(assembly.fasta, refs_fasta, threads)
 
     for hit in blast_hits:
-        if hit.qseqid not in k_refs:
+        if hit.qseqid not in refs:
             quit_with_error('BLAST hit (' + hit.qseqid + ') not found in locus references')
-        k_refs[hit.qseqid].add_blast_hit(hit)
-    best_k_ref = None
+        refs[hit.qseqid].add_blast_hit(hit)
+    best_ref = None
     best_cov = 0.0
-    for k_ref in k_refs.values():
-        cov = k_ref.get_coverage()
+    for ref in refs.values():
+        cov = ref.get_coverage()
         if cov > best_cov:
             best_cov = cov
-            best_k_ref = k_ref
-        elif cov == best_cov and best_k_ref and \
-                k_ref.get_mean_blast_hit_identity() > best_k_ref.get_mean_blast_hit_identity():
-            best_k_ref = k_ref
-    if best_k_ref is not None:
-        best_k_ref.clean_up_blast_hits()
-    return copy.copy(best_k_ref)
+            best_ref = ref
+        elif cov == best_cov and best_ref and \
+                ref.get_mean_blast_hit_identity() > best_ref.get_mean_blast_hit_identity():
+            best_ref = ref
+    if best_ref is not None:
+        best_ref.clean_up_blast_hits()
+    return copy.copy(best_ref)
 
 
 def type_gene_search(assembly_pieces_fasta, type_gene_names, args):
@@ -443,55 +470,55 @@ def type_gene_search(assembly_pieces_fasta, type_gene_names, args):
     return type_gene_results
 
 
-def find_assembly_pieces(assembly, k_locus, args):
+def find_assembly_pieces(assembly, locus, args):
     """
     This function uses the BLAST hits in the given locus type to find the corresponding pieces of
-    the given assembly. It saves its results in the KLocus object.
+    the given assembly. It saves its results in the Locus object.
     """
-    if not k_locus.blast_hits:
+    if not locus.blast_hits:
         return
-    assembly_pieces = [x.get_assembly_piece(assembly) for x in k_locus.blast_hits]
+    assembly_pieces = [x.get_assembly_piece(assembly) for x in locus.blast_hits]
     merged_pieces = merge_assembly_pieces(assembly_pieces)
     length_filtered_pieces = [x for x in merged_pieces if x.get_length() >= args.min_assembly_piece]
     if not length_filtered_pieces:
         return
-    k_locus.assembly_pieces = fill_assembly_piece_gaps(length_filtered_pieces, args.gap_fill_size)
+    locus.assembly_pieces = fill_assembly_piece_gaps(length_filtered_pieces, args.gap_fill_size)
 
     # Now check to see if the biggest assembly piece seems to capture the whole locus. If so, this
     # is an ideal match.
-    biggest_piece = sorted(k_locus.assembly_pieces, key=lambda z: z.get_length(), reverse=True)[0]
+    biggest_piece = sorted(locus.assembly_pieces, key=lambda z: z.get_length(), reverse=True)[0]
     start = biggest_piece.earliest_hit_coordinate()
     end = biggest_piece.latest_hit_coordinate()
-    if good_start_and_end(start, end, k_locus.get_length(), args.start_end_margin):
-        k_locus.assembly_pieces = [biggest_piece]
+    if good_start_and_end(start, end, locus.get_length(), args.start_end_margin):
+        locus.assembly_pieces = [biggest_piece]
 
     # If it isn't the ideal case, we still want to check if the start and end of the locus were
     # found in the same contig. If so, fill all gaps in between so we include the entire
     # intervening sequence.
     else:
-        earliest, latest, same_contig_and_strand = k_locus.get_earliest_and_latest_pieces()
-        k_start = earliest.earliest_hit_coordinate()
-        k_end = latest.latest_hit_coordinate()
-        if good_start_and_end(k_start, k_end, k_locus.get_length(), args.start_end_margin) and \
+        earliest, latest, same_contig_and_strand = locus.get_earliest_and_latest_pieces()
+        start = earliest.earliest_hit_coordinate()
+        end = latest.latest_hit_coordinate()
+        if good_start_and_end(start, end, locus.get_length(), args.start_end_margin) and \
            same_contig_and_strand:
             gap_filling_piece = AssemblyPiece(assembly, earliest.contig_name, earliest.start,
                                               latest.end, earliest.strand)
-            k_locus.assembly_pieces = merge_assembly_pieces(k_locus.assembly_pieces +
-                                                            [gap_filling_piece])
-    k_locus.identity = get_mean_identity(k_locus.assembly_pieces)
+            locus.assembly_pieces = merge_assembly_pieces(locus.assembly_pieces +
+                                                          [gap_filling_piece])
+    locus.identity = get_mean_identity(locus.assembly_pieces)
 
 
-def protein_blast(assembly, k_locus, gene_seqs, args):
+def protein_blast(assembly, locus, gene_seqs, args):
     """
-    Conducts a BLAST search of all known locus proteins. Stores the results in the KLocus
+    Conducts a BLAST search of all known locus proteins. Stores the results in the Locus
     object.
     """
     hits = get_blast_hits(assembly.fasta, gene_seqs, args.threads, genes=True)
     hits = [x for x in hits if x.query_cov >= args.min_gene_cov and x.pident >= args.min_gene_id]
 
     best_hits = []
-    for expected_gene in k_locus.gene_names:
-        best_hit = get_best_hit_for_query(hits, expected_gene, k_locus)
+    for expected_gene in locus.gene_names:
+        best_hit = get_best_hit_for_query(hits, expected_gene, locus)
         if best_hit is not None:
             best_hits.append(best_hit)
     best_hits = sorted(best_hits, key=lambda x: x.bitscore, reverse=True)
@@ -500,10 +527,10 @@ def protein_blast(assembly, k_locus, gene_seqs, args):
             hits = cull_conflicting_hits(best_hit, hits)
 
     expected_hits = []
-    for expected_gene in k_locus.gene_names:
-        best_hit = get_best_hit_for_query(hits, expected_gene, k_locus)
+    for expected_gene in locus.gene_names:
+        best_hit = get_best_hit_for_query(hits, expected_gene, locus)
         if not best_hit:
-            k_locus.missing_expected_genes.append(expected_gene)
+            locus.missing_expected_genes.append(expected_gene)
         else:
             best_hit.over_identity_threshold = best_hit.pident >= args.low_gene_id
             expected_hits.append(best_hit)
@@ -511,14 +538,14 @@ def protein_blast(assembly, k_locus, gene_seqs, args):
             hits = cull_conflicting_hits(best_hit, hits)
     other_hits = cull_all_conflicting_hits(hits)
 
-    k_locus.expected_hits_inside_locus = [x for x in expected_hits
-                                          if x.in_assembly_pieces(k_locus.assembly_pieces)]
-    k_locus.expected_hits_outside_locus = [x for x in expected_hits
-                                           if not x.in_assembly_pieces(k_locus.assembly_pieces)]
-    k_locus.other_hits_inside_locus = [x for x in other_hits
-                                       if x.in_assembly_pieces(k_locus.assembly_pieces)]
-    k_locus.other_hits_outside_locus = [x for x in other_hits
-                                        if not x.in_assembly_pieces(k_locus.assembly_pieces)]
+    locus.expected_hits_inside_locus = [x for x in expected_hits
+                                        if x.in_assembly_pieces(locus.assembly_pieces)]
+    locus.expected_hits_outside_locus = [x for x in expected_hits
+                                         if not x.in_assembly_pieces(locus.assembly_pieces)]
+    locus.other_hits_inside_locus = [x for x in other_hits
+                                     if x.in_assembly_pieces(locus.assembly_pieces)]
+    locus.other_hits_outside_locus = [x for x in other_hits
+                                      if not x.in_assembly_pieces(locus.assembly_pieces)]
 
 
 def create_table_file(output_prefix, type_gene_names):
@@ -538,6 +565,7 @@ def create_table_file(output_prefix, type_gene_names):
 
     headers = ['Assembly',
                'Best match locus',
+               'Best match type',
                'Match confidence',
                'Problems',
                'Coverage',
@@ -579,78 +607,121 @@ def get_type_gene_names(type_genes_fasta):
     return gene_names
 
 
-def check_name_for_o1_o2(k_locus):
+def load_special_logic(ref_filename, ref_types):
+    """
+    If any of the reference loci have a type of 'special logic', that implies that a corresponding
+    file exists to describe that logic. This function loads that special logic file if needed.
+    """
+    if not any(t == 'special logic' for t in ref_types.values()):
+        return []
+    special_logic = []
+    assert ref_filename.endswith('.gbk')
+    special_logic_filename = rreplace(ref_filename, '.gbk', '.logic')
+    check_file_exists(special_logic_filename)
+    with open(special_logic_filename, 'rt') as special_logic_file:
+        for line in special_logic_file:
+            parts = line.strip().split('\t')
+            assert len(parts) == 3
+            locus, extra_loci, new_type = parts
+            if locus == 'locus':  # header line
+                continue
+            if extra_loci.lower() == 'none':
+                extra_loci = []
+            else:
+                extra_loci = sorted(extra_loci.split(','))
+            special_logic.append((locus, extra_loci, new_type))
+    return special_logic
+
+
+def apply_special_logic(locus, special_logic, ref_genes):
     """
-    This function has special logic for dealing with the O1/O2 locus. If the wbbY and wbbZ genes
-    are both found, then we call the locus O2 (instead of O1/O2). If neither are found, then we
-    call the locus O1.
+    This function has special logic for dealing with the locus -> type situations that depend on
+    other genes in the genome.
     """
-    if not (k_locus.name == 'O1/O2v1' or k_locus.name == 'O1/O2v2'):
+    if not locus.type == 'special logic':
         return
-    other_gene_names = [x.qseqid for x in k_locus.other_hits_outside_locus]
-    both_present = ('Extra_genes_wbbY/wbbZ_01_wbbY' in other_gene_names and
-                    'Extra_genes_wbbY/wbbZ_02_wbbZ' in other_gene_names)
-    both_absent = ('Extra_genes_wbbY/wbbZ_01_wbbY' not in other_gene_names and
-                   'Extra_genes_wbbY/wbbZ_02_wbbZ' not in other_gene_names)
-    if both_present:
-        k_locus.name = k_locus.name.replace('O1/O2', 'O1')
-    elif both_absent:
-        k_locus.name = k_locus.name.replace('O1/O2', 'O2')
-
-
-def output(output_prefix, assembly, k_locus, args, type_gene_names, type_gene_results,
+
+    other_gene_names = [x.qseqid for x in locus.other_hits_outside_locus]
+    extra_gene_names = sorted(n for n in other_gene_names if n.startswith('Extra_genes_'))
+
+    # Look for any 'Extra genes' loci for which all of their genes have been found in this genome.
+    found_loci = []
+    for ref_locus, genes in ref_genes.items():
+        if ref_locus.startswith('Extra_genes_'):
+            short_locus_name = ref_locus.replace('Extra_genes_', '')
+            locus_gene_names = [g.full_name for g in genes]
+            if all(g in extra_gene_names for g in locus_gene_names):
+                found_loci.append(short_locus_name)
+        found_loci = sorted(found_loci)
+
+    # See if the combination of best-match-locus and extra-loci is represented in the special logic,
+    # and if so, change the type.
+    new_types = []
+    for locus_name, extra_loci, new_type in special_logic:
+        if locus.name == locus_name and found_loci == extra_loci:
+            new_types.append(new_type)
+    if len(new_types) == 0:
+        locus.type = 'unknown'
+    elif len(new_types) == 1:
+        locus.type = new_types[0]
+    else:  # multiple matches - shouldn't happen!
+        quit_with_error('redundancy in special logic file')
+
+
+def output(output_prefix, assembly, locus, args, type_gene_names, type_gene_results,
            json_list, output_table, output_json, all_gene_dict):
     """
     Writes a line to the output table describing all that we've learned about the given locus and
     writes to stdout as well.
     """
-    uncertainty_chars = k_locus.get_match_uncertainty_chars()
+    uncertainty_chars = locus.get_match_uncertainty_chars()
 
     try:
-        expected_in_locus_per = 100.0 * len(k_locus.expected_hits_inside_locus) / \
-            len(k_locus.gene_names)
-        expected_out_locus_per = 100.0 * len(k_locus.expected_hits_outside_locus) / \
-            len(k_locus.gene_names)
-        expected_genes_in_locus_str = str(len(k_locus.expected_hits_inside_locus)) + ' / ' + \
-            str(len(k_locus.gene_names)) + ' (' + float_to_str(expected_in_locus_per) + '%)'
-        expected_genes_out_locus_str = str(len(k_locus.expected_hits_outside_locus)) + ' / ' + \
-            str(len(k_locus.gene_names)) + ' (' + float_to_str(expected_out_locus_per) + '%)'
-        missing_per = 100.0 * len(k_locus.missing_expected_genes) / len(k_locus.gene_names)
-        missing_genes_str = str(len(k_locus.missing_expected_genes)) + ' / ' + \
-            str(len(k_locus.gene_names)) + ' (' + float_to_str(missing_per) + '%)'
+        expected_in_locus_per = 100.0 * len(locus.expected_hits_inside_locus) / \
+                                len(locus.gene_names)
+        expected_out_locus_per = 100.0 * len(locus.expected_hits_outside_locus) / \
+            len(locus.gene_names)
+        expected_genes_in_locus_str = str(len(locus.expected_hits_inside_locus)) + ' / ' + \
+            str(len(locus.gene_names)) + ' (' + float_to_str(expected_in_locus_per) + '%)'
+        expected_genes_out_locus_str = str(len(locus.expected_hits_outside_locus)) + ' / ' + \
+            str(len(locus.gene_names)) + ' (' + float_to_str(expected_out_locus_per) + '%)'
+        missing_per = 100.0 * len(locus.missing_expected_genes) / len(locus.gene_names)
+        missing_genes_str = str(len(locus.missing_expected_genes)) + ' / ' + \
+            str(len(locus.gene_names)) + ' (' + float_to_str(missing_per) + '%)'
     except ZeroDivisionError:
         expected_genes_in_locus_str, expected_genes_out_locus_str, missing_genes_str = '', '', ''
 
-    output_to_stdout(assembly, k_locus, args.verbose, type_gene_names, type_gene_results,
+    output_to_stdout(assembly, locus, args.verbose, type_gene_names, type_gene_results,
                      uncertainty_chars, expected_genes_in_locus_str, expected_genes_out_locus_str,
                      missing_genes_str)
     if output_table:
-        output_to_table(output_prefix, assembly, k_locus, type_gene_names, type_gene_results,
+        output_to_table(output_prefix, assembly, locus, type_gene_names, type_gene_results,
                         uncertainty_chars, expected_genes_in_locus_str,
                         expected_genes_out_locus_str)
     if output_json:
-        add_to_json(assembly, k_locus, type_gene_names, type_gene_results, json_list,
+        add_to_json(assembly, locus, type_gene_names, type_gene_results, json_list,
                     uncertainty_chars, all_gene_dict)
 
 
-def output_to_table(output_prefix, assembly, k_locus, type_gene_names, type_gene_results,
+def output_to_table(output_prefix, assembly, locus, type_gene_names, type_gene_results,
                     uncertainty_chars, expected_genes_in_locus_str, expected_genes_out_locus_str):
     line = [assembly.name,
-            k_locus.name,
-            k_locus.get_match_confidence(),
+            locus.name,
+            locus.type,
+            locus.get_match_confidence(),
             uncertainty_chars,
-            k_locus.get_coverage_string(),
-            k_locus.get_identity_string(),
-            k_locus.get_length_discrepancy_string(),
+            locus.get_coverage_string(),
+            locus.get_identity_string(),
+            locus.get_length_discrepancy_string(),
             expected_genes_in_locus_str,
-            get_gene_info_string(k_locus.expected_hits_inside_locus),
-            ';'.join(k_locus.missing_expected_genes),
-            str(len(k_locus.other_hits_inside_locus)),
-            get_gene_info_string(k_locus.other_hits_inside_locus),
+            get_gene_info_string(locus.expected_hits_inside_locus),
+            ';'.join(locus.missing_expected_genes),
+            str(len(locus.other_hits_inside_locus)),
+            get_gene_info_string(locus.other_hits_inside_locus),
             expected_genes_out_locus_str,
-            get_gene_info_string(k_locus.expected_hits_outside_locus),
-            str(len(k_locus.other_hits_outside_locus)),
-            get_gene_info_string(k_locus.other_hits_outside_locus)]
+            get_gene_info_string(locus.expected_hits_outside_locus),
+            str(len(locus.other_hits_outside_locus)),
+            get_gene_info_string(locus.other_hits_outside_locus)]
 
     for gene_name in type_gene_names:
         hit = type_gene_results[gene_name]
@@ -663,18 +734,19 @@ def output_to_table(output_prefix, assembly, k_locus, type_gene_names, type_gene
     table.close()
 
 
-def add_to_json(assembly, k_locus, type_gene_names, type_gene_results, json_list,
+def add_to_json(assembly, locus, type_gene_names, type_gene_results, json_list,
                 uncertainty_chars, all_gene_dict):
     json_record = OrderedDict()
     json_record['Assembly name'] = assembly.name
 
     match_dict = OrderedDict()
-    match_dict['Locus name'] = k_locus.name
-    match_dict['Match confidence'] = k_locus.get_match_confidence()
+    match_dict['Locus name'] = locus.name
+    match_dict['Type'] = locus.type
+    match_dict['Match confidence'] = locus.get_match_confidence()
 
     reference_dict = OrderedDict()
-    reference_dict['Length'] = len(k_locus.seq)
-    reference_dict['Sequence'] = k_locus.seq
+    reference_dict['Length'] = len(locus.seq)
+    reference_dict['Sequence'] = locus.seq
     match_dict['Reference'] = reference_dict
     json_record['Best match'] = match_dict
 
@@ -686,11 +758,11 @@ def add_to_json(assembly, k_locus, type_gene_names, type_gene_results, json_list
     json_record['Problems'] = problems
 
     blast_results = OrderedDict()
-    blast_results['Coverage'] = k_locus.get_coverage_string()
-    blast_results['Identity'] = k_locus.get_identity_string()
-    blast_results['Length discrepancy'] = k_locus.get_length_discrepancy_string()
+    blast_results['Coverage'] = locus.get_coverage_string()
+    blast_results['Identity'] = locus.get_identity_string()
+    blast_results['Length discrepancy'] = locus.get_length_discrepancy_string()
     assembly_pieces = []
-    for i, piece in enumerate(k_locus.assembly_pieces):
+    for i, piece in enumerate(locus.assembly_pieces):
         assembly_piece = OrderedDict()
         assembly_piece['Contig name'] = piece.contig_name
         assembly_piece['Contig start position'] = piece.start + 1
@@ -703,13 +775,13 @@ def add_to_json(assembly, k_locus, type_gene_names, type_gene_results, json_list
     blast_results['Locus assembly pieces'] = assembly_pieces
     json_record['blastn result'] = blast_results
 
-    expected_genes_in_locus = {x.qseqid: x for x in k_locus.expected_hits_inside_locus}
-    expected_hits_outside_locus = {x.qseqid: x for x in k_locus.expected_hits_outside_locus}
-    other_hits_inside_locus = {x.qseqid: x for x in k_locus.other_hits_inside_locus}
-    other_hits_outside_locus = {x.qseqid: x for x in k_locus.other_hits_outside_locus}
+    expected_genes_in_locus = {x.qseqid: x for x in locus.expected_hits_inside_locus}
+    expected_hits_outside_locus = {x.qseqid: x for x in locus.expected_hits_outside_locus}
+    other_hits_inside_locus = {x.qseqid: x for x in locus.other_hits_inside_locus}
+    other_hits_outside_locus = {x.qseqid: x for x in locus.other_hits_outside_locus}
 
-    k_locus_genes = []
-    for gene in k_locus.genes:
+    locus_genes = []
+    for gene in locus.genes:
         gene_dict = OrderedDict()
         gene_name = gene.full_name
         gene_dict['Name'] = gene_name
@@ -731,8 +803,8 @@ def add_to_json(assembly, k_locus, type_gene_names, type_gene_results, json_list
         else:
             gene_dict['Match confidence'] = 'Not found'
 
-        k_locus_genes.append(gene_dict)
-    json_record['Locus genes'] = k_locus_genes
+        locus_genes.append(gene_dict)
+    json_record['Locus genes'] = locus_genes
 
     extra_genes = OrderedDict()
     for gene_name, hit in other_hits_inside_locus.items():
@@ -803,7 +875,7 @@ def write_json_file(output_prefix, json_list):
             fcntl.flock(json_out, fcntl.LOCK_UN)
 
 
-def output_to_stdout(assembly, k_locus, verbose, type_gene_names, type_gene_results,
+def output_to_stdout(assembly, locus, verbose, type_gene_names, type_gene_results,
                      uncertainty_chars, expected_genes_in_locus_str, expected_genes_out_locus_str,
                      missing_genes_str):
     if verbose:
@@ -811,26 +883,27 @@ def output_to_stdout(assembly, k_locus, verbose, type_gene_names, type_gene_resu
         assembly_name_line = 'Assembly: ' + assembly.name
         print(assembly_name_line)
         print('-' * len(assembly_name_line))
-        print('    Best match locus: ' + k_locus.name)
-        print('    Match confidence: ' + k_locus.get_match_confidence())
+        print('    Best match locus: ' + locus.name)
+        print('    Best match type: ' + locus.type)
+        print('    Match confidence: ' + locus.get_match_confidence())
         print('    Problems: ' + (uncertainty_chars if uncertainty_chars else 'None'))
-        print('    Coverage: ' + k_locus.get_coverage_string())
-        print('    Identity: ' + k_locus.get_identity_string())
-        print('    Length discrepancy: ' + k_locus.get_length_discrepancy_string())
+        print('    Coverage: ' + locus.get_coverage_string())
+        print('    Identity: ' + locus.get_identity_string())
+        print('    Length discrepancy: ' + locus.get_length_discrepancy_string())
         print()
-        print_assembly_pieces(k_locus.assembly_pieces)
+        print_assembly_pieces(locus.assembly_pieces)
         print_gene_hits('Expected genes in locus: ' + expected_genes_in_locus_str,
-                        k_locus.expected_hits_inside_locus)
+                        locus.expected_hits_inside_locus)
         print_gene_hits('Expected genes outside locus: ' + expected_genes_out_locus_str,
-                        k_locus.expected_hits_outside_locus)
+                        locus.expected_hits_outside_locus)
         print('    Missing expected genes: ' + missing_genes_str)
-        for missing_gene in k_locus.missing_expected_genes:
+        for missing_gene in locus.missing_expected_genes:
             print('        ' + missing_gene)
         print()
-        print_gene_hits('Other genes in locus: ' + str(len(k_locus.other_hits_inside_locus)),
-                        k_locus.other_hits_inside_locus)
-        print_gene_hits('Other genes outside locus: ' + str(len(k_locus.other_hits_outside_locus)),
-                        k_locus.other_hits_outside_locus)
+        print_gene_hits('Other genes in locus: ' + str(len(locus.other_hits_inside_locus)),
+                        locus.other_hits_inside_locus)
+        print_gene_hits('Other genes outside locus: ' + str(len(locus.other_hits_outside_locus)),
+                        locus.other_hits_outside_locus)
 
         for gene_name in type_gene_names:
             result = 'Not found' if not type_gene_results[gene_name] \
@@ -839,7 +912,7 @@ def output_to_stdout(assembly, k_locus, verbose, type_gene_names, type_gene_resu
         print()
 
     else:  # not verbose
-        simple_output = assembly.name + ': ' + k_locus.name + uncertainty_chars
+        simple_output = assembly.name + ': ' + locus.name + uncertainty_chars
         for gene_name in type_gene_names:
             result = 'Not found' if not type_gene_results[gene_name] \
                 else type_gene_results[gene_name].result
@@ -941,7 +1014,7 @@ def get_blast_version(program):
         return ''
 
 
-def get_best_hit_for_query(blast_hits, query_name, k_locus):
+def get_best_hit_for_query(blast_hits, query_name, locus):
     """
     Given a list of BlastHits, this function returns the best hit for the given query, based first
     on whether or not the hit is in the assembly pieces, then on bit score.
@@ -950,7 +1023,7 @@ def get_best_hit_for_query(blast_hits, query_name, k_locus):
     matching_hits = [x for x in blast_hits if x.qseqid == query_name]
     if matching_hits:
         return sorted(matching_hits,
-                      key=lambda z: (z.in_assembly_pieces(k_locus.assembly_pieces), z.bitscore),
+                      key=lambda z: (z.in_assembly_pieces(locus.assembly_pieces), z.bitscore),
                       reverse=True)[0]
     else:
         return None
@@ -1065,16 +1138,16 @@ def complement_base(base):
     return reverse[forward.find(base)]
 
 
-def save_assembly_pieces_to_file(k_locus, assembly, output_prefix):
+def save_assembly_pieces_to_file(locus, assembly, output_prefix):
     """
     Creates a single FASTA file for all of the assembly pieces.
     Assumes all assembly pieces are from the same assembly.
     """
-    if not k_locus.assembly_pieces:
+    if not locus.assembly_pieces:
         return None
     fasta_file_name = output_prefix + '_' + assembly.name + '.fasta'
     with open(fasta_file_name, 'w') as fasta_file:
-        for piece in k_locus.assembly_pieces:
+        for piece in locus.assembly_pieces:
             fasta_file.write('>' + assembly.name + '_' + piece.get_header() + '\n')
             fasta_file.write(add_line_breaks_to_sequence(piece.get_sequence(), 60))
     return fasta_file_name
@@ -1103,9 +1176,10 @@ def line_iterator(string_with_line_breaks):
         prev_newline = next_newline
 
 
-def load_k_locus_references(fasta, k_ref_genes):
-    """Returns a dictionary of: key = locus name, value = KLocus object"""
-    return {seq[0]: KLocus(seq[0], seq[1], k_ref_genes[seq[0]]) for seq in load_fasta(fasta)}
+def load_locus_references(fasta, ref_genes, ref_types):
+    """Returns a dictionary of: key = locus name, value = Locus object"""
+    return {seq[0]: Locus(seq[0], ref_types[seq[0]], seq[1], ref_genes[seq[0]])
+            for seq in load_fasta(fasta)}
 
 
 def load_fasta(filename):
@@ -1134,12 +1208,12 @@ def load_fasta(filename):
     return fasta_seqs
 
 
-def good_start_and_end(start, end, k_length, allowed_margin):
+def good_start_and_end(start, end, length, allowed_margin):
     """
     Checks whether the given start and end coordinates are within the accepted margin of error.
     """
     good_start = start <= allowed_margin
-    good_end = end >= k_length - allowed_margin
+    good_end = end >= length - allowed_margin
     start_before_end = start < end
     return good_start and good_end and start_before_end
 
@@ -1347,9 +1421,10 @@ class TypeGeneBlastHit(BlastHit):
         return blast_results
 
 
-class KLocus(object):
-    def __init__(self, name, seq, genes):
+class Locus(object):
+    def __init__(self, name, type_name, seq, genes):
         self.name = name
+        self.type = type_name
         self.seq = seq
         self.genes = genes
         self.gene_names = [x.full_name for x in genes]
@@ -1389,7 +1464,7 @@ class KLocus(object):
 
     def clear(self):
         """
-        Clears everything in the KLocus object relevant to a particular assembly - gets it ready
+        Clears everything in the Locus object relevant to a particular assembly - gets it ready
         for the next assembly.
         """
         self.blast_hits = []
@@ -1425,11 +1500,11 @@ class KLocus(object):
         """
         self.blast_hits.sort(key=lambda x: x.length, reverse=True)
         kept_hits = []
-        k_range_so_far = IntRange()
+        range_so_far = IntRange()
         for hit in self.blast_hits:
             hit_range = hit.get_query_range()
-            if not k_range_so_far.contains(hit_range):
-                k_range_so_far.merge_in_range(hit_range)
+            if not range_so_far.contains(hit_range):
+                range_so_far.merge_in_range(hit_range)
                 kept_hits.append(hit)
         self.blast_hits = kept_hits
 
@@ -1466,9 +1541,9 @@ class KLocus(object):
         only_piece = self.assembly_pieces[0]
         a_start = only_piece.start
         a_end = only_piece.end
-        k_start = only_piece.earliest_hit_coordinate()
-        k_end = only_piece.latest_hit_coordinate()
-        expected_length = k_end - k_start
+        start = only_piece.earliest_hit_coordinate()
+        end = only_piece.latest_hit_coordinate()
+        expected_length = end - start
         actual_length = a_end - a_start
         return actual_length - expected_length
 
@@ -1724,11 +1799,11 @@ class IntRange(object):
 
 class Gene(object):
     """This class prepares and stores a gene taken from the input Genbank file."""
-    def __init__(self, k_locus_name, num, feature, k_locus_seq):
-        self.k_locus_name = k_locus_name
+    def __init__(self, locus_name, num, feature, k_locus_seq):
+        self.locus_name = locus_name
         self.feature = feature
         gene_num_string = str(num).zfill(2)
-        self.full_name = k_locus_name + '_' + gene_num_string
+        self.full_name = locus_name + '_' + gene_num_string
         if 'gene' in feature.qualifiers:
             self.gene_name = feature.qualifiers['gene'][0]
             self.full_name += '_' + self.gene_name


=====================================
reference_database/Klebsiella_k_locus_primary_reference.gbk
=====================================
The diff for this file was not included because it is too large.

=====================================
reference_database/Klebsiella_o_locus_primary_reference.gbk
=====================================
@@ -27,6 +27,7 @@ FEATURES             Location/Qualifiers
                      /serovar="rfb_OL101"
                      /db_xref="taxon:573"
                      /note="O locus: OL101"
+                     /note="O type: unknown"
      gene            1..1065
                      /gene="rfbB"
      CDS             1..1065
@@ -440,6 +441,7 @@ FEATURES             Location/Qualifiers
                      /serovar="rfb_OL102"
                      /db_xref="taxon:573"
                      /note="O locus: OL102"
+                     /note="O type: unknown"
      gene            1..525
                      /gene="rfbB"
      CDS             1..525
@@ -723,6 +725,7 @@ FEATURES             Location/Qualifiers
                      /serovar="rfb_OL103"
                      /db_xref="taxon:573"
                      /note="O locus: OL103"
+                     /note="O type: unknown"
      gene            1..192
                      /gene="rmlC"
      CDS             1..192
@@ -1007,6 +1010,7 @@ FEATURES             Location/Qualifiers
                      /serovar="rfb_OL104"
                      /db_xref="taxon:573"
                      /note="O locus: OL104"
+                     /note="O type: unknown"
      gene            1..1416
                      /gene="manC"
      CDS             1..1416
@@ -1403,6 +1407,7 @@ FEATURES             Location/Qualifiers
                      /serovar="O12"
                      /db_xref="taxon:573"
                      /note="O locus: O12"
+                     /note="O type: O12"
      gene            1..1065
                      /gene="rfbB"
      CDS             1..1065
@@ -1754,6 +1759,7 @@ FEATURES             Location/Qualifiers
                      /serovar="O1/O2 Variant 1"
                      /db_xref="taxon:573"
                      /note="O locus: O1/O2v1"
+                     /note="O type: special logic"
      gene            1..768
                      /gene="wzm"
      CDS             1..768
@@ -2051,6 +2057,7 @@ FEATURES             Location/Qualifiers
                      /serovar="O1/O2 Variant 2"
                      /db_xref="taxon:573"
                      /note="O locus: O1/O2v2"
+                     /note="O type: special logic"
      gene            1..768
                      /gene="wzm"
      CDS             1..768
@@ -2446,6 +2453,7 @@ FEATURES             Location/Qualifiers
                      /serovar="O3/O3a"
                      /db_xref="taxon:573"
                      /note="O locus: O3/O3a"
+                     /note="O type: O3/O3a"
                      /note="annotation edited from original by K. Wyres Jan 2018"
      gene            1..1416
                      /gene="manC"
@@ -2844,6 +2852,7 @@ FEATURES             Location/Qualifiers
                      /serovar="O3b"
                      /db_xref="taxon:573"
                      /note="O locus: O3b"
+                     /note="O type: O3b"
                      /note="annotation edited from original by K. Wyres Jan 2018"
      gene            1..1416
                      /gene="manC"
@@ -3241,6 +3250,7 @@ FEATURES             Location/Qualifiers
                      /serovar="O4"
                      /db_xref="taxon:573"
                      /note="O locus: O4"
+                     /note="O type: O4"
                      /note="sequence manually trimmed to removed insertion sequences"
      gene            complement(1..1797)
                      /gene="hydrolase"
@@ -3547,6 +3557,7 @@ FEATURES             Location/Qualifiers
                      /serovar="O5"
                      /db_xref="taxon:573"
                      /note="O locus: O5"
+                     /note="O type: O5"
      gene            1..1416
                      /gene="manC"
      CDS             1..1416
@@ -3960,6 +3971,7 @@ FEATURES             Location/Qualifiers
                      /note="Statens Serum Institut Klebsiella pnuemoniae
                      serotype O8 reference strains"
                      /note="O locus: O8"
+                     /note="O type: special logic"
                      /note="original sequence manually trimmed"
      gene            1..780
                      /gene="wzm"
@@ -4239,81 +4251,694 @@ ORIGIN
      9301 actacccgct gtacaaagcg ttaaaaaggc taatatcaaa aatgtcctgg ccaggtgcta
      9361 aaaaagatta ccgtggtaat cttttaaatc aataa
 //
-LOCUS       LT174607                3130 bp    DNA     linear   BCT 13-JUN-2016
-DEFINITION  Klebsiella pneumoniae D-Gal II biosynthesis gene cluster wbbYZ.
-ACCESSION   LT174607
-VERSION     LT174607.1  GI:1036211132
+LOCUS       O1/O2v3                 9400 bp    DNA     linear   BCT 14-FEB-2018
+DEFINITION  Klebsiella pneumoniae strain CWK53 O2aeh O locus, complete
+            sequence.
+ACCESSION   MG280710
+VERSION     MG280710.1
 KEYWORDS    .
 SOURCE      Klebsiella pneumoniae
   ORGANISM  Klebsiella pneumoniae
-            Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
+            Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacterales;
             Enterobacteriaceae; Klebsiella.
-REFERENCE   1
-  AUTHORS   Follador,R., Heinz,E., Wyres,K.L., Ellington,M.J., Kowarik,M.,
-            Holt,K.E. and Thomson,N.R.
-  TITLE     Towards a vaccine: An investigation of Klebsiella pneumoniae
-            surface antigens
+REFERENCE   1  (bases 1 to 10105)
+  AUTHORS   Clarke,B.R., Ovchinnikova,O.G., Kelly,S.D., Williamson,M.L.,
+            Butler,J.E., Liu,B., Wang,L., Gou,X., Follador,R., Lowary,T.L. and
+            Whitfield,C.
+  TITLE     Molecular basis for structural diversity in serogroup O2-antigen
+            polysaccharides in Klebsiella pneumoniae
   JOURNAL   Unpublished
-REFERENCE   2  (bases 1 to 3130)
-  AUTHORS   Informatics,P.
+REFERENCE   2  (bases 1 to 10105)
+  AUTHORS   Clarke,B.R., Ovchinnikova,O.G., Kelly,S.D., Williamson,M.L.,
+            Butler,J.E., Liu,B., Wang,L., Gou,X., Follador,R., Lowary,T.L. and
+            Whitfield,C.
   TITLE     Direct Submission
-  JOURNAL   Submitted (04-FEB-2016) SC, Wellcome Trust Sanger Institute, CB10
-            1SA, UNITED KINGDOM
+  JOURNAL   Submitted (27-OCT-2017) Dept of Molecular and Cellular Biology,
+            University of Guelph, 50 Stone Road, Guelph, Ontario N1G 2W1,
+            Canada
+COMMENT     ##Assembly-Data-START##
+            Assembly Method       :: Velvet v. 1.2.10
+            Assembly Name         :: Klebsiella pneumoniae CWK53 O-antigen
+                                     biosynthesis (rfb) gene cluster
+            Coverage              :: 100x
+            Sequencing Technology :: Illumina
+            ##Assembly-Data-END##
 FEATURES             Location/Qualifiers
-     source          1..3130
+     source          1..9400
                      /organism="Klebsiella pneumoniae"
                      /mol_type="genomic DNA"
-                     /strain="AKPRH07048"
-                     /serovar="O1"
+                     /strain="CWK53"
+                     /serotype="O2aeh"
                      /db_xref="taxon:573"
-                     /note="Extra genes: wbbY/wbbZ"
-                     /note="original sequence manually trimmed"
+                     /note="Sequence edited to remove hisI gene, Kelly Wyres April 2021"
+                     /note="O locus: O1/O2v3"
+                     /note="O type: special logic"
+     gene            1..768
+                     /gene="wzm"
+     CDS             1..768
+                     /gene="wzm"
+                     /note="lipopolysaccharide O-antigen export protein
+                     integral membrane component"
+                     /codon_start=1
+                     /transl_table=11
+                     /product="Wzm"
+                     /protein_id="AVA30565.1"
+                     /translation="MKYNIGYLFDLLVVITNKDLKVRYKSSVFGYLWSIANPLLFAMI
+                     YYFIFKLVMRVQIPNYTLFLITGLFPWQWFASSATNSLFSFIANAQIIKKTVFPRSVI
+                     PLSNVMMEGLHFLCTIPVIIAFLFVYGMSPSISWLWGIPIIAIGQVIFTFGISIIFST
+                     LNLFFRDLERFVSLGIMLMFYCTPILYAADMIPEKFSWIITYNPLASMILSWRQLFMD
+                     GVLNYEYISILYVTGIILTIVGLSIFNKLKYRFAEIL"
+     misc_feature    1..>9400
+                     /note="rfb gene cluster"
+     gene            768..1508
+                     /gene="wzt"
+     CDS             768..1508
+                     /gene="wzt"
+                     /note="lipopolysaccharide O-antigen export protein
+                     nucleotide binding component"
+                     /codon_start=1
+                     /transl_table=11
+                     /product="Wzt"
+                     /protein_id="AVA30566.1"
+                     /translation="MEPVINFSHVTKEYPLYHHIGSGIKDLVFHPKRAFQLLKGRKYL
+                     AIEDISFTVGKGEAVALIGRNGAGKSTSLGLVAGVIKPTKGTVTTQGRVASMLELGGG
+                     FHPELTGRENIYLNATLLGLRRKEVQQRMERIIEFSELGDFIDEPIRVYSSGMLAKLG
+                     FSVISQVEPDILIIDEVLAVGDISFQAKCIQTIKDFKSKGVTILFVSHNMSDVEKICD
+                     RVVWIEDHKLREIGSAERIIELYKQAMA"
+     gene            1523..3415
+                     /gene="wbbM"
+     CDS             1523..3415
+                     /gene="wbbM"
+                     /note="glycosyltransferase involved in lipopolysaccharide
+                     O-antigen biosynthesis"
+                     /codon_start=1
+                     /transl_table=11
+                     /product="WbbM"
+                     /protein_id="AVA30567.1"
+                     /translation="MNSIKIYTCHHKPSAFLNASIIKPLHVGKANTYNDIGCEGEDSG
+                     DNISFKNPFYCELTAHYWVWKNEPLADYVGFMHYRRHLNFAEQQNHPEDNWGVVNYPL
+                     INAEYESQFGLSDESISTCVDGYDLLLPKKWSVTSAGSKNNLDHYAKGEFLHIKDYQS
+                     ALDVVEELYPEYKDAIQQFNNATDGYYTNMFVMRKDMFTDYSEWLFAILSNLEDRISM
+                     NNYNAQEKRVIGHIAERLFNIYIIKSQQDKQLKIKELQRTFVTAETFNGKLNPIFDES
+                     VPVVISFDNNYALSGGALINSIVRHSDANRNYDIVVLENKVSHLNKQRLIKLVAGHNN
+                     ISLRFFDVNSFTEMSDVHTRAHFSASTYARLFIPQLFREYKKVVFIDSDTVVKSDLAT
+                     LLDVEIGTNLVAAVKDIVMEGFVKFGTMSESDDGIMPAGEYLKKTLGMTNPDEYFQAG
+                     IIVFNVEQMVKENTFAQLMSALKAKKYWFLDQDIMNKVFFGRVKFLPLEWNVYHGNGN
+                     TDDFFPNLKFSTYMRFLEARRNPKMIHYAGENKPWNTEKVDFYDDFLENVLNTPWEKE
+                     IYYRQLPVATVVPNQHTELQQTVLLQTKIKRALMPYVNKYAPVGSPRRNKLTKYYYKV
+                     RRSILG"
+     gene            3434..4588
+                     /gene="glf"
+     CDS             3434..4588
+                     /gene="glf"
+                     /EC_number="5.4.99.9"
+                     /note="UDP-galactopyranose mutase"
+                     /codon_start=1
+                     /transl_table=11
+                     /product="Glf"
+                     /protein_id="AVA30568.1"
+                     /translation="MKRNNILIVGAGFSGVVIARQLAEQGHKVKIIDQRDHIGGNSYD
+                     TRDPQTDVMVHVYGPHIFHTDNETVWNYVNRYAEMMPYVNRVKATVNGQVFSLPINLH
+                     TINQFFAKTCSPDEARALISEKGDSSILEPKTFEEQALRFIGKELYEAFFKGYTIKQW
+                     GMQPSELPASILKRLPVRFNYDDNYFNHKFQGMPKLGYTHMIEAIADHENITLQLQRE
+                     FAAENRESYDHVFYSGPLDAFYSYQHGRLGYRTLDFERFTWQGDYQGCAVMNYCSVDV
+                     PYTRITEHKYFSPWESHEGSVCYKEYSRACGEDDIPYYPIRQMGEMALLDKYLSLAEN
+                     EKNITFVGRLGTYRYLDMDVTIAEALKTADKYLSSLSNDEAMPVFVADVR"
+     gene            4585..5475
+                     /gene="wbbN"
+     CDS             4585..5475
+                     /gene="wbbN"
+                     /note="putative glycosyltransferase involved in
+                     lipopolysaccharide O-antigen biosynthesis"
+                     /codon_start=1
+                     /transl_table=11
+                     /product="WbbN"
+                     /protein_id="AVA30569.1"
+                     /translation="MKHTALIVTFNRLEKLKKTVAETVKLHFTSIVIVNNGSTDGTSD
+                     WLQTITDPRVIVLNLACNNGGAGGFKAGSQYICSSVDSDWVFFYDDDAYPQSDILDKF
+                     STINKEGCRVFTALVKDLQGRTCAMNVPFAKVPTSFSDTLQYIKHPQRYVPDNKEMLV
+                     QTVSFVGMIIKREVLNEHLHHIYDELFLYFDDLYFGYQLTLSGEKIIYNPELVFTHDV
+                     SIQGKVISPEWKVYYLCRNLILAKRIFTEVAVFSSSSILLRLCKYISIFPVQRRKWVY
+                     LKFLCRGIVHGVKGISGKFH"
+     gene            5488..6621
+                     /gene="wbbO"
+     CDS             5488..6621
+                     /gene="wbbO"
+                     /note="glycosyltransferase involved in lipopolysaccharide
+                     O-antigen biosynthesis"
+                     /codon_start=1
+                     /transl_table=11
+                     /product="WbbO"
+                     /protein_id="AVA30570.1"
+                     /translation="MKKLCYFINSDWYFDLHWTDRAIAARDAGYEIHIISHFVDDKIA
+                     IKFKTLGFICHNIPLVAQSFNVLIFFRAFFKARKIIQSINPDLLHCITIKPCLIGGFL
+                     AKSTHRPVILSFVGLGRVFSAESGCLKLLRSFTVMAYKYIASNKCSLFMFEHDKDRAK
+                     LADLVGIDYKQTIVIDGAGINPEIYKYSLEQQRDVPVVLFASRMLWSKGLGDLIEAKK
+                     ILSNKNIHFTLNVAGILVENDKDAIPLATIQKWQSEGVINWLGHCSNVFDLIEESNIV
+                     ALPSVYAEGVPRILLEASSVGRACIAYDVGGCDSLIINNDNGLIVKSKSVEELAEKLG
+                     FLLDNPETRVAMGINGRKRIQDKFSSVMIINKTLKTYRDVVEE"
+     CDS             6743..8047
+                     /note="Orf7"
+                     /codon_start=1
+                     /transl_table=11
+                     /product="hypothetical protein"
+                     /protein_id="AVA30571.1"
+                     /translation="MSEKKSIAVSVIIPVHNAAGYISDTLSTVLSQTLNDIEIIIIND
+                     CSNDNTLEIVSALAETDSRIRVINNTTNLGGGGSRNVGLDAAIGEYVIFLDDDDYADN
+                     MMLERMYAQASDTQADVVICRSQSFDPSLQVYAPMPWSVRQELLPDLMVFSSQDIPSD
+                     FFRAFVWWPWDKLLRRQFIATHQLRFQEIRTTNDLFFVCAFMLMANRISVLNETLISH
+                     TINRSESLSATRAESHRCAVEALVALKAFICQQGMMDQRLRDYKNYVVVFLEWHLNTI
+                     SGPAFHPFYQQVKEFIVALDAKDDDFYDEFIAAAHHRITTLSAEEYLFSLKDRVLKEL
+                     EFFQAKSSALQQEVETLTDSFAEQKDENAILHNQLHEIEERVTEQEKNIRQLTDKNNN
+                     MHHEITIKQQEFNEIYQHHKNLISSLSWKLTKPLRVVRRFFK"
+     CDS             8381..9400
+                     /note="Orf8; involved in lipopolysaccharide O-antigen
+                     biosynthesis"
+                     /codon_start=1
+                     /transl_table=11
+                     /product="putative acyltransferase"
+                     /protein_id="AVA30572.1"
+                     /translation="MGCRLSVNRFDAIDGTRGILAGMVMLSHMFGSFMGWPQVRPFSG
+                     AYLCVVYFFIMSGFVLTYAHASGNFFKYVLTRFARLWPLHFLSTILMVAIYYYNAHHG
+                     GYVSSPDVFSISVILKNILFLHGLYWHDFKLVNEPSWSISIEFWVSLLIPLIFTRLGN
+                     MTRAVVAGLLFIFLWHKHPSGIPPSMLTAMLSMLIGSLFFSLSQTEYFKCLMKEKFMA
+                     FFITAAAIVSIVGVYAMNHSRLDYFLFVAFIPMLFIDHLPDNQLIKRVFTSNLFLFLG
+                     YISFPLYLLHELVIVSGFIFDPENAWVSISIAAFASIFIAYIYARFIDYPLYRALKRQ
+                     IAKIS"
+BASE COUNT     2649 a   1738 c   2007 g   3006 t
+ORIGIN
+        1 atgaagtata atataggata tttgtttgat cttcttgttg taataacaaa caaagatttg
+       61 aaagtacgtt ataaaagcag cgtatttggt tacttatggt caatcgcaaa tccactgctt
+      121 ttcgcaatga tatattattt cattttcaaa cttgttatgc gggtacagat accaaattat
+      181 acactctttt taattacggg tcttttccca tggcaatggt ttgcgagttc tgcgaccaac
+      241 tctctattct cgtttattgc taacgcacaa attattaaga agacagtatt ccctcgttca
+      301 gtaatcccat taagcaatgt gatgatggaa gggttacatt ttctctgtac tattcctgtc
+      361 attatcgcat ttttgtttgt ttatggtatg agtccatcaa tatcctggct gtgggggata
+      421 ccaataattg cgattgggca agttattttt acatttggta tttctataat cttttcaaca
+      481 ctcaatctct ttttccggga tctggagcgg tttgtgagtc ttggcatcat gctgatgttt
+      541 tattgcacgc caattttata tgcagcagat atgatcccgg agaagtttag ctggattatc
+      601 acctataatc ctcttgcaag tatgatcctc agctggcgcc agttattcat ggatggtgta
+      661 ttgaactatg agtatatctc aatactttat gttacaggta ttatcctgac catagtgggt
+      721 ctaagcatct ttaataaatt aaaatatcga tttgcagaga ttttgtaatg gaaccagtaa
+      781 taaattttag tcacgttacg aaagaatatc cactttatca tcatattggc tcaggaataa
+      841 aagacttagt atttcatcct aagcgtgctt tccagttact gaaagggcgg aaataccttg
+      901 ctatcgagga tatctctttt actgttggca agggggaagc ggttgcattg atcggacgta
+      961 atggcgcggg gaaaagtact tcattaggtt tggtggctgg tgttatcaag ccaacgaaag
+     1021 gtaccgttac gactcagggc cgggttgcat caatgctgga gctaggcggt gggttccatc
+     1081 cagaattgac tggtcgggaa aatatttatc tcaatgccac ccttcttgga ttacgacgta
+     1141 aagaagttca gcagcgtatg gaacgtatca ttgagttctc ggaattaggt gactttattg
+     1201 atgaacctat ccgtgtttac tcaagcggca tgcttgcaaa actggggttc tctgtcatca
+     1261 gccaggttga gcctgatatt cttatcattg atgaagtgct tgctgtgggc gatatctcat
+     1321 tccaggccaa gtgcattcag accatcaaag actttaaaag caaaggagtg acgatattat
+     1381 tcgtcagtca caatatgagt gatgttgaaa aaatctgtga cagagttgtt tggatcgaag
+     1441 accacaaact ccgcgaaatt ggctcagcag agagaatcat tgaactttac aagcaagcaa
+     1501 tggcttaata attggtaata agatgaatag tatcaaaatt tacacatgtc accacaaacc
+     1561 cagtgctttt ctaaacgcct ctattattaa accactgcat gtcgggaagg ctaatactta
+     1621 taatgatatt ggctgtgaag gggaggatag cggagacaac atttccttta aaaacccctt
+     1681 ttattgtgag ctgacggccc actactgggt gtggaagaat gaaccccttg ctgactatgt
+     1741 gggattcatg cattaccgca gacacttgaa ctttgcagaa caacaaaatc atcctgagga
+     1801 taactggggg gtggtcaact acccgctcat caatgccgaa tacgaaagcc agtttggatt
+     1861 aagtgatgaa tccataagca cctgcgtcga tggatatgat ctgctgttac ccaaaaaatg
+     1921 gtcggtaact tcggcaggaa gtaagaataa ccttgatcat tatgcaaagg gtgagttttt
+     1981 acacattaaa gactatcagt ctgctctgga tgtcgttgaa gagctgtatc cagaatataa
+     2041 agatgcgatt caacaattca ataatgcaac tgatggttat tatacgaaca tgtttgttat
+     2101 gcgcaaagac atgttcacgg attattctga atggcttttt gctattttgt ctaatcttga
+     2161 agatcggatt tcgatgaata actacaatgc tcaagaaaaa cgagttatag gacacattgc
+     2221 tgagcgttta ttcaatattt atatcatcaa aagccaacaa gataaacagc tgaaaataaa
+     2281 agaactacag cgtacgtttg tcactgctga aacatttaat ggcaaattaa atccgatttt
+     2341 tgacgaaagt gttccggttg ttatcagttt tgataataac tatgcattaa gtggcggggc
+     2401 attaataaat tcaattgttc gccattctga tgctaaccga aactatgata tcgttgttct
+     2461 ggaaaataag gtcagtcatt taaataaaca acgccttatc aagctagttg ctggtcataa
+     2521 caacatatca ctgcgctttt ttgatgtgaa ttcattcact gagatgagtg atgttcatac
+     2581 ccgtgcgcat tttagtgcgt cgacatatgc gcgcttgttc atcccgcaac ttttccgcga
+     2641 gtataaaaaa gttgtgttta tcgattctga taccgtggtg aagtctgatt tggcgacact
+     2701 tctggatgtt gagatcggca ctaacctggt tgccgctgtt aaagacatcg tcatggaagg
+     2761 atttgtgaag tttggtacga tgtcagaatc tgatgatggc attatgccag caggggaata
+     2821 cctgaaaaag acattaggaa tgactaaccc tgacgaatat tttcaggccg ggattattgt
+     2881 ttttaacgtc gaacaaatgg ttaaagagaa cacctttgcg caattgatgt cagcattgaa
+     2941 agccaaaaag tattggttct tagatcagga tatcatgaac aaagtcttct ttggccgagt
+     3001 caagttttta ccattagaat ggaatgtgta ccatggtaat ggtaataccg atgatttctt
+     3061 cccgaatctc aagttttcaa cctatatgcg gtttttggaa gccagaagaa atccaaaaat
+     3121 gattcactat gcgggtgaaa acaaaccatg gaatactgag aaagtcgatt tctatgatga
+     3181 ttttcttgaa aatgttttaa atacgccatg ggaaaaagaa atttactatc gccagttacc
+     3241 tgtggccacg gtagtaccta accaacatac tgaactgcag caaaccgtgt tactgcagac
+     3301 aaagattaaa cgagctttaa tgccatatgt taacaaatat gctcctgtcg gttcgccaag
+     3361 aagaaataag ctcactaaat attattataa agttcgtcgc tcgattcttg gctaatataa
+     3421 ctggagatac attatgaagc gtaacaatat tctcatcgtc ggcgctggtt tttcaggtgt
+     3481 ggtcattgcc cgccagcttg ctgaacaagg tcacaaggtt aaaatcatcg atcagcggga
+     3541 ccacattgga ggaaactctt atgacactcg cgatcctcag actgatgtca tggttcatgt
+     3601 ctacggtccg catattttcc atacggataa tgaaaccgtc tggaactatg tcaatcggta
+     3661 tgcggaaatg atgccctatg tgaatagagt aaaagctact gttaatggtc aggttttttc
+     3721 actgccgatt aatctgcata ctattaatca gttctttgct aaaacctgtt ctcctgatga
+     3781 agctcgcgcg ctcattagcg aaaaaggtga tagctcgatt ctggaaccga agacatttga
+     3841 agagcaggct ttgcgcttta ttggtaaaga attatatgaa gcgttcttca aaggctacac
+     3901 cataaaacag tgggggatgc agccttccga gcttccggca tctatactca aacggttgcc
+     3961 tgtgcgtttc aattatgacg ataactactt caatcataaa ttccagggca tgccgaagtt
+     4021 gggctatacc catatgattg aagcgatcgc cgatcatgaa aatatcactc tgcaattaca
+     4081 gcgtgagttt gcggctgaaa atcgtgaaag ttatgatcat gtgttttata gcgggccgtt
+     4141 agacgcgttc tattcatacc agcatgggcg tcttggctac cgtactctgg actttgagcg
+     4201 atttacctgg caaggtgatt atcagggctg cgcggtcatg aattattgct cagttgatgt
+     4261 cccgtatacg cgaattacgg aacataaata tttttccccg tgggaaagcc atgaaggttc
+     4321 cgtatgttat aaggaatata gtcgcgcttg cggtgaagat gatattcctt actatccaat
+     4381 ccgccagatg ggtgagatgg ctctgctgga taaatattta tcactggcgg agaatgagaa
+     4441 aaatattact ttcgttggac gcctggggac ttatcgctac cttgatatgg atgtaacgat
+     4501 tgcagaagca ttgaaaacgg ccgataaata tctgtcctca ttgtccaacg acgaagcgat
+     4561 gcctgtattt gtggccgacg tacgatgaaa catactgcat taatagtcac ttttaaccgt
+     4621 cttgaaaagc taaaaaagac ggttgctgaa actgttaagc tgcactttac ctctatcgtt
+     4681 attgtcaaca acgggtccac ggatgggacc tctgattggc ttcagacaat cactgatccc
+     4741 cgtgtaattg tcttgaatct ggcctgtaat aacggtggcg ccggagggtt taaggccggt
+     4801 agtcaataca tatgcagttc tgttgatagc gattgggtct tcttttatga cgatgatgct
+     4861 tatccgcaaa gtgatattct tgataagttc tcaacgataa ataaagaagg ttgtcgtgtt
+     4921 tttacggctt tagttaaaga tttgcagggc cggacatgcg cgatgaatgt tccatttgca
+     4981 aaagtcccga cctcgtttag tgatactttg caatatatca aacatcccca acgatatgta
+     5041 ccagataata aggaaatgct ggtgcaaaca gtttcatttg ttggcatgat cattaagcgt
+     5101 gaagtcttga atgagcacct tcatcatatt tatgatgaac ttttccttta ttttgatgac
+     5161 ctctattttg gttatcaatt aactctaagt ggtgagaaaa ttatctataa tccggaactt
+     5221 gtttttactc atgatgtgag tatccaggga aaggtcatat cgccagagtg gaaggtttac
+     5281 tatctttgtc gaaacttaat tttggctaag agaatattta cggaagtggc agtttttagt
+     5341 agttcatcaa tccttttgcg cctatgcaag tatatctcca tattccctgt acagcgccgg
+     5401 aaatgggtgt atctaaagtt tttatgccgt gggattgtac atggtgtaaa aggaattagc
+     5461 ggtaagtttc actaagtgga catagcaatg aaaaaactgt gttatttcat aaattcggat
+     5521 tggtattttg atttgcattg gaccgatcgt gcaattgccg cccgagatgc cggttatgag
+     5581 attcacatta ttagccattt tgtcgatgat aaaatagcga taaaattcaa aacactaggt
+     5641 tttatttgtc ataatattcc tcttgtcgct caatcgttca acgttttgat tttctttcgg
+     5701 gcctttttta aggctcggaa aattattcaa agcattaatc ctgatttatt acactgcatt
+     5761 accatcaaac cgtgtttgat cggcggattt ctggctaaaa gtacccatcg tcctgttatt
+     5821 ctgagttttg tcggtttagg ccgagtattt tcggcagagt ctggctgtct taagctgctg
+     5881 cgtagtttta ctgttatggc atataagtat atcgccagta acaaatgcag cttgtttatg
+     5941 ttcgaacatg ataaagacag agccaaactc gccgatctgg ttggtatcga ttacaaacag
+     6001 actattgtta ttgatggtgc gggtattaat ccagagattt acaaatattc tctggagcaa
+     6061 caacgtgatg tcccggtcgt cctttttgcc agccgtatgc tgtggagtaa aggacttggt
+     6121 gatctgattg aagccaaaaa aatactgagt aataaaaata ttcactttac gctgaatgtt
+     6181 gccggtatct tagttgagaa tgataaagac gcgattccgc tagcgacgat acagaagtgg
+     6241 caaagcgaag gcgtgattaa ctggctcggt cattgctcta atgtatttga tttaattgaa
+     6301 gaatcaaata tcgttgcttt gccgtcggtc tacgccgaag gcgtaccacg tatcttgctg
+     6361 gaagcttcct cggtcgggcg tgcttgtatc gcttatgatg ttggtggctg tgatagctta
+     6421 attatcaaca acgataatgg gttaattgta aaaagtaaat ctgttgagga attagcggag
+     6481 aaactcggtt tcctgttaga taatcctgaa acgcgagtcg caatgggtat caatggcaga
+     6541 aagcgtattc aagataaatt ctcgagtgtg atgatcatta ataaaacatt aaaaacatat
+     6601 cgcgatgttg ttgaagagta atttttgatg aagctgagtt gctttcatct tacctatcct
+     6661 taatgaactg gaaacctaca aattaaatgc taccaataaa taaatataat tgcgtttaga
+     6721 agctaagcgt caaggtgatt acatgtctga aaaaaagagt attgctgttt ctgttattat
+     6781 ccctgtgcac aatgctgctg gatatatttc cgatacatta agcaccgttt tgtcgcaaac
+     6841 attaaatgat atcgaaatca ttattatcaa tgattgttct aacgataata ccttagagat
+     6901 tgtctctgcg ctggctgaaa ctgattcacg gattagagtg attaacaaca cgacaaatct
+     6961 tggtggtgga ggttcgagaa acgttgggtt ggatgctgct atcggtgagt acgttatttt
+     7021 tctcgatgac gacgattacg ctgataacat gatgttggag cgcatgtacg cgcaggccag
+     7081 cgatacacaa gcggatgtcg ttatttgtcg cagccagtct tttgatccct ccttacaagt
+     7141 ttatgctccg atgccgtggt cagttcgcca ggagctgcta cctgatctta tggttttttc
+     7201 ctcccaggat attccctctg actttttccg cgcctttgtc tggtggccat gggacaagct
+     7261 tctcaggcgt caatttattg ctactcatca gcttcgtttt caggaaatca ggacaacaaa
+     7321 cgatcttttc ttcgtctgtg ccttcatgct gatggcgaac agaatctcgg tcttaaatga
+     7381 aacgttaatt tctcatacca tcaatcgtag tgaatcctta tccgccacga gagcagaatc
+     7441 gcaccgctgt gctgttgaag cattagtggc tcttaaagca tttatctgtc agcaggggat
+     7501 gatggatcaa cgtctcagag attacaaaaa ctatgttgtc gttttccttg agtggcattt
+     7561 aaacacgata tctggtccgg catttcatcc gttttatcag caagtgaaag agtttattgt
+     7621 tgcgttagat gcaaaggatg atgattttta tgatgaattt atcgccgccg cgcatcatag
+     7681 aatcaccacg ctgtcagctg aagagtacct attctccctt aaagatcgtg ttctgaagga
+     7741 gcttgagttt ttccaggcaa agagctccgc attacagcaa gaagttgaga cgctgactga
+     7801 ctcattcgct gagcaaaagg atgaaaatgc gatattacat aatcaattgc atgagattga
+     7861 agagcgggtt actgaacagg aaaagaatat tcgtcagtta actgacaaaa ataataatat
+     7921 gcatcatgaa ataactatca aacagcaaga attcaatgaa atttatcagc atcacaaaaa
+     7981 tttaatctct tctttatcct ggaagttgac taagccacta cgagtggtaa gacgtttttt
+     8041 taaataattt caaaaagaag ttaagcaagt ttcacttctt tgttagtgtt cttgttattt
+     8101 tcaataaaaa tttttcgtta aataaaagag ctcctggtca tatgatttgc gagtctctat
+     8161 tcctgttaga ttaataattt attagttaga gagtaatagg tttttgtcga gttttactgg
+     8221 gggcagtaat atctcgatgt gtgtacttgt gtttttttgt gacttcctgc cttttgaatt
+     8281 ttttctatcg ggcagtgatc acctttgaat attgatagct attcttttta ttactaataa
+     8341 attctttctt ctaaaagaat gtagttttac tcattggaat atggggtgta gattgtctgt
+     8401 taacagattt gatgccattg atggcactcg tgggatattg gcgggtatgg tcatgctcag
+     8461 ccatatgttt ggcagcttta tgggttggcc gcaagtcaga ccatttagcg gtgcctatct
+     8521 gtgcgtggta tattttttca tcatgagtgg ctttgttcta acctatgccc acgcctcagg
+     8581 aaactttttc aaatacgttc ttacacgttt tgccagatta tggccgcttc attttctgtc
+     8641 aacgatactg atggtagcca tttattacta caatgcacac catggcggct atgtttccag
+     8701 tccggatgtt ttttctatta gcgtgatatt aaaaaatata ttatttcttc atggtcttta
+     8761 ctggcatgac ttcaagttag tcaatgagcc atcatggagt atcagtattg agttttgggt
+     8821 gtcactttta attcctctaa tctttacgcg attaggaaat atgactcgtg cggtagttgc
+     8881 aggtttatta tttatttttc tgtggcataa acatccatca gggataccgc catcaatgct
+     8941 aacagcaatg ctgtcaatgc tgattggttc attgtttttt tctttatcac agacagagta
+     9001 ctttaagtgt ctgatgaaag aaaagttcat ggctttcttt attaccgcag cggctatcgt
+     9061 atcgatcgtt ggcgtatatg caatgaacca tagtcgactg gactattttt tattcgttgc
+     9121 gtttatcccg atgctattca tcgatcatct tccggataat caattgataa agagagtttt
+     9181 cacctctaat ttattcttgt ttctcggtta tattagcttc cctttgtatt tattgcatga
+     9241 actggttatc gtatccggat tcatttttga ccctgagaat gcctgggtat caatatcaat
+     9301 agcggcattt gcatcaatat tcattgccta tatttatgca cgatttatcg attatcctct
+     9361 ctatcgggcg ttaaagcgtc agattgccaa aataagttaa
+//
+LOCUS       gmlABD                  3354 bp    DNA     linear   BCT 14-FEB-2018
+DEFINITION  Klebsiella pneumoniae strain CWK53 gmlABD gene cluster complete sequence.
+ACCESSION   MG458670
+VERSION     MG458670.1
+KEYWORDS    .
+SOURCE      Klebsiella pneumoniae
+  ORGANISM  Klebsiella pneumoniae
+            Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacterales;
+            Enterobacteriaceae; Klebsiella.
+REFERENCE   1  (bases 1 to 5075)
+  AUTHORS   Clarke,B.R., Ovchinnikova,O.G., Kelly,S.D., Williamson,M.L.,
+            Butler,J.E., Liu,B., Wang,L., Gou,X., Follador,R., Lowary,T.L. and
+            Whitfield,C.
+  TITLE     Molecular basis for structural diversity in serogroup O2-antigen
+            polysaccharides in Klebsiella pneumoniae
+  JOURNAL   Unpublished
+REFERENCE   2  (bases 1 to 5075)
+  AUTHORS   Clarke,B.R., Ovchinnikova,O.G., Kelly,S.D., Williamson,M.L.,
+            Butler,J.E., Liu,B., Wang,L., Gou,X., Follador,R., Lowary,T.L. and
+            Whitfield,C.
+  TITLE     Direct Submission
+  JOURNAL   Submitted (30-OCT-2017) Dept of Molecular and Cellular Biology,
+            University of Guelph, 50 Stone Road, Guelph, Ontario N1G 2W1,
+            Canada
+COMMENT     ##Assembly-Data-START##
+            Assembly Method       :: Velvet v. 1.2.10
+            Assembly Name         :: Klebsiella pneumoniae CWK53 gmlABD cluster
+            Coverage              :: 100x
+            Sequencing Technology :: Illumina
+            ##Assembly-Data-END##
+FEATURES             Location/Qualifiers
+     source          1..3354
+                     /organism="Klebsiella pneumoniae"
+                     /mol_type="genomic DNA"
+                     /strain="CWK53"
+                     /serotype="O2aeh"
+                     /db_xref="taxon:573"
+                     /note="sequence edited to trim to gmlABD only by Kelly Wyres, April 2021"
+                     /note="Extra genes: gmlABD"
+     gene            1..375
+                     /gene="gmlA"
+     CDS             1..375
+                     /gene="gmlA"
+                     /note="putative export component of the periplasmic
+                     O-antigen modification pathway"
+                     /codon_start=1
+                     /transl_table=11
+                     /product="GmlA"
+                     /protein_id="AVA30553.1"
+                     /translation="MKKILSNPALKYVMVGLLNTAITAVVIFLLMSAGVGVYLSNAMG
+                     YVVGILFSFVVNSLFTFSIALTGKRFIKFLASCAMCWVANIITVKMFLLIFPDLIYIS
+                     QLCGMIVYTIAGFLINKLWVMK"
+     misc_feature    <1..>3354
+                     /note="gmlABD gene cluster"
+     gene            375..1334
+                     /gene="gmlB"
+     CDS             375..1334
+                     /gene="gmlB"
+                     /note="putative glycosyltransferase synthesizing
+                     undecaprenol-P-galactose for periplasmic side-group
+                     modification of the O-antigen"
+                     /codon_start=1
+                     /transl_table=11
+                     /product="GmlB"
+                     /protein_id="AVA30554.1"
+                     /translation="MKQPKLAIVVPCYNEEEVFSHCLSELQAVLAELVTATLISADSY
+                     ILFVDDGSCDSTWQLIAQASMDYSTVKGVKLSRNRGHQVALVAGLEAAESDITVSIDA
+                     DLQDDTSVIALMVKEYLNGNEIVYGVRNDRASDSPFKRKTAGLFYSCMKWMGVQQIPQ
+                     HADFRLLSDRAKKALLSFKEQNLYLRGLVPLIGYRSTQVAYSRTVRLAGESKYPLKKM
+                     LALAIEGITSLTITPLRIIAISGFAISILSVFAAIYALFEKFSGNTVEGWASVMIAIF
+                     FLGGVQMLSLGIIGEYVGKIYMESKGRPKYFIEKTTQQKVDDE"
+     gene            1327..3354
+                     /gene="gmlD"
+     CDS             1327..3354
+                     /gene="gmlD"
+                     /note="putative GT-C type glycosyltransferase that adds an
+                     alpha(1-2)-linked galactose side group to the O-antigen
+                     from Und-P-galactose in the periplasm."
+                     /codon_start=1
+                     /transl_table=11
+                     /product="GmlD"
+                     /protein_id="AVA30555.1"
+                     /translation="MNNEKIISRIQYVVFIAILSVLCYSFVASENQAYTWDSRFYWVV
+                     WNEYTGLLHDSFSQWLASIKYSVYSADYNPLPVIALIPFNYLPIGNREAYILGVYILY
+                     FLPFAYIAMRLFATAAGVDNKYQPYIFVFLASFIPFVTPTLRGYPDIIGMIPLSLCCL
+                     ILFKVNILKYQGARFVLLAIAMGVLLWLPFAFRRWYAYSVVSLFVTLPFLNTFLFGEA
+                     ARFSKERIFKYFIFFTLAGLVVIFLVCVFQFELAERILTTNYSDIYSAYKATEQATFF
+                     ITLNYAGLYLLPLIVLGGLYIVFGTSFRLKVLCAFALANFIITYIIFTRTQTPGMQHG
+                     MPFCFWLAILVLSALKLFFDRLNKGLAITLTIAFSLLTVAVFISTYSKPFGKPELVSR
+                     YLPGKEYPLRLENFDHYHELIAFMSARVEPDDKVAVFASSGSLNNDLFSAISPASFVS
+                     HIANVSQVDLRDKLATEAFSSRYAIVTDPAQTHLGIHGQQVIYLPNNLILEGKGIGAA
+                     YKRVSQPFILSGGVQAYVYEKIRPYTIDEYRSMIDEFSRSYPSWKAEYENNLTESYLS
+                     AAVEKGDTWGQFSMYREGRIYAHPGATTPTIVRMFVGEYDTLRITSVNKNCGKTDGVK
+                     VIIKDATHRVEKHIATAESVVFDMRDFHNKDITLIIDNNGSSACDSLDINQ"
+BASE COUNT      902 a    603 c    741 g   1108 t
+ORIGIN
+        1 gtgaaaaaaa tactttctaa tcccgcattg aaatatgtaa tggtggggtt attgaataca
+       61 gcaataactg cggtggtgat atttttatta atgtctgccg gggttggagt atatctatca
+      121 aatgccatgg gatatgtggt aggtattctc tttagttttg tggttaattc gttgtttacc
+      181 ttttcaatag cgctgaccgg caaacggttt ataaagtttc tggcatcttg tgcaatgtgc
+      241 tgggtagcta atattataac agttaaaatg tttttattaa tattccctga tttgatttat
+      301 atatctcagt tatgtggcat gattgtttac acaattgctg gttttttaat taataaactc
+      361 tgggtgatga aataatgaaa caaccaaagt tggctattgt tgtgccatgt tataacgagg
+      421 aagaagtatt cagtcactgc ttatctgaat tacaagcggt tctggcggaa ttagtcacgg
+      481 caacactgat ttcagctgat agctatattt tattcgttga tgatggcagc tgtgattcaa
+      541 cgtggcagtt gattgctcaa gcatcgatgg attactcaac ggttaaaggg gttaagcttt
+      601 caagaaatcg tggtcatcag gtagccctgg ttgcgggtct tgaggccgcc gaatcggata
+      661 ttactgtgag cattgatgca gatcttcagg atgatacttc tgttattgca ttaatggtta
+      721 aagagtacct caatggtaat gagatcgtat atggtgtacg taatgatcgg gcatctgact
+      781 ctccatttaa aagaaaaacc gcgggtctct tttatagttg tatgaagtgg atgggcgttc
+      841 agcagattcc ccagcatgcc gatttcagat tgttaagcga tcgcgctaaa aaggcgttgc
+      901 tgagctttaa agagcaaaat ttatatttac gtggattagt gccactgatt ggttatcgtt
+      961 ccactcaggt tgcatattcg agaacagttc gtctggctgg tgagtcgaag tatccattaa
+     1021 aaaaaatgct tgccctggca attgaaggca tcacttctct gaccattacc ccattaagga
+     1081 ttatcgctat ctcagggttt gcgatcagta ttctttcggt gtttgcagca atttatgcgc
+     1141 tgtttgagaa attctcagga aataccgttg aagggtgggc ctcagttatg atcgctattt
+     1201 tcttcctcgg cggtgtacag atgctatcgt taggcatcat tggtgagtat gtcggtaaga
+     1261 tatatatgga gtctaaaggt cgtcctaaat attttatcga aaaaacaacg cagcagaagg
+     1321 ttgatgatga ataacgaaaa aataatcagt agaatacagt atgtggtttt catcgccata
+     1381 ttatcagtac tttgttattc ctttgtggca agtgaaaatc aggcttatac atgggattcc
+     1441 agattttact gggtagtatg gaatgagtat actggcttac tgcatgactc atttagtcaa
+     1501 tggcttgcta gtataaaata ttcagtctat tcagctgatt ataacccttt gccagtcatt
+     1561 gcactgatcc cgtttaacta tctgccgata ggaaatcggg aagcttacat attgggtgtg
+     1621 tatattcttt attttttacc atttgcttat attgcaatgc gtttgtttgc gactgctgca
+     1681 ggtgtggaca acaaatacca gccgtatatc tttgtttttc tagcaagttt tattcctttt
+     1741 gtcacaccaa ccttgcgcgg ctatcctgat atcataggaa tgattccact atcgctgtgc
+     1801 tgcttgattt tgtttaaagt taatatatta aaataccagg gagcccgttt tgtattactt
+     1861 gcaatcgcaa tgggagtatt actatggttg ccctttgcat tccggcgatg gtatgcctat
+     1921 agcgtagttt ccttgtttgt tacgttgcct tttttgaata catttttgtt tggtgaagcg
+     1981 gccaggttta gtaaggaaag aatatttaaa tactttattt tctttactct tgctgggctg
+     2041 gtagtgattt ttttagtctg tgtttttcag ttcgaactgg ctgagagaat actgacgaca
+     2101 aactattctg atatttattc tgcatacaag gcgacagagc aagcaacctt ctttataacc
+     2161 ctgaactatg ctggtttata tttgctgcca ttaattgttt tgggaggctt gtatattgtt
+     2221 ttcggaacca gttttcgtct taaagttctt tgtgcttttg cattagctaa ttttattatc
+     2281 acctatatta tatttacccg aacacagaca ccagggatgc agcatggaat gccattctgt
+     2341 ttttggttgg caatactcgt gctttcggct ttaaaactat tctttgacag gttaaacaaa
+     2401 ggactggcga taacacttac gatcgcattc tcattactta cggttgcagt ctttattagc
+     2461 acatattcaa aaccatttgg taaaccggaa ttagtcagcc gctatcttcc tggcaaagag
+     2521 tatcctctca gattagaaaa ctttgaccat taccatgagc tgattgcctt tatgagtgcg
+     2581 cgtgttgagc cagatgataa ggtggctgtt tttgcatcca gcggttcact taataatgat
+     2641 ctgttttcag ccatatcacc ggcatcgttt gtcagccata tcgccaacgt ctcgcaggta
+     2701 gatttgcgtg ataagctggc tactgaagcc tttagttctc gttatgcgat tgtgaccgat
+     2761 ccagctcaaa cgcatcttgg aattcatgga caacaggtta tatatctgcc aaataatctt
+     2821 atcctggaag gtaagggaat tggtgccgct tataaaagag tcagtcagcc atttatactt
+     2881 agcggcggtg ttcaggcata cgtctatgaa aaaatcagac cgtatacaat cgatgaatac
+     2941 agaagcatga ttgatgaatt ttcccgttcc tacccctcat ggaaagcgga gtatgaaaat
+     3001 aatcttacag agagttatct aagcgcggca gtggagaaag gcgacacctg ggggcagttt
+     3061 agtatgtatc gtgaaggtcg aatctacgcc cacccaggcg caacaacacc gactattgtc
+     3121 aggatgttcg tcggggaata tgacacctta cgtattacgt cagtcaataa aaactgtggg
+     3181 aaaactgacg gagtgaaagt gattatcaaa gatgcgacgc atcgtgtaga aaagcatatt
+     3241 gccacagcag agagcgtggt ttttgatatg cgcgactttc acaacaaaga tattaccttg
+     3301 attattgata ataatggttc ctctgcttgt gatagcctgg atattaacca gtaa
+//
+LOCUS       wbmVW                   2733 bp    DNA     linear   BCT 14-FEB-2018
+DEFINITION  Klebsiella pneumoniae strain 5053 wbmVW genes, complete
+            sequence.
+ACCESSION   MG602074
+VERSION     MG602074.1
+KEYWORDS    .
+SOURCE      Klebsiella pneumoniae
+  ORGANISM  Klebsiella pneumoniae
+            Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacterales;
+            Enterobacteriaceae; Klebsiella.
+REFERENCE   1  (bases 1 to 4944)
+  AUTHORS   Clarke,B.R., Ovchinnikova,O.G., Kelly,S.D., Williamson,M.L.,
+            Butler,J.E., Liu,B., Wang,L., Gou,X., Follador,R., Lowary,T.L. and
+            Whitfield,C.
+  TITLE     Molecular basis for structural diversity in serogroup O2-antigen
+            polysaccharides in Klebsiella pneumoniae
+  JOURNAL   Unpublished
+REFERENCE   2  (bases 1 to 4944)
+  AUTHORS   Clarke,B.R., Ovchinnikova,O.G., Kelly,S.D., Williamson,M.L.,
+            Butler,J.E., Liu,B., Wang,L., Gou,X., Follador,R., Lowary,T.L. and
+            Whitfield,C.
+  TITLE     Direct Submission
+  JOURNAL   Submitted (04-DEC-2017) Dept of Molecular and Cellular Biology,
+            University of Guelph, 50 Stone Road, Guelph, Ontario N1G 2W1,
+            Canada
+COMMENT     ##Assembly-Data-START##
+            Assembly Method       :: SPades v. 3.10.0
+            Assembly Name         :: Klebsiella pneumoniae 5053 wbmVWX cluster
+            Coverage              :: 100x
+            Sequencing Technology :: Illumina
+            ##Assembly-Data-END##
+FEATURES             Location/Qualifiers
+     source          1..2733
+                     /organism="Klebsiella pneumoniae"
+                     /mol_type="genomic DNA"
+                     /strain="5053"
+                     /serotype="O2ac"
+                     /db_xref="taxon:573"
+                     /note="Sequence edited to trim to wbmVW only, Kelly Wyres April 2021"
+                     /note="Extra genes: wbmVW"
+     gene            1..1266
+                     /gene="wbmV"
+     CDS             1..1266
+                     /gene="wbmV"
+                     /note="putative glycosyltransferase involved in
+                     biosynthesis of the K. pneumoniae O2c antigen
+                     polysaccharide"
+                     /codon_start=1
+                     /transl_table=11
+                     /product="WbmV"
+                     /protein_id="AVA30547.1"
+                     /translation="MEWFSDIKNSKIDLSIIIPMYNVEDYIIETLNSMRGVRGFKVEL
+                     ILVDDGSTDNTLNLAKEWANSSDFDTFILSKENGGPSIARNLGLSKARGELVTFFDSD
+                     DIALSYLYTDIINIMLSHGVDFCIFRGSSFDHVTQKVYEFPDYYVWEKIMGGAEFKVL
+                     TPQQEPRLARLEPSAVVRVYRKDFITSNEIQFPENLFFEDVLFHAKCVLNAKKIALLN
+                     KTLLLYRVNREGQTTGSFGKKRRDVLKVIDLIISDYSKMKVTNDIWANLVGLLIRMNV
+                     WCSENCAYQDKADFVKESISLFSKLPKNVLEIYMNEYSYNDWEIKLCKAFIERNEKVL
+                     NVSAEGGYPIFDNVHESLHPEENQVSLVSIDGKLEHLFSQQKEGWAHDRFNVLNGKLE
+                     SLIAQNKELYEKLEHKRSISSLINKIFKG"
+     misc_feature    <1..>2733
+                     /note="wbmVWX gene cluster"
+     gene            1270..2733
+                     /gene="wbmW"
+     CDS             1270..2733
+                     /gene="wbmW"
+                     /note="putative glycosyltransferase involved in
+                     biosynthesis of the K. pneumoniae O2c antigen
+                     polysaccharide"
+                     /codon_start=1
+                     /transl_table=11
+                     /product="WbmW"
+                     /protein_id="AVA30548.1"
+                     /translation="MLRMESFEALMEEMKFSLVDESLARKDYDRLTNSYYGKSCNPLI
+                     FCENLLIIGVRSGLEFLIAKSVNSNAVVHIVESNKSFLEKINKLNIANVTAFSSFKEY
+                     ISSVKSQKYDYVRINRVNFNLDLVAQLYSKHNILSICGELKEADCKPLTLYRLSRNSC
+                     DSFFFNIKDKNYNVAGARKCIEQPEVSVVVAAYGVENYLDECVSSIVHQTLKNIEILI
+                     VDDGSIDGTGKKADEWQKRFPDIVRVIHKENGGCASARLEGMREAKGEYVAFIDGDDW
+                     VEQPMYEDLFESAALHNSEIAQCGFYEFFADQTKNYYSTAWGADGQNGTTGLVKEPTE
+                     YLTLMPSIWRRIYKTSFLRKHGIEFPVQIRRHDDLPFAFMTIARAKRISVIPDCYYAY
+                     RLNRPGQDVGATDERLFIHFEIFEWLYEQVRPWASMAIMSNMRLVEIGTHSWVLSRLD
+                     SHLKNDYLNKSISGISERYKRYELDDGWLDRLKLASR"
+BASE COUNT      882 a    380 c    598 g    873 t
+ORIGIN
+        1 atggaatggt tttctgatat aaaaaattcg aagattgatc tatcaattat tataccgatg
+       61 tataatgttg aggattatat tattgaaact ttgaattcca tgagaggtgt caggggtttc
+      121 aaagttgagt taatactcgt tgacgatgga agcactgata atacgcttaa cctcgctaaa
+      181 gaatgggcta atagtagtga ttttgatact ttcattttaa gtaaagaaaa tggggggcca
+      241 tcaatagcaa ggaatttagg tttaagtaag gcgcgcggtg agttggtgac gttttttgac
+      301 agcgatgata ttgcactatc atatctatac acagatataa tcaatattat gctatctcat
+      361 ggagtggatt tttgcatctt cagaggcagc tctttcgacc atgtcactca aaaagtttat
+      421 gagttccctg attactatgt atgggaaaaa attatgggag gggcggaatt taaagtcctt
+      481 actccacaac aagaaccgcg tttagcacga cttgagccta gcgctgtcgt tagggtttac
+      541 agaaaagatt ttattacatc caacgaaata caatttcctg aaaacctctt ttttgaggat
+      601 gttctgtttc acgctaaatg tgtattgaac gcaaaaaaaa tcgctctttt gaataaaacg
+      661 cttctgctct atcgagtaaa ccgtgagggg cagacgaccg gttctttcgg aaaaaaaaga
+      721 agggatgtac tcaaagttat tgatctcatt atatccgatt attctaaaat gaaagtgacc
+      781 aatgatattt gggctaatct cgttgggctt ttaataagga tgaatgtttg gtgcagtgaa
+      841 aattgtgctt atcaggataa ggctgatttt gtaaaggaat caatctcttt gttttccaag
+      901 ctcccaaaga atgtgcttga gatttacatg aatgaatatt catataacga ttgggaaata
+      961 aagttgtgta aagccttcat agaaagaaat gaaaaggttt tgaatgttag cgctgaaggt
+     1021 ggttatccca tctttgataa tgttcatgaa tcattacacc cagaagaaaa tcaagtttct
+     1081 ttagtttcta ttgatgggaa gctagaacat cttttctctc agcaaaagga ggggtgggca
+     1141 cacgaccgat ttaatgtttt gaatggtaaa ttagaaagct tgattgctca gaataaagaa
+     1201 ttatacgaaa agttagagca taagcgttca atttcatctc tcattaataa aatttttaaa
+     1261 ggataagata tgttaagaat ggagtctttc gaagctctaa tggaagaaat gaagttttca
+     1321 ttagtggatg agtcattggc aagaaaagat tatgatagat taactaatag ctattatggg
+     1381 aagtcttgca atccattaat attttgcgag aacctactga tcattggagt aagatccggc
+     1441 ttggagtttt tgattgctaa aagcgttaat agcaatgcag ttgttcatat tgttgaaagt
+     1501 aacaaatctt ttttggaaaa aataaataaa cttaatattg ccaatgttac tgctttttct
+     1561 tcatttaaag aatatataag ttcagtaaag tctcaaaaat atgattacgt tagaattaat
+     1621 cgagttaatt ttaatttgga tttggtagct cagttatatt ctaagcacaa tattttatca
+     1681 atttgtgggg aattgaaaga ggctgattgt aagccgttaa ctctttacag gttgagtagg
+     1741 aatagttgtg attctttttt ctttaatatc aaggataaga attacaatgt cgcaggagca
+     1801 cgtaagtgta tcgaacagcc cgaagtatct gtagtggttg ctgcgtatgg ggttgaaaat
+     1861 tatttagatg aatgtgtctc ttcaatagta catcaaactc ttaagaatat tgaaattctc
+     1921 atcgttgatg atggctccat agatggcacc ggaaaaaaag cagacgaatg gcaaaagaga
+     1981 tttcctgata tagtgagggt aattcataaa gaaaatggtg gatgcgcttc tgcgcgctta
+     2041 gaaggcatga gggaagctaa aggcgagtat gtcgcattca ttgatggtga tgactgggtt
+     2101 gagcaaccaa tgtatgagga tttattcgaa tcagctgcgt tgcataattc tgaaatagct
+     2161 caatgtggtt tttatgagtt ttttgctgac caaactaaga attactattc aacagcttgg
+     2221 ggggctgatg gtcaaaatgg gactactgga ttagtaaaag aaccaacaga atatttgact
+     2281 ttgatgccgt ctatctggag aaggatttat aagacatcgt ttttgagaaa gcatggcatt
+     2341 gaatttcctg tgcagataag aagacatgat gatttgccat ttgcatttat gactattgcc
+     2401 cgtgcaaaaa gaataagtgt gatacctgat tgctattatg cttataggct taatcgccct
+     2461 ggacaagatg ttggtgctac agatgaacgg ctttttatac attttgaaat ttttgagtgg
+     2521 ctttatgagc aagttagacc atgggcctca atggcaatca tgagtaatat gagacttgtc
+     2581 gaaatcggaa cgcatagctg ggtcctgagc agattagata gtcatttaaa aaatgattat
+     2641 ttaaataaat ctatttctgg aattagtgaa agatataaga ggtatgagtt agatgatggc
+     2701 tggctagata gacttaagct ggccagccgc taa
+//
+LOCUS       wbbY                    2223 bp    DNA     linear   BCT 14-FEB-2018
+DEFINITION  Klebsiella pneumoniae strain CWK2 wbbY gene, complete
+            sequence.
+ACCESSION   MG458672
+VERSION     MG458672.1
+KEYWORDS    .
+SOURCE      Klebsiella pneumoniae
+  ORGANISM  Klebsiella pneumoniae
+            Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacterales;
+            Enterobacteriaceae; Klebsiella.
+REFERENCE   1  (bases 1 to 5324)
+  AUTHORS   Clarke,B.R., Ovchinnikova,O.G., Kelly,S.D., Williamson,M.L.,
+            Butler,J.E., Liu,B., Wang,L., Gou,X., Follador,R., Lowary,T.L. and
+            Whitfield,C.
+  TITLE     Molecular basis for structural diversity in serogroup O2-antigen
+            polysaccharides in Klebsiella pneumoniae
+  JOURNAL   Unpublished
+REFERENCE   2  (bases 1 to 5324)
+  AUTHORS   Clarke,B.R., Ovchinnikova,O.G., Kelly,S.D., Williamson,M.L.,
+            Butler,J.E., Liu,B., Wang,L., Gou,X., Follador,R., Lowary,T.L. and
+            Whitfield,C.
+  TITLE     Direct Submission
+  JOURNAL   Submitted (30-OCT-2017) Dept of Molecular and Cellular Biology,
+            University of Guelph, 50 Stone Road, Guelph, Ontario N1G 2W1,
+            Canada
+COMMENT     ##Assembly-Data-START##
+            Assembly Method       :: SPAdes v. 3.10.0
+            Assembly Name         :: Klebsiella pneumoniae CWK2 wbbYZ region
+            Sequencing Technology :: Illumina
+            ##Assembly-Data-END##
+FEATURES             Location/Qualifiers
+     source          1..2223
+                     /organism="Klebsiella pneumoniae"
+                     /mol_type="genomic DNA"
+                     /strain="CWK2"
+                     /serotype="O1"
+                     /db_xref="taxon:573"
+                     /note="Sequence edited to trim to wbbY only, Kelly Wyres April 2021"
+                     /note="Extra genes: wbbY"
      gene            complement(1..2223)
                      /gene="wbbY"
      CDS             complement(1..2223)
                      /gene="wbbY"
-                     /inference="ab initio prediction:Prodigal:2.60"
-                     /inference="similar to AA sequence:RefSeq:YP_006496170.1"
+                     /note="glycosyltransferase involved in biosynthesis of the
+                     O1 antigen polysaccharide (D-galactan II)"
                      /codon_start=1
                      /transl_table=11
-                     /product="glycosyl transferase"
-                     /protein_id="CZQ25346.1"
-                     /db_xref="GI:1036211141"
+                     /product="WbbY"
+                     /protein_id="AVA30562.1"
                      /translation="MKKILIMTPDIEGPVRNGGIGTAFTALATTLAKKGYDVDVLYTC
-                     GDYSESSVSKFSDWSRIYSTFGINLLRTGLIKEINIDAPYFRRKSYSIYLWLKENNTY
+                     GDYSESSVSKFSDWSRIYSTFGINLLRTGLIKEINIDAPYFRRKSYSIYLWLKENNIY
                      DTVISCEWQADLYYTLLSKKNGTDFENTKFIVNTHSSTLWADEGNYQLPYDQNHLELY
                      YMEKMVVEMADEVVSPSQYLIDWMLSKHWNVPEERHVILNCEPFQGFVTRDDVTVKIN
                      EKPASGVELVFFGRLETRKGLDIFLRALRKLSDEDKESISGVTFLGKNVTMGKTDSFT
                      YIMNQTKNLGLAVNVISDYDRTNANEYIKRKNVLVIIPSLVENSPYTVYECLINNVNF
-                     LASNVGGIPELIPQEHHAEVLFIPTPADLYGKIHYRLKNINIKPGLAESQDNIKEAWF
-                     VAVERKNNRTFKKIDEANSPLVSVCITHFERHHLLQQALASIKSQTYQNIEVILVDDG
+                     LASNVGGIPELIPQEHHAEVLFIPTPVDLYGKIHYRLKNINIKPGLAESQDNIKEAWF
+                     VAVERKNNRAFKKIDEANSPLVSVCITHFERHHLLQQALASIKSQTYQNIEVILVDDG
                      STTEDSHRYLNLIENDFNSRGWKIVRSSNNYLGAARNLAARHASGEYLMFMDDDNVAK
                      PFEVETFVTAALNSGADVLTTPSDLIFGEEFPSPFRKMTHCWLPLGPDLNIASFSNCF
                      GDANALIRKEVFEKVGGFTEDYGLGHEDWEFFAKISLQGYKLQIVPEPLFWYRVANSG
                      MLLSGNKSKNNYRSFRPFMDENVKYNYAMGLIPSYLEKIQELESEVNRLRSINGGHSV
                      SNELQLLNNKVDGLISQQRDGWAHDRFNALYEAIHVQGAKRGTSLVRRVARKVKSMLK
                      "
-     gene            2321..3130
-                     /gene="wbbZ"
-     CDS             2321..3130
-                     /gene="wbbZ"
-                     /inference="ab initio prediction:Prodigal:2.60"
-                     /codon_start=1
-                     /transl_table=11
-                     /product="Exopolysaccharide biosynthesis protein"
-                     /protein_id="CZQ25347.1"
-                     /db_xref="GI:1036211142"
-                     /translation="MTNMKLKFDLLLKSYHLSHRFVYKANPGNAGDGVIASATYDFFE
-                     RNALTYIPYRDGERYSSETDILIFGGGGNLIEGLYSEGHDFIQNNIGKFHKVIIMPST
-                     IRGYSDLFINNIDKFVVFCRENITFDYIKSLNYEPNKNVFITDDMAFYLDLNKYLSLK
-                     PVYKKQANCFRTDSESLTGDYKENNHDISLTWNGDYWDNEFLARNSTRCMINFLEEYK
-                     VVNTDRLHVAILASLLGKEVNFYPNSYYKNEAVYNYSLFNRYPKTCFITAS"
-BASE COUNT      955 a    640 c    554 g    981 t
+     misc_feature    1..2223
+                     /note="wbbY gene cluster"
+BASE COUNT      664 a    480 c    392 g    687 t
 ORIGIN
         1 ttattttaac attgatttca ctttccgggc aacccggcga accaggctgg tgcctcgttt
        61 tgcgccttgg acatgaattg cttcatacag agcattaaaa cggtcatggg cccagccatc
       121 tctttgctga gaaataagac catcaacctt attatttaaa agttgtaact cgttactgac
-      181 agaatgacca ccattgatgc tccgcaagcg attcacttca ctctcaagtt cttgaatctt
+      181 agaatgacca ccattgatgc tccgtaatcg attcacttca ctctcaagtt cttgaatctt
       241 ctcgaggtag gaaggtatca accccattgc atagttatat ttaacattct catccataaa
       301 aggacggaaa ctgcggtagt tatttttact cttatttcca cttaacaaca tgccggagtt
       361 tgcaactcta taccaaaata gaggttccgg gacgatttgc aatttatatc cctgtaatga
@@ -4326,11 +4951,11 @@ ORIGIN
       781 atagttatta gaactacgga caattttcca gcctcgagag ttaaaatcat tctcgatgag
       841 attcaaataa cgatgagaat cttctgtcgt acttccatca tcaaccaaga tgacctcaat
       901 attttggtac gtctgagatt ttattgatgc gagtgcttgc tgaagcaaat ggtgacgttc
-      961 gaagtgagtt atacacacgc taactaacgg gctgttagct tcatcgattt tcttgaatgt
+      961 gaagtgagtt atacacacgc taactaacgg gctgttagct tcatcgattt tcttgaatgc
      1021 gcggttgttt tttcgttcaa ctgcgacaaa ccaagcttct ttaatattgt cttgtgattc
-     1081 agcaagccct ggttttatat ttatattttt taagcgatag tggatttttc cgtataaatc
-     1141 ggcaggtgta ggaataaata gaacttccgc atgatgctcc tgcggaataa gctctggaat
-     1201 tccaccaacg tttgaagcga gaaaattaac gttattaatc aaacattcat aaacagtata
+     1081 agcaagccct ggttttatat ttatattttt taagcgatag tggattttcc cgtataaatc
+     1141 gacaggtgta ggaataaata gaacttccgc atgatgctcc tgcggaataa gctctggaat
+     1201 tccaccaacg tttgaagcga ggaaattaac gttattaatc aagcattcat aaacagtata
      1261 gggtgagttt tctacaagtg atggaatgat gactaataca ttttttcttt ttatatattc
      1321 attagcgttg gtacgatcat agtcgctgat gacattaact gcgagtccca aatttttagt
      1381 ctgattcata atataagtaa atgaatcagt tttccccata gtgacatttt ttccgaggaa
@@ -4342,25 +4967,10 @@ ORIGIN
      1741 ataatagagt tcaagatggt tctgatcata tggaagctgg taattacctt catcagccca
      1801 taacgttgaa ctgtgagtat ttacaatgaa ctttgtattt tcaaaatccg ttccattctt
      1861 tttgcttaat aaagtgtaat aaagatctgc ctgccactca caagaaataa cagtgtcata
-     1921 ggtgttattt tctttcaacc agagataaat tgaataactt ttccttctaa aatacggtgc
+     1921 gatgttattt tctttcaacc agagataaat tgaataactt ttccttctaa aatacggtgc
      1981 atcaatatta atctctttta tcagtccggt tcttagcaga ttgataccaa aggtactata
      2041 aatacgtgac cagtcgctaa atttcgatac agatgattca gaatagtcgc cacatgtata
-     2101 caatacatca acatcatacc ccttttttgc caaagtagtg gcaagggcag tgaaagcagt
-     2161 accaataccg ccgttacgga caggcccctc aatgtccggc gtcattataa gaattttctt
-     2221 cattgtaacc cttcctttgt aacctagact tttctatgat attagtgaat tgaagtagtg
-     2281 taagatagca gtcggtagct tctgttaaac aggataaaaa atgaccaata tgaagttaaa
-     2341 atttgatttg cttctaaaat cttatcatct atctcatcga tttgtctata aggcaaaccc
-     2401 tggtaatgct ggtgatggtg taattgcatc tgcgacgtat gacttttttg aacgaaatgc
-     2461 tcttacctat atcccttaca gagatggcga gcgctacagt tctgaaactg atattttaat
-     2521 ttttggaggc ggaggaaacc tgatagaagg attgtattct gaaggtcatg actttatcca
-     2581 gaataatatt gggaagtttc ataaagtaat aataatgccg tcgacaatca gagggtatag
-     2641 cgatttattc atcaacaata ttgataagtt tgttgttttt tgtcgcgaaa atatcacctt
-     2701 cgattatatt aaatctctca actacgaacc aaacaagaac gtattcatta ctgatgatat
-     2761 ggcattttat ctcgatctta ataaatacct gtcacttaaa cccgtctata aaaaacaggc
-     2821 caactgcttc agaacggact ccgaatctct aactggagac tacaaagaaa acaatcatga
-     2881 tatttcgctc acctggaatg gcgattattg ggataatgaa tttctggcgc gtaattctac
-     2941 ccgttgcatg ataaactttc ttgaagagta taaagttgtc aataccgaca ggctgcatgt
-     3001 ggcaatttta gcatctctgc ttggcaaaga agtcaacttc tatcctaact catattacaa
-     3061 aaatgaagct gtttacaatt attcactttt taatcgttat ccaaaaacat gctttattac
-     3121 ggcaagttga
+     2101 caatacatca acatcatacc ccttttttgc caaagtagtg gcaagagcag tgaaagcagt
+     2161 accgataccg ccgttacgga caggcccctc aatgtccggc gtcattataa gaattttctt
+     2221 cat
 //
\ No newline at end of file


=====================================
reference_database/Klebsiella_o_locus_primary_reference.logic
=====================================
@@ -0,0 +1,21 @@
+locus	extra_loci	type
+O1/O2v1	None	O2a
+O1/O2v2	None	O2afg
+O1/O2v3	None	O2a
+O8	None	O2a
+O1/O2v1	wbbY	O1
+O1/O2v2	wbbY	O1
+O1/O2v3	wbbY	O1
+O8	wbbY	O8
+O1/O2v1	wbmVW	O2ac
+O1/O2v2	wbmVW	O2ac
+O1/O2v3	wbmVW	O2ac
+O8	wbmVW	O2ac
+O1/O2v1	gmlABD	O2aeh
+O1/O2v2	gmlABD	O2aeh
+O1/O2v3	gmlABD	O2aeh
+O8	gmlABD	O2aeh
+O1/O2v1	wbbY,wbmVW	O1 (O2ac)
+O1/O2v2	wbbY,wbmVW	O1 (O2ac)
+O1/O2v3	wbbY,wbmVW	O1 (O2ac)
+O8	wbbY,wbmVW	O1 (O2ac)



View it on GitLab: https://salsa.debian.org/med-team/kaptive/-/commit/0d2120ce8179d99ce8354cc96245581c1c879fd1

-- 
View it on GitLab: https://salsa.debian.org/med-team/kaptive/-/commit/0d2120ce8179d99ce8354cc96245581c1c879fd1
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20220121/203205ba/attachment-0001.htm>


More information about the debian-med-commit mailing list