[Debian-med-packaging] Bug#944785: ITP: pufferfish -- An efficient index for the colored, compacted, de Bruijn graph
Michael R. Crusoe
michael.crusoe at gmail.com
Fri Nov 15 11:21:05 GMT 2019
Package: wnpp
Severity: wishlist
Subject: ITP: pufferfish -- An efficient index for the colored, compacted, de Bruijn graph
Package: wnpp
Owner: Michael R. Crusoe <michael.crusoe at gmail.com>
Severity: wishlist
* Package name : pufferfish
Version : 1.0.0
Upstream Author : , 2016 Rob Patro, Avi Srivastava, Hirak Sarkar
* URL : https://github.com/COMBINE-lab/pufferfish
* License : GPL-3+
Programming Lang: C
Description : An efficient index for the colored, compacted, de Bruijn graph
Pufferfish is a new time and memory-efficient data structure for indexing a
compacted, colored de Bruijn graph (ccdBG).
.
Though the de Bruijn Graph (dBG) has enjoyed tremendous popularity as an
assembly and sequence comparison data structure, it has only relatively
recently begun to see use as an index of the reference sequences (e.g. deBGA,
kallisto). Particularly, these tools index the compacted dBG (cdBG), in which
all non-branching paths are collapsed into individual nodes and labeled with
the string they spell out. This data structure is particularly well-suited for
representing repetitive reference sequences, since a single contig in the cdBG
represents all occurrences of the repeated sequence. The original positions in
the reference can be recovered with the help of an auxiliary "contig table"
that maps each contig to the reference sequence, position, and orientation
where it appears as a substring. The deBGA paper has a nice description how
this kind of index looks (they call it a unipath index, because the contigs we
index are unitigs in the cdBG), and how all the pieces fit together to be able
to resolve the queries we care about. Moreover, the cdBG can be built on
multiple reference sequences (transcripts, chromosomes, genomes), where each
reference is given a distinct color (or colour, if you're of the British
persuasion). The resulting structure, which also encodes the relationships
between the cdBGs of the underlying reference sequences, is called the
compacted, colored de Bruijn graph (ccdBG). This is not, of course, the only
variant of the dBG that has proven useful from an indexing perspective. The
(pruned) dBG has also proven useful as a graph upon which to build a path
index of arbitrary variation / sequence graphs, which has enabled very
interesting and clever indexing schemes like that adopted in GCSA2. Also,
thinking about sequence search in terms of the dBG has led to interesting
representations for variation-aware sequence search backed by indexes like the
vBWT (implemented in the excellent gramtools package).
Remark: This package is maintained by Debian Med Packaging Team at
https://salsa.debian.org/med-team/pufferfish
This package will be team maintained by Debian-Med
More information about the Debian-med-packaging
mailing list