[med-svn] [patman] 12/15: New upstream version 1.2.2+dfsg
Andreas Tille
tille at debian.org
Wed Dec 13 19:28:26 UTC 2017
This is an automated email from the git hooks/post-receive script.
tille pushed a commit to branch master
in repository patman.
commit 9a9e681d00105d330ef308c356ef358170604697
Author: Andreas Tille <tille at debian.org>
Date: Wed Dec 13 20:24:05 2017 +0100
New upstream version 1.2.2+dfsg
---
AUTHORS | 2 +
CHANGES | 6 +
LICENSE | 340 ++++++++++++++++++++++++
Makefile | 49 ++++
README | 10 +
debian/changelog | 6 -
debian/clean | 1 -
debian/compat | 1 -
debian/control | 16 --
debian/copyright | 13 -
debian/docs | 1 -
debian/rules | 4 -
debian/upstream/metadata | 12 -
debian/watch | 5 -
fasta.cpp | 47 ++++
fasta.h | 30 +++
global.h | 35 +++
main.cpp | 201 ++++++++++++++
patman.1 | 301 +++++++++++++++++++++
prefix_tree.cpp | 668 +++++++++++++++++++++++++++++++++++++++++++++++
prefix_tree.h | 92 +++++++
21 files changed, 1781 insertions(+), 59 deletions(-)
diff --git a/AUTHORS b/AUTHORS
new file mode 100644
index 0000000..7f2b374
--- /dev/null
+++ b/AUTHORS
@@ -0,0 +1,2 @@
+Kay Pruefer <pruefer at eva.mpg.de>
+Udo Stenzel <udo_stenzel at eva.mpg.de>
diff --git a/CHANGES b/CHANGES
new file mode 100644
index 0000000..bfdebe7
--- /dev/null
+++ b/CHANGES
@@ -0,0 +1,6 @@
+
+Version 1.2.1
+- fixed bug in makefile
+- added cstdlib & cstring headers to make it compile with g++ 4.4
+Thanks to Dan Tomlinson for reporting this!
+
diff --git a/LICENSE b/LICENSE
new file mode 100644
index 0000000..b7b5f53
--- /dev/null
+++ b/LICENSE
@@ -0,0 +1,340 @@
+ GNU GENERAL PUBLIC LICENSE
+ Version 2, June 1991
+
+ Copyright (C) 1989, 1991 Free Software Foundation, Inc.
+ 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+
+ Preamble
+
+ The licenses for most software are designed to take away your
+freedom to share and change it. By contrast, the GNU General Public
+License is intended to guarantee your freedom to share and change free
+software--to make sure the software is free for all its users. This
+General Public License applies to most of the Free Software
+Foundation's software and to any other program whose authors commit to
+using it. (Some other Free Software Foundation software is covered by
+the GNU Library General Public License instead.) You can apply it to
+your programs, too.
+
+ When we speak of free software, we are referring to freedom, not
+price. Our General Public Licenses are designed to make sure that you
+have the freedom to distribute copies of free software (and charge for
+this service if you wish), that you receive source code or can get it
+if you want it, that you can change the software or use pieces of it
+in new free programs; and that you know you can do these things.
+
+ To protect your rights, we need to make restrictions that forbid
+anyone to deny you these rights or to ask you to surrender the rights.
+These restrictions translate to certain responsibilities for you if you
+distribute copies of the software, or if you modify it.
+
+ For example, if you distribute copies of such a program, whether
+gratis or for a fee, you must give the recipients all the rights that
+you have. You must make sure that they, too, receive or can get the
+source code. And you must show them these terms so they know their
+rights.
+
+ We protect your rights with two steps: (1) copyright the software, and
+(2) offer you this license which gives you legal permission to copy,
+distribute and/or modify the software.
+
+ Also, for each author's protection and ours, we want to make certain
+that everyone understands that there is no warranty for this free
+software. If the software is modified by someone else and passed on, we
+want its recipients to know that what they have is not the original, so
+that any problems introduced by others will not reflect on the original
+authors' reputations.
+
+ Finally, any free program is threatened constantly by software
+patents. We wish to avoid the danger that redistributors of a free
+program will individually obtain patent licenses, in effect making the
+program proprietary. To prevent this, we have made it clear that any
+patent must be licensed for everyone's free use or not licensed at all.
+
+ The precise terms and conditions for copying, distribution and
+modification follow.
+
+ GNU GENERAL PUBLIC LICENSE
+ TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
+
+ 0. This License applies to any program or other work which contains
+a notice placed by the copyright holder saying it may be distributed
+under the terms of this General Public License. The "Program", below,
+refers to any such program or work, and a "work based on the Program"
+means either the Program or any derivative work under copyright law:
+that is to say, a work containing the Program or a portion of it,
+either verbatim or with modifications and/or translated into another
+language. (Hereinafter, translation is included without limitation in
+the term "modification".) Each licensee is addressed as "you".
+
+Activities other than copying, distribution and modification are not
+covered by this License; they are outside its scope. The act of
+running the Program is not restricted, and the output from the Program
+is covered only if its contents constitute a work based on the
+Program (independent of having been made by running the Program).
+Whether that is true depends on what the Program does.
+
+ 1. You may copy and distribute verbatim copies of the Program's
+source code as you receive it, in any medium, provided that you
+conspicuously and appropriately publish on each copy an appropriate
+copyright notice and disclaimer of warranty; keep intact all the
+notices that refer to this License and to the absence of any warranty;
+and give any other recipients of the Program a copy of this License
+along with the Program.
+
+You may charge a fee for the physical act of transferring a copy, and
+you may at your option offer warranty protection in exchange for a fee.
+
+ 2. You may modify your copy or copies of the Program or any portion
+of it, thus forming a work based on the Program, and copy and
+distribute such modifications or work under the terms of Section 1
+above, provided that you also meet all of these conditions:
+
+ a) You must cause the modified files to carry prominent notices
+ stating that you changed the files and the date of any change.
+
+ b) You must cause any work that you distribute or publish, that in
+ whole or in part contains or is derived from the Program or any
+ part thereof, to be licensed as a whole at no charge to all third
+ parties under the terms of this License.
+
+ c) If the modified program normally reads commands interactively
+ when run, you must cause it, when started running for such
+ interactive use in the most ordinary way, to print or display an
+ announcement including an appropriate copyright notice and a
+ notice that there is no warranty (or else, saying that you provide
+ a warranty) and that users may redistribute the program under
+ these conditions, and telling the user how to view a copy of this
+ License. (Exception: if the Program itself is interactive but
+ does not normally print such an announcement, your work based on
+ the Program is not required to print an announcement.)
+
+These requirements apply to the modified work as a whole. If
+identifiable sections of that work are not derived from the Program,
+and can be reasonably considered independent and separate works in
+themselves, then this License, and its terms, do not apply to those
+sections when you distribute them as separate works. But when you
+distribute the same sections as part of a whole which is a work based
+on the Program, the distribution of the whole must be on the terms of
+this License, whose permissions for other licensees extend to the
+entire whole, and thus to each and every part regardless of who wrote it.
+
+Thus, it is not the intent of this section to claim rights or contest
+your rights to work written entirely by you; rather, the intent is to
+exercise the right to control the distribution of derivative or
+collective works based on the Program.
+
+In addition, mere aggregation of another work not based on the Program
+with the Program (or with a work based on the Program) on a volume of
+a storage or distribution medium does not bring the other work under
+the scope of this License.
+
+ 3. You may copy and distribute the Program (or a work based on it,
+under Section 2) in object code or executable form under the terms of
+Sections 1 and 2 above provided that you also do one of the following:
+
+ a) Accompany it with the complete corresponding machine-readable
+ source code, which must be distributed under the terms of Sections
+ 1 and 2 above on a medium customarily used for software interchange; or,
+
+ b) Accompany it with a written offer, valid for at least three
+ years, to give any third party, for a charge no more than your
+ cost of physically performing source distribution, a complete
+ machine-readable copy of the corresponding source code, to be
+ distributed under the terms of Sections 1 and 2 above on a medium
+ customarily used for software interchange; or,
+
+ c) Accompany it with the information you received as to the offer
+ to distribute corresponding source code. (This alternative is
+ allowed only for noncommercial distribution and only if you
+ received the program in object code or executable form with such
+ an offer, in accord with Subsection b above.)
+
+The source code for a work means the preferred form of the work for
+making modifications to it. For an executable work, complete source
+code means all the source code for all modules it contains, plus any
+associated interface definition files, plus the scripts used to
+control compilation and installation of the executable. However, as a
+special exception, the source code distributed need not include
+anything that is normally distributed (in either source or binary
+form) with the major components (compiler, kernel, and so on) of the
+operating system on which the executable runs, unless that component
+itself accompanies the executable.
+
+If distribution of executable or object code is made by offering
+access to copy from a designated place, then offering equivalent
+access to copy the source code from the same place counts as
+distribution of the source code, even though third parties are not
+compelled to copy the source along with the object code.
+
+ 4. You may not copy, modify, sublicense, or distribute the Program
+except as expressly provided under this License. Any attempt
+otherwise to copy, modify, sublicense or distribute the Program is
+void, and will automatically terminate your rights under this License.
+However, parties who have received copies, or rights, from you under
+this License will not have their licenses terminated so long as such
+parties remain in full compliance.
+
+ 5. You are not required to accept this License, since you have not
+signed it. However, nothing else grants you permission to modify or
+distribute the Program or its derivative works. These actions are
+prohibited by law if you do not accept this License. Therefore, by
+modifying or distributing the Program (or any work based on the
+Program), you indicate your acceptance of this License to do so, and
+all its terms and conditions for copying, distributing or modifying
+the Program or works based on it.
+
+ 6. Each time you redistribute the Program (or any work based on the
+Program), the recipient automatically receives a license from the
+original licensor to copy, distribute or modify the Program subject to
+these terms and conditions. You may not impose any further
+restrictions on the recipients' exercise of the rights granted herein.
+You are not responsible for enforcing compliance by third parties to
+this License.
+
+ 7. If, as a consequence of a court judgment or allegation of patent
+infringement or for any other reason (not limited to patent issues),
+conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License. If you cannot
+distribute so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you
+may not distribute the Program at all. For example, if a patent
+license would not permit royalty-free redistribution of the Program by
+all those who receive copies directly or indirectly through you, then
+the only way you could satisfy both it and this License would be to
+refrain entirely from distribution of the Program.
+
+If any portion of this section is held invalid or unenforceable under
+any particular circumstance, the balance of the section is intended to
+apply and the section as a whole is intended to apply in other
+circumstances.
+
+It is not the purpose of this section to induce you to infringe any
+patents or other property right claims or to contest validity of any
+such claims; this section has the sole purpose of protecting the
+integrity of the free software distribution system, which is
+implemented by public license practices. Many people have made
+generous contributions to the wide range of software distributed
+through that system in reliance on consistent application of that
+system; it is up to the author/donor to decide if he or she is willing
+to distribute software through any other system and a licensee cannot
+impose that choice.
+
+This section is intended to make thoroughly clear what is believed to
+be a consequence of the rest of this License.
+
+ 8. If the distribution and/or use of the Program is restricted in
+certain countries either by patents or by copyrighted interfaces, the
+original copyright holder who places the Program under this License
+may add an explicit geographical distribution limitation excluding
+those countries, so that distribution is permitted only in or among
+countries not thus excluded. In such case, this License incorporates
+the limitation as if written in the body of this License.
+
+ 9. The Free Software Foundation may publish revised and/or new versions
+of the General Public License from time to time. Such new versions will
+be similar in spirit to the present version, but may differ in detail to
+address new problems or concerns.
+
+Each version is given a distinguishing version number. If the Program
+specifies a version number of this License which applies to it and "any
+later version", you have the option of following the terms and conditions
+either of that version or of any later version published by the Free
+Software Foundation. If the Program does not specify a version number of
+this License, you may choose any version ever published by the Free Software
+Foundation.
+
+ 10. If you wish to incorporate parts of the Program into other free
+programs whose distribution conditions are different, write to the author
+to ask for permission. For software which is copyrighted by the Free
+Software Foundation, write to the Free Software Foundation; we sometimes
+make exceptions for this. Our decision will be guided by the two goals
+of preserving the free status of all derivatives of our free software and
+of promoting the sharing and reuse of software generally.
+
+ NO WARRANTY
+
+ 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
+FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
+OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
+PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
+OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS
+TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE
+PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
+REPAIR OR CORRECTION.
+
+ 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
+WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
+REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
+INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
+OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
+TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
+YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
+PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGES.
+
+ END OF TERMS AND CONDITIONS
+
+ How to Apply These Terms to Your New Programs
+
+ If you develop a new program, and you want it to be of the greatest
+possible use to the public, the best way to achieve this is to make it
+free software which everyone can redistribute and change under these terms.
+
+ To do so, attach the following notices to the program. It is safest
+to attach them to the start of each source file to most effectively
+convey the exclusion of warranty; and each file should have at least
+the "copyright" line and a pointer to where the full notice is found.
+
+ <one line to give the program's name and a brief idea of what it does.>
+ Copyright (C) <year> <name of author>
+
+ This program is free software; you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation; either version 2 of the License, or
+ (at your option) any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU General Public License for more details.
+
+ You should have received a copy of the GNU General Public License
+ along with this program; if not, write to the Free Software
+ Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+
+
+Also add information on how to contact you by electronic and paper mail.
+
+If the program is interactive, make it output a short notice like this
+when it starts in an interactive mode:
+
+ Gnomovision version 69, Copyright (C) year name of author
+ Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
+ This is free software, and you are welcome to redistribute it
+ under certain conditions; type `show c' for details.
+
+The hypothetical commands `show w' and `show c' should show the appropriate
+parts of the General Public License. Of course, the commands you use may
+be called something other than `show w' and `show c'; they could even be
+mouse-clicks or menu items--whatever suits your program.
+
+You should also get your employer (if you work as a programmer) or your
+school, if any, to sign a "copyright disclaimer" for the program, if
+necessary. Here is a sample; alter the names:
+
+ Yoyodyne, Inc., hereby disclaims all copyright interest in the program
+ `Gnomovision' (which makes passes at compilers) written by James Hacker.
+
+ <signature of Ty Coon>, 1 April 1989
+ Ty Coon, President of Vice
+
+This General Public License does not permit incorporating your program into
+proprietary programs. If your program is a subroutine library, you may
+consider it more useful to permit linking proprietary applications with the
+library. If this is what you want to do, use the GNU Library General
+Public License instead of this License.
diff --git a/Makefile b/Makefile
new file mode 100644
index 0000000..ab592ea
--- /dev/null
+++ b/Makefile
@@ -0,0 +1,49 @@
+# PatMaN DNA pattern matcher
+# (C) 2007 Kay Pruefer, Udo Stenzel
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or (at
+# your option) any later version. See the LICENSE file for details.
+
+prefix := ${HOME}
+version := 1.2
+
+TARGETS := patman
+OBJS := prefix_tree.o fasta.o main.o
+CPPFLAGS += -Wall
+CPPFLAGS += -O3 -funroll-loops -DNDEBUG -march=k8
+# CPPFLAGS += -ggdb
+LDLIBS += -lpopt
+
+all: $(TARGETS)
+
+patman: $(OBJS)
+ g++ $(CPPFLAGS) -o $@ $(OBJS) $(LDFLAGS) $(LDLIBS)
+
+%.o: %.cpp
+ g++ -DVERSION="\"$(version)\"" $(CPPFLAGS) -c -o $@ $<
+
+install: $(TARGETS)
+ install -d ${prefix}/bin
+ install -d ${prefix}/share/man/man1
+ install -s -m755 $^ ${prefix}/bin
+ install -m644 patman.1 ${prefix}/share/man/man1
+
+fasta.o: fasta.h
+prefix_tree.o: prefix_tree.h global.h
+main.o: fasta.h prefix_tree.h global.h
+
+.SUFFIXES:
+
+.PHONY: clean mrproper dist all
+
+dist: all
+ $(MAKE) clean
+ strip $(TARGETS)
+
+clean:
+ -rm -f $(OBJS)
+
+mrproper: clean
+ -rm -r $(TARGETS)
diff --git a/README b/README
new file mode 100644
index 0000000..d43d87f
--- /dev/null
+++ b/README
@@ -0,0 +1,10 @@
+PatMaN: Pattern Matching in Nucleotide data bases
+
+PatMaN compares a set of short patterns against one or more sequence
+databases. Patterns can match with an arbitrary (but generally small)
+number of mismatches and/or gaps given as arguments. See the man page
+for details.
+
+To install into your home directory, simply call 'make install'. You
+can call 'make install prefix=dir' to install into 'dir' instead. To
+compile, a reasonable C++ compiler and the popt library are needed.
diff --git a/debian/changelog b/debian/changelog
deleted file mode 100644
index 2d28936..0000000
--- a/debian/changelog
+++ /dev/null
@@ -1,6 +0,0 @@
-patman (1.2.2+dfsg-1) UNRELEASED; urgency=low
-
- * Initial release (Closes: #482555)
-
- -- Charles Plessy <plessy at debian.org> Thu, 22 May 2008 07:36:37 +0900
-
diff --git a/debian/clean b/debian/clean
deleted file mode 100644
index 7a0d4d9..0000000
--- a/debian/clean
+++ /dev/null
@@ -1 +0,0 @@
-patman
diff --git a/debian/compat b/debian/compat
deleted file mode 100644
index f599e28..0000000
--- a/debian/compat
+++ /dev/null
@@ -1 +0,0 @@
-10
diff --git a/debian/control b/debian/control
deleted file mode 100644
index c3be598..0000000
--- a/debian/control
+++ /dev/null
@@ -1,16 +0,0 @@
-Source: patman
-Section: science
-Priority: optional
-Maintainer: Debian Med Packaging Team <debian-med-packaging at lists.alioth.debian.org>
-Uploaders: Charles Plessy <plessy at debian.org>
-Build-Depends: debhelper (>= 10)
-Standards-Version: 4.1.2
-Homepage: http://bioinf.eva.mpg.de/patman/
-
-Package: patman
-Architecture: any
-Depends: ${shlibs:Depends}, ${misc:Depends}
-Description: rapid alignment of short sequences to large databases
- Patman searches for short patterns in large DNA databases, allowing
- for approximate matches. It is optimized for searching for many small
- pattern at the same time, for example microarray probes.
diff --git a/debian/copyright b/debian/copyright
deleted file mode 100644
index f50c689..0000000
--- a/debian/copyright
+++ /dev/null
@@ -1,13 +0,0 @@
-Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
-Upstream-Name: PatMaN
-Source: https://bioinf.eva.mpg.de/patman/patman-1.2.2.tar.gz
-Files-Excluded: */patman
- */.svn
-
-Files: *
-Copyright: © 2008-2009 Kay Pruefer, Udo Stenzel
-License: GPL-2+
-
-Files: debian/*
-Copyright: © 2008 Charles Plessy <plessy at debian.org>
-License: GPL-2+
diff --git a/debian/docs b/debian/docs
deleted file mode 100644
index e845566..0000000
--- a/debian/docs
+++ /dev/null
@@ -1 +0,0 @@
-README
diff --git a/debian/rules b/debian/rules
deleted file mode 100755
index 053bee6..0000000
--- a/debian/rules
+++ /dev/null
@@ -1,4 +0,0 @@
-#!/usr/bin/make -f
-
-%:
- dh $@
diff --git a/debian/upstream/metadata b/debian/upstream/metadata
deleted file mode 100644
index e085e38..0000000
--- a/debian/upstream/metadata
+++ /dev/null
@@ -1,12 +0,0 @@
-Reference:
- Author: Kay Prüfer and Udo Stenzel and Michael Dannemann and Richard E Green and Michael Lachmann and Janet Kelso
- Title: "PatMaN: rapid alignment of short sequences to large databases"
- Journal: Bioinformatics
- Year: 2008
- Volume: 24
- Number: 13
- Pages: 1530-1
- DOI: 10.1093/bioinformatics/btn223
- PMID: 18467344
- URL: http://bioinformatics.oxfordjournals.org/content/24/13/1530
- eprint: http://bioinformatics.oxfordjournals.org/content/24/13/1530.full.pdf+html
diff --git a/debian/watch b/debian/watch
deleted file mode 100644
index 63fd461..0000000
--- a/debian/watch
+++ /dev/null
@@ -1,5 +0,0 @@
-version=4
-
-opts="repacksuffix=+dfsg,dversionmangle=s/\+dfsg//g,repack,compression=xz" \
- https://bioinf.eva.mpg.de/patman/ patman-(.*)\.tar\.gz
-
diff --git a/fasta.cpp b/fasta.cpp
new file mode 100644
index 0000000..60d0bf8
--- /dev/null
+++ b/fasta.cpp
@@ -0,0 +1,47 @@
+// PatMaN DNA pattern matcher
+// (C) 2007 Kay Pruefer, Udo Stenzel
+//
+// This program is free software; you can redistribute it and/or modify
+// it under the terms of the GNU General Public License as published by
+// the Free Software Foundation; either version 2 of the License, or (at
+// your option) any later version. See the LICENSE file for details.
+
+#include "fasta.h"
+
+using namespace std ;
+
+fasta* fasta_fac::get( )
+{
+ string headline ;
+ fasta* ret = new fasta ;
+ ret->null = 0 ;
+ // headline
+ getline( in, ret->headerline ) ;
+ line++ ;
+ if ( ret->headerline[0] != '>' ) {
+ if ( !in.eof() ) cerr << "No headerline in fasta file at line " << line << endl ;
+ ret->null = 1 ;
+ return ret ;
+ } else {
+ // cut the '>' in headerline
+ ret->headerline = ret->headerline.substr( 1, ret->headerline.size()-1 ) ;
+
+ // read the sequence
+ // handles eof, \n and '>', everything else is
+ // considered part of the sequence.
+ int i ;
+ while ( (i = in.get()) && in.good() ) {
+ if ( i == '>' ) {
+ in.putback( i ) ;
+ return ret ;
+ } else {
+ // add to seq ;
+ if ( i != '\n' )
+ ret->seq += static_cast<char>(i) ;
+ else
+ line++ ;
+ }
+ }
+ return ret ;
+ }
+}
diff --git a/fasta.h b/fasta.h
new file mode 100644
index 0000000..ce8a7e7
--- /dev/null
+++ b/fasta.h
@@ -0,0 +1,30 @@
+// PatMaN DNA pattern matcher
+// (C) 2007 Kay Pruefer, Udo Stenzel
+//
+// This program is free software; you can redistribute it and/or modify
+// it under the terms of the GNU General Public License as published by
+// the Free Software Foundation; either version 2 of the License, or (at
+// your option) any later version. See the LICENSE file for details.
+
+#ifndef INCLUDED_FASTA_H
+#define INCLUDED_FASTA_H
+
+#include <string>
+#include <iostream>
+
+typedef struct fasta {
+ std::string seq ;
+ std::string headerline ;
+ int null ;
+} fasta ;
+
+class fasta_fac {
+ public:
+ fasta_fac( std::istream& i ) : in(i), line(0) { }
+ fasta* get( ) ;
+ private:
+ std::istream& in ;
+ int line ;
+} ;
+
+#endif
diff --git a/global.h b/global.h
new file mode 100644
index 0000000..bf16f53
--- /dev/null
+++ b/global.h
@@ -0,0 +1,35 @@
+// PatMaN DNA pattern matcher
+// (C) 2007 Kay Pruefer, Udo Stenzel
+//
+// This program is free software; you can redistribute it and/or modify
+// it under the terms of the GNU General Public License as published by
+// the Free Software Foundation; either version 2 of the License, or (at
+// your option) any later version. See the LICENSE file for details.
+
+#ifndef INCLUDED_GLOBAL_H
+#define INCLUDED_GLOBAL_H
+
+#include <iosfwd>
+
+extern size_t cutoff ;
+extern size_t allow_gaps ;
+extern long position_genome ;
+extern const char* genome_name ;
+extern int only_plus_strand ;
+extern int discount_adenine ;
+extern int quiet ;
+extern int chop5 ;
+extern std::ostream *output ;
+extern int debug_flags ;
+extern int do_prefetch ;
+extern long long total_nodes ;
+
+static const int debug_notquiet = 1 ;
+static const int debug_verbose = 2 ;
+static const int debug_trie = 4 ;
+static const int debug_countnodes = 8 ;
+static const int debug_numnodes = 16 ;
+static const int debug_sequence = 32 ;
+
+#endif
+
diff --git a/main.cpp b/main.cpp
new file mode 100644
index 0000000..10a8b44
--- /dev/null
+++ b/main.cpp
@@ -0,0 +1,201 @@
+// PatMaN DNA pattern matcher
+// (C) 2007 Kay Pruefer, Udo Stenzel
+//
+// This program is free software; you can redistribute it and/or modify
+// it under the terms of the GNU General Public License as published by
+// the Free Software Foundation; either version 2 of the License, or (at
+// your option) any later version. See the LICENSE file for details.
+
+
+#include <iostream>
+#include <string>
+#include <vector>
+#include <fstream>
+#include <cstdlib>
+#include <cstring>
+
+#include <popt.h>
+
+#include "prefix_tree.h"
+#include "fasta.h"
+
+using namespace std ;
+
+// global variables
+size_t cutoff = 2 ;
+size_t allow_gaps = 0 ;
+size_t minlength = 0 ;
+long long total_nodes = 0 ;
+int only_plus_strand = 0 ;
+int discount_adenine = 0 ;
+int ambi_codes = 0 ;
+int chop5 = 0, chop3 = 0 ;
+int debug_flags = debug_notquiet ;
+int do_prefetch = 3 ;
+ostream *output = 0 ;
+
+long position_genome ;
+const char* genome_name ;
+
+enum option_flag { opt_none, opt_db, opt_pat, opt_output, opt_version } ;
+
+struct poptOption options[] = {
+ { "version", 'V', POPT_ARG_NONE, 0, opt_version, "Print version number and exit", 0 },
+ { "edits", 'e', POPT_ARG_INT | POPT_ARGFLAG_SHOW_DEFAULT, &cutoff, 0, "Set maximum edit distance to N", "N" },
+ { "gaps", 'g', POPT_ARG_INT | POPT_ARGFLAG_SHOW_DEFAULT, &allow_gaps, 0, "Set maximum number of gaps to N", "N" },
+ { "databases", 'D', POPT_ARG_NONE, 0, opt_db, "Following files are databases", 0 },
+ { "patterns", 'P', POPT_ARG_NONE, 0, opt_pat, "Following files contain patterns", 0 },
+ { "output", 'o', POPT_ARG_STRING, 0, opt_output, "Write output to FILE", "FILE" },
+ { "ambicodes", 'a', POPT_ARG_NONE, &ambi_codes, 0, "Interpret ambiguity codes in patterns", 0 },
+ { "singlestrand", 's', POPT_ARG_NONE, &only_plus_strand, 0, "Do not match reverse-complements", 0 },
+ { "prefetch", 'p', POPT_ARG_INT | POPT_ARGFLAG_SHOW_DEFAULT, &do_prefetch, 0, "Prefetch N nodes", "N" },
+ { "min-length", 'l', POPT_ARG_INT | POPT_ARGFLAG_DOC_HIDDEN, &minlength, 0, "Only consider patterns longer than N", "N" },
+ { "chop-3", 'x', POPT_ARG_INT | POPT_ARGFLAG_DOC_HIDDEN, &chop3, 0, "Chop N bp off the 3' end of patterns", "N" },
+ { "chop-5", 'X', POPT_ARG_INT | POPT_ARGFLAG_DOC_HIDDEN, &chop5, 0, "Chop N bp off the 5' end of patterns", "N" },
+ { "adenine-hack", 'A', POPT_ARG_NONE | POPT_ARGFLAG_DOC_HIDDEN, &discount_adenine, 0, "Ignore inserted adenines in db", 0 },
+ { "quiet", 'q', POPT_ARG_NONE | POPT_BIT_CLR, &debug_flags, debug_notquiet, "Turn off warnings", 0 },
+ { "verbose", 'v', POPT_ARG_NONE | POPT_BIT_SET, &debug_flags, debug_verbose, "Print additional progress reports", 0 },
+ { "debug", 'd', POPT_ARG_INT | POPT_ARGFLAG_DOC_HIDDEN, &debug_flags, 0, "Set debug flags to N (see man page)", "N" },
+ POPT_AUTOHELP POPT_TABLEEND
+} ;
+
+
+int main(int argc, const char *argv[]) {
+
+ std::vector<std::string> dbfiles, patfiles ;
+
+ poptContext pc = poptGetContext( "patman", argc, argv, options, 0 ) ;
+ poptSetOtherOptionHelp( pc, "[OPTION...] [input-file...]" ) ;
+ poptReadDefaultConfig( pc, 0 ) ;
+ bool isdbfiles = false ;
+
+ if( argc <= 1 ) { poptPrintHelp( pc, stderr, 0 ) ; return 1 ; }
+ int rc = poptGetNextOpt(pc) ;
+ for(;;) {
+ if( rc == opt_version ) {
+ std::cout << poptGetInvocationName(pc) << ", revision " << VERSION << std::endl ;
+ return 0 ;
+ }
+ else if( rc == opt_db || rc == opt_pat || rc == -1 ) {
+ while( poptPeekArg(pc) )
+ ( isdbfiles ? dbfiles : patfiles ).push_back( poptGetArg(pc) ) ;
+ isdbfiles = (rc == opt_db) ;
+ }
+ else if( rc == opt_output )
+ {
+ delete output ;
+ const char *arg = poptGetOptArg(pc) ;
+ if( 0 == strcmp( arg, "-" ) ) output = 0 ;
+ else output = new std::ofstream( arg ) ;
+ }
+ else {
+ std::cerr << argv[0] << ": " << poptStrerror( rc )
+ << ' ' << poptBadOption( pc, 0 ) << std::endl ;
+ return 1 ;
+ }
+ if( rc == -1 ) break ; else rc = poptGetNextOpt(pc) ;
+ }
+
+ if( dbfiles.empty() )
+ {
+ dbfiles.push_back( "-" ) ;
+ if( debug_flags & debug_notquiet )
+ clog << "No databases given, using stdin." << endl ;
+ }
+ if( patfiles.empty() )
+ {
+ clog << "No patterns given, aborting." << endl ;
+ exit(1) ;
+ }
+
+ prefix_tree trie ;
+ for( size_t i = 0 ; i != patfiles.size() ; ++i )
+ {
+ ifstream *infile = 0 ;
+ if( patfiles[i] != "-" ) {
+ infile = new ifstream( patfiles[i].c_str() ) ;
+ if( debug_flags & debug_verbose ) clog << "reading patterns from " << patfiles[i] << endl ;
+ if( !infile || !*infile ) { cerr << "cannot read " << patfiles[i] << endl ; return 1 ; }
+ }
+ else if( debug_flags & debug_verbose ) clog << "reading patterns from STDIN" << endl ;
+
+ // add pattern to prefix tree
+ fasta_fac f( infile ? *infile : cin ) ;
+ fasta* ff ;
+ while ( (ff = f.get()) && !ff->null ) {
+ if( minlength <= ff->seq.size() )
+ trie.add_seq(
+ ff->seq.substr( chop5, ff->seq.size() - chop5 - chop3 ),
+ ff->headerline, !!ambi_codes ) ;
+ delete ff ;
+ }
+ delete infile ;
+ }
+
+ if( debug_flags & debug_verbose ) clog << "post-processing trie" << endl ;
+ trie.add_suffix_links() ;
+
+#ifndef NDEBUG
+ trie.debug( cerr ) ;
+#endif
+
+ mismatch_ptr *ptrs = 0 ;
+
+ // read database
+ string headerline ;
+ for( size_t i = 0 ; i != dbfiles.size() ; ++i ) {
+ ifstream *dbfile = 0 ;
+ if( dbfiles[i] != "-" ) {
+ dbfile = new ifstream( dbfiles[i].c_str() ) ;
+ if( debug_flags & debug_verbose ) clog << "processing database from " << dbfiles[i] << endl ;
+ if( !dbfile || !*dbfile ) { cerr << "cannot read " << dbfiles[i] << endl ; return 1 ; }
+ }
+ else if( debug_flags & debug_verbose ) clog << "processing database from STDIN" << endl ;
+
+ getline( dbfile ? *dbfile : cin, headerline ) ;
+ while( dbfile ? *dbfile : cin ) {
+ if ( headerline[0] != '>' ) {
+ cerr << "database is not in fasta format" << endl ;
+ exit( 2 ) ;
+ }
+ headerline = headerline.substr( 1, headerline.size()-1 ) ;
+ genome_name = headerline.c_str() ; // global variable
+
+ string seq ;
+ bool eat_n = 0 ;
+ size_t last_n = 0 ;
+ position_genome = 1 ; // global variable
+
+ ptrs = trie.init( ptrs ) ;
+
+ while ( getline( dbfile ? *dbfile : cin, seq )
+ && ( seq.empty() || seq[0] != '>' ) ) {
+ for ( size_t i = 0 ; i < seq.size() ; i++,position_genome++ )
+ {
+ if ( toupper( seq[i] ) == 'N' )
+ if ( eat_n ) continue ;
+ else {
+ last_n ++ ;
+ if ( last_n == cutoff + 1 ) eat_n = 1 ;
+ }
+ else if ( eat_n ) { eat_n = 0 ; last_n = 0 ; }
+
+ if( debug_flags & debug_sequence ) std::clog << seq[i] << std::flush ;
+ ptrs = trie.compare( seq[i], ptrs ) ;
+ }
+ }
+ trie.compare( 0, ptrs ) ;
+ headerline = seq ;
+ }
+
+ delete dbfile ;
+ }
+ ( output ? *output : cout ) << std::flush ;
+ delete output ;
+
+ if( debug_flags & debug_countnodes )
+ clog << "Fetched " << total_nodes << " nodes in total" << endl ;
+
+ poptFreeContext( pc ) ;
+ return 0 ;
+}
diff --git a/patman.1 b/patman.1
new file mode 100644
index 0000000..401427b
--- /dev/null
+++ b/patman.1
@@ -0,0 +1,301 @@
+.\" PatMaN DNA pattern matcher
+.\" (C) 2007 Kay Pruefer, Udo Stenzel
+.\"
+.\" This program is free software; you can redistribute it and/or modify
+.\" it under the terms of the GNU General Public License as published by
+.\" the Free Software Foundation; either version 2 of the License, or (at
+.\" your option) any later version. See the LICENSE file for details.
+
+.\" Process this file with
+.\" groff -man -Tascii patman.1
+.\"
+.TH PATMAN 1 "JANUARY 2008" Applications "User Manuals"
+.SH NAME
+PatMaN \- search for approximate patterns in DNA libraries
+.SH SYNOPSIS
+.B patman [
+.I option
+.B |
+.I file
+.B ... ]
+.SH DESCRIPTION
+.B PatMaN
+searches for (small)
+.I patterns
+in (huge) DNA
+.IR databases ,
+allowing for some mismatches and optionally gaps.
+.I Patterns
+and
+.I databases
+are read from one or more
+.BR fasta (5)
+files listed as non-option arguments, depending on whether the
+.I -D
+or
+.I -P
+option last preceeded them, and matched against each other. The output
+of
+.B PatMaN
+is a table containing one line for each match, consisting of
+tab-separated fields:
+.IP \(bu 4
+name of database sequence,
+.IP \(bu 4
+name of pattern,
+.IP \(bu 4
+position of first matched base in database sequence, the sequence's
+beginning has position 1,
+.IP \(bu 4
+position of last matched base in database sequence,
+.IP \(bu 4
+strand (+ for literal match, - for reverse complement),
+.IP \(bu 4
+edit distance (number of mismatches plus number of gaps).
+
+.SH OPTIONS
+.IP "-V, --version"
+Print version number and exit.
+
+.IP "-e num, --edits num"
+Allow up to
+.I num
+mismatches and/or gaps per match.
+
+.IP "-g num, --gaps num"
+Allow up to
+.I num
+gaps per match. Note that gaps count as mismatches, too, so the
+.I -e
+option should always be set at least as high as the
+.I -g
+option. Allowing many gaps can incur a considerable computational cost.
+
+.IP "-D, --databases"
+Treat the following files as
+.IR database .
+Databases must be in
+.BR fasta (5)
+format. Multiple
+.I database
+files, including "-" for standard input, are allowed and are read in
+turn.
+
+.IP "-P, --patterns"
+Treat the following files as
+.IR patterns .
+Pattern files must be in
+.BR fasta (5)
+format. Multiple
+.I pattern
+files, including "-" for standard input, are allowed and are all read before
+touching the
+.I databases.
+
+.IP "-o file, --output file"
+Redirect output to
+.IR file .
+The file name "-" causes output to be written to stdout, which is also
+the default
+
+.IP "-a, --ambicodes"
+Activate the interpretation of ambiguity codes in patterns. This
+results in the expansion of any
+.I pattern
+with ambiguity codes into multiple patterns which can match
+independently. Compare
+.B Unknown Nucleotides
+below.
+
+.IP "-s, --singlestrand"
+Deactivate matching of reverse-complements. Normally,
+.B PatMaN
+will try to match patterns both literally and after
+reverse-complementing them, with this option set, only straight forward
+matches are considered.
+
+.IP "-p num, --prefetch num"
+Causes
+.I num
+pointers to be prefetched in advance. This feature can improve
+performance, if
+.B PatMaN
+has been compiled for a processor architecture that supports
+prefetching. The optimum value for your particular setup has to be
+determined empirically, but the default should be reasonably good.
+
+.IP "-l len, --min-length len"
+Only consider patterns with a length of at least
+.IR len .
+Use this if your
+.I pattern
+collection contains short sequences that you don't want lots of possible
+matches reported for.
+
+.IP "-x num, --chop3 num"
+Cut off
+.I num
+bases from the 3' end of each
+.I pattern.
+Use this for
+.I patterns
+with damaged, edited, etc. 3' ends that should be ignored. The chopped
+bases are neither matched nor included in the reported match regions.
+
+.IP "-X num, --chop5 num"
+Cut off
+.I num
+bases from the 5' end of each
+.I pattern.
+Use this for
+.I patterns
+with damaged, edited, etc. 5' ends that should be ignored. The chopped
+bases are neither matched nor included in the reported match regions.
+
+.IP "-A, --adenine-hack"
+Allow adenine to be ignored in patterns. This is essentially equivalent
+to not counting gaps in the
+.I database,
+as long as it was an A that was gapped. Using
+.I -A
+can be computationally extremely expensive, both in terms of memory and
+time consumed.
+
+.IP "-q, --quiet"
+Suppress warnings (about unrecognized characters in input sequences or
+missing input files). Even without
+.IR -q ,
+at most one such warning is given per run.
+
+.IP "-v, --verbose"
+Prints additional progress information to stderr.
+
+.IP "-d flags, --debug flags"
+Sets debugging flags to
+.IR flags . Flags
+may be the logical
+.I OR
+of any of the following values, each of which causes some output to
+appear on
+.IR stderr .
+Some of the values may only work if
+.B PatMaN
+has been compiled in debug mode. The default value is 1.
+
+.IP 1
+Print warnings. Equivalent to not setting
+.IR -q .
+
+.IP 2
+Print progress information. Equivalent to setting
+.IR -v .
+
+.IP 4
+Dump the suffix trie of the
+.IR patterns .
+Only available in debug build.
+
+.IP 8
+Count number of visited nodes and print that number in each iteration.
+Only available in debug build.
+
+.IP 16
+Print total number of nodes fetched from memory after completing all
+.IR databases .
+
+.IP 32
+Output
+.I database
+sequence while it is being matched.
+
+.SH NOTES
+.SS Non-Option Arguments
+Non-option arguments (bare filenames) are either treated as
+.I database
+or
+.I pattern
+files, depending on whether the
+.I -D
+or
+.I -P
+option was the the last that occured before the filename. If neither
+.I -D
+nor
+.I -P
+was given, file names are treated as
+.I pattern
+files. If no
+.I database
+was given, it is instead read from standard input. Standard input can
+be explicitly given as either a
+.I database
+or a
+.I pattern
+file by using the filename "-". A warning is given if standard input is
+selected implicitly as
+.I database,
+an error message is given if no
+.I pattern
+files have been named at all.
+
+
+.SS Gapped Matching
+Allowing gaps often causes overlapping matches of single
+.I patterns
+at almost the same position.
+.B PatMaN
+makes no attempt to filter these redundant matches. Also note that
+allowing many gaps, and especially allowing an arbitrary amount of gaps
+through the
+.I -A
+hack can slow down
+.B PatMaN
+considerably and cause it to produce enormous amounts of output. The
+use of some sorty of post-processor to filter these is highly
+recommended.
+
+.SS Unknown Nucleotides
+Unknown nucleotides are most often encoded by the letter
+.BR N .
+If the
+.I --ambicodes
+option is not given, Ns in patterns are interpreted as unknown
+nucleotides and can never match without penalty. If
+.I --ambicodes
+is given, Ns in
+.I patterns
+are expanded just like the other amibuguity codes, and effectively work
+as wildcards. Unknown nucleotides can still be encoded by an
+.B X
+and will never match anything. The database is treated differently in
+that anything other than
+.IR A ", " C ", " G ", " T " and " U ,
+including ambiguity codes, is treated as unknown and can never match
+without penalty.
+
+.SH FILES
+.I /etc/popt
+.RS
+The system wide configuration file for
+.BR popt (3).
+.B PatMaN
+identifies itself as "patman" to popt.
+.RE
+
+.I ~/.popt
+.RS
+Per user configuration file for
+.BR popt (3).
+.RE
+
+.SH BUGS
+None known.
+
+.SH AUTHOR
+Kay Pruefer <pruefer at eva.mpg.de>
+.br
+Udo Stenzel <udo_stenzel at eva.mpg.de>
+
+.SH "SEE ALSO"
+.BR popt (3), fasta (5)
+
diff --git a/prefix_tree.cpp b/prefix_tree.cpp
new file mode 100644
index 0000000..6d53a4a
--- /dev/null
+++ b/prefix_tree.cpp
@@ -0,0 +1,668 @@
+// PatMaN DNA pattern matcher
+// (C) 2007 Kay Pruefer, Udo Stenzel
+//
+// This program is free software; you can redistribute it and/or modify
+// it under the terms of the GNU General Public License as published by
+// the Free Software Foundation; either version 2 of the License, or (at
+// your option) any later version. See the LICENSE file for details.
+
+
+#include "prefix_tree.h"
+
+#include <cassert>
+#include <iostream>
+#include <queue>
+#include <cstdlib>
+
+using namespace std ;
+
+// Why a deque? Because allocation of many small objects of
+// equal size is faster in it and deallocation of the whole
+// deque is much faster than deallocation of the small objects,
+// especially with our extremely simple allocation pattern.
+// This is probably the most sensible way to plug this memory
+// leak.
+prefix_tree::prefix_tree( ) : nodes()
+{
+ nodes.push_back( node() ) ;
+}
+
+#ifndef NDEBUG
+void prefix_tree::debug( std::ostream& s, const node& n ) const
+{
+ if( debug_flags & debug_trie )
+ {
+ for( int i = 0 ; i != 5 ; ++i )
+ s << n.label << " -" << "ACGTN"[i] << "> " << n.childs[i]->label << '/' << (int)n.drop[i] << std::endl ;
+ if( n.suffix ) s << n.label << " ==> " << n.suffix->label << '/' << n.drop_suffix << std::endl ;
+
+ for( int i = 0 ; i != 5 ; ++i )
+ if( !n.drop[i] ) debug( s, *n.childs[i] ) ;
+ }
+}
+#endif
+
+
+static mismatch_ptr *junk_yard = 0 ;
+
+static mismatch_ptr *new_node()
+{
+ if( junk_yard )
+ {
+ mismatch_ptr *p = junk_yard ;
+ junk_yard = p->next ;
+ return p ;
+ }
+ else
+ {
+ return new mismatch_ptr ;
+ }
+}
+
+mismatch_ptr *prefix_tree::seed() const
+{
+ // add ptr to the root node
+ mismatch_ptr *p = new_node() ;
+ p->ptr = &nodes[0] ;
+ p->mismatch = 0 ;
+ p->gaps = 0 ;
+ p->first = 0 ;
+ p->isgap = 0 ;
+ p->matched = 0 ;
+ p->next = 0 ;
+ return p ;
+}
+
+mismatch_ptr *prefix_tree::init( mismatch_ptr *p ) const
+{
+ if( p ) {
+ mismatch_ptr *q = p ;
+ while( p->next ) p = p->next ;
+ p->next = junk_yard ;
+ junk_yard = q ;
+ }
+ return seed() ;
+}
+
+// creates child node # pos if necessary and returns ptr to this child node
+node* prefix_tree::create_go_node( node* n, int pos ) {
+ if ( n->childs[pos] == 0 ) {
+ node o ;
+ o.depth = n->depth + 1 ;
+ nodes.push_back( o ) ;
+ node *next = &nodes.back() ;
+ n->childs[pos] = next ;
+#ifndef NDEBUG
+ if( debug_flags & debug_trie )
+ n->childs[pos]->label = n->label + "ACGTN"[pos] ;
+#endif
+ return next ;
+ } else {
+ return n->childs[pos] ;
+ }
+}
+
+void prefix_tree::add_recursion( const string& seq, const string& name, node* ptr, size_t i, char strand )
+{
+ // is this a leaf?
+ if ( i == seq.size() ) {
+
+ probe_s *probe = new probe_s() ;
+ probe->name = name ;
+ probe->strand = strand ;
+ probe->length = seq.size() ;
+ probe->next = ptr->probes ;
+ ptr->probes = probe ;
+
+ // no leaf -> add
+ } else {
+ switch ( toupper(seq[i]) ) {
+ case 'A':
+ add_recursion( seq, name, create_go_node( ptr, 0 ), ++i, strand ) ;
+ break ;
+ case 'C':
+ add_recursion( seq, name, create_go_node( ptr, 1 ), ++i, strand ) ;
+ break ;
+ case 'G':
+ add_recursion( seq, name, create_go_node( ptr, 2 ), ++i, strand ) ;
+ break ;
+ case 'T':
+ case 'U':
+ add_recursion( seq, name, create_go_node( ptr, 3 ), ++i, strand ) ;
+ break ;
+ case 'N':
+ case 'X':
+ add_recursion( seq, name, create_go_node( ptr, 4 ), ++i, strand ) ;
+ break ;
+ case '-':
+ break ;
+ default:
+ add_recursion( seq, name, create_go_node( ptr, 4 ), ++i, strand ) ;
+ if( debug_flags & debug_notquiet )
+ clog << "Warning: Sequence with name " << name
+ << " has character '" << seq[i]
+ << "' other than ACGTUXN." << endl ;
+ debug_flags &= ~debug_notquiet ;
+ break ;
+ }
+
+ }
+}
+
+/***
+ * Adding the sequence taking into account the ambiguity codes.
+ * i.e. Sequence with "N" will be added to A, C, G and T node
+ */
+void prefix_tree::add_recursion_ambiguity( const string& seq, const string& name, node* ptr, size_t i, char strand )
+{
+ // is this a leaf?
+ if ( i == seq.size() ) {
+
+ probe_s *probe = new probe_s() ;
+ probe->name = name ;
+ probe->strand = strand ;
+ probe->length = seq.size() ;
+ probe->next = ptr->probes ;
+ ptr->probes = probe ;
+
+ // no leaf -> add
+ } else {
+ switch ( toupper(seq[i]) ) {
+ case 'A':
+ add_recursion_ambiguity( seq, name, create_go_node( ptr, 0 ), i+1, strand ) ;
+ break ;
+ case 'C':
+ add_recursion_ambiguity( seq, name, create_go_node( ptr, 1 ), i+1, strand ) ;
+ break ;
+ case 'G':
+ add_recursion_ambiguity( seq, name, create_go_node( ptr, 2 ), i+1, strand ) ;
+ break ;
+ case 'T':
+ case 'U':
+ add_recursion_ambiguity( seq, name, create_go_node( ptr, 3 ), i+1, strand ) ;
+ break ;
+ case 'N':
+ add_recursion_ambiguity( seq, name, create_go_node( ptr, 0 ), i+1, strand ) ;
+ add_recursion_ambiguity( seq, name, create_go_node( ptr, 1 ), i+1, strand ) ;
+ add_recursion_ambiguity( seq, name, create_go_node( ptr, 2 ), i+1, strand ) ;
+ add_recursion_ambiguity( seq, name, create_go_node( ptr, 3 ), i+1, strand ) ;
+ break ;
+ case 'R':
+ add_recursion_ambiguity( seq, name, create_go_node( ptr, 2 ), i+1, strand ) ;
+ add_recursion_ambiguity( seq, name, create_go_node( ptr, 0 ), i+1, strand ) ;
+ break ;
+ case 'Y':
+ add_recursion_ambiguity( seq, name, create_go_node( ptr, 1 ), i+1, strand ) ;
+ add_recursion_ambiguity( seq, name, create_go_node( ptr, 3 ), i+1, strand ) ;
+ break ;
+ case 'M':
+ add_recursion_ambiguity( seq, name, create_go_node( ptr, 0 ), i+1, strand ) ;
+ add_recursion_ambiguity( seq, name, create_go_node( ptr, 1 ), i+1, strand ) ;
+ break ;
+ case 'K':
+ add_recursion_ambiguity( seq, name, create_go_node( ptr, 2 ), i+1, strand ) ;
+ add_recursion_ambiguity( seq, name, create_go_node( ptr, 3 ), i+1, strand ) ;
+ break ;
+ case 'S':
+ add_recursion_ambiguity( seq, name, create_go_node( ptr, 1 ), i+1, strand ) ;
+ add_recursion_ambiguity( seq, name, create_go_node( ptr, 2 ), i+1, strand ) ;
+ break ;
+ case 'W':
+ add_recursion_ambiguity( seq, name, create_go_node( ptr, 0 ), i+1, strand ) ;
+ add_recursion_ambiguity( seq, name, create_go_node( ptr, 3 ), i+1, strand ) ;
+ break ;
+ case 'B':
+ add_recursion_ambiguity( seq, name, create_go_node( ptr, 1 ), i+1, strand ) ;
+ add_recursion_ambiguity( seq, name, create_go_node( ptr, 2 ), i+1, strand ) ;
+ add_recursion_ambiguity( seq, name, create_go_node( ptr, 3 ), i+1, strand ) ;
+ break ;
+ case 'D':
+ add_recursion_ambiguity( seq, name, create_go_node( ptr, 0 ), i+1, strand ) ;
+ add_recursion_ambiguity( seq, name, create_go_node( ptr, 2 ), i+1, strand ) ;
+ add_recursion_ambiguity( seq, name, create_go_node( ptr, 3 ), i+1, strand ) ;
+ break ;
+ case 'H':
+ add_recursion_ambiguity( seq, name, create_go_node( ptr, 0 ), i+1, strand ) ;
+ add_recursion_ambiguity( seq, name, create_go_node( ptr, 1 ), i+1, strand ) ;
+ add_recursion_ambiguity( seq, name, create_go_node( ptr, 3 ), i+1, strand ) ;
+ break ;
+ case 'V':
+ add_recursion_ambiguity( seq, name, create_go_node( ptr, 0 ), i+1, strand ) ;
+ add_recursion_ambiguity( seq, name, create_go_node( ptr, 1 ), i+1, strand ) ;
+ add_recursion_ambiguity( seq, name, create_go_node( ptr, 2 ), i+1, strand ) ;
+ break ;
+
+ case '-':
+ break ;
+
+ default:
+ add_recursion_ambiguity( seq, name, create_go_node( ptr, 4 ), i+1, strand ) ;
+ if( debug_flags & debug_notquiet )
+ clog << "Warning: Sequence with name " << name
+ << " has character '" << seq[i]
+ << "' other than ACGTU, BDH, SWKMRY, or XN." << endl ;
+ debug_flags &= ~debug_notquiet ;
+ break ;
+ }
+
+ }
+}
+
+/* Ambiguity codes:
+ * R = G or A (puRine)
+ * Y = C or T (pYrimidine)
+ * M = A or C (aMine)
+ * K = G or T (Ketone)
+ * S = G or C (Strong coupling)
+ * W = A or T (Weak coupling)
+ * B = not A (C or G or T)
+ * D = not C (A or G or T)
+ * H = not G (A or C or T)
+ * V = not U (A or C or G)
+ * N = A or C or G or T (Nucleotide)
+ */
+
+/*
+ * reverse complement of a dna sequence, including correct ambiguity mapping
+ */
+inline static string reverse_complement( const string& s ) {
+ string ret ;
+ for ( int i = s.size()-1 ; i >= 0 ; i-- ) {
+ switch ( toupper( s[i] ) ) {
+ case 'A':
+ ret += 'T' ; break ;
+ case 'C':
+ ret += 'G' ; break ;
+ case 'G':
+ ret += 'C' ; break ;
+ case 'T':
+ case 'U':
+ ret += 'A' ; break ;
+ case 'R':
+ ret += 'Y' ; break ;
+ case 'Y':
+ ret += 'R' ; break ;
+ case 'M':
+ ret += 'K' ; break ;
+ case 'K':
+ ret += 'M' ; break ;
+ case 'B':
+ ret += 'V' ; break ;
+ case 'V':
+ ret += 'B' ; break ;
+ case 'D':
+ ret += 'H' ; break ;
+ case 'H':
+ ret += 'D' ; break ;
+ case 'S':
+ case 'W':
+ case 'N':
+ case '-':
+ ret += s[i] ;
+ break ;
+ default:
+ clog << "Sequence has invalid character '"
+ << s[i] << "'." << endl ;
+ exit( 1 ) ;
+ }
+ }
+ return ret ;
+}
+
+void prefix_tree::add_seq( const string& seq, const string& name, bool ambi_codes )
+{
+ node *ptr = &nodes[0] ;
+ int i = 0 ;
+ if ( ! ambi_codes ) {
+ add_recursion( seq, name, ptr, i, '+' ) ;
+ if ( only_plus_strand != 1 ) {
+ string rev_seq = reverse_complement( seq ) ;
+ add_recursion( rev_seq, name, ptr, i, '-' ) ;
+ }
+ } else {
+ add_recursion_ambiguity( seq, name, ptr, i, '+' ) ;
+ if ( only_plus_strand != 1 ) {
+ string rev_seq = reverse_complement( seq ) ;
+ add_recursion_ambiguity( rev_seq, name, ptr, i, '-' ) ;
+ }
+ }
+}
+
+// Suffix links. These are only needed as an intermediate step. Each
+// node gets a suffix link, which points to the node that is the longest
+// existing true suffix. Given a node that already has a suffix, we
+// find the suffix of child "A" by following the "A" link from our
+// suffix and adding the 'drop' values for the "suffix" and the "A"
+// link. We can then fill in the missing children for this node in the
+// same fashion. To ensure that any node we visit already has a suffix,
+// we do a breadth-first search.
+//
+// The root never has a suffix. To start the fill-in, each missing
+// child of the root becomes the root itself with a "drop" of one and
+// the existing ones get the root as suffix with a drop of one.
+//
+// Also, after we added the missing children, we no longer need the
+// suffix link for anything but accessing query sequences that happen to
+// be substrings of other queries. So we can optimize these.
+
+void prefix_tree::add_suffix_links()
+{
+ std::queue< node* > q ;
+ node& root = nodes[0] ;
+ for( int i = 0 ; i != 5 ; ++i )
+ {
+ if( root.childs[i] )
+ {
+ root.childs[i]->suffix = &root ;
+ root.childs[i]->drop_suffix = 1 ;
+ q.push( root.childs[i] ) ;
+ }
+ else
+ {
+ root.childs[i] = &root ;
+ root.drop[i] = 1 ;
+ }
+ }
+
+ while( !q.empty() )
+ {
+ node* n = q.front() ;
+ q.pop() ;
+ for( int i = 0 ; i != 5 ; ++i )
+ {
+ if( n->childs[i] )
+ {
+ n->childs[i]->suffix = n->suffix->childs[i] ;
+ n->childs[i]->drop_suffix = n->drop_suffix + n->suffix->drop[i] ;
+ q.push( n->childs[i] ) ;
+ }
+ else
+ {
+ n->childs[i] = n->suffix->childs[i] ;
+ n->drop[i] = n->drop_suffix + n->suffix->drop[i] ;
+ }
+ }
+ if( !n->suffix->probes )
+ {
+ if( n->suffix->suffix )
+ {
+ n->drop_suffix += n->suffix->drop_suffix ;
+ n->suffix = n->suffix->suffix ;
+ }
+ else
+ {
+ n->suffix = 0 ;
+ }
+ }
+ }
+}
+
+inline static void out( int mm, const node* n, int mlen )
+{
+ for( const probe_s *probe = n->probes ; probe ; probe = probe->next )
+ {
+ ( output ? *output : cout )
+ << genome_name << '\t' << probe->name << '\t'
+ << position_genome - mlen << '\t'
+ << position_genome - 1 << '\t'
+ << probe->strand << '\t' << mm << '\n' ;
+ }
+}
+
+// Why does this work? Because when we follow suffix links (to print
+// matches that are completely inside other matches), we know that we
+// only drop perfectly matching stuff from the beginning. This allows
+// us to update "mlen".
+
+inline static void out_rec( int mm, const node* n, int first, int mlen )
+{
+ if( n->probes ) out( mm, n, mlen ) ;
+
+ const node* nn = n ;
+ while( nn->suffix && nn->drop_suffix <= first )
+ {
+ first -= nn->drop_suffix ;
+ mlen -= nn->drop_suffix ;
+ nn = nn->suffix ;
+ if( nn->probes ) out( mm, nn, mlen ) ;
+ }
+}
+
+
+/* Advancing the mismatch pointers.
+ * - We receive a (singly linked) list of mps. There has to be exactly
+ * one mp that didn't encounter any mismatches in its history (keeping
+ * this one ensures that the algorithm reduces to Aho-Corasick if
+ * mismatches aren't allowed). Too avoid to much special treatment,
+ * this special pointer will always be the first in the list. We
+ * process it first, initalize the result list with it or re-seed the
+ * result list, then go into the loop over the rest of the list. This
+ * implies an if-block in the middle of the loop and it also implies
+ * that some pointers are only initialized in the loop.
+ *
+ * - We will keep multiple pointers into the two involved lists: one to
+ * the end of the result list; this is where we splice in pointers to
+ * be queued for the next round. This list is at all time properly
+ * terminated and disconnected from the stuff we're processing this
+ * round.
+ *
+ * - One to the next node to be processed. This one is easy, it is
+ * advanced once per loop, it's the head of the incoming list and
+ * we're done if it becomes null.
+ *
+ * - One several places ahead of the current one. We prefetch the node
+ * pointed to by this one and its next link. Keeping another yet more
+ * advanced pointer to prefetch mismatch pointers is useless because
+ * of the need to follow the chain of next pointers. This one has to
+ * be valid at all times (see next point), so if it would become null,
+ * we don't advance it.
+ *
+ * - When gapping, we need to splice new mps into the queue being
+ * processed. We do this after the node we prefetched last. That
+ * makes sense, because we already have that mp in cache, so we can
+ * splice without penalty. It works fine, as long as there is such a
+ * pointer, and to ensure that, we first handle gaps (in this
+ * direction) while the current node is still in the queue, then
+ * cache the contents of the current node, handle the node itself,
+ * thereby removing and or splicing it, then handle mismatches and/or
+ * gaps.
+ */
+
+inline static mismatch_ptr *splice_fresh_after( mismatch_ptr *mp )
+{
+ if( junk_yard )
+ {
+ mismatch_ptr *p0 = junk_yard ;
+ junk_yard = junk_yard->next ;
+ p0->next = mp->next ;
+ mp->next = p0 ;
+ return p0 ;
+ }
+ else
+ {
+ mismatch_ptr *p0 = new mismatch_ptr ;
+ p0->next = mp->next ;
+ mp->next = p0 ;
+ return p0 ;
+ }
+}
+
+mismatch_ptr *prefix_tree::compare( char c, mismatch_ptr *ptrs ) const
+{
+ int pos ;
+ switch( c )
+ {
+ case '-': return ptrs ;
+ case 'A': case 'a': pos = 0 ; break ;
+ case 'C': case 'c': pos = 1 ; break ;
+ case 'G': case 'g': pos = 2 ; break ;
+ case 'T': case 't': pos = 3 ; break ;
+ case 'U': case 'u': pos = 3 ; break ;
+ default: pos = -1 ; break ;
+ }
+
+ bool is_first = true ;
+ mismatch_ptr *new_head = 0, *last_done = 0 ;
+ mismatch_ptr *next_in_queue = ptrs ;
+ mismatch_ptr *next_to_prefetch = ptrs ;
+ for( int i = 0 ; i != do_prefetch && next_to_prefetch->next ; ++i )
+ next_to_prefetch = next_to_prefetch->next ;
+
+ while( next_in_queue )
+ {
+ ++total_nodes ;
+ if( next_to_prefetch && next_to_prefetch->ptr ) __builtin_prefetch( next_to_prefetch->ptr ) ;
+ if( next_to_prefetch && next_to_prefetch->next ) __builtin_prefetch( next_to_prefetch->next ) ;
+
+ const node* n = next_in_queue->ptr ;
+ size_t mm = next_in_queue->mismatch ;
+ size_t gg = next_in_queue->gaps ;
+ int ff = next_in_queue->first ;
+ int nn = next_in_queue->matched ;
+
+ if( !next_in_queue->isgap ) out_rec( mm, n, is_first ? 255 : ff, nn ) ;
+
+ // If gaps are allowed, also add gapped nodes for
+ // processing _in_this_iteration_. If
+ // discount_adenine is set, don't gap an A (makes
+ // bookkeeping difficult).
+
+ if( gg < allow_gaps && mm < cutoff && n->depth > 0 ) {
+ for ( int i = discount_adenine ? 1 : 0 ; i != 5 ; i++ ) {
+ int ff_ = is_first ? n->depth-1 : ff ;
+ if( ff_ >= n->drop[i] ) {
+ assert( next_to_prefetch ) ;
+ mismatch_ptr *mp2 = splice_fresh_after( next_to_prefetch ) ;
+ mp2->ptr = n->childs[i] ;
+ mp2->mismatch = mm + 1 ;
+ mp2->gaps = gg + 1 ;
+ mp2->first = ff_ - n->drop[i] ;
+ mp2->isgap = 1 ;
+ mp2->matched = nn ;
+ }
+ }
+ }
+
+ // Are we gapping As? then queue that up, no matter how many
+ // mismatches we've already accumulated. (Note how this isn't
+ // even considered a gap, note that gapping the root actually
+ // makes a certain kind of sense.)
+
+ if( discount_adenine ) {
+ int ff_ = is_first ? n->depth-1 : ff ;
+ if( ff_ >= n->drop[0] && n->drop[0] ) {
+ assert( next_to_prefetch ) ;
+ mismatch_ptr *mp2 = splice_fresh_after( next_to_prefetch ) ;
+ mp2->ptr = n->childs[0] ;
+ mp2->mismatch = mm ;
+ mp2->gaps = gg ;
+ mp2->first = ff_ - n->drop[0] ;
+ mp2->isgap = 0 ;
+ mp2->matched = nn ;
+ }
+ }
+
+ // No longer is anything enqueued into the old queue, so we no
+ // longer need the next_to_prefetch pointer and can move it
+ // ahead.
+ if( next_to_prefetch->next ) next_to_prefetch = next_to_prefetch->next ;
+
+ // At this point, we no longer run into difficulties if the
+ // incoming queue gets emptied. So now we move the pointer to its
+ // matching child and throw it out if dropping the prefix kills
+ // a mismatch. Also, we can move the next_to_prefetch pointer,
+ // since it's not needed anymore in this iteration and might
+ // point to nowhere now. Moreover, if we're at the first node
+ // (the one without any mismatches), this is the place to
+ // initialize the new list.
+
+ int pos_ = pos == -1 ? 4 : pos ;
+ if( is_first ) {
+ next_in_queue = ptrs->next ;
+ if( pos == -1 ) {
+ new_head = seed() ;
+ ptrs->next = junk_yard ;
+ junk_yard = ptrs ;
+ } else {
+ new_head = ptrs ;
+ new_head->matched += 1 - ptrs->ptr->drop[pos] ;
+ new_head->ptr = ptrs->ptr->childs[pos] ;
+ new_head->next = 0 ;
+ }
+ last_done = new_head ;
+ } else if( ff >= n->drop[pos_] ) {
+ next_in_queue->ptr = n->childs[pos_] ;
+ next_in_queue->first -= n->drop[pos_] ;
+ next_in_queue->matched ++ ;
+ next_in_queue->matched -= n->drop[pos_] ;
+ next_in_queue->isgap = 0 ;
+ assert( !last_done->next ) ;
+ last_done->next = next_in_queue ;
+ last_done = next_in_queue ;
+ next_in_queue = next_in_queue->next ;
+ last_done->next = 0 ;
+ } else {
+ mismatch_ptr *p = next_in_queue ;
+ next_in_queue = next_in_queue->next ;
+ p->next = junk_yard ;
+ junk_yard = p ;
+ }
+
+ // The following always produces new mismatches, so skip it
+ // completely if we're already at the limit.
+
+ if ( mm < cutoff ) {
+ // if gaps are allowed, always keep the node (but don't
+ // uselessly gap the root node)
+ if( gg < allow_gaps && n->depth > 0 )
+ {
+ assert( last_done ) ;
+ mismatch_ptr *p = splice_fresh_after( last_done ) ;
+ p->ptr = n ;
+ p->mismatch = mm + 1 ;
+ p->gaps = gg + 1 ;
+ p->isgap = 1 ;
+ p->first = is_first ? n->depth-1 : ff ;
+ p->matched = nn + 1 ;
+ last_done = p ;
+ }
+
+ // Take care of the children, but not the one that matches.
+ // (That one has already been handled.)
+
+ for ( int i = 0 ; i != 5 ; i++ ) {
+ if( i != pos ) {
+ // The child always exists, but we follow it only
+ // iff the first mismatch would not be forgotten by
+ // following a suffix link.
+ assert( n->childs[i] ) ;
+ int ff_ = is_first ? n->depth : ff ;
+ if( ff_ >= n->drop[i] )
+ {
+ mismatch_ptr *p = splice_fresh_after( last_done ) ;
+ p->ptr = n->childs[i] ;
+ p->first = ff_ - n->drop[i] ;
+ p->mismatch = mm + 1 ;
+ p->gaps = gg ;
+ p->isgap = 0 ;
+ p->matched = nn + 1 - n->drop[i] ;
+ assert( !p->next ) ;
+ assert( last_done->next == p ) ;
+ last_done = p ;
+ }
+ }
+ }
+ }
+ is_first = false ;
+ assert( !last_done->next ) ;
+ }
+
+ if( debug_flags & debug_numnodes )
+ {
+ int n = 0 ;
+ for( mismatch_ptr *p = new_head ; p ; p=p->next ) ++n ;
+ std::clog << '\r' << n << std::endl;
+ }
+ return new_head ;
+}
+
diff --git a/prefix_tree.h b/prefix_tree.h
new file mode 100644
index 0000000..25b0609
--- /dev/null
+++ b/prefix_tree.h
@@ -0,0 +1,92 @@
+// PatMaN DNA pattern matcher
+// (C) 2007 Kay Pruefer, Udo Stenzel
+//
+// This program is free software; you can redistribute it and/or modify
+// it under the terms of the GNU General Public License as published by
+// the Free Software Foundation; either version 2 of the License, or (at
+// your option) any later version. See the LICENSE file for details.
+
+#ifndef INCLUDED_PREFIX_TREE_H
+#define INCLUDED_PREFIX_TREE_H
+
+#include <deque>
+#include <string>
+
+#ifndef NDEBUG
+#include <istream>
+#endif
+
+#include "global.h"
+
+struct probe_s {
+ probe_s *next ;
+ std::string name ;
+ int length ;
+ char strand ; // + or -
+} ;
+
+struct node {
+ node* childs[5] ;
+ node* suffix ;
+ probe_s *probes ;
+ unsigned char drop[5] ;
+ unsigned char drop_suffix ;
+ unsigned char depth ;
+#ifndef NDEBUG
+ std::string label ;
+#endif
+
+ node() : suffix(0), probes(0), drop_suffix(0), depth(0)
+ {
+ for( int i = 0 ; i != 5 ; ++i )
+ {
+ childs[i] = 0 ;
+ drop[i] = 0 ;
+ }
+ }
+} ;
+
+/* Mismatch pointers (elsewhere called "walkers") point into the prefix
+ * trie, track the history of an alignment and are advanced according to
+ * the read database.
+ *
+ * We will keep mismatch pointers in a queue, implemented as singly
+ * linked list, and because we splice inside that list, it's probably
+ * best to implement it directly (instead of encapsulating everything
+ * behind an object). Unused nodes will be collected in another list
+ * (the junk heap).
+ */
+struct mismatch_ptr {
+ struct mismatch_ptr *next ;
+ const node *ptr ;
+ unsigned char mismatch ;
+ unsigned char first ;
+ unsigned char matched ;
+ unsigned char isgap : 1 ;
+ unsigned char gaps : 7 ;
+} ;
+
+class prefix_tree
+{
+ public:
+ prefix_tree( ) ;
+ void add_seq( const std::string& seq, const std::string& name, bool ambi_codes ) ;
+ mismatch_ptr *compare( char c, mismatch_ptr *ptrs ) const ;
+ void add_suffix_links() ;
+ mismatch_ptr *init( mismatch_ptr* ) const ;
+
+#ifndef NDEBUG
+ void debug( std::ostream& s ) const { debug( s, nodes[0] ) ; }
+ void debug( std::ostream&, const node& ) const ;
+#endif
+
+ private:
+ node* create_go_node( node* n, int pos ) ;
+ mismatch_ptr *seed() const ;
+ void add_recursion( const std::string& seq, const std::string& name, node* ptr, size_t i, char strand ) ;
+ void add_recursion_ambiguity( const std::string& seq, const std::string& name, node* ptr, size_t i, char strand ) ;
+
+ std::deque<node> nodes ;
+} ;
+
+#endif
--
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/patman.git
More information about the debian-med-commit
mailing list