[med-svn] [patman] 12/15: New upstream version 1.2.2+dfsg

Andreas Tille tille at debian.org
Wed Dec 13 19:28:26 UTC 2017


This is an automated email from the git hooks/post-receive script.

tille pushed a commit to branch master
in repository patman.

commit 9a9e681d00105d330ef308c356ef358170604697
Author: Andreas Tille <tille at debian.org>
Date:   Wed Dec 13 20:24:05 2017 +0100

    New upstream version 1.2.2+dfsg
---
 AUTHORS                  |   2 +
 CHANGES                  |   6 +
 LICENSE                  | 340 ++++++++++++++++++++++++
 Makefile                 |  49 ++++
 README                   |  10 +
 debian/changelog         |   6 -
 debian/clean             |   1 -
 debian/compat            |   1 -
 debian/control           |  16 --
 debian/copyright         |  13 -
 debian/docs              |   1 -
 debian/rules             |   4 -
 debian/upstream/metadata |  12 -
 debian/watch             |   5 -
 fasta.cpp                |  47 ++++
 fasta.h                  |  30 +++
 global.h                 |  35 +++
 main.cpp                 | 201 ++++++++++++++
 patman.1                 | 301 +++++++++++++++++++++
 prefix_tree.cpp          | 668 +++++++++++++++++++++++++++++++++++++++++++++++
 prefix_tree.h            |  92 +++++++
 21 files changed, 1781 insertions(+), 59 deletions(-)

diff --git a/AUTHORS b/AUTHORS
new file mode 100644
index 0000000..7f2b374
--- /dev/null
+++ b/AUTHORS
@@ -0,0 +1,2 @@
+Kay Pruefer <pruefer at eva.mpg.de>
+Udo Stenzel <udo_stenzel at eva.mpg.de>
diff --git a/CHANGES b/CHANGES
new file mode 100644
index 0000000..bfdebe7
--- /dev/null
+++ b/CHANGES
@@ -0,0 +1,6 @@
+
+Version 1.2.1
+- fixed bug in makefile
+- added cstdlib & cstring headers to make it compile with g++ 4.4
+Thanks to Dan Tomlinson for reporting this!
+
diff --git a/LICENSE b/LICENSE
new file mode 100644
index 0000000..b7b5f53
--- /dev/null
+++ b/LICENSE
@@ -0,0 +1,340 @@
+		    GNU GENERAL PUBLIC LICENSE
+		       Version 2, June 1991
+
+ Copyright (C) 1989, 1991 Free Software Foundation, Inc.
+	51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+
+			    Preamble
+
+  The licenses for most software are designed to take away your
+freedom to share and change it.  By contrast, the GNU General Public
+License is intended to guarantee your freedom to share and change free
+software--to make sure the software is free for all its users.  This
+General Public License applies to most of the Free Software
+Foundation's software and to any other program whose authors commit to
+using it.  (Some other Free Software Foundation software is covered by
+the GNU Library General Public License instead.)  You can apply it to
+your programs, too.
+
+  When we speak of free software, we are referring to freedom, not
+price.  Our General Public Licenses are designed to make sure that you
+have the freedom to distribute copies of free software (and charge for
+this service if you wish), that you receive source code or can get it
+if you want it, that you can change the software or use pieces of it
+in new free programs; and that you know you can do these things.
+
+  To protect your rights, we need to make restrictions that forbid
+anyone to deny you these rights or to ask you to surrender the rights.
+These restrictions translate to certain responsibilities for you if you
+distribute copies of the software, or if you modify it.
+
+  For example, if you distribute copies of such a program, whether
+gratis or for a fee, you must give the recipients all the rights that
+you have.  You must make sure that they, too, receive or can get the
+source code.  And you must show them these terms so they know their
+rights.
+
+  We protect your rights with two steps: (1) copyright the software, and
+(2) offer you this license which gives you legal permission to copy,
+distribute and/or modify the software.
+
+  Also, for each author's protection and ours, we want to make certain
+that everyone understands that there is no warranty for this free
+software.  If the software is modified by someone else and passed on, we
+want its recipients to know that what they have is not the original, so
+that any problems introduced by others will not reflect on the original
+authors' reputations.
+
+  Finally, any free program is threatened constantly by software
+patents.  We wish to avoid the danger that redistributors of a free
+program will individually obtain patent licenses, in effect making the
+program proprietary.  To prevent this, we have made it clear that any
+patent must be licensed for everyone's free use or not licensed at all.
+
+  The precise terms and conditions for copying, distribution and
+modification follow.
+

+		    GNU GENERAL PUBLIC LICENSE
+   TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
+
+  0. This License applies to any program or other work which contains
+a notice placed by the copyright holder saying it may be distributed
+under the terms of this General Public License.  The "Program", below,
+refers to any such program or work, and a "work based on the Program"
+means either the Program or any derivative work under copyright law:
+that is to say, a work containing the Program or a portion of it,
+either verbatim or with modifications and/or translated into another
+language.  (Hereinafter, translation is included without limitation in
+the term "modification".)  Each licensee is addressed as "you".
+
+Activities other than copying, distribution and modification are not
+covered by this License; they are outside its scope.  The act of
+running the Program is not restricted, and the output from the Program
+is covered only if its contents constitute a work based on the
+Program (independent of having been made by running the Program).
+Whether that is true depends on what the Program does.
+
+  1. You may copy and distribute verbatim copies of the Program's
+source code as you receive it, in any medium, provided that you
+conspicuously and appropriately publish on each copy an appropriate
+copyright notice and disclaimer of warranty; keep intact all the
+notices that refer to this License and to the absence of any warranty;
+and give any other recipients of the Program a copy of this License
+along with the Program.
+
+You may charge a fee for the physical act of transferring a copy, and
+you may at your option offer warranty protection in exchange for a fee.
+
+  2. You may modify your copy or copies of the Program or any portion
+of it, thus forming a work based on the Program, and copy and
+distribute such modifications or work under the terms of Section 1
+above, provided that you also meet all of these conditions:
+
+    a) You must cause the modified files to carry prominent notices
+    stating that you changed the files and the date of any change.
+
+    b) You must cause any work that you distribute or publish, that in
+    whole or in part contains or is derived from the Program or any
+    part thereof, to be licensed as a whole at no charge to all third
+    parties under the terms of this License.
+
+    c) If the modified program normally reads commands interactively
+    when run, you must cause it, when started running for such
+    interactive use in the most ordinary way, to print or display an
+    announcement including an appropriate copyright notice and a
+    notice that there is no warranty (or else, saying that you provide
+    a warranty) and that users may redistribute the program under
+    these conditions, and telling the user how to view a copy of this
+    License.  (Exception: if the Program itself is interactive but
+    does not normally print such an announcement, your work based on
+    the Program is not required to print an announcement.)
+

+These requirements apply to the modified work as a whole.  If
+identifiable sections of that work are not derived from the Program,
+and can be reasonably considered independent and separate works in
+themselves, then this License, and its terms, do not apply to those
+sections when you distribute them as separate works.  But when you
+distribute the same sections as part of a whole which is a work based
+on the Program, the distribution of the whole must be on the terms of
+this License, whose permissions for other licensees extend to the
+entire whole, and thus to each and every part regardless of who wrote it.
+
+Thus, it is not the intent of this section to claim rights or contest
+your rights to work written entirely by you; rather, the intent is to
+exercise the right to control the distribution of derivative or
+collective works based on the Program.
+
+In addition, mere aggregation of another work not based on the Program
+with the Program (or with a work based on the Program) on a volume of
+a storage or distribution medium does not bring the other work under
+the scope of this License.
+
+  3. You may copy and distribute the Program (or a work based on it,
+under Section 2) in object code or executable form under the terms of
+Sections 1 and 2 above provided that you also do one of the following:
+
+    a) Accompany it with the complete corresponding machine-readable
+    source code, which must be distributed under the terms of Sections
+    1 and 2 above on a medium customarily used for software interchange; or,
+
+    b) Accompany it with a written offer, valid for at least three
+    years, to give any third party, for a charge no more than your
+    cost of physically performing source distribution, a complete
+    machine-readable copy of the corresponding source code, to be
+    distributed under the terms of Sections 1 and 2 above on a medium
+    customarily used for software interchange; or,
+
+    c) Accompany it with the information you received as to the offer
+    to distribute corresponding source code.  (This alternative is
+    allowed only for noncommercial distribution and only if you
+    received the program in object code or executable form with such
+    an offer, in accord with Subsection b above.)
+
+The source code for a work means the preferred form of the work for
+making modifications to it.  For an executable work, complete source
+code means all the source code for all modules it contains, plus any
+associated interface definition files, plus the scripts used to
+control compilation and installation of the executable.  However, as a
+special exception, the source code distributed need not include
+anything that is normally distributed (in either source or binary
+form) with the major components (compiler, kernel, and so on) of the
+operating system on which the executable runs, unless that component
+itself accompanies the executable.
+
+If distribution of executable or object code is made by offering
+access to copy from a designated place, then offering equivalent
+access to copy the source code from the same place counts as
+distribution of the source code, even though third parties are not
+compelled to copy the source along with the object code.
+

+  4. You may not copy, modify, sublicense, or distribute the Program
+except as expressly provided under this License.  Any attempt
+otherwise to copy, modify, sublicense or distribute the Program is
+void, and will automatically terminate your rights under this License.
+However, parties who have received copies, or rights, from you under
+this License will not have their licenses terminated so long as such
+parties remain in full compliance.
+
+  5. You are not required to accept this License, since you have not
+signed it.  However, nothing else grants you permission to modify or
+distribute the Program or its derivative works.  These actions are
+prohibited by law if you do not accept this License.  Therefore, by
+modifying or distributing the Program (or any work based on the
+Program), you indicate your acceptance of this License to do so, and
+all its terms and conditions for copying, distributing or modifying
+the Program or works based on it.
+
+  6. Each time you redistribute the Program (or any work based on the
+Program), the recipient automatically receives a license from the
+original licensor to copy, distribute or modify the Program subject to
+these terms and conditions.  You may not impose any further
+restrictions on the recipients' exercise of the rights granted herein.
+You are not responsible for enforcing compliance by third parties to
+this License.
+
+  7. If, as a consequence of a court judgment or allegation of patent
+infringement or for any other reason (not limited to patent issues),
+conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License.  If you cannot
+distribute so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you
+may not distribute the Program at all.  For example, if a patent
+license would not permit royalty-free redistribution of the Program by
+all those who receive copies directly or indirectly through you, then
+the only way you could satisfy both it and this License would be to
+refrain entirely from distribution of the Program.
+
+If any portion of this section is held invalid or unenforceable under
+any particular circumstance, the balance of the section is intended to
+apply and the section as a whole is intended to apply in other
+circumstances.
+
+It is not the purpose of this section to induce you to infringe any
+patents or other property right claims or to contest validity of any
+such claims; this section has the sole purpose of protecting the
+integrity of the free software distribution system, which is
+implemented by public license practices.  Many people have made
+generous contributions to the wide range of software distributed
+through that system in reliance on consistent application of that
+system; it is up to the author/donor to decide if he or she is willing
+to distribute software through any other system and a licensee cannot
+impose that choice.
+
+This section is intended to make thoroughly clear what is believed to
+be a consequence of the rest of this License.
+

+  8. If the distribution and/or use of the Program is restricted in
+certain countries either by patents or by copyrighted interfaces, the
+original copyright holder who places the Program under this License
+may add an explicit geographical distribution limitation excluding
+those countries, so that distribution is permitted only in or among
+countries not thus excluded.  In such case, this License incorporates
+the limitation as if written in the body of this License.
+
+  9. The Free Software Foundation may publish revised and/or new versions
+of the General Public License from time to time.  Such new versions will
+be similar in spirit to the present version, but may differ in detail to
+address new problems or concerns.
+
+Each version is given a distinguishing version number.  If the Program
+specifies a version number of this License which applies to it and "any
+later version", you have the option of following the terms and conditions
+either of that version or of any later version published by the Free
+Software Foundation.  If the Program does not specify a version number of
+this License, you may choose any version ever published by the Free Software
+Foundation.
+
+  10. If you wish to incorporate parts of the Program into other free
+programs whose distribution conditions are different, write to the author
+to ask for permission.  For software which is copyrighted by the Free
+Software Foundation, write to the Free Software Foundation; we sometimes
+make exceptions for this.  Our decision will be guided by the two goals
+of preserving the free status of all derivatives of our free software and
+of promoting the sharing and reuse of software generally.
+
+			    NO WARRANTY
+
+  11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
+FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW.  EXCEPT WHEN
+OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
+PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
+OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.  THE ENTIRE RISK AS
+TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU.  SHOULD THE
+PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
+REPAIR OR CORRECTION.
+
+  12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
+WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
+REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
+INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
+OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
+TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
+YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
+PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGES.
+
+		     END OF TERMS AND CONDITIONS
+

+	    How to Apply These Terms to Your New Programs
+
+  If you develop a new program, and you want it to be of the greatest
+possible use to the public, the best way to achieve this is to make it
+free software which everyone can redistribute and change under these terms.
+
+  To do so, attach the following notices to the program.  It is safest
+to attach them to the start of each source file to most effectively
+convey the exclusion of warranty; and each file should have at least
+the "copyright" line and a pointer to where the full notice is found.
+
+    <one line to give the program's name and a brief idea of what it does.>
+    Copyright (C) <year>  <name of author>
+
+    This program is free software; you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation; either version 2 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License
+    along with this program; if not, write to the Free Software
+    Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301 USA
+
+
+Also add information on how to contact you by electronic and paper mail.
+
+If the program is interactive, make it output a short notice like this
+when it starts in an interactive mode:
+
+    Gnomovision version 69, Copyright (C) year  name of author
+    Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
+    This is free software, and you are welcome to redistribute it
+    under certain conditions; type `show c' for details.
+
+The hypothetical commands `show w' and `show c' should show the appropriate
+parts of the General Public License.  Of course, the commands you use may
+be called something other than `show w' and `show c'; they could even be
+mouse-clicks or menu items--whatever suits your program.
+
+You should also get your employer (if you work as a programmer) or your
+school, if any, to sign a "copyright disclaimer" for the program, if
+necessary.  Here is a sample; alter the names:
+
+  Yoyodyne, Inc., hereby disclaims all copyright interest in the program
+  `Gnomovision' (which makes passes at compilers) written by James Hacker.
+
+  <signature of Ty Coon>, 1 April 1989
+  Ty Coon, President of Vice
+
+This General Public License does not permit incorporating your program into
+proprietary programs.  If your program is a subroutine library, you may
+consider it more useful to permit linking proprietary applications with the
+library.  If this is what you want to do, use the GNU Library General
+Public License instead of this License.
diff --git a/Makefile b/Makefile
new file mode 100644
index 0000000..ab592ea
--- /dev/null
+++ b/Makefile
@@ -0,0 +1,49 @@
+# PatMaN DNA pattern matcher
+# (C) 2007 Kay Pruefer, Udo Stenzel
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or (at
+# your option) any later version.  See the LICENSE file for details.
+
+prefix := ${HOME}
+version := 1.2
+
+TARGETS := patman
+OBJS := prefix_tree.o fasta.o main.o
+CPPFLAGS += -Wall 
+CPPFLAGS += -O3 -funroll-loops -DNDEBUG -march=k8
+# CPPFLAGS += -ggdb
+LDLIBS += -lpopt
+
+all: $(TARGETS)
+
+patman:	$(OBJS)
+	g++ $(CPPFLAGS) -o $@ $(OBJS) $(LDFLAGS) $(LDLIBS)
+
+%.o:	%.cpp
+	g++ -DVERSION="\"$(version)\"" $(CPPFLAGS) -c -o $@ $<
+
+install: $(TARGETS)
+	install -d ${prefix}/bin
+	install -d ${prefix}/share/man/man1
+	install -s -m755 $^ ${prefix}/bin
+	install -m644 patman.1 ${prefix}/share/man/man1
+	
+fasta.o: fasta.h 
+prefix_tree.o: prefix_tree.h global.h
+main.o: fasta.h prefix_tree.h global.h
+
+.SUFFIXES:
+
+.PHONY: clean mrproper dist all
+
+dist: all
+	$(MAKE) clean
+	strip $(TARGETS)
+
+clean:	
+	-rm -f $(OBJS)
+	
+mrproper: clean
+	-rm -r $(TARGETS)
diff --git a/README b/README
new file mode 100644
index 0000000..d43d87f
--- /dev/null
+++ b/README
@@ -0,0 +1,10 @@
+PatMaN: Pattern Matching in Nucleotide data bases 
+
+PatMaN compares a set of short patterns against one or more sequence
+databases.  Patterns can match with an arbitrary (but generally small)
+number of mismatches and/or gaps given as arguments.  See the man page
+for details.
+
+To install into your home directory, simply call 'make install'.  You
+can call 'make install prefix=dir' to install into 'dir' instead.  To
+compile, a reasonable C++ compiler and the popt library are needed.
diff --git a/debian/changelog b/debian/changelog
deleted file mode 100644
index 2d28936..0000000
--- a/debian/changelog
+++ /dev/null
@@ -1,6 +0,0 @@
-patman (1.2.2+dfsg-1) UNRELEASED; urgency=low
-
-  * Initial release (Closes: #482555)
-
- -- Charles Plessy <plessy at debian.org>  Thu, 22 May 2008 07:36:37 +0900
-
diff --git a/debian/clean b/debian/clean
deleted file mode 100644
index 7a0d4d9..0000000
--- a/debian/clean
+++ /dev/null
@@ -1 +0,0 @@
-patman
diff --git a/debian/compat b/debian/compat
deleted file mode 100644
index f599e28..0000000
--- a/debian/compat
+++ /dev/null
@@ -1 +0,0 @@
-10
diff --git a/debian/control b/debian/control
deleted file mode 100644
index c3be598..0000000
--- a/debian/control
+++ /dev/null
@@ -1,16 +0,0 @@
-Source: patman
-Section: science
-Priority: optional
-Maintainer: Debian Med Packaging Team <debian-med-packaging at lists.alioth.debian.org>
-Uploaders: Charles Plessy <plessy at debian.org>
-Build-Depends: debhelper (>= 10)
-Standards-Version: 4.1.2
-Homepage: http://bioinf.eva.mpg.de/patman/
-
-Package: patman
-Architecture: any
-Depends: ${shlibs:Depends}, ${misc:Depends}
-Description: rapid alignment of short sequences to large databases
- Patman searches for short patterns in large DNA databases, allowing
- for approximate matches. It is optimized for searching for many small
- pattern at the same time, for example microarray probes.
diff --git a/debian/copyright b/debian/copyright
deleted file mode 100644
index f50c689..0000000
--- a/debian/copyright
+++ /dev/null
@@ -1,13 +0,0 @@
-Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
-Upstream-Name: PatMaN
-Source: https://bioinf.eva.mpg.de/patman/patman-1.2.2.tar.gz
-Files-Excluded: */patman
-                */.svn
-
-Files: *
-Copyright: © 2008-2009 Kay Pruefer, Udo Stenzel
-License: GPL-2+
-
-Files: debian/*
-Copyright: © 2008 Charles Plessy <plessy at debian.org>
-License: GPL-2+
diff --git a/debian/docs b/debian/docs
deleted file mode 100644
index e845566..0000000
--- a/debian/docs
+++ /dev/null
@@ -1 +0,0 @@
-README
diff --git a/debian/rules b/debian/rules
deleted file mode 100755
index 053bee6..0000000
--- a/debian/rules
+++ /dev/null
@@ -1,4 +0,0 @@
-#!/usr/bin/make -f
-  
-%:
-        dh $@
diff --git a/debian/upstream/metadata b/debian/upstream/metadata
deleted file mode 100644
index e085e38..0000000
--- a/debian/upstream/metadata
+++ /dev/null
@@ -1,12 +0,0 @@
-Reference:
-  Author: Kay Prüfer and Udo Stenzel and Michael Dannemann and Richard E Green and Michael Lachmann and Janet Kelso
-  Title: "PatMaN: rapid alignment of short sequences to large databases"
-  Journal: Bioinformatics
-  Year: 2008
-  Volume: 24
-  Number: 13
-  Pages: 1530-1
-  DOI: 10.1093/bioinformatics/btn223
-  PMID: 18467344
-  URL: http://bioinformatics.oxfordjournals.org/content/24/13/1530
-  eprint: http://bioinformatics.oxfordjournals.org/content/24/13/1530.full.pdf+html
diff --git a/debian/watch b/debian/watch
deleted file mode 100644
index 63fd461..0000000
--- a/debian/watch
+++ /dev/null
@@ -1,5 +0,0 @@
-version=4
-
-opts="repacksuffix=+dfsg,dversionmangle=s/\+dfsg//g,repack,compression=xz" \
-   https://bioinf.eva.mpg.de/patman/ patman-(.*)\.tar\.gz
-
diff --git a/fasta.cpp b/fasta.cpp
new file mode 100644
index 0000000..60d0bf8
--- /dev/null
+++ b/fasta.cpp
@@ -0,0 +1,47 @@
+// PatMaN DNA pattern matcher
+// (C) 2007 Kay Pruefer, Udo Stenzel
+//
+// This program is free software; you can redistribute it and/or modify
+// it under the terms of the GNU General Public License as published by
+// the Free Software Foundation; either version 2 of the License, or (at
+// your option) any later version.  See the LICENSE file for details.
+
+#include "fasta.h"
+
+using namespace std ;
+
+fasta* fasta_fac::get(  ) 
+{
+	string headline ;
+	fasta* ret = new fasta ;
+	ret->null = 0 ;
+	// headline
+	getline( in, ret->headerline ) ;
+	line++ ;
+	if ( ret->headerline[0] != '>' ) {
+		if ( !in.eof() ) cerr << "No headerline in fasta file at line " << line << endl ;
+		ret->null = 1 ;
+		return ret ;
+	} else {
+		// cut the '>' in headerline
+		ret->headerline = ret->headerline.substr( 1, ret->headerline.size()-1 ) ;
+
+		// read the sequence
+		// handles eof, \n and '>', everything else is
+		// considered part of the sequence.
+		int i ;
+		while ( (i = in.get()) && in.good() ) {
+			if ( i == '>' ) {
+				in.putback( i ) ;
+				return ret ;
+			} else {
+				// add to seq ;
+				if ( i != '\n' ) 
+					ret->seq += static_cast<char>(i) ;
+				else 
+					line++ ;
+			}
+		}
+		return ret ;
+	}
+}
diff --git a/fasta.h b/fasta.h
new file mode 100644
index 0000000..ce8a7e7
--- /dev/null
+++ b/fasta.h
@@ -0,0 +1,30 @@
+// PatMaN DNA pattern matcher
+// (C) 2007 Kay Pruefer, Udo Stenzel
+//
+// This program is free software; you can redistribute it and/or modify
+// it under the terms of the GNU General Public License as published by
+// the Free Software Foundation; either version 2 of the License, or (at
+// your option) any later version.  See the LICENSE file for details.
+
+#ifndef INCLUDED_FASTA_H
+#define INCLUDED_FASTA_H
+
+#include <string>
+#include <iostream>
+
+typedef struct fasta {
+	std::string seq ;
+	std::string headerline ;
+	int null ;
+} fasta ;
+
+class fasta_fac {
+	public:
+		fasta_fac( std::istream& i ) : in(i), line(0) {  } 
+		fasta* get(  ) ;
+	private: 
+		std::istream& in ;
+		int line ;
+} ;
+
+#endif
diff --git a/global.h b/global.h
new file mode 100644
index 0000000..bf16f53
--- /dev/null
+++ b/global.h
@@ -0,0 +1,35 @@
+// PatMaN DNA pattern matcher
+// (C) 2007 Kay Pruefer, Udo Stenzel
+//
+// This program is free software; you can redistribute it and/or modify
+// it under the terms of the GNU General Public License as published by
+// the Free Software Foundation; either version 2 of the License, or (at
+// your option) any later version.  See the LICENSE file for details.
+
+#ifndef INCLUDED_GLOBAL_H
+#define INCLUDED_GLOBAL_H
+
+#include <iosfwd>
+
+extern size_t cutoff ;
+extern size_t allow_gaps ;
+extern long position_genome ;
+extern const char* genome_name ;
+extern int only_plus_strand ;
+extern int discount_adenine ;
+extern int quiet ;
+extern int chop5 ;
+extern std::ostream *output ;
+extern int debug_flags ;
+extern int do_prefetch ;
+extern long long total_nodes ;
+
+static const int debug_notquiet   = 1 ;
+static const int debug_verbose    = 2 ;
+static const int debug_trie       = 4 ;
+static const int debug_countnodes = 8 ;
+static const int debug_numnodes   = 16 ;
+static const int debug_sequence   = 32 ;
+
+#endif
+
diff --git a/main.cpp b/main.cpp
new file mode 100644
index 0000000..10a8b44
--- /dev/null
+++ b/main.cpp
@@ -0,0 +1,201 @@
+// PatMaN DNA pattern matcher
+// (C) 2007 Kay Pruefer, Udo Stenzel
+//
+// This program is free software; you can redistribute it and/or modify
+// it under the terms of the GNU General Public License as published by
+// the Free Software Foundation; either version 2 of the License, or (at
+// your option) any later version.  See the LICENSE file for details.
+
+
+#include <iostream>
+#include <string>
+#include <vector>
+#include <fstream>
+#include <cstdlib>
+#include <cstring>
+
+#include <popt.h>
+
+#include "prefix_tree.h"
+#include "fasta.h"
+
+using namespace std ;
+
+// global variables
+size_t cutoff = 2 ;
+size_t allow_gaps = 0 ;
+size_t minlength = 0 ;
+long long total_nodes = 0 ;
+int only_plus_strand = 0 ;
+int discount_adenine = 0 ;
+int ambi_codes = 0 ; 
+int chop5 = 0, chop3 = 0 ;
+int debug_flags = debug_notquiet ;
+int do_prefetch = 3 ;
+ostream *output = 0 ;
+
+long position_genome ;
+const char* genome_name ;
+
+enum option_flag { opt_none, opt_db, opt_pat, opt_output, opt_version } ;
+
+struct poptOption options[] = {
+	{ "version",      'V', POPT_ARG_NONE,                             0,       opt_version, "Print version number and exit", 0 },
+	{ "edits",        'e', POPT_ARG_INT  | POPT_ARGFLAG_SHOW_DEFAULT, &cutoff,           0, "Set maximum edit distance to N", "N" },
+	{ "gaps",         'g', POPT_ARG_INT  | POPT_ARGFLAG_SHOW_DEFAULT, &allow_gaps,       0, "Set maximum number of gaps to N", "N" },
+	{ "databases",    'D', POPT_ARG_NONE,                             0,            opt_db, "Following files are databases", 0 },
+	{ "patterns",     'P', POPT_ARG_NONE,                             0,           opt_pat, "Following files contain patterns", 0 },
+	{ "output",       'o', POPT_ARG_STRING,                           0,        opt_output, "Write output to FILE", "FILE" },
+	{ "ambicodes",    'a', POPT_ARG_NONE,                             &ambi_codes,       0, "Interpret ambiguity codes in patterns", 0 },
+	{ "singlestrand", 's', POPT_ARG_NONE,                             &only_plus_strand, 0, "Do not match reverse-complements", 0 },
+	{ "prefetch",     'p', POPT_ARG_INT  | POPT_ARGFLAG_SHOW_DEFAULT, &do_prefetch,      0, "Prefetch N nodes", "N" },
+	{ "min-length",   'l', POPT_ARG_INT  | POPT_ARGFLAG_DOC_HIDDEN,   &minlength,        0, "Only consider patterns longer than N", "N" },
+	{ "chop-3",       'x', POPT_ARG_INT  | POPT_ARGFLAG_DOC_HIDDEN,   &chop3,            0, "Chop N bp off the 3' end of patterns", "N" },
+	{ "chop-5",       'X', POPT_ARG_INT  | POPT_ARGFLAG_DOC_HIDDEN,   &chop5,            0, "Chop N bp off the 5' end of patterns", "N" },
+	{ "adenine-hack", 'A', POPT_ARG_NONE | POPT_ARGFLAG_DOC_HIDDEN,   &discount_adenine, 0, "Ignore inserted adenines in db", 0 },
+	{ "quiet",        'q', POPT_ARG_NONE | POPT_BIT_CLR, &debug_flags,      debug_notquiet, "Turn off warnings", 0 },
+	{ "verbose",      'v', POPT_ARG_NONE | POPT_BIT_SET, &debug_flags,       debug_verbose, "Print additional progress reports", 0 },
+	{ "debug",        'd', POPT_ARG_INT  | POPT_ARGFLAG_DOC_HIDDEN,   &debug_flags,      0, "Set debug flags to N (see man page)", "N" },
+	POPT_AUTOHELP POPT_TABLEEND
+} ;
+
+
+int main(int argc, const char *argv[]) {
+
+	std::vector<std::string> dbfiles, patfiles ;
+
+	poptContext pc = poptGetContext( "patman", argc, argv, options, 0 ) ;
+	poptSetOtherOptionHelp( pc, "[OPTION...] [input-file...]" ) ;
+	poptReadDefaultConfig( pc, 0 ) ;
+	bool isdbfiles = false ;
+
+	if( argc <= 1 ) { poptPrintHelp( pc, stderr, 0 ) ; return 1 ; }
+	int rc = poptGetNextOpt(pc) ;
+	for(;;) {
+		if( rc == opt_version ) {
+			std::cout << poptGetInvocationName(pc) << ", revision " << VERSION << std::endl ;
+			return 0 ;
+		}
+		else if( rc == opt_db || rc == opt_pat || rc == -1 ) {
+			while( poptPeekArg(pc) )
+				( isdbfiles ? dbfiles : patfiles ).push_back( poptGetArg(pc) ) ;
+			isdbfiles = (rc == opt_db) ;
+		}
+		else if( rc == opt_output )
+		{
+			delete output ;
+			const char *arg = poptGetOptArg(pc) ;
+			if( 0 == strcmp( arg, "-" ) ) output = 0 ;
+			else output = new std::ofstream( arg ) ;
+		}
+		else {
+			std::cerr << argv[0] << ": " << poptStrerror( rc ) 
+				<< ' ' << poptBadOption( pc, 0 ) << std::endl ;
+			return 1 ; 
+		}
+		if( rc == -1 ) break ; else rc = poptGetNextOpt(pc) ;
+	}
+
+	if( dbfiles.empty() )
+	{ 
+		dbfiles.push_back( "-" ) ;
+		if( debug_flags & debug_notquiet ) 
+			clog << "No databases given, using stdin." << endl ;
+	}
+	if( patfiles.empty() )
+	{
+		clog << "No patterns given, aborting." << endl ;
+		exit(1) ;
+	}
+
+	prefix_tree trie ;
+	for( size_t i = 0 ; i != patfiles.size() ; ++i )
+	{
+		ifstream *infile = 0 ;
+		if( patfiles[i] != "-" ) {
+			infile = new ifstream( patfiles[i].c_str() ) ;
+			if( debug_flags & debug_verbose ) clog << "reading patterns from " << patfiles[i] << endl ;
+			if( !infile || !*infile ) { cerr << "cannot read " << patfiles[i] << endl ; return 1 ; }
+		}
+		else if( debug_flags & debug_verbose ) clog << "reading patterns from STDIN" << endl ;
+
+		// add pattern to prefix tree
+		fasta_fac f( infile ? *infile : cin ) ;
+		fasta* ff ;
+		while ( (ff = f.get()) && !ff->null ) {
+			if( minlength <= ff->seq.size() )
+				trie.add_seq(
+						ff->seq.substr( chop5, ff->seq.size() - chop5 - chop3 ),
+						ff->headerline, !!ambi_codes ) ;
+			delete ff ;
+		}
+		delete infile ;
+	}
+
+	if( debug_flags & debug_verbose ) clog << "post-processing trie" << endl ;
+	trie.add_suffix_links() ;
+
+#ifndef NDEBUG
+	trie.debug( cerr ) ;
+#endif
+
+	mismatch_ptr *ptrs = 0 ;
+
+	// read database
+	string headerline ;
+	for( size_t i = 0 ; i != dbfiles.size() ; ++i ) {
+		ifstream *dbfile = 0 ;
+		if( dbfiles[i] != "-" ) {
+			dbfile = new ifstream( dbfiles[i].c_str() ) ;
+			if( debug_flags & debug_verbose ) clog << "processing database from " << dbfiles[i] << endl ;
+			if( !dbfile || !*dbfile ) { cerr << "cannot read " << dbfiles[i] << endl ; return 1 ; }
+		}
+		else if( debug_flags & debug_verbose ) clog << "processing database from STDIN" << endl ;
+
+		getline( dbfile ? *dbfile : cin, headerline ) ;
+		while( dbfile ? *dbfile : cin ) {
+			if ( headerline[0] != '>' ) {
+				cerr << "database is not in fasta format" << endl ;
+				exit( 2 ) ;
+			}
+			headerline = headerline.substr( 1, headerline.size()-1 ) ; 
+			genome_name = headerline.c_str() ; // global variable
+
+			string seq ;
+			bool eat_n = 0 ;
+			size_t last_n = 0 ;
+			position_genome = 1 ; // global variable
+
+			ptrs = trie.init( ptrs ) ;
+
+			while ( getline( dbfile ? *dbfile : cin, seq )
+					&& ( seq.empty() || seq[0] != '>' ) ) {
+				for ( size_t i = 0 ; i < seq.size() ; i++,position_genome++ ) 
+				{
+					if ( toupper( seq[i] ) == 'N' ) 
+						if ( eat_n ) continue ;
+						else { 
+							last_n ++ ;
+							if ( last_n == cutoff + 1 ) eat_n = 1 ;
+						}
+					else if ( eat_n ) { eat_n = 0 ; last_n = 0 ; }
+
+					if( debug_flags & debug_sequence ) std::clog << seq[i] << std::flush ;
+					ptrs = trie.compare( seq[i], ptrs ) ;
+				}
+			}
+			trie.compare( 0, ptrs ) ;
+			headerline = seq ;
+		}
+
+		delete dbfile ;
+	}
+	( output ? *output : cout ) << std::flush ;
+	delete output ;
+
+	if( debug_flags & debug_countnodes )
+		clog << "Fetched " << total_nodes << " nodes in total" << endl ;
+
+	poptFreeContext( pc ) ;
+	return 0 ;
+}
diff --git a/patman.1 b/patman.1
new file mode 100644
index 0000000..401427b
--- /dev/null
+++ b/patman.1
@@ -0,0 +1,301 @@
+.\" PatMaN DNA pattern matcher
+.\" (C) 2007 Kay Pruefer, Udo Stenzel
+.\"
+.\" This program is free software; you can redistribute it and/or modify
+.\" it under the terms of the GNU General Public License as published by
+.\" the Free Software Foundation; either version 2 of the License, or (at
+.\" your option) any later version.  See the LICENSE file for details.
+
+.\" Process this file with
+.\" groff -man -Tascii patman.1
+.\"
+.TH PATMAN 1 "JANUARY 2008" Applications "User Manuals"
+.SH NAME
+PatMaN \- search for approximate patterns in DNA libraries
+.SH SYNOPSIS
+.B patman [
+.I option
+.B |
+.I file
+.B ... ]
+.SH DESCRIPTION
+.B PatMaN
+searches for (small) 
+.I patterns 
+in (huge) DNA 
+.IR databases ,
+allowing for some mismatches and optionally gaps.  
+.I Patterns 
+and
+.I databases
+are read from one or more 
+.BR fasta (5)
+files listed as non-option arguments, depending on whether the
+.I -D
+or
+.I -P
+option last preceeded them, and matched against each other.  The output
+of
+.B PatMaN 
+is a table containing one line for each match, consisting of
+tab-separated fields: 
+.IP \(bu 4
+name of database sequence,
+.IP \(bu 4
+name of pattern,
+.IP \(bu 4
+position of first matched base in database sequence, the sequence's
+beginning has position 1,
+.IP \(bu 4
+position of last matched base in database sequence,
+.IP \(bu 4
+strand (+ for literal match, - for reverse complement),
+.IP \(bu 4
+edit distance (number of mismatches plus number of gaps).
+
+.SH OPTIONS
+.IP "-V, --version"
+Print version number and exit.
+
+.IP "-e num, --edits num"
+Allow up to
+.I num
+mismatches and/or gaps per match.
+
+.IP "-g num, --gaps num"
+Allow up to 
+.I num
+gaps per match.  Note that gaps count as mismatches, too, so the
+.I -e
+option should always be set at least as high as the
+.I -g
+option.  Allowing many gaps can incur a considerable computational cost.
+
+.IP "-D, --databases"
+Treat the following files as
+.IR database .
+Databases must be in
+.BR fasta (5)
+format.  Multiple 
+.I database
+files, including "-" for standard input, are allowed and are read in
+turn.
+
+.IP "-P, --patterns"
+Treat the following files as
+.IR patterns .
+Pattern files must be in
+.BR fasta (5)
+format.  Multiple 
+.I pattern
+files, including "-" for standard input, are allowed and are all read before
+touching the 
+.I databases.
+
+.IP "-o file, --output file"
+Redirect output to
+.IR file .
+The file name "-" causes output to be written to stdout, which is also
+the default
+
+.IP "-a, --ambicodes"
+Activate the interpretation of ambiguity codes in patterns.  This
+results in the expansion of any
+.I pattern
+with ambiguity codes into multiple patterns which can match
+independently.  Compare 
+.B Unknown Nucleotides
+below.
+
+.IP "-s, --singlestrand"
+Deactivate matching of reverse-complements.  Normally,
+.B PatMaN
+will try to match patterns both literally and after
+reverse-complementing them, with this option set, only straight forward
+matches are considered.
+
+.IP "-p num, --prefetch num"
+Causes 
+.I num
+pointers to be prefetched in advance.  This feature can improve
+performance, if
+.B PatMaN
+has been compiled for a processor architecture that supports
+prefetching.  The optimum value for your particular setup has to be
+determined empirically, but the default should be reasonably good.
+
+.IP "-l len, --min-length len"
+Only consider patterns with a length of at least 
+.IR len .
+Use this if your
+.I pattern
+collection contains short sequences that you don't want lots of possible
+matches reported for.
+
+.IP "-x num, --chop3 num"
+Cut off 
+.I num
+bases from the 3' end of each
+.I pattern.
+Use this for 
+.I patterns
+with damaged, edited, etc. 3' ends that should be ignored.  The chopped
+bases are neither matched nor included in the reported match regions.
+
+.IP "-X num, --chop5 num"
+Cut off 
+.I num
+bases from the 5' end of each
+.I pattern.
+Use this for 
+.I patterns
+with damaged, edited, etc. 5' ends that should be ignored.  The chopped
+bases are neither matched nor included in the reported match regions.
+
+.IP "-A, --adenine-hack"
+Allow adenine to be ignored in patterns.  This is essentially equivalent
+to not counting gaps in the 
+.I database,
+as long as it was an A that was gapped.  Using 
+.I -A
+can be computationally extremely expensive, both in terms of memory and
+time consumed.
+
+.IP "-q, --quiet"
+Suppress warnings (about unrecognized characters in input sequences or
+missing input files).  Even without 
+.IR -q ,
+at most one such warning is given per run.
+
+.IP "-v, --verbose"
+Prints additional progress information to stderr.
+
+.IP "-d flags, --debug flags"
+Sets debugging flags to
+.IR flags . Flags
+may be the logical
+.I OR
+of any of the following values, each of which causes some output to
+appear on 
+.IR stderr .
+Some of the values may only work if 
+.B PatMaN
+has been compiled in debug mode.  The default value is 1.
+
+.IP 1
+Print warnings.  Equivalent to not setting 
+.IR -q .
+
+.IP 2
+Print progress information.  Equivalent to setting
+.IR -v .
+
+.IP 4
+Dump the suffix trie of the
+.IR patterns .
+Only available in debug build.
+
+.IP 8
+Count number of visited nodes and print that number in each iteration.
+Only available in debug build.
+
+.IP 16
+Print total number of nodes fetched from memory after completing all
+.IR databases .
+
+.IP 32
+Output
+.I database
+sequence while it is being matched.
+
+.SH NOTES
+.SS Non-Option Arguments
+Non-option arguments (bare filenames) are either treated as 
+.I database
+or
+.I pattern
+files, depending on whether the
+.I -D
+or 
+.I -P
+option was the the last that occured before the filename.  If neither 
+.I -D
+nor
+.I -P
+was given, file names are treated as
+.I pattern 
+files.  If no 
+.I database
+was given, it is instead read from standard input.  Standard input can
+be explicitly given as either a
+.I database
+or a
+.I pattern
+file by using the filename "-".  A warning is given if standard input is
+selected implicitly as 
+.I database,
+an error message is given if no
+.I pattern
+files have been named at all.
+
+
+.SS Gapped Matching
+Allowing gaps often causes overlapping matches of single
+.I patterns
+at almost the same position. 
+.B PatMaN 
+makes no attempt to filter these redundant matches.  Also note that
+allowing many gaps, and especially allowing an arbitrary amount of gaps
+through the 
+.I -A
+hack can slow down
+.B PatMaN 
+considerably and cause it to produce enormous amounts of output.  The
+use of some sorty of post-processor to filter these is highly
+recommended.
+
+.SS Unknown Nucleotides
+Unknown nucleotides are most often encoded by the letter
+.BR N .
+If the
+.I --ambicodes
+option is not given, Ns in patterns are interpreted as unknown
+nucleotides and can never match without penalty.  If
+.I --ambicodes
+is given, Ns in 
+.I patterns 
+are expanded just like the other amibuguity codes, and effectively work
+as wildcards.  Unknown nucleotides can still be encoded by an
+.B X
+and will never match anything.  The database is treated differently in
+that anything other than 
+.IR A ", " C ", " G ", " T " and " U ,
+including ambiguity codes, is treated as unknown and can never match
+without penalty.
+
+.SH FILES
+.I /etc/popt
+.RS
+The system wide configuration file for
+.BR popt (3).
+.B PatMaN
+identifies itself as "patman" to popt.
+.RE
+
+.I ~/.popt
+.RS
+Per user configuration file for
+.BR popt (3).
+.RE
+
+.SH BUGS
+None known.
+
+.SH AUTHOR
+Kay Pruefer <pruefer at eva.mpg.de>
+.br
+Udo Stenzel <udo_stenzel at eva.mpg.de>
+
+.SH "SEE ALSO"
+.BR popt (3), fasta (5)
+
diff --git a/prefix_tree.cpp b/prefix_tree.cpp
new file mode 100644
index 0000000..6d53a4a
--- /dev/null
+++ b/prefix_tree.cpp
@@ -0,0 +1,668 @@
+// PatMaN DNA pattern matcher
+// (C) 2007 Kay Pruefer, Udo Stenzel
+//
+// This program is free software; you can redistribute it and/or modify
+// it under the terms of the GNU General Public License as published by
+// the Free Software Foundation; either version 2 of the License, or (at
+// your option) any later version.  See the LICENSE file for details.
+
+
+#include "prefix_tree.h" 
+
+#include <cassert>
+#include <iostream>
+#include <queue>
+#include <cstdlib>
+
+using namespace std ;
+
+// Why a deque?  Because allocation of many small objects of
+// equal size is faster in it and deallocation of the whole
+// deque is much faster than deallocation of the small objects,
+// especially with our extremely simple allocation pattern.
+// This is probably the most sensible way to plug this memory
+// leak.
+prefix_tree::prefix_tree(  ) : nodes()
+{
+	nodes.push_back( node() ) ;
+}
+
+#ifndef NDEBUG
+void prefix_tree::debug( std::ostream& s, const node& n ) const
+{
+	if( debug_flags & debug_trie )
+	{
+		for( int i = 0 ; i != 5 ; ++i )
+			s << n.label << " -" << "ACGTN"[i] << "> " << n.childs[i]->label << '/' << (int)n.drop[i] << std::endl ;
+		if( n.suffix ) s << n.label << " ==> " << n.suffix->label << '/' << n.drop_suffix << std::endl ;
+
+		for( int i = 0 ; i != 5 ; ++i )
+			if( !n.drop[i] ) debug( s, *n.childs[i] ) ;
+	}
+}
+#endif
+
+
+static mismatch_ptr *junk_yard = 0 ;
+
+static mismatch_ptr *new_node()
+{
+	if( junk_yard )
+	{
+		mismatch_ptr *p = junk_yard ;
+		junk_yard = p->next ;
+		return p ;
+	}
+	else
+	{
+		return new mismatch_ptr ;
+	}
+}
+
+mismatch_ptr *prefix_tree::seed() const 
+{
+	// add ptr to the root node
+	mismatch_ptr *p = new_node() ;
+	p->ptr = &nodes[0] ;
+	p->mismatch = 0 ;
+	p->gaps = 0 ;
+	p->first = 0 ;
+	p->isgap = 0 ;
+	p->matched = 0 ;
+	p->next = 0 ;
+	return p ;
+}
+
+mismatch_ptr *prefix_tree::init( mismatch_ptr *p ) const 
+{
+	if( p ) {
+		mismatch_ptr *q = p ;
+		while( p->next ) p = p->next ;
+		p->next = junk_yard ;
+		junk_yard = q ;
+	}
+	return seed() ;
+}
+
+// creates child node # pos if necessary and returns ptr to this child node
+node* prefix_tree::create_go_node( node* n, int pos ) {
+	if ( n->childs[pos] == 0 ) {
+		node o ;
+		o.depth = n->depth + 1 ;
+		nodes.push_back( o ) ;
+		node *next = &nodes.back() ;
+		n->childs[pos] = next ;
+#ifndef NDEBUG
+		if( debug_flags & debug_trie )
+			n->childs[pos]->label = n->label + "ACGTN"[pos] ;
+#endif
+		return next ;
+	} else {
+		return n->childs[pos] ;
+	}
+}
+
+void prefix_tree::add_recursion( const string& seq, const string& name, node* ptr, size_t i, char strand ) 
+{
+	// is this a leaf?
+	if ( i == seq.size() ) {
+		
+		probe_s *probe = new probe_s() ;
+		probe->name = name ;
+		probe->strand = strand ;
+		probe->length = seq.size() ;
+		probe->next = ptr->probes ;
+		ptr->probes = probe ;
+	
+	// no leaf -> add 
+	} else {
+		switch ( toupper(seq[i]) ) {
+			case 'A':
+				add_recursion( seq, name, create_go_node( ptr, 0 ), ++i, strand ) ;
+				break ;
+			case 'C':
+				add_recursion( seq, name, create_go_node( ptr, 1 ), ++i, strand ) ;
+				break ;
+			case 'G':
+				add_recursion( seq, name, create_go_node( ptr, 2 ), ++i, strand ) ;
+				break ;
+			case 'T':
+			case 'U':
+				add_recursion( seq, name, create_go_node( ptr, 3 ), ++i, strand ) ;
+				break ;
+			case 'N':
+			case 'X':
+				add_recursion( seq, name, create_go_node( ptr, 4 ), ++i, strand ) ;
+				break ;
+			case '-':
+				break ;
+			default:
+				add_recursion( seq, name, create_go_node( ptr, 4 ), ++i, strand ) ;
+				if( debug_flags & debug_notquiet )
+					clog << "Warning: Sequence with name " << name 
+						<< " has character '" << seq[i]
+						<< "' other than ACGTUXN." << endl ;
+				debug_flags &= ~debug_notquiet ;
+				break ;
+		}
+		
+	}
+}
+
+/***
+ * Adding the sequence taking into account the ambiguity codes. 
+ * i.e. Sequence with "N" will be added to A, C, G and T node
+ */
+void prefix_tree::add_recursion_ambiguity( const string& seq, const string& name, node* ptr, size_t i, char strand ) 
+{
+	// is this a leaf?
+	if ( i == seq.size() ) {
+		
+		probe_s *probe = new probe_s() ;
+		probe->name = name ;
+		probe->strand = strand ;
+		probe->length = seq.size() ;
+		probe->next = ptr->probes ;
+		ptr->probes = probe ;
+	
+	// no leaf -> add 
+	} else {
+		switch ( toupper(seq[i]) ) {
+			case 'A':
+				add_recursion_ambiguity( seq, name, create_go_node( ptr, 0 ), i+1, strand ) ;
+				break ;
+			case 'C':
+				add_recursion_ambiguity( seq, name, create_go_node( ptr, 1 ), i+1, strand ) ;
+				break ;
+			case 'G':
+				add_recursion_ambiguity( seq, name, create_go_node( ptr, 2 ), i+1, strand ) ;
+				break ;
+			case 'T':
+			case 'U':
+				add_recursion_ambiguity( seq, name, create_go_node( ptr, 3 ), i+1, strand ) ;
+				break ;
+			case 'N':
+				add_recursion_ambiguity( seq, name, create_go_node( ptr, 0 ), i+1, strand ) ;
+				add_recursion_ambiguity( seq, name, create_go_node( ptr, 1 ), i+1, strand ) ;
+				add_recursion_ambiguity( seq, name, create_go_node( ptr, 2 ), i+1, strand ) ;
+				add_recursion_ambiguity( seq, name, create_go_node( ptr, 3 ), i+1, strand ) ;
+				break ;
+			case 'R':
+				add_recursion_ambiguity( seq, name, create_go_node( ptr, 2 ), i+1, strand ) ;
+				add_recursion_ambiguity( seq, name, create_go_node( ptr, 0 ), i+1, strand ) ;
+				break ;
+			case 'Y':
+				add_recursion_ambiguity( seq, name, create_go_node( ptr, 1 ), i+1, strand ) ;
+				add_recursion_ambiguity( seq, name, create_go_node( ptr, 3 ), i+1, strand ) ;
+				break ;
+			case 'M':
+				add_recursion_ambiguity( seq, name, create_go_node( ptr, 0 ), i+1, strand ) ;
+				add_recursion_ambiguity( seq, name, create_go_node( ptr, 1 ), i+1, strand ) ;
+				break ;
+			case 'K':
+				add_recursion_ambiguity( seq, name, create_go_node( ptr, 2 ), i+1, strand ) ;
+				add_recursion_ambiguity( seq, name, create_go_node( ptr, 3 ), i+1, strand ) ;
+				break ;
+			case 'S':
+				add_recursion_ambiguity( seq, name, create_go_node( ptr, 1 ), i+1, strand ) ;
+				add_recursion_ambiguity( seq, name, create_go_node( ptr, 2 ), i+1, strand ) ;
+				break ;
+			case 'W':
+				add_recursion_ambiguity( seq, name, create_go_node( ptr, 0 ), i+1, strand ) ;
+				add_recursion_ambiguity( seq, name, create_go_node( ptr, 3 ), i+1, strand ) ;
+				break ;
+			case 'B':
+				add_recursion_ambiguity( seq, name, create_go_node( ptr, 1 ), i+1, strand ) ;
+				add_recursion_ambiguity( seq, name, create_go_node( ptr, 2 ), i+1, strand ) ;
+				add_recursion_ambiguity( seq, name, create_go_node( ptr, 3 ), i+1, strand ) ;
+				break ;
+			case 'D':
+				add_recursion_ambiguity( seq, name, create_go_node( ptr, 0 ), i+1, strand ) ;
+				add_recursion_ambiguity( seq, name, create_go_node( ptr, 2 ), i+1, strand ) ;
+				add_recursion_ambiguity( seq, name, create_go_node( ptr, 3 ), i+1, strand ) ;
+				break ;
+			case 'H':
+				add_recursion_ambiguity( seq, name, create_go_node( ptr, 0 ), i+1, strand ) ;
+				add_recursion_ambiguity( seq, name, create_go_node( ptr, 1 ), i+1, strand ) ;
+				add_recursion_ambiguity( seq, name, create_go_node( ptr, 3 ), i+1, strand ) ;
+				break ;
+			case 'V':
+				add_recursion_ambiguity( seq, name, create_go_node( ptr, 0 ), i+1, strand ) ;
+				add_recursion_ambiguity( seq, name, create_go_node( ptr, 1 ), i+1, strand ) ;
+				add_recursion_ambiguity( seq, name, create_go_node( ptr, 2 ), i+1, strand ) ;
+				break ;
+
+			case '-':
+				break ;
+
+			default:
+				add_recursion_ambiguity( seq, name, create_go_node( ptr, 4 ), i+1, strand ) ;
+				if( debug_flags & debug_notquiet )
+					clog << "Warning: Sequence with name " << name 
+						<< " has character '" << seq[i]
+						<< "' other than ACGTU, BDH, SWKMRY, or XN." << endl ;
+				debug_flags &= ~debug_notquiet ;
+				break ;
+		}
+		
+	}
+}
+
+/* Ambiguity codes:
+ *                           R = G or A (puRine)
+ *                           Y = C or T (pYrimidine)
+ *                           M = A or C (aMine)
+ *                           K = G or T (Ketone)
+ *                           S = G or C (Strong coupling)
+ *                           W = A or T (Weak coupling)
+ *                           B = not A (C or G or T)
+ *                           D = not C (A or G or T)
+ *                           H = not G (A or C or T)
+ *                           V = not U (A or C or G)
+ *                           N = A or C or G or T (Nucleotide)
+ */
+
+/*
+ * reverse complement of a dna sequence, including correct ambiguity mapping
+ */
+inline static string reverse_complement( const string& s ) {
+	string ret ;
+	for ( int i = s.size()-1 ; i >= 0 ; i-- ) {
+		switch ( toupper( s[i] ) ) {
+			case 'A':
+				ret += 'T' ; break ;
+			case 'C':
+				ret += 'G' ; break ;
+			case 'G':
+				ret += 'C' ; break ;
+			case 'T':
+			case 'U':
+				ret += 'A' ; break ;
+			case 'R':
+				ret += 'Y' ; break ;
+			case 'Y':
+				ret += 'R' ; break ;
+			case 'M':
+				ret += 'K' ; break ;
+			case 'K':
+				ret += 'M' ; break ;
+			case 'B':
+				ret += 'V' ; break ;
+			case 'V':
+				ret += 'B' ; break ;
+			case 'D':
+				ret += 'H' ; break ;
+			case 'H':
+				ret += 'D' ; break ;
+			case 'S':
+			case 'W':
+			case 'N':
+			case '-':
+				ret += s[i] ;
+				break ; 
+			default:
+				clog << "Sequence has invalid character '"
+					 << s[i] << "'." << endl ;
+				exit( 1 ) ;
+		}
+	}
+	return ret ;
+}
+
+void prefix_tree::add_seq( const string& seq, const string& name, bool ambi_codes ) 
+{
+	node *ptr = &nodes[0] ;
+	int i = 0 ;
+	if ( ! ambi_codes ) {
+		add_recursion( seq, name, ptr, i, '+' ) ;
+		if ( only_plus_strand != 1 ) {
+			string rev_seq = reverse_complement( seq ) ;
+			add_recursion( rev_seq, name, ptr, i, '-' ) ;
+		}
+	} else {
+		add_recursion_ambiguity( seq, name, ptr, i, '+' ) ;
+		if ( only_plus_strand != 1 ) {
+			string rev_seq = reverse_complement( seq ) ;
+			add_recursion_ambiguity( rev_seq, name, ptr, i, '-' ) ;
+		}
+	}
+}
+
+// Suffix links.  These are only needed as an intermediate step.  Each
+// node gets a suffix link, which points to the node that is the longest
+// existing true suffix.  Given a node that already has a suffix, we
+// find the suffix of child "A" by following the "A" link from our
+// suffix and adding the 'drop' values for the "suffix" and the "A"
+// link.  We can then fill in the missing children for this node in the
+// same fashion.  To ensure that any node we visit already has a suffix,
+// we do a breadth-first search.
+//
+// The root never has a suffix.  To start the fill-in, each missing
+// child of the root becomes the root itself with a "drop" of one and
+// the existing ones get the root as suffix with a drop of one.
+//
+// Also, after we added the missing children, we no longer need the
+// suffix link for anything but accessing query sequences that happen to
+// be substrings of other queries.  So we can optimize these.
+
+void prefix_tree::add_suffix_links()
+{
+	std::queue< node* > q ;
+	node& root = nodes[0] ;
+	for( int i = 0 ; i != 5 ; ++i )
+	{
+		if( root.childs[i] )
+		{
+			root.childs[i]->suffix = &root ;
+			root.childs[i]->drop_suffix = 1 ;
+			q.push( root.childs[i] ) ;
+		}
+		else
+		{
+			root.childs[i] = &root ;
+			root.drop[i] = 1 ;
+		}
+	}
+
+	while( !q.empty() ) 
+	{
+		node* n = q.front() ;
+		q.pop() ;
+		for( int i = 0 ; i != 5 ; ++i )
+		{
+			if( n->childs[i] )
+			{
+				n->childs[i]->suffix = n->suffix->childs[i] ;
+				n->childs[i]->drop_suffix = n->drop_suffix + n->suffix->drop[i] ;
+				q.push( n->childs[i] ) ;
+			}
+			else
+			{
+				n->childs[i] = n->suffix->childs[i] ;
+				n->drop[i] = n->drop_suffix + n->suffix->drop[i] ;
+			}
+		}
+		if( !n->suffix->probes )
+		{
+			if( n->suffix->suffix )
+			{
+				n->drop_suffix += n->suffix->drop_suffix ;
+				n->suffix = n->suffix->suffix ;
+			}
+			else
+			{
+				n->suffix = 0 ;
+			}
+		}
+	}
+}
+
+inline static void out( int mm, const node* n, int mlen )
+{
+	for( const probe_s *probe = n->probes ; probe ; probe = probe->next )
+	{
+		( output ? *output : cout )
+		    << genome_name <<  '\t' << probe->name << '\t'
+			<< position_genome - mlen << '\t'
+			<< position_genome - 1 << '\t'
+			<< probe->strand << '\t' << mm << '\n' ;
+	}
+}
+
+// Why does this work?  Because when we follow suffix links (to print
+// matches that are completely inside other matches), we know that we
+// only drop perfectly matching stuff from the beginning.  This allows
+// us to update "mlen".
+ 
+inline static void out_rec( int mm, const node* n, int first, int mlen )
+{
+	if( n->probes ) out( mm, n, mlen ) ;
+
+	const node* nn = n ;
+	while( nn->suffix && nn->drop_suffix <= first )
+	{
+		first -= nn->drop_suffix ;
+		mlen -= nn->drop_suffix ;
+		nn = nn->suffix ;
+		if( nn->probes ) out( mm, nn, mlen ) ;
+	}
+}
+
+
+/* Advancing the mismatch pointers.
+ * - We receive a (singly linked) list of mps.  There has to be exactly
+ *   one mp that didn't encounter any mismatches in its history (keeping
+ *   this one ensures that the algorithm reduces to Aho-Corasick if
+ *   mismatches aren't allowed).  Too avoid to much special treatment,
+ *   this special pointer will always be the first in the list.  We
+ *   process it first, initalize the result list with it or re-seed the
+ *   result list, then go into the loop over the rest of the list.  This
+ *   implies an if-block in the middle of the loop and it also implies
+ *   that some pointers are only initialized in the loop.
+ *
+ * - We will keep multiple pointers into the two involved lists: one to
+ *   the end of the result list; this is where we splice in pointers to
+ *   be queued for the next round.  This list is at all time properly
+ *   terminated and disconnected from the stuff we're processing this
+ *   round.
+ *
+ * - One to the next node to be processed.  This one is easy, it is
+ *   advanced once per loop, it's the head of the incoming list and
+ *   we're done if it becomes null.
+ *
+ * - One several places ahead of the current one.  We prefetch the node
+ *   pointed to by this one and its next link.  Keeping another yet more
+ *   advanced pointer to prefetch mismatch pointers is useless because
+ *   of the need to follow the chain of next pointers.  This one has to
+ *   be valid at all times (see next point), so if it would become null,
+ *   we don't advance it.
+ *
+ * - When gapping, we need to splice new mps into the queue being
+ *   processed.  We do this after the node we prefetched last.  That
+ *   makes sense, because we already have that mp in cache, so we can
+ *   splice without penalty.  It works fine, as long as there is such a
+ *   pointer, and to ensure that, we first handle gaps (in this
+ *   direction) while the current node is still in the queue, then
+ *   cache the contents of the current node, handle the node itself,
+ *   thereby removing and or splicing it, then handle mismatches and/or
+ *   gaps.
+ */
+
+inline static mismatch_ptr *splice_fresh_after( mismatch_ptr *mp )
+{
+	if( junk_yard )
+	{
+		mismatch_ptr *p0 = junk_yard ;
+		junk_yard = junk_yard->next ;
+		p0->next = mp->next ;
+		mp->next = p0 ;
+		return p0 ;
+	}
+	else
+	{
+		mismatch_ptr *p0 = new mismatch_ptr ;
+		p0->next = mp->next ;
+		mp->next = p0 ;
+		return p0 ;
+	}
+}
+
+mismatch_ptr *prefix_tree::compare( char c, mismatch_ptr *ptrs ) const
+{
+	int pos ;
+	switch( c )
+	{
+		case '-': return ptrs ;
+		case 'A': case 'a': pos = 0 ; break ;
+		case 'C': case 'c': pos = 1 ; break ;
+		case 'G': case 'g': pos = 2 ; break ;
+		case 'T': case 't': pos = 3 ; break ;
+		case 'U': case 'u': pos = 3 ; break ;
+		default: pos = -1 ; break ; 
+	}
+
+	bool is_first = true ;
+	mismatch_ptr *new_head = 0, *last_done = 0 ;
+	mismatch_ptr *next_in_queue = ptrs ;
+	mismatch_ptr *next_to_prefetch = ptrs ;
+	for( int i = 0 ; i != do_prefetch && next_to_prefetch->next ; ++i )
+		next_to_prefetch = next_to_prefetch->next ;
+
+	while( next_in_queue ) 
+	{
+		++total_nodes ;
+		if( next_to_prefetch && next_to_prefetch->ptr  ) __builtin_prefetch( next_to_prefetch->ptr  ) ;
+		if( next_to_prefetch && next_to_prefetch->next ) __builtin_prefetch( next_to_prefetch->next ) ;
+
+		const node* n = next_in_queue->ptr ;
+		size_t mm = next_in_queue->mismatch ;
+		size_t gg = next_in_queue->gaps ;
+		int ff = next_in_queue->first ;
+		int nn = next_in_queue->matched ;
+
+		if( !next_in_queue->isgap ) out_rec( mm, n, is_first ? 255 : ff, nn ) ;
+
+		// If gaps are allowed, also add gapped nodes for
+		// processing _in_this_iteration_.  If
+		// discount_adenine is set, don't gap an A (makes
+		// bookkeeping difficult).
+				
+		if( gg < allow_gaps && mm < cutoff && n->depth > 0 ) {
+			for ( int i = discount_adenine ? 1 : 0 ; i != 5 ; i++ ) {
+				int ff_ = is_first ? n->depth-1 : ff ;
+				if( ff_ >= n->drop[i] ) {
+					assert( next_to_prefetch ) ;
+					mismatch_ptr *mp2 = splice_fresh_after( next_to_prefetch ) ;
+					mp2->ptr = n->childs[i] ;
+					mp2->mismatch = mm + 1 ;
+					mp2->gaps = gg + 1 ;
+					mp2->first = ff_ - n->drop[i] ;
+					mp2->isgap = 1 ;
+					mp2->matched = nn ;
+				}
+			}
+		}
+
+		// Are we gapping As?  then queue that up, no matter how many
+		// mismatches we've already accumulated.  (Note how this isn't
+		// even considered a gap, note that gapping the root actually
+		// makes a certain kind of sense.)
+
+		if( discount_adenine ) {
+			int ff_ = is_first ? n->depth-1 : ff ;
+			if( ff_ >= n->drop[0] && n->drop[0] ) {
+				assert( next_to_prefetch ) ;
+				mismatch_ptr *mp2 = splice_fresh_after( next_to_prefetch ) ;
+				mp2->ptr = n->childs[0] ;
+				mp2->mismatch = mm ;
+				mp2->gaps = gg ;
+				mp2->first = ff_ - n->drop[0] ;
+				mp2->isgap = 0 ;
+				mp2->matched = nn ;
+			}
+		}
+
+		// No longer is anything enqueued into the old queue, so we no
+		// longer need the next_to_prefetch pointer and can move it
+		// ahead.
+		if( next_to_prefetch->next ) next_to_prefetch = next_to_prefetch->next ;
+
+		// At this point, we no longer run into difficulties if the
+		// incoming queue gets emptied.  So now we move the pointer to its
+		// matching child and throw it out if dropping the prefix kills
+		// a mismatch.  Also, we can move the next_to_prefetch pointer,
+		// since it's not needed anymore in this iteration and might
+		// point to nowhere now.  Moreover, if we're at the first node
+		// (the one without any mismatches), this is the place to
+		// initialize the new list.
+
+		int pos_ = pos == -1 ? 4 : pos ;
+		if( is_first ) {
+			next_in_queue = ptrs->next ;
+			if( pos == -1 ) {
+				new_head = seed() ;
+				ptrs->next = junk_yard ;
+				junk_yard = ptrs ;
+			} else {
+				new_head = ptrs ;
+				new_head->matched += 1 - ptrs->ptr->drop[pos] ;
+				new_head->ptr = ptrs->ptr->childs[pos] ;
+				new_head->next = 0 ;
+			}
+			last_done = new_head ;
+		} else if( ff >= n->drop[pos_] ) {
+			next_in_queue->ptr = n->childs[pos_] ;
+			next_in_queue->first -= n->drop[pos_] ;
+			next_in_queue->matched ++ ;
+			next_in_queue->matched -= n->drop[pos_] ;
+			next_in_queue->isgap = 0 ;
+			assert( !last_done->next ) ;
+			last_done->next = next_in_queue ;
+			last_done = next_in_queue ;
+			next_in_queue = next_in_queue->next ;
+			last_done->next = 0 ;
+		} else {
+			mismatch_ptr *p = next_in_queue ;
+			next_in_queue = next_in_queue->next ;
+			p->next = junk_yard ;
+			junk_yard = p ;
+		}
+
+		// The following always produces new mismatches, so skip it
+		// completely if we're already at the limit.
+
+		if ( mm < cutoff ) {
+			// if gaps are allowed, always keep the node (but don't
+			// uselessly gap the root node)
+			if( gg < allow_gaps && n->depth > 0 )
+			{
+				assert( last_done ) ;
+				mismatch_ptr *p = splice_fresh_after( last_done ) ;
+				p->ptr = n ;
+				p->mismatch = mm + 1 ;
+				p->gaps = gg + 1 ;
+				p->isgap = 1 ;
+				p->first = is_first ? n->depth-1 : ff ;
+				p->matched = nn + 1 ;
+				last_done = p ;
+			}
+
+			// Take care of the children, but not the one that matches.
+			// (That one has already been handled.)
+
+			for ( int i = 0 ; i != 5 ; i++ ) {
+				if( i != pos ) {
+					// The child always exists, but we follow it only
+					// iff the first mismatch would not be forgotten by
+					// following a suffix link.
+					assert( n->childs[i] ) ;
+					int ff_ = is_first ? n->depth : ff ;
+					if( ff_ >= n->drop[i] )
+					{
+						mismatch_ptr *p = splice_fresh_after( last_done ) ;
+						p->ptr = n->childs[i] ;
+						p->first = ff_ - n->drop[i] ;
+						p->mismatch = mm + 1 ;
+						p->gaps = gg ;
+						p->isgap = 0 ;
+						p->matched = nn + 1 - n->drop[i] ;
+						assert( !p->next ) ;
+						assert( last_done->next == p ) ;
+						last_done = p ;
+					}
+				}
+			}
+		}
+		is_first = false ;
+		assert( !last_done->next ) ;
+	} 
+
+	if( debug_flags & debug_numnodes )
+	{
+		int n = 0 ;
+		for( mismatch_ptr *p = new_head ; p ; p=p->next ) ++n ;
+		std::clog << '\r' << n << std::endl;
+	}
+	return new_head ;
+}
+
diff --git a/prefix_tree.h b/prefix_tree.h
new file mode 100644
index 0000000..25b0609
--- /dev/null
+++ b/prefix_tree.h
@@ -0,0 +1,92 @@
+// PatMaN DNA pattern matcher
+// (C) 2007 Kay Pruefer, Udo Stenzel
+//
+// This program is free software; you can redistribute it and/or modify
+// it under the terms of the GNU General Public License as published by
+// the Free Software Foundation; either version 2 of the License, or (at
+// your option) any later version.  See the LICENSE file for details.
+
+#ifndef INCLUDED_PREFIX_TREE_H
+#define INCLUDED_PREFIX_TREE_H
+
+#include <deque>
+#include <string>
+
+#ifndef NDEBUG
+#include <istream>
+#endif
+
+#include "global.h"
+
+struct probe_s {
+	probe_s *next ;
+	std::string name ;
+	int length ;
+	char strand ; // + or -
+} ;
+
+struct node {
+	node* childs[5] ;
+	node* suffix ;
+	probe_s *probes ;
+	unsigned char drop[5] ;
+	unsigned char drop_suffix ;
+	unsigned char depth ;
+#ifndef NDEBUG
+	std::string label ;
+#endif
+
+	node() : suffix(0), probes(0), drop_suffix(0), depth(0)
+	{
+		for( int i = 0 ; i != 5 ; ++i )
+		{
+			childs[i] = 0 ;
+			drop[i] = 0 ;
+		}
+	}
+} ;
+
+/* Mismatch pointers (elsewhere called "walkers") point into the prefix
+ * trie, track the history of an alignment and are advanced according to
+ * the read database.
+ *
+ * We will keep mismatch pointers in a queue, implemented as singly
+ * linked list, and because we splice inside that list, it's probably
+ * best to implement it directly (instead of encapsulating everything
+ * behind an object).  Unused nodes will be collected in another list
+ * (the junk heap).
+ */
+struct mismatch_ptr {
+	struct mismatch_ptr *next ;
+	const node *ptr ;
+	unsigned char mismatch ;
+	unsigned char first ;
+	unsigned char matched ;
+	unsigned char isgap : 1 ;
+	unsigned char gaps : 7 ;
+} ;
+
+class prefix_tree 
+{
+	public:
+		prefix_tree( ) ;
+		void add_seq( const std::string& seq, const std::string& name, bool ambi_codes ) ;
+		mismatch_ptr *compare( char c, mismatch_ptr *ptrs ) const ;
+		void add_suffix_links() ;
+		mismatch_ptr *init( mismatch_ptr* ) const ; 
+
+#ifndef NDEBUG
+		void debug( std::ostream& s ) const { debug( s, nodes[0] ) ; }
+		void debug( std::ostream&, const node& ) const ;
+#endif
+
+	private:
+		node* create_go_node( node* n, int pos ) ;
+		mismatch_ptr *seed() const ; 
+		void add_recursion( const std::string& seq, const std::string& name, node* ptr, size_t i, char strand ) ;
+		void add_recursion_ambiguity( const std::string& seq, const std::string& name, node* ptr, size_t i, char strand ) ;
+
+		std::deque<node> nodes ;
+} ;
+
+#endif

-- 
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/patman.git



More information about the debian-med-commit mailing list