[med-svn] [libpsortb] 01/02: New upstream version 1.0+dfsg
Andreas Tille
tille at debian.org
Mon Apr 24 13:18:29 UTC 2017
This is an automated email from the git hooks/post-receive script.
tille pushed a commit to branch master
in repository libpsortb.
commit b8c45319613f970b3676c9cfe0d1f1093566e60c
Author: Andreas Tille <tille at debian.org>
Date: Mon Apr 24 15:10:53 2017 +0200
New upstream version 1.0+dfsg
---
AUTHORS | 3 +
COPYING | 340 +++
ChangeLog | 3 +
INSTALL | 24 +
Makefile.am | 6 +
NEWS | 0
README | 85 +
configure.ac | 74 +
modhmm0.92b/LICENCE | 23 +
modhmm0.92b/Makefile.am | 26 +
modhmm0.92b/S_TMHMM_0.92b.hmg.res | 0
modhmm0.92b/cmdline_hmmsearch.c.flc | 4 +
modhmm0.92b/cmdline_hmmsearch.h | 107 +
modhmm0.92b/core_algorithms_multialpha.c | 3391 ++++++++++++++++++++++++
modhmm0.92b/core_algorithms_multialpha.c.flc | 4 +
modhmm0.92b/debug_funcs.c | 1071 ++++++++
modhmm0.92b/funcs.h | 302 +++
modhmm0.92b/hmmsearch.c | 1337 ++++++++++
modhmm0.92b/hmmsearch.c.flc | 4 +
modhmm0.92b/readhmm.c | 1434 ++++++++++
modhmm0.92b/readhmm.c.flc | 4 +
modhmm0.92b/readhmm_multialpha.c | 1793 +++++++++++++
modhmm0.92b/readseqs_multialpha.c | 2054 +++++++++++++++
modhmm0.92b/readseqs_multialpha.c.flc | 4 +
modhmm0.92b/std_calculation_funcs.c | 1459 +++++++++++
modhmm0.92b/std_funcs.c | 2808 ++++++++++++++++++++
modhmm0.92b/std_funcs.c.flc | 4 +
modhmm0.92b/structs.h | 644 +++++
modhmm0.92b/training_algorithms_multialpha.c | 3629 ++++++++++++++++++++++++++
svmloc/Makefile.am | 10 +
svmloc/binding.cpp | 27 +
svmloc/binding.h | 28 +
svmloc/svmloc.cpp | 507 ++++
svmloc/svmloc.h | 90 +
34 files changed, 21299 insertions(+)
diff --git a/AUTHORS b/AUTHORS
new file mode 100644
index 0000000..c3d85e2
--- /dev/null
+++ b/AUTHORS
@@ -0,0 +1,3 @@
+This package was written by Matthew Laird of the Brinkman Laboratory at Simon Fraser University, Burnaby, BC, Canada.
+
+It uses a number of open source packages i to build libraries needed by PSortb.
diff --git a/COPYING b/COPYING
new file mode 100644
index 0000000..623b625
--- /dev/null
+++ b/COPYING
@@ -0,0 +1,340 @@
+ GNU GENERAL PUBLIC LICENSE
+ Version 2, June 1991
+
+ Copyright (C) 1989, 1991 Free Software Foundation, Inc.
+ 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+
+ Preamble
+
+ The licenses for most software are designed to take away your
+freedom to share and change it. By contrast, the GNU General Public
+License is intended to guarantee your freedom to share and change free
+software--to make sure the software is free for all its users. This
+General Public License applies to most of the Free Software
+Foundation's software and to any other program whose authors commit to
+using it. (Some other Free Software Foundation software is covered by
+the GNU Library General Public License instead.) You can apply it to
+your programs, too.
+
+ When we speak of free software, we are referring to freedom, not
+price. Our General Public Licenses are designed to make sure that you
+have the freedom to distribute copies of free software (and charge for
+this service if you wish), that you receive source code or can get it
+if you want it, that you can change the software or use pieces of it
+in new free programs; and that you know you can do these things.
+
+ To protect your rights, we need to make restrictions that forbid
+anyone to deny you these rights or to ask you to surrender the rights.
+These restrictions translate to certain responsibilities for you if you
+distribute copies of the software, or if you modify it.
+
+ For example, if you distribute copies of such a program, whether
+gratis or for a fee, you must give the recipients all the rights that
+you have. You must make sure that they, too, receive or can get the
+source code. And you must show them these terms so they know their
+rights.
+
+ We protect your rights with two steps: (1) copyright the software, and
+(2) offer you this license which gives you legal permission to copy,
+distribute and/or modify the software.
+
+ Also, for each author's protection and ours, we want to make certain
+that everyone understands that there is no warranty for this free
+software. If the software is modified by someone else and passed on, we
+want its recipients to know that what they have is not the original, so
+that any problems introduced by others will not reflect on the original
+authors' reputations.
+
+ Finally, any free program is threatened constantly by software
+patents. We wish to avoid the danger that redistributors of a free
+program will individually obtain patent licenses, in effect making the
+program proprietary. To prevent this, we have made it clear that any
+patent must be licensed for everyone's free use or not licensed at all.
+
+ The precise terms and conditions for copying, distribution and
+modification follow.
+
+ GNU GENERAL PUBLIC LICENSE
+ TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
+
+ 0. This License applies to any program or other work which contains
+a notice placed by the copyright holder saying it may be distributed
+under the terms of this General Public License. The "Program", below,
+refers to any such program or work, and a "work based on the Program"
+means either the Program or any derivative work under copyright law:
+that is to say, a work containing the Program or a portion of it,
+either verbatim or with modifications and/or translated into another
+language. (Hereinafter, translation is included without limitation in
+the term "modification".) Each licensee is addressed as "you".
+
+Activities other than copying, distribution and modification are not
+covered by this License; they are outside its scope. The act of
+running the Program is not restricted, and the output from the Program
+is covered only if its contents constitute a work based on the
+Program (independent of having been made by running the Program).
+Whether that is true depends on what the Program does.
+
+ 1. You may copy and distribute verbatim copies of the Program's
+source code as you receive it, in any medium, provided that you
+conspicuously and appropriately publish on each copy an appropriate
+copyright notice and disclaimer of warranty; keep intact all the
+notices that refer to this License and to the absence of any warranty;
+and give any other recipients of the Program a copy of this License
+along with the Program.
+
+You may charge a fee for the physical act of transferring a copy, and
+you may at your option offer warranty protection in exchange for a fee.
+
+ 2. You may modify your copy or copies of the Program or any portion
+of it, thus forming a work based on the Program, and copy and
+distribute such modifications or work under the terms of Section 1
+above, provided that you also meet all of these conditions:
+
+ a) You must cause the modified files to carry prominent notices
+ stating that you changed the files and the date of any change.
+
+ b) You must cause any work that you distribute or publish, that in
+ whole or in part contains or is derived from the Program or any
+ part thereof, to be licensed as a whole at no charge to all third
+ parties under the terms of this License.
+
+ c) If the modified program normally reads commands interactively
+ when run, you must cause it, when started running for such
+ interactive use in the most ordinary way, to print or display an
+ announcement including an appropriate copyright notice and a
+ notice that there is no warranty (or else, saying that you provide
+ a warranty) and that users may redistribute the program under
+ these conditions, and telling the user how to view a copy of this
+ License. (Exception: if the Program itself is interactive but
+ does not normally print such an announcement, your work based on
+ the Program is not required to print an announcement.)
+
+These requirements apply to the modified work as a whole. If
+identifiable sections of that work are not derived from the Program,
+and can be reasonably considered independent and separate works in
+themselves, then this License, and its terms, do not apply to those
+sections when you distribute them as separate works. But when you
+distribute the same sections as part of a whole which is a work based
+on the Program, the distribution of the whole must be on the terms of
+this License, whose permissions for other licensees extend to the
+entire whole, and thus to each and every part regardless of who wrote it.
+
+Thus, it is not the intent of this section to claim rights or contest
+your rights to work written entirely by you; rather, the intent is to
+exercise the right to control the distribution of derivative or
+collective works based on the Program.
+
+In addition, mere aggregation of another work not based on the Program
+with the Program (or with a work based on the Program) on a volume of
+a storage or distribution medium does not bring the other work under
+the scope of this License.
+
+ 3. You may copy and distribute the Program (or a work based on it,
+under Section 2) in object code or executable form under the terms of
+Sections 1 and 2 above provided that you also do one of the following:
+
+ a) Accompany it with the complete corresponding machine-readable
+ source code, which must be distributed under the terms of Sections
+ 1 and 2 above on a medium customarily used for software interchange; or,
+
+ b) Accompany it with a written offer, valid for at least three
+ years, to give any third party, for a charge no more than your
+ cost of physically performing source distribution, a complete
+ machine-readable copy of the corresponding source code, to be
+ distributed under the terms of Sections 1 and 2 above on a medium
+ customarily used for software interchange; or,
+
+ c) Accompany it with the information you received as to the offer
+ to distribute corresponding source code. (This alternative is
+ allowed only for noncommercial distribution and only if you
+ received the program in object code or executable form with such
+ an offer, in accord with Subsection b above.)
+
+The source code for a work means the preferred form of the work for
+making modifications to it. For an executable work, complete source
+code means all the source code for all modules it contains, plus any
+associated interface definition files, plus the scripts used to
+control compilation and installation of the executable. However, as a
+special exception, the source code distributed need not include
+anything that is normally distributed (in either source or binary
+form) with the major components (compiler, kernel, and so on) of the
+operating system on which the executable runs, unless that component
+itself accompanies the executable.
+
+If distribution of executable or object code is made by offering
+access to copy from a designated place, then offering equivalent
+access to copy the source code from the same place counts as
+distribution of the source code, even though third parties are not
+compelled to copy the source along with the object code.
+
+ 4. You may not copy, modify, sublicense, or distribute the Program
+except as expressly provided under this License. Any attempt
+otherwise to copy, modify, sublicense or distribute the Program is
+void, and will automatically terminate your rights under this License.
+However, parties who have received copies, or rights, from you under
+this License will not have their licenses terminated so long as such
+parties remain in full compliance.
+
+ 5. You are not required to accept this License, since you have not
+signed it. However, nothing else grants you permission to modify or
+distribute the Program or its derivative works. These actions are
+prohibited by law if you do not accept this License. Therefore, by
+modifying or distributing the Program (or any work based on the
+Program), you indicate your acceptance of this License to do so, and
+all its terms and conditions for copying, distributing or modifying
+the Program or works based on it.
+
+ 6. Each time you redistribute the Program (or any work based on the
+Program), the recipient automatically receives a license from the
+original licensor to copy, distribute or modify the Program subject to
+these terms and conditions. You may not impose any further
+restrictions on the recipients' exercise of the rights granted herein.
+You are not responsible for enforcing compliance by third parties to
+this License.
+
+ 7. If, as a consequence of a court judgment or allegation of patent
+infringement or for any other reason (not limited to patent issues),
+conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License. If you cannot
+distribute so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you
+may not distribute the Program at all. For example, if a patent
+license would not permit royalty-free redistribution of the Program by
+all those who receive copies directly or indirectly through you, then
+the only way you could satisfy both it and this License would be to
+refrain entirely from distribution of the Program.
+
+If any portion of this section is held invalid or unenforceable under
+any particular circumstance, the balance of the section is intended to
+apply and the section as a whole is intended to apply in other
+circumstances.
+
+It is not the purpose of this section to induce you to infringe any
+patents or other property right claims or to contest validity of any
+such claims; this section has the sole purpose of protecting the
+integrity of the free software distribution system, which is
+implemented by public license practices. Many people have made
+generous contributions to the wide range of software distributed
+through that system in reliance on consistent application of that
+system; it is up to the author/donor to decide if he or she is willing
+to distribute software through any other system and a licensee cannot
+impose that choice.
+
+This section is intended to make thoroughly clear what is believed to
+be a consequence of the rest of this License.
+
+ 8. If the distribution and/or use of the Program is restricted in
+certain countries either by patents or by copyrighted interfaces, the
+original copyright holder who places the Program under this License
+may add an explicit geographical distribution limitation excluding
+those countries, so that distribution is permitted only in or among
+countries not thus excluded. In such case, this License incorporates
+the limitation as if written in the body of this License.
+
+ 9. The Free Software Foundation may publish revised and/or new versions
+of the General Public License from time to time. Such new versions will
+be similar in spirit to the present version, but may differ in detail to
+address new problems or concerns.
+
+Each version is given a distinguishing version number. If the Program
+specifies a version number of this License which applies to it and "any
+later version", you have the option of following the terms and conditions
+either of that version or of any later version published by the Free
+Software Foundation. If the Program does not specify a version number of
+this License, you may choose any version ever published by the Free Software
+Foundation.
+
+ 10. If you wish to incorporate parts of the Program into other free
+programs whose distribution conditions are different, write to the author
+to ask for permission. For software which is copyrighted by the Free
+Software Foundation, write to the Free Software Foundation; we sometimes
+make exceptions for this. Our decision will be guided by the two goals
+of preserving the free status of all derivatives of our free software and
+of promoting the sharing and reuse of software generally.
+
+ NO WARRANTY
+
+ 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
+FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
+OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
+PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
+OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS
+TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE
+PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
+REPAIR OR CORRECTION.
+
+ 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
+WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
+REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
+INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
+OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
+TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
+YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
+PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGES.
+
+ END OF TERMS AND CONDITIONS
+
+ How to Apply These Terms to Your New Programs
+
+ If you develop a new program, and you want it to be of the greatest
+possible use to the public, the best way to achieve this is to make it
+free software which everyone can redistribute and change under these terms.
+
+ To do so, attach the following notices to the program. It is safest
+to attach them to the start of each source file to most effectively
+convey the exclusion of warranty; and each file should have at least
+the "copyright" line and a pointer to where the full notice is found.
+
+ <one line to give the program's name and a brief idea of what it does.>
+ Copyright (C) <year> <name of author>
+
+ This program is free software; you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation; either version 2 of the License, or
+ (at your option) any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU General Public License for more details.
+
+ You should have received a copy of the GNU General Public License
+ along with this program; if not, write to the Free Software
+ Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+
+
+Also add information on how to contact you by electronic and paper mail.
+
+If the program is interactive, make it output a short notice like this
+when it starts in an interactive mode:
+
+ Gnomovision version 69, Copyright (C) year name of author
+ Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
+ This is free software, and you are welcome to redistribute it
+ under certain conditions; type `show c' for details.
+
+The hypothetical commands `show w' and `show c' should show the appropriate
+parts of the General Public License. Of course, the commands you use may
+be called something other than `show w' and `show c'; they could even be
+mouse-clicks or menu items--whatever suits your program.
+
+You should also get your employer (if you work as a programmer) or your
+school, if any, to sign a "copyright disclaimer" for the program, if
+necessary. Here is a sample; alter the names:
+
+ Yoyodyne, Inc., hereby disclaims all copyright interest in the program
+ `Gnomovision' (which makes passes at compilers) written by James Hacker.
+
+ <signature of Ty Coon>, 1 April 1989
+ Ty Coon, President of Vice
+
+This General Public License does not permit incorporating your program into
+proprietary programs. If your program is a subroutine library, you may
+consider it more useful to permit linking proprietary applications with the
+library. If this is what you want to do, use the GNU Library General
+Public License instead of this License.
diff --git a/ChangeLog b/ChangeLog
new file mode 100644
index 0000000..d1ca604
--- /dev/null
+++ b/ChangeLog
@@ -0,0 +1,3 @@
+2008-06-24
+
+ - Initial version created
diff --git a/INSTALL b/INSTALL
new file mode 100644
index 0000000..7d62f52
--- /dev/null
+++ b/INSTALL
@@ -0,0 +1,24 @@
+PSortb libraries
+================
+
+INSTALLATION
+------------
+
+libpsortb has been tested under linux/x86, linux/x86_64, OSX
+
+Other operating systems are not supported at this time, but it doesn't mean the package won't work. Testing has been a matter of access to different OS and hardware platforms. If you have success with another platform, or would like our assistance porting the package to your platform please contact us.
+
+Installation should be as simple as:
+
+./configure
+make
+make install
+ldconfig
+
+configure takes the usual parameters, however the final libraries must be in a path that the dynamic linker from Perl can find.
+
+
+DEPENDENCIES
+------------
+
+libpsortb depends on an operating system which allows dynamically linked libraries. These will be loaded by Perl and Psortb runtime.
diff --git a/Makefile.am b/Makefile.am
new file mode 100644
index 0000000..562a307
--- /dev/null
+++ b/Makefile.am
@@ -0,0 +1,6 @@
+# Made by me.
+
+SUBDIRS = svmloc squid hmmer modhmm0.92b include
+
+noinst_HEADERS = \
+README
diff --git a/NEWS b/NEWS
new file mode 100644
index 0000000..e69de29
diff --git a/README b/README
new file mode 100644
index 0000000..5d68385
--- /dev/null
+++ b/README
@@ -0,0 +1,85 @@
+PSortb libraries
+================
+
+1) DESCRIPTION
+--------------
+
+libpsortb was created by Matthew Laird of the Brinkman Laboratory at Simon Fraser University. It was created to provide a more user friendly installer for the various C and C++ libraries used by Psortb. This library must be successfully be installed before the Perl bindings from the main Psortb package will work.
+
+The package consist of repackaged versions of the following open source tools:
+
+HMMER from the Washington University in St. Louis. For more information on HMM, see the project homepage at:
+
+ http://hmmer.wustl.edu/
+
+libsvm written by Chih-Chung Chang and Chih-Jen Lin at the National Taiwan University. For more information on libsvm see the project homepage at:
+
+http://www.csie.ntu.edu.tw/~cjlin/libsvm
+
+modhmm, part of the prodiv_tmhmm package by Hakan Viklund at Stockholm University. For more information on prodiv_tmhmm see the project homepage at:
+
+http://www.pdc.kth.se/~hakanv/prodiv-tmhmm/
+
+Full versions of the packages used as a basis for these customized versions of the libraries can be found at www.psort.org/downloads/src/
+
+The author invites feedback on this library bundle. If you find a bug, please send the information described in the BUGS section below.
+
+
+2) INSTALLATION
+---------------
+
+libpsortb has been tested under linux/x86, linux/x86_64
+
+Other operating systems are not supported at this time, but it doesn't mean the package won't work. Testing has been a matter of access to different OS and hardware platforms. If you have success with another platform, or would like our assistance porting the package to your platform please contact us.
+
+Installation should be as simple as:
+
+./configure
+make
+make install
+ldconfig
+
+configure takes the usual parameters, however the final libraries must be in a path that the dynamic linker from Perl can find.
+
+
+3) DEPENDENCIES
+---------------
+
+libpsortb depends on an operating system which allows dynamically linked libraries. These will be loaded by Perl and Psortb runtime.
+
+
+4) BUGS
+-------
+
+If you find a bug, please report it to the author along with the
+following information:
+
+ * version of Perl (output of 'perl -V' is best)
+ * version of libpsortb
+ * operating system type and version
+ * hardware description
+ * exact text of error message or description of problem
+
+If we don't have access to a system similar to yours, you may be asked
+to insert some debugging lines and report back on the results.
+The more help and information you can provide, the better.
+
+
+5) libpsortb COPYRIGHT AND LICENCE
+----------------------------------
+
+libpsortb is Copyright (C) 2008, the Brinkman Laboratory, Simon Fraser Univserity. All rights reserved.
+
+HMMER, libsvm, and prodiv_tmhmm are the sole Copyright of their respective authors.
+
+This package is licensed under the terms of the Gnu General Public
+License v2. For more information on the GPL, see the LICENSE file
+included with this distribution.
+
+
+6) AUTHOR INFORMATION
+---------------------
+
+libpsortb was originally written by Matthew Laird <lairdm at sfu.ca> og the Brinkman Laboratory at Simon Fraser University, Burnaby, BC, Canada.
+
+ http://www.pathogenomics.sfu.ca/brinkman
diff --git a/configure.ac b/configure.ac
new file mode 100644
index 0000000..a68371d
--- /dev/null
+++ b/configure.ac
@@ -0,0 +1,74 @@
+# -*- Autoconf -*-
+# Process this file with autoconf to produce a configure script.
+
+AC_PREREQ(2.60)
+AC_INIT([libpsortb], [1.0], [lairdm at sfu.ca])
+AC_CONFIG_SRCDIR([svmloc/binding.cpp])
+AC_CONFIG_HEADER([config.h])
+AM_INIT_AUTOMAKE
+
+# Checks for programs.
+AC_PROG_CXX
+AC_PROG_CC
+AC_PROG_CPP
+AC_PROG_INSTALL
+AC_PROG_LN_S
+AC_PROG_MAKE_SET
+AC_PROG_LIBTOOL
+
+# Checks for libraries.
+# FIXME: Replace `main' with a function in `-lm':
+AC_CHECK_LIB([m], [main])
+
+# We need to set lib64 for 64-bit versions of Linux
+libnn=lib
+case "${host_os}" in
+ linux*)
+ ## Not all distros use this: some choose to march out of step
+ case "${host_cpu}" in
+ x86_64|mips64|ppc64|powerpc64|sparc64|s390x)
+ libnn=lib64
+ ;;
+ esac
+ ;;
+ solaris*)
+ ## libnn=lib/sparcv9 ## on 64-bit only, but that's compiler-specific
+ ;;
+esac
+: ${LIBnn=$libnn}
+AC_SUBST(LIBnn)
+## take care not to override the command-line setting
+if test "${libdir}" = '${exec_prefix}/lib'; then
+ libdir='${exec_prefix}/${LIBnn}'
+fi
+
+
+# Checks for header files.
+AC_HEADER_STDC
+AC_CHECK_HEADERS([float.h limits.h memory.h netinet/in.h stdlib.h string.h unistd.h])
+
+# Checks for typedefs, structures, and compiler characteristics.
+AC_HEADER_STDBOOL
+AC_C_CONST
+AC_C_INLINE
+AC_TYPE_SIZE_T
+AC_HEADER_TIME
+AC_STRUCT_TM
+
+# Checks for library functions.
+AC_FUNC_FSEEKO
+AC_FUNC_MALLOC
+AC_FUNC_REALLOC
+AC_FUNC_STAT
+AC_FUNC_STRFTIME
+AC_FUNC_VPRINTF
+AC_FUNC_STRTOD
+AC_CHECK_FUNCS([atexit floor memmove memset pow sqrt strchr strcspn strerror strpbrk strrchr strspn strstr strtol])
+
+AC_CONFIG_FILES([Makefile
+ hmmer/Makefile
+ squid/Makefile
+ svmloc/Makefile
+ modhmm0.92b/Makefile
+ include/Makefile])
+AC_OUTPUT
diff --git a/modhmm0.92b/LICENCE b/modhmm0.92b/LICENCE
new file mode 100644
index 0000000..28b5d69
--- /dev/null
+++ b/modhmm0.92b/LICENCE
@@ -0,0 +1,23 @@
+C The modhmm software package
+C
+C Copyright (C) Hakan Viklund
+C
+C
+C This program is free software; you can redistribute it and/or modify
+C it under the terms of the GNU General Public License as published by
+C the Free Software Foundation; version 2 of the License. With the
+C exception that if you use this program in any scientific work you have
+C to explicitly state that you have used 'modhmm' and cite the relevant
+C publication (depending on what you have used 'modhmm' for). My
+C publicationlist can be found at http://www.sbc.su.se/~hakanv/.
+C This program is distributed in the hope that it will be useful, but
+C WITHOUT ANY WARRANTY; without even the implied warranty of
+C MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+C General Public License for more details. You should have received a
+C copy of the GNU General Public License along with this program (in
+C the file gpl.txt); if not, write to the Free Software Foundation,
+C Inc., 59 Temple Place - Suite 330, Boston, MA 02111-13
+C
+C For support please use email to hakanv at sbc.su.se and add 'modhmm-support'
+C in the header
+C
diff --git a/modhmm0.92b/Makefile.am b/modhmm0.92b/Makefile.am
new file mode 100644
index 0000000..c72762c
--- /dev/null
+++ b/modhmm0.92b/Makefile.am
@@ -0,0 +1,26 @@
+lib_LTLIBRARIES = libmodhmm.la
+libmodhmm_la_SOURCES = \
+core_algorithms_multialpha.c \
+debug_funcs.c \
+hmmsearch.c \
+readhmm.c \
+readhmm_multialpha.c \
+readseqs_multialpha.c \
+std_calculation_funcs.c \
+std_funcs.c \
+training_algorithms_multialpha.c \
+cmdline_hmmsearch.h \
+funcs.h \
+structs.h
+
+noinst_HEADERS = \
+cmdline_hmmsearch.c.flc \
+core_algorithms_multialpha.c.flc \
+hmmsearch.c.flc \
+readhmm.c.flc \
+readseqs_multialpha.c.flc \
+std_funcs.c.flc \
+S_TMHMM_0.92b.hmg.res \
+LICENCE
+
+libmodhmm_la_LDFLAGS = -version-info 0:0:0
diff --git a/modhmm0.92b/S_TMHMM_0.92b.hmg.res b/modhmm0.92b/S_TMHMM_0.92b.hmg.res
new file mode 100644
index 0000000..e69de29
diff --git a/modhmm0.92b/cmdline_hmmsearch.c.flc b/modhmm0.92b/cmdline_hmmsearch.c.flc
new file mode 100644
index 0000000..a69d49f
--- /dev/null
+++ b/modhmm0.92b/cmdline_hmmsearch.c.flc
@@ -0,0 +1,4 @@
+
+(fast-lock-cache-data 3 (quote (17032 . 19365)) (quote nil) (quote nil) (quote (t ("^\\(\\sw+\\)[ ]*(" (1 font-lock-function-name-face)) ("^#[ ]*error[ ]+\\(.+\\)" (1 font-lock-warning-face prepend)) ("^#[ ]*\\(import\\|include\\)[ ]*\\(<[^>\"
+]*>?\\)" (2 font-lock-string-face)) ("^#[ ]*define[ ]+\\(\\sw+\\)(" (1 font-lock-function-name-face)) ("^#[ ]*\\(elif\\|if\\)\\>" ("\\<\\(defined\\)\\>[ ]*(?\\(\\sw+\\)?" nil nil (1 font-lock-builtin-face) (2 font-lock-variable-name-face nil t))) ("^#[ ]*\\(define\\|e\\(?:l\\(?:if\\|se\\)\\|ndif\\|rror\\)\\|file\\|i\\(?:f\\(?:n?def\\)?\\|nclude\\)\\|line\\|pragma\\|undef\\)\\>[ !]*\\(\\sw+\\)?" (1 font-lock-builtin-face) (2 font-lock-variable-name-face nil t)) ("\\<\\(c\\(?:har\\|o [...]
+") (point)) nil (1 font-lock-constant-face nil t))) (":" ("^[ ]*\\(\\sw+\\)[ ]*:[ ]*$" (beginning-of-line) (end-of-line) (1 font-lock-constant-face))) ("\\<\\(c\\(?:har\\|omplex\\)\\|double\\|float\\|int\\|long\\|s\\(?:hort\\|igned\\)\\|\\(?:unsigne\\|voi\\)d\\|FILE\\|\\sw+_t\\|Lisp_Object\\)\\>\\([ *&]+\\sw+\\>\\)*" (font-lock-match-c-style-declaration-item-and-skip-to-next (goto-char (or (match-beginning 2) (match-end 1))) (goto-char (match-end 1)) (1 (if (match-beginning 2) font-l [...]
diff --git a/modhmm0.92b/cmdline_hmmsearch.h b/modhmm0.92b/cmdline_hmmsearch.h
new file mode 100644
index 0000000..f704995
--- /dev/null
+++ b/modhmm0.92b/cmdline_hmmsearch.h
@@ -0,0 +1,107 @@
+/* cmdline_hmmsearch.h */
+
+/* File autogenerated by gengetopt version 2.12.2 */
+
+#ifndef CMDLINE_HMMSEARCH_H
+#define CMDLINE_HMMSEARCH_H
+
+/* If we use autoconf. */
+#ifdef HAVE_CONFIG_H
+#include "config.h"
+#endif
+
+#ifdef __cplusplus
+extern "C" {
+#endif /* __cplusplus */
+
+#ifndef CMDLINE_PARSER_PACKAGE
+#define CMDLINE_PARSER_PACKAGE "modhmms"
+#endif
+
+#ifndef CMDLINE_PARSER_VERSION
+#define CMDLINE_PARSER_VERSION "0.92"
+#endif
+
+struct gengetopt_args_info
+{
+ char * hmmnamefile_arg; /* model namefile for models in hmg format. */
+ char * seqnamefile_arg; /* sequence namefile (for seuences in fasta, smod, msamod or prfmod format). */
+ char * seqformat_arg; /* format of input sequences (fa=fasta, s=smod, msa=msamod, prf=prfmod). */
+ char * outpath_arg; /* output directory. */
+ char * freqfile_arg; /* background frequency file. */
+ char * smxfile_arg; /* substitution matrix file. */
+ char * replfile_arg; /* replacement letter file. */
+ char * priorfile_arg; /* sequence prior file (for msa input files). */
+ char * nullfile_arg; /* null model file. */
+ char * anchor_arg; /* hmm=results are hmm-ancored (default), seq=results are sequence anchored. */
+ int labeloutput_flag; /* output will print predicted labeling and posterior label probabilities (default=off). */
+ int alignmentoutput_flag; /* output will print log likelihood, log odds and reversi scores (default=off). */
+ char * msascoring_arg; /* scoring method for alignment and profile data options = DP/DPPI/GM/GMR/DPPI/PI/PIS default=GM. */
+ char * usecolumns_arg; /* specify which columns to use for alignment input data, options = all/nr, where all means use all columns
+and nr specifies a sequence in the alignment and the columns where this sequence have non-gap symbls are used
+default = all. */
+ int nolabels_flag; /* do not use labels even though the input sequences are labeled (default=off). */
+ int verbose_flag; /* print some information about what is going on (default=off). */
+ int max_d_flag; /* Retrain model on each sequence using Baum-Welch before scoring (default=off). */
+ int path_flag; /* Print most likely statepath (default=off). */
+ int nopostout_flag; /* no posterior probability information for label scoring (default=off). */
+ int nolabelout_flag; /* no predicted labeling for label scoring (default=off). */
+ int nollout_flag; /* no log likelihood score for alignment scoring (default=off). */
+ int nooddsout_flag; /* no log odds score for alignment scoring (default=off). */
+ int norevout_flag; /* no reversi score for alignment scoring (default=off). */
+ int alignpostout_flag; /* print posterior probability information for alignment scoring (default=off). */
+ int alignlabelout_flag; /* print predicted labeling for alignment scoring (default=off). */
+ int labelllout_flag; /* print log likelihood score for label scoring (default=off). */
+ int labeloddsout_flag; /* print log odds score for label scoring (default=off). */
+ int labelrevout_flag; /* print reversi score for label scoring (default=off). */
+
+ int help_given ; /* Whether help was given. */
+ int version_given ; /* Whether version was given. */
+ int hmmnamefile_given ; /* Whether hmmnamefile was given. */
+ int seqnamefile_given ; /* Whether seqnamefile was given. */
+ int seqformat_given ; /* Whether seqformat was given. */
+ int outpath_given ; /* Whether outpath was given. */
+ int freqfile_given ; /* Whether freqfile was given. */
+ int smxfile_given ; /* Whether smxfile was given. */
+ int replfile_given ; /* Whether replfile was given. */
+ int priorfile_given ; /* Whether priorfile was given. */
+ int nullfile_given ; /* Whether nullfile was given. */
+ int anchor_given ; /* Whether anchor was given. */
+ int labeloutput_given ; /* Whether labeloutput was given. */
+ int alignmentoutput_given ; /* Whether alignmentoutput was given. */
+ int msascoring_given ; /* Whether msascoring was given. */
+ int usecolumns_given ; /* Whether usecolumns was given. */
+ int nolabels_given ; /* Whether nolabels was given. */
+ int verbose_given ; /* Whether verbose was given. */
+ int viterbi_given ; /* Whether viterbi was given. */
+ int nbest_given ; /* Whether nbest was given. */
+ int forward_given ; /* Whether forward was given. */
+ int max_d_given ; /* Whether max_d was given. */
+ int path_given ; /* Whether path was given. */
+ int nopostout_given ; /* Whether nopostout was given. */
+ int nolabelout_given ; /* Whether nolabelout was given. */
+ int nollout_given ; /* Whether nollout was given. */
+ int nooddsout_given ; /* Whether nooddsout was given. */
+ int norevout_given ; /* Whether norevout was given. */
+ int alignpostout_given ; /* Whether alignpostout was given. */
+ int alignlabelout_given ; /* Whether alignlabelout was given. */
+ int labelllout_given ; /* Whether labelllout was given. */
+ int labeloddsout_given ; /* Whether labeloddsout was given. */
+ int labelrevout_given ; /* Whether labelrevout was given. */
+
+ int score_algs_group_counter; /* counter for group score_algs */
+} ;
+
+int cmdline_parser (int argc, char * const *argv, struct gengetopt_args_info *args_info);
+int cmdline_parser2 (int argc, char * const *argv, struct gengetopt_args_info *args_info, int override, int initialize, int check_required);
+
+void cmdline_parser_print_help(void);
+void cmdline_parser_print_version(void);
+
+void cmdline_parser_init (struct gengetopt_args_info *args_info);
+void cmdline_parser_free (struct gengetopt_args_info *args_info);
+
+#ifdef __cplusplus
+}
+#endif /* __cplusplus */
+#endif /* CMDLINE_HMMSEARCH_H */
diff --git a/modhmm0.92b/core_algorithms_multialpha.c b/modhmm0.92b/core_algorithms_multialpha.c
new file mode 100644
index 0000000..3faf74a
--- /dev/null
+++ b/modhmm0.92b/core_algorithms_multialpha.c
@@ -0,0 +1,3391 @@
+#include <stdio.h>
+#include <math.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include "structs.h"
+#include "funcs.h"
+
+#define V_LIST_END -99
+#define V_LIST_NEXT -9
+
+//#define DEBUG_LABELING_UPDATE
+//#define DEBUG_DEALLOCATE_LABELINGS
+//#define DEBUG_FW
+//#define DEBUG_BW
+//#define DEBUG_VI
+//#define PATH
+
+extern int verbose;
+
+double dot_product_multi(struct hmm_multi_s *hmmp, int use_prior_shares, int use_gap_shares,
+ struct msa_sequences_multi_s *msa_seq_infop, int c, int v, int normalize, int multi_scoring_method);
+double dot_prouct_picasso_multi(struct hmm_multi_s *hmmp, int use_prior_shares, int use_gap_shares,
+ struct msa_sequences_multi_s *msa_seq_infop, int c, int v, int normalize, int multi_scoring_method,
+ double *aa_freqs, double *aa_freqs_2, double *aa_freqs_3, double *aa_freqs_4);
+double picasso_multi(struct hmm_multi_s *hmmp, int use_prior_shares, int use_gap_shares,
+ struct msa_sequences_multi_s *msa_seq_infop, int c, int v, int normalize, int multi_scoring_method,
+ double *aa_freqs, double *aa_freqs_2, double *aa_freqs_3, double *aa_freqs_4);
+double picasso_sym_multi(struct hmm_multi_s *hmmp, int use_prior_shares, int use_gap_shares,
+ struct msa_sequences_multi_s *msa_seq_infop, int c, int v, int normalize, int multi_scoring_method,
+ double *aa_freqs, double *aa_freqs_2, double *aa_freqs_3, double *aa_freqs_4);
+double sjolander_score_multi(struct hmm_multi_s *hmmp, int use_prior_shares, int use_gap_shares,
+ struct msa_sequences_multi_s *msa_seq_infop, int c, int v, int normalize, int multi_scoring_method);
+double sjolander_reversed_score_multi(struct hmm_multi_s *hmmp, int use_prior_shares, int use_gap_shares,
+ struct msa_sequences_multi_s *msa_seq_infop, int c, int v, int normalize,
+ int multi_scoring_method);
+double subst_mtx_product_multi(struct hmm_multi_s *hmmp, int use_prior_shares, int use_gap_shares,
+ struct msa_sequences_multi_s *msa_seq_infop, int c, int v, int normalize, int multi_scoring_method);
+double subst_mtx_dot_product_multi(struct hmm_multi_s *hmmp, int use_prior_shares, int use_gap_shares,
+ struct msa_sequences_multi_s *msa_seq_infop, int c, int v, int a_index, int a_index_2,
+ int a_index_3, int a_index_4,
+ int normalize, int multi_scoring_method);
+double subst_mtx_dot_product_prior_multi(struct hmm_multi_s *hmmp, int use_prior_shares, int use_gap_shares,
+ struct msa_sequences_multi_s *msa_seq_infop, int c, int v, int a_index, int a_index_2,
+ int a_index_3, int a_index_4,
+ int normalize, int multi_scoring_method);
+double get_msa_emission_score_multi(struct msa_sequences_multi_s *msa_seq_infop, struct hmm_multi_s *hmmp, int c, int v,
+ int use_labels, int normalize, int a_index, int a_index_2,
+ int a_index_3, int a_index_4, int scoring_method, int use_gap_shares,
+ int use_prior_shares, int multi_scoring_method, double *aa_freqs, double *aa_freqs_2,
+ double *aa_freqs_3, double *aa_freqs_4);
+double get_single_emission_score_multi(struct hmm_multi_s *hmmp, struct letter_s *seq, struct letter_s *seq_2,
+ struct letter_s *seq_3, struct letter_s *seq_4, int c, int v, int replacement_letter_c,
+ int replacement_letter_c_2, int replacement_letter_c_3,
+ int replacement_letter_c_4, int use_labels, int a_index, int a_index_2,
+ int a_index_3, int a_index_4, int multi_scoring_method);
+
+
+/************************* the forward algorithm **********************************/
+int forward_multi(struct hmm_multi_s *hmmp, struct letter_s *s, struct letter_s *s_2, struct letter_s *s_3,
+ struct letter_s *s_4, struct forward_s **ret_forw_mtxpp,
+ double **ret_scale_fspp, int use_labels, int multi_scoring_method)
+{
+ struct forward_s *forw_mtxp, *cur_rowp, *prev_rowp; /* pointers to forward matrix */
+ double *scale_fsp; /* pointer to array of scaling factors */
+ struct letter_s *seq, *seq_2, *seq_3, *seq_4; /* pointer to the sequence */
+ int i,j; /* loop variables */
+ int seq_len; /* length of the sequence */
+ int c, v, w; /* looping indices, c loops over the sequence,
+ v and w over the vertices in the HMM */
+ double row_sum, res, t_res1, t_res2, t_res3; /* temporary variables to calculate probabilities */
+ double t_res3_1, t_res3_2, t_res3_3, t_res3_4;
+ int nr_v;
+ struct path_element *wp; /* for traversing the paths from v to w */
+ int a_index, a_index_2, a_index_3, a_index_4; /* holds the current letter's position in the alphabet */
+ int replacement_letter_c, replacement_letter_c_2, replacement_letter_c_3, replacement_letter_c_4;
+
+#ifdef DEBUG_FW
+ printf("running forward\n");
+#endif
+ /* Allocate memory for matrix and scaling factors + some initial setup:
+ * Note 1: forward probability matrix has the sequence indices vertically
+ * and the vertex indices horizontally meaning it will be filled row by row
+ * Note 2: *ret_forw_mtxpp and *ret_scale_fspp are allocated here, but must be
+ * freed by caller */
+ nr_v = hmmp->nr_v;
+ seq_len = get_seq_length(s);
+ *ret_forw_mtxpp = (struct forward_s*)(malloc_or_die((seq_len+2) *
+ hmmp->nr_v *
+ sizeof(struct forward_s)));
+ forw_mtxp = *ret_forw_mtxpp;
+ *ret_scale_fspp = (double*)(malloc_or_die((seq_len+2) * sizeof(double)));
+ scale_fsp = *ret_scale_fspp;
+
+ /* Convert sequence to 1...L for easier indexing */
+ seq = (struct letter_s*) malloc_or_die(((seq_len + 2) * sizeof(struct letter_s)));
+ memcpy(seq+1, s, seq_len * sizeof(struct letter_s));
+ if(hmmp->nr_alphabets > 1) {
+ seq_2 = (struct letter_s*) malloc_or_die(((seq_len + 2) * sizeof(struct letter_s)));
+ memcpy(seq_2+1, s_2, seq_len * sizeof(struct letter_s));
+ }
+ if(hmmp->nr_alphabets > 2) {
+ seq_3 = (struct letter_s*) malloc_or_die(((seq_len + 2) * sizeof(struct letter_s)));
+ memcpy(seq_3+1, s_3, seq_len * sizeof(struct letter_s));
+ }
+ if(hmmp->nr_alphabets > 3) {
+ seq_4 = (struct letter_s*) malloc_or_die(((seq_len + 2) * sizeof(struct letter_s)));
+ memcpy(seq_4+1, s_4, seq_len * sizeof(struct letter_s));
+ }
+
+ /* Initialize the first row of the matrix */
+ forw_mtxp->prob = 1.0; /* sets index (0,0) to 1.0,
+ the rest are already 0.0 as they should be */
+ *scale_fsp = 1.0;
+
+ /* Fill in middle rows */
+ prev_rowp = forw_mtxp;
+ cur_rowp = forw_mtxp + hmmp->nr_v;
+ for(c = 1; c <= seq_len; c++) {
+
+ /* get alphabet index for c*/
+ if(hmmp->alphabet_type == DISCRETE) {
+ replacement_letter_c = NO;
+ a_index = get_alphabet_index(&seq[c], hmmp->alphabet, hmmp->a_size);
+ if(a_index < 0) {
+ a_index = get_replacement_letter_index_multi(&seq[c], hmmp->replacement_letters, 1);
+ if(a_index < 0) {
+ printf("Letter '%s' is not in alphabet\n", (&seq[c])->letter);
+ return NOPROB;
+ }
+ replacement_letter_c = YES;
+ }
+ }
+ if(hmmp->nr_alphabets > 1) {
+ if(hmmp->alphabet_type_2 == DISCRETE) {
+ replacement_letter_c_2 = NO;
+ a_index_2 = get_alphabet_index(&seq_2[c], hmmp->alphabet_2, hmmp->a_size_2);
+ if(a_index_2 < 0) {
+ a_index_2 = get_replacement_letter_index_multi(&seq_2[c], hmmp->replacement_letters, 2);
+ if(a_index_2 < 0) {
+ printf("Letter '%s' is not in alphabet\n", (&seq_2[c])->letter);
+ return NOPROB;
+ }
+ replacement_letter_c_2 = YES;
+ }
+ }
+ }
+ if(hmmp->nr_alphabets > 2) {
+ if(hmmp->alphabet_type_3 == DISCRETE) {
+ replacement_letter_c_3 = NO;
+ a_index_3 = get_alphabet_index(&seq_3[c], hmmp->alphabet_3, hmmp->a_size_3);
+ if(a_index_3 < 0) {
+ a_index_3 = get_replacement_letter_index_multi(&seq_3[c], hmmp->replacement_letters, 3);
+ if(a_index_3 < 0) {
+ printf("Letter '%s' is not in alphabet\n", (&seq_3[c])->letter);
+ return NOPROB;
+ }
+ replacement_letter_c_3 = YES;
+ }
+ }
+ }
+ if(hmmp->nr_alphabets > 3) {
+ if(hmmp->alphabet_type_4 == DISCRETE) {
+ replacement_letter_c_4 = NO;
+ a_index_4 = get_alphabet_index(&seq_4[c], hmmp->alphabet_4, hmmp->a_size_4);
+ if(a_index_4 < 0) {
+ a_index_4 = get_replacement_letter_index_multi(&seq_4[c], hmmp->replacement_letters, 4);
+ if(a_index_4 < 0) {
+ printf("Letter '%s' is not in alphabet\n", (&seq_4[c])->letter);
+ return NOPROB;
+ }
+ replacement_letter_c_4 = YES;
+ }
+ }
+ }
+#ifdef DEBUG_FW
+ printf("seq[c] = %s\n", &seq[c]);
+ if(hmmp->alphabet_type == DISCRETE) {
+ printf("a_index = %d\n", a_index);
+ }
+#endif
+ /* calculate sum of probabilities */
+ row_sum = 0;
+ for(v = 1; v < hmmp->nr_v - 1; v++) /* v = to-vertex */ {
+#ifdef DEBUG_FW
+ printf("prob to vertex %d\n", v);
+#endif
+ res = 0.0;
+ wp = *(hmmp->tot_from_trans_array + v);
+ while((w = wp->vertex) != END) /* w = from-vertex */ {
+ /* calculate intermediate results */
+ res += (prev_rowp + w)->prob * *(hmmp->tot_transitions + (w * nr_v + v));
+ if(*(hmmp->tot_transitions + (w * nr_v + v)) < 0) {
+ printf("found model transition prob from %d to %d < 0.0\n", w, v);
+ exit(0);
+ }
+ wp++;
+#ifdef DEBUG_FW
+ printf("prev = %f: ", (prev_rowp + w)->prob);
+ printf("trans = %f\n", *(hmmp->tot_transitions + (w * nr_v + v)));
+#endif
+ }
+
+
+ /* calculate prob of producing letter l in v */
+ t_res3 = get_single_emission_score_multi(hmmp, seq, seq_2, seq_3, seq_4, c, v, replacement_letter_c, replacement_letter_c_2,
+ replacement_letter_c_3, replacement_letter_c_4, use_labels, a_index,
+ a_index_2, a_index_3, a_index_4, multi_scoring_method);
+
+ res = res * t_res3;
+ row_sum += res;
+
+ /* save result in matrices */
+ (cur_rowp + v)->prob = res;
+
+
+#ifdef DEBUG_FW
+ printf("res = %f\n", res);
+#endif
+ }
+
+ /* scale the results, row_sum = the total probability of
+ * having produced the sequence up to and including character c */
+ scale_fsp[c] = row_sum;
+#ifdef DEBUG_FW
+ printf("rowsum = %f\n",row_sum);
+ printf("scaling set to: %f\n", scale_fsp[c]);
+#endif
+ if(row_sum == 0.0) {
+ printf("Sequence cannot be produced by this hmm\n");
+ sequence_as_string(s);
+ printf("pos = %d\n",c);
+ return NOPROB;
+ }
+ for(v = 0; v < hmmp->nr_v; v++) {
+ if((cur_rowp + v)->prob != 0){
+ (cur_rowp + v)->prob = ((cur_rowp + v)->prob)/row_sum; /* scaling */
+ }
+ }
+
+ /* move row pointers one row forward */
+ prev_rowp = cur_rowp;
+ cur_rowp = cur_rowp + hmmp->nr_v;
+ }
+
+
+ /* Fill in transition to end state */
+ res = 0;
+ wp = *(hmmp->tot_from_trans_array + hmmp->nr_v - 1);
+ while(wp->vertex != END) {
+ t_res1 = (prev_rowp + wp->vertex)->prob;
+ t_res2 = *((hmmp->tot_transitions) + get_mtx_index(wp->vertex, hmmp->nr_v-1, hmmp->nr_v));
+ if(t_res2 > 1.0) {
+ t_res2 = 1.0;
+ }
+ res += t_res1 * t_res2;
+ wp++;
+ }
+
+
+#ifdef DEBUG_FW
+ printf("res = %f\n", res);
+#endif
+ (cur_rowp + hmmp->nr_v - 1)->prob = res; /* obs: no scaling performed here */
+
+#ifdef DEBUG_FW
+ dump_forward_matrix(seq_len + 2, hmmp->nr_v, forw_mtxp);
+ dump_scaling_array(seq_len + 1, scale_fsp);
+#endif
+
+ /* Garbage collection and return */
+ free(seq);
+ if(hmmp->nr_alphabets > 1) {
+ free(seq_2);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ free(seq_3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ free(seq_4);
+ }
+ return OK;
+}
+
+
+
+/**************************** the backward algorithm **********************************/
+int backward_multi(struct hmm_multi_s *hmmp, struct letter_s *s, struct letter_s *s_2, struct letter_s *s_3,
+ struct letter_s *s_4, struct backward_s **ret_backw_mtxpp,
+ double *scale_fsp, int use_labels, int multi_scoring_method)
+{
+ struct backward_s *backw_mtxp, *cur_rowp, *prev_rowp; /* pointers to backward matrix */
+ struct letter_s *seq, *seq_2, *seq_3, *seq_4; /* pointer to the sequence */
+ int i; /* loop index */
+ int seq_len; /* length of the sequence */
+ int c, v, w; /* looping indices, c loops over the sequence,
+ v and w over the vertices in the HMM */
+ int a_index, a_index_2, a_index_3, a_index_4; /* holds the current letter's position in the alphabet */
+ double row_sum, res,t_res1,t_res2, t_res3; /* temporary variables to calculate results */
+ double t_res3_1, t_res3_2, t_res3_3, t_res3_4;
+ struct path_element *wp;
+ int replacement_letter_c, replacement_letter_c_2, replacement_letter_c_3, replacement_letter_c_4;
+
+ /* Allocate memory for matrix + some initial setup:
+ * Note 1: probability matrix has the sequence indices vertically
+ * and the vertex indices horizontally meaning it will be filled row by row
+ * Note 2: *ret_backw_mtxpp is allocated here, but must be
+ * freed by caller
+ * Note 3: *scale_fspp is the scaling array produced by forward() meaning backward()
+ * must be called after forward() */
+ seq_len = get_seq_length(s);
+ *ret_backw_mtxpp = (struct backward_s*)(malloc_or_die((seq_len+2) *
+ hmmp->nr_v *
+ sizeof(struct backward_s)));
+ backw_mtxp = *ret_backw_mtxpp;
+
+ /* Convert sequence to 1...L for easier indexing */
+ seq = (struct letter_s*) malloc_or_die(((seq_len + 2) * sizeof(struct letter_s)));
+ memcpy(seq+1, s, seq_len * sizeof(struct letter_s));
+ if(hmmp->nr_alphabets > 1) {
+ seq_2 = (struct letter_s*) malloc_or_die(((seq_len + 2) * sizeof(struct letter_s)));
+ memcpy(seq_2+1, s_2, seq_len * sizeof(struct letter_s));
+ }
+ if(hmmp->nr_alphabets > 2) {
+ seq_3 = (struct letter_s*) malloc_or_die(((seq_len + 2) * sizeof(struct letter_s)));
+ memcpy(seq_3+1, s_3, seq_len * sizeof(struct letter_s));
+ }
+ if(hmmp->nr_alphabets > 3) {
+ seq_4 = (struct letter_s*) malloc_or_die(((seq_len + 2) * sizeof(struct letter_s)));
+ memcpy(seq_4+1, s_4, seq_len * sizeof(struct letter_s));
+ }
+
+
+ /* Initialize the last row of the matrix */
+ (backw_mtxp + get_mtx_index(seq_len + 1, hmmp->nr_v - 1,
+ hmmp->nr_v))->prob = 1.0; /* sets index
+ * (seq_len+1,nr_v-1) i.e.
+ * the lower right
+ * corner of the matrix to
+ * 1.0,the rest are already
+ * 0.0 as they should be*/
+
+ /* Fill in next to last row in matrix (i.e. add prob for path to end state for all states
+ * that have a transition path to the end state) */
+ prev_rowp = backw_mtxp + get_mtx_index(seq_len + 1 , 0, hmmp->nr_v);
+ cur_rowp = prev_rowp - hmmp->nr_v;
+ wp = *(hmmp->from_trans_array + hmmp->nr_v - 1);
+
+ while(wp->vertex != END) {
+ w = wp->vertex;
+ t_res1 = 1.0;
+ while(wp->next != NULL) {
+ t_res1 = t_res1 * *(hmmp->transitions +
+ get_mtx_index(wp->vertex, (wp + 1)->vertex, hmmp->nr_v));
+ wp++;
+ }
+ (cur_rowp + w)->prob += t_res1;
+ if((cur_rowp + w)->prob > 1.0) {
+ (cur_rowp + w)->prob = 1.0;
+ }
+ wp++;
+ }
+ prev_rowp = cur_rowp;
+ cur_rowp = cur_rowp - hmmp->nr_v;
+
+
+ /* Fill in first rows moving upwards in matrix */
+ for(c = seq_len; c >= 1; c--) {
+
+ /* get alphabet index for c */
+#ifdef DEBUG_BW
+ printf("c = %d\n", c);
+#endif
+
+ /* get alphabet index for c*/
+ if(hmmp->alphabet_type == DISCRETE) {
+ replacement_letter_c = NO;
+ a_index = get_alphabet_index(&seq[c], hmmp->alphabet, hmmp->a_size);
+ if(a_index < 0) {
+ a_index = get_replacement_letter_index_multi(&seq[c], hmmp->replacement_letters, 1);
+ if(a_index < 0) {
+ printf("Letter '%s' is not in alphabet\n", (&seq[c])->letter);
+ return NOPROB;
+ }
+ replacement_letter_c = YES;
+ }
+ }
+ if(hmmp->nr_alphabets > 1) {
+ if(hmmp->alphabet_type_2 == DISCRETE) {
+ replacement_letter_c_2 = NO;
+ a_index_2 = get_alphabet_index(&seq_2[c], hmmp->alphabet_2, hmmp->a_size_2);
+ if(a_index_2 < 0) {
+ a_index_2 = get_replacement_letter_index_multi(&seq_2[c], hmmp->replacement_letters, 2);
+ if(a_index_2 < 0) {
+ printf("Letter '%s' is not in alphabet\n", (&seq_2[c])->letter);
+ return NOPROB;
+ }
+ replacement_letter_c_2 = YES;
+ }
+ }
+ }
+ if(hmmp->nr_alphabets > 2) {
+ if(hmmp->alphabet_type_3 == DISCRETE) {
+ replacement_letter_c_3 = NO;
+ a_index_3 = get_alphabet_index(&seq_3[c], hmmp->alphabet_3, hmmp->a_size_3);
+ if(a_index_3 < 0) {
+ a_index_3 = get_replacement_letter_index_multi(&seq_3[c], hmmp->replacement_letters, 3);
+ if(a_index_3 < 0) {
+ printf("Letter '%s' is not in alphabet\n", (&seq_3[c])->letter);
+ return NOPROB;
+ }
+ replacement_letter_c_3 = YES;
+ }
+ }
+ }
+ if(hmmp->nr_alphabets > 3) {
+ if(hmmp->alphabet_type_4 == DISCRETE) {
+ replacement_letter_c_4 = NO;
+ a_index_4 = get_alphabet_index(&seq_4[c], hmmp->alphabet_4, hmmp->a_size_4);
+ if(a_index_4 < 0) {
+ a_index_4 = get_replacement_letter_index_multi(&seq_4[c], hmmp->replacement_letters, 4);
+ if(a_index_4 < 0) {
+ printf("Letter '%s' is not in alphabet\n", (&seq_4[c])->letter);
+ return NOPROB;
+ }
+ replacement_letter_c_4 = YES;
+ }
+ }
+ }
+
+
+ /* calculate sum of probabilities */
+ for(v = 0; v < hmmp->nr_v - 1; v++) /* v = from-vertex */{
+#ifdef DEBUG_BW
+ printf("prob passing through vertex %d:\n", v);
+#endif
+
+ res = 0;
+ wp = *(hmmp->tot_to_trans_array + v);
+ while(wp->vertex != END) /* w = to-vertex */ {
+ /* total probability of transiting from v to w on all possible path (via silent states) */
+ t_res2 = *((hmmp->tot_transitions) + get_mtx_index(v, wp->vertex, hmmp->nr_v));
+ if(t_res2 < 0) {
+ printf("found model transition prob from %d to %d < 0.0\n", v, wp->vertex);
+ exit(0);
+ }
+ t_res1 = (prev_rowp + wp->vertex)->prob; /* probability of having produced the
+ * sequence after c passing through
+ * vertex w */
+
+
+ /* calculate prob of producing letter l in v */
+ t_res3 = get_single_emission_score_multi(hmmp, seq, seq_2, seq_3, seq_4, c, wp->vertex,
+ replacement_letter_c, replacement_letter_c_2,
+ replacement_letter_c_3, replacement_letter_c_4, use_labels, a_index,
+ a_index_2, a_index_3, a_index_4, multi_scoring_method);
+
+
+ res += t_res1 * t_res2 * t_res3;
+ wp++;
+ }
+#ifdef DEBUG_BW
+ printf("prev = %f: ", t_res1);
+ printf("trans = %f\n", t_res2);
+#endif
+
+ /* save result in matrices */
+ (cur_rowp + v)->prob = res;
+#ifdef DEBUG_BW
+ printf("res = %f\n", res);
+#endif
+ }
+
+ /* scale the results using the scaling factors produced by forward() */
+#ifdef DEBUG_BW
+ printf("scaling set to: %f\n", scale_fsp[c]);
+ printf("and c = %d\n", c);
+ dump_scaling_array(seq_len + 1, scale_fsp);
+#endif
+ for(v = 0; v < hmmp->nr_v; v++) {
+ if((cur_rowp + v)->prob != 0){
+ if(scale_fsp[c] != 0.0) {
+ (cur_rowp + v)->prob = ((cur_rowp + v)->prob)/scale_fsp[c]; /* scaling */
+ }
+ else {
+ printf("Sequence cannot be produced by this hmm\n");
+ sequence_as_string(s);
+ return NOPROB;
+ }
+ }
+ }
+
+ /* move row pointers one row backward */
+ prev_rowp = cur_rowp;
+ cur_rowp = cur_rowp - hmmp->nr_v;
+ }
+
+#ifdef DEBUG_BW
+ dump_backward_matrix(seq_len + 2, hmmp->nr_v, backw_mtxp);
+ dump_scaling_array(seq_len + 1, scale_fsp);
+#endif
+
+ /* Garbage collection and return */
+ free(seq);
+ if(hmmp->nr_alphabets > 1) {
+ free(seq_2);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ free(seq_3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ free(seq_4);
+ }
+ return OK;
+}
+
+
+/************************* the viterbi algorithm **********************************/
+int viterbi_multi(struct hmm_multi_s *hmmp, struct letter_s *s, struct letter_s *s_2, struct letter_s *s_3,
+ struct letter_s *s_4, struct viterbi_s **ret_viterbi_mtxpp, int use_labels, int multi_scoring_method)
+{
+ struct viterbi_s *viterbi_mtxp, *cur_rowp;
+ struct viterbi_s *prev_rowp; /* pointers to viterbi matrix */
+ struct letter_s *seq, *seq_2, *seq_3, *seq_4; /* pointer to the sequence */
+ int i; /* loop index */
+ int prev; /* nr of previous vertex in viterbi path */
+ int seq_len; /* length of the sequence */
+ int c, v, w; /* looping indices, c loops over the sequence,
+ v and w over the vertices in the HMM */
+ int a_index, a_index_2, a_index_3, a_index_4; /* holds the current letter's position in the alphabet */
+ double max, res, t_res1, t_res2, t_res3; /* temporary variables to calculate probabilities */
+ double t_res3_1, t_res3_2, t_res3_3, t_res3_4;
+ struct path_element *wp, *from_vp, *prevp; /* pointers to elements in to_trans_array and
+ * from_trans_array are used for retrieving the path
+ * from state a to state b */
+ int replacement_letter_c, replacement_letter_c_2, replacement_letter_c_3, replacement_letter_c_4;
+ int MARKOV_CHAIN; /* set to yes means all emission probs = 1.0, independent of the letter and the state distribution */
+
+
+ MARKOV_CHAIN = NO;
+ /* Allocate memory for matrix + some initial setup:
+ * Note 1: viterbi probability matrix has the vertex indices horizontally
+ * and the sequence indices vertically meaning it will be filled row by row
+ * Note 2: *ret_viterbi_mtxpp is allocated here, but must be
+ * freed by caller
+ * Note 3: The viterbi algorithm uses the to_trans_array to remember the path
+ * taken from a state to another (by letting the prevp point to the path in
+ * the to_trans_array that was taken to reach this state) */
+ seq_len = get_seq_length(s);
+ *ret_viterbi_mtxpp = (struct viterbi_s*)
+ (malloc_or_die((seq_len+2) * hmmp->nr_v * sizeof(struct viterbi_s)));
+ init_viterbi_s_mtx(*ret_viterbi_mtxpp, DEFAULT, (seq_len+2) * hmmp->nr_v);
+ viterbi_mtxp = *ret_viterbi_mtxpp;
+
+ /* Convert sequence to 1...L for easier indexing */
+ seq = (struct letter_s*) malloc_or_die(((seq_len + 2) * sizeof(struct letter_s)));
+ memcpy(seq+1, s, seq_len * sizeof(struct letter_s));
+ if(hmmp->nr_alphabets > 1) {
+ seq_2 = (struct letter_s*) malloc_or_die(((seq_len + 2) * sizeof(struct letter_s)));
+ memcpy(seq_2+1, s_2, seq_len * sizeof(struct letter_s));
+ }
+ if(hmmp->nr_alphabets > 2) {
+ seq_3 = (struct letter_s*) malloc_or_die(((seq_len + 2) * sizeof(struct letter_s)));
+ memcpy(seq_3+1, s_3, seq_len * sizeof(struct letter_s));
+ }
+ if(hmmp->nr_alphabets > 3) {
+ seq_4 = (struct letter_s*) malloc_or_die(((seq_len + 2) * sizeof(struct letter_s)));
+ memcpy(seq_4+1, s_4, seq_len * sizeof(struct letter_s));
+ }
+
+ /* Initialize the first row of the matrix */
+ viterbi_mtxp->prob = 0.0; /* sets index (0,0) to 0.0 (i.e. prob 1.0),
+ the rest are already DEFAULT as they should be */
+
+ /* Fill in middle rows */
+ prev_rowp = viterbi_mtxp;
+ cur_rowp = viterbi_mtxp + hmmp->nr_v;
+
+ for(c = 1; c <= seq_len; c++) {
+
+ /* get alphabet index for c*/
+ if(hmmp->alphabet_type == DISCRETE) {
+ replacement_letter_c = NO;
+ a_index = get_alphabet_index(&seq[c], hmmp->alphabet, hmmp->a_size);
+ if(a_index < 0) {
+ a_index = get_replacement_letter_index_multi(&seq[c], hmmp->replacement_letters, 1);
+ if(a_index < 0) {
+ printf("Letter '%s' is not in alphabet\n", (&seq[c])->letter);
+ return NOPROB;
+ }
+ replacement_letter_c = YES;
+ }
+ }
+ if(hmmp->nr_alphabets > 1) {
+ if(hmmp->alphabet_type_2 == DISCRETE) {
+ replacement_letter_c_2 = NO;
+ a_index_2 = get_alphabet_index(&seq_2[c], hmmp->alphabet_2, hmmp->a_size_2);
+ if(a_index_2 < 0) {
+ a_index_2 = get_replacement_letter_index_multi(&seq_2[c], hmmp->replacement_letters, 2);
+ if(a_index_2 < 0) {
+ printf("Letter '%s' is not in alphabet\n", (&seq_2[c])->letter);
+ return NOPROB;
+ }
+ replacement_letter_c_2 = YES;
+ }
+ }
+ }
+ if(hmmp->nr_alphabets > 2) {
+ if(hmmp->alphabet_type_3 == DISCRETE) {
+ replacement_letter_c_3 = NO;
+ a_index_3 = get_alphabet_index(&seq_3[c], hmmp->alphabet_3, hmmp->a_size_3);
+ if(a_index_3 < 0) {
+ a_index_3 = get_replacement_letter_index_multi(&seq_3[c], hmmp->replacement_letters, 3);
+ if(a_index_3 < 0) {
+ printf("Letter '%s' is not in alphabet\n", (&seq_3[c])->letter);
+ return NOPROB;
+ }
+ replacement_letter_c_3 = YES;
+ }
+ }
+ }
+ if(hmmp->nr_alphabets > 3) {
+ if(hmmp->alphabet_type_4 == DISCRETE) {
+ replacement_letter_c_4 = NO;
+ a_index_4 = get_alphabet_index(&seq_4[c], hmmp->alphabet_4, hmmp->a_size_4);
+ if(a_index_4 < 0) {
+ a_index_4 = get_replacement_letter_index_multi(&seq_4[c], hmmp->replacement_letters, 4);
+ if(a_index_4 < 0) {
+ printf("Letter '%s' is not in alphabet\n", (&seq_4[c])->letter);
+ return NOPROB;
+ }
+ replacement_letter_c_4 = YES;
+ }
+ }
+ }
+
+
+ /* calculate sum of probabilities */
+ for(v = 1; v < hmmp->nr_v - 1; v++) /* v = to-vertex */ {
+#ifdef DEBUG_VI
+ printf("prob to vertex %d:\n", v);
+#endif
+ prev = NO_PREV;
+ max = DEFAULT;
+ wp = *(hmmp->tot_from_trans_array + v);
+ res = 0;
+ while((w = wp->vertex) != END) /* w = from-vertex */ {
+ /* calculate intermediate results */
+ from_vp = wp;
+ t_res1 = (prev_rowp + w)->prob; /* probability of having produced the
+ * sequence up to the last letter ending
+ * in vertex w */
+ /* calculate intermediate results */
+ t_res2 = *(hmmp->max_log_transitions + (w * hmmp->nr_v + v));
+ if(t_res1 != DEFAULT && t_res2 != DEFAULT) {
+ res = t_res1 + t_res2;
+ }
+ else {
+ res = DEFAULT;
+ }
+
+ if(prev == NO_PREV && res != DEFAULT &&
+ (use_labels == NO || seq[c].label == *(hmmp->vertex_labels + v) || seq[c].label == '.')) {
+ max = res;
+ prev = from_vp->vertex;
+ prevp = from_vp;
+ continue;
+ }
+#ifdef DEBUG_VI
+ printf("seqlabel = %c, vertexlable = %c\n", seq[c].label,*(hmmp->vertex_labels + v));
+#endif
+ if(res > max && res != DEFAULT &&
+ (use_labels == NO || seq[c].label == *(hmmp->vertex_labels + v) || seq[c].label == '.')) {
+ max = res;
+ prev = from_vp->vertex;
+ prevp = from_vp;
+ }
+ wp++;
+#ifdef DEBUG_VI
+ printf("prev in pos(%d) = %f: ",from_vp->vertex, t_res1);
+ printf("trans from %d to %d = %f\n", from_vp->vertex, v, t_res2);
+#endif
+ }
+
+#ifdef DEBUG_VI
+ printf("max before setting: %f\n", max);
+#endif
+
+ /* add the logprob of reaching state v to the logprob of producing letter l in v*/
+ if(MARKOV_CHAIN == YES) {
+ if((*((hmmp->log_emissions) + (v * (hmmp->a_size)) + a_index)) != DEFAULT &&
+ max != DEFAULT) {
+ max = max + 0.0;
+ }
+ else {
+ max = DEFAULT;
+ }
+ }
+ else {
+ /* calculate prob of producing letter l in v */
+ t_res3 = get_single_emission_score_multi(hmmp, seq, seq_2, seq_3, seq_4, c, v, replacement_letter_c, replacement_letter_c_2,
+ replacement_letter_c_3, replacement_letter_c_4, use_labels, a_index,
+ a_index_2, a_index_3, a_index_4, multi_scoring_method);
+ }
+
+ if(t_res3 == 0.0) {
+ max = DEFAULT;
+ }
+ else {
+ t_res3 = log10(t_res3);
+ if(max != DEFAULT) {
+ max = max + t_res3;
+ }
+ }
+
+ /* save result in matrices */
+ (cur_rowp + v)->prob = max;
+ (cur_rowp + v)->prev = prev;
+ (cur_rowp + v)->prevp = prevp;
+#ifdef DEBUG_VI
+ printf("max after setting: %f\n", max);
+#endif
+ }
+
+ /* move row pointers one row forward */
+ prev_rowp = cur_rowp;
+ cur_rowp = cur_rowp + hmmp->nr_v;
+ }
+
+ /* Fill in transition to end state */
+#ifdef DEBUG_VI
+ printf("\ntransition to end state:\n");
+#endif
+ max = DEFAULT;
+ prev = NO_PREV;
+ prevp = NULL;
+ wp = *(hmmp->from_trans_array + hmmp->nr_v-1);
+ while(wp->vertex != END) /* w = from-vertex */ {
+ from_vp = wp;
+ t_res1 = (prev_rowp + wp->vertex)->prob; /* probability of having produced the
+ * sequence up to the last letter ending
+ * in vertex w */
+ t_res2 = 0.0;
+ while(wp->next != NULL) {
+ t_res2 = t_res2 + *((hmmp->log_transitions) +
+ get_mtx_index(wp->vertex, (wp+1)->vertex, hmmp->nr_v));
+ wp++;
+ }
+ t_res2 = t_res2 + *((hmmp->log_transitions) +
+ get_mtx_index(wp->vertex, hmmp->nr_v - 1, hmmp->nr_v));
+
+#ifdef DEBUG_VI
+ printf("prev = %f: ", t_res1);
+ printf("trans = %f\n", t_res2);
+#endif
+ if(t_res1 != DEFAULT && t_res2 != DEFAULT) {
+ res = t_res1 + t_res2;
+ }
+ else {
+ res = DEFAULT;
+ }
+ if(prev == NO_PREV && res != DEFAULT) {
+ prev = from_vp->vertex;
+ prevp = from_vp;
+ max = res;
+ continue;
+ }
+ if(res > max && res != DEFAULT) {
+ prev = from_vp->vertex;
+ prevp = from_vp;
+ max = res;
+ }
+ wp++;
+ }
+#ifdef DEBUG_VI
+ printf("res = %f\n", max);
+#endif
+ if(max != DEFAULT) {
+ (cur_rowp + hmmp->nr_v-1)->prob = max;
+ (cur_rowp + hmmp->nr_v-1)->prev = prev;
+ (cur_rowp + hmmp->nr_v-1)->prevp = prevp;
+ }
+ else {
+#ifdef DEBUG_VI
+ dump_trans_matrix(hmmp->nr_v, hmmp->nr_v, hmmp->transitions);
+ dump_trans_matrix(hmmp->nr_v, hmmp->nr_v, hmmp->log_transitions);
+#endif
+ printf("Sequence cannot be produced by this hmm\n");
+ sequence_as_string(s);
+ return NOPROB;
+ }
+
+#ifdef PATH
+ //dump_viterbi_matrix(seq_len + 2, hmmp->nr_v, viterbi_mtxp);
+ //dump_viterbi_path((cur_rowp + w)->prevp);
+ printf("normalized log likelihood for most probable path = %f\n",
+ 0.0 - (((cur_rowp + hmmp->nr_v - 1)->prob) / seq_len));
+ printf("and most probable path is: ");
+ //dump_viterbi_path((cur_rowp + hmmp->nr_v - 1), hmmp, viterbi_mtxp, seq_len + 1, hmmp->nr_v);
+ printf("%d\n",hmmp->nr_v - 1);
+ printf("log prob = %f\n", (cur_rowp + hmmp->nr_v-1)->prob);
+ printf("real prob = %f\n", pow(10,(cur_rowp + hmmp->nr_v-1)->prob));
+ dump_viterbi_label_path((cur_rowp + hmmp->nr_v - 1), hmmp, viterbi_mtxp, seq_len + 1, hmmp->nr_v);
+ printf("\n");
+#endif
+
+ /* Garbage collection and return */
+ free(seq);
+ if(hmmp->nr_alphabets > 1) {
+ free(seq_2);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ free(seq_3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ free(seq_4);
+ }
+ return OK;
+}
+
+
+/************************* the one_best algorithm **********************************/
+int one_best_multi(struct hmm_multi_s *hmmp, struct letter_s *s, struct letter_s *s_2, struct letter_s *s_3,
+ struct letter_s *s_4, struct one_best_s **ret_one_best_mtxpp,
+ double **ret_scale_fspp, int use_labels, char *best_labeling, int multi_scoring_method)
+{
+ struct one_best_s *one_best_mtxp, *cur_rowp, *prev_rowp; /* pointers to forward matrix */
+ double *scale_fsp; /* pointer to array of scaling factors */
+ struct letter_s *seq, *seq_2, *seq_3, *seq_4; /* pointer to the sequence */
+ int *sorted_v_list; /* final list for the association of same-labeling-elements */
+ int v_list_index;
+ int i,j; /* loop variables */
+ int seq_len; /* length of the sequence */
+ int c, v, w; /* looping indices, c loops over the sequence,
+ v and w over the vertices in the HMM */
+ double row_sum, res, t_res1, t_res2, t_res3; /* temporary variables to calculate probabilities */
+ double t_res3_1, t_res3_2, t_res3_3, t_res3_4;
+ int nr_v;
+ struct path_element *wp; /* for traversing the paths from v to w */
+ int a_index, a_index_2, a_index_3, a_index_4; /* holds the current letter's position in the alphabet */
+ int replacement_letter_c, replacement_letter_c_2, replacement_letter_c_3, replacement_letter_c_4;
+
+ double scaled_result;
+
+ /* Allocate memory for matrix and scaling factors + some initial setup:
+ * Note 1: one_best probability matrix has the sequence indices vertically
+ * and the vertex indices horizontally meaning it will be filled row by row
+ * Note 2: *ret_one_best_mtxpp and *ret_scale_fspp are allocated here, but must be
+ * freed by caller */
+ nr_v = hmmp->nr_v;
+ seq_len = get_seq_length(s);
+ *ret_one_best_mtxpp = (struct one_best_s*)(malloc_or_die((seq_len+2) *
+ hmmp->nr_v *
+ sizeof(struct one_best_s)));
+ one_best_mtxp = *ret_one_best_mtxpp;
+ *ret_scale_fspp = (double*)(malloc_or_die((seq_len+2) * sizeof(double)));
+ scale_fsp = *ret_scale_fspp;
+ sorted_v_list = (int*)(malloc_or_die((hmmp->nr_v * 2 + 1) * sizeof(int)));
+
+ /* Convert sequence to 1...L for easier indexing */
+ seq = (struct letter_s*) malloc_or_die(((seq_len + 2) * sizeof(struct letter_s)));
+ memcpy(seq+1, s, seq_len * sizeof(struct letter_s));
+ if(hmmp->nr_alphabets > 1) {
+ seq_2 = (struct letter_s*) malloc_or_die(((seq_len + 2) * sizeof(struct letter_s)));
+ memcpy(seq_2+1, s_2, seq_len * sizeof(struct letter_s));
+ }
+ if(hmmp->nr_alphabets > 2) {
+ seq_3 = (struct letter_s*) malloc_or_die(((seq_len + 2) * sizeof(struct letter_s)));
+ memcpy(seq_3+1, s_3, seq_len * sizeof(struct letter_s));
+ }
+ if(hmmp->nr_alphabets > 3) {
+ seq_4 = (struct letter_s*) malloc_or_die(((seq_len + 2) * sizeof(struct letter_s)));
+ memcpy(seq_4+1, s_4, seq_len * sizeof(struct letter_s));
+ }
+
+ /* Initialize the first row of the matrix */
+ one_best_mtxp->prob = 1.0; /* sets index (0,0) to 1.0,
+ the rest are already 0.0 as they should be */
+ one_best_mtxp->labeling = (char*)(malloc_or_die(1 * sizeof(char)));
+ for(i = 1; i < hmmp->nr_v; i++) {
+ (one_best_mtxp+i)->labeling = NULL;
+ }
+ *scale_fsp = 1.0;
+
+ /* create initial sorted v-list*/
+ *(sorted_v_list) = 0; /* 0 is always the number of the start state */
+ *(sorted_v_list + 1) = V_LIST_NEXT;
+ *(sorted_v_list + 2) = V_LIST_END;
+
+
+ /* Fill in middle rows */
+ prev_rowp = one_best_mtxp;
+ cur_rowp = one_best_mtxp + hmmp->nr_v;
+ for(c = 1; c <= seq_len; c++) {
+
+ /* get alphabet index for c*/
+ if(hmmp->alphabet_type == DISCRETE) {
+ replacement_letter_c = NO;
+ a_index = get_alphabet_index(&seq[c], hmmp->alphabet, hmmp->a_size);
+ if(a_index < 0) {
+ a_index = get_replacement_letter_index_multi(&seq[c], hmmp->replacement_letters, 1);
+ if(a_index < 0) {
+ printf("Letter '%s' is not in alphabet\n", (&seq[c])->letter);
+ return NOPROB;
+ }
+ replacement_letter_c = YES;
+ }
+ }
+ if(hmmp->nr_alphabets > 1) {
+ if(hmmp->alphabet_type_2 == DISCRETE) {
+ replacement_letter_c_2 = NO;
+ a_index_2 = get_alphabet_index(&seq_2[c], hmmp->alphabet_2, hmmp->a_size_2);
+ if(a_index_2 < 0) {
+ a_index_2 = get_replacement_letter_index_multi(&seq_2[c], hmmp->replacement_letters, 2);
+ if(a_index_2 < 0) {
+ printf("Letter '%s' is not in alphabet\n", (&seq_2[c])->letter);
+ return NOPROB;
+ }
+ replacement_letter_c_2 = YES;
+ }
+ }
+ }
+ if(hmmp->nr_alphabets > 2) {
+ if(hmmp->alphabet_type_3 == DISCRETE) {
+ replacement_letter_c_3 = NO;
+ a_index_3 = get_alphabet_index(&seq_3[c], hmmp->alphabet_3, hmmp->a_size_3);
+ if(a_index_3 < 0) {
+ a_index_3 = get_replacement_letter_index_multi(&seq_3[c], hmmp->replacement_letters, 3);
+ if(a_index_3 < 0) {
+ printf("Letter '%s' is not in alphabet\n", (&seq_3[c])->letter);
+ return NOPROB;
+ }
+ replacement_letter_c_3 = YES;
+ }
+ }
+ }
+ if(hmmp->nr_alphabets > 3) {
+ if(hmmp->alphabet_type_4 == DISCRETE) {
+ replacement_letter_c_4 = NO;
+ a_index_4 = get_alphabet_index(&seq_4[c], hmmp->alphabet_4, hmmp->a_size_4);
+ if(a_index_4 < 0) {
+ a_index_4 = get_replacement_letter_index_multi(&seq_4[c], hmmp->replacement_letters, 4);
+ if(a_index_4 < 0) {
+ printf("Letter '%s' is not in alphabet\n", (&seq_4[c])->letter);
+ return NOPROB;
+ }
+ replacement_letter_c_4 = YES;
+ }
+ }
+ }
+
+#ifdef DEBUG_FW
+ printf("seq[c] = %s\n", &seq[c]);
+ printf("a_index = %d\n", a_index);
+ printf("v_list dump out ");
+ dump_v_list(sorted_v_list);
+
+#endif
+
+ /* calculate probabilities */
+ row_sum = 0.0;
+ for(v = 0; v < hmmp->nr_v - 1; v++) /* v = to-vertex */ {
+#ifdef DEBUG_FW
+ printf("prob to vertex %d\n", v);
+#endif
+ v_list_index = 0;
+ (cur_rowp + v)->labeling = NULL;
+
+ while(*(sorted_v_list + v_list_index) != V_LIST_END) {
+ res = 0.0;
+ while(*(sorted_v_list + v_list_index) != V_LIST_NEXT) {
+ w = *(sorted_v_list + v_list_index); /* w = from-vertex */
+ /* calculate intermediate results */
+ res += (prev_rowp + w)->prob * *(hmmp->tot_transitions + (w * nr_v + v));
+ if(*(hmmp->tot_transitions + (w * nr_v + v)) < 0) {
+ printf("Error: found model transition prob from %d to %d < 0.0\n", w, v);
+ exit(0);
+ }
+#ifdef DEBUG_FW
+ printf("prev = %f: ", (prev_rowp + w)->prob);
+ printf("trans = %f\n", *(hmmp->tot_transitions + (w * nr_v + v)));
+#endif
+ v_list_index++;
+ }
+
+ if(res == 0.0) {
+ v_list_index++;
+ continue;
+ }
+
+ /* calculate prob of producing letter l in v */
+ t_res3 = get_single_emission_score_multi(hmmp, seq, seq_2, seq_3, seq_4, c, v, replacement_letter_c, replacement_letter_c_2,
+ replacement_letter_c_3, replacement_letter_c_4, use_labels, a_index,
+ a_index_2, a_index_3, a_index_4, multi_scoring_method);
+
+
+ res = res * t_res3;
+
+
+ /* check if this score is best so far */
+#ifdef DEBUG_FW
+ printf("best score = %f\n",(cur_rowp + v)->prob);
+#endif
+ if(res > (cur_rowp + v)->prob) {
+#ifdef DEBUG_FW
+ printf("updating best score to %f\n", res);
+#endif
+ /* save result in matrices */
+ (cur_rowp + v)->prob = res;
+ /* set pointer to point to current labeling */
+ (cur_rowp + v)->labeling = (prev_rowp + w)->labeling;
+ if((cur_rowp + v)->labeling == NULL) {
+ printf("Error: NULL labeling when updating best score\n");
+ exit(0);
+ }
+ }
+
+#ifdef DEBUG_FW
+ printf("res = %f\n", res);
+#endif
+ v_list_index++;
+ }
+ }
+
+ /* update labeling pointers */
+ update_labelings(cur_rowp, hmmp->vertex_labels, sorted_v_list, seq_len, c, hmmp->labels, hmmp->nr_labels, hmmp->nr_v);
+ deallocate_row_labelings(prev_rowp, hmmp->nr_v);
+
+ /* scale the results, row_sum = the total probability of
+ * having produced the labelings up to and including character c */
+ row_sum = 0.0;
+ for(v = 0; v < hmmp->nr_v; v++) {
+ row_sum = row_sum + (cur_rowp + v)->prob;
+#ifdef DEBUG_FW
+ dump_labeling((cur_rowp + v)->labeling, c);
+ printf("c = %d\n", c);
+#endif
+ }
+ scale_fsp[c] = row_sum;
+#ifdef DEBUG_FW
+ printf("rowsum = %f\n",row_sum);
+ printf("scaling set to: %f\n", scale_fsp[c]);
+#endif
+ if(row_sum == 0.0) {
+ printf("Sequence cannot be produced by this hmm\n");
+ sequence_as_string(s);
+ exit(0);
+ return NOPROB;
+ }
+ for(v = 0; v < hmmp->nr_v; v++) {
+ if((cur_rowp + v)->prob != 0.0){
+ (cur_rowp + v)->prob = ((cur_rowp + v)->prob)/row_sum; /* scaling */
+ }
+ }
+
+ /* move row pointers one row forward */
+ prev_rowp = cur_rowp;
+ cur_rowp = cur_rowp + hmmp->nr_v;
+ }
+
+ /* Fill in transition to end state */
+ v_list_index = 0;
+ (cur_rowp + hmmp->nr_v - 1)->labeling = NULL;
+ while(*(sorted_v_list + v_list_index) != V_LIST_END) {
+ res = 0.0;
+ while(*(sorted_v_list + v_list_index) != V_LIST_NEXT) {
+ w = *(sorted_v_list + v_list_index); /* w = from-vertex */
+ t_res1 = (prev_rowp + w)->prob;
+ t_res2 = *((hmmp->tot_transitions) + get_mtx_index(w, hmmp->nr_v-1, hmmp->nr_v));
+ if(t_res2 > 1.0) {
+ t_res2 = 1.0;
+ }
+ res += t_res1 * t_res2;
+ v_list_index++;
+ }
+
+ /* check if this score is best so far */
+ if(res > (cur_rowp + hmmp->nr_v - 1)->prob) {
+ /* save result in matrices */
+ (cur_rowp + hmmp->nr_v - 1)->prob = res;
+ /* set pointer to point to current labeling */
+ (cur_rowp + hmmp->nr_v - 1)->labeling = (prev_rowp + w)->labeling;
+ }
+
+ v_list_index++;
+ }
+
+#ifdef DEBUG_FW
+ dump_one_best_matrix(seq_len + 2, hmmp->nr_v, one_best_mtxp);
+ dump_scaling_array(seq_len + 1, scale_fsp);
+#endif
+
+ /* store results */
+ scaled_result = (cur_rowp + hmmp->nr_v - 1)->prob;
+ memcpy(best_labeling, ((cur_rowp + hmmp->nr_v - 1)->labeling) + 1, seq_len * sizeof(char));
+ best_labeling[seq_len] = '\0';
+#ifdef DEBUG_PATH
+ printf("seq_len = %d\n", seq_len);
+ printf("best labeling = %s\n", best_labeling);
+#endif
+
+ /* Garbage collection and return */
+ free(seq);
+ if(hmmp->nr_alphabets > 1) {
+ free(seq_2);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ free(seq_3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ free(seq_4);
+ }
+ free(sorted_v_list);
+ /* FREE labelings, except result labeling */
+ deallocate_row_labelings(prev_rowp, hmmp->nr_v);
+
+ return OK;
+}
+
+
+
+
+
+/*************************************************************************************
+ ************************** the forward, backward and viterbi algorithms *************
+ ************************** for scoring an msa against the hmm ***********************
+ *************************************************************************************/
+
+
+/************************* the msa_forward algorithm **********************************/
+int msa_forward_multi(struct hmm_multi_s *hmmp, struct msa_sequences_multi_s *msa_seq_infop, int use_lead_columns,
+ int use_gap_shares, int use_prior_shares, struct forward_s **ret_forw_mtxpp, double **ret_scale_fspp,
+ int use_labels, int normalize, int scoring_method, int multi_scoring_method, double *aa_freqs,
+ double *aa_freqs_2, double *aa_freqs_3, double *aa_freqs_4)
+{
+ struct forward_s *forw_mtxp, *cur_rowp, *prev_rowp; /* pointers to forward matrix */
+ double *scale_fsp; /* pointer to array of scaling factors */
+ int seq_len; /* length of the sequence */
+ int c, v, w; /* looping indices, c loops over the sequence,
+ v and w over the vertices in the HMM */
+ int i,j; /* general loop indices */
+ int a_index, a_index_2, a_index_3, a_index_4; /* holds the current letter's position in the alphabet */
+ double row_sum, res, t_res1, t_res2, t_res3; /* temporary variables to calculate probabilities */
+ struct path_element *wp; /* for traversing the paths from v to w */
+ int nr_v;
+
+#ifdef DEBUG_FW
+ printf("entering msa_forward\n");
+#endif
+
+ /* Allocate memory for matrix and scaling factors + some initial setup:
+ * Note 1: forward probability matrix has the sequence indices vertically
+ * and the vertex indices horizontally meaning it will be filled row by row
+ * Note 2: *ret_forw_mtxpp and *ret_scale_fspp are allocated here, but must be
+ * freed by caller */
+ nr_v = hmmp->nr_v;
+ if(use_lead_columns == YES) {
+ seq_len = msa_seq_infop->nr_lead_columns;
+ }
+ else {
+ seq_len = msa_seq_infop->msa_seq_length;
+ }
+ *ret_forw_mtxpp = (struct forward_s*)(malloc_or_die((seq_len+2) *
+ hmmp->nr_v *
+ sizeof(struct forward_s)));
+ forw_mtxp = *ret_forw_mtxpp;
+ *ret_scale_fspp = (double*)(malloc_or_die((seq_len+2) * sizeof(double)));
+ scale_fsp = *ret_scale_fspp;
+
+ /* Initialize the first row of the matrix */
+ forw_mtxp->prob = 1.0; /* sets index (0,0) to 1.0,
+ the rest are already 0.0 as they should be */
+ *scale_fsp = 1.0;
+
+ /* Fill in middle rows */
+ prev_rowp = forw_mtxp;
+ cur_rowp = forw_mtxp + hmmp->nr_v;
+ j = 0;
+ if(use_lead_columns == YES) {
+ c = *(msa_seq_infop->lead_columns_start + j);
+ }
+ else {
+ c = 0;
+ }
+ while(c != END && c < msa_seq_infop->msa_seq_length) {
+ a_index_2 = -1;
+ a_index_3 = -1;
+ a_index_4 = -1;
+
+ if(hmmp->alphabet_type == DISCRETE) {
+
+ a_index = get_alphabet_index((msa_seq_infop->msa_seq_1 + (c * (hmmp->a_size+1)))->query_letter, hmmp->alphabet, hmmp->a_size);
+ if(a_index < 0) {
+ a_index = hmmp->a_size; /* if letter is wild card, use default column in subst matrix */
+ }
+ }
+ if(hmmp->nr_alphabets > 1) {
+ if(hmmp->alphabet_type_2 == DISCRETE) {
+ a_index_2 = get_alphabet_index((msa_seq_infop->msa_seq_2 + (c * (hmmp->a_size_2+1)))->query_letter,
+ hmmp->alphabet_2, hmmp->a_size_2);
+ if(a_index_2 < 0) {
+ a_index_2 = hmmp->a_size_2; /* if letter is wild card, use default column in subst matrix */
+ }
+ }
+ }
+ if(hmmp->nr_alphabets > 2) {
+ if(hmmp->alphabet_type_3 == DISCRETE) {
+ a_index_3 = get_alphabet_index((msa_seq_infop->msa_seq_3 + (c * (hmmp->a_size_3+1)))->query_letter,
+ hmmp->alphabet_3, hmmp->a_size_3);
+ if(a_index_3 < 0) {
+ a_index_3 = hmmp->a_size_3; /* if letter is wild card, use default column in subst matrix */
+ }
+ }
+ }
+ if(hmmp->nr_alphabets > 3) {
+ if(hmmp->alphabet_type_4 == DISCRETE) {
+ a_index_4 = get_alphabet_index((msa_seq_infop->msa_seq_4 + (c * (hmmp->a_size_4+1)))->query_letter,
+ hmmp->alphabet_4, hmmp->a_size_4);
+ if(a_index_4 < 0) {
+ a_index_4 = hmmp->a_size_4; /* if letter is wild card, use default column in subst matrix */
+ }
+ }
+ }
+ /* calculate sum of probabilities */
+ row_sum = 0;
+ for(v = 1; v < hmmp->nr_v - 1; v++) /* v = to-vertex */ {
+#ifdef DEBUG_FW
+ printf("prob to vertex %d:\n", v);
+#endif
+ res = 0;
+ wp = *(hmmp->tot_from_trans_array + v);
+ while((w = wp->vertex) != END) /* w = from-vertex */ {
+ /* calculate intermediate results */
+ res += (prev_rowp + w)->prob * *((hmmp->tot_transitions) + (w * nr_v + v));
+ if(*((hmmp->tot_transitions) + (w * nr_v + v)) < 0) {
+ printf("found model transition prob from %d to %d < 0.0\n", w, v);
+ exit(0);
+ }
+ wp++;
+#ifdef DEBUG_FW
+ printf("prev = %f: ",(prev_rowp + w)->prob );
+ printf("trans = %f\n",*((hmmp->tot_transitions) + (w * nr_v + v)) );
+#endif
+ }
+
+ /* calculate the prob of producing letters l in v*/
+ t_res3 = get_msa_emission_score_multi(msa_seq_infop, hmmp, c, v, use_labels, normalize, a_index, a_index_2,
+ a_index_3, a_index_4, scoring_method, use_gap_shares,
+ use_prior_shares, multi_scoring_method, aa_freqs, aa_freqs_2,
+ aa_freqs_3, aa_freqs_4);
+
+ res = res * t_res3;
+ row_sum += res;
+
+ /* save result in matrices */
+ (cur_rowp + v)->prob = res;
+#ifdef DEBUG_FW
+ printf("letter = %f\n", t_res3);
+ printf("res = %f\n", res);
+#endif
+ }
+
+ /* scale the results, row_sum = the total probability of
+ * having produced the sequence up to and including character c */
+ if(use_lead_columns == YES) {
+ scale_fsp[j+1] = row_sum;
+ }
+ else {
+ scale_fsp[c+1] = row_sum;
+ }
+
+#ifdef DEBUG_FW
+ printf("rowsum for row %d = %f\n", c, row_sum);
+ printf("scaling set to: %f\n", scale_fsp[c+1]);
+#endif
+ if(row_sum == 0.0) {
+ printf("Probability for this msa = 0.0, pos = %d\n", c);
+ printf("in forward\n");
+ exit(0);
+ return NOPROB;
+ }
+ for(v = 0; v < hmmp->nr_v; v++) {
+ if((cur_rowp + v)->prob != 0){
+ (cur_rowp + v)->prob = ((cur_rowp + v)->prob)/row_sum; /* scaling */
+ }
+ }
+
+ /* move row pointers one row forward */
+ prev_rowp = cur_rowp;
+ cur_rowp = cur_rowp + hmmp->nr_v;
+
+ /* update current column */
+ if(use_lead_columns == YES) {
+ j++;
+ c = *(msa_seq_infop->lead_columns_start + j);
+ }
+ else {
+ c++;
+ }
+ }
+
+
+ /* Fill in transition to end state */
+#ifdef DEBUG_FW
+ printf("\ntransition to end state:\n");
+#endif
+ res = 0;
+ wp = *(hmmp->tot_from_trans_array + hmmp->nr_v - 1);
+ while(wp->vertex != END) {
+ t_res1 = (prev_rowp + wp->vertex)->prob;
+ t_res2 = *((hmmp->tot_transitions) + get_mtx_index(wp->vertex, hmmp->nr_v - 1, hmmp->nr_v));
+ if(t_res2 > 1.0) {
+ t_res2 = 1.0;
+ }
+ res += t_res1 * t_res2;
+
+ wp++;
+ }
+#ifdef DEBUG_FW
+ printf("res = %f\n", res);
+#endif
+ (cur_rowp + hmmp->nr_v - 1)->prob = res; /* obs: no scaling performed here */
+
+#ifdef DEBUG_FW
+ dump_forward_matrix(seq_len + 2, hmmp->nr_v, forw_mtxp);
+ dump_scaling_array(seq_len + 1, scale_fsp);
+ printf("exiting msa_forward\n\n\n");
+#endif
+
+ /* Garbage collection and return */
+ return OK;
+}
+
+
+
+
+/**************************** the msa_backward algorithm **********************************/
+int msa_backward_multi(struct hmm_multi_s *hmmp, struct msa_sequences_multi_s *msa_seq_infop, int use_lead_columns,
+ int use_gap_shares, struct backward_s **ret_backw_mtxpp, double *scale_fsp, int use_labels, int normalize,
+ int scoring_method, int multi_scoring_method, double *aa_freqs, double *aa_freqs_2, double *aa_freqs_3,
+ double *aa_freqs_4)
+{
+ struct backward_s *backw_mtxp, *cur_rowp, *prev_rowp; /* pointers to backward matrix */
+ int seq_len; /* length of the sequence */
+ int c, v, w; /* looping indices, c loops over the sequence,
+ v and w over the vertices in the HMM */
+ int i,j; /* general loop indices */
+ int all_columns; /* boolean for checking whether all columns have been looped over */
+ int a_index, a_index_2, a_index_3, a_index_4; /* holds the current letter's position in the alphabet */
+ double row_sum, res,t_res1,t_res2, t_res3; /* temporary variables to calculate results */
+ struct path_element *wp;
+
+ /* Allocate memory for matrix + some initial setup:
+ * Note 1: probability matrix has the sequence indices vertically
+ * and the vertex indices horizontally meaning it will be filled row by row
+ * Note 2: *ret_backw_mtxpp is allocated here, but must be
+ * freed by caller
+ * Note 3: *scale_fspp is the scaling array produced by forward() meaning backward()
+ * must be called after forward() */
+
+ if(use_lead_columns == YES) {
+ seq_len = msa_seq_infop->nr_lead_columns;
+ }
+ else {
+ seq_len = msa_seq_infop->msa_seq_length;
+ }
+ *ret_backw_mtxpp = (struct backward_s*)(malloc_or_die((seq_len+2) *
+ hmmp->nr_v *
+ sizeof(struct backward_s)));
+ backw_mtxp = *ret_backw_mtxpp;
+
+ /* Initialize the last row of the matrix */
+ (backw_mtxp + get_mtx_index(seq_len + 1, hmmp->nr_v - 1,
+ hmmp->nr_v))->prob = 1.0; /* sets index
+ * (seq_len+1,nr_v-1) i.e.
+ * the lower right
+ * corner of the matrix to
+ * 1.0,the rest are already
+ * 0.0 as they should be*/
+
+ /* Fill in next to last row in matrix (i.e. add prob for path to end state for all states
+ * that have a transition path to the end state) */
+ prev_rowp = backw_mtxp + get_mtx_index(seq_len + 1, 0, hmmp->nr_v);
+ cur_rowp = prev_rowp - hmmp->nr_v;
+ wp = *(hmmp->from_trans_array + hmmp->nr_v - 1);
+ while(wp->vertex != END) {
+ w = wp->vertex;
+ t_res1 = 1.0;
+ while(wp->next != NULL) {
+ t_res1 = t_res1 * *(hmmp->transitions +
+ get_mtx_index(wp->vertex, (wp + 1)->vertex, hmmp->nr_v));
+ wp++;
+ }
+ (cur_rowp + w)->prob += t_res1;
+ if((cur_rowp + w)->prob > 1.0) {
+ (cur_rowp + w)->prob = 1.0;
+ }
+ wp++;
+ }
+ prev_rowp = cur_rowp;
+ cur_rowp = cur_rowp - hmmp->nr_v;
+
+
+ /* Fill in first rows moving upwards in matrix */
+ all_columns = NO;
+ j = 1;
+ if(use_lead_columns == YES) {
+ c = *(msa_seq_infop->lead_columns_end - j);
+ }
+ else {
+ c = seq_len - 1;
+ }
+
+ while(c >= 0 && all_columns == NO) {
+ a_index_2 = -1;
+ a_index_3 = -1;
+ a_index_4 = -1;
+ if(hmmp->alphabet_type == DISCRETE) {
+ a_index = get_alphabet_index((msa_seq_infop->msa_seq_1 + (c * (hmmp->a_size+1)))->query_letter, hmmp->alphabet, hmmp->a_size);
+ if(a_index < 0) {
+ a_index = hmmp->a_size; /* if letter is wild card, use default column in subst matrix */
+ }
+ }
+ if(hmmp->nr_alphabets > 1) {
+ if(hmmp->alphabet_type_2 == DISCRETE) {
+ a_index_2 = get_alphabet_index((msa_seq_infop->msa_seq_2 + (c * (hmmp->a_size_2+1)))->query_letter,
+ hmmp->alphabet_2, hmmp->a_size_2);
+ if(a_index_2 < 0) {
+ a_index_2 = hmmp->a_size_2; /* if letter is wild card, use default column in subst matrix */
+ }
+ }
+ }
+ if(hmmp->nr_alphabets > 2) {
+ if(hmmp->alphabet_type_3 == DISCRETE) {
+ a_index_3 = get_alphabet_index((msa_seq_infop->msa_seq_3 + (c * (hmmp->a_size_3+1)))->query_letter,
+ hmmp->alphabet_3, hmmp->a_size_3);
+ if(a_index_3 < 0) {
+ a_index_3 = hmmp->a_size_3; /* if letter is wild card, use default column in subst matrix */
+ }
+ }
+ }
+ if(hmmp->nr_alphabets > 3) {
+ if(hmmp->alphabet_type_4 == DISCRETE) {
+ a_index_4 = get_alphabet_index((msa_seq_infop->msa_seq_4 + (c * (hmmp->a_size_4+1)))->query_letter,
+ hmmp->alphabet_4, hmmp->a_size_4);
+ if(a_index_4 < 0) {
+ a_index_4 = hmmp->a_size_4; /* if letter is wild card, use default column in subst matrix */
+ }
+ }
+ }
+ /* calculate sum of probabilities */
+ for(v = 0; v < hmmp->nr_v - 1; v++) /* v = from-vertex */{
+#ifdef DEBUG_BW
+ printf("prob passing through vertex %d:\n", v);
+#endif
+
+ res = 0;
+ wp = *(hmmp->tot_to_trans_array + v);
+ while(wp->vertex != END) /* w = to-vertex */ {
+ /* probability of transiting from v to w on this particular path */
+ t_res2 = *((hmmp->tot_transitions) + get_mtx_index(v, wp->vertex, hmmp->nr_v));
+ if(t_res2 < 0) {
+ printf("found model transition prob from %d to %d < 0.0\n", v, wp->vertex);
+ exit(0);
+ }
+ t_res1 = (prev_rowp + wp->vertex)->prob; /* probability of having produced the
+ * sequence after c passing through
+ * vertex w */
+
+ /* calculate the prob of producing letters l in v*/
+ t_res3 = get_msa_emission_score_multi(msa_seq_infop, hmmp, c, wp->vertex, use_labels, normalize, a_index, a_index_2,
+ a_index_3, a_index_4, scoring_method, use_gap_shares,
+ NO, multi_scoring_method, aa_freqs, aa_freqs_2, aa_freqs_3, aa_freqs_4);
+
+ res += t_res1 * t_res2 * t_res3; /* multiply the probabilities and
+ * add to previous sum */
+ wp++;
+ }
+#ifdef DEBUG_BW
+ printf("prev = %f: ", t_res1);
+ printf("trans = %f\n", t_res2);
+#endif
+
+
+
+ /* save result in matrices */
+ (cur_rowp + v)->prob = res;
+#ifdef DEBUG_BW
+ printf("res = %f\n", res);
+#endif
+ }
+
+ /* scale the results using the scaling factors produced by forward() */
+#ifdef DEBUG_BW
+ printf("c = %d\n", c);
+ dump_scaling_array(seq_len + 1, scale_fsp);
+#endif
+ for(v = 0; v < hmmp->nr_v; v++) {
+ if((cur_rowp + v)->prob != 0){
+ if(use_lead_columns == YES) {
+ if(scale_fsp[msa_seq_infop->nr_lead_columns + 1 - j] != 0.0) {
+ (cur_rowp + v)->prob =
+ ((cur_rowp + v)->prob)/scale_fsp[msa_seq_infop->nr_lead_columns + 1 - j]; /* scaling */
+ }
+ else {
+ printf("This msa cannot be produced by hmm\n");
+ return NOPROB;
+ }
+ }
+ else {
+ if(scale_fsp[c+1] != 0.0) {
+ (cur_rowp + v)->prob = ((cur_rowp + v)->prob)/scale_fsp[c+1]; /* scaling */
+ }
+ else {
+ printf("This msa cannot be produced by hmm\n");
+ return NOPROB;
+ }
+ }
+ }
+ }
+
+ /* move row pointers one row backward */
+ prev_rowp = cur_rowp;
+ cur_rowp = cur_rowp - hmmp->nr_v;
+
+
+ /* update current column */
+ if(use_lead_columns == YES) {
+ if(c == *(msa_seq_infop->lead_columns_start)) {
+ all_columns = YES;
+ }
+ j++;
+ c = *(msa_seq_infop->lead_columns_end - j);
+ }
+ else {
+ c = c - 1;
+ }
+ }
+
+#ifdef DEBUG_BW
+ dump_backward_matrix(seq_len + 2, hmmp->nr_v, backw_mtxp);
+ dump_scaling_array(seq_len + 1, scale_fsp);
+#endif
+
+ /* Garbage collection and return */
+ return OK;
+}
+
+
+/************************* the msa_viterbi algorithm **********************************/
+int msa_viterbi_multi(struct hmm_multi_s *hmmp, struct msa_sequences_multi_s *msa_seq_infop, int use_lead_columns,
+ int use_gap_shares, int use_prior_shares, struct viterbi_s **ret_viterbi_mtxpp, int use_labels, int normalize,
+ int scoring_method, int multi_scoring_method, double *aa_freqs, double *aa_freqs_2, double *aa_freqs_3,
+ double *aa_freqs_4)
+{
+ struct viterbi_s *viterbi_mtxp, *cur_rowp, *prev_rowp; /* pointers to viterbi matrix */
+ int seq_len; /* length of the sequence */
+ int c, v, w; /* looping indices, c loops over the sequence,
+ v and w over the vertices in the HMM */
+ int i,j; /* general loop indices */
+ int a_index, a_index_2, a_index_3, a_index_4; /* holds the current letter's position in the alphabet */
+ double max, res, t_res1, t_res2, t_res3; /* temporary variables to calculate probabilities */
+ struct path_element *wp, *from_vp, *prevp; /* for traversing the paths from v to w */
+ int nr_v;
+ int prev;
+ double seq_normalizer, state_normalizer;
+
+ /* Allocate memory for matrix + some initial setup:
+ * Note 1: viterbi probability matrix has the sequence indices vertically
+ * and the vertex indices horizontally meaning it will be filled row by row
+ * Note 2: *ret_viterbi_mtxpp and *ret_scale_fspp are allocated here, but must be
+ * freed by caller */
+ nr_v = hmmp->nr_v;
+ if(use_lead_columns == YES) {
+ seq_len = msa_seq_infop->nr_lead_columns;
+ }
+ else {
+ seq_len = msa_seq_infop->msa_seq_length;
+ }
+ *ret_viterbi_mtxpp = (struct viterbi_s*)(malloc_or_die((seq_len+2) *
+ hmmp->nr_v *
+ sizeof(struct viterbi_s)));
+ init_viterbi_s_mtx(*ret_viterbi_mtxpp, DEFAULT, (seq_len+2) * hmmp->nr_v);
+ viterbi_mtxp = *ret_viterbi_mtxpp;
+
+ /* Initialize the first row of the matrix */
+ viterbi_mtxp->prob = 0.0; /* sets index (0,0) to 0.0 (i.e. prob 1.0),
+ the rest are already DEFAULT as they should be */
+
+ /* Fill in middle rows */
+ prev_rowp = viterbi_mtxp;
+ cur_rowp = viterbi_mtxp + hmmp->nr_v;
+ j = 0;
+ if(use_lead_columns == YES) {
+ c = *(msa_seq_infop->lead_columns_start + j);
+ }
+ else {
+ c = 0;
+ }
+
+
+
+ while(c != END && c < msa_seq_infop->msa_seq_length) {
+ a_index_2 = -1;
+ a_index_3 = -1;
+ a_index_4 = -1;
+ if(hmmp->alphabet_type == DISCRETE) {
+ a_index = get_alphabet_index((msa_seq_infop->msa_seq_1 + (c * (hmmp->a_size+1)))->query_letter, hmmp->alphabet, hmmp->a_size);
+ if(a_index < 0) {
+ a_index = hmmp->a_size; /* if letter is wild card, use default column in subst matrix */
+ }
+ }
+ if(hmmp->nr_alphabets > 1) {
+ if(hmmp->alphabet_type_2 == DISCRETE) {
+ a_index_2 = get_alphabet_index((msa_seq_infop->msa_seq_2 + (c * (hmmp->a_size_2+1)))->query_letter,
+ hmmp->alphabet_2, hmmp->a_size_2);
+ if(a_index_2 < 0) {
+ a_index_2 = hmmp->a_size_2; /* if letter is wild card, use default column in subst matrix */
+ }
+ }
+ }
+ if(hmmp->nr_alphabets > 2) {
+ if(hmmp->alphabet_type_3 == DISCRETE) {
+ a_index_3 = get_alphabet_index((msa_seq_infop->msa_seq_3 + (c * (hmmp->a_size_3+1)))->query_letter,
+ hmmp->alphabet_3, hmmp->a_size_3);
+ if(a_index_3 < 0) {
+ a_index_3 = hmmp->a_size_3; /* if letter is wild card, use default column in subst matrix */
+ }
+ }
+ }
+ if(hmmp->nr_alphabets > 3) {
+ if(hmmp->alphabet_type_4 == DISCRETE) {
+ a_index_4 = get_alphabet_index((msa_seq_infop->msa_seq_4 + (c * (hmmp->a_size_4+1)))->query_letter,
+ hmmp->alphabet_4, hmmp->a_size_4);
+ if(a_index_4 < 0) {
+ a_index_4 = hmmp->a_size_4; /* if letter is wild card, use default column in subst matrix */
+ }
+ }
+ }
+
+ /* calculate sum of probabilities */
+#ifdef DEBUG_VI
+ printf("label nr %d for pos %d = %c\n", j, c, (msa_seq_infop->msa_seq_1 + (c * (hmmp->a_size+1)))->label);
+#endif
+ for(v = 1; v < hmmp->nr_v - 1; v++) /* v = to-vertex */ {
+#ifdef DEBUG_VI
+ printf("prob to vertex %d:\n", v);
+#endif
+ prev = NO_PREV;
+ max = DEFAULT;
+ wp = *(hmmp->tot_from_trans_array + v);
+ res = 0;
+ while((w = wp->vertex) != END) /* w = from-vertex */ {
+ /* calculate intermediate results */
+ from_vp = wp;
+ t_res1 = (prev_rowp + w)->prob; /* probability of having produced the
+ * sequence up to the last letter ending
+ * in vertex w */
+ /* calculate intermediate results */
+ t_res2 = *(hmmp->max_log_transitions + (w * hmmp->nr_v + v));
+ if(t_res1 != DEFAULT && t_res2 != DEFAULT) {
+ res = t_res1 + t_res2;
+ }
+ else {
+ res = DEFAULT;
+ }
+
+
+ if(prev == NO_PREV && res != DEFAULT &&
+ (use_labels == NO || (msa_seq_infop->msa_seq_1 + (c * (hmmp->a_size+1)))->label == *(hmmp->vertex_labels + v) ||
+ (msa_seq_infop->msa_seq_1 + (c * (hmmp->a_size+1)))->label == '.')) {
+ max = res;
+ prev = from_vp->vertex;
+ prevp = from_vp;
+ continue;
+ }
+
+ if(res > max && res != DEFAULT &&
+ (use_labels == NO || (msa_seq_infop->msa_seq_1 + (c * (hmmp->a_size+1)))->label == *(hmmp->vertex_labels + v) ||
+ (msa_seq_infop->msa_seq_1 + (c * (hmmp->a_size+1)))->label == '.')) {
+ max = res;
+ prev = from_vp->vertex;
+ prevp = from_vp;
+ }
+ wp++;
+#ifdef DEBUG_VI
+ printf("prev in pos(%d) = %f: ",from_vp->vertex, t_res1);
+ printf("trans from %d to %d = %f\n", from_vp->vertex, v, t_res2);
+ printf("tot score = %f\n", res);
+#endif
+ }
+
+#ifdef DEBUG_VI
+ printf("max before setting: %f\n", max);
+#endif
+
+ /* calculate the prob of producing letters l in v*/
+ t_res3 = get_msa_emission_score_multi(msa_seq_infop, hmmp, c, v, use_labels, normalize, a_index, a_index_2,
+ a_index_3, a_index_4, scoring_method, use_gap_shares,
+ use_prior_shares, multi_scoring_method, aa_freqs, aa_freqs_2, aa_freqs_3,
+ aa_freqs_4);
+
+ if(t_res3 == 0.0) {
+ max = DEFAULT;
+ }
+ else {
+ t_res3 = log10(t_res3);
+ if(max != DEFAULT) {
+ max = max + t_res3;
+ }
+ }
+
+ /* save result in matrices */
+ (cur_rowp + v)->prob = max;
+ (cur_rowp + v)->prev = prev;
+ (cur_rowp + v)->prevp = prevp;
+#ifdef DEBUG_VI
+ printf("max after setting: %f\n", max);
+#endif
+ }
+
+ /* move row pointers one row viterbi */
+ prev_rowp = cur_rowp;
+ cur_rowp = cur_rowp + hmmp->nr_v;
+
+ /* update current column */
+ if(use_lead_columns == YES) {
+ j++;
+ c = *(msa_seq_infop->lead_columns_start + j);
+ }
+ else {
+ c++;
+ }
+ }
+
+
+ /* Fill in transition to end state */
+#ifdef DEBUG_VI
+ printf("\ntransition to end state:\n");
+#endif
+ max = DEFAULT;
+ prev = NO_PREV;
+ prevp = NULL;
+ wp = *(hmmp->from_trans_array + hmmp->nr_v-1);
+ while(wp->vertex != END) /* w = from-vertex */ {
+ from_vp = wp;
+ t_res1 = (prev_rowp + wp->vertex)->prob; /* probability of having produced the
+ * sequence up to the last letter ending
+ * in vertex w */
+ t_res2 = 0.0;
+ while(wp->next != NULL) {
+ t_res2 = t_res2 + *((hmmp->log_transitions) +
+ get_mtx_index(wp->vertex, (wp+1)->vertex, hmmp->nr_v));
+ wp++;
+ }
+ t_res2 = t_res2 + *((hmmp->log_transitions) +
+ get_mtx_index(wp->vertex, hmmp->nr_v - 1, hmmp->nr_v));
+
+#ifdef DEBUG_VI
+ printf("prev = %f: ", t_res1);
+ printf("trans = %f\n", t_res2);
+#endif
+ if(t_res1 != DEFAULT && t_res2 != DEFAULT) {
+ res = t_res1 + t_res2;
+ }
+ else {
+ res = DEFAULT;
+ }
+ if(prev == NO_PREV && res != DEFAULT) {
+ prev = from_vp->vertex;
+ prevp = from_vp;
+ max = res;
+ continue;
+ }
+ if(res > max && res != DEFAULT) {
+ prev = from_vp->vertex;
+ prevp = from_vp;
+ max = res;
+ }
+ wp++;
+ }
+#ifdef DEBUG_VI
+ printf("res = %f\n", max);
+#endif
+ if(max != DEFAULT) {
+ (cur_rowp + hmmp->nr_v-1)->prob = max;
+ (cur_rowp + hmmp->nr_v-1)->prev = prev;
+ (cur_rowp + hmmp->nr_v-1)->prevp = prevp;
+ }
+ else {
+#ifdef DEBUG_VI
+ dump_trans_matrix(hmmp->nr_v, hmmp->nr_v, hmmp->transitions);
+ dump_trans_matrix(hmmp->nr_v, hmmp->nr_v, hmmp->log_transitions);
+#endif
+ printf("This msa cannot be produced by hmm\n");
+ printf("inne\n");
+ return NOPROB;
+ }
+
+#ifdef PATH
+ //dump_viterbi_matrix(seq_len + 2, hmmp->nr_v, viterbi_mtxp);
+ //dump_viterbi_path((cur_rowp + w)->prevp);
+ printf("normalized log likelihood for most probable path = %f\n",
+ 0.0 - (((cur_rowp + hmmp->nr_v - 1)->prob) / seq_len));
+ printf("and most probable path is: ");
+ dump_viterbi_path((cur_rowp + hmmp->nr_v - 1), hmmp, viterbi_mtxp, seq_len + 1, hmmp->nr_v);
+ printf("%d\n",hmmp->nr_v - 1);
+ printf("log prob = %f\n", (cur_rowp + hmmp->nr_v-1)->prob);
+ printf("real prob = %f\n", pow(10,(cur_rowp + hmmp->nr_v-1)->prob));
+ dump_viterbi_label_path((cur_rowp + hmmp->nr_v - 1), hmmp, viterbi_mtxp, seq_len + 1, hmmp->nr_v);
+ printf("\n");
+#endif
+
+ /* Garbage collection and return */
+ return OK;
+}
+
+
+/************************* the msa_one_best algorithm **********************************/
+int msa_one_best_multi(struct hmm_multi_s *hmmp, struct msa_sequences_multi_s *msa_seq_infop, int use_lead_columns,
+ int use_gap_shares, int use_prior_shares, struct one_best_s **ret_one_best_mtxpp, double **ret_scale_fspp,
+ int use_labels, char *best_labeling, int normalize, int scoring_method, int multi_scoring_method,
+ double *aa_freqs, double *aa_freqs_2, double *aa_freqs_3, double *aa_freqs_4)
+{
+ struct one_best_s *one_best_mtxp, *cur_rowp, *prev_rowp; /* pointers to one_best matrix */
+ double *scale_fsp; /* pointer to array of scaling factors */
+ int seq_len; /* length of the sequence */
+ int c, v, w; /* looping indices, c loops over the sequence,
+ v and w over the vertices in the HMM */
+ int i,j; /* general loop indices */
+ int a_index, a_index_2, a_index_3, a_index_4; /* holds the current letter's position in the alphabet */
+ double row_sum, res, t_res1, t_res2, t_res3; /* temporary variables to calculate probabilities */
+ struct path_element *wp; /* for traversing the paths from v to w */
+ int nr_v;
+ double scaled_result;
+
+ int *sorted_v_list;
+ int v_list_index;
+
+ /* Allocate memory for matrix and scaling factors + some initial setup:
+ * Note 1: one_best probability matrix has the sequence indices vertically
+ * and the vertex indices horizontally meaning it will be filled row by row
+ * Note 2: *ret_one_best_mtxpp and *ret_scale_fspp are allocated here, but must be
+ * freed by caller */
+ nr_v = hmmp->nr_v;
+ if(use_lead_columns == YES) {
+ seq_len = msa_seq_infop->nr_lead_columns;
+ }
+ else {
+ seq_len = msa_seq_infop->msa_seq_length;
+ }
+ *ret_one_best_mtxpp = (struct one_best_s*)(malloc_or_die((seq_len+2) *
+ hmmp->nr_v *
+ sizeof(struct one_best_s)));
+ one_best_mtxp = *ret_one_best_mtxpp;
+ *ret_scale_fspp = (double*)(malloc_or_die((seq_len+2) * sizeof(double)));
+ scale_fsp = *ret_scale_fspp;
+ sorted_v_list = (int*)(malloc_or_die((hmmp->nr_v * 2 + 1) * sizeof(int)));
+
+ /* Initialize the first row of the matrix */
+ one_best_mtxp->prob = 1.0; /* sets index (0,0) to 1.0,
+ the rest are already 0.0 as they should be */
+ one_best_mtxp->labeling = (char*)(malloc_or_die(1 * sizeof(char)));
+ for(i = 1; i < hmmp->nr_v; i++) {
+ (one_best_mtxp+i)->labeling = NULL;
+ }
+ *scale_fsp = 1.0;
+
+ /* create initial sorted v-list*/
+ *(sorted_v_list) = 0; /* 0 is always the number of the start state */
+ *(sorted_v_list + 1) = V_LIST_NEXT;
+ *(sorted_v_list + 2) = V_LIST_END;
+
+
+
+
+ /* Fill in middle rows */
+ prev_rowp = one_best_mtxp;
+ cur_rowp = one_best_mtxp + hmmp->nr_v;
+ j = 0;
+ if(use_lead_columns == YES) {
+ c = *(msa_seq_infop->lead_columns_start + j);
+ }
+ else {
+ c = 0;
+ }
+ while(c != END && c < msa_seq_infop->msa_seq_length) {
+ a_index_2 = -1;
+ a_index_3 = -1;
+ a_index_4 = -1;
+ if(hmmp->alphabet_type == DISCRETE) {
+ a_index = get_alphabet_index((msa_seq_infop->msa_seq_1 + (c * (hmmp->a_size+1)))->query_letter, hmmp->alphabet, hmmp->a_size);
+ if(a_index < 0) {
+ a_index = hmmp->a_size; /* if letter is wild card, use default column in subst matrix */
+ }
+ }
+ if(hmmp->nr_alphabets > 1) {
+ if(hmmp->alphabet_type_2 == DISCRETE) {
+ a_index_2 = get_alphabet_index((msa_seq_infop->msa_seq_2 + (c * (hmmp->a_size_2+1)))->query_letter,
+ hmmp->alphabet_2, hmmp->a_size_2);
+ if(a_index_2 < 0) {
+ a_index_2 = hmmp->a_size_2; /* if letter is wild card, use default column in subst matrix */
+ }
+ }
+ }
+ if(hmmp->nr_alphabets > 2) {
+ if(hmmp->alphabet_type_3 == DISCRETE) {
+ a_index_3 = get_alphabet_index((msa_seq_infop->msa_seq_3 + (c * (hmmp->a_size_3+1)))->query_letter,
+ hmmp->alphabet_3, hmmp->a_size_3);
+ if(a_index_3 < 0) {
+ a_index_3 = hmmp->a_size_3; /* if letter is wild card, use default column in subst matrix */
+ }
+ }
+ }
+ if(hmmp->nr_alphabets > 3) {
+ if(hmmp->alphabet_type_4 == DISCRETE) {
+ a_index_4 = get_alphabet_index((msa_seq_infop->msa_seq_4 + (c * (hmmp->a_size_4+1)))->query_letter,
+ hmmp->alphabet_4, hmmp->a_size_4);
+ if(a_index_4 < 0) {
+ a_index_4 = hmmp->a_size_4; /* if letter is wild card, use default column in subst matrix */
+ }
+ }
+ }
+
+ /* calculate sum of probabilities */
+#ifdef DEBUG_FW
+ printf("label for pos %d = %c\n", c, (msa_seq_infop->msa_seq_1 + (c * (hmmp->a_size+1)))->label);
+#endif
+ for(v = 1; v < hmmp->nr_v - 1; v++) /* v = to-vertex */ {
+#ifdef DEBUG_FW
+ printf("prob to vertex %d:\n", v);
+#endif
+
+ v_list_index = 0;
+ (cur_rowp + v)->labeling = NULL;
+ while(*(sorted_v_list + v_list_index) != V_LIST_END) {
+ res = 0.0;
+ while(*(sorted_v_list + v_list_index) != V_LIST_NEXT) {
+ w = *(sorted_v_list + v_list_index); /* w = from-vertex */
+ /* calculate intermediate results */
+ res += (prev_rowp + w)->prob * *(hmmp->tot_transitions + (w * nr_v + v));
+ if(*(hmmp->tot_transitions + (w * nr_v + v)) < 0) {
+ printf("found model transition prob from %d to %d < 0.0\n", w, v);
+ exit(0);
+ }
+#ifdef DEBUG_FW
+ printf("prev = %f: ", (prev_rowp + w)->prob);
+ printf("trans = %f\n", *(hmmp->tot_transitions + (w * nr_v + v)));
+#endif
+ v_list_index++;
+ }
+
+
+ /* calculate the prob of producing letters l in v*/
+ t_res3 = get_msa_emission_score_multi(msa_seq_infop, hmmp, c, v, use_labels, normalize, a_index, a_index_2,
+ a_index_3, a_index_4, scoring_method, use_gap_shares,
+ use_prior_shares, multi_scoring_method, aa_freqs, aa_freqs_2, aa_freqs_3,
+ aa_freqs_4);
+
+ res = res * t_res3;
+
+ /* check if this score is best so far */
+#ifdef DEBUG_FW
+ printf("best score = %f\n",(cur_rowp + v)->prob);
+#endif
+ if(res > (cur_rowp + v)->prob) {
+#ifdef DEBUG_FW
+ printf("updating best score to %f\n", res);
+#endif
+ /* save result in matrices */
+ (cur_rowp + v)->prob = res;
+ /* set pointer to point to current labeling */
+ (cur_rowp + v)->labeling = (prev_rowp + w)->labeling;
+ if((cur_rowp + v)->labeling == NULL) {
+ printf("Error: NULL labeling when updating best score\n");
+ exit(0);
+ }
+ }
+
+#ifdef DEBUG_FW
+ printf("res = %f\n", res);
+#endif
+ v_list_index++;
+ }
+ }
+
+ /* update labeling pointers */
+ if(use_lead_columns == YES) {
+ update_labelings(cur_rowp, hmmp->vertex_labels, sorted_v_list, seq_len, j+1, hmmp->labels, hmmp->nr_labels, hmmp->nr_v);
+ }
+ else {
+ update_labelings(cur_rowp, hmmp->vertex_labels, sorted_v_list, seq_len, c+1, hmmp->labels, hmmp->nr_labels, hmmp->nr_v);
+ }
+ deallocate_row_labelings(prev_rowp, hmmp->nr_v);
+
+ /* scale the results, row_sum = the total probability of
+ * having produced the labelings up to and including character c */
+ row_sum = 0.0;
+ for(v = 0; v < hmmp->nr_v; v++) {
+ row_sum = row_sum + (cur_rowp + v)->prob;
+ }
+
+ if(use_lead_columns == YES) {
+ scale_fsp[j+1] = row_sum;
+ }
+ else {
+ scale_fsp[c+1] = row_sum;
+ }
+
+
+#ifdef DEBUG_FW
+ printf("rowsum for row %d= %f\n", c, row_sum);
+ printf("scaling set to: %f\n", scale_fsp[c+1]);
+#endif
+ if(row_sum == 0.0) {
+ printf("Probability for this msa = 0.0\n");
+ return NOPROB;
+ }
+ for(v = 0; v < hmmp->nr_v; v++) {
+ if((cur_rowp + v)->prob != 0){
+ (cur_rowp + v)->prob = ((cur_rowp + v)->prob)/row_sum; /* scaling */
+ }
+ }
+
+ /* move row pointers one row one_best */
+ prev_rowp = cur_rowp;
+ cur_rowp = cur_rowp + hmmp->nr_v;
+
+ /* update current column */
+ if(use_lead_columns == YES) {
+ j++;
+ c = *(msa_seq_infop->lead_columns_start + j);
+ }
+ else {
+ c++;
+ }
+ }
+
+
+ /* Fill in transition to end state */
+ v_list_index = 0;
+ (cur_rowp + hmmp->nr_v - 1)->labeling = NULL;
+ while(*(sorted_v_list + v_list_index) != V_LIST_END) {
+ res = 0.0;
+ while(*(sorted_v_list + v_list_index) != V_LIST_NEXT) {
+ w = *(sorted_v_list + v_list_index); /* w = from-vertex */
+ t_res1 = (prev_rowp + w)->prob;
+ t_res2 = *((hmmp->tot_transitions) + get_mtx_index(w, hmmp->nr_v-1, hmmp->nr_v));
+ if(t_res2 > 1.0) {
+ t_res2 = 1.0;
+ }
+ res += t_res1 * t_res2;
+ v_list_index++;
+ }
+
+ /* check if this score is best so far */
+ if(res > (cur_rowp + hmmp->nr_v - 1)->prob) {
+ /* save result in matrices */
+ (cur_rowp + hmmp->nr_v - 1)->prob = res;
+ /* set pointer to point to current labeling */
+ (cur_rowp + hmmp->nr_v - 1)->labeling = (prev_rowp + w)->labeling;
+ }
+
+ v_list_index++;
+ }
+#ifdef DEBUG_FW
+ dump_one_best_matrix(seq_len + 2, hmmp->nr_v, one_best_mtxp);
+ dump_scaling_array(seq_len + 1, scale_fsp);
+#endif
+
+ /* store results */
+ scaled_result = (cur_rowp + hmmp->nr_v - 1)->prob;
+ memcpy(best_labeling, ((cur_rowp + hmmp->nr_v - 1)->labeling) + 1, seq_len * sizeof(char));
+ best_labeling[seq_len] = '\0';
+#ifdef DEBUG_PATH
+ printf("seq_len = %d\n", seq_len);
+ printf("best labeling = %s\n", best_labeling);
+#endif
+
+ /* Garbage collection and return */
+ free(sorted_v_list);
+ /* FREE labelings, except result labeling */
+ deallocate_row_labelings(prev_rowp, hmmp->nr_v);
+
+ return OK;
+}
+
+
+
+
+/************************* help functions ***********************************/
+
+
+
+double dot_product_multi(struct hmm_multi_s *hmmp, int use_prior_shares, int use_gap_shares,
+ struct msa_sequences_multi_s *msa_seq_infop,
+ int c, int v, int normalize, int multi_scoring_method)
+{
+ int i,j;
+ double t_res3_1, t_res3_2, t_res3_3, t_res3_4, t_res3_tot;
+ double seq_normalizer;
+ double state_normalizer;
+
+ t_res3_tot = 0.0;
+
+ t_res3_1 = 0.0;
+ t_res3_2 = 0.0;
+ t_res3_3 = 0.0;
+ t_res3_4 = 0.0;
+
+ if(hmmp->alphabet_type == DISCRETE) {
+ t_res3_1 = get_dp_statescore(hmmp->a_size, use_gap_shares, use_prior_shares, msa_seq_infop->msa_seq_1, c, hmmp->emissions,
+ v, normalize, msa_seq_infop->gap_shares);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_1 + get_mtx_index(c, 0, hmmp->a_size+1))->nr_occurences > 0.0) {
+ t_res3_1 = 0.0;
+ for(j = 0; j < hmmp->a_size / 3; j++) {
+ t_res3_1 += get_single_gaussian_statescore((*((hmmp->emissions) + (v * (hmmp->a_size)) + (j * 3))),
+ (*((hmmp->emissions) + (v * (hmmp->a_size)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_1 + get_mtx_index(c, 0, hmmp->a_size+1))->share) *
+ *((hmmp->emissions) + (v * (hmmp->a_size)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_1 = 1.0;
+ }
+ }
+ if(hmmp->nr_alphabets > 1) {
+ if(hmmp->alphabet_type_2 == DISCRETE) {
+ t_res3_2 = get_dp_statescore(hmmp->a_size_2, use_gap_shares, use_prior_shares, msa_seq_infop->msa_seq_2, c, hmmp->emissions_2,
+ v, normalize, msa_seq_infop->gap_shares);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_2 + get_mtx_index(c, 0, hmmp->a_size_2+1))->nr_occurences > 0.0) {
+ t_res3_2 = 0.0;
+ for(j = 0; j < hmmp->a_size_2 / 3; j++) {
+ t_res3_2 += get_single_gaussian_statescore((*((hmmp->emissions_2) + (v * (hmmp->a_size_2)) + (j * 3))),
+ (*((hmmp->emissions_2) + (v * (hmmp->a_size_2)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_2 + get_mtx_index(c, 0, hmmp->a_size_2+1))->share) *
+ *((hmmp->emissions_2) + (v * (hmmp->a_size_2)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_2 = 1.0;
+ }
+ }
+ }
+
+ if(hmmp->nr_alphabets > 2) {
+ if(hmmp->alphabet_type_3 == DISCRETE) {
+ t_res3_3 = get_dp_statescore(hmmp->a_size_3, use_gap_shares, use_prior_shares, msa_seq_infop->msa_seq_3, c, hmmp->emissions_3,
+ v, normalize, msa_seq_infop->gap_shares);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_3 + get_mtx_index(c, 0, hmmp->a_size_3+1))->nr_occurences > 0.0) {
+ t_res3_3 = 0.0;
+ for(j = 0; j < hmmp->a_size_3 / 3; j++) {
+ t_res3_3 += get_single_gaussian_statescore((*((hmmp->emissions_3) + (v * (hmmp->a_size_3)) + (j * 3))),
+ (*((hmmp->emissions_3) + (v * (hmmp->a_size_3)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_3 + get_mtx_index(c, 0, hmmp->a_size_3+1))->share) *
+ *((hmmp->emissions_3) + (v * (hmmp->a_size_3)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_3 = 1.0;
+ }
+ }
+ }
+
+ if(hmmp->nr_alphabets > 3) {
+ if(hmmp->alphabet_type_4 == DISCRETE) {
+ t_res3_4 = get_dp_statescore(hmmp->a_size_4, use_gap_shares, use_prior_shares, msa_seq_infop->msa_seq_4, c, hmmp->emissions_4,
+ v, normalize, msa_seq_infop->gap_shares);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_4 + get_mtx_index(c, 0, hmmp->a_size_4+1))->nr_occurences > 0.0) {
+ t_res3_4 = 0.0;
+ for(j = 0; j < hmmp->a_size_4 / 3; j++) {
+ t_res3_4 += get_single_gaussian_statescore((*((hmmp->emissions_4) + (v * (hmmp->a_size_4)) + (j * 3))),
+ (*((hmmp->emissions_4) + (v * (hmmp->a_size_4)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_4 + get_mtx_index(c, 0, hmmp->a_size_4+1))->share) *
+ *((hmmp->emissions_4) + (v * (hmmp->a_size_4)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_4 = 1.0;
+ }
+ }
+ }
+
+
+ if(multi_scoring_method == JOINT_PROB) {
+ t_res3_tot = t_res3_1;
+ if(hmmp->nr_alphabets > 1) {
+ t_res3_tot *= t_res3_2;
+ }
+ if(hmmp->nr_alphabets > 2) {
+ t_res3_tot *= t_res3_3;
+ }
+ if(hmmp->nr_alphabets > 3) {
+ t_res3_tot *= t_res3_4;
+ }
+ return t_res3_tot;
+ }
+ else {
+ /* use other multialpha scoring method, not implemented yet */
+ printf("Error: only JOINT_PROB scoring is implemented\n");
+ exit(0);
+ }
+}
+
+double dot_product_picasso_multi(struct hmm_multi_s *hmmp, int use_prior_shares, int use_gap_shares,
+ struct msa_sequences_multi_s *msa_seq_infop,
+ int c, int v, int normalize, int multi_scoring_method, double *aa_freqs,
+ double *aa_freqs_2, double *aa_freqs_3, double *aa_freqs_4)
+{
+ int i,j;
+ double t_res3_1, t_res3_2, t_res3_3, t_res3_4, t_res3_tot;
+ double seq_normalizer;
+ double state_normalizer;
+
+ t_res3_tot = 0.0;
+
+ t_res3_1 = 0.0;
+ t_res3_2 = 0.0;
+ t_res3_3 = 0.0;
+ t_res3_4 = 0.0;
+
+
+ if(hmmp->alphabet_type == DISCRETE) {
+ t_res3_1 = get_dp_picasso_statescore(hmmp->a_size, use_gap_shares, use_prior_shares, msa_seq_infop->msa_seq_1,
+ c, hmmp->emissions,
+ v, normalize, msa_seq_infop->gap_shares, aa_freqs);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_1 + get_mtx_index(c, 0, hmmp->a_size+1))->nr_occurences > 0.0) {
+ t_res3_1 = 0.0;
+ for(j = 0; j < hmmp->a_size / 3; j++) {
+ t_res3_1 += get_single_gaussian_statescore((*((hmmp->emissions) + (v * (hmmp->a_size)) + (j * 3))),
+ (*((hmmp->emissions) + (v * (hmmp->a_size)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_1 + get_mtx_index(c, 0, hmmp->a_size+1))->share) *
+ *((hmmp->emissions) + (v * (hmmp->a_size)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_1 = 1.0;
+ }
+ }
+ if(hmmp->nr_alphabets > 1) {
+ if(hmmp->alphabet_type_2 == DISCRETE) {
+ t_res3_2 = get_dp_picasso_statescore(hmmp->a_size_2, use_gap_shares, use_prior_shares, msa_seq_infop->msa_seq_2,
+ c, hmmp->emissions_2,
+ v, normalize, msa_seq_infop->gap_shares, aa_freqs_2);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_2 + get_mtx_index(c, 0, hmmp->a_size_2+1))->nr_occurences > 0.0) {
+ t_res3_2 = 0.0;
+ for(j = 0; j < hmmp->a_size_2 / 3; j++) {
+ t_res3_2 += get_single_gaussian_statescore((*((hmmp->emissions_2) + (v * (hmmp->a_size_2)) + (j * 3))),
+ (*((hmmp->emissions_2) + (v * (hmmp->a_size_2)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_2 + get_mtx_index(c, 0, hmmp->a_size_2+1))->share) *
+ *((hmmp->emissions_2) + (v * (hmmp->a_size_2)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_2 = 1.0;
+ }
+ }
+ }
+
+ if(hmmp->nr_alphabets > 2) {
+ if(hmmp->alphabet_type_3 == DISCRETE) {
+ t_res3_3 = get_dp_picasso_statescore(hmmp->a_size_3, use_gap_shares, use_prior_shares, msa_seq_infop->msa_seq_3,
+ c, hmmp->emissions_3,
+ v, normalize, msa_seq_infop->gap_shares, aa_freqs_3);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_3 + get_mtx_index(c, 0, hmmp->a_size_3+1))->nr_occurences > 0.0) {
+ t_res3_3 = 0.0;
+ for(j = 0; j < hmmp->a_size_3 / 3; j++) {
+ t_res3_3 += get_single_gaussian_statescore((*((hmmp->emissions_3) + (v * (hmmp->a_size_3)) + (j * 3))),
+ (*((hmmp->emissions_3) + (v * (hmmp->a_size_3)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_3 + get_mtx_index(c, 0, hmmp->a_size_3+1))->share) *
+ *((hmmp->emissions_3) + (v * (hmmp->a_size_3)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_3 = 1.0;
+ }
+ }
+ }
+
+ if(hmmp->nr_alphabets > 3) {
+ if(hmmp->alphabet_type_4 == DISCRETE) {
+ t_res3_4 = get_dp_picasso_statescore(hmmp->a_size_4, use_gap_shares, use_prior_shares, msa_seq_infop->msa_seq_4,
+ c, hmmp->emissions_4,
+ v, normalize, msa_seq_infop->gap_shares, aa_freqs_4);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_4 + get_mtx_index(c, 0, hmmp->a_size_4+1))->nr_occurences > 0.0) {
+ t_res3_4 = 0.0;
+ for(j = 0; j < hmmp->a_size_4 / 3; j++) {
+ t_res3_4 += get_single_gaussian_statescore((*((hmmp->emissions_4) + (v * (hmmp->a_size_4)) + (j * 3))),
+ (*((hmmp->emissions_4) + (v * (hmmp->a_size_4)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_4 + get_mtx_index(c, 0, hmmp->a_size_4+1))->share) *
+ *((hmmp->emissions_4) + (v * (hmmp->a_size_4)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_4 = 1.0;
+ }
+ }
+ }
+
+
+ if(multi_scoring_method == JOINT_PROB) {
+ t_res3_tot = t_res3_1;
+ if(hmmp->nr_alphabets > 1) {
+ t_res3_tot *= t_res3_2;
+ }
+ if(hmmp->nr_alphabets > 2) {
+ t_res3_tot *= t_res3_3;
+ }
+ if(hmmp->nr_alphabets > 3) {
+ t_res3_tot *= t_res3_4;
+ }
+ return t_res3_tot;
+ }
+ else {
+ /* use other multialpha scoring method, not implemented yet */
+ printf("Error: only JOINT_PROB scoring is implemented\n");
+ exit(0);
+ }
+}
+
+
+double picasso_multi(struct hmm_multi_s *hmmp, int use_prior_shares, int use_gap_shares,
+ struct msa_sequences_multi_s *msa_seq_infop,
+ int c, int v, int normalize, int multi_scoring_method, double *aa_freqs,
+ double *aa_freqs_2, double *aa_freqs_3, double *aa_freqs_4)
+{
+ int i,j;
+ double t_res3_1, t_res3_2, t_res3_3, t_res3_4, t_res3_tot;
+ double seq_normalizer;
+ double state_normalizer;
+
+ t_res3_tot = 0.0;
+
+ t_res3_1 = 0.0;
+ t_res3_2 = 0.0;
+ t_res3_3 = 0.0;
+ t_res3_4 = 0.0;
+
+
+ if(hmmp->alphabet_type == DISCRETE) {
+ t_res3_1 = get_picasso_statescore(hmmp->a_size, use_gap_shares, use_prior_shares, msa_seq_infop->msa_seq_1, c, hmmp->emissions,
+ v, normalize, msa_seq_infop->gap_shares, aa_freqs);
+
+ }
+ else {
+ if((msa_seq_infop->msa_seq_1 + get_mtx_index(c, 0, hmmp->a_size+1))->nr_occurences > 0.0) {
+ t_res3_1 = 0.0;
+ for(j = 0; j < hmmp->a_size / 3; j++) {
+ t_res3_1 += get_single_gaussian_statescore((*((hmmp->emissions) + (v * (hmmp->a_size)) + (j * 3))),
+ (*((hmmp->emissions) + (v * (hmmp->a_size)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_1 + get_mtx_index(c, 0, hmmp->a_size+1))->share) *
+ *((hmmp->emissions) + (v * (hmmp->a_size)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_1 = 1.0;
+ }
+ }
+ if(hmmp->nr_alphabets > 1) {
+ if(hmmp->alphabet_type_2 == DISCRETE) {
+ t_res3_2 = get_picasso_statescore(hmmp->a_size_2, use_gap_shares, use_prior_shares, msa_seq_infop->msa_seq_2, c,
+ hmmp->emissions_2, v, normalize, msa_seq_infop->gap_shares, aa_freqs_2);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_2 + get_mtx_index(c, 0, hmmp->a_size_2+1))->nr_occurences > 0.0) {
+ t_res3_2 = 0.0;
+ for(j = 0; j < hmmp->a_size_2 / 3; j++) {
+ t_res3_2 += get_single_gaussian_statescore((*((hmmp->emissions_2) + (v * (hmmp->a_size_2)) + (j * 3))),
+ (*((hmmp->emissions_2) + (v * (hmmp->a_size_2)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_2 + get_mtx_index(c, 0, hmmp->a_size_2+1))->share) *
+ *((hmmp->emissions_2) + (v * (hmmp->a_size_2)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_2 = 1.0;
+ }
+ }
+ }
+
+ if(hmmp->nr_alphabets > 2) {
+ if(hmmp->alphabet_type_3 == DISCRETE) {
+ t_res3_3 = get_picasso_statescore(hmmp->a_size_3, use_gap_shares, use_prior_shares, msa_seq_infop->msa_seq_3, c,
+ hmmp->emissions_3, v, normalize, msa_seq_infop->gap_shares, aa_freqs_3);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_3 + get_mtx_index(c, 0, hmmp->a_size_3+1))->nr_occurences > 0.0) {
+ t_res3_3 = 0.0;
+ for(j = 0; j < hmmp->a_size_3 / 3; j++) {
+ t_res3_3 += get_single_gaussian_statescore((*((hmmp->emissions_3) + (v * (hmmp->a_size_3)) + (j * 3))),
+ (*((hmmp->emissions_3) + (v * (hmmp->a_size_3)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_3 + get_mtx_index(c, 0, hmmp->a_size_3+1))->share) *
+ *((hmmp->emissions_3) + (v * (hmmp->a_size_3)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_3 = 1.0;
+ }
+ }
+ }
+
+ if(hmmp->nr_alphabets > 3) {
+ if(hmmp->alphabet_type_4 == DISCRETE) {
+ t_res3_4 = get_picasso_statescore(hmmp->a_size_4, use_gap_shares, use_prior_shares, msa_seq_infop->msa_seq_4, c,
+ hmmp->emissions_4, v, normalize, msa_seq_infop->gap_shares, aa_freqs_4);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_4 + get_mtx_index(c, 0, hmmp->a_size_4+1))->nr_occurences > 0.0) {
+ t_res3_4 = 0.0;
+ for(j = 0; j < hmmp->a_size_4 / 3; j++) {
+ t_res3_4 += get_single_gaussian_statescore((*((hmmp->emissions_4) + (v * (hmmp->a_size_4)) + (j * 3))),
+ (*((hmmp->emissions_4) + (v * (hmmp->a_size_4)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_4 + get_mtx_index(c, 0, hmmp->a_size_4+1))->share) *
+ *((hmmp->emissions_4) + (v * (hmmp->a_size_4)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_4 = 1.0;
+ }
+ }
+ }
+
+
+ if(multi_scoring_method == JOINT_PROB) {
+ t_res3_tot = t_res3_1;
+ if(hmmp->nr_alphabets > 1) {
+ t_res3_tot *= t_res3_2;
+ }
+ if(hmmp->nr_alphabets > 2) {
+ t_res3_tot *= t_res3_3;
+ }
+ if(hmmp->nr_alphabets > 3) {
+ t_res3_tot *= t_res3_4;
+ }
+ return t_res3_tot;
+ }
+ else {
+ /* use other multialpha scoring method, not implemented yet */
+ printf("Error: only JOINT_PROB scoring is implemented\n");
+ exit(0);
+ }
+}
+
+double picasso_sym_multi(struct hmm_multi_s *hmmp, int use_prior_shares, int use_gap_shares,
+ struct msa_sequences_multi_s *msa_seq_infop,
+ int c, int v, int normalize, int multi_scoring_method, double *aa_freqs,
+ double *aa_freqs_2, double *aa_freqs_3, double *aa_freqs_4)
+{
+ int i,j;
+ double t_res3_1, t_res3_2, t_res3_3, t_res3_4, t_res3_tot;
+ double seq_normalizer;
+ double state_normalizer;
+
+ t_res3_tot = 0.0;
+
+ t_res3_1 = 0.0;
+ t_res3_2 = 0.0;
+ t_res3_3 = 0.0;
+ t_res3_4 = 0.0;
+
+
+
+ if(hmmp->alphabet_type == DISCRETE) {
+ t_res3_1 = get_picasso_sym_statescore(hmmp->a_size, use_gap_shares, use_prior_shares, msa_seq_infop->msa_seq_1, c,
+ hmmp->emissions,
+ v, normalize, msa_seq_infop->gap_shares, aa_freqs);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_1 + get_mtx_index(c, 0, hmmp->a_size+1))->nr_occurences > 0.0) {
+ t_res3_1 = 0.0;
+ for(j = 0; j < hmmp->a_size / 3; j++) {
+ t_res3_1 += get_single_gaussian_statescore((*((hmmp->emissions) + (v * (hmmp->a_size)) + (j * 3))),
+ (*((hmmp->emissions) + (v * (hmmp->a_size)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_1 + get_mtx_index(c, 0, hmmp->a_size+1))->share) *
+ *((hmmp->emissions) + (v * (hmmp->a_size)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_1 = 1.0;
+ }
+ }
+ if(hmmp->nr_alphabets > 1) {
+ if(hmmp->alphabet_type_2 == DISCRETE) {
+ t_res3_2 = get_picasso_sym_statescore(hmmp->a_size_2, use_gap_shares, use_prior_shares, msa_seq_infop->msa_seq_2, c,
+ hmmp->emissions_2,
+ v, normalize, msa_seq_infop->gap_shares, aa_freqs_2);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_2 + get_mtx_index(c, 0, hmmp->a_size_2+1))->nr_occurences > 0.0) {
+ t_res3_2 = 0.0;
+ for(j = 0; j < hmmp->a_size_2 / 3; j++) {
+ t_res3_2 += get_single_gaussian_statescore((*((hmmp->emissions_2) + (v * (hmmp->a_size_2)) + (j * 3))),
+ (*((hmmp->emissions_2) + (v * (hmmp->a_size_2)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_2 + get_mtx_index(c, 0, hmmp->a_size_2+1))->share) *
+ *((hmmp->emissions_2) + (v * (hmmp->a_size_2)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_2 = 1.0;
+ }
+ }
+ }
+ if(hmmp->nr_alphabets > 2) {
+ if(hmmp->alphabet_type_3 == DISCRETE) {
+ t_res3_3 = get_picasso_sym_statescore(hmmp->a_size_3, use_gap_shares, use_prior_shares, msa_seq_infop->msa_seq_3, c,
+ hmmp->emissions_3,
+ v, normalize, msa_seq_infop->gap_shares, aa_freqs_3);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_3 + get_mtx_index(c, 0, hmmp->a_size_3+1))->nr_occurences > 0.0) {
+ t_res3_3 = 0.0;
+ for(j = 0; j < hmmp->a_size_3 / 3; j++) {
+ t_res3_3 += get_single_gaussian_statescore((*((hmmp->emissions_3) + (v * (hmmp->a_size_3)) + (j * 3))),
+ (*((hmmp->emissions_3) + (v * (hmmp->a_size_3)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_3 + get_mtx_index(c, 0, hmmp->a_size_3+1))->share) *
+ *((hmmp->emissions_3) + (v * (hmmp->a_size_3)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_3 = 1.0;
+ }
+ }
+ }
+
+ if(hmmp->nr_alphabets > 3) {
+ if(hmmp->alphabet_type_4 == DISCRETE) {
+ t_res3_4 = get_picasso_sym_statescore(hmmp->a_size_4, use_gap_shares, use_prior_shares, msa_seq_infop->msa_seq_4, c,
+ hmmp->emissions_4,
+ v, normalize, msa_seq_infop->gap_shares, aa_freqs_4);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_4 + get_mtx_index(c, 0, hmmp->a_size_4+1))->nr_occurences > 0.0) {
+ t_res3_4 = 0.0;
+ for(j = 0; j < hmmp->a_size_4 / 3; j++) {
+ t_res3_4 += get_single_gaussian_statescore((*((hmmp->emissions_4) + (v * (hmmp->a_size_4)) + (j * 3))),
+ (*((hmmp->emissions_4) + (v * (hmmp->a_size_4)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_4 + get_mtx_index(c, 0, hmmp->a_size_4+1))->share) *
+ *((hmmp->emissions_4) + (v * (hmmp->a_size_4)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_4 = 1.0;
+ }
+ }
+
+ }
+
+ if(multi_scoring_method == JOINT_PROB) {
+ t_res3_tot = t_res3_1;
+ if(hmmp->nr_alphabets > 1) {
+ t_res3_tot *= t_res3_2;
+ }
+ if(hmmp->nr_alphabets > 2) {
+ t_res3_tot *= t_res3_3;
+ }
+ if(hmmp->nr_alphabets > 3) {
+ t_res3_tot *= t_res3_4;
+ }
+ return t_res3_tot;
+ }
+ else {
+ /* use other multialpha scoring method, not implemented yet */
+ printf("Error: only JOINT_PROB scoring is implemented\n");
+ exit(0);
+ }
+}
+
+
+
+double sjolander_score_multi(struct hmm_multi_s *hmmp, int use_prior_shares, int use_gap_shares,
+ struct msa_sequences_multi_s *msa_seq_infop,
+ int c, int v, int normalize, int multi_scoring_method)
+{
+ int i,j;
+ double t_res3_tot, t_res3_1, t_res3_2, t_res3_3, t_res3_4;
+ double seq_normalizer;
+ double state_normalizer;
+
+
+ t_res3_tot = 0.0;
+ t_res3_1 = 1.0;
+ t_res3_2 = 1.0;
+ t_res3_3 = 1.0;
+ t_res3_4 = 1.0;
+
+
+ if(hmmp->alphabet_type == DISCRETE) {
+ t_res3_1 = get_sjolander_statescore(hmmp->a_size, use_gap_shares, use_prior_shares, msa_seq_infop->msa_seq_1, c, hmmp->emissions,
+ v, normalize, msa_seq_infop->gap_shares);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_1 + get_mtx_index(c, 0, hmmp->a_size+1))->nr_occurences > 0.0) {
+ t_res3_1 = 0.0;
+ for(j = 0; j < hmmp->a_size / 3; j++) {
+ t_res3_1 += get_single_gaussian_statescore((*((hmmp->emissions) + (v * (hmmp->a_size)) + (j * 3))),
+ (*((hmmp->emissions) + (v * (hmmp->a_size)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_1 + get_mtx_index(c, 0, hmmp->a_size+1))->share) *
+ *((hmmp->emissions) + (v * (hmmp->a_size)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_1 = 1.0;
+ }
+ }
+ if(hmmp->nr_alphabets > 1) {
+ if(hmmp->alphabet_type_2 == DISCRETE) {
+ t_res3_2 = get_sjolander_statescore(hmmp->a_size_2, use_gap_shares, use_prior_shares, msa_seq_infop->msa_seq_2,
+ c, hmmp->emissions_2,
+ v, normalize, msa_seq_infop->gap_shares);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_2 + get_mtx_index(c, 0, hmmp->a_size_2+1))->nr_occurences > 0.0) {
+ t_res3_2 = 0.0;
+ for(j = 0; j < hmmp->a_size_2 / 3; j++) {
+ t_res3_2 += get_single_gaussian_statescore((*((hmmp->emissions_2) + (v * (hmmp->a_size_2)) + (j * 3))),
+ (*((hmmp->emissions_2) + (v * (hmmp->a_size_2)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_2 + get_mtx_index(c, 0, hmmp->a_size_2+1))->share) *
+ *((hmmp->emissions_2) + (v * (hmmp->a_size_2)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_2 = 1.0;
+ }
+ }
+ }
+
+ if(hmmp->nr_alphabets > 2) {
+ if(hmmp->alphabet_type_3 == DISCRETE){
+ t_res3_3 = get_sjolander_statescore(hmmp->a_size_3, use_gap_shares, use_prior_shares, msa_seq_infop->msa_seq_3,
+ c, hmmp->emissions_3,
+ v, normalize, msa_seq_infop->gap_shares);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_3 + get_mtx_index(c, 0, hmmp->a_size_3+1))->nr_occurences > 0.0) {
+ t_res3_3 = 0.0;
+ for(j = 0; j < hmmp->a_size_3 / 3; j++) {
+ t_res3_3 += get_single_gaussian_statescore((*((hmmp->emissions_3) + (v * (hmmp->a_size_3)) + (j * 3))),
+ (*((hmmp->emissions_3) + (v * (hmmp->a_size_3)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_3 + get_mtx_index(c, 0, hmmp->a_size_3+1))->share) *
+ *((hmmp->emissions_3) + (v * (hmmp->a_size_3)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_3 = 1.0;
+ }
+ }
+ }
+ if(hmmp->nr_alphabets > 3) {
+ if(hmmp->alphabet_type_4 == DISCRETE) {
+ t_res3_4 = get_sjolander_statescore(hmmp->a_size_4, use_gap_shares, use_prior_shares, msa_seq_infop->msa_seq_4,
+ c, hmmp->emissions_4,
+ v, normalize, msa_seq_infop->gap_shares);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_4 + get_mtx_index(c, 0, hmmp->a_size_4+1))->nr_occurences > 0.0) {
+ t_res3_4 = 0.0;
+ for(j = 0; j < hmmp->a_size_4 / 3; j++) {
+ t_res3_4 += get_single_gaussian_statescore((*((hmmp->emissions_4) + (v * (hmmp->a_size_4)) + (j * 3))),
+ (*((hmmp->emissions_4) + (v * (hmmp->a_size_4)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_4 + get_mtx_index(c, 0, hmmp->a_size_4+1))->share) *
+ *((hmmp->emissions_4) + (v * (hmmp->a_size_4)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_4 = 1.0;
+ }
+ }
+ }
+
+ if(multi_scoring_method == JOINT_PROB) {
+ t_res3_tot = t_res3_1;
+ if(hmmp->nr_alphabets > 1) {
+ t_res3_tot *= t_res3_2;
+ }
+ if(hmmp->nr_alphabets > 2) {
+ t_res3_tot *= t_res3_3;
+ }
+ if(hmmp->nr_alphabets > 3) {
+ t_res3_tot *= t_res3_4;
+ }
+ return t_res3_tot;
+ }
+ else {
+ /* use other multialpha scoring method, not implemented yet */
+ printf("Error: only JOINT_PROB scoring is implemented\n");
+ exit(0);
+ }
+}
+
+double sjolander_reversed_score_multi(struct hmm_multi_s *hmmp, int use_prior_shares, int use_gap_shares,
+ struct msa_sequences_multi_s *msa_seq_infop,
+ int c, int v, int normalize, int multi_scoring_method)
+{
+ int i,j;
+ double t_res3_tot, t_res3_1, t_res3_2, t_res3_3, t_res3_4;
+ double seq_normalizer;
+ double state_normalizer;
+
+
+ t_res3_tot = 0.0;
+ t_res3_1 = 1.0;
+ t_res3_2 = 1.0;
+ t_res3_3 = 1.0;
+ t_res3_4 = 1.0;
+
+ if(hmmp->alphabet_type == DISCRETE) {
+ t_res3_1 = get_sjolander_reversed_statescore(hmmp->a_size, use_gap_shares, use_prior_shares, msa_seq_infop->msa_seq_1,
+ c, hmmp->emissions,
+ v, normalize, msa_seq_infop->gap_shares);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_1 + get_mtx_index(c, 0, hmmp->a_size+1))->nr_occurences > 0.0) {
+ t_res3_1 = 0.0;
+ for(j = 0; j < hmmp->a_size / 3; j++) {
+ t_res3_1 += get_single_gaussian_statescore((*((hmmp->emissions) + (v * (hmmp->a_size)) + (j * 3))),
+ (*((hmmp->emissions) + (v * (hmmp->a_size)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_1 + get_mtx_index(c, 0, hmmp->a_size+1))->share) *
+ *((hmmp->emissions) + (v * (hmmp->a_size)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_1 = 1.0;
+ }
+ }
+ if(hmmp->nr_alphabets > 1) {
+ if(hmmp->alphabet_type_2 == DISCRETE) {
+ t_res3_2 = get_sjolander_reversed_statescore(hmmp->a_size_2, use_gap_shares, use_prior_shares, msa_seq_infop->msa_seq_2,
+ c, hmmp->emissions_2,
+ v, normalize, msa_seq_infop->gap_shares);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_2 + get_mtx_index(c, 0, hmmp->a_size_2+1))->nr_occurences > 0.0) {
+ t_res3_2 = 0.0;
+ for(j = 0; j < hmmp->a_size_2 / 3; j++) {
+ t_res3_2 += get_single_gaussian_statescore((*((hmmp->emissions_2) + (v * (hmmp->a_size_2)) + (j * 3))),
+ (*((hmmp->emissions_2) + (v * (hmmp->a_size_2)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_2 + get_mtx_index(c, 0, hmmp->a_size_2+1))->share) *
+ *((hmmp->emissions_2) + (v * (hmmp->a_size_2)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_2 = 1.0;
+ }
+ }
+ }
+ if(hmmp->nr_alphabets > 2) {
+ if(hmmp->alphabet_type_3 == DISCRETE) {
+ t_res3_3 = get_sjolander_reversed_statescore(hmmp->a_size_3, use_gap_shares, use_prior_shares, msa_seq_infop->msa_seq_3,
+ c, hmmp->emissions_3,
+ v, normalize, msa_seq_infop->gap_shares);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_3 + get_mtx_index(c, 0, hmmp->a_size_3+1))->nr_occurences > 0.0) {
+ t_res3_3 = 0.0;
+ for(j = 0; j < hmmp->a_size_3 / 3; j++) {
+ t_res3_3 += get_single_gaussian_statescore((*((hmmp->emissions_3) + (v * (hmmp->a_size_3)) + (j * 3))),
+ (*((hmmp->emissions_3) + (v * (hmmp->a_size_3)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_3 + get_mtx_index(c, 0, hmmp->a_size_3+1))->share) *
+ *((hmmp->emissions_3) + (v * (hmmp->a_size_3)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_3 = 1.0;
+ }
+ }
+ }
+ if(hmmp->nr_alphabets > 3) {
+ if(hmmp->alphabet_type_4 == DISCRETE) {
+ t_res3_4 = get_sjolander_reversed_statescore(hmmp->a_size_4, use_gap_shares, use_prior_shares, msa_seq_infop->msa_seq_4,
+ c, hmmp->emissions_4,
+ v, normalize, msa_seq_infop->gap_shares);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_4 + get_mtx_index(c, 0, hmmp->a_size_4+1))->nr_occurences > 0.0) {
+ t_res3_4 = 0.0;
+ for(j = 0; j < hmmp->a_size_4 / 3; j++) {
+ t_res3_4 += get_single_gaussian_statescore((*((hmmp->emissions_4) + (v * (hmmp->a_size_4)) + (j * 3))),
+ (*((hmmp->emissions_4) + (v * (hmmp->a_size_4)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_4 + get_mtx_index(c, 0, hmmp->a_size_4+1))->share) *
+ *((hmmp->emissions_4) + (v * (hmmp->a_size_4)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_4 = 1.0;
+ }
+ }
+ }
+
+ if(multi_scoring_method == JOINT_PROB) {
+ t_res3_tot = t_res3_1;
+ if(hmmp->nr_alphabets > 1) {
+ t_res3_tot *= t_res3_2;
+ }
+ if(hmmp->nr_alphabets > 2) {
+ t_res3_tot *= t_res3_3;
+ }
+ if(hmmp->nr_alphabets > 3) {
+ t_res3_tot *= t_res3_4;
+ }
+ return t_res3_tot;
+ }
+ else {
+ /* use other multialpha scoring method, not implemented yet */
+ printf("Error: only JOINT_PROB scoring is implemented\n");
+ exit(0);
+ }
+}
+
+double subst_mtx_product_multi(struct hmm_multi_s *hmmp, int use_prior_shares, int use_gap_shares,
+ struct msa_sequences_multi_s *msa_seq_infop, int c, int v, int normalize, int multi_scoring_method)
+{
+ int i,j;
+ double t_res3_1, t_res3_2, t_res3_3, t_res3_4, t_res3_tot;
+ /* Note: normalization not used for this scoring method */
+
+ t_res3_1 = 0.0;
+ t_res3_2 = 0.0;
+ t_res3_3 = 0.0;
+ t_res3_4 = 0.0;
+ t_res3_tot = 0.0;
+
+ if(hmmp->alphabet_type == DISCRETE) {
+ t_res3_1 = get_subst_mtx_product_statescore(hmmp->a_size, use_gap_shares, use_prior_shares, msa_seq_infop->msa_seq_1,
+ c, hmmp->emissions,
+ v, hmmp->subst_mtx);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_1 + get_mtx_index(c, 0, hmmp->a_size+1))->nr_occurences > 0.0) {
+ t_res3_1 = 0.0;
+ for(j = 0; j < hmmp->a_size / 3; j++) {
+ t_res3_1 += get_single_gaussian_statescore((*((hmmp->emissions) + (v * (hmmp->a_size)) + (j * 3))),
+ (*((hmmp->emissions) + (v * (hmmp->a_size)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_1 + get_mtx_index(c, 0, hmmp->a_size+1))->share) *
+ *((hmmp->emissions) + (v * (hmmp->a_size)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_1 = 1.0;
+ }
+ }
+ if(hmmp->nr_alphabets > 1) {
+ if(hmmp->alphabet_type_2 == DISCRETE) {
+ t_res3_2 = get_subst_mtx_product_statescore(hmmp->a_size_2, use_gap_shares, use_prior_shares, msa_seq_infop->msa_seq_2,
+ c, hmmp->emissions_2,
+ v, hmmp->subst_mtx_2);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_2 + get_mtx_index(c, 0, hmmp->a_size_2+1))->nr_occurences > 0.0) {
+ t_res3_2 = 0.0;
+ for(j = 0; j < hmmp->a_size_2 / 3; j++) {
+ t_res3_2 += get_single_gaussian_statescore((*((hmmp->emissions_2) + (v * (hmmp->a_size_2)) + (j * 3))),
+ (*((hmmp->emissions_2) + (v * (hmmp->a_size_2)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_2 + get_mtx_index(c, 0, hmmp->a_size_2+1))->share) *
+ *((hmmp->emissions_2) + (v * (hmmp->a_size_2)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_2 = 1.0;
+ }
+ }
+ }
+
+ if(hmmp->nr_alphabets > 2) {
+ if(hmmp->alphabet_type_3 == DISCRETE) {
+ t_res3_3 = get_subst_mtx_product_statescore(hmmp->a_size_3, use_gap_shares, use_prior_shares, msa_seq_infop->msa_seq_3,
+ c, hmmp->emissions_3,
+ v, hmmp->subst_mtx_3);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_3 + get_mtx_index(c, 0, hmmp->a_size_3+1))->nr_occurences > 0.0) {
+ t_res3_3 = 0.0;
+ for(j = 0; j < hmmp->a_size_3 / 3; j++) {
+ t_res3_3 += get_single_gaussian_statescore((*((hmmp->emissions_3) + (v * (hmmp->a_size_3)) + (j * 3))),
+ (*((hmmp->emissions_3) + (v * (hmmp->a_size_3)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_3 + get_mtx_index(c, 0, hmmp->a_size_3+1))->share) *
+ *((hmmp->emissions_3) + (v * (hmmp->a_size_3)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_3 = 1.0;
+ }
+ }
+ }
+
+ if(hmmp->nr_alphabets > 3) {
+ if(hmmp->alphabet_type_4 == DISCRETE) {
+ t_res3_4 = get_subst_mtx_product_statescore(hmmp->a_size_4, use_gap_shares, use_prior_shares, msa_seq_infop->msa_seq_4,
+ c, hmmp->emissions_4,
+ v, hmmp->subst_mtx_4);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_4 + get_mtx_index(c, 0, hmmp->a_size_4+1))->nr_occurences > 0.0) {
+ t_res3_4 = 0.0;
+ for(j = 0; j < hmmp->a_size_4 / 3; j++) {
+ t_res3_4 += get_single_gaussian_statescore((*((hmmp->emissions_4) + (v * (hmmp->a_size_4)) + (j * 3))),
+ (*((hmmp->emissions_4) + (v * (hmmp->a_size_4)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_4 + get_mtx_index(c, 0, hmmp->a_size_4+1))->share) *
+ *((hmmp->emissions_4) + (v * (hmmp->a_size_4)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_4 = 1.0;
+ }
+ }
+ }
+
+
+ if(multi_scoring_method == JOINT_PROB) {
+ t_res3_tot = t_res3_1;
+ if(hmmp->nr_alphabets > 1) {
+ t_res3_tot *= t_res3_2;
+ }
+ if(hmmp->nr_alphabets > 2) {
+ t_res3_tot *= t_res3_3;
+ }
+ if(hmmp->nr_alphabets > 3) {
+ t_res3_tot *= t_res3_4;
+ }
+ return t_res3_tot;
+ }
+ else {
+ /* use other multialpha scoring method, not implemented yet */
+ printf("Error: only JOINT_PROB scoring is implemented\n");
+ exit(0);
+ }
+}
+
+double subst_mtx_dot_product_multi(struct hmm_multi_s *hmmp, int use_prior_shares, int use_gap_shares,
+ struct msa_sequences_multi_s *msa_seq_infop, int c, int v, int a_index, int a_index_2,
+ int a_index_3, int a_index_4, int normalize,
+ int multi_scoring_method)
+{
+ int i,j;
+ double t_res3_1, t_res3_2, t_res3_3, t_res3_4, t_res3_tot;
+ double seq_normalizer;
+ double state_normalizer;
+ double subst_mtx_normalizer;
+
+
+
+
+ t_res3_1 = 0.0;
+ t_res3_2 = 0.0;
+ t_res3_3 = 0.0;
+ t_res3_4 = 0.0;
+ t_res3_tot = 0.0;
+
+ if(hmmp->alphabet_type == DISCRETE) {
+ t_res3_1 = get_subst_mtx_dot_product_statescore(hmmp->a_size, use_gap_shares, use_prior_shares, msa_seq_infop->msa_seq_1,
+ c, hmmp->emissions,
+ v, normalize, msa_seq_infop->gap_shares, a_index, hmmp->subst_mtx);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_1 + get_mtx_index(c, 0, hmmp->a_size+1))->nr_occurences > 0.0) {
+ t_res3_1 = 0.0;
+ for(j = 0; j < hmmp->a_size / 3; j++) {
+ t_res3_1 += get_single_gaussian_statescore((*((hmmp->emissions) + (v * (hmmp->a_size)) + (j * 3))),
+ (*((hmmp->emissions) + (v * (hmmp->a_size)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_1 + get_mtx_index(c, 0, hmmp->a_size+1))->share) *
+ *((hmmp->emissions) + (v * (hmmp->a_size)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_1 = 1.0;
+ }
+ }
+ if(hmmp->nr_alphabets > 1) {
+ if(hmmp->alphabet_type_2 == DISCRETE) {
+ t_res3_2 = get_subst_mtx_dot_product_statescore(hmmp->a_size_2, use_gap_shares, use_prior_shares, msa_seq_infop->msa_seq_2,
+ c, hmmp->emissions_2,
+ v, normalize, msa_seq_infop->gap_shares, a_index_2, hmmp->subst_mtx_2);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_2 + get_mtx_index(c, 0, hmmp->a_size_2+1))->nr_occurences > 0.0) {
+ t_res3_2 = 0.0;
+ for(j = 0; j < hmmp->a_size_2 / 3; j++) {
+ t_res3_2 += get_single_gaussian_statescore((*((hmmp->emissions_2) + (v * (hmmp->a_size_2)) + (j * 3))),
+ (*((hmmp->emissions_2) + (v * (hmmp->a_size_2)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_2 + get_mtx_index(c, 0, hmmp->a_size_2+1))->share) *
+ *((hmmp->emissions_2) + (v * (hmmp->a_size_2)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_2 = 1.0;
+ }
+ }
+ }
+
+ if(hmmp->nr_alphabets > 2) {
+ if(hmmp->alphabet_type_3 == DISCRETE) {
+ t_res3_3 = get_subst_mtx_dot_product_statescore(hmmp->a_size_3, use_gap_shares, use_prior_shares, msa_seq_infop->msa_seq_3,
+ c, hmmp->emissions_3,
+ v, normalize, msa_seq_infop->gap_shares, a_index_3, hmmp->subst_mtx_3);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_3 + get_mtx_index(c, 0, hmmp->a_size_3+1))->nr_occurences > 0.0) {
+ t_res3_3 = 0.0;
+ for(j = 0; j < hmmp->a_size_3 / 3; j++) {
+ t_res3_3 += get_single_gaussian_statescore((*((hmmp->emissions_3) + (v * (hmmp->a_size_3)) + (j * 3))),
+ (*((hmmp->emissions_3) + (v * (hmmp->a_size_3)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_3 + get_mtx_index(c, 0, hmmp->a_size_3+1))->share) *
+ *((hmmp->emissions_3) + (v * (hmmp->a_size_3)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_3 = 1.0;
+ }
+ }
+ }
+ if(hmmp->nr_alphabets > 3) {
+ if(hmmp->alphabet_type_4 == DISCRETE) {
+ t_res3_4 = get_subst_mtx_dot_product_statescore(hmmp->a_size_4, use_gap_shares, use_prior_shares, msa_seq_infop->msa_seq_4,
+ c, hmmp->emissions_4,
+ v, normalize, msa_seq_infop->gap_shares, a_index_4, hmmp->subst_mtx_4);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_4 + get_mtx_index(c, 0, hmmp->a_size_4+1))->nr_occurences > 0.0) {
+ t_res3_4 = 0.0;
+ for(j = 0; j < hmmp->a_size_4 / 3; j++) {
+ t_res3_4 += get_single_gaussian_statescore((*((hmmp->emissions_4) + (v * (hmmp->a_size_4)) + (j * 3))),
+ (*((hmmp->emissions_4) + (v * (hmmp->a_size_4)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_4 + get_mtx_index(c, 0, hmmp->a_size_4+1))->share) *
+ *((hmmp->emissions_4) + (v * (hmmp->a_size_4)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_4 = 1.0;
+ }
+ }
+ }
+
+
+ if(multi_scoring_method == JOINT_PROB) {
+ t_res3_tot = t_res3_1;
+ if(hmmp->nr_alphabets > 1) {
+ t_res3_tot *= t_res3_2;
+ }
+ if(hmmp->nr_alphabets > 2) {
+ t_res3_tot *= t_res3_3;
+ }
+ if(hmmp->nr_alphabets > 3) {
+ t_res3_tot *= t_res3_4;
+ }
+ return t_res3_tot;
+ }
+ else {
+ /* use other multialpha scoring method, not implemented yet */
+ printf("Error: only JOINT_PROB scoring is implemented\n");
+ exit(0);
+ }
+}
+
+double subst_mtx_dot_product_prior_multi(struct hmm_multi_s *hmmp, int use_prior_shares, int use_gap_shares,
+ struct msa_sequences_multi_s *msa_seq_infop,
+ int c, int v, int a_index, int a_index_2, int a_index_3, int a_index_4,
+ int normalize, int multi_scoring_method)
+{
+ int i,j;
+ double t_res3_1, t_res3_2, t_res3_3, t_res3_4, t_res3_tot;
+ double rest_share;
+ double default_share;
+ double seq_normalizer;
+ double state_normalizer;
+ double subst_mtx_normalizer;
+
+ t_res3_1 = 0.0;
+ t_res3_2 = 0.0;
+ t_res3_3 = 0.0;
+ t_res3_4 = 0.0;
+ t_res3_tot = 0.0;
+
+
+ if(hmmp->alphabet_type == DISCRETE) {
+ t_res3_1 = get_subst_mtx_dot_product_prior_statescore(hmmp->a_size, use_gap_shares, use_prior_shares, msa_seq_infop->msa_seq_1,
+ c, hmmp->emissions,
+ v, normalize, msa_seq_infop->gap_shares, a_index, hmmp->subst_mtx);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_1 + get_mtx_index(c, 0, hmmp->a_size+1))->nr_occurences > 0.0) {
+ t_res3_1 = 0.0;
+ for(j = 0; j < hmmp->a_size / 3; j++) {
+ t_res3_1 += get_single_gaussian_statescore((*((hmmp->emissions) + (v * (hmmp->a_size)) + (j * 3))),
+ (*((hmmp->emissions) + (v * (hmmp->a_size)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_1 + get_mtx_index(c, 0, hmmp->a_size+1))->share) *
+ *((hmmp->emissions) + (v * (hmmp->a_size)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_1 = 1.0;
+ }
+ }
+ if(hmmp->nr_alphabets > 1) {
+ if(hmmp->alphabet_type_2 == DISCRETE) {
+ t_res3_2 = get_subst_mtx_dot_product_prior_statescore(hmmp->a_size_2, use_gap_shares, use_prior_shares,
+ msa_seq_infop->msa_seq_2,
+ c, hmmp->emissions_2,
+ v, normalize, msa_seq_infop->gap_shares, a_index_2, hmmp->subst_mtx_2);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_2 + get_mtx_index(c, 0, hmmp->a_size_2+1))->nr_occurences > 0.0) {
+ t_res3_2 = 0.0;
+ for(j = 0; j < hmmp->a_size_2 / 3; j++) {
+ t_res3_2 += get_single_gaussian_statescore((*((hmmp->emissions_2) + (v * (hmmp->a_size_2)) + (j * 3))),
+ (*((hmmp->emissions_2) + (v * (hmmp->a_size_2)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_2 + get_mtx_index(c, 0, hmmp->a_size_2+1))->share) *
+ *((hmmp->emissions_2) + (v * (hmmp->a_size_2)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_2 = 1.0;
+ }
+ }
+ }
+
+ if(hmmp->nr_alphabets > 2) {
+ if(hmmp->alphabet_type_3 == DISCRETE) {
+ t_res3_3 = get_subst_mtx_dot_product_prior_statescore(hmmp->a_size_3, use_gap_shares, use_prior_shares,
+ msa_seq_infop->msa_seq_3,
+ c, hmmp->emissions_3,
+ v, normalize, msa_seq_infop->gap_shares, a_index_3, hmmp->subst_mtx_3);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_3 + get_mtx_index(c, 0, hmmp->a_size_3+1))->nr_occurences > 0.0) {
+ t_res3_3 = 0.0;
+ for(j = 0; j < hmmp->a_size_3 / 3; j++) {
+ t_res3_3 += get_single_gaussian_statescore((*((hmmp->emissions_3) + (v * (hmmp->a_size_3)) + (j * 3))),
+ (*((hmmp->emissions_3) + (v * (hmmp->a_size_3)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_3 + get_mtx_index(c, 0, hmmp->a_size_3+1))->share) *
+ *((hmmp->emissions_3) + (v * (hmmp->a_size_3)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_3 = 1.0;
+ }
+ }
+ }
+
+ if(hmmp->nr_alphabets > 3) {
+ if(hmmp->alphabet_type_4 == DISCRETE) {
+ t_res3_4 = get_subst_mtx_dot_product_prior_statescore(hmmp->a_size_4, use_gap_shares, use_prior_shares,
+ msa_seq_infop->msa_seq_4,
+ c, hmmp->emissions_4,
+ v, normalize, msa_seq_infop->gap_shares, a_index_4, hmmp->subst_mtx_4);
+ }
+ else {
+ if((msa_seq_infop->msa_seq_4 + get_mtx_index(c, 0, hmmp->a_size_4+1))->nr_occurences > 0.0) {
+ t_res3_4 = 0.0;
+ for(j = 0; j < hmmp->a_size_4 / 3; j++) {
+ t_res3_4 += get_single_gaussian_statescore((*((hmmp->emissions_4) + (v * (hmmp->a_size_4)) + (j * 3))),
+ (*((hmmp->emissions_4) + (v * (hmmp->a_size_4)) + (j * 3 + 1))),
+ (msa_seq_infop->msa_seq_4 + get_mtx_index(c, 0, hmmp->a_size_4+1))->share) *
+ *((hmmp->emissions_4) + (v * (hmmp->a_size_4)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res3_4 = 1.0;
+ }
+ }
+ }
+
+
+ if(multi_scoring_method == JOINT_PROB) {
+ t_res3_tot = t_res3_1;
+ if(hmmp->nr_alphabets > 1) {
+ t_res3_tot *= t_res3_2;
+ }
+ if(hmmp->nr_alphabets > 2) {
+ t_res3_tot *= t_res3_3;
+ }
+ if(hmmp->nr_alphabets > 3) {
+ t_res3_tot *= t_res3_4;
+ }
+ return t_res3_tot;
+ }
+ else {
+ /* use other multialpha scoring method, not implemented yet */
+ printf("Error: only JOINT_PROB scoring is implemented\n");
+ exit(0);
+ }
+}
+
+
+double get_msa_emission_score_multi(struct msa_sequences_multi_s *msa_seq_infop, struct hmm_multi_s *hmmp, int c, int v,
+ int use_labels, int normalize, int a_index, int a_index_2,
+ int a_index_3, int a_index_4, int scoring_method, int use_gap_shares,
+ int use_prior_shares, int multi_scoring_method, double *aa_freqs,
+ double *aa_freqs_2, double *aa_freqs_3, double *aa_freqs_4)
+{
+
+ double t_res3;
+
+
+ /* multiply the prob of reaching state v with the prob of producing letters l in v*/
+ t_res3 = 0.0;
+ if(use_labels == YES && (msa_seq_infop->msa_seq_1 + (c * (hmmp->a_size+1)))->label != *(hmmp->vertex_labels + v) &&
+ (msa_seq_infop->msa_seq_1 + (c * (hmmp->a_size+1)))->label != '.') {
+ }
+ /* calculate the simple dot product of the hmm-state vector and the msa vector */
+ else if(scoring_method == DOT_PRODUCT) {
+ t_res3 += dot_product_multi(hmmp, use_prior_shares, use_gap_shares, msa_seq_infop, c, v, normalize, multi_scoring_method);
+ }
+ else if(scoring_method == DOT_PRODUCT_PICASSO) {
+ t_res3 += dot_product_picasso_multi(hmmp, use_prior_shares, use_gap_shares, msa_seq_infop, c, v, normalize,
+ multi_scoring_method, aa_freqs, aa_freqs_2, aa_freqs_3, aa_freqs_4);
+ }
+ else if(scoring_method == PICASSO) {
+ t_res3 += picasso_multi(hmmp, use_prior_shares, use_gap_shares, msa_seq_infop, c, v, normalize, multi_scoring_method,
+ aa_freqs, aa_freqs_2, aa_freqs_3, aa_freqs_4);
+ }
+
+ else if(scoring_method == PICASSO_SYM) {
+ t_res3 += picasso_sym_multi(hmmp, use_prior_shares, use_gap_shares, msa_seq_infop, c, v, normalize,
+ multi_scoring_method, aa_freqs, aa_freqs_2, aa_freqs_3, aa_freqs_4);
+ }
+ /* calculate sjolander score for hmm-state vector and msa vector */
+ else if(scoring_method == SJOLANDER) {
+ t_res3 += sjolander_score_multi(hmmp, use_prior_shares, use_gap_shares, msa_seq_infop, c, v, normalize, multi_scoring_method);
+ }
+ /* calculate sjolander score for hmm-state vector and msa vector */
+ else if(scoring_method == SJOLANDER_REVERSED) {
+ t_res3 += sjolander_reversed_score_multi(hmmp, use_prior_shares, use_gap_shares, msa_seq_infop, c, v, normalize,
+ multi_scoring_method);
+ }
+
+ /* calculate the joint sum of the emissions-vector and the msa-vector
+ * multiplying each result with the probability of the two amino acids being related, taken from
+ * a given substitution matrix */
+ else if(scoring_method == SUBST_MTX_PRODUCT) {
+ t_res3 += subst_mtx_product_multi(hmmp, use_prior_shares, use_gap_shares, msa_seq_infop, c, v, normalize,
+ multi_scoring_method);
+ }
+ /* a simpler and faster form of substitution mtx product, which multiplies the values of amino acids i
+ * in the two columns and multiplies this value with a scaling factor taken from the substitition matrix, which
+ * depends on the original aa in the query sequence and aa i */
+ else if(scoring_method == SUBST_MTX_DOT_PRODUCT) {
+ t_res3 += subst_mtx_dot_product_multi(hmmp, use_prior_shares, use_gap_shares, msa_seq_infop, c, v, a_index,
+ a_index_2, a_index_3, a_index_4, normalize,
+ multi_scoring_method);
+ }
+ /* like subst mtx dot product, but with a prior prob for the part of the seq-column not used in the
+ * multiplication */
+ else if(scoring_method == SUBST_MTX_DOT_PRODUCT_PRIOR) {
+ t_res3 += subst_mtx_dot_product_prior_multi(hmmp, use_prior_shares, use_gap_shares, msa_seq_infop, c, v, a_index,
+ a_index_2, a_index_3, a_index_4, normalize,
+ multi_scoring_method);
+ }
+
+ return t_res3;
+}
+
+
+double get_single_emission_score_multi(struct hmm_multi_s *hmmp, struct letter_s *seq, struct letter_s *seq_2,
+ struct letter_s *seq_3, struct letter_s *seq_4, int c, int v, int replacement_letter_c,
+ int replacement_letter_c_2, int replacement_letter_c_3,
+ int replacement_letter_c_4, int use_labels, int a_index, int a_index_2,
+ int a_index_3, int a_index_4, int multi_scoring_method)
+{
+ double t_res3;
+ double t_res3_1, t_res3_2, t_res3_3, t_res3_4;
+ double cur_t_res3;
+ int i,j;
+ int cur_alphabet;
+ double *cur_probs, *cur_emissions;
+ int cur_a_size;
+ int cur_a_index;
+ int cur_replacement_letter;
+ int cur_alphabet_type;
+ struct letter_s *cur_seq;
+
+
+ for(cur_alphabet = 1; cur_alphabet <= hmmp->nr_alphabets; cur_alphabet++) {
+ if(cur_alphabet == 1) {
+ cur_a_size = hmmp->a_size;
+ cur_a_index = a_index;
+ cur_probs = hmmp->replacement_letters->probs_1;
+ cur_emissions = hmmp->emissions;
+ cur_replacement_letter = replacement_letter_c;
+ cur_alphabet_type = hmmp->alphabet_type;
+ cur_seq = seq;
+ }
+ else if(cur_alphabet == 2) {
+ cur_a_size = hmmp->a_size_2;
+ cur_a_index = a_index_2;
+ cur_probs = hmmp->replacement_letters->probs_2;
+ cur_emissions = hmmp->emissions_2;
+ cur_replacement_letter = replacement_letter_c_2;
+ cur_alphabet_type = hmmp->alphabet_type_2;
+ cur_seq = seq_2;
+ }
+ else if(cur_alphabet == 3) {
+ cur_a_size = hmmp->a_size_3;
+ cur_a_index = a_index_3;
+ cur_probs = hmmp->replacement_letters->probs_3;
+ cur_emissions = hmmp->emissions_3;
+ cur_replacement_letter = replacement_letter_c_3;
+ cur_alphabet_type = hmmp->alphabet_type_3;
+ cur_seq = seq_3;
+ }
+ else if(cur_alphabet == 4) {
+ cur_a_size = hmmp->a_size_4;
+ cur_a_index = a_index_4;
+ cur_probs = hmmp->replacement_letters->probs_4;
+ cur_emissions = hmmp->emissions_4;
+ cur_replacement_letter = replacement_letter_c_4;
+ cur_alphabet_type = hmmp->alphabet_type_4;
+ cur_seq = seq_4;
+ }
+
+ if(cur_alphabet_type == DISCRETE) {
+ if(cur_replacement_letter == YES) {
+ /* count emission prob with dot-product method */
+ cur_t_res3 = 0.0;
+ if(use_labels == YES && seq[c].label != *(hmmp->vertex_labels + v) && seq[c].label != '.') {
+ cur_t_res3 = 0.0;
+ }
+ else {
+ for(i = 0; i < cur_a_size; i++) {
+ cur_t_res3 += *(cur_probs + get_mtx_index(cur_a_index, i, cur_a_size)) *
+ *(cur_emissions + get_mtx_index(v,i,cur_a_size));
+ }
+ }
+ }
+ else {
+ if(use_labels == YES && seq[c].label != *(hmmp->vertex_labels + v) && seq[c].label != '.') {
+ cur_t_res3 = 0.0;
+ }
+ else {
+ cur_t_res3 = (*((cur_emissions) + (v * (cur_a_size)) + cur_a_index));
+ }
+ }
+ }
+ else {
+ if(use_labels == YES && seq[c].label != *(hmmp->vertex_labels + v) && seq[c].label != '.') {
+ cur_t_res3 = 0.0;
+ }
+ else {
+ t_res3 = 0.0;
+ for(j = 0; j < cur_a_size / 3; j++) {
+ t_res3 += get_single_gaussian_statescore((*((cur_emissions) + (v * (cur_a_size)) + (j * 3))),
+ (*((cur_emissions) + (v * (cur_a_size)) + (j * 3 + 1))),
+ cur_seq[c].cont_letter) *
+ *((cur_emissions) + (v * (cur_a_size)) + (j * 3 + 2));
+ }
+ }
+ }
+
+ if(cur_alphabet == 1) {
+ t_res3_1 = cur_t_res3;
+ }
+ else if(cur_alphabet == 2) {
+ t_res3_2 = cur_t_res3;
+ }
+ else if(cur_alphabet == 3) {
+ t_res3_3 = cur_t_res3;
+ }
+ else if(cur_alphabet == 4) {
+ t_res3_4 = cur_t_res3;
+ }
+ }
+
+ /* calculate total res for this to-vertex (v) */
+ if(multi_scoring_method == JOINT_PROB) {
+ t_res3 = t_res3_1;
+ if(hmmp->nr_alphabets > 1) {
+ t_res3 = t_res3 * t_res3_2;
+ }
+ if(hmmp->nr_alphabets > 2) {
+ t_res3 = t_res3 * t_res3_3;
+ }
+ if(hmmp->nr_alphabets > 3) {
+ t_res3 = t_res3 * t_res3_4;
+ }
+ }
+
+ return t_res3;
+}
diff --git a/modhmm0.92b/core_algorithms_multialpha.c.flc b/modhmm0.92b/core_algorithms_multialpha.c.flc
new file mode 100644
index 0000000..37b69a6
--- /dev/null
+++ b/modhmm0.92b/core_algorithms_multialpha.c.flc
@@ -0,0 +1,4 @@
+
+(fast-lock-cache-data 3 (quote (17032 . 19351)) (quote nil) (quote nil) (quote (t ("^\\(\\sw+\\)[ ]*(" (1 font-lock-function-name-face)) ("^#[ ]*error[ ]+\\(.+\\)" (1 font-lock-warning-face prepend)) ("^#[ ]*\\(import\\|include\\)[ ]*\\(<[^>\"
+]*>?\\)" (2 font-lock-string-face)) ("^#[ ]*define[ ]+\\(\\sw+\\)(" (1 font-lock-function-name-face)) ("^#[ ]*\\(elif\\|if\\)\\>" ("\\<\\(defined\\)\\>[ ]*(?\\(\\sw+\\)?" nil nil (1 font-lock-builtin-face) (2 font-lock-variable-name-face nil t))) ("^#[ ]*\\(define\\|e\\(?:l\\(?:if\\|se\\)\\|ndif\\|rror\\)\\|file\\|i\\(?:f\\(?:n?def\\)?\\|nclude\\)\\|line\\|pragma\\|undef\\)\\>[ !]*\\(\\sw+\\)?" (1 font-lock-builtin-face) (2 font-lock-variable-name-face nil t)) ("\\<\\(c\\(?:har\\|o [...]
+") (point)) nil (1 font-lock-constant-face nil t))) (":" ("^[ ]*\\(\\sw+\\)[ ]*:[ ]*$" (beginning-of-line) (end-of-line) (1 font-lock-constant-face))) ("\\<\\(c\\(?:har\\|omplex\\)\\|double\\|float\\|int\\|long\\|s\\(?:hort\\|igned\\)\\|\\(?:unsigne\\|voi\\)d\\|FILE\\|\\sw+_t\\|Lisp_Object\\)\\>\\([ *&]+\\sw+\\>\\)*" (font-lock-match-c-style-declaration-item-and-skip-to-next (goto-char (or (match-beginning 2) (match-end 1))) (goto-char (match-end 1)) (1 (if (match-beginning 2) font-l [...]
diff --git a/modhmm0.92b/debug_funcs.c b/modhmm0.92b/debug_funcs.c
new file mode 100644
index 0000000..7fc3985
--- /dev/null
+++ b/modhmm0.92b/debug_funcs.c
@@ -0,0 +1,1071 @@
+#include <stdlib.h>
+#include <stdio.h>
+
+
+
+#include "structs.h"
+
+
+void dump_align_matrix(int nr_rows, int nr_cols, struct align_mtx_element_s *mtx)
+{
+ int i,j;
+ struct align_mtx_element_s *mtx_tmp;
+ mtx_tmp = mtx;
+
+ printf("alignment matrix dump:\n");
+ printf("nr rows: %d\n", nr_rows);
+ printf("nr columns: %d\n", nr_cols);
+ printf("scores:\n");
+ for(i = 0; i < nr_rows; i++) {
+ for(j = 0; j < nr_cols; j++) {
+ printf("%d ", mtx_tmp->score);
+ mtx_tmp++;
+ }
+ printf("\n");
+ }
+ printf("\n\n");
+
+ mtx_tmp = mtx;
+
+ printf("lasts:\n");
+ for(i = 0; i < nr_rows; i++) {
+ for(j = 0; j < nr_cols; j++) {
+ printf("%c ", mtx->last);
+ mtx++;
+ }
+ printf("\n");
+ }
+ printf("\n");
+}
+
+void dump_trans_matrix(int nr_rows, int nr_cols, double *mtx)
+{
+ int i,j;
+ printf("transition matrix dump:\n");
+ printf("nr rows: %d\n", nr_rows);
+ printf("nr columns: %d\n", nr_cols);
+ for(i = 0; i < nr_rows; i++) {
+ for(j = 0; j < nr_cols; j++) {
+ printf("%f ", *mtx);
+ mtx++;
+ }
+ printf("\n");
+ }
+ printf("\n");
+}
+
+void dump_int_trans_matrix(int nr_rows, int nr_cols, int *mtx)
+{
+ int i,j;
+ printf("transition matrix dump:\n");
+ printf("nr rows: %d\n", nr_rows);
+ printf("nr columns: %d\n", nr_cols);
+ for(i = 0; i < nr_rows; i++) {
+ for(j = 0; j < nr_cols; j++) {
+ printf("%d ", *mtx);
+ mtx++;
+ }
+ printf("\n");
+ }
+ printf("\n");
+}
+
+void dump_emiss_matrix(int nr_rows, int nr_cols, double *mtx)
+{
+ int i,j;
+ printf("emission matrix dump:\n");
+ for(i = 0; i < nr_rows; i++) {
+ for(j = 0; j < nr_cols; j++) {
+ printf("%f ", *mtx);
+ mtx++;
+ }
+ printf("\n");
+ }
+ printf("\n");
+}
+
+void dump_post_prob_matrix(int nr_rows, int nr_cols, double *mtx)
+{
+ int i,j;
+ printf("posterior probability matrix dump:\n");
+ for(i = 0; i < nr_rows; i++) {
+ for(j = 0; j < nr_cols; j++) {
+ printf("%f ", *mtx);
+ mtx++;
+ }
+ printf("\n");
+ }
+ printf("\n");
+}
+
+void dump_forward_matrix(int nr_rows, int nr_cols, struct forward_s *mtx)
+{
+ int i,j;
+ printf("forward matrix dump:\n");
+ for(i = 0; i < nr_rows; i++) {
+ for(j = 0; j < nr_cols; j++) {
+ printf("%f ", mtx->prob);
+ mtx++;
+ }
+ printf("\n");
+ }
+ printf("\n");
+}
+
+void dump_viterbi_matrix(int nr_rows, int nr_cols, struct viterbi_s *mtx)
+{
+ int i,j;
+ struct viterbi_s *mtx_2;
+
+ mtx_2 = mtx;
+ printf("viterbi matrix dump probs:\n");
+ for(i = 0; i < nr_rows; i++) {
+ for(j = 0; j < nr_cols; j++) {
+ //printf("%d ", mtx->prev);
+ printf("%f ", mtx->prob);
+ mtx++;
+ }
+ printf("\n");
+ }
+ printf("\n");
+
+ printf("viterbi matrix dump prevs:\n");
+ for(i = 0; i < nr_rows; i++) {
+ printf("row: %d\n", i);
+ for(j = 0; j < nr_cols; j++) {
+ printf("col: %d = %d \n", j, mtx_2->prev);
+ //printf("%f ", mtx->prob);
+ mtx_2++;
+ }
+ printf("nr_cols = %d\n", j);
+ printf("\n");
+ }
+ printf("\n");
+}
+
+void dump_one_best_matrix(int nr_rows, int nr_cols, struct one_best_s *mtx)
+{
+ int i,j;
+ printf("one best matrix dump:\n");
+ printf("%4d ", 0);
+ for(j = 0; j < nr_cols; j++) {
+ printf("%8d ", j);
+ }
+ printf("\n");
+ for(i = 0; i < nr_rows; i++) {
+ printf("%4d ", i);
+ for(j = 0; j < nr_cols; j++) {
+ printf("%f ", mtx->prob);
+ mtx++;
+ }
+ printf("\n");
+ }
+ printf("\n");
+}
+
+void dump_backward_matrix(int nr_rows, int nr_cols, struct backward_s *mtx)
+{
+ int i,j;
+ printf("backward matrix dump:\n");
+ for(i = 0; i < nr_rows; i++) {
+ for(j = 0; j < nr_cols; j++) {
+ printf("%f ", mtx->prob);
+ mtx++;
+ }
+ printf("\n");
+ }
+ printf("\n");
+}
+
+void dump_scaling_array(int len, double *mtx)
+{
+ int i;
+ printf("scaling array dump\n");
+ for(i = 0; i < len; i++) {
+ printf("%d: %f ", i, *mtx);
+ mtx++;
+ }
+ printf("\n");
+}
+
+void dump_from_trans_array(int nr_v, struct path_element **array)
+{
+ int i;
+ struct path_element *cur;
+
+
+ printf("\nfrom_trans_array_dump:\n");
+ printf("nr_v = %d\n", nr_v);
+
+
+ for(i = 0; i < nr_v; i++) {
+ if((*(array + i))->vertex == END) {
+ printf("no transitions to vertex %d\n", i);
+ }
+ else {
+ printf("paths to vertex %d:\n", i);
+ cur = *(array + i);
+ while(cur->vertex != END) {
+ printf("%d ", cur->vertex);
+ while(cur->next != NULL) {
+ cur = cur->next;
+ printf("%d ", cur->vertex);
+ }
+ printf("%d ", i);
+ printf("\n");
+ cur++;
+ }
+ }
+ }
+ printf("\n");
+}
+
+void dump_subst_mtx(double *mtx, int a_size)
+{
+ int i,j;
+ int nr_rows = a_size + 1;
+ int nr_cols = a_size;
+ printf("substitution matrix dump:\n");
+ printf("nr rows: %d\n", nr_rows);
+ printf("nr columns: %d\n", nr_cols);
+ for(i = 0; i < nr_rows; i++) {
+ for(j = 0; j < nr_cols; j++) {
+ printf("%f ", *mtx);
+ mtx++;
+ }
+ printf("\n");
+ }
+ printf("\n");
+}
+
+void dump_to_trans_array(int nr_v, struct path_element **array)
+{
+ int i;
+ struct path_element *cur;
+
+
+ printf("\nto_trans_array_dump:\n");
+ printf("nr_v = %d\n", nr_v);
+
+ for(i = 0; i < nr_v; i++) {
+ if((*(array + i))->vertex == END) {
+ printf("no transitions from vertex %d\n", i);
+ }
+ else {
+ printf("paths from vertex %d:\n", i);
+ cur = *(array + i);
+ while(cur->vertex != END) {
+ printf("%d ", i);
+ printf("%d ", cur->vertex);
+ while(cur->next != NULL) {
+ cur = cur->next;
+ printf("%d ", cur->vertex);
+ }
+ printf("\n");
+ cur++;
+ }
+ }
+ }
+ printf("\n");
+}
+
+
+void dump_T_matrix(int nr_rows, int nr_cols, double *mtx)
+{
+ int i,j;
+ printf("T matrix dump:\n");
+ printf("nr rows: %d\n", nr_rows);
+ printf("nr columns: %d\n", nr_cols);
+ for(i = 0; i < nr_rows; i++) {
+ for(j = 0; j < nr_cols; j++) {
+ printf("%f ", *mtx);
+ mtx++;
+ }
+ printf("\n");
+ }
+ printf("\n");
+}
+
+void dump_E_matrix(int nr_rows, int nr_cols, double *mtx)
+{
+ int i,j;
+ printf("E matrix dump:\n");
+ printf("nr rows: %d\n", nr_rows);
+ printf("nr columns: %d\n", nr_cols);
+ for(i = 0; i < nr_rows; i++) {
+ for(j = 0; j < nr_cols; j++) {
+ printf("%f ", *mtx);
+ mtx++;
+ }
+ printf("\n");
+ }
+ printf("\n");
+}
+
+/* dumps most probable state path for given string (misses end state) */
+void dump_viterbi_path(struct viterbi_s *cur, struct hmm_s *hmmp,
+ struct viterbi_s *viterbi_mtxp, int row, int row_size)
+{
+ struct path_element *p_el;
+
+ if(cur->prev == 0) {
+ p_el = cur->prevp;
+ printf("0 ");
+ while(p_el->next != NULL) {
+ printf("%d ", p_el->vertex);
+ p_el++;
+ }
+ }
+ else {
+ dump_viterbi_path(viterbi_mtxp + get_mtx_index(row-1, cur->prev, row_size), hmmp,
+ viterbi_mtxp, row-1, row_size);
+ p_el = cur->prevp;
+ printf("%d ", cur->prev);
+ while(p_el->next != NULL) {
+ p_el++;
+ printf("%d ", p_el->vertex);
+ }
+ }
+}
+
+void dump_viterbi_label_path(struct viterbi_s *cur, struct hmm_s *hmmp,
+ struct viterbi_s *viterbi_mtxp, int row, int row_size)
+{
+ struct path_element *p_el;
+
+ if(cur->prev == 0) {
+ p_el = cur->prevp;
+ printf("0 ");
+ while(p_el->next != NULL) {
+ printf("%c ", *(hmmp->vertex_labels + p_el->vertex));
+ p_el++;
+ }
+ }
+ else {
+ dump_viterbi_label_path(viterbi_mtxp + get_mtx_index(row-1, cur->prev, row_size), hmmp,
+ viterbi_mtxp, row-1, row_size);
+ p_el = cur->prevp;
+ printf("%c ", *(hmmp->vertex_labels + (int)(cur->prev)));
+ //printf("%d ", (int)(cur->prev));
+ while(p_el->next != NULL) {
+ p_el++;
+ printf("%c ", *(hmmp->vertex_labels + p_el->vertex));
+ }
+ }
+}
+
+void dump_modules(struct hmm_s *hmmp)
+{
+ int i,j;
+
+ printf("\nModule dump:\n");
+ for(i = 0 ; i < hmmp->nr_m; i++) {
+ printf("module: %s", (*(hmmp->modules + i))->name);
+ printf("nr_v: %d\n", (*(hmmp->modules + i))->nr_v);
+ printf("vertices: ");
+ for(j = 0; j < (*(hmmp->modules + i))->nr_v;j++) {
+ printf("%d ",*(((*(hmmp->modules + i))->vertices) + j));
+ }
+ printf("\n");
+ }
+
+
+}
+
+void dump_multi_modules(struct hmm_multi_s *hmmp)
+{
+ int i,j;
+
+ printf("\nModule dump:\n");
+ for(i = 0 ; i < hmmp->nr_m; i++) {
+ printf("module: %s", (*(hmmp->modules + i))->name);
+ printf("nr_v: %d\n", (*(hmmp->modules + i))->nr_v);
+ printf("vertices: ");
+ for(j = 0; j < (*(hmmp->modules + i))->nr_v;j++) {
+ printf("%d ",*(((*(hmmp->modules + i))->vertices) + j));
+ }
+ printf("\n");
+ }
+}
+
+void dump_distrib_groups(int *distrib_groups, int nr_d)
+{
+ int i,j;
+
+ printf("\ndistribution groups dump:\n");
+ for(i = 0 ; i < nr_d; i++) {
+ printf("Group %d: ", i+1);
+ while(1) {
+ if(*distrib_groups == END) {
+ distrib_groups++;
+ printf("\n");
+ break;
+ }
+ else {
+ printf("%d ", *distrib_groups);
+ distrib_groups++;
+ }
+ }
+ printf("\n");
+ }
+ printf("\n");
+}
+
+void dump_trans_tie_groups(struct transition_s *trans_tie_groups, int nr_ttg)
+{
+ int i,j;
+
+ printf("\ntrans tie groups dump:\n");
+ for(i = 0 ; i < nr_ttg; i++) {
+ printf("Tie %d: ", i+1);
+ while(1) {
+ if(trans_tie_groups->from_v == END) {
+ trans_tie_groups++;
+ printf("\n");
+ break;
+ }
+ else {
+ printf("%d->%d ", trans_tie_groups->from_v, trans_tie_groups->to_v);
+ trans_tie_groups++;
+ }
+ }
+ printf("\n");
+ }
+ printf("\n");
+}
+
+void dump_prior_struct(struct emission_dirichlet_s *emission_priorsp)
+{
+ int i,j;
+ double *mtx = emission_priorsp->prior_values;
+
+ printf("prifile: %s\n", emission_priorsp->name);
+ printf("nr components: %d\n", emission_priorsp->nr_components);
+ printf("q_values: ");
+ for(i = 0; i < emission_priorsp->nr_components; i++) {
+ printf("%f ", *((emission_priorsp->q_values) + i));
+ }
+ printf("\n");
+ printf("alpha_sums: ");
+ for(i = 0; i < emission_priorsp->nr_components; i++) {
+ printf("%f ", *((emission_priorsp->alpha_sums) + i));
+ }
+ printf("\n");
+ printf("alpha_values:\n");
+ for(i = 0; i < emission_priorsp->nr_components; i++) {
+ for(j = 0; j < emission_priorsp->alphabet_size; j++) {
+ printf("%f ", *mtx);
+ mtx++;
+ }
+ printf("\n");
+ }
+ printf("\n");
+
+ printf("logbeta_values: ");
+ for(i = 0; i < emission_priorsp->nr_components; i++) {
+ printf("%f ", *((emission_priorsp->logbeta_values) + i));
+ }
+ printf("\n");
+}
+
+void dump_silent_vertices(struct hmm_s *hmmp)
+{
+ int i;
+
+ printf("silent vertices dump: ");
+ for(i = 0;;i++) {
+ if(*(hmmp->silent_vertices + i) == END) {
+ break;
+ }
+ else {
+ printf("%d ", *(hmmp->silent_vertices + i));
+ }
+ }
+ printf("\n");
+}
+
+void dump_silent_vertices_multi(struct hmm_multi_s *hmmp)
+{
+ int i;
+
+ printf("silent vertices dump: ");
+ for(i = 0;;i++) {
+ if(*(hmmp->silent_vertices + i) == END) {
+ break;
+ }
+ else {
+ printf("%d ", *(hmmp->silent_vertices + i));
+ }
+ }
+ printf("\n");
+}
+
+void dump_locked_vertices(struct hmm_s *hmmp)
+{
+ int i;
+
+ printf("locked vertices dump: ");
+ for(i = 0;;i++) {
+ if(*(hmmp->locked_vertices + i) == END) {
+ break;
+ }
+ else {
+ printf("%d ", *(hmmp->locked_vertices + i));
+ }
+ }
+ printf("\n");
+}
+
+
+void dump_seq(struct letter_s *seq)
+{
+ int i;
+
+ while(seq->letter[0] != '\0') {
+ i = 0;
+ while(seq->letter[i] != '\0') {
+ printf("%c", seq->letter[i]);
+ i++;
+ }
+ printf(";");
+ seq++;
+ }
+ printf("\n");
+}
+
+void dump_seqs(struct sequences_s *seq_infop)
+{
+ int i,j;
+ struct letter_s *seqsp;
+
+
+ printf("seqs dump\n");
+ for(i = 0; i < seq_infop->nr_seqs; i++) {
+ seqsp = (seq_infop->seqs + i)->seq;
+ printf("seq_name = %s\n", (seq_infop->seqs + i)->name);
+ printf("seq_length = %d\n", (seq_infop->seqs + i)->length);
+ while(seqsp->letter[0] != '\0') {
+ j = 0;
+ while(seqsp->letter[j] != '\0') {
+ printf("%c", seqsp->letter[j]);
+ j++;
+ }
+ printf(";");
+ seqsp++;
+ }
+ printf("\n");
+ }
+}
+
+void dump_seqs_multi(struct sequences_multi_s *seq_infop)
+{
+ int i,j,k;
+ struct letter_s *seqsp;
+
+ printf("multi alphabet seqs dump\n");
+ printf("nr_seqs = %d\n", seq_infop->nr_seqs);
+ printf("longest_seq = %d\n", seq_infop->longest_seq);
+ printf("shortest_seq = %d\n", seq_infop->shortest_seq);
+ printf("avg_seq_len = %d\n", seq_infop->avg_seq_len);
+ for(i = 0; i < seq_infop->nr_seqs; i++) {
+ for(k = 0; k < seq_infop->nr_alphabets; k++) {
+ if(k == 0) {
+ seqsp = (seq_infop->seqs + i)->seq_1;
+ }
+ if(k == 1) {
+ seqsp = (seq_infop->seqs + i)->seq_2;
+ }
+ if(k == 2) {
+ seqsp = (seq_infop->seqs + i)->seq_3;
+ }
+ if(k == 3) {
+ seqsp = (seq_infop->seqs + i)->seq_4;
+ }
+ printf("seq_name = %s\n", (seq_infop->seqs + i)->name);
+ printf("seq_length = %d\n", (seq_infop->seqs + i)->length);
+ while(seqsp->letter[0] != '\0') {
+ j = 0;
+ while(seqsp->letter[j] != '\0') {
+ printf("%c", seqsp->letter[j]);
+ j++;
+ }
+ printf(";");
+ seqsp++;
+ }
+ printf("\n");
+ }
+ }
+}
+
+void dump_labeled_seqs_multi(struct sequences_multi_s *seq_infop)
+{
+ int i,j,k;
+ struct letter_s *seqsp;
+
+ printf("seqs dump\n");
+ for(i = 0; i < seq_infop->nr_seqs; i++) {
+ for(k = 0; k < seq_infop->nr_alphabets; k++) {
+ if(k == 0) {
+ seqsp = (seq_infop->seqs + i)->seq_1;
+ }
+ if(k == 1) {
+ seqsp = (seq_infop->seqs + i)->seq_2;
+ }
+ if(k == 2) {
+ seqsp = (seq_infop->seqs + i)->seq_3;
+ }
+ if(k == 3) {
+ seqsp = (seq_infop->seqs + i)->seq_4;
+ }
+ printf("seq_name = %s\n", (seq_infop->seqs + i)->name);
+ printf("seq_length = %d\n", (seq_infop->seqs + i)->length);
+ while(seqsp->letter[0] != '\0') {
+ j = 0;
+ while(seqsp->letter[j] != '\0') {
+ printf("%c", seqsp->letter[j]);
+ j++;
+ }
+ if(k == 0) {
+ printf("/%c",seqsp->label);
+ }
+ printf(";");
+ seqsp++;
+ }
+ printf("\n");
+ }
+ }
+}
+
+void dump_labeled_seqs(struct sequences_s *seq_infop)
+{
+ int i,j;
+ struct letter_s *seqsp;
+
+ printf("seqs dump\n");
+ for(i = 0; i < seq_infop->nr_seqs; i++) {
+ seqsp = (seq_infop->seqs + i)->seq;
+ printf("seq_name = %s\n", (seq_infop->seqs + i)->name);
+ printf("seq_length = %d\n", (seq_infop->seqs + i)->length);
+ while(seqsp->letter[0] != '\0') {
+ j = 0;
+ while(seqsp->letter[j] != '\0') {
+ printf("%c", seqsp->letter[j]);
+ j++;
+ }
+ printf("/%c",seqsp->label);
+ printf(";");
+ seqsp++;
+ }
+ printf("\n");
+ }
+}
+
+
+void dump_msa_seqs(struct msa_sequences_s *msa_seq_infop, int a_size)
+{
+ int i,j;
+ int *cur_pos;
+
+ cur_pos = (int*)(msa_seq_infop->gaps + msa_seq_infop->msa_seq_length);
+ printf("msa seqs dump:\n");
+ printf("nr_seqs = %d\n", msa_seq_infop->nr_seqs);
+ printf("msa_seq_length = %d\n", msa_seq_infop->msa_seq_length);
+ printf("nr_lead_columns = %d\n", msa_seq_infop->nr_lead_columns);
+ for(i = 0; i < msa_seq_infop->msa_seq_length; i++) {
+ printf("pos %d: ", i+1);
+ printf("\n");
+ printf("label: %c\n", (msa_seq_infop->msa_seq + get_mtx_index(i,0,a_size+1))->label);
+ for(j = 0; j < a_size+1; j++) {
+ printf("%f ",(msa_seq_infop->msa_seq + get_mtx_index(i,j,a_size+1))->nr_occurences);
+ }
+ printf("\n");
+ for(j = 0; j < a_size+1; j++) {
+ printf("%f ",(msa_seq_infop->msa_seq + get_mtx_index(i,j,a_size+1))->share);
+ }
+ printf("\n");
+ for(j = 0; j < a_size+1; j++) {
+ printf("%f ",(msa_seq_infop->msa_seq + get_mtx_index(i,j,a_size+1))->prior_share);
+ }
+ printf("\n");
+ }
+ printf("\n");
+ printf("gaps:\n");
+ for(i = 0; i < msa_seq_infop->msa_seq_length; i++) {
+ printf("pos %d: ",i+1);
+ while(*cur_pos != END) {
+ printf("%d ", *cur_pos);
+ cur_pos++;
+ }
+ printf("\n");
+ cur_pos++;
+ }
+
+ printf("lead columns: ");
+ i = 0;
+ while(*(msa_seq_infop->lead_columns_start + i) != END) {
+ printf("%d ", *(msa_seq_infop->lead_columns_start + i));
+ i++;
+ }
+ printf("\n");
+
+ printf("gap_shares: ");
+ for(i = 0; i < msa_seq_infop->msa_seq_length; i++) {
+ printf("%f ", *(msa_seq_infop->gap_shares + i));
+ }
+ printf("\n\n");
+}
+
+void dump_msa_seqs_multi(struct msa_sequences_multi_s *msa_seq_infop, struct hmm_multi_s *hmmp)
+{
+ int i,j;
+ int *cur_pos;
+
+ cur_pos = (int*)(msa_seq_infop->gaps + msa_seq_infop->msa_seq_length);
+ printf("msa seqs dump:\n");
+
+ printf("nr_seqs = %d\n", msa_seq_infop->nr_seqs);
+ printf("msa_seq_length = %d\n", msa_seq_infop->msa_seq_length);
+ printf("nr_lead_columns = %d\n\n", msa_seq_infop->nr_lead_columns);
+
+ printf("Alphabet 1\n");
+ for(i = 0; i < msa_seq_infop->msa_seq_length; i++) {
+ printf("pos %d: ", i+1);
+ printf("\n");
+ printf("label: %c\n", (msa_seq_infop->msa_seq_1 + get_mtx_index(i,0,hmmp->a_size+1))->label);
+ for(j = 0; j < hmmp->a_size+1; j++) {
+ printf("%f ",(msa_seq_infop->msa_seq_1 + get_mtx_index(i,j,hmmp->a_size+1))->nr_occurences);
+ }
+ printf("\n");
+ for(j = 0; j < hmmp->a_size+1; j++) {
+ printf("%f ",(msa_seq_infop->msa_seq_1 + get_mtx_index(i,j,hmmp->a_size+1))->share);
+ }
+ printf("\n");
+ for(j = 0; j < hmmp->a_size+1; j++) {
+ printf("%f ",(msa_seq_infop->msa_seq_1 + get_mtx_index(i,j,hmmp->a_size+1))->prior_share);
+ }
+ printf("\n");
+ }
+ printf("\n");
+
+ if(hmmp->nr_alphabets > 1) {
+ printf("Alphabet 2\n");
+ for(i = 0; i < msa_seq_infop->msa_seq_length; i++) {
+ printf("pos %d: ", i+1);
+ printf("\n");
+ printf("label: %c\n", (msa_seq_infop->msa_seq_2 + get_mtx_index(i,0,hmmp->a_size_2+1))->label);
+ for(j = 0; j < hmmp->a_size_2+1; j++) {
+ printf("%f ",(msa_seq_infop->msa_seq_2 + get_mtx_index(i,j,hmmp->a_size_2+1))->nr_occurences);
+ }
+ printf("\n");
+ for(j = 0; j < hmmp->a_size_2+1; j++) {
+ printf("%f ",(msa_seq_infop->msa_seq_2 + get_mtx_index(i,j,hmmp->a_size_2+1))->share);
+ }
+ printf("\n");
+ for(j = 0; j < hmmp->a_size_2+1; j++) {
+ printf("%f ",(msa_seq_infop->msa_seq_2 + get_mtx_index(i,j,hmmp->a_size_2+1))->prior_share);
+ }
+ printf("\n");
+ }
+ printf("\n");
+ }
+
+ if(hmmp->nr_alphabets > 2) {
+ printf("Alphabet 3\n");
+ for(i = 0; i < msa_seq_infop->msa_seq_length; i++) {
+ printf("pos %d: ", i+1);
+ printf("\n");
+ printf("label: %c\n", (msa_seq_infop->msa_seq_3 + get_mtx_index(i,0,hmmp->a_size_3+1))->label);
+ for(j = 0; j < hmmp->a_size_3+1; j++) {
+ printf("%f ",(msa_seq_infop->msa_seq_3 + get_mtx_index(i,j,hmmp->a_size_3+1))->nr_occurences);
+ }
+ printf("\n");
+ for(j = 0; j < hmmp->a_size_3+1; j++) {
+ printf("%f ",(msa_seq_infop->msa_seq_3 + get_mtx_index(i,j,hmmp->a_size_3+1))->share);
+ }
+ printf("\n");
+ for(j = 0; j < hmmp->a_size_3+1; j++) {
+ printf("%f ",(msa_seq_infop->msa_seq_3 + get_mtx_index(i,j,hmmp->a_size_3+1))->prior_share);
+ }
+ printf("\n");
+ }
+ printf("\n");
+ }
+
+ if(hmmp->nr_alphabets > 3) {
+ printf("Alphabet 4\n");
+ for(i = 0; i < msa_seq_infop->msa_seq_length; i++) {
+ printf("pos %d: ", i+1);
+ printf("\n");
+ printf("label: %c\n", (msa_seq_infop->msa_seq_4 + get_mtx_index(i,0,hmmp->a_size_4+1))->label);
+ for(j = 0; j < hmmp->a_size_4+1; j++) {
+ printf("%f ",(msa_seq_infop->msa_seq_4 + get_mtx_index(i,j,hmmp->a_size_4+1))->nr_occurences);
+ }
+ printf("\n");
+ for(j = 0; j < hmmp->a_size_4+1; j++) {
+ printf("%f ",(msa_seq_infop->msa_seq_4 + get_mtx_index(i,j,hmmp->a_size_4+1))->share);
+ }
+ printf("\n");
+ for(j = 0; j < hmmp->a_size_4+1; j++) {
+ printf("%f ",(msa_seq_infop->msa_seq_4 + get_mtx_index(i,j,hmmp->a_size_4+1))->prior_share);
+ }
+ printf("\n");
+ }
+ printf("\n");
+ }
+
+
+ printf("gaps:\n");
+
+ for(i = 0; i < msa_seq_infop->msa_seq_length; i++) {
+ printf("pos %d: ",i+1);
+ while(*cur_pos != END) {
+ printf("%d ", *cur_pos);
+ cur_pos++;
+ }
+ printf("\n");
+ cur_pos++;
+ }
+
+
+ printf("lead columns: ");
+ i = 0;
+ while(*(msa_seq_infop->lead_columns_start + i) != END) {
+ printf("%d ", *(msa_seq_infop->lead_columns_start + i));
+ i++;
+ }
+ printf("\n");
+
+ printf("gap_shares: ");
+ for(i = 0; i < msa_seq_infop->msa_seq_length; i++) {
+ printf("%f ", *(msa_seq_infop->gap_shares + i));
+ }
+ printf("\n\n");
+}
+
+
+void dump_replacement_letters(struct replacement_letter_s *replacement_letters, int a_size)
+{
+ int i,j;
+ int nr_rows, nr_cols;
+ double *prob_mtx;
+
+ prob_mtx = replacement_letters->probs;
+ nr_rows = replacement_letters->nr_rl;
+ nr_cols = a_size;
+
+
+ printf("replacement letter dump:\n");
+ printf("nr_letters = %d\n", replacement_letters->nr_rl);
+ printf("letters: ");
+ for(i = 0; i < nr_rows;i++) {
+ printf("%c ", (*(replacement_letters->letters + i)));
+ }
+ printf("\nprobs:\n");
+ for(i = 0; i < nr_rows; i++) {
+ for(j = 0; j < nr_cols; j++) {
+ printf("%f ", *prob_mtx);
+ prob_mtx++;;
+ }
+ printf("\n");
+ }
+ printf("\n");
+}
+
+void dump_replacement_letters_multi(struct replacement_letter_multi_s *replacement_letters, int alphabet, int a_size)
+{
+ int i,j;
+ int nr_rows, nr_cols;
+ double *prob_mtx;
+
+ if(alphabet == 1) {
+ prob_mtx = replacement_letters->probs_1;
+ nr_rows = replacement_letters->nr_rl_1;
+ }
+ if(alphabet == 2) {
+ prob_mtx = replacement_letters->probs_2;
+ nr_rows = replacement_letters->nr_rl_2;
+ }
+ if(alphabet == 3) {
+ prob_mtx = replacement_letters->probs_3;
+ nr_rows = replacement_letters->nr_rl_3;
+ }
+ if(alphabet == 4) {
+ prob_mtx = replacement_letters->probs_4;
+ nr_rows = replacement_letters->nr_rl_4;
+ }
+ nr_cols = a_size;
+
+
+ printf("replacement letter dump:\n");
+ if(alphabet == 1) {
+ printf("nr_letters = %d\n", replacement_letters->nr_rl_1);
+ }
+ if(alphabet == 2) {
+ printf("nr_letters = %d\n", replacement_letters->nr_rl_2);
+ }
+ if(alphabet == 3) {
+ printf("nr_letters = %d\n", replacement_letters->nr_rl_3);
+ }
+ if(alphabet == 4) {
+ printf("nr_letters = %d\n", replacement_letters->nr_rl_4);
+ }
+ printf("letters: ");
+ for(i = 0; i < nr_rows;i++) {
+ if(alphabet == 1) {
+ printf("%c ", (*(replacement_letters->letters_1 + i)));
+ }
+ if(alphabet == 2) {
+ printf("%c ", (*(replacement_letters->letters_2 + i)));
+ }
+ if(alphabet == 3) {
+ printf("%c ", (*(replacement_letters->letters_3 + i)));
+ }
+ if(alphabet == 4) {
+ printf("%c ", (*(replacement_letters->letters_4 + i)));
+ }
+ }
+ printf("\nprobs:\n");
+ for(i = 0; i < nr_rows; i++) {
+ for(j = 0; j < nr_cols; j++) {
+ printf("%f ", *prob_mtx);
+ prob_mtx++;;
+ }
+ printf("\n");
+ }
+ printf("\n");
+}
+
+void dump_calibration_matrix(int nr_rows, int nr_cols, int *mtx)
+{
+ int i,j;
+ printf("calibration matrix dump:\n");
+ printf("nr rows: %d\n", nr_rows);
+ printf("nr columns: %d\n", nr_cols);
+ for(i = 0; i < nr_rows; i++) {
+ for(j = 0; j < nr_cols; j++) {
+ printf("%d ", *mtx);
+ mtx++;
+ }
+ printf("\n");
+ }
+ printf("\n");
+}
+
+void dump_to_silent_trans_array(int nr_v, int **array)
+{
+ int i;
+ int pos;
+
+ printf("To_silent_trans_array_dump:\n");
+ for(i = 0; i < nr_v; i++) {
+ printf("Transitions from vertex %d: ", i);
+ pos = 0;
+ while(*(*(array + i) + pos) != END) {
+ printf("%d ", *(*(array + i) + pos));
+ pos++;
+ }
+ printf("\n");
+ }
+ printf("\n");
+}
+
+void dump_aa_distrib_mtx(struct aa_distrib_mtx_s *aa_distrib_mtxp)
+{
+ int i;
+
+ printf("aa_distrib_mtx_dump\n");
+ printf("a_size = %d\n", aa_distrib_mtxp->a_size);
+ for(i = 0; i < aa_distrib_mtxp->a_size;i++) {
+ printf("%f %f %f\n", *(aa_distrib_mtxp->inside_values + i), *(aa_distrib_mtxp->outside_values + i),
+ *(aa_distrib_mtxp->membrane_values + i));
+ }
+}
+
+
+void dump_v_list(int *sorted_v_list)
+{
+ int i = 0;
+ printf("v-list dump:\n");
+ while(*(sorted_v_list + i) != -99) {
+ if(*(sorted_v_list + i) == -9) {
+ printf(" | ");
+ }
+ else {
+ printf("%d ", *(sorted_v_list + i));
+ }
+ i++;
+ }
+ printf("\n");
+}
+
+void dump_labeling(char *labeling, int c)
+{
+ int i;
+
+ printf("labeling dump:\n");
+ if(labeling == NULL) {
+ printf("NULL\n");
+ return;
+ }
+ for(i = 1; i <= c; i++) {
+ printf("%c",labeling[i]);
+ }
+ printf("\n");
+}
+
+void dump_label_tmp_list(int *list)
+{
+ int i;
+
+ printf("label_list_dump: \n");
+
+ for(i = 0; list[i] != TOT_END;i++) {
+ printf("%d ",list[i]);
+ }
+ printf("\n");
+}
+
+void dump_labels(char *labels, int nr_labels)
+{
+ int i;
+
+ printf("label_dump: \n");
+ for(i = 0; i < nr_labels; i++) {
+ printf("%c ", labels[i]);
+ }
+ printf("\n");
+}
+
+void check_for_corrupt_values(int nr_rows, int nr_cols, double *mtx, char *name)
+{
+ int v,w;
+ int x;
+
+ /* test if this matrix contains strange values */
+ for(v = 0; v < nr_rows; v++) {
+ for(w = 0; w < nr_cols; w++) {
+ x = get_mtx_index(v,w,nr_cols);
+ if(*(mtx + x) < 0.0) {
+ printf("%s[%d][%d] = %f\n", name, v,w,*(mtx + x));
+ if(*(mtx + x) == 0.0) {
+ }
+ }
+ else if(!(*(mtx + x) <= 1000000000000.0)) {
+ printf("%s[%d][%d] = %f\n", name, v,w,*(mtx + x));
+ }
+ else {
+ //printf("%s[%d][%d] = %f\n", name, v,w,*(mtx + x));
+ }
+ }
+ }
+}
+
+void dump_weights(double *weights, int nr_seqs)
+{
+ int i;
+
+ printf("weight dump:\n");
+ for(i = 0; i < nr_seqs; i++) {
+ printf("%f\n", *(weights + i));
+ }
+}
diff --git a/modhmm0.92b/funcs.h b/modhmm0.92b/funcs.h
new file mode 100644
index 0000000..88e003c
--- /dev/null
+++ b/modhmm0.92b/funcs.h
@@ -0,0 +1,302 @@
+
+
+
+/* function declarations */
+
+/* readhmm */
+int readhmm(FILE*, struct hmm_multi_s*, char* path);
+
+/* readhmm_multialpha */
+int readhmm_multialpha(FILE*, struct hmm_multi_s*);
+void transform_singlehmmfile_to_multi(FILE *hmmfile, FILE *outfile);
+int readhmm_check(FILE *hmmfile);
+void copy_hmm_struct(struct hmm_multi_s *hmm, struct hmm_multi_s *retrain_hmm);
+
+/* readseqs */
+void get_sequences_std(FILE*, struct sequences_s*, struct hmm_s*);
+void get_labeled_sequences_std(FILE*, struct sequences_s*, struct hmm_s*);
+void get_sequences_fasta(FILE*, struct sequences_s*);
+void get_sequences_msa_std(FILE*, FILE*, struct msa_sequences_s*, struct hmm_s*, int, struct replacement_letter_s*);
+void get_sequences_msa_prf(FILE *seqfile, FILE *priorfile, struct msa_sequences_s *msa_seq_infop,
+ struct hmm_s *hmmp, int lead_seq);
+
+
+/* readseqs_multi */
+int seqfile_has_labels(FILE *seqfile);
+void get_sequence_fasta_multi(char *seq, struct sequences_multi_s *seq_infop, int seq_nr);
+void get_sequences_std_multi(FILE *seqfile, struct sequences_multi_s *seq_infop, struct hmm_multi_s *hmmp, int seq_nr);
+void get_sequences_msa_std_multi(FILE*, FILE*, struct msa_sequences_multi_s*, struct hmm_multi_s*,
+ int, struct replacement_letter_multi_s*);
+void get_sequences_msa_prf_multi(FILE *seqfile, FILE *priorfile, struct msa_sequences_multi_s *msa_seq_infop,
+ struct hmm_multi_s *hmmp);
+
+/* savehmm */
+int savehmm(FILE*, struct hmm_multi_s*);
+int savehmm_multialpha(FILE*, struct hmm_multi_s*);
+
+
+/* core_algorithms */
+int forward(struct hmm_s*, struct letter_s*, struct forward_s**, double**, int);
+int backward(struct hmm_s*, struct letter_s*, struct backward_s**, double*, int);
+int viterbi(struct hmm_s*, struct letter_s*, struct viterbi_s**, int);
+int one_best(struct hmm_s*, struct letter_s*, struct one_best_s**, double**, int, char*);
+int msa_forward(struct hmm_s*, struct msa_sequences_s*, int,
+ int, int, struct forward_s**, double**, int, int, int, double*);
+int msa_backward(struct hmm_s*, struct msa_sequences_s*, int,
+ int, struct backward_s**, double*, int, int, int, double*);
+int msa_viterbi(struct hmm_s*, struct msa_sequences_s*, int,
+ int, int, struct viterbi_s**, int, int, int, double*);
+int msa_one_best(struct hmm_s*, struct msa_sequences_s*, int,
+ int, int, struct one_best_s**, double**, int, char*, int, int, double*);
+
+
+/* core_algorithms_multialpha */
+int forward_multi(struct hmm_multi_s*, struct letter_s*, struct letter_s*, struct letter_s*, struct letter_s*,
+ struct forward_s**, double**, int, int);
+int backward_multi(struct hmm_multi_s*, struct letter_s*, struct letter_s*, struct letter_s*, struct letter_s*,
+ struct backward_s**, double*, int, int);
+int viterbi_multi(struct hmm_multi_s*, struct letter_s*, struct letter_s*, struct letter_s*, struct letter_s*,
+ struct viterbi_s**, int, int);
+int one_best_multi(struct hmm_multi_s*, struct letter_s*, struct letter_s*, struct letter_s*, struct letter_s*,
+ struct one_best_s**, double**, int, char*, int);
+int msa_forward_multi(struct hmm_multi_s*, struct msa_sequences_multi_s*, int,
+ int, int, struct forward_s**, double**, int, int, int, int, double*, double*, double*, double*);
+int msa_backward_multi(struct hmm_multi_s*, struct msa_sequences_multi_s*, int,
+ int, struct backward_s**, double*, int, int, int, int, double*, double*, double*, double*);
+int msa_viterbi_multi(struct hmm_multi_s*, struct msa_sequences_multi_s*, int,
+ int, int, struct viterbi_s**, int, int, int, int, double*, double*, double*, double*);
+int msa_one_best_multi(struct hmm_multi_s*, struct msa_sequences_multi_s*, int,
+ int, int, struct one_best_s**, double**, int, char*, int, int, int, double*, double*, double*, double*);
+
+/* tm_core_algorithms */
+int tm_viterbi(struct hmm_s*, struct letter_s*, struct viterbi_s**, struct aa_distrib_mtx_s*, int);
+
+
+/* training_algorithms */
+void baum_welch_std(struct hmm_s*, struct sequence_s*, int, int, int);
+void baum_welch_dirichlet(struct hmm_s*, struct sequence_s*, int, int, int, int, int);
+void extended_baum_welch_dirichlet(struct hmm_s*, struct sequence_s*, int, int, int, int, int);
+void msa_baum_welch_dirichlet(struct hmm_s*, struct msa_sequences_s*, int, int, int, int, int, int, int, int, int, int, double*);
+void extended_msa_baum_welch_dirichlet(struct hmm_s*, struct msa_sequences_s*, int, int, int, int, int, int, int, int, int, int,
+ double*);
+
+/* training_algorithms */
+void baum_welch_std_multi(struct hmm_multi_s *hmmp, struct sequence_multi_s *seqsp, int nr_seqs, int annealing, int use_labels,
+ int multi_scoring_method, int use_prior);
+void baum_welch_dirichlet_multi(struct hmm_multi_s *hmmp, struct sequence_multi_s *seqsp, int nr_seqs, int annealing, int use_labels,
+ int use_transition_pseudo_counts, int use_emission_pseudo_counts, int multi_scoring_method,
+ int use_prior);
+void msa_baum_welch_dirichlet_multi(struct hmm_multi_s *hmmp, struct msa_sequences_multi_s *msa_seq_infop, int nr_seqs,
+ int annealing,
+ int use_gap_shares, int use_lead_columns, int use_labels, int use_transition_pseudo_counts,
+ int use_emission_pseudo_counts, int normalize, int scoring_method, int use_nr_occ,
+ int multi_scoring_method, double *aa_freqs, double *aa_freqs_2, double *aa_freqs_3,
+ double *aa_freqs_4, int use_prior);
+void extended_msa_baum_welch_dirichlet_multi(struct hmm_multi_s *hmmp, struct msa_sequences_multi_s *msa_seq_infop,
+ int nr_seqs, int annealing,
+ int use_gap_shares, int use_lead_columns, int use_labels,
+ int use_transition_pseudo_counts,
+ int use_emission_pseudo_counts, int normalize, int scoring_method, int use_nr_occ,
+ int multi_scoring_method, double *aa_freqs, double *aa_freqs_2, double *aa_freqs_3,
+ double *aa_freqs_4, int use_prior);
+
+
+
+/* std_funcs */
+void* malloc_or_die(int);
+void init_float_mtx(double*, double, int);
+void init_viterbi_s_mtx(struct viterbi_s*, double, int);
+void printhelp_modhmms();
+void printhelp_modhmms_msa();
+void printhelp_hmmtrain();
+void printhelp_hmmtrain_msa();
+void printhelp_modhmms_multialpha();
+void printhelp_modhmms_msa_multialpha();
+void printhelp_hmmtrain_multialpha();
+void printhelp_hmmtrain_msa_multialpha();
+void printhelp_modhmms_tm_multialpha();
+void printhelp_modhmms_tm_msa_multialpha();
+void printhelp_hmmtrain_tm_multialpha();
+void printhelp_hmmtrain_tm_msa_multialpha();
+void printhelp_modhmms_tm();
+void printhelp_modhmms_tm_msa();
+void printhelp_hmmtrain_tm();
+void printhelp_hmmtrain_tm_msa();
+
+void printhelp_chmmtrain();
+void printhelp_chmmtrain_msa();
+void printhelp_chmmtrain_multialpha();
+void printhelp_chmmtrain_msa_multialpha();
+void printhelp_add_alphabet();
+void printhelp_add2profilehmm();
+void printhelp_cal();
+void printhelp_opt();
+
+int get_mtx_index(int,int,int);
+int get_alphabet_index(struct letter_s*, char*, int);
+int get_alphabet_index_msa_query(char*, char*, int);
+int get_replacement_letter_index(struct letter_s*, struct replacement_letter_s*);
+int get_replacement_letter_index_multi(struct letter_s *c, struct replacement_letter_multi_s *replacement_letters, int alphabet);
+int get_alphabet_index_single(char*, char, int);
+int get_replacement_letter_index_single(char*, struct replacement_letter_s*);
+int get_seq_length(struct letter_s*);
+int path_length(int, int, struct hmm_s*, int);
+int path_length_multi(int, int, struct hmm_multi_s*, int);
+void print_seq(struct letter_s*, FILE*, int, char*, int);
+struct path_element* get_end_path_start(int l, struct hmm_s *hmmp);
+struct path_element* get_end_path_start_multi(int l, struct hmm_multi_s *hmmp);
+char* get_profile_vertex_type(int, int*);
+void get_replacement_letters(FILE*, struct replacement_letter_s*);
+void get_aa_distrib_mtx(FILE *distribmtxfile, struct aa_distrib_mtx_s *aa_distrib_mtxp);
+void get_replacement_letters_multi(FILE *replfile, struct replacement_letter_multi_s *replacement_lettersp);
+char* letter_as_string(struct letter_s*);
+char* sequence_as_string(struct letter_s*);
+void get_viterbi_label_path(struct viterbi_s *cur, struct hmm_s *hmmp,
+ struct viterbi_s *viterbi_mtxp, int row, int row_size, char *labels, int *ip);
+void get_viterbi_label_path_multi(struct viterbi_s *cur, struct hmm_multi_s *hmmp,
+ struct viterbi_s *viterbi_mtxp, int row, int row_size, char *labels, int *ip);
+void get_viterbi_path(struct viterbi_s *cur, struct hmm_s *hmmp,
+ struct viterbi_s *viterbi_mtxp, int row, int row_size, int *path, int *ip);
+void get_viterbi_path_multi(struct viterbi_s *cur, struct hmm_multi_s *hmmp,
+ struct viterbi_s *viterbi_mtxp, int row, int row_size, int *path, int *ip);
+void itoa(char* s, int nr);
+void ftoa(char* s, double nr, int prec);
+int read_subst_matrix(double **mtx, FILE *substmtxfile);
+int read_subst_matrix_multi(double **mtxpp, double **mtxpp_2, double **mtxpp_3, double **mtxpp_4, FILE *substmtxfile);
+int read_prior_file(struct emission_dirichlet_s *em_di, struct hmm_s *hmmp, FILE *priorfile);
+int read_frequencies(FILE *freqfile, double **aa_freqs);
+int read_frequencies_multi(FILE *freqfile, double **aa_freqsp, double **aa_freqsp_2, double **aa_freqsp_3, double **aa_freqsp_4);
+int read_prior_file_multi(struct emission_dirichlet_s *em_di, struct hmm_multi_s *hmmp, FILE *priorfile, int alphabet);
+int locked_state(struct hmm_s *hmmp, int v);
+int locked_state_multi(struct hmm_multi_s *hmmp, int v);
+int get_best_reliability_score(double reliability_score_1, double reliability_score_2, double reliability_score_3);
+void hmm_garbage_collection(FILE *hmmfile, struct hmm_s *hmmp);
+void hmm_garbage_collection_multi(FILE *hmmfile, struct hmm_multi_s *hmmp);
+void hmm_garbage_collection_multi_no_dirichlet(FILE *hmmfile, struct hmm_multi_s *hmmp);
+void msa_seq_garbage_collection_multi(struct msa_sequences_multi_s *msa_seq_info, int nr_alphabets);
+void seq_garbage_collection_multi(struct sequences_multi_s *seq_info, int nr_alphabets);
+void get_msa_labels(FILE *labelfile, struct msa_sequences_s *msa_seq_infop, struct hmm_s *hmmp);
+void get_msa_labels_all_columns(FILE *labelfile, struct msa_sequences_s *msa_seq_infop, struct hmm_s *hmmp);
+int update_shares_prior(struct emission_dirichlet_s *em_di, struct hmm_s *hmmp,
+ struct msa_sequences_s *msa_seq_infop, int l);
+int replacement_letter(struct letter_s *cur_letterp, struct replacement_letter_s *replacement_letters,
+ struct msa_sequences_s *msa_seq_infop, struct hmm_s *hmmp, int seq_pos);
+void get_labels_multi(FILE *labelfile, struct sequences_multi_s *seq_infop, struct hmm_multi_s *hmmp, int seq_nr);
+void get_msa_labels_multi(FILE *labelfile, struct msa_sequences_multi_s *msa_seq_infop, struct hmm_multi_s *hmmp);
+void get_msa_labels_all_columns_multi(FILE *labelfile, struct msa_sequences_multi_s *msa_seq_infop, struct hmm_multi_s *hmmp);
+int update_shares_prior_multi(struct emission_dirichlet_s *em_di, struct hmm_multi_s *hmmp,
+ struct msa_sequences_multi_s *msa_seq_infop, int l, int alphabet);
+int replacement_letter_multi(struct letter_s *cur_letterp, struct replacement_letter_multi_s *replacement_letters,
+ struct msa_sequences_multi_s *msa_seq_infop, struct hmm_multi_s *hmmp, int seq_pos, int alphabet);
+int get_nr_alphabets(FILE *hmmfile);
+void get_set_of_labels(struct hmm_s *hmmp);
+void get_set_of_labels_multi(struct hmm_multi_s *hmmp);
+void get_reverse_msa_seq_multi(struct msa_sequences_multi_s *msa_seq_infop, struct msa_sequences_multi_s *reverse_msa_seq_infop,
+ struct hmm_multi_s *hmmp);
+void get_reverse_seq_multi(struct sequence_multi_s *seqs, struct letter_s **reverse_seq_1,
+ struct letter_s **reverse_seq_2, struct letter_s **reverse_seq_3,
+ struct letter_s **reverse_seq_4, struct hmm_multi_s *hmmp, int seq_len);
+
+/* std calculation funcs */
+double get_single_gaussian_statescore(double mu, double sigma_square, double letter);
+double get_dp_statescore(int a_size, int use_gap_shares, int use_prior_shares, struct msa_letter_s *msa_seq,
+ int p, double *emissions, int vertex, int normalize, double *gap_shares);
+double get_dp_picasso_statescore(int a_size, int use_gap_shares, int use_prior_shares, struct msa_letter_s *msa_seq,
+ int p, double *emissions, int vertex, int normalize, double *gap_shares, double *aa_freqs);
+double get_sjolander_statescore(int a_size, int use_gap_shares, int use_prior_shares, struct msa_letter_s *msa_seq,
+ int p, double *emissions, int vertex, int normalize, double *gap_shares);
+double get_sjolander_reversed_statescore(int a_size, int use_gap_shares, int use_prior_shares, struct msa_letter_s *msa_seq,
+ int p, double *emissions, int vertex, int normalize, double *gap_shares);
+double get_picasso_statescore(int a_size, int use_gap_shares, int use_prior_shares, struct msa_letter_s *msa_seq,
+ int p, double *emissions, int vertex, int normalize, double *gap_shares, double *aa_freqs);
+double get_picasso_sym_statescore(int a_size, int use_gap_shares, int use_prior_shares, struct msa_letter_s *msa_seq,
+ int p, double *emissions, int vertex, int normalize, double *gap_shares, double *aa_freqs);
+double get_subst_mtx_product_statescore(int a_size, int use_gap_shares, int use_prior_shares, struct msa_letter_s *msa_seq,
+ int p, double *emissions, int vertex, double *subst_mtx);
+double get_subst_mtx_dot_product_statescore(int a_size, int use_gap_shares, int use_prior_shares, struct msa_letter_s *msa_seq,
+ int p, double *emissions, int vertex, int normalize, double *gap_shares,
+ int query_index, double *subst_mtx);
+double get_subst_mtx_dot_product_prior_statescore(int a_size, int use_gap_shares, int use_prior_shares, struct msa_letter_s *msa_seq,
+ int p, double *emissions, int vertex, int normalize, double *gap_shares,
+ int query_index, double *subst_mtx);
+
+void add_to_E_dot_product(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize);
+void add_to_E_dot_product_picasso(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize);
+void add_to_E_picasso(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize);
+void add_to_E_picasso_sym(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize);
+void add_to_E_sjolander_score(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize);
+void add_to_E_sjolander_reversed_score(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize);
+void add_to_E_subst_mtx_product(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize, double *subst_mtx);
+void add_to_E_subst_mtx_dot_product(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize, double *subst_mtx, char *alphabet);
+void add_to_E_subst_mtx_dot_product_prior(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize, double *subst_mtx, char *alphabet);
+
+void add_to_E_dot_product_nr_occ(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize);
+void add_to_E_dot_product_picasso_nr_occ(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize);
+void add_to_E_picasso_nr_occ(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize);
+void add_to_E_picasso_sym_nr_occ(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize);
+void add_to_E_sjolander_score_nr_occ(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize);
+void add_to_E_sjolander_reversed_score_nr_occ(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize);
+void add_to_E_subst_mtx_product_nr_occ(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize, double *subst_mtx);
+void add_to_E_subst_mtx_dot_product_nr_occ(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize, double *subst_mtx, char *alphabet);
+void add_to_E_subst_mtx_dot_product_prior_nr_occ(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize, double *subst_mtx, char *alphabet);
+
+void update_labelings(struct one_best_s *cur_rowp, char *vertex_labels,
+ int *sorted_v_list, int seq_len, int c, char *labels, int nr_of_labels, int nr_v);
+void deallocate_row_labelings(struct one_best_s *prev_rowp, int nr_v);
+
+
+/* debug_funcs */
+void dump_align_matrix(int nr_rows, int nr_cols, struct align_mtx_element_s *mtx);
+void dump_trans_matrix(int,int,double*);
+void dump_int_trans_matrix(int nr_rows, int nr_cols, double *mtx);
+void dump_emiss_matrix(int,int,double*);
+void dump_post_prob_matrix(int nr_rows, int nr_cols, double *mtx);
+void dump_forward_matrix(int,int,struct forward_s*);
+void dump_backward_matrix(int,int,struct backward_s*);
+void dump_viterbi_matrix(int nr_rows, int nr_cols, struct viterbi_s *mtx);
+void dump_one_best_matrix(int, int, struct one_best_s*);
+void dump_scaling_array(int,double*);
+void dump_from_trans_array(int,struct path_element**);
+void dump_to_trans_array(int,struct path_element**);
+void dump_viterbi_path(struct viterbi_s*, struct hmm_s*, struct viterbi_s*, int, int);
+void dump_viterbi_label_path(struct viterbi_s*, struct hmm_s*, struct viterbi_s*, int, int);
+void dump_T_matrix(int,int,double*);
+void dump_E_matrix(int,int,double*);
+void dump_modules(struct hmm_s*);
+void dump_distrib_groups(int*, int);
+void dump_trans_tie_groups(struct transition_s*, int);
+void dump_prior_struct(struct emission_dirichlet_s*);
+void dump_silent_vertices(struct hmm_s*);
+void dump_silent_vertices_multi(struct hmm_multi_s *hmmp);
+void dump_locked_vertices(struct hmm_s*);
+void dump_seqs(struct sequences_s*);
+void dump_seqs_multi(struct sequences_multi_s*);
+void dump_msa_seqs(struct msa_sequences_s*, int);
+void dump_msa_seqs_multi(struct msa_sequences_s*, struct hmm_multi_s*);
+void dump_to_silent_trans_array(int, int**);
+void dump_aa_distrib_mtx(struct aa_distrib_mtx_s *aa_distrib_mtxp);
+void dump_v_list(int*);
+void dump_labeling(char*, int);
+void dump_label_tmp_list(int *list);
+void check_for_corrupt_values(int nr_rows, int nr_cols, double *mtx, char *name);
+void dump_subst_mtx(int a_size, double *mtx);
+void dump_multi_modules(struct hmm_multi_s *hmmp);
+void dump_weights(double *total_weights, int nr_seqs);
diff --git a/modhmm0.92b/hmmsearch.c b/modhmm0.92b/hmmsearch.c
new file mode 100644
index 0000000..b9e491e
--- /dev/null
+++ b/modhmm0.92b/hmmsearch.c
@@ -0,0 +1,1337 @@
+#include <stdio.h>
+#include <string.h>
+#include <stdlib.h>
+#include <math.h>
+#include <limits.h>
+
+#include "structs.h"
+#include "funcs.h"
+#include "cmdline_hmmsearch.h"
+
+
+#define NORM_LOG_LIKE 0
+#define LOGODDS 1
+
+#define HMM 20
+#define SEQ 21
+
+#define LONGEST_SEQ -1 /* Note: this constant is also defined in read_seqs.c */
+#define FIRST_SEQ 1
+
+#define LEAD_SEQ 10
+#define ALL_SEQS 11
+
+#define MAX_LINE 500
+
+#define HMMFILEERROR -10
+#define REPFILEERROR -11
+#define SEQERROR -12
+#define PATHERROR -13
+
+extern int verbose;
+
+
+static struct hmm_multi_s hmm;
+static struct hmm_multi_s retrain_hmm;
+static struct msa_sequences_multi_s *msa_seq_infop;
+static struct sequences_multi_s seq_info;
+static struct replacement_letter_multi_s replacement_letters;
+static struct null_model_multi_s null_model;
+double *subst_mtxp;
+double *subst_mtxp_2;
+double *subst_mtxp_3;
+double *subst_mtxp_4;
+double *aa_freqs;
+double *aa_freqs_2;
+double *aa_freqs_3;
+double *aa_freqs_4;
+
+
+void get_null_model_multi(FILE *nullfile);
+double get_nullmodel_score_multi(struct letter_s *seq, struct letter_s *seq_2, struct letter_s *seq_3,
+ struct letter_s *seq_4, int seq_len, int multi_scoring_method);
+double get_nullmodel_score_msa_multi(int seq_len, int prf_mode, int use_prior,
+ int use_gap_shares, int normalize,
+ int scoring_method, int multi_scoring_method, double *aa_freqs,
+ double *aa_freqs_2, double *aa_freqs_3, double *aa_freqs_4);
+int get_scores(int use_labels, int scoring_method, int multi_scoring_method, helix_sites* hSites);
+void get_post_prob_matrix(double **ret_post_prob_matrixp, double forw_score, struct forward_s *forw_mtx,
+ struct backward_s *backw_mtx, int seq_len);
+
+helix_sites* get_helices(char* seq, char* hmmfilename, char* repfilename, char* path)
+{
+
+ /* command line options */
+ helix_sites *hSites;
+ FILE *hmmfile; /* file to read hmms from */
+ FILE *seqnamefile; /* file for reading sequence names */
+ FILE *replfile; /* file for reading special replacement letters */
+ FILE *freqfile;
+ int use_labels;
+ int scoring_method;
+ int multi_scoring_method;
+ int helices;
+
+ /* temporary variables */
+ int i,j; /*standard loop indices */
+
+ /*init some variables */
+ hmmfile = NULL;
+ seqnamefile = NULL;
+ replfile = NULL;
+ freqfile = NULL;
+ use_labels = NO;
+ scoring_method = DOT_PRODUCT;
+ multi_scoring_method = JOINT_PROB;
+ subst_mtxp = NULL;
+ subst_mtxp_2 = NULL;
+ subst_mtxp_3 = NULL;
+ subst_mtxp_4 = NULL;
+ aa_freqs = NULL;
+ aa_freqs_2 = NULL;
+ aa_freqs_3 = NULL;
+ aa_freqs_4 = NULL;
+
+
+ /* Create the structure to return the helix sites in */
+ hSites = (helix_sites *) malloc_or_die(sizeof (helix_sites));
+ hSites->helix_count = 0;
+ hSites->helix = NULL;
+
+ /* compulsory options */
+ if(hmmfilename) {
+ if((hmmfile = fopen(hmmfilename, "r")) == NULL) {
+ hSites->helix_count = HMMFILEERROR;
+ return hSites;
+ }
+ } else {
+ hSites->helix_count = HMMFILEERROR;
+ return hSites;
+ }
+
+ if(!path) {
+ hSites->helix_count = PATHERROR;
+ return hSites;
+ }
+
+ if(!seq) {
+ hSites->helix_count = SEQERROR;
+ return hSites;
+ }
+
+ if(repfilename) {
+ if((replfile = fopen(repfilename, "r")) == NULL) {
+ hSites->helix_count = REPFILEERROR;
+ return hSites;
+ }
+ } else {
+ hSites->helix_count = REPFILEERROR;
+ return hSites;
+ }
+
+
+ /* get replacement letters */
+ get_replacement_letters_multi(replfile, &replacement_letters);
+
+ /* read null model - always null, don't bother */
+ null_model.a_size = -1;
+ null_model.nr_alphabets = -1;
+
+
+ /* get hmm and score all seqs against them */
+ /* get hmm from file */
+ readhmm(hmmfile, &hmm, path);
+
+ hmm.subst_mtx = subst_mtxp;
+ hmm.subst_mtx_2 = subst_mtxp_2;
+ hmm.subst_mtx_3 = subst_mtxp_3;
+ hmm.subst_mtx_4 = subst_mtxp_4;
+ hmm.replacement_letters = &replacement_letters;
+
+
+ /* check sequence file for labels + check nolabel flag */
+ use_labels = NO;
+
+ /* allocate space for msa_seq_info structs */
+ seq_info.seqs = malloc_or_die(sizeof(struct sequence_multi_s) * 1);
+ seq_info.nr_alphabets = hmm.nr_alphabets;
+ seq_info.nr_seqs = 1;
+ seq_info.longest_seq = 0;
+ seq_info.shortest_seq = INT_MAX;
+ seq_info.avg_seq_len = 0;
+
+
+ get_sequence_fasta_multi(seq, &seq_info, 0);
+ hmm.alphabet_type = DISCRETE;
+
+ seq_info.avg_seq_len = ((int)(seq_info.avg_seq_len / seq_info.nr_seqs));
+
+
+ /* get score info for this seq */
+
+ helices = get_scores(use_labels, scoring_method, multi_scoring_method, hSites);
+
+ /* deallocate seqinfo */
+ free(((seq_info.seqs))->seq_1);
+
+ free(msa_seq_infop);
+
+
+ /* deallocate hmm_info */
+ hmm_garbage_collection_multi(hmmfile, &hmm);
+
+
+ /* deallocate replacement letters */
+ if(replfile != NULL) {
+ if(replacement_letters.nr_rl_1 > 0) {
+ free(replacement_letters.letters_1);
+ free(replacement_letters.probs_1);
+ }
+ fclose(replfile);
+
+ }
+
+ return hSites;
+
+}
+
+
+int find_helices(char* seq, char* hmmfilename, char* repfilename, char* path)
+{
+ /* command line options */
+ FILE *hmmfile; /* file to read hmms from */
+ FILE *seqnamefile; /* file for reading sequence names */
+ FILE *replfile; /* file for reading special replacement letters */
+ FILE *freqfile;
+ int use_labels;
+ int scoring_method;
+ int multi_scoring_method;
+ int helices;
+
+ /* temporary variables */
+ int i,j; /*standard loop indices */
+
+ /*init some variables */
+ hmmfile = NULL;
+ seqnamefile = NULL;
+ replfile = NULL;
+ freqfile = NULL;
+ use_labels = NO;
+ scoring_method = DOT_PRODUCT;
+ multi_scoring_method = JOINT_PROB;
+ subst_mtxp = NULL;
+ subst_mtxp_2 = NULL;
+ subst_mtxp_3 = NULL;
+ subst_mtxp_4 = NULL;
+ aa_freqs = NULL;
+ aa_freqs_2 = NULL;
+ aa_freqs_3 = NULL;
+ aa_freqs_4 = NULL;
+
+
+ /* compulsory options */
+ if(hmmfilename) {
+ if((hmmfile = fopen(hmmfilename, "r")) == NULL) {
+ return HMMFILEERROR;
+ }
+ } else {
+ return HMMFILEERROR;
+ }
+
+ if(!path) {
+ return PATHERROR;
+ }
+
+ if(!seq) {
+ return SEQERROR;
+ }
+
+ if(repfilename) {
+ if((replfile = fopen(repfilename, "r")) == NULL) {
+ return REPFILEERROR;
+ }
+ } else {
+ return REPFILEERROR;
+ }
+
+
+ /* get replacement letters */
+ get_replacement_letters_multi(replfile, &replacement_letters);
+
+ /* read null model - always null, don't bother */
+ null_model.a_size = -1;
+ null_model.nr_alphabets = -1;
+
+
+ /* get hmm and score all seqs against them */
+ /* get hmm from file */
+ readhmm(hmmfile, &hmm, path);
+
+ hmm.subst_mtx = subst_mtxp;
+ hmm.subst_mtx_2 = subst_mtxp_2;
+ hmm.subst_mtx_3 = subst_mtxp_3;
+ hmm.subst_mtx_4 = subst_mtxp_4;
+ hmm.replacement_letters = &replacement_letters;
+
+
+ /* check sequence file for labels + check nolabel flag */
+ use_labels = NO;
+
+ /* allocate space for msa_seq_info structs */
+ seq_info.seqs = malloc_or_die(sizeof(struct sequence_multi_s) * 1);
+ seq_info.nr_alphabets = hmm.nr_alphabets;
+ seq_info.nr_seqs = 1;
+ seq_info.longest_seq = 0;
+ seq_info.shortest_seq = INT_MAX;
+ seq_info.avg_seq_len = 0;
+
+
+ get_sequence_fasta_multi(seq, &seq_info, 0);
+ hmm.alphabet_type = DISCRETE;
+
+ seq_info.avg_seq_len = ((int)(seq_info.avg_seq_len / seq_info.nr_seqs));
+
+
+ /* get score info for this seq */
+
+ helices = get_scores(use_labels, scoring_method, multi_scoring_method, NULL);
+
+ /* deallocate seqinfo */
+ free(((seq_info.seqs))->seq_1);
+
+ free(msa_seq_infop);
+
+
+ /* deallocate hmm_info */
+ hmm_garbage_collection_multi(hmmfile, &hmm);
+
+
+ /* deallocate replacement letters */
+ if(replfile != NULL) {
+ if(replacement_letters.nr_rl_1 > 0) {
+ free(replacement_letters.letters_1);
+ free(replacement_letters.probs_1);
+ }
+ fclose(replfile);
+ }
+
+ return helices;
+}
+
+
+
+
+
+/*************************************************************************************************/
+void get_null_model_multi(FILE *nullfile)
+{
+ int i,j;
+ char s[MAX_LINE];
+
+ if(nullfile != NULL) {
+ /* read nullfile */
+ while(1) {
+ if(fgets(s, MAX_LINE, nullfile) == NULL) {
+ printf("Could not read null model file\n");
+ exit(0);
+ }
+ if(s[0] == '#' || s[0] == '\n') {
+ continue;
+ }
+ else {
+ null_model.nr_alphabets = atoi(s);
+ break;
+ }
+ }
+
+ for(i = 0; i < null_model.nr_alphabets; i++) {
+ while(1) {
+ if(fgets(s, MAX_LINE, nullfile) == NULL) {
+ printf("Could not read null model file\n");
+ exit(0);
+ }
+ if(s[0] == '#' || s[0] == '\n') {
+ continue;
+ }
+ else {
+ switch(i) {
+ case 0:
+ null_model.a_size = atoi(s);
+ null_model.emissions = (double*)malloc_or_die(null_model.a_size * sizeof(double));
+ break;
+ case 1:
+ null_model.a_size_2 = atoi(s);
+ null_model.emissions_2 = (double*)malloc_or_die(null_model.a_size_2 * sizeof(double));
+ break;
+ case 2:
+ null_model.a_size_3 = atoi(s);
+ null_model.emissions_3 = (double*)malloc_or_die(null_model.a_size_3 * sizeof(double));
+ break;
+ case 3:
+ null_model.a_size_4 = atoi(s);
+ null_model.emissions_4 = (double*)malloc_or_die(null_model.a_size_4 * sizeof(double));
+ break;
+ }
+ break;
+ }
+ }
+ j = 0;
+ switch(i) {
+ case 0:
+ while(j < null_model.a_size) {
+ if (fgets(s, MAX_LINE, nullfile) != NULL) {
+ if(s[0] != '#' && s[0] != '\n') {
+ *(null_model.emissions + j) = atof(s);
+ j++;
+ }
+ }
+ else {
+ printf("Could not read null model file\n");
+ exit(0);
+ }
+ }
+ break;
+ case 1:
+ while(j < null_model.a_size_2) {
+ if (fgets(s, MAX_LINE, nullfile) != NULL) {
+ if(s[0] != '#' && s[0] != '\n') {
+ *(null_model.emissions_2 + j) = atof(s);
+ j++;
+ }
+ }
+ else {
+ printf("Could not read null model file\n");
+ exit(0);
+ }
+ }
+ break;
+ case 2:
+ while(j < null_model.a_size_3) {
+ if (fgets(s, MAX_LINE, nullfile) != NULL) {
+ if(s[0] != '#' && s[0] != '\n') {
+ *(null_model.emissions_3 + j) = atof(s);
+ j++;
+ }
+ }
+ else {
+ printf("Could not read null model file\n");
+ exit(0);
+ }
+ }
+ break;
+ case 3:
+ while(j < null_model.a_size_4) {
+ if (fgets(s, MAX_LINE, nullfile) != NULL) {
+ if(s[0] != '#' && s[0] != '\n') {
+ *(null_model.emissions_4 + j) = atof(s);
+ j++;
+ }
+ }
+ else {
+ printf("Could not read null model file\n");
+ exit(0);
+ }
+ }
+ break;
+ }
+ }
+ while(1) {
+ if(fgets(s, MAX_LINE, nullfile) != NULL) {
+ if(s[0] != '#' && s[0] != '\n') {
+ null_model.trans_prob = atof(s);
+ break;
+ }
+ }
+ else {
+ printf("Could not read null model file\n");
+ exit(0);
+ }
+ }
+ }
+ else {
+ null_model.a_size = -1;
+ null_model.nr_alphabets = -1;
+ }
+}
+
+
+
+/*********************score methods******************************************/
+int get_scores(int use_labels, int scoring_method, int multi_scoring_method, helix_sites* hSites)
+{
+ struct forward_s *forw_mtx;
+ struct one_best_s *one_best_mtx;
+ struct backward_s *backw_mtx;
+ struct forward_s *rev_forw_mtx;
+ double *forw_scale, *rev_forw_scale, *scaling_factors;
+ char *labels;
+ struct letter_s *reverse_seq, *reverse_seq_2, *reverse_seq_3, *reverse_seq_4;
+ struct helix_site *helixSite;
+
+ int seq_len;
+ double forward_score, backward_score, rev_forward_score, raw_forward_score;
+
+ int a,b;
+ int i,j,k;
+
+ struct hmm_multi_s *hmmp;
+ int IN_HELIX;
+ int helices;
+
+ hmmp = &hmm;
+
+
+ /* get seq_length */
+ seq_len = get_seq_length(seq_info.seqs->seq_1);
+
+
+ /* if needed run forward */
+ forward_multi(hmmp, seq_info.seqs->seq_1, seq_info.seqs->seq_2, seq_info.seqs->seq_3, seq_info.seqs->seq_4,
+ &forw_mtx, &forw_scale, use_labels, multi_scoring_method);
+ raw_forward_score = (forw_mtx + get_mtx_index(seq_len+1, hmmp->nr_v-1, hmmp->nr_v))->prob;
+ forward_score = log10((forw_mtx + get_mtx_index(seq_len+1, hmmp->nr_v-1, hmmp->nr_v))->prob);
+ for (j = seq_len; j > 0; j--) {
+ forward_score = forward_score + log10(*(forw_scale + j));
+ }
+
+ /* if needed run backward */
+ backward_multi(hmmp, seq_info.seqs->seq_1, seq_info.seqs->seq_2, seq_info.seqs->seq_3, seq_info.seqs->seq_4,
+ &backw_mtx, forw_scale, use_labels, multi_scoring_method);
+ backward_score = log10((backw_mtx + get_mtx_index(0, 0, hmmp->nr_v))->prob);
+ for (j = seq_len; j > 0; j--) {
+ backward_score = backward_score + log10(*(forw_scale + j));
+ }
+
+
+ /* if needed run n-best */
+ labels = (char*)malloc_or_die((seq_len + 1) * sizeof(char));
+ one_best_multi(hmmp, seq_info.seqs->seq_1, seq_info.seqs->seq_2, seq_info.seqs->seq_3, seq_info.seqs->seq_4,
+ &one_best_mtx, &scaling_factors, use_labels, labels, multi_scoring_method);
+
+
+ /* count helices + deallocate */
+ j = 0;
+ helices = 0;
+ IN_HELIX = NO;
+ while(1) {
+ if(labels[j] == '\0') {
+ break;
+ }
+ else {
+ if((labels[j] == 'M') && (IN_HELIX == NO)) {
+ IN_HELIX = YES;
+ helices++;
+ } else if((labels[j] != 'M') && (IN_HELIX == YES)) {
+ IN_HELIX = NO;
+ }
+ j++;
+ }
+ }
+
+ /* Check if we care where the helices are, if so, loop
+ through and record their locations */
+
+ if(hSites) {
+ hSites->helix = (helix_sites *) malloc(sizeof (helix_sites) * helices);
+ helixSite = hSites->helix;
+
+ hSites->helix_count = helices;
+
+ j = 0;
+ helices = 0;
+ IN_HELIX = NO;
+ while(1) {
+ if(labels[j] == '\0') {
+ break;
+ }
+ else {
+ if((labels[j] == 'M') && (IN_HELIX == NO)) {
+ IN_HELIX = YES;
+ helixSite->start = (j+1);
+ helices++;
+ } else if((labels[j] != 'M') && (IN_HELIX == YES)) {
+ IN_HELIX = NO;
+ helixSite->end = (j);
+ helixSite++;
+ }
+ j++;
+ }
+ }
+
+ }
+
+ /* deallocate */
+ free(labels);
+
+ /* deallocate result matrix info */
+ free(forw_mtx);
+ free(forw_scale);
+ free(backw_mtx);
+ free(one_best_mtx);
+ free(scaling_factors);
+
+ return helices;
+
+}
+
+
+/*********************help functions***************************************/
+double get_nullmodel_score_multi(struct letter_s *seq, struct letter_s *seq_2, struct letter_s *seq_3,
+ struct letter_s *seq_4, int seq_len, int multi_scoring_method)
+{
+ int avg_seq_len;
+ double emiss_prob, emiss_prob_2, emiss_prob_3, emiss_prob_4;
+ double e1, e2, e3, e4;
+ double trans_prob;
+ double null_model_score;
+ int letter, letter_2, letter_3, letter_4;
+ int l,i;
+ double t_res;
+
+ /* calculate null model score for seq */
+ if(null_model.nr_alphabets < 0) {
+ /* use default null model */
+ emiss_prob = 1.0 / (double)hmm.a_size;
+ if(hmm.nr_alphabets > 1) {
+ emiss_prob_2 = 1.0 / (double)hmm.a_size_2;
+ }
+ if(hmm.nr_alphabets > 2) {
+ emiss_prob_3 = 1.0 / (double)hmm.a_size_3;
+ }
+ if(hmm.nr_alphabets > 3) {
+ emiss_prob_4 = 1.0 / (double)hmm.a_size_4;
+ }
+ trans_prob = (double)(seq_len)/(double)(seq_len + 1);
+ if(multi_scoring_method == JOINT_PROB) {
+ if(hmm.nr_alphabets == 1) {
+ null_model_score = seq_len * (log10(emiss_prob) + log10(trans_prob));
+ }
+ else if(hmm.nr_alphabets == 2) {
+ null_model_score = seq_len * (log10(emiss_prob) + log10(emiss_prob_2) + log10(trans_prob));
+ }
+ else if(hmm.nr_alphabets == 3) {
+ null_model_score = seq_len * (log10(emiss_prob) + log10(emiss_prob_2) + log10(emiss_prob_3) + log10(trans_prob));
+ }
+ else if(hmm.nr_alphabets == 2) {
+ null_model_score = seq_len * (log10(emiss_prob) + log10(emiss_prob_2) + log10(emiss_prob_3) + log10(emiss_prob_4)
+ + log10(trans_prob));
+ }
+ }
+ else {
+ /* use other multialpha scoring method, not implemented yet */
+ printf("Error: only JOINT_PROB scoring is implemented\n");
+ exit(0);
+ }
+ }
+ else {
+ /* use specified null model */
+ null_model_score = 0.0;
+ for(l = 0; l < seq_len; l++) {
+ letter = get_alphabet_index(&seq[l], hmm.alphabet, hmm.a_size);
+ if(hmm.nr_alphabets > 1) {
+ letter_2 = get_alphabet_index(&seq_2[l], hmm.alphabet_2, hmm.a_size_2);
+ }
+ if(hmm.nr_alphabets > 2) {
+ letter_3 = get_alphabet_index(&seq_3[l], hmm.alphabet_3, hmm.a_size_3);
+ }
+ if(hmm.nr_alphabets > 3) {
+ letter_4 = get_alphabet_index(&seq_4[l], hmm.alphabet_4, hmm.a_size_4);
+ }
+
+ if(letter >= 0) {
+ e1 = null_model.emissions[letter];
+ //null_model_score += log10(null_model.emissions[letter]) + log10(null_model.trans_prob);
+ }
+ else {
+ /* need replacement letters */
+ letter = get_replacement_letter_index_multi(&seq[l], &replacement_letters, 1);
+ if(letter >= 0) {
+ e1 = 0.0;
+ for(i = 0; i < hmm.a_size; i++) {
+ e1 += *(hmm.replacement_letters->probs_1 + get_mtx_index(letter, i, hmm.a_size)) * null_model.emissions[i];
+ }
+ //null_model_score += log10(t_res) + log10(null_model.trans_prob);
+ }
+ else {
+ printf("Could not find letter %s when scoring null model\n", &seq[l]);
+ return DEFAULT;
+ }
+ }
+ if(hmm.nr_alphabets > 1) {
+ if(letter_2 >= 0) {
+ e2 = null_model.emissions_2[letter_2];
+ }
+ else {
+ /* need replacement letters */
+ letter_2 = get_replacement_letter_index_multi(&seq_2[l], &replacement_letters, 2);
+ if(letter_2 >= 0) {
+ e2 = 0.0;
+ for(i = 0; i < hmm.a_size_2; i++) {
+ e2 += *(hmm.replacement_letters->probs_2 + get_mtx_index(letter_2, i, hmm.a_size_2)) * null_model.emissions_2[i];
+ }
+ }
+ else {
+ printf("Could not find letter %s when scoring null model\n", &seq_2[l]);
+ return DEFAULT;
+ }
+ }
+ }
+ if(hmm.nr_alphabets > 2) {
+ if(letter_3 >= 0) {
+ e3 = null_model.emissions_3[letter_3];
+ }
+ else {
+ /* need replacement letters */
+ letter_3 = get_replacement_letter_index_multi(&seq_3[l], &replacement_letters, 3);
+ if(letter_3 >= 0) {
+ e3 = 0.0;
+ for(i = 0; i < hmm.a_size_3; i++) {
+ e3 += *(hmm.replacement_letters->probs_3 + get_mtx_index(letter_3, i, hmm.a_size_3)) * null_model.emissions_3[i];
+ }
+ }
+ else {
+ printf("Could not find letter %s when scoring null model\n", &seq_3[l]);
+ return DEFAULT;
+ }
+ }
+ }
+ if(hmm.nr_alphabets > 3) {
+ if(letter_4 >= 0) {
+ e4 = null_model.emissions_4[letter_4];
+ }
+ else {
+ /* need replacement letters */
+ letter_4 = get_replacement_letter_index_multi(&seq_4[l], &replacement_letters, 4);
+ if(letter_4 >= 0) {
+ e4 = 0.0;
+ for(i = 0; i < hmm.a_size_4; i++) {
+ e4 += *(hmm.replacement_letters->probs_4 + get_mtx_index(letter_4, i, hmm.a_size_4)) * null_model.emissions_4[i];
+ }
+ }
+ else {
+ printf("Could not find letter %s when scoring null model\n", &seq[l]);
+ return DEFAULT;
+ }
+ }
+ }
+
+ if(multi_scoring_method == JOINT_PROB) {
+ null_model_score += log10(e1) + log10(null_model.trans_prob);
+ if(hmm.nr_alphabets > 1) {
+ null_model_score += log10(e2);
+ }
+ if(hmm.nr_alphabets > 2) {
+ null_model_score += log10(e3);
+ }
+ if(hmm.nr_alphabets > 3) {
+ null_model_score += log10(e4);
+ }
+ }
+ else {
+ /* use other multialpha scoring method, not implemented yet */
+ printf("Error: only JOINT_PROB scoring is implemented\n");
+ exit(0);
+ }
+ }
+ }
+
+ return null_model_score;
+}
+
+
+
+
+
+
+double get_nullmodel_score_msa_multi(int seq_len, int prf_mode, int use_prior,
+ int use_gap_shares, int normalize,
+ int scoring_method, int multi_scoring_method, double *aa_freqs, double *aa_freqs_2,
+ double *aa_freqs_3, double *aa_freqs_4)
+{
+ int avg_seq_len;
+ double emiss_prob, emiss_prob_2, emiss_prob_3, emiss_prob_4;
+ double trans_prob;
+ double null_model_score;
+ int letter;
+ int k,l,p,m;
+ double col_score, col_score_2, col_score_3, col_score_4;
+ int using_default_null_model;
+ double seq_normalizer, state_normalizer;
+
+ using_default_null_model = NO;
+ /* calculate null model score for seq */
+ if(null_model.nr_alphabets < 0) {
+ /* use default null model */
+ emiss_prob = 1.0 / (double)hmm.a_size;
+ if(hmm.nr_alphabets > 1) {
+ emiss_prob_2 = 1.0 / (double)hmm.a_size_2;
+ }
+ if(hmm.nr_alphabets > 2) {
+ emiss_prob_3 = 1.0 / (double)hmm.a_size_3;
+ }
+ if(hmm.nr_alphabets > 3) {
+ emiss_prob_4 = 1.0 / (double)hmm.a_size_4;
+ }
+ trans_prob = (double)(seq_len)/(double)(seq_len + 1);
+ null_model.a_size = hmm.a_size;
+ null_model.emissions = (double*)malloc_or_die(null_model.a_size * sizeof(double));
+ if(hmm.nr_alphabets > 1) {
+ null_model.a_size_2 = hmm.a_size_2;
+ null_model.emissions_2 = (double*)malloc_or_die(null_model.a_size_2 * sizeof(double));
+ }
+ if(hmm.nr_alphabets > 2) {
+ null_model.a_size_3 = hmm.a_size_3;
+ null_model.emissions_3 = (double*)malloc_or_die(null_model.a_size_3 * sizeof(double));
+ }
+ if(hmm.nr_alphabets > 3) {
+ null_model.a_size_4 = hmm.a_size_4;
+ null_model.emissions_4 = (double*)malloc_or_die(null_model.a_size_4 * sizeof(double));
+ }
+ for(k = 0; k < hmm.a_size; k++) {
+ null_model.emissions[k] = emiss_prob;
+ if(hmm.nr_alphabets > 1) {
+ null_model.emissions_2[k] = emiss_prob_2;
+ }
+ if(hmm.nr_alphabets > 2) {
+ null_model.emissions_3[k] = emiss_prob_3;
+ }
+ if(hmm.nr_alphabets > 3) {
+ null_model.emissions_4[k] = emiss_prob_4;
+ }
+ }
+
+ null_model.trans_prob = trans_prob;
+ using_default_null_model = YES;
+ }
+
+
+ /* NOTE: must include other scoring methods here as well (copy from core_algorithms) */
+ /* use specified null model */
+ null_model_score = 0.0;
+ if(scoring_method == DOT_PRODUCT) {
+ l = 0;
+ while(1) {
+
+ /* set index for seq pointer */
+ if(prf_mode == ALL_SEQS) {
+ p = l;
+ }
+ else {
+ p = *(msa_seq_infop->lead_columns_start + l);
+ }
+
+ /* get col_score for the different alphabets */
+ col_score = 0.0;
+ col_score_2 = 0.0;
+ col_score_3 = 0.0;
+ col_score_4 = 0.0;
+ col_score = get_dp_statescore(null_model.a_size, use_gap_shares, use_prior, msa_seq_infop->msa_seq_1,
+ p, null_model.emissions,
+ 0, normalize, msa_seq_infop->gap_shares);
+ if(hmm.nr_alphabets > 1) {
+ col_score_2 = get_dp_statescore(null_model.a_size_2, use_gap_shares, use_prior, msa_seq_infop->msa_seq_2,
+ p, null_model.emissions_2,
+ 0, normalize, msa_seq_infop->gap_shares);
+ }
+ if(hmm.nr_alphabets > 2) {
+ col_score_3 = get_dp_statescore(null_model.a_size_3, use_gap_shares, use_prior, msa_seq_infop->msa_seq_3,
+ p, null_model.emissions_3,
+ 0, normalize, msa_seq_infop->gap_shares);
+ }
+ if(hmm.nr_alphabets > 3) {
+ col_score_4 = get_dp_statescore(null_model.a_size_4, use_gap_shares, use_prior, msa_seq_infop->msa_seq_4,
+ p, null_model.emissions_4,
+ 0, normalize, msa_seq_infop->gap_shares);
+ }
+
+ /* calculate total column score */
+ if(multi_scoring_method == JOINT_PROB) {
+ null_model_score += log10(col_score) + log10(null_model.trans_prob);
+ if(hmm.nr_alphabets > 1) {
+ null_model_score += log10(col_score_2);
+ }
+ if(hmm.nr_alphabets > 2) {
+ null_model_score += log10(col_score_3);
+ }
+ if(hmm.nr_alphabets > 3) {
+ null_model_score += log10(col_score_4);
+ }
+ }
+ else {
+ /* use other multialpha scoring method, not implemented yet */
+ printf("Error: only JOINT_PROB scoring is implemented\n");
+ exit(0);
+ }
+
+ /* update seq index and check if we are done */
+ l++;
+ if(prf_mode == ALL_SEQS) {
+ if(l >= seq_len) {
+ break;
+ }
+ }
+ else {
+ if(*(msa_seq_infop->lead_columns_start + l) == END) {
+ break;
+ }
+ }
+ }
+ }
+
+ if(scoring_method == DOT_PRODUCT_PICASSO) {
+ l = 0;
+ while(1) {
+
+ /* set index for seq pointer */
+ if(prf_mode == ALL_SEQS) {
+ p = l;
+ }
+ else {
+ p = *(msa_seq_infop->lead_columns_start + l);
+ }
+
+ /* get col_score for the different alphabets */
+ col_score = 0.0;
+ col_score_2 = 0.0;
+ col_score_3 = 0.0;
+ col_score_4 = 0.0;
+ col_score = get_dp_picasso_statescore(null_model.a_size, use_gap_shares, use_prior, msa_seq_infop->msa_seq_1,
+ p, null_model.emissions,
+ 0, normalize, msa_seq_infop->gap_shares, aa_freqs);
+ if(hmm.nr_alphabets > 1) {
+ col_score_2 = get_dp_picasso_statescore(null_model.a_size_2, use_gap_shares, use_prior, msa_seq_infop->msa_seq_2,
+ p, null_model.emissions_2,
+ 0, normalize, msa_seq_infop->gap_shares, aa_freqs_2);
+ }
+ if(hmm.nr_alphabets > 2) {
+ col_score_3 = get_dp_picasso_statescore(null_model.a_size_3, use_gap_shares, use_prior, msa_seq_infop->msa_seq_3,
+ p, null_model.emissions_3,
+ 0, normalize, msa_seq_infop->gap_shares, aa_freqs_3);
+ }
+ if(hmm.nr_alphabets > 3) {
+ col_score_4 = get_dp_picasso_statescore(null_model.a_size_4, use_gap_shares, use_prior, msa_seq_infop->msa_seq_4,
+ p, null_model.emissions_4,
+ 0, normalize, msa_seq_infop->gap_shares, aa_freqs_4);
+ }
+
+ /* calculate total column score */
+ if(multi_scoring_method == JOINT_PROB) {
+ null_model_score += log10(col_score) + log10(null_model.trans_prob);
+ if(hmm.nr_alphabets > 1) {
+ null_model_score += log10(col_score_2);
+ }
+ if(hmm.nr_alphabets > 2) {
+ null_model_score += log10(col_score_3);
+ }
+ if(hmm.nr_alphabets > 3) {
+ null_model_score += log10(col_score_4);
+ }
+ }
+ else {
+ /* use other multialpha scoring method, not implemented yet */
+ printf("Error: only JOINT_PROB scoring is implemented\n");
+ exit(0);
+ }
+
+ /* update seq index and check if we are done */
+ l++;
+ if(prf_mode == ALL_SEQS) {
+ if(l >= seq_len) {
+ break;
+ }
+ }
+ else {
+ if(*(msa_seq_infop->lead_columns_start + l) == END) {
+ break;
+ }
+ }
+ }
+ }
+
+ if(scoring_method == PICASSO) {
+ l = 0;
+ while(1) {
+
+ /* set index for seq pointer */
+ if(prf_mode == ALL_SEQS) {
+ p = l;
+ }
+ else {
+ p = *(msa_seq_infop->lead_columns_start + l);
+ }
+
+ /* get col_score for the different alphabets */
+ col_score = 0.0;
+ col_score_2 = 0.0;
+ col_score_3 = 0.0;
+ col_score_4 = 0.0;
+ col_score = get_picasso_statescore(null_model.a_size, use_gap_shares, use_prior, msa_seq_infop->msa_seq_1,
+ p, null_model.emissions,
+ 0, normalize, msa_seq_infop->gap_shares, aa_freqs);
+ if(hmm.nr_alphabets > 1) {
+ col_score_2 = get_picasso_statescore(null_model.a_size_2, use_gap_shares, use_prior, msa_seq_infop->msa_seq_2,
+ p, null_model.emissions_2,
+ 0, normalize, msa_seq_infop->gap_shares, aa_freqs_2);
+ }
+ if(hmm.nr_alphabets > 2) {
+ col_score_3 = get_picasso_statescore(null_model.a_size_3, use_gap_shares, use_prior, msa_seq_infop->msa_seq_3,
+ p, null_model.emissions_3,
+ 0, normalize, msa_seq_infop->gap_shares, aa_freqs_3);
+ }
+ if(hmm.nr_alphabets > 3) {
+ col_score_4 = get_picasso_statescore(null_model.a_size_4, use_gap_shares, use_prior, msa_seq_infop->msa_seq_4,
+ p, null_model.emissions_4,
+ 0, normalize, msa_seq_infop->gap_shares, aa_freqs_4);
+ }
+
+ /* calculate total column score */
+ if(multi_scoring_method == JOINT_PROB) {
+ null_model_score += log10(col_score) + log10(null_model.trans_prob);
+ if(hmm.nr_alphabets > 1) {
+ null_model_score += log10(col_score_2);
+ }
+ if(hmm.nr_alphabets > 2) {
+ null_model_score += log10(col_score_3);
+ }
+ if(hmm.nr_alphabets > 3) {
+ null_model_score += log10(col_score_4);
+ }
+ }
+ else {
+ /* use other multialpha scoring method, not implemented yet */
+ printf("Error: only JOINT_PROB scoring is implemented\n");
+ exit(0);
+ }
+
+ /* update seq index and check if we are done */
+ l++;
+ if(prf_mode == ALL_SEQS) {
+ if(l >= seq_len) {
+ break;
+ }
+ }
+ else {
+ if(*(msa_seq_infop->lead_columns_start + l) == END) {
+ break;
+ }
+ }
+ }
+ }
+
+ if(scoring_method == PICASSO_SYM) {
+ l = 0;
+ while(1) {
+
+ /* set index for seq pointer */
+ if(prf_mode == ALL_SEQS) {
+ p = l;
+ }
+ else {
+ p = *(msa_seq_infop->lead_columns_start + l);
+ }
+
+ /* get col_score for the different alphabets */
+ col_score = 0.0;
+ col_score_2 = 0.0;
+ col_score_3 = 0.0;
+ col_score_4 = 0.0;
+ col_score = get_picasso_sym_statescore(null_model.a_size, use_gap_shares, use_prior, msa_seq_infop->msa_seq_1,
+ p, null_model.emissions,
+ 0, normalize, msa_seq_infop->gap_shares, aa_freqs);
+ if(hmm.nr_alphabets > 1) {
+ col_score_2 = get_picasso_sym_statescore(null_model.a_size_2, use_gap_shares, use_prior, msa_seq_infop->msa_seq_2,
+ p, null_model.emissions_2,
+ 0, normalize, msa_seq_infop->gap_shares, aa_freqs_2);
+ }
+ if(hmm.nr_alphabets > 2) {
+ col_score_3 = get_picasso_sym_statescore(null_model.a_size_3, use_gap_shares, use_prior, msa_seq_infop->msa_seq_3,
+ p, null_model.emissions_3,
+ 0, normalize, msa_seq_infop->gap_shares, aa_freqs_3);
+ }
+ if(hmm.nr_alphabets > 3) {
+ col_score_4 = get_picasso_sym_statescore(null_model.a_size_4, use_gap_shares, use_prior, msa_seq_infop->msa_seq_4,
+ p, null_model.emissions_4,
+ 0, normalize, msa_seq_infop->gap_shares, aa_freqs_4);
+ }
+
+ /* calculate total column score */
+ if(multi_scoring_method == JOINT_PROB) {
+ null_model_score += log10(col_score) + log10(null_model.trans_prob);
+ if(hmm.nr_alphabets > 1) {
+ null_model_score += log10(col_score_2);
+ }
+ if(hmm.nr_alphabets > 2) {
+ null_model_score += log10(col_score_3);
+ }
+ if(hmm.nr_alphabets > 3) {
+ null_model_score += log10(col_score_4);
+ }
+ }
+ else {
+ /* use other multialpha scoring method, not implemented yet */
+ printf("Error: only JOINT_PROB scoring is implemented\n");
+ exit(0);
+ }
+
+ /* update seq index and check if we are done */
+ l++;
+ if(prf_mode == ALL_SEQS) {
+ if(l >= seq_len) {
+ break;
+ }
+ }
+ else {
+ if(*(msa_seq_infop->lead_columns_start + l) == END) {
+ break;
+ }
+ }
+ }
+ }
+
+ if(scoring_method == SJOLANDER) {
+ l = 0;
+ while(1) {
+ /* set index for seq pointer */
+ if(prf_mode == ALL_SEQS) {
+ p = l;
+ }
+ else {
+ p = *(msa_seq_infop->lead_columns_start + l);
+ }
+
+ /* get col_score for the different alphabets */
+ col_score = 0.0;
+ col_score_2 = 0.0;
+ col_score_3 = 0.0;
+ col_score_4 = 0.0;
+ col_score = get_sjolander_statescore(null_model.a_size, use_gap_shares, use_prior, msa_seq_infop->msa_seq_1,
+ p, null_model.emissions,
+ 0, normalize, msa_seq_infop->gap_shares);
+ if(hmm.nr_alphabets > 1) {
+ col_score_2 = get_sjolander_statescore(null_model.a_size_2, use_gap_shares, use_prior, msa_seq_infop->msa_seq_2,
+ p, null_model.emissions_2,
+ 0, normalize, msa_seq_infop->gap_shares);
+ }
+ if(hmm.nr_alphabets > 2) {
+ col_score_3 = get_sjolander_statescore(null_model.a_size_3, use_gap_shares, use_prior, msa_seq_infop->msa_seq_3,
+ p, null_model.emissions_3,
+ 0, normalize, msa_seq_infop->gap_shares);
+ }
+ if(hmm.nr_alphabets > 3) {
+ col_score_4 = get_sjolander_statescore(null_model.a_size_4, use_gap_shares, use_prior, msa_seq_infop->msa_seq_4,
+ p, null_model.emissions_4,
+ 0, normalize, msa_seq_infop->gap_shares);
+ }
+
+ /* calculate total column score */
+ if(multi_scoring_method == JOINT_PROB) {
+ null_model_score += log10(col_score) + log10(null_model.trans_prob);
+ if(hmm.nr_alphabets > 1) {
+ null_model_score += log10(col_score_2);
+ }
+ if(hmm.nr_alphabets > 2) {
+ null_model_score += log10(col_score_3);
+ }
+ if(hmm.nr_alphabets > 3) {
+ null_model_score += log10(col_score_4);
+ }
+ }
+ else {
+ /* use other multialpha scoring method, not implemented yet */
+ printf("Error: only JOINT_PROB scoring is implemented\n");
+ exit(0);
+ }
+
+ /* update seq index and check if we are done */
+ l++;
+ if(prf_mode == ALL_SEQS) {
+ if(l >= seq_len) {
+ break;
+ }
+ }
+ else {
+ if(*(msa_seq_infop->lead_columns_start + l) == END) {
+ break;
+ }
+ }
+ }
+ }
+
+ if(scoring_method == SJOLANDER_REVERSED) {
+ l = 0;
+ while(1) {
+ /* set index for seq pointer */
+ if(prf_mode == ALL_SEQS) {
+ p = l;
+ }
+ else {
+ p = *(msa_seq_infop->lead_columns_start + l);
+ }
+
+ /* get col_score for the different alphabets */
+ col_score = 0.0;
+ col_score_2 = 0.0;
+ col_score_3 = 0.0;
+ col_score_4 = 0.0;
+ col_score = get_sjolander_reversed_statescore(null_model.a_size, use_gap_shares, use_prior, msa_seq_infop->msa_seq_1,
+ p, null_model.emissions,
+ 0, normalize, msa_seq_infop->gap_shares);
+ if(hmm.nr_alphabets > 1) {
+ col_score_2 = get_sjolander_reversed_statescore(null_model.a_size_2, use_gap_shares, use_prior, msa_seq_infop->msa_seq_2,
+ p, null_model.emissions_2,
+ 0, normalize, msa_seq_infop->gap_shares);
+ }
+ if(hmm.nr_alphabets > 2) {
+ col_score_3 = get_sjolander_reversed_statescore(null_model.a_size_3, use_gap_shares, use_prior, msa_seq_infop->msa_seq_3,
+ p, null_model.emissions_3,
+ 0, normalize, msa_seq_infop->gap_shares);
+ }
+ if(hmm.nr_alphabets > 3) {
+ col_score_4 = get_sjolander_reversed_statescore(null_model.a_size_4, use_gap_shares, use_prior, msa_seq_infop->msa_seq_4,
+ p, null_model.emissions_4,
+ 0, normalize, msa_seq_infop->gap_shares);
+ }
+
+ /* calculate total column score */
+ if(multi_scoring_method == JOINT_PROB) {
+ null_model_score += log10(col_score) + log10(null_model.trans_prob);
+ if(hmm.nr_alphabets > 1) {
+ null_model_score += log10(col_score_2);
+ }
+ if(hmm.nr_alphabets > 2) {
+ null_model_score += log10(col_score_3);
+ }
+ if(hmm.nr_alphabets > 3) {
+ null_model_score += log10(col_score_4);
+ }
+ }
+ else {
+ /* use other multialpha scoring method, not implemented yet */
+ printf("Error: only JOINT_PROB scoring is implemented\n");
+ exit(0);
+ }
+
+ /* update seq index and check if we are done */
+ l++;
+ if(prf_mode == ALL_SEQS) {
+ if(l >= seq_len) {
+ break;
+ }
+ }
+ else {
+ if(*(msa_seq_infop->lead_columns_start + l) == END) {
+ break;
+ }
+ }
+ }
+ }
+
+ else if(scoring_method == SUBST_MTX_PRODUCT) {
+ l = 0;
+ while(1) {
+ /* set index for seq pointer */
+ if(prf_mode == ALL_SEQS) {
+ p = l;
+ }
+ else {
+ p = *(msa_seq_infop->lead_columns_start + l);
+ }
+
+ /* get col_score for the different alphabets */
+ col_score = 0.0;
+ col_score_2 = 0.0;
+ col_score_3 = 0.0;
+ col_score_4 = 0.0;
+ col_score = get_subst_mtx_product_statescore(null_model.a_size, use_gap_shares, use_prior, msa_seq_infop->msa_seq_1,
+ p, null_model.emissions,
+ 0, hmm.subst_mtx);
+ if(hmm.nr_alphabets > 1) {
+ col_score_2 = get_subst_mtx_product_statescore(null_model.a_size_2, use_gap_shares, use_prior, msa_seq_infop->msa_seq_2,
+ p, null_model.emissions_2,
+ 0, hmm.subst_mtx_2);
+ }
+ if(hmm.nr_alphabets > 2) {
+ col_score_3 = get_subst_mtx_product_statescore(null_model.a_size_3, use_gap_shares, use_prior, msa_seq_infop->msa_seq_3,
+ p, null_model.emissions_3,
+ 0, hmm.subst_mtx_3);
+ }
+ if(hmm.nr_alphabets > 3) {
+ col_score_4 = get_subst_mtx_product_statescore(null_model.a_size_4, use_gap_shares, use_prior, msa_seq_infop->msa_seq_4,
+ p, null_model.emissions_4,
+ 0, hmm.subst_mtx_4);
+ }
+
+ /* calculate total column score */
+ if(multi_scoring_method == JOINT_PROB) {
+ null_model_score += log10(col_score) + log10(null_model.trans_prob);
+ if(hmm.nr_alphabets > 1) {
+ null_model_score += log10(col_score_2);
+ }
+ if(hmm.nr_alphabets > 2) {
+ null_model_score += log10(col_score_3);
+ }
+ if(hmm.nr_alphabets > 3) {
+ null_model_score += log10(col_score_4);
+ }
+ }
+ else {
+ /* use other multialpha scoring method, not implemented yet */
+ printf("Error: only JOINT_PROB scoring is implemented\n");
+ exit(0);
+ }
+
+ /* update seq index and check if we are done */
+ l++;
+ if(prf_mode == ALL_SEQS) {
+ if(l >= seq_len) {
+ break;
+ }
+ }
+ else {
+ if(*(msa_seq_infop->lead_columns_start + l) == END) {
+ break;
+ }
+ }
+ }
+ }
+
+ else if(scoring_method == SUBST_MTX_DOT_PRODUCT) {
+ printf("SMDP scoring not implemented yet\n");
+ exit(0);
+
+ }
+ else if(scoring_method == SUBST_MTX_DOT_PRODUCT_PRIOR) {
+ printf("SMDPP scoring not implemented yet\n");
+ exit(0);
+ }
+
+ if(using_default_null_model == YES) {
+ free(null_model.emissions);
+ null_model.a_size = -1;
+ }
+ return null_model_score;
+}
+
+
+
+void get_post_prob_matrix(double **ret_post_prob_matrixp, double forw_score, struct forward_s *forw_mtx,
+ struct backward_s *backw_mtx, int seq_len)
+{
+ int i,j;
+ double *post_prob_matrix;
+ double post_prob_score;
+
+ /* must be freed by caller */
+ post_prob_matrix = (double*)(malloc_or_die((seq_len+2) * hmm.nr_v * sizeof(double)));
+ for(i = 0; i < seq_len + 2; i++) {
+ for(j = 0; j < hmm.nr_v; j++) {
+ //printf("forw_score = %f, forw_mtx = %f, backw_mtx = %f\n", forw_score,(forw_mtx + get_mtx_index(i,j,hmm.nr_v))->prob,
+ // (backw_mtx + get_mtx_index(i,j,hmm.nr_v))->prob);
+ post_prob_score = (forw_mtx + get_mtx_index(i,j,hmm.nr_v))->prob * (backw_mtx + get_mtx_index(i,j,hmm.nr_v))->prob /
+ forw_score;
+ *(post_prob_matrix + get_mtx_index(i,j,hmm.nr_v)) = post_prob_score;
+ }
+ }
+
+ *ret_post_prob_matrixp = post_prob_matrix;
+}
diff --git a/modhmm0.92b/hmmsearch.c.flc b/modhmm0.92b/hmmsearch.c.flc
new file mode 100644
index 0000000..8099015
--- /dev/null
+++ b/modhmm0.92b/hmmsearch.c.flc
@@ -0,0 +1,4 @@
+
+(fast-lock-cache-data 3 (quote (17618 . 34116)) (quote nil) (quote nil) (quote (t ("^\\(\\sw+\\)[ ]*(" (1 font-lock-function-name-face)) ("^#[ ]*error[ ]+\\(.+\\)" (1 font-lock-warning-face prepend)) ("^#[ ]*\\(import\\|include\\)[ ]*\\(<[^>\"
+]*>?\\)" (2 font-lock-string-face)) ("^#[ ]*define[ ]+\\(\\sw+\\)(" (1 font-lock-function-name-face)) ("^#[ ]*\\(elif\\|if\\)\\>" ("\\<\\(defined\\)\\>[ ]*(?\\(\\sw+\\)?" nil nil (1 font-lock-builtin-face) (2 font-lock-variable-name-face nil t))) ("^#[ ]*\\(define\\|e\\(?:l\\(?:if\\|se\\)\\|ndif\\|rror\\)\\|file\\|i\\(?:f\\(?:n?def\\)?\\|nclude\\)\\|line\\|pragma\\|undef\\)\\>[ !]*\\(\\sw+\\)?" (1 font-lock-builtin-face) (2 font-lock-variable-name-face nil t)) ("\\<\\(c\\(?:har\\|o [...]
+") (point)) nil (1 font-lock-constant-face nil t))) (":" ("^[ ]*\\(\\sw+\\)[ ]*:[ ]*$" (beginning-of-line) (end-of-line) (1 font-lock-constant-face))) ("\\<\\(c\\(?:har\\|omplex\\)\\|double\\|float\\|int\\|long\\|s\\(?:hort\\|igned\\)\\|\\(?:unsigne\\|voi\\)d\\|FILE\\|\\sw+_t\\|Lisp_Object\\)\\>\\([ *&]+\\sw+\\>\\)*" (font-lock-match-c-style-declaration-item-and-skip-to-next (goto-char (or (match-beginning 2) (match-end 1))) (goto-char (match-end 1)) (1 (if (match-beginning 2) font-l [...]
diff --git a/modhmm0.92b/readhmm.c b/modhmm0.92b/readhmm.c
new file mode 100644
index 0000000..fbbaa5c
--- /dev/null
+++ b/modhmm0.92b/readhmm.c
@@ -0,0 +1,1434 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <math.h>
+
+
+#include "structs.h" /* data structures etc */
+#include "funcs.h" /* function header */
+
+#define MAX_LINE 500
+
+//#define DEBUG_RD
+//#define DEBUG_RDPRI
+
+extern int verbose;
+
+int read_module(char*, FILE*, struct hmm_multi_s*, struct module_multi_s*, int*, int*);
+void check_probs(struct hmm_multi_s*);
+void create_to_silent_trans_array(struct hmm_multi_s *hmmp);
+void create_from_silent_trans_array(struct hmm_multi_s *hmmp);
+void create_from_trans_array(struct hmm_multi_s*);
+void create_to_trans_array(struct hmm_multi_s*);
+void create_tot_transitions(struct hmm_multi_s*);
+void create_tot_trans_arrays(struct hmm_multi_s *hmmp);
+void add_all_from_paths(int v, int w, struct hmm_multi_s *hmmp,
+ struct path_element **from_transp, struct path_element *temp_pathp, int length);
+void add_all_to_paths(int v, int w, struct hmm_multi_s *hmmp,
+ struct path_element **from_transp, struct path_element *temp_pathp, int length);
+int read_prior_files(int, struct emission_dirichlet_s*, struct hmm_multi_s*, char*, FILE*);
+int read_trans_prior_files(int, void*, struct hmm_multi_s*, FILE*);
+int silent_vertex(int v, struct hmm_multi_s *hmmp);
+
+int readhmm(FILE *file, struct hmm_multi_s *hmmp, char* path)
+{
+
+ char s[MAX_LINE], *c;
+ int i,j;
+ int res;
+ int **from_trans_array, **to_trans_array;
+ int *from_trans, *to_trans, *cur;
+ struct module_multi_s *module;
+ struct emission_dirichlet_s *emission_priorsp;
+ void *transition_priorsp;
+ int nr_priorfiles, nr_trans_priorfiles;
+ int silent_counter, locked_counter;
+ char *nr_trans_tiesp, *nr_distrib_tiesp;
+ struct transition_s *trans_ties;
+ struct transition_s trans;
+
+ if(verbose == YES) {
+ printf("reading hmm ");
+ }
+
+ /* set alphabet to be = DISCRETE as default, this should be reset when reading the sequences */
+ hmmp->alphabet_type = DISCRETE;
+
+ /* read header */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ }
+ /* name */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ strcpy(hmmp->name, &s[6]);
+ if(verbose == YES) {
+ printf("%s ... ", hmmp->name);
+ fflush(stdout);
+ }
+ }
+ /* creation time */
+ fgets(s, MAX_LINE, file);
+ /* alphabet */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ strcpy(hmmp->alphabet, &s[10]);
+ hmmp->nr_alphabets = 1;
+ }
+ /* alphabet length */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ hmmp->a_size = atoi(&s[17]);
+ }
+ /* nr of modules */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ hmmp->nr_m = atoi(&s[15]);
+ }
+ /* nr of vertices */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ hmmp->nr_v = atoi(&s[16]);
+ hmmp->transitions = (double*)(malloc_or_die(hmmp->nr_v * hmmp->nr_v *
+ sizeof(double)));
+ hmmp->log_transitions = (double*)(malloc_or_die(hmmp->nr_v * hmmp->nr_v *
+ sizeof(double)));
+ init_float_mtx(hmmp->log_transitions, DEFAULT, hmmp->nr_v * hmmp->nr_v);
+ hmmp->emissions = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size *
+ sizeof(double)));
+ hmmp->log_emissions = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size *
+ sizeof(double)));
+ init_float_mtx(hmmp->log_emissions, DEFAULT, hmmp->nr_v * hmmp->a_size);
+ hmmp->modules = (struct module_multi_s**)(malloc_or_die(hmmp->nr_m * sizeof(struct module_multi_s*) +
+ hmmp->nr_m * sizeof(struct module_multi_s)));
+ module = (struct module_multi_s*)(hmmp->modules + hmmp->nr_m);
+ hmmp->silent_vertices = (int*)(malloc_or_die((hmmp->nr_v + 1) * sizeof(int)));
+ hmmp->locked_vertices = (int*)(malloc_or_die((hmmp->nr_v + 1) * sizeof(int)));
+ for(i = 0; i < hmmp->nr_v; i++) {
+ *(hmmp->locked_vertices + i) = NO;
+ }
+ hmmp->vertex_labels = (char*)(malloc_or_die(hmmp->nr_v * sizeof(char)));
+ hmmp->vertex_trans_prior_scalers = (double*)(malloc_or_die(hmmp->nr_v * sizeof(double)));
+ hmmp->vertex_emiss_prior_scalers = (double*)(malloc_or_die(hmmp->nr_v * sizeof(double)));
+
+ }
+ /* nr of transitions */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ hmmp->nr_t = atoi(&s[19]);
+ }
+ /* nr of distribution groups */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ hmmp->nr_d = atoi(&s[27]);
+ hmmp->distrib_groups = (int*)(malloc_or_die((hmmp->nr_d + hmmp->nr_v) * sizeof(int)));
+ }
+ /* nr of trans tie groups */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ hmmp->nr_ttg = atoi(&s[29]);
+ hmmp->trans_tie_groups = (int*)(malloc_or_die((hmmp->nr_t + hmmp->nr_ttg) * sizeof(struct transition_s)));
+ }
+ /* nr of emission priorfiles */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ nr_priorfiles = atoi(&s[27]);
+ hmmp->nr_ed = nr_priorfiles;
+ emission_priorsp = malloc_or_die(nr_priorfiles * sizeof(struct emission_dirichlet_s));
+ hmmp->emission_dirichlets = emission_priorsp;
+ hmmp->ed_ps = malloc_or_die(hmmp->nr_v * sizeof(struct emission_dirichlet_s*));
+ }
+ /* read the emission priorfiles */
+ if(read_prior_files(nr_priorfiles, emission_priorsp, hmmp, path, file) < 0) {
+ printf("Could not read emission priorfiles\n");
+ exit(-1);
+ }
+ /* nr of transition priorfiles */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ nr_trans_priorfiles = atoi(&s[29]);
+ transition_priorsp = NULL;
+ /* not implemented yet */
+ }
+ /* read the transition priorfiles */
+ if(read_trans_prior_files(nr_trans_priorfiles, transition_priorsp, hmmp, file) < 0) {
+ printf("Could not read transition priorfiles\n");
+ exit(-1);
+ }
+ /* empty row */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ }
+ /* empty row */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ }
+ /* reads ****************Modules*****************/
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ }
+
+ /* read the modules */
+ silent_counter = 0;
+ locked_counter = 0;
+ for(i = 0; i < hmmp->nr_m; i++) {
+ *(hmmp->modules + i) = module;
+ if((res = read_module(s, file, hmmp, module, &silent_counter, &locked_counter)) < 0) {
+ printf("Could not read modules\n");
+ exit(-1);
+ }
+ module++;
+ }
+ *(hmmp->silent_vertices + silent_counter) = END;
+ *(hmmp->locked_vertices + hmmp->nr_v) = END;
+
+#ifdef DEBUG_RD
+ //dump_locked_vertices(hmmp);
+ //dump_silent_vertices(hmmp);
+ //dump_modules(hmmp);
+#endif
+
+ /* empty row */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ }
+
+ /* reads ****************Emission distribution groups*****************/
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ }
+ /* read the distribution groups */
+ cur = hmmp->distrib_groups;
+ for(i = 0; i < hmmp->nr_d; i++) {
+ j = 0;
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ while(1) {
+ if(s[j] == ':') {
+ break;
+ }
+ j++;
+ }
+ j++;
+ j++;
+ while(1) {
+ *cur = atoi(&s[j]);
+ cur++;
+ while(s[j] != ' ' && s[j] != '\n') {
+ j++;
+ }
+ while(s[j] == ' ') {
+ j++;
+ }
+ if(s[j] == '\n') {
+ break;
+ }
+ }
+ *cur = END;
+ cur++;
+ }
+ }
+
+ /* empty row */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ }
+
+ /* reads ****************Transition tie groups*****************/
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ }
+
+ /* read the trans tie groups */
+ trans_ties = hmmp->trans_tie_groups;
+ for(i = 0; i < hmmp->nr_ttg; i++) {
+ if(fgets(s, MAX_LINE, file) != NULL && s[0] != '\n') {
+ j = 0;
+ while(1) {
+ if(s[j] == ':') {
+ break;
+ }
+ j++;
+ }
+ j++;
+ j++;
+ while(1) {
+ trans.from_v = atoi(&s[j]);
+ while(s[j] != '>') {
+ j++;
+ }
+ j++;
+ trans.to_v = atoi(&s[j]);
+ memcpy(trans_ties, &trans, sizeof(struct transition_s));
+ trans_ties++;
+ while(s[j] != ' ' && s[j] != '\n') {
+ j++;
+ }
+ while(s[j] == ' ') {
+ j++;
+ }
+ if(s[j] == '\n') {
+ break;
+ }
+ }
+ trans.to_v = END;
+ trans.from_v = END;
+ memcpy(trans_ties, &trans, sizeof(struct transition_s));
+ trans_ties++;
+ }
+ else {
+ hmmp->nr_ttg = i;
+ break;
+ }
+ }
+#ifdef DEBUG_RD
+ dump_distrib_groups(hmmp->distrib_groups, hmmp->nr_d);
+ dump_trans_tie_groups(hmmp->trans_tie_groups, hmmp->nr_ttg);
+#endif
+
+ /* create to_silent_trans_array */
+ create_to_silent_trans_array(hmmp);
+
+ /* create from_trans_array */
+ create_from_trans_array(hmmp);
+
+ /* create to_trans_array */
+ create_to_trans_array(hmmp);
+
+ /* create tot_transitions */
+ create_tot_transitions(hmmp);
+
+ /* create tot_to_trans_array and tot_from_trans_arrays*/
+ create_tot_trans_arrays(hmmp);
+
+ /* get the set of labels and the number of labels */
+ get_set_of_labels_multi(hmmp);
+
+#ifdef DEBUG_RD
+ dump_emiss_matrix(hmmp->nr_v, hmmp->a_size, hmmp->emissions);
+ dump_trans_matrix(hmmp->nr_v, hmmp->nr_v, hmmp->transitions);
+ printf("hmmp->emission_dirichlets = %x\n", hmmp->emission_dirichlets);
+ for(i = 0; i < hmmp->nr_v; i++) {
+ printf("hmmp->ed_ps for vertex %d = %x\n", i, *(hmmp->ed_ps + i));
+ }
+#endif
+
+
+ /* make sure all probabilities are legal*/
+ //check_probs(hmmp);
+
+ if(verbose == YES) {
+ printf("done\n");
+ }
+
+ return 0;
+}
+
+
+int readanhmm(FILE *file, struct hmm_multi_s *hmmp, char* path)
+{
+
+ char s[MAX_LINE], *c;
+ int i,j;
+ int res;
+ int **from_trans_array, **to_trans_array;
+ int *from_trans, *to_trans, *cur;
+ struct module_multi_s *module;
+ struct emission_dirichlet_s *emission_priorsp;
+ void *transition_priorsp;
+ int nr_priorfiles, nr_trans_priorfiles;
+ int silent_counter, locked_counter;
+ char *nr_trans_tiesp, *nr_distrib_tiesp;
+ struct transition_s *trans_ties;
+ struct transition_s trans;
+
+ if(verbose == YES) {
+ printf("reading hmm ");
+ }
+
+ /* set alphabet to be = DISCRETE as default, this should be reset when reading the sequences */
+ hmmp->alphabet_type = DISCRETE;
+
+ /* read header */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ }
+ /* name */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ strcpy(hmmp->name, &s[6]);
+ if(verbose == YES) {
+ printf("%s ... ", hmmp->name);
+ fflush(stdout);
+ }
+ }
+ /* creation time */
+ fgets(s, MAX_LINE, file);
+ /* alphabet */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ strcpy(hmmp->alphabet, &s[10]);
+ hmmp->nr_alphabets = 1;
+ }
+ /* alphabet length */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ hmmp->a_size = atoi(&s[17]);
+ }
+ /* nr of modules */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ hmmp->nr_m = atoi(&s[15]);
+ }
+ /* nr of vertices */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ hmmp->nr_v = atoi(&s[16]);
+ hmmp->transitions = (double*)(malloc_or_die(hmmp->nr_v * hmmp->nr_v *
+ sizeof(double)));
+ hmmp->log_transitions = (double*)(malloc_or_die(hmmp->nr_v * hmmp->nr_v *
+ sizeof(double)));
+ init_float_mtx(hmmp->log_transitions, DEFAULT, hmmp->nr_v * hmmp->nr_v);
+ hmmp->emissions = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size *
+ sizeof(double)));
+ hmmp->log_emissions = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size *
+ sizeof(double)));
+ init_float_mtx(hmmp->log_emissions, DEFAULT, hmmp->nr_v * hmmp->a_size);
+ hmmp->modules = (struct module_multi_s**)(malloc_or_die(hmmp->nr_m * sizeof(struct module_multi_s*) +
+ hmmp->nr_m * sizeof(struct module_multi_s)));
+ module = (struct module_multi_s*)(hmmp->modules + hmmp->nr_m);
+ hmmp->silent_vertices = (int*)(malloc_or_die((hmmp->nr_v + 1) * sizeof(int)));
+ hmmp->locked_vertices = (int*)(malloc_or_die((hmmp->nr_v + 1) * sizeof(int)));
+ for(i = 0; i < hmmp->nr_v; i++) {
+ *(hmmp->locked_vertices + i) = NO;
+ }
+ hmmp->vertex_labels = (char*)(malloc_or_die(hmmp->nr_v * sizeof(char)));
+ hmmp->vertex_trans_prior_scalers = (double*)(malloc_or_die(hmmp->nr_v * sizeof(double)));
+ hmmp->vertex_emiss_prior_scalers = (double*)(malloc_or_die(hmmp->nr_v * sizeof(double)));
+
+ }
+ /* nr of transitions */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ hmmp->nr_t = atoi(&s[19]);
+ }
+ /* nr of distribution groups */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ hmmp->nr_d = atoi(&s[27]);
+ hmmp->distrib_groups = (int*)(malloc_or_die((hmmp->nr_d + hmmp->nr_v) * sizeof(int)));
+ }
+ /* nr of trans tie groups */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ hmmp->nr_ttg = atoi(&s[29]);
+ hmmp->trans_tie_groups = (int*)(malloc_or_die((hmmp->nr_t + hmmp->nr_ttg) * sizeof(struct transition_s)));
+ }
+ /* nr of emission priorfiles */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ nr_priorfiles = atoi(&s[27]);
+ hmmp->nr_ed = nr_priorfiles;
+ emission_priorsp = malloc_or_die(nr_priorfiles * sizeof(struct emission_dirichlet_s));
+ hmmp->emission_dirichlets = emission_priorsp;
+ hmmp->ed_ps = malloc_or_die(hmmp->nr_v * sizeof(struct emission_dirichlet_s*));
+ }
+ /* read the emission priorfiles */
+ if(read_prior_files(nr_priorfiles, emission_priorsp, hmmp, path, file) < 0) {
+ printf("Could not read emission priorfiles\n");
+ exit(-1);
+ }
+ /* nr of transition priorfiles */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ nr_trans_priorfiles = atoi(&s[29]);
+ transition_priorsp = NULL;
+ /* not implemented yet */
+ }
+ /* read the transition priorfiles */
+ if(read_trans_prior_files(nr_trans_priorfiles, transition_priorsp, hmmp, file) < 0) {
+ printf("Could not read transition priorfiles\n");
+ exit(-1);
+ }
+ /* empty row */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ }
+ /* empty row */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ }
+ /* reads ****************Modules*****************/
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ }
+
+ /* read the modules */
+ silent_counter = 0;
+ locked_counter = 0;
+ for(i = 0; i < hmmp->nr_m; i++) {
+ *(hmmp->modules + i) = module;
+ if((res = read_module(s, file, hmmp, module, &silent_counter, &locked_counter)) < 0) {
+ printf("Could not read modules\n");
+ exit(-1);
+ }
+ module++;
+ }
+ *(hmmp->silent_vertices + silent_counter) = END;
+ *(hmmp->locked_vertices + hmmp->nr_v) = END;
+
+#ifdef DEBUG_RD
+ //dump_locked_vertices(hmmp);
+ //dump_silent_vertices(hmmp);
+ //dump_modules(hmmp);
+#endif
+
+ /* empty row */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ }
+
+ /* reads ****************Emission distribution groups*****************/
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ }
+ /* read the distribution groups */
+ cur = hmmp->distrib_groups;
+ for(i = 0; i < hmmp->nr_d; i++) {
+ j = 0;
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ while(1) {
+ if(s[j] == ':') {
+ break;
+ }
+ j++;
+ }
+ j++;
+ j++;
+ while(1) {
+ *cur = atoi(&s[j]);
+ cur++;
+ while(s[j] != ' ' && s[j] != '\n') {
+ j++;
+ }
+ while(s[j] == ' ') {
+ j++;
+ }
+ if(s[j] == '\n') {
+ break;
+ }
+ }
+ *cur = END;
+ cur++;
+ }
+ }
+
+ /* empty row */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ }
+
+ /* reads ****************Transition tie groups*****************/
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ }
+
+ /* read the trans tie groups */
+ trans_ties = hmmp->trans_tie_groups;
+ for(i = 0; i < hmmp->nr_ttg; i++) {
+ if(fgets(s, MAX_LINE, file) != NULL && s[0] != '\n') {
+ j = 0;
+ while(1) {
+ if(s[j] == ':') {
+ break;
+ }
+ j++;
+ }
+ j++;
+ j++;
+ while(1) {
+ trans.from_v = atoi(&s[j]);
+ while(s[j] != '>') {
+ j++;
+ }
+ j++;
+ trans.to_v = atoi(&s[j]);
+ memcpy(trans_ties, &trans, sizeof(struct transition_s));
+ trans_ties++;
+ while(s[j] != ' ' && s[j] != '\n') {
+ j++;
+ }
+ while(s[j] == ' ') {
+ j++;
+ }
+ if(s[j] == '\n') {
+ break;
+ }
+ }
+ trans.to_v = END;
+ trans.from_v = END;
+ memcpy(trans_ties, &trans, sizeof(struct transition_s));
+ trans_ties++;
+ }
+ else {
+ hmmp->nr_ttg = i;
+ break;
+ }
+ }
+#ifdef DEBUG_RD
+ dump_distrib_groups(hmmp->distrib_groups, hmmp->nr_d);
+ dump_trans_tie_groups(hmmp->trans_tie_groups, hmmp->nr_ttg);
+#endif
+
+ /* create to_silent_trans_array */
+ create_to_silent_trans_array(hmmp);
+
+ /* create from_trans_array */
+ create_from_trans_array(hmmp);
+
+ /* create to_trans_array */
+ create_to_trans_array(hmmp);
+
+ /* create tot_transitions */
+ create_tot_transitions(hmmp);
+
+ /* create tot_to_trans_array and tot_from_trans_arrays*/
+ create_tot_trans_arrays(hmmp);
+
+ /* get the set of labels and the number of labels */
+ get_set_of_labels_multi(hmmp);
+
+#ifdef DEBUG_RD
+ dump_emiss_matrix(hmmp->nr_v, hmmp->a_size, hmmp->emissions);
+ dump_trans_matrix(hmmp->nr_v, hmmp->nr_v, hmmp->transitions);
+ printf("hmmp->emission_dirichlets = %x\n", hmmp->emission_dirichlets);
+ for(i = 0; i < hmmp->nr_v; i++) {
+ printf("hmmp->ed_ps for vertex %d = %x\n", i, *(hmmp->ed_ps + i));
+ }
+#endif
+
+
+ /* make sure all probabilities are legal*/
+ //check_probs(hmmp);
+
+ if(verbose == YES) {
+ printf("done\n");
+ }
+}
+
+
+
+/************************read_module *************************************/
+int read_module(char *s, FILE *file, struct hmm_multi_s *hmmp, struct module_multi_s *modulep,
+ int *silent_counter, int *locked_counter)
+{
+ int nr_v, nr_t, nr_e, nr_et;
+ int i,j,k;
+ int from_v, to_v;
+ double prob, log_prob;
+ char type[50];
+ char prifile_name[500];
+ char *p, *probp;
+ struct emission_dirichlet_s *priorsp;
+ int silent_vertex;
+
+ /* module name */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ strcpy(modulep->name, &s[8]);
+ }
+ /* module type */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ strcpy(type, &s[6]);
+ if(strncmp(type, "Singlenode", 10) == 0) {
+ modulep->type = SINGLENODE;
+ }
+ else if(strncmp(type, "Cluster", 7) == 0) {
+ modulep->type = CLUSTER;
+ }
+ else if(strncmp(type, "Forward_std", 11) == 0) {
+ modulep->type = FORWARD_STD;
+ }
+ else if(strncmp(type, "Forward_alt", 11) == 0) {
+ modulep->type = FORWARD_ALT;
+ }
+ else if(strncmp(type, "Singleloop", 10) == 0) {
+ modulep->type = SINGLELOOP;
+ }
+ else if(strncmp(type, "Profile7", 8) == 0) {
+ modulep->type = PROFILE7;
+ }
+ else if(strncmp(type, "Profile9", 8) == 0) {
+ modulep->type = PROFILE9;
+ }
+ else {
+ printf("Error: module is of unknown type\n");
+ exit(-1);
+ }
+ }
+ /* nr vertices */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ modulep->nr_v = atoi(&s[12]);
+ modulep->vertices = (int*)(malloc_or_die(modulep->nr_v * sizeof(int)));
+ }
+ /* emission prior file */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ strcpy(prifile_name, (&s[21]));
+ if((p = strstr(prifile_name, "\n")) != NULL) {
+ *p = '\0';
+ }
+ if(strncmp(prifile_name, "null", 4) == 0) {
+ strcpy(modulep->priorfile_name, "null");
+ priorsp = NULL;
+ }
+ else {
+ strcpy(modulep->priorfile_name, prifile_name);
+ strcat(modulep->priorfile_name, "\0");
+ for(i = 0; i < hmmp->nr_ed; i++) {
+ priorsp = (hmmp->emission_dirichlets + i);
+ if((strncmp(prifile_name, priorsp->name, 200)) == 0) {
+ /* keep this priorsp */
+ break;
+ }
+ else {
+#ifdef DEBUG_RD
+ printf("prifile_name = %s\n", prifile_name);
+ printf("priorsp->name = %s\n", priorsp->name);
+#endif
+ }
+ if(i == hmmp->nr_ed-1) /* no name equals priorfile_name */{
+ printf("Couldn't find emission priorfile '%s'\n", prifile_name);
+ exit(-1);
+ }
+ }
+ }
+ }
+ /* transition prior file */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ /* not implemented yet */
+ }
+ /* empty line */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ }
+ /* loop over the vertices */
+ for(i = 0; i < modulep->nr_v; i++) {
+ /* Vertex nr */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ from_v = atoi(&s[7]);
+ *(modulep->vertices + i) = from_v;
+
+ /* connect this vertex to its priorfile */
+ *(hmmp->ed_ps + from_v) = priorsp;
+ }
+ /* Vertex type */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ strcpy(type, &s[13]);
+ if(modulep->type == PROFILE7 || modulep->type == PROFILE9) {
+ modulep->v_type = PROFILEV;
+ }
+ if(strncmp(type, "standard", 8) == 0) {
+ if(modulep->type != PROFILE7 && modulep->type != PROFILE9) {
+ modulep->v_type = STANDARDV;
+ silent_vertex = NO;
+ }
+ }
+ else if(strncmp(type, "silent", 6) == 0) {
+ silent_vertex = YES;
+ if(modulep->type != PROFILE7 && modulep->type != PROFILE9) {
+ modulep->v_type = SILENTV;
+ }
+ *(hmmp->silent_vertices + *silent_counter) = from_v;
+ *silent_counter = *silent_counter + 1;
+ }
+ else if(strncmp(type, "locked", 5) == 0) {
+ modulep->v_type = LOCKEDV;
+ *(hmmp->locked_vertices + from_v) = YES;
+ *locked_counter = *locked_counter + 1;
+ silent_vertex = NO;
+ }
+ else if(strncmp(type, "start", 5) == 0) {
+ modulep->v_type = STARTV;
+ hmmp->startnode = from_v;
+ }
+ else if(strncmp(type, "end", 3) == 0) {
+ modulep->v_type = ENDV;
+ hmmp->endnode = from_v;
+ }
+ else {
+ printf("Error: vertex type is undefined\n");
+ printf("vertex type = %s\n", type);
+ printf("from_v = %d\n", from_v);
+ exit(-1);
+ }
+ }
+ /* Vertex label */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ *(hmmp->vertex_labels + from_v) = s[14];
+ }
+ /* transition prior scaler */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ *(hmmp->vertex_trans_prior_scalers + from_v) = atof(&(s[25]));
+ }
+ /* emission prior scaler */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ *(hmmp->vertex_emiss_prior_scalers + from_v) = atof(&(s[23]));
+ }
+ /* Nr transitions */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ nr_t = atoi(&s[17]);
+ }
+ /* Nr end transitions */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ nr_et = atoi(&s[21]);
+ }
+ /* Nr emissions */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ nr_e = atoi(&s[15]);
+ }
+ /* read transition probabilities */
+ fgets(s, MAX_LINE, file);
+ for(j = 0; j < nr_t; j++) {
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ to_v = atoi(&s[8]);
+ if(to_v < 10 ) {
+ prob = (double)(atof(&s[11]));
+ }
+ else if(to_v < 100) {
+ prob = (double)(atof(&s[12]));
+ }
+ else if(to_v < 1000) {
+ prob = (double)(atof(&s[13]));
+ }
+ else if(to_v < 10000) {
+ prob = (double)(atof(&s[14]));
+ }
+ else {
+ printf("Sorry, reader cannot handle HMMs with more than 10000 states\n");
+ exit(0);
+ }
+ if(prob != 0.0) {
+ log_prob = log10(prob);
+ }
+ else {
+ log_prob = DEFAULT;
+ }
+#ifdef DEBUG_RD
+ printf("prob from %d to %d = %f\n", from_v, to_v, prob);
+#endif
+ *(hmmp->transitions + get_mtx_index(from_v, to_v, hmmp->nr_v)) = prob;
+ *(hmmp->log_transitions + get_mtx_index(from_v, to_v, hmmp->nr_v)) = log_prob;
+ }
+ }
+
+ /* read end transition probabilities */
+ fgets(s, MAX_LINE, file);
+ for(j = 0; j < nr_et; j++) {
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ to_v = atoi(&s[8]);
+ if(to_v < 10 ) {
+ prob = (double)(atof(&s[11]));
+ }
+ else if(to_v < 100) {
+ prob = (double)(atof(&s[12]));
+ }
+ else if(to_v < 1000) {
+ prob = (double)(atof(&s[13]));
+ }
+ else if(to_v < 10000) {
+ prob = (double)(atof(&s[14]));
+ }
+ else {
+ printf("Sorry, reader cannot handle HMMs with more than 10000 states\n");
+ exit(0);
+ }
+ if(prob != 0.0) {
+ log_prob = log10(prob);
+ }
+ else {
+ log_prob = DEFAULT;
+ }
+#ifdef DEBUG_RD
+ printf("end prob from %d to %d = %f\n", from_v, to_v, prob);
+#endif
+ *(hmmp->transitions + get_mtx_index(from_v, to_v, hmmp->nr_v)) = prob;
+ *(hmmp->log_transitions + get_mtx_index(from_v, to_v, hmmp->nr_v)) = log_prob;
+ }
+ }
+
+ /* read emission probabilities */
+ fgets(s, MAX_LINE, file);
+ for(j = 0; j < nr_e; j++) {
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ k = 0;
+ while(s[k] != ' ') {
+ k++;
+ }
+ if(k > 7) {
+ printf("Cannot read hmm file, please check hmm specification\n");
+ }
+ k++;
+ prob = (double)(atof(&s[k]));
+ if(prob != 0.0) {
+ log_prob = log10(prob);
+ }
+ else {
+ log_prob = DEFAULT;
+ }
+ if(silent_vertex == YES) {
+ prob = SILENT;
+ log_prob = SILENT;
+ }
+#ifdef DEBUG_RD
+ printf("emissprob in %d of letter %d = %f\n", from_v, j, prob);
+#endif
+ *(hmmp->emissions + get_mtx_index(from_v, j, hmmp->a_size)) = prob;
+ *(hmmp->log_emissions + get_mtx_index(from_v, j, hmmp->a_size)) = log_prob;
+ }
+ }
+ fgets(s, MAX_LINE, file);
+ silent_vertex = NO;
+ }
+ /* read ---------------------------------------- */
+ fgets(s, MAX_LINE, file);
+
+#ifdef DEBUG_RD
+ printf("exiting read_module\n");
+#endif
+ return 0;
+
+}
+
+
+void create_to_silent_trans_array(struct hmm_multi_s *hmmp)
+{
+ int v,w;
+ int malloc_size;
+ int *values;
+
+ malloc_size = 0;
+ for(v = 0; v < hmmp->nr_v; v++) {
+ for(w = 0; w < hmmp->nr_v; w++) {
+ if(*(hmmp->transitions + get_mtx_index(v,w,hmmp->nr_v)) != 0 && silent_vertex(w,hmmp) == YES) {
+ malloc_size++;
+ }
+ }
+ malloc_size++;
+ }
+
+ hmmp->to_silent_trans_array = (int**)malloc_or_die(hmmp->nr_v * sizeof(int*) + malloc_size * sizeof(int));
+ values = (int*)(hmmp->to_silent_trans_array + hmmp->nr_v);
+
+ for(v = 0; v < hmmp->nr_v; v++) {
+ *(hmmp->to_silent_trans_array + v) = values;
+ for(w = 0; w < hmmp->nr_v; w++) {
+ if(*(hmmp->transitions + get_mtx_index(v,w,hmmp->nr_v)) != 0 && silent_vertex(w,hmmp) == YES) {
+ *values = w;
+ values++;
+ }
+ }
+ *values = END;
+ values++;
+ }
+
+#ifdef DEBUG_RD
+ dump_to_silent_trans_array(hmmp->nr_v, hmmp->to_silent_trans_array);
+#endif
+}
+
+
+
+
+
+
+
+
+/* Go through transmission matrix and get all probabilities that are not 0
+ * into from_trans_array */
+void create_from_trans_array(struct hmm_multi_s *hmmp)
+{
+ int v,w,*xp;
+ int has_to_trans;
+ int array_head_size, array_tail_size;
+ struct path_element **from_trans_array, *from_trans, *temp_path;
+
+ array_tail_size = 0;
+ array_head_size = hmmp->nr_v;
+
+ /* estimate how much space we need to store transitions */
+ array_tail_size = (hmmp->nr_t/hmmp->nr_v + 3 + MAX_GAP_SIZE) * MAX_GAP_SIZE/2 * hmmp->nr_v;
+
+#ifdef DEBUG_RD
+ printf("array_head_size, array_tail_size = %d, %d\n", array_head_size, array_tail_size);
+#endif
+ from_trans_array = (struct path_element**)
+ (malloc_or_die(array_head_size * sizeof(struct path_element*) +
+ (array_tail_size + hmmp->nr_v) * sizeof(struct path_element)));
+ from_trans = (struct path_element*)(from_trans_array + hmmp->nr_v);
+ hmmp->from_trans_array = from_trans_array;
+
+ /* find all paths and add them to from_trans_array */
+ for(v = 0; v < hmmp->nr_v; v++) /* to-vertex */ {
+ *from_trans_array = from_trans;
+ if(silent_vertex(v, hmmp) == YES) {
+ from_trans->vertex = END;
+ from_trans->next = NULL;
+ from_trans++;
+ from_trans_array++;
+ continue;
+ }
+ for(w = 0; w < hmmp->nr_v; w++) /* from-vertex */ {
+ if(silent_vertex(w,hmmp) == YES) {
+ continue;
+ }
+ temp_path = (struct path_element*)(malloc_or_die(1000 * sizeof(struct path_element)));
+ add_all_from_paths(w, v, hmmp, &from_trans, temp_path, 0);
+ free(temp_path);
+ }
+ from_trans->vertex = END;
+ from_trans->next = NULL;
+ from_trans++;
+ from_trans_array++;
+ }
+#ifdef DEBUG_RD
+ dump_trans_matrix(hmmp->nr_v, hmmp->nr_v, hmmp->transitions);
+ dump_from_trans_array(hmmp->nr_v, hmmp->from_trans_array);
+#endif
+}
+
+void add_all_from_paths(int v, int w, struct hmm_multi_s *hmmp,
+ struct path_element **from_transp, struct path_element *temp_pathp,
+ int length)
+{
+ int i,j;
+ int *xp;
+ struct path_element p_el, cur_p_el;
+
+ if(length > MAX_GAP_SIZE) {
+ return;
+ }
+
+ cur_p_el.vertex = v;
+ cur_p_el.next = NULL;
+ memcpy(temp_pathp + length, &cur_p_el, sizeof(struct path_element));
+
+ if(*(hmmp->transitions + get_mtx_index(v, w, hmmp->nr_v)) != 0.0) {
+#ifdef DEBUG_RD
+ printf("adding path: ");
+#endif
+ /* direct path to w, add total path */
+ for(i = 0; i < length; i++) {
+ p_el.vertex = (temp_pathp + i)->vertex;
+#ifdef DEBUG_RD
+ printf("%d ", p_el.vertex);
+#endif
+ p_el.next = (*from_transp) + 1;
+ memcpy(*from_transp, &p_el, sizeof(struct path_element));
+ (*from_transp)++;
+ }
+ memcpy(*from_transp, &cur_p_el, sizeof(struct path_element));
+#ifdef DEBUG_RD
+ printf("%d %d\n", cur_p_el.vertex, w);
+#endif
+ (*from_transp)++;
+ }
+ xp = *(hmmp->to_silent_trans_array + v);
+ while(*xp != END) {
+ add_all_from_paths(*xp, w, hmmp, from_transp, temp_pathp, length + 1);
+ xp++;
+ }
+
+}
+
+
+/* Go through transmission matrix and get all probabilities that are not 0
+ * into to_trans_array */
+void create_to_trans_array(struct hmm_multi_s *hmmp)
+{
+ int v,w,*xp;
+ int has_to_trans;
+ int array_head_size, array_tail_size;
+ struct path_element **to_trans_array, *to_trans, *temp_path;
+
+ array_tail_size = 0;
+ array_head_size = hmmp->nr_v;
+
+ /* estimate how much space we need to store transitions */
+ array_tail_size = (hmmp->nr_t/hmmp->nr_v + 3 + MAX_GAP_SIZE) * MAX_GAP_SIZE/2 * hmmp->nr_v;
+#ifdef DEBUG_RD
+ printf("array_tail_size = %d\n", array_tail_size);
+#endif
+ to_trans_array = (struct path_element**)
+ (malloc_or_die(array_head_size * sizeof(struct path_element*) +
+ (array_tail_size + hmmp->nr_v) * sizeof(struct path_element)));
+ to_trans = (struct path_element*)(to_trans_array + hmmp->nr_v);
+ hmmp->to_trans_array = to_trans_array;
+
+ /* find all paths and add them to to_trans_array */
+ for(v = 0; v < hmmp->nr_v; v++) /* from-vertex */ {
+ *to_trans_array = to_trans;
+ if(silent_vertex(v, hmmp) == YES) {
+ to_trans->vertex = END;
+ to_trans->next = NULL;
+ to_trans++;
+ to_trans_array++;
+ continue;
+ }
+ for(w = 0; w < hmmp->nr_v; w++) /* to-vertex */ {
+ if(silent_vertex(w,hmmp) == YES) {
+ continue;
+ }
+ temp_path = (struct path_element*)(malloc_or_die(1000 * sizeof(struct path_element)));
+ add_all_to_paths(v, w, hmmp, &to_trans, temp_path, 0);
+ free(temp_path);
+ }
+ to_trans->vertex = END;
+ to_trans->next = NULL;
+ to_trans++;
+ to_trans_array++;
+ }
+
+
+#ifdef DEBUG_RD
+ printf("array_head_size, array_tail_size = %d, %d\n", array_head_size, array_tail_size);
+ dump_to_trans_array(hmmp->nr_v, hmmp->to_trans_array);
+#endif
+}
+
+void add_all_to_paths(int v, int w, struct hmm_multi_s *hmmp,
+ struct path_element **to_transp, struct path_element *temp_pathp, int length)
+{
+ int i,j;
+ int *xp;
+ struct path_element p_el;
+
+ if(length > MAX_GAP_SIZE) {
+ return;
+ }
+
+ if(*(hmmp->transitions + get_mtx_index(v, w, hmmp->nr_v)) != 0.0) {
+ /* direct path to w, add total path */
+ for(i = 0; i < length; i++) {
+ p_el.vertex = (temp_pathp + i)->vertex;
+ p_el.next = (*to_transp) + 1;
+ memcpy(*to_transp, &p_el, sizeof(struct path_element));
+ (*to_transp)++;
+ }
+ p_el.vertex = w;
+ p_el.next = NULL;
+ memcpy(*to_transp, &p_el, sizeof(struct path_element));
+ (*to_transp)++;
+ }
+
+ xp = *(hmmp->to_silent_trans_array + v);
+ while(*xp != END) {
+ (temp_pathp + length)->vertex = *xp;
+ (temp_pathp + length)->next = NULL;
+ add_all_to_paths(*xp, w, hmmp, to_transp, temp_pathp, length + 1);
+ xp++;
+ }
+
+
+}
+
+
+int silent_vertex(int k, struct hmm_multi_s *hmmp)
+{
+#ifdef DEBUG_RD
+ printf("startnode = %d\n",hmmp->startnode);
+ printf("endnode = %d\n",hmmp->endnode);
+ dump_silent_vertices_multi(hmmp);
+ dump_emiss_matrix(hmmp->nr_v, hmmp->a_size, hmmp->emissions);
+#endif
+ if(*(hmmp->emissions + get_mtx_index(k,0,hmmp->a_size)) == SILENT && k != hmmp->startnode && k != hmmp->endnode) {
+ return YES;
+ }
+ else {
+ return NO;
+ }
+}
+
+
+
+/* check all probabilities and abort if some prob > 1.0 or < 0.0 */
+void check_probs(struct hmm_multi_s *hmmp)
+{
+ int i,j;
+ double sum;
+ double diff;
+ double prob;
+ /* transition probabilities first */
+ for(i = 0; i < hmmp->nr_v; i++) {
+ sum = 0;
+ for(j = 0; j < hmmp->nr_v; j++) {
+ prob = *((hmmp->transitions) + (i*hmmp->nr_v) + j);
+ if(prob > 1.0 || prob < 0.0) {
+ printf("Illegal probabilities (prob < 0.0 or prob > 1.0)\n");
+ exit(-1);
+ }
+ else {
+ sum += prob;
+ }
+ }
+ diff = 1.0 - sum;
+ /* maybe something about auto correcting the probabilities
+ * will be implemented later */
+ }
+
+ /* then emission probabilities */
+ for(i = 0; i < hmmp->nr_v; i++) {
+ sum = 0;
+ for(j = 0; j < hmmp->a_size; j++) {
+ prob = *((hmmp->emissions) + (i*hmmp->a_size) + j);
+ if((prob > 1.0 || prob < 0.0) && prob != SILENT) {
+ printf("Illegal probabilities (prob < 0.0 or prob > 1.0)\n");
+ exit(-1);
+ }
+ else {
+ sum += prob;
+ }
+ }
+ diff = 1.0 - sum;
+ /* maybe something about auto correcting the probabilities
+ * will be implemented later */
+ }
+
+#ifdef DEBUG_RD
+ dump_trans_matrix(hmmp->nr_v, hmmp->nr_v, hmmp->transitions);
+ dump_trans_matrix(hmmp->nr_v, hmmp->nr_v, hmmp->log_transitions);
+ dump_emiss_matrix(hmmp->nr_v, hmmp->a_size, hmmp->emissions);
+ dump_emiss_matrix(hmmp->nr_v, hmmp->a_size, hmmp->log_emissions);
+#endif
+
+}
+
+int read_prior_files(int nr_priorfiles, struct emission_dirichlet_s *emission_priorsp,
+ struct hmm_multi_s *hmmp, char* path, FILE *file)
+{
+ int i,j,k;
+ double q_value, alpha_value, alpha_sum, logbeta;
+ char s[2048], *p;
+ char ps[2048];
+ char full_filename[2048];
+ char *file_name;
+ char *pri;
+ FILE *priorfile;
+
+
+ /* find \n sign and remove it */
+ if(fgets(s, 2048, file) != NULL) {
+ p = s;
+ while((p = strstr(p, "\n")) != NULL) {
+ strncpy(p, " ", 1);
+ }
+ }
+ /* read all before first filename */
+ strtok(s," ");
+
+ if((file_name = strtok(NULL, " ")) == NULL) {
+ /* done */
+ return 0;
+ }
+ for(i = 0; i < nr_priorfiles; i++) {
+ /* open the priorfile */
+ if((file_name = strtok(NULL, " ")) == NULL) {
+ /* done */
+ return 0;
+ }
+
+ else {
+ strcpy(full_filename, path);
+ strcat(full_filename, file_name);
+ if((priorfile = fopen(full_filename,"r")) != NULL) {
+ // printf("Opened priorfile %s\n", full_filename);
+ }
+ else {
+ perror(full_filename);
+ return -1;
+ }
+ }
+
+ /* put name in struct */
+ strcpy(emission_priorsp->name, file_name);
+
+ /* put nr of components in struct */
+ if(fgets(ps, 2048, priorfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ while(1) {
+ if(fgets(ps, 2048, priorfile) != NULL) {
+ if(strncmp(ps, "START", 5) == 0) {
+ break;
+ }
+ }
+ else {
+ return -1;
+ }
+ }
+ fgets(ps, 2048, priorfile);
+ while(*ps == '#' || *ps == '\n') {
+ if(fgets(ps, 2048, priorfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ }
+ (emission_priorsp + i)->nr_components = atoi(&ps[0]);
+ (emission_priorsp + i)->alphabet_size = hmmp->a_size;
+
+ /* allocate memory for arrays and matrix to this prior struct */
+ (emission_priorsp + i)->q_values = malloc_or_die((emission_priorsp + i)->nr_components *
+ sizeof(double));
+ (emission_priorsp + i)->alpha_sums = malloc_or_die((emission_priorsp + i)->nr_components *
+ sizeof(double));
+ (emission_priorsp + i)->logbeta_values =
+ malloc_or_die((emission_priorsp + i)->nr_components * sizeof(double));
+ (emission_priorsp + i)->prior_values = malloc_or_die((emission_priorsp + i)->nr_components *
+ hmmp->a_size * sizeof(double));
+
+ for(j = 0; j < (emission_priorsp + i)->nr_components; j++) {
+ /* put q-value in array */
+ while(1) {
+ if(fgets(ps, 2048, priorfile) != NULL) {
+ if(*ps == '#' || *ps == '\n') {
+
+ }
+ else {
+ break;
+ }
+ }
+ else {
+ printf("Prior file has incorrect format\n");
+ }
+ }
+ q_value = atof(&ps[0]);
+ *( (emission_priorsp + i)->q_values + j) = q_value;
+#ifdef DEBUG_RDPRI
+ printf("q_value = %f\n", *(((emission_priorsp + i)->q_values) + j));
+#endif
+
+ /* put alpha-values of this component in matrix */
+ alpha_sum = 0.0;
+ k = 0;
+ if(fgets(ps, 2048, priorfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ while(*ps == '#') {
+ if(fgets(ps, 2048, priorfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ }
+ pri = &ps[0];
+ for(k = 0; k < hmmp->a_size; k++) {
+ alpha_value = strtod(pri, &pri);
+ alpha_sum += alpha_value;
+ *(((emission_priorsp + i)->prior_values) +
+ get_mtx_index(j, k,hmmp->a_size)) = alpha_value;
+ }
+
+ /* put sum of alphavalues in array */
+ *(((emission_priorsp + i)->alpha_sums) + j) = alpha_sum;
+
+ /* calculate logB(alpha) for this compoment, store in array*/
+ logbeta = 0;
+ for(k = 0; k < hmmp->a_size; k++) {
+ logbeta += lgamma(*(((emission_priorsp + i)->prior_values) +
+ get_mtx_index(j, k, hmmp->a_size)));
+
+#ifdef DEBUG_RDPRI
+ printf("prior_value = %f\n", *(((emission_priorsp + i)->prior_values) +
+ get_mtx_index(j, k, hmmp->a_size)));
+ printf("lgamma_value = %f\n", lgamma(*(((emission_priorsp + i)->prior_values) +
+ get_mtx_index(j, k, hmmp->a_size))));
+#endif
+ }
+ logbeta = logbeta - lgamma(*(((emission_priorsp + i)->alpha_sums) + j));
+ *(((emission_priorsp + i)->logbeta_values) + j) = logbeta;
+ }
+#ifdef DEBUG_RDPRI
+ dump_prior_struct(emission_priorsp + i);
+ exit(0);
+#endif
+
+ /* some cleanup before continueing with next prior file */
+ fclose(priorfile);
+ emission_priorsp++;
+
+ }
+ return 0;
+}
+
+int read_trans_prior_files(int nr_priorfiles, void *emission_priorsp,
+ struct hmm_multi_s *hmmp, FILE *file)
+{
+ char s[2048];
+
+ /* not implemented yet */
+ if(fgets(s, 2048, file) != NULL) {
+ }
+ return 0;
+}
+
+void create_tot_transitions(struct hmm_multi_s *hmmp)
+{
+ int v,w;
+ struct path_element *wp;
+ double t_res;
+ double log_t_res, cur_value;
+
+ hmmp->tot_transitions = (double*)malloc_or_die(hmmp->nr_v * hmmp->nr_v * sizeof(double));
+ hmmp->max_log_transitions = (double*)malloc_or_die(hmmp->nr_v * hmmp->nr_v * sizeof(double));
+ init_float_mtx(hmmp->max_log_transitions, DEFAULT, hmmp->nr_v * hmmp->nr_v);
+
+ for(v = 0; v < hmmp->nr_v; v++) {
+ wp = *(hmmp->from_trans_array + v);
+ while(wp->vertex != END) /* w = from-vertex */ {
+ t_res = 1.0;
+ w = wp->vertex;
+ while(wp->next != NULL) {
+ t_res = t_res * *((hmmp->transitions) +
+ get_mtx_index(wp->vertex, (wp + 1)->vertex, hmmp->nr_v));
+ /* probability of transition from w to v via silent states */
+ wp++;
+ }
+ t_res = t_res * *((hmmp->transitions) +
+ get_mtx_index(wp->vertex, v, hmmp->nr_v));
+ /* tot_transitions */
+ *(hmmp->tot_transitions + get_mtx_index(w,v,hmmp->nr_v)) += t_res;
+
+ /* max_log_transitions */
+ if(t_res != 0.0) {
+ log_t_res = log10(t_res);
+ cur_value = *(hmmp->max_log_transitions + get_mtx_index(w,v,hmmp->nr_v));
+ if(cur_value == DEFAULT || log_t_res > cur_value) {
+ *(hmmp->max_log_transitions + get_mtx_index(w,v,hmmp->nr_v)) = log_t_res;
+ }
+ }
+ wp++;
+ }
+ }
+}
+
+void create_tot_trans_arrays(struct hmm_multi_s *hmmp)
+{
+ int v,w;
+ struct path_element *p_elp;
+ int malloc_size;
+
+ malloc_size = 0;
+ for(v = 0; v < hmmp->nr_v; v++) {
+ for(w = 0; w < hmmp->nr_v; w++) {
+ if(*(hmmp->tot_transitions + get_mtx_index(v,w,hmmp->nr_v)) != 0.0) {
+ malloc_size++;
+ }
+ }
+ }
+
+ hmmp->tot_to_trans_array = (struct path_element**)malloc_or_die(hmmp->nr_v * sizeof(struct path_element*) +
+ (malloc_size + hmmp->nr_v) * sizeof(struct path_element));
+
+ hmmp->tot_from_trans_array = (struct path_element**)malloc_or_die(hmmp->nr_v * sizeof(struct path_element*) +
+ (malloc_size + hmmp->nr_v) * sizeof(struct path_element));
+
+ /* fill in tot_to_trans_array */
+ p_elp = (struct path_element*)(hmmp->tot_to_trans_array + hmmp->nr_v);
+ for(v = 0; v < hmmp->nr_v; v++) {
+ *(hmmp->tot_to_trans_array + v) = p_elp;
+ for(w = 0; w < hmmp->nr_v; w++) {
+ if(*(hmmp->tot_transitions + get_mtx_index(v,w,hmmp->nr_v)) != 0.0) {
+ p_elp->vertex = w;
+ p_elp->next = NULL;
+ p_elp++;
+ }
+ }
+ p_elp->vertex = END;
+ p_elp->next = NULL;
+ p_elp++;
+ }
+
+
+ /* fill in tot_from_trans_array */
+ p_elp = (struct path_element*)(hmmp->tot_from_trans_array + hmmp->nr_v);
+ for(v = 0; v < hmmp->nr_v; v++) {
+ *(hmmp->tot_from_trans_array + v) = p_elp;
+ for(w = 0; w < hmmp->nr_v; w++) {
+ if(*(hmmp->tot_transitions + get_mtx_index(w,v,hmmp->nr_v)) != 0.0) {
+ p_elp->vertex = w;
+ p_elp->next = NULL;
+ p_elp++;
+ }
+ }
+ p_elp->vertex = END;
+ p_elp->next = NULL;
+ p_elp++;
+ }
+}
+
+
diff --git a/modhmm0.92b/readhmm.c.flc b/modhmm0.92b/readhmm.c.flc
new file mode 100644
index 0000000..4b1ba10
--- /dev/null
+++ b/modhmm0.92b/readhmm.c.flc
@@ -0,0 +1,4 @@
+
+(fast-lock-cache-data 3 (quote (17032 . 19407)) (quote nil) (quote nil) (quote (t ("^\\(\\sw+\\)[ ]*(" (1 font-lock-function-name-face)) ("^#[ ]*error[ ]+\\(.+\\)" (1 font-lock-warning-face prepend)) ("^#[ ]*\\(import\\|include\\)[ ]*\\(<[^>\"
+]*>?\\)" (2 font-lock-string-face)) ("^#[ ]*define[ ]+\\(\\sw+\\)(" (1 font-lock-function-name-face)) ("^#[ ]*\\(elif\\|if\\)\\>" ("\\<\\(defined\\)\\>[ ]*(?\\(\\sw+\\)?" nil nil (1 font-lock-builtin-face) (2 font-lock-variable-name-face nil t))) ("^#[ ]*\\(define\\|e\\(?:l\\(?:if\\|se\\)\\|ndif\\|rror\\)\\|file\\|i\\(?:f\\(?:n?def\\)?\\|nclude\\)\\|line\\|pragma\\|undef\\)\\>[ !]*\\(\\sw+\\)?" (1 font-lock-builtin-face) (2 font-lock-variable-name-face nil t)) ("\\<\\(c\\(?:har\\|o [...]
+") (point)) nil (1 font-lock-constant-face nil t))) (":" ("^[ ]*\\(\\sw+\\)[ ]*:[ ]*$" (beginning-of-line) (end-of-line) (1 font-lock-constant-face))) ("\\<\\(c\\(?:har\\|omplex\\)\\|double\\|float\\|int\\|long\\|s\\(?:hort\\|igned\\)\\|\\(?:unsigne\\|voi\\)d\\|FILE\\|\\sw+_t\\|Lisp_Object\\)\\>\\([ *&]+\\sw+\\>\\)*" (font-lock-match-c-style-declaration-item-and-skip-to-next (goto-char (or (match-beginning 2) (match-end 1))) (goto-char (match-end 1)) (1 (if (match-beginning 2) font-l [...]
diff --git a/modhmm0.92b/readhmm_multialpha.c b/modhmm0.92b/readhmm_multialpha.c
new file mode 100644
index 0000000..90ea8f7
--- /dev/null
+++ b/modhmm0.92b/readhmm_multialpha.c
@@ -0,0 +1,1793 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <math.h>
+
+
+#include "structs.h" /* data structures etc */
+#include "funcs.h" /* function header */
+
+#define MAX_LINE 500
+
+//#define DEBUG_RD
+//#define DEBUG_RDPRI
+
+extern int verbose;
+
+int read_module_multi(char*, FILE*, struct hmm_multi_s*, struct module_multi_s*, int*, int*);
+void check_probs_multi(struct hmm_multi_s*);
+void create_to_silent_trans_array_multi(struct hmm_multi_s *hmmp);
+void create_from_silent_trans_array_multi(struct hmm_multi_s *hmmp);
+void create_from_trans_array_multi(struct hmm_multi_s*);
+void create_to_trans_array_multi(struct hmm_multi_s*);
+void create_tot_transitions_multi(struct hmm_multi_s*);
+void create_tot_trans_arrays_multi(struct hmm_multi_s *hmmp);
+void add_all_from_paths_multi(int v, int w, struct hmm_multi_s *hmmp,
+ struct path_element **from_transp, struct path_element *temp_pathp, int length);
+void add_all_to_paths_multi(int v, int w, struct hmm_multi_s *hmmp,
+ struct path_element **from_transp, struct path_element *temp_pathp, int length);
+int read_prior_files_multi(int, struct emission_dirichlet_s*, int, FILE*);
+int read_trans_prior_files_multi(int, void*, struct hmm_multi_s*, FILE*);
+int silent_vertex_multi(int v, struct hmm_multi_s *hmmp);
+
+
+int readhmm_check(FILE *hmmfile) {
+ char s[MAX_LINE];
+ int filetype;
+
+ filetype = SINGLE_HMM;
+ while(fgets(s, MAX_LINE, hmmfile) != NULL) {
+ if(strncmp(s, "NR OF ALPHABETS:",16) == 0) {
+ filetype = MULTI_HMM;
+ break;
+ }
+ else if(strncmp(s, "ALPHABET LENGTH:", 16) == 0) {
+ filetype = SINGLE_HMM;
+ break;
+ }
+ }
+ rewind(hmmfile);
+ return filetype;
+}
+
+
+void transform_singlehmmfile_to_multi(FILE *hmmfile, FILE *outfile)
+{
+ char s[MAX_LINE], s_2[MAX_LINE];
+ if(outfile == NULL || hmmfile == NULL) {
+ printf("Could not transform singlehmm to multi\n");
+ exit(0);
+ }
+
+ while(1) {
+ if(fgets(s, MAX_LINE, hmmfile) != NULL) {
+ if(strncmp(s, "ALPHABET:", 9) == 0) {
+ if(fputs("NR OF ALPHABETS: 1\n", outfile) == EOF) {
+ perror("");
+ }
+ if(fputs("ALPHABET 1:", outfile) == EOF) {
+ perror("");
+ }
+ if(fputs(&s[9], outfile) == EOF) {
+ perror("");
+ }
+ }
+ else if(strncmp(s, "ALPHABET LENGTH:", 16) == 0) {
+ if(fputs("ALPHABET LENGTH 1:", outfile) == EOF) {
+ perror("");
+ }
+ if(fputs(&s[16], outfile) == EOF) {
+ perror("");
+ }
+ }
+ else if(strncmp(s, "NR OF EMISSION PRIORFILES:", 26) == 0) {
+ if(fputs("NR OF EMISSION PRIORFILES 1:", outfile) == EOF) {
+ perror("");
+ }
+ if(fputs(&s[26], outfile) == EOF) {
+ perror("");
+ }
+ }
+ else if(strncmp(s, "EMISSION PRIORFILES:", 20) == 0) {
+ if(fputs("EMISSION PRIORFILES_1:", outfile) == EOF) {
+ perror("");
+ }
+ if(fputs(&s[20], outfile) == EOF) {
+ perror("");
+ }
+ }
+ else if(strncmp(s, "Emission prior file:", 20) == 0) {
+ if(fputs("Emission prior file 1:", outfile) == EOF) {
+ perror("");
+ }
+ if(fputs(&s[20], outfile) == EOF) {
+ perror("");
+ }
+ }
+ else if(strncmp(s, "Emission prior scaler:", 22) == 0) {
+ if(fputs("Emission prior scaler 1:", outfile) == EOF) {
+ perror("");
+ }
+ if(fputs(&s[22], outfile) == EOF) {
+ perror("");
+ }
+ }
+ else if(strncmp(s, "Nr emissions =", 14) == 0) {
+ if(fputs("Nr emissions 1 =", outfile) == EOF) {
+ perror("");
+ }
+ if(fputs(&s[14], outfile) == EOF) {
+ perror("");
+ }
+ }
+ else if(strncmp(s, "Emission probabilities", 22) == 0) {
+ if(fputs("Emission probabilities 1\n", outfile) == EOF) {
+ perror("");
+ }
+ }
+ else {
+ if(fputs(s, outfile) == EOF) {
+ perror("");
+ }
+ }
+ }
+ else {
+ break;
+ }
+ }
+}
+
+void copy_hmm_struct(struct hmm_multi_s *hmmp, struct hmm_multi_s *retrain_hmmp)
+{
+
+ memcpy(retrain_hmmp->name, hmmp->name, 100); /* name of the HMM */
+ retrain_hmmp->constr_t = NULL; /* time of construction */
+ retrain_hmmp->nr_alphabets = hmmp->nr_alphabets;
+ memcpy(retrain_hmmp->alphabet, hmmp->alphabet, 1000); /* the alphabet */
+ memcpy(retrain_hmmp->alphabet_2, hmmp->alphabet_2, 1000); /* the alphabet */
+ memcpy(retrain_hmmp->alphabet_3, hmmp->alphabet_3, 1000); /* the alphabet */
+ memcpy(retrain_hmmp->alphabet_4, hmmp->alphabet_4, 1000); /* the alphabet */
+ retrain_hmmp->alphabet_type = hmmp->alphabet_type;
+ retrain_hmmp->alphabet_type_2 = hmmp->alphabet_type_2;
+ retrain_hmmp->alphabet_type_3 = hmmp->alphabet_type_3;
+ retrain_hmmp->alphabet_type_4 = hmmp->alphabet_type_4;
+ retrain_hmmp->a_size = hmmp->a_size;
+ retrain_hmmp->a_size_2 = hmmp->a_size_2;
+ retrain_hmmp->a_size_3 = hmmp->a_size_3;
+ retrain_hmmp->a_size_4 = hmmp->a_size_4;
+ retrain_hmmp->nr_m = hmmp->nr_m;
+ retrain_hmmp->nr_v = hmmp->nr_v;
+ retrain_hmmp->nr_t = hmmp->nr_t;
+ retrain_hmmp->nr_d = hmmp->nr_d;
+ retrain_hmmp->nr_dt = hmmp->nr_dt;
+ retrain_hmmp->nr_ttg = hmmp->nr_ttg;
+ retrain_hmmp->nr_tt = hmmp->nr_tt;
+ retrain_hmmp->nr_ed = hmmp->nr_ed;
+ retrain_hmmp->nr_ed_2 = hmmp->nr_ed_2;
+ retrain_hmmp->nr_ed_3 = hmmp->nr_ed_3;
+ retrain_hmmp->nr_ed_4 = hmmp->nr_ed_4;
+ retrain_hmmp->nr_tp = hmmp->nr_tp;
+ retrain_hmmp->startnode = hmmp->startnode;
+ retrain_hmmp->endnode = hmmp->endnode;
+
+ retrain_hmmp->replacement_letters = hmmp->replacement_letters; /* point to same struct */
+
+
+ retrain_hmmp->modules = (struct module_multi_s**)(malloc_or_die(hmmp->nr_m * sizeof(struct module_multi_s*) +
+ hmmp->nr_m * sizeof(struct module_multi_s)));
+ memcpy(retrain_hmmp->modules, hmmp->modules, hmmp->nr_m * sizeof(struct module_multi_s*) +
+ hmmp->nr_m * sizeof(struct module_multi_s));
+
+
+ retrain_hmmp->silent_vertices = (int*)(malloc_or_die((hmmp->nr_v + 1) * sizeof(int)));
+ retrain_hmmp->locked_vertices = (int*)(malloc_or_die((hmmp->nr_v + 1) * sizeof(int)));
+ memcpy(retrain_hmmp->silent_vertices, hmmp->silent_vertices, (hmmp->nr_v + 1) * sizeof(int));
+ memcpy(retrain_hmmp->locked_vertices, hmmp->locked_vertices, (hmmp->nr_v + 1) * sizeof(int));
+
+ retrain_hmmp->vertex_labels = (char*)(malloc_or_die(hmmp->nr_v * sizeof(char)));
+ memcpy(retrain_hmmp->vertex_labels, hmmp->vertex_labels, hmmp->nr_v * sizeof(char));
+ retrain_hmmp->labels = (char*)(malloc_or_die(hmmp->nr_v * sizeof(char)));
+ memcpy(retrain_hmmp->labels, hmmp->labels, hmmp->nr_v * sizeof(char));
+ retrain_hmmp->nr_labels = hmmp->nr_labels;
+
+
+
+ retrain_hmmp->vertex_trans_prior_scalers = (double*)(malloc_or_die(hmmp->nr_v * sizeof(double)));
+ memcpy(retrain_hmmp->vertex_trans_prior_scalers, hmmp->vertex_trans_prior_scalers, hmmp->nr_v * sizeof(double));
+
+ retrain_hmmp->vertex_emiss_prior_scalers = (double*)(malloc_or_die(hmmp->nr_v * sizeof(double)));
+ memcpy(retrain_hmmp->vertex_emiss_prior_scalers, hmmp->vertex_emiss_prior_scalers, hmmp->nr_v * sizeof(double));
+ if(hmmp->nr_alphabets > 1) {
+ retrain_hmmp->vertex_emiss_prior_scalers_2 = (double*)(malloc_or_die(hmmp->nr_v * sizeof(double)));
+ memcpy(retrain_hmmp->vertex_emiss_prior_scalers_2, hmmp->vertex_emiss_prior_scalers_2, hmmp->nr_v * sizeof(double));
+
+ }
+ if(hmmp->nr_alphabets > 2) {
+ retrain_hmmp->vertex_emiss_prior_scalers_3 = (double*)(malloc_or_die(hmmp->nr_v * sizeof(double)));
+ memcpy(retrain_hmmp->vertex_emiss_prior_scalers_3, hmmp->vertex_emiss_prior_scalers_3, hmmp->nr_v * sizeof(double));
+ }
+ if(hmmp->nr_alphabets > 3) {
+ retrain_hmmp->vertex_emiss_prior_scalers_4 = (double*)(malloc_or_die(hmmp->nr_v * sizeof(double)));
+ memcpy(retrain_hmmp->vertex_emiss_prior_scalers_4, hmmp->vertex_emiss_prior_scalers_4, hmmp->nr_v * sizeof(double));
+ }
+
+ retrain_hmmp->transitions = (double*)(malloc_or_die(hmmp->nr_v * hmmp->nr_v * sizeof(double)));
+ retrain_hmmp->log_transitions = (double*)(malloc_or_die(hmmp->nr_v * hmmp->nr_v * sizeof(double)));
+ memcpy(retrain_hmmp->transitions, hmmp->transitions, hmmp->nr_v * hmmp->nr_v * sizeof(double));
+ memcpy(retrain_hmmp->log_transitions, hmmp->log_transitions, hmmp->nr_v * hmmp->nr_v * sizeof(double));
+
+ retrain_hmmp->emissions = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size * sizeof(double)));
+ retrain_hmmp->log_emissions = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size * sizeof(double)));
+ memcpy(retrain_hmmp->emissions, hmmp->emissions, hmmp->nr_v * hmmp->a_size * sizeof(double));
+ memcpy(retrain_hmmp->log_emissions, hmmp->log_emissions, hmmp->nr_v * hmmp->a_size * sizeof(double));
+ if(hmmp->nr_alphabets > 1) {
+ retrain_hmmp->emissions_2 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_2 *
+ sizeof(double)));
+ retrain_hmmp->log_emissions_2 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_2 *
+ sizeof(double)));
+ memcpy(retrain_hmmp->emissions_2, hmmp->emissions_2, hmmp->nr_v * hmmp->a_size * sizeof(double));
+ memcpy(retrain_hmmp->log_emissions_2, hmmp->log_emissions_2, hmmp->nr_v * hmmp->a_size * sizeof(double));
+ }
+ if(hmmp->nr_alphabets > 2) {
+ retrain_hmmp->emissions_3 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_3 *
+ sizeof(double)));
+ retrain_hmmp->log_emissions_3 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_3 *
+ sizeof(double)));
+ memcpy(retrain_hmmp->emissions_3, hmmp->emissions_3, hmmp->nr_v * hmmp->a_size * sizeof(double));
+ memcpy(retrain_hmmp->log_emissions_3, hmmp->log_emissions_3, hmmp->nr_v * hmmp->a_size * sizeof(double));
+ }
+ if(hmmp->nr_alphabets > 3) {
+ retrain_hmmp->emissions_4 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_4 *
+ sizeof(double)));
+ retrain_hmmp->log_emissions_4 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_4 *
+ sizeof(double)));
+ memcpy(retrain_hmmp->emissions_4, hmmp->emissions_4, hmmp->nr_v * hmmp->a_size * sizeof(double));
+ memcpy(retrain_hmmp->log_emissions_4, hmmp->log_emissions_4, hmmp->nr_v * hmmp->a_size * sizeof(double));
+ }
+
+
+
+ /* create to_silent_trans_array */
+ create_to_silent_trans_array_multi(retrain_hmmp);
+
+ /* create from_trans_array */
+ create_from_trans_array_multi(retrain_hmmp);
+
+ /* create to_trans_array */
+ create_to_trans_array_multi(retrain_hmmp);
+
+ /* create tot_transitions */
+ create_tot_transitions_multi(retrain_hmmp);
+
+ /* create tot_to_trans_array and tot_from_trans_arrays */
+ create_tot_trans_arrays_multi(retrain_hmmp);
+
+
+
+
+ /* data structures */
+
+ retrain_hmmp->distrib_groups = (int*)(malloc_or_die((hmmp->nr_d + hmmp->nr_v) * sizeof(int)));
+ memcpy(retrain_hmmp->distrib_groups, hmmp->distrib_groups,(hmmp->nr_d + hmmp->nr_v) * sizeof(int));
+
+ retrain_hmmp->trans_tie_groups = (int*)(malloc_or_die((hmmp->nr_t + hmmp->nr_ttg) * sizeof(struct transition_s)));
+ memcpy(retrain_hmmp->trans_tie_groups, hmmp->trans_tie_groups,(hmmp->nr_t + hmmp->nr_ttg) * sizeof(struct transition_s));
+
+
+ retrain_hmmp->emission_dirichlets = hmmp->emission_dirichlets;
+ retrain_hmmp->ed_ps = hmmp->ed_ps;
+ retrain_hmmp->emission_dirichlets_2 = hmmp->emission_dirichlets_2;
+ retrain_hmmp->ed_ps_2 = hmmp->ed_ps_2;
+ retrain_hmmp->emission_dirichlets_3 = hmmp->emission_dirichlets_3;
+ retrain_hmmp->ed_ps_3 = hmmp->ed_ps_3;
+ retrain_hmmp->emission_dirichlets_4 = hmmp->emission_dirichlets_4;
+ retrain_hmmp->ed_ps_4 = hmmp->ed_ps_4;
+
+ retrain_hmmp->subst_mtx = hmmp->subst_mtx;
+ retrain_hmmp->subst_mtx_2 = hmmp->subst_mtx_2;
+ retrain_hmmp->subst_mtx_3 = hmmp->subst_mtx_3;
+ retrain_hmmp->subst_mtx_4 = hmmp->subst_mtx_4;
+
+}
+
+int readhmm_multialpha(FILE *file, struct hmm_multi_s *hmmp)
+{
+ char s[MAX_LINE], *c;
+ int i,j,k;
+ int res;
+ int **from_trans_array, **to_trans_array;
+ int *from_trans, *to_trans, *cur;
+ struct module_multi_s *module;
+ struct emission_dirichlet_s *emission_priorsp, *emission_priorsp_2, *emission_priorsp_3, *emission_priorsp_4;
+ void *transition_priorsp;
+ int nr_priorfiles, nr_trans_priorfiles;
+ int silent_counter, locked_counter;
+ char *nr_trans_tiesp, *nr_distrib_tiesp;
+ struct transition_s *trans_ties;
+ struct transition_s trans;
+
+ if(verbose == YES) {
+ printf("reading hmm ");
+ }
+ /* read header */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ }
+ /* name */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ strcpy(hmmp->name, &s[6]);
+ if(verbose == YES) {
+ printf("%s ... ", hmmp->name);
+ fflush(stdout);
+ }
+ }
+
+
+ /* creation time */
+ fgets(s, MAX_LINE, file);
+ /* nr of alphabets */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ hmmp->nr_alphabets = atoi(&s[17]);
+ }
+
+ if(hmmp->nr_alphabets < 1 || hmmp->nr_alphabets > 4) {
+ printf("Wrong nr of alphabets\n");
+ exit(0);
+ }
+ /* alphabet 1 */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ strcpy(hmmp->alphabet, &s[12]);
+ }
+
+ /* set alphabet to be = DISCRETE as default, this should be reset when reading the sequences */
+ hmmp->alphabet_type = DISCRETE;
+
+ /* alphabet length 1 */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ hmmp->a_size = atoi(&s[19]);
+ }
+ if(hmmp->nr_alphabets > 1) {
+ /* alphabet 2 */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ strcpy(hmmp->alphabet_2, &s[12]);
+ }
+
+ /* set alphabet to be = DISCRETE as default, this should be reset when reading the sequences */
+ hmmp->alphabet_type_2 = DISCRETE;
+
+ /* alphabet length 2 */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ hmmp->a_size_2 = atoi(&s[19]);
+ }
+ }
+ if(hmmp->nr_alphabets > 2) {
+ /* alphabet 3 */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ strcpy(hmmp->alphabet_3, &s[12]);
+ }
+
+ /* set alphabet to be = DISCRETE as default, this should be reset when reading the sequences */
+ hmmp->alphabet_type_3 = DISCRETE;
+
+ /* alphabet length 3 */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ hmmp->a_size_3 = atoi(&s[19]);
+ }
+ }
+ if(hmmp->nr_alphabets > 3) {
+ /* alphabet 4 */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ strcpy(hmmp->alphabet_4, &s[12]);
+ }
+
+ /* set alphabet to be = DISCRETE as default, this should be reset when reading the sequences */
+ hmmp->alphabet_type_4 = DISCRETE;
+
+ /* alphabet length 4 */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ hmmp->a_size_4 = atoi(&s[19]);
+ }
+ }
+ /* nr of modules */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ hmmp->nr_m = atoi(&s[15]);
+ }
+
+
+ /* nr of vertices */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ hmmp->nr_v = atoi(&s[16]);
+ hmmp->transitions = (double*)(malloc_or_die(hmmp->nr_v * hmmp->nr_v *
+ sizeof(double)));
+ hmmp->log_transitions = (double*)(malloc_or_die(hmmp->nr_v * hmmp->nr_v *
+ sizeof(double)));
+ init_float_mtx(hmmp->log_transitions, DEFAULT, hmmp->nr_v * hmmp->nr_v);
+ hmmp->emissions = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size *
+ sizeof(double)));
+ hmmp->log_emissions = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size *
+ sizeof(double)));
+ init_float_mtx(hmmp->log_emissions, DEFAULT, hmmp->nr_v * hmmp->a_size);
+ if(hmmp->nr_alphabets > 1) {
+ hmmp->emissions_2 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_2 *
+ sizeof(double)));
+ hmmp->log_emissions_2 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_2 *
+ sizeof(double)));
+ init_float_mtx(hmmp->log_emissions_2, DEFAULT, hmmp->nr_v * hmmp->a_size_2);
+ }
+
+ if(hmmp->nr_alphabets > 2) {
+ hmmp->emissions_3 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_3 *
+ sizeof(double)));
+ hmmp->log_emissions_3 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_3 *
+ sizeof(double)));
+ init_float_mtx(hmmp->log_emissions_3, DEFAULT, hmmp->nr_v * hmmp->a_size_3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ hmmp->emissions_4 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_4 *
+ sizeof(double)));
+ hmmp->log_emissions_4 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_4 *
+ sizeof(double)));
+ init_float_mtx(hmmp->log_emissions_4, DEFAULT, hmmp->nr_v * hmmp->a_size_4);
+ }
+
+ hmmp->modules = (struct module_multi_s**)(malloc_or_die(hmmp->nr_m * sizeof(struct module_multi_s*) +
+ hmmp->nr_m * sizeof(struct module_multi_s)));
+
+ module = (struct module_multi_s*)(hmmp->modules + hmmp->nr_m);
+ hmmp->silent_vertices = (int*)(malloc_or_die((hmmp->nr_v + 1) * sizeof(int)));
+ hmmp->locked_vertices = (int*)(malloc_or_die((hmmp->nr_v + 1) * sizeof(int)));
+
+ for(i = 0; i < hmmp->nr_v; i++) {
+ *(hmmp->locked_vertices + i) = NO;
+ }
+ hmmp->vertex_labels = (char*)(malloc_or_die(hmmp->nr_v * sizeof(char)));
+ hmmp->vertex_trans_prior_scalers = (double*)(malloc_or_die(hmmp->nr_v * sizeof(double)));
+
+ hmmp->vertex_emiss_prior_scalers = (double*)(malloc_or_die(hmmp->nr_v * sizeof(double)));
+ if(hmmp->nr_alphabets > 1) {
+ hmmp->vertex_emiss_prior_scalers_2 = (double*)(malloc_or_die(hmmp->nr_v * sizeof(double)));
+ }
+ if(hmmp->nr_alphabets > 2) {
+ hmmp->vertex_emiss_prior_scalers_3 = (double*)(malloc_or_die(hmmp->nr_v * sizeof(double)));
+ }
+ if(hmmp->nr_alphabets > 3) {
+ hmmp->vertex_emiss_prior_scalers_4 = (double*)(malloc_or_die(hmmp->nr_v * sizeof(double)));
+ }
+ }
+ /* nr of transitions */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ hmmp->nr_t = atoi(&s[19]);
+ }
+ /* nr of distribution groups */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ hmmp->nr_d = atoi(&s[27]);
+ hmmp->distrib_groups = (int*)(malloc_or_die((hmmp->nr_d + hmmp->nr_v) * sizeof(int)));
+ }
+
+ /* nr of trans tie groups */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ hmmp->nr_ttg = atoi(&s[29]);
+ hmmp->trans_tie_groups = (int*)(malloc_or_die((hmmp->nr_t + hmmp->nr_ttg) * sizeof(struct transition_s)));
+ }
+
+ /* nr of emission priorfiles */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ nr_priorfiles = atoi(&s[29]);
+ hmmp->nr_ed = nr_priorfiles;
+ emission_priorsp = malloc_or_die(nr_priorfiles * sizeof(struct emission_dirichlet_s));
+ hmmp->emission_dirichlets = emission_priorsp;
+ hmmp->ed_ps = malloc_or_die(hmmp->nr_v * sizeof(struct emission_dirichlet_s*));
+ }
+ /* read the emission priorfiles */
+ if(read_prior_files_multi(nr_priorfiles, emission_priorsp, hmmp->a_size, file) < 0) {
+ printf("Could not read emission priorfiles\n");
+ exit(-1);
+ }
+
+ if(hmmp->nr_alphabets > 1) {
+ /* nr of emission priorfiles */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ nr_priorfiles = atoi(&s[29]);
+ hmmp->nr_ed_2 = nr_priorfiles;
+ emission_priorsp_2 = malloc_or_die(nr_priorfiles * sizeof(struct emission_dirichlet_s));
+ hmmp->emission_dirichlets_2 = emission_priorsp_2;
+ hmmp->ed_ps_2 = malloc_or_die(hmmp->nr_v * sizeof(struct emission_dirichlet_s*));
+ }
+ /* read the emission priorfiles */
+ if(read_prior_files_multi(nr_priorfiles, emission_priorsp_2, hmmp->a_size_2, file) < 0) {
+ printf("Could not read emission priorfiles\n");
+ exit(-1);
+ }
+ }
+
+ if(hmmp->nr_alphabets > 2) {
+ /* nr of emission priorfiles */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ nr_priorfiles = atoi(&s[29]);
+ hmmp->nr_ed_3 = nr_priorfiles;
+ emission_priorsp_3 = malloc_or_die(nr_priorfiles * sizeof(struct emission_dirichlet_s));
+ hmmp->emission_dirichlets_3 = emission_priorsp_3;
+ hmmp->ed_ps_3 = malloc_or_die(hmmp->nr_v * sizeof(struct emission_dirichlet_s*));
+ }
+ /* read the emission priorfiles */
+ if(read_prior_files_multi(nr_priorfiles, emission_priorsp_3, hmmp->a_size_3, file) < 0) {
+ printf("Could not read emission priorfiles\n");
+ exit(-1);
+ }
+ }
+
+ if(hmmp->nr_alphabets > 3) {
+ /* nr of emission priorfiles */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ nr_priorfiles = atoi(&s[29]);
+ hmmp->nr_ed_4 = nr_priorfiles;
+ emission_priorsp_4 = malloc_or_die(nr_priorfiles * sizeof(struct emission_dirichlet_s));
+ hmmp->emission_dirichlets_4 = emission_priorsp_4;
+ hmmp->ed_ps_4 = malloc_or_die(hmmp->nr_v * sizeof(struct emission_dirichlet_s*));
+ }
+ /* read the emission priorfiles */
+ if(read_prior_files_multi(nr_priorfiles, emission_priorsp_4, hmmp->a_size_4, file) < 0) {
+ printf("Could not read emission priorfiles\n");
+ exit(-1);
+ }
+ }
+
+ /* nr of transition priorfiles */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ nr_trans_priorfiles = atoi(&s[29]);
+ transition_priorsp = NULL;
+ /* not implemented yet */
+ }
+ /* read the transition priorfiles */
+ if(read_trans_prior_files_multi(nr_trans_priorfiles, transition_priorsp, hmmp, file) < 0) {
+ printf("Could not read transition priorfiles\n");
+ exit(-1);
+ }
+
+ /* empty row */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ }
+ /* empty row */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ }
+ /* reads ****************Modules*****************/
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ }
+ /* read the modules */
+ silent_counter = 0;
+ locked_counter = 0;
+ for(i = 0; i < hmmp->nr_m; i++) {
+ *(hmmp->modules + i) = module;
+ if((res = read_module_multi(s, file, hmmp, module, &silent_counter, &locked_counter)) < 0) {
+ printf("Could not read modules\n");
+ exit(-1);
+ }
+ module++;
+ }
+
+ *(hmmp->silent_vertices + silent_counter) = END;
+ *(hmmp->locked_vertices + hmmp->nr_v) = END;
+#ifdef DEBUG_RD
+ //dump_locked_vertices(hmmp);
+ //dump_silent_vertices(hmmp);
+ //dump_multi_modules(hmmp);
+#endif
+
+ /* empty row */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ }
+
+ /* reads ****************Emission distribution groups*****************/
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ }
+
+
+ /* read the distribution groups */
+ cur = hmmp->distrib_groups;
+ for(i = 0; i < hmmp->nr_d; i++) {
+ j = 0;
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ while(1) {
+ if(s[j] == ':') {
+ break;
+ }
+ j++;
+ }
+ j++;
+ j++;
+ while(1) {
+ *cur = atoi(&s[j]);
+ cur++;
+ while(s[j] != ' ' && s[j] != '\n') {
+ j++;
+ }
+ while(s[j] == ' ') {
+ j++;
+ }
+ if(s[j] == '\n') {
+ break;
+ }
+ }
+ *cur = END;
+ cur++;
+ }
+ }
+
+ /* empty row */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ }
+
+ /* reads ****************Transition tie groups*****************/
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ }
+ /* read the trans tie groups */
+ trans_ties = hmmp->trans_tie_groups;
+ for(i = 0; i < hmmp->nr_ttg; i++) {
+ if(fgets(s, MAX_LINE, file) != NULL && s[0] != '\n') {
+ j = 0;
+ while(1) {
+ if(s[j] == ':') {
+ break;
+ }
+ j++;
+ }
+ j++;
+ j++;
+ while(1) {
+ trans.from_v = atoi(&s[j]);
+ while(s[j] != '>') {
+ j++;
+ }
+ j++;
+ trans.to_v = atoi(&s[j]);
+ memcpy(trans_ties, &trans, sizeof(struct transition_s));
+ trans_ties++;
+ while(s[j] != ' ' && s[j] != '\n') {
+ j++;
+ }
+ while(s[j] == ' ') {
+ j++;
+ }
+ if(s[j] == '\n') {
+ break;
+ }
+ }
+ trans.to_v = END;
+ trans.from_v = END;
+ memcpy(trans_ties, &trans, sizeof(struct transition_s));
+ trans_ties++;
+ }
+ else {
+ hmmp->nr_ttg = i;
+ break;
+ }
+ }
+#ifdef DEBUG_RD
+ dump_distrib_groups(hmmp->distrib_groups, hmmp->nr_d);
+ dump_trans_tie_groups(hmmp->trans_tie_groups, hmmp->nr_ttg);
+#endif
+ /* create to_silent_trans_array */
+ create_to_silent_trans_array_multi(hmmp);
+
+ /* create from_trans_array */
+ create_from_trans_array_multi(hmmp);
+
+ /* create to_trans_array */
+ create_to_trans_array_multi(hmmp);
+
+ /* create tot_transitions */
+ create_tot_transitions_multi(hmmp);
+
+ /* create tot_to_trans_array and tot_from_trans_arrays*/
+ create_tot_trans_arrays_multi(hmmp);
+
+ /* get the set of labels and the number of labels */
+ get_set_of_labels_multi(hmmp);
+
+#ifdef DEBUG_RD
+ dump_emiss_matrix(hmmp->nr_v, hmmp->a_size, hmmp->emissions);
+ dump_trans_matrix(hmmp->nr_v, hmmp->nr_v, hmmp->transitions);
+ printf("hmmp->emission_dirichlets = %x\n", hmmp->emission_dirichlets);
+ for(i = 0; i < hmmp->nr_v; i++) {
+ printf("hmmp->ed_ps for vertex %d = %x\n", i, *(hmmp->ed_ps + i));
+ }
+#endif
+
+ /* make sure all probabilities are legal*/
+ //check_probs_multi(hmmp);
+
+ if(verbose == YES) {
+ printf("done\n");
+ }
+}
+
+
+/************************read_module_multi *************************************/
+int read_module_multi(char *s, FILE *file, struct hmm_multi_s *hmmp, struct module_multi_s *modulep,
+ int *silent_counter, int *locked_counter)
+{
+ int nr_v, nr_t, nr_e, nr_e2, nr_e3, nr_e4, nr_et;
+ int i,j,k;
+ int from_v, to_v;
+ double prob, log_prob;
+ char type[50];
+ char prifile_name[500], prifile_name_2[500], prifile_name_3[500], prifile_name_4[500];
+ char *p, *probp;
+ struct emission_dirichlet_s *priorsp, *priorsp_2, *priorsp_3, *priorsp_4;
+ int silent_vertex;
+
+ //i = 0;
+
+ /* module name */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ strcpy(modulep->name, &s[8]);
+#ifdef DEBUG_RD
+ printf("module name %s", s);
+#endif
+ }
+ /* module type */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ strcpy(type, &s[6]);
+ if(strncmp(type, "Singlenode", 10) == 0) {
+ modulep->type = SINGLENODE;
+ }
+ else if(strncmp(type, "Cluster", 7) == 0) {
+ modulep->type = CLUSTER;
+ }
+ else if(strncmp(type, "Forward_std", 11) == 0) {
+ modulep->type = FORWARD_STD;
+ }
+ else if(strncmp(type, "Forward_alt", 11) == 0) {
+ modulep->type = FORWARD_ALT;
+ }
+ else if(strncmp(type, "Singleloop", 10) == 0) {
+ modulep->type = SINGLELOOP;
+ }
+ else if(strncmp(type, "Profile7", 8) == 0) {
+ modulep->type = PROFILE7;
+ }
+ else if(strncmp(type, "Profile9", 8) == 0) {
+ modulep->type = PROFILE9;
+ }
+ else {
+ printf("Error: module is of unknown type\n");
+ exit(-1);
+ }
+
+ }
+
+ /* nr vertices */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ modulep->nr_v = atoi(&s[12]);
+ modulep->vertices = (int*)(malloc_or_die(modulep->nr_v * sizeof(int)));
+ }
+ /* emission prior file */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ strcpy(prifile_name, (&s[23]));
+ if((p = strstr(prifile_name, "\n")) != NULL) {
+ *p = '\0';
+ }
+ if(strncmp(prifile_name, "null", 4) == 0) {
+ strcpy(modulep->priorfile_name, "null");
+ priorsp = NULL;
+ }
+ else {
+ strcpy(modulep->priorfile_name, prifile_name);
+ strcat(modulep->priorfile_name, "\0");
+ for(i = 0; i < hmmp->nr_ed; i++) {
+ priorsp = (hmmp->emission_dirichlets + i);
+ if((strncmp(prifile_name, priorsp->name, 200)) == 0) {
+ /* keep this priorsp */
+ break;
+ }
+ else {
+#ifdef DEBUG_RD
+ printf("prifile_name = %s\n", prifile_name);
+ printf("priorsp->name = %s\n", priorsp->name);
+#endif
+ }
+ if(i == hmmp->nr_ed-1) /* no name equals priorfile_name */{
+ printf("Couldn't find emission priorfile '%s'\n", prifile_name);
+ exit(-1);
+ }
+ }
+ }
+ }
+
+ if(hmmp->nr_alphabets > 1) {
+ /* emission prior file */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ strcpy(prifile_name_2, (&s[23]));
+ if((p = strstr(prifile_name_2, "\n")) != NULL) {
+ *p = '\0';
+ }
+ if(strncmp(prifile_name_2, "null", 4) == 0) {
+ strcpy(modulep->priorfile_name_2, "null");
+ priorsp_2 = NULL;
+ }
+ else {
+ strcpy(modulep->priorfile_name_2, prifile_name_2);
+ strcat(modulep->priorfile_name_2, "\0");
+ for(i = 0; i < hmmp->nr_ed_2; i++) {
+ priorsp_2 = (hmmp->emission_dirichlets_2 + i);
+ if((strncmp(prifile_name_2, priorsp_2->name, 200)) == 0) {
+ /* keep this priorsp */
+ break;
+ }
+ else {
+#ifdef DEBUG_RD
+ printf("prifile_name_2 = %s\n", prifile_name_2);
+ printf("priorsp_2->name = %s\n", priorsp_2->name);
+#endif
+ }
+ if(i == hmmp->nr_ed_2 - 1) /* no name equals priorfile_name */{
+ printf("Couldn't find emission priorfile '%s'\n", prifile_name_2);
+ exit(-1);
+ }
+ }
+ }
+ }
+ }
+
+ if(hmmp->nr_alphabets > 2) {
+ /* emission prior file */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ strcpy(prifile_name_3, (&s[23]));
+ if((p = strstr(prifile_name_3, "\n")) != NULL) {
+ *p = '\0';
+ }
+ if(strncmp(prifile_name_3, "null", 4) == 0) {
+ strcpy(modulep->priorfile_name_3, "null");
+ priorsp_3 = NULL;
+ }
+ else {
+ strcpy(modulep->priorfile_name_3, prifile_name_3);
+ strcat(modulep->priorfile_name_3, "\0");
+ for(i = 0; i < hmmp->nr_ed_3; i++) {
+ priorsp_3 = (hmmp->emission_dirichlets_3 + i);
+ if((strncmp(prifile_name_3, priorsp_3->name, 200)) == 0) {
+ /* keep this priorsp */
+ break;
+ }
+ else {
+#ifdef DEBUG_RD
+ printf("prifile_name_3 = %s\n", prifile_name_3);
+ printf("priorsp_3->name = %s\n", priorsp_3->name);
+#endif
+ }
+ if(i == hmmp->nr_ed_3 - 1) /* no name equals priorfile_name */{
+ printf("Couldn't find emission priorfile '%s'\n", prifile_name_3);
+ exit(-1);
+ }
+ }
+ }
+ }
+ }
+
+ if(hmmp->nr_alphabets > 3) {
+ /* emission prior file */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ strcpy(prifile_name_4, (&s[23]));
+ if((p = strstr(prifile_name_4, "\n")) != NULL) {
+ *p = '\0';
+ }
+ if(strncmp(prifile_name_4, "null", 4) == 0) {
+ strcpy(modulep->priorfile_name_4, "null");
+ priorsp_4 = NULL;
+ }
+ else {
+ strcpy(modulep->priorfile_name_4, prifile_name_4);
+ strcat(modulep->priorfile_name_4, "\0");
+ for(i = 0; i < hmmp->nr_ed_4; i++) {
+ priorsp_4 = (hmmp->emission_dirichlets_4 + i);
+ if((strncmp(prifile_name_4, priorsp_4->name, 200)) == 0) {
+ /* keep this priorsp */
+ break;
+ }
+ else {
+#ifdef DEBUG_RD
+ printf("prifile_name_4 = %s\n", prifile_name_4);
+ printf("priorsp_4->name = %s\n", priorsp_4->name);
+#endif
+ }
+ if(i == hmmp->nr_ed_4 - 1) /* no name equals priorfile_name */{
+ printf("Couldn't find emission priorfile '%s'\n", prifile_name_4);
+ exit(-1);
+ }
+ }
+ }
+ }
+ }
+
+ /* transition prior file */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ /* not implemented yet */
+ }
+ /* empty line */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ }
+
+#ifdef DEBUG_RD
+ printf("modulep->nr_v = %d\n", modulep->nr_v);
+#endif
+/* loop over the vertices */
+ for(i = 0; i < modulep->nr_v; i++) {
+ /* Vertex nr */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+#ifdef DEBUG_RD
+ printf("Vertex nr: %s", s);
+#endif
+ from_v = atoi(&s[7]);
+ *(modulep->vertices + i) = from_v;
+ /* connect this vertex to its priorfile */
+ *(hmmp->ed_ps + from_v) = priorsp;
+ if(hmmp->nr_alphabets > 1) {
+ *(hmmp->ed_ps_2 + from_v) = priorsp_2;
+ }
+ if(hmmp->nr_alphabets > 2) {
+ *(hmmp->ed_ps_3 + from_v) = priorsp_3;
+ }
+ if(hmmp->nr_alphabets > 3) {
+ *(hmmp->ed_ps_4 + from_v) = priorsp_4;
+ }
+ }
+ else {
+ printf("Error in hmm spec\n");
+ exit(0);
+ }
+ /* Vertex type */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+#ifdef DEBUG_RD
+ printf("Vertex type: %s", s);
+#endif
+ strcpy(type, &s[13]);
+ if(modulep->type == PROFILE7 || modulep->type == PROFILE9) {
+ modulep->v_type = PROFILEV;
+ }
+ if(strncmp(type, "standard", 8) == 0) {
+ if(modulep->type != PROFILE7 && modulep->type != PROFILE9) {
+ modulep->v_type = STANDARDV;
+ silent_vertex = NO;
+ }
+ }
+ else if(strncmp(type, "silent", 6) == 0) {
+ silent_vertex = YES;
+ if(modulep->type != PROFILE7 && modulep->type != PROFILE9) {
+ modulep->v_type = SILENTV;
+ }
+ *(hmmp->silent_vertices + *silent_counter) = from_v;
+ *silent_counter = *silent_counter + 1;
+ }
+ else if(strncmp(type, "locked", 5) == 0) {
+ modulep->v_type = LOCKEDV;
+ *(hmmp->locked_vertices + from_v) = YES;
+ *locked_counter = *locked_counter + 1;
+ silent_vertex = NO;
+ }
+ else if(strncmp(type, "start", 5) == 0) {
+ modulep->v_type = STARTV;
+ hmmp->startnode = from_v;
+ }
+ else if(strncmp(type, "end", 3) == 0) {
+ modulep->v_type = ENDV;
+ hmmp->endnode = from_v;
+ }
+ else {
+ printf("Error: vertex type is undefined\n");
+ printf("vertex type = %s\n", type);
+ printf("from_v = %d\n", from_v);
+ exit(-1);
+ }
+ }
+ /* Vertex label */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ *(hmmp->vertex_labels + from_v) = s[14];
+ }
+ /* transition prior scaler */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ *(hmmp->vertex_trans_prior_scalers + from_v) = atof(&(s[25]));
+ }
+ /* emission prior scaler */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ *(hmmp->vertex_emiss_prior_scalers + from_v) = atof(&(s[25]));
+ }
+ if(hmmp->nr_alphabets > 1) {
+ /* emission prior scaler */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ *(hmmp->vertex_emiss_prior_scalers_2 + from_v) = atof(&(s[25]));
+ }
+ }
+ if(hmmp->nr_alphabets > 2) {
+ /* emission prior scaler */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ *(hmmp->vertex_emiss_prior_scalers_3 + from_v) = atof(&(s[25]));
+ }
+ }
+ if(hmmp->nr_alphabets > 3) {
+ /* emission prior scaler */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ *(hmmp->vertex_emiss_prior_scalers_4 + from_v) = atof(&(s[25]));
+ }
+ }
+ /* Nr transitions */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ nr_t = atoi(&s[17]);
+ }
+ /* Nr end transitions */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ nr_et = atoi(&s[21]);
+ }
+ /* Nr emissions */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ nr_e = atoi(&s[17]);
+ }
+ if(hmmp->nr_alphabets > 1) {
+ /* Nr emissions */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ nr_e2 = atoi(&s[17]);
+ }
+ }
+ if(hmmp->nr_alphabets > 2) {
+ /* Nr emissions */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ nr_e3 = atoi(&s[17]);
+ }
+ }
+ if(hmmp->nr_alphabets > 3) {
+ /* Nr emissions */
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ nr_e4 = atoi(&s[17]);
+ }
+ }
+ /* read transition probabilities */
+ fgets(s, MAX_LINE, file);
+ for(j = 0; j < nr_t; j++) {
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ to_v = atoi(&s[8]);
+ if(to_v < 10 ) {
+ prob = (double)(atof(&s[11]));
+ }
+ else if(to_v < 100) {
+ prob = (double)(atof(&s[12]));
+ }
+ else if(to_v < 1000) {
+ prob = (double)(atof(&s[13]));
+ }
+ else if(to_v < 10000) {
+ prob = (double)(atof(&s[14]));
+ }
+ else {
+ printf("Sorry, reader cannot handle HMMs with more than 10000 states\n");
+ exit(0);
+ }
+ if(prob != 0.0) {
+ log_prob = log10(prob);
+ }
+ else {
+ log_prob = DEFAULT;
+ }
+#ifdef DEBUG_RD
+ printf("prob from %d to %d = %f\n", from_v, to_v, prob);
+#endif
+ *(hmmp->transitions + get_mtx_index(from_v, to_v, hmmp->nr_v)) = prob;
+ *(hmmp->log_transitions + get_mtx_index(from_v, to_v, hmmp->nr_v)) = log_prob;
+ }
+ }
+
+ /* read end transition probabilities */
+ fgets(s, MAX_LINE, file);
+ for(j = 0; j < nr_et; j++) {
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ to_v = atoi(&s[8]);
+ if(to_v < 10 ) {
+ prob = (double)(atof(&s[11]));
+ }
+ else if(to_v < 100) {
+ prob = (double)(atof(&s[12]));
+ }
+ else if(to_v < 1000) {
+ prob = (double)(atof(&s[13]));
+ }
+ else if(to_v < 10000) {
+ prob = (double)(atof(&s[14]));
+ }
+ else {
+ printf("Sorry, reader cannot handle HMMs with more than 10000 states\n");
+ exit(0);
+ }
+ if(prob != 0.0) {
+ log_prob = log10(prob);
+ }
+ else {
+ log_prob = DEFAULT;
+ }
+#ifdef DEBUG_RD
+ printf("prob from %d to %d = %f\n", from_v, to_v, prob);
+#endif
+ *(hmmp->transitions + get_mtx_index(from_v, to_v, hmmp->nr_v)) = prob;
+ *(hmmp->log_transitions + get_mtx_index(from_v, to_v, hmmp->nr_v)) = log_prob;
+ }
+ }
+ /* read emission probabilities */
+ fgets(s, MAX_LINE, file);
+ for(j = 0; j < nr_e; j++) {
+ if(fgets(s, MAX_LINE, file) != NULL) {
+#ifdef DEBUG_RD
+ printf("%s", s);
+#endif
+ k = 0;
+ while(s[k] != ' ') {
+ k++;
+ }
+ if(k > 7) {
+ printf("Cannot read hmm file, please check hmm specification\n");
+ }
+ k++;
+ prob = (double)(atof(&s[k]));
+ if(prob != 0.0) {
+ log_prob = log10(prob);
+ }
+ else {
+ log_prob = DEFAULT;
+ }
+ if(silent_vertex == YES) {
+ prob = SILENT;
+ log_prob = SILENT;
+ }
+ *(hmmp->emissions + get_mtx_index(from_v, j, hmmp->a_size)) = prob;
+ *(hmmp->log_emissions + get_mtx_index(from_v, j, hmmp->a_size)) = log_prob;
+ }
+ }
+ fgets(s, MAX_LINE, file);
+
+ if(hmmp->nr_alphabets > 1) {
+ fgets(s, MAX_LINE, file);
+ for(j = 0; j < nr_e2; j++) {
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ k = 0;
+ while(s[k] != ' ') {
+ k++;
+ }
+ if(k > 7) {
+ printf("Cannot read hmm file, please check hmm specification\n");
+ }
+ k++;
+ prob = (double)(atof(&s[k]));
+ if(prob != 0.0) {
+ log_prob = log10(prob);
+ }
+ else {
+ log_prob = DEFAULT;
+ }
+ if(silent_vertex == YES) {
+ prob = SILENT;
+ log_prob = SILENT;
+ }
+ *(hmmp->emissions_2 + get_mtx_index(from_v, j, hmmp->a_size_2)) = prob;
+ *(hmmp->log_emissions_2 + get_mtx_index(from_v, j, hmmp->a_size_2)) = log_prob;
+ }
+ }
+ fgets(s, MAX_LINE, file);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ fgets(s, MAX_LINE, file);
+ for(j = 0; j < nr_e3; j++) {
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ k = 0;
+ while(s[k] != ' ') {
+ k++;
+ }
+ if(k > 7) {
+ printf("Cannot read hmm file, please check hmm specification\n");
+ }
+ k++;
+ prob = (double)(atof(&s[k]));
+ if(prob != 0.0) {
+ log_prob = log10(prob);
+ }
+ else {
+ log_prob = DEFAULT;
+ }
+ if(silent_vertex == YES) {
+ prob = SILENT;
+ log_prob = SILENT;
+ }
+ *(hmmp->emissions_3 + get_mtx_index(from_v, j, hmmp->a_size_3)) = prob;
+ *(hmmp->log_emissions_3 + get_mtx_index(from_v, j, hmmp->a_size_3)) = log_prob;
+ }
+ }
+ fgets(s, MAX_LINE, file);
+ }
+
+ if(hmmp->nr_alphabets > 3) {
+ fgets(s, MAX_LINE, file);
+ for(j = 0; j < nr_e4; j++) {
+ if(fgets(s, MAX_LINE, file) != NULL) {
+ k = 0;
+ while(s[k] != ' ') {
+ k++;
+ }
+ if(k > 7) {
+ printf("Cannot read hmm file, please check hmm specification\n");
+ }
+ k++;
+ prob = (double)(atof(&s[k]));
+ if(prob != 0.0) {
+ log_prob = log10(prob);
+ }
+ else {
+ log_prob = DEFAULT;
+ }
+ if(silent_vertex == YES) {
+ prob = SILENT;
+ log_prob = SILENT;
+ }
+
+ *(hmmp->emissions_4 + get_mtx_index(from_v, j, hmmp->a_size_4)) = prob;
+ *(hmmp->log_emissions_4 + get_mtx_index(from_v, j, hmmp->a_size_4)) = log_prob;
+ }
+ }
+ fgets(s, MAX_LINE, file);
+ }
+
+ silent_vertex = NO;
+ }
+
+ /* read ---------------------------------------- */
+ fgets(s, MAX_LINE, file);
+ return 0;
+
+}
+
+
+void create_to_silent_trans_array_multi(struct hmm_multi_s *hmmp)
+{
+ int v,w;
+ int malloc_size;
+ int *values;
+
+ malloc_size = 0;
+ for(v = 0; v < hmmp->nr_v; v++) {
+ for(w = 0; w < hmmp->nr_v; w++) {
+ if(*(hmmp->transitions + get_mtx_index(v,w,hmmp->nr_v)) != 0 && silent_vertex_multi(w,hmmp) == YES) {
+ malloc_size++;
+ }
+ }
+ malloc_size++;
+ }
+
+ hmmp->to_silent_trans_array = (int**)malloc_or_die(hmmp->nr_v * sizeof(int*) + malloc_size * sizeof(int));
+ values = (int*)(hmmp->to_silent_trans_array + hmmp->nr_v);
+
+ for(v = 0; v < hmmp->nr_v; v++) {
+ *(hmmp->to_silent_trans_array + v) = values;
+ for(w = 0; w < hmmp->nr_v; w++) {
+ if(*(hmmp->transitions + get_mtx_index(v,w,hmmp->nr_v)) != 0 && silent_vertex_multi(w,hmmp) == YES) {
+ *values = w;
+ values++;
+ }
+ }
+ *values = END;
+ values++;
+ }
+
+#ifdef DEBUG_RD
+ dump_to_silent_trans_array(hmmp->nr_v, hmmp->to_silent_trans_array);
+#endif
+}
+
+/* Go through transmission matrix and get all probabilities that are not 0
+ * into from_trans_array */
+void create_from_trans_array_multi(struct hmm_multi_s *hmmp)
+{
+ int v,w,*xp;
+ int has_to_trans;
+ int array_head_size, array_tail_size;
+ struct path_element **from_trans_array, *from_trans, *temp_path;
+
+ array_tail_size = 0;
+ array_head_size = hmmp->nr_v;
+
+ /* estimate how much space we need to store transitions */
+ array_tail_size = (hmmp->nr_t/hmmp->nr_v + 3 + MAX_GAP_SIZE) * MAX_GAP_SIZE/2 * hmmp->nr_v;
+
+#ifdef DEBUG_RD
+ printf("array_head_size, array_tail_size = %d, %d\n", array_head_size, array_tail_size);
+#endif
+ from_trans_array = (struct path_element**)
+ (malloc_or_die(array_head_size * sizeof(struct path_element*) +
+ (array_tail_size + hmmp->nr_v) * sizeof(struct path_element)));
+ from_trans = (struct path_element*)(from_trans_array + hmmp->nr_v);
+ hmmp->from_trans_array = from_trans_array;
+
+ /* find all paths and add them to from_trans_array */
+ for(v = 0; v < hmmp->nr_v; v++) /* to-vertex */ {
+ *from_trans_array = from_trans;
+ if(silent_vertex_multi(v, hmmp) == YES) {
+ from_trans->vertex = END;
+ from_trans->next = NULL;
+ from_trans++;
+ from_trans_array++;
+ continue;
+ }
+ for(w = 0; w < hmmp->nr_v; w++) /* from-vertex */ {
+ if(silent_vertex_multi(w,hmmp) == YES) {
+ continue;
+ }
+ temp_path = (struct path_element*)(malloc_or_die(1000 * sizeof(struct path_element)));
+ add_all_from_paths_multi(w, v, hmmp, &from_trans, temp_path, 0);
+ free(temp_path);
+ }
+ from_trans->vertex = END;
+ from_trans->next = NULL;
+ from_trans++;
+ from_trans_array++;
+ }
+#ifdef DEBUG_RD
+ dump_trans_matrix(hmmp->nr_v, hmmp->nr_v, hmmp->transitions);
+ dump_from_trans_array(hmmp->nr_v, hmmp->from_trans_array);
+#endif
+}
+
+void add_all_from_paths_multi(int v, int w, struct hmm_multi_s *hmmp,
+ struct path_element **from_transp, struct path_element *temp_pathp,
+ int length)
+{
+ int i,j;
+ int *xp;
+ struct path_element p_el, cur_p_el;
+
+ if(length > MAX_GAP_SIZE) {
+ return;
+ }
+
+ cur_p_el.vertex = v;
+ cur_p_el.next = NULL;
+ memcpy(temp_pathp + length, &cur_p_el, sizeof(struct path_element));
+
+ if(*(hmmp->transitions + get_mtx_index(v, w, hmmp->nr_v)) != 0.0) {
+#ifdef DEBUG_RD
+ printf("adding path: ");
+#endif
+ /* direct path to w, add total path */
+ for(i = 0; i < length; i++) {
+ p_el.vertex = (temp_pathp + i)->vertex;
+#ifdef DEBUG_RD
+ printf("%d ", p_el.vertex);
+#endif
+ p_el.next = (*from_transp) + 1;
+ memcpy(*from_transp, &p_el, sizeof(struct path_element));
+ (*from_transp)++;
+ }
+ memcpy(*from_transp, &cur_p_el, sizeof(struct path_element));
+#ifdef DEBUG_RD
+ printf("%d %d\n", cur_p_el.vertex, w);
+#endif
+ (*from_transp)++;
+ }
+ xp = *(hmmp->to_silent_trans_array + v);
+ while(*xp != END) {
+ add_all_from_paths_multi(*xp, w, hmmp, from_transp, temp_pathp, length + 1);
+ xp++;
+ }
+
+}
+
+
+/* Go through transmission matrix and get all probabilities that are not 0
+ * into to_trans_array */
+void create_to_trans_array_multi(struct hmm_multi_s *hmmp)
+{
+ int v,w,*xp;
+ int has_to_trans;
+ int array_head_size, array_tail_size;
+ struct path_element **to_trans_array, *to_trans, *temp_path;
+
+ array_tail_size = 0;
+ array_head_size = hmmp->nr_v;
+
+ /* estimate how much space we need to store transitions */
+ array_tail_size = (hmmp->nr_t/hmmp->nr_v + 3 + MAX_GAP_SIZE) * MAX_GAP_SIZE/2 * hmmp->nr_v;
+#ifdef DEBUG_RD
+ printf("array_tail_size = %d\n", array_tail_size);
+#endif
+ to_trans_array = (struct path_element**)
+ (malloc_or_die(array_head_size * sizeof(struct path_element*) +
+ (array_tail_size + hmmp->nr_v) * sizeof(struct path_element)));
+ to_trans = (struct path_element*)(to_trans_array + hmmp->nr_v);
+ hmmp->to_trans_array = to_trans_array;
+
+ /* find all paths and add them to to_trans_array */
+ for(v = 0; v < hmmp->nr_v; v++) /* from-vertex */ {
+ *to_trans_array = to_trans;
+ if(silent_vertex_multi(v, hmmp) == YES) {
+ to_trans->vertex = END;
+ to_trans->next = NULL;
+ to_trans++;
+ to_trans_array++;
+ continue;
+ }
+ for(w = 0; w < hmmp->nr_v; w++) /* to-vertex */ {
+ if(silent_vertex_multi(w,hmmp) == YES) {
+ continue;
+ }
+ temp_path = (struct path_element*)(malloc_or_die(1000 * sizeof(struct path_element)));
+ add_all_to_paths_multi(v, w, hmmp, &to_trans, temp_path, 0);
+ free(temp_path);
+ }
+ to_trans->vertex = END;
+ to_trans->next = NULL;
+ to_trans++;
+ to_trans_array++;
+ }
+
+
+#ifdef DEBUG_RD
+ printf("array_head_size, array_tail_size = %d, %d\n", array_head_size, array_tail_size);
+ dump_to_trans_array(hmmp->nr_v, hmmp->to_trans_array);
+#endif
+}
+
+void add_all_to_paths_multi(int v, int w, struct hmm_multi_s *hmmp,
+ struct path_element **to_transp, struct path_element *temp_pathp, int length)
+{
+ int i,j;
+ int *xp;
+ struct path_element p_el;
+
+ if(length > MAX_GAP_SIZE) {
+ return;
+ }
+
+ if(*(hmmp->transitions + get_mtx_index(v, w, hmmp->nr_v)) != 0.0) {
+ /* direct path to w, add total path */
+ for(i = 0; i < length; i++) {
+ p_el.vertex = (temp_pathp + i)->vertex;
+ p_el.next = (*to_transp) + 1;
+ memcpy(*to_transp, &p_el, sizeof(struct path_element));
+ (*to_transp)++;
+ }
+ p_el.vertex = w;
+ p_el.next = NULL;
+ memcpy(*to_transp, &p_el, sizeof(struct path_element));
+ (*to_transp)++;
+ }
+
+ xp = *(hmmp->to_silent_trans_array + v);
+ while(*xp != END) {
+ (temp_pathp + length)->vertex = *xp;
+ (temp_pathp + length)->next = NULL;
+ add_all_to_paths_multi(*xp, w, hmmp, to_transp, temp_pathp, length + 1);
+ xp++;
+ }
+
+
+}
+
+
+int silent_vertex_multi(int k, struct hmm_multi_s *hmmp)
+{
+#ifdef DEBUG_RD
+ printf("startnode = %d\n",hmmp->startnode);
+ printf("endnode = %d\n",hmmp->endnode);
+ dump_silent_vertices_multi(hmmp);
+ dump_emiss_matrix(hmmp->nr_v, hmmp->a_size, hmmp->emissions);
+#endif
+ if(*(hmmp->emissions + get_mtx_index(k,0,hmmp->a_size)) == SILENT && k != hmmp->startnode && k != hmmp->endnode) {
+ return YES;
+ }
+ else {
+ return NO;
+ }
+}
+
+
+
+/* check all probabilities and abort if some prob > 1.0 or < 0.0 */
+void check_probs_multi(struct hmm_multi_s *hmmp)
+{
+ int i,j;
+ double sum;
+ double diff;
+ double prob;
+ /* transition probabilities first */
+ for(i = 0; i < hmmp->nr_v; i++) {
+ sum = 0;
+ for(j = 0; j < hmmp->nr_v; j++) {
+ prob = *((hmmp->transitions) + (i*hmmp->nr_v) + j);
+ if(prob > 1.0 || prob < 0.0) {
+ printf("Illegal probabilities (prob < 0.0 or prob > 1.0)\n");
+ exit(-1);
+ }
+ else {
+ sum += prob;
+ }
+ }
+ diff = 1.0 - sum;
+ /* maybe something about auto correcting the probabilities
+ * will be implemented later */
+ }
+
+ /* then emission probabilities */
+ for(i = 0; i < hmmp->nr_v; i++) {
+ sum = 0;
+ for(j = 0; j < hmmp->a_size; j++) {
+ prob = *((hmmp->emissions) + (i*hmmp->a_size) + j);
+ if((prob > 1.0 || prob < 0.0) && prob != SILENT) {
+ printf("Illegal probabilities (prob < 0.0 or prob > 1.0)\n");
+ exit(-1);
+ }
+ else {
+ sum += prob;
+ }
+ }
+ diff = 1.0 - sum;
+ /* maybe something about auto correcting the probabilities
+ * will be implemented later */
+ }
+
+#ifdef DEBUG_RD
+ //dump_trans_matrix(hmmp->nr_v, hmmp->nr_v, hmmp->transitions);
+ //dump_trans_matrix(hmmp->nr_v, hmmp->nr_v, hmmp->log_transitions);
+ //dump_emiss_matrix(hmmp->nr_v, hmmp->a_size, hmmp->emissions);
+ //dump_emiss_matrix(hmmp->nr_v, hmmp->a_size, hmmp->log_emissions);
+#endif
+
+}
+
+int read_prior_files_multi(int nr_priorfiles, struct emission_dirichlet_s *emission_priorsp,
+ int a_size, FILE *file)
+{
+ int i,j,k;
+ double q_value, alpha_value, alpha_sum, logbeta;
+ char s[2048], *p;
+ char ps[2048];
+ char *file_name;
+ char *pri;
+ FILE *priorfile;
+
+
+ /* find \n sign and remove it */
+ if(fgets(s, 2048, file) != NULL) {
+ p = s;
+ while((p = strstr(p, "\n")) != NULL) {
+ strncpy(p, " ", 1);
+ }
+ }
+
+ /* read all before first filename */
+ strtok(s," ");
+
+
+ if((file_name = strtok(NULL, " ")) == NULL) {
+ /* done */
+ return 0;
+ }
+
+ for(i = 0; i < nr_priorfiles; i++) {
+ /* open the priorfile */
+ if((file_name = strtok(NULL, " ")) == NULL) {
+ /* done */
+ return 0;
+ }
+
+ else {
+ if((priorfile = fopen(file_name,"r")) != NULL) {
+ printf("Opened priorfile %s\n", file_name);
+ }
+ else {
+ perror(file_name);
+ return -1;
+ }
+ }
+
+ /* put name i struct */
+ strcpy(emission_priorsp->name, file_name);
+
+ /* put nr of components in struct */
+ if(fgets(ps, 2048, priorfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ while(*ps == '#') {
+ if(fgets(ps, 2048, priorfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ }
+ (emission_priorsp + i)->nr_components = atoi(&ps[0]);
+ (emission_priorsp + i)->alphabet_size = a_size;
+
+
+ /* allocate memory for arrays and matrix to this prior struct */
+ (emission_priorsp + i)->q_values = malloc_or_die((emission_priorsp + i)->nr_components *
+ sizeof(double));
+ (emission_priorsp + i)->alpha_sums = malloc_or_die((emission_priorsp + i)->nr_components *
+ sizeof(double));
+ (emission_priorsp + i)->logbeta_values =
+ malloc_or_die((emission_priorsp + i)->nr_components * sizeof(double));
+ (emission_priorsp + i)->prior_values = malloc_or_die((emission_priorsp + i)->nr_components *
+ a_size * sizeof(double));
+
+ for(j = 0; j < (emission_priorsp + i)->nr_components; j++) {
+ /* put q-value in array, skip empty and comment lines */
+ while(1) {
+ if(fgets(ps, 2048, priorfile) != NULL) {
+ if(*ps == '#' || *ps == '\n') {
+
+ }
+ else {
+ break;
+ }
+ }
+ else {
+ printf("Prior file has incorrect format\n");
+ }
+ }
+ q_value = atof(&ps[0]);
+ *((emission_priorsp + i)->q_values + j) = q_value;
+#ifdef DEBUG_RDPRI
+ printf("q_value = %f\n", *(((emission_priorsp + i)->q_values) + j));
+#endif
+
+ /* put alpha-values of this component in matrix */
+ alpha_sum = 0.0;
+ k = 0;
+ if(fgets(ps, 2048, priorfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ while(*ps == '#') {
+ if(fgets(ps, 2048, priorfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ }
+
+ pri = &ps[0];
+ for(k = 0; k < a_size; k++) {
+ alpha_value = strtod(pri, &pri);
+ alpha_sum += alpha_value;
+ *(((emission_priorsp + i)->prior_values) +
+ get_mtx_index(j, k,a_size)) = alpha_value;
+ }
+
+ /* put sum of alphavalues in array */
+ *(((emission_priorsp + i)->alpha_sums) + j) = alpha_sum;
+
+ /* calculate logB(alpha) for this component, store in array*/
+ logbeta = 0;
+ for(k = 0; k < a_size; k++) {
+ logbeta += lgamma(*(((emission_priorsp + i)->prior_values) +
+ get_mtx_index(j, k, a_size)));
+
+#ifdef DEBUG_RDPRI
+ printf("prior_value = %f\n", *(((emission_priorsp + i)->prior_values) +
+ get_mtx_index(j, k, a_size)));
+ printf("lgamma_value = %f\n", lgamma(*(((emission_priorsp + i)->prior_values) +
+ get_mtx_index(j, k, a_size))));
+#endif
+ }
+ logbeta = logbeta - lgamma(*(((emission_priorsp + i)->alpha_sums) + j));
+ *(((emission_priorsp + i)->logbeta_values) + j) = logbeta;
+ }
+#ifdef DEBUG_RDPRI
+ dump_prior_struct(emission_priorsp + i);
+#endif
+
+ /* some cleanup before continuing with next prior file */
+ fclose(priorfile);
+ emission_priorsp++;
+
+ }
+ return 0;
+}
+
+int read_trans_prior_files_multi(int nr_priorfiles, void *emission_priorsp,
+ struct hmm_multi_s *hmmp, FILE *file)
+{
+ char s[2048];
+
+ /* not implemented yet */
+ if(fgets(s, 2048, file) != NULL) {
+ }
+ return 0;
+}
+
+void create_tot_transitions_multi(struct hmm_multi_s *hmmp)
+{
+ int v,w;
+ struct path_element *wp;
+ double t_res;
+ double log_t_res, cur_value;
+
+ hmmp->tot_transitions = (double*)malloc_or_die(hmmp->nr_v * hmmp->nr_v * sizeof(double));
+ hmmp->max_log_transitions = (double*)malloc_or_die(hmmp->nr_v * hmmp->nr_v * sizeof(double));
+ init_float_mtx(hmmp->max_log_transitions, DEFAULT, hmmp->nr_v * hmmp->nr_v);
+
+ for(v = 0; v < hmmp->nr_v; v++) {
+ wp = *(hmmp->from_trans_array + v);
+ while(wp->vertex != END) /* w = from-vertex */ {
+ t_res = 1.0;
+ w = wp->vertex;
+ while(wp->next != NULL) {
+ t_res = t_res * *((hmmp->transitions) +
+ get_mtx_index(wp->vertex, (wp + 1)->vertex, hmmp->nr_v));
+ /* probability of transition from w to v via silent states */
+ wp++;
+ }
+ t_res = t_res * *((hmmp->transitions) +
+ get_mtx_index(wp->vertex, v, hmmp->nr_v));
+ /* tot_transitions */
+ *(hmmp->tot_transitions + get_mtx_index(w,v,hmmp->nr_v)) += t_res;
+
+ /* max_log_transitions */
+ if(t_res != 0.0) {
+ log_t_res = log10(t_res);
+ cur_value = *(hmmp->max_log_transitions + get_mtx_index(w,v,hmmp->nr_v));
+ if(cur_value == DEFAULT || log_t_res > cur_value) {
+ *(hmmp->max_log_transitions + get_mtx_index(w,v,hmmp->nr_v)) = log_t_res;
+ }
+ }
+ wp++;
+ }
+ }
+}
+
+void create_tot_trans_arrays_multi(struct hmm_multi_s *hmmp)
+{
+ int v,w;
+ struct path_element *p_elp;
+ int malloc_size;
+
+ malloc_size = 0;
+ for(v = 0; v < hmmp->nr_v; v++) {
+ for(w = 0; w < hmmp->nr_v; w++) {
+ if(*(hmmp->tot_transitions + get_mtx_index(v,w,hmmp->nr_v)) != 0.0) {
+ malloc_size++;
+ }
+ }
+ }
+
+ hmmp->tot_to_trans_array = (struct path_element**)malloc_or_die(hmmp->nr_v * sizeof(struct path_element*) +
+ (malloc_size + hmmp->nr_v) * sizeof(struct path_element));
+
+ hmmp->tot_from_trans_array = (struct path_element**)malloc_or_die(hmmp->nr_v * sizeof(struct path_element*) +
+ (malloc_size + hmmp->nr_v) * sizeof(struct path_element));
+
+ /* fill in tot_to_trans_array */
+ p_elp = (struct path_element*)(hmmp->tot_to_trans_array + hmmp->nr_v);
+ for(v = 0; v < hmmp->nr_v; v++) {
+ *(hmmp->tot_to_trans_array + v) = p_elp;
+ for(w = 0; w < hmmp->nr_v; w++) {
+ if(*(hmmp->tot_transitions + get_mtx_index(v,w,hmmp->nr_v)) != 0.0) {
+ p_elp->vertex = w;
+ p_elp->next = NULL;
+ p_elp++;
+ }
+ }
+ p_elp->vertex = END;
+ p_elp->next = NULL;
+ p_elp++;
+ }
+
+
+ /* fill in tot_from_trans_array */
+ p_elp = (struct path_element*)(hmmp->tot_from_trans_array + hmmp->nr_v);
+ for(v = 0; v < hmmp->nr_v; v++) {
+ *(hmmp->tot_from_trans_array + v) = p_elp;
+ for(w = 0; w < hmmp->nr_v; w++) {
+ if(*(hmmp->tot_transitions + get_mtx_index(w,v,hmmp->nr_v)) != 0.0) {
+ p_elp->vertex = w;
+ p_elp->next = NULL;
+ p_elp++;
+ }
+ }
+ p_elp->vertex = END;
+ p_elp->next = NULL;
+ p_elp++;
+ }
+}
diff --git a/modhmm0.92b/readseqs_multialpha.c b/modhmm0.92b/readseqs_multialpha.c
new file mode 100644
index 0000000..85ef3f7
--- /dev/null
+++ b/modhmm0.92b/readseqs_multialpha.c
@@ -0,0 +1,2054 @@
+#include <stdio.h>
+#include <string.h>
+#include <stdlib.h>
+#include <math.h>
+#include <float.h>
+//#include <double.h>
+#include <limits.h>
+
+#include "structs.h"
+#include "funcs.h"
+
+#define MAX_LINE 60000
+
+#define DIRICHLET 12
+#define NONE -2
+
+#define LONGEST_SEQ -1 /* Note: this constant is also defined in hmmsearch_msa.c */
+
+//#define DEBUG_SEQ_STD
+//#define DEBUG_SEQ_FASTA
+//#define DEBUG_MSA_SEQ_STD
+//#define DEBUG_MSA_SEQ_PRF
+//#define DEBUG_PRI
+
+extern int verbose;
+
+
+int seqfile_has_labels(FILE *seqfile)
+{
+ char row[30000];
+
+ rewind(seqfile);
+ while(1) {
+ if(fgets(row,30000,seqfile) != NULL) {
+ if(row[0] == '/') {
+ rewind(seqfile);
+ return YES;
+ }
+ }
+ else {
+ rewind(seqfile);
+ return NO;
+ }
+ }
+
+}
+
+
+void get_sequence_fasta_multi(char* seq, struct sequences_multi_s *seq_infop, int seq_nr)
+{
+ int i,j,a;
+ int nr_letters;
+ int longest_seq, shortest_seq, avg_seq_len;
+ int inside_seq;
+ char c;
+ struct letter_s *seqsp;
+
+
+ nr_letters = 0;
+ longest_seq = seq_infop->longest_seq;
+ shortest_seq = seq_infop->shortest_seq;
+ inside_seq = NO;
+
+ /* Find out how much space to allocate for the sequence = count total number of letters*/
+ nr_letters = strlen(seq);
+
+ /* Allocate memory, must be freed by caller*/
+ if(nr_letters > seq_infop->longest_seq) {
+ seq_infop->longest_seq = nr_letters;
+ }
+ if(nr_letters < seq_infop->shortest_seq) {
+ seq_infop->shortest_seq = nr_letters;
+ }
+ seq_infop->avg_seq_len += nr_letters;
+ (seq_infop->seqs + seq_nr)->seq_1 = (struct letter_s*)(malloc_or_die(nr_letters * 2 * sizeof(struct letter_s)));
+ (seq_infop->seqs + seq_nr)->length = nr_letters;
+
+ /* Read sequences into memory */
+ seqsp = (struct letter_s*)((seq_infop->seqs + seq_nr)->seq_1);
+ inside_seq = NO;
+
+ /* read sequence name */
+ (seq_infop->seqs + seq_nr)->name[0] = "s";
+
+ i = 0;
+ while((i <= nr_letters) && (seq[i] != '\0')) {
+ c = seq[i];
+
+ (seqsp->letter)[0] = c;
+ (seqsp->letter)[1] = '\0';
+ seqsp->label = '.';
+ seqsp++;
+ i++;
+ }
+
+}
+
+
+void get_sequence_std_multi(FILE *seqfile, struct sequences_multi_s *seq_infop, struct hmm_multi_s *hmmp, int seq_nr)
+{
+ int i,j,a,k,last;
+ int nr_letters;
+ int longest_seq, shortest_seq, avg_seq_len;
+ int temp_seq_len;
+ int inside_seq;
+ int nr_alphabets, nr_alphabets_temp;
+ char c;
+ struct letter_s *seqsp;
+ struct sequence_multi_s temp_seq;
+
+ nr_letters = 0;
+ inside_seq = NO;
+ nr_alphabets = 0;
+ nr_alphabets_temp = 0;
+ last = '!';
+ char line[MAX_LINE], *cur_line;
+ double letter_val;
+
+ cur_line = line;
+ while(fgets(line, MAX_LINE, seqfile) != NULL) {
+ if(line[0] == '<' || line[0] == '#' || line[0] == '\s' || line[0] == '\n' || line[0] == '/') {
+
+ }
+ else {
+ printf("Sequence file does not seem to be in correct format\n");
+ exit(0);
+ }
+ for(i = 0; line[i] != '\0'; i++) {
+ if((line[i] == '>' || line[i] == '+') && i > 0 && line[i-1] != ';') {
+ printf("Sequence file does not seem to be in correct format\n");
+ printf("All letters must be followed by ';'\n");
+ exit(0);
+ }
+ }
+ }
+ rewind(seqfile);
+
+ /* Find out how much space to allocate for the sequences = count total number of letters*/
+ while((i = fgetc(seqfile)) != EOF) {
+ c = (char)i;
+ if(c == ';' && inside_seq == YES) {
+ nr_letters++;
+ last = '!';
+ }
+ else if(c == '<') {
+ inside_seq = YES;
+ nr_letters = 0;
+ if(last == '<') {
+ nr_alphabets_temp++;
+ }
+ else {
+ nr_alphabets_temp = 1;
+ last = '<';
+ }
+ }
+ else if(c == '#') {
+ inside_seq = YES;
+ if(last == '#') {
+ nr_alphabets_temp++;
+ }
+ else {
+ nr_alphabets_temp = 1;
+ last = '#';
+ }
+ }
+ else if((c == '>' || c == '+')&& inside_seq == YES) {
+ inside_seq = NO;
+ if(nr_alphabets_temp > nr_alphabets) {
+ nr_alphabets = nr_alphabets_temp;
+ nr_alphabets_temp = 0;
+ }
+ if(c == '>') {
+ switch(nr_alphabets) {
+ case 1: hmmp->alphabet_type = DISCRETE; break;
+ case 2: hmmp->alphabet_type_2 = DISCRETE; break;
+ case 3: hmmp->alphabet_type_3 = DISCRETE; break;
+ case 4: hmmp->alphabet_type_4 = DISCRETE; break;
+ }
+ }
+ if(c == '+') {
+ switch(nr_alphabets) {
+ case 1: hmmp->alphabet_type = CONTINUOUS; break;
+ case 2: hmmp->alphabet_type_2 = CONTINUOUS; break;
+ case 3: hmmp->alphabet_type_3 = CONTINUOUS; break;
+ case 4: hmmp->alphabet_type_4 = CONTINUOUS; break;
+ }
+ }
+ last = '!';
+ }
+ else if(c == '/') {
+ break;
+ }
+ else {
+ last = '!';
+ }
+ }
+
+#ifdef DEBUG_SEQ_STD
+ printf("nr_letters = %d\n", nr_letters);
+ printf("nr_alphabets = %d\n", nr_alphabets);
+#endif
+
+ /* check if nr of alphabets in sequence file corresponds to nr of alphabets in hmm file */
+ if(hmmp->nr_alphabets < nr_alphabets) {
+ printf("Warning: HMM has %d alphabets, while sequence file contains %d alphabets\n", hmmp->nr_alphabets, nr_alphabets);
+ printf("Only the first %d alphabets will be read and used\n", hmmp->nr_alphabets);
+ nr_alphabets = hmmp->nr_alphabets;
+ }
+ else if(hmmp->nr_alphabets > nr_alphabets) {
+ printf("Error: HMM has %d alphabets, while sequence file contains %d alphabets\n", hmmp->nr_alphabets, nr_alphabets);
+ exit(0);
+ }
+
+ /* Allocate memory, must be freed by caller*/
+ if(nr_letters > seq_infop->longest_seq) {
+ seq_infop->longest_seq = nr_letters;
+ }
+ if(nr_letters < seq_infop->shortest_seq) {
+ seq_infop->shortest_seq = nr_letters;
+ }
+ seq_infop->avg_seq_len += nr_letters;
+ (seq_infop->seqs + seq_nr)->seq_1 = (struct letter_s*)(malloc_or_die(nr_letters * 2 * sizeof(struct letter_s)));
+ if(nr_alphabets > 1) {
+ (seq_infop->seqs + seq_nr)->seq_2 = (struct letter_s*)(malloc_or_die(nr_letters * 2 * sizeof(struct letter_s)));
+ }
+ if(nr_alphabets > 2) {
+ (seq_infop->seqs + seq_nr)->seq_3 = (struct letter_s*)(malloc_or_die(nr_letters * 2 * sizeof(struct letter_s)));
+ }
+ if(nr_alphabets > 3) {
+ (seq_infop->seqs + seq_nr)->seq_4 = (struct letter_s*)(malloc_or_die(nr_letters * 2 * sizeof(struct letter_s)));
+ }
+ (seq_infop->seqs + seq_nr)->length = nr_letters;
+
+ /* Read sequences into memory */
+ rewind(seqfile);
+
+ for(k = 0; k < nr_alphabets;) {
+ if(k == 0) {
+ seqsp = (seq_infop->seqs + seq_nr)->seq_1;
+ }
+ if(k == 1) {
+ seqsp = (seq_infop->seqs + seq_nr)->seq_2;
+ }
+ if(k == 2) {
+ seqsp = (seq_infop->seqs + seq_nr)->seq_3;
+ }
+ if(k == 3) {
+ seqsp = (seq_infop->seqs + seq_nr)->seq_4;
+ }
+ a = 0;
+ c = (char)(fgetc(seqfile));
+ if(c != '<' && c != '#') {
+ /* not a sequence on this line, just continue */
+ if(c != '\n') {
+ while((c = (char)(fgetc(seqfile))) != '\n') {
+ }
+ }
+ continue;
+ }
+ else if(c == '<') /* this line is a sequence */ {
+ while((c = (char)(fgetc(seqfile))) != '>') {
+ if(c == '<') {
+
+ }
+ else if(c == ';') {
+ seqsp->letter[a] = '\0';
+ seqsp->label = '.';
+ seqsp++;
+ a = 0;
+ }
+ else {
+ seqsp->letter[a] = c;
+ a++;
+ }
+ }
+
+ seqsp->letter[0] = '\0';
+ seqsp++;
+ strcpy((seq_infop->seqs + seq_nr)->name, "\0");
+ if(k == 0) {
+ (seq_infop->seqs + seq_nr)->length = get_seq_length((seq_infop->seqs + seq_nr)->seq_1);
+ }
+ k++;
+ }
+ else if(c == '#') {
+ fgets(line, MAX_LINE, seqfile);
+ cur_line = line;
+ while(*cur_line == '#') {
+ cur_line++;
+ }
+ while(*cur_line != '+') {
+ letter_val = strtod(cur_line, &cur_line);
+ seqsp->letter[0] = 'X';
+ seqsp->letter[1] = '\0';
+ seqsp->label = '.';
+ seqsp->cont_letter = letter_val;
+ seqsp++;
+ if(*cur_line != ';') {
+ printf("Strange continuous sequence file format\n");
+ }
+ else {
+ cur_line = cur_line + 1;
+ }
+ }
+ seqsp->letter[0] = '\0';
+ seqsp++;
+ strcpy((seq_infop->seqs + seq_nr)->name, "\0");
+ if(k == 0) {
+ (seq_infop->seqs + seq_nr)->length = get_seq_length((seq_infop->seqs + seq_nr)->seq_1);
+ }
+ k++;
+ }
+ }
+
+#ifdef DEBUG_SEQ_STD
+ printf("exiting read_seqs\n");
+#endif
+
+}
+
+
+
+/* Note: msa_seq_infop->seq and msa_seq_infop->gaps will be allocated here but must be
+ * freed by caller */
+void get_sequences_msa_std_multi(FILE *seqfile, FILE *priorfile_a1, struct msa_sequences_multi_s *msa_seq_infop,
+ struct hmm_multi_s *hmmp, int lead_seq, struct replacement_letter_multi_s *replacement_letters)
+{
+ int i,j,k,l,m;
+ int done, inside_seq, seq_pos, letter_pos, a_index, nr_seqs, cur_pos, cur_seq;
+ int msa_seq_length, msa_length, seq_length, longest_seq, longest_seq_length;
+ int seq_index, nr_lead_columns;
+ int gaps_per_column, tot_nr_gaps, nr_gaps;
+ double occurences_per_column;
+ int get_letter_columns, get_query_letters;
+ int use_priordistribution_1, use_priordistribution_2, use_priordistribution_3, use_priordistribution_4;
+ char c;
+ struct letter_s cur_letter;
+ int **cur_posp;
+ struct emission_dirichlet_s em_di_1, em_di_2, em_di_3, em_di_4;
+ int is_first;
+ int use_replacement_letters;
+ int is_empty;
+ char last;
+ int nr_alphabets, nr_alphabets_temp, nr_continuous_alphabets, nr_continuous_alphabets_temp;
+ char line[MAX_LINE], *cur_line;
+ double letter_val;
+ int read_priorfile;
+
+
+ /* find out length of alignment and allocate memory for probability matrix by
+ * reading all sequence rows and remembering the longest */
+ tot_nr_gaps = 0;
+ done = NO;
+ inside_seq = NO;
+ msa_seq_length = 0;
+ nr_seqs = 0;
+ msa_length = 0;
+ seq_length = 0;
+ seq_index = 0;
+ longest_seq = -1;
+ longest_seq_length = 0;
+ is_first = YES;
+
+ while(1) {
+ i = fgetc(seqfile);
+ if(i == '<') {
+ break;
+ }
+ else if(i == '#') {
+ break;
+ }
+ else if(i == '\s' && i == '\n') {
+
+ }
+ else {
+ printf("Sequence file does not seem to be in correct format\n");
+ exit(0);
+ }
+ }
+ rewind(seqfile);
+
+ if(priorfile_a1 == NULL) {
+ use_priordistribution_1 = NONE;
+ use_priordistribution_2 = NONE;
+ use_priordistribution_3 = NONE;
+ use_priordistribution_4 = NONE;
+ }
+ else {
+ read_priorfile = read_multi_prior_file_multi(&em_di_1, hmmp, priorfile_a1, 1);
+ if(read_priorfile > 0) {
+ use_priordistribution_1 = DIRICHLET;
+ }
+ else if(read_priorfile < 0) {
+ printf("Error: Incorrect priorfile format\n");
+ exit(0);
+ }
+ else {
+ use_priordistribution_1 = NONE;
+ }
+ if(hmmp->nr_alphabets > 1) {
+ read_priorfile = read_multi_prior_file_multi(&em_di_2, hmmp, priorfile_a1, 2);
+ if(read_priorfile > 0) {
+ use_priordistribution_2 = DIRICHLET;
+ }
+ else if(read_priorfile < 0) {
+ printf("Error: Incorrect priorfile format\n");
+ exit(0);
+ }
+ else {
+ use_priordistribution_2 = NONE;
+ }
+ }
+ if(hmmp->nr_alphabets > 2) {
+ read_priorfile = read_multi_prior_file_multi(&em_di_3, hmmp, priorfile_a1, 3);
+ if(read_priorfile > 0) {
+ use_priordistribution_3 = DIRICHLET;
+ }
+ else if(read_priorfile < 0) {
+ printf("Error: Incorrect priorfile format\n");
+ exit(0);
+ }
+ else {
+ use_priordistribution_3 = NONE;
+ }
+ }
+ if(hmmp->nr_alphabets > 3) {
+ read_priorfile = read_multi_prior_file_multi(&em_di_4, hmmp, priorfile_a1, 4);
+ if(read_priorfile > 0) {
+ use_priordistribution_4 = DIRICHLET;
+ }
+ else if(read_priorfile < 0) {
+ printf("Error: Incorrect priorfile format\n");
+ exit(0);
+ }
+ else {
+ use_priordistribution_4 = NONE;
+ }
+ }
+ }
+
+ if(replacement_letters->nr_alphabets == 0) {
+ use_replacement_letters = NO;
+ }
+ else {
+ use_replacement_letters = YES;
+ }
+
+ /* check if file is empty */
+ is_empty = YES;
+ while(done != YES) {
+ c = (char)fgetc(seqfile);
+ if((int)c == EOF) {
+ break;
+ }
+ else if (c == '<' || c == '#') {
+ is_empty = NO;
+ break;
+ }
+ }
+ if(is_empty == YES) {
+ if(verbose == YES) {
+ printf("File is empty\n");
+ }
+ }
+ else {
+ }
+ rewind(seqfile);
+
+ last = '!';
+ nr_alphabets = 0;
+ nr_alphabets_temp = 0;
+ nr_continuous_alphabets = 0;
+ nr_continuous_alphabets_temp = 0;
+ while(done != YES) {
+ c = (char)fgetc(seqfile);
+ if((int)c == EOF) {
+ done = YES;
+ continue;
+ }
+ if(c == '<' && last != '<') {
+ inside_seq = YES;
+ seq_length = 0;
+ if(is_first == YES) {
+ msa_seq_length = 0;
+ }
+ seq_index++;
+ nr_seqs++;
+ last = '<';
+ nr_alphabets_temp = 1;
+ }
+ else if(c == '#' && last != '#') {
+ inside_seq = YES;
+ seq_length = 0;
+ if(is_first == YES) {
+ msa_seq_length = 0;
+ }
+ seq_index++;
+ last = '#';
+ nr_alphabets_temp = 1;
+ }
+ else if(c == '<') {
+ nr_alphabets_temp++;
+ }
+ else if(c == '#') {
+ nr_alphabets_temp++;
+ nr_continuous_alphabets_temp++;
+ }
+ else if(c == '>') {
+ if(seq_length > longest_seq_length) {
+ longest_seq = seq_index;
+ longest_seq_length = seq_length;
+ }
+ inside_seq = NO;
+ is_first = NO;
+ if(nr_alphabets_temp > nr_alphabets) {
+ nr_alphabets = nr_alphabets_temp;
+ switch(nr_alphabets) {
+ case 1: hmmp->alphabet_type = DISCRETE; break;
+ case 2: hmmp->alphabet_type_2 = DISCRETE; break;
+ case 3: hmmp->alphabet_type_3 = DISCRETE; break;
+ case 4: hmmp->alphabet_type_4 = DISCRETE; break;
+ }
+ }
+
+ nr_alphabets_temp = 0;
+ last = '!';
+ }
+ else if(c == '+') {
+ if(seq_length > longest_seq_length) {
+ longest_seq = seq_index;
+ longest_seq_length = seq_length;
+ }
+ inside_seq = NO;
+ is_first = NO;
+ if(nr_alphabets_temp > nr_alphabets) {
+ nr_alphabets = nr_alphabets_temp;
+ switch(nr_alphabets) {
+ case 1: hmmp->alphabet_type = CONTINUOUS; break;
+ case 2: hmmp->alphabet_type_2 = CONTINUOUS; break;
+ case 3: hmmp->alphabet_type_3 = CONTINUOUS; break;
+ case 4: hmmp->alphabet_type_4 = CONTINUOUS; break;
+ }
+ }
+ if(nr_continuous_alphabets_temp > nr_continuous_alphabets) {
+ nr_continuous_alphabets = nr_continuous_alphabets_temp;
+ }
+
+
+ nr_alphabets_temp = 0;
+ nr_continuous_alphabets_temp = 0;
+ last = '!';
+ }
+ if(c == ';' && inside_seq == YES) {
+ if(is_first == YES) {
+ msa_seq_length++;
+ }
+ seq_length++;
+ last = '!';
+ }
+ else if(inside_seq == YES && (c == '_' || c == ' ' || c == '-' || c == '.')) {
+ seq_length--;
+ last = '!';
+ }
+ }
+ nr_seqs = nr_seqs / (nr_alphabets - nr_continuous_alphabets);
+ msa_seq_infop->nr_seqs = nr_seqs;
+
+#ifdef DEBUG_MSA_SEQ_STD
+ printf("reached checkpoint 1\n");
+ printf("msa_seq_length = %d\n", msa_seq_length);
+ printf("nr_alphabets = %d\n", nr_alphabets);
+ printf("longest seq = %d\n", longest_seq);
+ printf("longest_seq_length = %d\n", longest_seq_length);
+ printf("nr_seqs = %d\n", nr_seqs);
+#endif
+
+ msa_seq_infop->msa_seq_1 = (struct msa_letter_s*)
+ malloc_or_die(msa_seq_length * (hmmp->a_size+1) * sizeof(struct msa_letter_s));
+ msa_seq_infop->lead_columns_start = (int*)malloc_or_die((msa_seq_length+1) * sizeof(int));
+ if(hmmp->nr_alphabets > 1){
+ msa_seq_infop->msa_seq_2 = (struct msa_letter_s*)
+ malloc_or_die(msa_seq_length * (hmmp->a_size_2+1) * sizeof(struct msa_letter_s));
+ }
+ if(hmmp->nr_alphabets > 2){
+ msa_seq_infop->msa_seq_3 = (struct msa_letter_s*)
+ malloc_or_die(msa_seq_length * (hmmp->a_size_3+1) * sizeof(struct msa_letter_s));
+ }
+ if(hmmp->nr_alphabets > 3){
+ msa_seq_infop->msa_seq_4 = (struct msa_letter_s*)
+ malloc_or_die(msa_seq_length * (hmmp->a_size_4+1) * sizeof(struct msa_letter_s));
+ }
+
+ /* read first alphabet, save nr of occurences of each letter in each position,
+ * including nr of gaps */
+ rewind(seqfile);
+ seq_pos = 0;
+ inside_seq = NO;
+ seq_index = 0;
+ get_letter_columns = NO;
+ k = 0;
+ l = 0;
+ nr_lead_columns = 0;
+ if(lead_seq == LONGEST_SEQ) {
+ lead_seq = longest_seq;
+ }
+
+ for(m = 0; m < nr_seqs;) {
+ i = fgetc(seqfile);
+ if(i == EOF) {
+ break;
+
+ }
+ c = (char)i;
+ if(c == '<') {
+ seq_pos = 0;
+ inside_seq = YES;
+ seq_index++;
+ if(seq_index == lead_seq) {
+ get_letter_columns = YES;
+ get_query_letters = YES;
+ }
+ }
+ else if(c == '#') {
+ get_letter_columns = YES;
+ fgets(line, MAX_LINE, seqfile);
+ cur_line = line;
+ while(*cur_line == '#') {
+ cur_line++;
+ }
+ while(*cur_line != '+') {
+ if(*cur_line == '_' || *cur_line == ' ' ||
+ (*cur_line == '-' && *(cur_line + 1) == ';') || *cur_line == '.') /* gap */ {
+ (msa_seq_infop->msa_seq_1 + get_mtx_index(seq_pos, 0, hmmp->a_size+1))->share = 0.0;
+ (msa_seq_infop->msa_seq_1 + get_mtx_index(seq_pos, 0, hmmp->a_size+1))->nr_occurences = 0.0;
+ cur_line++;
+ }
+ else if(*cur_line == 'X') {
+ (msa_seq_infop->msa_seq_1 + get_mtx_index(seq_pos, 0, hmmp->a_size+1))->share = 0.0;
+ (msa_seq_infop->msa_seq_1 + get_mtx_index(seq_pos, 0, hmmp->a_size+1))->nr_occurences = -1.0;
+ cur_line++;
+ }
+ else {
+ letter_val = strtod(cur_line, &cur_line);
+ (msa_seq_infop->msa_seq_1 + get_mtx_index(seq_pos,0, hmmp->a_size+1))->share = letter_val;
+ (msa_seq_infop->msa_seq_1 + get_mtx_index(seq_pos,0, hmmp->a_size+1))->nr_occurences = 1.0;
+ if(get_letter_columns == YES) {
+ *(msa_seq_infop->lead_columns_start + k) = seq_pos; /* letter column */
+ k++;
+ nr_lead_columns++;
+ }
+ }
+ if(*cur_line != ';') {
+ printf("Strange continuous sequence file format\n");
+ exit(0);
+ }
+ else {
+ cur_line = cur_line + 1;
+ seq_pos++;
+ }
+ }
+ get_letter_columns = NO;
+ break;
+ }
+ else if(c == '>') {
+ inside_seq = NO;
+ get_letter_columns = NO;
+ get_query_letters = NO;
+ m++;
+ }
+ else if(inside_seq == YES) {
+ /* reading one letter */
+ letter_pos = 0;
+ cur_letter.letter[letter_pos] = c;
+ letter_pos++;
+ while((c = (char)fgetc(seqfile)) != ';') {
+ cur_letter.letter[letter_pos] = c;
+ letter_pos++;
+ }
+ if(letter_pos > 4) {
+ printf("max letter size = 4 characters\n");
+ exit(0);
+ }
+ cur_letter.letter[letter_pos] = '\0';
+ if(seq_pos + 1 > msa_length) {
+ msa_length = seq_pos + 1; /* update nr of positions */
+ }
+ if(cur_letter.letter[0] == '_' || cur_letter.letter[0] == ' ' ||
+ cur_letter.letter[0] == '-' || cur_letter.letter[0] == '.') /* gap */ {
+ (msa_seq_infop->msa_seq_1 + get_mtx_index(seq_pos, hmmp->a_size, hmmp->a_size+1))->nr_occurences
+ += 1;
+ seq_pos++;
+ }
+ else /* non gap character */ {
+ if(use_replacement_letters == YES &&
+ replacement_letter_multi(&cur_letter, replacement_letters, msa_seq_infop, hmmp, seq_pos,1) == YES){
+ /* adding of nr_occurences for these letters is done on the fly in replacement_letter-function */
+ }
+ else /* add occurence of this letter */{
+ a_index = get_alphabet_index(&cur_letter, hmmp->alphabet, hmmp->a_size);
+ //printf("a_index = %d\n", a_index);
+ if(a_index < 0) {
+ printf("Could not read msa seq from file: letter '%s' is not in alphabet\n", cur_letter.letter);
+ exit(0);
+ }
+ else {
+ (msa_seq_infop->msa_seq_1 + get_mtx_index(seq_pos,a_index, hmmp->a_size+1))->nr_occurences += 1;
+ }
+ }
+ if(get_letter_columns == YES) {
+ *(msa_seq_infop->lead_columns_start + k) = seq_pos; /* letter column */
+ k++;
+ nr_lead_columns++;
+ }
+ seq_pos++;
+ }
+ if(get_query_letters == YES) {
+ memcpy((msa_seq_infop->msa_seq_1 + l * (hmmp->a_size + 1))->query_letter, &(cur_letter.letter),
+ sizeof(char) * 5); /* query letter */
+ l++;
+ }
+ }
+ //printf("m = %d\n",m);
+ }
+ /* read second alphabet, save nr of occurences of each letter in each position,
+ * including nr of gaps */
+ if(hmmp->nr_alphabets > 1) {
+ seq_pos = 0;
+ inside_seq = NO;
+ seq_index = 0;
+ get_letter_columns = NO;
+ l = 0;
+ last = '!';
+ for(m = 0; m < nr_seqs;) {
+ i = fgetc(seqfile);
+ if(i == EOF) {
+ break;
+ }
+ c = (char)i;
+ //printf("alpha2:c = %c\n",c);
+
+ if(c == '<' && last != '<') {
+ seq_pos = 0;
+ inside_seq = YES;
+ seq_index++;
+ if(seq_index == lead_seq) {
+ get_query_letters = YES;
+ }
+ last = c;
+ }
+ else if(c == '<') {
+ last = c;
+ }
+ else if(c == '#') {
+ fgets(line, MAX_LINE, seqfile);
+ cur_line = line;
+ while(*cur_line == '#') {
+ cur_line++;
+ }
+ while(*cur_line != '+') {
+ if(*cur_line == '_' || *cur_line == ' ' ||
+ (*cur_line == '-' && *(cur_line + 1) == ';') || *cur_line == '.') /* gap */ {
+ (msa_seq_infop->msa_seq_2 + get_mtx_index(seq_pos, 0, hmmp->a_size_2+1))->share = 0.0;
+ (msa_seq_infop->msa_seq_2 + get_mtx_index(seq_pos, 0, hmmp->a_size_2+1))->nr_occurences = 0.0;
+ cur_line++;
+ }
+ else if(*cur_line == 'X') {
+ (msa_seq_infop->msa_seq_2 + get_mtx_index(seq_pos, 0, hmmp->a_size_2+1))->share = 0.0;
+ (msa_seq_infop->msa_seq_2 + get_mtx_index(seq_pos, 0, hmmp->a_size_2+1))->nr_occurences = -1.0;
+ cur_line++;
+ }
+ else {
+ letter_val = strtod(cur_line, &cur_line);
+ (msa_seq_infop->msa_seq_2 + get_mtx_index(seq_pos,0, hmmp->a_size_2+1))->share = letter_val;
+ (msa_seq_infop->msa_seq_2 + get_mtx_index(seq_pos,0, hmmp->a_size_2+1))->nr_occurences = 1.0;
+ }
+ if(*cur_line != ';') {
+ printf("Strange continuous sequence file format\n");
+ exit(0);
+ }
+ else {
+ cur_line = cur_line + 1;
+ seq_pos++;
+ }
+ }
+ break;
+ }
+ else if(c == '>') {
+ inside_seq = NO;
+ get_query_letters = NO;
+ m++;
+ last = '!';
+ }
+ else if(inside_seq == YES) {
+ /* reading one letter */
+ letter_pos = 0;
+ cur_letter.letter[letter_pos] = c;
+ letter_pos++;
+ while((c = (char)fgetc(seqfile)) != ';') {
+ cur_letter.letter[letter_pos] = c;
+ letter_pos++;
+ }
+ if(letter_pos > 4) {
+ printf("max letter size = 4 characters\n");
+ exit(0);
+ }
+ cur_letter.letter[letter_pos] = '\0';
+ if(seq_pos + 1 > msa_length) {
+ msa_length = seq_pos + 1; /* update nr of positions */
+ }
+ if(cur_letter.letter[0] == '_' || cur_letter.letter[0] == ' ' ||
+ cur_letter.letter[0] == '-' || cur_letter.letter[0] == '.') /* gap */ {
+ (msa_seq_infop->msa_seq_2 + get_mtx_index(seq_pos, hmmp->a_size_2, hmmp->a_size_2+1))->nr_occurences
+ += 1;
+ seq_pos++;
+ }
+ else /* non gap character */ {
+ if(use_replacement_letters == YES &&
+ replacement_letter_multi(&cur_letter, replacement_letters, msa_seq_infop, hmmp, seq_pos,2) == YES){
+ /* adding of nr_occurences for these letters is done on the fly in replacement_letter-function */
+ }
+ else /* add occurence of this letter */{
+ a_index = get_alphabet_index(&cur_letter, hmmp->alphabet_2, hmmp->a_size_2);
+ if(a_index < 0) {
+ printf("Could not read msa seq from file: letter '%s' is not in alphabet\n", cur_letter.letter);
+ exit(0);
+ }
+ else {
+ (msa_seq_infop->msa_seq_2 + get_mtx_index(seq_pos,a_index, hmmp->a_size_2+1))->nr_occurences += 1;
+ }
+ }
+ seq_pos++;
+ }
+ if(get_query_letters == YES) {
+ memcpy((msa_seq_infop->msa_seq_2 + l * (hmmp->a_size_2 + 1))->query_letter, &(cur_letter.letter),
+ sizeof(char) * 5); /* query letter */
+ l++;
+ }
+ }
+ }
+ }
+
+ /* read third alphabet, save nr of occurences of each letter in each position,
+ * including nr of gaps */
+ if(hmmp->nr_alphabets > 2) {
+ seq_pos = 0;
+ inside_seq = NO;
+ seq_index = 0;
+ get_letter_columns = NO;
+ k = 0;
+ l = 0;
+ nr_lead_columns = 0;
+ last = '!';
+ for(m = 0; m < nr_seqs;) {
+ i = fgetc(seqfile);
+ if(i == EOF) {
+ break;
+ }
+ c = (char)i;
+ //printf("alpha3:c = %c\n",c);
+
+ if(c == '<' && last != '<') {
+ seq_pos = 0;
+ inside_seq = YES;
+ seq_index++;
+ if(seq_index == lead_seq) {
+ get_query_letters = YES;
+ }
+ last = c;
+ }
+ else if(c == '<') {
+ last = c;
+ }
+ else if(c == '#') {
+ fgets(line, MAX_LINE, seqfile);
+ cur_line = line;
+ while(*cur_line == '#') {
+ cur_line++;
+ }
+ while(*cur_line != '+') {
+ if(*cur_line == '_' || *cur_line == ' ' ||
+ (*cur_line == '-' && *(cur_line + 1) == ';') || *cur_line == '.') /* gap */ {
+ (msa_seq_infop->msa_seq_3 + get_mtx_index(seq_pos, 0, hmmp->a_size_3+1))->share = 0.0;
+ (msa_seq_infop->msa_seq_3 + get_mtx_index(seq_pos, 0, hmmp->a_size_3+1))->nr_occurences = 0.0;
+ seq_pos++;
+ cur_line++;
+ }
+ else if(*cur_line == 'X') {
+ (msa_seq_infop->msa_seq_3 + get_mtx_index(seq_pos, 0, hmmp->a_size+3))->share = 0.0;
+ (msa_seq_infop->msa_seq_3 + get_mtx_index(seq_pos, 0, hmmp->a_size+3))->nr_occurences = -1.0;
+ seq_pos++;
+ cur_line++;
+ }
+ else {
+ letter_val = strtod(cur_line, &cur_line);
+ (msa_seq_infop->msa_seq_3 + get_mtx_index(seq_pos,0, hmmp->a_size_3+1))->share = letter_val;
+ (msa_seq_infop->msa_seq_3 + get_mtx_index(seq_pos,0, hmmp->a_size_3+1))->nr_occurences = 1.0;
+ }
+ if(*cur_line != ';') {
+ printf("Strange continuous sequence file format\n");
+ exit(0);
+ }
+ else {
+ cur_line = cur_line + 1;
+ seq_pos++;
+ }
+ }
+ break;
+ }
+ else if(c == '>') {
+ inside_seq = NO;
+ get_query_letters = NO;
+ m++;
+ last = '!';
+ }
+ else if(inside_seq == YES) {
+ /* reading one letter */
+ letter_pos = 0;
+ cur_letter.letter[letter_pos] = c;
+ letter_pos++;
+ while((c = (char)fgetc(seqfile)) != ';') {
+ cur_letter.letter[letter_pos] = c;
+ letter_pos++;
+ }
+ if(letter_pos > 4) {
+ printf("max letter size = 4 characters\n");
+ exit(0);
+ }
+ cur_letter.letter[letter_pos] = '\0';
+ if(seq_pos + 1 > msa_length) {
+ msa_length = seq_pos + 1; /* update nr of positions */
+ }
+ if(cur_letter.letter[0] == '_' || cur_letter.letter[0] == ' ' ||
+ cur_letter.letter[0] == '-' || cur_letter.letter[0] == '.') /* gap */ {
+ (msa_seq_infop->msa_seq_3 + get_mtx_index(seq_pos, hmmp->a_size_3, hmmp->a_size_3+1))->nr_occurences
+ += 1;
+ seq_pos++;
+ }
+ else /* non gap character */ {
+ if(use_replacement_letters == YES &&
+ replacement_letter_multi(&cur_letter, replacement_letters, msa_seq_infop, hmmp, seq_pos,3) == YES){
+ /* adding of nr_occurences for these letters is done on the fly in replacement_letter-function */
+ }
+ else /* add occurence of this letter */{
+ a_index = get_alphabet_index(&cur_letter, hmmp->alphabet_3, hmmp->a_size_3);
+ if(a_index < 0) {
+ printf("Could not read msa seq from file: letter '%s' is not in alphabet\n", cur_letter.letter);
+ exit(0);
+ }
+ else {
+ (msa_seq_infop->msa_seq_3 + get_mtx_index(seq_pos,a_index, hmmp->a_size_3+1))->nr_occurences += 1;
+ }
+ }
+ seq_pos++;
+ }
+ if(get_query_letters == YES) {
+ memcpy((msa_seq_infop->msa_seq_3 + l * (hmmp->a_size_3 + 1))->query_letter, &(cur_letter.letter),
+ sizeof(char) * 5); /* query letter */
+ l++;
+ }
+ }
+ }
+ }
+
+ /* read fourth alphabet, save nr of occurences of each letter in each position,
+ * including nr of gaps */
+ if(hmmp->nr_alphabets > 3) {
+ seq_pos = 0;
+ inside_seq = NO;
+ seq_index = 0;
+ k = 0;
+ l = 0;
+ nr_lead_columns = 0;
+ last = '!';
+ for(m = 0; m < nr_seqs;) {
+ i = fgetc(seqfile);
+ if(i == EOF) {
+ break;
+ }
+ c = (char)i;
+ //printf("alpha4:c = %c\n",c);
+
+ if(c == '<' && last != '<') {
+ seq_pos = 0;
+ inside_seq = YES;
+ seq_index++;
+ if(seq_index == lead_seq) {
+ get_query_letters = YES;
+ }
+ last = c;
+ }
+ else if(c == '<') {
+ last = c;
+ }
+ else if(c == '#') {
+ fgets(line, MAX_LINE, seqfile);
+ cur_line = line;
+ while(*cur_line == '#') {
+ cur_line++;
+ }
+ while(*cur_line != '+') {
+ if(*cur_line == '_' || *cur_line == ' ' ||
+ (*cur_line == '-' && *(cur_line + 1) == ';') || *cur_line == '.') /* gap */ {
+ (msa_seq_infop->msa_seq_4 + get_mtx_index(seq_pos, 0, hmmp->a_size_4+1))->share = 0.0;
+ (msa_seq_infop->msa_seq_4 + get_mtx_index(seq_pos, 0, hmmp->a_size_4+1))->nr_occurences = 0.0;
+ seq_pos++;
+ cur_line++;
+ }
+ else if(*cur_line == 'X') {
+ (msa_seq_infop->msa_seq_4 + get_mtx_index(seq_pos, 0, hmmp->a_size_4+1))->share = 0.0;
+ (msa_seq_infop->msa_seq_4 + get_mtx_index(seq_pos, 0, hmmp->a_size_4+1))->nr_occurences = -1.0;
+ seq_pos++;
+ cur_line++;
+ }
+ else {
+ letter_val = strtod(cur_line, &cur_line);
+ (msa_seq_infop->msa_seq_4 + get_mtx_index(seq_pos,0, hmmp->a_size_4+1))->share = letter_val;
+ (msa_seq_infop->msa_seq_4 + get_mtx_index(seq_pos,0, hmmp->a_size_4+1))->nr_occurences = 1.0;
+ }
+ if(*cur_line != ';') {
+ printf("Strange continuous sequence file format\n");
+ exit(0);
+ }
+ else {
+ cur_line = cur_line + 1;
+ seq_pos++;
+ }
+ }
+ break;
+ }
+ else if(c == '>') {
+ inside_seq = NO;
+ get_query_letters = NO;
+ m++;
+ last = '!';
+ }
+ else if(inside_seq == YES) {
+ /* reading one letter */
+ letter_pos = 0;
+ cur_letter.letter[letter_pos] = c;
+ letter_pos++;
+ while((c = (char)fgetc(seqfile)) != ';') {
+ cur_letter.letter[letter_pos] = c;
+ letter_pos++;
+ }
+ if(letter_pos > 4) {
+ printf("max letter size = 4 characters\n");
+ exit(0);
+ }
+ cur_letter.letter[letter_pos] = '\0';
+ if(seq_pos + 1 > msa_length) {
+ msa_length = seq_pos + 1; /* update nr of positions */
+ }
+ if(cur_letter.letter[0] == '_' || cur_letter.letter[0] == ' ' ||
+ cur_letter.letter[0] == '-' || cur_letter.letter[0] == '.') /* gap */ {
+ (msa_seq_infop->msa_seq_4 + get_mtx_index(seq_pos, hmmp->a_size_4, hmmp->a_size_4+1))->nr_occurences
+ += 1;
+ seq_pos++;
+ }
+ else /* non gap character */ {
+ if(use_replacement_letters == YES &&
+ replacement_letter_multi(&cur_letter, replacement_letters, msa_seq_infop, hmmp, seq_pos,4) == YES){
+ /* adding of nr_occurences for these letters is done on the fly in replacement_letter-function */
+ }
+ else /* add occurence of this letter */{
+ a_index = get_alphabet_index(&cur_letter, hmmp->alphabet_4, hmmp->a_size_4);
+ if(a_index < 0) {
+ printf("Could not read msa seq from file: letter '%s' is not in alphabet\n", cur_letter.letter);
+ exit(0);
+ }
+ else {
+ (msa_seq_infop->msa_seq_4 + get_mtx_index(seq_pos,a_index, hmmp->a_size_4+1))->nr_occurences += 1;
+ }
+ }
+ seq_pos++;
+ }
+ if(get_query_letters == YES) {
+ memcpy((msa_seq_infop->msa_seq_4 + l * (hmmp->a_size_4 + 1))->query_letter, &(cur_letter.letter),
+ sizeof(char) * 5); /* query letter */
+ l++;
+ }
+ }
+ }
+ }
+
+#ifdef DEBUG_MSA_SEQ_STD
+ printf("reached checkpoint 2\n");
+ printf("total_nr_gaps = %d\n", tot_nr_gaps);
+#endif
+
+ *(msa_seq_infop->lead_columns_start + k) = END;
+ msa_seq_infop->lead_columns_end = msa_seq_infop->lead_columns_start + k;
+ msa_seq_infop->nr_lead_columns = nr_lead_columns;
+ //msa_seq_infop->nr_seqs = nr_seqs;
+ msa_seq_infop->msa_seq_length = msa_length;
+ msa_seq_infop->gap_shares = (double*)malloc_or_die(msa_length * sizeof(double));
+
+ tot_nr_gaps = 0;
+ gaps_per_column = 0;
+ /* go through each position and calculate distribution for all letters in alphabet 1 */
+ for(i = 0; i < msa_length; i++) {
+ occurences_per_column = 0;
+ gaps_per_column = 0;
+ for(j = 0; j < hmmp->a_size+1; j++) {
+ occurences_per_column += (msa_seq_infop->msa_seq_1 +
+ get_mtx_index(i,j, hmmp->a_size+1))->nr_occurences;
+ if(j == hmmp->a_size) {
+ tot_nr_gaps += (int)(msa_seq_infop->msa_seq_1 +
+ get_mtx_index(i,j, hmmp->a_size+1))->nr_occurences;
+ gaps_per_column = (int)(msa_seq_infop->msa_seq_1 +
+ get_mtx_index(i,j, hmmp->a_size+1))->nr_occurences;
+ }
+ }
+ if(use_priordistribution_1 == DIRICHLET) /* calculate share update using dirichlet prior mixture */ {
+ update_shares_prior_multi(&em_di_1, hmmp, msa_seq_infop, i,1);
+
+ }
+ /* simple share update just using the standard quotient */
+ for(j = 0; j < hmmp->a_size+1; j++) {
+ if(hmmp->alphabet_type == DISCRETE) {
+ (msa_seq_infop->msa_seq_1 + get_mtx_index(i,j, hmmp->a_size+1))->share =
+ (msa_seq_infop->msa_seq_1 + get_mtx_index(i,j, hmmp->a_size+1))->nr_occurences /
+ occurences_per_column;
+ (msa_seq_infop->msa_seq_1 + get_mtx_index(i,j, hmmp->a_size+1))->label = '.';
+ }
+ }
+
+ /* calculate gap_share */
+ if(hmmp->alphabet_type == DISCRETE) {
+ *(msa_seq_infop->gap_shares + i) = (double)(gaps_per_column) / occurences_per_column;
+ }
+ }
+
+
+ /* go through each position and calculate distribution for all letters in alphabet 2 */
+ if(hmmp->nr_alphabets > 1) {
+ for(i = 0; i < msa_length; i++) {
+ occurences_per_column = 0;
+ gaps_per_column = 0;
+ for(j = 0; j < hmmp->a_size_2+1; j++) {
+ occurences_per_column += (msa_seq_infop->msa_seq_2 +
+ get_mtx_index(i,j, hmmp->a_size_2+1))->nr_occurences;
+ }
+ if(use_priordistribution_2 == DIRICHLET) /* calculate share update using dirichlet prior mixture */ {
+ update_shares_prior_multi(&em_di_2, hmmp, msa_seq_infop, i,2);
+ //printf("inne if\n");
+ }
+ /* simple share update just using the standard quotient */
+ for(j = 0; j < hmmp->a_size_2+1; j++) {
+ if(hmmp->alphabet_type_2 == DISCRETE) {
+ (msa_seq_infop->msa_seq_2 + get_mtx_index(i,j, hmmp->a_size_2+1))->share =
+ (msa_seq_infop->msa_seq_2 + get_mtx_index(i,j, hmmp->a_size_2+1))->nr_occurences /
+ occurences_per_column;
+ (msa_seq_infop->msa_seq_2 + get_mtx_index(i,j, hmmp->a_size_2+1))->label = '.';
+ }
+ }
+ }
+ }
+ /* go through each position and calculate distribution for all letters in alphabet 3 */
+ if(hmmp->nr_alphabets > 2) {
+ for(i = 0; i < msa_length; i++) {
+ occurences_per_column = 0;
+ gaps_per_column = 0;
+ for(j = 0; j < hmmp->a_size_3+1; j++) {
+ occurences_per_column += (msa_seq_infop->msa_seq_3 +
+ get_mtx_index(i,j, hmmp->a_size_3+1))->nr_occurences;
+ }
+ if(use_priordistribution_3 == DIRICHLET) /* calculate share update using dirichlet prior mixture */ {
+ update_shares_prior_multi(&em_di_3, hmmp, msa_seq_infop, i,3);
+ }
+ /* simple share update just using the standard quotient */
+ for(j = 0; j < hmmp->a_size_3+1; j++) {
+ if(hmmp->alphabet_type_3 == DISCRETE) {
+ (msa_seq_infop->msa_seq_3 + get_mtx_index(i,j, hmmp->a_size_3+1))->share =
+ (msa_seq_infop->msa_seq_3 + get_mtx_index(i,j, hmmp->a_size_3+1))->nr_occurences /
+ occurences_per_column;
+ (msa_seq_infop->msa_seq_3 + get_mtx_index(i,j, hmmp->a_size_3+1))->label = '.';
+ }
+ }
+ }
+ }
+
+ /* go through each position and calculate distribution for all letters in alphabet 4 */
+ if(hmmp->nr_alphabets > 3) {
+ for(i = 0; i < msa_length; i++) {
+ occurences_per_column = 0;
+ gaps_per_column = 0;
+ for(j = 0; j < hmmp->a_size_4+1; j++) {
+ occurences_per_column += (msa_seq_infop->msa_seq_4 +
+ get_mtx_index(i,j, hmmp->a_size_4+1))->nr_occurences;
+ }
+ if(use_priordistribution_4 == DIRICHLET) /* calculate share update using dirichlet prior mixture */ {
+ update_shares_prior_multi(&em_di_4, hmmp, msa_seq_infop, i,4);
+ }
+ /* simple share update just using the standard quotient */
+ for(j = 0; j < hmmp->a_size_4+1; j++) {
+ if(hmmp->alphabet_type_4 == DISCRETE) {
+ (msa_seq_infop->msa_seq_4 + get_mtx_index(i,j, hmmp->a_size_4+1))->share =
+ (msa_seq_infop->msa_seq_4 + get_mtx_index(i,j, hmmp->a_size_4+1))->nr_occurences /
+ occurences_per_column;
+ (msa_seq_infop->msa_seq_4 + get_mtx_index(i,j, hmmp->a_size_4+1))->label = '.';
+ }
+ }
+ }
+ }
+
+#ifdef DEBUG_MSA_SEQ_STD
+ printf("reached checkpoint 3\n");
+ printf("total_nr_gaps = %d\n", tot_nr_gaps);
+
+#endif
+/* allocate memory for gaps array and set initial pointers for each position*/
+ msa_seq_infop->gaps = (int**)malloc_or_die(msa_length * sizeof(int*) +
+ (tot_nr_gaps + msa_length) * sizeof(int));
+ nr_gaps = 0;
+ for(i = 0; i < msa_length; i++) {
+ *(msa_seq_infop->gaps + i) = (int*)(msa_seq_infop->gaps + msa_length) + nr_gaps;
+ nr_gaps += (int)(msa_seq_infop->msa_seq_1 + get_mtx_index(i, hmmp->a_size, hmmp->a_size+1))->nr_occurences;
+ nr_gaps += 1;
+ }
+#ifdef DEBUG_MSA_SEQ_STD
+ printf("%x\n", msa_seq_infop->gaps);
+ printf("%x\n", *(msa_seq_infop->gaps));
+ printf("%x\n", *(msa_seq_infop->gaps + 1));
+ printf("%x\n", *(msa_seq_infop->gaps + 2));
+ printf("%x\n", *(msa_seq_infop->gaps + 3));
+ printf("nr_seqs: %d\n", nr_seqs);
+#endif
+ /********** verify that this is done correctly even if there are multiple alphabets */
+
+ /* go through every sequence and store which sequences have gaps at what
+ * positions in gaps array */
+ rewind(seqfile);
+ inside_seq = NO;
+ seq_pos = 0;
+ cur_posp = msa_seq_infop->gaps;
+ last = '!';
+ for(i = 0; i < nr_seqs;) {
+ c = (char)fgetc(seqfile);
+ if(c == '<' && last == '<') {
+ break;
+ }
+ else if(c == '<') {
+ inside_seq = YES;
+ cur_posp = msa_seq_infop->gaps;
+ last = '<';
+ }
+ else if(c == '>') {
+ inside_seq = NO;
+ i++;
+ last = '!';
+ }
+ else if(inside_seq == YES && (c == '_' || c == ' ' || c == '-' || c == '.')) {
+ (**cur_posp) = i+1;
+
+#ifdef DEBUG_MSA_SEQ_STD
+ printf("posp = %x\n", cur_posp);
+ printf("*posp = %x\n", *cur_posp);
+ printf("**posp = %x\n", **cur_posp);
+#endif
+ (*cur_posp)++;
+ cur_posp++;
+ while((c = (char)fgetc(seqfile)) != ';') {
+
+ }
+ last = '!';
+ }
+ else if(inside_seq == YES) {
+ while((c = (char)fgetc(seqfile)) != ';') {
+
+ }
+ cur_posp++;
+ last = '!';
+ }
+ }
+ cur_posp = msa_seq_infop->gaps;
+ for(i = 0; i < msa_seq_infop->msa_seq_length; i++) {
+ (**cur_posp) = END;
+ cur_posp++;
+ }
+
+ nr_gaps = 0;
+ for(i = 0; i < msa_length; i++) {
+ *(msa_seq_infop->gaps + i) = (int*)(msa_seq_infop->gaps + msa_length) + nr_gaps;
+ nr_gaps += (int)(msa_seq_infop->msa_seq_1 + get_mtx_index(i, hmmp->a_size, hmmp->a_size+1))->nr_occurences;
+ nr_gaps += 1;
+ }
+
+
+
+#ifdef DEBUG_MSA_SEQ_STD
+ dump_msa_seqs_multi(msa_seq_infop, hmmp);
+ printf("reached checkpoint 4\n");
+ exit(0);
+#endif
+
+ /* cleanup and return */
+ if(use_priordistribution_1 == DIRICHLET) {
+ if(em_di_1.nr_components > 0) {
+ free(em_di_1.q_values);
+ free(em_di_1.alpha_sums);
+ free(em_di_1.logbeta_values);
+ free(em_di_1.prior_values);
+ }
+ }
+ if(use_priordistribution_2 == DIRICHLET) {
+ if(em_di_2.nr_components > 0) {
+ free(em_di_2.q_values);
+ free(em_di_2.alpha_sums);
+ free(em_di_2.logbeta_values);
+ free(em_di_2.prior_values);
+ }
+ }
+ if(use_priordistribution_3 == DIRICHLET) {
+ if(em_di_3.nr_components > 0) {
+ free(em_di_3.q_values);
+ free(em_di_3.alpha_sums);
+ free(em_di_3.logbeta_values);
+ free(em_di_3.prior_values);
+ }
+ }
+ if(use_priordistribution_4 == DIRICHLET) {
+ if(em_di_4.nr_components > 0) {
+ free(em_di_4.q_values);
+ free(em_di_4.alpha_sums);
+ free(em_di_4.logbeta_values);
+ free(em_di_4.prior_values);
+ }
+ }
+
+ return;
+}
+
+
+/* Note: msa_seq_infop->seq and msa_seq_infop->gaps will be allocated here but must be
+ * freed by caller */
+void get_sequences_msa_prf_multi(FILE *seqfile, FILE *priorfile, struct msa_sequences_multi_s *msa_seq_infop,
+ struct hmm_multi_s *hmmp)
+{
+
+ int i,j,k,l,m;
+ int done, inside_seq, seq_pos, letter_pos, a_index, nr_seqs, cur_pos, cur_seq;
+ int msa_length, seq_length, longest_seq, longest_seq_length;
+ int seq_index, nr_lead_columns;
+ int gaps_per_column, tot_nr_gaps, nr_gaps;
+ double occurences_per_column;
+ int get_letter_columns, get_query_letters;
+ int use_priordistribution,use_priordistribution_2, use_priordistribution_3, use_priordistribution_4;
+ char c;
+ char s[MAX_LINE];
+ struct letter_s cur_letter;
+ int **cur_posp;
+ struct emission_dirichlet_s em_di,em_di_2, em_di_3, em_di_4;
+ int is_first;
+ int use_replacement_letters;
+ int is_empty;
+ int col_pos;
+ char *seq_ptr;
+ char *endptr;
+ double nr_occurences;
+ int label_pos, cur_letter_pos;
+ char label, letter;
+ int read_priorfile;
+
+
+ /* find out length of alignment and allocate memory for probability matrix by
+ * reading all sequence rows and remembering the longest */
+ done = NO;
+ inside_seq = NO;
+ msa_length = 0;
+ seq_length = 0;
+ seq_index = 0;
+
+if(priorfile == NULL) {
+ use_priordistribution = NONE;
+ use_priordistribution_2 = NONE;
+ use_priordistribution_3 = NONE;
+ use_priordistribution_4 = NONE;
+ }
+ else {
+ read_priorfile = read_multi_prior_file_multi(&em_di, hmmp, priorfile, 1);
+ if(read_priorfile > 0) {
+ use_priordistribution = DIRICHLET;
+ }
+ else if(read_priorfile < 0) {
+ printf("Error: Incorrect priorfile format\n");
+ exit(0);
+ }
+ else {
+ use_priordistribution = NONE;
+ }
+ if(hmmp->nr_alphabets > 1) {
+ read_priorfile = read_multi_prior_file_multi(&em_di_2, hmmp, priorfile, 2);
+ if(read_priorfile > 0) {
+ use_priordistribution_2 = DIRICHLET;
+ }
+ else if(read_priorfile < 0) {
+ printf("Error: Incorrect priorfile format\n");
+ exit(0);
+ }
+ else {
+ use_priordistribution_2 = NONE;
+ }
+ }
+ if(hmmp->nr_alphabets > 2) {
+ read_priorfile = read_multi_prior_file_multi(&em_di_3, hmmp, priorfile, 3);
+ if(read_priorfile > 0) {
+ use_priordistribution_3 = DIRICHLET;
+ }
+ else if(read_priorfile < 0) {
+ printf("Error: Incorrect priorfile format\n");
+ exit(0);
+ }
+ else {
+ use_priordistribution_3 = NONE;
+ }
+ }
+ if(hmmp->nr_alphabets > 3) {
+ read_priorfile = read_multi_prior_file_multi(&em_di_4, hmmp, priorfile, 4);
+ if(read_priorfile > 0) {
+ use_priordistribution_4 = DIRICHLET;
+ }
+ else if(read_priorfile < 0) {
+ printf("Error: Incorrect priorfile format\n");
+ exit(0);
+ }
+ else {
+ use_priordistribution_4 = NONE;
+ }
+ }
+ }
+
+
+
+ /* check if file is empty */
+ is_empty = YES;
+ while(done != YES) {
+ c = (char)fgetc(seqfile);
+ if((int)c == EOF) {
+ break;
+ }
+ else {
+ is_empty = NO;
+ break;
+ }
+ }
+ if(is_empty == YES) {
+ if(verbose == YES) {
+ printf("File is empty\n");
+ }
+ }
+ else {
+ }
+ rewind(seqfile);
+
+ while(fgets(s, MAX_LINE, seqfile) != NULL) {
+ if(strncmp(s,"COL",3) == 0) {
+ msa_length = atoi(&s[4]);
+ }
+ }
+#ifdef DEBUG_MSA_SEQ_PRF
+ printf("reached checkpoint 1\n");
+ printf("msa_length = %d\n", msa_length);
+#endif
+
+
+ msa_seq_infop->lead_columns_start = (int*)malloc_or_die((msa_length+1) * sizeof(int));
+ msa_seq_infop->msa_seq_1 = (struct msa_letter_s*)
+ malloc_or_die(msa_length * (hmmp->a_size+1) * sizeof(struct msa_letter_s));
+
+#ifdef DEBUG_MSA_SEQ_PRF
+ printf("malloced ok\n");
+#endif
+
+ /* read alphabet 1, save nr of occurences of each letter in each position,
+ * including nr of gaps */
+ rewind(seqfile);
+ seq_pos = 0;
+ nr_lead_columns = 0;
+ k = 0;
+ l = 0;
+ inside_seq = NO;
+ while(fgets(s, MAX_LINE, seqfile) != NULL) {
+ if(strncmp(s,"END 1",5 ) == 0) {
+ break;
+ }
+ if(strncmp(s,"START 1",7) == 0) {
+ inside_seq = YES;
+ }
+ else if(strncmp(s,"NR of aligned sequences",23) == 0) {
+ nr_seqs = strtol(s + 24, NULL, 10);
+ }
+ else if(strncmp(s,"COL",3) == 0 && inside_seq == YES) {
+ /* split into columns depending on a_size, add '-' */
+ col_pos = 11;
+ seq_ptr = s + col_pos;
+ for(m = 0; m < hmmp->a_size + 1; m++) {
+ nr_occurences = strtod(seq_ptr, &endptr);
+ if(endptr == seq_ptr) {
+ printf("Error reading column: no frequency was read\n");
+ exit(0);
+ }
+ else {
+ /* add nr of occurences to datastructure */
+ seq_ptr = endptr;
+ (msa_seq_infop->msa_seq_1 + get_mtx_index(seq_pos,m, hmmp->a_size+1))->nr_occurences = nr_occurences;
+ }
+ }
+ /* read space */
+ strtod(seq_ptr, &endptr);
+ seq_ptr = endptr;
+
+ /* read label */
+ label_pos = 0;
+ for(label_pos = 0;seq_ptr[label_pos] != '\n';label_pos++) {
+ if(seq_ptr[label_pos] == ' ') {
+
+ }
+ else {
+ label = seq_ptr[label_pos];
+ seq_ptr = seq_ptr + label_pos + 1;
+ break;
+ }
+ }
+
+ /* read query letter */
+ letter_pos = 0;
+ cur_letter_pos = 0;
+ for(letter_pos = 0;seq_ptr[letter_pos] != '\n';letter_pos++) {
+ if(seq_ptr[letter_pos] == ' ') {
+
+ }
+ else {
+ letter = seq_ptr[letter_pos];
+ cur_letter.letter[cur_letter_pos] = letter;
+ cur_letter_pos++;
+ seq_ptr = seq_ptr + letter_pos + 1;
+ break;
+ }
+ }
+
+ if(cur_letter_pos >= 5) {
+ printf("Maximum of four characters for one letter\n");
+ exit(0);
+ }
+ else {
+ cur_letter.letter[cur_letter_pos] = '\0';
+ }
+
+ /* store lead seq and query columns + label */
+ if(cur_letter.letter[0] != '-') {
+ *(msa_seq_infop->lead_columns_start + k) = seq_pos; /* letter column */
+ if(label == '-') {
+ label = '.';
+ }
+ /* only add label to first msa_letter of the first alphabet */
+ (msa_seq_infop->msa_seq_1 + (*(msa_seq_infop->lead_columns_start + k)) * (hmmp->a_size+1))->label = label;
+ k++;
+ nr_lead_columns++;
+ }
+ if(1 == 1) {
+ memcpy((msa_seq_infop->msa_seq_1 + l * (hmmp->a_size + 1))->query_letter, &(cur_letter.letter),
+ sizeof(char) * 5); /* query letter */
+ l++;
+ }
+ seq_pos++;
+ }
+ }
+
+#ifdef DEBUG_MSA_SEQ_PRF
+ printf("reached checkpoint 2\n");
+ printf("total_nr_gaps = %d\n", tot_nr_gaps);
+#endif
+
+
+
+
+ /* read alphabet 2, save nr of occurences of each letter in each position,
+ * including nr of gaps */
+ if(hmmp->nr_alphabets > 1) {
+ msa_seq_infop->msa_seq_2 = (struct msa_letter_s*)
+ malloc_or_die(msa_length * (hmmp->a_size_2+1) * sizeof(struct msa_letter_s));
+ rewind(seqfile);
+ seq_pos = 0;
+ k = 0;
+ l = 0;
+ inside_seq = NO;
+ while(fgets(s, MAX_LINE, seqfile) != NULL) {
+ if(strncmp(s,"END 2",5 ) == 0) {
+ break;
+ }
+ if(strncmp(s,"START 2",7) == 0) {
+ inside_seq = YES;
+ }
+ else if(strncmp(s,"NR of aligned sequences",23) == 0) {
+ nr_seqs = strtol(s + 24, NULL, 10);
+ }
+ else if(strncmp(s,"COL",3) == 0 && inside_seq == YES) {
+ /* split into columns depending on a_size, add '-' */
+ col_pos = 11;
+ seq_ptr = s + col_pos;
+ for(m = 0; m < hmmp->a_size_2 + 1; m++) {
+ nr_occurences = strtod(seq_ptr, &endptr);
+ if(endptr == seq_ptr) {
+ printf("Error reading column: no frequency was read\n");
+ exit(0);
+ }
+ else {
+ /* add nr of occurences to datastructure */
+ seq_ptr = endptr;
+ (msa_seq_infop->msa_seq_2 + get_mtx_index(seq_pos,m, hmmp->a_size_2+1))->nr_occurences = nr_occurences;
+ }
+ }
+ /* read space */
+ strtod(seq_ptr, &endptr);
+ seq_ptr = endptr;
+
+ /* read label */
+ label_pos = 0;
+ for(label_pos = 0;seq_ptr[label_pos] != '\n';label_pos++) {
+ if(seq_ptr[label_pos] == ' ') {
+
+ }
+ else {
+ label = seq_ptr[label_pos];
+ seq_ptr = seq_ptr + label_pos + 1;
+ break;
+ }
+ }
+
+ /* read query letter */
+ letter_pos = 0;
+ cur_letter_pos = 0;
+ for(letter_pos = 0;seq_ptr[letter_pos] != '\n';letter_pos++) {
+ if(seq_ptr[letter_pos] == ' ') {
+
+ }
+ else {
+ letter = seq_ptr[letter_pos];
+ cur_letter.letter[cur_letter_pos] = letter;
+ cur_letter_pos++;
+ seq_ptr = seq_ptr + letter_pos + 1;
+ break;
+ }
+ }
+
+ if(cur_letter_pos >= 5) {
+ printf("Maximum of four characters for one letter\n");
+ exit(0);
+ }
+ else {
+ cur_letter.letter[cur_letter_pos] = '\0';
+ }
+
+ /* store lead seq and query columns + label */
+ if(cur_letter.letter[0] != '-') {
+ //*(msa_seq_infop->lead_columns_start + k) = seq_pos; /* letter column */
+ //(msa_seq_infop->msa_seq_2 + (*(msa_seq_infop->lead_columns_start + k)) * (hmmp->a_size_2+1))->label = label;
+ k++;
+ }
+ if(1 == 1) {
+ //memcpy((msa_seq_infop->msa_seq_2 + l * (hmmp->a_size_2 + 1))->query_letter, &(cur_letter.letter),
+ //sizeof(char) * 5); /* query letter */
+ l++;
+ }
+ seq_pos++;
+ }
+ }
+
+#ifdef DEBUG_MSA_SEQ_PRF
+ printf("reached checkpoint 2.1\n");
+ printf("total_nr_gaps = %d\n", tot_nr_gaps);
+#endif
+ }
+
+ if(hmmp->nr_alphabets > 2) {
+ msa_seq_infop->msa_seq_3 = (struct msa_letter_s*)
+ malloc_or_die(msa_length * (hmmp->a_size_3+1) * sizeof(struct msa_letter_s));
+
+ /* read alphabet 1, save nr of occurences of each letter in each position,
+ * including nr of gaps */
+ rewind(seqfile);
+ seq_pos = 0;
+ k = 0;
+ l = 0;
+ inside_seq = NO;
+ while(fgets(s, MAX_LINE, seqfile) != NULL) {
+ if(strncmp(s,"END 3",5 ) == 0) {
+ break;
+ }
+ if(strncmp(s,"START 3",7) == 0) {
+ inside_seq = YES;
+ }
+ else if(strncmp(s,"NR of aligned sequences",23) == 0) {
+ nr_seqs = strtol(s + 24, NULL, 10);
+ }
+ else if(strncmp(s,"COL",3) == 0 && inside_seq == YES) {
+ /* split into columns depending on a_size, add '-' */
+ col_pos = 11;
+ seq_ptr = s + col_pos;
+ for(m = 0; m < hmmp->a_size_3 + 1; m++) {
+ nr_occurences = strtod(seq_ptr, &endptr);
+ if(endptr == seq_ptr) {
+ printf("Error reading column: no frequency was read\n");
+ exit(0);
+ }
+ else {
+ /* add nr of occurences to datastructure */
+ seq_ptr = endptr;
+ (msa_seq_infop->msa_seq_3 + get_mtx_index(seq_pos,m, hmmp->a_size_3+1))->nr_occurences = nr_occurences;
+ }
+ }
+ /* read space */
+ strtod(seq_ptr, &endptr);
+ seq_ptr = endptr;
+
+ /* read label */
+ label_pos = 0;
+ for(label_pos = 0;seq_ptr[label_pos] != '\n';label_pos++) {
+ if(seq_ptr[label_pos] == ' ') {
+
+ }
+ else {
+ label = seq_ptr[label_pos];
+ seq_ptr = seq_ptr + label_pos + 1;
+ break;
+ }
+ }
+
+ /* read query letter */
+ letter_pos = 0;
+ cur_letter_pos = 0;
+ for(letter_pos = 0;seq_ptr[letter_pos] != '\n';letter_pos++) {
+ if(seq_ptr[letter_pos] == ' ') {
+
+ }
+ else {
+ letter = seq_ptr[letter_pos];
+ cur_letter.letter[cur_letter_pos] = letter;
+ cur_letter_pos++;
+ seq_ptr = seq_ptr + letter_pos + 1;
+ break;
+ }
+ }
+
+ if(cur_letter_pos >= 5) {
+ printf("Maximum of four characters for one letter\n");
+ exit(0);
+ }
+ else {
+ cur_letter.letter[cur_letter_pos] = '\0';
+ }
+
+ /* store lead seq and query columns + label */
+ if(cur_letter.letter[0] != '-') {
+ //*(msa_seq_infop->lead_columns_start + k) = seq_pos; /* letter column */
+ //(msa_seq_infop->msa_seq_3 + (*(msa_seq_infop->lead_columns_start + k)) * (hmmp->a_size_3+1))->label = label;
+ k++;
+ }
+ if(1 == 1) {
+ //memcpy((msa_seq_infop->msa_seq_3 + l * (hmmp->a_size_3 + 1))->query_letter, &(cur_letter.letter),
+ // sizeof(char) * 5); /* query letter */
+ l++;
+ }
+ seq_pos++;
+ }
+ }
+
+#ifdef DEBUG_MSA_SEQ_PRF
+ printf("reached checkpoint 2.2\n");
+ printf("total_nr_gaps = %d\n", tot_nr_gaps);
+#endif
+ }
+
+ if(hmmp->nr_alphabets > 3) {
+ msa_seq_infop->msa_seq_4 = (struct msa_letter_s*)
+ malloc_or_die(msa_length * (hmmp->a_size_4+1) * sizeof(struct msa_letter_s));
+
+ /* read alphabet 4, save nr of occurences of each letter in each position,
+ * including nr of gaps */
+ rewind(seqfile);
+ seq_pos = 0;
+ k = 0;
+ l = 0;
+ inside_seq = NO;
+ while(fgets(s, MAX_LINE, seqfile) != NULL) {
+ if(strncmp(s,"END 4",5 ) == 0) {
+ break;
+ }
+ if(strncmp(s,"START 4",7) == 0) {
+ inside_seq = YES;
+ }
+ else if(strncmp(s,"NR of aligned sequences",23) == 0) {
+ nr_seqs = strtol(s + 24, NULL, 10);
+ }
+ else if(strncmp(s,"COL",3) == 0 && inside_seq == YES) {
+ /* split into columns depending on a_size, add '-' */
+ col_pos = 11;
+ seq_ptr = s + col_pos;
+ for(m = 0; m < hmmp->a_size_4 + 1; m++) {
+ nr_occurences = strtod(seq_ptr, &endptr);
+ if(endptr == seq_ptr) {
+ printf("Error reading column: no frequency was read\n");
+ exit(0);
+ }
+ else {
+ /* add nr of occurences to datastructure */
+ seq_ptr = endptr;
+ (msa_seq_infop->msa_seq_4 + get_mtx_index(seq_pos,m, hmmp->a_size_4+1))->nr_occurences = nr_occurences;
+ }
+ }
+ /* read space */
+ strtod(seq_ptr, &endptr);
+ seq_ptr = endptr;
+
+ /* read label */
+ label_pos = 0;
+ for(label_pos = 0;seq_ptr[label_pos] != '\n';label_pos++) {
+ if(seq_ptr[label_pos] == ' ') {
+
+ }
+ else {
+ label = seq_ptr[label_pos];
+ seq_ptr = seq_ptr + label_pos + 1;
+ break;
+ }
+ }
+
+ /* read query letter */
+ letter_pos = 0;
+ cur_letter_pos = 0;
+ for(letter_pos = 0;seq_ptr[letter_pos] != '\n';letter_pos++) {
+ if(seq_ptr[letter_pos] == ' ') {
+
+ }
+ else {
+ letter = seq_ptr[letter_pos];
+ cur_letter.letter[cur_letter_pos] = letter;
+ cur_letter_pos++;
+ seq_ptr = seq_ptr + letter_pos + 1;
+ break;
+ }
+ }
+
+ if(cur_letter_pos >= 5) {
+ printf("Maximum of four characters for one letter\n");
+ exit(0);
+ }
+ else {
+ cur_letter.letter[cur_letter_pos] = '\0';
+ }
+
+ /* store lead seq and query columns + label */
+ if(cur_letter.letter[0] != '-') {
+ //*(msa_seq_infop->lead_columns_start + k) = seq_pos; /* letter column */
+ //(msa_seq_infop->msa_seq_4 + (*(msa_seq_infop->lead_columns_start + k)) * (hmmp->a_size_4+1))->label = label;
+ k++;
+ }
+ if(1 == 1) {
+ //memcpy((msa_seq_infop->msa_seq_4 + l * (hmmp->a_size_4 + 1))->query_letter, &(cur_letter.letter),
+ //sizeof(char) * 5); /* query letter */
+ l++;
+ }
+ seq_pos++;
+ }
+ }
+
+#ifdef DEBUG_MSA_SEQ_PRF
+ printf("reached checkpoint 2.3\n");
+ printf("total_nr_gaps = %d\n", tot_nr_gaps);
+#endif
+ }
+
+
+ /* from everything should be untouched */
+ *(msa_seq_infop->lead_columns_start + k) = END;
+ msa_seq_infop->lead_columns_end = msa_seq_infop->lead_columns_start + k;
+ msa_seq_infop->nr_lead_columns = nr_lead_columns;
+ msa_seq_infop->nr_seqs = nr_seqs;
+ msa_seq_infop->msa_seq_length = msa_length;
+ msa_seq_infop->gap_shares = (double*)malloc_or_die(msa_length * sizeof(double));
+ tot_nr_gaps = 0;
+ gaps_per_column = 0;
+
+ /* go through each position and calculate distribution for all letters in alphabet */
+ for(i = 0; i < msa_length; i++) {
+ occurences_per_column = 0;
+ gaps_per_column = 0;
+ for(j = 0; j < hmmp->a_size+1; j++) {
+ occurences_per_column += (msa_seq_infop->msa_seq_1 +
+ get_mtx_index(i,j, hmmp->a_size+1))->nr_occurences;
+ if(j == hmmp->a_size) {
+ tot_nr_gaps += (int)(msa_seq_infop->msa_seq_1 +
+ get_mtx_index(i,j, hmmp->a_size+1))->nr_occurences;
+ gaps_per_column = (int)(msa_seq_infop->msa_seq_1 +
+ get_mtx_index(i,j, hmmp->a_size+1))->nr_occurences;
+ }
+ }
+ if(use_priordistribution == DIRICHLET) /* calculate share update using dirichlet prior mixture */ {
+ update_shares_prior_multi(&em_di, hmmp, msa_seq_infop, i,1);
+ }
+ /* simple share update just using the standard quotient */
+ for(j = 0; j < hmmp->a_size+1; j++) {
+ (msa_seq_infop->msa_seq_1 + get_mtx_index(i,j, hmmp->a_size+1))->share =
+ (msa_seq_infop->msa_seq_1 + get_mtx_index(i,j, hmmp->a_size+1))->nr_occurences /
+ occurences_per_column;
+ }
+
+ /* calculate gap_share */
+ *(msa_seq_infop->gap_shares + i) = (double)(gaps_per_column) / occurences_per_column;
+ }
+
+ /* go through each position and calculate distribution for all letters in alphabet 2 */
+ if(hmmp->nr_alphabets > 1) {
+ for(i = 0; i < msa_length; i++) {
+ occurences_per_column = 0;
+ gaps_per_column = 0;
+ for(j = 0; j < hmmp->a_size_2+1; j++) {
+ occurences_per_column += (msa_seq_infop->msa_seq_2 +
+ get_mtx_index(i,j, hmmp->a_size_2+1))->nr_occurences;
+ if(j == hmmp->a_size_2) {
+ tot_nr_gaps += (int)(msa_seq_infop->msa_seq_2 +
+ get_mtx_index(i,j, hmmp->a_size_2+1))->nr_occurences;
+ gaps_per_column = (int)(msa_seq_infop->msa_seq_2 +
+ get_mtx_index(i,j, hmmp->a_size_2+1))->nr_occurences;
+ }
+ }
+ if(use_priordistribution_2 == DIRICHLET) /* calculate share update using dirichlet prior mixture */ {
+ update_shares_prior_multi(&em_di_2, hmmp, msa_seq_infop, i,2);
+ }
+ /* simple share update just using the standard quotient */
+ for(j = 0; j < hmmp->a_size_2+1; j++) {
+ (msa_seq_infop->msa_seq_2 + get_mtx_index(i,j, hmmp->a_size_2+1))->share =
+ (msa_seq_infop->msa_seq_2 + get_mtx_index(i,j, hmmp->a_size_2+1))->nr_occurences /
+ occurences_per_column;
+ }
+
+ /* calculate gap_share */
+ *(msa_seq_infop->gap_shares + i) = (double)(gaps_per_column) / occurences_per_column;
+ }
+ }
+
+ /* go through each position and calculate distribution for all letters in alphabet 3 */
+ if(hmmp->nr_alphabets > 2) {
+ for(i = 0; i < msa_length; i++) {
+ occurences_per_column = 0;
+ gaps_per_column = 0;
+ for(j = 0; j < hmmp->a_size_3+1; j++) {
+ occurences_per_column += (msa_seq_infop->msa_seq_3 +
+ get_mtx_index(i,j, hmmp->a_size_3+1))->nr_occurences;
+ if(j == hmmp->a_size_3) {
+ tot_nr_gaps += (int)(msa_seq_infop->msa_seq_3 +
+ get_mtx_index(i,j, hmmp->a_size_3+1))->nr_occurences;
+ gaps_per_column = (int)(msa_seq_infop->msa_seq_3 +
+ get_mtx_index(i,j, hmmp->a_size_3+1))->nr_occurences;
+ }
+ }
+ if(use_priordistribution_3 == DIRICHLET) /* calculate share update using dirichlet prior mixture */ {
+ update_shares_prior_multi(&em_di_3, hmmp, msa_seq_infop, i,3);
+ }
+ /* simple share update just using the standard quotient */
+ for(j = 0; j < hmmp->a_size_3+1; j++) {
+ (msa_seq_infop->msa_seq_3 + get_mtx_index(i,j, hmmp->a_size_3+1))->share =
+ (msa_seq_infop->msa_seq_3 + get_mtx_index(i,j, hmmp->a_size_3+1))->nr_occurences /
+ occurences_per_column;
+ }
+
+ /* calculate gap_share */
+ *(msa_seq_infop->gap_shares + i) = (double)(gaps_per_column) / occurences_per_column;
+ }
+ }
+
+ /* go through each position and calculate distribution for all letters in alphabet 4 */
+ if(hmmp->nr_alphabets > 3) {
+ for(i = 0; i < msa_length; i++) {
+ occurences_per_column = 0;
+ gaps_per_column = 0;
+ for(j = 0; j < hmmp->a_size_4+1; j++) {
+ occurences_per_column += (msa_seq_infop->msa_seq_4 +
+ get_mtx_index(i,j, hmmp->a_size_4+1))->nr_occurences;
+ if(j == hmmp->a_size_4) {
+ tot_nr_gaps += (int)(msa_seq_infop->msa_seq_4 +
+ get_mtx_index(i,j, hmmp->a_size_4+1))->nr_occurences;
+ gaps_per_column = (int)(msa_seq_infop->msa_seq_4 +
+ get_mtx_index(i,j, hmmp->a_size_4+1))->nr_occurences;
+ }
+ }
+ if(use_priordistribution_4 == DIRICHLET) /* calculate share update using dirichlet prior mixture */ {
+ update_shares_prior_multi(&em_di_4, hmmp, msa_seq_infop, i,4);
+ }
+ /* simple share update just using the standard quotient */
+ for(j = 0; j < hmmp->a_size_4+1; j++) {
+ (msa_seq_infop->msa_seq_4 + get_mtx_index(i,j, hmmp->a_size_4+1))->share =
+ (msa_seq_infop->msa_seq_4 + get_mtx_index(i,j, hmmp->a_size_4+1))->nr_occurences /
+ occurences_per_column;
+ }
+
+ /* calculate gap_share */
+ *(msa_seq_infop->gap_shares + i) = (double)(gaps_per_column) / occurences_per_column;
+ }
+ }
+
+#ifdef DEBUG_MSA_SEQ_PRF
+ printf("reached checkpoint 3\n");
+ printf("total_nr_gaps = %d\n", tot_nr_gaps);
+#endif
+/* allocate memory for gaps array and set initial pointers for each position*/
+ msa_seq_infop->gaps = (int**)malloc_or_die(msa_length * sizeof(int*) +
+ (tot_nr_gaps + msa_length) * sizeof(int));
+ nr_gaps = 0;
+ for(i = 0; i < msa_length; i++) {
+ *(msa_seq_infop->gaps + i) = (int*)(msa_seq_infop->gaps + msa_length) + i;
+ **(msa_seq_infop->gaps + i) = END;
+ }
+#ifdef DEBUG_MSA_SEQ_PRF
+ printf("%x\n", msa_seq_infop->gaps);
+ printf("%x\n", *(msa_seq_infop->gaps));
+ printf("%x\n", *(msa_seq_infop->gaps + 1));
+ printf("%x\n", *(msa_seq_infop->gaps + 2));
+ printf("%x\n", *(msa_seq_infop->gaps + 3));
+ printf("nr_seqs: %d\n", nr_seqs);
+#endif
+
+#ifdef DEBUG_MSA_SEQ_PRF
+ dump_msa_seqs_multi(msa_seq_infop, hmmp);
+ exit(0);
+#endif
+
+ /* cleanup and return */
+ if(use_priordistribution == DIRICHLET) {
+ free(em_di.q_values);
+ free(em_di.alpha_sums);
+ free(em_di.logbeta_values);
+ free(em_di.prior_values);
+ }
+ if(use_priordistribution == DIRICHLET && hmmp->nr_alphabets > 1) {
+ free(em_di_2.q_values);
+ free(em_di_2.alpha_sums);
+ free(em_di_2.logbeta_values);
+ free(em_di_2.prior_values);
+ }
+ if(use_priordistribution == DIRICHLET && hmmp->nr_alphabets > 2) {
+ free(em_di_3.q_values);
+ free(em_di_3.alpha_sums);
+ free(em_di_3.logbeta_values);
+ free(em_di_3.prior_values);
+ }
+ if(use_priordistribution == DIRICHLET && hmmp->nr_alphabets > 3) {
+ free(em_di_4.q_values);
+ free(em_di_4.alpha_sums);
+ free(em_di_4.logbeta_values);
+ free(em_di_4.prior_values);
+ }
+ return;
+}
diff --git a/modhmm0.92b/readseqs_multialpha.c.flc b/modhmm0.92b/readseqs_multialpha.c.flc
new file mode 100644
index 0000000..663d463
--- /dev/null
+++ b/modhmm0.92b/readseqs_multialpha.c.flc
@@ -0,0 +1,4 @@
+
+(fast-lock-cache-data 3 (quote (17618 . 33645)) (quote nil) (quote nil) (quote (t ("^\\(\\sw+\\)[ ]*(" (1 font-lock-function-name-face)) ("^#[ ]*error[ ]+\\(.+\\)" (1 font-lock-warning-face prepend)) ("^#[ ]*\\(import\\|include\\)[ ]*\\(<[^>\"
+]*>?\\)" (2 font-lock-string-face)) ("^#[ ]*define[ ]+\\(\\sw+\\)(" (1 font-lock-function-name-face)) ("^#[ ]*\\(elif\\|if\\)\\>" ("\\<\\(defined\\)\\>[ ]*(?\\(\\sw+\\)?" nil nil (1 font-lock-builtin-face) (2 font-lock-variable-name-face nil t))) ("^#[ ]*\\(define\\|e\\(?:l\\(?:if\\|se\\)\\|ndif\\|rror\\)\\|file\\|i\\(?:f\\(?:n?def\\)?\\|nclude\\)\\|line\\|pragma\\|undef\\)\\>[ !]*\\(\\sw+\\)?" (1 font-lock-builtin-face) (2 font-lock-variable-name-face nil t)) ("\\<\\(c\\(?:har\\|o [...]
+") (point)) nil (1 font-lock-constant-face nil t))) (":" ("^[ ]*\\(\\sw+\\)[ ]*:[ ]*$" (beginning-of-line) (end-of-line) (1 font-lock-constant-face))) ("\\<\\(c\\(?:har\\|omplex\\)\\|double\\|float\\|int\\|long\\|s\\(?:hort\\|igned\\)\\|\\(?:unsigne\\|voi\\)d\\|FILE\\|\\sw+_t\\|Lisp_Object\\)\\>\\([ *&]+\\sw+\\>\\)*" (font-lock-match-c-style-declaration-item-and-skip-to-next (goto-char (or (match-beginning 2) (match-end 1))) (goto-char (match-end 1)) (1 (if (match-beginning 2) font-l [...]
diff --git a/modhmm0.92b/std_calculation_funcs.c b/modhmm0.92b/std_calculation_funcs.c
new file mode 100644
index 0000000..693249a
--- /dev/null
+++ b/modhmm0.92b/std_calculation_funcs.c
@@ -0,0 +1,1459 @@
+#include <stdio.h>
+#include <math.h>
+#include <stdlib.h>
+#include <string.h>
+#include <float.h>
+
+
+
+
+#include "structs.h"
+#include "funcs.h"
+
+//#define DEBUG_LABELING_UPDATE
+//#define DEBUG_DEALLOCATE_LABELINGS
+
+
+#define REST_LETTER_INDEX 0.5
+
+#define V_LIST_END -99
+#define V_LIST_NEXT -9
+
+
+double get_single_gaussian_statescore(double mu, double sigma_square, double letter)
+{
+
+ double res;
+
+ if(sigma_square <= 0.0) {
+ return 0.0;
+ }
+ else {
+ res = exp(0.0 - (pow((letter-mu),2) / (2.0 * sigma_square))) / sqrt(sigma_square * 2.0 * 3.141592655);
+ //printf("mu = %f, letter = %f, sigma_square = %f, res = %f\n", mu, letter, sigma_square, res);
+ return res;
+ }
+}
+
+double get_dp_statescore(int a_size, int use_gap_shares, int use_prior_shares, struct msa_letter_s *msa_seq,
+ int p, double *emissions, int vertex, int normalize, double *gap_shares)
+{
+
+ double seq_normalizer;
+ double state_normalizer;
+ double subst_mtx_normalizer;
+ int a_index;
+ double t_res_3;
+
+ seq_normalizer = 0.0;
+ state_normalizer = 0.0;
+ subst_mtx_normalizer = 0.0;
+
+ t_res_3 = 0.0;
+ /* scoring using dot-product method */
+ for(a_index = 0; a_index < a_size; a_index++) {
+
+ if(use_prior_shares == YES) {
+ t_res_3 += *(emissions + (vertex * a_size + a_index)) *
+ (msa_seq + (p * (a_size+1) + a_index))->prior_share;
+ if(normalize == YES) {
+ seq_normalizer += pow((msa_seq + (p * (a_size+1) + a_index))->prior_share, 2);
+ state_normalizer += pow(*(emissions + (vertex * a_size + a_index)), 2);
+ }
+ }
+ else if(use_gap_shares == YES) {
+ if((msa_seq + get_mtx_index(p,a_size, a_size+1))->share == 1.0) {
+ printf("Error: all gap column in sequence\n");
+ exit(0);
+ }
+ t_res_3 += *(emissions + get_mtx_index(vertex, a_index, a_size)) *
+ (msa_seq + get_mtx_index(p,a_index, a_size+1))->share /
+ (1.0 -(msa_seq + get_mtx_index(p,a_size, a_size+1))->share);
+ if(normalize == YES) {
+ seq_normalizer += pow((msa_seq + (p * (a_size+1) + a_index))->share /
+ (1.0 - *(gap_shares + p)), 2);
+ state_normalizer += pow(*(emissions + (vertex * a_size + a_index)), 2);
+ }
+ }
+ else {
+ t_res_3 += *(emissions + get_mtx_index(vertex, a_index, a_size)) *
+ (msa_seq + get_mtx_index(p,a_index, a_size+1))->share;
+ if(normalize == YES) {
+ seq_normalizer += pow((msa_seq + (p * (a_size+1) + a_index))->share, 2);
+ state_normalizer += pow(*(emissions + (vertex * a_size + a_index)), 2);
+ }
+ }
+ }
+ if(normalize == YES) {
+ seq_normalizer = sqrt(seq_normalizer);
+ state_normalizer = sqrt(state_normalizer);
+ if(t_res_3 != 0.0) {
+ t_res_3 = t_res_3 / (seq_normalizer * state_normalizer);
+ }
+ }
+
+ if(t_res_3 < 0.0) {
+ printf("t_res_3 = %f\n", t_res_3);
+ printf("Error: got strange dot product state result value\n");
+ exit(0);
+ }
+
+ return t_res_3;
+}
+
+
+double get_dp_picasso_statescore(int a_size, int use_gap_shares, int use_prior_shares, struct msa_letter_s *msa_seq,
+ int p, double *emissions, int vertex, int normalize, double *gap_shares, double *aa_freqs)
+{
+
+ double seq_normalizer;
+ double state_normalizer;
+ double subst_mtx_normalizer;
+ int a_index;
+ double t_res_3;
+
+ seq_normalizer = 0.0;
+ state_normalizer = 0.0;
+ subst_mtx_normalizer = 0.0;
+
+ t_res_3 = 0.0;
+ /* scoring using dot-product method */
+ for(a_index = 0; a_index < a_size; a_index++) {
+ if(use_prior_shares == YES) {
+ t_res_3 += *(emissions + (vertex * a_size + a_index)) *
+ (msa_seq + (p * (a_size+1) + a_index))->prior_share / *(aa_freqs + a_index);
+ if(normalize == YES) {
+ seq_normalizer += pow((msa_seq + (p * (a_size+1) + a_index))->prior_share, 2);
+ state_normalizer += pow(*(emissions + (vertex * a_size + a_index)), 2);
+ }
+ }
+ else if(use_gap_shares == YES) {
+ if((msa_seq + get_mtx_index(p,a_size, a_size+1))->share == 1.0) {
+ printf("Error: all gap column in sequence\n");
+ exit(0);
+ }
+ t_res_3 += *(emissions + get_mtx_index(vertex, a_index, a_size)) *
+ (msa_seq + get_mtx_index(p,a_index, a_size+1))->share /
+ ((1.0 -(msa_seq + get_mtx_index(p,a_size, a_size+1))->share) * *(aa_freqs + a_index));
+ if(normalize == YES) {
+ seq_normalizer += pow((msa_seq + (p * (a_size+1) + a_index))->share /
+ (1.0 - *(gap_shares + p)), 2);
+ state_normalizer += pow(*(emissions + (vertex * a_size + a_index)), 2);
+ }
+ }
+ else {
+ t_res_3 += *(emissions + get_mtx_index(vertex, a_index, a_size)) *
+ (msa_seq + get_mtx_index(p,a_index, a_size+1))->share / *(aa_freqs + a_index);
+ if(normalize == YES) {
+ seq_normalizer += pow((msa_seq + (p * (a_size+1) + a_index))->share, 2);
+ state_normalizer += pow(*(emissions + (vertex * a_size + a_index)), 2);
+ }
+ }
+ }
+ if(normalize == YES) {
+ seq_normalizer = sqrt(seq_normalizer);
+ state_normalizer = sqrt(state_normalizer);
+#ifdef DEBUG_BW
+ printf("state_normalizer = %f\n", state_normalizer);
+ printf("seq_normalizer = %f\n", seq_normalizer);
+#endif
+ if(t_res_3 != 0.0) {
+ t_res_3 = t_res_3 / (seq_normalizer * state_normalizer);
+ }
+ }
+
+ if(t_res_3 < 0.0) {
+ printf("t_res_3 = %f\n", t_res_3);
+ printf("Error: got strange dot product state result value\n");
+ exit(0);
+ }
+
+ return t_res_3;
+}
+
+
+double get_sjolander_statescore(int a_size, int use_gap_shares, int use_prior_shares, struct msa_letter_s *msa_seq,
+ int p, double *emissions, int vertex, int normalize, double *gap_shares)
+{
+ double seq_normalizer;
+ double state_normalizer;
+ double subst_mtx_normalizer;
+ int a_index;
+ double t_res_3;
+
+ seq_normalizer = 0.0;
+ state_normalizer = 0.0;
+ subst_mtx_normalizer = 0.0;
+
+ t_res_3 = 1.0;
+ /* scoring using sjolander score method */
+
+ for(a_index = 0; a_index < a_size; a_index++) {
+ if(use_prior_shares == YES) {
+ t_res_3 *= pow(*(emissions + (vertex * a_size + a_index)),
+ (msa_seq + (p * (a_size+1) + a_index))->prior_share);
+ if(normalize == YES) {
+ seq_normalizer += pow((msa_seq + (p * (a_size+1) + a_index))->prior_share, 2);
+ state_normalizer += pow(*(emissions + (vertex * a_size + a_index)), 2);
+ }
+
+ }
+ else if(use_gap_shares == YES) {
+ if((msa_seq + get_mtx_index(p,a_size, a_size+1))->share == 1.0) {
+ printf("Error: all gap column in sequence\n");
+ exit(0);
+ }
+
+ t_res_3 *= pow(*(emissions + get_mtx_index(vertex, a_index, a_size)),
+ (msa_seq + get_mtx_index(p,a_index, a_size+1))->share /
+ (1.0 -(msa_seq + get_mtx_index(p,a_size, a_size+1))->share));
+ if(normalize == YES) {
+ seq_normalizer += pow((msa_seq + (p * (a_size+1) + a_index))->share /
+ (1.0 - *(gap_shares + p)), 2);
+ state_normalizer += pow(*(emissions + (vertex * a_size + a_index)), 2);
+ }
+ }
+ else {
+ t_res_3 *= pow(*(emissions + get_mtx_index(vertex, a_index, a_size)),
+ (msa_seq + get_mtx_index(p,a_index, a_size+1))->share);
+
+ if(normalize == YES) {
+ seq_normalizer += pow((msa_seq + (p * (a_size+1) + a_index))->share, 2);
+ state_normalizer += pow(*(emissions + (vertex * a_size + a_index)), 2);
+ }
+ }
+ }
+
+ if(normalize == YES) {
+ seq_normalizer = sqrt(seq_normalizer);
+ state_normalizer = sqrt(state_normalizer);
+#ifdef DEBUG_BW
+ printf("state_normalizer = %f\n", state_normalizer);
+ printf("seq_normalizer = %f\n", seq_normalizer);
+#endif
+ if(t_res_3 != 0.0) {
+ t_res_3 = t_res_3 / (seq_normalizer * state_normalizer);
+ }
+ }
+ if(t_res_3 < 0.0) {
+ printf("Error: got strange geometric mean state result value\n");
+ printf("t_res_3 = %f\n", t_res_3);
+ exit(0);
+ }
+ return t_res_3;
+}
+
+
+
+double get_sjolander_reversed_statescore(int a_size, int use_gap_shares, int use_prior_shares, struct msa_letter_s *msa_seq,
+ int p, double *emissions, int vertex, int normalize, double *gap_shares)
+{
+ double seq_normalizer;
+ double state_normalizer;
+ double subst_mtx_normalizer;
+ int a_index;
+ double t_res_3;
+
+ seq_normalizer = 0.0;
+ state_normalizer = 0.0;
+ subst_mtx_normalizer = 0.0;
+
+
+ t_res_3 = 1.0;
+ /* scoring using sjolander score method */
+ for(a_index = 0; a_index < a_size; a_index++) {
+ if(use_prior_shares == YES) {
+ if((msa_seq + get_mtx_index(p,a_index, a_size+1))->prior_share != 0.0) {
+ t_res_3 *= pow((msa_seq + get_mtx_index(p,a_index, a_size+1))->prior_share,
+ *(emissions + get_mtx_index(vertex, a_index, a_size)));
+ }
+ if(normalize == YES) {
+ seq_normalizer += pow((msa_seq + (p * (a_size+1) + a_index))->prior_share, 2);
+ state_normalizer += pow(*(emissions + (vertex * a_size + a_index)), 2);
+ }
+ }
+ else if(use_gap_shares == YES) {
+ if((msa_seq + get_mtx_index(p,a_size, a_size+1))->share == 1.0) {
+ printf("Error: all gap column in sequence\n");
+ exit(0);
+ }
+ if((msa_seq + get_mtx_index(p,a_index, a_size+1))->share != 0.0) {
+ t_res_3 *= pow((msa_seq + get_mtx_index(p,a_index, a_size+1))->share /
+ (1.0 -(msa_seq + get_mtx_index(p,a_size, a_size+1))->share),
+ *(emissions + get_mtx_index(vertex, a_index, a_size)));
+ }
+ if(normalize == YES) {
+ seq_normalizer += pow((msa_seq + (p * (a_size+1) + a_index))->share /
+ (1.0 - *(gap_shares + p)), 2);
+ state_normalizer += pow(*(emissions + (vertex * a_size + a_index)), 2);
+ }
+ }
+ else {
+ if((msa_seq + get_mtx_index(p,a_index, a_size+1))->share != 0.0) {
+ t_res_3 *= pow((msa_seq + get_mtx_index(p,a_index, a_size+1))->share,
+ *(emissions + get_mtx_index(vertex, a_index, a_size)));
+ }
+ if(normalize == YES) {
+ seq_normalizer += pow((msa_seq + (p * (a_size+1) + a_index))->share, 2);
+ state_normalizer += pow(*(emissions + (vertex * a_size + a_index)), 2);
+ }
+ }
+ }
+ if(normalize == YES) {
+ seq_normalizer = sqrt(seq_normalizer);
+ state_normalizer = sqrt(state_normalizer);
+#ifdef DEBUG_BW
+ printf("state_normalizer = %f\n", state_normalizer);
+ printf("seq_normalizer = %f\n", seq_normalizer);
+#endif
+ if(t_res_3 != 0.0) {
+ t_res_3 = t_res_3 / (seq_normalizer * state_normalizer);
+ }
+ }
+ if(t_res_3 < 0.0) {
+ printf("Error: got strange geometric mean state result value\n");
+ printf("t_res_3 = %f\n", t_res_3);
+ exit(0);
+ }
+
+ return t_res_3;
+}
+
+
+double get_picasso_statescore(int a_size, int use_gap_shares, int use_prior_shares, struct msa_letter_s *msa_seq,
+ int p, double *emissions, int vertex, int normalize, double *gap_shares, double *aa_freqs)
+{
+
+ double seq_normalizer;
+ double state_normalizer;
+ double subst_mtx_normalizer;
+ int a_index;
+ double t_res_3;
+
+ seq_normalizer = 0.0;
+ state_normalizer = 0.0;
+ subst_mtx_normalizer = 0.0;
+
+
+ t_res_3 = 1.0;
+ /* scoring using picasso-product method */
+
+ for(a_index = 0; a_index < a_size; a_index++) {
+ if(use_prior_shares == YES) {
+ if((msa_seq + get_mtx_index(p,a_index, a_size+1))->prior_share != 0.0 &&
+ *(emissions + get_mtx_index(vertex, a_index, a_size)) != SILENT) {
+ t_res_3 *= pow((msa_seq + get_mtx_index(p,a_index, a_size+1))->prior_share / *(aa_freqs + a_index),
+ *(emissions + (vertex * a_size + a_index)));
+ }
+ }
+ else if(use_gap_shares == YES) {
+ if((msa_seq + get_mtx_index(p,a_size, a_size+1))->share == 1.0) {
+ printf("Error: all gap column in sequence\n");
+ exit(0);
+ }
+ if((msa_seq + get_mtx_index(p,a_index, a_size+1))->share != 0.0 &&
+ *(emissions + get_mtx_index(vertex, a_index, a_size)) != SILENT) {
+ t_res_3 *= pow(((msa_seq + get_mtx_index(p,a_index, a_size+1))->share /
+ (1.0 -(msa_seq + get_mtx_index(p,a_size, a_size+1))->share)) / *(aa_freqs + a_index),
+ *(emissions + get_mtx_index(vertex, a_index, a_size)));
+ }
+ }
+ else {
+ if((msa_seq + get_mtx_index(p,a_index, a_size+1))->share != 0.0 &&
+ *(emissions + get_mtx_index(vertex, a_index, a_size)) != SILENT) {
+ t_res_3 *= pow((msa_seq + get_mtx_index(p,a_index, a_size+1))->share / *(aa_freqs + a_index),
+ *(emissions + get_mtx_index(vertex, a_index, a_size)));
+ }
+ }
+ }
+
+ if(t_res_3 < 0.0 || t_res_3 > 1000000000000.0) {
+ printf("Error: got strange picasso product state result value\n");
+ exit(0);
+ }
+
+ return t_res_3;
+}
+
+
+double get_picasso_sym_statescore(int a_size, int use_gap_shares, int use_prior_shares, struct msa_letter_s *msa_seq,
+ int p, double *emissions, int vertex, int normalize, double *gap_shares, double *aa_freqs)
+{
+
+ double seq_normalizer;
+ double state_normalizer;
+ double subst_mtx_normalizer;
+ int a_index;
+ double t_res_3;
+
+ seq_normalizer = 0.0;
+ state_normalizer = 0.0;
+ subst_mtx_normalizer = 0.0;
+
+
+ t_res_3 = 1.0;
+ /* scoring using picasso-product method */
+
+ for(a_index = 0; a_index < a_size; a_index++) {
+ if(use_prior_shares == YES) {
+ if((msa_seq + get_mtx_index(p,a_index, a_size+1))->prior_share != 0.0 &&
+ *(emissions + get_mtx_index(vertex, a_index, a_size)) != SILENT) {
+ t_res_3 *= pow((msa_seq + get_mtx_index(p,a_index, a_size+1))->prior_share / *(aa_freqs + a_index),
+ *(emissions + (vertex * a_size + a_index))) *
+ pow(*(emissions + (vertex * a_size + a_index)) / *(aa_freqs + a_index),
+ (msa_seq + get_mtx_index(p,a_index, a_size+1))->prior_share);
+ }
+ }
+ else if(use_gap_shares == YES) {
+ if((msa_seq + get_mtx_index(p,a_size, a_size+1))->share == 1.0) {
+ printf("Error: all gap column in sequence\n");
+ exit(0);
+ }
+ if((msa_seq + get_mtx_index(p,a_index, a_size+1))->share != 0.0 &&
+ *(emissions + get_mtx_index(vertex, a_index, a_size)) != SILENT) {
+ t_res_3 *= pow(((msa_seq + get_mtx_index(p,a_index, a_size+1))->share /
+ (1.0 -(msa_seq + get_mtx_index(p,a_size, a_size+1))->share)) / *(aa_freqs + a_index),
+ *(emissions + get_mtx_index(vertex, a_index, a_size))) *
+ pow(*(emissions + (vertex * a_size + a_index)) / *(aa_freqs + a_index),
+ (msa_seq + get_mtx_index(p,a_index, a_size+1))->share /
+ (1.0 - (msa_seq + get_mtx_index(p,a_size, a_size+1))->share));
+ }
+ }
+ else {
+ //printf(" *(aa_freqs + a_index) = %f\n", *(aa_freqs + a_index));
+ //printf("(msa_seq + (p * (a_size+1) + a_index))->share = %f\n", (msa_seq + get_mtx_index(p,a_index, a_size+1))->share);
+ //printf("*(emissions + get_mtx_index(vertex, a_index, a_size)) = %f\n",*(emissions + get_mtx_index(vertex, a_index, a_size)));
+ if((msa_seq + get_mtx_index(p,a_index, a_size+1))->share != 0.0 &&
+ *(emissions + get_mtx_index(vertex, a_index, a_size)) != SILENT) {
+ t_res_3 *= pow((msa_seq + get_mtx_index(p,a_index, a_size+1))->share / *(aa_freqs + a_index),
+ *(emissions + get_mtx_index(vertex, a_index, a_size))) *
+ pow(*(emissions + (vertex * a_size + a_index)) / *(aa_freqs + a_index),
+ (msa_seq + get_mtx_index(p,a_index, a_size+1))->share);
+ }
+ }
+ }
+
+ if(t_res_3 < 0.0 || t_res_3 > 1000000000000.0) {
+ printf("Error: got strange picasso product state result value\n");
+ exit(0);
+ }
+
+ return t_res_3;
+}
+
+
+double get_subst_mtx_product_statescore(int a_size, int use_gap_shares, int use_prior_shares, struct msa_letter_s *msa_seq,
+ int p, double *emissions, int vertex, double *subst_mtx)
+{
+ int a_index, a_index2;
+ double t_res_3;
+
+ t_res_3 = 0.0;
+ for(a_index = 0; a_index < a_size; a_index++) {
+ for(a_index2 = 0; a_index2 < a_size; a_index2++) {
+ if(use_gap_shares == YES) {
+ t_res_3 += *(emissions + get_mtx_index(vertex, a_index, a_size)) *
+ (msa_seq + get_mtx_index(p,a_index2, a_size+1))->share /
+ (1.0 -(msa_seq + get_mtx_index(p,a_size, a_size+1))->share) *
+ *(subst_mtx + (a_index * a_size + a_index2));
+ }
+ else {
+ t_res_3 += *(emissions + get_mtx_index(vertex, a_index, a_size)) *
+ (msa_seq + get_mtx_index(p,a_index2, a_size+1))->share /
+ *(subst_mtx + (a_index * a_size + a_index2));
+ }
+ }
+ }
+
+ if(t_res_3 < 0.0) {
+ printf("Error: got strange subst mtx product state result value\n");
+ exit(0);
+ }
+
+ return t_res_3;
+}
+
+
+double get_subst_mtx_dot_product_statescore(int a_size, int use_gap_shares, int use_prior_shares, struct msa_letter_s *msa_seq,
+ int p, double *emissions, int vertex, int normalize, double *gap_shares,
+ int query_index, double *subst_mtx)
+{
+ double seq_normalizer;
+ double state_normalizer;
+ double subst_mtx_normalizer;
+ int a_index;
+ double t_res_3;
+
+ seq_normalizer = 0.0;
+ state_normalizer = 0.0;
+ subst_mtx_normalizer = 0.0;
+
+ t_res_3 = 0.0;
+ /* scoring using subst_mtx_dot-product method */
+ for(a_index = 0; a_index < a_size; a_index++) {
+ if(use_prior_shares == YES) {
+ t_res_3 += *(emissions + (vertex * a_size + a_index)) *
+ (msa_seq + (p * (a_size+1) + a_index))->prior_share *
+ *(subst_mtx + get_mtx_index(query_index, a_index, a_size));
+ if(normalize == YES) {
+ seq_normalizer += pow((msa_seq + (p * (a_size+1) + a_index))->prior_share, 2);
+ state_normalizer += pow(*(emissions + (vertex * a_size + a_index)), 2);
+ subst_mtx_normalizer += pow(*(subst_mtx + get_mtx_index(query_index, a_index, a_size)), 2);
+ }
+ }
+ else if(use_gap_shares == YES) {
+ if((msa_seq + get_mtx_index(p,a_size, a_size+1))->share == 1.0) {
+ printf("Error: all gap column in sequence\n");
+ exit(0);
+ }
+ t_res_3 += *(emissions + get_mtx_index(vertex, a_index, a_size)) *
+ (msa_seq + get_mtx_index(p,a_index, a_size+1))->share *
+ *(subst_mtx + get_mtx_index(query_index, a_index, a_size)) /
+ (1.0 -(msa_seq + get_mtx_index(p,a_size, a_size+1))->share);
+ if(normalize == YES) {
+ seq_normalizer += pow((msa_seq + (p * (a_size+1) + a_index))->share /
+ (1.0 - *(gap_shares + p)), 2);
+ state_normalizer += pow(*(emissions + (vertex * a_size + a_index)), 2);
+ subst_mtx_normalizer += pow(*(subst_mtx + get_mtx_index(query_index, a_index, a_size)), 2);
+ }
+ }
+ else {
+ t_res_3 += *(emissions + get_mtx_index(vertex, a_index, a_size)) *
+ (msa_seq + get_mtx_index(p,a_index, a_size+1))->share *
+ *(subst_mtx + get_mtx_index(query_index, a_index, a_size));
+ if(normalize == YES) {
+ seq_normalizer += pow((msa_seq + (p * (a_size+1) + a_index))->share, 2);
+ state_normalizer += pow(*(emissions + (vertex * a_size + a_index)), 2);
+ subst_mtx_normalizer += pow(*(subst_mtx + get_mtx_index(query_index, a_index, a_size)), 2);
+ }
+ }
+ }
+
+ if(normalize == YES) {
+ seq_normalizer = sqrt(seq_normalizer);
+ state_normalizer = sqrt(state_normalizer);
+ subst_mtx_normalizer = sqrt(subst_mtx_normalizer);
+#ifdef DEBUG
+ printf("state_normalizer = %f\n", state_normalizer);
+ printf("seq_normalizer = %f\n", seq_normalizer);
+#endif
+ if(t_res_3 != 0.0) {
+ t_res_3 = t_res_3 / (seq_normalizer * state_normalizer * subst_mtx_normalizer);
+ }
+ }
+
+ if(t_res_3 < 0.0) {
+ printf("Error: got strange subst mtx dot product state result value\n");
+ exit(0);
+ }
+
+ return t_res_3;
+}
+
+
+double get_subst_mtx_dot_product_prior_statescore(int a_size, int use_gap_shares, int use_prior_shares, struct msa_letter_s *msa_seq,
+ int p, double *emissions, int vertex, int normalize, double *gap_shares,
+ int query_index, double *subst_mtx)
+{
+ double seq_normalizer;
+ double state_normalizer;
+ double subst_mtx_normalizer;
+ int a_index;
+ double t_res_3;
+ double rest_share, default_share;
+
+ seq_normalizer = 0.0;
+ state_normalizer = 0.0;
+ subst_mtx_normalizer = 0.0;
+ default_share = 1.0 / (double)(a_size);
+
+ t_res_3 = 0.0;
+
+ /* scoring using dot-product method */
+ rest_share = 1.0;
+ for(a_index = 0; a_index < a_size; a_index++) {
+ if(use_prior_shares == YES) {
+ t_res_3 += *(emissions + (vertex * a_size + a_index)) *
+ (msa_seq + (p * (a_size+1) + a_index))->prior_share *
+ *(subst_mtx + get_mtx_index(query_index, a_index, a_size));
+ if(*(subst_mtx + get_mtx_index(query_index, a_index, a_size)) != 0.0) {
+ rest_share = rest_share - (msa_seq + (p * (a_size+1) + a_index))->prior_share;
+ if(normalize == YES) {
+ seq_normalizer += pow((msa_seq + (p * (a_size+1) + a_index))->prior_share, 2);
+ state_normalizer += pow(*(emissions + (vertex * a_size + a_index)), 2);
+ subst_mtx_normalizer += pow(*(subst_mtx + get_mtx_index(query_index, a_index, a_size)), 2);
+ }
+ }
+ }
+ else if(use_gap_shares == YES) {
+ if((msa_seq + get_mtx_index(p,a_size, a_size+1))->share == 1.0) {
+ printf("Error: all gap column in sequence\n");
+ exit(0);
+ }
+ t_res_3 += *(emissions + get_mtx_index(vertex, a_index, a_size)) *
+ (msa_seq + get_mtx_index(p,a_index, a_size+1))->share *
+ *(subst_mtx + get_mtx_index(query_index, a_index, a_size)) /
+ (1.0 -(msa_seq + get_mtx_index(p,a_size, a_size+1))->share);
+ if(*(subst_mtx + get_mtx_index(a_index, a_index, a_size)) != 0.0) {
+ rest_share = rest_share - (msa_seq + (p * (a_size+1) + a_index))->share /
+ (1.0 - *(gap_shares + p));
+ if(normalize == YES) {
+ seq_normalizer += pow((msa_seq + (p * (a_size+1) + a_index))->share /
+ (1.0 - *(gap_shares + p)), 2);
+ state_normalizer += pow(*(emissions + (vertex * a_size + a_index)), 2);
+ subst_mtx_normalizer += pow(*(subst_mtx + get_mtx_index(query_index, a_index, a_size)), 2);
+ }
+ }
+
+ }
+ else {
+ t_res_3 += *(emissions + get_mtx_index(vertex, a_index, a_size)) *
+ (msa_seq + get_mtx_index(p,a_index, a_size+1))->share *
+ *(subst_mtx + get_mtx_index(query_index, a_index, a_size));
+ if(*(subst_mtx + get_mtx_index(a_index, a_index, a_size)) != 0.0) {
+ rest_share = rest_share - (msa_seq + (p * (a_size+1) + a_index))->share;
+ if(normalize == YES) {
+ seq_normalizer += pow((msa_seq + (p * (a_size+1) + a_index))->share, 2);
+ state_normalizer += pow(*(emissions + (vertex * a_size + a_index)), 2);
+ subst_mtx_normalizer += pow(*(subst_mtx + get_mtx_index(query_index, a_index, a_size)), 2);
+ }
+ }
+ }
+ }
+ if(rest_share < 0.0) {
+ rest_share = 0.0;
+ }
+ t_res_3 += default_share * rest_share;
+ seq_normalizer += pow(rest_share, 2);
+ state_normalizer += pow(default_share, 2);
+
+ if(normalize == YES) {
+ seq_normalizer = sqrt(seq_normalizer);
+ state_normalizer = sqrt(state_normalizer);
+ subst_mtx_normalizer = sqrt(subst_mtx_normalizer);
+#ifdef DEBUG_BW
+ printf("state_normalizer = %f\n", state_normalizer);
+ printf("seq_normalizer = %f\n", seq_normalizer);
+#endif
+ if(t_res_3 != 0.0) {
+ t_res_3 = t_res_3 / (seq_normalizer * state_normalizer * subst_mtx_normalizer);
+ }
+ }
+
+ if(t_res_3 < 0.0) {
+ printf("Error: got strange subst mtx dot product prior state result value\n");
+ exit(0);
+ }
+
+ return t_res_3;
+}
+
+
+
+/************************************* add to E methods *********************************************/
+void add_to_E_continuous(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, double *emissions)
+{
+ double mean_value, varians;
+ int i,j;
+ double continuous_score_all, continuous_score_j, gamma_p_j;
+
+ mean_value = (msa_seq + get_mtx_index(p,0,a_size+1))->share;
+
+
+ continuous_score_all = 0.0;
+ for(j = 0; j < a_size / 3; j++) {
+ continuous_score_all +=
+ get_single_gaussian_statescore(*(emissions + get_mtx_index(k, (j * 3), a_size)),
+ *(emissions + get_mtx_index(k, (j * 3 + 1), a_size)),
+ mean_value) *
+ *((emissions) + (k * (a_size)) + (j * 3 + 2));
+ }
+
+ for(j = 0; j < a_size / 3; j++) {
+ continuous_score_j =
+ get_single_gaussian_statescore(*(emissions + get_mtx_index(k, (j * 3), a_size)),
+ *(emissions + get_mtx_index(k, (j * 3 + 1), a_size)),
+ mean_value) *
+ *((emissions) + (k * (a_size)) + (j * 3 + 2));
+ varians = pow((msa_seq + get_mtx_index(p,0,a_size+1))->share - *(emissions + get_mtx_index(k, j * 3, a_size)), 2);
+ if(continuous_score_all > 0.0) {
+ gamma_p_j = Eka_base * continuous_score_j / continuous_score_all;
+ }
+ else {
+ gamma_p_j = 0.0;
+ }
+ *(E + get_mtx_index(k, j * 3, a_size + 1)) += mean_value * gamma_p_j;
+ *(E + get_mtx_index(k, j * 3 + 1, a_size + 1)) += varians * gamma_p_j;
+ *(E + get_mtx_index(k, j * 3 + 2, a_size + 1)) += gamma_p_j;
+ }
+
+ *(E + get_mtx_index(k, j * 3, a_size + 1)) += Eka_base;
+}
+
+
+
+void add_to_E_dot_product(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize)
+{
+ /* k is a state index, p is a sequence position index */
+
+ int a_index;
+ double prf_column_length;
+
+ if(normalize == NO) {
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->share);
+
+ }
+ }
+ else {
+ prf_column_length = 0.0;
+ for(a_index = 0; a_index < a_size; a_index++) {
+ prf_column_length += pow(((msa_seq + get_mtx_index(p, a_index, a_size+1))->share),2);
+ }
+ prf_column_length = sqrt(prf_column_length);
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->share) / prf_column_length;
+ }
+ }
+}
+
+void add_to_E_dot_product_picasso(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize)
+{
+ /* k is a state index, p is a sequence position index */
+
+ int a_index;
+ double prf_column_length;
+
+ if(normalize == NO) {
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->share);
+
+ }
+ }
+ else {
+ prf_column_length = 0.0;
+ for(a_index = 0; a_index < a_size; a_index++) {
+ prf_column_length += pow(((msa_seq + get_mtx_index(p, a_index, a_size+1))->share),2);
+ }
+ prf_column_length = sqrt(prf_column_length);
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->share) / prf_column_length;
+ }
+ }
+}
+
+void add_to_E_picasso(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize)
+{
+ /* k is a state index, p is a sequence position index */
+
+ int a_index;
+ double prf_column_length;
+
+ if(normalize == NO) {
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->share);
+ }
+ }
+ else {
+ prf_column_length = 0.0;
+ for(a_index = 0; a_index < a_size; a_index++) {
+ prf_column_length += pow(((msa_seq + get_mtx_index(p, a_index, a_size+1))->share),2);
+ }
+ prf_column_length = sqrt(prf_column_length);
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->share) / prf_column_length;
+ }
+ }
+}
+
+void add_to_E_picasso_sym(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize)
+{
+ /* k is a state index, p is a sequence position index */
+
+ int a_index;
+ double prf_column_length;
+
+ if(normalize == NO) {
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->share);
+ }
+ }
+ else {
+ prf_column_length = 0.0;
+ for(a_index = 0; a_index < a_size; a_index++) {
+ prf_column_length += pow(((msa_seq + get_mtx_index(p, a_index, a_size+1))->share),2);
+ }
+ prf_column_length = sqrt(prf_column_length);
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->share) / prf_column_length;
+ }
+ }
+}
+
+void add_to_E_sjolander_score(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize)
+{
+ /* k is a state index, p is a sequence position index */
+
+ int a_index;
+ double prf_column_length;
+
+
+ if(normalize == NO) {
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->share);
+ }
+ }
+ else {
+ prf_column_length = 0.0;
+ for(a_index = 0; a_index < a_size; a_index++) {
+ prf_column_length += pow(((msa_seq + get_mtx_index(p, a_index, a_size+1))->share),2);
+ }
+ prf_column_length = sqrt(prf_column_length);
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->share) / prf_column_length;
+ }
+ }
+}
+
+void add_to_E_sjolander_reversed_score(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize)
+{
+ /* k is a state index, p is a sequence position index */
+
+ int a_index;
+ double prf_column_length;
+
+ if(normalize == NO) {
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->share);
+
+ }
+ }
+ else {
+ prf_column_length = 0.0;
+ for(a_index = 0; a_index < a_size; a_index++) {
+ prf_column_length += pow(((msa_seq + get_mtx_index(p, a_index, a_size+1))->share),2);
+ }
+ prf_column_length = sqrt(prf_column_length);
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->share) / prf_column_length;
+ }
+ }
+}
+
+void add_to_E_dot_product_nr_occ(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize)
+{
+ /* k is a state index, p is a sequence position index */
+
+ int a_index;
+ double prf_column_length;
+
+ if(normalize == NO) {
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->nr_occurences);
+
+ }
+ }
+ else {
+ prf_column_length = 0.0;
+ for(a_index = 0; a_index < a_size; a_index++) {
+ prf_column_length += pow(((msa_seq + get_mtx_index(p, a_index, a_size+1))->share),2);
+ }
+ prf_column_length = sqrt(prf_column_length);
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->nr_occurences) / prf_column_length;
+ }
+ }
+}
+
+void add_to_E_dot_product_picasso_nr_occ(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize)
+{
+ /* k is a state index, p is a sequence position index */
+
+ int a_index;
+ double prf_column_length;
+
+ if(normalize == NO) {
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->nr_occurences);
+
+ }
+ }
+ else {
+ prf_column_length = 0.0;
+ for(a_index = 0; a_index < a_size; a_index++) {
+ prf_column_length += pow(((msa_seq + get_mtx_index(p, a_index, a_size+1))->share),2);
+ }
+ prf_column_length = sqrt(prf_column_length);
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->nr_occurences) / prf_column_length;
+ }
+ }
+}
+
+void add_to_E_picasso_nr_occ(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize)
+{
+ /* k is a state index, p is a sequence position index */
+
+ int a_index;
+ double prf_column_length;
+
+ if(normalize == NO) {
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->nr_occurences);
+
+ }
+ }
+ else {
+ prf_column_length = 0.0;
+ for(a_index = 0; a_index < a_size; a_index++) {
+ prf_column_length += pow(((msa_seq + get_mtx_index(p, a_index, a_size+1))->share),2);
+ }
+ prf_column_length = sqrt(prf_column_length);
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->nr_occurences) / prf_column_length;
+ }
+ }
+}
+
+void add_to_E_picasso_sym_nr_occ(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize)
+{
+ /* k is a state index, p is a sequence position index */
+
+ int a_index;
+ double prf_column_length;
+
+ if(normalize == NO) {
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->nr_occurences);
+
+ }
+ }
+ else {
+ prf_column_length = 0.0;
+ for(a_index = 0; a_index < a_size; a_index++) {
+ prf_column_length += pow(((msa_seq + get_mtx_index(p, a_index, a_size+1))->share),2);
+ }
+ prf_column_length = sqrt(prf_column_length);
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->nr_occurences) / prf_column_length;
+ }
+ }
+}
+
+void add_to_E_sjolander_score_nr_occ(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize)
+{
+ /* k is a state index, p is a sequence position index */
+
+ int a_index;
+ double prf_column_length;
+
+ if(normalize == NO) {
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->nr_occurences);
+
+ }
+ }
+ else {
+ prf_column_length = 0.0;
+ for(a_index = 0; a_index < a_size; a_index++) {
+ prf_column_length += pow(((msa_seq + get_mtx_index(p, a_index, a_size+1))->share),2);
+ }
+ prf_column_length = sqrt(prf_column_length);
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->nr_occurences) / prf_column_length;
+ }
+ }
+}
+
+void add_to_E_sjolander_reversed_score_nr_occ(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize)
+{
+ /* k is a state index, p is a sequence position index */
+
+ int a_index;
+ double prf_column_length;
+
+ if(normalize == NO) {
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->nr_occurences);
+
+ }
+ }
+ else {
+ prf_column_length = 0.0;
+ for(a_index = 0; a_index < a_size; a_index++) {
+ prf_column_length += pow(((msa_seq + get_mtx_index(p, a_index, a_size+1))->share),2);
+ }
+ prf_column_length = sqrt(prf_column_length);
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->nr_occurences) / prf_column_length;
+ }
+ }
+}
+
+void add_to_E_subst_mtx_product(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize, double *subst_mtx)
+{
+ int a_index, a_index2;
+ double prf_column_length;
+ double subst_mtx_row_length;
+
+ if(normalize == NO) {
+ for(a_index = 0; a_index < a_size; a_index++) {
+ for(a_index2 = 0; a_index2 < a_size; a_index2++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->share) *
+ *(subst_mtx + get_mtx_index(a_index, a_index2, a_size));
+ }
+ }
+ }
+ else {
+ printf("Error: no normalizing in subst_mtx_product, yet...\n");
+ exit(0);
+ }
+
+}
+void add_to_E_subst_mtx_product_nr_occ(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize, double *subst_mtx)
+{
+ int a_index, a_index2;
+ double prf_column_length;
+ double subst_mtx_row_length;
+
+ if(normalize == NO) {
+ for(a_index = 0; a_index < a_size; a_index++) {
+ for(a_index2 = 0; a_index2 < a_size; a_index2++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->nr_occurences) *
+ *(subst_mtx + get_mtx_index(a_index, a_index2, a_size));
+ }
+ }
+ }
+ else {
+ printf("Error: no normalizing in subst_mtx_product, yet...\n");
+ exit(0);
+ }
+
+}
+
+void add_to_E_subst_mtx_dot_product(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize, double *subst_mtx, char *alphabet)
+{
+ /* k is a state index, p is a sequence position index */
+
+ int a_index;
+ double prf_column_length, subst_mtx_row_length;
+ int query_index;
+
+ query_index = get_alphabet_index_msa_query((msa_seq + (p * (a_size+1)))->query_letter, alphabet, a_size);
+ if(normalize == NO) {
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->share) *
+ *(subst_mtx + get_mtx_index(query_index, a_index, a_size));
+ }
+ }
+ else {
+ prf_column_length = 0.0;
+ for(a_index = 0; a_index < a_size; a_index++) {
+ prf_column_length += pow(((msa_seq + get_mtx_index(p, a_index, a_size+1))->share),2);
+ }
+ prf_column_length = sqrt(prf_column_length);
+ subst_mtx_row_length = 0.0;
+ for(a_index = 0; a_index < a_size; a_index++) {
+ subst_mtx_row_length += pow(*(subst_mtx + get_mtx_index(query_index, a_index, a_size)),2);
+ }
+ subst_mtx_row_length = sqrt(subst_mtx_row_length);
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->share) *
+ *(subst_mtx + get_mtx_index(query_index, a_index, a_size))/
+ (prf_column_length * subst_mtx_row_length);
+ }
+ }
+}
+
+void add_to_E_subst_mtx_dot_product_nr_occ(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize, double *subst_mtx, char *alphabet)
+{
+ /* k is a state index, p is a sequence position index */
+
+ int a_index;
+ double prf_column_length, subst_mtx_row_length;
+ int query_index;
+
+ query_index = get_alphabet_index_msa_query((msa_seq + (p * (a_size+1)))->query_letter, alphabet, a_size);
+ if(normalize == NO) {
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->nr_occurences) *
+ *(subst_mtx + get_mtx_index(query_index, a_index, a_size));
+ }
+ }
+ else {
+ prf_column_length = 0.0;
+ for(a_index = 0; a_index < a_size; a_index++) {
+ prf_column_length += pow(((msa_seq + get_mtx_index(p, a_index, a_size+1))->share),2);
+ }
+ prf_column_length = sqrt(prf_column_length);
+ subst_mtx_row_length = 0.0;
+ for(a_index = 0; a_index < a_size; a_index++) {
+ subst_mtx_row_length += pow(*(subst_mtx + get_mtx_index(query_index, a_index, a_size)),2);
+ }
+ subst_mtx_row_length = sqrt(subst_mtx_row_length);
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->nr_occurences) *
+ *(subst_mtx + get_mtx_index(query_index, a_index, a_size))/
+ (prf_column_length * subst_mtx_row_length);
+ }
+ }
+}
+
+void add_to_E_subst_mtx_dot_product_prior(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize, double *subst_mtx, char *alphabet)
+{
+ /* k is a state index, p is a sequence position index */
+
+ int a_index;
+ double prf_column_length, subst_mtx_row_length;
+ int query_index;
+ double rli;
+
+ rli = REST_LETTER_INDEX;
+
+ query_index = get_alphabet_index_msa_query((msa_seq + (p * (a_size+1)))->query_letter, alphabet, a_size);
+ if(normalize == NO) {
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->share) *
+ *(subst_mtx + get_mtx_index(query_index, a_index, a_size));
+
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->share) * rli;
+ }
+
+ }
+ else {
+ prf_column_length = 0.0;
+ for(a_index = 0; a_index < a_size; a_index++) {
+ prf_column_length += pow(((msa_seq + get_mtx_index(p, a_index, a_size+1))->share),2);
+ }
+ prf_column_length = sqrt(prf_column_length);
+ subst_mtx_row_length = 0.0;
+ for(a_index = 0; a_index < a_size; a_index++) {
+ subst_mtx_row_length += pow(*(subst_mtx + get_mtx_index(query_index, a_index, a_size)),2);
+ }
+ subst_mtx_row_length = sqrt(subst_mtx_row_length);
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->share) *
+ *(subst_mtx + get_mtx_index(query_index, a_index, a_size))/
+ (prf_column_length * subst_mtx_row_length);
+
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->share) * rli /
+ (prf_column_length * subst_mtx_row_length);
+ }
+ }
+}
+
+
+void add_to_E_subst_mtx_dot_product_prior_nr_occ(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p,
+ int k, int a_size, int normalize, double *subst_mtx, char *alphabet)
+{
+ /* k is a state index, p is a sequence position index */
+
+ int a_index;
+ double prf_column_length, subst_mtx_row_length;
+ int query_index;
+ double rli;
+
+ rli = REST_LETTER_INDEX;
+
+ query_index = get_alphabet_index_msa_query((msa_seq + (p * (a_size+1)))->query_letter, alphabet, a_size);
+ if(normalize == NO) {
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->nr_occurences) *
+ *(subst_mtx + get_mtx_index(query_index, a_index, a_size));
+
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->nr_occurences) * rli;
+ }
+
+ }
+ else {
+ prf_column_length = 0.0;
+ for(a_index = 0; a_index < a_size; a_index++) {
+ prf_column_length += pow(((msa_seq + get_mtx_index(p, a_index, a_size+1))->share),2);
+ }
+ prf_column_length = sqrt(prf_column_length);
+ subst_mtx_row_length = 0.0;
+ for(a_index = 0; a_index < a_size; a_index++) {
+ subst_mtx_row_length += pow(*(subst_mtx + get_mtx_index(query_index, a_index, a_size)),2);
+ }
+ subst_mtx_row_length = sqrt(subst_mtx_row_length);
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->nr_occurences) *
+ *(subst_mtx + get_mtx_index(query_index, a_index, a_size))/
+ (prf_column_length * subst_mtx_row_length);
+
+ *(E + get_mtx_index(k, a_index, a_size)) += Eka_base *
+ (double)((msa_seq + get_mtx_index(p, a_index, a_size+1))->nr_occurences) * rli /
+ (prf_column_length * subst_mtx_row_length);
+ }
+ }
+}
+
+
+
+/* General versions of the functions needed for keeping track of the labeleings in the one-best algorithm */
+
+void update_labelings(struct one_best_s *cur_rowp, char *vertex_labels,
+ int *sorted_v_list, int seq_len, int c, char *labels, int nr_of_labels, int nr_v)
+{
+ int v,w;
+ int v_list_index;
+ int cur_address;
+ int first;
+ char *tmp_labeling;
+ char cur_label;
+ int **same_labeling_lists;
+ int *same_labeling_list_indices;
+ int i;
+
+
+#ifdef DEBUG_LABELING_UPDATE
+ dump_v_list(sorted_v_list);
+ printf("nr of labels = %d\n", nr_of_labels);
+ printf("dump of nr three\n");
+#endif
+
+ v_list_index = 0;
+
+ same_labeling_list_indices = (int*)(malloc_or_die(nr_of_labels * sizeof(int)));
+
+ same_labeling_lists = (int**)(malloc_or_die(nr_of_labels * sizeof(int*)));
+
+
+ for(i = 0; i < nr_of_labels; i++) {
+ same_labeling_lists[i] = (int*)(malloc_or_die((nr_v * 2 + 1) * sizeof(int)));
+ }
+
+
+ for(v = 0; v < nr_v; v++) {
+ (cur_rowp + v)->is_updated = NO;
+ }
+
+ for(i = 0; i < nr_of_labels; i++) {
+ same_labeling_list_indices[i] = 0;
+ }
+
+
+ for(v = 1; v < nr_v-1; v++) {
+ /* find all states with same labeling as this state up to previous row */
+
+ if((cur_rowp + v)->is_updated == NO && (cur_rowp + v)->labeling != NULL) {
+ cur_address = (int)((cur_rowp+v)->labeling);
+#ifdef DEBUG_LABELING_UPDATE
+ printf("searching vertex %d\n", v);
+#endif
+ for(i = 0; i < nr_of_labels; i++) {
+ if(*(vertex_labels + v) == *(labels + i)) {
+ *(*(same_labeling_lists + i) + same_labeling_list_indices[i]) = v;
+ same_labeling_list_indices[i] += 1;
+ break;
+ }
+ }
+
+ (cur_rowp+v)->is_updated = YES;
+ for(w = v+1; w < nr_v-1; w++) {
+ if((int)((cur_rowp+w)->labeling) == cur_address && (cur_rowp + w)->is_updated == NO) {
+#ifdef DEBUG_LABELING_UPDATE
+ printf("found same address, vertex nr = %d\n", w);
+#endif
+ for(i = 0; i < nr_of_labels; i++) {
+ if(*(vertex_labels + w) == *(labels + i)) {
+ *(*(same_labeling_lists + i) + same_labeling_list_indices[i]) = w;
+ same_labeling_list_indices[i] += 1;
+ break;
+ }
+ }
+
+ (cur_rowp+w)->is_updated = YES;
+ }
+ }
+
+ for(i = 0; i < nr_of_labels; i++) {
+ *(*(same_labeling_lists + i) + same_labeling_list_indices[i]) = END;
+ same_labeling_list_indices[i] += 1;
+ }
+ }
+ }
+ for(i = 0; i < nr_of_labels; i++) {
+ *(*(same_labeling_lists + i) + same_labeling_list_indices[i]) = TOT_END;
+ }
+
+
+#ifdef DEBUG_LABELING_UPDATE
+ for(i = 0; i < nr_of_labels; i++) {
+ printf("same_labeling_lists, label: %c\n", labels[i]);
+ dump_label_tmp_list(*(same_labeling_lists + i));
+ }
+ //exit(0);
+#endif
+
+ for(i = 0; i < nr_of_labels; i++) {
+ same_labeling_list_indices[i] = 0;
+ while(*(*(same_labeling_lists + i) + same_labeling_list_indices[i]) != TOT_END) {
+ first = YES;
+ while(*(*(same_labeling_lists + i) + same_labeling_list_indices[i]) != END) {
+ /* update sorted_v_list */
+ *(sorted_v_list + v_list_index) = *(*(same_labeling_lists + i) + same_labeling_list_indices[i]);
+ v_list_index++;
+
+ /* update pointers and label paths */
+ if(first == YES) {
+ tmp_labeling = (cur_rowp + *(*(same_labeling_lists + i) + same_labeling_list_indices[i]))->labeling;
+ (cur_rowp + *(*(same_labeling_lists + i) + same_labeling_list_indices[i]))->labeling =
+ (char*)malloc_or_die((c+1) * sizeof(char));
+ memcpy((cur_rowp + *(*(same_labeling_lists + i) + same_labeling_list_indices[i]))->labeling,
+ tmp_labeling, (c) * sizeof(char));
+ ((cur_rowp + *(*(same_labeling_lists + i) + same_labeling_list_indices[i]))->labeling)[c] = labels[i];
+#ifdef DEBUG_LABELING_UPDATE
+ printf("added label;labels[%d] = %c\n", i, labels[i]);
+#endif
+ first = NO;
+ tmp_labeling = (cur_rowp + *(*(same_labeling_lists + i) + same_labeling_list_indices[i]))->labeling;
+ }
+ else {
+ (cur_rowp + *(*(same_labeling_lists + i) + same_labeling_list_indices[i]))->labeling = tmp_labeling;
+ }
+#ifdef DEBUG_LABELING_UPDATE
+ printf("label length c = %d\n", c);
+ dump_labeling((cur_rowp + *(*(same_labeling_lists + i) + same_labeling_list_indices[i]))->labeling, c);
+#endif
+
+
+ same_labeling_list_indices[i] += 1;
+ }
+ if(first == NO) {
+ *(sorted_v_list + v_list_index) = V_LIST_NEXT;
+ v_list_index++;
+ }
+ same_labeling_list_indices[i] += 1;
+ }
+ }
+
+ *(sorted_v_list + v_list_index) = V_LIST_END;
+
+ for(v = 1; v < nr_v; v++) {
+ (cur_rowp + v)->is_updated = NO;
+ }
+
+
+ /* garbage collection */
+ for(i = 0; i < nr_of_labels; i++) {
+ free(same_labeling_lists[i]);
+ }
+ free(same_labeling_list_indices);
+ free(same_labeling_lists);
+}
+
+void deallocate_row_labelings(struct one_best_s *prev_rowp, int nr_v)
+{
+ int dealloc_index;
+ int cur_address;
+ int v,w;
+ int *dealloc_list;
+
+#ifdef DEBUG_DEALLOCATE_LABELINGS
+ printf("starting dealloc\n");
+ printf("nr_v = %d\n", nr_v);
+#endif
+
+ dealloc_list = (int*)(malloc_or_die((nr_v+1) * sizeof(int)));
+
+ for(v = 0; v < nr_v; v++) {
+ (prev_rowp + v)->is_updated = NO;
+ }
+
+
+ /* deallocate last row's labelings */
+ dealloc_index = 0;
+ for(v = 0; v < nr_v; v++) {
+ /* find all states with same labeling as this state up to previous row */
+ if((prev_rowp + v)->is_updated == NO && (prev_rowp + v)->labeling != NULL) {
+ cur_address = (int)((prev_rowp+v)->labeling);
+ dealloc_list[dealloc_index] = v;
+ dealloc_index++;
+ (prev_rowp+v)->is_updated = YES;
+ for(w = v+1; w < nr_v; w++) {
+ if((int)((prev_rowp+w)->labeling) == cur_address) {
+#ifdef DEBUG_DEALLOCATE_LABELINGS
+ printf("found same address, vertices %d and %d: %x\n", v, w, (prev_rowp + w)->labeling);
+#endif
+ (prev_rowp+w)->is_updated = YES;
+ (prev_rowp+w)->labeling = NULL;
+ }
+ }
+ }
+ }
+ dealloc_list[dealloc_index] = END;
+
+ for(dealloc_index = 0; dealloc_list[dealloc_index] != END; dealloc_index++) {
+#ifdef DEBUG_DEALLOCATE_LABELINGS
+ printf("dealloc_index = %d\n", dealloc_index);
+ printf("freeing labeling of vertex %d\n", dealloc_list[dealloc_index]);
+#endif
+ free((prev_rowp + dealloc_list[dealloc_index])->labeling);
+#ifdef DEBUG_DEALLOCATE_LABELINGS
+ printf("done\n");
+#endif
+ }
+
+ free(dealloc_list);
+#ifdef DEBUG_DEALLOCATE_LABELINGS
+ printf("exiting dealloc\n");
+#endif
+}
+
+
diff --git a/modhmm0.92b/std_funcs.c b/modhmm0.92b/std_funcs.c
new file mode 100644
index 0000000..b4f517c
--- /dev/null
+++ b/modhmm0.92b/std_funcs.c
@@ -0,0 +1,2808 @@
+#include <stdio.h>
+#include <string.h>
+#include <stdlib.h>
+#include <math.h>
+#include <float.h>
+//#include <double.h>
+#include <limits.h>
+
+#include "structs.h"
+#include "funcs.h"
+
+//#define DEBUG_ALPHA
+//#define DEBUG_REVERSE_SEQ
+//#define DEBUG_PRI
+
+#define POS 0
+
+
+int verbose = NO;
+
+
+
+void *malloc_or_die(int mem_size)
+{
+ void *mem;
+ int i;
+
+ if(mem_size > 6000 * 6000 * 8) {
+ printf("Trying to allocate to much memory: %d bytes\n", mem_size);
+ exit(0);
+ }
+
+ if((mem = malloc(mem_size)) != NULL) {
+ memset(mem, 0, mem_size);
+ return mem;
+ }
+ else {
+ perror("Memory trouble");
+ exit(-1);
+ }
+}
+
+void init_float_mtx(double *mtx, double init_nr, int mtx_size)
+{
+ int i;
+ for(i = 0; i < mtx_size; i++) {
+ *mtx = init_nr;
+ mtx++;
+ }
+}
+
+void init_viterbi_s_mtx(struct viterbi_s *mtx, double init_nr, int mtx_size)
+{
+ int i;
+ for(i = 0; i < mtx_size; i++) {
+ mtx->prob = init_nr;
+ mtx++;
+ }
+}
+
+int get_mtx_index(int row, int col, int row_size)
+{
+ return row * row_size + col;
+}
+
+int get_alphabet_index(struct letter_s *c, char *alphabet, int a_size)
+{
+ int a_index;
+ int found_index;
+ int i;
+
+ for(a_index = 0; a_index < a_size; a_index++) {
+
+ found_index = NO;
+
+#ifdef DEBUG_ALPHA
+ printf("a_index = %d\n", a_index);
+ printf("c = %s\n", c->letter);
+ printf("Alphabet = %s\n", alphabet);
+#endif
+
+ for(i = 0; c->letter[i] != '\0'; i++) {
+ if(*alphabet == ';') {
+ found_index = NO;
+ break;
+ }
+ else if(c->letter[i] == *alphabet) {
+ found_index = YES;
+ }
+ else {
+ found_index = NO;
+ break;
+ }
+ alphabet++;
+ }
+ if(found_index == YES) {
+ break;
+ }
+ while(*alphabet != ';') {
+ alphabet++;
+ }
+ alphabet++;
+ }
+
+ if(a_index >= a_size) {
+ return -1;
+ }
+ return a_index;
+}
+
+int get_alphabet_index_msa_query(char *c, char *alphabet, int a_size)
+{
+ int a_index;
+ int found_index;
+ int i;
+
+ for(a_index = 0; a_index < a_size; a_index++) {
+
+ found_index = NO;
+
+#ifdef DEBUG_ALPHA
+ printf("a_index = %d\n", a_index);
+ printf("c = %s\n", c);
+ printf("Alphabet = %s\n", alphabet);
+#endif
+
+ for(i = 0; c[i] != '\0'; i++) {
+ if(*alphabet == ';') {
+ found_index = NO;
+ break;
+ }
+ else if(c[i] == *alphabet) {
+ found_index = YES;
+ }
+ else {
+ found_index = NO;
+ break;
+ }
+ alphabet++;
+ }
+ if(found_index == YES) {
+ break;
+ }
+ while(*alphabet != ';') {
+ alphabet++;
+ }
+ alphabet++;
+ }
+
+ if(a_index >= a_size) {
+ return -1;
+ }
+ return a_index;
+}
+
+/* if letter is known to be only one character */
+int get_alphabet_index_single(char *alphabet, char letter, int a_size)
+{
+ int a_index;
+ //printf("alphabet = %s\n", alphabet);
+ //printf("letter = %c\n", letter);
+ for(a_index = 0; a_index < a_size; a_index++) {
+ if(letter == *alphabet) {
+ return a_index;
+ }
+ alphabet++;
+ alphabet++;
+ }
+
+ return -1;
+
+}
+
+int get_seq_length(struct letter_s *s)
+{
+ int i, len;
+
+ len = 0;
+ while((s->letter)[0] != '\0') {
+ len++;
+ s++;
+ }
+ return len;
+}
+
+/* checks if there is a path in the hmm from vertex 'v' to vertex 'w'
+ * either directly or via silent states */
+int path_length(int v, int w, struct hmm_s *hmmp, int length)
+{
+ int *xp;
+ int tot_length;
+ int temp_length;
+
+ if(length > MAX_GAP_SIZE) {
+ return 0;
+ }
+
+ tot_length = 0;
+ if(*(hmmp->transitions + get_mtx_index(v, w, hmmp->nr_v)) != 0.0) {
+ tot_length = length + 1; /* direct path */
+ }
+ for(xp = hmmp->silent_vertices; *xp != END; xp++) {
+ if(*(hmmp->transitions + get_mtx_index(v, *xp, hmmp->nr_v)) != 0.0) {
+ if((temp_length = path_length(*xp, w, hmmp, length + 1)) > 0) {
+ tot_length += length + temp_length; /* path via a silent state */
+ }
+ }
+ }
+ return tot_length; /* if 0, then there is no path */
+}
+
+/* checks if there is a path in the hmm from vertex 'v' to vertex 'w'
+ * either directly or via silent states */
+int path_length_multi(int v, int w, struct hmm_multi_s *hmmp, int length)
+{
+ int *xp;
+ int tot_length;
+ int temp_length;
+
+
+ if(length > MAX_GAP_SIZE) {
+ return 0;
+ }
+
+ tot_length = 0;
+ if(*(hmmp->transitions + get_mtx_index(v, w, hmmp->nr_v)) != 0.0) {
+ tot_length = length + 1; /* direct path */
+ }
+ for(xp = hmmp->silent_vertices; *xp != END; xp++) {
+ if(*(hmmp->transitions + get_mtx_index(v, *xp, hmmp->nr_v)) != 0.0) {
+ if((temp_length = path_length_multi(*xp, w, hmmp, length + 1)) > 0) {
+ tot_length += length + temp_length; /* path via a silent state */
+ }
+ }
+ }
+ return tot_length; /* if 0, then there is no path */
+}
+
+
+struct path_element* get_end_path_start(int l, struct hmm_s *hmmp)
+{
+ struct path_element *pep, *pep_const;
+
+ pep_const = *(hmmp->to_trans_array + l);
+ pep = pep_const;
+ while(pep->vertex != END) {
+ while(pep->next != NULL) {
+ pep++;
+ }
+ if(pep->vertex == hmmp->nr_v-1) {
+ return pep_const;
+ }
+ else {
+ pep++;
+ pep_const = pep;
+ }
+ }
+ return NULL;
+}
+
+struct path_element* get_end_path_start_multi(int l, struct hmm_multi_s *hmmp)
+{
+ struct path_element *pep, *pep_const;
+
+ pep_const = *(hmmp->to_trans_array + l);
+ pep = pep_const;
+ while(pep->vertex != END) {
+ while(pep->next != NULL) {
+ pep++;
+ }
+ if(pep->vertex == hmmp->nr_v-1) {
+ return pep_const;
+ }
+ else {
+ pep++;
+ pep_const = pep;
+ }
+ }
+ return NULL;
+}
+
+void print_seq(struct letter_s *seq, FILE *outfile, int seq_nr, char *name, int seq_length)
+{
+ int i,j;
+
+ fprintf(outfile, "NR %d: %s\n", seq_nr, name);
+ fprintf(outfile, ">");
+ for(i = 0; i < seq_length; i++) {
+ j = 0;
+ while(*((seq+i)->letter + j) != '\0') {
+ fputc((int)*((seq+i)->letter + j), outfile);
+ fputc((int)';', outfile);
+ j++;
+ }
+ }
+ fputc('\n', outfile);
+}
+
+
+char* get_profile_vertex_type(int v, int *silent_vertices)
+{
+ while(*silent_vertices != END) {
+ if(v == *silent_vertices) {
+ return "silent\n";
+ }
+ silent_vertices++;
+ }
+
+ return "standard\n";
+}
+
+void get_aa_distrib_mtx(FILE *distribmtxfile, struct aa_distrib_mtx_s *aa_distrib_mtxp)
+{
+ int MAX_LINE = 500;
+ char s[500];
+ char *temp;
+ int i;
+
+ if(distribmtxfile == NULL) {
+ aa_distrib_mtxp->a_size = -1;
+ return;
+ }
+
+ i = 0;
+ while(fgets(s, MAX_LINE, distribmtxfile) != NULL) {
+ if(s[0] == '\n' || s[0] == '#') {
+ }
+ else {
+ aa_distrib_mtxp->a_size = atoi(s);
+ aa_distrib_mtxp->inside_values = (double*)malloc_or_die(aa_distrib_mtxp->a_size * sizeof(double));
+ aa_distrib_mtxp->outside_values = (double*)malloc_or_die(aa_distrib_mtxp->a_size * sizeof(double));
+ aa_distrib_mtxp->membrane_values = (double*)malloc_or_die(aa_distrib_mtxp->a_size * sizeof(double));
+ break;
+ }
+ }
+ while(fgets(s, MAX_LINE, distribmtxfile) != NULL){
+ if(s[0] == '\n' || s[0] == '#') {
+
+ }
+ else {
+ temp = s+2;
+ *(aa_distrib_mtxp->inside_values + i) = strtod(temp, &temp);
+ *(aa_distrib_mtxp->outside_values + i) = strtod(temp, &temp);
+ *(aa_distrib_mtxp->membrane_values + i) = strtod(temp, &temp);
+ *(aa_distrib_mtxp->membrane_values + i) += strtod(temp, &temp);
+ i++;
+ }
+ }
+ //dump_aa_distrib_mtx(aa_distrib_mtxp);
+}
+
+
+void get_replacement_letters(FILE *replfile, struct replacement_letter_s *replacement_lettersp)
+{
+ int MAX_LINE = 1000;
+ char s[1000];
+ int i,j,k;
+ int nr_repl_letters;
+ int cur_letter;
+ struct letter_s le;
+ double prob;
+ int a_index;
+ int done;
+ int a_size;
+ char *alphabet;
+
+
+ if(replfile == NULL) {
+ replacement_lettersp->nr_rl = 0;
+ return;
+ }
+
+ while(fgets(s, MAX_LINE, replfile) != NULL){
+ if(s[0] == '\n' || s[0] == '#') {
+ }
+ else {
+ a_size = atoi(s);
+ alphabet = (char*)malloc_or_die(a_size * sizeof(char) * 5);
+ break;
+ }
+ }
+ i = 0;
+ while(fgets(s, MAX_LINE, replfile) != NULL){
+ if(s[0] == '\n' || s[0] == '#') {
+
+ }
+ else {
+ strcpy(alphabet, s);
+ break;
+ }
+ }
+
+
+ while(fgets(s, MAX_LINE, replfile) != NULL){
+ if(s[0] == '\n' || s[0] == '#') {
+
+ }
+ else {
+ nr_repl_letters = atoi(s);
+ replacement_lettersp->nr_rl = nr_repl_letters;
+ replacement_lettersp->letters = (struct letter_s*)(malloc_or_die(nr_repl_letters * sizeof(struct letter_s)));
+ replacement_lettersp->probs = (double*)(malloc_or_die(nr_repl_letters * a_size * sizeof(double)));
+ break;
+ }
+ }
+
+ cur_letter = 0;
+ while (fgets(s, MAX_LINE, replfile) != NULL) {
+ if(s[0] == '#' || s[0] == '\n') {
+
+ }
+ else {
+ i = 0;
+ while(s[i] != ' ') {
+ *((replacement_lettersp->letters + cur_letter)->letter + i) = s[i];
+ i++;
+ }
+ *((replacement_lettersp->letters + cur_letter)->letter + i) = '\0';
+ while(s[i] == ' ' || s[i] == '=') {
+ i++;
+ }
+ done = NO;
+ while(done == NO) {
+ j = 0;
+ while(s[i] != ':') {
+ *(le.letter + j) = s[i];
+ i++;
+ j++;
+ }
+ *(le.letter + j) = '\0';
+ i++;
+
+ prob = atof(&s[i]);
+ a_index = get_alphabet_index(&le, alphabet, a_size);
+ *(replacement_lettersp->probs + get_mtx_index(cur_letter, a_index, a_size)) = prob;
+ while(s[i] != ' ' && s[i] != '\n') {
+ i++;
+ }
+ if(s[i] == '\n') {
+ done = YES;
+ }
+ i++;
+ }
+ cur_letter++;
+ }
+ }
+ free(alphabet);
+ //dump_replacement_letters(replacement_lettersp, a_size);
+}
+
+void get_replacement_letters_multi(FILE *replfile, struct replacement_letter_multi_s *replacement_lettersp)
+{
+ int MAX_LINE = 1000;
+ char s[1000];
+ int i,j,k,l,m;
+ int nr_repl_letters;
+ int cur_letter;
+ struct letter_s le;
+ double prob;
+ int a_index;
+ int done;
+ int a_size;
+ char *alphabet;
+ int nr_alphabets;
+ int letterindex;
+
+
+ if(replfile == NULL) {
+ replacement_lettersp->nr_alphabets = 0;
+ return;
+ }
+ replacement_lettersp->nr_rl_1 = 0;
+ replacement_lettersp->nr_rl_2 = 0;
+ replacement_lettersp->nr_rl_3 = 0;
+ replacement_lettersp->nr_rl_4 = 0;
+
+ while(fgets(s, MAX_LINE, replfile) != NULL){
+ if(s[0] == '\n' || s[0] == '#') {
+ }
+ else {
+ nr_alphabets = atoi(s);
+ replacement_lettersp->nr_alphabets = nr_alphabets;
+ break;
+ }
+ }
+
+ for(l = 0; l < nr_alphabets; l++) {
+ while(fgets(s, MAX_LINE, replfile) != NULL){
+ if(s[0] == '\n' || s[0] == '#') {
+ }
+ else {
+ a_size = atoi(s);
+ alphabet = (char*)malloc_or_die(a_size * sizeof(char) * 5);
+ break;
+ }
+ }
+
+
+ i = 0;
+ while(fgets(s, MAX_LINE, replfile) != NULL){
+ if(s[0] == '\n' || s[0] == '#') {
+
+ }
+ else {
+ strcpy(alphabet, s);
+ break;
+ }
+ }
+
+ while(fgets(s, MAX_LINE, replfile) != NULL){
+ if(s[0] == '\n' || s[0] == '#') {
+ }
+ else {
+ if(l == 0) {
+ nr_repl_letters = atoi(s);
+ replacement_lettersp->nr_rl_1 = nr_repl_letters;
+ replacement_lettersp->letters_1 = (struct letter_s*)(malloc_or_die(nr_repl_letters * sizeof(struct letter_s)));
+ replacement_lettersp->probs_1 = (double*)(malloc_or_die(nr_repl_letters * a_size * sizeof(double)));
+ break;
+ }
+ if(l == 1) {
+ nr_repl_letters = atoi(s);
+ replacement_lettersp->nr_rl_2 = nr_repl_letters;
+ replacement_lettersp->letters_2 = (struct letter_s*)(malloc_or_die(nr_repl_letters * sizeof(struct letter_s)));
+ replacement_lettersp->probs_2 = (double*)(malloc_or_die(nr_repl_letters * a_size * sizeof(double)));
+ break;
+ }
+ if(l == 2) {
+ nr_repl_letters = atoi(s);
+ replacement_lettersp->nr_rl_3 = nr_repl_letters;
+ replacement_lettersp->letters_3 = (struct letter_s*)(malloc_or_die(nr_repl_letters * sizeof(struct letter_s)));
+ replacement_lettersp->probs_3 = (double*)(malloc_or_die(nr_repl_letters * a_size * sizeof(double)));
+ break;
+ }
+ if(l == 3) {
+ nr_repl_letters = atoi(s);
+ replacement_lettersp->nr_rl_4 = nr_repl_letters;
+ replacement_lettersp->letters_4 = (struct letter_s*)(malloc_or_die(nr_repl_letters * sizeof(struct letter_s)));
+ replacement_lettersp->probs_4 = (double*)(malloc_or_die(nr_repl_letters * a_size * sizeof(double)));
+ break;
+ }
+ }
+ }
+
+ cur_letter = 0;
+ letterindex = 0;
+ while (fgets(s, MAX_LINE, replfile) != NULL) {
+ if(letterindex >= nr_repl_letters) {
+ break;
+ }
+ if(s[0] == '#' || s[0] == '\n') {
+
+ }
+ else {
+ i = 0;
+ while(s[i] != ' ') {
+ if(l == 0) {
+ *((replacement_lettersp->letters_1 + cur_letter)->letter + i) = s[i];
+ }
+ if(l == 1) {
+ *((replacement_lettersp->letters_2 + cur_letter)->letter + i) = s[i];
+ }
+ if(l == 2) {
+ *((replacement_lettersp->letters_3 + cur_letter)->letter + i) = s[i];
+ }
+ if(l == 3) {
+ *((replacement_lettersp->letters_4 + cur_letter)->letter + i) = s[i];
+ }
+ i++;
+ }
+
+ if(l == 0) {
+ *((replacement_lettersp->letters_1 + cur_letter)->letter + i) = '\0';
+ }
+ if(l == 1) {
+ *((replacement_lettersp->letters_2 + cur_letter)->letter + i) = '\0';
+ }
+ if(l == 2) {
+ *((replacement_lettersp->letters_3 + cur_letter)->letter + i) = '\0';
+ }
+ if(l == 3) {
+ *((replacement_lettersp->letters_4 + cur_letter)->letter + i) = '\0';
+ }
+ while(s[i] == ' ' || s[i] == '=') {
+ i++;
+ }
+ done = NO;
+ while(done == NO) {
+ j = 0;
+ while(s[i] != ':') {
+ *(le.letter + j) = s[i];
+ i++;
+ j++;
+ }
+ *(le.letter + j) = '\0';
+ i++;
+
+ prob = atof(&s[i]);
+ /* get alphabet index for correct alphabet (must hardcode this to the same order as in the hmm */
+ a_index = get_alphabet_index(&le, alphabet, a_size);
+
+ if(l == 0) {
+ *(replacement_lettersp->probs_1 + get_mtx_index(cur_letter, a_index, a_size)) = prob;
+ }
+ if(l == 1) {
+ *(replacement_lettersp->probs_2 + get_mtx_index(cur_letter, a_index, a_size)) = prob;
+ }
+ if(l == 2) {
+ *(replacement_lettersp->probs_3 + get_mtx_index(cur_letter, a_index, a_size)) = prob;
+ }
+ if(l == 3) {
+ *(replacement_lettersp->probs_4 + get_mtx_index(cur_letter, a_index, a_size)) = prob;
+ }
+ while(s[i] != ' ' && s[i] != '\n') {
+ i++;
+ }
+ if(s[i] == '\n') {
+ done = YES;
+ }
+ i++;
+ }
+ cur_letter++;
+ letterindex++;
+ }
+
+ }
+ free(alphabet);
+ //dump_replacement_letters_multi(replacement_lettersp, l+1, a_size);
+ }
+}
+
+int get_replacement_letter_index(struct letter_s *c, struct replacement_letter_s *replacement_letters)
+{
+ int a_index;
+ for(a_index = 0; a_index < replacement_letters->nr_rl; a_index++) {
+ if(strcmp((replacement_letters->letters + a_index)->letter, c->letter) == 0) {
+ return a_index;
+ }
+ }
+
+ return -1;
+}
+
+int get_replacement_letter_index_multi(struct letter_s *c, struct replacement_letter_multi_s *replacement_letters, int alphabet)
+{
+ int a_index;
+ if(alphabet == 1) {
+ for(a_index = 0; a_index < replacement_letters->nr_rl_1; a_index++) {
+ if(strcmp((replacement_letters->letters_1 + a_index)->letter, c->letter) == 0) {
+ return a_index;
+ }
+ }
+ }
+ if(alphabet == 2) {
+ for(a_index = 0; a_index < replacement_letters->nr_rl_2; a_index++) {
+ if(strcmp((replacement_letters->letters_2 + a_index)->letter, c->letter) == 0) {
+ return a_index;
+ }
+ }
+ }
+ if(alphabet == 3) {
+ for(a_index = 0; a_index < replacement_letters->nr_rl_3; a_index++) {
+ if(strcmp((replacement_letters->letters_3 + a_index)->letter, c->letter) == 0) {
+ return a_index;
+ }
+ }
+ }
+ if(alphabet == 4) {
+ for(a_index = 0; a_index < replacement_letters->nr_rl_4; a_index++) {
+ if(strcmp((replacement_letters->letters_4 + a_index)->letter, c->letter) == 0) {
+ return a_index;
+ }
+ }
+ }
+
+ return -1;
+}
+
+int get_replacement_letter_index_single(char *c, struct replacement_letter_s *replacement_letters)
+{
+ int a_index;
+ for(a_index = 0; a_index < replacement_letters->nr_rl; a_index++) {
+ if(((replacement_letters->letters + a_index)->letter)[0] == c) {
+ return a_index;
+ }
+ }
+
+ return -1;
+}
+
+char* sequence_as_string(struct letter_s *sequence)
+{
+ struct letter_s *c;
+ char *s;
+ int s_length;
+
+ s_length = 0;
+ c = sequence;
+ while(c->letter[0] != '\0') {
+ s_length++;
+ c++;
+ }
+
+ s = (char*)(malloc_or_die(s_length * 5 * sizeof(char)));
+
+ c = sequence;
+ while(c->letter[0] != '\0') {
+ strcat(s,c->letter);
+ c++;
+ }
+
+ printf("%s\n", s);
+ free(s);
+ return NULL;
+}
+
+void get_viterbi_label_path(struct viterbi_s *cur, struct hmm_s *hmmp,
+ struct viterbi_s *viterbi_mtxp, int row, int row_size, char *labels, int *ip)
+{
+ struct path_element *p_el;
+
+ if(cur->prev == 0) {
+ p_el = cur->prevp;
+ while(p_el->next != NULL) {
+ labels[*ip] = *(hmmp->vertex_labels + p_el->vertex);
+ *ip = (*ip) + 1;
+ p_el++;
+ }
+ }
+ else {
+ get_viterbi_label_path(viterbi_mtxp + get_mtx_index(row-1, cur->prev, row_size), hmmp,
+ viterbi_mtxp, row-1, row_size, labels, ip);
+ p_el = cur->prevp;
+ labels[*ip] = *(hmmp->vertex_labels + (int)(cur->prev));
+ *ip = (*ip) + 1;
+ while(p_el->next != NULL) {
+ p_el++;
+ labels[*ip] = *(hmmp->vertex_labels + p_el->vertex);
+ *ip = (*ip) + 1;
+ }
+ }
+}
+
+void get_viterbi_label_path_multi(struct viterbi_s *cur, struct hmm_multi_s *hmmp,
+ struct viterbi_s *viterbi_mtxp, int row, int row_size, char *labels, int *ip)
+{
+ struct path_element *p_el;
+
+ if(cur->prev == 0) {
+ p_el = cur->prevp;
+ while(p_el->next != NULL) {
+ labels[*ip] = *(hmmp->vertex_labels + p_el->vertex);
+ *ip = (*ip) + 1;
+ p_el++;
+ }
+ }
+ else {
+ get_viterbi_label_path_multi(viterbi_mtxp + get_mtx_index(row-1, cur->prev, row_size), hmmp,
+ viterbi_mtxp, row-1, row_size, labels, ip);
+ p_el = cur->prevp;
+ labels[*ip] = *(hmmp->vertex_labels + (int)(cur->prev));
+ *ip = (*ip) + 1;
+ while(p_el->next != NULL) {
+ p_el++;
+ labels[*ip] = *(hmmp->vertex_labels + p_el->vertex);
+ *ip = (*ip) + 1;
+ }
+ }
+}
+
+
+void get_viterbi_path(struct viterbi_s *cur, struct hmm_s *hmmp,
+ struct viterbi_s *viterbi_mtxp, int row, int row_size, int *path, int *ip)
+{
+ struct path_element *p_el;
+
+ //printf("cur->prev = %d\n", cur->prev);
+ if(cur->prev == 0) {
+ p_el = cur->prevp;
+ while(p_el->next != NULL) {
+ path[*ip] = p_el->vertex;
+ *ip = (*ip) + 1;
+ p_el++;
+ }
+ }
+ else {
+ get_viterbi_path(viterbi_mtxp + get_mtx_index(row-1, cur->prev, row_size), hmmp,
+ viterbi_mtxp, row-1, row_size, path, ip);
+ p_el = cur->prevp;
+ path[*ip] = p_el->vertex;
+ //printf("%d ", path[*ip]);
+ *ip = (*ip) + 1;
+ while(p_el->next != NULL) {
+ p_el++;
+ path[*ip] = p_el->vertex;
+ *ip = (*ip) + 1;
+ }
+ }
+}
+
+void get_viterbi_path_multi(struct viterbi_s *cur, struct hmm_multi_s *hmmp,
+ struct viterbi_s *viterbi_mtxp, int row, int row_size, int *path, int *ip)
+{
+ struct path_element *p_el;
+
+ //printf("cur->prev = %d\n", cur->prev);
+ if(cur->prev == 0) {
+ p_el = cur->prevp;
+ while(p_el->next != NULL) {
+ path[*ip] = p_el->vertex;
+ *ip = (*ip) + 1;
+ p_el++;
+ }
+ }
+ else {
+ get_viterbi_path_multi(viterbi_mtxp + get_mtx_index(row-1, cur->prev, row_size), hmmp,
+ viterbi_mtxp, row-1, row_size, path, ip);
+ p_el = cur->prevp;
+ path[*ip] = p_el->vertex;
+ //printf("%d ", path[*ip]);
+ *ip = (*ip) + 1;
+ while(p_el->next != NULL) {
+ p_el++;
+ path[*ip] = p_el->vertex;
+ *ip = (*ip) + 1;
+ }
+ }
+}
+
+
+void loosen_labels(char *labels, char *loose_labels, int label_looseness, int seq_len)
+{
+ int *locked_labels;
+ int reg_len;
+ int i,j;
+ char cur;
+
+ /* initial memory and copy stuff */
+ locked_labels = (int*)(malloc_or_die(seq_len * sizeof(int)));
+ memcpy(loose_labels, labels, seq_len * sizeof(char));
+
+
+ /* lock middle labels */
+ reg_len = 1;
+ cur = labels[0];
+ locked_labels[0] = 1;
+ for(i = 1; i < seq_len; i++) {
+ if(labels[i] != cur) {
+ cur = labels[i];
+ if(reg_len == i) {
+ /* first reg shift, do nothing */
+ }
+ else {
+ if(reg_len % 2 == 1) {
+ locked_labels[i - (reg_len/2 + 1)] = 1;
+ }
+ else {
+ locked_labels[i - reg_len/2] = 1;
+ locked_labels[i - (reg_len/2 + 1)] = 1;
+ }
+ }
+ reg_len = 1;
+ }
+ else {
+ reg_len++;
+ }
+ }
+
+ /* loosen labels */
+ reg_len = 1;
+ cur = labels[0];
+ for(i = 1; i < seq_len; i++) {
+ if(labels[i] != cur) {
+ for(j = 0; j < label_looseness; j++) {
+ if(locked_labels[i + j] == 0 && locked_labels[i-1-j] == 0) {
+ //printf("i = %d, j = %d\n", i , j);
+ loose_labels[i + j] = '.';
+ loose_labels[i-1-j] = '.';
+ }
+ else {
+ break;
+ }
+ }
+
+ cur = labels[i];
+ reg_len = 1;
+ }
+ else {
+ reg_len++;
+ }
+
+ }
+ free(locked_labels);
+
+ //dump_labeling(labels, seq_len);
+ //dump_labeling(loose_labels, seq_len);
+}
+
+int read_subst_matrix(double **mtxpp, FILE *substmtxfile)
+{
+ int MAX_LINE = 1000;
+ char s[1000];
+ int i,j,k;
+ int nr_rows;
+ int row_le_index, col_le_index;
+ struct letter_s row_le, col_le;
+ double prob;
+ int row_a_index, col_a_index;
+ int done;
+ int a_size;
+ char *alphabet;
+ double *mtxp;
+
+ if(substmtxfile == NULL) {
+ return NO;
+ }
+
+ while(fgets(s, MAX_LINE, substmtxfile) != NULL){
+ if(s[0] == '\n' || s[0] == '#') {
+ }
+ else {
+ a_size = atoi(s);
+ alphabet = (char*)malloc_or_die(a_size * sizeof(char) * 5);
+ *mtxpp = (double*)(malloc_or_die(a_size * (a_size + 1) * sizeof(double)));
+ mtxp = *mtxpp;
+ break;
+ }
+ }
+ i = 0;
+ while(fgets(s, MAX_LINE, substmtxfile) != NULL){
+ if(s[0] == '\n' || s[0] == '#') {
+
+ }
+ else {
+ strcpy(alphabet, s);
+ break;
+ }
+ }
+
+ while (fgets(s, MAX_LINE, substmtxfile) != NULL) {
+ if(s[0] == '#' || s[0] == '\n') {
+
+ }
+ else {
+ i = 0;
+ while(s[i] != ' ') {
+ *(row_le.letter + i) = s[i];
+ i++;
+ }
+ *(row_le.letter + i) = '\0';
+ while(s[i] == ' ' || s[i] == '=') {
+ i++;
+ }
+ done = NO;
+ while(done == NO) {
+ j = 0;
+ while(s[i] != ':') {
+ *(col_le.letter + j) = s[i];
+ i++;
+ j++;
+ }
+ *(col_le.letter + j) = '\0';
+ i++;
+
+ prob = atof(&s[i]);
+ row_a_index = get_alphabet_index(&row_le, alphabet, a_size);
+ if(row_a_index < 0) {
+ row_a_index = a_size;
+ }
+ col_a_index = get_alphabet_index(&col_le, alphabet, a_size);
+ *(mtxp + get_mtx_index(row_a_index, col_a_index, a_size)) = prob;
+ while(s[i] != ' ' && s[i] != '\n') {
+ i++;
+ }
+ if(s[i] == '\n') {
+ done = YES;
+ }
+ i++;
+ }
+ }
+ }
+ free(alphabet);
+ //dump_subst_mtx(mtxp, a_size);
+ //exit(0);
+ return YES;
+}
+
+
+int read_subst_matrix_multi(double **mtxpp, double **mtxpp_2, double **mtxpp_3, double **mtxpp_4, FILE *substmtxfile)
+{
+ int MAX_LINE = 1000;
+ char s[1000];
+ int i,j,k, l;
+ int nr_rows;
+ int row_le_index, col_le_index;
+ struct letter_s row_le, col_le;
+ double prob;
+ int row_a_index, col_a_index;
+ int done;
+ int a_size;
+ char *alphabet;
+ double *mtxp;
+ int nr_alphabets;
+
+ if(substmtxfile == NULL) {
+ return NO;
+ }
+
+ while(fgets(s, MAX_LINE, substmtxfile) != NULL){
+ if(s[0] == '\n' || s[0] == '#') {
+ }
+ else {
+ nr_alphabets = atoi(s);
+ break;
+ }
+ }
+
+ for(l = 0; l < nr_alphabets; l++) {
+ while(fgets(s, MAX_LINE, substmtxfile) != NULL){
+ if(s[0] == '\n' || s[0] == '#') {
+ }
+ else {
+ a_size = atoi(s);
+ alphabet = (char*)malloc_or_die(a_size * sizeof(char) * 5);
+ if(l == 0) {
+ *mtxpp = (double*)(malloc_or_die(a_size * (a_size + 1) * sizeof(double)));
+ mtxp = *mtxpp;
+ }
+ if(l == 1) {
+ *mtxpp_2 = (double*)(malloc_or_die(a_size * (a_size + 1) * sizeof(double)));
+ mtxp = *mtxpp_2;
+ }
+ if(l == 2) {
+ *mtxpp_3 = (double*)(malloc_or_die(a_size * (a_size + 1) * sizeof(double)));
+ mtxp = *mtxpp_3;
+ }
+ if(l == 3) {
+ *mtxpp_4 = (double*)(malloc_or_die(a_size * (a_size + 1) * sizeof(double)));
+ mtxp = *mtxpp_4;
+ }
+ break;
+ }
+ }
+ i = 0;
+ while(fgets(s, MAX_LINE, substmtxfile) != NULL){
+ if(s[0] == '\n' || s[0] == '#') {
+
+ }
+ else {
+ strcpy(alphabet, s);
+ break;
+ }
+ }
+
+ while (fgets(s, MAX_LINE, substmtxfile) != NULL) {
+ if(s[0] == '#' || s[0] == '\n') {
+
+ }
+ else if(strncmp(s, "END", 3) == 0) {
+ break;
+ }
+ else {
+ i = 0;
+ while(s[i] != ' ') {
+ *(row_le.letter + i) = s[i];
+ i++;
+ }
+ *(row_le.letter + i) = '\0';
+ while(s[i] == ' ' || s[i] == '=') {
+ i++;
+ }
+ done = NO;
+ while(done == NO) {
+ j = 0;
+ while(s[i] != ':') {
+ *(col_le.letter + j) = s[i];
+ i++;
+ j++;
+ }
+ *(col_le.letter + j) = '\0';
+ i++;
+
+ prob = atof(&s[i]);
+ row_a_index = get_alphabet_index(&row_le, alphabet, a_size);
+ if(row_a_index < 0) {
+ row_a_index = a_size;
+ }
+ col_a_index = get_alphabet_index(&col_le, alphabet, a_size);
+ *(mtxp + get_mtx_index(row_a_index, col_a_index, a_size)) = prob;
+ while(s[i] != ' ' && s[i] != '\n') {
+ i++;
+ }
+ if(s[i] == '\n') {
+ done = YES;
+ }
+ i++;
+ }
+ }
+ }
+ free(alphabet);
+ //dump_subst_mtx(mtxp, a_size);
+ //exit(0);
+ }
+ return YES;
+}
+
+
+int read_prior_file(struct emission_dirichlet_s *em_di, struct hmm_s *hmmp, FILE *priorfile)
+{
+ int j,k;
+ double q_value, alpha_value, alpha_sum, logbeta;
+ char s[2048];
+ char ps[2048];
+ char *file_name;
+ char *pri;
+
+ rewind(priorfile);
+
+ /* put default name */
+ strcpy(em_di->name, "default");
+
+ /* put nr of components in struct */
+ if(fgets(ps, 2048, priorfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ while(*ps == '#' || *ps == '\n') {
+ if(fgets(ps, 2048, priorfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ }
+ em_di->nr_components = atoi(&ps[0]);
+
+ /* allocate memory for arrays and matrix to this prior struct */
+ em_di->q_values = malloc_or_die(em_di->nr_components *
+ sizeof(double));
+ em_di->alpha_sums = malloc_or_die(em_di->nr_components *
+ sizeof(double));
+ em_di->logbeta_values =
+ malloc_or_die(em_di->nr_components * sizeof(double));
+ em_di->prior_values = malloc_or_die(em_di->nr_components *
+ hmmp->a_size * sizeof(double));
+
+ for(j = 0; j < em_di->nr_components; j++) {
+ /* put q-value in array */
+ if(fgets(ps, 2048, priorfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ while(*ps == '#' || *ps == '\n') {
+ if(fgets(ps, 2048, priorfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ }
+ q_value = atof(&ps[0]);
+ *(em_di->q_values + j) = q_value;
+#ifdef DEBUG_PRI
+ printf("q_value = %f\n", *(em_di->q_values + j));
+#endif
+
+ /* put alpha-values of this component in matrix */
+ alpha_sum = 0.0;
+ k = 0;
+ if(fgets(ps, 2048, priorfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ while(*ps == '#' || *ps == '\n') {
+ if(fgets(ps, 2048, priorfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ }
+ pri = &ps[0];
+ for(k = 0; k < hmmp->a_size; k++) {
+ alpha_value = strtod(pri, &pri);
+ alpha_sum += alpha_value;
+ *((em_di->prior_values) +
+ get_mtx_index(j, k, hmmp->a_size)) = alpha_value;
+ }
+
+ /* put sum of alphavalues in array */
+ *((em_di->alpha_sums) + j) = alpha_sum;
+
+ /* calculate logB(alpha) for this compoment, store in array*/
+ logbeta = 0;
+ for(k = 0; k < hmmp->a_size; k++) {
+ logbeta += lgamma(*(em_di->prior_values +
+ get_mtx_index(j, k, hmmp->a_size)));
+
+#ifdef DEBUG_PRI
+ printf("prior_value = %f\n", *((em_di->prior_values) +
+ get_mtx_index(j, k, hmmp->a_size)));
+ printf("lgamma_value = %f\n", lgamma(*((em_di->prior_values) +
+ get_mtx_index(j, k, hmmp->a_size))));
+#endif
+ }
+ logbeta = logbeta - lgamma(*(em_di->alpha_sums + j));
+ *(em_di->logbeta_values + j) = logbeta;
+ }
+
+#ifdef DEBUG_PRI
+ dump_prior_struct(em_di);
+ exit(0);
+#endif
+}
+
+int read_frequencies(FILE *freqfile, double **aa_freqsp)
+{
+ char ps[2048];
+ int cur;
+ int a_size;
+ double *aa_freqs;
+
+ /* read frequencies */
+ if(fgets(ps, 2048, freqfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ while(*ps == '#' || *ps == '\n') {
+ if(fgets(ps, 2048, freqfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ }
+
+
+
+ a_size = atoi(ps);
+
+ aa_freqs = (double*)malloc_or_die(a_size * sizeof(double));
+ *aa_freqsp = aa_freqs;
+ if(freqfile == NULL) {
+ printf("Could not read prior frequencies\n");
+ exit(0);
+ }
+
+ /* read frequencies */
+ for(cur = 0; cur < a_size; cur++) {
+ if(fgets(ps, 2048, freqfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ while(*ps == '#' || *ps == '\n') {
+ if(fgets(ps, 2048, freqfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ }
+ *(aa_freqs + cur) = atof(ps);
+ //printf("aa_freqs[%d] = %f\n", cur, *(aa_freqs + cur));
+ }
+
+ return 1;
+
+}
+
+
+int read_frequencies_multi(FILE *freqfile, double **aa_freqsp, double **aa_freqsp_2, double **aa_freqsp_3, double **aa_freqsp_4)
+{
+ char ps[2048];
+ int cur;
+ int a_size;
+ double *aa_freqs, *aa_freqs_2, *aa_freqs_3, *aa_freqs_4, *aa_freqs_temp;
+ int nr_alphabets;
+ int i;
+
+ /* read frequencies */
+ if(fgets(ps, 2048, freqfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ while(*ps == '#' || *ps == '\n') {
+ if(fgets(ps, 2048, freqfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ }
+
+ nr_alphabets = atoi(ps);
+
+ for(i = 0; i <= nr_alphabets; i++) {
+ if(fgets(ps, 2048, freqfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ while(*ps == '#' || *ps == '\n') {
+ if(fgets(ps, 2048, freqfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ }
+ a_size = atoi(ps);
+ if(i == 0) {
+ aa_freqs = (double*)malloc_or_die(a_size * sizeof(double));
+ *aa_freqsp = aa_freqs;
+ aa_freqs_temp = aa_freqs;
+ }
+ if(i == 1) {
+ aa_freqs_2 = (double*)malloc_or_die(a_size * sizeof(double));
+ *aa_freqsp_2 = aa_freqs_2;
+ aa_freqs_temp == aa_freqs_2;
+ }
+ if(i == 2) {
+ aa_freqs_3 = (double*)malloc_or_die(a_size * sizeof(double));
+ *aa_freqsp_3 = aa_freqs_3;
+ aa_freqs_temp == aa_freqs_3;
+ }
+ if(i == 3) {
+ aa_freqs_4 = (double*)malloc_or_die(a_size * sizeof(double));
+ *aa_freqsp_4 = aa_freqs_4;
+ aa_freqs_temp == aa_freqs_4;
+ }
+
+
+ /* read frequencies */
+ for(cur = 0; cur < a_size; cur++) {
+ if(fgets(ps, 2048, freqfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ while(*ps == '#' || *ps == '\n') {
+ if(fgets(ps, 2048, freqfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ }
+ *(aa_freqs_temp + cur) = atof(ps);
+ //printf("aa_freqs_temp[%d] = %f\n", cur, *(aa_freqs + cur));
+ }
+ }
+
+ return 1;
+
+}
+
+
+
+
+int read_prior_file_multi(struct emission_dirichlet_s *em_di, struct hmm_multi_s *hmmp, FILE *priorfile, int alphabet)
+{
+ int j,k;
+ double q_value, alpha_value, alpha_sum, logbeta;
+ char s[2048];
+ char ps[2048];
+ char *file_name;
+ char *pri;
+ int a_size;
+
+ if(alphabet == 1) {
+ a_size = hmmp->a_size;
+ }
+ if(alphabet == 2) {
+ a_size = hmmp->a_size_2;
+ if(hmmp->nr_alphabets < 2) {
+ printf("Trying to read priorfile for alphabet 2, but hmm only has one alphabet\n");
+ exit(0);
+ }
+ }
+ if(alphabet == 3) {
+ a_size = hmmp->a_size_3;
+ if(hmmp->nr_alphabets < 3) {
+ printf("Trying to read priorfile for alphabet 3, but hmm only has two alphabets\n");
+ exit(0);
+ }
+ }
+ if(alphabet == 4) {
+ a_size = hmmp->a_size_4;
+ if(hmmp->nr_alphabets < 4) {
+ printf("Trying to read priorfile for alphabet 4, but hmm only has three alphabets\n");
+ exit(0);
+ }
+ }
+
+ if(priorfile == NULL) {
+ em_di->nr_components = 0;
+ return 0;
+ }
+ rewind(priorfile);
+
+ /* put default name */
+ strcpy(em_di->name, "default");
+
+ /* put nr of components in struct */
+ if(fgets(ps, 2048, priorfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ while(*ps == '#') {
+ if(fgets(ps, 2048, priorfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ }
+ em_di->nr_components = atoi(&ps[0]);
+
+ /* allocate memory for arrays and matrix to this prior struct */
+ em_di->q_values = malloc_or_die(em_di->nr_components *
+ sizeof(double));
+ em_di->alpha_sums = malloc_or_die(em_di->nr_components *
+ sizeof(double));
+ em_di->logbeta_values =
+ malloc_or_die(em_di->nr_components * sizeof(double));
+ em_di->prior_values = malloc_or_die(em_di->nr_components *
+ a_size * sizeof(double));
+
+ for(j = 0; j < em_di->nr_components; j++) {
+ /* put q-value in array */
+ if(fgets(ps, 2048, priorfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ while(*ps == '#' || *ps == '\n') {
+ if(fgets(ps, 2048, priorfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ }
+ q_value = atof(&ps[0]);
+ *(em_di->q_values + j) = q_value;
+#ifdef DEBUG_PRI
+ printf("q_value = %f\n", *(em_di->q_values + j));
+#endif
+
+ /* put alpha-values of this component in matrix */
+ alpha_sum = 0.0;
+ k = 0;
+ if(fgets(ps, 2048, priorfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ while(*ps == '#' || *ps == '\n') {
+ if(fgets(ps, 2048, priorfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ }
+ pri = &ps[0];
+ for(k = 0; k < a_size; k++) {
+ alpha_value = strtod(pri, &pri);
+ alpha_sum += alpha_value;
+ *((em_di->prior_values) +
+ get_mtx_index(j, k, a_size)) = alpha_value;
+ }
+
+ /* put sum of alphavalues in array */
+ *((em_di->alpha_sums) + j) = alpha_sum;
+
+ /* calculate logB(alpha) for this compoment, store in array*/
+ logbeta = 0;
+ for(k = 0; k < a_size; k++) {
+ logbeta += lgamma(*(em_di->prior_values +
+ get_mtx_index(j, k, a_size)));
+
+#ifdef DEBUG_PRI
+ printf("prior_value = %f\n", *((em_di->prior_values) +
+ get_mtx_index(j, k, a_size)));
+ printf("lgamma_value = %f\n", lgamma(*((em_di->prior_values) +
+ get_mtx_index(j, k, a_size))));
+#endif
+ }
+ logbeta = logbeta - lgamma(*(em_di->alpha_sums + j));
+ *(em_di->logbeta_values + j) = logbeta;
+ }
+
+#ifdef DEBUG_PRI
+ dump_prior_struct(em_di);
+ exit(0);
+#endif
+
+
+}
+
+int read_multi_prior_file_multi(struct emission_dirichlet_s *em_di, struct hmm_multi_s *hmmp, FILE *priorfile, int alphabet)
+{
+
+ /* returns negative value if an error in the priorfile is detected, 0 if there is no prior information for the alphabet
+ * and a postive value if prior components were read successfully */
+
+ int j,k;
+ double q_value, alpha_value, alpha_sum, logbeta;
+ char s[2048];
+ char ps[2048];
+ char *file_name;
+ char *pri;
+ int a_size;
+ int cur_alphabet;
+
+ cur_alphabet = -1;
+
+ if(alphabet == 1) {
+ a_size = hmmp->a_size;
+ }
+ if(alphabet == 2) {
+ a_size = hmmp->a_size_2;
+ if(hmmp->nr_alphabets < 2) {
+ printf("Trying to read priorfile for alphabet 2, but hmm only has one alphabet\n");
+ exit(0);
+ }
+ }
+ if(alphabet == 3) {
+ a_size = hmmp->a_size_3;
+ if(hmmp->nr_alphabets < 3) {
+ printf("Trying to read priorfile for alphabet 3, but hmm only has two alphabets\n");
+ exit(0);
+ }
+ }
+ if(alphabet == 4) {
+ a_size = hmmp->a_size_4;
+ if(hmmp->nr_alphabets < 4) {
+ printf("Trying to read priorfile for alphabet 4, but hmm only has three alphabets\n");
+ exit(0);
+ }
+ }
+
+ if(priorfile == NULL) {
+ em_di->nr_components = 0;
+ return 0;
+ }
+
+ rewind(priorfile);
+
+ /* put default name */
+ strcpy(em_di->name, "default");
+
+ /* find correct alphabet */
+
+ while(fgets(ps, 2048, priorfile) != NULL) {
+ if(strncmp(ps, "ALPHABET:", 9) == 0) {
+ cur_alphabet = atoi(&ps[9]);
+ if(cur_alphabet == alphabet) {
+ break;
+ }
+ }
+ }
+ if(cur_alphabet != alphabet) {
+ return 0;
+ }
+ while(fgets(ps, 2048, priorfile) != NULL) {
+ if(strncmp(ps, "START", 5) == 0) {
+ break;
+ }
+ }
+
+ /* put nr of components in struct */
+ if(fgets(ps, 2048, priorfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ while(*ps == '#' || *ps == '\n') {
+ if(fgets(ps, 2048, priorfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ }
+ em_di->nr_components = atoi(&ps[0]);
+ em_di->alphabet_size = a_size;
+
+ /* allocate memory for arrays and matrix to this prior struct */
+ em_di->q_values = malloc_or_die(em_di->nr_components *
+ sizeof(double));
+ em_di->alpha_sums = malloc_or_die(em_di->nr_components *
+ sizeof(double));
+ em_di->logbeta_values =
+ malloc_or_die(em_di->nr_components * sizeof(double));
+ em_di->prior_values = malloc_or_die(em_di->nr_components *
+ a_size * sizeof(double));
+
+ for(j = 0; j < em_di->nr_components; j++) {
+ /* put q-value in array */
+ if(fgets(ps, 2048, priorfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ while(*ps == '#' || *ps == '\n') {
+ if(fgets(ps, 2048, priorfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ }
+ q_value = atof(&ps[0]);
+ *(em_di->q_values + j) = q_value;
+#ifdef DEBUG_PRI
+ printf("q_value = %f\n", *(em_di->q_values + j));
+#endif
+
+ /* put alpha-values of this component in matrix */
+ alpha_sum = 0.0;
+ k = 0;
+ if(fgets(ps, 2048, priorfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ while(*ps == '#' || *ps == '\n') {
+ if(fgets(ps, 2048, priorfile) != NULL) {
+ }
+ else {
+ return -1;
+ }
+ }
+ pri = &ps[0];
+ for(k = 0; k < a_size; k++) {
+ alpha_value = strtod(pri, &pri);
+ alpha_sum += alpha_value;
+ *((em_di->prior_values) +
+ get_mtx_index(j, k, a_size)) = alpha_value;
+ }
+
+ /* put sum of alphavalues in array */
+ *((em_di->alpha_sums) + j) = alpha_sum;
+
+ /* calculate logB(alpha) for this compoment, store in array*/
+ logbeta = 0;
+ for(k = 0; k < a_size; k++) {
+ logbeta += lgamma(*(em_di->prior_values +
+ get_mtx_index(j, k, a_size)));
+
+#ifdef DEBUG_PRI
+ printf("prior_value = %f\n", *((em_di->prior_values) +
+ get_mtx_index(j, k, a_size)));
+ printf("lgamma_value = %f\n", lgamma(*((em_di->prior_values) +
+ get_mtx_index(j, k, a_size))));
+#endif
+ }
+ logbeta = logbeta - lgamma(*(em_di->alpha_sums + j));
+ *(em_di->logbeta_values + j) = logbeta;
+ }
+
+#ifdef DEBUG_PRI
+ dump_prior_struct(em_di);
+ exit(0);
+#endif
+
+ return 1;
+
+}
+
+
+
+
+int locked_state(struct hmm_s *hmmp, int v)
+{
+ if(*(hmmp->locked_vertices + v) == YES) {
+ return YES;
+ }
+ else {
+ return NO;
+ }
+}
+
+int locked_state_multi(struct hmm_multi_s *hmmp, int v)
+{
+ if(*(hmmp->locked_vertices + v) == YES) {
+ return YES;
+ }
+ else {
+ return NO;
+ }
+}
+
+int get_best_reliability_score(double reliability_score_1, double reliability_score_2, double reliability_score_3)
+{
+ int max = 2;
+ if(reliability_score_1 > reliability_score_2 && reliability_score_1 > reliability_score_3) {
+ max = 1;
+ }
+ else if(reliability_score_3 > reliability_score_2) {
+ max = 3;
+ }
+ return max;
+}
+
+
+
+
+
+void itoa(char* s, int nr) {
+ int dec, sign;
+ char temp[30];
+
+ strcpy(temp, fcvt(nr, 0, &dec, &sign));
+ if(sign == POS) {
+ strncpy(s, temp, dec);
+ s[dec] = '\0';
+ }
+ else {
+ s[0] = '-';
+ strncpy(s+1, temp, dec);
+ s[dec+1] = '\0';
+ }
+
+
+}
+
+void ftoa(char* s, double nr, int prec) {
+ int dec, sign;
+ char temp[30];
+ int i, pos;
+
+ strcpy(temp, fcvt(nr, prec, &dec, &sign));
+ if(sign == POS) {
+ if(dec <= 0) {
+ s[0] = '0';
+ s[1] = '.';
+ pos = 2;
+ for(i = 0; i > dec; i--) {
+ s[pos] = '0';
+ pos++;
+ }
+ strncpy(s+pos, temp, prec);
+ s[pos+prec] = '\0';
+ }
+ else {
+ strncpy(s, temp, dec);
+ s[dec] = '.';
+ strncpy(s+dec+1, temp+dec, prec);
+ s[dec+1+prec] = '\0';
+ }
+ }
+ else {
+ if(dec <= 0) {
+ s[0] = '-';
+ s[1] = '0';
+ s[2] = '.';
+ pos = 3;
+ for(i = 0; i > dec; i--) {
+ s[pos] = '0';
+ pos++;
+ }
+ strncpy(s+pos, temp, prec);
+ s[pos+prec] = '\0';
+ }
+ else {
+ s[0] = '-';
+ strncpy(s+1, temp, dec);
+ s[dec+1] = '.';
+ strncpy(s+dec+2, temp+dec, prec);
+ s[dec+2+prec] = '\0';
+ }
+ }
+}
+
+
+void hmm_garbage_collection(FILE *hmmfile, struct hmm_s *hmmp)
+{
+ int i;
+
+ free(hmmp->transitions);
+ free(hmmp->log_transitions);
+ free(hmmp->emissions);
+ free(hmmp->log_emissions);
+ for(i = 0; i < hmmp->nr_m; i++) {
+ free((*(hmmp->modules + i))->vertices);
+ }
+ free(hmmp->silent_vertices);
+ free(hmmp->vertex_labels);
+ free(hmmp->labels);
+ free(hmmp->vertex_trans_prior_scalers);
+ free(hmmp->vertex_emiss_prior_scalers);
+ free(hmmp->modules);
+ free(hmmp->to_trans_array);
+ free(hmmp->from_trans_array);
+ free(hmmp->to_silent_trans_array);
+ free(hmmp->tot_transitions);
+ free(hmmp->max_log_transitions);
+ free(hmmp->tot_from_trans_array);
+ free(hmmp->tot_to_trans_array);
+ free(hmmp->distrib_groups);
+ free(hmmp->trans_tie_groups);
+ for(i = 0; i < hmmp->nr_ed; i++) {
+ free(hmmp->emission_dirichlets->q_values);
+ free(hmmp->emission_dirichlets->alpha_sums);
+ free(hmmp->emission_dirichlets->logbeta_values);
+ free(hmmp->emission_dirichlets->prior_values);
+ }
+ free(hmmp->emission_dirichlets);
+ free(hmmp->ed_ps);
+ fclose(hmmfile);
+}
+
+void hmm_garbage_collection_multi(FILE *hmmfile, struct hmm_multi_s *hmmp)
+{
+ int i;
+
+ free(hmmp->transitions);
+ free(hmmp->log_transitions);
+ free(hmmp->emissions);
+ free(hmmp->log_emissions);
+ if(hmmp->nr_alphabets > 1) {
+ free(hmmp->emissions_2);
+ free(hmmp->log_emissions_2);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ free(hmmp->emissions_3);
+ free(hmmp->log_emissions_3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ free(hmmp->emissions_4);
+ free(hmmp->log_emissions_4);
+ }
+ for(i = 0; i < hmmp->nr_m; i++) {
+ free((*(hmmp->modules + i))->vertices);
+ }
+ free(hmmp->silent_vertices);
+ free(hmmp->vertex_labels);
+ free(hmmp->labels);
+ free(hmmp->vertex_trans_prior_scalers);
+ free(hmmp->vertex_emiss_prior_scalers);
+ if(hmmp->nr_alphabets > 1) {
+ free(hmmp->vertex_emiss_prior_scalers_2);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ free(hmmp->vertex_emiss_prior_scalers_3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ free(hmmp->vertex_emiss_prior_scalers_4);
+ }
+ free(hmmp->modules);
+ free(hmmp->to_trans_array);
+ free(hmmp->from_trans_array);
+ free(hmmp->to_silent_trans_array);
+ free(hmmp->tot_transitions);
+ free(hmmp->max_log_transitions);
+ free(hmmp->tot_from_trans_array);
+ free(hmmp->tot_to_trans_array);
+ free(hmmp->distrib_groups);
+ free(hmmp->trans_tie_groups);
+ for(i = 0; i < hmmp->nr_ed; i++) {
+ free(hmmp->emission_dirichlets->q_values);
+ free(hmmp->emission_dirichlets->alpha_sums);
+ free(hmmp->emission_dirichlets->logbeta_values);
+ free(hmmp->emission_dirichlets->prior_values);
+ }
+ free(hmmp->emission_dirichlets);
+ free(hmmp->ed_ps);
+ if(hmmp->nr_alphabets > 1) {
+ for(i = 0; i < hmmp->nr_ed_2; i++) {
+ free(hmmp->emission_dirichlets_2->q_values);
+ free(hmmp->emission_dirichlets_2->alpha_sums);
+ free(hmmp->emission_dirichlets_2->logbeta_values);
+ free(hmmp->emission_dirichlets_2->prior_values);
+ }
+ free(hmmp->emission_dirichlets_2);
+ free(hmmp->ed_ps_2);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ for(i = 0; i < hmmp->nr_ed_3; i++) {
+ free(hmmp->emission_dirichlets_3->q_values);
+ free(hmmp->emission_dirichlets_3->alpha_sums);
+ free(hmmp->emission_dirichlets_3->logbeta_values);
+ free(hmmp->emission_dirichlets_3->prior_values);
+ }
+ free(hmmp->emission_dirichlets_3);
+ free(hmmp->ed_ps_3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ for(i = 0; i < hmmp->nr_ed_4; i++) {
+ free(hmmp->emission_dirichlets_4->q_values);
+ free(hmmp->emission_dirichlets_4->alpha_sums);
+ free(hmmp->emission_dirichlets_4->logbeta_values);
+ free(hmmp->emission_dirichlets_4->prior_values);
+ }
+ free(hmmp->emission_dirichlets_4);
+ free(hmmp->ed_ps_4);
+ }
+ if(hmmfile != NULL) {
+ fclose(hmmfile);
+ }
+}
+
+void hmm_garbage_collection_multi_no_dirichlet(FILE *hmmfile, struct hmm_multi_s *hmmp)
+{
+ int i;
+
+ //printf("transitions\n");
+ free(hmmp->transitions);
+ free(hmmp->log_transitions);
+ free(hmmp->emissions);
+ //printf("log_emissions\n");
+ free(hmmp->log_emissions);
+ if(hmmp->nr_alphabets > 1) {
+ free(hmmp->emissions_2);
+ free(hmmp->log_emissions_2);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ free(hmmp->emissions_3);
+ free(hmmp->log_emissions_3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ free(hmmp->emissions_4);
+ free(hmmp->log_emissions_4);
+ }
+ for(i = 0; i < hmmp->nr_m; i++) {
+ //free((*(hmmp->modules + i))->vertices);
+ }
+ //printf("silent_vertices\n");
+ free(hmmp->silent_vertices);
+ //printf("vertex_labels\n");
+ free(hmmp->vertex_labels);
+ //printf("labels\n");
+ free(hmmp->labels);
+ //printf("trans_prior_scalers\n");
+ free(hmmp->vertex_trans_prior_scalers);
+ //printf("emiss_prior_scalers\n");
+ free(hmmp->vertex_emiss_prior_scalers);
+ if(hmmp->nr_alphabets > 1) {
+ free(hmmp->vertex_emiss_prior_scalers_2);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ free(hmmp->vertex_emiss_prior_scalers_3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ free(hmmp->vertex_emiss_prior_scalers_4);
+ }
+ //printf("modules\n");
+ free(hmmp->modules);
+ free(hmmp->to_trans_array);
+ free(hmmp->from_trans_array);
+ free(hmmp->to_silent_trans_array);
+ //printf("tot_transitionss\n");
+ free(hmmp->tot_transitions);
+ free(hmmp->max_log_transitions);
+ free(hmmp->tot_from_trans_array);
+ free(hmmp->tot_to_trans_array);
+ free(hmmp->distrib_groups);
+ //printf("trans_tie_groups\n");
+ free(hmmp->trans_tie_groups);
+
+ if(hmmfile != NULL) {
+ fclose(hmmfile);
+ }
+}
+
+void msa_seq_garbage_collection_multi(struct msa_sequences_multi_s *msa_seq_info, int nr_alphabets)
+{
+
+}
+
+void seq_garbage_collection_multi(struct sequences_multi_s *seq_info, int nr_alphabets)
+{
+
+}
+
+struct letter_s* get_reverse_seq(struct letter_s *seq_const, int seq_len)
+{
+ struct letter_s *reverse_seq, *seq;
+ int i;
+
+ seq = seq_const;
+ reverse_seq = (struct letter_s*)(malloc_or_die((seq_len+1)* sizeof(struct letter_s)));
+ for(i = seq_len - 1; i >= 0; i--) {
+ memcpy(reverse_seq + i, seq, sizeof(struct letter_s));
+ seq++;
+ }
+
+ memcpy(reverse_seq + seq_len, seq, sizeof(struct letter_s));
+
+
+#ifdef DEBUG_REVERSE_SEQ
+ dump_seq(reverse_seq);
+ dump_seq(seq_const);
+ printf("reverse_seq = %x\n", reverse_seq);
+#endif
+
+ return reverse_seq;
+}
+
+void get_reverse_seq_multi(struct sequence_multi_s *seqs, struct letter_s **reverse_seq_1,
+ struct letter_s **reverse_seq_2, struct letter_s **reverse_seq_3,
+ struct letter_s **reverse_seq_4, struct hmm_multi_s *hmmp, int seq_len)
+{
+ /* letter_s pointers allocated here must be freed by caller */
+
+ struct letter_s *reverse_seq, *seq, *seq_const;
+ int i,j;
+ *reverse_seq_1 = NULL;
+ *reverse_seq_2 = NULL;
+ *reverse_seq_3 = NULL;
+ *reverse_seq_4 = NULL;
+
+ for(j = 0; j < hmmp->nr_alphabets; j++) {
+ reverse_seq = (struct letter_s*)(malloc_or_die((seq_len+1)* sizeof(struct letter_s)));
+ if(j == 0) {
+ seq = seqs->seq_1;
+ *reverse_seq_1 = reverse_seq;
+ }
+ if(j == 1) {
+ seq = seqs->seq_2;
+ *reverse_seq_2 = reverse_seq;
+ }
+ if(j == 1) {
+ seq = seqs->seq_3;
+ *reverse_seq_3 = reverse_seq;
+ }
+ if(j == 1) {
+ seq = seqs->seq_4;
+ *reverse_seq_4 = reverse_seq;
+ }
+ seq_const = seq;
+ for(i = seq_len - 1; i >= 0; i--) {
+ memcpy(reverse_seq + i, seq, sizeof(struct letter_s));
+ seq++;
+ }
+
+ memcpy(reverse_seq + seq_len, seq, sizeof(struct letter_s));
+
+#ifdef DEBUG_REVERSE_SEQ
+ printf("seq_len = %d\n", seq_len);
+ dump_seq(reverse_seq);
+ dump_seq(*reverse_seq_1);
+ dump_seq(seq_const);
+ printf("reverse_seq = %x\n", reverse_seq);
+#endif
+ }
+}
+
+
+void get_reverse_msa_seq(struct msa_sequences_s *msa_seq_infop, struct msa_sequences_s *reverse_msa_seq_infop, int a_size)
+{
+
+ /* note that this function does not implement the gaps-pointing, everything points to END */
+
+ int i,j;
+
+ reverse_msa_seq_infop->nr_seqs = msa_seq_infop->nr_seqs;
+ reverse_msa_seq_infop->msa_seq_length = msa_seq_infop->msa_seq_length;
+ reverse_msa_seq_infop->nr_lead_columns = msa_seq_infop->nr_lead_columns;
+ reverse_msa_seq_infop->msa_seq = (struct msa_letter_s*)(malloc_or_die(msa_seq_infop->msa_seq_length * (a_size + 1) *
+ sizeof(struct msa_letter_s)));
+ reverse_msa_seq_infop->gaps = (int**)(malloc_or_die(msa_seq_infop->msa_seq_length * sizeof(int*) +
+ msa_seq_infop->msa_seq_length * sizeof(int)));
+ reverse_msa_seq_infop->lead_columns_start = (int*)(malloc_or_die((msa_seq_infop->nr_lead_columns +1)* sizeof(int)));
+ reverse_msa_seq_infop->lead_columns_end = reverse_msa_seq_infop->lead_columns_start + (msa_seq_infop->nr_lead_columns);
+ reverse_msa_seq_infop->gap_shares = (double*)(malloc_or_die(msa_seq_infop->msa_seq_length * sizeof(double)));
+
+ /* get sequence data */
+ j = 0;
+ for(i = msa_seq_infop->msa_seq_length - 1; i >= 0; i--) {
+ memcpy(reverse_msa_seq_infop->msa_seq + (i * (a_size + 1)),
+ msa_seq_infop->msa_seq + (j * (a_size + 1)),
+ sizeof(struct msa_letter_s) * (a_size+1));
+ j++;
+ }
+
+ /* get gap data (not implemented) */
+ for(i = 0; i < msa_seq_infop->msa_seq_length; i++) {
+ *(reverse_msa_seq_infop->gaps + (msa_seq_infop->msa_seq_length + i)) = END;
+ }
+ for(i = 0; i < msa_seq_infop->msa_seq_length; i++) {
+ *(reverse_msa_seq_infop->gaps + i) = (int*)(reverse_msa_seq_infop->gaps + (msa_seq_infop->msa_seq_length + i));
+ }
+
+ /* get gap shares */
+ j = 0;
+ for(i = msa_seq_infop->msa_seq_length - 1; i >= 0; i--) {
+ memcpy(reverse_msa_seq_infop->gap_shares + i, msa_seq_infop->gap_shares + j, 1 * sizeof(double));
+ j++;
+ }
+
+ /* get lead_columns */
+ j = 0;
+ for(i = msa_seq_infop->nr_lead_columns - 1; i >= 0; i--) {
+ *(reverse_msa_seq_infop->lead_columns_start + i) = (msa_seq_infop->msa_seq_length - 1) - *(msa_seq_infop->lead_columns_start + j);
+ j++;
+ }
+ *(reverse_msa_seq_infop->lead_columns_start + msa_seq_infop->nr_lead_columns) = END;
+
+#ifdef DEBUG_REVERSE_SEQ
+ dump_msa_seqs(msa_seq_infop, a_size);
+ dump_msa_seqs(reverse_msa_seq_infop, a_size);
+#endif
+}
+
+void get_reverse_msa_seq_multi(struct msa_sequences_multi_s *msa_seq_infop, struct msa_sequences_multi_s *reverse_msa_seq_infop,
+ struct hmm_multi_s *hmmp)
+{
+
+ /* note that this function does not implement the gaps-pointing, everything points to END */
+
+ int i,j;
+ int nr_alphabets;
+ int a_size, a_size_2, a_size_3, a_size_4;
+
+ nr_alphabets = hmmp->nr_alphabets;
+ a_size = hmmp->a_size;
+ a_size_2 = hmmp->a_size_2;
+ a_size_3 = hmmp->a_size_3;
+ a_size_4 = hmmp->a_size_4;
+
+
+
+ reverse_msa_seq_infop->nr_seqs = msa_seq_infop->nr_seqs;
+ reverse_msa_seq_infop->msa_seq_length = msa_seq_infop->msa_seq_length;
+ reverse_msa_seq_infop->nr_lead_columns = msa_seq_infop->nr_lead_columns;
+ reverse_msa_seq_infop->msa_seq_1 = (struct msa_letter_s*)(malloc_or_die(msa_seq_infop->msa_seq_length * (a_size + 1) *
+ sizeof(struct msa_letter_s)));
+ if(nr_alphabets > 1) {
+ reverse_msa_seq_infop->msa_seq_2 = (struct msa_letter_s*)(malloc_or_die(msa_seq_infop->msa_seq_length * (a_size_2 + 1) *
+ sizeof(struct msa_letter_s)));
+ }
+ if(nr_alphabets > 2) {
+ reverse_msa_seq_infop->msa_seq_3 = (struct msa_letter_s*)(malloc_or_die(msa_seq_infop->msa_seq_length * (a_size_3 + 1) *
+ sizeof(struct msa_letter_s)));
+ }
+ if(nr_alphabets > 3) {
+ reverse_msa_seq_infop->msa_seq_4 = (struct msa_letter_s*)(malloc_or_die(msa_seq_infop->msa_seq_length * (a_size_4 + 1) *
+ sizeof(struct msa_letter_s)));
+ }
+ reverse_msa_seq_infop->gaps = (int**)(malloc_or_die(msa_seq_infop->msa_seq_length * sizeof(int*) +
+ msa_seq_infop->msa_seq_length * sizeof(int)));
+ reverse_msa_seq_infop->lead_columns_start = (int*)(malloc_or_die((msa_seq_infop->nr_lead_columns +1)* sizeof(int)));
+ reverse_msa_seq_infop->lead_columns_end = reverse_msa_seq_infop->lead_columns_start + (msa_seq_infop->nr_lead_columns);
+ reverse_msa_seq_infop->gap_shares = (double*)(malloc_or_die(msa_seq_infop->msa_seq_length * sizeof(double)));
+
+ /* get sequence data */
+ j = 0;
+ for(i = msa_seq_infop->msa_seq_length - 1; i >= 0; i--) {
+ memcpy(reverse_msa_seq_infop->msa_seq_1 + (i * (a_size + 1)),
+ msa_seq_infop->msa_seq_1 + (j * (a_size + 1)),
+ sizeof(struct msa_letter_s) * (a_size+1));
+ j++;
+ }
+ if(nr_alphabets > 1) {
+ j = 0;
+ for(i = msa_seq_infop->msa_seq_length - 1; i >= 0; i--) {
+ memcpy(reverse_msa_seq_infop->msa_seq_2 + (i * (a_size_2 + 1)),
+ msa_seq_infop->msa_seq_2 + (j * (a_size_2 + 1)),
+ sizeof(struct msa_letter_s) * (a_size_2+1));
+ j++;
+ }
+ }
+ if(nr_alphabets > 2) {
+ j = 0;
+ for(i = msa_seq_infop->msa_seq_length - 1; i >= 0; i--) {
+ memcpy(reverse_msa_seq_infop->msa_seq_3 + (i * (a_size_3 + 1)),
+ msa_seq_infop->msa_seq_3 + (j * (a_size_3 + 1)),
+ sizeof(struct msa_letter_s) * (a_size_3+1));
+ j++;
+ }
+ }
+ if(nr_alphabets > 3) {
+ j = 0;
+ for(i = msa_seq_infop->msa_seq_length - 1; i >= 0; i--) {
+ memcpy(reverse_msa_seq_infop->msa_seq_4 + (i * (a_size_4 + 1)),
+ msa_seq_infop->msa_seq_4 + (j * (a_size_4 + 1)),
+ sizeof(struct msa_letter_s) * (a_size_4+1));
+ j++;
+ }
+ }
+
+
+
+ /* get gap data (not implemented) */
+ for(i = 0; i < msa_seq_infop->msa_seq_length; i++) {
+ *(reverse_msa_seq_infop->gaps + (msa_seq_infop->msa_seq_length + i)) = END;
+ }
+ for(i = 0; i < msa_seq_infop->msa_seq_length; i++) {
+ *(reverse_msa_seq_infop->gaps + i) = (int*)(reverse_msa_seq_infop->gaps + (msa_seq_infop->msa_seq_length + i));
+ }
+
+ /* get gap shares */
+ j = 0;
+ for(i = msa_seq_infop->msa_seq_length - 1; i >= 0; i--) {
+ memcpy(reverse_msa_seq_infop->gap_shares + i, msa_seq_infop->gap_shares + j, 1 * sizeof(double));
+ j++;
+ }
+
+ /* get lead_columns */
+ j = 0;
+ for(i = msa_seq_infop->nr_lead_columns - 1; i >= 0; i--) {
+ *(reverse_msa_seq_infop->lead_columns_start + i) = (msa_seq_infop->msa_seq_length - 1) - *(msa_seq_infop->lead_columns_start + j);
+ j++;
+ }
+ *(reverse_msa_seq_infop->lead_columns_start + msa_seq_infop->nr_lead_columns) = END;
+
+#ifdef DEBUG_REVERSE_SEQ
+ dump_msa_seqs(msa_seq_infop, a_size);
+ dump_msa_seqs(reverse_msa_seq_infop, a_size);
+#endif
+}
+
+
+
+void get_msa_labels(FILE *labelfile, struct msa_sequences_s *msa_seq_infop, struct hmm_s *hmmp) {
+ char row[30000];
+ int i,j;
+
+ rewind(labelfile);
+ while(1) {
+ if(fgets(row,30000,labelfile) != NULL) {
+ if(row[0] == '/') {
+ for(i = 1; row[i] != '/';i++) {
+ (msa_seq_infop->msa_seq + (*(msa_seq_infop->lead_columns_start + i-1)) * (hmmp->a_size+1))->label = row[i];
+ }
+ }
+ }
+ else {
+ break;
+ }
+ }
+}
+
+void get_msa_labels_all_columns(FILE *labelfile, struct msa_sequences_s *msa_seq_infop, struct hmm_s *hmmp) {
+ char row[30000];
+ int i,j;
+
+ rewind(labelfile);
+ while(1) {
+ if(fgets(row,30000,labelfile) != NULL) {
+ if(row[0] == '/') {
+ for(i = 1; row[i] != '/';i++) {
+ (msa_seq_infop->msa_seq + (i-1) * (hmmp->a_size+1))->label = row[i];
+ }
+ }
+ }
+ else {
+ break;
+ }
+ }
+}
+
+int update_shares_prior(struct emission_dirichlet_s *em_di, struct hmm_s *hmmp,
+ struct msa_sequences_s *msa_seq_infop, int l)
+{
+ int nr_components, comps, a_index;
+ double occ_sums;
+ double q_value, scaling_factor, X_sum, *X_values, ed_res1, *logbeta_an_values;
+ double exponent, prior_prob, tot_prior_prob;
+
+
+ /************the update part ********************/
+ nr_components = em_di->nr_components;
+ logbeta_an_values = malloc_or_die(nr_components * sizeof(double));
+ scaling_factor = -FLT_MAX;
+ X_sum = 0.0;
+ X_values = malloc_or_die(hmmp->a_size * sizeof(double));
+
+
+ /* calculate logB(alpha + n) for all components +
+ * calculate scaling factor for logB(alpha + n) - logB(alpha) */
+ for(comps = 0; comps < nr_components; comps++) {
+ ed_res1 = 0;
+ occ_sums = 0;
+ for(a_index = 0; a_index < hmmp->a_size; a_index++) {
+ ed_res1 += lgamma(*(em_di->prior_values +
+ get_mtx_index(comps, a_index, hmmp->a_size)) +
+ (double)((msa_seq_infop->msa_seq +
+ get_mtx_index(l,a_index,hmmp->a_size+1))->nr_occurences));
+ occ_sums += (msa_seq_infop->msa_seq + get_mtx_index(l,a_index, hmmp->a_size+1))->nr_occurences;
+ }
+ ed_res1 = ed_res1 - lgamma(*(em_di->alpha_sums + comps) + (double)(occ_sums));
+ *(logbeta_an_values + comps) = ed_res1;
+ if((ed_res1 = ed_res1 - *(em_di->logbeta_values + comps)) > scaling_factor) {
+ scaling_factor = ed_res1;
+ }
+ }
+
+ /* calculate all the Xi's */
+ for(a_index = 0; a_index < hmmp->a_size; a_index++) {
+ *(X_values + a_index) = 0;
+ for(comps = 0; comps < nr_components; comps++) {
+ q_value = *(em_di->q_values + comps);
+ exponent = (*(logbeta_an_values + comps) - *(em_di->logbeta_values + comps) -
+ scaling_factor);
+ prior_prob = (*(em_di->prior_values + get_mtx_index(comps,a_index, hmmp->a_size)) +
+ (double)((msa_seq_infop->msa_seq +
+ get_mtx_index(l,a_index,hmmp->a_size+1))->nr_occurences));
+ tot_prior_prob = (*(em_di->alpha_sums + comps) + (double)(occ_sums));
+ *(X_values + a_index) += q_value * exp(exponent) * prior_prob / tot_prior_prob;
+#ifdef DEBUG_PRI
+ printf("\nscaling factor = %f\n", scaling_factor);
+ printf("a_index = %d\n", a_index);
+ printf("q_value = %f\n", q_value);
+ printf("exponent = %f\n", exponent);
+ printf("prior_prob = %f\n", prior_prob);
+ printf("tot_prior_prob = %f\n\n", tot_prior_prob);
+#endif
+ }
+ X_sum += *(X_values + a_index);
+ }
+
+ /* update share values */
+ for(a_index = 0; a_index < hmmp->a_size; a_index++) {
+ ed_res1 = *(X_values + a_index) / X_sum;
+ if(ed_res1 != 0.0) {
+ (msa_seq_infop->msa_seq + get_mtx_index(l, a_index, hmmp->a_size+1))->prior_share = ed_res1;
+ }
+ else {
+ (msa_seq_infop->msa_seq + get_mtx_index(l, a_index, hmmp->a_size+1))->prior_share = ed_res1;
+ }
+ }
+
+ /* cleanup */
+ free(logbeta_an_values);
+ free(X_values);
+}
+
+
+int replacement_letter(struct letter_s *cur_letterp, struct replacement_letter_s *replacement_letters,
+ struct msa_sequences_s *msa_seq_infop, struct hmm_s *hmmp, int seq_pos)
+{
+ int i,j,k;
+ struct letter_s *repl_letter;
+ int same_letter;
+
+ /* find out if letter in cur_letterp is a replacement_letter */
+ for(i = 0; i < replacement_letters->nr_rl; i++) {
+ repl_letter = replacement_letters->letters + i;
+ same_letter = YES;
+ j = 0;
+ while(*(repl_letter->letter + j) != '\0' && *(cur_letterp->letter + j) != '\0') {
+ if(*(repl_letter->letter + j) == *(cur_letterp->letter + j)) {
+ }
+ else {
+ same_letter = NO;
+ }
+ j++;
+ }
+ if(*(repl_letter->letter + j) != '\0' || *(cur_letterp->letter + j) != '\0') {
+ same_letter = NO;
+ }
+ else if(same_letter == YES) {
+ break;
+ }
+ }
+ if(same_letter == NO) {
+ return NO;
+ }
+ else { /* k represents the regular letter, i represents which repl_letter this is */
+ for(k = 0; k < hmmp->a_size; k++) {
+ (msa_seq_infop->msa_seq + get_mtx_index(seq_pos,k, hmmp->a_size+1))->nr_occurences +=
+ *(replacement_letters->probs + get_mtx_index(i,k,hmmp->a_size));
+ }
+ return YES;
+ }
+
+}
+
+
+void get_labels_multi(FILE *labelfile, struct sequences_multi_s *seq_infop, struct hmm_multi_s *hmmp, int seq_nr) {
+ char row[30000];
+ int i,j;
+
+ rewind(labelfile);
+ while(1) {
+ if(fgets(row,30000,labelfile) != NULL) {
+ if(row[0] == '/') {
+ for(i = 1; row[i] != '/';i++) {
+ ((seq_infop->seqs + seq_nr)->seq_1 + (i-1))->label = row[i];
+ if(hmmp->nr_alphabets > 1) {
+ ((seq_infop->seqs + seq_nr)->seq_2 + (i-1))->label = row[i];
+ }
+ if(hmmp->nr_alphabets > 2) {
+ ((seq_infop->seqs + seq_nr)->seq_3 + (i-1))->label = row[i];
+ }
+ if(hmmp->nr_alphabets > 3) {
+ ((seq_infop->seqs + seq_nr)->seq_4 + (i-1))->label = row[i];
+ }
+ }
+ }
+ }
+ else {
+ break;
+ }
+ }
+ rewind(labelfile);
+}
+
+void get_msa_labels_multi(FILE *labelfile, struct msa_sequences_multi_s *msa_seq_infop, struct hmm_multi_s *hmmp) {
+ char row[30000];
+ int i,j;
+
+ rewind(labelfile);
+ while(1) {
+ if(fgets(row,30000,labelfile) != NULL) {
+ if(row[0] == '/') {
+ for(i = 1; row[i] != '/';i++) {
+ (msa_seq_infop->msa_seq_1 + (*(msa_seq_infop->lead_columns_start + i-1)) * (hmmp->a_size+1))->label = row[i];
+ if(hmmp->nr_alphabets > 1) {
+ (msa_seq_infop->msa_seq_2 + (*(msa_seq_infop->lead_columns_start + i-1)) * (hmmp->a_size_2+1))->label = row[i];
+ }
+ if(hmmp->nr_alphabets > 2) {
+ (msa_seq_infop->msa_seq_3 + (*(msa_seq_infop->lead_columns_start + i-1)) * (hmmp->a_size_3+1))->label = row[i];
+ }
+ if(hmmp->nr_alphabets > 3) {
+ (msa_seq_infop->msa_seq_4 + (*(msa_seq_infop->lead_columns_start + i-1)) * (hmmp->a_size_4+1))->label = row[i];
+ }
+ }
+ }
+ }
+ else {
+ break;
+ }
+ }
+}
+
+
+
+void get_msa_labels_all_columns_multi(FILE *labelfile, struct msa_sequences_multi_s *msa_seq_infop, struct hmm_multi_s *hmmp) {
+ char row[30000];
+ int i,j;
+
+ rewind(labelfile);
+ while(1) {
+ if(fgets(row,30000,labelfile) != NULL) {
+ if(row[0] == '/') {
+ for(i = 1; row[i] != '/';i++) {
+ (msa_seq_infop->msa_seq_1 + (i-1) * (hmmp->a_size+1))->label = row[i];
+ if(hmmp->nr_alphabets > 1) {
+ (msa_seq_infop->msa_seq_2 + (i-1) * (hmmp->a_size_2+1))->label = row[i];
+ }
+ if(hmmp->nr_alphabets > 2) {
+ (msa_seq_infop->msa_seq_3 + (i-1) * (hmmp->a_size_3+1))->label = row[i];
+ }
+ if(hmmp->nr_alphabets > 3) {
+ (msa_seq_infop->msa_seq_4 + (i-1) * (hmmp->a_size_4+1))->label = row[i];
+ }
+ }
+ }
+ }
+ else {
+ break;
+ }
+ }
+}
+
+
+int update_shares_prior_multi(struct emission_dirichlet_s *em_di, struct hmm_multi_s *hmmp,
+ struct msa_sequences_multi_s *msa_seq_infop, int l, int alphabet)
+{
+ int nr_components, comps, a_index;
+ double occ_sums;
+ double q_value, scaling_factor, X_sum, *X_values, ed_res1, *logbeta_an_values;
+ double exponent, prior_prob, tot_prior_prob;
+ int a_size;
+ struct msa_letter_s *msa_seq;
+
+ /* check if this alphabet has a prior */
+ if(em_di->nr_components <= 0) {
+ //printf("em_di nr comps = %d\n",em_di->nr_components);
+ //printf("alphabet = %d\n", alphabet);
+ return;
+ }
+
+ /* set a_size and msa_seq according to alphabet */
+ if(alphabet == 1) {
+ a_size = hmmp->a_size;
+ msa_seq = msa_seq_infop->msa_seq_1;
+ }
+ if(alphabet == 2) {
+ a_size = hmmp->a_size_2;
+ msa_seq = msa_seq_infop->msa_seq_2;
+ }
+ if(alphabet == 3) {
+ a_size = hmmp->a_size_3;
+ msa_seq = msa_seq_infop->msa_seq_3;
+ }
+ if(alphabet == 4) {
+ a_size = hmmp->a_size_4;
+ msa_seq = msa_seq_infop->msa_seq_4;
+ }
+
+ /************the update part ********************/
+ nr_components = em_di->nr_components;
+ logbeta_an_values = malloc_or_die(nr_components * sizeof(double));
+ scaling_factor = -FLT_MAX;
+ X_sum = 0.0;
+ X_values = malloc_or_die(a_size * sizeof(double));
+
+
+ /* calculate logB(alpha + n) for all components +
+ * calculate scaling factor for logB(alpha + n) - logB(alpha) */
+ for(comps = 0; comps < nr_components; comps++) {
+ ed_res1 = 0;
+ occ_sums = 0;
+ for(a_index = 0; a_index < a_size; a_index++) {
+ ed_res1 += lgamma(*(em_di->prior_values +
+ get_mtx_index(comps, a_index, a_size)) +
+ (double)((msa_seq +
+ get_mtx_index(l,a_index,a_size+1))->nr_occurences));
+ occ_sums += (msa_seq + get_mtx_index(l,a_index, a_size+1))->nr_occurences;
+ }
+ ed_res1 = ed_res1 - lgamma(*(em_di->alpha_sums + comps) + (double)(occ_sums));
+ *(logbeta_an_values + comps) = ed_res1;
+ if((ed_res1 = ed_res1 - *(em_di->logbeta_values + comps)) > scaling_factor) {
+ scaling_factor = ed_res1;
+ }
+ }
+
+ /* calculate all the Xi's */
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(X_values + a_index) = 0;
+ for(comps = 0; comps < nr_components; comps++) {
+ q_value = *(em_di->q_values + comps);
+ exponent = (*(logbeta_an_values + comps) - *(em_di->logbeta_values + comps) -
+ scaling_factor);
+ prior_prob = (*(em_di->prior_values + get_mtx_index(comps,a_index, a_size)) +
+ (double)((msa_seq +
+ get_mtx_index(l,a_index,a_size+1))->nr_occurences));
+ tot_prior_prob = (*(em_di->alpha_sums + comps) + (double)(occ_sums));
+ *(X_values + a_index) += q_value * exp(exponent) * prior_prob / tot_prior_prob;
+#ifdef DEBUG_PRI
+ printf("\nscaling factor = %f\n", scaling_factor);
+ printf("a_index = %d\n", a_index);
+ printf("q_value = %f\n", q_value);
+ printf("exponent = %f\n", exponent);
+ printf("prior_prob = %f\n", prior_prob);
+ printf("tot_prior_prob = %f\n\n", tot_prior_prob);
+#endif
+ }
+ X_sum += *(X_values + a_index);
+ }
+
+ /* update share values */
+ //printf("a_size = %d\n",a_size);
+ for(a_index = 0; a_index < a_size; a_index++) {
+ ed_res1 = *(X_values + a_index) / X_sum;
+ if(ed_res1 != 0.0) {
+ (msa_seq + get_mtx_index(l, a_index, a_size+1))->prior_share = ed_res1;
+ }
+ else {
+ (msa_seq + get_mtx_index(l, a_index, a_size+1))->prior_share = ed_res1;
+ }
+ //printf("msa_seq = %f\n",(msa_seq + get_mtx_index(l, a_index, a_size+1))->prior_share);
+ }
+
+ //printf("returning\n");
+ /* cleanup */
+ free(logbeta_an_values);
+ free(X_values);
+}
+
+
+int replacement_letter_multi(struct letter_s *cur_letterp, struct replacement_letter_multi_s *replacement_letters,
+ struct msa_sequences_multi_s *msa_seq_infop, struct hmm_multi_s *hmmp, int seq_pos, int alphabet)
+{
+ int i,j,k;
+ struct letter_s *repl_letter;
+ int same_letter;
+
+
+
+ /* find out if letter in cur_letterp is a replacement_letter */
+ if(alphabet == 1) {
+ if(replacement_letters->nr_rl_1 <= 0) {
+ return NO;
+ }
+
+ for(i = 0; i < replacement_letters->nr_rl_1; i++) {
+ repl_letter = replacement_letters->letters_1 + i;
+ same_letter = YES;
+ j = 0;
+ while(*(repl_letter->letter + j) != '\0' && *(cur_letterp->letter + j) != '\0') {
+ if(*(repl_letter->letter + j) == *(cur_letterp->letter + j)) {
+ }
+ else {
+ same_letter = NO;
+ }
+ j++;
+ }
+ if(*(repl_letter->letter + j) != '\0' || *(cur_letterp->letter + j) != '\0') {
+ same_letter = NO;
+ }
+ else if(same_letter == YES) {
+ break;
+ }
+ }
+ if(same_letter == NO) {
+ return NO;
+ }
+ else { /* k represents the regular letter, i represents which repl_letter this is */
+ for(k = 0; k < hmmp->a_size; k++) {
+ (msa_seq_infop->msa_seq_1 + get_mtx_index(seq_pos,k, hmmp->a_size+1))->nr_occurences +=
+ *(replacement_letters->probs_1 + get_mtx_index(i,k,hmmp->a_size));
+ }
+ return YES;
+ }
+ }
+
+ /* find out if letter in cur_letterp is a replacement_letter */
+ if(alphabet == 2) {
+ if(replacement_letters->nr_rl_2 <= 0) {
+ return NO;
+ }
+ for(i = 0; i < replacement_letters->nr_rl_2; i++) {
+ repl_letter = replacement_letters->letters_2 + i;
+ same_letter = YES;
+ j = 0;
+ while(*(repl_letter->letter + j) != '\0' && *(cur_letterp->letter + j) != '\0') {
+ if(*(repl_letter->letter + j) == *(cur_letterp->letter + j)) {
+ }
+ else {
+ same_letter = NO;
+ }
+ j++;
+ }
+ if(*(repl_letter->letter + j) != '\0' || *(cur_letterp->letter + j) != '\0') {
+ same_letter = NO;
+ }
+ else if(same_letter == YES) {
+ break;
+ }
+ }
+ if(same_letter == NO) {
+ return NO;
+ }
+ else { /* k represents the regular letter, i represents which repl_letter this is */
+ for(k = 0; k < hmmp->a_size_2; k++) {
+ (msa_seq_infop->msa_seq_2 + get_mtx_index(seq_pos,k, hmmp->a_size_2+1))->nr_occurences +=
+ *(replacement_letters->probs_2 + get_mtx_index(i,k,hmmp->a_size_2));
+ }
+ return YES;
+ }
+ }
+
+ /* find out if letter in cur_letterp is a replacement_letter */
+ if(alphabet == 3) {
+ if(replacement_letters->nr_rl_3 <= 0) {
+ return NO;
+ }
+ for(i = 0; i < replacement_letters->nr_rl_3; i++) {
+ repl_letter = replacement_letters->letters_3 + i;
+ same_letter = YES;
+ j = 0;
+ while(*(repl_letter->letter + j) != '\0' && *(cur_letterp->letter + j) != '\0') {
+ if(*(repl_letter->letter + j) == *(cur_letterp->letter + j)) {
+ }
+ else {
+ same_letter = NO;
+ }
+ j++;
+ }
+ if(*(repl_letter->letter + j) != '\0' || *(cur_letterp->letter + j) != '\0') {
+ same_letter = NO;
+ }
+ else if(same_letter == YES) {
+ break;
+ }
+ }
+ if(same_letter == NO) {
+ return NO;
+ }
+ else { /* k represents the regular letter, i represents which repl_letter this is */
+ for(k = 0; k < hmmp->a_size_3; k++) {
+ (msa_seq_infop->msa_seq_3 + get_mtx_index(seq_pos,k, hmmp->a_size_3+1))->nr_occurences +=
+ *(replacement_letters->probs_3 + get_mtx_index(i,k,hmmp->a_size_3));
+ }
+ return YES;
+ }
+ }
+
+ /* find out if letter in cur_letterp is a replacement_letter */
+ if(alphabet == 4) {
+ if(replacement_letters->nr_rl_4 <= 0) {
+ return NO;
+ }
+ for(i = 0; i < replacement_letters->nr_rl_4; i++) {
+ repl_letter = replacement_letters->letters_4 + i;
+ same_letter = YES;
+ j = 0;
+ while(*(repl_letter->letter + j) != '\0' && *(cur_letterp->letter + j) != '\0') {
+ if(*(repl_letter->letter + j) == *(cur_letterp->letter + j)) {
+ }
+ else {
+ same_letter = NO;
+ }
+ j++;
+ }
+ if(*(repl_letter->letter + j) != '\0' || *(cur_letterp->letter + j) != '\0') {
+ same_letter = NO;
+ }
+ else if(same_letter == YES) {
+ break;
+ }
+ }
+ if(same_letter == NO) {
+ return NO;
+ }
+ else { /* k represents the regular letter, i represents which repl_letter this is */
+ for(k = 0; k < hmmp->a_size_4; k++) {
+ (msa_seq_infop->msa_seq_4 + get_mtx_index(seq_pos,k, hmmp->a_size_4+1))->nr_occurences +=
+ *(replacement_letters->probs_4 + get_mtx_index(i,k,hmmp->a_size_4));
+ }
+ return YES;
+ }
+ }
+
+ return NO;
+}
+
+int get_nr_alphabets(FILE *hmmfile)
+{
+ int MAX_LINE = 4000;
+ int i;
+ char s[1000];
+
+ if(hmmfile == NULL) {
+ return 0;
+ }
+ else {
+ for(i = 0; i < 3; i++) {
+ if(fgets(s, MAX_LINE, hmmfile) != NULL) {
+
+ }
+ else {
+ return 0;
+ }
+ }
+ if(fgets(s, MAX_LINE, hmmfile) != NULL) {
+ if(strncmp(s, "NR OF", 5) == 0) {
+ return atoi(&s[17]);
+ }
+ else {
+ return 11;
+ }
+ }
+ else {
+ return 0;
+ }
+ }
+}
+
+void get_set_of_labels(struct hmm_s *hmmp)
+{
+ int i,j;
+ int nr_l;
+ char *labels;
+ int is_listed;
+
+ labels = (char*)(malloc_or_die(hmmp->nr_v * sizeof(char)));
+
+ nr_l = 0;
+ for(i = 1; i < hmmp->nr_v-1; i++) {
+ is_listed = NO;
+ for(j = 0; j < nr_l; j++) {
+ if(hmmp->vertex_labels[i] == *(labels + j)) {
+ is_listed = YES;
+ break;
+ }
+ }
+
+ if(is_listed == NO) {
+ *(labels + nr_l) = hmmp->vertex_labels[i];
+ nr_l ++;
+ }
+ }
+
+ hmmp->labels = labels;
+ hmmp->nr_labels = nr_l;
+}
+
+void get_set_of_labels_multi(struct hmm_multi_s *hmmp)
+{
+ int i,j;
+ int nr_l;
+ char *labels;
+ int is_listed;
+
+ labels = (char*)(malloc_or_die(hmmp->nr_v * sizeof(char)));
+
+ nr_l = 0;
+ for(i = 1; i < hmmp->nr_v-1; i++) {
+ is_listed = NO;
+ for(j = 0; j < nr_l; j++) {
+ if(hmmp->vertex_labels[i] == *(labels + j)) {
+ is_listed = YES;
+ break;
+ }
+ }
+
+ if(is_listed == NO) {
+ *(labels + nr_l) = hmmp->vertex_labels[i];
+ nr_l ++;
+ }
+ }
+
+ hmmp->labels = labels;
+ hmmp->nr_labels = nr_l;
+}
diff --git a/modhmm0.92b/std_funcs.c.flc b/modhmm0.92b/std_funcs.c.flc
new file mode 100644
index 0000000..01d0d8b
--- /dev/null
+++ b/modhmm0.92b/std_funcs.c.flc
@@ -0,0 +1,4 @@
+
+(fast-lock-cache-data 3 (quote (17032 . 19439)) (quote nil) (quote nil) (quote (t ("^\\(\\sw+\\)[ ]*(" (1 font-lock-function-name-face)) ("^#[ ]*error[ ]+\\(.+\\)" (1 font-lock-warning-face prepend)) ("^#[ ]*\\(import\\|include\\)[ ]*\\(<[^>\"
+]*>?\\)" (2 font-lock-string-face)) ("^#[ ]*define[ ]+\\(\\sw+\\)(" (1 font-lock-function-name-face)) ("^#[ ]*\\(elif\\|if\\)\\>" ("\\<\\(defined\\)\\>[ ]*(?\\(\\sw+\\)?" nil nil (1 font-lock-builtin-face) (2 font-lock-variable-name-face nil t))) ("^#[ ]*\\(define\\|e\\(?:l\\(?:if\\|se\\)\\|ndif\\|rror\\)\\|file\\|i\\(?:f\\(?:n?def\\)?\\|nclude\\)\\|line\\|pragma\\|undef\\)\\>[ !]*\\(\\sw+\\)?" (1 font-lock-builtin-face) (2 font-lock-variable-name-face nil t)) ("\\<\\(c\\(?:har\\|o [...]
+") (point)) nil (1 font-lock-constant-face nil t))) (":" ("^[ ]*\\(\\sw+\\)[ ]*:[ ]*$" (beginning-of-line) (end-of-line) (1 font-lock-constant-face))) ("\\<\\(c\\(?:har\\|omplex\\)\\|double\\|float\\|int\\|long\\|s\\(?:hort\\|igned\\)\\|\\(?:unsigne\\|voi\\)d\\|FILE\\|\\sw+_t\\|Lisp_Object\\)\\>\\([ *&]+\\sw+\\>\\)*" (font-lock-match-c-style-declaration-item-and-skip-to-next (goto-char (or (match-beginning 2) (match-end 1))) (goto-char (match-end 1)) (1 (if (match-beginning 2) font-l [...]
diff --git a/modhmm0.92b/structs.h b/modhmm0.92b/structs.h
new file mode 100644
index 0000000..8bb42e1
--- /dev/null
+++ b/modhmm0.92b/structs.h
@@ -0,0 +1,644 @@
+#include <time.h>
+#include <stdio.h>
+
+/* various constants */
+#define END -999 /* used in to_trans_array, from_trans_array, silent_vertices, etc
+ * for marking the end of a transition group */
+#define TOT_END -1010 /* used as a more final end marker than END */
+#define DEFAULT 99.0 /* used in log_matrices to represent probability 0.0 */
+#define NO_PREV -1 /* used in viterbi to indicate no max value is found yet */
+#define NOPROB -1 /* used in forward, backward and viterbi for signaling to
+ * caller that the given sequence has 0 prob of being produced
+ * by the hmm */
+#define OK 0 /* all went as expected in the function that returns this value */
+#define SILENT 888.0 /* if emission probability = SILENT (in emissions and log emissions)
+ * then this state is a silent state */
+
+#define MAX_SEQ_NAME_SIZE 100 /* maximal size of a sequence name, if larger it will
+ * simply be truncated */
+#define MAX_GAP_SIZE 20 /* maximum gap size, meaning paths between states longer than this value will
+ * be disregarded */
+#define ENDLETTER '\0'
+
+#define NO -1
+#define YES 0
+#define OH_MY_YES 100
+
+/* hmm file types */
+#define SINGLE_HMM 0
+#define MULTI_HMM 1
+
+/* core alg scoring methods for msa */
+#define DOT_PRODUCT 1
+#define SUBST_MTX_DOT_PRODUCT 3
+#define SUBST_MTX_PRODUCT 2
+#define SUBST_MTX_DOT_PRODUCT_PRIOR 4
+#define SJOLANDER 5
+#define PICASSO 6
+#define PICASSO_SYM 7
+#define DOT_PRODUCT_PICASSO 8
+#define SJOLANDER_REVERSED 9
+
+/* multi alphabet scoring methods */
+#define JOINT_PROB 10
+#define AVERAGE 11
+
+/* training algorithms */
+#define BW_STD 0
+#define CML_STD 1
+
+/* types of modules */
+#define SINGLENODE 2
+#define CLUSTER 3
+#define SINGLELOOP 4
+#define FORWARD_STD 5
+#define FORWARD_ALT 6
+#define PROFILE7 7
+#define PROFILE9 9
+
+/* types of vertices */
+#define STANDARDV 101
+#define SILENTV 102
+#define STARTV 100
+#define LOCKEDV 104
+#define PROFILEV 103
+#define ENDV 999
+
+/* sequence formats */
+#define STANDARD 0 /* default */
+#define FASTA 1
+#define LABELED 3
+#define PROFILE 4
+#define MSA_STANDARD 2
+
+/* score algorithms */
+#define FORW_ALG 10
+#define VITERBI_ALG 11
+#define ONE_BEST_ALG 12
+
+/* max alphabet size */
+#define MAX_NR_ALPHABETS 4;
+
+/* alphabet types */
+#define DISCRETE 1
+#define CONTINUOUS 2
+
+/* model maker options */
+#define FAST 1
+#define QUERY 2
+#define HAND 3
+
+
+/* weighting schemes */
+#define NONE -2
+#define HENIKOFF 1
+
+//#define DEBUG
+//#define DEBUG2
+
+/*
+ * Structure: helix_site
+ *
+ * An individual TM helix
+ */
+
+struct helix_site {
+
+ int start;
+ int end;
+};
+
+/*
+ * Structure: helix_sites
+ *
+ * Structure to hold helices
+ * helix_count must match the actual number of helices
+ * in the structure
+ */
+
+typedef struct {
+
+ int helix_count;
+ struct helix_site *helix;
+} helix_sites;
+
+typedef helix_sites Bio__Tools__PSort__ModHMM__HelixSites;
+
+extern int helix_count(helix_sites *hSites);
+extern int helix_start(helix_sites *hSites, int helix);
+extern int helix_end(helix_sites *hSites, int helix);
+extern void helix_DESTROY(helix_sites *hSites);
+
+/* Structure: hmm_s
+ *
+ * Declaration of a general HMM
+ */
+
+struct hmm_s {
+
+ /* general info */
+ char name[100]; /* name of the HMM */
+ struct time_t *constr_t; /* time of construction */
+ char alphabet[1000]; /* the alphabet */
+ int alphabet_type; /* discrete or continuous */
+ int a_size; /* size of alphabet */
+ int nr_m; /* nr of modules */
+ int nr_v; /* nr of vertices (states) */
+ int nr_t; /* nr of transitions */
+ int nr_d; /* nr of distribution groups */
+ int nr_dt; /* total nr of distribution ties */
+ int nr_ttg; /* nr of transition tie groups */
+ int nr_tt; /* total nr of transition ties */
+ int nr_ed; /* nr of emission dirichlet structs */
+ int nr_tp;
+ int startnode; /* nr of startnode */
+ int endnode; /* nr of endnode */
+ struct replacement_letter_s *replacement_letters; /* pointer to wildcard letters */
+ struct module_s **modules; /* pointer to array of pointers to the modules */
+
+ /* data structures */
+ int *silent_vertices; /* pointer to array of the silent vertices */
+ int *locked_vertices; /* pointer to arrray of locked vertices */
+ char *vertex_labels; /* pointer to array of vertex labels */
+ char *labels; /* pointer to array which contains the set of vertex labels */
+ int nr_labels; /* the number of labels (set count) for this hmm */
+ double *vertex_trans_prior_scalers; /* pointer to values that decide how to weight prior distribution contribution
+ * when updating vertex parameters */
+ double *vertex_emiss_prior_scalers; /* --------------------------------- " --------------------------------------- */
+ double *transitions; /* pointer to transition probabilities matrix,
+ * from_v on vertical, to_v on horizontal */
+ double *log_transitions; /* pointer to log of trans probs, for viterbi */
+ double *emissions; /* pointer to emission probabilities matrix, letters on horizontal,
+ * vertices on vertical */
+ double *log_emissions; /* pointer to log of emiss probs, for viterbi */
+ double *tot_transitions; /* pointer to transition probability matrix storing the total prob of going from a to b, adding all
+ * paths via silent vertices */
+ double *max_log_transitions; /* pointer to log of maximal transprob for going from a to b of all possible paths via silent states*/
+ struct path_element **from_trans_array; /* pointer to an array that for each vertex stores
+ * a pointer to the paths to the vertices that has
+ * a transition to that vertex (including via one or more
+ * silent states */
+ struct path_element **to_trans_array; /* pointer to an array that for each vertex stores
+ * a pointer to the paths to the vertices that it has
+ * a transition to (including via one or more silent
+ * states) */
+ int **to_silent_trans_array; /* pointer to array that for each vertex stores a pointer to the silent vertices this vertex
+ * has a DIRECT transition to */
+ struct path_element **tot_to_trans_array; /* pointer to an array that for each vertex stores
+ * a pointer to the vertices that it has
+ * a transition to (including via one or more silent
+ * states) */
+ struct path_element **tot_from_trans_array; /* pointer to an array that for each vertex stores
+ * a pointer to the vertices that has
+ * a transition to that vertex (including via one or more
+ * silent states */
+ int *distrib_groups; /* a pointer to the arrays for the distribution groups,
+ * i.e. groups of vertices that have identical emission
+ * probabilities */
+ struct transition_s *trans_tie_groups; /* a pointer to the arrays for the transition tie groups,
+ * i.e. groups of transitions that have identical transition probabilities */
+ struct emission_dirichlet_s *emission_dirichlets; /* pointer to the different dirichlet
+ * mixtures */
+ struct emission_dirichlet_s **ed_ps; /* pointer to array of pointers (one for each state)
+ * to the different dirichlet mixtures */
+ double *subst_mtx; /* substitution matrix array giving the probability of two alphabet letters being related */
+};
+
+struct hmm_multi_s {
+
+ /* general info */
+ char name[100]; /* name of the HMM */
+ struct time_t *constr_t; /* time of construction */
+ int nr_alphabets;
+ char alphabet[1000]; /* the alphabet */
+ char alphabet_2[1000]; /* the alphabet */
+ char alphabet_3[1000]; /* the alphabet */
+ char alphabet_4[1000]; /* the alphabet */
+ int alphabet_type; /* discrete or continuous */
+ int alphabet_type_2; /* discrete or continuous */
+ int alphabet_type_3; /* discrete or continuous */
+ int alphabet_type_4; /* discrete or continuous */
+ int a_size; /* size of alphabet */
+ int a_size_2; /* size of alphabet */
+ int a_size_3; /* size of alphabet */
+ int a_size_4; /* size of alphabet */
+ int nr_m; /* nr of modules */
+ int nr_v; /* nr of vertices (states) */
+ int nr_t; /* nr of transitions */
+ int nr_d; /* nr of distribution groups */
+ int nr_dt; /* total nr of distribution ties */
+ int nr_ttg; /* nr of transition tie groups */
+ int nr_tt; /* total nr of transition ties */
+ int nr_ed; /* nr of emission dirichlet structs */
+ int nr_ed_2; /* nr of emission dirichlet structs */
+ int nr_ed_3; /* nr of emission dirichlet structs */
+ int nr_ed_4; /* nr of emission dirichlet structs */
+ int nr_tp;
+ int startnode; /* nr of startnode */
+ int endnode; /* nr of endnode */
+ struct replacement_letter_multi_s *replacement_letters; /* pointer to wildcard letters */
+
+ struct module_multi_s **modules; /* pointer to array of pointers to the modules */
+
+ /* data structures */
+ int *silent_vertices; /* pointer to array of the silent vertices */
+ int *locked_vertices; /* pointer to arrray of locked vertices */
+ char *vertex_labels; /* pointer to array of vertex labels */
+ char *labels; /* pointer to array which contains the set of vertex labels */
+ int nr_labels; /* the number of labels (set count) for this hmm */
+ double *vertex_trans_prior_scalers; /* pointer to values that decide how to weight prior distribution contribution
+ * when updating vertex parameters */
+ double *vertex_emiss_prior_scalers; /* --------------------------------- " --------------------------------------- */
+ double *vertex_emiss_prior_scalers_2; /* --------------------------------- " --------------------------------------- */
+ double *vertex_emiss_prior_scalers_3; /* --------------------------------- " --------------------------------------- */
+ double *vertex_emiss_prior_scalers_4; /* --------------------------------- " --------------------------------------- */
+ double *transitions; /* pointer to transition probabilities matrix,
+ * from_v on vertical, to_v on horizontal */
+ double *log_transitions; /* pointer to log of trans probs, for viterbi */
+ double *emissions; /* pointer to emission probabilities matrix, letters on horizontal,
+ * vertices on vertical */
+ double *emissions_2; /* pointer to emission probabilities matrix, letters on horizontal,
+ * vertices on vertical */
+ double *emissions_3; /* pointer to emission probabilities matrix, letters on horizontal,
+ * vertices on vertical */
+ double *emissions_4; /* pointer to emission probabilities matrix, letters on horizontal,
+ * vertices on vertical */
+ double *log_emissions; /* pointer to log of emiss probs, for viterbi */
+ double *log_emissions_2; /* pointer to log of emiss probs, for viterbi */
+ double *log_emissions_3; /* pointer to log of emiss probs, for viterbi */
+ double *log_emissions_4; /* pointer to log of emiss probs, for viterbi */
+ double *tot_transitions; /* pointer to transition probability matrix storing the total prob of going from a to b, adding all
+ * paths via silent vertices */
+ double *max_log_transitions; /* pointer to log of maximal transprob for going from a to b of all possible paths via silent states*/
+ struct path_element **from_trans_array; /* pointer to an array that for each vertex stores
+ * a pointer to the paths to the vertices that has
+ * a transition to that vertex (including via one or more
+ * silent states */
+ struct path_element **to_trans_array; /* pointer to an array that for each vertex stores
+ * a pointer to the paths to the vertices that it has
+ * a transition to (including via one or more silent
+ * states) */
+ int **to_silent_trans_array; /* pointer to array that for each vertex stores a pointer to the silent vertices this vertex
+ * has a DIRECT transition to */
+
+ struct path_element **tot_to_trans_array; /* pointer to an array that for each vertex stores
+ * a pointer to the vertices that it has
+ * a transition to (including via one or more silent
+ * states) */
+ struct path_element **tot_from_trans_array; /* pointer to an array that for each vertex stores
+ * a pointer to the vertices that has
+ * a transition to that vertex (including via one or more
+ * silent states */
+ int *distrib_groups; /* a pointer to the arrays for the distribution groups,
+ * i.e. groups of vertices that have identical emission
+ * probabilities */
+ struct transition_s *trans_tie_groups; /* a pointer to the arrays for the transition tie groups,
+ * i.e. groups of transitions that have identical transition probabilities */
+ struct emission_dirichlet_s *emission_dirichlets; /* pointer to the different dirichlet
+ * mixtures */
+ struct emission_dirichlet_s *emission_dirichlets_2; /* pointer to the different dirichlet
+ * mixtures */
+ struct emission_dirichlet_s *emission_dirichlets_3; /* pointer to the different dirichlet
+ * mixtures */
+ struct emission_dirichlet_s *emission_dirichlets_4; /* pointer to the different dirichlet
+ * mixtures */
+ struct emission_dirichlet_s **ed_ps; /* pointer to array of pointers (one for each state)
+ * to the different dirichlet mixtures */
+ struct emission_dirichlet_s **ed_ps_2; /* pointer to array of pointers (one for each state)
+ * to the different dirichlet mixtures */
+ struct emission_dirichlet_s **ed_ps_3; /* pointer to array of pointers (one for each state)
+ * to the different dirichlet mixtures */
+ struct emission_dirichlet_s **ed_ps_4; /* pointer to array of pointers (one for each state)
+ * to the different dirichlet mixtures */
+ double *subst_mtx; /* substitution matrix array giving the probability of two alphabet letters being related */
+ double *subst_mtx_2; /* substitution matrix array giving the probability of two alphabet letters being related */
+ double *subst_mtx_3; /* substitution matrix array giving the probability of two alphabet letters being related */
+ double *subst_mtx_4; /* substitution matrix array giving the probability of two alphabet letters being related */
+};
+
+
+/* Structure: null_model_s
+ *
+ * Declaration of null model struct
+ */
+struct null_model_s {
+ double trans_prob;
+ int a_size;
+ char alphabet[1000];
+ double *emissions;
+};
+
+/* Structure: null_model_multi_s
+ *
+ * Declaration of null model multi struct
+ */
+struct null_model_multi_s {
+ int nr_alphabets;
+ double trans_prob;
+ int a_size;
+ int a_size_2;
+ int a_size_3;
+ int a_size_4;
+ char alphabet[1000];
+ char alphabet_2[1000];
+ char alphabet_3[1000];
+ char alphabet_4[1000];
+ double *emissions;
+ double *emissions_2;
+ double *emissions_3;
+ double *emissions_4;
+
+};
+
+
+/* Structure: module_s
+ *
+ * Declaration of module
+ */
+struct module_s {
+ char name[50];
+ int type;
+ int v_type;
+ int nr_v;
+ int *vertices;
+ char priorfile_name[200];
+};
+
+
+/* Structure: module_multi_s
+ *
+ * Declaration of module
+ */
+struct module_multi_s {
+ char name[50];
+ int type;
+ int v_type;
+ int nr_v;
+ int *vertices;
+ char priorfile_name[200];
+ char priorfile_name_2[200];
+ char priorfile_name_3[200];
+ char priorfile_name_4[200];
+};
+
+/* Structure emission_dirichlet_s
+ *
+ * Declaration of dirichlet mixture
+ */
+struct emission_dirichlet_s {
+ char name[200];
+ int nr_components;
+ int alphabet_size;
+ double *q_values; /* each component's "probability" */
+ double *alpha_sums; /* sums of the prior values */
+ double *logbeta_values; /* precalculated beta values B(alpha) for each alpha */
+ double *prior_values; /* matrix with prior values */
+};
+
+
+/* Structure: viterbi_s
+ *
+ * Declaration of viterbi matrix element
+ */
+
+struct viterbi_s {
+ double prob; /* is really a log prob */
+ int prev;
+ struct path_element *prevp;
+};
+
+/* Structure: forward_s
+ *
+ * Declaration of forward matrix element
+ */
+
+struct forward_s {
+ double prob;
+ //int distance_to_next;
+};
+
+/* Structure: backward_s
+ *
+ * Declaration of backward matrix element
+ */
+
+struct backward_s {
+ double prob;
+ //int distance_to_next;
+};
+
+/* Structure: one_best_s
+ *
+ * Declaration of one_best matrix element
+ */
+
+struct one_best_s {
+ double prob;
+ int is_updated;
+ char *labeling;
+};
+
+
+
+/* Structure: letter_prob_s
+ *
+ * Declaration of letter probability distribution element
+ */
+struct letter_prob_s {
+ char letter;
+ double share;
+};
+
+
+/* Structure: path_element
+ *
+ * Declaration of trans_array elements
+ */
+struct path_element {
+ int vertex;
+ struct path_element *next;
+};
+
+/* Structure: letter_s
+ *
+ * Declaration of alphabet symbol
+ */
+struct letter_s {
+ char letter[5];
+ char label;
+ double cont_letter;
+};
+
+/* Structure: sequence_s
+ *
+ * Declaration of sequence info holder
+ */
+struct sequence_s {
+ char name[MAX_SEQ_NAME_SIZE];
+ int length;
+ double weight;
+ struct letter_s *seq;
+};
+
+/* Structure: sequence_multi_s
+ *
+ * Declaration of sequence info holder for multiple alphabet sequences
+ */
+struct sequence_multi_s {
+ char name[MAX_SEQ_NAME_SIZE];
+ int length;
+ double weight;
+ struct letter_s *seq_1;
+ struct letter_s *seq_2;
+ struct letter_s *seq_3;
+ struct letter_s *seq_4;
+};
+
+
+/* Structure: sequences_s
+ *
+ * Declaration of struct for info about the sequences
+ */
+
+struct sequences_s {
+ int nr_seqs;
+ int longest_seq;
+ int shortest_seq;
+ int avg_seq_len;
+ struct sequence_s *seqs;
+};
+
+
+/* Structure: sequences_multi_s
+ *
+ * Declaration of struct for info about the sequences
+ */
+struct sequences_multi_s {
+ int nr_alphabets;
+ int nr_seqs;
+ int longest_seq;
+ int shortest_seq;
+ int avg_seq_len;
+ struct sequence_multi_s *seqs;
+};
+
+
+/* Structure: MSA_letter_s
+ *
+ * Declaration of struct for MSA_letter
+ */
+struct msa_letter_s {
+ double nr_occurences; /* non integer occurences exist */
+ double share;
+ double prior_share;
+ char label;
+ char query_letter[5];
+ double cont_letter;
+};
+
+/* Structure: msa_sequences_s
+ *
+ * Declaration of struct for msa sequence info
+ */
+struct msa_sequences_s {
+ int nr_seqs;
+ int msa_seq_length;
+ int nr_lead_columns;
+ struct msa_letter_s *msa_seq;
+ int **gaps;
+ int *lead_columns_start;
+ int *lead_columns_end;
+ double *gap_shares;
+};
+
+/* Structure: msa_sequences_multi_s
+ *
+ * Declaration of struct for msa sequence info
+ */
+struct msa_sequences_multi_s {
+ int nr_alphabets;
+ int nr_seqs;
+ int msa_seq_length;
+ int nr_lead_columns;
+ struct msa_letter_s *msa_seq_1;
+ struct msa_letter_s *msa_seq_2;
+ struct msa_letter_s *msa_seq_3;
+ struct msa_letter_s *msa_seq_4;
+ int **gaps;
+ int *lead_columns_start;
+ int *lead_columns_end;
+ double *gap_shares;
+};
+
+
+/* Structure: replacement_letter_s
+ *
+ * Declaration of struct for replacement letter info
+ */
+struct replacement_letter_s {
+ int nr_rl;
+ struct letter_s *letters;
+ double *probs;
+};
+
+/* Structure: replacement_letter_multi_s
+ *
+ * Declaration of struct for replacement letter info
+ */
+struct replacement_letter_multi_s {
+ int nr_alphabets;
+ int nr_rl_1;
+ int nr_rl_2;
+ int nr_rl_3;
+ int nr_rl_4;
+ struct letter_s *letters_1;
+ double *probs_1;
+ struct letter_s *letters_2;
+ double *probs_2;
+ struct letter_s *letters_3;
+ double *probs_3;
+ struct letter_s *letters_4;
+ double *probs_4;
+ };
+
+/* Structure: aa_distrib_mtx_s
+ *
+ * Declaration of struct for amino acid distribution matrix
+ */
+struct aa_distrib_mtx_s {
+ int a_size;
+ double *inside_values;
+ double *outside_values;
+ double *membrane_values;
+};
+
+
+/* Structure: v_list_element_s
+ *
+ * Declaration of struct for n_best v_list elements
+ */
+struct v_list_element_s {
+ int vertex; /* vertex nr */
+ int address; /* pointer address of this vertex's labeling */
+};
+
+/* Structure: transition_s
+ *
+ * Declaration of struct for transition (used for transition ties)
+ */
+struct transition_s {
+ int from_v;
+ int to_v;
+};
+
+struct align_mtx_element_s {
+ int score;
+ char last;
+};
+
+struct alignment_s {
+ int target_pos;
+ int template_pos;
+ char target_letter[5];
+ char template_letter[5];
+};
diff --git a/modhmm0.92b/training_algorithms_multialpha.c b/modhmm0.92b/training_algorithms_multialpha.c
new file mode 100644
index 0000000..bf90646
--- /dev/null
+++ b/modhmm0.92b/training_algorithms_multialpha.c
@@ -0,0 +1,3629 @@
+#include <stdio.h>
+#include <math.h>
+#include <stdlib.h>
+#include <string.h>
+#include <float.h>
+//#include <double.h>
+
+
+#include "structs.h"
+#include "funcs.h"
+
+#define INNER_BW_THRESHOLD 0.1
+#define OUTER_BW_THRESHOLD 0.1
+#define CML_THRESHOLD 0.2
+#define TRUE 1
+#define FALSE -1
+#define STARTRANDOM 0
+
+#define REST_LETTER_INDEX 0.5
+
+/* for simulated annealing */
+#define INIT_TEMP 1.0
+#define INIT_COOL 0.8
+#define ANNEAL_THRESHOLD 0.1
+#define DONE -10
+#define ACTIVE 25
+
+/* for transition matrix pseudo count */
+#define TRANSITION_PSEUDO_VALUE 0.1
+#define EMISSION_PSEUDO_VALUE 1.0
+
+//#define DEBUG_BW
+//#define DEBUG_BW_TRANS
+//#define DEBUG_BW2
+//#define DEBUG_EXTBW
+//#define DEBUG_PRIORS
+//#define DEBUG_Tkl
+
+extern int verbose;
+
+void update_emiss_mtx_std_multi(struct hmm_multi_s*, double*, int, int);
+void update_emiss_mtx_std_continuous_multi(struct hmm_multi_s*, double*, int, int);
+void update_emiss_mtx_pseudocount_multi(struct hmm_multi_s*, double*, int, int);
+void update_emiss_mtx_prior_multi(struct hmm_multi_s*, double*, int, struct emission_dirichlet_s*, int);
+void update_trans_mtx_std_multi(struct hmm_multi_s*, double*, int);
+void update_trans_mtx_pseudocount_multi(struct hmm_multi_s*, double*, int);
+void update_tot_trans_mtx_multi(struct hmm_multi_s*);
+void recalculate_emiss_expectations_multi(struct hmm_multi_s*, double*, int);
+void recalculate_trans_expectations_multi(struct hmm_multi_s *hmmp, double *T);
+double add_Eka_contribution_multi(struct hmm_multi_s*, struct letter_s*, struct forward_s*,
+ struct backward_s*, int, int, int);
+double add_Eka_contribution_continuous_multi(struct hmm_multi_s*, struct letter_s*, struct forward_s*,
+ struct backward_s*, int, int, int, double*, int);
+double add_Eka_contribution_msa_multi(struct hmm_multi_s*, struct msa_sequences_multi_s*, struct forward_s*,
+ struct backward_s*, int, int, int, int);
+void add_Tkl_contribution_multi(struct hmm_multi_s*, struct letter_s*, struct letter_s*, struct letter_s*,
+ struct letter_s*, struct forward_s*,
+ struct backward_s*, double*, int, int,
+ struct path_element*, int, int, int, int, double*, int use_labels, int multi_scoring_method);
+void add_Tkl_contribution_msa_multi(struct hmm_multi_s*, struct msa_sequences_multi_s*, struct forward_s*,
+ struct backward_s*, double*,
+ int, int, struct path_element*, double*, int use_gap_shares, int use_lead_columns, int i,
+ int use_labels, int scoring_method, int normalize, int multi_scoring_method,
+ double *aa_freqs, double *aa_freqs_2, double *aa_freqs_3, double *aa_freqs_4);
+void random_walk_multi(struct hmm_multi_s*, double*, double*, char**, int, int, int);
+int silent_state_multi(int, struct hmm_multi_s*);
+void anneal_E_matrix_multi(double temperature, double *E, struct hmm_multi_s *hmmp, int alphabet);
+void anneal_T_matrix_multi(double temperature, double *T, struct hmm_multi_s *hmmp);
+void calculate_TE_contributions_multi(double *T, double *E, double *E_2, double *E_3, double *E_4,
+ double *T_lab, double *E_lab, double *E_lab_2, double *E_lab_3, double *E_lab_4,
+ double *T_ulab, double *E_ulab, double *E_ulab_2, double *E_ulab_3, double *E_ulab_4,
+ double *emissions, double *emissions_2, double *emissions_3, double *emissions_4,
+ double *transitions, int nr_v, int a_size, int a_size_2, int a_size_3, int a_size_4,
+ double *emiss_prior_scalers, double *emiss_prior_scalers_2, double *emiss_prior_scalers_3,
+ double *emiss_prior_scalers_4, int rd, int nr_alphabets);
+void add_to_E_multi(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p, int k, int a_size, int normalize,
+ double *subst_mtx, int alphabet, int scoring_method, int use_nr_occ, int alphabet_type, double *emissions);
+
+
+
+/************** baum-welch training algorithm ************************/
+
+
+
+
+/* implementation of the baum-welch training algorithm using dirichlet prior mixture to
+ * calculate update of emission (and transition) matrices */
+void baum_welch_dirichlet_multi(struct hmm_multi_s *hmmp, struct sequence_multi_s *seqsp, int nr_seqs, int annealing, int use_labels,
+ int use_transition_pseudo_counts, int use_emission_pseudo_counts,
+ int multi_scoring_method, int use_prior)
+{
+ double *T, *E, *E_2, *E_3, *E_4; /* matrices for the estimated number of times
+ * each transition (T) and emission (E) is used */
+ struct forward_s *forw_mtx; /* forward matrix */
+ struct backward_s *backw_mtx; /* backward matrix */
+ double *forw_scale; /* scaling array */
+ int s,p,k,l,a,d; /* loop counters, s loops over the sequences, p over the
+ * positions in the sequence, k and l over states, a over the alphabet
+ * and d over the distribution groups */
+ struct path_element *lp;
+ double t_res, t_res_1, t_res_2, t_res_3; /* for temporary results */
+ double t_res_4, t_res_5, t_res_6; /* for temporary results */
+ double e_res, e_res_1, e_res_2, e_res_3; /* for temporary results */
+
+ int seq_len; /* length of the seqences */
+ int a_index, a_index_2, a_index_3, a_index_4; /* holds current letters index in the alphabet */
+ struct letter_s *seq, *seq_2, *seq_3, *seq_4; /* pointer to current sequence */
+ double old_log_likelihood, new_log_likelihood; /* to calculate when to stop */
+ double likelihood; /* temporary variable for calculating likelihood of a sequence */
+ int max_nr_iterations, iteration;
+
+ /* dirichlet prior variables */
+ struct emission_dirichlet_s *priorp;
+ struct emission_dirichlet_s *priorp_2;
+ struct emission_dirichlet_s *priorp_3;
+ struct emission_dirichlet_s *priorp_4;
+
+ /* simulated annealing variables */
+ double temperature;
+ double cooling_factor;
+ int annealing_status;
+
+
+ /* some initialization */
+ old_log_likelihood = 9999.0;
+ new_log_likelihood = 9999.0;
+ max_nr_iterations = 20;
+ iteration = 1;
+ if(annealing == YES) {
+ temperature = INIT_TEMP;
+ cooling_factor = INIT_COOL;
+ annealing_status = ACTIVE;
+ }
+ else {
+ annealing_status = DONE;
+ }
+
+
+ do {
+ T = (double*)(malloc_or_die(hmmp->nr_v * hmmp->nr_v *
+ sizeof(double)));
+ if(hmmp->alphabet_type == DISCRETE) {
+ E = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size *
+ sizeof(double)));
+ }
+ else {
+ E = (double*)(malloc_or_die(hmmp->nr_v * (hmmp->a_size + 1) *
+ sizeof(double)));
+ }
+ if(hmmp->nr_alphabets > 1) {
+ if(hmmp->alphabet_type_2 == DISCRETE) {
+ E_2 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_2 *
+ sizeof(double)));
+ }
+ else {
+ E_2 = (double*)(malloc_or_die(hmmp->nr_v * (hmmp->a_size_2 + 1) *
+ sizeof(double)));
+ }
+ }
+ if(hmmp->nr_alphabets > 2) {
+ if(hmmp->alphabet_type_3 == DISCRETE) {
+ E_3 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_3 *
+ sizeof(double)));
+ }
+ else {
+ E_3 = (double*)(malloc_or_die(hmmp->nr_v * (hmmp->a_size_3 + 1) *
+ sizeof(double)));
+ }
+ }
+ if(hmmp->nr_alphabets > 3) {
+ if(hmmp->alphabet_type_4 == DISCRETE) {
+ E_4 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_4 *
+ sizeof(double)));
+ }
+ else {
+ E_4 = (double*)(malloc_or_die(hmmp->nr_v * (hmmp->a_size_4 + 1) *
+ sizeof(double)));
+ }
+ }
+
+ old_log_likelihood = new_log_likelihood;
+ new_log_likelihood = 0.0;
+ for(s = 0; s < nr_seqs; s++) {
+ /* Convert sequence to 1...L for easier indexing */
+ seq_len = (seqsp + s)->length;
+ seq = (struct letter_s*) (malloc_or_die((seq_len + 2) * sizeof(struct letter_s)));
+ memcpy(seq+1, (seqsp + s)->seq_1, seq_len * sizeof(struct letter_s));
+ if(hmmp->nr_alphabets > 1) {
+ seq_2 = (struct letter_s*) (malloc_or_die((seq_len + 2) * sizeof(struct letter_s)));
+ memcpy(seq_2+1, (seqsp + s)->seq_2, seq_len * sizeof(struct letter_s));
+ }
+ if(hmmp->nr_alphabets > 2) {
+ seq_3 = (struct letter_s*) malloc_or_die(((seq_len + 2) * sizeof(struct letter_s)));
+ memcpy(seq_3+1, (seqsp + s)->seq_3, seq_len * sizeof(struct letter_s));
+ }
+ if(hmmp->nr_alphabets > 3) {
+ seq_4 = (struct letter_s*) malloc_or_die(((seq_len + 2) * sizeof(struct letter_s)));
+ memcpy(seq_4+1, (seqsp + s)->seq_4, seq_len * sizeof(struct letter_s));
+ }
+
+
+ /* calculate forward and backward matrices */
+ forward_multi(hmmp, (seqsp + s)->seq_1,(seqsp + s)->seq_2, (seqsp + s)->seq_3, (seqsp + s)->seq_4,
+ &forw_mtx, &forw_scale, use_labels, multi_scoring_method);
+ backward_multi(hmmp, (seqsp + s)->seq_1, (seqsp + s)->seq_2, (seqsp + s)->seq_3, (seqsp + s)->seq_4,
+ &backw_mtx, forw_scale, use_labels, multi_scoring_method);
+ /* memory for forw_mtx, scale_mtx and
+ * backw_mtx is allocated in the functions */
+
+ /* update new_log_likelihood */
+ likelihood = log10((forw_mtx +
+ get_mtx_index(seq_len+1, hmmp->nr_v-1, hmmp->nr_v))->prob);
+ for(k = 0; k <= seq_len; k++) {
+ likelihood = likelihood + log10(*(forw_scale + k));
+ }
+#ifdef DEBUG_BW
+ dump_scaling_array(k-1,forw_scale);
+ printf("likelihood = %f\n", likelihood);
+#endif
+ new_log_likelihood += likelihood;
+
+ for(k = 0; k < hmmp->nr_v-1; k++) /* k = from vertex */ {
+ lp = *(hmmp->to_trans_array + k);
+ while(lp->vertex != END) /* l = to-vertex */ {
+ for(p = 1; p <= seq_len; p++) {
+
+ /* get alphabet index for c (add replacement letter stuff here) */
+ if(hmmp->alphabet_type == DISCRETE) {
+ a_index = get_alphabet_index(&seq[p], hmmp->alphabet, hmmp->a_size);
+ }
+ if(hmmp->nr_alphabets > 1 && hmmp->alphabet_type_2 == DISCRETE) {
+ a_index_2 = get_alphabet_index(&seq_2[p], hmmp->alphabet_2, hmmp->a_size_2);
+ }
+ if(hmmp->nr_alphabets > 2 && hmmp->alphabet_type_3 == DISCRETE) {
+ a_index_3 = get_alphabet_index(&seq_3[p], hmmp->alphabet_3, hmmp->a_size_3);
+ }
+ if(hmmp->nr_alphabets > 3 && hmmp->alphabet_type_4 == DISCRETE) {
+ a_index_4 = get_alphabet_index(&seq_4[p], hmmp->alphabet_4, hmmp->a_size_4);
+ }
+
+ /* add T[k][l] contribution for this sequence */
+ add_Tkl_contribution_multi(hmmp, seq+1, seq_2+1, seq_3+1, seq_4+1, forw_mtx, backw_mtx,
+ forw_scale, p, k, lp, a_index, a_index_2, a_index_3, a_index_4, T, use_labels,
+ multi_scoring_method);
+
+ /* continuous? */
+
+ }
+ /* move on to next path */
+ while(lp->next != NULL) {
+ lp++;
+ }
+ lp++;
+ }
+ /* calculate E[k][a] contribution from this sequence */
+ if(silent_state_multi(k, hmmp) != 0) {
+ for(p = 1; p <= seq_len; p++) {
+ if(hmmp->alphabet_type == DISCRETE) {
+ a_index = get_alphabet_index(&seq[p], hmmp->alphabet, hmmp->a_size);
+ *(E + get_mtx_index(k, a_index, hmmp->a_size)) +=
+ add_Eka_contribution_multi(hmmp, seq+1, forw_mtx, backw_mtx, p, k, multi_scoring_method);
+ }
+ else {
+ add_Eka_contribution_continuous_multi(hmmp, seq+1, forw_mtx, backw_mtx, p, k, multi_scoring_method, E, 1);
+ }
+ if(hmmp->nr_alphabets > 1 && hmmp->alphabet_type_2 == DISCRETE) {
+ a_index_2 = get_alphabet_index(&seq_2[p], hmmp->alphabet_2, hmmp->a_size_2);
+ *(E_2 + get_mtx_index(k, a_index_2, hmmp->a_size_2)) +=
+ add_Eka_contribution_multi(hmmp, seq_2+1, forw_mtx, backw_mtx, p, k, multi_scoring_method);
+ }
+ else if(hmmp->nr_alphabets > 1) {
+ add_Eka_contribution_continuous_multi(hmmp, seq_2+1, forw_mtx, backw_mtx, p, k, multi_scoring_method, E_2, 2);
+ }
+ if(hmmp->nr_alphabets > 2 && hmmp->alphabet_type_3 == DISCRETE) {
+ a_index_3 = get_alphabet_index(&seq_3[p], hmmp->alphabet_3, hmmp->a_size_3);
+ *(E_3 + get_mtx_index(k, a_index_3, hmmp->a_size_3)) +=
+ add_Eka_contribution_multi(hmmp, seq_3+1, forw_mtx, backw_mtx, p, k, multi_scoring_method);
+ }
+ else if(hmmp->nr_alphabets > 2) {
+ add_Eka_contribution_continuous_multi(hmmp, seq_3+1, forw_mtx, backw_mtx, p, k, multi_scoring_method, E_3, 3);
+ }
+ if(hmmp->nr_alphabets > 3 && hmmp->alphabet_type_4 == DISCRETE) {
+ a_index_4 = get_alphabet_index(&seq_4[p], hmmp->alphabet_4, hmmp->a_size_4);
+ *(E_4 + get_mtx_index(k, a_index_4, hmmp->a_size_4)) +=
+ add_Eka_contribution_multi(hmmp, seq_4+1, forw_mtx, backw_mtx, p, k, multi_scoring_method);
+ }
+ else if(hmmp->nr_alphabets > 3) {
+ add_Eka_contribution_continuous_multi(hmmp, seq_4+1, forw_mtx, backw_mtx, p, k, multi_scoring_method, E_4, 4);
+ }
+ }
+ }
+ }
+ /* some garbage collection */
+ free(seq);
+ if(hmmp->nr_alphabets > 1) {
+ free(seq_2);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ free(seq_3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ free(seq_4);
+ }
+ free(forw_mtx);
+ free(backw_mtx);
+ free(forw_scale);
+ }
+ if(verbose == YES) {
+ printf("log likelihood rd %d: %f\n", iteration, new_log_likelihood);
+ }
+
+#ifdef DEBUG_BW2
+ dump_T_matrix(hmmp->nr_v, hmmp->nr_v, T);
+ dump_E_matrix(hmmp->nr_v, hmmp->a_size, E);
+ //dump_E_matrix(hmmp->nr_v, hmmp->a_size_2 + 1, E_2);
+#endif
+
+ /* check if likelihood change is small enough, then we are done */
+ if(fabs(new_log_likelihood - old_log_likelihood) < INNER_BW_THRESHOLD && annealing_status == DONE) {
+ break;
+ }
+
+ /* if simulated annealing is used, scramble results in E and T matrices */
+ if(annealing == YES && temperature > ANNEAL_THRESHOLD) {
+ anneal_E_matrix_multi(temperature, E, hmmp, 1);
+ if(hmmp->nr_alphabets > 1) {
+ anneal_E_matrix_multi(temperature, E_2, hmmp, 2);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ anneal_E_matrix_multi(temperature, E_3, hmmp, 3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ anneal_E_matrix_multi(temperature, E_4, hmmp, 4);
+ }
+ anneal_T_matrix_multi(temperature, T, hmmp);
+ temperature = temperature * cooling_factor;
+ }
+
+ if(temperature < ANNEAL_THRESHOLD) {
+ annealing_status = DONE;
+ }
+
+ /* recalculate emission expectations according to distribution groups
+ * by simply taking the mean of the expected emissions within this group
+ * for each letter in the alphabet and replacing each expectation for the
+ * letter with this value for every member of the distribution group */
+ recalculate_emiss_expectations_multi(hmmp, E, 1);
+ if(hmmp->nr_alphabets > 1) {
+ recalculate_emiss_expectations_multi(hmmp, E_2, 2);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ recalculate_emiss_expectations_multi(hmmp, E_3, 3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ recalculate_emiss_expectations_multi(hmmp, E_4, 4);
+ }
+
+ /* recalculate transition expectations for tied transitions according
+ * to the same scheme as for emission distribution groups */
+ recalculate_trans_expectations_multi(hmmp, T);
+
+ for(k = 0; k < hmmp->nr_v-1; k++) /* k = from-vertex */ {
+ /* update transition matrix */
+ if(use_transition_pseudo_counts == YES) {
+ update_trans_mtx_pseudocount_multi(hmmp, T, k);
+ }
+ else {
+ update_trans_mtx_std_multi(hmmp, T, k);
+ }
+
+
+#ifdef DEBUG_PRIORS
+ printf("Starting emission matrix update\n");
+#endif
+
+ /* update emission matrix using Dirichlet prior files if they exist*/
+ priorp = *(hmmp->ed_ps + k);
+ if(priorp != NULL && use_prior == YES && hmmp->alphabet_type == DISCRETE) {
+#ifdef DEBUG_PRIORS
+ printf("k = %d\n", k);
+ printf("value = %x\n", priorp);
+#endif
+ update_emiss_mtx_prior_multi(hmmp, E, k, priorp, 1);
+ }
+ else if(use_emission_pseudo_counts == YES && hmmp->alphabet_type == DISCRETE)
+ /* update emissions matrix "normally" when dirichlet file is missing */ {
+ update_emiss_mtx_pseudocount_multi(hmmp, E, k, 1);
+ }
+ else if(hmmp->alphabet_type == DISCRETE) {
+ update_emiss_mtx_std_multi(hmmp, E, k, 1);
+ }
+ else {
+ update_emiss_mtx_std_continuous_multi(hmmp, E, k, 1);
+ }
+
+ if(hmmp->nr_alphabets > 1) {
+ priorp = *(hmmp->ed_ps_2 + k);
+ if(priorp != NULL && use_prior == YES && hmmp->alphabet_type_2 == DISCRETE) {
+ update_emiss_mtx_prior_multi(hmmp, E_2, k, priorp, 2);
+ }
+ else if(use_emission_pseudo_counts == YES && hmmp->alphabet_type_2 == DISCRETE)
+ /* update emissions matrix "normally" when dirichlet file is missing */ {
+ update_emiss_mtx_pseudocount_multi(hmmp, E_2, k, 2);
+ }
+ else if(hmmp->alphabet_type_2 == DISCRETE) {
+ update_emiss_mtx_std_multi(hmmp, E_2, k, 2);
+ }
+ else {
+ update_emiss_mtx_std_continuous_multi(hmmp, E_2, k, 2);
+ }
+ }
+
+
+ if(hmmp->nr_alphabets > 2) {
+ priorp = *(hmmp->ed_ps_3 + k);
+ if(priorp != NULL && use_prior == YES && hmmp->alphabet_type_3 == DISCRETE) {
+ update_emiss_mtx_prior_multi(hmmp, E_3, k, priorp, 3);
+ }
+ else if(use_emission_pseudo_counts == YES && hmmp->alphabet_type_3 == DISCRETE)
+ /* update emissions matrix "normally" when dirichlet file is missing */ {
+ update_emiss_mtx_pseudocount_multi(hmmp, E_3, k, 3);
+ }
+ else if(hmmp->alphabet_type_3 == DISCRETE) {
+ update_emiss_mtx_std_multi(hmmp, E_3, k, 3);
+ }
+ else {
+ update_emiss_mtx_std_continuous_multi(hmmp, E_3, k, 3);
+ }
+ }
+
+ if(hmmp->nr_alphabets > 3) {
+ priorp = *(hmmp->ed_ps_4 + k);
+ if(priorp != NULL && use_prior == YES && hmmp->alphabet_type_4 == DISCRETE) {
+ update_emiss_mtx_prior_multi(hmmp, E_4, k, priorp, 4);
+ }
+ else if(use_emission_pseudo_counts == YES && hmmp->alphabet_type_4 == DISCRETE)
+ /* update emissions matrix "normally" when dirichlet file is missing */ {
+ update_emiss_mtx_pseudocount_multi(hmmp, E_4, k, 4);
+ }
+ else if(hmmp->alphabet_type_4 == DISCRETE) {
+ update_emiss_mtx_std_multi(hmmp, E_4, k, 4);
+ }
+ else {
+ update_emiss_mtx_std_continuous_multi(hmmp, E_4, k, 4);
+ }
+ }
+ }
+
+#ifdef DEBUG_BW
+ dump_trans_matrix(hmmp->nr_v, hmmp->nr_v, hmmp->transitions);
+ dump_emiss_matrix(hmmp->nr_v, hmmp->a_size, hmmp->emissions);
+#endif
+
+ /* some garbage collection */
+ free(E);
+ if(hmmp->nr_alphabets > 1) {
+ free(E_2);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ free(E_3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ free(E_4);
+ }
+ free(T);
+ max_nr_iterations--;
+ iteration++;
+ }
+ while(max_nr_iterations > 0); /* break condition is also when log_likelihood_difference is
+ * smaller than THRESHOLD, checked inside the loop for
+ * better efficiency */
+
+
+#ifdef DEBUG_BW
+ dump_trans_matrix(hmmp->nr_v, hmmp->nr_v, hmmp->transitions);
+ dump_emiss_matrix(hmmp->nr_v, hmmp->a_size, hmmp->emissions);
+#endif
+
+}
+
+/* implementation of the baum-welch training algorithm using dirichlet prior mixture to
+ * calculate update of emission (and transition) matrices and using a multiple sequence
+ * alignment as the training sequence */
+void msa_baum_welch_dirichlet_multi(struct hmm_multi_s *hmmp, struct msa_sequences_multi_s *msa_seq_infop, int nr_seqs,
+ int annealing, int use_gap_shares, int use_lead_columns, int use_labels,
+ int use_transition_pseudo_counts, int use_emission_pseudo_counts, int normalize,
+ int scoring_method, int use_nr_occ, int multi_scoring_method, double *aa_freqs,
+ double *aa_freqs_2, double *aa_freqs_3, double *aa_freqs_4, int use_prior)
+{
+ struct msa_sequences_multi_s *msa_seq_infop_start;
+ double *T, *E, *E_2, *E_3, *E_4; /* matrices for the estimated number of times
+ * each transition (T) and emission (E) is used */
+ struct forward_s *forw_mtx; /* forward matrix */
+ struct backward_s *backw_mtx; /* backward matrix */
+ double *forw_scale; /* scaling array */
+ int s,p,k,l,a,d,i; /* loop counters, s loops over the sequences, p over the
+ * positions in the sequence, k and l over states, a over the alphabet,
+ * d over the distribution groups and i is a slush variable */
+ struct path_element *lp;
+ double t_res, t_res_1, t_res_2, t_res_3; /* for temporary results */
+ double t_res_4, t_res_5, t_res_6; /* for temporary results */
+ double e_res, e_res_1, e_res_2, e_res_3; /* for temporary results */
+
+ int seq_len; /* length of the sequences */
+ int a_index; /* holds current letters index in the alphabet */
+ double old_log_likelihood, new_log_likelihood; /* to calculate when to stop */
+ double likelihood; /* temporary variable for calculating likelihood of a sequence */
+ int max_nr_iterations, iteration;
+ double Eka_base;
+ int query_index; /* index of query seq */
+
+ /* dirichlet prior variables */
+ struct emission_dirichlet_s *priorp;
+ struct emission_dirichlet_s *priorp_2;
+ struct emission_dirichlet_s *priorp_3;
+ struct emission_dirichlet_s *priorp_4;
+
+ /* simulated annealing varialbles */
+ double temperature;
+ double cooling_factor;
+ int annealing_status;
+
+ /* help variables for add_to_E */
+ int alphabet_nr;
+ int alphabet;
+ int a_size;
+ double *E_cur;
+ double *subst_mtx;
+ struct msa_letter_s *msa_seq;
+ double *tmp_emissions;
+ int alphabet_type;
+
+ /* remember start of sequences */
+ msa_seq_infop_start = msa_seq_infop;
+
+ old_log_likelihood = 9999.0;
+ new_log_likelihood = 9999.0;
+ max_nr_iterations = 20;
+ iteration = 1;
+ if(annealing == YES) {
+ temperature = INIT_TEMP;
+ cooling_factor = INIT_COOL;
+ annealing_status = ACTIVE;
+ }
+ else {
+ annealing_status = DONE;
+ }
+
+#ifdef DEBUG_BW2
+ check_for_corrupt_values(hmmp->nr_v, hmmp->a_size, hmmp->emissions , "emiss");
+ check_for_corrupt_values(hmmp->nr_v, hmmp->nr_v, hmmp->transitions , "trans");
+#endif
+
+ do {
+#ifdef DEBUG_BW2
+ printf("starting baum-welch loop\n");
+#endif
+ /* initialize matrices */
+ T = (double*)(malloc_or_die(hmmp->nr_v * hmmp->nr_v *
+ sizeof(double)));
+
+ if(hmmp->alphabet_type == DISCRETE) {
+ E = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size *
+ sizeof(double)));
+ }
+ else {
+ E = (double*)(malloc_or_die(hmmp->nr_v * (hmmp->a_size + 1) *
+ sizeof(double)));
+ }
+ if(hmmp->nr_alphabets > 1) {
+ if(hmmp->alphabet_type_2 == DISCRETE) {
+ E_2 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_2 *
+ sizeof(double)));
+ }
+ else {
+ E_2 = (double*)(malloc_or_die(hmmp->nr_v * (hmmp->a_size_2 + 1) *
+ sizeof(double)));
+ }
+ }
+ if(hmmp->nr_alphabets > 2) {
+ if(hmmp->alphabet_type_3 == DISCRETE) {
+ E_3 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_3 *
+ sizeof(double)));
+ }
+ else {
+ E_3 = (double*)(malloc_or_die(hmmp->nr_v * (hmmp->a_size_3 + 1) *
+ sizeof(double)));
+ }
+ }
+ if(hmmp->nr_alphabets > 3) {
+ if(hmmp->alphabet_type_4 == DISCRETE) {
+ E_4 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_4 *
+ sizeof(double)));
+ }
+ else {
+ E_4 = (double*)(malloc_or_die(hmmp->nr_v * (hmmp->a_size_4 + 1) *
+ sizeof(double)));
+ }
+ }
+
+ /* reset sequence pointer */
+ msa_seq_infop = msa_seq_infop_start;
+
+
+ old_log_likelihood = new_log_likelihood;
+ new_log_likelihood = 0.0;
+
+ for(s = 0; s < nr_seqs; s++) {
+ if(use_lead_columns == YES) {
+ seq_len = msa_seq_infop->nr_lead_columns;
+ }
+ else {
+ seq_len = msa_seq_infop->msa_seq_length;
+ }
+
+ /* calculate forward and backward matrices
+ * memory for forw_mtx, scale_mtx and
+ * backw_mtx is allocated in the functions */
+#ifdef DEBUG_BW2
+ printf("running forward for seq %d\n", s + 1);
+#endif
+ msa_forward_multi(hmmp, msa_seq_infop, use_lead_columns, use_gap_shares, NO, &forw_mtx, &forw_scale, use_labels, normalize,
+ scoring_method, multi_scoring_method, aa_freqs, aa_freqs_2, aa_freqs_3, aa_freqs_4);
+#ifdef DEBUG_BW2
+ dump_forward_matrix(seq_len + 2, hmmp->nr_v, forw_mtx);
+ printf("running backward\n");
+#endif
+ msa_backward_multi(hmmp, msa_seq_infop, use_lead_columns, use_gap_shares, &backw_mtx, forw_scale, use_labels, normalize,
+ scoring_method, multi_scoring_method, aa_freqs, aa_freqs_2, aa_freqs_3, aa_freqs_4);
+#ifdef DEBUG_BW2
+ check_for_corrupt_values(seq_len + 2, hmmp->nr_v, forw_mtx , "F");
+ check_for_corrupt_values(seq_len + 2, hmmp->nr_v, backw_mtx , "B");
+ printf("done with backward\n");
+#endif
+ /* update new_log_likelihood */
+ likelihood = log10((forw_mtx +
+ get_mtx_index(seq_len+1, hmmp->nr_v-1, hmmp->nr_v))->prob);
+ for(k = 0; k <= seq_len; k++) {
+ likelihood = likelihood + log10(*(forw_scale + k));
+ }
+#ifdef DEBUG_BW2
+ dump_scaling_array(k-1,forw_scale);
+ printf("likelihood = %f\n", likelihood);
+#endif
+ new_log_likelihood += likelihood;
+
+ for(k = 0; k < hmmp->nr_v-1; k++) /* k = from vertex */ {
+ lp = *(hmmp->to_trans_array + k);
+ while(lp->vertex != END) /* l = to-vertex */ {
+ i = 0;
+ while(1) {
+ if(use_lead_columns == NO) {
+ p = i;
+ }
+ else {
+ p = *(msa_seq_infop->lead_columns_start + i);
+ }
+ /* add T[k][l] contribution for the msa-sequence */
+ add_Tkl_contribution_msa_multi(hmmp, msa_seq_infop, forw_mtx, backw_mtx,
+ forw_scale, p, k, lp, T, use_gap_shares, use_lead_columns, i, use_labels, scoring_method,
+ normalize, multi_scoring_method, aa_freqs, aa_freqs_2, aa_freqs_3, aa_freqs_4);
+ i++;
+ if(use_lead_columns == NO) {
+ if(i >= seq_len) {
+ break;
+ }
+ }
+ else {
+ if(*(msa_seq_infop->lead_columns_start + i) == END) {
+ break;
+ }
+ }
+ }
+ /* move on to next path */
+ while(lp->next != NULL) {
+ lp++;
+ }
+ lp++;
+ }
+
+ /* calculate E[k][a] contribution from this sequence */
+ if(silent_state_multi(k, hmmp) != 0) {
+ i = 0;
+ while(1) {
+
+ /* get correct index for this letter column */
+ if(use_lead_columns == NO) {
+ p = i;
+ }
+ else {
+ p = *(msa_seq_infop->lead_columns_start + i);
+ }
+
+ /* get basic scoring result */
+ Eka_base = add_Eka_contribution_msa_multi(hmmp, msa_seq_infop, forw_mtx, backw_mtx, p, k,
+ i, use_lead_columns);
+
+ /* loop over the alphabets */
+ for(alphabet_nr = 1; alphabet_nr <= hmmp->nr_alphabets; alphabet_nr++) {
+ if(alphabet_nr == 1) {
+ alphabet = hmmp->alphabet;
+ subst_mtx = hmmp->subst_mtx;
+ a_size = hmmp->a_size;
+ E_cur = E;
+ msa_seq = msa_seq_infop->msa_seq_1;
+ alphabet_type = hmmp->alphabet_type;
+ tmp_emissions = hmmp->emissions;
+ }
+ else if(alphabet_nr == 2) {
+ alphabet = hmmp->alphabet_2;
+ subst_mtx = hmmp->subst_mtx_2;
+ a_size = hmmp->a_size_2;
+ E_cur = E_2;
+ msa_seq = msa_seq_infop->msa_seq_2;
+ alphabet_type = hmmp->alphabet_type_2;
+ tmp_emissions = hmmp->emissions_2;
+ }
+ else if(alphabet_nr == 3) {
+ alphabet = hmmp->alphabet_3;
+ subst_mtx = hmmp->subst_mtx_3;
+ a_size = hmmp->a_size_3;
+ E_cur = E_3;
+ msa_seq = msa_seq_infop->msa_seq_3;
+ alphabet_type = hmmp->alphabet_type_3;
+ tmp_emissions = hmmp->emissions_3;
+ }
+ else if(alphabet_nr == 4) {
+ alphabet = hmmp->alphabet_4;
+ subst_mtx = hmmp->subst_mtx_4;
+ a_size = hmmp->a_size_4;
+ E_cur = E_4;
+ msa_seq = msa_seq_infop->msa_seq_4;
+ alphabet_type = hmmp->alphabet_type_4;
+ tmp_emissions = hmmp->emissions_4;
+ }
+
+ /* get result and add to matrix according to scoring method */
+ add_to_E_multi(E_cur, Eka_base, msa_seq, p, k, a_size, normalize, subst_mtx,
+ alphabet, scoring_method, use_nr_occ, alphabet_type, tmp_emissions);
+ }
+ /* update loop index, check if we are done */
+ i++;
+ if(use_lead_columns == NO) {
+ if(i >= seq_len) {
+ break;
+ }
+ }
+ else {
+ if(*(msa_seq_infop->lead_columns_start + i) == END) {
+ break;
+ }
+ }
+ }
+ }
+ }
+
+ /* some garbage collection */
+ free(forw_mtx);
+ free(backw_mtx);
+ free(forw_scale);
+
+ msa_seq_infop++;
+ }
+
+ if(verbose == YES) {
+ printf("log likelihood rd %d: %f\n", iteration, new_log_likelihood);
+ }
+
+
+
+#ifdef DEBUG_BW2
+ dump_T_matrix(hmmp->nr_v, hmmp->nr_v, T);
+ dump_E_matrix(hmmp->nr_v, hmmp->a_size, E);
+ if(hmmp->nr_alphabets > 1) {
+ dump_E_matrix(hmmp->nr_v, hmmp->a_size_2 + 1, E_2);
+ }
+#endif
+ /* check if likelihood change is small enough, then we are done */
+ if(fabs(new_log_likelihood - old_log_likelihood) < INNER_BW_THRESHOLD && annealing_status == DONE) {
+ break;
+ }
+
+ /* if simulated annealing is used, scramble results in E and T matrices */
+ if(annealing == YES && temperature > ANNEAL_THRESHOLD) {
+ anneal_E_matrix_multi(temperature, E, hmmp, 1);
+ if(hmmp->nr_alphabets > 1) {
+ anneal_E_matrix_multi(temperature, E_2, hmmp, 2);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ anneal_E_matrix_multi(temperature, E_3, hmmp, 3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ anneal_E_matrix_multi(temperature, E_4, hmmp, 4);
+ }
+ anneal_T_matrix_multi(temperature, T, hmmp);
+ temperature = temperature * cooling_factor;
+ }
+
+ if(temperature < ANNEAL_THRESHOLD) {
+ annealing_status = DONE;
+ }
+
+ /* recalculate emission expectations according to distribution groups
+ * by simply taking the mean of the expected emissions within this group
+ * for each letter in the alphabet and replacing each expectation for the
+ * letter with this value for every member of the distribution group */
+ recalculate_emiss_expectations_multi(hmmp, E, 1);
+ if(hmmp->nr_alphabets > 1) {
+ recalculate_emiss_expectations_multi(hmmp, E_2, 2);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ recalculate_emiss_expectations_multi(hmmp, E_3, 3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ recalculate_emiss_expectations_multi(hmmp, E_4, 4);
+ }
+
+ /* recalculate transition expectations for tied transitions according
+ * to the same scheme as for emission distribution groups */
+ recalculate_trans_expectations_multi(hmmp, T);
+
+
+ for(k = 0; k < hmmp->nr_v-1; k++) /* k = from-vertex */ {
+ /* update transition matrix */
+ if(use_transition_pseudo_counts == YES) {
+ update_trans_mtx_pseudocount_multi(hmmp, T, k);
+ }
+ else {
+ update_trans_mtx_std_multi(hmmp, T, k);
+ }
+
+
+#ifdef DEBUG_PRIORS
+ printf("Starting emission matrix update\n");
+#endif
+
+
+ /* update emission matrix using Dirichlet prior files if they exist*/
+ priorp = *(hmmp->ed_ps + k);
+ if(priorp != NULL && use_prior == YES && hmmp->alphabet_type == DISCRETE) {
+#ifdef DEBUG_PRIORS
+ printf("k = %d\n", k);
+ printf("value = %x\n", priorp);
+#endif
+ update_emiss_mtx_prior_multi(hmmp, E, k, priorp, 1);
+ }
+ else if(use_emission_pseudo_counts == YES && hmmp->alphabet_type == DISCRETE)
+ /* update emissions matrix "normally" when dirichlet file is missing */ {
+ update_emiss_mtx_pseudocount_multi(hmmp, E, k, 1);
+ }
+ else if(hmmp->alphabet_type == DISCRETE) {
+ update_emiss_mtx_std_multi(hmmp, E, k, 1);
+ }
+ else {
+ update_emiss_mtx_std_continuous_multi(hmmp, E, k, 1);
+ }
+
+
+ if(hmmp->nr_alphabets > 1) {
+ priorp = *(hmmp->ed_ps_2 + k);
+ if(priorp != NULL && use_prior == YES && hmmp->alphabet_type_2 == DISCRETE) {
+ update_emiss_mtx_prior_multi(hmmp, E_2, k, priorp, 2);
+ }
+ else if(use_emission_pseudo_counts == YES && hmmp->alphabet_type_2 == DISCRETE)
+ /* update emissions matrix "normally" when dirichlet file is missing */ {
+ update_emiss_mtx_pseudocount_multi(hmmp, E_2, k, 2);
+ }
+ else if(hmmp->alphabet_type_2 == DISCRETE) {
+ update_emiss_mtx_std_multi(hmmp, E_2, k, 2);
+ }
+ else {
+ update_emiss_mtx_std_continuous_multi(hmmp, E_2, k, 2);
+ }
+ }
+
+ if(hmmp->nr_alphabets > 2) {
+ priorp = *(hmmp->ed_ps_3 + k);
+ if(priorp != NULL && use_prior == YES && hmmp->alphabet_type_3 == DISCRETE) {
+ update_emiss_mtx_prior_multi(hmmp, E_3, k, priorp, 3);
+ }
+ else if(use_emission_pseudo_counts == YES && hmmp->alphabet_type_3 == DISCRETE)
+ /* update emissions matrix "normally" when dirichlet file is missing */ {
+ update_emiss_mtx_pseudocount_multi(hmmp, E_3, k, 3);
+ }
+ else if(hmmp->alphabet_type_3 == DISCRETE) {
+ update_emiss_mtx_std_multi(hmmp, E_3, k, 3);
+ }
+ else {
+ update_emiss_mtx_std_continuous_multi(hmmp, E_3, k, 3);
+ }
+ }
+
+ if(hmmp->nr_alphabets > 3) {
+ priorp = *(hmmp->ed_ps_4 + k);
+ if(priorp != NULL && use_prior == YES && hmmp->alphabet_type_4 == DISCRETE) {
+ update_emiss_mtx_prior_multi(hmmp, E_4, k, priorp, 4);
+ }
+ else if(use_emission_pseudo_counts == YES && hmmp->alphabet_type_4 == DISCRETE)
+ /* update emissions matrix "normally" when dirichlet file is missing */ {
+ update_emiss_mtx_pseudocount_multi(hmmp, E_4, k, 4);
+ }
+ else if(hmmp->alphabet_type_4 == DISCRETE) {
+ update_emiss_mtx_std_multi(hmmp, E_4, k, 4);
+ }
+ else {
+ update_emiss_mtx_std_continuous_multi(hmmp, E_4, k, 4);
+ }
+ }
+ }
+
+#ifdef DEBUG_BW2
+ dump_trans_matrix(hmmp->nr_v, hmmp->nr_v, hmmp->transitions);
+ dump_emiss_matrix(hmmp->nr_v, hmmp->a_size, hmmp->emissions);
+ if(hmmp->nr_alphabets > 1) {
+ dump_emiss_matrix(hmmp->nr_v, hmmp->a_size_2, hmmp->emissions_2);
+ }
+#endif
+
+ /* some garbage collection */
+ free(E);
+ if(hmmp->nr_alphabets > 1) {
+ free(E_2);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ free(E_3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ free(E_4);
+ }
+ free(T);
+ max_nr_iterations--;
+ iteration++;
+#ifdef DEBIG_BW2
+ printf("end of baum-welch-loop\n");
+#endif
+ }
+ while(max_nr_iterations > 0); /* break condition is also when log_likelihood_difference is
+ * smaller than THRESHOLD, checked inside the loop for
+ * better efficiency */
+
+
+#ifdef DEBUG_BW
+ dump_trans_matrix(hmmp->nr_v, hmmp->nr_v, hmmp->transitions);
+ dump_emiss_matrix(hmmp->nr_v, hmmp->a_size, hmmp->emissions);
+#endif
+}
+
+
+/* implementation of the conditional maximum likelihood version of the
+ * baum-welch training algorithm using dirichlet prior mixture to
+ * calculate update of emission (and transition) matrices */
+void extended_baum_welch_dirichlet_multi(struct hmm_multi_s *hmmp, struct sequence_multi_s *seqsp, int nr_seqs,
+ int annealing, int use_labels,
+ int use_transition_pseudo_counts, int use_emission_pseudo_counts,
+ int multi_scoring_method, int use_prior)
+{
+ double *T, *E, *E_2, *E_3, *E_4; /* matrices for the estimated number of times
+ * each transition (T) and emission (E) is used */
+ double *T_lab, *E_lab, *E_lab_2, *E_lab_3, *E_lab_4, *T_ulab, *E_ulab, *E_ulab_2, *E_ulab_3, *E_ulab_4;
+ struct forward_s *forw_mtx; /* forward matrix */
+ struct backward_s *backw_mtx; /* backward matrix */
+ double *forw_scale; /* scaling array */
+ int s,p,k,l,a,d; /* loop counters, s loops over the sequences, p over the
+ * positions in the sequence, k and l over states, a over the alphabet
+ * and d over the distribution groups */
+ struct path_element *lp;
+ double t_res, t_res_1, t_res_2, t_res_3; /* for temporary results */
+ double t_res_4, t_res_5, t_res_6; /* for temporary results */
+ double e_res, e_res_1, e_res_2, e_res_3; /* for temporary results */
+ double t_res_ulab;
+
+ int seq_len; /* length of the seqences */
+ int a_index, a_index_2, a_index_3, a_index_4; /* holds current letters index in the alphabet */
+ struct letter_s *seq, *seq_2, *seq_3, *seq_4; /* pointer to current sequence */
+ double old_log_likelihood_lab, new_log_likelihood_lab;
+ double old_log_likelihood_ulab, new_log_likelihood_ulab; /* to calculate when to stop */
+ double likelihood; /* temporary variable for calculating likelihood of a sequence */
+ int max_nr_iterations, iteration;
+
+ /* dirichlet prior variables */
+ struct emission_dirichlet_s *priorp;
+
+ /* simulated annealing variables */
+ double temperature;
+ double cooling_factor;
+ int annealing_status;
+
+ /* some initialization */
+ old_log_likelihood_lab = 9999.0;
+ new_log_likelihood_lab = 9999.0;
+ old_log_likelihood_ulab = 9999.0;
+ new_log_likelihood_ulab = 9999.0;
+ max_nr_iterations = 70;
+ iteration = 1;
+ if(annealing == YES) {
+ temperature = INIT_TEMP;
+ cooling_factor = INIT_COOL;
+ annealing_status = ACTIVE;
+ }
+ else {
+ annealing_status = DONE;
+ }
+
+
+ do {
+ /* allocate per iteration matrices */
+ T = (double*)(malloc_or_die(hmmp->nr_v * hmmp->nr_v * sizeof(double)));
+ E = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size * sizeof(double)));
+ T_lab = (double*)(malloc_or_die(hmmp->nr_v * hmmp->nr_v * sizeof(double)));
+ E_lab = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size * sizeof(double)));
+ T_ulab = (double*)(malloc_or_die(hmmp->nr_v * hmmp->nr_v * sizeof(double)));
+ E_ulab = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size * sizeof(double)));
+
+ if(hmmp->nr_alphabets > 1) {
+ E_2 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_2 * sizeof(double)));
+ E_lab_2 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_2 * sizeof(double)));
+ E_ulab_2 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_2 * sizeof(double)));
+ }
+ if(hmmp->nr_alphabets > 2) {
+ E_3 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_3 * sizeof(double)));
+ E_lab_3 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_3 * sizeof(double)));
+ E_ulab_3 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_3 * sizeof(double)));
+ }
+ if(hmmp->nr_alphabets > 3) {
+ E_4 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_4 * sizeof(double)));
+ E_lab_4 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_4 * sizeof(double)));
+ E_ulab_4 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_4 * sizeof(double)));
+ }
+
+ old_log_likelihood_ulab = new_log_likelihood_ulab;
+ new_log_likelihood_ulab = 0.0;
+ old_log_likelihood_lab = new_log_likelihood_lab;
+ new_log_likelihood_lab = 0.0;
+ for(s = 0; s < nr_seqs; s++) {
+ /* Convert sequence to 1...L for easier indexing */
+ seq_len = (seqsp + s)->length;
+ seq = (struct letter_s*) (malloc_or_die((seq_len + 2) * sizeof(struct letter_s)));
+ memcpy(seq+1, (seqsp + s)->seq_1, seq_len * sizeof(struct letter_s));
+ if(hmmp->nr_alphabets > 1) {
+ seq_2 = (struct letter_s*) (malloc_or_die((seq_len + 2) * sizeof(struct letter_s)));
+ memcpy(seq_2+1, (seqsp + s)->seq_2, seq_len * sizeof(struct letter_s));
+ }
+ if(hmmp->nr_alphabets > 2) {
+ seq_3 = (struct letter_s*) malloc_or_die(((seq_len + 2) * sizeof(struct letter_s)));
+ memcpy(seq_3+1, (seqsp + s)->seq_3, seq_len * sizeof(struct letter_s));
+ }
+ if(hmmp->nr_alphabets > 3) {
+ seq_4 = (struct letter_s*) malloc_or_die(((seq_len + 2) * sizeof(struct letter_s)));
+ memcpy(seq_4+1, (seqsp + s)->seq_4, seq_len * sizeof(struct letter_s));
+ }
+
+
+ /* calculate forward and backward matrices */
+ forward_multi(hmmp, (seqsp + s)->seq_1,(seqsp + s)->seq_2, (seqsp + s)->seq_3, (seqsp + s)->seq_4,
+ &forw_mtx, &forw_scale, NO, multi_scoring_method);
+ backward_multi(hmmp, (seqsp + s)->seq_1, (seqsp + s)->seq_2, (seqsp + s)->seq_3, (seqsp + s)->seq_4,
+ &backw_mtx, forw_scale, NO, multi_scoring_method);
+
+ /* memory for forw_mtx, scale_mtx and
+ * backw_mtx is allocated in the functions */
+
+ /* update new_log_likelihood */
+ likelihood = log10((forw_mtx +
+ get_mtx_index(seq_len+1, hmmp->nr_v-1, hmmp->nr_v))->prob);
+ for(k = 0; k <= seq_len; k++) {
+ likelihood = likelihood + log10(*(forw_scale + k));
+ }
+#ifdef DEBUG_BW
+ dump_scaling_array(k-1,forw_scale);
+ printf("likelihood = %f\n", likelihood);
+#endif
+ new_log_likelihood_ulab += likelihood;
+
+ for(k = 0; k < hmmp->nr_v-1; k++) /* k = from vertex */ {
+ lp = *(hmmp->to_trans_array + k);
+ while(lp->vertex != END) /* l = to-vertex */ {
+ for(p = 1; p <= seq_len; p++) {
+
+ /* get alphabet index for c*/
+ a_index = get_alphabet_index(&seq[p], hmmp->alphabet, hmmp->a_size);
+ if(hmmp->nr_alphabets > 1) {
+ a_index_2 = get_alphabet_index(&seq_2[p], hmmp->alphabet_2, hmmp->a_size_2);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ a_index_3 = get_alphabet_index(&seq_3[p], hmmp->alphabet_3, hmmp->a_size_3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ a_index_4 = get_alphabet_index(&seq_4[p], hmmp->alphabet_4, hmmp->a_size_4);
+ }
+
+ /* add T[k][l] contribution for this sequence */
+ add_Tkl_contribution_multi(hmmp, seq+1, seq_2+1, seq_3+1, seq_4+1, forw_mtx, backw_mtx,
+ forw_scale, p, k, lp, a_index, a_index_2, a_index_3, a_index_4, T_ulab, NO,
+ multi_scoring_method);
+ }
+ /* move on to next path */
+ while(lp->next != NULL) {
+ lp++;
+ }
+ lp++;
+ }
+
+ /* calculate E[k][a] contribution from this sequence */
+ if(silent_state_multi(k, hmmp) != 0) {
+ for(p = 1; p <= seq_len; p++) {
+ a_index = get_alphabet_index(&seq[p], hmmp->alphabet, hmmp->a_size);
+ if(hmmp->nr_alphabets > 1) {
+ a_index_2 = get_alphabet_index(&seq_2[p], hmmp->alphabet_2, hmmp->a_size_2);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ a_index_3 = get_alphabet_index(&seq_3[p], hmmp->alphabet_3, hmmp->a_size_3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ a_index_4 = get_alphabet_index(&seq_4[p], hmmp->alphabet_4, hmmp->a_size_4);
+ }
+ /* get result and add to matrix */
+ *(E_ulab + get_mtx_index(k, a_index, hmmp->a_size)) +=
+ add_Eka_contribution_multi(hmmp, seq+1, forw_mtx, backw_mtx, p, k, multi_scoring_method);
+ if(hmmp->nr_alphabets > 1) {
+ *(E_ulab_2 + get_mtx_index(k, a_index_2, hmmp->a_size_2)) +=
+ add_Eka_contribution_multi(hmmp, seq_2+1, forw_mtx, backw_mtx, p, k, multi_scoring_method);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ *(E_ulab_3 + get_mtx_index(k, a_index_3, hmmp->a_size_3)) +=
+ add_Eka_contribution_multi(hmmp, seq_3+1, forw_mtx, backw_mtx, p, k, multi_scoring_method);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ *(E_ulab_4 + get_mtx_index(k, a_index_4, hmmp->a_size_4)) +=
+ add_Eka_contribution_multi(hmmp, seq_4+1, forw_mtx, backw_mtx, p, k, multi_scoring_method);
+ }
+ }
+ }
+ }
+
+ t_res_ulab = (forw_mtx + get_mtx_index(seq_len+1, hmmp->nr_v-1, hmmp->nr_v))->prob;
+ /* some garbage collection */
+ free(forw_mtx);
+ free(backw_mtx);
+ free(forw_scale);
+
+
+ /********* calculations using labels *************/
+
+ /* calculate forward and backward matrices */
+ forward_multi(hmmp, (seqsp + s)->seq_1,(seqsp + s)->seq_2, (seqsp + s)->seq_3, (seqsp + s)->seq_4,
+ &forw_mtx, &forw_scale, YES, multi_scoring_method);
+ backward_multi(hmmp, (seqsp + s)->seq_1, (seqsp + s)->seq_2, (seqsp + s)->seq_3, (seqsp + s)->seq_4,
+ &backw_mtx, forw_scale, YES, multi_scoring_method);
+ /* memory for forw_mtx, scale_mtx and
+ * backw_mtx is allocated in the functions */
+
+ /* update new_log_likelihood */
+ likelihood = log10((forw_mtx +
+ get_mtx_index(seq_len+1, hmmp->nr_v-1, hmmp->nr_v))->prob);
+ for(k = 0; k <= seq_len; k++) {
+ likelihood = likelihood + log10(*(forw_scale + k));
+ }
+#ifdef DEBUG_BW
+ dump_scaling_array(k-1,forw_scale);
+ printf("likelihood = %f\n", likelihood);
+#endif
+ new_log_likelihood_lab += likelihood;
+
+ for(k = 0; k < hmmp->nr_v-1; k++) /* k = from vertex */ {
+ lp = *(hmmp->to_trans_array + k);
+ while(lp->vertex != END) /* l = to-vertex */ {
+ for(p = 1; p <= seq_len; p++) {
+
+ /* get alphabet index for c*/
+ a_index = get_alphabet_index(&seq[p], hmmp->alphabet, hmmp->a_size);
+ if(hmmp->nr_alphabets > 1) {
+ a_index_2 = get_alphabet_index(&seq_2[p], hmmp->alphabet_2, hmmp->a_size_2);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ a_index_3 = get_alphabet_index(&seq_3[p], hmmp->alphabet_3, hmmp->a_size_3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ a_index_4 = get_alphabet_index(&seq_4[p], hmmp->alphabet_4, hmmp->a_size_4);
+ }
+
+ /* add T[k][l] contribution for this sequence */
+ add_Tkl_contribution_multi(hmmp, seq+1, seq_2+1, seq_3+1, seq_4+1, forw_mtx, backw_mtx,
+ forw_scale, p, k, lp, a_index, a_index_2, a_index_3, a_index_4, T_lab, YES,
+ multi_scoring_method);
+ }
+ /* move on to next path */
+ while(lp->next != NULL) {
+ lp++;
+ }
+ lp++;
+ }
+ /* calculate E[k][a] contribution from this sequence */
+ if(silent_state_multi(k, hmmp) != 0) {
+ for(p = 1; p <= seq_len; p++) {
+ a_index = get_alphabet_index(&seq[p], hmmp->alphabet, hmmp->a_size);
+ if(hmmp->nr_alphabets > 1) {
+ a_index_2 = get_alphabet_index(&seq_2[p], hmmp->alphabet_2, hmmp->a_size_2);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ a_index_3 = get_alphabet_index(&seq_3[p], hmmp->alphabet_3, hmmp->a_size_3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ a_index_4 = get_alphabet_index(&seq_4[p], hmmp->alphabet_4, hmmp->a_size_4);
+ }
+ /* get result and add to matrix */
+ *(E_lab + get_mtx_index(k, a_index, hmmp->a_size)) +=
+ add_Eka_contribution_multi(hmmp, seq+1, forw_mtx, backw_mtx, p, k, multi_scoring_method);
+ if(hmmp->nr_alphabets > 1) {
+ *(E_lab_2 + get_mtx_index(k, a_index_2, hmmp->a_size_2)) +=
+ add_Eka_contribution_multi(hmmp, seq_2+1, forw_mtx, backw_mtx, p, k, multi_scoring_method);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ *(E_lab_3 + get_mtx_index(k, a_index_3, hmmp->a_size_3)) +=
+ add_Eka_contribution_multi(hmmp, seq_3+1, forw_mtx, backw_mtx, p, k, multi_scoring_method);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ *(E_lab_4 + get_mtx_index(k, a_index_4, hmmp->a_size_4)) +=
+ add_Eka_contribution_multi(hmmp, seq_4+1, forw_mtx, backw_mtx, p, k, multi_scoring_method);
+ }
+ }
+ }
+ }
+
+ /* some garbage collection */
+ free(seq);
+ if(hmmp->nr_alphabets > 1) {
+ free(seq_2);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ free(seq_3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ free(seq_4);
+ }
+ free(forw_mtx);
+ free(backw_mtx);
+ free(forw_scale);
+
+
+ }
+ if(verbose == YES) {
+ printf("log likelihood diff rd %d: %f\n", iteration, new_log_likelihood_ulab - new_log_likelihood_lab);
+ }
+
+#ifdef DEBUG_BW
+ dump_T_matrix(hmmp->nr_v, hmmp->nr_v, T);
+ dump_E_matrix(hmmp->nr_v, hmmp->a_size, E);
+#endif
+
+ /* recalculate emission expectations according to distribution groups
+ * by simply taking the mean of the expected emissions within this group
+ * for each letter in the alphabet and replacing each expectation for the
+ * letter with this value for every member of the distribution group */
+ recalculate_emiss_expectations_multi(hmmp, E_lab, 1);
+ if(hmmp->nr_alphabets > 1) {
+ recalculate_emiss_expectations_multi(hmmp, E_lab_2, 2);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ recalculate_emiss_expectations_multi(hmmp, E_lab_3, 3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ recalculate_emiss_expectations_multi(hmmp, E_lab_4, 4);
+ }
+
+ recalculate_emiss_expectations_multi(hmmp, E_ulab, 1);
+ if(hmmp->nr_alphabets > 1) {
+ recalculate_emiss_expectations_multi(hmmp, E_ulab_2, 2);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ recalculate_emiss_expectations_multi(hmmp, E_ulab_3, 3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ recalculate_emiss_expectations_multi(hmmp, E_ulab_4, 4);
+ }
+
+ /* recalculate transition expectations for tied transitions according
+ * to the same scheme as for emission distribution groups */
+ recalculate_trans_expectations_multi(hmmp, T_lab);
+ recalculate_trans_expectations_multi(hmmp, T_ulab);
+
+
+ /* update real T end E matrices */
+ calculate_TE_contributions_multi(T, E, E_2, E_3, E_4, T_lab, E_lab, E_lab_2, E_lab_3, E_lab_4, T_ulab, E_ulab,
+ E_ulab_2, E_ulab_3, E_ulab_4, hmmp->emissions, hmmp->emissions_2, hmmp->emissions_3,
+ hmmp->emissions_4, hmmp->transitions, hmmp->nr_v, hmmp->a_size,
+ hmmp->a_size_2, hmmp->a_size_3, hmmp->a_size_4, hmmp->vertex_emiss_prior_scalers,
+ hmmp->vertex_emiss_prior_scalers_2, hmmp->vertex_emiss_prior_scalers_3,
+ hmmp->vertex_emiss_prior_scalers_4, iteration, hmmp->nr_alphabets);
+
+ /* check if likelihood change is small enough, then we are done */
+ if(fabs((new_log_likelihood_ulab - new_log_likelihood_lab) - (old_log_likelihood_ulab - old_log_likelihood_lab))
+ < CML_THRESHOLD && annealing_status == DONE) {
+ break;
+ }
+
+ /* if simulated annealing is used, scramble results in E and T matrices */
+ if(annealing == YES && temperature > ANNEAL_THRESHOLD) {
+ anneal_E_matrix_multi(temperature, E, hmmp, 1);
+ if(hmmp->nr_alphabets > 1) {
+ anneal_E_matrix_multi(temperature, E_2, hmmp, 2);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ anneal_E_matrix_multi(temperature, E_3, hmmp, 3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ anneal_E_matrix_multi(temperature, E_4, hmmp, 4);
+ }
+ anneal_T_matrix_multi(temperature, T, hmmp);
+ temperature = temperature * cooling_factor;
+ }
+
+ if(temperature < ANNEAL_THRESHOLD) {
+ annealing_status = DONE;
+ }
+
+ for(k = 0; k < hmmp->nr_v-1; k++) /* k = from-vertex */ {
+ /* update transition matrix */
+ if(use_transition_pseudo_counts == YES) {
+ update_trans_mtx_pseudocount_multi(hmmp, T, k);
+ }
+ else {
+ update_trans_mtx_std_multi(hmmp, T, k);
+ }
+
+
+#ifdef DEBUG_PRIORS
+ printf("Starting emission matrix update\n");
+#endif
+
+ /* update emission matrix using Dirichlet prior files if they exist*/
+ priorp = *(hmmp->ed_ps + k);
+ if(priorp != NULL && use_prior == YES) {
+#ifdef DEBUG_PRIORS
+ printf("k = %d\n", k);
+ printf("value = %x\n", priorp);
+#endif
+ update_emiss_mtx_prior_multi(hmmp, E, k, priorp, 1);
+ }
+ else if(use_emission_pseudo_counts == YES) /* update emissions matrix "normally" when dirichlet file is missing */ {
+ update_emiss_mtx_pseudocount_multi(hmmp, E, k, 1);
+ }
+ else {
+ update_emiss_mtx_std_multi(hmmp, E, k, 1);
+ }
+
+
+ if(hmmp->nr_alphabets > 1) {
+ priorp = *(hmmp->ed_ps_2 + k);
+ if(priorp != NULL && use_prior == YES) {
+ update_emiss_mtx_prior_multi(hmmp, E_2, k, priorp, 2);
+ }
+ else if(use_emission_pseudo_counts == YES) /* update emissions matrix "normally" when dirichlet file is missing */ {
+ update_emiss_mtx_pseudocount_multi(hmmp, E_2, k, 2);
+ }
+ else {
+ update_emiss_mtx_std_multi(hmmp, E_2, k, 2);
+ }
+ }
+
+ if(hmmp->nr_alphabets > 2) {
+ priorp = *(hmmp->ed_ps_3 + k);
+ if(priorp != NULL && use_prior == YES) {
+ update_emiss_mtx_prior_multi(hmmp, E_3, k, priorp, 3);
+ }
+ else if(use_emission_pseudo_counts == YES) /* update emissions matrix "normally" when dirichlet file is missing */ {
+ update_emiss_mtx_pseudocount_multi(hmmp, E_3, k, 3);
+ }
+ else {
+ update_emiss_mtx_std_multi(hmmp, E_3, k, 3);
+ }
+ }
+
+ if(hmmp->nr_alphabets > 3) {
+ priorp = *(hmmp->ed_ps_4 + k);
+ if(priorp != NULL && use_prior == YES) {
+ update_emiss_mtx_prior_multi(hmmp, E_4, k, priorp, 4);
+ }
+ else if(use_emission_pseudo_counts == YES) /* update emissions matrix "normally" when dirichlet file is missing */ {
+ update_emiss_mtx_pseudocount_multi(hmmp, E_4, k, 4);
+ }
+ else {
+ update_emiss_mtx_std_multi(hmmp, E_4, k, 4);
+ }
+ }
+ }
+
+#ifdef DEBUG_BW
+ dump_trans_matrix(hmmp->nr_v, hmmp->nr_v, hmmp->transitions);
+ dump_emiss_matrix(hmmp->nr_v, hmmp->a_size, hmmp->emissions);
+#endif
+
+ /* some garbage collection */
+ free(E);
+ if(hmmp->nr_alphabets > 1) {
+ free(E_2);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ free(E_3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ free(E_4);
+ }
+ free(T);
+ free(T_lab);
+ free(E_lab);
+ if(hmmp->nr_alphabets > 1) {
+ free(E_lab_2);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ free(E_lab_3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ free(E_lab_4);
+ }
+ free(T_ulab);
+ free(E_ulab);
+ if(hmmp->nr_alphabets > 1) {
+ free(E_ulab_2);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ free(E_ulab_3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ free(E_ulab_4);
+ }
+ max_nr_iterations--;
+ iteration++;
+ }
+ while(max_nr_iterations > 0); /* break condition is also when log_likelihood_difference is
+ * smaller than THRESHOLD, checked inside the loop for
+ * better efficiency */
+#ifdef DEBUG_BW2
+ printf("exiting\n");
+#endif
+#ifdef DEBUG_BW
+ dump_trans_matrix(hmmp->nr_v, hmmp->nr_v, hmmp->transitions);
+ dump_emiss_matrix(hmmp->nr_v, hmmp->a_size, hmmp->emissions);
+#endif
+
+}
+
+
+
+/* implementation of the baum-welch training algorithm using dirichlet prior mixture to
+ * calculate update of emission (and transition) matrices and using a multiple sequence
+ * alignment as the training sequence */
+void extended_msa_baum_welch_dirichlet_multi(struct hmm_multi_s *hmmp, struct msa_sequences_multi_s *msa_seq_infop,
+ int nr_seqs, int annealing,
+ int use_gap_shares, int use_lead_columns, int use_labels, int use_transition_pseudo_counts,
+ int use_emission_pseudo_counts, int normalize, int scoring_method, int use_nr_occ,
+ int multi_scoring_method, double *aa_freqs,
+ double *aa_freqs_2, double *aa_freqs_3, double *aa_freqs_4, int use_prior)
+{
+ struct msa_sequences_multi_s *msa_seq_infop_start;
+ double *T, *E, *E_2, *E_3, *E_4; /* matrices for the estimated number of times
+ * each transition (T) and emission (E) is used */
+ double *T_lab, *E_lab, *E_lab_2, *E_lab_3, *E_lab_4, *T_ulab, *E_ulab, *E_ulab_2, *E_ulab_3, *E_ulab_4;
+ struct forward_s *forw_mtx; /* forward matrix */
+ struct backward_s *backw_mtx; /* backward matrix */
+ double *forw_scale; /* scaling array */
+ int s,p,k,l,a,d,i; /* loop counters, s loops over the sequences, p over the
+ * positions in the sequence, k and l over states, a over the alphabet,
+ * d over the distribution groups and i is a slush variable */
+ struct path_element *lp;
+ double t_res, t_res_1, t_res_2, t_res_3; /* for temporary results */
+ double t_res_4, t_res_5, t_res_6; /* for temporary results */
+ double e_res, e_res_1, e_res_2, e_res_3; /* for temporary results */
+
+ int seq_len; /* length of the seqences */
+ int a_index, a_index_2, a_index_3, a_index_4; /* holds current letters index in the alphabet */
+ struct letter_s *seq; /* pointer to current sequence */
+ double old_log_likelihood_lab, new_log_likelihood_lab;
+ double old_log_likelihood_ulab, new_log_likelihood_ulab; /* to calculate when to stop */
+ double likelihood; /* temporary variable for calculating likelihood of a sequence */
+ int max_nr_iterations, iteration;
+ double Eka_base;
+
+ /* dirichlet prior variables */
+ struct emission_dirichlet_s *priorp;
+
+ /* simulated annealing varialbles */
+ double temperature;
+ double cooling_factor;
+ int annealing_status;
+
+ /* help variables for add_to_E */
+ int alphabet_nr;
+ int alphabet;
+ int a_size;
+ double *E_cur;
+ double *subst_mtx;
+ struct msa_letter_s *msa_seq;
+
+ /* remember start of sequence pointer */
+ msa_seq_infop_start = msa_seq_infop;
+
+ old_log_likelihood_lab = 9999.0;
+ new_log_likelihood_lab = 9999.0;
+ old_log_likelihood_ulab = 9999.0;
+ new_log_likelihood_ulab = 9999.0;
+ max_nr_iterations = 70;
+ iteration = 1;
+ if(annealing == YES) {
+ temperature = INIT_TEMP;
+ cooling_factor = INIT_COOL;
+ annealing_status = ACTIVE;
+ }
+ else {
+ annealing_status = DONE;
+ }
+
+ do {
+#ifdef DEBUG_BW2
+ printf("starting baum-welch loop\n");
+#endif
+ /* allocate per iteration matrices */
+ T = (double*)(malloc_or_die(hmmp->nr_v * hmmp->nr_v * sizeof(double)));
+ E = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size * sizeof(double)));
+ T_lab = (double*)(malloc_or_die(hmmp->nr_v * hmmp->nr_v * sizeof(double)));
+ E_lab = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size * sizeof(double)));
+ T_ulab = (double*)(malloc_or_die(hmmp->nr_v * hmmp->nr_v * sizeof(double)));
+ E_ulab = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size * sizeof(double)));
+
+ if(hmmp->nr_alphabets > 1) {
+ E_2 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_2 * sizeof(double)));
+ E_lab_2 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_2 * sizeof(double)));
+ E_ulab_2 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_2 * sizeof(double)));
+ }
+ if(hmmp->nr_alphabets > 2) {
+ E_3 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_3 * sizeof(double)));
+ E_lab_3 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_3 * sizeof(double)));
+ E_ulab_3 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_3 * sizeof(double)));
+ }
+ if(hmmp->nr_alphabets > 3) {
+ E_4 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_4 * sizeof(double)));
+ E_lab_4 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_4 * sizeof(double)));
+ E_ulab_4 = (double*)(malloc_or_die(hmmp->nr_v * hmmp->a_size_4 * sizeof(double)));
+ }
+
+ old_log_likelihood_ulab = new_log_likelihood_ulab;
+ new_log_likelihood_ulab = 0.0;
+ old_log_likelihood_lab = new_log_likelihood_lab;
+ new_log_likelihood_lab = 0.0;
+
+ /* reset sequence pointer */
+ msa_seq_infop = msa_seq_infop_start;
+
+ for(s = 0; s < nr_seqs; s++) {
+ if(use_lead_columns == YES) {
+ seq_len = msa_seq_infop->nr_lead_columns;
+ }
+ else {
+ seq_len = msa_seq_infop->msa_seq_length;
+ }
+
+ /* calculate for unlabeled sequences */
+
+ /* calculate forward and backward matrices
+ * memory for forw_mtx, scale_mtx and
+ * backw_mtx is allocated in the functions */
+#ifdef DEBUG_BW2
+ printf("running forward unlabeled\n");
+#endif
+ msa_forward_multi(hmmp, msa_seq_infop, use_lead_columns, use_gap_shares, NO, &forw_mtx, &forw_scale, NO, normalize,
+ scoring_method, multi_scoring_method, aa_freqs, aa_freqs_2, aa_freqs_3, aa_freqs_4);
+#ifdef DEBUG_BW2
+ printf("running backward unlabeled\n");
+#endif
+ msa_backward_multi(hmmp, msa_seq_infop, use_lead_columns, use_gap_shares, &backw_mtx, forw_scale, NO, normalize,
+ scoring_method, multi_scoring_method, aa_freqs, aa_freqs_2, aa_freqs_3, aa_freqs_4);
+#ifdef DEBUG_BW2
+ printf("done with backward unlabeled\n");
+#endif
+ /* update new_log_likelihood */
+ likelihood = log10((forw_mtx +
+ get_mtx_index(seq_len+1, hmmp->nr_v-1, hmmp->nr_v))->prob);
+ for(k = 0; k <= seq_len; k++) {
+ likelihood = likelihood + log10(*(forw_scale + k));
+ }
+#ifdef DEBUG_BW
+ dump_scaling_array(k-1,forw_scale);
+ printf("likelihood = %f\n", likelihood);
+#endif
+ new_log_likelihood_ulab += likelihood;
+ for(k = 0; k < hmmp->nr_v-1; k++) /* k = from vertex */ {
+ lp = *(hmmp->to_trans_array + k);
+ while(lp->vertex != END) /* l = to-vertex */ {
+ i = 0;
+ while(1) {
+ if(use_lead_columns == NO) {
+ p = i;
+ }
+ else {
+ p = *(msa_seq_infop->lead_columns_start + i);
+ }
+ add_Tkl_contribution_msa_multi(hmmp, msa_seq_infop, forw_mtx, backw_mtx,
+ forw_scale, p, k, lp, T_ulab, use_gap_shares, use_lead_columns, i, NO, scoring_method,
+ normalize, multi_scoring_method, aa_freqs, aa_freqs_2, aa_freqs_3, aa_freqs_4);
+ i++;
+ if(use_lead_columns == NO) {
+ if(i >= seq_len) {
+ break;
+ }
+ }
+ else {
+ if(*(msa_seq_infop->lead_columns_start + i) == END) {
+ break;
+ }
+ }
+ }
+ /* move on to next path */
+ while(lp->next != NULL) {
+ lp++;
+ }
+ lp++;
+ }
+
+
+ /* calculate E[k][a] contribution from this sequence */
+ if(silent_state_multi(k, hmmp) != 0) {
+ i = 0;
+ while(1) {
+
+ /* get correct incex for this letter column */
+ if(use_lead_columns == NO) {
+ p = i;
+ }
+ else {
+ p = *(msa_seq_infop->lead_columns_start + i);
+ }
+
+ /* get basic scoring result */
+ Eka_base = add_Eka_contribution_msa_multi(hmmp, msa_seq_infop, forw_mtx, backw_mtx, p, k,
+ i, use_lead_columns);
+
+ /* loop over the alphabets */
+ for(alphabet_nr = 1; alphabet_nr <= hmmp->nr_alphabets; alphabet_nr++) {
+ if(alphabet_nr == 1) {
+ alphabet = hmmp->alphabet;
+ subst_mtx = hmmp->subst_mtx;
+ a_size = hmmp->a_size;
+ E_cur = E_ulab;
+ msa_seq = msa_seq_infop->msa_seq_1;
+ }
+ else if(alphabet_nr == 2) {
+ alphabet = hmmp->alphabet_2;
+ subst_mtx = hmmp->subst_mtx_2;
+ a_size = hmmp->a_size_2;
+ E_cur = E_ulab_2;
+ msa_seq = msa_seq_infop->msa_seq_2;
+ }
+ else if(alphabet_nr == 3) {
+ alphabet = hmmp->alphabet_3;
+ subst_mtx = hmmp->subst_mtx_3;
+ a_size = hmmp->a_size_3;
+ E_cur = E_ulab_3;
+ msa_seq = msa_seq_infop->msa_seq_3;
+ }
+ else if(alphabet_nr == 4) {
+ alphabet = hmmp->alphabet_4;
+ subst_mtx = hmmp->subst_mtx_4;
+ a_size = hmmp->a_size_4;
+ E_cur = E_ulab_4;
+ msa_seq = msa_seq_infop->msa_seq_4;
+ }
+
+ /* get result and add to matrix according to scoring method */
+ add_to_E_multi(E_cur, Eka_base, msa_seq, p, k, a_size, normalize, subst_mtx,
+ alphabet, scoring_method, use_nr_occ, DISCRETE, NULL);
+ }
+ /* update loop index, check if we are done */
+ i++;
+ if(use_lead_columns == NO) {
+ if(i >= seq_len) {
+ break;
+ }
+ }
+ else {
+ if(*(msa_seq_infop->lead_columns_start + i) == END) {
+ break;
+ }
+ }
+ }
+ }
+ }
+
+#ifdef DEBUG_BW
+ dump_T_matrix(hmmp->nr_v, hmmp->nr_v, T_ulab);
+ dump_E_matrix(hmmp->nr_v, hmmp->a_size, E_ulab);
+#endif
+
+
+ /* some garbage collection */
+ free(forw_mtx);
+ free(backw_mtx);
+ free(forw_scale);
+
+
+ /* calculate for labeled seqs */
+
+ /* calculate forward and backward matrices
+ * memory for forw_mtx, scale_mtx and
+ * backw_mtx is allocated in the functions */
+#ifdef DEBUG_BW2
+ printf("running forward labeled\n");
+#endif
+ msa_forward_multi(hmmp, msa_seq_infop, use_lead_columns, use_gap_shares, NO, &forw_mtx,
+ &forw_scale, YES, normalize, scoring_method, multi_scoring_method, aa_freqs, aa_freqs_2, aa_freqs_3,
+ aa_freqs_4);
+#ifdef DEBUG_BW2
+ printf("running backward labeled\n");
+#endif
+ msa_backward_multi(hmmp, msa_seq_infop, use_lead_columns, use_gap_shares, &backw_mtx, forw_scale,
+ YES, normalize, scoring_method, multi_scoring_method, aa_freqs, aa_freqs_2, aa_freqs_3, aa_freqs_4);
+#ifdef DEBUG_BW2
+ printf("done with backward labeled\n");
+#endif
+ /* update new_log_likelihood */
+ likelihood = log10((forw_mtx +
+ get_mtx_index(seq_len+1, hmmp->nr_v-1, hmmp->nr_v))->prob);
+ for(k = 0; k <= seq_len; k++) {
+ likelihood = likelihood + log10(*(forw_scale + k));
+ }
+#ifdef DEBUG_BW
+ dump_scaling_array(k-1,forw_scale);
+ printf("likelihood = %f\n", likelihood);
+#endif
+ new_log_likelihood_lab += likelihood;
+
+ for(k = 0; k < hmmp->nr_v-1; k++) /* k = from vertex */ {
+ lp = *(hmmp->to_trans_array + k);
+ while(lp->vertex != END) /* l = to-vertex */ {
+ i = 0;
+ while(1) {
+ if(use_lead_columns == NO) {
+ p = i;
+ }
+ else {
+ p = *(msa_seq_infop->lead_columns_start + i);
+ }
+ /* add T[k][l] contribution for the msa-sequence */
+ add_Tkl_contribution_msa_multi(hmmp, msa_seq_infop, forw_mtx, backw_mtx,
+ forw_scale, p, k, lp, T_lab, use_gap_shares, use_lead_columns, i, YES, scoring_method,
+ normalize, multi_scoring_method, aa_freqs, aa_freqs_2, aa_freqs_3, aa_freqs_4);
+ i++;
+ if(use_lead_columns == NO) {
+ if(i >= seq_len) {
+ break;
+ }
+ }
+ else {
+ if(*(msa_seq_infop->lead_columns_start + i) == END) {
+ break;
+ }
+ }
+ }
+ /* move on to next path */
+ while(lp->next != NULL) {
+ lp++;
+ }
+ lp++;
+ }
+
+ /* calculate E[k][a] contribution from this sequence */
+ if(silent_state_multi(k, hmmp) != 0) {
+ i = 0;
+ while(1) {
+
+ /* get correct incex for this letter column */
+ if(use_lead_columns == NO) {
+ p = i;
+ }
+ else {
+ p = *(msa_seq_infop->lead_columns_start + i);
+ }
+
+ /* get basic scoring result */
+ Eka_base = add_Eka_contribution_msa_multi(hmmp, msa_seq_infop, forw_mtx, backw_mtx, p, k,
+ i, use_lead_columns);
+
+ /* loop over the alphabets */
+ for(alphabet_nr = 1; alphabet_nr <= hmmp->nr_alphabets; alphabet_nr++) {
+ if(alphabet_nr == 1) {
+ alphabet = hmmp->alphabet;
+ subst_mtx = hmmp->subst_mtx;
+ a_size = hmmp->a_size;
+ E_cur = E_lab;
+ msa_seq = msa_seq_infop->msa_seq_1;
+ }
+ else if(alphabet_nr == 2) {
+ alphabet = hmmp->alphabet_2;
+ subst_mtx = hmmp->subst_mtx_2;
+ a_size = hmmp->a_size_2;
+ E_cur = E_lab_2;
+ msa_seq = msa_seq_infop->msa_seq_2;
+ }
+ else if(alphabet_nr == 3) {
+ alphabet = hmmp->alphabet_3;
+ subst_mtx = hmmp->subst_mtx_3;
+ a_size = hmmp->a_size_3;
+ E_cur = E_lab_3;
+ msa_seq = msa_seq_infop->msa_seq_3;
+ }
+ else if(alphabet_nr == 4) {
+ alphabet = hmmp->alphabet_4;
+ subst_mtx = hmmp->subst_mtx_4;
+ a_size = hmmp->a_size_4;
+ E_cur = E_lab_4;
+ msa_seq = msa_seq_infop->msa_seq_4;
+ }
+
+ /* get result and add to matrix according to scoring method */
+ add_to_E_multi(E_cur, Eka_base, msa_seq, p, k, a_size, normalize, subst_mtx,
+ alphabet, scoring_method, use_nr_occ, DISCRETE, NULL);
+ }
+ /* update loop index, check if we are done */
+ i++;
+ if(use_lead_columns == NO) {
+ if(i >= seq_len) {
+ break;
+ }
+ }
+ else {
+ if(*(msa_seq_infop->lead_columns_start + i) == END) {
+ break;
+ }
+ }
+ }
+ }
+ }
+
+#ifdef DEBUG_BW
+ dump_T_matrix(hmmp->nr_v, hmmp->nr_v, T_lab);
+ dump_E_matrix(hmmp->nr_v, hmmp->a_size, E_lab);
+#endif
+
+ /* some garbage collection */
+ free(forw_mtx);
+ free(backw_mtx);
+ free(forw_scale);
+
+ msa_seq_infop++;
+ }
+
+ if(verbose == YES) {
+ printf("log likelihood diff rd %d: %f\n", iteration, new_log_likelihood_ulab - new_log_likelihood_lab);
+ }
+
+ /* recalculate emission expectations according to distribution groups
+ * by simply taking the mean of the expected emissions within this group
+ * for each letter in the alphabet and replacing each expectation for the
+ * letter with this value for every member of the distribution group */
+ recalculate_emiss_expectations_multi(hmmp, E_lab, 1);
+ if(hmmp->nr_alphabets > 1) {
+ recalculate_emiss_expectations_multi(hmmp, E_lab_2, 2);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ recalculate_emiss_expectations_multi(hmmp, E_lab_3, 3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ recalculate_emiss_expectations_multi(hmmp, E_lab_4, 4);
+ }
+
+ recalculate_emiss_expectations_multi(hmmp, E_ulab, 1);
+ if(hmmp->nr_alphabets > 1) {
+ recalculate_emiss_expectations_multi(hmmp, E_ulab_2, 2);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ recalculate_emiss_expectations_multi(hmmp, E_ulab_3, 3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ recalculate_emiss_expectations_multi(hmmp, E_ulab_4, 4);
+ }
+
+ /* recalculate transition expectations for tied transitions according
+ * to the same scheme as for emission distribution groups */
+ recalculate_trans_expectations_multi(hmmp, T_lab);
+ recalculate_trans_expectations_multi(hmmp, T_ulab);
+
+#ifdef DEBUG_EXTBW
+ dump_T_matrix(hmmp->nr_v, hmmp->nr_v, T_lab);
+ dump_T_matrix(hmmp->nr_v, hmmp->nr_v, T_ulab);
+ dump_E_matrix(hmmp->nr_v, hmmp->a_size, E_lab);
+#endif
+
+ /* update real T end E matrices */
+ calculate_TE_contributions_multi(T, E, E_2, E_3, E_4, T_lab, E_lab, E_lab_2, E_lab_3, E_lab_4, T_ulab, E_ulab,
+ E_ulab_2, E_ulab_3, E_ulab_4, hmmp->emissions, hmmp->emissions_2, hmmp->emissions_3,
+ hmmp->emissions_4, hmmp->transitions, hmmp->nr_v, hmmp->a_size,
+ hmmp->a_size_2, hmmp->a_size_3, hmmp->a_size_4, hmmp->vertex_emiss_prior_scalers,
+ hmmp->vertex_emiss_prior_scalers_2, hmmp->vertex_emiss_prior_scalers_3,
+ hmmp->vertex_emiss_prior_scalers_4, iteration, hmmp->nr_alphabets);
+
+ /* check if likelihood change is small enough, then we are done */
+ if(fabs((new_log_likelihood_ulab - new_log_likelihood_lab) - (old_log_likelihood_ulab - old_log_likelihood_lab)) <
+ CML_THRESHOLD && annealing_status == DONE) {
+ break;
+ }
+
+ /* if simulated annealing is used, scramble results in E and T matrices */
+ if(annealing == YES && temperature > ANNEAL_THRESHOLD) {
+ anneal_E_matrix_multi(temperature, E, hmmp, 1);
+ if(hmmp->nr_alphabets > 1) {
+ anneal_E_matrix_multi(temperature, E_2, hmmp, 2);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ anneal_E_matrix_multi(temperature, E_3, hmmp, 3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ anneal_E_matrix_multi(temperature, E_4, hmmp, 4);
+ }
+ anneal_T_matrix_multi(temperature, T, hmmp);
+ temperature = temperature * cooling_factor;
+ }
+
+ if(temperature < ANNEAL_THRESHOLD) {
+ annealing_status = DONE;
+ }
+
+ for(k = 0; k < hmmp->nr_v-1; k++) /* k = from-vertex */ {
+ /* update transition matrix */
+ if(use_transition_pseudo_counts == YES) {
+ update_trans_mtx_pseudocount_multi(hmmp, T, k);
+ }
+ else {
+ update_trans_mtx_std_multi(hmmp, T, k);
+ }
+
+#ifdef DEBUG_PRIORS
+ printf("Starting emission matrix update\n");
+#endif
+
+ /* update emission matrix using Dirichlet prior files if they exist*/
+ priorp = *(hmmp->ed_ps + k);
+ if(priorp != NULL && use_prior == YES) {
+#ifdef DEBUG_PRIORS
+ printf("k = %d\n", k);
+ printf("value = %x\n", priorp);
+#endif
+ update_emiss_mtx_prior_multi(hmmp, E, k, priorp, 1);
+ }
+ else if(use_emission_pseudo_counts == YES) /* update emissions matrix "normally" when dirichlet file is missing */ {
+ update_emiss_mtx_pseudocount_multi(hmmp, E, k, 1);
+ }
+ else {
+ update_emiss_mtx_std_multi(hmmp, E, k, 1);
+ }
+
+
+ if(hmmp->nr_alphabets > 1) {
+ priorp = *(hmmp->ed_ps_2 + k);
+ if(priorp != NULL && use_prior == YES) {
+ update_emiss_mtx_prior_multi(hmmp, E_2, k, priorp, 2);
+ }
+ else if(use_emission_pseudo_counts == YES) /* update emissions matrix "normally" when dirichlet file is missing */ {
+ update_emiss_mtx_pseudocount_multi(hmmp, E_2, k, 2);
+ }
+ else {
+ update_emiss_mtx_std_multi(hmmp, E_2, k, 2);
+ }
+ }
+
+ if(hmmp->nr_alphabets > 2) {
+ priorp = *(hmmp->ed_ps_3 + k);
+ if(priorp != NULL && use_prior == YES) {
+ update_emiss_mtx_prior_multi(hmmp, E_3, k, priorp, 3);
+ }
+ else if(use_emission_pseudo_counts == YES) /* update emissions matrix "normally" when dirichlet file is missing */ {
+ update_emiss_mtx_pseudocount_multi(hmmp, E_3, k, 3);
+ }
+ else {
+ update_emiss_mtx_std_multi(hmmp, E_3, k, 3);
+ }
+ }
+
+ if(hmmp->nr_alphabets > 3) {
+ priorp = *(hmmp->ed_ps_4 + k);
+ if(priorp != NULL && use_prior == YES) {
+ update_emiss_mtx_prior_multi(hmmp, E_4, k, priorp, 4);
+ }
+ else if(use_emission_pseudo_counts == YES) /* update emissions matrix "normally" when dirichlet file is missing */ {
+ update_emiss_mtx_pseudocount_multi(hmmp, E_4, k, 4);
+ }
+ else {
+ update_emiss_mtx_std_multi(hmmp, E_4, k, 4);
+ }
+ }
+ }
+
+#ifdef DEBUG_BW
+ dump_trans_matrix(hmmp->nr_v, hmmp->nr_v, hmmp->transitions);
+ dump_emiss_matrix(hmmp->nr_v, hmmp->a_size, hmmp->emissions);
+#endif
+
+ /* some garbage collection */
+ free(E);
+ if(hmmp->nr_alphabets > 1) {
+ free(E_2);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ free(E_3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ free(E_4);
+ }
+ free(T);
+ free(T_lab);
+ free(E_lab);
+ if(hmmp->nr_alphabets > 1) {
+ free(E_lab_2);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ free(E_lab_3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ free(E_lab_4);
+ }
+ free(T_ulab);
+ free(E_ulab);
+ if(hmmp->nr_alphabets > 1) {
+ free(E_ulab_2);
+ }
+ if(hmmp->nr_alphabets > 2) {
+ free(E_ulab_3);
+ }
+ if(hmmp->nr_alphabets > 3) {
+ free(E_ulab_4);
+ }
+
+ max_nr_iterations--;
+ iteration++;
+#ifdef DEBUG_BW2
+ printf("end of baum-welch-loop\n");
+#endif
+ }
+ while(max_nr_iterations > 0); /* break condition is also when log_likelihood_difference is
+ * smaller than THRESHOLD, checked inside the loop for
+ * better efficiency */
+
+
+#ifdef DEBUG_BW
+ dump_trans_matrix(hmmp->nr_v, hmmp->nr_v, hmmp->transitions);
+ dump_emiss_matrix(hmmp->nr_v, hmmp->a_size, hmmp->emissions);
+#endif
+}
+
+
+
+
+/*************************************************************/
+/************** help functions *******************************/
+/*************************************************************/
+
+/* calculates T[k][l] contribution */
+void add_Tkl_contribution_multi(struct hmm_multi_s *hmmp, struct letter_s *seq, struct letter_s *seq_2,
+ struct letter_s *seq_3, struct letter_s *seq_4, struct forward_s *forw_mtx,
+ struct backward_s *backw_mtx, double *forw_scale, int p, int k,
+ struct path_element *lp_const, int a_index, int a_index_2, int a_index_3, int a_index_4,
+ double *T, int use_labels, int multi_scoring_method)
+{
+
+ double t_res, t_res_1, t_res_2, t_res_3, t_res_3_1, t_res_3_2, t_res_3_4, t_res_4, t_res_5, t_res_6, t_res_3_temp;
+ struct path_element *lp_shadow, *lp, *lp_end;
+ int from_v_end;
+ int j;
+
+ lp = lp_const;
+ lp_shadow = lp;
+
+ /* calculate T[k][l] contribution using scaled values*/
+ t_res_1 = (forw_mtx + get_mtx_index(p-1, k, hmmp->nr_v))->prob;
+ t_res_2 = *(hmmp->transitions + get_mtx_index(k, lp->vertex, hmmp->nr_v));
+ while(lp->next != NULL) {
+ t_res_2 = t_res_2 * *(hmmp->transitions + get_mtx_index(lp->vertex, (lp+1)->vertex, hmmp->nr_v));
+ lp++;
+ }
+ if(use_labels == YES && *(hmmp->vertex_labels + lp->vertex) != seq[p-1].label && seq[p-1].label != '.') {
+ t_res_3 = 0.0;
+ }
+ else {
+ if(multi_scoring_method == JOINT_PROB) {
+ if(hmmp->alphabet_type == DISCRETE) {
+ if(a_index < 0) {
+ /* letter is not in alphabet => it is a replacement letter.
+ * This is not implemented yet, simple solution is to ignore this value, i.e. set letter to 'X' */
+ t_res_3 = 0.0;
+ }
+ else {
+ t_res_3 = *(hmmp->emissions + get_mtx_index(lp->vertex, a_index, hmmp->a_size));
+ }
+ }
+ else {
+ t_res_3 = 0.0;
+ for(j = 0; j < hmmp->a_size / 3; j++) {
+ t_res_3 += get_single_gaussian_statescore(*(hmmp->emissions + get_mtx_index(lp->vertex, (j * 3), hmmp->a_size)),
+ *(hmmp->emissions + get_mtx_index(lp->vertex, (j * 3 + 1), hmmp->a_size)),
+ seq[p-1].cont_letter) *
+ *((hmmp->emissions) + (lp->vertex * (hmmp->a_size)) + (j * 3 + 2));
+ }
+ }
+ if(hmmp->nr_alphabets > 1) {
+ if(hmmp->alphabet_type_2 == DISCRETE) {
+ if(a_index_2 < 0) {
+ /* letter is not in alphabet => it is a replacement letter.
+ * This is not implemented yet, simple solution is to ignore this value, i.e. set letter to 'X' */
+ t_res_3 = 0.0;
+ }
+ else {
+ t_res_3 *= *(hmmp->emissions_2 + get_mtx_index(lp->vertex, a_index_2, hmmp->a_size_2));
+ }
+ }
+ else {
+ t_res_3_temp = 0.0;
+ for(j = 0; j < hmmp->a_size_2 / 3; j++) {
+ t_res_3_temp += get_single_gaussian_statescore(*(hmmp->emissions_2 + get_mtx_index(lp->vertex, (j * 3), hmmp->a_size_2)),
+ *(hmmp->emissions_2 + get_mtx_index(lp->vertex, (j * 3 + 1), hmmp->a_size_2)),
+ seq_2[p-1].cont_letter) *
+ *((hmmp->emissions_2) + (lp->vertex * (hmmp->a_size_2)) + (j * 3 + 2));
+ }
+ t_res_3 *= t_res_3_temp;
+ }
+ }
+ if(hmmp->nr_alphabets > 2) {
+ if(hmmp->alphabet_type_2 == DISCRETE) {
+ if(a_index_3 < 0) {
+ /* letter is not in alphabet => it is a replacement letter.
+ * This is not implemented yet, simple solution is to ignore this value, i.e. set letter to 'X' */
+ t_res_3 = 0.0;
+ }
+ else {
+ t_res_3 *= *(hmmp->emissions_3 + get_mtx_index(lp->vertex, a_index_3, hmmp->a_size_3));
+ }
+ }
+ else {
+ t_res_3_temp = 0.0;
+ for(j = 0; j < hmmp->a_size_3 / 3; j++) {
+ t_res_3_temp += get_single_gaussian_statescore(*(hmmp->emissions_3 + get_mtx_index(lp->vertex, (j * 3), hmmp->a_size_3)),
+ *(hmmp->emissions_3 + get_mtx_index(lp->vertex, (j * 3 + 1), hmmp->a_size_3)),
+ seq_3[p-1].cont_letter) *
+ *((hmmp->emissions_3) + (lp->vertex * (hmmp->a_size_3)) + (j * 3 + 2));
+ }
+ t_res_3 *= t_res_3_temp;
+ }
+ }
+ if(hmmp->nr_alphabets > 3) {
+ if(hmmp->alphabet_type_3 == DISCRETE) {
+ if(a_index_4 < 0) {
+ /* letter is not in alphabet => it is a replacement letter.
+ * This is not implemented yet, simple solution is to ignore this value, i.e. set letter to 'X' */
+ t_res_3 = 0.0;
+ }
+ else {
+ t_res_3 *= *(hmmp->emissions_4 + get_mtx_index(lp->vertex, a_index_4, hmmp->a_size_4));
+ }
+ }
+ else {
+ t_res_3_temp = 0.0;
+ for(j = 0; j < hmmp->a_size_4 / 3; j++) {
+ t_res_3_temp += get_single_gaussian_statescore(*(hmmp->emissions_4 + get_mtx_index(lp->vertex, (j * 3), hmmp->a_size_4)),
+ *(hmmp->emissions_4 + get_mtx_index(lp->vertex, (j * 3 + 1), hmmp->a_size_4)),
+ seq_4[p-1].cont_letter) *
+ *((hmmp->emissions_4) + (lp->vertex * (hmmp->a_size_4)) + (j * 3 + 2));
+ }
+ t_res_3 *= t_res_3_temp;
+ }
+ }
+ }
+ }
+ t_res_4 = (backw_mtx + get_mtx_index(p, lp->vertex, hmmp->nr_v))->prob;
+ t_res = t_res_1 * t_res_2 * t_res_3 * t_res_4;
+ if(t_res == 0) {
+ return ; /* no reason to update with zero value */
+ }
+
+#ifdef DEBUG_BW_TRANS
+ printf("for T[%d][%d]\n", k, lp->vertex);
+ sequence_as_string(seq);
+ printf("forw_mtx value = %f\n", t_res_1);
+ printf("transitions value = %f\n", t_res_2);
+ printf("emissions value = %f\n", t_res_3);
+ printf("backw_mtx value = %f\n", t_res_4);
+#endif
+
+ /* divide by scaled probability for sequence s */
+ t_res_5 = (forw_mtx + get_mtx_index(get_seq_length(seq)+1, hmmp->nr_v-1, hmmp->nr_v))->prob;
+#ifdef DEBUG_BW_TRANS
+ printf("seq_length = %d\n", get_seq_length(seq));
+ printf("t_res_5 = %f\n", t_res_5);
+#endif
+
+ t_res = t_res / t_res_5;
+
+ /* divide result with scaling factor for position p,
+ * since it is not included in the contribution, but is
+ * included in the probability for the sequence */
+ t_res_6 = *(forw_scale + p);
+ t_res = t_res / t_res_6;
+
+#ifdef DEBUG_BW_TRANS
+ printf("tot prob for sequence = %f\n", t_res_5);
+ printf("scale p = %f\n", t_res_6);
+ printf("res = %f\n", t_res);
+#endif
+
+ /* if last letter of sequence and a path from the emitting state for this letter to end state exists
+ update this path as well with the same value */
+ if(p == get_seq_length(seq) && path_length_multi(lp->vertex, hmmp->nr_v-1, hmmp, 0) > 0) {
+ from_v_end = lp->vertex;
+ lp_end = get_end_path_start_multi(lp->vertex, hmmp);
+ *(T + get_mtx_index(from_v_end, lp_end->vertex, hmmp->nr_v)) += t_res;
+ from_v_end = lp_end->vertex;
+ lp_end++;
+ while(lp_end->next != NULL) {
+ *(T + get_mtx_index(from_v_end, lp_end->vertex, hmmp->nr_v)) += t_res;
+ from_v_end = lp_end->vertex;
+ lp_end++;
+ }
+ }
+
+ /* update T-matrix for transition indices that correspond to current path */
+ lp = lp_shadow;
+ *(T + get_mtx_index(k, lp->vertex, hmmp->nr_v)) += t_res;
+ lp++;
+ while(lp_shadow->next != NULL) {
+ *(T + get_mtx_index(lp_shadow->vertex, lp->vertex, hmmp->nr_v)) += t_res;
+ lp_shadow = lp;
+ lp++;
+ }
+
+#ifdef DEBUG_BW_TRANS
+ printf("adding result to T-mtx index: %d\n", get_mtx_index(k,lp_shadow->vertex,hmmp->nr_v));
+#endif
+}
+
+/* calculates T[k][l] contribution */
+void add_Tkl_contribution_msa_multi(struct hmm_multi_s *hmmp, struct msa_sequences_multi_s *msa_seq_infop,
+ struct forward_s *forw_mtx, struct backward_s *backw_mtx,
+ double *forw_scale, int p, int k,
+ struct path_element *lp_const, double *T, int use_gap_shares,
+ int use_lead_columns, int i, int use_labels, int scoring_method, int normalize,
+ int multi_scoring_method, double *aa_freqs_1, double *aa_freqs_2,
+ double *aa_freqs_3, double *aa_freqs_4)
+{
+ double t_res, t_res_1, t_res_2, t_res_3, t_res_4, t_res_5, t_res_6, temp_res, t_res_3_tot;
+ struct path_element *lp_shadow, *lp, *lp_end;
+ int a_index, a_index2;
+ int query_index;
+ double default_share, rest_share;
+ double seq_normalizer;
+ double state_normalizer;
+ double subst_mtx_normalizer;
+ int from_v_end;
+
+ int alphabet;
+
+ int a_size, a_size_1;
+ struct msa_letter_s *msa_seq;
+ double *emissions;
+ double *subst_mtx;
+ double *aa_freqs;
+ int alphabet_type;
+ int j;
+
+ lp = lp_const;
+ lp_shadow = lp;
+
+
+ /* calculate T[k][l] contribution using scaled values*/
+
+ /* get f_i value */
+ t_res_1 = (forw_mtx + get_mtx_index(i, k, hmmp->nr_v))->prob;
+
+ /* get a_kl value*/
+ t_res_2 = *(hmmp->transitions + get_mtx_index(k, lp->vertex, hmmp->nr_v));
+ while(lp->next != NULL) {
+ t_res_2 = t_res_2 * *(hmmp->transitions + get_mtx_index(lp->vertex, (lp+1)->vertex, hmmp->nr_v));
+ lp++;
+ }
+
+ /* calculate e_i(x) */
+ t_res_3_tot = 1.0;
+ a_size_1 = hmmp->a_size;
+ for(alphabet = 1; alphabet <= hmmp->nr_alphabets; alphabet++) {
+ seq_normalizer = 0.0;
+ state_normalizer = 0.0;
+ subst_mtx_normalizer = 0.0;
+ if(alphabet == 1) {
+ if(hmmp->alphabet_type == DISCRETE) {
+ query_index = get_alphabet_index_msa_query((msa_seq_infop->msa_seq_1 + (p * (hmmp->a_size+1)))->query_letter,
+ hmmp->alphabet, hmmp->a_size);
+ if(query_index < 0) {
+ query_index = hmmp->a_size; /* if letter is wild card, use default column in subst matrix */
+ }
+ }
+ a_size = hmmp->a_size;
+ msa_seq = msa_seq_infop->msa_seq_1;
+ emissions = hmmp->emissions;
+ subst_mtx = hmmp->subst_mtx;
+ alphabet_type = hmmp->alphabet_type;
+ aa_freqs = aa_freqs_1;
+ }
+ if(alphabet == 2) {
+ if(hmmp->alphabet_type_2 == DISCRETE) {
+ query_index = get_alphabet_index_msa_query((msa_seq_infop->msa_seq_2 + (p * (hmmp->a_size_2+1)))->query_letter,
+ hmmp->alphabet_2, hmmp->a_size_2);
+ if(query_index < 0) {
+ query_index = hmmp->a_size_2; /* if letter is wild card, use default column in subst matrix */
+ }
+ }
+ a_size = hmmp->a_size_2;
+ msa_seq = msa_seq_infop->msa_seq_2;
+ emissions = hmmp->emissions_2;
+ subst_mtx = hmmp->subst_mtx_2;
+ alphabet_type = hmmp->alphabet_type_2;
+ aa_freqs = aa_freqs_2;
+ }
+ if(alphabet == 3) {
+ if(hmmp->alphabet_type_3 == DISCRETE) {
+ query_index = get_alphabet_index_msa_query((msa_seq_infop->msa_seq_3 + (p * (hmmp->a_size_3+1)))->query_letter,
+ hmmp->alphabet_3, hmmp->a_size_3);
+ if(query_index < 0) {
+ query_index = hmmp->a_size_3; /* if letter is wild card, use default column in subst matrix */
+ }
+ }
+ a_size = hmmp->a_size_3;
+ msa_seq = msa_seq_infop->msa_seq_3;
+ emissions = hmmp->emissions_3;
+ subst_mtx = hmmp->subst_mtx_3;
+ alphabet_type = hmmp->alphabet_type_3;
+ aa_freqs = aa_freqs_3;
+ }
+ if(alphabet == 4) {
+ if(hmmp->alphabet_type_4 == DISCRETE) {
+ query_index = get_alphabet_index_msa_query((msa_seq_infop->msa_seq_4 + (p * (hmmp->a_size_4+1)))->query_letter,
+ hmmp->alphabet_4, hmmp->a_size_4);
+ if(query_index < 0) {
+ query_index = hmmp->a_size_4; /* if letter is wild card, use default column in subst matrix */
+ }
+ }
+ a_size = hmmp->a_size_4;
+ msa_seq = msa_seq_infop->msa_seq_4;
+ emissions = hmmp->emissions_4;
+ subst_mtx = hmmp->subst_mtx_4;
+ alphabet_type = hmmp->alphabet_type_4;
+ aa_freqs = aa_freqs_4;
+ }
+
+ default_share = 1.0 / (double)(a_size);
+ t_res_3 = 0.0;
+
+ /* use first alphabet here since the labels are placed in the first alphabet */
+ if(use_labels == YES && *(hmmp->vertex_labels + lp->vertex) !=
+ (msa_seq_infop->msa_seq_1 + get_mtx_index(p,0, a_size_1+1))->label &&
+ (msa_seq_infop->msa_seq_1 + get_mtx_index(p,0, a_size_1+1))->label != '.') {
+ t_res_3 = 0.0;
+ }
+ else if(alphabet_type == CONTINUOUS) {
+ if((msa_seq + get_mtx_index(p, 0, a_size + 1))->nr_occurences > 0.0) {
+ t_res_3 = 0.0;
+ for(j = 0; j < a_size / 3; j++) {
+ t_res_3 += get_single_gaussian_statescore(*(emissions + get_mtx_index(lp->vertex, (j * 3), a_size)),
+ *(emissions + get_mtx_index(lp->vertex, (j * 3 + 1), a_size)),
+ (msa_seq + get_mtx_index(p, 0, a_size + 1))->share) *
+ *((emissions) + (lp->vertex * (a_size)) + (j * 3 + 2));
+ }
+ }
+ else {
+ t_res_3 = 0.0;
+ }
+ }
+ else if(scoring_method == DOT_PRODUCT) {
+ t_res_3 = get_dp_statescore(a_size, use_gap_shares, NO, msa_seq, p, emissions,
+ lp->vertex, normalize, msa_seq_infop->gap_shares);
+ }
+ else if(scoring_method == DOT_PRODUCT_PICASSO) {
+ t_res_3 = get_dp_picasso_statescore(a_size, use_gap_shares, NO, msa_seq, p, emissions,
+ lp->vertex ,normalize, msa_seq_infop->gap_shares, aa_freqs);
+ }
+ else if(scoring_method == PICASSO) {
+ t_res_3 = get_picasso_statescore(a_size, use_gap_shares, NO, msa_seq, p, emissions,
+ lp->vertex ,normalize, msa_seq_infop->gap_shares, aa_freqs);
+ }
+ else if(scoring_method == PICASSO_SYM) {
+ t_res_3 = get_picasso_sym_statescore(a_size, use_gap_shares, NO, msa_seq, p, emissions,
+ lp->vertex ,normalize, msa_seq_infop->gap_shares, aa_freqs);
+ }
+ else if(scoring_method == SJOLANDER) {
+ t_res_3 = get_sjolander_statescore(a_size, use_gap_shares, NO, msa_seq, p, emissions,
+ lp->vertex, normalize, msa_seq_infop->gap_shares);
+ }
+ else if(scoring_method == SJOLANDER_REVERSED) {
+ t_res_3 = get_sjolander_statescore(a_size, use_gap_shares, NO, msa_seq, p, emissions,
+ lp->vertex, normalize, msa_seq_infop->gap_shares);
+ }
+ else if(scoring_method == SUBST_MTX_PRODUCT) {
+ t_res_3 = get_subst_mtx_product_statescore(a_size, use_gap_shares, NO, msa_seq, p, emissions,
+ lp->vertex, subst_mtx);
+ }
+ else if(scoring_method == SUBST_MTX_DOT_PRODUCT) {
+ t_res_3 = get_subst_mtx_dot_product_statescore(a_size, use_gap_shares, NO, msa_seq, p, emissions,
+ lp->vertex, normalize, msa_seq_infop->gap_shares,
+ query_index, subst_mtx);
+ }
+ else if(scoring_method == SUBST_MTX_DOT_PRODUCT_PRIOR) {
+ t_res_3 = get_subst_mtx_dot_product_prior_statescore(a_size, use_gap_shares, NO, msa_seq, p,
+ emissions, lp->vertex, normalize, msa_seq_infop->gap_shares,
+ query_index, subst_mtx);
+ }
+
+
+ if(multi_scoring_method == JOINT_PROB) {
+ t_res_3_tot *= t_res_3;
+ }
+ else {
+ printf("Error: only joint prob is multiscoring is implemented\n");
+ }
+ }
+
+
+ /* get b_i+1 value */
+ t_res_4 = (backw_mtx + get_mtx_index(i+1, lp->vertex, hmmp->nr_v))->prob;
+
+ t_res = t_res_1 * t_res_2 * t_res_3_tot * t_res_4;
+
+#ifdef DEBUG_Tkl
+ printf("for T[%d][%d]\n", k, lp->vertex);
+ printf("forw_mtx value = %f\n", t_res_1);
+ printf("transitions value = %f\n", t_res_2);
+ printf("emissions value = %f\n", t_res_3_tot);
+ printf("backw_mtx value = %f\n", t_res_4);
+ printf("t_res = %f\n", t_res);
+#endif
+ if(t_res == 0) {
+ return ; /* no reason to update with zero value */
+ }
+
+ /* divide by scaled probability for sequence s */
+ if(use_lead_columns == NO) {
+ t_res_5 = (forw_mtx + get_mtx_index((msa_seq_infop->msa_seq_length)+1,
+ hmmp->nr_v-1, hmmp->nr_v))->prob;
+ }
+ else {
+ t_res_5 = (forw_mtx + get_mtx_index((msa_seq_infop->nr_lead_columns)+1,
+ hmmp->nr_v-1, hmmp->nr_v))->prob;
+ }
+ t_res = t_res / t_res_5;
+
+ /* divide result with scaling factor for position p+1,
+ * since it is not included in the contribution, but is
+ * included in the probability for the sequence */
+ t_res_6 = *(forw_scale + i+1);
+ t_res = t_res / t_res_6;
+
+#ifdef DEBUG_Tkl
+ printf("tot prob for sequence = %f\n", t_res_5);
+ printf("scale p = %f\n", t_res_6);
+ printf("res = %f\n", t_res);
+ printf("p = %d\n", p);
+ printf("seq_length = %d\n", msa_seq_infop->msa_seq_length);
+#endif
+
+
+ /* if last letter of sequence and a path from the emitting state for this letter to end state exists
+ update this path as well with the same value */
+ if(p == msa_seq_infop->msa_seq_length - 1 && path_length_multi(lp->vertex, hmmp->nr_v-1, hmmp, 0) > 0) {
+ from_v_end = lp->vertex;
+ lp_end = get_end_path_start_multi(lp->vertex, hmmp);
+ *(T + get_mtx_index(from_v_end, lp_end->vertex, hmmp->nr_v)) += t_res;
+ from_v_end = lp_end->vertex;
+ lp_end++;
+ while(lp_end->next != NULL) {
+ *(T + get_mtx_index(from_v_end, lp_end->vertex, hmmp->nr_v)) += t_res;
+ from_v_end = lp_end->vertex;
+ lp_end++;
+ }
+ }
+
+
+
+ /* update T-matrix for transition indices that corresponds to current path */
+ lp = lp_shadow;
+ *(T + get_mtx_index(k, lp->vertex, hmmp->nr_v)) += t_res;
+ lp++;
+ while(lp_shadow->next != NULL) {
+ *(T + get_mtx_index(lp_shadow->vertex, lp->vertex, hmmp->nr_v)) += t_res;
+ lp_shadow = lp;
+ lp++;
+ }
+
+#ifdef DEBUG_Tkl
+ printf("adding result to T-mtx index: %d\n\n", get_mtx_index(k,lp_shadow->vertex,hmmp->nr_v));
+#endif
+}
+
+
+double add_Eka_contribution_multi(struct hmm_multi_s *hmmp, struct letter_s *seq, struct forward_s *forw_mtx,
+ struct backward_s *backw_mtx, int p, int k, int multi_scoring_method)
+{
+ double e_res, e_res_1, e_res_2, e_res_3;
+
+ /* get contribution from this position in the sequence */
+ e_res_1 = (forw_mtx + get_mtx_index(p, k, hmmp->nr_v))->prob;
+ e_res_2 = (backw_mtx + get_mtx_index(p, k, hmmp->nr_v))->prob;
+ e_res = e_res_1 * e_res_2;
+ if(e_res == 0) {
+ return 0.0; /* no use updating with a zero value */
+ }
+
+#ifdef DEBUG_BW
+ printf("forw_mtx = %f\n", e_res_1);
+ printf("backw_mtx = %f\n", e_res_2);
+ printf("res = %f\n", e_res);
+#endif
+
+ /* divide with total probability of current sequence */
+ e_res_3 = (forw_mtx + get_mtx_index(get_seq_length(seq)+1, hmmp->nr_v-1, hmmp->nr_v))->prob;
+ e_res = e_res / e_res_3;
+
+#ifdef DEBUG_BW
+ printf("total prob = %f\n", e_res_3);
+ printf("e_res = %f\n", e_res);
+#endif
+
+ if(multi_scoring_method == JOINT_PROB) {
+ /* nothing more needs to be done, this updating procedure will update with the same number for all alphabets */
+ }
+
+ return e_res;
+}
+
+double add_Eka_contribution_continuous_multi(struct hmm_multi_s *hmmp, struct letter_s *seq, struct forward_s *forw_mtx,
+ struct backward_s *backw_mtx, int p, int k, int multi_scoring_method, double *E,
+ int alphabet)
+{
+ double e_res, e_res_1, e_res_2, e_res_3;
+ double mean_value, varians;
+ double continuous_score_all, continuous_score_j, gamma_p_j;
+ int j;
+ double *emissions;
+ int a_size;
+
+
+ if(alphabet == 1) {
+ emissions = hmmp->emissions;
+ a_size = hmmp->a_size;
+ }
+ else if(alphabet == 2) {
+ emissions = hmmp->emissions_2;
+ a_size = hmmp->a_size_2;
+ }
+ else if(alphabet == 3) {
+ emissions = hmmp->emissions_3;
+ a_size = hmmp->a_size_3;
+ }
+ else if(alphabet == 4) {
+ emissions = hmmp->emissions_4;
+ a_size = hmmp->a_size_4;
+ }
+ else {
+ printf("strange alphabet nr: %d\n", alphabet);
+ exit(0);
+ }
+
+ /* get contribution from this position in the sequence */
+ e_res_1 = (forw_mtx + get_mtx_index(p, k, hmmp->nr_v))->prob;
+ e_res_2 = (backw_mtx + get_mtx_index(p, k, hmmp->nr_v))->prob;
+ e_res = e_res_1 * e_res_2;
+ if(e_res == 0) {
+ return 0.0; /* no use updating with a zero value */
+ }
+
+#ifdef DEBUG_BW
+ printf("forw_mtx = %f\n", e_res_1);
+ printf("backw_mtx = %f\n", e_res_2);
+ printf("res = %f\n", e_res);
+#endif
+
+
+/* divide with total probability of current sequence */
+ e_res_3 = (forw_mtx + get_mtx_index(get_seq_length(seq)+1, hmmp->nr_v-1, hmmp->nr_v))->prob;
+ e_res = e_res / e_res_3;
+
+ mean_value = (seq + p - 1)->cont_letter;
+
+
+#ifdef DEBUG_BW
+ printf("mean = %f\n", mean_value);
+ printf("varians = %f\n", varians);
+ printf("e_res = %f\n", e_res);
+ printf("e_res_3 = %f\n", e_res_3);
+#endif
+
+ if(multi_scoring_method == JOINT_PROB) {
+ /* nothing more needs to be done, this updating procedure will update with the same number for all alphabets */
+ }
+
+ continuous_score_all = 0.0;
+ for(j = 0; j < a_size / 3; j++) {
+ continuous_score_all +=
+ get_single_gaussian_statescore(*(emissions + get_mtx_index(k, (j * 3), a_size)),
+ *(emissions + get_mtx_index(k, (j * 3 + 1), a_size)),
+ mean_value) *
+ *((emissions) + (k * (a_size)) + (j * 3 + 2));
+ }
+
+ for(j = 0; j < a_size / 3; j++) {
+ continuous_score_j =
+ get_single_gaussian_statescore(*(emissions + get_mtx_index(k, (j * 3), a_size)),
+ *(emissions + get_mtx_index(k, (j * 3 + 1), a_size)),
+ mean_value) *
+ *((emissions) + (k * (a_size)) + (j * 3 + 2));
+ varians = pow((seq + p - 1)->cont_letter - *(emissions + get_mtx_index(k, j * 3, a_size)), 2);
+ if(continuous_score_all > 0.0) {
+ gamma_p_j = e_res * continuous_score_j / continuous_score_all;
+ }
+ else {
+ gamma_p_j = 0.0;
+ }
+ *(E + get_mtx_index(k, j * 3, a_size + 1)) += mean_value * gamma_p_j;
+ *(E + get_mtx_index(k, j * 3 + 1, a_size + 1)) += varians * gamma_p_j;
+ *(E + get_mtx_index(k, j * 3 + 2, a_size + 1)) += gamma_p_j;
+ }
+
+ *(E + get_mtx_index(k, j * 3, a_size + 1)) += e_res;
+
+}
+
+
+double add_Eka_contribution_msa_multi(struct hmm_multi_s *hmmp, struct msa_sequences_multi_s *msa_seq_infop,
+ struct forward_s *forw_mtx, struct backward_s *backw_mtx, int p,
+ int k, int i, int use_lead_columns)
+{
+ double e_res, e_res_1, e_res_2, e_res_3;
+
+ /* get contribution from this position in the sequence */
+ e_res_1 = (forw_mtx + get_mtx_index(i+1, k, hmmp->nr_v))->prob;
+ e_res_2 = (backw_mtx + get_mtx_index(i+1, k, hmmp->nr_v))->prob;
+ e_res = e_res_1 * e_res_2;
+ if(e_res == 0) {
+ return 0.0; /* no use updating with a zero value */
+ }
+
+#ifdef DEBUG_BW
+ printf("forw_mtx = %f\n", e_res_1);
+ printf("backw_mtx = %f\n", e_res_2);
+ printf("res = %f\n", e_res);
+#endif
+
+ /* divide with total probability of current sequence */
+ if(use_lead_columns == NO) {
+ e_res_3 = (forw_mtx + get_mtx_index((msa_seq_infop->msa_seq_length)+1,
+ hmmp->nr_v-1, hmmp->nr_v))->prob;
+ }
+ else {
+ e_res_3 = (forw_mtx + get_mtx_index((msa_seq_infop->nr_lead_columns)+1,
+ hmmp->nr_v-1, hmmp->nr_v))->prob;
+ }
+ e_res = e_res / e_res_3;
+
+#ifdef DEBUG_BW
+ printf("total prob = %f\n", e_res_3);
+ printf("e_res = %f\n", e_res);
+#endif
+
+ return e_res;
+}
+
+
+void recalculate_emiss_expectations_multi(struct hmm_multi_s *hmmp, double *E, int alphabet)
+{
+ int a, d, k, l;
+ double e_res;
+ int a_size;
+
+ if(alphabet == 1) {
+ a_size = hmmp->a_size;
+ }
+ if(alphabet == 2) {
+ a_size = hmmp->a_size_2;
+ }
+ if(alphabet == 3) {
+ a_size = hmmp->a_size_3;
+ }
+ if(alphabet == 4) {
+ a_size = hmmp->a_size_4;
+ }
+
+
+ for(a = 0; a < a_size;a++) {
+ k = 0;
+ for(d = 0; d < hmmp->nr_d; d++) {
+ e_res = 0;
+ l = 0;
+ while(*(hmmp->distrib_groups + k) != END) {
+ e_res += *(E + get_mtx_index(*(hmmp->distrib_groups + k),
+ a, a_size));
+ k++;
+ l++;
+ }
+#ifdef DEBUG_BW
+ printf("e_res = %f, l = %d\n", e_res, l);
+#endif
+ e_res = e_res/(double)l;
+ k = k - l;
+ while(*(hmmp->distrib_groups + k) != END) {
+ *(E + get_mtx_index(*(hmmp->distrib_groups + k),
+ a, a_size)) = e_res;
+ k++;
+ }
+ k++;
+ }
+ }
+}
+
+void recalculate_trans_expectations_multi(struct hmm_multi_s *hmmp, double *T)
+{
+ int t,k,l;
+ double t_res;
+ k = 0;
+ for(t = 0; t < hmmp->nr_ttg; t++) {
+ t_res = 0;
+ l = 0;
+ while((hmmp->trans_tie_groups + k)->from_v != END) {
+ t_res += *(T + get_mtx_index((hmmp->trans_tie_groups + k)->from_v, (hmmp->trans_tie_groups + k)->to_v, hmmp->nr_v));
+ k++;
+ l++;
+ }
+ t_res = t_res/(double)l;
+ k = k - l;
+ while((hmmp->trans_tie_groups + k)->from_v != END) {
+ *(T + get_mtx_index((hmmp->trans_tie_groups + k)->from_v, (hmmp->trans_tie_groups + k)->to_v, hmmp->nr_v)) = t_res;
+ k++;
+ }
+ k++;
+ }
+}
+
+
+void update_trans_mtx_std_multi(struct hmm_multi_s *hmmp, double *T, int k)
+{
+ int l;
+ double t_res_1, t_res_2;
+ int i;
+
+#ifdef DEBUG_BW_TRANS
+ dump_trans_matrix(hmmp->nr_v, hmmp->nr_v, hmmp->transitions);
+ dump_T_matrix(hmmp->nr_v, hmmp->nr_v, T);
+#endif
+
+ t_res_1 = 0;
+ for(l = 0; l < hmmp->nr_v-1; l++) {
+ t_res_1 += *(T + get_mtx_index(k,l,hmmp->nr_v));
+ }
+ if(t_res_1 == 0.0) {
+ /* no dividing with 0 */
+ }
+ else {
+ for(l = 0; l < hmmp->nr_v-1; l++) /* l = to-vertex, k = from-vertex */ {
+ t_res_2 = *(T + get_mtx_index(k, l, hmmp->nr_v));
+ *(hmmp->transitions + get_mtx_index(k,l,hmmp->nr_v)) = t_res_2/t_res_1;
+ if(t_res_2 != 0.0 ) {
+ *(hmmp->log_transitions + get_mtx_index(k,l,hmmp->nr_v)) =
+ log10(t_res_2/t_res_1);
+ }
+ else {
+ *(hmmp->log_transitions + get_mtx_index(k,l,hmmp->nr_v)) = DEFAULT;
+ }
+ }
+ }
+ update_tot_trans_mtx_multi(hmmp);
+}
+
+void update_trans_mtx_pseudocount_multi(struct hmm_multi_s *hmmp, double *T, int k)
+{
+ int l;
+ double t_res_1, t_res_2;
+ int i;
+ double pseudo_value;
+
+ pseudo_value = TRANSITION_PSEUDO_VALUE;
+ t_res_1 = 0.0;
+ for(l = 0; l < hmmp->nr_v-1; l++) {
+ t_res_1 += *(T + get_mtx_index(k,l,hmmp->nr_v));
+ if(*(T + get_mtx_index(k,l,hmmp->nr_v)) != 0.0) {
+ t_res_1 += pseudo_value;
+ }
+ }
+
+
+ if(t_res_1 == 0.0) {
+ /* no dividing with 0 */
+ }
+
+ else {
+ for(l = 0; l < hmmp->nr_v-1; l++) /* l = to-vertex */ {
+ t_res_2 = *(T + get_mtx_index(k, l, hmmp->nr_v));
+
+ if(t_res_2 != 0.0) {
+ t_res_2 += pseudo_value;
+ }
+ *(hmmp->transitions + get_mtx_index(k,l,hmmp->nr_v)) = t_res_2/t_res_1;
+ if(t_res_2 != 0.0 ) {
+ *(hmmp->log_transitions + get_mtx_index(k,l,hmmp->nr_v)) =
+ log10(t_res_2/t_res_1);
+ }
+ else {
+ *(hmmp->log_transitions + get_mtx_index(k,l,hmmp->nr_v)) = DEFAULT;
+ }
+ }
+ }
+ update_tot_trans_mtx_multi(hmmp);
+}
+
+void update_emiss_mtx_std_multi(struct hmm_multi_s *hmmp, double *E, int k, int alphabet)
+{
+ /* NOTE: k = current vertex */
+ double e_res_1, e_res_2;
+ int a_index;
+ int a_size;
+ double *emissions;
+ double *log_emissions;
+
+ if(alphabet == 1) {
+ a_size = hmmp->a_size;
+ emissions = hmmp->emissions;
+ log_emissions = hmmp->log_emissions;
+ }
+ if(alphabet == 2) {
+ a_size = hmmp->a_size_2;
+ emissions = hmmp->emissions_2;
+ log_emissions = hmmp->log_emissions_2;
+ }
+ if(alphabet == 3) {
+ a_size = hmmp->a_size_3;
+ emissions = hmmp->emissions_3;
+ log_emissions = hmmp->log_emissions_3;
+ }
+ if(alphabet == 4) {
+ a_size = hmmp->a_size_4;
+ emissions = hmmp->emissions_4;
+ log_emissions = hmmp->log_emissions_4;
+ }
+
+ if(silent_state_multi(k, hmmp) == YES) {
+ return; /* this is a silent state, no updating should be done */
+ }
+ else if(locked_state_multi(hmmp, k) == YES) {
+ return; /* this state's parameters are locked, don't update */
+ }
+
+
+ e_res_1 = 0;
+ for(a_index = 0; a_index < a_size; a_index++) {
+ e_res_1 += *(E + get_mtx_index(k, a_index, a_size));
+ }
+ if(e_res_1 == 0.0) {
+ }
+ else {
+ for(a_index = 0; a_index < a_size; a_index++) {
+ e_res_2 = *(E + get_mtx_index(k, a_index, a_size));
+ *(emissions + get_mtx_index(k, a_index, a_size)) = e_res_2/e_res_1;
+ if(e_res_2 != 0) {
+ *(log_emissions + get_mtx_index(k, a_index, a_size)) =
+ log10(e_res_2/e_res_1);
+ }
+ else {
+ *(log_emissions + get_mtx_index(k, a_index, a_size)) = DEFAULT;
+ }
+ }
+ }
+}
+
+void update_emiss_mtx_std_continuous_multi(struct hmm_multi_s *hmmp, double *E, int k, int alphabet)
+{
+ /* NOTE: k = current vertex */
+ double e_res_1, e_res_2;
+ int a_index;
+ int a_size;
+ double *emissions;
+ double *log_emissions;
+ double gamma_all, gamma_j, mean_j, var_j;
+
+
+ if(alphabet == 1) {
+ a_size = hmmp->a_size;
+ emissions = hmmp->emissions;
+ log_emissions = hmmp->log_emissions;
+ }
+ if(alphabet == 2) {
+ a_size = hmmp->a_size_2;
+ emissions = hmmp->emissions_2;
+ log_emissions = hmmp->log_emissions_2;
+ }
+ if(alphabet == 3) {
+ a_size = hmmp->a_size_3;
+ emissions = hmmp->emissions_3;
+ log_emissions = hmmp->log_emissions_3;
+ }
+ if(alphabet == 4) {
+ a_size = hmmp->a_size_4;
+ emissions = hmmp->emissions_4;
+ log_emissions = hmmp->log_emissions_4;
+ }
+ if(silent_state_multi(k, hmmp) == YES) {
+ return; /* this is a silent state, no updating should be done */
+ }
+ else if(locked_state_multi(hmmp, k) == YES) {
+ return; /* this state's parameters are locked, don't update */
+ }
+ else {
+ gamma_all = *(E + get_mtx_index(k, a_size, a_size + 1));
+ for(a_index = 0; a_index < a_size; a_index += 3) {
+ gamma_j = *(E + get_mtx_index(k, a_index + 2, a_size + 1));
+ mean_j = *(E + get_mtx_index(k, a_index, a_size + 1));
+ var_j = *(E + get_mtx_index(k, a_index + 1, a_size + 1));
+ if(gamma_all == 0.0) {
+ *(emissions + get_mtx_index(k, a_index + 2, a_size)) = 0.0;
+ *(log_emissions + get_mtx_index(k, a_index + 2, a_size)) = DEFAULT;
+ }
+ else {
+ *(emissions + get_mtx_index(k, a_index + 2, a_size)) = gamma_j / gamma_all;
+ *(log_emissions + get_mtx_index(k, a_index + 2, a_size)) = log10(gamma_j / gamma_all);
+ }
+
+ if(gamma_j == 0.0) {
+ *(emissions + get_mtx_index(k, a_index, a_size)) = 0.0;
+ *(log_emissions + get_mtx_index(k, a_index, a_size)) = DEFAULT;
+ *(emissions + get_mtx_index(k, a_index + 1, a_size)) = 0.0;
+ *(log_emissions + get_mtx_index(k, a_index + 1, a_size)) = DEFAULT;
+ }
+ else {
+ *(emissions + get_mtx_index(k, a_index, a_size)) = mean_j / gamma_j;
+ *(log_emissions + get_mtx_index(k, a_index, a_size)) = log10(mean_j / gamma_j);
+ *(emissions + get_mtx_index(k, a_index + 1, a_size)) = var_j / gamma_j;
+ *(log_emissions + get_mtx_index(k, a_index + 1, a_size)) = log10(var_j / gamma_j);
+ }
+ }
+ }
+}
+
+
+void update_emiss_mtx_pseudocount_multi(struct hmm_multi_s *hmmp, double *E, int k, int alphabet)
+{
+ /* NOTE: k = current vertex */
+ double e_res_1, e_res_2;
+ int a_index;
+ int pseudo_value;
+ int a_size;
+ double *emissions;
+ double *log_emissions;
+
+ if(alphabet == 1) {
+ a_size = hmmp->a_size;
+ emissions = hmmp->emissions;
+ log_emissions = hmmp->log_emissions;
+ }
+ if(alphabet == 2) {
+ a_size = hmmp->a_size_2;
+ emissions = hmmp->emissions_2;
+ log_emissions = hmmp->log_emissions_2;
+ }
+ if(alphabet == 3) {
+ a_size = hmmp->a_size_3;
+ emissions = hmmp->emissions_3;
+ log_emissions = hmmp->log_emissions_3;
+ }
+ if(alphabet == 4) {
+ a_size = hmmp->a_size_4;
+ emissions = hmmp->emissions_4;
+ log_emissions = hmmp->log_emissions_4;
+ }
+
+ pseudo_value = EMISSION_PSEUDO_VALUE;
+ if(silent_state_multi(k,hmmp) == YES) {
+ return; /* this is a silent state, no updating should be done */
+ }
+ else if(locked_state_multi(hmmp, k) == YES) {
+ return; /* this state's parameters are locked, don't update */
+ }
+
+ e_res_1 = 0;
+ for(a_index = 0; a_index < a_size; a_index++) {
+ e_res_1 += *(E + get_mtx_index(k, a_index, a_size));
+ if(*(E + get_mtx_index(k, a_index, a_size)) != 0.0) {
+ e_res_1 += pseudo_value;
+ }
+ }
+ if(e_res_1 == 0.0) {
+ }
+ else {
+ for(a_index = 0; a_index < a_size; a_index++) {
+ e_res_2 = *(E + get_mtx_index(k, a_index, a_size));
+ if(e_res_2 != 0.0) {
+ e_res_2 += pseudo_value;
+ }
+ *(emissions + get_mtx_index(k, a_index, a_size)) = e_res_2/e_res_1;
+ if(e_res_2 != 0) {
+ *(log_emissions + get_mtx_index(k, a_index, a_size)) =
+ log10(e_res_2/e_res_1);
+ }
+ else {
+ *(log_emissions + get_mtx_index(k, a_index, a_size)) = DEFAULT;
+ }
+ }
+ }
+}
+
+
+void update_emiss_mtx_prior_multi(struct hmm_multi_s *hmmp, double *E, int k, struct emission_dirichlet_s *priorp, int alphabet)
+{
+
+ /* NOTE: k = current vertex */
+ int nr_components, comps, a_index;
+ double scaling_factor, X_sum, *X_values, ed_res1, E_sums, *logbeta_an_values;
+ double q_value, exponent, prior_prob, tot_prior_prob;
+ double prior_scaler;
+ int a_size;
+ double *emissions;
+ double *log_emissions;
+
+
+ if(alphabet == 1) {
+ a_size = hmmp->a_size;
+ emissions = hmmp->emissions;
+ log_emissions = hmmp->log_emissions;
+ }
+ if(alphabet == 2) {
+ a_size = hmmp->a_size_2;
+ emissions = hmmp->emissions_2;
+ log_emissions = hmmp->log_emissions_2;
+ }
+ if(alphabet == 3) {
+ a_size = hmmp->a_size_3;
+ emissions = hmmp->emissions_3;
+ log_emissions = hmmp->log_emissions_3;
+ }
+ if(alphabet == 4) {
+ a_size = hmmp->a_size_4;
+ emissions = hmmp->emissions_4;
+ log_emissions = hmmp->log_emissions_4;
+ }
+
+ /* Note that this function expects that a prior struct is loaded and ready for this particular alphabet */
+
+ prior_scaler = *(hmmp->vertex_emiss_prior_scalers + k);
+ if(*(emissions + get_mtx_index(k, 0, a_size)) == SILENT) {
+ return; /* this is a silent state, no updating should be done */
+ }
+ else if(locked_state_multi(hmmp, k) == YES) {
+ return; /* this state's parameters are locked, don't update */
+ }
+
+ nr_components = priorp->nr_components;
+ logbeta_an_values = malloc_or_die(nr_components * sizeof(double));
+ scaling_factor = 0.0 - FLT_MAX;
+ X_sum = 0.0;
+ X_values = malloc_or_die(a_size * sizeof(double));
+
+ /* calculate logB(alpha + n) for all components +
+ * calculate scaling factor for logB(alpha + n) - logB(alpha) */
+ for(comps = 0; comps < nr_components; comps++) {
+ ed_res1 = 0;
+ E_sums = 0;
+ for(a_index = 0; a_index < a_size; a_index++) {
+ ed_res1 += lgamma(*(priorp->prior_values +
+ get_mtx_index(comps, a_index, a_size)) +
+ *(E + get_mtx_index(k,a_index,a_size)));
+ E_sums += *(E + get_mtx_index(k,a_index, a_size));
+ }
+ ed_res1 = ed_res1 - lgamma(*(priorp->alpha_sums + comps) + E_sums);
+ *(logbeta_an_values + comps) = ed_res1;
+ if((ed_res1 = ed_res1 - *(priorp->logbeta_values + comps)) > scaling_factor) {
+ scaling_factor = ed_res1;
+ }
+ }
+#ifdef DEBUG_PRIORS
+ printf("ed_res1(top) = %f\n", ed_res1);
+ printf("scaling_factor = %f\n", scaling_factor);
+#endif
+
+ /* calculate all the Xi's */
+ for(a_index = 0; a_index < a_size; a_index++) {
+ *(X_values + a_index) = 0;
+ for(comps = 0; comps < nr_components; comps++) {
+ q_value = *(priorp->q_values + comps);
+ exponent = (*(logbeta_an_values + comps) - *(priorp->logbeta_values + comps) -
+ scaling_factor);
+ prior_prob = (*(priorp->prior_values + get_mtx_index(comps,a_index, a_size)) * prior_scaler +
+ *(E + get_mtx_index(k,a_index,a_size)));
+ tot_prior_prob = (*(priorp->alpha_sums + comps) + E_sums);
+ *(X_values + a_index) += q_value * exp(exponent) * prior_prob / tot_prior_prob;
+#ifdef DEBUG_PRIORS
+ printf("q_value = %f\n", q_value);
+ printf("exponent = %f\n", exponent);
+ printf("prior_prob = %f\n", prior_prob);
+ printf("tot_prior_prob = %f\n", tot_prior_prob);
+ printf("X_values[%d] = %f\n", a_index, *(X_values + a_index));
+#endif
+ }
+ X_sum += *(X_values + a_index);
+ }
+
+ /* update emission matrix */
+ for(a_index = 0; a_index < a_size; a_index++) {
+ ed_res1 = *(X_values + a_index) / X_sum;
+#ifdef DEBUG_PRIORS
+ printf("ed_res1 = %f\n", ed_res1);
+ printf("X_sum = %f\n", X_sum);
+#endif
+ if(ed_res1 != 0.0) {
+ *(emissions + get_mtx_index(k, a_index, a_size)) = ed_res1;
+ *(log_emissions + get_mtx_index(k, a_index, a_size)) = log10(ed_res1);
+ }
+ else {
+ *(emissions + get_mtx_index(k, a_index, a_size)) = ed_res1;
+ *(log_emissions + get_mtx_index(k, a_index, a_size)) = DEFAULT;
+ }
+ }
+
+ free(logbeta_an_values);
+ free(X_values);
+}
+
+
+void anneal_E_matrix_multi(double temperature, double *E, struct hmm_multi_s *hmmp, int alphabet)
+{
+ int i,j;
+ double rand_nr;
+ int a_size;
+
+ if(alphabet == 1) {
+ if(hmmp->alphabet_type == DISCRETE) {
+ a_size = hmmp->a_size;
+ }
+ else {
+ a_size = hmmp->a_size + 1;
+ }
+ }
+ if(alphabet == 2) {
+ if(hmmp->alphabet_type_2 == DISCRETE) {
+ a_size = hmmp->a_size_2;
+ }
+ else {
+ a_size = hmmp->a_size_2 + 1;
+ }
+ }
+ if(alphabet == 3) {
+ if(hmmp->alphabet_type_3 == DISCRETE) {
+ a_size = hmmp->a_size_3;
+ }
+ else {
+ a_size = hmmp->a_size_3 + 1;
+ }
+ }
+ if(alphabet == 4) {
+ if(hmmp->alphabet_type_4 == DISCRETE) {
+ a_size = hmmp->a_size_4;
+ }
+ else {
+ a_size = hmmp->a_size_4 + 1;
+ }
+ }
+
+
+ srand(time(0));
+ for(i = 1; i < hmmp->nr_v - 1; i++) {
+ for(j = 0; j < a_size; j++) {
+ rand_nr = (double)rand()/RAND_MAX;
+ rand_nr = rand_nr * temperature;
+ *(E + get_mtx_index(i, j, a_size)) += *(E + get_mtx_index(i, j, a_size)) * rand_nr;
+ }
+ }
+}
+
+void anneal_T_matrix_multi(double temperature, double *T, struct hmm_multi_s *hmmp)
+{
+ int i,j;
+ double rand_nr;
+
+ srand(time(0));
+ for(i = 1; i < hmmp->nr_v - 1; i++) {
+ for(j = 0; j < hmmp->nr_v; j++) {
+ rand_nr = (double)rand()/RAND_MAX;
+ rand_nr = rand_nr * temperature;
+ *(T + get_mtx_index(i, j, hmmp->nr_v)) += *(T + get_mtx_index(i, j, hmmp->nr_v)) * rand_nr;
+ }
+ }
+}
+
+void calculate_TE_contributions_multi(double *T, double *E, double *E_2, double *E_3, double *E_4,
+ double *T_lab, double *E_lab, double *E_lab_2, double *E_lab_3, double *E_lab_4,
+ double *T_ulab, double *E_ulab, double *E_ulab_2, double *E_ulab_3, double *E_ulab_4,
+ double *emissions, double *emissions_2, double *emissions_3, double *emissions_4,
+ double *transitions, int nr_v, int a_size, int a_size_2, int a_size_3, int a_size_4,
+ double *emiss_prior_scalers, double *emiss_prior_scalers_2, double *emiss_prior_scalers_3,
+ double *emiss_prior_scalers_4, int rd, int nr_alphabets) {
+ int v,w;
+ int a;
+ int x,y, y_2, y_3, y_4;
+ double rowsum;
+ double T_divider, E_divider, E_divider_2, E_divider_3, E_divider_4, max_T_ulab;
+ double max_E_ulab, max_E_ulab_2, max_E_ulab_3, max_E_ulab_4;
+ double E_limiter, E_limiter_2, E_limiter_3, E_limiter_4, T_limiter;
+ double DIVIDER_SCALER = 7.0;
+
+
+#ifdef DEBUG_EXTBW
+ printf("matrices before update emissions mtx\n");
+ dump_E_matrix(nr_v, a_size, emissions);
+#endif
+
+#ifdef DEBUG_EXTBW
+ printf("matrices before update ulab\n");
+ dump_E_matrix(nr_v, a_size, E_ulab);
+#endif
+
+#ifdef DEBUG_EXTBW
+ printf("matrices before update lab\n");
+ dump_E_matrix(nr_v, a_size, E_lab);
+#endif
+
+
+ /* copy current emission and transition matrix values to E and T matrices + scale values */
+ for(v = 0; v < nr_v-1; v++) {
+ for(w = 1; w < nr_v-1; w++) {
+ x = get_mtx_index(v,w,nr_v);
+ *(T + x) = *(transitions + x);
+ }
+ }
+ for(v = 1; v < nr_v-1; v++) {
+ for(a = 0; a < a_size; a++) {
+ y = get_mtx_index(v,a,a_size);
+ *(E + y) = *(emissions + y);
+ }
+ }
+ if(nr_alphabets > 1) {
+ for(v = 1; v < nr_v-1; v++) {
+ for(a = 0; a < a_size_2; a++) {
+ y = get_mtx_index(v,a,a_size_2);
+ *(E_2 + y) = *(emissions_2 + y);
+ }
+ }
+ }
+ if(nr_alphabets > 2) {
+ for(v = 1; v < nr_v-1; v++) {
+ for(a = 0; a < a_size_3; a++) {
+ y = get_mtx_index(v,a,a_size_3);
+ *(E_3 + y) = *(emissions_3 + y);
+ }
+ }
+ }
+ if(nr_alphabets > 3) {
+ for(v = 1; v < nr_v-1; v++) {
+ for(a = 0; a < a_size_4; a++) {
+ y = get_mtx_index(v,a,a_size_4);
+ *(E_4 + y) = *(emissions_4 + y);
+ }
+ }
+ }
+
+#ifdef DEBUG_EXTBW
+ printf("matrices after emission copy update E matrices\n");
+ dump_E_matrix(nr_v, a_size, E);
+ dump_T_matrix(nr_v, nr_v, T);
+#endif
+
+ /* check matrices for potential negative values + compensate */
+ T_divider = 1.0;
+ E_divider = 1.0;
+ E_divider_2 = 1.0;
+ E_divider_3 = 1.0;
+ E_divider_4 = 1.0;
+ for(v = 0; v < nr_v-1; v++) {
+ for(w = 1; w < nr_v-1; w++) {
+ x = get_mtx_index(v,w,nr_v);
+ T_limiter = *(T + x);
+ if(*(T + x) > (1.0 - *(T + x))) {
+ T_limiter = 1.0 - *(T + x);
+ }
+ if(*(T + x) != 0.0 && (*(T_ulab + x) - *(T_lab + x)) / T_limiter > T_divider) {
+ T_divider = (*(T_ulab + x) - *(T_lab + x)) / T_limiter;
+#ifdef DEBUG_EXTBW
+ printf("needed T_divider\n");
+#endif
+ }
+ }
+ }
+ for(v = 1; v < nr_v-1; v++) {
+ for(a = 0; a < a_size; a++) {
+ y = get_mtx_index(v,a,a_size);
+ E_limiter = *(E + y);
+ if(*(E + y) > (1.0 - *(E + y))) {
+ E_limiter = 1.0 - *(E + y);
+ }
+ if(*(E + y) != 0.0 && (*(E_ulab + y) - *(E_lab + y)) / E_limiter > E_divider) {
+ E_divider = (*(E_ulab + y) - *(E_lab + y)) / E_limiter;
+#ifdef DEBUG_EXTBW
+ printf("needed E_divider\n");
+ printf("E_divider = %f\n", E_divider);
+#endif
+ }
+ }
+ }
+ if(nr_alphabets > 1) {
+ for(v = 1; v < nr_v-1; v++) {
+ for(a = 0; a < a_size_2; a++) {
+ y = get_mtx_index(v,a,a_size_2);
+ E_limiter_2 = *(E_2 + y);
+ if(*(E_2 + y) > (1.0 - *(E_2 + y))) {
+ E_limiter_2 = 1.0 - *(E_2 + y);
+ }
+ if(*(E_2 + y) != 0.0 && (*(E_ulab_2 + y) - *(E_lab_2 + y)) / E_limiter > E_divider_2) {
+ E_divider_2 = (*(E_ulab_2 + y) - *(E_lab_2 + y)) / *(E_2 + y);
+#ifdef DEBUG_EXTBW
+ printf("needed E_divider_2\n");
+ printf("E_divider_2 = %f\n", E_divider_2);
+#endif
+ }
+ }
+ }
+ }
+ if(nr_alphabets > 2) {
+ for(v = 1; v < nr_v-1; v++) {
+ for(a = 0; a < a_size_3; a++) {
+ y = get_mtx_index(v,a,a_size_3);
+ E_limiter_3 = *(E_3 + y);
+ if(*(E_3 + y) > (1.0 - *(E_3 + y))) {
+ E_limiter_3 = 1.0 - *(E_3 + y);
+ }
+ if(*(E_3 + y) != 0.0 && (*(E_ulab_3 + y) - *(E_lab_3 + y)) / E_limiter_3 > E_divider_3) {
+ E_divider_3 = (*(E_ulab_3 + y) - *(E_lab_3 + y)) / *(E_3 + y);
+#ifdef DEBUG_EXTBW
+ printf("needed E_divider\n");
+#endif
+ }
+ }
+ }
+ }
+ if(nr_alphabets > 3) {
+ for(v = 1; v < nr_v-1; v++) {
+ for(a = 0; a < a_size_4; a++) {
+ y = get_mtx_index(v,a,a_size_4);
+ E_limiter_4 = *(E_4 + y);
+ if(*(E_4 + y) > (1.0 - *(E_4 + y))) {
+ E_limiter_4 = 1.0 - *(E_4 + y);
+ }
+ if(*(E_4 + y) != 0.0 && (*(E_ulab_4 + y) - *(E_lab_4 + y)) / E_limiter_4 > E_divider_4) {
+ E_divider_4 = (*(E_ulab_4 + y) - *(E_lab_4 + y)) / *(E_4 + y);
+#ifdef DEBUG_EXTBW
+ printf("needed E_divider\n");
+#endif
+ }
+ }
+ }
+ }
+
+ T_divider = T_divider * DIVIDER_SCALER;;
+ E_divider = E_divider * DIVIDER_SCALER;
+ E_divider_2 = E_divider_2 * DIVIDER_SCALER;
+ E_divider_3 = E_divider_3 * DIVIDER_SCALER;
+ E_divider_4 = E_divider_4 * DIVIDER_SCALER;
+
+ /* add T_lab - T_ulab and E_lab - E_ulab values to T and E matrices */
+ for(v = 0; v < nr_v-1; v++) {
+ for(w = 1; w < nr_v-1; w++) {
+ x = get_mtx_index(v,w,nr_v);
+ *(T + x) += (*(T_lab + x) - *(T_ulab + x)) / T_divider;
+ if(*(T + x) < 0.0) {
+ *(T + x) = 0.0;
+ }
+ }
+ }
+ for(v = 1; v < nr_v-1; v++) {
+ for(a = 0; a < a_size; a++) {
+ y = get_mtx_index(v,a,a_size);
+ *(E + y) += (*(E_lab + y) - *(E_ulab + y)) / (E_divider * *(emiss_prior_scalers + v));
+ if(*(E + y) < 0.0) {
+ *(E + y) = 0.0;
+ }
+ }
+ }
+ if(nr_alphabets > 1) {
+ for(v = 1; v < nr_v-1; v++) {
+ for(a = 0; a < a_size_2; a++) {
+ y = get_mtx_index(v,a,a_size_2);
+ *(E_2 + y) += (*(E_lab_2 + y) - *(E_ulab_2 + y)) / (E_divider_2 * *(emiss_prior_scalers_2 + v));
+ if(*(E_2 + y) < 0.0) {
+ *(E_2 + y) = 0.0;
+ }
+ }
+ }
+ }
+ if(nr_alphabets > 2) {
+ for(v = 1; v < nr_v-1; v++) {
+ for(a = 0; a < a_size_3; a++) {
+ y = get_mtx_index(v,a,a_size_3);
+ *(E_3 + y) += (*(E_lab_3 + y) - *(E_ulab_3 + y)) / (E_divider_3 * *(emiss_prior_scalers_3 + v));
+ if(*(E_3 + y) < 0.0) {
+ *(E_3 + y) = 0.0;
+ }
+ }
+ }
+ }
+ if(nr_alphabets > 3) {
+ for(v = 1; v < nr_v-1; v++) {
+ for(a = 0; a < a_size_4; a++) {
+ y = get_mtx_index(v,a,a_size_4);
+ *(E_4 + y) += (*(E_lab_4 + y) - *(E_ulab_4 + y)) / (E_divider_4 * *(emiss_prior_scalers_4 + v));
+ if(*(E_4 + y) < 0.0) {
+ *(E_4 + y) = 0.0;
+ }
+ }
+ }
+ }
+
+
+#ifdef DEBUG_EXTBW
+ printf("matrices after lab/ulab update E matrices\n");
+ dump_E_matrix(nr_v, a_size, E);
+
+ printf("T matrices after update\n");
+ dump_T_matrix(nr_v, nr_v, T);
+#endif
+
+ /* weight matrices with prior distribution */
+ /* not implemented, not obvious if it will work */
+
+
+ /* normalize matrices */
+ for(v = 0; v < nr_v-1; v++) {
+ rowsum = 0.0;
+ for(w = 1; w < nr_v-1; w++) {
+ x = get_mtx_index(v,w,nr_v);
+ rowsum += *(transitions + x);
+ }
+ if(rowsum != 0.0) {
+ for(w = 1; w < nr_v-1; w++) {
+ x = get_mtx_index(v,w,nr_v);
+ *(transitions + x) = *(transitions + x) / rowsum;
+ }
+ }
+ }
+ for(v = 1; v < nr_v-1; v++) {
+ rowsum = 0.0;
+ for(a = 0; a < a_size; a++) {
+ y = get_mtx_index(v,a,a_size);
+ rowsum += *(emissions + y);
+ }
+ if(rowsum != 0.0) {
+ for(a = 0; a < a_size; a++) {
+ y = get_mtx_index(v,a,a_size);
+ *(emissions + y) = *(emissions + y) / rowsum;
+ }
+ }
+ }
+ if(nr_alphabets > 1) {
+ for(v = 1; v < nr_v-1; v++) {
+ rowsum = 0.0;
+ for(a = 0; a < a_size_2; a++) {
+ y = get_mtx_index(v,a,a_size_2);
+ rowsum += *(emissions_2 + y);
+ }
+ if(rowsum != 0.0) {
+ for(a = 0; a < a_size_2; a++) {
+ y = get_mtx_index(v,a,a_size_2);
+ *(emissions_2 + y) = *(emissions_2 + y) / rowsum;
+ }
+ }
+ }
+ }
+ if(nr_alphabets > 2) {
+ for(v = 1; v < nr_v-1; v++) {
+ rowsum = 0.0;
+ for(a = 0; a < a_size_3; a++) {
+ y = get_mtx_index(v,a,a_size_3);
+ rowsum += *(emissions_3 + y);
+ }
+ if(rowsum != 0.0) {
+ for(a = 0; a < a_size_3; a++) {
+ y = get_mtx_index(v,a,a_size_3);
+ *(emissions_3 + y) = *(emissions_3 + y) / rowsum;
+ }
+ }
+ }
+ }
+ if(nr_alphabets > 3) {
+ for(v = 1; v < nr_v-1; v++) {
+ rowsum = 0.0;
+ for(a = 0; a < a_size_4; a++) {
+ y = get_mtx_index(v,a,a_size_4);
+ rowsum += *(emissions_4 + y);
+ }
+ if(rowsum != 0.0) {
+ for(a = 0; a < a_size_4; a++) {
+ y = get_mtx_index(v,a,a_size_4);
+ *(emissions_4 + y) = *(emissions_4 + y) / rowsum;
+ }
+ }
+ }
+ }
+#ifdef DEBUG_EXTBW
+ printf("E matrices after update\n");
+ dump_E_matrix(nr_v, a_size, E);
+
+ printf("T matrices after update\n");
+ dump_T_matrix(nr_v, nr_v, T);
+ printf("***************************************\n");
+#endif
+
+}
+
+/****************************some utility methods***********************************************/
+int silent_state_multi(int k, struct hmm_multi_s *hmmp)
+{
+ if(*(hmmp->emissions + get_mtx_index(k,0,hmmp->a_size)) == SILENT) {
+ return YES;
+ }
+ else {
+ return NO;
+ }
+}
+
+void update_tot_trans_mtx_multi(struct hmm_multi_s *hmmp)
+{
+ int v,w;
+ struct path_element *wp;
+ double t_res;
+
+#ifdef DEBUG_BW
+ //hmmp->tot_transitions = (double*)malloc_or_die(hmmp->nr_v * hmmp->nr_v * sizeof(double));
+#endif
+
+ /***************** changed to loop over trans to end state as well, may not always work *********************/
+
+ for(v = 0; v < hmmp->nr_v;v++) {
+ wp = *(hmmp->from_trans_array + v);
+ while(wp->vertex != END) /* w = from-vertex */ {
+ t_res = 1.0;
+ w = wp->vertex;
+ while(wp->next != NULL) {
+ t_res = t_res * *((hmmp->transitions) +
+ get_mtx_index(wp->vertex, (wp + 1)->vertex, hmmp->nr_v));
+ /* probability of transition from w to v via silent states */
+ wp++;
+ }
+ t_res = t_res * *((hmmp->transitions) +
+ get_mtx_index(wp->vertex, v, hmmp->nr_v));
+ *(hmmp->tot_transitions + get_mtx_index(w,v,hmmp->nr_v)) = t_res;
+ wp++;
+ }
+ }
+#ifdef DEBUG_BW
+ dump_trans_matrix(hmmp->nr_v, hmmp->nr_v, hmmp->tot_transitions);
+ dump_trans_matrix(hmmp->nr_v, hmmp->nr_v, hmmp->transitions);
+#endif
+}
+
+/* the general method for adding values to E matrix */
+void add_to_E_multi(double *E, double Eka_base, struct msa_letter_s *msa_seq, int p, int k, int a_size, int normalize,
+ double *subst_mtx, int alphabet, int scoring_method, int use_nr_occ, int alphabet_type, double *emissions)
+{
+
+ if(alphabet_type == CONTINUOUS) {
+ add_to_E_continuous(E, Eka_base, msa_seq, p, k, a_size, emissions);
+ }
+ else if(scoring_method == DOT_PRODUCT && use_nr_occ == YES) {
+ add_to_E_dot_product_nr_occ(E, Eka_base, msa_seq, p, k, a_size, normalize);
+ }
+ else if(scoring_method == DOT_PRODUCT_PICASSO && use_nr_occ == YES) {
+ add_to_E_dot_product_picasso_nr_occ(E, Eka_base, msa_seq, p, k, a_size, normalize);
+ }
+ else if(scoring_method == PICASSO && use_nr_occ == YES) {
+ add_to_E_picasso_nr_occ(E, Eka_base, msa_seq, p, k, a_size, normalize);
+ }
+ else if(scoring_method == PICASSO_SYM && use_nr_occ == YES) {
+ add_to_E_picasso_sym_nr_occ(E, Eka_base, msa_seq, p, k, a_size, normalize);
+ }
+ else if(scoring_method == SJOLANDER && use_nr_occ == YES) {
+ add_to_E_sjolander_score_nr_occ(E, Eka_base, msa_seq, p, k, a_size, normalize);
+ }
+ else if(scoring_method == SJOLANDER_REVERSED && use_nr_occ == YES) {
+ add_to_E_sjolander_reversed_score_nr_occ(E, Eka_base, msa_seq, p, k, a_size, normalize);
+ }
+ else if(scoring_method == SUBST_MTX_PRODUCT && use_nr_occ == YES) {
+ add_to_E_subst_mtx_product_nr_occ(E, Eka_base, msa_seq, p, k, a_size, normalize, subst_mtx);
+ }
+ else if(scoring_method == SUBST_MTX_DOT_PRODUCT && use_nr_occ == YES) {
+ add_to_E_subst_mtx_dot_product_nr_occ(E, Eka_base, msa_seq, p, k, a_size, normalize, subst_mtx,
+ alphabet);
+ }
+ else if(scoring_method == SUBST_MTX_DOT_PRODUCT_PRIOR && use_nr_occ == YES) {
+ add_to_E_subst_mtx_dot_product_prior_nr_occ(E, Eka_base, msa_seq, p, k, a_size, normalize, subst_mtx,
+ alphabet);
+ }
+ else if(scoring_method == DOT_PRODUCT) {
+ add_to_E_dot_product(E, Eka_base, msa_seq, p, k, a_size, normalize);
+ }
+ else if(scoring_method == DOT_PRODUCT_PICASSO) {
+ add_to_E_dot_product_picasso(E, Eka_base, msa_seq, p, k, a_size, normalize);
+ }
+ else if(scoring_method == PICASSO) {
+ add_to_E_picasso(E, Eka_base, msa_seq, p, k, a_size, normalize);
+ }
+ else if(scoring_method == PICASSO_SYM) {
+ add_to_E_picasso_sym(E, Eka_base, msa_seq, p, k, a_size, normalize);
+ }
+ else if(scoring_method == SJOLANDER) {
+ add_to_E_sjolander_score(E, Eka_base, msa_seq, p, k, a_size, normalize);
+ }
+ else if(scoring_method == SJOLANDER_REVERSED) {
+ add_to_E_sjolander_reversed_score(E, Eka_base, msa_seq, p, k, a_size, normalize);
+ }
+ else if(scoring_method == SUBST_MTX_PRODUCT) {
+ add_to_E_subst_mtx_product(E, Eka_base, msa_seq, p, k, a_size, normalize, subst_mtx);
+ }
+ else if(scoring_method == SUBST_MTX_DOT_PRODUCT) {
+ add_to_E_subst_mtx_dot_product(E, Eka_base, msa_seq, p, k, a_size, normalize, subst_mtx,
+ alphabet);
+ }
+ else if(scoring_method == SUBST_MTX_DOT_PRODUCT_PRIOR) {
+ add_to_E_subst_mtx_dot_product_prior(E, Eka_base, msa_seq, p, k, a_size, normalize, subst_mtx,
+ alphabet);
+ }
+ else {
+ printf("Error: Unrecognized scoring method\n");
+ exit(0);
+ }
+}
+
diff --git a/svmloc/Makefile.am b/svmloc/Makefile.am
new file mode 100644
index 0000000..30c02ba
--- /dev/null
+++ b/svmloc/Makefile.am
@@ -0,0 +1,10 @@
+# Process this file with automake to produce Makefile.in
+# Made by me.
+# svmloc dir
+
+lib_LTLIBRARIES = libsvmloc.la
+libsvmloc_la_SOURCES = binding.cpp libsvm.cpp svmloc.cpp binding.h libsvm.h svmloc.h
+libsvmloc_la_LDFLAGS = -version-info 0:0:0
+
+noinst_HEADERS = \
+COPYRIGHT
diff --git a/svmloc/binding.cpp b/svmloc/binding.cpp
new file mode 100644
index 0000000..402e492
--- /dev/null
+++ b/svmloc/binding.cpp
@@ -0,0 +1,27 @@
+#include "binding.h"
+
+SVM* createSVM(int st, int kt, int d, double g, double c0, double C, double nu, double e)
+{
+ return new SVM(st, kt, d, g, c0, C, nu, e);
+}
+
+void destroySVM(SVM* pSVM)
+{
+ delete pSVM;
+}
+
+int loadSVMModel(SVM* pSVM, char* filename)
+{
+ return pSVM->loadModel(filename);
+}
+
+int loadSVMFreqPattern(SVM* pSVM, char* filename)
+{
+ return pSVM->loadFreqPattern(filename);
+}
+
+double SVMClassify(SVM* pSVM, char* seq)
+{
+ return pSVM->classify(seq);
+}
+
diff --git a/svmloc/binding.h b/svmloc/binding.h
new file mode 100644
index 0000000..1593a9c
--- /dev/null
+++ b/svmloc/binding.h
@@ -0,0 +1,28 @@
+#ifndef __BINDING_H__
+#define __BINDING_H__
+
+#ifdef __cplusplus
+
+#include "svmloc.h"
+
+#else
+ typedef
+ struct SVM
+ SVM;
+#endif
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+ extern SVM* createSVM(int st, int kt, int d, double g, double c0, double C, double nu, double e);
+ extern void destroySVM(SVM*);
+ extern int loadSVMModel(SVM*, char*);
+ extern int loadSVMFreqPattern(SVM*, char*);
+ extern double SVMClassify(SVM*, char*);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/svmloc/svmloc.cpp b/svmloc/svmloc.cpp
new file mode 100644
index 0000000..8922fb5
--- /dev/null
+++ b/svmloc/svmloc.cpp
@@ -0,0 +1,507 @@
+#include "svmloc.h"
+#include <errno.h>
+
+#ifdef DEBUG
+#include <stdarg.h>
+void printf_dbg(const char *a, ...) {
+ va_list alist;
+ va_start(alist,a);
+ vfprintf(stdout,a,alist);
+ va_end(alist);
+ fflush(NULL);
+}
+#else
+void printf_dbg(const char *a, ...) {}
+#endif
+
+DataSet::DataSet(double l) {
+ label = l;
+ realigned=false;
+ n=0;
+ max_n=16;
+ attributes = (struct svm_node *)malloc(sizeof(struct svm_node) * max_n);
+ assert(attributes!=NULL);
+ attributes[0].index=-1; // insert end-of-data marker
+ max_i=-1;
+}
+
+DataSet::~DataSet() {
+ printf_dbg("destructor DS called\n");
+ if (realigned) {
+ attributes[n].value=-1; // notify svm that dataset is destroyed
+ } else {
+ free(attributes);
+ }
+}
+
+void DataSet::realign(struct svm_node *address) {
+ assert(address!=NULL);
+ memcpy(address,attributes,sizeof(struct svm_node)*(n+1));
+ free(attributes); attributes=address;
+ max_n=n+1; realigned=true; attributes[n].value=0;
+}
+
+void DataSet::setAttribute(int k, double v) {
+ if (realigned) {
+ printf_dbg("set Attr with realigned k=%d, v=%lf\n",k,v);
+ max_n=n+2; attributes[n].value=-1; // notify svm to not care about allocating memory for this dataset
+ struct svm_node *address=(struct svm_node *)malloc(sizeof(struct svm_node)*max_n);
+ assert(address!=NULL);
+ memcpy(address,attributes,sizeof(struct svm_node)*(n+1));
+ attributes=address; realigned=false; if (k==-1) { return; }
+ } else {
+ printf_dbg("set Attr without realigned k=%d, v=%lf\n",k,v);
+ }
+
+ if (k>max_i) {
+ max_i=k;
+ if (v!=0) {
+ attributes[n].index=k;
+ attributes[n].value=v; n++; attributes[n].index=-1;
+ }
+ } else {
+ // assume sorted array - check where it belongs
+ int upper = n-1; int lower=0; int midpos=0; int midk=-1;
+ while (lower<=upper) {
+ midpos = (upper+lower)/2;
+ midk=attributes[midpos].index;
+ if (k>midk) { lower=midpos+1; }
+ else if (k<midk) { upper=midpos-1; }
+ else { break; }
+ }
+ if (k==midk) { attributes[midpos].value=v; }
+ else {
+ if (v!=0) {
+ for (int i=n; i>lower; i--) {
+ attributes[i].index=attributes[i-1].index;
+ attributes[i].value=attributes[i-1].value;
+ }
+ attributes[lower].index=k;
+ attributes[lower].value=v; n++; attributes[n].index=-1;
+ }
+ }
+ }
+ if (n>=max_n-1) {
+ max_n*=2;
+ attributes = (struct svm_node *)realloc(attributes,sizeof(struct svm_node)*max_n);
+ assert(attributes!=NULL);
+ }
+}
+
+double DataSet::getAttribute(int k) {
+ int upper = n-1; int lower=0; int midpos=0; int midk=-1;
+ while (upper>=lower) {
+ midpos = (upper+lower)/2;
+ midk=attributes[midpos].index;
+ if (k>midk) { lower=midpos+1; }
+ else if (k<midk) { upper=midpos-1; }
+ else { break; }
+ }
+ if (k==midk) { return attributes[midpos].value; } else { return 0; }
+ return -1;
+}
+
+
+SVM::SVM(int st, int kt, int d, double g, double c0, double C, double nu,
+ double e) {
+
+ // Default parameter settings.
+ param.svm_type = st;
+ param.kernel_type = kt;
+ param.degree = d;
+ param.gamma = g;
+ param.coef0 = c0;
+ param.nu = nu;
+ param.cache_size = 40;
+ param.C = 1;
+ param.eps = 1e-3;
+ param.p = e;
+ param.shrinking = 1;
+ param.nr_weight = 0;
+ param.weight_label = NULL;
+ param.weight = NULL;
+ param.probability = 0;
+ nelem=0;
+
+ x_space = NULL;
+ model = NULL;
+ prob = NULL;
+
+ randomized = 0;
+}
+
+void SVM::addDataSet(DataSet *ds) {
+
+ if(ds != NULL) dataset.push_back(ds);
+}
+
+
+void SVM::clearDataSet() {
+ dataset.clear();
+}
+
+void SVM::free_x_space() {
+ if (x_space!=NULL) {
+ long idx=nelem;
+ for (int i=dataset.size()-1; i>=0; i--) {
+ assert(x_space[idx-1].index==-1);
+ if (x_space[idx-1].value!=-1) {
+ printf_dbg((dataset[i]->realigned ? "+" : "-"));
+ printf_dbg("%lf\n",x_space[idx-1].value);
+ idx-=((dataset[i]->n)+1);
+ dataset[i]->setAttribute(-1,0);
+ } else {
+ printf_dbg("%d already destroyed or changed.\n",i);
+ idx-=2; while (idx >= 0 && x_space[idx].index!=-1) { idx--; }
+ idx++;
+ }
+ }
+ assert(idx==0);
+ free(x_space); x_space=NULL;
+ }
+}
+
+int SVM::train(int retrain) {
+ const char *error;
+
+ // Free any old model we have.
+ if(model != NULL) {
+ svm_destroy_model(model);
+ model = NULL;
+ }
+
+ if(retrain) {
+ if(prob == NULL) return 0;
+ model = svm_train(prob, ¶m);
+ return 1;
+ }
+
+ if (x_space != NULL) free_x_space();
+ if(prob != NULL) free(prob);
+
+ model = NULL;
+ prob = NULL;
+
+ // Allocate memory for the problem struct.
+ if((prob = (struct svm_problem *)malloc(sizeof(struct svm_problem))) == NULL) return 0;
+
+ prob->l = dataset.size();
+
+ // Allocate memory for the labels/nodes.
+ prob->y = (double *)malloc(sizeof(double) * prob->l);
+ prob->x = (struct svm_node **)malloc(sizeof(struct svm_node *) * prob->l);
+
+ if((prob->y == NULL) || (prob->x == NULL)) {
+ if(prob->y != NULL) free(prob->y);
+ if(prob->x != NULL) free(prob->x);
+ free(prob);
+ return 0;
+ }
+
+ // Check for errors with the parameters.
+ error = svm_check_parameter(prob, ¶m);
+ if(error) { free(prob->x); free (prob->y); free(prob); return 0; }
+
+ // Allocate x_space and successively release dataset memory
+ // (realigning the dataset memory to x_space)
+ nelem=0;
+ for (unsigned int i=0; i<dataset.size(); i++) {
+ nelem+=dataset[i]->n+1;
+ }
+ x_space = (struct svm_node *)malloc(sizeof(struct svm_node)*nelem);
+ long idx=0;
+ for (unsigned int i=0; i<dataset.size(); i++) {
+ dataset[i]->realign(x_space+idx);
+ idx+=(dataset[i]->n)+1;
+ }
+
+ if (x_space==NULL) {
+ free(prob->y);
+ free(prob->x);
+ free(prob);
+ nelem=0;
+ return 0;
+ }
+
+ // Munge the datasets into the format that libsvm expects.
+ int maxi = 0; long n=0;
+ for(int i = 0; i < prob->l; i++) {
+ prob->x[i] = &x_space[n]; //dataset[i]->attributes;
+ assert((dataset[i]->attributes)==(&x_space[n]));
+ n+=dataset[i]->n+1;
+ prob->y[i] = dataset[i]->getLabel();
+
+ if( dataset[i]->max_i > maxi) maxi = dataset[i]->max_i;
+ }
+ printf_dbg("\nnelem=%ld\n",n);
+
+ if(param.gamma == 0) param.gamma = 1.0/maxi;
+
+ model = svm_train(prob, ¶m);
+
+ return 1;
+}
+
+double SVM::predict_value(DataSet *ds) {
+ double pred[100];
+
+ if(ds == NULL) return 0;
+
+ svm_predict_values(model, ds->attributes, pred);
+
+ return pred[0];
+}
+
+
+double SVM::predict(DataSet *ds) {
+ double pred;
+
+ if(ds == NULL) return 0;
+
+ pred = svm_predict(model, ds->attributes);
+
+ return pred;
+}
+
+int SVM::saveModel(char *filename) {
+
+ if((model == NULL) || (filename == NULL)) {
+ return 0;
+ } else {
+ return ! svm_save_model(filename, model);
+ }
+}
+
+int SVM::loadModel(char *filename) {
+ struct svm_model *tmodel;
+
+ if(filename == NULL) return 0;
+
+ if(x_space != NULL) {
+ free_x_space();
+ }
+
+ if(model != NULL) {
+ svm_destroy_model(model);
+ model = NULL;
+ }
+
+ if((tmodel = svm_load_model(filename)) != NULL) {
+ model = tmodel;
+ return 1;
+ }
+
+ return 0;
+}
+
+int SVM::loadFreqPattern(char *filename) {
+
+ ifstream infile;
+ string line;
+ size_t space;
+
+ freqpatterns.clear();
+
+ infile.open(filename, ifstream::in);
+
+ if(!infile.is_open()) {
+ cout << "Error opening file" << endl;
+ return 0;
+ }
+
+ while(infile.good()) {
+ getline(infile, line, '\n');
+
+ if(!line.empty()) {
+
+ space = line.find_first_of(' ');
+ if(space != string::npos) {
+ line.erase(space);
+ }
+ freqpatterns.push_back(line);
+ }
+ }
+
+ infile.close();
+
+ return 1;
+}
+
+double SVM::classify(char *sequence) {
+ vector<string>::iterator it;
+ string seq = sequence;
+ set<int> vectors;
+ int n = 1;
+
+ if(freqpatterns.empty() || model == NULL)
+ return -1;
+
+ // Find all the frequent patters in the sequence and
+ // record the sequence number of that patterns
+
+ for (it=freqpatterns.begin(); it < freqpatterns.end(); it++) {
+ if(seq.find(*it) != string::npos) {
+ vectors.insert(n);
+ }
+ n++;
+ }
+
+ // Allocate the svm_node array to pass to libsvm + 1
+ // for stop element
+ int v_size = vectors.size();
+ struct svm_node* attributes = (struct svm_node *)malloc(sizeof(struct svm_node) * (v_size+1));
+
+ // Go through the set of vectors and build the
+ // svm_node array, it must be ordered
+ set<int>::iterator vit;
+ n=0;
+ for(vit=vectors.begin() ; vit != vectors.end(); vit++) {
+ attributes[n].index = *vit;
+ attributes[n].value=1;
+ n++;
+ }
+
+ // Set the last element to -1, which is the stop signal
+ // for libsvm
+ attributes[n].index = -1;
+
+ double results = svm_predict(model, attributes);
+
+ delete attributes;
+
+ return results;
+
+}
+
+double SVM::crossValidate(int nfolds) {
+ double sumv = 0, sumy = 0, sumvv = 0, sumyy = 0, sumvy = 0;
+ double total_error = 0;
+ int total_correct = 0;
+ int i;
+
+ if(! prob) return 0;
+
+ if(! randomized) {
+ // random shuffle
+ for(i=0;i<prob->l;i++) {
+ int j = i+rand()%(prob->l-i);
+ struct svm_node *tx;
+ double ty;
+
+ tx = prob->x[i];
+ prob->x[i] = prob->x[j];
+ prob->x[j] = tx;
+
+ ty = prob->y[i];
+ prob->y[i] = prob->y[j];
+ prob->y[j] = ty;
+ }
+
+ randomized = 1;
+ }
+
+ for(i=0;i<nfolds;i++) {
+ int begin = i*prob->l/nfolds;
+ int end = (i+1)*prob->l/nfolds;
+ int j,k;
+ struct svm_problem subprob;
+
+ subprob.l = prob->l-(end-begin);
+ subprob.x = (struct svm_node**)malloc(sizeof(struct svm_node)*subprob.l);
+ subprob.y = (double *)malloc(sizeof(double)*subprob.l);
+
+ k=0;
+ for(j=0;j<begin;j++) {
+ subprob.x[k] = prob->x[j];
+ subprob.y[k] = prob->y[j];
+ ++k;
+ }
+
+ for(j=end;j<prob->l;j++) {
+ subprob.x[k] = prob->x[j];
+ subprob.y[k] = prob->y[j];
+ ++k;
+ }
+
+ if(param.svm_type == EPSILON_SVR || param.svm_type == NU_SVR) {
+ struct svm_model *submodel = svm_train(&subprob,¶m);
+ double error = 0;
+ for(j=begin;j<end;j++) {
+ double v = svm_predict(submodel,prob->x[j]);
+ double y = prob->y[j];
+ error += (v-y)*(v-y);
+ sumv += v;
+ sumy += y;
+ sumvv += v*v;
+ sumyy += y*y;
+ sumvy += v*y;
+ }
+ svm_destroy_model(submodel);
+ // cout << "Mean squared error = %g\n", error/(end-begin));
+ total_error += error;
+ } else {
+ struct svm_model *submodel = svm_train(&subprob,¶m);
+
+ int correct = 0;
+ for(j=begin;j<end;j++) {
+ double v = svm_predict(submodel,prob->x[j]);
+ if(v == prob->y[j]) ++correct;
+ }
+ svm_destroy_model(submodel);
+ //cout << "Accuracy = " << 100.0*correct/(end-begin) << " (" <<
+ //correct << "/" << (end-begin) << endl;
+ total_correct += correct;
+ }
+
+ free(subprob.x);
+ free(subprob.y);
+ }
+ if(param.svm_type == EPSILON_SVR || param.svm_type == NU_SVR) {
+ return ((prob->l*sumvy-sumv*sumy)*(prob->l*sumvy-sumv*sumy))/
+ ((prob->l*sumvv-sumv*sumv)*(prob->l*sumyy-sumy*sumy));
+ } else {
+ return 100.0*total_correct/prob->l;
+ }
+}
+
+int SVM::getNRClass() {
+
+ if(model == NULL) {
+ return 0;
+ } else {
+ return svm_get_nr_class(model);
+ }
+}
+
+int SVM::getLabels(int* label) {
+ if(model == NULL) {
+ return 0;
+ } else {
+ svm_get_labels(model, label);
+ return 1;
+ }
+}
+
+double SVM::getSVRProbability() {
+
+ if((model == NULL) || (svm_check_probability_model(model))) {
+ return 0;
+ } else {
+ return svm_get_svr_probability(model);
+ }
+}
+
+int SVM::checkProbabilityModel() {
+
+ if(model == NULL) {
+ return 0;
+ } else {
+ return svm_check_probability_model(model);
+ }
+}
+
+SVM::~SVM() {
+ if(x_space!=NULL) { free_x_space(); }
+ if(model != NULL) { svm_destroy_model(model); model=NULL; }
+ if(prob != NULL) { free(prob); prob=NULL; }
+}
diff --git a/svmloc/svmloc.h b/svmloc/svmloc.h
new file mode 100644
index 0000000..f99fe56
--- /dev/null
+++ b/svmloc/svmloc.h
@@ -0,0 +1,90 @@
+#ifndef __SVMLOC_H__
+#define __SVMLOC_H__
+
+#include <vector>
+#include <map>
+#include <string>
+#include <fstream>
+#include <iostream>
+#include <set>
+#include <assert.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+
+#include "libsvm.h"
+
+using namespace std;
+
+class DataSet {
+ friend class SVM;
+
+ private:
+ double label;
+ struct svm_node *attributes;
+ int n; int max_n; int max_i;
+ bool realigned;
+ public:
+ DataSet(double l);
+ void setLabel(double l) { label = l; }
+ double getLabel() { return label; }
+ int getMaxI() { return max_i; }
+ void setAttribute(int k, double v);
+ double getAttribute(int k);
+ int getIndexAt(int i) { if (i<=n) { return attributes[i].index; } else { return -1; }}
+ double getValueAt(int i) { if (i<=n) { return attributes[i].value; } else { return 0; }}
+
+ void realign(struct svm_node *address);
+ ~DataSet();
+};
+
+
+class SVM {
+ public:
+ SVM(int st, int kt, int d, double g, double c0, double C, double nu,
+ double e);
+ void addDataSet(DataSet *ds);
+ int saveModel(char *filename);
+ int loadModel(char *filename);
+ int loadFreqPattern(char *filename);
+ double classify(char *sequence);
+ void clearDataSet();
+ int train(int retrain);
+ double predict_value(DataSet *ds);
+ double predict(DataSet *ds);
+ void free_x_space();
+ void setSVMType(int st) { param.svm_type = st; }
+ int getSVMType() { return param.svm_type; }
+ void setKernelType(int kt) { param.kernel_type = kt; }
+ int getKernelType() { return param.kernel_type; }
+ void setGamma(double g) { param.gamma = g; }
+ double getGamma() { return param.gamma; }
+ void setDegree(int d) { param.degree = d; }
+ double getDegree() { return param.degree; }
+ void setCoef0(double c) { param.coef0 = c; }
+ double getCoef0() { return param.coef0; }
+ void setC(double c) { param.C = c; }
+ double getC() { return param.C; }
+ void setNu(double n) { param.nu = n; }
+ double getNu() { return param.nu; }
+ void setEpsilon(double e) { param.p = e; }
+ double getEpsilon() { return param.p; }
+ double crossValidate(int nfolds);
+ int getNRClass();
+ int getLabels(int* label);
+ double getSVRProbability();
+ int checkProbabilityModel();
+
+ ~SVM();
+ private:
+ long nelem;
+ struct svm_parameter param;
+ vector<DataSet *> dataset;
+ struct svm_problem *prob;
+ struct svm_model *model;
+ vector<string> freqpatterns;
+ struct svm_node *x_space;
+ int randomized;
+};
+
+#endif
--
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/libpsortb.git
More information about the debian-med-commit
mailing list