[med-svn] r1447 - trunk/packages/dialign/trunk

Tue Feb 19 04:47:22 UTC 2008

Author: charles-guest
Date: 2008-02-19 04:47:21 +0000 (Tue, 19 Feb 2008)
New Revision: 1447

Removed:
   trunk/packages/dialign/trunk/INSTALLATION_GUIDE
   trunk/packages/dialign/trunk/USER_GUIDE
   trunk/packages/dialign/trunk/dialign2_dir/
   trunk/packages/dialign/trunk/license/
   trunk/packages/dialign/trunk/src/
Log:
Switched to MergeWithUpstream mode

Deleted: trunk/packages/dialign/trunk/INSTALLATION_GUIDE
===================================================================

--- trunk/packages/dialign/trunk/INSTALLATION_GUIDE	2008-02-19 04:44:44 UTC (rev 1446)
+++ trunk/packages/dialign/trunk/INSTALLATION_GUIDE	2008-02-19 04:47:21 UTC (rev 1447)
@@ -1,52 +0,0 @@
-
-
-                    installation guide for  
- 
-                          DIALIGN2 
-                          ========
-    
-                   program code written by  
-
-           Burkhard Morgenstern and Said Abdeddaim
- 
-               e-mail contact: dialign at gobics.de
-
-
-(1) cd to the directory `src' that contains the dialign source code 
-
-(2) type `make' to compile the program. This should create an executable
-    binary file called `dialign2-2'.
-
-(3) you may remove all object files (type `rm *.o')
-
-(4) to run DIALIGN2, you must create an environment variable 
-    `DIALIGN2_DIR' pointing to the directory `dialign2_dir' 
-
-    (type `setenv DIALIGN2_DIR /your_path/dialign2_dir/' where 
-    `your_path' is the directory where you de-tarred the file with the
-    sources)  
-
-    The program needs the files 
-
-       tp400_dna
-       tp400_prot
-       tp400_trans
-       BLOSUM 
- 
-    that are contained in the directory `dialign2_dir'. You may move 
-    these files to any other directory and set DIALIGN2_DIR accordingly.
-
-
-
-Please note that, unlike in the first version of DIALIGN, it is NOT 
-POSSIBLE to replace the BLOSUM matrix by other similarity matrices !!
-
-DIALIGN comes with a detailed user guide. For additional information, please 
-consult the DIALIGN home page at
-
-  http://bibiserv.techfak.uni-bielefeld.de/dialign/
- 
-----------------------------------------------------------------------------
-BM, Goettingen, February 2003 
-
-

Deleted: trunk/packages/dialign/trunk/USER_GUIDE
===================================================================
--- trunk/packages/dialign/trunk/USER_GUIDE	2008-02-19 04:44:44 UTC (rev 1446)
+++ trunk/packages/dialign/trunk/USER_GUIDE	2008-02-19 04:47:21 UTC (rev 1447)
@@ -1,537 +0,0 @@
-
-
-                           DIALIGN 2.2.2 
-
-                            User Guide
-
-                     Program code written by 
-
-               Burkhard Morgenstern, Said Abdeddaim
-
-at University of Bielefeld (FSPM and International Graduate School in 
-Bioinformatics and Genome Research), GSF (ISG, IBB, MIPS/IBI), 
-North Carolina State University, Universite de Rouen, MPI fuer 
-Biochemie (Martinsried), University of Goettingen, Institute of
-Microbiology and Genetics.
-
-
-E-mail contact: dialign at gobics.de
-
-               
-                            Reference: 
-
-     B. Morgenstern (1999). 
-     DIALIGN 2: improvement of the segment-to-segment approach to
-     multiple sequence alignment.
-     Bioinformatics 15, 211 - 218.
-
-Public research assisted by DIALIGN should cite this article. For more 
-information, updated references etc. please visit the DIALIGN home page at
-
-  http://dialign.gobics.de/
-
-
-Program usage: 
-
-  dialign2-2 [ options ] <seq_file>
-
-
-<seq_file> is the name of the input sequence file; this must be a multiple
-FASTA file (all sequences in one file), a description of the format is  
-given below. The following options are available (a more detailed description
-of these options is given below):
-
- -afc            Creates additional output file "*.afc" containing data of 
-                 all fragments considered for alignment
-                 WARNING: this file can be HUGE ! 
- 
- -afc_v          like "-afc" but verbose: fragments are explicitly printed 
-                 WARNING: this file can be EVEN BIGGER ! 
-
- -anc            Anchored alignment. Requires a file <seq_file>.anc 
-                 containing anchor points.   
-
- -cs             if segments are translated, not only the `Watson strand'
-                 but also the `Crick strand' is looked at.
-
- -cw             additional output file in CLUSTAL W format.
-
- -ds             `dna alignment speed up' - non-translated nucleic acid
-                 fragments are taken into account only if they start with
-                 at least two matches. Speeds up DNA alignment at the expense
-                 of sensitivity.
-
- -fa             additional output file in FASTA format.
-
- -ff             Creates file *.frg containing information about all 
-                 fragments that are part of the respective optimal pairwise 
-                 alignmnets plus information about consistency in the multiple 
-                 alignment 
- 
- -fn <out_file>  output files are named <out_file>.<extension> .
-
-
- -fop            Creates file *.fop containing coordinates of all fragments 
-                 that are part of the respective pairwise alignments. 
-
- -fsm            Creates file *.fsm containing coordinates of all fragments 
-                 that are part of the final alignment 
- 
- -iw             overlap weights switched off (by default, overlap weights are
-                 used if up to 35 sequences are aligned). This option
-                 speeds up the alignment but may lead to reduced alignment
-                 quality.
-
- -lgs            `long genomic sequences' - combines the following options:
-                 -ma, -thr 2, -lmax 30, -smin 8, -nta, -ff,
-                 -fop, -ff, -cs, -ds, -pst 
-
- -lgs_t          Like "-lgs" but with all segment pairs assessed at the 
-                 peptide level (rather than 'mixed alignments' as with the
-                 "-lgs" option). Therefore faster than -lgs but not very 
-                 sensitive for non-coding regions.         
-
- -lmax <x>       maximum fragment length = x  (default: x = 40 or x = 120
-                 for `translated' fragments). Shorter x speeds up the program
-                 but may affect alignment quality. 
-
- -lo             (Long Output) Additional file *.log with information abut
-                 fragments selected for pairwise alignment and about 
-                 consistency in multi-alignment proceedure 
-
- -ma             `mixed alignments' consisting of P-fragments and N-fragments
-                 if nucleic acid sequences are aligned.
-
- -mask           residues not belonging to selected fragments are replaced
-                 by `*' characters in output alignment (rather than being
-                 printed in lower-case characters)
-
- -mat            Creates file *mat with substitution counts derived from the
-                 fragments that have been selected for alignment 
-
- -mat_thr <t>    Like "-mat" but only fragments with weight score > t 
-                 are considered         
-
- -max_link       "maximum linkage" clustering used to construct sequence tree
-                 (instead of UPGMA).
-
- -min_link       "minimum linkage" clustering used.
-
- -mot            "motif" option. 
-
- -msf            separate output file in MSF format.
-
- -n              input sequences are nucleic acid sequences. No translation
-                 of fragments.
-
- -nt             input sequences are nucleic acid sequences and `nucleic acid
-                 segments' are translated to `peptide segments'.
-
- -nta            `no textual alignment' - textual alignment suppressed. This
-                 option makes sense if other output files are of intrest -- 
-                 e.g. the fragment files created with -ff, -fop, -fsm or -lo   
- 
- -o              fast version, resulting alignments may be slightly different.
-
- -ow             overlap weights enforced (By default, overlap weights are
-                 used only if up to 35 sequences are aligned since calculating
-                 overlap weights is time consuming). Warning: overlap weights
-                 generally improve alignment quality but the running time
-                 increases in the order O(n^4) with the number of sequences.
-                 This is why, by default, overlap weights are used only for 
-                 sequence sets with < 35 sequences.  
-
- -pst            "print status". Creates and updates a file *.sta with
-                 information about the current status of the program run.
-                 This option is recommended if large data sets are aligned
-                 since it allows the user to estimate the remaining running
-                 time.
-
- -smin <x>       minimum similarity value for first residue pair (or codon
-                 pair) in fragments. Speeds up protein alignment or alignment
-                 of translated DNA fragments at the expense of sensitivity.
-
- -stars <x>      maximum number of `*' characters indicating degree of 
-                 local similarity among sequences. By default, no stars 
-                 are used but numbers between 0 and 9, instead.  
-
- -stdo           Results written to standard output.
-
- -ta             standard textual alignment printed (overrides suppression
-                 of textual alignments in special options, e.g. -lgs)     
-
- -thr <x>        Threshold T = x.
-
- -xfr            "exclude fragments" - list of fragments can be specified
-                 that are NOT considered for pairwise alignment 
-
-
-General remark: If contradictory options are used, subsequent options 
-override previous ones, e.g.:  
-
-  dialign2-2 -nt -n <seq_file> 
-
-runs the program with the "-n" option (no translation!), while 
-
-  dialign2-2 -n -nt <seq_file>
-
-runs it with the "-nt" option (translation!). 
- 
-
-
-                            Input File:
-
-Sequences to be aligned must be contained in a single file in FASTA
-format. Example:
-
-
-        >HTL2  
-        LDTAPCLFSDGSPQKAAYVLWDQTILQQDITPLPSHETHSAQKGELLALICGLRAAKPWP
-        SLNIFLDSKY
-        >MMLV   
-        GKKLNVYTDSRYAFATAHIHGEIYRRRGLLTSEGKEIKNKDEILALLKALFLPKRLSIIH
-        CPGHQKGHSAEARGNRMADQAARKAAITETPDTSTLL
-        >HEPB 
-        RPGLCQVFADATPTGWGLVMGHQRMRGTFSAPLPIHTAELLAACFARSRSGANIIGTDNS
-        GRTSLYADSPSVPSHLPDRVH
-
-
-The first line for each sequence starts with ">" and contains the name of 
-the sequence. Please make sure, that the first line in the input file is
-not empty and that the first character in the first line is not blank.
-
-Some details about avaliable options:
-
-     (1) Sequence Type: 
-
-     The user can decide if nucleic acid or protein sequences are to be 
-     aligned. 
-
-     (2) Threshold T: 
-
-     As described in our papers, the program DIALIGN constructs alignments 
-     from gapfree pairs of similar segments of the sequences. Such segment 
-     pairs are referred to as `(alignment) fragments' (previously, we called
-     them `diagonals'). 
-
-     Every possible fragment is given a so-called weight reflecting the 
-     degree of similarity among the two segments involved. The overall 
-     score of an alignment is then defined as the sum of weights of the 
-     fragments it consists of and the program tries to find an alignment with
-     maximum score -- in other words: the program tries to find a consistent
-     collection of fragments with maximum sum of weights. This novel scoring
-     scheme for alignments is the basic difference between DIALIGN and other
-     global or local alignment methods. Note that DIALIGN does not employ any 
-     kind of gap penalty. 
-
-     It is possible to use a threshold T for the quality of the fragments. 
-     In this case, a fragment is considered for alignment only if its 
-     `weight' exceeds this threshold. Regions of lower similarity are ignored. 
-
-     In the first version of the program (DIALIGN 1), this threshold was in 
-     many situations absolutely necessary to obtain meaningful alignments. 
-     By contrast, DIALIGN 2 should produce reasonable alignments without a 
-     threshold, i.e. with T = 0. This is the most important difference between
-     DIALIGN 2 and the first version of the program. Nevertheless, it is still
-     possible to use a positive threshold T to filter out regions of lower 
-     significance and to include only high scoring fragments into the 
-     alignment.
-
-     (3) Different levels of sequence similarity:  
-
-     If (possibly) coding nucleic acid sequences are to be aligned, DIALIGN 
-     optionally translates the compared `nucleic acid segments' to `peptide 
-     segments' according to the genetic code -- without presupposing any of 
-     the three possible reading frames, so all combinations of reading frames 
-     get checked for significant similarity. If this option is used, the 
-     similarity among segments will be assessed on the `peptide level' rather 
-     than on the `nucleotide level'. 
-
-     We strongly recommend to use the `translation' option if nucleic acid 
-     sequences are expected to contain protein coding regions, as it will 
-     significantly increase the sensitivity of the alignment procedure in 
-     such cases. 
-
-     For the levels of sequence similarity, release 2.2 of DIALIGN has 
-     two additional options:
-
-     (a) it can measure the similarity among segment pairs at both levels
-     of similarity (nucleotide-level and peptide-level similarity). The 
-     score of a fragment is based on whatever similarity is stronger. As a
-     result, the program can now produce `mixed alignments' that contain 
-     both types of fragments. Fragments with stronger similarity at the
-     `nucleotide level' referred to as N-fragments whereas fragments with
-     stronger similarity a the peptide level are called P-fragments.
-
-     (b) if the `translation' or `mixed alignment' option is used, it is
-     possible to consider the `reverse complements' of segments, too. In 
-     this case, both the original segments and their reverse complements 
-     are translated and both pairs of implied `peptide segments' are 
-     compared. This option is useful if DNA sequences contain coding regions 
-     not only on the `Watson strand' but also on the `Crick strand'.   
-
-     (4) The score that DIALIGN assigns to a fragment is based on the 
-     probability to find a fragment of the same respective length and number
-     of matches (or BLOSUM values, if the translation option is used) in
-     random sequences of the same length as the input sequences. If long
-     genomic sequences are aligned, an iterative procedure can be applied
-     where the program first looks for fragments with strong similarity.
-     In subsequent steps, regions between these fragments are realigned.
-     Here, the score of a fragment is based on random occurrence in these
-     regions between the previously aligned segment pairs. 
-
-     (5) With the -ff (or -lgs) option, a file with all fragments contained 
-     in the output alignment can be returned. This file contains additional 
-     information about the identified fragments such as 
-
-       - start coordinates in the respective sequences 
-       - length 
-       - fragment weight,
-       - iteration step (if the iterative option is used) 
-       - whether the similarity among the segments is strongest at the 
-         nucleotide level (N-frg) or at the peptide level (P-frg) if the 
-         `mixed alignment' option is used 
-       - whether the similarity is stronger on the `Watson strand' (" + " ) 
-         or on the `Crick strand' (" - " ) - if a fragment is translated
-         and the respective option is used    
-
-     All this information can be used to further post-process the DIALIGN 
-     output, for example by customized visualisation tools. 
-
-     The file containing this information looks like this: 
-
-
-      #  program call: ./dialign2-2 -lgs seq_file  
-
-       seq_len:   552   527 
-       sequences:   seq1   seq2 
-
-         1) seq: 1 2  beg: 161  351 len: 27 wgt: 7.60 it: 1   cons  P-frg +
-         2) seq: 1 2  beg: 300  507 len: 17 wgt: 4.40 it: 1   cons  N-frg
-         3) seq: 1 2  beg: 111  170 len: 12 wgt: 4.34 it: 1   cons  N-frg
-
-
-     (6) Degree of local sequence similarity: 
-
-     Numbers between 0 and 9 are printed below the alignment to indicate
-     the degree of local sequence similarity (in previous verions of the
-     program, "*" characters were used instead of numbers). These numbers
-     are normalized such that the region of highest similarity gets a
-     score of 9. With the -stars option, "*" characters can be used as 
-     previously.  
-
-     (7) `overlap weights':
-
-     This option improves the sensitivity of the program if multiple sequences
-     are aligned but it also increases the running time, especially if large
-     numbers of sequences are aligned. By default, `overlap weights' are used
-     if up to 35 sequences are aligned but switched off for larger data sets. 
-     In the command-line version, `overlap weights' can be switched on or off 
-     for data sets of any size, see below.
-
-     (8) `anchored alignment':
-   
-     Forces the program to align user-specified anchor points to speed-up
-     the alignment procedure for long sequences. Anchor points are given in
-     a file <seq_file>.anc where <seq_file> is the name of the sequence file 
-     (without extension .fa or .seq). Note that anchoring is possible for 
-     pairwise as well as for multiple alignment. The format of the .anc file 
-     is as follows (each line represents one anchor point): 
-    
-
-       2 5 13724 7646 23  23.45345   
-       1 3  6596  517  5  12.34555 
-       3 5 33511 9438 34  27.45459  
- 
-     The first two columns are the sequences to be anchored, columns 3 
-     and 4 contain the beginning positions of the anchored segments in 
-     the specified sequences, and column 5 contains a score of the
-     anchor that specifies its priority compared to other anchoring 
-     regions in case there is a conflict between inconsistent anchor
-     points (see below).  
-
-     In the above example, three anchored segment pairs are specified.  
-     Here, 13724 is the beginning position of the first anchor in sequence 2, 
-     7646 is the beginning position of the first anchor in sequence 5 and
-     23 is the length of the first anchor. In other words, the program is
-     forced to align positions 13724 - 13746 in sequence 2 with positions
-     7646 - 7668 in sequence 5. Similarly, a segment of sequence 1 starting
-     at position  6596 is anchored with a segment of sequence 3 starting
-     at position 517 etc.   
-
-     The program can use only consistent sets of anchor points. This means,
-     that all anchored regions must fit into one single multiple alignment 
-     (see our papers for our notion of "consistency"). The anchor points 
-     in the specified file are sorted according to their scores (as given
-     in the last column of the anchor file) and then accepted one-by-one 
-     -- provided they are consistent with the already accepted anchor points.   
-
-     This is exactly the way, dialign includes fragments (segment pairs
-     or "diagonals") into a resulting multiple alignment, see the dialign
-     papers for more details. 
-
-     Anchor points can be created by any suitable software program, 
-     for example by CHAOS developed by Mike Brudno, Stanford:
-      
-           http://www.stanford.edu/~brudno/chaos/    
-
-
-    (9) `Motif' option:
-
-    A motif can be specified by a simple regular expression such as "TY[ILV]A".
-    Gaps are not allowed in motifs; all residues within brackets are allowed 
-    at the respective position. For example, "TYIA", "TYLA" and "TYVA" would 
-    match the above motif. Alignments where instances of the motif are aligned 
-    to each other, are preferred. They receive a bonus which can be specified 
-    by the user. There are two paramters to determine the bonus for matched 
-    motifs: a first weighting factor (fct1) assigns a bonus for aligned 
-    instances of the motif occurring at the same relative position in the 
-    input sequences. The bonus decreases with the distance between the 
-    matched motif in the sequences. A second parameter (fct2) controls how fast 
-    the bonus decreases.  
-
-    With the two user-defined parameters fct1 and fct2, the bonus for each 
-    matched motif is calculated as follows: If a matched motif occurs at
-    positions i and j in two of the input sequences, |i-j| is the `offset'
-    of the motif. The bonus is then 
-
-       fct1 * exp - ( |i-j|^2 / (fct2^2 * 10 ) )  
-
-    I.e. a high value of fct2 means that even matches of the motif that are
-    far apart within the sequences reveive a high bonus. 
-
-    With the motif-search option, the program call is:
-
-      ./dialign2-2 [para] -mot <regex> <fct1> <fct2> [para] <seq>
-
-    where
-      <regex>  is a regular expression, e.g. "AT[CG]XT",
-      <fct1>    is the first parameter 
-      <fct2>    is the second parameter  
-      <seq>    is the input sequence file and
-      [para]   are (optional) additional program parameters
-
-
-Similarity Matrix:
-
-DIALIGN 2 uses the BLOSUM62 amino acid substitution matrix. In the current 
-version, it is NOT possible to replace BLOSUM62 by other similarity matrices,
-since the probability values contained in the files n_prob and p_prob refer 
-to the BLOSUM62 matrix. 
-
-
-
-                             Program Output: 
-
-By default, DIALIGN creates a single file containing
-
-    - An alignment of the input sequences in DIALIGN format. 
-    - The same alignment in FASTA format. 
-    - A sequence tree in PHYLIP format. This tree is constructed by applying 
-      the UPGMA clustering method to the DIALIGN similarity scores. It roughly 
-      reflects the different degrees of similarity among sequences. For 
-      detailed phylogenetic analysis, we recommend the usual methods for 
-      phylogenetic reconstruction. 
-
-
-This is the DIALIGN alignment format: 
-
-
-
-SMb21199_AA-       1   mtemkdsila vrglkvdfyt pd-GTVE-AV KGIDLDVRSG ETLAVVGESG
-SMb21206_AA-       1   mpapatepgt apfVRLTGVT KRFGTARpAL DAVAGEIFGG RVTGLVGPDG
-SMb21592_AA-       1   mtlq------ ---IELNGVN KFYGSYH-AL KDIDLAIEEG TFVALVGPSG
-SMb21605_AA-       1   msg------- ---IKLTGVS KSFGAVK-VI HGVDIEIGQG EFAVFVGPSG
-
-                       0000000000 0000000000 0002222022 2222233356 6666666666
-
-
-SMb21199_AA-      49   SGKSQTMMGI MGLLakngtv tgsaryrgqe lvgLAPKALN KVRGS-KITM
-SMb21206_AA-      51   AGKTTLIRLM TGLMLPDAGT IE-------- ---VLGydtr rdpasiQAAI
-SMb21592_AA-      41   CGKSTLLRSL AGLEKISAGE MK-------- ---IAGARMN DVPPR-KRDV
-SMb21605_AA-      40   CGKSTLLRMI AGLEETTGGE IR-------- ---Idaedvt hkePS-KRGV
-
-                       6666666666 6664333333 3300000000 0003110000 0001102222
-
-
-SMb21199_AA-      98   IFQEPMTSLD PLYTIGRQIA EPIvhhRGGS FKEA---RRR VLELLELVGI
-SMb21206_AA-      90   GYMPQRFGLY EDLSVQENLD LYADL-RGLP KTER---SRT FGELLDFTDL
-SMb21592_AA-      79   AMVFQSYALY PHMTVEENLT YSLRI-RGVK KAEA---LKA AAEVATTTGL
-SMb21605_AA-      78   AMVFQSYALY PHLSVFDNMA FSLSI-ARRP KAEieqkVKA AAEIlrlsdy
-
-                       2222222222 2222222222 2222202222 2220000000 0000000000
-
-
-
-     Names of aligned sequences are shown on the left hand side of the 
-     alignment. 
-     
-     Numbers on the left hand side of the alignment denote the position 
-     of the first residue in a line within the respective sequence. 
-     
-     Capital letters denote aligned residues, i.e. residues involved in 
-     at least one of the fragments the alignment consists of. Lower-case
-     letters denote residues not belonging to any of these selected 
-     fragments. They are not considered to be aligned by DIALIGN. Thus, 
-     if a lower-case letter is standing in the same column with other letters,
-     this is pure chance; these residues are not considered to be homologous. 
-
-     Numbers below the alignment reflects the degree of local similarity 
-     among sequences. More precisely: They represent the sum of `weights' 
-     of fragments connecting residues at the respective position.
-
-     These numbers are normalized such that regions of maximum similarity 
-     always get a score of 9 - no matter how strong this maximum simliarity 
-     is. 
-
-
-
-This is FASTA alignment format: 
-
-
->HTL2
-ldtapcLFSDGS------PQKAAYVLWDQTIL---QQDITPLPSHethSA
-QKGELLALICGLRAAKPWPSLNIFLDSKYLIKYLHslaigaflgtsah--
--------QT---LQAALPPLLQGKTIYLHHVRSHT------NLPDPISTF
-NEYTDSLILApl--------------------------------------
-----------
->MMLV
-pdadhtwYTDGSSLLQEGQRKAGAAVTTETeviwaKALDAG---T---SA
-QRAELIALTQALKMAEgkk-LNVYTDSRYAFATAHIHGEIYRRRGLLTSE
-GKEIKNKDE---ILALLKALFLPKRLSIIHCPGHQ------KGHSAEARG
-NRMADQAARKAAITETPDTStll---------------------------
-----------
->HEPB
-rpglcQVFADAT------PTGWGLVMGHQRMR---GTFSAPLPIHt----
---AELLAACFArsrsgan---IIGTDN-----------------------
--------------SVVLSR--------------KYTSFPWLLGCAANWI-
-LRGTSFVYVPSALNPADDPSrgrlglsrpllrlpfrpttgrtslyadsps
-vpshlpdrvh
-
-
-
-This is PHYLIP tree format: 
-
- 
-((HTL2:0.111024,
-(MMLV:0.078471,
-ECOL:0.078471):0.032554):0.121218,
-HEPB:0.232242);
-
-
-
-Trees can be visualized using the treetool program that is part of 
-Joe Felsenstein's PHYLIP software package:
-
-   http://evolution.genetics.washington.edu/phylip.html
-
-
----------------------------------------------------------------------
-
-Last update by Burkhard Morgenstern, Goettingen, February 2005  
-
-
-
-
-
-