[med-svn] r13007 - in trunk/packages/meme/trunk/debian: . meme_manpages
Andreas Tille
tille at alioth.debian.org
Thu Feb 14 13:58:05 UTC 2013
Author: tille
Date: 2013-02-14 13:58:05 +0000 (Thu, 14 Feb 2013)
New Revision: 13007
Added:
trunk/packages/meme/trunk/debian/meme_manpages/mast.1
Removed:
trunk/packages/meme/trunk/debian/mast_manual.txt
trunk/packages/meme/trunk/debian/meme_manual.txt
Log:
Remove unneeded text copies of manuals (meme.1 existed, mast.1 just written)
Deleted: trunk/packages/meme/trunk/debian/mast_manual.txt
===================================================================
--- trunk/packages/meme/trunk/debian/mast_manual.txt 2013-02-14 12:38:03 UTC (rev 13006)
+++ trunk/packages/meme/trunk/debian/mast_manual.txt 2013-02-14 13:58:05 UTC (rev 13007)
@@ -1,413 +0,0 @@
-USAGE:
- mast <mfile> [optional arguments ...]
-
- <mfile> file containing motifs to use; may be a MEME output
- file or a file with the format given below
- [<database>] or
- [-d <database>] database to search with motifs or
- [-stdin] read database from standard input;
- Default: reads database specified inside <mfile>
- [-c <count>] only use the first <count> motifs
- [-a <alphabet>] <mfile> is assumed to contain motifs in the
- format output by bin/make_logodds
- and <alphabet> is their alphabet; -d <database>
- or -stdin must be specified when this option is used
- [-stdout] print output to standard output instead of file
- [-text] output in text (ASCII) format;
- (default: hypertext (HTML) format)
-
- [-sep] score reverse complement DNA strand as a separate
- sequence
- [-norc] do not score reverse complement DNA strand
- [-dna] translate DNA sequences to protein
- [-comp] adjust p-values and E-values for sequence composition
- [-rank <rank>] print results starting with <rank> best (default: 1)
- [-smax <smax>] print results for no more than <smax> sequences
- (default: all)
- [-ev <ev>] print results for sequences with E-value < <ev>
- (default: 10)
- [-mt <mt>] show motif matches with p-value < mt (default: 0.0001)
- [-w] show weak matches (mt<p-value<mt*10) in angle brackets
- [-bfile <bfile>] read background frequencies from <bfile>
- [-seqp] use SEQUENCE p-values for motif thresholds
- (default: use POSITION p-values)
- [-mf <mf>] print <mf> as motif file name
- [-df <df>] print <df> as database name
- [-minseqs <minseqs>] lower bound on number of sequences in db
- [-mev <mev>]+ use only motifs with E-values less than <mev>
- [-m <m>]+ use only motif(s) number <m> (overrides -mev)
- [-diag <diag>] nominal order and spacing of motifs
- [-best] include only the best motif in diagrams
- [-remcorr] remove highly correlated motifs from query
- [-brief] brief output--do not print documentation
- [-b] print only sections I and II
- [-nostatus] do not print progress report
- [-hit_list] print hit_list instead of diagram; implies -text
-
-
- MAST: Motif Alignment and Search Tool
-
- MAST is a tool for searching biological sequence databases for sequences
- that contain one or more of a group of known motifs.
-
- A motif is a sequence pattern that occurs repeatedly in a group of related
- protein or DNA sequences. Motifs are represented as position-dependent
- scoring matrices that describe the score of each possible letter at each
- position in the pattern. Individual motifs may not contain gaps. Patterns with
- variable-length gaps must be split into two or more separate motifs before
- being submitted as input to MAST.
-
- MAST takes as input a file containing the descriptions of one or more motifs
- and searches a sequence database that you select for sequences that match
- the motifs. The motif file can be the output of the MEME motif discovery tool
- or any file in the appropriate format.
-
- MAST outputs three things:
-
- 1. The names of the high-scoring sequences sorted by the strength of the
- combined match of the sequence to all of the motifs in the group.
- 2. Motif diagrams showing the order and spacing of the motifs within each
- matching sequence.
- 3. Detailed annotation of each matching sequence showing the sequence
- and the locations and strengths of matches to the motifs.
-
- MAST works by calculating match scores for each sequence in the database
- compared with each of the motifs in the group of motifs you provide. For each
- sequence, the match scores are converted into various types of p-values and
- these are used to determine the overall match of the sequence to the group of
- motifs and the probable order and spacing of occurrences of the motifs in the
- sequence.
-
- MAST outputs a file containing:
-
- * the version of MAST and the date it was built,
- * the reference to cite if you use MAST in your research,
- * a description of the database and motifs used in the search,
- * an explanation of the results,
- * high-scoring sequences--sequences matching the group of motifs
- above a stated level of statistical significance,
- * motif diagrams showing the order and spacing of occurrences of the
- motifs in the high-scoring sequences and
- * annotated sequences showing the positions and p-values of all motif
- occurrences in each of the high-scoring sequences.
-
- Each section of the results file contains an explanation of how to interpret
- them.
-
- Match Scores
-
- The match score of a motif to a position in a sequence is the sum of the
- score from each column of the position-dependent scoring matrix
- corresponding to the letter at that position in the sequence. For example, if
- the sequence is
-
- TAATGTTGGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGC
- ========
-
- and the motif is represented by the position-dependent scoring matrix (where
- each row of the matrix corresponds to a position in the motif)
-
- =========|=================================
- POSITION | A C G T
- =========|=================================
- 1 | 1.447 0.188 -4.025 -4.095
- 2 | 0.739 1.339 -3.945 -2.325
- 3 | 1.764 -3.562 -4.197 -3.895
- 4 | 1.574 -3.784 -1.594 -1.994
- 5 | 1.602 -3.935 -4.054 -1.370
- 6 | 0.797 -3.647 -0.814 0.215
- 7 |-1.280 1.873 -0.607 -1.933
- 8 |-3.076 1.035 1.414 -3.913
- =========|=================================
-
- then the match score of the fourth position in the sequence (underlined)
- would be found by summing the score for T in position 1, G in position 2 and
- so on until G in position 8. So the match score would be
-
- score = -4.095 + -3.945 + -3.895 + -1.994
- + -4.054 + -0.814 + -1.933 + 1.414
- = -19.316
-
- The match scores for other positions in the sequence are calculated in the
- same way. Match scores are only calculated if the match completely fits within
- the sequence. Match scores are not calculated if the motif would overhang
- either end of the sequence.
-
- P-values
-
- MAST reports all matches of a sequence to a motif or group of motifs in terms
- of the p-value of the match. MAST considers the p-values of four types of
- events:
-
- position p-value: the match of a single position within a sequence to
- a given motif,
- sequence p-value: the best match of any position within a sequence
- to a given motif,
- combined p-value: the combined best matches of a sequence to a
- group of motifs, and
- E-value: observing a combined p-value at least as small in a random
- database of the same size.
-
- All p-values are based on a random sequence model that assumes each
- position in a random sequence is generated according to the average letter
- frequencies of all sequences in the the appropriate (peptide or nucleotide)
- non-redundant database (ftp://ncbi.nlm.nih.gov/blast/db/) on September 22,
- 1996. This can be overridden in two ways:
-
- 1) -bfile <bfile>
- The random model uses the letter frequencies given in <bfile>
- instead of the non-redundant database frequencies.
- The format of <bfile> is the same as that for the MEME -bfile opton;
- see the MEME documentation for details. Sample files are given in
- directory tests: tests/nt.freq and tests/na.freq.)
-
- 2) -comp
- The random model uses the letter frequencies in the current target
- sequence instead of the non-redundant database frequencies. This
- causes p-values and E-values to be compensated individually for the
- actual composition of each sequence in the database. This option
- can increase search time substantially due to the need to compute
- a different score distribution for each high-scoring sequence.
-
-
- Position p-value
-
- The p-value of a match of a given position within a sequence to a
- motif is defined as the probability of a randomly selected position in a
- randomly generated sequence having a match score at least as large
- as that of the given position.
-
- Sequence p-value
-
- The p-value of a match of a sequence to a motif is defined as the
- probability of a randomly generated sequence of the same length
- having a match score at least as large as the largest match score of
- any position in the sequence.
-
- Combined p-value
-
- The p-value of a match of a sequence to a group of motifs is defined
- as the probability of a randomly generated sequence of the same
- length having sequence p-values whose product is at least as small
- as the product of the sequence p-values of the matches of the motifs
- to the given sequence.
-
- E-value
-
- The E-value of the match of a sequence in a database to a a group
- of motifs is defined as the expected number of sequences in a random
- database of the same size that would match the motifs as well as the
- sequence does and is equal to the combined p-value of the sequence
- times the number of sequences in the database.
-
- High-scoring Sequences
-
- MAST lists the names and part of the descriptive text of all sequences
- whose E-value is less than E. Sequences shorter than one or more of the
- motifs are skipped. The sequences are sorted by increasing E-value. The
- value of E is set to 10 for the WEB server but is user-selectable in the
- down-loadable version of MAST.
-
- Motif Diagrams
-
- Motif diagrams show the order and spacing of non-overlapping matches to
- the motifs in each high-scoring sequence. Motif occurrences are determined
- based on the position p-value of matches to the motif. Strong matches
- (p-value < M) are shown in square brackets (`[ ]'), weak matches (M <
- p-value < M × 10) are shown in angle brackets (`< >') and the length of
- non-motif sequence ("spacer") is shown between dashes (`-'). For example,
-
- 27-[3]-44-<4>-99-[1]-7
-
- shows an initial spacer of length 27, followed by a strong match to motif 3, a
- spacer of length 44, a weak match to motif 4, a spacer of length 99, a strong
- match to motif 1 and a final non-motif sequence of length 7. The value of M is
- 0.0001 for the WEB server but is user-selectable in the down-loadable
- version of MAST.
-
- Note: If you specify the -hit_list switch to MAST, the motif "diagram" takes the form
- of a comma separated list of motif occurrences ("hits"). Each "hit" has the format:
- <strand><motif> <start> <end> <p-value>
- where
- <strand> is the strand (+ or - for DNA, blank for protein),
- <motif> is the motif number,
- <start> is the starting position of the hit,
- <end> is the ending position of the hit, and
- <p-value> is the position p-value of the hit.
-
- Annotated Sequences
-
- MAST annotates each high-scoring sequence by printing the sequence
- along with the position and strength of all the non-overlapping motif
- occurrences. The four lines above each motif occurrence contain,
- respectively,
-
- the motif number of the occurrence,
- the position p-value of the occurence,
- the best possible match to the motif, and
- a plus sign (`+') above each letter in the occurrence that has a positive
- match score to the motif.
-
- The best possible match to a motif is the sequence of letters which would
- acheive the highest match score.
-
-
- MOTIF FORMAT
-
- MAST can search using (multiple) motifs contained in
-
- a MEME output file,
- a GCG profile file,
- two or more GCG profile filess concatenated together, or
- a file with the following format.
-
- Motif file format
-
- ALPHABET= alphabet
- log-odds matrix: alength= alength w= w
- row_1
- row_2
- ...
- row_w
-
-
-
- A motif is represented by a position-dependent scoring matrix.
- A scoring matrix is preceded by a line starting with the words
- log-odds matrix: and specifying alength, the length of
- the alphabet (number of columns in the scoring matrix), and the w, the
- width of the motif (number of rows in the scoring matrix).
- The following w lines (no blank lines allowed) contain the rows of the
- scoring matrix. Row i, column j of the matrix gives the score for the j-th
- letter in alphabet appearing at position i in an occurrence of the
- motif.
- The spaces after the equals signs and the colon are required.
- The number of letters in alphabet must equal alength.
- Any number of additional motifs may follow the first one.
- The motif file must contain a line starting with
-
- ALPHABET=
-
- followed by alphabet, a list containing the letters used in the motifs.
- The order of the letters in alphabet must be the same as the order of the
- columns of scores in the motifs. The order need not be alphabetical
- and case does not matter, but there should be no spaces in alphabet.
- The letters in alphabet must be a subset of either the IUB/IUPAC DNA
- (ABCDGHKMNRSTUVWY) or protein
- (ABCDEFGHIKLMNPQRSTUVWXYZ) alphabets. DNA alphabets
- must contain at least the letters ACGT. Protein alphabets must contain
- at least the letters ACDEFGHIKLMNPQRSTVWY. All other letters in
- the alphabets are optional. If any of the optional letters are missing
- from alphabet, MAST automatically generates scores for them by taking the
- weighted average of the scores for the letters which the missing letter
- could match. (The weights are the frequencies of the replaced letters in
- the appropriate non-redundant database.) Replacements for the
- optional letters are given in the following table.
-
- LETTERS MATCHED BY OPTIONAL LETTERS
- =================================================
- optional matches
- letter DNA protein
- =================================================
- B CGT DN
- D AGT
- H ACT
- K GT
- M AC
- N ACGT
- R AG
- S CG
- U T ACDEFGHIKLMNPQRSTVWY
- V CAG
- W AT
- X ACDEFGHIKLMNPQRSTVWY
- Y CT
- Z EQ
- * ACGT ACDEFGHIKLMNPQRSTVWY
- - ACGT ACDEFGHIKLMNPQRSTVWY
- =================================================
-
-
- EXAMPLE
-
- Here is an example of a DNA motif file that contains two motifs.
-
- Sample motif file
-
- ALPHABET= ACGT
- log-odds matrix: alength= 4 w= 9
- -4.275 -0.182 -4.195 1.408
- -4.296 -1.487 1.880 -0.816
- -2.160 -1.492 -4.171 1.474
- -0.810 -4.076 1.872 -2.164
- 1.537 -1.487 -4.195 -4.205
- 0.113 0.340 -0.237 -0.209
- -0.454 0.923 0.390 -0.834
- -1.336 -0.082 0.905 0.100
- 0.674 -4.183 0.130 -0.201
- log-odds matrix: alength= 4 w= 6
- -2.032 0.324 1.371 -0.781
- -0.409 0.560 -0.250 0.119
- -4.274 -0.519 -0.260 1.167
- -2.188 2.300 -4.191 -2.465
- 1.265 -4.111 -0.267 -2.180
- -1.977 2.158 -1.661 -2.071
-
-
-
- In the example above, because the order of the letters in alphabet is
- ACGT, the first column of each motif gives the scores for the letter A at each
- position in the motif, the second column gives the scores for C and so forth.
-
- Note: If -d <database> is not given, MAST looks for database
- specified inside of <mfile>
-
- Creates file (unless [-stdout] given) after stripping ".html" from the end of
- <mfile>:
- mast.<mfile>[.<database>][.c<count>][.m<motif>]+[.rank<rank>][.ev<ev>][.mt<mt>][.b]
-
- EXAMPLES:
-
- The following examples assume that file "meme.results" is the
- output of a MEME run containing at least 3 motifs and file
- SwissProt is a copy of the Swiss-Prot database on your local disk.
- DNA_DB is a copy of a DNA database on your local disk.
-
- 1) Annotate the training set:
-
- mast meme.results
-
- 2) Find sequences matching the motif and annotate them in
- the SwissProt database:
-
- mast meme.results -d SwissProt
-
- 3) Show sequences with weaker combined matches to motifs.
-
- mast meme.results -d SwissProt -ev 200
-
- 4) Indicate weaker matches to single motifs in the annotation so
- that sequences with weak matches to the motifs (but perhaps with
- the "correct" order and spacing) can be seen:
-
- mast meme.results -d SwissProt -w
-
- 5) Include a nominal order and spacing of the first three motifs
- in the calculation of the sequence p-values to increase the
- sensitivity of the search for matching sequences:
-
- mast meme.results -d SwissProt -diag "9-[2]-61-[1]-62-[3]-91"
-
- 6) Use only the first and third motifs in the search:
-
- mast meme.results -d SwissProt -m 1 -m 3
-
- 7) Use only the first two motifs in the search:
-
- mast meme.results -d SwissProt -c 2
-
- 8) Search DNA sequences using protein motifs, adjusting p-values and E-values
- for each sequence by that sequence's composition:
-
- mast meme.results -d DNA_DB -dna -comp
-
Added: trunk/packages/meme/trunk/debian/meme_manpages/mast.1
===================================================================
--- trunk/packages/meme/trunk/debian/meme_manpages/mast.1 (rev 0)
+++ trunk/packages/meme/trunk/debian/meme_manpages/mast.1 2013-02-14 13:58:05 UTC (rev 13007)
@@ -0,0 +1,475 @@
+.\" DO NOT MODIFY THIS FILE! It was generated by help2man 1.40.10.
+.TH MAST: "1" "February 2013" "Motif Alignment and Search Tool" "User Commands"
+.SH NAME
+MAST \- Motif Alignment and Search Tool
+.SH SYNOPSIS
+.B mast <motif file> <sequence file>
+[\fIoptions\fR]
+.SH DESCRIPTION
+MAST: Motif Alignment and Search Tool
+.SS
+Inputs
+.TP
+\fB<motif file>\fR
+file containing motifs to use; normally a MEME output file
+.TP
+\fB<sequence file>\fR
+search sequences in FASTA\-formatted database with motifs
+.TP
+\fB\-bfile <file>\fR
+read background frequencies from <file>
+.TP
+\fB\-dblist\fR
+read the <sequence file> as a list of FASTA\-formatted databases
+.SS
+Outputs
+.TP
+\fB\-o <dir>\fR
+directory to output mast results; directory must not exist
+.TP
+\fB\-oc <dir>\fR
+directory to output mast results with overwriting allowed
+.TP
+\fB\-hit_list\fR
+print a machine\-readable list of all hits only; outputs to standard out and overrides \fB\-seqp\fR
+.SS
+Which Motifs To Use
+.TP
+\fB\-remcorr\fR
+remove highly correlated motifs from query
+.TP
+\fB\-m <m>+\fR
+use only motif number \fB<m>\fR (overrides \fB\-mev\fR); this can be
+repeated to select multiple motifs
+.TP
+\fB\-c <count>\fR
+only use the first \fB<count>\fR motifs or all motifs when \fB<count>\fR is zero (default: 0)
+.TP
+\fB\-mev <mev>\fR
+use only motifs with E\-values less than \fB<mev>\fR
+.TP
+\fB\-diag <diag>\fR
+nominal order and spacing of motifs is specified by \fB<diag>\fR which is a block diagram
+.SS
+DNA\-Only Options
+.TP
+\fB\-norc\fR
+do not score reverse complement DNA strand
+.TP
+\fB\-sep\fR
+score reverse complement DNA strand as a separate sequence
+.TP
+\fB\-dna\fR
+translate DNA sequences to protein; motifs must be protein; sequences must be DNA
+.TP
+\fB\-comp\fR
+adjust p\-values and E\-values for sequence composition
+.SS
+Which Results To Print
+.TP
+\fB\-ev <ev>\fR
+print results for sequences with E\-value < \fB<ev>\fR (default: 10)
+.SS
+Appearance Of Block Diagrams
+.TP
+\fB\-mt <mt>\fR
+show motif matches with p\-value < \fB<mt>\fR (default: 0.0001)
+.TP
+\fB\-w\fR show weak matches (\fB<mt>\fR < p\-value < \fB<mt>\fR*10) in angle brackets in
+the hit list or when the xml is converted to text
+.TP
+\fB\-best\fR
+include only the best motif hits in \fB\-hit_list\fR diagrams
+.TP
+\fB\-seqp\fR
+use SEQUENCE p\-values for motif thresholds (default: use POSITION p\-values)
+.SS
+Miscellaneous
+.TP
+\fB\-mf <mf>\fR
+in results use \fB<mf>\fR as motif file name
+.TP
+\fB\-df <df>\fR
+in results use \fB<df>\fR as database name (ignored when \fB\-dblist\fR)
+.TP
+\fB\-dl <dl>\fR
+in results use \fB<dl>\fR as link to search sequence names; token
+SEQUENCEID is replaced with the FASTA sequence ID; ignored when \fB\-dblist\fR;
+.TP
+\fB\-minseqs <ms>\fR
+lower bound on number of sequences in db
+.TP
+\fB\-nostatus\fR
+do not print progress report
+.TP
+\fB\-notext\fR
+do not create text output
+.TP
+\fB\-nohtml\fR
+do not create html output
+.SS
+Description
+.P
+MAST is a tool for searching biological sequence databases for
+sequences that contain one or more of a group of known motifs.
+.PP
+A motif is a sequence pattern that occurs repeatedly in a group of
+related protein or DNA sequences. Motifs are represented as
+position\-dependent scoring matrices that describe the score of each
+possible letter at each position in the pattern. Individual motifs may
+not contain gaps. Patterns with variable\-length gaps must be split into
+two or more separate motifs before being submitted as input to MAST.
+.PP
+MAST takes as input a file containing the descriptions of one or more
+motifs and searches a sequence database that you select for sequences
+that match the motifs. The motif file can be the output of the MEME
+motif discovery tool or any file in the appropriate format.
+.PP
+MAST outputs an xml file which can then be converted into html or text
+format. The xml file is designed for machine processing and the html
+file is designed for human viewing. The text format is available for
+backwards compatibility though due to design decisions made to optimise
+the xml for html generation the output for separate scoring mode is not
+identical and some options were removed. The text format will be
+unsupported in future releases and so we recommend you migrate any
+programs reading mast output to the xml format.
+.SS
+MAST outputs three things:
+.IP
+1. The names of the high\-scoring sequences sorted by the strength of
+the combined match of the sequence to all of the motifs in the
+group.
+.IP
+2. Motif diagrams showing the order and spacing of the motifs within
+each matching sequence.
+.IP
+3. Detailed annotation of each matching sequence showing the sequence
+and the locations and strengths of matches to the motifs.
+.PP
+MAST works by calculating match scores for each sequence in the
+database compared with each of the motifs in the group of motifs you
+provide. For each sequence, the match scores are converted into various
+types of p\-values and these are used to determine the overall match of
+the sequence to the group of motifs and the probable order and spacing
+of occurrences of the motifs in the sequence.
+.PP
+MAST generates a human readable file from the xml output containing:
+.IP
+* the version of MAST and the date it was built,
+.IP
+* the reference to cite if you use MAST in your research,
+.IP
+* a description of the databases and motifs used in the search,
+.IP
+* an explanation of the result,
+.IP
+* the sequences identifier and score sorted by score matching the
+group of motifs above a stated level of statistical significance,
+.IP
+* motif diagrams showing the order and spacing of occurrences of the
+motifs in the significant sequences and,
+.IP
+* annotated sequences showing the positions and p\-values of all motif
+occurrences in each of the high\-scoring sequences.
+.PP
+The html version is the recommended version for human reading and has
+all sections documented however the text version has no documentation
+for the first section. That section lists each motif along with the
+sequence that would achieve the best possible match score. In order to
+avoid biased scores when multiple motif scores are combined, MAST also
+computes the pairwise correlations between each pair of motifs. The
+correlation between two motifs is the maximum sum of Pearson's
+correlation coefficients for aligned columns divided by the width of
+the shorter motif. The maximum is found by trying all alignments of the
+two motifs. Motifs with correlations below 0.60 have little effect on
+the accuracy of the combined scores. Pairs of motifs with higher
+correlations should be removed from the query.
+.SS
+Match Scores
+.PP
+The match score of a motif to a position in a sequence is the sum of
+the score from each column of the position\-dependent scoring matrix
+corresponding to the letter at that position in the sequence. For
+example, if the sequence is
+.IP
+TAATGTTGGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGC
+.IP
+========
+.PP
+and the motif is represented by the position\-dependent scoring matrix
+(where each row of the matrix corresponds to a position in the motif)
+.TP
+Position
+A C G T
+.TP
+1
+1.447 0.188 \fB\-4\fR.025 \fB\-4\fR.095
+.TP
+2
+0.739 1.339 \fB\-3\fR.945 \fB\-2\fR.325
+.TP
+3
+1.764 \fB\-3\fR.562 \fB\-4\fR.197 \fB\-3\fR.895
+.TP
+4
+1.574 \fB\-3\fR.784 \fB\-1\fR.594 \fB\-1\fR.994
+.TP
+5
+1.602 \fB\-3\fR.935 \fB\-4\fR.054 \fB\-1\fR.370
+.TP
+6
+0.797 \fB\-3\fR.647 \fB\-0\fR.814 0.215
+.TP
+7
+\fB\-1\fR.280 1.873 \fB\-0\fR.607 \fB\-1\fR.993
+.TP
+8
+\fB\-3\fR.076 1.035 1.414 \fB\-3\fR.913
+.PP
+then the match score of the fourth position in the sequence
+(underlined) would be found by summing the score for T in position 1, G
+in position 2 and so on until G in position 8. So the match score would
+be
+.IP
+score = \fB\-4\fR.095 + \fB\-3\fR.945 + \fB\-3\fR.895 + \fB\-1\fR.994
+.IP
++ \fB\-4\fR.054 + \fB\-0\fR.814 + \fB\-1\fR.933 + 1.414
+.IP
+= \fB\-19\fR.316
+.PP
+The match scores for other positions in the sequence are calculated in
+the same way. Match scores are only calculated if the match completely
+fits within the sequence. Match scores are not calculated if the motif
+would overhang either end of the sequence.
+.SS
+P\-values
+.PP
+MAST reports all matches of a sequence to a motif or group of motifs in
+terms of the p\-value of the match. MAST considers the p\-values of four
+types of events:
+.IP
+* position p\-value: the match of a single position within a sequence
+to a given motif,
+.IP
+* sequence p\-value: the best match of any position within a sequence
+to a given motif,
+.IP
+* combined p\-value: the combined best matches of a sequence to a
+group of motifs, and
+.IP
+* E\-value: observing a combined p\-value at least as small in a random
+database of the same size.
+.PP
+All p\-values are based on a random sequence model that assumes each
+position in a random sequence is generated according to the average
+letter frequencies of all sequences in the appropriate (peptide or
+nucleotide) non\-redundant database (ftp://ncbi.nlm.nih.gov/blast/db/)
+on September 22, 1996. This can be overridden by specifying the \fB\-bfile\fR
+or \fB\-comp\fR options (see below). For DNA sequences, unless \fB\-norc\fR is given,
+the positive and reverse complement strand frequencies are averaged
+together.
+.IP
+1. \fB\-bfile\fR <bfile> The random model uses the letter frequencies given
+in <bfile> instead of the non\-redundant database frequencies. The
+format of <bfile> is the same as that for the MEME \fB\-bfile\fR option;
+see the MEME documentation for details. You can create files in the
+appropriate format based on the base/residue composition of your
+own FASTA sequence files using the command "fasta\-get\-markov"
+included in the MEME distribution. Type fasta\-get\-markov on the
+command line for documentation. (Sample files are also given in
+directory tests: tests/nt.freq and tests/na.freq.)
+.IP
+2. \fB\-comp\fR The random model uses the letter frequencies in the current
+target sequence instead of the non\-redundant database frequencies.
+This causes p\-values and E\-values to be compensated individually
+for the actual composition of each sequence in the database. This
+option can increase search time substantially due to the need to
+compute a different score distribution for each high\-scoring
+sequence. With this option and DNA sequences, the positive and
+reverse complement strand frequencies are not averaged together.
+.SS
+Position p\-value
+.PP
+The p\-value of a match of a given position within a sequence to a motif
+is defined as the probability of a randomly selected position in a
+randomly generated sequence having a match score at least as large as
+that of the given position. Note:If MAST is combining reverse
+complement DNA strands, the position p\-value is not corrected for
+multiple tests.
+.SS
+Sequence p\-value
+.PP
+The p\-value of a match of a sequence to a motif is defined as the
+probability of a randomly generated sequence of the same length having
+a match score at least as large as the largest match score of any
+position in the sequence.
+.SS
+Combined p\-value
+.PP
+The p\-value of a match of a sequence to a group of motifs is defined as
+the probability of a randomly generated sequence of the same length
+having sequence p\-values whose product is at least as small as the
+product of the sequence p\-values of the matches of the motifs to the
+given sequence.
+.SS
+E\-value
+.PP
+The E\-value of the match of a sequence in a database to a a group of
+motifs is defined as the expected number of sequences in a random
+database of the same size that would match the motifs as well as the
+sequence does and is equal to the combined p\-value of the sequence
+times the number of sequences in the database.
+.SS
+High\-scoring Sequences
+.PP
+MAST lists the names and part of the descriptive text of all sequences
+whose E\-value is less than E. Sequences shorter than one or more of the
+motifs are skipped. The sequences are sorted by increasing E\-value. The
+value of E is set to 10 for the WEB server but is user\-selectable in
+the down\-loadable version of MAST.
+.SS
+Motif Diagrams
+.PP
+Motif diagrams show the order and spacing of non\-overlapping matches to
+the motifs in each high\-scoring sequence. Motif occurrences are
+determined based on the position p\-value of matches to the motif.
+Strong matches (p\-value < M) are shown in square brackets (`[ ]'), weak
+matches (M < p\-value < M x 10) are shown in angle brackets (`< >') and
+the length of non\-motif sequence ("spacer") is shown between
+underscores (`_'). For example,
+.IP
+27_[3]_44_<4>_99_[1]_7
+.PP
+shows an initial spacer of length 27, followed by a strong match to
+motif 3, a spacer of length 44, a weak match to motif 4, a spacer of
+length 99, a strong match to motif 1 and a final non\-motif sequence of
+length 7. The value of M is 0.0001 for the WEB server but is
+user\-selectable in the downloadable version of MAST.
+.PP
+Annotated Sequences
+.SS
+MAST annotates each high\-scoring sequence by printing the sequence
+along with the position and strength of all the non\-overlapping motif
+occurrences. The four lines above each motif occurrence contain,
+respectively,
+.IP
+* the motif number of the occurrence,
+.IP
+* the position p\-value of the occurrence,
+.IP
+* the best possible match to the motif, and
+.IP
+* a plus sign (`+') above each letter in the occurrence that has a
+positive match score to the motif.
+.PP
+The best possible match to a motif is the sequence of letters which
+would achieve the highest match score.
+.SS
+Hit List
+.PP
+If you specify the \fB\-hit_list\fR switch to MAST, MAST outputs ONLY a list
+of "hits" in easily machine\-readable format. Each line corresponds to
+one motif occurrence in one sequence. The format of the hit lines is
+.IP
+[<sequence_name> <strand><motif> <start> <end> <score> <p\-value>]+
+.PP
+where
+.TP
+<sequence_name> is the name of the sequence containing the hit
+.TP
+<strand> is the strand (+ or \- for DNA, blank for protein),
+.TP
+<motif> is the motif number,
+.TP
+<start> is the starting position of the hit,
+.TP
+<end> is the ending position of the hit, and
+.TP
+<score> is the score the hit,
+.TP
+<p\-value> is the position p\-value of the hit.
+.PP
+Two comment lines (starting with "#") are written above the list of
+hits, and the MAST command line is printed as a comment line after the
+list. An example of the output using the \fB\-hit_list\fR switch to MAST is:
+.IP
+# All non\-overlapping hits in all sequences.
+.IP
+# sequence_name motif hit_start hit_end score hit_p\-value
+.IP
+ce1cg \fB\-2\fR 8 22 1459.90 1.67e\-06
+.IP
+ara +2 2 16 1661.18 5.04e\-08
+.IP
+bglr1 +2 1 15 1274.97 1.42e\-05
+.IP
+cya \fB\-2\fR 19 33 1101.37 6.64e\-05
+.IP
+gale +2 5 19 1076.21 8.11e\-05
+.IP
+ilv \fB\-2\fR 6 20 1098.85 6.78e\-05
+.IP
+malk +2 37 51 1085.02 7.56e\-05
+.IP
+ompa +2 5 19 1583.18 2.43e\-07
+.IP
+# mast tests/meme/meme.crp0.oops tests/common/crp0.s \fB\-hit_list\fR \fB\-m\fR 2
+.SS
+Loading Multiple Sequence Databases
+.PP
+Multiple sequence databases can be loaded by MAST by putting the file
+names into a file and specifying that file instead of the sequence
+database with the option \fB\-dblist\fR.
+.PP
+The file list has one file name on each line with the optional name and
+link as follows:
+.IP
+<file> [<name> <link>]
+.IP
+\&...
+.IP
+\&...
+.PP
+If it is specified then the name will be used instead of the file name
+in the output. If the link is specified then all sequences for that
+database in the html output will have a hyperlink to the URL specified
+with the text SEQUENCEID replaced with the FASTA sequence id.
+.SH
+EXAMPLES:
+.PP
+The following examples assume that file "meme.results" is the output of
+a MEME run containing at least 3 motifs which was created on the
+trainingset "training.fasta" and file SwissProt is a copy of the
+Swiss\-Prot database on your local disk. DNA_DB is a copy of a DNA
+database on your local disk.
+.IP
+1. Annotate the training set:
+mast meme.results training.fasta
+.IP
+2. Find sequences matching the motif and annotate them in the
+SwissProt database:
+.IP
+mast meme.results SwissProt
+.IP
+3. Show sequences with weaker combined matches to motifs.
+.IP
+mast meme.results SwissProt \fB\-ev\fR 200
+.IP
+4. Include a nominal order and spacing of the first three motifs in
+the calculation of the sequence p\-values to increase the
+sensitivity of the search for matching sequences:
+.IP
+mast meme.results SwissProt \fB\-diag\fR "9\-[2]\-61\-[1]\-62\-[3]\-91"
+.IP
+5. Use only the first and third motifs in the search:
+.IP
+mast meme.results SwissProt \fB\-m\fR 1 \fB\-m\fR 3
+.IP
+6. Use only the first two motifs in the search:
+.IP
+mast meme.results SwissProt \fB\-c\fR 2
+.IP
+7. Search DNA sequences using protein motifs, adjusting p\-values and
+E\-values for each sequence by that sequence's composition:
+.IP
+mast meme.results DNA_DB \fB\-dna\fR \fB\-comp\fR
Deleted: trunk/packages/meme/trunk/debian/meme_manual.txt
===================================================================
--- trunk/packages/meme/trunk/debian/meme_manual.txt 2013-02-14 12:38:03 UTC (rev 13006)
+++ trunk/packages/meme/trunk/debian/meme_manual.txt 2013-02-14 13:58:05 UTC (rev 13007)
@@ -1,650 +0,0 @@
-USAGE:
- meme <dataset> [optional arguments]
-
- <dataset> file containing sequences in FASTA format
- [-h] print this message
- [-dna] sequences use DNA alphabet
- [-protein] sequences use protein alphabet
- [-mod oops|zoops|anr] distribution of motifs
- [-nmotifs <nmotifs>] maximum number of motifs to find
- [-evt <ev>] stop if motif E-value greater than <evt>
- [-nsites <sites>] number of sites for each motif
- [-minsites <minsites>] minimum number of sites for each motif
- [-maxsites <maxsites>] maximum number of sites for each motif
- [-wnsites <wnsites>] weight on expected number of sites
- [-w <w>] motif width
- [-minw <minw>] minumum motif width
- [-maxw <maxw>] maximum motif width
- [-nomatrim] do not adjust motif width using multiple
- alignment
- [-wg <wg>] gap opening cost for multiple alignments
- [-ws <ws>] gap extension cost for multiple alignments
- [-noendgaps] do not count end gaps in multiple alignments
- [-bfile <bfile>] name of background Markov model file
- [-revcomp] allow sites on + or - DNA strands
- [-pal] force palindromes (requires -dna)
- [-maxiter <maxiter>] maximum EM iterations to run
- [-distance <distance>] EM convergence criterion
- [-prior dirichlet|dmix|mega|megap|addone]
- type of prior to use
- [-b <b>] strength of the prior
- [-plib <plib>] name of Dirichlet prior file
- [-spfuzz <spfuzz>] fuzziness of sequence to theta mapping
- [-spmap uni|pam] starting point seq to theta mapping type
- [-cons <cons>] consensus sequence to start EM from
- [-text] output in text format (default is HTML)
- [-maxsize <maxsize>] maximum dataset size in characters
- [-nostatus] do not print progress reports to terminal
- [-p <np>] use parallel version with <np> processors
- [-time <t>] quit before <t> CPU seconds consumed
- [-sf <sf>] print <sf> as name of sequence file
-
- MEME -- Multiple EM for Motif Elicitation
-
- MEME is a tool for discovering motifs in a group of related DNA or protein
- sequences.
-
- A motif is a sequence pattern that occurs repeatedly in a group of related
- protein or DNA sequences. MEME represents motifs as position-dependent
- letter-probability matrices which describe the probability of each possible
- letter at each position in the pattern. Individual MEME motifs do not
- contain gaps. Patterns with variable-length gaps are split by MEME into two
- or more separate motifs.
-
- MEME takes as input a group of DNA or protein sequences (the training set)
- and outputs as many motifs as requested. MEME uses statistical modeling
- techniques to automatically choose the best width, number of occurrences,
- and description for each motif.
-
- MEME outputs its results as a hypertext (HTML) document.
-
- The MEME results consist of:
-
- The version of MEME and the date it was released.
-
- The reference to cite if you use MEME in your research.
-
- A description of the sequences you submitted (the "training set")
- showing the name, "weight" and length of each sequence.
-
- The command line summary detailing the parameters with which you
- ran MEME.
-
- Information on each of the motifs MEME discovered, including:
- 1.A summary line showing the width, number of occurrences, log
- likelihood ratio and statistical significance of the motif.
- 2.A simplified position-specific probability matrix.
- 3.A diagram showing the degree of conservation at each motif
- position.
- 4.A multilevel consensus sequence showing the most conserved
- letter(s) at each motif position.
- 5.The occurrences of the motif sorted by p-value and aligned with
- each other.
- 6.Block diagrams of the occurrences of the motif within each
- sequence in the training set.
- 7.The motif in BLOCKS format.
- 8.A position-specific scoring matrix (PSSM) for use by the
- MAST database search program.
- 9.The position specific probability matrix (PSPM) describing the
- motif.
-
- A summary of motifs showing an optimized (non-overlapping) tiling of
- all of the motifs onto each of the sequences in the training set.
-
- The reason why MEME stopped and the name of the CPU on which it
- ran.
-
- This explanation of how to interpret MEME results.
-
- REQUIRED ARGUMENTS:
- <dataset> The name of the file containing the training set
- sequences. If <dataset> is the word "stdin", MEME
- reads from standard input.
-
- The sequences in the dataset should be in
- Pearson/FASTA format. For example:
-
- >ICYA_MANSE INSECTICYANIN A FORM (BLUE BILIPROTEIN)
- GDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAK
- LPLENENQGKCTIAEYKYDGKKASVYNSFVSNGVKEYMEGDLEIAPDA
- >LACB_BOVIN BETA-LACTOGLOBULIN PRECURSOR (BETA-LG)
- MKCLLLALALTCGAQALIVTQTMKGLDI
- QKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQKW
-
- Sequences start with a header line followed by
- sequence lines. A header line has
- the character ">" in position one, followed by
- an unique name without any spaces, followed by
- (optional) descriptive text. After the header line
- come the actual sequence lines. Spaces and blank
- lines are ignored. Sequences may be in capital or
- lowercase or both.
-
- MEME uses the first word in the header line of each
- sequence, truncated to 24 characters if necessary,
- as the name of the sequence. This name must be unique.
- Sequences with duplicate names will be ignored.
- (The first word in the title line is
- everything following the ">" up to the first blank.)
-
- Sequence weights may be specified in the dataset
- file by special header lines where the unique name
- is "WEIGHTS" (all caps) and the descriptive
- text is a list of sequence weights.
- Sequence weights are numbers in the range 0 < w <=1.
- All weights are assigned in order to the
- sequences in the file. If there are more sequences
- than weights, the remainder are given weight one.
- Weights must be greater than zero and less than
- or equal to one. Weights may be specified by
- more than one "WEIGHT" entry which may appear
- anywhere in the file. When weights are used,
- sequences will contribute to motifs in proportion
- to their weights. Here is an example for a file
- of three sequences where the first two sequences are
- very similar and it is desired to down-weight them:
-
- >WEIGHTS 0.5 .5 1.0
- >seq1
- GDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAK
- >seq2
- GDMFCPGYCPDVKPVGDFDLSAFAGAWHELAK
- >seq3
- QKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQKW
-
-
- OPTIONAL ARGUMENTS:
-
- MEME has a large number of optional inputs that can be used
- to fine-tune its behavior. To make these easier to understand
- they are divided into the following categories:
-
- ALPHABET - control the alphabet for the motifs
- (patterns) that MEME will search for
-
- DISTRIBUTION - control how MEME assumes the occurrences
- of the motifs are distributed throughout
- the training set sequences
-
- SEARCH - control how MEME searches for motifs
-
- SYSTEM - the -p <np> argument causes a version of MEME
- compiled for a parallel CPU architecture
- to be run. (By placing <np> in quotes you
- may pass installation specific switches to
- the 'mpirun' command. The number of
- processors to run on must be the first
- argument following -p).
-
-
- In what follows, <n> is an integer, <a> is a decimal number, and <string>
- is a string of characters.
-
- ALPHABET
- --------
- MEME accepts either DNA or protein sequences, but not both in the same run.
- By default, sequences are assumed to be protein. The sequences must be in
- FASTA format.
-
- DNA sequences must contain only the letters "ACGT", plus the ambiguous
- letters "BDHKMNRSUVWY*-".
- Protein sequences must contain only the letters "ACDEFGHIKLMNPQRSTVWY",
- plus the ambiguous letters "BUXZ*-".
-
- MEME converts all ambiguous letters to "X", which is treated as "unknown".
-
- -dna Assume sequences are DNA; default: protein sequences
- -protein Assume sequences are protein
-
-
- DISTRIBUTION
- ------------
- If you know how occurrences of motifs are distributed in the training set
- sequences, you can specify it with the following optional switches. The
- default distribution of motif occurrences is assumed to be zero or one
- occurrence of per sequence.
-
- -mod <string> The type of distribution to assume.
- oops One Occurrence Per Sequence
- MEME assumes that each sequence in the dataset
- contains exactly one occurrence of each motif.
- This option is the fastest and most sensitive
- but the motifs returned by MEME may be
- "blurry" if any of the sequences is missing
- them.
-
- zoops Zero or One Occurrence Per Sequence
- MEME assumes that each sequence may contain at
- most one occurrence of each motif. This option
- is useful when you suspect that some motifs
- may be missing from some of the sequences. In
- that case, the motifs found will be more
- accurate than using the first option. This
- option takes more computer time than the
- first option (about twice as much) and is
- slightly less sensitive to weak motifs present
- in all of the sequences.
-
- anr Any Number of Repetitions
- MEME assumes each sequence may contain any
- number of non-overlapping occurrences of each
- motif. This option is useful when you suspect
- that motifs repeat multiple times within a
- single sequence. In that case, the motifs
- found will be much more accurate than using
- one of the other options. This option can also
- be used to discover repeats within a single
- sequence. This option takes the much more
- computer time than the first option (about ten
- times as much) and is somewhat less sensitive
- to weak motifs which do not repeat within a
- single sequence than the other two options.
-
-
- SEARCH
- ------
-
- A) OBJECTIVE FUNCTION
-
- MEME uses an objective function on motifs to select the "best" motif.
- The objective function is based on the statistical significance of the
- log likelihood ratio (LLR) of the occurrences of the motif.
- The E-value of the motif is an estimate of the number of motifs (with the
- same width and number of occurrences) that would have equal or higher log
- likelihood ratio if the training set sequences had been generated randomly
- according to the (0-order portion of the) background model.
-
- MEME searches for the motif with the smallest E-value.
- It searches over different motif widths, numbers of occurrences, and
- positions in the training set for the motif occurrences.
- The user may limit the range of motif widths and number of occurrences
- that MEME tries using the switches described below. In addition,
- MEME trims the motif (using a dynamic programming multiple alignment) to
- eliminate any positions where there is a gap in any of the occurrences.
-
- The log likelihood ratio of a motif is
- llr = log (Pr(sites | motif) / Pr(sites | back))
- and is a measure of how different the sites are from the background model.
- Pr(sites | motif) is the probability of the occurrences given the a model
- consisting of the position-specific probability matrix (PSPM) of the motif.
- (The PSPM is output by MEME).
- Pr(sites | back) is the probability of the occurrences given the background
- model. The background model is an n-order Markov model. By default,
- it is a 0-order model consisting of the frequencies of the letters in
- the training set. A different 0-order Markov model or higher order Markov
- models can be specified to MEME using the -bfile option described below.
-
- The E-value reported by MEME is actually an approximation of the E-value
- of the log likelihood ratio. (An approximation is used because it is far
- more efficient to compute.) The approximation is based on the fact that
- the log likelihood ratio of a motif is the sum of the log
- likelihood ratios of each column of the motif. Instead of computing the
- statistical significance of this sum (its p-value), MEME computes the
- p-value of each column and then computes the significance of their product.
- Although not identical to the significance of the log likelihood ratio, this
- easier to compute objective function works very similarly in practice.
-
- The motif significance is reported as the E-value of the motif.
- The statistical signficance of a motif is computed based on:
- 1) the log likelihood ratio,
- 2) the width of the motif,
- 3) the number of occurrences,
- 4) the 0-order portion of the background model,
- 5) the size of the training set, and
- 6) the type of model (oops, zoops, or anr, which determines the
- number of possible different motifs of the given width and
- number of occurrences).
-
- MEME searches for motifs by performing Expectation Maximization (EM) on a
- motif model of a fixed width and using an initial estimate of the number of
- sites. It then sorts the possible sites according to their probability
- according to EM. MEME then and calculates the E-values of the first n sites
- in the sorted list for different values of n. This procedure (first EM,
- followed by computing E-values for different numbers of sites) is repeated
- with different widths and different initial estimates of the number of
- sites. MEME outputs the motif with the lowest E-value.
-
-
- B) NUMBER OF MOTIFS
-
- -nmotifs <n> The number of *different* motifs to search
- for. MEME will search for and output <n> motifs.
- Default: 1
-
- -evt <p> Quit looking for motifs if E-value exceeds <p>.
- Default: infinite (so by default MEME never quits
- before -nmotifs <n> have been found.)
-
-
- C) NUMBER OF MOTIF OCCURENCES
-
- -nsites <n>
- -minsites <n>
- -maxsites <n>
- The (expected) number of occurrences of each motif.
- If -nsites is given, only that number of occurrences
- is tried. Otherwise, numbers of occurrences between
- -minsites and -maxsites are tried as initial guesses
- for the number of motif occurrences. These
- switches are ignored if mod = oops.
-
- Default: -minsites sqrt(number sequences)
- -maxsites Default:
- zoops # of sequences
- anr MIN(5*#sequences, 50)
-
- -wnsites <n> The weight on the prior on nsites. This controls
- how strong the bias towards motifs with exactly
- nsites sites (or between minsites and maxsites sites)
- is. It is a number in the range [0..1). The
- larger it is, the stronger the bias towards
- motifs with exactly nsites occurrences is.
- Default: 0.8
-
- D) MOTIF WIDTH
-
- -w <n>
- -minw <n>
- -maxw <n>
-
- The width of the motif(s) to search for.
- If -w is given, only that width is tried.
- Otherwise, widths between -minw and -maxw are tried.
- Default: -minw 8, -maxw 50 (defined in user.h)
-
- Note: If <n> is less than the length of the shortest
- sequence in the dataset, <n> is reset by MEME to
- that value.
-
- -nomatrim
- -wg <a>
- -ws <a>
- -noendgaps
- These switches control trimming (shortening) of
- motifs using the multiple alignment method.
- Specifying -nomatrim causes MEME to skip this and
- causes the other switches to be ignored.
- MEME finds the best motif
- found and then trims (shortens) it using the multiple
- alignment method (described below). The number of
- occurrences is then adjusted to maximize the motif
- E-value, and then the motif width is further
- shortened to optimize the E-value.
-
- The multiple alignment method performs a separate
- pairwise alignment of the site with the highest
- probability and each other possible site.
- (The alignment includes width/2 positions on either
- side of the sites.) The pairwise alignment
- is controlled by the switches:
- -wg <a> (gap cost; default: 11),
- -ws <a> (space cost; default 1), and,
- -noendgaps (do not penalize endgaps; default:
- penalize endgaps).
- The pairwise alignments are then combined and the
- method determines the widest section of the motif with
- no insertions or deletions. If this alignment
- is shorter than <minw>, it tries to find an alignment
- allowing up to one insertion/deletion per motif
- column. This continues (allowing up to 2, 3 ...
- insertions/deletions per motif column) until an
- alignment of width at least <minw> is found.
-
-
- E) BACKGROUND MODEL
- -bfile <bfile> The name of the file containing the background model
- for sequences. The background model is the model
- of random sequences used by MEME. The background
- model is used by MEME
- 1) during EM as the "null model",
- 2) for calculating the log likelihood ratio
- of a motif,
- 3) for calculating the significance (E-value)
- of a motif, and,
- 4) for creating the position-specific scoring
- matrix (log-odds matrix).
-
- By default, the background model is a 0-order Markov
- model based on the letter frequencies in the training
- set.
-
- Markov models of any order can be specified in <bfile>
- by listing frequencies of all possible tuples of
- length up to order+1.
-
- Note that MEME uses only the 0-order portion (single
- letter frequencies) of the background model for
- purposes 3) and 4), but uses the full-order model
- for purposes 1) and 2), above.
-
- Example: To specify a 1-order Markov background model
- for DNA, <bfile> might contain the following
- lines. Note that optional comment lines are
- by "#" and are ignored by MEME.
-
- # tuple frequency_non_coding
- a 0.324
- c 0.176
- g 0.176
- t 0.324
- # tuple frequency_non_coding
- aa 0.119
- ac 0.052
- ag 0.056
- at 0.097
- ca 0.058
- cc 0.033
- cg 0.028
- ct 0.056
- ga 0.056
- gc 0.035
- gg 0.033
- gt 0.052
- ta 0.091
- tc 0.056
- tg 0.058
- tt 0.119
-
- Sample -bfile files are given in directory tests:
- tests/nt.freq (DNA), and
- tests/na.freq (amino acid).
-
- F) DNA PALINDROMES AND STRANDS
-
- -revcomp motifs occurrences may be on the given DNA strand
- or on its reverse complement.
- Default: look for DNA motifs only on the strand given
- in the training set.
-
- -pal
- Choosing -pal causes MEME to look for palindromes in
- DNA datasets.
-
- MEME averages the letter frequencies in corresponding
- columns of the motif (PSPM) together. For instance,
- if the width of the motif is 10, columns 1 and 10, 2
- and 9, 3 and 8, etc., are averaged together. The
- averaging combines the frequency of A in one column
- with T in the other, and the frequency of C in one
- column with G in the other.
- If neither option is not chosen, MEME does not
- search for DNA palindromes.
-
-
- G) EM ALGORITHM
-
- -maxiter <n> The number of iterations of EM to run from
- any starting point.
- EM is run for <n> iterations or until convergence
- (see -distance, below) from each starting point.
- Default: 50
-
- -distance <a> The convergence criterion. MEME stops
- iterating EM when the change in the
- motif frequency matrix is less than <a>.
- (Change is the euclidean distance between
- two successive frequency matrices.)
- Default: 0.001
-
- -prior <string> The prior distribution on the model parameters:
- dirichlet simple Dirichlet prior
- This is the default for -dna and
- -alph. It is based on the
- non-redundant database letter
- frequencies.
- dmix mixture of Dirichlets prior
- This is the default for -protein.
- mega extremely low variance dmix;
- variance is scaled inversely with
- the size of the dataset.
- megap mega for all but last iteration
- of EM; dmix on last iteration.
- addone add +1 to each observed count
-
- -b <a> The strength of the prior on model parameters:
- <a> = 0 means use intrinsic strength of prior
- for prior = dmix.
- Defaults:
- 0.01 if prior = dirichlet
- 0 if prior = dmix
-
- -plib <string> The name of the file containing the Dirichlet prior
- in the format of file prior30.plib.
-
-
- H) SELECTING STARTS FOR EM
-
- The default is for MEME to search the dataset for good starts for EM. How
- the starting points are derived from the dataset is specified by the
- following switches.
-
- The default type of mapping MEME uses is:
- -spmap uni for -dna and -alph <string>
- -spmap pam for -protein
-
- -spfuzz <a> The fuzziness of the mapping.
- Possible values are greater than 0. Meaning
- depends on -spmap, see below.
-
- -spmap <string> The type of mapping function to use.
- uni Use add-<a> prior when converting a substring
- to an estimate of theta.
- Default -spfuzz <a>: 0.5
- pam Use columns of PAM <a> matrix when converting
- a substring to an estimate of theta.
- Default -spfuzz <a>: 120 (PAM 120)
-
- Other types of starting points
- can be specified using the following switches.
-
- -cons <string> Override the sampling of starting points
- and just use a starting point derived from
- <string>.
- This is useful when an actual occurrence of
- a motif is known and can be used as the
- starting point for finding the motif.
-
- EXAMPLES:
-
- The following examples use data files provided in this release of MEME.
- MEME writes its output to standard output, so you will want to redirect it
- to a file in order for use with MAST.
-
- 1) A simple DNA example:
-
- meme crp0.s -dna -mod oops -pal > ex1.html
-
- MEME looks for a single motif in the file crp0.s which contains DNA
- sequences in FASTA format. The OOPS model is used so MEME assumes that
- every sequence contains exactly one occurrence of the motif. The
- palindrome switch is given so the motif model (PSPM) is converted into a
- palindrome by combining corresponding frequency columns. MEME automatically
- chooses the best width for the motif in this example since no width was
- specified.
-
- 2) Searching for motifs on both DNA strands:
-
- meme crp0.s -dna -mod oops -revcomp > ex2.html
-
- This is like the previous example except that the -revcomp switch tells
- MEME to consider both DNA strands, and the -pal switch is absent so the
- palindrome conversion is omitted. When DNA uses both DNA strands, motif
- occurrences on the two strands may not overlap. That is, any position
- in the sequence given in the training set may be contained in an occurrence
- of a motif on the positive strand or the negative strand, but not both.
-
- 3) A fast DNA example:
-
- meme crp0.s -dna -mod oops -revcomp -w 20 > ex3.html
-
- This example differs from example 1) in that MEME is told to only
- consider motifs of width 20. This causes MEME to execute about 10
- times faster. The -w switch can also be used with protein datasets if
- the width of the motifs are known in advance.
-
- 4) Using a higher-order background model:
-
- meme INO_up800.s -dna -mod anr -revcomp -bfile yeast.nc.6.freq > ex4.html
-
- In this example we use -mod anr and -bfile yeast.nc.6.freq. This specifies
- that
- a) the motif may have any number of occurrences in each sequence, and,
- b) the Markov model specified in yeast.nc.6.freq is used as the
- background model. This file contains a fifth-order Markov model
- for the non-coding regions in the yeast genome.
- Using a higher order background model can often result in more sensitive
- detection of motifs. This is because the background model more accurately
- models non-motif sequence, allowing MEME to discriminate against it and find
- the true motifs.
-
- 5) A simple protein example:
-
- meme lipocalin.s -mod oops -maxw 20 -nmotifs 2 > ex5.html
-
- The -dna switch is absent, so MEME assumes the file lipocalin.s contains
- protein sequences. MEME searches for two motifs each of width less than or
- equal to 20.
- (Specifying -maxw 20 makes MEME run faster since it does not have to
- consider motifs longer than 20.) Each motif is assumed to occur in each
- of the sequences because the OOPS model is specified.
-
- 6) Another simple protein example:
-
- meme farntrans5.s -mod anr -maxw 40 -maxsites 50 > ex6.html
-
- MEME searches for a motif of width up to 40 with up to 50 occurrences in
- the entire training set. The ANR sequence model is specified,
- which allows each motif to have any number of occurrences in each sequence.
- This dataset contains motifs with multiple repeats of motifs in each
- sequence. This example is fairly time consuming due to the fact that the
- time required to initiale the motif probability tables is proportional
- to <maxw> times <maxsites>. By default, MEME only looks for motifs up to
- 29 letters wide with a maximum total of number of occurrences equal to twice
- the number of sequences or 30, whichever is less.
-
- 7) A much faster protein example:
-
- meme farntrans5.s -mod anr -w 10 -maxsites 30 -nmotifs 3 > ex7.html
-
- This time MEME is constrained to search for three motifs of width exactly
- ten. The effect is to break up the long motif found in the previous
- example. The -w switch forces motifs to be *exactly* ten letters wide.
- This example is much faster because, since only one width is considered, the
- time to build the motif probability tables is only proportional to
- <maxsites>.
-
- 8) Splitting the sites into three:
-
- meme farntrans5.s -mod anr -maxw 12 -nsites 24 -nmotifs 3 > ex8.html
-
- This forces each motif to have 24 occurrences, exactly, and be up to 12
- letters wide.
-
- 9) A larger protein example with E-value cutoff:
-
- meme adh.s -mod zoops -nmotifs 20 -evt 0.01 > ex9.html
-
- In this example, MEME looks for up to 20 motifs, but stops when a motif is
- found with E-value greater than 0.01. Motifs with large E-values are likely
- to be statistical artifacts rather than biologically significant.
-
More information about the debian-med-commit
mailing list