[med-svn] r9642 - in trunk/packages/R/r-cran-haplo.stats/trunk/debian: . doc_source

Fri Feb 10 22:52:57 UTC 2012

Author: tille
Date: 2012-02-10 22:52:56 +0000 (Fri, 10 Feb 2012)
New Revision: 9642

Added:
   trunk/packages/R/r-cran-haplo.stats/trunk/debian/README.Debian
   trunk/packages/R/r-cran-haplo.stats/trunk/debian/doc-base
   trunk/packages/R/r-cran-haplo.stats/trunk/debian/doc_source/
   trunk/packages/R/r-cran-haplo.stats/trunk/debian/doc_source/manualHaploStats.Rnw
   trunk/packages/R/r-cran-haplo.stats/trunk/debian/docs
Modified:
   trunk/packages/R/r-cran-haplo.stats/trunk/debian/changelog
   trunk/packages/R/r-cran-haplo.stats/trunk/debian/rules
Log:
Enhance documentation including source provided by upstream; upload to unstable


Added: trunk/packages/R/r-cran-haplo.stats/trunk/debian/README.Debian
===================================================================

--- trunk/packages/R/r-cran-haplo.stats/trunk/debian/README.Debian	                        (rev 0)
+++ trunk/packages/R/r-cran-haplo.stats/trunk/debian/README.Debian	2012-02-10 22:52:56 UTC (rev 9642)
@@ -0,0 +1,5 @@
+Notes on how this package can be tested.
+────────────────────────────────────────
+
+This package can be tested by loading it into R with the command
+‘library(haplo.stats)’ in order to confirm its integrity.

Modified: trunk/packages/R/r-cran-haplo.stats/trunk/debian/changelog
===================================================================
--- trunk/packages/R/r-cran-haplo.stats/trunk/debian/changelog	2012-02-10 15:27:06 UTC (rev 9641)
+++ trunk/packages/R/r-cran-haplo.stats/trunk/debian/changelog	2012-02-10 22:52:56 UTC (rev 9642)
@@ -1,4 +1,4 @@
-r-cran-haplo.stats (1.5.2-1) UNRELEASED; urgency=low
+r-cran-haplo.stats (1.5.2-1) unstable; urgency=low
 
   [ Charles Plessy ]
   * debian/upstream-metadata.yaml -> debian/upstream
@@ -14,9 +14,13 @@
   * debian/rules: use usual R:depends variable substitution
   * debian/lintian-overrides: do not warn about duplicate license
   * debian/copyright: DEP5
+  * debian/doc_source/manualHaploStats.Rnw: Source for user
+    manual provided by upstream
+    Closes: #655439
+  * debian/docs: Install user manual
+  * debian/doc-base: Doc-base control file
+  * debian/README.Debian: Hint how to test the package
 
-  TODO: Clarify #655439 (just sendet mail to upstream)
-
  -- Andreas Tille <tille at debian.org>  Fri, 10 Feb 2012 15:54:27 +0100
 
 r-cran-haplo.stats (1.4.4-1) unstable; urgency=low

Added: trunk/packages/R/r-cran-haplo.stats/trunk/debian/doc-base
===================================================================
--- trunk/packages/R/r-cran-haplo.stats/trunk/debian/doc-base	                        (rev 0)
+++ trunk/packages/R/r-cran-haplo.stats/trunk/debian/doc-base	2012-02-10 22:52:56 UTC (rev 9642)
@@ -0,0 +1,13 @@
+Document: r-cran-haplo.stats
+Title: Statistical Methods for Haplotypes When Linkage Phase is Ambiguous
+Author: Jason P. Sinnwell∗and Daniel J. Schaid
+Abstract: GNU R package for haplotype analysis
+ Haplo Stats is a suite of R routines for the analysis of indirectly
+ measured haplotypes. The statistical methods assume that all subjects
+ are unrelated and that haplotypes are ambiguous (due to unknown
+ linkage phase of the genetic markers), while also allowing for missing
+ alleles.
+Section: Science/Biology
+
+Format: pdf
+Files: /usr/share/doc/r-cran-haplo.stats/manualHaploStats.pdf

Added: trunk/packages/R/r-cran-haplo.stats/trunk/debian/doc_source/manualHaploStats.Rnw
===================================================================
--- trunk/packages/R/r-cran-haplo.stats/trunk/debian/doc_source/manualHaploStats.Rnw	                        (rev 0)
+++ trunk/packages/R/r-cran-haplo.stats/trunk/debian/doc_source/manualHaploStats.Rnw	2012-02-10 22:52:56 UTC (rev 9642)
@@ -0,0 +1,2066 @@
+%\VignetteIndexEntry{haplo.stats}
+%\VignetteKeywords{haplotype, score, glm}
+%\VignetteDepends{haplo.stats}
+%\VignettePackage{haplo.stats}
+
+\RequirePackage[T1]{fontenc}
+\RequirePackage{graphicx,ae,fancyvrb}
+
+\IfFileExists{upquote.sty}{\RequirePackage{upquote}}{}
+\setkeys{Gin}{width=0.8\textwidth}
+
+
+\DefineVerbatimEnvironment{Sinput}{Verbatim}{fontfamily=courier,
+  fontshape=sl, fontseries=b, xleftmargin=.25cm}
+
+\DefineVerbatimEnvironment{Soutput}{Verbatim}{fontfamily=courier,
+  fontshape=sl, xleftmargin=.5cm}
+
+
+\documentclass[11pt]{article}
+\def\version{version 1.5.0}
+\def\Rversion{version 2.14.0}
+\def\package{{\sl haplo.stats}}
+\def\Explain{\paragraph{\large {\bf Explanation of Results \vspace{.1in}} \\}}
+
+\usepackage{Sweave}
+\usepackage{fullpage}
+
+%\textwidth 6.5in
+%\oddsidemargin 0in
+%\evensidemargin 0in
+%\textheight 7.7in
+
+\title{{\bf Haplo Stats} \\
+(\version) \vspace{.25in}\\
+Statistical Methods for Haplotypes When
+Linkage Phase is Ambiguous \vspace{1.5in}\\
+}
+
+\author{Jason P. Sinnwell\thanks{sinnwell at mayo.edu} and Daniel J. Schaid \\
+Mayo Clinic Division of Health Sciences Research\\Rochester MN USA \vspace{.25in} \\}
+
+\date{{\today} \vspace{2in}}
+\begin{document}
+
+\maketitle
+
+\pagebreak
+\tableofcontents
+\pagebreak
+
+\section{Introduction}
+
+Haplo Stats is a suite of R routines for the analysis of indirectly 
+measured haplotypes. The statistical methods assume that all subjects 
+are unrelated and that haplotypes are ambiguous (due to unknown linkage 
+phase of the genetic markers), while also allowing for missing alleles.
+
+
+The user-level functions in Haplo Stats are:
+\begin{itemize}
+ \item{{\sl haplo.em:}\ \ for the estimation of haplotype frequencies
+   and posterior probabilities of haplotype pairs for each subject, 
+   conditional on the observed marker data} 
+ \item{{\sl haplo.glm:}\ \ generalized linear models for the regression of 
+   a trait on haplotypes, with the option of including covariates and 
+   interactions}
+ \item{{\sl haplo.score:}\ \ score statistics to test associations
+   between haplotypes and a variety of traits, including binary, ordinal,
+   quantitative, and Poisson}
+ \item{{\sl haplo.score.slide:} \ \ haplo.score computed on sub-haplotypes
+   of a larger region}
+ \item{{\sl seqhap:}\ \ sequentially scan markers in enlarging a haplotype
+   for association with a trait}
+ \item{{\sl haplo.cc:} \ \ run a combined analysis for haplotype 
+   frequencies, scores, and regression results for a case-control study}
+ \item{{\sl haplo.power.qt/haplo.power.cc:}\ \ power or sample size 
+   calculatins for quantitative or binary trait}
+ \item{{\sl haplo.scan:}\ \ search for a trait locus for all sizes
+   of sub-haplotypes within a fixed maximum window width for all
+   markers in a region}
+ \item{{\sl haplo.design:} \ \ create a design matrix for haplotype
+     effects}
+\end{itemize}
+
+\noindent This manual explains the basic and advanced usage of these 
+routines, with guidelines for running the analyses and interpreting results. 
+We provide many of these details in the function help pages, which are
+accessed within an R session using 
+{\sl help(haplo.em)}, for example.  We also provide brief examples in the 
+help files, which can be run in the R session with {\sl example(haplo.em)}.
+
+\subsection{Updates} 
+This version of Haplo Stats contains updates to {\sl haplo.glm} 
+in section~\ref{hapGLM} and new methods written for it that resemble glm 
+class methods. These methods include residuals, fitted.values, vcov, and 
+anova, and they are detailed in section~\ref{glmMethods}. For full history
+of updates see the NEWS file, or type {\sl news(package=''haplo.stats'')} 
+in the R command prompt.
+
+\subsection{Operating System and Installation}
+Haplo Stats \version\ is written for R (\Rversion).  It has been uploaded 
+to the Comprehensive R Archive Network (CRAN), and is made available 
+on various operating systems through CRAN.  Package installation within R
+is made simple from within R using {\sl install.packages(``haplo.stats'')}, 
+but other procedures for installing R packages can be found at the R project 
+website (http://www.r-project.org).
+
+\subsection{R Basics}
+For users who are new to the R environment, we demonstrate some basic
+concepts. In the following example we create a vector of character alleles and
+use the {\sl table} function to get allele counts.  We first show how to save 
+the results of {\sl table(snp)} into an R session variable, {\sl tab}.  We show 
+that {\sl tab} is an object of the {\sl table} class, and use the print and summary
+methods that are defined for {\sl table} objects.  Note that when you enter just 
+{\sl tab} or {\sl table(snp)} at the prompt, the print method is invoked.
+
+<<label=simpleR>>=
+snp <- c("A", "T", "T", "A", "A", "T", "T")
+
+tab <- table(snp)
+tab
+
+class(tab)
+print.table(tab)
+summary(tab)
+
+tab[2]
+table(snp)
+@ 
+
+\noindent The routines in \package\ are computationally intensive and 
+return lots of information in the returned object. Therefore, we assign 
+classes to the returned objects and provide various methods for each 
+of them.  
+
+\section{Data Setup}
+
+We first show some typical steps when you first load a package and look
+for details on a function of interest.  In the sample code below, we 
+load \package, check which functions are available in the package, 
+view a help file, and run the example that is within the help file.
+
+<<getstarted, echo=TRUE, eval=FALSE>>=
+# load the library, load and preview at demo dataset
+library(haplo.stats)
+ls(name="package:haplo.stats")
+help(haplo.em)
+example(haplo.em)
+@
+
+<<hidsettings, echo=FALSE, eval=TRUE>>=
+options(width=80)
+rm(list=ls())
+library(haplo.stats, lib.loc="~/Rlib")
+@
+
+\subsection{Example Data}
+
+The \package\ package contains three example data sets. The primary data 
+set used in this manual is named {\sl (hla.demo)}, which contains 11 
+loci from the HLA region
+on chromosome 6, with covariates, qualitative, and quantitative
+responses. Within {\sl /haplo.stats/data/hla.demo.tab} the data is
+stored in tab-delimited format. Typically data stored in this format
+can be read in using  {\sl read.table()}. Since the data is provided in the 
+package, we load the data in R using {\sl data()} and view the names of 
+the columns.  Then to make the columns of {\sl hla.demo} accessible without 
+typing it each time, we attach it to the current session.
+
+<<label=hladat, echo=TRUE>>=
+# load and preview demo dataset stored in ~/haplo.stats/data/hla.demo.tab
+data(hla.demo)
+names(hla.demo)
+# attach hla.demo to make columns available in the session
+attach(hla.demo)
+@
+
+The column names of {\sl hla.demo} are shown above.  They are defined as
+follows:
+\begin{itemize}
+\item{\bf resp:\ }{quantitative antibody response to measles vaccination}
+\item{\bf resp.cat:\ }{a factor with levels "low", "normal", "high", for 
+categorical antibody response}
+\item{\bf male:\ }{gender code with {\sl 1="male"}, {\sl 0="female"}}
+\item{\bf age:\ }{age (in months) at immunization}
+\end{itemize}
+
+\noindent The remaining columns are genotypes for 11 HLA loci, with a prefix 
+name (e.g., "DQB") and a suffix for each of two alleles (".a1" and ".a2").
+The variables in {\sl hla.demo}\ can be accessed by typing {\sl hla.demo\$}\ 
+before their names, such as {\sl hla.demo\$resp}. 
+Alternatively, it is easier for these examples to attach {\sl hla.demo}, 
+(as shown above with {\sl attach()}) so the variables can be accessed by 
+simply typing their names.
+
+\subsection{Creating a Genotype Matrix}
+
+Many of the functions require a matrix of genotypes, denoted below as 
+{\sl geno}. This matrix is arranged such that each locus has a pair of
+adjacent columns of alleles, and the order of columns corresponds 
+to the order of loci on a chromosome. If there are K loci, then the 
+number of columns of {\sl geno} is $2K$. Rows represent the alleles for each 
+subject. For example, if there are three loci, in the order A-B-C, 
+then the 6 columns of {\sl geno} would be arranged as A.a1, A.a2, B.a1,
+B.a2, C.a1, C.a2. For illustration, three of the loci in 
+{\sl hla.demo} will be used to demonstrate some of the functions. Create 
+a separate data frame for 3 of the loci, and call this {\sl geno}.  Then create 
+a vector of labels for the loci.
+
+<<label=makegeno, echo=TRUE>>=
+geno <- hla.demo[,c(17,18,21:24)]
+label <-c("DQB","DRB","B")
+@
+
+\noindent The {\sl hla.demo} data already had alleles in two columns for each
+locus. For many SNP datasets, the data is in a one column format,
+giving the count of the minor allele.   To assist in converting this
+format to two columns, a function named {\sl geno1to2} has been added
+to the package. See its help file for more details.
+
+\subsection{Preview Missing Data: {\sl summaryGeno}}
+
+Before performing a haplotype analysis, the user will want to assess
+missing genotype data to determine the completeness of the data. If 
+many genotypes are missing, the functions may take a long time to
+compute results, or even run out of memory.  For these reasons, the user 
+may want to remove some of the subjects with a lot of missing data. 
+This step can be guided by using the {\sl summaryGeno}\ function, which checks 
+for missing allele information and counts the number of potential 
+haplotype pairs that are consistent with the observed data (see the 
+Appendix for a description of this counting scheme).
+
+The codes for missing alleles are defined by the parameter 
+{\sl miss.val}, a vector to define all possible missing value
+codes. Below, the result is saved in {\sl geno.desc}, which is a data frame, 
+so individual rows may be printed.  Here we show the results for 
+subjects 1-10, 80-85, and 135-140, some of which have missing alleles.
+
+<<summarygeno, echo=TRUE>>=
+geno.desc <- summaryGeno(geno, miss.val=c(0,NA))
+print(geno.desc[c(1:10,80:85,135:140),])
+@
+
+\noindent The columns with {\sl 'loc miss-'} illustrate the number of loci 
+missing either 0,
+1, or 2 alleles, and the last column, {\sl num\_enum\_rows}, illustrates the
+number of haplotype pairs that are consistent with the observed 
+data. In the example above, subjects indexed by rows 81 and 137 have 
+missing alleles.  Subject \#81 has one locus missing two alleles, while
+subject \#137 has two loci missing two alleles.  As indicated by 
+{\sl num\_enum\_rows}, subject \#81 has 1,800 potential haplotype pairs, while 
+subject \#137 has nearly 130,000.  
+
+The 130,000 haplotype pairs is considered a large number, but {\sl haplo.em}, 
+{\sl haplo.score}, and {\sl haplo.glm} complete in roughly 3-6 minutes (depending 
+on system limits or control parameter settings). If person \#137 were removed, 
+the methods would take less than half that time.  It is preferred to keep people 
+if they provide information to the analysis, given that run time and memory usage 
+are not too much of a burden.
+
+When a person has no genotype information, they do not provide
+information to any of the methods in \package. Furthermore, they cause a
+much longer run time.  Below, using the {\sl table} function on the third 
+column of {\sl geno.desc}, we can tabulate how many people are missing two alleles 
+at any at of the three loci.  If there were people missing two alleles at 
+all three loci, they should be removed.  The second command below shows how to 
+make an index of which people to remove from {\sl hla.demo} because they are 
+missing all their alleles.
+
+<<tabledesc, echo=TRUE>>=
+
+# find if there are any people missing all alleles
+table(geno.desc[,3])
+@
+
+<<rmmiss, echo=TRUE, eval=FALSE>>=
+## create an index of people missing all alleles
+miss.all <- which(geno.desc[,3]==3)
+
+# use index to subset hla.demo
+hla.demo.updated <- hla.demo[-miss.all,]
+@
+
+
+\subsection{Random Numbers and Setting Seed \label{seed} }
+
+Random numbers are used in several of the functions (e.g., to determine 
+random starting frequencies within {\sl haplo.em}, and to compute permutation
+p-values in {\sl haplo.score}).  To reproduce calculations involving random
+numbers, we set the seed values stored in a vector called  
+{\sl .Random.seed}.  This vector can be set using {\sl set.seed()}
+before any function which uses random numbers. Section \ref{glmBaseline} 
+shows one example of setting the seed for {\sl haplo.glm}. We illustrate 
+setting the seed below.
+
+<<setseed, echo=TRUE>>=
+# this is how to set the seed for reproducing results where haplo.em is 
+# involved, and also if simulations are run. In practice, don't reset seed.
+seed <- c(17, 53, 1, 40, 37, 0, 62, 56, 5, 52, 12, 1)
+set.seed(seed)
+@
+
+
+\section{Haplotype Frequency Estimation: {\sl haplo.em}}
+
+\subsection{Algorithm}
+For genetic markers measured on unrelated subjects, with linkage phase
+unknown, {\sl haplo.em} computes maximum likelihood estimates of haplotype 
+probabilities. Because there may be more than one pair of haplotypes 
+that are consistent with the observed marker phenotypes, posterior 
+probabilities of haplotype pairs for each subject are also
+computed. Unlike the usual EM which attempts to enumerate all possible
+haplotype pairs before iterating over the EM steps, our
+{\em progressive insertion} algorithm progressively inserts batches of
+loci into haplotypes of growing lengths, runs the EM steps, trims 
+off pairs of haplotypes per subject when the posterior probability of 
+the pair is below a specified threshold, and then continues these 
+insertion, EM, and trimming steps until all loci are inserted into 
+the haplotype. The user can choose the batch size. If the batch size 
+is chosen to be all loci, and the threshold for trimming is set to 0, 
+then this reduces to the usual EM algorithm. The basis of this
+progressive insertion algorithm is from the "snphap" software by 
+David Clayton\cite{Clayton}. 
+Although some of the features and control parameters of {\sl haplo.em} are 
+modeled after {\sl snphap}, there are substantial differences, such as
+extension to allow for more than two alleles per locus, and some 
+other nuances on how the algorithm is implemented.
+
+\subsection{Example Usage \label{hapEM}}
+We use {\sl haplo.em} on {\sl geno} for the 3 loci defined above and save
+the result in an object named {\sl save.em}, which has the {\sl haplo.em}
+class. The print method would normally print all 178 haplotypes from {\sl save.em},
+but to keep the results short for this manual, we give a quick glance of 
+the output by using the option {\sl nlines=10}, which prints 
+only the first 10 haplotypes of the full results.  The {\sl nlines} 
+parameter has been employed in some of Haplo Stats' print methods 
+for when there are many haplotypes. In practice, it is best to exclude this 
+parameter so that the default will print all results.
+
+<<seed, eval=TRUE, echo=FALSE>>=
+set.seed(seed)
+@
+<<label=runEM, echo=TRUE>>=
+save.em <- haplo.em(geno=geno, locus.label=label, miss.val=c(0,NA))
+names(save.em)
+print(save.em, nlines=10)
+@
+
+\Explain The print methods shows the haplotypes and their estimated frequencies, 
+followed by the final log-likelihood statistic and the {\sl lr stat for no LD},
+which is the likelihood ratio test statistic contrasting the {\sl lnlike} for the 
+estimated haplotype frequencies versus the {\sl lnlike} under the null assuming 
+that alleles from all loci are in linkage equilibrium. We note that the trimming 
+by the progressive insertion algorithm can invalidate the {\sl lr stat} and the 
+degrees of freedom ({\sl df}).\\
+
+\subsection{Summary Method}
+
+The {\sl summary} method for a {\sl haplo.em} object on {\sl save.em} shows the 
+list of haplotypes per subject, and their posterior probabilities:
+<<summaryEM, echo=TRUE>>=
+summary(save.em, nlines=7)
+@
+
+\Explain The first part of the {\sl summary} output lists the subject id 
+(row number of input {\sl geno} matrix), the codes for the haplotypes of 
+each pair, and the posterior probabilities of the haplotype pairs. The 
+second part gives a table of the maximum number of pairs of haplotypes per subject,
+versus the number of pairs used in the final posterior probabilities.
+The haplotype codes remove the clutter of illustrating all the alleles
+of the haplotypes, but may not be as informative as the actual
+haplotypes themselves. To see the actual haplotypes, use the 
+{\sl show.haplo=TRUE} option, as in the following example.
+
+<<summaryEMshow, echo=TRUE>>=
+# show full haplotypes, instead of codes
+summary(save.em, show.haplo=TRUE, nlines=7)
+@
+
+\subsection{Control Parameters for {\sl haplo.em} \label{emControl}}
+
+A set of control parameters can be passed together to {\sl haplo.em} as the
+``{\sl control}" argument. This is a list of parameters that control the 
+EM algorithm based on progressive insertion of loci. The default values are set by a
+function called {\sl haplo.em.control} (see {\sl help(haplo.em.control)}
+for a complete description). Although the user can accept the default 
+values, there are times when control parameters may need to be adjusted. 
+
+These parameters are defined below:
+
+\begin{itemize}
+\item{} {\bf insert.batch.size:\ } Number of loci to be inserted in a single
+batch.
+\item{} {\bf min.posterior:\ } Minimum posterior probability of haplotype pair,
+conditional on observed marker genotypes. Posteriors below this
+minimum value will have their pair of haplotypes "trimmed" off the 
+list of possible pairs.
+\item{}{\bf max.iter:\ } Maximum number of iterations allowed for the EM 
+algorithm before it stops and prints an error.
+\item{}{\bf n.try:\ } Number of times to try to maximize the {\sl lnlike}
+ by the EM algorithm. The first try will use, as initial starting values for 
+the posteriors, either equal values or uniform random variables, as 
+determined by {\sl random.start}. All subsequent tries will use uniform 
+random values as initial starting values for the posterior probabilities.
+\item{}{\bf max.haps.limit:\ }
+Maximum number of haplotypes for the input genotypes.  Within
+haplo.em, the first step is to try to allocate the 
+sum of the result of geno.count.pairs(), if that exceeds
+max.haps.limit, start by allocating max.haps.limit.  If that is
+exceeded in the progressive-insertions steps, the C function doubles
+the memory until it can longer request more. 
+\end{itemize}
+
+One reason to adjust control parameters is for finding the global
+maximum of the log-likelihood.  It can be difficult in particular for
+small sample sizes and many possible haplotypes.  Different
+maximizations of the log-likelihood may result in different results from
+{\sl haplo.em}, {\sl haplo.score}, or {\sl haplo.glm} when
+rerunning the analyses.  The algorithm uses multiple attempts to maximize the
+log-likelihood, starting each attempt with random starting values.
+To increase the chance of finding the global maximum of the
+log-likelihood, the user can increase the number of attempts 
+({\sl  n.try}), increase the batch size ({\sl insert.batch.size}), or
+decrease the trimming threshold for posterior probabilities 
+({\sl min.posterior}).  
+
+Another reason to adjust control parameters is when the algorithm runs
+out of memory because there are too many haplotypes.  If 
+{\sl max.haps.limit} is exceeded when a batch of markers is added, 
+the algorithm requests twice as much memory until it runs out.  One option is
+to set {\sl max.haps.limit} to a different value, either to make {\sl haplo.em}
+request more memory initially, or to request more memory in smaller
+chunks. Another solution is to make the algorithm trim the number of
+haplotypes more aggressively by decreasing {\sl insert.batch.size} or
+increasing {\sl min.posterior}.  Any changes to these parameters should be made
+with caution, and not drastically different from the default values.
+For instance, the default for {\sl min.posterior} used to be $1e-7$, 
+and in some rare circumstances with many markers in only moderate linkage
+disequilibrium, some subjects had all their possible haplotype pairs
+trimmed.  The default is now set at $1e-9$, and we recommend not
+increasing {\sl min.posterior} much greater than $1e-7$.
+
+The example below gives the command for increasing the number of tries
+to $20$, and the batch size to $2$, since not much more can be done
+for three markers.
+
+<<emcontrol, echo=TRUE, eval=FALSE>>=
+# demonstrate only the syntax of control parameters
+save.em.try20 <- haplo.em(geno=geno, locus.label=label, miss.val=c(0, NA),
+     control = haplo.em.control(n.try = 20, insert.batch.size=2))
+@
+
+\subsection{Haplotype Frequencies by Group Subsets}
+To compute the haplotype frequencies for each level of a grouping
+variable, use the function {\sl haplo.group}.  The following example 
+illustrates the use of a binomial response based on {\sl resp.cat},
+{\sl y.bin}, that splits the subjects into two groups.
+
+<<seed. eval=TRUE, echo=FALSE>>=
+set.seed(seed)
+@
+<<groupBin, echo=TRUE >>=
+## run haplo.em on sub-groups
+## create ordinal and binary variables
+y.bin <- 1*(resp.cat=="low")
+group.bin <- haplo.group(y.bin, geno, locus.label=label, miss.val=0)
+print(group.bin, nlines=15)
+@
+
+\Explain  The {\sl group.bin} object can be very large, depending on the number
+of possible haplotypes, so only a portion of the output is illustrated above
+(limited again by {\sl nlines}).  The first section gives a short summary of 
+how many subjects appear in each of the groups.  The second section is a table 
+with the following columns:  
+\begin{itemize}
+\item The first column gives row numbers.
+\item The next columns (3 in this example) illustrate the alleles of
+the haplotypes.
+\item {\sl Total} are the estimated haplotype frequencies for the entire
+data set.
+\item The last columns are the estimated haplotype frequencies for 
+the subjects in the levels of the group variable ({\sl y.bin=0} and 
+{\sl y.bin=1} in this example).  Note that some haplotype frequencies 
+have an {\sl NA}, which appears when the haplotypes do not occur in the
+subgroups. 
+\end{itemize}
+
+\section{Power and Sample Size for Haplotype Association Studies\label{hapPower}}
+
+It is known that using haplotypes has greater power than single-markers
+to detect genetic association in some circumstances.  There is little
+guidance, however, in determining sample size and power under
+different circumstances, some of which include: marker type, dominance, 
+and effect size. The \package\ package now includes functions to calculate 
+sample size and power for haplotype association studies, which is flexible 
+to handle these multiple circumstances.
+
+Based on work in Schaid 2005\cite{Schaid 2005}, we can take a set of 
+haplotypes with their population frequencies, assign a risk to a subset 
+of the haplotypes, then determine either the sample size to achieve a 
+stated power, or the power for a stated sample size.  Sample size and power 
+can be calculated for either quantitative traits or case-control studies.
+
+\subsection{Quantitative Traits: {\sl haplo.power.qt}\label{powerQT}}
+
+We assume that quantitative traits will be modeled by a linear
+regression.  Some well-known tests for association between haplotypes and 
+the trait include score statistics\cite{Schaid et al. 2002} and an
+F-test\cite{Zaykin}. For both types of tests, power depends on the 
+amount of variance in the trait that is explained by haplotypes, or a 
+multiple correlation coefficient, $R^{2}$.  Rather than specifying the
+haplotype coefficients directly, we calculate the vector of
+coefficients based on an $R^{2}$ value.
+
+In the example below, we load an example set of haplotypes that contain 5
+markers, and specify the indices of the at-risk haplotypes; in this case, 
+whichever haplotype has allele $1$ at the 2nd and 3rd markers.  
+We set the first haplotype (most common) as the baseline. 
+With these values we calculate the vector of coefficients for haplotype 
+effects from {\sl find.haplo.beta.qt} using an $R^{2}=0.01$.  
+Next, we use {\sl haplo.power.qt} to calculate the sample size for the 
+set of haplotypes and their coefficients, type-I error (alpha) set to $0.05$, 
+power at 80\%, and the same mean and variance used to get haplotype coefficients.  
+Then we use the sample size needed for 80\% power for un-phased haplotypes 
+($2,826$) to get the power for both phased and un-phased haplotypes.
+
+<<hapPowerQT, echo=TRUE >>=
+
+# load a set of haplotypes (hap-1 from Schaid 2005)
+
+data(hapPower.demo) 
+
+#### an example using save.em hla markers may go like this.
+# keep <- which(save.em$hap.prob > .004)  # get an index of non-rare haps
+# hfreq <- save.em$hap.prob[keep]
+# hmat <- save.em$haplotype[keep,]
+# hrisk <- which(hmat[,1]==31 & hmat[,2]==11)  # contains 3 haps with freq=.01
+# hbase <- 4  # 4th hap has mas freq of .103
+####
+
+## separate the haplotype matrix and the frequencies
+hmat <- hapPower.demo[,-6]
+hfreq <- hapPower.demo[,6]
+
+## Define risk haplotypes as those with "1" allele at loc2 and loc3
+hrisk <- which(hmat$loc.2==1 & hmat$loc.3==1)
+
+# define index for baseline haplotype
+hbase <-  1
+
+hbeta.list <- find.haplo.beta.qt(haplo=hmat, haplo.freq=hfreq, base.index=hbase, 
+                                 haplo.risk=hrisk, r2=.01, y.mu=0, y.var=1)
+hbeta.list
+
+ss.qt <- haplo.power.qt(hmat, hfreq, hbase, hbeta.list$beta, 
+                        y.mu=0, y.var=1, alpha=.05, power=.80)
+ss.qt
+
+power.qt <-  haplo.power.qt(hmat, hfreq, hbase, hbeta.list$beta, 
+                        y.mu=0, y.var=1, alpha=.05, sample.size=2826)
+power.qt
+@ 
+
+\subsection{Case-Control Studies: {\sl haplo.power.cc} \label{powerCC}}
+
+The steps to compute sample size and power for case-control studies is
+similar to the steps for quantitative traits.  If we assume a
+log-additive model for haplotype effects, the haplotype
+coefficients can be specified first as odds ratios (OR), and then converted
+to logistic regression coefficients according to $log(OR)$.
+
+In the example below, we assume the same baseline and risk haplotypes
+defined in section~\ref{powerQT}, give 
+the risk haplotypes an odds ratio of 1.50, and specify a population disease
+prevalance of 10\%.   We also assume cases make up 50\% ({\sl case.frac}) 
+of the study\'s subjects.  We first compute the sample size for this scenario for 
+Type-I error (alpha) at $0.05$ and 80\% power, and then compute power for the 
+sample size required for un-phased haplotypes ($4,566$).
+
+<<hapPowerCC, echo=TRUE >>=
+
+## get power and sample size for quantitative response
+## get beta vector based on odds ratios
+
+cc.OR <- 1.5
+
+# determine beta regression coefficients for risk haplotypes
+
+hbeta.cc <- numeric(length(hfreq))
+hbeta.cc[hrisk] <-  log(cc.OR)
+
+# Compute sample size for stated power
+
+ss.cc <- haplo.power.cc(hmat, hfreq, hbase, hbeta.cc, case.frac=.5, prevalence=.1,
+               alpha=.05, power=.8)
+ss.cc
+
+# Compute power for given sample size
+
+power.cc <- haplo.power.cc(hmat, hfreq, hbase, hbeta.cc, case.frac=.5, prevalence=.1,
+               alpha=.05, sample.size=4566)
+power.cc
+@
+
+ 
+\section{Haplotype Score Tests: {\sl haplo.score} \label{hapScore}}
+
+The {\sl haplo.score} routine is used to compute score statistics to test 
+association between haplotypes and a wide variety of traits,
+including binary, ordinal, quantitative, and Poisson. This function 
+provides several different global and haplotype-specific tests for 
+association and allows for adjustment for non-genetic covariates.  
+Haplotype effects can be specified as additive,
+dominant, or recessive.  This method also has an option to compute 
+permutation p-values, which may be needed for sparse data when
+distribution assumptions may not be met. Details on the background and 
+theory of the score statistics can be found in 
+Schaid et al.\cite{Schaid et al. 2002}.
+
+\subsection{Quantitative Trait Analysis}
+First, we assess a haplotype association with a quantitative trait in 
+{\sl hla.demo} called {\sl resp}.  To tell {\sl haplo.score} the trait is 
+quantitative, specify the parameter {\sl trait.type="gaussian"} 
+(a reminder that a gaussian distribution is assumed for the error terms).  The other
+arguments, all set to default values, are explained in the help file.  Note that 
+rare haplotypes can result in unstable variance estimates, and hence 
+unreliable test statistics for rare haplotypes.  We restrict the
+analysis to get scores for haplotypes with a minimum sample count using 
+{\sl min.count=5}.  For more explanation on handling rare haplotypes, see 
+section~\ref{freqmin}.  Below is an example of running {\sl haplo.score} with 
+a quantitative trait, then viewing the results using the {\sl print} 
+method for the {\sl haplo.score} class. (again, output shortened by {\sl nlines}).
+
+<<seed, eval=TRUE, echo=FALSE>>=
+set.seed(seed)
+@
+<<scoregauss, echo=TRUE>>= 
+# score statistics w/ Gaussian trait
+score.gaus.add <- haplo.score(resp, geno, trait.type="gaussian",
+                              min.count=5, 
+                              locus.label=label, simulate=FALSE)
+print(score.gaus.add, nlines=10)
+@
+
+\Explain  First, the model effect chosen by {\sl haplo.effect} is
+printed across the top. The section {\sl Global Score Statistics} shows results for 
+testing an overall association between haplotypes and the response.  The 
+{\sl global-stat} has an asymptotic $\chi^{2}$ distribution, with degrees of 
+freedom ({\sl df}) and {\sl p-value} as indicated.  Next, {\sl Haplotype-specific 
+scores} are given in a table format.  The column descriptions are as follows: 
+\begin{itemize}
+\item{} The first column gives row numbers.
+\item{} The next columns (3 in this example) illustrate the alleles of
+the haplotypes.
+\item{} {\sl Hap-Freq} is the estimated frequency of the haplotype in the
+pool of all subjects.
+\item{} {\sl Hap-Score} is the score for the haplotype, the results
+  are sorted by this value. Note, the score statistic should not be 
+  interpreted as a measure of the haplotype effect.
+\item{} {\sl p-val} is the asymptotic $\chi^{2}_1$ p-value, calculated
+  from the square of the score statistic.
+\end{itemize}
+
+
+\subsection{Binary Trait Analysis}
+Let us assume that "low" responders are of primary interest, so we create 
+a binary trait that has values of 1 when {\sl resp.cat} is "low", and 0 otherwise.
+Then in {\sl haplo.score} specify the parameter {\sl trait.type="binomial"}.  
+
+<<seed, eval=TRUE, echo=FALSE>>=
+set.seed(seed)
+@
+<<scorebinom, echo=TRUE>>=
+# scores, binary trait
+y.bin <- 1*(resp.cat=="low")
+score.bin <- haplo.score(y.bin, geno, trait.type="binomial",
+                        x.adj = NA, min.count=5,
+                        haplo.effect="additive", locus.label=label, 
+                        miss.val=0, simulate=FALSE)
+print(score.bin, nlines=10)
+@
+
+\subsection{Ordinal Trait Analysis}
+To create an ordinal trait, here we convert {\sl resp.cat} (described above) to 
+numeric values, {\sl y.ord} (with levels 1, 2, 3).  For {\sl haplo.score}, 
+use {\sl y.ord} as the response variable, and set the parameter 
+{\sl trait.type = "ordinal"}.
+
+<<seed, eval=TRUE, echo=FALSE>>=
+set.seed(seed)
+@
+<<scoreOrd, echo=TRUE >>=
+# scores w/ ordinal trait
+y.ord <- as.numeric(resp.cat)
+score.ord <- haplo.score(y.ord, geno, trait.type="ordinal",
+                     x.adj = NA, min.count=5,
+                     locus.label=label, 
+                     miss.val=0, simulate=FALSE)
+print(score.ord, nlines=7)
+@
+
+\noindent{\bf \large Warning for Ordinal Traits \vspace{.06in}\\}
+
+When analyzing an ordinal trait with adjustment for covariates (using
+the {\sl x.adj} option), the software requires the {\sl rms} package, 
+distributed by Frank Harrell \cite{Harrell}.  If the user 
+does not have these packages installed, then it will not be possible 
+to use the {\sl x.adj} option.  However, the unadjusted scores
+for an ordinal trait (using the default option {\sl x.adj=NA}) do not
+require these pacakgeses.  Check the list of your local packages in
+the list shown from entering {\sl library()}\ in your prompt.
+
+
+\subsection{Haplotype Scores, Adjusted for Covariates}
+To adjust for covariates in {\sl haplo.score}, first set up a 
+matrix of covariates from the example data.  For example, use 
+a column for male (1 if male; 0 if female), and 
+a second column for age.  Then pass the matrix to {\sl haplo.score} 
+using parameter {\sl x.adj}.  The results change slightly in this example.
+
+<<seed, eval=TRUE, echo=FALSE>>=
+set.seed(seed)
+@
+<<scoreAdj, echo=TRUE >>=
+# score w/gaussian, adjusted by covariates
+x.ma <- cbind(male, age)
+score.gaus.adj <- haplo.score(resp, geno, trait.type="gaussian",
+                        x.adj = x.ma, min.count=5,
+                        locus.label=label, simulate=FALSE)
+print(score.gaus.adj, nlines=10)
+@
+
+
+\subsection{Plots and Haplotype Labels}
+A convenient way to view results from {\sl haplo.score} is a plot of
+the haplotype frequencies ({\sl Hap-Freq}) versus the haplotype score
+statistics ({\sl Hap-Score}).  This plot, and the syntax for creating
+it, are shown in Figure~\ref{scorePlot}.
+
+Some points on the plot may be of interest.  To identify 
+individual points on the plot, use \mbox{\sl locator.haplo(score.gaus)}, 
+which is similar to {\sl locator()}.  Use the mouse to select points on 
+the plot.  After points are chosen, click on the middle
+mouse button, and the points are labeled with their haplotype labels.
+Note, in constructing Figure~\ref{scorePlot}, we had to define which points
+to label, and then assign labels in the same way as done within the
+{\sl locator.haplo} function.
+
+\begin{figure}[h]
+  \begin{center}
+<<scorePlot, fig=TRUE, echo=TRUE>>=
+## plot score vs. frequency, gaussian response
+plot(score.gaus.add, pch="o")
+ 
+## locate and label pts with their haplotypes
+## works similar to locator() function
+#> pts.haplo <- locator.haplo(score.gaus)
+
+pts.haplo <- list(x.coord=c(0.05098, 0.03018, .100), 
+                  y.coord=c(2.1582, 0.45725, -2.1566), 
+                  hap.txt=c("62:2:7", "51:1:35", "21:3:8"))
+
+text(x=pts.haplo$x.coord, y=pts.haplo$y.coord, labels=pts.haplo$hap.txt)
+@
+   \caption{Haplotype Statistics: Score vs. Frequency, Quantitative
+     Response} \label{scorePlot}
+  \end{center}
+\end{figure}
+\clearpage
+
+\subsection{Skipping Rare Haplotypes\label{freqmin}}
+For the {\sl haplo.score}, the {\sl skip.haplo} and {\sl min.count} 
+parameters control which rare haplotypes are pooled 
+into a common group. The {\sl min.count} parameter is a recent
+addition to {\sl haplo.score}, yet it does the same task as 
+{\sl skip.haplo} and is the same idea as {\sl haplo.min.count} used in
+{\sl haplo.glm.control} for {\sl haplo.glm}.  As a guideline, you may wish 
+to set {\sl min.count} to calculate scores for haplotypes with expected 
+haplotype counts of $5$ or greater in the sample.  We concentrate on 
+this expected count because it adjusts to the size of the input data. 
+If $N$ is the number of subjects and $f$ the haplotype frequency, 
+then the expected haplotype count is \mbox{$count=2 \times N \times f$}. 
+Alternatively, you can choose \mbox{$skip.haplo= \frac{count}{2 \times N}$}. 
+
+In the following example we try a different cut-off than before, 
+{\sl min.count=10}, which corresponds to {\sl skip.haplo} 
+of $10 \div (2 \times 220) = .045$.
+In the output, see that the global statistic, degrees of freedom, and 
+p-value change because of the fewer haplotypes, while the haplotype-specific 
+scores do not change.
+
+<<seed, eval=TRUE, echo=FALSE>>=
+set.seed(seed)
+@
+<<scoremin10, echo=TRUE>>=
+# increase skip.haplo, expected hap counts = 10  
+score.gaus.min10 <- haplo.score(resp, geno, trait.type="gaussian",
+                          x.adj = NA, min.count=10,
+                          locus.label=label, simulate=FALSE)
+print(score.gaus.min10)
+@
+
+\subsection{Score Statistic Dependencies: 
+  the {\sl eps.svd} parameter \label{scoreEPS}}
+
+The global score test is calculated using the vector of scores and
+the generalized inverse of their variance-covariance matrix,
+performed by the {\sl Ginv} function. This function
+determines the rank of the variance matrix by its singular value
+decomposition, and an epsilon value is used as the cut-off for small
+singular values.  If all of the haplotypes in the sample are scored,
+then there is dependence between them and the variance matrix is
+not of full rank.  However, it is more often the case that one or more
+rare haplotypes are not scored because of low frequency.  It is not
+clear how strong the dependencies are between the remaining score statistics,
+and likewise, there is disparity in calculating the rank of the variance
+matrix. For these instances we give the user control over the epsilon
+parameter for {\sl haplo.score} with {\sl eps.svd}.
+
+We have seen instances where the global score test had a very
+significant p-value, but none of the haplotype-specific scores showed strong
+association. In such instances, we found the default epsilon value
+in {\sl Ginv} was incorrectly considering the variance matrix as
+having full rank, and the misleading global score test was corrected when
+we increased epsilon for {\sl Ginv}.  We now set the default for {\sl eps.svd} 
+at $1e-5$, which seems to perform better in the troublesome circumstances
+than the previous default of $1e-6$.
+
+
+\subsection{Haplotype Model Effect  \label{scoreEffect} }
+A recent addition to {\sl haplo.score} is the ability to select
+non-additive effects to score haplotypes. The possible effects for haplotypes 
+are additive, dominant, and recessive.  Under recessive effects, fewer 
+haplotypes may be scored, because subjects are required
+to be homozygous for haplotypes.  Furthermore, there would have
+to be {\sl min.count} such persons in the sample to have the recessive
+effect scored.  Therefore, a recessive model should
+only be used on samples with common haplotypes.  In the example below with 
+the gaussian response, set the haplotype effect to dominant using parameter 
+{\sl haplo.effect = ''dominant''}.  Notice the results change slightly
+compared to the {\sl score.gaus.add} results above.
+
+<<seed, eval=TRUE, echo=FALSE>>=
+set.seed(seed)
+@
+<<scoreDom, echo=TRUE >>=
+# score w/gaussian, dominant effect
+
+score.gaus.dom <- haplo.score(resp, geno, trait.type="gaussian",
+                        x.adj=NA, min.count=5,
+                        haplo.effect="dominant", locus.label=label, 
+                        simulate=FALSE)
+print(score.gaus.dom, nlines=10)
+@
+
+\subsection{Simulation p-values \label{scoreSim}}
+When {\sl simulate=TRUE}, {\sl haplo.score} gives simulated p-values.  
+Simulated haplotype score statistics are the re-calculated score
+statistics from a permuted re-ordering of the trait and covariates and 
+the original ordering of the genotype matrix.  The simulated p-value 
+for the global score statistic 
+(\mbox{\sl Global sim. p-val}) is the number of times the simulated 
+global score statistic exceeds the observed, divided by the total 
+number of simulations.  Likewise, simulated p-value for the maximum 
+score statistic (\mbox{\sl Max-stat sim. p-val}) is the number of 
+times the simulated maximum haplotype score statistic exceeds the observed
+maximum score statistic, divided by the total number of simulations.
+The maximum score statistic is the maximum of the square of the
+haplotype-specific score statistics, which has an unknown distribution, so its
+significance can only be given by the simulated p-value.  
+Intuitively, if only one or two haplotypes are associated with 
+the trait, the maximum score statistic should have greater power to 
+detect association than the global statistic.
+
+The {\sl score.sim.control} function manages control parameters for 
+simulations. The {\sl haplo.score} function employs the simulation p-value
+precision criteria of Besag and Clifford\cite{Besag and Clifford 1991}.   
+These criteria ensure that the simulated p-values for both the global 
+and the maximum score statistics are precise for small p-values.  
+The algorithm performs a user-defined minimum number of permutations
+({\sl min.sim}) to guarantee sufficient precision for the simulated
+p-values for score statistics of individual haplotypes. Permutations
+beyond this minimum are then conducted until the sample standard
+errors for simulated p-values for both the {\sl global-stat} and {\sl
+  max-stat} score statistics are less than a threshold 
+\mbox{({\sl p.threshold * p-value})}.  The default value for 
+\mbox{{\sl p.threshold= $\frac{1}{4}$}} provides a two-sided 95\% 
+confidence interval for the p-value with a width that is approximately 
+as wide as the p-value itself.  Effectively, simulations are 
+more precise for smaller p-values.  The following example illustrates
+computation of simulation p-values with {\sl min.sim=1000}.
+
+<<seed, eval=TRUE, echo=FALSE>>=
+set.seed(seed)
+@
+<<scorePerm, echo=TRUE >>=
+# simulations when binary response
+score.bin.sim <- haplo.score(y.bin, geno, trait.type="binomial",
+              x.adj = NA, locus.label=label, min.count=5,
+              simulate=TRUE, sim.control = score.sim.control() )
+print(score.bin.sim)
+@
+
+\section{Regression Models: {\sl haplo.glm} \label{hapGLM}}
+The {\sl haplo.glm} function computes the regression of a trait on
+haplotypes, and possibly other covariates and their interactions with 
+haplotypes. We currently support the gaussian, binomial, and Poisson 
+families of traits with their canonical link functions.  The effects
+of haplotypes on the link function can be modeled as either additive, 
+dominant (heterozygotes and homozygotes for a particular haplotype 
+assumed to have equivalent effects), or recessive (homozygotes of a 
+particular haplotype considered to have an alternative effect on the 
+trait). The basis of the algorithm is a two-step iteration process; 
+the posterior probabilities of pairs of haplotypes per subject are 
+used as weights to update the regression coefficients, and the
+regression coefficients are used to update the haplotype posterior 
+probabilities. See Lake et al.\cite{Lake et al. 2003} for details.
+
+\subsection{New and Updated Methods for haplo.glm}
+
+We initially wrote {\sl haplo.glm} without much effort of making it work with
+R's glm class methods. We have now refined the {\sl haplo.glm} class to 
+look and act as much like a glm object as possible with methods defined
+specifically for the {\sl haplo.glm} class. We provide print and 
+summary methods that make use of the corresponding methods
+for glm and then add extra information for the haplotypes and their 
+frequencies.  Furthermore, we have defined for the {\sl haplo.glm} class
+some of the standard methods for 
+regression fits, including residuals, fitted.values, vcov, and anova. We 
+describe the challenges that haplotype regression presents with these methods 
+in section~\ref{glmMethods}.
+
+\subsection{Preparing the {\sl data.frame} for {\sl haplo.glm} \label{glmSetup}}
+A critical distinction between {\sl haplo.glm} and all other functions
+in Haplo Stats is that the definition of the regression model follows
+the S/R formula standard (see {\sl lm} or {\sl glm}). So, a 
+{\sl data.frame} must be defined, and this object must contain the trait and 
+other optional covariates, plus a special kind of genotype matrix 
+({\sl geno.glm} for this example) that contains the genotypes of the 
+marker loci. We require the genotype matrix to be prepared using 
+{\sl setupGeno()}, which handles character, numeric, or factor alleles, 
+and keeps the columns of the genotype matrix as a single unit when 
+inserting into (and extracting from) a {\sl data.frame}.  The 
+{\sl setupGeno} function recodes all missing genotype value codes given 
+by {\sl miss.val} to NA, and also recodes alleles to integer values. 
+The original allele codes are preserved within an attribute of {\sl geno.glm}, 
+and are utilized within {\sl haplo.glm}.  The returned object has class 
+{\sl model.matrix}, and it can be included in a {\sl data.frame} to 
+be used in {\sl haplo.glm}.  In the example below we prepare a 
+genotype matrix, {\sl geno.glm}, and create a {\sl data.frame} object, 
+{\sl glm.data}, for use in {\sl haplo.glm}.  
+
+<<glmGeno, echo=TRUE>>=
+# set up data for haplo.glm, include geno.glm, 
+# covariates age and male, and responses resp and y.bin
+geno <- hla.demo[,c(17,18,21:24)]
+geno.glm <- setupGeno(geno, miss.val=c(0,NA), locus.label=label)
+attributes(geno.glm)
+y.bin <- 1*(resp.cat=="low")
+glm.data <- data.frame(geno.glm, age=age, male=male, y=resp, y.bin=y.bin)
+@
+
+\subsection{Rare Haplotypes \label{rareHap}}
+The issue of deciding which haplotypes to use for association is
+critical in {\sl haplo.glm}.  By default it will model a rare
+haplotype effect so that the effects of other haplotypes are in
+reference to the baseline effect of the one common happlotype.  The
+rules for choosing haplotypes to be modeled in {\sl haplo.glm} are
+similar to the rules in {\sl haplo.score}: by a minimum frequency or a
+minimum expected count in the sample. 
+
+Two control parameters in {\sl haplo.glm.control} may be used to 
+control this setting: {\sl haplo.freq.min} may be set to a selected 
+minimum haplotype frequency, and {\sl haplo.min.count} may be set to select 
+the cut-off for minimum expected haplotype count in the sample. The
+default minimum frequency cut-off in {\sl haplo.glm} is set to
+$0.01$. More discussion on rare haplotypes takes place in
+section~\ref{controlRare}. 
+
+\subsection{Regression for a Quantitative Trait \label{glmQuant}}
+
+The following illustrates how to fit a regression of a quantitative
+trait {\sl y} on the haplotypes estimated from the {\sl geno.glm} matrix, 
+and the covariate {\sl male}.  For {\sl na.action}, we use 
+{\sl na.geno.keep}, which keeps a subject with missing values in the genotype
+matrix if they are not missing all alleles, but removes subjects
+with missing values ({\sl NA}) in either the response or covariate. 
+
+<<seed, eval=TRUE, echo=FALSE>>=
+set.seed(seed)
+@
+<<glmGauss, echo=TRUE>>=
+# glm fit with haplotypes, additive gender covariate on gaussian response
+fit.gaus <- haplo.glm(y ~ male + geno.glm, family=gaussian, data=glm.data, 
+        na.action="na.geno.keep", locus.label = label, x=TRUE,
+        control=haplo.glm.control(haplo.freq.min=.02))
+
+summary(fit.gaus)
+@
+
+\Explain The new summary function for {\sl haplo.glm} shows much the same
+information as summary for glm objects with the extra table for the haplotype
+frequencies.  The above table for {\sl Coefficients} lists the estimated regression
+coefficients ({\sl coef}), standard errors ({\sl se}), the 
+corresponding t-statistics ({\sl t.stat}), and p-values ({\sl pval}). 
+The labels for haplotype coefficients are a concatenation of the name of the 
+genotype matrix ({\sl geno.glm}) and unique haplotype codes assigned
+within {\sl haplo.glm}. The haplotypes corresponding to these haplotype 
+codes are listed in the {\sl Haplotypes} table, along with the estimates of the 
+haplotype frequencies ({\sl hap.freq}). The rare haplotypes, those with 
+expected counts less than \mbox{\sl haplo.min.count=5} (equivalent to
+having frequencies less than \mbox{{\sl haplo.freq.min =}  %.01136 
+\Sexpr{5/(2*nrow(glm.data))}}) in the 
+above example), are pooled into a single category labeled {\sl geno.glm.rare}. 
+The haplotype chosen as the baseline category for the design matrix (most 
+frequent haplotype is the default) is labeled as {\sl haplo.base}; more 
+information on the baseline may be found in section \ref{glmBaseline}.
+
+
+{\subsection{Fitting Haplotype x Covariate Interactions}
+Interactions are fit by the standard S-language model syntax, using 
+a '$*$' in the model formula to indicate main effects and interactions.
+Some other formula constructs are not supported, so use the formula
+parameter with caution.  Below is an example of modeling the
+interaction of {\sl male} and the haplotypes.  Because more terms will
+be estimated in this case, we limit how many haplotypes will be
+included by increasing {\sl haplo.min.count} to 10.
+
+<<seed, eval=TRUE, echo=FALSE>>=
+set.seed(seed)
+@
+<<glmInter, echo=TRUE >>=
+# glm fit haplotypes with covariate interaction
+fit.inter <- haplo.glm(formula = y ~ male * geno.glm, 
+                   family = gaussian, data=glm.data, 
+                   na.action="na.geno.keep", 
+                   locus.label = label, 
+                   control = haplo.glm.control(haplo.min.count = 10))
+summary(fit.inter)
+@
+
+\Explain The listed results are as explained under section~\ref{glmQuant}. 
+The main difference is that the interaction coefficients are labeled as a 
+concatenation of the covariate ({\sl male} in this example) and the
+name of the haplotype, as described above. In addition, estimates may
+differ because the model has changed. 
+
+\subsection{Regression for a Binomial Trait \label{glmBinom}}
+Next we illustrate the fitting of a binomial trait with the same
+genotype matrix and covariate.
+
+<<seed, eval=TRUE, echo=FALSE>>=
+set.seed(seed)
+@
+<<glmBinom, echo=TRUE>>=
+# gender and haplotypes fit on binary response, 
+# return model matrix
+fit.bin <- haplo.glm(y.bin ~ male + geno.glm, family = binomial, 
+             data=glm.data, na.action = "na.geno.keep",
+             locus.label=label,
+             control = haplo.glm.control(haplo.min.count=10))
+summary(fit.bin)
+@
+
+\Explain The underlying methods for {\sl haplo.glm} are based on a prospective
+likelihood. Normally, this type of likelihood works well for 
+case-control studies with standard covariates. For ambiguous 
+haplotypes, however, one needs to be careful when interpreting 
+the results from fitting {\sl haplo.glm} to case-control data. Because 
+cases are over-sampled, relative to the population prevalence (or
+incidence, for incident cases), haplotypes associated with disease 
+will be over-represented in the case sample, and so estimates 
+of haplotype frequencies will be biased. Positively associated
+haplotypes will have haplotype frequency estimates that are higher 
+than the population haplotype frequency. To avoid this problem, one 
+can weight each subject. The weights for the cases should be the 
+population prevalence, and the weights for controls should be 1 
+(assuming the disease is rare in the population, and controls are 
+representative of the general population). See Stram et al.\cite{Stram et al. 
+2003} for background on using weights, and see the help file for 
+{\sl haplo.glm} for how to implement weights.
+
+The estimated regression coefficients for case-control studies can be 
+biased by either a large amount of haplotype ambiguity and
+mis-specified weights, or by departures from Hardy-Weinberg
+Equilibrium of the haplotypes in the pool of cases and controls. 
+Generally, the bias is small, but tends to be towards the null of no 
+association. See Stram et al. \cite{Stram et al. 2003} and Epstein and
+Satten \cite{Epstein and Satten 2003} for further details.
+
+\subsubsection{Caution on Rare Haplotypes with Binomial Response \label{rarebinom}}
+If a rare haplotype occurs only in cases or only in controls, the
+fitted values would go to 0 or 1, where R would issue a
+warning.  Also, the coefficient estimate for that haplotype would go
+to positive or negative infinity, If the default {\sl haplo.min.count=5} 
+were used above, this warning would appear.  To keep this from 
+occuring in other model fits, increase the minimum count or minimum frequency. 
+
+\subsection{Control Parameters \label{glmControl}}
+Additional parameters are handled using {\sl control}, which is a list of
+parameters providing additional functionality in {\sl haplo.glm}.  This list
+is set up by the function {\sl haplo.glm.control}.  See the help
+file ({\sl help(haplo.glm.control)}) for a full list of control
+parameters, with details of their usage.  Some of the options are
+described here.
+
+\subsubsection{Controlling Genetic Models: {\sl haplo.effect}}
+The {\sl haplo.effect} control parameter for {\sl haplo.glm}
+instructs whether the haplotype effects are fit as additive, dominant,
+or recessive. That is, {\sl haplo.effect} determines whether the
+covariate ($x$) coding of haplotypes follows the values in Table 1 
+for each effect type.  Heterozygous means a subject has one copy of a
+particular haplotype, and homozygous means a subject has two copies of
+a particular haplotype. \\ \\ \\ \\
+
+\noindent {\bf Table 1:} Coding haplotype covariates in a model matrix\\ 
+\begin{center}
+\begin{tabular}{|r|c|c|c|}
+\hline
+Hap - Pair    & additive & dominant & recessive\\ \hline \hline
+Heterozygous  & 1        & 1        & 0        \\ \hline
+Homozygous    & 2        & 1        & 1        \\ \hline
+\end{tabular}
+\end{center}
+
+\noindent Note that in a recessive model, the haplotype effects are
+estimated only from subjects who are homozygous for a haplotype. 
+Some of the haplotypes which meet the {\sl haplo.freq.min} and 
+{\sl haplo.count.min} cut-offs may occur as homozygous in only a few of the 
+subjects.  As stated in \ref{scoreEffect}, recessive models should be used
+when the region has multiple common haplotypes.
+
+The default {\sl haplo.effect} is {\sl additive}, whereas the example below 
+illustrates the fit of a {\sl dominant} effect of haplotypes for the 
+gaussian trait with the gender covariate. 
+
+<<seed, eval=TRUE, echo=FALSE>>=
+set.seed(seed)
+@
+<<glmDom, echo=TRUE >>=
+# control dominant effect of haplotypes (haplo.effect) 
+# by using haplo.glm.control
+fit.dom <- haplo.glm(y ~ male + geno.glm, family = gaussian,
+       data = glm.data, na.action = "na.geno.keep", 
+       locus.label = label, 
+       control = haplo.glm.control(haplo.effect='dominant', 
+       haplo.min.count=8))
+
+summary(fit.dom)
+@
+
+\subsubsection{Selecting the Baseline Haplotype \label {glmBaseline} }
+The haplotype chosen for the baseline in the model is the one with the
+highest frequency.  Sometimes the most frequent haplotype may be an at-risk 
+haplotype, and so the measure of its effect is desired.  To specify a
+more appropriate haplotype as the baseline in the binomial example, choose 
+from the list of other common haplotypes, {\sl fit.bin\$haplo.common}. To specify an
+alternative baseline, such as haplotype 77, use the control parameter 
+{\sl haplo.base} and haplotype code, as in the example below.
+
+<<seed, eval=TRUE, echo=FALSE>>=
+set.seed(seed)
+@
+<<glmBaseline, echo=TRUE >>=
+# control baseline selection, perform the same exact run as fit.bin, 
+# but different baseline by using haplo.base chosen from haplo.common
+fit.bin$haplo.common
+fit.bin$haplo.freq.init[fit.bin$haplo.common]
+fit.bin.base77 <- haplo.glm(y.bin ~ male + geno.glm, family = binomial,
+       data = glm.data, na.action = "na.geno.keep",
+       locus.label = label,
+       control = haplo.glm.control(haplo.base=77, 
+       haplo.min.count=8))
+summary(fit.bin.base77)
+@
+
+\Explain The above model has the same haplotypes as 
+{\sl fit.bin}, except haplotype $4$, the old baseline,  now has an
+effect estimate while haplotype $77$ is the new baseline. 
+Due to randomness in the starting values of the haplotype 
+frequency estimation, different runs of {\sl haplo.glm} may result in a 
+different set of haplotypes meeting the minimum counts requirement for being 
+modeled. Therefore, once you have arrived at a suitable model, and you wish 
+to modify it by changing baseline and/or effects, you can make results 
+consistent by controlling the randomness using {\sl set.seed}, as described 
+in section \ref{seed}.  In this document, we use the same seed before 
+making {\sl fit.bin} and {\sl fit.bin.base77}.
+
+
+\subsubsection{Rank of Information Matrix and eps.svd ({\bf NEW}) \label{glmSVD}}
+
+Similar to recent additions to {\sl haplo.score} in section~\ref{scoreEPS}, 
+we give the user control over the epsilon parameter determining the number of
+singular values when determining the rank of the information matrix in 
+{\sl haplo.glm}.  Finding the generalized inverse of this matrix can be 
+problematic when either the response variable or a covariate has a large variance
+and is not scaled before passed to {\sl haplo.glm}.  The rank of the information
+matrix is determined by the number of non-zero singular values a small cutoff, 
+epsilon. When the singular values for the 
+coefficients are on a larger numeric scale than those for the haplotype 
+frequencies, the generalized inverse may incorrectly determine the information 
+matrix is not of full rank.  Therefore, we allow the user to specify the epsilon 
+as {\sl eps.svd} in the control parameters for {\sl haplo.glm}.  
+A simpler fix, which we strongly suggest, is for the user to pre-scale any 
+continuous responses or covariates with a large variance.
+
+Here we demonstrate what happens when we increase the variance of a gaussian 
+response by $2500$. We see that the coefficients are all highly significant
+and the rank of the information matrix is much smaller than the scaled gaussian fit.
+
+<<eval=TRUE, echo=FALSE>>=
+set.seed(seed)
+@ 
+
+<<glmEPS1, eval=TRUE>>=
+glm.data$ybig <- glm.data$y*50
+fit.gausbig <- haplo.glm(formula = ybig ~ male + geno.glm, family = gaussian, 
+          data = glm.data, na.action = "na.geno.keep", locus.label = label, 
+          control = haplo.glm.control(haplo.freq.min = 0.02), x = TRUE)
+
+summary(fit.gausbig)
+
+fit.gausbig$rank.info
+fit.gaus$rank.info
+@ 
+
+Now we set a smaller value for the {\sl eps.svd} control parameter and find the fit
+matches the original Gaussian fit.
+
+<<echo=FALSE, eval=TRUE>>=
+set.seed(seed)
+@ 
+<<glmEPS2>>=
+
+fit.gausbig.eps <- haplo.glm(formula = ybig ~ male + geno.glm, family = gaussian, 
+          data = glm.data, na.action = "na.geno.keep", locus.label = label, 
+          control = haplo.glm.control(eps.svd=1e-10, haplo.freq.min = 0.02), x = TRUE)
+
+summary(fit.gausbig.eps)
+fit.gausbig.eps$rank.info
+@ 
+
+
+\subsubsection{Rare Haplotypes and {\sl haplo.min.info} \label{controlRare}}
+
+Another notable control parameter is the minimum frequency for a rare
+haplotype to be included in the calculations for standard error (se)
+of the coefficients, or {\sl haplo.min.info}.  The default value is 
+$0.001$, which means that haplotypes with frequency less 
+than that will be part of the rare haplotype coefficient estimate,
+but it will not be used in the standard error calculation.
+
+The following example demonstrates a possible result when dealing with
+the rare haplotype effect. We show with the hla genotype data one
+consequence for when this occurs.  However, we make it happen by
+setting {\sl haplo.freq.min} equal to {\sl haplo.min.info}, which we
+advise strongly against in your analyses. 
+
+<<seed, eval=TRUE, echo=FALSE>>=
+set.seed(seed)
+@
+<<glmRare2, echo=TRUE >>=
+## set haplo.freq.min and haplo.min.info to same value to show how the 
+## rare coefficient may be modeled but standard error estimate is not 
+## calculated because all haps are below haplo.min.info
+
+fit.bin.rare02 <- haplo.glm(y.bin ~ geno.glm, family = binomial,
+       data = glm.data, na.action = "na.geno.keep", locus.label = label,
+       control = haplo.glm.control(haplo.freq.min=.02, haplo.min.info=.02)) 
+summary(fit.bin.rare02)
+@
+
+\Explain The above results show the standard error for the rare
+haplotype coefficient is ``NaN'', or ``Not a Number'' in R, which is a
+consequence of having most, or all, of the rare haplotypes discarded for the
+standard error estimate.  In other datasets there may be only a few
+haplotypes between {\sl haplo.min.info} and {\sl haplo.freq.min}, and
+may yield misleading results for the rare haplotype coefficient.  For
+this reason, we recommend that any inference made on the rare
+haplotypes be made with caution, if at all. 
+
+
+\section{Methods for {\sl haplo.glm} {\bf (NEW)} \label{glmMethods}}
+
+The latest updates to \package\ is our work to make {\sl haplo.glm} to 
+act similar to a glm object with methods to compare and assess model
+fits.  In this section we describe the challenges and caveats of defining 
+these methods for a {\sl haplo.glm} object and show how to use them.
+
+
+\subsection{fitted.values}
+
+A challenge when defining methods for {\sl haplo.glm} is that we account for
+the ambiguity in a persons haplotype pair. To handle this in the glm framework, the
+response and non-haplotype covariates are expanded for each person with a 
+posterior probability of the haplotype given their genotype as a weight.  
+The returned object from {\sl haplo.glm} looks somewhat like a regular glm, 
+but the model matrix, response, and thus the fitted values, are all expanded.  
+Users who want to work with the expanded versions of those items are welcome 
+to access them from the returned object.
+
+We now provide a method to get the fitted values for each person, 
+{\sl fitted.haplo.glm}. These collapsed fitted values are calculated 
+by a weighted sum of the expanded fitted values for each person where 
+the weights are the posterior probabilities of the person's expanded 
+haplotype pairs.
+
+\subsection{residuals}
+
+The residuals within the {\sl haplo.glm} object are also expanded for the 
+haplotype pairs for subjects.  We provide {\sl residuals.haplo.glm} to
+get the collapsed deviance, pearson, working, and response residuals for 
+each person. Because we have not implemented a predict method for 
+{\sl haplo.glm}, the method does not calculate partial residuals.
+
+\subsection{vcov}
+
+We provide {\sl vcov.haplo.glm} as a method to get the variance-covariance matrix of 
+model parameters in the {\sl haplo.glm} object.  Unlike the standard glm object, this 
+matrix is computed and retained in the returned object.  We do this because the model
+parameters are the model coefficients and the haplotype frequencies, and it is 
+computationally-intensive to compute.
+
+We show how to get the variance matrix for all the parameters and for only the model 
+coefficients.
+
+<<vcovGLM>>=
+
+varmat <- vcov(fit.gaus)
+dim(varmat)
+
+varmat <- vcov(fit.gaus, freq=FALSE)
+dim(varmat)
+print(varmat, digits=2)
+@ 
+
+\subsection{anova and Model Comparison}
+
+We use the {\sl anova.glm} method as a framework for {\sl anova.haplo.glm} to allow
+comparisons of model fits. We limit the model comparisons to multiple 
+nested model fits, which requires that each model to be compared is either a
+{\sl haplo.glm} or {\sl glm} fitted object.  We eliminate the functionality of testing
+sub-models of a single fit because removal of a single covariate would require 
+re-fitting of the reduced model to get updated coefficient and haplotype frequency 
+estimates with a maximized log-likelihood.  We decided to simplify the usage and
+require that all models to be compared are fully fitted.
+
+As with the {\sl anova.glm} method, it is difficult to check for truly nested models,
+so we pass the responsibility on to the user.  We discuss some of the requirements.
+
+One type of two-model comparison is between models with haplotypes (expanded subjects) 
+and a reduced model without haplotypes. We check for the same sample size in these models
+by comparing the collapsed sample size from a {\sl haplo.glm} fit to the sample size from
+the {\sl glm} fit, which we remind users is only a loose check of model comparability.
+
+The other comparison of two models in {\sl anova.haplo.glm} is to compare two models 
+that contain the same genotypes, and inherently the same haplotypes.  This is more 
+tricky because a subject may not have the same expanded set of possible haplotype pairs 
+across two fits of {\sl haplo.glm} unless the same seed is set before both fits. 
+Even if a seed is the same, the other effects that are different between the two models 
+will affect the haplotype frequency estimates, and may still result in a different 
+expansion of haplotype pairs per subject.  Our check of the collapsed sample size for the
+two models still applies with the same pitfalls, but a better assurance of model
+comparability is to use the same seed.
+
+In the {\sl haplo.glm} fit we provide the likelihood ratio test of the null model 
+against the full model, which is the most appropriate 
+test available for {\sl haplo.glm} objects, but it is difficult to compare the log-likeihood
+across two {\sl haplo.glm} fits.  Therefore, we remain consistent with 
+glm model comparison \cite{McCullaghNelder}, and use the difference in deviance to 
+compare models. Furthermore, we restrict the asymptotic test for model 
+comparison to be the $\chi^{2}$ test for goodness of fit.
+
+Below we show how to get the LRT from the {\sl fit.gaus} result, then show how to 
+compare some of the nested models fit above, including a regular
+glm fit of $y \sim male$.  The anova method requires the nested model to be given first,
+and any anova with a {\sl haplo.glm} object should explicitly call {\sl anova.haplo.glm}.
+
+<<anovaGLM>>=
+fit.gaus$lrt
+
+glmfit.gaus <- glm(y~male, family="gaussian", data=glm.data)
+
+anova.haplo.glm(glmfit.gaus, fit.gaus)
+anova.haplo.glm(fit.gaus, fit.inter)
+anova.haplo.glm(glmfit.gaus, fit.gaus, fit.inter)
+@ 
+
+
+
+\section{Extended Applications}
+The following functions are designed to wrap the functionality of the
+major functions in Haplo Stats into other useful applications.
+
+\subsection{Combine Score and Group Results:
+{\sl haplo.score.merge}}
+When analyzing a qualitative trait, such as binary, it can be helpful
+to align the results from {\sl haplo.score} with {\sl haplo.group}. 
+To do so, use the function {\sl haplo.score.merge}, as illustrated in
+the following example:
+
+<<scoreMerge,echo=TRUE>>=
+# merge haplo.score and haplo.group results
+merge.bin <- haplo.score.merge(score.bin, group.bin)
+print(merge.bin, nlines=10)
+@
+
+\Explain The first column is a row index, the next columns (3 in this example) 
+illustrate the haplotype, the {\sl Hap.Score} column is the score
+statistic and {\sl p.val} the corresponding $ \chi^{2} $ p-value.
+{\sl Hap.Freq} is the haplotype frequency for the total sample, and
+the remaining columns are the estimated haplotype frequencies
+for each of the group levels ({\sl y.bin} in this example). The default
+print method only prints results for haplotypes appearing in the
+{\sl haplo.score} output. To view all haplotypes, use the print option
+{\sl all.haps=TRUE}, which prints all haplotypes from the 
+{\sl haplo.group} output.  The output is ordered by the score
+statistic, but the {\sl order.by} parameter can specify ordering by 
+haplotypes or by haplotype frequencyies.
+
+
+\subsection{Case-Control Haplotype Analysis: {\sl haplo.cc} \label{hapCC} }
+It is possible to combine the results of {\sl haplo.score}, 
+{\sl haplo.group}, and {\sl haplo.glm} for case-control data, all 
+performed within {\sl haplo.cc}. The function peforms a score test and
+a glm on the same haplotypes. The parameters that determine which
+haplotypes are used are {\sl haplo.min.count} and {\sl haplo.freq.min},  
+which are set in the {\sl control} parameter, as done for {\sl
+  haplo.glm}. This is a change from previous versions, where 
+{\sl haplo.min.count} was in the parameter list for {\sl haplo.cc}.
+
+Below we run {\sl haplo.cc} setting the minimum haplotype
+frequency at $0.02$. The print results are shown, in addition to the
+names of the objects stored in the {\sl cc.hla} result. 
+
+<<seed, eval=TRUE, echo=FALSE>>=
+set.seed(seed)
+@
+<<hapCC, echo=TRUE >>=
+
+# demo haplo.cc where haplo.min.count is specified
+# use geno, and this function prepares it for haplo.glm
+y.bin <- 1*(hla.demo$resp.cat=="low")
+cc.hla <- haplo.cc(y=y.bin, geno=geno, locus.label = label, 
+                   control=haplo.glm.control(haplo.freq.min=.02))
+
+##, em.c=haplo.em.control(iseed=10)))  perhaps not needed??
+
+print(cc.hla, nlines=25, digits=2)
+names(cc.hla)
+@
+
+\Explain  First, from the names function we see that {\sl cc.hla} also contains
+{\sl score.lst} and {\sl fit.lst}, which are the {\sl haplo.score} and 
+{\sl haplo.glm} objects, respectively.
+For the printed results of {\sl haplo.cc}, first are the global statistics
+from {\sl haplo.score}, followed by cell counts for cases and
+controls. The last portion of the output is a data frame containing combined
+results for individual haplotypes: 
+
+\begin{itemize}
+\item{} {\bf Hap-Score:}{\ haplotype score statistic}
+\item{} {\bf p-val:}{\ haplotype score statistic p-value}
+\item{} {\bf sim p-val:}{\ (if simulations performed) simulated p-value for the haplotype score statistic}
+\item{} {\bf pool.hf:}{\ haplotype frequency for the pooled sample}
+\item{} {\bf control.hf:}{\ haplotype frequencies for the control sample only}
+\item{} {\bf case.hf:}{\ haplotype frequencies for the case sample only}
+\item{} {\bf glm.eff:}{\ one of three ways the haplotype appeared in the
+    glm model: {\sl Eff}: modeled as an effect; {\sl Base}: part of the
+    baseline; and {\sl R}: a rare haplotype, included in the effect of 
+    pooled rare haplotypes}
+\item{} {\bf OR.lower:}{\ Odds Ratio confidence interval
+    lower limit}
+\item{} {\bf OR:}{\ Odds Ratio for each effect in the model}
+\item{} {\bf OR.upper:}{\ Odds Ratio confidence interval upper limit}
+\end{itemize}
+
+Significance levels are indicated by the p-values for the score statistics, 
+and the odds ratio (OR) confidence intervals for the haplotype effects.  Note 
+that the Odds Ratios are effect sizes of haplotypes, assuming haplotype 
+effects are multiplicative.  Since this last table has many columns,
+lines are wrapped in the output in this manual.  You can align wrapped
+lines by the haplotype code which appears on the far left.
+Alternatively, instruct the print function to only print {\sl digits}
+significant digits, and set the width settings for output in your session 
+using the {\sl options()} function.
+
+
+\subsection{Score Tests on Sub-Haplotypes: {\sl haplo.score.slide}}
+To evaluate the association of sub-haplotypes (subsets of alleles from
+the full haplotype) with a trait, the user can evaluate a "window" of 
+alleles by {\sl haplo.score}, and slide this window across the entire 
+haplotype. This procedure is implemented by the function 
+{\sl haplo.score.slide}. To illustrate this method, we use all 11 loci
+in the demo data, {\sl hla.demo}.
+ 
+First, make the geno matrix and the locus labels for the 11 loci.
+Then use {\sl haplo.score.slide} for a window of 3 loci 
+({\sl n.slide=3}), which will slide along the haplotype for all 9 
+contiguous subsets of size 3, using the previously defined gaussian trait 
+{\sl resp}.
+
+<<seed, eval=TRUE, echo=FALSE>>=
+set.seed(seed)
+@
+
+<<label=scoreSlide, echo=TRUE, eval=TRUE >>=
+# haplo.score on 11 loci, slide on 3 consecutive loci at a time
+geno.11 <- hla.demo[,-c(1:4)]
+label.11 <- c("DPB","DPA","DMA","DMB","TAP1","TAP2","DQB","DQA","DRB","B","A")
+score.slide.gaus <- haplo.score.slide(hla.demo$resp, geno.11, trait.type =
+                "gaussian", n.slide=3, min.count=5, locus.label=label.11)
+print(score.slide.gaus)
+@
+
+\Explain The first column is the row index of the nine calls to 
+{\sl haplo.score}, the second column is the number of the starting 
+locus of the sub-haplotype, the third column is the global score
+statistic p-value for each call.  The last two columns are the 
+simulated p-values for the global and maximum score statistics, 
+respectively.  If you specify {\sl simulate=TRUE} in the function 
+call, the simulated p-values would be present. 
+
+\subsubsection{Plot Results from {\sl haplo.score.slide}}
+The results from {\sl haplo.score.slide} can be easily viewed in a plot
+shown in Figure~\ref{slidePlot} below. 
+The x-axis has tick marks for each locus, and the y-axis is the 
+$ -log_{10}(pval) $.  To select which p-value to plot, use the 
+parameter  pval, with  choices "{\sl global}", "{\sl global.sim}", and
+"{\sl max.sim}" corresponding to p-values described above.  If the
+simulated p-values were not computed, the default is to plot the
+global p-values.  For each p-value, a horizontal line is drawn at the 
+height of $ -log_{10}(pval) $\ across the loci over which it was
+calculated. For example, the p-value {\sl score.global.p = 0.009963}
+for loci 8-10 is plotted as a horizontal line at $ y=2.002 $
+spanning the $8^{th}$, $9^{th}$, and $10^{th}$ x-axis tick marks.
+
+\begin{figure}[ht]
+<<label=plotSlide, fig=TRUE, echo=TRUE >>=
+# plot global p-values for sub-haplotypes from haplo.score.slide
+plot.haplo.score.slide(score.slide.gaus)
+@
+   \caption{Global p-values for sub-haplotypes;
+            Gaussian Response} \label{slidePlot}
+\end{figure}
+\clearpage
+
+\subsection{Scanning Haplotypes Within a Fixed-Width Window: {\sl haplo.scan}}
+Another method to search for a candidate locus within a genome region 
+is {\sl haplo.scan}, an implementation of the method proposed in Cheng
+et al. 2005 \cite{Cheng2005}.  This method searches for a region for which the haplotypes 
+have the strongest association with a binary trait by sliding a window of fixed 
+width over each marker locus, and then scans over all haplotype lengths within
+each window. This latter step, scanning over all possible haplotype
+lengths within a window, distinguishes {\sl haplo.scan} from 
+{\sl haplo.score.slide} (which considers only the maximum haplotype
+length within a window).  To account for unknown linkage phase, the 
+function {\sl haplo.em} is called prior to scanning, to create a list of 
+haplotype pairs and posterior probabilities. To illustrate the scanning 
+window, consider a 10-locus dataset. When placing a window of 
+width 3 over locus 5, the possible haplotype lengths that contain locus 5 
+are three (loci 3-4-5, 4-5-6, and 5-6-7), two (loci 4-5 and 5-6) and one 
+(locus 5).
+For each of these loci subsets a score statistic is computed, which is
+based on the difference between the mean vector of haplotype counts
+for cases and that for controls. The maximum of these score
+statistics, over all possible haplotype lengths within a window, is 
+the locus-specific test statistic, or the locus scan statistic. The global 
+test statistic is the maximum over all computed score statistics. To 
+compute p-values, the case/control status is randomly permuted. Below we 
+run {\sl haplo.scan} on the 11-locus HLA dataset with a binary
+response and a window width of $3$, but first we use the results of 
+{\sl summaryGeno} to choose subjects with less than $50,000$ haplotype
+pairs to speed calculations with all $11$ polymorphic loci
+with many missing alleles.
+
+<<seed, eval=TRUE, echo=FALSE>>=
+set.seed(seed)
+@
+
+<<<hapScan, echo=TRUE, eval=TRUE >>=
+geno.11 <- hla.demo[,-c(1:4)]
+y.bin <- 1*(hla.demo$resp.cat=="low")
+hla.summary <- summaryGeno(geno.11, miss.val=c(0,NA))
+
+# track those subjects with too many possible haplotype pairs ( > 50,000)
+many.haps <- (1:length(y.bin))[hla.summary[,4] > 50000]
+
+# For speed, or even just so it will finish, make y.bin and geno.scan 
+# for genotypes that don't have too many ambigous haplotypes
+geno.scan <- geno.11[-many.haps,]
+y.scan <- y.bin[-many.haps]
+
+# scan haplotypes for regions within width of 3 for each locus.
+# test statistic measures difference in haplotype counts in cases and controls
+# p-values are simulated for each locus and the maximum statistic, 
+# we do 100 simuations here, should use default settings for analysis
+
+scan.hla <- haplo.scan(y.scan, geno.scan, width=3,
+        sim.control=score.sim.control(min.sim=100, max.sim=100),
+        em.control=haplo.em.control())
+
+print(scan.hla)
+@
+
+\Explain In the output we report the simulated p-values for each
+locus test statistic.  Additionally, we report the loci (or locus)
+which provided the maximum observed test statistic, and the  
+{\sl Max-Stat Simulated Global p-value} is the simulated p-value for that 
+maximum statistic. We print the number of simulations, because they
+are performed until p-value precision criteria are met, as described 
+in section~\ref{scoreSim}.  We would typically allow simulations to 
+run under default parameters rather than limiting to $100$ by the
+control parameters.
+
+\subsection{Sequential Haplotype Scan Methods: {\sl seqhap} \label{seqhap}}
+
+Another approach for choosing loci for haplotype associations is by
+{\sl seqhap}, as described in Yu and Schaid, 2007 \cite{Yu}.  
+The {\sl seqhap} method performs three tests for association 
+of a binary trait over a set of bi-allelic loci. When evaluating each 
+locus, loci close to it are added in a sequential manner based on the 
+Mantel-Haenszel test \cite{Mantel}. For each marker locus, three tests 
+are provided:
+
+\begin{itemize}
+\item{\bf single locus}, the traditional single-locus $\chi^{2}_1$ test 
+  of association,
+\item{\bf sequential haplotype}, based on a haplotype test for sequentially 
+  chosen loci,
+\item{\bf sequential sum}, based on the sum of a series of conditional
+  $\chi^{2}$ statistics.
+\end{itemize}
+
+All three tests are assessed for significance with permutation
+p-values, in addition to the asymptotic p-value. The point-wise p-value 
+for a statistic at a locus is the fraction of times that the statistic 
+for the permuted data is larger than that for the observed data. The 
+regional p-value is the chance of observing a permuted test statistic, 
+maximized over a region, that is greater than that for the observed
+data. 
+
+Similar to the permutation p-values in {\sl haplo.score} as described in
+section \ref{scoreSim}, permutations are performed until a precision threshold 
+is reached for the regional p-values.  A minimum and maximum number of 
+permutations specified in the {\sl sim.control} parameter list ensure
+a certain accuracy is met for every simulation p-value, yet having a
+limit to avoid infinite run-time.
+
+Below is an example of using {\sl seqhap} on data with case-control
+response for a chromosome region. First set up the binary response, y,
+with 0=control, 1=case, then a genotype matrix with two columns per
+locus, and a vector of chromosome positions. The genotype data is available
+in the {\sl seqhap.dat} dataset while the chromosome positions are in
+{\sl seqhap.pos}.  The following example runs {\sl seqhap} with default 
+settings for permutations and threshold parameters. 
+
+<<seed, eval=TRUE, echo=FALSE>>=
+set.seed(seed)
+@
+<<seqhap, echo=TRUE, eval=TRUE>>=
+# define binary response and genotype matrix
+data(seqhap.dat)
+data(seqhap.pos)
+y <- seqhap.dat$disease
+geno <- seqhap.dat[,-1]
+# get vector with chrom position
+pos <- seqhap.pos$pos
+seqhap.out <- seqhap(y=y, geno=geno, pos=pos, miss.val=c(0,NA),
+                 r2.threshold=.95, mh.threshold=3.84)
+seqhap.out$n.sim
+print(seqhap.out)
+@ 
+
+\Explain 
+The output from this example first shows {\sl n.sim}, the number of
+permutations needed for precision on the regional p-values.  Next, in
+the printed results, the first section ({\sl Single-locus Chi-square
+Test}) shows a table with columns for single-locus tests. The table
+includes test statistics, permuted p-values, and asymptotic p-values
+based on a $\chi^{2}_1$ distribution. The second section ({\sl Sequential Scan}) shows which loci are
+combined for association.  In this example, the table shows the first locus is
+not combined with other loci, whereas the second locus is combined
+with loci 3, 4, and 5.  The third section ({\sl Sequential Haplotype
+  Test}), shows the test statistics for the sequential haplotype method with 
+degrees of freedom and permuted and asymptotic p-values.  The fourth
+section ({\sl Sequential Sum Test}) shows similar information for the 
+sequential sum tests.
+
+\subsubsection{Plot Results from {\sl seqhap}}
+The results from {\sl seqhap} can be viewed in a useful plot
+shown in Figure~\ref{seqhapPlot}. The plot is similar to the plot 
+for {\sl haplo.score.slide} results, with the x-axis having tick marks for all 
+loci and the y-axis is the -log10() of p-value for the tests performed.  
+For the sequential result for each locus, a horizontal line at the 
+height of -log10(p-value) is drawn across the loci combined. The start 
+locus is indicated by a filled triangle and other loci combined with 
+the start locus are indicated by an asterisk or circle. 
+The choices for pval include {\sl "hap"} (sequential haplotype asymptotic 
+p-value), {\sl "hap.sim"} (sequential haplotype simulated p-value),
+{\sl "sum"} (sequential sum asymptotic p-value), and {\sl "sum.sim"} 
+(sequential sum simulated p-value). The other parameter option
+is {\sl single}, indicating whether to plot a line for the single-locus tests.
+
+\begin{figure}[tpb]
+<<label=plotSeqHap,fig=TRUE, echo=TRUE >>=
+# plot global p-values for sub-haplotypes from haplo.score.slide
+plot(seqhap.out, pval="hap", single=TRUE, las=2)
+@
+   \caption{Plot p-values for sequential haplotype scan
+     and single-locus tests} \label{seqhapPlot}
+\end{figure}
+\clearpage
+ 
+\subsection{Creating Haplotype Effect Columns: {\sl haplo.design}
+\label{hapDesign}}
+
+In some instances, the desired model for haplotype effects is not
+possible with the methods given in {\sl haplo.glm}.  Examples 
+include modeling just one haplotype effect, or modeling an 
+interaction of haplotypes from
+different chromosomes, or analyzing censored data.  To circumvent
+these limitations, we provide a function called {\sl haplo.design}, which 
+will set up an expected haplotype design matrix from a {\sl haplo.em} 
+object, to create columns that can be used to model haplotype effects 
+in other modeling functions.  
+
+The function {\sl haplo.design} first creates a design marix for all
+pairs of haplotypes over all subjects, and then uses the posterior
+probabilities to create a weighted average contribution for each
+subject, so that the number of rows of the final design matrix is
+equal to the number of subjects.  This is sometimes called the
+expectation-substitution method, as proposed by Zaykin et al. 
+2002 \cite{Zaykin}, and using this haplotype design matrix in a
+regression model is asymptotically equivalent to the score statistics
+from {\sl haplo.score} (Xie and Stram 2005 \cite{XieStram}).
+Although this provides much flexibility, by using the design matrix in
+any type of regression model, the estimated regression parameters can
+be biased toward zero (see Lin and Zeng, 2006 \cite{LinZeng} for concerns
+about the expectation-substitution method).  
+
+In the first example below, using default parameters, the returned data.frame 
+contains a column for each haplotype that meets a minimum count in the 
+sample {\sl min.count}. The columns are named by the code they are 
+assigned in {\sl haplo.em}.
+ 
+<<label=hapDesign1, echo=TRUE>>=
+# create a matrix of haplotype effect columns from haplo.em result
+hap.effect.frame <- haplo.design(save.em)
+
+names(hap.effect.frame)
+
+hap.effect.frame[1:10,1:8]
+@
+
+Additionally, {\sl haplo.design} gives the user flexibility to make
+a more specific design matrix with the following parameters:
+
+\begin{itemize}
+\item{}{\bf hapcodes:\ }{codes assigned in the {\sl haplo.em} object,
+    the only haplotypes to be made into effects} 
+\item{}{\bf haplo.effect:\ }{the coding of haplotypes as additive,
+    dominant, or recessive}
+\item{}{\bf haplo.base:\ }{code for the baseline haplotype}
+\item{}{\bf min.count:\ }{minimum haplotype count}
+\end{itemize}
+
+This second example below creates columns for specific haplotype
+codes that were most interesting in {\sl score.gaus.add}, haplotypes
+with alleles 21-3-8 and 62-2-7, corresponding to codes 4 and 138 in
+{\sl haplo.em}, respectively. Assume we want to test their individual 
+effects when they are coded with {\sl haplo.effect=''dominant''}.
+
+<<label=hapDesign2,echo=TRUE>>=
+# create haplotype effect cols for haps 4 and 138
+hap4.hap138.frame <- haplo.design(save.em, hapcodes=c(4,138), 
+                                 haplo.effect="dominant")
+
+hap4.hap138.frame[1:10,]
+
+dat.glm <- data.frame(resp, male, age, 
+                      hap.4=hap4.hap138.frame$hap.4, 
+                      hap.138=hap4.hap138.frame$hap.138)
+
+glm.hap4.hap138 <- glm(resp ~ male + age + hap.4 + hap.138, 
+                       family="gaussian", data=dat.glm)
+summary(glm.hap4.hap138)
+@
+
+\pagebreak
+
+\section{License and Warranty}
+
+\noindent License:\linebreak
+
+\noindent Copyright 2003 Mayo Foundation for Medical Education and Research. \linebreak
+
+\noindent This program is free software; you can redistribute it and/or modify
+it under the terms of 
+the GNU General Public License as published by the Free Software
+Foundation; either 
+version 2 of the License, or (at your option) any later version.\\
+
+\noindent This program is distributed in the hope that it will be useful, but
+WITHOUT ANY 
+WARRANTY; without even the implied warranty of MERCHANTABILITY or 
+FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License 
+for more details. \\
+
+\noindent You should have received a copy of the GNU General Public License
+along with this program; if not, write to \\
+\noindent Free Software Foundation, Inc. \\
+59 Temple Place, Suite 330 \\ 
+Boston, MA 02111-1307 USA \\
+
+\noindent For other licensing arrangements, please contact Daniel J. Schaid.\\
+Daniel J. Schaid, Ph.D.\\
+Division of Biostatistics\\
+Harwick Building - Room 775\\
+Mayo Clinic\\
+200 First St., SW\\
+Rochester, MN 55905\\
+phone: 507-284-0639\\
+fax:\ \ \ 507-284-9542\\
+email: schaid at mayo.edu\\
+
+
+\section{Acknowledgements}
+This research was supported by United States Public Health Services, 
+National Institutes of Health; Contract grant numbers R01 DE13276, 
+R01 GM 65450, N01 AI45240, and R01 2AI33144. The {\sl hla.demo} data is 
+kindly provided by Gregory A. Poland, M.D. and the Mayo Vaccine
+Research Group for illustration only, and may not be used  for
+publication.
+
+\pagebreak
+
+\appendix
+\noindent{\large {\bf Appendix}}
+
+\section{Counting Haplotype Pairs When Marker Phenotypes 
+                   Have Missing Alleles}
+
+The following describes the process for counting the number of
+haplotype pairs that are consistent with a subject's observed marker 
+phenotypes, allowing for some loci with missing data.  Note that we
+refer to marker phenotypes, but our algorithm is oriented towards 
+typical markers that have a one-to-one correspondence with their 
+genotypes.  We first describe how to count when none of the loci 
+have missing alleles, and then generalize to allow loci to have 
+either one or two missing alleles.  When there are no missing alleles,
+note that homozygous loci are not ambiguous with respect to the 
+underlying haplotypes, because at these loci the underlying 
+haplotypes will not differ if we interchange alleles between 
+haplotypes.  In contrast, heterozygous loci are ambiguous, 
+because we do not know the haplotype origin of the 
+distinguishable alleles (i.e., unknown linkage phase).  
+However, if there is only one heterozygous locus, then it 
+doesn't matter if we interchange alleles, because the pair of 
+haplotypes will be the same.  In this situation, if parental 
+origin of alleles were known, then interchanging alleles would 
+switch parental origin of haplotypes, but not the composition 
+of the haplotypes.  Hence, ambiguity arises only when there 
+are at least two heterozygous loci.  For each heterozygous 
+locus beyond the first one, the number of possible haplotypes 
+increases by a factor of 2, because we interchange the two 
+alleles at each heterozygous locus to create all possible 
+pairs of haplotypes.  Hence, the number of possible haplotype 
+pairs can be expressed as $2^{x}$, where $x = H-1$, if  $H$\ (the number of 
+heterozygous loci) is at least 2, otherwise $x = 0$.
+
+Now consider a locus with missing alleles.  The possible 
+alleles at a given locus are considered to be those that 
+are actually observed in the data. Let $a_{i}$\ denote the number 
+of distinguishable alleles at the locus. To count the number 
+of underlying haplotypes that are consistent with the observed 
+and missing marker data, we need to enumerate all possible 
+genotypes for the loci with missing data, and consider whether 
+the imputed genotypes are heterozygous or homozygous. 
+
+To develop our method, first consider how to count the number 
+of genotypes at a locus, say the $i^{th}$ locus, when either one or two 
+alleles are missing.  This locus could have either a homozygous 
+or heterozygous genotype, and both possibilities must be 
+considered for our counting method.  If the locus is considered 
+as homozygous, and there is one allele missing, then there is 
+only one possible genotype; if there are two alleles missing, 
+then there are $a_{i}$  possible genotypes. A function to perform this 
+counting for homozygous loci is denoted  $f(a_{i})$.  If the locus is 
+considered as heterozygous, and there is one allele missing, 
+then there are $ a_{i}-1$\ possible genotypes; if there are two alleles 
+missing, then there are $\frac{a_{i}(a_{i}-1)}{2}$\  possible
+genotypes.  A function to 
+perform this counting for heterozygous loci is denoted $g(a_{i})$ 
+These functions and counts are summarized in Table A.1. \\
+
+\noindent {\bf Table A.1:} Factors for when a locus having missing allele(s)
+is counted as homozygous($f()$) or heterozygous($g()$)\\ 
+
+\begin{table}[h] 
+\begin{center}
+\begin{tabular}{|r|c|c|}  
+\hline
+Number of       & Homozygous          &  Heterozygous \\
+missing alleles & function $f(a_{i})$ &  function $g(a_{i})$ \\
+\hline \hline
+1               &    1                & $a_{i} - 1$ \\ \hline
+2               & $ a_{i}$            & $\frac{a_{i}(a_{i}-1)}{2}$ \\ \hline
+\end{tabular}
+\end{center}
+\end{table}
+
+Now, to use these genotype counting functions to determine the number 
+of possible haplotype pairs, first consider a simple case where only
+one locus, say the $i^{th}$ locus, has two missing alleles.  Suppose that 
+the phenotype has H heterozygous loci (H is the count of heterozygous 
+loci among those without missing data).  We consider whether the locus with 
+missing data is either homozygous or heterozygous, to give the count of possible 
+haplotype pairs as 
+
+\begin{equation}
+a_{i}2^{x} + \left[{\frac{a_{i}(a_{i}-1)}{2}}\right]2^{x+1}
+\end{equation}
+
+\noindent where again  $x = H-1$\ if H is at least 2, otherwise x = 0.  This special 
+case can be represented by our more general genotype counting functions as 
+
+\begin{equation}
+f(a_{i})\,2^{x} + g(a_{i})\,2^{x+1}
+\end{equation}
+
+When multiple loci have missing data, we need to sum over all possible
+combinations of heterozygous and homozygous genotypes for the
+incomplete loci.  The rows of Table A.2 below present these 
+combinations for up to $m=3$\ loci with missing data.  Note that as the 
+number of heterozygous loci increases (across the columns of Table A.2),
+so too does the exponent of 2. To calculate the total number of pairs 
+of haplotypes, given observed and possibly missing genotypes, we need 
+to sum the terms in Table A.2 across the appropriate row.  For example, 
+with $m=3$, there are eight terms to sum over.  The general
+formulation for this counting method can be expressed as 
+\begin{equation}
+Total Pairs = \sum_{j=0}^{m} \sum_{combo} C(combo,j)
+\end{equation}
+
+\noindent where combo is a particular pattern of heterozygous and homozygous 
+loci among the loci with missing values (e.g., for $m = 3$, one 
+combination is the first locus heterozygous and the $2^{nd}$\ and 
+$3^{rd}$\ third as homozygous), and $C(combo,j)$\ is the corresponding count for 
+this pattern when there are $i$\ loci that are heterozygous (e.g., for 
+$m = 3$\ and $j = 1$\,  , as illustrated in Table A.2).
+
+
+\noindent {\bf Table A.2:} Genotype counting terms when $m$ loci
+have missing \\
+alleles, grouped by number of heterozygous loci (out of $m$)\\
+\begin{table}[h] 
+\begin{center}
+\begin{tabular}{|r||r|r|r|r|}  
+\hline 
+$m$ & $j=0\,of\, m$         & $j=1\, of\, m$ &$j=2\, of\, m$ & $j=3\, of\, m$ \\
+\hline \hline
+$0$ & $2^{x}$             &              &             &   \\
+\hline
+$1$ & $f(a_{1})2^{x}$     &  $g(a_{1})2^{x+1}$          &         &        \\
+\hline
+$2$ & $f(a_{1})f(a_{2})2^{x}$       &  $g(a_{1})f(a_{2})2^{x+1}$  &$g(a_{1})g(a_{2})2^{x+1}$ & \\ 
+    &                         &  $f(a_{1})g(a_{2})2^{x+1}$  &  & \\
+\hline
+$3$ & $f(a_{1})f(a_{2})f(a_{3})2^{x}$&$g(a_{1})f(a_{2})f(a_{3})2^{x+1}$&$g(a_{1})g(a_{2})f(a_{3})2^{x+2}$& 
+ $g(a_{1})g(a_{2})g(a_{3})2^{x+2}$   \\
+    &                                &$f(a_{1})g(a_{2})f(a_{3})2^{x+1}$&$g(a_{1})f(a_{2})g(a_{3})2^{x+2}$& \\
+    &                                &$f(a_{1})f(a_{2})g(a_{3})2^{x+1}$&$f(a_{1})g(a_{2})g(a_{3})2^{x+2}$& \\
+\hline
+\end{tabular}
+\end{center}
+\end{table}
+
+\pagebreak
+
+\begin{thebibliography}{}
+
+  %% using AMA, listed in order of appearance
+\bibitem{Clayton}
+  Clayton, David. Personal web page, software list. 
+  \mbox{<http://www-gene.cimr.cam.ac.uk/clayton/software/>.} 
+  Accessed April 1, 2004.
+
+\bibitem{Schaid 2005}
+  Schaid DJ. Power and Sample Size for Testing Associations of
+  Haplotypes with Complex Traits. {\em Annals of Human Genetics} 
+  2005;70:116-130.  
+
+\bibitem{Schaid et al. 2002}
+  Schaid DJ, Rowland CM, Tines DE, Jacobson RM, 
+  Poland GA. Score tests for association between traits and 
+  haplotypes when linkage phase is ambiguous. {\em Am J Hum Genet} 
+  2002;70:425-34.  
+
+\bibitem{Zaykin}
+  Zaykin DV, Westfall PH, Young SS, Karnoub MA, Wagner MJ, Ehm MG.
+  Testing Association of Statistically Inferred Haplotypes with Discreet
+  and Continuous Traits in Samples of Unrelated Individuals.  
+  {\em Human Heredity} 2002;53:79-91.  
+
+\bibitem{Harrell}
+  Harrell, FE. {\em Regression Modeling Strategies.} New York: Springer-Verlag; 2001.
+  
+\bibitem{Besag and Clifford 1991}
+  Besag J, Clifford P. Sequential Monte Carlo p-Values. 
+  {\em Biometrika} 1991;78:301-304.
+
+\bibitem{Lake et al. 2003}
+  Lake S, Lyon H, Silverman E, Weiss S, Laird N, Schaid D.
+  Estimation and tests of haplotype-environment interaction 
+  when linkage phase is ambiguous. {\em Human Heredity} 2003;55:56-65.
+
+\bibitem{Stram et al. 2003}
+  Stram D, Pearce C, Bretsky P, Freedman M, Hirschhorn J, 
+  Altshuler D, Kolonel L, Henderson B, Thomas D. Modeling and 
+  E-M estimation of haplotype-specific relative risks from genotype 
+  data for case-control study of unrelated individuals. 
+  {\em Hum Hered} 2003;55:179-190.  
+
+\bibitem{Epstein and Satten 2003}
+  Epstein MP, Satten GA. Inference on haplotype effects 
+  in case-control studies using unphased genotype data. 
+  {\em Am J Hum Genet} 2003;73:1316-1329.
+
+\bibitem{McCullaghNelder}
+  McCullagh P, Nelder JA. {\em Generalized Linear Models, Second Edition}. 
+  Boca Raton, FL: Chapman and Hall. 1989:35-36.
+
+%\bibitem{Cheng2003}
+%  Cheng R, Ma JZ, Wright FA, Lin S, Gau X, Wang D, Elston RC, Li
+%  MD. Nonparametric disequilibrium mapping of functional sites using
+%  haplotypes of multiple tightly linked single-nucleotide polymorphism markers.
+%  {\em Genetics} 2003;164:1175-1187.
+
+\bibitem{Cheng2005}
+  Cheng R, Ma JZ, Elston RC, Li MD. Fine Mapping Functional Sites or
+  Regions from Case-Control Data Using Haplotypes of Multiple Linked SNPs.  
+  {\em Annals of Human Genetics} 2005;69: 102-112.
+
+\bibitem{Yu}
+  Yu Z, Schaid DJ. Sequential haplotype scan methods for
+  association analysis. {\em Gen Epi} 2007;31:553-564.
+ 
+\bibitem{Mantel}
+  Mantel N, Haenszel W. Statistical aspects of the analysis of
+  data from retrospective studies of disease.  {\em J Nat Cancer Inst}
+  1959;22:719-48. 
+
+\bibitem{XieStram}
+  Xie R, Stram DO. Asymptotic equivalence between two score tests
+  for haplotype-specific risk in general linear models. {\em Gen Epi} 
+  2005;29:166-170.
+
+\bibitem{LinZeng}
+  Lin DY, Zeng D. Likelihood-based inference on haplotype effects
+  in genetic association studies.  {\em J Am Stat Assoc} 2006;101:473. 
+
+\end{thebibliography}
+
+\end{document}

Added: trunk/packages/R/r-cran-haplo.stats/trunk/debian/docs
===================================================================
--- trunk/packages/R/r-cran-haplo.stats/trunk/debian/docs	                        (rev 0)
+++ trunk/packages/R/r-cran-haplo.stats/trunk/debian/docs	2012-02-10 22:52:56 UTC (rev 9642)
@@ -0,0 +1 @@
+inst/doc/manualHaploStats.pdf

Modified: trunk/packages/R/r-cran-haplo.stats/trunk/debian/rules
===================================================================
--- trunk/packages/R/r-cran-haplo.stats/trunk/debian/rules	2012-02-10 15:27:06 UTC (rev 9641)
+++ trunk/packages/R/r-cran-haplo.stats/trunk/debian/rules	2012-02-10 22:52:56 UTC (rev 9642)
@@ -4,6 +4,8 @@
 # Copyright 2012 by Andreas Tille <tille at debian.org>
 # GPL
 
+DEB_COMPRESS_EXCLUDE_ALL := .pdf
+
 include /usr/share/R/debian/r-cran.mk
 
 # Require a number equal or superior than the R version the package was built with.