gfit {happy} | R Documentation |
gfit() fits a QTL model to a happy() object. The model is a mixture of Gaussians, each with a different mean, and corresponds loosely to the "full" model in hfit(). The difference is that hfit() fits the observed phenotype values to the expected phenotypes under a full model, whereas gfit() uses maximum likelihood to fit the observed phenotype values to a mixture of Gaussians, each with a different mean but common variance. The other functions in this suite are not usually called directly.
The statistical model fitted is as follows. Consider first the case of fitting a QTL at a locus, L. Let y be the vector of trait values. Let X(L) be the design matrix for fitting a QTL at the locus L. Let t(L) be the vector of parameters to be estimated at the locus. For an additive QTL, the paramters are the strain effect sizes; for a full interaction model there is a paramter for every possible strain combination. Then the one-QTL model is
E(y) = X(L).t
There are S(S-1)/2 parameters to be estimated in a full model allowing for interactions between the alleles within the locus, so the i,j'th element of the design matrix X is related to the strain probabilities thus:
X(Lij) = F(iLst)
, where
j(s,t) = min(s + S(t-1), t + S(s-1))
In the function hfit(), the observed phenotypes are regressed directly on the expected trait values. This is not an optimal procedure becuase the data are really a mixture:
y_i tilde sum_{st} F_{iLst} f( ( y_i - β_{Lst} )/2σ_L^2)
where f(x) is a standard Gaussian density. The β_L is a vector of mean trait values for the strain combinations. The parameters β_L , σ_L are estimated by maximum likelihood, and the test for the presence of a QTL at locus L is equivalent to the test that all the β_{st}=μ, when the model collapses to a single Gaussian distribution.
The model-fitting is implemented in the function gfit() by an iterative process, rather like a simplified version of EM. Is is slower than hfit(), and generally gives similar results as far as overall QTL detection is concered,m but gives more accurate parameter estimates. The log-likelihood for the data is
L = sum_{i} log ( sum_j p_{ij} frac{exp(-frac{(y_i-β_j)^2}{2σ^2})}{sqrt{2π σ^2}})
= sum_i log ( sum_j p_{ij} exp(-frac{(y_i-β_j)^2}{2σ^2})) - frac{N log(2πσ^2)}{2}
Differentiating wrt to the parameters gives
frac{partial L}{partial σ^2} = sum_i frac{sum_j p_{ij} (y_i-β_j)^2 exp(-frac{(y_i-β_j)^2}{2σ^2}) }{ 2σ^4 sum_j p_{ij} exp(-frac{(y_i-β_j)^2 }{2σ^2})} - frac{N}{2σ^2}
frac{partial L}{partial β_j } = - sum_i frac{ p_{ij} frac{(y_i-β_j) }{σ^2} exp( -frac{(y_i-β_j)^2}{2σ^2})}{ sum_j e_{ij}}
= frac{1}{σ^2} ( - sum_i frac{y_i e_{ij} }{sum_j e_{ij}} + β_j frac{sum_i e_{ij} }{sum_j e_{ij}} )
write
w_{ij} = frac{p_{ij} exp(-frac{(y_i-β_j)^2}{2σ^2}) }{ sum_j frac{p_{ij} exp(-frac{(y_i-β_j)^2}{2σ^2})}}
then the mle satisfies
hat{σ^2} = frac{1}{N} sum_i frac{sum_j p_{ij}(y_i-β_j)^2 exp(-frac{(y_i-β_j)^2}{2σ^2})} {sum_j p_{ij}exp(-frac{(y_i-β_j)^2}{2σ^2})}
hat{σ^2} = frac{1}{N} sum_i sum_j hat{w}_{ij}(y_i-hat{β}_j)^2
hat{β_j} = frac{sum_i hat{e}_{ij} y_i}{sum_i hat{e}_{ij}}
and the log-likelihood is
hat{L} = sum_i(log sum_j hat{e}_{ij}) - frac{N log(2πhat{σ}^2)}{2}
gfit( h,eps=1.0e-4, shuffle=FALSE, method="optim" ) gaussian.loop( d, maxit=100, eps=1.0e-3, df=NULL ) gaussian.null( n, sigma2 ) gaussian.init( d ) gaussian.iterate( d, params ) gaussian.fn( p, d=NULL ) gaussian.gr( p, d=NULL )
h |
an object returned by a previous call to happy() |
shuffle |
boolean indicating whether the shuffle the phenotypes to perform a permutation test |
method |
The optimisation algorithm. Default is to use R's "optim" function, which uses derivative information. All other values of this argument will use an EM type iteration. |
d |
a list comprising two elements d, probs |
maxit |
the maximum number of iterations in the ML fitting |
eps |
the terminatation accuracy in the model fitting : the log likelihood must change by less than eps in successive iterations |
df |
the degress of freedom to use. If NULL then this is computed as the rank of the data |
n |
the number of observations with non-missing phenotypes |
sigma2 |
the variance of the errors in the data |
params |
a list with two components, beta = the group means and sigma = the standard deviation |
p |
vector of paramters. For internal use only |
gfit() returns a matrix with columns "marker", "LogL", "Null", "Chi", "df", "Pval", "LogPval". Each row of the column describes the fit of the model for thecorresponding marker interval.
gaussian.loop() fits the model to a single marker and returns a list with the same elements as in hfit()
gaussian.iterate() performs a single iteration of the fitting process and returns a list with the updated LogL, beta, sigma, dbeta and dsigma
gaussian.init() intialises the parameters under the Null model, ie to the case where the means are all identical and the variance is the overal variance.
gaussian.null() returns the log-likelihood under the Null model
gaussian.fn() and gaussian.gr() are the function and gradient required by the optim function.
Richard Mott
happy{}, hprob{}
## An example session: # initialise happy ## Not run: h <- happy('HS.data','HS.alleles') # fit all the markers ## Not run: f <- gfit(h)