11 Technical Details

11.1 Model Form

The probability of the \(i^{th}\) data point is represented by:

\[f_{i}(y_{i} | l_{i}) \label{pyl-Eq} \tag{11.1}\]

where \(f_{i}\) is the probability density function associated with \(y_{i}\). For example, if \(y_{i}\) was assumed to be Poisson distributed and we used the canonical log link function, then Equation (11.1) would have the form:

\[f_{P}\left(y_{i} | \lambda = \textrm{exp}(l_{i})\right) \label{pyl2-Eq} \tag{11.2}\]

where \(\lambda\) is the canonical parameter of the Poisson density function \(f_{P}\). Table 11.1 has a full list of supported distributions and link functions.

The vector of latent variables follow the linear model

\[{\bf l} = {\bf X}{\boldsymbol{\mathbf{\beta}}}+{\bf Z}{\bf u}+{\bf e} \label{l-Eq} \tag{11.3}\]

where \({\bf X}\) is a design matrix relating fixed predictors to the data, and \({\bf Z}\) is a design matrix relating random predictors to the data. These predictors have associated parameter vectors \({\boldsymbol{\mathbf{\beta}}}\) and \({\bf u}\), and \({\bf e}\) is a vector of residuals. In the Poisson case these residuals deal with any overdispersion in the data after accounting for fixed and random sources of variation.

The location effects (\({\boldsymbol{\mathbf{\beta}}}\) and \({\bf u}\)), and the residuals (\({\bf e}\)) are assumed to come from a multivariate normal distribution:

\[\left[ \begin{array}{c} {\boldsymbol{\mathbf{\beta}}}\\ {\bf u}\\ {\bf e} \end{array} \right] \sim N\left( \left[ \begin{array}{c} {\boldsymbol{\mathbf{\beta}}}_{0}\\ {\bf 0}\\ {\bf 0}\\ \end{array} \right] , \left[ \begin{array}{ccc} {\bf B}&{\bf 0}&{\bf 0}\\ {\bf 0}&{\bf G}&{\bf 0}\\ {\bf 0}&{\bf 0}&{\bf R}\\ \end{array} \right] \right) \label{V-Eq} \tag{11.4}\]

where \({\boldsymbol{\mathbf{\beta}}}_{0}\) is a vector of prior means for the fixed effects with prior (co)variance \({\bf B}\), and \({\bf G}\) and \({\bf R}\) are the expected (co)variances of the random effects and residuals respectively. The zero off-diagonal matrices imply a priori independence between fixed effects, random effects, and residuals. Generally, \({\bf G}\) and \({\bf R}\) are large square matrices with dimensions equal to the number of random effects or residuals. Typically they are unknown, and must be estimated from the data, usually by assuming they are structured in a way that they can be parameterised by few parameters. Below we will focus on the structure of \({\bf G}\), but the same logic can be applied to \({\bf R}\).

At its most general, \(\texttt{MCMCglmm}\) allows variance structures of the form:

\[{\bf G}= \left({\bf V}_{1}\otimes{\bf A}_{1}\right) \oplus \left({\bf V}_{2}\otimes{\bf A}_{2}\right) \oplus \ldots \label{G3-Eq} \tag{11.5}\]

where the parameter (co)variance matrices (\({\bf V}\)) are usually low-dimensional and are to be estimated, and the structured matrices (\({\bf A}\)) are usually high dimensional and treated as known.

In the case of ordinal probit models with \(>2\) categories (i.e. "threshold" or "ordinal" models), \(f_{T}/f_{O}\) depends on an extra set of parameters in addition to the latent variable: the \(\textrm{max}(y)+1\) cutpoints \({\boldsymbol{\mathbf{\gamma}}}\). The probability of \(y_{i}\) is then:

\[f_{T}(y_{i} | l_{i}, {\boldsymbol{\mathbf{\gamma}}}) = 1\ \textrm{if}\ \gamma_{y_{i}+1} < l_{i} < \gamma_{y_{i}}\]

and

\[f_{O}(y_{i} | l_{i}, {\boldsymbol{\mathbf{\gamma}}}) = F_{N}(\gamma_{y_{i}} | l_{i}, 1)-F_{N}(\gamma_{y_{i}+1} | l_{i},1)\]

where \(F_{N}\) is the cumulative density function for the normal. Note that the two models can be made equivalent.

11.2 MCMC Sampling Schemes

11.2.1 Updating the latent variables \({\bf l}\)

The conditional density of \(l\) is given by:

\[Pr(l_{i}| {\bf y}, \boldsymbol{\theta}, {\bf R}, {\bf G}) \propto f_{i}(y_{i} | l_{i})f_{N}(e_{i}|{\bf r}_{i}{\bf R}_{/i}^{-1}{\bf e}_{/i}, r_{i}-{\bf r}_{i}{\bf R}_{/i}^{-1}{\bf r}^{'}_{i}) \label{pcl-Eq} \tag{11.6}\]

where \(f_{N}\) indicates a Multivariate normal density with specified mean vector and covariance matrix. Equation (11.6) is the probability of the data point \(y_{i}\) from distribution \(f_{i}\) with latent varaible \(l_{i}\), multiplied by the probability of the linear predictor residual. The linear predictor residual follows a conditional normal distribution where the conditioning is on the residuals associated with data points other than \(i\). Vectors and matrices with the row and/or column associated with \(i\) removed are denoted \(/i\).

Three special cases exist for which we sample directly from Equation (11.6): i) When \(y_{i}\) is normal \(f_{i}(y_{i} | l_{i})=1\) if \(y_{i}=l_{i}\) and zero otherwise so \(l_{i}=y_{i}\) with out the need for updating, ii) when \(y_{i}\) is discrete and modelled using family="threshold" then Equation defines a truncated normal distribution and can be slice sampled (Robert 1995) and iii) when \(y_{i}\) is missing \(f_{i}(y_{i} | l_{i})\) is not defined and samples can drawn directly from the normal.

In practice, the conditional distribution in Equation (11.6) only involves other residuals which are expected to show some form of residual covariation, as defined by the \({\bf R}\) structure. Because of this we actually update latent variables in blocks, where the block is defined as groups of residuals which are expected to be correlated:

\[Pr({\bf l}_{j}|{\bf y}, \boldsymbol{\theta}, {\bf R}, {\bf G}) \propto \prod_{i \in j}{p}_{i}({y}_{i} | l_{i})f_{N}({\bf e}_{j}|{\bf 0}, {\bf R}_{j}) \label{pcl2-Eq} \tag{11.7}\]

where \(j\) indexes blocks of latent variables that have non-zero residual covariances. For response variables that are neither Gaussian nor threshold, the density in equation (11.7) is in non-standard form and so Metropolis-Hastings updates are employed. We use adaptive methods during the burn-in phase to determine an efficient multivariate normal proposal distribution centered at the previous value of \({\bf l}_{j}\) with covariance matrix \(m{\bf M}\). For computational efficiency we use the same \({\bf M}\) for each block \(j\), where \({\bf M}\) is the average posterior (co)variance of \({\bf l}_{j}\) within blocks and is updated each iteration of the burn-in period Haario, Saksman, and Tamminen (2001). The scalar \(m\) is chosen using the method of Ovaskainen et al. (2008) so that the proportion of successful jumps is optimal, with a rate of 0.44 when \({\bf l}_{j}\) is a scalar declining to 0.23 when \({\bf l}_{j}\) is high dimensional (Gelman et al. 2004).

A special case arises for multi-parameter distributions in which each parameter is associated with a linear predictor. For example, in the zero-inflated Poisson two linear predictors are used to model the same data point, one to predict zero-inflation, and one to predict the Poisson variable. In this case the two linear predictors are updated in a single block even when the residual covariance between them is set to zero, because the first probability in Equation (11.7) cannot be factored:

\[Pr({\bf l}_{j}|{\bf y}, \boldsymbol{\theta}, {\bf R}, {\bf G}) \propto {p}_{i}({y}_{i} | {\bf l}_{j})({\bf e}_{j}|{\bf 0}, {\bf R}_{j}) \label{pcl3-Eq} \tag{11.8}\]

When the block size is one (i.e. a univariate analysis) then the latent variables can be slice sampled for two-category ordinal and categorical models if slice=TRUE is passed to \(\texttt{MCMCglmm}\).

11.2.2 Updating the location vector \(\boldsymbol{\theta} = \left[{\boldsymbol{\mathbf{\beta}}}^{'}\; {\bf u}^{'}\right]^{'}\)

Garcia-Cortes and Sorensen (2001) provide a method for sampling \(\boldsymbol{\theta}\) as a complete block that involves solving the sparse linear system:

\[\tilde{\boldsymbol{\theta}} = {\bf C}^{-1}{\bf W}^{'}{\bf R}^{-1}({\bf l} - {\bf W}\boldsymbol{\theta}_{\star}-{\bf e}_{\star}) \label{sMME-Eq} \tag{11.9}\]

where \({\bf C}\) is the mixed model coefficient matrix:

\[{\bf C} = {\bf W}^{'}{\bf R}^{-1}{\bf W}+ \left[ \begin{array}{c c} {\bf B}^{-1}&{\bf 0}\\ {\bf 0}&{\bf G}^{-1}\\ \end{array} \right]\]

and \({\bf W} = \left[{\bf X}\; {\bf Z}\right]\), and \({\bf B}\) is the prior (co)variance matrix for the fixed effects.

\(\boldsymbol{\theta}_{\star}\) and \({\bf e}_{\star}\) are random draws from the multivariate normal distributions:

\[\boldsymbol{\theta}_{\star} \sim N\left( \left[ \begin{array}{c} {\boldsymbol{\mathbf{\beta}}_{0}}\\ {\bf 0}\\ \end{array} \right] , \left[ \begin{array}{c c} {\bf B}&{\bf 0}\\ {\bf 0}&{\bf G}\\ \end{array} \right] \right)\]

and

\[{\bf e}_{\star} \sim N\left({\bf 0},{\bf R}\right)\]

\(\tilde{\boldsymbol{\theta}} + \boldsymbol{\theta}_{\star}\) gives a realisation from the required probability distribution:

\[Pr(\boldsymbol{\theta} | {\bf l}, {\bf W}, {\bf R}, {\bf G})\]

Equation (11.9) is solved using Cholesky factorisation. Because \({\bf C}\) is sparse and the pattern of non-zero elements fixed, an initial symbolic Cholesky factorisation of \({\bf P}{\bf C}{\bf P}^{'}\) is preformed where \({\bf P}\) is a fill-reducing permutation matrix (Davis 2006). Numerical factorisation must be performed each iteration but the fill-reducing permutation (found via a minimum degree ordering of \({\bf C}+{\bf C}^{'}\)) reduces the computational burden dramatically compared to a direct factorisation of \({\bf C}\) (Davis 2006).

Forming the inverse of the variance structures is usually simpler because they can be expressed as a series of direct sums and Kronecker products:

\[{\bf G}= \left({\bf V}_{1}\otimes{\bf A}_{1}\right) \oplus \left({\bf V}_{2}\otimes{\bf A}_{2}\right) \oplus \ldots\]

and the inverse of such a structure has the form

\[{\bf G}^{-1} = \left({\bf V}^{-1}_{1}\otimes{\bf A}^{-1}_{1}\right) \oplus \left({\bf V}^{-1}_{2}\otimes{\bf A}^{-1}_{2}\right) \oplus \ldots\\\]

which involves inverting the parameter (co)variance matrices (\({\bf V}\)), which are usually of low dimension, and inverting \({\bf A}\). For many problems \({\bf A}\) is actually an identity matrix and so inversion is not required. When \({\bf A}\) is a relationship matrix associated with a pedigree, Henderson (1976; Meuwissen and Luo 1992) give efficient recursive algorithms for obtaining the inverse, and Hadfield and Nakagawa (2010) derive a similar procedure for phylogenies.

11.2.3 Updating the variance structures \({\bf G}\) and \({\bf R}\)

Components of the direct sum used to construct the desired variance structures are conditionally independent. The sum of squares matrix associated with each component term has the form:

\[{\bf S} = {\bf U}^{'}{\bf A}^{-1}{\bf U}\]

where \({\bf U}\) is a matrix of random effects where each column is associated with the relevant row/column of \({\bf V}\) and each row associated with the relevant row/column of \({\bf A}\). The parameter (co)variance matrix can then be sampled from the inverse Wishart distribution:

\[{\bf V} \sim IW(({\bf S}_{p}+{\bf S})^{-1},\ n_{p}+n) \label{pIW-Eq} \tag{11.10}\]

where \(n\) is the number of rows in \({\bf U}\), and \({\bf S}_{p}\) and \(n_{p}\) are the prior sum of squares and prior degree’s of freedom, respectively.

In some models, some elements of a parameter (co)variance matrix cannot be estimated from the data and all the information comes from the prior. In these cases it can be advantageous to fix these elements at some value and Korsgaard, Andersen, and Sorensen (1999) provide a strategy for sampling from a conditional inverse-Wishart distribution which is appropriate when the rows/columns of the parameter matrix can be permuted so that the conditioning occurs on some diagonal sub-matrix. When this is not possible Metropolis-Hastings updates can be made.

11.2.4 Ordinal Models

For ordinal models it is necessary to update the cutpoints which define the bin boundaries for latent variables associated with each category of the outcome. To achieve good mixing we used the method developed by (Cowles 1996) that allows the latent variables and cutpoints to be updated simultaneously using a Hastings-with-Gibbs update.

11.2.5 Path Analyses

Elements of the response vector can be regressed on each other using the sir and path functions. Using the matrix notation of Gianola and Sorensen (2004), Equation (11.3) can be rewritten as:

\[\boldsymbol{\Lambda}{\bf l} = {\bf X}{\boldsymbol{\mathbf{\beta}}}+{\bf Z}{\bf u}+{\bf e} \label{rs-Eq1} \tag{11.11}\]

where \(\boldsymbol{\Lambda}\) is a square matrix of the form:

\[\begin{array}{rl} \boldsymbol{\Lambda} =& {\bf I}-\sum_{l}\boldsymbol{\Psi}^{(l)}\lambda_{l}\\ \end{array} \label{rs-Eq2} \tag{11.12}\]

This sets up a regression where the \(i^{th}\) element of the response vector acts as a weighted (by \(\Psi^{(l)}_{i,j}\)) predictor for the \(j^{th}\) element of the response vector with associated regression parameter \(\lambda_{l}\). Often \(\boldsymbol{\Psi}^{(l)}\) is an incidence matrix with the patterns of ones determining which elements of the response are regressed on each other.

Conditional on the vector of regression coefficients \(\boldsymbol{\Lambda}\), the location effects and variance structures can be updated as before by simply substituting \({\bf l}\) for \(\boldsymbol{\Lambda}{\bf l}\) in the necessary equations. Gianola and Sorensen (2004) provide a simple scheme for updating \(\boldsymbol{\Lambda}\). Note that Equation (11.11) can be rewritten as:

\[\begin{array}{rl} {\bf l} - {\bf X}{\boldsymbol{\mathbf{\beta}}} - {\bf Z}{\bf u} =& {\bf e}+\sum_{l}\boldsymbol{\Psi}^{(l)}{\bf l}\lambda_{l}\\ =& {\bf e}+{\bf L}\boldsymbol{\Lambda}\\ \end{array}\]

where \({\bf L}\) is the design matrix \(\left[\boldsymbol{\Psi}^{(1)}{\bf l}, \boldsymbol{\Psi}^{(2)}{\bf l} \dots \boldsymbol{\Psi}^{(L)}{\bf l}\right]\) for the \(L\) path coefficients. Conditional on \({\boldsymbol{\mathbf{\beta}}}\) and \({\boldsymbol{\mathbf{u}}}\), \(\boldsymbol{\Lambda}\) can then be sampled using the method of Garcia-Cortes and Sorensen (2001) with \({\bf l} - {\bf X}{\boldsymbol{\mathbf{\beta}}} - {\bf Z}{\bf u}\) as response and \({\bf L}\) as predictor. However, only in a fully recursive system (there exists a row/column permutation by which all \(\boldsymbol{\Psi}\)’s are triangular) are the resulting draws from the appropriate conditional distribution, which requires multiplication by the Jacobian of the transform: \(|\boldsymbol{\Lambda}|\). An extra Metropolis Hastings step is used to accept/reject the proposed draw when \(|\boldsymbol{\Lambda}|\neq 1\).

When the response vector is Gaussian and fully observed, the latent variable does not need updating. For non-Gaussian data, or with missing responses, updating the latent variable is difficult because Equation (11.6) becomes:

\[Pr(l_{i}| {\bf y}, \boldsymbol{\theta}, {\bf R}, {\bf G}, \boldsymbol{\Lambda}) \propto f_{i}(y_{i} | l_{i})f_{N}((\boldsymbol{\Lambda}^{-1}{\bf e})_{i}|{\bf q}_{i}{\bf Q}_{/i}^{-1}{\bf e}_{/i}, q_{i}-{\bf q}_{i}{\bf Q}_{/i}^{-1}{\bf q}^{'}_{i})\]

where \({\bf Q} = \boldsymbol{\Lambda}^{-1}{\bf R}\boldsymbol{\Lambda}^{-\top}\). In the general case \({\bf Q}\) will not have block diagonal structure like \({\bf R}\) and so the scheme for updating latent variables within residual blocks (i.e. Equation (11.7)) is not possible. However, in some cases \(\boldsymbol{\Lambda}\) may have the form where all non-zero elements correspond to elements of the response vector that are in the same residual block. In such cases updating the latent variables remains relatively simple:

\[Pr({\bf l}_{j}|{\bf y}, \boldsymbol{\theta}, {\bf R}, {\bf G}) \propto {p}_{i}({y}_{i} | {\bf l}_{j})(\boldsymbol{\Lambda}^{-1}_{j}{\bf e}_{j}|{\bf 0}, \boldsymbol{\Lambda}^{-1}_{j}{\bf R}_{j}\boldsymbol{\Lambda}^{-\top}_{j})\]

11.2.6 Deviance and DIC

The deviance \(D\) is defined as:

\[D = -2\textrm{log}(\Pr({\bf y} | {\boldsymbol{\mathbf{\Omega}}}))\]

where \({\boldsymbol{\mathbf{\Omega}}}\) is some parameter set of the model. The deviance can be calculated in different ways depending on what is in ‘focus’, and \(\texttt{MCMCglmm}\) calculates this probability for the lowest level of the hierarchy (Spiegelhalter et al. 2002). For fully-observed Gaussian response variables in the likelihood is the density:

\[f_{N}({\bf y} | {\bf W}\boldsymbol{\theta},\ {\bf R})\]

where \({\boldsymbol{\mathbf{\Omega}}} = \left\{\boldsymbol{\theta},\ {\bf R}\right\}\). For discrete response variables in univariate analyses modeled using family="threshold" the density is

\[\prod_{i} F_{N}(\gamma_{y_{i}} | {\bf w}_{i}\boldsymbol{\theta}, \ r_{ii})-F_{N}(\gamma_{y_{i}+1} | {\bf w}_{i}\boldsymbol{\theta}, \ r_{ii})\]

where \({\boldsymbol{\mathbf{\Omega}}} = \left\{{\boldsymbol{\mathbf{\gamma}}},\ \boldsymbol{\theta},\ {\bf R}\right\}\). For other response variables variables (including discrete response variables modeled using family="ordinal") it is the product:

\[\prod_{i}f_{i}(y_{i} | l_{i}) \label{LLikL} \tag{11.13}\]

with \({\boldsymbol{\mathbf{\Omega}}} = {\bf l}\).

For multivariate models with mixtures of Gaussian (g), threshold (t) and other non-Gaussian (n) data (including missing data) we can define the deviance in terms of three conditional densities:

\[\begin{array}{rl} Pr({\bf y} | {\boldsymbol{\mathbf{\Omega}}}) =& \Pr({\bf y}_{g}, {\bf y}_{t}, {\bf y}_{n} | {\boldsymbol{\mathbf{\gamma}}}, \boldsymbol{\theta}_{g}, \boldsymbol{\theta}_{t}, {\boldsymbol{\mathbf{l}}}_{n}, {\boldsymbol{\mathbf{R}}})\\ =& \Pr({\bf y}_{t} | {\boldsymbol{\mathbf{\gamma}}}, \boldsymbol{\theta}_{t}, {\bf y}_{g}, {\boldsymbol{\mathbf{l}}}_{n}, {\boldsymbol{\mathbf{R}}})\Pr({\bf y}_{g} | \boldsymbol{\theta}_{g}, {\boldsymbol{\mathbf{l}}}_{n}, {\boldsymbol{\mathbf{R}}})\Pr({\bf y}_{n} | {\boldsymbol{\mathbf{l}}}_{n})\\ \label{Eq-MVdeviance} \end{array} \tag{11.14}\]

with \({\boldsymbol{\mathbf{\Omega}}} = \left\{{\boldsymbol{\mathbf{\gamma}}},\ \boldsymbol{\theta}_{/n},\ {\boldsymbol{\mathbf{l}}}_{n}\ {\bf R}\right\}\). Have \(({\boldsymbol{\mathbf{W}}}\boldsymbol{\theta})_{a|b}={\boldsymbol{\mathbf{W}}}_{a}\boldsymbol{\theta}_{a}+{\bf R}_{a,b}{\bf R}^{-1}_{b,b}({\bf l}_{b}-{\boldsymbol{\mathbf{W}}}_{b}\boldsymbol{\theta}_{b})\) and \({\bf R}_{a |b} = {\bf R}_{a,a}-{\bf R}_{a,b}{\bf R}^{-1}_{b,b}{\bf R}_{a,b}\) where the subscripts denote rows of the data vector/design matrices or rows/columns of the \({\bf R}\)-structure. Then, the conditional density of \({\bf y}_{g}\) in Equation (11.14) is:

\[f_{N}\left({\bf y}_{g} | ({\boldsymbol{\mathbf{W}}}\boldsymbol{\theta})_{g|n},\ {\bf R}_{g|n}\right)\]

The conditional density of \({\bf y}_{n}\) in Equation (11.14) is identical to that given in Equation (11.13), and for a single "threshold" trait

\[\prod_{i} F_{N}(\gamma_{y_{i}} | ({\boldsymbol{\mathbf{W}}}\boldsymbol{\theta})_{ti|g,n}, \ r_{ti|g, n})-F_{N}(\gamma_{y_{i}+1} | ({\boldsymbol{\mathbf{W}}}\boldsymbol{\theta})_{ti|g,n}, \ r_{ti|g, n}) \label{Eq-cpmvnorm} \tag{11.15}\]

is the conditional density for \({\bf y}_{t}\) in Equation (11.14), where \(({\boldsymbol{\mathbf{W}}}\boldsymbol{\theta})_{ti|g,n}\) is the \(i^{\textrm{th}}\) element of \(({\boldsymbol{\mathbf{W}}}\boldsymbol{\theta})_{t|g,n}\). Currently the deviance (and hence the DIC) will not be returned if there is more than one threshold trait.

The deviance is calculated at each iteration if DIC=TRUE and stored each thin\(^{th}\) iteration after burn-in. However, for computational reasons the deviance is calculated mid-iteration such that the deviance returned at iteration \(i\) uses \({\boldsymbol{\mathbf{\Omega}}}_{i} = \left\{{\boldsymbol{\mathbf{\gamma}}}_{i},\ \boldsymbol{\theta}_{/n, i},\ {\boldsymbol{\mathbf{l}}}_{n, i-1}\ {\bf R}_{i}\right\}\). The mean deviance (\(\bar{D}\)) is calculated over all iterations, as is the mean of the latent variables (\({\bf l}\)) the \({\bf R}\)-structure and the vector of predictors (\({\bf W}\boldsymbol{\theta}\)). The deviance is calculated at the mean estimate of the parameters (\(D(\bar{\boldsymbol{\mathbf{\Omega}}})\)) and the deviance information criterion calculated as:

\[\textrm{DIC} = 2\bar{D}-D(\bar{\boldsymbol{\mathbf{\Omega}}})\]

Table 11.1: Distribution types that can fitted using \(\texttt{MCMCglmm}\). The prefixes `"zi"`, `"zt"`, `"hu"` and `"za"` stand for zero-inflated, zero-truncated, hurdle and zero-altered respectively. The prefix `"cen"` standards for censored where \(y_{1}\) and \(y_{2}\) are the upper and lower bounds for the unobserved datum \(y\). \(J\) stands for the number of categories in the multinomial, categorical and zero-truncated multivariate Bernoulli distributions and this must be specified in the family argument for the multinomial and zero-truncated multivariate Bernoulli distributions. The density function is for a single datum in a univariate model with \({\bf w}^{\top}\) being a row vector of \({\bf W}\). \(f\) and \(F\) are the density and distribution functions for the subscripted distribution (\(N\)=Normal, \(P\)=Poisson, \(E\)=Exponential, \(G\)=Geometric, \(B\)=Binomial, \(NCt\)=non-central \(t\) and \(t\)=Student’s \(t\)). The \(J-1\) \(\gamma\)’s in the ordinal/threshold models are the cut-points, with \(\gamma_{1}\) set to zero. The normalising probability \(P_k(k)\) in the denominator of the zero-truncated multinomial is not given explicitly but is the probability of no zero’s in the non-zero categories under a standard multinomial.
Distribution	NoDataColumns	NoTraits	Density	Function
`"gaussian"`	1	1	\(Pr(y)\)	\(=f_{N}({\bf w}^{\top}\boldsymbol{\theta},\sigma^{2}_{e})\)
`"poisson"`	1	1	\(Pr(y)\)	\(=f_{P}(\textrm{exp}(l))\)
`"categorical"`	1	\(J-1\)	\(Pr(y = k\| k\neq k_1)\) \(Pr(y=k \| k=k_1)\)	\(=\frac{\textrm{exp}(l_{k})}{1+\sum^{J-1}_{j=1}\textrm{exp}(l_{j})}\) \(=\frac{1}{1+\sum^{J-1}_{j=1}\textrm{exp}(l_{j})}\)
`"multinomialJ"`	\(J\)	\(J-1\)	\(Pr(y_k \| k\neq J)\) \(Pr(y_k \| k=J)\)	\(={\sum^{J}_{j=1}y_j \choose y_k}\left(\frac{\textrm{exp}(l_{k})}{1+\sum^{J-1}_{j=1}\textrm{exp}(l_{j})}\right)^{y_k}\left(1-\frac{\textrm{exp}(l_{k})}{1+\sum^{J-1}_{j=1}\textrm{exp}(l_{j})}\right)^{\sum^{J}_{j\neq k} y_j}\) \(={\sum^{J}_{j=1}y_j \choose y_k}\left(\frac{1}{1+\sum^{J-1}_{j=1}\textrm{exp}(l_{j})}\right)^{y_k}\left(1-\frac{1}{1+\sum^{J-1}_{j=1}\textrm{exp}(l_{j})}\right)^{\sum^{J}_{j\neq k} y_j}\)
`"ordinal"`	1	1	\(Pr(y=k)\)	\(=F_{N}(\gamma_{k} \| l,1)-F_{N}(\gamma_{k+1} \| l,1)\)
`"threshold"`	1	1	\(Pr(y=k)\)	\(=F_{N}(\gamma_{k} \| {\bf w}^{\top}\boldsymbol{\theta}, \sigma^{2}_{e})-F_{N}(\gamma_{k+1} \| {\bf w}^{\top}\boldsymbol{\theta}, \sigma^{2}_{e})\)
`"exponential"`	1	1	\(Pr(y)\)	\(=f_{E}(\textrm{exp}(-l))\)
`"geometric"`	1	1	\(Pr(y)\)	\(=f_{G}(\frac{\textrm{exp}(l)}{1+\textrm{exp}(l)})\)
`"cengaussian"`	2	1	\(Pr(y_{1}>y>y_{2})\)	\(=F_{N}(y_{2} \| {\bf w}^{\top}\boldsymbol{\theta},\sigma^{2}_{e})-F_{N}(y_{1} \| {\bf w}^{\top}\boldsymbol{\theta},\sigma^{2}_{e})\)
`"cenpoisson"`	2	1	\(Pr(y_{1}>y>y_{2})\)	\(=F_{P}(y_{2} \| l)-F_{P}(y_{1} \| l)\)
`"cenexponential"`	2	1	\(Pr(y_{1}>y>y_{2})\)	\(=F_{E}(y_{2} \| l)-F_{E}(y_{1} \| l)\)
`"zipoisson"`	1	2	\(Pr(y=0)\) \(Pr(y \| y>0)\)	\(=\frac{\textrm{exp}(l_{2})}{1+\textrm{exp}(l_{2})}+\left(1-\frac{\textrm{exp}(l_{2})}{1+\textrm{exp}(l_{2})}\right)f_{P}(y\|\textrm{exp}(l_{1}))\) \(=\left(1-\frac{\textrm{exp}(l_{2})}{1+\textrm{exp}(l_{2})}\right)f_{P}(y \|\textrm{exp}(l_{1}))\)
`"ztpoisson"`	1	1	\(Pr(y)\)	\(=\frac{f_{P}(y \|\textrm{exp}(l))}{1-f_{P}(0 \|\textrm{exp}(l))}\)
`"hupoisson"`	1	2	\(Pr(y=0)\) \(Pr(y \| y>0)\)	\(=\frac{\textrm{exp}(l_{2})}{1+\textrm{exp}(l_{2})}\) \(=\left(1-\frac{\textrm{exp}(l_{2})}{1+\textrm{exp}(l_{2})}\right)\frac{f_{P}(y \|\textrm{exp}(l_{1}))}{1-f_{P}(0 \|\textrm{exp}(l_{1}))}\)
`"zapoisson"`	1	2	\(Pr(y=0)\) \(Pr(y \| y>0)\)	\(=1-\textrm{exp}(\textrm{exp}(l_{2}))\) \(=\textrm{exp}(\textrm{exp}(l_{2}))\frac{f_{P}(y \|\textrm{exp}(l_{1}))}{1-f_{P}(0 \|\textrm{exp}(l_{1}))}\)
`"hubinomial"`	2	2	\(Pr(y_{1}=0)\) \(Pr(y_{1} \| y_{1}>0)\)	\(=\frac{\textrm{exp}(l_{2})}{1+\textrm{exp}(l_{2})}+\left(1-\frac{\textrm{exp}(l_{2})}{1+\textrm{exp}(l_{2})}\right)f_{B}(0 \| n=y_{1}+y_{2}, \frac{\textrm{exp}(l_{1})}{1+\textrm{exp}(l_{1})})\) \(=\left(1-\frac{\textrm{exp}(l_{2})}{1+\textrm{exp}(l_{2})}\right)f_{B}(y_{1} \| n=y_{1}+y_{2} \frac{\textrm{exp}(l_{1})}{1+\textrm{exp}(l_{1})})\)
`"nzbinom"`	2	1	\(Pr(y=1)\) \(Pr(y=0)\)	\(=1-\left(1-\frac{\textrm{exp}(l)}{1+\textrm{exp}(l)}\right)^{y_2}\) \(=\left(1-\frac{\textrm{exp}(l)}{1+\textrm{exp}(l)}\right)^{y_2}\)
`"ncst"`	3	1	\(Pr(y)\)	\(=f_{NCt}(y_1 \| \nu=y_3, \mu=l/y_2)/y_2\)
`"msst"`	3	1	\(Pr(y)\)	\(=f_{t}(y_1-l/y_2 \| \nu=y_3)/y_2\)
`"ztmbJ"`	\(J\)	\(J\)	\(Pr(y_k=1)\) \(Pr(y_k=0)\)	\(=\left(1-\frac{\textrm{exp}(l_k)}{1+\textrm{exp}(l_k)}\right)\left(1-\prod^{J}_{j=1}\frac{\textrm{exp}(l_k)}{1+\textrm{exp}(l_k)}\right)^{-1}\) \(=\left(\frac{\textrm{exp}(l_k)}{1+\textrm{exp}(l_k)}\right)\left(1-\prod^{J}_{j=1}\frac{\textrm{exp}(l_k)}{1+\textrm{exp}(l_k)}\right)^{-1}\)
`"ztmultinomialJ"`	\(J\)	\(J-1\)	\(Pr(y_k \| k\neq J, y_k>0)\) \(Pr(y_k \| k=J, y_k>0)\) \(Pr(y_k=0 \| k=J)\)	\(={\sum^{J}_{j=1}y_j \choose y_k}\left(\frac{\textrm{exp}(l_{k})}{P_k(k)}\right)^{y_k}\left(1-\frac{\textrm{exp}(l_{k})}{P_k(k)}\right)^{\sum^{J}_{j\neq k} y_j}\) \(={\sum^{J}_{j=1}y_j \choose y_k}\left(\frac{1}{P_k(k)}\right)^{y_k}\left(1-\frac{1}{P_k(k)}\right)^{\sum^{J}_{j\neq k} y_j}\) ignored

11.3 Parameter Expansion

As the covariance matrix approaches a singularity the mixing of the chain becomes notoriously slow. This problem is often encountered in single-response models when a variance component is small and the chain becomes stuck at values close to zero. Similar problems occur for the EM algorithm and C. H. Liu, Rubin, and Wu (1998) introduced parameter expansion to speed up the rate of convergence. The idea was quickly applied to Gibbs sampling problems (J. S. Liu and Wu 1999) and has now been extensively used to develop more efficient mixed-model samplers (e.g. Dyk and Meng 2001; Gelman, Dyk, et al. 2008; Browne et al. 2009).

The columns of the design matrix (\({\bf W}\)) can be multiplied by the non-identified working parameters \({\boldsymbol{\mathbf{\alpha}}} = \left[1,\ \alpha_{1},\ \alpha_{2},\ \dots \alpha_{k}\right]^{'}\):

\[{\bf W}_{\alpha} = \left[{\bf X}\ {\bf Z}_{1}\alpha_{1}\ {\bf Z}_{2}\alpha_{2}\ \dots\ {\bf Z}_{k}\alpha_{k}\right] \label{wstar} \tag{11.16}\]

where the indices denote submatrices of \({\bf Z}\) which pertain to effects associated with the same variance component. Replacing \({\bf W}\) with \({\bf W}_{\alpha}\) we can sample the new location effects \(\boldsymbol{\theta}_{\alpha}\) as described above, and rescale them to obtain \(\boldsymbol{\theta}\):

\[\boldsymbol{\theta} = ({\bf I}_{\beta}\oplus_{i=1}^{k}{\bf I}_{u_{i}}\ \alpha_{i})\boldsymbol{\theta}_{\alpha}\]

where the identity matrices are of dimension equal to the length of the subscripted parameter vectors.

Likewise, the (co)variance matrices can be rescaled by the set of \(\alpha\)’s associated with the variances of a particular variance structure component (\({\boldsymbol{\mathbf{\alpha}}}_{\mathcal{V}}\)):

\[{\bf V} = Diag({\boldsymbol{\mathbf{\alpha}}}_{\mathcal{V}}){\bf V}_{\alpha}Diag({\boldsymbol{\mathbf{\alpha}}}_{\mathcal{V}})\]

The working parameters are not identifiable in the likelihood, but do have a proper conditional distribution. Defining the \(n\times(k+1)\) design matrix \({\bf X}_{\alpha}\) with each column equal to the submatrices in Equation (11.16) postmultiplied by the relevant subvectors of \(\boldsymbol{\theta}_{\alpha}\), we can see that \({\boldsymbol{\mathbf{\alpha}}}\) is a vector of regression coefficients:

\[\begin{array}{rl} \bf{l} =& {\bf X}_{\alpha}{\boldsymbol{\mathbf{\alpha}}}+\bf{e}\\ \end{array}\]

and so the methods described above can be used to update them.

11.4 Priors for `corg` and `corgh` structures

For corg and corgh structures⁴⁸ the diagonals of V define the fixed variances (corgh) or are ignored and the variances set to one (corg). I use the prior specification in Barnard, McCulloch, and Meng (2000) where nu controls how much the correlation matrix approaches an identity matrix. The marginal distribution of individual correlations (\(r\)) is given by Barnard, McCulloch, and Meng (2000; and Box and Tiao 1973):

\[\begin{array}{lr} Pr(r) \propto (1-r^{2})^\frac{\texttt{nu-dim(V)-1}}{\texttt{2}}, & |r|<1\\ \end{array}\]

and as shown above setting nu =dim(V)+1 results in marginal correlations that are uniform on the interval \[-1,1\].

In most cases correlation matrices do not have known form and so cannot be directly Gibbs sampled. \(\texttt{MCMCglmm}\) uses a method proposed by X. F. Liu and Daniels (2006) with the target prior as in Barnard, McCulloch, and Meng (2000). Generally this algorithm is very efficient as the Metropolis-Hastings acceptance probability only depends on the degree to which the candidate prior and the target prior (the prior you specify) conflict. The candidate prior is equivalent to the prior in Barnard, McCulloch, and Meng (2000) with nu=0 so as long as a diffuse prior is set, mixing is generally not a problem. If nu=0 is set (the default) then the Metropolis-Hastings steps are always accepted resulting in Gibbs sampling. However, a prior of this form puts high density on extreme correlations which can cause problems if the data give support to correlations in this region.

References

Barnard, J., R. McCulloch, and X. L. Meng. 2000. “Modeling Covariance Matrices in Terms of Standard Deviations and Correlations, with Application to Shrinkage.” Statistica Sinica 10 (4): 1281–311.

Box, G. E. P., and G. C. Tiao. 1973. Bayesian Inference in Statistical Analysis. New York: John Wiley & Sons.

Browne, W. J., F. Steele, M. Golalizadeh, and M. J. Green. 2009. “The Use of Simple Reparameterizations to Improve the Efficiency of Markov Chain Monte Carlo Estimation for Multilevel Models with Applications to Discrete Time Survival Models.” Journal of the Royal Statistical Society Series A - Statistics in Society 172: 579–98.

Cowles, M. K. 1996. “Accelerating Monte Carlo Markov Chain Convergence for Cumulative–Link Generalized Linear Models.” Statistics and Computing 6 (2): 101–11.

Davis, T. A. 2006. Direct Methods for Sparse Linear Systems. Fundamentals of Algorithms. Philadelphia: SIAM.

Dyk, D. A. van, and X. L. Meng. 2001. “The Art of Data Augmentation.” Journal of Computational and Graphical Statistics 10 (1): 1–50.

Garcia-Cortes, L. A., and D. Sorensen. 2001. “Alternative Implementations of Monte Carlo EM Algorithms for Likelihood Inferences.” Genetics Selection Evolution 33 (4): 443–52.

Gelman, A., J. B. Carlin, H. S. Stern, and D. B. Rubin. 2004. Bayesian Data Analysis. 2nd ed. Texts in Statistical Science. Chapman & Hall.

Gelman, A., D. A. van Dyk, Z. Y. Huang, and W. J. Boscardin. 2008. “Using Redundant Parameterizations to Fit Hierarchical Models.” Journal of Computational and Graphical Statistics 17 (1): 95–122.

Gianola, D., and D. Sorensen. 2004. “Quantitative Genetic Models for Describing Simultaneous and Recursive Relationships Between Phenotypes.” Genetics 167 (3): 1407–24.

Haario, H., E. Saksman, and J. Tamminen. 2001. “An Adaptive Metropolis Algorithm.” Bernoulli 7 (2): 223–42.

Hadfield, J. D., and S. Nakagawa. 2010. “General Quantitative Genetic Methods for Comparative Biology: Phylogenies, Taxonomies, Meta-Analysis and Multi-Trait Models for Continuous and Categorical Characters.” Journal of Evolutionary Biology 23 (3): 494–508.

Henderson, C. R. 1976. “Simple Method for Computing Inverse of a Numerator Relationship Matrix Used in Prediction of Breeding Values.” Biometrics 32 (1): 69–83.

Korsgaard, I. R., A. H. Andersen, and D. Sorensen. 1999. “A Useful Reparameterisation to Obtain Samples from Conditional Inverse Wishart Distributions.” Genetics Selection Evolution 31 (2): 177–81.

Liu, C. H., D. B. Rubin, and Y. N. Wu. 1998. “Parameter Expansion to Accelerate EM: The PX–EM Algorithm.” Biometrika 85 (4): 755–70.

Liu, J. S., and Y. N. Wu. 1999. “Parameter Expansion for Data Augmentation.” Journal of the American Statistical Association 94 (448): 1264–74.

Liu, X. F., and M. J. Daniels. 2006. “A New Algorithm for Simulating a Correlation Matrix Based on Parameter Expansion and Reparameterization.” Journal of Computational and Graphical Statistics 15 (4): 897–914.

Meuwissen, T. H. E., and Z. Luo. 1992. “Computing Inbreeding Coefficients in Large Populations.” Genetics Selection Evolution 24 (4): 305–13.

Ovaskainen, O., H. Rekola, E. Meyke, and E. Arjas. 2008. “Bayesian Methods for Analyzing Movements in Heterogeneous Landscapes from Mark–Recapture Data.” Ecology 89 (2): 542–54.

Robert, C. P. 1995. “Simulation of Truncated Normal Variables.” Statistics and Computing 5 (2): 121–25.

Spiegelhalter, D. J., N. G. Best, B. R. Carlin, and A. van der Linde. 2002. “Bayesian Measures of Model Complexity and Fit.” Proceedings of the Royal Society of London Series B - Biological Sciences 64 (4): 583–639.

In versions \(<2.18\) cor fitted what is now a corg structure. The reason for the change is to keep the asreml and \(\texttt{MCMCglmm}\) syntax equivalent. However, the corgh structure in asreml is a reparameterised us structure whereas in \(\texttt{MCMCglmm}\) the variances are fixed in the prior.↩︎