8 Pedigrees and Phylogenies

Pedigrees and phylogenies are similar things: they are both ways of representing shared ancestry. Under a quantitative genetic model of inheritance, or a Brownian motion model of evolution, GLMM’s can be readily extended to model the similarities that exist between the phenotypes of related individuals or taxa. In the context of quantitative genetics these models are known as ‘animal’ models (Henderson 1976), and in the context of the comparative method these models are known as phylogenetic mixed models (Lynch 1991). The two models are almost identical, and are relatively minor modifications to the basic mixed model (Hadfield and Nakagawa 2010).

8.1 Pedigree and phylogeny formats

8.1.1 Pedigrees

\(\texttt{MCMCglmm}\) handles pedigrees stored in 3-column tabular form, with each row representing a single individual. The first column should contain the unique identifier of the individual, and columns 2 and 3 should be the unique identifiers of the individual’s parents. Parents must appear before their offspring in the table. I usually have the dam (mother) in the first column and the sire (father) in the third. I prefer the words dam and sire because if I subscript things with \(m\) and \(f\) I can never remember whether I mean male and female, or mother and father. In hermaphrodite systems the same individual may appear in both columns, even within the same row if an individual was produced through selfing. This is not a problem, but \(\texttt{MCMCglmm}\) will issue a warning in case hermaphrodites are not present and a data entry mistake has been made. Impossible pedigrees (for example individual’s that give birth to their own mother) are a problem and \(\texttt{MCMCglmm}\) will issue an error, hopefully with an appropriate message, when impossibilities are detected.

If the parent(s) of an individual are unknown then a missing value (NA) should be assigned in the relevant column. All individuals appearing as dams or sires need to have their own record, even if both of their parents are unknown. Often the number of individuals in a pedigree will be greater than the number of individuals for which phenotypic data exist. \(\texttt{MCMCglmm}\) can handle this, as long as all the individuals appearing in the data frame passed to data also appear in the pedigree.

To illustrate, we can load a pedigree for a population of blue tits and display the pedigree for the nuclear family that has individuals \(\texttt{R187920}\) and \(\texttt{R187921}\) as parents:

data(BTped)
Nped <- BTped[which(apply(BTped, 1, function(x) {
    any(x == "R187920" | x == "R187921")
})), ]
Nped
##       animal     dam    sire
## 66   R187920    <NA>    <NA>
## 172  R187921    <NA>    <NA>
## 325  R187726 R187920 R187921
## 411  R187724 R187920 R187921
## 503  R187723 R187920 R187921
## 838  R187613 R187920 R187921
## 932  R187612 R187920 R187921
## 1030 R187609 R187920 R187921

Both parents form part of what is known as the base population - they are outbred and unrelated to anybody else in the pedigree.

\(\texttt{MCMCglmm}\) and MasterBayes have several pedigree manipulation functions. (MasterBayes::orderPed) orders a pedigree so parents appear before their offspring, (MasterBayes::insertPed) inserts records for individuals that only appear as parents (or a vector of specified individuals). When the number of individuals with phenotypic data is less than the number of individuals in the pedigree it is sometimes possible to remove uninformative individuals from the pedigree and thus reduce the computation time. This is known as pruning the pedigree and is implemented in the \(\texttt{MCMCglmm}\) function prunePed. A vector of measured individuals is specified in the argument keep and specifying make.base=TRUE implements the most complete pruning. Note, make.base=FALSE is the default argument so you’ll need to explicitly specify TRUE in the call to prunePed. Michael Morrissey’s pedantics package, and the kinship2 package also have many other useful pedigree orientated functions. In fact, the orderPed function in MasterBayes is built around functions provided by kinship2.

8.1.2 Phylogenies

Phylogenies can be expressed in tabular form, although only two columns are required because each species only has a single parent. In general however, phylogenies are not expressed in this form presumably because it is hard to traverse phylogenies (and pedigrees) backwards in time when they are stored this way. For phylogenetic mixed models we generally only need to traverse phylogenies forward in time (if at all) but I have stuck with convention and used the phylo class from the ape package to store phylogenies. As with pedigrees, all species appearing in the data frame passed to data need to appear in the phylogeny. Typically, this will only include species at the tips of the phylogeny and so the measured species should appear in the tip.label element of the phylo object. An error message will be issued if this is not the case. Data may also exist for ancestral species, or even for species present at the tips but measured many generations before. It is possible to include these data as long as the phylogeny has labelled internal nodes. If nodes are unlabelled then \(\texttt{MCMCglmm}\) names them internally using the default arguments of makeNodeLabel from ape.

To illustrate, lets take the phylogeny of bird families included in the ape package, and extract the phylogeny in tabular form for the Paridae (Tits), Certhiidae (Treecreepers), Gruidae (Cranes) and the Struthionidae (Ostriches):

data("bird.families")
bird.families <- makeNodeLabel(bird.families)
some.families <- c("Certhiidae", "Paridae", "Gruidae", "Struthionidae")
Nphylo <- drop.tip(bird.families, setdiff(bird.families$tip.label, some.families))
INphylo <- inverseA(Nphylo)
INphylo$pedigree
##      node.names                  
## [1,] "Node58"        NA        NA
## [2,] "Node122"       "Node58"  NA
## [3,] "Struthionidae" NA        NA
## [4,] "Gruidae"       "Node58"  NA
## [5,] "Certhiidae"    "Node122" NA
## [6,] "Paridae"       "Node122" NA
A phylogeny of bird families from \citet{Sibley.1990} The families in red are the Tits (Paridae), Treecreepers (Certhiidae),  Cranes (Gruidae)  and the Ostriches (Struthionidae) from top to bottom. Blue tits are in the Paridae, and the word pedigree comes from the french for crane's foot.

(#fig:bird.families)A phylogeny of bird families from The families in red are the Tits (Paridae), Treecreepers (Certhiidae), Cranes (Gruidae) and the Ostriches (Struthionidae) from top to bottom. Blue tits are in the Paridae, and the word pedigree comes from the french for crane’s foot.

The full phylogeny, with these families and their connecting notes displayed, is shown in Figure @ref(fig:bird.families). You will notice that \(\texttt{Node1}\) - the root - does not appear in the phylogeny in tabular form. This is because the root is equivalent to the base population in a pedigree analysis, an issue which we will come back to later. Another piece of information that seems to be lacking in the tabular form is the branch length information. Branch lengths are equivalent to inbreeding coefficients in a pedigree. As with pedigrees the inbreeding coefficients are calculated by inverseA:

INphylo$inbreeding
## [1] 0.2285714 0.3857143 1.0000000 0.7714286 0.3857143 0.3857143

You will notice that the Struthionidae have an inbreeding coefficient of 1 because we used the default scale=TRUE in the call to inverseA. Only ultrametric trees can be scaled in \(\texttt{MCMCglmm}\) and in this case the sum of the inbreeding coefficients connecting the root to a terminal node is one. To take the Paridae as an example:

sum(INphylo$inbreeding[which(INphylo$pedigree[, 1] %in% c("Paridae", "Node122", "Node58"))])
## [1] 1

The inbreeding coefficients for the members of the blue tit nuclear family are of course all zero:

inverseA(Nped)$inbreeding
## [1] 0 0 0 0 0 0 0 0

8.2 The animal model and the phylogenetic mixed model

The structure of pedigrees and phylogenies can be expressed in terms of the relatedness matrix \({\bf A}\). This matrix is symmetric, square, and has dimensions equal to the number of individuals in the pedigree (or the number of taxa in the phylogeny). For pedigrees, element \(A_{i,j}\) is twice the probability that an allele drawn from individual \(i\) is identical by descent to an allele in individual \(j\). For phylogenies, element \(A_{i,j}\) is the amount of time that elapsed (since the common ancestor of all sampled taxa) before the speciation event that resulted in taxa \(i\) and \(j\). Simple, but perhaps slow, recursive methods exist for calculating \({\bf A}\) in both cases:

Aped <- 2 * kinship2::kinship(Nped[, 1], Nped[, 2], Nped[, 3])
Aped
##         R187920 R187921 R187726 R187724 R187723 R187613 R187612 R187609
## R187920     1.0     0.0     0.5     0.5     0.5     0.5     0.5     0.5
## R187921     0.0     1.0     0.5     0.5     0.5     0.5     0.5     0.5
## R187726     0.5     0.5     1.0     0.5     0.5     0.5     0.5     0.5
## R187724     0.5     0.5     0.5     1.0     0.5     0.5     0.5     0.5
## R187723     0.5     0.5     0.5     0.5     1.0     0.5     0.5     0.5
## R187613     0.5     0.5     0.5     0.5     0.5     1.0     0.5     0.5
## R187612     0.5     0.5     0.5     0.5     0.5     0.5     1.0     0.5
## R187609     0.5     0.5     0.5     0.5     0.5     0.5     0.5     1.0
Aphylo <- vcv.phylo(Nphylo, cor = T)
Aphylo
##               Struthionidae   Gruidae Certhiidae   Paridae
## Struthionidae             1 0.0000000  0.0000000 0.0000000
## Gruidae                   0 1.0000000  0.2285714 0.2285714
## Certhiidae                0 0.2285714  1.0000000 0.6142857
## Paridae                   0 0.2285714  0.6142857 1.0000000

Note that specifying cor=T is equivalent to scaling the tree as we did in the argument to inverseA.

In fact, all of the mixed models we fitted in earlier sections also used an \({\bf A}\) matrix, but in those cases the matrix was an identity matrix (i.e. \({\bf A}={\bf I}\)) and we didn’t have to worry about it. Let’s reconsider the Blue tit model m3a.1 from Section 5 where we were interested in estimating sex effects for tarsus length together with the amount of variance explained by genetic mother (dam) and foster mother (fosternest):

data(BTdata)
m3a.1 <- MCMCglmm(tarsus ~ sex, random = ~dam + fosternest, data = BTdata, verbose = FALSE)

All individuals that contributed to that analysis are from a single generation and appear in BTped together with their parents. However, individuals in the parental generation do not have tarsus length measurements so they do not have their own records in BTdata.

The model can be expressed as:

\[{\bf y} = {\bf X}{\boldsymbol{\mathbf{\beta}}}+{\bf Z}_{1}{\bf u}_{1}+{\bf Z}_{2}{\bf u}_{2}+{\bf e}\]

where the design matrices contain information relating each individual to a sex (\({\bf X}\)) a dam (\({\bf Z}_{1}\)) and a fosternest(\({\bf Z}_{2}\)). The associated parameter vectors (\({\boldsymbol{\mathbf{\beta}}}\), \({\bf u}_{1}\) and \({\bf u}_{2}\)) are the effects of each sex, mother and fosternest on tarsus length, and \({\bf e}\) is the vector of residuals.

In the model, the \(u\)’s are treated as random so we estimate their variance instead of fixing it in the prior at some (large) value, as we did with the \(\beta\)’s. We can be a little more explicit about what this means:

\[{\bf u}_{1} \sim N({\bf 0},\ {\bf I}\sigma^{2}_{1})\]

where \(\sim\) stands for ‘is distributed as’ and \(N\) a (multivariate) normal distribution. The distribution has two sets of parameters; a vector of means and a covariance matrix. We assume the random effects are deviations around the fixed effect part of the model and so they have a prior expectation of zero. The (co)variance matrix of the random effects is \({\bf I}\sigma^{2}_{1}\) where \(\sigma^{2}_{1}\) is the variance component to be estimated. The use of the identity matrix makes two things explicit. First, because all off-diagonal elements of an identity matrix are zero we are assuming that all dam effects are independent (no covariance exists between any two dam effects). Second, all diagonal elements of an identity matrix are 1 implying that the range of possible values the dam effect could take is equivalent for every dam, this range being governed by the magnitude of the variance component.

Since dam’s have very little interaction with the subset of offspring that were moved to a fosternest, we may be willing to assume that any similarity that exists between the tarsus lengths of this susbset and the subset that remained at home must be due to genetic effects. Although not strictly true we can assume that individuals that shared the same dam also shared the same sire, and so share around 50% of their genes.

References

Hadfield, J. D., and S. Nakagawa. 2010. “General Quantitative Genetic Methods for Comparative Biology: Phylogenies, Taxonomies, Meta-Analysis and Multi-Trait Models for Continuous and Categorical Characters.” Journal of Evolutionary Biology 23 (3): 494–508.
Henderson, C. R. 1976. “Simple Method for Computing Inverse of a Numerator Relationship Matrix Used in Prediction of Breeding Values.” Biometrics 32 (1): 69–83.
Lynch, M. 1991. “Methods for the Analysis of Comparative Data in Evolutionary Biology.” Evolution 45 (5): 1065–80.