return to the HPGL site

A review of mutation processes and methods of phylogenetic inference

David B. Goldstein

Department of Zoology, South Parks Road
University of Oxford
Oxford, OXI 3PS
United Kingdom

and

David D. Pollock

Department of Mathematical Biology
National Institute for Medical Research
Mill Hill, London, NW7 1AA
United Kingdom

Corresponding author:
David B. Goldstein

phone: 01865 271250
fax: 01865 310447
email: david.goldstein@zoo.ox.ac.uk

Contents

  1. Introduction
  2. Molecular details
  3. Genetic distance measures
  4. Discussion
  5. Acknowledgements

Introduction

Microsatellites are short segments of DNA in which a specific motif of 1-6 bases is repeated up to a usual maximum of 60 or so. Due to their exceptional variability and relative ease of scoring microsatellites are now generally considered the most powerful genetic marker. It is typical to observe loci with more than 10 alleles and heterozygosities above 0.60, even in relatively small samples (Bowcock et al. 1994, Deka et al. 1995), while certain loci can be considerably more variable (Primmer et al. 1996). In addition to being highly variable, microsatellites are also densely distributed throughout eukaryotic genomes, making them the preferred marker for very-high resolution genetic mapping (Dib et al. 1996, Dietrich et al. 1996). Microsatellites are rapidly replacing RFLPs and RAPDs in most applications in population biology, from identifying relatives to inferring demographic parameters (Blouin et al. 1996, Bowcock et al. 1994, Goldstein et al. 1996, Jame and Lagoda 1996). Part of the appeal of microsatellites over RFLPs and RAPDs is that the genetic basis of microsatellite variability is readily apparent: unique primers amplify a genomic region including a well-defined repeat structure that is responsible for the observed variation. This allows the development of inferential methods based on explicit models of microsatellite evolution (Slatkin 1995a,b; Goldstein et al. 1995a,b; Goldstein et al 1996; Feldman et al 1996; Pollock et al 1996). These advantages suggest that microsatellites will enjoy a lengthy reign in population studies.

One perceived difficulty with microsatellites is the long lead time in identifying and characterising microsatellites in new taxonomic groups. This problem is partially alleviated, however, by the continuing popularity of microsatellites in genetic mapping. Dense microsatellite maps are now available in nearly all organisms of genetic and/or economic interest including humans, mice, fruitflies, cows, sheep, chickens, pigs, tomatoes, soybeans, rice, etc. (Dib et al., 1996, Dietrich et al. 1996, Goldstein and Clark 1995, Taramino and Tingey 1996, Postlethwait et al. 1994, Su and Willems 1996, Ma et al. 1996, Crawford et al. 1995, Crooijmans et al. 1996, Rohrer et al. 1996, Broun and Tanksley 1996, Akkaya et al. 1995, Xiao et al 1994). In addition, large databases of microsatellites isolated for population work are accumulating: one maintained at the Smithsonian Laboratory of Molecular Systematics includes 25 species, and is certainly an underestimate of those available. One practical long-term difficulty with microsatellite markers is the requirement of determining fragment lengths, which would seem to complicate automation. Ultimately the future may belong to markers amenable to yes/no tests which can be set up on dense chips, as for example single nucleotide polymorphisms.

In contrast with their importance in intraspecific studies, microsatellites have yet to make any real contribution to phylogeny reconstruction. This failure has come as a surprise to those who suspected that the huge number of microsatellites available, coupled with their very rapid rate of evolution, would make them particularly useful in working out the relationships among very closely related species (e.g. Goldstein et al. 1995a). Although it is not yet entirely clear why microsatellites have not been more successful in reconstructing phylogenies, part of the difficulty certainly stems from restrictions to divergence imposed by range constraints, irregularities and asymmetries in the mutation process, and the degradation of microsatellites over time. A number of recent studies have developed theoretical methods to both estimate the relevant molecular details and to correct for them statistically, but they have yet to be tested. Our purpose here is to provide a non-technical introduction to the concerns related the available methods and to suggest how the methods might be applied. We begin with a review of microsatellite mutation and evolution, emphasising those features relevant to the basic assumptions of the early stepwise distances (Goldstein et al. 1995a, Slatkin 1995a). In particular, we consider

  1. the mutation rate
  2. the distribution of mutation sizes
  3. constraints on the number of repeats (repeat count or allele size)
  4. the degree of asymmetry in the mutation distribution, and >li>the dependence of the mutation process on heterozygous genotype.
Next we describe the principle analytic distances for microsatellites and a number of recent modifications that have been made, focusing in particular on their assumptions about the molecular details of microsatellites. We will also indicate how the modified distances can be used to estimate the parameters governing microsatellite mutation and evolution. Such estimation may ultimately allow the partitioning of loci into classes appropriate for particular problems.

Molecular details

Mutation rate: A variety of in vivo and in vitro studies indicate that microsatellite loci are highly unstable, having some of the highest mutation rates observed at molecular loci. Microsatellite mutation processes have been inferred by direct observations both on artificial constructs in yeast (Henderson and Petes 1992) and in human pedigrees (Weber and Wong 1993). The general conclusion from these studies is that there is an exceptionally high rate of mutation adding or subtracting a small number of perfect repeats. In humans, the average overall mutation rate for 28 di- and tetranucleotide microsatellites was estimated at about 0.001, with the tetranucleotide repeats significantly more mutable than the dinucleotide repeats. The most popular explanation for the high mutation rate is polymerase slippage (Levinson and Gutman 1987), a hypothesis which received considerable support from an elegant in vitro analysis showing that polymerase tends to miscopy repeated tracks of DNA (Schloetterer and Tautz 1992).

Distribution of mutation sizes: While the majority of observed mutations are of a single step (one repeat unit), a significant minority of mutations may be of larger size. Out of 22 observed germ line mutations, Weber and Wong (1993) confirmed no mutations of larger than 2 repeats. Twenty of these mutations involved a change of a single step for a ratio of 0.91 single-step to two-step mutations. A subsequent study by Amos et al. (1996) confirmed only a single mutation of larger than one repeat unit out of 15 observed mutations. Engineered repeat tracks in yeast also show a great preponderance of single and two-step mutations (Henderson and Petes 1992). The general conclusion from these studies is that the majority of mutations are of one or two steps. It should be kept in mind, however, that in observing relatively few mutations these studies are biased toward the most common types of mutation. It remains possible that mutations of much larger sizes occur, but too infrequently to be routinely picked up in such studies. Indirect evidence for such mutations comes from the study in distances among human populations in which the calibrated mutation rate is somewhat higher than that observed in pedigrees (Goldstein et al 1995b).

The occurrence of mutations larger than one or two steps is confirmed by studies of trinucleotide expansions in which alleles beyond a certain size threshold have asymmetric distributions of mutations including some of very large size. In fragile chromosome sites, for example, the disease-causing allele may have over a thousand repeats of CCG. The CAG repeats associated with some neurological disorders may also mutate to alleles with over a thousand repeats when they occur outside of coding regions (Ashley and Warren 1995). Perhaps surprisingly, these very large sizes associated with diseases are rarely reported at other microsatellites, although it should be noted that a sampling bias exists in that the expanded trinucleotides are identified by their phenotypic effects. A detailed characterization of maximal allele sizes at loci not associated with disease is necessary to determine whether trinucleotide expansion behaviour can be generalised to other types of microsatellites. Since the expanded trinucleotides have phenotypic effects even when they are not expressed, alleles above a certain size are probably eliminated quickly. Given that atypically large alleles are hypermutable, leading to the production of expanded, symptomatic alleles, this suggests a potential mechanism of size constraint.

Asymmetry of mutation distribution: A tendency to mutate to alleles of larger size (positive asymmetry) was first observed in the asymmetric mutation distribution for large alleles at trinucleotide-expansion loci (Ashley and Warren 1995). Subsequently, positive asymmetry was invoked by Rubinsztein et al. (1995) as part of an explanation of observed differences in average repeat sizes between humans and other primates. They posited that microsatellites in humans have a greater tendency toward positive asymmetric mutation than those in other primates. One major problem with this explanation is that since the loci were selected in humans, any real differences in microsatellite characteristics between the species are confounded with ascertainment bias (Ellegren et al. 1995, Box A). Subsequent studies, however, have demonstrated that asymmetric mutation is not restricted to trinucleotide expansion loci. Primmer et al. (1996) made a detailed study of a single highly polymorphic tetranucleotide locus in the swallow, Hirundo rustica. Out of 841 meioses, 26 mutations increasing size were observed compared with 7 decreasing it, with the majority of changes involving the gain or loss of a single repeat unit. Amos et al. (1996) added 15 new germline mutations to those studies in Weber and Wong (1993), and showed a significant excess of mutations increasing allelic size. The generality of these results is not yet clear, however, especially given that artificially constructed repeat tracks introduced into both bacteria (Levinson and Gutman 1987) and yeast (Henderson and Petes 1992) show an asymmetry towards mutations that decrease size. These observations, together with the behavior of expanded trinucleotide alleles, suggest that the degree of asymmetry may depend on allele size. In assessing asymmetry it will therefore be important not only to consider differences among loci but also differences among alleles within a locus.

These results raise key questions about microsatellite persistence. In particular, since loci with more than 60 or so repeats are rarely observed (but see Primmer et al. 1996), something must restrict the size of those loci showing positive asymmetry. Alternatively, microsatellites may be unstable above a certain threshold and quickly degraded through large deletions or through the introduction of imperfections. It is interesting in this regard that large GT repeat tracks cloned into plasmids tend to undergo large deletions (Levinson and Gutman 1987). This would, however, seem to predict a shorter life span for microsatellites than is consistent with observations on at least some loci. Coote and Bruford (1996), for example, found a set of microsatellites first identified in humans that are polymorphic in the majority of apes and old world monkeys, which includes species that last shared a common ancestor about 30 million years ago. More dramatically, FitzSimmons et al. (1995) reported conservation of orthologous microsatellite loci over 300 million years in marine turtles. It will be especially interesting to determine whether a relationship exists between microsatellite longevity and mutational asymmetry.

Range constraints: Perhaps the most compelling evidence that the number of repeats at microsatellite loci is under some form of constraint is simply the absence of alleles of very large size. Given the high mutation rate, and the very large number of loci that have been characterised, it is clear that if the process were an unconstrained random walk we would expect to regularly observe loci with very large alleles. In fact, with the exception of trinucleotide-expansion loci, alleles much greater than 60 repeats are very rarely observed (but see Primmer et al. 1996).

Other lines of argument have provided less direct evidence of a length ceiling. Bowcock et al. (1994), for example, found that the variance in repeat score among primates is not significantly larger than that among human populations. Under an unconstrained random walk the greater evolutionary distance among primates would be expected to lead to a greater variance by increasing the between-group component of the total variance. Similarly, a number of groups have reported that the ratio of the genetic distance between apes and humans compared with that between African and non-Affican populations is much less than would be expected in the absence of range constraints (Garza et al. 1995, Goldstein 1995b). Since these loci were first selected in humans, however, they are expected to represent a biased sample of the locus properties in humans. Some of these observations, therefore, may be due to ascertaimnent bias as opposed to range constraints per se (see Box A).

Dependence of the mutation process on allele size and sequence: In trinucleotide repeat expansion loci, the rate and distribution of mutations change dramatically as allele sizes pass from the pre-mutation (atypically large but nonsymptomatic) to full-mutation state (Ashley and Warren 1995). A number of population studies have also tested the dependence of the mutation rate on allelic size by correlating observed levels of variation with average allele size. This approach utilizes the theoretical result of Moran (1975), who showed that at mutation-drift equilibrium the variance in size at a locus undergoing stepwise mutations is 2(N-1)ß, where N is the haploid population size and 2ß is the total mutation rate. The largest such study to date (Valdes et al. 1993) used PCR fragment size as a substitute for the number of repeats, and reported no correlation between average fragment size and the observed variance. The relationship between PCR fragment size and number of repeats is not particularly tight, however, because the size of the non-repeat portion of the PCR fragment varies from locus to locus. Moreover, systematic bias may have been introduced in the data by the research procedure used to select primers; algorithms often seek fragments in a specified size range. Goldstein and Clark (1995) analysed the dependence of the allelic variance on the repeat count itself, considering both the average size and the maximum size at a locus. Both correlations were significant, but the latter more so. This suggests that the increase in mutation rate with repeat size is not linear, or that some other assumption of the stepwise mutation model is violated. Interestingly, the same pattern was observed for both di- and trinucleotide microsatellites.

It has also been inferred from population studies that imperfections in the repeat array tends to stabilize microsatellites (Goldstein and Clark 1995). This conclusion is supported by the observation that normal alleles at trinucleotide expansion loci often carry imperfections, while the pre- and full-mutation alleles do not (Ashley and Warren 1995). The sensitive dependence on the exact sequence in the repeated regions further complicates comparisons of microsatellite variability between species, and suggests that microsatellite degradation may involve the introduction of imperfections (Garza and Freimer 1996). It is therefore especially important to sequence at least a single allele from each locus when extending primers from a focal species to close relatives (See Box A).

Effects of heterozygosity: In the case of minisatellite regions, which involve repetitions of longer sequence motifs than microsatellites, it is known that mutation can result from unequal exchange during meiosis (Jeffreys et al. 1988), and it seems reasonable that this mechanism can also operate at microsatellite loci, at least for the larger alleles. The suggestion in Amos et al. (1996) that the probability of mutation increases with the difference in size between homologous alleles is consistent with a role for unequal exchange in microsatellite mutation. Mutational dependence on diploid genotype would have a dramatic impact on the dynamics of allele frequency evolution at microsatellite loci, and warrants more detailed study.

Genetic distance measures

If a distance is used to estimate relative times of divergence, it is essential that its expectation increases linearly with time, and beneficial if the coefficient of variance is low. For reconstruction of phylogenetic relationships, the combination of linearity and variance determines the performance (Goldstein and Pollock 1994, Pollock and Goldstein 1995). A useful measure which combines these features is the accuracy index of Tajima and Takezaki (1994), defined as the slope of the distance at any time divided by its standard deviation. Thus, if the variance is constant, distances will be most accurate over time if they maintain a constant rather than decreasing slope. In general, distances are constructed to be both as linear and as precise as possible under the assumption of a particular model of evolution. It should be appreciated, however, that a tradeoff between the two often exists (Goldstein and Pollock 1994, Pollock and Goldstein 1995).

The majority of mutations at microsatellite loci are stepwise in nature, changing allelic sizes by one or a very few number of repeats, and thus distances which are designed specifically to apply to microsatellites generally assume Ohta and Kimura's (1973) stepwise mutation model (SSM) or one of its generalizations. Most classical distance measures, however, are based either an Kimura and Crows (1964) infinite alleles model (IAM), or upon multidimensional geometric considerations without reference to a particular evolutionary model. The assumptions of the SSM differ sharply from the assumptions of the IAM, and therefore distances designed to increase linearly under the IAM, such as Nei's standard distance, are both non-linear and inaccurate for microsatellite loci (Goldstein et al. 1995a, Takezaki and Nei 1996).

Despite the fact that they are not based on the SSM or any other evolutionary model, a group of related distances performs well for reconstruction of phylogenies when taxa are closely related. Cavalli-Sforza and Edwards' (1967) chord distance, Dc, Nei et al.'s (1983) distance Da, and Stephens et al.'s (1992) allele sharing distance, Das, all make use of the product of allele frequencies shared between populations (see Box B), and have been shown to reconstruct closely-related phylogenies better than SSM-based distances (Goldstein et al. 1995a,b; Takezaki and Nei 1996). It is clear from these studies that these distances do not increase linearly with time, however, and become extremely flat as time becomes large. Thus, they do not reflect divergence time unless taxa are very closely related. Their accuracy at short distances stems from their use of the information available in the degree of overlap between the allele-frequency distributions of two populations, and they are less accurate at greater distances where the amount of overlap cannot decrease below zero and the distance between distributions becomes important. The amount of overlap between distributions is also sensitive to fluctuations in the effective population size, and thus it is not surprising that these distances are much less accurate when population bottlenecks have occurred (Takezaki and Nei 1996). It should be noted that that each locus may have been subjected to different apparent fluctuations in effective population size due to positive, balancing, or slightly deleterious selection (Nauta and Weissing 1996, Slatkin 1995b). The sensitivity of these distances to fluctuations in effective population thus presents a very serious complication.

Three distances have recently been developed specifically for application to microsatellite evolution assuming the SSM (see also Chakraborty and Nei 1977). Goldstein et al.'s (1995a) and Slatkin's (1995a) distance, ASD, described in Box C, increases linearly with time under the unconstrained SSM model. The main difficulty with this distance is its high variance, partly due to its dependence on the variation within populations. In addition to this, because population sizes are likely to vary among taxa in any phylogeny, the inclusion of the intra-population variance term 2(N-1)ß obscures the relationship between separation time and the observed value of ASD. The intra-population variance term also make ASD very sensitive to fluctuations in population size in a manner similar to the geometric distances described above (Nauta and Weissing 1996, Takezaki and Nei 1996). Goldstein et al.'s (1995b) distance, (ðµ)², was specifically designed to overcome problems associated with the variance term (see Box C). This distance increases linearly with time at the same rate as ASD, but has a lower variance, and thus seems to be always preferable for both phylogenetic reconstruction and estimation of relative separation times. Although independence of this distance from population size was derived under the assumption of constant population size, computer simulations show that the standardisation achieved by averaging scores within populations results in a distance that is extremely robust to fluctuations in population size, perhaps more so than any other distance defined in terms of allele frequencies (Takezaki and Nei 1996).

Shriver et al.'s (1995) stepwise distance, Dsw, is similar in form to ASD (with variance correction) but with an absolute value operation replacing the square function on the difference between allele sizes (Box C). Dsw was developed through heuristic argument and an explicit dynamic has not been derived. Nevertheless, the exact linearity of ASD implies that Dsw cannot be linear, an inference borne out by computer simulations (Shriver et al., 1995). Under some circumstances, however, Dsw has a lower coefficient of variance, and may therefore be preferred for phylogenetic reconstruction. Under the conditions analysed by Takezaki and Nei (1996), however, this was rarely the case. In addition, this distance is extremely sensitive to variation in population size.

Although ASD and (ðµ)² were derived assuming a strict stepwise mutation model, they are in fact considerably more general. In particular, if mutation sizes vary, the expectation of ASD is altered only by replacing the overall mutation rate, 2ß, with the product of the mutation rate and the variance of mutational step sizes (Slatkin, 1995a). Kimmel et al. (1996) also note that the linearity of ASD and (ðµ)² are independent of the assumptions of both single-step sizes and symmetry in the mutation rate, the latter point being of particular significance given recent demonstrations of asymmetry as described above. These results suggest that the two greatest concerns for extension of microsatellite loci to phylogenetic reconstruction of more distantly related organisms are constraints on allele sizes and the longevity of the mutational properties of microsatellite loci (Garza and Freimer 1996). The rate of degradation of microsatellite loci requires careful comparative analysis. As described above, it is essential to sequence microsatellites in all taxa to confirm that the repeated motifs have not been interrupted by imperfections. Assuming that microsatellites with sufficient longevity can be identified, the restrictions on divergence imposed by range constraints must be accounted for.

If the number of repeats attainable by microsatellite loci is restricted, the accuracy and linearity with time of all distances are strongly affected (Goldstein et al. 1995a, Feldman et al. 1996; Pollock et al. 1996). To statistically adjust for the effects of range constraints, our group has introduced a number of new distances (See Box D), including a log-based distance denoted Dl (Feldman et al. 1996), a least squares distance denoted Dls (Pollock et al. 1996), and a generalized least squares distance denoted Dgls (Pollock et al. 1996). Although the model upon which these distances are based is overly simplified (the range terminates in reflecting boundaries at the upper and lower ends; there is no asymmetry or dependence in the mutation rate with repeat number), we suspect that some of the results are significantly more general than these assumptions suggest. In particular, much of the behavior of the distances can be attributed to the fact that the length of time over which a locus will accurately reflect separation times decreases both as the mutation rate increases and as the number of attainable states decreases. This basic interaction between the range and the mutation rate is likely to come into effect for any reasonable mutation model.

Since the usefulness of a locus for assessment of deep divergence times depends upon the locus range and mutation rate, it is critical to accurately assess these parameters. Pollock et al (1996) developed methods to assess range constraints under Feldman et al's (1996) and Nauta and Weissing's (1996) reflecting boundary SSM model. The obvious estimator, the difference between minimum and maximum allele sizes, is extremely inaccurate when only one or a few independent populations are available. Corrections developed have increased accuracy (Pollock et at 1996), but assume that the populations are sufficiently diverged that mean allele sizes are no longer correlated. Reasonable adjustments for phylogenetic relatedness may eventually be developed for these estimators, but in the meantime it is probably best to make estimates based on well-diverged populations for application to closer populations (assuming the microsatellites have not been degraded). Under the conditions analysed in Pollock et al (1996), relative mutation rates estimated from the allelic variance (Nauta and Weissing 1996, Feldman et al 1996) were somewhat more accurate than when estimated via the least squares distance methods they introduced. The variance-based methods may be affected more by population size fluctuations and selection, however, and the least squares methods may improve faster as more taxa are introduced. It will be particularly interesting to see how well these different methods perform on real data sets, in particular whether they succeed in dramatically improving predictions concerning long-term microsatellite locus evolution, and whether they can be used to effectively partition loci according to usefulness in addressing particular phylogenetic questions.

For very recently separated populations, the distances which make use of the product of allele frequencies shared between populations are most accurate (although not linear with time), and they are more accurate when allelic variance, and thus the mutation rate, is high (Takezaki and Nei 1996). For more distant populations, these distances become much less accurate than the distances which make use of the degree of separation between alleles. The (ðµ)² based distances are not sensitive to levels of allelic variance in the absence of range constraints, but with range constraints, loci with the lowest mutation rates will remain accurate longer. Thus, when extending phylogenetic analysis using microsatellites beyond the subspecies level, it is preferable to select those loci with lower allelic variation, exactly the opposite of the preference for studying differentiation of subpopulations within species.

Discussion

With all the difficulties itemized, we wish to emphasize that for certain phylogenetic problems microsatellites remain the most promising approach and it seems well worth the effort of improving methods for their analysis. For example, a method of "genetic absolute dating" based on microsatellites has recently been introduced (Goldstein et al. 1995b). The novelty of this method is that the expected rate of differentiation can be estimated by studying microsatellite mutations in pedigrees, which removes the requirement of rate calibration using uncertain paleontological dates. Moreover, since microsatellite analyses can easily collect information from a number of different genomic regions it is possible to model the divergence of populations as opposed to the genealogical history of particular genomic regions. For some applications, such as the study of human evolutionary history, inferences about population differentiation are of particular importance. In addition to these advantages, the rapid rate of microsatellite evolution also means that reliable information may be gained even for taxa so closely related that it would be impractical to collect enough sequence information to work out their relationships.

For these reasons it is important to determine whether the incorporation of more details about microsatellite behavior can lead to more accurate inferences. We are especially encouraged in this regard by a simple comparison with the use of sequence variation. Just as genes with evolutionary rates and properties appropriate to a particular phylogenetic problem must be carefully selected, we might expect that microsatellite loci appropriate to particular phylogenetic problems must be screened and selected.

A great deal of empirical work remains to be done in evaluating how best to employ microsatellites in phylogeny reconstruction. The methods for assessing range constraints and mutation rates need to be applied to real data from different taxonomic groups and types of microsatellite loci. Clustering loci with similar ranges and mutation rates could be very useful, but the statistical considerations involved in such procedures remain to be elucidated. One of our major goals here is to encourage the collection of the data necessary to estimate the relevant details of microsatellite behavior. That is, data on both the length and sequence structure of homologous microsatellites in sets of related species. Once multiple observations for different types of microsatellites are available it will be possible to determine whether a priori characteristics (e.g. motif size/type) correlate with key features such as range constraints, longevity, and mutation rates.

Acknowledgements

We thank Dr. Stephen J. O'Brien and members of the Laboratory of Genomic Diversity for encouraging us to write this review. DDP is a Hitchings-Elion Fellow of the Burroughs Wellcome Fund. A computer program written by Minch et al. which calculates statistics for microsatellite data is available from L. L Cavalli-Sforza's pages at http://crick.stanford.edu or by writing to E. Minch at minch@crick.stanford.edu.