(C1) Sum[Sum[(i-j)² XiYj]]where Xi and Yj are the frequencies in population X and Y of alleles with i and j repeats, respectively. (ðµ)² and Dsw take the form
(C2) Sum[Sum[f(i-j) Xi Yi]] - {Sum[Sum[f(i-j) Xi Xj]] + Sum[Sum[f(i-j) Yi Yj]]}/2
where for Dsw, f is the absolute value function, and for (ðµ)² f is the square function. (ðµ)², of course, conveniently simplifies to its familiar form, (ðµ)² = (Mx-My)2, where Mx and My are the m
eans of allele sizes in populations X and Y respectively. While ASD and (ðµ)² were derived analytically under the SSM model, and both are expected to increase linearly with time, Dsw was derived heuristically and does not. Both (ðµ
;)² and Dsw equal zero when the time separating two populations is zero, and increase with time. The expectation of ASD, however, is 2ßt + Vx + Vy, where t is the time since separation, ß is the stepwise mutation rate in each direction, and
Vx and Vy are the variances in allelic score of populations X and Y, respectively. Thus, ASD does not begin at zero, and the inclusion of the variance terms add significantly to the variance of the distance measure.In the preceding discussion it is assumed that sampling is complete. In practice, however, we must estimate (ðµ)², which describes populations x and y, based on samples from these populations. Equation (C2) for (ðµ)² can be r ewritten as (ðµ)² = ASD - Vx - Vy. Goldstein et al (1995a) showed that ASD calculated for the sample is in fact an unbiased estimator of ASD for the population. It is well known, however, that sample vari ances are not unbiased estimators of the parametric variance. Therefore, an unbiased estimator of (ðµ)² is easily obtained by substituting unbiased estimators for Vx and Vy. That is, in practice (ðµ)² can be calculated as:
(C3) (ðµ)² = ASD - V~x - V~y
= Sum[Sum[(i-j)² Xi Yj]] - Nx/(Nx-1) Sum[(i-Mx) Xi] - Ny/(Ny-1) Sum[(i-My) Yi]
where V~ denotes the usual unbiased estimate of the parametric variance based on the variance in the sample; Xi and Yj now represent the frequency of alleles i and j in the samples from populations X and Y, Mx and My are the sample means, and Nx and
Ny are the number of alleles sampled from each population. Equation (C3) shows that the difference between the unbiased estimate of (ðµ)² and the squared difference between sample means (that is, (ðµ)² calculated directly for
the sample) will be very slight unless both the sample size and the level of differentiation is small. The amount of differentiation matters because the bias enters in only through the estimation of the variance. If the expected difference between the me
ans is very small relative to the variance, even a few percent change in the variance will be large as a proportion of the distance. In those cases where differentiation is very slight relative to the population variance, however, non-stepwise distances s
hould not be used (see above). In practice, therefore, we do not expect that the bias corrected version of (ðµ)² will be noticeably different from the uncorrected version.Zhivotovsky and Feldman (1995) were able to derive the variance of (ðµ)² under the unbounded SSM model, and it is 2(2ß)2 . Thus, the expected standard deviation for (ðµ)² averaged ov er equivalent loci is the square root of twice the expectation of (ðµ)² divided by the square root of the number of loci. This makes it clear why many loci are needed to achieve an acceptable degree of accuracy with these measures. We would predict that reasonable reliability would generally require between 50 and 100 loci as a minimum. The requirement of a large set of loci becomes even more pronounced when one considers the variation among loci in mutation rate and other basic properties. Finally, the large drift variances of these genetic distance measures necessitate that empirical confidence intervals for them be derived by bootstrapping over loci. It is never appropriate to use bootstrapping over individuals as a substitute for bootst rapping over loci in assessing the reliability of these distances.