Microsat
Brief instructions for using microsat, the microsatellite distance program.
Instructions for obtaining microsat
Microsat source code and executables are available here.
Microsat data is available here.
The current version includes the following new features:
- New menu structure separating statistical descriptions from distance selection and main control menu. New statistical descriptions include:
- More diversity indices: Fst (by variance and by heterozygosity methods), standardized Rst1, and average and total values for heterozygosity, variance, number of alleles, allele size range, maximum allele size, and entropy of allele size distribution. These can be calculated by locus or by taxon.
- Checks for missing data, anomalous frequencies, incommensurable taxa, and taxon-specific alleles.
- Crosstabulation of the number of data values found in each taxon-locus combination.
- Frequency distribution: mean, median, mode, minimum, and maximum allele size, and the frequency of each allele, by locus or by taxon.
- New distance measures
- Fst: Reynolds' distance.
- Rst: standardized Rst1.
- D1: average square distance.
- Dsw: stepwise weighted genetic distance.
- Dad: absolute difference distance (like (Delta Mu)² except with the absolute value instead of the square of the difference).
- More options in specifying distance calculations:
- The distances Gst, Dps, Dfs, and Dkf can be calculated as -ln(ratio) or as 1-ratio.
- The distances D1 and Dkf can be adjusted by subtracting pairwise averages.
- Distance matrices can be calculated separately for each locus.
- A check for exhaustive bootstrapping (used for a small number of loci).
- A limited help facility, which gives a short description of each menu option, has been included.
Some helpful researchers have volunteered executable versions of microsat, compiled for various hardware/OS combinations. [Disclaimer: this does not constitute a promise on their part or on mine to support these executables or to fix problems; they are provided at this site only to make distribution simpler and to make use of the program easier for those who have no way of compiling the program source]:
You can also get sample data and output files to test your version.
1The algorithm used is the unbiased standardized estimate from Simon Goodman, whose RstCALC program is available from a website. Goodman's method will be included in the distance page once the paper describing its derivation has been accepted.
Instructions for running microsat
The microsat program is written in ANSI C and is self-contained, i.e., it requires only the standard routines in stdio, stdlib, math, string, and ctype. Selection of options controlling the program is via simple tty menus. The format of the input data is
<taxon> <locus> <repeatlength> <frequency>
(where repeatlength can be number of repeats or nucleotide length, and frequency is the number of occurrences, i.e., absolute frequency) or
<taxon> <locus> <repeatlength>
(with implied absolute frequency=1) for m loci and n taxa. A
"taxon" can be either a population or an individual. The program ignores blank lines and comment lines (any that start with '%'). Example data and output files are available. The following options are currently supported:
- Calculations can take various aspects of the allele sizes into account:
- Allele sizes in input data can be either repeat scores or nucleotide counts. The default is to assume nucleotide counts.
- Repeat lengths greater than two can be used. The default is to assume dinucleotide repeats.
- Whether alleles are specified as nucleotide counts (repeat length>=2) or as repeat scores (repeat length=1), the data from each locus are checked to see if the minimum difference matches the repeat length. If not, a warning is raised and you are asked if the observed or the specified repeat length should be used for that locus.
- Input files can be checked for missing data (taxon-locus combinations with no data), anomalous frequencies (taxon-locus combinations with odd-numbered frequencies), incommensurable taxa (taxon pairs with no loci in common), or taxon-specific alleles (alleles occurring in only one taxon, or in only one taxon of a taxon pair).
- Outliers may be detected and eliminated by either of two methods; each has a default value for the multiplicative coefficient, which may be overridden.
- Hinge method, in which outliers are defined as those less than
(lower quartile) - x * (interquartile range),
or greater than
(upper quartile) + x * (interquartile range).
This is suitable for nonnormal distributions. The default value for x is 4.0.
- Sigma method, in which outliers are defined as those less than
µ - x * Sigma,
or greater than
µ + x * Sigma,
where µ and Sigma are the mean and standard deviation of the allele sizes for a locus. This is suitable for normal distributions. The default value for x is 3.0.
- The distance measure to be calculated may be any one of the following:
- D1: average square
- D1*: adjusted average square
- Dad: absolute difference
- Fst
- Dkf: -ln(kinship coefficient)
- Dkf': 1-kinship coefficient
- Dkf*: -ln(adjusted kinship coefficient)
- Dkf*': 1-adjusted kinship coefficient
- Ddm: delta mu squared
- Gst
- Dps: -ln(proportion of shared alleles)
- Dps': 1-proportion of shared alleles
- Rst: standardized Rst1
- Dfs: -ln(fuzzy set similarity)
- Dfs': 1-fuzzy set similarity
- Dsw: absolute product
- If distance measures are being calculated, they can be performed over multiple bootstraps. In this case, the distance matrix reported will be the average over all bootstraps, and it will be followed by a matrix of the standard errors of the distances.
- Distance measures can be calculated separately for each locus.
- Estimates of duration of linearity can be calculated for each locus, and averaged over all loci. If linearity is being estimated:
- a value for the mutation rate can be specified. If it is not specified, a default rate of 0.00056 is used.
- an estimate of the primer error (size of the region flanking the microsatellite) can be entered and corrected for. The default is to assume no primer error (i.e., 0 nucleotides).
- Values for Fst (by variance and by heterozygosity methods), standardized Rst1, and average and total values for heterozygosity, variance, number of alleles, allele size range, maximum allele size, and entropy of allele size distribution can be calculated by locus or by taxon.
- Crosstabulation of the number of data values found in each taxon-locus combination.
- Frequency distribution: mean, median, mode, minimum, and maximum allele size, and the frequency of each allele, by locus or by taxon.
Suggestions for improvements to the program, and offers of binaries for distribution, should be mailed to Eric Minch.
If you wish to test your compiled version of the program, the two sample data files test.dat1 and test.dat2 yield the results shown in test.out. The two sample data files contain exactly the same data; the first is in the first input format, i.e., with explicit absolute frequencies, while the second is in the second input format, i.e., with no explicit frequencies.