Microsatellite analyzer (MSA)
4.05
Description
of genetic distances
Background:
Input
files can be generated using spreadsheet software, such as Excel,
in which the data are arranged either in one column per locus or two
columns
per locus (sample input file). As MSA was mainly designed for
microsatellites,
data should be entered as the PCR product size. Missing data can be
indicated
by either an empty cell or a negative value, but do not enter
Ô0Õ as this would
be treated as a PCR product of zero bases. Note that the twocolumn
format
provides some additional input options (see below).
Please note if Excel is used to generate a MSA inputfile
it has to be saved as "TAB
DELIMITED" file.
Size constraints:
The
number of individuals, populations and loci are only constrained by
the available memory.
Generation of a MSA input
file:
á
The third column allows to
group populations and some analyses will be
also performed for the specified groups. Please note that this column
must not
remain empty. In the absence of grouping give the same number to all
populations. Furthermore, only consecutive group numbers are allowed,
but
groups assigned without any constraints in order.
á
The first two rows provide
information about each locus. The first row
specifies the repeat type (1, 2, 3, etc). The second row indicates the
length
of the sequence flanking the microsatellite (in bp). In the case no
information
is provided in the second row (empty cell), MSA does not calculate the
variance
in repeat number, but the inferred repeat type is specified in the
output file.
This option allows MSA to calculate the number of repeats from your PCR
product
size.
á
The third row contains the
name of the microsatellite locus. In the
twocolumn format, MSA allows two different names for the same locus
(each
entered in one cell).
Remarks:
For
compatibility with PHYLIP the population labels are limited to 8
characters. For individual based distances, only the first 4 characters
of the
population label are used to label individuals. Therefore, it is highly
advised
that the first 4 characters differ among population labels.
If
you are formatting your data with Excel (or other spreadsheet
software), please make sure that you saved it in the format "TAB
DELIMITED". Other formats will not be accepted by MSA.
Generation of a MSA IM input file (for IMinput file
conversion only):
á
Create the normal MSA input
file
á
Make a copy with a different
name
á
Open the copy and remove all
data (population names and alleles) so that
only the locus information is left
á
Write 'inheritance' as one
"individual" name and then below
the (first) locus names the value of the wanted inheritance scalar
(e.g. 1 for
autosomes, 0.75 for Xchromosomes, 0.25 for Ychromosomes or mtDNA, ...)
á
Write 'mutation' as another
"individual" name and do the same
with the mutation rates (As mutation rate per year).
á
Save the file AGAIN AS
"TABDELIMITED"
á
In MSA menu you
need to go to the submenu '(c)...Data conversion settings', chose the
option
'(f) ... Locus information
for IM : No'
and enter the name for the file containing the inheritance information.
á
For an
example file with the inheritance information see
ãtestdata.imÒ
Starting the
program:
Option
1: double clicking:
Windows:
double click to start the program and follow the instructions on the
screen
OsX:
double click to start the program
Load
data by selecting option i) in the start menu of MSA. Rather than
typing the name of the input file, drag and drop the input file to MSA
window.
By doing so you automatically specify the path (location of your input
file).
Option
2: command line (OsX, works similar for Linux)
open the 'Terminal'
software (located in: Applications>Utilities)
type: cd note that 'cd'
is followed by a blank
drag the folder
containing the MSA executable and input file onto the Terminal window
hit the 'return' key
type: ./msa
hit the 'return' keyagain
the normal MSA menu will
appear and the input file could be specified by
typing its name. Note: the path does not need to be specified, as the
file is
located in the same folder as the executable.
Available
Functions:
Starting
with MSA version 4.0 it is also possible to use command line arguments,
instead of using the MSA menu.
Starting
MSA without any argument will open the MSA menu. Starting MSA with
arguments use
MSA without
open the MSA menu. So please if you
are not sure how to use this, start MSA without any argument and MSA
will behave
as normal.
Explanation
of the following notation:
Statements in
"[]" are optional and has not to be given, MSA will use standard
settings instead of. If Statements are written like [g[lobal]] then you
can use
'global' OR only 'g' to activate the function.
YES/NO means you have to choose between the two possibilities YES OR NO. Same is true for all other statements ON/OFF, 1/LN
XXXX means
that you have
to enter a number.
Please write
command as
they are given below (small/capital letter) otherwise the result can
not
predicted.
For more
information
about functions itself and the output files please read the chapters
below!
MSA
Command Line Options:
► i:
"FILENAME"
Specifies the
input file for MSA
(essential)
► fim:
"FILENAME"
In this file
is the additional
information required for conversion into the IMFormat. This has nearly
the
same format as MSAInput. Please look at the sample file and below.
► fst: [g[lobal]]
[p[airwise]] [beta] [n=XXXX] [HW[=YES/NO]] [RD[=ON/OFF]] [locus]
This
option starts F_{ST}
analysis: essential is to give either
[global] and/or [pairwise] for global F_{ST}
or F_{ST}
pairwise between populations.
[beta]
calculates the moment
estimator _{} Ð
values
of the data
[n=XXXX]
sets the number of permutations standard is 10000 (n=1 means no
permutation)
[HW[=YES/NO]]:
HardyWeinbergEquilibrium is by default NO (HW=NO) so use HW or HW=YES
when
you assume your data to be in HWequilibrium.
[RD[=ON/OFF]]
Random discard inbreed lines before calculating F_{ST}
is by default OFF (RD=OFF) so use RD or RD=ON to activate this.
[locus] For
pairwise F_{ST}
this option gives the F_{ST}
values for each
population pair and locus.
► "dist:"/"dist " [p[opulation]]
[i[ndividual]] [...] [n=XXXX] [calc=1/ln] [NEXUS] [locus]
Calculates
standard genetic distances. It is essential to activate at least
[population]
or/and [individual].
[...]
You have to specify at least one distance method to get results, use
one of the
following codes:
POSA ...
D_{ps}
proportion
of shared alleles
FSS ...
D_{fs}
fuzzy
set similarity
ADA ...
D_{ad}
absolute
difference algorithm
KSC ...
D_{kf}
kinship
coefficient
DMS ...
D_{dm}
(dµ)^{2}
ASD ...
D1 average
square
CAS ...
D_{c}
CavalliSforza
and Edwards chord distance (1967)
DAN ...
D_{a}
Nei's
chord distance (1983)
DSG ...
D
Nei's
standard genetic distance (corrected for sample size)
[n=XXXX] Number of bootstraps.
Standard is no bootstrapping (n=1).
[bl=XXXX] Specifies the number
of loci for bootstrapping DS. Standard is the same number as original
dataset
(bl=1)
[calc=1/ln]
Some distances (see below) can be calculated by two ways: 1factor
(standard
calc=1) or Ðln(factor) calc=ln. (Please don't use calc only!)
[NEXUS]
Nexus outputfiles can be generated for individual based distances
without
bootstrapping
[locus] For
pairwise distances this
option gives the distance values for each population pair and locus.
► format: [msvar] [arlequin] [migrate] [im]
To
convert your data into additional formats, use this options:
[msvar]
MSVAR (Beaumont) input format for each population separately
[arlequin]
Calculates ARLEQUIN input files.
[migrate]
Gives input files for MIGRATE.
[im] Creates
for all population
pairs the IM input files.
► hetrange: [n=XXXX] [RD[=ON/OFF]]
[zero[=ON/OFF]]
[l[inked]] [ind[ividual]]
Estimate
the sampling variance of gene diversity (using bootstrapping)
[n=XXXX]
Number of replications (Standard is 1000)
[RD[=ON/OFF]]
Activates random discarding for inbreed data. (Please use this only
together
with [individual])
[zero[=ON/OFF]] Correction for
gene diversity 0 and 1 is by default ON (zero=ON) to deactivate this,
use zero
or zero=OFF.
[l[inked]] By default all loci
are assumed to be unlinked and therefore resampled separately. Use
linked to
change this (all loci are then completely linked). Activate also
[individuum]
or you would resample chromosomes (ever the first/second chromosome
together)!
[ind[ividual]] Sample
individuals instead of chromosomes (loci can
also be unlinked!).
► rare: [number]
Sets
the reference population size to calculate the allelic
richness/rarefraction.
(1:
minimal sample size =STANDARD)
All
descriptive statistics are given per population and locus.
á
Allele counts and frequencies
á
Number of chromosomes per
population and locus
á
expected number of alleles for
each locus (IAM [Ewens W.J., 1997] and SMM [Kimura
and Ohta,
1975])
á
Allelic richness [El Mousadik A. and Petit R.J.,
1996; Hurlbert
S.H., 1971; Krebs
C.J., 1989]
and the variance of this value.
á
observed heterozygosity
á
expected heterozygosity (=gene
diversity), corrected for sample size
á
estimate of theta based on
gene diversity and the SMM mutation model
á
constrained
gene
diversity
(0<H<1)
á
variance in PCR product size
and in repeat number, corrected for sample
size
á
Shannon index of diversity [Shannon and
Weaver]
Shannon entropy
á
minimum, maximum, and mean
allele length
á
minimum, maximum, and mean
repeat number.
á
h_{s }Nei's unbiased
estimator for gene diversity [Nei M,1987 p.164]
(this option is available only for
"outbred" populations)
á
F_{IS} (only for "
outbred" populations) for each
population
á
estimate for the sampling
variance of gene diversity (using
bootstrapping)
á
Calculation of G_{ST} with the methods
from Nei as well as
Hamrik & Godt (for comparison of this methods see [Culley
et al. 2002]).
á
Calculation of G_{ST}' for both
methods (see above) [Hedrick
P.W., 2005]
Remarks:
á
For inbred lines, expected
heterozygosity, variance and number of
alleles are determined for 200 randomly discarded data sets and the
mean will
be reported. Note that this procedure could result in noninteger
allele
counts.
á
MSA also provides most
descriptive statistics for the specified groups
by reporting the unweighted mean of all populations in a given group.
á
Standard allelic richness is
calculated based on the minimum number of
sampled individuals (total individuals Ð missing data) for each
locus. The
drawback is, that only populations within the same data set can be
compared. To
avoid this, it is possible to specify the number of individuals used
for all
loci. If this number is provided in a publication, the result could be
compared
across publications. Note, however, that for populations with a smaller
sample
size than specified the allelic richness is not calculated ('n.d.').
á
proportion of shared alleles D_{ps}
á
fuzzy set similarity D_{fs}
á
kinship coefficient D_{k}_{f}
á
absolute difference algorithm D_{ad}
á
(dµ)^{2}
D_{dm}
á
average square D1
(ASD)
note
that this distance is not available for tree of individuals
á
D_{a} Nei's chord
distance (1983)
á
D Nei's standard
genetic distance (corrected for sample size)
á
D_{c}
CavalliSforza and Edwards chord distance
(1967)
Statistical
support: A bootstrap option is available for all distances (select this
option
in the distance menu [n]).
When using bootstrapping, (and only then) there is the additional
option of
bootstrapping a user specified number of loci. Hence, MSA will only
resample N
loci from the total data set.
D_{ps},
D_{fs}, D_{kf} and D can be calculated
in two alternative ways:
Ðln(similarity factor) or 1(similarity factor).
Remarks:
á
All distances can be
calculated between populations or between
individuals. Make sure, that the usage of an individualbased distance
makes
sense.
á
Genetic distances are
calculated WITHOUT randomly discarding alleles for
inbred data sets. In the case distances should be calculated from a
randomly
discarded data set, use the file "FILENAME_2C_RD" and run MSA again.
á
Genetic distances which
consider size differences between alleles [(dµ)^{2},
D1, D_{ad}] require that the repeat size is specified in the
input
file. The errorreport.txt file provides information on the minimum size
difference among alleles. For loci evolving in a stepwise manner, this
corresponds to the repeat size.
á
Distance matrices between individuals
without
bootstrapping can be saved in NEXUS format instead of PHYLIP format.
NOTE: The menu option to activate
this function is invisible until distances
between individuals are NOT activated, therefore activate FIRST
distances
between individuals and then you get this menu point visible!
á
Global F_{ST},
F_{IS}, F_{IT} estimators [Weir,
B.S. and Cockerham, C.C., 1984] across all
loci
á
Global F_{ST},
F_{IS}, F_{IT} estimators per
locus
á
F_{ST}, F_{IS}, F_{IT} estimators [Weir, B.S. and Cockerham, C.C.,
1984] across
loci and between population
pairs
á
F_{ST}values per locus
and population pair
á
Pvalues for F_{ST} (global and
pairwise) and with and
without bonferoni correction for Pvalues of pairwise F_{ST}'s.
á
Estimation of the
heterogeneity in Fstatistics among loci (based on
bootstrapping)
á
Calculation of the moment estimator
for
populations and between
population pairs
[Weir, B.S. and Hill, W.G.,
2002]
á
G_{ST} and G_{ST}' calculation is
included in the
calculation of global F_{ST}. Similar to F_{ST},
pvalues are as the proportion of
permutations that result in a G_{ST} or G_{ST}'
larger than or equal to the observed G_{ST} (G_{ST}')
value.
Statistical support:
When loci are unlinked and in
HardyWeinberg equilibrium, permutation of
alleles should be selected. Otherwise, genotypes should be permuted,
which is a
conservative but not entirely satisfactory procedure. [Michalakis
and Excoffier (1996)].
For
inbred lines a special option exists, which provides an estimate for
the variance introduced by the random discarding of alleles. For each
population consisting of inbred individuals a user specified number of
discarded
data sets will be produced and the F_{ST} values will be
calculated for each data set. The statistical significance of the F_{ST}
value for each data set will be determined by permutation. MSA reports
mean F_{ST}
and
P values.
Pvalues
are given without bonferoni correction (right above diagonal)
and with bonferoni correction (left below diagonal). Values after
bonferoni
correction above 10% (0.1) are reported as 'n.s.' in general.
Pvalues
for global F_{ST}
estimates are calculated across all loci
as well as for each locus separately. Note that random discarding of
alleles
(optional for inbred lines) is not possible for global F_{ST}
estimates.
There
is no possibility for a statistical test for the moment estimator .
The
calculation of lnRH (see Kauer, Dieringer & Schlštterer,
Genetics (2003)) and theta require gene diversities larger than 0 and
smaller
than 1. In the case of an invariant population MSA adds one different
allele to
the data set before calculating the gene diversity. For populations
with gene
diversity =1 MSA duplicates one allele before calculating gene
diversity.
Data bootstrapping (also called Hetrange)
MSA has the
function to bootstrap over individuals or chromosomes within
populations to get
the range within the gene diversity, the variance and the number of
alleles can
vary just by sampling.
The
program reports:
á
Large gaps in allele sizes at
a given locus
á
Outlier alleles
á
Discrepancies between the
assumed and observed step size
á
Inferred step size if no
repeat type was specified in the input file
This
file provides all settings used for the analysis of the data and
could be used for documentation purposes.
The
program converts your data into following formats:
á
GENEPOP two digit format
á
MSVAR (Beaumont) input format
for each population separately
á
STRUCTURE
á
ARLEQUIN
á
MIGRATEformat
á
Randomly discarded datasheets
(1 column and 2 column format), in which
all populations are marked, as inbred (h) will be randomly discarded
for one
allele.
Also the corresponding GENEPOP
files will be written.
All files, which contain randomly discarded alleles, are marked with
the
RDlabel. Noninbred populations will not be changed. RD input files
are
generated automatically for GENEPOP and STRUCTURE.
á
IMformat
Creates for all possible population pairs the input files (IM =
isolation and
migration by Jody Hey and Rasmus Nielsen).
NOTE: You
need an additional file for IM to set the inheritance scalar
(standard = 1) and the migration rate per year (see IM documentation
for this).
See input file
for this.
LAMARC's XMLformat can be
easily obtained by
converting the MIGRATEformat into LAMARCformat (see LAMARC
documentation).
ARLEQUIN has optional setting
for genetic structuring
(only for AMOVA important). MSA will use for this setting the given
grouping in
MSA data sheet.
Remarks:
á
MIGRATE, ARLEQUIN, MSVAR and
IM input formats require that the alleles
correspond to the number of repeats (not to the allele length),
therefore make
sure, that your microsatellites behave AS EXPECTED (observed mutational
step =
size of repeat unit). This can be easily checked in the error
reportfile. MSA
will assume that the observed mutational step size is the correct one.
Alternatively,
select option: "reduce to stepsize", which bins the observed alleles
into size classes specified in the input file. Allele bins are
constructed by
reducing the observed size to the next size class compatible with the
specified
repeat number. As indels of unknown size contribute to these size
shifts, the
preferred strategy is to remove all loci with incorrect mutation step
sizes
from your data set, before converting it.
á
For data sets containing
inbred individuals, randomly discarded
datasheets must be used to generate the MSVAR, ARLEQUIN, MIGRATE or IM
formats:
simply rerun MSA and use the randomly discarded data set as input
file.
á
The results of each run of MSA
are stored in a directory labeled
FILENAME_resultXX. FILENAME is name of you data sheet and XX is a
number
between 0 and 99. In the case of multiple runs using of the same data
set, the
result directories are consecutively numbered.
á
In each directory various
files and subdirectories can be found. The
table below provides information about the content of the files
generated by
MSA.
Subdirectory 
Name of
file 
Description 

Errorreport.txt 
The error
report of the data set. Possible errors (wrong mutation step, large
gaps, ...) detected from MSA. 

Summary.xls 
All
summary statistics split by locus and populations 

MSA.log 
The
protocol of the analysis 



Allelecount 



Allelecount.xls 
Allele
counts 

Allelefrequency.xls 
Allele
frequencies 
Distance_data 

Distances
between populations are marked with _POP, between Individuals with IND 

ADA_XXX.txt 
Dad,
Absolute differences 

ASD_XXX.txt 
D1,
Average square 

CAS_XXX.txt 
Dc, Chord
distance 

DAN_XXX.txt 
Da,
NeiÕs distance 

DMS_XXX.txt 
Ddm, Delta
mu square 

FSS_XXX.txt 
Dfs, Fuzzy
set similarity 

DSG_XXX.txt 
NeiÕs
standard genetic Distance (1978) corrected for small samplesize 

KSC_XXX.txt 
Dkf,
kinship coefficient 

POSA_XXX.txt 
Dps,
proportion of shared alleles 
FStatistic 



FIS_WC84.xls 
F_{is} 

FIS_WC84_MULTI.xls 
Mean F_{is} from repeated randomly discarding (for inbred
lines only) 

FIT_WC84.xls 
F_{it} 

FIT_WC84_MULTI.xls 
Mean F_{it} from repeated randomly discarding (for inbred
lines only) 

FST_WC84.xls 
F_{st} 

FST_WC84_MULTI.xls 
Mean F_{st} from repeated randomly discarding (for inbred
lines only) 

FST_WC84_OG.xls 
Upper
value of the 95%CI for F_{st} from repeated randomly discarding (for inbred
lines only) 

FST_WC84_UG.xls 
Lower
value of the 95%CI for F_{st} from repeated randomly discarding (for inbred
lines only) 

FST_WC84pValue.xls 
Pvalue
determined by permutating genotypes/alleles 

GlobFst.xls 
Global F_{ST}, F_{IS}, F_{IT} over all loci,
for each locus separately and the Pvalues for corresponding F_{st}estimates G_{ST} (Nei)
values and corresponding Pvalues 

LocusFHeterogeneity.xls 
Calculates
the 95%/99% confidence interval of F_{ST}
by resampling a given number of loci (with replacement) 

Beta_POP.xls 
i data for populations
on the diagonal and ij data (between
population pairs) 

Betaij_POP.txt 
ij Ð data (between
population pairs) only, in Phylip input format. 
Formats&Data 



Genepop.gen 
Genepop
Format 

GenepopRD.gen 
Randomly
discarded genepop format 

FILENAME_2C_RD 
datasheet
in 2 column format, randomly discarded 

FILENAME_1C_RD 
datasheet
in 1 column format, randomly discarded 

POPNAME. beau.infile 
msvar
Ð infile for each population separate 

FILENAME.migrate 
Migrate/LAMARC
input file 

FILENAME.arlequin.arp 
Arlequin
input file 

FILENAME.struct 
Input file
for Structure 
Group_data 

Calculates
the unweighted average from populations within a specified group 

Hetexpgroup.xls 
Expected
heterozygosity split by group and loci 

HetexpRDgroup.xls 
Expected
heterozygosity split by group and loci based on the mean of 200
randomly discarded data sets 

Vargroup.xls 
Variance
split by group and loci 

VarRDgroup.xls 
Variance
split by group and loci based on the mean of 200 randomly discarded
data sets 

VarRDrepeatgroup.xls 
Variance
in repeat number split by group and loci based on the mean of 200
randomly discarded data sets 

Varrepeatgroup.xls 
Variance
in repeat number split by group and loci 

SumGrpXX.xls 
Corresponds
to the file Summary.xls, but separate for each group and split only by
loci 
Resampling_data 

Randomized
range of gene diversity for each locus and population 

AlleleRangeMax.xls 
Maximum
allele number (determined by bootstrapping) split by locus and
population 

AlleleRangeMean.xls 
Mean
allele number (determined by bootstrapping) split by locus and
population 

AlleleRangeMin.xls 
Minimum
allele number (determined by bootstrapping) split by locus and
population 

HetRangeMax.xls 
Maximum
gene diversity (determined by bootstrapping) split by locus and
population 

HetRangeMean.xls 
Mean gene
diversity (determined by bootstrapping) split by locus and population 

HetRangeMin.xls 
Minimum
gene diversity (determined by bootstrapping) split by locus and
population 

VarRangeMax.xls 
Maximum
variance (determined by bootstrapping) split by locus and population 

VarRangeMean.xls 
Mean
variance (determined by bootstrapping) split by locus and population 

VarRangeMin.xls 
Minimum
variance (determined by bootstrapping) split by locus and population 

NAME_AlleleRange.xls 
Report of
all allele numbers obtained by N bootstrap replications 

NAME_HetRange.xls 
Report of
all gene diversities obtained by N bootstrap replications 

NAME_VarRange.xls 
Report of
all variance obtained by N bootstrap replications 
Single_data 

all of
them split by loci and populations 

AllelicRichness.xls 
Reports
the calculated values of allelic richness 

GST.xls 
G_{ST}
per locus and global (Nei as well as Hamrik & Godt) 

Hetexp.xls 
Expected
heterozygosity 

HetRDexp.xls 
Expected
heterozygosity based on the mean of 200 randomly discarded data sets 

MaxAllele.xls 
Maximum
allele length 

Mean.xls 
Mean
allele length (in the case of RD: based on the mean of 200 randomly
discarded data sets) 

MinAllele.xls 
Minimum
allele length 

NumAlleles.xls 
Number of
alleles 

NumChromosomes.xls 
Number of
analyzed chromosomes 

Shannon.xls 
Shannon
index 

Var.xls 
Variance
in allele length 

VarRD.xls 
Variance
in allele length based on the mean of 200 randomly discarded data sets 

Varrepeat.xls 
Variance
in repeat number 

VarRDrepeat.xls 
Variance
in repeat number based on the mean of 200 randomly discarded data sets 
Dieringer, Daniel &
Schlštterer, Christian (2003)
Microsatellite analyser (MSA): a platform independent analysis tool for
large
microsatellite data sets. Molecular Ecology Notes 3 (1), 167169
daniel.dieringer@boku.ac.at
The program can be found
here.
Copyright ©
2001,2002,2003,2004,2005,2006,2007 Daniel
Dieringer.
Permission to use and
distribute this software and its
documentation for any purpose is hereby granted without fee, provided
the above
copyright notice, author statement and this permission notice appear in
all
copies of this software and related documentation.
THE SOFTWARE IS PROVIDED
"ASIS" AND WITHOUT
WARRANTY OF ANY KIND, EXPRESS, IMPLIED OR OTHERWISE, INCLUDING WITHOUT
LIMITATION, ANY WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR
PURPOSE.IN NO EVENT SHALL THE AUTHOR, THE "INSTITUT F†R TIERZUCHT UND
GENETIK" OR THE VETERINARY UNIVERSITY OF VIENNA BE LIABLE FOR ANY
SPECIAL,
INCIDENTAL, INDIRECT OR CONSEQUENTIAL DAMAGES OF ANY KIND, OR ANY
DAMAGES
WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER OR NOT
ADVISED
OF THE POSSIBILITY OF DAMAGE, AND ON ANY THEORY OF LIABILITY, ARISING
OUT OF OR
IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
Bowcock,
A.M., Ru’zLinares, A., Tomfohrde, J., Minch, E., Kidd, J.R.,
CavalliSforza, L.L. (1994) High resolution human evolutionary trees
with
polymorphic microsatellites. Nature, 368: 455457
CavalliSforza,
L.L. and Bodmer, W.F. (1971) The Genetics of Human Population, p. 399,
San
Francisco, W.H. Freeman and Company
CavalliSforza,
L.L., and Edwards, A.W.F. (1967) Phylogenetic analysis: models and
estimation
procedures. Amer. J. Hum. Genet. 19: 233257.
Culley, T.M., Wallace,
L.E., GenglerNowak, K.M., and Crawford, D.J. (2002) A comparison of
two
methods of calculating G_{ST}, a genetic
measure of population
differentiation. Amer. J. of Botany 89(3): 460465.
Dubois, D. and
Prade, H. (1980) Fuzzy Sets and Systems: Theory and Applications, p.
24, New
York, Academic Press.
El Mousadik A.
and Petit R.J., 1996. High level of genetic differentiation for allelic
richness among populations of the argan tree [Argania spinosa (L) Skeels]
endemic to morocco. Theor. Appl. Genet. 92: 832839
Evens,
W.J.
(1972) The sampling theory of selectively neutral alleles. Theor.
Popul. Biol.
3: 87112
Goldstein, D.B., Ruiz Linares,
A., CavalliSforza,
L.L. and Feldman, M.W. (1995a) Genetic absolute dating based on
microsatellites
and the origin of modern humans. Proc. Natl. Acad. Sci. USA 92:
67236727.
Goldstein, D.B.,
Ruiz Linares, A., CavalliSforza, L.L. and Feldman, M.W. (1995b) An
evaluation
of genetic distances for use with microsatellite loci. Genetics 139: 463471.
Hedrick, P.W, 2005. A
standardized genetic
differentiation measure. Evolution 59(8): 16331638.
Hurlbert, S.H., 1971. The
nonconcept of species diversity: a critique and alternative parameters.
Ecology
52:577586
Kimura, M. and Ohta, T. (1975)
Distribution of allelic
frequencies in a finite population under stepwise production of neutral
alleles. . Proc. Natl. Acad. Sci. USA 72: 27612764.
Krebs, C.J., 1989.
Ecological Methodology. Harper & Row. New York.
Michalakis, Y.
and Excoffier, L. (1996) A genetic estimation of population subdivision
using
distances between alleles with special references for microsatellite
loci.
Genetics 142: 10611064.
Nei, M., Tajima,
F. and Tateno, Y. (1983) Accuracy of estimated phylogenetic trees from
molecular data. J. Mol. Evol. 19: 153170
Nei, M. (1978)
Estimation of average heterozygosity and genetic distance from a number
of
individuals. Genetics 89: 538590
Nei M.(1987) eq.
(7.39) p.164 Molecular Evolutionary Genetics Columbia University
Press,
New York
Slatkin, M.
(1995) A measure of population subdivision based on microsatellite
allele
frequencies. Genetics 139: 457462
Shannon, C.E. and
Weaver, W. (1949) The Mathematical Theory of Communication. University
of
Illinois Press, IL
Weir, B.S. and
Cockerham, C.C. (1984) Estimating FStatistics for the Analysis of
Population
Structure, Evolution 38(6): 13581370
Weir, B.S. and
Hill, W.G. (2002) Estimating FStatistics, Annu. Rev. Genet. 36: 721750
Changes
with version 3.00:
Bugfixes:
á
Problems with distance
calculations D1, (dµ)^{2}, D_{ad}.
Note that the previous version
incorrectly estimated the distances based on PCR product size rather
than
repeat length
á
D_{c} per
locus was calculated incorrectly
both bugs were reported by Ana Dominguez Sanjurjo
Note if
you intend to use D1, (dµ)^{2}, D_{ad }or D_{c}
per locus,
do NOT use versions earlier than 3.0
New features:
Estimate for sampling variance
of gene diversity:
Calculates the range of gene
diversity (min, mean,
max) determined by bootstrapping
Estimate of theta based on
gene diversity and the SMM
mutation model
Bugfixes with
version 3.01:
á
Calculation of theta based on
gene diversity and the SMM mutation model
is (obvious) wrong when a locus occurs with gene diversity 1. The error
writes
the correct result instead in the file ThetaRDexp.xls in the file
Thetaexp.xls.
Version 3.10:
á
New function added: estimates
heterogeneity of the F_{ST}
values by bootstrapping a specified number of loci.
Version 3.12:
á
Calculation of D1 (ASD)
between individuals is now suppressed, as it
requires the variance of the population.
Version 3.14:
á
A bug in global F_{ST} calculations was
removed that replaced
Pvalue for the global F_{ST} with zero when
it was larger then the
minimum (1/number of permutation)
Version 3.15:
á
Version 3.14 had a problem to
generate genepop Ð formatted files. This
problem is now fixed.
Version 3.17:
á
MSA is able to write
individuum based distance files in NEXUS format
instead of PHYLIP format. (Please note that this function is only
without
bootstrapping possible).
á
Changing the menu structure
for easier understandability of the Fstmenu.
Version 4.00:
á
MSA is now able to do the
analysis with the command line arguments (see
topic above).
á
The function "sampling
variance of gene diversity" had an
error when individuals are sampled.
á
The same function gives a
wrong result for zero correction, this is now
corrected.
Version 4.02:
á
Output of tabbed results for
FIS was Ðnan when the number of alleles was
1. This is now corrected to "n.d."
á
MSA can now create input files
for IM (see above)
á
MSA calculates now the F_{ST} Ð like
moment estimators β_{i}/β_{ij}
statistic (see above)
Version 4.04:
á
MSA outputs now when do
hetrange analysis also the variance and the
number of alleles.
á
Adding allelic richness to the
overview methods.
Version 4.05:
á
For bootstrapping of distances
it is possible to set up the number of
loci that will be used for replication data set.
á
Adding the variance value of
allelic richness.
á
Small changes in documentation
á
Data files has no longer to be
located in the same folder as the MSA
executable
á
Included G_{ST} and G_{ST}'
in descriptive statistics
and in Global F_{ST} calculation
(including pValues)
Description
of genetic
distances
D1, ASD,
Average
Square [Goldstein et
al., 1995b ; Slatkin, 1995]:
The average
square distance was derived by employing
the analytical theory developed by Moran for the distribution of
alleles
mutating under a strict stepwise mutation process in a population of
finite
constant size with non overlapping generations. It and its family of
related
distances ((dm)^{2})
is superior to other distances for
microsatellites in that they have a linear expectation with time making
them
good for evolutionary studies.
N_{k}=(Mean(Pop1,k)Mean(Pop2,k))^2+(n_{i}1)*Var(Pop1,k)/n_{i}+(n_{j}1)*Var(Pop2,k)/n_{j}
D=number of
loci
N=S_{k}(N_{k})
Sum
over all loci (k=1D)
ASD=N/D
Distance
based on individuals is not calculated, since
D1 is equal to (¶m)^{2}
[D_{dm}], when the variance term is removed.
D_{ps},
Proportion of shared alleles [Bowcock et al., 1994]
The general
definition of the proportion of shared
alleles at a given locus, which holds true whether the taxa are
individuals or
population samples, is the mean over loci of the sums of the minima of
the
relative frequencies of all alleles between compared taxa.
ps=S_{k}S_{a}min(
f_{a,i }, f_{a,j })/D
Sum over all
loci and all alleles
f_{a,i/a,j}
= frequency of allele a in pop i/j
D=number
of loci
The distance
can be taken as: Dps = ln(ps), or Dps' =
1ps
D_{fs},
Fuzzy set similarity [Dubois
and
Prade, 1980]
The fuzzy set
similarity for a pair of taxa is the
ratio between the cardinality of the intersection of their alleles and
the
cardinality of the union of their alleles, e.g., if two individuals
have
genotypes ab and ac, the intersection is {a}, the union is {a,b,c}, and
the
ratio is 1/3.
The distance
can be taken as: Dfs = ln(fs), or Dfs'=
1fs
D_{kf},
Kinship coefficient [CavalliSforza and Bodmer, 1971]
The kinship
coefficient is the probability that a gene
taken at random from population i (at a given locus) be identical by
descent to
a gene taken at random from population j at the same locus.
kf=S_{k}S_{a
}( f_{a,i *} f_{a,j })/D
Sum over all
loci and all alleles
f_{a,i/a,j}
= frequency of allele a in pop i/j
D=number
of loci
The distance
can be taken as: Dkf = ln(kf), or Dfs'=
1kf
D_{ad}=S_{k }abs(
Mean(Pop1,k)  Mean(Pop2,k)) /D
Sum over all
Loci
Mean(Pop1/2,k)
... Mean Allele of population 1/2 in Locus k
D=Number
of Loci
(¶m)^{2},
D_{dm} [Goldstein
et al. 1995a]
D_{dm}=S_{k }(
Mean(Pop1,k)  Mean(Pop2,k))^2 /D
Sum over all
Loci
Mean(Pop1/2,k)
... Mean Allele of population 1/2 in Locus k
D=Number
of Loci
D_{c}
[CavalliSforca
and
Edwards, 1967]
D_{c}=2
S_{k}
(D_{c,k})/(p
D)
Sum over all
Loci
D_{c,k}
= sqrt(2(1+S_{a}(sqrt(f_{a,i}
* f_{a,j} ) )))
Sum over all
Alleles
D=Number
of Loci
f_{a,i/a,j}
= Frequency of allele a in Pop i/j
D_{a}
[Nei et al. 1983]
D_{a}=1
 S_{k}S_{a
}sqrt( f_{a,i *} f_{a,j })/D
Sum over all
Loci and all Alleles
f_{a,i/a,j}
= Frequency of allele a in Pop i/j
D=Number
of Loci
D [Nei,
1978]
Nei's
standard genetic
distance
Sum1= sqrt((n_{Ck1}*S_{a
}( f_{a,i *} f_{a,i })1)/(n_{Ck1}1))
Sum2= sqrt((n_{Ck2}*S_{a
}( f_{a,j *} f_{a,j })1)/(n_{Ck2}1))
id=S_{k}(S_{a
}( f_{a,i *} f_{a,j })/( Sum1*Sum2) )/D_{L}
Sum over all
Loci and all Alleles
f_{a,i/a,j}
= Frequency of allele a in Pop i/j
n_{Cki/kj}
= Number of Chromosomes of Locus k in Pop i/j
D_{L}=Number of Loci
The distance
can be taken as: D= ln(id), or D'= 1id
F_{ST},F_{IS},F_{IT} [Weir and Cockerham,
1984]
For one
Allele:
F_{IT,l,a}=1(c/(a+b+c))
F_{ST,l,a}=a/(a+b+c)
F_{IS,l,a}=1(c/(b+c))
a = nq
(s^2(pq(1pq)(r1)s^2/rhq/4)/(nq1))/nc
b
= nq (pq(1pq)(r1)s^2/r(2nq1)hq/(4nq))/(nq1)
c
= hq/2
p_{i}
... frequency of allele A in population
i
n_{i}
... sample size of population i
h_{i}
... observed proportion of individuals heterozygous for
allele A
nq
... average sample size (genomes!)
nc
... (r nq S_{i}(n_{i}^2)
/ (r
nq))/(r1)
r
... number of compared populations
pq
... S_{i}(n_{i}
p_{i})
/ (r nq)
average sample frequency of
allele A
s^2...
S_{i}(n_{i}
(p_{i}pq)^2)/((r1)
nq)
sample variance of allele A
frequencies over populations
hq
... S_{i}
(n_{i} h_{i})
/(r nq)
average heterozygote
frequency for allele A
Over all
alleles and loci:
F_{IT}= S_{l}S_{a}
F_{it,l,a}
F_{ST}=
S_{l}S_{a}
F_{st,l,a}
F_{IS}=
S_{l}S_{a}
F_{is,l,a}
Note: F_{ST}values
over all loci per pair, are calculated excluding all loci, which are
missed in
one of the populations.
Alleles or genotypes
of each pair of populations are permuted N times.
Resampling
units are alleles (when you assume HardyWeinberg)
respectively genotypes (not assuming HardyWeinberg).
The result is
the proportion of permuted F_{ST}values
equal or greater than the observed F_{ST} (in the upper
triangle)!
The
lower triangle contains these proportions multiplied by the number
of tests (=strict Bonferroni  correction).
These
values are the probability of rejecting the hypothesis that
populations are equal.
A
value greater than 0.05 will be replaced with ns (=non signifikant).
In
general N should be greater than: Number of tests / level of
significance
If
you want to get this Number before make your permutations, analyse
your data without distances and choose in the mainmenu the option [s]
to show data information or caluculate: (P*PP)/2 {where P is the
number of
populations}.
Handling
inbreeding lines:
To
get a better performance when analysing inbred lines, you can choose
the random discarding option before each permutation (for all
"haploid" population).
This
will give you the mean over all randomdiscarded F_{st}values,
as well the Pvalue for them.
Also
you get the upper and lower border of the 95%CI.
But in practice, this procedure is in most cases not necessary, because
nondiscarded data sets are not biased and do not deviate from the mean
of 1000
randomly discarded data sets (data not shown).
Moment Estimator [Weir
and Hill,
2003]