Microsatellite analyzer (MSA) 4.05

Mission

Input file

Command line

Descriptive statistics

Distances

F-statistics

Constrained gene diversity

Data bootstrapping

Error report

Working report (Log-file)

Formats

Result structure

How to cite

Contact

Download

Disclaimer

References

History / bug reports

Description of genetic distances

Sample input file

Mission:

 


 

Input file:

Background:

Input files can be generated using spreadsheet software, such as Excel, in which the data are arranged either in one column per locus or two columns per locus (sample input file). As MSA was mainly designed for microsatellites, data should be entered as the PCR product size. Missing data can be indicated by either an empty cell or a negative value, but do not enter Ô0Õ as this would be treated as a PCR product of zero bases. Note that the two-column format provides some additional input options (see below).

 

Please note if Excel is used to generate a MSA inputfile it has to be saved as "TAB DELIMITED" file.

 

Size constraints:

The number of individuals, populations and loci are only constrained by the available memory.

 

Generation of a MSA input file:

á      The third column allows to group populations and some analyses will be also performed for the specified groups. Please note that this column must not remain empty. In the absence of grouping give the same number to all populations. Furthermore, only consecutive group numbers are allowed, but groups assigned without any constraints in order.

á      The first two rows provide information about each locus. The first row specifies the repeat type (1, 2, 3, etc). The second row indicates the length of the sequence flanking the microsatellite (in bp). In the case no information is provided in the second row (empty cell), MSA does not calculate the variance in repeat number, but the inferred repeat type is specified in the output file. This option allows MSA to calculate the number of repeats from your PCR product size.

á      The third row contains the name of the microsatellite locus. In the two-column format, MSA allows two different names for the same locus (each entered in one cell).

 

 

Remarks:

For compatibility with PHYLIP the population labels are limited to 8 characters. For individual based distances, only the first 4 characters of the population label are used to label individuals. Therefore, it is highly advised that the first 4 characters differ among population labels.

 

If you are formatting your data with Excel (or other spreadsheet software), please make sure that you saved it in the format "TAB DELIMITED". Other formats will not be accepted by MSA.

 

Generation of a MSA IM input file (for IM-input file conversion only):

á      Create the normal MSA input file

á      Make a copy with a different name

á      Open the copy and remove all data (population names and alleles) so that only the locus information is left

á      Write 'inheritance' as one "individual" name and then below the (first) locus names the value of the wanted inheritance scalar (e.g. 1 for autosomes, 0.75 for X-chromosomes, 0.25 for Y-chromosomes or mtDNA, ...)

á      Write 'mutation' as another "individual" name and do the same with the mutation rates (As mutation rate per year).

á      Save the file AGAIN AS "TAB-DELIMITED"

á      In MSA menu you need to go to the submenu '(c)...Data conversion settings', chose the option
'(f) ... Locus information for IM : No'
and enter the name for the file containing the inheritance information.

á      For an example file with the inheritance information see ãtestdata.imÒ

 

Starting the program:

Option 1: double clicking:

 

Windows: double click to start the program and follow the instructions on the screen

OsX: double click to start the program

Load data by selecting option i) in the start menu of MSA. Rather than typing the name of the input file, drag and drop the input file to MSA window. By doing so you automatically specify the path (location of your input file).

 

Option 2: command line (OsX, works similar for Linux)

open the 'Terminal' software (located in: Applications->Utilities)

type: cd note that 'cd' is followed by a  blank

drag the folder containing the MSA executable and input file onto the Terminal window

hit the 'return' key

type: ./msa

hit the  'return' keyagain

the normal MSA menu will appear and the input file could be specified by typing its name. Note: the path does not need to be specified, as the file is located in the same folder as the executable.

 


Available Functions:

 

Command Line:

Starting with MSA version 4.0 it is also possible to use command line arguments, instead of using the MSA menu.

Starting MSA without any argument will open the MSA menu. Starting MSA with arguments use MSA without open the MSA menu. So please if you are not sure how to use this, start MSA without any argument and MSA will behave as normal.

Explanation of the following notation:

Statements in "[]" are optional and has not to be given, MSA will use standard settings instead of. If Statements are written like [g[lobal]] then you can use 'global' OR only 'g' to activate the function.

YES/NO means you have to choose between the two possibilities YES OR NO. Same is true for all other statements ON/OFF, 1/LN

XXXX means that you have to enter a number.

Please write command as they are given below (small/capital letter) otherwise the result can not predicted.

For more information about functions itself and the output files please read the chapters below!

 

MSA Command Line Options:

     -i: "FILENAME"

Specifies the input file for MSA (essential)

     -fim: "FILENAME"

In this file is the additional information required for conversion into the IM-Format. This has nearly the same format as MSA-Input. Please look at the sample file and below.

     -fst: [g[lobal]] [p[airwise]] [beta] [n=XXXX] [HW[=YES/NO]] [RD[=ON/OFF]] [locus]

This option starts FST analysis: essential is to give either [global] and/or [pairwise] for global FST or FST pairwise between populations.

[beta] calculates the moment estimator  Ð values of the data

[n=XXXX] sets the number of permutations standard is 10000 (n=1 means no permutation)

[HW[=YES/NO]]: Hardy-Weinberg-Equilibrium is by default NO (HW=NO) so use HW or HW=YES when you assume your data to be in HW-equilibrium.

[RD[=ON/OFF]] Random discard inbreed lines before calculating FST is by default OFF (RD=OFF) so use RD or RD=ON to activate this.

[locus] For pairwise FST this option gives the FST values for each population pair and locus.

     "-dist:"/"-dist " [p[opulation]] [i[ndividual]] [...] [n=XXXX] [calc=1/ln] [NEXUS] [locus]

Calculates standard genetic distances. It is essential to activate at least [population] or/and [individual].

[...] You have to specify at least one distance method to get results, use one of the following codes:

POSA ...         Dps        proportion of shared alleles

FSS    ...         Dfs        fuzzy set similarity

ADA   ...         Dad        absolute difference algorithm

KSC   ...         Dkf        kinship coefficient

DMS   ...         Ddm       (dµ)2

ASD   ...         D1        average square

CAS   ...         Dc         Cavalli-Sforza and Edwards chord distance (1967)

DAN   ...         Da         Nei's chord distance (1983)

DSG   ...         D          Nei's standard genetic distance (corrected for sample size)

[n=XXXX] Number of bootstraps. Standard is no bootstrapping (n=1).

[bl=XXXX] Specifies the number of loci for bootstrapping DS. Standard is the same number as original dataset (bl=-1)

[calc=1/ln] Some distances (see below) can be calculated by two ways: 1-factor (standard calc=1) or Ðln(factor) calc=ln. (Please don't use calc only!)

[NEXUS] Nexus outputfiles can be generated for individual based distances without bootstrapping

[locus] For pairwise distances this option gives the distance values for each population pair and locus.

     -format: [msvar] [arlequin] [migrate] [im]

To convert your data into additional formats, use this options:

[msvar] MSVAR (Beaumont) input format for each population separately

[arlequin] Calculates ARLEQUIN input files.

[migrate] Gives input files for MIGRATE.

[im] Creates for all population pairs the IM input files.

     -hetrange: [n=XXXX] [RD[=ON/OFF]] [zero[=ON/OFF]] [l[inked]] [ind[ividual]]

Estimate the sampling variance of gene diversity (using bootstrapping)

[n=XXXX] Number of replications (Standard is 1000)

[RD[=ON/OFF]] Activates random discarding for inbreed data. (Please use this only together with [individual])

[zero[=ON/OFF]] Correction for gene diversity 0 and 1 is by default ON (zero=ON) to deactivate this, use zero or zero=OFF.

[l[inked]] By default all loci are assumed to be unlinked and therefore resampled separately. Use linked to change this (all loci are then completely linked). Activate also [individuum] or you would resample chromosomes (ever the first/second chromosome together)!

[ind[ividual]] Sample individuals instead of chromosomes (loci can also be unlinked!).

     -rare: [number]

Sets the reference population size to calculate the allelic richness/rarefraction.

(-1: minimal sample size =STANDARD)

 

 


Descriptive Statistics:

All descriptive statistics are given per population and locus.

á      Allele counts and frequencies

á      Number of chromosomes per population and locus

á      expected number of alleles for each locus (IAM [Ewens W.J., 1997] and SMM [Kimura and Ohta, 1975])

á      Allelic richness [El Mousadik A. and Petit R.J., 1996; Hurlbert S.H., 1971; Krebs C.J., 1989] and the variance of this value.

á      observed heterozygosity

á      expected heterozygosity (=gene diversity), corrected for sample size

á      estimate of theta based on gene diversity and the SMM mutation model

á      constrained gene diversity (0<H<1)

á      variance in PCR product size and in repeat number, corrected for sample size

á      Shannon index of diversity [Shannon and Weaver] Shannon entropy

á      minimum, maximum, and mean allele length

á      minimum, maximum, and mean repeat number.

á      hs Nei's unbiased estimator for gene diversity [Nei M,1987 p.164] (this option is available only for "outbred" populations)

á      FIS (only for " outbred" populations) for each population

á      estimate for the sampling variance of gene diversity (using bootstrapping)

á      Calculation of GST with the methods from Nei as well as Hamrik & Godt (for comparison of this methods see [Culley et al. 2002]).

á      Calculation of GST' for both methods (see above) [Hedrick P.W., 2005]

 

Remarks:

á      For inbred lines, expected heterozygosity, variance and number of alleles are determined for 200 randomly discarded data sets and the mean will be reported. Note that this procedure could result in non-integer allele counts.

á      MSA also provides most descriptive statistics for the specified groups by reporting the unweighted mean of all populations in a given group.

á      Standard allelic richness is calculated based on the minimum number of sampled individuals (total individuals Ð missing data) for each locus. The drawback is, that only populations within the same data set can be compared. To avoid this, it is possible to specify the number of individuals used for all loci. If this number is provided in a publication, the result could be compared across publications. Note, however, that for populations with a smaller sample size than specified the allelic richness is not calculated ('n.d.').

 

 


 

Distances:

á      proportion of shared alleles Dps

á      fuzzy set similarity Dfs

á      kinship coefficient Dkf

á      absolute difference algorithm Dad

á      (dµ)2 Ddm

á      average square D1 (ASD)
           note that this distance is not available for tree of individuals

á      Da Nei's chord distance (1983)

á      D Nei's standard genetic distance (corrected for sample size)

á      Dc Cavalli-Sforza and Edwards chord distance (1967)

Statistical support: A bootstrap option is available for all distances (select this option in the distance menu [n]).
When using bootstrapping, (and only then) there is the additional option of bootstrapping a user specified number of loci. Hence, MSA will only resample N loci from the total data set.

Dps, Dfs, Dkf and D can be calculated in two alternative ways:
Ðln(similarity factor) or 1-(similarity factor).

 

Remarks:

á      All distances can be calculated between populations or between individuals. Make sure, that the usage of an individual-based distance makes sense.

á      Genetic distances are calculated WITHOUT randomly discarding alleles for inbred data sets. In the case distances should be calculated from a randomly discarded data set, use the file "FILENAME_2C_RD" and run MSA again.

á      Genetic distances which consider size differences between alleles [(dµ)2, D1, Dad] require that the repeat size is specified in the input file. The errorreport.txt file provides information on the minimum size difference among alleles. For loci evolving in a stepwise manner, this corresponds to the repeat size.

á      Distance matrices between individuals without bootstrapping can be saved in NEXUS format instead of PHYLIP format.

NOTE: The menu option to activate this function is invisible until distances between individuals are NOT activated, therefore activate FIRST distances between individuals and then you get this menu point visible!


 

F-Statistics: 

á      Global FST, FIS, FIT estimators [Weir, B.S. and Cockerham, C.C., 1984] across all loci

á      Global FST, FIS, FIT estimators per locus

á      FST, FIS, FIT estimators [Weir, B.S. and Cockerham, C.C., 1984] across loci and between population pairs

á      FST-values per locus and population pair

á      P-values for FST (global and pairwise) and with and without bonferoni correction for P-values of pairwise FST's.

á      Estimation of the heterogeneity in F-statistics among loci (based on bootstrapping)

á      Calculation of the moment estimator  for populations and between population pairs
[Weir, B.S. and Hill, W.G., 2002]

á      GST and GST' calculation is included in the calculation of global FST. Similar to FST,  p-values are as the proportion of permutations that result in a GST or GST' larger than or equal to the observed GST (GST') value.

 

Statistical support:

When loci are unlinked and in Hardy-Weinberg equilibrium, permutation of alleles should be selected. Otherwise, genotypes should be permuted, which is a conservative but not entirely satisfactory procedure. [Michalakis and Excoffier (1996)].

For inbred lines a special option exists, which provides an estimate for the variance introduced by the random discarding of alleles. For each population consisting of inbred individuals a user specified number of discarded data sets will be produced and the FST values will be calculated for each data set. The statistical significance of the FST value for each data set will be determined by permutation. MSA reports mean FST and P values.

P-values are given without bonferoni correction (right above diagonal) and with bonferoni correction (left below diagonal). Values after bonferoni correction above 10% (0.1) are reported as 'n.s.' in general.

P-values for global FST estimates are calculated across all loci as well as for each locus separately. Note that random discarding of alleles (optional for inbred lines) is not possible for global FST estimates.

There is no possibility for a statistical test for the moment estimator  .

 

Constrained gene diversities

The calculation of lnRH (see Kauer, Dieringer & Schlštterer, Genetics (2003)) and theta require gene diversities larger than 0 and smaller than 1. In the case of an invariant population MSA adds one different allele to the data set before calculating the gene diversity. For populations with gene diversity =1 MSA duplicates one allele before calculating gene diversity.

 

Data bootstrapping (also called Hetrange)

MSA has the function to bootstrap over individuals or chromosomes within populations to get the range within the gene diversity, the variance and the number of alleles can vary just by sampling.

 

Error Report:

The program reports:

á      Large gaps in allele sizes at a given locus

á      Outlier alleles

á      Discrepancies between the assumed and observed step size

á      Inferred step size if no repeat type was specified in the input file

 


Working report:

This file provides all settings used for the analysis of the data and could be used for documentation purposes.


 

Formats:

The program converts your data into following formats:

á      GENEPOP two digit format

á      MSVAR (Beaumont) input format for each population separately

á      STRUCTURE

á      ARLEQUIN

á      MIGRATE-format

á      Randomly discarded datasheets (1 column and 2 column format), in which all populations are marked, as inbred (h) will be randomly discarded for one allele.

Also the corresponding GENEPOP files will be written. All files, which contain randomly discarded alleles, are marked with the RD-label. Non-inbred populations will not be changed. RD input files are generated automatically for GENEPOP and STRUCTURE.

á      IM-format
Creates for all possible population pairs the input files (IM = isolation and migration by Jody Hey and Rasmus Nielsen).
NOTE
: You need an additional file for IM to set the inheritance scalar (standard = 1) and the migration rate per year (see IM documentation for this).
See input file for this.

LAMARC's XML-format can be easily obtained by converting the MIGRATE-format into LAMARC-format (see LAMARC documentation).

ARLEQUIN has optional setting for genetic structuring (only for AMOVA important). MSA will use for this setting the given grouping in MSA data sheet.

 

Remarks:

á      MIGRATE, ARLEQUIN, MSVAR and IM input formats require that the alleles correspond to the number of repeats (not to the allele length), therefore make sure, that your microsatellites behave AS EXPECTED (observed mutational step = size of repeat unit). This can be easily checked in the error report-file. MSA will assume that the observed mutational step size is the correct one. Alternatively, select option: "reduce to stepsize", which bins the observed alleles into size classes specified in the input file. Allele bins are constructed by reducing the observed size to the next size class compatible with the specified repeat number. As indels of unknown size contribute to these size shifts, the preferred strategy is to remove all loci with incorrect mutation step sizes from your data set, before converting it.

á      For data sets containing inbred individuals, randomly discarded datasheets must be used to generate the MSVAR, ARLEQUIN, MIGRATE or IM formats: simply re-run MSA and use the randomly discarded data set as input file.

 

Results:

á      The results of each run of MSA are stored in a directory labeled FILENAME_resultXX. FILENAME is name of you data sheet and XX is a number between 0 and 99. In the case of multiple runs using of the same data set, the result directories are consecutively numbered.

á      In each directory various files and subdirectories can be found. The table below provides information about the content of the files generated by MSA.

Subdirectory

Name of file

Description

 

 

Errorreport.txt

The error report of the data set. Possible errors (wrong mutation step, large gaps, ...) detected from MSA.

 

Summary.xls

All summary statistics split by locus and populations

 

MSA.log

The protocol of the analysis

 

 

 

Allelecount

 

 

 

Allelecount.xls

Allele counts

 

Allelefrequency.xls

Allele frequencies

Distance_data

 

Distances between populations are marked with _POP, between Individuals with IND

 

ADA_XXX.txt

Dad, Absolute differences

 

ASD_XXX.txt

D1, Average square

 

CAS_XXX.txt

Dc, Chord distance

 

DAN_XXX.txt

Da, NeiÕs distance

 

DMS_XXX.txt

Ddm, Delta mu square

 

FSS_XXX.txt

Dfs, Fuzzy set similarity

 

DSG_XXX.txt

NeiÕs standard genetic Distance (1978) corrected for small samplesize

 

KSC_XXX.txt

Dkf, kinship coefficient

 

POSA_XXX.txt

Dps, proportion of shared alleles

F-Statistic

 

 

 

FIS_WC84.xls

Fis

 

FIS_WC84_MULTI.xls

Mean Fis from repeated randomly discarding (for inbred lines only)

 

FIT_WC84.xls

Fit

 

FIT_WC84_MULTI.xls

Mean Fit from repeated randomly discarding (for inbred lines only)

 

FST_WC84.xls

Fst

 

FST_WC84_MULTI.xls

Mean Fst from repeated randomly discarding (for inbred lines only)

 

FST_WC84_OG.xls

Upper value of the 95%CI for Fst from repeated randomly discarding (for inbred lines only)

 

FST_WC84_UG.xls

Lower value of the 95%CI for Fst from repeated randomly discarding (for inbred lines only)

 

FST_WC84-pValue.xls

P-value determined by permutating genotypes/alleles

 

GlobFst.xls

Global FST, FIS, FIT over all loci, for each locus separately and the P-values for corresponding Fst-estimates

GST (Nei) values and corresponding P-values

 

LocusFHeterogeneity.xls

Calculates the 95%/99% confidence interval of FST by resampling a given number of loci (with replacement)

 

Beta_POP.xls

i data for populations on the diagonal and ij data (between population pairs)

 

Betaij_POP.txt

ij Ð data (between population pairs) only, in Phylip input format.

Formats&Data

 

 

 

Genepop.gen

Genepop Format

 

GenepopRD.gen

Randomly discarded genepop format

 

FILENAME_2C_RD

datasheet in 2 column format, randomly discarded

 

FILENAME_1C_RD

datasheet in 1 column format, randomly discarded

 

POPNAME. beau.infile

msvar Ð infile for each population separate

 

FILENAME.migrate

Migrate/LAMARC input file

 

FILENAME.arlequin.arp

Arlequin input file

 

FILENAME.struct

Input file for Structure

Group_data

 

Calculates the unweighted average from populations within a specified group

 

Hetexpgroup.xls

Expected heterozygosity split by group and loci

 

HetexpRDgroup.xls

Expected heterozygosity split by group and loci based on the mean of 200 randomly discarded data sets

 

Vargroup.xls

Variance split by group and loci

 

VarRDgroup.xls

Variance split by group and loci based on the mean of 200 randomly discarded data sets

 

VarRDrepeatgroup.xls

Variance in repeat number split by group and loci based on the mean of 200 randomly discarded data sets

 

Varrepeatgroup.xls

Variance in repeat number split by group and loci

 

SumGrpXX.xls

Corresponds to the file Summary.xls, but separate for each group and split only by loci

Resampling_data

 

Randomized range of gene diversity for each locus and population

 

AlleleRangeMax.xls

Maximum allele number (determined by bootstrapping) split by locus and population

 

AlleleRangeMean.xls

Mean allele number (determined by bootstrapping) split by locus and population

 

AlleleRangeMin.xls

Minimum allele number (determined by bootstrapping) split by locus and population

 

HetRangeMax.xls

Maximum gene diversity (determined by bootstrapping) split by locus and population

 

HetRangeMean.xls

Mean gene diversity (determined by bootstrapping) split by locus and population

 

HetRangeMin.xls

Minimum gene diversity (determined by bootstrapping) split by locus and population

 

VarRangeMax.xls

Maximum variance (determined by bootstrapping) split by locus and population

 

VarRangeMean.xls

Mean variance (determined by bootstrapping) split by locus and population

 

VarRangeMin.xls

Minimum variance (determined by bootstrapping) split by locus and population

 

NAME_AlleleRange.xls

Report of all allele numbers obtained by N bootstrap replications

 

NAME_HetRange.xls

Report of all gene diversities obtained by N bootstrap replications

 

NAME_VarRange.xls

Report of all variance obtained by N bootstrap replications

Single_data

 

all of them split by loci and populations

 

AllelicRichness.xls

Reports the calculated values of allelic richness

 

GST.xls

GST per locus and global (Nei as well as Hamrik & Godt)

 

Hetexp.xls

Expected heterozygosity

 

HetRDexp.xls

Expected heterozygosity based on the mean of 200 randomly discarded data sets

 

MaxAllele.xls

Maximum allele length

 

Mean.xls

Mean allele length (in the case of RD: based on the mean of 200 randomly discarded data sets)

 

MinAllele.xls

Minimum allele length

 

NumAlleles.xls

Number of alleles

 

NumChromosomes.xls

Number of analyzed chromosomes

 

Shannon.xls

Shannon index

 

Var.xls

Variance in allele length

 

VarRD.xls

Variance in allele length based on the mean of 200 randomly discarded data sets

 

Varrepeat.xls

Variance in repeat number

 

VarRDrepeat.xls

Variance in repeat number based on the mean of 200 randomly discarded data sets

 


How to cite

Dieringer, Daniel & Schlštterer, Christian (2003) Microsatellite analyser (MSA): a platform independent analysis tool for large microsatellite data sets. Molecular Ecology Notes 3 (1), 167-169

Contact:

daniel.dieringer@boku.ac.at

 

Download:

The program you can found at: http://i122server.vu-wien.ac.at

 

Disclaimer:

Copyright © 2001,2002,2003,2004,2005,2006,2007 Daniel Dieringer.

Permission to use and distribute this software and its documentation for any purpose is hereby granted without fee, provided the above copyright notice, author statement and this permission notice appear in all copies of this software and related documentation.

 

THE SOFTWARE IS PROVIDED "AS-IS" AND WITHOUT WARRANTY OF ANY KIND, EXPRESS, IMPLIED OR OTHERWISE, INCLUDING WITHOUT LIMITATION, ANY WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.IN NO EVENT SHALL THE AUTHOR, THE "INSTITUT F†R TIERZUCHT UND GENETIK" OR THE VETERINARY UNIVERSITY OF VIENNA BE LIABLE FOR ANY SPECIAL, INCIDENTAL, INDIRECT OR CONSEQUENTIAL DAMAGES OF ANY KIND, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER OR NOT ADVISED OF THE POSSIBILITY OF DAMAGE, AND ON ANY THEORY OF LIABILITY, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

References


Bowcock, A.M., Ru’z-Linares, A., Tomfohrde, J., Minch, E., Kidd, J.R., Cavalli-Sforza, L.L. (1994) High resolution human evolutionary trees with polymorphic microsatellites. Nature, 368: 455-457

Cavalli-Sforza, L.L. and Bodmer, W.F. (1971) The Genetics of Human Population, p. 399, San Francisco, W.H. Freeman and Company

Cavalli-Sforza, L.L., and Edwards, A.W.F. (1967) Phylogenetic analysis: models and estimation procedures. Amer. J. Hum. Genet. 19: 233-257.

Culley, T.M., Wallace, L.E., Gengler-Nowak, K.M., and Crawford, D.J. (2002) A comparison of two methods of calculating GST, a genetic measure of population differentiation. Amer. J. of Botany 89(3): 460-465.

Dubois, D. and Prade, H. (1980) Fuzzy Sets and Systems: Theory and Applications, p. 24, New York, Academic Press.

El Mousadik A. and Petit R.J., 1996. High level of genetic differentiation for allelic richness among populations of the argan tree [Argania spinosa (L) Skeels] endemic to morocco. Theor. Appl. Genet. 92: 832-839

Evens, W.J. (1972) The sampling theory of selectively neutral alleles. Theor. Popul. Biol. 3: 87-112

Goldstein, D.B., Ruiz Linares, A., Cavalli-Sforza, L.L. and Feldman, M.W. (1995a) Genetic absolute dating based on microsatellites and the origin of modern humans. Proc. Natl. Acad. Sci. USA 92: 6723-6727.

Goldstein, D.B., Ruiz Linares, A., Cavalli-Sforza, L.L. and Feldman, M.W. (1995b) An evaluation of genetic distances for use with microsatellite loci. Genetics  139: 463-471.

Hedrick, P.W, 2005. A standardized genetic differentiation measure. Evolution 59(8): 1633-1638.

Hurlbert, S.H., 1971. The nonconcept of species diversity: a critique and alternative parameters. Ecology 52:577-586

Kimura, M. and Ohta, T. (1975) Distribution of allelic frequencies in a finite population under stepwise production of neutral alleles. . Proc. Natl. Acad. Sci. USA 72: 2761-2764.

Krebs, C.J., 1989. Ecological Methodology. Harper & Row. New York.

Michalakis, Y. and Excoffier, L. (1996) A genetic estimation of population subdivision using distances between alleles with special references for microsatellite loci. Genetics 142: 1061-1064.

Nei, M., Tajima, F. and Tateno, Y. (1983) Accuracy of estimated phylogenetic trees from molecular data. J. Mol. Evol. 19: 153-170

Nei, M. (1978) Estimation of average heterozygosity and genetic distance from a number of individuals. Genetics 89: 538-590

Nei M.(1987) eq. (7.39) p.164 Molecular  Evolutionary Genetics Columbia University Press, New York

Slatkin, M. (1995) A measure of population subdivision based on microsatellite allele frequencies. Genetics 139: 457-462

Shannon, C.E. and Weaver, W. (1949) The Mathematical Theory of Communication. University of Illinois Press, IL

Weir, B.S. and Cockerham, C.C. (1984) Estimating F-Statistics for the Analysis of Population Structure, Evolution 38(6): 1358-1370

Weir, B.S. and Hill, W.G. (2002) Estimating F-Statistics, Annu. Rev. Genet. 36: 721-750

 


History

Changes with version 3.00:

Bugfixes:

á      Problems with distance calculations D1, (dµ)2, Dad.
Note that the previous version incorrectly estimated the distances based on PCR product size rather than repeat length

á      Dc per locus was calculated incorrectly
both bugs were reported by Ana Dominguez Sanjurjo

 

 

Note if you intend to use D1, (dµ)2, Dad or Dc per locus, do NOT use versions earlier than 3.0

 

New features:

Estimate for sampling variance of gene diversity:

Calculates the range of gene diversity (min, mean, max) determined by bootstrapping

 

Estimate of theta based on gene diversity and the SMM mutation model

 

Bugfixes with version 3.01:

á      Calculation of theta based on gene diversity and the SMM mutation model is (obvious) wrong when a locus occurs with gene diversity 1. The error writes the correct result instead in the file ThetaRDexp.xls in the file Thetaexp.xls.

 

Version 3.10:  

á      New function added: estimates heterogeneity of the FST values by bootstrapping a specified number of loci.

 

Version 3.12:

á      Calculation of D1 (ASD) between individuals is now suppressed, as it requires the variance of the population.

 

Version 3.14:

á      A bug in global FST calculations was removed that replaced P-value for the global FST with zero when it was larger then the minimum (1/number of permutation)

 

Version 3.15:

á      Version 3.14 had a problem to generate genepop Ð formatted files. This problem is now fixed.

 

Version 3.17:

á      MSA is able to write individuum based distance files in NEXUS format instead of PHYLIP format. (Please note that this function is only without bootstrapping possible).

á      Changing the menu structure for easier understandability of the Fst-menu.

 

Version 4.00:

á      MSA is now able to do the analysis with the command line arguments (see topic above).

á      The function "sampling variance of gene diversity" had an error when individuals are sampled.

á      The same function gives a wrong result for zero correction, this is now corrected.

 

Version 4.02:

á      Output of tabbed results for FIS was Ðnan when the number of alleles was 1. This is now corrected to "n.d."

á      MSA can now create input files for IM (see above)

á      MSA calculates now the FST Ð like moment estimators βiij statistic (see above)

 

Version 4.04:

á      MSA outputs now when do hetrange analysis also the variance and the number of alleles.

á      Adding allelic richness to the overview methods.

 

Version 4.05:

á      For bootstrapping of distances it is possible to set up the number of loci that will be used for replication data set.

á      Adding the variance value of allelic richness.

á      Small changes in documentation

á      Data files has no longer to be located  in the same folder as the MSA executable

á      Included GST and GST' in descriptive statistics and in Global FST calculation (including p-Values)

 


Description of genetic distances

D1, ASD, Average Square [Goldstein et al., 1995b ; Slatkin, 1995]:

The average square distance was derived by employing the analytical theory developed by Moran for the distribution of alleles mutating under a strict stepwise mutation process in a population of finite constant size with non overlapping generations. It and its family of related distances ((dm)2) is superior to other distances for microsatellites in that they have a linear expectation with time making them good for evolutionary studies.
 

Nk=(Mean(Pop1,k)-Mean(Pop2,k))^2+(ni-1)*Var(Pop1,k)/ni+(nj-1)*Var(Pop2,k)/nj

D=number of loci

N=Sk(Nk)
Sum over all loci (k=1-D)

ASD=N/D

Distance based on individuals is not calculated, since D1 is equal to (m)2 [Ddm], when the variance term is removed.

 

Dps, Proportion of shared alleles [Bowcock et al., 1994]

The general definition of the proportion of shared alleles at a given locus, which holds true whether the taxa are individuals or population samples, is the mean over loci of the sums of the minima of the relative frequencies of all alleles between compared taxa.

ps=SkSamin( fa,i , fa,j )/D

Sum over all loci and all alleles
fa,i/a,j = frequency of allele a in pop i/j
D=number of loci

The distance can be taken as: Dps = -ln(ps), or Dps' = 1-ps


Dfs, Fuzzy set similarity [Dubois and Prade, 1980]

The fuzzy set similarity for a pair of taxa is the ratio between the cardinality of the intersection of their alleles and the cardinality of the union of their alleles, e.g., if two individuals have genotypes ab and ac, the intersection is {a}, the union is {a,b,c}, and the ratio is 1/3.

The distance can be taken as: Dfs = -ln(fs), or Dfs'= 1-fs

 

Dkf, Kinship coefficient [Cavalli-Sforza and Bodmer, 1971]

The kinship coefficient is the probability that a gene taken at random from population i (at a given locus) be identical by descent to a gene taken at random from population j at the same locus.

kf=SkSa ( fa,i * fa,j )/D

Sum over all loci and all alleles
fa,i/a,j = frequency of allele a in pop i/j
D=number of loci

The distance can be taken as: Dkf = -ln(kf), or Dfs'= 1-kf
 

Dad, absolute difference

Dad=Sk abs( Mean(Pop1,k) - Mean(Pop2,k)) /D

Sum over all Loci
Mean(Pop1/2,k) ... Mean Allele of population 1/2 in Locus k
D=Number of Loci

 

(m)2, Ddm [Goldstein et al. 1995a]

Ddm=Sk ( Mean(Pop1,k) - Mean(Pop2,k))^2 /D

Sum over all Loci
Mean(Pop1/2,k) ... Mean Allele of population 1/2 in Locus k
D=Number of Loci


Dc [Cavalli-Sforca and Edwards, 1967]

Dc=2 Sk (Dc,k)/(p D)

Sum over all Loci

Dc,k = sqrt(2(1+Sa(sqrt(fa,i * fa,j ) )))

Sum over all Alleles
D=Number of Loci
fa,i/a,j = Frequency of allele a in Pop i/j

Da [Nei et al. 1983]

Da=1 - SkSa sqrt( fa,i * fa,j )/D

Sum over all Loci and all Alleles
fa,i/a,j = Frequency of allele a in Pop i/j
D=Number of Loci

D  [Nei, 1978]

Nei's standard genetic distance

Sum1= sqrt((nCk1*Sa ( fa,i * fa,i )-1)/(nCk1-1))

Sum2= sqrt((nCk2*Sa ( fa,j * fa,j )-1)/(nCk2-1))

id=Sk(Sa ( fa,i * fa,j )/( Sum1*Sum2)  )/DL

 

Sum over all Loci and all Alleles
fa,i/a,j = Frequency of allele a in Pop i/j
nCki/kj = Number of Chromosomes of Locus k in Pop i/j
DL=Number of Loci

The distance can be taken as: D= -ln(id), or D'= 1-id


F-Statistics

FST,FIS,FIT [Weir and Cockerham, 1984]

For one Allele:

FIT,l,a=1-(c/(a+b+c))
FST,l,a=a/(a+b+c)
FIS,l,a=1-(c/(b+c))

a = nq (s^2-(pq(1-pq)-(r-1)s^2/r-hq/4)/(nq-1))/nc
b = nq (pq(1-pq)-(r-1)s^2/r-(2nq-1)hq/(4nq))/(nq-1)
c = hq/2

pi ... frequency of allele A in population i
ni ... sample size of population i
hi ... observed proportion of individuals heterozygous for allele A
nq ... average sample size (genomes!)
nc ... (r nq- Si(ni^2) / (r nq))/(r-1)
r  ... number of compared populations
pq ... Si(ni pi) / (r nq)
        average sample frequency of allele A
s^2... Si(ni (pi-pq)^2)/((r-1) nq)
        sample variance of allele A frequencies over populations
hq ... Si (ni hi) /(r nq)
        average heterozygote frequency for allele A

Over all alleles and loci:

FIT= SlSa Fit,l,a
FST= SlSa Fst,l,a
FIS= SlSa Fis,l,a
 

Note: FST-values over all loci per pair, are calculated excluding all loci, which are missed in one of the populations.

Calculation of P-values:

Alleles or genotypes of each pair of populations are permuted N times.
Resampling units are alleles (when you assume Hardy-Weinberg) respectively genotypes (not assuming Hardy-Weinberg).

The result is the proportion of permuted FST-values equal or greater than the observed FST (in the upper triangle)!
The lower triangle contains these proportions multiplied by the number of tests (=strict Bonferroni - correction).
These values are the probability of rejecting the hypothesis that populations are equal.
A value greater than 0.05 will be replaced with ns (=non signifikant).
In general N should be greater than: Number of tests / level of significance
If you want to get this Number before make your permutations, analyse your data without distances and choose in the main-menu the option [s] to show data information or caluculate: (P*P-P)/2 {where P is the number of populations}.

Handling inbreeding lines:
To get a better performance when analysing inbred lines, you can choose the random discarding option before each permutation (for all "haploid" population).
This will give you the mean over all randomdiscarded Fst-values, as well the P-value for them.
Also you get the upper and lower border of the 95%CI.
But in practice, this procedure is in most cases not necessary, because non-discarded data sets are not biased and do not deviate from the mean of 1000 randomly discarded data sets (data not shown).

Moment Estimator  [Weir and Hill, 2003]


Sample input file

 



AppleMark