^{1}

^{*}

^{2}

^{3}

^{4}

^{5}

^{3}

The initial approach was conceived by NAR, with help from JKP and MWF. The construction of subsamples with different levels of geographic dispersion was performed by SM and NAR. CZ contributed to the design and construction of the marker panels and to initial analysis with STRUCTURE; the full STRUCTURE analysis was designed by NAR and SM, with help from JKP, and was performed by SM with help from NAR. The regression analyses were designed by NAR with help from MWF, and were performed by NAR. The genetic/geographic distance analysis was designed by SR and NAR and was performed by SR. NAR wrote the paper with help from SR, JKP, and MWF.

The authors have declared that no competing interests exist.

Previously, we observed that without using prior information about individual sampling locations, a clustering algorithm applied to multilocus genotypes from worldwide human populations produced genetic clusters largely coincident with major geographic regions. It has been argued, however, that the degree of clustering is diminished by use of samples with greater uniformity in geographic distribution, and that the clusters we identified were a consequence of uneven sampling along genetic clines. Expanding our earlier dataset from 377 to 993 markers, we systematically examine the influence of several study design variables—sample size, number of loci, number of clusters, assumptions about correlations in allele frequencies across populations, and the geographic dispersion of the sample—on the “clusteredness” of individuals. With all other variables held constant, geographic dispersion is seen to have comparatively little effect on the degree of clustering. Examination of the relationship between genetic and geographic distance supports a view in which the clusters arise not as an artifact of the sampling scheme, but from small discontinuous jumps in genetic distance for most population pairs on opposite sides of geographic barriers, in comparison with genetic distance for pairs on the same side. Thus, analysis of the 993-locus dataset corroborates our earlier results: if enough markers are used with a sufficiently large worldwide sample, individuals can be partitioned into genetic clusters that match major geographic subdivisions of the globe, with some individuals from intermediate geographic locations having mixed membership in the clusters that correspond to neighboring regions.

By helping to frame the ways in which human genetic variation is conceptualized, an understanding of the genetic structure of human populations can assist in inferring human evolutionary history, as well as in designing studies that search for disease-susceptibility loci. Previously, it has been observed that when individual genomes are clustered solely by genetic similarity, individuals sort into broad clusters that correspond to large geographic regions. It has also been seen that allele frequencies tend to vary continuously across geographic space. These two perspectives seem to be contradictory, but in this article the authors show that they are indeed compatible.

First the authors demonstrate that the clusters are robust, in that if sufficient data are used, the geographic distribution of the sampled individuals has little effect on the analysis. They then show that allele frequency differences generally increase gradually with geographic distance. However, small discontinuities occur as geographic barriers are crossed, allowing clusters to be produced. These results provide a greater understanding of the factors that generate the clusters, verifying that they arise from genuine features of the underlying pattern of human genetic variation, rather than as artifacts of uneven sampling along continuous gradients of allele frequencies.

It has recently been demonstrated in several studies that to a large extent, without prior knowledge of individual origins, the geographic ancestries of individuals can be inferred from genetic markers [

To further ascertain the degree of difficulty in obtaining the genetic clusters, several articles have considered the influence of properties of the study design on the extent of clustering [

Other factors besides sample size and number of markers, however, may influence clustering patterns. Serre and Pääbo [

Each distribution is obtained by binning the values of _{n}

In this article, we perform an extensive evaluation of the role of study design on genetic clustering, considering both geographic dispersion and allele frequency correlation, as well as sample size, number of loci, and number of clusters. The dataset employed is an expansion of our original data [

We utilized the unsupervised clustering algorithm implemented in STRUCTURE [_{st}

Each individual is represented by a thin line partitioned into

A total of 367,220 runs of STRUCTURE were performed on subsets of a dataset consisting of 1,048 individuals from the Human Genome Diversity Project–Centre d'Etude du Polymorphisme Humain (HGDP-CEPH) Human Genome Diversity Panel [

Each point shows the mean clusteredness of 2,000 runs with the specified sample size and allele frequency correlation model: two replicates for each of ten sets of loci for each of 100 sets of individuals (for 1,048 individuals, it is the mean of 20 runs, as only one set of individuals was used; for 1,048 individuals and 993 loci, it is the mean of two runs, as only one set of loci was used). Error bars denote standard deviations. The

The 100 sets of individuals used were selected to have a wide range of levels of geographic dispersion (_{n}_{n}_{n}_{n}

Each point shows the mean clusteredness of 20 runs with the specified number of loci and allele frequency correlation model: two replicates for each of ten sets of loci (for 993 loci, it is the mean of two runs, as only one set of loci was used). From left to right, the three groups of points in each plot respectively represent sets of 100, 250, and 500 individuals.

The two sets of 100 individuals represent extremes of the distribution of _{n}:

Red circles indicate comparisons between pairs of populations with majority representation in the same cluster in the

Representative estimates of the population structure based on the full dataset are shown in

To examine the influence of the study design parameters on clusteredness, we separately considered each variable, holding the others constant. This analysis included linear regressions of clusteredness on each variable for each possible combination of values of the other variables. We also analyzed the full collection of runs to determine the relative contributions of the quantities considered to variability in clusteredness.

Holding the number of clusters, sample size, and allele frequency correlation model fixed, the general trend was that clusteredness was noticeably smaller for ten and 20 loci, and was larger for 50 or more loci (^{2}) across the 40 regressions equaled 0.454.

When the number of loci, sample size, and correlation model were held constant, ^{2} was larger across the 28 combinations with the correlated model (0.382) than it was for the 28 combinations with the uncorrelated model (0.147).

Clusteredness Mean and Standard Deviation for the Correlated and Uncorrelated Allele Frequency Models

Influence of the Number of Clusters

Holding the number of loci, number of clusters, and correlation model fixed, clusteredness was generally higher for the samples of size 250 and 500 than it was for the samples of size 100 (^{2} across the 70 combinations equaled 0.511.

Influence of the Sample Size on Clusteredness

With the correlation model and the numbers of loci, clusters, and individuals held constant, the inferred population structure was generally similar for different values of _{n}_{n}

Often, geographic dispersion had a negative rather than a positive influence on clusteredness (see _{n},_{n}^{2} across the 210 regressions was only 0.045.

Influence of the Geographic Dispersion _{n}

With the numbers of loci, clusters, and individuals held constant, the correlation model had a noticeable influence on clusteredness, with the correlated model usually producing higher clusteredness than the uncorrelated model (see

With each sample size, considering all 122,000 STRUCTURE runs with the given sample size, the ^{2} values for regressions of clusteredness on individual variables were greatest for the number of loci and the allele frequency correlation model, and smallest for the number of clusters and the geographic dispersion (

Values of ^{2} for Regressions of Clusteredness on Study Design Variables

In this article, we have systematically analyzed the influence of five variables on the genetic clustering of individuals from genome-wide markers: number of loci, sample size, number of clusters, geographic dispersion of the sample, and assumptions about allele frequency correlation. Each of these variables was found to have an effect on clustering. Holding all other variables constant, geographic dispersion had a relatively modest effect on clusteredness, with a considerably smaller ^{2} than number of loci, sample size, or number of clusters. Additionally, geographic dispersion was generally less consistent in the direction in which it affected clusteredness, although in contrast to what was expected based on the results of [_{n}

Unlike geographic dispersion, the number of loci and sample size both had strong direct relationships with clusteredness for nearly all combinations of the other variables. Excluding a few scenarios that utilized two clusters, the correlation model produced significantly greater clusteredness for nearly all combinations of the other variables, when a large number of STRUCTURE runs were performed. The number of clusters influences the way in which individual membership coefficients are distributed, but its effect on the clusteredness statistic was found to be smaller than that of the number of loci or the sample size. The effect of the number of clusters depended on the choice of correlation model: in the correlated model, clusteredness generally increased with

Two main claims of Serre and Pääbo [_{n}_{n}

Second, in three analyses with the uncorrelated allele frequencies model, each of which used 261 individuals, Serre and Pääbo observed a reduction in clusteredness compared with analyses using 261 individuals and the correlated model, and compared with analyses based on 1,066 individuals and either model. They attributed the different results in these scenarios to the use of the uncorrelated frequencies model. We found, however, that with either the correlated or the uncorrelated allele frequencies model, holding all other variables constant, when 100 samples of size 250 were considered, clusteredness differed for the samples of size of 250 compared with those of size 1,048 (

Even if the frequency correlation model actually provided the sole explanation for the weaker clustering in their analysis, we question the basis for assuming that allele frequencies are uncorrelated across populations. Allele frequencies should be expected to be correlated, on the basis of the shared descent of all human populations from the same set of ancestral groups. Clearly, as has been shown in simulations [

Correlation Coefficients of Allele Frequencies

In summary, the observation of [

Serre and Pääbo [

How can these seemingly discordant perspectives on human genetic diversity be reconciled?

For population pairs from the same cluster, as geographic distance increases, genetic distance increases in a linear manner, consistent with a clinal population structure. However, for pairs from different clusters, genetic distance is generally larger than that between intracluster pairs that have the same geographic distance. For example, genetic distances for population pairs with one population in Eurasia and the other in East Asia are greater than those for pairs at equivalent geographic distance within Eurasia or within East Asia. Loosely speaking, it is these small discontinuous jumps in genetic distance—across oceans, the Himalayas, and the Sahara—that provide the basis for the ability of STRUCTURE to identify clusters that correspond to geographic regions.

Two exceptions to the pattern include the Hazara and Uygur populations, from Pakistan and western China, respectively, whose genetic distances scale continuously with geographic distance both for populations in Eurasia and for those in East Asia. These populations were evenly split across the clusters corresponding to Eurasia and East Asia, and thus, unlike most other populations, they do not reflect a discontinuous jump in genetic distance with geographic distance. Finally, a third population of interest in the plot is the Kalash population (of Pakistan), whose genetic distances to other populations are large at all geographic distances, illustrating the distinctiveness of the group as the only member of its own genetic cluster in some STRUCTURE analyses with

Excluding points that involve Hazara, Kalash, or Uygur, a linear regression on geographic distance for the points in ^{2} = 0.690. When an additional binary variable ^{2} increases to 0.729. The regression equation is _{st}_{st}^{2}, the discontinuities that give rise to genetic clusters—as we have stated previously [

Our evidence for clustering should not be taken as evidence of our support of any particular concept of “biological race.” In general, representations of human genetic diversity are evaluated based on their ability to facilitate further research into such topics as human evolutionary history and the identification of medically important genotypes that vary in frequency across populations. Both clines and clusters are among the constructs that meet this standard of usefulness: for example, clines of allele frequency variation have proven important for inference about the genetic history of Europe [

The dataset analyzed here consists of 1,048 individuals from the HGDP-CEPH Human Genome Diversity Panel [

The set of individuals used here differs slightly from that studied by [

The geographic dispersion of a set of

where ψ_{ij}_{n}_{n}_{ij}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{j}_{j}_{j}_{j}_{j}_{j}_{j}_{j}

Method 1 of [_{n}

To determine the distribution of _{n}_{n}

To obtain the distribution of _{n}_{n}_{n}

To measure the average “clusteredness” of individuals, or the extent to which individuals were estimated to belong to a single cluster rather than to a combination of clusters, we computed for each STRUCTURE run the quantity

where _{ik}

All runs of the STRUCTURE program [

Linear regression was used to test the influence of study design variables on clusteredness. To control for the effects of the other variables, each regression utilized only STRUCTURE runs in which variables other than the one being tested were held constant. For example, to examine the influence of the number of clusters on clusteredness, 56 separate regressions were performed, one for each combination of the number of loci (seven possibilities), the sample size (four possibilities), and the allele frequency correlation model (two possibilities). Similarly, 40 regressions of clusteredness on the base-10 logarithm of the number of loci were performed, as were 70 regressions of clusteredness on sample size and 210 regressions of clusteredness on _{n}_{n},_{n}

In the case of the allele frequency correlation model, the runs with the correlated and uncorrelated models were compared using the Wilcoxon two-sample test instead of with linear regression. Because there were seven numbers of loci, five numbers of clusters, and four numbers of individuals, 140 separate tests were performed.

For each sample size, regressions of clusteredness on individual variables were also performed using all 122,000 runs with the given sample size. Additional regressions were also performed using all 367,220 runs. These regressions used the base-10 logarithm of the number of loci.

For the comparison of genetic and geographic distance, calculations were performed as in [_{st}_{st}_{ij}_{ij}

We thank J. Long, J. Molitor, C. Roseman, H. Tang, E. Ziv, and an anonymous reviewer for suggestions that have greatly improved the manuscript. This work was supported by National Institutes of Health GM28016 to MWF, by the Stanford Genome Training Program (T32 HG00044 from the National Human Genome Research Institute), by a Burroughs Wellcome Fund Career Award in the Biomedical Sciences to NAR, and by a grant from the University of Southern California. The Mammalian Genotyping Service is supported by the National Heart, Lung, and Blood Institute (HV48141). The data used in this study are a subset of the genotypes available at

Human Genome Diversity Project–Centre d'Etude du Polymorphisme Humain