Skip to main content
  • Methodology article
  • Open access
  • Published:

Identifying rare variants for quantitative traits in extreme samples of population via Kullback-Leibler distance

Abstract

Background

The rapid development of sequencing technology and simultaneously the availability of large quantities of sequence data has facilitated the identification of rare variant associated with quantitative traits. However, existing statistical methods depend on certain assumptions and thus lacking uniform power. The present study focuses on mapping rare variant associated with quantitative traits.

Results

In the present study, we proposed a two-stage strategy to identify rare variant of quantitative traits using phenotype extreme selection design and Kullback-Leibler distance, where the first stage was association analysis and the second stage was fine mapping. We presented a statistic and a linkage disequilibrium measure for the first stage and the second stage, respectively. Theory analysis and simulation study showed that (1) the power of the proposed statistic for association analysis increased with the stringency of the sample selection and was affected slightly by non-causal variants and opposite effect variants, (2) the statistic here achieved higher power than three commonly used methods, and (3) the linkage disequilibrium measure for fine mapping was independent of the frequencies of non-causal variants and simply dependent on the frequencies of causal variants.

Conclusions

We conclude that the two-stage strategy here can be used effectively to mapping rare variant associated with quantitative traits.

Background

Thanks to the rapid development of sequencing technology and the lowering of sequencing costs in the last decade, the availability of large quantities of sequence data provides an unprecedented opportunity for researchers to investigate the role of rare variants in complex traits [1,2,3,4]. But due to the low minor allele frequency (MAF < 5%) and thus resulting in weak linkage disequilibrium (LD) with nearby markers, detecting rare variant (RV) association with complex traits faces great challenges [5,6,7,8]. One challenge is that detection of rare causal variants with traditional designs usually requires a large sample, which will be the high cost [3, 6]. Thus cost-effective design should be considered to reduce sample size. Another challenge is that the statistical power with test statistics of single-marker tests is generally low in genetic association studies of rare variants with more moderate or weak genetic effects [8,9,10]. To date many statistical methods have been developed for rare variant association analysis, including burden tests [11,12,13], variance-component tests [14, 15], series of sequence kernel association tests [10, 16, 17]. Any of these methods has relative perfect performance in special scenario, but none of them can overwhelm others in all scenarios [8, 9], especially for quantitative traits.

In fact, rare variant association analysis in the past several years mainly focused on the qualitative trait. Only a few statistical methods have been developed for the quantitative trait [13, 18,19,20,21]. One approach for rare variant association analysis of quantitative traits is the linear regression model. However, most regression-based methods rely on the normality assumption of the phenotype [8, 21]. Another commonly used approach adopts phenotype extreme selection design where one can transform the quantitative trait association study into case-control association study of qualitative traits by treating the upper extreme as cases and the lower extreme as controls in a strategy using extreme phenotype [22,23,24,25]. Extreme phenotypes of a quantitative trait are generally considered to be more informative. Moreover, a smaller sample size for extreme-phenotype sampling than that for random sampling is needed to achieve similar power [23, 24].

In this report, we use phenotype extreme selection design and Kullback-Leibler distance (KL-distance) [26] to propose a simple statistic method to identify rare variants for quantitative traits. Two stages strategies are adopted in our analysis where association analysis and fine mapping will be done in the first stage and the second stage, respectively. This method will compare the frequency distributions of rare variant in two extreme phenotypes based on KL-distance. Our method has three features: (1) it has increasing power with the stringency of the sample selection, (2) it is affected slightly by non-causal variants and the opposite effect variants.in the first stage for association analysis, and (3) it is not depend on the frequencies of non-causal variants and just dependent on the frequencies of causal variants in the second stage for fine mapping. Through simulation studies, we investigate the performance of the proposed method and compare it with three commonly used methods of the burden test [12], the sequence kernel association test (SKAT) [17], and the optimal test that combines SKAT and the burden test (SKAT-O) [10].

Results

Type I error rate and power for association analysis

Table 1 exhibits the estimated type I error rates of the statistic TKL for the extreme sample with sample-selection threshold value of 20, 10, and 5% and with sample size of 1000 and 1500. It can be seen that, under various genetic parameters, type I error rates of TKL are not appreciably different from the nominal alpha levels, which indicates the validity of the statistic TKL.

Table 1 Estimated type I error rates of the statistic TKL

Figure 1 shows the results of power for 9 scenarios when sample sizes are 1000 and 1500. It is found that the power of the statistic TKL with the sample size of 1500 is nearly 0.20 larger than that with the sample size of 1000, indicating that the power of the statistic TKL significantly increase with the increasing of the sample size. It can be seen that, under the same sample size, the powers of the statistic TKL with the low 5% samples and the up 5% samples are highest and the powers with the low 20% samples and the up 20% samples are lowest, which indicates that the powers of the statistic TKL increase with the stringency of the sample selection. It is observed from scenarios {1, 2, 3} that, when rare variant effects are in the same direction, the powers of the statistic TKL increase with the increasing of the number of causal variants. The same above conclusions are observed when 80% causal variants have positive effect and 20% causal variants have negative effect (scenarios {4, 5, 6}) and when there is the same number of causal variants with positive effects and negative effects (scenarios {7, 8, 9}). By comparing the powers under scenarios {1, 4, 7} with 10 causal variants, the powers under scenarios {2, 5, 8} with 20 causal variants, and the powers under scenarios {3, 6, 9} with 50 causal variants, we found that, when the number of causal variants with negative effect increases, the power of the statistic TKL decreases slightly. From Fig. 1, we can observe that, among four statistics of the TKL, the burden test, the SKAT, and the SKAT-O, the power of TKL is higher than that of other three statistics. The burden test, the SKAT, and the SKAT-O are severely affected by the number of non-causal variants and the opposite effect variants, especially when there are the same number of opposite effect variants. Although non-causal variants and the opposite effect variants affect the power of the TKL, the impact is slight. For example, when the sample size is 1500 and the number of causal variants is 50 for 10% sample-selection threshold value (B2), as the number of variants with negative effect increases from zero to 25, the powers of the burden test, the SKAT, and the SKAT-O decrease from ~ 0.80, ~ 0.79, and ~ 0.84 to ~ 0.23, ~ 0.63, and ~ 0.74, respectively, with the decline rate of 71.2, 20.2, and 12.0%. Nevertheless, when the number of variants with negative effect is 25, the TKL still achieves ~ 0.83 power, with the decline rate of just 7% comparing to ~ 0.90 when the number of variants with negative effect is zero.

Fig. 1
figure 1

Empirical power of four statistics from the extreme samples with 20% threshold value a, 10% threshold value b, and 5% threshold value c when the sample sizes are 1000 (a1, b1, c1) and 1500 (a2, b2, c2) at a 0.05 significance level

Power for fine mapping

In fine mapping study, the QTL can be located by the maximum value of the measure lKL. So we sample 10 times from each of 100 simulation populations where each sample includes 750 individuals with the up-extreme phenotype of Y > U and 750 individuals with the lower-extreme phenotype with Y < L. For each sample, we calculate the value of the measure lKL for each variant. In order to guard against noisy distributions of the measure lKL, we adopt the 5-point moving-average method to determine the maximum value. We count the number (here, we denote it B) of the maximum values that locate at variant 10 or variant 11. Then the probability that the maximum values of lKL locate at variant 10 or variant 11 is B/1000. We refer this value as the power of lKL, which measure the likelihood of fine mapping the QTL. Table 2 shows the results of the power for lKL. It can be seen that the power of lKL for fine mapping under dominant model is highest and the power of lKL for fine mapping under recessive model is lowest. The power of lKL increases with increasing of the heritability h2 of the causal variant and the stringency of the sample selection. For example, power of lKL under dominant model with the heritability h2 of 0.01 is 0.52, 0.62, and 0.67 at 20, 10, and 5% sample-selection threshold value, respectively; power of lKL under dominant at 5% sample-selection threshold value increase from 0.67 to 0.83 with the heritability h2 of the causal variant increasing from 0.01 to 0.10. We also investigate the effect of different sample sizes (e.g., 2n =1000, 1500, and 2000). As expected, power of lKL increases with the increasing sample size (data not shown). In order to assess the performance of lKL, we compare it with two LD measures l [27] and pexcess [28] with case-control design using extreme samples. Table 2 also lists the powers for l and pexcess. We found that the powers of lKL and l are nearly the same and higher than those of pexcess.

Table 2 The power of the QTL fine mapping for three LD measures by use of five-point moving average

Discussion

In this report, we present a robust approach to identify rare variant of quantitative traits. The proposed approach adopts phenotype extreme selection design and KL-distance method. We use a two-stage strategy in our analysis where the first stage is association analysis and the second stage is fine mapping of QTL if the first stage is positive result. We propose a statistic TKL for association analysis and a LD measure lKL for fine mapping. Simulation studies present the performance of the proposed method. We found that the power of the TKL increases with the stringency of the sample selection and the increasing of the number of causal variants. The TKL here has higher power for association analysis than three existing statistics. Meanwhile, the impact of non-causal variants and the opposite effect variants on the TKL is slight. The LD measure lKL for fine mapping in the second stage has a good feature of not dependence on the frequencies of non-causal variants and just dependence on the frequencies of causal variants. These results show that our method can be used to detect rare variant associated with quantitative traits. At the same time, we found that the proposed method can be easily extended to case-control study by treating cases and controls as samples with upper extreme phenotype and lower extreme phenotype, respectively.

In rare variant association analysis, in order to achieve high statistical power of tests, usually a large sample with high sequencing costs is needed. Thus less costly sequencing design is preferred in rare variant association study. For quantitative traits, extreme phenotypes are generally considered to be more informative because of rare causal variants enriched among them. One can use a smaller sample size for extreme-phenotype sampling to achieve similar power as that for random sampling [23, 24]. Moreover, because extreme phenotypes of quantitative traits relative to human health are of primary clinical significance and thus data set can be obtained easily for subjects with extreme phenotypic values, using extreme phenotype samples in association analysis will make our study useful and practical. Here we use KL-distance to construct the statistics TKL to measure the difference between two probability distributions of rare variants in two extreme populations. Based on the principle that the larger TKL value is, the more dissimilar two probability distributions of rare variants, the statistics TKL can be used as a test statistic to quantify the magnitude of association between the variants and the quantitative trait in the first stage of association analysis. We found that the statistic TKL here for association analysis has higher power than three existing statistics of the burden test, the SKAT, and the SKAT-O. Moreover, whereas increasing the number of non-causal variants and the opposite effect variants result in decreasing severely the powers of the burden test, the SKAT, and the SKAT-O, non-causal variants and the opposite effect variants affect slightly on the TKL. The TKL has relatively stable power with small change range under various parameters set.

In the second stage of fine mapping, lKL is essentially a measure of LD between the variant and the QTL. Although LD between rare variant and QTL maybe weak [24], the maximum value across all rare variants can be usually found to identify the causal variant (QTL). The measure lKL here has a good performance of just dependence on the frequency of the causal variant. In practice, not dependence on the frequency of the non-causal variant can eliminate“noise” and even bias introduced by varying frequencies of non-causal variants. In our early works, we proposed the LD measure l for mapping common variant of the QTL [27]. The performance of the measure l for mapping rare variant is unknown. We found from theory analysis that the two LD measures lKL and l are parallel and have the same performance, that is, both of them can quantify LD between the variant and the QTL and do not depend on frequencies of non-causal variants. The difference between them is that the measure lKL here is based on KL-distance and the measure l is based on entropy theory. Another LD measure for fine mapping is pexcess [28]. The pexcess is originally developed for fine mapping common variant of qualitative trait. We compare the performance of these three LD measures for fine mapping rare variant of quantitative traits using extreme samples. We found from theory analysis and simulation study that lKL is superior to pexcess.

It is noted that, in practice, we do not know how many causal variants there are in the region established through association analysis at first stage. Although we considered a region having only a single causal variant, our method works for the general case with a region consisting of multiple causal variants. In fact, when there is a region linked to a quantitative trait has multiple causal variants, we can detect all causal variants using following steps: (1) lKL is used to mapping a causal variant with the maximum value of lKL; (2) TKL is used to do association analysis for all variants except the causal variant detected in (1). If the association analysis result is positive, then return to (1). All causal variants will be found when the association analysis result is negative. It should be noted that we use the permutation procedure to assess the statistical significance of the statistic TKL for association analysis. Permutation procedure may need more computing time to conduct simulation. But with the development of high-performance computing, computing time may not be a problem in our study. In addition, it can be seen that our method involves only rare variants. A phenotype may affected by common variants or both common variants and rare variants. So our further work will involve extensive field for common variants or both common variants and rare variants.

Conclusions

The statistic TKL is affected slightly by non-causal variants and the opposite effect variants. The power of the TKL for association analysis of rare variants increases with the stringency of the sample selection for quantitative traits. Extreme phenotypes allow TKL to achieve higher power than three commonly used methods. The LD measure lKL for fine mapping is independent of the frequencies of non-causal variants and just dependent on the frequencies of causal variants.

Methods

In this study, all datasets were publically available and no research requiring ethics approval was conducted.

We consider an interesting gene region with k biallelic variants and assume that each variant has a minor allele m with the MAF Pm and a normal allele M with the allele frequency PM (Pm + PM = 1). The variants are indexed by i (i = 1, ..., k). The index i may or may not correspond to the variant orders. Let Xi be minor allele count at ith variant carried by a subject. Assume that there is a quantitative trait Y: Y = β0 + G + ε, where \( G=\underset{i=1}{\overset{k}{\Sigma}}{\beta}_i{X}_i \), β0 is the mean baseline value, and ε is residual due to random environmental effects. Without loss of generality, we assume β0 =0 and ε~N(0, σ2). To simplify our presentation, we use a measure with a superscript “U” to indicate a measure in the upper extreme population that has phenotypic values of the quantitative trait Y > U (U is an upper-threshold value, chosen from the continuous distribution of the study quantitative trait). We also use a measure with a superscript “L” to indicate a measure in the low extreme population that has phenotypic values of the quantitative trait Y < L (L is an low-threshold value, chosen from the continuous distribution of the study quantitative trait). Assume NU and NL subjects are sequenced with k variants in the upper extreme population and in the low extreme population, respectively, which are indexed by j (j = 1,..., NU/ NL). Denote \( {X}_{ij}^{\mathrm{U}} \) and \( {X}_{ij}^{\mathrm{L}} \) as the number of copies “m” for jth subject at ith variant in the upper extreme population and in the low extreme population, respectively. Then the frequencies of Pm and PM at ith variant in the upper extreme population and in the low extreme population, denoted as \( {p}_{mi}^{\mathrm{U}} \), \( {p}_{Mi}^{\mathrm{U}} \), and \( {p}_{mi}^{\mathrm{L}} \), \( {p}_{Mi}^{\mathrm{L}} \), respectively, are estimated as follows:

\( {p}_{mi}^{\mathrm{U}}=\frac{\underset{j=1}{\overset{N^{\mathrm{U}}}{\Sigma}}{X}_{ij}^{\mathrm{U}}}{2{N}^{\mathrm{U}}} \), \( {p}_{Mi}^{\mathrm{U}}=1-{p}_{mi}^{\mathrm{U}} \), \( {p}_{mi}^{\mathrm{L}}=\frac{\underset{j=1}{\overset{N^{\mathrm{L}}}{\Sigma}}{X}_{ij}^{\mathrm{L}}}{2{N}^{\mathrm{L}}} \), and \( {p}_{Mi}^{\mathrm{L}}=1-{p}_{mi}^{\mathrm{L}} \).

A statistic test for association analysis in the first stage

In the first stage, we propose a statistic test for association analysis. We define a k-dimensional random vector \( {\tilde{p}}_m={\left({\tilde{p}}_{m1},\cdots, {\tilde{p}}_{mk}\right)}^T \) as the proportion of the minor allele m among all k variants, where \( {\tilde{p}}_{mi}=\frac{\underset{j}{\Sigma}{X}_{ij}}{\underset{i=1}{\overset{k}{\Sigma}}\underset{j}{\Sigma}{X}_{ij}} \) and Xij is the number of copies “m” for jth subject at ith variant. In the upper extreme population and in the low extreme population, the k-dimensional random vector of the proportion of the minor allele m are denoted as \( {\tilde{p}}_m^{\mathrm{U}}={\left({\tilde{p}}_{m1}^{\mathrm{U}},\cdots, {\tilde{p}}_{mk}^{\mathrm{U}}\right)}^T \) and \( {\tilde{p}}_m^{\mathrm{L}}={\left({\tilde{p}}_{m1}^{\mathrm{L}},\cdots, {\tilde{p}}_{mk}^{\mathrm{L}}\right)}^T \), respectively, where \( {\tilde{p}}_{mi}^{\mathrm{U}}=\frac{\underset{j=1}{\overset{N^{\mathrm{U}}}{\Sigma}}{X}_{ij}^{\mathrm{U}}}{\underset{i=1}{\overset{k}{\Sigma}}\underset{j=1}{\overset{N^{\mathrm{U}}}{\Sigma}}{X}_{ij}^{\mathrm{U}}} \) and \( {\tilde{p}}_{mi}^{\mathrm{L}}=\frac{\underset{j=1}{\overset{N^{\mathrm{L}}}{\Sigma}}{X}_{ij}^{\mathrm{L}}}{\underset{i=1}{\overset{k}{\Sigma}}\underset{j=1}{\overset{N^{\mathrm{L}}}{\Sigma}}{X}_{ij}^{\mathrm{L}}} \) (i = 1, 2, , k). We compare the two probability distributions \( {\tilde{p}}_m^{\mathrm{U}} \) and \( {\tilde{p}}_m^{\mathrm{L}} \) using the KL-distance which is defined as in Kullback & Leibler [26], here, we denote it the statistic TKL:

$$ {\mathrm{T}}_{KL}=H\left({\tilde{p}}_m^{\mathrm{U}},{\tilde{p}}_m^{\mathrm{L}}\right)=\frac{1}{2}\left(\underset{i=1}{\overset{k}{\Sigma}}{\tilde{p}}_{mi}^{\mathrm{U}}\cdot \log \frac{{\tilde{p}}_{mi}^{\mathrm{U}}}{{\tilde{p}}_{mi}^{\mathrm{L}}}+\underset{i=1}{\overset{k}{\Sigma}}{\tilde{p}}_{mi}^{\mathrm{L}}\cdot \log \frac{{\tilde{p}}_{mi}^{\mathrm{L}}}{{\tilde{p}}_{mi}^{\mathrm{U}}}\right) $$
(1)

It is easy to find the relationship between TKL and the frequencies of Pm and PM as follows:

$$ {\mathrm{T}}_{\mathrm{KL}}=\frac{1}{2}\left(\underset{i=1}{\overset{k}{\Sigma}}\frac{p_{mi}^{\mathrm{U}}}{\underset{i=1}{\overset{k}{\Sigma}}{p}_{mi}^{\mathrm{U}}}\cdot \log \left(\frac{p_{mi}^{\mathrm{U}}}{p_{mi}^{\mathrm{L}}}\cdot \frac{\underset{i=1}{\overset{k}{\Sigma}}{p}_{mi}^{\mathrm{L}}}{\underset{i=1}{\overset{k}{\Sigma}}{p}_{mi}^{\mathrm{U}}}\right)+\underset{i=1}{\overset{k}{\Sigma}}\frac{p_{mi}^{\mathrm{U}}}{\underset{i=1}{\overset{k}{\Sigma}}{p}_{mi}^{\mathrm{U}}}\cdot \log \left(\frac{p_{mi}^{\mathrm{L}}}{p_{mi}^{\mathrm{U}}}\cdot \frac{\underset{i=1}{\overset{k}{\Sigma}}{p}_{mi}^{\mathrm{U}}}{\underset{i=1}{\overset{k}{\Sigma}}{p}_{mi}^{\mathrm{L}}}\right)\right) $$
(2)

TKL is the mean between two KL-distances where one is the KL-distance between \( {\tilde{p}}_m^{\mathrm{U}} \) and \( {\tilde{p}}_m^{\mathrm{L}} \) and the other is the KL-distance between \( {\tilde{p}}_m^{\mathrm{L}} \) and \( {\tilde{p}}_m^{\mathrm{U}} \). KL-distance provides a non-symmetric measure of how big of the difference between two probability distributions are. The KL-distance is always non-negative and equal to 0 only if two distributions are identical. It can be seen that TKL is a non-negative and symmetric measure of the two probability distributions \( {\tilde{p}}_m^{\mathrm{U}} \) and \( {\tilde{p}}_m^{\mathrm{L}} \). So, TKL can be used as a statistic to quantify the magnitude of association between the variants and the quantitative trait: a much larger TKL value will be observed under the alternative hypothesis of association compared to that under the null hypothesis of no association.

A KL-distance index for fine mapping of QTL in the second stage

Assume that a region linked to a quantitative trait has already been established through association analysis at first stage. In order to simplify our presentation, we assume that this region contains only a causal variant with a minor allele a (with frequency pa) and a normal allele A (with frequency pA = 1 − pa), here, we call it the quantitative trait locus (QTL). We consider the quantitative trait Y = GQ + ε, where GQ is the genotypic value at the QTL and ε~N(0, σ2). We hope to fine map this region by calculating the linkage disequilibrium (LD) measure between the QTL and a variant. We still use KL-distance to construct this measure through comparing the probability distributions of allele m and M at a variant in the upper extreme population and in the low extreme population. Following the previous symbols, let Pm and PM be the frequencies of allele m and M at a variant. From Eq. (1), we have

$$ H\left(\left\{{p}_m^{\mathrm{U}},{p}_M^{\mathrm{U}}\right\},\left\{{p}_m^{\mathrm{L}},{p}_M^{\mathrm{L}}\right\}\right)=\frac{1}{2}\left({p}_m^{\mathrm{U}}\cdot \log \frac{p_m^{\mathrm{U}}}{p_m^{\mathrm{L}}}+{p}_M^{\mathrm{U}}\cdot \log \frac{p_M^{\mathrm{U}}}{p_M^{\mathrm{L}}}+{p}_m^{\mathrm{L}}\cdot \log \frac{p_m^{\mathrm{L}}}{p_m^{\mathrm{U}}}+{p}_M^{\mathrm{L}}\cdot \log \frac{p_M^{\mathrm{L}}}{p_M^{\mathrm{U}}}\right) $$
(3)

From Appendix, \( H\left(\left\{{p}_m^{\mathrm{U}},{p}_M^{\mathrm{U}}\right\},\left\{{p}_m^{\mathrm{L}},{p}_M^{\mathrm{L}}\right\}\right) \) can be asymptotically expressed as a function of LD (δam) between the QTL and the variant:

$$ H\left(\left\{{p}_m^{\mathrm{U}},{p}_M^{\mathrm{U}}\right\},\left\{{p}_m^{\mathrm{L}},{p}_M^{\mathrm{L}}\right\}\right)\approx \frac{\delta_{ma}^2\cdot {\left({b}_U-{b}_L\right)}^2}{2{p}_m\cdot {p}_M} $$
(4)

Assume that there is an initial complete association between the variant allele m and the QTL allele a, at the 0th generation when the allele a is initially introduced into the study population. Let \( {\delta}_{ma}^{(0)} \) be the initial complete LD between the allele a and m at the 0th generation, \( {\delta}_{ma}^{(0)}={p}_M\cdot {p}_a \). After n generations, the LD between the allele m and a is \( {\delta}_{ma}^{(n)}={\left(1-\theta \right)}^n{\delta}_{ma}^{(0)}={\left(1-\theta \right)}^n\cdot {p}_M\cdot {p}_a \) [29], where, θ is the recombination between the QTL and the variant. Then we have

$$ H\left(\left\{{p}_m^{\mathrm{U}},{p}_M^{\mathrm{U}}\right\},\left\{{p}_m^{\mathrm{L}},{p}_M^{\mathrm{L}}\right\}\right)\approx \frac{\delta_{am}^2\cdot {\left({b}_U-{b}_L\right)}^2}{2{p}_M\cdot {p}_m}=\frac{{\left(1-\theta \right)}^{2n}{p}_a^2\cdot {p}_M^2\cdot {\left({b}_U-{b}_L\right)}^2}{2{p}_M\cdot {p}_m} $$
(5)

Now we define a LD measure, here, we denote it as lKL, as follows:

$$ {l}_{KL}=\frac{p_m}{p_M}H\ \left(\left\{{p_m}^{\mathrm{U}},{p_M}^{\mathrm{U}}\right\},\left\{{p_m}^{\mathrm{L}},{p_M}^{\mathrm{L}}\right\}\right)\approx \frac{1}{2}{\left(1-\theta \right)}^{2n}{p}_a^2\cdot {b}^2 $$
(6)

where b = bU − bL. It can be seen that lKL is a decreasing function of the recombination θ and reaches its maximum at θ = 0. So we can use lKL to find the variant closest to the QTL and thus fine map the QTL. Notice from Eq. (6) that lKL is independent of the frequency of the variant, just only dependent on the frequency of the QTL.

Simulation

Simulation for association analysis

To evaluate the performance of the test statistic TKL, we perform a series of simulation studies. We consider k = 100 variants with MAF values of causal variants determined by a uniform distribution U (0.001, 0.01) and MAF values of non-causal variants determined by a uniform distribution U (0.001, 0.05). The genotype data are simulated similar to those in Wang and Elston [30]. We first generate haplotypes for k variants based on a latent variable Z = (Z1, , Zk) from a multivariate normal distribution with covariance structure cov(Zi, Zj) = 0.4i − j between any two latent components. Then we combine two haplotypes to obtain the genotype value for each individual Xi = (Xi1, , Xik). A phenotype Y under the null hypothesis of no association is generated using the model Y = ε with ε~N(0, 1) (β1 =  = βk = 0). Under the alternative hypothesis of association, we randomly chose s variants as causal variants while other k-s variants as non-causal variants having βj = 0. Here, we let s = 10, 20, 50 in which 10, 20%, or 50% of rare variants were causal. For causal variants under the alternative hypothesis, we set βj = c  log10(pmi) as used in Lee et al. [10], where c is 0.6, 0.3, and 0.2 for different values of s and different direction of the effects of causal variants. We consider 9 scenarios in the simulation study with the parameter values detailed in Table 3. We conduct 1000 simulations for each scenario. In each simulation, we select three extreme sample strategies, the low 20% and the up 20%, the low 10% and the up 10%, and, the low 5% and the up 5%, each of which consists of 2 N individuals including N individuals in an upper sample and N individuals in a lower sample. The statistical significance is assessed by a permutation procedure. We first calculate the value of the data-based statistic TKL for each simulation. Then we permute the “upper sample” and “lower sample” labels with equal probability and recalculate the statistic TKL for 1000 times. The estimated P value is then the proportion of permutation-based statistics that are larger than the data-based statistic in 1000 permutations for each simulation. For a given significance level α, the power/type I error rate is estimated as the proportion of rejecting the null hypothesis when p-value ≤ α in 1000 simulations. In order to compare the performance of the test statistic TKL with the existing methods, we also obtain the power for the burden, SKAT, and SKAT-O tests using case-control design with the same samples as for the test statistic TKL.

Table 3 The parameter values for power study

Simulation for fine mapping

To assess the performance of the LD measure lKL in fine mapping rare causal variants of quantitative traits, we conduct a simulation study using the method similar to those described in our early work [27, 28]. We consider a genetic region that has 21 variants, where only a variant locating at the middle of variant 10 and variant 11 is causal variant (that is, the QTL). The MAF of the causal variant is set to be 0.01(pa =0.01) and the MAFs for 20 other variants are uniformly determined with values ranging from 0.001 to 0.05. Other parameters in simulation include the ratio d/v (here, v and d are the genotypic values for individuals with genotypes aa and Aa, respectively), the thresholds L and U, the heritability (h2) of the causal variant, and the sample size (2 N) [31]. We let the ratio d/v be − 1, 0, and 1 which correspond to recessive, additive, and dominant models, respectively. Once the parameter values are chosen, a population with the effective size of 15,000 is simulated starting from the 0th generation, with an initial complete association between the minor allele a at the causal variant and m at other variants [P(m|a) = 1]. The population then evolved for 50 generations under random mating and genetic drift. A hundred populations are simulated for analyses.

Availability of data and materials

All data generated or analyzed during this study are included in this published article.

Abbreviations

LD:

Linkage disequilibrium

RV:

Rare variant

KL-distance:

Kullback-Leibler distance

SKAT:

Sequence kernel association test

SKAT-O:

The optimal test that combines SKAT and the burden test

QTL:

Quantitative trait locus

HW:

Hardy-Weinberg

References

  1. Cirulli ET, Goldstein DB. Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nat Rev Genet. 2010;11(6):415–25.

    Article  CAS  PubMed  Google Scholar 

  2. Visscher PM, Brown MA, McCarthy MI, Yang J. Five years of GWAS discovery. Am J Hum Genet. 2012;90(1):7–24.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hin-dorff LA, Hunter DJ, et al. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747–53.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Fu W, O’Connor TD, Jun G, Kang HM, Abecasis G, Leal SM, et al. Analysis of 6,515 exomes reveals the recent origin of most human protein-cod-ing variants. Nature. 2013;493(7431):216–20.

    Article  CAS  PubMed  Google Scholar 

  5. Nelson MR, Wegmann D, Ehm MG, Kessner D, St Jean P, Verzilli C, et al. An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science. 2012;337(6090):100–4.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Bansal V, Libiger O, Torkamani A, Schork NJ. Statistical analysis strategies for association studies involving rare variants. Nat Rev Genet. 2010;11(11):773–85.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Kiezun A, Garimella K, Do R, Stitziel NO, Neale BM, McLaren PJ, et al. Exome sequencing and the genetic basis of complex traits. Nat Genet. 2012;44(6):623–30.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Lee S, Abecasis GR, Boehnke M, Lin X. Rare-variant association analysis: study designs and statistical tests. Am J Hum Genet. 2014;95(1):5–23.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Li Z, Li X, Liu Y, Shen J, Chen H, Zhou H, et al. Dynamic scan procedure for detecting rare-variant association regions in whole-genome sequencing studies. Am J Hum Genet. 2019;104(5):802–14.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Lee S, Wu MC, Lin X. Optimal tests for rare variant effects in sequencing association studies. Biostatistics. 2012;13(4):762–75.

    Article  PubMed  PubMed Central  Google Scholar 

  11. Morgenthaler S, Thilly WG. A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST). Mutat Res. 2007;615(1):28–56.

    Article  CAS  PubMed  Google Scholar 

  12. Li B, Leal SM. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am J Hum Genet. 2008;83(3):311–21.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Madsen BE, Browning SR. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009;5(2):e1000384.

    Article  PubMed  PubMed Central  Google Scholar 

  14. Pan W. Asymptotic tests of association with multiple SNPs in linkage disequilibrium. Genet Epidemiol. 2009;33(6):497–507.

    Article  PubMed  PubMed Central  Google Scholar 

  15. Neale BM, Rivas MA, Voight BF, Altshuler D, Devlin B, Orho-Melander M, et al. Testing for an unusual distribution of rare variants. PLoS Genet. 2011;7(3):e1001322.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Wu MC, Kraft P, Epstein MP, Taylor DM, Chanock SJ, Hunter DJ, et al. Powerful SNP-setanalysis for case-control genome-wide association studies. Am J Hum Genet. 2010;86(6):929–42.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Wu MC, Lee S, Cai T, Li Y, Boehnke MC, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011;89(1):82–93.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Bacanu SA, Nelson MR, Whittaker JC. Comparison of methods and sampling designs to test for association between rare variants and quantitative traits. Genet Epidemiol. 2011;35(4):226–35.

    PubMed  Google Scholar 

  19. Price AL, Kryukov GV, de Bakker PI, Purcell SM, Staples J, Wei LJ, et al. Pooled association tests for rare variants in exon-resequencing studies. Am J Hum Genet. 2010;86(6):832–8.

    Article  PubMed  PubMed Central  Google Scholar 

  20. Morris AP, Zeggini E. An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genet Epidemiol. 2010;34(2):188–93.

    Article  PubMed  Google Scholar 

  21. Luo L, Zhu Y, Xiong M. Quantitative trait locus (QTL) analysis for next-generation sequencing with the functional linear models. J Med Genet. 2012;49(8):513–24.

    Article  PubMed  PubMed Central  Google Scholar 

  22. Guey LT, Kravic J, Melander O, Burtt NP, Laramie JM, Lyssenko V, et al. Power in the phenotypic extremes: a simulation study of power in discovery and replication of rare variants. Genet Epidemiol. 2011;35(4):236–46.

    PubMed  Google Scholar 

  23. Barnett IJ, Lee S, Lin X. Detecting rare variant effects using extreme phenotype sampling in sequencing association studies. Genet Epidemiol. 2013;37(2):142–51.

    Article  PubMed  Google Scholar 

  24. Li D, Lewinger JP, Gauderman WJ, Murcray CE, Conti D. Using extreme phenotype sampling to identify the rare causal variants of quantitative traits in association studies. Genet Epidemiol. 2011;35(8):790–9.

    Article  PubMed  PubMed Central  Google Scholar 

  25. Emond MJ, Louie T, Emerson J, Zhao W, Mathias RA, Knowles MR, et al. Exome sequencing of extreme phenotypes identifies DCTN4 as a modifier of chronic Pseudomonas aeruginosa infection in cystic fibrosis. Nat Genet. 2012;44(8):886–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Kullback S, Leibler RA. On information and sufficiency. Ann Math Stat. 1951;22(1):79–86.

    Article  Google Scholar 

  27. Deng HW, Chen WM, Recker RR. QTL fine mapping by measuring and test for hardy-Weinberg and linkage disequilibrium at a series of linked marker loci in extreme samples of populations. Am J Hum Genet. 2000;66(3):1027–45.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Li YM, Xiang Y, Sun ZQ. An entropy-based measure for QTL mapping using extreme samples of population. Hum Hered. 2008;65(3):121–8.

    Article  PubMed  Google Scholar 

  29. Hartl DL. A primer of population genetics, Sinauer, Sunderland, Massachusetts, 3rd; 1999.

    Google Scholar 

  30. Wang T, Elston RC. Improved power by use of a weighted score test for linkage disequilibrium mapping. Am J Hum Genet. 2007;80(2):353–60.

    Article  CAS  PubMed  Google Scholar 

  31. Lehesjoki AE, Koskiniemi M, Norio R, Tirrito S, Sistonen P, Lander E, et al. Localization of the EPM1 gene for progressive myoclonus epilepsy on chromosome 21: linkage disequilibrium allows high resolution mapping. Hum Mol Genet. 1993;2(8):1229–34.

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgments

This work was supported by the Foundation of Hunan Double First-rate Discipline Construction Projects of Bioengineering.

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

XY developed the statistical method and wrote the manuscript. XXR developed the statistical method and revised the manuscript. LYM conceived the idea, designed the study, and revised the manuscript. All authors have read and approved the final version of the manuscript.

Corresponding author

Correspondence to Yumei Li.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Firstly, we calculate \( {p}_m^{\mathrm{U}}\cdot \log \frac{p_m^{\mathrm{U}}}{p_m^{\mathrm{L}}} \). Under the assumption of random mating and, thus, Hardy-Weinberg (HW) equilibrium holding in the population, we can get.

\( {p}_m^{\mathrm{L}}= pr\left(m|Y<L\right)= pr\left(m,Y<L\right)/{\varphi}_L \), where φL = pr(Y < L)

$$ {\displaystyle \begin{array}{l} pr\left(m,Y<L\right)= pr\left(m, aa,Y<L\right)+ pr\left(m, aA,s<T\right)+ pr\left(m, AA,Y<L\right)\\ {}={p}_{ma}\cdot {p}_a\cdot pr\left(Y<L| aa\right)+\left({p}_{ma}\cdot {p}_A+{p}_{mA}\cdot {p}_a\right)\cdot pr\left(Y<L| aA\right)+{p}_{mA}\cdot {p}_A\cdot pr\left(Y<L| AA\right)\\ {}={p}_{ma}\cdot {p}_a\cdot {\phi}_{11}+\left({p}_{ma}\cdot {p}_A+{p}_{mA}\cdot {p}_a\right)\cdot {\phi}_{12}+{p}_{mA}\cdot {p}_A\cdot {\phi}_{22}\\ {}={a}_1\cdot {p}_{ma}+{a}_2\cdot {p}_{mA}\end{array}} $$

where ϕ11 = pr(Y < L| aa), ϕ12 = pr(Y < L| aA), ϕ22 = pr(Y < L| AA), a1 = (ϕ11pa + ϕ12pA)/φL, a2 = (ϕ22pA + ϕ12pa)/φL.

Note that.

pma = pmpa + δma,   pmA = pmpA + δmA, δma =  − δmA and  a1pa + a2pA = 1, here δm~ is the measure of LD between variant allele m and the QTL allele ~ and is defined as δ~m = p~m − p~pm, where pm~ is the frequency of haplotype m ~ .

Then,

$$ {\displaystyle \begin{array}{l}{p}_m^L={a}_1\cdot \left({p}_m\cdot {p}_a+{\delta}_{ma}\right)+{a}_2\cdot \left({p}_m\cdot {p}_A+{\delta}_{mA}\right)={p}_m+{\delta}_{ma}\left({a}_1-{a}_2\right)\\ {}={p}_m+{b}_L\cdot {\delta}_{ma}\end{array}} $$
$$ {b}_L={a}_1-{a}_2 $$
(Where)

Similarly, we can get \( {p}_M^L={p}_M+{b}_L\cdot {\delta}_{Ma} \), \( {p}_m^U={p}_m+{b}_U\cdot {\delta}_{ma} \) and \( {p}_M^U={p}_M+{b}_U\cdot {\delta}_{Ma} \), where bU = c1 − c2, c1 = (γ11pa + γ12pA)/φU, c2 = (γ22pA + γ12pa)/φU, φU = pr(Y > U),

γ11 = pr(Y > U| aa), γ12 = pr(Y > U| aA), γ22 = pr(Y > U| AA), c1 = (γ11pa + γ12pA)/φU, c2 = (γ22pA + γ12pa)/φU.

Then

$$ {\displaystyle \begin{array}{l}{p}_m^{\mathrm{U}}\cdot \log \frac{p_m^{\mathrm{U}}}{p_m^{\mathrm{L}}}={p}_m^{\mathrm{U}}\cdot \log {p}_m^{\mathrm{U}}-{p}_m^{\mathrm{U}}\cdot \log {p}_m^{\mathrm{L}}=\left({p}_m+{b}_U\cdot {\delta}_{ma}\right)\cdot \log \frac{p_m\cdot \left(1+{b}_U\cdot {\delta}_{ma}/{p}_m\right)}{p_m\cdot \left(1+{b}_L\cdot {\delta}_{ma}/{p}_m\right)}\\ {}=\left({p}_m+{b}_U\cdot {\delta}_{ma}\right)\cdot \left[\log \left(1+{b}_U\cdot {\delta}_{ma}/{p}_m\right)-\log \left(1+{b}_L\cdot {\delta}_{ma}/{p}_m\right)\right]\\ {}\left[ by\ \log \left(1+x\right)\approx x-{x}^2/2\right]\\ {}\approx \left({p}_m+{b}_U\cdot {\delta}_{ma}\right)\cdot \left[\left(\frac{b_U\cdot {\delta}_{ma}}{p_m}-\frac{{b_U}^2\cdot {\delta_{ma}}^2}{2{p_m}^2}\right)-\left(\frac{b_L\cdot {\delta}_{ma}}{p_m}-\frac{{b_L}^2\cdot {\delta_{ma}}^2}{2{p_m}^2}\right)\right]\\ {}=\left({p}_m+{b}_U\cdot {\delta}_{ma}\right)\cdot \frac{\left({b}_U-{b}_L\right)\cdot {\delta}_{ma}}{p_m}\cdot \left(1-\frac{\left({b}_U+{b}_L\right)\cdot {\delta}_{ma}}{2{p}_m}\right)\end{array}} $$

Similar

$$ {p}_M^{\mathrm{U}}\cdot \log \frac{p_M^{\mathrm{U}}}{p_M^{\mathrm{L}}}\approx \left({p}_M+{b}_U\cdot {\delta}_{Ma}\right)\cdot \frac{\left({b}_U-{b}_L\right)\cdot {\delta}_{Ma}}{p_M}\cdot \left(1-\frac{\left({b}_U+{b}_L\right)\cdot {\delta}_{Ma}}{2{p}_M}\right) $$
$$ {p}_m^{\mathrm{L}}\cdot \log \frac{p_m^{\mathrm{L}}}{p_m^{\mathrm{U}}}\approx \left({p}_m+{b}_L\cdot {\delta}_{ma}\right)\cdot \frac{\left({b}_L-{b}_U\right)\cdot {\delta}_{ma}}{p_m}\cdot \left(1-\frac{\left({b}_L+{b}_U\right)\cdot {\delta}_{ma}}{2{p}_m}\right) $$
$$ {p}_M^{\mathrm{L}}\cdot \log \frac{p_M^{\mathrm{L}}}{p_M^{\mathrm{U}}}\approx \left({p}_M+{b}_L\cdot {\delta}_{Ma}\right)\cdot \frac{\left({b}_L-{b}_U\right)\cdot {\delta}_{Ma}}{p_M}\cdot \left(1-\frac{\left({b}_L+{b}_U\right)\cdot {\delta}_{Ma}}{2{p}_M}\right) $$

Then

$$ {\displaystyle \begin{array}{l}H\left(\left\{{p}_m^{\mathrm{U}},{p}_M^{\mathrm{U}}\right\},\left\{{p}_m^{\mathrm{L}},{p}_M^{\mathrm{L}}\right\}\right)=\frac{1}{2}\left({p}_m^{\mathrm{U}}\cdot \log \frac{p_m^{\mathrm{U}}}{p_m^{\mathrm{L}}}+{p}_M^{\mathrm{U}}\cdot \log \frac{p_M^{\mathrm{U}}}{p_M^{\mathrm{L}}}+{p}_m^{\mathrm{L}}\cdot \log \frac{p_m^{\mathrm{L}}}{p_m^{\mathrm{U}}}+{p}_M^{\mathrm{L}}\cdot \log \frac{p_M^{\mathrm{L}}}{p_M^{\mathrm{U}}}\right)\\ {}\approx \frac{1}{2}\left[\frac{{\left({b}_U-{b}_L\right)}^2\cdot {\delta_{ma}}^2}{p_m}+\frac{{\left({b}_U-{b}_L\right)}^2\cdot {\delta_{Ma}}^2}{p_M}\right]\\ {}=\frac{{\delta_{ma}}^2\cdot {\left({b}_U-{b}_L\right)}^2}{2{p}_m\cdot {p}_M}\end{array}} $$

Assume that there is an initial complete association between the variant allele m and the QTL allele a, at the 0th generation when the allele a is initially introduced into the study population. Let \( {\delta}_{ma}^{(0)} \) be the initial complete LD between the allele a and m at the 0th generation, \( {\delta}_{ma}^{(0)}={p}_M\cdot {p}_a \). After n generations, the LD between the allele m and a is \( {\delta}_{ma}^{(n)}={\left(1-\theta \right)}^n{\delta}_{ma}^{(0)}={\left(1-\theta \right)}^n\cdot {p}_M\cdot {p}_a \) [29], where, θ is the recombination between the QTL and the variant. Then we have

$$ H\left(\left\{{p}_m^{\mathrm{U}},{p}_M^{\mathrm{U}}\right\},\left\{{p}_m^{\mathrm{L}},{p}_M^{\mathrm{L}}\right\}\right)\approx \frac{{\delta_{ma}}^2\cdot {\left({b}_U-{b}_L\right)}^2}{2{p}_m\cdot {p}_M}=\frac{{\left(1-\theta \right)}^{2n}{p}_a^2\cdot {p}_M^2\cdot {\left({b}_U-{b}_L\right)}^2}{2{p}_m\cdot {p}_M} $$

Then, we have

$$ {l}_{KL}=\frac{p_m}{p_M}H\ \left(\left\{{p_m}^{\mathrm{U}},{p_M}^{\mathrm{U}}\right\},\left\{{p_m}^{\mathrm{L}},{p_M}^{\mathrm{L}}\right\}\right)\kern0.5em \approx \frac{1}{2}\ {\left(\ 1-\theta \right)}^{2n}{p}_a^2\cdot {b}^2 $$

where b = bU − bL.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xiang, Y., Xiang, X. & Li, Y. Identifying rare variants for quantitative traits in extreme samples of population via Kullback-Leibler distance. BMC Genet 21, 130 (2020). https://0-doi-org.brum.beds.ac.uk/10.1186/s12863-020-00951-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://0-doi-org.brum.beds.ac.uk/10.1186/s12863-020-00951-2

Keywords