Correcting confounding biases in genome-wide association studies


Linkage disequilibrium score regression is already being put to good use in drug discovery, by filtering out false positives in genome-wide association studies. What are the limits – and the potential – of its application?

Linkage disequilibrium score regression (LDSC) is a post-processing method for genome-wide association studies (GWAS) that reduces false positives and improves structured drug discovery by correcting confounding biases such as polygenicity.

The challenge

Although a number of diseases, including several types of cancer, are driven by malfunctions in a single gene, most diseases are linked to an intricate interplay of underlying genetic factors –  often with more subtle effects. Identifying those genetic variants that show the strongest correlation to disease traits is more difficult than finding monogenetic relationships. But they could also lead to better understanding of diseases and help us identify drug targets and biomarkers to predict treatment success, and help overcoming resistance to treatment.1 

By using genome-wide association studies (GWAS) to identify disease traits, we could also pinpoint multiple points of attack in underlying disease pathways, which could be tackled by combination treatments. However, because of the unique nature of genetic data, statistical extraction of such information is anything but trivial. That’s why linkage disequilibrium score regression2 (LDSC) is used to correct confounding biases.


GWAS enabled the identification of associations between single-nucleotide-polymorphisms (SNPs) and a trait of interest. Although associations do not imply causality, they can still form the basis for discovering new drug targets3. However, even a single GWAS often provides many hundreds of significant associations for a given disease context. Some of these will be false positives, due to confounding biases – such as polygenicity, cryptic relatedness or population stratification – which GWAS does not account for. LDSC provides a post-processing method to counteract these confounding biases and thus make drug target identification powered by GWAS more precise.

Sources of confounding

The main sources of confounding are: 

  • polygenicity – where lots of small genetic effects contributing to a single trait, such as predisposition to a disease
  • cryptic relatedness – the presence of close relatives in the data
  • population stratification – the presence of multiple subpopulations with a different ancestral background

While polygenicity is a more general difficulty for GWAS, cryptic relatedness and population stratification could be at least partly accounted for if they were known – but in most cases they are not. 

All these sources of confounding can lead to false positives – associations that are only detected because of the presence of confounding factors, such as described above. That’s why any method that could correct confounding without knowledge of the specific sources of confounding would be highly beneficial. LDSC is such a method.

How it works

LDSC directly builds upon GWAS results (association statistics between traits and SNPs) by taking into account  linkage disequilibrium (LD) scores, which measure the extent to which an SNP is correlated with other nearby SNPs. As its name suggests, LDSC uses a univariate linear regression with the GWAS association statistics as the target variable and the LD score as the explanatory variable; basically, it asks the question: “How well do SNP correlations explain trait-SNP associations?”

The most interesting part of this regression is the intercept. Lee et al.4 show that it should be close to unity. They also show that any upward departure from unity must be due to confounding. As such, the estimate of the intercept provides a correction for confounding biases – and that is the major use for this method. Outside this core application, there are also two additional applications that take advantage of the slope: estimates for the heritability and the genetic correlation of traits.

Could this work in practice?

Actually, GWAS, LDSC and their results are already used to improve drug discovery:

Confounding-corrected GWAS results have also helped solve challenges in other fields, such as personalizing medicine:

  • This is aided by identification of new targets or combinations, or biomarkers that predict treatment success. For cancer patients, personalized treatment combinations might help to prevent or overcome resistance, which currently occurs when attacking a single disease target, such as overactive kinases.
  • The treatment of chronic Hepatitis C infection could also be improved. Usually, patients undergo a 48-weeks program of medication that can have substantial side-effects. In 2009 Ge et al.1 were able to identify a genetic variant that could be used to predict the likelihood of success for this treatment. 

Open questions

Although LDSC helped to refine GWAS results to be more useful in the identification of potential relationships between genetic variations and a certain phenotype, the general approach of GWAS still faces some challenges:

  • How to deal with pleiotropy (where one genetic variant affects multiple traits)?
  • Small effect sizes. In general, the found associations between a genetic variant and a phenotype each have a small effect size. That necessitates large datasets even in the era of big data. Because data is distributed around the world, this is commonly achieved by forming a consortium, in which researchers perform the same GWAS with their data and report results to the project leads. Increased organizational efforts are needed to formulate the analysis plan, designate teams and ensure the quality control of sent results.
  • Lack of completeness. The X chromosome is commonly omitted from analysis, and yet it holds 5% of the human genome and codes for proteins that are important both for men and women. (Thus “genome-wide” analysis is something of an overstatement.) 
  • No causal relationships. Even if we correct for confounding factors using LDSC, we can only identify correlations. This has formed the basis of much criticism – and yet GWAS is a method for exploring and discovering potentially causal variants. GWAS results form a foundation  for biologists’ wet lab research that otherwise might not exist, and which is the first of many steps towards clinical testing of a target.

The fact that NMEs with genetic support are twice as likely to be approved – and the list of drugs that have been discovered  as a result of GWAS – justifies the immense effort required and should encourage its usage. LDSC helps make this process simpler, by increasing the precision of significant GWAS associations – thereby saving valuable research time and resources – and has, as such, proved its worth.


1. Ge et al., Genetic variation in IL28B predicts hepatitis C treatment-induced viral clearance

2. Bulik-Sullivan et al., LD Score regression distinguishes confounding from polygenicity in genome-wide association studies

3. Tam et al., Benefits and limitations of genome-wide association studies

4. Lee et al., The accuracy of LD Score regression as an estimator of confounding and genetic correlations in genome-wide association studies

5. King et al., Are drug targets with genetic support twice as likely to be approved? Revised estimates of the impact of genetic support for drug mechanisms on the probability of drug approval

6. Visscher et al., 10 Years of GWAS Discovery: Biology, Function, and Translation

7. Gns et al., An update on Drug Repurposing: Re-written saga of the drug’s fate