Connecting Data across Biomedical Databases
Open data in biomedicine provides a wealth of information scattered across a variety of databases. Nucleotide sequences, protein structures, and clinical data, for instance, all reside in their different silos. Since knowledge discovery necessitates the integration of several aspects of the life sciences, linking records across resources is essential. That is why cross-references play a decisive role in the setup of a data analysis project.
PubMed is the most renowned resource for accessing biomedical literature. First released in 1996, this website gives access to the abstracts of scientific articles referenced in MEDLINE. MEDLINE is a bibliographic database that constitutes a good example of useful connections between biomedical resources. In fact, it is indexed with the MeSH (Medical Subject Headings) taxonomy to make search queries more effective. MeSH was introduced in 1960 already and was originally designed for library computerisation. Nevertheless, this taxonomy still plays a powerful role in information retrieval, as it maps various terminologies to unique medical concepts.
OMIM is one of many databases displaying links to multiple external sources on its web interface, such as Ensembl, UniProt, and KEGG PATHWAY, just to name a few. In OMIM, record linkage is based on manual expert review, which leads to high quality results but is a time-consuming and expensive process. In contrast, automated matching methods provide faster, cheaper outcomes. They can be categorised in two classes depending on the nature of the data to be integrated.
In cases where records can be identified unambiguously across datasets, links can be generated based on rules. This process is called deterministic matching, or joining. Here, the main challenge lies in pre-processing the data so its quality is sufficient for high performance record linkage. More often than not, databases indeed contain inconsistencies and duplicates.
A prime example of deterministic record linking is the International Clinical Trials Registry Platform (ICTRP). It is a central database that gathers the information of several national registries approved by the WHO, including ClinicalTrials.gov and the EU Clinical Trial Register. Since trials are often run in multiple countries, the ICTRP strives to unambiguously identify clinical studies registered in various locations. To do so, it bridges records across national registries by matching main and secondary trial identifiers.
When links cannot be based on unique identifiers, they have to be inferred from the content. This is called fuzzy matching. This approach consists in evaluating algorithmically the probability that two records refer to the same object by attributing weights to the different content-based indicators. Machine learning algorithms developed for this task include naïve Bayesian classifier and artificial neural networks. Besides, pairs, for which the link is highly uncertain, are often reviewed by human experts.
In any case, the choice between deterministic and fuzzy matching depends on the requirements of the analysis. Since each option has its advantages and drawbacks, it is critical to understand the nature of the data to be matched and the level of accuracy required. Although technically much more difficult, fuzzy matching can extract valuable pieces of information from unstructured data.
Meaningful data integration is the foundation of actionable knowledge discovery. It can boost pharma R&D through the optimisation of clinical trial design, the identification of therapeutic targets, or by supporting portfolio management. The challenge is to purposefully connect heterogeneous data spread across sources.
The biomedical literature is a major resource to consider, as the myriad of recent publications hold valuable information on potential scientific breakthroughs. In order to detect those nuggets, it is essential to link the data found in the literature to other resources, which is done using text mining. This process allows to extract information by turning unstructured text into structured data, hence paving the way to creating useful cross-references. At the interface of biomedicine and data science, text mining helps discover knowledge from the research pursued at the forefront of science.