Mundane, but magical: the power of database matching


With its reputation for being dull and straightforward, database matching is hardly the poster child of AI. The truth, however, is that knotty database-matching problems can reveal almost limitless opportunities 

In recent years AI methods have been making headlines in life science research. Alpha Fold, the algorithm designed by Google Deep Mind that promises advancements in one of biologies central problems – protein folding – is a prime example.

However, deep learning models are not the only tools at our disposal for generating new insights.

Very often, value can already be created by simply connecting siloed data sources. This is especially true of life science, which has a super-rich universe of different vital entities – such as disease conditions, drugs, companies – but no single focus, like life itself. 

Connections for context 

One new link can yield a wealth of valuable new research areas. Connecting data of protein-protein interactions with tissue expression of proteins, for example, gives us a better idea of the context in which an unknown protein might function, especially if we combine it with  canonical pathway data.

Another example would be to combine data on clinical trials with funding data of biotech companies to identify ‘hot’ fields in pharma. Funding data from the likes of Pitchbook, Crunchbase or Bloomberg linked with biomedical data from Integrity,, GlobalData or AdisInsight might be usefully put to use by both venture backers and the biotech companies themselves.

Perhaps unsurprisingly, the data that is hardest to match is the least exploited. Matching funding data to company data is particularly challenging, because for private companies there are no unique ids – whereas matching it to stock tickers, which are unique, is easier.

Clinical data, meanwhile, tends to be incomplete and is mostly accessible in a quantitative fashion (e.g. in which clinical phase per indication is an asset, have there been discontinued trials, has data been published); the quality of clinical data (how good is it in comparison to other contestants in the field) is much more difficult to assess.

For anyone who can devise a way of linking these two datasets, the opportunity is huge. 

Where clinical meets commercial 

When it comes to linking clinical trials data with commercial data, the entity common to both data sets is the company. However, even this is not straightforward: any irrelevant entities (such as companies from other sectors) need to be excluded. Furthermore the granularity of the company names can vary between two datasets – such as the holding company being used in dataset A and a specific branch of the company being used in dataset B.

First, filter

Comparing all companies of dataset A with all companies of dataset B can be infeasible – the combinatorics explode – so our first priority should be to filter out irrelevant entities. Some non-commercial entities, such as research institutions, may be marked by specific terms in the name, such as ‘hospital’. Removing these from the dataset can drastically reduce the combinatorics and thus the computational complexity. Combinatorics can be further reduced by a set of tools that identify likely matches and sort out all the rest (based on a rough assessment of similarities).

Next, take care of the obvious

Next, we can fish out the obvious matches – those in which a pair share a feature that confirms they are identical. This is more problematic than it might sound: when it comes to funding data, for example, where references are to a legal company name they can be considered an obvious match. Patent numbers are similarly objective. However, matching people by their surname – or even their first and last name – is not infallible. The baseline of absolute certainty needs to be established before we can decide how much “fuzziness” to build in.  

…and the less obvious

Next, we should fish out the less obvious matches, for which the differences are so small as to suggest a match. The subjective nature of exactly where our cut-off points should be – where we draw the line between a pair being considered a match or not – as well as the range of methods that can be employed, makes this an interesting area of research.  

Algorithms such as those found in the Python Record Linkage Library, suggest matches based on  word similarity – measured via Jaro-Winkler distance or longest common substring, for example.

These algorithms are broadly split into two camps: supervised and unsupervised. For the former, you need “correctly matched samples” – a lot of them. This is a rare luxury in entity matching, so supervised algorithms are rarely used in this area. For the latter, no correctly matched samples are required – and so unsupervised algorithms, such as the Expectation/Conditional Maximization classifier, are often the most practicable solution, although the performance of such algorithms can be tricky to evaluate.      

Multiple choices

Lastly, we need to decide how to treat multiple matches; when one company in A is matched to 2 or more companies in B. There are three options here: combine, keep or delete. Most preferable, where possible, is to combine – to aggregate the information in B into one entity. This makes sense if the entities in B are different branches of the same company; we can combine all their divisions, then match that with the funding information given for the company as a whole.

Keeping multiple matches is more problematic. You may choose to keep multiple matches in situations where there can be more than one answer. In a situation where you’re looking to link university research funding to research output, for example, one researcher might be linked to more than one university (and, therefore, be named on papers from more than one university). The problem with situations such as these, in which multiple matches are retained, is that they introduce ambiguity.

Lastly, in a situation in which the information in B is less important than A, you may choose to keep one match and delete the others from B.

What do we get?

By filtering out before matching identical names we omit that a webhosting company agplus is matched with the diagnostic company agplus. A fuzzy matching algorithm enables us to link Moonbiotech with moon biotech co ltd. Combining matches can ensure all products of the different Pfizer entities, such as Primary Care, Specialty Care and Oncology, are linked to Pfizer financing.

By linking the commercial data with biomedical data we can monitor commercial interest in specific pharmaceutical areas. But this does not have to be the end of the story; more dimensions can be added  depending on the angle we would like to shine a light on – we just need the right data sources.

If, for example, we include academic  research output we may be able to identify key opinion leaders in interesting fields. This could be relevant for companies looking to acquire talents in trending promising research areas. Adding stock exchange data to this process would create a tool that could highlight  investment opportunities in biotech companies – useful not just for venture capitalists and other funds looking to invest, but also, in turn, for the target companies themselves. 

Useful resources

  • A useful and established Python library for entity matching is record linkage. It includes different indexing techniques to reduces the number of possible matches. A variety of classifiers (supervised and unsupervised) are provided and well documented
  • An unsupervised method not requiring correct examples is Expectation/Conditional Maximization classifier. It’s a classic method for the task of linking entities
  • How do I reduce the number of possible pairs that I have to evaluate? Many clever methods are described here: Smart indexing to reduce combinatorics

Mona Schirmer

Student Assistant, Data Science


Potsdamer Straße 68
10785 Berlin