Data Science & Ancestry
History is exciting, especially if we can relate to it. If our ancestors have been involved in any historic event, we are much more likely to attribute importance to this happening, develop a desire to know the details well beyond storytelling. Who have our ancestors been? What traces did they leave in history? How far back can we track their blood lines? Many people share a passion for ancestry, travel to archives in different countries, puzzle together information from different documents to gather new insights about their roots. The internet provides those efforts with new and powerful tools for research. Not only are historic documents being digitized, but online networks of people with a passion for ancestry, , e.g. ancestry.com, allow for new ways to gather insights about one’s family tree. Interestingly, it is data science which powers the conquest of family history.
The question of name distinguishability throughout the centuries
Anyone who has been involved in genealogy, who has been trying to track down distant relatives from the 16th century is well aware of one of the key challenges: surnames change overtime. The current surname is most often just a derivative of previous versions, phonetically adjusted overtime. Some names change completely, others might have gradually evolved with the centuries along with rules of orthography. Connecting the dots when combing through dusty documents in church and municipal archives is already a challenging endeavour, but automating this process at scale is rather difficult.
There are fairly well-functioning machine learning algorithms, which account for potential phonetic transformations, spelling mistakes and allow for a clustering of those first and surnames. Nevertheless, the digitization of personal research endeavours and thus the respective family trees, allows for more fine tuning of those algorithmic libraries. For once, more information about the actual historic name transformation is feeding the algorithms – not just assumptions about potential phonetic iterations. As a matter of fact, the adaptations improve performance – which in turn enables genealogy research platforms to allow its users comprehensive searches.
Genealogy Search Algorithms – helping to find the relevant “John Smith”
At the same time, searching for names – when not extremely rare – oftentimes does not narrow down the search to a manually controllable sub-set. With millions of documents available for reference, a lot of context information for the respective individual is available. Certain individuals might thus for example automatically be tagged with geographic locations. Sophisticated search algorithms make historic data accessible for the time-constrained researcher. And make the casual trip to the distant library obsolete. Ancestry.com claims that is has digitized more than 200 billion documents (pictures, bureaucratic files, etc.) through its users, which upload these, and access to archives. Powerful backend algorithms are of necessity, to make sense of this history overflow and make precisely that information accessible, which the user searches for. At the same time, when users build their family tree online, the platform is able to automatically give hints and recommendations to fill-out blank spots within the different branches. Having a large and active user base, machine learning is used to power the record linking activities, which build the engine for the “recommendation” and family history discoveries.
Does this take away the mystery of genealogy, the romanticized time spent in archives? Not really, but it poses a more cost-effective way to acquire information about family history and find precisely that one “John Smith”. Those who desire physical journeys, the ups and downs of the information hunt, are certainly free to continue to do so.
Using all that data to create a better understanding of history
While there are so many obvious advantages of having a central platform for the management of family tree research, it also needs to be mentioned that users might err when composing their family tree – be it with or without intention. However, modern technology should be sophisticated enough to account for these flaws.
Because looking at the potential is just too attractive: online genealogy platforms could contribute to the creation of a fully personalized immersion in history. Registered users could, for example, get location-based information about their family members historic presence at the respective place. Linked with relevant documents (personal and general) it would function almost like a portable, location-based museum. With increasing sophistication of image processing, it could even be sufficient to take a picture of the location to get contextual information.
The organization of family history is fascinating, as it is composed of so many intersecting family trees. Attaching and providing information at the relevant nodes with data science is a key challenge. It bears not only benefits for users on their specific research quest, but also forms the backbone for a more personalized history experience. To add some craziness, ancestry.com has – already for several year – tried out a DNA feature, which could allow for a more fine-grained analysis of potential connections between users.