Summer Book Recommendation: “Everybody Lies”
“Everybody Lies” is the harsh title of Seth Stephens-Davidowitz’s new book. While it doesn’t provide any feasible recipe to prevent people from lying, the book helps the reader in one essential realm: to grasp and conceptualize the power of data and data science. Its key strength: It does so in a very engaging and accessible way.
Perceived truth and objective truth are oftentimes tough to reconcile. Whoever can provide the stronger narrative usually has a fair shot to brand his perceived truth as the unbiased point of reference. Donald Trump, US president and Twitter’s darling, who fancies to call unfavorable news articles “fake news” embodies this like no other. For anyone really interested in objective truth, though, data is the ingredient and data science an essential toolbox.
Stephens-Davidowitz, who has worked as a data scientist at Google and since then compiled and worked through hundreds of datasets, is uniquely positioned to elaborate on these issues. Not surprisingly he uses Google Search data to shed lights on various phenomena. His hypothesis: any poll, survey or dataset compiled with standard methods is far inferior to Google Search data. Only through search data – terms entered into a search engine in presumed privacy – can we understand the real issues that concern humanity and even shed new light on the human mind. In surveys, on the contrary, “everybody lies”.
One catchy example to prove his point: Stephens-Davidowitz found that the comparison of people who searched for “Trump-Clinton” vs “Clinton-Trump” could explain the election outcome in swing states, which were perceived to be “too close to call” before election night. The undecided voter though, despite being unable to express his or her preference in polls, indicated it by placing the name of the preferred candidate first. There are many more of these examples, which range from horse racing, to sexuality, general human concerns and food preferences. These examples – the backbone of the book – are making the book a light summer read with a lot of dinner party takeaways. But even beyond that, Stephens-Davidowitz manages to bring across two essentials.
Ask the right questions: big data alone will not solve your problems
Stephens-Davidowitz rightly refuses to define the term big data in his book. These days, “big data” is more buzz word than helpful framework. What’s undisputed though: there is a lot of data, there are social networks with an abundance of data about our lifes (Google, Facebook, etc.) and the volume of data is likely to increase in the next years. As this dynamic accelerates, we can observe a shift in attention from mere data collection to more value-adding activities – primarily data science, as a tool to foster insights. Because data, no matter the volume, is useless if one does not know how to ask the right questions.
The mindset of narrowing large questions down to actionable sets of smaller questions and subsequently defining suitable data sources is tough to implement, as “big data” suggests that the more data we accumulate, the more insights will automatically follow. This, however, is a common misconception and Stephens-Davidowitz picks his examples in such a manner as to convincingly convey this very basic, but powerful point: asking the right question to an existing set of data sources is the crucial groundwork for insights and knowledge generation.
Value Creation through Discovery and Creation of New Data Sources
While confronting the reader with a lot of humorous stories, inter alia about the preferences of Indian males for being breastfed by their respective partner, Stephens-Davidowitz also sharpens his readers’ understanding on what to count as data sources. It’s not only in standardized formats (sensor data, machine performance data, purchasing data, etc.), but pictures and text also count as data and might open even larger new avenues for value creation.
More than ever, though, it is about the intelligent combination and blending of data sources, when tackling a certain set of questions. Stephens-Davidowitz for example mentioned how satellite images helped to understand GDP movements in African countries, which often struggle to give valid estimates of their GDP. To go even further, creating truly unique insights (which could also form a competitive edge) often requires the creation of completely new data sources.
This starts at the basics with putting in human effort to label an unstructured and fuzzy data set, which is required to train algorithms. But it also expands to a continuous thought process about what additional data could be generated to help answer the questions at hand. In “Everybody Lies”, Stephens-Davidowitz mentions how an “algorithm” to determine the quality (as in: likelihood to win big races) of race horses was developed by explicitly going beyond existing data sources. Eventually, the size of the horse organs (in combination with some other factors) turned out to the best predictor of the horse’s future success. Going the extra mile to generate this additional data (no one had ever structurally recorded the organ size of horses) is a cumbersome process, but might be the ‘magic sauce’, when traditional data sources appear to have little explanatory value.
Thus, if you are looking for a light and entertaining summer read along the lines of “Freakonomics” or “Think Like a Freak” by Stephen Dubner and Steven Levitt, “Everybody Lies” is highly recommended. And what’s best: you’ll probably learn a fair deal about data science as well.
The book: “Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are”, by Seth Stephens-Davidowitz.