A Question of Data Quality
80% of work in data science projects is dedicated to data quality assessment, data preparation and integration. Applying and tweaking the algorithms, improving the performance of models (basically all the fun stuff) covers only 20%. What’s the reason for this?
When one turns to common data science challenges, as they are for example posed on kaggle.com (a company recently acquired by Google), one could easily get a wrong impression about the art of data science. Data sets are prepared, fully labelled and accessible in an easily understood order. Accompanied with the overall, well crafted project goal, the data science teams (or individuals) are ready to “do the magic” without further due.
However, these competitions are not merely resembling the majority of data science projects. Certainly, at times, companies without an in-house data science team, but with a superb data warehouse and storage culture approach external service providers for analytical capabilities. These projects are then, indeed, fairly straightforward and their upside potential is fairly predictable.
Most projects, however, are driven by lofty, ambitious visions and first need to reviewed from a “data angle”. An essential activity during this phase of any data science project is the identification of data sources, which could assist to enhance the understanding of a certain phenomena (target variable). This process – as easy as it sounds – can often be lengthy in timing, as data sources and access are scattered across the organization. In addition to internal data sources, external data sources (which are oftentimes associated with costs) need to be scouted.
Depending on the project, the shortage of adequate data sources might also lead to a revision of the project goal or internal efforts, to improve data collection and storage.
While it might be tempting to include “as much data as possible”, time and budget constraints render it necessary to do an upfront assessment. Which data sources are essential to the success of the projects?
The answer to this question is primarily a function of “relevance of information” and “data quality”. The relevance and insight a data source might provide is certainly project-specific, but if one wants to predicts daily sales of a local ice cream shop based on weather data, having data regarding daily sales and the respective weather at the location in question are certainly essential.
The assessment of data quality, however, is even more crucial – but far more difficult. Especially if not taken from well documented APIs. Even larger corporations, which have been exposed to data-related challenges over the last decades on a continuous basis, still struggle with data quality. It is the necessity of nitty-gritty work, the absence of standardized off-the shelf solutions to ensure data quality, which often hampers the integration of additional data sources.
The quality of data input naturally controls the quality of data science results – and drawing the wrong conclusions from low-quality data could have disastrous business impact. When discovering shortcomings regarding a certain data source, fixing these (oftentimes with partly manual work) or cleaning the data set to avoid them are the options on the table. One might hamper the model quality, the other might cut into the available budget. Striking the balance is crucial.
Another component of “the 80%” of data science work dedicated to data preparation, is “data integration”, which largely depends on the connectivity of the various sources. What does this mean in practice?
Any proper dataset operates with a unique identifier. If this unique identifier is not present in another data set, any information associated with the specific entries could serve as a point of connection. However, this information needs to be coherent and understood. For example, if a dataset contains logs of events with date, time and location, the unique identifier of the entry as well as date, time and location (if stored in an accessible format, such as GPS coordinates) can serve as connection point, as time and location are well understood concepts.
If however, data is stored as ‘text’, and was manually entered, it becomes more difficult to make sense of these entries and tie them to other data sources, which might relate to categories implicitly contained in the unstructured text data. In a well-thought-out data storage structure, these issues rarely arise.
However, if data sources are uniquely integrated for project-specific purposes these tasks are the essential backbone of project success.
Data availability, data quality and data connectivity: data sets rarely come well prepared and neatly integrated. On the contrary, data is scattered and often scare. That’s why data scientists oftentimes spend 80% of their time with data preparation and integration. Does that sound very sexy? Not really, but it is the essential ground for applying robust data science methods, thus enabling the sexy (value creating) part of the data science profession.