Data Science Rapid Proof-of-Concept Projects
Data Science projects are complex and always inherit the risk of not being feasible. As opposed to large-scale IT projects though, there are ways to quickly act on ideas in a sandbox environment. This allows for a fail-fast-approach and enables companies to sustainably allocate their resources towards those projects, which will reliably create value – may it be through the optimization of processes, enabling new services or increasing customer loyalty. In this blog article, we will briefly outline the five steps of a rapid proof-of-concept data science project and explain how it allows companies to move forwards within data science and AI in a more entrepreneurial manner, avoiding large misinvestments in lofty product visions.
Even in a rapid prototyping setup, fully understanding the business and commercial incentives is crucial. Certain ideas might be formulated in a rather vague manner (“let’s predict trends in our industry with big data and use it to enhance our services”) and need to be narrowed down to a set of workable hypotheses and goals. While it oftentimes seems evident on the surface, the exact problem definition or problem scope is rarely clearly established. Especially in projects, where representatives from various departments are involved, everyone will have slightly different incentives and agendas. Smoothly aligning those in a project description is a key activity during the first step of a PoC data science project – always keeping in mind the available data sources, which should permanently inform this discussion.
After having established a project definition, attention shifts to the available data sources. This covers internal and external data sources alike. While external data sources – especially if fee-based – are usually adhering to certain quality standards, internal data sources may come in different form. Contrary to public opinion, perfectly assembled data streams are a very rare occasion. Oftentimes, projects involve various different data types (csv, pdf, etc.) with different quality standards. During the data understanding phase, data scientists aim to fully understand each data source, column by column, line by line. This can be quite a lengthy process and often involves time-consuming 1:1 sessions with domain experts.
An integrated database is the essential groundwork for project success. Highly intertwined with data understanding, the data preparation phase of a PoC data science project deals with the integration of all relevant data sources. Oftentimes, during these integration efforts, more questions regarding the respective datasets arise. While the data understanding phase is usually also carried out to ensure integration downstream, some of the issues naturally only arise in the process. It is one of the crucial activities during data science projects, to raise awareness among project participants, that data understanding and integration are key activities and success enables. It is better to invest an extra week into these phases, than to move on preliminarily as otherwise all downstream-activities will be negatively affected.
Once the comprehensive dataset to work with has been assembled, data scientists start to extract features from the data, apply, tweak and augment models. Algorithmic approaches to data always have a rather exploratory component and within a rapid PoC project, rarely all different approaches can be tested. However, the modelling phase should give data scientists a pretty good understanding about what kind of further optimization is possible and how much effort (as in: costs) it would require to achieve this optimization. In the end, it will be a business decision to assess how further optimization could influence the bottom line down the road. During the PoC phase, however, oftentimes more than a 80/20 solution is provided, which already as such, can help the respective company on their endeavor. The modelling phase – where the magic happens – is oftentimes more straightforward than the data preparation phase. Nevertheless, it does sound a lot more attractive and magical to the spectator.
Once the algorithms have been optimized up to a satisfactory level, the closing phase of a PoC data science project regards evaluation. Data scientists will discuss to some extent what further optimizations could be conducted, how the algorithm performs and where it could have potential weaknesses. In short, a detailed roadmap will be specified, outlining concrete next steps if decided to move on and bring the algorithm to work in a commercial setting. At the same time, the decision could be taken that it is not worthwhile to push the project further. Or, it could be agreed upon that some specific data needs to be collected from now on, in order to allow for analysis at a later stage. Enabling such informed business decision is also one of the key deliverables of a PoC data science project.
In any case, the details of any rapid Proof-of-Concept data science project are highly specific to the respective environment. As in: there is no definite blueprint, but customization is the default scenario. Thus, the project timeline and activity setup might have different requirements in your organization. Feel free to reach out to us, we’d be more than happy to share more insights with you.