In early 2012, a group of Stanford University researchers interviewed 35 data analysts from 25 organizations across a variety of sectors, including health care, retail, marketing and finance, and identified the various challenges data scientists face in the data analysis process.
Despite being in high demand and hailed as one of the hottest professions of the 21st century, much of the work of a data scientist is actually dominated by the incredibly time-consuming process of changing data into a usable form. The data analysis process involves four tasks – discovery, transformation, modeling and reporting – with data scientists spending as much as 60 to 80 percent of their time in the data transformation stage.
In this slideshow, Trifacta, a provider of productivity platforms for data analysis, takes you through each of these tasks in greater detail, highlighting the pain points data scientists face at each stage. It’s clear tools are needed that can simplify the data analysis process while at the same time increasing productivity and collaboration among data scientists.
Click through for a closer look at the day-to-day activities of a data scientist, as identified by Trifacta.
Discovery: Acquiring data necessary to complete an analysis task.
Before data scientists start to transform data for analysis, they must first acquire the necessary data. However, they often find that data is distributed across multiple databases, and that their organization lacks documentation or search capabilities, requiring them to rely on database administrators or others for help.
Transformation: Manipulating the acquired data for analysis, diagnosing data quality and understanding what assumptions can be made.
The most time-consuming component of the analysis process, transformation, involves reformatting, validating data to make it palatable for databases and visualization tools, diagnosing the data for quality issues and trying to understand what assumptions they can make about it. In the transformation phase, data scientists encounter numerous challenges, including data sets that may contain missing, erroneous or extreme values. As a result, the assumptions that data scientists make about such data sets turn out to be wrong and misled.
Modeling: Constructing a model of the assembled data.
The biggest difficulty in constructing a model is understanding the relevance of each data set to a given analysis task. When data scientists get to this stage, they often find their data has not been completely transformed and must go back to the wrangling stage in order to identify useful patterns or relationships. Data scientists also find that during this stage, many existing analytics packages, tools or algorithms do not scale with the size of their data sets.
Reporting: Sharing insights gained from the data.
Because of poor documentation of assumptions made during analysis, data scientists may find it hard to distribute and consume reports, which can affect the interpretation of results. With little to no knowledge of how the original input data was transformed, many reports do not allow for interactive verification or sensitivity analysis.