EMC consultant David Dietrich says a “dirty little secret” of Big Data is that even when you’re dealing with the industry’s top minds, data preparation still takes up a huge amount of any Big Data project.
“Data Prep can easily absorb 80% of the time of a project,” Dietrich writes. “Many times I see leaders who want to get their data science projects going quickly, so their teams jump right into making models, only to slide back a few phases, because they are dealing with messy or dirty data. They must then try to regroup and create predictive models.”
Dietrich is a member of an MIT consortium called bigdata@csail. The idea is to bring together academic elites with industry leaders to drive innovation in Big Data technology. Microsoft, EMC, SAP, Intel and AIG are among the member organizations.
Dietrich points out that, too often, technology focuses on advancing the gee-whiz parts of data, such as data analytics, visualization tools, streaming data or dashboards, while neglecting the more ho-hem, but very critical, task of data preparation, which includes data integration, matching and other data quality tasks.
Right now, those steps are surprisingly human-intensive. One industry expert recently told me he’s never had a client run a data matching/cleansing transformation without keeping a set of eyes on the process.
In fact, according to Dietrich, it’s such a cumbersome task that recent MIT research shows out of about 5,000 data sources found in large enterprises, only about 1-2 of these make it into the Enterprise Data Warehouse. That’s in part because that other 98 percent is unused, inaccessible or just not cleaned up enough for people to use.
This month, the bigdata@csail group discussed Big Data integration and several people presented projects they’re working on to make the technology for integration and data quality smarter using machine learning.
“The thesis is that rather than using brute force techniques to merge data together, we can use more intelligent techniques to make inferences about different kinds of data and automate some of the decision making,” he said, explaining how one project achieved this through algorithmic intelligence. “This means some of the brute force-level work can happen with machine learning, and some of the more difficult decisions about merging data can be left to humans, who can exercise higher-level judgments based on their deep domain knowledge and experience.”
It’s pretty obvious that this kind of automation is going to be required to manage Big Datasets. Dietrich says these types of tools would allow enterprises to use 10-20 percent of their data, rather than the current 1-2 percent.
He lists five available technologies for smarter data preparation work:
- Data Tamer, which focuses on integration and is still being developed at MIT.
- Open Refine, formerly Google Refine, which helps with clean-up.
- Data Wrangler, a cleaning and transformation tool developed by Stanford.
- Reshape2 packages, which let you restructure and aggregate data.
- Plyr, which uses a split-apply-combine strategy for R.
For more on how Big Data integration can be smarter, be sure to read Dietrich’s full article.