EMC consultant David Dietrich says a “dirty little secret” of Big Data is that even when you’re dealing with the industry’s top minds, data preparation still takes up a huge amount of any Big Data project.
“Data Prep can easily absorb 80% of the time of a project,” Dietrich writes. “Many times I see leaders who want to get their data science projects going quickly, so their teams jump right into making models, only to slide back a few phases, because they are dealing with messy or dirty data. They must then try to regroup and create predictive models.”
Dietrich is a member of an MIT consortium called bigdata@csail. The idea is to bring together academic elites with industry leaders to drive innovation in Big Data technology. Microsoft, EMC, SAP, Intel and AIG are among the member organizations.
Dietrich points out that, too often, technology focuses on advancing the gee-whiz parts of data, such as data analytics, visualization tools, streaming data or dashboards, while neglecting the more ho-hem, but very critical, task of data preparation, which includes data integration, matching and other data quality tasks.
Right now, those steps are surprisingly human-intensive. One industry expert recently told me he’s never had a client run a data matching/cleansing transformation without keeping a set of eyes on the process.
In fact, according to Dietrich, it’s such a cumbersome task that recent MIT research shows out of about 5,000 data sources found in large enterprises, only about 1-2 of these make it into the Enterprise Data Warehouse. That’s in part because that other 98 percent is unused, inaccessible or just not cleaned up enough for people to use.
This month, the bigdata@csail group discussed Big Data integration and several people presented projects they’re working on to make the technology for integration and data quality smarter using machine learning.
It’s pretty obvious that this kind of automation is going to be required to manage Big Datasets. Dietrich says these types of tools would allow enterprises to use 10-20 percent of their data, rather than the current 1-2 percent.
He lists five available technologies for smarter data preparation work:
For more on how Big Data integration can be smarter, be sure to read Dietrich’s full article.