SHARE
Facebook X Pinterest WhatsApp

Five New Tools for Smarter Big Data Integration

Big Data Is Creating Big Jobs: 4.4 Million By 2015 EMC consultant David Dietrich says a “dirty little secret” of Big Data is that even when you’re dealing with the industry’s top minds, data preparation still takes up a huge amount of any Big Data project. “Data Prep can easily absorb 80% of the time […]

Written By
thumbnail
Loraine Lawson
Loraine Lawson
Apr 22, 2013
Slide Show

Big Data Is Creating Big Jobs: 4.4 Million By 2015

EMC consultant David Dietrich says a “dirty little secret” of Big Data is that even when you’re dealing with the industry’s top minds, data preparation still takes up a huge amount of any Big Data project.

“Data Prep can easily absorb 80% of the time of a project,” Dietrich writes. “Many times I see leaders who want to get their data science projects going quickly, so their teams jump right into making models, only to slide back a few phases, because they are dealing with messy or dirty data. They must then try to regroup and create predictive models.”

Dietrich is a member of an MIT consortium called bigdata@csail. The idea is to bring together academic elites with industry leaders to drive innovation in Big Data technology. Microsoft, EMC, SAP, Intel and AIG are among the member organizations.

Dietrich points out that, too often, technology focuses on advancing the gee-whiz parts of data, such as data analytics, visualization tools, streaming data or dashboards, while neglecting the more ho-hem, but very critical, task of data preparation, which includes data integration, matching and other data quality tasks.

Right now, those steps are surprisingly human-intensive. One industry expert recently told me he’s never had a client run a data matching/cleansing transformation without keeping a set of eyes on the process.

In fact, according to Dietrich, it’s such a cumbersome task that recent MIT research shows out of about 5,000 data sources found in large enterprises, only about 1-2 of these make it into the Enterprise Data Warehouse. That’s in part because that other 98 percent is unused, inaccessible or just not cleaned up enough for people to use.

This month, the bigdata@csail group discussed Big Data integration and several people presented projects they’re working on to make the technology for integration and data quality smarter using machine learning.

“The thesis is that rather than using brute force techniques to merge data together, we can use more intelligent techniques to make inferences about different kinds of data and automate some of the decision making,” he said, explaining how one project achieved this through algorithmic intelligence. “This means some of the brute force-level work can happen with machine learning, and some of the more difficult decisions about merging data can be left to humans, who can exercise higher-level judgments based on their deep domain knowledge and experience.”

It’s pretty obvious that this kind of automation is going to be required to manage Big Datasets. Dietrich says these types of tools would allow enterprises to use 10-20 percent of their data, rather than the current 1-2 percent.

He lists five available technologies for smarter data preparation work:

  • Data Tamer, which focuses on integration and is still being developed at MIT.
  • Open Refine, formerly Google Refine, which helps with clean-up.
  • Data Wrangler, a cleaning and transformation tool developed by Stanford.
  • Reshape2 packages, which let you restructure and aggregate data.
  • Plyr, which uses a split-apply-combine strategy for R.

For more on how Big Data integration can be smarter, be sure to read Dietrich’s full article.

Recommended for you...

Top Data Lake Solutions for 2022
Aminu Abdullahi
Jul 19, 2022
Top ETL Tools 2022
Collins Ayuya
Jul 14, 2022
Snowflake vs. Databricks: Big Data Platform Comparison
Surajdeep Singh
Jul 14, 2022
Identify Where Your Information Is Vulnerable Using Data Flow Diagrams
Jillian Koskie
Jun 22, 2022
IT Business Edge Logo

The go-to resource for IT professionals from all corners of the tech world looking for cutting edge technology solutions that solve their unique business challenges. We aim to help these professionals grow their knowledge base and authority in their field with the top news and trends in the technology space.

Property of TechnologyAdvice. © 2025 TechnologyAdvice. All Rights Reserved

Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.