Overcoming Hadoop Data Indigestion

Michael Vizard

As enterprise IT organizations become more familiar with Hadoop, many of them are becoming all too familiar with the trials and tribulations of loading large amounts of data into the open source data management framework.

As a result, providers of Hadoop distributions such as EMC Greenplum are now starting to form alliances with providers of extract, transform and load (ETL) tools, such as Syncsort, which offer tools that are significantly faster than existing open source tools for loading Hadoop data such as Sqoop.

According to Mitch Seigle, Syncsort vice president of marketing and product management, Syncsort was able to achieve a load throughput of more than 12 terabytes per hour that was achieved using just four ETL servers. Naturally, there is going to be a lot of debate over who has the best ETL tools for loading data in Hadoop. But the one thing that is clear is that commercial offerings are generally going to be far superior to open source tools that, while free, wind up costing a lot more in terms of time and effort.

In fact, Seigle says the whole point of using a commercial ETL tool is reducing the amount of time it takes to gain value from Hadoop by making it faster and easier to load data into the system.

The speed at which a Big Data framework can ingest data may not be an issue when most organizations start evaluating Hadoop. But once a Hadoop system goes into pilot, it’s not long before figuring out how to actually live with Hadoop on a daily basis suddenly becomes a top-of-mind issue.

Add Comment      Leave a comment on this blog post

Post a comment





(Maximum characters: 1200). You have 1200 characters left.




Subscribe to our Newsletters

Sign up now and get the best business technology insights direct to your inbox.