Scaling Big Data Performance Across Multicore Processors

Michael Vizard

As the management of Big Data begins to increasingly move from the theory into actual practice, IT organizations are discovering that moving all that data around is no simple task. While the cost of storing massive amounts of information has fallen thanks to technologies such as the open source Apache Hadoop data management framework, moving large amounts of data around the enterprise is as complex an undertaking as it has ever been.

To address that specific issue, the folks at Pervasive Software, a provider of data integration tools and cloud computing services, came up with DataRush, which optimizes the processing of Hadoop data across every core available on multicore processor servers.

According to Pervasive Software CTO Mike Hoskins, one of the reasons that Hadoop clusters tend to be large is that as a technology Hadoop was never really designed to optimally take advantage of multicore processors. DataRush essentially adds a layer of software that manages the processing of Hadoop data in parallel across all the cores in the system, which means the Hadoop environment will scale linearly.

Hoskins says that also means Hadoop clusters can be much smaller than they are today, which means they can be more easily managed by IT teams that don't tend to have a lot of experience running large clusters of servers.

Hadoop, says Hoskins, is much more than a framework for processing large amounts of data. It's evolving into an entirely new computing architecture for building enterprise-class applications. But that won't happen, says Hoskins, until IT organizations can leverage multicore processors to process that data. Otherwise, IT organizations will have to rely on massively parallel database appliances that wind up adding additional cost and complexity to the IT environment.

It's unknown to what extent Hadoop will transform IT, but it's clear that significant changes are starting to occur that go way beyond the amount of data that can be processed. Hoskins says that will become a lot more apparent once a more accessible form of the MapReduce application programming interface becomes available later this year. Once that happens, Hoskins says the use cases for Hadoop are going to rapidly expand, which in turn will drive a much greater need to leverage the largely untapped parallel processing capabilities of multicore processors.

Add Comment      Leave a comment on this blog post
Apr 10, 2012 6:34 AM Gopal Gopal  says:

Challenge of taking advantage of Multi-core with hadoop is taking data to the core for processing....verbatim (copy/paste) from yahoo post..

Even if hundreds or thousands of CPU cores are placed on a single machine, it would not be possible to deliver input data to these cores fast enough for processing. Individual hard drives can only sustain read speeds between 60-100 MB/second. These speeds have been increasing over time, but not at the same breakneck pace as processors. Optimistically assuming the upper limit of 100 MB/second, and assuming four independent I/O channels are available to the machine, that provides 400 MB of data every second. A 4 terabyte data set would thus take over 10,000 seconds to read--about three hours just to load the data! With 100 separate machines each with two I/O channels on the job, this drops to three minutes.

How is this issue resolved?

Apr 21, 2012 7:25 AM jwhite jwhite  says: in response to Gopal

Check out ETI they can do exactly what you claim is the limiting step for delivering data to muticores.

May 4, 2012 5:17 AM Mike Hoskins, Pervasive CTO Mike Hoskins, Pervasive CTO  says: in response to Gopal

Gopal is right that we always need more parallel IO throughput, and that can be achieved by distributing IO across nodes in a cluster. In fact Hadoop and HDFS are good at distributing data across multiple nodes to take advantage of all the IO channels on all the nodes. However, as nodes get 'fatter'  (more cores and therefore more inherent processing power in each node), the simple, coarse-grained high-latency process-level parallelism of traditional MapReduce (think MR1), while offering scaling, in many cases grossly under-utilizes the latent horsepower of the multicore revolution. By deploying fine-grained extreme low-latency thread-level parallelism in your execution layer (think MR2), like that found in the DataRush dataflow environment, many workloads could see vastly more efficient use of each high core-count node in a Hadoop cluster.

Also, there are new I/O systems being put out by Intel such as Sandy Bridge and recently released Ivy Bridge. The latest I/O systems by Intel include native speeds of up to 6GB/s (need to verify this number), with disks reaching speeds of 3GB/s. These are off the shelf systems that will handle a much larger I/O workload than systems just a year or two older. So the mix of increased compute power plus increased I/O capability does lead to a smaller h/w footprint.


Post a comment





(Maximum characters: 1200). You have 1200 characters left.




Subscribe to our Newsletters

Sign up now and get the best business technology insights direct to your inbox.