Big Data isn’t just Hadoop and in-memory anymore. Big data technologies and tools have grown significantly over the past few years — so much so that it’s hard to keep up with them.
If you’d like to get up to snuff and are primarily interested in open source solutions, I recommend this CIOL.com column by Virenda Gupta, senior vice president at Huawei Technologies India.
He discusses new open source solutions in the areas of Big Data processing, analytics and mining. He also addresses Big Data virtualization, where he sees a shortage of comprehensive platforms.
What I particularly appreciate is his succinct definitions of Apache’s Spark and Storm projects. Gupta explains:
- Spark improves data processing speeds on Hadoop by allowing the same data to be processed multiple times.
- Storm supports streaming process in real time.
These solutions are starting to attract attention, so it’s nice to have those short explanations in mind for pieces like this GigaOm article about AMPLab, which created Spark:
“…Spark, has taken the data-processing world by storm as a faster, easier and more flexible framework for a wide variety of tasks. Spark’s creators and backers promised it can tackle everything from batch process to stream processing and SQL queries to machine learning jobs, without the performance and overhead complexity of tools such as MapReduce and Storm.”
Perhaps even more significantly, the article points out that even Hadoop vendors are “falling all over themselves to support it.”
Since Spark is all about improving data processing speeds, one of the feats AMPLab is trying to accomplish is “making big data really small,” the article states. That means storing more data in a smaller footprint, which means you’d get the same amount of data for less memory.
It’s had impressive results with a new in-memory database project called Succinct. GigaOm reports that Succinct “blew away” MongoDB, Cassandra and HyperDex in storage efficiency, fitting 123 gigabytes of raw data onto 64 gigabytes of RAM. Now, I am no technologist, but even I know that’s impressive.
The article does a nice job of explaining how it achieves that goal, so rather than repeat it, I’ll let you read it there. It also looks at other AMPLab projects, including how the lab is working with doctors to improve speeds on DNA sequencing.
Loraine Lawson is a veteran technology reporter and blogger. She currently writes the Integration blog for IT Business Edge, which covers all aspects of integration technology, including data governance and best practices. She has also covered IT/Business Alignment and IT Security for IT Business Edge. Before becoming a freelance writer, Lawson worked at TechRepublic as a site editor and writer, covering mobile, IT management, IT security and other technology trends. Previously, she was a webmaster at the Kentucky Transportation Cabinet and a newspaper journalist. Follow Lawson at Google+ and on Twitter.