One of the ongoing and largely pointless debates about Big Data is what, exactly, the term means. I call it pointless, because, let’s face it — vendors aren’t going to stop using the term just because some analyst or journalist says it doesn’t meet the criteria for “Big Data.”
As a result, it does add confusion because you have vendors calling something a “Big Data” solution when maybe it’s just 1 terabyte in a data warehouse. Generally, that’s not what people mean when they talk about Big Data. But it’s creeping toward the line.
So at what point do you need to consider a Big Data solution like Hadoop or in-memory processing? Here are a few signs you need Big Data.
Your datasets fall above 2 terabytes. If you have more than 2 terabytes of data, you should “consider” Hadoop, Josh Sullivan, a vice president at Booz Allen Hamilton and founder of the Hadoop-DC Meetup group, told Katherine Reynolds Lewis in “What the Heck is Hadoop?” (A popular theme, I might add.)
You have more than 100 terabytes of data. I realize that’s a huge jump from the two terabytes mentioned above, but Sullivan qualifies it saying that’s the point at which you shift from “considering” Hadoop to “absolutely want to be looking at Hadoop.”
You need speed. One of the characteristics of Big Data technology is its velocity. As data grows, it’s taking companies more than 24 hours to run the ETL batch processes they use to run overnight for the next day. Both Hadoop and in-memory solutions can cut that time down to size. One example of how Hadoop changes the speed of things? Genome sequencing that took hours now takes minutes, according to Lewis.
You want to analyze more historical data without building an extensive infrastructure. This is a bit of an evolving area, but as SAP’s Big Data Strategist David Jonker explained, putting ERP, CRP or other enterprise applications on top of in-memory solutions lets you keep more data for in-application analysis. That means no more complicated infrastructure to store and support queries of historical data.
Your data is unstructured. This is the much-touted “variety” component of Big Data solutions. If you’re interested in processing video, photography, log files, text, or GPS/GIS data, then you’ll definitely want to investigate Hadoop.
Of course, this list still doesn’t give us a definitive definition for Big Data. My suggestion? Don’t worry about it. Focus on what you need and the tools that can address whatever that need is.