Here’s an interesting data point in the debate over Hadoop clusters versus data warehouses: The cost per terabyte can be in the six figures in a data warehouse, but only a few hundred dollars in Hadoop, according to Jack Norris, vice president of marketing at Hadoop distribution vendor MapR Technologies.
“So, a big ROI,” he said.
That’s not exactly a new argument for Hadoop and, to be fair, Norris isn’t suggesting that Hadoop will usurp the data warehouse. Most people agree there’s room for both.
But in some cases, the technologies do compete.
Norris and I were actually discussing another issue: How enterprises are really putting Big Data technology to use, particularly when it comes to existing applications or problems.
But the cost of a terabyte is an interesting discussion in its own right. It turns out, there are different approaches to calculating the cost of a terabyte.
However, most calculations of storage total cost of ownership tend to be based on a “usable” terabyte, which as I understand it is just the cost of a terabyte, sitting around in some sort of storage.
It’s an interesting exercise, since he was actually setting out to “show (let alone prove) that enterprise disk on a SAN (storage area network) was more cost effective than the JBOD” (just a bunch of disk drives - he doesn’t say what these were running but calls them “big data systems”). He also looked at Hitachi’s FC SAN (fibre channel storage area network) or iSCSI solution, and how this all plays out with cloud costs.
All of which makes me a bit suspicious for three reasons:
Reservations aside, Merrill’s premise is that the cost of a written terabyte is a more precise measure of total cost of ownership than the storage of a terabyte.
“When measuring written-TB, we were able to get to some closer metrics around total transaction cost, or the analytics query cost within this IT environment,” Merrill states.
And he’s certainly doing a more complete analysis than you’d typically see. By assessing what it costs to pay for a written terabyte, he’s including “incidentals” such as migrating off and on the big data environment, backing up data and network costs.
“Upon further analysis, we found that big data processing required the local CPU to do many mundane storage tasks, and extra processors had to be employed,” he explains. “We forget how much work RAID, controllers and intelligent array-based software do to offload the host. These extra server costs were added to the unit cost model to serve up a TB of capacity.”
The results? It flips the total cost of ownership, making SANs look more affordable than the JBOD/rack disk, which is labeled as “direct attached” or DAS). Oh, and Hitachi’s option ranks about the same as SAN.
As an added bonus, he computes the carbon footprint of all these approaches. Not surprisingly, using a whole bunch of individual drives for storage leaves a chunky carbon print behind.
He also provides links for a three-part analysis on the economic cost analysis on cloud storage.
So, besides a headache, where does all this leave CIOs and other IT leaders? Honestly, I can’t say. I suspect it very well might mean that you can prove a favorable TCO for whichever technology you choose.
IT Business Edge’s Arthur Cole is the man you really want to follow for more on the hardware cost discussions. But certainly, Merrill’s approach comes closer to measuring the overall costs associated with managing a terabyte.
“Infonomics” aside, it’s clear organizations are embracing Hadoop to better perform everyday tasks, whether it’s expanding the use of sensors, offloading ETL processing, operational intelligence or analyzing customer data.
“With Hadoop, there are new applications that are being created with new data sources, but there also are existing applications that are just better by using Hadoop as the underlying platform,” Norris said.