Forrester predicted earlier this year that we'd see more open source data tools find their way into the enterprise this year, thanks in no small part to the fact that open source rules the Big Data space.
James Kobielus, who is now a Big Data evangelist for IBM but was then an analyst for Forrester Research, made the case for why the enterprises doors are now open (ahem) for open source. I pointed out that he wasn't alone: The TDWI had also foreseen a new era for open source data tools.
PC World wrote on this theme recently with a discussion about open source's role in Big Data. One thing I found really revealing is that data scientists are graduating with a solid knowledge of R - an open source language for Big Data - but little knowledge of proprietary, GUI-based systems.
Here's what's interesting about that: It's not just about the cost of open source. No, it's also about the transparency of open source, according to Imran Ahmad, a data scientist who developed his own grid-computing algorithm called Bileg, which competes with Hadoop.
He used an open source toolkit, because, he explains, open source platforms allow you to see the underlying mathematical basis, which, in turn, can help you evaluate the results of your data analysis.
"If it's in open-source, you can dig down and see why I'm getting these results, why these results are the optimal ones," Ahmad told PC World. Proprietary analytics software doesn't let you see the why of your results - which is fine until you have questionable results.
Primarily, when experts have talked about open source adoption, it's been all about three points:
But, clearly, there are other reasons at play when it comes to data management and analysis.
Another example: I was recently talking with data quality/governance expert Jim Harris, a consultant who runs the Obsessive Compulsive Data Quality blog. We were talking specifically about selecting a data quality tool, and he pointed out it actually was very hard these days to buy a stand-alone tool.
He used IBM as an example of what's happened in the data quality space.
"IBM has acquired a lot of companies that used to just offer a data quality tool, but then IBM integrated it into a platform of technology where you can now no longer just buy a data quality tool," Harris said. "You have to buy the entire platform, which at the low end of the price point would be around a quarter of a million dollars."
But that hasn't happened with open source tools. Open source vendors still offer stand-alone data quality tools, he added:
That's why I think open source has gained. Talend is not the only example of this, but they're the most well-known open source data quality and data integration vendor. If you just want to do some data profiling, you can download the Talend open data profiler for free. I mean, you give them your contact information and you will be mercilessly spammed by their marketing department until you decide to buy, but for free, you can download their data profiling tool away for free.
Of course, it's still early days for Big Data, so who knows what will happen. So far, even the proprietary companies are relying on the open source solutions like Hadoop and R, according to the PC World article. What does that mean? The piece quotes Kobielus on this question:
As the footprint of closed-source software shrinks in many data/analytics environments, many incumbent vendors will evolve their business models toward open-source approaches, and also ramp up professional services and systems integration to assist customers in their moves towards open-source, cloud-oriented analytics, much of it focused on Hadoop and R.
Ten years ago, when we were all talking about Linux, who would've thought open source would find its way in through the data and a programming language?