Jim Kaskade, CEO of Infochimps, says companies are making mistakes when it comes to Big Data. InfoChimps offers three types of analytics as a cloud service: Cloud Stream for real-time analytics, Cloud Queries for ad hoc or near real-time reports and queries, and Cloud Hadoop solution for large data sets. Kaskade explains further in a conversation with IT Business Edge’s Loraine Lawson.
Lawson: Tell me the elevator pitch about Infochimps?
Kaskade: We’re a cloud service provider and we have cloud services that are purpose built around three different data analytics for Fortune 1000 companies that allow them to address existing, very important business problems.
Cloud Hadoop really allows people to look at trends over large historic time frames or large volumes of data. We’re talking about data sets that are larger than what are in traditional data warehouses. That means that if I’m going to be a cloud service provider, I have to have a cloud service that’s really fairly close, if not co-located, with your data. To ask you to slurp your data into our cloud in Amazon is just not an option for the Fortune 1000s.
So we’re in tier four data centers in North America, where our customers’ data already resides and where it’s co-located next to our cloud service. That’s the only way that we can actually analyze large volumes, as well as large velocities, of data that are highly governed by companies. Every vertical in our Fortune 1000 has data that’s highly secure and requires a lot of governance, which means it can’t be running in a public cloud.
Lawson: You said people are making mistakes, more mistakes than we think when it comes to doing Big Data. What do you mean and what types of mistakes are you seeing?
Kaskade: I think the biggest mistake is people are diving in and making investments in these technologies, creating what I’ll call a sandbox. That means they don’t necessarily know what the use case is or they haven’t really focused their energy behind a single use case. They’ve just said, “We know we can use this across any use case. Let’s create a Big Data platform.”
The result is you boil the ocean. The projects get started, budgets get allocated and then you embark on this journey that in most cases takes you about 24 months before you realize you might have made some wrong choices along the way and that you truly don’t have the talent internally to help you really execute.
We did a survey of 300 companies. Basically, I’ll give you the rules of thumb: 100 percent of the Big Data projects go over schedule, 100 percent go over budget, and half of them fail.
This is not just what we’re seeing in terms of customers that we’re working with directly, but also out of a CIO survey that we did of 300 others who are not customers. Over half - 58 percent - attributed failure to an inaccurate scope of the project. Well, did you even scope it in terms of a business use case to begin with?
Forty-one percent said the technical roadblocks contributed to their failure. This comes down greatly to the large extent that the technology is still pretty complex. It’s very immature and there’s a lot of it.
Hadoop, as difficult as it is and as challenging as it is to get a Hadoop cluster up and running, it’s still only 20 percent of your solutions. So it’s probably one of the most challenging technologies to administer yourself because it’s not been made easy to use yet. But then again, when you do master it, it’s only 20 percent of your solution.
So this 41 percent technical roadblock is beyond just Hadoop. It’s NoSQL, which is 150 NoSQL databases to choose from, and it’s real-time technologies such as stream processes.
The third is that 39 percent of the failure of Big Data project is attributed to the fact the data is siloed and there’s not a lot of cooperation in gaining access to that data. Now that is the oldest problem in the history of IT.
Back in the day when I started at Teradata, I remember if you were an Oracle or Teradata DBA, you could name your price because not a lot of people knew how to get these relational databases working and tuned. So the same thing happens today, where because the technology is so challenging, a handful of people have the locks on it. So the people who own the data schemas and the database and basically have the keys and the data sets within the organization still exist.
When you bring in this new technology and you say, “I have Hadoop and I want to populate the cluster with data,” then you’ve got to arm wrestle with all the various organizations that still have people who control the data sets. So that speaks to the fact a lot of Hadoop clusters are empty. They're empty because they're still struggling to get data into them because of these politics.
Those are the three reasons: inaccurate scope, technical roadblocks and siloed data with lots of politics wrapped around them. Those are probably the three biggest mistakes, based on our direct discussions.
Lawson: How would they know if they're in over their heads with the scope? I mean, is there a point at which - are there warning signs for, “Wow, we’re going to be in over our head?”
Kaskade: For us, as a cloud service provider, you know who are the best customers? They're the ones that have failed already, because they then learn the hard way that, “We totally underscoped this.”
They didn’t know as much as they needed to about the data, what data sources were required to answer the business problem, the types of analytics to apply to the data sets.
When you look at a typical design pattern for Big Data, of course it starts with sourcing the data.
Data integration has always been the ugliest, but yet most important, aspect to data infrastructure, yet it’s given the least amount of love in terms of appreciation. So getting the data integrated into your analytics environment is always underestimated. Then once you’ve got it into the environment, the process of understanding the structure that you want to glean from that data or data sets.
So there’s a lot of work in terms of understanding what your data models are going to become as you evolve your questions or your queries or your analytics. You're operating against that data.
I generally use the rule of thumb that about 80 percent of your time is data wrangling and 20 percent of your time is actually the analytics and presentation of it.
That’s the way it was back in the day of data warehousing when you had to do your ETL processes and all of your data cleansing and make sure you had your schemas in place. All of the data munging is 80 to 90 percent of the time, but yet all of the value is associated with that 20 percent, which would be analytics and the presentation reporting.