Recently, I've been talking to a lot of vendors about their Big Data offerings. To ensure I'm comparing apples to apples as much as possible, I've taken to asking two questions:
Typically, vendors and experts tell me it's important to remember that Big Data involves more than ridiculously large stores of data. Big Data often is also defined by how fast it's created and its many formats, which can include everything from unstructured text, to Web or sensor logs to digital media. In other words, Big Data is about three Vs: volume, velocity and variety, as IBM and Informatica put it to me.
I shared in July how IBM's David Corrigan answered the Big Data platform question. More recently, I interviewed Informatica's CTO James Markarian about Informatica 9.1, which was released in June and is all about empowering Big Data. I asked him the same questions, with one difference: I asked what he saw as the key elements of a Big Data integration platform, which is a term Informatica's using.
Now, in some ways, this is a gimme question, because the vendor can always describe the platform according to their capabilities. But, in general, most leaders I talk to - and CTOs especially - are good about answering with an eye toward the key elements, rather than just listing their own features. Also, even if the vendor does give you a bit of biased answer, at least you know what's important to that vendor and where they see their strengths. So, that's why I go ahead with the softball question.
Some of the same issues covered by IBM came up, of course. But overall, Informatica's take on a Big Data integration platform focuses more on where Big Data comes from and how it's managed in terms of quality and governance - which makes sense, since that plays to Informatica's strength as a data management company.
Markarian suggests five key issues that a Big Data integration platform should include:
Integration from a variety (there's that word again!) of sources, from mainframes to messaging systems. "These big data processing environments aren't the originators of the information," he said.
Data quality and data governance. You have to be able to trust the data if you're going to use it for analytics or decision making, he warned, and that means a Big Data platform has to support data quality and data governance. "We view data quality as being even more important in the big data environments because small problems with information and information-handling can result in large magnification of errors," he told me.
Text analytics and sentiment analysis. This one was new to me, but it means being able to put context around the data you're receiving. For instance, when processing millions of feeds, you need to be able to decide whether that information is something you care about. That may mean integrating the data with your MDM system or another enterprise application, he said.
Then you need to put that information into context. For instance, is the comment positive or negative? Does it represent a trend? "That's where something like the combination of text analytics combined with identity resolution so that you can match that information against your existing enterprise systems comes into play," he said.
Support for R. "The open source tendencies of the analytics world are leaning towards using languages like R and away from things like SPSS or SAS or Unica and saying, 'Hey, we're making a commitment to open technology,' and that right now seems to be R-language-based," he said. "If you look at the Hadoop processing environment itself, you look at the text analytics and you look at the traditional structured analytics around R language, those are going be some of the staples of big data environments."
A new attitude about analysis and design schemas. Markarian said handling unstructured information will require a new attitude toward how you approach the data:
People don't necessarily want to presuppose an analytics schema for handling the information for their big data problems. Rather, they want to dump the data into these platforms and then use some sort of indexing scheme and more of a dynamic information discovery model, rather than predetermining what the schema is and only being able to answer the questions that that schema can reveal.
This, to me, was the most interesting of all the characteristics, because it points out something that's unique about Big Data: At this point, companies are approaching Big Data as something of an exploration, rather than a way to answer specific questions.
For more on Informatica's take on Big Data and what it's doing to make Big Data more accessible to Informatica customers, check out my full interview with Markarian.