It’s not easy making sense out of a lot of “noise” in data, which is why I think you see conflicting answers about how to “best” approach large datasets. It’s like two different approaches to fishing.
Some experts say you need to relax, take a tackle box, see what bites and adjust accordingly. Others are much more Type A: They come with bait designed to snare large-mouth bass and they fish accordingly.
It’s a tough balancing act. On one hand, you don’t want to think so small you miss what Big Data has to offer. On the other, you also don’t want to “dig tons of dirt to discover an ounce of gold,” as IBM recently described it.
It should be an academic question because ideally, you’d be able to do both. But when it comes to data, many organizations have it, want it, but can’t manage it. We’re like Tantalus, standing in a river of water, and not a drop to drink.
Consider these telling stats from the Harvard Business Review: 85 percent of organizations reported that they have Big Data initiatives planned or in progress … but only 17 percent rank their ability to use the data and analytics to transform their business as more than adequate or world-class.
“The majority of companies are on the sidelines because they think they can’t readily access the data they have, they don’t have in house tools or talent to analyze it and don’t have the ability to put the data to use anyway,” writes Matthew Crowl in a recent CAN blog post.
One answer may be the graph database, which uses nodes, properties and edges rather than traditional indexing to store data. In other words, it allows you to create a graph of connections between people, objects and data.
I recently received a briefing from Objectivity on its Oct. 1 release of InfiniteGraph version 3.0. The demo included a demonstration on how police might use two phone numbers — one from an arrested suspect and the other a number that person called — to detect crime patterns, including drug rings, fraud, money laundering and even terrorism.
“Relational databases are not very good at relationships … They’re great at the things they do, but when you get a lot of relationships in the data, it all becomes very clumsy and the queries become more complex because everything runs more slowly,” explained Leon Guzenda, the CTO and one of Objectivity’s founders. “A graph database is all about the real connections. Imagine a map. So we have cities. So the nodes in the graph are the cities and then the connection between the cities are technically called edges, so each road, each waterway is represented by an edge.”
One way you could use the graph database is if you wanted to map a route from San Francisco to New York based on shipping preferences — weight, length, cargo content — a graph database could tell you what’s the best, cheapest route, while eliminating any options that wouldn’t be able to handle the cargo.
While it allows you to ask specific questions, it also is very open-ended in terms of presenting further relationships. For instance, the demo traced the connections between the two phone numbers — and all the calls in between — but it could also allow you to explore degrees from that relationship — for instance, what if you want to look at the circle of calls with 5 degrees of separation versus 7?
Patterns quickly become apparent, Guzenda explained. A circle of connections that only call each other is likely to indicate a terrorist cell, for instance, he said.
“They don’t call anyone else in the world — that’s it, a very closed ring, you’re probably looking at a terrorist cell,” Guzenda said. “If there’s 30 or 40 people involved and some of them call others and a few people in this group call a lot of people in this group, you’re probably looking at a drug ring and dealers and so on.”
The original system took anywhere from four to 20 or more hours to run a query. Guzenda managed to show it within mere minutes.
“I pushed out seven degrees here. That would take Oracle about 10 hours,” Guzenda said. “That meant after three days, they’d have to release the suspect.”
In the marketing world, relationships can be used to expand something like MCI’s friends and family calling plan.
Most organizations have graph data that could be used, whether it’s logistics, operational, financial transactions or other hierarchical data, he explained.
“The data is in there, it’s just a matter of having a way to represent it and look at it properly,” he said.
The Oct. 1 release of InfiniteGraph v. 3.0 includes boosts to scalability and performance to distributed environments. A recent test scaled InfiniteGraph 3.0 to more than a billion nodes and edges in less than 30 minutes, according to Objectivity.
The new version also includes faster querying using flexible placement algorithms, which also increases index performance; distributed parallel navigation; a faster filtering tool for the graph view; support for predefined navigation policies; and a free 60-day trial period.
If you’re interested in learning more about graph databases as an option for exploring Big Data, Wikipedia maintains a list of other graph database solutions.