We talk about Big Data and, now, Small Data as if it’s always clear with which you’re dealing. Big Data means volume, variety or velocity (or all three) and small data is structured and everything else.
Of course, the reality isn’t always so binary, according to a panel of medical and pharmaceutical experts at the recent MIT Chief Data Officer and Information Quality Symposium.
SearchCIO.com covered the event, and, in a recent article, shared a few lessons from the panel’s trial-and-error approach to dealing with data variety. Mark Schreiber’s experience is a perfect example.
As the associate director of knowledge engineering at Novartis International AG, Schreiber headed a team that had to handle a variety of data, including unstructured medical and scientific research. The team’s first approach was to build a data warehouse.
If you’re a regular reader, you can probably see where this is headed. And you’re right: It didn’t work.
The problem comes down to how you’re using that data model. In more traditional scenarios — retail, for example — the warehouse’s data model looks like the business model. However, in research, the situation shifts: You’re now in the business of exploring your data model as a way of learning, then monetizing the findings.
Data warehouses don’t like this kind of willy-nilly, hippy-dippy stuff. Or, as SearchCIO put it:
By normalizing the data and bringing it into a data warehouse, researchers painted themselves into a corner, effectively constraining the kinds of questions they could ask as well as their “ability to innovate and learn new things,” he said.
I seriously doubt he’s the only one making that mistake. If he were, I doubt Cloudera would be publishing a recent guest post, in which Dr. Geoffrey Malafsky, founder of the PSIKORS Institute and Phasic Systems, argues for handling small data in Hadoop 2.
This piece is a tricky read, and it’s in part a pointer for an Inside Analysis event that happened last week. The meat of the piece, however, is still there: Malafsky argues for using Hadoop 2 as a normalization tool for small data.
“Hadoop 2 makes Data Normalization for corporate Small Data a reality,” he writes. “With its inexpensive, innovative, speedy capabilities for ingesting, digesting, modifying, merging, and delivering data we can now apply Data Science to regular corporate data.”
“Now, we can marry the knowledge of all people, update in a realistic manner in days as part of normal business meetings and tempo, and synchronize operational data with varied analytical derivative sets: coordinated; correlated; visible; meaningful,” he adds.
He continues to make the case for Cloudera’s solutions in particular, but the main point is that Hadoop 2 isn’t just for Big Data anymore. It can also be used to:
- Normalize small data.
- Synchronize operational data with analytical derivative sets.
- Manage the IT environment so it’s easier for business and IT users to retrieve corporate data.
That sounds great, but before you download any distribution of Hadoop 2, it’s a good idea to make sure you’re covering the basics, especially data governance, first.
Loraine Lawson is a veteran technology reporter and blogger. She currently writes the Integration blog for IT Business Edge, which covers all aspects of integration technology, including data governance and best practices. She has also covered IT/Business Alignment and IT Security for IT Business Edge. Before becoming a freelance writer, Lawson worked at TechRepublic as a site editor and writer, covering mobile, IT management, IT security and other technology trends. Previously, she was a webmaster at the Kentucky Transportation Cabinet and a newspaper journalist. Follow Lawson at Google+ and on Twitter.