If one of your goals in the New Year is to move toward using Big Data, then it’s time to move beyond the theoretical discussion to the nitty-gritty of implementations.
That doesn’t mean you should ignore your strategic goals, of course: It just means filling in the integration blanks between having Big Data and using Big Data.
TechTarget recently published a good starting point by excerpting chapter 10 from “Data Warehousing in the Age of Big Data,” written by Krish Krishnan, who is a Chicago-based executive consultant with Daugherty Business Solutions and a TDWI faculty member.
The two-part excerpt, like most technology books, is no-nonsense and on the dry side. Still, it’s laser-focused on the integration layer, so it’s definitely worth reading if you’re a CIO, architecture or a technology decision-maker who is likely to be involved in a Big Data project.
Krishnan breaks it down into two overarching steps: The data-driven architecture, which covers all the head-work you need to do before you build. The second step is the physical architecture implementation, which is where you get to buy and build things.
Each part has several action items. (You didn’t think it’d be that easy, did you?)
Everybody always wants to skip ahead to that last one, but don’t. We all know you have to do your homework first.
“Finalizing the data architecture is the most time-consuming task that, once completed, will provide a strong foundation for the physical implementation,” Krishnan writes. “The physical implementation will be accomplished using technologies from the earlier discussions, including big data and RDBMS systems.”
Then Krishnan hunkers down to the type of down-and-dirty-integration discussion we’ve wanted for years now, such as how you know what kind of tools you need, including whether a Big Data appliance would be a good fit for your organization.
First, a Word About Data Integration
In the past, integration was pretty straightforward: You need batch processing, and then get ETL. You need incremental data loads, so look at change data capture.
With Big Data, you’ll likely use a variety of integration approaches, as well as master data management, semantic technologies and so on.
That’s why Krishnan recommends you make the integration layer “data-driven,” which means your first order of business will be to identify all the different data types you want to use. That includes everything from enterprise data warehouse data to sensor and semi-structured data.
Don’t worry: Krishnan’s included a list to help you. It’s comprehensive and in no way what you would call “short.”
Only after you’ve figured that out should you move on to action item two: Designing the architecture. This does not by any stretch mean you’re almost ready to start buying things, by the way.
No: It means you start defining master data elements, data types, the data owners and the data stewards. In other words, this is where you start the data quality and data governance discussion.
Action step three is workload management, which Krishnan writes is the biggest need for processing Big Data. That’s a pretty long, dense read, so I’ll let you go to the source for that.
Finally, the last action step for building the data architecture is pinpointing your analytical processing requirement. You can almost taste the RFPs.
Now You Buy and Build
Most of the second excerpt deals with the physical parts of your integration layer and architecture, which includes discussions on:
- Data loading options, including MapReduce, appliances and batch processing
- Data availability and special issues related to Big Data
- Data volumes and problem areas such as retention, compliance and legal issues in acquiring data
- Storage performance
As I said, it’s not an easy read, but it is an extremely useful read. If you’d like to hear more about Big Data from Krishnan, I recommend following him @datagenius on Twitter.