One of the challenges that technologists are still working out is the easiest way to put unstructured data to work for most end users.
It’s particularly a head-scratcher for business intelligent analysts, who want to combine the unstructured data stored in Hadoop with the more defined data structure we all know and love.
John O’Brien recently took an in-depth look at some possible solutions in an excellent Inside Analysis piece.
It seems that BI analysts are still working out the details when it comes to combining Hadoop’s unstructured data with the structured data we find in traditional enterprise BI tools and data warehouses. The big issue: What’s the best BI integration architecture to use with Hadoop?
As O’Brian explains, BI experts are focusing on three specific architectures:
- The first generation of integration architecture. Data stored in Hadoop is accessed or extracted to add context to existing database or BI tools. Given that he calls it the “first generation,” I’m guessing this is a typical first step for making unstructured data meaningful to BI systems.
- An architecture that utilizes Hadoop as the data warehouse staging area. I’ve heard repeatedly that this is the most common use of Hadoop for BI. A variation on this is to use Hadoop as a sandbox for playing with the data without changing it at its source. He explains the tools and software you can use to do this.
- An architecture that uses Hadoop for your entire data warehouse. “Here context is defined only [by] the abstraction layer, and satellite data marts are used for specialized applications and BI workloads such OLAP,” O’Brien writes. But he also adds this architecture is the “furthest out for most companies and data warehouses.”
While this piece is not overly technical — I could definitely follow it — it’s not a lightweight read, with specific solutions and suggestions for how you can actually make Hadoop work with your BI tools and enterprise applications. This makes it an excellent resource for BI analysts or any IT manager who wants to move beyond “talking about Hadoop” stage.
One solution he devotes a good amount of space to is HCatalog, which really sounds like a game-changer for BI professionals and Hadoop.
Yahoo developers created HCatalog to add two important capabilities to Hadoop data stores:
1. A table abstraction of Hadoop stores, making it a bit more familiar and easier to use
2. A metadata service for Hadoop
“Data stored in Hadoop is no longer confined to only those few who possess MapReduce skills. A larger portion of the BI community can explore data and allow their definitions to be leveraged in HCatalog for many more users of traditional tools,” O’Brien writes. “HCatalog is a key milestone in the maturing of Hadoop for the industry and helps bridge user access through the use of metadata.”
If you’d like more ideas on how you can get started with Hadoop, I explored some easy options in “7 Enterprise-friendly Ways of Dealing with Big Data,” which appeared on our sister site, and more recently here on IT Business Edge in “Eight Ways Any IT Division Can Use Hadoop.”