A data integration sandbox and integrating with Hadoop data through vendor products are among the best practices emerging for use with the Hadoop ecosystem, according to a recent TDWI report.
TDWI Data Management Research Director Philip Russom wrote the report on the heels of a survey (registration required to download) that found 63 percent of organizations expect to deploy Hadoop Distributed File System within three years. Right now, adoption hovers around 10 percent, so if organizations follow through, it would require a significant effort to ramp up to that 63 percent.
But Russom says it will happen, thanks in part to these trends:
- Organizations want more business value from Big Data, with 70 percent saying Hadoop is a real opportunity if the data is leveraged through analytics.
- Hadoop is a scalable and cost-effective complement to existing data warehouse platforms, data integration and analytics tools.
- Users are moving toward multi-platform environments for data warehousing, data integration and analytics.
“Most of the organizations adopting Hadoop are completely new to it, so they need to educate themselves quickly about emerging best practices,” Russom states in the best practices report. “The checklist of best practices presented here can help users make sustainable decisions as they plan their first Hadoop deployments.”
One of the challenges with Hadoop is finding someone who can write code to process data using MapReduce. This was a major barrier to adoption until recently, but three changes are making it possible to reduce MapReduce coding:
- Most vendors now offer some sort of tool or platform that lets you interface with HDFS using a modern GUI rather than open-source Hadoop tools or hand-coding the data integration, migration or processing.
- The Hadoop is evolving toward easier use thanks to the YARN architecture and Stinger, a community effort to resolve integration issues between Hadoop, a NoSQL solution, and SQL, a structured solution.
- More vendors are reducing the code generation work for Hadoop interfaces by building interfaces that run natively in Hadoop tools—including MapReduce—without generating code, which the TDWI calls an “emerging best practice.”
What I love about this report is that it really does fill in those gaps you might have when you think about deploying Hadoop. For instance, it outlines best practices for extending your data warehouse architecture with Hadoop. Among the recommendations:
- A data integration sandbox that lets integration specialists test large joins, aggregations and transformation logic
- A data lake, which can be a logical data warehouse or virtual data warehouse, but is essentially a reservoir for business data collected for analytics
- An analytics sandbox
- A data staging area
- Operational data stores
- Online data archive
If you’re among that 63 percent planning a Hadoop deployment, you’ll find this 10-page checklist of the Eight Hadoop Best Practices an essential resource. It’s available for free download with basic user registration. There’s also a webinar with Philip Russom discussing the report.