Top Ten Best Practices for Data Integration
Use these guidelines to help you achieve more modern, high-value and diverse uses of DI tools and techniques.
My husband loves roller coasters. When he rides them, he doesn't just scream, he yells, "More and faster." Meanwhile, I'm too terrified to even breathe.
I mention this because it seems to me "more and faster" is the data integration slogan for a lot of organizations these days. According to Ventana Research, organizations are already integrating on average five or more different data sources, and that number is expected to rise as more applications are delivered as services. And then, there's all this talk of Big Data.
Couple this with the demand for better analytics based on real-time or near-real-time information, and you can see why Josh Rogers thinks "data integration acceleration" is a concept organizations will embrace.
I spoke with Rogers recently about Syncsort's new solution called DMExpress for Hadoop. He explained how DMX Hadoop edition, as well as the basic DMX solution, can cut data integration costs. According to Rogers:
We can offer tremendous throughput at scale with limited hardware resources and on top of that, we have a self tuning environment that is resource aware, so the way we drive cost out of data integration is by allowing you to move a lot more data in a lot shorter period of time. But more than that, to use a fraction, generally 25 percent of the CPU resources, generally 15-20 percent of the storage resources and generally we can drop the labor involved in achieving those outcomes significantly because of our self tuning capabilities.
That deficiency is achieved because Syncsort has a core engine that is aware of both the memory storage and chip set available to do the processing. It also understands the data structures and the tasks that need to be performed, Rogers explained; hence, it can optimize a job without the need for human intervention at runtime. He said:
When you look at the labor savings and the hardware savings and the server savings and the storage savings, we've seen projects where we could have driven as much as 70 percent of the cost structure out of those projects.
So what does Syncsort's technology add to Hadoop, which uses distributed computing to store and process massive amounts of data? Said Rogers:
I think that there are opportunities to make the workload that happens on each node in the cluster more efficient, meaning use less storage, use less CPU to do the same work and that's what our contribution. Then the second thing that we see that we hear from our customers is that the skill set required to both write and manage MapReduce is fairly significant and sophisticated. If we can make it more accessible using a graphical user environment, that should drive labor.
Syncsort isn't the only company focused on making Hadoop and MapReduce more accessible. IBM, Informatica, SnapLogic and others have announced solutions designed to support Hadoop this summer. I'll be writing more about those in future posts.
If you'd like to hear more about Syncsort's new solution, as well as what it proposed to contribute to the open-source Hadoop framework, check out my recent Q&A with Rogers.