I think I can safely say that data lakes are probably the most controversial integration topic I’ve covered since SOA came on the scene. Ostensibly, the goal is to pool information into Hadoop, providing one big data repository for organizations.
That has lead to vendors “selling” data lakes, and some — including Gartner — are skeptical about these claims. They also caution about dumping data in without any plan, which some fear could lead to data swamps. Data swamps, as the name implies, are where data sinks and disappears, forgotten to time.
After talking and reading about this topic for about six months, I think it’s safe to say that data lakes can work — if you follow these two steps:
- Don’t dump everything. It’s a lake, not an ocean. Data lakes should hold only relevant data for the project.
- Before you dump, incorporate a support structure — think metadata, data heritage, data governance — that will let you search and query the data effectively.
Vendors are already wise to this, of course, and I’m starting to hear about new solutions and products designed to tame data lakes. For instance, this week, Teradata announced Loom 2.4, which is designed to help manage Hadoop data lakes.
“The latest version of Loom aims to solve these problems by tracking the lineage of data in the lakes, and integrating metadata so that information can be identified and maintained,” V3 reports.
ZDNet reports that Loom 2.4 is also compatible with JavaScript Object Notation, which helps support data format for sensors, IoT, mobile devices and web browsers. It also supports dates, currency and other international data formats and can be used to create data partitions in Apache Hive.
While Loom 2.4 won’t be available until March, you can try out Loom Community Edition 2.3 now as a free download.
Hitachi Acquiring Open Source Integration Company Pentaho
Hitachi Data Systems (HDS) revealed this week that it will acquire open source data integration and analytics company Pentaho. The deal is expected to close by June.
Pentaho CEO Quentin Gallivan will stay on as head of Pentaho, but will now report to HDS Senior VP Kevin Eggleston, who oversees the company’s Social Innovation and Global Industries divisions.
The press announcement did not include many details about the acquisition, but did explain that Hitachi’s shared analytics platform will be a reference architecture “that brings together and orchestrates different technologies from Hitachi, partners and the open source community.”
The move is being positioned as a Big Data play – but what isn’t these days? More precisely, it seems to align with Hitachi’s “ambitions in the machine data, IT and analytics space as the ‘Internet of Things’ technology continues to rapidly perpetuate,” reports RCC Wireless.com. That makes sense, given Pentaho’s open source integration tool and the unique integration challenges we’ll see with the Internet of Things.
Deloitte Consulting predicts that integration and interoperability challenges will emerge as one of the top analytical trends in the coming year.
“There needs to be more interoperability, more interconnectivity, more integration of all these devices, otherwise we’re just going to have these competing standards, competing formats and I think you’ll have disappointed customers in the end,” John Lucker, Deloitte Consulting principal and global advanced analytics and modeling market leader, said in a recent interview with IT Business Edge.
Talend, Others Support Integration on Hadoop
This week I featured Xplenty, which offers an integration platform built on top of Hadoop. Ciaran Dynes, vice president of products at Talend, pointed out to me that Talend also supports integration running on Hadoop.
“Talend relies 100 percent on the Hadoop API for transformation, mapping, machine learning, security, deployment, job scheduling, everything,” he wrote. “Talend Big Data Studio generates native MapReduce, Pig and Hive code, which is optimized for Hadoop.”
His full comment is worth reading if you’re looking at Hadoop integration capabilities. He notes that Talend also supports Spark, Spark Streaming and Storm.
That’s an important point, and one I’d like to highlight now.
Certainly, Hadoop integration is well supported by other vendors — even Xplenty didn’t deny that. Where it differs from some of the older solutions, such as Talend, is that it was built on top of Hadoop. Whether or not that really matters when other solutions are optimized for Hadoop as well, I can’t say. Xplenty co-founder and CEO Yaniv Mor said it did provide advantages, but he didn’t provide specifics.
Loraine Lawson is a veteran technology reporter and blogger. She currently writes the Integration blog for IT Business Edge, which covers all aspects of integration technology, including data governance and best practices. She has also covered IT/Business Alignment and IT Security for IT Business Edge. Before becoming a freelance writer, Lawson worked at TechRepublic as a site editor and writer, covering mobile, IT management, IT security and other technology trends. Previously, she was a webmaster at the Kentucky Transportation Cabinet and a newspaper journalist. Follow Lawson at Google+ and on Twitter.