As organizations of all sizes start to amass more data than ever in the hopes of making better business decisions faster, the quality of that data becomes increasingly critical. To help ensure that data quality, Talend this week added data preparation tools to its integration platform that promise to make it simpler to collaborate on the cleansing of data using a self-service Data Stewardship application.
In addition, the winter 2017 version of the Talend Data Fabric provides tighter integration with the Apache Spark 2.0 in-memory computing framework for analyzing Hadoop data.
Ashley Stirrup, chief marketing officer for Talend, says the Data Stewardship application is designed to make it simpler for the average business user to maintain the quality of the data being added to a larger Big Data lake. Without that capability, organizations will find themselves having to hire dedicated data managers to perform those same functions, says Stirrup.
“Organizations run the risk of finding themselves in specialization hell,” says Stirrup.
To enable the level of self-service required to maintain data quality, the Talend Data Stewardship application includes a pre-configured data dictionary that auto-recognizes the meaning of the raw data stored in the data lake. Organizations can also augment the dictionary with their own vocabulary, such as product codes or names, or crowdsource new data definitions from external sources or the Talend Community.
The combined effect of the new release is that Talend is moving to combine data integration and governance into a single integrated workflow. Rather than requiring organizations to acquire a separate set of data quality tools, Talend is making it simpler to essentially federate management of those functions between end users that better understand what specific data sets represent and the IT organization tasked with making it easier to combine data residing in multiple applications.
Of course, combining data governance and integration within the same set of workflow processes has been a long-time goal for many organizations. The issue now is to what degree the rise of data lakes that are shared by multiple departments across an organization will finally force organizations to address both integration and governance issues at the same time.