Talend Releases Open Source Data Cleaning Tool

Share it on Twitter  
Share it on Facebook  
Share it on Linked in  

For years, my cousin, Scottie, would visit from Albuquerque, New Mexico, every summer. She announced, one summer, that she now went by "Margaret," her first name. Margaret turned into Maggie, then Mags, then Maggie again, who offered dispensation to us at the ripe age of 25. We were allowed to use Scottie again, which was a relief, since I'd never been able to give up calling her Scottie in the first place. I'm not sure what prompted all the aliases, but I imagine her casual name changes would cause quite a few database headaches today.


Last Wednesday, Talend unveiled a new offering to deal with exactly these kinds of data quirks. The new product, Talend Data Quality, is certainly not the first such tool, though Talend does contend it's the first product that combines "data integration, data profiling, and data quality in a single open source suite," according to this TMCnet article.


Regardless of its standing in the pantheon of firsts, it's an open source tool for finding and fixing dirty data, including nicknames and duplicated records. It can also confirm addresses, phone numbers, ZIP codes and abbreviations by comparing the data with with reference information from providers such as the U.S. Postal Service and mailing databases in other countries, according to the press release. Talend even promises the tool can pick up on variations, including, I was pleased to note, the incarnations of "Margaret."


Yves De Montcheuil, vice president of worldwide marketing for Talend, told Network World that the tool can also be used for cleaning product data. That article also includes a nice description of the tool's drag-and-drop graphical interface.


It'll be interesting to see how data quality offerings fit in with the growth of master data management. Perhaps tools like this will appeal to smaller companies that can't afford MDM's hefty price tag.


A recent TechTarget article reports that mid-market companies are increasingly concerned about data integrity. Michael Dortch, senior analyst at Boston-based Aberdeen Group, told TechTarget that many mid-market companies are adopting tiered storage and building structured document repositories to address data quality problems.


Dortch also said there may be SaaS versions of some MDM products announced in December, which could make for an intriguing shake-up in the data quality space.


For now, however, those who can't afford MDM or proprietary data quality tools might want to look into Talend Data Quality, which will be released at the end of September under the GPL. Subscription fees for tech support and other services will start at $15,000 per year.


Talend CEO Bertrand Diard told IT Business Edge's Lora Bentley in March that the open source model gives customers an "insurance policy" against vendor changes in the data integration space:

"... the recent M&As in the space tremendously help the open source cause. Clients are tired of being victims of product strategy shifts, price list changes and other demands of proprietary vendors."