When we talk about Hadoop and integration, are we talking about semantic interoperability or is it statistically weighted string matching? John O’Gorman asked this after reading last week’s news round-up.
I’d mentioned a number of new releases, all primarily around integration, which prompted the question. O’Gorman explained further:
My understanding is that users of Hadoop can dump huge amounts of pretty much any kind of data into its hoppers without worrying too much about integrity constraints, and get some basic patterns of relationship. If that's the case, and I understand its appeal, wouldn't the tradeoff be between the ease of use vs. the precision of the conclusions coming out the other end?
It’s a pretty technical question, but what he’s really speaking to is how reliable or useful the data will be when it comes out of Hadoop.
As he noted in a follow-up comment, when he reports statistics to his executives, he can’t “imply anything.” That means he needs a precise count of customers. He wrote:
Since so much of executive information reports depend on precise counts (of customers, for example) and percentages using that count as a denominator, I can't afford to miss the fact that John Smith appeared five times and was in fact five different people or, alternatively is spelled five different ways and is the same person. There are very strong tendencies, when technology is hyped as much as Big Data has been, to commingle the language of 'how' it works to an examination of 'why' (or whether) it is delivering useful results.https://o1.qnsr.com/log/p.gif?;n=203;c=204663295;s=11915;x=7936;f=201904081034270;u=j;z=TIMESTAMP;a=20410779;e=i
So, clearly, there’s a strong business driver for ensuring data quality with Big Data at his organization. I suspect, however, he’s not alone in wondering just how useful Big Data will be for analysis that requires more pristine data, which requires more than simple string matches.
But heck if I know how well these products handle data quality and integration on semantics. As I try to point out regularly, I only play an integration “expert” on the Internet. In real life, I’m just somebody who asks questions of much smarter people, and then tries to translate it into English.
So I did what I do best. I asked for responses via Twitter and emailed the vendors mentioned in the post for more information.
Hadoop, as it turns out, is a great place to dump data if you need to be fast — but if you want to be efficient, or in this case, efficient and effective, then Hadoop alone isn’t going to cut it.
Data warehouse (DW) architect Mark Madsen, a principal with consultancy Third Nature, makes that clear in a recent TDWI ebook, "Big Data Integration."
“Hadoop is brute force parallelism,” he told TDWI. “If you have some data such as user IDs and you want to get some other data via a look-up, you [have to] write your own join in code. There’s no facility for single-row look-up, so you [have to] write your own joins and deal with sorting, also on your own.”
That’s easy to do if you have a parallel programming or engineering expert on staff — but most people don’t. And that’s one of the many reasons why good integration and data quality tools will be so important moving forward with Hadoop and other Big Data tools.
“Inside Hadoop, many of the operations involve sorting. The sort [capability] that’s included in Hadoop, the native sort, is not very scalable, and it doesn’t have a lot of functionality. If you are doing different types of sort to achieve performance or to achieve more functionality, you have to go Java,” said Jorge Lopez, senior manager for data integration with Syncsort, in the same article.
This leaves us, then, with the issue of semantic interoperability and achieving trust with data. And that, it seems, may still be more wishful thinking than reality.
Pervasive Software is one of the companies mentioned in the original post. It recently announced a new Hadoop edition of its tool, Data Integrator. The tool is for loading and exporting data from the Hadoop HBase database. The company’s CTO, Jim Falgout, responded to O’Gorman:
Hadoop along with the ‘no SQL’ data stores do tend to loosen the constraints of semantic interoperability. This being a trade-off between tighter semantic constraints and the ability to ingest huge amounts of data with minimal friction and actually do something positive with the data. But you have a point. At some level, the consistency and precision of your results are at the mercy of the quality of your Hadoop programmers.
That will change as the market matures, said Falgout.
“As these mature, semantic interoperability will again come to the forefront allowing enterprises the ability to do things like track data lineage,” he said. “For now the trade-off is the ability to get some defined value out of big data user ‘looser’ techniques.”
For now, the best option may be to focus on MDM and data quality to address the problems upstream.