Big Data and Integration: Old Solutions for a New Market?

Share it on Twitter  
Share it on Facebook  
Share it on Linked in  
Slide Show

The Business Impact of Big Data

Many business executives want more information than ever, even though they're already drowning in it.

When you hear about Big Data, the focus is usually on the Internet and how dotcom big wigs like Google and Twitter use it. But the Internet didn't create Big Data, and it didn't come into usage just because of Hadoop.


In fact, financial services companies have been dealing with large data loads for a long time; although, as we all know by now, the effectiveness of their data dealings hasn't always been stellar.


A recent InformationWeek article, written by Editor-at-Large Doug Henschen, notes that online marketing data company comScore has dealt with Big Data for a decade now. Henschen also points out a much under-discussed fact of Big Data: It's not all about the storage.


The article focuses on what comScore - and one assumes, other companies - do to manage huge data before they load it into storage. As you may have guessed from the fact that I am writing about it, those pre-storage steps involve running the data through integration software to sort it and organize it:

... comScore collects about 2 billion new rows of panel data and more than 18 billion new rows of census data each day. That means more than 20 million rows of new data is loaded into the data warehouse each day. Of course, most every organization will apply compression to reduce storage demands. But comScore also uses Syncsort DMExpress data integration software to sort and bring alphanumeric order to the data before it's loaded into the warehouse. This improves compression ratios.

In essence, the integration step reduces 10 bytes of data to three or four bytes, the article notes, using Syncsort's DM Express.


If you don't know about Syncsort's data software, you're not alone - even though it's been in business more than 40 years. Its data processing and integration software hails from the old days of mainframes, and it's designed to deal with large amounts of data. As of last year, when I interviewed Syncsort's technology strategist Joe Lichtenberg, its client list included 90 of the Fortune 100 companies. As you might expect, it hasn't exactly been a product for the budget-minded, smaller organization. But the company is finding a new market as more companies become involved with Big Data and it's interesting to see how this mainframe company is applying its technology today.


Recently, I received an email from a press person informing me that Syncsort had a new executive team and, apparently, a new approach to how it markets its product, since the email noted that "Syncsort advocates a new best practice, data integration acceleration."


As the InformationWeek article explains, using a data integration acceleration is one way to speed up dealing with Big Data loads and to reduce some of the complexities of ETL loads:

Not every company operates at comScore's scale, but the lesson is that not every big-data challenge is best left to the high-powered database platform to solve. Sorting, filtering, aggregation, and transformation steps can streamline data before it gets to the data warehouse, saving CPU cycles and storage space before and after the crucial data-loading stage.

I think it's great that we're seeing some press about how integration and processing of data tie in with Big Data beyond the basic database/storage issues. I hope to hear more about it, obviously.


But I suspect it could also become confusing as we see more solutions and marketing messages trying to grab a piece of that growing Big Data pie. For instance, after reading the article, what I'm still not sure about is whether this concept of data integration acceleration is unique to Syncsort and possibly more of a marketing term or whether there are other solutions in this space. I can tell you a quick Google search shows almost all of the results reference back to Syncsort, which in the past has meant this is more of a marketing term. Syncsort's email explained the process this way:

The Syncsort approach rests on four basic tenets: High performance at scale, minimum resource utilization, ease of use with no tuning required and the ability to integrate with other data integration platforms. The result is simpler maintenance with all transformations happening in one place, faster responses to new demands for information, and greater flexibility to adapt to changing conditions. Ultimately, it is about utilizing data faster and more efficiently than the competition.

When I first read this and the InformationWeek article, I wondered if MapReduce was an open source tool for this same function. Apparently not. IT Business Edge's Mike Vizard wrote this week that Syncsort is offering a DMExpress Hadoop Edition to accelerate Hadoop's processing of MapReduce and hide some of its complexity.


I plan on catching up with Syncsort soon to find out more, but my big question is about cost. One reason that Big Data is becoming more accessible to more organizations is that Hadoop makes it very affordable to store and process large amounts of data.


I suspect this won't be the last time I'll be confused about Big Data offerings. What's going to be tricky is separating what you need from what you don't - and whether or not there's an existing open source product that fills the same need. It'll be interesting to see, then, how add-ons such as DMExpress Hadoop will do in the marketplace.