The Untapped Potential of Good Ole ETL


ETL is in a weird place these days. It's a great market, with growth in the double digits, and yet, there's all this other integration "stuff" out there that seems to compete with it.


Recently, I interviewed Ross Mason and Ken Yagen of ESB vendor MuleSoft about what they say is the decline of ETL. They pointed to recent movement by ETL-focused vendors into other areas as a sign of the times and ETL's fading allure. Mason, Mulesoft's founder and CTO, said:

I think ETL is a very old-world way of looking at moving data around the enterprise and, you know, what we're seeing is the location of data is changing, so it's not only in databases anymore. ... And I think the demands on the need to gather information in real-time has also changed. ETL used to be very sort of batch-oriented, but what the enterprise really need is real-time information because a lot of our business processes and orchestration need that data.

Bob Potter, co-founder and CEO of ETL provider expressor, responded pretty adamantly that the whole interview was "complete nonsense." In the reader comments, Potter defended ETL's continued relevance:

First of all ETL systems move data in both batch and continuous mode. Secondly, the ability to transform data, particularly massive big data cannot be duplicated by ESB's. Also, many ETL systems, including ours, can load multiple databases simultaneously. The data integration business is growing faster than the middleware infrastructure business in which ESB's live. Let's set the record straight - ETL will continue to grow in the double digits for the foreseeable future!

Point taken. Admittedly, I did use a pretty tabloid-esque headline ("Are Real-time and Cloud Demands the Death Knell for ETL?"), in part because I wanted to flush out some different opinions. Come on, we all know ETL isn't going away anytime soon-Mason admits as much, really. As Potter noted, it's still growing in the double digits. Michael Curry of IBM added in a recent DM Radio podcast that not only is it growing, it's one of the "fastest growing in software," despite how long it's been around.


But Ross and Mason aren't the only ones who've noted a shift in the data integration technology market and wondered what it means for ETL. As I shared recently, consultant Seth Grimes recently looked at what he calls "New Data Integration"-new approaches that move beyond and complement ETL.


Slide Show

Top Ten Best Practices for Data Integration

Use these guidelines to help you achieve more modern, high-value and diverse uses of DI tools and techniques.

Just this week, data warehouse and integration consultant Rick Sherman penned a post with the telling title, "My take on why ETL has not always kept up with the integration workload." This piece is a preview of remarks he made during the DM Radio podcast I mentioned earlier, which also looks at the use of ETL. It's entitled, "Carrying the Load: Why ETL Still Rules the Roost," which, again, tells you a lot about where that discussion is headed, but it explores why "many organizations have reached their limits with the workhorse discipline of Extract-Transform-Load (ETL), either because of shrinking batch windows, or because they lack a strategic view of data transformations and movements throughout the enterprise."


The discussion is complicated by a difference in how small- and medium-sized companies use ETL versus how large companies do. Yes, ETL is growing, particularly in small- and mid-sized companies; although, ironically enough, its main competition is still an even older approach to data integration called "hand coding." Meanwhile, large companies are exploring complementary technologies, and in some cases, are actually missing opportunities to use ETL tools to their full capacity.


The podcast is good-although there are some sound quality issues-but at an hour, it does go on a bit. I think it's worth a listen, particularly the first 35 minutes, but if you're short on time, I think Rick Sherman's post does a great job of summing up the four key obstacles to squeezing more from your ETL tools. In addition to what he discusses in his post, the podcast does add a few additional dimensions to the ETL and evolving data integration market.


First, people still think hand-coding is cheaper and tools are too expensive. Sherman points out that this was true in the first generation, but now there are some very affordable options. He also points out that the hand-coding problem is cumulative. Sure it's easier to hand-code that first upload, but before you know it, you're 70 sources and two years into it, but nothing's documented and you're still reinventing the wheel every time you do an upload. "When I talk to management, they certainly aren't happy about the situation, but when you talk to programmers, they still think that they can out-code ETL," Sherman said.


Second, the tools are too complex, according to Gaurav Dhillon of SnapLogic. It deters programmers from learning the tool, and that complexity is competing with increasingly cheap storage. Dhillon believes the next generation of tools needs to address this complexity issue or we'll be having this same discussion in 10 years.


Third, there's the problem of unstructured information, which accounts for 80 percent of the growth in information, according to the podcast.


The podcast also discusses data governance and the unexplored, but troubling, topic of how risk management and integration relate. As I said, it's a great discussion, with representatives from Syncsort, IBM and SnapLogic, but long.


The bottom line seems to be that ETL still has a lot of life left, and in fact, a good deal of untapped potential, particularly in SMBs where hand-coding is still king and outside the data warehouse. But that said, there are new data challenges arising and new technologies emerging to address them. The key seems to be to think of these new approaches as add-ons, rather than replacements, for data integration's old, faithful ETL.