One of the major challenges with Big Data, I think, is figuring out your options. It is such a new space, so it’s a bit tricky to identify what type of tools you’ll even need, much less figure out which vendors actually offer them.
A large number of lists about Big Data are available: The Big Data 100, the Hot Start-Ups, the Most-Powerful Big Data Companies, and so on. All of these sites are informative, but they don’t necessarily help you piece together a basic Big Data architecture or list of solutions you must have, particularly when it comes to Big Data integration.
Organizations need to realize that not everything changes just because they’re dealing with Big Data.
“In reality, Big Data integration fits into the overall process of integration of data across your company,” states Big Data for Dummies. “Therefore, you can't simply toss aside everything you have learned from data integration of traditional data sources. The same rules apply whether you are thinking about traditional data management or Big Data management.”
Still, Big Data integration does have unique integration challenges. For instance, while mature ETL solutions now have Hadoop connectors, you can still run into problems with ETL tools, according to Rick Percuoco, senior vice president of Research and Development at Harte-Hanks Trillium Software.
“The aspect of extract, transform, load (ETL) will definitely be affected with the large data volumes, as you can't move the data like you used to in the past,” Percuoco told Interarbor Solutions Principal Analyst Dana Gardner during a recent BriefingsDirect podcast. “Also, governance is becoming a stronger force within companies, because as you load many sources of data into one repository, it’s easier to have some kind of governance capabilities around that.”
Other challenges include accessing and preparing unstructured and semi-structured data and access to legacy data sets.
So, sometimes what’s more helpful than a list is some actual discussion of what kind of tools you’ll actually need. It’s easy to find pieces about storage and analytics tools, but integration tends to be overlooked.
Recently, I stumbled across three pieces that more specifically address the data integration aspects of Big Data.
The first, and my favorite, is a December series by technology journalist and editor Rutrell Yasin and published on Government Computing News (GCN).
The first article discusses the hardware, as well as “data ingestion” tools for preparing source data. These tools focus on migrating data from other systems to Hadoop. Again, Yasin is pointing out specific tools used in the public sector, which in this case includes:
But it’s the second article in the series that specifically covers integration and retrieval tools, including:
You might also check out this Big Data Republic piece. My hesitation with this piece is that the writer, Anand Srinivasan, has no profile and it’s a fairly common name.
The piece is a discussion of the Big Data tools that can help with data integration, analysis and visualization. Although it discusses several different and competing vendors in the piece as a whole, it only mentions Oracle’s Data Integrator Enterprise Edition for an integration option.