10 Options for Prepping, Moving and Integrating Data Sets for Big Data

    Slide Show

    Top Five Reasons Big Data Is Advancing Personalized Medicine

    One of the major challenges with Big Data, I think, is figuring out your options. It is such a new space, so it’s a bit tricky to identify what type of tools you’ll even need, much less figure out which vendors actually offer them.

    A large number of lists about Big Data are available: The Big Data 100, the Hot Start-Ups, the Most-Powerful Big Data Companies, and so on. All of these sites are informative, but they don’t necessarily help you piece together a basic Big Data architecture or list of solutions you must have, particularly when it comes to Big Data integration.

    Organizations need to realize that not everything changes just because they’re dealing with Big Data.

    “In reality, Big Data integration fits into the overall process of integration of data across your company,” states Big Data for Dummies. “Therefore, you can’t simply toss aside everything you have learned from data integration of traditional data sources. The same rules apply whether you are thinking about traditional data management or Big Data management.”

    Still, Big Data integration does have unique integration challenges. For instance, while mature ETL solutions now have Hadoop connectors, you can still run into problems with ETL tools, according to Rick Percuoco, senior vice president of Research and Development at Harte-Hanks Trillium Software.

    “The aspect of extract, transform, load (ETL) will definitely be affected with the large data volumes, as you can’t move the data like you used to in the past,” Percuoco told Interarbor Solutions Principal Analyst Dana Gardner during a recent BriefingsDirect podcast. “Also, governance is becoming a stronger force within companies, because as you load many sources of data into one repository, it’s easier to have some kind of governance capabilities around that.”

    Maybe that’s why one of the main uses of Hadoop to date has been as an ETL engine.

    Other challenges include accessing and preparing unstructured and semi-structured data and access to legacy data sets.

    So, sometimes what’s more helpful than a list is some actual discussion of what kind of tools you’ll actually need. It’s easy to find pieces about storage and analytics tools, but integration tends to be overlooked.

    Recently, I stumbled across three pieces that more specifically address the data integration aspects of Big Data.

    The first, and my favorite, is a December series by technology journalist and editor Rutrell Yasin and published on Government Computing News (GCN).

    The first article discusses the hardware, as well as “data ingestion” tools for preparing source data. These tools focus on migrating data from other systems to Hadoop. Again, Yasin is pointing out specific tools used in the public sector, which in this case includes:

    • Apache Flume, for collecting, aggregating and moving log data from multiple sources to a centralized data store, like Hadoop File Systems. Yasin notes that it’s considered the de facto standard for sending data streams into Hadoop.
    • Apache Sqoop, which automates the import and export of data from relational databases, data warehouses and NoSQL systems.
    • MapReduce, which is an integral part of Hadoop systems and handles the hard work of processing data. The problem: It requires an expert and custom coding.
    • Pivotal Gemfire is good if you’re using the commercial Hadoop version Pivotal HD for running traditional Big Data management and analytics, with Pivotal HawQ.
    • IBM InfoSphere Streams.

    But it’s the second article in the series that specifically covers integration and retrieval tools, including:

    • NoSQL databases such as MongoDB, which uses in-memory computing (another Big Data technology), and Apache Cassandra, which distributes the workloads across data centers, so there’s no single point of failure.
    • Apache HBase, an open-source column-oriented store.
    • MarkLogic’s NoSQL database platform, which couples government-grade security with integration from legacy databases, open-source technologies and Web data sources.
    • Accumulo, an NSA solution that offers cell-level security. It was turned over to the Apache Foundation in 2011.
    • ETL tools—The article specifically mentions Talend’s Open Studio, Pentaho’s Data Integration (aka, Kettle ETL engine).
    • UIA (Universal Information Access), an emerging area that combines search with database technologies. Solutions include Attivio’s Active Intelligence Engine and Cambridge Semantics’ Anzo Unstructured, Yasin writes.

    You might also check out this Big Data Republic piece. My hesitation with this piece is that the writer, Anand Srinivasan, has no profile and it’s a fairly common name.

    The piece is a discussion of the Big Data tools that can help with data integration, analysis and visualization. Although it discusses several different and competing vendors in the piece as a whole, it only mentions Oracle’s Data Integrator Enterprise Edition for an integration option.

    Loraine Lawson
    Loraine Lawson
    Loraine Lawson is a freelance writer specializing in technology and business issues, including integration, health care IT, cloud and Big Data.

    Get the Free Newsletter!

    Subscribe to Daily Tech Insider for top news, trends, and analysis.

    Latest Articles