The emergence of Hadoop as the de facto Big Data operating system has brought on a flurry of beliefs and expectations that are sometimes simply untrue. Organizations embarking on their Hadoop journey face multiple pitfalls that, if not proactively addressed, will lead to wasted time, runaway expenditures and performance bottlenecks. By proactively anticipating these issues and utilizing smarter tools, the full potential of Hadoop may be realized. Syncsort has identified five pitfalls that should be avoided with Hadoop.
Click through for five pitfalls you should avoid with Hadoop, as identified by Syncsort.
A data integration tool provides an environment to make it easier for a broad audience to develop and maintain ETL jobs. Typical capabilities of a data integration tool include: an intuitive graphical interface, pre-built data transformation functions (aggregations, joins, change data capture [CDC], cleansing, filtering, reformatting, lookups, data type conversions, and so on), metadata management to enable re-use and data lineage, powerful connectivity to source and target systems, and advanced features to make data integration easily accessible by data analysts.
Although the primary use case of Hadoop is ETL, Hadoop is not a data integration tool itself. Rather, Hadoop is a reliable, scale-out parallel processing framework, meaning servers (nodes) can be easily added as workloads increase. It frees the programmer from concerns about how to physically manage large data sets when spreading processing across multiple nodes.
Tips to avoid this pitfall:
ETL is emerging as the key use case for Hadoop implementations. However, Hadoop alone lacks many attributes needed for successful ETL deployments. Therefore, it’s important to choose a data integration tool that can fill the ETL gaps.
- Choose a user-friendly graphical interface to easily build ETL jobs without writing MapReduce code.
- Ensure that the solution has a large library of pre-built data integration functions that can be easily reused.
- Include a metadata repository to enable re-use of developments, as well as data lineage tracking.
- Select a tool with a wide variety of connectors to source and target systems.
Programming with the MapReduce processing paradigm in Hadoop requires not only Java programming skills, but also a deep understanding of how to develop the appropriate Mappers, Reducers, Partitioners, Combiners, etc. A typical Hadoop task often has multiple steps and a typical application can have multiple tasks. Most of these steps need to be coded by a Java developer (or using Pig script). With hand-coding, these steps can quickly become unwieldy to create and maintain.
Not only does MapReduce programming require specialized skills that are hard to find and expensive, hand-coding does not scale well in terms of job creation productivity, job re-use and job maintenance.
Tips to avoid this pitfall:
Hadoop ETL requires organizations to acquire a completely new set of advanced programming skills that are expensive and difficult to find. To overcome this pitfall, it’s critical to choose a data integration tool that both complements Hadoop and also leverages skills organizations already have.
- Select a tool with a graphical user interface (GUI) that abstracts the complexities of MapReduce programming.
- Look for pre-built templates specifically to create MapReduce jobs without manually writing code.
- Insist on the ability to re-use previously created MapReduce flows as means to increase developers’ productivity.
- Avoid code generation since it frequently requires tuning and maintenance.
- Visually track data flows with metadata and lineage.
Most data integration solutions offered for Hadoop do not run natively and generate hundreds of lines of code to accomplish even simple tasks. This can have a significant impact on the overall time it takes to load and process data. That’s why it’s critical to choose a data integration tool that is tightly integrated within Hadoop and can run natively within the MapReduce framework. Moreover, it’s important to consider not only the horizontal scalability inherent to Hadoop, but also the vertical scalability within each node. Remember, vertical scalability is about the processing efficiency of each node. A good example of vertical scalability is sorting, a key component of every MapReduce process (equally important is connectivity efficiency, covered in Pitfall #5). When vertical scalability is most efficient, it also delivers the fastest job processing time, thereby reducing overall time to value.
Tips to avoid this pitfall:
Most data integration tools are simply code generators that add extra overhead to the Hadoop framework. A smarter approach must fully integrate with Hadoop and provide means to seamlessly optimize performance without adding complexity.
- Understand how different solutions are specifically interacting with Hadoop and the amount of code that they are generating.
- Choose solutions with the ability to run natively within each Hadoop node without generating code.
- Run performance benchmarks and study which tools deliver the best combination of price and performance for your most common use cases.
- Select an approach with built-in optimizations to maximize Hadoop’s vertical scalability.
Hadoop is significantly disrupting the cost structure of processing data at scale. However, deploying Hadoop is not free, and significant costs can add up. Vladimir Boroditsky, a director of software engineering at Google’s Motorola Mobility Holdings Inc., recognized in a Wall Street Journal article that “there is a very substantial cost to free software,” noting that Hadoop comes with additional costs of hiring in-house expertise and consultants. In all, the primary costs to consider for a complete enterprise data integration solution powered with Hadoop include: software, technical support, skills, hardware and time-to-value.
The first three factors – software, support and skills – should be considered together. While the Hadoop software itself is open source and free, typically it’s desirable to purchase a support subscription with an enterprise service-level agreement (SLA). Likewise, it’s important to consider the software and subscription costs as a whole when considering the data integration tool to work in tandem with Hadoop. In terms of skills, the Wall Street Journal cites that a Hadoop programmer, also sometimes referred to as a data scientist, can easily command at least $300,000 per year. Although the data integration tool may add costs on the software and support side, using the right tool can reduce overall costs of development and maintenance by dramatically reducing time to build and manage Hadoop jobs. Finally, data integration tool skills are much more broadly available and much less expensive than the specialized Hadoop MapReduce developer skills.
Tips to avoid this pitfall:
Hadoop provides virtually unlimited horizontal scalability. However, hardware and development costs can quickly hinder sustainable growth. Therefore, it’s important to maximize developer productivity and per-node efficiency to contain costs.
- Choose cost-effective software and support, including both the Hadoop distribution and the data integration tool.
- Ensure tools include features to reduce development and maintenance efforts of MapReduce jobs.
- Look for optimizations that enhance Hadoop’s vertical scalability to reduce hardware requirements.
One of Hadoop’s hallmark strengths is its ability to process massive data volumes of nearly any type. But that strength cannot be fully utilized unless the Hadoop cluster is adequately connected to all available data sources and targets, including relational databases, files, CRM systems, social media, mainframe and so on. However, moving data in and out of Hadoop is not trivial. Moreover, with the birth of new categories of data management technologies, broadly generalized as NoSQL and NewSQL, mission-critical systems like mainframes can all too often be neglected. The fact is that at least 70 percent of the world’s transactional production applications run on mainframe platforms. The ability to process and analyze mainframe data with Hadoop could open up a wealth of opportunities by delivering deeper analytics, at lower cost, for many organizations.
Shortening the time it takes to get data into the Hadoop Distributed File System (HDFS) can be critical for many companies, such as those that must load billions of records each day. Reducing load times can also be important for organizations that plan to increase the amount and types of data they will need to load into Hadoop, as their application or business grows. Finally, pre-processing data before loading into Hadoop is vital in order to filter out noise of irrelevant data, achieve significant storage space savings and optimize performance.
Tips to avoid this pitfall:
Without the right connectivity, Hadoop risks becoming another data silo within the enterprise. Tools to get the needed data in and out of Hadoop at the right time are critical to maximize the value of Big Data.
- Select tools with a wide range of native connectors, particularly for popular relational databases, appliances, files and systems.
- Don’t forget to include mainframe data in your Hadoop and Big Data strategies.
- Make sure connectivity is provided not only from a stand-alone data integration server to Hadoop, but also directly from the Hadoop cluster itself to a variety of sources and targets.
- Look for connectors that don’t require writing additional code.
- Ensure high-performance connectivity in both loading and extracting data from various sources and targets.