A data integration tool provides an environment to make it easier for a broad audience to develop and maintain ETL jobs. Typical capabilities of a data integration tool include: an intuitive graphical interface, pre-built data transformation functions (aggregations, joins, change data capture [CDC], cleansing, filtering, reformatting, lookups, data type conversions, and so on), metadata management to enable re-use and data lineage, powerful connectivity to source and target systems, and advanced features to make data integration easily accessible by data analysts.
Although the primary use case of Hadoop is ETL, Hadoop is not a data integration tool itself. Rather, Hadoop is a reliable, scale-out parallel processing framework, meaning servers (nodes) can be easily added as workloads increase. It frees the programmer from concerns about how to physically manage large data sets when spreading processing across multiple nodes.
The emergence of Hadoop as the de facto Big Data operating system has brought on a flurry of beliefs and expectations that are sometimes simply untrue. Organizations embarking on their Hadoop journey face multiple pitfalls that, if not proactively addressed, will lead to wasted time, runaway expenditures and performance bottlenecks. By proactively anticipating these issues and utilizing smarter tools, the full potential of Hadoop may be realized. Syncsort has identified five pitfalls that should be avoided with Hadoop.