Ingest and Delivery
Data lakes need mechanisms for getting data into and out of the backend storage platform. In traditional data warehouses, data is inserted and queried using some form of SQL and a database driver, possibly via ODBC or JDBC. While compatibility drivers do exist to access Hadoop data, the variety of data formats requires more flexible tooling to accommodate the different formats. Open source tools such as Sqoop and Flume provide low-level interfaces for pulling in data from relational databases and log data, respectively. In addition, custom MapReduce programs and scripts are currently used to import data from APIs and other data sources. Commercial tools provide pre-built connectors and a wealth of data format support to mix and match data sources to data repositories in the data lake.
Given the variety of data formats for Hadoop data, a comprehensive schema management tool does not yet exist. Hive's metastore extended via HCatalog provides a relational schema manager for Hadoop data. Yet, not all data formats can be described in HCatalog. To date, quite a bit of Hadoop data is defined inside applications themselves, perhaps using JSON, AVRO, RCFile or Parquet. Just like with data endpoints and data formats, the right commercial tools can help describe the lake data and surface the schemas to the end users more readily.