Discovery and Preparation
Due to the flexibility of data formats in Hadoop and other data lake backend storage platforms, it is common to dump data into the lake before fully understanding the schema of the data. In fact, a lot of lake data may be highly unstructured. In any case, the cost effectiveness of Hadoop data makes it possible to prepare the data after it has been acquired. This is more ELT (extract, load, transform) than traditional ETL (extract, transform, load). However, there is a point at which to do useful work with a data set, the format of the data must be understood.
In the open source ecosystem, discovery and preparation can be done at the command line with scripting languages, such as Python and Pig. Ultimately, native MapReduce jobs, Pig or Hive can be used to extract useful data out of semi-structured data. This new, accessible data can be used by further analytic queries or machine-learning algorithms. In addition, the prepared data can be delivered to traditional relational databases so that conventional business intelligence tools can directly query it.
Commercial offerings in the data discovery and basic data preparation space offer web-based interfaces (although some are basic on-premise tools for so-called "data blending") for investigating raw data and then devising strategies for cleansing and pulling out relevant data. Such commercial tools range from "lightweight" spreadsheet-like interfaces to heuristic-based analysis interfaces that help guide data discovery and extraction.