Scheduling and Workflow
Orchestration in the data lake is a mandatory requirement. Scheduling refers to launching jobs at specified times or in response to an external trigger. Workflow refers to specifying job dependencies and providing a means to execute jobs in a way that the dependencies are respected. A job could be a form of data acquisition, data transformation, or data delivery. In the context of a data lake, scheduling and workflow both need to interface with the underlying data storage and data processing platforms. For the enterprise, scheduling and workflow should be defined via a graphical user interface and not through the command line.
The open source ecosystem provides some low-level tools such Oozie, Azkaban, and Luigi. These tools provide command-line interfaces and file-based configuration. They are useful mainly for orchestrating work primarily within Hadoop.
Commercial data integration tools provide high-level interfaces to scheduling and workflow, making such tasks more accessible to a wider range of IT professionals.