As usage of Big Data with platforms such as Hadoop and Apache Spark becomes more mainstream, a clarification of the separation of duties between IT organizations and data scientists needs to emerge. IT operations teams should exert more control over data preparation, which in turn will free up the data scientist to spend more time analyzing data versus massaging it.
With that construct in mind, Pentaho, a unit of Hitachi Corp., today announced Pentaho Business Analytics 7.0, which provides a set of visual tools that makes it simpler for IT operations teams to manage the flow of data within any given pipeline.
Chuck Yarbrough, senior director of solutions marketing and management for Pentaho, says with this release of its analytics software, Pentaho is moving more of the data preparation process into a discrete set of functions that internal IT operation teams can use without having to master arcane extract, transform and load (ETL) tools.
“We don’t think you need to have an ETL specialist,” says Yarbrough.
Pentaho Business Analytics makes use of metadata injection techniques developed by Pentaho to make it possible to create a set of graphical tools for managing the data preparation process. The basic idea is to allow internal IT organizations to visually inspect each part of the Big Data preparation process without any help from a data scientist required, says Yarbrough.
A lot of data scientists today spend far too much time on data plumbing issues. At a time when most data scientists earn six-figure salaries to create Big Data analytics applications, using them to perform data preparation and integration tasks is a gigantic waste of time and money. At the same time, it’s apparent that IT organizations now need access to data preparation tools that the average IT generalist can use to accomplish those tasks without necessarily having to master ETL tools that were primarily designed for another data management era.
The challenge is making it possible for data scientists and IT operations teams to work together using a more hand-in-glove approach that ideally removes the data scientist as much as possible from the process of managing the flow of data in and out of any data lake.