While there’s a lot of enthusiasm for all things relating to Big Data, the hiring of a data scientist can be a frustrating experience for all concerned. All too often, data scientists wind up spending more time cleaning data than actually analyzing anything.
To overcome that challenge, Pentaho, a unit of Hitachi, released today an update to its namesake analytics application that adds support for metadata injection that makes it easier to ingest and transform large amounts of similar data. Version 6.1 of Pentaho, says Chuck Yarbrough, director of Pentaho solutions, makes use of that metadata to identify patterns in data sources. That metadata information is then shared with the Pentaho Data Integration engine at run time to dramatically accelerate the extract, transform and load (ETL) process.
Given the repetitive nature of onboarding data, Yarbrough says IT departments are increasingly being asked to offload this process from data scientists who usually command six-figure salaries. Unfortunately, most organizations are not going to recoup their investment in those data scientists until the data pipeline on which they depend becomes more automated. To help facilitate that process, version 6.1 of Pentaho also now includes a series of blueprints that IT organizations can follow to enable self-service data ingestion that doesn’t require someone in the IT department to be involved in every data ingestion process.
Other enhancements in version 6.1 include the ability to create virtual data sets across a wider number of data blends and the ability to automatically model and publish analytic data. In addition, Yarbrough notes Pentaho has made it easier to collaboratively share metrics across a broader number of users.
Data scientists may be the rock stars of IT for the moment. But right now, they also represent one of the most expensive IT tickets in town. For that reason, every minute a data scientist or analyst spends on what amounts to data maintenance work winds up costing the business not only a lot of money in manual labor, but an increase in the time required to gain any actionable insight from all the data being collected in the first place.