Conventional wisdom holds that if you want to do Big Data, you need a data lake. This is true up to a point, but as the technology unfolds, it is becoming clear that not all data lakes are created equal, and if you don’t build it the right way you could wind up with more problems than you’ve solved.
To be fair, the task placed on the data lake is not an easy one. Enormous volumes of data, the vast majority of it unstructured, have to be ingested, digested, inspected and corrected in order to produce actionable intelligence for a wide range of enterprise processes and strategic initiatives. This requires a lot of powerful hardware and cutting-edge software, not just to produce results but to do so quickly in order to preserve the value of data that is largely ephemeral in nature.
So the first question enterprise executives must ask themselves is, do I really need a data lake and, if so, what kind?
Adam Wray, president and CEO of NoSQL database developer Basho, pulls no punches when he says data lakes are downright evil. As he explained to Forbes recently, data lakes interrupt the normal data supply chain as it transforms mere bits to actual knowledge. The lake is intended to store everything and then figure out what to do with it later. Without first prioritizing data before it enters the supply chain, the data lake becomes just a giant collection of junk that may or may not contain some priceless artifacts. Even with advanced analytics at its disposal, it will produce diminished results compared to a more distributed approach, with higher latency and at a much greater cost.
This is why some experts are calling for greater deployment of so-called “fog computing.” Moshe Kranc, CTO of Ness Digital Engineering, for example, argues that driving more of the analytics and intelligence to the edge will streamline the process of gleaning knowledge from data while keeping the results close to the user where they can produce the most value. At the same time, edge facilities will be able to communicate with each other to share data and resources when necessary, and they have the ability to push information to central facilities for more macro-style strategic initiatives. While still large, these centralized data engines will be dramatically less complex than a standard data lake because much of the data coming in will have already been processed and analyzed on the edge.
In the end, it comes down to how the enterprise wants to manage this flood of data coming in from the Internet of Things, says Satish Vutukuru, data scientist at data lake developer Zaloni. While the infrastructure itself is massive, the real power of the data lake lies in the management and governance architecture, which is in a constant state of evolution. So one of the worst things the enterprise can do at this point is begin construction of the lake without a clear idea of how it is to function, and since few organizations possess the internal knowledge to take on a project of this magnitude, the best approach is to partner with someone who has. At the same time, it is important to understand that data science is a fluid concept, so any approach to analytics and high-volume data processing must incorporate the flexibility to adapt and change as new technologies enter the market.
There are also the operational aspects of the data lake to consider, says Brad Anderson, vice president of big data informatics at Liaison Technologies. Recent surveys suggest that most data scientists devote a majority of their time simply preparing raw data for analysis, such as merging and duplicating records, rather than performing the actual analytics. Anderson advocates dPaaS (Data Platform as a Service) as an effective means to create agile Big Data architectures that are both easy to implement and provide the operational flexibility to accommodate new data formats and analytics applications as they emerge from development.
The most disturbing aspect in the development of a Big Data analytics engine is not the actual infrastructure or architecture to be deployed, but the urgency with which it needs to be done. High-speed analytics are already tearing down legacy business models and replacing them with agile, service-based ones. So the longer it takes today’s enterprise to leverage the volumes of data at its disposal, the greater the risk of becoming obsolete in the new digital economy.
IT has always been under the gun to provide the latest and greatest in data technology, but this time the stakes are sky high. Failure to execute won’t merely diminish performance, it would put the enterprise out of business.
Arthur Cole writes about infrastructure for IT Business Edge. Cole has been covering the high-tech media and computing industries for more than 20 years, having served as editor of TV Technology, Video Technology News, Internet News and Multimedia Weekly. His contributions have appeared in Communications Today and Enterprise Networking Planet and as web content for numerous high-tech clients like TwinStrata and Carpathia. Follow Art on Twitter @acole602.