The enterprise is formulating big plans for Big Data, but first there is the little matter of deploying big infrastructure to handle the load.
To be sure, not all of the data generated by legions of smartphone apps and RF-connected sensors will need to be compiled in a central repository. Much of it will be too fleeting to be of any use after a few minutes: think optimized search results for recent logins or sales specials based on the buying histories of in-store customers. These are best handled by automated on-site or near-site systems.
Still, large amounts of data will head back to the data center where it can be used to chart historical trends, update user records and generally optimize and refine business processes. For these volumes, the most readily available solution is the data lake, which is part repository, part warehouse and part analytics engine—but wholly expensive and complex.
This is why organizations should resist the temptation to rush the deployment of mass infrastructure and opt for a more careful approach to the data lake, says Mindtree’s Akshey Gupta. When the goal is to lower storage and archival costs to acceptable levels plus incorporate the architectures needed to derive value from Big Data, the best approach is to begin with some fundamental understandings. These include determining what type of data will be going into the lake, where it is coming from and whether it is trustworthy. As well, there are considerations for the product that the lake is intended to produce, such as who is to make use of it, and how. In between, there are numerous issues regarding data integration, handling, security and workflow processing. Otherwise, your data lake may turn into a data landfill.
Most data lakes will be built to support Hadoop workloads, since this is the framework that kicked off the Big Data movement to begin with, says Actian’s Mike Hoskins. But in order to make use of Hadoop, it is best to access it through a familiar database platform like SQL. This is both a problem and an opportunity for the enterprise as there are currently few SQL capabilities in Hadoop products. New developments like YARN (Yet Another Resource Negotiator) and SQL in Hadoop can help in this regard by decoupling management from data processing to allow for broader application support, but there is still a way to go before Hadoop can effectively implement common business intelligence tools. Enterprises or analytics developers who can successfully thread the Hadoop-SQL needle will be in a strong position to push Big Data into the economic mainstream.
Yet another crucial piece of the data lake is the gateway, says John Thielens, VP of technology for data integration specialist Cleo. If the scale of Big Data is only half what the experts predict, then the old ways of discovering, ingesting, extracting and sequencing data will be woefully inadequate for the real-time purposes that some analytics require. Optimizing the gateway for Big Data supports tasks like “schema on read” while also preserving raw data, which speeds up the handling process without compromising data sets. They also enable elastic scalability to foster higher resource utilization, and take care of functions like security and governance on the spot while still allowing collaborative community management for improved workflow and data routing.
Search will also emerge as a primary tool in the data lake, but it won’t be the same old search that resides in traditional enterprise environments, according to Datanami’s Alex Woodie. In the first place, the new search will have to incorporate many of the intuitive, contextual algorithms that most Web-facing platforms like Google and Bing provide. When data scales into millions of records, simple keyword and page-ranking results won’t be of much use. At the same time, search will require a high degree of intelligence so it can actually learn about the data under management and how it is being used and then tailor results accordingly.
It is also important to realize that the data lake is not intended to replace current archiving and warehouse infrastructure, but to supplement it in ways that allow all data to become more productive. This is not an easy task, and it won’t be accomplished overnight, but the enterprise industry is clearly heading in this direction so individual organizations should at least start making rudimentary plans for Big Data – if only to avoid get caught flat-footed when the movement kicks into high gear.
Arthur Cole writes about infrastructure for IT Business Edge. Cole has been covering the high-tech media and computing industries for more than 20 years, having served as editor of TV Technology, Video Technology News, Internet News and Multimedia Weekly. His contributions have appeared in Communications Today and Enterprise Networking Planet and as web content for numerous high-tech clients like TwinStrata and Carpathia. Follow Art on Twitter @acole602.