The closer the enterprise gets to implementing Big Data analytics, the more daunting it appears. Even organizations that are well-versed in data warehousing realize that building infrastructure for the so-called “data lake” is a completely different ballgame.
Not only does the data lake require large amounts of computing power and storage access, it has to be integrated with cutting-edge analytics, automation, orchestration and machine intelligence. And ideally, this state-of-the-art infrastructure should be accessible to the average business executive who has little or no experience in the data sciences.
But as we’ve seen many times, things that seem impossible at the outset are often possible once you put your mind to it. And data lake technology is already starting to make its mark at the top end of the enterprise market and shows every indication of trickling down to the lower tiers.
According to Computerworld’s James Henderson, classic warehousing technology is already evolving to meet the needs of the data lake and Big Data analytics. This is not to say the data lake will replace the warehouse – in fact it seems the two will ultimately complement each other – but that key elements like data ingestion and staging are changing on the warehouse side to meet the needs of Hadoop analytics. For instance, the data lake should be more adept at working with raw data, so much of the conditioning and pre-processing that went on in the warehouse data mart is unnecessary. This makes things easier for applications higher up the stack because they no longer have to know which databases to query.
But this only works if the enterprise embraces the proper architecture for the data lake. As Infosys’ Abdul Razack told Inside Big Data recently, many early attempts at the data lake resembled jumbled repositories of structured, semi-structured and unstructured data, turning what should be a highly efficient analytics engine into a data wasteland.
Current thinking attempts to reverse-engineer the analytics process by first determining what kinds of results the enterprise hopes to receive and then optimizing each component of the lake for those purposes. Invariably, this results in a lot of customization of the architecture but greater support for desirable outcomes like real-time analytics and highly dynamic scalability.
Indeed, most data lake development so far assumed that its primary purpose was to collect and store data in its original Hadoop format and then let analysts make sense of it later, says Third Nature’s Mark Madsen (download required). But it soon became evident that this approach does not allow for sufficient scale – not because the technology was not sound but because the people who need the analytics lacked the proper skill sets to locate and process data.
Going forward, the data lake should consist of separate but integrated components for storage, standardization, structuring and processing. This allows data architects and scientists to do what they do best – condition and prepare data for optimal value – while still enabling line of business managers to draw results in a prompt, intuitive fashion.
This is also why many of the key aspects of the data lake – things like preparation, analytics and integration – are becoming automated, says Data Informed’s Jelani Harper. By removing much of the rote work from human operators, the automated lake frees up highly trained data scientists and architects for high-level management and development, while non-data-savvy users can still derive the results they need without a lot of retraining.
At the same time, automation enables the kind of real-time or near-real-time functionality that will be vital in the fast-moving world of Big Data and the Internet of Things. Indeed, much of the analytics taking place will target opportunities in highly fluid, highly ephemeral business environments and will in fact be instigated by applications that are driven by their own sets of automated triggers.
The idea of data-driven constructs acting on their own to engage customers and drive business processes may be unnerving to some, but it will likely become a crucial factor in the emerging app-centric economy. To meet this need, the data lake must not only be large enough to encompass the data load, but responsive enough to capitalize on opportunities.
And that will require a clear set of expectations and architectural elements before the first compute module is unboxed.
Arthur Cole writes about infrastructure for IT Business Edge. Cole has been covering the high-tech media and computing industries for more than 20 years, having served as editor of TV Technology, Video Technology News, Internet News and Multimedia Weekly. His contributions have appeared in Communications Today and Enterprise Networking Planet and as web content for numerous high-tech clients like TwinStrata and Carpathia. Follow Art on Twitter @acole602.