When the Cloud Meets Big Data

    Big Data is not an ironic term. It really is big. For the enterprise looking to make use of it, this leads to a perplexing problem: Do you host your Big Data environment in the cloud, or create a data lake on premises?

    Each approach has its pluses and minuses. The cloud is a good way to get started, of course, but since Big Data is all about scale, the cost can very quickly get out of control. On-premises infrastructure requires up-front capital, but modular hardware has brought costs down dramatically. And while the cloud lets you offload all the hands-on maintenance and management of Big Data infrastructure, there is also something to be said for having that expertise in-house when it comes time to leverage Big Data in novel ways.

    According to tech consultant Andrew Froehlich, however, it is no accident that many early local Big Data projects failed. Not only is the learning curve steep, but as more and more of the enterprise data load gravitates to the cloud, the migration challenges start to mount. This is why he argues that Big Data in the cloud has reached a tipping point: because normal data operations themselves are becoming increasingly cloud-native. And this need to keep data and analytics relatively close together will only increase as real-time performance requirements start to take center stage.

    Still, to cloud or not to cloud your Big Data is not a simple question. First, there is the decision as to which cloud, or clouds, to employ, and then you must create the proper environment. As Clarity Insights noted recently, the right decision will depend on multiple factors, including the time-to-value for your business model, your desired level of risk, your ability to manage remote resource configurations and enforce consistency across architectural and devops teams, and the provider’s ability to quickly implement the appropriate dev, test and staging environments.

    New Big Data cloud platforms are already hitting the channel, offering users multiple ways to implement both the storage and analytics aspects of the data lake quickly and easily. A California company called Cask recently unveiled the Cask Data Application Platform (CDAP) Cloud Sandbox for Amazon Web Services (AWS). The system is a fully configured but scale-limited single-node instance of the CDAP offering, enabling users to create a working Big Data engine without having to first set up a Hadoop cluster. The package includes the latest version of CDAP, 4.2, as well as an SDK, the runtime, a user interface and a command-line interface and related tools.

    Meanwhile, cloud developer Talend is out with a multi-cloud integration solution that allows organizations to distribute Big Data workloads across AWS, Azure, Google and other providers. The Talend Data Fabric features drag-and-drop visual tools for functions like warehousing, NoSQL management and messaging, and provides built-in support for the Cloudera Altus Platform as a Service solution. As well, it supports rapid migration to the Snowflake cloud warehouse, utilizing a bulk load connector that increases transfer speeds by 20-fold.

    The cloud most certainly has the scale to suit just about anyone’s Big Data requirements, and with proper data management – most likely utilizing the same Big Data tools that drive business insight – there are ways to prevent both scale and costs from becoming unbearable. But there are still questions surrounding security, availability performance, multi-tenancy and everything else that comes with cloud infrastructure.

    As with any scale-out architecture, the devil is in the details, which means the enterprise will need to know exactly what it hopes to gain from cloud-based Big Data infrastructure before it makes a firm commitment.

    Arthur Cole writes about infrastructure for IT Business Edge. Cole has been covering the high-tech media and computing industries for more than 20 years, having served as editor of TV Technology, Video Technology News, Internet News and Multimedia Weekly. His contributions have appeared in Communications Today and Enterprise Networking Planet and as web content for numerous high-tech clients like TwinStrata and Carpathia. Follow Art on Twitter @acole602.

    Arthur Cole
    Arthur Cole
    With more than 20 years of experience in technology journalism, Arthur has written on the rise of everything from the first digital video editing platforms to virtualization, advanced cloud architectures and the Internet of Things. He is a regular contributor to IT Business Edge and Enterprise Networking Planet and provides blog posts and other web content to numerous company web sites in the high-tech and data communications industries.

    Latest Articles