The Next Challenge for Big Data: Geo-Distributed Architectures

    Slide Show

    The Role of IT in the Cloud Era

    Few enterprises have the means to build their own scale-out infrastructure for Big Data collection and analytics. That means much of the workload will be ported to the cloud.

    But since it is neither wise nor practical to push all that data to a single cloud facility, organizations will need to develop the skills and technology to manage Big Data operations across multiple data centers, which will most likely be spread across large geographic areas.

    This isn’t as easy as it seems, however. As Morpheus Data’s Darren Perucci points out to DZone, bursting data onto the cloud is still more of a theory than a practice. It isn’t enough to simply push volumes onto third-party infrastructure; it has to involve authentication, usage tracking, performance monitoring, and a host of other functions. At the same time, distributed cloud environments incorporate a vast array of platforms, formats, protocols and other elements, making it difficult to effectively coordinate resource consumption and data flow, even in open source environments. Emerging systems and services are smoothing out many of the rough edges, of course, but we are still a long way from an integrated cloud environment capable of functioning across multiple data centers.

    One of the keys to such a scheme for Big Data workloads is to embed it into the streaming module of the database cluster framework. Confluent recently took this step with the Confluent Enterprise version of Apache Kafka, providing critical tools like multi-data center (MDC) replication, automatic load balancing and cloud migration. The system allows the enterprise to establish secure cluster replication across geo-distributed infrastructure while maintaining centralized configuration management. It also takes care of the synchronization between clusters, as well as SSL encryption and SASL-supported authentication under Kerberos and Active Directory protocols.

    Since many Big Data processes must be automated by necessity, this same functionality must extend to multiple data centers, as well. Snowflake Computing recently added new resilience and fault-tolerance capabilities to its Elastic Data Warehouse platform to maintain the high-speed performance that users expect from single-entity data environments. The system now enables automatic scalability with no delays or operator intervention, as well as improved dashboarding and reporting for faster query management, plus data milestoning and continuous data access for improved replication and disaster recovery. In this way, the company says it can support advanced Hadoop-based workloads alongside normal business intelligence and reporting applications in the same warehouse.

    Much of the Big Data analytics process is also expected to take place on in-memory infrastructure to provide faster transfer between storage and processing. This can become problematic in distributed environments due to the need to quickly compile data from diverse sources and generate results to users who may be some distance away from the processing center.  This is what GigaSpaces is hoping to address with the new XAP 12 platform that features an open and decoupled core to support high-performance in-memory data grids. The solution supports millisecond performance in cross-platform architectures utilizing both RAM and SSD storage. At the same time, it enables multi-center replication for both recovery operations and data localization requirements, as well as full session replication to cut latency across distributed clusters.

    As mentioned above, however, none of these solutions addresses the full gamut of challenges involved in distributed Big Data environments. This will likely lead to a layered approach at most enterprises, with the most critical, time-sensitive applications hosted close to home while multiple regional and edge solutions provide results to users and other stakeholders wherever they reside.

    In time, as networking and automation tools become more sophisticated, this should evolve into an integrated data analytics presence that extends across the wide area, providing both the scale and resilience the enterprise needs to mine real value from all accumulated data.

    Arthur Cole writes about infrastructure for IT Business Edge. Cole has been covering the high-tech media and computing industries for more than 20 years, having served as editor of TV Technology, Video Technology News, Internet News and Multimedia Weekly. His contributions have appeared in Communications Today and Enterprise Networking Planet and as web content for numerous high-tech clients like TwinStrata and Carpathia. Follow Art on Twitter @acole602.

    Arthur Cole
    Arthur Cole
    With more than 20 years of experience in technology journalism, Arthur has written on the rise of everything from the first digital video editing platforms to virtualization, advanced cloud architectures and the Internet of Things. He is a regular contributor to IT Business Edge and Enterprise Networking Planet and provides blog posts and other web content to numerous company web sites in the high-tech and data communications industries.

    Latest Articles