CEO, ScaleOut Software
In-memory storage technologies are rapidly arriving as enterprises look for assistance in keeping Big Data environments under control. ScaleOut Software’s CEO William Bain argues that in-memory is the best way to support scalable, real-time analytics in highly fluid workloads. By staging data in RAM and utilizing an integrated analytics engine, in-memory solutions overcome two of Hadoop’s main weaknesses: poor I/O performance and too much batch scheduling overhead. With a properly configured system, 11-fold improvements in latency are becoming increasingly common.
Cole: It seems odd that the enterprise is already looking beyond solid-state drives and even flash modules with the arrival of in-memory solutions. What are some of the primary applications driving the technology?
Bain: Solid-state drives and flash modules have emerged to provide faster storage alternatives for persistent data that has been traditionally kept on hard disks. However, because they are significantly slower than RAM, SSDs and flash are not alternatives for RAM as primary storage for in-memory computing platforms that provide tightly integrated data storage and analytics. For example, advanced in-memory data grids (IMDGs) enable scalable, real-time analytics for live, fast-changing data hosted in RAM. IMDGs combine the RAM of a server cluster to provide a distributed, in-memory computing solution that is both fast and linearly scalable to handle large workloads. Commercial IMDGs, such as ScaleOut Analytics Server, are employed across a wide variety of use cases, including real-time risk analysis and scalable position-keeping in financial services, smart grid power optimization, real-time recommendation engines, credit card fraud detection, and many more. All of these use cases are characterized by the need to rapidly update live data and concurrently analyze that data to detect patterns that reveal business opportunities or issues.
Cole: One sticking point with database tools like Hadoop is the lack of real-time analytics. How does IMDG technology help on that score?
Bain: IMDGs, such as ScaleOut Analytics Server, enable real-time data analytics by attacking the well-known bottlenecks to performance inherent in Hadoop’s architecture. First, staging data in RAM dramatically reduces Hadoop’s disk I/O required to input a data set for analysis. For example, performance measurements using the Terasort benchmark have demonstrated an 11X reduction in access latency. Second, IMDGs can eliminate Hadoop’s batch scheduling overhead and streamline data combining by employing a simplified computation model and an analytics engine tightly integrated with the IMDG’s RAM-based storage.
To help combine Hadoop with an IMDG and thereby improve its real-time capabilities, we have recently released a new in-memory computing product called ScaleOut hServer, specifically designed to work with Hadoop. Our ultimate aim is to enable Hadoop to perform real-time analytics, and this release focuses on the data access bottleneck as a first step toward that goal. Instead of storing data on disk within the Hadoop Distributed File System (HDFS), ScaleOut hServer enables Hadoop to use a fast, scalable IMDG. This provides two key benefits. First, live data can be continuously updated while it is being accessed for analysis using standard Hadoop MapReduce applications, eliminating the limitation that HDFS data cannot be updated. Second, ScaleOut hServer can be used as a transparent in-memory cache for HDFS data sets that fit within the IMDG’s memory. In this usage model, when a Hadoop MapReduce job accesses HDFS data, ScaleOut hServer automatically caches the data set in the IMDG. On subsequent runs, Hadoop transparently reads key/value pairs from the IMDG, significantly speeding up data access time.
Cole: What about scalability? Will in-memory be able to keep up with the steadily increasing data loads that most enterprises are dealing with?
Bain: Well-designed IMDG architectures support linear performance scaling across hundreds or even thousands of servers, limited only by interconnect bandwidth, power and heat dissipation. As multicore servers increase their memory capacity and speed and interconnect technologies advance, IMDGs will continue to be scaled to handle steadily growing data sets. However, given the premium that must be paid for RAM versus SSDs, hybrid architectures that incorporate both technologies are likely to have an advantage in overall cost-effectiveness and may emerge as the dominant platform for real-time analytics.
In the current market, companies are moving toward real-time in-memory systems to store and analyze their live data while continuing to store their static data on disk for batch analysis. Live data sets are typically less than 10TB and many are 1TB or less. These will easily fit into today’s memory-based computing solutions and enable companies to gain a competitive edge in managing their operations and detecting perishable business opportunities.