Next Generation Cloud Storage Architecture

Cameron Bahar

Few would argue that enterprise data is not growing at a torrid pace. What's not well understood is that not all data is expanding at the same rate. A major source of that growth is machine-generated content, such as log files, RFID scan data, video surveillance, genomics data and other data types specific to particular vertical market segments. As organizations struggle to cope with this exploding data, a new alternative is emerging: enterprise cloud storage specifically architected to address the challenges of fast-growing, enterprise content.

With machine-generated content consuming hefty portions of enterprise storage real estate, many organizations have turned to cloud storage to keep pace with demand. Generally defined as storage located remotely within a service provider's data center, cloud storage is not simply storage accessed over the Internet. Many organizations want to deal with fast-growing content, but they also want to be able to maintain control of it. The next generation of cloud storage provides a new architecture to cope with this, whether data is stored within an organization's own data center (a private cloud), within a service provider's data center (a hosted private cloud), or both (a hybrid cloud). 

In contrast to the popular notion of cloud storage as nothing more than storage on the Internet, the next generation of cloud storage provides a new architecture: One that is optimized to address the demands of large files and fast-growing content with highly scalable, easy-to-use storage. It provides a platform for new enterprise applications to reduce costs and management complexity and drive faster business results. This next generation of cloud storage includes a major paradigm shift-bringing the processing to the data.

Bringing the processing to the Data

Why is a new approach needed? With the influx of machine-generated content, storage tiers must manage and process large data sets more efficiently and intelligently than is possible with current solutions. With large scale data sets, the bottleneck has shifted to the network. It doesn't make sense to move large data sets across the network. It's much more efficient to bring the processing to the data.

Storage area networks (SANs) emerged because it was more efficient to share storage instead of buying a new mainframe when the internal mainframe storage capacity was reached. Similarly, network attached storage (NAS) appliances provided more efficiency by pooling storage across local area networks. Fast growing machine-generated data are swamping these architectures. Organizations are having difficulty keeping up with the storage demand and the sheer volume of management tasks related to provisioning new storage and coordinating the migration of data across the network.

Applying Google's Insights

This is the insight behind Google's architecture. Google recognized early on that managing large data would not be practical with existing storage architectures. Google's insight was to break a large data set into smaller manageable partitions stored on a cluster of servers and perform distributed and parallel processing on these partitioned data sets. Google's MapReduce framework recognizes that the network is the bottleneck and instead of moving large data sets across a network, processing is performed locally on the data, and only results are transmitted across the network.

Similarly, the next generation of cloud storage pools the underlying storage of commodity hardware. Compute tasks can access the cloud as a whole or run directly on the storage nodes and access local data.

Direct Data Services

This is a paradigm shift and will usher in new levels of efficiency and performance. With large data sets it makes more sense for data services and certain applications to be performed directly on the storage node. For example, instead of running on a separate application server and posing a potential bottleneck to a highly distributed environment, an anti-virus application can be deployed and run on each storage node. This is particularly true of open source tools such as ClamAV that do not have any licensing issues limiting their distribution across nodes. Other examples include video trans-coding applications that can run on each storage node and perform transformations directly, or Lucene SOLR that can perform search and indexing on each storage node. This transforms an extremely large storage tier with petabytes of information into a high-performing searchable information asset.

Simple Scalability

In addition to bringing processing to the data, the next-generation cloud storage scales simply. Cloud storage requires a high degree of automation with self-management, self-configuration and self-healing, all orchestrated through high-level policies specified by the storage administrator. Next generation storage provides administrators flexibility and control. As the cloud grows, additional nodes are able to service the data requests in parallel. If administrators require more performance, they can add a processing node. If they need more capacity, they can add disks. Each can be added independently or scaled simultaneously.

Scalability is not simply the ability to add nodes and claim linear scalability. Scalability applies to all of the storage services provided in a cloud including data security, data protection, multi-tenancy. For example, how is data protected? With petabytes of information stored silent data corruption needs to be effectively addressed. Building a centralized database of checksums and requiring each storage node to perform a centralized transaction would not appreciably impact performance with a few nodes, but at much larger node and file counts this could present a bottleneck. Similarly, how encryption and key management is performed also could negatively impact performance and scalability.


The next generation of cloud storage provides a new architecture to address the storage, management and analysis of fast-growing machine-generated data. By providing advanced scalability, manageability and the potential to collapse compute and storage together on the same processing nodes, the next generation of cloud storage will usher in new levels of efficiency and economics into enterprise data centers.

As data continues to explode and storage tiers get larger and larger, ParaScale delivers a solution to more effectively manage, access, and process data. ParaScale provides the ability to integrate applications and services directly on the storage nodes for more efficiency, manageability and effectiveness.

Add Comment      Leave a comment on this blog post
Jun 5, 2010 8:06 PM Anonymous Anonymous  says:
Cloud drive solves the performance and availability issues surrounding cloud data storage. Cloud drive accelerates cloud storage by as much as 100 times enabling you to use cloud storage just like a normal hard drive or file server. Cloud drive supports the iSCSI protocol and can be accessed from many platforms including Windows, Linux and Mac. Cloud drive provides increased availability enabling you to access your data even when your internet connection is down. Reply

Post a comment





(Maximum characters: 1200). You have 1200 characters left.



Subscribe to our Newsletters

Sign up now and get the best business technology insights direct to your inbox.