Solid-state storage manufacturers are acutely aware of the technology’s reputation for unreliability. During the initial rollout of enterprise-class models, there was much talk about speed, versatility and low-cost operation, but the reliability question went largely unmentioned.
These days, however, vendors across the board tout the ruggedness and reliability of their latest models as key selling points, and it’s true that in many benchmarks the latest SSDs are exhibiting near-HDD reliability.
Even so, storage remains but one component of a larger data ecosystem, and even if SSDs have gotten more robust of late, the question remains about how useful they are when the failure occurs elsewhere in the data center, say, in the power supply.
Extreme Tech recently published an eye-opening report regarding SSD data integrity in the event of a power failure and, while not naming any names, the results are not good for the leading drive manufacturers. It seems data released at the recent Usenix Conference on File and Storage Technologies showed only one out of 15 drives tested showed no failure, while a third became unusable due to corruption of metadata. The remainder, including several high-end enterprise models, exhibited varying degrees of data loss, even those with supposedly sophisticated ECC capabilities. Hard drives put through the same testing also showed high failure rates, but data was generally easier to recover once normal operations were restored.
This has potentially serious implications for the enterprise industry, which has embarked on a concerted effort to boost solid state’s presence in virtual and cloud-based storage environments. As Kroll Ontrack’s Robert Winter argued in a recent post, the recovery process for SSDs is highly complex and involves navigating not only encryption and file system formats but drilling down into the minute, chip-level data structures that solid-state devices employ. And the very dynamism that SSDs are prized for under normal circumstances can work against you during recovery, particularly if the drive in question is part of a complex RAID configuration.
New virtual storage platforms are starting to take the issue of SSD reliability seriously. VMware’s Cormac Hogan, for example, described VSAN’s use of object storage and cluster partitioning as a means to improve VM availability in cases of outright SSD or PCIe Flash failure. In that instance, VSAN will begin building storage objects on other disk groups within the immediate cluster. While a single drive failure could take down the one SSD and maybe six HDDs in one disk group, the damage is still much less than if the system is configured as a single large disk group.
But even in large, highly redundant disk arrays, there is still the possibility of multiple SSD failures at one time, argues Silverton Consulting’s Ray Lucchesi. This is primarily due to the fact that SSDs have a much more predictable lifespan than hard disks – a set number of read/writes will cause the drive to fail as opposed to the more random nature of mechanical degradation. So when organizations employ techniques like wear leveling and data striping to spread the load across multiple SSDs, it increases the chance that those disks will fail at the same time, potentially bringing the entire storage environment to its knees. Accurate monitoring and liberal use of DRAM buffering can alleviate the problem, but it would also be wise to integrate older and newer disks, as well as models from different manufacturers, and even avoid RAID 1 and RAID 5 in SSD arrays as these tend to increase parity among drives.
Data integrity in solid-state storage environments need not come from improvements to the drives themselves. All drives fail eventually, so the goal should not be to prevent failure at all costs but to manage the loss of data to produce minimal disruption to working environments.
But as the enterprise continues to increase its reliance on solid-state technology, and particularly when it crosses that line into mission-critical functions, a broader assessment is needed as to the change that is really taking place and what to do to ensure the continued health of the data environment.