Google Gmail Outage Underscores Relevance of Offline Backup

Paul Mah
Slide Show

Top 10 Pitfalls of Traditional Data Backup Methods

Common pitfalls that can severely affect a company's processes and bottom line.

One corner of Google's vast cloud skipped a beat over the weekend, resulting in the deletion of e-mail messages belonging to some Gmail users. Estimates of affected users varies, with Google quick to emphasize that the affected accounts range from an initial 0.08 percent to a later revised 0.02 percent of all Gmail users. Nobody has a concrete figure at this point (and Google will probably never say), though with an estimated 170 million users, I think it is fair to say that affected Gmail users could only number in the tens of thousands.


The root of the issue appears to be a software bug in a recently released storage software update that inadvertently corrupted and deleted e-mails belonging to a number of Gmail users. While Google says that the presence of backup meant that no e-mail was lost, there are a couple of lessons that we can learn from a quick examination of this episode.


The Insidious Danger of Data Corruption


Google has always prided itself for the multiple copies of data it keeps mirrored across multiple data centers. Such an architecture deserves praise in our post-9/11 world, where geographical spread and multiple redundancies are considered necessary for data survival. However, it is all too clear from the Gmail outage that hyper-redundancy offers practically zero protection against data corruption-and could ironically end up replicating bad files over good copies even faster. Not that this is the case with Google here, but the same vulnerability to data corruption applies also to outright sabotage.


The Value of Offline Backup


For all the hype surrounding cloud storage, the only real defense against data corruption or sabotage is the presence of an offline storage tier. As it is, SMBs should not automatically disregard the use of tape backup, which offers inherent offline backup capabilities. And as I've highlighted in my recent blog "Re-examine Your SMB Backup Strategy," the use of tape also instills a certain level of traceability that can serve as a potential guard against deliberate data modifications.


In fact, as I highlighted in an earlier post on the relevance of tape in SMBs, Google itself is the largest consumer of tape in the world today, using some 50,000 LTO (Linear Tape-Open) tape cartridges every quarter. Still, offline backup does have its disadvantages. In a blog update on the official Gmail Blog, Ben Treynor, VP of engineering and site reliability czar, had this to say:

But restoring data from them also takes longer than transferring your requests to another data center, which is why it's taken us hours to get the email back instead of milliseconds.

Restoration work over at Gmail continues at the time of this writing. In the meantime, I would love to hear from readers who deploy any form of offline backup in their small- and mid-sized business.

Subscribe to our Newsletters

Sign up now and get the best business technology insights direct to your inbox.


Add Comment      Leave a comment on this blog post
Mar 1, 2011 2:38 PM Open Hosting Open Hosting  says:

Whenever you deal with anything of importance online back-up should be one of the very first things a person or company can do to protect themselves.

Mar 1, 2011 4:35 PM greg schulz greg schulz  says:

Recent google gmail outage is a good example of the difference between data loss and loss of access to data. Granted when you can not access your data it may seem like lost data. However if the data had actually been lost, there would be no restoration.

Heres a recent post related to data loss vs. loss of access




Mar 1, 2011 7:30 PM Riaz Riaz  says:

Skip messing with complicated software. Get a DVD of your gmail sent to you. gmailtodisk.com

Mar 3, 2011 11:53 AM sharath sharath  says:

Such software bugs are unexpected and cannot blame google too. Anyway backup is a great option for such problems.

Mar 3, 2011 12:05 PM Andrew Andrew  says:

The "restoring from tape" explanation from google and the "no lost data" after all online copies of data were corrupted don't add up to me.  Hopefully I'm missing something.

Tape is a (great) point in time backup, but time must have elapsed between the backup being made and the corruption occurring.  Doesn't that add up to lost data?

Mar 3, 2011 4:06 PM udit udit  says:

Yes, to be a precautionary act, backing up your gmail account is always a good practice.

here we have a tool for gmail/Imap backup, copy, move and restore email from one gmail account to another account.


Mar 4, 2011 11:56 AM greg schulz greg schulz  says: in response to Andrew

Andrew you bring up a good point in that some things do not add up or there are gaps that we may or may not hear about in the future.

Now to be clear, what Im about to say is based on other experiences and in no shape or form reflect upon what may or may not be or have occurred at Google, in other words, this is an educated speculation ;)...

We can assume or piece together based on what Google has said/documented in the past:

a)     They keep copies of data on multiple servers/nodes potentially in different locations

b)     They have some sort of time/interval (e.g.RPO) based protection and assume some types of journals/logs.

c)     They have some form of change control management process for detecting and falling off of changes

d)     Their systems have some degree of logical isolation or segmentation to prevent a runaway or non contained fault from spreading to all users

e)     That in addition to having multiple copies of data on disk and with time interval RPO protection using snaps, logs, backup, or whatever your preference or assumption is, they also copy at some point either the data or those time based copies offline (which could mean to another datacenter on disk) in addition to ending up on tape as a gold/master copy.

What we know:

a)     Google was doing a software update of some sort that involved storage and something went wrong, they detected, stopped the turnover/change.

b)     The result was that some users have lost access to data however I have not heard of anyone actually losing data yet.

c)     Based on previous published Google information, what they call storage system and storage software is not the same as what most IT or other environments use.The Google software and storage systems are custom and proprietary compared to a commercially available storage system, volume manager or file system, backup, bc/dr type tools.

d)     That Google has a HA, BC and DR plan and strategy that leverages multiple technologies and techniques to isolate faults

e)     There is plenty of speculation and arm chair data center/bc/dr quarterbacking taking place.

What we can piece together or speculate on:

a)     Once detected Google stopped the change, fell off of the change reverting back to a known good copy of the software being updated.

b)     Not all user accounts were impacted or equally impacted.

c)     No data appears to have been lost, just access lost to the data.

What we don't know (if someone does know for sure without speculating, please tell us):

a)     What type of software was being updated that was involving storage?

b)     Was it a configuration setting of some sort or actual software being changed?

c)     Was it an operating system on a storage server node being changed?

d)     Was it a file system, volume manger, or other some storage related application on a storage node?

e)     Was it actually a storage node, or one of Google other server nodes that had some storage management or access software that was impacted?

f)     What were/or are the Google RTO and RPO for Gmail including most recently accessed vs. Reply

Mar 4, 2011 11:56 AM greg schulz greg schulz  says: in response to Andrew
older data?

d)     Is it the case that most recently updated data was in fact intact or recoverable via journals/logs accessible from other systems, however older data needs to be restored, recovered and rebuilt before an account can be re-enabled.In other words, a users account could have been turned back on however if some data (older or newer) were missing, than there would have been a consistency and data loss perception if not reality.

e)     What type of data is being brought back from tape, is it tar ball type backups, or, are they log and journal files that take time to restore, replay and rebuild to get things back to a known consistency point.

What Im guessing:

Based on the above, either older data was lost and thus having to be restored, then repaired and validated across multiple Google storage nodes given how it is understood they disperse their data.Or older data is intact;however some newer data needed to be reconstructed from logs and journals, combined with older data, consistency checks and repairs made before re-enabling access to the accounts, possibly some combination of the above.

Also that most technology failures are tied to human intervention or error.If faults are not contained or isolated, they will spread expanding into larger disasters.Also, don't confuse loss of data access with loss of data.Granted, while you cannot access your data, that can have the same impact of actual data loss.Any information that is important, have multiple copies in different locations, practicing what I preach, that's how I protect my data including combing cloud, onsite and traditional offsite.

Shameless plug alert!If you want to learn more about HA, BC, DR, data protection and related topics for data infrastructure, data centers, IT and information services delivery environments check out my books at storageio.com/books

Bottom line lesson here could be don't be scared, have a plan and be prepared, technology will fail, it's not if, rather when, why and how.

Now back to the regularly scheduled speculation programming ;)...

Cheers gs

Mar 15, 2011 2:20 PM Email Addresses Email Addresses  says:

The Insidious Danger of Data Corruption. Google has always prided itself for the multiple copies of data it keeps mirrored across multiple data centers.



Post a comment





(Maximum characters: 1200). You have 1200 characters left.




Subscribe Daily Edge Newsletters

Sign up now and get the best business technology insights direct to your inbox.

Subscribe Daily Edge Newsletters

Sign up now and get the best business technology insights direct to your inbox.