Comments on: Backblaze Durability Calculates at 99.999999999% — And Why It Doesn’t Matter https://www.backblaze.com/blog/cloud-storage-durability/ Cloud Storage & Cloud Backup Tue, 17 Oct 2023 16:20:11 +0000 hourly 1 https://wordpress.org/?v=6.4.3 By: Anthony Mai https://www.backblaze.com/blog/cloud-storage-durability/#comment-329383 Tue, 01 Nov 2022 21:01:42 +0000 https://www.backblaze.com/blog/?p=84006#comment-329383 The math of calculation is correct. But the result of eleven nines is way off. You stretched some numbers with unreasonable assumption. First the expected annual disk failure rate is more like 1.64% in average, not 0.44% which you use for calculation.

Second, more importantly it hugely relates to how often you scrub the disks to check for errors. If, for example, you let the disks sit for one year without checking for errors, then one year later there is a high chance yo suddenly find more than 3 disk failures, and lose data. If you scrub the disk in a theoretical impossible once per second there is no chance you find more than one disk failure at a time. The truth is some where in between, depending on how often you scrub. My calculation shows you need to scrub all disks every 2 days to achieve 11 nines. Which is impossible to achieve. If you scrub every quarter. Since each quarter is 90/2 = 45 times longer, the chance of data loss is 45^3 = 91125 times higher than 11 nines, or the data resilience is only 6 nines in reality, not 11 nines.

Bottom line, your 17+3 erasure coding is just not good enough. Your first catastrophic customer data loss probably will likely happen within the next three years, based on the scale of your current customer data. You need a more powerful erasure coding than the Reed-Solomon you implemented in Java currently. Talk to me. I will show you how to spend less over head, scrub no more than once a year, and yet achieve better than 11 nines resilience. You cannot achieve that with the legacy Reed-Solomon codes.

]]>
By: Serkay ölmez https://www.backblaze.com/blog/cloud-storage-durability/#comment-325475 Tue, 27 Nov 2018 20:09:06 +0000 https://www.backblaze.com/blog/?p=84006#comment-325475 I have to disagree with this analysis. You implicitly assume that
if you didn’t have 4 or more failures within a given 156 hour period, you
start the next round of 156 hours as a fully healthy system that can
still handle up to 3 failures. In reality, there are 4 probabilities: you may exit the first period with all drives up and running (this
is what you assume implicitly), or you may have 1,2 or 3 restores going
on right at t=157hrs. The failure probability in the second 156 hour
period depends on how you exit from the previous one: you do not
necessarily start fresh with a budget of 3 failures. To proceed to the third frame, you have to
calculate how you would exit the second 156 hour period depending on how
you started. And you can see how fast it gets complicated.

Your model
assumes that you are back to square one with all 20 drives running at
the beginning of each 56 frames in a year, and it is
clearly not true.

The proper way of analyzing such systems is to use Markov Chains, which is available in the literature: https://www.researchgate.net/publication/2680057_MDS_Disk_Array_Reliability
If
you compare the results in this paper with yours, you will notice that
you are overestimating the reliability which is a result of the
implicit assumption I discussed above.

Let me know if you disagree.

]]>
By: jimp https://www.backblaze.com/blog/cloud-storage-durability/#comment-325552 Thu, 11 Oct 2018 18:02:15 +0000 https://www.backblaze.com/blog/?p=84006#comment-325552 Wouldn’t the billing data loss scenario be eliminated by suspending access to the data for a month or longer before permanently deleting it? Ideally the API, uploads, and downloads would all stop working, which would present multiple opportunities for the account owner to be notified (from reports or customer calls). That would protect against sabotage as well, where someone maliciously social engineers an account cancellation or via the backend (assuming that is possible).

]]>
By: Rodrigo_Roa_Duterte https://www.backblaze.com/blog/cloud-storage-durability/#comment-325677 Tue, 07 Aug 2018 18:03:00 +0000 https://www.backblaze.com/blog/?p=84006#comment-325677 I have been using B2 for almost 2 years and they have not charged me a single cent yet. I have stored files bigger than how big I’ve stored in either AWS and APIS, but those giants charge me 32 cents and 50 cents a month. I also want backblaze to charge me something so that I don’t feel guilty using their services the same way I use AWS and APIS.

]]>
By: Michael Kubler https://www.backblaze.com/blog/cloud-storage-durability/#comment-325689 Tue, 31 Jul 2018 17:00:30 +0000 https://www.backblaze.com/blog/?p=84006#comment-325689 I’ve been reading the The Drunkard’s Walk book by Leonard Mlodinow and I think he’d be proud at the calculations done here. Looks like you know your stuff and there’s certainly far higher chances of a billing issue or ful datacenter failure than file loss.

But… Do you have any stats on files or objects that have had such data failures? Have you seen any in the wild or do you intentionally create some as part of your code deploy process to see what happens and how the process from recovery failure as well as recovery successes but delayed accessibility during the process, are handled at the various layers up to the UI?

]]>
By: Edgar Maloverian https://www.backblaze.com/blog/cloud-storage-durability/#comment-325694 Sat, 28 Jul 2018 16:31:06 +0000 https://www.backblaze.com/blog/?p=84006#comment-325694 Exciting read! Did you think about decreasing durability to get to adequate risk level and focus/invest in fixing more likely loss scenarios instead?

]]>
By: alhopper https://www.backblaze.com/blog/cloud-storage-durability/#comment-325695 Sat, 28 Jul 2018 01:25:47 +0000 https://www.backblaze.com/blog/?p=84006#comment-325695 I think your calculations, while interesting, don’t provide an accurate result. The reason I say this is based on observational experience:

I’ve seen A/C failures – usually more that one A/C unit goes offline for various reasons – and the remaining A/C units can’t maintain the desired temperature. Everything (in the datacenter) continues to work without any indication of issues. The failed A/C units are put back online. And then over the next 5 to 10 days, you’ll see a “rash” of disk drive failures. The failure rate will usually follow a “bell curve” type distribution. During the peak of the bell curve, when many of the failures occur in a relatively short timeframe – you usually have to hustle to replace the failed drives so that you don’t have a cascading type failure – where more that one device will fail in a RAID 5 config. [yes – I understand that you guys are not using RAID 5] My point is, that to maximize the probability of being able to resolve this “A/C Event” without data loss, you have a relatively short window in which to swap out the failed drives. This is generally referred to as MTTR (Mean Time To Repair).

The big issue I see from what you have described is your MTTR number – 1Tb in 24 hours. That is *way* too long – especially considering that you’re moving to 14Tb drives.

What I don’t understand it why that MTTR is excessive (?). Modern disk drives can move *approximately* (insert nit-picking argument here – but I say *approximately*) 200Mb/Sec. So .. again approximately 1Tb in 90 minutes. And I’m assuming a minimum of 10Gbit/Sec for your ethernet interconnections.

IMHO – this is your current challenge. To drive down your MTTR.

]]>
By: ohno jones https://www.backblaze.com/blog/cloud-storage-durability/#comment-325701 Fri, 27 Jul 2018 16:03:07 +0000 https://www.backblaze.com/blog/?p=84006#comment-325701 U did not account for the possibility of a natural disaster. That’s why u need geographic redundancy

]]>
By: Homero https://www.backblaze.com/blog/cloud-storage-durability/#comment-325703 Fri, 27 Jul 2018 14:29:22 +0000 https://www.backblaze.com/blog/?p=84006#comment-325703 What if a big starts corrupting data and cascades to all disks ?

]]>