Comments on: Backblaze Durability Calculates at 99.999999999% — And Why It Doesn’t Matter

By: Anthony Mai

Anthony Mai — Tue, 01 Nov 2022 21:01:42 +0000

The math of calculation is correct. But the result of eleven nines is way off. You stretched some numbers with unreasonable assumption. First the expected annual disk failure rate is more like 1.64% in average, not 0.44% which you use for calculation.

Second, more importantly it hugely relates to how often you scrub the disks to check for errors. If, for example, you let the disks sit for one year without checking for errors, then one year later there is a high chance yo suddenly find more than 3 disk failures, and lose data. If you scrub the disk in a theoretical impossible once per second there is no chance you find more than one disk failure at a time. The truth is some where in between, depending on how often you scrub. My calculation shows you need to scrub all disks every 2 days to achieve 11 nines. Which is impossible to achieve. If you scrub every quarter. Since each quarter is 90/2 = 45 times longer, the chance of data loss is 45^3 = 91125 times higher than 11 nines, or the data resilience is only 6 nines in reality, not 11 nines.

Bottom line, your 17+3 erasure coding is just not good enough. Your first catastrophic customer data loss probably will likely happen within the next three years, based on the scale of your current customer data. You need a more powerful erasure coding than the Reed-Solomon you implemented in Java currently. Talk to me. I will show you how to spend less over head, scrub no more than once a year, and yet achieve better than 11 nines resilience. You cannot achieve that with the legacy Reed-Solomon codes.

By: Serkay ölmez

Serkay ölmez — Tue, 27 Nov 2018 20:09:06 +0000

I have to disagree with this analysis. You implicitly assume that
if you didn’t have 4 or more failures within a given 156 hour period, you
start the next round of 156 hours as a fully healthy system that can
still handle up to 3 failures. In reality, there are 4 probabilities: you may exit the first period with all drives up and running (this
is what you assume implicitly), or you may have 1,2 or 3 restores going
on right at t=157hrs. The failure probability in the second 156 hour
period depends on how you exit from the previous one: you do not
necessarily start fresh with a budget of 3 failures. To proceed to the third frame, you have to
calculate how you would exit the second 156 hour period depending on how
you started. And you can see how fast it gets complicated.

Your model
assumes that you are back to square one with all 20 drives running at
the beginning of each 56 frames in a year, and it is
clearly not true.

The proper way of analyzing such systems is to use Markov Chains, which is available in the literature: https://www.researchgate.net/publication/2680057_MDS_Disk_Array_Reliability
If
you compare the results in this paper with yours, you will notice that
you are overestimating the reliability which is a result of the
implicit assumption I discussed above.

Let me know if you disagree.

By: jimp

jimp — Thu, 11 Oct 2018 18:02:15 +0000

Wouldn’t the billing data loss scenario be eliminated by suspending access to the data for a month or longer before permanently deleting it? Ideally the API, uploads, and downloads would all stop working, which would present multiple opportunities for the account owner to be notified (from reports or customer calls). That would protect against sabotage as well, where someone maliciously social engineers an account cancellation or via the backend (assuming that is possible).

By: Rodrigo_Roa_Duterte

Rodrigo_Roa_Duterte — Tue, 07 Aug 2018 18:03:00 +0000

I have been using B2 for almost 2 years and they have not charged me a single cent yet. I have stored files bigger than how big I’ve stored in either AWS and APIS, but those giants charge me 32 cents and 50 cents a month. I also want backblaze to charge me something so that I don’t feel guilty using their services the same way I use AWS and APIS.

By: Michael Kubler

Michael Kubler — Tue, 31 Jul 2018 17:00:30 +0000

I’ve been reading the The Drunkard’s Walk book by Leonard Mlodinow and I think he’d be proud at the calculations done here. Looks like you know your stuff and there’s certainly far higher chances of a billing issue or ful datacenter failure than file loss.

But… Do you have any stats on files or objects that have had such data failures? Have you seen any in the wild or do you intentionally create some as part of your code deploy process to see what happens and how the process from recovery failure as well as recovery successes but delayed accessibility during the process, are handled at the various layers up to the UI?

By: Edgar Maloverian

Edgar Maloverian — Sat, 28 Jul 2018 16:31:06 +0000

Exciting read! Did you think about decreasing durability to get to adequate risk level and focus/invest in fixing more likely loss scenarios instead?

By: alhopper

alhopper — Sat, 28 Jul 2018 01:25:47 +0000

I think your calculations, while interesting, don’t provide an accurate result. The reason I say this is based on observational experience:

I’ve seen A/C failures – usually more that one A/C unit goes offline for various reasons – and the remaining A/C units can’t maintain the desired temperature. Everything (in the datacenter) continues to work without any indication of issues. The failed A/C units are put back online. And then over the next 5 to 10 days, you’ll see a “rash” of disk drive failures. The failure rate will usually follow a “bell curve” type distribution. During the peak of the bell curve, when many of the failures occur in a relatively short timeframe – you usually have to hustle to replace the failed drives so that you don’t have a cascading type failure – where more that one device will fail in a RAID 5 config. [yes – I understand that you guys are not using RAID 5] My point is, that to maximize the probability of being able to resolve this “A/C Event” without data loss, you have a relatively short window in which to swap out the failed drives. This is generally referred to as MTTR (Mean Time To Repair).

The big issue I see from what you have described is your MTTR number – 1Tb in 24 hours. That is *way* too long – especially considering that you’re moving to 14Tb drives.

What I don’t understand it why that MTTR is excessive (?). Modern disk drives can move *approximately* (insert nit-picking argument here – but I say *approximately*) 200Mb/Sec. So .. again approximately 1Tb in 90 minutes. And I’m assuming a minimum of 10Gbit/Sec for your ethernet interconnections.

IMHO – this is your current challenge. To drive down your MTTR.

By: ohno jones

ohno jones — Fri, 27 Jul 2018 16:03:07 +0000

U did not account for the possibility of a natural disaster. That’s why u need geographic redundancy

By: Homero

Homero — Fri, 27 Jul 2018 14:29:22 +0000

What if a big starts corrupting data and cascades to all disks ?