Vaults Archives - Backblaze Blog | Cloud Storage & Cloud Backup

Backblaze Vaults: Zettabyte-Scale Cloud Storage Architecture

Brian Beach — Tue, 18 Jun 2019 15:00:34 +0000

A lot has changed in the four years since Brian Beach wrote a post announcing Backblaze Vaults, our software architecture for cloud data storage. Just looking at how the major statistics have changed, we now have over 100,000 hard drives in our data centers instead of the 41,000 mentioned in the post video. We have three data centers (soon four) instead of one data center. We’re approaching one exabyte of data stored for our customers (almost seven times the 150 petabytes back then), and we’ve recovered over 41 billion files for our customers, up from the 10 billion in the 2015 post.

In the original post, we discussed having durability of seven nines. Shortly thereafter, it was upped to eight nines. In July of 2018, we took a deep dive into the calculation and found our durability closer to eleven nines (and went into detail on the calculations used to arrive at that number). And, as followers of our Hard Drive Stats reports will be interested in knowing, we’ve just started testing our first 16 TB drives, which are twice the size of the biggest drives we used back at the time of this post — then a whopping eight TB.

We’ve updated the details here and there in the text from the original post that was published on our blog on March 11, 2015. We’ve left the original 135 comments intact, although some of them might be non sequiturs after the changes to the post. We trust that you will be able to sort out the old from the new and make sense of what’s changed. If not, please add a comment and we’ll be happy to address your questions.

— Editor

Storage Vaults form the core of Backblaze’s cloud services. Backblaze Vaults are not only incredibly durable, scalable, and performant, but they dramatically improve availability and operability, while still being incredibly cost-efficient at storing data. Back in 2009, we shared the design of the original Storage Pod hardware we developed; here we’ll share the architecture and approach of the cloud storage software that makes up a Backblaze Vault.

Backblaze Vault Architecture for Cloud Storage

The Vault design follows the overriding design principle that Backblaze has always followed: keep it simple. As with the Storage Pods themselves, the new Vault storage software relies on tried and true technologies used in a straightforward way to build a simple, reliable, and inexpensive system.

A Backblaze Vault is the combination of the Backblaze Vault cloud storage software and the Backblaze Storage Pod hardware.

Putting The Intelligence in the Software

Another design principle for Backblaze is to anticipate that all hardware will fail and build intelligence into our cloud storage management software so that customer data is protected from hardware failure. The original Storage Pod systems provided good protection for data and Vaults continue that tradition while adding another layer of protection. In addition to leveraging our low-cost Storage Pods, Vaults take advantage of the cost advantage of consumer-grade hard drives and cleanly handle their common failure modes.

Distributing Data Across 20 Storage Pods

A Backblaze Vault is comprised of 20 Storage Pods, with the data evenly spread across all 20 pods. Each Storage Pod in a given vault has the same number of drives, and the drives are all the same size.

Drives in the same drive position in each of the 20 Storage Pods are grouped together into a storage unit we call a tome. Each file is stored in one tome and is spread out across the tome for reliability and availability.

Every file uploaded to a Vault is divided into pieces before being stored. Each of those pieces is called a shard. Parity shards are computed to add redundancy, so that a file can be fetched from a vault even if some of the pieces are not available.

Each file is stored as 20 shards: 17 data shards and three parity shards. Because those shards are distributed across 20 Storage Pods, the Vault is resilient to the failure of a Storage Pod.

Files can be written to the Vault when one pod is down and still have two parity shards to protect the data. Even in the extreme and unlikely case where three Storage Pods in a Vault lose power, the files in the vault are still available because they can be reconstructed from any of the 17 pods that are available.

Storing Shards

Each of the drives in a Vault has a standard Linux file system, ext4, on it. This is where the shards are stored. There are fancier file systems out there, but we don’t need them for Vaults. All that is needed is a way to write files to disk and read them back. Ext4 is good at handling power failure on a single drive cleanly without losing any files. It’s also good at storing lots of files on a single drive and providing efficient access to them.

Compared to a conventional RAID, we have swapped the layers here by putting the file systems under the replication. Usually, RAID puts the file system on top of the replication, which means that a file system corruption can lose data. With the file system below the replication, a Vault can recover from a file system corruption because a single corrupt file system can lose at most one shard of each file.

Creating Flexible and Optimized Reed-Solomon Erasure Coding

Just like RAID implementations, the Vault software uses Reed-Solomon erasure coding to create the parity shards. But, unlike Linux software RAID, which offers just one or two parity blocks, our Vault software allows for an arbitrary mix of data and parity. We are currently using 17 data shards plus three parity shards, but this could be changed on new vaults in the future with a simple configuration update.

For Backblaze Vaults, we threw out the Linux RAID software we had been using and wrote a Reed-Solomon implementation from scratch, which we wrote about in “Backblaze Open-sources Reed-Solomon Erasure Coding Source Code.” It was exciting to be able to use our group theory and matrix algebra from college.

The beauty of Reed-Solomon is that we can then re-create the original file from any 17 of the shards. If one of the original data shards is unavailable, it can be re-computed from the other 16 original shards, plus one of the parity shards. Even if three of the original data shards are not available, they can be re-created from the other 17 data and parity shards. Matrix algebra is awesome!

Handling Drive Failures

The reason for distributing the data across multiple Storage Pods and using erasure coding to compute parity is to keep the data safe and available. How are different failures handled?

If a disk drive just up and dies, refusing to read or write any data, the Vault will continue to work. Data can be written to the other 19 drives in the tome, because the policy setting allows files to be written as long as there are two parity shards. All of the files that were on the dead drive are still available and can be read from the other 19 drives in the tome.

When a dead drive is replaced, the Vault software will automatically populate the new drive with the shards that should be there; they can be recomputed from the contents of the other 19 drives.

A Vault can lose up to three drives in the same tome at the same moment without losing any data, and the contents of the drives will be re-created when the drives are replaced.

Handling Data Corruption

Disk drives try hard to correctly return the data stored on them, but once in a while they return the wrong data, or are just unable to read a given sector.

Every shard stored in a Vault has a checksum, so that the software can tell if it has been corrupted. When that happens, the bad shard is recomputed from the other shards and then re-written to disk. Similarly, if a shard just can’t be read from a drive, it is recomputed and re-written.

Conventional RAID can reconstruct a drive that dies, but does not deal well with corrupted data because it doesn’t checksum the data.

Scaling Horizontally

Each vault is assigned a number. We carefully designed the numbering scheme to allow for a lot of vaults to be deployed, and designed the management software to handle scaling up to that level in the Backblaze data centers.

The overall design scales very well because file uploads (and downloads) go straight to a vault, without having to go through a central point that could become a bottleneck.

There is an authority server that assigns incoming files to specific Vaults. Once that assignment has been made, the client then uploads data directly to the Vault. As the data center scales out and adds more Vaults, the capacity to handle incoming traffic keeps going up. This is horizontal scaling at its best.

We could deploy a new data center with 10,000 Vaults holding 16TB drives and it could accept uploads fast enough to reach its full capacity of 160 exabytes in about two months!

Backblaze Vault Benefits

The Backblaze Vault architecture has six benefits:

1. Extremely Durable

The Vault architecture is designed for 99.999999% (eight nines) annual durability (now 11 nines — Editor). At cloud-scale, you have to assume hard drives die on a regular basis, and we replace about 10 drives every day. We have published a variety of articles sharing our hard drive failure rates.

The beauty with Vaults is that not only does the software protect against hard drive failures, it also protects against the loss of entire Storage Pods or even entire racks. A single Vault can have three Storage Pods — a full 180 hard drives — die at the exact same moment without a single byte of data being lost or even becoming unavailable.

2. Infinitely Scalable

A Backblaze Vault is comprised of 20 Storage Pods, each with 60 disk drives, for a total of 1200 drives. Depending on the size of the hard drive, each vault will hold:

12TB hard drives => 12.1 petabytes/vault (Deploying today.)
14TB hard drives => 14.2 petabytes/vault (Deploying today.)
16TB hard drives => 16.2 petabytes/vault (Small-scale testing.)
18TB hard drives => 18.2 petabytes/vault (Announced by WD & Toshiba)
20TB hard drives => 20.2 petabytes/vault (Announced by Seagate)

At our current growth rate, Backblaze deploys one to three Vaults each month. As the growth rate increases, the deployment rate will also increase. We can incrementally add more storage by adding more and more Vaults. Without changing a line of code, the current implementation supports deploying 10,000 Vaults per location. That’s 160 exabytes of data in each location. The implementation also supports up to 1,000 locations, which enables storing a total of 160 zettabytes (also known as 160,000,000,000,000 GB)!

3. Always Available

Data backups have always been highly available: if a Storage Pod was in maintenance, the Backblaze online backup application would contact another Storage Pod to store data. Previously, however, if a Storage Pod was unavailable, some restores would pause. For large restores this was not an issue since the software would simply skip the Storage Pod that was unavailable, prepare the rest of the restore, and come back later. However, for individual file restores and remote access via the Backblaze iPhone and Android apps, it became increasingly important to have all data be highly available at all times.

The Backblaze Vault architecture enables both data backups and restores to be highly available.

With the Vault arrangement of 17 data shards plus three parity shards for each file, all of the data is available as long as 17 of the 20 Storage Pods in the Vault are available. This keeps the data available while allowing for normal maintenance and rare expected failures.

4. Highly Performant

The original Backblaze Storage Pods could individually accept 950 Mbps (megabits per second) of data for storage.

The new Vault pods have more overhead, because they must break each file into pieces, distribute the pieces across the local network to the other Storage Pods in the vault, and then write them to disk. In spite of this extra overhead, the Vault is able to achieve 1,000 Mbps of data arriving at each of the 20 pods.

This capacity required a new type of Storage Pod that could handle this volume. The net of this: a single Vault can accept a whopping 20 Gbps of data.

Because there is no central bottleneck, adding more Vaults linearly adds more bandwidth.

5. Operationally Easier

When Backblaze launched in 2008 with a single Storage Pod, many of the operational analyses (e.g. how to balance load) could be done on a simple spreadsheet and manual tasks (e.g. swapping a hard drive) could be done by a single person. As Backblaze grew to nearly 1,000 Storage Pods and over 40,000 hard drives, the systems we developed to streamline and operationalize the cloud storage became more and more advanced. However, because our system relied on Linux RAID, there were certain things we simply could not control.

With the new Vault software, we have direct access to all of the drives and can monitor their individual performance and any indications of upcoming failure. And, when those indications say that maintenance is needed, we can shut down one of the pods in the Vault without interrupting any service.

6. Astoundingly Cost Efficient

Even with all of these wonderful benefits that Backblaze Vaults provide, if they raised costs significantly, it would be nearly impossible for us to deploy them since we are committed to keeping our online backup service affordable for completely unlimited data. However, the Vault architecture is nearly cost neutral while providing all these benefits.

When we were running on Linux RAID, we used RAID6 over 15 drives: 13 data drives plus two parity. That’s 15.4% storage overhead for parity.

With Backblaze Vaults, we wanted to be able to do maintenance on one pod in a vault and still have it be fully available, both for reading and writing. And, for safety, we weren’t willing to have fewer than two parity shards for every file uploaded. Using 17 data plus three parity drives raises the storage overhead just a little bit, to 17.6%, but still gives us two parity drives even in the infrequent times when one of the pods is in maintenance. In the normal case when all 20 pods in the Vault are running, we have three parity drives, which adds even more reliability.

Summary

Backblaze’s cloud storage Vaults calculated at 99.999999% (eight nines) annual durability (now 11 nines — Editor), horizontal scalability, and 20 Gbps of per-Vault performance, while being operationally efficient and extremely cost effective. Driven from the same mindset that we brought to the storage market with Backblaze Storage Pods, Backblaze Vaults continue our singular focus of building the most cost-efficient cloud storage available anywhere.

• • •

Note: This post was updated from the original version posted on March 11, 2015.

The post Backblaze Vaults: Zettabyte-Scale Cloud Storage Architecture appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

An Inside Look at the Backblaze Storage Pod Museum

Andy Klein — Thu, 14 Feb 2019 17:32:52 +0000

Merriam-Webster defines a museum as “an institution devoted to the procurement, care, study, and display of objects of lasting interest or value.” With that definition in mind, we’d like to introduce the Backblaze Storage Pod Museum. While some folks think of a museum as a place of static, outdated artifacts, others realize that those artifacts can tell a story over time of experimentation, evolution, and innovation. That is certainly the case with our Storage Pods. Modesty prevents us from saying that we changed the storage industry with our Storage Pod design, so let’s say we added a lot of red to the picture.

Over the years, Larry, our data center manager, has stashed away the various versions of our Storage Pods as they were removed from service. He also kept drives, SATA cards, power supplies, cables, and more. Thank goodness. With the equipment that Larry’s pack-rat tendencies saved, and a couple of current Storage Pods we borrowed (shhhh, don’t tell Larry), we were able to start the Backblaze Storage Pod Museum. Let’s take a quick photo trip through the years.

Storage Pod History Slide Show

(click on photo for full-screen)

Before Storage Pod 1.0

Before we announced Storage Pod 1.0 to the world nearly 10 years ago, we had already built about twenty or so Storage Pods. These early pods used Western Digital 1.0 TB Green drives. There were multiple prototypes, but once we went into production, we had settled on the 45-drive design with 3 rows of 15 vertically mounted drives. We ordered the first batch of ten chassis to be built and then discovered we did not spec a hole for the on/off switch. We improvised.

Storage Pod 1.0 — Petabytes on a Budget

We introduced the storage world to inexpensive cloud storage with Storage Pod 1.0. Funny thing, we didn’t refer to this innovation as version 1.0 — just a Backblaze Storage Pod. We not only introduced the Storage Pod, we also open-sourced the design, publishing the design specs, parts list, and more. People took notice. We introduced the design with Seagate 1.5 TB drives for a total of 67 TB of storage. This version also had an Intel Desktop motherboard (DG43NB) and 4 GB of memory.

Storage Pod 2.0 — More Petabytes on a Budget

Storage Pod 2.0 was basically twice the system that 1.0 was. It had twice the memory, twice the speed, and twice the storage, but it was in the same chassis with the same number of drives. All of this combined to reduce the cost per GB of the Storage Pod system over 50%: from $0.117/GB in version 1 to $0.055/GB in version 2.

Among the changes: the desktop motherboard in V1 was upgraded to a server class motherboard, we simplified things by using three four-port SATA cards, and reduced the cost of the chassis itself. In addition, we used Hitachi (HGST) 3 TB drives in Storage Pod 2.0 to double the total amount of storage to 135 TB. Over their lifetime, these HGST drives had an annualized failure rate of 0.82%, with the last of them being replaced in Q2 2017.

Storage Pod 3.0 — Good Vibrations

Storage Pod 3.0 brought the first significant chassis redesign in our efforts to make the design easier to service and provide the opportunity to use a wider variety of components. The most noticeable change was the introduction of drive lids — one for each row of 15 drives. Each lid was held in place by a pair of steel rods. The drive lids held the drives below in place and replaced the drive bands used previously. The motherboard and CPU were upgraded and we went with memory that was Supermicro certified. In addition, we added standoffs to the chassis to allow for Micro ATX motherboards to be used if desired, and we added holes where needed to allow for someone to use one or two 2.5” drives as boot drives — we use one 3.5” drive.

Storage Pod 4.0 — Direct Wire

Up through Storage Pod 3.0, Protocase helped design and then build our Storage Pods. During that time, they also designed and produced a direct wire version, which replaced the nine backplanes with direct wiring to the SATA cards. Storage Pod 4.0 was based on the direct wire technology. We deployed a small number of these systems but we fought driver problems between our software and the new SATA cards. In the end, we went back to our backplanes and Protocase continued forward with direct wire systems that they continued to deploy successfully. Conclusion: there are multiple ways you can be successful with the Storage Pod design.

Storage Pod 4.5 — Backplanes are Back

This version started with the Storage Pod 3.0 design and introduced new 5-port backplanes and upgraded to SATA III cards. Both of these parts were built on Marvell chipsets. The backplanes we previously used were being phased out, which prompted us to examine other alternatives like the direct wire pods. Now we had a ready supply of 5-port backplanes and Storage Pod 4.5 was ready to go.

We also began using Evolve Manufacturing to build these systems. They were located near Backblaze and were able to scale to meet our ever increasing production needs. In addition, they were full of great ideas on how to improve the Storage Pod design.

Storage Pod 5.0 — Evolution from the Chassis on Up

While Storage Pod 3.0 was the first chassis redesign, Storage Pod 5.0 was, to date, the most substantial. Working with Evolve Manufacturing, we examined everything down to the rivets and stand-offs, looking for a better, more cost efficient design. Driving many of the design decisions was the introduction of Backblaze B2 Cloud Storage that was designed to run on our Backblaze Vault architecture. From a performance point-of-view we upgraded the motherboard and CPU, increased memory fourfold, upgraded the networking to 10 GB on the motherboard, and moved from SATA II to SATA III. We also completely redid the drive enclosures, replacing the 15-drive clampdown lids with nine five-drive compartments with drive guides.

Storage Pod 6.0 — 60 Drives

Storage Pod 6.0 increased the amount of storage from 45 to 60 drives. We had a lot of questions when this idea was first proposed, like would we need: bigger power supplies (answer: no), more memory (no), a bigger CPU (no), or more fans (no). We did need to redesign our SATA cable routes from the SATA cards to the backplanes as we needed to stay under the one meter spec length for the SATA cables. We also needed to update our power cable harness, and, of course, add length to the chassis to accommodate the 15 additional drives, but nothing unexpected cropped up — it just worked.

What’s Next?

We’ll continue to increase the density of our storage systems. For example, we unveiled a Backblaze Vault full of 14 TB drives in our 2018 Drive Stats report. Each Storage Pod in that vault contains 840 terabytes worth of hard drives, meaning the 20 Storage Pods that make up the Backblaze Vault bring 16.8 petabytes of storage online when the vault is activated. As higher density drives and new technologies like HAMR and MAMR are brought to market, you can be sure we’ll be testing them for inclusion in our environment.

Nearly 10 years after the first Storage Pod altered the storage landscape, the innovation continues to deliver great returns to the market. Many other companies, from 45Drives to Dell and HP, have leveraged the Storage Pod’s concepts to make affordable, high-density storage systems. We think that’s awesome.

The post An Inside Look at the Backblaze Storage Pod Museum appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Design Thinking: B2 APIs (& The Hidden Costs of S3 Compatibility)

Brian Wilson — Wed, 08 Aug 2018 15:57:21 +0000

When we get asked, “why did Backblaze make its own set of APIs for B2,” the question behind the question is most often “why didn’t Backblaze just implement an S3-compatible interface for B2?”

Either are totally reasonable questions to ask. The quick answer to either question? So our customers and partners can move faster while simultaneously enabling Backblaze to sustainably offer a cloud storage service that is ¼ of the price of S3.

But, before we jump all the way to the end, let me step you through our thinking.

The Four Major Functions of Cloud Storage APIs

Throughout cloud storage, S3 or otherwise, APIs are meant to mainly provide access to four major underlying functions:

Authorization — Providing account/bucket/file access
Upload — Sending files to the cloud
Download — Data retrieval
List — Data checking/selection/comparison

The comparison between B2 and S3 on the List and Download functions is, candidly, not that interesting. Fundamentally, we ended up having similar approaches when solving those challenges. If the detail is of interest, I’m happy to get into that on a later post or answer questions in the comments below.

Backblaze and Amazon did take different approaches to how each service handles Authorization. The 5 step approach for S3 is well outlined here. B2’s architecture enables secure authorization in just 2 steps. My assertion is that a 2 step architecture is ~60% simpler than having a 5 step approach. To understand what we’re doing, I’d like to introduce the concept of Backblaze’s “Contract Architecture.”

The easiest way to understand B2’s Contract Architecture is to deep dive into how we handle the Upload process.

Uploads (Load Balancing vs Contract Architecture)

The interface to upload data into Amazon S3 is actually a bit simpler than Backblaze B2’s API. But it comes at a literal cost. It requires Amazon to have a massive and expensive choke point in their network: load balancers. When a customer tries to upload to S3, she is given a single upload URL to use. For instance, http://s3.amazonaws.com/<bucketname>. This is great for the customer as she can just start pushing data to the URL. But that requires Amazon to be able to take that data and then, in a second step behind the scenes, find available storage space and then push that data to that available location. The second step creates a choke point as it requires having high bandwidth load balancers. That, in turn, carries a significant customer implication; load balancers cost significant money.

When we were creating the B2 APIs, we faced a dilemma — do we go a simple but expensive route like S3? Or is there a way to remove significant cost even if it means introducing some slight complexity? We understood that there are perfectly great reasons to go either way — and there are customers at either end of this decision tree.

We realized the expense savings could be significant; we know load balancers well. We use them for our Download capabilities. They are expensive so, to run a sustainable service, we charge for downloads. That B2 download pricing is 1¢/GB while Amazon S3 starts at 9¢/GB is a subject we covered in a prior blog post.

Back to the B2 upload function. With our existing knowledge of the “expensive” design, the next step was to understand the alternative path. We found a way to create significant savings by only introducing a modest level of complexity. Here’s how: When a “client” wants to push data to the servers, it does not just start uploading data to a “well known URL” and have the SERVER figure out where to put the data. At the start, the client contacts a “dispatching server” that has the job of knowing where there is optimally available space in a Backblaze data center.

The dispatching server (the API server answering the b2_get_upload_url call) tells the client “there is space over on “Vault-8329.” This next step is our magic. Armed with the knowledge of the open vault, the client ends its connection with the dispatching server and creates a brand new request DIRECTLY to Vault-8329 (calling b2_upload_file or b2_upload_part). No load balancers involved! This is guaranteed to scale infinitely for very little overhead cost. A side note is that the client can continue to directly call b2_upload_file repeatedly from now on (without asking the dispatching server ever again), up until it gets the response indicating that particular vault is full. In other words, this does NOT double the number of network requests.

The “Contract” concept emanates from a simple truth: all APIs are contracts between two entities (machines). Since the client knows exactly where to go and exactly what authorization to bring with it, it can establish a secure “contract” with the Vault specified by the dispatching server.^[1] The modest complexity only comes into play if Vault-8392 fills up, gets too busy, or goes offline. In that case, the client will receive either a 500 or 503 error as notification that the contract has been terminated (in effect, it’s a firm message that says “stop uploading to Vault-8392, it doesn’t have room for more data”). When this happens, the client is responsible to go BACK to the dispatching server, ask for a new vault, and retry the upload to a different vault. In the scenario where the client has to go back to the dispatching server, the “two phase” process becomes more work for the client versus S3’s singular “well known URL” architecture. Of course, this is all handled at the code level and is well documented. In effect, your code just needs to know that “if you receive a 500 block error, just retry.” It’s free, it’s easy, and it will work.

So while the Backblaze approach introduces some modest complexity, it can quickly and easily be reliably handled with code. Looking at S3’s approach, it is certainly simpler, but it results in three expensive consequences:

1) Expensive fixed costs. Amazon S3 has a single upload URL choke point that requires load balancers and extremely high bandwidth requirements. Backblaze’s architecture does not require moving data around internally; this lets us use commodity 10 GB/s connections that are affordable and will scale infinitely. Further, as discussed above, load balancing hardware is expensive. By removing it from our Upload system, we remove a significant cost center.

2) Expensive single URL availability issues. Amazon S3’s solution requires high availability of the single upload URL for massive amounts of data. The Contract concept from Backblaze works more reliably, but does add slight additional complexity when (rare) extra network round trips are needed.

3) Expensive, time consuming data copy needs (and “eventual consistency”). Amazon S3 requires the copying of massive amounts of data from one part of their network (the upload server) to wherever the data’s ultimate resting place will be. This is at the root of one of the biggest frustrations when dealing with S3: Amazon’s “eventual consistency.” It means that you can’t use your data until it has been pushed to all the places it needs to go. As the article notes, this is usually fast, but can be material amounts of time, anytime. The lack of predictability around access times is something anyone dealing with S3 is all too familiar with.

The B2 architecture offers what one could consider “strong consistency.” There are different definitions of that idea. Ours is that the client connects DIRECTLY with the correct final location for the data to land. Once our system has confirmed a write, the data has been written to enough places that we can guarantee that the data can be seen without delay.

Was our decision a good one? Customers will continue to vote on that, but it appears that the marginal complexity is more than offset by the fact that B2 is sustainable service offered at ¼ of S3’s price.

But Seriously, Why Aren’t You Just “S3 Compatible?”

The use of Object Storage requires some sort of interface. You can build it yourself by learning your vendor’s APIs or you can go through a third party integration. Regardless of what route you choose, somebody is becoming fluent in the vendor’s APIs. And beyond the difference in cost, there’s a reliability component.

This is a good time to clear up a common misconception. The S3 protocol is not a specification: it’s an API doc. Why does this matter? Because API docs leave many outcomes undocumented. For instance, when one uses S3’s list_files function, a developer canNOT know what is going to happen just by reading the API docs. Compounding this issue is the sheer scope of the S3 API; it is huge and expanding. Systems that purport to be “S3 compatible” are unlikely to implement the full API and have to document whatever subset they implement. Once that is done, they will have to work with integration partners and customers to communicate what subset they choose as “important.”

Ultimately, we have chosen to create robust documentation describing, among other things, the engineering specification (this input returns that output, here’s how B2 handles various error cases, etc).

With hundreds of integrations from third parties and hundreds of thousands of customers, it’s clear that our APIs have proven easy to implement. The reality is the first time anyone implements cloud storage into their application it can take weeks. The first move into the cloud can be particularly tough for legacy applications. But the marginal cloud implementation can be reliably completed in days, if not hours, if the documentation is clear and the APIs can be well understood. I’m pleased that we’ve been able to create a complete solution that is proving quite easy to use.

And it’s a big deal that B2 is free of the “load balancer problem.” It solves for a huge scaling issue. When we roll out new vaults in new data centers in new countries, the clients are contacting those vaults DIRECTLY (over whatever network path is shortest) and so there are fewer choke points in our architecture.

It all means that, over an infinite time horizon, our customers can rely on B2 as the most affordable, easiest to use cloud storage on the planet. And, at the end of the day, if we’re doing that, we’ve done the right thing.

[1] The Contract Architecture also explains how we got to a secure two step Authorization process. When you call the dispatching server, we run the authentication process and then give you a Vault for uploads and an Auth token. When you are establishing the contract with the Vault, the Auth token is required before any other operation can begin.

The post Design Thinking: B2 APIs (& The Hidden Costs of S3 Compatibility) appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Yes, Backblaze Just Ordered 100 Petabytes of Hard Drives

Andy Klein — Thu, 05 Oct 2017 15:58:57 +0000

Backblaze just ordered 100 petabytes’ worth of hard drives, and yes, we’ll use nearly all of them in Q4. In fact, we’ll begin the process of sourcing the Q1 hard drive order in the next few weeks.

What are we doing with all those hard drives? Let’s take a look.

Our First 10 Petabyte Backblaze Vault

Ken clicked the submit button and 10 Petabytes of Backblaze Cloud Storage came online ready to accept customer data. Ken (aka the Pod Whisperer), is one of our Datacenter Operations Managers at Backblaze, and with that one click he activated Backblaze Vault 1093, which was built with 1,200 Seagate 10 TB drives (model: ST10000NM0086). After formatting and configuration of the disks, there is 10.12 Petabytes of free space remaining for customer data. Back in 2011, when Ken started at Backblaze, he was amazed that we had amassed as much as 10 Petabytes of data storage.

The Seagate 10 TB drives we deployed in vault 1093 are helium-filled drives. We had previously deployed 45 HGST 8 TB helium-filled drives where we learned one of the benefits of using helium drives — they consume less power than traditional air-filled drives. Here’s a quick comparison of the power consumption of several high-density drive models we deploy:

MFR	Model	Fill	Size	Idle (1)	Operating (2)
Seagate	ST8000DM002	Air	8 TB	7.2 watts	9.0 watts
Seagate	ST8000NM0055	Air	8 TB	7.6 watts	8.6 watts
HGST	HUH728080ALE600	Helium	8 TB	5.1 watts	7.4 watts
Seagate	ST10000NM0086	Helium	10 TB	4.8 watts	8.6 watts
(1) Idle: Average Idle in watts as reported by the manufacturer. (2) Operating: The maximum operational consumption in watts as reported by the manufacturer — typically for read operations.

I’d like 100 Petabytes of Hard Drives To Go, Please

“100 Petabytes should get us through Q4.” — Tim Nufire, Chief Cloud Officer, Backblaze

The 1,200 Seagate 10 TB drives are just the beginning. The next Backblaze Vault will be configured with 12 TB drives which will give us 12.2 petabytes of storage in one vault. We are currently building and adding two to three Backblaze Vaults a month to our cloud storage system, so we are going to need more drives. When we did all of our “drive math,” we decided to place an order for 100 petabytes of hard drives comprised of 10 and 12 TB models. Gleb, our CEO and occasional blogger, exhaled mightily as he signed the biggest purchase order in company history. Wait until he sees the one for Q1.

400 Petabytes of Cloud Storage

When we added Backblaze Vault 1093, we crossed over 400 Petabytes of total available storage. For those of you keeping score at home, we reached 350 Petabytes about 3 months ago as you can see in the chart below.

Backblaze Vault Primer

All of the storage capacity we’ve added in the last two years has been on our Backblaze Vault architecture, with vault 1093 being the 60th one we have placed into service. Each Backblaze Vault is comprised of 20 Backblaze Storage Pods logically grouped together into one storage system. Today, each Storage Pod contains sixty 3 ½” hard drives, giving each vault 1,200 drives. Early vaults were built on Storage Pods with 45 hard drives, for a total of 900 drives in a vault.

A Backblaze Vault accepts data directly from an authenticated user. Each data blob (object, file, group of files) is divided into 20 shards (17 data shards and 3 parity shards) using our erasure coding library. Each of the 20 shards is stored on a different Storage Pod in the vault. At any given time, several vaults stand ready to receive data storage requests.

Drive Stats for the New Drives

In our Q3 2017 Drive Stats report, due out in late October, we’ll start reporting on the 10 TB drives we are adding. It looks like the 12 TB drives will come online in Q4. We’ll also get a better look at the 8 TB consumer and enterprise drives we’ve been following. Stay tuned.

Other Big Data Clouds

We have always been transparent here at Backblaze, including about how much data we store, how we store it, even how much it costs to do so. Very few others do the same. But, if you have information on how much data a company or organization stores in the cloud, let us know in the comments. Please include the source and make sure the data is not considered proprietary. If we get enough tidbits we’ll publish a “big cloud” list.

The post Yes, Backblaze Just Ordered 100 Petabytes of Hard Drives appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

The Cloud’s Software: A Look Inside Backblaze

Peter Cohen — Fri, 10 Mar 2017 17:30:54 +0000

When most of us think about “the cloud,” we have an abstract idea that it’s computers in a data center somewhere—racks of blinking lights and lots of loud fans. There’s truth to that. Have a look inside our data center to get an idea. But besides the impressive hardware—and the skilled techs needed to keep it running—there’s software involved. Let’s take a look at a few of the software tools that keep our operation working.

Our data center is populated with Storage Pods, the servers that hold the data you entrust to us if you’re a Backblaze customer or you use Backblaze B2 Cloud Storage. Inside each Storage Pod are dozens of 3.5inch spinning hard disk drives—the same kind you’ll find inside a desktop PC. Storage Pods are mounted on racks inside the data center. Those Storage Pods work together in Vaults.

Vault Software

The Vault software that keeps those Storage Pods humming is one the backbones of our operation. It’s what makes it possible for us to scale our services to meet your needs with durability, scalability, and fast performance.

The Vault software distributes data across 20 different Storage Pods, with the data spread evenly across all 20 Pods. Drives in the same position inside each Storage Pod are grouped together in software in what we call a Tome. When a file gets uploaded to Backblaze, it’s split into pieces we call Shards and distributed across all 20 drives.

Each file is stored as 20 shards: 17 data shards and three parity shards. As the name implies, the data shards comprise the information in the files you upload to Backblaze. Parity shards add redundancy so that a file can be completely restored from a Vault even if some of the pieces are not available.

Because those shards are distributed across 20 Storage Pods in 20 cabinets, a Storage Pod can go down and the Vault will still operate unimpeded. An entire cabinet can lose power and the Vault will still work fine.

Files can be written to the Vault even if a Storage Pod is down with two parity shards to protect the data. Even in the extreme—and unlikely—case where three Storage Pods in a Vault are offline, the files in the vault are still available because they can be reconstructed from the 17 available pieces.

Reed-Solomon Erasure Coding

Erasure coding makes it possible to rebuild a data file even if parts of the original are lost. Having effective erasure coding is vital in a distributed environment like a Backblaze Vault. It helps us keep your data safe even when the hardware that the data is stored on needs to be serviced.

We use Reed-Solomon erasure encoding. It’s a proven technique used in Linux RAID systems, by Microsoft in its Azure cloud storage, and by Facebook, too. The Backblaze Vault Architecture is calculated at 99.999999999% annual durability thanks in part to our Reed-Solomon erasure coding implementation.

Here’s our own Brian Beach with an explanation of how Reed-Solomon encoding works:

We threw out the Linux RAID software we had been using prior to the implementation of the Vaults and wrote our own Reed-Solomon implementation from scratch. We’re very proud of it. So much so that we’ve released it as open source that you can use in your own projects, if you wish.

We developed our Reed-Solomon implementation as a Java library. Why? When we first started this project, we assumed that we would need to write it in C to make it run as fast as we needed. It turns out that modern Java virtual machines working on our servers are great, and just in time compilers produces code that runs pretty quick.

All the work we’ve done to build a reliable, scalable, affordable solution for storing data in a “cloud” led to the creation of B2 Cloud Storage. Backblaze B2 lets you store your data in the cloud for a fraction of what you’d spend elsewhere—one-fourth the price of Amazon S3, for example.

Using Our Storage

Having over 300PB of data storage available isn’t very useful unless we can store data and reliably restore it, too. We offer two ways to store data with Backblaze: via a client application or via direct access. Our client application, Backblaze Computer Backup, is installed on your Mac or Windows system and basically does everything related to automatically backing up your computer. We locate the files that are new or changed and back them up. We manage versions, deduplicate files, and more. The Backblaze app does all the work behind the scenes.

The other way to use our storage is via direct access. You can use a web GUI, a command line interface (CLI), or an application programming interface (API). With any of these methods, you are in charge of what gets stored in the Backblaze cloud. This is what Backblaze B2 is all about. You can log into Backblaze B2 and use the web GUI to drag and drop files that are stored in the Backblaze cloud. You decide what gets added and deleted, and how many versions of a file you want to keep. Think of B2 Cloud Storage as your very own bucket in the cloud where you can store your files.

We also have mobile apps for iOS and Android devices to help you view and share any backed up files you have on the go. You can download them, play back or view media files, and share them as you need.

We focused on creating a native, integrated experience for you when you use our software. We didn’t take a shortcut to create a Java app for the desktop. On the Mac our app is built using Xcode and on the PC it was built using C. The app is designed for lightweight, unobtrusive performance. If you do need to adjust its performance, we give you that ability. You have control over throttling the backup rate. You can even adjust the number of CPU threads dedicated to Backblaze, if you choose.

When we first released the software almost a decade ago we had no idea that we’d iterate it more than 1,000 times. That’s the threshold we reached late last year, however, we released version 4.3.0 in December. We’re still plugging away at it and have plans for the future, too.

Our Philosophy: Keep it Simple

“Keep it simple” is the philosophy that underlies all of the technology that powers our hardware. It makes it possible for you to affordably, reliably back up your computers and store data in the cloud.

We’re not interested in creating elaborate, difficult to implement solutions or pricing schemes that confuse and confound you. Our backup service is unlimited and unthrottled for one low price. We offer cloud storage for one-fourth the competition. And we make it easy to access with desktop, mobile and web interfaces, command line tools, and APIs.

Hopefully we’ve shed some light on the software that lets our cloud services operate. Have questions? Join the discussion and let us know.

The post The Cloud’s Software: A Look Inside Backblaze appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Backblaze Open-sources Reed-Solomon Erasure Coding Source Code

Brian Beach — Tue, 16 Jun 2015 14:59:56 +0000

At Backblaze we have built an extremely cost-effective storage system that enables us to offer a great price on our online backup service. Along the path to building our storage system, we have used time-tested technologies off the shelf, but we have also built in-house technologies ourselves when things weren’t available, or when the price was too high.

We have taken advantage of many open-source projects, and want to do our part in contributing back to the community. Our first foray into open-source was our original Storage Pod design, back in September of 2009.

Today, we are releasing our latest open-source project: Backblaze Reed-Solomon, a Java library for erasure coding.

An erasure code takes a “message,” such as a data file, and makes a longer message in a way that the original can be reconstructed from the longer message even if parts of the longer message have been lost. Reed-Solomon is an erasure code with exactly the properties we needed for file storage, and it is simple and straightforward to implement. It ensures that an entire data element can be recovered even when part or parts of the original stored data element are lost or unavailable.

Download the Open-Source Code

You can find the source code for Backblaze Reed-Solomon on GitHub.

The code is licensed with the MIT License, which means that you can use it in your own projects for free. You can even use it in commercial projects. We’ve put together a video titled: “Reed-Solomon Erasure Coding Overview” to get you started.

Erasure Codes and Storage

Erasure coding is standard practice for systems that store data reliably, and many of them use Reed-Solomon coding.

The RAID system built into Linux uses Reed-Solomon. It has a carefully tuned Reed-Solomon implementation in C that is part of the RAID module. Microsoft Azure uses a similar, but different, erasure coding strategy. We’re not sure exactly what Amazon S3 and Google Cloud Storage use because they haven’t said, but it’s bound to be Reed-Solomon or something similar. Facebook’s new cold storage system also uses Reed-Solomon.

If you want reliable storage that can recover from the loss of parts of the data, then Reed-Solomon is a well-proven technique.

Backblaze Vaults Utilize Erasure Coding

Earlier this year, I wrote about Backblaze Vaults, our new software architecture that allows a file to be stored across multiple Storage Pods, so that the file can be available for download even when some Storage Pods are shut down for maintenance.

To make Backblaze Vaults work, we needed an erasure coding library to compute “parity” and then use it to reconstruct files. When a file is stored in a Vault, it is broken into 17 pieces, all the same size. Then three additional pieces are created that hold parity, resulting in a total of 20 pieces. The original file can then be reconstructed from any 17 of the 20 pieces.

We needed a simple, reliable, and efficient Java library to do Reed-Solomon coding, but didn’t find any. So we built our own. And now we are releasing that code for you to use in your own projects.

Performance

Backblaze Vaults store a vast amount of data and need to be able to ingest it quickly. This means that the Reed-Solomon coding must be fast. When we started designing Vaults, we assumed that we would need to code in C to make things fast. It turned out, though, that modern Java virtual machines are really good, and the just-in-time compiler produces code that runs fast.

Our Java library for Reed-Solomon is as fast as a C implementation, and is much easier to integrate with a software stack written in Java.

A Vault splits data into 17 shards, and has to calculate three parity shards from that, so that’s the configuration we use for performance measurements. Running in a single thread on Storage Pod hardware, our library can process incoming data at 149 megabytes per second. (This test was run on a single processor core, on a Pod with an Intel Xeon E5-1620 v2, clocked at 3.70GHz, on data not already in cache memory.)

Reed-Solomon Encoding Matrix Example

Feel free to skip this section if you aren’t into the math.

We are fortunate that mathematicians have been working on matrix algebra, group theory, and information theory for centuries. Reed and Solomon used this body of knowledge to create a coding system that seems like magic. It can take a message, break it into n pieces, add k “parity” pieces, and then reconstruct the original from n of the (n+k) pieces.

The examples below use a “4+2” coding system, where the original file is broken into four pieces, and then two parity pieces are added. In Backblaze Vaults, we use 17+3 (17 data plus three parity). The math—and the code—works with any numbers as long as you have at least one data shard and don’t have more than 256 shards total. To use Reed-Solomon, you put your data into a matrix. For computer files, each element of the matrix is one byte from the file. The bytes are laid out in a grid to form a matrix. If your data file has “ABCDEFGHIJKLMNOP” in it, you can lay it out like this:

The Original Data

In this example, the four pieces of the file are each four bytes long. Each piece is one row of the matrix. The first one is “ABCD.” The second one is “EFGH.” And so on.
The Reed-Solomon algorithm creates a coding matrix that you multiply with your data matrix to create the coded data. The matrix is set up so that the first four rows of the result are the same as the first four rows of the input. That means that the data is left intact, and all it’s really doing is computing the parity.

Applying the Coding Matrix

The result is a matrix with two more rows than the original. Those two rows are the parity pieces.

Each row of the coding matrix produces one row of the result. So each row of the coding matrix makes one of the resulting pieces of the file. Because the rows are independent, you can cross out two of the rows and the equation still holds.

Data Loss: Two of the Six Rows Are “Lost”

With those rows completely gone, it looks like this:

Data Loss: The Matrix Without the Two “Lost” Rows

Because of all the work that mathematicians have done over the years, we know the coding matrix, the matrix on the left, is invertible. There is an inverse matrix that, when multiplied by the coding matrix, produces the identity matrix. As in basic algebra, in matrix algebra you can multiply both sides of an equation by the same thing. In this case, we’ll multiply on the left by the identity matrix:

Multiplying Each Side of the Equation by the Inverse Matrix
The Inverse Matrix and the Coding Matrix Cancel Out

This leaves the equation for reconstructing the original data from the pieces that are available:

Reconstructing the Original Data

So to make a decoding matrix, the process is to take the original coding matrix, cross out the rows for the missing pieces, and then find the inverse matrix. You can then multiply the inverse matrix and the pieces that are available to reconstruct the original data.

Summary

That was a quick overview of the math. Once you understand the steps, it’s not super complicated. The Java code goes through the same steps outlined above.

There is one small part of the code that does the actual matrix multiplications that has been carefully optimized for speed. The rest of the code does not need to be fast, so we aimed more for simple and clear.

If you need to store or transmit data, and be able to recover it if some is lost, you might want to look at Reed-Solomon coding. Using our code is an easy way to get started.

The post Backblaze Open-sources Reed-Solomon Erasure Coding Source Code appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.