Developer Archives - Backblaze Blog | Cloud Storage & Cloud Backup Cloud Storage & Cloud Backup Thu, 09 Nov 2023 19:18:34 +0000 en-US hourly 1 https://wordpress.org/?v=6.4.3 https://www.backblaze.com/blog/wp-content/uploads/2019/04/cropped-cropped-backblaze_icon_transparent-80x80.png Developer Archives - Backblaze Blog | Cloud Storage & Cloud Backup 32 32 B2 Copy File is Now Public https://www.backblaze.com/blog/b2-copy-file-is-now-public/ https://www.backblaze.com/blog/b2-copy-file-is-now-public/#comments Tue, 13 Aug 2019 15:45:01 +0000 https://www.backblaze.com/blog/?p=91978 At the beginning of summer, we put B2 Copy File APIs into beta. We’re pleased to announce that they are now public and ready to use (and we even included a few feature). Here's the lowdown on how developers can use these new capabilities with B2 Cloud Storage.

The post B2 Copy File is Now Public appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

]]>
B2 Copy File

At the beginning of summer, we put B2 Copy File APIs into beta. We’re pleased to announce the end of the beta and that the APIs are all now public!

We had a number of people use the beta features and give us great feedback. In fact, because of the feedback, we were able to implement an incremental feature.

New Feature — Bucket to Bucket Copies

Initially, our guidance was that these new APIs were only to be used within the same B2 bucket, but in response to customer and partner feedback, we added the ability to copy files from one bucket to another bucket within the same account.

To use this new feature with b2_copy_file, simply pass in the destinationBucketId where the new file copy will be stored. If this is not set, the copied file will simply default to the same bucket as the source file. Within b2_copy_part, there is a subtle difference in that the Source File ID can belong to a different bucket than the Large File ID.

For the complete API documentation, refer to the Backblaze B2 docs online:

What You Can Do With B2 Copy File

In a literal sense, the new capability enables you to create a new file (or new part of a large file) that is a copy of an existing file (or range of an existing file). You can either copy over the source file’s metadata or specify new metadata for the new file that is created. This all occurs without having to download or re-upload any data.

This has been one of our most requested features as it unlocks:

  • Rename/Re-organize. The new capabilities give customers the ability to re-organize their files without having to download and re-upload. This is especially helpful when trying to mirror the contents of a file system to B2.
  • Synthetic Backup. With the ability to copy ranges of a file, users can now leverage B2 for synthetic backup, i.e. uploading a full backup but then only uploading incremental changes (as opposed to re-uploading the whole file with every change). This is particularly helpful for applications like backing up VMs where re-uploading the entirety of the file every time it changes can be inefficient.

While many of our customers directly leverage our APIs, just as many use 3rd party software (B2 Integration Partners) to facilitate storage into B2. Our Integration Partners were very helpful and active in giving us feedback during the beta. Some highlights of those that are already supporting the copy_file feature:

Transmit logo Transmit: macOS file transfer/cloud storage application that supports high speed copying to data between your Mac and more than 15 different cloud services.
Rclone logo RClone: Rsync for cloud storage is a powerful command line tool to copy and sync files to and from local disk, SFTP servers, and many cloud storage providers.
Mountain Duck logo Mountain Duck: Mount server and cloud storage as a disk (Finder on macOS; File Explorer on Windows). With Mountain Duck, you can also open remote files with any application as if the file were on a local volume.
Cyberduck logo Cyberduck: File transfer/cloud storage browser for Mac and Windows with support for more than 10 different cloud services.
FileZilla Pro FileZilla Pro: FileZilla Pro transfers files between your machine and remote servers and supports a number of different protocols and storage locations.

Where to Learn More

The endpoint documentation can be found here:

The post B2 Copy File is Now Public appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

]]>
https://www.backblaze.com/blog/b2-copy-file-is-now-public/feed/ 4
Backblaze B2 Copy File Beta is Now Public https://www.backblaze.com/blog/backblaze-b2-copy-file-beta-is-now-public/ https://www.backblaze.com/blog/backblaze-b2-copy-file-beta-is-now-public/#comments Tue, 21 May 2019 14:16:03 +0000 https://www.backblaze.com/blog/?p=90179 One of the most requested developer features for B2, copy file, is now in beta and ready for you to try out.

The post Backblaze B2 Copy File Beta is Now Public appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

]]>
B2 Copy File Beta

Since introducing B2 Cloud Storage nearly four years ago, we’ve been busy adding enhancements and new functionality to the service. We continually look for ways to make B2 more useful for our customers, be it through service level enhancements, partnerships with leading Compute providers, or lowering the industry’s lowest download price to 1¢/GB. Today, we’re pleased to announce the beta release of our newest functionality: Copy File.

What You Can Do With B2 Copy File

This new capability enables you to create a new file (or new part of a large file) that is a copy of an existing file (or range of an existing file). You can either copy over the source file’s metadata or specify new metadata for the new file that is created. This all occurs without having to download or reupload any data.

This has been one of our most requested features, as it unlocks:

  • Rename/Re-organize. The new capabilities give customers the ability to reorganize their files without having to download and reupload. This is especially helpful when trying to mirror the contents of a file system to B2.
  • Synthetic Backup. With the ability to copy ranges of a file, users can now leverage B2 for synthetic backup, which is uploading a full backup but then only uploading incremental changes (as opposed to reuploading the whole file with every change). This is particularly helpful for uses such as backing up VMs, where reuploading the entirety of the file every time it changes creates user inefficiencies.

Where to Learn More About B2 Copy File

The endpoint documentation can be found here:

b2_copy_file:  https://www.backblaze.com/b2/docs/b2_copy_file.html
b2_copy_part:  https://www.backblaze.com/b2/docs/b2_copy_part.html

More About the Beta Program

We’re introducing these endpoints as a beta so that developers can provide us feedback before the endpoints go into production. Specifically, this means that the APIs may evolve as a result of the feedback we get. We encourage you to give Copy File a try and, if you have any comments, you can email our B2 beta team at b2beta@backblaze.com. Thanks!

The post Backblaze B2 Copy File Beta is Now Public appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

]]>
https://www.backblaze.com/blog/backblaze-b2-copy-file-beta-is-now-public/feed/ 7
How We Optimized Storage and Performance of Apache Cassandra at Backblaze https://www.backblaze.com/blog/wide-partitions-in-apache-cassandra-3-11/ https://www.backblaze.com/blog/wide-partitions-in-apache-cassandra-3-11/#respond Thu, 10 Jan 2019 16:00:06 +0000 https://www.backblaze.com/blog/?p=87225 Conventional wisdom for Apache’s high-performance, scalable database has been to avoid wide partitions for risk of impacting database performance and storage limitations. Backblaze’s work with a Cassandra consultant shows how version 3.11 can dramatically decrease garbage collection, lower latencies, and require nearly 30% less space when using wide partitions versus version 2.2.

The post How We Optimized Storage and Performance of Apache Cassandra at Backblaze appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

]]>
Guest post by Mick Semb Wever

Backblaze uses Apache Cassandra, a high-performance, scalable distributed database to help manage hundreds of petabytes of data. We engaged the folks at The Last Pickle to use their extensive experience to optimize the capabilities and performance of our Cassandra 3.11 cluster, and now they want to share their experience with a wider audience to explain what they found. We agree; enjoy!

— Andy

Wide Partitions in Apache Cassandra 3.11

by Mick Semb Wever, Consultant, The Last Pickle

Wide partitions in Cassandra can put tremendous pressure on the Java heap and garbage collector, impact read latencies, and can cause issues ranging from load shedding and dropped messages to crashed and downed nodes.

While the theoretical limit on the number of cells per partition has always been two billion cells, the reality has been quite different, as the impacts of heap pressure show. To mitigate these problems, the community has offered a standard recommendation for Cassandra users to keep partitions under 400MB, and preferably under 100MB.

However, in version 3 many improvements were made that affected how Cassandra handles wide partitions. Memtables, caches, and SSTable components were moved off-heap, the storage engine was rewritten in CASSANDRA-8099, and Robert Stupp made a number of other improvements listed under CASSANDRA-11206.

While working with Backblaze and operating a Cassandra version 3.11 cluster, we had the opportunity to test and validate how Cassandra actually handles partitions with this latest version. We will demonstrate that well designed data models can go beyond the existing 400MB recommendation without nodes crashing through heap pressure.

Below, we walk through how Cassandra writes partitions to disk in 3.11, look at how wide partitions impact read latencies, and then present our testing and verification of wide partition impacts on the cluster using the work we did with Backblaze.

The Art and Science of Writing Wide Partitions to Disk

First we need to understand what a partition is and how Cassandra writes partitions to disk in version 3.11.

Each SSTable contains a set of files, and the (–Data.db) file contains numerous partitions.

The layout of a partition in the –Data.db file has three components: a header, followed by zero or one static rows, which is followed by zero or more ordered Clusterable objects. The Clusterable object in this file may either be a row or a RangeTombstone that deletes data with each wide partition containing many Clusterable objects. For an excellent in-depth examination of this, see Aaron’s blog post Cassandra 3.x Storage Engine.

The –Index.db file stores offsets for the partitions, as well as the IndexInfo serialized objects for each partition. These indices facilitate locating the data on disk within the –Data.db file. Stored partition offsets are represented by a subclass of the RowIndexEntry. This subclass is chosen by the the ColumnIndex and depends on the size of the partition:

  • RowIndexEntry is used when there are no Clusterable objects in the partition, such as when there is only a static row. In this case there are no IndexInfo objects to store and so the parent RowIndexEntry class is used rather than a subclass.
  • The IndexEntry subclass holds the IndexInfo objects in memory until the partition has finished writing to disk. It is used in partitions where the total serialized size of the IndexInfo objects is less than the column_index_cache_size_in_kb configuration setting (which defaults to 2KB).
  • The ShallowIndexEntry subclass serializes IndexInfo objects to disk as they are created and references these objects using only their position in the file. This is used in partitions where the total serialized size of the IndexInfo objects is more than the column_index_cache_size_in_kb configuration setting.

These IndexInfo objects provide a sampling of positional offsets for rows within a partition, creating an index. Each object specifies the offset the page starts at, the first row and the last row.

So, in general, the bigger the partition, the more IndexInfo objects need to be created when writing to disk — and if they are held in memory until the partition is fully written to disk they can cause memory pressure. This is why the column_index_cache_size_in_kb setting was added in Cassandra 3.6 and the objects are now serialized as they are created.

The relationship between partition size and the number of objects was quantified by Robert Stupp in his presentation, Myths of Big Partitions.

IndexInfo numbers from Robert Stupp

How Wide Partitions Impact Read Latencies

Cassandra’s key cache is an optimization that is enabled by default and helps to improve the speed and efficiency of the read path by reducing the amount of disk activity per read.

Each key cache entry is identified by a combination of the keyspace, table name, SSTable, and the partition key. The value of the key cache is a RowIndexEntry or one of its subclasses — either IndexedEntry or the new ShallowIndexedEntry. The size of the key cache is limited by the key_cache_size_in_mb configuration setting.

When a read operation in the storage engine gets a cache hit it avoids having to access the –Summary.db and –Index.db SSTable components, which reduces that read request’s latency. Wide partitions, however, can decrease the efficiency of this key cache optimization because fewer hot partitions will fit into the allocated cache size.

Indeed, before the ShallowIndexedEntry was added in Cassandra version 3.6, a single wide row could fill the key cache, reducing the hit rate efficiency. When applied to multiple rows, this will cause greater churn of additions and evictions of cache entries.

For example, if the IndexEntry for a 512MB partition contains 100K+ IndexInfo objects and if these IndexInfo objects total 1.4MB, then the key cache would only be able to hold 140 entries.

The introduction of ShallowIndexedEntry objects changed how the key cache can hold data. The ShallowIndexedEntry contains a list of file pointers referencing the serialized IndexInfo objects and can binary search through this list, rather than having to deserialize the entire IndexInfo objects list. Thus when the ShallowIndexedEntry is used no IndexInfo objects exist within the key cache. This increases the storage efficiency of the key cache in storing more entries, but does still require that the IndexInfo objects are binary searched and deserialized from the –Index.db file on a cache hit.

In short, on wide partitions a key cache miss still results in two additional disk reads, as it did before Cassandra 3.6, but now a key cache hit incurs a disk read to the -Index.db file where it did not before Cassandra 3.6.

Object Creation and Heap Behavior with Wide Partitions in 2.2.13 vs 3.11.3

Introducing the ShallowIndexedEntry into Cassandra version 3.6 creates a measurable improvement in the performance of wide partitions. To test the effects of this and the other performance enhancement features introduced in version 3 we compared how Cassandra 2.2.13 and 3.11.3 performed when one hundred thousand, one million, or ten million rows were each written to a single partition.

The results and accompanying screenshots help illustrate the impact of object creation and heap behavior when inserting rows into wide partitions. While version 2.2.13 crashed repeatedly during this test, 3.11.3 was able to write over 30 million rows to a single partition before Cassandra Out-of-Memory crashed. The test and results are reproduced below.

Both Cassandra versions were started as single-node clusters with default configurations, excepting heap customization in the cassandra–env.sh:

MAX_HEAP_SIZE=”1G”
HEAP_NEWSIZE=”600M”

In Cassandra only the configured concurrency of memtable flushes and compactors determines how many partitions are processed by a node and thus pressuring its heap at any one time. Based on this known concurrency limitation, profiling can be done by inserting data into one partition against one Cassandra node with a small heap. These results extrapolate to production environments.

The tlp-stress tool inserted data in three separate profiling passes against both versions of Cassandra, creating wide partitions of one hundred thousand (100K), one million (1M), or ten million (10M) rows.

A tlp-stress profile for wide partitions was written, as no suitable profile existed. The read to write ratio used the default setting of 1:100.

The following command lines then implemented the tlp-stress tool:

# To write 100000 rows into one partition
tlp-stress run Wide –replication “{‘class’:’SimpleStrategy’,’replication_factor’: 1}” -n 100K# To write 1M rows into one partition
tlp-stress run Wide –replication “{‘class’:’SimpleStrategy’,’replication_factor’: 1}” -n 1M# To write 10M rows into one partition
tlp-stress run Wide –replication “{‘class’:’SimpleStrategy’,’replication_factor’: 1}” -n 10M

Each time tlp-stress executed it was immediately followed by a command to ensure the full count of specified rows passed through the memtable flush and were written to disk:

nodetool flush

The graphs in the sections below, taken from the Apache NetBeans Profiler, illustrate how the ShallowIndexEntry in Cassandra version 3.11 avoids keeping IndexInfo objects in memory.

Notably, the IndexInfo objects are instantiated far more often, but are referenced for much shorter periods of time. The Garbage Collector is more effective at removing short-lived objects, as illustrated by the GC pause times being barely present in the Cassandra 3.11 graphs compared to Cassandra 2.2 where GC pause times overwhelm the JVM.

Wide Partitions in Cassandra 2.2

Benchmarks were against Cassandra 2.2.13

One Partition with 100K Rows (2.2.13)

The following three screenshots shows the number of IndexInfo objects instantiated during the write benchmark, during compaction, and a heap profile.

The partition grew to be ~40MB.

Objects created during tlp-stress

screenshot of Cassandra 2.2 objects created during tlp-stress

Objects created during subsequent major compaction

screenshot of Cassandra 2.2 objects created during subsequent major compaction

Heap profiled during tlp-stress and major compaction

screenshot of Cassandra 2.2 Heap profiled during tlp-stress and major compaction

The above diagrams do not have their x-axis expanded to the full width, but still encompass the startup, stress test, flush, and compaction periods of the benchmark.

When stress testing starts with tlp-stress, the CPU Time and Surviving Generations starts to climb. During this time the heap also starts to increase and decrease more frequently as it fills up and then the Garbage Collector cleans it out. In these diagrams the garbage collection intervals are easy to identify and isolate from one another.

One Partition with 1M Rows (2.2.13)

Here, the first two screenshots show the number of IndexInfo objects instantiated during the write benchmark and during the subsequent compaction process. The third screenshot shows the CPU & GC Pause Times and the heap profile from the time writes started through when the compaction was completed.

The partition grew to be ~400MB.

Already at this size the Cassandra JVM is GC thrashing and has occasionally Out-of-Memory crashed.

Objects created during tlp-stress

screenshot of Cassandra 2.2.13 Objects created during tlp-stress

Objects created during subsequent major compaction

screenshot of Cassandra 2.2.13 Objects created during subsequent major compaction

Heap profiled during tlp-stress and major compaction

screenshot of Cassandra 2.2.13 Heap profiled during tlp-stress and major compaction

The above diagrams display a longer running benchmark, with the quiet period during the startup barely noticeable on the very left-hand side of each diagram. The number of garbage collection intervals and the oscillations in heap size are far more frequent. The GC Pause Time during the stress testing period is now consistently higher and comparable to the CPU Time. It only dissipates when the benchmark performs the flush and compaction.

One Partition with 10M Rows (2.2.13)

In this final test of Cassandra version 2.2.13, the results were difficult to reproduce reliably, as more often than not this test Out-of-Memory crashed from GC heap pressure.

The first two screenshots show the number of IndexInfo objects instantiated during the write benchmark and during the subsequent compaction process. The third screenshot shows the GC Pause Time and the heap profile from the time writes started until compaction was completed.

The partition grew to be ~4GB.

Objects created during tlp-stress

Objects created during subsequent major compaction

Heap profiled during tlp-stress and major compaction

screenshot of Cassandra Heap profiled during tlp-stress and major compaction

The above diagrams display consistently very high GC Pause Time compared to CPU Time. Any Cassandra node under this much duress from garbage collection is not healthy. It is suffering from high read latencies, could become blacklisted by other nodes due to its lack of responsiveness, and even crash altogether from Out-of-Memory errors (as it did often during this benchmark).

Wide Partitions in Cassandra 3.11.3

Benchmarks were against Cassandra 3.11.3

In this series, the graphs demonstrate how IndexInfo objects are created either from memtable flushes or from deserialization off disk. The ShallowIndexEntry is used in Cassandra 3.11.3 when deserializing the IndexInfo objects from the -Index.db file.

Neither form of IndexInfo objects reside long in the heap and thus the GC Pause Time is barely visible in comparison to Cassandra 2.2.13 despite the additional numbers of IndexInfo objects created via deserialization.

One Partition with 100K Rows (3.11.3)

As with the earlier version test of this size, the following two screenshots shows the number of IndexInfo objects instantiated during the write benchmark and during the subsequent compaction process. The third screenshot shows the CPU & GC Pause Time and the heap profile from the time writes started through when the compaction was completed.

The partition grew to be ~40MB, the same as with Cassandra 2.2.13

Objects created during tlp-stress

screenshot of Cassandra 3.11.3 objects created during tlp-stress

Objects created during subsequent major compaction

screenshot of Cassandra 3.11.3 objects created during subsequent major compaction

Heap profiled during tlp-stress and major compaction

screenshot of Cassandra 3.11.3 Heap profiled during tlp-stress and major compaction

The diagrams above are roughly comparable to the first diagrams presented under Cassandra 2.2.13, except here the x-axis is expanded to full width. Note there are significantly more instantiated IndexInfo objects, but barely any noticeable GC Pause Time.

One Partition with 1M Rows (3.11.3)

Again, the first two screenshots show the number of IndexInfo objects instantiated during the write benchmark and during the subsequent compaction process. The third screenshot shows the CPU & GC Pause Time and the heap profile over the time writes started until the compaction was completed.

The partition grew to be ~400MB, the same as with Cassandra 2.2.13

Objects created during tlp-stress

Objects created during subsequent major compaction

Heap profiled during tlp-stress and major compaction

The above diagrams show a wildly oscillating heap as many IndexInfo objects are created, and shows many garbage collection intervals, yet the GC Pause Time remains low, if at all noticeable.

One Partition with 10M Rows (3.11.3)

Here again, the first two screenshots show the number of IndexInfo objects instantiated during the write benchmark and during the subsequent compaction process. The third screenshot shows the CPU & GC Pause Time and the heap profile over the time writes started until the compaction was completed.

The partition grew to be ~4GB, the same as with Cassandra 2.2.13

Objects created during tlp-stress

Objects created during subsequent major compaction

Heap profiled during tlp-stress and major compaction

Unlike this profile in 2.2.13, the cluster remains stable as it was when running 1M rows per partition. The above diagrams display an oscillating heap when IndexInfo objects are created, and many garbage collection intervals, yet GC Pause Time remains low, if at all noticeable.

Maximum Rows in 1GB Heap (3.11.3)

In an attempt to push Cassandra 3.11.3 to the limit, we ran a test to see how much data could be written to a single partition before Cassandra Out-of-Memory crashed.

The result was 30M+ rows, which is ~12GB of data on disk.

This is similar to the limit of 17GB of data written to a single partition as Robert Stupp found in CASSANDRA-9754 when using a 5GB Java heap.

screenshot of Cassandra 3.11.3 memory usage

What about Reads

The following graph reruns the benchmark on Cassandra version 3.11.3 over a longer period of time with a read to write ratio of 10:1. It illustrates that reads of wide partitions do not create the heap pressure that writes do.

screenshot of Cassandra 3.11.3 read functions

Conclusion

While the 400MB community recommendation for partition size is clearly appropriate for version 2.2.13, version 3.11.3 shows that performance improvements have created a tremendous ability to handle wide partitions and they can easily be an order of magnitude larger than earlier versions of Cassandra without nodes crashing through heap pressure.

The trade-off for better supporting wide partitions in Cassandra 3.11.3 is increased read latency as row offsets now need to be read off disk. However, modern SSDs and kernel pagecaches take advantage of larger configurations of physical memory providing enough IO improvements to compensate for the read latency trade-offs.

The improved stability and falling back on better hardware to deal with the read latency issue allows Cassandra operators to worry less about how to store massive amounts of data in different schemas and unexpected data growth patterns on those schemas.

Some CASSANDRA-9754 custom B+ tree structures will be used to more effectively look up the deserialised row offsets and further avoid the deserialization and instantiation of short-lived unused IndexInfo objects.


Mick Semb Wever Mick Semb Wever designs, builds, and is an evangelist for distributed systems, from data-driven backends using Cassandra, Hadoop, Spark, to enterprise microservices platforms.

The post How We Optimized Storage and Performance of Apache Cassandra at Backblaze appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

]]>
https://www.backblaze.com/blog/wide-partitions-in-apache-cassandra-3-11/feed/ 0
Backblaze and Cloudflare Partner to Provide Free Data Transfer https://www.backblaze.com/blog/backblaze-and-cloudflare-partner-to-provide-free-data-transfer/ https://www.backblaze.com/blog/backblaze-and-cloudflare-partner-to-provide-free-data-transfer/#comments Wed, 26 Sep 2018 12:02:40 +0000 https://www.backblaze.com/blog/?p=85530 Today we are announcing that beginning immediately, Backblaze B2 customers will be able to download data stored in B2 to Cloudflare for zero transfer fees. This means that customers can save up to 75% on storage versus Amazon S3 to store their content in the cloud and deliver it worldwide.

The post Backblaze and Cloudflare Partner to Provide Free Data Transfer appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

]]>
 Backblaze B2 Free Data Transfer to Cloudflare

Today we are announcing that beginning immediately, Backblaze B2 customers will be able to download data stored in B2 to Cloudflare for zero transfer fees. This happens automatically once Cloudflare is configured to distribute your B2 files. This means that Backblaze B2 can now be used as an origin store for the Cloudflare CDN and edge network, providing customers enhanced performance and access to their content stored on B2. The result is that customers can save up to 75% on storage versus Amazon S3 to store their content in the cloud and deliver it worldwide.

The zero B2 transfer fees are available to all Cloudflare customers using any plan. Cloudflare customers can also use paid add-ons such as Argo and Workers to enhance the routing and security of the B2 files being delivered over the Cloudflare CDN. To implement this service, Backblaze and Cloudflare have directly connected, thereby allowing near-instant data transfers from B2 to Cloudflare.

Backblaze has prepared a guide on “Using Backblaze B2 storage with Cloudflare.” This guide provides step-by-step instructions on how to set up Backblaze B2 with Cloudflare to take advantage of this program.

The Bandwidth Alliance

The driving force behind the free transfer program is the Bandwidth Alliance. Backblaze and Cloudflare are two of the founding members of this group of forward-thinking cloud and networking companies that are committed to providing the best and most cost-efficient experience for our mutual customers. Additional founding members of the Bandwidth Alliance include Automattic (WordPress), DigitalOcean, IBM Cloud, Microsoft Azure, Packet, and other leading cloud and networking companies.

How Companies Can Leverage the Bandwidth Alliance

Below are examples of how Bandwidth Alliance partners can work together to save customers on their data transfer fees.

Hosting Website Assets

Whether you are a professional webmaster or just run a few homegrown sites, you’ve lived the frustration of having a slow website. Over the past few years these challenges have become more acute as video and other types of rich media have become core to the website experience. This new content has also translated to higher storage and bandwidth costs. That’s where Backblaze B2 and Cloudflare come in.diagram of zero cost data transfer from Backblaze B2 to Cloudflare CDN

Customers can store their videos, photos, and other assets in Backblaze B2’s pay-as-you-go cloud storage and serve the site with Cloudflare’s CDN and edge services. The result is an amazingly affordable cloud-based solution that dramatically improves web site performance and reliability. And customers pay each service for what they do best.

Note that Cloudflare’s subscription agreement for self-serve accounts limits serving non-HTML content, including “video or a disproportionate percentage of pictures, audio files, or other non-HTML content.” If you are planning to serve media files from Backblaze B2 via Cloudflare, you should review your subscription agreement with Cloudflare and ensure that your Cloudflare subscription matches your plans.

“I am extremely happy with my experience serving html/css/js and over 17 million images from B2 via Cloudflare Workers. Page load time has been great and costs are minimal.”

— Jacob Hands, Lead Developer, FactorioMaps.com

Media Content Distribution

The ability to download content from B2 cloud storage to the Cloudflare CDN for zero transfer cost is the just the beginning. A company needing to distribute media can now store original assets in Backblaze B2, send them to a compute service to transcode and transmux them, and forward the finished assets to be served up by Cloudflare. Backblaze and Packet previously announced zero transfer fees between Backblaze B2 storage and Packet compute services. This enabled customers to store data in B2 at 1/4th the price of competitive offerings and then process data for transcoding, AI, data analysis, and more inside of Packet without worrying about data transfer fees. Packet is also a member of the Bandwidth Alliance and will deliver content to Cloudflare for zero transfer fees as well.

diagram of zero cost data transfer flow from Backblaze B2 to Packet Compute to Cloudflare CDN

Process Now, Distribute Later

A variation of the example above is for a company to store the originals in B2, transcode and transmux the files in Packet, then put those versions back into B2, and finally serve them up via Cloudflare. All of this is done with zero transfer fees between Backblaze, Packet, and Cloudflare. The result is all originals and transmuxed versions are stored at 1/4th the prices of other storage, and served up efficiently via Cloudflare.diagram of data transfer flow between B2 to Packet back to B2 to Cloudflare

In all cases you would only pay for services you use and not for the cost to move data between those services. This results in a predictable and affordable cost for a given project using industry leading best-of-breed services.

Moving Forward

The members of the Bandwidth Alliance are committed to enabling the best and most cost efficient cloud services when it comes to working with data stored in the cloud. Backblaze has committed to a transfer fee of $0 to move content from B2 to either Cloudflare or Packet. We think that’s a great step in the right direction. And if you are cloud provider, let us know if you’d be interested in taking a step like this one with Backblaze.

The post Backblaze and Cloudflare Partner to Provide Free Data Transfer appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

]]>
https://www.backblaze.com/blog/backblaze-and-cloudflare-partner-to-provide-free-data-transfer/feed/ 13
Backblaze B2 API Version 2 Beta is Now Open https://www.backblaze.com/blog/backblaze-b2-api-version-2-beta-is-now-open/ https://www.backblaze.com/blog/backblaze-b2-api-version-2-beta-is-now-open/#comments Fri, 14 Sep 2018 00:40:47 +0000 https://www.backblaze.com/blog/?p=85355 Version 2 of the Backblaze B2 API brings a number of enhancements. While existing applications using version 1 of the API will continue to function as before with no changes required, we encourage developers to try out the version 2 beta and submit comments to us.

The post Backblaze B2 API Version 2 Beta is Now Open appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

]]>
cloud storage workflow image

Since B2 cloud storage was introduced nearly 3 years ago, we’ve been adding enhancements and new functionality to the B2 API, including capabilities like CORS support and lifecycle rules. Today, we’d like to introduce the beta of version 2 of the B2 API, which formalizes rules on application keys, provides a consistent structure for all API calls returning information about files, and cleans up outdated request parameters and returned data. All version 1 B2 API calls will continue to work as is, so no changes are required to existing integrations and applications.

The API Versions section of the B2 documentation on the Backblaze website provides the details on how the V1 and V2 APIs differ, but in the meantime here’s an overview into the what, why, and how of the V2 API.

What Has Changed Between the B2 Cloud Storage Version 1 and Version 2 APIs?

The most obvious difference between a V1 and V2 API call is the version number in the URL. For example:

https://apiNNN.backblazeb2.com/b2api/v1/b2_create_bucket

https://apiNNN.backblazeb2.com/b2api/v2/b2_create_bucket

In addition, the V2 API call may have different required request parameters and/or required response data. For example, the V2 version of b2_hide_file always returns accountId and bucketId, while V1 returns accountId.

The documentation for each API call will show whether there are any differences between API versions for a given API call.

No Change is Required For V1 Applications

With the introduction of V2 of the B2 API there will be V1 and V2 versions for every B2 API call. All applications using V1 API calls will continue to work with no change in behavior. In some cases, a given V2 API call will be different from its companion V1 API call as noted in the B2 API documentation. For the remaining API calls a given V1 API call and its companion V2 call will be the same, have identical parameters, return the same data, and have the same errors. This provides a B2 developer the flexibility to choose how to upgrade to the V2 API.

Obviously, if you want to use the functionality associated with a V2 API version, then you must use the V2 API call and update your code accordingly.

One last thing: beginning today, if we create a new B2 API call it will be created in the current API version (V2) and most likely will not be created in V1.

Standardizing B2 File Related API Calls

As requested by many B2 developers, the V2 API now uses a consistent structure for all API calls returning information about files. To enable this there are some V2 API calls that return additional fields, for example:

Restricted Application Keys

In August we introduced the ability to create restricted applications keys using the B2 API. This capability allows an account owner the ability to restrict who, how, and when the data in a given bucket can be accessed. This changed the functionality of multiple B2 API calls such that a user could create a restricted application key that could break a 3rd party integration to Backblaze B2. We subsequently updated the affected V1 API calls, so they could continue to work with the existing 3rd party integrations.

The V2 API fully implements the expected behavior when it comes to working with restricted application keys. The V1 API calls continue to operate as before.

Here is an example of how the V1 API and the V2 API will act differently as it relates to restricted application keys.

Set-up

  • The B2 account owner has created 2 public buckets, “Backblaze_123” and “Backblaze_456”
  • The account owner creates a restricted application key that allows the user to read the files in the bucket named “Backblaze_456”
  • The account owner uses the restricted application key in an application that uses the b2_list_buckets API call

In Version 1 of the B2 API

    • Action: The account owner uses the restricted application key (for bucket Backblaze_456) to access/list all the buckets they own (2 public buckets).

 

  • Result: The results returned are just for Backblaze_456 as the restricted application key is just for that bucket. Data about other buckets is not returned.

While this result may seem appropriate, the data returned did not match the question asked, i.e. list all buckets. V2 of the API ensures the data returned is responsive to the question asked.

In Version 2 of the B2 API

    • Action: The account owner uses the restricted application key (for bucket Backblaze_456) to access/list all the buckets they own (2 public buckets).

 

  • Result: A “401 unauthorized” error is returned as the request for access to “all” buckets does not match the restricted application key, e.g. bucket Backblaze_456. To achieve the desired result, the account owner can specify the name of the bucket being requested in the API call that matches the restricted application key.

Cleaning up the API

There are a handful of API calls in V2 where we dropped fields that were deprecated in V1 of the B2 API, but were still required. So in V2:

  • b2_authorize_account: The response no longer contains minimumPartSize. Use partSize and absoluteMinimumPartSize instead.
  • b2_list_file_names: The response no longer contains size. Use contentLength instead.
  • b2_list_file_versions: The response no longer contains size. Use contentLength instead.
  • b2_hide_file: The response no longer contains size. Use contentLength instead.

Support for Version 1 of the B2 API

As noted previously, V1 of the B2 API continues to function. There are no plans to stop supporting V1. If at some point in the future we do deprecate the V1 API, we will provide advance notice of at least one year before doing so.

The B2 Java SDK and the B2 Command Tool

Both the B2 Java SDK and the B2 Command Line Tool, do not currently support Version 2 of B2 API. They are being updated and will support the V2 API at the time the V2 API exits Beta and goes GA. Both of these tools, and more, can be found in the Backblaze GitHub repository.

More About the Version 2 Beta Program

We introduced Version 2 of the B2 API as beta so that developers can provide us feedback before V2 goes into production. With every B2 integration being coded differently, we want to hear from as many developers as possible. Give the V2 API a try and if you have any comments you can email our B2 beta team at b2beta@backblaze.com or contact Backblaze B2 support. Thanks.

The post Backblaze B2 API Version 2 Beta is Now Open appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

]]>
https://www.backblaze.com/blog/backblaze-b2-api-version-2-beta-is-now-open/feed/ 2
How to Leverage Your Amazon S3 Experience to Code the Backblaze B2 API https://www.backblaze.com/blog/how-to-code-backblaze-b2-api-interface/ https://www.backblaze.com/blog/how-to-code-backblaze-b2-api-interface/#comments Thu, 30 Aug 2018 16:28:10 +0000 https://www.backblaze.com/blog/?p=85062 In a guest post, John Matze, founder of BridgeSTOR, outlines the differences for application developers between S3 and B2 APIs, and concludes that adding B2 support is not a major job.

The post How to Leverage Your Amazon S3 Experience to Code the Backblaze B2 API appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

]]>
Going from S3 to learning Backblaze B2

We wrote recently about how the Backblaze B2 and Amazon S3 APIs are different. What we neglected to mention was how to bridge those differences so a developer can create a B2 interface if they’ve already coded one for S3. John Matze, Founder of BridgeSTOR, put together his list of things to consider when leveraging your S3 API experience to create a B2 interface. Thanks John.   — Andy
Backblaze B2 to Amazon S3 Conversion
by John Matze, Founder of BridgeSTOR

Backblaze B2 Cloud Storage Platform has developed into a real alternative to the Amazon S3 online storage platform with the same redundancy capabilities but at a fraction of the cost.

Sounds great — sign up today!

Wait. If you’re an application developer, it doesn’t come free. The Backblaze REST API is not compatible with Amazon S3 REST API. That is the bad news. The good news — it includes almost the entire set of functionality so converting from S3 to B2 can be done with minimal work once you understand the differences between the two platforms. We created a S3 to B2 shim in a week followed by a few extra weeks of testing and bug fixes.

This article will help you shortcut the process by describing the differences between B2 and S3.

  1. Endpoints: AWS has a standard endpoint of s3.amazonaws.com which redirects to the region where the bucket is located or you may send requests directly to the bucket by a region endpoint. B2 does not have regions, but does have an initial endpoint called api.backblazeb2.com. Every application must start by talking to this endpoint. B2 also requires three other endpoints. One for uploading an object, one for API calls and a third for downloading an object. The upload endpoint is generated on demand when uploading an object while the other two are returned during the authentication process and may be saved for API and download requests.
  1. Host: Unlike Amazon S3, the HTTP header requires the host token. If it is not present, B2 will respond with an error.
  1. JSON: Unlike S3, which uses XML, all B2 calls use JSON. Some API calls require data to be sent on the request. This data must be in JSON and all APIs return JSON as a result. Fortunately, the amount of JSON required is minimal or none at all. We just built a JSON request when required and made a simple JSON parser for returned data.
  1. Authentication: Amazon currently has two major authentication mechanisms with complicated hashing formulas. B2 simply uses the industry standard “HTTP basic auth” algorithm. It takes only a few minutes to get to speed on this algorithm.
  1. Keys: Amazon has the concept of an access key and a secret key. B2 has the equivalent with the access key being your key id (an application key id or account id) and the secret key being the application key (returned from the website or returned by b2_create_key API) that maps to the secret key.
  1. Bucket ID: Unlike S3, almost every B2 API requires a bucket ID. There is a special list bucket call that will display bucket IDs by bucket name. Once you find your bucket name, capture the bucket ID and save it for future API calls.
  1. Directory Listings: B2 Directories again have the same functionality as S3, but with a different API format. Again the mapping is easy: marker is startFileName, prefix is prefix, max-keys is maxFileCount and delimiter is delimiter. In both cases, the marker tells the service where to start the next search. The Amazon S3 nextmarker is literally the next marker to be searched. The B2 nextFileName is also the next place to search; it’s the last file name that was returned with a trailing space character. In both systems, when paging through the files in a bucket, you take the “marker” or “nextFileName” from one request and pass it to the next. You’ll get a list of all files with no repeats.
  1. Uploading an object: Uploading an object in B2 is quite different than S3. S3 just requires you to send the object to an endpoint and Amazon will automatically place the object somewhere in their environment. In the B2 world, you must request a location for the object with an API call and then send the object to the returned location. The first API will send you a temporary upload token and you can continue to use this token for one day without generating another, with the caveat that you have to monitor for failures from B2. The B2 environment may become full or some other issue will require you to request another upload token.
  1. Downloading an Object: Downloading an object in B2 is really easy. There is a download endpoint that is returned during the authentication process and you pass your request to that endpoint. The object is downloaded just like Amazon S3.
  1. Multipart Upload: Finally, multipart upload. The beast in S3 is just as much of a beast in B2. Again the good news is there is a one to one mapping.
    1. Multipart Init: The equivalent initialization returns a fileid. This ID will be used for future calls.
    2. Mulitpart Upload: Similar to uploading an object, you will need to get the API location to place the part. So use the fileid from “a” above and call B2 for the endpoint to place the part. Backblaze recommends that the upload payload to be hashed with a SHA1 algorithm. Once done, simply pass the SHA and the part number to the URL and the part is uploaded. This SHA1 component is equivalent to an etag in the S3 world so save it for later.
    3. Multipart Complete: Like S3, you will have to build a return structure for each part. B2 of course requires this structure to be in JSON but like S3, B2 requires the part number and the SHA1 (etag) for each part.

What Doesn’t Port

We found almost everything we required easily mapped from S3 to B2 except for a few issues. To be fair, Backblaze is working on the following in future versions.

    1. Copy Object doesn’t exist: This could cause some issues with applications for copying or renaming objects. BridgeSTOR has a workaround for this situation so it wasn’t a big deal for our application.

 

  1. Directory Objects don’t exist: Unlike Amazon, where an object with that ends with a “/” is considered a directory, this does not port to B2. There is an undocumented object name that B2 applications use called .bzEmpty. Numerous 3rd party applications, including BridgeSTOR, treat an object ending with .bzEmpty as a directory name. This is also important for directory listings described above. If you choose to use this method, you will be required to replace the “.bzEmpty” with a “/.”

In conclusion, you can see the B2 API is different than the Amazon S3, but as far as functionality they are basically the same. For us at first it looked like it was going to be a large task, but once we took the time to understand the differences, porting to B2 was not a major job for our application. I hope this document helps in your S3 to B2 conversion.

— John Matze, BridgeSTOR

 

9-4-2018 (AK) — Cleaned up the language in sections 1, 7, and 8 to reflect the most current operation of B2.

The post How to Leverage Your Amazon S3 Experience to Code the Backblaze B2 API appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

]]>
https://www.backblaze.com/blog/how-to-code-backblaze-b2-api-interface/feed/ 1
The B2 Developers’ Community https://www.backblaze.com/blog/object-storage-developer-community/ https://www.backblaze.com/blog/object-storage-developer-community/#comments Fri, 17 Aug 2018 20:29:33 +0000 https://www.backblaze.com/blog/?p=84912 It's been three years since the launch of Backblaze B2 Cloud Storage and the B2 developer community is strong and growing. In this post we survey what developers are doing with B2 and some of the many tools and resources available to developers.

The post The B2 Developers’ Community appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

]]>
Developers at Work Using Object Storage

When we launched B2 Cloud Storage in September of 2015, we were hoping that the low cost, reliability, and openness of B2 would result in developers integrating B2 object storage into their own applications and platforms.

Since that launch, we’ve continually strengthened and encouraged the development of tools and resources for the B2 developer community. These resources include APIs, a Command-Line tool, a Java SDK, and code examples for Swift and C++. Backblaze recently added application keys for B2, which enable developers to restrict access to B2 data and control how an application interacts with that data. Those who use Amazon APIs can interact with B2 using Minio as an Amazon S3 gateway.

An Active B2 Developer Community

It’s three years later and we are happy to see that an active developer community has sprung up around B2. Just a quick look at GitHub shows over 250 repositories for B2 code with projects in ten different languages that range from C# to Go to Ruby to Elixir. A recent discussion on Hacker News about a B2 Python Library resulted in 225 comments.

B2 coding languages - Java, Ruby, C#, Shell, PHP, R, JavaScript, C++, Elixir, Go, Python, Swift

What’s Happening in the B2 Developer Community?

We believe that the two major reasons for the developer activity supporting B2 are, 1) the user demand for inexpensive and reliable storage, and, 2) the ease of implementation of the B2 API. We discussed the B2 API design decisions in a recent blog post.

Sharing and transparency have been cornerstone values for Backblaze since our founding, and we believe openness and transparency breed trust and further innovation in the community. Since we ask customers to trust us with their data, we want our actions to show why we are worthy of that trust.

Here are Just Some of the Many B2 Projects Currently Underway

We’re excited about all the developer activity and all of the fresh and creative ways you are using Backblaze B2 storage. We want everyone to know about these developer projects so we’re spotlighting some of the exciting work that is being done to integrate and extend B2.

Rclone (Go) — In addition to being an open-source command line program to sync files and directories to and from cloud storage systems, Rclone is being used in conjunction with other applications such as restic. See Rclone on GitHub, as well.

Duplicacy (Go) — Duplicacy is a cross-platform cloud backup tool based on the idea of lock-free deduplication.

CORS (General web development) — Backblaze supports CORS for efficient cross-site media serving. CORS allows developers to store large or infrequently accessed files on B2 storage, and then refer to and serve them securely from another website without having to re-download the asset.

b2blaze (Python) — The b2blaze Python library for B2.

PHP SDK (PHP) — A standard PHP SDK for Backblaze B2 cloud files & storage system.

BackBlaze PHP Flysystem Adapter (PHP) — The Backblaze adapter enables the use of the Flysystem filesystem abstraction library with Backblaze. It uses the Backblaze B2 SDK to communicate with the API.

Laravel Backblaze Adapter (PHP) — The Laravel Backblaze B2 Storage Service Provider uses the Backblaze B2 SDK & Flysystem Adapter to communicate with the Backblaze B2 API.

Wal-E (Postgres) — Continuous archiving to Backblaze for your Postgres databases.

Phoenix (Elixir) — File upload utility for the Phoenix web dev framework.

ZFS Backup (Go) — Backup tool to move your ZFS snapshots to B2.

Django Storage (Python) — B2 storage for the Python Django web development framework.

Arq Backup (Mac and Windows application) — Arq Backup is an example of a single developer, Stefan Reitshamer, creating and supporting a successful and well-regarded application for cloud backup. Stefan also is known for being responsive to his users.

Go Client & Libraries (Go) — Go is a popular language that is being used for a number of projects that support B2, including restic, Minio, and Rclone.

Ruby fog for B2 (Ruby) — Integration library for gem fog and B2 Cloud Storage.

How to Get Involved as a B2 Developer

If you’re considering developing for B2, we encourage you to give it a try. It’s easy to implement and your application and users will benefit from dependable and economical cloud storage.

Developers at workStart by checking out the B2 documentation and resources on our website. GitHub and other code repositories are also great places to look. If you follow discussions on Reddit, you could learn of projects in the works and maybe find users looking for solutions.

We’ve written a number of blog posts highlighting the integrations for B2. You can find those by searching for a specific integration on our blog or under the tag B2. Posts for developers are tagged developer.

Developers at work

If you have a B2 integration that you believe will appeal to a significant audience, you should consider submitting it to us. Those that pass our review are listed on the B2 Integrations page on our website. We’re adding more each week. When you’re ready, just review the B2 Integration Checklist and submit your application. We’re looking forward to showcasing your work!

Now’s a good time to join the B2 developers’ community. Jump on in — the water’s great!

P.S. We want to highlight and promote more developers working with B2. If you have a B2 integration or project that we haven’t mentioned in this post, please tell us what you’re working on in the comments.

The post The B2 Developers’ Community appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

]]>
https://www.backblaze.com/blog/object-storage-developer-community/feed/ 3
What To Do When You Get a B2 503 (or 500) Server Error https://www.backblaze.com/blog/b2-503-500-server-error/ https://www.backblaze.com/blog/b2-503-500-server-error/#comments Thu, 16 Aug 2018 20:34:17 +0000 https://www.backblaze.com/blog/?p=84769 If you get a 503 or 500 server error it does not mean that the B2 service is down. It simply means that B2 is functioning as designed. Let's look at the process.

The post What To Do When You Get a B2 503 (or 500) Server Error appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

]]>
Backblaze logo
Just try again — it’s free, easy, and will work.

Seriously, that’s it. Occasionally, I’ll see questions that amount to, “I’m getting a 503 error; does that mean B2 is down?” To address that question, I wanted to take today’s post to go into a bit more detail on how to handle a 500 or 503 error. The short answer is no. B2 is not down. It simply means that B2 is functioning as designed as the most affordable, easy to use cloud storage service on the planet.

As we’ve described in our developer docs, the best decision is to write your integration in a way that it retries in the event of a 500 or 503. This modest amount of upfront work will result in a stable and transparent long term experience.

The Backblaze Contract Architecture

To understand the vast majority of B2 500 and 503 errors, it’s helpful to go into the “contract architecture” for B2. To create a service that is fully scalable at incredibly low cost, Backblaze has had to innovate in a number of areas. One way is what we refer to as “contract architecture.” It’s the approach that let us cut a large expense in traditional cloud storage infrastructure — high bandwidth load balancers for uploads.

Here’s how it works: when a client wants to push data to Backblaze, it contacts a “dispatching server.” That dispatching server figures out where there data will ultimately live inside a given Backblaze data center.

The dispatching server tells the client “there is space over on vault-9015.

Armed with that information (and an auth token), the client ends its connection with the dispatching server and creates a brand new request directly to vault-9015. The “contract” concept is not novel: ultimately, all APIs are contracts between two entities (machines). In the B2 case, our design leverages that insight as the client and vault negotiate how they will work together. In this example, once authenticated, the client continues to transmit to vault-9015 until it’s done or the vault fills up (or happens to go offline). In those instances, all the client has to do is return to the dispatching server to get information for the next available vault. This is a relatively trivial step and can be easily handled at the software level.

What Causes a B2 500 or 503 Error Response?

The client knows when to go back to the dispatching server because it receives (wait for it) a 500 or 503 error from vault-9015. The system is designed to send a firm message that says, in effect, “stop uploading to vault-9015.” We documented the specifics of what happens where in the B2 error handling protocols. The bottom line is an error in the 500 block should be interpreted by the client as the signal to GO BACK to the dispatching server and ask for a new vault for uploads. Rinse and repeat. It’s a free process that causes negligible incremental overhead and no charges to the customer. Unlike other services, B2 uploads and upload transactions are free.

What if, after getting a 503 and asking the dispatch server for a new URL, you try to upload and get ANOTHER 503 from the new vault? To address this unusual case, write your software to pause for a few seconds, then go back to the dispatch server. In this scenario, the user has hit a statistically unusual situation where the user was told to go to a vault with very little space left and somebody else got there and filled up that space. The second 503 is a sign the system is functioning as designed. Your program can elegantly handle it by going back to the dispatch server.

Other services, notably, Amazon S3, provide the client with a “well known URL.” The client can merrily push data to the URL and Amazon handles load balancing and finding open storage space after receiving the data. That’s a totally valid approach, but objectively more expensive as it involves high bandwidth load balancers. There are other interesting implications to the load balancing scenario. If you’re interested, I wrote a blog post on the difference between the two approaches.

As I discussed in that post, the contract architecture does introduce some complexity when the client has to go back to the dispatching server. But, for that modest amount of error handling upfront, we help fuel Backblaze B2 as an infinitely scalable, fully sustainable service that has and will continue to be the affordability leader in the object storage market.

The post What To Do When You Get a B2 503 (or 500) Server Error appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

]]>
https://www.backblaze.com/blog/b2-503-500-server-error/feed/ 1
Design Thinking: B2 APIs (& The Hidden Costs of S3 Compatibility) https://www.backblaze.com/blog/design-thinking-b2-apis-the-hidden-costs-of-s3-compatibility/ https://www.backblaze.com/blog/design-thinking-b2-apis-the-hidden-costs-of-s3-compatibility/#comments Wed, 08 Aug 2018 15:57:21 +0000 https://www.backblaze.com/blog/?p=84556 We're often asked why Backblaze didn't implement S3-compatible APIs for our B2 Cloud Storage. Here are the reasons why we believe our approach has advantages over Amazon's APIs and enables us to offer our cloud services at lower cost than our competitors.

The post Design Thinking: B2 APIs (& The Hidden Costs of S3 Compatibility) appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

]]>
API Functions - Authorize, Download, List, Upload

When we get asked, “why did Backblaze make its own set of APIs for B2,” the question behind the question is most often “why didn’t Backblaze just implement an S3-compatible interface for B2?”

Either are totally reasonable questions to ask. The quick answer to either question? So our customers and partners can move faster while simultaneously enabling Backblaze to sustainably offer a cloud storage service that is ¼ of the price of S3.

But, before we jump all the way to the end, let me step you through our thinking.

The Four Major Functions of Cloud Storage APIs

Throughout cloud storage, S3 or otherwise, APIs are meant to mainly provide access to four major underlying functions:

  • Authorization — Providing account/bucket/file access
  • Upload — Sending files to the cloud
  • Download — Data retrieval
  • List — Data checking/selection/comparison

The comparison between B2 and S3 on the List and Download functions is, candidly, not that interesting. Fundamentally, we ended up having similar approaches when solving those challenges. If the detail is of interest, I’m happy to get into that on a later post or answer questions in the comments below.

Backblaze and Amazon did take different approaches to how each service handles Authorization. The 5 step approach for S3 is well outlined here. B2’s architecture enables secure authorization in just 2 steps. My assertion is that a 2 step architecture is ~60% simpler than having a 5 step approach. To understand what we’re doing, I’d like to introduce the concept of Backblaze’s “Contract Architecture.”

The easiest way to understand B2’s Contract Architecture is to deep dive into how we handle the Upload process.

Uploads (Load Balancing vs Contract Architecture)

The interface to upload data into Amazon S3 is actually a bit simpler than Backblaze B2’s API. But it comes at a literal cost. It requires Amazon to have a massive and expensive choke point in their network: load balancers. When a customer tries to upload to S3, she is given a single upload URL to use. For instance, http://s3.amazonaws.com/<bucketname>. This is great for the customer as she can just start pushing data to the URL. But that requires Amazon to be able to take that data and then, in a second step behind the scenes, find available storage space and then push that data to that available location. The second step creates a choke point as it requires having high bandwidth load balancers. That, in turn, carries a significant customer implication; load balancers cost significant money.

When we were creating the B2 APIs, we faced a dilemma — do we go a simple but expensive route like S3? Or is there a way to remove significant cost even if it means introducing some slight complexity? We understood that there are perfectly great reasons to go either way — and there are customers at either end of this decision tree.

We realized the expense savings could be significant; we know load balancers well. We use them for our Download capabilities. They are expensive so, to run a sustainable service, we charge for downloads. That B2 download pricing is 1¢/GB while Amazon S3 starts at 9¢/GB is a subject we covered in a prior blog post.

Back to the B2 upload function. With our existing knowledge of the “expensive” design, the next step was to understand the alternative path. We found a way to create significant savings by only introducing a modest level of complexity. Here’s how: When a “client” wants to push data to the servers, it does not just start uploading data to a “well known URL” and have the SERVER figure out where to put the data. At the start, the client contacts a “dispatching server” that has the job of knowing where there is optimally available space in a Backblaze data center.

The dispatching server (the API server answering the b2_get_upload_url call) tells the client “there is space over on “Vault-8329.” This next step is our magic. Armed with the knowledge of the open vault, the client ends its connection with the dispatching server and creates a brand new request DIRECTLY to Vault-8329 (calling b2_upload_file or b2_upload_part). No load balancers involved! This is guaranteed to scale infinitely for very little overhead cost. A side note is that the client can continue to directly call b2_upload_file repeatedly from now on (without asking the dispatching server ever again), up until it gets the response indicating that particular vault is full. In other words, this does NOT double the number of network requests.

The “Contract” concept emanates from a simple truth: all APIs are contracts between two entities (machines). Since the client knows exactly where to go and exactly what authorization to bring with it, it can establish a secure “contract” with the Vault specified by the dispatching server.[1] The modest complexity only comes into play if Vault-8392 fills up, gets too busy, or goes offline. In that case, the client will receive either a 500 or 503 error as notification that the contract has been terminated (in effect, it’s a firm message that says “stop uploading to Vault-8392, it doesn’t have room for more data”). When this happens, the client is responsible to go BACK to the dispatching server, ask for a new vault, and retry the upload to a different vault. In the scenario where the client has to go back to the dispatching server, the “two phase” process becomes more work for the client versus S3’s singular “well known URL” architecture. Of course, this is all handled at the code level and is well documented. In effect, your code just needs to know that “if you receive a 500 block error, just retry.” It’s free, it’s easy, and it will work.

So while the Backblaze approach introduces some modest complexity, it can quickly and easily be reliably handled with code. Looking at S3’s approach, it is certainly simpler, but it results in three expensive consequences:

1) Expensive fixed costs. Amazon S3 has a single upload URL choke point that requires load balancers and extremely high bandwidth requirements. Backblaze’s architecture does not require moving data around internally; this lets us use commodity 10 GB/s connections that are affordable and will scale infinitely. Further, as discussed above, load balancing hardware is expensive. By removing it from our Upload system, we remove a significant cost center.

2) Expensive single URL availability issues. Amazon S3’s solution requires high availability of the single upload URL for massive amounts of data. The Contract concept from Backblaze works more reliably, but does add slight additional complexity when (rare) extra network round trips are needed.

3) Expensive, time consuming data copy needs (and “eventual consistency”). Amazon S3 requires the copying of massive amounts of data from one part of their network (the upload server) to wherever the data’s ultimate resting place will be. This is at the root of one of the biggest frustrations when dealing with S3: Amazon’s “eventual consistency.” It means that you can’t use your data until it has been pushed to all the places it needs to go. As the article notes, this is usually fast, but can be material amounts of time, anytime. The lack of predictability around access times is something anyone dealing with S3 is all too familiar with.

The B2 architecture offers what one could consider “strong consistency.” There are different definitions of that idea. Ours is that the client connects DIRECTLY with the correct final location for the data to land. Once our system has confirmed a write, the data has been written to enough places that we can guarantee that the data can be seen without delay.

Was our decision a good one? Customers will continue to vote on that, but it appears that the marginal complexity is more than offset by the fact that B2 is sustainable service offered at ¼ of S3’s price.

But Seriously, Why Aren’t You Just “S3 Compatible?”

The use of Object Storage requires some sort of interface. You can build it yourself by learning your vendor’s APIs or you can go through a third party integration. Regardless of what route you choose, somebody is becoming fluent in the vendor’s APIs. And beyond the difference in cost, there’s a reliability component.

This is a good time to clear up a common misconception. The S3 protocol is not a specification: it’s an API doc. Why does this matter? Because API docs leave many outcomes undocumented. For instance, when one uses S3’s list_files function, a developer canNOT know what is going to happen just by reading the API docs. Compounding this issue is the sheer scope of the S3 API; it is huge and expanding. Systems that purport to be “S3 compatible” are unlikely to implement the full API and have to document whatever subset they implement. Once that is done, they will have to work with integration partners and customers to communicate what subset they choose as “important.”

Ultimately, we have chosen to create robust documentation describing, among other things, the engineering specification (this input returns that output, here’s how B2 handles various error cases, etc).

With hundreds of integrations from third parties and hundreds of thousands of customers, it’s clear that our APIs have proven easy to implement. The reality is the first time anyone implements cloud storage into their application it can take weeks. The first move into the cloud can be particularly tough for legacy applications. But the marginal cloud implementation can be reliably completed in days, if not hours, if the documentation is clear and the APIs can be well understood. I’m pleased that we’ve been able to create a complete solution that is proving quite easy to use.

And it’s a big deal that B2 is free of the “load balancer problem.” It solves for a huge scaling issue. When we roll out new vaults in new data centers in new countries, the clients are contacting those vaults DIRECTLY (over whatever network path is shortest) and so there are fewer choke points in our architecture.

It all means that, over an infinite time horizon, our customers can rely on B2 as the most affordable, easiest to use cloud storage on the planet. And, at the end of the day, if we’re doing that, we’ve done the right thing.


[1] The Contract Architecture also explains how we got to a secure two step Authorization process. When you call the dispatching server, we run the authentication process and then give you a Vault for uploads and an Auth token. When you are establishing the contract with the Vault, the Auth token is required before any other operation can begin.

The post Design Thinking: B2 APIs (& The Hidden Costs of S3 Compatibility) appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

]]>
https://www.backblaze.com/blog/design-thinking-b2-apis-the-hidden-costs-of-s3-compatibility/feed/ 10