Elliott Sims, Author at Backblaze Blog | Cloud Storage & Cloud Backup https://www.backblaze.com/blog/author/elliott/ Cloud Storage & Cloud Backup Tue, 09 Jan 2024 18:02:17 +0000 en-US hourly 1 https://wordpress.org/?v=6.4.3 https://www.backblaze.com/blog/wp-content/uploads/2019/04/cropped-cropped-backblaze_icon_transparent-80x80.png Elliott Sims, Author at Backblaze Blog | Cloud Storage & Cloud Backup https://www.backblaze.com/blog/author/elliott/ 32 32 New Open Source Tool for Consistency in Cassandra Migrations https://www.backblaze.com/blog/new-open-source-tool-for-consistency-in-cassandra-migrations/ https://www.backblaze.com/blog/new-open-source-tool-for-consistency-in-cassandra-migrations/#comments Fri, 05 Jan 2024 17:43:26 +0000 https://www.backblaze.com/blog/?p=110674 When considering Cassandra datacenter migrations with lightweight transactions, the Backblaze team discovered an inconsistency in execution. Read about the new, open source tool we've developed and released to solve it.

The post New Open Source Tool for Consistency in Cassandra Migrations appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

]]>
A decorative image showing the Cassandra logo with a function represented by two servers on either side of the logo.

Sometimes you find a problem because something breaks, and other times you find a problem the good way—by thinking it through before things break. This is a story about one of those bright, shining, lightbulb moments when you find a problem the good way.

On the Backblaze Site Reliability Engineering (SRE) team, we were thinking through an upcoming datacenter migration in Cassandra. We were running through all of the various types of queries we would have to do when we had the proverbial “aha” moment. We discovered an inconsistency in the way Cassandra handles lightweight transactions (LWTs).

If you’ve ever tried to do a datacenter migration in Cassandra and something got corrupted in the process but you couldn’t figure out why or how—this might be why. I’m going to walk through a short intro on Cassandra, how we use it, and the issue we ran into. Then, I’ll explain the workaround, which we open sourced. 

Get the Open Source Code

You can download the open source code from our Git repository. We’d love to know how you’re using it and how it’s working for you—let us know in the comments.

How We Use Cassandra

First, if you’re not a Cassandra dev, I should mention that when we say “datacenter migration” it means something slightly different in Cassandra than what it sounds like. It doesn’t mean a data center migration in the physical sense (although you can use datacenter migrations in Cassandra when you’re moving data from one physical data center to another). In the simplest terms, it involves moving data between two Cassandra or Cassandra-compatible database replica sets within a cluster.

And, if you’re not familiar with Cassandra at all, it’s an open-source, NoSQL, distributed database management system. It was created to handle large amounts of data across many commodity servers, so it fits our use case—lots of data, lots of servers. 

At Backblaze, we use Cassandra to index filename to location for data stored in Backblaze B2, for example. Because it’s customer data and not just analytics, we care more about durability and consistency than some other applications of Cassandra. We run with three replicas in a single datacenter and “batch” mode to require writes to be committed to disk before acknowledgement rather than the default “periodic.”

Datacenter migrations are an important aspect of running Cassandra, especially on bare metal. We do a few datacenter migrations per year either for physical data moves, hardware refresh, or to change certain cluster layout parameters like tokens per host that are otherwise static. 

What Are LWTs and Why Do They Matter for Datacenter Migrations in Cassandra?

First of all, LWTs are neither lightweight nor transactions, but that’s neither here nor there. They are an important feature in Cassandra. Here’s why. 

Cassandra is great at scaling. In something like a replicated SQL cluster, you can add additional replicas for read throughput, but not writes. Cassandra scales writes (as well as reads) nearly linearly with the number of hosts—into the hundreds. Adding nodes is a fairly straightforward and “automagic” process as well, with no need to do something like manual token range splits. It also handles individual down nodes with little to no impact on queries. Unfortunately, these properties come with a trade-off: a complex and often nonintuitive consistency model that engineers and operators need to understand well.

In a distributed database like Cassandra, data is replicated across multiple nodes for durability and availability. 

Although databases generally allow multiple reads and writes to be submitted at once, they make it look to the outside world like all the operations are happening in order, one at a time. This property is known as serializability, and Cassandra is not serializable. Although it does have a “last write wins” system, there’s no transaction isolation and timestamps can be identical. 

It’s possible, for example, to have a row that has some columns from one write and other columns from another write. It’s safe if you’re only appending additional rows, but mutating existing rows safely requires careful design. Put another way, you can have two transactions with different data that, to the system, appear to have equal priority. 

How Do LWTs Solve This Problem?

As a solution for cases where stronger consistency is needed, Cassandra has a feature called “Lightweight Transactions” or LWTs. These are not really identical to traditional database transactions, but provide a sort of “compare and set” operation that also guarantees pending writes are completed before answering a read. This means if you’re trying to change a row’s value from “A” to “B”, a simultaneous attempt to change that row from “A” to “C” will return a failure. This is accomplished by doing a full—not at all lightweight—Paxos round complete with multiple round trips and slow expensive retries in the event of a conflict.

In Cassandra, the minimum consistency level for read and write operations is ONE, meaning that only a single replica needs to acknowledge the operation for it to be considered successful. This is fast, but in a situation where you have one down host, it could mean data loss, and later reads may or may not show the newest write depending on which replicas are involved and whether they’re received the previous write. For better durability and consistency, Cassandra also provides various quorum levels that require a response from multiple replicas, as well as an ALL consistency that requires responses from every replica.

Cassandra Is My Type of Database

Curious to know more about consistency limitations and LWTs in Cassandra? Christopher Batey’s presentation at the 2016 Cassandra Summit does a good job of explaining the details.

The Problem We Found With LWTs During Datacenter Migrations

Usually we use one datacenter in Cassandra, but there are circumstances where we sometimes stand up a second datacenter in the cluster and migrate to it, then tear down the original. We typically do this either to change num_tokens, to move data when we’re refreshing hardware, or to physically move to another nearby data center.

The TL:DR

We reasoned through the interaction between LWTs/serial and datacenter migrations and found a hole—there’s no guarantee of LWT correctness during a topology change (that is, a change to the number of replicas) large enough to change the number of replicas needed to satisfy quorum. It turns out that combining LWTs and datacenter migrations can violate consistency guarantees in subtle ways without some specific steps and tools to work around it.

The Long Version

Let’s say you are standing up a new datacenter, and you need to copy an existing datacenter to it. So, you have two datacenters—datacenter A, the existing datacenter, and datacenter B, the new datacenter. Let’s say datacenter A has three replicas you need to copy for simplicity’s sake, and you’re using quorum writes to ensure consistency.

Refresher: What is Quorum-Based Consistency in Cassandra?

Quorum consistency in Cassandra is based on the concept that a specific number of replicas must participate in a read or write operation to ensure consistency and availability—a majority (n/2 +1) of the nodes must respond before considering the operation as successful. This ensures that the data is durably stored and available even if a minority of replicas are unavailable.

You have different types of quorum you can choose from, and here’s how those defaults make a decision: 

  • Local quorum: Two out of the three replicas in the datacenter I’m talking to must respond in order to return success. I don’t care about the other datacenter.
  • Global quorum: Four out of the six total replicas must respond in order to return success, and it doesn’t matter which datacenter they come from.
  • Each quorum: Two out of the three replicas in each datacenter must respond in order to return success.

Most of these quorum types also have a serial equivalent for LWTs.

Type of QuorumSerialRegular
LocalLOCAL_SERIALLOCAL_QUORUM
EachunsupportedEACH_QUORUM
GlobalSERIALQUORUM

The problem you might run into, however, is that LWTs do not have an each_serial mode. They only have local and global. There’s no way to tell the LWT you want quorum in each datacenter. 

local_serial is good for performance, but transactions on different datacenters could overlap and be inconsistent. serial is more expensive, but normally guarantees correctness as long as all queries agree on cluster size. But what if a query straddles a topology change that changes quorum size? 

Let’s use global quorum to show how this plays out. If a LWT starts when RF=3, at least two hosts must process it. 

While it’s running, the topology changes to two datacenters (A and B) each with RF=3 (so six replicas total) with a quorum of four. There’s a chance that a query affecting the same partition could then run without overlapping nodes, which means consistency guarantees are not maintained for those queries. For that query, quorum is four out of six where those four could be the three replicas in datacenter B and the remaining replica in datacenter A. 

Those two queries are on the same partition, but they’re not overlapping any hosts, so they don’t know about each other. It violates the LWT guarantees.

The Solution to LWT Inconsistency

What we needed was a way to make sure that the definition of “quorum” didn’t change too much in the middle of a LWT running. Some change is okay, as long as old and new are guaranteed to overlap.

To account for this, you need to change the replication factor one level at a time and make sure there are no transactions still running that started before the previous topology change before you make the next. Three replicas with a quorum of two can only change to four replicas with a quorum of three. That way, at least one replica must overlap. The same thing happens when you go from four to five replicas or five to six replicas. This also applies when reducing the replication factor, such as when tearing down the old datacenter after everything has moved to the new one.

Then, you just need to make sure no LWT overlaps multiple changes. You could just wait long enough that they’ve timed out, but it’s better to be sure. This requires querying the internal-only system.paxos table on each host in the cluster between topology changes.

We built a tool that checks to see whether there are still transactions running from before we made a topology change. It reads system.paxos on each host, ignoring any rows with proposal_ballot=null, and records them. Then after a short delay, it re-reads system.paxos, ignoring any rows that weren’t present in the previous run, or any with proposal_ballot=null in either read, or any where in_progress_ballot has changed. Any remaining rows are potentially active transactions. 

This worked well the first few times that we used it, on 3.11.6. To our surprise, when we tried to migrate a cluster running 3.11.10 the tool reported hundreds of thousands of long-running LWTs. After a lot of digging, we found a small (but fortunately well-commented) performance optimization added as part of a correctness fix (CASSANDRA-12126), which means proposal_ballot does not get set to null if the proposal is empty/noop. To work around this, we had to actually parse the proposal field. Fortunately all we need is the is_empty flag in the third field, so no need to reimplement the full parsing code. A big impact to us for a seemingly small and innocuous change piggy-backed onto a correctness fix, but that’s the risk of directly reading internal-only tables. 

We’ve used the tool several times now for migrations with good results, but it’s still relatively basic and limited. It requires running repeatedly until all transactions are complete, and sometimes manual intervention to deal with incomplete transactions. In some cases we’ve been able to force-commit a long-pending LWT by doing a SERIAL read of the partition affected, but in a couple of cases we actually ended up running across LWTs that still didn’t seem to complete. Fortunately in every case so far it was in a temporary table and a little work allowed us to confirm that we no longer needed the partition at all and could just delete it.

Most people who use Cassandra may never run across this problem, and most of those who do will likely never track down what caused the small mystery inconsistency around the time they did a datacenter migration. If you rely on LWTs and are doing a datacenter migration, we definitely recommend going through the extra steps to guarantee consistency until and unless Cassandra implements an EACH_SERIAL consistency level.

Using the Tool

If you want to use the tool for yourself to help maintain consistency through datacenter migrations, you can find it here. Drop a note in the comments to let us know how it’s working for you and if you think of any other ways around this problem—we’re all ears!

If You’ve Made It This Far

You might be interested in signing up for our Developer Newsletter where our resident Chief Technical Evangelist, Pat Patterson, shares the latest and greatest ways you can use B2 Cloud Storage in your applications.

The post New Open Source Tool for Consistency in Cassandra Migrations appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

]]>
https://www.backblaze.com/blog/new-open-source-tool-for-consistency-in-cassandra-migrations/feed/ 1
Load Balancing and Backblaze B2 Cloud Storage https://www.backblaze.com/blog/load-balancing-and-b2-cloud-storage/ https://www.backblaze.com/blog/load-balancing-and-b2-cloud-storage/#comments Fri, 18 Mar 2016 17:49:03 +0000 https://www.backblaze.com/blog/?p=52441 Backblaze's firsthand account on the difference between load balancing in cloud backup and cloud storage server environments.

The post Load Balancing and Backblaze B2 Cloud Storage appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

]]>
Load Balancer

A few months ago we announced Backblaze B2 Cloud Storage, our cloud storage product. Cloud storage presents different challenges versus cloud backup in the server environment. Load balancing is one such issue. Let’s take a look at the challenges that have come up, and our solutions.

A load balancer is a server or specialized device that distributes load among the servers that actually do work and reply to requests. In addition to allowing the work to be distributed among several servers, it also makes sure that requests only get sent to healthy servers that are prepared to handle them.

For Backblaze Personal Backup and Backblaze Business Backup products, we’ve had it easy in terms of load balancing: We wrote the client, so we can just make it smart enough to ask us which server to talk to. No separate load balancer needed. The Backblaze products are also very tolerant of short outages, since the client will just upload the files slightly later.

For Backblaze B2, though, the clients are web browsers or programming language libraries like libcurl. They just make a single request for a file and expect an answer immediately. This means we need a load balancer both to distribute the load and allow us to take individual servers offline to update them.

Option 1: Layer 7, Full Proxy

The simplest and most flexible way to do load balancing is to have a pair of hosts, one active and one standby, that accept HTTPS connections from the client and create new connections to the server, then proxy the traffic back and forth between the two. This is usually referred to as “layer 7, full proxy” load balancing.

This doesn’t generally require any special setup on the server, other than perhaps making sure it understands the x-forwarded-for header so it knows the actual client’s IP address. It does have one big downside, though: The load balancer has to have enough bandwidth to handle every request and response in both directions, and enough CPU to handle TCP and SSL in both directions. Modern processors with AES-NI—onboard AES encryption and decryption—help a lot with this, but it can still quickly become a performance bottleneck when you’re talking about transferring large files at 1 Gb/s or higher.

Option 2: Layer 4, Full Proxy

Another option, if you want to reduce the burden on the load balancers, is layer 4 load balancing. The load balancers accept TCP connections from the client and create a new TCP session to the server, but they proxy through the HTTPS traffic inside the TCP session without decrypting or re-encrypting it. This still requires that the load balancer have enough bandwidth to handle all your traffic, but a lot less CPU compared to layer 7. Unfortunately, it also means that your servers don’t really have a good way to see the original client’s IP address short of hijacking the TCP options field with a proprietary extension.

Option 3: DSR

All of this is adding a lot of work layered on top of a load balancer’s basic purpose: to distribute client requests among multiple healthy back-end servers. To do this, the load balancer only needs to see the request and modify the destination at the outermost layers. No need to parse all the way to layer 7, and no need to even see the response. This is generally called Direct Server Return (DSR).

Especially when serving large files with SSL, DSR requires minimal amounts of bandwidth and CPU power on the load balancer. Because the source IP address is unchanged, the server can see the original client’s IP without even needing an x-forwarded-for header. This does have a few tradeoffs, though: It requires a fairly complex setup not only on the load balancers, but also on the individual servers. In full-proxy modes the load balancer can intercept bad responses and retry the request on a different back-end server or display a friendlier error message to the client, but since the response bypasses the load balancer in DSR mode this isn’t possible. This also makes health-checking tricky because there’s no path for responses from the back-end host to the load balancer.

After some testing, we ended up settling on DSR. Although it’s a lot more complicated to set up and maintain, it allows us to handle large amounts of traffic with minimal hardware. It also makes it easy to fulfill our goal of keeping user traffic encrypted even within our data center.

How Does It Work?

DSR load balancing requires two things:

  1. A load balancer with the VIP address attached to an external NIC and ARPing, so that the rest of the network knows it “owns” the IP.
  2. Two or more servers on the same layer 2 network that also have the VIP address attached to a NIC, either internal or external, but are not replying to ARP requests about that address. This means that no other servers on the network know that the VIP exists anywhere but on the load balancer.

A request packet will enter the network, and be routed to the load balancer. Once it arrives there, the load balancer leaves the source and destination IP addresses intact and instead modifies the destination MAC address to that of a server, then puts the packet back on the network. The network switch only understands MAC addresses, so it forwards the packet on to the correct server.
blog-network-diagram

When the packet arrives at the server’s network interface, it checks to make sure the destination MAC address matches its own. It does, so accepts the packet. It then, separately, checks to see whether the destination IP address is one attached to it somehow. It is, even though the rest of the network doesn’t know it, so it accepts the packet and passes it on to the application. The application then sends a response with the VIP as the source IP address and the client as the destination IP, so it’s routed directly to the client without passing back through the load balancer.

How Do I Set It Up?

DSR setup is very specific to each individual network setup, but we’ll try to provide enough information that this can be adapted to most cases. The simplest way is probably to just pay a vendor like F5, A10, or Kemp to handle it. You’ll still need the complex setup on the individual hosts, though, and the commercial options tend to be pretty pricey. We also tend to prefer open-source over black-box solutions, since they’re more flexible and debuggable.

HAProxy and likely other applications can do DSR, but we ended up using IPVS (formerly known as LVS). The core packet routing of IPVS is actually part of the Linux kernel, and then various user-space utilities are used for health checks and other management. For user-space management, there’s a number of other good options like Keepalived, Ldirectord, Piranha, and Google’s recently-released Seesaw. We ended up choosing Keepalived because we also wanted VRRP support for failing over between load balancers, and because it’s both simple and stable/mature.

Setting Up IPVS and Keepalived

Good news! If your kernel is 2.6.10 or newer (and it almost certainly is), IPVS is already included. If /proc/net/ip_vs exists, it’s already loaded. If not, modprobe ip_vs will load the module. Most distributions will probably compile it as a kernel module, but your results with VPS providers may vary. At this point, you’ll probably also want to install the ipvsadm utility so you can manually inspect and modify the IPVS config.

The Keepalived load-balancing config is fairly straightforward: a virtual_server section with a real_server section inside of it for each back-end server. Most of the rest depends on your specific needs, but you’ll want to set lb_kind to “DR.” You can use SSL_GET as a simple health checker, but we use “MISC_CHECK” with a custom script, which lets us stop sending new traffic to a server that’s shutting down by setting its weight to 0.

The host config is where things get a bit more complicated. The important part is that the VIP address is assigned to an interface, but the server isn’t sending ARP replies about it. There’s a few ways to do this that work about equally well, but we use arptables rules defined in /etc/network/interfaces:

pre-up /sbin/arptables -I INPUT -j DROP -d <VIPADDRESS>
pre-up /sbin/arptables -I OUTPUT -j mangle -s <VIPADDRESS> ‐‐mangle-ip-s <SERVERIPADDRESS>
pre-down /sbin/arptables -D INPUT -j DROP -d <VIPADDRESS>
pre-down /sbin/arptables -D OUTPUT -j mangle -s <VIPADDRESS> ‐‐mangle-ip-s <SERVERIPADDRESS>

Once the arptables rules are in place, you’ll want to add the actual address to the interface:

post-up /sbin/ip addr add <VIPADDRESS>/32 dev $LOGICAL
pre-down /sbin/ip addr del <VIPADDRRESS>/32 dev $LOGICAL

If your backend server doesn’t have an actual external IP and normally talks to the outside via NAT, you will need to create a source-based route, also in the interfaces config:

post-up /sbin/ip rule add from <VIPADDRESS> lookup 200
pre-down /sbin/ip rule del from <VIPADDRESS> lookup 200
post-up /sbin/ip route add default via <VIPADDRESS> dev $LOGICAL table 200
pre-down /sbin/ip route del default via <VIPADDRESS> dev $LOGICAL table 200

Finally, make sure your web server (or other daemon) is listening on specifically, or on all addresses (0.0.0.0 or :::).

It’s Not Working!

  • First, make sure the load balancer and server are on the same layer-2 network. If they’re on different subnets, none of this will work.
  • Check the value of /proc/sys/net/ipv4/conf/*/rp_filter and make sure it’s not set to 1 anywhere.
  • Run tcpdump -e -n host on the load balancer and make sure that the requests are reaching the load balancer with a destination IP of and a destination MAC address belonging to the load balancer, then leaving again with the same source and destination IP but the MAC address of the back-end server.
  • Run ipvsadm on the load balancer and make sure IPVS is configured
  • Run tcpdump -e -n host on the server and make sure that requests are arriving with a destination IP of , and leaving again with a source of and a destination of
  • On the server, run ip route get from to make sure the host has a non-NATTED return route to the outside.
  • On the server, run ip neigh show , where is the “via” IP from the previous “ip route get” command, to make sure the host knows how to reach the gateway.

The post Load Balancing and Backblaze B2 Cloud Storage appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

]]>
https://www.backblaze.com/blog/load-balancing-and-b2-cloud-storage/feed/ 11