Thursday 10 March 2011

Flexible Storage Replication

I have recently been looking quite a lot at different storage setups including storage replication and have been so far mostly relying on running rsync to copy a file system to an appropriate secondary host. For large file systems - either with a lot of files or simply a lot of changing data, this is slow and resource intensive. Not really a problem in some cases, but very problematic if you want your secondary system to have very current data. If you want to cobble something together yourself from commodity hardware, DRBD is an excellent tool and very feature-rich.

First of all, I can't recommend the DRBD User Guide enough. It really lays out the features and usage not just of DRBD but also some common applications you would use alongside like LVM for storage management and Pacemaker and Heartbeat (and others) for clustering.

What DRBD is going to do is basically copy writes to a block device over the network to a replica device - this storage set is called a "resource". Generally, you will expect to have two nodes for each resource. During normal operation, you will have one "Primary" node and one "Secondary" node for each resource which logically indicates that one node is writing changes to the resource while the other is making a copy. DRBD is generally very slick in handling replication and the status of the nodes. First of all, when you configure the resource, you specify an IP address for the replication target and generally you are going to want this to be a separate network interface from your general data plane - for example a cross-over cable for point-to-point connection between the two nodes. If the replication path goes down, DRBD is basically going to mark at what point in time it happened and then keep track of which blocks changed since that point so when the path comes back up, it has a list of which blocks need to be transferred instead of having to resync the whole device. That's another thing - it does the whole device sync for you too when you create the device. And also, you get basically the same behaviour if your secondary node tanks, or if both nodes tank for that matter, or even the primary node.

Unless both nodes end up in a "primary" state during some overlapping time. So if you automatically bring up the secondary node in case of a primary failure with Pacemaker, for example, but the issue was a path failure and not a node failure, then both nodes may end up in "primary" state. Since DRBD is tracking when communication is disrupted, it will detect this problem - a "split brain". You get several options for manual resolution (I think automatic as well) including taking the changes of one node or the other, the node with the "most" changes, the node with the "least" changes, the oldest primary, the youngest primary... You may still be stuck losing some data - but you can keep both nodes in split brain and consolidate externally (e.g. if you have critical data like financial data where you can never drop a transaction).

DRBD supports three replication "protocols" called, intuitively, A, B, and C. "A" is asynchronous so writes to local storage device unblock after the local device finishes writing. "B" is "semi-synchronous" which unblocks after the data has reached the peer. And "C" which is "synchronous" so the write operation is only complete once the data is written to both devices. I was finding that "A" and "B" got me similar speeds and "C" was slower - but this is not very rigorous testing and my replication link was 100Mbps through a shared data plane.

One of the things about any of these replication options compared to rsync is that they are going to generally be much nicer on your memory. I find that when rsync scrapes the file system, this effectively nukes the OS's disk cache such that after rsync runs, users may notice it takes a while to "warm up" again. But, replication is not a backup - if a virus eats your files on your primary node, it will eat them on the secondary node synchronously or asynchronously - your choice.

If you are using LVM (and you should be, I've posted about LVM before, so have others), you'll wonder whether you layer DRBD on top of LVM or vise-versa. As Chef would say: Use DRBD on top of your LVs. Dramatic over-simplification aside, it does depend on what you are doing. If you are using LVM to carve up a pool of storage for example for virtualization and then want the storage layer to replicate your VMs, it may make more sense to create your DRBD volume from physical storage, then it will replicate the whole LVM structure to your replica node. But there's complications like ensuring LVM will even look at DRBD devices for PVs and managing size changes, etc. There's a time and a place for everything, and that's college.

Um, what else is awesome about DRBD? Offline initialization, "truck based replication" (a.k.a. sneakernet), replicate the node locally, ship it to the remote site, turn-up from there. DRBD Proxy (paid feature) for when you need to buffer replication for slow or unreliable network links. Dual-primary (for use with something like GFS) operation. 3 node operation by layering DRBD on top of DRBD.

Yeah, it's cool. It's Free and free. You can get it stock with Fedora and CentOS (probably Ubuntu and others, but haven't tried it yet).

And one last thing - you cannot mount a resource that is "Secondary". So if you are getting crazy error messages that you can neither mount nor even fsck your file system, it's probably in Secondary - don't bang your head against the wall, just do "drbdadm primary <resourcename>". Is clear?

Popular Posts