Proxmox, VM Redundancy Using ZFS Replication

Oct 11, 2024 · 13 mins read

Proxmox, VM Redundancy Using ZFS Replication

In the video below, we show you how to setup ZFS replication and High Availability in Proxmox VE

One of the many useful benefits of hypervisors is the ability to provide High Availability for applications that don’t have their own redundancy solution

Now, in a small cluster, that typically involves using central storage like a NAS, but the problem is, if the NAS goes offline so do your virtual machines

Fortunately, Proxmox VE supports ZFS allowing us to take advantage of ZFS replication

With this, you store your virtual machines on local storage and replicate them to the local storage of another server

Useful links:
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_storage_types https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_pvesr

Assumptions:
Now because this video is about ZFS replication and High Availability, I’m going to assume you already have a cluster setup

If not, then I do have another video which shows you how to create Proxmox VE clusters

In addition, if you only want to run two servers, I have another video which shows you how to setup a qdevice which can save you money

Overview:
In order to be able to run a virtual machine, the hypervisor needs access to that VM’s files

Now, let’s say you create a VM on server 1 and choose that server’s local storage for the hard drive destination

Well if server 1 goes offline, server 2 can’t spin the VM back up, or at least not without maybe restoring the VM from a backup for instance

That’s because server 2 can’t access the files on server 1 and it’s the reason why shared storage is popular

If all of the servers are running VMs from shared storage, then they all have access to the files

An alternative strategy is to use ZFS replication

Each server will have a ZFS pool that uses local storage

And the VM files will be replicated to the local storage of another server

By default replication will be done every 15 minutes, but you can set specific times of the day, and even reduce this down to every 1 minute if your like

The initial pass will require an entire file copying across, but subsequent ones copy the delta changes

If a server goes offline, another server can then spin the VM back up because it has a local copy of the VM’s hard drive

You do have a brief window of data loss, but if you had used shared storage and that went offline, then you’d likely restore from a backup taken anywhere up to 24 hours ago

Yes you can replicate from one NAS to another, but you’ll probably still have a small window where data is lost and it will cost more money to run a redundant NAS

Some of the benefits of ZFS replication to consider are:
Faster throughput for your VMs, as the storage is local
It’s cheaper to implement than running multiple NAS solutions
Maintenance of your NAS doesn’t affect your virtual machines

Granted this isn’t suited to critical data like a database server, but that would be expected to have its own redundancy solution anyway

Create ZFS Storage:
In order to be able to replicate using ZFS, each node requires ZFS storage

Hard drives, particularly mechanical ones, are more prone to failure than other server parts, so it makes sense for the hypervisors to have disk redundancy

In this demo though, we’ll setup a simple mirror comprised of two drives on one server, but a single disk on another

This is to demonstrate that’s an option if you’re running a lab or you’re on a budget like me for instance

In a community post I showed some hardware I’d bought which included two Enterprise SSDs, one for each server
https://amzn.to/4dZhg1z

These can handle a lot more writes than your typical retail SSD, so for now the servers will have one disk, but once I get the money I’ll add another to improve redundancy

To create the storage, click on the server then navigate to Disks | ZFS

Click Create: ZFS and provide a name for the new pool, for example, LocalZFS

To keep things simple you will want to use the same name when setting this up on another server

Select the disks to use and the RAID level

By default, compression is set to on, which means that lz4 compression will be used, or at least at the time of recording

If you’d prefer to be specific or choose a different algorithm, you can select one from the drop down menu

Unless you know what you’re doing, you’ll probably want to leave ashift left at 12
https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Workload%20Tuning.html

NOTE: Disk performance can suffer if you make a wrong decision and once the pool is created, you can’t change this setting

Finally, click Create and shortly after the pool will be ready

Now repeat this for other servers that will be involved in the replication, making sure to use the same settings, including the name

Typically though I’d expect you’d want to replicate between two servers, with one server as the primary and the other as the secondary

For a cluster of 5 servers or more, Ceph would probably make more sense

VM Disk Storage:
A virtual machine has one or more hard drives and these are files that need to be stored somewhere the hypervisor can access

In this case, we want the files stored in the ZFS pool comprised of local disks

For new VMs, you can set this to be the ZFS pool as part of the build process

But for existing VMs, you’ll need to migrate the hard drive(s) and that’s what we’ll do for this demo

To do that, select the VM on the server then click on Hardware

Select the Hard Disk, then from the Disk Action drop-down menu select Move Storage

Select the ZFS pool created earlier from the Target Storage drop-down menu, enable Delete source, then click Move disk

Usually there’s no gain in leaving the original files in place, as they just take up space, so that’s why I suggest deleting the source file

After a while, the task will complete and the VM will be ready for replication

ZFS Replication:
Although both nodes in this demo have a storage called LocalZFS, only one server holds a copy of the hard drive

You’ll see that if you click a server’s LocalZFS storage then select VM Disks

What we need to do therefore is to setup replication of the VM’s hard disk to the other server

You can setup replication from the Datacenter, Server and VM level

However, since you can’t pick multiple VMs to replicate at once, it’s easier to configure this on the VM itself

Once it’s setup though, you can maintain replication jobs at any level

For this demo we’ll replicate the hard drive of VM 100, so we’ll click on that VM then navigate to Replication

From there, click Add

Make sure to choose the correct Target server

By default the replication will run every 15 minutes, but you can change this to maybe */1 for every 1 minute and that’s what I’ll do for this video demonstration to speed things up

What you choose really depends on how often you expect disk changes on the VM, how much data loss can be tolerated, etc.

But how much network bandwidth you have can be really important as well. If this data is constantly going over a production interface for instance there could be little room left to serve clients

TIP: After the initial migration, deltas are replicated which reduces the amount of data to transfer

Now you can set rate limiting here as well, but you have to allocate enough bandwidth to send all the data in that time window so you’ll probably need to experiment

Add a comment if you like

By default, the replication job will be enabled, so unless you need to delay this you’ll want to leave this as is

Now click Create

You’ll now need to wait for the interval period chosen before the job will start

Once the task completes, you should see a copy of the hard drive on the target server

TIP: We don’t get asked what storage to replicate to, hence why both servers need to have the same ZFS storage

Going forward, this task will be repeated at interval by the cluster

High Availability:
Now although we’re replicating a virtual machine’s hard drive to another server, that doesn’t include vital information about the virtual machine’s hardware, it’s CPU, memory, etc

But that’s not an issue because we want to cover against the primary server going offline, and we also want to minimise the downtime

And this is where High Availability comes in

First, we need to create a High Availability group so click on Datcenter then navigate to HA | Groups

Click Create and enter something meaningful for the name, for example Prefer_Node1 to define a group where node 1 is the preferred server VMs should run on

The restricted option is if you only want VMs to only run on servers in the group. This makes more sense in large server groups and you need to balance resources for instance

Personally, I prefer to enable nofailback, because if something goes wrong with a server and it keeps restarting, your virtual machines could become unstable

If a server goes offline, I’d rather the VMs are migrated to another server by HA and then I’ll manually move them back once the situation is properly resolved

Fortunately Proxmox VE has a bulk migration option, so it’s relatively easy to do

TIP: You can assign multiple tags to a VM including one for the server it should run on. This makes it easy to filter out which servers to migrate

Select the servers that you want included in the group

Typically you’ll want to set priorities so VMs are running on a specific server during normal times

This group is called Prefer_Node1, so we’ll give server 1 a priority of 10 and server 2 a priority of 1

Now click Create

The next thing to do is to configure HA for our VM

Navigate back up a level to HA then click Add

Select the VM you want High Availability for from the drop-down menu

Now select the group we just created

Max Restart defines how many times you want HA to try and restart a VM on the original server if the VM isn’t running

Now it might be worth trying more than once, if the problem is the OS on the VM, but the more attempts that are made, the longer it may take before the VM is back online if the original server is offline and a migration is then needed

Max Relocate is how many times a migration should be attempted before giving up

Typically the Request State will be started and that’s what it is by default

TIP: If you’ll be carrying out maintenance on a server or VM, you might want to set this to ignored and then change it back to started after that’s complete

Add a comment if needed, then click Add

Technically, the replication job we created only involves replicating the hard drive of the virtual machine

Fortunately, Proxmox VE and HA handles that for us, so we don’t need to create a dummy VM for instance on the target server and start it up manually

Now in this demo, we’ve only created one group, but typically you’d have at least two. This way, some VMs will be run on server 1, while others will be run on server 2

Testing:
Now in theory this should all work, but it always pays to test

To make sure things are working as normal I have a ping test running on another VM

Next I’ll simulate an outage on server 1

What we should see after a while is that HA decides server 1 is out of service

And then it will start the VM on server 2

Once the VM is back up and running we’ll check how our ping test is doing

Now granted, there is still an outage window, but the only way to get near seamless failover is when you have an application that has its own HA solution. The backup server for instance should detect the primary is down within milliseconds and take over. In that situation, we wouldn’t need this type of replication anyway

Now we’ll make one final check and that is to check replication

If you remember, we set ZFS replication from server 1 to server 2

But if we navigate to Datacenter | Replication, select the job and click Edit we’ll see that server 1 is now the Target

In other words, Proxmox VE updated our replication job because the VM was migrated to server 2

It’s best to clean things up, and because this VM is intended to be run on server 1, we’ll manually migrate it back

Once that’s completed, we should see the replication job updated again

Summary:
In an ideal world, we’d all like to have a hyper-converged solution such as Ceph

But I don’t run enough VMs to justify the extra servers that would be needed

I still want a cluster, but I use a qdevice rather than a 3rd server as that to me would be a waste of money

In the case of Ceph, you need at least 3 servers to get it up and running, but it’s recommended you have a minumum of 4

During normal operations though, you should have an odd number of servers in a cluster, which means you should really have 5 servers

Now I can run all of my VMs on one server, so I only have one extra server because I want the redundancy

But adding another 3 just so that I could have Ceph wouldn’t make any sense at all

In which case, ZFS replication offers a very useful way to deal with that single point of failure that is shared storage