Proxmox, VM Redundancy Using ZFS Replication
In the video below, we show you how to setup ZFS replication and High Availability in Proxmox VE
One of the many useful benefits of hypervisors is the ability to provide High Availability for applications that don’t have their own redundancy solution
Now, in a small cluster, that typically involves using central storage like a NAS, but the problem is, if the NAS goes offline so do your virtual machines
Fortunately, Proxmox VE supports ZFS allowing us to take advantage of ZFS replication
With this, you store your virtual machines on local storage and replicate them to the local storage of another server
Useful links:
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_storage_types
https://pve.proxmox.com/pve-docs/pve-admin-guide.html#chapter_pvesr
Assumptions:
Now because this video is about ZFS replication and High Availability, I’m going to assume you already have a cluster setup
If not, then I do have another video which shows you how to create Proxmox VE clusters
In addition, if you only want to run two servers, I have another video which shows you how to setup a qdevice which can save you money
Overview:
In order to be able to run a virtual machine, the hypervisor needs access to that VM’s files
Now, let’s say you create a VM on server 1 and choose that server’s local storage for the hard drive destination
Well if server 1 goes offline, server 2 can’t spin the VM back up, or at least not without maybe restoring the VM from a backup for instance
That’s because server 2 can’t access the files on server 1 and it’s the reason why shared storage is popular
If all of the servers are running VMs from shared storage, then they all have access to the files
An alternative strategy is to use ZFS replication
Each server will have a ZFS pool that uses local storage
And the VM files will be replicated to the local storage of another server
By default replication will be done every 15 minutes, but you can set specific times of the day, and even reduce this down to every 1 minute if your like
The initial pass will require an entire file copying across, but subsequent ones copy the delta changes
If a server goes offline, another server can then spin the VM back up because it has a local copy of the VM’s hard drive
You do have a brief window of data loss, but if you had used shared storage and that went offline, then you’d likely restore from a backup taken anywhere up to 24 hours ago
Yes you can replicate from one NAS to another, but you’ll probably still have a small window where data is lost and it will cost more money to run a redundant NAS
Some of the benefits of ZFS replication to consider are:
Faster throughput for your VMs, as the storage is local
It’s cheaper to implement than running multiple NAS solutions
Maintenance of your NAS doesn’t affect your virtual machines
Granted this isn’t suited to critical data like a database server, but that would be expected to have its own redundancy solution anyway
Create ZFS Storage:
In order to be able to replicate using ZFS, each node requires ZFS storage
Hard drives, particularly mechanical ones, are more prone to failure than other server parts, so it makes sense for the hypervisors to have disk redundancy
In this demo though, we’ll setup a simple mirror comprised of two drives on one server, but a single disk on another
This is to demonstrate that’s an option if you’re running a lab or you’re on a budget like me for instance
In a community post I showed some hardware I’d bought which included two Enterprise SSDs, one for each server
https://amzn.to/4dZhg1z
These can handle a lot more writes than your typical retail SSD, so for now the servers will have one disk, but once I get the money I’ll add another to improve redundancy
To create the storage, click on the server then navigate to Disks | ZFS
Click Create: ZFS and provide a name for the new pool, for example, LocalZFS
To keep things simple you will want to use the same name when setting this up on another server
Select the disks to use and the RAID level
By default, compression is set to on, which means that lz4 compression will be used, or at least at the time of recording
If you’d prefer to be specific or choose a different algorithm, you can select one from the drop down menu
Unless you know what you’re doing, you’ll probably want to leave ashift left at 12
https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/Workload%20Tuning.html
NOTE: Disk performance can suffer if you make a wrong decision and once the pool is created, you can’t change this setting
Finally, click Create and shortly after the pool will be ready
Now repeat this for other servers that will be involved in the replication, making sure to use the same settings, including the name
Typically though I’d expect you’d want to replicate between two servers, with one server as the primary and the other as the secondary
For a cluster of 5 servers or more, Ceph would probably make more sense
VM Disk Storage:
A virtual machine has one or more hard drives and these are files that need to be stored somewhere the hypervisor can access
In this case, we want the files stored in the ZFS pool comprised of local disks
For new VMs, you can set this to be the ZFS pool as part of the build process
But for existing VMs, you’ll need to migrate the hard drive(s) and that’s what we’ll do for this demo
To do that, select the VM on the server then click on Hardware
Select the Hard Disk, then from the Disk Action drop-down menu select Move Storage
Select the ZFS pool created earlier from the Target Storage drop-down menu, enable Delete source, then click Move disk
Usually there’s no gain in leaving the original files in place, as they just take up space, so that’s why I suggest deleting the source file
After a while, the task will complete and the VM will be ready for replication
ZFS Replication:
Although both nodes in this demo have a storage called LocalZFS, only one server holds a copy of the hard drive
You’ll see that if you click a server’s LocalZFS storage then select VM Disks
What we need to do therefore is to setup replication of the VM’s hard disk to the other server
You can setup replication from the Datacenter, Server and VM level
However, since you can’t pick multiple VMs to replicate at once, it’s easier to configure this on the VM itself
Once it’s setup though, you can maintain replication jobs at any level
For this demo we’ll replicate the hard drive of VM 100, so we’ll click on that VM then navigate to Replication
From there, click Add
Make sure to choose the correct Target server
By default the replication will run every 15 minutes, but you can change this to maybe */1 for every 1 minute and that’s what I’ll do for this video demonstration to speed things up
What you choose really depends on how often you expect disk changes on the VM, how much data loss can be tolerated, etc.
But how much network bandwidth you have can be really important as well. If this data is constantly going over a production interface for instance there could be little room left to serve clients
TIP: After the initial migration, deltas are replicated which reduces the amount of data to transfer
Now you can set rate limiting here as well, but you have to allocate enough bandwidth to send all the data in that time window so you’ll probably need to experiment
Add a comment if you like
By default, the replication job will be enabled, so unless you need to delay this you’ll want to leave this as is
Now click Create
You’ll now need to wait for the interval period chosen before the job will start
Once the task completes, you should see a copy of the hard drive on the target server
TIP: We don’t get asked what storage to replicate to, hence why both servers need to have the same ZFS storage
Going forward, this task will be repeated at interval by the cluster
High Availability:
Now although we’re replicating a virtual machine’s hard drive to another server, that doesn’t include vital information about the virtual machine’s hardware, it’s CPU, memory, etc
But that’s not an issue because we want to cover against the primary server going offline, and we also want to minimise the downtime
And this is where High Availability comes in
First, we need to create a High Availability group so click on Datcenter then navigate to HA | Groups
Click Create and enter something meaningful for the name, for example Prefer_Node1 to define a group where node 1 is the preferred server VMs should run on
The restricted option is if you only want VMs to only run on servers in the group. This makes more sense in large server groups and you need to balance resources for instance
Personally, I prefer to enable nofailback, because if something goes wrong with a server and it keeps restarting, your virtual machines could become unstable
If a server goes offline, I’d rather the VMs are migrated to another server by HA and then I’ll manually move them back once the situation is properly resolved
Fortunately Proxmox VE has a bulk migration option, so it’s relatively easy to do
TIP: You can assign multiple tags to a VM including one for the server it should run on. This makes it easy to filter out which servers to migrate
Select the servers that you want included in the group
Typically you’ll want to set priorities so VMs are running on a specific server during normal times
This group is called Prefer_Node1, so we’ll give server 1 a priority of 10 and server 2 a priority of 1
Now click Create
The next thing to do is to configure HA for our VM
Navigate back up a level to HA then click Add
Select the VM you want High Availability for from the drop-down menu
Now select the group we just created
Max Restart defines how many times you want HA to try and restart a VM on the original server if the VM isn’t running
Now it might be worth trying more than once, if the problem is the OS on the VM, but the more attempts that are made, the longer it may take before the VM is back online if the original server is offline and a migration is then needed
Max Relocate is how many times a migration should be attempted before giving up
Typically the Request State will be started and that’s what it is by default
TIP: If you’ll be carrying out maintenance on a server or VM, you might want to set this to ignored and then change it back to started after that’s complete
Add a comment if needed, then click Add
Technically, the replication job we created only involves replicating the hard drive of the virtual machine
Fortunately, Proxmox VE and HA handles that for us, so we don’t need to create a dummy VM for instance on the target server and start it up manually
Now in this demo, we’ve only created one group, but typically you’d have at least two. This way, some VMs will be run on server 1, while others will be run on server 2
Testing:
Now in theory this should all work, but it always pays to test
To make sure things are working as normal I have a ping test running on another VM
Next I’ll simulate an outage on server 1
What we should see after a while is that HA decides server 1 is out of service
And then it will start the VM on server 2
Once the VM is back up and running we’ll check how our ping test is doing
Now granted, there is still an outage window, but the only way to get near seamless failover is when you have an application that has its own HA solution. The backup server for instance should detect the primary is down within milliseconds and take over. In that situation, we wouldn’t need this type of replication anyway
Now we’ll make one final check and that is to check replication
If you remember, we set ZFS replication from server 1 to server 2
But if we navigate to Datacenter | Replication, select the job and click Edit we’ll see that server 1 is now the Target
In other words, Proxmox VE updated our replication job because the VM was migrated to server 2
It’s best to clean things up, and because this VM is intended to be run on server 1, we’ll manually migrate it back
Once that’s completed, we should see the replication job updated again
Summary:
In an ideal world, we’d all like to have a hyper-converged solution such as Ceph
But I don’t run enough VMs to justify the extra servers that would be needed
I still want a cluster, but I use a qdevice rather than a 3rd server as that to me would be a waste of money
In the case of Ceph, you need at least 3 servers to get it up and running, but it’s recommended you have a minumum of 4
During normal operations though, you should have an odd number of servers in a cluster, which means you should really have 5 servers
Now I can run all of my VMs on one server, so I only have one extra server because I want the redundancy
But adding another 3 just so that I could have Ceph wouldn’t make any sense at all
In which case, ZFS replication offers a very useful way to deal with that single point of failure that is shared storage
Sharing is caring!