Proxmox, How To Remove A Server From A Cluster And Add A Rebuilt Server
In the video below, we show you how to remove a server from a Proxmox VE cluster and how to add a rebuilt server
Now, there may come a time when you need to remove or rebuild a server in your Proxmox VE cluster
In my case, the boot drive on one server needs to be replaced and so the server will be rebuilt
But, you can’t remove a server from the cluster via the GUI, instead you have to use the command line
So in this video we go over how to remove a server from the cluster and how do you add a rebuilt server back in
Useful links:
https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node
https://forum.proxmox.com/threads/lrm-unable-to-read-lrm-status.65415/
Warning:
Removing a server from a cluster is a one way trip and it may not be easy to recover the cluster back to its previous state should things go wrong
And while the process should be trouble free, you have to be prepared for a potential outage. In which case make sure you have a backup of everything to hand
You need to be fully aware of what needs to be done ahead of time, and have a process in place, long before you touch your own system
Don’t simply follow along and make changes as you watch the video or read my blog, because your circumstances may be different
Watch the video and/or read the blog first and then plan what you need to do
Ideally, test any changes in a lab first, before making changes to a production environment
Server Upgrade:
Now if all you want to do is to remove a server from a cluster then feel free to skip ahead
But in my case I’m rebuilding a server and I want to add it back into the cluster afterwards
The server doesn’t have a spare PCIe slot so I have to replace an NVMe that occupies that slot with an SSD that will actually go into a DVD tray; Probably not a strategy the designers had in mind
When you add a server to a cluster it’s best that the servers are running the same software version
And the easiest way to do that is for them all to be running the latest version of software
Now I only have 2 servers plus a qdevice in this cluster, so now is the ideal time to upgrade server 2 as I’ll be able to run the virtual machines while server 1 is up an running
But the first thing I’ll do is migrate all the running virtual machines to server 1
A minor upgrade for a server is pretty straightforward, just click on the server you want to upgrade and navigate to Updates
From there click on Refresh and wait for the task to finish. Once that’s done, close the window then click on Upgrade
It will then open a console session and you’ll be asked you to confirm you want to upgrade the server
Once the upgrade is completed you’ll probably want to reboot the server for good measure
Prep Work:
Before you remove a server from the cluster you will want to make sure it’s not in use and that the cluster will support the removal
The first thing to do is to make sure that nothing depends on this server
So if you’re running Ceph for instance, you might need to make changes to that before the server is removed
You also want to check Backup tasks, Replication tasks and also HA tasks and groups
If the server is referenced in any of these, you’ll either need to modify or delete them
This is particularly important for replication tasks we’re told because you may not be able to remove a job afterwards
HA might also get you stuck in a loop if you migrate virtual machines to a server only for HA to migrate them back
The next thing to do is to check the cluster status
For this, you can either open a shell session or connect to a server using SSH
From here we’ll run this command
pvecm status
Typically a cluster will show that it requires 2 servers to achieve quorum
So, assuming you are seeing 3 voting devices in the cluster, it should be OK to remove one server
NOTE: If you only have two voting machines in the cluster, you’ll likely get a warning that you can’t remove a server. This is because you’d be left with only 1 and so quorum wouldn’t be possible, or at least not without making configuration changes
The next thing to do is to make sure the server is empty
So I’ll migrate everything to server 2, by doing a bulk migration of all virtual machines, containers and templates
TIP: If you plan on rebuilding a server and will add it back to the cluster later, then you might want to note down any network interface and IP addressing as you’ll need these for the rebuild
Since I want to remove server 1 we’ll switch over to server 2 in the GUI
From here, we’ll shutdown server 1
Bear in mind, once you remove a server from the cluster, it should not be powered back on or at least not be able to talk to the other servers in the cluster again as it risks breaking the cluster
Since I’m going to install Proxmox VE onto a new boot drive, once this server is powered off I’ll replace the original drive anyway and that will avoid any risk of that happening for me
Another thing to factor in here is redundancy
During this work I will only have 1 working server to run virtual machines on so I have to be prepared for an outage and maybe the need to restore from backup
Removing A Server:
Removing a node from the cluster is fairly straightforward but be warned, this is a one way process so you have to be prepared for potential problems
Make sure your systems and data are backed up and be prepared for a potential outage should things not got to plan
Ideally, you should test things in a lab first
To remove a server, we have to do this from the command line
I want to remove server 1 and so I’m connected to server 2 via the web browser
In which case, I could open a shell session here or I could SSH into the server
Both options support copying and pasting so feel free to pick whichever method you prefer
The first thing we’ll do is to identify which servers are operational in this cluster by running this command
pvecm nodes
Because I’ve powered off server 1, it shouldn’t appear in the list
NOTE: If the server you’re going to remove is showing, then check why and resolve that before going any further because it must be offline
Assuming the server is seen to be offline, we’ll remove it from the cluster with this command
pvecm delnode pvedemo1
Since pvedemo1 is the name of my server, you’ll need to substitute this with the name of yours
NOTE: There is no request for confirmation, so make sure you are using the correct server name. Another reason to test this in a lab first
Now you might see the following warning
Could not kill node (error = CS_ERR_NOT_EXIST)
From what we’re told, we can ignore this because the message is from corosync and the node it’s referring to is offline
TIP: If you get this response
cluster not ready - no quorum?
It’s because you don’t have enough devices in the cluster to achieve quorum and so changes cannot be made
To confirm the node is removed we’ll check the cluster status with this command
pvecm status
Compared to the last time we did this, the number of nodes and expected votes for instance will now be less because a server has been removed from the cluster
Although the server has been removed, its configuration files will be retained
In my case, these can be found in /etc/pve/nodes/pvedemo1
Now, you can recover these files if you think you’ll need them, but either way we’re told to delete the folder for completeness
rm -rf /etc/pve/nodes/pvedemo1
Again, substitute the name of your server for pvedemo1
TIP: The server you do this on should sync with your other servers so you only need to remove this folder once
As an extra measure you might want to tidy up the known_hosts file
For me this makes sense because I’ll be adding a server into the cluster with the same IP address and hostname, but it will have a different fingerprint
To do this we’ll edit the file using nano
nano /etc/pve/priv/known_hosts
And then remove the fingerprint for the server we removed
Save and exit and this should also update the file /etc/ssh/ssh_known_hosts because it’s a linked file
TIP: The server you do this on should sync with any other servers in the cluster so you only need to this once
Now, although I didn’t see this in the documentation there is one more step to tidy up if you’ve been using High Availability
If you don’t do this then chances are, when you check the HA status at the Datacenter level, it will still reference the old server and that can be confusing
I ran into that myself and found this forum post
https://forum.proxmox.com/threads/lrm-unable-to-read-lrm-status.65415/
First, you need to stop the CRM service on ALL nodes, by running this command
systemctl stop pve-ha-crm.service
Then on one server we need to remove a status file
rm -f /etc/pve/ha/manager_status
Then on that same server, start the service back up
systemctl start pve-ha-crm.service
Once this is back up, start the service on the other nodes
This doesn’t impact any existing HA jobs. But once the service is restarted, if you have any existing jobs or add a new one, the status will be updated
Chances are, the server that was removed will still show in the web browser. If so, this can be resolved with a refresh (Ctrl-F5)
If your goal was to simply remove a server then that’s as far as you need to go
But again, make sure the server you removed does not come back online otherwise it could damage the cluster
Adding A Rebuilt Server:
Now in my case, the goal was to rebuild server 1
So off camera I replaced the boot drive and carried out a fresh installation of Proxmox VE
Basically, there are plenty of videos out there that show you how to install Proxmox VE, including one that I’ve made, and I didn’t want to make this video any longer
So now it’s a matter of adding the rebuilt server into the cluster
Now, although I’ve already done a video about this as well, I’ll make an exception in this case because it involves adding a server to the cluster that it more or less already knew about and that’s quite unusual
One thing I’ve noticed is that when a server is added to the cluster, it loses its TLS certificate
Even the backup files in the folder were lost, suggesting the cluster creates a new folder for the server, which makes sense
I didn’t want to retain the original folder as I wanted a fresh installation, but it’s something to think about it
So if you do plan on providing your own certificate, then you may as well add that after the server has been added to the cluster
To add the server to the cluster we need to get the Join Information
So whilst connected to server 2 in the web browser, we’ll navigate to Datacenter | Cluster and then click Join Information
Here we can copy the details to the clipboard by clicking Copy Information
Now connected to server 1 in the web browser, we’ll navigate to Datacenter | Cluster and this time click Join Cluster
Then we’ll paste in the join information, and check the details
At the very least we need to provde the root password and then we’ll click the Join button to begin the join process
Now what I tend to find is that process is usually quick because if you check another server, the new server shows up
However, on the original server, the process seems to stall and eventually you’ll start seeing connection error messages if you click anywhere on the page
Assuming the server is showing up in the cluster, we may as well close this connection to server 1 and start a new one
Another thing I’ve noticed is that while existing storage connections, for example an NFS share to a NAS, are added to the new server, SDN networking is not added automatically
An SDN network does show up under a new server, but it will be in a state of pending
In which case, if you’re using SDN then you need to apply your SDN configuration in order to update the new server
Assuming nothing else stands out, it makes sense to migrate any virtual machines back to the server, and add/update any Backup/HA/Replication jobs as well
Sharing is caring!