Proxmox, How To Remove A Server From A Cluster And Add A Rebuilt Server

Sep 16, 2024 · 12 mins read
Proxmox, How To Remove A Server From A Cluster And Add A Rebuilt Server

In the video below, we show you how to remove a server from a Proxmox VE cluster and how to add a rebuilt server


Now, there may come a time when you need to remove or rebuild a server in your Proxmox VE cluster

In my case, the boot drive on one server needs to be replaced and so the server will be rebuilt

But, you can’t remove a server from the cluster via the GUI, instead you have to use the command line

So in this video we go over how to remove a server from the cluster and how do you add a rebuilt server back in

Useful links:
https://pve.proxmox.com/wiki/Cluster_Manager#_remove_a_cluster_node https://forum.proxmox.com/threads/lrm-unable-to-read-lrm-status.65415/

Warning:
Removing a server from a cluster is a one way trip and it may not be easy to recover the cluster back to its previous state should things go wrong

And while the process should be trouble free, you have to be prepared for a potential outage. In which case make sure you have a backup of everything to hand

You need to be fully aware of what needs to be done ahead of time, and have a process in place, long before you touch your own system

Don’t simply follow along and make changes as you watch the video or read my blog, because your circumstances may be different

Watch the video and/or read the blog first and then plan what you need to do

Ideally, test any changes in a lab first, before making changes to a production environment

Server Upgrade:
Now if all you want to do is to remove a server from a cluster then feel free to skip ahead

But in my case I’m rebuilding a server and I want to add it back into the cluster afterwards

The server doesn’t have a spare PCIe slot so I have to replace an NVMe that occupies that slot with an SSD that will actually go into a DVD tray; Probably not a strategy the designers had in mind

When you add a server to a cluster it’s best that the servers are running the same software version

And the easiest way to do that is for them all to be running the latest version of software

Now I only have 2 servers plus a qdevice in this cluster, so now is the ideal time to upgrade server 2 as I’ll be able to run the virtual machines while server 1 is up an running

But the first thing I’ll do is migrate all the running virtual machines to server 1

A minor upgrade for a server is pretty straightforward, just click on the server you want to upgrade and navigate to Updates

From there click on Refresh and wait for the task to finish. Once that’s done, close the window then click on Upgrade

It will then open a console session and you’ll be asked you to confirm you want to upgrade the server

Once the upgrade is completed you’ll probably want to reboot the server for good measure

Prep Work:
Before you remove a server from the cluster you will want to make sure it’s not in use and that the cluster will support the removal

The first thing to do is to make sure that nothing depends on this server

So if you’re running Ceph for instance, you might need to make changes to that before the server is removed

You also want to check Backup tasks, Replication tasks and also HA tasks and groups

If the server is referenced in any of these, you’ll either need to modify or delete them

This is particularly important for replication tasks we’re told because you may not be able to remove a job afterwards

HA might also get you stuck in a loop if you migrate virtual machines to a server only for HA to migrate them back

The next thing to do is to check the cluster status

For this, you can either open a shell session or connect to a server using SSH

From here we’ll run this command

pvecm status

Typically a cluster will show that it requires 2 servers to achieve quorum

So, assuming you are seeing 3 voting devices in the cluster, it should be OK to remove one server

NOTE: If you only have two voting machines in the cluster, you’ll likely get a warning that you can’t remove a server. This is because you’d be left with only 1 and so quorum wouldn’t be possible, or at least not without making configuration changes

The next thing to do is to make sure the server is empty

So I’ll migrate everything to server 2, by doing a bulk migration of all virtual machines, containers and templates

TIP: If you plan on rebuilding a server and will add it back to the cluster later, then you might want to note down any network interface and IP addressing as you’ll need these for the rebuild

Since I want to remove server 1 we’ll switch over to server 2 in the GUI

From here, we’ll shutdown server 1

Bear in mind, once you remove a server from the cluster, it should not be powered back on or at least not be able to talk to the other servers in the cluster again as it risks breaking the cluster

Since I’m going to install Proxmox VE onto a new boot drive, once this server is powered off I’ll replace the original drive anyway and that will avoid any risk of that happening for me

Another thing to factor in here is redundancy

During this work I will only have 1 working server to run virtual machines on so I have to be prepared for an outage and maybe the need to restore from backup

Removing A Server:
Removing a node from the cluster is fairly straightforward but be warned, this is a one way process so you have to be prepared for potential problems

Make sure your systems and data are backed up and be prepared for a potential outage should things not got to plan

Ideally, you should test things in a lab first

To remove a server, we have to do this from the command line

I want to remove server 1 and so I’m connected to server 2 via the web browser

In which case, I could open a shell session here or I could SSH into the server

Both options support copying and pasting so feel free to pick whichever method you prefer

The first thing we’ll do is to identify which servers are operational in this cluster by running this command

pvecm nodes

Because I’ve powered off server 1, it shouldn’t appear in the list

NOTE: If the server you’re going to remove is showing, then check why and resolve that before going any further because it must be offline

Assuming the server is seen to be offline, we’ll remove it from the cluster with this command

pvecm delnode pvedemo1

Since pvedemo1 is the name of my server, you’ll need to substitute this with the name of yours

NOTE: There is no request for confirmation, so make sure you are using the correct server name. Another reason to test this in a lab first

Now you might see the following warning
Could not kill node (error = CS_ERR_NOT_EXIST)

From what we’re told, we can ignore this because the message is from corosync and the node it’s referring to is offline

TIP: If you get this response
cluster not ready - no quorum?

It’s because you don’t have enough devices in the cluster to achieve quorum and so changes cannot be made

To confirm the node is removed we’ll check the cluster status with this command

pvecm status

Compared to the last time we did this, the number of nodes and expected votes for instance will now be less because a server has been removed from the cluster

Although the server has been removed, its configuration files will be retained

In my case, these can be found in /etc/pve/nodes/pvedemo1

Now, you can recover these files if you think you’ll need them, but either way we’re told to delete the folder for completeness

rm -rf /etc/pve/nodes/pvedemo1

Again, substitute the name of your server for pvedemo1

TIP: The server you do this on should sync with your other servers so you only need to remove this folder once

As an extra measure you might want to tidy up the known_hosts file

For me this makes sense because I’ll be adding a server into the cluster with the same IP address and hostname, but it will have a different fingerprint

To do this we’ll edit the file using nano

nano /etc/pve/priv/known_hosts

And then remove the fingerprint for the server we removed

Save and exit and this should also update the file /etc/ssh/ssh_known_hosts because it’s a linked file

TIP: The server you do this on should sync with any other servers in the cluster so you only need to this once

Now, although I didn’t see this in the documentation there is one more step to tidy up if you’ve been using High Availability

If you don’t do this then chances are, when you check the HA status at the Datacenter level, it will still reference the old server and that can be confusing

I ran into that myself and found this forum post
https://forum.proxmox.com/threads/lrm-unable-to-read-lrm-status.65415/

First, you need to stop the CRM service on ALL nodes, by running this command

systemctl stop pve-ha-crm.service

Then on one server we need to remove a status file

rm -f /etc/pve/ha/manager_status

Then on that same server, start the service back up

systemctl start pve-ha-crm.service

Once this is back up, start the service on the other nodes

This doesn’t impact any existing HA jobs. But once the service is restarted, if you have any existing jobs or add a new one, the status will be updated

Chances are, the server that was removed will still show in the web browser. If so, this can be resolved with a refresh (Ctrl-F5)

If your goal was to simply remove a server then that’s as far as you need to go

But again, make sure the server you removed does not come back online otherwise it could damage the cluster

Adding A Rebuilt Server:
Now in my case, the goal was to rebuild server 1

So off camera I replaced the boot drive and carried out a fresh installation of Proxmox VE

Basically, there are plenty of videos out there that show you how to install Proxmox VE, including one that I’ve made, and I didn’t want to make this video any longer

So now it’s a matter of adding the rebuilt server into the cluster

Now, although I’ve already done a video about this as well, I’ll make an exception in this case because it involves adding a server to the cluster that it more or less already knew about and that’s quite unusual

One thing I’ve noticed is that when a server is added to the cluster, it loses its TLS certificate

Even the backup files in the folder were lost, suggesting the cluster creates a new folder for the server, which makes sense

I didn’t want to retain the original folder as I wanted a fresh installation, but it’s something to think about it

So if you do plan on providing your own certificate, then you may as well add that after the server has been added to the cluster

To add the server to the cluster we need to get the Join Information

So whilst connected to server 2 in the web browser, we’ll navigate to Datacenter | Cluster and then click Join Information

Here we can copy the details to the clipboard by clicking Copy Information

Now connected to server 1 in the web browser, we’ll navigate to Datacenter | Cluster and this time click Join Cluster

Then we’ll paste in the join information, and check the details

At the very least we need to provde the root password and then we’ll click the Join button to begin the join process

Now what I tend to find is that process is usually quick because if you check another server, the new server shows up

However, on the original server, the process seems to stall and eventually you’ll start seeing connection error messages if you click anywhere on the page

Assuming the server is showing up in the cluster, we may as well close this connection to server 1 and start a new one

Another thing I’ve noticed is that while existing storage connections, for example an NFS share to a NAS, are added to the new server, SDN networking is not added automatically

An SDN network does show up under a new server, but it will be in a state of pending

In which case, if you’re using SDN then you need to apply your SDN configuration in order to update the new server

Assuming nothing else stands out, it makes sense to migrate any virtual machines back to the server, and add/update any Backup/HA/Replication jobs as well

Sharing is caring!