Skip to content

Replacing a Proxmox Node

Dealing with Ceph

When a Proxmox node is part of a Ceph cluster, the process to replace the node requires additional steps for handling the Ceph components.

To replace a node in a Ceph cluster, follow this process:

  1. Backup All Critical Data: Always make sure to back up all critical data.

  2. Remove the Node from the Ceph Cluster:

    Before you can remove the node from the Proxmox cluster, it should be removed from the Ceph cluster. For each Ceph service (ceph-osd, ceph-mon, ceph-mds, etc.) on the node to be replaced, you must do the following:

    • For OSDs (Object Storage Daemons):

      • Mark the OSD as out: ceph osd out osd.<ID>
      • Stop the OSD: systemctl stop ceph-osd@<ID>
      • Remove the OSD: ceph osd crush remove osd.<ID>
      • Delete the OSD: ceph osd rm osd.<ID>
    • For MONs (MONitors):

      • Remove the monitor: ceph mon remove <MONID>

    osd.<ID> and <MONID> should be replaced by the ID number of the OSD or MON you wish to remove.

  3. Shutdown and Remove the Old Node:

    You can follow the same steps provided previously.

  4. Install Proxmox and Ceph on the New Node:

    Install Proxmox on the new node and add it to the Proxmox and Ceph clusters.

    Adding a node to the Proxmox cluster was outlined in the previous response. The process of adding a node to a Ceph cluster typically incorporates installing Ceph and recreating each Ceph service that existed on the old node:

Please remember, replacing a node that is part of a Ceph cluster is an advanced process that might require a deep understanding of both Proxmox and Ceph. Be very careful while performing these operations as incorrect settings might lead to data loss. Always ensure to make backup of important data before cutting the node off. It's recommendable to involve IT professionals or Systems Admins during such process.

After Ceph has been taken care of

Replacing a node in a Proxmox cluster involves a few steps. Here's a general approach to replacing a Proxmox node in a cluster:

  1. Backup All Critical Data:

Ensure that you've backed up all critical data from the node to be replaced.

  1. Shutdown and Remove the Old Node:

Shutdown the system (you can use the shutdown -h now command) and physically remove the old hardware.

  • In the Proxmox web interface, click on the node you want to remove.
  • Click on 'More' in the top right corner and select 'Remove from cluster'.

The node should not be connected to the network anymore after it's removed from the cluster.

  1. Install Proxmox on the New Node:

Install Proxmox Virtual Environment on the new hardware according to the official Proxmox VE installation guide.

  1. Add the New Node to the Cluster:

Adding a node to the Proxmox cluster involves copying the cluster configuration from an existing node. You can do this by following these steps on the new node:

  • Copy pvecm from existing cluster node: scp root@existing_node:/etc/pve/corosync.conf /etc/pve/local/etc/

    Replace existing_node with IP or hostname of an existing Proxmox node in the cluster.

  • Copy cluster certificate: scp root@existing_node:/etc/pve/pve-root-ca.pem /etc/pve/

    Replace existing_node with IP or hostname of the existing Proxmox node in the cluster.

  • Stop pve-cluster service: systemctl stop pve-cluster

  • Import cluster config: pvecm updatecerts -f

  • Start pve-cluster service: systemctl start pve-cluster

    Now, the new node should be part of the cluster.

If you have a Proxmox VE Subscription:

In case you have a Proxmox VE Subscription, you can follow this procedure:

  1. Remove the Old Node Subscriptions:

  2. Login on my.proxmox.com and go to your subscription.

  3. Remove the node by clicking remove. Confirm the removal request.

  4. Manage the Subscription on the New Node:

  5. Login to the Proxmox VE node.

  6. Visit the Subscription tab in the node view.

  7. Use Check to see if there are available subscriptions, then click on Assign to assign an available subscription to the new node.

Remember to migrate or shut down any running VMs or containers before removing the old node from the cluster. After adding the new node to the cluster, you can recreate and restart your VMs or containers. This process can be complex, so make sure to fully research and understand the steps, or consider working with a system administrator or IT professional.

Since you're using Proxmox, this means you're likely using the Proxmox management interface for managing Ceph as well. That said, some cleanup operations are easier to handle via the command line.

If a Ceph node goes down, Ceph automatically handles some aspects of recovery. However, if you know that a Ceph OSD node is permanently gone (due to hardware failure etc.) and isn't just temporarily offline, it's better to manually remove the lost OSDs to help Ceph recover more quickly.

Therefore, the following instructions on removing a lost Ceph node focus on removing OSDs as this is typically the most relevant scenario in a cluster.

Removing OSDs from a Proxmox Ceph Cluster following a Node Loss

Remember, the commands are to be run on a working Ceph node. Here is what you need to do:

  1. Get an Overview of Your Cluster:

    Run ceph osd tree. This command will give you an overview of the current OSDs, including their up/down state and which are in the CRUSH map. The output will resemble:

    ID CLASS WEIGHT  TYPE NAME      STATUS REWEIGHT PRI-AFF 
     -1       0.21859 root default                            
     -3       0.21859     rack 2U_IBM                           
      0   hdd 0.21859         osd.0       up  1.00000 1.00000 
    

    You should see your missing node's OSDs here, and they'll typically be tagged as down.

  2. Mark the OSDs as Out:

    To mark the OSDs as out, use ceph osd out followed by the OSD ID, for example:

    ceph osd out osd.0
    

    This will start the data migration from the lost OSDs to the other OSDs in your cluster.

  3. Remove the OSDs from the CRUSH Map and Cluster:

    Once an OSD has been marked out and the data has finished backfilling to other OSDs in the cluster, you'll want to remove the lost OSDs from the cluster.

    To do this, you'll first want to remove it from the CRUSH map:

    ceph osd crush remove osd.0
    

    And then remove the OSD completely:

    ceph auth del osd.0
    ceph osd rm osd.0
    
  4. Repeat for All Lost OSDs:

    Repeat the process above for each OSD from your lost node.

  5. Check the Cluster Health:

    When you've removed all lost OSDs, you should check your cluster health with ceph status in order to verify that the cluster is back to HEALTH_OK status.

Delete old monitor references on the existing master node by editing /etc/ceph/ceph.conf:

Remove the IP for the lost node from the line mon_host = [IPs]. Remove any entry for the lost node in the form of

[mon.ID]
     public_addr = [IP]

These commands should let Ceph know that the lost OSDs are gone for good and are not expected to return to the cluster.

Remember, these operations might have a serious impact on your cluster and might lead to data loss if not performed properly. Always consider getting professional assistance when orchestrating a recovery of a Ceph cluster.