1 min read

Tips: How to Resolve Most “PartiallyFailed” Errors in Velero Volume Backups

Tips: How to Resolve Most “PartiallyFailed” Errors in Velero Volume Backups

In some environments, Velero backups frequently fail to capturing some pods peristent volumes, resulting in a “PartiallyFailed” status. This issue can be critical when volume-level backups are required on mission critical enviroments. It tends to occurs most frequently during full cluster backups that demand reliable volume coverage and consistency.

After testing with Tanzu Kubernetes Grid (TKG) 2.5, I’ve found a reliable workaround that ensures successful full-cluster backups, even including persistent volumes and works every time.

This solution is not compatible with backups managed via Tanzu Mission Control (TMC). Applying this fix will break TMC integration.

Solution: Patch Velero Node Agents

To use this solution, Velero must be installed with the --use-node-agent flag. This allows direct control over the node-agent DaemonSet.

Edit the Node Agent DaemonSet

kubectl -n velero edit ds node-agent

Add Tolerations to Enable Scheduling on Control Plane Nodes

Insert the following under .spec.template.spec:

  tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/control-plane
    operator: Exists
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists

Now run a full cluster backup to test that all components, including peristent volumes are backuped successfully.