Solr Operator Documentation

Managed SolrCloud Rolling Updates

Since v0.2.7

Solr Clouds are complex distributed systems, and thus require a more delicate and informed approach to rolling updates.

If the Managed update strategy is specified in the Solr Cloud CRD, then the Solr Operator will take control over deleting SolrCloud pods when they need to be updated.

The operator will find all pods that have not been updated yet and choose the next set of pods to delete for an update, given the following workflow.

Note: Managed Updates are a executed via Cluster Operation Locks, please refer to the documentation for more information about how these operations are executed.

Pod Update Workflow

The logic goes as follows:

  1. Find the pods that are out-of-date
  2. Update all out-of-date pods that do not have a started Solr container.
    • This allows for updating a pod that cannot start, even if other pods are not available.
    • This step does not respect the maxPodsUnavailable option, because these pods have not even started the Solr process.
  3. Retrieve the cluster state of the SolrCloud if there are any ready pods.
    • If no pods are ready, then there is no endpoint to retrieve the cluster state from.
  4. Sort the pods in order of safety for being restarted. Sorting order reference
  5. Iterate through the sorted pods, greedily choosing which pods to update. Selection logic reference
    • The maximum number of pods that can be updated are determined by starting with maxPodsUnavailable, then subtracting the number of updated pods that are unavailable as well as the number of not-yet-started, out-of-date pods that were updated in a previous step. This check makes sure that any pods taken down during this step do not violate the maxPodsUnavailable constraint.

Pod Update Sorting Order

The pods are sorted by the following criteria, in the given order. If any two pods on a criterion, then the next criteria (in the following order) is used to sort them.

In this context the pods sorted highest are the first chosen to be updated, the pods sorted lowest will be selected last.

  1. If the pod is the overseer, it will be sorted lowest.
  2. If the pod is not represented in the clusterState, it will be sorted highest.
    • A pod is not in the clusterstate if it does not host any replicas and is not the overseer.
  3. Number of leader replicas hosted in the pod, sorted low -> high
  4. Number of active or recovering replicas hosted in the pod, sorted low -> high
  5. Number of total replicas hosted in the pod, sorted low -> high
  6. If the pod is not a liveNode, then it will be sorted lower.
  7. Any pods that are equal on the above criteria will be sorted lexicographically.

Pod Update Selection Logic

Loop over the sorted pods, until the number of pods selected to be updated has reached the maximum. This maximum is calculated by taking the given, or default, maxPodsUnavailable and subtracting the number of updated pods that are unavailable or have yet to be re-created.

Triggering a Manual Rolling Restart

Given these complex requirements, kubectl rollout restart statefulset will generally not work on a SolrCloud.

One option to trigger a manual restart is to change one of the podOptions annotations. For example you could set this to the date and time of the manual restart.

apiVersion: solr.apache.org/v1beta1
kind: SolrCloud
spec:
  customSolrKubeOptions:
    podOptions:
      annotations:
        manualrestart: "2021-10-20T08:37:00Z"

Pod Readiness during updates

The Solr Operator sets up at least two Services for every SolrCloud.

Only the common service uses the publishNotReadyAddresses: false option, since the common service should load balance between all available nodes. The other services have individual endpoints for each node, so there is no reason to de-list pods that are not available.

When doing a rolling upgrade, or taking down a pod for any reason, we want to first stop all requests to this pod. Solr will do this while stopping by first taking itself out of the cluster’s set of liveNodes , so that other Solr nodes and clients think it is not running. However, for ephemeral clusters we are also evicting data before the pod is deleted. So we want to stop requests to this node since the data will soon no-longer live there.

Kubernetes allows the Solr Operator to control whether a pod is considered ready, or available for requests, via readinessConditions/readinessGates. When the Solr Operator begins the shut-down procedure for a pod, it will first set a readinessCondition to false, so that no loadBalanced requests (through the common service) go to the pod. This readinessCondition will stay set to false until the pod is deleted and a new pod is created in its place.

For this reason, it’s a good idea to avoid very aggressive Update Strategies. During a rolling restart with a high maxPodsUnavailable, requests that go through the common service might be routed to a very small number of pods.