Since v0.2.7
Solr Clouds are complex distributed systems, and thus require a more delicate and informed approach to rolling updates.
If the Managed
update strategy is specified in the Solr Cloud CRD, then the Solr Operator will take control over deleting SolrCloud pods when they need to be updated.
The operator will find all pods that have not been updated yet and choose the next set of pods to delete for an update, given the following workflow.
Note: Managed Updates are a executed via Cluster Operation Locks, please refer to the documentation for more information about how these operations are executed.
The logic goes as follows:
maxPodsUnavailable
option, because these pods have not even started the Solr process.ready
pods.
maxPodsUnavailable
,
then subtracting the number of updated pods that are unavailable as well as the number of not-yet-started, out-of-date pods that were updated in a previous step.
This check makes sure that any pods taken down during this step do not violate the maxPodsUnavailable
constraint.The pods are sorted by the following criteria, in the given order. If any two pods on a criterion, then the next criteria (in the following order) is used to sort them.
In this context the pods sorted highest are the first chosen to be updated, the pods sorted lowest will be selected last.
Loop over the sorted pods, until the number of pods selected to be updated has reached the maximum.
This maximum is calculated by taking the given, or default, maxPodsUnavailable
and subtracting the number of updated pods that are unavailable or have yet to be re-created.
maxPodsUnavailable
to a value you are comfortable with.live
, the pod is chosen to be updated.down
or recovery_failed
state, the pod is chosen to be updated.maxShardReplicasUnavailable
, then the pod can be updated.
Once a pod with replicas has been chosen to be updated, the replicas hosted in that pod are then considered unavailable for the rest of the selection logic.
maxShardReplicasUnavailable
calculation will take these replicas into account, as a starting point.maxShardReplicasUnavailable
calculation.Given these complex requirements, kubectl rollout restart statefulset
will generally not work on a SolrCloud.
One option to trigger a manual restart is to change one of the podOptions annotations. For example you could set this to the date and time of the manual restart.
apiVersion: solr.apache.org/v1beta1
kind: SolrCloud
spec:
customSolrKubeOptions:
podOptions:
annotations:
manualrestart: "2021-10-20T08:37:00Z"
The Solr Operator sets up at least two Services for every SolrCloud.
Only the common service uses the publishNotReadyAddresses: false
option, since the common service should load balance between all available nodes.
The other services have individual endpoints for each node, so there is no reason to de-list pods that are not available.
When doing a rolling upgrade, or taking down a pod for any reason, we want to first stop all requests to this pod.
Solr will do this while stopping by first taking itself out of the cluster’s set of liveNodes
, so that other Solr nodes and clients think it is not running.
However, for ephemeral clusters we are also evicting data before the pod is deleted. So we want to stop requests to this node since the data will soon no-longer live there.
Kubernetes allows the Solr Operator to control whether a pod is considered ready
, or available for requests, via readinessConditions/readinessGates.
When the Solr Operator begins the shut-down procedure for a pod, it will first set a readinessCondition
to false
, so that no loadBalanced requests (through the common service) go to the pod.
This readinessCondition will stay set to false
until the pod is deleted and a new pod is created in its place.
For this reason, it’s a good idea to avoid very aggressive Update Strategies.
During a rolling restart with a high maxPodsUnavailable
, requests that go through the common service might be routed to a very small number of pods.