Solr Operator Documentation

Cluster Operations

Since v0.8.0

Solr Clouds are complex distributed systems, and thus any operations that deal with data availability should be handled with care.

Cluster Operation Locks

Since cluster operations deal with Solr’s index data (either the availability of it, or moving it), its safest to only allow one operation to take place at a time. That is why these operations must first obtain a lock on the SolrCloud before execution can be started.

Lockable Operations

How is the Lock Implemented?

The lock is implemented as an annotation on the SolrCloud’s StatefulSet. The cluster operation retry queue is also implemented as an annotation. These locks can be viewed at the following annotation keys:

Avoiding Deadlocks

If all cluster operations executed without any issues, there would be no need to worry about deadlocks. Cluster operations give up the lock when the operation is complete, and then other operations that have been waiting can proceed. Unfortunately, these cluster operations can and will fail for a number of reasons:

If this is the case, then we need to be able to stop the locked cluster operation if it hasn’t succeeded in a certain time period. The cluster operation can only be stopped if there is no background task (async request) being executed in the Solr Cluster. Once cluster operation reaches a point at which it can stop, and the locking-timeout has been exceeded or an error was found, the cluster operation is paused, and added to a queue to retry later. The timeout is different per-operation:

Immediately afterwards, the Solr Operator sees if there are any other operations that need to take place while before the queued cluster operation is re-started. This allows for users to make changes to fix the reason why the cluster operation was failing. Examples:

If a queued operation is going to be retried, the Solr Operator first makes sure that its values are still valid. For the Scale Down example above, when the Solr Operator tries to restart the queued Scale Down operation, it sees that the SolrCloud.Spec.Replicas is no longer lower than the current number of Solr Pods. Therefore, the Scale Down does not need to be retried, and a “fake” Scale Up needs to take place.

In the case of an emergency

When all else fails, and you need to stop a cluster operation, you can remove the lock annotation from the StatefulSet manually.

Edit the StatefulSet (e.g. kubectl edit statefulset <name>) and remove the cluster operation lock annotation:

This can be done via the following command:

$ kubectl annotate statefulset ${statefulSetName}

This will only remove the current running cluster operation, if other cluster operations have been queued, they will be retried once the lock annotation is removed. Also if the operation still needs to occur to put the SolrCloud in its expected state, then the operation will be retried once a lock can be acquired. The only way to have the cluster operation not run again is to put the SolrCloud back to its previous state (for scaling, set SolrCloud.Spec.replicas to the value found in StatefulSet.Spec.replicas). If the SolrCloud requires a rolling restart, it cannot be “put back to its previous state”. The only way to move forward is to either delete the StatefulSet (a very dangerous operation), or find a way to allow the RollingUpdate operation to succeed.