Lifecycle¶
The operator manages database migrations and application initialization through dedicated lifecycle tasks. This page covers configuration, behavior, and troubleshooting.
Overview¶
The spec.lifecycle section controls up to four sequential tasks:
- clone — database snapshot from an external source (staging workflows)
- migrate —
superset db upgrade(database schema migration) - rotate —
superset re-encrypt-secrets(secret key rotation) - init —
superset init(application initialization: roles, permissions)
Tasks run as parent-owned Jobs. The parent
Superset controller orchestrates sequencing, gating, re-runs, Job lifecycle,
retries, and timeouts, and stores durable task state in
status.lifecycle.
Lifecycle is enabled by default even when spec.lifecycle is nil; disable it
explicitly with spec.lifecycle.disabled: true.
Key behaviors:
- Clone must complete before migrate starts; migrate before rotate; rotate before init
- Components are not created or updated until all enabled tasks complete
- When config or image changes require a re-run, the parent deletes the old task Job and creates a fresh one
Task Triggers¶
Each task has hardcoded trigger inputs — what it watches for changes:
| Task | Watches | Re-runs when... |
|---|---|---|
| Clone | trigger field, cronSchedule tick, source config, excludes, target Superset image |
Trigger value changes, schedule tick boundary crossed, source DB config changes, or target image changes (so a downgrade triggers a fresh clone before migrate sees the older schema) |
| Migrate | Image (resolved lifecycle image), metastore.createDatabase flag |
Image tag or repository changes, or createDatabase is toggled |
| Rotate | trigger field, secretKeyFrom ref, previousSecretKeyFrom ref |
Secret key references change or trigger value changes |
| Init | Config checksum (rendered Python config) | Any config-affecting field changes |
All tasks also re-run when an upstream task re-executes (automatic propagation).
Manual Trigger¶
Every task has a trigger field (on BaseTaskSpec) — an opaque string that
forces a re-run when changed. Changing a trigger also cascades to all downstream
tasks:
spec:
lifecycle:
migrate:
trigger: "force-2026-05-10" # forces migrate + init to re-run
init:
trigger: "reset-roles" # forces only init to re-run
Disabling Tasks¶
Set disabled: true to skip a task entirely:
Scheduled Execution¶
Tasks that support scheduling (currently clone) accept a cronSchedule field —
a standard 5-field cron expression that triggers periodic re-execution:
spec:
environment: Staging
lifecycle:
clone:
cronSchedule: "0 2 * * *" # daily at 2 AM UTC
source:
host: postgres-prod.db.svc
database: superset_prod
username: prod_reader
passwordFrom:
name: prod-reader-creds
key: password
When a schedule is configured, the operator automatically re-runs the full
lifecycle pipeline (clone → migrate → rotate → init) each time a cron tick boundary is
crossed. The trigger field remains functional for manual overrides on top of
the schedule — both contribute independently.
How it works:
- The operator computes the "current tick" (most recent past time matching the expression) and includes it in the task checksum
- When the clock crosses a cron boundary, the tick changes, the checksum changes, and the pipeline re-runs
- The operator requeues itself to wake at the next cron tick
- If the operator is down during a scheduled tick, it catches up on the next reconcile
- If the pipeline is still running when a tick fires, it completes normally; the new tick is detected afterward and triggers one re-run (no backlog accumulation)
Status reporting:
The clone task status includes lastScheduledAt (the tick that triggered the
most recent run) and nextScheduleAt (the next future tick).
Alternative — external CronJob:
For teams that prefer external scheduling, a Kubernetes CronJob can patch
the trigger field on a cron schedule. This requires a CronJob resource,
ServiceAccount, RoleBinding, and a kubectl image, but keeps the scheduling
logic outside the operator.
When disabled, the task's Job and ConfigMap are deleted, its projected status is cleared from the parent, and it does not participate in the pipeline. Downstream tasks still run but don't receive propagation from the disabled task.
Upgrade Mode¶
The upgradeMode field controls how image upgrades are handled:
- Automatic (default) — tasks run immediately when an image change is detected
- Supervised — tasks wait for an annotation-based approval before running
When an image change is detected in supervised mode, the operator sets
status.phase: AwaitingApproval and records the upgrade context in
status.lifecycle.upgrade. Approve the upgrade by annotating the CR with the
recorded approval token:
TOKEN=$(kubectl get superset my-superset \
-o jsonpath='{.status.lifecycle.upgrade.approvalToken}')
kubectl annotate superset my-superset \
superset.apache.org/approve-upgrade="${TOKEN}" --overwrite
The approval is consumed once: the operator clears the annotation automatically after lifecycle tasks complete or the image change is otherwise settled. This prevents a stale approval annotation from approving a later image change. If the target image changes while an upgrade is awaiting approval, the approval token changes and the new transition must be approved separately. You can monitor the upgrade status with:
The operator also performs semver comparison on image tags and blocks downgrades
to prevent accidental database corruption. A blocked downgrade sets
status.phase: Blocked — revert the image tag to resolve.
Downgrade protection requires semantically comparable (semver) image tags.
Non-semver tags — such as latest, date stamps, or digest pins — cannot be
ordered, so the operator cannot detect a downgrade and lifecycle tasks proceed
normally. In that case it emits a VersionComparisonSkipped warning event
rather than blocking. Use semver tags in Staging and Production if you rely on
downgrade protection.
Drain Behavior¶
Each task declares whether it requires components to be drained (scaled to zero) before execution. The operator drains once before the first pending task that requires it when at least one configured component has desired replicas greater than zero, and recreates components after the pipeline completes. A config-only change that only re-runs the default init task does not drain components.
| Task | Default requiresDrain |
Rationale |
|---|---|---|
| Clone | true |
DROP DATABASE fails with active connections |
| Migrate | true |
Schema changes risk deadlocks and version/schema inconsistencies |
| Rotate | true |
After re-encryption, stored secrets use the new key — components with the old key would fail to decrypt |
| Init | false |
Role/permission operations are safe with components running |
Override per-task when needed:
spec:
lifecycle:
migrate:
requiresDrain: false # opt-in to rolling migrations (additive changes only)
init:
requiresDrain: true # force drain before init (rare)
During a drain, Ingress/HTTPRoute and NetworkPolicy resources are preserved because they are owned by the parent CR. Once all lifecycle tasks complete, components are recreated and traffic resumes automatically.
Maintenance Page¶
By default, the web UI is unreachable while components are drained. To avoid user confusion during upgrades, enable the maintenance page:
When enabled, the operator spins up a lightweight maintenance page before draining components and redirects all traffic from the web-server Service to it. The maintenance page is only started when a drain will actually run and the web-server component already has an existing workload. Initial installs skip the maintenance page because there is no existing web traffic to preserve. The Service name and ClusterIP are preserved, so Ingress, HTTPRoute, and direct Service consumers continue working without interruption. After lifecycle tasks complete, traffic is returned to the web-server pods automatically.
Parent Superset events report operator-level lifecycle milestones such as maintenance routing, drain start/completion, task starts, retries, and failures. Kubernetes-native Deployment, ReplicaSet, Job, and Pod events remain the source for lower-level workload creation and scaling details.
All paths return a 302 redirect to /, which serves the maintenance HTML page
with a 30-second auto-refresh.
Customizing the message¶
spec:
lifecycle:
maintenancePage:
title: "Upgrade in Progress"
message: "Superset is being upgraded to v4.1. Expected downtime: 10 minutes."
For complete control over the HTML page, use body:
spec:
lifecycle:
maintenancePage:
body: |
<!DOCTYPE html>
<html>
<head><title>Maintenance</title><meta http-equiv="refresh" content="30"></head>
<body><h1>We'll be right back</h1><p>02:00–03:00 UTC</p></body>
</html>
Custom image¶
For advanced use cases (custom branding, dynamic status pages, HA), provide your own image. It must serve HTTP on the web-server port (default 8088):
spec:
lifecycle:
maintenancePage:
message: "Back in 30 minutes"
image:
repository: my-org/maintenance-server
tag: v2
replicas: 2
podTemplate:
container:
command: ["/serve", "--port=8088", "--redirect-all=/"]
resources:
requests:
cpu: 50m
memory: 64Mi
In custom mode, title, message, and body are passed as environment variables
(SUPERSET_OPERATOR__MAINTENANCE_TITLE, SUPERSET_OPERATOR__MAINTENANCE_MESSAGE,
SUPERSET_OPERATOR__MAINTENANCE_BODY) — your image can use them for dynamic content.
Lifecycle Flow¶
The following diagram shows the lifecycle pipeline. Tasks execute sequentially; components are drained before the first task that requires it, and recreated after the pipeline completes.
%%{init: {'theme': 'neutral', 'themeVariables': {'fontSize': '12px'}}}%%
flowchart TD
A[Reconcile] --> B{Lifecycle disabled?}
B -->|Yes| Z[Components reconcile normally]
B -->|No| C{upgradeMode}
C -->|Automatic| E
C -->|Supervised| D[Await approval]
D --> E{Any task requires drain?}
E -->|Yes| MP{Maintenance page configured?}
E -->|No| G
MP -->|Yes| MP1[Deploy maintenance page, switch Service selector]
MP -->|No| F
MP1 --> F[Drain: delete component resources, wait for pod termination]
F --> G[Clone task]
G -->|checksum match| H[Skip]
G -->|checksum mismatch| G1[Execute clone job]
G1 --> H
H --> I[Migrate task]
I -->|checksum match| J[Skip]
I -->|checksum mismatch| I1[Execute migrate job]
I1 --> J
J --> R[Rotate task]
R -->|checksum match| S[Skip]
R -->|checksum mismatch| R1[Execute rotate job]
R1 --> S
S --> K[Init task]
K -->|checksum match| L[Skip]
K -->|checksum mismatch| K1[Execute init job]
K1 --> L
L --> M[Pipeline complete]
M --> N[Recreate component resources]
N --> Z
Disabled tasks are removed from the pipeline entirely (not shown as "skip").
Custom Commands¶
spec:
lifecycle:
migrate:
command: ["/bin/sh", "-c", "superset db upgrade && custom-migrate"]
init:
command: ["/bin/sh", "-c", "superset init && custom-seed"]
Both adminUser and loadExamples (see below) are mutually exclusive with a
custom lifecycle.init.command — when using these fields, the operator
constructs the full init command automatically.
Auto-Creating the Metastore Database¶
Setting metastore.createDatabase: true attaches an idempotent init container to the migrate Job that runs CREATE DATABASE against the server before superset db upgrade. This avoids a chicken-and-egg pre-install step on fresh PostgreSQL/MySQL servers. See Auto-creating the database for full details, including the privilege requirement on the metastore user. The flag is redundant alongside lifecycle.clone (which already drops and re-creates the target database) but harmless.
Timeout, Retries, and Pod Retention¶
Each task has configurable timeout and retry behavior:
spec:
lifecycle:
podRetention:
policy: Retain # Delete | Retain | RetainOnFailure (default)
migrate:
timeout: 10m # max time per attempt (default: 5m)
maxRetries: 5 # attempts before permanent failure (default: 3)
init:
timeout: 5m
maxRetries: 3
On failure, the operator retries with exponential backoff (10s * 2^(attempt-1),
capped at 5m). If a Job exceeds the timeout while Running or Pending, it counts
as a failed attempt.
If a task pod cannot start at all — a CreateContainerConfigError (for example a
runAsNonRoot policy the image can't satisfy), an image pull failure, or an
unschedulable pod — the operator surfaces the reason on the Superset status and
events rather than silently waiting. Once you change the task's pod
configuration (securityContext, resources, image, etc.), the operator
recreates the task Job from the corrected spec automatically. This also rescues a
task that has already exhausted its retries: editing how the pod runs lets it run
again, with no need to delete the Job by hand. A purely application-level failure
(unchanged pod, e.g. a broken migration) stays terminal so it does not loop.
Task retention policies:
| Policy | On Success | On Failure |
|---|---|---|
Delete |
Job and Pods deleted | Job and Pods deleted |
Retain |
Job and Pods kept | Job and Pods kept |
RetainOnFailure (default) |
Job and Pods deleted | Job and Pods kept for debugging |
The default keeps only failed task Jobs and Pods so you can inspect logs without
cluttering the namespace with completed-success Jobs. Override to Retain
if you want the full history, or Delete to garbage-collect everything.
To inspect logs of a retained failed Job:
Admin User (Development Mode Only)¶
In Development mode, the operator can create an admin user during initialization:
spec:
environment: Development
lifecycle:
init:
adminUser:
username: admin # default
password: admin # default
firstName: Superset # default
lastName: Admin # default
email: admin@example.com # default
All fields have defaults, so adminUser: {} creates a user with
username/password admin/admin. The operator passes credentials as env vars
and appends a superset fab create-admin step to the init command. This field
is rejected in Production and Staging modes by CRD validation.
Load Examples (Development Mode Only)¶
Load Superset's example dashboards and datasets during initialization:
The operator appends a superset load-examples step to the init command. This
field is rejected in Production and Staging modes by CRD validation. Note that Superset's built-in
examples require an admin user with username admin — if you customize
adminUser.username, example loading may fail.
Lifecycle Pod Template¶
The spec.lifecycle section supports podTemplate with the same Pod and
container fields as other components (tolerations, nodeSelector, volumes, etc.
on podTemplate; env, resources, securityContext, etc. on
podTemplate.container), so task Job Pods inherit top-level scheduling and security
settings and can be customized independently:
spec:
lifecycle:
podTemplate:
container:
resources:
limits:
memory: "2Gi"
migrate:
command: ["/bin/sh", "-c", "superset db upgrade"]
Secret Key Rotation¶
The rotate task runs superset re-encrypt-secrets to re-encrypt stored secrets
when the application secret key is rotated. It runs after migrate and before init.
To enable secret key rotation, set previousSecretKey (dev mode) or
previousSecretKeyFrom (staging/production) on the parent spec, and add
lifecycle.rotate: {}:
apiVersion: superset.apache.org/v1alpha1
kind: Superset
metadata:
name: my-superset
spec:
secretKeyFrom:
name: superset-secret-v2
key: secret-key
previousSecretKeyFrom:
name: superset-secret-v1
key: secret-key
lifecycle:
rotate: {}
The operator injects both SECRET_KEY and PREVIOUS_SECRET_KEY into all Python
components. The rotate task Job uses both to decrypt stored secrets with the old
key and re-encrypt with the new one. After rotation completes, components restart
with the new key and can use PREVIOUS_SECRET_KEY for fallback decryption during
the transition.
The command is idempotent: re-running it skips already-converted values. If no
PREVIOUS_SECRET_KEY is set, it exits cleanly. If decryption fails for any entry,
the entire transaction rolls back.
Rotation Triggers¶
The task re-runs when:
- The
secretKeyFromorpreviousSecretKeyFromreferences change (different Secret name or key) - The
triggerfield changes (use this for in-place Secret content updates where the reference stays the same)
Drain¶
By default, the rotate task requires drain (requiresDrain: true). After
re-encryption commits, stored secrets are encrypted with the new key — components
still running with the old key would fail to decrypt them. Override with
requiresDrain: false only if you understand the implications.
Cleanup¶
After confirming rotation succeeded, remove previousSecretKeyFrom and the
lifecycle.rotate section. The previous secret key is no longer needed once all
components have restarted with the new key.
Clone (Development and Staging Mode Only)¶
The clone task creates a database snapshot from an external source into the CR's metastore before running migrations. This enables staging workflows where you test version upgrades against a copy of production data.
Clone is only allowed when environment: Development or environment: Staging
— it performs a destructive DROP DATABASE on the target metastore and must never
run against a production instance. Staging mode enforces secrets (like
Production) while still permitting clone operations.
Staging Workflow¶
The recommended pattern is a separate Superset CR for staging:
apiVersion: superset.apache.org/v1alpha1
kind: Superset
metadata:
name: superset-staging
spec:
environment: Staging
image:
tag: "6.0.0" # version to test
secretKeyFrom:
name: staging-secret
key: secret-key
metastore:
type: PostgreSQL
host: postgres-staging.db.svc
database: superset_staging
username: superset_admin # needs CREATEDB rights
passwordFrom:
name: staging-db-creds
key: password
lifecycle:
clone:
trigger: "2026-05-09-v1" # change to re-clone
source:
host: postgres-prod.db.svc
database: superset_prod
username: prod_reader # read-only on production
passwordFrom:
name: prod-reader-creds
key: password
excludeTables:
- tab_state
excludeTableData:
- logs
- query
timeout: 30m
migrate: {}
init: {}
webServer: {}
celeryWorker: {}
# celeryBeat intentionally omitted — prevents alert double-triggers
The lifecycle pipeline runs: clone → migrate → rotate → init → components. Components
are not deployed until all enabled tasks complete, and clone always drains existing
components before running (DROP DATABASE fails with active connections). Only
the tasks you configure run; rotate, for example, is skipped when no
lifecycle.rotate spec is set.
Clone Trigger and Scheduling¶
The clone task runs when its checksum changes. Two mechanisms trigger re-execution:
triggerfield — an opaque string (date, UUID, CI build ID). Changing it causes a re-clone. Use this for manual or CI-driven refreshes.cronSchedulefield — a 5-field cron expression for periodic re-execution. When the clock crosses a cron boundary, the task checksum changes automatically.
To disable clone without removing its configuration, set disabled: true.
Table Exclusion¶
excludeTables— tables excluded entirely (schema and data). Use for tables that are not needed and can be recreated by migrations (e.g.,tab_state).excludeTableData— schema is dumped but data is not. Use for large tables where migrations expect the schema to exist but the data is not needed for testing (e.g.,logs,query).
Custom Clone Command¶
By default the operator constructs a streaming pg_dump | psql (or
mysqldump | mysql) command. Override it for custom workflows (anonymization,
database-native snapshots, etc.):
spec:
lifecycle:
clone:
command: ["/bin/sh", "-c", "custom-clone-script.sh"]
image:
repository: my-registry/custom-tools
tag: "v1"
source:
host: postgres-prod.db.svc
database: superset_prod
username: prod_reader
passwordFrom:
name: prod-reader-creds
key: password
When a custom command is set, the operator still injects all env vars
(SUPERSET_OPERATOR__CLONE_SRC_* and SUPERSET_OPERATOR__DB_*) so your script
can use them.
Clone Image¶
The clone pod uses a database-tool image (not the Superset image):
| Source type | Default image |
|---|---|
postgresql |
postgres:17-alpine |
mysql |
mysql:8-alpine |
Override with clone.image if you need additional tools.
Requirements¶
- Metastore must use structured mode (host, database, username) — passthrough URI mode is not supported for clone.
- Metastore user must have CREATEDB rights — the clone drops and recreates the target database.
- Source user should be read-only — the clone only reads from production.
- Network access — the clone pod needs egress to both source and target databases. Configure NetworkPolicy accordingly.
How It Works Under the Hood¶
Pipeline Checksum Model¶
The lifecycle uses a checksum chain to determine which tasks execute and
which skip. Each task receives an incoming checksum from the previous stage,
adds its own unique inputs, and produces a task checksum. On completion, that
checksum is stored in the parent status.lifecycle field and becomes the
incoming checksum for the next task.
parentUID (stable anchor)
↓
clone.checksum = hash(parentUID, "Clone", command, trigger, scheduleTick, source, excludes, postCloneSQL, cloneImage, targetImage)
↓ (stored in parent status on completion)
migrate.checksum = hash(clone.status.checksum, "Migrate", command, trigger, image)
↓ (stored in parent status on completion)
rotate.checksum = hash(migrate.status.checksum, "Rotate", command, trigger, image, secretKey, secretKeyFrom, previousSecretKey, previousSecretKeyFrom)
↓ (stored in parent status on completion)
init.checksum = hash(rotate.status.checksum, "Init", command, trigger, image, configChecksum)
The universal rule: a task executes when its computed checksum differs from the completed checksum stored in parent status. If they match, the task skips.
Upstream propagation is automatic: when clone re-runs (e.g., trigger changed), its status checksum changes. That new value flows into migrate's checksum computation, making it differ from its stored value — so migrate re-runs too. The chain continues transitively through rotate to init.
Isolation by design: each task watches only its own relevant inputs.
Migrate is image/schema-version driven and intentionally ignores feature/config
changes — a config tweak does not re-run migrations. Init is the
config-sensitive task: rendered superset_config.py changes propagate here.
Clone tracks both the source connection identity and the target Superset
image, so a staging image change (including a downgrade) triggers a fresh clone
before migrations re-run. Rotate fires on secret-key transitions.
What checksums do NOT cover: checksums hash task-semantic inputs only.
Pod-level fields like resource requests/limits, node selectors, tolerations,
affinity, and other podTemplate knobs are not part of the task checksum and
do not by themselves trigger a re-run. Such changes apply on the next execution
that is triggered by a semantic input change. This is intentional: a pure
scheduling tweak should not, for example, re-run a destructive clone.
Why Jobs¶
- Idempotent creation — Each task uses a deterministic Job name, so repeated reconciles cannot create duplicate clone/migrate/init executions.
- Controlled retries — Jobs use
backoffLimit: 0; the operator decides when and how to retry with configurable max attempts and exponential backoff. - Durable checkpoints — Task completion is stored on the parent status before the operator advances to the next task or deletes completed Jobs.
Job State Machine¶
Task Jobs transition through these states:
- Pending — No Job exists yet. The operator creates one.
- Running — Job is executing. If it exceeds the timeout, Kubernetes marks the Job failed through
activeDeadlineSeconds. - Succeeded → Complete — Task is done; the next task (or components) can proceed.
- Failed — If
attempts < maxRetries, the operator waits for exponential backoff, deletes the failed Job, and creates a replacement. Ifattempts >= maxRetries, the task is permanently failed.
Job Naming and Discovery¶
Jobs use deterministic names ({parent}-{task}, e.g. my-superset-migrate).
The operator reads Jobs by name and also labels them with
superset.apache.org/instance and superset.apache.org/init-task for querying.
Task Job Pod Spec¶
Task Job Pods inherit scheduling, security, volumes, and env from the top-level
podTemplate, just like other components. Key fields:
- Image: From
spec.image - Command: From
spec.lifecycle.migrate.commandorspec.lifecycle.init.command(defaults:superset db upgradeandsuperset init) - Config: Mounted from the task ConfigMap (
{parent}-{task}-config) - Env vars: Database credentials, secret key (via plain env vars in dev mode, or
valueFrom.secretKeyRefwhen*Fromfields are used) - Resources: From
spec.lifecycle.podTemplate.container.resourcesif set - Service account: Inherited from parent spec
- Restart policy: Always
Never— the operator handles retries
Status Reporting¶
Lifecycle task progress is tracked per-task in the parent status:
status:
lifecycle:
phase: Complete # Cloning | Draining | Migrating | Rotating | Initializing | Restoring | Complete | Blocked | AwaitingApproval
clone:
state: Complete # Pending | Running | Complete | Failed (only present when clone is enabled)
attempts: 1
migrate:
state: Complete # Pending | Running | Complete | Failed
startedAt: "2026-03-16T10:00:00Z"
completedAt: "2026-03-16T10:00:12Z"
attempts: 1
image: apache/superset:latest
init:
state: Complete
startedAt: "2026-03-16T10:00:13Z"
completedAt: "2026-03-16T10:00:22Z"
attempts: 1
image: apache/superset:latest
Task Job names are deterministic: {parentName}-{taskType} (e.g. my-superset-migrate). Inspect the Job directly with kubectl get job my-superset-migrate.
Parent phase values related to lifecycle:
| Phase | Meaning |
|---|---|
Initializing |
First deployment — lifecycle tasks running for the first time |
Upgrading |
Image change detected — lifecycle tasks running against new version |
Blocked |
Downgrade detected — lifecycle tasks will not run (manual intervention required) |
AwaitingApproval |
Supervised upgrade mode — waiting for the target-bound approval annotation before proceeding |
Drain progress appears in the Lifecycle column and
status.lifecycle.phase=Draining, while the top-level phase remains
Initializing or Upgrading.
After lifecycle tasks finish, status.lifecycle.phase=Restoring while the
operator recreates component workloads and waits for them to become ready. It
switches to Complete once the enabled components are available.