fizz.today

Kubeflow version is a matrix, so snapshot runtime plus reconciler inputs

I asked a simple platform question: what version of Kubeflow are we on?

The useful answer was not one number.

It was a matrix.

What we saw in one cluster

At the same time, all of these were true:

If you only report the control-plane version, you miss runtime drift.

If you only report one upgraded component, you miss fallback defaults that can reapply older behavior on next reconcile.

Why this is an incident risk, not a dashboard nit

In controller-heavy platforms, version is distributed policy.

The system doesn’t ask, “what number do humans prefer to say?”

It asks, “what values are wired into reconcile logic right now?”

That distinction matters under restart and rollout pressure.

A cluster can look healthy and still carry downgrade pressure in controller inputs. The risk only becomes visible when reconcile runs again and resolves conflicting truths.

This is how teams end up saying “we upgraded this” and “it reverted” in the same incident while both statements are locally correct.

What to snapshot every time

You need both:

  1. Runtime truth: what images are running now.
  2. Reconciler truth: what config/env/defaults will be rendered next.

In practice, our minimum snapshot includes:

I published the script we now run:

The minimal policy that helps

Before calling a platform “upgraded,” require all lanes to agree:

If one lane disagrees, you’re partially converged, not done.

That wording alone improves operational decisions because “partially converged” triggers follow-up work instead of release notes optimism.

What this catches in practice

Longform context: https://ferkakta.dev/kubeflow-is-a-version-matrix-not-a-version/

#kubeflow #kubernetes #reliability #platform-engineering