Snowflake handled 6 million VARIANT rows. DuckDB crashed at 4,000,001.

2026-04-03

I have a 6-million-row timeline parquet with an event_data column containing serialized JSON — 20 distinct key patterns across 3.8 million non-null rows. I wanted to know how Snowflake’s VARIANT compares to DuckDB’s new VARIANT type for extracting fields from that column.

DuckDB 1.5.1 crashed writing the VARIANT parquet. I binary searched the boundary: LIMIT 4,000,000 succeeds, LIMIT 4,000,001 crashes with an internal assertion error. Exactly 4 million — not a power of 2, not a page boundary. Confirmed on 1.6.0-dev12 too. Filed as duckdb/duckdb#21779.

Snowflake loaded all 6 million rows into a VARIANT column without complaint. PARSE_JSON($1:event_data::STRING) at ingest, then : accessor at query time.

The query — filter on the top-level event_name column, extract fields from the VARIANT:

SELECT match_id,
       event_data:time::NUMBER / 1000.0 AS countdown_sec,
       event_data:payloadTeam::STRING AS team
FROM timeline_variant
WHERE event_name = 'goal'

The numbers — all filtering on event_name = 'goal', extracting countdown time from JSON, run back-to-back in a single session:

                                  4M rows       6M rows
                               (1,750K goals) (1,814K goals)
pandas json.loads per row:        8,369ms       9,316ms
DuckDB VARIANT read:             10,817ms           💥
Snowflake VARIANT :accessor:      2,974ms       5,853ms
DuckDB json_extract_string:        782ms         688ms

DuckDB’s string-based json_extract_string on a VARCHAR column — no VARIANT conversion, just parsing the JSON at read time — is 3.8x faster than Snowflake’s shredded VARIANT at 4M rows. The gap widens at 6M because Snowflake’s query time includes network fetch that scales with result size, while DuckDB runs in-process on my laptop. Snowflake’s VARIANT is mature and handles the heterogeneous schemas without flinching, but the comparison isn’t quite fair — one is remote compute, the other is local.

DuckDB’s own VARIANT read is the surprise loser. At 4M rows (the largest it can write), reading back from its own shredded parquet takes nearly 11 seconds — slower than pandas json.loads. The shredding overhead for 20 heterogeneous JSON schemas produces a parquet structure that’s more expensive to reassemble than to just parse the strings fresh.

The DuckDB json_extract_string numbers (782ms and 688ms) are suspiciously close across 4M and 6M — that’s OS page cache warming from the second run, not a real scaling relationship. In a cold-cache scenario the 6M number would be higher.

The approach that crashes (DuckDB VARIANT) would theoretically be fastest because the JSON is pre-parsed at write time. The approach that works (DuckDB json_extract_string on raw VARCHAR) scans millions of JSON strings on every query and still wins everything.

688ms. Parse on read. No shredding required.

#duckdb #snowflake #parquet #performance