Typed compiler, not a templating engine
Column-level type inference across the full DAG. 35+ diagnostic codes with actionable suggestions. E013 blocks the PR before a row is written.
Rocky exists because the most expensive failures in modern data platforms aren’t slow queries. They’re trust failures. A column type changes upstream and a revenue dashboard quietly diverges for three days. A SELECT * pulls a new column nobody designed for. A Snowflake-only function lands in a Databricks-targeted project and only fails in prod. Warehouse spend doubles and nobody can attribute which model caused it. An auditor asks who changed fct_revenue.amount and when.
Rocky moves correctness into the compiler. The disasters above become things you catch before a row is written: a column-type change is E013 at compile, a rename’s blast radius is a rocky lineage-diff comment on the PR, a cost spike is a [budget] block that fails the run, and every run leaves a content-addressed record. These failures are invisible to the warehouse and out of reach of the templating layer above it. Rocky is the typed graph in between.
Typed compiler, not a templating engine
Column-level type inference across the full DAG. 35+ diagnostic codes with actionable suggestions. E013 blocks the PR before a row is written.
Column-level lineage at compile time
Every column traced through every transformation, before execution. rocky lineage-diff main lists per-column downstream blast radius for PR review.
Branches + content-addressed run records
Named branches as isolated schemas. rocky replay <run_id> inspects and verifies a run against its content-addressed record: per-model SQL hashes, row counts, and bytes. (Re-execution from the pinned record is on the roadmap.)
Per-model cost attribution
Cost is a column on every run, not a dashboard. [budget] blocks fail the run on overspend. rocky preview cost projects spend at PR time.
AI gated through the compiler
Every AI suggestion type-checks before it lands. Generate, compile, auto-fix, ship. The closed loop nobody else has.
Dialect-divergence lint
P001 catches Snowflake-only constructs in a Databricks project, and the reverse. Cross-warehouse teams stop discovering portability bugs in prod.
curl -fsSL https://raw.githubusercontent.com/rocky-data/rocky/main/engine/install.sh | bashrocky playground my-first-projectcd my-first-projectrocky compile # type-checkrocky test # run assertions locallyrocky run # materialize the DAG against local DuckDBNo credentials needed; the playground is DuckDB-backed and seeds itself on first run.
rocky run is the one-step path for local iteration and automation. For production or PR-gated deploys, split it into rocky plan (persists an auditable plan to .rocky/plans/<id>.json) and rocky apply <plan-id>.
rocky run writes a single RunRecord to the state store. Three read commands project that record from different angles: rocky trace for causality and concurrency, rocky cost for per-model warehouse spend, and rocky replay for the byte-for-byte reproduction. One record, three views. Walk through it end-to-end in POC #17 (trace + cost + replay against the same run_id).
Rocky is built first for data platform engineers running production-critical, multi-tenant pipelines on Databricks, the team where silent failures cost real money and dbt Core has hit a ceiling. The trust primitives are most battle-tested there.
The next ring out: Snowflake and BigQuery shops evaluating SQLMesh, who want correctness moved to the compiler rather than the planner, and prefer SQL by default. Adapters are Beta today; see the Roadmap.