Rocky: the typed graph between your code and your warehouse

The typed graph between your code and whichever warehouse, table format, or query engine you've chosen. Branches, replay, column-level lineage, contracts, and per-model cost; storage and compute stay where they are.

Get Started in 60 Seconds Coming from dbt? Migrate in 5 steps View on GitHub

The disasters Rocky prevents

Rocky exists because the most expensive failures in modern data platforms aren’t slow queries. They’re trust failures. A column type changes upstream and a revenue dashboard quietly diverges for three days. A SELECT * pulls a new column nobody designed for. A Snowflake-only function lands in a Databricks-targeted project and only fails in prod. Warehouse spend doubles and nobody can attribute which model caused it. An auditor asks who changed fct_revenue.amount and when.

Rocky moves correctness into the compiler. The disasters above become things you catch before a row is written: a column-type change is E013 at compile, a rename’s blast radius is a rocky lineage-diff comment on the PR, a cost spike is a [budget] block that fails the run, and every run leaves a content-addressed record. These failures are invisible to the warehouse and out of reach of the templating layer above it. Rocky is the typed graph in between.

Typed compiler, not a templating engine

Column-level type inference across the full DAG. 35+ diagnostic codes with actionable suggestions. E013 blocks the PR before a row is written.

Column-level lineage at compile time

Every column traced through every transformation, before execution. rocky lineage-diff main lists per-column downstream blast radius for PR review.

Branches + content-addressed run records

Named branches as isolated schemas. rocky replay <run_id> inspects and verifies a run against its content-addressed record: per-model SQL hashes, row counts, and bytes. (Re-execution from the pinned record is on the roadmap.)

Per-model cost attribution

Cost is a column on every run, not a dashboard. [budget] blocks fail the run on overspend. rocky preview cost projects spend at PR time.

AI gated through the compiler

Every AI suggestion type-checks before it lands. Generate, compile, auto-fix, ship. The closed loop nobody else has.

Dialect-divergence lint

P001 catches Snowflake-only constructs in a Databricks project, and the reverse. Cross-warehouse teams stop discovering portability bugs in prod.

Get Started in 60 Seconds

curl -fsSL https://raw.githubusercontent.com/rocky-data/rocky/main/engine/install.sh | bash
rocky playground my-first-project
cd my-first-project
rocky compile            # type-check
rocky test               # run assertions locally
rocky run                # materialize the DAG against local DuckDB

No credentials needed; the playground is DuckDB-backed and seeds itself on first run.

rocky run is the one-step path for local iteration and automation. For production or PR-gated deploys, split it into rocky plan (persists an auditable plan to .rocky/plans/<id>.json) and rocky apply <plan-id>.

One trace, one cost graph, one replay handle

rocky run writes a single RunRecord to the state store. Three read commands project that record from different angles: rocky trace for causality and concurrency, rocky cost for per-model warehouse spend, and rocky replay for the byte-for-byte reproduction. One record, three views. Walk through it end-to-end in POC #17 (trace + cost + replay against the same run_id).

Who Rocky is for

Rocky is built first for data platform engineers running production-critical, multi-tenant pipelines on Databricks, the team where silent failures cost real money and dbt Core has hit a ceiling. The trust primitives are most battle-tested there.

The next ring out: Snowflake and BigQuery shops evaluating SQLMesh, who want correctness moved to the compiler rather than the planner, and prefer SQL by default. Adapters are Beta today; see the Roadmap.

Coming from dbt? Run rocky import-dbt against your project, get a Rocky repo on disk, and ship the gains incrementally with no rewrite.

Using Dagster? dagster-rocky wraps the CLI as a ConfigurableResource: auto-discovery, asset checks, and Pipes through one subprocess hop.