Initial Thoughts on SQLMesh

May 14, 2024•2,990 words

For the past couple of weeks, I've been experimenting with SQLMesh, a new open source data modeling/transformation tool. SQLMesh is intentionally positioned to compete with dbt, which has become the go-to option for companies adopting the ELT paradigm of managing data transformations in their cloud data warehouse.

My team and I started adopting dbt a couple years ago. It's been a slow transition so far. For a variety of reasons, our primary data warehouse still uses traditional ETL. For now, we use dbt to support numerous other data marts and standardized reporting. Eventually, the plan is to completely transition all our data pipelines to the ELT pattern. We don't use dbt Cloud (and don't intend to). While I would say our efforts have so far been successful and beneficial, there have been some pain points along the way as our adoption has developed.

SQLMesh caught my attention a few months ago, not because it is the shiny new toy on the market, but because it was architected to directly address the pain points of dbt. Let's get into it.

Getting Started

The process of installing and trying SQLMesh is not that much different from dbt. Both are built as python packages so it's a simple pip install and your off. In fact, SQLMesh makes this trial experience even simpler than dbt because you don't even need access to a cloud data warehouse to get started. It comes with and defaults to DuckDB so you can try things out completely locally. In my case, we use Snowflake and I wanted to see how it worked with that, so I installed SQLMesh with the Snowflake extra.

python -m pip install "sqlmesh[snowflake,web]"

I also included the web extra because part of my goal was to try out the UI SQLMesh provides. Unlike dbt's Cloud IDE, SQLMesh's browser based IDE is freely available and open source.

I created a sample SQLMesh project using their built in init function, which creates a small demo project with a few sample models on a very tiny sample of data.

sqlmesh init sqlmeshtest

To configure SQLMesh to use Snowflake, I add a config.yaml file to the project with the connection details. SQLMesh uses a term called gateways to refer to your database connections. In its most simple form it is just a single connection to your cloud data warehouse. But a SQLMesh project actually involves a collection of database connections that can be to separate database engines, so I guess a gateway is the term they came up with to represent these collections. Besides your connection to your data warehouse, you can have the project state (which is central to SQLMesh's function) stored in a different database like PostgresQL. There are other connections too, but I'm not going to go into that detail. Just know that you only need to connect to your data warehouse to get SQLMesh to work.

User Interface

As I mentioned, SQLMesh comes with a browser based UI that runs completely local to your machine. This is in stark contrast to dbt which monetizes their cloud based IDE. Even though it is free, SQLMesh didn't skimp on the UI. It is extremely functional and actually very nice looking. A nice way to get new users started by giving a point and click experience to the different commands you need to use. In addition to a code editor, you get documentation of your models and a lineage graph which includes column level lineage. It's probably worth mentioning that this functionality is also available with dbt, however it doesn't come for free. In fact, I'm pretty sure that dbt and some of the extensions that have sprouted up to support dbt, use the same package as SQLMesh to provide this functionality. That would be SQLGlot, which the authors of SQLMesh also wrote and actively maintain.

In the UI, you can view and edit your models, see how they relate to each other, create new environments and execute plans (which I will discuss in a bit). You can also execute custom queries against your database and see how your models render to SQL.

Now I'm not totally enamored with the SQLMesh UI. It looks nice and it has good functionality, but the coding experience is not as great as I get with using VS Code for dbt. VS Code has a free extension that you can install, dbt Power User, that is a very nice alternative to dbt Cloud.

SQLMesh can automatically format your SQL for you, and the UI does this by default whenever you save your file. I didn't know this and I found it quite jarring when my precisely crafted SQL suddenly changed on me. You can disable that behavior by adding this to your config.yaml.

ui:
    format_on_save: false

I do like that SQLMesh offers a way to standardize SQL code out-of-the-box. I've seen a lot of ugly SQL in my time and it is quite annoying to go through someone else's code and spend 30 minutes just reformatting it. I just wish SQLMesh gave a little more control over how it formats. It does provide a few configuration options, which I won't cover in detail this time. But I've been writing SQL a while, and I have a particular taste that doesn't mesh with SQLMesh's formatting options. (I might change my mind if I can get SQL keywords to be lowercase instead of uppercase.)

I couldn't figure out what keyboard shortcuts are available in the UI. There were a few that were visible tooltips, but the main one I was missing was how to execute a custom query. I don't like having to use my mouse to hunt down the run query button. None of the usual options like Ctrl+Enter or F5 seemed to work.

I couldn't seem to get multiple file tabs to stay open, so I was constantly going back to the file tree to refer back to another model file. Adding new files was also awkward since it creates the file initially with some odd random string of letters. I'd prefer if it just prompt me for the file name first to save me the step of renaming it.

This next point probably sounds very nit-picky (and maybe I am). The code editor does this odd thing with comparison operators. For example, if you input a less than or equal to operator, <=, the UI will combine them into a single character, ≤. If you inspect the actual file in a text editor like Notepad++, you will see that it is still the two characters in the file. So the UI is just altering how it is displayed. It looks neat, but I don't like it. Makes me feel like I'm coding in Word or something.

There's a few other papercuts with the editor, but I've spent a lot of words griping on the UI already, and there is plenty of other nice things I want to cover, so let's move on. In short, the UI is a fantastic feature to onboard new users, but the coding experience is not quite at the level of other IDEs like VS Code.

Environments and Plans

Understanding the concept of environments and plans is central to grasping how SQLMesh differentiates itself from dbt. An environment in dbt is typically a separate schema or database. When you run dbt, all the selected models are recreated in the target environment. If you have tables (Full or Incremental), then that run time can be slow. Creating separate dev environments costs as much as creating the prod environment (unless you add some jinja to filter the data in dev). SQLMesh's approach to environments aims to make them lighter and cheaper than dbt.

An environment in SQLMesh is also a separate schema, but all the models in the environment are views. Each of the views are simple SELECTs to the real models in another private schema that SQLMesh manages. It is in this private schema that SQLMesh materializes your model's tables or views. SQLMesh defines this concept as a Virtual Environment.

When you want to create a new dev environment, SQLMesh creates a set of new schemas like production with the name of the environment appended to them e.g. analytics__dev. (I'd prefer if it prefixed the environment name, so all the schemas for an environment naturally grouped together.) By default, these new schemas start off referencing the same objects in the private schemas that prod references. Since views are fast to create, you get a dev environment that matches prod completely very quickly. This is also cheap since you don't have to spend extra compute to regenerate any tables that already exist in prod.

What happens next when you make a change in your new dev environment? That's where the SQLMesh plan comes in. dbt does not have anything analogous to a SQLMesh plan. When you run sqlmesh plan <target_env>, SQLMesh compares the code in your model files to the state of your target environment. It then determines which models are different and will need to be rebuilt. It also automatically determines if those changes will affect any downstream models that are dependent. With column-level lineage, it can do this intelligently and avoid unnecessary downstream rebuilds, which saves compute costs. When you apply the plan, which SQLMesh refers to as backfilling, new tables/views are created in the private schemas, and the affected views in the target virtual environment are repointed to those new objects. The prod environment is unaffected by this because it still uses the original tables/views in the private schemas. This makes SQLMesh behave similar to git branching and merging.

This design also makes deploying changes to prod very fast. Once the updated models in the dev environment are ready, running sqlmesh plan prod will apply a virtual update to prod. Because SQLMesh recognizes that the tables have already been built, it's unnecessary to build them again. This re-points the affected views in the prod virtual environment to the new tables/views in the private schemas. Once again, you save compute costs because it avoids recreating any tables that were already built and tested in dev.

Compare this to dbt where all your tables are recreated in dev, changed tables created in dev, and changed tables recreated again in prod. Including tables that may not have been semantically affected by upstream models.

dbt Compatibility

Clearly, the authors of SQLMesh understand the challenge involved in competing with a product like dbt that has become so entrenched in the marketplace. Data/Analytics engineers are not the type to jump on the idea of rewriting all their data pipelines for every shiny new option that pops up. To ease that pain and entice existing dbt customers, SQLMesh supports dbt projects almost out of the box.

The "almost" part is because there are a few minor things you have to tweak to allow SQLMesh to run a dbt project. I won't go into great detail, but here are a few things I observed when I tried it out.

You need to install dbt into the same python environment as your SQLMesh installation. SQLMesh seems to be functioning as a wrapper around dbt to handle compiling the SQL code for the models.

You will need to modify the config for any incremental models to add necessary options for SQLMesh incremental models as well as update the jinja portion of your code to include SQLMesh specific macros. These steps are described in SQLMesh's documentation.

Any table or incremental dbt materializations will need to be dropped initially and let SQLMesh rebuild them. This is because of how SQLMesh is architected to use view-based virtual environments. SQLMesh will not automatically migrate any existing dbt built tables to its private schema. A one-time cost to consider if you have any really large incremental models.

Interval based incremental models

You can tell SQLMesh has put a lot of thought into supporting incremental models. Their approach is a measurable improvement over dbt's. When you create an incremental model in dbt, dbt executes it in two different modes: the initial run and then incremental runs. With dbt, you will have to design your queries to support both the initial and incremental modes. Depending on the amount of data and your query, the initial run may have performance issues that you have to work around. Then you have to add special jinja code to alter your query for the incremental runs.

Instead, SQLMesh uses an interval-based approach to incremental models. The byproduct of this design is that you only have to design your incremental models for one mode: the incremental runs. Instead of using conditional jinja to alter the query for incremental runs, you use the pre-defined SQLMesh macros: @start_ds and @end_ds. You position them around your time column in your WHERE statement and SQLMesh takes care of the rest. Since SQLMesh tracks the state of your incremental models, it knows which time intervals have already been processed. Each time the model runs, the start and end macros are changed to only fill in missing intervals.

One aspect of this that I did not understand at first is how SQLMesh decides the values for @start_ds and @end_ds. I initially assumed that SQLMesh would automatically run a separate query for each interval even when you are initially backfilling the model. So, if your intervals were in months and you backfilled 12 months, then there would be 12 separate queries invoked, where the start and end correspond to the beginning and end of each month. However, that is not how it works. The reality is that the default behavior for incremental models is they act very similar to dbt. The initial backfill attempts to populate all the prior time intervals using one query with @start_ds set to the first day of the first interval and @end_ds set to the last day of the last interval.

Does this mean SQLMesh incremental models are actually no better than dbt? Not at all. The query still only needs to be designed in one form, and since SQLMesh tracks the state of the intervals, we can modify the loading behavior to address performance when necessary while not resorting to hacks. The two key properties are batch_size and batch_concurrency. If you set batch_size to a value of 5, then SQLMesh will only run queries that fill at most 5 time intervals at a time. The batch_concurrency limits how many queries SQLMesh will run at one time for the model. This gives you more control to backfill large incremental models without resorting to work arounds or hacks.

Multi-project support

This is a potential game changer. dbt works great for small teams, but there are challenges when you start to expand its footprint to other teams and projects. dbt recently introduce the contracts feature which allowed upstream projects to enforce standard interfaces to dependent projects. But it adds a burden on the upstream project to manually code those contracts and they still have limited insight into what is happening downstream.

SQLMesh is able to track the lineage between projects automatically via its shared state database. This means teams maintaining independent projects can still see end-to-end lineage across all downstream and upstream projects. This makes contracts unnecessary because an upstream project will instantly know when a change they make will break something in another downstream project. Even better, the upstream maintainer can still apply that change and trigger the downstream project model to rebuild!

External models

SQLMesh does not rely on functions like ref() or source() for interpreting data lineage and compiling the DAG. Instead, SQLMesh understands when the table name in your SQL is a SQLMesh model. Which means the data lineage comes automatically. When a table name does not refer to a model but to an existing table in your data warehouse, then that is what SQLMesh refers to as an external model. The dbt equivalent is a source model.

This is one feature where dbt has more benefits. dbt allows you to add metadata and configurations to sources via YAML. You can specify the source database and schema, apply alternative names, document descriptions, specify a data freshness check and more.

SQLMesh also uses a YAML file, but pretty much just to support column level lineage. You can add descriptions to tables, but not much else beyond that. It would be nice to have the option to override the source database or schema.

It should be noted that SQLMesh does not require you to declare your source tables like dbt does.

Looking ahead

There's more to SQLMesh than what I covered here. SQLMesh also has audits where are similar to dbt tests. SQLMesh took this one step further and supports unit tests too, which allow for easily testing your models to a standard set of inputs. As of the time I am writing this, dbt just announced support for unit tests in their latest version. SQLMesh, like dbt, also supports seeds and Python based models. And while SQLMesh also supports jinja, SQLMesh's own macro system promises to be a far more powerful and elegant option, allowing them to be written in SQL or Python.

My next steps are to assess the incremental benefit of moving from dbt to SQLMesh and weigh that against the cost of retraining my team and updating our processes. dbt compatibility helps reduce the cost, but there is still a lot to consider implementing SQLMesh in a production capacity. There are also non-technical factors to consider. dbt is still enhancing their features, like the example of the recent release of unit tests. dbt is also well established, and SQLMesh is new to the market. While SQLMesh is open source, the maintainers do seem to be working on some paid offerings around it to support larger enterprises. Only time will tell if SQLMesh will gain enough momentum to be around long term.

Initial Thoughts on SQLMesh

Getting Started

User Interface

Environments and Plans

dbt Compatibility

Interval based incremental models

Multi-project support

External models

Looking ahead

More from Matt Carter
All posts

Coding Python in VS Code (vs Spyder)

Three guiding rules for writing SQL

Initial Thoughts on SQLMesh

Getting Started

User Interface

Environments and Plans

dbt Compatibility

Interval based incremental models

Multi-project support

External models

Looking ahead

More from Matt CarterAll posts

Coding Python in VS Code (vs Spyder)

Three guiding rules for writing SQL

More from Matt Carter
All posts