Scheduling dbt with GitHub Actions: Pros & Cons

20/02/2024 18:11

Scheduling dbt with GitHub Actions: Pros & Cons

A step-by-step illustration using dbt & re_data

Idriss Chebak
 
re_data
 

Airport departures timetable
Photo by Matthew Smith

Introduction

GitHub Actions is a powerful CI/CD pipeline automation tool for automating software development workflows and is fast becoming one of the go-to choices for pipeline orchestration. Other options include Airflow or dbt Cloud’s browser-based IDE.

In this article, we will explore some of the reasons why GitHub Actions is one of the best pipeline orchestrators available today.

Why choose GitHub Action?

First, GitHub Actions is tightly integrated with the rest of the GitHub ecosystem, making pipelines easier to use and manage. GitHub Actions allows you to create workflows triggered by various events such as commits, pull requests, and releases. This allows you to automate pipelines that are tightly linked to your development process, thus reducing errors and increasing efficiency.

GitHub Actions is built on open standards: this means it can be easily integrated with other tools and services. This also makes it easy to create pipelines that span multiple stages: from development and testing to deployment and monitoring. You can also use various third-party actions and plugins that help you further extend the functionality of your pipeline such as sqlfluff.

Additionally, GitHub Actions has a large and active community. That means you’ll find a wealth of resources and best practices to help you get started and optimize your pipeline. This community offers a variety of tutorials, sample workflows, and other resources to help you get the most out of GitHub Actions.

Another aspect to emphasize is cost efficiency: GitHub Actions offers a free tier with a generous number of hours per month, so you can use it for small or personal projects (knock knock.. entrepreneurs!). And, thanks to GitHub Actions Importer you can now plan and automatically migrate your CI/CD pipelines to GitHub Actions from Azure DevOps, CircleCI, GitLab, Jenkins, and Travis CI.

How :

Here is a simple illustration of how to use Github Actions to deploy a dbt project:

  1. In your GitHub repository, create a new workflow: Click the “New workflow” option in the GitHub repository’s Actions tab.
  2. Set the workflow trigger as follows: Choose the time when the workflow should be triggered. For instance, you could want it to execute on a predetermined schedule or each time updates are submitted to the repository.
  3. Setup the environment: You might need to indicate the Python version you wish to use or the environment variables you want for the deployment, for instance.
  4. Install dbt: Install dbt and any necessary dependencies by using the “actions/setup-python” action. You may do this by including the next step in your workflow.
  5. Run dbt: Use the “run” step to execute the dbt run command.
  6. Commit the workflow: After completing the steps, submit the workflow to run.

All in all, a complete GitHub Action workflow for deploying a dbt project on schedule can look like this:

name: Deploy dbt
on:
 schedule:
    - cron: 0 4 * * *
env:
  DB_HOST: '${{ secrets.DB_HOST }}'
  DB_PORT: '${{ secrets.DB_PORT }}'
  DB_USER: '${{ secrets.DB_USER }}'
  DB_PASS: '${{ secrets.DB_PASS }}'
  DB_NAME: '${{ secrets.DB_NAME }}'
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repository
        uses: actions/checkout@masterbranch
      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.8'
      - name: Install dependencies
        run: |
          pip install dbt
          dbt deps --target {{target}}
      - name: Run dbt
        run: |
          dbt run --target {{target}}
      - name: Test using dbt 
          dbt test --target {{target}'

This workflow runs on a schedule set by cron to checkout the repository, sets up Python, installs the dbt package, and runs the dbt project. You can also see how to use environment variables to pass database connection information that can be stored as GitHub secrets. As you may have noticed, before using GitHub Actions, you need to create service credentials to access your data warehouse. For example, you can do this with BigQuery using a dedicated service account.

Additionally, you can use the workflow in conjunction with re_da open-source library. Here’s an example from re_data’s repo below:

name: analysis

on:
  push:
    branches: [ main ]

  schedule:
    - cron: '0 0 * * *'

  workflow_dispatch:

env:
  DBT_PROFILES_DIR: ${{ github.workspace }}
  RE_DATA_CONFIG_DIR: ${{ github.workspace }}
  RE_DATA_SEND_ANONYMOUS_USAGE_STATS: 0
  RE_CLOUD_API_KEY: ${{ secrets.RE_CLOUD_API_KEY }}
  SNOWFLAKE_ACCOUNT: ${{ secrets.SNOWFLAKE_ACCOUNT }}
  SNOWFLAKE_DBT_PASSWORD: ${{ secrets.SNOWFLAKE_DBT_PASSWORD }}

jobs:

  build:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v1
        with:
          python-version: "3.7.x"

      - name: Install re_data and python dependencies
        run: |
          pip install dbt-core==1.3.0
          pip install dbt-snowflake==1.3.0
          pip install re_data
          pip install re_cloud

      - name: Run dbt
        run: |
          dbt --version
          dbt deps
          dbt run
          re_data run

      - name: Generate dbt docs
        run: |
          dbt docs generate

      - name: Generate re_data overview
        run: |
          re_data overview generate

      - name: Configure re_data.yml for re_cloud production
        uses: mikefarah/yq@master
        with:
          cmd: |
            yq -i '.re_cloud.api_key = "${{secrets.RE_CLOUD_API_KEY}}"' re_data.yml 
      
      - name: Upload dbt docs and re_data to production cloud
        run: |
          re_cloud upload dbt-docs --name analytics-docs 
          re_cloud upload re-data --name analytics-report

      - name: Configure re_data.yml for re_cloud testing
        uses: mikefarah/yq@master
        with:
          cmd: |
            yq -i '.re_cloud.api_key = "${{secrets.RE_CLOUD_TEST_API_KEY}}"' re_data.yml 
            yq -i '.re_cloud.base_url = "https://test.getre.io"' re_data.yml
      
      - name: Upload dbt docs to testing cloud
        run: |
            re_cloud upload dbt-docs --name analytics-docs 
            re_cloud upload re-data --name analytics-report

This is a Github Actions workflow written in YAML that drives the data analysis process.

trigger:

  • Any push to the “main” branch will trigger the workflow.
  • The workflow runs once a day at midnight via a cron schedule.

Environment variables:

The workflow sets several environment variables, including :

  • the location of the dbt profile and re_data configuration directories.
  • API keys for accessing re_cloud and Snowflake services.

Job:

  • The workflow defines a single job named “build”.
  • This job runs on an Ubuntu machine running the latest version of Ubuntu.

The work consists of several steps:

  1. Check out the code repository.
  2. Set up your Python environment with Python 3.7.x.
  3. Installs required dependencies such as dbt, re_data, and re_cloud packages.
  4. Runs dbt and re_data.
  5. Generates dbt docs.
  6. Generates re_data overviews.
  7. Configures dbt docs & re_cloud in production using a secret API key.
  8. Uploads a dbt report & a re_data summary report for re_cloud in production.
  9. Configures dbt docs & re_data YML for re_data for re_cloud in testing.
  10. Uploads dbt docs & re_date docs to cloud in testing.

GitHub Actions’ Downsides :

I think the free tier limitation is the downside of GitHub Actions. GitHub Actions’ free tier has the following usage limits:

  1. Concurrent jobs: For public repositories, you can run up to 20 concurrent jobs across all workflows. For private repositories, the limit is based on your GitHub plan.
  2. Actions per repository: There is a limit on the number of Actions you can use in a single repository.
  3. Workflow: You cannot queue more than 500 workflow executions in 10 seconds per repository. Execution fails when the limit is reached.

As of now, if you need to run workflows that exceed these limits, you should consider using a paid plan or contact GitHub support for more information.

Conclusion :

In summary, GitHub Actions is a powerful CI/CD pipeline automation tool that offers benefits such as ease of use, integration with the rest of the GitHub ecosystem, cost-effectiveness, and a large and active community. By creating a workflow and following the steps described in the article, you can use it to deploy your dbt project on a schedule. However, the free tier has usage limits, such as a limit on the number of concurrent jobs and a limit on the number of hours per month.

Nonetheless, GitHub Actions is still a great option for those looking to automate their pipelines and streamline their software development workflows.

 

Source: Scheduling dbt with GitHub Actions: Pros & Cons | by Idriss Chebak | re_data | Medium