GitHub Actions in Databricks| databricks-import-directory

Praddyum Verma
4 min readMar 5, 2023
Adding Github Action to your Databricks workflow

TL;DR

In this article, we’ll be setting up databricks-import-directory (GitHub action ) which updates our databricks notebooks whenever we make changes in the GitHub repo of the same databricks files.

Overview

Before we begin let’s understand in brief :

  • Databricks: Databricks is just like your Jupiter notebooks or vs code but on the cloud with its own computing resources and rich integration with many more services. It enables the developer to do all the data engineering, AI, and ML coding on the cloud.
  • GitHub Actions: GitHub Actions makes it easy to automate all your software workflows, now with world-class CI/CD. Build, test, and deploy your code right from GitHub. Make code reviews, branch management, and issue triaging work the way you want.

Need

When we work in a team sharing the same workspace environment then it’s always a best practice to restrict users from making direct changes to code and instead raise a pull request. The structure of CI/CD remains subject to team needs. One of which we can see below

Master | Develop | Feature

In this one, we assume that our code repositories have two branches
1) Master: It’s our production-level code which in this case contains code for our prod databricks environment.
2) Develop: It’s our development-level code which in this case contains code for our dev databricks environment.

And whenever we want to add/remove a new code to any of the environments then instead of doing changes directly in the databricks environment we’ll create a branch from develop and then update the code and make a pull req to merge it to develop and post-testing in develop we’ll again make a pull req from develop to master in GitHub.

Now here comes the usage of github actions.

Since we making changes in GitHub and not directly in the databricks environment so to reflect the same changes in our databricks we’ll establish a GitHub action that will copy all the files with changes to the databricks environment.

Steps

  • Sync your databricks code to a GitHub repo (Let me know in the comments if you don’t know how to do it)
  • Generate databricks access token
    Databricks Webpage -> Click on your name on top right -> User Settings -> Access tokens -> Generate new token -> <Give it a name> -> Generate -> Save the token somewhere safe.
  • Make a copy of your databricks host address (your URL)
  • Go to your GitHub repo where you synced the code and create a repo secret.
    Inside your repo -> Settings -> Secrets and variables -> Actions -> New repository secret
    Add access token: give it a name (in this case ADB_SECRET) and add the databricks access token we generated previously in Secret.
    Repeat the same for adding databricks host (ADB_HOST) and add the URL
  • Now back to repository click on
    Actions-> New workflow -> set up a workflow yourself and add the following code.
name: Databricks CD

on:
push:
branches:
- master

jobs:
deploy:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3

- name: install-databricks-cli
uses: microsoft/install-databricks-cli@v1.0.0

- name: Import Databricks notebooks
uses: ./
with:
databricks-host: ${{ secrets.ADB_HOST }}
databricks-token: ${{ secrets.ADB_SECRET }}
local-path: ./code_folder/
remote-path: /code_folder

About Code

This code runs whenever we see a push to the master branch. Then it spins up ubuntu latest server and installs databricks cli on that machine. Take a clone for current repository and then using datbricks-host and databricks-secret we push whatever is inside ./code_foder/ to /code_folder on databricks.

Click on start commit and commit it to either master or create a new branch and then via pull req merge it to master.

Now this needs one more file in the master branch to run successfully. Create a action.yml file in root of your master branch with code

name: databricks-import-directory
description: 'Github Action that imports a local directory into the Databricks Workspace'

inputs:
databricks-host:
description: 'Databricks host'
required: true
databricks-token:
description: 'Databricks token'
required: false
local-path:
description: 'LOCAL_NOTEBOOKS_PATH'
required: true
remote-path:
description: 'REMOTE_NOTEBOOK_PATH'
required: true

runs:
using: "composite"
steps:
- id: import-notebooks
run: |
echo "Uploading notebooks from $LOCAL_NOTEBOOKS_PATH to $REMOTE_NOTEBOOK_PATH in the workspace $DATABRICKS_HOST"
databricks workspace import_dir --overwrite "${LOCAL_NOTEBOOKS_PATH}" "${REMOTE_NOTEBOOK_PATH}" --debug
shell: bash
env:
DATABRICKS_HOST: ${{ inputs.databricks-host }}
DATABRICKS_TOKEN: ${{ inputs.databricks-token}}
LOCAL_NOTEBOOKS_PATH: ${{ inputs.local-path }}
REMOTE_NOTEBOOK_PATH: ${{ inputs.remote-path }}

And save it.

Closing

Make sure all your changes reflect in your master (in this case) or whatever branch you targeting. And now whenever you push/ merge anything to master this action will run under the Actions tab and copy the changes to your databricks workspace.

I hope it was helpful. You can also check out the official GitHub repo of Microsoft to get the most updated code. See you next time with a different blog.

--

--

Praddyum Verma

A very enthusiastic and learning behavior with a mentality of over-promising and over-delivering having experience working as freelance.