Skip to content

Instantly share code, notes, and snippets.

@nihil0
Last active July 30, 2024 09:34
Show Gist options
  • Save nihil0/3c70bf857b7c7ca6b4759f5db16f2c5a to your computer and use it in GitHub Desktop.
Save nihil0/3c70bf857b7c7ca6b4759f5db16f2c5a to your computer and use it in GitHub Desktop.
Deploying Databricks Workflows with Serverless Compute using Terraform

Deploying Databricks Workflows with Serverless Compute using Terraform

Introduction

Serverless compute for workflows enables you to run your Databricks jobs without the need for configuring and deploying infrastructure. This allows you to focus solely on implementing your data processing and analysis pipelines. Databricks takes care of managing compute resources, including optimizing and scaling them for your workloads. With autoscaling and Photon automatically enabled, you can be assured of efficient resource utilization.

Additionally, serverless compute for workflows features auto-optimization, which selects the appropriate resources such as instance types, memory, and processing engines based on your workload. It also automatically retries failed jobs, ensuring smooth and efficient execution of your data workflows.

On 15.7.2024, Serverless Compute for Notebooks, Workflows, and Delta Live Tables went into GA. However, the documentation is still limited about how to deploy Workflows with serverless compute using Terraform or the Jobs 2.1 REST API. Still, it is actually possible to deploy workflows by piecing together information from various public sources. In this blog post, I will describe step-by-step how to deploy a serverless workflow with Terraform.

Method

  1. Enable serverless compute for your account. This allows you to use serverless compute from all workspaces in your account.

  2. Write your job. Note that to use serverless compute your code can only be written in PySpark or SQL. In this case, I will be creating a workflow with a single python script deployed as a spark_python_task. The python script is shown below:

    from pyspark.sql import SparkSession
    
    def count_rows_in_table(catalog, schema, table):
        """
        Count the rows in a specified table in a given catalog and schema.
    
        :param catalog: The catalog containing the table.
        :param schema: The schema containing the table.
        :param table: The table to count rows from.
        :return: The number of rows in the table, or None if an error occurs.
        """
    
        # Initialize Spark session
        spark = SparkSession.builder.appName("CountRowsInTable").getOrCreate()
    
        # Construct the full table name
        full_table_name = f"{catalog}.{schema}.{table}"
    
        # Load the table into a DataFrame
        df = spark.sql(f"SELECT * FROM {full_table_name}")
    
        # Count the number of rows in the DataFrame
        row_count = df.count()
    
        return row_count
    
    if __name__ == "__main__":
        catalog = "dev"
        schema = "my_schema"
        table = "mytable"
    
        row_count = count_rows_in_table(catalog, schema, table)
    
        print(f"The table {schema}.{table} in catalog {catalog} has {row_count} rows.")
  3. Write the Terraform code to deploy the python script. Assume the script is available in the workspace at the location specified in the Terraform code:

    resource "databricks_job" "srvless-job" {
    
      name = "row-count"
    
      task {
        task_key        = "task_srvless"
        environment_key = "default"
    
        spark_python_task {
          python_file = "/Users/me/projects/my_script.py"
        }
    
      }
    
      environment {
        environment_key = "default"
        spec {
          client = "1"
          dependencies = []
        }
      }
    }

Note the following points:

  • There is no need to specify the job_cluster block.
  • Note the presence of the environment block and the environment_key. These are critical to getting the resource to deploy. The definition of the environment_block and the use of environment_key can be found in the source code of the Databricks Terraform Provider.

That's it! This is all you need to start deploying workflows using Serverless Compute on Databricks.

P.S. It seems like the disable_auto_optimization feature is not yet available in the Terraform provider but it is already available in the REST API. This controls the serverless auto-optimization feature mentioned above. Hopefully, this can be added to the Databricks Terraform Provider soon.

EDIT 2024-07-18: Seems like disable_auto_optimization is available now.

EDIT 2024-07-22: The environment feature is now documented here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment