Skip to content

Instantly share code, notes, and snippets.

@dantonnoriega
Created March 5, 2022 02:09
Show Gist options
  • Save dantonnoriega/f676b4bc85aaf5022cca28143747594a to your computer and use it in GitHub Desktop.
Save dantonnoriega/f676b4bc85aaf5022cca28143747594a to your computer and use it in GitHub Desktop.
instructions for using JDBC SIMBA drivers in R to connect to Databricks

Databricks - JDBC

This is the preferred method if you're collecting small objects from the Spark Cluster because it is otherwise easy to set up, doesn't appear to cause many "out of memory" errors, and requires fewer dependencies.

Setup

  1. Install and configure Java.

    • Install Java 8

        brew tap AdoptOpenJDK/openjdk
        brew install --cask adoptopenjdk8
      
    • Export correct JAVA_HOME and run R CMD javareconf

        export JAVA_HOME=$(/usr/libexec/java_home -v 1.8)
        R CMD javareconf
      
  2. Install rJava and RJDBC

     # rJava can hit some issues:
     # - Error: `clang: error: unsupported option '-fopenmp'`
     #   - https://stackoverflow.com/a/43943631
     # - Error: `ld: framework not found JavaVM`            
     #   - https://stackoverflow.com/a/65297107
     install.packages('rJava', configure.args="--disable-jri")
     
     install.packages('RJDBC')
    
  3. Make sure you have a proper ~/.databricks-connect file.

    • This is just a simple JSON array. If you don't have it, you can generate it in vim.

        vim ~/.databricks-connect
      
    • Then paste

        {
          "host": "https://abc-d1e2345f-a6b2.cloud.databricks.com",
          "token": "[API_TOKEN_OMITTED]",
          "cluster_id": "0123-2345678-hello987",
          "org_id": "0"
        }
      
    • The only thing you'll have to add here yourself is your token, replacing [API_TOKEN_OMITTED] with your databricks API token.

Click here for instructions on generating a new token.

Using JDBC

The follow script has a JDBC "header" which does everything for you. If you include it at the start of your script, you can get functions get_simba_driver() and jdbc_url().

  • get_simba_driver() downloads the Simba JDBC driver if you don't have it. Otherwise, it returns the driver path on your machine.
  • jdbc_url() generates the correct JDBC url using your ~/.databricks-connect credentials.

These two functions are then used to create the JDBC driver in RJDBC::JDBC and then the proper connection in RJDBC::dbConnect().

!! IMPORTANT !!

  • dbGetQuery() works with small queries but if you try to pull large objects, you'll likely get a java.lang.OutOfMemoryError: Java heap space error.
  • Use dbSendQuery(), dbFetch() and dbClearResult() to fetch small chunks of the query at a time.
# <<<< JDBC Header <<<< --------------
# function to get simba JDBC driver
# SOURCE: https://databricks.com/spark/jdbc-drivers-download
get_simba_driver <- function(
  simba_url = paste0(
    "https://databricks-bi-artifacts.s3.us-east-2.amazonaws.com/",
    "simbaspark-drivers/jdbc/2.6.17/SimbaSparkJDBC42-2.6.17.1021.zip")
) {
  #
  SIMBA_DRIVER_PATH =
    "~/Library/simba/spark"
  DEST_FILE = file.path(SIMBA_DRIVER_PATH, basename(simba_url))
  dir.create(SIMBA_DRIVER_PATH, recursive = TRUE, showWarnings = FALSE)
  if(!file.exists(DEST_FILE)) {
    download.file(
      simba_url,
      destfile = DEST_FILE
    )
  }
  unzip(DEST_FILE, exdir = SIMBA_DRIVER_PATH)
  list.files(SIMBA_DRIVER_PATH,
             pattern = ".*.jar$",
             full.names = TRUE)
  #
}

# MOTIVATING SOURCE:
# https://docs.databricks.com/integrations/bi/jdbc-odbc-bi.html
jdbc_url = function() {
  DATABRICKS_CONF = '~/.databricks-connect'
  stopifnot(file.exists(DATABRICKS_CONF))
  conf = jsonlite::read_json(DATABRICKS_CONF)
  url = gsub('http[s]?://', '', conf[['host']])
  sprintf(
    paste0(
      "jdbc:spark://%s:443/default;transportMode=http;",
      "ssl=1;httpPath=sql/protocolv1/o/%s/%s;AuthMech=3;",
      "UID=token;PWD=%s"),
    url, conf[['org_id']],
    conf[['cluster_id']], conf[['token']]
  )
}
# >>>> END JDBC Header >>>> -------------------

# JDBC example -------------

drv <- RJDBC::JDBC(
  "com.simba.spark.jdbc.Driver",
  get_simba_driver(),
  identifier.quote="`"
)

# load driver and connect to the jdbc_url
conn = DBI::dbConnect(
  drv,
  jdbc_url()
)

# we fetch the table in chunks to avoid memory issues
#  - note that `DBI::dbGetQuery` will try to pull the whole object at once
#    - better for small queries
qry = "
  SELECT id, date, value
  FROM my_schema.some_table
  LIMIT 5000"
  
n <- 500
rs <- DBI::dbSendQuery(conn, qry)
chunks <- list()
while (!DBI::dbHasCompleted(rs)) {
  chunks <- append(
    chunks,
    list(data.table::setDT(DBI::dbFetch(rs, n)))
  )
}
DBI::dbClearResult(rs)

df = do.call(rbind, chunks)

head(df)
nrow(df)

DBI::dbDisconnect(conn)

Sources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment