lindacmsheard/pyspark_df_from_fs_ls.md

Created July 7, 2021 18:41

Star () You must be signed in to star a gist
Fork () You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/lindacmsheard/aae4548db888fc192002b0ff05844f99.js"></script>
Save lindacmsheard/aae4548db888fc192002b0ff05844f99 to your computer and use it in GitHub Desktop.

Create a pyspark dataframe of filepaths by reading a directory in databricks

Raw

    #create a dataframe from filepaths

directory = '/path/on/dbfs'

file_paths = dbutils.fs.ls(directory)

#e.g
print(file_paths[0].path)
print(file_paths[0].name)

files_df = spark.createDataFrame(map(lambda path: (path.path,path.name), file_paths), ["path","name"])