Created
May 28, 2019 21:12
-
-
Save mrchristine/027d43cdbecdb363f0f36b0115cf9f1e to your computer and use it in GitHub Desktop.
Read / Write Spark Schema to JSON
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
##### READ SPARK DATAFRAME | |
df = spark.read.option("header", "true").option("inferSchema", "true").csv(fname) | |
# store the schema from the CSV w/ the header in the first file, and infer the types for the columns | |
df_schema = df.schema | |
##### SAVE JSON SCHEMA INTO S3 / BLOB STORAGE | |
# save the schema to load from the streaming job, which we will load during the next job | |
dbutils.fs.rm("/home/mwc/airline_schema.json", True) | |
with open("/dbfs/home/mwc/airline_schema.json", "w") as f: | |
f.write(df.schema.json()) | |
##### LOAD JSON SCHEMA BACK TO DATAFRAME SCHEMA OBJECT | |
import json | |
from pyspark.sql.functions import * | |
from pyspark.sql.types import * | |
schema = '/dbfs/home/mwc/airline_schema.json' | |
with open(schema, 'r') as content_file: | |
schema_json = content_file.read() | |
new_schema = StructType.fromJson(json.loads(schema_json)) |
Upload JSON schema to S3
s3_client=boto3.client('s3')
schema=df.schema.json()
data=json.dumps(schema)
s3_client.put_object(Body=data,Bucket='S3-BucketName',Key='FileName.json')
Upload JSON schema to S3
s3_client=boto3.client('s3') schema=df.schema.json() data=json.dumps(schema) s3_client.put_object(Body=data,Bucket='S3-BucketName',Key='FileName.json')
thanks!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
How to upload a json schema to s3?
How to load a json file containing the schema from s3, use that json schema to read csv file?