Describe table output:
>>> spark.sql('describe table extended test.sample').show(truncate=False, n=100)
+----------------------------+--------------------------------------------------------------+----------------------------+
|col_name |data_type |comment |
+----------------------------+--------------------------------------------------------------+----------------------------+
|some_id |bigint |some identifier column |
|created_at |timestamp |when this record was created|
|status |string |status of this record |
|stats |struct<stat1:float,stat2:float,stat3:float> |stats for this record |
|ds |string |date partition |
|# Partition Information | | |
|# col_name |data_type |comment |
|ds |string |date partition |
| | | |
|# Detailed Table Information| | |
|Database |test | |
|Table |sample | |
|Owner |kandelin | |
|Created Time |Tue Jul 26 09:15:46 EDT 2022 | |
|Last Access |UNKNOWN | |
|Created By |Spark 3.2.1 | |
|Type |EXTERNAL | |
|Provider |parquet | |
|Location |file:/tmp/warehouse/test/sample | |
|Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | |
|InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | |
|OutputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| |
|Partition Provider |Catalog | |
+----------------------------+--------------------------------------------------------------+----------------------------+
Define a helper to extract the metadata specifics in the above output:
def get_describe_table_metadata(spark: SparkSession, table: str) -> Dict[str, str]:
"""Extract table metadata from the `describe table extended` commanded, returned as a dict."""
metadata_df = spark.sql(f'describe table extended {table}').where(col('col_name').rlike('^[A-Z]'))
metadata_dict = {
row.col_name.lower(): row.data_type
for row in metadata_df.collect()
}
return metadata_dict
Which outputs the following:
>>> get_describe_table_metadata(spark, 'test.sample')
{
"database": "test",
"table": "sample",
"owner": "kandelin",
"created time": "Tue Jul 26 09:15:46 EDT 2022",
"last access": "UNKNOWN",
"created by": "Spark 3.2.1",
"type": "EXTERNAL",
"provider": "parquet",
"location": "file:/tmp/warehouse/test/sample",
"serde library": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
"inputformat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
"outputformat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",
"partition provider": "Catalog"
}