pyspark-sql-table-metadata.md

Describe table output:

>>> spark.sql('describe table extended test.sample').show(truncate=False, n=100)
+----------------------------+--------------------------------------------------------------+----------------------------+
|col_name                    |data_type                                                     |comment                     |
+----------------------------+--------------------------------------------------------------+----------------------------+
|some_id                     |bigint                                                        |some identifier column      |
|created_at                  |timestamp                                                     |when this record was created|
|status                      |string                                                        |status of this record       |
|stats                       |struct<stat1:float,stat2:float,stat3:float>                   |stats for this record       |
|ds                          |string                                                        |date partition              |
|# Partition Information     |                                                              |                            |
|# col_name                  |data_type                                                     |comment                     |
|ds                          |string                                                        |date partition              |
|                            |                                                              |                            |
|# Detailed Table Information|                                                              |                            |
|Database                    |test                                                          |                            |
|Table                       |sample                                                        |                            |
|Owner                       |kandelin                                                      |                            |
|Created Time                |Tue Jul 26 09:15:46 EDT 2022                                  |                            |
|Last Access                 |UNKNOWN                                                       |                            |
|Created By                  |Spark 3.2.1                                                   |                            |
|Type                        |EXTERNAL                                                      |                            |
|Provider                    |parquet                                                       |                            |
|Location                    |file:/tmp/warehouse/test/sample                               |                            |
|Serde Library               |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe   |                            |
|InputFormat                 |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat |                            |
|OutputFormat                |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat|                            |
|Partition Provider          |Catalog                                                       |                            |
+----------------------------+--------------------------------------------------------------+----------------------------+

Define a helper to extract the metadata specifics in the above output:

def get_describe_table_metadata(spark: SparkSession, table: str) -> Dict[str, str]:
    """Extract table metadata from the `describe table extended` commanded, returned as a dict."""
    metadata_df = spark.sql(f'describe table extended {table}').where(col('col_name').rlike('^[A-Z]'))
    metadata_dict = {
        row.col_name.lower(): row.data_type
        for row in metadata_df.collect()
    }
    return metadata_dict

Which outputs the following:

>>> get_describe_table_metadata(spark, 'test.sample')
{
  "database": "test",
  "table": "sample",
  "owner": "kandelin",
  "created time": "Tue Jul 26 09:15:46 EDT 2022",
  "last access": "UNKNOWN",
  "created by": "Spark 3.2.1",
  "type": "EXTERNAL",
  "provider": "parquet",
  "location": "file:/tmp/warehouse/test/sample",
  "serde library": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
  "inputformat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
  "outputformat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",
  "partition provider": "Catalog"
}

andelink/pyspark-sql-table-metadata.md