Skip to content

Instantly share code, notes, and snippets.

@okumin
Last active September 3, 2024 03:21
Show Gist options
  • Save okumin/4fccec45109fc9927a22f40c166fe7f9 to your computer and use it in GitHub Desktop.
Save okumin/4fccec45109fc9927a22f40c166fe7f9 to your computer and use it in GitHub Desktop.
Hive + Iceberg split

Reproduction

I used Hive 4.0.0.

Create a table with a big Parquet file

set tez.grouping.split-count=1;
CREATE TABLE web_sales_parquet STORED AS PARQUET AS SELECT * FROM web_sales;

$ hdfs dfs -ls -h /user/hive/warehouse/web_sales_parquet
Found 1 items
-rw-r--r--   3 zookage hive      1.1 G 2024-09-02 12:29 /user/hive/warehouse/web_sales_parquet/000000_0

Directly query the table

The file was split into multiple InputSplits expectedly.

0: jdbc:hive2://hive-hiveserver2:10000/defaul> SELECT * FROM web_sales_parquet WHERE RAND() = 0.0;
...
----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      9          9        0        0       0       0  
----------------------------------------------------------------------------------------------
VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 27.69 s    
----------------------------------------------------------------------------------------------

Migrate it to an Iceberg table

Created manifest files, keeping the big Parquet file.

0: jdbc:hive2://hive-hiveserver2:10000/defaul> ALTER TABLE web_sales_parquet SET TBLPROPERTIES ('storage_handler'='org.apache.iceberg.mr.hive.HiveIcebergStorageHandler', 'format-version' = '2');
...
$ hdfs dfs -ls -h /user/hive/warehouse/web_sales_parquet
Found 2 items
-rw-r--r--   3 zookage hive      1.1 G 2024-09-02 12:29 /user/hive/warehouse/web_sales_parquet/000000_0
drwxr-xr-x   - zookage hive          0 2024-09-02 13:50 /user/hive/warehouse/web_sales_parquet/metadata

Query the Icegerg table

The same number of tasks was created.

0: jdbc:hive2://hive-hiveserver2:10000/defaul> SELECT * FROM web_sales_parquet WHERE RAND() = 0.0;
...
----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      9          9        0        0       0       0  
----------------------------------------------------------------------------------------------
VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 25.58 s    
----------------------------------------------------------------------------------------------
@BsoBird
Copy link

BsoBird commented Sep 3, 2024

I can provide a 600GB dataset and the SQL to reproduce the problem.
@okumin

@BsoBird
Copy link

BsoBird commented Sep 3, 2024

I copied this dataset from iceberg to a normal ORC table, and the slow execution of the map-task disappeared, but I still observed.

  1. reduce-task executes slowly
  2. The reduce-task has a high probability of failing and triggering fault tolerance.

Since the size of the full dataset is 600GB, I will provide you with a mock program to generate the simulated data.

@BsoBird
Copy link

BsoBird commented Sep 3, 2024

for iceberg-split:
This was my mistake, I found out that actually icebergInputSplit actually calculates the split information. However, I recommend that you try again after compressing with ZSTD, as ZSTD has a high compression rate, too much shuffle data will cause PipelinedSorter.sort() in TEZ to perform slowly. This is because too many elements are stored in the memory Queue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment