Reproduction

I used Hive 4.0.0.

Create a table with a big Parquet file

set tez.grouping.split-count=1;
CREATE TABLE web_sales_parquet STORED AS PARQUET AS SELECT * FROM web_sales;

$ hdfs dfs -ls -h /user/hive/warehouse/web_sales_parquet
Found 1 items
-rw-r--r--   3 zookage hive      1.1 G 2024-09-02 12:29 /user/hive/warehouse/web_sales_parquet/000000_0

Directly query the table

The file was split into multiple InputSplits expectedly.

0: jdbc:hive2://hive-hiveserver2:10000/defaul> SELECT * FROM web_sales_parquet WHERE RAND() = 0.0;
...
----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      9          9        0        0       0       0  
----------------------------------------------------------------------------------------------
VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 27.69 s    
----------------------------------------------------------------------------------------------

Migrate it to an Iceberg table

Created manifest files, keeping the big Parquet file.

0: jdbc:hive2://hive-hiveserver2:10000/defaul> ALTER TABLE web_sales_parquet SET TBLPROPERTIES ('storage_handler'='org.apache.iceberg.mr.hive.HiveIcebergStorageHandler', 'format-version' = '2');
...
$ hdfs dfs -ls -h /user/hive/warehouse/web_sales_parquet
Found 2 items
-rw-r--r--   3 zookage hive      1.1 G 2024-09-02 12:29 /user/hive/warehouse/web_sales_parquet/000000_0
drwxr-xr-x   - zookage hive          0 2024-09-02 13:50 /user/hive/warehouse/web_sales_parquet/metadata

Query the Icegerg table

The same number of tasks was created.

0: jdbc:hive2://hive-hiveserver2:10000/defaul> SELECT * FROM web_sales_parquet WHERE RAND() = 0.0;
...
----------------------------------------------------------------------------------------------
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  
----------------------------------------------------------------------------------------------
Map 1 .......... container     SUCCEEDED      9          9        0        0       0       0  
----------------------------------------------------------------------------------------------
VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 25.58 s    
----------------------------------------------------------------------------------------------

okumin/main.md

Reproduction

Create a table with a big Parquet file

Directly query the table

Migrate it to an Iceberg table

Query the Icegerg table

BsoBird commented Sep 3, 2024 •

edited

Loading

BsoBird commented Sep 3, 2024 •

edited

Loading

BsoBird commented Sep 3, 2024

okumin/main.md

Reproduction

Create a table with a big Parquet file

Directly query the table

Migrate it to an Iceberg table

Query the Icegerg table

BsoBird commented Sep 3, 2024 • edited Loading

BsoBird commented Sep 3, 2024 • edited Loading

BsoBird commented Sep 3, 2024

BsoBird commented Sep 3, 2024 •

edited

Loading

BsoBird commented Sep 3, 2024 •

edited

Loading