ianmcook/maintain_row_order.md

This is a set of examples demonstrating whether various Python and R dataframe libraries and OLAP query engines preserve (or do not preserve) the original order of the records in the data.

Example data

The examples all use this dataset describing the 28 times when a person walked on the moon:

year	mission	name	minutes
1969	Apollo 11	Neil Armstrong	151
1969	Apollo 11	Buzz Aldrin	151
1969	Apollo 12	Pete Conrad	236
1969	Apollo 12	Alan Bean	236
1969	Apollo 12	Pete Conrad	229
1969	Apollo 12	Alan Bean	229
1971	Apollo 14	Alan Shepard	287
1971	Apollo 14	Edgar Mitchell	287
1971	Apollo 14	Alan Shepard	274
1971	Apollo 14	Edgar Mitchell	274
1971	Apollo 15	David Scott	392
1971	Apollo 15	James Irwin	392
1971	Apollo 15	David Scott	432
1971	Apollo 15	James Irwin	432
1971	Apollo 15	David Scott	289
1971	Apollo 15	James Irwin	289
1972	Apollo 16	John Young	431
1972	Apollo 16	Charles Duke	431
1972	Apollo 16	John Young	443
1972	Apollo 16	Charles Duke	443
1972	Apollo 16	John Young	340
1972	Apollo 16	Charles Duke	340
1972	Apollo 17	Gene Cernan	431
1972	Apollo 17	Harrison Schmitt	431
1972	Apollo 17	Gene Cernan	456
1972	Apollo 17	Harrison Schmitt	456
1972	Apollo 17	Gene Cernan	435
1972	Apollo 17	Harrison Schmitt	435

The rows of this data are implicitly ordered first by the Apollo mission (in chronological order from top to bottom), second by the order of moonwalks during the mission (earliest at the top), and third by the order in which the astronauts egressed from the lunar module during the first or only moonwalk of the mission (first at the top). See the included file moonwalks.csv which contains this data.

Operation: Return the unique values in a column

pandas

Code

import pandas as pd

moonwalks = pd.read_csv("moonwalks.csv")

moonwalks[["mission"]].drop_duplicates()

Result

      mission
0   Apollo 11
2   Apollo 12
6   Apollo 14
10  Apollo 15
16  Apollo 16
22  Apollo 17

Conclusion

pandas preserves the original row order.

Polars

Code (with `maintain_order` set to `True`)

import polars as pl

moonwalks = pl.read_csv("moonwalks.csv")

moonwalks.select("mission").unique(maintain_order=True)

Result (with `maintain_order` set to `True`)

shape: (6, 1)
┌───────────┐
│ mission   │
│ ---       │
│ str       │
╞═══════════╡
│ Apollo 11 │
│ Apollo 12 │
│ Apollo 14 │
│ Apollo 15 │
│ Apollo 16 │
│ Apollo 17 │
└───────────┘

Code (with `maintain_order` set to `False`)

import polars as pl

moonwalks = pl.read_csv("moonwalks.csv")

moonwalks.select("mission").unique(maintain_order=False)

Result (with `maintain_order` set to `False`)

shape: (6, 1)
┌───────────┐
│ mission   │
│ ---       │
│ str       │
╞═══════════╡
│ Apollo 17 │
│ Apollo 14 │
│ Apollo 12 │
│ Apollo 16 │
│ Apollo 11 │
│ Apollo 15 │
└───────────┘

Conclusion

Polars preserves the original row order only when maintain_order is set to True.

Dask DataFrame

Code

import dask.dataframe as dd

moonwalks = dd.read_csv("moonwalks.csv")

moonwalks[["mission"]].drop_duplicates().compute()

Result

      mission
0   Apollo 11
2   Apollo 12
6   Apollo 14
10  Apollo 15
16  Apollo 16
22  Apollo 17

Conclusion

When the input index is monotonically increasing (as in this example), Dask DataFrame preserves the original row order.

Modin

Code

import modin.pandas as pd

moonwalks = pd.read_csv("moonwalks.csv")

moonwalks[["mission"]].drop_duplicates()

Result

      mission
0   Apollo 11
2   Apollo 12
6   Apollo 14
10  Apollo 15
16  Apollo 16
22  Apollo 17

Conclusion

Modin preserves the original row order.

dplyr

Code

library(dplyr)
library(readr)

moonwalks <- read_csv("moonwalks.csv")

moonwalks |> distinct(mission)

Result

# A tibble: 6 × 1
  mission  
  <chr>    
1 Apollo 11
2 Apollo 12
3 Apollo 14
4 Apollo 15
5 Apollo 16
6 Apollo 17

Conclusion

dplyr preserves the original row order.

data.table

Code

library(data.table)

moonwalks <- fread("moonwalks.csv")

unique(moonwalks[, .(mission)])

Result

     mission
      <char>
1: Apollo 11
2: Apollo 12
3: Apollo 14
4: Apollo 15
5: Apollo 16
6: Apollo 17

Conclusion

data.table preserves the original row order.

DuckDB

Code (with `preserve_insertion_order` set to `true`)

SET preserve_insertion_order = true;

SELECT DISTINCT(mission) FROM "moonwalks.csv";

Result (with `preserve_insertion_order` set to `true`)

┌───────────┐
│  mission  │
│  varchar  │
├───────────┤
│ Apollo 16 │
│ Apollo 15 │
│ Apollo 17 │
│ Apollo 11 │
│ Apollo 12 │
│ Apollo 14 │
└───────────┘

Code (with `preserve_insertion_order` set to `false`)

SET preserve_insertion_order = false;

SELECT DISTINCT(mission) FROM "moonwalks.csv";

Result (with `preserve_insertion_order` set to `false`)

┌───────────┐
│  mission  │
│  varchar  │
├───────────┤
│ Apollo 16 │
│ Apollo 11 │
│ Apollo 12 │
│ Apollo 15 │
│ Apollo 17 │
│ Apollo 14 │
└───────────┘

Conclusion

DuckDB does not preserve the original row order, regardless of the value of preserve_insertion_order.

DataFusion

Code

CREATE EXTERNAL TABLE moonwalks
STORED AS CSV
LOCATION "moonwalks.csv"
OPTIONS ("has_header" "true");

SELECT DISTINCT(mission) FROM moonwalks;

Result

+-----------+
| mission   |
+-----------+
| Apollo 14 |
| Apollo 11 |
| Apollo 12 |
| Apollo 16 |
| Apollo 15 |
| Apollo 17 |
+-----------+

Conclusion

DataFusion does not preserve the original row order.

Spark

Code

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

moonwalks = spark.read.csv("moonwalks.csv", header=True, inferSchema=True)

moonwalks.select("mission").distinct().show()

Result

+---------+
|  mission|
+---------+
|Apollo 11|
|Apollo 16|
|Apollo 12|
|Apollo 14|
|Apollo 15|
|Apollo 17|
+---------+

Conclusion

Spark does not preserve the original row order.

year	mission	name	minutes
1969	Apollo 11	Neil Armstrong	151
1969	Apollo 11	Buzz Aldrin	151
1969	Apollo 12	Pete Conrad	236
1969	Apollo 12	Alan Bean	236
1969	Apollo 12	Pete Conrad	229
1969	Apollo 12	Alan Bean	229
1971	Apollo 14	Alan Shepard	287
1971	Apollo 14	Edgar Mitchell	287
1971	Apollo 14	Alan Shepard	274
1971	Apollo 14	Edgar Mitchell	274
1971	Apollo 15	David Scott	392
1971	Apollo 15	James Irwin	392
1971	Apollo 15	David Scott	432
1971	Apollo 15	James Irwin	432
1971	Apollo 15	David Scott	289
1971	Apollo 15	James Irwin	289
1972	Apollo 16	John Young	431
1972	Apollo 16	Charles Duke	431
1972	Apollo 16	John Young	443
1972	Apollo 16	Charles Duke	443
1972	Apollo 16	John Young	340
1972	Apollo 16	Charles Duke	340
1972	Apollo 17	Gene Cernan	431
1972	Apollo 17	Harrison Schmitt	431
1972	Apollo 17	Gene Cernan	456
1972	Apollo 17	Harrison Schmitt	456
1972	Apollo 17	Gene Cernan	435
1972	Apollo 17	Harrison Schmitt	435

ianmcook/maintain_row_order.md

Example data

Operation: Return the unique values in a column

pandas

Code

Result

Conclusion

Polars

Code (with maintain_order set to True)

Result (with maintain_order set to True)

Code (with maintain_order set to False)

Result (with maintain_order set to False)

Conclusion

Dask DataFrame

Code

Result

Conclusion

Modin

Code

Result

Conclusion

dplyr

Code

Result

Conclusion

data.table

Code

Result

Conclusion

DuckDB

Code (with preserve_insertion_order set to true)

Result (with preserve_insertion_order set to true)

Code (with preserve_insertion_order set to false)

Result (with preserve_insertion_order set to false)

Conclusion

DataFusion

Code

Result

Conclusion

Spark

Code

Result

Conclusion

Code (with `maintain_order` set to `True`)

Result (with `maintain_order` set to `True`)

Code (with `maintain_order` set to `False`)

Result (with `maintain_order` set to `False`)

Code (with `preserve_insertion_order` set to `true`)

Result (with `preserve_insertion_order` set to `true`)

Code (with `preserve_insertion_order` set to `false`)

Result (with `preserve_insertion_order` set to `false`)