Skip to content

Instantly share code, notes, and snippets.

@ianmcook
Last active August 20, 2024 21:20
Show Gist options
  • Save ianmcook/2cdf9c6b1afef8cf044eafe26bd1ec84 to your computer and use it in GitHub Desktop.
Save ianmcook/2cdf9c6b1afef8cf044eafe26bd1ec84 to your computer and use it in GitHub Desktop.
Examples demonstrating whether systems maintain row order

This is a set of examples demonstrating whether various Python and R dataframe libraries and OLAP query engines preserve (or do not preserve) the original order of the records in the data.

Example data

The examples all use this dataset describing the 28 times when a person walked on the moon:

year mission name minutes
1969 Apollo 11 Neil Armstrong 151
1969 Apollo 11 Buzz Aldrin 151
1969 Apollo 12 Pete Conrad 236
1969 Apollo 12 Alan Bean 236
1969 Apollo 12 Pete Conrad 229
1969 Apollo 12 Alan Bean 229
1971 Apollo 14 Alan Shepard 287
1971 Apollo 14 Edgar Mitchell 287
1971 Apollo 14 Alan Shepard 274
1971 Apollo 14 Edgar Mitchell 274
1971 Apollo 15 David Scott 392
1971 Apollo 15 James Irwin 392
1971 Apollo 15 David Scott 432
1971 Apollo 15 James Irwin 432
1971 Apollo 15 David Scott 289
1971 Apollo 15 James Irwin 289
1972 Apollo 16 John Young 431
1972 Apollo 16 Charles Duke 431
1972 Apollo 16 John Young 443
1972 Apollo 16 Charles Duke 443
1972 Apollo 16 John Young 340
1972 Apollo 16 Charles Duke 340
1972 Apollo 17 Gene Cernan 431
1972 Apollo 17 Harrison Schmitt 431
1972 Apollo 17 Gene Cernan 456
1972 Apollo 17 Harrison Schmitt 456
1972 Apollo 17 Gene Cernan 435
1972 Apollo 17 Harrison Schmitt 435

The rows of this data are implicitly ordered first by the Apollo mission (in chronological order from top to bottom), second by the order of moonwalks during the mission (earliest at the top), and third by the order in which the astronauts egressed from the lunar module during the first or only moonwalk of the mission (first at the top). See the included file moonwalks.csv which contains this data.

Operation: Return the unique values in a column

pandas

Code

import pandas as pd

moonwalks = pd.read_csv("moonwalks.csv")

moonwalks[["mission"]].drop_duplicates()

Result

      mission
0   Apollo 11
2   Apollo 12
6   Apollo 14
10  Apollo 15
16  Apollo 16
22  Apollo 17

Conclusion

pandas preserves the original row order.

Polars

Code (with maintain_order set to True)

import polars as pl

moonwalks = pl.read_csv("moonwalks.csv")

moonwalks.select("mission").unique(maintain_order=True)

Result (with maintain_order set to True)

shape: (6, 1)
┌───────────┐
│ mission   │
│ ---       │
│ str       │
╞═══════════╡
│ Apollo 11 │
│ Apollo 12 │
│ Apollo 14 │
│ Apollo 15 │
│ Apollo 16 │
│ Apollo 17 │
└───────────┘

Code (with maintain_order set to False)

import polars as pl

moonwalks = pl.read_csv("moonwalks.csv")

moonwalks.select("mission").unique(maintain_order=False)

Result (with maintain_order set to False)

shape: (6, 1)
┌───────────┐
│ mission   │
│ ---       │
│ str       │
╞═══════════╡
│ Apollo 17 │
│ Apollo 14 │
│ Apollo 12 │
│ Apollo 16 │
│ Apollo 11 │
│ Apollo 15 │
└───────────┘

Conclusion

Polars preserves the original row order only when maintain_order is set to True.

Dask DataFrame

Code

import dask.dataframe as dd

moonwalks = dd.read_csv("moonwalks.csv")

moonwalks[["mission"]].drop_duplicates().compute()

Result

      mission
0   Apollo 11
2   Apollo 12
6   Apollo 14
10  Apollo 15
16  Apollo 16
22  Apollo 17

Conclusion

When the input index is monotonically increasing (as in this example), Dask DataFrame preserves the original row order.

Modin

Code

import modin.pandas as pd

moonwalks = pd.read_csv("moonwalks.csv")

moonwalks[["mission"]].drop_duplicates()

Result

      mission
0   Apollo 11
2   Apollo 12
6   Apollo 14
10  Apollo 15
16  Apollo 16
22  Apollo 17

Conclusion

Modin preserves the original row order.

dplyr

Code

library(dplyr)
library(readr)

moonwalks <- read_csv("moonwalks.csv")

moonwalks |> distinct(mission)

Result

# A tibble: 6 × 1
  mission  
  <chr>    
1 Apollo 11
2 Apollo 12
3 Apollo 14
4 Apollo 15
5 Apollo 16
6 Apollo 17

Conclusion

dplyr preserves the original row order.

data.table

Code

library(data.table)

moonwalks <- fread("moonwalks.csv")

unique(moonwalks[, .(mission)])

Result

     mission
      <char>
1: Apollo 11
2: Apollo 12
3: Apollo 14
4: Apollo 15
5: Apollo 16
6: Apollo 17

Conclusion

data.table preserves the original row order.

DuckDB

Code (with preserve_insertion_order set to true)

SET preserve_insertion_order = true;

SELECT DISTINCT(mission) FROM "moonwalks.csv";

Result (with preserve_insertion_order set to true)

┌───────────┐
│  mission  │
│  varchar  │
├───────────┤
│ Apollo 16 │
│ Apollo 15 │
│ Apollo 17 │
│ Apollo 11 │
│ Apollo 12 │
│ Apollo 14 │
└───────────┘

Code (with preserve_insertion_order set to false)

SET preserve_insertion_order = false;

SELECT DISTINCT(mission) FROM "moonwalks.csv";

Result (with preserve_insertion_order set to false)

┌───────────┐
│  mission  │
│  varchar  │
├───────────┤
│ Apollo 16 │
│ Apollo 11 │
│ Apollo 12 │
│ Apollo 15 │
│ Apollo 17 │
│ Apollo 14 │
└───────────┘

Conclusion

DuckDB does not preserve the original row order, regardless of the value of preserve_insertion_order.

DataFusion

Code

CREATE EXTERNAL TABLE moonwalks
STORED AS CSV
LOCATION "moonwalks.csv"
OPTIONS ("has_header" "true");

SELECT DISTINCT(mission) FROM moonwalks;

Result

+-----------+
| mission   |
+-----------+
| Apollo 14 |
| Apollo 11 |
| Apollo 12 |
| Apollo 16 |
| Apollo 15 |
| Apollo 17 |
+-----------+

Conclusion

DataFusion does not preserve the original row order.

Spark

Code

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

moonwalks = spark.read.csv("moonwalks.csv", header=True, inferSchema=True)

moonwalks.select("mission").distinct().show()

Result

+---------+
|  mission|
+---------+
|Apollo 11|
|Apollo 16|
|Apollo 12|
|Apollo 14|
|Apollo 15|
|Apollo 17|
+---------+

Conclusion

Spark does not preserve the original row order.

year mission name minutes
1969 Apollo 11 Neil Armstrong 151
1969 Apollo 11 Buzz Aldrin 151
1969 Apollo 12 Pete Conrad 236
1969 Apollo 12 Alan Bean 236
1969 Apollo 12 Pete Conrad 229
1969 Apollo 12 Alan Bean 229
1971 Apollo 14 Alan Shepard 287
1971 Apollo 14 Edgar Mitchell 287
1971 Apollo 14 Alan Shepard 274
1971 Apollo 14 Edgar Mitchell 274
1971 Apollo 15 David Scott 392
1971 Apollo 15 James Irwin 392
1971 Apollo 15 David Scott 432
1971 Apollo 15 James Irwin 432
1971 Apollo 15 David Scott 289
1971 Apollo 15 James Irwin 289
1972 Apollo 16 John Young 431
1972 Apollo 16 Charles Duke 431
1972 Apollo 16 John Young 443
1972 Apollo 16 Charles Duke 443
1972 Apollo 16 John Young 340
1972 Apollo 16 Charles Duke 340
1972 Apollo 17 Gene Cernan 431
1972 Apollo 17 Harrison Schmitt 431
1972 Apollo 17 Gene Cernan 456
1972 Apollo 17 Harrison Schmitt 456
1972 Apollo 17 Gene Cernan 435
1972 Apollo 17 Harrison Schmitt 435
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment