A dataframe is a 2 dimensional data structure, a "list of lists", where the items in the list can be
of any type. Dataframes are similar to a spreadsheet or a SQL table. The word "dataframe" was
popularized by pandas
, the widely-used Python library, and R, a programming language devoted to
data analysis.
# A dataframe where column 0 is a string and column 1 is a float
[["hello", 10.2]
["world", 0.1]]
What makes dataframes special is that they provide an API for programmers to do things with the contents.
Dataframes let you
- query for contents
- slice or subset rows
- perform common statistics tasks
- easily serialize to csv
Dataframes really shine in interpreted languages with first-class support for interfactive programming. Unlike other kinds of programming, data analysis involves a lot of trial and error. Why? Because
- data sets are messy, so usually one needs to poke around first
- many analyses are dead ends; a lot of stuff doesn't pan out, so there is less of a demand for production-level development workflow
Having a live interpreter also lets one
- print and visualize data quickly
- dynamically add and store dataframes (and other objects) in the environment