David Stansby
In late 2022 I got a small development grant from NumFocus to scope the future of time-series data in sunpy
. The sucessful application can be read on the sunpy wiki - this contains context that I won't repeat here.
The current document will be the key outcome of the small development grant, with a record of what I did, the recommendations I made, and any decisions we came to as a community.
In it's current form, please feel free to leave comments.
The first stage of my work investigated what the user requirements are for a sunpy data container. As part of this I used my own experience and the following community engagement:
- Discussion at one of the weekly sunpy community meetings in December 2022
- Discussion on the Python in Heliophysics mailing list
- Discussion on the SunPy forum
From these discsusion came the following list of requriements:
Requirement | Notes |
---|---|
Store data that is a function of time | This means the time column should be treated as the index or coordinates to the data, and be stored as a time-like type. |
Handle different time scales | Data can have times defined in a variety of different time scales (e.g. UTC, TAI) |
Store multi-dimensional data | Although time is a common index to timeseries data, it isn't always the only one. As an exapmle, velocity distribution functions measured in the solar wind are 4D datasets, with data as a function of time and three dimensions in velocity space. |
Handle time scales with leapseconds | Some timescales can contain timestamps that occur within a leapsecond. |
Store and use physical units with the data and any non-time indices | |
Store data in a format that can be used with scientific Python libraries | |
Support for storing out-of memory datasets | |
Store metadata alongside actual data | |
Have a way to store an observer coordinate alongside the time index | |
Have an easy way to do common data manipulation tasks | e.g. interpolating, resampling, rebinning |
Have a way to combine multiple timeseries objects, and keep track of metadata | |
Ability to convert to other common time series objects (e.g. pandas.DataFrame ) |
|
Functionality for loading and saving out to common file formats |
The next step was to identify a set of possible data containers that could be used to store time-series data in sunpy. The identified options were:
astropy.timeseries.TimeSeries
pandas.DataFrame
xarray.DataArray
(orxarray.DataSet
)numpy.ndarray
ndcube
I also looked at what Python in Heliophysics projects use (as of writing, in Jan 2023):
Package | Container |
---|---|
sunpy | Custom TimeSeries object, backed by pandas.DataFrame |
HAPI Client | numpy.ndarray |
pySPEDAS | Not sure, can users actually get at the data itself? |
spacepy | Unclear if there is any specific timeseries container object? |
aidapy | xarray.DataArray |
cdflib | numpy.ndarray |
NDCube | NDCube |
pytplot | xarray.DataArray |
solo-epd-loader | pandas.DataFrame |
speasy | Custom DataContainer object, backed by numpy.ndarray |
There is no common container used, with only astropy.TimeSeries
not represented out of the possible options above.
sunpy currently has built in support for reading CDF files that conform to the Space Physics Guidelines for CDF, as long as the dataset is one- or two- dimensional. Alongside this several custom data readers have been written to support different data sources:
(links point to the data source information web page)
Data product(s) | File format |
---|---|
SDO EVE/ESP L1 | FITS |
SDO EVE/ESP L0CS | Text file |
FERMI GBM summary | FITS |
GOES XRS | FITS, netCDF |
PROBA-2 LYRA ligthcurve | FITS |
NOAA solar cycle monthly indices | JSON |
NOAA solar cycle predicted indices | JSON |
NoRH radio | FITS |
RHESSI x-ray summary | FITS |
Having found possible options, in this section I've evaluated them against the criteria set out above.
Time-like index data | π | Can store datetime64 data, but no support for indexes |
Different time scales | π | No support |
Multi-dimensional data | π© | |
Physical units | π | No support |
Interop with scientific Python | π© | |
Out of memory | π | numpy arrays are always in memory |
Metadata | π | No support |
Observer coordinates | π | No support |
Easy data manipulation | ||
I/O | π | Can save to binary .npy format or text file |
Time-like index data | π© | |
Different time scales | π | No support |
Multi-dimensional data | π | Possible, but recommended to use xarray instead |
Physical units | π | No native support (tracking issue), could be possible with pint-pands |
Interop with scientific Python | π© | |
Out of memory | π | pandas DataFrames are always in memory |
Metadata | π© | Possible to add additional properties to a DataFrame |
Observer coordinates | π | No support |
Easy data manipulation | π© | Many built in methods for maniuplating time-like data |
I/O | π© | Lots of I/O options |
Time-like index data | π© | |
Different time scales | π | No support |
Multi-dimensional data | π© | |
Physical units | π | No native support (tracking issue), could be possible with pint-xarray |
Interop with scientific Python | π© | |
Out of memory | π© | Support for computing using dask |
Metadata | π© | Possible to add metadata to a DataArray |
Observer coordinates | π | Support for adding "non-dimensional" coordinates (e.g. longitude/latitude), but not clear if storing astropy SkyCoord would work |
Easy data manipulation | π© | Many built in methods for maniuplating time-like data |
I/O | π© | Lots of I/O options |
Time-like index data | π© | |
Different time scales | π© | |
Multi-dimensional data | π | |
Physical units | π© | |
Interop with scientific Python | π© | |
Out of memory | π | As far as I can tell, data has to be loaded into memory |
Metadata | π© | Can store on the .meta attribute |
Observer coordinates | π© | Support for adding "non-dimensional" coordinates (e.g. longitude/latitude), but not clear if storing astropy SkyCoord would work |
Easy data manipulation | π | |
I/O | π© | Lots of options via. the astropy.table API |
Time-like index data | π© | |
Different time scales | π© | |
Multi-dimensional data | π© | |
Physical units | π© | |
Interop with scientific Python | π© | |
Out of memory | π | Seems to be supported in theory, but little docs |
Metadata | π© | Can store arbitrary FITS metadata |
Observer coordinates | π© | Support using the .extra_coords attribute |
Easy data manipulation | π | Very few manipulation methods impelmented |
I/O | π |
numpy.ndarray
doesn't implement several key features, and these are almost certainly out of scope for futurendarray
development, so I suggestndarray
is discounted.xarray.DataArray
builds on top ofpandas.DataFrame
with additional features that would be useful to us, I suggestpandas.DataFrame
is dicsounted.NDCube
is designed specifically to store data that is associated with a FITS world coordinate system (WCS). While some solar timeseries data is already in the FITS format, a large portion is in CDF format which is tabular, which FITS is not primarily designed to represent. So I suggestNDCube
is discounted.
This leaves us with astropy.TimeSeries
and xarray.DataArray
, with the following comparison:
astropy.TimeSeries |
xarray.DataArray |
|
---|---|---|
Time-like index data | π© | π© |
Different time scales | π© | π |
Multi-dimensional data | π | π© |
Physical units | π© | π |
Interop with scientific Python | π© | π© |
Out of memory | π | π© |
Metadata | π© | π© |
Observer coordinates | π | π |
Easy data manipulation | π | π© |
I/O | π© | π© |
My initial recommendation would be to adopt xarray.DataArray
, as there are more green items compared to astropy.TimeSeries
. I also think the two red items have the possibility of being solved with DataArray
:
- It should (I haven't confirmed this) be possible to convert times in different time scales (including ones with leapseconds) to a single timescale that doesn't have leapseconds, and store this in an
xarray.DataArray
. - Alternatively, it is possible to use
ExtensionArray
s to extend the data types used for a pandasIndex
, which is the data type used to indexxarray.DataArray
. I haven't checked yet if it's possible to useExtensionArray
s to store astropyTime
-like objects, and therefore support different time scales without conversion. - Alternatively, it is possible to use
ExtensionArray
s to extend the data types used for a pandasIndex
, which is the data type used to indexxarray.DataArray
. I haven't checked yet if it's possible to useExtensionArray
s to store astropyTime
-like objects, and therefore support different time scales without conversion. - Although there is not native support for units in
DataArray
currently, there is interest and ongoing development to support them.
Finally, xarray
has a much bigger development community than astropy.TimeSeries
, so implementing bug fixes and new features would probably be much easier with xarray
.
One immediate comment is that ndcube very much deos have support for non-dimensional coordinates via the
.extra_coords
property.I am sure we will discuss this later, but I feel this lacks any discussion on the relative importance of the requirements. Also I would love to know what it would take to add astropy time support to pandas & xarray as indices.