Our starting point is raw counts in a sparse matrix, smat
(4 genes, 5 cells):
c1 c2 c3 c4 c5
g1 8 . . . 5
g2 . 5 7 . .
g3 1 . . 5 .
g4 8 . . 7 .
As part of our analysis we center/scale the data, resulting in a dense matrix, dmat
:
c1 c2 c3 c4 c5
g1 1.45363114 -0.6998965 -0.6998965 -0.6998965 0.6460583
g2 -0.71395694 0.7734534 1.3684175 -0.7139569 -0.7139569
g3 -0.09225312 -0.5535187 -0.5535187 1.7528093 -0.5535187
g4 1.21267813 -0.7276069 -0.7276069 0.9701425 -0.7276069
Instead of storing all 20 values in dmat
, I’m proposing we only store:
c1 c2 c3 c4 c5
g1 1.45363114 . . . 0.6460583
g2 . 0.7734534 1.368417 . .
g3 -0.09225312 . . 1.7528093 .
g4 1.21267813 . . 0.9701425 .
which corresponds to the 8 non-empty coordinates of smat
.
Then in the metadata we store the gene/row-specific offsets that were applied to the previously empty cells. Concretely, g1 in smat
contained 3 empty values. In dmat
those 3 empty values have been replaced with the same offset, -0.6998965. For g2 the offset is -0.71395694, and so on.
On disk dmat
would look something like this:
[array]
i j x
1 g1 c1 1.45363114
3 g3 c1 -0.09225312
4 g4 c1 1.21267813
6 g2 c2 0.77345335
10 g2 c3 1.36841747
15 g3 c4 1.75280930
16 g4 c4 0.97014250
17 g1 c5 0.64605828
[metadata]
offsets: -0.6998965, -0.71395694, -0.5535187, -0.7276069
And using this array/metadata we could reconstruct the original matrix, dmat
without any loss of information or making any assumptions about how the data was generated.