Few days ago Mike Anderson wrote a proposal for a generic matrix API in Clojure which could compete with NumPy. I wanted to write a similar post for months, but was reluctant to start. This problem is very dear to me. Basically, it is a matter if I can use Clojure for most of my work, or it remains a toy language for me. Thanks to Mike for bringing the question up. Though, I have a different vision of how we should approach arrays and matrices in Clojure.
The strong point of Python for scientific computing is that the community has managed to agree on one array abstraction used across multiple libraries (IMO, the lack of such an agreement is the weekest point of Haskell). It is exactly the case of having thousands of functions operating on one data structure (at least from the end-used view point).
And it gives a huge boost to composability. It's not uncommon to mix
three or four libraries built for different problem domains, and glue
them together in just few lines of Python
code. numpy.ndarray
proved to be an abstraction which is
enough for most of the problem domains, but not too specific.
Let's look at ndarray
:
- it appears to be a fixed size multi-dimensional container,
- it is often just a view to another array or an object in memory,
- the number of dimensions and items is defined by its shape tuple,
- all items are numeric values of the same type and size,
- the contents can be accessed by indexing and slicing,
- there are many ways to operate on the array as a whole (like element-by-element ops),
- its in-memory storage layout is interoperable with C and Fortran libraries.
In the future NumPy will go well beyond this abstraction, but
it is ndarray
which made Python the language of choice of many for
numerical computation.
Please note what ndarray
is not a matrix, nor a tensor. It
implements some linear algebra methods, and some statistics
methods, but use of additional specialized modules is
required to do more complex stuff.
I suppose the purpose of the common multi-dimensional array abstractions is to:
-
make an array-like output of any Clojure/Java library consumable by any other library
-
allow for zero-copying composition of views over arbitrary array-like values
-
allow for building array-like values from Clojure collections and other array-like values
-
favor array-level operations (arrays as values), rather than explicit iteration over elements,
Unlike in NumPy, in Clojure we cannot force one conventional storage model. We have an opposite problem of having many storage models to support.
- allow for destructive updates and implement all the API for at least one "standard" storage type (and promote copy-on-write usage pattern).
There are always some special-purpose array implementations, like sparse arrays or analytically defined matrices. The abstraction should not be too restrictive about how the underlying storage works.
Complexity:
- API/protocol should be as simple and orthogonal as possible, so it is easy to implement its support for new array-like types.
So I think that the basic Clojure array abstraction should behave like a generic multi-dimensional container; linear algebra, image processing, signal processing libraries etc. should not be part of it.
So the core array API would be something like:
-
Random access for an arbitrary number of dimensions.
(get my-3d-array [42 17 1])
The last index is supposed to be the most frequently changing index.
The API should not mention things like rows or columns explicitly. Instead only a generic
shape
function should be available:(shape my-3d-array) ;; => [100 100 3] for a 100x100x3 array
shape
function is enough to know the size of the array(apply * (shape ...))
, its rank(count (shape ...))
and every dimension. -
Slicing/striding syntax to produce partial array views, for example:
(shape (get my-3d-array [[:from 50], [:every 2], 0])) ;; => [50 50 1]
or probably with a separate slicing function (not
get
) and with some shortcuts:(shape (slice my-3d-array [:all [:from 50 :every 2] :last])) ;; => [100 25 1]
-
Index-space transforms. Many array operations (like transposition, cyclic shifts, reversing etc.) are just index-space transforms;
index-map
may produce new array views:(index-map (comp vec reverse) my-2d-array) ;; transpose
-
Changing shape could be done with
index-map
only, but then we need to deduce too much about the image of the index transform; having a separatereshape
is probably more pragmatic:(reshape my-array (apply * (shape my-array))) ;; flatten (reshape my-100x100-array [1000 10])
-
Stacking along an arbitrary axis; stacking for arrays is like concatenation and
zipmap
for sequences:(stack 0 array1 array2) ;; along the first axis (stack nil array-2d array-2d array-2d) ;; => 3D array
-
Lifting (
map
) scalar functions to element-by-element functions; support multiple-arity functions too:(map + my-array-spam my-array-eggs)
I suppose, lifting may make the hypothetical Clojure
ndarray
better than its Python namesake. In Python it works only for predefined binary arithmetic operators and some predefined unary functions written in C. It requires to write list comprehensions/loops in Python and lose performance benefits of NumPy.A similar concept is applying a function along some dimension:
(apply-dim 2 (fn [rgb] (apply + (map * [0.2126 0.7152 0.0722] rgb))) my-image-rgb) ;; linear RGB to BT.709 luminance
-
A function to explicitly copy/change storage/make contiguous/force view:
(copy my-array-view) ;; now contiguous and mutable
-
Converting “multi-dimensional” collections (lists of lists, vectors of vectors etc) to a multi-dimensional array:
(to-ndarray [[1 2] [3 4]])
It should create a contiguous mutable storage by default.
I counted my most used NumPy functions
recently. Not counting slicing and arithmetic
operations, they are: asarray
, linspace
, mean
,
dot
, sqrt
, roll
, hstack
, loadtxt
, linalg.norm
.
My overuse of asarray
is probably due to how universal (unary
element-by-element) functions are implemented in NumPy. They should be
written in C, and only a limited set of them is available out of the
box. If I cannot express the algorithm using existing UFuncs, masked
arrays, array stacking, reshaping etc, then I run list comprehensions
to do anything I want in Python, and convert the result to ndarray
with asarray
. Hopefully, we can do better in Clojure.
It indicates the priority features in building a usable Clojure
ndarray
suite:
- lifting element-level functions to element-wise array functions
- building arrays from arrays (cyclic shifts, stacking, ranges)
- basic input-output (
loadtxt
/savetxt
,imread
/imwrite
are useful too) - some linear algebra and some common statistics/reductions.
I suppose lifting and building array views should be part of the array API (see above).
Other features to consider for the core API are:
- search by value or by condition (like
argmin
,argmax
in NumPy) - zero-copy sorting (like
argsort
in NumPy) -- can replace search too - element type query (like
.dtype
) and conversion (probably into-ndarray
) - full and partial reductions (along some axis)
- masking and choice
Provided that arrays are iterable, the core API seems to be enough to
implement almost all features of the NumPy ndarray
. A separate set
of utility functions needs to be provided for the end user (things
like transpose
).
Strictly speaking, reshape
is not orthogonal to index-map
, but
it's easier to implement it as a separate function; apply-dim
is
redundant if we have slice
and map
, but I suppose some
implementations may optimize it; slice
can be merged with get
;
copy
and to-ndarray
can be merged too, no-op conversion may be
moved to utilities.
Some candidates are redundant too. I suppose that complete reductions
could be done with reduce
if arrays behave like
iterables. clojure.core/map
+ slice
+ reduce
+ stack
would do
the work of partial reductions (along some axis), but some
implementations may do better.
Input-output can be implemented as utility functions using the core API.
Linear algebra is optional, and only some storage models may implement it correctly. It might be a good idea to decide on similar extended API with the support for linear algebra. Providing generic unoptimized implementations as a fallback is another solution.
Choice can be covered by stacking and apply-dim
. Masking is
equivalent to lifting a conditional function.
I suppose the good approach is to start with an almost orthogonal core API, and try to implement all utility functions on top of it. Extend the API, if some functions are too inefficient to do in Clojure.
I still don't know what Java implementation should be the default storage type. I suppose it is better to deal with a contiguous 1D storage by default, and convert it to other types if one starts doing things like linear algebra.
Commons Math's ArrayRealVector is my prime candidate. It doesn't support integer-valued arrays unfortunately.
+1