xarray-einstats

Documentation Status
Code style: black
PyPI

Stats, linear algebra and einops for xarray

⚠️ Caution: This project is still in a very early development stage

Installation

To install, run

(.venv) $ pip install xarray-einstats

Overview

As stated in their website:

xarray makes working with multi-dimensional labeled arrays simple, efficient and fun!

The code is often more verbose, but it is generally because it is clearer and thus less error prone
and intuitive. Here are some examples of such trade-off:

numpy xarray
a[2, 5] da.sel(drug="paracetamol", subject=5)
a.mean(axis=(0, 1)) da.mean(dim=("chain", "draw"))

In some other cases however, using xarray can result in overly verbose code
that often also becomes less clear. xarray-einstats provides wrappers
around some numpy and scipy functions (mostly numpy.linalg and scipy.stats)
and around einops with an api and features adapted to xarray.

% ⚠️ Attention: A nicer rendering of the content below is available at our documentation

Data for examples

The examples in this overview page use the DataArrays from the Dataset below
(stored as ds variable) to illustrate xarray-einstats features:

<xarray.Dataset>
Dimensions:  (dim_plot: 50, chain: 4, draw: 500, team: 6)
Coordinates:
  * chain    (chain) int64 0 1 2 3
  * draw     (draw) int64 0 1 2 3 4 5 6 7 8 ... 492 493 494 495 496 497 498 499
  * team     (team) object 'Wales' 'France' 'Ireland' ... 'Italy' 'England'
Dimensions without coordinates: dim_plot
Data variables:
    x_plot   (dim_plot) float64 0.0 0.2041 0.4082 0.6122 ... 9.592 9.796 10.0
    atts     (chain, draw, team) float64 0.1063 -0.01913 ... -0.2911 0.2029
    sd_att   (draw) float64 0.272 0.2685 0.2593 0.2612 ... 0.4112 0.2117 0.3401

Stats

xarray-einstats provides two wrapper classes {class}xarray_einstats.XrContinuousRV
and {class}xarray_einstats.XrDiscreteRV that can be used to wrap any distribution
in {mod}scipy.stats so they accept {class}~xarray.DataArray as inputs.

We can evaluate the logpdf using inputs that wouldn’t align if using numpy
in a couple lines:

norm_dist = xarray_einstats.XrContinuousRV(scipy.stats.norm)
norm_dist.logpdf(ds["x_plot"], ds["atts"], ds["sd_att"])

which returns:

<xarray.DataArray (dim_plot: 50, chain: 4, draw: 500, team: 6)>
array([[[[ 3.06470249e-01,  3.80373065e-01,  2.56575936e-01,
...
          -4.41658154e+02, -4.57599982e+02, -4.14709280e+02]]]])
Coordinates:
  * chain    (chain) int64 0 1 2 3
  * draw     (draw) int64 0 1 2 3 4 5 6 7 8 ... 492 493 494 495 496 497 498 499
  * team     (team) object 'Wales' 'France' 'Ireland' ... 'Italy' 'England'
Dimensions without coordinates: dim_plot

einops

only rearrange wrapped for now

einops uses a convenient notation inspired in
Einstein notation to specify operations on multidimensional arrays.
It uses spaces as a delimiter between dimensions, parenthesis to
indicate splitting or stacking of dimensions and -> to separate
between input and output dim specification. einstats uses
an adapted notation then translates to einops and calls {func}xarray.apply_ufunc
under the hood.

Why change the notation? There are three main reasons, each concerning one
of the elements respectively: ->, space as delimiter and parenthesis:

  • In xarray dimensions are already labeled. In many cases, the left
    side in the einops notation is only used to label the dimensions.
    In fact, 5/7 examples in https://einops.rocks/api/rearrange/ fall in this category.
    This is not necessary when working with xarray objects.
  • In xarray dimension names can be any {term}xarray:hashable. xarray-einstats only
    supports strings as dimension names, but the space can’t be used as delimiter.
  • In xarray dimensions are labeled and the order doesn’t matter.
    This might seem the same as the first reason but it is not. When splitting
    or stacking dimensions you need (and want) the names of both parent and children dimensions.
    In some cases, for example stacking, we can autogenerate a default name, but
    in general you’ll want to give a name to the new dimension. After all,
    dimension order in xarray doesn’t matter and there isn’t much to be done without knowing
    the dimension names.

xarray-einstats uses two separate arguments, one for the input pattern (optional) and
another for the output pattern. Each is a list of dimensions (strings)
or dimension operations (lists or dictionaries). Some examples:

We can combine the chain and draw dimensions and name the resulting dimension sample
using a list with a single dictionary. The team dimension is not present in the pattern
and is not modified.

rearrange(ds.atts, [{"sample": ("chain", "draw")}])

Out:

<xarray.DataArray 'atts' (team: 6, sample: 2000)>
array([[ 0.10632395,  0.1538294 ,  0.17806237, ...,  0.16744257,
         0.14927569,  0.21803568],
         ...,
       [ 0.30447644,  0.22650416,  0.25523419, ...,  0.28405435,
         0.29232681,  0.20286656]])
Coordinates:
  * team     (team) object 'Wales' 'France' 'Ireland' ... 'Italy' 'England'
Dimensions without coordinates: sample

Note that following xarray convention, new dimensions and dimensions on which we operated
are moved to the end. This only matters when you access the underlying array with .values
or .data and you can always transpose using {meth}xarray.Dataset.transpose, but
it can matter. You can change the pattern to enforce the output dimension order:

rearrange(ds.atts, [{"sample": ("chain", "draw")}, "team"])

Out:

<xarray.DataArray 'atts' (sample: 2000, team: 6)>
array([[ 0.10632395, -0.01912607,  0.13671159, -0.06754783, -0.46083807,
         0.30447644],
       ...,
       [ 0.21803568, -0.11394285,  0.09447937, -0.11032643, -0.29111234,
         0.20286656]])
Coordinates:
  * team     (team) object 'Wales' 'France' 'Ireland' ... 'Italy' 'England'
Dimensions without coordinates: sample

Now to a more complicated pattern. We will split the chain and draw dimension,
then combine those split dimensions between them.

rearrange(
    ds.atts,
    # combine split chain and team dims between them
    # here we don't use a dict so the new dimensions get a default name
    out_dims=[("chain1", "team1"), ("team2", "chain2")],
    # use dicts to specify which dimensions to split, here we *need* to use a dict
    in_dims=[{"chain": ("chain1", "chain2")}, {"team": ("team1", "team2")}],
    # set the lengths of split dimensions as kwargs
    chain1=2, chain2=2, team1=2, team2=3
)

Out:

<xarray.DataArray 'atts' (draw: 500, chain1,team1: 4, team2,chain2: 6)>
array([[[ 1.06323952e-01,  2.47005252e-01, -1.91260714e-02,
         -2.55769582e-02,  1.36711590e-01,  1.23165119e-01],
...
        [-2.76616968e-02, -1.10326428e-01, -3.99582340e-01,
         -2.91112341e-01,  1.90714405e-01,  2.02866563e-01]]])
Coordinates:
  * draw     (draw) int64 0 1 2 3 4 5 6 7 8 ... 492 493 494 495 496 497 498 499
Dimensions without coordinates: chain1,team1, team2,chain2

More einops examples at {ref}einops

Linear Algebra

Still missing in the package

There is no one size fits all solution, but knowing the function
we are wrapping we can easily make the code more concise and clear.
Without xarray-einstats, to invert a batch of matrices stored in a 4d
array you have to do:

inv = xarray.apply_ufunc(   # output is a 4d labeled array
    numpy.linalg.inv,
    batch_of_matrices,      # input is a 4d labeled array
    input_core_dims=[["matrix_dim", "matrix_dim_bis"]],
    output_core_dims=[["matrix_dim", "matrix_dim_bis"]]
)

to calculate it’s norm instead, it becomes:

norm = xarray.apply_ufunc(  # output is a 2d labeled array
    numpy.linalg.norm,
    batch_of_matrices,      # input is a 4d labeled array
    input_core_dims=[["matrix_dim", "matrix_dim_bis"]],
)

With xarray-einstats, those operations become:

inv = xarray_einstats.inv(batch_of_matrices, dim=("matrix_dim", "matrix_dim_bis"))
norm = xarray_einstats.norm(batch_of_matrices, dim=("matrix_dim", "matrix_dim_bis"))

Similar projects

Here we list some similar projects we know of. Note that all of
them are complementary and don’t overlap:

GitHub

View Github