# xarray-einstats

Stats, linear algebra and einops for xarray

⚠️

Caution: This project is still in a very early development stage

## Installation

To install, run

```
(.venv) $ pip install xarray-einstats
```

## Overview

As stated in their website:

xarray makes working with multi-dimensional labeled arrays simple, efficient and fun!

The code is often more verbose, but it is generally because it is clearer and thus less error prone

and intuitive. Here are some examples of such trade-off:

numpy | xarray |
---|---|

`a[2, 5]` |
`da.sel(drug="paracetamol", subject=5)` |

`a.mean(axis=(0, 1))` |
`da.mean(dim=("chain", "draw"))` |

“ | “ |

In some other cases however, using xarray can result in overly verbose code

that often also becomes less clear. `xarray-einstats`

provides wrappers

around some numpy and scipy functions (mostly `numpy.linalg`

and `scipy.stats`

)

and around einops with an api and features adapted to xarray.

% ⚠️ Attention: A nicer rendering of the content below is available at our documentation

### Data for examples

The examples in this overview page use the `DataArray`

s from the `Dataset`

below

(stored as `ds`

variable) to illustrate `xarray-einstats`

features:

```
<xarray.Dataset>
Dimensions: (dim_plot: 50, chain: 4, draw: 500, team: 6)
Coordinates:
* chain (chain) int64 0 1 2 3
* draw (draw) int64 0 1 2 3 4 5 6 7 8 ... 492 493 494 495 496 497 498 499
* team (team) object 'Wales' 'France' 'Ireland' ... 'Italy' 'England'
Dimensions without coordinates: dim_plot
Data variables:
x_plot (dim_plot) float64 0.0 0.2041 0.4082 0.6122 ... 9.592 9.796 10.0
atts (chain, draw, team) float64 0.1063 -0.01913 ... -0.2911 0.2029
sd_att (draw) float64 0.272 0.2685 0.2593 0.2612 ... 0.4112 0.2117 0.3401
```

### Stats

`xarray-einstats`

provides two wrapper classes {class}`xarray_einstats.XrContinuousRV`

and {class}`xarray_einstats.XrDiscreteRV`

that can be used to wrap any distribution

in {mod}`scipy.stats`

so they accept {class}`~xarray.DataArray`

as inputs.

We can evaluate the logpdf using inputs that wouldn’t align if using numpy

in a couple lines:

```
norm_dist = xarray_einstats.XrContinuousRV(scipy.stats.norm)
norm_dist.logpdf(ds["x_plot"], ds["atts"], ds["sd_att"])
```

which returns:

```
<xarray.DataArray (dim_plot: 50, chain: 4, draw: 500, team: 6)>
array([[[[ 3.06470249e-01, 3.80373065e-01, 2.56575936e-01,
...
-4.41658154e+02, -4.57599982e+02, -4.14709280e+02]]]])
Coordinates:
* chain (chain) int64 0 1 2 3
* draw (draw) int64 0 1 2 3 4 5 6 7 8 ... 492 493 494 495 496 497 498 499
* team (team) object 'Wales' 'France' 'Ireland' ... 'Italy' 'England'
Dimensions without coordinates: dim_plot
```

### einops

**only rearrange wrapped for now**

einops uses a convenient notation inspired in

Einstein notation to specify operations on multidimensional arrays.

It uses spaces as a delimiter between dimensions, parenthesis to

indicate splitting or stacking of dimensions and `->`

to separate

between input and output dim specification. `einstats`

uses

an adapted notation then translates to einops and calls {func}`xarray.apply_ufunc`

under the hood.

Why change the notation? There are three main reasons, each concerning one

of the elements respectively: `->`

, space as delimiter and parenthesis:

- In xarray dimensions are already labeled. In many cases, the left

side in the einops notation is only used to label the dimensions.

In fact, 5/7 examples in https://einops.rocks/api/rearrange/ fall in this category.

This is not necessary when working with xarray objects. - In xarray dimension names can be any {term}
`xarray:hashable`

.`xarray-einstats`

only

supports strings as dimension names, but the space can’t be used as delimiter. - In xarray dimensions are labeled and the order doesn’t matter.

This might seem the same as the first reason but it is not. When splitting

or stacking dimensions you need (and want) the names of both parent and children dimensions.

In some cases, for example stacking, we can autogenerate a default name, but

in general you’ll want to give a name to the new dimension. After all,

dimension order in xarray doesn’t matter and there isn’t much to be done without knowing

the dimension names.

`xarray-einstats`

uses two separate arguments, one for the input pattern (optional) and

another for the output pattern. Each is a list of dimensions (strings)

or dimension operations (lists or dictionaries). Some examples:

We can combine the chain and draw dimensions and name the resulting dimension `sample`

using a list with a single dictionary. The `team`

dimension is not present in the pattern

and is not modified.

`rearrange(ds.atts, [{"sample": ("chain", "draw")}])`

Out:

```
<xarray.DataArray 'atts' (team: 6, sample: 2000)>
array([[ 0.10632395, 0.1538294 , 0.17806237, ..., 0.16744257,
0.14927569, 0.21803568],
...,
[ 0.30447644, 0.22650416, 0.25523419, ..., 0.28405435,
0.29232681, 0.20286656]])
Coordinates:
* team (team) object 'Wales' 'France' 'Ireland' ... 'Italy' 'England'
Dimensions without coordinates: sample
```

Note that following xarray convention, new dimensions and dimensions on which we operated

are moved to the end. This only matters when you access the underlying array with `.values`

or `.data`

and you can always transpose using {meth}`xarray.Dataset.transpose`

, but

it can matter. You can change the pattern to enforce the output dimension order:

`rearrange(ds.atts, [{"sample": ("chain", "draw")}, "team"])`

Out:

```
<xarray.DataArray 'atts' (sample: 2000, team: 6)>
array([[ 0.10632395, -0.01912607, 0.13671159, -0.06754783, -0.46083807,
0.30447644],
...,
[ 0.21803568, -0.11394285, 0.09447937, -0.11032643, -0.29111234,
0.20286656]])
Coordinates:
* team (team) object 'Wales' 'France' 'Ireland' ... 'Italy' 'England'
Dimensions without coordinates: sample
```

Now to a more complicated pattern. We will split the chain and draw dimension,

then combine those split dimensions between them.

```
rearrange(
ds.atts,
# combine split chain and team dims between them
# here we don't use a dict so the new dimensions get a default name
out_dims=[("chain1", "team1"), ("team2", "chain2")],
# use dicts to specify which dimensions to split, here we *need* to use a dict
in_dims=[{"chain": ("chain1", "chain2")}, {"team": ("team1", "team2")}],
# set the lengths of split dimensions as kwargs
chain1=2, chain2=2, team1=2, team2=3
)
```

Out:

```
<xarray.DataArray 'atts' (draw: 500, chain1,team1: 4, team2,chain2: 6)>
array([[[ 1.06323952e-01, 2.47005252e-01, -1.91260714e-02,
-2.55769582e-02, 1.36711590e-01, 1.23165119e-01],
...
[-2.76616968e-02, -1.10326428e-01, -3.99582340e-01,
-2.91112341e-01, 1.90714405e-01, 2.02866563e-01]]])
Coordinates:
* draw (draw) int64 0 1 2 3 4 5 6 7 8 ... 492 493 494 495 496 497 498 499
Dimensions without coordinates: chain1,team1, team2,chain2
```

More einops examples at {ref}`einops`

### Linear Algebra

**Still missing in the package**

There is no one size fits all solution, but knowing the function

we are wrapping we can easily make the code more concise and clear.

Without `xarray-einstats`

, to invert a batch of matrices stored in a 4d

array you have to do:

```
inv = xarray.apply_ufunc( # output is a 4d labeled array
numpy.linalg.inv,
batch_of_matrices, # input is a 4d labeled array
input_core_dims=[["matrix_dim", "matrix_dim_bis"]],
output_core_dims=[["matrix_dim", "matrix_dim_bis"]]
)
```

to calculate it’s norm instead, it becomes:

```
norm = xarray.apply_ufunc( # output is a 2d labeled array
numpy.linalg.norm,
batch_of_matrices, # input is a 4d labeled array
input_core_dims=[["matrix_dim", "matrix_dim_bis"]],
)
```

With `xarray-einstats`

, those operations become:

```
inv = xarray_einstats.inv(batch_of_matrices, dim=("matrix_dim", "matrix_dim_bis"))
norm = xarray_einstats.norm(batch_of_matrices, dim=("matrix_dim", "matrix_dim_bis"))
```

## Similar projects

Here we list some similar projects we know of. Note that all of

them are complementary and don’t overlap: