## Splitter

A PyTorch implementation of Splitter: Learning Node Representations that Capture Multiple Social Contexts (WWW 2019).

### Abstract

Recent interest in graph embedding methods has focused on learning a single representation for each node in the graph. But can nodes really be best described by a single vector representation? In this work, we propose a method for learning multiple representations of the nodes in a graph (e.g., the users of a social network). Based on a principled decomposition of the ego-network, each representation encodes the role of the node in a different local community in which the nodes participate. These representations allow for improved reconstruction of the nuanced relationships that occur in the graph a phenomenon that we illustrate through state-of-the-art results on link prediction tasks on a variety of graphs, reducing the error by up to 90%. In addition, we show that these embeddings allow for effective visual analysis of the learned community structure.

This repository provides a PyTorch implementation of Splitter as described in the paper:

Splitter: Learning Node Representations that Capture Multiple Social Contexts.

Alessandro Epasto and Bryan Perozzi.

WWW, 2019.

[Paper]

The original Tensorflow implementation is available [here].

### Requirements

The codebase is implemented in Python 3.5.2. package versions used for development are just below.

```
networkx 1.11
tqdm 4.28.1
numpy 1.15.4
pandas 0.23.4
texttable 1.5.0
scipy 1.1.0
argparse 1.1.0
torch 1.1.0
gensim 3.6.0
```

### Datasets

The code takes the **edge list** of the graph in a csv file. Every row indicates an edge between two nodes separated by a comma. The first row is a header. Nodes should be indexed starting with 0. A sample graph for `Cora` is included in the `input/` directory.

### Outputs

The embeddings are saved in the `input/` directory. Each embedding has a header and a column with the node IDs. Finally, the node embedding is sorted by the node ID column.

### Options

The training of a Splitter embedding is handled by the `src/main.py` script which provides the following command line arguments.

#### Input and output options

```
--edge-path STR Edge list csv. Default is `input/chameleon_edges.csv`.
--embedding-output-path STR Embedding output csv. Default is `output/chameleon_embedding.csv`.
--persona-output-path STR Persona mapping JSON. Default is `output/chameleon_personas.json`.
```

#### Model options

```
--seed INT Random seed. Default is 42.
--number of walks INT Number of random walks per node. Default is 10.
--window-size INT Skip-gram window size. Default is 5.
--negative-samples INT Number of negative samples. Default is 5.
--walk-length INT Random walk length. Default is 40.
--lambd FLOAT Regularization parameter. Default is 0.1
--dimensions INT Number of embedding dimensions. Default is 128.
--workers INT Number of cores for pre-training. Default is 4.
--learning-rate FLOAT SGD learning rate. Default is 0.025
```

### Examples

The following commands learn an embedding and save it with the persona map. Training a model on the default dataset.

```
python src/main.py
```

Training a Splitter model with 32 dimensions.

```
python src/main.py --dimensions 32
```

Increasing the number of walks and the walk length.

```
python src/main.py --number-of-walks 20 --walk-length 80
```