PyKEEN
PyKEEN (Python KnowlEdge EmbeddiNgs) is a Python package designed to train and evaluate knowledge graph embedding models (incorporating multi-modal information).
Installation
The latest stable version of PyKEEN can be downloaded and installed from
PyPI with:
$ pip install pykeen
The latest version of PyKEEN can be installed directly from the
source on GitHub with:
$ pip install git+https://github.com/pykeen/pykeen.git
More information about installation (e.g., development mode, Windows installation, Colab, Kaggle, extras)
can be found in the installation documentation.
Quickstart
This example shows how to train a model on a dataset and test on another dataset.
The fastest way to get up and running is to use the pipeline function. It
provides a high-level entry into the extensible functionality of this package.
The following example shows how to train and evaluate the TransE
model on the Nations
dataset. By default, the training loop uses the stochastic local closed world assumption (sLCWA)
training approach and evaluates with rank-based evaluation.
from pykeen.pipeline import pipeline
result = pipeline(
model='TransE',
dataset='nations',
)
The results are returned in an instance of the PipelineResult
dataclass that has attributes for the trained model, the training loop, the evaluation, and more. See the tutorials
on using your own dataset,
understanding the evaluation,
and making novel link predictions.
PyKEEN is extensible such that:
- Each model has the same API, so anything from
pykeen.models
can be dropped in - Each training loop has the same API, so
pykeen.training.LCWATrainingLoop
can be dropped in - Triples factories can be generated by the user with
from pykeen.triples.TriplesFactory
The full documentation can be found at https://pykeen.readthedocs.io.
Implementation
Below are the models, datasets, training modes, evaluators, and metrics implemented
in pykeen
.
Datasets (27)
The following datasets are built in to PyKEEN. The citation for each dataset corresponds to either the paper
describing the dataset, the first paper published using the dataset with knowledge graph embedding models,
or the URL for the dataset if neither of the first two are available. If you want to use a custom dataset,
see the Bring Your Own Dataset tutorial. If you
have a suggestion for another dataset to include in PyKEEN, please let us know
here.
Models (30)
Losses (7)
Name | Reference | Description |
---|---|---|
Binary cross entropy (after sigmoid) | pykeen.losses.BCEAfterSigmoidLoss |
A module for the numerically unstable version of explicit Sigmoid + BCE loss. |
Binary cross entropy (with logits) | pykeen.losses.BCEWithLogitsLoss |
A module for the binary cross entropy loss. |
Cross entropy | pykeen.losses.CrossEntropyLoss |
A module for the cross entropy loss that evaluates the cross entropy after softmax output. |
Margin ranking | pykeen.losses.MarginRankingLoss |
A module for the margin ranking loss. |
Mean square error | pykeen.losses.MSELoss |
A module for the mean square error loss. |
Self-adversarial negative sampling | pykeen.losses.NSSALoss |
An implementation of the self-adversarial negative sampling loss function proposed by [sun2019]_. |
Softplus | pykeen.losses.SoftplusLoss |
A module for the softplus loss. |
Regularizers (5)
Name | Reference | Description |
---|---|---|
combined | pykeen.regularizers.CombinedRegularizer |
A convex combination of regularizers. |
lp | pykeen.regularizers.LpRegularizer |
A simple L_p norm based regularizer. |
no | pykeen.regularizers.NoRegularizer |
A regularizer which does not perform any regularization. |
powersum | pykeen.regularizers.PowerSumRegularizer |
A simple x^p based regularizer. |
transh | pykeen.regularizers.TransHRegularizer |
A regularizer for the soft constraints in TransH. |
Optimizers (6)
Name | Reference | Description |
---|---|---|
adadelta | torch.optim.Adadelta |
Implements Adadelta algorithm. |
adagrad | torch.optim.Adagrad |
Implements Adagrad algorithm. |
adam | torch.optim.Adam |
Implements Adam algorithm. |
adamax | torch.optim.Adamax |
Implements Adamax algorithm (a variant of Adam based on infinity norm). |
adamw | torch.optim.AdamW |
Implements AdamW algorithm. |
sgd | torch.optim.SGD |
Implements stochastic gradient descent (optionally with momentum). |
Training Loops (2)
Name | Reference | Description |
---|---|---|
lcwa | pykeen.training.LCWATrainingLoop |
A training loop that uses the local closed world assumption training approach. |
slcwa | pykeen.training.SLCWATrainingLoop |
A training loop that uses the stochastic local closed world assumption training approach. |
Negative Samplers (3)
Name | Reference | Description |
---|---|---|
basic | pykeen.sampling.BasicNegativeSampler |
A basic negative sampler. |
bernoulli | pykeen.sampling.BernoulliNegativeSampler |
An implementation of the Bernoulli negative sampling approach proposed by [wang2014]_. |
pseudotyped | pykeen.sampling.PseudoTypedNegativeSampler |
A sampler that accounts for which entities co-occur with a relation. |
Stoppers (2)
Name | Reference | Description |
---|---|---|
early | pykeen.stoppers.EarlyStopper |
A harness for early stopping. |
nop | pykeen.stoppers.NopStopper |
A stopper that does nothing. |
Evaluators (2)
Name | Reference | Description |
---|---|---|
rankbased | pykeen.evaluation.RankBasedEvaluator |
A rank-based evaluator for KGE models. |
sklearn | pykeen.evaluation.SklearnEvaluator |
An evaluator that uses a Scikit-learn metric. |
Metrics (16)
Name | Description |
---|---|
AUC-ROC | The area under the ROC curve, on [0, 1]. Higher is better. |
Adjusted Arithmetic Mean Rank (AAMR) | The mean over all chance-adjusted ranks, on (0, 2). Lower is better. |
Adjusted Arithmetic Mean Rank Index (AAMRI) | The re-indexed adjusted mean rank (AAMR), on [-1, 1]. Higher is better. |
Average Precision | The area under the precision-recall curve, on [0, 1]. Higher is better. |
Geometric Mean Rank (GMR) | The geometric mean over all ranks, on [1, inf). Lower is better. |
Harmonic Mean Rank (HMR) | The harmonic mean over all ranks, on [1, inf). Lower is better. |
Hits @ K | The relative frequency of ranks not larger than a given k, on [0, 1]. Higher is better |
Inverse Arithmetic Mean Rank (IAMR) | The inverse of the arithmetic mean over all ranks, on (0, 1]. Higher is better. |
Inverse Geometric Mean Rank (IGMR) | The inverse of the geometric mean over all ranks, on (0, 1]. Higher is better. |
Inverse Median Rank | The inverse of the median over all ranks, on (0, 1]. Higher is better. |
Mean Rank (MR) | The arithmetic mean over all ranks on, [1, inf). Lower is better. |
Mean Reciprocal Rank (MRR) | The inverse of the harmonic mean over all ranks, on (0, 1]. Higher is better. |
Median Rank | The median over all ranks, on [1, inf). Lower is better. |
Trackers (7)
Name | Reference | Description |
---|---|---|
console | pykeen.trackers.ConsoleResultTracker |
A class that directly prints to console. |
csv | pykeen.trackers.CSVResultTracker |
Tracking results to a CSV file. |
json | pykeen.trackers.JSONResultTracker |
Tracking results to a JSON lines file. |
mlflow | pykeen.trackers.MLFlowResultTracker |
A tracker for MLflow. |
neptune | pykeen.trackers.NeptuneResultTracker |
A tracker for Neptune.ai. |
tensorboard | pykeen.trackers.TensorBoardResultTracker |
A tracker for TensorBoard. |
wandb | pykeen.trackers.WANDBResultTracker |
A tracker for Weights and Biases. |
Hyper-parameter Optimization
Samplers (3)
Name | Reference | Description |
---|---|---|
grid | optuna.samplers.GridSampler |
Sampler using grid search. |
random | optuna.samplers.RandomSampler |
Sampler using random sampling. |
tpe | optuna.samplers.TPESampler |
Sampler using TPE (Tree-structured Parzen Estimator) algorithm. |
Any sampler class extending the optuna.samplers.BaseSampler,
such as their sampler implementing the CMA-ES
algorithm, can also be used.
Experimentation
Reproduction
PyKEEN includes a set of curated experimental settings for reproducing past landmark
experiments. They can be accessed and run like:
$ pykeen experiments reproduce tucker balazevic2019 fb15k
Where the three arguments are the model name, the reference, and the dataset.
The output directory can be optionally set with -d
.
Ablation
PyKEEN includes the ability to specify ablation studies using the
hyper-parameter optimization module. They can be run like:
$ pykeen experiments ablation ~/path/to/config.json
Large-scale Reproducibility and Benchmarking Study
We used PyKEEN to perform a large-scale reproducibility and benchmarking study which are described in
our article:
@article{ali2020benchmarking,
title={Bringing Light Into the Dark: A Large-scale Evaluation of Knowledge Graph Embedding Models Under a Unified Framework},
author={Ali, Mehdi and Berrendorf, Max and Hoyt, Charles Tapley and Vermue, Laurent and Galkin, Mikhail and Sharifzadeh, Sahand and Fischer, Asja and Tresp, Volker and Lehmann, Jens},
journal={arXiv preprint arXiv:2006.13365},
year={2020}
}
We have made all code, experimental configurations, results, and analyses that lead to our interpretations available
at https://github.com/pykeen/benchmarking.