/ Natural Language Processing

Python library for knowledge graph embedding and representation learning

Python library for knowledge graph embedding and representation learning

Pykg2vec

Python library for knowledge graph embedding and representation learning.

Pykg2vec is a library, currently in active development, for learning the representation of entities and relations in Knowledge Graphs. We have attempted to bring all the state-of-the-art knowledge graph embedding algorithms and the necessary building blocks in knowledge graph embedding task pipeline into a single library. We hope Pykg2vec is both practical and educational for users who want to explore the related fields.

For people who just start working on Knowledge Graph Embedding Methods, the papers A Review of Relational Machine Learning for Knowledge Graphs, Knowledge Graph Embedding: A Survey of Approaches and Applications, and An overview of embedding models of entities and relationships for knowledge base completion are well-written materials for reading! The figure below illustrates the current overall architecture.

Dependencies

The goal of this library is to minimize the dependency on other libraries as far as possible to rapidly test the algorithms against different dataset. We emphasize that in the beginning, we will not be focus in run-time performance. However, in the future, may provide faster implementation of each of the algorithms. We encourage installing the tensorflow-gpu version for optimal usage.

  • tensorflow==<version suitable for your workspace>
  • networkx>=2.2
  • setuptools>=40.8.0
  • matplotlib>=3.0.3
  • numpy>=1.16.2
  • seaborn>=0.9.0
  • scikit_learn>=0.20.3
  • hyperopt>=0.1.2
  • progressbar2>=3.39.3
  • pathlib>=1.0.1

Features

  • A lot of state-of-the-art KGE model implementations and well-known datasets.
  • Tools that support automatic hyperparameter tuning (bayesian optimizer).
  • Optimized performance by making a proper use of CPUs and GPUs (multiprocess and Tensorflow).
    • Will be adding C++ implementation to further optimize!
  • A suite of visualization and summerization tools
    • TSNE based visualization. (Support TSV export)
    • KPI summary visualization (mean rank, hit ratio) in various format. (csvs, figures, latex table)
Training Loss Plot Testing Rank Results Testing Hits Result
Freebase15k_training_loss_plot_0-1 Freebase15k_testing_rank_plot_1-1 Freebase15k_testing_hits_plot_1-1
Relation embedding plot Entity embedding plot Relation and Entity Plot
TransE_rel_plot_embedding_plot_0 TransE_entity_plot_embedding_plot_0 TransE_ent_n_rel_plot_embedding_plot_0

Repository Structure

  • pyKG2Vec/config: This folder consists of the configuration module. It provides the necessary configuration to parse the datasets, and also consists of the baseline hyperparameters for the knowledge graph embedding algorithms.
  • pyKG2Vec/core: This folder consists of the core codes of the knowledge graph embedding algorithms. Inside this folder, each algorithm is implemented as a separate python module.
  • pyKG2Vec/utils: This folders consists of modules providing various utilities, such as data preparation, data visualization, and evaluation of the algorithms, data generators, baynesian optimizer.
  • pyKG2Vec/example: This folders consists of example codes that can be used to run individual modules or run all the modules at once or tune the model.

Installation

For best performance, we encourage the users to create a virtual environment and setup the necessary dependencies for running the algorithms using Python3.6.

Please install tensorflow cpu or gpu version before performing pip install of pykg2vec!

#Prepare your environment
$ sudo apt update
$ sudo apt install python3-dev python3-pip
$ sudo pip3 install -U virtualenv     

#Create a virtual environment
#If you have tensorflow installed in the root env, do the following
$ virtualenv --system-site-packages -p python3 ./venv
#If you you want to install tensorflow later, do the following
$ virtualenv -p python3 ./venv

#Activate the virtual environment using a shell-specific command:  
$ source ./venv/bin/activate

#Upgrade pip:  
$ pip install --upgrade pip

#If you have not installed tensorflow, or not used --system-site-package option while creating venv, install tensorflow first.
(venv) $ pip install tensorflow

#Install pyKG2Vec:  
(venv) $ pip install pykg2vec

#Install stable version directly from github repo:  
(venv) $ git clone https://github.com/Sujit-O/pykg2vec.git
(venv) $ cd pykg2vec
(venv) $ python setup.py install

#Install development version directly from github repo:  
(venv) $ git clone https://github.com/Sujit-O/pykg2vec.git
(venv) $ cd pykg2vec
(venv) $ git checkout development
(venv) $ python setup.py install

Usage Example

Running a single algorithm:

Train.py

from pykg2vec.utils.kgcontroller import KnowledgeGraph
from pykg2vec.config.config import Importer, KGEArgParser
from pykg2vec.utils.trainer import Trainer

def main():
    # getting the customized configurations from the command-line arguments.
    args = KGEArgParser().get_args()

    # Preparing data and cache the data for later usage
    knowledge_graph = KnowledgeGraph(dataset=args.dataset_name, negative_sample=args.sampling)
    knowledge_graph.prepare_data()

    # Extracting the corresponding model config and definition from Importer(). 
    config_def, model_def = Importer().import_model_config(args.model_name.lower())
    config = config_def(args=args)
    model = model_def(config)

    # Create, Compile and Train the model. While training, several evaluation will be performed.
    trainer = Trainer(model=model, debug=args.debug)
    trainer.build_model()
    trainer.train_model()


if __name__ == "__main__":
    main()

with train.py we then can train the existed model using command:

python train.py -h # check all tunnable parameters.
python train.py -mn TransE # Run TransE model.
python train.py -mn Complex # Run Complex model. 

Tuning a single algorithm:

tune_model.py


from pykg2vec.config.hyperparams import KGETuneArgParser
from pykg2vec.utils.bayesian_optimizer import BaysOptimizer

def main():
    # getting the customized configurations from the command-line arguments.
    args = KGETuneArgParser().get_args()

    # initializing bayesian optimizer and prepare data.
    bays_opt = BaysOptimizer(args=args)

    # perform the golden hyperparameter tuning. 
    bays_opt.optimize()
    
if __name__ == "__main__":
    main()

with tune_model.py we then can train the existed model using command:

python tune_model.py -h # check all tunnable parameters.
python tune_model.py -mn TransE # Tune TransE model.

Switch between Implemented Methods:

Pykg2vec aims to include most of the state-of-the-art KGE methods. You can check Implemented Algorithms for more information about the algorithms implemented in pykg2vec. With train.py described in usage examples, you can switch the models to train on a dataset using command:

python train.py -mn TransE # Run TransE model.
python train.py -mn Complex # Run Complex model. 
# you can select one of models from ["complex", "conve","convkb","hole", 
                                     "distmult", "kg2e", "ntn": "NTN", 
                                     "proje_pointwise","rescal","rotate",
                                     "slm","sme","transd","TransD",
                                     "transe","transh","transg","transm","transr","tucker"]

Switch between Datasets:

Pykg2vec aims to include all the well-known datasets available online so that you can test all available KGE models or your own model on those datasets. Currently, pykg2vec has FK15K, WN18, WN18-RR, YAGO, FK15K_237, Kinship, Nations, UMLS. You can check Datasets for more information.
With train.py described in usage examples, you can switch the models to train on a dataset using command:

Using Well-Known Dataset

python train.py -mn TransE -ds FB15K # Run TransE model on Freebase15k(FK15)
python train.py -mn TransE -ds dl50a # Run TransE model on Deeplearning50a(dl50a)
# you can select one of models from ["fb15k','dl50a','wn18','wn18_rr',
                                     'yago3_10','fb15k_237','ks','nations','umls']

Using Custom Dataset

For custom dataset, some steps are provided:

  1. For triples, store all of them in a text-format with each line formatted as follows,
head\trelation\ttail
  1. For the text file, separate it into three files according to your reference give names as follows,
[name]-train.txt, [name]-valid.txt, [name]-test.txt
  1. For those three files, create a folder [path_storing_text_files] to include them.
  2. Once finished, you then can use the custom dataset to train on a specific model using command:
python train.py -mn TransE -ds [name] -dsp [path_storing_text_files] 
# Run TransE model on a custom dataset [name].

Perform Inference Tasks:

inference.py

import sys, code

from pykg2vec.utils.kgcontroller import KnowledgeGraph
from pykg2vec.config.config import Importer, KGEArgParser
from pykg2vec.utils.trainer import Trainer

def main():
    # getting the customized configurations from the command-line arguments.
    args = KGEArgParser().get_args(sys.argv[1:])

    # Preparing data and cache the data for later usage
    knowledge_graph = KnowledgeGraph(dataset=args.dataset_name, negative_sample=args.sampling, custom_dataset_path=args.dataset_path)
    knowledge_graph.prepare_data()

    # Extracting the corresponding model config and definition from Importer().
    config_def, model_def = Importer().import_model_config(args.model_name.lower())
    config = config_def(args=args)
    model = model_def(config)

    # Create, Compile and Train the model. While training, several evaluation will be performed.
    trainer = Trainer(model=model, debug=args.debug)
    trainer.build_model()
    trainer.train_model()
    
    #can perform all the inference here after training the model
    trainer.enter_interactive_mode()
    
    code.interact(local=locals())

    trainer.exit_interactive_mode()

if __name__ == "__main__":
    main()

For inference task, you can use the following command:

python inference.py -mn TransE # train a model on FK15K dataset and enter interactive CMD for manual inference tasks.
python inference.py -mn TransE -ld true # pykg2vec will look for the location of cached pretrained parameters in your local.

# Once interactive mode is reached, you can execute instruction manually like
# Example 1: trainer.infer_tails(1,10,topk=5) => give the list of top-5 predicted tails. 
# Example 2: trainer.infer_heads(10,20,topk=5) => give the list of top-5 predicted heads.

Common Installation Problems

Cite

Please kindly cite the paper corresponding to the library.

@article{yu2019pykg2vec,
title={Pykg2vec: A Python Library for Knowledge Graph Embedding},
author={Yu, Shih Yuan and Rokka Chhetri, Sujit and Canedo, Arquimedes and Goyal, Palash and Faruque, Mohammad Abdullah Al},
journal={arXiv preprint arXiv:1906.04239},
year={2019}
}
 

GitHub