wikirec is a framework that allows users to parse the Wikipedia of any language for entries of a given type and then seamlessly create recommendation engines based on unsupervised natural language processing. The gaol is for wikirec to both refine and deploy models that provide accurate content recommendations based solely on open-source data.
Installation via PyPi
wikirec can be downloaded from pypi via pip or sourced directly from this repository:
pip install wikirec
git clone https://github.com/andrewtavis/wikirec.git cd wikirec python setup.py install
wikirec.data_utils allows a user to download Wikipedia texts of a given document type including movies, TV shows, books, music, and countless other classes of information. These texts then serve as the basis to recommend similar content given an input of what the user is interested in.
Article classes are derived from infobox types found on Wikipedia articles. The article on infoboxes (and its translations) contains all the allowed arguments to subset the data by. Simply passing
"Infobox chosen_type" to the
topic argument of
data_utils.parse_to_ndjson in the following example will subset all Wikipedia articles for the given type. For the English Wikipedia, wikirec also provides concise arguments for data that commonly serve as recommendation inputs including:
video_games, as well as various categories of
people such as
Downloading and parsing Wikipedia for the needed data is as simple as:
from wikirec import data_utils # Downloads the most recent stable bz2 compressed English Wikipedia dump files = data_utils.download_wiki(language="en") # Produces an ndjson of all book articles on Wikipedia data_utils.parse_to_ndjson( topic="books", output_path="enwiki_books.ndjson", multicore=True, verbose=True, )
Generating a clean text and token corpus is achieved through the following:
with open("enwiki_books.ndjson", "r") as fin: books = [json.loads(l) for l in fin] titles = [b for b in books] texts = [b for b in books] text_corpus, token_corpus = data_utils.clean(texts=texts)[:2]
Implemented NLP modeling methods include:
Bidirectional Encoder Representations from Transformers derives representations of words based on NLP models ran over open source Wikipedia data. These representations are leveraged to derive article similarities that are then used to deliver recommendations.
from wikirec import model # We can pass kwargs for sentence_transformers.SentenceTransformer.encode sim_matrix = model.gen_sim_matrix( method="bert", metric="cosine", corpus=text_corpus, )
Doc2vec (a generalization of Word2vec) is an NLP algorithm for deriving vector representations of documents from contextual word interrelations. These representations are then used as a baseline for recommendations.
from wikirec import model # We can pass kwargs for gensim.models.doc2vec.Doc2Vec sim_matrix = model.gen_sim_matrix( method="doc2vec", metric="cosine", corpus=text_corpus, )
Latent Dirichlet Allocation is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. In the case of wikirec, Wikipedia articles are posited to be a mixture of a given number of topics, and the presence of each word in a text body comes from its relation to these derived topics. These topic-word relations are then used to determine article similarities and then make recommendations.
from wikirec import model # We can pass kwargs for gensim.models.ldamulticore.LdaMulticore sim_matrix = model.gen_sim_matrix( method="lda", metric="cosine", corpus=token_corpus, num_topics=10, )
Term Frequency Inverse Document Frequency is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. In case of wikirec, word importances are combined and compared to derive article similarities and thus provide recommendations.
from wikirec import model sim_matrix = model.gen_sim_matrix( method="tfidf", metric="cosine", corpus=text_corpus, )
Once any of the above methods has been trained, generating recommendations is as simple as the following:
from wikirec import model # Using sim_matrix generated by BERT recs = model.recommend( inputs="title_or_list_of_titles", titles=titles, sim_matrix=sim_matrix, n=10, )
- Adding further methods for recommendations
- Compiling other sources of open source data that can be used to augment input data
- Potentially writing scripts to load this data for significant topics
- Allowing multiple infobox topics to be subsetted for at once in wikirec.data_utils functions
- Updates to wikirec.languages as lemmatization and other linguistic package dependencies evolve
- Creating, improving and sharing examples
- Updating and refining the documentation
- Improving tests for greater code coverage