Neural Distance Embeddings for Biological Sequences
Official implementation of Neural Distance Embeddings for Biological Sequences (NeuroSEED) in PyTorch. NeuroSEED is a novel framework to embed biological sequences in geometric vector spaces. Preprint will we published soon.
The repository is organised in four main folders one for each of the tasks analysed. Each of these contain scripts and models used for the task as well as instructions on how to run them and the tuned hyperparameters found.
edit_distancefor the edit distance approximation task
closest_stringfor the closest string retrieval task
hierarchical_clusteringfor the hierarchical clustering task, further divided in
unsupervisedfor the two approaches explored
multiple_alignmentfor the multiple sequence alignment task, further divided in
utilcontains a series of utility routines shared between all the tasks
testscontains a wide range of tests for the various components of the repository
Create a virtual (or conda) environment and install the dependencies:
python3 -m venv neuroseed source neuroseed/bin/activate pip install -r requirements.txt
Then install the
unionfind packages used for the hierarchical clustering:
cd hierarchical_clustering/relaxed/mst; python setup.py build_ext --inplace; cd ../../.. cd hierarchical_clustering/relaxed/unionfind; python setup.py build_ext --inplace; cd ../../..