ictv-mmseqs2-protein-database

This repository contains instructions to generate a MMSeqs2 protein database with ICTV taxonomy.

Dependencies:

Instructions

First, download the latest VMR release from ICTV and convert it to a tabular file:

aria2c -x 4 -o ictv.xlsx "https://talk.ictvonline.org/taxonomy/vmr/m/vmr-file-repository/13426/download"

# convert xlsx to tsv
csvtk xlsx2csv ictv.xlsx \
    | csvtk csv2tab \
    | sed 's/\xc2\xa0/ /g' \
    | csvtk replace -t -F -f "*" -p "^\s+|\s+$" \
    > ictv.tsv

# choose columns, and remove duplicates
csvtk cut -t -f "Realm,Subrealm,Kingdom,Subkingdom,Phylum,Subphylum,Class,Subclass,Order,Suborder,Family,Subfamily,Genus,Subgenus,Species" ictv.tsv \
    | csvtk uniq -t -f "Realm,Subrealm,Kingdom,Subkingdom,Phylum,Subphylum,Class,Subclass,Order,Suborder,Family,Subfamily,Genus,Subgenus,Species" \
    | csvtk del-header -t \
    > ictv.taxonomy.tsv

Create a file that will store all the ICTV taxa names:

csvtk cut -t -H -f 1,3,5,7,9,11,13,15 ictv.taxonomy.tsv \
    | sed 's/\t/\n/g' \
    | awk '!/^[[:blank:]]*$/' \
    | sort -u \
    > ictv.names.txt

Use taxonkit create-taxdump to create a custom taxdump for ICTV. Next, execute the fix_taxdump.py script, which will make the taxids sequential to make them compatible with MMSeqs2:

taxonkit create-taxdump -K 1 -P 3 -C 5 -O 7 -F 9 -G 11 -S 13 -T 15 \
    --rank-names "realm","kingdom","phylum","class","order","family","genus","species" \
    ictv.taxonomy.tsv --out-dir ictv-taxdump

./fix_taxdump.py

Download the NCBI taxdump and the prot.accession2taxid file. Then, filter prot.accession2taxid to keep only viral proteins:

# Download the NCBI taxdump
aria2c -x 4 "ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz"
mkdir ncbi-taxdump
tar zxfv taxdump.tar.gz -C ncbi-taxdump
rm taxdump.tar.gz

# Download the protein → taxid association and filter for viruses
aria2c -x 4 "https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.FULL.gz"

gunzip prot.accession2taxid.FULL.gz

awk '{print $2}' prot.accession2taxid.FULL \
    | sort -u \
    | taxonkit --data-dir ncbi-taxdump lineage \
    | rg "\tViruses;" \
    | awk '{print $1}' \
    > virus_taxid.list

csvtk grep -t -f 2 -P virus_taxid.list prot.accession2taxid.FULL > virus.accession2taxid

rm prot.accession2taxid.FULL

Execute the get_ictv_taxids.py script to create a accession2taxid file with ICTV taxids.

# Find the ICTV-compliant proteins and write a new table with the ICTV taxids
./get_ictv_taxids.py

Download the proteins from NCBI and filter the FASTA file to keep only the proteins associated with ICTV viruses:

# Download and filter NR proteins
aria2c -x 4 "https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz"

# Create a list containing the accessions of the proteins of ICTV viruses
cut -f 1 virus.accession2taxid.ictv > virus.accession.txt

# Filter the NR proteins to keep the proteins encoded by ICTV viruses
seqkit grep -j 4 -f virus.accession.txt nr.gz | seqkit seq -i -w 0 -o nr.virus.faa.gz

rm nr.gz

There will be proteins in virus.accession2taxid.ictv that are not in NR. So we will keep only the proteins that are present in the filtered NR FASTA file:

# Filter the NR virus taxid table
seqkit fx2tab -n -i nr.virus.faa.gz > nr.virus.list.txt
csvtk grep -t -H -f 1 -P nr.virus.list.txt virus.accession2taxid.ictv > nr.virus.accession2taxid.ictv

Using the filtered NR FASTA, the ICTV taxdump, and the virus.accession2taxid.ictv tabular file, we will create a MMSeqs2 protein database with taxonomy information:

# Create the MMSeqs2 database
mkdir virus_tax_db
mmseqs createdb --dbtype 1 nr.virus.faa.gz virus_tax_db/virus_tax_db
mmseqs createtaxdb virus_tax_db/virus_tax_db tmp --ncbi-tax-dump ictv-taxdump --tax-mapping-file nr.virus.accession2taxid.ictv
rm -rf tmp

Finally, to assign taxonomy to viral sequences in an input file (input.fna):

mmseqs easy-taxonomy input.fna virus_tax_db/virus_tax_db taxonomy_results tmp -e 1e-5 -s 6 --blacklist "" --tax-lineage 1

GitHub

View Github