/ Data Visualization

Anchor-t-SNE for large-scale and high-dimension vector visualization

Anchor-t-SNE for large-scale and high-dimension vector visualization

AtSNE

AtSNE is a solution of high-dimensional data visualization problem. It can project large-scale high-dimension vectors into low-dimension space while keeping the pair-wise similarity amount point. AtSNE is efficient and scalable and can visualize 20M points in less than 5 hours using GPU. The spatial structure of its result is also robust to random initializations. It implements the algorithm of our KDD'19 paper - AtSNE: Efficient and Robust Visulization on GPU through Hierarchical Optimization

Benchmark datasets

Dataset Dimensions Number of Points Number of Categories Data Label
CIFAR10 1024 60,000 10 .txt .fvces .txt .ivces
CIFAR100 1024 60,000 100 .txt .fvces .txt .ivces
MNIST 784 70,000 10 .txt .fvces .txt .ivces
Fashion-MNIST 784 70,000 10 .txt .fvces .txt .ivces
AG’s News 100 120,000 4 .txt .fvces .txt .ivces
DBPedia 100 560,000 14 .txt .fvces .txt .ivces
ImageNet 128 1,281,167 1000 .txt .fvces .txt .ivces
Yahoo 100 1,400,000 10 .txt .fvces .txt .ivces
Crawl 300 2,000,000 10 .txt .fvces .txt .ivces
Amazon3M 100 3,000,000 5 .txt .fvces .txt .ivces
Amazon20M 96 19,531,329 5 .txt .fvces .txt .ivces
  • Details of dataset pre-processing are provided in our papers

Visualization Examples

AtSNE

Performance

Compared Algorithms:

Dataset method 10-NN accuracy time Memory (GB) speedup
CIFAR10 BH-t-SNE 0.966 5m12s 2.61 1.6
LargeVis 0.965 8m23s 7.90 1.0
tsne-cuda 0.963 27.7s 2.17 18.1
AtSNE 0.957 19.6s 0.93 25.7
CIFAR100 BH-t-SNE 0.636 9m51s 2.62 0.9
LargeVis 0.607 8m50s 7.90 1.0
tsne-cuda 0.646 28.3s 2.33 18.7
AtSNE 0.600 19s 0.93 27.9
MNIST BH-t-SNE 0.970 5m20s 2.35 1.7
LargeVis 0.966 8m59s 7.15 1.0
tsne-cuda 0.968 31.3s 2.33 14.7
AtSNE 0.967 19.6s 0.93 27.5
Fashion-MNIST BH-t-SNE 0.821 3m46s 2.28 2.3
LargeVis 0.797 8m30s 7.18 1.0
tsne-cuda 0.827 31.1s 2.17 16.4
AtSNE 0.822 19.9s 0.93 25.6
AG’s News BH-t-SNE 0.993 5m30s 0.95 1.9
LargeVis 0.994 10m37s 2.65 1.0
tsne-cuda 0.993 39.3s 2.17 16.2
AtSNE 0.995 23s 0.88 27.7
DBPedia BH-t-SNE 0.993 36m8s 4.22 0.93
LargeVis 0.999 33m43s 12.71 1.0
tsne-cuda - - -
AtSNE 0.999 3m 2.03 11.2
ImageNet BH-t-SNE 0.412 4h7m53s 10.8 0.3
LargeVis 0.608 1h18m45s 53.09 1.0
tsne-cuda - - -
AtSNE 0.649 11m53s 4.01 6.6
Yahoo BH-t-SNE 0.537 2h17m17s 10.47 0.62
LargeVis 0.775 1h25m17s 49.99 1.0
tsne-cuda
AtSNE 0.780 12m52s 4.27 6.6
Crawl BH-t-SNE - >24h
LargeVis 0.688 2h34m14s 139.05 1.0
tsne-cuda -
AtSNE 0.692 30m1s 7.19 5.1
Amazon3M BH-t-SNE - > 24h -
LargeVis 0.606 2h53m25s 104 1.0
tsne-cuda - -
AtSNE 0.603 34m4s 7.98 5.1
Amazon20M BH-t-SNE - - -
LargeVis - - -
tsne-cuda - - -
AtSNE 0.755 4h54m 19.70
  • Tested on i9-7980XE (18Cores, 36 Threads) with 128GB Memory
  • AtSNE and TSNE-CUDA use one GTX 1080Ti GPU
  • BH-t-SNE and LargeVis use 32 threads in the table above
  • - means this method crashed in testing progress, mostly because of memory issues
  • Tested version of LargeVis, BH-t-SNE and TSNE-CUDA are feb8121, 62dedde and efa2098 respectively
  • For Amazon20M dataset which is too large to fit in memory, we use Product Quantization to build KNN graph. AtSNE use extra parameters -k 50 --ivfpq 1 --subQuantizers 24 --bitsPerCode 8.
  • AtSNE just use the default parameters in the test above, except --n_negative 400. Exact parameters of aforementioned result are provided below in case you need it.

--lr 0.05 --vis_iter 2000 --save_interval 0 -k 100 --clusters 1000 --n_negative 400 --center_number 5 --nprobe 50 --knn_negative_rate 0 -p 50 --early_pull_rate 20 --center_pull_iter 500 --early_pull_iter 1000 --scale 10 --center_perplexity 70 --center_grad_coeff 1

How to use

Requirement

  • CUDA (8 or later), nvcc and cublas included
  • gcc
  • faiss

Compile

  1. Clone this project
  2. init submodule (cmdline and faiss)
    • enter the project root directory
    • run git submodule init; git submodule update
  3. Compile faiss, enter directory of faiss (vendor/faiss), and follow Step1 and Step3, confirm that vendor/faiss/libfaiss.a and vendor/faiss/gpu/libgpufaiss.a is generated. Simplified instructions are shown below:
    • install required BLAS library (MKL, openblas): sudo apt install libopenblas-dev
    • cd vender/faiss
    • build faiss cpu library: ./configure && make -j8
    • build faiss gpu library: cd gpu; make -j
  4. enter project root directory, run make -j

Run

./qvis_gpu -b mnist_vec784D_data.txt.fvecs -o mnist_result.txt

We choose good default parameters for you. And there are many other parameters you can change. If you want to reproduce the test in our KDD paper, please add --n_negative 400.

./qvis_gpu -b mnist_vec784D_data.txt.fvecs --n_negative 400 -o mnist_result.txt

ivecs/fvecs vector file formats are defined here

Supplementary tools

There are some supplementary tools we use during developing/debugging/experimentation

  • tools/view.py Draw the result in 2D space and save images for you.
  • Label file is optional.
  • Use multi-process to draw images for results with the same filename-prefix
  • tools/txt_to_fvecs.py covert txt file, like result of largevVis or label file, to ivecs/fvecs
  • tools/largevis_convert.py convert dataset of fvecs/ivecs to largeVis input format
  • tools/imagenet_infer.py generate 128D imagenet feature vectors from ImageNet dataset
  • tools/box_filter.py Give a bounding-box, print the points and corresponding labels. Used for case-study in our paper
  • test_knn_accuracy (Build required) Test knn classifier accuracy(label needed) of visualization result
  • test_top1_error (Build required) Test top-1 error of visualization result. The top-1 error is the ratio that the nearest neighbor of one point in low-dimension is not the nearest neighbor in high-dimension

GitHub