/ Web Crawling & Web Scraping

Crawl and Visualize ICLR 2020 OpenReview Data

Crawl and Visualize ICLR 2020 OpenReview Data

ICLR2020-OpenReviewData

Script that crawls meta data from ICLR OpenReview webpage. Tutorials on installing and using Selenium and ChromeDriver on Ubuntu.

This Jupyter Notebook contains the data crawled from ICLR 2020 OpenReview webpages and their visualizations.

Visualizations

Rating distribution

The distribution of reviewer ratings centers around 4 (mean: 3.9063).

rating

You can compute how many papers are beaten by yours with

# See how many papers are beaten by yours
def PR(rating_mean, your_rating):
    pr = np.sum(your_rating > np.array(rating_mean))/len(rating_mean)*100
    return pr
my_rating = (6+6+3)/3.  # your average rating here
print('Your papar ({:.2f}) beats {:.2f}% of submissions based on the ratings.'.format(
          my_rating, PR(rating_mean, my_rating)))

#            accept rate       orals     posters
# ICLR 2017: 39.1% (198/507)    15         183
# ICLR 2018: 32.0% (314/981)    23         291
# ICLR 2019: 31.4% (500/1591)   24         476
# ICLR 2020: ?     (?/2594)

[Output]

Your papar (5.00) beats 69.97% of submissions based on the ratings.

Word clouds

The word clouds formed by keywords of submissions show the hot topics including deep learning, reinforcement learning, representation learning, generative models, graph neural network, etc.

wordcloud

This figure is plotted with python word cloud generator

from wordcloud import WordCloud
wordcloud = WordCloud(max_font_size=64, max_words=160, 
                      width=1280, height=640,
                      background_color="black").generate(' '.join(keywords))
plt.figure(figsize=(16, 8))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

Frequent keywords

The top 50 common keywords and their frequency.

frequency

The average reviewer ratings and the frequency of keywords indicate that to maximize your chance to get higher ratings would be using the keywords such as compositionality, deep learning theory, or gradient descent.

rating_frequency

Review length histogram

The average review length is 395.36 words. The histogram is as follows.

review_len_hist

review_len_hist_rating

How it works

See How to install Selenium and ChromeDriver on Ubuntu.

To crawl data from dynamic websites such as OpenReview, a headless web simulator can be created by

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
executable_path = '/Users/waltersun/Desktop/chromedriver'  # path to your executable browser
options = Options()
options.add_argument("--headless")
browser = webdriver.Chrome(options=options, executable_path=executable_path)  

Then, we can get the content from a webpage

browser.get(url)

To know what content we to crawl, we need to inspect the webpage layout.

inspect

I chose to get the content by

key = browser.find_elements_by_class_name("note_content_field")
value = browser.find_elements_by_class_name("note_content_value")

The data includes the abstract, keywords, TL; DR, comments.

Installing Selenium and ChromeDriver on Ubuntu

The following content is hugely borrowed from a nice post written by Christopher Su.

  • Install Google Chrome for Debian/Ubuntu
sudo apt-get install libxss1 libappindicator1 libindicator7
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb

sudo dpkg -i google-chrome*.deb
sudo apt-get install -f
  • Install xvfb to run Chrome on a headless device
sudo apt-get install xvfb
  • Install ChromeDriver for 64-bit Linux
sudo apt-get install unzip  # If you don't have unzip package

wget -N http://chromedriver.storage.googleapis.com/2.26/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
chmod +x chromedriver

sudo mv -f chromedriver /usr/local/share/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver

If your system is 32-bit, please find the ChromeDriver releases here and modify the above download command.

  • Install Python dependencies (Selenium and pyvirtualdisplay)
pip install pyvirtualdisplay selenium
  • Test your setup in Python
from pyvirtualdisplay import Display
from selenium import webdriver

display = Display(visible=0, size=(1024, 1024))
display.start()
browser = webdriver.Chrome()
browser.get('http://shaohua0116.github.io/')
print(browser.title)
print(browser.find_element_by_class_name('bio').text)

GitHub