/ Web Crawling & Web Scraping

Crawl data from ICLR 2019 OpenReview webpage

Crawl data from ICLR 2019 OpenReview webpage

ICLR2019-OpenReviewData

Crawl and Visualize ICLR 2019 OpenReview Data.

This Jupyter Notebook contains the data and visualizations that are crawled ICLR 2019 OpenReview webpages. As some are the reviews are still missing (11.3299% by the time the data is crawled), the results might not be accurate.

Prerequisites

Visualizations

The word clouds formed by keywords of submissions show the hot topics including reinforcement learning, generative adversarial networks, generative models, imitation learning, representation learning, etc.

This figure is plotted with python word cloud generator

from wordcloud import WordCloud
wordcloud = WordCloud(max_font_size=64, max_words=160, 
                      width=1280, height=640,
                      background_color="black").generate(' '.join(keywords))
plt.figure(figsize=(16, 8))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

The distributions of reviewer ratings center around 5 to 6 (mean: 5.15).

You can compute how many papers are beaten by yours with

def PR(rating_mean, your_rating):
    pr = np.sum(your_rating >= np.array(rating_mean))/len(rating_mean)*100
    return pr
my_rating = (7+7+9)/3  # your average rating here
print('Your papar beats {:.2f}% of submission '
      '(well, jsut based on the ratings...)'.format(PR(rating_mean, my_rating)))
# ICLR 2017: accept rate 39.1% (198/507) (15 orals and 183 posters)
# ICLR 2018: accept rate 32% (314/981) (23 orals and 291 posters)
# ICLR 2018: accept rate ?% (?/1580)

The top 50 common keywrods and their frequency.

The average reviewer ratings and the frequency of keywords indicate that to maximize your chance to get higher ratings would be using the keyowrds such as theory, robustness, or graph neural network.

How it works

See How to install Selenium and ChromeDriver on Ubuntu.

To crawl data from dynamic websites such as OpenReview, a headless web simulator is created by

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
executable_path = '/Users/waltersun/Desktop/chromedriver'  # path to your executable browser
options = Options()
options.add_argument("--headless")
browser = webdriver.Chrome(options=options, executable_path=executable_path)  

Then, we can get the content of a webpage

browser.get(url)

To know what content we can crawl, we will need to inspect the webpage layout.

I chose to get the content by

key = browser.find_elements_by_class_name("note_content_field")
value = browser.find_elements_by_class_name("note_content_value")

The data includes the abstract, keywords, TL; DR, comments.

Installing Selenium and ChromeDriver on Ubuntu

The following content is hugely borrowed from a nice post written by Christopher Su.

  • Install Google Chrome for Debian/Ubuntu
sudo apt-get install libxss1 libappindicator1 libindicator7
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb

sudo dpkg -i google-chrome*.deb
sudo apt-get install -f
  • Install xvfb to run Chrome on a headless device
sudo apt-get install xvfb
  • Install ChromeDriver for 64-bit Linux
sudo apt-get install unzip  # If you don't have unzip package

wget -N http://chromedriver.storage.googleapis.com/2.26/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
chmod +x chromedriver

sudo mv -f chromedriver /usr/local/share/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver

If your systme is 32-bit, please find the ChromeDriver releases here and modify the above download command.

  • Install Python dependencies (Selenium and pyvirtualdisplay)
pip install pyvirtualdisplay selenium
  • Test your setup in Python
from pyvirtualdisplay import Display
from selenium import webdriver

display = Display(visible=0, size=(1024, 1024))
display.start()
browser = webdriver.Chrome()
browser.get('http://shaohua0116.github.io/')
print(browser.title)
print(browser.find_element_by_class_name('bio').text)

GitHub