ICLR2020-OpenReviewData
Script that crawls meta data from ICLR OpenReview webpage. Tutorials on installing and using Selenium and ChromeDriver on Ubuntu.
This Jupyter Notebook contains the data crawled from ICLR 2020 OpenReview webpages and their visualizations.
Visualizations
Rating distribution
The distribution of reviewer ratings centers around 4 (mean: 3.9063).
You can compute how many papers are beaten by yours with
# See how many papers are beaten by yours
def PR(rating_mean, your_rating):
pr = np.sum(your_rating > np.array(rating_mean))/len(rating_mean)*100
return pr
my_rating = (6+6+3)/3. # your average rating here
print('Your papar ({:.2f}) beats {:.2f}% of submissions based on the ratings.'.format(
my_rating, PR(rating_mean, my_rating)))
# accept rate orals posters
# ICLR 2017: 39.1% (198/507) 15 183
# ICLR 2018: 32.0% (314/981) 23 291
# ICLR 2019: 31.4% (500/1591) 24 476
# ICLR 2020: ? (?/2594)
[Output]
Your papar (5.00) beats 69.97% of submissions based on the ratings.
Word clouds
The word clouds formed by keywords of submissions show the hot topics including deep learning, reinforcement learning, representation learning, generative models, graph neural network, etc.
This figure is plotted with python word cloud generator
from wordcloud import WordCloud
wordcloud = WordCloud(max_font_size=64, max_words=160,
width=1280, height=640,
background_color="black").generate(' '.join(keywords))
plt.figure(figsize=(16, 8))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
Frequent keywords
The top 50 common keywords and their frequency.
The average reviewer ratings and the frequency of keywords indicate that to maximize your chance to get higher ratings would be using the keywords such as compositionality, deep learning theory, or gradient descent.
Review length histogram
The average review length is 395.36 words. The histogram is as follows.
How it works
See How to install Selenium and ChromeDriver on Ubuntu.
To crawl data from dynamic websites such as OpenReview, a headless web simulator can be created by
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
executable_path = '/Users/waltersun/Desktop/chromedriver' # path to your executable browser
options = Options()
options.add_argument("--headless")
browser = webdriver.Chrome(options=options, executable_path=executable_path)
Then, we can get the content from a webpage
browser.get(url)
To know what content we to crawl, we need to inspect the webpage layout.
I chose to get the content by
key = browser.find_elements_by_class_name("note_content_field")
value = browser.find_elements_by_class_name("note_content_value")
The data includes the abstract, keywords, TL; DR, comments.
Installing Selenium and ChromeDriver on Ubuntu
The following content is hugely borrowed from a nice post written by Christopher Su.
- Install Google Chrome for Debian/Ubuntu
sudo apt-get install libxss1 libappindicator1 libindicator7
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo dpkg -i google-chrome*.deb
sudo apt-get install -f
- Install
xvfb
to run Chrome on a headless device
sudo apt-get install xvfb
- Install ChromeDriver for 64-bit Linux
sudo apt-get install unzip # If you don't have unzip package
wget -N http://chromedriver.storage.googleapis.com/2.26/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
chmod +x chromedriver
sudo mv -f chromedriver /usr/local/share/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver
If your system is 32-bit, please find the ChromeDriver releases here and modify the above download command.
- Install Python dependencies (Selenium and pyvirtualdisplay)
pip install pyvirtualdisplay selenium
- Test your setup in Python
from pyvirtualdisplay import Display
from selenium import webdriver
display = Display(visible=0, size=(1024, 1024))
display.start()
browser = webdriver.Chrome()
browser.get('http://shaohua0116.github.io/')
print(browser.title)
print(browser.find_element_by_class_name('bio').text)