/ Data Visualization

Bokeh Plotting Backend for Pandas and GeoPandas

Bokeh Plotting Backend for Pandas and GeoPandas

Pandas Bokeh

Pandas Bokeh provides a Bokeh plotting backend for Pandas and GeoPandas, similar to the already existing Visualization feature of Pandas. Importing the library adds a complementary plotting method plot_bokeh() on DataFrames and Series (and also on GeoDataFrames).

With Pandas Bokeh, creating stunning, interactive, HTML-based visualization is as easy as calling:

df.plot_bokeh()

Startimage

Installation

You can install Pandas Bokeh from PyPI via pip:

pip install pandas-bokeh

Pandas Bokeh is supported on Python 2.7, as well as Python 3.6 and above.


How To Use

The Pandas-Bokeh library should be imported after Pandas. After the import, one should define the plotting output, which can be:

  • pandas_bokeh.output_notebook(): Embeds the Plots in the cell outputs of the notebook. Ideal when working in Jupyter Notebooks.
  • pandas_bokeh.output_file(filename): Exports the plot to the provided filename as an HTML.

For more details about the plotting outputs, see the reference here or the Bokeh documentation.

Notebook output (see also bokeh.io.output_notebook)

import pandas as pd
import pandas_bokeh
pandas_bokeh.output_notebook()

File output to "Interactive Plot.html" (see also bokeh.io.output_file)

import pandas as pd
import pandas_bokeh
pandas_bokeh.output_file("Interactive Plot.html")

Lineplot

Basic Lineplot

This simple lineplot already contains various interactive elements:

  • a pannable and zoomable (zoom in plotarea and zoom on axis) plot
  • by clicking on the legend elements, one can hide and show the individual lines
  • a Hovertool for the plotted lines

Note: If the x parameter is not specified, the index is used for the x-values of the plot.

import numpy as np

np.random.seed(42)
df = pd.DataFrame({"Google": np.random.randn(1000)+0.2, 
                   "Apple": np.random.randn(1000)+0.17}, 
                   index=pd.date_range('1/1/2000', periods=1000))
df = df.cumsum()
df = df + 50
df.plot_bokeh(kind="line")

ApplevsGoogle_1

Advanced Lineplot

There are various optional parameters to tune the plots, for example:

  • kind: Which kind of plot should be produced. Currently supported are: "line", "point", "scatter", "bar" and "histogram". In the near future many more will be implemented as horizontal barplot, boxplots, pie-charts, etc.

  • figsize: Choose width & height of the plot

  • title: Sets title of the plot

  • xlim/ylim: Set visibler range of plot for x- and y-axis (also works for datetime x-axis)

  • xlabel/ylabel: Set x- and y-labels

  • logx/logy: Set log-scale on x-/y-axis

  • xticks/yticks: Explicitly set the ticks on the axes

  • color: Defines a single color for a plot.

  • colormap: Defines the colors to plot. Can be either a list of colors or the name of a Bokeh color palette

  • hovertool: If True a Hovertool is active, else if False no Hovertool is drawn.

  • kwargs**: Optional keyword arguments of bokeh.plotting.figure.line

Try them out to get a feeling for the effects. Let us consider now:

df.plot_bokeh(
    kind="line",
    figsize=(800, 450),
    title="Apple vs Google",
    xlabel="Date",
    ylabel="Stock price [$]",
    yticks=[0,100,200,300,400],
    ylim=(0,400),
    colormap=["red", "blue"])

ApplevsGoogle_2

Lineplot with data points

For lineplots, as for many other plot-kinds, there are some special keyword arguments that only work for this plotting type. For lineplots, these are:

  • plot_data_points: Plot also the data points on the lines

  • plot_data_points_size: Determines the size of the data points

  • marker: Defines the point type (Default: "circle"). Possible values are: 'circle', 'square', 'triangle', 'asterisk', 'circle_x', 'square_x', 'inverted_triangle', 'x', 'circle_cross', 'square_cross', 'diamond', 'cross'

  • kwargs**: Optional keyword arguments of bokeh.plotting.figure.line

Let us use this information to have another version of the same plot:

df.plot_bokeh(
    kind="line",
    figsize=(800, 450),
    title="Apple vs Google",
    xlabel="Date",
    ylabel="Stock price [$]",
    yticks=[0,100,200,300,400],
    ylim=(100,200),
    xlim=("2001-01-01","2001-02-01"),
    colormap=["red", "blue"],
    plot_data_points=True,
    plot_data_points_size=10,
    marker="asterisk",
    toolbar_location="right"
)

ApplevsGoogle_3

Pointplot

If you just wish to draw the date points for curves, the pointplot option is the right choice. It also accepts the kwargs of bokeh.plotting.figure.scatter like marker or size:

import numpy as np

x = np.arange(-3, 3, 0.1)
y2 = x**2
y3 = x**3
df = pd.DataFrame({"x": x, "Parabula": y2, "Cube": y3})
df.plot_bokeh(
    kind="point",
    x="x",
    xticks=range(-3, 4),
    size=5,
    colormap=["#009933", "#ff3399"],
    title="Pointplot (Parabula vs. Cube)",
    marker="x")

Pointplot

Scatterplot

A basic scatterplot can be created using the kind="scatter" option. For scatterplots, the x and y parameters have to be specified and the following optional keyword argument is allowed:

  • category: Determines the category column to use for coloring the scatter points

  • kwargs**: Optional keyword arguments of bokeh.plotting.figure.scatter

Note, that the pandas.DataFrame.plot_bokeh() method return per default a Bokeh figure, which can be embedded in Dashboard layouts with other figures and Bokeh objects (for more details about (sub)plot layouts and embedding the resulting Bokeh plots as HTML click here).

In the example below, we use the building grid layout support of Pandas Bokeh to display both the DataFrame (embedded in a Div) and the resulting scatterplot:

#Load Iris Dataset from Scikit Learn:
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(iris["data"])
df.columns = iris["feature_names"]
df["species"] = iris["target"]
df["species"] = df["species"].map(dict(zip(range(3), iris["target_names"])))
df = df.sample(frac=1)

#Create Div with DataFrame:
from bokeh.models import Div
div_df = Div(text=df.head(10).to_html(index=False), 
             width=550)

#Create Scatterplot:
p_scatter = df.plot_bokeh(
    kind="scatter",
    x="petal length (cm)",
    y="sepal width (cm)",
    category="species",
    title="Iris DataSet Visualization",
    show_figure=False)

#Combine Div and Scatterplot via grid layout:
pandas_bokeh.plot_grid([[div_df, p_scatter]], 
                       plot_width=400, 
                       plot_height=350)

Scatterplot

Barplot

The barplot API has no special keyword arguments, but accepts optional kwargs of bokeh.plotting.figure.vbar like alpha. It uses per default the index for the bar categories (however, also columns can be used as x-axis category using the x argument).

data = {
    'fruits':
    ['Apples', 'Pears', 'Nectarines', 'Plums', 'Grapes', 'Strawberries'],
    '2015': [2, 1, 4, 3, 2, 4],
    '2016': [5, 3, 3, 2, 4, 6],
    '2017': [3, 2, 4, 4, 5, 3]
}
df = pd.DataFrame(data).set_index("fruits")

df.plot_bokeh(
    kind="bar",
    ylabel="Price per Unit [€]",
    title="Fruit prices per Year",
    alpha=0.6)

Barplot

Histogram

For drawing histograms, Pandas Bokeh has a lot of customization features. Optional keyword arguments are for kind="hist":

  • bins: Determines bins to use for the histogram. If bins is an int, it defines the number of equal-width bins in the given range (10, by default). If bins is a sequence, it defines the bin edges, including the rightmost edge, allowing for non-uniform bin widths. If bins is a string, it defines the method used to calculate the optimal bin width, as defined by histogram_bin_edges.

  • histogram_type: Either "sidebyside", "topontop" or "stacked". Default: "topontop"

  • kwargs**: Optional keyword arguments of bokeh.plotting.figure.quad

import numpy as np

df_hist = pd.DataFrame({
    'a': np.random.randn(1000) + 1,
    'b': np.random.randn(1000),
    'c': np.random.randn(1000) - 1
},
                       columns=['a', 'b', 'c'])

#Top-on-Top Histogram (Default):
df_hist.plot_bokeh(
    kind="hist",
    bins=np.linspace(-5, 5, 41),
    vertical_xlabel=True,
    hovertool=False,
    title="Normal distributions (Top-on-Top)",
    line_color="black")

#Side-by-Side Histogram (multiple bars share bin side-by-side):
df_hist.plot_bokeh(
    kind="hist",
    bins=np.linspace(-5, 5, 41),
    histogram_type="sidebyside",
    vertical_xlabel=True,
    hovertool=False,
    title="Normal distributions (Side-by-Side)",
    line_color="black")

#Stacked histogram:
df_hist.plot_bokeh(
    kind="hist",
    bins=np.linspace(-5, 5, 41),
    histogram_type="stacked",
    vertical_xlabel=True,
    hovertool=False,
    title="Normal distributions (Stacked)",
    line_color="black")

Histograms_all

Geoplots

Pandas Bokeh also allows for interactive plotting of Maps using GeoPandas by providing a geopandas.GeoDataFrame.plot_bokeh() method. It allows to plot the following geodata on a map :

  • Points/MultiPoints
  • Lines/MultiLines
  • Polygons/MultiPolygons

Note: t is not possible to mix up the objects types, i.e. a GeoDataFrame with Points and Lines is for example not allowed.

Les us start with a simple example using the "World Borders Dataset" . Let us first import all neccessary libraries and read the shapefile:

import requests
import geopandas as gpd
import pandas as pd
import pandas_bokeh
pandas_bokeh.output_notebook()

#Download GeoJSON example data:

geojson = requests.get(r"https://raw.githubusercontent.com/PatrikHlobil/Pandas-Bokeh/master/Documentation/Testdata/states/states.geojson").text
with open("states.geojson", "w") as f:
    f.write(geojson)
    
#Read in GeoJSON:
df_states = gpd.read_file("states.geojson")
df_states.head()
STATE_NAME REGION POPESTIMATE2010 POPESTIMATE2011 POPESTIMATE2012 POPESTIMATE2013 POPESTIMATE2014 POPESTIMATE2015 POPESTIMATE2016 POPESTIMATE2017 geometry
Hawaii 4 1363817 1378323 1392772 1408038 1417710 1426320 1428683 1427538 (POLYGON ((-160.0738033454681 22.0041773479577...
Washington 4 6741386 6819155 6890899 6963410 7046931 7152818 7280934 7405743 (POLYGON ((-122.4020153103835 48.2252163723779...
Montana 4 990507 996866 1003522 1011921 1019931 1028317 1038656 1050493 POLYGON ((-111.4754253002074 44.70216236909688...
Maine 1 1327568 1327968 1328101 1327975 1328903 1327787 1330232 1335907 (POLYGON ((-69.77727626137293 44.0741483685119...
North Dakota 2 674518 684830 701380 722908 738658 754859 755548 755393 POLYGON ((-98.73043728833767 45.93827137024809...

Plotting the data on a map is as simple as calling:

df_states.plot_bokeh(simplify_shapes=10000)

US_States_1

We also passed the optional parameter simplify_shapes (~meter) to improve plotting performance (for a reference see shapely.object.simplify). The above geolayer thus has an accuracy of about 10km.

Many keyword arguments like xlabel, ylabel, title, colormap, hovertool, ... for costumizing the plot are also available for the geoplotting API and can be uses as in the examples shown above. There are however also many other options especially for plotting geodata:

  • hovertool_columns: Specify column names, for which values should be shown in hovertool
  • colormap_uselog: If set True, the colormapper is using a logscale. Default: False
  • colormap_range: Specify the value range of the colormapper via (min, max) tuple
  • tile_provider: Define build-in tile provider for background maps. Possible values: 'CARTODBPOSITRON', 'CARTODBPOSITRON_RETINA', 'STAMEN_TERRAIN', 'STAMEN_TERRAIN_RETINA', 'STAMEN_TONER', 'STAMEN_TONER_BACKGROUND', 'STAMEN_TONER_LABELS'. Default: CARTODBPOSITRON_RETINA
  • tile_provider_url: An arbitraty tile_provider_url of the form '/{Z}/{X}/{Y}*.png' can be passed to be used a background map.
  • legend: Hide or show legend. Default: True

One of the most common usage of map plots are choropleth maps, where the color of a the objects is determined by the property of the object itself. There are 3 ways of drawing choropleth maps using Pandas Bokeh, which are described below.

Categories

This is the simplest way. Just provide the category keyword for the selection of the property column:

  • category: Specifies the column of the GeoDataFrame that should be used to draw a choropleth map
  • show_colorbar: Whether or not to show a colorbar for categorical plots. Default: True

Let us now draw the regions as a choropleth plot using the category keyword (at the moment, only numerical columns are supported for choropleth plots):

df_states.plot_bokeh(
    figsize=(900, 600),
    simplify_shapes=5000,
    category="REGION",
    show_colorbar=False,
    colormap=["blue", "yellow", "green", "red"],
    hovertool_columns=["STATE_NAME", "REGION"],
    tile_provider="STAMEN_TERRAIN_RETINA")

US_States_2

By passing a list of column names of the GeoDataFrame as the dropdown keyword argument, a dropdown menu is shown above the map. This dropdown menu can be used to select the choropleth layer by the user.

df_states.plot_bokeh(
    figsize=(900, 600),
    simplify_shapes=5000,
    dropdown=["POPESTIMATE2010", "POPESTIMATE2017"],
    colormap="Viridis",
    hovertool_columns=["STATE_NAME", "POPESTIMATE2010", "POPESTIMATE2017"],
    tile_provider_url=r"http://c.tile.stamen.com/watercolor/{Z}/{X}/{Y}.jpg"
    )

US_States_3

When hovering over the states, the state-name and the population of 2010 and 2017 are shown as specified in the hovertool_columns argument.

Sliders

Another option for interactive choropleth maps is the slider implementation of Pandas Bokeh. The possible keyword arguments are here:

  • slider: By passing a list of column names of the GeoDataFrame, a slider can be used to . This dropdown menu can be used to select the choropleth layer by the user.
  • slider_range: Pass a range (or numpy.arange) of numbers object to relate the sliders values with the slider columns. By passing range(0,10), the slider will have values [0, 1, 2, ..., 9], when passing numpy.arange(3,5,0.5), the slider will have values [3, 3.5, 4, 4.5]. Default: range(0, len(slider))
  • slider_name: Specifies the title of the slider. Default is an empty string.

This can be used to display the change in population relative to the year 2010:

#Calculate change of population relative to 2010:
for i in range(8):
    df_states["Delta_Population_201%d"%i] = ((df_states["POPESTIMATE201%d"%i] / df_states["POPESTIMATE2010"]) -1 ) * 100

#Specify slider columns:
slider_columns = ["Delta_Population_201%d"%i for i in range(8)]

#Specify slider-range (Maps "Delta_Population_2010" -> 2010, 
#                           "Delta_Population_2011" -> 2011, ...):
slider_range = range(2010, 2018)

#Make slider plot:
df_states.plot_bokeh(
    figsize=(900, 600),
    simplify_shapes=5000,
    slider=slider_columns,
    slider_range=slider_range,
    slider_name="Year", 
    colormap="Inferno",
    hovertool_columns=["STATE_NAME"] + slider_columns,
    title="Change of Population [%]")

US_States_4

Outputs and Layouts

Output options

The pandas.DataFrame.plot_bokeh API has the following additional keyword arguments:

  • show_figure: If True, the resulting figure is shown (either in the notebook or exported and shown as HTML file, see Basics. If False, None is returned. Default: True
  • return_html: If True, the method call returns an HTML string that contains all Bokeh CSS&JS resources and the figure embedded in a div. This HTML representation of the plot can be used for embedding the plot in an HTML document. Default: False

If you have a Bokeh figure or layout, you can also use the pandas_bokeh.embedded_html function to generate an embeddable HTML representation of the plot. This can be included into any valid HTML (note that this is not possible directly with the HTML generated by the pandas_bokeh.output_file output option, because it includes an HTML header). Let us consider the following simple example:

#Import Pandas and Pandas-Bokeh (if you do not specify an output option, the standard is
#output_file):
import pandas as pd
import pandas_bokeh

#Create DataFrame to Plot:
import numpy as np
x = np.arange(-10, 10, 0.1)
sin = np.sin(x)
cos = np.cos(x)
tan = np.tan(x)
df = pd.DataFrame({"x": x, "sin(x)": sin, "cos(x)": cos, "tan(x)": tan})

#Make Bokeh plot from DataFrame using Pandas Bokeh. Do not show the plot, but export
#it to an embeddable HTML string:
html_plot = df.plot_bokeh(
    kind="line",
    x="x",
    y=["sin(x)", "cos(x)", "tan(x)"],
    xticks=range(-20, 20),
    title="Trigonometric functions",
    show_figure=False,
    return_html=True,
    ylim=(-1.5, 1.5))

#Write some HTML and embed the HTML plot below it. For production use, please use
#Templates and the awesome Jinja library.
html = r"""
<script type="text/x-mathjax-config">
  MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}});
</script>
<script type="text/javascript"
  src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>

<h1> Trigonometric functions </h1>

<p> The basic trigonometric functions are:</p>

<p>$ sin(x) $</p>
<p>$ cos(x) $</p>
<p>$ tan(x) = \frac{sin(x)}{cos(x)}$</p>

<p>Below is a plot that shows them</p>

""" + html_plot

#Export the HTML string to an external HTML file and show it:
with open("test.html" , "w") as f:
    f.write(html)
    
import webbrowser
webbrowser.open("test.html")

This code will open up a webbrowser and show the following page. As you can see, the interactive Bokeh plot is embedded nicely into the HTML layout. The return_html option is ideal for the use in a templating engine like Jinja.

embedded_HTML

Dashboard Layouts

As shown in the Scatterplot Example, combining plots with plots or other HTML elements is straighforward in Pandas Bokeh due to the layout capabilities of Bokeh. The easiest way to generate a dashboard layout is using the pandas_bokeh.plot_grid method (which is an extension of bokeh.layouts.gridplot):

import pandas as pd
import numpy as np
import pandas_bokeh
pandas_bokeh.output_notebook()

#Barplot:
data = {
    'fruits':
    ['Apples', 'Pears', 'Nectarines', 'Plums', 'Grapes', 'Strawberries'],
    '2015': [2, 1, 4, 3, 2, 4],
    '2016': [5, 3, 3, 2, 4, 6],
    '2017': [3, 2, 4, 4, 5, 3]
}
df = pd.DataFrame(data).set_index("fruits")
p_bar = df.plot_bokeh(
    kind="bar",
    ylabel="Price per Unit [€]",
    title="Fruit prices per Year",
    show_figure=False)

#Lineplot:
np.random.seed(42)
df = pd.DataFrame({
    "Google": np.random.randn(1000) + 0.2,
    "Apple": np.random.randn(1000) + 0.17
},
                  index=pd.date_range('1/1/2000', periods=1000))
df = df.cumsum()
df = df + 50
p_line = df.plot_bokeh(
    kind="line",
    title="Apple vs Google",
    xlabel="Date",
    ylabel="Stock price [$]",
    yticks=[0, 100, 200, 300, 400],
    ylim=(0, 400),
    colormap=["red", "blue"],
    show_figure=False)

#Scatterplot:
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(iris["data"])
df.columns = iris["feature_names"]
df["species"] = iris["target"]
df["species"] = df["species"].map(dict(zip(range(3), iris["target_names"])))
p_scatter = df.plot_bokeh(
    kind="scatter",
    x="petal length (cm)",
    y="sepal width (cm)",
    category="species",
    title="Iris DataSet Visualization",
    show_figure=False)

#Histogram:
df_hist = pd.DataFrame({
    'a': np.random.randn(1000) + 1,
    'b': np.random.randn(1000),
    'c': np.random.randn(1000) - 1
},
                       columns=['a', 'b', 'c'])

p_hist = df_hist.plot_bokeh(
    kind="hist",
    bins=np.arange(-6, 6.5, 0.5),
    vertical_xlabel=True,
    normed=100,
    hovertool=False,
    title="Normal distributions",
    show_figure=False)

#Make Dashboard with Grid Layout:
pandas_bokeh.plot_grid([[p_line, p_bar], 
                        [p_scatter, p_hist]], plot_width=450)

Startimage

Using a combination of row and column elements (see Bokeh Layouts) allow for a very easy general arrangement of elements. An alternative layout to the one above is:

from bokeh.layouts import row, column

p_line.plot_width = 900
p_hist.plot_width = 900

layout = column(p_line,
                row(p_scatter, p_bar),
                p_hist)

pandas_bokeh.show(layout)

Alternative Dashboard Layout

GitHub