- Software engineer and data scientist at Bloomberg in New York City
- Working in information security
- Author of Hands-On Data Analysis with Pandas (currently in its second edition; translated into Korean)
- BS in operations research from Columbia University
- MS in computer science (ML specialization) from Georgia Tech

- To follow along with this workshop, you will need to either configure a local environment or use a cloud solution like GitHub Codespaces. Consult the README for step-by-step setup instructions.
- In addition, you should have basic knowledge of Python and be comfortable working in Jupyter Notebooks; if not, check out the resources here to get up to speed.

- Getting Started With Matplotlib
- Moving Beyond Static Visualizations
- Building Interactive Visualizations for Data Exploration

We will begin by familiarizing ourselves with Matplotlib. Moving beyond the default options, we will explore how to customize various aspects of our visualizations. By the end of this section, you will be able to generate plots using the Matplotlib API directly, as well as customize the plots that libraries like pandas and Seaborn create for you.

**Why start with Matplotlib?**- Matplotlib basics
- Plotting with Matplotlib

There are many libraries for creating data visualizations in Python (even more if you include those that build on top of them). In this section, we will learn about Matplotlib's role in the Python data visualization ecosystem before diving into the library itself.

`stackoverflow.zip`

dataset, which contains the title and tags for all Stack Overflow questions tagged with a select few Python libraries since Stack Overflow's inception (Sept. 2008) through Sept. 12, 2021. The data comes from the Stack Overflow API – more information can be found in this notebook. Here, we are aggregating the data monthly to get the total number of questions per library per month:

In [1]:

```
import pandas as pd
stackoverflow_monthly = pd.read_csv(
'../data/stackoverflow.zip', parse_dates=True, index_col='creation_date'
).loc[:'2021-08','pandas':'bokeh'].resample('1M').sum()
stackoverflow_monthly.sample(5, random_state=1)
```

Out[1]:

pandas | matplotlib | numpy | seaborn | geopandas | geoviews | altair | yellowbrick | vega | holoviews | hvplot | bokeh | |
---|---|---|---|---|---|---|---|---|---|---|---|---|

creation_date | ||||||||||||

2018-06-30 | 2690 | 612 | 931 | 75 | 12 | 0 | 9 | 0 | 10 | 9 | 0 | 82 |

2014-12-31 | 417 | 280 | 420 | 17 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 20 |

2012-12-31 | 124 | 159 | 209 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

2011-04-30 | 2 | 58 | 101 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

2011-08-31 | 0 | 74 | 124 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

*Source: Stack Exchange Network*

`plot()`

method to generate visualizations. We will start by configuring our Matplotlib plotting backend to generate SVG output (first argument) with custom metadata (second argument):

In [2]:

```
import matplotlib_inline
from utils import mpl_svg_config
matplotlib_inline.backend_inline.set_matplotlib_formats(
'svg', # output images using SVG format
**mpl_svg_config('section-1') # optional: configure metadata
)
```

`hashsalt`

along with some metadata, which will be used by Matplotlib when generating any SVG output (see the `utils.py`

file for more details). Without this argument, different runs of the same plotting code will generate plots that are visually identical, but differ at the HTML level due to different IDs, metadata, etc.

Next, we plot monthly Matplotlib questions over time by calling the `plot()`

method:

In [3]:

```
stackoverflow_monthly.matplotlib.plot(
figsize=(8, 2), xlabel='creation date', ylabel='total questions',
title='Matplotlib Questions per Month\n(since the creation of Stack Overflow)'
)
```

Out[3]:

<Axes: title={'center': 'Matplotlib Questions per Month\n(since the creation of Stack Overflow)'}, xlabel='creation date', ylabel='total questions'>

Notice that this returns a Matplotlib `Axes`

object since pandas is using Matplotlib as a plotting backend. This means that pandas takes care of a lot of the legwork for us – some examples include the following:

- Creating the figure: source code
- Calling the
`Axes.plot()`

method: source code - Adding titles/labels: source code

We can use other data structures (such as NumPy arrays) without the overhead of converting to a pandas data structure just to plot.

Even if we use pandas to make the initial plot, we can use Matplotlib commands on the `Axes`

object that is returned to tweak other parts of the visualization. This is also the case for any library that uses Matplotlib as its plotting backend – examples of which include the following:

- Cartopy: geospatial data processing to produce map visualizations
- ggplot: Python version of the popular
`ggplot2`

R package - HoloViews: interactive visualizations with minimal code
- Seaborn: high-level interface for creating statistical visualizations with Matplotlib
- Yellowbrick: extension of Scikit-Learn for creating visualizations to analyze machine learning performance

*Note: Matplotlib maintains a list of such libraries here. We will cover HoloViews later in this workshop, and examples with Seaborn can be found in this pandas workshop.*

You can also build on top of Matplotlib for personal/work libraries. This might mean defining custom plot themes or functionality to create commonly-used visualizations.

`refline()`

method in the Seaborn library. This method makes it possible to draw horizontal/vertical reference lines on all subplots at once. The Matplotlib methods `axhline()`

and `axvline()`

are the basis of this contribution:

- Why start with Matplotlib?
**Matplotlib basics**- Plotting with Matplotlib

In this workshop, we will explore the static and animated visualization functionality to gain a breadth of knowledge of the library. While we won't go too in depth, additional resources will be provided throughout. Now, let's get started with the basics.

`Figure`

object is the container for all components of our visualization. It contains one or more `Axes`

objects, which can be thought of as the (sub)plots, as well as other *Artists*, which draw on the plot canvas (x-axis, y-axis, legend, lines, etc.). The following image from the Matplotlib documentation illustrates the different components of a figure:

Matplotlib provides two main plotting interfaces:

**Functional (implicit)**: call__functions__provided by the`pyplot`

module**Object-oriented (explicit)**: call__methods__on`Figure`

and`Axes`

__objects__

Regardless of the plotting interface we choose, we must import the `pyplot`

module:

In [4]:

```
import matplotlib.pyplot as plt
```

In [5]:

```
# figsize is determined by rcParams for plt.plot()
plt.plot(stackoverflow_monthly.index, stackoverflow_monthly.matplotlib)
_ = plt.xlabel('creation date')
_ = plt.ylabel('total questions')
_ = plt.title('Matplotlib Questions per Month\n(since the creation of Stack Overflow)')
```

`plt.show()`

to do so.

In [6]:

```
# creates the Figure and adds a single Axes object
fig, ax = plt.subplots(figsize=(8, 2))
ax.plot(stackoverflow_monthly.index, stackoverflow_monthly.matplotlib)
ax.set_xlabel('creation date')
ax.set_ylabel('total questions')
ax.set_title('Matplotlib Questions per Month\n(since the creation of Stack Overflow)')
```

Out[6]:

Text(0.5, 1.0, 'Matplotlib Questions per Month\n(since the creation of Stack Overflow)')

In [7]:

```
ax = stackoverflow_monthly.matplotlib.plot(
figsize=(8, 2), xlabel='creation date', ylabel='total questions',
title='Matplotlib Questions per Month\n(since the creation of Stack Overflow)'
)
ax.set_ylim(0, None) # this can also be done with pandas
# hide some of the spines (must be done with Matplotlib)
for spine in ['top', 'right']:
ax.spines[spine].set_visible(False)
```

`ax.set_ylim(0, None)`

with `plt.ylim(0, None)`

.

- Why start with Matplotlib?
- Matplotlib basics
**Plotting with Matplotlib**

Now that we understand a little bit of how Matplotlib works, we will walk through some more involved examples, which include legends, reference lines, and/or annotations, building them up step by step. Note that while using a library like pandas to do the initial plot creation can makes things easier, we will focus on using Matplotlib exclusively to get more familiar with it.

Each example in this section will showcase both how to build a specific plot with Matplotlib directly and how to customize it with some of the more advanced plotting techniques available. In particular, we will learn how to build and customize the following plot types:

- line plots
- scatter plots
- area plots
- bar plots
- stacked bar plots
- histograms
- box plots

The Stack Overflow data we have been working with thus far is a time series, so the first set of visualizations will be for studying the evolution of the data over time. However, rather than using a monthly aggregate like before, we will use daily data, so we will read in the data once more and this time aggregate it daily:

In [8]:

```
stackoverflow_daily = pd.read_csv(
'../data/stackoverflow.zip', parse_dates=True, index_col='creation_date'
).loc[:,'pandas':'bokeh'].resample('1D').sum()
stackoverflow_daily.tail()
```

Out[8]:

pandas | matplotlib | numpy | seaborn | geopandas | geoviews | altair | yellowbrick | vega | holoviews | hvplot | bokeh | |
---|---|---|---|---|---|---|---|---|---|---|---|---|

creation_date | ||||||||||||

2021-09-08 | 132 | 33 | 49 | 5 | 2 | 0 | 2 | 1 | 1 | 1 | 0 | 2 |

2021-09-09 | 182 | 33 | 51 | 8 | 1 | 0 | 1 | 0 | 3 | 0 | 0 | 2 |

2021-09-10 | 132 | 19 | 44 | 7 | 4 | 0 | 0 | 0 | 2 | 0 | 0 | 2 |

2021-09-11 | 66 | 19 | 17 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |

2021-09-12 | 69 | 14 | 24 | 3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |

In [9]:

```
avgs = stackoverflow_daily.rolling('30D').mean()
stds = stackoverflow_daily.rolling('30D').std()
avgs.tail()
```

Out[9]:

pandas | matplotlib | numpy | seaborn | geopandas | geoviews | altair | yellowbrick | vega | holoviews | hvplot | bokeh | |
---|---|---|---|---|---|---|---|---|---|---|---|---|

creation_date | ||||||||||||

2021-09-08 | 136.933333 | 26.966667 | 37.133333 | 5.766667 | 1.833333 | 0.000000 | 0.500000 | 0.133333 | 0.500000 | 0.400000 | 0.033333 | 1.033333 |

2021-09-09 | 138.000000 | 27.033333 | 37.933333 | 5.766667 | 1.833333 | 0.000000 | 0.533333 | 0.133333 | 0.566667 | 0.400000 | 0.000000 | 1.033333 |

2021-09-10 | 137.100000 | 26.733333 | 37.966667 | 5.800000 | 1.833333 | 0.000000 | 0.533333 | 0.133333 | 0.566667 | 0.366667 | 0.000000 | 1.066667 |

2021-09-11 | 133.433333 | 26.400000 | 37.233333 | 5.666667 | 1.833333 | 0.000000 | 0.533333 | 0.133333 | 0.533333 | 0.333333 | 0.000000 | 1.000000 |

2021-09-12 | 130.466667 | 25.933333 | 36.666667 | 5.666667 | 1.733333 | 0.033333 | 0.533333 | 0.133333 | 0.533333 | 0.233333 | 0.000000 | 0.866667 |

Now, we can proceed to building this visualization. We will work through the following steps over the next few slides:

- Create the line plot.
- Add a shaded region for $\pm$2 standard deviations from the mean.
- Set the axis labels, y-axis limits, plot title, and despine the plot.

By default, the `plot()`

method will return a line plot:

In [10]:

```
fig, ax = plt.subplots(figsize=(8, 2))
ax.plot(avgs.index, avgs.matplotlib)
```

Out[10]:

[<matplotlib.lines.Line2D at 0x1440ffe20>]

Next, we use the `fill_between()`

method to shade the region $\pm$2 standard deviations from the mean. Note that we also set `alpha=0.25`

to make the region 25% opaque – transparent enough to easily see the line for the rolling 30-day mean:

In [11]:

```
fig, ax = plt.subplots(figsize=(8, 2))
ax.plot(avgs.index, avgs.matplotlib)
ax.fill_between(
avgs.index, avgs.matplotlib - 2 * stds.matplotlib,
avgs.matplotlib + 2 * stds.matplotlib, alpha=0.25
)
```

Out[11]:

<matplotlib.collections.PolyCollection at 0x1367c60b0>