These datasets are clearly different:
However, we would not know that if we were to only look at the summary statistics:
What we call summary statistics summarize only part of the distribution. We need many moments$^*$ to describe the shape of a distribution (and distinguish between these datasets):
Adding in histograms for the marginal distributions, we can see the distributions of both x and y are indeed quite different across datasets. Some of these differences are captured in the third moment (skewness) and the fourth moment (kurtosis), which measure the asymmetry and weight in the tails of the distribution, respectively:
However, the moments aren't capturing the relationship between x and y. If we suspect a linear relationship, we may use the Pearson correlation coefficient, which is the same for all three datasets below. Here, the visualization tells us a lot more information about the relationships between the variables:
The Pearson correlation coefficient measures linear correlation, so if we don't visualize our data, then we have another problem: a high correlation (close in absolute value to 1) does not mean the relationship is actually linear. Without a visualization to contextualize the summary statistics, we do not have an accurate understanding of the data.
For example, all four datasets in Anscombe's Quartet (constructed in 1973) have strong correlations, but only I and III have linear relationships:
Despite all of this, there's still (and always will be) a tendency to forget that summary statistics aren't enough.
In 2017, Autodesk researchers created the Datasaurus Dozen, building upon the idea of Anscombe's Quartet to make a more impactful example:
They also employed animation, which is even more impactful. Every shape as we transition between the Datasaurus and the circle shares the same summary statistics:
But, now we have a new problem...
Since there was no easy way to do this for arbitrary datasets, people assumed that this capability is a property of the Datasaurus and were shocked to see this work with other shapes. The more ways people see this and the more memorable they are, the better this concept will stick – repetition is key to learning. This is why I built Data Morph.
Here's the code to create that example:
Here's what's going on behind the scenes:
A high-level overview.
Data Morph provides the Dataset
class that wraps the data (stored as a pandas.DataFrame
) with information about bounds for the data, the morphing process, and plotting. This allows for the use of arbitrary datasets by providing a way to calculate target shapes – no more hardcoded values.
To spark creativity, there are built-in datasets to inspire you:
Depending on the target shape, bounds and/or statistics from the dataset are used to generate a custom target shape for the dataset to morph into.
The following target shapes are currently available:
Shape
class hierarchy¶In Data Morph, shapes are structured as a hierarchy of classes, which must provide a distance()
method. This makes them interchangeable in the morphing logic.
A point is selected at random (blue) and moved a small, random amount to a new location (red), preserving summary statistics. This part of the codebase comes from the Autodesk research and is mostly unchanged:
Sometimes, the algorithm will move a point away from the target shape, while still preserving summary statistics. This helps to avoid getting stuck:
The likelihood of doing this decreases over time and is governed by the temperature of the simulated annealing process:
The maximum amount that a point can move at a given iteration decreases over time for a better visual effect. This makes points move faster when the morphing starts and slow down as we approach the target shape:
Unlike temperature, we don't allow this value to fall to zero, since we don't want to halt movement:
Currently, we can only morph from dataset to shape (and shape to dataset by playing the animation in reverse). I would like to support dataset to dataset and shape to shape morphing, but there are challenges to both:
Goal | Challenges |
---|---|
shape → shape | determining the initial sizing and possibly aligning scale across the shapes, and solving the bald spot problem |
dataset → dataset | defining a distance metric, determining scale and position of target, and solving the bald spot problem |
The algorithm from the original research is largely untouched and parts of it could potentially be vectorized to speed up the morphing process.
Smaller values (left subplot) morph in fewer iterations than larger values (right subplot) since we only move small amounts at a time:
My first step was to use the Autodesk researchers' code to recreate the conversion of the Datasaurus into a circle and figure out how the code worked.
Challenges at this stage:
TIME TAKEN: 4 hours
From there, I tried to get it to work with a panda-shaped dataset, reworked to have similar statistics to the Datasaurus.
Challenges at this stage:
TIME TAKEN: 1.75 days
Once I got the transformation working with the panda (my original goal), I realized this would be a helpful teaching tool and decided to make a package.
Challenges at this stage:
TIME TAKEN: 2 months (v0.1.0)
Here are some cases I bumped into while building Data Morph:
python -m pip install data-morph-ai
conda install -c conda-forge data-morph-ai
I hope you enjoyed the session. You can follow my work on the following platforms: