These datasets are clearly different:
However, we would not know that if we were to only look at the summary statistics:
What we call summary statistics summarize only part of the distribution. We need many moments to describe the shape of a distribution (and distinguish between these datasets):
Adding in histograms for the marginal distributions, we can see the distributions of both x and y are indeed quite different across datasets. Some of these differences are captured in the third moment (skewness) and the fourth moment (kurtosis), which measure the asymmetry and weight in the tails of the distribution, respectively:
However, the moments aren't capturing the relationship between x and y. If we suspect a linear relationship, we may use the Pearson correlation coefficient, which is the same for all three datasets below. Here, the visualization tells us a lot more information about the relationships between the variables:
The Pearson correlation coefficient measures linear correlation, so if we don't visualize our data, then we have another problem: a high correlation (close in absolute value to 1) does not mean the relationship is actually linear. Without a visualization to contextualize the summary statistics, we do not have an accurate understanding of the data.
For example, all four datasets in Anscombe's Quartet (constructed in 1973) have strong correlations, but only I and III have linear relationships:
In their 2020 paper, A hypothesis is a liability, researchers Yanai and Lercher argue that simply approaching a dataset with a hypothesis may limit the thoroughness to which the data is explored.
Let's take a look at their experiment.
Students in a statistical data analysis course were split into two groups. One group was given the open-ended task of exploring the data, while the other group was instructed to test the following hypotheses:
Here's what that dataset looked like:
In 2017, Autodesk researchers created the Datasaurus Dozen, building upon the idea of Anscombe's Quartet to make a more impactful example:
They also employed animation, which is even more impactful. Every shape as we transition between the Datasaurus and the circle shares the same summary statistics:
But, now we have a new problem...
Since there was no easy way to do this for arbitrary datasets, people assumed that this capability is a property of the Datasaurus and were shocked to see this work with other shapes. The more ways people see this and the more memorable they are, the better this concept will stick – repetition is key to learning.
This is why I built Data Morph.
It addresses the limitations of previous methods:
Here's the code to create that example:
$ python -m pip install data-morph-ai
$ data-morph --start-shape Python --target-shape heart
Here's what's going on behind the scenes:
from data_morph.data.loader import DataLoader
from data_morph.morpher import DataMorpher
from data_morph.shapes.factory import ShapeFactory
dataset = DataLoader.load_dataset('Python')
target_shape = ShapeFactory(dataset).generate_shape('heart')
morpher = DataMorpher(decimals=2, in_notebook=False)
_ = morpher.morph(dataset, target_shape)
from data_morph.data.loader import DataLoader
from data_morph.morpher import DataMorpher
from data_morph.shapes.factory import ShapeFactory
dataset = DataLoader.load_dataset('Python')
target_shape = ShapeFactory(dataset).generate_shape('heart')
morpher = DataMorpher(decimals=2, in_notebook=False)
_ = morpher.morph(dataset, target_shape)
Data Morph provides the Dataset
class that wraps the
data (stored as a pandas.DataFrame
) with information
about bounds for the data, the morphing process, and plotting. This
allows for the use of arbitrary datasets by providing a way to
calculate target shapes – no more hardcoded values.
To spark creativity, there are built-in datasets to inspire you:
from data_morph.data.loader import DataLoader
from data_morph.morpher import DataMorpher
from data_morph.shapes.factory import ShapeFactory
dataset = DataLoader.load_dataset('Python')
target_shape = ShapeFactory(dataset).generate_shape('heart')
morpher = DataMorpher(decimals=2, in_notebook=False)
_ = morpher.morph(dataset, target_shape)
Depending on the target shape, bounds and/or statistics from the dataset are used to generate a custom target shape for the dataset to morph into.
The following target shapes are currently available:
Shape
class hierarchy
In Data Morph, shapes are structured as a hierarchy of classes,
which must provide a distance()
method. This makes them
interchangeable in the morphing logic.
from data_morph.data.loader import DataLoader
from data_morph.morpher import DataMorpher
from data_morph.shapes.factory import ShapeFactory
dataset = DataLoader.load_dataset('Python')
target_shape = ShapeFactory(dataset).generate_shape('heart')
morpher = DataMorpher(decimals=2, in_notebook=False)
_ = morpher.morph(dataset, target_shape)
A point is selected at random (blue) and moved a small, random amount to a new location (red), preserving summary statistics. This part of the codebase comes from the Autodesk research and is mostly unchanged:
Sometimes, the algorithm will move a point away from the target shape, while still preserving summary statistics. This helps to avoid getting stuck:
The likelihood of doing this decreases over time and is governed by the temperature of the simulated annealing process:
The maximum amount that a point can move at a given iteration decreases over time for a better visual effect. This makes points move faster when the morphing starts and slow down as we approach the target shape:
Unlike temperature, we don't allow this value to fall to zero, since we don't want to halt movement:
How do we encourage points to fill out the target shape and not just clump together?
Currently, we can only morph from dataset to shape (and shape to dataset by playing the animation in reverse). I would like to support dataset to dataset and shape to shape morphing, but there are challenges to both:
Goal | Challenges |
---|---|
shape→shape | determining the initial sizing and possibly aligning scale across the shapes, and solving the bald spot problem |
dataset→dataset | defining a distance metric, determining scale and position of target, and solving the bald spot problem |
The algorithm from the original research is largely untouched and parts of it could potentially be vectorized to speed up the morphing process.
Smaller values (left subplot) morph in fewer iterations than larger values (right subplot) since we only move small amounts at a time:
My first step was to use the Autodesk researchers' code to recreate the conversion of the Datasaurus into a circle and figure out how the code worked.
Challenges at this stage:
TIME TAKEN: 4 hours
From there, I tried to get it to work with a panda-shaped dataset, reworked to have similar statistics to the Datasaurus.
Challenges at this stage:
TIME TAKEN: 1.75 days
Once I got the transformation working with the panda (my original goal), I realized this would be a helpful teaching tool and decided to make a package.
Challenges at this stage:
TIME TAKEN: 2 months (v0.1.0)
Here are some cases I bumped into while building Data Morph:
python -m pip install data-morph-ai
conda install -c conda-forge data-morph-ai
I hope you enjoyed the session. You can follow my work on the following platforms: