## Practical Data Science: Visualization and Data Exploration

- 1 year ago
- Gritinai

**Basics of visualization**

**Two types of visualization:**

**Data exploration visualization:** figuring out what is true

**Data presentation visualization: **convincing other people it is true

We will mostly be focused on the first

“Data exploration” is much broader than just visualization

**Importance of visualization**

Before you run any analysis, build any machine learning system, etc, always

visualize your data

If you can’t identify a trend or make a prediction for your dataset, it’s unlikely that an automated algorithm will.

This is especially important to keep in mind as you hear stories of

“superhuman” performance of AI methods (it is possible, but takes a long time, and is not the norm)

**Visualization vs. statistics**

Visualization almost always presents a more informative (though less quantitative) view of your data than statistics (the noun, not the field). Despite the temptation to summarize data using “simple” statistics, unless you have a good sense of the full distribution of your data (best achieved via visualization), these statistics can be misleading.

[Source: https://twitter.com/JustinMatejka/status/770682771656368128 Credit: @JustinMatejka, @albertocairo]

This is a mathematical property: 𝑛 data points and 𝑚 equations to satisfy, with 𝑛 >𝑚

**Data types**

Although becoming an expert in visualization techniques is well beyond the scope of this one set of notes on the topic, there are some basic rules of thumb about data, and what types of visualizations are appropriate for these different types of data, that can help avoid some of the more “obviously incorrect” errors you may make when generating visualizations for data exploration purposes. To get at this point, we’re going to review that four “basic types” of data typically presented in Statistics courses.

**Nominal:** categorical data, no ordering

Example — Pet: {dog, cat, rabbit, …}

Operations: =, ≠

**Ordinal: **categorical data, with ordering

Example — Rating: {1,2,3,4,5}

Operations: =, ≠, ≥, ≤, >, <

**Interval: **numerical data, zero doesn’t mean zero “quantity”

Example — Temperature Fahrenheit

Operations: =, ≠, ≥, ≤, >, <, +, −

**Ratio: **numerical data, zero has meaning related to zero “quantity”

Example — Temperature Kelvin

Operations: =, ≠, ≥, ≤, >, <, +, −,÷

At a course level, the first two data types can simply be viewed as “categorial” data (data taking on discrete values), whereas the later are “numerical” (taking real-valued numbers), though with the caveat that some level of discretization is acceptable even in numeric data, as long as the notion of differences are properly preserved. Indeed, most of the later discussion on visualization will fall exactly along the categorial/real differentiation,

**Matplotlib**

We will use the matplotlib library, which integrates well with the Jupyter notebook.

To import Matplotlib plotting into the notebook, the common module you’ll need is the matplotlib.pyplot module, which is common enough that we’ll just import it as plt

import matplotlib.pyplot **as** plt

import numpy **as** np

To display plots in the notebook you’ll want to use one of the following two magic commands, either

**%**matplotlib inline

which will generate static plots inline in the notebook, or

**%**matplotlib notebook

**Visualization Types**

Most discussion of visualization types emphasizes what elements the chart is trying to convey

Instead, we are going to focus on the type and dimensionality of the underlying data

**Visualization types (not an exhaustive list):**

1D: bar chart, pie chart, histogram

2D: scatter plot, line plot, box and whisker plot, heatmap

3D+: scatter matrix, bubble chart

**1D DATA**

One dimensional can either be categorical (nominal or ordinal) or numerical (interval or ratio).

**Bar charts — categorical data**

The following code generates some (fake) one dimensional data and then plots it with a bar chart

import collections

data **=** np.random.permutation(np.array(["dog"]*****10 **+** ["cat"]*****7 **+** ["rabbit"]*****3))

counts **=** collections.Counter(data)

plt.bar(range(len(counts)), list(counts.values()), tick_label**=**list(counts.keys()))

*# DON'T DO THIS*

plt.bar(range(len(counts)), counts.values(), tick_label**=**list(counts.keys()))

plt.plot(range(len(counts)), counts.values(), 'ko-')*# DON'T DO THIS EITHER*

data **=** {"strongly disagree": 5,

"slightly disagree": 3,

"neutral": 8,

"slightly agree": 12,

"strongly agree": 9}

plt.bar(range(len(data)), data.values())

plt.xticks(range(len(data)), data.keys(), rotation**=**45, ha**=**"right")

plt.plot(range(len(data)), data.values(), 'ko-');

**Pie charts — just say no**

Pie charts can also be used to plot 1D categorical data. Pie charts offer remarkably little information, especially if you have more than 2 (or at most 3) different categories. Should you decide to go that route, here is the code that does it

plt.pie(data.values(), labels**=**data.keys(), autopct**=**'%1.1f%%')

plt.axis('equal');

**Histograms — numerical data**

Histograms show frequency counts as well. Histograms are the workhorse of exploratory data analysis.

np.random.seed(0)

data **=** np.concatenate([30 **+** 4*****np.random.randn(10000),

18 **+** 2*****np.random.randn(7000),

12 **+** 3*****np.random.randn(3000)])

plt.hist(data, bins**=**50);

The entire range of the data is divided into bins number of equal-sized bins, spanning the entire range of the data.

It also is not incorrect to create a line plot showing the overall shape of the distribution.

y,x,_ **=** plt.hist(data, bins**=**50);

plt.plot((x[1:]**+**x[:**-**1])**/**2,y,'k-')

**2D Data**

Two dimensional data could either be numerical, categorical, or of mixed types.

**Scatter plots — numeric x numeric**

If both dimensions of the data are numeric, the most natural first type of plot to consider is the scatter plot.

x **=** np.random.randn(1000)

y **=** 0.4*****x******2 **+** x **+** 0.7*****np.random.randn(1000)

plt.scatter(x,y,s**=**10)

In this case of excess data, we can also create a *2D histogram* of the data (which bins the data along both dimensions), and indicate the “height” of each block via a color map.

plt.hist2d(x,y,bins**=**100);

plt.colorbar();

plt.set_cmap('hot')

**Choose Colormaps Carefully**

Several factors to consider. For example:

• Accessibility

• Printing in grayscale

• Unintentional boundaries

• Intentional boundaries

• Color semantics

**Accessibility**

Image: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0199239

**Printing in grayscale**

Image: https://jakevdp.github.io/blog/2014/10/16/how-bad-is-your-colormap/

**Unintentional boundaries**

Image: https://eagereyes.org/basics/rainbow-color-map

**Intentional boundaries**

Image: https://weather.com/maps/currentusweather

**Color semantics**

Image: https://covid.cdc.gov/covid-data-tracker/#county-view

**Line plots — numeric x numeric (sequential)**

The following examples illustrates how to use the plt.plot for a simple line plot.

x **=** np.linspace(0,10,1000)

y **=** np.cumsum(0.01*****np.random.randn(1000))

plt.plot(x,y)

**Box and whiskers and violin plots — categorical x numeric**

Let’s consider a simple example, where here (just with fictitious data), we’re plotting pet type versus weight.

data**=** {"dog": 30 **+** 4*****np.random.randn(1000),

"cat": 16 **+** 10*****np.random.rand(700),

"rabbit": 12 **+** 3*****np.random.randn(300)}

plt.scatter(np.concatenate([i*****np.ones(len(x)) **for** i,x **in** enumerate(data.values())]),

np.concatenate(list(data.values())))

plt.xticks(range(len(data)), data.keys());

Very little can be determined by looking at just this plot, as there is not enough information in the dense line of points to really understand the distribution of the numeric variable for each point.

**Box and whiskers**

It plots the median of the data (as the line in the middle of the box), the 25th and 75th percentiles of the data (as the bottom and top of the box), the “whiskers” are set by a number of different possible conventions (by default Matplotlib uses 1.5 times the interquartile range, the distance between the 25th and 75th percentile), and any points outside this range (“outliers”) plotted individually.

plt.boxplot(data.values())

plt.xticks(range(1,len(data)**+**1), data.keys());

The box and whisker statistics don’t fully capture the distribution of the data.

**Violin plot**

It creates mini-histograms (symmetrized, largely for aesthetic purposes) in the vertical direction for each category. The advantage of these plots is that they carry a great deal of information about the actual distributions over each categorical variable, so are typically going to give more information especially when there is sufficient data to build this histogram.

plt.violinplot(data.values())

plt.xticks(range(1,len(data)**+**1), data.keys());

**Heat map and bubble plots — categorical x categorical**

Considering a fictitious data set of pet-type vs. house type:

types **=** np.array([('dog', 'house'), ('dog', 'appt'),

('cat', 'house'), ('cat', 'appt'),

('rabbit', 'house'), ('rabbit', 'appt')])

data **=** types[np.random.choice(range(6), 2000, p**=**[0.4, 0.1, 0.12, 0.18, 0.05, 0.15]),:]label_x, x **=** np.unique(data[:,0], return_inverse**=**True)

label_y, y **=** np.unique(data[:,1], return_inverse**=**True)

M, xt, yt, _ **=** plt.hist2d(x,y, bins**=**(len(label_x), len(label_y)))

plt.xticks((xt[:**-**1]**+**xt[1:])**/**2, label_x)

plt.yticks((yt[:**-**1]**+**yt[1:])**/**2, label_y)

plt.colorbar()

the range of colors is admittedly not very informative in some settings, and so a scatter plot with sizes associated with each data type may be more appropriate (this is also called a bubble plot). This can be easily constructed from the results of our previous calls.

xy, cnts **=** np.unique((x,y), axis**=**1, return_counts**=**True)

plt.scatter(xy[0], xy[1], s**=**cnts*****5)

plt.xticks(range(len(label_x)), label_x)

plt.yticks(range(len(label_y)), label_y)

They can be quick and easy visualizations of the data.

**3D+ data**

Going beyond two dimensions, effective visualization becomes much more difficult.

Much like the pie charts, avoid 3D scatter plots whenever possible. The reason for this is that they don’t work well as 2D charts: out of necessity we loose information about the third dimensions, because we are only looking at a single projection of the data onto the 2D screen. For examples, consider the following data:

x **=** np.random.randn(1000)

y **=** 0.4*****x******2 **+** x **+** 0.7*****np.random.randn(1000)

z **=** 0.5 **+** 0.2*****(y**-**1)******2 **+** 0.1*****np.random.randn(1000)from mpl_toolkits.mplot3d import Axes3D

fig **=** plt.figure()

ax **=** fig.add_subplot(111, projection**=**'3d')

ax.scatter(x,y,z)

It is virtually impossible to understand the data. The *one* exception to the rule against 3d scatterplots is explicitly if you want to *interact* with a 3d plot: by rotating the plot you can form a reasonable internal model of what the data looks like. This can be accomplished with the previously mentioned %matplotlib notebook call.

**Scatter matrices**

This plot shows all *pairwise* visualizations across all dimensions of the data set.

import pandas **as** pd

df **=** pd.DataFrame([x,y,z]).transpose()

pd.plotting.scatter_matrix(df, hist_kwds**=**{'bins':50});

Do *not* try to use these for data presentation. It will take a great deal of time staring at your problem before you really understand the nature of the data as presented in the scatter matrix, and its sole use is in trying to see patterns when you are willing to invest substantial cognitive load.

**Bubble plots**

plt.scatter(x,y,s**=**z*****20)

**Colored scatter plots**

One setting where using color to denote a third dimension *does* work well is when that third dimension is a categorical variable.

np.random.seed(0)

xy1 **=** np.random.randn(1000,2) **@** np.random.randn(2,2) **+** np.random.randn(2)

xy2 **=** np.random.randn(1000,2) **@** np.random.randn(2,2) **+** np.random.randn(2)

plt.scatter(xy1[:,0], xy1[:,1])

plt.scatter(xy2[:,0], xy2[:,1])

Please like and leave a comment, let us know how to improve.

Thank you!!