Check relationships between the target variable and numeric features. Via _ seaborn heatmap _ plot.

Graphics

Contingency tables
Scatter plot
Bar chart
Biplot
Box plot
Control chart
Correlogram
Fan chart
Forest plot
Histogram
Pie chart
Q–Q plot
Run chart
Scatter plot
Stem-and-leaf display
Radar chart

Non-hierarchical

Histogram
Pi
Stack
Chord
Force

Hierarchical

Tree
Cluster
Tree map
Partition
Pack

T-Distribution - Visualize high density

Check relationships between the target variable and numeric features. Via _ seaborn heatmap _ plot.

Basic Summary

Below is the code to plot the univariate distribution of the numerical columns which contains the histograms and the estimated PDF. We use displot of the seaborn library to plot this graph:

Visualizations

Graphical Integrity

https://www.amazon.com/Visual-Display-Quantitative-Information/dp/1930824130

there are six principles to ensure Graphical Integrity:

Make the representation of numbers proportional to quantities
Use clear, detailed, and thorough labeling
Show data variation, not design variation
Use standardized units, not nominal values
Depict ’n’ data dimensions with less than or equal to ’n’ variable dimensions
Quote data in full context

Misc Visualization Notes

https://realpython.com/python-data-visualization-bokeh

Histograms and Density Plots

Histograms work very well for display a single variable from one category (in this case the one category was all the flights). However, for displaying multiple categories, a histogram does not work well because the plots are obscured.

Solution 1: Side-by-Side Histograms

Solution 2: Stacked Histograms

Solution 3: Density Plots

Density with Rug Plot

Create a Pairplot Using a Scatter, Histogram, Density Plots

Default Pair Plot with All Data Let's use the entire dataset and sns.pairplot to create a simple, yet useful plot.

Group and Color by a Variable In order to better understand the data, we can color the pairplot using a categorical variable and the hue keyword. First, we will color the plots by the continent.

Customizing pairplot First, let's change the diagonal from a histogram to a kde which can better show the differences between continents. We can also adjust the alpha (intensity) of the scatter plots to better show all the data and change the size of the markers on the scatter plot. Finally, I increase the size of all the plots to better show the data.

sns.pairplot(df, hue = 'continent', diag_kind = 'kde', plot_kws = {'alpha': 0.6, 's': 80, 'edgecolor': 'k'}, size = 4);

The density plots on the diagonal are better for when we have data in multiple categories to make comparisons. We can color the plot by any variable we like. For example, here is a plot colored by a decade categorical variable we create from the year column.

df['decade'] = pd.cut(df['year'], bins = range(1950, 2010, 10))

sns.pairplot(df, hue = 'decade', diag_kind = 'kde', vars = ['life_exp', 'log_pop', 'log_gdp_per_cap'], plot_kws = {'alpha': 0.6, 's': 80, 'edgecolor': 'k'}, size = 4);

Below is the code to plot the univariate distribution of the numerical columns which contains the histograms and the estimated PDF. We use displot of the seaborn library to plot this graph:

BoxPlot

Below is the code to plot the box plot of all the column names mentioned in the list col_names. The box plot allows us to visually analyze the outliers in the dataset. The key terminology to note here are as follows:

The range of the data provides us with a measure of spread and is equal to a value between the smallest data point (min) and the largest one (Max)
The interquartile range (IQR), which is the range covered by the middle 50% of the data.
IQR = Q3 - Q1, the difference between the third and first quartiles. The first quartile (Q1) is the value such that one quarter (25%) of the data points fall below it, or the median of the bottom half of the data. The third quartile is the value such that three quarters (75%) of the data points fall below it, or the median of the top half of the data.
The IQR can be used to detect outliers using the 1.5(IQR) criteria. Outliers are observations that fall below Q1 - 1.5(IQR) or above Q3 + 1.5(IQR). Based on the above definition of how we identify outliers the black dots are outliers in the strength factor attribute and the red colored box is the IQR range.

Datascience

Charles Karpati | Data Visualization