Home

Datascience

Clustering Associati... Distributions And Re... Geospatial Timeserie... Linear Regressions Ml Process Python Eda Spatial Stationarity Viz Notes Examples

Don't Look! I'm changing!

URL Copied

Non-HierarchicalBasic SummaryGraphical IntegrityMisc Visualization NotesCreate A Pairplot Using A Scatter, Histogram, Density PlotsBoxplotWord Cloud

Graphics

Non-hierarchical

Hierarchical

T-Distribution - Visualize high density

Check relationships between the target variable and numeric features. Via _ seaborn heatmap _ plot.

Basic Summary

Below is the code to plot the univariate distribution of the numerical columns which contains the histograms and the estimated PDF. We use displot of the seaborn library to plot this graph:

Visualizations

Graphical Integrity

https://www.amazon.com/Visual-Display-Quantitative-Information/dp/1930824130

there are six principles to ensure Graphical Integrity:

Misc Visualization Notes

https://realpython.com/python-data-visualization-bokeh

Histograms and Density Plots

Histograms work very well for display a single variable from one category (in this case the one category was all the flights). However, for displaying multiple categories, a histogram does not work well because the plots are obscured.

Solution 1: Side-by-Side Histograms

Solution 2: Stacked Histograms

Solution 3: Density Plots

Density with Rug Plot

Create a Pairplot Using a Scatter, Histogram, Density Plots

Default Pair Plot with All Data Let's use the entire dataset and sns.pairplot to create a simple, yet useful plot.

Group and Color by a Variable In order to better understand the data, we can color the pairplot using a categorical variable and the hue keyword. First, we will color the plots by the continent.

Customizing pairplot First, let's change the diagonal from a histogram to a kde which can better show the differences between continents. We can also adjust the alpha (intensity) of the scatter plots to better show all the data and change the size of the markers on the scatter plot. Finally, I increase the size of all the plots to better show the data.

sns.pairplot(df, hue = 'continent', diag_kind = 'kde', plot_kws = {'alpha': 0.6, 's': 80, 'edgecolor': 'k'}, size = 4);

The density plots on the diagonal are better for when we have data in multiple categories to make comparisons. We can color the plot by any variable we like. For example, here is a plot colored by a decade categorical variable we create from the year column.

df['decade'] = pd.cut(df['year'], bins = range(1950, 2010, 10))

sns.pairplot(df, hue = 'decade', diag_kind = 'kde', vars = ['life_exp', 'log_pop', 'log_gdp_per_cap'], plot_kws = {'alpha': 0.6, 's': 80, 'edgecolor': 'k'}, size = 4);

Below is the code to plot the univariate distribution of the numerical columns which contains the histograms and the estimated PDF. We use displot of the seaborn library to plot this graph:

BoxPlot

Below is the code to plot the box plot of all the column names mentioned in the list col_names. The box plot allows us to visually analyze the outliers in the dataset. ​ The key terminology to note here are as follows: ​

Word Cloud