Graphics
- Contingency tables
- Scatter plot
- Bar chart
- Biplot
- Box plot
- Control chart
- Correlogram
- Fan chart
- Forest plot
- Histogram
- Pie chart
- QāQ plot
- Run chart
- Scatter plot
- Stem-and-leaf display
- Radar chart
Non-hierarchical
- Histogram
- Pi
- Stack
- Chord
- Force
Hierarchical
- Tree
- Cluster
- Tree map
- Partition
- Pack
T-Distribution - Visualize high density
Check relationships between the target variable and numeric features. Via _ seaborn heatmap _ plot.
Basic Summary
Below is the code to plot the univariate distribution of the numerical columns which contains the histograms and the estimated PDF. We use displot of the seaborn library to plot this graph:
Visualizations
Graphical Integrity
https://www.amazon.com/Visual-Display-Quantitative-Information/dp/1930824130
there are six principles to ensure Graphical Integrity:
- Make the representation of numbers proportional to quantities
- Use clear, detailed, and thorough labeling
- Show data variation, not design variation
- Use standardized units, not nominal values
- Depict ānā data dimensions with less than or equal to ānā variable dimensions
- Quote data in full context
Misc Visualization Notes
https://realpython.com/python-data-visualization-bokeh
Histograms and Density Plots
Histograms work very well for display a single variable from one category (in this case the one category was all the flights). However, for displaying multiple categories, a histogram does not work well because the plots are obscured.
Solution 1: Side-by-Side Histograms
Solution 2: Stacked Histograms
Solution 3: Density Plots
Density with Rug Plot
Create a Pairplot Using a Scatter, Histogram, Density Plots
Default Pair Plot with All Data Let's use the entire dataset and sns.pairplot to create a simple, yet useful plot.
Group and Color by a Variable In order to better understand the data, we can color the pairplot using a categorical variable and the hue keyword. First, we will color the plots by the continent.
Customizing pairplot First, let's change the diagonal from a histogram to a kde which can better show the differences between continents. We can also adjust the alpha (intensity) of the scatter plots to better show all the data and change the size of the markers on the scatter plot. Finally, I increase the size of all the plots to better show the data.
sns.pairplot(df, hue = 'continent', diag_kind = 'kde', plot_kws = {'alpha': 0.6, 's': 80, 'edgecolor': 'k'}, size = 4);
The density plots on the diagonal are better for when we have data in multiple categories to make comparisons. We can color the plot by any variable we like. For example, here is a plot colored by a decade categorical variable we create from the year column.
df['decade'] = pd.cut(df['year'], bins = range(1950, 2010, 10))
sns.pairplot(df, hue = 'decade', diag_kind = 'kde', vars = ['life_exp', 'log_pop', 'log_gdp_per_cap'], plot_kws = {'alpha': 0.6, 's': 80, 'edgecolor': 'k'}, size = 4);
Below is the code to plot the univariate distribution of the numerical columns which contains the histograms and the estimated PDF. We use displot of the seaborn library to plot this graph:
BoxPlot
Below is the code to plot the box plot of all the column names mentioned in the list col_names. The box plot allows us to visually analyze the outliers in the dataset. ā The key terminology to note here are as follows: ā
- The range of the data provides us with a measure of spread and is equal to a value between the smallest data point (min) and the largest one (Max) ā
- The interquartile range (IQR), which is the range covered by the middle 50% of the data. ā
- IQR = Q3 - Q1, the difference between the third and first quartiles. The first quartile (Q1) is the value such that one quarter (25%) of the data points fall below it, or the median of the bottom half of the data. The third quartile is the value such that three quarters (75%) of the data points fall below it, or the median of the top half of the data. ā
- The IQR can be used to detect outliers using the 1.5(IQR) criteria. Outliers are observations that fall below Q1 - 1.5(IQR) or above Q3 + 1.5(IQR). ā ā Based on the above definition of how we identify outliers the black dots are outliers in the strength factor attribute and the red colored box is the IQR range.