Visualising Complex Data: Advanced Histogram and Box Plot Techniques with Matplotlib

0
45

Introduction

Data visualisation is a crucial aspect of data analysis every data professional seeks to learn. This is especially true of professionals working in large organisations where they need to interact with several stake holders, all of whom might not be as tech-savvy as themselves; for instance, business strategists and business developers. A Data Science Course in Chennai and such cities that covers data visualisation attracts professional data analysts on a large-scale because expertise in this discipline allows  them to present complex datasets and communicate insights effectively. Histograms and box plots are two of the most commonly used visualisation tools for understanding the distribution and variability of data. In this article, we will explore advanced techniques for creating and customising histograms and box plots using Matplotlib, Python’s go-to library for data visualisation.

Why Use Histograms and Box Plots?

Two terms that you will most frequently encounter in a Data Science Course that covers data visualisation techniques are  Histograms and Box Plots. Here is a brief description of these terms.

Histograms: They provide a visual representation of the distribution of a dataset by dividing the data into bins and counting the number of observations within each bin. This helps in understanding the frequency distribution, skewness, and the presence of outliers.

Box Plots: Box plots, also known as box-and-whisker plots, summarise the distribution of a dataset by displaying the median, quartiles, and potential outliers. They are particularly useful for comparing distributions between different groups or datasets.

Setting Up Matplotlib

Before diving into advanced techniques, ensure you have Matplotlib installed:

pip install matplotlib

Then, import the necessary libraries:

import matplotlib.pyplot as plt

import numpy as np

import pandas as pd

Advanced Histogram Techniques

Here are the steps involved in advanced histogram techniques as will be covered in a standard Data Science Course.

1. Creating Overlaid Histograms

Overlaid histograms are useful when you want to compare the distribution of multiple datasets on the same plot.

# Sample data

data1 = np.random.normal(0, 1, 1000)

data2 = np.random.normal(1, 1.5, 1000)

# Plotting overlaid histograms

plt.hist(data1, bins=30, alpha=0.5, label=’Dataset 1′)

plt.hist(data2, bins=30, alpha=0.5, label=’Dataset 2′)

plt.legend(loc=’upper right’)

plt.title(‘Overlaid Histograms’)

plt.xlabel(‘Value’)

plt.ylabel(‘Frequency’)

plt.show()

In this example, alpha=0.5 controls the transparency, making it easier to compare the two distributions.

2. Creating a Density Plot with a Histogram

A density plot superimposed on a histogram provides a smoother representation of the distribution, making patterns easier to identify.

import seaborn as sns

# Sample data

data = np.random.normal(0, 1, 1000)

# Plotting histogram with density plot

sns.histplot(data, kde=True, bins=30)

plt.title(‘Histogram with Density Plot’)

plt.xlabel(‘Value’)

plt.ylabel(‘Frequency’)

plt.show()

Using Seaborn’s histplot, you can quickly add a kernel density estimate (KDE) to your histogram.

3. Customising Histogram Bins

Customising the bin size and edges can reveal finer details in the data distribution.

# Sample data

data = np.random.normal(0, 1, 1000)

# Custom bins

bins = np.linspace(-4, 4, 20)

# Plotting histogram with custom bins

plt.hist(data, bins=bins, edgecolor=’black’)

plt.title(‘Histogram with Custom Bins’)

plt.xlabel(‘Value’)

plt.ylabel(‘Frequency’)

plt.show()

Here, we specify custom bin edges using np.linspace, allowing for more control over the histogram’s appearance.

Advanced Box Plot Techniques

Here are the steps involved in advanced box plot techniques as will be covered in a standard Data Science Course.

1. Creating Grouped Box Plots

Grouped box plots are effective for comparing the distribution of different groups side by side.

# Sample data

data = pd.DataFrame({

    ‘Group’: np.repeat([‘A’, ‘B’, ‘C’], 100),

    ‘Value’: np.concatenate([np.random.normal(0, 1, 100),

                             np.random.normal(1, 1.5, 100),

                             np.random.normal(2, 0.5, 100)])

})

# Plotting grouped box plots

plt.figure(figsize=(8, 6))

sns.boxplot(x=’Group’, y=’Value’, data=data)

plt.title(‘Grouped Box Plots’)

plt.xlabel(‘Group’)

plt.ylabel(‘Value’)

plt.show()

This example uses Seaborn’s boxplot to create grouped box plots, which allow for easy comparison between different groups.

2. Adding Notches to Box Plots

Notched box plots provide a visual indication of the confidence interval around the median, useful for comparing medians between groups.

# Sample data

data = pd.DataFrame({

    ‘Group’: np.repeat([‘A’, ‘B’, ‘C’], 100),

    ‘Value’: np.concatenate([np.random.normal(0, 1, 100),

                             np.random.normal(1, 1.5, 100),

                             np.random.normal(2, 0.5, 100)])

})

# Plotting notched box plots

plt.figure(figsize=(8, 6))

sns.boxplot(x=’Group’, y=’Value’, data=data, notch=True)

plt.title(‘Notched Box Plots’)

plt.xlabel(‘Group’)

plt.ylabel(‘Value’)

plt.show()

Adding the notch=True argument introduces notches in the box plot, making it easier to assess whether the medians of different groups are significantly different.

3. Displaying Outliers with Box Plots

Box plots automatically show outliers, but you can customise how they are displayed to make them stand out more.

# Sample data

data = pd.DataFrame({

    ‘Group’: np.repeat([‘A’, ‘B’, ‘C’], 100),

    ‘Value’: np.concatenate([np.random.normal(0, 1, 100),

                             np.random.normal(1, 1.5, 100),

                             np.random.normal(2, 0.5, 100)])

})

# Plotting box plots with customized outliers

plt.figure(figsize=(8, 6))

sns.boxplot(x=’Group’, y=’Value’, data=data, flierprops={‘marker’: ‘o’, ‘color’: ‘red’, ‘alpha’: 0.5})

plt.title(‘Box Plots with Customized Outliers’)

plt.xlabel(‘Group’)

plt.ylabel(‘Value’)

plt.show()

Here, the flierprops parameter customises the appearance of outliers, using red circles (marker=’o’) to make them more noticeable.

Combining Histograms and Box Plots

In some cases, you may want to use both histograms and box plots together to provide a more comprehensive view of your data distribution.

# Sample data

data = np.random.normal(0, 1, 1000)

# Creating a figure with subplots

fig, axs = plt.subplots(2, 1, figsize=(8, 10))

# Histogram

axs[0].hist(data, bins=30, edgecolor=’black’)

axs[0].set_title(‘Histogram’)

# Box Plot

axs[1].boxplot(data, vert=False)

axs[1].set_title(‘Box Plot’)

plt.show()

This example creates a figure with two subplots: one for the histogram and one for the box plot, allowing you to analyse the data distribution from multiple perspectives.

Conclusion

Advanced histogram and box plot techniques in Matplotlib offer a powerful toolkit for visualising complex data. Whether you are comparing distributions, identifying outliers, or customising plots for clearer communication, these techniques can help you gain deeper insights into your data. By acquainting yourself with these tools by enrolling in a Data Science Course in Chennai and such cities that offer lessons in advanced visualisation techniques,  you will be better equipped to create effective and informative visualisations that resonate with your audience.

BUSINESS DETAILS:

NAME: ExcelR- Data Science, Data Analyst, Business Analyst Course Training Chennai

ADDRESS: 857, Poonamallee High Rd, Kilpauk, Chennai, Tamil Nadu 600010

Phone: 8591364838

Email- enquiry@excelr.com

WORKING HOURS: MON-SAT [10AM-7PM]