Histograms, particularly when implemented in Python using Matplotlib, serve as robust tools for dissecting dataset distributions. Our exploration begins with a comprehensive definition, illustrating statistical insights through blocks that represent frequency. This article delves into the significance of histograms, addressing when to deploy them and their real-world applications. Understanding their role in extracting business insights underscores the importance of data visualization in data science projects, where tools like Matplotlib facilitate convenient and effective histogram plotting.
This article was published as a part of the Data Science Blogathon.
Before using any technique, we first need to analyze when we can use that (in this case, histograms) to get the best out of it. Hence, in this section, we will discuss some key aspects where histograms can be extremely useful.
import numpy as np import matplotlib.pyplot as plt import seaborn as sb import pandas as pd plt.rcParams[“figure.figsize”] = (10,6)
Let’s discuss in a nutshell the use of each library that we have imported:
Let’s load some 1D data and get some insight into it!
To keep this section easy to understand, the data generated is something I’ve thrown together. It’s not just an analytic function from Scipy – that would make it too easy – but I’ve ensured it’s not pathological.
Let’s start with the imports to make sure we have everything right at the beginning. If this errors, pip install whichever dependency you don’t have. If you have issues (especially on windows machines with NumPy), try using Conda install. For example:
import numpy as np
# Uncomment the next line if you need to install NumPy using Conda.
# !conda install numpy
# Now, let's load the data from two files, "example_1.txt" and "example_2.txt".
d1 = np.loadtxt("example_1.txt")
d2 = np.loadtxt("example_2.txt")
# Print the shapes of the loaded data arrays.
print("Output:", d1.shape, d2.shape)
Output:(500,) (500,)
Inference: As mentioned above, we have loaded two datasets that I have created for demonstration purposes, and when we had a look at the shape of both of the datasets then, we saw that both of them have 500 rows.
Now it’s time to get our hands dirty and see different aspects of histograms. Here we will discuss 3 plots that can help get the nominal data distribution simultaneously for both datasets.
plt.hist(d1, label="D1") plt.hist(d2, label="D2") plt.legend() plt.ylabel("Counts");
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
import pandas as pd
plt.rcParams["figure.figsize"] = (10,6)
d1 = np.loadtxt("example_1.txt")
d2 = np.loadtxt("example_2.txt")
print(d1.shape, d2.shape)
plt.hist(d1, label="D1")
plt.hist(d2, label="D2")
plt.legend()
plt.ylabel("Counts")
plt.show()
bins = np.linspace(min(d1.min(), d2.min()), max(d1.max(), d2.max()), 50)
counts1, _, _ = plt.hist(d1, bins=bins, label="D1")
plt.hist(d2, bins=bins, label="D2")
plt.legend()
plt.ylabel("Counts");
bins = np.linspace(min(d1.min(), d2.min()), max(d1.max(), d2.max()), 50)
counts1, _, _ = plt.hist(d1, bins=bins, label="D1", density=True)
plt.hist(d2, bins=bins, label="D2", density=True)
plt.legend()
plt.ylabel("Probability");
bins = np.linspace(min(d1.min(), d2.min()), max(d1.max(), d2.max()), 50)
plt.hist([d1, d2], bins=bins, label="Stacked", density=True, alpha=0.5)
plt.hist(d1, bins=bins, label="D1", density=True, histtype="step", lw=1)
plt.hist(d2, bins=bins, label="D2", density=True, histtype="step", ls=":")
plt.legend()
plt.ylabel("Probability");
Inference: Now comes the styling part where we will improve the look and feel of Matplotlib’s histogram plot by changing and adding a few parameters from the hist() function.
We now reach the final part of the article, where we will briefly review everything discussed so far about histograms—covering their definition, purpose, applications, and practical implementation. Rather than providing a paragraph explanation, let’s opt for a point-to-point briefing.
A. Histograms visually represent data distribution, offering insights into patterns, central tendencies, and variations. Understanding when to use them enhances data analysis and interpretation.
A. A histogram comprises four main elements: bins (intervals representing data ranges), frequencies (counts of data points in each bin), axes (x and y), and bars (rectangular blocks indicating data frequency).
A. Two crucial characteristics of a histogram are shape (indicating data distribution) and central tendency (highlighted by the location of the central peak or mean).
A. Histograms reveal the process’s central tendency, variability, shape of the distribution, and potential outliers or anomalies, providing a comprehensive overview of the underlying data patterns.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.