Introduction to EDA: Scatter Plots

Surbhi Srivastava
6 min readNov 13, 2021

--

In this article we are going to learn about scatter plots, useful for exploratory data analysis. There is no denying the fact that EDA is the most crucial part of any data science project. It is always a wise step to get to know the data beforehand that is given to us and pull out the information from that.

So there are many tools/techniques at our disposal that help in EDA. Here we are going to discuss the basic yet very useful technique to explore the data and that is Scatter Plots.

We are going to learn a few plotting techniques using the below dataset.

So, let’s talk about the data first. The dataset is in tabular format that contains information about three kinds of flowers from the IRIS family namely Iris Setosa, Iris Versicolor, and Iris Virginica.

Iris dataset top 5 view

Talking about the objective given to us with this dataset is to classify a new flower as belonging to one of the three classes given the four features. Understanding the objective is the first step of solving any problem statement and this will also help us to determine the type of analysis we need for our dataset. Here the ‘species’ column is referred to as Classes/Labels and all the other columns are Features.

So that is the overview of our dataset. If you want to explore more about the dataset, refer below site.

First of all I am importing all the libraries, then loading the dataset using Pandas and printing the shape of the dataset. It has 150 rows and 5 columns. We can also extract the column name using the below command.

To extract number of data points per class/species , below command is useful,

We can see we have a balanced dataset that means every class/species contains an equal number of data points but in real life scenarios we encounter more imbalanced dataset and the degree of imbalance varies also, some are severely imbalanced others are slightly imbalanced. Analysis of an imbalanced dataset varies slightly from that of a balanced dataset. So it is important to understand if the given dataset is imbalanced or not.

Let’s understand the interesting and simple plotting tools.

  1. 2-D Scatter plot:- Scatter plots are used to observe relationships between variables. Scatter plots also report the patterns of the whole data. We can also identify the correlation between features. So there are several ways of plotting a scatter plot.

Using matplotlib:

To know the complete skeleton of a particular function it is always best to check the official documentation which you can access below for the above function.

Using pandas:

The ‘plot’ function used for pandas dataframe or series is a wrapper around ‘plt.plot’.

After plotting the graph observe both the axes and the scale which will help us to derive the insights. Here you can see species/class/labels are not distinguishable. So we are going to put a name for the points with the help of the Seaborn library.

We can see that blue points can be easily separated from red and green points by drawing a line. On the other hand red and green data points are inseparable. These are the observations regarding a set of features. We can say that Iris-setosa can be distinguished from Iris-versicolor and Iris-virginica using sepal_length and sepal_width. In this way we can try out other features in the axes and plot different graphs and draw important observations.

2. 3-D Scatter Plot:- We can also plot 3-D scatter plots. Please refer below for visualization.

As you can notice 3-D plots need a lot of mouse interaction to interpret the data. In real life scenarios we usually get more features and to plot such high dimensional graphs is not fruitful as humans are not equipped to visualize high dimensional space. To overcome this issue with high dimensional data we have a hack around it which will help us to visualize data in a go. Also in our Iris dataset we have four features and as we can not visualize a four dimensional graph, following is the smart way to do it.

3. Pair Plots:- As the name suggests we actually do pairs of features in scatter plots like we did with sepal_width and sepal_length a while ago. We have 4 features and we can create upto 4c2 i.e 6 unique combinations of such pairs and consecutively plot them. Once we get 6 unique 2 dimensional plots we can get the sense of what the data is in 4 dimensions. Let’s check out the code:

Beautiful isn’t it….

This is the matrix of the plots. Ignore the diagonal element for now we will cover them in a separate topic of histograms and PDFs/CDFs. In the matrix of plots all the column’s X-axes are given in the bottom and all the row’s Y-axes are given in the left hand side. The plots above and below the diagonal elements are the mirror images of each other hence we have 12 plots but 6 unique ones. So we can focus on the top 6 diagrams as the below 6 are the mirror images but the insights are going to be the same.

Notice these graphs here the Y-axes are the same but the X-axes are different, so the major takeaway from this is Petal_length(PL) and Petal_width(PW) acts better in order to distinguish the setosa flower from other two species. Also there is less overlap in red and green points. And like that we got our important feature to consider while model building.

In this way we got the gist of how EDA helps in drawing the insights from a dataset.

Take a dataset and start experimenting with such cool techniques.

Thank You

--

--

Surbhi Srivastava
Surbhi Srivastava

Written by Surbhi Srivastava

A software engineer, learner, explorer, and all other things that I wanna be.

No responses yet