Machine learning seeks to create models and algorithms that can predict future outcomes based on historical data and improve their accuracy over time. The accuracy of a model is determined by the choice of target and predictor variables.

Exploratory data analysis (EDA) is a set of tools and procedures that aid in the identification of patterns, the analysis of data, finding correlations between variables, and the selection of the most effective ones. EDA thus plays a crucial role in the selection of variables for building ML models.

Jump to

Exploratory Data Analysis

Exploratory data analysis or EDA is simply knowing the data you’re dealing with, understanding its significance, nature of occurrence, data type, and distribution, and discovering relationships between the target and response variables. Statistical analysis, data visualization, and data transformation are all part of EDA.

EDA is a prerequisite step of almost every ML project as it helps to:

Improve model accuracy.
Choose feasible response variables.
Select target or predictor variables.
Identify the skew distribution of variables.
Identify patterns and outliers in data that may produce false outputs, leading to overfitting of models.
Remove inconsistencies from the data.

Let’s have a quick look at some data analysis techniques commonly used in Machine Learning processes:

EDA Techniques for ML Processes

1. Summary statistics:

To get started with building models, one must first understand the data. Summary statistics are typically the initial stage in any machine learning process, providing a concise overview of variable distribution. This aids in better preprocessing and handling of the dataset.

In Pandas, the .describe() method is used to obtain statistical parameters like total count, measures of central tendency (mean, mode, median), variance, standard deviation, measures of variability (range, interquartile range), and so on.

2. Feature Selection

Feature selection is a process in machine learning in which we actively choose the most significant features of a dataset as input variables that can contribute to a model’s accuracy.

This process involves trial and error, with various combinations of factors being examined to determine the performance. The better our input variables are, the better the learning algorithm will perform.

3. Data Analysis

Depending on the variables in our dataset, the data can be analyzed in a variety of ways. There are three types of data analysis:

Univariate analysis:

Analysis of a single variable to determine its structure and distribution. This includes determining the central tendency values, min and max values, and range. Histograms and KDE plots are the most commonly used distribution charts.

Bivariate analysis:

Analysis of two variables to understand their relationship. It helps in assessing the dependency of one variable on another. The most prominent plots include scatterplots, heat maps, box plots, and regression graphs. Bivariate analysis is important in machine learning during the feature selection phase, where the predictor variable is determined.

Multivariate analysis:

Multivariate analysis involves finding the correlation between more than two variables or attributes. The most widely used plots include pair plots, multiple regression plots, multivariate scatter plots, heatmaps, and grouped boxplots.

4. Data Transformation

Data transformation entails converting variables into a more comprehensible structure to remove any inconsistencies in data. In machine learning, the transformation process ensures the algorithm compares the predictor and response variables on the same scale. This is how we prepare the data for model building after the feature engineering step.

Some common methods of data transformation include:

Standardization:

At times, the dataset may be on multiple scales, resulting in a biased and skewed ML model. Standardization is the process of centering data around its mean thereby scaling it to fit within the standard normal distribution.
In standard normal distribution, a variable will have a mean of 0 and a standard deviation of 1.

For instance, a dataset has datetime values in different formats, such as MM/DD/YYYY, DD-MM-YYYY, or YYYY-MM-DD. In such a case, standardization can be incorporated to ensure consistency in the dataset.

Standardization can be implemented using sklearn library as:

Role of Exploratory Data Analysis in Machine Learning 1

Normalization:

Data normalization, also known as min-max scaling, ensures that data is scaled to a specific range, which is often between 0 and 1. This step increases the reliability of supervised machine learning algorithms such as KNN (K-Nearest Neighbours) or K-means clustering, which classify data based on distance metrics.

Data normalization is commonly used with numerical attributes.

The following is an example of min-max normalization implemented using sklearn library:

Dummy Variables:

Often we encounter categorical variables in our dataset which are not inherently numerical, such as gender, class, and location. Such variables are transformed into dummy variables, allowing for more useful and structured numerical data.

We can represent various categories as binary 0 or 1 values using dummy variables, and because ML models rely on statistical techniques, this approach is quite helpful for building a much more efficient model.

For instance, consider an attribute “answer” with the values “Yes” and “No”.

To obtain dummy variables for the given attribute, we will utilize the .get_dummies() function from the Pandas library: Role of Exploratory Data Analysis in Machine Learning 3

5. Data Visualization

Data analysis techniques can be quite daunting to understand and decipher the data. This is where data visualization can come in handy. Visuals and graphs can reveal a lot about the data when compared to numerical data.

Python libraries like Matplotlib, Seaborn, and Plotly offer powerful ways to create visually appealing plots to explore ML models. This helps in identifying patterns, finding outliers, distribution of data, finding correlation between features and target variables, and so on. Some common plots used in machine learning projects are:

Boxplot:

Box and whisker plots help to understand the distribution/spread of data. This plot draws a median line through a box between the 25% (1st quartile) and 75% (3rd quartile). The box plot shows the min, max, median, first quartile, and the third quartile distributions of a variable.

In the figure below, we draw a boxplot of the “penguins” dataset from the seaborn library.

Role of Exploratory Data Analysis in Machine Learning 4

Role of Exploratory Data Analysis in Machine Learning 5

One can easily see the distribution of the “flipper_length_mm” variable across all the three species of penguins, suggesting the similarities or variances in flipper length of species.

Pairplot:

Pairplots, as the name suggests, are used to plot pairwise relationships between all the numerical variables present in a dataset.
In the figure below, we draw a pairplot of the “penguins” dataset from the seaborn library. We additionally use the parameter “hue” to categorize the data points by “species” variable.

Role of Exploratory Data Analysis in Machine Learning 7

Pairplot with Regression Line:

Pairplots can also be plotted along with a regression line, which indicates the direction of trends of the variables. We use ‘reg’ in parameter ‘kind’ as shown below:

Role of Exploratory Data Analysis in Machine Learning 9

Correlation Matrix Plots:

Correlation measures the strength of the relationship between two or more variables. The correlation matrix plot visualizes the relationship based on the correlation coefficients of each variable.

Role of Exploratory Data Analysis in Machine Learning 10

Role of Exploratory Data Analysis in Machine Learning 11

Conclusion

Exploratory data analysis can help ML engineers to:

Uncover insights that may be challenging to guess with simply statistical data.
Understand the nature, structure, characteristics and distribution of attributes that can be used to enhance feature selection.
Identify outliers in data that could otherwise harm the distribution of data points.
Discover trends and significant relationships between the variables.
Use appealing visualizations to effectively deliver the findings to the audience/stakeholders.

In this post, we have just scratched the surface of EDA. Consider researching further if you are an aspiring data scientist, data analyst, machine learning engineer, or someone wanting to get into data since it can help you build powerful and efficient machine learning models.

Role of Exploratory Data Analysis in Machine Learning

Exploratory Data Analysis

EDA Techniques for ML Processes

1. Summary statistics:

2. Feature Selection