Every data analysis project requires the use of various data transformation techniques to carry out a few pre-analysis processes. This makes it easier for data scientists and analysts to ensure a better, unbiased analysis of the data.
In this tutorial, you will learn some advanced data wrangling methods using SQL to make the most of your datasets.
What is data wrangling?
Data wrangling is the process of data transformation while preserving the quality of the dataset. It involves cleaning, combining, sorting, grouping, and filtering various datasets, as well as dealing with outliers and missing data.
But why would you want to perform data wrangling?
It helps in reducing inconsistencies in data, makes it error-friendly and reduces the need for manual data entry for missing values through various data imputation techniques.
Clean and structured data makes the data analysis process easier.
Data Wrangling with SQL
Data wrangling can be performed using Microsoft Excel, SQL, Python libraries like Pandas and Numpy, or any other ETL tools like Apache Spark and Apache Airflow. SQL is often preferred for data wrangling for a few reasons, some of which are:
- Large databases may be analyzed with the use of SQL queries, which can efficiently clean and aggregate data. This makes it ideal for relational databases containing structured data for improved data quality.
- Most firms use SQL extensively for data analysis and management.
- Because of its great scalability, SQL enables pre-processing of huge datasets in organizations.
- SQL is standardized across multiple databases, which ensures consistency in datasets.
SQL Basics: Queries for Data Wrangling
SELECT query is used to return specific columns from a table:
|— return all columns
SELECT * FROM student_db;
— return specific columns
SELECT name, age FROM student_db;
WHERE query is used to filter out rows under some condition(s):
WHERE score >= 30;
The given query will retrieve only those rows where the score is greater than or equal to 30.
WHERE query is often used to filter null values in a dataset:
|— select rows where last_name is NULL
SELECT * FROM student_db WHERE last_name IS NULL;
— select rows where score is 80
— select rows where name is ‘Finn’
SQL Queries for Data Cleaning
1. Deleting data
To delete unwanted rows from a table:
|DELETE FROM student_db
WHERE score < 15;
The above query will remove student records if their score is less than 15.
2. Updating data
To update database records, you can use the UPDATE statement:
SET score = 79
WHERE student_id = 4322;
The above query will update the “score” of the student in the student_db table to 79 for the specified ID.
3. Removing duplicate rows
In order to eliminate duplicate rows from a table, self-join is imposed on the table to identify the duplicates. Then, under certain conditions, DELETE statements can be used to remove the relevant rows.
FROM student_db s1
INNER JOIN student_db s2
ON s1.name = s2.name
AND (s1.score, s1.percentage) < (s2.score, s2.percentage);
The above query will delete the duplicate records and keep the student with the higher score and percentage.
4. Renaming data
Sometimes the original data source may have given complicated names. To make the columns easier to comprehend, you can rename them as follows:
|ALTER TABLE employees
RENAME COLUMN month_end_employee_salary
This will rename the column name from “month_end_employee_salary” to “salary”.
5. Filtering data
To filter data, a set of conditions can be applied to tables:
WHERE salary > 40000;
This will simply return all the rows where the “salary” value is greater than 40000.
Advanced Data Wrangling with SQL
After going over the fundamentals, let us move ahead to some more advanced data wrangling techniques using SQL:
1. Data Transformation
Data obtained originally from a data source is often unprepared and messy. Data transformation involves some processes like modifying and converting data types, standardizing and normalizing data, and handling categorical data to make it fit for insightful data analysis.
Conversion of Data Types
Consider a “student” table that has a “score” column. It stores numerical values of “text” or VARCHAR type. In order to perform calculative analysis, you will need “integer” data types. SQL offers CONVERT and CAST functions for data type conversion:
|— Using CONVERT function
SELECT name, CONVERT(INT, score) AS num_score
— Using CAST function
SELECT name, CAST(score as INT) AS num_score
SQL CASE Statements
SQL provides CASE statements that aid in logic building. These come in particularly handy when you want to retrieve and filter particular results based on certain conditions.
CASE statements are also used to handle NULL values in data. This contributes to the consistency and reliability of the data.
Some example use cases are as follows:
Handling NULL values
Consider a table called “students,” where the “score” column contains several NULL values. You can replace the NULL values with a default score of 0.
CASE WHEN score IS NULL
END AS new_score
Handling categorical values
CASE statements are often used to handle categorical data.
Suppose the “students” table has another column “gender”, but with variations like “MALE”, “male”, “m”, “FEMALE”, “F”. This can make the data messy.
These values can be standardized as follows:
WHEN UPPER(gender) = ‘M’
OR UPPER(gender) = ‘MALE’ THEN ‘Male’
WHEN UPPER(gender) = ‘F’
OR UPPER(gender) = ‘FEMALE’ THEN ‘Female’
END AS new_gender
Alternatively, SQL CASE can be used to group categorical variables with binary values:
WHEN gender = ‘Male’ THEN 1
WHEN gender = ‘Female’ THEN 0
END AS gender_group
Handling conditional statements
Suppose you want to categorize student performance from the “score” column as ‘High,’ ‘Medium,’ or ‘Low’, based on their scores:
|SELECT name, score,
WHEN score >= 70 THEN ‘High’
WHEN score >= 40 THEN ‘Medium’
END AS score_category
CASE and COALESCE
COALESCE is a function that returns the first non-null value from a given expression, much like SQL CASE statements.
Let’s say the “students” database has an “age” column with some missing values. One way to deal with missing data is to set a default age:
|SELECT name, COALESCE(age, 15) AS default_age
2. Grouping data
SQL offers various functions for filtering and aggregating data to organize it according to a given criteria.
The GROUP BY function is used along with SQL aggregate functions to group similar rows and apply calculations to get the required results. Some common aggregate functions include SUM, AVG, COUNT, MIN, and MAX.
|— to calculate sum of values for each group
SELECT gender, SUM(score) AS total_score
GROUP BY gender;
— to calculate average of values for each group
SELECT AVG(age) AS average_age
— to calculate total count for each group
SELECT age, COUNT(*) AS student_count
GROUP BY age;
— to find maximum values in a column
SELECT MAX(percentage) AS max_percentage
— to find minimum values in a column
SELECT MIN(percentage) AS min_percentage
The HAVING clause in SQL allows the aggregated data to be filtered after grouping.
HAVING is used along with aggregate functions to filter data based on a condition.
|SELECT age, AVG(score) AS avg_score
GROUP BY age
HAVING AVG(score) > 70;
This query will return the average score of students in each age group with an average score above 70.
It could be unclear to you when to use the HAVING and WHERE clauses. One simple way to keep this in mind is that the HAVING clause will always be used when working with aggregate functions.
When combining multiple results based on similar sets—which may include the same columns, the same order, or similar data types—you use the SQL UNION clause.
While UNION ALL permits duplicate records, the UNION clause only returns unique records.
Let’s say there are two tables in a database, student_p and student_q, with the same set of columns and different records, and you want to combine distinct rows from both of the tables:
|SELECT name, age, score
SELECT name, age, score
As a result, the unique records are preserved, and duplicates are removed from both tables.
To include all rows:
|SELECT name, age, score
SELECT name, age, score
As a result, every record, including duplicates from the two tables, will be retrieved.
3. Handling Outliers
Outliers are records or extreme data points that differ largely from the other values present in the column or are far from the average value. These records can affect your analysis significantly, so it’s better to handle them beforehand.
Statistical measures such as the mean and standard deviation of a column can help us identify outliers. After that, the SQL UPDATE clause can be used to truncate the outlier values.
Let us say there are 100 rows in the “students” table, and only 3–4 of those records have a student’s score higher than 95. In this scenario:
SET score = 95
WHERE score > 95;
For an even distribution of values, a log transformation is used. This helps to reduce the influence of extreme data points:
|SELECT student_id, LOG(score) AS log_score
4. Handling Time Series Data
Time series data includes TIMESTAMP, DATE, and DATETIME values, which are used to keep track of time. Formatting time-based data is very crucial, as it can give you an insightful analysis of data based on time. Time-based indexing improves query performance and aids in query optimization for huge datasets.
To obtain time-based summary statistics such as daily, weekly, hourly, or annual results, SQL aggregate functions such as SUM, AVG, and COUNT are used.
|SELECT DATE_TRUNC(‘month’, order_date) AS month, SUM(amount) AS total_sales
GROUP BY month
ORDER BY month;
The WHERE clause is used to filter data based on specific time periods, for example, calculating total sales per year.
|SELECT SUM(amount) AS total_sales
WHERE EXTRACT(YEAR FROM sale_date) = 2023;
LAG and LEAD functions are used to access data before and after a specific timestamp, respectively. This is often used to identify trends and patterns in data.
|SELECT date, price,
LAG(price) OVER (ORDER BY date) AS previous_price,
LEAD(price) OVER (ORDER BY date) AS next_price
In this blog, we’ve covered some valuable data wrangling techniques, including SQL functions for data transformation, aggregation, and handling outliers. These skills are essential for anyone working with data, ensuring the data is cleaned and ready for meaningful analysis.