The Talent500 Blog
sql

Advanced Data Wrangling Techniques with SQL

Introduction

Every data analysis project requires the use of various data transformation techniques to carry out a few pre-analysis processes. This makes it easier for data scientists and analysts to ensure a better, unbiased analysis of the data.

In this tutorial, you will learn some advanced data wrangling methods using SQL to make the most of your datasets.

What is data wrangling?

Data wrangling is the process of data transformation while preserving the quality of the dataset. It involves cleaning, combining, sorting, grouping, and filtering various datasets, as well as dealing with outliers and missing data.

But why would you want to perform data wrangling? 

It helps in reducing inconsistencies in data, makes it error-friendly and reduces the need for manual data entry for missing values through various data imputation techniques. 

Clean and structured data makes the data analysis process easier.

Data Wrangling with SQL

Data wrangling can be performed using Microsoft Excel, SQL, Python libraries like Pandas and Numpy, or any other ETL tools like Apache Spark and Apache Airflow. SQL is often preferred for data wrangling for a few reasons, some of which are:

  • Large databases may be analyzed with the use of SQL queries, which can efficiently clean and aggregate data. This makes it ideal for relational databases containing structured data for improved data quality.
  • Most firms use SQL extensively for data analysis and management.
  • Because of its great scalability, SQL enables pre-processing of huge datasets in organizations.
  • SQL is standardized across multiple databases, which ensures consistency in datasets.

SQL Basics: Queries for Data Wrangling

1. SELECT

SELECT query is used to return specific columns from a table:

— return all columns
SELECT * FROM student_db;

— return specific columns
SELECT name, age FROM student_db; 

2. WHERE

WHERE query is used to filter out rows under some condition(s):

SELECT *
FROM student_db
WHERE score >= 30;

The given query will retrieve only those rows where the score is greater than or equal to 30.

WHERE query is often used to filter null values in a dataset:

— select rows where last_name is NULL
SELECT * FROM student_db WHERE last_name IS NULL;


— select rows where score is NOT NULL
SELECT * FROM student_db WHERE score IS NOT NULL;

— select rows where score is 80
SELECT * FROM student_db WHERE score = 80;

— select rows where name is ‘Finn’
SELECT * FROM student_db WHERE first_name = ‘Finn’;

SQL Queries for Data Cleaning

1. Deleting data

To delete unwanted rows from a table: 

DELETE FROM student_db
WHERE score < 15;

The above query will remove student records if their score is less than 15.

2. Updating data

To update database records, you can use the UPDATE statement:

UPDATE student_db
SET score = 79
WHERE student_id = 4322;

The above query will update the “score” of the student in the student_db table to 79 for the specified ID.

3. Removing duplicate rows

In order to eliminate duplicate rows from a table, self-join is imposed on the table to identify the duplicates. Then, under certain conditions, DELETE statements can be used to remove the relevant rows.

DELETE s1
FROM student_db s1
INNER JOIN student_db s2
ON s1.name = s2.name
AND (s1.score, s1.percentage) < (s2.score, s2.percentage);

The above query will delete the duplicate records and keep the student with the higher score and percentage. 

4. Renaming data

Sometimes the original data source may have given complicated names. To make the columns easier to comprehend, you can rename them as follows:

ALTER TABLE employees
RENAME COLUMN month_end_employee_salary
TO salary;

This will rename the column name from “month_end_employee_salary” to “salary”.

5. Filtering data

To filter data, a set of conditions can be applied to tables:

SELECT *
FROM employees
WHERE salary > 40000;

This will simply return all the rows where the “salary” value is greater than 40000.

Advanced Data Wrangling with SQL

After going over the fundamentals, let us move ahead to some more advanced data wrangling techniques using SQL:

1. Data Transformation

Data obtained originally from a data source is often unprepared and messy. Data transformation involves some processes like modifying and converting data types, standardizing and normalizing data, and handling categorical data to make it fit for insightful data analysis.

Conversion of Data Types

Consider a “student” table that has a “score” column. It stores numerical values of “text” or VARCHAR type. In order to perform calculative analysis, you will need “integer” data types. SQL offers CONVERT and CAST functions for data type conversion:

— Using CONVERT function
SELECT name, CONVERT(INT, score) AS num_score
FROM student_db;

— Using CAST function
SELECT name, CAST(score as INT) AS num_score
FROM student_db;

SQL CASE Statements

SQL provides CASE statements that aid in logic building. These come in particularly handy when you want to retrieve and filter particular results based on certain conditions.

CASE statements are also used to handle NULL values in data. This contributes to the consistency and reliability of the data.

Some example use cases are as follows:

Handling NULL values

Consider a table called “students,” where the “score” column contains several NULL values. You can replace the NULL values with a default score of 0.

SELECT score,
      CASE WHEN score IS NULL
THEN 0
ELSE score
      END AS new_score
FROM students;

Handling categorical values

CASE statements are often used to handle categorical data. 

Suppose the “students” table has another column “gender”, but with variations like “MALE”, “male”, “m”, “FEMALE”, “F”. This can make the data messy. 

These values can be standardized as follows:

SELECT name,
      CASE
        WHEN UPPER(gender) = ‘M’
              OR UPPER(gender) = ‘MALE’ THEN ‘Male’
        WHEN UPPER(gender) = ‘F’
              OR UPPER(gender) = ‘FEMALE’ THEN ‘Female’
        ELSE ‘Unknown’
      END AS new_gender
FROM students; 

Alternatively, SQL CASE can be used to group categorical variables with binary values:

SELECT name,
      CASE
          WHEN gender = ‘Male’ THEN 1
          WHEN gender = ‘Female’ THEN 0
          ELSE NULL
      END AS gender_group
FROM students;

Handling conditional statements

Suppose you want to categorize student performance from the “score” column as ‘High,’ ‘Medium,’ or ‘Low’, based on their scores:

SELECT name, score,
      CASE
          WHEN score >= 70 THEN ‘High’
          WHEN score >= 40 THEN ‘Medium’
          ELSE ‘Low’
      END AS score_category
FROM students;

CASE and COALESCE

COALESCE is a function that returns the first non-null value from a given expression, much like SQL CASE statements.

Let’s say the “students” database has an “age” column with some missing values. One way to deal with missing data is to set a default age:

SELECT name, COALESCE(age, 15) AS default_age
FROM students;

2. Grouping data

SQL offers various functions for filtering and aggregating data to organize it according to a given criteria.

The GROUP BY function is used along with SQL aggregate functions to group similar rows and apply calculations to get the required results. Some common aggregate functions include SUM, AVG, COUNT, MIN, and MAX.

Data Aggregation:

— to calculate sum of values for each group
SELECT gender, SUM(score) AS total_score
FROM students
GROUP BY gender;

— to calculate average of values for each group
SELECT AVG(age) AS average_age
FROM students;

— to calculate total count for each group
SELECT age, COUNT(*) AS student_count
FROM students
GROUP BY age;

— to find maximum values in a column
SELECT MAX(percentage) AS max_percentage
FROM students;

— to find minimum values in a column
SELECT MIN(percentage) AS min_percentage
FROM students;

Data Filtering:

The HAVING clause in SQL allows the aggregated data to be filtered after grouping.

HAVING is used along with aggregate functions to filter data based on a condition.

SELECT age, AVG(score) AS avg_score
FROM students
GROUP BY age
HAVING AVG(score) > 70;

This query will return the average score of students in each age group with an average score above 70.

It could be unclear to you when to use the HAVING and WHERE clauses. One simple way to keep this in mind is that the HAVING clause will always be used when working with aggregate functions.

SQL Unions

When combining multiple results based on similar sets—which may include the same columns, the same order, or similar data types—you use the SQL UNION clause.

While UNION ALL permits duplicate records, the UNION clause only returns unique records.

Let’s say there are two tables in a database, student_p and student_q, with the same set of columns and different records, and you want to combine distinct rows from both of the tables:

SELECT name, age, score
FROM student_p
UNION
SELECT name, age, score
FROM student_q;

As a result, the unique records are preserved, and duplicates are removed from both tables.

To include all rows:

SELECT name, age, score
FROM student_p
UNION ALL
SELECT name, age, score
FROM student_q;

As a result, every record, including duplicates from the two tables, will be retrieved.

3. Handling Outliers

Outliers are records or extreme data points that differ largely from the other values present in the column or are far from the average value. These records can affect your analysis significantly, so it’s better to handle them beforehand.

Statistical measures such as the mean and standard deviation of a column can help us identify outliers. After that, the SQL UPDATE clause can be used to truncate the outlier values.

Let us say there are 100 rows in the “students” table, and only 3–4 of those records have a student’s score higher than 95. In this scenario:

UPDATE students
SET score = 95
WHERE score > 95;

For an even distribution of values, a log transformation is used. This helps to reduce the influence of extreme data points:

SELECT student_id, LOG(score) AS log_score
FROM student_db;

4. Handling Time Series Data

Time series data includes TIMESTAMP, DATE, and DATETIME values, which are used to keep track of time. Formatting time-based data is very crucial, as it can give you an insightful analysis of data based on time. Time-based indexing improves query performance and aids in query optimization for huge datasets. 

To obtain time-based summary statistics such as daily, weekly, hourly, or annual results, SQL aggregate functions such as SUM, AVG, and COUNT are used.

SELECT DATE_TRUNC(‘month’, order_date) AS month, SUM(amount) AS total_sales
FROM sales
GROUP BY month
ORDER BY month;

The WHERE clause is used to filter data based on specific time periods, for example, calculating total sales per year.

SELECT SUM(amount) AS total_sales
FROM sales
WHERE EXTRACT(YEAR FROM sale_date) = 2023;

LAG and LEAD functions are used to access data before and after a specific timestamp, respectively. This is often used to identify trends and patterns in data.

SELECT date, price,
      LAG(price) OVER (ORDER BY date) AS previous_price,
      LEAD(price) OVER (ORDER BY date) AS next_price
FROM stock_prices;

Conclusion

In this blog, we’ve covered some valuable data wrangling techniques, including SQL functions for data transformation, aggregation, and handling outliers. These skills are essential for anyone working with data, ensuring the data is cleaned and ready for meaningful analysis.

0
Shreya Purohit

Shreya Purohit

As a data wizard and technical writer, I demystify complex concepts of data science and data analytics into bite-sized nuggets that are easy for anyone to understand.

Add comment