Introduction
SQL, or Structured Query Language, is used for storing and manipulating information in relational databases. SQL provides us with a number of commands to perform various operations on a database. Window functions are one such command that allows advanced queries to return more streamlined results. In this tutorial, you will get a brief overview of some of the most common SQL window functions that make your work easier as a developer.
What are Window Functions in SQL?
To understand window functions better, we will first learn about databases.
A SQL database is a relational database, which means the data points are linked across one or more tables. A relational database typically contains multiple tables, each with its own unique set of rows and columns holding information in different formats, such as string, integer, date, and time values.
We can retrieve such data using simple SQL queries.
Window Functions
Window functions are used to perform complex calculations across a set of rows in a defined window. This window is created using an OVER() clause in SQL, which returns a single value for each row in the query.
You might assume they function similarly to aggregate functions in SQL, but window functions preserve the individuality (identity) of rows, unlike aggregate functions.
Window functions can use aggregate functions using the OVER() clause to compute a running calculation for each row.
Based on the type of calculations, window functions can be categorized into three main types:
- Aggregate window functions
- Ranking window functions
- Value window functions
Why do we use window functions in SQL?
Although window functions require a sheer amount of practice to get used to, they are quite useful for dealing with huge databases in SQL by executing queries within a specified window frame. This reduces the loading time and makes SQL queries efficient.
With window functions, you can perform tasks like finding the running averages of a filtered query, which would otherwise require complex operations like self-joins and subqueries.
Another advantage of using window functions over traditional SQL functions is that it formats longer queries and enhances readability.
Types of SQL Window Functions
Before moving forward, let us begin by creating a basic table to understand window functions better. Open a DBMS (database management) tool of your choice and run this simple query to create a Students table. For this tutorial, I’ll be using MySQL Workbench:
CREATE TABLE Students ( student_id INT PRIMARY KEY AUTO_INCREMENT, name VARCHAR(100) NOT NULL, — Combine first and last name city VARCHAR(50), country VARCHAR(50), marks DECIMAL(5,2) ); INSERT INTO Students (name, city, country, marks) VALUES (‘Alice Miller, ‘New York‘, ‘USA‘, 95.0), (‘Charlie Brown‘, ‘Chicago‘, ‘USA‘, 98.0), (‘Bart Simpson‘, ‘El Paso‘, ‘USA‘, 89.0), (‘Mulan’, ‘Charlotte’, ‘USA‘, 94.0), — add more of your choice |
Return the Students table:
SELECT * FROM Students |
Output:
Consider adding some additional data for a comprehensive analysis of your queries.
Aggregate Functions
Aggregate functions are most frequently used to perform basic calculations, such as computing averages, sums, minimum and maximum values.
COUNT()
In standard SQL, aggregate functions are generally used with GROUP BY statements.
When we use window functions, we can replace the GROUP BY statement with the OVER clause. For example:
— using GROUP BY statement SELECT city, COUNT(*) AS total_students FROM Students GROUP BY city; — using OVER clause SELECT city, COUNT(*) OVER (PARTITION BY city) AS total_students FROM Students; |
Both queries return the total number of students in each city. However, the GROUP BY statement first groups the data by city and then computes the required query within each group.
On the other hand, the OVER clause creates a partition by city within the specified window, and calculations are then performed across this partition.
AVG()
— with GROUP BY statement SELECT s1.name, s1.marks, AVG(s2.marks) AS moving_average FROM Students s1 INNER JOIN Students s2 ON s1.name >= s2.name GROUP BY s1.name, s1.marks ORDER BY s1.name; — with OVER clause SELECT name, marks, AVG(marks) OVER (ORDER BY name) AS moving_average FROM Students; |
Both queries calculate the moving average of marks for each student in the Students table. But as you can see, the OVER clause improves the readability of the code. In contrast, self-joins become less efficient to use, particularly with larger databases.
SUM()
Similarly, the SUM() function calculates the total of a specified column in a database.
SELECT name, marks, SUM(marks) OVER (ORDER BY name) AS running_total FROM Students; |
MIN() & MAX()
The MIN() and MAX() functions are used to calculate the minimum and maximum values of a column, respectively:
— with GROUP BY() SELECT city, MIN(marks) AS lowest_marks, MAX(marks) AS highest_marks FROM Students GROUP BY city; |
This query returns the lowest and highest marks scored by students in each city.
— with OVER() SELECT name, marks, MIN(marks) OVER (ORDER BY name ROWS BETWEEN PRECEDING 2 FOLLOWING 0) AS previous_min, MAX(marks) OVER (ORDER BY name ROWS BETWEEN PRECEDING 2 FOLLOWING 0) AS previous_max FROM Students; |
This query produces the same output, but it creates an output window including the current student and the two previous students. This can help us find the previous minimum and maximum marks as well.
As you can see, window functions can help you analyze student performances by comparing marks with those of their immediate competitors.
Ranking Functions
ROW_NUMBER()
When a window is defined within the OVER() clause, ROW_NUMBER() assigns a distinct sequential number to each row:
SELECT * FROM ( SELECT name, marks, ROW_NUMBER() OVER (ORDER BY name) AS row_num FROM Students ) AS ranked_students WHERE row_num <= 5; |
This query will assign a unique number to each student, ordered by their names in ascending order.
RANK()
SELECT name, marks, RANK() OVER (ORDER BY marks DESC) AS ‘rank’ FROM ( SELECT name, marks FROM Students ) AS ranked_students; |
This query will return the top 5 students who scored the highest marks from the Students table.
Rows with the same value will be assigned the same rank, and subsequent rows will get the next rank with gaps in the ordering.
DENSE_RANK()
SELECT name, marks, DENSE_RANK() OVER (ORDER BY marks DESC) AS ‘dense_rank’ FROM Students WHERE ‘dense_rank’ <= 5; |
Similar to the previous query, this query will also return the top 5 students scoring the highest marks, but using the DENSE_RANK() function.
DENSE_RANK() will assign the same rank to identical values, but keep in mind that unlike RANK(), it does not include gaps in between. This means that the next row will be assigned the consecutive ranking in order.
PERCENT_RANK()
PERCENT_RANK() returns 0 to 1 percentile ranks for a row within a specified window.
SELECT name, marks, PERCENT_RANK() OVER (ORDER BY marks DESC) AS percentile_rank FROM Students; |
The above result indicates that Mulan scored higher than 50% of the class. It can also be interpreted that Mulan scored lower than 51% of the class. Similarly, Bart scored higher than 75% but lower than 76% of the students.
NTILE()
The NTILE() function allows you to divide the records of an output table into approximately equal tiles or groups. NTILE(num) takes a num integer as input, where we specify the desired number of groups.
This output table is the result of the ordered partitioning of a specific table in a database.
For example, if our output table contains 20 records and we use NTILE(5), it will divide the records into 5 groups, with each group containing 4 records.
SELECT name, marks, NTILE(3) OVER (ORDER BY marks DESC) AS student_performance FROM Students; |
This query divides the students into 3 groups based on their performance, where a student_performance of 1 corresponds to the top performers with the highest marks.
Value Functions
LAG()
The LAG() function is used to retrieve a value from a previous row within a partition, i.e., the succeeding value that comes before the current row in the ordered set of rows.
SELECT name, city, marks, LAG(name) OVER (PARTITION BY city ORDER BY marks) AS prev_name, LAG(marks) OVER (PARTITION BY city ORDER BY marks) AS prev_marks FROM Students; |
This SQL query returns the student who has the second highest marks and scores lower than the current student within the same city.
LEAD()
The LEAD() function is used to retrieve a value from a subsequent row within a partition, i.e., the preceding value or next row in the ordered set of rows.
SELECT name, city, marks, LEAD(name) OVER (PARTITION BY city ORDER BY marks) AS next_name, LEAD(marks) OVER (PARTITION BY city ORDER BY marks) AS next_marks FROM Students; |
This query finds the student with the next highest marks who scores higher than the current student within the same city.
FIRST_VALUE() & LAST_VALUE()
The FIRST_VALUE() function retrieves the first row within an ordered window frame, while the LAST_VALUE() function retrieves the last row within an ordered window frame.
SELECT name, city, marks, FIRST_VALUE(marks) OVER (ORDER BY marks DESC) AS highest_marks, LAST_VALUE(marks) OVER (ORDER BY marks ASC) AS lowest_marks FROM Students; |
This query will return highest and lowest marks for each row. But if you think about it, we can achieve the same result using the MIN() and MAX() functions:
SELECT name, city, marks, MIN(marks) OVER () AS lowest_marks, MAX(marks) OVER () AS highest_marks FROM Students; |
This is efficient and faster compared to the above query. Depending on your use case and the size of the dataset, it is important to understand whether using SQL window functions would be beneficial or not.
Conclusion
SQL window functions offer an efficient way of dealing with huge databases to make the queries more readable, faster and promote clean code.
While window functions are a great choice when writing complex SQL queries, we should always consider factors like database size, working environment, and specific use cases when choosing the optimal approach to writing queries.
For instance, we can use simple queries like MIN() and MAX() to perform simple calculations in SQL queries.
It takes an ample amount of practice to master window functions in SQL, but once you get your command on it, it can greatly improve your productivity and save time when working with databases.
Add comment