Aggregate Functions (AVG, MAX/MIN, COUNT, SUM)
Above is a wine dataset found on StrataScratch. Suppose we want to write a query to find out what the most expensive wine from each country. We can utilize window function to do this.
The SQL query is looking for the maximum price from each country. Using only MAX(price) alone wouldn’t work, because you need to use GROUP BY to group the. For this query to execute, you have to group them by: id, country, points, price. Using “GROUP BY id, country, points, price” doesn’t give you any real insight because it would just copy the price column over.
Here is the query result:
Another great example:
This query added three new columns: running_total, overall and running%. Let’s break them down…
SUM(length) OVER(ORDER BY film_id) AS running_total
Here, we don’t use “PARTITION BY” because we are not grouping into any sub-categories as the previous example. We use ORDER BY instead, to get a running_total of film_id.
SUM(length) OVER() AS overall
This query is similar to the first query, except there is no clause inside of OVER(). When the clause is empty, it would fill up every row with the total from SUM(length).
SUM(length) OVER(ORDER BY film_id) * 100 / SUM(length) OVER() AS running_percentage
This query combines the first two queries, dividing running total by overall giving us the running percentage.
ROW_NUMBER, RANK, DENSE_RANK
ROW_NUMBER() OVER(ORDER BY length DESC) AS row_num
This query generates an integer sequence from 1 for each row. However, movies with the same length are assigned unique number. In my opinion, this query is not very useful because it assigns random number to identical lengths. Let’s assume, we want to rank them based on the movie genre to get better insights.
Adding PARTITION BY C.name inside the OVER clause tells SQL engine to split them by their name, and then assign a row_number based on their length. As you can see, movies with length of 185 are randomly ranked as well. Let’s assume we want to write a query where movies with the same lengths get the same number. We use DENSE_RANK(), instead of ROW_NUMBER() to achieve this.
DENSE_RANK Vs RANK
RANK and DENSE_RANK functions are similar functions. If we were to apply RANK function, instead of DENSE_RANK(), we would get the following:
As you can see here, RANK function counts the “tie” lengths as one ranking, and it skips the next rank count.
If we want to make a histogram, we can utilize the window function to prepare our data.
Here, we create 100 buckets and we ordered ALL of the movie by length.
We can also create decile and quartile data. Queries above are partitioned by rating. NTILE are pretty straight forward queries.
LAG and LEAD
Lag and Lead are used to analyze time-related data. LAG gets data from previous rows, while LEAD fetches data from the following rows.
LAG(revenue, 1) OVER (ORDER BY date) prev_day_sales
The last two columns of the first row are empty because it’s the first available date. We specified the previous day of revenue with (revenue,1). If we change it to two, we are comparing to current data to the one from two days ago.
LEAD is similar to LAG. Instead of the previous day, it takes the next day data.