Analytic Functions – Aggregations

Let us see how we can perform aggregations with in a partition or group using Windowing/Analytics Functions.

  • For simple aggregations where we have to get grouping key and aggregated results we can use GROUP BY.

  • If we want to get the raw data along with aggregated results, then using GROUP BY is not possible or overly complicated.

  • Using aggregate functions with OVER Clause not only simplifies the process of writing query, but also better with respect to performance.

  • Let us take an example of getting employee salary percentage when compared to department salary expense.

Warning

If you are using Jupyter based environment make sure to restart the kernel, as the session might have been already connected with retail database.

%load_ext sql
%env DATABASE_URL=postgresql://itversity_hr_user:hr_password@localhost:5432/itversity_hr_db
env: DATABASE_URL=postgresql://itversity_hr_user:hr_password@localhost:5432/itversity_hr_db
%%sql

SELECT employee_id, department_id, salary 
FROM employees 
ORDER BY department_id, salary
LIMIT 10
10 rows affected.
employee_id department_id salary
200 10 4400.00
202 20 6000.00
201 20 13000.00
119 30 2500.00
118 30 2600.00
117 30 2800.00
116 30 2900.00
115 30 3100.00
114 30 11000.00
203 40 6500.00

Note

Let us write the query using GROUP BY approach.

%%sql

SELECT department_id,
    sum(salary) AS department_salary_expense
FROM employees
GROUP BY department_id
ORDER BY department_id
 * postgresql://itversity_hr_user:***@localhost:5432/itversity_hr_db
12 rows affected.
department_id department_salary_expense
10 4400.00
20 19000.00
30 24900.00
40 6500.00
50 156400.00
60 28800.00
70 10000.00
80 304500.00
90 58000.00
100 51600.00
110 20300.00
None 7000.00
%%sql

SELECT e.employee_id, e.department_id, e.salary,
    ae.department_salary_expense,
    ae.avg_salary_expense
FROM employees e JOIN (
    SELECT department_id, 
        sum(salary) AS department_salary_expense,
        round(avg(salary)::numeric, 2) AS avg_salary_expense
    FROM employees
    GROUP BY department_id
) ae
ON e.department_id = ae.department_id
ORDER BY department_id, salary
LIMIT 10
 * postgresql://itversity_hr_user:***@localhost:5432/itversity_hr_db
10 rows affected.
employee_id department_id salary department_salary_expense avg_salary_expense
200 10 4400.00 4400.00 4400.00
202 20 6000.00 19000.00 9500.00
201 20 13000.00 19000.00 9500.00
119 30 2500.00 24900.00 4150.00
118 30 2600.00 24900.00 4150.00
117 30 2800.00 24900.00 4150.00
116 30 2900.00 24900.00 4150.00
115 30 3100.00 24900.00 4150.00
114 30 11000.00 24900.00 4150.00
203 40 6500.00 6500.00 6500.00
%%sql

SELECT e.employee_id, e.department_id, e.salary,
    ae.department_salary_expense,
    ae.avg_salary_expense,
    round(e.salary/ae.department_salary_expense * 100, 2) pct_salary
FROM employees e JOIN (
    SELECT department_id, 
        sum(salary) AS department_salary_expense,
        round(avg(salary)::numeric, 2) AS avg_salary_expense
    FROM employees
    GROUP BY department_id
) ae
ON e.department_id = ae.department_id
ORDER BY department_id, salary
LIMIT 10
 * postgresql://itversity_hr_user:***@localhost:5432/itversity_hr_db
10 rows affected.
employee_id department_id salary department_salary_expense avg_salary_expense pct_salary
200 10 4400.00 4400.00 4400.00 100.00
202 20 6000.00 19000.00 9500.00 31.58
201 20 13000.00 19000.00 9500.00 68.42
119 30 2500.00 24900.00 4150.00 10.04
118 30 2600.00 24900.00 4150.00 10.44
117 30 2800.00 24900.00 4150.00 11.24
116 30 2900.00 24900.00 4150.00 11.65
115 30 3100.00 24900.00 4150.00 12.45
114 30 11000.00 24900.00 4150.00 44.18
203 40 6500.00 6500.00 6500.00 100.00

Note

Let us see how we can get it using Analytics/Windowing Functions.

  • We can use all standard aggregate functions such as count, sum, min, max, avg etc.

%%sql

SELECT e.employee_id, e.department_id, e.salary,
    sum(e.salary) OVER (
        PARTITION BY e.department_id
    ) AS department_salary_expense
FROM employees e
ORDER BY e.department_id
LIMIT 10
 * postgresql://itversity_hr_user:***@localhost:5432/itversity_hr_db
10 rows affected.
employee_id department_id salary department_salary_expense
200 10 4400.00 4400.00
201 20 13000.00 19000.00
202 20 6000.00 19000.00
114 30 11000.00 24900.00
115 30 3100.00 24900.00
116 30 2900.00 24900.00
117 30 2800.00 24900.00
118 30 2600.00 24900.00
119 30 2500.00 24900.00
203 40 6500.00 6500.00
%%sql

SELECT e.employee_id, e.department_id, e.salary,
    sum(e.salary) OVER (
        PARTITION BY e.department_id
    ) AS department_salary_expense,
    round(e.salary / sum(e.salary) OVER (
        PARTITION BY e.department_id
    ) * 100, 2) AS pct_salary
FROM employees e
ORDER BY e.department_id,
    e.salary
LIMIT 10
 * postgresql://itversity_hr_user:***@localhost:5432/itversity_hr_db
10 rows affected.
employee_id department_id salary department_salary_expense pct_salary
200 10 4400.00 4400.00 100.00
202 20 6000.00 19000.00 31.58
201 20 13000.00 19000.00 68.42
119 30 2500.00 24900.00 10.04
118 30 2600.00 24900.00 10.44
117 30 2800.00 24900.00 11.24
116 30 2900.00 24900.00 11.65
115 30 3100.00 24900.00 12.45
114 30 11000.00 24900.00 44.18
203 40 6500.00 6500.00 100.00
%%sql

SELECT e.employee_id, e.department_id, e.salary,
    sum(e.salary) OVER (
        PARTITION BY e.department_id
    ) AS sum_sal_expense,
    round(avg(e.salary) OVER (
        PARTITION BY e.department_id
    ), 2) AS avg_sal_expense,
    min(e.salary) OVER (
        PARTITION BY e.department_id
    ) AS min_sal_expense,
    max(e.salary) OVER (
        PARTITION BY e.department_id
    ) AS max_sal_expense,
    count(e.salary) OVER (
        PARTITION BY e.department_id
    ) AS cnt_sal_expense
FROM employees e
ORDER BY e.department_id,
    e.salary
LIMIT 10
 * postgresql://itversity_hr_user:***@localhost:5432/itversity_hr_db
10 rows affected.
employee_id department_id salary sum_sal_expense avg_sal_expense min_sal_expense max_sal_expense cnt_sal_expense
200 10 4400.00 4400.00 4400.00 4400.00 4400.00 1
202 20 6000.00 19000.00 9500.00 6000.00 13000.00 2
201 20 13000.00 19000.00 9500.00 6000.00 13000.00 2
119 30 2500.00 24900.00 4150.00 2500.00 11000.00 6
118 30 2600.00 24900.00 4150.00 2500.00 11000.00 6
117 30 2800.00 24900.00 4150.00 2500.00 11000.00 6
116 30 2900.00 24900.00 4150.00 2500.00 11000.00 6
115 30 3100.00 24900.00 4150.00 2500.00 11000.00 6
114 30 11000.00 24900.00 4150.00 2500.00 11000.00 6
203 40 6500.00 6500.00 6500.00 6500.00 6500.00 1

Warning

If you are using Jupyter based environment make sure to restart the kernel, as the session might have been already connected with hr database.

%load_ext sql
%env DATABASE_URL=postgresql://itversity_retail_user:retail_password@localhost:5432/itversity_retail_db
env: DATABASE_URL=postgresql://itversity_retail_user:retail_password@localhost:5432/itversity_retail_db
%%sql

SELECT
    order_date,
    order_item_product_id,
    revenue,
    sum(revenue) OVER (PARTITION BY order_date) AS sum_revenue,
    min(revenue) OVER (PARTITION BY order_date) AS min_revenue,
    max(revenue) OVER (PARTITION BY order_date) AS max_revenue
FROM daily_product_revenue
ORDER BY order_date,
    revenue DESC
LIMIT 10
10 rows affected.
order_date order_item_product_id revenue sum_revenue min_revenue max_revenue
2013-07-25 00:00:00 1004 5599.72 31547.23 49.98 5599.72
2013-07-25 00:00:00 191 5099.49 31547.23 49.98 5599.72
2013-07-25 00:00:00 957 4499.70 31547.23 49.98 5599.72
2013-07-25 00:00:00 365 3359.44 31547.23 49.98 5599.72
2013-07-25 00:00:00 1073 2999.85 31547.23 49.98 5599.72
2013-07-25 00:00:00 1014 2798.88 31547.23 49.98 5599.72
2013-07-25 00:00:00 403 1949.85 31547.23 49.98 5599.72
2013-07-25 00:00:00 502 1650.00 31547.23 49.98 5599.72
2013-07-25 00:00:00 627 1079.73 31547.23 49.98 5599.72
2013-07-25 00:00:00 226 599.99 31547.23 49.98 5599.72