Merging or Upserting DataΒΆ
At times we need to merge or upsert the data (update existing records and insert new records)
One of the way to achieve merge or upsert is to develop 2 statements - one to update and other to insert.
The queries in both the statements (update and insert) should return mutually exclusive results.
Even though the statements can be executed in any order, updating first and then inserting perform better in most of the cases (as update have to deal with lesser number of records with this approach)
We can also take care of merge or upsert using
INSERT
withON CONFLICT (columns) DO UPDATE
.Postgres does not have either
MERGE
orUPSERT
as part of the SQL syntax.
%load_ext sql
The sql extension is already loaded. To reload it, use:
%reload_ext sql
%env DATABASE_URL=postgresql://itversity_retail_user:retail_password@localhost:5432/itversity_retail_db
env: DATABASE_URL=postgresql://itversity_retail_user:retail_password@localhost:5432/itversity_retail_db
%sql DROP TABLE IF EXISTS customer_order_metrics_dly
* postgresql://itversity_retail_user:***@localhost:5432/itversity_retail_db
Done.
[]
%%sql
CREATE TABLE customer_order_metrics_dly (
customer_id INT,
order_date DATE,
order_count INT,
order_revenue FLOAT
)
* postgresql://itversity_retail_user:***@localhost:5432/itversity_retail_db
Done.
[]
%%sql
ALTER TABLE customer_order_metrics_dly
ADD PRIMARY KEY (customer_id, order_date)
* postgresql://itversity_retail_user:***@localhost:5432/itversity_retail_db
Done.
[]
Note
Let us go through the 2 statement approach. Here we are inserting data for the month of August 2013.
%%sql
INSERT INTO customer_order_metrics_dly
SELECT o.order_customer_id,
o.order_date,
count(1) order_count,
NULL
FROM orders o
JOIN order_items oi
ON o.order_id = oi.order_item_order_id
WHERE o.order_date BETWEEN '2013-08-01' AND '2013-08-31'
GROUP BY o.order_customer_id,
o.order_date
* postgresql://itversity_retail_user:***@localhost:5432/itversity_retail_db
4708 rows affected.
[]
Note
Now we want to merge data into the table using 2013 August to 2013 October. As we are using 2 statement approach, first we should update and then we should insert
%%sql
UPDATE customer_order_metrics_dly comd
SET
(order_count, order_revenue) = (
SELECT count(1),
round(sum(oi.order_item_subtotal)::numeric, 2)
FROM orders o
JOIN order_items oi
ON o.order_id = oi.order_item_order_id
WHERE o.order_date BETWEEN '2013-08-01' AND '2013-10-31'
AND o.order_customer_id = comd.customer_id
AND o.order_date = comd.order_date
GROUP BY o.order_customer_id,
o.order_date
)
WHERE comd.order_date BETWEEN '2013-08-01' AND '2013-10-31'
* postgresql://itversity_retail_user:***@localhost:5432/itversity_retail_db
4708 rows affected.
[]
%%sql
SELECT * FROM customer_order_metrics_dly
ORDER BY order_date, customer_id
LIMIT 10
* postgresql://itversity_retail_user:***@localhost:5432/itversity_retail_db
10 rows affected.
customer_id | order_date | order_count | order_revenue |
---|---|---|---|
34 | 2013-08-01 | 4 | 789.92 |
109 | 2013-08-01 | 3 | 799.9 |
174 | 2013-08-01 | 5 | 654.89 |
267 | 2013-08-01 | 4 | 559.97 |
478 | 2013-08-01 | 5 | 729.9 |
553 | 2013-08-01 | 2 | 399.9 |
692 | 2013-08-01 | 2 | 479.92 |
696 | 2013-08-01 | 2 | 649.88 |
800 | 2013-08-01 | 5 | 609.95 |
835 | 2013-08-01 | 5 | 589.9 |
%%sql
SELECT to_char(order_date, 'yyyy-MM'), count(1) FROM customer_order_metrics_dly
GROUP BY to_char(order_date, 'yyyy-MM')
LIMIT 10
* postgresql://itversity_retail_user:***@localhost:5432/itversity_retail_db
1 rows affected.
to_char | count |
---|---|
2013-08 | 4708 |
%%sql
INSERT INTO customer_order_metrics_dly
SELECT o.order_customer_id AS customer_id,
o.order_date,
count(1) order_count,
round(sum(order_item_subtotal)::numeric, 2)
FROM orders o
JOIN order_items oi
ON o.order_id = oi.order_item_order_id
WHERE o.order_date BETWEEN '2013-08-01' AND '2013-10-31'
AND NOT EXISTS (
SELECT 1 FROM customer_order_metrics_dly codm
WHERE o.order_customer_id = codm.customer_id
AND o.order_date = codm.order_date
)
GROUP BY o.order_customer_id,
o.order_date
* postgresql://itversity_retail_user:***@localhost:5432/itversity_retail_db
9265 rows affected.
[]
%%sql
SELECT * FROM customer_order_metrics_dly
WHERE order_date::varchar ~ '2013-09'
ORDER BY order_date, customer_id
LIMIT 10
* postgresql://itversity_retail_user:***@localhost:5432/itversity_retail_db
10 rows affected.
customer_id | order_date | order_count | order_revenue |
---|---|---|---|
19 | 2013-09-01 | 5 | 839.92 |
95 | 2013-09-01 | 5 | 969.85 |
136 | 2013-09-01 | 4 | 639.94 |
247 | 2013-09-01 | 2 | 639.94 |
383 | 2013-09-01 | 5 | 729.9 |
437 | 2013-09-01 | 4 | 829.97 |
543 | 2013-09-01 | 4 | 1489.83 |
601 | 2013-09-01 | 2 | 159.99 |
689 | 2013-09-01 | 2 | 419.96 |
842 | 2013-09-01 | 4 | 954.87 |
%%sql
SELECT to_char(order_date, 'yyyy-MM'), count(1) FROM customer_order_metrics_dly
GROUP BY to_char(order_date, 'yyyy-MM')
LIMIT 10
* postgresql://itversity_retail_user:***@localhost:5432/itversity_retail_db
3 rows affected.
to_char | count |
---|---|
2013-08 | 4708 |
2013-10 | 4417 |
2013-09 | 4848 |
Note
Let us see how we can upsert or merge the data using INSERT
with ON CONFLICT (columns) DO UPDATE
. We will first insert data for the month of August 2013 and then upsert or merge for the months of August 2013 to October 2013.
%sql TRUNCATE TABLE customer_order_metrics_dly
* postgresql://itversity_retail_user:***@localhost:5432/itversity_retail_db
Done.
[]
%%sql
INSERT INTO customer_order_metrics_dly
SELECT o.order_customer_id,
o.order_date,
count(1) order_count,
NULL
FROM orders o
JOIN order_items oi
ON o.order_id = oi.order_item_order_id
WHERE o.order_date BETWEEN '2013-08-01' AND '2013-08-31'
GROUP BY o.order_customer_id,
o.order_date
* postgresql://itversity_retail_user:***@localhost:5432/itversity_retail_db
4708 rows affected.
[]
Note
We need to have unique or primary key constraint on the columns specified as part of ON CONFLICT
clause.
%%sql
ALTER TABLE customer_order_metrics_dly DROP CONSTRAINT customer_order_metrics_dly_pkey
* postgresql://itversity_retail_user:***@localhost:5432/itversity_retail_db
Done.
[]
%%sql
ALTER TABLE customer_order_metrics_dly
ADD PRIMARY KEY (customer_id, order_date)
* postgresql://itversity_retail_user:***@localhost:5432/itversity_retail_db
Done.
[]
%%sql
INSERT INTO customer_order_metrics_dly
SELECT o.order_customer_id,
o.order_date,
count(1) order_count,
round(sum(order_item_subtotal)::numeric, 2) AS order_revenue
FROM orders o
JOIN order_items oi
ON o.order_id = oi.order_item_order_id
WHERE o.order_date BETWEEN '2013-08-01' AND '2013-10-31'
GROUP BY o.order_customer_id,
o.order_date
ON CONFLICT (customer_id, order_date) DO UPDATE SET
order_count = EXCLUDED.order_count,
order_revenue = EXCLUDED.order_revenue
* postgresql://itversity_retail_user:***@localhost:5432/itversity_retail_db
13973 rows affected.
[]
%%sql
SELECT * FROM customer_order_metrics_dly
WHERE order_date::varchar ~ '2013-09'
ORDER BY order_date, customer_id
LIMIT 10
* postgresql://itversity_retail_user:***@localhost:5432/itversity_retail_db
10 rows affected.
customer_id | order_date | order_count | order_revenue |
---|---|---|---|
19 | 2013-09-01 | 5 | 839.92 |
95 | 2013-09-01 | 5 | 969.85 |
136 | 2013-09-01 | 4 | 639.94 |
247 | 2013-09-01 | 2 | 639.94 |
383 | 2013-09-01 | 5 | 729.9 |
437 | 2013-09-01 | 4 | 829.97 |
543 | 2013-09-01 | 4 | 1489.83 |
601 | 2013-09-01 | 2 | 159.99 |
689 | 2013-09-01 | 2 | 419.96 |
842 | 2013-09-01 | 4 | 954.87 |
%%sql
SELECT to_char(order_date, 'yyyy-MM'), count(1) FROM customer_order_metrics_dly
GROUP BY to_char(order_date, 'yyyy-MM')
LIMIT 10
* postgresql://itversity_retail_user:***@localhost:5432/itversity_retail_db
3 rows affected.
to_char | count |
---|---|
2013-08 | 4708 |
2013-10 | 4417 |
2013-09 | 4848 |