String Manipulation Functions¶
We use string manipulation functions quite extensively. Here are some of the important functions which we typically use.
Case Conversion -
lower
,upper
,initcap
Getting size of the column value -
length
Extracting Data -
substr
andsplit_part
Trimming and Padding functions -
trim
,rtrim
,ltrim
,rpad
andlpad
Reversing strings -
reverse
Concatenating multiple strings
concat
andconcat_ws
Case Conversion and Length¶
Let us understand how to perform case conversion of a string and also get length of a string.
Case Conversion Functions -
lower
,upper
,initcap
Getting length -
length
Let us see how to use these functions on top of the table. We will use orders table which was loaded as part of last section.
order_status for all the orders is in upper case and we will convert every thing to lower case.
order_id | order_date | order_customer_id | order_status |
---|---|---|---|
1021 | 2013-07-30 00:00:00 | 10118 | COMPLETE |
4068 | 2013-08-17 00:00:00 | 12293 | PENDING |
5881 | 2013-08-30 00:00:00 | 3715 | CLOSED |
7564 | 2013-09-09 00:00:00 | 8648 | CLOSED |
8766 | 2013-09-18 00:00:00 | 855 | COMPLETE |
8926 | 2013-09-19 00:00:00 | 10517 | ON_HOLD |
9290 | 2013-09-21 00:00:00 | 11879 | COMPLETE |
9793 | 2013-09-24 00:00:00 | 9809 | COMPLETE |
9816 | 2013-09-24 00:00:00 | 1753 | COMPLETE |
14047 | 2013-10-20 00:00:00 | 6473 | CLOSED |
order_id | order_date | order_customer_id | order_status | order_status_length |
---|---|---|---|---|
1021 | 2013-07-30 00:00:00 | 10118 | complete | 8 |
4068 | 2013-08-17 00:00:00 | 12293 | pending | 7 |
5881 | 2013-08-30 00:00:00 | 3715 | closed | 6 |
7564 | 2013-09-09 00:00:00 | 8648 | closed | 6 |
8766 | 2013-09-18 00:00:00 | 855 | complete | 8 |
8926 | 2013-09-19 00:00:00 | 10517 | on_hold | 7 |
9290 | 2013-09-21 00:00:00 | 11879 | complete | 8 |
9793 | 2013-09-24 00:00:00 | 9809 | complete | 8 |
9816 | 2013-09-24 00:00:00 | 1753 | complete | 8 |
14047 | 2013-10-20 00:00:00 | 6473 | closed | 6 |
Extracting Data - substr and split_part¶
Let us understand how to extract data from strings using substr
/substring
as well as split_part
.
We can extract sub string from main string using
substr
orsubstring
position and length.For example, get first 4 characters from date to get year or get last 4 characters from fixed length unique id.
substring
have broader options (regular expression) and also can be used with different styles (using keywords such asFROM
,FOR
).Unlike in other relational databases, we cannot pass negative integers to
substr
orsubstring
to get the information from right. We need to use functions likeright
instead.
Note
We can also use combination of substring
and length
like below to get last 4 digits or characters from a string.
Note
Getting first 3 characters or digits as well as last 4 characters or digits using substring
. However, this works only when the strings are of fixed length.
%%sql
WITH unique_ids AS (
SELECT '241-80-7115' AS unique_id UNION
SELECT '694-30-6851' UNION
SELECT '586-92-5361' UNION
SELECT '884-65-284' UNION
SELECT '876-99-585' UNION
SELECT '831-59-5593' UNION
SELECT '399-88-3617' UNION
SELECT '733-17-4217' UNION
SELECT '873-68-9778' UNION
SELECT '48'
) SELECT unique_id,
substring(unique_id FROM 1 FOR 3) AS unique_id_first3,
substring(unique_id FROM '....$') AS unique_id_last4
FROM unique_ids
ORDER BY unique_id
unique_id | unique_id_first3 | unique_id_last4 |
---|---|---|
241-80-7115 | 241 | 7115 |
399-88-3617 | 399 | 3617 |
48 | 48 | None |
586-92-5361 | 586 | 5361 |
694-30-6851 | 694 | 6851 |
733-17-4217 | 733 | 4217 |
831-59-5593 | 831 | 5593 |
873-68-9778 | 873 | 9778 |
876-99-585 | 876 | -585 |
884-65-284 | 884 | -284 |
Let us see how we can extract date part from order_date of orders.
order_id | order_date | order_customer_id | order_status |
---|---|---|---|
1021 | 2013-07-30 00:00:00 | 10118 | COMPLETE |
4068 | 2013-08-17 00:00:00 | 12293 | PENDING |
5881 | 2013-08-30 00:00:00 | 3715 | CLOSED |
7564 | 2013-09-09 00:00:00 | 8648 | CLOSED |
8766 | 2013-09-18 00:00:00 | 855 | COMPLETE |
8926 | 2013-09-19 00:00:00 | 10517 | ON_HOLD |
9290 | 2013-09-21 00:00:00 | 11879 | COMPLETE |
9793 | 2013-09-24 00:00:00 | 9809 | COMPLETE |
9816 | 2013-09-24 00:00:00 | 1753 | COMPLETE |
14047 | 2013-10-20 00:00:00 | 6473 | CLOSED |
order_id | order_date | order_customer_id | order_status |
---|---|---|---|
1021 | 2013-07-30 | 10118 | COMPLETE |
4068 | 2013-08-17 | 12293 | PENDING |
5881 | 2013-08-30 | 3715 | CLOSED |
7564 | 2013-09-09 | 8648 | CLOSED |
8766 | 2013-09-18 | 855 | COMPLETE |
8926 | 2013-09-19 | 10517 | ON_HOLD |
9290 | 2013-09-21 | 11879 | COMPLETE |
9793 | 2013-09-24 | 9809 | COMPLETE |
9816 | 2013-09-24 | 1753 | COMPLETE |
14047 | 2013-10-20 | 6473 | CLOSED |
Let us understand how to extract the information from the string where there is a delimiter.
split_part
can be used to split a string using delimiter and extract the information.If there is no data in a given position after splitting, it will be represented as empty string ‘’.
%%sql
WITH addresses AS (
SELECT '593 Fair Oaks Pass, Frankfort, Kentucky, 40618' AS address UNION
SELECT ', Vancouver, Washington, 98687' UNION
SELECT '83047 Glacier Hill Circle, Sacramento, California, 94237' UNION
SELECT '935 Columbus Junction, Cincinnati, Ohio, 45213' UNION
SELECT '03010 Nevada Crossing, El Paso, Texas, 88579' UNION
SELECT '9 Dunning Circle, , Arizona, 85271' UNION
SELECT '96 Fair Oaks Way, Decatur, Illinois, 62525' UNION
SELECT '999 Caliangt Avenue, Greenville, South Carolina, 29615' UNION
SELECT '2 Saint Paul Trail, Bridgeport, , 06673' UNION
SELECT '3 Reindahl Center, Ogden, Utah'
) SELECT split_part(address, ', ', 1) street,
split_part(address, ', ', 2) city,
split_part(address, ', ', 3) state,
split_part(address, ', ', 4) postal_code
FROM addresses
ORDER BY postal_code
street | city | state | postal_code |
---|---|---|---|
3 Reindahl Center | Ogden | Utah | |
2 Saint Paul Trail | Bridgeport | 06673 | |
999 Caliangt Avenue | Greenville | South Carolina | 29615 |
593 Fair Oaks Pass | Frankfort | Kentucky | 40618 |
935 Columbus Junction | Cincinnati | Ohio | 45213 |
96 Fair Oaks Way | Decatur | Illinois | 62525 |
9 Dunning Circle | Arizona | 85271 | |
03010 Nevada Crossing | El Paso | Texas | 88579 |
83047 Glacier Hill Circle | Sacramento | California | 94237 |
Vancouver | Washington | 98687 |
%%sql
WITH addresses AS (
SELECT '593 Fair Oaks Pass, Frankfort, Kentucky, 40618' AS address UNION
SELECT ', Vancouver, Washington, 98687' UNION
SELECT '83047 Glacier Hill Circle, Sacramento, California, 94237' UNION
SELECT '935 Columbus Junction, Cincinnati, Ohio, 45213' UNION
SELECT '03010 Nevada Crossing, El Paso, Texas, 88579' UNION
SELECT '9 Dunning Circle, , Arizona, 85271' UNION
SELECT '96 Fair Oaks Way, Decatur, Illinois, 62525' UNION
SELECT '999 Caliangt Avenue, Greenville, South Carolina, 29615' UNION
SELECT '2 Saint Paul Trail, Bridgeport, , 06673' UNION
SELECT '3 Reindahl Center, Ogden, Utah'
) SELECT split_part(address, ', ', 1) street,
split_part(address, ', ', 2) city,
split_part(address, ', ', 3) state,
split_part(address, ', ', 4) postal_code
FROM addresses
WHERE split_part(address, ', ', 1) = ''
ORDER BY postal_code
%%sql
WITH unique_ids AS (
SELECT '241-80-7115' AS unique_id UNION
SELECT '694-30-6851' UNION
SELECT '586-92-5361' UNION
SELECT '884-65-284' UNION
SELECT '876-99-585' UNION
SELECT '831-59-5593' UNION
SELECT '399-88-3617' UNION
SELECT '733-17-4217' UNION
SELECT '873-68-9778' UNION
SELECT '480-69-032'
) SELECT unique_id,
substring(unique_id FROM 1 FOR 3) AS unique_id_first3,
substring(unique_id FROM '....$') AS unique_id_last4,
CASE WHEN length(split_part(unique_id, '-', 3)) = 4
THEN split_part(unique_id, '-', 3)
ELSE 'Invalid'
END AS unique_id_last
FROM unique_ids
ORDER BY unique_id
unique_id | unique_id_first3 | unique_id_last4 | unique_id_last |
---|---|---|---|
241-80-7115 | 241 | 7115 | 7115 |
399-88-3617 | 399 | 3617 | 3617 |
480-69-032 | 480 | -032 | Invalid |
586-92-5361 | 586 | 5361 | 5361 |
694-30-6851 | 694 | 6851 | 6851 |
733-17-4217 | 733 | 4217 | 4217 |
831-59-5593 | 831 | 5593 | 5593 |
873-68-9778 | 873 | 9778 | 9778 |
876-99-585 | 876 | -585 | Invalid |
884-65-284 | 884 | -284 | Invalid |
Using position or strpos¶
At times we might want to get the position of a substring in a main string. For example, we might want to check whether email ids have @ in them. We can use functions such as position
or strpos
.
%%sql
WITH email_ids AS (
SELECT 'bsellan0@yellowbook.com' AS email_id UNION
SELECT 'rstelljes1@illinois.edu' UNION
SELECT 'mmalarkey2@webeden.co.uk' UNION
SELECT 'emussared3@redcross.org' UNION
SELECT 'livashin4@bloglovin.com' UNION
SELECT 'gkeach5@cbc.ca' UNION
SELECT 'emasham6@xing.com' UNION
SELECT 'rcobbald7@house.gov' UNION
SELECT 'rdrohan8@washingtonpost.com' UNION
SELECT 'aebben9@arstechnica.com'
)
* postgresql://itversity_retail_user:***@localhost:5432/itversity_retail_db
(psycopg2.errors.SyntaxError) syntax error at end of input
LINE 12: )
^
[SQL: WITH email_ids AS (
SELECT 'bsellan0@yellowbook.com' AS email_id UNION
SELECT 'rstelljes1@illinois.edu' UNION
SELECT 'mmalarkey2@webeden.co.uk' UNION
SELECT 'emussared3@redcross.org' UNION
SELECT 'livashin4@bloglovin.com' UNION
SELECT 'gkeach5@cbc.ca' UNION
SELECT 'emasham6@xing.com' UNION
SELECT 'rcobbald7@house.gov' UNION
SELECT 'rdrohan8@washingtonpost.com' UNION
SELECT 'aebben9@arstechnica.com'
)]
(Background on this error at: http://sqlalche.me/e/13/f405)
Trimming and Padding Functions¶
Let us understand how to trim or remove leading and/or trailing spaces in a string.
ltrim
is used to remove the spaces on the left side of the string.rtrim
is used to remove the spaces on the right side of the string.trim
is used to remove the spaces on both sides of the string.
Let us understand how to use padding to pad characters to a string.
Let us assume that there are 3 fields - year, month and date which are of type integer.
If we have to concatenate all the 3 fields and create a date, we might have to pad month and date with 0.
lpad
is used more often thanrpad
especially when we try to build the date from separate columns.
Reverse and Concatenating multiple strings¶
Let us understand how to reverse a string as well as concatenate multiple strings.
We can use
reverse
to reverse a string.We can concatenate multiple strings using
concat
andconcat_ws
.concat_ws
is typically used if we want to have the same string between all the strings that are being concatenated.
result |
---|
Order Status is COMPLETE |
Order Status is PENDING |
Order Status is CLOSED |
Order Status is CLOSED |
Order Status is COMPLETE |
Order Status is ON_HOLD |
Order Status is COMPLETE |
Order Status is COMPLETE |
Order Status is COMPLETE |
Order Status is CLOSED |
String Replacement¶
Let us go through the details related to string replacement.
replace
can be used to replace a sub string with in a string with another string.overlay
can be used to replace a sub string with in a string by position with another string.translate
can be used to replace individual characters with other characters.
%%sql
WITH unique_ids AS (
SELECT '241-80-7115' AS unique_id UNION
SELECT '694-30-6851' UNION
SELECT '586-92-5361' UNION
SELECT '884-65-2844' UNION
SELECT '876-99-5856' UNION
SELECT '831-59-5593' UNION
SELECT '399-88-3617' UNION
SELECT '733-17-4217' UNION
SELECT '873-68-9778' UNION
SELECT '487-21-9802'
) SELECT unique_id,
replace(unique_id, '-', ' ') AS unique_id_replaced,
translate(unique_id, '-', ' ') AS unique_id_translated,
overlay(unique_id PLACING ' ' FROM 4 FOR 1) AS unique_id_overlaid
FROM unique_ids
ORDER BY unique_id
unique_id | unique_id_replaced | unique_id_translated | unique_id_overlaid |
---|---|---|---|
241-80-7115 | 241 80 7115 | 241 80 7115 | 241 80-7115 |
399-88-3617 | 399 88 3617 | 399 88 3617 | 399 88-3617 |
487-21-9802 | 487 21 9802 | 487 21 9802 | 487 21-9802 |
586-92-5361 | 586 92 5361 | 586 92 5361 | 586 92-5361 |
694-30-6851 | 694 30 6851 | 694 30 6851 | 694 30-6851 |
733-17-4217 | 733 17 4217 | 733 17 4217 | 733 17-4217 |
831-59-5593 | 831 59 5593 | 831 59 5593 | 831 59-5593 |
873-68-9778 | 873 68 9778 | 873 68 9778 | 873 68-9778 |
876-99-5856 | 876 99 5856 | 876 99 5856 | 876 99-5856 |
884-65-2844 | 884 65 2844 | 884 65 2844 | 884 65-2844 |
%%sql
WITH unique_ids AS (
SELECT '241-80-7115' AS unique_id UNION
SELECT '694-30:6851' UNION
SELECT '586-92-5361' UNION
SELECT '884:65-2844' UNION
SELECT '876/99-5856' UNION
SELECT '831-59:5593' UNION
SELECT '399-88-3617' UNION
SELECT '733:17-4217' UNION
SELECT '873:68-9778' UNION
SELECT '487-21/9802'
) SELECT unique_id,
replace(replace(unique_id, '-', ' '), ':', ' ') AS unique_id_replaced,
translate(unique_id, '-:/', ' ') AS unique_id_translated,
overlay(overlay(unique_id PLACING ' ' FROM 4 FOR 1) PLACING ' ' FROM 7 FOR 1) AS unique_id_overlaid
FROM unique_ids
ORDER BY unique_id
unique_id | unique_id_replaced | unique_id_translated | unique_id_overlaid |
---|---|---|---|
241-80-7115 | 241 80 7115 | 241 80 7115 | 241 80 7115 |
399-88-3617 | 399 88 3617 | 399 88 3617 | 399 88 3617 |
487-21/9802 | 487 21/9802 | 487 21 9802 | 487 21 9802 |
586-92-5361 | 586 92 5361 | 586 92 5361 | 586 92 5361 |
694-30:6851 | 694 30 6851 | 694 30 6851 | 694 30 6851 |
733:17-4217 | 733 17 4217 | 733 17 4217 | 733 17 4217 |
831-59:5593 | 831 59 5593 | 831 59 5593 | 831 59 5593 |
873:68-9778 | 873 68 9778 | 873 68 9778 | 873 68 9778 |
876/99-5856 | 876/99 5856 | 876 99 5856 | 876 99 5856 |
884:65-2844 | 884 65 2844 | 884 65 2844 | 884 65 2844 |
Note
In case of translate
, if we do not have characters for replacement, then those will be replaced with empty string. For example, translate('+86 (238) 954-9649', '+() -', '0')
will result in 0862389549649.
%%sql
WITH phone_numbers AS (
SELECT '+86 (238) 954-9649' AS phone_number UNION
SELECT '+420 (331) 900-5807' UNION
SELECT '+1 (320) 484-4495' UNION
SELECT '+45 (238) 961-9801' UNION
SELECT '+51 (123) 545-6543' UNION
SELECT '+63 (308) 354-2560' UNION
SELECT '+86 (433) 851-1260' UNION
SELECT '+63 (332) 705-0319' UNION
SELECT '+351 (147) 359-3767' UNION
SELECT '+57 (714) 557-0468'
) SELECT phone_number,
translate(phone_number, '+() -', '') phone_number_int
FROM phone_numbers
ORDER BY phone_number
phone_number | phone_number_int |
---|---|
+1 (320) 484-4495 | 13204844495 |
+351 (147) 359-3767 | 3511473593767 |
+420 (331) 900-5807 | 4203319005807 |
+45 (238) 961-9801 | 452389619801 |
+51 (123) 545-6543 | 511235456543 |
+57 (714) 557-0468 | 577145570468 |
+63 (308) 354-2560 | 633083542560 |
+63 (332) 705-0319 | 633327050319 |
+86 (238) 954-9649 | 862389549649 |
+86 (433) 851-1260 | 864338511260 |