Wednesday, February 27, 2008

SQL Tips & Techniques- String Aggregation

In my experience, application or database developers will have to deal with writing SQL queries more than a DBA (specially a System DBA). However, many a times DBAs are expected to help with writing complex queries. Keeping that in mind, I have decided to post SQL tips and techniques as I encounter them.

Recently, I had a requirement of performing string aggregation by concatenating them while grouping based on another column. For example, we have a table myTab1 with CATEGORY and NAME column values as shown below.

Listing 1

SELECT category, name

FROM myTab1

ORDER BY category


CATEGORY NAME

----------- ----------

1 abc

1 bcd

2 cde

2 def

2 efg

3 fgh

3 ghi

7 record(s) selected.

The expected result is to have one CATEGORY per row with all the NAME values for a CATEGORY concatenated with a comma and a space in between. So the result should look as shown below.

Listing 2

CATEGORY LIST

----------- ----------

1 abc, bcd

2 cde, def, efg

3 fgh, ghi

This could have been so simple if there was an aggregate function (similar to SUM, AVG, COUNT) to perform this task. However, there is no built-in function for it. So either you can write a function of your own or you have to use recursive SQL. You may refer the article “User-defined Aggregate Functions in DB2 Universal Database”, if you want to write a function for it. Here, I will show you a query that uses recursive SQL

Listing 3

WITH

t1(rowNum, category, name) AS

( SELECT rownumber() over(partition by category),

category, name

FROM myTab1 ),

t2(category, list, cnt) AS

( SELECT category, VARCHAR(name, 50), 1

FROM t1

WHERE rowNum = 1

UNION ALL

SELECT t2.category, t2.list || ', ' || t1.name,

t2.cnt + 1

FROM t2, t1

WHERE t2.category = t1.category AND

t2.cnt + 1 = t1.rowNum )

SELECT category, list

FROM t2

WHERE ( category, cnt ) IN (

SELECT category,

MAX(rowNum)

FROM t1

GROUP BY category )

ORDER BY 1


CATEGORY LIST

----------- -------------------------------------------------

SQL0347W The recursive common table expression "RKUMAR.T2" may contain an infinite loop. SQLSTATE=01605

1 abc, bcd

2 cde, def, efg

3 fgh, ghi

3 record(s) selected with 1 warning messages printed.

Notice that in the above query, the definition of table expression t2 refers itself and hence we call such queries recursive. If you first run the query used in the definition of table expression t1, then run the entire query with t1 and t2 but without the last WHERE clause, you can easily follow logic involved here.

DB2 version 9.5 also supports the CONNECT BY syntax of Oracle for recursive queries. You may check the article DB2 Viper 2 compatibility features for details on CONNECT BY syntax. We may write the query using CONNECT BY as shown below.

Listing 4

SELECT category, list

FROM

(SELECT category, level,

MAX(level) OVER(PARTITION BY category) as maxlevel,

VARCHAR(SUBSTR(SYS_CONNECT_BY_PATH(name,', '), 3),50)

as list

FROM

(SELECT category,

name,

ROW_NUMBER() OVER (PARTITION BY category

ORDER BY name) AS curr,

ROW_NUMBER() OVER (PARTITION BY category

ORDER BY name) -1 AS prev

FROM myTab1)

CONNECT BY prev = PRIOR curr and category = PRIOR category)

WHERE level = maxlevel

ORDER BY 1


CATEGORY LIST

----------- -------------------------------------------------

1 abc, bcd

2 cde, def, efg

3 fgh, ghi

3 record(s) selected.


In addition to CONNECT BY, the above query also uses the function MAX as OLAP aggregation function. You may check DB2 SQL Reference manual or the article “OLAP Functions in DB2” to learn about OLAP functions.

5 comments:

Anonymous said...

Radhesh,

This is a great way transforming a technique into something it was not exactly meant to be used for.

Did you know Serge Rielau did string concatenation using XMLAGG in his "SQL on Fire Part 1"? I am eager to see the performance results for both...

Frederik

Radhesh Kumar said...
This comment has been removed by the author.
Radhesh Kumar said...

Frederik,
Thanks for your comment and pointing to XMLAGG function. More heads produce better solution in such scenario.

I had not looked at Serge's "SQL on Fire Part 1" until now. Even though we do not have a built-in function just for string aggregation, the XMLAGG function is pretty close to that. Not being familiar with XML functions, I overlooked the possibility of exploring XML functions outside XML world.

In my quick test, the query using XMLAGG showed the lowest estimated cost. However, the difference in performance would not be noticeable if you are querying a small table.

Thanks,
Radhesh

Eldho Mathew said...

Hello

Most of the posts are really amazing..!

Anonymous said...

Hello and thanks for your script.

Just one thing :
I think the script need something more to be sure about the lenght of the VARCHAR list, especially on a long list :

WITH
t1(rowNum, category, name) AS
( SELECT rownumber() over(partition by category),
category, name
FROM myTab1 ),
t2(category, list, cnt) AS
( SELECT category, VARCHAR(name, 1500), 1
FROM t1
WHERE rowNum = 1
UNION ALL
SELECT t2.category, VARCHAR(t2.list || ', ' || t1.name, 1500),
t2.cnt + 1
FROM t2, t1
WHERE t2.category = t1.category AND
t2.cnt + 1 = t1.rowNum )

SELECT category, list
FROM t2
WHERE ( category, cnt ) IN (
SELECT category,
MAX(rowNum)
FROM t1
GROUP BY category )
ORDER BY 1

Matthieu