Saturday, June 20, 2009

SQL Tips & Techniques- String Splitting

In my previous post SQL Tips & Techniques- String Aggregation, I discussed how to perform string aggregation. Recently, I received a question to do the reverse using SQL i.e. String Splitting.

Listing 1


For example, we have a table myTab2 with column values as shown above. As you can see, the string value in LIST column is a list of sub-strings separated with “, “. Now, we would like to split the string in LIST column and present the sub-strings in separate rows. Hence, the query should return two rows for CATEGORY = 1, three rows for CATEGORY = 2 and two rows for CATEGORY = 3.

Here is the query that performs the required string splitting.

Listing 2


The above query uses recursive SQL and it will return the following.

Listing 3


I learned the above use of recursive SQL from Knut Stolze’s developerWorks article Parsing Strings in SQL. If you know a different way to do the string splitting using SQL, please share it.

Saturday, April 11, 2009

DB2 Authentication & Password Encryption Method

Few days ago, I received an e-mail from my colleague with following error message

I verified the password of db2inst1 on OS level and the password “db2test” was correct. I verified the authentication related configuration parameters and everything looked okay. The authentication type was set to SERVER. The connection was working fine if we only run “db2 connect to test” and do not specify user and using keywords. The version of DB2 was 8.1 and the OS was SUSE Linux Enterprise Server 10.

You might be thinking how come I am still dealing with DB2 version 8.1. Well, as I have mentioned in my earlier posts, the company I work for (Stratavia Corporation) has software product for data center automation called
Data Palette. Data Palette still supports DB2 V8.1 and hence we have to test it on DB2 V8.1 also. Going back to the original problem, I approached our system administrator to figure out what is so unique about db2inst1 password that is causing the above authentication error. We found that the password encryption method used for db2inst1 was different than the other V8.1 instances where we don’t have the above mentioned authentication problem. On this problematic system, the password encryption method was “blowfish”. We changed it to “md5” as shown in the portion of “/etc/default/passwd” below. On SUSE Linux, “/etc/default/passwd” has the configuration parameter to control the password encryption method.


After changing the password encryption method, we set the password of db2inst1 and it was encrypted using md5 this time. On Linux systems, you can check the encrypted password in /etc/shadow file (you have to be root). The length of blowfish encrypted password is almost double than md5 encrypted password. Also, I think md5 encrypted password starts with $1 and blowfish encrypted password starts with $2.

After the above change, the authentication problem was gone. So it appears DB2 V8.1 (Fix pack 0) does not support authentication with password that is encrypted using blowfish encryption. Out of curiosity, I upgraded the same instance to V8.2 to test if blowfish password encryption will work or not and with V8.2 blowfish encryption worked fine. Hence, we should keep in mind that DB2 authentication (while using OS authentication) has dependency on password encryption method used by OS. The user password may work fine on the OS level, but DB2 authentication may still fail if DB2 does not support the password encryption method used by OS.

Monday, February 16, 2009

DB2 and EC2

Though irrelevant, there is a similarity of alphabetical pattern between DB2 and EC2. E comes just after D and C comes just after B in the alphabet. May be this similarity aroused my interest in EC2... J Also, IBM and Amazon have partnered together to deliver DB2 in Amazon EC2 (Elastic Compute Cloud) environment.

Cloud computing, the biggest buzz word in current date, is no more perceived as mere hype. It has already started helping small startup companies. There are new businesses emerging on top of cloud computing infrastructure. Gradually more and more companies will start embracing the cloud computing.

Though I have been hearing about cloud computing for over a year, I never bothered to think about how similar or different it would be managing DB2 databases hosted in cloud computing environment. When I read the IBM’s announcement (actually followed the link from Anant Jhingran’s blog post), it crossed my mind that very soon we DBA’s may have to deal with the DB2 instances hosted in the cloud computing environment. Suddenly, several questions started arising in my mind. Will it be accessible through web browser only or we can still use our favorite SSH client or Remote Desktop (in case of Windows)? How will be the database physical design? Can we also access these virtual servers through VPN and so on...?

Out of curiosity, I thought of signing up for the Amazon’s service and get a feel of EC2. Looking at their pricing, it did not appear to cost much just to get a feel of it. On a second thought, I decided to go through their documentation first. The documentation is concise and to the point. It helped to clear the clouds over my understanding of Elastic Compute Cloud.

Here is a very high level summary of what I learned about EC2 from their documentation.
  1. The virtual servers in EC2 environment are basically the running instances of AMI (Amazon Machine Image). AMI contains all the software, including OS and associated configuration settings, applications, libraries etc.
  2. There are already many existing AMI (free as well as paid). Also, new AMI can be created from scratch or based on an existing AMI. However, new AMI has to be saved on Amazon S3 (Simple Storage Service) and there is a charge for this storage.
  3. There are different types of instances based on CPU, Memory, Storage and I/O requirements and accordingly Amazon has different pricing for each type of instance.
  4. AMI instances are not persistent. You lose all the changes when you shutdown your OS or terminate the AMI instance.
  5. For all the online data that requires persistence (for example, a DB2 database), they have a different solution. It’s called Elastic Block Store (EBS). EBS is storage volume that can be attached to the host (running instances of AMI) and it can either be used as raw device or a file system can be created on it.
  6. Snapshots of EBS can be created on the Amazon S3 (Simple Storage Service). EBS snapshot can be used to launch new EBS volumes using the previous snapshot as the starting point of those new volumes.
  7. Static IP addresses can also be assigned to instances. They are called Elastic IP address.
  8. We can access the servers hosted in EC2 environment using SSH client or Remote Desktop.

I like the ease of provisioning, server downsizing and upsizing available in EC2 environment. Also, it appears to me that set up and maintenance of disaster recovery and business continuity solutions will be relatively easier in EC2 environment. Testing the additional computing power (capacity planning) for increased workload will be much easier and cost effective. We don’t have to make permanent investment in computing power and then later realize that the investment did not solve the capacity problem.

Having said all the good things about EC2, I do agree that we will come across several unpleasant aspects of it when we start using it. I found some of the experiences discussed on Colin Percival’s blog.

In general, one of the big concerns about cloud computing is company policies and security issues related with storing company’s data outside company network. Though EC2 appears to have reasonable security provision, I am not an expert on network security and hence I will abstain from making any comment on security. However, if cloud computing provides or will provide robust security, I would anticipate revisions in company policies over a period of time to adapt to the new computing environment.

Tuesday, January 13, 2009

Where a DB Request Spends Time?

While analyzing slow response or poor database performance, you might be curious to know where a database request spends time. This information will feed to database performance intelligence and guide the tuning effort in the right direction. However, I doubt if anyone ever found a satisfactory answer to this question for a DB2 database.

Basically, a database request may spend time either waiting for a service (wait time) or being serviced (service time). In fact this applies to any request to a computer system. The total time spent by a database request should be sum total of all the wait times and the service times for that request. Also, a database request may spend time inside the database system as well as outside the database system (e.g. Network). For a database request, the monitoring features of a database system should ideally provide all these wait times and service times spent inside database system.

In this post I will investigate the monitoring info available through db2batch and snapshot monitoring in attempt to identify the individual components that make up the total time spent by a SQL statement. First, I will investigate the monitoring info captured by db2batch. Listing 1 shows the SQL script I used for db2batch test


Listing 1


In case you are curious about the definition of table t1, it has the same structure as syscat.columns and I inserted all the rows from syscat.columns four times into t1. Also, t1 is created under a DMS tablespace with the associated buffer pool of just 1000 4K pages.

PERF_DETAIL 3 instructs db2batch to capture summary of monitoring information. Listing 2 shows the db2batch output after filtering out all the lines not related with time from the monitoring info


Listing 2



The SQL statement took total 130.567 seconds. Out of which Agent CPU Time was 124.36 seconds, but what about the remaining time (6.207 seconds). Even if we assume that buffer pool read time is not inclusive in Agent CPU Time, still it does not sum up to total elapsed time.

Next I will investigate the DB2 snapshot monitoring and identify any time related stats. Ideally, a database request should correspond to a statement. However, application level snapshot provides more time related stats, hence I will present a test example using application snapshot.

I created a test script as shown in Listing 3:


Listing 3


I ran the script with the time command as shown below


Listing 4



Here is the content of test.out


Listing 5


Listing 6 shows the time related lines from the application snapshot


Listing 6


Now let’s try to analyze the output of this test. The script test.sh took 2 minutes and 11.12 seconds (131.12 seconds). The query was run locally on the database server to avoid any network time. Also, the query returns just a single row to minimize the time associated with sending the result set to standard output. That’s the reason “Elapsed time of last completed uow” from application snapshot is pretty close to the time captured by the time command. Even the difference of fraction of a second is mostly because of other statements in test.sh (connect, get snapshot, terminate etc.).

Since the statement was run in auto-commit mode, the UOW comprises of single statement. Thus the SELECT statement took 130.9 seconds. Now let’s try to break up the 130.9 seconds and identify the individual components. We see that User CPU Time by agents is 124.78 seconds. Apart from that, the following time components sum up to less than a second


Listing 7


Adding them all to User CPU time will still be less than 126 seconds. So where was the remaining 5 seconds spent? In fact, many of the time components shown in Listing 7 (sort time, buffer pool read time etc.) might be inclusive in User CPU Time. In any case, we don’t have any accountability of at least 5 seconds. This gap of unaccountable time will be even longer for more complex queries. Also, note that all the time components shown in Listing 7 are not even available at the statement level where it makes more sense.

With these simple examples, I just wanted to show that DB2 monitoring does not provide the complete picture of all the time spent inside the database system. I have reached to the same conclusion even with the event monitoring. Ideally, the database monitoring should provide the total time spent by a statement and its individual components in a parent-child format. I would love to hear from you if you have any more insight or info regarding the time spent by a database request inside DB2.

Sunday, November 30, 2008

SQL Tips & Techniques- Correlated Reference

Sometimes even a simple SQL query may become challenging if we don’t approach it with a right technique. Few months ago I came across one such SQL query problem that I am going to share in this post. I was writing a database monitoring routine using snapshot table function. For simplicity, let’s consider the following query:

Listing 1

The above query will give the log used and available on partition 0, because the second argument is 0 in the snapshot_database table function. I had to collect the above info on each partition and using db2_all was not an option because the SQL was supposed to run through a JDBC connection. My first instinct was to find the list of partitions using following query and then run the query in Listing 1 in a loop for each partition

Listing 2


However, I wanted to have a single SQL query to get the complete result as shown below:

Listing 3

The problem appeared to be easy in the beginning, but it started becoming difficult as I thought over it and could not come up with the SQL query. Then I thought of checking the complete syntax of DB2 SQL query in DB2 SQL Reference to make sure I am not missing any syntax that can help with this problem. And I was indeed missing something. It was Correlated Reference, which can also be used in arguments of table function.

Most likely you might already be familiar with the Correlated Subquery and Correlation Name. Even if these terms do not sound familiar, I am sure you must have used them in your SQL. Let’s look at an example to make it clear


Listing 4

In the above query, notice the use of x.workdept inside the subquery. The employee table in the FROM clause of parent query has correlation name x. Though x is not present in the FROM clause of subquery, still x.workdept is used in the WHERE clause of subquery. This is called correlated subquery. In the above example, both the parent and subquery has employee table. However, subquery may have completely different tables and still it can use the columns from the parent table. Correlated subquery can also appear in SELECT and HAVING clause as shown below

Listing 5


Correlated reference can also be used in nested table expression as shown below

Listing 6

Above example has a nested table expression empinfo. Notice the use of d.deptno in the query of nested table expression empinfo.

As I mentioned earlier, correlated reference can also be used in the arguments of a table function and this helped me to get the solution of my SQL problem. The following SQL will give the result I wanted

Listing 7

In the above query, x.partition_number is used in the argument of snapshot_database table function. Changing the order of db_partitions and snapshot_database table functions in the FROM clause of above query will result in error because table references in a FROM clause are processed from left to right and x.partition_number must be resolved before it’s used in the argument of snapshot_database table function. The same rule applies with correlated subquery as well. The correlated reference must be from a table reference at a higher level in the hierarchy of subqueries.

Tuesday, September 30, 2008

Don’t Ignore Naming Convention

Recently I worked with an organization having hundreds of DB2 databases and still a uniform and consistent naming convention across all these databases. Naming convention started from Server Name, Instance Name and Database Name itself. Just by looking at an Instance Name, you can tell whether it’s a production, development or QA instance and the associated application for that instance. Instance home directory, database path, log path, archive log path, database backup path ..., all followed a consistent naming convention.

In the above mentioned environment, if a file system is getting full on any of the DB2 database server, you immediately know if that file system is database related and which instance/database is using that file system. If a new DBA joins the team, he can quickly get familiar with the large environment just by learning the conventions. If a database has to be backed up, he knows the backup target and if a database has to be recovered, he knows where to find the backup image and archive logs, all because of consistent naming convention.

A good naming convention not only helps a new DBA quickly come up to speed, but it also simplifies automation of DBA tasks. Many of the decision points in automation can get benefitted by the established convention. Automation platform like Data Palette has the capability to capture all the variations and weirdness in your database environment and still enable the decision making. However, more variations often lead to complexity in implementing and managing the automation.

Just having any naming convention is not enough, you should have a good naming convention. I have described below some of the qualities of a good naming convention

1. Make your naming convention informative
You can name your instances like db2ins01, db2ins02 and so on. But such naming convention does not have much information except that it is a DB2 instance. Instead if your instance name can tell you whether it’s production, development or QA instance, then you get more information out of that naming convention. Also, if your instance name includes INST or your database name includes DB, it does not provide you any additional info if you already know that you are looking at an instance name or database name.

Similarly the directory structure of your instance, database and table space level paths should incorporate the corresponding instance name, database partition number, database name hierarchy. By looking at instance/database configuration and system catalog, you can figure out most of the associated paths. However, the reverse mapping i.e. you know the path and trying to figure out who is using that path is straightforward only if you have an appropriate naming convention.

Whenever a name is derived from some other base names, do not abbreviate or shorten the base names while using into derived names. For example, if your table space container path includes a directory with your database name, this directory name should have a complete database name and not abbreviated or short form. If you use the abbreviated or short name, your path name becomes less informative because now you need additional data (the mapping between full name and short name) to determine if that directory name corresponds to the database name.


2. Use fixed-width numbers in your naming convention
If your naming convention includes a sequence number, have a fixed width sequence number. For example, if you know that you will need max four digits for a sequence number in your naming convention, use 0001 instead of just 1. Also, it’s better to have your database partition numbers (in db2nodes.cfg) as fixed width. For example, if you have 16 partitions in your database from 0 to 15, have your partitions named as 00, 01, 02 and so on. Such partition numbers has a distinct advantage while using “##” in your db2_all command. The character sequence “##” (two hash characters) can be used to get replaced by partition number while running a command through db2_all. So if you use NODE00## in your db2_all command, it can get replaced as NODE0000, NODE0001...NODE0015 in the above example of a 16 partition database.

No matter how good a naming convention is, if it’s not followed consistently, it’s not useful. I started this post with an ideal example of consistency in naming convention. On the other extreme, I have also come across organizations where DBA’s come up with a new naming convention whenever they set up a new database. To make it even worse, within a database also the naming convention would start changing over a period of time...it all depends on the mood of DBA while coming up with the name.

In my opinion, naming conventions should be properly defined as a planned effort, it should be well documented and strictly enforced by change control board or a similar regulation authority. Though, I fully agree that there are several scenarios where we can’t expect to have a uniform naming convention across a larger organization. For example, let’s imagine a company being acquired by another company and as a result the IT department of these companies getting merged into a single organization. However, naming conventions should not get destroyed just because of ignorance or negligence of individuals in an organization.

XML Superstar Contest

Did you hear about “Search for the XML Superstar” contest? If not yet, check out the details of this contest below (as provided by IBM). If you or your friend or colleague is a XML guru, here is a chance to win great prizes.

The International DB2 User Group (IDUG) is proud to announce the launch of the “Search for the XML Superstar” contest in the United States. This worldwide initiative is aimed at highlighting the benefits of DB2 and related technologies to database developers and university students.

The “Search for the XML Superstar” has five categories:
  1. XML application programming
  2. Ported applications
  3. Database gadgets
  4. Videos
  5. Query challenge
Contestants will participate in a combination of education, exposure to practical applications and “tests” that will validate understanding and ability to program IBM database technologies. Entries will be judged by an independent panel of DB2 professionals.

Participants in the contest will be eligible to win one of a series of great prizes. Those who submit star-quality entries could receive an all-expense paid trip to an IDUG conference, a new laptop, an iPod, a Sony PlayStation 3 or a Nintendo Wii system. The contest will also allow students to win opportunities for interviews and lab visits, and will enhance the skills that catch employers’ eyes on a resume. Developers will be able to gain recognition from world-class consultants and be celebrated for standing out among their peers.

The contest launches in U.S. tomorrow. There will be a IBM press release coming out tomorrow, and the website (xmlchallenge.com), will launch then, as well. Also, check out the IBM DB2 Query Console (beta), an application hosted in the cloud that allows you to run SQL and XQuery statements against a DB2 database. Good luck for the contest!

Wednesday, August 20, 2008

SSV Backup

As I mentioned in my previous post DB2 Partitioning, Database partitioning feature (DPF) of DB2 enables DB2 to provide a very scalable database solution. The applications connecting to database do not have to be aware of database partitioning. However, the DBAs managing the multi-partition database have to deal with some complexities. One of the administration complexities of multi-partition database is Backup and Restore. Each database partition has to be backed up separately. Also, the catalog partition has to be backed up before any other partition. The db2_all command is used in such scenarios to run a command against one or more than one database partitions as shown in the example below. Following example shows the offline backup of a multi-partition database “test” with catalog partition 0.


db2_all ‘<<+0< db2 BACKUP DATABASE test to /backup_path’

db2_all ‘|<<-0< db2 BACKUP DATABASE test to /backup_path’


The first line in the above example will backup the partition 0 and the second line will backup the remaining partitions in parallel. With the above approach, the backup timestamp of all the partitions would not be the same. Hence, we have to separately identify the backup images for each partition while restoring the database.


In DB2 9.5, you don’t have to deal with the above mentioned complexity because of Single system view (SSV) backup. SSV backup allows to backup some or all the database partitions using ON DBPARTITIONNUMS or ON ALL DBPARTITIONNUMS clause of BACKUP DATABASE command. The backup images created by SSV backup have the same timestamp for each database partition. The SSV backup has to be run from catalog partition. In the following example, all the database partitions are backed up using SSV backup. The backup image of all the database partitions has the same timestamp. Hence, all the database partitions could be restored with a single command using db2_all.



db2 BACKUP DB test ON ALL DBPARTITIONNUMS ONLINE INCLUDE LOGS

db2_all “db2 RESTORE DB test TAKEN AT 2008-08-17-23.44.56.125237“

db2 ROLLFORWARD DB test TO END OF BACKUP ON ALL DBPARTITIONNUMS



Some more good news... In the above example, notice that the backup image would include the logs. Prior to version 9.5, logs can be included with the online backup image for single partition databases only. With SSV backup, it’s possible to include the logs with online backup image for multi-partition databases also. Finally, notice the use of TO END OF BACKUP clause with the ROLLFORWARD command. It really simplifies to rollforward all the database partitions to the minimum recovery time and saves the effort to determine minimum recovery time over all the database partitions.

New Poll- Backup Target

Thanks everyone for participating in my previous poll regarding DB2 autonomic features. Here is the final poll result


Self-tuning memory manager – 14 (51%)

Automatic storage – 11 (40%)

Automatic maintenance – 8 (29%)

Configuration advisor – 7 (25%)

Health monitor – 9 (33%)

Other – 1 (3%)

None – 6 (22%)


Interestingly, the use of the above features is almost in the same order as their positions in the listing. It’s evident that the top two autonomic features, self-tuning memory manager and automatic storage are the most popular autonomic features as per the poll result.


I have added a new poll ... No, this is not about Democratic vs. Republican J This is related to your database backup target. Please indicate the backup target as you specify in your backup command. For example, if your backup target is TSM, choose TSM only even though TSM will store the backup image on tape. If you select “Other”, it would be nice if you can explicitly mention the exact target through a comment to this post. I hope there will be more participation this time and thus the poll result will be more realistic. So, please don’t forget to provide your input in this poll !!!