Wednesday, January 13, 2010

Multi-terabyte

After a long gap, I am back again to share some of my thoughts. Leaving my Blog dry for so long makes me feel drained as well. However, I try to console myself thinking now it’s the age of information overload. Also, the growth of community websites in past few years is pouring so much information that it’s difficult for one to cope up with it. Hence, no harm slowing down my contribution to this information overload :)

Even though I was not posting anything on my Blog, I was not completely detached from it. In the meanwhile, I was tackling with the junk comments on my posts :) Yes, you read it correctly... junk comments, not junk emails. Since the email applications already have ways to filter out most of the junk emails, the junkies are targeting blogs now.

One day while looking at my blog, my eyes got stuck at a keyword used in my profile and triggered a chain of thoughts... this word was “multi-terabyte”. I recalled when I had written that profile, I did not mean more than 10 terabytes by it. Just 5 years ago from now, a 10 terabyte data warehouse was considered to be huge and working with such databases used to make me feel privileged. However, now a days I doubt if anyone will consider a smaller than 100 terabyte data warehouse huge. In fact, we have already started talking about petabyte size of data warehouses.

That naturally makes me think, will DB2 be able to support muti-petabyte warehouses in future with its current architecture. If not, what’s there in IBM’s roadmap to support such warehouses? Well, I have not heard much except the empty promises that current scalability features (data partitions, balanced configuration units etc.) should be able to tackle it. Beyond DB2 world, I do see some promising work and an important one to note would be Column-oriented DBMS.

Column-oriented DBMS is not a new concept, though recently it has started getting more traction. If you are hearing about it first time, you can get a quick overview of it on the Wikipedia. Also, you will find tons of articles regarding it if you google it. IBM has been downplaying column-oriented DBMS mentioning that it’s not suitable for all type of workloads in data warehousing environment, the companies who provide Column-oriented DBMS still have a very small footprint in the data warehousing market and so on. These arguments are basically deviating the discussion from the main point.

The important question is, will the majority of workload in data warehousing environment greatly benefit by Column-oriented DBMS approach or not. And the researches on this topic have answered this question affirmatively. Also, I have not come across any clear statement from IBM or IBMers disproving this fact. Well, I am not proposing that IBM should change DB2 from row-oriented to column-oriented or come up with a brand new column-oriented DBMS. However, it can provide an option for column-oriented tables in DB2. May be it will make more sense to have this option on table space level and all the tables under that table space will be stored column-oriented. Please don’t take me wrong, I don’t think this feature is going to be so easy to implement. I can anticipate its far reaching impact on DB2 core engine and optimizer, but so what? When IBM came up with hybrid XML/relational DB2 engine for native support of XML data, it was not a simple feature. If I recall correctly, IBM publicized it to involve 5 years of development time by a large development team.

The next important question is, is it worth investing so much effort in column-oriented table space feature. I think it is... because it will prepare DB2 for the next generation data warehouses. DB2 has been stronger in the data warehousing market compared to OLTP and this feature will help it become even stronger in the data warehousing market. However, I am curious to hear your thoughts on this.