If you can measure it, you can manage it. I’m a measurement, monitoring, analysis and statistics addict
That’s why I’ve always wanted to be able to monitor the IO load of the Linux systems I’ve worked with. While there are well established monitoring and accounting tools for the CPU usage – both system wide and per process – there were virtually none for the IO system until very recently.
Two of the more important reasons why I’d like to see better IO load monitoring are:
- The mechanical drives have big latency. In general the CPU feels much better than the disks when overloaded. For example if load average 10 is caused by CPU bound processes the system feels much more responsive than the same load but caused by IO bond processes. CPU load average 10 on a server system with two processors isn’t very noticeable. At the same time IO load average of 10 on the same system with 2x 7200 rpm disk drives in RAID1 feels very sluggish.
- The hard disk drives failed to keep up with the performance improvements in microprocessor technologies. Disk capacity has grown quite well, but the speed and especially access times are far behind. The IO performance is the most common bottleneck and most precious resource in today’s systems. Or at least the systems I work with
At the beginning of my Linux career, ten years ago, there was only one metric – blocks read/written. And that’s it. How busy the disk is you can guess only by looking at load average and checking how many processes are stuck in D state. I wish there are separate load average readings for CPU and IO…
At some point (linux 2.5 times?) extended statistics were added and things like queue size, utilization in % etc. became available. Much better. Still it was hard to tell who exactly is causing the load. If we speak of multi user system all you can see is multiple processes in D state. It’s unclear whether these are the ones causing the IO havoc or just victims of the already overloaded IO subsystem waiting.
In Linux 2.6.20 another step was made by adding per process IO accounting. I was very excited when I heard about this feature and eager to try it. It turned out that this per process IO accounting counts only the bytes read/written by a process. Not that better. A modern 7200 rpm SATA drive is only capable of about 90 IOPS so it could be choked with the pathetic 90 bytes per second…
Then there are the atop patches. These add per process IO occupation percentage. That sounds great but… when you have a lot of small random writes they go to the page cache first and only then are periodically flushed to the physical device. This is performance feature and is generally a (very) good thing as it allows the elevators to group writes together etc. Unfortunately, atop ends up accounting all these writes and IO utilization to pdflush and kjournald.
Ok, lets see what’s the state of the affairs in some other operating system. Everybody talks about dtrace so it’s time to check it out. Linux doesn’t have dtrace. At least yet. There is work in progress by Paul Fox. On the other hand Linux has system tap but it doesn’t look very mature to me. Anyway, there are number of operating systems that support dtrace: as it is create by Sun engineers first come Solaris and OpenSolaris. Then there is the FreeBSD port and Apple OS X. I’m familiar with FreeBSD but I wanted to check the current state of OpenSolaris kernel. On the other hand I wanted to keep the learning curve less sloppy, so I opted for Nexenta core 2 rc1. Nexenta is GNU userspace (Debian/Ubuntu) and OpenSolaris kernel.
Download, install – everything was smooth. The install defaulted to root fs on ZFS. Good! I was thinking about playing with ZFS these days anyway.
And the moment of truth:
I started dbench -S 1, run dtrace -s iotop.d and here’s the output:
UID PID PPID CMD DEVICE MAJ MIN D %I/O 0 0 0 sched cmdk0 102 0 W 17
Hm, that looks somewhat familiar. I see a pattern there. Isn’t sched the ZFS cousin of pdflush/kjournald? Oh, well it is: http://opensolaris.org/jive/thread.jspa?threadID=39545&tstart=285
No luck… dtrace’s iotop works with UFS but has problem with ZFS.
Turns out the proper IO monitoring is a very tricky business.