If you can measure it, you can manage it. I’m a measurement, monitoring, analysis and statistics addict 🙂
That’s why I’ve always wanted to be able to monitor the IO load of the Linux systems I’ve worked with. While there are well established monitoring and accounting tools for the CPU usage – both system wide and per process – there were virtually none for the IO system until very recently.
Two of the more important reasons why I’d like to see better IO load monitoring are:
- The mechanical drives have big latency. In general the CPU feels much better than the disks when overloaded. For example if load average 10 is caused by CPU bound processes the system feels much more responsive than the same load but caused by IO bond processes. CPU load average 10 on a server system with two processors isn’t very noticeable. At the same time IO load average of 10 on the same system with 2x 7200 rpm disk drives in RAID1 feels very sluggish.
- The hard disk drives failed to keep up with the performance improvements in microprocessor technologies. Disk capacity has grown quite well, but the speed and especially access times are far behind. The IO performance is the most common bottleneck and most precious resource in today’s systems. Or at least the systems I work with 🙂
At the beginning of my Linux career, ten years ago, there was only one metric – blocks read/written. And that’s it. How busy the disk is you can guess only by looking at load average and checking how many processes are stuck in D state. I wish there are separate load average readings for CPU and IO…
At some point (linux 2.5 times?) extended statistics were added and things like queue size, utilization in % etc. became available. Much better. Still it was hard to tell who exactly is causing the load. If we speak of multi user system all you can see is multiple processes in D state. It’s unclear whether these are the ones causing the IO havoc or just victims of the already overloaded IO subsystem waiting.
In Linux 2.6.20 another step was made by adding per process IO accounting. I was very excited when I heard about this feature and eager to try it. It turned out that this per process IO accounting counts only the bytes read/written by a process. Not that better. A modern 7200 rpm SATA drive is only capable of about 90 IOPS so it could be choked with the pathetic 90 bytes per second…
Then there are the atop patches. These add per process IO occupation percentage. That sounds great but… when you have a lot of small random writes they go to the page cache first and only then are periodically flushed to the physical device. This is performance feature and is generally a (very) good thing as it allows the elevators to group writes together etc. Unfortunately, atop ends up accounting all these writes and IO utilization to pdflush and kjournald.
Ok, lets see what’s the state of the affairs in some other operating system. Everybody talks about dtrace so it’s time to check it out. Linux doesn’t have dtrace. At least yet. There is work in progress by Paul Fox. On the other hand Linux has system tap but it doesn’t look very mature to me. Anyway, there are number of operating systems that support dtrace: as it is create by Sun engineers first come Solaris and OpenSolaris. Then there is the FreeBSD port and Apple OS X. I’m familiar with FreeBSD but I wanted to check the current state of OpenSolaris kernel. On the other hand I wanted to keep the learning curve less sloppy, so I opted for Nexenta core 2 rc1. Nexenta is GNU userspace (Debian/Ubuntu) and OpenSolaris kernel.
Download, install – everything was smooth. The install defaulted to root fs on ZFS. Good! I was thinking about playing with ZFS these days anyway.
And the moment of truth:
I started dbench -S 1, run dtrace -s iotop.d and here’s the output:
UID PID PPID CMD DEVICE MAJ MIN D %I/O 0 0 0 sched cmdk0 102 0 W 17
Hm, that looks somewhat familiar. I see a pattern there. Isn’t sched the ZFS cousin of pdflush/kjournald? Oh, well it is: http://opensolaris.org/jive/thread.jspa?threadID=39545&tstart=285
No luck… dtrace’s iotop works with UFS but has problem with ZFS.
Turns out the proper IO monitoring is a very tricky business.
I followed the link from your comment on my recent post on Linux disk I/O monitoring, and was pleasantly surprised to see this post discussing it more! I am somewhat glad to see that I’m not the only one stymied by this.
It’s a really essential metric and I, like you, feed on such statistics with relish 🙂 If you find any more complete solutions, share them!
Hi, Samuel 🙂
These days the storage is all the hype and that’s for a reason.
You can rest assured I’ll be sniffing around for better ways to analyse IO usage. If I find something I’ll share it on this blog. I’ll keep an eye on your blog too.
If you’re a performance monitoring junkie I’ll bet you’d like collectl – http://collectl.sourceforge.net/. It’s goal is to be able to monitor everything form one tool so that you can actually correlate what is going not with your storage subsystem but with everything. For example if you disk is slow, it could be related to memory fragmentation (buddyinfo), slab activity or other resources. collectl does it all and you can even run it at sub-second monitoring levels, synchronized to the nearest second wthin microseconds!
But don’t take my work for it, download and check it out for yourself.
Thanks, I’ll check it out. You can take a look at atop – http://www.atcomputing.nl/Tools/atop/. atop is probably the best performance monitoring and accounting tool I’ve tried so far having in mind I haven’t tried collectl yet.
Pingback: Linux per-process I/O performance: measuring the wrong thing / taint.org: Justin Mason's Weblog
SInce you are fond of statisics, you may want to check out fsstat that is new in Solaris 10, it allows you to track actual data being transferred in the file system, while it doesn’t give you per user IO or per disk, it gives you the data transfered from file cache, which is very important for monitoring system performance when ZFS is involved because its data caching mechanics are far more intelligent than other systems so it is much more likely to serve data from cache than other systems.
Also Ben Rockwood of cuddletech has a cool script that allows you to see just how effective your disk caching is,
http://www.cuddletech.com/blog/pivot/entry.php?id=979 so you can see the possible effect of adding more memory to a system.
For someone with your apetite for statisics on your systems, solaris must be your favorite OS between the regular tools, prstat, mpstat, vmstat, it also has dtrace, and kstat to give you all the more insite into what your systems are doing.
Wow, sir, this fsstat look great.
Pingback: At a Glance: Last week’s JavaFX, OpenSolaris and MySQL reviews - Technology
Now this is something very interesting:
sudo sysdig -p “%evt.arg.name” “fd.type=directory or fd.type=file” |head -n1000 |cut -d/ -f1-4 |sort |uniq -c |sort -n