Tag: linux
-
Can’t believe there’s (still) no way to list just the ipset members
I find it amusing that the ipset command has all kinds of output modifiers like -name to show only the name of the list, -terse to show the name and headers, and even -output xml, but lacks a straightforward way to list the set members only. Many tools have options like –skip-headers, –no-headers, –skip-column-names, etc., but not ipset.
I guess someone needed the XML output and contributed it, so maybe I should try and contribute -no-headers or something similar.
At least this gives me some shell scripting practice. 🙂 -
Attack vectors deja vu
I have to keep an eye on the IT security news. You know, “security is a process not product”. Just recently, Linux kernel vulnerability CVE-2009-1337 caught my attention. This even has l33t in its name 🙂 The more interesting part is, of course, not the CVE number but the attack vector used in a recent exploit. Basically, a core is dumped to the logrotate.d directory. After this, logrotate executes the malicious code included in this dump since it uses rather naive parsing to find instructions in its configuration files.
Inevitably, this reminded me of a very similar situation from few years ago. In 2006, CVE-2006-2451, which is another kernel vulnerability, allowed core to be dumped in a directory that the attacker isn’t allowed to write to. A weakness in cron.d parsing similar to that in recent versions of logrotate was used as attack vector.
Just a few weeks ago, I had another deja vu. There’s a flaw in udev versions before 1.4.1 that allows local users to gain root privileges by not checking whether a NETLINK message originates from kernel (CVE-2009-1185). It took me some time to remember why this sounded so familiar since the older case is from 2003. Back then, the zebra routing suite failed to check the NETLINK message originators (CVE-2003-0858).
Oh well, to err is human, don’t you think?
-
IO performance monitoring
If you can measure it, you can manage it. I’m a measurement, monitoring, analysis and statistics addict 🙂
That’s why I’ve always wanted to be able to monitor the IO load of the Linux systems I’ve worked with. While there are well established monitoring and accounting tools for the CPU usage – both system wide and per process – there were virtually none for the IO system until very recently.
Two of the more important reasons why I’d like to see better IO load monitoring are:
- The mechanical drives have big latency. In general the CPU feels much better than the disks when overloaded. For example if load average 10 is caused by CPU bound processes the system feels much more responsive than the same load but caused by IO bond processes. CPU load average 10 on a server system with two processors isn’t very noticeable. At the same time IO load average of 10 on the same system with 2x 7200 rpm disk drives in RAID1 feels very sluggish.
- The hard disk drives failed to keep up with the performance improvements in microprocessor technologies. Disk capacity has grown quite well, but the speed and especially access times are far behind. The IO performance is the most common bottleneck and most precious resource in today’s systems. Or at least the systems I work with 🙂
At the beginning of my Linux career, ten years ago, there was only one metric – blocks read/written. And that’s it. How busy the disk is you can guess only by looking at load average and checking how many processes are stuck in D state. I wish there are separate load average readings for CPU and IO…
At some point (linux 2.5 times?) extended statistics were added and things like queue size, utilization in % etc. became available. Much better. Still it was hard to tell who exactly is causing the load. If we speak of multi user system all you can see is multiple processes in D state. It’s unclear whether these are the ones causing the IO havoc or just victims of the already overloaded IO subsystem waiting.
In Linux 2.6.20 another step was made by adding per process IO accounting. I was very excited when I heard about this feature and eager to try it. It turned out that this per process IO accounting counts only the bytes read/written by a process. Not that better. A modern 7200 rpm SATA drive is only capable of about 90 IOPS so it could be choked with the pathetic 90 bytes per second…
Then there are the atop patches. These add per process IO occupation percentage. That sounds great but… when you have a lot of small random writes they go to the page cache first and only then are periodically flushed to the physical device. This is performance feature and is generally a (very) good thing as it allows the elevators to group writes together etc. Unfortunately, atop ends up accounting all these writes and IO utilization to pdflush and kjournald.
Ok, lets see what’s the state of the affairs in some other operating system. Everybody talks about dtrace so it’s time to check it out. Linux doesn’t have dtrace. At least yet. There is work in progress by Paul Fox. On the other hand Linux has system tap but it doesn’t look very mature to me. Anyway, there are number of operating systems that support dtrace: as it is create by Sun engineers first come Solaris and OpenSolaris. Then there is the FreeBSD port and Apple OS X. I’m familiar with FreeBSD but I wanted to check the current state of OpenSolaris kernel. On the other hand I wanted to keep the learning curve less sloppy, so I opted for Nexenta core 2 rc1. Nexenta is GNU userspace (Debian/Ubuntu) and OpenSolaris kernel.
Download, install – everything was smooth. The install defaulted to root fs on ZFS. Good! I was thinking about playing with ZFS these days anyway.
And the moment of truth:
I started dbench -S 1, run dtrace -s iotop.d and here’s the output:
UID PID PPID CMD DEVICE MAJ MIN D %I/O 0 0 0 sched cmdk0 102 0 W 17
Hm, that looks somewhat familiar. I see a pattern there. Isn’t sched the ZFS cousin of pdflush/kjournald? Oh, well it is: http://opensolaris.org/jive/thread.jspa?threadID=39545&tstart=285
No luck… dtrace’s iotop works with UFS but has problem with ZFS.
Turns out the proper IO monitoring is a very tricky business.