Producing PDFs

I don't want to throw this in with the announcement of the availability of the paper on memory and cache handling but I also don't want to forget it. So, here we go.

I write all the text I can using TeX (PDFLaTeX to be exact). This leads directly to a PDF document without intermediate steps. The graphics are done using Metapost because I'm better at programming than at drawing. Metapost produces Postscript-like files which some LaTeX macros then read and directly integrate into the PDF output.

The result in this case is a PDF with 114 pages which is only 934051 bytes in size. Just about 8kB for each page. Given that the text is multi-column and the numerous graphics in the text this is amazingly small.

I mentioned before how badly sucks at exporting graphics. I bad all the other word processor, spreadsheets, etc suck just as badly. Also generated PDFs for text is much, much bigger.

My guess is that if I'd written the document with the size would be north of 4MB, probably significantly more. I cannot understand why people do this to themselves and, more importantly, to others.

Memory and Cache Paper

Well, it's finally done. I've uploaded the PDF of the memory and cache paper to my home page. You can download it but do not re-publish it or make it available in any form to others. I do not want multiple copies flying around, at least not while I'm still intending to maintain the document.

With Jonathan Corbet's help the text should actually be readable. I had to change some of the text in the end to accommodate line breaks in the PDF. So I might have introduced problems, don't think bad about Jonathan's abilities. Aside, this is a large document. You simply go blind after a while, I know I do.

Which brings me to the next point. Even though I intend to maintain the document, don't expect me to do much in the near future. I've been working on it for far too long now and need a break. Integrating all the editing Jonathan produced plus today's line breaking have given me the rest. I haven't even integrated all the comments I've received. I know the structure of the document is in a few places a bit weak, esp section 5 which contains a lot of non-NUMA information. But it was simply too much work so far. Maybe some day.

The Evils of pkgconfig and libtool

If you need more proof that this insane just look at some of the packages using it. I recently was looking at krb5-auth-dialog. The output of ldd -u -r on the original binary shows 26 unused DSOs.

This can be changed quite easily: add -Wl,--as-needed to link line. Do this in case of this package all but one of the unused dependencies is going away. This has several benefits:

The binary size is actually measurably reduced.

   text    data     bss     dec     hex filename
  35944    6512      64   42520    a618 src/krb5-auth-dialog-old
  35517    6112      64   41693    a2dd src/krb5-auth-dialog

That’s a 2% improvement. Note that all the saved dependencies are all recursive dependencies. The runtime is therefore not much effected (only a little). The saved data is pure overhead. Multiply the number by the thousands of binaries and DSOs which are shipped and the savings are significant.

The second problem to mention here is that not all unused dependencies are gone because somebody thought s/he is clever and uses -pthread in one of the pkgconfig files instead of linking with -lpthread. That’s just stupid when combined with the insanity called libtool. The result is that the -Wl,--as-needed is not applied to the thread library.

Just avoid libtool and pkgconfig. At the very least fix up the pkgconfig files to use -Wl,--as-needed.

Energy saving is everybody's business

With the wide acceptance of laptop and even smaller devices more and more people have been exposed to devices limited by energy consumption. Still, programmers don't pay much attention to this aspect.

This statement is not entirely accurate: there has been a big push towards energy conservation in the kernel world (at least in the Linux kernel). With the tickless kernels we have the infrastructure to sleep for long times (long is a relative term here). Other internal changes avoid unnecessary wakeups. It is now realy up to the userlevel world to do its part.

The situation is pretty dire here. There are some projects (e.g., PowerTOP) which highlight the problems. Still, not much happens.

I've been somewhat guilty myself. nscd (part of glibc) was waking up every 5 seconds to clean up its cache, even if often was to be done. This program structure has several reasons. Good ones, but not ultimate reason. So I finally bit the bullet and changed the program structure significantly to better enable wakeup. The result is that now nscd at all times determines when the next cache cleanup is due and sleeps until then. Cache cleanups might be many hours out, so the code improved from one wakeups every 5 seconds to one wakeup every couple of hours.

nscd is a very small drop in the bucket, though. Just look at your machine and examine the running processes and those which are regularly started. PowerTOP cannot realy help here (Arjan said something will be coming soon though).

There is a tool which can help, though: systemtap. Simply create a small script which traps syscalls the violators will use and disply process information. The syscalls to use include: open, stat, access, poll, epoll, select, nanosleep, futex. For the latter five it is a matter of small timeout values which is the problem.

I'll post a script to do this soon (just not now). But the guilty parties probably already know who they are. Just don't do this quasi busy waiting!

  • If a program has to react to a file change or removal or creation, use inotify
  • for internal cleanups, choose reasonable values and then compute the timeout so that you don't wake up when nothing has to be done.

If you want to see how not to do it, look at something like the flash player (the proprietary one). If you inadvertently have started it it'll remain active (even if no flash page is displayed) and it is basically busy waiting on something.

Let's show the proprietary software world we can do better.

Directory Reading

In the last weeks I have seen far too much code which reads directory content in horribly inefficient ways to let this slide. Programmers really have to learn doing this efficiently. Some of the instances I've seen are in code which runs frequently. Frequently as in once per second. Doing it right can make a huge difference.

The following is an exemplary piece of code. Not taken from an actual project but it shows some of the problems quite well, all in one example. I drop the error handling to make the point clearer.

  DIR *dir = opendir(some_path);
  struct dirent *d;
  struct dirent d_mem;
  while (readdir_r(d, &d_mem, &d) == 0) {
    char path[PATH_MAX];
    snprintf(path, sizeof(path), "%s/%s/somefile", some_path, d->d_name);
    int fd = open(path, O_RDONLY);
    if (fd != -1) {
      ... do something ...
      close (fd);

How many things are inefficient at best and outright problematic in some cases?

Let's enumerate:

  1. Why use readdir_r?
  2. Even the use of readdir is dangerous.
  3. Creating a path string might exceed the PATH_MAX limit.
  4. Using a path like this is racy.
  5. What if the directory contain entries which are not directories?

readdir_r is only needed if multiple threads are using the same directory stream. I have yet to see a program where this really is the case. In this toy example the stream (variable dir) is definitely not shared between different threads. Therefore the use of readdir is just fine. Should this matter? Yes, it should, since readdir_r has to copy the data in into the buffer provided by the user while readdir has the possibility to avoid that.

Instead of readdir code should in fact use readdir64. The definition of the dirent structure comes from an innocent time when hard drive with a couple of dozen MB of capacity were huge. Things change and we need larger values for inode numbers etc. Modern (i.e., 64-bit) ABIs do this by default but if the code is supposed to be used on 32-bit machines as well the *64 variants should always be used.

Path length limits are becoming an ever-increasing problem. Linux, like most Unix implementations, imposes a length limit on each filename string which is passed to a system call. But this does not mean that in general path names have any length limit. It just means that longer names have to be implicitly constructed through the use of multiple relative path names. In the example above, what happens if some_path is already close to PATH_MAX bytes in size? It means the snprintf call will truncate the output. This can and should of course be caught but this doesn't help the program. It is crippled.

Any use of filenames with path components (i.e., with one or more slashes in the name) is racy and an attacker change any of the contained path components. This can lead to exploits. In the example, the some_path string itself might be long and traverse multiple directories. A change in any of these will lead to the open call not reaching the desired file or directory.

Finally, while the code above works (the open call will fail if d->d_name does not name a directory) it is anything but efficient. In fact, the open system calls are quite expensive. Before any work is done, the kernel has to reserve a file descriptor. Since file descriptors are a shared resource this requires coordination and synchronization which is expensive. Synchronization also reduces parallelism, which might be a big issue in some code. The open call then has to follow the path which also is not free.

To make a long story short, here is how the code should look like (again, sans error handling):

  DIR *dir = opendir(some_path);
  int dfd = dirfd(dir);
  struct dirent64 *d;
  while ((d = readdir64(dir)) != NULL) {
    if (d->d_type != DT_DIR && d->d_type != DT_UNKNOWN)
    char path[PATH_MAX];
    snprintf(path, sizeof(path), "%s/somefile", d->d_name);
    int fd = openat(dfd, path, O_RDONLY);
    if (fd != -1) {
      ... do something ...
      close (fd);

This rewrite addresses all the issues. It uses readdir64 which will do just fine in this case and it is safe when it comes to huge disk drives. It uses the d_type field of the dirent64 to check whether we already know the file is no directory. Most of Linux's directories today fill in the d_type field correctly (including all the pseudo filesystems like sysfs and proc). Those file systems which do not have the information handy fill in DT_UNKNOWN which is why the code above allows this case, too. In some program one also might want to allow DT_LNK since a symbolic link might point to a directory. But more often enough this is not the case and not following symlinks is a security measure.

Finally, the new code uses openat to open the file. This avoids the length path lookup and it closes most of the races of the original open call since the pathname lookup starts at the directory read by readdir64. Any change to the filesystem below this directory has no effect on the openat call. Also, since now the generated path is very short (just the maximum of 256 bytes for d_name plus 10 we know that the buffer path is sufficient.

It is easy enough to apply these changes to all the places which read directories. The result will be small, faster, and safer code.

The Series is Underway

Jon Corbet has edited the first two sections of the document I mentioned earlier here and here.

The document will be published in multiple installments, beginning with Sections 1 and 2 which are available now. Since LWN is a business the reasonable limitation is put in place that for the first week only subscribers have access to it.

So, get a subscription to LWN.

If you find mistakes in the text let me know directly, either as a comment here or as a personal mail. Don't bother J on with that.

SHA for crypt

Just a short note: I added SHA support to the Unix crypt implementation in glibc. The reason for all this (including replies to the extended "NIH" complaints) can be found here.

Publishing Update

A few weeks back I asked how I should publish the document on memory and cache handling. I got quite some feedback.

  • There was the usual it doesn't matter but I want it for free crowd.
  • Then there was the even $8 for a book is too much for me. These are people from outside the US and $8 translated to local currency and income is certainly far too much for many people. I do not throw this group in with the first.
  • Several people (all or mostly US-based) thought the idea of printed paper to be nice. The price was no issue.
  • Most people said a freely PDF is more important than a printed copy. Some derogatory comments about lecturers who require books were heard. Others said editing isn't important.

Because of this first obnoxious group of people I would probably have gone with a print-only route. This attitude that just because somebody works on free software he always has to make everything available for free makes me sick. These are most probably the same people who never in their life produced anything that other found of value or they are the criminals working on (mostly embedded) project exploiting free software.

But since I really want the document to be widely distributed and available to places where $8 is too much money I will release the PDF for free. But this won't happen right away. Unlike some of the people making comments I do think that editing is important. Fortunately having professional editing and a free PDF don't exclude each other.

I'll not go with a publisher (esp not these $%# at O'Reilly, as several people suggested). This would in most cases have precluded retaining the copyright and making the text available for free.

Instead the nice people at LWN, Jonathan Corbet and crew, will edit the document. They will then serialize it, I guess, along with the weekly edition. It's up to Jon to make this decision. The document has 8 large section including introduction which means my guess is that after 7 installments the whole document is published. Once this has happened I'll then make the whole updated and edited PDF available.

This means if you think it's worth it, get a subscription to the LWN instead of waiting a week to read it for free.

So in summary, I get professional editing, keep the copyright, and might be able to help getting some more subscribers for the LWN. Win, win, win. If the L in LWN bothers you I've news for you: the document itself is very Linux-centric.

I haven't forgotten the printed version. I've read a bit more of the Lulu documentation. Apparently there is a model where I don't have to pay anything. People ordering the book pay a per-copy price and that's it (apparently with discounts for larger orders). If I submit it in letter/A4 format I don't have to do any reformatting and the price is less (for the color print) since there are fewer pages.

I'll probably try to do this after the PDF is freely available. People who like to have something in their hands will have their wishes. The only problem I see right now is that Lulu has a stupid requirement that the PDF documents must be generated with proprietary tools from Adobe. Of course I don't do this, I use pdfTeX. If this proves to be the case I guess I'll have to have a word with Bob Young...

Increasing Virtualization Insanity

People are starting to realize how broken the Xen model is with its privileged Dom0 domain. But the actions they want to take are simply ridiculous: they want to add the drivers back into the hypervisor. There are many technical reasons why this is a terrible idea. You'd have to add (back, mind you, Xen before version 2 did this) all the PCI handling and lots of other lowlevel code which is now maintained as part of the Linux kernel. This would of course play nicely into Xensource's (the company) pocket. Their technical people so far turn this down but I have no faith in this group: sooner or later they want to be independent of OS vendors and have their own mini-OS in the hypervisor. Adios remaining few advantages of the hypervisor model. But this is of course also the direction of VMWare who loudly proclaim that in the future we won't have OS as they exist today. Instead only domains with mini-OS which are ideally only hooks into the hypervisor OS where single applications run.

I hope everybody realizes the insanity of this:

  • If they really mean single application this must also mean single-process. If not, you'll have to implement an OS which can provide multi-process services. But this means that you either have no support to create processes or you rely on an mini-OS which is a front for the hypervisor. In VMWare's case this is some proprietary mini-OS and I imagine Xensource would like to do the very same.
  • Imagine that you have such application domains. All nicely separated because replicated. The result is a maintainance nightmare. What if a component which is needed in all application domains has to be updated? In a traditional system you update the one instance per machine/domain. With application domains you have to update every single one and not forget one.
And worst of all:
  • Don't people realize that this is the KVM model just implemented much poorer and more proprietary? If you invite drivers and all the infrastructure into the hypervisor it is not small enough anymore to have a complete code review. I.e., you end up with a full OS which is too large for that. Why not use one which already works: Linux.

I fear I have to repeat myself over and over again until the last person recognizes that the hypervisor model does not work for the type of virtualization for commodity hardware we try to achieve. Using a hypervisor was simply the first idea which popped into people's head since it was already done before in quite different environments. The change from Xen v1 to v2 should have shown how rotten the model is. Only when you take a step back you can see the whole picture and realize the KVM model is not only better, it's the only logical choice.

I know people have invested into Xen and that KVM is not yet there yet but a) there has been a lot of progress in KVM-land and b) the performance is constantly improving and especially with next year's processor updates hardware virtualization costs will go down even further.

For sysadmin types this means: do what you have to do with Xen for now. But keep the investments small. For developers this means: don't let yourself be tied to a platform. Use an abstraction layer such as libvirt to bridge over the differences. For architects this means: don't looking to Xen for answers, base your new designs on KVM.