Home
Ulrich Drepper [entries|archive|friends|userinfo]
Ulrich Drepper

[ website | My Website ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

Producing PDFs [Nov. 21st, 2007|06:35 pm]
I don't want to throw this in with the announcement of the availability of the paper on memory and cache handling but I also don't want to forget it. So, here we go.

I write all the text I can using TeX (PDFLaTeX to be exact). This leads directly to a PDF document without intermediate steps. The graphics are done using Metapost because I'm better at programming than at drawing. Metapost produces Postscript-like files which some LaTeX macros then read and directly integrate into the PDF output.

The result in this case is a PDF with 114 pages which is only 934051 bytes in size. Just about 8kB for each page. Given that the text is multi-column and the numerous graphics in the text this is amazingly small.

I mentioned before how badly OO.org sucks at exporting graphics. I bad all the other word processor, spreadsheets, etc suck just as badly. Also generated PDFs for text is much, much bigger.

My guess is that if I'd written the document with OOO.org the size would be north of 4MB, probably significantly more. I cannot understand why people do this to themselves and, more importantly, to others.
linkpost comment

Memory and Cache Paper [Nov. 21st, 2007|06:09 pm]
[Tags|]

Well, it's finally done. I've uploaded the PDF of the memory and cache paper to my home page. You can download it but do not re-publish it or make it available in any form to others. I do not want multiple copies flying around, at least not while I'm still intending to maintain the document.

With Jonathan Corbet's help the text should actually be readable. I had to change some of the text in the end to accommodate line breaks in the PDF. So I might have introduced problems, don't think bad about Jonathan's abilities. Aside, this is a large document. You simply go blind after a while, I know I do.

Which brings me to the next point. Even though I intend to maintain the document, don't expect me to do much in the near future. I've been working on it for far too long now and need a break. Integrating all the editing Jonathan produced plus today's line breaking have given me the rest. I haven't even integrated all the comments I've received. I know the structure of the document is in a few places a bit weak, esp section 5 which contains a lot of non-NUMA information. But it was simply too much work so far. Maybe some day.
linkpost comment

The Evils of pkgconfig and libtool [Nov. 12th, 2007|06:02 pm]

If you need more proof that this insane just look at some of the packages using it. I recently was looking at krb5-auth-dialog. The output of ldd -u -r on the original binary shows 26 unused DSOs.

This can be changed quite easily: add -Wl,--as-needed to link line. Do this in case of this package all but one of the unused dependencies is going away. This is several benefits:</tt>

The binary size is actually measurably reduced.

   text    data     bss     dec     hex filename
  35944    6512      64   42520    a618 src/krb5-auth-dialog-old
  35517    6112      64   41693    a2dd src/krb5-auth-dialog

That’s a 2% improvement. Note that all the saved dependencies are all recursive dependencies. The runtime is therefore not much effected (only a little). The saved data is pure overhead. Multiply the number by the thousands of binaries and DSOs which are shipped and the savings are significant.

The second problem to mention here is that not all unused dependencies are gone because somebody thought s/he is clever and uses -pthread in one of the pkgconfig files instead of linking with -lpthread. That’s just stupid when combined with the insanity called libtool. The result is that the -Wl,--as-needed is not applied to the thread library.

Just avoid libtool and pkgconfig. At the bery least fix up the pkgconfig files to use -Wl,--as-needed.

linkpost comment

Energy saving is everybody's business [Nov. 8th, 2007|08:41 pm]

With the wide acceptance of laptop and even smaller devices more and more people have been exposed to devices limited by energy consumption. Still, programmers don't pay much attention to this aspect.

This statement is not entirely accurate: there has been a big push towards energy conservation in the kernel world (at least in the Linux kernel). With the tickless kernels we have the infrastructure to sleep for long times (long is a relative term here). Other internal changes avoid unnecessary wakeups. It is now realy up to the userlevel world to do its part.

The situation is pretty dire here. There are some projects (e.g., PowerTOP) which highlight the problems. Still, not much happens.

I've been somewhat guilty myself. nscd (part of glibc) was waking up every 5 seconds to clean up its cache, even if often was to be done. This program structure has several reasons. Good ones, but not ultimate reason. So I finally bit the bullet and changed the program structure significantly to better enable wakeup. The result is that now nscd at all times determines when the next cache cleanup is due and sleeps until then. Cache cleanups might be many hours out, so the code improved from one wakeups every 5 seconds to one wakeup every couple of hours.

nscd is a very small drop in the bucket, though. Just look at your machine and examine the running processes and those which are regularly started. PowerTOP cannot realy help here (Arjan said something will be coming soon though).

There is a tool which can help, though: systemtap. Simply create a small script which traps syscalls the violators will use and disply process information. The syscalls to use include: open, stat, access, poll, epoll, select, nanosleep, futex. For the latter five it is a matter of small timeout values which is the problem.

I'll post a script to do this soon (just not now). But the guilty parties probably already know who they are. Just don't do this quasi busy waiting!

  • If a program has to react to a file change or removal or creation, use inotify
  • for internal cleanups, choose reasonable values and then compute the timeout so that you don't wake up when nothing has to be done.

If you want to see how not to do it, look at something like the flash player (the proprietary one). If you inadvertently have started it it'll remain active (even if no flash page is displayed) and it is basically busy waiting on something.

Let's show the proprietary software world we can do better.

linkpost comment

Part 2 released [Oct. 1st, 2007|08:09 am]
[Tags|]

Jonathan and crew published part 2 of the paper. If you have an LWN subscription you can read it here.

linkpost comment

Directory Reading [Sep. 27th, 2007|10:45 am]
[Tags|]

In the last weeks I have seen far too much code which reads directory content in horribly inefficient ways to let this slide. Programmers really have to learn doing this efficiently. Some of the instances I've seen are in code which runs frequently. Frequently as in once per second. Doing it right can make a huge difference.

The following is an exemplary piece of code. Not taken from an actual project but it shows some of the problems quite well, all in one example. I drop the error handling to make the point clearer.

  DIR *dir = opendir(some_path);
  struct dirent *d;
  struct dirent d_mem;
  while (readdir_r(d, &d_mem, &d) == 0) {
    char path[PATH_MAX];
    snprintf(path, sizeof(path), "%s/%s/somefile", some_path, d->d_name);
    int fd = open(path, O_RDONLY);
    if (fd != -1) {
      ... do something ...
      close (fd);
    }
  }
  closedir(dir);

How many things are inefficient at best and outright problematic in some cases?

Let's enumerate:

  1. Why use readdir_r?
  2. Even the use of readdir is dangerous.
  3. Creating a path string might exceed the PATH_MAX limit.
  4. Using a path like this is racy.
  5. What if the directory contain entries which are not directories?

readdir_t is only needed if multiple thread are using the same directory stream. I have yet to see a program where this really is the case. In this toy example the stream (variable dir) is definitely not shared between different threads. Therefore the use of readdir is just fine. Should this matter? Yes, it should, since readdir_r has to copy the data in into the buffer provided by the user while readdir has the possibility to avoid that.

Instead of readdir code should in fact use readdir64. The definition of the dirent structure comes from an innocent time when hard drive with a couple of dozen MB of capacity were huge. Things change and we need larger values for inode numbers etc. Modern (i.e., 64-bit) ABIs do this by default but if the code is supposed to be used on 32-bit machines as well the *64 variants should always be used.

Path length limits are becoming an ever-increasing problem. Linux, like most Unix implementations, imposes a length limit on each filename string which is passed to a system call. But this does not mean that in general path names have any length limit. It just means that longer names have to be implicitly constructed through the use of multiple relative path names. In the example above, what happens if some_path is already close to PATH_MAX bytes in size? It means the snprintf call will truncate the output. This can and should of course be caught but this doesn't help the program. It is crippled.

Any use of filenames with path components (i.e., with one or more slashes in the name) is racy and an attacker change any of the contained path components. This can lead to exploits. In the example, the some_path string itself might be long and traverse multiple directories. A change in any of these will lead to the open call not reaching the desired file or directory.

Finally, while the code above works (the open call will fail if d->d_name does not name a directory) it is anything but efficient. In fact, the open system calls are quite expensive. Before any work is done, the kernel has to reserve a file descriptor. Since file descriptors are a shared resource this requires coordination and synchronization which is expensive. Synchronization also reduces parallelism, which might be a big issue in some code. The open call then has to follow the path which also is not free.

To make a long story short, here is how the code should look like (again, sans error handling):

  DIR *dir = opendir(some_path);
  int dfd = dirfd(dir);
  struct dirent64 *d;
  while ((d = readdir64(dir)) != NULL) {
    if (d->d_type != DT_DIR && d->d_type != DT_UNKNOWN)
      continue;
    char path[PATH_MAX];
    snprintf(path, sizeof(path), "%s/somefile", d->d_name);
    int fd = openat(dfd, path, O_RDONLY);
    if (fd != -1) {
      ... do something ...
      close (fd);
    }
  }
  closedir(dir);

This rewrite addresses all the issues. It uses readdir64 which will do just fine in this case and it is safe when it comes to huge disk drives. It uses the d_type field of the dirent64 to check whether we already know the file is no directory. Most of Linux's directories today fill in the d_type field correctly (including all the pseudo filesystems like sysfs and proc). Those file systems which do not have the information handy fill in DT_UNKNOWN which is why the code above allows this case, too. In some program one also might want to allow DT_LNK since a symbolic link might point to a directory. But more often enough this is not the case and not following symlinks is a security measure.

Finally, the new code uses openat to open the file. This avoids the length path lookup and it closes most of the races of the original open call since the pathname lookup starts at the directory read by readdir64. Any change to the filesystem below this directory has no effect on the openat call. Also, since now the generated path is very short (just the maximum of 256 bytes for d_name plus 10 we know that the buffer path is sufficient.

It is easy enough to apply these changes to all the places which read directories. The result will be small, faster, and safer code.

linkpost comment

The Series is Underway [Sep. 21st, 2007|01:41 pm]
[Tags|]

Jon Corbet has edited the first two sections of the document I mentioned earlier here and here.

The document will be published in multiple installments, beginning with Sections 1 and 2 which are available now. Since LWN is a business the reasonable limitation is put in place that for the first week only subscribers have access to it.

So, get a subscription to LWN.

If you find mistakes in the text let me know directly, either as a comment here or as a personal mail. Don't bother J on with that.
linkpost comment

SHA for crypt [Sep. 19th, 2007|02:55 pm]
[Tags|]

Just a short note: I added SHA support to the Unix crypt implementation in glibc. The reason for all this (including replies to the extended "NIH" complaints) can be found here.
linkpost comment

Publishing Update [Aug. 13th, 2007|08:43 pm]

A few weeks back I asked how I should publish the document on memory and cache handling. I got quite some feedback.

  • There was the usual it doesn't matter but I want it for free crowd.
  • Then there was the even $8 for a book is too much for me. These are people from outside the US and $8 translated to local currency and income is certainly far too much for many people. I do not throw this group in with the first.
  • Several people (all or mostly US-based) thought the idea of printed paper to be nice. The price was no issue.
  • Most people said a freely PDF is more important than a printed copy. Some derogatory comments about lecturers who require books were heard. Others said editing isn't important.

Because of this first obnoxious group of people I would probably have gone with a print-only route. This attitude that just because somebody works on free software he always has to make everything available for free makes me sick. These are most probably the same people who never in their life produced anything that other found of value or they are the criminals working on (mostly embedded) project exploiting free software.

But since I really want the document to be widely distributed and available to places where $8 is too much money I will release the PDF for free. But this won't happen right away. Unlike some of the people making comments I do think that editing is important. Fortunately having professional editing and a free PDF don't exclude each other.

I'll not go with a publisher (esp not these $%# at O'Reilly, as several people suggested). This would in most cases have precluded retaining the copyright and making the text available for free.

Instead the nice people at LWN, Jonathan Corbet and crew, will edit the document. They will then serialize it, I guess, along with the weekly edition. It's up to Jon to make this decision. The document has 8 large section including introduction which means my guess is that after 7 installments the whole document is published. Once this has happened I'll then make the whole updated and edited PDF available.

This means if you think it's worth it, get a subscription to the LWN instead of waiting a week to read it for free.

So in summary, I get professional editing, keep the copyright, and might be able to help getting some more subscribers for the LWN. Win, win, win. If the L in LWN bothers you I've news for you: the document itself is very Linux-centric.

I haven't forgotten the printed version. I've read a bit more of the Lulu documentation. Apparently there is a model where I don't have to pay anything. People ordering the book pay a per-copy price and that's it (apparently with discounts for larger orders). If I submit it in letter/A4 format I don't have to do any reformatting and the price is less (for the color print) since there are fewer pages.

I'll probably try to do this after the PDF is freely available. People who like to have something in their hands will have their wishes. The only problem I see right now is that Lulu has a stupid requirement that the PDF documents must be generated with proprietary tools from Adobe. Of course I don't do this, I use pdfTeX. If this proves to be the case I guess I'll have to have a word with Bob Young...

linkpost comment

Increasing Virtualization Insanity [Aug. 13th, 2007|04:52 pm]
[Tags|]

People are starting to realize how broken the Xen model is with its privileged Dom0 domain. But the actions they want to take are simply ridiculous: they want to add the drivers back into the hypervisor. There are many technical reasons why this is a terrible idea. You'd have to add (back, mind you, Xen before version 2 did this) all the PCI handling and lots of other lowlevel code which is now maintained as part of the Linux kernel. This would of course play nicely into Xensource's (the company) pocket. Their technical people so far turn this down but I have no faith in this group: sooner or later they want to be independent of OS vendors and have their own mini-OS in the hypervisor. Adios remaining few advantages of the hypervisor model. But this is of course also the direction of VMWare who loudly proclaim that in the future we won't have OS as they exist today. Instead only domains with mini-OS which are ideally only hooks into the hypervisor OS where single applications run.

I hope everybody realizes the insanity of this:

  • If they really mean single application this must also mean single-process. If not, you'll have to implement an OS which can provide multi-process services. But this means that you either have no support to create processes or you rely on an mini-OS which is a front for the hypervisor. In VMWare's case this is some proprietary mini-OS and I imagine Xensource would like to do the very same.
  • Imagine that you have such application domains. All nicely separated because replicated. The result is a maintainance nightmare. What if a component which is needed in all application domains has to be updated? In a traditional system you update the one instance per machine/domain. With application domains you have to update every single one and not forget one.
And worst of all:
  • Don't people realize that this is the KVM model just implemented much poorer and more proprietary? If you invite drivers and all the infrastructure into the hypervisor it is not small enough anymore to have a complete code review. I.e., you end up with a full OS which is too large for that. Why not use one which already works: Linux.

I fear I have to repeat myself over and over again until the last person recognizes that the hypervisor model does not work for the type of virtualization for commodity hardware we try to achieve. Using a hypervisor was simply the first idea which popped into people's head since it was already done before in quite different environments. The change from Xen v1 to v2 should have shown how rotten the model is. Only when you take a step back you can see the whole picture and realize the KVM model is not only better, it's the only logical choice.

I know people have invested into Xen and that KVM is not yet there yet but a) there has been a lot of progress in KVM-land and b) the performance is constantly improving and especially with next year's processor updates hardware virtualization costs will go down even further.

For sysadmin types this means: do what you have to do with Xen for now. But keep the investments small. For developers this means: don't let yourself be tied to a platform. Use an abstraction layer such as libvirt to bridge over the differences. For architects this means: don't looking to Xen for answers, base your new designs on KVM.

linkpost comment

How to publish? [Jun. 25th, 2007|10:08 am]

That is meant as a question to the readers. The problem I have right now is that I have more or less finished the paper accompanying one of the talks I gave at the Red Hat Summit in Nashville last year. The slides for the talk about CPU Caches are available. But quite honestly, as most slide sets, they don't do the topic any justice. I had to compress things to < 45 mins which is of course not enough. The paper covers everything I can currently think of and which makes sense with relation to CPU caches and CPU memory, as far as programmers are concerned (nothing for hardware people). The title I currently use it

What Every Programmer Should Know About Memory

and I think this is adequate.

For this reason I usually write a paper on the important topics I talk about. And this topic qualifies. I consider the topic especially important since it's almost never treated in the software world at all. College grads today in most cases have not the slightest clue about this topic. Ideally I'd like the paper be picked up by some lecturers (like they do for many of my other publications) and use it in a course. Heck, I'm even willing to teach it myself if that is what it takes to get credibility.

The problem I'm facing is that the document is (using my usual paper style, two column etc) around 100 densely packed pages long. Some of the people I've shown it to suggested that it should rather be published as a book. I'm a bit unsure about this. I have a few publisher who for a long time keep pestering me about writing something for them (some even prematurely submitted titles to distributors!). One I talked to would be willing to print it even though it's thin for a book. But there are a lot of pluses and minuses all around:

My PDF only
Going this route means the document is easy to change and extend. The format is exactly as I want it. The visibility is restricted, not in the print market. No professional review. Due to the size (and use of color) it is hard to print.
Go with a publisher
Professional editing, maybe a college edition, visibility through listing in catalogs etc. Additionally available as e-book. But it likely means the color has to go (printing in color is expensive) and there will be no free-of-charge copy. Getting a revision out will be almost impossible.
Go with Lulu
The alternative publishing route: I could submit an appropriately formatted PDF to Lulu and have them publish it. Demand printing, ISBN available. B&W and color printing possible. Even e-books if anybody cares. No professional editing.

Going with Lulu has the advantages I want but it's quite an effort. And there are costs associated with it. I do not plan to make money out of all this but I'd have to recover the costs. Excess gains would probably go to charity (in my case this is the Monterey Bay Aquarium in case anybody is interested).

So, the questions I have and would like to get some feedback on are:

  • Are printed copies wanted at all? Especially for those teaching, is it a prerequisite?
  • If yes, do you prefer a professional, more expensive book?
  • Or perhaps an amateur-ish publication which is either B&W and cheap (I guess not much more than $10)...
  • ... or a colored print for around $30. The paper has currently around 60 diagrams and color helps.

If you have an opinion and a mail or add a comment to the blog (which won't be published). I know it is not easy to answer given that you haven't seen the material. But this is the same for most books, isn't it? Look at the slides and assume 100 times more details. I doubt I'll find many people who know all these details now (I had to do research myself).

linkpost comment

grep and color [Jun. 1st, 2007|06:53 am]

I cannot believe there are still people who are surprised they see me working with the command line on my machine or when I tell them otherwise the the output of grep can use highlighting. Just add --color to the command line (with the optional argument just like ls). I've implemented that more than six years ago. In my .bashrc I have the following:

alias egrep='egrep --color=tty -d skip'
alias egrpe='egrep --color=tty -d skip'
alias fgrep='fgrep --color=tty -d skip'
alias fgrpe='fgrep --color=tty -d skip'
alias grep='grep --color=tty -d skip'
alias grpe='grep --color=tty -d skip'

Yes, I mistype grep often enough to warrant the extra aliases. Using tty as the color mode mean that if I pipe the output into another program there won't be any color escape sequences added which could irritate those programs.

Just make your life easier and add such aliases, too.

linkpost comment

pthread_t and similar types [May. 22nd, 2007|11:46 am]
[Tags|]

Constantly people complain that the runtime does not catch their mistakes. They are hiding behind this requirement in the POSIX specification (for pthread_join in this case, also applies to pthread_kill and similar functions):

       The pthread_join() function shall fail if:
       [...]

       ESRCH  No thread could be found corresponding to that specified by the given thread ID.

The glibc implementation follows this requirement to the letter. *IFF* we can detect that the thread descriptor is invalid we do return ESRCH.

But: the above does not mean that all uses of invalid thread descriptors must result in ESRCH errors. The reason is simple: the standard does not restrict the implementation in any way in the definition of the type pthread_t. It does not even have to be an arithmetic type. This means it is valid to use a pointer type and this is just what NPTL does.

Nobody argues that functions like strcpy should not dump a core in case the buffer is invalid. The same for pthread_attr_t references passed to pthread_attr_init etc. The use of pthread_t when defined as a pointer is no different. The only complication is in the understanding that pthread_t can be a pointer type. This is obvious for void* etc.

In the POSIX committee we discussed several times changing the pthread_join and pthread_kill man pages. The ESRCH errors could be marked as may fail. But

  1. this really is not necessary, see above.
  2. it would mean we have to go through the entire specification and treat every other place where this is an issue the same way.

If somebody wants to do the work associated with the second step above and we have confidence in the results, we (= Austin Group) might make the change at some later date. But it is a rather high risk for no real gain. Programmers have to educate themselves anyway.

What remains is the question: how can programs avoid these mistakes? It is actually pretty simple: the program should make sure that no calls to pthread_kill, for instance, can happen when the thread is exiting. One way to solve this problem is:

  1. Associate a variable running of some sort and a mutex with each thread.
  2. In the function started by pthread_create (the thread function) set running to true.
  3. Before returning from the thread function or calling pthread_exit or in a cancellation handler acquire the mutex, set running to false, unlock the mutex, and proceed.
  4. Any thread trying to use pthread_kill etc first must get the mutex for the target thread, if running is true call pthread_kill, and finally unlock the mutex.

This ensures that no invalid descriptor is used. But I can already hear people complain:

This is too expensive!

That is ridiculous. The implementation would have to do something similar if it would try to catch bad thread descriptors. In fact, it would have to do more. What is important is to recognize that this price would have to be paid by every program, not just the buggy ones. This is wrong. Only those people who need this extra protection should pay the price.

But I don't have control over the code calling pthread_create!

Boo hoo, cry me a river. Don't expect sympathy for using proprietary software. I will never allow good free software to be shackled because of proprietary code. If you cannot get this changed in the code you pay good money for this just means it is time to find a new supplier or, even better, use free software.

In summary, this is entirely a problem of the programs which experience them. Existing Linux systems are proof that it is possible to write complex programs without requiring the implementation to help incompetent programmers. We will have a few more words in the next revision of the POSIX specification which talk about this issue. But I expect they will be ignored anyway and all focus remains on the shall fail errors of pthread_kill etc.

linkpost comment

The Growing Importance of Parallel Programming [May. 12th, 2007|10:49 am]
At the 2007 Red Hat Summit in San Diego which just which just wrapped up yesterday I gave a talk about parallel programming which the marketing folks retitled Programming for tomorrow's high speed processors, today.

The crux of the talk is that programmers in the future cannot always rely on improving hardware to make their programs run faster. This is summarized nicely in the following graph which I generated from performance data for x86 processors.



The crucial part is the divergence of the two lines going forward and the flattening of the blue line. This means programs which are not able to take advantage of ever increasing numbers of processing cores simply won't run (much) faster.

Parallel programming is hard. There are algorithms to change to allow more than one thread in parallel. Well, not necessarily thread, especially on Linux one should use processes if the sharing requirement between the processes makes this feasible.

There are data structures to lay out correctly to allow a) vectorization and b) data parallelization. Vectorization is important if one wants to come even close to the peak performance listed for the processor. But when you do this you also have to know a lot about CPU design (pipelines etc), caches, and memory.

And then there is something people might have heard about but didn't really register: co-processors are back. Intel's Geneseo and AMD's Torrenza are technologies to couple 3rd party processors tightly to the existing processor-memory mash.

In general I think the industry is entirely ill-prepared for these upcoming changes. Many/most programmers are not able to write code with these requirements. Companies and other organizations will have to invest into education. The system provides (like Red Hat) have to find ways to make parallel programming easier.

One big step in the right direction is OpenMP. Officially supported in gcc 4.2 Red Hat has backported the changes to our gcc 4.1 used in RHEL5 and Fedora Core 6 and later. Not only does OpenMP allow relatively easy conversion of existing code, it also frees the programmer from dealing with all the details of thread lifetime handling, thread stacks, etc. Even mutual exclusion happens at a higher level. All this is good, It will make programmers more productive if only it is used more often.

But there is one more thing: the OpenMP runtime is basically in complete control. It can decide on using just one thread or many threads. It can decide where to run threads and many more things. All these details are hidden from the programmer. This is a good thing since it allows the runtime to perform optimizations. I'll have more about this at a later date.

In summary, programmers have to learn, re-learn or for the first time, about parallelism. I think the topic of this talk is very important. If you are a Red Hat customer you could potentially ask for somebody from Red Hat to come in and talk about these issues. I'll give the slides and the details to our consulting organization and possibly also sales engineers. I cannot make any promises but I'll encourage those gals and guys to be willing to talk about this. If you're a big enough customer and you demand it, I might (have to) come out myself, if this is wanted. Or somebody can organize gatherings in places I have to go to anyway and have me speak there.
linkpost comment

nscd and DNS TTL [May. 12th, 2007|10:04 am]
Recently some people spread their non-existing knowledge about nscd (Name Service Cache Daemon) by claiming it ignores the TTL (time-to-live) value a DNS server returns. As far as I know this rampant ignorance is especially wide-spread in the ubuntu world. They claim that for this reason one has to run a local, caching DNS server. This is complete nonsense. nscd does handle TTL for a long time now (committed to the public CVS on 2004-9-15). All reasonable requests are handled, i.e., all getaddrinfo requests.

As I have pointed out many times before (here and here and in other places), it is completely unacceptable today to use gethostbyname etc. These functions simply don't work. Which is why I found it unnecessary to make the implementation of nscd more complicated and add more compatiblity and maintenance problems just to fix one of the many problems these interfaces have. Just don't use them and convert all your programs (e.g., I think we've done just that for all of RHEL and Fedora nowadays). Also don't use

  getent hosts some.host


You have to use

  getent ahosts some.host


For all getaddrinfo lookups the TTL value from DNS replies takes precedence over the TTL value from /etc/nscd.conf. The latter is used for services which do not provide a TTL themselves (today all other services).
linkpost comment

getaddrinfo is not just for IPv6 [Mar. 7th, 2007|01:11 am]
[Tags|]

I've heard far too often that getaddrinfo is only interesting for IPv6 and therefore can be ignored since one does not have IPv6.

Aside from the fact that all programs should be protocol independent this statement is bogus. gethostbyname etc do not perform correctly in some situations where only ever IPv4 is involved.

Assume you have an internal IPv4 network with, say, 192.168.x.y addresses. In addition you have a server (web server, for instance) which is also visible on the Internet. This server has two addresses: one 192.168.x.y address and one global address. The client is a NATed machine on the intranet.

Now what happens if the nameserver returns both addresses to a query for the addresses of said server? With gethostbyname the addresses are returned to the caller in the order they are received from the DNS server. Maybe some randomization is applied. In short, it is possible that the internal machine gets sees the public IPv4 address and then connects to it. This is not only wasteful (the request has to be routed through a switch), it might even be dangerous (the traffic might actually have to go through the Internet).

With getaddrinfo this is not the case. The sorting according to RFC 3484 makes sure that the internal address of the server is returned first. The sorting function will notice that the source address used on the client is also an internal address and therefore the internal address of the server is a better match than the global address.

In summary, gethostbyaddr is not only about IPv6. The old interfaces were simply completely inadequate and should never be used. If you still haven't converted your programs to use getaddrinfo instead of gethostbyname and gethostbyname2 do it now. I have written some time ago a brief intro.

link2 comments|post comment

Xensource/VMWare start sandbagging [Feb. 26th, 2007|10:07 pm]
[Tags|]

With KVM proving more and more that it is viable Xensource and VMWare start sandbagging. They call KVM immature and the wrong approach (see their quotes in CNET article).

Calling KVM is immature is, well, premature and misleading. Xen has a headstart of several years. KVM is today not supposed to be in the state Xen is. Nevertheless, KVM already has hardware virt support, SMP support, support for 64-bit host and guests (despite what the article says), live migration, and more. Xen simply started from the other direction, namely para-virt, hardware virt took them a long time and a lot of help from the hardware vendors. I think para-virt will be done RSN.

But immature is not the worst complain. Claiming the hypervisor approach is the only viable option is what should get people worked up. Look at the arguments:

[...] but hypervisors offer better performance, have security advantages, and juggle the competing needs of multiple virtual machines better [...]
In order to [deliver Virtual Infrastructure], you need the separate hypervisor layer.

These are bogus claims. And you have realize where they come from. VMWare’s ESX is a kernel on itself, one which only few people work on (compared to something like Linux). Device drivers will always be a nightmare unless/until devices get their own PCI devices (once DMA can be virtualized). Nevertheless, ESX is a full OS by itself. Plus, ESX has the service console a Linux OS. The service console of course has to have some control over the hypervisor.

For Xen the situation is similar. Here the hypervisor, after the mistakes of the 1.x series, don’t have device drivers included and use a privileged domain, a complete OS.

This means, both Xen and VMWare do not have less code. I’d say they even have more code that is part of the privileged code base. Certainly a Linux installation hosting KVM domains can be scaled down to only have the kernel, kqemu, and the service console.

As for specific security support, Xen has in theory shype or whatever will come out of it. Like SELinux, it’s based on Flask. But it still is a separate code base. And if shype is actually moved out of the hypervisor itself and into a separate domain you have to worry about even more interfaces to worry about. I haven’t seen any security features of this caliber even mentioned for VMWare. With KVM, the SELinux policy governing the kernel can also handle the KVM module. It’s after all part of the same kernel. One implementation, one policy.

As for performance, let’s wait until KVM actually has been optimized. Ingo did some work on a para-virt network driver and the results are simply great. It’s just that performance tuning hasn’t been a focus. In theory there is absolutely no reason why the KVM approach should be any slower.

As for better scheduling with a hypervisor: that can only be a joke. Especially for Xen, the privileged domain (Dom0) has to be scheduled without the hypervisor having any insight into the Dom0 kernel. How can this be better? For VMWare we have a simple-minded OS serving as the hypervisor. The Linux kernel has support for all kinds of situations, including NUMA machines, many processor machines, HT and multi-core processors etc. And it’s an O(1) scheduler which sooner or later will make a difference even for hypervisors.

And then there are the advantages the KVM solution has. For instance, ever tried to run Xen on a laptop while on battery? It’s almost not worth it since power management does not exist. The machine will always run at full power. VMWare has the same problem. This is not only an issue for laptops. Cooling is a major issue in data centers. Maybe even a bigger issue with increasing density.

NUMA has already been mentioned, but there is also the memory allocation issue as part of the problem. Xen has nothing of it, I bet VMWare neither or something simple. The Linux kernel can provide KVM with all kinds of support, as the performance on big NUMA machines like SGI’s Altix shows.

In short: neither Xen nor VMWare have any real advantages which cannot be surmounted by giving KVM more time to catch up, i.e., grant it the same time to develop the features. On the other hand there are device driver issues which VMWare will never be able to muster. Xen is not included into the mainstream kernel and even with paravitr_ops interface will be lagging because it needs synchronization.

So why do these companies (and Xensource makes this statement as a company) make such statements? The answer should be not surprising: they have a lot or all to lose. KVM can be the one-in-all solution, unlike any of the others. Xensource and VMWare want to get on your system by providing a hypervisor which then can be used with all kinds of OSes. But: they are in ultimate control. The idea that there suddenly is a virtualization solution which does not need any hypervisor must be absolutely frightening to them. So, they try to suppress this no technology from the start.

Don’t believe the propaganda. Try KVM once it Fedora 7 is out. I expect it to be updated over the lifetime of Fedora 7. Or for the more adventurous people, start using rawhide now and keep using it.

link4 comments|post comment

DST Panic [Feb. 22nd, 2007|01:27 pm]
[Tags|, ]

With the DST rule changes for the US going into effect real soon (2007-3-11) people are panicking.

  1. there are those who run completely obsolete OSes. People contact us for support of RHL9 or even RHL7.2 (that's the predecessor of Fedora, for those who don't know. Guess what, even FC4 is not supported anymore, leave alone anything earlier. The DST change is just one of many good reasons to update. Security is the other big one.
  2. many applications are broken and they use private timezone data. In their quest to achieve perfect portability the people writing the Java runtime added the data into their sources. And unfortunately the same has been done for libgcj. Only somebody without the slightest clue about the nature of DST rules would do something like this. Would Sun/BEA/IBM/... be willing to update their JVMs 20 times a year for all DST changes (if they would have that broad support in the first place)? Of course not. Only people in countries with stable rules would not think about it. There probably hasn't been a day in the life of the JVMs when the data was really accurate. and complete.

Time zone changes are nothing special. I guess on average we see about 20 each year, maybe more. Do you see people packing 20 times a year? No, only if the US is involved. Yes, you can argue that more computers are affected but aren't the computers which are in countries affected by those changes as important to the people living there? There are also banks, utilities, etc which need to keep the correct time.

Having lived here in the US now for quite a few years I think the root of the problem is the same that keeps the US from making progress on other fronts: fear of change and trying to prevent change through denial. Another example? Take the measurement system. When knowing the metric and the imperial system equally well, who would argue the latter is better? And it's not that people don't know the metric system at all. There are large numbers of people who serve/d in the military and all these people had to use it in their job. Every food container also shows grams. But I'm getting off-topic.

Fact is, people delay things. Delaying to update their OSes and even delaying to think about the problem. It might go away on its own and then no time has been wasted. But guess what: the DST change is coming.

So, people, get your act together. Update your OSes. If you really for some obscure reason cannot do this, update the timezone data (in /usr /share/zoneinfo). The data we have today is fully compatible, even the extended file format. Since there is no glibc update coming you could just overwrite the files without fear of reverting the changes inadvertently later. Update applications with their own timezone data. There are lots of (especially big) programs which come with their own data. The timezone data is free to copy by everybody so companies take advantage of this. Now you know why I always advised against this. More likely than not, updates for old versions of these programs are not available anymore. Make sure you let the companies who produce this kind of shitty products know what you think of them, duplicating timezone data is always bad. As for JVM: have fun! Old versions will get no updates and new version often don't run on old OS versions. At least libgcj should now be fixed for good due to the work of one of my collegues.

Once the timezone data is updated some more steps might be needed. Many programs will just work. glibc detects updates to the /etc/localtime file and reloads the data. Lots of people complained about this in the past and present since it means time operations cause my filesystem operations, but it is critical in some situations. If a process only uses the time functions which do not implicitly call tzset() they must be restarted. The same is true for processes which have the TZ environment variable set. In general you cannot know whether a process falls into any of the later categories. The safest thing to do is to reboot the machine.

link4 comments|post comment

More array fun [Feb. 20th, 2007|03:27 pm]
[Tags|]

As a continuation of a previous post, here's another thing I frequently stumble across:

#include <stdio.h>
#include <string.h>
int main(void)
{
  const char s[] = "hello";
  strcpy (s, "bye");
  puts (s);
  return 0;
}

Yes, this code will produce a warning. But it will run. Slowly, since it does not do what the programmer actually meant. s is a dynamic variable. The compiler has to allocate space on the stack (or in TLS) and then copy the string from some static, read-only area into it. This of course is not only slow, uses memory, it also means that the newly created string is NOT read-only. The compiler-generate function prologue has to write to the memory.

Whenever you write code where you define an array in the scope of a function, always stop and think what the semantics should be. For all constant arrays it is almost always correct to have exactly one copy (it cannot be changed). If one copy is needed or OK then don't forget the static:

  static const char s[] = "hello";

If you do this all of a suddenly the code will not only produce a warning, it will also crash at runtime since the string is stored in read-only memory and s is not now really a variable anymore, it's a label for the region in read-only memory.

linkpost comment

But I Have Nothing Of Interest On My Machine [Feb. 12th, 2007|09:09 pm]
[Tags|]

I'm sick and tired of hearing people saying

I don't have to secure my machine since I have nothing of interest on it. Nobody would want to steal anything I have.

That's absolutely not the point. Yes, some attackers are after personal data like account numbers. But this is not all:

  • passwords are high on the list since people use the same password for all their accounts, be it banks, Amazon, eBay, whatever. Do you still agree you don't have anything interesting protected by those passwords?
  • if a machine can be taken over it can be used to a) sniff the local network, b) attack other machines, c) send spam. Some ISPs already stopped being lenient towards idiots who allow this to happen unchecked and they simply suspend the accounts. Do you care about having an Internet connection?

Security always matters even if the data stored on the machine is benign. Nobody should be allowed to even run machines which have no distinction between user and administrator. This includes more and more Linux people because new idiot distributions like Linspire, NimbleX, etc pop up. No machine should be without firewalls, in both directions. For RHEL/Fedora users it of course doesn't stop there, we have many more security features and if it would be up to me I would take out the switch to disable them.

Next time when you see somebody writing nonsense like the above (or hear them talking like this) do me a favor: smack them a bit so that they come to their senses. These are the people who create the opportunity for spam, phishing, and other illicit activities. Heck, they deserve more then a bit of smacking...

link1 comment|post comment

navigation
[ viewing | most recent entries ]
[ go | earlier ]