(no subject)
The original plan was to have some program sI wrote to be added to the procps or util-linux package but the maintainers haven't been responsive. Therefore here they are in a package on their own.
I call the package putils
(available from my private server) and the following programs are available so far:
- plimit
- Show or set the limits of a process
- pfiles
- Show information about the files open inside a process
These programs will be familiar to Solaris users. There are likely a few more programs to follow.
pagein
I've updated the pagein tool to compile with a recent valgrind version. The tarball also contains a .spec file. I had to work around a bug in valgrind in Fedora 16 and 17.
The tarball
Cancellation and C++ Exceptions
Cancellation and C++ Exceptions
In NPTL thread cancellation is implemented using exceptions. This does not in general conflict with the mixed use of cancellation and exceptions in C++ programs. This works just fine. Some people, though, write code which doesn't behave as they expect. This is a short example:
#include <cstdlib> #include <iostream> #include <pthread.h> static pthread_mutex_t m = PTHREAD_MUTEX_INITIALIZER; static pthread_cond_t c = PTHREAD_COND_INITIALIZER; static void *tf (void *) { try { ::pthread_mutex_lock(&m); ::pthread_cond_wait (&c, &m); } catch (...) { // do something } } int main () { pthread_t th; ::pthread_create (&th, NULL, tf, NULL); // do some work; simulate using sleep std::cout << "Wait a bit" << std::endl; sleep (1); // cancel the child thread ::pthread_cancel (th); // wait for it ::pthread_join (th, NULL); }
The problem is in function tf. This function contains a catch-all clause which does not rethrow the exception. This is possible to expect but should really never happen in any code. The rules C++ experts developed state that catch-all cases must rethrow. If not then strange things can happen since one doesn't always know exactly what exceptions are thrown. The code above is just one example. Running it will produce a segfault:
$ ./test Wait a bit FATAL: exception not rethrown Aborted (core dumped)
The exception used for cancellation is special, it cannot be ignored. This is why the program aborts.
Simply adding the rethrow will cure the problem:
@@ -13,6 +13,7 @@ ::pthread_cond_wait (&c, &m); } catch (...) { // do something + throw; } }
But this code might not have the expected semantics. Therefore the more general solution is to change the code as such:
@@ -1,6 +1,7 @@ #include <cstdlib> #include <iostream> #include <pthread.h> +#include <cxxabi.h> static pthread_mutex_t m = PTHREAD_MUTEX_INITIALIZER; static pthread_cond_t c = PTHREAD_COND_INITIALIZER; @@ -11,6 +12,8 @@ try { ::pthread_mutex_lock(&m); ::pthread_cond_wait (&c, &m); + } catch (abi::__forced_unwind&) { + throw; } catch (...) { // do something }
The header cxxabi.h comes with gcc since, I think, gcc 4.3. It defines a special tag which corresponds to the exception used in cancellation. This exception is not catchable, as already said, which is why it is called __forced::unwind.
That's all. That is needed. This code can easily be added to existing code, maybe even with a single hidden use:
#define CATCHALL catch (abi::__forced_unwind&) { throw; } catch (...)
This macro can be defined predicated on the gcc version and the platform.
I still think it is better to always rethrow the execption, though.
IDN Support
The changes are minimal and pushed into the git archive. IDN is enabled by default, non-IDN names still continue to work. In case there is some sort of problem the --no-idn option can be used to disable IDN support.
It is quite easy for other programs to enable IDN as well. Whether it all should just work automatically is another question. There is the problem of look-alike characters in the Unicode range which might undermine the certificate system.
Fedora and USB Mobile Broadband
Inserting it into a standard Fedora 12 system causes only the simulated CDROM to be mounted. This dual-mode is the root of the problem.
Looking for a solution one comes across many different
solutions. The provider I use, Fonic, has something to download for Linux (a plus, even though they don’t provide any support for it). This seems to be an NDIS driver. There are several other ways (including wvdial and some KDE programs) which are documented.
It’s all much simpler with recent Fedora distributions. Just make sure you have the usb_modeswitch and the accompanying usb_modeswitch-data package installed. Version 1.1.2-3 is what I use. Make sure you reboot before trying to use it.
The usb_modeswitch package contains a program which switches the USB stick from mass storage mode into modem mode. It also contains appropriate udev rules to make this automatic if the device is known in the config files. If it is not known it’s quite simple to add.
Anyway, when inserting the stick the mode should then automatically be switched and then NetworkManager takes over. It recognizes the modem and the built-in rules for wireless broadband providers guide you through the rest of the installation process. You only need to know the provider and possibly the plan. That’s it.
Now when I insert the stick all I get asked is the PIN which has to be provided every time. Why, I don’t know, it should IMO be stored in the keyring just like all the other access information.
Anyway, for everybody with wireless broadband devices on Fedora, make sure usb_modeswitch is installed. There is an open bug in the Red Hat bugzilla to make NetworkManager depend on this package so that everything
just worksfor more people.
glibc 2.10 news
I might need a bit more space to explain the new features in glibc 2.10 than can reasonably be written down in the release notes. Therefore I’ll take some time to describe them here.
POSIX 2008
We (= Austin Group) have finished work on the 2008 revision of POSIX some time ago. In glibc 2.10 I’ve added the necessary feature select macros and more to glibc to support POSIX 2008. Most of it, at least. This was quite easy. A large part of the work which went into POSIX 2008 was to add functions which have been in glibc to POSIX. The Unix world catches up with Linux.
I had to implement one new function: psiginfo. This function is similar to psignal but instead of printing information for a simple signal it prints information for a real-time
signal context.
A few things are left to be done. What I know right now is the implementation of the O_SEARCH and O_EXEC flags. This needs kernel support.
C++ compliance
The C standard defines functions like strchr in a pretty weak way because C has no function overloading:
char *strchr(const char *, int)
The string parameter and the return value type are as weak as possible. Non-constant strings can be passed as parameters and the result can be assigned to a constant string variable.
The problem of this is that the const-ness of the parameter is not preserved and reflected in the return value. This would be the right thing to do since the return value, if not NULL, is pointing somewhere in the parameter string.
C++ with its function overloading can do better. This is why C++ 1998 actually defines two functions:
char *strchr(char *, int) const char *strchr(const char *, int)
These functions do preserve the const-ness. This is possible because these functions actually have different names after mangling. Actually, in glibc we use a neat trick of gcc to avoid defining any function with C++ binding but this is irrelevant here.
Anyway, the result of this change is that some incorrect C++ programs, which worked before, will now fail to compile.
const char *in = “some string”; char *i = strchr(in, ‘i’);
This code will fail because the strchr version selected by the compiler is the second one which returns a constant string pointer. It is an error (not only a source of a warning) in C++ when a const pointer is assigned to a non-const pointer.
As I wrote, this is incorrect C++ code. But it might trip up some people.
C++ 201x support
There is one interface in the upcoming C++ revision which needs support in the C library, at least to be efficient. C++ 201x defines yet another set of interface to terminate a process and to register handlers which are run when this happens:
int at_quick_exit(void (*)(void)) void quick_exit(int)
The handlers installed with at_quick_exit will only be run when quick_exit is used and not when exit is used. No global destructors are run either. That’s the whole purpose of this new interface. If the process is in a state where the global destructors cannot be run anymore and the process would crash, quick_exit should be used.
DNS NSS improvement
In glibc 2.9 I already implemented an improvement to the DNS NSS module which optimizes the lookup of IPv4 and IPv6 addresses for the same host. This can improve the response time of the lookup due to parallelism. It also fixes a bug in name lookup where the IPv4 and IPv6 addresses could be returned for different hosts.
The problem with this change was that there are broken DNS servers and broken firewall configurations which prevented the two results from being received successfully. Some broken DNS servers (especially those in cable modems etc) only send one reply. For this reason Fedora had this change disabled in F10.
For F11 I’ve added a work-around for broken servers. The default behavior is the same as described above. I.e., we get the improved performance for working DNS servers. In case the program detects a broken DNS server or firewall because it received only one reply the resolver switches into a mode where the second request is sent only after the first reply has been received. We still get the benefit of the bug fix described above, though.
The drawback is that a timeout is needed to detect the broken servers or firewalls. This delay is experienced once per process start and could be noticeable. But the broken setups of the few people affected must not prevent the far larger group of people with working setups to experience the advantage of the parallel lookup.
There are also ways to avoid the delays, some old, some new:
- Install a caching name server on this machine or somewhere on the local network. bind is known to work correctly.
- Run nscd on the local machine. In this case the delay is incurred once per system start (i.e., at the first lookup nscd performs).
- Add “single-request” to the options in /etc/resolv.conf. This selects the compatibility mode from the start.
All of these work-arounds are easy to implement. Therefore there is no reason to not have the fast mode the default which in any case will work for 99% of the people.
Use NSS in libcrypt
The NSS I refer to here is the Network Security Services packages. It provides libraries with implementations of crypto and hash functions, among other things. In RHEL the NSS package is certified and part of the EAL feature set.
To get compliance for the whole system every implementation of the crypto and hash functions would have to be certified. This is an expensive and time-consuming process. The alternative is to use everywhere the same implementation. This is what a change to libcrypt now allows.
Since NSS is already certified we can just use the implementation of the hash functions from the NSS libraries in the implementation of crypt(3) in libcrypt. Bob Relyea implemented a set of new interfaces in the libfreebl3 library to allow the necessary low-level access and freed libfreebl3 from dependencies on NSPR.
By default libcrypt is built as before. Only with the appropriate configure option is libfreebl3 used. There are no visible changes (except the dependency on libfreebl3) so users should not have to worry at all.
Combine this with the new password hashing I’ve developed almost two years ago and we have now fully certified password handling.
printf hooks
Certain special interest groups subverted the standardization process (again) and pressed through changes to introduce in the C programming language extensions to support decimal floating point computations. 99.99% of all the people will never use this stuff and still we have to live with it.
I refuse to add support for this to glibc because these extensions are not (yet) in the official language standard. And maybe even after that we’ll have it separately.
But the DFP extension call for support in printf. The normal floating-point formats cannot be used. New modifiers are needed.
The printf in glibc has for the longest time a way to extend it. One can install handlers for additional format specifiers. Unfortunately, this extension mechanism isn’t generic enough for the purpose of supporting DFP.
After a couple of versions of a patch from Ryan Arnold I finally finished the work and added a generic framework which allows installing additional modifiers and format specifiers.
int register_printf_specifier (int, printf_function, printf_arginfo_size_function) int register_printf_modifier (wchar_t *) int register_printf_type (printf_va_arg_function)
With these interfaces DFP printing functions can live outside glibc and still work as if the support were built in. For an example see my code to print XMM values.
malloc scalability
A change which is rather small in the number of lines it touches went in to make malloc more scalable. Before, malloctried to emulate a per-core memory pool. Every time when contention for all existing memory pools was detected a new pool is created. Threads stay with the last used pool if possible.
This never worked 100% because a thread can be descheduled while executing a malloc call. When some other thread tries to use the memory pool used in the call it would detect contention. A second problem is that if multiple threads on multiple core/sockets happily use malloc without contention memory from the same pool is used by different cores/on different sockets. This can lead to false sharing and definitely additional cross traffic because of the meta information updates. There are more potential problems not worth going into here in detail.
The changes which are in glibc now create per-thread memory pools. This can eliminate false sharing in most cases. The meta data is usually accessed only in one thread (which hopefully doesn’t get migrated off its assigned core). To prevent the memory handling from blowing up the address space use too much the number of memory pools is capped. By default we create up to two memory pools per core on 32-bit machines and up to eight memory per core on 64-bit machines. The code delays testing for the number of cores (which is not cheap, we have to read /proc/stat) until there are already two or eight memory pools allocated, respectively.
Using environment variables the implementation can be changed. If MALLOC_ARENA_TEST_ is set the test for the number of cores is only performed once the number of memory pools in use reaches the value specified by this envvar. If MALLOC_ARENA_MAX_ is used it sets the maximum number of memory pools used, regardless of the number of cores.
While these changes might increase the number of memory pools which are created (and thus increase the address space they use) the number can be controlled. Because using the old mechanism there could be a new pool being created whenever there are collisions the total number could in theory be higher. Unlikely but true, so the new mechanism is more predictable.
The important thing to realize, though, is when the old mechanism was developed. My machine at the time when I added Wolfram’s dlmalloc to glibc back in 1995 (I think) had 64MB of memory. We’ve come a long way since then. Memory use is not that much of a premium anymore and most of the memory pool doesn’t actually require memory until it is used, only address space. We have plenty of that on 64-bit machines. 32-bit machines are a different story. But this is why I limit the number of memory pools on 32-bit machines so drastically to two per core.
The changes include a second improvement which allow the free function to avoid locking the memory pool in certain situations.
We have done internally some measurements of the effects of the new implementation and they can be quite dramatic.
Information about malloc
There is an obscure SysV interface in glibc called mallinfo. It allows the caller to get some information about the state of the malloc implementation. Data like total memory allocated, total address space, etc. There are multiple problems with this interface, though.
The first problem is that it is completely unsuitable for 64-bit machines. The data types required by the SysV spec don’t allow for values larger 2^31 bytes (all fields in the structure are ints). The second problem is that the data structure is really specific to the malloc implementation SysV used at that time.
The implementation details of malloc implementations will change over time. It is therefore a bad idea to codify a specific implementation in the structures which export statistical information.
The new malloc_info function therefore does not export a structure. Instead it exports the information in a self-describing data structure. Nowadays the preferred way to do this is via XML. The format can change over time (it’s versioned), some fields will stay the same, other will change. No breakage. The reader just cannot assume that all the information will forever be available in the same form. There is no reader in glibc. This isn’t necessary, it’s easy enough to write outside glibc using one of the many XML libraries.
Automatic use of optimized function
Processor vendors these days spend time fine tuning the instruction sets of their products. Specialized instructions are introduced which can be used to accelerate the implementation of specific functions. One problem holding back the adoption of such instructions is that people want their binaries to work everywhere.
One example for such application-specific instructions are the SSE4.2 extensions Intel introduced in their Nehalem core. This core features special instructions for string handling. They allow optimized implementations of functions like strlen or strchr etc.
It would of course be possible to start the implementation of these functions with a test for this feature and then use the old or the new implementation. For functions where the total time a call takes is just a couple of dozen cycles this overhead is noticeable, though.
Therefore I’ve designed an ELF extension which allows to make the decision about which implementation to use once per process run. It is implemented using a new ELF symbol type (STT_GNU_IFUNC). Whenever the a symbol lookup resolves to a symbol with this type the dynamic linker does not immediately return the found value. Instead it is interpreting the value as a function pointer to a function that takes no argument and returns the real function pointer to use. The code called can be under control of the implementer and can choose, based on whatever information the implementer wants to use, which of the two or more implementations to use.
This feature is not yet enabled in Fedora 11. There is some more binutils work needed and then prelink has to be changed. My guess is that F11 will go out without glibc taking advantage of this feature itself. But we will perhaps enable it after the release, once binutils and prelink caught up.
Fedora 10 a little bit more secure
Anyway, you can do the following by hand. Unfortunately you have to do it every time the program is updated again.
sudo chmod u-s /bin/ping sudo /usr/sbin/setcap cap_net_raw=ep /bin/ping sudo chmod u-s /bin/ping6 sudo /usr/sbin/setcap cap_net_raw=ep /bin/ping6
Voilà, ping and ping6 are no SUID binaries anymore. Note that ls still signals (at least when you're using --color) that there is something special with the file, namely, there are filesystem attributes.
These are two easy cases. Other SUID programs need some research to see whether they can use filesystem capabilities as well and which capabilities they need.
Secure File Descriptor Handling
During the 2.6.27 merge window a number of my patches were merge and now we are at the point where we can securely create file descriptors without the danger of possibly leaking information. Before I go into the details let's get some background information.
A file descriptor in the Unix/POSIX world has lots of state associated with it. One bit of information determines whether the file descriptor is automatically closed when the process executes an exec call to start executing another program. This is useful, for instance, to establish pipelines. Traditionally, when a file descriptor is created (e.g., with the default open() mode) this close-on-exec flag is not set and a programmer has to explicitly set it using
fcntl(fd, F_SETFD, FD_CLOEXEC);
Closing the descriptor is a good idea for two main reasons:
- the new program's file descriptor table might fill up. For every open file descriptor resources are consumed.
- more importantly, information might be leaked to the second program. That program might get access to information it normally wouldn't have access to.
It is easy to see why the latter point is such a problem. Assume this common scenario:
A web browser has two windows or tabs open, both loading a new page (maybe triggered through Javascript). One connection is to your bank, the other some random Internet site. The latter contains some random object which must be handled by a plug-in. The plug-in could be an external program processing some scripting language. The external program will be started through a fork() and exec sequence, inheriting all the file descriptors open and not marked with close-on-exec from the web browser process.
The result is that the plug-in can have access to the file descriptor used for the bank connection. This is especially bad if the plug-in is used for a scripting language such a Flash because this could make the descriptor easily available to the script. In case the author of the script has malicious intentions you might end up losing money.
Until not too long ago the best programs could to is to set the close-on-exec flag for file descriptors as quickly as possible after the file descriptor has been created. Programs would break if the default for new file descriptors would be changed to set the bit automatically.
This does not solve the problem, though. There is a (possibly brief) period of time between the return of the open() call or other function creating a file descriptor and the fcntl() call to set the flag. This is problematic because the fork() function is signal-safe (i.e., it can be called from a signal handler). In multi-threaded code a second thread might call fork() concurrently. It is theoretically possible to avoid these races by blocking all signals and by ensuring through locks that fork() cannot be called concurrently. This very quickly get far too complicated to even contemplate:
- To block all signals, each thread in the process has to be interrupted (through another signal) and in the signal handler block all the other signals. This is complicated, slow, possibly unreliable, and might introduce deadlocks.
- Using a lock also means there has to be a lock around fork() itself. But fork() is signal safe. This means this step also needs to block all signals. This by itself requires additional work since child processes inherit signal masks.
- Making all this work in projects which come from different sources (and which non-trivial program doesn't use system or third-party libraries?) is virtually impossible.
It is therefore necessary to find a different solution. The first set of patches to achieve the goal went into the Linux kernel in 2.6.23, the last, as already mentioned, will be in the 2.6.27 release. The patches are all rather simple. They just extend the interface of various system calls so that already existing functionality can be taken advantage of.
The simplest case is the open() system call. To create a file descriptor with the close-on-exec flag atomically set all one has to do is to add the O_CLOEXEC flag to the call. There is already a parameter which takes such flags.
The next more complicated is the solution chosen to extend the socket() and socketcall() system calls. No flag parameter is available but the second parameter to these interfaces (the type) has a very limited range requirement. It was felt that overloading the parameter is an acceptable solution. It definitely makes using the new interfaces simpler.
The last group are interfaces where the original interface simply doesn't provide a way to pass additional parameters. In all these cases a generic flags parameter was added. This is preferable to using specialized new interfaces (like, for instance, dup2_cloexec) because we do and will need other flags. O_NONBLOCK is one case. Hopefully we'll have non-sequential file descriptors at some point and we can then request them using the flags, too.
The (hopefully complete) list of interface changes which were introduced is listed below. Note: these are the userlevel change. Inside the kernel things look different.
Userlevel Interface | What changed? |
---|---|
open | O_CLOEXEC flag added |
fcntl | F_DUPFD_CLOEXEC command added |
recvmsg | MSG_CMSG_CLOEXEC flag for transmission of file descriptor over Unix domain socket which has close-on-exec set atomically |
dup3 | New interface taking an addition flag parameter (O_CLOEXEC, O_NONBLOCK) |
pipe2 | New interface taking an addition flag parameter (O_CLOEXEC, O_NONBLOCK) |
socket | SOCK_CLOEXEC and SOCK_NONBLOCK flag added to type parameter |
socketpair | SOCK_CLOEXEC and SOCK_NONBLOCK flag added to type parameter |
paccept | New interface taking an addition flag parameter (SOCK_CLOEXEC, SOCK_NONBLOCK) and a temporary signal mask |
fopen | New mode 'e' to open file with close-on-exec set |
popen | New mode 'e' to open pipes with close-on-exec set |
eventfd | Take new flags EFD_CLOEXEC and EFD_NONBLOCK |
signalfd | Take new flags SFD_CLOEXEC and SFD_NONBLOCK |
timerfd | Take new flags TFD_CLOEXEC and TFD_NONBLOCK |
epoll_create1 | New interface taking a flag parameter. Support EPOLL_CLOEXEC and EPOLL_NONBLOCK |
inotify_init1 | New interface taking a flag parameter (IN_CLOEXEC, IN_NONBLOCK) |
When should these interfaces be used? The answer is simple: whenever the author is not sure that no asynchronous fork()+exec can happen or a concurrently running threads executes fork()+exec (or posix_spawn(), BTW).
Application writers might have control over this. But I'd say that in all library code one has to play it safe. In glibc we do now in almost all interfaces open the file descriptor with the close-on-exec flag set. This means a lot of work but it has to be done. Applications also have to change (see this autofs bug, for instance).
dual head xrandr configuration
ajax told me that extra wide screens now work with the latest Fedora 9 binaries for X11. So I had to try it out and after some experimenting I got it to work. To save others the work here is what I did.
Hardware:
- ATI FireGL V3600
- 2x Dell 3007FPW
I use the free driver, of course. No need for 3D here.
The old way to get a spanning desktop was to use Xinerama. This has been replaced by xrandr nowadays. xrandr is not just for external screens of laptops and to change the resolution. One can assign the origin of various screens and therefore display different parts of a bigger virtual desktop. This is the whole trick here. The /etc/X11/xorg.conf file I use is this:
Section "ServerLayout" Identifier "dual head configuration" Screen 0 "Screen0" 0 0 InputDevice "Keyboard0" "CoreKeyboard" EndSection Section "InputDevice" Identifier "Keyboard0" Driver "kbd" Option "XkbModel" "pc105" Option "XkbLayout" "us+inet" EndSection Section "Device" Identifier "Videocard0" Driver "radeon" Option "monitor-DVI-0" "dvi0" Option "monitor-DVI-1" "dvi1" EndSection Section "Monitor" Identifier "dvi0" Option "Position" "2560 0" EndSection Section "Monitor" Identifier "dvi1" Option "LeftOf" "dvi0" EndSection Section "Screen" Identifier "Screen0" Device "Videocard0" DefaultDepth 16 SubSection "Display" Viewport 0 0 Depth 16 Modes "2560x1600" Virtual 5120 1600 EndSubSection EndSection
Fortunately X11 configuration got much easier since I had to edit the file by hand. I started from the most basic setup for a single screen which the installer or config-system-display will be happy to create for you. The important changes on top of this initial version are these:
Option "monitor-DVI-0" "dvi0" Option "monitor-DVI-1" "dvi1"
These lines in the Device section announce the two screens. It is unfortunately not well (at all?) documented that the first parameter strings are magic. If you ran xrandr -q on your system with two screens attached you'll see the identifiers assigned to the screens by the system. In my case:
$ xrandr -q Screen 0: minimum 320 x 200, current 5120 x 1600, maximum 5120 x 1600 DVI-1 connected 2560x1600+0+0 (normal left inverted right x axis y axis) 646mm x 406mm ... DVI-0 connected 2560x1600+2560+0 (normal left inverted right x axis y axis) 646mm x 406mm ...
Add to the names DVI-0 and DVI-1 the magic prefix monitor-
and add as the second parameter string an arbitrary
identifier. Do not drop or change the monitor-
prefix, that's the main magic which seems to make all this work.
Then create two monitor sections in the xorg.conf file, one for each screen:
Section "Monitor" Identifier "dvi0" Option "Position" "2560 0" EndSection Section "Monitor" Identifier "dvi1" Option "LeftOf" "dvi0" EndSection
The Identifier lines must of course match the identifiers used in the Device section. The rest are options which determine what the screens show. Since the LCDs have a resolution of 2560x1600 and since I want to have a spanning desktop and the DVI-0 connector is used for the display on the right side, I'm using an x-offset of 2560 and an y-offset of 0 for that screen. Then just tell the server to place the second screen at the left of it and the server will figure out the rest.
What remains to be done is to tell the server how large the screen in total is. That's done using
Virtual 5120 1600
The numbers should explain themselves. Now the two screens show non-overlapping regions of the total desktop with no area not displayed, all due to the correct arithmetic in the calculation of the total screen size and the offset.
Note: there is only one Screen section. That's something which is IIRC different from the last Xinerama setup I did years ago.