I might need a bit more space to explain the new features in glibc 2.10 than can reasonably be written down in the release notes. Therefore I’ll take some time to describe them here.
We (= Austin Group) have finished work on the 2008 revision of POSIX some time ago. In glibc 2.10 I’ve added the necessary feature select macros and more to glibc to support POSIX 2008. Most of it, at least. This was quite easy. A large part of the work which went into POSIX 2008 was to add functions which have been in glibc to POSIX. The Unix world catches up with Linux.
I had to implement one new function: psiginfo. This function is similar to psignal but instead of printing information for a simple signal it prints information for a
real-time signal context.
A few things are left to be done. What I know right now is the implementation of the O_SEARCH and O_EXEC flags. This needs kernel support.
The C standard defines functions like strchr in a pretty weak way because C has no function overloading:
char *strchr(const char *, int)
The string parameter and the return value type are as weak as possible. Non-constant strings can be passed as parameters and the result can be assigned to a constant string variable.
The problem of this is that the const-ness of the parameter is not preserved and reflected in the return value. This would be the right thing to do since the return value, if not NULL, is pointing somewhere in the parameter string.
C++ with its function overloading can do better. This is why C++ 1998 actually defines two functions:
char *strchr(char *, int) const char *strchr(const char *, int)
These functions do preserve the const-ness. This is possible because these functions actually have different names after mangling. Actually, in glibc we use a neat trick of gcc to avoid defining any function with C++ binding but this is irrelevant here.
Anyway, the result of this change is that some incorrect C++ programs, which worked before, will now fail to compile.
const char *in = “some string”; char *i = strchr(in, ‘i’);
This code will fail because the strchr version selected by the compiler is the second one which returns a constant string pointer. It is an error (not only a source of a warning) in C++ when a const pointer is assigned to a non-const pointer.
As I wrote, this is incorrect C++ code. But it might trip up some people.
C++ 201x support
There is one interface in the upcoming C++ revision which needs support in the C library, at least to be efficient. C++ 201x defines yet another set of interface to terminate a process and to register handlers which are run when this happens:
int at_quick_exit(void (*)(void)) void quick_exit(int)
The handlers installed with at_quick_exit will only be run when quick_exit is used and not when exit is used. No global destructors are run either. That’s the whole purpose of this new interface. If the process is in a state where the global destructors cannot be run anymore and the process would crash, quick_exit should be used.
DNS NSS improvement
In glibc 2.9 I already implemented an improvement to the DNS NSS module which optimizes the lookup of IPv4 and IPv6 addresses for the same host. This can improve the response time of the lookup due to parallelism. It also fixes a bug in name lookup where the IPv4 and IPv6 addresses could be returned for different hosts.
The problem with this change was that there are broken DNS servers and broken firewall configurations which prevented the two results from being received successfully. Some broken DNS servers (especially those in cable modems etc) only send one reply. For this reason Fedora had this change disabled in F10.
For F11 I’ve added a work-around for broken servers. The default behavior is the same as described above. I.e., we get the improved performance for working DNS servers. In case the program detects a broken DNS server or firewall because it received only one reply the resolver switches into a mode where the second request is sent only after the first reply has been received. We still get the benefit of the bug fix described above, though.
The drawback is that a timeout is needed to detect the broken servers or firewalls. This delay is experienced once per process start and could be noticeable. But the broken setups of the few people affected must not prevent the far larger group of people with working setups to experience the advantage of the parallel lookup.
There are also ways to avoid the delays, some old, some new:
- Install a caching name server on this machine or somewhere on the local network. bind is known to work correctly.
- Run nscd on the local machine. In this case the delay is incurred once per system start (i.e., at the first lookup nscd performs).
- Add “single-request” to the options in /etc/resolv.conf. This selects the compatibility mode from the start.
All of these work-arounds are easy to implement. Therefore there is no reason to not have the fast mode the default which in any case will work for 99% of the people.
Use NSS in libcrypt
The NSS I refer to here is the Network Security Services packages. It provides libraries with implementations of crypto and hash functions, among other things. In RHEL the NSS package is certified and part of the EAL feature set.
To get compliance for the whole system every implementation of the crypto and hash functions would have to be certified. This is an expensive and time-consuming process. The alternative is to use everywhere the same implementation. This is what a change to libcrypt now allows.
Since NSS is already certified we can just use the implementation of the hash functions from the NSS libraries in the implementation of crypt(3) in libcrypt. Bob Relyea implemented a set of new interfaces in the libfreebl3 library to allow the necessary low-level access and freed libfreebl3 from dependencies on NSPR.
By default libcrypt is built as before. Only with the appropriate configure option is libfreebl3 used. There are no visible changes (except the dependency on libfreebl3) so users should not have to worry at all.
Combine this with the new password hashing I’ve developed almost two years ago and we have now fully certified password handling.
Certain special interest groups subverted the standardization process (again) and pressed through changes to introduce in the C programming language extensions to support decimal floating point computations. 99.99% of all the people will never use this stuff and still we have to live with it.
I refuse to add support for this to glibc because these extensions are not (yet) in the official language standard. And maybe even after that we’ll have it separately.
But the DFP extension call for support in printf. The normal floating-point formats cannot be used. New modifiers are needed.
The printf in glibc has for the longest time a way to extend it. One can install handlers for additional format specifiers. Unfortunately, this extension mechanism isn’t generic enough for the purpose of supporting DFP.
After a couple of versions of a patch from Ryan Arnold I finally finished the work and added a generic framework which allows installing additional modifiers and format specifiers.
int register_printf_specifier (int, printf_function, printf_arginfo_size_function) int register_printf_modifier (wchar_t *) int register_printf_type (printf_va_arg_function)
With these interfaces DFP printing functions can live outside glibc and still work as if the support were built in. For an example see my code to print XMM values.
A change which is rather small in the number of lines it touches went in to make malloc more scalable. Before, malloctried to emulate a per-core memory pool. Every time when contention for all existing memory pools was detected a new pool is created. Threads stay with the last used pool if possible.
This never worked 100% because a thread can be descheduled while executing a malloc call. When some other thread tries to use the memory pool used in the call it would detect contention. A second problem is that if multiple threads on multiple core/sockets happily use malloc without contention memory from the same pool is used by different cores/on different sockets. This can lead to false sharing and definitely additional cross traffic because of the meta information updates. There are more potential problems not worth going into here in detail.
The changes which are in glibc now create per-thread memory pools. This can eliminate false sharing in most cases. The meta data is usually accessed only in one thread (which hopefully doesn’t get migrated off its assigned core). To prevent the memory handling from blowing up the address space use too much the number of memory pools is capped. By default we create up to two memory pools per core on 32-bit machines and up to eight memory per core on 64-bit machines. The code delays testing for the number of cores (which is not cheap, we have to read /proc/stat) until there are already two or eight memory pools allocated, respectively.
Using environment variables the implementation can be changed. If MALLOC_ARENA_TEST_ is set the test for the number of cores is only performed once the number of memory pools in use reaches the value specified by this envvar. If MALLOC_ARENA_MAX_ is used it sets the maximum number of memory pools used, regardless of the number of cores.
While these changes might increase the number of memory pools which are created (and thus increase the address space they use) the number can be controlled. Because using the old mechanism there could be a new pool being created whenever there are collisions the total number could in theory be higher. Unlikely but true, so the new mechanism is more predictable.
The important thing to realize, though, is when the old mechanism was developed. My machine at the time when I added Wolfram’s dlmalloc to glibc back in 1995 (I think) had 64MB of memory. We’ve come a long way since then. Memory use is not that much of a premium anymore and most of the memory pool doesn’t actually require memory until it is used, only address space. We have plenty of that on 64-bit machines. 32-bit machines are a different story. But this is why I limit the number of memory pools on 32-bit machines so drastically to two per core.
The changes include a second improvement which allow the free function to avoid locking the memory pool in certain situations.
We have done internally some measurements of the effects of the new implementation and they can be quite dramatic.
Information about malloc
There is an obscure SysV interface in glibc called mallinfo. It allows the caller to get some information about the state of the malloc implementation. Data like total memory allocated, total address space, etc. There are multiple problems with this interface, though.
The first problem is that it is completely unsuitable for 64-bit machines. The data types required by the SysV spec don’t allow for values larger 2^31 bytes (all fields in the structure are ints). The second problem is that the data structure is really specific to the malloc implementation SysV used at that time.
The implementation details of malloc implementations will change over time. It is therefore a bad idea to codify a specific implementation in the structures which export statistical information.
The new malloc_info function therefore does not export a structure. Instead it exports the information in a self-describing data structure. Nowadays the preferred way to do this is via XML. The format can change over time (it’s versioned), some fields will stay the same, other will change. No breakage. The reader just cannot assume that all the information will forever be available in the same form. There is no reader in glibc. This isn’t necessary, it’s easy enough to write outside glibc using one of the many XML libraries.
Automatic use of optimized function
Processor vendors these days spend time fine tuning the instruction sets of their products. Specialized instructions are introduced which can be used to accelerate the implementation of specific functions. One problem holding back the adoption of such instructions is that people want their binaries to work everywhere.
One example for such application-specific instructions are the SSE4.2 extensions Intel introduced in their Nehalem core. This core features special instructions for string handling. They allow optimized implementations of functions like strlen or strchr etc.
It would of course be possible to start the implementation of these functions with a test for this feature and then use the old or the new implementation. For functions where the total time a call takes is just a couple of dozen cycles this overhead is noticeable, though.
Therefore I’ve designed an ELF extension which allows to make the decision about which implementation to use once per process run. It is implemented using a new ELF symbol type (STT_GNU_IFUNC). Whenever the a symbol lookup resolves to a symbol with this type the dynamic linker does not immediately return the found value. Instead it is interpreting the value as a function pointer to a function that takes no argument and returns the real function pointer to use. The code called can be under control of the implementer and can choose, based on whatever information the implementer wants to use, which of the two or more implementations to use.
This feature is not yet enabled in Fedora 11. There is some more binutils work needed and then prelink has to be changed. My guess is that F11 will go out without glibc taking advantage of this feature itself. But we will perhaps enable it after the release, once binutils and prelink caught up.