Log in

No account? Create an account
DSO memory waste - Ulrich Drepper [entries|archive|friends|userinfo]
Ulrich Drepper

[ website | My Website ]
[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

DSO memory waste [Jun. 10th, 2006|07:29 am]
Ulrich Drepper
It is commendable that Frederico cares about footprints of DSOs. Somebody already pointed out the problem in his argumentation. In fact, much of the memory used for private dirty data cannot be avoided. There are almost always some initialialized variables involved and once you have one an entire memory page has to be used for it.

But there are a number of other disturbing things which can be learned from his list:

  1. 56 programs or DSOs are linked with libresolv. Why? Nobody really needs the resolver directly. Especially the resolver which isn't even used. There are some additional functions in that DSO which some specialty programs need but they are rare.

  2. 136 programs or DSOs use libnsl but only 102 is libnss_nis. That means 34 are linked directly with this DSO. Why? What reason can there be? No program really uses the functions from this DSO directly (they all deal with NIS and NIS+ access). Well, actually I know why this is the case. autoconf automatically adds -lnsl to the link line (for Solaris' sake) if the author is too lazy. It means the configure scripts just have to be fixed.

  3. The system apparently didn't use nscd. This means many, many processes have to link in the NSS code themselves. nscd does not only speed up the lookup, it also helps to avoid loading DSO. For the system analyzed this would amount to about 1.2M savings alone for the private data.

Furthermore, the list doesn't provide information about the shared dirty mapping which is at least as problematic.

If you really want to analyze the waste you have to dive a bit deeper. First, you need to understand that waste does not only come from the dirty pages. Every DSO mapping has a cost, not only in runtime, but also in memory. Even if the physical pages for a DSO can be shared the administrative information is unique for each process. The kernel has to generate VMAs for each mapping. Each DSO consists of up to 5 VMAs. Someobdy with more kernel VM handling should find out how much memory each VMA needs but it's not negligible.

Then, as the next step, run ldd -u -r on the binaries of the processes running. E.g., gnomine:

Unused direct dependencies:

Loading these DSOs is at least from the startup phases' POV unnecessary. None of the symbols in any of these DSOs is directly used. It might be that some DSOs are used later with dlopen. But why then use dlopen and not direct linking and jumps? That would be faster. Anyway, add up all the memory used for the unused DSOs (dirty pages, administrative data in kernel and ld.so) and I bet you can save many MBs or RAM alone from this. Just use the linker's --as-needed option.

But the problems go deeping. Look at the relocation info of, say, libgtk:

$ relinfo.pl /usr/lib/libgtk-x11-2.0.so.0
/usr/lib/libgtk-x11-2.0.so.0: 2029 relocations, 1982 relative (97%), 1337 PLT entries, 144 for local syms (10%)

1982 relative relocations? That's appalling. 144 PLTs for local symbols? Does anyone really intends to preempt any of them? A year or so ago I cleaned up libselinux and at that time time the numbers were similar. Today they are:

$ relinfo.pl /usr/lib/libselinux.so
/usr/lib/libselinux.so: 15 relocations, 8 relative (53%), 94 PLT entries, 0 for local syms (0%)

Yes, libgtk needs more external functions so the total number of PLTs might be as high as it is. but look at the libselinux numbers, no PLTs for local symbols. You really don't have to be able to preempt any of them. And look at the number of relative relocations. I think I shrank the number for close to 1,000 to below 10. The reason the number was so high is that people simply don't understand how to write position independent code. Especially, how to lay out data structures used for global variables. This leads to unsharable data. E.g.,

static const char *const strs[] = { "one", "two", "three" };
const char *nname(int n) { return strs[n-1]; }

Even though the array strs is marked as const (please make sure you understand which of the const specifies this) the array cannot be in the .rodata section of a DSO because the string pointers have to be adjusted for the load address. This means

  1. We have 24 bytes of unsharable data on 64bit machines.

  2. We need 3 relocation records, each 24 bytes on most 64 bit archs).

  3. We also have an array of pointers to strings in addition to the strings (another 24 bytes).

Now look at this rewrite:

static const char str[] = "one\0two\0three" };
static uint8_t idx[] = { 0, 4, 8 };
const char *nname(int n) { return str + idx[n-1]; }

All of the listed disadvantages go away. str is only a name for a memory region in the .rodata section. There are no additional pointers. The total data size is 17 bytes as opposed to 110 bytes for the old code (including relocations). The code for the functions is hardly larger, maybe one additional instruction. The details about all this can be found in my DSO HowTo. Appendix B shows how the above technique can be applied without sacrificing maintainability of the code.

There is only one reason for not applying this technique and that is if the variables in question are exported. I always warned against making variables part of the ABI of a DSO. This never is a good idea. For historical reasons it might be needed (which is why libresolv has such high numbers on Frederico's list, I wish I could change it). But that's all.

I am willing to bet that the number of relocations in libgtk can be reduced significantly, maybe as drastically as in libselinux's case. Removing, say, 2,000 relocations means reducing the .rodata size of the DSO by 48,000 bytes. This means 12 pages which don't have to be paged-in through major and minor page faults. In non-prelinked binaries the startup is faster and usually the changes result in smaller data structures as well (see the example above).

So, if you're really concerned about wasting memory (and I think you should, one of the few things I agree with Negroponte about) then start looking at the data structures used in the DSOs and the unnecessary dependencies. That's where you get biggest bank for a buck.