Categories

Heaps of Fun with glibc malloc

Update 06/2018: Added thread-local caching (tcache)

Introduction to glibc malloc

What is the heap? If you’ve taken an operating systems class before, you might recall that it is a free-floating region of memory that is managed by a memory allocator. When using a memory allocator to allocate memory, we call it dynamic memory allocation. In a C program, you must use the built-in functions malloc(), calloc(), realloc() and free() to invoke the memory allocator.

The function malloc() is not the memory allocator – it invokes the memory allocator linked by the application at compile time. By default, a gcc-compiled application uses the ptmalloc2 allocator in glibc.

There are many other memory allocators out there: dlmalloc, jemalloc, tcmalloc, etc., all of them claiming to be memory-efficient, fast, scalable and stable over long periods of time (which is important for server applications, which do run for very long periods of time). However, different allocators perform differently under different workloads, and for memory-intensive applications you might see a performance improvement when you switch allocators.

This article will try to give an overview of the structures in glibc‘s malloc while covering the essential algorithmic details of the allocation process. We will only be discussing the ptmalloc2 allocator in glibc, which was based off the Doug Lea allocator (better known as dlmalloc) and had fork() support added to it. ptmalloc2 has been integrated into glibc and any changes were made directly to the glibc source tree.

High-level behavior

malloc makes use of the brk/sbrk and mmapsystem calls to allocate the program heap. From the manpages:

brk() and sbrk() change the location of the program break, which defines the end of the process’s data segment (i.e., the program
break is the first location after the end of the uninitialized data
segment). Increasing the program break has the effect of allocating memory to the process; decreasing the break deallocates memory.

brk() sets the end of the data segment to the value specified by addr, when that value is reasonable, the system has enough memory, and the process does not exceed its maximum data size (see setrlimit(2)).

sbrk() increments the program’s data space by increment bytes. Calling sbrk() with an increment of 0 can be used to find the current location of the program break.

http://man7.org/linux/man-pages/man2/brk.2.html

So when does malloc decide to use the heap space by incrementing the end-of-heap pointer with brk/sbrk, vs. reserving a region of memory using mmap? The behavior is controlled via a parameter M_MMAP_THRESHOLD which is set to 128 KiB by default (which can be changed via mallopt() in malloc.h) If the requested size is less than 128 KiB, malloc uses brk/sbrk – otherwise, it uses mmap. To illustrate, consider the following code:

#include <sys/types.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>

void print_map()
{
char sz_exec_cmd[64];
sprintf(&sz_exec_cmd, "cat /proc/%d/maps\0", getpid());
printf("Output of /proc/%d/maps:\n", getpid());
system(sz_exec_cmd);
printf("\n");
return;
}

int main()
{
printf("Process started with pid %d\n", getpid());

// Get current brk ptr with sbrk(0)
printf("Before calling brk, end of heap is at %p\n", sbrk(0));

// Increment brk ptr by 1 page size (4K)
brk(sbrk(0) + 4096); // Equiv to sbrk(4096);
printf("After +4096, end of heap is at %p\n", sbrk(0));

sbrk(-4096); // Equiv to brk(sbrk(0)-4096);
printf("After -4096, end of heap is at %p\n\n", sbrk(0));
print_map();

void *p1 = malloc(120*1024);
printf("Invoked malloc(120*1024).\n");
print_map();

free(p1);
void *p2 = malloc(140*1024);
printf("Invoked malloc(140*1024).\n");
print_map();
free(p2);

return 0;
}


Compile and run the program with the command

gcc -w -o malloc_syscall_demo malloc_syscall_demo.c && ./malloc_syscall_demo

The following printouts demonstrate the behavior of brk/sbrk after each invocation:

Process started with pid 2780
Before calling brk, end of heap is at 0x55555557a000
After +4096, end of heap is at 0x55555557b000
After -4096, end of heap is at 0x55555557a000

In the program virtual memory layout we can see the allocated heap that corresponds to the return value from sbrk(0):

Output of /proc/2780/maps:
...
555555559000-55555557a000 rw-p 00000000 00:00 0 [heap]
7ffff7df0000-7ffff7e15000 r--p 00000000 08:01 131130 /usr/lib/x86_64-linux-gnu/libc-2.30.so
...

This tells us that the virtual address (VA) range for the heap is 0x555555559000-0x55555557a000, with the read, write and private flags set. Since no file is mapped here, we have a file offset of 00000000 and an inode number of 0.

After the printout line “Invoked malloc(120*1024) for the first time.“, we see no change in the process virtual memory layout since it fits into the allocated heap VA range. However, after the printout line “Invoked malloc(120*1024) for the second time.” we see that the heap VA range has expanded to fit the two malloc-ed blocks of memory:

Output of /proc/2780/maps:
...
555555559000-5555555b6000 rw-p 00000000 00:00 0 [heap]
7ffff7df0000-7ffff7e15000 r--p 00000000 08:01 131130 /usr/lib/x86_64-linux-gnu/libc-2.30.so
...

And after printout line “Invoked malloc(140*1024).“, we see that a memory region has been allocated at a vastly different VA range:

Output of /proc/2780/maps:
...
555555559000-55555557a000 rw-p 00000000 00:00 0 [heap]
7ffff7dcc000-7ffff7df0000 rw-p 00000000 00:00 0
7ffff7df0000-7ffff7e15000 r--p 00000000 08:01 131130 /usr/lib/x86_64-linux-gnu/libc-2.30.so
...

This is because the requested memory is greater than the default M_MMAP_THRESHOLD value.

Seems simple enough, right? Not quite. There are many more things going on under the hood…

One of the main reasons why ptmalloc2 was integrated into glibc as the default memory allocator was due to its support for threading, which helped improve memory allocator performance (and by extension, application performance).

Sometimes two threads want to malloc memory at the same time. If you’re lucky, they don’t step on each others’ toes, but… no one’s that lucky. So, malloc uses a lock to make the threads “take turns” accessing malloc’s internal structures. Still, taking turns means one thread is doing nothing for a while, which can’t be good for performance, and these locks themselves can take a significant amount of time as they often need to synchronize access across CPU caches and sometimes across physical CPUs. ptmalloc2 solves this by creating a separate heap segment and freelist data structures for each process thread.

Recall that

A free list is a data structure used in a scheme for dynamic memory allocation. It operates by connecting unallocated regions of memory together in a linked list, using the first word of each unallocated region as a pointer to the next.

https://en.wikipedia.org/wiki/Free_list

The term “arena” refers to separate set of freelists and heap segments. Let’s take a look at the following toy example:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>

void print_map()
{
char sz_exec_cmd[64];
sprintf(&sz_exec_cmd, "cat /proc/%d/maps\0", getpid());
int dummy;
register void *rbp asm("rbp");
register void *rsp asm("rsp");
printf("Stack frame is %p-%p\n", rsp, rbp);
printf("Output of /proc/%d/maps:\n", getpid());
system(sz_exec_cmd);
printf("\n");
return;
}

void runnable_func(void* arg) {
printf("In child thread: before malloc\n");
print_map();
void *addr = malloc(1024);
printf("In child thread: after malloc returned %p\n", addr);
print_map();
return;
}

int main() {
void* s;
printf("Process started with pid %d\n",getpid());
void *addr = malloc(1024);
printf("In parent thread: after malloc returned %p\n", addr);
print_map();
pthread_create(&th, NULL, &runnable_func, NULL);
return 0;
}

Compile and run it with the following command:

gcc -w -o malloc_pthread_demo -pthread malloc_pthread_demo.c && ./malloc_pthread_demo

Observe the location of the stack and the heap in the parent thread:

Process started with pid 5331
In parent thread: after malloc returned 0x5555555596b0
Stack frame is 0x7fffffffe150-0x7fffffffe190
Output of /proc/5331/maps:
555555554000-555555555000 r--p 00000000 08:01 1967380 /home/kali/heaps/malloc_pthread_demo
...
555555559000-55555557a000 rw-p 00000000 00:00 0 [heap]
7ffff7dcd000-7ffff7dd0000 rw-p 00000000 00:00 0
...
7ffff7ffd000-7ffff7ffe000 rw-p 00028000 08:01 131126 /usr/lib/x86_64-linux-gnu/ld-2.30.so
7ffff7ffe000-7ffff7fff000 rw-p 00000000 00:00 0
7ffffffde000-7ffffffff000 rw-p 00000000 00:00 0 [stack]

After we create and enter the child thread, a new thread-local stack space and arena is created. Observe the output after malloc is invoked:

In child thread: malloc(1024) returned 0x7ffff0000b60
In child thread: malloc(140*1024) returned 0x7ffff75a8010
Stack frame is 0x7ffff7dcbe80-0x7ffff7dcbec0
Output of /proc/5368/maps:
...
555555559000-55555557a000 rw-p 00000000 00:00 0 [heap] (main arena)
7ffff0000000-7ffff0021000 rw-p 00000000 00:00 0 (thread arena)
7ffff0021000-7ffff4000000 ---p 00000000 00:00 0
7ffff75a8000-7ffff75cc000 rw-p 00000000 00:00 0 (thread mmap region)
7ffff75cc000-7ffff75cd000 ---p 00000000 00:00 0 (main mmap region)
7ffff75cd000-7ffff7dd0000 rw-p 00000000 00:00 0 (thread stack)
7ffff7dd0000-7ffff7df5000 r--p 00000000 08:01 131130 /usr/lib/x86_64-linux-gnu/libc-2.30.so

Note here that the main arena is mapped to the heap segment of the process, and the per-thread arenas are mapped to the memory-mapping segment of the process.

Multiple Arenas

When more thread collisions on arena critical sections happen (in multi-thread applications), additional arenas are created via mmap.

As it turns out, there is an upper limit to the number of arenas per process. Having per-thread arenas for all threads seems to be rather expensive (some applications use hundreds if not thousands of threads!) This behavior is defined in malloc.c in the macro NARENAS_FROM_CORES:

#define NARENAS_FROM_NCORES(n) ((n) * (sizeof (long) == 4 ? 2 : 8))

where n is the number of processor cores. So for 32-bit applications we are allowed 2n arenas, and 8n arenas for 64-bit applications – if no cores are detected, the code assumes n=2. You can change this limit by setting the MALLOC_ARENA_MAX environment variable to the number of arenas you want.

When contention happens (number of threads exceeds the number of arenas) on the first thread invocation of malloc, at first the arenas are reused by looping over the available arenas and attempting to lock them. If an arena is locked successfully, it is returned to the application thread; otherwise, it is placed in a blocking wait for the next arena. Once an arena has been reused by a thread, malloc will attempt to reuse the same arena on its second thread invocation, placing the thread on blocking wait until the same arena is free.

The implication of this design for a heavily threaded application like the Apache Web Server is that there will be some thread contention, with a trade-off of lesser memory fragmentation.

Arena and Heap Organization

Each arena has a mutex which is used to control access to that arena. All operations (except for fastbin access, more on fastbins) require that the thread acquire a lock on the arena. Multiple thread contention on this mutex is the reason why multiple arenas are created, since threads assigned to different arenas need not wait for one another. Threads will automatically switch to unused (unlocked) arenas if contention requires it.

Each arena obtains memory from one or more heaps. The main arena uses the program’s initial heap, while additional arenas allocate memory for their heaps using mmap, adding more heaps to their list of heaps as older heaps are used up. Each arena keeps track of a special “top” chunk, which is typically the biggest available chunk, and also maintains a pointer to the most recently allocated heap.

The heap metadata is organized with the help of 3 data structures as shown in the two figures above:

• malloc_chunk structure (Chunk Header): The heap is sliced into many chunks based on the malloc calls by the user. The chunk is the smallest unit allocated by malloc, and each chunk has its own malloc_chunk header structure.
• malloc_state structure (Arena Header): A single-threaded arena may have multiple heaps (due to thread contention) but those heaps are governed by a single malloc_state header structure. This structure tells the allocator where the top chunk (chunk at the highest address), last remainder chunk, bins, etc. are.
• heap_info structure: Every heap has its own header, which has a pointer to the previous heap_info structure. When multiple heaps are mapped to the same arena (e.g. when the heap segment runs out of space), this structure maintains information about the multiple heaps.

The main arena does not have a heap_info structure, since only 1 heap exists. If it runs out of space, brk/sbrk is invoked to extend the end of the section until the memory-mapping segment is reached. The malloc_state structure is a global variable stored in the mapped area of the shared library libc.so.

When new chunks are allocated, top (the contiguous free memory chunk available for allocation), well, always stays on top, and is referred to by the malloc_state struct in the arena. Anything past top (in the higher addresses) is called “the wilderness”. top can grow or shrink, depending on how much space is needed in the arena. Initially the size of top is 0, forcing malloc to allocate space to top when an application requests space.

malloc_state

The malloc_state is a per-arena struct. (Note that the main arena malloc_state is a global variable and not part of the heap segment.) Arena headers or malloc_state structs are stored in the heap segment. As seen earlier, arenas can have memory from multiple heaps.

struct malloc_state
{
__libc_lock_define (, mutex);
int flags;

// Fast bins
mfastbinptr fastbinsY[NFASTBINS];
// Topmost free chunk
mchunkptr top;
// Pointer to a malloc_chunk structure from
// from a recent chunk split
mchunkptr last_remainder;
// Non-fastbins
mchunkptr bins[NBINS * 2 - 2];
// Bitmap of bins
unsigned int binmap[BINMAPSIZE];

// Linked list for arenas
struct malloc_state *next;
struct malloc_state *next_free;
// Number of threads attached to this arena
// 0 if the arena is on the free list.

/* Memory allocated from the system in this arena.  */
INTERNAL_SIZE_T system_mem;
INTERNAL_SIZE_T max_system_mem;
};

typedef struct malloc_state *mstate;

Bins

Within each arena, chunks are either in-use by the application or free (available for allocation).

• In-use chunks are not tracked by the arena.
• Free chunks are stored in various freelists based on size and history, so that the allocator can find suitable chunks to satisfy allocation requests quickly.

The freelists are actually called bins. Bins are in-memory linked structures that keep track of all the freed chunks:

• Fast bins (fastbinsY[] in malloc_state): There are 10 fastbins as defined by NFASTBINS but only the first 7 are used by default, servicing sizes up to 128 B inclusive (160 B including chunk metadata, as defined by MAX_FAST_SIZE). Each fast bin holds chunks with a specific size, where n-th fastbin is a linked list of chunks with sizes in the range (32n, 32n+32] in bytes (so fastbinsY[0] holds chunk sizes 1 B32 B, fastbinsY[1] holds chunk sizes 33 B64 B, …, fastbinsY[6] holds chunk sizes 97 B128 B). Chunks added to a fast bin are not combined with adjacent chunks to keep access fast (hence fast bins). Chunks in the fastbins may be moved to other bins as needed. Fastbin chunks are stored in a single linked list, since they have identical sizes and chunks in the middle of the list don’t have to be accessed.
• Normal bins (bins[] in malloc_state): The normal/regular bins are divided into small bins (where all chunks in a bin have identical sizes), and large bins (where chunks have varied sizes). When a chunk is added to these bins, they are combined with adjacent chunks to coalesce them into larger contiguous chunks. Hence, these chunks are usually never adjacent to other such normal chunks (although they may be adjacent to fast, unsorted or in-use chunks). These bins are doubly-linked lists so that chunks may be removed from any position (for example, when coalescing chunks).
• Unsorted bin (bins[1]): When chunks are freed they are initially stored in the unsorted bin. The allocator sorts them later in order to give them a chance at being quickly re-used. The unsorted bin is the first bin of the normal bins. Note that bins[0] is unused.
• Small bins (bins[2:64]): 62 small bins (index 2-63) hold small chunks of identical size, where the bin of index n holds chunks in the range [16n,16n+16) bytes (so bins[2] has chunk sizes 32 B47 B, bins[3] has chunk sizes 48 B63 B, …, bins[63] has chunk sizes 1008 B1023 B). When allocating a small chunk, the head pointer for the corresponding bin is used. The maximum is defined by the calculated constant MIN_LARGE_SIZE which is 1024 B on a 64-bit system.
• Large bins (bins[64:127]): 63 large bins (index 64-126) hold chunks between 1024 B and 128 KiB inclusive (or whatever value M_MMAP_THRESHOLD is set to). Each bin (64-126) holds a different range of sizes, therefore they have to be sorted (in decreasing order) and coalesced if necessary. When allocating a large chunk, the best chunk size has to be searched for and possibly split into two chunks (one with the requested size, and one for the remainder).

The large bins are divided into 6 groups, where each group contains bins of different chunk sizes (each bin has some chunk capacity increment over the previous bin):

When large chunks are freed, they are stored by size, in their respective bins. Smaller chunks are stored in small bins that hold freed chunks of the same size, sorted by when they were freed.

malloc_chunk

As defined in malloc-internal.h, the malloc_chunk struct contains metadata about a chunk.

struct malloc_chunk {
// If free, holds size of previous chunk
INTERNAL_SIZE_T      mchunk_prev_size;
// Holds size of chunk, including data and overhead
INTERNAL_SIZE_T      mchunk_size;
// Forward pointer used in normal bins
struct malloc_chunk* fd;
// Backward pointer used in normal bins
struct malloc_chunk* bk;
// Forward pointer to next larger free chunk in large bins
struct malloc_chunk* fd_nextsize;
// Backward pointer to previous smaller free chunk in large bins
struct malloc_chunk* bk_nextsize;
};

typedef struct malloc_chunk* mchunkptr;

The fd, fd_nextsize, bk, bk_nextsize pointers are only used when the chunk is free. Note that the names fd_nextsize and bk_nextsize are malloc_chunk pointers and not size values (misleading, I know) – they are pointers to the next larger free chunk and the previous smaller free chunk, respectively. These pointers exist because of the need to find the best fit for large chunks, where chunks are sorted by size in descending order. This enables the allocator to quickly search for the first chunk that is big enough to satisfy some request (this is known as glibc malloc‘s first-fit behavior). However there is a small quirk: if the first two chunks are of the same size, first-fit usually returns the second chunk (to avoid adjusting the head pointer of the linked list). For the same reason, chunks are inserted after a chunk of the same size.

Allocated Chunk

The last 3 bits of the mchunk_size field is reserved for flag information:

• PREV_INUSE (0x1): Set when the previous chunk is allocated.
• IS_MMAPPED (0x2): Set when the chunk is mapped using mmap.
• NON_MAIN_ARENA (0x4): Set when the chunk is part of a non-main arena.

Notice that the pointers fd and bk are unused here. Also, the mchunk_size field is also never returned directly to the application; Instead, the macros chunk2mem and mem2chunk are used to convert malloc internal headers to user pointers and vice versa.

#define chunk2mem(p)   ((void*)((char*)(p) + 2*SIZE_SZ))
#define mem2chunk(mem) ((mchunkptr)((char*)(mem) - 2*SIZE_SZ))

Free Chunk

Consecutive free chunks are combined into one contiguous free chunk. Hence, no consecutive free chunks exist in a bin list, and the mchunk_prev_size field is always filled with data from the previous allocated chunk. (Except for fast bin chunks which are treated as allocated chunks, which do not consolidate neighboring free chunks).

The pointer to the location of fd will be returned to the caller in the application.

Every thread has a thread-local variable that saves the last-used arena. Recall that when a thread needs to use the same arena last used, and that arena happens to be in use, the thread goes into blocking wait until the arena is free. If a thread has never used an arena before, then it does one of reusing an unused arena, creating a new arena, or picking the next available arena.

To reduce arena contention, every thread has a per-thread cache called the tcache, which was introduced in glibc>=2.26.

typedef struct tcache_entry
{
struct tcache_entry *next;
} tcache_entry;

{
char counts[TCACHE_MAX_BINS];
tcache_entry *entries[TCACHE_MAX_BINS];

static __thread char tcache_shutting_down = 0;
static __thread tcache_perthread_struct *tcache = NULL;

The tcache contains a small linked list of chunks which can be accessed without needing to lock an arena. There are 64 tcache bins per thread by default, and chunk sizes range from 24 to 1032 bytes in 16 byte increments on a 64-bit system. tcache maintains pointers to the application data area (payload) via tcache->entries. Every tcache bin contains chunks of identical size. Similar to fast bins, tcache->entries is a LIFO structure.

The number of chunks allowed in each tcache bin is constrained by TCACHE_COUNT. If there is a suitable chunk of exact requested size in tcache->entries, it is returned to the caller. This takes precedence over fast bins.

If the tcache bin is empty for a given requested size, the tcache is not used and it falls back on normal malloc routines (blocking wait, locking an arena). This is to reduce internal fragmentation.

Summary

Now we know the main structures used in malloc() and free(). Putting it all together, we have the allocation and free algorithms that are detailed in the subsections that follow.

Allocation Algorithm

When malloc(p) is invoked for some pointer p pointing to a space of size x, the overall process looks like this:

1. If there is a free chunk in the tcache with size exactly x, it is returned to the caller.
2. If x > M_MMAP_THRESHOLD, mmap() is invoked to allocate memory.
3. If there is a chunk in the fast bin of size >= x, it is returned to the caller. If there is more than one suitable chunk available, the tcache is also pre-filled.
4. If there is a chunk in the small bin of size >= x, it is returned to the caller. If there is more than one suitable chunk available, the tcache is also pre-filled.
5. If the request is large, the fast bin chunks are coalesced and moved to the unsorted bin.
6. The unsorted bin is sorted and coalesced, then moved to the appropriate small/large bins. If there is a suitable chunk of size >= x, it is returned to the caller.
7. If the request is large, the large bins are searched (starting from the smallest suitable large bin for the request) and the first chunk of size >= x is returned to the caller.
8. If there are remaining chunks in the fast bins, they are consolidated and steps 6-7 are repeated.
9. top is aligned as necessary via splitting and/or enlarging.

Free Algorithm

When free() is invoked, a chunk of memory is marked as “free to be reused” by the application, but is still reserved by the OS for the application. However, if the top chunk becomes large enough, some of that memory may be un-mapped and returned to the OS. The process for free(p) where p points to memory of size x looks like this:

1. If there is space in the tcache, the chunk is stored there and the function returns.
2. If x is small enough to be in a fast bin, it is placed in the appropriate fast bin.
3. If the chunk was allocated via mmap(), munmap() is invoked to de-allocate it.
4. If p is next to another free chunk, both chunks are consolidated into a contiguous chunk (which may be the top chunk)
5. The new consolidated chunk is placed in the unsorted bin (skipped if the chunk is now top).
6. During sometime in the lifecycle of the allocator:
• If x is large enough, the fast bin list is consolidated.
• If the top chunk is large enough, some memory is split off and returned to the OS.

Main Concepts

To conclude, let’s cover the main concepts in glibc‘s memory allocator:

• Every allocation is a memory chunk which is aligned to 16 B as defined by MALLOC_ALIGNMENT (which is the largest of the size of a long double (128 bits) and the size of size_t (64 bits)).
• Every allocation contains metadata as well as the requested region of memory.
• There is a main arena in the program heap segment whose malloc_state struct is a global variable in libc.so.
• Arenas keep track of free chunks. Multiple threads can use the same arena through blocking-wait and locking via the mutex in malloc_state, and arenas are created as they are needed.
• The maximum number of arenas is 8n by default on a 64-bit system, where n is the number of cores.
• Each arena can have multiple heaps. Here, heaps refer to spaces allocated via mmap() calls.
• Each arena maintains a single malloc_state structure, which keeps track of the top chunk. The malloc_state structure also defines 10 fast bins and 126 regular bins by default.
• Out of the 10 fast bins defined, 7 fast bins by default (which are singly-linked lists of free chunks) service chunks up to 128 B.
• The regular bins are split up into unsorted, small and large bins.
• 62 small bins indexed 2-63 hold chunks equal to or above 32 B and below 1024 B.
• 63 large bins indexed 64-126 hold chunks above 1024 B inclusive, up to 128 KiB or M_MMAP_THRESHOLD.
• The unsorted bin (index 1) holds freed chunks.
• Index 0 is unused.
• Each heap in an arena has its own heap_info structure. Multiple heap_info structures form a linked list via the prev field, and the ar_ptr of the top heap_info structure points to the malloc_state structure of the arena.
• The number of arenas is constrained by MALLOC_ARENA_MAX.
• The area past the top chunk is called the wilderness.

Memory allocation in modern applications is complex, and hopefully this article distills the main components of glibc‘s memory allocator down into a more approachable form.

References

https://sourceware.org/glibc/wiki/MallocInternals

https://code.woboq.org/userspace/glibc/malloc/malloc.c.html