[v3,3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max

Larger page allocation/freeing batch number may cause longer run time of
code holding zone->lock.  If zone->lock is heavily contended at the same
time, latency spikes may occur even for casual page allocation/freeing.
Although reducing the batch number cannot make zone->lock contended
lighter, it can reduce the latency spikes effectively.

To demonstrate this, I wrote a Python script:

  import mmap

  size = 6 * 1024**3

  while True:
      mm = mmap.mmap(-1, size)
      mm[:] = b'\xff' * size
      mm.close()

Run this script 10 times in parallel and measure the allocation latency by
measuring the duration of rmqueue_bulk() with the BCC tools
funclatency[0]:

  funclatency -T -i 600 rmqueue_bulk

Here are the results for both AMD and Intel CPUs.

AMD EPYC 7W83 64-Core Processor, single NUMA node, KVM virtual server
=====================================================================

- Default value of 5

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 12       |                                        |
      1024 -> 2047       : 9116     |                                        |
      2048 -> 4095       : 2004     |                                        |
      4096 -> 8191       : 2497     |                                        |
      8192 -> 16383      : 2127     |                                        |
     16384 -> 32767      : 2483     |                                        |
     32768 -> 65535      : 10102    |                                        |
     65536 -> 131071     : 212730   |*******************                     |
    131072 -> 262143     : 314692   |*****************************           |
    262144 -> 524287     : 430058   |****************************************|
    524288 -> 1048575    : 224032   |********************                    |
   1048576 -> 2097151    : 73567    |******                                  |
   2097152 -> 4194303    : 17079    |*                                       |
   4194304 -> 8388607    : 3900     |                                        |
   8388608 -> 16777215   : 750      |                                        |
  16777216 -> 33554431   : 88       |                                        |
  33554432 -> 67108863   : 2        |                                        |

avg = 449775 nsecs, total: 587066511229 nsecs, count: 1305242

The avg alloc latency can be 449us, and the max latency can be higher
than 30ms.

- Value set to 0

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 92       |                                        |
      1024 -> 2047       : 8594     |                                        |
      2048 -> 4095       : 2042818  |******                                  |
      4096 -> 8191       : 8737624  |**************************              |
      8192 -> 16383      : 13147872 |****************************************|
     16384 -> 32767      : 8799951  |**************************              |
     32768 -> 65535      : 2879715  |********                                |
     65536 -> 131071     : 659600   |**                                      |
    131072 -> 262143     : 204004   |                                        |
    262144 -> 524287     : 78246    |                                        |
    524288 -> 1048575    : 30800    |                                        |
   1048576 -> 2097151    : 12251    |                                        |
   2097152 -> 4194303    : 2950     |                                        |
   4194304 -> 8388607    : 78       |                                        |

avg = 19359 nsecs, total: 708638369918 nsecs, count: 36604636

The avg was reduced significantly to 19us, and the max latency is reduced
to less than 8ms.

- Conclusion

On this AMD CPU, reducing vm.pcp_batch_scale_max significantly helps reduce
latency. Latency-sensitive applications will benefit from this tuning.

However, I don't have access to other types of AMD CPUs, so I was unable to
test it on different AMD models.

Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, two NUMA nodes
============================================================

- Default value of 5

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 2419     |                                        |
      1024 -> 2047       : 34499    |*                                       |
      2048 -> 4095       : 4272     |                                        |
      4096 -> 8191       : 9035     |                                        |
      8192 -> 16383      : 4374     |                                        |
     16384 -> 32767      : 2963     |                                        |
     32768 -> 65535      : 6407     |                                        |
     65536 -> 131071     : 884806   |****************************************|
    131072 -> 262143     : 145931   |******                                  |
    262144 -> 524287     : 13406    |                                        |
    524288 -> 1048575    : 1874     |                                        |
   1048576 -> 2097151    : 249      |                                        |
   2097152 -> 4194303    : 28       |                                        |

avg = 96173 nsecs, total: 106778157925 nsecs, count: 1110263

- Conclusion

This Intel CPU works fine with the default setting.

Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, single NUMA node
==============================================================

Using the cpuset cgroup, we can restrict the test script to run on NUMA
node 0 only.

- Default value of 5

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 46       |                                        |
       512 -> 1023       : 695      |                                        |
      1024 -> 2047       : 19950    |*                                       |
      2048 -> 4095       : 1788     |                                        |
      4096 -> 8191       : 3392     |                                        |
      8192 -> 16383      : 2569     |                                        |
     16384 -> 32767      : 2619     |                                        |
     32768 -> 65535      : 3809     |                                        |
     65536 -> 131071     : 616182   |****************************************|
    131072 -> 262143     : 295587   |*******************                     |
    262144 -> 524287     : 75357    |****                                    |
    524288 -> 1048575    : 15471    |*                                       |
   1048576 -> 2097151    : 2939     |                                        |
   2097152 -> 4194303    : 243      |                                        |
   4194304 -> 8388607    : 3        |                                        |

avg = 144410 nsecs, total: 150281196195 nsecs, count: 1040651

The zone->lock contention becomes severe when there is only a single NUMA
node. The average latency is approximately 144us, with the maximum
latency exceeding 4ms.

- Value set to 0

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 24       |                                        |
       512 -> 1023       : 2686     |                                        |
      1024 -> 2047       : 10246    |                                        |
      2048 -> 4095       : 4061529  |*********                               |
      4096 -> 8191       : 16894971 |****************************************|
      8192 -> 16383      : 6279310  |**************                          |
     16384 -> 32767      : 1658240  |***                                     |
     32768 -> 65535      : 445760   |*                                       |
     65536 -> 131071     : 110817   |                                        |
    131072 -> 262143     : 20279    |                                        |
    262144 -> 524287     : 4176     |                                        |
    524288 -> 1048575    : 436      |                                        |
   1048576 -> 2097151    : 8        |                                        |
   2097152 -> 4194303    : 2        |                                        |

avg = 8401 nsecs, total: 247739809022 nsecs, count: 29488508

After setting it to 0, the avg latency is reduced to around 8us, and the
max latency is less than 4ms.

- Conclusion

On this Intel CPU, this tuning doesn't help much. Latency-sensitive
applications work well with the default setting.

It is worth noting that all the above data were tested using the upstream
kernel.

Why introduce a systl knob?
===========================

From the above data, it's clear that different CPU types have varying
allocation latencies concerning zone->lock contention. Typically, people
don't release individual kernel packages for each type of x86_64 CPU.

Furthermore, for latency-insensitive applications, we can keep the default
setting for better throughput. In our production environment, we set this
value to 0 for applications running on Kubernetes servers while keeping it
at the default value of 5 for other applications like big data. It's not
common to release individual kernel packages for each application.

Future work
===========

To ultimately mitigate the zone->lock contention issue, several suggestions
have been proposed. One approach involves dividing large zones into multi
smaller zones, as suggested by Matthew[1], while another entails splitting
the zone->lock using a mechanism similar to memory arenas and shifting away
from relying solely on zone_id to identify the range of free lists a
particular page belongs to, as suggested by Mel[2]. However, implementing
these solutions is likely to necessitate a more extended development
effort.

Link: https://github.com/iovisor/bcc/blob/master/tools/funclatency.py [0]
Link: https://lore.kernel.org/linux-mm/ZnTrZ9mcAIRodnjx@casper.infradead.org/ [1]
Link: https://lore.kernel.org/linux-mm/20240705130943.htsyhhhzbcptnkcu@techsingularity.net/ [2]
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Rientjes <rientjes@google.com>
---
 Documentation/admin-guide/sysctl/vm.rst | 17 +++++++++++++++++
 mm/Kconfig                              | 11 -----------
 mm/page_alloc.c                         | 23 +++++++++++++++++------
 3 files changed, 34 insertions(+), 17 deletions(-)

Message ID	20240804080107.21094-4-laoar.shao@gmail.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DA631C3DA7F for <linux-mm@archiver.kernel.org>; Sun, 4 Aug 2024 08:01:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6BAC26B00A3; Sun, 4 Aug 2024 04:01:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 668006B00A4; Sun, 4 Aug 2024 04:01:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4E22D6B00A5; Sun, 4 Aug 2024 04:01:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 29B686B00A3 for <linux-mm@kvack.org>; Sun, 4 Aug 2024 04:01:46 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 94A17161813 for <linux-mm@kvack.org>; Sun, 4 Aug 2024 08:01:45 +0000 (UTC) X-FDA: 82413818970.10.59B45EA Received: from mail-oi1-f178.google.com (mail-oi1-f178.google.com [209.85.167.178]) by imf08.hostedemail.com (Postfix) with ESMTP id B0359160024 for <linux-mm@kvack.org>; Sun, 4 Aug 2024 08:01:43 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=BzKwBXib; spf=pass (imf08.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.167.178 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722758443; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vDvZ//zCaWvmWjz2UseTXJvmfMYoi/N4vXD74u/t6Ow=; b=bXp6qg+TytYAdOFB8SgXG0IrV+PcEE83QmqjVDaImI9yR2huymtaLDWFDqp0e2CMOFdY2I XISwr01CUfIBuZf6HcBF3RiZIwv8z+RnsgguFY5XonIbnlEEGdsuyrrJK0S4krchrE/Az+ zayLCE3bAuC3Tv3XJHYUu/Amt6Zh6TY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722758443; a=rsa-sha256; cv=none; b=6I2ex1Z6PH8v6JieXFf1y17qnL4++F6Nxp3Bvs4/oVP/0knguMWNpm2omrWxpPvJJqTWdD a+OkDwEulc4MinxvE/NnZSUzmOwbG+Dy2NB12boBBRjm27Cg4nKO9YfHvrm7rt68wlp+Bg uB/XaWfobr3Sz7YHbiil75w26nZeyik= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=BzKwBXib; spf=pass (imf08.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.167.178 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-oi1-f178.google.com with SMTP id 5614622812f47-3db1956643bso6114525b6e.3 for <linux-mm@kvack.org>; Sun, 04 Aug 2024 01:01:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1722758503; x=1723363303; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=vDvZ//zCaWvmWjz2UseTXJvmfMYoi/N4vXD74u/t6Ow=; b=BzKwBXib2ZwBgTz3HALhdpHp3VYA1djcufnQzyaq6OU+2kf5HiIYdkXhWDMk4O0nNb wReZ4x/RbScH33uon/KV69R7tU/9Xxx4VK6ousWdk/n5cOJ9WBb3GQ4jlU7RQdMqjkfz 7lJBXBgMAUQkcYIStehz/QqcFZurVfUUAj6RifubB3PLzOUloALMsMz4rT+3bZ7QpMGd 0uSIk0ofgoiXb8sUPtdfJx/tGUNY4/OIwiKCYPps5oMNqg8dfROyAKqtX/XgDkbvqzQO IqiAjTypu+L/33PQyM92cFXPWHdaqOUCXs59T9IN+fWO/lUcDwAI8ivYzinOfVrw/uZU 1L+Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722758503; x=1723363303; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=vDvZ//zCaWvmWjz2UseTXJvmfMYoi/N4vXD74u/t6Ow=; b=oGdfd8UndAvCpSAtl4F8Su3+i3bs5aNXhLmgUvSlzPA0UK8MEC0OVxPa0k8uUpOTt8 wMeJfS/vAkRBp3bIsaZRLyXgV/GuB0CatHgPIpsi3Qp+jAggj2MBRBRjrg7WVljPjFus lHNcK8b16X24rJFjf2qUdUvFgNJP09rNTzPNsdiQuNhFd4lOOTbV8tkP0P4xuitwhCT/ cnVElM/zddcuHdotE9QyjAnBRVPmLoIBmz5ueFqYlvg/v5cBBdLXqiYbD4nJiz643kuJ yYXdAqd2GWYbSCGSvTh7H3v2l7Y2eWs0vfhcr+cOnqaO9JNArFGxz3pHB1kaYoFDrA0m 5ENw== X-Forwarded-Encrypted: i=1; AJvYcCXP8TZsbbiXonzQMxZ7apI0Pw8IbF+8+pPO/XmGL4hgUEJd1fxEdI7RCNIb687MPfxwOR9Ev+oJ8v8Sm5Y15ouTzuM= X-Gm-Message-State: AOJu0Yw0wW5iUAOjM8HK4c1ZMgnGQ1fGMEDfDtcYfILQwYRfhfzaOIkQ r2NjaQBPykNqxlFml3wcOhti7w4vuiDutUMeEtvrDYPrJEhKN2jf X-Google-Smtp-Source: AGHT+IGMoMeVRj+uNO0jf4ccDI2oVk9+HVkbM7PYbBW0QBTZxc1Kk8Cg3fsSSXITMM0ylfDteTLTBg== X-Received: by 2002:a05:6808:f8a:b0:3da:a2bf:23b7 with SMTP id 5614622812f47-3db55817822mr12664660b6e.28.1722758502624; Sun, 04 Aug 2024 01:01:42 -0700 (PDT) Received: from localhost.localdomain ([39.144.105.172]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-1ff61f3fc8asm39601295ad.231.2024.08.04.01.01.39 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Sun, 04 Aug 2024 01:01:42 -0700 (PDT) From: Yafang Shao <laoar.shao@gmail.com> To: akpm@linux-foundation.org Cc: ying.huang@intel.com, mgorman@techsingularity.net, linux-mm@kvack.org, Yafang Shao <laoar.shao@gmail.com>, Matthew Wilcox <willy@infradead.org>, David Rientjes <rientjes@google.com> Subject: [PATCH v3 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max Date: Sun, 4 Aug 2024 16:01:07 +0800 Message-Id: <20240804080107.21094-4-laoar.shao@gmail.com> X-Mailer: git-send-email 2.30.1 (Apple Git-130) In-Reply-To: <20240804080107.21094-1-laoar.shao@gmail.com> References: <20240804080107.21094-1-laoar.shao@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Stat-Signature: k8xonphqhjkxuguyp5xq9dhg4zrdd8ux X-Rspamd-Queue-Id: B0359160024 X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1722758503-458755 X-HE-Meta: U2FsdGVkX18SvTRzdgMsR/WKGYf/wtPKXSspOn/WIqyBbXz7kckNijDPDmthzFE+V+0S2ZmXxjRXzzzCLgsdBzPLjg4tOPWKi+kAtEL9jX4hfvP+j5Y6ResAo4MurbwlFT5qzFpjttXAVaKIgYD0iFFOLGYMhAYv2hq53m/Yev3q0/SgIzJou4hoa+YMUXCTXnCtxbF7JQHpoPKk2jwVA5FltJttsBIfVfqL/NNcPe5lGA94nT02lCqAUCuF0ddeCqTwTRjsKJDcdYVHtRrUmsdkohP9pc1qJ+iBZ5COFx/QqhMNrbp9PPXScwiUjJf5ZZjL8/WKpaTaWDU/SWQrO8wI365WjYFeZTvCdMcxG+5+0P+t3aNLw1B8+gghjC6Ll1b+ulLZuxWq6UyVEIHB7I/1MwGz2iJfrQfceK2oFHiYRRM9y/QM7FiMvy4wncq/PyMFxIga0yTzWS8VM703glsnrinzSucrRoN3stUY95Fo8ClApLOd1qBSKwMuz0F4Js/kDsYNihpl19fdvtr2bucwwOxvjC6Z7E2XXUZDOeeaY6VGiNLH4L5hvn+aXPQ8yspziFhdMHCQya3xpPJ9bd7IyGqzQUb8gfPCoWy0JumT1tHryDQ0QHU68ECc1NpCPsIVkq3IJSZq0qDP6dr+Yn9up20ik04NpmpDVxZytUyCEDhd2411GzAr2DU4pTjuEedKiPx59G/mPsmTuX0JKBqksEu7Bf7q0WcURS3jr7+f3N6seqi6gWvmzmDHg9aEXCXtiUUtIq3aZpdWnuuv3IxWdzngJfHx95IAt+rr5WW1EWEPIMXopqjS+ywkx1GK2qkZ7A4mpa7JozKLpwHyhzd51zP2QKWKRdUoFymsLDA7NpKW/JbtiHzgKzyO1sBtFWsJeEsATG2mOuQQIueW8lsdO0DfRuKGDq1RBQ9KVZCeJVI3lPQNgYOvpcbfVsua+INXlmlWmM3c7rkNtE9 DBjaId/s 8VbbGYrSU99sr371Z/U5IlSvLPXcZQiX97HKewGvtRnhIl3A0jZ25CUj0xCehVcmtM8OKdILZT28iuaRs1kt8DNlJG9oIb48cBy6SniNoQTqvUd7mfx2mH7GQUOc/fzNGC/tv5+z0l4LrYPAqi5mXDDYfAfJaWXhWKO4QDHKcj2boLJ364psIIvh1PSwKcuxB9uT5wlj/crmlbLv0lEEXAU/sab9xM8zZOl2iAUFo+q+25ULNAQG5BnqauXMK1BBH6r7RDEXoii3bOLTfNKB+LZ7bbj0wbq7A1IRnH2/M6xBgItF8+/sLhPw7Cj5tyGuKvTfz46lqCgkELFu8ICI5YLIMMRBWb0eDrR2VaBdVL0PuMh06InOAEski1Ww+dr2OkxoK9BA4k3/rFc949EX/ta9aRmsxijinr3+TBf5R4F+evJkmcdsATHIpCYSxeqhPRnncXswN9kJqmlVrSgYfrwdTM/X0bFf6JkR56MYRyVdp7YMvZKny99iEyN8xnUdNlcZtK+yUVI2m7/5kyqpvkznZJZkMdcAS2U8rcgToug0z7tS8K/bC7CCfYbjSeimzgDUW6qdc2Rv/l/yYkWtetSNx6P350zKi/GJDhkrrhQHMwnFjIX988y0gYpJ2Za8rsgobezkmIDyRDEGUzILQ19NlZP1pvhroWEGHIUYyTC7Oz1sn3K+W6RjOZ30gaoqeFL7moMkITciSakDlC5ExMb7KSQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: <linux-mm.kvack.org> List-Subscribe: <mailto:majordomo@kvack.org> List-Unsubscribe: <mailto:majordomo@kvack.org>
Series	mm: Introduce a new sysctl knob vm.pcp_batch_scale_max \| expand [v3,0/3] mm: Introduce a new sysctl knob vm.pcp_batch_scale_max [v3,1/3] mm/page_alloc: A minor fix to the calculation of pcp->free_count [v3,2/3] mm/page_alloc: Avoid changing pcp->high decaying when adjusting CONFIG_PCP_BATCH_SCALE_MAX [v3,3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max

[v3,3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max

Commit Message

Comments

Patch