From patchwork Wed Mar 29 05:31:45 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Luis Chamberlain X-Patchwork-Id: 13191931 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7FF9AC74A5B for ; Wed, 29 Mar 2023 06:22:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 146A76B0078; Wed, 29 Mar 2023 02:22:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0D0EC6B007B; Wed, 29 Mar 2023 02:22:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E1638900002; Wed, 29 Mar 2023 02:22:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id D0BB86B0078 for ; Wed, 29 Mar 2023 02:22:18 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 9FCA2C0A8B for ; Wed, 29 Mar 2023 06:22:18 +0000 (UTC) X-FDA: 80620941156.25.C63646E Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) by imf22.hostedemail.com (Postfix) with ESMTP id ECEEEC0003 for ; Wed, 29 Mar 2023 06:22:16 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=infradead.org header.s=bombadil.20210309 header.b=GyXZwQdz; spf=none (imf22.hostedemail.com: domain of mcgrof@infradead.org has no SPF policy when checking 198.137.202.133) smtp.mailfrom=mcgrof@infradead.org; dmarc=fail reason="No valid SPF, DKIM not aligned (relaxed)" header.from=kernel.org (policy=none) ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1680070937; a=rsa-sha256; cv=none; b=RkuR5cE1RB05JXY3o6KzfENsCT7SY9VF4fVwfD8nggXVV5tSmrU4IszoCV5YPH9atuDvuJ hZz9lNdW8aiqO01aPa3E0x0M1+2PIs6WJYj/AbQ1Ds7cf+Mryts90al2dUqRMRPCDk1gL3 fyVMbpIrKVSxulZKr9dcz8usHK9fjUQ= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=infradead.org header.s=bombadil.20210309 header.b=GyXZwQdz; spf=none (imf22.hostedemail.com: domain of mcgrof@infradead.org has no SPF policy when checking 198.137.202.133) smtp.mailfrom=mcgrof@infradead.org; dmarc=fail reason="No valid SPF, DKIM not aligned (relaxed)" header.from=kernel.org (policy=none) ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1680070937; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=RbshZVKDekPDOboeHnEAcgErH9krvuwQN7TysZlnIZw=; b=zAwO9GROvFy+Z8jMEml+lXfdxto5bM5kqtvL5kmtxxzSkyTzkW9vOjn2qBIQiP5Ztgu9m6 kXBzX5WvP4K/jO9EKDeP9sMcNb3iwxonctd99AgIQrVLGqZFAcoVaRtq4csYej25PGmdrf pCzvjNjSqIQc3nRwqI9j6rRypwuCbF8= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20210309; h=Sender:Content-Transfer-Encoding: MIME-Version:References:In-Reply-To:Message-Id:Date:Subject:Cc:To:From: Reply-To:Content-Type:Content-ID:Content-Description; bh=RbshZVKDekPDOboeHnEAcgErH9krvuwQN7TysZlnIZw=; b=GyXZwQdzvgM7yhc8r+C5Vhai/9 IOpEs0a2tDWNEdfprqiEIcZ2E8v1vexNsaoG9ZN2e1ZBToe1wdV9OPP21g4oLpqbMgEuo3fyX/8r2 z8ZG617RPED8eMv9RVJgKv6yxZsjZikUTsRsoepiIWpblLpufupifKeeB2RZS9hCVfFLS5Qi/zaPs wbXqYiFQNVQ9NpGD3zNYkWbEk7J5CwMo3VeJ2PMxGIbDq6flmPN4cv/e7EzAP13A/ztHYle/wUwUP pdzO68cb4rR6+2D/EtIoNOIRkfU1dIvlJhCsfV4cOL31V/agcdygbpFFjQy9YclM9/4eqvm8S4zQH /GKQHogQ==; Received: from mcgrof by bombadil.infradead.org with local (Exim 4.96 #2 (Red Hat Linux)) id 1phOPW-00GgRY-2D; Wed, 29 Mar 2023 05:31:50 +0000 From: Luis Chamberlain To: david@redhat.com, patches@lists.linux.dev, linux-modules@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, pmladek@suse.com, petr.pavlu@suse.com, prarit@redhat.com, torvalds@linux-foundation.org, gregkh@linuxfoundation.org, rafael@kernel.org Cc: christophe.leroy@csgroup.eu, tglx@linutronix.de, peterz@infradead.org, song@kernel.org, rppt@kernel.org, willy@infradead.org, vbabka@suse.cz, mhocko@suse.com, dave.hansen@linux.intel.com, mcgrof@kernel.org Subject: [PATCH 3/7] module: avoid allocation if module is already present and ready Date: Tue, 28 Mar 2023 22:31:45 -0700 Message-Id: <20230329053149.3976378-4-mcgrof@kernel.org> X-Mailer: git-send-email 2.38.1 In-Reply-To: <20230329053149.3976378-1-mcgrof@kernel.org> References: <20230329053149.3976378-1-mcgrof@kernel.org> MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Queue-Id: ECEEEC0003 X-Rspamd-Server: rspam01 X-Stat-Signature: ajmjo6nyjbhay57o7jospgq5dxaffgj1 X-HE-Tag: 1680070936-227067 X-HE-Meta: U2FsdGVkX1+IvidKrmgF5EDZuZVFKyMuBKz/OBLaa6p4Ty2NXk7q/t/AdqNVG9spJAQDqhHWKNDxlYk32t1CVEgiTGBTVAvF6HYC36qCvAVCGsHVpl7b6UmNjN0i4tdpPTm+L3lM3Jp8/NkmyWHe8S3ZcaLRzqIzWwyrss5L+WNlaD3eGNiZuot4ovsLOGn1CdnMXFq3mQWHugmqSIcy9Voa814S+Amm2gRUNPb7iuQy76cTdL9b4VDiDDW2WaNXx44TKap4nnaFN8jgx4kgap0/gqgoeXCmi498WqBCX3S34B7RahWxVSeqVps0Pi7PDVGM5cbjLk1dqINwRW5ORsUQxdqT7D2c6NF5UNipb+aWpaoTV/OuEfLIJJ9YgcLEVyccrxJnLOHn3YapP8oFLzEG5dkSB5ZUxNNrQ+1aNjjUDvztibPVFYCSVstiCy2pS6/Z2/edJmNbpKI5E3LIwXu7jBj4bdlSo0If/Wf/JFxnF4gY3EyOHSyn7/TOidChpz22yCBJhqgIFWMkbNrm0pQlO3jxHmYA3+Px7SlMZQ/dg1Y9uHHUbCbkGhGmYzd7Zk3IX0kDZv4l35LBZRGVn+e+AICQr42wrfMxGYN9Zz3flAbRrV1Chj9XNGQQWTZEgVLlZ0T2rxyCM2+74buGdSUeAKVrlz4ae4XpjXZ5gOhCWCwmyKwYa2Fe4yI2EhJidZEUlpAl6Nndt8AbMyhpEWd3avze9spshmyW68jOJtxjGvG2Oi7bm9jqV9W1maQYqrj5MJPsVRll53iaSRUNbw4+mQ+Nd5eJAyKUS4IIDI6ALFfTloKyfLpk7FU1+w4A7SUAtt0DTS0b+8QPWgKAycfCeWkVfMW05ntgWLHZeeeXG90lA/vv6WDvMdWK2HI3+u8M7l6TeN65ZFhl51zus2DfbgNgo4LZCBlSb0K4G4ugGnyWrVS+Zx/3EtmkgusWQtZ4r+sVo4CFxpZMOlt r9AeNBo6 aDpmqeV/K7HFbWvYjs0jWvW7hDYHjerSzcHU4kJFUnD54Udvi3geNP6kfSucNdJrUTQ03x9LyH1CstuP5Zk092u6nN/mRKlHvmiM63g4eaH3qOXVcu9JTbFbWRsHY0WOC5lAgOu4EvD5D0Gylcl4lC1NOI+CHEZVkEq1W2Y58HMFEnh08SsBGtNfZ+F6L3xuM01idGhN12salwUv2Q1uIRZFXrF91G2HM8ndzi6HDTWjGGBk3XiGbnNzeZQYTgM2g3DlEygbSOrcmi7DmlJTwBFK4k4ZDhoY9QTkkgn0tA4BvFYa1hFnlIE3zL7TkW0PX/pGWbOtgY0iXs7NCVxZ0zX8LEmaDabD/nhUtKdxLRjBRBvCc1AUX6IFai4OycQn+tFXz+/jzA57GWvd4YLd/SxzTN5psGhNUpKM80sgucX6AlTZW5H8FpUxxag== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: load_module() will allocate a struct module before even checking if the module is already loaded. This can create unecessary memory pressure since we can easily just check if the module is already present early with the copy of the module information from userspace after we've validated it a bit. This can only be an issue if a system is getting hammered with userspace loading modules. Note that there are two ways to load modules, one is kernel moduile auto-loading (request_module() calls in-kernel) and the other is modprobe calls from userspace. The auto-loading is in-kernel, that pings back to userspace to just call modprobe. We already have a way to restrict the amount of concurrent kernel auto-loads in a given time, however that does not stop a system from issuing tons of system calls to load a module and for the races to exist. Userspace itself *is* supposed to check if a module is present before loading it. But we're observing situations where tons of the same module are in effect being loaded. Although some of these are acknolwedged as in-kernel bugs such as the ACPI frequency modules, issues for which we already have fixes merged or are working towards, but we can also help a bit more in the modules side to avoid those dramatic situations. All that is just memory being allocated to then be thrown away. To avoid memory pressure for such stupid cases put a stop gap for them. We now check for the module being present *before* allocation, and then right after we are going to add it to the system. On a 8vcpu 8 GiB RAM system using kdevops and testing against selftests kmod.sh -t 0008 I see a saving in the *highest* side of memory consumption of up to ~ 84 MiB with the Linux kernel selftests kmod test 0008. With the new stress-ng module test I see a 145 MiB difference in max memory consumption with 100 ops. The stress-ng module ops tests can be pretty pathalogical -- it is not realistic, however it was used to finally successfully reproduce issues which are only reported to happen on system with over 400 CPUs [0] by just usign 100 ops on a 8vcpu 8 GiB RAM system. This can be observed and visualized below. The time it takes to run the test is also not affected. The kmod tests 0008: The gnuplot is set to a range from 400000 KiB (390 Mib) - 580000 (566 Mib) given the tests peak around that range. cat kmod.plot set term dumb set output fileout set yrange [400000:580000] plot filein with linespoints title "Memory usage (KiB)" Before: root@kmod ~ # /data/linux-next/tools/testing/selftests/kmod/kmod.sh -t 0008 root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > log-0008-before.txt ^C root@kmod ~ # sort -n -r log-0008-before.txt | head -1 528732 So ~516.33 MiB After: root@kmod ~ # /data/linux-next/tools/testing/selftests/kmod/kmod.sh -t 0008 root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > log-0008-after.txt ^C root@kmod ~ # sort -n -r log-0008-after.txt | head -1 442516 So ~432.14 MiB That's about 84 ~MiB in savings in the worst case. The graphs: root@kmod ~ # gnuplot -e "filein='log-0008-before.txt'; fileout='graph-0008-before.txt'" kmod.plot root@kmod ~ # gnuplot -e "filein='log-0008-after.txt'; fileout='graph-0008-after.txt'" kmod.plot root@kmod ~ # cat graph-0008-before.txt 580000 +-----------------------------------------------------------------+ | + + + + + + + | 560000 |-+ Memory usage (KiB) ***A***-| | | 540000 |-+ +-| | | | *A *AA*AA*A*AA *A*AA A*A*A *AA*A*AA*A A | 520000 |-+A*A*AA *AA*A *A*AA*A*AA *A*A A *A+-| |*A | 500000 |-+ +-| | | 480000 |-+ +-| | | 460000 |-+ +-| | | | | 440000 |-+ +-| | | 420000 |-+ +-| | + + + + + + + | 400000 +-----------------------------------------------------------------+ 0 5 10 15 20 25 30 35 40 root@kmod ~ # cat graph-0008-after.txt 580000 +-----------------------------------------------------------------+ | + + + + + + + | 560000 |-+ Memory usage (KiB) ***A***-| | | 540000 |-+ +-| | | | | 520000 |-+ +-| | | 500000 |-+ +-| | | 480000 |-+ +-| | | 460000 |-+ +-| | | | *A *A*A | 440000 |-+A*A*AA*A A A*A*AA A*A*AA*A*AA*A*AA*A*AA*AA*A*AA*A*AA-| |*A *A*AA*A | 420000 |-+ +-| | + + + + + + + | 400000 +-----------------------------------------------------------------+ 0 5 10 15 20 25 30 35 40 The stress-ng module tests: This is used to run the test to try to reproduce the vmap issues reported by David: echo 0 > /proc/sys/vm/oom_dump_tasks ./stress-ng --module 100 --module-name xfs Prior to this commit: root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > baseline-stress-ng.txt root@kmod ~ # sort -n -r baseline-stress-ng.txt | head -1 5046456 After this commit: root@kmod ~ # free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > after-stress-ng.txt root@kmod ~ # sort -n -r after-stress-ng.txt | head -1 4896972 5046456 - 4896972 149484 149484/1024 145.98046875000000000000 So this commit using stress-ng reveals saving about 145 MiB in memory using 100 ops from stress-ng which reproduced the vmap issue reported. cat kmod.plot set term dumb set output fileout set yrange [4700000:5070000] plot filein with linespoints title "Memory usage (KiB)" root@kmod ~ # gnuplot -e "filein='baseline-stress-ng.txt'; fileout='graph-stress-ng-before.txt'" kmod-simple-stress-ng.plot root@kmod ~ # gnuplot -e "filein='after-stress-ng.txt'; fileout='graph-stress-ng-after.txt'" kmod-simple-stress-ng.plot root@kmod ~ # cat graph-stress-ng-before.txt +---------------------------------------------------------------+ 5.05e+06 |-+ + A + + + + + + +-| | * Memory usage (KiB) ***A*** | | * A | 5e+06 |-+ ** ** +-| | ** * * A | 4.95e+06 |-+ * * A * A* +-| | * * A A * * * * A | | * * * * * * *A * * * A * | 4.9e+06 |-+ * * * A*A * A*AA*A A *A **A **A*A *+-| | A A*A A * A * * A A * A * ** | | * ** ** * * * * * * * | 4.85e+06 |-+ A A A ** * * ** *-| | * * * * ** * | | * A * * * * | 4.8e+06 |-+ * * * A A-| | * * * | 4.75e+06 |-+ * * * +-| | * ** | | * + + + + + + ** + | 4.7e+06 +---------------------------------------------------------------+ 0 5 10 15 20 25 30 35 40 root@kmod ~ # cat graph-stress-ng-after.txt +---------------------------------------------------------------+ 5.05e+06 |-+ + + + + + + + +-| | Memory usage (KiB) ***A*** | | | 5e+06 |-+ +-| | | 4.95e+06 |-+ +-| | | | | 4.9e+06 |-+ *AA +-| | A*AA*A*A A A*AA*AA*A*AA*A A A A*A *AA*A*A A A*AA*AA | | * * ** * * * ** * *** * | 4.85e+06 |-+* *** * * * * *** A * * +-| | * A * * ** * * A * * | | * * * * ** * * | 4.8e+06 |-+* * * A * * * +-| | * * * A * * | 4.75e+06 |-* * * * * +-| | * * * * * | | * + * *+ + + + + * *+ | 4.7e+06 +---------------------------------------------------------------+ 0 5 10 15 20 25 30 35 40 [0] https://lkml.kernel.org/r/20221013180518.217405-1-david@redhat.com Reported-by: David Hildenbrand Signed-off-by: Luis Chamberlain --- kernel/module/main.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/kernel/module/main.c b/kernel/module/main.c index 77c2e7a60f2e..145e15f19576 100644 --- a/kernel/module/main.c +++ b/kernel/module/main.c @@ -2785,7 +2785,11 @@ static int early_mod_check(struct load_info *info, int flags) if (err) return err; - return 0; + mutex_lock(&module_mutex); + err = module_patient_check_exists(info->mod->name); + mutex_unlock(&module_mutex); + + return err; } /* From patchwork Wed Mar 29 05:31:46 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Luis Chamberlain X-Patchwork-Id: 13191930 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4D71AC6FD18 for ; Wed, 29 Mar 2023 06:22:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B0A7A6B0075; Wed, 29 Mar 2023 02:22:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id ABA746B0078; Wed, 29 Mar 2023 02:22:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 982436B007B; Wed, 29 Mar 2023 02:22:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 7A77F6B0075 for ; Wed, 29 Mar 2023 02:22:16 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 330D61C5ECB for ; Wed, 29 Mar 2023 06:22:16 +0000 (UTC) X-FDA: 80620941072.05.7A54D91 Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) by imf21.hostedemail.com (Postfix) with ESMTP id 191F41C0004 for ; Wed, 29 Mar 2023 06:22:12 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=infradead.org header.s=bombadil.20210309 header.b=0s+ZsQtq; dmarc=fail reason="No valid SPF, DKIM not aligned (relaxed)" header.from=kernel.org (policy=none); spf=none (imf21.hostedemail.com: domain of mcgrof@infradead.org has no SPF policy when checking 198.137.202.133) smtp.mailfrom=mcgrof@infradead.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1680070933; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=guyHag9qCqT02ycAP5gFYMiS5Z8MMs8Ps7B6B4e3yQk=; b=1gWUbyr4mqFWeWq/H1PtC0bRlGtgMekR9oxhPKvldrmJY/XB8fsGB0HwyJcyGvm73aGZei zJ+AWJb4QXKi09G7UF4wq/WEZCNm4/yG0DNmcARA0omOLrVv1RTJHZcEAhFYLxHS6j3SQ8 iKRJt+3pMohI0/elWXZfKXC29HPXd0s= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=infradead.org header.s=bombadil.20210309 header.b=0s+ZsQtq; dmarc=fail reason="No valid SPF, DKIM not aligned (relaxed)" header.from=kernel.org (policy=none); spf=none (imf21.hostedemail.com: domain of mcgrof@infradead.org has no SPF policy when checking 198.137.202.133) smtp.mailfrom=mcgrof@infradead.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1680070933; a=rsa-sha256; cv=none; b=YyA0tvGthA6HVAn5ZWvCYwJOR5uM7QOyX/qgEivoBQHkBGT0ZwGF0a7Q0PS40FVt5BiYg6 xQiFqWh5FHjWZwq/wyILstwVLsvd9s5yTz+4H8Uag7aImshuvURfTR0nbX1RCn4xFGgDmj mxs5pZsdE8GIwUC7JbnH+Jk/ZyQ+NaM= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20210309; h=Sender:Content-Transfer-Encoding: MIME-Version:References:In-Reply-To:Message-Id:Date:Subject:Cc:To:From: Reply-To:Content-Type:Content-ID:Content-Description; bh=guyHag9qCqT02ycAP5gFYMiS5Z8MMs8Ps7B6B4e3yQk=; b=0s+ZsQtqvZrSA+dzaZX4im6LhE WyYiRJxXstb/bqMR3zp8lDLo3bPTmJaJEWSTCh06Aez4lV2UP5frUnAfsl7KhQGMydKgbO1lw7zhb 5jDrZkRjvFJfeHZVno0lB+v2bfAxpKG/fGbbO0mlzuPuljHB9FQjYJ/jVkpBdrh4YEiBJCnK3TUmn hiaTcVuownYisCxprFvcXdMGJps9Idj/5unt/K9XrpcpkHwN+cyMrQDW7fOwdPR91ENj/EqR2BYlX ZDIkrXjOcSqgk2+GqwuSjX7aetm38G9LIM3wHa+55pCtzqIALL7J0jkg4T34ZcZ7/7O5zQ66cDSM2 QzeATbHg==; Received: from mcgrof by bombadil.infradead.org with local (Exim 4.96 #2 (Red Hat Linux)) id 1phOPW-00GgRb-2M; Wed, 29 Mar 2023 05:31:50 +0000 From: Luis Chamberlain To: david@redhat.com, patches@lists.linux.dev, linux-modules@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, pmladek@suse.com, petr.pavlu@suse.com, prarit@redhat.com, torvalds@linux-foundation.org, gregkh@linuxfoundation.org, rafael@kernel.org Cc: christophe.leroy@csgroup.eu, tglx@linutronix.de, peterz@infradead.org, song@kernel.org, rppt@kernel.org, willy@infradead.org, vbabka@suse.cz, mhocko@suse.com, dave.hansen@linux.intel.com, mcgrof@kernel.org Subject: [PATCH 4/7] sempahore: add a helper for a concurrency limiter Date: Tue, 28 Mar 2023 22:31:46 -0700 Message-Id: <20230329053149.3976378-5-mcgrof@kernel.org> X-Mailer: git-send-email 2.38.1 In-Reply-To: <20230329053149.3976378-1-mcgrof@kernel.org> References: <20230329053149.3976378-1-mcgrof@kernel.org> MIME-Version: 1.0 X-Rspamd-Queue-Id: 191F41C0004 X-Rspamd-Server: rspam09 X-Rspam-User: X-Stat-Signature: cmfkyk4monuh3gsjp19ushbwf3opsogj X-HE-Tag: 1680070932-494479 X-HE-Meta: U2FsdGVkX18rCV4PlTE26liU3clAb/hS/rgWzVUgNFClll4Vp7kVHVrFZ9y5aFpeaexEkZkq/4/u+zTqj1ic8xgBCMFZc6BKRpw/zrlavIyDOvL41qyRdyHmq1ewdbxO4Sb63M99zXilTx/ifoQfktGB7QL85r/qseA9x2Yqi0cwFjXISXsccrMbxdKIQV0pL3RQ7RAKORJvWK85trBtMbQQQqTzT1sjrF+xJzJsA9rp0XAV5ALI47lMUfh58NqxBVkjB7UGbcwVyY6izNzQpFv6mlukdlS95We4BJfAXCEtrAJ3YxucPZa5t7scASPqR+jb8srcCY9Of2A9GgsbKQGZYlLR3qj8N4GmpRUoZELkhBLoWBCVcH/BTyNaLa1Hxva+xvct55UEVh1h1r9B93x839e+6vey+lvXsKvXc8Th3wTaC7WVSkLu2Vh231lIRLFw6LeQhvyZTwoD8JzGVQI0nfHXXs/K2LwrA2DTP8iB1YLT5VfC3W38W8BOVM2eQxVxlcxLmGCrlKOIk6Q2aYaHtwyVZkJuGcQq3P02maENepMe15WJmUukey7hnjSB02pEzeWJSSYg8Osv5NfWjLkfyhWe51ToO4XrQgthqy8cq+oB07AtXocaRdJDxiPczkM6ASGzwc5HSq5v/BNf5zyyg+lVBTuT1uvKACA53jFlrgE3BKi7rHgTd53q2y3bLX/GOgvDP+ewapQz/EMKP8bTURfM4dzaZuilZniqNK9HHkhTe9wrKVVSH407gY+hIJPGTaL7sPZMqtj2wk6pWcXvVPOy0fDEClzXWL1+V+cr38G/4ue6+eqd85xJl0E9bpPked7KO5JdjOU88fD/GHyUAgliVl0c2STofIWF9cCIBFi4dlVueprsQJDdIAFk/DIYFKF4HbjyZpcBbSDWDVSxNKJdt8PhLDN7tQBTWC+NUh//8j942YiXwOmvEXXfUH+Lu3FNOW2CyG3XNTc sGIqxU/o nDjO5fG/6te8H6OcDZITszFO+YPR8AkNW9GsCktto5vhtBlVo2fhpRZcfNpis52VycfSqf3xNYtZHxnCr4kzuw8FrKaa0301nOlT+Fm+IKF4NRxYV/TORLngjbZseArFGM9RTTksz3+KokBeySwxypVSzPkRMsDPEZVXRknzYjtzkq1UCeBBV1qVzBWx+4sI5m6ayaq07FoyjAssIlKu19YwBsmBAGrhlrPg/k/1XgvlOOyWTOAIP6SAWlrtyLs8/CQqxMo5pN4kBPlwfkFPmF34iw1Fcqg3/m8E6ii6yx20KQ5ZaV3nOjTvnYVPq3gGoLYlVgiKEod8C6pVQwWDGayi88tn0TvmxCptvGLpRtT43cj9Lc++30zao1fqo7wFQsmx3AkcxYKiIifvjC5KbJhb/ZMRQ+9fNgwyeXrhXNxSQcnAqOZS3iyvReu4GFvn897vp+dfELxhQsVHWqk7Og1ew1Uu6JkFE3XVpX/mOZKL1m35AEP/CTkWqDA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000049, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: While I looked at re-using the old kernel/kmod.c (now kernel/module/kmod.c) concurrency delimiter methodology for another place in the kernel Linus noted that this could be simply replaced with a sempahore [0]. So add that so we we don't re-invent the wheel and make it obvious to use. [0] https://lore.kernel.org/all/CAHk-=whkj6=wyi201JXkw9iT_eTUTsSx+Yb9d4OgmZFjDJA18g@mail.gmail.com/ Suggested-by: Linus Torvalds Signed-off-by: Luis Chamberlain Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Peter Zijlstra (Intel) --- include/linux/semaphore.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/include/linux/semaphore.h b/include/linux/semaphore.h index 6694d0019a68..2ecdffdb9814 100644 --- a/include/linux/semaphore.h +++ b/include/linux/semaphore.h @@ -28,6 +28,9 @@ struct semaphore { #define DEFINE_SEMAPHORE(name) \ struct semaphore name = __SEMAPHORE_INITIALIZER(name, 1) +#define CONCURRENCY_LIMITER(name, n) \ + struct semaphore name = __SEMAPHORE_INITIALIZER(name, n) + static inline void sema_init(struct semaphore *sem, int val) { static struct lock_class_key __key; From patchwork Wed Mar 29 05:31:49 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Luis Chamberlain X-Patchwork-Id: 13191899 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C09FAC74A5B for ; Wed, 29 Mar 2023 05:32:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 602706B007E; Wed, 29 Mar 2023 01:31:59 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5B41D900003; Wed, 29 Mar 2023 01:31:59 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 36B1F900004; Wed, 29 Mar 2023 01:31:59 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 17896900003 for ; Wed, 29 Mar 2023 01:31:59 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id E560C1C5D25 for ; Wed, 29 Mar 2023 05:31:58 +0000 (UTC) X-FDA: 80620814316.09.4F98548 Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) by imf06.hostedemail.com (Postfix) with ESMTP id 3D5E418001A for ; Wed, 29 Mar 2023 05:31:57 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=infradead.org header.s=bombadil.20210309 header.b=XhxXDw48; dmarc=fail reason="No valid SPF, DKIM not aligned (relaxed)" header.from=kernel.org (policy=none); spf=none (imf06.hostedemail.com: domain of mcgrof@infradead.org has no SPF policy when checking 198.137.202.133) smtp.mailfrom=mcgrof@infradead.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1680067917; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=2bVkSUsIADWswZ5bVimfZdo+Dy3JEjup3DneXTttdjU=; b=goqGVp8cNXSztFQu+DOKnAi68GHN1lwrPXceB3Y4z8YkZoJhobbxOaIyEKGKLnMXcvwKDy WGxpbHRetQDM6jUfvuLPIN004R9JqN5OdCelttgMKXCuwG/JKk4M2DAGyd0fCm/pIUF+kZ 7WYEReOAKJRUi3zpjkImAGYi6HiVZKY= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=infradead.org header.s=bombadil.20210309 header.b=XhxXDw48; dmarc=fail reason="No valid SPF, DKIM not aligned (relaxed)" header.from=kernel.org (policy=none); spf=none (imf06.hostedemail.com: domain of mcgrof@infradead.org has no SPF policy when checking 198.137.202.133) smtp.mailfrom=mcgrof@infradead.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1680067917; a=rsa-sha256; cv=none; b=riq/Z65EHW7PkJnl8qzI7DfBE7wxw77SleYuOWZcc5Yk1txH9eaAZ+4OloxD+i4RpgRICK n9D1+Q1xtAbB+UqwvFfT/5gwlLaYnIgFwa6oKyiCyAQMNaGKvfSe9YCkPtJbnVPBCdHxAG pEZvAr5lZ2O+OgsPHaOEta8pIvT6n9Q= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20210309; h=Sender:Content-Transfer-Encoding: MIME-Version:References:In-Reply-To:Message-Id:Date:Subject:Cc:To:From: Reply-To:Content-Type:Content-ID:Content-Description; bh=2bVkSUsIADWswZ5bVimfZdo+Dy3JEjup3DneXTttdjU=; b=XhxXDw48uuZL/uQqb/68ZIqJmD rmpQ5+Q9WG115GOyc2PP14uow1Fu9Fyl+xWkZjxKZPYGV93xilgKKaBYoUt+OfXEwj8QKkWJYdBAP IlOSqIb2pSUIbY05lI7yLe9mQz5sYUPfe2ZaCkc0mtkmJ5EgcE6Wl/9UhBxe1X1SjI0HtQj4Nsklu G0XJOAjov0Mty6kJx7mKk8qTyGDWjwDWIjOJYGmew3nmK1fFOcBMhqcc3k6rUwweq/cTtXlDD6n0v 29oZ94XSTgSoPHrpY41wLms5xSkIGpjfT/z1uB11c2CAUA2biuHqIOL5jHz7+HStSb9kGiUkN6CoT qrdIB1TQ==; Received: from mcgrof by bombadil.infradead.org with local (Exim 4.96 #2 (Red Hat Linux)) id 1phOPW-00GgRu-2q; Wed, 29 Mar 2023 05:31:50 +0000 From: Luis Chamberlain To: david@redhat.com, patches@lists.linux.dev, linux-modules@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, pmladek@suse.com, petr.pavlu@suse.com, prarit@redhat.com, torvalds@linux-foundation.org, gregkh@linuxfoundation.org, rafael@kernel.org Cc: christophe.leroy@csgroup.eu, tglx@linutronix.de, peterz@infradead.org, song@kernel.org, rppt@kernel.org, willy@infradead.org, vbabka@suse.cz, mhocko@suse.com, dave.hansen@linux.intel.com, mcgrof@kernel.org Subject: [PATCH 7/7] module: add debug stats to help identify memory pressure Date: Tue, 28 Mar 2023 22:31:49 -0700 Message-Id: <20230329053149.3976378-8-mcgrof@kernel.org> X-Mailer: git-send-email 2.38.1 In-Reply-To: <20230329053149.3976378-1-mcgrof@kernel.org> References: <20230329053149.3976378-1-mcgrof@kernel.org> MIME-Version: 1.0 X-Rspamd-Queue-Id: 3D5E418001A X-Rspamd-Server: rspam09 X-Rspam-User: X-Stat-Signature: 1rr9ht8ea5j655yn4yi5dptetyjj1dda X-HE-Tag: 1680067917-981100 X-HE-Meta: U2FsdGVkX1/uzhkwyMJLVc4Mfz42ujLmM9ULlpk8yy9ABpsh2kdmPypDm/sqlPrqGNGOGjcnSJF5fNSON4+uoEC2Au8SFvdGvv/UlRlO8otUN2cn00Qo5g8vRyfnafsJPvNapKUKxv+aMGX0ui3eau0B4/aFWjJ24xQnbSn9VYqnVNujhzJqHr07OzNIcG9zddv6PNkpMW7ogM4mjfajA7xoKyQZQsdelfC0NO7KTFGPQIKrl9EHTsaGXpu7GYT9dfzOFjL+UDbAEZLiSJhAjr9b5m5dL9x37r/IfwpIFS2O6ioJhrxTGVOs3ocNhw8RPAeUoZ4jYhJqHdeUqW0KrnzNBkAMbhCbtEBQPNM5byCAgpPvZzNjuNEbsYZ2Mvjt7W3vNP4SqhYMuqnGO41TsChWdLT+ACDBjqHsKb3I8zdsAZq3dhb71lvrQNI+GDJyR6LafdVqgY+fbwXsGUPp+hoS8sNzasZh19oZ5D4QBjEDYkdC3XoXVyOZ0fyVQjiDIJcFE7pQ7+nr5sUEnF88qqPN/f1R8s3GpDXdrNdDQNwuDmxj+nomgk0z2LHscG/5UMeVSRktR1pqWyI0tLJRVbxahTrJh15qXeuy1RkLnXaLo61Qjesj7RuCdgOVBsjoHf8RbD2KJ2+ZXxeafXL4nxOsQmfTToY0QsNKe6NJRtYxUGaTH1VDRMWMNs2Q+7kWgDwu/0OfI9e+DTain/lMwnvm9d+IPi2wHeyKUzsiz+tD69UDdIpWdBXcqU0kS14NFTQT5jpS/A6gR9ZEaC/MfuO4rptEaO4mBqwDiCSVKSENGAQvk4YuNvIBHGLWbic0QVNY+2ZCCvqm2mCWFgiY8x6+I1Hmqk8/lUCoVj+o+WpqE8cWuCAH0keaJtSwXhc1pRM48gWabBFw/29rSXkR9hTWRh8o/QUHfX+MMtm6SfjAajB88bjpkQX8uj3WyhwdjeP2mkghSY7/tqzfLGf QzE3lelC pEItcowXYPTT3f2mWYp8wULd4/K6yYaBdgTnZ1tmK416xPLz2ROAXci4K3EI9Y9GHapR/AMXYX+yrt2hy2Tby/UgJ+HPWeV7U8dj2Bzp3HU2komsdhUy9N/RKO9R5/F6f+e8nTTslGpwwue9r2Ctp7GUdBnG5hL8LfDbByD4rhz8F4mDc4wwCJVQMlsoMGSffOBoZx2w3xKi7r1ldKC4F4JOCJD1E9Gy2blvuR8VAErUcJYcB7Tr930an2JxLZRt9SO4YndokAx6/vTZUCILIaf8vsgrMdcSzLgcWtPoJQ8NhCuuCmtTAFWWntX/iRKHeifb0kiQODvUuRbwGocweLd90y6IXqgKDb0hmJCRy/Vj5BbV9756LXGA1xyct8LxlDhEltPBWB91PnTICvLqyvlESrf16P+4yGMLwmBVBAt96WGzuh52LOSiUyQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Loading modules with finit_module() can end up using vmalloc(), vmap() and vmalloc() again, for a total of up to 3 separate allocations in the worse case for a single module. We always kernel_read*() the module, that's a vmalloc(). Then vmap() is used for the module decompression, and if so the last read buffer is freed as we use the now decompressed module buffer to stuff data into our copy module. The last one is specific to architectures but pretty much that's generally a series of vmalloc() for different ELF sections... Evaluation with new stress-ng module support [1] with just 100 ops us proving that you can end up using GiBs of data easily even if we are trying to be very careful not to load modules which are already loaded. 100 ops seems to resemble the sort of pressure a system with about 400 CPUs can create on modules. Although those issues for so many concurrent loads per CPU is silly and are being fixed, we lack proper tooling to help diagnose easily what happened, when it happened and what likely are the culprits -- userspace or kernel module autoloading. Provide an initial set of stats for debugfs which let us easily scrape post-boot information about failed loads. This sort of information can be used on production worklaods to try to optimize *avoiding* redundant memory pressure using finit_module(). Screen shot: root@kmod ~ # cat /sys/kernel/debug/modules/stats Modules loaded 67 Total module size 11464704 Total mod text size 4194304 Failed kread bytes 890064 Failed kmod bytes 890064 Invalid kread bytes 890064 Invalid decompress bytes 0 Invalid mod bytes 890064 Average mod size 171115 Average mod text size 62602 Failed modules: kvm_intel kvm irqbypass crct10dif_pclmul ghash_clmulni_intel sha512_ssse3 sha512_generic aesni_intel crypto_simd cryptd evdev serio_raw virtio_pci nvme nvme_core virtio_pci_legacy_dev t10_pi crc64_rocksoft virtio_pci_modern_dev crc32_pclmul virtio crc32c_intel virtio_ring crc64 [0] https://github.com/ColinIanKing/stress-ng.git [1] echo 0 > /proc/sys/vm/oom_dump_tasks ./stress-ng --module 100 --module-name xfs Signed-off-by: Luis Chamberlain --- kernel/module/Kconfig | 32 +++++++ kernel/module/Makefile | 4 + kernel/module/debug.c | 16 ++++ kernel/module/internal.h | 35 +++++++ kernel/module/main.c | 45 ++++++++- kernel/module/stats.c | 200 +++++++++++++++++++++++++++++++++++++++ kernel/module/tracking.c | 7 +- 7 files changed, 331 insertions(+), 8 deletions(-) create mode 100644 kernel/module/debug.c create mode 100644 kernel/module/stats.c diff --git a/kernel/module/Kconfig b/kernel/module/Kconfig index 424b3bc58f3f..fbf7b92cb3d0 100644 --- a/kernel/module/Kconfig +++ b/kernel/module/Kconfig @@ -22,6 +22,12 @@ menuconfig MODULES if MODULES +config MODULE_DEBUG + bool "Module debugging" + default n + help + Allows to help debug module functionality. + config MODULE_FORCE_LOAD bool "Forced module loading" default n @@ -48,6 +54,8 @@ config MODULE_FORCE_UNLOAD rmmod). This is mainly for kernel developers and desperate users. If unsure, say N. +if MODULE_DEBUG + config MODULE_UNLOAD_TAINT_TRACKING bool "Tainted module unload tracking" depends on MODULE_UNLOAD @@ -59,6 +67,30 @@ config MODULE_UNLOAD_TAINT_TRACKING page (see bad_page()), the aforementioned details are also shown. If unsure, say N. +config MODULE_STATS + bool "Module statistics" + depends on DEBUG_FS + default n + help + This option allows you to maintain a record of module statistics. + For example each all modules size, average size, text size, and + failed modules and the size for each of those. For failed + modules we keep track of module which failed due to either the + existing module taking too long to load or that module already + was loaded. + + You should enable this if you are debugging production loads + and want to see if userspace or the kernel is doing stupid things + with loading modules when it shouldn't or if you want to help + optimize userspace / kernel space module autoloading schemes. + You might want to do this because failed modules tend to use + use up significan amount of memory, and so you'd be doing everyone + a favor in avoiding these failure proactively. + + If unsure, say N. + +endif # MODULE_DEBUG + config MODVERSIONS bool "Module versioning support" help diff --git a/kernel/module/Makefile b/kernel/module/Makefile index 5b1d26b53b8d..fe97047f3807 100644 --- a/kernel/module/Makefile +++ b/kernel/module/Makefile @@ -20,4 +20,8 @@ obj-$(CONFIG_PROC_FS) += procfs.o obj-$(CONFIG_SYSFS) += sysfs.o obj-$(CONFIG_KGDB_KDB) += kdb.o obj-$(CONFIG_MODVERSIONS) += version.o + +# Link order matters here, keep debug.o first. +obj-$(CONFIG_MODULE_DEBUG) += debug.o +obj-$(CONFIG_MODULE_STATS) += stats.o obj-$(CONFIG_MODULE_UNLOAD_TAINT_TRACKING) += tracking.o diff --git a/kernel/module/debug.c b/kernel/module/debug.c new file mode 100644 index 000000000000..ef580d70b751 --- /dev/null +++ b/kernel/module/debug.c @@ -0,0 +1,16 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (C) 2023 Luis Chamberlain + */ + +#include +#include "internal.h" + +struct dentry *mod_debugfs_root; + +static int module_debugfs_init(void) +{ + mod_debugfs_root = debugfs_create_dir("modules", NULL); + return 0; +} +module_init(module_debugfs_init); diff --git a/kernel/module/internal.h b/kernel/module/internal.h index 6ae29bb8836f..a645cb3fafc7 100644 --- a/kernel/module/internal.h +++ b/kernel/module/internal.h @@ -143,6 +143,41 @@ static inline bool set_livepatch_module(struct module *mod) #endif } +#ifdef CONFIG_MODULE_STATS + +#define mod_stat_add64(count, var) atomic64_add(count, var) +#define mod_stat_inc(name) atomic_inc(name) + +extern atomic64_t total_mod_size; +extern atomic64_t total_text_size; +extern atomic64_t invalid_kread_bytes; +extern atomic64_t invalid_decompress_bytes; +extern atomic64_t invalid_mod_becoming_bytes; +extern atomic64_t invalid_mod_bytes; + +extern atomic_t modcount; +extern atomic_t failed_kreads; +extern atomic_t failed_decompress; +extern atomic_t failed_load_modules; +struct mod_fail_load { + struct list_head list; + char name[MODULE_NAME_LEN]; + atomic_t count; +}; + +int try_add_failed_module(const char *name); + +#else + +#define mod_stat_inc64(name) +#define mod_stat_inc(name) atomic_inc(name) + +static inline int try_add_failed_module(const char *name) +{ + return 0; +} +#endif /* CONFIG_MODULE_STATS */ + #ifdef CONFIG_MODULE_UNLOAD_TAINT_TRACKING struct mod_unload_taint { struct list_head list; diff --git a/kernel/module/main.c b/kernel/module/main.c index 145e15f19576..8b851042b7f9 100644 --- a/kernel/module/main.c +++ b/kernel/module/main.c @@ -2495,6 +2495,16 @@ static noinline int do_init_module(struct module *mod) { int ret = 0; struct mod_initfree *freeinit; + unsigned int text_size = 0, total_size = 0; + + for_each_mod_mem_type(type) { + const struct module_memory *mod_mem = &mod->mem[type]; + if (mod_mem->size) { + total_size += mod_mem->size; + if (type == MOD_TEXT || type == MOD_INIT_TEXT) + text_size += mod->mem[type].size; + } + } freeinit = kmalloc(sizeof(*freeinit), GFP_KERNEL); if (!freeinit) { @@ -2556,6 +2566,7 @@ static noinline int do_init_module(struct module *mod) mod->mem[type].base = NULL; mod->mem[type].size = 0; } + #ifdef CONFIG_DEBUG_INFO_BTF_MODULES /* .BTF is not SHF_ALLOC and will get removed, so sanitize pointer */ mod->btf_data = NULL; @@ -2579,6 +2590,11 @@ static noinline int do_init_module(struct module *mod) mutex_unlock(&module_mutex); wake_up_all(&module_wq); + mod_stat_add64(text_size, &total_text_size); + mod_stat_add64(total_size, &total_mod_size); + + mod_stat_inc(&modcount); + return 0; fail_free_freeinit: @@ -2594,6 +2610,7 @@ static noinline int do_init_module(struct module *mod) ftrace_release_mod(mod); free_module(mod); wake_up_all(&module_wq); + return ret; } @@ -2650,6 +2667,9 @@ static int module_patient_check_exists(const char *name) old = find_module_all(name, strlen(name), true); } + if (try_add_failed_module(name)) + pr_warn("Could not add fail-tracking for module: %s\n", name); + /* * We are here only when the same module was being loaded. Do * not try to load it again right now. It prevents long delays @@ -2800,6 +2820,7 @@ static int load_module(struct load_info *info, const char __user *uargs, int flags) { struct module *mod; + bool module_allocated = false; long err = 0; char *after_dashes; @@ -2839,6 +2860,8 @@ static int load_module(struct load_info *info, const char __user *uargs, goto free_copy; } + module_allocated = true; + audit_log_kern_module(mod->name); /* Reserve our place in the list. */ @@ -2983,6 +3006,7 @@ static int load_module(struct load_info *info, const char __user *uargs, synchronize_rcu(); mutex_unlock(&module_mutex); free_module: + mod_stat_add64(info->len, &invalid_mod_bytes); /* Free lock-classes; relies on the preceding sync_rcu() */ for_class_mod_mem_type(type, core_data) { lockdep_free_key_range(mod->mem[type].base, @@ -2991,6 +3015,13 @@ static int load_module(struct load_info *info, const char __user *uargs, module_deallocate(mod, info); free_copy: + /* + * The info->len is always set. We distinguish between + * failures once the proper module was allocated and + * before that. + */ + if (!module_allocated) + mod_stat_add64(info->len, &invalid_mod_becoming_bytes); free_copy(info, flags); return err; } @@ -3009,8 +3040,10 @@ SYSCALL_DEFINE3(init_module, void __user *, umod, umod, len, uargs); err = copy_module_from_user(umod, len, &info); - if (err) + if (err) { + mod_stat_inc(&failed_kreads); return err; + } return load_module(&info, uargs, 0); } @@ -3035,14 +3068,20 @@ SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags) len = kernel_read_file_from_fd(fd, 0, &buf, INT_MAX, NULL, READING_MODULE); - if (len < 0) + if (len < 0) { + mod_stat_inc(&failed_kreads); + mod_stat_add64(len, &invalid_kread_bytes); return len; + } if (flags & MODULE_INIT_COMPRESSED_FILE) { err = module_decompress(&info, buf, len); vfree(buf); /* compressed data is no longer needed */ - if (err) + if (err) { + mod_stat_inc(&failed_decompress); + mod_stat_add64(len, &invalid_decompress_bytes); return err; + } } else { info.hdr = buf; info.len = len; diff --git a/kernel/module/stats.c b/kernel/module/stats.c new file mode 100644 index 000000000000..bbf0ddb8b589 --- /dev/null +++ b/kernel/module/stats.c @@ -0,0 +1,200 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Let's keep keep tabs on a few important module stats, useful + * for debugging production loads and interactions between userspace + * and kernelspace for loading modules. + * + * Copyright (C) 2023 Luis Chamberlain + */ + +#include +#include +#include +#include +#include +#include +#include + +#include "internal.h" + +extern struct dentry *mod_debugfs_root; + +/* + * Tracks modules which failed to be loaded as they were being processed. + * These require modulesed consumed vmalloc space for all finit_module() + * calls as kernel_read*() is used. Then if compression is used vmap() + * is used to allocate space for the decompressed version of what userspace + * has on the filesystem, we vfree() the compressed data which kerne_read*() + * fetched for us. Finally, a final module is allocated as well which we + * use to keep around, and that *can* use vmalloc() too. + * + * In the worst case, when module compression is used then we use the vmap + * space three times. + * + * We really should strive to get this list to be empty. This not being empty + * is a reflection of us needing to do more work to ensure either the kernel + * or usersapce does not do unnecessary calls to load modules which it should + * know are already loaded or on its way to be loaded. + */ +static LIST_HEAD(failed_modules); + +/* Total bytes used by all modules we've dealt with on this system */ +atomic64_t total_mod_size; + +/* Total .text section sizes we've dealt with on this system */ +atomic64_t total_text_size; + +/* Failures happen on the initial kernel_read_*() call. They use vmalloc() */ +atomic64_t invalid_kread_bytes; + +/* Failures happen on the module decompression path, these use use vmap(). */ +atomic64_t invalid_decompress_bytes; + +/* + * The invalid_mod_becoming_bytes only keeps tabs of failures in between kread + * success and right before we allocate the module to process it. These + * can be failures due to: + * + * o module_sig_check() - module signature checks + * o elf_validity_cache_copy() - ELF does not add up + * o early_mod_check(): + * - blacklist + * - failed to rewrite section headers + * - verion magic + * - live patch requirements didn't check out + * - the module was detected as being already present, this + * first check avoids a new vmalloc for the full size of + * the module. + */ +atomic64_t invalid_mod_becoming_bytes; + +/* + * These are failures after we did all the sanity checks of the + * module userspace passed to us. This can still fail if we detect + * the module is loaded, we do this check after we allocate space + * for the module. + */ +atomic64_t invalid_mod_bytes; + +/* How many modules we've loaded in our kernel life time */ +atomic_t modcount; + +/* How many modules failed due to failed kernel_read*() */ +atomic_t failed_kreads; + +/* How many failed decompression attempts we've had */ +atomic_t failed_decompress; + +/* How many modules failed once we've allocated space for our module */ +atomic_t failed_load_modules; + +int try_add_failed_module(const char *name) +{ + struct mod_fail_load *mod_fail; + + list_for_each_entry_rcu(mod_fail, &failed_modules, list, + lockdep_is_held(&module_mutex)) { + if (!strcmp(mod_fail->name, name)) { + atomic_inc(&mod_fail->count); + goto out; + } + } + + mod_fail = kmalloc(sizeof(*mod_fail), GFP_KERNEL); + if (!mod_fail) + return -ENOMEM; + strscpy(mod_fail->name, name, MODULE_NAME_LEN); + atomic_inc(&mod_fail->count); + list_add_rcu(&mod_fail->list, &failed_modules); +out: + return 0; +} + +static ssize_t read_file_mod_stats(struct file *file, char __user *user_buf, + size_t count, loff_t *ppos) +{ + struct mod_fail_load *mod_fail; + unsigned int len; + const unsigned int size = 1024; + char *buf; + u32 live_mod_count, fkreads, floads; + u64 total_size, text_size, ikread_bytes, idecomp_bytes, imod_bytes; + + live_mod_count = atomic_read(&modcount); + fkreads = atomic_read(&failed_kreads); + floads = atomic_read(&failed_load_modules); + + total_size = atomic64_read(&total_mod_size); + text_size = atomic64_read(&total_text_size); + ikread_bytes = atomic64_read(&invalid_mod_bytes); + idecomp_bytes = atomic64_read(&invalid_decompress_bytes); + imod_bytes = atomic64_read(&invalid_mod_bytes); + + buf = kzalloc(size, GFP_KERNEL); + if (buf == NULL) + return -ENOMEM; + + len += scnprintf(buf + len, size - len, "%25s\t%u\n", "Modules loaded", live_mod_count); + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Total module size", total_size); + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Total mod text size", text_size); + + /* + * Failed kmod bytes do not contain any failed kreads bytes as those + * failures would happen earlier on kread. Failed kread bytes are wasted + * vmalloc() space allocations and are indicative of invalid modules. + */ + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Failed kread bytes", ikread_bytes); + + /* + * Failed kmod bytes are modules which for whatever reason fail to load + * on the load_module() effort. They are good signs the kernel or userspace + * is doing something stupid or that could be improved. + */ + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Failed kmod bytes", imod_bytes); + + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Invalid kread bytes", ikread_bytes); + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Invalid decompress bytes", idecomp_bytes); + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Invalid mod bytes", imod_bytes); + + if (live_mod_count && total_size) { + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Average mod size", + DIV_ROUND_UP(total_size, live_mod_count)); + } + + if (live_mod_count && text_size) { + len += scnprintf(buf + len, size - len, "%25s\t%llu\n", "Average mod text size", + DIV_ROUND_UP(text_size, live_mod_count)); + } + + if (list_empty(&failed_modules)) + goto out; + + len += scnprintf(buf + len, size - len, "Failed modules:\n"); + list_for_each_entry_rcu(mod_fail, &failed_modules, list) + len += scnprintf(buf + len, size - len, "%25s\n", mod_fail->name); +out: + return simple_read_from_buffer(user_buf, count, ppos, buf, len); +} + +static const struct file_operations fops_mod_stats = { + .read = read_file_mod_stats, + .open = simple_open, + .owner = THIS_MODULE, + .llseek = default_llseek, +}; + +static int __init module_stats_init(void) +{ + debugfs_create_atomic64_t("total_mod_size", 0400, mod_debugfs_root, &total_mod_size); + debugfs_create_atomic64_t("total_text_size", 0400, mod_debugfs_root, &total_text_size); + debugfs_create_atomic64_t("invalid_kread_bytes", 0400, mod_debugfs_root, &invalid_kread_bytes); + debugfs_create_atomic64_t("invalid_decompress_bytes", 0400, mod_debugfs_root, &invalid_decompress_bytes); + debugfs_create_atomic64_t("invalid_mod_bytes", 0400, mod_debugfs_root, &invalid_mod_bytes); + debugfs_create_atomic_t("modcount", 0400, mod_debugfs_root, &modcount); + debugfs_create_atomic_t("failed_kreads", 0400, mod_debugfs_root, &failed_kreads); + debugfs_create_atomic_t("failed_load_modules", 0400, mod_debugfs_root, &failed_load_modules); + debugfs_create_file("stats", 0400, mod_debugfs_root, mod_debugfs_root, &fops_mod_stats); + + return 0; +} +module_init(module_stats_init); diff --git a/kernel/module/tracking.c b/kernel/module/tracking.c index 26d812e07615..cbeb9330db9b 100644 --- a/kernel/module/tracking.c +++ b/kernel/module/tracking.c @@ -15,6 +15,7 @@ #include "internal.h" static LIST_HEAD(unloaded_tainted_modules); +extern struct dentry *mod_debugfs_root int try_add_tainted_module(struct module *mod) { @@ -120,12 +121,8 @@ static const struct file_operations unloaded_tainted_modules_fops = { static int __init unloaded_tainted_modules_init(void) { - struct dentry *dir; - - dir = debugfs_create_dir("modules", NULL); - debugfs_create_file("unloaded_tainted", 0444, dir, NULL, + debugfs_create_file("unloaded_tainted", 0444, mod_debugfs_root, NULL, &unloaded_tainted_modules_fops); - return 0; } module_init(unloaded_tainted_modules_init);