diff mbox series

[v6,3/8] firmware_loader: Move module refcounts to allow unloading

Message ID 20241112232253.3379178-4-dionnaglaze@google.com
State New
Headers show
Series Add SEV firmware hotloading | expand

Commit Message

Dionna Amalie Glaze Nov. 12, 2024, 11:22 p.m. UTC
If a kernel module registers a firmware upload API ops set, then it's
unable to be moved due to effectively a cyclic reference that the module
depends on the upload which depends on the module.

Instead, only require the try_module_get when an upload is requested to
disallow unloading a module only while the upload is in progress.

Fixes: 97730bbb242c ("firmware_loader: Add firmware-upload support")

CC: Sean Christopherson <seanjc@google.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: Borislav Petkov <bp@alien8.de>
CC: Dave Hansen <dave.hansen@linux.intel.com>
CC: Ashish Kalra <ashish.kalra@amd.com>
CC: Tom Lendacky <thomas.lendacky@amd.com>
CC: John Allen <john.allen@amd.com>
CC: Herbert Xu <herbert@gondor.apana.org.au>
CC: "David S. Miller" <davem@davemloft.net>
CC: Michael Roth <michael.roth@amd.com>
CC: Luis Chamberlain <mcgrof@kernel.org>
CC: Russ Weight <russ.weight@linux.dev>
CC: Danilo Krummrich <dakr@redhat.com>
CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
CC: "Rafael J. Wysocki" <rafael@kernel.org>
CC: Tianfei zhang <tianfei.zhang@intel.com>
CC: Alexey Kardashevskiy <aik@amd.com>

Tested-by: Ashish Kalra <ashish.kalra@amd.com>
Reviewed-by: Russ Weight <russ.weight@linux.dev>
Signed-off-by: Dionna Glaze <dionnaglaze@google.com>
---
 drivers/base/firmware_loader/sysfs_upload.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

Comments

Dan Williams Nov. 13, 2024, 2:40 a.m. UTC | #1
Dionna Glaze wrote:
> If a kernel module registers a firmware upload API ops set, then it's
> unable to be moved due to effectively a cyclic reference that the module
> depends on the upload which depends on the module.
> 
> Instead, only require the try_module_get when an upload is requested to
> disallow unloading a module only while the upload is in progress.

Oh, interesting, I wondered why CXL did not uncover this loop in its
usage only to realize that CXL calls firmware registration from the
cxl_pci module, but the @module paramter passed to
firmware_upload_register() is the cxl_core module. I.e. we are
accidentally avoiding the problem. I assume other CONFIG_FW_UPLOAD users
simply do not test module removal.

However, I think the fix is simply to remove all module reference taking
by the firmware_loader core. It is the consumer's responsibility to call
firmware_upload_unregister() in its module removal path and that should
flush any and all future usage of the passed in ops structure.
Dionna Amalie Glaze Nov. 13, 2024, 6:40 p.m. UTC | #2
On Tue, Nov 12, 2024 at 6:40 PM Dan Williams <dan.j.williams@intel.com> wrote:
>...
> However, I think the fix is simply to remove all module reference taking
> by the firmware_loader core. It is the consumer's responsibility to call
> firmware_upload_unregister() in its module removal path and that should
> flush any and all future usage of the passed in ops structure.

That would suggest the addition of the refcounting in v1 to fix a test
means the test_firmware is wrong?
https://lore.kernel.org/all/20220421212204.36052-5-russell.h.weight@intel.com/

Adding Kees in case he knows better.
Russ Weight Nov. 14, 2024, 4:35 p.m. UTC | #3
On Tue, Nov 12, 2024 at 06:40:28PM -0800, Dan Williams wrote:
> Dionna Glaze wrote:
> > If a kernel module registers a firmware upload API ops set, then it's
> > unable to be moved due to effectively a cyclic reference that the module
> > depends on the upload which depends on the module.
> > 
> > Instead, only require the try_module_get when an upload is requested to
> > disallow unloading a module only while the upload is in progress.
> 
> Oh, interesting, I wondered why CXL did not uncover this loop in its
> usage only to realize that CXL calls firmware registration from the
> cxl_pci module, but the @module paramter passed to
> firmware_upload_register() is the cxl_core module. I.e. we are
> accidentally avoiding the problem. I assume other CONFIG_FW_UPLOAD users
> simply do not test module removal.
> 
> However, I think the fix is simply to remove all module reference taking
> by the firmware_loader core. It is the consumer's responsibility to call
> firmware_upload_unregister() in its module removal path and that should
> flush any and all future usage of the passed in ops structure.

As I understand it, if a module directly references symbols in another 
module, then the reference count is automatically incremented to ensure
that the dependent symbols are available to the consumer.

In this case, the firmware_loader does not directly reference symbol
names in the device driver that registered it. The call-back function
pointers are provided during registration. Without explicitly
incrementing the module reference count, it is possible to remove the
device driver while leaving the firmware loader instance (and sysfs
entries) intact. Accessing those sysfs nodes would result in
references to pointers that are no longer valid.

Clearly this would be an unexpected/unusual case. Someone with root
access would have to remove the device driver. I'm not sure how much
effort should be expended in preventing it - but this is the reasoning
behind the incrementing/decrementing of the module reference counts.

- Russ
Dan Williams Nov. 14, 2024, 6:17 p.m. UTC | #4
Russ Weight wrote:
[..]
> Clearly this would be an unexpected/unusual case. Someone with root
> access would have to remove the device driver. I'm not sure how much
> effort should be expended in preventing it - but this is the reasoning
> behind the incrementing/decrementing of the module reference counts.

The module reference needs to be held only if the producer of those
symbols can be removed without triggering some coordinated removal with
action consumer. A driver that fails to call
firmware_upload_unregister() in its module removal path is simply a driver
with a memory-leak and use-after-free bug, not something the firmware
upload core needs to worry about.

So, the prevention mechanism is "thou shalt use
firmware_upload_unregister() correctly", and when that is in place
explicit module references are not only redundant, but trying to
implement them causes circular dependency loops.
Tom Lendacky Nov. 14, 2024, 7:30 p.m. UTC | #5
On 11/14/24 12:17, Dan Williams wrote:
> Russ Weight wrote:
> [..]
>> Clearly this would be an unexpected/unusual case. Someone with root
>> access would have to remove the device driver. I'm not sure how much
>> effort should be expended in preventing it - but this is the reasoning
>> behind the incrementing/decrementing of the module reference counts.
> 
> The module reference needs to be held only if the producer of those
> symbols can be removed without triggering some coordinated removal with
> action consumer. A driver that fails to call
> firmware_upload_unregister() in its module removal path is simply a driver
> with a memory-leak and use-after-free bug, not something the firmware
> upload core needs to worry about.
> 
> So, the prevention mechanism is "thou shalt use
> firmware_upload_unregister() correctly", and when that is in place
> explicit module references are not only redundant, but trying to
> implement them causes circular dependency loops.

I believe that is how other similar services, like debugfs, work, the
module is responsible for cleaning up.

Thanks,
Tom
diff mbox series

Patch

diff --git a/drivers/base/firmware_loader/sysfs_upload.c b/drivers/base/firmware_loader/sysfs_upload.c
index 829270067d163..7d9c6aef7720a 100644
--- a/drivers/base/firmware_loader/sysfs_upload.c
+++ b/drivers/base/firmware_loader/sysfs_upload.c
@@ -204,6 +204,7 @@  static void fw_upload_main(struct work_struct *work)
 		fwlp->ops->cleanup(fwl);
 
 putdev_exit:
+	module_put(fwlp->module);
 	put_device(fw_dev->parent);
 
 	/*
@@ -239,6 +240,9 @@  int fw_upload_start(struct fw_sysfs *fw_sysfs)
 	}
 
 	fwlp = fw_sysfs->fw_upload_priv;
+	if (!try_module_get(fwlp->module)) /* released in fw_upload_main */
+		return -EFAULT;
+
 	mutex_lock(&fwlp->lock);
 
 	/* Do not interfere with an on-going fw_upload */
@@ -310,13 +314,10 @@  firmware_upload_register(struct module *module, struct device *parent,
 		return ERR_PTR(-EINVAL);
 	}
 
-	if (!try_module_get(module))
-		return ERR_PTR(-EFAULT);
-
 	fw_upload = kzalloc(sizeof(*fw_upload), GFP_KERNEL);
 	if (!fw_upload) {
 		ret = -ENOMEM;
-		goto exit_module_put;
+		goto exit_err;
 	}
 
 	fw_upload_priv = kzalloc(sizeof(*fw_upload_priv), GFP_KERNEL);
@@ -358,7 +359,7 @@  firmware_upload_register(struct module *module, struct device *parent,
 	if (ret) {
 		dev_err(fw_dev, "%s: device_register failed\n", __func__);
 		put_device(fw_dev);
-		goto exit_module_put;
+		goto exit_err;
 	}
 
 	return fw_upload;
@@ -372,8 +373,7 @@  firmware_upload_register(struct module *module, struct device *parent,
 free_fw_upload:
 	kfree(fw_upload);
 
-exit_module_put:
-	module_put(module);
+exit_err:
 
 	return ERR_PTR(ret);
 }
@@ -387,7 +387,6 @@  void firmware_upload_unregister(struct fw_upload *fw_upload)
 {
 	struct fw_sysfs *fw_sysfs = fw_upload->priv;
 	struct fw_upload_priv *fw_upload_priv = fw_sysfs->fw_upload_priv;
-	struct module *module = fw_upload_priv->module;
 
 	mutex_lock(&fw_upload_priv->lock);
 	if (fw_upload_priv->progress == FW_UPLOAD_PROG_IDLE) {
@@ -403,6 +402,5 @@  void firmware_upload_unregister(struct fw_upload *fw_upload)
 
 unregister:
 	device_unregister(&fw_sysfs->dev);
-	module_put(module);
 }
 EXPORT_SYMBOL_GPL(firmware_upload_unregister);