diff mbox series

kernel/module: add documentation for try_module_get()

Message ID 20210722221905.1718213-1-mcgrof@kernel.org (mailing list archive)
State New
Headers show
Series kernel/module: add documentation for try_module_get() | expand

Commit Message

Luis Chamberlain July 22, 2021, 10:19 p.m. UTC
There is quite a bit of tribal knowledge around proper use of
try_module_get() and that it must be used only in a context which
can ensure the module won't be gone during the operation. Document
this little bit of tribal knowledge.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 kernel/module.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

Comments

Stephen Hemminger July 22, 2021, 10:39 p.m. UTC | #1
On Thu, 22 Jul 2021 15:19:05 -0700
Luis Chamberlain <mcgrof@kernel.org> wrote:

> +/**
> + * try_module_get - yields to module removal and bumps reference count otherwise
> + * @module: the module we should check for

You have the right intentions, but this patch needs some work.

This looks like a docbook comment format but isn't that format.

The first couple of lines need to be reworded to be a complete sentence like
the rest.

In general best to make new comments the same as existing comments
in the same area.

It would be better to put this description in the header file where other
docbook comments for module API are. See example around module_init() and module_exit().

Ideally, all the api's around module reference counts would be documented there.
Bart Van Assche July 23, 2021, 2:33 a.m. UTC | #2
On 7/22/21 3:19 PM, Luis Chamberlain wrote:
> + * The real value to try_module_get() is the module_is_live() check which
> + * ensures this the caller of try_module_get() can yields to userspace module
> + * removal requests and fail whatever it was about to process.

can yields -> can yield?

Otherwise this looks really well written to me.

Bart.
David Laight July 24, 2021, 12:15 p.m. UTC | #3
From: Luis Chamberlain
> Sent: 22 July 2021 23:19
> 
> There is quite a bit of tribal knowledge around proper use of
> try_module_get() and that it must be used only in a context which
> can ensure the module won't be gone during the operation. Document
> this little bit of tribal knowledge.
> 
...

Some typos.

> +/**
> + * try_module_get - yields to module removal and bumps reference count otherwise
> + * @module: the module we should check for
> + *
> + * This can be used to check if userspace has requested to remove a module,
                                                           a module be removed
> + * and if so let the caller give up. Otherwise it takes a reference count to
> + * ensure a request from userspace to remove the module cannot happen.
> + *
> + * Care must be taken to ensure the module cannot be removed during
> + * try_module_get(). This can be done by having another entity other than the
> + * module itself increment the module reference count, or through some other
> + * means which gaurantees the module could not be removed during an operation.
                  guarantees
> + * An example of this later case is using this call in a sysfs file which the
> + * module created. The sysfs store / read file operation is ensured to exist
                                                            ^^^^^^^^^^^^^^^^^^^
Not sure what that is supposed to mean.
> + * and still be present by kernfs's active reference. If a sysfs file operation
> + * is being run, the module which created it must still exist as the module is
> + * in charge of removal of the sysfs file.
> + *
> + * The real value to try_module_get() is the module_is_live() check which
> + * ensures this the caller of try_module_get() can yields to userspace module
> + * removal requests and fail whatever it was about to process.
> + */

But is the comment even right?
I think you need to consider when try_module_get() can actually fail.
I believe the following is right.
The caller has to have valid module reference and module unload
must actually be in progress - ie the ref count is zero and
there are no active IO operations.

The module's unload function must (eventually) invalidate the
caller's module reference to stop try_module_get() being called
with a (very) stale pointer.

So there is a potentially horrid race:
The module unload is going to do:
	driver_data->module_ref = 0;
and elsewhere there'll be:
	ref = driver_data->module_ref;
	if (!ref || !try_module_get(ref))
		return -error;

You have to have try_module_get() to allow the module unload
function to sleep.
But the above code still needs a driver lock to ensure the
unload code doesn't race with the try_module_get() and the
'ref' be invalidated before try_module_get() looks at it.
(eg if an interrupt defers processing.)

So there can be no 'yielding'.

I'm pretty much certain try_module_get(THIS_MODULE) is pretty
much never going to fail.
(It is mostly needed to give a worker thread a reference.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Luis Chamberlain July 27, 2021, 5:30 p.m. UTC | #4
On Sat, Jul 24, 2021 at 12:15:10PM +0000, David Laight wrote:
> From: Luis Chamberlain
> > Sent: 22 July 2021 23:19
> > 
> > There is quite a bit of tribal knowledge around proper use of
> > try_module_get() and that it must be used only in a context which
> > can ensure the module won't be gone during the operation. Document
> > this little bit of tribal knowledge.
> > 
> ...
> 
> Some typos.
> 
> > +/**
> > + * try_module_get - yields to module removal and bumps reference count otherwise
> > + * @module: the module we should check for
> > + *
> > + * This can be used to check if userspace has requested to remove a module,
>                                                            a module be removed
> > + * and if so let the caller give up. Otherwise it takes a reference count to
> > + * ensure a request from userspace to remove the module cannot happen.
> > + *
> > + * Care must be taken to ensure the module cannot be removed during
> > + * try_module_get(). This can be done by having another entity other than the
> > + * module itself increment the module reference count, or through some other
> > + * means which gaurantees the module could not be removed during an operation.
>                   guarantees
> > + * An example of this later case is using this call in a sysfs file which the
> > + * module created. The sysfs store / read file operation is ensured to exist
>                                                             ^^^^^^^^^^^^^^^^^^^
> Not sure what that is supposed to mean.

I'll clarify further. How about:

The sysfs store / read file operations are gauranteed to exist using
kernfs's active reference (see kernfs_active()).

> > + * and still be present by kernfs's active reference. If a sysfs file operation
> > + * is being run, the module which created it must still exist as the module is
> > + * in charge of removal of the sysfs file.
> > + *
> > + * The real value to try_module_get() is the module_is_live() check which
> > + * ensures this the caller of try_module_get() can yields to userspace module
> > + * removal requests and fail whatever it was about to process.
> > + */
> 
> But is the comment even right?
> I think you need to consider when try_module_get() can actually fail.

Let's do that!

> I believe the following is right.
> The caller has to have valid module reference and module unload
> must actually be in progress - ie the ref count is zero and
> there are no active IO operations.

If the refcount bump succeeded then module unload will simply not
happen. So what exactly do you mean with the first part of
"The caller has to have a valid module reference" ?

> The module's unload function must (eventually) invalidate the
> caller's module reference to stop try_module_get() being called
> with a (very) stale pointer.

Once a module's exit call is triggered the state is MODULE_STATE_GOING,
which is what module_is_live() checks for.

> So there is a potentially horrid race:
> The module unload is going to do:
> 	driver_data->module_ref = 0;
> and elsewhere there'll be:
> 	ref = driver_data->module_ref;
> 	if (!ref || !try_module_get(ref))
> 		return -error;
> 
> You have to have try_module_get() to allow the module unload
> function to sleep.
> But the above code still needs a driver lock to ensure the
> unload code doesn't race with the try_module_get() and the
> 'ref' be invalidated before try_module_get() looks at it.
> (eg if an interrupt defers processing.)
> 
> So there can be no 'yielding'.

Oh but there is. Consider access to a random sysfs file 'add_new_device'
which takes as input a name, for driver foo, and so foo's
add_new_foobar_device(name="bar") is called. Unless sysfs file
"yields" by using try_module_get() before trying to add a new
foo device called "bar", it will essentially be racing with the
exit routine of module foo, and depending on how locking is implemented
(most drivers get it wrong), this easily leads to crashes.

In fact, this documentation patch was motivated by my own solution to a
possible deadlock when sysfs is used. Using the same example above, if
the same sysfs file uses *any* lock, which is *also* used on the exit
routine, you can easily trigger a deadlock. This can happen for example
by the lock being obtained by the removal routine, then the sysfs file
gets called, waits for the lock to complete, then the module's exit
routine starts cleaning up and removing sysfs files, but we won't be
able to remove the sysfs file (due to kernefs active reference) until
the sysfs file complets, but it cannot complete because the lock is
already held.

Yes, this is a generic problem. Yes I have proof [0]. Yes, a generic
solution has been proposed [1], and because Greg is not convinced and I
need to move on with life, I am suggesting a temporary driver specific
solution (to which Greg is still NACK'ing, without even proposing any
alternatives) [2].

[0] https://lkml.kernel.org/r/20210703004632.621662-5-mcgrof@kernel.org
[1] https://lkml.kernel.org/r/20210401235925.GR4332@42.do-not-panic.com 
[2] https://lkml.kernel.org/r/20210723174919.ka3tzyre432uilf7@garbanzo

> I'm pretty much certain try_module_get(THIS_MODULE) is pretty
> much never going to fail.

It would have to take something very asynchronous and detached from
the module to run. But the only thing I can think now, is something
takes a module pointer right before after try_stop_module() and then
a piece of code in between try_stop_module() and free_module()
asynchronously tries to run something with that pointer.

In the end I can only think of buggy code. Perhaps the more type of
common issue could be code which purposely leave module pointers around
with the intent of cleaning up using a module removal notifier event and
that for some stupid reason runs something asynchronously with that
pointer.

> (It is mostly needed to give a worker thread a reference.)

Greg, do you have a real world example which demonstrates the race
better? Or perhaps a selftest? Or a kunit test?

  Luis
Greg KH July 27, 2021, 5:46 p.m. UTC | #5
On Tue, Jul 27, 2021 at 10:30:36AM -0700, Luis Chamberlain wrote:
> On Sat, Jul 24, 2021 at 12:15:10PM +0000, David Laight wrote:
> > From: Luis Chamberlain
> > > Sent: 22 July 2021 23:19
> > > 
> > > There is quite a bit of tribal knowledge around proper use of
> > > try_module_get() and that it must be used only in a context which
> > > can ensure the module won't be gone during the operation. Document
> > > this little bit of tribal knowledge.
> > > 
> > ...
> > 
> > Some typos.
> > 
> > > +/**
> > > + * try_module_get - yields to module removal and bumps reference count otherwise
> > > + * @module: the module we should check for
> > > + *
> > > + * This can be used to check if userspace has requested to remove a module,
> >                                                            a module be removed
> > > + * and if so let the caller give up. Otherwise it takes a reference count to
> > > + * ensure a request from userspace to remove the module cannot happen.
> > > + *
> > > + * Care must be taken to ensure the module cannot be removed during
> > > + * try_module_get(). This can be done by having another entity other than the
> > > + * module itself increment the module reference count, or through some other
> > > + * means which gaurantees the module could not be removed during an operation.
> >                   guarantees
> > > + * An example of this later case is using this call in a sysfs file which the
> > > + * module created. The sysfs store / read file operation is ensured to exist
> >                                                             ^^^^^^^^^^^^^^^^^^^
> > Not sure what that is supposed to mean.
> 
> I'll clarify further. How about:
> 
> The sysfs store / read file operations are gauranteed to exist using
> kernfs's active reference (see kernfs_active()).

But that has nothing to do with module reference counts.  kernfs knows
nothing about modules.

> > So there is a potentially horrid race:
> > The module unload is going to do:
> > 	driver_data->module_ref = 0;
> > and elsewhere there'll be:
> > 	ref = driver_data->module_ref;
> > 	if (!ref || !try_module_get(ref))
> > 		return -error;
> > 
> > You have to have try_module_get() to allow the module unload
> > function to sleep.
> > But the above code still needs a driver lock to ensure the
> > unload code doesn't race with the try_module_get() and the
> > 'ref' be invalidated before try_module_get() looks at it.
> > (eg if an interrupt defers processing.)
> > 
> > So there can be no 'yielding'.
> 
> Oh but there is. Consider access to a random sysfs file 'add_new_device'
> which takes as input a name, for driver foo, and so foo's
> add_new_foobar_device(name="bar") is called. Unless sysfs file
> "yields" by using try_module_get() before trying to add a new
> foo device called "bar", it will essentially be racing with the
> exit routine of module foo, and depending on how locking is implemented
> (most drivers get it wrong), this easily leads to crashes.
> 
> In fact, this documentation patch was motivated by my own solution to a
> possible deadlock when sysfs is used. Using the same example above, if
> the same sysfs file uses *any* lock, which is *also* used on the exit
> routine, you can easily trigger a deadlock. This can happen for example
> by the lock being obtained by the removal routine, then the sysfs file
> gets called, waits for the lock to complete, then the module's exit
> routine starts cleaning up and removing sysfs files, but we won't be
> able to remove the sysfs file (due to kernefs active reference) until
> the sysfs file complets, but it cannot complete because the lock is
> already held.
> 
> Yes, this is a generic problem. Yes I have proof [0]. Yes, a generic
> solution has been proposed [1], and because Greg is not convinced and I
> need to move on with life, I am suggesting a temporary driver specific
> solution (to which Greg is still NACK'ing, without even proposing any
> alternatives) [2].
> 
> [0] https://lkml.kernel.org/r/20210703004632.621662-5-mcgrof@kernel.org
> [1] https://lkml.kernel.org/r/20210401235925.GR4332@42.do-not-panic.com 
> [2] https://lkml.kernel.org/r/20210723174919.ka3tzyre432uilf7@garbanzo

My problem with your proposed solution is that it is still racy, you can
not increment your own module reference count from 0 -> 1 and expect it
to work properly.  You need external code to do that somewhere.

Now trying to tie sysfs files to the modules that own them would be
nice, but as we have seen, that way lies way too many kernel changes,
right?

Hm, maybe.  Did we think about this from the kobj_attribute level?  If
we use the "wrapper" logic there and the use of the macros we already
have for attributes, we might be able to get the module pointer directly
"for free".

Did we try that?  this thread has been going on for so long I can't
remember anymore...

> > I'm pretty much certain try_module_get(THIS_MODULE) is pretty
> > much never going to fail.
> 
> It would have to take something very asynchronous and detached from
> the module to run. But the only thing I can think now, is something
> takes a module pointer right before after try_stop_module() and then
> a piece of code in between try_stop_module() and free_module()
> asynchronously tries to run something with that pointer.
> 
> In the end I can only think of buggy code. Perhaps the more type of
> common issue could be code which purposely leave module pointers around
> with the intent of cleaning up using a module removal notifier event and
> that for some stupid reason runs something asynchronously with that
> pointer.
> 
> > (It is mostly needed to give a worker thread a reference.)
> 
> Greg, do you have a real world example which demonstrates the race
> better? Or perhaps a selftest? Or a kunit test?

Hah, nice try :)

greg k-h
Luis Chamberlain July 27, 2021, 6:18 p.m. UTC | #6
On Tue, Jul 27, 2021 at 07:46:34PM +0200, gregkh@linuxfoundation.org wrote:
> On Tue, Jul 27, 2021 at 10:30:36AM -0700, Luis Chamberlain wrote:
> > On Sat, Jul 24, 2021 at 12:15:10PM +0000, David Laight wrote:
> > > From: Luis Chamberlain
> > > > Sent: 22 July 2021 23:19
> > > > 
> > > > There is quite a bit of tribal knowledge around proper use of
> > > > try_module_get() and that it must be used only in a context which
> > > > can ensure the module won't be gone during the operation. Document
> > > > this little bit of tribal knowledge.
> > > > 
> > > ...
> > > 
> > > Some typos.
> > > 
> > > > +/**
> > > > + * try_module_get - yields to module removal and bumps reference count otherwise
> > > > + * @module: the module we should check for
> > > > + *
> > > > + * This can be used to check if userspace has requested to remove a module,
> > >                                                            a module be removed
> > > > + * and if so let the caller give up. Otherwise it takes a reference count to
> > > > + * ensure a request from userspace to remove the module cannot happen.
> > > > + *
> > > > + * Care must be taken to ensure the module cannot be removed during
> > > > + * try_module_get(). This can be done by having another entity other than the
> > > > + * module itself increment the module reference count, or through some other
> > > > + * means which gaurantees the module could not be removed during an operation.
> > >                   guarantees
> > > > + * An example of this later case is using this call in a sysfs file which the
> > > > + * module created. The sysfs store / read file operation is ensured to exist
> > >                                                             ^^^^^^^^^^^^^^^^^^^
> > > Not sure what that is supposed to mean.
> > 
> > I'll clarify further. How about:
> > 
> > The sysfs store / read file operations are gauranteed to exist using
> > kernfs's active reference (see kernfs_active()).
> 
> But that has nothing to do with module reference counts.  kernfs knows
> nothing about modules.

Yes but we are talking about sysfs files which the module creates. So
but inference again, an active reference protects a module.

> > > So there is a potentially horrid race:
> > > The module unload is going to do:
> > > 	driver_data->module_ref = 0;
> > > and elsewhere there'll be:
> > > 	ref = driver_data->module_ref;
> > > 	if (!ref || !try_module_get(ref))
> > > 		return -error;
> > > 
> > > You have to have try_module_get() to allow the module unload
> > > function to sleep.
> > > But the above code still needs a driver lock to ensure the
> > > unload code doesn't race with the try_module_get() and the
> > > 'ref' be invalidated before try_module_get() looks at it.
> > > (eg if an interrupt defers processing.)
> > > 
> > > So there can be no 'yielding'.
> > 
> > Oh but there is. Consider access to a random sysfs file 'add_new_device'
> > which takes as input a name, for driver foo, and so foo's
> > add_new_foobar_device(name="bar") is called. Unless sysfs file
> > "yields" by using try_module_get() before trying to add a new
> > foo device called "bar", it will essentially be racing with the
> > exit routine of module foo, and depending on how locking is implemented
> > (most drivers get it wrong), this easily leads to crashes.
> > 
> > In fact, this documentation patch was motivated by my own solution to a
> > possible deadlock when sysfs is used. Using the same example above, if
> > the same sysfs file uses *any* lock, which is *also* used on the exit
> > routine, you can easily trigger a deadlock. This can happen for example
> > by the lock being obtained by the removal routine, then the sysfs file
> > gets called, waits for the lock to complete, then the module's exit
> > routine starts cleaning up and removing sysfs files, but we won't be
> > able to remove the sysfs file (due to kernefs active reference) until
> > the sysfs file complets, but it cannot complete because the lock is
> > already held.
> > 
> > Yes, this is a generic problem. Yes I have proof [0]. Yes, a generic
> > solution has been proposed [1], and because Greg is not convinced and I
> > need to move on with life, I am suggesting a temporary driver specific
> > solution (to which Greg is still NACK'ing, without even proposing any
> > alternatives) [2].
> > 
> > [0] https://lkml.kernel.org/r/20210703004632.621662-5-mcgrof@kernel.org
> > [1] https://lkml.kernel.org/r/20210401235925.GR4332@42.do-not-panic.com 
> > [2] https://lkml.kernel.org/r/20210723174919.ka3tzyre432uilf7@garbanzo
> 
> My problem with your proposed solution is that it is still racy, you can
> not increment your own module reference count from 0 -> 1 and expect it
> to work properly.  You need external code to do that somewhere.

You are not providing *any* proof for this. And even so, I believe I
have clarified as best as possible how a kernfs active reference
implicitly protects the module when we are talking about sysfs files.

> Now trying to tie sysfs files to the modules that own them would be
> nice, but as we have seen, that way lies way too many kernel changes,
> right?

It's not a one-liner fix. Yes.

> Hm, maybe.  Did we think about this from the kobj_attribute level?  If
> we use the "wrapper" logic there and the use of the macros we already
> have for attributes, we might be able to get the module pointer directly
> "for free".
>
> Did we try that?

That was my hope. I tried that first. Last year in November I determined
kernfs is kobject stupid. But more importantly *neither* are struct device
specific, so neither of them have semantics for modules or even devices.

> this thread has been going on for so long I can't
> remember anymore...

Please...

  Luis
Greg KH July 27, 2021, 6:38 p.m. UTC | #7
On Tue, Jul 27, 2021 at 11:18:03AM -0700, Luis Chamberlain wrote:
> On Tue, Jul 27, 2021 at 07:46:34PM +0200, gregkh@linuxfoundation.org wrote:
> > On Tue, Jul 27, 2021 at 10:30:36AM -0700, Luis Chamberlain wrote:
> > > On Sat, Jul 24, 2021 at 12:15:10PM +0000, David Laight wrote:
> > > > From: Luis Chamberlain
> > > > > Sent: 22 July 2021 23:19
> > > > > 
> > > > > There is quite a bit of tribal knowledge around proper use of
> > > > > try_module_get() and that it must be used only in a context which
> > > > > can ensure the module won't be gone during the operation. Document
> > > > > this little bit of tribal knowledge.
> > > > > 
> > > > ...
> > > > 
> > > > Some typos.
> > > > 
> > > > > +/**
> > > > > + * try_module_get - yields to module removal and bumps reference count otherwise
> > > > > + * @module: the module we should check for
> > > > > + *
> > > > > + * This can be used to check if userspace has requested to remove a module,
> > > >                                                            a module be removed
> > > > > + * and if so let the caller give up. Otherwise it takes a reference count to
> > > > > + * ensure a request from userspace to remove the module cannot happen.
> > > > > + *
> > > > > + * Care must be taken to ensure the module cannot be removed during
> > > > > + * try_module_get(). This can be done by having another entity other than the
> > > > > + * module itself increment the module reference count, or through some other
> > > > > + * means which gaurantees the module could not be removed during an operation.
> > > >                   guarantees
> > > > > + * An example of this later case is using this call in a sysfs file which the
> > > > > + * module created. The sysfs store / read file operation is ensured to exist
> > > >                                                             ^^^^^^^^^^^^^^^^^^^
> > > > Not sure what that is supposed to mean.
> > > 
> > > I'll clarify further. How about:
> > > 
> > > The sysfs store / read file operations are gauranteed to exist using
> > > kernfs's active reference (see kernfs_active()).
> > 
> > But that has nothing to do with module reference counts.  kernfs knows
> > nothing about modules.
> 
> Yes but we are talking about sysfs files which the module creates. So
> but inference again, an active reference protects a module.

What active reference?  sysfs creation/removal/rename/whatever right now
has nothing to do with module reference counts as they are totally
disconnected.  kernfs has nothing to do with module reference counts
either.  So I do not know what you are inferring here.

> > > > So there is a potentially horrid race:
> > > > The module unload is going to do:
> > > > 	driver_data->module_ref = 0;
> > > > and elsewhere there'll be:
> > > > 	ref = driver_data->module_ref;
> > > > 	if (!ref || !try_module_get(ref))
> > > > 		return -error;
> > > > 
> > > > You have to have try_module_get() to allow the module unload
> > > > function to sleep.
> > > > But the above code still needs a driver lock to ensure the
> > > > unload code doesn't race with the try_module_get() and the
> > > > 'ref' be invalidated before try_module_get() looks at it.
> > > > (eg if an interrupt defers processing.)
> > > > 
> > > > So there can be no 'yielding'.
> > > 
> > > Oh but there is. Consider access to a random sysfs file 'add_new_device'
> > > which takes as input a name, for driver foo, and so foo's
> > > add_new_foobar_device(name="bar") is called. Unless sysfs file
> > > "yields" by using try_module_get() before trying to add a new
> > > foo device called "bar", it will essentially be racing with the
> > > exit routine of module foo, and depending on how locking is implemented
> > > (most drivers get it wrong), this easily leads to crashes.
> > > 
> > > In fact, this documentation patch was motivated by my own solution to a
> > > possible deadlock when sysfs is used. Using the same example above, if
> > > the same sysfs file uses *any* lock, which is *also* used on the exit
> > > routine, you can easily trigger a deadlock. This can happen for example
> > > by the lock being obtained by the removal routine, then the sysfs file
> > > gets called, waits for the lock to complete, then the module's exit
> > > routine starts cleaning up and removing sysfs files, but we won't be
> > > able to remove the sysfs file (due to kernefs active reference) until
> > > the sysfs file complets, but it cannot complete because the lock is
> > > already held.
> > > 
> > > Yes, this is a generic problem. Yes I have proof [0]. Yes, a generic
> > > solution has been proposed [1], and because Greg is not convinced and I
> > > need to move on with life, I am suggesting a temporary driver specific
> > > solution (to which Greg is still NACK'ing, without even proposing any
> > > alternatives) [2].
> > > 
> > > [0] https://lkml.kernel.org/r/20210703004632.621662-5-mcgrof@kernel.org
> > > [1] https://lkml.kernel.org/r/20210401235925.GR4332@42.do-not-panic.com 
> > > [2] https://lkml.kernel.org/r/20210723174919.ka3tzyre432uilf7@garbanzo
> > 
> > My problem with your proposed solution is that it is still racy, you can
> > not increment your own module reference count from 0 -> 1 and expect it
> > to work properly.  You need external code to do that somewhere.
> 
> You are not providing *any* proof for this.

I did provide proof of that.  Here it is again.

Consider these lines of code:

 1	int foo(int baz)
 2	{
 3		int retval
 4
 5		if (!try_module_get(THIS_MODULE))
 6			return -ERROR;
 7		retval = do_something(baz)
 8		put_module(THIS_MODULE);
 9		return retval;
10	}

Going into the call to foo(), there is no reference held on THIS_MODULE.

Right before line 5 is called (or really, right before the jump to
try_module_get(), yet still within foo() (i.e. lines 2-4 where you have
fun stack frames set up, and ftrace hooks, and other nifty things),
userspace asks for the module to be unloaded, and the module is removed
from the system and the memory for this code is overwritten with all
0x00.

Then, we try to call into try_module_get(), but yet, that call
instruction is gone and boom.

Or better yet, after put_module() is called, the module is unloaded
_before_ the return happens.  Then we try to make the return jump back,
but that instruction was overwritten with all 0x00.  Or different code
because a new module was loaded then.

Yes, your window is smaller, but it is still there, and still can be
triggered.  That is why in the 2.5 days we removed almost all instances
of this pattern.  There are still some floating around in the kernel,
but odds are they are broken because NO ONE TESTS UNLOADING MODULES
UNDER STRESS.

Except your crazy customer :)

> And even so, I believe I have clarified as best as possible how a
> kernfs active reference implicitly protects the module when we are
> talking about sysfs files.

I do not see any link anywhere between kernfs and modules, what am I
missing?  Pointers to lines of code would be appreciated.

> > Now trying to tie sysfs files to the modules that own them would be
> > nice, but as we have seen, that way lies way too many kernel changes,
> > right?
> 
> It's not a one-liner fix. Yes.
> 
> > Hm, maybe.  Did we think about this from the kobj_attribute level?  If
> > we use the "wrapper" logic there and the use of the macros we already
> > have for attributes, we might be able to get the module pointer directly
> > "for free".
> >
> > Did we try that?
> 
> That was my hope. I tried that first. Last year in November I determined
> kernfs is kobject stupid. But more importantly *neither* are struct device
> specific, so neither of them have semantics for modules or even devices.

But what about at the kobject level?

I will try to look at that this week, can't promise anything...

greg k-h
Luis Chamberlain July 27, 2021, 8:54 p.m. UTC | #8
On Tue, Jul 27, 2021 at 08:38:50PM +0200, gregkh@linuxfoundation.org wrote:
> On Tue, Jul 27, 2021 at 11:18:03AM -0700, Luis Chamberlain wrote:
> > On Tue, Jul 27, 2021 at 07:46:34PM +0200, gregkh@linuxfoundation.org wrote:
> > > On Tue, Jul 27, 2021 at 10:30:36AM -0700, Luis Chamberlain wrote:
> > > > On Sat, Jul 24, 2021 at 12:15:10PM +0000, David Laight wrote:
> > > > > From: Luis Chamberlain
> > > > > > Sent: 22 July 2021 23:19
> > > > The sysfs store / read file operations are gauranteed to exist using
> > > > kernfs's active reference (see kernfs_active()).
> > > 
> > > But that has nothing to do with module reference counts.  kernfs knows
> > > nothing about modules.
> > 
> > Yes but we are talking about sysfs files which the module creates. So
> > but inference again, an active reference protects a module.
> 
> What active reference? 

kernfs_active()

> > > > In fact, this documentation patch was motivated by my own solution to a
> > > > possible deadlock when sysfs is used. Using the same example above, if
> > > > the same sysfs file uses *any* lock, which is *also* used on the exit
> > > > routine, you can easily trigger a deadlock. This can happen for example
> > > > by the lock being obtained by the removal routine, then the sysfs file
> > > > gets called, waits for the lock to complete, then the module's exit
> > > > routine starts cleaning up and removing sysfs files, but we won't be
> > > > able to remove the sysfs file (due to kernefs active reference) until
> > > > the sysfs file complets, but it cannot complete because the lock is
> > > > already held.
> > > > 
> > > > Yes, this is a generic problem. Yes I have proof [0]. Yes, a generic
> > > > solution has been proposed [1], and because Greg is not convinced and I
> > > > need to move on with life, I am suggesting a temporary driver specific
> > > > solution (to which Greg is still NACK'ing, without even proposing any
> > > > alternatives) [2].
> > > > 
> > > > [0] https://lkml.kernel.org/r/20210703004632.621662-5-mcgrof@kernel.org
> > > > [1] https://lkml.kernel.org/r/20210401235925.GR4332@42.do-not-panic.com 
> > > > [2] https://lkml.kernel.org/r/20210723174919.ka3tzyre432uilf7@garbanzo
> > > 
> > > My problem with your proposed solution is that it is still racy, you can
> > > not increment your own module reference count from 0 -> 1 and expect it
> > > to work properly.  You need external code to do that somewhere.
> > 
> > You are not providing *any* proof for this.
> 
> I did provide proof of that.  Here it is again.

<irrelevant example> 

sysfs files are safe to use try_module_get() because once they are
active a removal of the file cannot happen, and so removal will wait.

> > And even so, I believe I have clarified as best as possible how a
> > kernfs active reference implicitly protects the module when we are
> > talking about sysfs files.
> 
> I do not see any link anywhere between kernfs and modules, what am I
> missing?  Pointers to lines of code would be appreciated.

I provided a selftests with error injections inserted all over
kernfs_fop_write_iter(). Please study that and my error injection
code.

> > > Now trying to tie sysfs files to the modules that own them would be
> > > nice, but as we have seen, that way lies way too many kernel changes,
> > > right?
> > 
> > It's not a one-liner fix. Yes.
> > 
> > > Hm, maybe.  Did we think about this from the kobj_attribute level?  If
> > > we use the "wrapper" logic there and the use of the macros we already
> > > have for attributes, we might be able to get the module pointer directly
> > > "for free".
> > >
> > > Did we try that?
> > 
> > That was my hope. I tried that first. Last year in November I determined
> > kernfs is kobject stupid. But more importantly *neither* are struct device
> > specific, so neither of them have semantics for modules or even devices.
> 
> But what about at the kobject level?

kernfs is kobject stupid.

> I will try to look at that this week, can't promise anything...

  Luis
David Laight July 28, 2021, 8:28 a.m. UTC | #9
...
> sysfs files are safe to use try_module_get() because once they are
> active a removal of the file cannot happen, and so removal will wait.

I doubt it.

If the module_remove() function removes sysfs nodes then (something
like) this has to happen.

1) rmmod (or similar) tries to remove the module.
2) The reference count is zero so the remove is allowed.
3) Something tries to access a sysfs node in the module.
3a) If sysfs knew the nodes were in a module it could use
    try_module_get() to ensure the module wasn't being unloaded.
    Failure would cause the sysfs access to fail.
    But I'm not sure it does, and in any case it doesn't help.
3b) The sysfs thread calls into the module code and waits on a mutex.
3c) The rmmod thread gets around to calling into sysfs to remove the nodes.

At this point we hit the standard 'deregistering a callback' issue.
Exactly the same issue affects removal of per-device sysfs node
from a driver's .remove function.

Typically this is solved by making the deregister routing sleep
until all the callbacks have completed.

So this would require functions like SYSFS_REMOVE_GROUP() and
hwmon_device_unregister() to be allowed to sleep and not be
called with any locks (of any kind) held that the callback
functions acquire.

The module reference count is irrelevant.

	David

    

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Luis Chamberlain July 28, 2021, 1:49 p.m. UTC | #10
On Wed, Jul 28, 2021 at 08:28:11AM +0000, David Laight wrote:
> ...
> > sysfs files are safe to use try_module_get() because once they are
> > active a removal of the file cannot happen, and so removal will wait.
> 
> I doubt it.

But that is what happens.

> If the module_remove() function removes sysfs nodes then (something
> like) this has to happen.
> 
> 1) rmmod (or similar) tries to remove the module.
> 2) The reference count is zero so the remove is allowed.
> 3) Something tries to access a sysfs node in the module.
> 3a) If sysfs knew the nodes were in a module it could use
>     try_module_get() to ensure the module wasn't being unloaded.
>     Failure would cause the sysfs access to fail.
>     But I'm not sure it does,


It does, if a sysfs file had a try_module_get() it would fail as the
module is going.

>     and in any case it doesn't help.

Not clear how from your example.

> 3b) The sysfs thread calls into the module code and waits on a mutex.

If try_module_get() is used on the syfs files, the deadlock is escaped if
used on remove.

> 3c) The rmmod thread gets around to calling into sysfs to remove the nodes.
> 
> At this point we hit the standard 'deregistering a callback' issue.
> Exactly the same issue affects removal of per-device sysfs node
> from a driver's .remove function.
> 
> Typically this is solved by making the deregister routing sleep
> until all the callbacks have completed.
> 
> So this would require functions like sysfs_remove_group() and
> hwmon_device_unregister() to be allowed to sleep

Both can.

Both kernfs_find_and_get_ns() and kernfs_remove_by_name_ns() call
mutex_lock(), they certainly can sleep.

hwmon_device_unregister() calls device_del() which also holds a mutex.

> and not be
> called with any locks (of any kind) held that the callback
> functions acquire.

Not sure why you think this is a requirement.

> The module reference count is irrelevant.

To be clear, there were concerns that there were races here which would
make things murky on sysfs operations and module removal (null
deferences when accessing back the gendisk->private_data) however a
a new selftest driver for sysfs [0], and error injections to allow us to
test and verify all these things I just said are true. If you'd like
to extend the tests to include something you might be concerned about
and want to try, please send me a patch against my tree [1].

[0] https://lkml.kernel.org/r/20210703004632.621662-1-mcgrof@kernel.org
[1] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20210701-sysfs-fix-races-v2

  Luis
diff mbox series

Patch

diff --git a/kernel/module.c b/kernel/module.c
index ed13917ea5f3..0d609647a54d 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -1066,6 +1066,28 @@  void __module_get(struct module *module)
 }
 EXPORT_SYMBOL(__module_get);
 
+/**
+ * try_module_get - yields to module removal and bumps reference count otherwise
+ * @module: the module we should check for
+ *
+ * This can be used to check if userspace has requested to remove a module,
+ * and if so let the caller give up. Otherwise it takes a reference count to
+ * ensure a request from userspace to remove the module cannot happen.
+ *
+ * Care must be taken to ensure the module cannot be removed during
+ * try_module_get(). This can be done by having another entity other than the
+ * module itself increment the module reference count, or through some other
+ * means which gaurantees the module could not be removed during an operation.
+ * An example of this later case is using this call in a sysfs file which the
+ * module created. The sysfs store / read file operation is ensured to exist
+ * and still be present by kernfs's active reference. If a sysfs file operation
+ * is being run, the module which created it must still exist as the module is
+ * in charge of removal of the sysfs file.
+ *
+ * The real value to try_module_get() is the module_is_live() check which
+ * ensures this the caller of try_module_get() can yields to userspace module
+ * removal requests and fail whatever it was about to process.
+ */
 bool try_module_get(struct module *module)
 {
 	bool ret = true;