diff mbox series

mm: remove unintentional voluntary preemption in get_mmap_lock_carefully

Message ID 20230820104303.2083444-1-mjguzik@gmail.com (mailing list archive)
State New
Headers show
Series mm: remove unintentional voluntary preemption in get_mmap_lock_carefully | expand

Commit Message

Mateusz Guzik Aug. 20, 2023, 10:43 a.m. UTC
Should the trylock succeed (and thus blocking was avoided), the routine
wants to ensure blocking was still legal to do. However, the method
used ends up calling __cond_resched injecting a voluntary preemption
point with the freshly acquired lock.

One can hack around it using __might_sleep instead of mere might_sleep,
but since threads keep going off CPU here, I figured it is better to
accomodate it.

Drop the trylock, do the read lock which does the job prior to lock
acquire.

Found by checking off-CPU time during kernel build (like so:
"offcputime-bpfcc -Ku"), sample backtrace:
    finish_task_switch.isra.0
    __schedule
    __cond_resched
    lock_mm_and_find_vma
    do_user_addr_fault
    exc_page_fault
    asm_exc_page_fault
    -                sh (4502)
        10

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 mm/memory.c | 6 ------
 1 file changed, 6 deletions(-)

Comments

Matthew Wilcox Aug. 20, 2023, 11:36 a.m. UTC | #1
On Sun, Aug 20, 2023 at 12:43:03PM +0200, Mateusz Guzik wrote:
> Should the trylock succeed (and thus blocking was avoided), the routine
> wants to ensure blocking was still legal to do. However, the method
> used ends up calling __cond_resched injecting a voluntary preemption
> point with the freshly acquired lock.
> 
> One can hack around it using __might_sleep instead of mere might_sleep,
> but since threads keep going off CPU here, I figured it is better to
> accomodate it.

Except now we search the exception tables every time we call it.
The now-deleted comment (c2508ec5a58d) suggests this is slow:

-       /*
-        * Kernel-mode access to the user address space should only occur
-        * on well-defined single instructions listed in the exception
-        * tables.  But, an erroneous kernel fault occurring outside one of
-        * those areas which also holds mmap_lock might deadlock attempting
-        * to validate the fault against the address space.
-        *
-        * Only do the expensive exception table search when we might be at
-        * risk of a deadlock.  This happens if we
-        * 1. Failed to acquire mmap_lock, and
-        * 2. The access did not originate in userspace.
-        */

Now, this doesn't mean we're doing it on every page fault.  We skip
all of this if we're able to handle the fault under the VMA lock,
so the effect is probably much smaller than it was.  But I'm surprised
not to see you send any data quantifying the effect of this change!
Mateusz Guzik Aug. 20, 2023, 12:41 p.m. UTC | #2
On 8/20/23, Matthew Wilcox <willy@infradead.org> wrote:
> On Sun, Aug 20, 2023 at 12:43:03PM +0200, Mateusz Guzik wrote:
>> Should the trylock succeed (and thus blocking was avoided), the routine
>> wants to ensure blocking was still legal to do. However, the method
>> used ends up calling __cond_resched injecting a voluntary preemption
>> point with the freshly acquired lock.
>>
>> One can hack around it using __might_sleep instead of mere might_sleep,
>> but since threads keep going off CPU here, I figured it is better to
>> accomodate it.
>
> Except now we search the exception tables every time we call it.
> The now-deleted comment (c2508ec5a58d) suggests this is slow:
>

I completely agree it a rather unfortunate side-effect.

> -       /*
> -        * Kernel-mode access to the user address space should only occur
> -        * on well-defined single instructions listed in the exception
> -        * tables.  But, an erroneous kernel fault occurring outside one of
> -        * those areas which also holds mmap_lock might deadlock attempting
> -        * to validate the fault against the address space.
> -        *
> -        * Only do the expensive exception table search when we might be at
> -        * risk of a deadlock.  This happens if we
> -        * 1. Failed to acquire mmap_lock, and
> -        * 2. The access did not originate in userspace.
> -        */
>
> Now, this doesn't mean we're doing it on every page fault.  We skip
> all of this if we're able to handle the fault under the VMA lock,
> so the effect is probably much smaller than it was.  But I'm surprised
> not to see you send any data quantifying the effect of this change!
>

Going off CPU *after* taking the lock sounds like an obviously bad
thing to happen and as such I did not think it warrants any
measurements.

My first patch looked like this:
diff --git a/mm/memory.c b/mm/memory.c
index 1ec1ef3418bf..8662fd69eae8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5259,7 +5259,9 @@ static inline bool
get_mmap_lock_carefully(struct mm_struct *mm, struct pt_regs
 {
        /* Even if this succeeds, make it clear we *might* have slept */
        if (likely(mmap_read_trylock(mm))) {
-               might_sleep();
+#if defined(CONFIG_DEBUG_ATOMIC_SLEEP)
+               __might_sleep(__FILE__, __LINE__);
+#endif
                return true;
        }

This keeps assertions while dodging __cond_resched.

But then I figured someone may complain about scheduling latency which
was not there prior to the patch.

Between the 2 not so great choices I rolled with what I considered the
safer one.

However, now that I said it, I wonder if perhaps the search could be
circumvented on x86-64? The idea would be to check if SMAP got
disabled (and assuming the CPU supports it) -- if so and the faulting
address belongs to userspace, assume it's all good. This is less
precise in that SMAP can be disabled over the intended users access
but would be fine as far as I'm concerned if the search is such a big
deal.
Mateusz Guzik Aug. 20, 2023, 12:46 p.m. UTC | #3
On 8/20/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> On 8/20/23, Matthew Wilcox <willy@infradead.org> wrote:
>> On Sun, Aug 20, 2023 at 12:43:03PM +0200, Mateusz Guzik wrote:
>>> Should the trylock succeed (and thus blocking was avoided), the routine
>>> wants to ensure blocking was still legal to do. However, the method
>>> used ends up calling __cond_resched injecting a voluntary preemption
>>> point with the freshly acquired lock.
>>>
>>> One can hack around it using __might_sleep instead of mere might_sleep,
>>> but since threads keep going off CPU here, I figured it is better to
>>> accomodate it.
>>
>> Except now we search the exception tables every time we call it.
>> The now-deleted comment (c2508ec5a58d) suggests this is slow:
>>
>
> I completely agree it a rather unfortunate side-effect.
>
>> -       /*
>> -        * Kernel-mode access to the user address space should only occur
>> -        * on well-defined single instructions listed in the exception
>> -        * tables.  But, an erroneous kernel fault occurring outside one
>> of
>> -        * those areas which also holds mmap_lock might deadlock
>> attempting
>> -        * to validate the fault against the address space.
>> -        *
>> -        * Only do the expensive exception table search when we might be
>> at
>> -        * risk of a deadlock.  This happens if we
>> -        * 1. Failed to acquire mmap_lock, and
>> -        * 2. The access did not originate in userspace.
>> -        */
>>
>> Now, this doesn't mean we're doing it on every page fault.  We skip
>> all of this if we're able to handle the fault under the VMA lock,
>> so the effect is probably much smaller than it was.  But I'm surprised
>> not to see you send any data quantifying the effect of this change!
>>
>
> Going off CPU *after* taking the lock sounds like an obviously bad
> thing to happen and as such I did not think it warrants any
> measurements.
>
> My first patch looked like this:
> diff --git a/mm/memory.c b/mm/memory.c
> index 1ec1ef3418bf..8662fd69eae8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5259,7 +5259,9 @@ static inline bool
> get_mmap_lock_carefully(struct mm_struct *mm, struct pt_regs
>  {
>         /* Even if this succeeds, make it clear we *might* have slept */
>         if (likely(mmap_read_trylock(mm))) {
> -               might_sleep();
> +#if defined(CONFIG_DEBUG_ATOMIC_SLEEP)
> +               __might_sleep(__FILE__, __LINE__);
> +#endif
>                 return true;
>         }
>
> This keeps assertions while dodging __cond_resched.
>
> But then I figured someone may complain about scheduling latency which
> was not there prior to the patch.
>
> Between the 2 not so great choices I rolled with what I considered the
> safer one.
>
> However, now that I said it, I wonder if perhaps the search could be
> circumvented on x86-64? The idea would be to check if SMAP got
> disabled (and assuming the CPU supports it) -- if so and the faulting
> address belongs to userspace, assume it's all good. This is less
> precise in that SMAP can be disabled over the intended users access
> but would be fine as far as I'm concerned if the search is such a big
> deal.
>

Oof, hit send too fast.

This is less precise in that SMAP can be disabled over A LARGER AREA
THAN the intended users access but would be fine as far as I'm
concerned if the search is such a big.

there :)

Anyhow I don't feel strongly about any of this, I was mostly
interested in what happens with VFS on the off-CPU front and this one
is just a random thing I needed to check.

Now that I elaborated on my $0,03 I'm happy to respin with the
__might_sleep variant. If someone wants a different fix altogether
they are welcome to ignore these patches.

I do claim the current state *is* a problem though -- it can block
down_write for no good reason.
Linus Torvalds Aug. 20, 2023, 12:47 p.m. UTC | #4
On Sun, 20 Aug 2023 at 14:41, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On 8/20/23, Matthew Wilcox <willy@infradead.org> wrote:
> >
> > Except now we search the exception tables every time we call it.
> > The now-deleted comment (c2508ec5a58d) suggests this is slow:

Yeah, that was the intent.

But I agree that we should basically avoid trying to sleep just as we
got the lock.

> My first patch looked like this:

Well, that's disgusting and strange.

> -               might_sleep();
> +#if defined(CONFIG_DEBUG_ATOMIC_SLEEP)
> +               __might_sleep(__FILE__, __LINE__);
> +#endif

Why would you have that strange #ifdef? __might_sleep() just goes away
without that debug option anyway.

But without that odd ifdef, I think it's fine.

            Linus
Linus Torvalds Aug. 20, 2023, 12:59 p.m. UTC | #5
On Sun, 20 Aug 2023 at 14:47, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But without that odd ifdef, I think it's fine.

Another option might be to just move the might_sleep() to the top, and
do it unconditionally. If the trylock fails, the overhead of possibly
doing a cond_resched() is kind of moot.

IOW, the main problem here is not that it causes a scheduling point
(if the kernel isn't preemptable), it seems to be just that we
unnecessarily schedule in a place with the mm lock is held, so it
unnecessarily causes possible lock contention for writers.

With the per-vma locking catching most cases, does any of this even matter?

Mateusz - on that note: I'm wondering what made you see this as a
problem. The case you quote doesn't actually seem to be threaded, so
the vm lock contention shouldn't actually matter there.

Does it schedule away? Sure. But only if "needs_resched" is set, so it
doesn't seem to be a *bad* thing per se.

It might just be that this particular scheduling point ends up being a
common one on that load, and with those kernel config options (ie
PREEMPT_VOLUNTARY)?

              Linus
Mateusz Guzik Aug. 20, 2023, 1 p.m. UTC | #6
On Sun, Aug 20, 2023 at 02:47:41PM +0200, Linus Torvalds wrote:
> On Sun, 20 Aug 2023 at 14:41, Mateusz Guzik <mjguzik@gmail.com> wrote:
> > My first patch looked like this:
> 
> Well, that's disgusting and strange.
> 
> > -               might_sleep();
> > +#if defined(CONFIG_DEBUG_ATOMIC_SLEEP)
> > +               __might_sleep(__FILE__, __LINE__);
> > +#endif
> 
> Why would you have that strange #ifdef? __might_sleep() just goes away
> without that debug option anyway.
> 
> But without that odd ifdef, I think it's fine.
> 

Heh, I wrote the patch last night and I could swear it failed to compile
without the ifdef.

That said I think it looks more than disgusting and I'm happy to confirm
it does build both ways.

That said:

mm: remove unintentional voluntary preemption in get_mmap_lock_carefully

Should the trylock succeed (and thus blocking was avoided), the routine
wants to ensure blocking was still legal to do. However, might_sleep()
ends up calling __cond_resched() injecting a voluntary preemption point
with the freshly acquired lock.

__might_sleep() instead to only get the asserts.

Found while checking off-CPU time during kernel build (like so:
"offcputime-bpfcc -Ku"), sample backtrace:
    finish_task_switch.isra.0
    __schedule
    __cond_resched
    lock_mm_and_find_vma
    do_user_addr_fault
    exc_page_fault
    asm_exc_page_fault
    -                sh (4502)
        10

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 mm/memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 1ec1ef3418bf..d82316a8a48b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5259,7 +5259,7 @@ static inline bool get_mmap_lock_carefully(struct mm_struct *mm, struct pt_regs
 {
 	/* Even if this succeeds, make it clear we *might* have slept */
 	if (likely(mmap_read_trylock(mm))) {
-		might_sleep();
+		__might_sleep(__FILE__, __LINE__);
 		return true;
 	}
Mateusz Guzik Aug. 20, 2023, 1:08 p.m. UTC | #7
On Sun, Aug 20, 2023 at 02:59:07PM +0200, Linus Torvalds wrote:
> On Sun, 20 Aug 2023 at 14:47, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > But without that odd ifdef, I think it's fine.
> 
> Another option might be to just move the might_sleep() to the top, and
> do it unconditionally. If the trylock fails, the overhead of possibly
> doing a cond_resched() is kind of moot.
> 

I wanted to do it, but then I found this comment:

 * For example, if we have a kernel bug that causes a page
 * fault, we don't want to just use mmap_read_lock() to get
 * the mm lock, because that would deadlock if the bug were
 * to happen while we're holding the mm lock for writing.

I figured scheduling away while on the way to OOPS/similar is not the
best thing to happen.

> IOW, the main problem here is not that it causes a scheduling point
> (if the kernel isn't preemptable), it seems to be just that we
> unnecessarily schedule in a place with the mm lock is held, so it
> unnecessarily causes possible lock contention for writers.
> 
> With the per-vma locking catching most cases, does any of this even matter?
> 
> Mateusz - on that note: I'm wondering what made you see this as a
> problem. The case you quote doesn't actually seem to be threaded, so
> the vm lock contention shouldn't actually matter there.
> 
> Does it schedule away? Sure. But only if "needs_resched" is set, so it
> doesn't seem to be a *bad* thing per se.
> 
> It might just be that this particular scheduling point ends up being a
> common one on that load, and with those kernel config options (ie
> PREEMPT_VOLUNTARY)?
> 

I did not cause a slowdown for me and I did not say it did.

I am saying down_read + going off CPU, should it happen in a
multithreaded program, is going to delay any down_write issued by other
threads. And that going off CPU here was clearly not intended.

As I noted in another e-mail this is just a side thing I spotted while
looking at other stuff. I don't find it important enough to discuss it
any further, so as far as I am concerned you are most welcome to take
any of the 2 patches, write your own or or leave the code be.

[I am going to post other stuff later which *I am* going to argue to
push for ;>]
Matthew Wilcox Aug. 20, 2023, 6:12 p.m. UTC | #8
On Sun, Aug 20, 2023 at 12:43:03PM +0200, Mateusz Guzik wrote:
> Found by checking off-CPU time during kernel build (like so:
> "offcputime-bpfcc -Ku"), sample backtrace:
>     finish_task_switch.isra.0
>     __schedule
>     __cond_resched
>     lock_mm_and_find_vma
>     do_user_addr_fault
>     exc_page_fault
>     asm_exc_page_fault
>     -                sh (4502)

Now I'm awake, this backtrace really surprises me.  Do we not check
need_resched on entry?  It seems terribly unlikely that need_resched
gets set between entry and getting to this point, so I guess we must
not.

I suggest the version of the patch which puts might_sleep() before the
mmap_read_trylock() is the right one to apply.  It's basically what
we've done forever, except that now we'll be rescheduling without the
mmap lock held, which just seems like an overall win.
Mateusz Guzik Aug. 21, 2023, 1:13 a.m. UTC | #9
On Sun, Aug 20, 2023 at 07:12:16PM +0100, Matthew Wilcox wrote:
> On Sun, Aug 20, 2023 at 12:43:03PM +0200, Mateusz Guzik wrote:
> > Found by checking off-CPU time during kernel build (like so:
> > "offcputime-bpfcc -Ku"), sample backtrace:
> >     finish_task_switch.isra.0
> >     __schedule
> >     __cond_resched
> >     lock_mm_and_find_vma
> >     do_user_addr_fault
> >     exc_page_fault
> >     asm_exc_page_fault
> >     -                sh (4502)
> 
> Now I'm awake, this backtrace really surprises me.  Do we not check
> need_resched on entry?  It seems terribly unlikely that need_resched
> gets set between entry and getting to this point, so I guess we must
> not.
> 
> I suggest the version of the patch which puts might_sleep() before the
> mmap_read_trylock() is the right one to apply.  It's basically what
> we've done forever, except that now we'll be rescheduling without the
> mmap lock held, which just seems like an overall win.
> 

I can't sleep and your response made me curious, is that really safe
here?

As I wrote in another email, the routine is concerned with a case of the
kernel faulting on something it should not have. For a case like that I
find rescheduling to another thread to be most concerning.

That said I think I found a winner -- add need_resched() prior to
trylock.

This adds less work than you would have added with might_sleep (a func
call), still respects the preemption point, dodges exception table
checks in the common case and does not switch away if the there is
anything fishy going on.

Or just do that might_sleep.

I'm really buggering off the subject now.

====

mm: remove unintentional voluntary preemption in get_mmap_lock_carefully

Should the trylock succeed (and thus blocking was avoided), the routine
wants to ensure blocking was still legal to do. However, might_sleep()
used ends up calling __cond_resched() injecting a voluntary preemption
point with the freshly acquired lock.

__might_sleep() instead with the lock, but check for preemption prior to
taking it.

Found by checking off-CPU time during kernel build (like so:
"offcputime-bpfcc -Ku"), sample backtrace:
    finish_task_switch.isra.0
    __schedule
    __cond_resched
    lock_mm_and_find_vma
    do_user_addr_fault
    exc_page_fault
    asm_exc_page_fault
    -                sh (4502)
        10

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 mm/memory.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 1ec1ef3418bf..6dac9dbb7b59 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5258,8 +5258,8 @@ EXPORT_SYMBOL_GPL(handle_mm_fault);
 static inline bool get_mmap_lock_carefully(struct mm_struct *mm, struct pt_regs *regs)
 {
 	/* Even if this succeeds, make it clear we *might* have slept */
-	if (likely(mmap_read_trylock(mm))) {
-		might_sleep();
+	if (likely(!need_resched() && mmap_read_trylock(mm))) {
+		__might_sleep(__FILE__, __LINE__);
 		return true;
 	}
Matthew Wilcox Aug. 21, 2023, 3:58 a.m. UTC | #10
On Mon, Aug 21, 2023 at 03:13:03AM +0200, Mateusz Guzik wrote:
> On Sun, Aug 20, 2023 at 07:12:16PM +0100, Matthew Wilcox wrote:
> > On Sun, Aug 20, 2023 at 12:43:03PM +0200, Mateusz Guzik wrote:
> > > Found by checking off-CPU time during kernel build (like so:
> > > "offcputime-bpfcc -Ku"), sample backtrace:
> > >     finish_task_switch.isra.0
> > >     __schedule
> > >     __cond_resched
> > >     lock_mm_and_find_vma
> > >     do_user_addr_fault
> > >     exc_page_fault
> > >     asm_exc_page_fault
> > >     -                sh (4502)
> > 
> > Now I'm awake, this backtrace really surprises me.  Do we not check
> > need_resched on entry?  It seems terribly unlikely that need_resched
> > gets set between entry and getting to this point, so I guess we must
> > not.
> > 
> > I suggest the version of the patch which puts might_sleep() before the
> > mmap_read_trylock() is the right one to apply.  It's basically what
> > we've done forever, except that now we'll be rescheduling without the
> > mmap lock held, which just seems like an overall win.
> > 
> 
> I can't sleep and your response made me curious, is that really safe
> here?
> 
> As I wrote in another email, the routine is concerned with a case of the
> kernel faulting on something it should not have. For a case like that I
> find rescheduling to another thread to be most concerning.

Hmm, initially I didn't see it, but you're concerned with something like:

        foo->bar = NULL;
        spin_lock(&foo->lock);
        printk("%d\n", foo->bar->baz);

And yeah, scheduling away in that case would be bad.

> That said I think I found a winner -- add need_resched() prior to
> trylock.
> 
> This adds less work than you would have added with might_sleep (a func
> call), still respects the preemption point, dodges exception table
> checks in the common case and does not switch away if the there is
> anything fishy going on.
> 
> Or just do that might_sleep.

The might_sleep() is clearly safe, but I thought of a different take on
the problem you've found, which is that we used to check need_resched
on _every_ page fault, because we used to take the mmap_lock on every
page fault.  Now we only check it on the minority of page faults which
can't be handled under the VMA lock.  But we can't just slam a
might_resched() into the start of the fault handler, because of the
problem you outlined above.

So how about something like this:

+++ b/arch/x86/mm/fault.c
@@ -1365,6 +1365,7 @@ void do_user_addr_fault(struct pt_regs *regs,
        if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)))
                vma_end_read(vma);

+       might_resched();
        if (!(fault & VM_FAULT_RETRY)) {
                count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
                goto done;

We found a VMA, so we know it isn't a NULL pointer dereference.  And we've
released the VMA lock at this point, so we won't be blocking anything from
making progress.  I'm not thrilled about having to replicate this in each
architecture, but I also don't love putting it in lock_vma_under_rcu()
(since someone might call that who actually can't schedule -- it certainly
wouldn't be obvious from the function name).

Then we can leave the might_sleep() exactly where it is in
get_mmap_lock_carefully(); it's really unlikely to trigger a reschedule.
Linus Torvalds Aug. 21, 2023, 4:55 a.m. UTC | #11
On Mon, 21 Aug 2023 at 05:58, Matthew Wilcox <willy@infradead.org> wrote:
>
> The might_sleep() is clearly safe, but I thought of a different take on
> the problem you've found, which is that we used to check need_resched
> on _every_ page fault, because we used to take the mmap_lock on every
> page fault.  Now we only check it on the minority of page faults which
> can't be handled under the VMA lock.  But we can't just slam a
> might_resched() into the start of the fault handler, because of the
> problem you outlined above.

Bah.

I decided that there is no way the might_sleep() can be the right
thing to do inside get_mmap_lock_carefully(), because the whole point
of that function existing is that we might have a kernel bug causing a
wild pointer access.

And that kernel bug would be about the subsequent oops, not the fact
that we might be sleeping in a bad context.

So I have just removed the existing might_sleep() entirely, because
both the warning it can generate _and_ the voluntary scheduling point
are bad things in that context.

I do think that maybe we should then re-introduce the might_sleep() in
some actually appropriate place in the page fault path, which might be
'handle_mm_fault()'.

But I think that's a separate - if related - issue to the whole "this
was always the wrong point for might_sleep()" issue that Mateusz
noticed.

We are generally much too timid about removing old debug checks that
don't really make sense.

               Linus
Linus Torvalds Aug. 21, 2023, 5:38 a.m. UTC | #12
On Mon, 21 Aug 2023 at 06:55, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I do think that maybe we should then re-introduce the might_sleep() in
> some actually appropriate place in the page fault path, which might be
> 'handle_mm_fault()'.

Just to clarify: if we do it in handle_mm_fault(), we should obviously
do it only for kernel faults - just to avoid the unnecessary
preemption with lock held for the normal user space case.

That said, I don't love the handle_mm_fault() option, I just think it
would be lovely if we had that test in a generic place. Sadly, we
don't seem to have any such thing other than handle_mm_fault().

The reason we shouldn't do this for user space faults are fine is
because not only are they obviously always sleepable, but they also
reschedule on return to user space. So neither the debugging nor the
preemption makes sense there.

That's also why the per-vma locking paths don't need this - we only do
per-vma locking for user space faults.

So it really is only "kernel faults" it makes sense for, and they are
rare enough that I think the preemption point issue might be moot. But
it might still be better to just do this all as a special kernel fault
case in the exception handler - even if it's then
architecture-specific (like it always was before commit c2508ec5a58d).

              Linus
diff mbox series

Patch

diff --git a/mm/memory.c b/mm/memory.c
index 1ec1ef3418bf..f31d5243272b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5257,12 +5257,6 @@  EXPORT_SYMBOL_GPL(handle_mm_fault);
 
 static inline bool get_mmap_lock_carefully(struct mm_struct *mm, struct pt_regs *regs)
 {
-	/* Even if this succeeds, make it clear we *might* have slept */
-	if (likely(mmap_read_trylock(mm))) {
-		might_sleep();
-		return true;
-	}
-
 	if (regs && !user_mode(regs)) {
 		unsigned long ip = instruction_pointer(regs);
 		if (!search_exception_tables(ip))