diff mbox series

[v5,4/8] uprobes: travers uprobe's consumer list locklessly under SRCU protection

Message ID 20240903174603.3554182-5-andrii@kernel.org (mailing list archive)
State New
Headers show
Series uprobes: RCU-protected hot path optimizations | expand

Commit Message

Andrii Nakryiko Sept. 3, 2024, 5:45 p.m. UTC
uprobe->register_rwsem is one of a few big bottlenecks to scalability of
uprobes, so we need to get rid of it to improve uprobe performance and
multi-CPU scalability.

First, we turn uprobe's consumer list to a typical doubly-linked list
and utilize existing RCU-aware helpers for traversing such lists, as
well as adding and removing elements from it.

For entry uprobes we already have SRCU protection active since before
uprobe lookup. For uretprobe we keep refcount, guaranteeing that uprobe
won't go away from under us, but we add SRCU protection around consumer
list traversal.

Lastly, to keep handler_chain()'s UPROBE_HANDLER_REMOVE handling simple,
we remember whether any removal was requested during handler calls, but
then we double-check the decision under a proper register_rwsem using
consumers' filter callbacks. Handler removal is very rare, so this extra
lock won't hurt performance, overall, but we also avoid the need for any
extra protection (e.g., seqcount locks).

Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/linux/uprobes.h |  10 +++-
 kernel/events/uprobes.c | 104 +++++++++++++++++++++++-----------------
 2 files changed, 70 insertions(+), 44 deletions(-)

Comments

Breno Leitao Nov. 6, 2024, 12:03 p.m. UTC | #1
Hello Andrii,

On Tue, Sep 03, 2024 at 10:45:59AM -0700, Andrii Nakryiko wrote:
> uprobe->register_rwsem is one of a few big bottlenecks to scalability of
> uprobes, so we need to get rid of it to improve uprobe performance and
> multi-CPU scalability.
> 
> First, we turn uprobe's consumer list to a typical doubly-linked list
> and utilize existing RCU-aware helpers for traversing such lists, as
> well as adding and removing elements from it.
> 
> For entry uprobes we already have SRCU protection active since before
> uprobe lookup. For uretprobe we keep refcount, guaranteeing that uprobe
> won't go away from under us, but we add SRCU protection around consumer
> list traversal.

I am seeing the following message in a kernel with RCU_PROVE_LOCKING:

	kernel/events/uprobes.c:937 RCU-list traversed without holding the required lock!!

It seems the SRCU is not held, when coming from mmap_region ->
uprobe_mmap. Here is the message I got in my debug kernel. (sorry for
not decoding it, but, the stack trace is clear enough).

         WARNING: suspicious RCU usage
           6.12.0-rc5-kbuilder-01152-gc688a96c432e #26 Tainted: G        W   E    N
           -----------------------------
           kernel/events/uprobes.c:938 RCU-list traversed without holding the required lock!!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
           3 locks held by env/441330:
            #0: ffff00021c1bc508 (&mm->mmap_lock){++++}-{3:3}, at: vm_mmap_pgoff+0x84/0x1d0
            #1: ffff800089f3ab48 (&uprobes_mmap_mutex[i]){+.+.}-{3:3}, at: uprobe_mmap+0x20c/0x548
            #2: ffff0004e564c528 (&uprobe->consumer_rwsem){++++}-{3:3}, at: filter_chain+0x30/0xe8

stack backtrace:
           CPU: 4 UID: 34133 PID: 441330 Comm: env Kdump: loaded Tainted: G        W   E    N 6.12.0-rc5-kbuilder-01152-gc688a96c432e #26
           Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
           Hardware name: Quanta S7GM 20S7GCU0010/S7G MB (CG1), BIOS 3D22 07/03/2024
           Call trace:
            dump_backtrace+0x10c/0x198
            show_stack+0x24/0x38
            __dump_stack+0x28/0x38
            dump_stack_lvl+0x74/0xa8
            dump_stack+0x18/0x28
            lockdep_rcu_suspicious+0x178/0x2c8
            filter_chain+0xdc/0xe8
            uprobe_mmap+0x2e0/0x548
            mmap_region+0x510/0x988
            do_mmap+0x444/0x528
            vm_mmap_pgoff+0xf8/0x1d0
            ksys_mmap_pgoff+0x184/0x2d8


That said, it seems we want to hold the SRCU, before reaching the
filter_chain(). I hacked a bit, and adding the lock in uprobe_mmap()
solves the problem, but, I might be missing something, since I am not familiar
with this code.

How does the following patch look like?

commit 1bd7bcf03031ceca86fdddd8be2e5500497db29f
Author: Breno Leitao <leitao@debian.org>
Date:   Mon Nov 4 06:53:31 2024 -0800

    uprobes: Get SRCU lock before traverseing the list

    list_for_each_entry_srcu() is being called without holding the lock,
    which causes LOCKDEP (when enabled with RCU_PROVING) to complain such
    as:

            kernel/events/uprobes.c:937 RCU-list traversed without holding the required lock!!

    Get the SRCU uprobes_srcu lock before calling filter_chain(), which
    needs to have the SRCU lock hold, since it is going to call
    list_for_each_entry_srcu().

    Signed-off-by: Breno Leitao <leitao@debian.org>
    Fixes: cc01bd044e6a ("uprobes: travers uprobe's consumer list locklessly under SRCU protection")

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 4b52cb2ae6d62..cc9d4ddeea9a6 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1391,6 +1391,7 @@ int uprobe_mmap(struct vm_area_struct *vma)
 	struct list_head tmp_list;
 	struct uprobe *uprobe, *u;
 	struct inode *inode;
+	int srcu_idx;

 	if (no_uprobe_events())
 		return 0;
@@ -1409,6 +1410,7 @@ int uprobe_mmap(struct vm_area_struct *vma)

 	mutex_lock(uprobes_mmap_hash(inode));
 	build_probe_list(inode, vma, vma->vm_start, vma->vm_end, &tmp_list);
+	srcu_idx = srcu_read_lock(&uprobes_srcu);
 	/*
 	 * We can race with uprobe_unregister(), this uprobe can be already
 	 * removed. But in this case filter_chain() must return false, all
@@ -1422,6 +1424,7 @@ int uprobe_mmap(struct vm_area_struct *vma)
 		}
 		put_uprobe(uprobe);
 	}
+	srcu_read_unlock(&uprobes_srcu, srcu_idx);
 	mutex_unlock(uprobes_mmap_hash(inode));

 	return 0;
Andrii Nakryiko Nov. 6, 2024, 4:25 p.m. UTC | #2
On Wed, Nov 6, 2024 at 4:03 AM Breno Leitao <leitao@debian.org> wrote:
>
> Hello Andrii,
>
> On Tue, Sep 03, 2024 at 10:45:59AM -0700, Andrii Nakryiko wrote:
> > uprobe->register_rwsem is one of a few big bottlenecks to scalability of
> > uprobes, so we need to get rid of it to improve uprobe performance and
> > multi-CPU scalability.
> >
> > First, we turn uprobe's consumer list to a typical doubly-linked list
> > and utilize existing RCU-aware helpers for traversing such lists, as
> > well as adding and removing elements from it.
> >
> > For entry uprobes we already have SRCU protection active since before
> > uprobe lookup. For uretprobe we keep refcount, guaranteeing that uprobe
> > won't go away from under us, but we add SRCU protection around consumer
> > list traversal.
>
> I am seeing the following message in a kernel with RCU_PROVE_LOCKING:
>
>         kernel/events/uprobes.c:937 RCU-list traversed without holding the required lock!!
>
> It seems the SRCU is not held, when coming from mmap_region ->
> uprobe_mmap. Here is the message I got in my debug kernel. (sorry for
> not decoding it, but, the stack trace is clear enough).
>
>          WARNING: suspicious RCU usage
>            6.12.0-rc5-kbuilder-01152-gc688a96c432e #26 Tainted: G        W   E    N
>            -----------------------------
>            kernel/events/uprobes.c:938 RCU-list traversed without holding the required lock!!
>
> other info that might help us debug this:
>
> rcu_scheduler_active = 2, debug_locks = 1
>            3 locks held by env/441330:
>             #0: ffff00021c1bc508 (&mm->mmap_lock){++++}-{3:3}, at: vm_mmap_pgoff+0x84/0x1d0
>             #1: ffff800089f3ab48 (&uprobes_mmap_mutex[i]){+.+.}-{3:3}, at: uprobe_mmap+0x20c/0x548
>             #2: ffff0004e564c528 (&uprobe->consumer_rwsem){++++}-{3:3}, at: filter_chain+0x30/0xe8
>
> stack backtrace:
>            CPU: 4 UID: 34133 PID: 441330 Comm: env Kdump: loaded Tainted: G        W   E    N 6.12.0-rc5-kbuilder-01152-gc688a96c432e #26
>            Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
>            Hardware name: Quanta S7GM 20S7GCU0010/S7G MB (CG1), BIOS 3D22 07/03/2024
>            Call trace:
>             dump_backtrace+0x10c/0x198
>             show_stack+0x24/0x38
>             __dump_stack+0x28/0x38
>             dump_stack_lvl+0x74/0xa8
>             dump_stack+0x18/0x28
>             lockdep_rcu_suspicious+0x178/0x2c8
>             filter_chain+0xdc/0xe8
>             uprobe_mmap+0x2e0/0x548
>             mmap_region+0x510/0x988
>             do_mmap+0x444/0x528
>             vm_mmap_pgoff+0xf8/0x1d0
>             ksys_mmap_pgoff+0x184/0x2d8
>
>
> That said, it seems we want to hold the SRCU, before reaching the
> filter_chain(). I hacked a bit, and adding the lock in uprobe_mmap()
> solves the problem, but, I might be missing something, since I am not familiar
> with this code.
>
> How does the following patch look like?
>
> commit 1bd7bcf03031ceca86fdddd8be2e5500497db29f
> Author: Breno Leitao <leitao@debian.org>
> Date:   Mon Nov 4 06:53:31 2024 -0800
>
>     uprobes: Get SRCU lock before traverseing the list
>
>     list_for_each_entry_srcu() is being called without holding the lock,
>     which causes LOCKDEP (when enabled with RCU_PROVING) to complain such
>     as:
>
>             kernel/events/uprobes.c:937 RCU-list traversed without holding the required lock!!
>
>     Get the SRCU uprobes_srcu lock before calling filter_chain(), which
>     needs to have the SRCU lock hold, since it is going to call
>     list_for_each_entry_srcu().
>
>     Signed-off-by: Breno Leitao <leitao@debian.org>
>     Fixes: cc01bd044e6a ("uprobes: travers uprobe's consumer list locklessly under SRCU protection")
>
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 4b52cb2ae6d62..cc9d4ddeea9a6 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -1391,6 +1391,7 @@ int uprobe_mmap(struct vm_area_struct *vma)
>         struct list_head tmp_list;
>         struct uprobe *uprobe, *u;
>         struct inode *inode;
> +       int srcu_idx;
>
>         if (no_uprobe_events())
>                 return 0;
> @@ -1409,6 +1410,7 @@ int uprobe_mmap(struct vm_area_struct *vma)
>
>         mutex_lock(uprobes_mmap_hash(inode));
>         build_probe_list(inode, vma, vma->vm_start, vma->vm_end, &tmp_list);
> +       srcu_idx = srcu_read_lock(&uprobes_srcu);

Hey Breno,

Thanks for catching that (production testing FTW, right?!).

But I think you a) adding wrong RCU protection flavor (it has to be
rcu_read_lock_trace()/rcu_read_unlock_trace(), see uprobe_apply() for
an example) and b) I think this is the wrong place to add it. We
should add it inside filter_chain(). filter_chain() is called from
three places, only one of which is already RCU protected (that's the
handler_chain() case). But there is also register_for_each_vma(),
which needs RCU protection as well.

So can you resend the patch as a stand-alone patch, switch to RCU
Tasks Trace flavor, and add the protection inside filter_chain()?
Thank you!

P.S. pending_list traversal that you (accidentally) protect as well in
your patch doesn't need RCU protection, so there is no problem with
moving into filter_chain() for RCU stuff.

>         /*
>          * We can race with uprobe_unregister(), this uprobe can be already
>          * removed. But in this case filter_chain() must return false, all
> @@ -1422,6 +1424,7 @@ int uprobe_mmap(struct vm_area_struct *vma)
>                 }
>                 put_uprobe(uprobe);
>         }
> +       srcu_read_unlock(&uprobes_srcu, srcu_idx);
>         mutex_unlock(uprobes_mmap_hash(inode));
>
>         return 0;
>
Breno Leitao Nov. 7, 2024, 11:35 a.m. UTC | #3
Hello Andrii,

On Wed, Nov 06, 2024 at 08:25:25AM -0800, Andrii Nakryiko wrote:
> On Wed, Nov 6, 2024 at 4:03 AM Breno Leitao <leitao@debian.org> wrote:
> > On Tue, Sep 03, 2024 at 10:45:59AM -0700, Andrii Nakryiko wrote:
> > > uprobe->register_rwsem is one of a few big bottlenecks to scalability of
> > > uprobes, so we need to get rid of it to improve uprobe performance and
> > > multi-CPU scalability.
> > >
> > > First, we turn uprobe's consumer list to a typical doubly-linked list
> > > and utilize existing RCU-aware helpers for traversing such lists, as
> > > well as adding and removing elements from it.
> > >
> > > For entry uprobes we already have SRCU protection active since before
> > > uprobe lookup. For uretprobe we keep refcount, guaranteeing that uprobe
> > > won't go away from under us, but we add SRCU protection around consumer
> > > list traversal.
> >
> > I am seeing the following message in a kernel with RCU_PROVE_LOCKING:
> >
> >         kernel/events/uprobes.c:937 RCU-list traversed without holding the required lock!!
> >
> > It seems the SRCU is not held, when coming from mmap_region ->
> > uprobe_mmap. Here is the message I got in my debug kernel. (sorry for
> > not decoding it, but, the stack trace is clear enough).
> >
> >          WARNING: suspicious RCU usage
> >            6.12.0-rc5-kbuilder-01152-gc688a96c432e #26 Tainted: G        W   E    N
> >            -----------------------------
> >            kernel/events/uprobes.c:938 RCU-list traversed without holding the required lock!!
> >
> > other info that might help us debug this:
> >
> > rcu_scheduler_active = 2, debug_locks = 1
> >            3 locks held by env/441330:
> >             #0: ffff00021c1bc508 (&mm->mmap_lock){++++}-{3:3}, at: vm_mmap_pgoff+0x84/0x1d0
> >             #1: ffff800089f3ab48 (&uprobes_mmap_mutex[i]){+.+.}-{3:3}, at: uprobe_mmap+0x20c/0x548
> >             #2: ffff0004e564c528 (&uprobe->consumer_rwsem){++++}-{3:3}, at: filter_chain+0x30/0xe8
> >
> > stack backtrace:
> >            CPU: 4 UID: 34133 PID: 441330 Comm: env Kdump: loaded Tainted: G        W   E    N 6.12.0-rc5-kbuilder-01152-gc688a96c432e #26
> >            Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
> >            Hardware name: Quanta S7GM 20S7GCU0010/S7G MB (CG1), BIOS 3D22 07/03/2024
> >            Call trace:
> >             dump_backtrace+0x10c/0x198
> >             show_stack+0x24/0x38
> >             __dump_stack+0x28/0x38
> >             dump_stack_lvl+0x74/0xa8
> >             dump_stack+0x18/0x28
> >             lockdep_rcu_suspicious+0x178/0x2c8
> >             filter_chain+0xdc/0xe8
> >             uprobe_mmap+0x2e0/0x548
> >             mmap_region+0x510/0x988
> >             do_mmap+0x444/0x528
> >             vm_mmap_pgoff+0xf8/0x1d0
> >             ksys_mmap_pgoff+0x184/0x2d8
> >
> >
> > That said, it seems we want to hold the SRCU, before reaching the
> > filter_chain(). I hacked a bit, and adding the lock in uprobe_mmap()
> > solves the problem, but, I might be missing something, since I am not familiar
> > with this code.
> >
> > How does the following patch look like?
> >
> > commit 1bd7bcf03031ceca86fdddd8be2e5500497db29f
> > Author: Breno Leitao <leitao@debian.org>
> > Date:   Mon Nov 4 06:53:31 2024 -0800
> >
> >     uprobes: Get SRCU lock before traverseing the list
> >
> >     list_for_each_entry_srcu() is being called without holding the lock,
> >     which causes LOCKDEP (when enabled with RCU_PROVING) to complain such
> >     as:
> >
> >             kernel/events/uprobes.c:937 RCU-list traversed without holding the required lock!!
> >
> >     Get the SRCU uprobes_srcu lock before calling filter_chain(), which
> >     needs to have the SRCU lock hold, since it is going to call
> >     list_for_each_entry_srcu().
> >
> >     Signed-off-by: Breno Leitao <leitao@debian.org>
> >     Fixes: cc01bd044e6a ("uprobes: travers uprobe's consumer list locklessly under SRCU protection")
> >
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 4b52cb2ae6d62..cc9d4ddeea9a6 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -1391,6 +1391,7 @@ int uprobe_mmap(struct vm_area_struct *vma)
> >         struct list_head tmp_list;
> >         struct uprobe *uprobe, *u;
> >         struct inode *inode;
> > +       int srcu_idx;
> >
> >         if (no_uprobe_events())
> >                 return 0;
> > @@ -1409,6 +1410,7 @@ int uprobe_mmap(struct vm_area_struct *vma)
> >
> >         mutex_lock(uprobes_mmap_hash(inode));
> >         build_probe_list(inode, vma, vma->vm_start, vma->vm_end, &tmp_list);
> > +       srcu_idx = srcu_read_lock(&uprobes_srcu);
> 
> Thanks for catching that (production testing FTW, right?!).

Correct. I am running some hosts with RCU_PROVING and I am finding some
cases where RCU protected areas are touched without holding the RCU read
lock.

> But I think you a) adding wrong RCU protection flavor (it has to be
> rcu_read_lock_trace()/rcu_read_unlock_trace(), see uprobe_apply() for
> an example) and b) I think this is the wrong place to add it. We
> should add it inside filter_chain(). filter_chain() is called from
> three places, only one of which is already RCU protected (that's the
> handler_chain() case). But there is also register_for_each_vma(),
> which needs RCU protection as well.

Thanks for the guidance!

My initial plan was to protect filter_chain(), but, handler_chain()
already has the lock. Is it OK to get into a critical section in a
nested form?

The code will be something like:

handle_swbp() {
	rcu_read_lock_trace();
	handler_chain() {
		filter_chain() {
			rcu_read_lock_trace();
			list_for_each_entry_rcu()
			rcu_read_lock_trace();
		}
	}
	rcu_read_lock_trace();
}

Is this nested locking fine?

Thanks
--breno
Andrii Nakryiko Nov. 7, 2024, 4:01 p.m. UTC | #4
On Thu, Nov 7, 2024 at 3:35 AM Breno Leitao <leitao@debian.org> wrote:
>
> Hello Andrii,
>
> On Wed, Nov 06, 2024 at 08:25:25AM -0800, Andrii Nakryiko wrote:
> > On Wed, Nov 6, 2024 at 4:03 AM Breno Leitao <leitao@debian.org> wrote:
> > > On Tue, Sep 03, 2024 at 10:45:59AM -0700, Andrii Nakryiko wrote:
> > > > uprobe->register_rwsem is one of a few big bottlenecks to scalability of
> > > > uprobes, so we need to get rid of it to improve uprobe performance and
> > > > multi-CPU scalability.
> > > >
> > > > First, we turn uprobe's consumer list to a typical doubly-linked list
> > > > and utilize existing RCU-aware helpers for traversing such lists, as
> > > > well as adding and removing elements from it.
> > > >
> > > > For entry uprobes we already have SRCU protection active since before
> > > > uprobe lookup. For uretprobe we keep refcount, guaranteeing that uprobe
> > > > won't go away from under us, but we add SRCU protection around consumer
> > > > list traversal.
> > >
> > > I am seeing the following message in a kernel with RCU_PROVE_LOCKING:
> > >
> > >         kernel/events/uprobes.c:937 RCU-list traversed without holding the required lock!!
> > >
> > > It seems the SRCU is not held, when coming from mmap_region ->
> > > uprobe_mmap. Here is the message I got in my debug kernel. (sorry for
> > > not decoding it, but, the stack trace is clear enough).
> > >
> > >          WARNING: suspicious RCU usage
> > >            6.12.0-rc5-kbuilder-01152-gc688a96c432e #26 Tainted: G        W   E    N
> > >            -----------------------------
> > >            kernel/events/uprobes.c:938 RCU-list traversed without holding the required lock!!
> > >
> > > other info that might help us debug this:
> > >
> > > rcu_scheduler_active = 2, debug_locks = 1
> > >            3 locks held by env/441330:
> > >             #0: ffff00021c1bc508 (&mm->mmap_lock){++++}-{3:3}, at: vm_mmap_pgoff+0x84/0x1d0
> > >             #1: ffff800089f3ab48 (&uprobes_mmap_mutex[i]){+.+.}-{3:3}, at: uprobe_mmap+0x20c/0x548
> > >             #2: ffff0004e564c528 (&uprobe->consumer_rwsem){++++}-{3:3}, at: filter_chain+0x30/0xe8
> > >
> > > stack backtrace:
> > >            CPU: 4 UID: 34133 PID: 441330 Comm: env Kdump: loaded Tainted: G        W   E    N 6.12.0-rc5-kbuilder-01152-gc688a96c432e #26
> > >            Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
> > >            Hardware name: Quanta S7GM 20S7GCU0010/S7G MB (CG1), BIOS 3D22 07/03/2024
> > >            Call trace:
> > >             dump_backtrace+0x10c/0x198
> > >             show_stack+0x24/0x38
> > >             __dump_stack+0x28/0x38
> > >             dump_stack_lvl+0x74/0xa8
> > >             dump_stack+0x18/0x28
> > >             lockdep_rcu_suspicious+0x178/0x2c8
> > >             filter_chain+0xdc/0xe8
> > >             uprobe_mmap+0x2e0/0x548
> > >             mmap_region+0x510/0x988
> > >             do_mmap+0x444/0x528
> > >             vm_mmap_pgoff+0xf8/0x1d0
> > >             ksys_mmap_pgoff+0x184/0x2d8
> > >
> > >
> > > That said, it seems we want to hold the SRCU, before reaching the
> > > filter_chain(). I hacked a bit, and adding the lock in uprobe_mmap()
> > > solves the problem, but, I might be missing something, since I am not familiar
> > > with this code.
> > >
> > > How does the following patch look like?
> > >
> > > commit 1bd7bcf03031ceca86fdddd8be2e5500497db29f
> > > Author: Breno Leitao <leitao@debian.org>
> > > Date:   Mon Nov 4 06:53:31 2024 -0800
> > >
> > >     uprobes: Get SRCU lock before traverseing the list
> > >
> > >     list_for_each_entry_srcu() is being called without holding the lock,
> > >     which causes LOCKDEP (when enabled with RCU_PROVING) to complain such
> > >     as:
> > >
> > >             kernel/events/uprobes.c:937 RCU-list traversed without holding the required lock!!
> > >
> > >     Get the SRCU uprobes_srcu lock before calling filter_chain(), which
> > >     needs to have the SRCU lock hold, since it is going to call
> > >     list_for_each_entry_srcu().
> > >
> > >     Signed-off-by: Breno Leitao <leitao@debian.org>
> > >     Fixes: cc01bd044e6a ("uprobes: travers uprobe's consumer list locklessly under SRCU protection")
> > >
> > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > index 4b52cb2ae6d62..cc9d4ddeea9a6 100644
> > > --- a/kernel/events/uprobes.c
> > > +++ b/kernel/events/uprobes.c
> > > @@ -1391,6 +1391,7 @@ int uprobe_mmap(struct vm_area_struct *vma)
> > >         struct list_head tmp_list;
> > >         struct uprobe *uprobe, *u;
> > >         struct inode *inode;
> > > +       int srcu_idx;
> > >
> > >         if (no_uprobe_events())
> > >                 return 0;
> > > @@ -1409,6 +1410,7 @@ int uprobe_mmap(struct vm_area_struct *vma)
> > >
> > >         mutex_lock(uprobes_mmap_hash(inode));
> > >         build_probe_list(inode, vma, vma->vm_start, vma->vm_end, &tmp_list);
> > > +       srcu_idx = srcu_read_lock(&uprobes_srcu);
> >
> > Thanks for catching that (production testing FTW, right?!).
>
> Correct. I am running some hosts with RCU_PROVING and I am finding some
> cases where RCU protected areas are touched without holding the RCU read
> lock.
>
> > But I think you a) adding wrong RCU protection flavor (it has to be
> > rcu_read_lock_trace()/rcu_read_unlock_trace(), see uprobe_apply() for
> > an example) and b) I think this is the wrong place to add it. We
> > should add it inside filter_chain(). filter_chain() is called from
> > three places, only one of which is already RCU protected (that's the
> > handler_chain() case). But there is also register_for_each_vma(),
> > which needs RCU protection as well.
>
> Thanks for the guidance!
>
> My initial plan was to protect filter_chain(), but, handler_chain()
> already has the lock. Is it OK to get into a critical section in a
> nested form?
>
> The code will be something like:
>
> handle_swbp() {
>         rcu_read_lock_trace();
>         handler_chain() {
>                 filter_chain() {
>                         rcu_read_lock_trace();
>                         list_for_each_entry_rcu()
>                         rcu_read_lock_trace();
>                 }
>         }
>         rcu_read_lock_trace();
> }
>
> Is this nested locking fine?
>

Yes, it's totally fine to nest RCU lock regions.

> Thanks
> --breno
Paul E. McKenney Nov. 7, 2024, 4:13 p.m. UTC | #5
oN tHU, nov 07, 2024 at 08:01:05AM -0800, Andrii Nakryiko wrote:
> On Thu, Nov 7, 2024 at 3:35 AM Breno Leitao <leitao@debian.org> wrote:
> >
> > Hello Andrii,
> >
> > On Wed, Nov 06, 2024 at 08:25:25AM -0800, Andrii Nakryiko wrote:
> > > On Wed, Nov 6, 2024 at 4:03 AM Breno Leitao <leitao@debian.org> wrote:
> > > > On Tue, Sep 03, 2024 at 10:45:59AM -0700, Andrii Nakryiko wrote:
> > > > > uprobe->register_rwsem is one of a few big bottlenecks to scalability of
> > > > > uprobes, so we need to get rid of it to improve uprobe performance and
> > > > > multi-CPU scalability.
> > > > >
> > > > > First, we turn uprobe's consumer list to a typical doubly-linked list
> > > > > and utilize existing RCU-aware helpers for traversing such lists, as
> > > > > well as adding and removing elements from it.
> > > > >
> > > > > For entry uprobes we already have SRCU protection active since before
> > > > > uprobe lookup. For uretprobe we keep refcount, guaranteeing that uprobe
> > > > > won't go away from under us, but we add SRCU protection around consumer
> > > > > list traversal.
> > > >
> > > > I am seeing the following message in a kernel with RCU_PROVE_LOCKING:
> > > >
> > > >         kernel/events/uprobes.c:937 RCU-list traversed without holding the required lock!!
> > > >
> > > > It seems the SRCU is not held, when coming from mmap_region ->
> > > > uprobe_mmap. Here is the message I got in my debug kernel. (sorry for
> > > > not decoding it, but, the stack trace is clear enough).
> > > >
> > > >          WARNING: suspicious RCU usage
> > > >            6.12.0-rc5-kbuilder-01152-gc688a96c432e #26 Tainted: G        W   E    N
> > > >            -----------------------------
> > > >            kernel/events/uprobes.c:938 RCU-list traversed without holding the required lock!!
> > > >
> > > > other info that might help us debug this:
> > > >
> > > > rcu_scheduler_active = 2, debug_locks = 1
> > > >            3 locks held by env/441330:
> > > >             #0: ffff00021c1bc508 (&mm->mmap_lock){++++}-{3:3}, at: vm_mmap_pgoff+0x84/0x1d0
> > > >             #1: ffff800089f3ab48 (&uprobes_mmap_mutex[i]){+.+.}-{3:3}, at: uprobe_mmap+0x20c/0x548
> > > >             #2: ffff0004e564c528 (&uprobe->consumer_rwsem){++++}-{3:3}, at: filter_chain+0x30/0xe8
> > > >
> > > > stack backtrace:
> > > >            CPU: 4 UID: 34133 PID: 441330 Comm: env Kdump: loaded Tainted: G        W   E    N 6.12.0-rc5-kbuilder-01152-gc688a96c432e #26
> > > >            Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
> > > >            Hardware name: Quanta S7GM 20S7GCU0010/S7G MB (CG1), BIOS 3D22 07/03/2024
> > > >            Call trace:
> > > >             dump_backtrace+0x10c/0x198
> > > >             show_stack+0x24/0x38
> > > >             __dump_stack+0x28/0x38
> > > >             dump_stack_lvl+0x74/0xa8
> > > >             dump_stack+0x18/0x28
> > > >             lockdep_rcu_suspicious+0x178/0x2c8
> > > >             filter_chain+0xdc/0xe8
> > > >             uprobe_mmap+0x2e0/0x548
> > > >             mmap_region+0x510/0x988
> > > >             do_mmap+0x444/0x528
> > > >             vm_mmap_pgoff+0xf8/0x1d0
> > > >             ksys_mmap_pgoff+0x184/0x2d8
> > > >
> > > >
> > > > That said, it seems we want to hold the SRCU, before reaching the
> > > > filter_chain(). I hacked a bit, and adding the lock in uprobe_mmap()
> > > > solves the problem, but, I might be missing something, since I am not familiar
> > > > with this code.
> > > >
> > > > How does the following patch look like?
> > > >
> > > > commit 1bd7bcf03031ceca86fdddd8be2e5500497db29f
> > > > Author: Breno Leitao <leitao@debian.org>
> > > > Date:   Mon Nov 4 06:53:31 2024 -0800
> > > >
> > > >     uprobes: Get SRCU lock before traverseing the list
> > > >
> > > >     list_for_each_entry_srcu() is being called without holding the lock,
> > > >     which causes LOCKDEP (when enabled with RCU_PROVING) to complain such
> > > >     as:
> > > >
> > > >             kernel/events/uprobes.c:937 RCU-list traversed without holding the required lock!!
> > > >
> > > >     Get the SRCU uprobes_srcu lock before calling filter_chain(), which
> > > >     needs to have the SRCU lock hold, since it is going to call
> > > >     list_for_each_entry_srcu().
> > > >
> > > >     Signed-off-by: Breno Leitao <leitao@debian.org>
> > > >     Fixes: cc01bd044e6a ("uprobes: travers uprobe's consumer list locklessly under SRCU protection")
> > > >
> > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > > index 4b52cb2ae6d62..cc9d4ddeea9a6 100644
> > > > --- a/kernel/events/uprobes.c
> > > > +++ b/kernel/events/uprobes.c
> > > > @@ -1391,6 +1391,7 @@ int uprobe_mmap(struct vm_area_struct *vma)
> > > >         struct list_head tmp_list;
> > > >         struct uprobe *uprobe, *u;
> > > >         struct inode *inode;
> > > > +       int srcu_idx;
> > > >
> > > >         if (no_uprobe_events())
> > > >                 return 0;
> > > > @@ -1409,6 +1410,7 @@ int uprobe_mmap(struct vm_area_struct *vma)
> > > >
> > > >         mutex_lock(uprobes_mmap_hash(inode));
> > > >         build_probe_list(inode, vma, vma->vm_start, vma->vm_end, &tmp_list);
> > > > +       srcu_idx = srcu_read_lock(&uprobes_srcu);
> > >
> > > Thanks for catching that (production testing FTW, right?!).
> >
> > Correct. I am running some hosts with RCU_PROVING and I am finding some
> > cases where RCU protected areas are touched without holding the RCU read
> > lock.
> >
> > > But I think you a) adding wrong RCU protection flavor (it has to be
> > > rcu_read_lock_trace()/rcu_read_unlock_trace(), see uprobe_apply() for
> > > an example) and b) I think this is the wrong place to add it. We
> > > should add it inside filter_chain(). filter_chain() is called from
> > > three places, only one of which is already RCU protected (that's the
> > > handler_chain() case). But there is also register_for_each_vma(),
> > > which needs RCU protection as well.
> >
> > Thanks for the guidance!
> >
> > My initial plan was to protect filter_chain(), but, handler_chain()
> > already has the lock. Is it OK to get into a critical section in a
> > nested form?
> >
> > The code will be something like:
> >
> > handle_swbp() {
> >         rcu_read_lock_trace();
> >         handler_chain() {
> >                 filter_chain() {
> >                         rcu_read_lock_trace();
> >                         list_for_each_entry_rcu()
> >                         rcu_read_lock_trace();
> >                 }
> >         }
> >         rcu_read_lock_trace();
> > }
> >
> > Is this nested locking fine?
> 
> Yes, it's totally fine to nest RCU lock regions.

As long as you don't nest them more than 255 deep in CONFIG_PREEMPT=n
kernels that also have CONFIG_PREEMPT_COUNT=y, or more than 2G deep in
CONFIG_PREEMPT=y kernels.  For a limited time only, in CONFIG_PREEMPT=n
kernels that also have CONFIG_PREEMPT_COUNT=n, you can nest as deeply
as you want.  ;-)

Sorry, couldn't resist...

							Thanx, Paul
diff mbox series

Patch

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 9cf0dce62e4c..2785d1bedb74 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -29,13 +29,21 @@  struct page;
 #define MAX_URETPROBE_DEPTH		64
 
 struct uprobe_consumer {
+	/*
+	 * handler() can return UPROBE_HANDLER_REMOVE to signal the need to
+	 * unregister uprobe for current process. If UPROBE_HANDLER_REMOVE is
+	 * returned, filter() callback has to be implemented as well and it
+	 * should return false to "confirm" the decision to uninstall uprobe
+	 * for the current process. If filter() is omitted or returns true,
+	 * UPROBE_HANDLER_REMOVE is effectively ignored.
+	 */
 	int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
 	int (*ret_handler)(struct uprobe_consumer *self,
 				unsigned long func,
 				struct pt_regs *regs);
 	bool (*filter)(struct uprobe_consumer *self, struct mm_struct *mm);
 
-	struct uprobe_consumer *next;
+	struct list_head cons_node;
 };
 
 #ifdef CONFIG_UPROBES
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 8bdcdc6901b2..97e58d160647 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -59,7 +59,7 @@  struct uprobe {
 	struct rw_semaphore	register_rwsem;
 	struct rw_semaphore	consumer_rwsem;
 	struct list_head	pending_list;
-	struct uprobe_consumer	*consumers;
+	struct list_head	consumers;
 	struct inode		*inode;		/* Also hold a ref to inode */
 	struct rcu_head		rcu;
 	loff_t			offset;
@@ -783,6 +783,7 @@  static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
 	uprobe->inode = inode;
 	uprobe->offset = offset;
 	uprobe->ref_ctr_offset = ref_ctr_offset;
+	INIT_LIST_HEAD(&uprobe->consumers);
 	init_rwsem(&uprobe->register_rwsem);
 	init_rwsem(&uprobe->consumer_rwsem);
 	RB_CLEAR_NODE(&uprobe->rb_node);
@@ -808,32 +809,19 @@  static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
 static void consumer_add(struct uprobe *uprobe, struct uprobe_consumer *uc)
 {
 	down_write(&uprobe->consumer_rwsem);
-	uc->next = uprobe->consumers;
-	uprobe->consumers = uc;
+	list_add_rcu(&uc->cons_node, &uprobe->consumers);
 	up_write(&uprobe->consumer_rwsem);
 }
 
 /*
  * For uprobe @uprobe, delete the consumer @uc.
- * Return true if the @uc is deleted successfully
- * or return false.
+ * Should never be called with consumer that's not part of @uprobe->consumers.
  */
-static bool consumer_del(struct uprobe *uprobe, struct uprobe_consumer *uc)
+static void consumer_del(struct uprobe *uprobe, struct uprobe_consumer *uc)
 {
-	struct uprobe_consumer **con;
-	bool ret = false;
-
 	down_write(&uprobe->consumer_rwsem);
-	for (con = &uprobe->consumers; *con; con = &(*con)->next) {
-		if (*con == uc) {
-			*con = uc->next;
-			ret = true;
-			break;
-		}
-	}
+	list_del_rcu(&uc->cons_node);
 	up_write(&uprobe->consumer_rwsem);
-
-	return ret;
 }
 
 static int __copy_insn(struct address_space *mapping, struct file *filp,
@@ -929,7 +917,8 @@  static bool filter_chain(struct uprobe *uprobe, struct mm_struct *mm)
 	bool ret = false;
 
 	down_read(&uprobe->consumer_rwsem);
-	for (uc = uprobe->consumers; uc; uc = uc->next) {
+	list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
+				 srcu_read_lock_held(&uprobes_srcu)) {
 		ret = consumer_filter(uc, mm);
 		if (ret)
 			break;
@@ -1125,18 +1114,29 @@  void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
 	int err;
 
 	down_write(&uprobe->register_rwsem);
-	if (WARN_ON(!consumer_del(uprobe, uc))) {
-		err = -ENOENT;
-	} else {
-		err = register_for_each_vma(uprobe, NULL);
-		/* TODO : cant unregister? schedule a worker thread */
-		if (unlikely(err))
-			uprobe_warn(current, "unregister, leaking uprobe");
-	}
+	consumer_del(uprobe, uc);
+	err = register_for_each_vma(uprobe, NULL);
 	up_write(&uprobe->register_rwsem);
 
-	if (!err)
-		put_uprobe(uprobe);
+	/* TODO : cant unregister? schedule a worker thread */
+	if (unlikely(err)) {
+		uprobe_warn(current, "unregister, leaking uprobe");
+		goto out_sync;
+	}
+
+	put_uprobe(uprobe);
+
+out_sync:
+	/*
+	 * Now that handler_chain() and handle_uretprobe_chain() iterate over
+	 * uprobe->consumers list under RCU protection without holding
+	 * uprobe->register_rwsem, we need to wait for RCU grace period to
+	 * make sure that we can't call into just unregistered
+	 * uprobe_consumer's callbacks anymore. If we don't do that, fast and
+	 * unlucky enough caller can free consumer's memory and cause
+	 * handler_chain() or handle_uretprobe_chain() to do an use-after-free.
+	 */
+	synchronize_srcu(&uprobes_srcu);
 }
 EXPORT_SYMBOL_GPL(uprobe_unregister);
 
@@ -1214,13 +1214,20 @@  EXPORT_SYMBOL_GPL(uprobe_register);
 int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool add)
 {
 	struct uprobe_consumer *con;
-	int ret = -ENOENT;
+	int ret = -ENOENT, srcu_idx;
 
 	down_write(&uprobe->register_rwsem);
-	for (con = uprobe->consumers; con && con != uc ; con = con->next)
-		;
-	if (con)
-		ret = register_for_each_vma(uprobe, add ? uc : NULL);
+
+	srcu_idx = srcu_read_lock(&uprobes_srcu);
+	list_for_each_entry_srcu(con, &uprobe->consumers, cons_node,
+				 srcu_read_lock_held(&uprobes_srcu)) {
+		if (con == uc) {
+			ret = register_for_each_vma(uprobe, add ? uc : NULL);
+			break;
+		}
+	}
+	srcu_read_unlock(&uprobes_srcu, srcu_idx);
+
 	up_write(&uprobe->register_rwsem);
 
 	return ret;
@@ -2085,10 +2092,12 @@  static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
 	struct uprobe_consumer *uc;
 	int remove = UPROBE_HANDLER_REMOVE;
 	bool need_prep = false; /* prepare return uprobe, when needed */
+	bool has_consumers = false;
 
-	down_read(&uprobe->register_rwsem);
 	current->utask->auprobe = &uprobe->arch;
-	for (uc = uprobe->consumers; uc; uc = uc->next) {
+
+	list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
+				 srcu_read_lock_held(&uprobes_srcu)) {
 		int rc = 0;
 
 		if (uc->handler) {
@@ -2101,17 +2110,24 @@  static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
 			need_prep = true;
 
 		remove &= rc;
+		has_consumers = true;
 	}
 	current->utask->auprobe = NULL;
 
 	if (need_prep && !remove)
 		prepare_uretprobe(uprobe, regs); /* put bp at return */
 
-	if (remove && uprobe->consumers) {
-		WARN_ON(!uprobe_is_active(uprobe));
-		unapply_uprobe(uprobe, current->mm);
+	if (remove && has_consumers) {
+		down_read(&uprobe->register_rwsem);
+
+		/* re-check that removal is still required, this time under lock */
+		if (!filter_chain(uprobe, current->mm)) {
+			WARN_ON(!uprobe_is_active(uprobe));
+			unapply_uprobe(uprobe, current->mm);
+		}
+
+		up_read(&uprobe->register_rwsem);
 	}
-	up_read(&uprobe->register_rwsem);
 }
 
 static void
@@ -2119,13 +2135,15 @@  handle_uretprobe_chain(struct return_instance *ri, struct pt_regs *regs)
 {
 	struct uprobe *uprobe = ri->uprobe;
 	struct uprobe_consumer *uc;
+	int srcu_idx;
 
-	down_read(&uprobe->register_rwsem);
-	for (uc = uprobe->consumers; uc; uc = uc->next) {
+	srcu_idx = srcu_read_lock(&uprobes_srcu);
+	list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
+				 srcu_read_lock_held(&uprobes_srcu)) {
 		if (uc->ret_handler)
 			uc->ret_handler(uc, ri->func, regs);
 	}
-	up_read(&uprobe->register_rwsem);
+	srcu_read_unlock(&uprobes_srcu, srcu_idx);
 }
 
 static struct return_instance *find_next_ret_chain(struct return_instance *ri)