diff mbox series

[mm/bpf,v2] mm: bpf: add find_vma_no_check() without lockdep_assert on mm->mmap_lock

Message ID 20210908044427.3632119-1-yhs@fb.com (mailing list archive)
State New
Headers show
Series [mm/bpf,v2] mm: bpf: add find_vma_no_check() without lockdep_assert on mm->mmap_lock | expand

Commit Message

Yonghong Song Sept. 8, 2021, 4:44 a.m. UTC
Current bpf-next bpf selftest "get_stack_raw_tp" triggered the warning:

  [ 1411.304463] WARNING: CPU: 3 PID: 140 at include/linux/mmap_lock.h:164 find_vma+0x47/0xa0
  [ 1411.304469] Modules linked in: bpf_testmod(O) [last unloaded: bpf_testmod]
  [ 1411.304476] CPU: 3 PID: 140 Comm: systemd-journal Tainted: G        W  O      5.14.0+ #53
  [ 1411.304479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  [ 1411.304481] RIP: 0010:find_vma+0x47/0xa0
  [ 1411.304484] Code: de 48 89 ef e8 ba f5 fe ff 48 85 c0 74 2e 48 83 c4 08 5b 5d c3 48 8d bf 28 01 00 00 be ff ff ff ff e8 2d 9f d8 00 85 c0 75 d4 <0f> 0b 48 89 de 48 8
  [ 1411.304487] RSP: 0018:ffffabd440403db8 EFLAGS: 00010246
  [ 1411.304490] RAX: 0000000000000000 RBX: 00007f00ad80a0e0 RCX: 0000000000000000
  [ 1411.304492] RDX: 0000000000000001 RSI: ffffffff9776b144 RDI: ffffffff977e1b0e
  [ 1411.304494] RBP: ffff9cf5c2f50000 R08: ffff9cf5c3eb25d8 R09: 00000000fffffffe
  [ 1411.304496] R10: 0000000000000001 R11: 00000000ef974e19 R12: ffff9cf5c39ae0e0
  [ 1411.304498] R13: 0000000000000000 R14: 0000000000000000 R15: ffff9cf5c39ae0e0
  [ 1411.304501] FS:  00007f00ae754780(0000) GS:ffff9cf5fba00000(0000) knlGS:0000000000000000
  [ 1411.304504] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [ 1411.304506] CR2: 000000003e34343c CR3: 0000000103a98005 CR4: 0000000000370ee0
  [ 1411.304508] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [ 1411.304510] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [ 1411.304512] Call Trace:
  [ 1411.304517]  stack_map_get_build_id_offset+0x17c/0x260
  [ 1411.304528]  __bpf_get_stack+0x18f/0x230
  [ 1411.304541]  bpf_get_stack_raw_tp+0x5a/0x70
  [ 1411.305752] RAX: 0000000000000000 RBX: 5541f689495641d7 RCX: 0000000000000000
  [ 1411.305756] RDX: 0000000000000001 RSI: ffffffff9776b144 RDI: ffffffff977e1b0e
  [ 1411.305758] RBP: ffff9cf5c02b2f40 R08: ffff9cf5ca7606c0 R09: ffffcbd43ee02c04
  [ 1411.306978]  bpf_prog_32007c34f7726d29_bpf_prog1+0xaf/0xd9c
  [ 1411.307861] R10: 0000000000000001 R11: 0000000000000044 R12: ffff9cf5c2ef60e0
  [ 1411.307865] R13: 0000000000000005 R14: 0000000000000000 R15: ffff9cf5c2ef6108
  [ 1411.309074]  bpf_trace_run2+0x8f/0x1a0
  [ 1411.309891] FS:  00007ff485141700(0000) GS:ffff9cf5fae00000(0000) knlGS:0000000000000000
  [ 1411.309896] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [ 1411.311221]  syscall_trace_enter.isra.20+0x161/0x1f0
  [ 1411.311600] CR2: 00007ff48514d90e CR3: 0000000107114001 CR4: 0000000000370ef0
  [ 1411.312291]  do_syscall_64+0x15/0x80
  [ 1411.312941] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [ 1411.313803]  entry_SYSCALL_64_after_hwframe+0x44/0xae
  [ 1411.314223] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [ 1411.315082] RIP: 0033:0x7f00ad80a0e0
  [ 1411.315626] Call Trace:
  [ 1411.315632]  stack_map_get_build_id_offset+0x17c/0x260

To reproduce, first build `test_progs` binary:
  make -C tools/testing/selftests/bpf -j60
and then run the binary at tools/testing/selftests/bpf directory:
  ./test_progs -t get_stack_raw_tp

The warning is due to commit 5b78ed24e8ec("mm/pagemap: add mmap_assert_locked() annotations to find_vma*()")
which added mmap_assert_locked() in find_vma() function. The mmap_assert_locked() function
asserts that mm->mmap_lock needs to be held. But this is not the case for
bpf_get_stack() or bpf_get_stackid() helper (kernel/bpf/stackmap.c), which
uses mmap_read_trylock_non_owner() instead. Since mm->mmap_lock is not held
in bpf_get_stack[id]() use case, the above warning is emitted during test run.

This patch added function find_vma_no_check() which does not have mmap_assert_locked() call and
bpf_get_stack[id]() helpers call find_vma_no_check() instead. This resolved the above warning.

I didn't use __find_vma() name because it has been used in drivers/gpu/drm/i915/i915_gpu_error.c:
    static struct i915_vma_coredump *
    __find_vma(struct i915_vma_coredump *vma, const char *name) { ... }

Cc: Luigi Rizzo <lrizzo@google.com>
Fixes: 5b78ed24e8ec("mm/pagemap: add mmap_assert_locked() annotations to find_vma*()")
Signed-off-by: Yonghong Song <yhs@fb.com>
---
 include/linux/mm.h    | 2 ++
 kernel/bpf/stackmap.c | 2 +-
 mm/mmap.c             | 9 +++++++--
 3 files changed, 10 insertions(+), 3 deletions(-)

Comments

Daniel Borkmann Sept. 8, 2021, 12:20 p.m. UTC | #1
On 9/8/21 6:44 AM, Yonghong Song wrote:
> Current bpf-next bpf selftest "get_stack_raw_tp" triggered the warning:
> 
>    [ 1411.304463] WARNING: CPU: 3 PID: 140 at include/linux/mmap_lock.h:164 find_vma+0x47/0xa0
>    [ 1411.304469] Modules linked in: bpf_testmod(O) [last unloaded: bpf_testmod]
>    [ 1411.304476] CPU: 3 PID: 140 Comm: systemd-journal Tainted: G        W  O      5.14.0+ #53
>    [ 1411.304479] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
>    [ 1411.304481] RIP: 0010:find_vma+0x47/0xa0
>    [ 1411.304484] Code: de 48 89 ef e8 ba f5 fe ff 48 85 c0 74 2e 48 83 c4 08 5b 5d c3 48 8d bf 28 01 00 00 be ff ff ff ff e8 2d 9f d8 00 85 c0 75 d4 <0f> 0b 48 89 de 48 8
>    [ 1411.304487] RSP: 0018:ffffabd440403db8 EFLAGS: 00010246
>    [ 1411.304490] RAX: 0000000000000000 RBX: 00007f00ad80a0e0 RCX: 0000000000000000
>    [ 1411.304492] RDX: 0000000000000001 RSI: ffffffff9776b144 RDI: ffffffff977e1b0e
>    [ 1411.304494] RBP: ffff9cf5c2f50000 R08: ffff9cf5c3eb25d8 R09: 00000000fffffffe
>    [ 1411.304496] R10: 0000000000000001 R11: 00000000ef974e19 R12: ffff9cf5c39ae0e0
>    [ 1411.304498] R13: 0000000000000000 R14: 0000000000000000 R15: ffff9cf5c39ae0e0
>    [ 1411.304501] FS:  00007f00ae754780(0000) GS:ffff9cf5fba00000(0000) knlGS:0000000000000000
>    [ 1411.304504] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>    [ 1411.304506] CR2: 000000003e34343c CR3: 0000000103a98005 CR4: 0000000000370ee0
>    [ 1411.304508] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>    [ 1411.304510] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>    [ 1411.304512] Call Trace:
>    [ 1411.304517]  stack_map_get_build_id_offset+0x17c/0x260
>    [ 1411.304528]  __bpf_get_stack+0x18f/0x230
>    [ 1411.304541]  bpf_get_stack_raw_tp+0x5a/0x70
>    [ 1411.305752] RAX: 0000000000000000 RBX: 5541f689495641d7 RCX: 0000000000000000
>    [ 1411.305756] RDX: 0000000000000001 RSI: ffffffff9776b144 RDI: ffffffff977e1b0e
>    [ 1411.305758] RBP: ffff9cf5c02b2f40 R08: ffff9cf5ca7606c0 R09: ffffcbd43ee02c04
>    [ 1411.306978]  bpf_prog_32007c34f7726d29_bpf_prog1+0xaf/0xd9c
>    [ 1411.307861] R10: 0000000000000001 R11: 0000000000000044 R12: ffff9cf5c2ef60e0
>    [ 1411.307865] R13: 0000000000000005 R14: 0000000000000000 R15: ffff9cf5c2ef6108
>    [ 1411.309074]  bpf_trace_run2+0x8f/0x1a0
>    [ 1411.309891] FS:  00007ff485141700(0000) GS:ffff9cf5fae00000(0000) knlGS:0000000000000000
>    [ 1411.309896] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>    [ 1411.311221]  syscall_trace_enter.isra.20+0x161/0x1f0
>    [ 1411.311600] CR2: 00007ff48514d90e CR3: 0000000107114001 CR4: 0000000000370ef0
>    [ 1411.312291]  do_syscall_64+0x15/0x80
>    [ 1411.312941] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>    [ 1411.313803]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>    [ 1411.314223] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>    [ 1411.315082] RIP: 0033:0x7f00ad80a0e0
>    [ 1411.315626] Call Trace:
>    [ 1411.315632]  stack_map_get_build_id_offset+0x17c/0x260
> 
> To reproduce, first build `test_progs` binary:
>    make -C tools/testing/selftests/bpf -j60
> and then run the binary at tools/testing/selftests/bpf directory:
>    ./test_progs -t get_stack_raw_tp
> 
> The warning is due to commit 5b78ed24e8ec("mm/pagemap: add mmap_assert_locked() annotations to find_vma*()")
> which added mmap_assert_locked() in find_vma() function. The mmap_assert_locked() function
> asserts that mm->mmap_lock needs to be held. But this is not the case for
> bpf_get_stack() or bpf_get_stackid() helper (kernel/bpf/stackmap.c), which
> uses mmap_read_trylock_non_owner() instead. Since mm->mmap_lock is not held
> in bpf_get_stack[id]() use case, the above warning is emitted during test run.
> 
> This patch added function find_vma_no_check() which does not have mmap_assert_locked() call and
> bpf_get_stack[id]() helpers call find_vma_no_check() instead. This resolved the above warning.
> 
> I didn't use __find_vma() name because it has been used in drivers/gpu/drm/i915/i915_gpu_error.c:
>      static struct i915_vma_coredump *
>      __find_vma(struct i915_vma_coredump *vma, const char *name) { ... }
> 
> Cc: Luigi Rizzo <lrizzo@google.com>
> Fixes: 5b78ed24e8ec("mm/pagemap: add mmap_assert_locked() annotations to find_vma*()")
> Signed-off-by: Yonghong Song <yhs@fb.com>

Luigi / Liam / Andrew, if the below looks reasonable to you, any objections to route the
fix with your ACKs via bpf tree to Linus (or strong preference via -mm fixes)?

Thanks,
Daniel

> ---
>   include/linux/mm.h    | 2 ++
>   kernel/bpf/stackmap.c | 2 +-
>   mm/mmap.c             | 9 +++++++--
>   3 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 50e2c2914ac2..ba7a11d900f5 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2669,6 +2669,8 @@ extern int expand_upwards(struct vm_area_struct *vma, unsigned long address);
>   #endif
>   
>   /* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
> +extern struct vm_area_struct * find_vma_no_check(struct mm_struct * mm,
> +						 unsigned long addr);
>   extern struct vm_area_struct * find_vma(struct mm_struct * mm, unsigned long addr);
>   extern struct vm_area_struct * find_vma_prev(struct mm_struct * mm, unsigned long addr,
>   					     struct vm_area_struct **pprev);
> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> index e8eefdf8cf3e..838a2c141497 100644
> --- a/kernel/bpf/stackmap.c
> +++ b/kernel/bpf/stackmap.c
> @@ -190,7 +190,7 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
>   	}
>   
>   	for (i = 0; i < trace_nr; i++) {
> -		vma = find_vma(current->mm, ips[i]);
> +		vma = find_vma_no_check(current->mm, ips[i]);
>   		if (!vma || build_id_parse(vma, id_offs[i].build_id, NULL)) {
>   			/* per entry fall back to ips */
>   			id_offs[i].status = BPF_STACK_BUILD_ID_IP;
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 88dcc5c25225..8d5e340114f8 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2270,12 +2270,11 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
>   EXPORT_SYMBOL(get_unmapped_area);
>   
>   /* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
> -struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
> +struct vm_area_struct *find_vma_no_check(struct mm_struct *mm, unsigned long addr)
>   {
>   	struct rb_node *rb_node;
>   	struct vm_area_struct *vma;
>   
> -	mmap_assert_locked(mm);
>   	/* Check the cache first. */
>   	vma = vmacache_find(mm, addr);
>   	if (likely(vma))
> @@ -2302,6 +2301,12 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
>   	return vma;
>   }
>   
> +struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
> +{
> +	mmap_assert_locked(mm);
> +	return find_vma_no_check(mm, addr);
> +}
> +
>   EXPORT_SYMBOL(find_vma);
>   
>   /*
>
Jason Gunthorpe Sept. 8, 2021, 1:53 p.m. UTC | #2
On Wed, Sep 08, 2021 at 02:20:17PM +0200, Daniel Borkmann wrote:

> > The warning is due to commit 5b78ed24e8ec("mm/pagemap: add mmap_assert_locked() annotations to find_vma*()")
> > which added mmap_assert_locked() in find_vma() function. The mmap_assert_locked() function
> > asserts that mm->mmap_lock needs to be held. But this is not the case for
> > bpf_get_stack() or bpf_get_stackid() helper (kernel/bpf/stackmap.c), which
> > uses mmap_read_trylock_non_owner() instead. Since mm->mmap_lock is not held
> > in bpf_get_stack[id]() use case, the above warning is emitted during test run.
> > 
> > This patch added function find_vma_no_check() which does not have mmap_assert_locked() call and
> > bpf_get_stack[id]() helpers call find_vma_no_check() instead. This resolved the above warning.
> > 
> > I didn't use __find_vma() name because it has been used in drivers/gpu/drm/i915/i915_gpu_error.c:
> >      static struct i915_vma_coredump *
> >      __find_vma(struct i915_vma_coredump *vma, const char *name) { ... }
> > 
> > Cc: Luigi Rizzo <lrizzo@google.com>
> > Fixes: 5b78ed24e8ec("mm/pagemap: add mmap_assert_locked() annotations to find_vma*()")
> > Signed-off-by: Yonghong Song <yhs@fb.com>
> 
> Luigi / Liam / Andrew, if the below looks reasonable to you, any objections to route the
> fix with your ACKs via bpf tree to Linus (or strong preference via -mm fixes)?

Michel added this remark along with the mmap_read_trylock_non_owner:

    It's still not ideal that bpf/stackmap subverts the lock ownership in this
    way.  Thanks to Peter Zijlstra for suggesting this API as the least-ugly
    way of addressing this in the short term.

Subverting lockdep and then adding more and more core MM APIs to
support this seems quite a bit more ugly than originally expected.

Michel's original idea to split out the lockdep abuse and put it only
in BPF is probably better. Obtain the mmap_read_trylock normally as
owner and then release ownership only before triggering the work. At
least lockdep will continue to work properly for the find_vma.

Jason
Peter Zijlstra Sept. 8, 2021, 2:15 p.m. UTC | #3
On Wed, Sep 08, 2021 at 10:53:26AM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 08, 2021 at 02:20:17PM +0200, Daniel Borkmann wrote:
> 
> > > The warning is due to commit 5b78ed24e8ec("mm/pagemap: add mmap_assert_locked() annotations to find_vma*()")
> > > which added mmap_assert_locked() in find_vma() function. The mmap_assert_locked() function
> > > asserts that mm->mmap_lock needs to be held. But this is not the case for
> > > bpf_get_stack() or bpf_get_stackid() helper (kernel/bpf/stackmap.c), which
> > > uses mmap_read_trylock_non_owner() instead. Since mm->mmap_lock is not held
> > > in bpf_get_stack[id]() use case, the above warning is emitted during test run.
> > > 
> > > This patch added function find_vma_no_check() which does not have mmap_assert_locked() call and
> > > bpf_get_stack[id]() helpers call find_vma_no_check() instead. This resolved the above warning.
> > > 
> > > I didn't use __find_vma() name because it has been used in drivers/gpu/drm/i915/i915_gpu_error.c:
> > >      static struct i915_vma_coredump *
> > >      __find_vma(struct i915_vma_coredump *vma, const char *name) { ... }
> > > 
> > > Cc: Luigi Rizzo <lrizzo@google.com>
> > > Fixes: 5b78ed24e8ec("mm/pagemap: add mmap_assert_locked() annotations to find_vma*()")
> > > Signed-off-by: Yonghong Song <yhs@fb.com>
> > 
> > Luigi / Liam / Andrew, if the below looks reasonable to you, any objections to route the
> > fix with your ACKs via bpf tree to Linus (or strong preference via -mm fixes)?
> 
> Michel added this remark along with the mmap_read_trylock_non_owner:
> 
>     It's still not ideal that bpf/stackmap subverts the lock ownership in this
>     way.  Thanks to Peter Zijlstra for suggesting this API as the least-ugly
>     way of addressing this in the short term.
> 
> Subverting lockdep and then adding more and more core MM APIs to
> support this seems quite a bit more ugly than originally expected.
> 
> Michel's original idea to split out the lockdep abuse and put it only
> in BPF is probably better. Obtain the mmap_read_trylock normally as
> owner and then release ownership only before triggering the work. At
> least lockdep will continue to work properly for the find_vma.

The only right solution to all of this is the below. That function
downright subverts all the locking rules we have. Spreading the hacks
any further than that one function is absolutely unacceptable.

The only sane approach is making the vma tree lockless, but so far the
bpf people have resisted doing the right thing because they've been
allowed to get away with these atrocities.

---
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 6fbc2abe9c91..e2c7ab0a41f4 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -48,8 +48,6 @@ static void do_up_read(struct irq_work *entry)
 	mmap_read_unlock_non_owner(work->mm);
 }
 
-static DEFINE_PER_CPU(struct stack_map_irq_work, up_read_work);
-
 static inline bool stack_map_use_build_id(struct bpf_map *map)
 {
 	return (map->map_flags & BPF_F_STACK_BUILD_ID);
@@ -148,67 +146,14 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
 					  u64 *ips, u32 trace_nr, bool user)
 {
 	int i;
-	struct vm_area_struct *vma;
-	bool irq_work_busy = false;
-	struct stack_map_irq_work *work = NULL;
-
-	if (irqs_disabled()) {
-		if (!IS_ENABLED(CONFIG_PREEMPT_RT)) {
-			work = this_cpu_ptr(&up_read_work);
-			if (irq_work_is_busy(&work->irq_work)) {
-				/* cannot queue more up_read, fallback */
-				irq_work_busy = true;
-			}
-		} else {
-			/*
-			 * PREEMPT_RT does not allow to trylock mmap sem in
-			 * interrupt disabled context. Force the fallback code.
-			 */
-			irq_work_busy = true;
-		}
-	}
-
-	/*
-	 * We cannot do up_read() when the irq is disabled, because of
-	 * risk to deadlock with rq_lock. To do build_id lookup when the
-	 * irqs are disabled, we need to run up_read() in irq_work. We use
-	 * a percpu variable to do the irq_work. If the irq_work is
-	 * already used by another lookup, we fall back to report ips.
-	 *
-	 * Same fallback is used for kernel stack (!user) on a stackmap
-	 * with build_id.
-	 */
-	if (!user || !current || !current->mm || irq_work_busy ||
-	    !mmap_read_trylock_non_owner(current->mm)) {
-		/* cannot access current->mm, fall back to ips */
-		for (i = 0; i < trace_nr; i++) {
-			id_offs[i].status = BPF_STACK_BUILD_ID_IP;
-			id_offs[i].ip = ips[i];
-			memset(id_offs[i].build_id, 0, BUILD_ID_SIZE_MAX);
-		}
-		return;
-	}
 
+	/* cannot access current->mm, fall back to ips */
 	for (i = 0; i < trace_nr; i++) {
-		vma = find_vma(current->mm, ips[i]);
-		if (!vma || build_id_parse(vma, id_offs[i].build_id, NULL)) {
-			/* per entry fall back to ips */
-			id_offs[i].status = BPF_STACK_BUILD_ID_IP;
-			id_offs[i].ip = ips[i];
-			memset(id_offs[i].build_id, 0, BUILD_ID_SIZE_MAX);
-			continue;
-		}
-		id_offs[i].offset = (vma->vm_pgoff << PAGE_SHIFT) + ips[i]
-			- vma->vm_start;
-		id_offs[i].status = BPF_STACK_BUILD_ID_VALID;
-	}
-
-	if (!work) {
-		mmap_read_unlock_non_owner(current->mm);
-	} else {
-		work->mm = current->mm;
-		irq_work_queue(&work->irq_work);
+		id_offs[i].status = BPF_STACK_BUILD_ID_IP;
+		id_offs[i].ip = ips[i];
+		memset(id_offs[i].build_id, 0, BUILD_ID_SIZE_MAX);
 	}
+	return;
 }
 
 static struct perf_callchain_entry *
@@ -714,16 +659,3 @@ const struct bpf_map_ops stack_trace_map_ops = {
 	.map_btf_name = "bpf_stack_map",
 	.map_btf_id = &stack_trace_map_btf_id,
 };
-
-static int __init stack_map_init(void)
-{
-	int cpu;
-	struct stack_map_irq_work *work;
-
-	for_each_possible_cpu(cpu) {
-		work = per_cpu_ptr(&up_read_work, cpu);
-		init_irq_work(&work->irq_work, do_up_read);
-	}
-	return 0;
-}
-subsys_initcall(stack_map_init);
Luigi Rizzo Sept. 8, 2021, 2:43 p.m. UTC | #4
On Wed, Sep 8, 2021 at 4:16 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Sep 08, 2021 at 10:53:26AM -0300, Jason Gunthorpe wrote:
> > On Wed, Sep 08, 2021 at 02:20:17PM +0200, Daniel Borkmann wrote:
> >
> > > > The warning is due to commit 5b78ed24e8ec("mm/pagemap: add mmap_assert_locked() annotations to find_vma*()")
> > > > which added mmap_assert_locked() in find_vma() function. The mmap_assert_locked() function
> > > > asserts that mm->mmap_lock needs to be held. But this is not the case for
> > > > bpf_get_stack() or bpf_get_stackid() helper (kernel/bpf/stackmap.c), which
> > > > uses mmap_read_trylock_non_owner() instead. Since mm->mmap_lock is not held
> > > > in bpf_get_stack[id]() use case, the above warning is emitted during test run.
...
> > > Luigi / Liam / Andrew, if the below looks reasonable to you, any objections to route the
> > > fix with your ACKs via bpf tree to Linus (or strong preference via -mm fixes)?
> >
> > Michel added this remark along with the mmap_read_trylock_non_owner:
> >
> >     It's still not ideal that bpf/stackmap subverts the lock ownership in this
> >     way.  Thanks to Peter Zijlstra for suggesting this API as the least-ugly
> >     way of addressing this in the short term.
> >
> > Subverting lockdep and then adding more and more core MM APIs to
> > support this seems quite a bit more ugly than originally expected.
> >
> > Michel's original idea to split out the lockdep abuse and put it only
> > in BPF is probably better. Obtain the mmap_read_trylock normally as
> > owner and then release ownership only before triggering the work. At
> > least lockdep will continue to work properly for the find_vma.
>
> The only right solution to all of this is the below. That function
> downright subverts all the locking rules we have. Spreading the hacks
> any further than that one function is absolutely unacceptable.

I'd be inclined to agree that we should not introduce hacks around
locking rules. I don't know enough about the constraints of
bpf/stackmap, how much of a performance penalty do we pay with Peter's
patch,
and ow often one is expected to call this function ?

cheers
luigi
Liam R. Howlett Sept. 8, 2021, 3:12 p.m. UTC | #5
* Luigi Rizzo <lrizzo@google.com> [210908 10:44]:
> On Wed, Sep 8, 2021 at 4:16 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Wed, Sep 08, 2021 at 10:53:26AM -0300, Jason Gunthorpe wrote:
> > > On Wed, Sep 08, 2021 at 02:20:17PM +0200, Daniel Borkmann wrote:
> > >
> > > > > The warning is due to commit 5b78ed24e8ec("mm/pagemap: add mmap_assert_locked() annotations to find_vma*()")
> > > > > which added mmap_assert_locked() in find_vma() function. The mmap_assert_locked() function
> > > > > asserts that mm->mmap_lock needs to be held. But this is not the case for
> > > > > bpf_get_stack() or bpf_get_stackid() helper (kernel/bpf/stackmap.c), which
> > > > > uses mmap_read_trylock_non_owner() instead. Since mm->mmap_lock is not held
> > > > > in bpf_get_stack[id]() use case, the above warning is emitted during test run.
> ...
> > > > Luigi / Liam / Andrew, if the below looks reasonable to you, any objections to route the
> > > > fix with your ACKs via bpf tree to Linus (or strong preference via -mm fixes)?
> > >
> > > Michel added this remark along with the mmap_read_trylock_non_owner:
> > >
> > >     It's still not ideal that bpf/stackmap subverts the lock ownership in this
> > >     way.  Thanks to Peter Zijlstra for suggesting this API as the least-ugly
> > >     way of addressing this in the short term.
> > >
> > > Subverting lockdep and then adding more and more core MM APIs to
> > > support this seems quite a bit more ugly than originally expected.
> > >
> > > Michel's original idea to split out the lockdep abuse and put it only
> > > in BPF is probably better. Obtain the mmap_read_trylock normally as
> > > owner and then release ownership only before triggering the work. At
> > > least lockdep will continue to work properly for the find_vma.
> >
> > The only right solution to all of this is the below. That function
> > downright subverts all the locking rules we have. Spreading the hacks
> > any further than that one function is absolutely unacceptable.
> 
> I'd be inclined to agree that we should not introduce hacks around
> locking rules. I don't know enough about the constraints of
> bpf/stackmap, how much of a performance penalty do we pay with Peter's
> patch,
> and ow often one is expected to call this function ?
> 
> cheers
> luigi

The hack already exists.  The symptom of the larger issue is that
lockdep now catches the hack when using find_vma().

If Peter's solution is acceptable to the bpf folks, then we can go ahead
and drop the option of using the non_owner variant - which would be
best.  Otherwise the hack around the locking rule still exists as long
as the find_vma() interface isn't used.

Thanks,
Liam
Yonghong Song Sept. 8, 2021, 4:09 p.m. UTC | #6
On 9/8/21 8:12 AM, Liam Howlett wrote:
> * Luigi Rizzo <lrizzo@google.com> [210908 10:44]:
>> On Wed, Sep 8, 2021 at 4:16 PM Peter Zijlstra <peterz@infradead.org> wrote:
>>>
>>> On Wed, Sep 08, 2021 at 10:53:26AM -0300, Jason Gunthorpe wrote:
>>>> On Wed, Sep 08, 2021 at 02:20:17PM +0200, Daniel Borkmann wrote:
>>>>
>>>>>> The warning is due to commit 5b78ed24e8ec("mm/pagemap: add mmap_assert_locked() annotations to find_vma*()")
>>>>>> which added mmap_assert_locked() in find_vma() function. The mmap_assert_locked() function
>>>>>> asserts that mm->mmap_lock needs to be held. But this is not the case for
>>>>>> bpf_get_stack() or bpf_get_stackid() helper (kernel/bpf/stackmap.c), which
>>>>>> uses mmap_read_trylock_non_owner() instead. Since mm->mmap_lock is not held
>>>>>> in bpf_get_stack[id]() use case, the above warning is emitted during test run.
>> ...
>>>>> Luigi / Liam / Andrew, if the below looks reasonable to you, any objections to route the
>>>>> fix with your ACKs via bpf tree to Linus (or strong preference via -mm fixes)?
>>>>
>>>> Michel added this remark along with the mmap_read_trylock_non_owner:
>>>>
>>>>      It's still not ideal that bpf/stackmap subverts the lock ownership in this
>>>>      way.  Thanks to Peter Zijlstra for suggesting this API as the least-ugly
>>>>      way of addressing this in the short term.
>>>>
>>>> Subverting lockdep and then adding more and more core MM APIs to
>>>> support this seems quite a bit more ugly than originally expected.
>>>>
>>>> Michel's original idea to split out the lockdep abuse and put it only
>>>> in BPF is probably better. Obtain the mmap_read_trylock normally as
>>>> owner and then release ownership only before triggering the work. At
>>>> least lockdep will continue to work properly for the find_vma.
>>>
>>> The only right solution to all of this is the below. That function
>>> downright subverts all the locking rules we have. Spreading the hacks
>>> any further than that one function is absolutely unacceptable.
>>
>> I'd be inclined to agree that we should not introduce hacks around
>> locking rules. I don't know enough about the constraints of
>> bpf/stackmap, how much of a performance penalty do we pay with Peter's
>> patch,
>> and ow often one is expected to call this function ?
>>
>> cheers
>> luigi
> 
> The hack already exists.  The symptom of the larger issue is that
> lockdep now catches the hack when using find_vma().
> 
> If Peter's solution is acceptable to the bpf folks, then we can go ahead
> and drop the option of using the non_owner variant - which would be
> best.  Otherwise the hack around the locking rule still exists as long
> as the find_vma() interface isn't used.

Hi, Peter, Luigi, Liam, Jason,

Peter's solution will definitely break user applications using
BPF_F_USER_BUILD_ID feature 
(https://github.com/torvalds/linux/blob/master/include/uapi/linux/bpf.h#L1193). 
So in my opinion,
the "hack" is still needed to avoid break user space application.

To answer Luigi's question below:
 > I don't know enough about the constraints of
 > bpf/stackmap, how much of a performance penalty do we pay with Peter's
 > patch,
 > and ow often one is expected to call this function ?

This function is called only if user bpf program specifies 
BPF_F_USER_BUILD_ID in bpf_get_stack() or bpf_get_stack_pe() helper.

> 
> Thanks,
> Liam
>
Luigi Rizzo Sept. 8, 2021, 5:09 p.m. UTC | #7
On Wed, Sep 8, 2021 at 6:10 PM Yonghong Song <yhs@fb.com> wrote:
>
>
>
> On 9/8/21 8:12 AM, Liam Howlett wrote:
> > * Luigi Rizzo <lrizzo@google.com> [210908 10:44]:
> >> On Wed, Sep 8, 2021 at 4:16 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >>>
> >>> On Wed, Sep 08, 2021 at 10:53:26AM -0300, Jason Gunthorpe wrote:
> >>>> On Wed, Sep 08, 2021 at 02:20:17PM +0200, Daniel Borkmann wrote:
> >>>>
> >>>>>> The warning is due to commit 5b78ed24e8ec("mm/pagemap: add mmap_assert_locked() annotations to find_vma*()")
> >>>>>> which added mmap_assert_locked() in find_vma() function. The mmap_assert_locked() function
> >>>>>> asserts that mm->mmap_lock needs to be held. But this is not the case for
> >>>>>> bpf_get_stack() or bpf_get_stackid() helper (kernel/bpf/stackmap.c), which
> >>>>>> uses mmap_read_trylock_non_owner() instead. Since mm->mmap_lock is not held
> >>>>>> in bpf_get_stack[id]() use case, the above warning is emitted during test run.
> >> ...
> >>>>> Luigi / Liam / Andrew, if the below looks reasonable to you, any objections to route the
> >>>>> fix with your ACKs via bpf tree to Linus (or strong preference via -mm fixes)?
> >>>>
> >>>> Michel added this remark along with the mmap_read_trylock_non_owner:
> >>>>
> >>>>      It's still not ideal that bpf/stackmap subverts the lock ownership in this
> >>>>      way.  Thanks to Peter Zijlstra for suggesting this API as the least-ugly
> >>>>      way of addressing this in the short term.
> >>>>
> >>>> Subverting lockdep and then adding more and more core MM APIs to
> >>>> support this seems quite a bit more ugly than originally expected.
> >>>>
> >>>> Michel's original idea to split out the lockdep abuse and put it only
> >>>> in BPF is probably better. Obtain the mmap_read_trylock normally as
> >>>> owner and then release ownership only before triggering the work. At
> >>>> least lockdep will continue to work properly for the find_vma.
> >>>
> >>> The only right solution to all of this is the below. That function
> >>> downright subverts all the locking rules we have. Spreading the hacks
> >>> any further than that one function is absolutely unacceptable.
> >>
> >> I'd be inclined to agree that we should not introduce hacks around
> >> locking rules. I don't know enough about the constraints of
> >> bpf/stackmap, how much of a performance penalty do we pay with Peter's
> >> patch,
> >> and ow often one is expected to call this function ?
> >>
> >> cheers
> >> luigi
> >
> > The hack already exists.  The symptom of the larger issue is that
> > lockdep now catches the hack when using find_vma().
> >
> > If Peter's solution is acceptable to the bpf folks, then we can go ahead
> > and drop the option of using the non_owner variant - which would be
> > best.  Otherwise the hack around the locking rule still exists as long
> > as the find_vma() interface isn't used.
>
> Hi, Peter, Luigi, Liam, Jason,
>
> Peter's solution will definitely break user applications using
> BPF_F_USER_BUILD_ID feature

Again I am ignorant on the details so if you can clarify the following
it may help me and others to better understand the problem:

1. Peter's patch appears to just take the same "fallback" path
   that would be taken if the trylock failed.
   Is this really a breakage or just loss of performance ?
   I would expect the latter, since it is called "fallback".

2. Assuming it is just loss of performance, it becomes important to
   clarify how much we are losing, and whether this is something to
   worry about, or it is only some rarely used function. Things like:
   - total CPU cost + latency difference (including deferred work, if any)
     in the "trylock ok" and "trylock failed" cases.
     I have no idea but since there is a trylock involved, I suppose
     we are in O(500ns) territory (when cold or contended)

   - how often we'd expect calls to the bpf helpers involved
     (BPF_F_USER_BUILD_ID in bpf_get_stack() or bpf_get_stack_pe()) ?



thanks
luigi
Alexei Starovoitov Sept. 8, 2021, 5:21 p.m. UTC | #8
On Wed, Sep 08, 2021 at 07:09:23PM +0200, Luigi Rizzo wrote:
> On Wed, Sep 8, 2021 at 6:10 PM Yonghong Song <yhs@fb.com> wrote:
> >
> >
> >
> > On 9/8/21 8:12 AM, Liam Howlett wrote:
> > > * Luigi Rizzo <lrizzo@google.com> [210908 10:44]:
> > >> On Wed, Sep 8, 2021 at 4:16 PM Peter Zijlstra <peterz@infradead.org> wrote:
> > >>>
> > >>> On Wed, Sep 08, 2021 at 10:53:26AM -0300, Jason Gunthorpe wrote:
> > >>>> On Wed, Sep 08, 2021 at 02:20:17PM +0200, Daniel Borkmann wrote:
> > >>>>
> > >>>>>> The warning is due to commit 5b78ed24e8ec("mm/pagemap: add mmap_assert_locked() annotations to find_vma*()")
> > >>>>>> which added mmap_assert_locked() in find_vma() function. The mmap_assert_locked() function
> > >>>>>> asserts that mm->mmap_lock needs to be held. But this is not the case for
> > >>>>>> bpf_get_stack() or bpf_get_stackid() helper (kernel/bpf/stackmap.c), which
> > >>>>>> uses mmap_read_trylock_non_owner() instead. Since mm->mmap_lock is not held
> > >>>>>> in bpf_get_stack[id]() use case, the above warning is emitted during test run.
> > >> ...
> > >>>>> Luigi / Liam / Andrew, if the below looks reasonable to you, any objections to route the
> > >>>>> fix with your ACKs via bpf tree to Linus (or strong preference via -mm fixes)?
> > >>>>
> > >>>> Michel added this remark along with the mmap_read_trylock_non_owner:
> > >>>>
> > >>>>      It's still not ideal that bpf/stackmap subverts the lock ownership in this
> > >>>>      way.  Thanks to Peter Zijlstra for suggesting this API as the least-ugly
> > >>>>      way of addressing this in the short term.
> > >>>>
> > >>>> Subverting lockdep and then adding more and more core MM APIs to
> > >>>> support this seems quite a bit more ugly than originally expected.
> > >>>>
> > >>>> Michel's original idea to split out the lockdep abuse and put it only
> > >>>> in BPF is probably better. Obtain the mmap_read_trylock normally as
> > >>>> owner and then release ownership only before triggering the work. At
> > >>>> least lockdep will continue to work properly for the find_vma.
> > >>>
> > >>> The only right solution to all of this is the below. That function
> > >>> downright subverts all the locking rules we have. Spreading the hacks
> > >>> any further than that one function is absolutely unacceptable.
> > >>
> > >> I'd be inclined to agree that we should not introduce hacks around
> > >> locking rules. I don't know enough about the constraints of
> > >> bpf/stackmap, how much of a performance penalty do we pay with Peter's
> > >> patch,
> > >> and ow often one is expected to call this function ?
> > >>
> > >> cheers
> > >> luigi
> > >
> > > The hack already exists.  The symptom of the larger issue is that
> > > lockdep now catches the hack when using find_vma().
> > >
> > > If Peter's solution is acceptable to the bpf folks, then we can go ahead
> > > and drop the option of using the non_owner variant - which would be
> > > best.  Otherwise the hack around the locking rule still exists as long
> > > as the find_vma() interface isn't used.
> >
> > Hi, Peter, Luigi, Liam, Jason,
> >
> > Peter's solution will definitely break user applications using
> > BPF_F_USER_BUILD_ID feature
> 
> Again I am ignorant on the details so if you can clarify the following
> it may help me and others to better understand the problem:
> 
> 1. Peter's patch appears to just take the same "fallback" path
>    that would be taken if the trylock failed.
>    Is this really a breakage or just loss of performance ?
>    I would expect the latter, since it is called "fallback".

As Yonghong explained it's a user space breakage.
User space tooling expects build_id to be available 99.999% of the time
and that's what users observed in practice.
They've built a bunch of tools on top of this feature.
The data from these tools goes into various datacenter tables
and humans analyze it later.
So Peter's proposal is not acceptable. We don't want to get yelled at.

> 2. Assuming it is just loss of performance, it becomes important to

It's nothing to do with performance.
Andrew Morton Sept. 8, 2021, 5:52 p.m. UTC | #9
On Wed, 8 Sep 2021 10:21:18 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> > Again I am ignorant on the details so if you can clarify the following
> > it may help me and others to better understand the problem:
> > 
> > 1. Peter's patch appears to just take the same "fallback" path
> >    that would be taken if the trylock failed.
> >    Is this really a breakage or just loss of performance ?
> >    I would expect the latter, since it is called "fallback".
> 
> As Yonghong explained it's a user space breakage.
> User space tooling expects build_id to be available 99.999% of the time
> and that's what users observed in practice.
> They've built a bunch of tools on top of this feature.
> The data from these tools goes into various datacenter tables
> and humans analyze it later.
> So Peter's proposal is not acceptable. We don't want to get yelled at.
> 

I'm not understanding.  Peter said "this patch merely removes a
performance tweak" and you and Yonghong said "it breaks userspace". 
These assertions are contradictory!

Please describe the expected userspace-visible change from Peter's
patch in full detail?

And yes, it is far preferable that we resolve this by changing BPF to
be a better interface citizen, please.  Let's put those thinking caps on?
Alexei Starovoitov Sept. 8, 2021, 6:02 p.m. UTC | #10
On Wed, Sep 08, 2021 at 10:52:59AM -0700, Andrew Morton wrote:
> On Wed, 8 Sep 2021 10:21:18 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> > > Again I am ignorant on the details so if you can clarify the following
> > > it may help me and others to better understand the problem:
> > > 
> > > 1. Peter's patch appears to just take the same "fallback" path
> > >    that would be taken if the trylock failed.
> > >    Is this really a breakage or just loss of performance ?
> > >    I would expect the latter, since it is called "fallback".
> > 
> > As Yonghong explained it's a user space breakage.
> > User space tooling expects build_id to be available 99.999% of the time
> > and that's what users observed in practice.
> > They've built a bunch of tools on top of this feature.
> > The data from these tools goes into various datacenter tables
> > and humans analyze it later.
> > So Peter's proposal is not acceptable. We don't want to get yelled at.
> > 
> 
> I'm not understanding.  Peter said "this patch merely removes a
> performance tweak" and you and Yonghong said "it breaks userspace". 
> These assertions are contradictory!

Peter said:
"The only sane approach is making the vma tree lockless, but so far the
 bpf people have resisted doing the right thing because they've been
 allowed to get away with these atrocities.
"
which is partially true.
bpf folks didn't resist it. There is work ongoing to make it lockless.
It just takes an long time. I don't see how bpf folks can speed it up
any further.

> Please describe the expected userspace-visible change from Peter's
> patch in full detail?

User space expects build_id to be available. Peter patch simply removes
that feature.

> And yes, it is far preferable that we resolve this by changing BPF to
> be a better interface citizen, please.  Let's put those thinking caps on?

Just silence a lockdep as Yonghong proposed or some other way,
since it's only a lockdep issue. There is no actual breakage.
The feature was working and still works as intended.
Andrew Morton Sept. 8, 2021, 6:15 p.m. UTC | #11
On Wed, 8 Sep 2021 11:02:58 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> > Please describe the expected userspace-visible change from Peter's
> > patch in full detail?
> 
> User space expects build_id to be available. Peter patch simply removes
> that feature.

Are you sure?  He ends up with

static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
					  u64 *ips, u32 trace_nr, bool user)
{
	int i;

	/* cannot access current->mm, fall back to ips */
	for (i = 0; i < trace_nr; i++) {
		id_offs[i].status = BPF_STACK_BUILD_ID_IP;
		id_offs[i].ip = ips[i];
		memset(id_offs[i].build_id, 0, BUILD_ID_SIZE_MAX);
	}
	return;
}

and you're saying that userspace won't like this because we didn't set
BPF_STACK_BUILD_ID_VALID?
Alexei Starovoitov Sept. 8, 2021, 6:20 p.m. UTC | #12
On Wed, Sep 8, 2021 at 11:15 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed, 8 Sep 2021 11:02:58 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>
> > > Please describe the expected userspace-visible change from Peter's
> > > patch in full detail?
> >
> > User space expects build_id to be available. Peter patch simply removes
> > that feature.
>
> Are you sure?  He ends up with

More than sure :)
Just look at below.

> static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
>                                           u64 *ips, u32 trace_nr, bool user)
> {
>         int i;
>
>         /* cannot access current->mm, fall back to ips */
>         for (i = 0; i < trace_nr; i++) {
>                 id_offs[i].status = BPF_STACK_BUILD_ID_IP;
>                 id_offs[i].ip = ips[i];
>                 memset(id_offs[i].build_id, 0, BUILD_ID_SIZE_MAX);
>         }
>         return;
> }
>
> and you're saying that userspace won't like this because we didn't set
> BPF_STACK_BUILD_ID_VALID?

The patch forces the "fallback path" that in production is seen 0.001%
Meaning that user space doesn't see build_id any more. It sees IPs only.
The user space cannot correlate IPs to binaries. That's what build_id enabled.
Liam R. Howlett Sept. 8, 2021, 6:30 p.m. UTC | #13
* Alexei Starovoitov <alexei.starovoitov@gmail.com> [210908 14:21]:
> On Wed, Sep 8, 2021 at 11:15 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Wed, 8 Sep 2021 11:02:58 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >
> > > > Please describe the expected userspace-visible change from Peter's
> > > > patch in full detail?
> > >
> > > User space expects build_id to be available. Peter patch simply removes
> > > that feature.
> >
> > Are you sure?  He ends up with
> 
> More than sure :)
> Just look at below.
> 
> > static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
> >                                           u64 *ips, u32 trace_nr, bool user)
> > {
> >         int i;
> >
> >         /* cannot access current->mm, fall back to ips */
> >         for (i = 0; i < trace_nr; i++) {
> >                 id_offs[i].status = BPF_STACK_BUILD_ID_IP;
> >                 id_offs[i].ip = ips[i];
> >                 memset(id_offs[i].build_id, 0, BUILD_ID_SIZE_MAX);
> >         }
> >         return;
> > }
> >
> > and you're saying that userspace won't like this because we didn't set
> > BPF_STACK_BUILD_ID_VALID?
> 
> The patch forces the "fallback path" that in production is seen 0.001%
> Meaning that user space doesn't see build_id any more. It sees IPs only.
> The user space cannot correlate IPs to binaries. That's what build_id enabled.

I was thinking of decomposing the checks in my first response to two
functions.

Something like this:
--------------
diff --git a/mm/mmap.c b/mm/mmap.c
index dce46105e3df..8afc1d22aa61 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2293,12 +2293,13 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
 EXPORT_SYMBOL(get_unmapped_area);
 
 /* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
-struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
+struct vm_area_struct *find_vma_non_owner(struct mm_struct *mm,
+					 unsigned long addr)
 {
 	struct rb_node *rb_node;
 	struct vm_area_struct *vma;
 
-	mmap_assert_locked(mm);
+	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
 	/* Check the cache first. */
 	vma = vmacache_find(mm, addr);
 	if (likely(vma))
@@ -2325,6 +2326,11 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
 	return vma;
 }
 
+struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
+{
+	lockdep_assert_held(&mm->mmap_lock);
+	return find_vma_non_owner(mm, addr);
+}
 EXPORT_SYMBOL(find_vma);
 
 /*

--------------

Although this leaks more into the mm API and was referred to as ugly
previously, it does provide a working solution and still maintains the
same level of checking.

Would it push the back actors to just switch to non_owner though?


Thanks,
Liam
Liam R. Howlett Sept. 8, 2021, 6:43 p.m. UTC | #14
* Alexei Starovoitov <alexei.starovoitov@gmail.com> [210908 14:03]:
> On Wed, Sep 08, 2021 at 10:52:59AM -0700, Andrew Morton wrote:
> > On Wed, 8 Sep 2021 10:21:18 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > 
> > > > Again I am ignorant on the details so if you can clarify the following
> > > > it may help me and others to better understand the problem:
> > > > 
> > > > 1. Peter's patch appears to just take the same "fallback" path
> > > >    that would be taken if the trylock failed.
> > > >    Is this really a breakage or just loss of performance ?
> > > >    I would expect the latter, since it is called "fallback".
> > > 
> > > As Yonghong explained it's a user space breakage.
> > > User space tooling expects build_id to be available 99.999% of the time
> > > and that's what users observed in practice.
> > > They've built a bunch of tools on top of this feature.
> > > The data from these tools goes into various datacenter tables
> > > and humans analyze it later.
> > > So Peter's proposal is not acceptable. We don't want to get yelled at.
> > > 
> > 
> > I'm not understanding.  Peter said "this patch merely removes a
> > performance tweak" and you and Yonghong said "it breaks userspace". 
> > These assertions are contradictory!
> 
> Peter said:
> "The only sane approach is making the vma tree lockless, but so far the
>  bpf people have resisted doing the right thing because they've been
>  allowed to get away with these atrocities.
> "
> which is partially true.
> bpf folks didn't resist it. There is work ongoing to make it lockless.
> It just takes an long time. I don't see how bpf folks can speed it up
> any further.

What work are you doing on a lockless vma tree?  I've been working on
the maple tree and would like to hear what you've come up with.

> 
> > Please describe the expected userspace-visible change from Peter's
> > patch in full detail?
> 
> User space expects build_id to be available. Peter patch simply removes
> that feature.
> 
> > And yes, it is far preferable that we resolve this by changing BPF to
> > be a better interface citizen, please.  Let's put those thinking caps on?
> 
> Just silence a lockdep as Yonghong proposed or some other way,
> since it's only a lockdep issue. There is no actual breakage.
> The feature was working and still works as intended.
Yonghong Song Sept. 8, 2021, 6:45 p.m. UTC | #15
On 9/8/21 11:30 AM, Liam Howlett wrote:
> * Alexei Starovoitov <alexei.starovoitov@gmail.com> [210908 14:21]:
>> On Wed, Sep 8, 2021 at 11:15 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>>>
>>> On Wed, 8 Sep 2021 11:02:58 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>>
>>>>> Please describe the expected userspace-visible change from Peter's
>>>>> patch in full detail?
>>>>
>>>> User space expects build_id to be available. Peter patch simply removes
>>>> that feature.
>>>
>>> Are you sure?  He ends up with
>>
>> More than sure :)
>> Just look at below.
>>
>>> static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
>>>                                            u64 *ips, u32 trace_nr, bool user)
>>> {
>>>          int i;
>>>
>>>          /* cannot access current->mm, fall back to ips */
>>>          for (i = 0; i < trace_nr; i++) {
>>>                  id_offs[i].status = BPF_STACK_BUILD_ID_IP;
>>>                  id_offs[i].ip = ips[i];
>>>                  memset(id_offs[i].build_id, 0, BUILD_ID_SIZE_MAX);
>>>          }
>>>          return;
>>> }
>>>
>>> and you're saying that userspace won't like this because we didn't set
>>> BPF_STACK_BUILD_ID_VALID?
>>
>> The patch forces the "fallback path" that in production is seen 0.001%
>> Meaning that user space doesn't see build_id any more. It sees IPs only.
>> The user space cannot correlate IPs to binaries. That's what build_id enabled.
> 
> I was thinking of decomposing the checks in my first response to two
> functions.
> 
> Something like this:
> --------------
> diff --git a/mm/mmap.c b/mm/mmap.c
> index dce46105e3df..8afc1d22aa61 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2293,12 +2293,13 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
>   EXPORT_SYMBOL(get_unmapped_area);
>   
>   /* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
> -struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
> +struct vm_area_struct *find_vma_non_owner(struct mm_struct *mm,
> +					 unsigned long addr)
>   {
>   	struct rb_node *rb_node;
>   	struct vm_area_struct *vma;
>   
> -	mmap_assert_locked(mm);
> +	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
>   	/* Check the cache first. */
>   	vma = vmacache_find(mm, addr);
>   	if (likely(vma))
> @@ -2325,6 +2326,11 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
>   	return vma;
>   }
>   
> +struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
> +{
> +	lockdep_assert_held(&mm->mmap_lock);
> +	return find_vma_non_owner(mm, addr);
> +}
>   EXPORT_SYMBOL(find_vma);
>   
>   /*
> 
> --------------
> 
> Although this leaks more into the mm API and was referred to as ugly
> previously, it does provide a working solution and still maintains the
> same level of checking.
> 
> Would it push the back actors to just switch to non_owner though?

Thanks, Liam. This should work for bpf side as well as we can just call
find_vma_no_owner(). I will submit v3 soon.

> 
> 
> Thanks,
> Liam
>
Jason Gunthorpe Sept. 8, 2021, 6:49 p.m. UTC | #16
On Wed, Sep 08, 2021 at 06:30:52PM +0000, Liam Howlett wrote:

>  /* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
> -struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
> +struct vm_area_struct *find_vma_non_owner(struct mm_struct *mm,
> +					 unsigned long addr)
>  {
>  	struct rb_node *rb_node;
>  	struct vm_area_struct *vma;
>  
> -	mmap_assert_locked(mm);
> +	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
>  	/* Check the cache first. */
>  	vma = vmacache_find(mm, addr);
>  	if (likely(vma))
> @@ -2325,6 +2326,11 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
>  	return vma;
>  }
>  
> +struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
> +{
> +	lockdep_assert_held(&mm->mmap_lock);
> +	return find_vma_non_owner(mm, addr);
> +}
>  EXPORT_SYMBOL(find_vma);
>  
>  /*
> 
> 
> Although this leaks more into the mm API and was referred to as ugly
> previously, it does provide a working solution and still maintains the
> same level of checking.

I think it is no better than before.

The solution must be to not break lockdep in the BPF side. If Peter's
reworked algorithm is not OK then BPF should drop/acquire the lockdep
when it punts the unlock to the WQ.

'non-owner' is just a nice way of saying 'the caller is messing with
lockdep', it is not a sane way to design APIs

Jason
Yonghong Song Sept. 8, 2021, 7:11 p.m. UTC | #17
On 9/8/21 11:49 AM, Jason Gunthorpe wrote:
> On Wed, Sep 08, 2021 at 06:30:52PM +0000, Liam Howlett wrote:
> 
>>   /* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
>> -struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
>> +struct vm_area_struct *find_vma_non_owner(struct mm_struct *mm,
>> +					 unsigned long addr)
>>   {
>>   	struct rb_node *rb_node;
>>   	struct vm_area_struct *vma;
>>   
>> -	mmap_assert_locked(mm);
>> +	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
>>   	/* Check the cache first. */
>>   	vma = vmacache_find(mm, addr);
>>   	if (likely(vma))
>> @@ -2325,6 +2326,11 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
>>   	return vma;
>>   }
>>   
>> +struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
>> +{
>> +	lockdep_assert_held(&mm->mmap_lock);
>> +	return find_vma_non_owner(mm, addr);
>> +}
>>   EXPORT_SYMBOL(find_vma);
>>   
>>   /*
>>
>>
>> Although this leaks more into the mm API and was referred to as ugly
>> previously, it does provide a working solution and still maintains the
>> same level of checking.
> 
> I think it is no better than before.
> 
> The solution must be to not break lockdep in the BPF side. If Peter's
> reworked algorithm is not OK then BPF should drop/acquire the lockdep
> when it punts the unlock to the WQ.

The current warning is triggered by bpf calling find_vma().
Is it too late even if bpf does drop/acquire the lockdep when it punts
the unlock to the WQ with irq_work_queue()? Maybe I missed something,
could you be more specific about your proposed solution?

> 
> 'non-owner' is just a nice way of saying 'the caller is messing with
> lockdep', it is not a sane way to design APIs
> 
> Jason
>
Alexei Starovoitov Sept. 8, 2021, 7:42 p.m. UTC | #18
On Wed, Sep 08, 2021 at 06:43:49PM +0000, Liam Howlett wrote:
> * Alexei Starovoitov <alexei.starovoitov@gmail.com> [210908 14:03]:
> > On Wed, Sep 08, 2021 at 10:52:59AM -0700, Andrew Morton wrote:
> > > On Wed, 8 Sep 2021 10:21:18 -0700 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > > 
> > > > > Again I am ignorant on the details so if you can clarify the following
> > > > > it may help me and others to better understand the problem:
> > > > > 
> > > > > 1. Peter's patch appears to just take the same "fallback" path
> > > > >    that would be taken if the trylock failed.
> > > > >    Is this really a breakage or just loss of performance ?
> > > > >    I would expect the latter, since it is called "fallback".
> > > > 
> > > > As Yonghong explained it's a user space breakage.
> > > > User space tooling expects build_id to be available 99.999% of the time
> > > > and that's what users observed in practice.
> > > > They've built a bunch of tools on top of this feature.
> > > > The data from these tools goes into various datacenter tables
> > > > and humans analyze it later.
> > > > So Peter's proposal is not acceptable. We don't want to get yelled at.
> > > > 
> > > 
> > > I'm not understanding.  Peter said "this patch merely removes a
> > > performance tweak" and you and Yonghong said "it breaks userspace". 
> > > These assertions are contradictory!
> > 
> > Peter said:
> > "The only sane approach is making the vma tree lockless, but so far the
> >  bpf people have resisted doing the right thing because they've been
> >  allowed to get away with these atrocities.
> > "
> > which is partially true.
> > bpf folks didn't resist it. There is work ongoing to make it lockless.
> > It just takes an long time. I don't see how bpf folks can speed it up
> > any further.
> 
> What work are you doing on a lockless vma tree?  I've been working on
> the maple tree and would like to hear what you've come up with.

Mainly cheering Michel and Paul from sidelines.
imo any approach would be better than the current state.
Jason Gunthorpe Sept. 8, 2021, 11:33 p.m. UTC | #19
On Wed, Sep 08, 2021 at 12:11:54PM -0700, Yonghong Song wrote:
> 
> 
> On 9/8/21 11:49 AM, Jason Gunthorpe wrote:
> > On Wed, Sep 08, 2021 at 06:30:52PM +0000, Liam Howlett wrote:
> > 
> > >   /* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
> > > -struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
> > > +struct vm_area_struct *find_vma_non_owner(struct mm_struct *mm,
> > > +					 unsigned long addr)
> > >   {
> > >   	struct rb_node *rb_node;
> > >   	struct vm_area_struct *vma;
> > > -	mmap_assert_locked(mm);
> > > +	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
> > >   	/* Check the cache first. */
> > >   	vma = vmacache_find(mm, addr);
> > >   	if (likely(vma))
> > > @@ -2325,6 +2326,11 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
> > >   	return vma;
> > >   }
> > > +struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
> > > +{
> > > +	lockdep_assert_held(&mm->mmap_lock);
> > > +	return find_vma_non_owner(mm, addr);
> > > +}
> > >   EXPORT_SYMBOL(find_vma);
> > >   /*
> > > 
> > > 
> > > Although this leaks more into the mm API and was referred to as ugly
> > > previously, it does provide a working solution and still maintains the
> > > same level of checking.
> > 
> > I think it is no better than before.
> > 
> > The solution must be to not break lockdep in the BPF side. If Peter's
> > reworked algorithm is not OK then BPF should drop/acquire the lockdep
> > when it punts the unlock to the WQ.
> 
> The current warning is triggered by bpf calling find_vma().

Yes, but that is because the lockdep has already been dropped.

It looks to me like it basically does this:

        mmap_read_trylock_non_owner(current->mm)

        vma = find_vma(current->mm, ips[i]);

        if (!work) {
                mmap_read_unlock_non_owner(current->mm);
        } else {
                work->mm = current->mm;
                irq_work_queue(&work->irq_work);


And the only reason for this lockdep madness is because the
irq_work_queue() does:

static void do_up_read(struct irq_work *entry)
{
        struct stack_map_irq_work *work;

        if (WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_RT)))
                return;

        work = container_of(entry, struct stack_map_irq_work, irq_work);
        mmap_read_unlock_non_owner(work->mm);
}


This is all about deferring the unlock to outside an IRQ context. The
lockdep ownership is transfered from the IRQ to the work, which is
something that we don't usually do or model in lockdep.

Lockdep complains with the straightforward code because exiting an IRQ
with locks held is illegal.

The saner version is more like:

        mmap_read_trylock(current->mm)

        vma = find_vma(current->mm, ips[i]);

        if (!work) {
                mmap_read_unlock(current->mm);
        } else {
                work->mm = current->mm;
                <tell lockdep we really do mean to return with
		 the lock held>
                rwsem_release(&mm->mmap_lock.dep_map, _RET_IP_);
                irq_work_queue(&work->irq_work);


do_up_read():
       <tell lockdep the lock was already released from the map>
       mmap_read_unlock_non_owner(work->mm);

ie properly model in lockdep that ownership moves from the IRQ to the
work. At least we don't corrupt the core mm code with this insanity.

Jason
Yonghong Song Sept. 9, 2021, 5:50 a.m. UTC | #20
On 9/8/21 4:33 PM, Jason Gunthorpe wrote:
> On Wed, Sep 08, 2021 at 12:11:54PM -0700, Yonghong Song wrote:
>>
>>
>> On 9/8/21 11:49 AM, Jason Gunthorpe wrote:
>>> On Wed, Sep 08, 2021 at 06:30:52PM +0000, Liam Howlett wrote:
>>>
>>>>    /* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
>>>> -struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
>>>> +struct vm_area_struct *find_vma_non_owner(struct mm_struct *mm,
>>>> +					 unsigned long addr)
>>>>    {
>>>>    	struct rb_node *rb_node;
>>>>    	struct vm_area_struct *vma;
>>>> -	mmap_assert_locked(mm);
>>>> +	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
>>>>    	/* Check the cache first. */
>>>>    	vma = vmacache_find(mm, addr);
>>>>    	if (likely(vma))
>>>> @@ -2325,6 +2326,11 @@ struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
>>>>    	return vma;
>>>>    }
>>>> +struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
>>>> +{
>>>> +	lockdep_assert_held(&mm->mmap_lock);
>>>> +	return find_vma_non_owner(mm, addr);
>>>> +}
>>>>    EXPORT_SYMBOL(find_vma);
>>>>    /*
>>>>
>>>>
>>>> Although this leaks more into the mm API and was referred to as ugly
>>>> previously, it does provide a working solution and still maintains the
>>>> same level of checking.
>>>
>>> I think it is no better than before.
>>>
>>> The solution must be to not break lockdep in the BPF side. If Peter's
>>> reworked algorithm is not OK then BPF should drop/acquire the lockdep
>>> when it punts the unlock to the WQ.
>>
>> The current warning is triggered by bpf calling find_vma().
> 
> Yes, but that is because the lockdep has already been dropped.
> 
> It looks to me like it basically does this:
> 
>          mmap_read_trylock_non_owner(current->mm)
> 
>          vma = find_vma(current->mm, ips[i]);
> 
>          if (!work) {
>                  mmap_read_unlock_non_owner(current->mm);
>          } else {
>                  work->mm = current->mm;
>                  irq_work_queue(&work->irq_work);
> 
> 
> And the only reason for this lockdep madness is because the
> irq_work_queue() does:
> 
> static void do_up_read(struct irq_work *entry)
> {
>          struct stack_map_irq_work *work;
> 
>          if (WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_RT)))
>                  return;
> 
>          work = container_of(entry, struct stack_map_irq_work, irq_work);
>          mmap_read_unlock_non_owner(work->mm);
> }
> 
> 
> This is all about deferring the unlock to outside an IRQ context. The
> lockdep ownership is transfered from the IRQ to the work, which is
> something that we don't usually do or model in lockdep.
> 
> Lockdep complains with the straightforward code because exiting an IRQ
> with locks held is illegal.
> 
> The saner version is more like:
> 
>          mmap_read_trylock(current->mm)
> 
>          vma = find_vma(current->mm, ips[i]);
> 
>          if (!work) {
>                  mmap_read_unlock(current->mm);
>          } else {
>                  work->mm = current->mm;
>                  <tell lockdep we really do mean to return with
> 		 the lock held>
>                  rwsem_release(&mm->mmap_lock.dep_map, _RET_IP_);
>                  irq_work_queue(&work->irq_work);
> 
> 
> do_up_read():
>         <tell lockdep the lock was already released from the map>
>         mmap_read_unlock_non_owner(work->mm);
> 
> ie properly model in lockdep that ownership moves from the IRQ to the
> work. At least we don't corrupt the core mm code with this insanity.

Thanks for the suggestion! I verified the above change indeed work.
Will send v4 soon.

> 
> Jason
>
Peter Zijlstra Sept. 9, 2021, 8:05 a.m. UTC | #21
On Wed, Sep 08, 2021 at 06:30:52PM +0000, Liam Howlett wrote:
> +	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);

NAK.. do not ever use the _is_locked() trainwrecks.
diff mbox series

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 50e2c2914ac2..ba7a11d900f5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2669,6 +2669,8 @@  extern int expand_upwards(struct vm_area_struct *vma, unsigned long address);
 #endif
 
 /* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
+extern struct vm_area_struct * find_vma_no_check(struct mm_struct * mm,
+						 unsigned long addr);
 extern struct vm_area_struct * find_vma(struct mm_struct * mm, unsigned long addr);
 extern struct vm_area_struct * find_vma_prev(struct mm_struct * mm, unsigned long addr,
 					     struct vm_area_struct **pprev);
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index e8eefdf8cf3e..838a2c141497 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -190,7 +190,7 @@  static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
 	}
 
 	for (i = 0; i < trace_nr; i++) {
-		vma = find_vma(current->mm, ips[i]);
+		vma = find_vma_no_check(current->mm, ips[i]);
 		if (!vma || build_id_parse(vma, id_offs[i].build_id, NULL)) {
 			/* per entry fall back to ips */
 			id_offs[i].status = BPF_STACK_BUILD_ID_IP;
diff --git a/mm/mmap.c b/mm/mmap.c
index 88dcc5c25225..8d5e340114f8 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2270,12 +2270,11 @@  get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
 EXPORT_SYMBOL(get_unmapped_area);
 
 /* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */
-struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
+struct vm_area_struct *find_vma_no_check(struct mm_struct *mm, unsigned long addr)
 {
 	struct rb_node *rb_node;
 	struct vm_area_struct *vma;
 
-	mmap_assert_locked(mm);
 	/* Check the cache first. */
 	vma = vmacache_find(mm, addr);
 	if (likely(vma))
@@ -2302,6 +2301,12 @@  struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
 	return vma;
 }
 
+struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
+{
+	mmap_assert_locked(mm);
+	return find_vma_no_check(mm, addr);
+}
+
 EXPORT_SYMBOL(find_vma);
 
 /*