[RFC,vhost] vhost-vdpa: Fix invalid irq bypass unregister

Message ID	20240801153722.191797-2-dtatulea@nvidia.com (mailing list archive)
State	New, archived
Headers	show Received: from NAM11-BN8-obe.outbound.protection.outlook.com (mail-bn8nam11on2058.outbound.protection.outlook.com [40.107.236.58]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BDBC81EB4AF; Thu, 1 Aug 2024 15:38:18 +0000 (UTC) Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 216.228.118.232 as permitted sender) receiver=protection.outlook.com; client-ip=216.228.118.232; helo=mail.nvidia.com; pr=C From: Dragos Tatulea <dtatulea@nvidia.com> To: "Michael S. Tsirkin" <mst@redhat.com>, Jason Wang <jasowang@redhat.com>, =?utf-8?q?Eugenio_P=C3=A9rez?= <eperezma@redhat.com> CC: Dragos Tatulea <dtatulea@nvidia.com>, <kvm@vger.kernel.org>, <virtualization@lists.linux.dev>, <netdev@vger.kernel.org>, <linux-kernel@vger.kernel.org> Subject: [RFC PATCH vhost] vhost-vdpa: Fix invalid irq bypass unregister Date: Thu, 1 Aug 2024 18:37:22 +0300 Message-ID: <20240801153722.191797-2-dtatulea@nvidia.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain
Series	[RFC,vhost] vhost-vdpa: Fix invalid irq bypass unregister \| expand [RFC,vhost] vhost-vdpa: Fix invalid irq bypass unregister

Dragos Tatulea Aug. 1, 2024, 3:37 p.m. UTC

The following workflow triggers the crash referenced below:

1) vhost_vdpa_unsetup_vq_irq() unregisters the irq bypass producer
   but the producer->token is still valid.
2) vq context gets released and reassigned to another vq.
3) That other vq registers it's producer with the same vq context
   pointer as token in vhost_vdpa_setup_vq_irq().
4) The original vq tries to unregister it's producer which it has
   already unlinked in step 1. irq_bypass_unregister_producer() will go
   ahead and unlink the producer once again. That happens because:
      a) The producer has a token.
      b) An element with that token is found. But that element comes
         from step 3.

I see 3 ways to fix this:
1) Fix the vhost-vdpa part. What this patch does. vfio has a different
   workflow.
2) Set the token to NULL directly in irq_bypass_unregister_producer()
   after unlinking the producer. But that makes the API asymmetrical.
3) Make irq_bypass_unregister_producer() also compare the pointer
   elements not just the tokens and do the unlink only on match.

Any thoughts?

Oops: general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] SMP
CPU: 8 PID: 5190 Comm: qemu-system-x86 Not tainted 6.10.0-rc7+ #6
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:irq_bypass_unregister_producer+0xa5/0xd0
RSP: 0018:ffffc900034d7e50 EFLAGS: 00010246
RAX: dead000000000122 RBX: ffff888353d12718 RCX: ffff88810336a000
RDX: dead000000000100 RSI: ffffffff829243a0 RDI: 0000000000000000
RBP: ffff888353c42000 R08: ffff888104882738 R09: ffff88810336a000
R10: ffff888448ab2050 R11: 0000000000000000 R12: ffff888353d126a0
R13: 0000000000000004 R14: 0000000000000055 R15: 0000000000000004
FS:  00007f9df9403c80(0000) GS:ffff88852cc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000562dffc6b568 CR3: 000000012efbb006 CR4: 0000000000772ef0
PKRU: 55555554
Call Trace:
 <TASK>
 ? die_addr+0x36/0x90
 ? exc_general_protection+0x1a8/0x390
 ? asm_exc_general_protection+0x26/0x30
 ? irq_bypass_unregister_producer+0xa5/0xd0
 vhost_vdpa_setup_vq_irq+0x5a/0xc0 [vhost_vdpa]
 vhost_vdpa_unlocked_ioctl+0xdcd/0xe00 [vhost_vdpa]
 ? vhost_vdpa_config_cb+0x30/0x30 [vhost_vdpa]
 __x64_sys_ioctl+0x90/0xc0
 do_syscall_64+0x4f/0x110
 entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7f9df930774f
RSP: 002b:00007ffc55013080 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000562dfe134d20 RCX: 00007f9df930774f
RDX: 00007ffc55013200 RSI: 000000004008af21 RDI: 0000000000000011
RBP: 00007ffc55013200 R08: 0000000000000002 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000562dfe134360
R13: 0000562dfe134d20 R14: 0000000000000000 R15: 00007f9df801e190

Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
---
 drivers/vhost/vdpa.c | 1 +
 1 file changed, 1 insertion(+)

Jason Wang Aug. 2, 2024, 3:29 a.m. UTC | #1

On Thu, Aug 1, 2024 at 11:38 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
>
> The following workflow triggers the crash referenced below:
>
> 1) vhost_vdpa_unsetup_vq_irq() unregisters the irq bypass producer
>    but the producer->token is still valid.
> 2) vq context gets released and reassigned to another vq.

Just to make sure I understand here, which structure is referred to as
"vq context" here? I guess it's not call_ctx as it is a part of the vq
itself.

> 3) That other vq registers it's producer with the same vq context
>    pointer as token in vhost_vdpa_setup_vq_irq().

Or did you mean when a single eventfd is shared among different vqs?

> 4) The original vq tries to unregister it's producer which it has
>    already unlinked in step 1. irq_bypass_unregister_producer() will go
>    ahead and unlink the producer once again. That happens because:
>       a) The producer has a token.
>       b) An element with that token is found. But that element comes
>          from step 3.
>
> I see 3 ways to fix this:
> 1) Fix the vhost-vdpa part. What this patch does. vfio has a different
>    workflow.
> 2) Set the token to NULL directly in irq_bypass_unregister_producer()
>    after unlinking the producer. But that makes the API asymmetrical.
> 3) Make irq_bypass_unregister_producer() also compare the pointer
>    elements not just the tokens and do the unlink only on match.
>
> Any thoughts?
>
> Oops: general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] SMP
> CPU: 8 PID: 5190 Comm: qemu-system-x86 Not tainted 6.10.0-rc7+ #6
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> RIP: 0010:irq_bypass_unregister_producer+0xa5/0xd0
> RSP: 0018:ffffc900034d7e50 EFLAGS: 00010246
> RAX: dead000000000122 RBX: ffff888353d12718 RCX: ffff88810336a000
> RDX: dead000000000100 RSI: ffffffff829243a0 RDI: 0000000000000000
> RBP: ffff888353c42000 R08: ffff888104882738 R09: ffff88810336a000
> R10: ffff888448ab2050 R11: 0000000000000000 R12: ffff888353d126a0
> R13: 0000000000000004 R14: 0000000000000055 R15: 0000000000000004
> FS:  00007f9df9403c80(0000) GS:ffff88852cc00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000562dffc6b568 CR3: 000000012efbb006 CR4: 0000000000772ef0
> PKRU: 55555554
> Call Trace:
>  <TASK>
>  ? die_addr+0x36/0x90
>  ? exc_general_protection+0x1a8/0x390
>  ? asm_exc_general_protection+0x26/0x30
>  ? irq_bypass_unregister_producer+0xa5/0xd0
>  vhost_vdpa_setup_vq_irq+0x5a/0xc0 [vhost_vdpa]
>  vhost_vdpa_unlocked_ioctl+0xdcd/0xe00 [vhost_vdpa]
>  ? vhost_vdpa_config_cb+0x30/0x30 [vhost_vdpa]
>  __x64_sys_ioctl+0x90/0xc0
>  do_syscall_64+0x4f/0x110
>  entry_SYSCALL_64_after_hwframe+0x4b/0x53
> RIP: 0033:0x7f9df930774f
> RSP: 002b:00007ffc55013080 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> RAX: ffffffffffffffda RBX: 0000562dfe134d20 RCX: 00007f9df930774f
> RDX: 00007ffc55013200 RSI: 000000004008af21 RDI: 0000000000000011
> RBP: 00007ffc55013200 R08: 0000000000000002 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: 0000562dfe134360
> R13: 0000562dfe134d20 R14: 0000000000000000 R15: 00007f9df801e190
>
> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
> ---
>  drivers/vhost/vdpa.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index 478cd46a49ed..d4a7a3918d86 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -226,6 +226,7 @@ static void vhost_vdpa_unsetup_vq_irq(struct vhost_vdpa *v, u16 qid)
>         struct vhost_virtqueue *vq = &v->vqs[qid];
>
>         irq_bypass_unregister_producer(&vq->call_ctx.producer);
> +       vq->call_ctx.producer.token = NULL;
>  }
>
>  static int _compat_vdpa_reset(struct vhost_vdpa *v)
> --
> 2.45.2
>

Thanks

Dragos Tatulea Aug. 2, 2024, 6:51 a.m. UTC | #2

On Fri, 2024-08-02 at 11:29 +0800, Jason Wang wrote:
> On Thu, Aug 1, 2024 at 11:38 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
> > 
> > The following workflow triggers the crash referenced below:
> > 
> > 1) vhost_vdpa_unsetup_vq_irq() unregisters the irq bypass producer
> >    but the producer->token is still valid.
> > 2) vq context gets released and reassigned to another vq.
> 
> Just to make sure I understand here, which structure is referred to as
> "vq context" here? I guess it's not call_ctx as it is a part of the vq
> itself.
> 
> > 3) That other vq registers it's producer with the same vq context
> >    pointer as token in vhost_vdpa_setup_vq_irq().
> 
> Or did you mean when a single eventfd is shared among different vqs?
> 
Yes, that's what I mean: vq->call_ctx.ctx which is a eventfd_ctx.

But I don't think it's shared in this case, only that the old eventfd_ctx value
is lingering in producer->token. And this old eventfd_ctx is assigned now to
another vq.

> > 4) The original vq tries to unregister it's producer which it has
> >    already unlinked in step 1. irq_bypass_unregister_producer() will go
> >    ahead and unlink the producer once again. That happens because:
> >       a) The producer has a token.
> >       b) An element with that token is found. But that element comes
> >          from step 3.
> > 
> > I see 3 ways to fix this:
> > 1) Fix the vhost-vdpa part. What this patch does. vfio has a different
> >    workflow.
> > 2) Set the token to NULL directly in irq_bypass_unregister_producer()
> >    after unlinking the producer. But that makes the API asymmetrical.
> > 3) Make irq_bypass_unregister_producer() also compare the pointer
> >    elements not just the tokens and do the unlink only on match.
> > 
> > Any thoughts?
> > 
> > Oops: general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] SMP
> > CPU: 8 PID: 5190 Comm: qemu-system-x86 Not tainted 6.10.0-rc7+ #6
> > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> > RIP: 0010:irq_bypass_unregister_producer+0xa5/0xd0
> > RSP: 0018:ffffc900034d7e50 EFLAGS: 00010246
> > RAX: dead000000000122 RBX: ffff888353d12718 RCX: ffff88810336a000
> > RDX: dead000000000100 RSI: ffffffff829243a0 RDI: 0000000000000000
> > RBP: ffff888353c42000 R08: ffff888104882738 R09: ffff88810336a000
> > R10: ffff888448ab2050 R11: 0000000000000000 R12: ffff888353d126a0
> > R13: 0000000000000004 R14: 0000000000000055 R15: 0000000000000004
> > FS:  00007f9df9403c80(0000) GS:ffff88852cc00000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000562dffc6b568 CR3: 000000012efbb006 CR4: 0000000000772ef0
> > PKRU: 55555554
> > Call Trace:
> >  <TASK>
> >  ? die_addr+0x36/0x90
> >  ? exc_general_protection+0x1a8/0x390
> >  ? asm_exc_general_protection+0x26/0x30
> >  ? irq_bypass_unregister_producer+0xa5/0xd0
> >  vhost_vdpa_setup_vq_irq+0x5a/0xc0 [vhost_vdpa]
> >  vhost_vdpa_unlocked_ioctl+0xdcd/0xe00 [vhost_vdpa]
> >  ? vhost_vdpa_config_cb+0x30/0x30 [vhost_vdpa]
> >  __x64_sys_ioctl+0x90/0xc0
> >  do_syscall_64+0x4f/0x110
> >  entry_SYSCALL_64_after_hwframe+0x4b/0x53
> > RIP: 0033:0x7f9df930774f
> > RSP: 002b:00007ffc55013080 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> > RAX: ffffffffffffffda RBX: 0000562dfe134d20 RCX: 00007f9df930774f
> > RDX: 00007ffc55013200 RSI: 000000004008af21 RDI: 0000000000000011
> > RBP: 00007ffc55013200 R08: 0000000000000002 R09: 0000000000000000
> > R10: 0000000000000000 R11: 0000000000000246 R12: 0000562dfe134360
> > R13: 0000562dfe134d20 R14: 0000000000000000 R15: 00007f9df801e190
> > 
> > Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
> > ---
> >  drivers/vhost/vdpa.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> > index 478cd46a49ed..d4a7a3918d86 100644
> > --- a/drivers/vhost/vdpa.c
> > +++ b/drivers/vhost/vdpa.c
> > @@ -226,6 +226,7 @@ static void vhost_vdpa_unsetup_vq_irq(struct vhost_vdpa *v, u16 qid)
> >         struct vhost_virtqueue *vq = &v->vqs[qid];
> > 
> >         irq_bypass_unregister_producer(&vq->call_ctx.producer);
> > +       vq->call_ctx.producer.token = NULL;
> >  }
> > 
> >  static int _compat_vdpa_reset(struct vhost_vdpa *v)
> > --
> > 2.45.2
> > 
> 
Thanks

Jason Wang Aug. 5, 2024, 3:17 a.m. UTC | #3

On Fri, Aug 2, 2024 at 2:51 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
>
> On Fri, 2024-08-02 at 11:29 +0800, Jason Wang wrote:
> > On Thu, Aug 1, 2024 at 11:38 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
> > >
> > > The following workflow triggers the crash referenced below:
> > >
> > > 1) vhost_vdpa_unsetup_vq_irq() unregisters the irq bypass producer
> > >    but the producer->token is still valid.
> > > 2) vq context gets released and reassigned to another vq.
> >
> > Just to make sure I understand here, which structure is referred to as
> > "vq context" here? I guess it's not call_ctx as it is a part of the vq
> > itself.
> >
> > > 3) That other vq registers it's producer with the same vq context
> > >    pointer as token in vhost_vdpa_setup_vq_irq().
> >
> > Or did you mean when a single eventfd is shared among different vqs?
> >
> Yes, that's what I mean: vq->call_ctx.ctx which is a eventfd_ctx.
>
> But I don't think it's shared in this case, only that the old eventfd_ctx value
> is lingering in producer->token. And this old eventfd_ctx is assigned now to
> another vq.

Just to make sure I understand the issue. The eventfd_ctx should be
still valid until a new VHOST_SET_VRING_CALL().

I may miss something but the only way to assign exactly the same
eventfd_ctx value to another vq is where the guest tries to share the
MSI-X vector among virtqueues, then qemu will use a single eventfd as
the callback for multiple virtqueues. If this is true:

For bypass registering, only the first registering can succeed as the
following registering will fail because the irq bypass manager already
had exactly the same producer token.
For registering, all unregistering can succeed:

1) the first unregistering will do the real job that unregister the token
2) the following unregistering will do nothing by iterating the
producer token list without finding a match one

Maybe you can show me the userspace behaviour (ioctls) when you see this?

Thanks

>
> > > 4) The original vq tries to unregister it's producer which it has
> > >    already unlinked in step 1. irq_bypass_unregister_producer() will go
> > >    ahead and unlink the producer once again. That happens because:
> > >       a) The producer has a token.
> > >       b) An element with that token is found. But that element comes
> > >          from step 3.
> > >
> > > I see 3 ways to fix this:
> > > 1) Fix the vhost-vdpa part. What this patch does. vfio has a different
> > >    workflow.
> > > 2) Set the token to NULL directly in irq_bypass_unregister_producer()
> > >    after unlinking the producer. But that makes the API asymmetrical.
> > > 3) Make irq_bypass_unregister_producer() also compare the pointer
> > >    elements not just the tokens and do the unlink only on match.
> > >
> > > Any thoughts?
> > >
> > > Oops: general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] SMP
> > > CPU: 8 PID: 5190 Comm: qemu-system-x86 Not tainted 6.10.0-rc7+ #6
> > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> > > RIP: 0010:irq_bypass_unregister_producer+0xa5/0xd0
> > > RSP: 0018:ffffc900034d7e50 EFLAGS: 00010246
> > > RAX: dead000000000122 RBX: ffff888353d12718 RCX: ffff88810336a000
> > > RDX: dead000000000100 RSI: ffffffff829243a0 RDI: 0000000000000000
> > > RBP: ffff888353c42000 R08: ffff888104882738 R09: ffff88810336a000
> > > R10: ffff888448ab2050 R11: 0000000000000000 R12: ffff888353d126a0
> > > R13: 0000000000000004 R14: 0000000000000055 R15: 0000000000000004
> > > FS:  00007f9df9403c80(0000) GS:ffff88852cc00000(0000) knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 0000562dffc6b568 CR3: 000000012efbb006 CR4: 0000000000772ef0
> > > PKRU: 55555554
> > > Call Trace:
> > >  <TASK>
> > >  ? die_addr+0x36/0x90
> > >  ? exc_general_protection+0x1a8/0x390
> > >  ? asm_exc_general_protection+0x26/0x30
> > >  ? irq_bypass_unregister_producer+0xa5/0xd0
> > >  vhost_vdpa_setup_vq_irq+0x5a/0xc0 [vhost_vdpa]
> > >  vhost_vdpa_unlocked_ioctl+0xdcd/0xe00 [vhost_vdpa]
> > >  ? vhost_vdpa_config_cb+0x30/0x30 [vhost_vdpa]
> > >  __x64_sys_ioctl+0x90/0xc0
> > >  do_syscall_64+0x4f/0x110
> > >  entry_SYSCALL_64_after_hwframe+0x4b/0x53
> > > RIP: 0033:0x7f9df930774f
> > > RSP: 002b:00007ffc55013080 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> > > RAX: ffffffffffffffda RBX: 0000562dfe134d20 RCX: 00007f9df930774f
> > > RDX: 00007ffc55013200 RSI: 000000004008af21 RDI: 0000000000000011
> > > RBP: 00007ffc55013200 R08: 0000000000000002 R09: 0000000000000000
> > > R10: 0000000000000000 R11: 0000000000000246 R12: 0000562dfe134360
> > > R13: 0000562dfe134d20 R14: 0000000000000000 R15: 00007f9df801e190
> > >
> > > Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
> > > ---
> > >  drivers/vhost/vdpa.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > >
> > > diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> > > index 478cd46a49ed..d4a7a3918d86 100644
> > > --- a/drivers/vhost/vdpa.c
> > > +++ b/drivers/vhost/vdpa.c
> > > @@ -226,6 +226,7 @@ static void vhost_vdpa_unsetup_vq_irq(struct vhost_vdpa *v, u16 qid)
> > >         struct vhost_virtqueue *vq = &v->vqs[qid];
> > >
> > >         irq_bypass_unregister_producer(&vq->call_ctx.producer);
> > > +       vq->call_ctx.producer.token = NULL;
> > >  }
> > >
> > >  static int _compat_vdpa_reset(struct vhost_vdpa *v)
> > > --
> > > 2.45.2
> > >
> >
> Thanks
>

Dragos Tatulea Aug. 5, 2024, 3:58 p.m. UTC | #4

On 05.08.24 05:17, Jason Wang wrote:
> On Fri, Aug 2, 2024 at 2:51 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
>>
>> On Fri, 2024-08-02 at 11:29 +0800, Jason Wang wrote:
>>> On Thu, Aug 1, 2024 at 11:38 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
>>>>
>>>> The following workflow triggers the crash referenced below:
>>>>
>>>> 1) vhost_vdpa_unsetup_vq_irq() unregisters the irq bypass producer
>>>>    but the producer->token is still valid.
>>>> 2) vq context gets released and reassigned to another vq.
>>>
>>> Just to make sure I understand here, which structure is referred to as
>>> "vq context" here? I guess it's not call_ctx as it is a part of the vq
>>> itself.
>>>
>>>> 3) That other vq registers it's producer with the same vq context
>>>>    pointer as token in vhost_vdpa_setup_vq_irq().
>>>
>>> Or did you mean when a single eventfd is shared among different vqs?
>>>
>> Yes, that's what I mean: vq->call_ctx.ctx which is a eventfd_ctx.
>>
>> But I don't think it's shared in this case, only that the old eventfd_ctx value
>> is lingering in producer->token. And this old eventfd_ctx is assigned now to
>> another vq.
> 
> Just to make sure I understand the issue. The eventfd_ctx should be
> still valid until a new VHOST_SET_VRING_CALL().
> 
I think it's not about the validity of the eventfd_ctx. More about
the lingering ctx value of the producer after vhost_vdpa_unsetup_vq_irq().
That value is the eventfd ctx, but it could be anything else really...


> I may miss something but the only way to assign exactly the same
> eventfd_ctx value to another vq is where the guest tries to share the
> MSI-X vector among virtqueues, then qemu will use a single eventfd as
> the callback for multiple virtqueues. If this is true:
> 
I don't think this is the case. I see the issue happening when running qemu vdpa
live migration tests on the same host. From a vdpa device it's basically a device
starting on a VM over and over.

> For bypass registering, only the first registering can succeed as the
> following registering will fail because the irq bypass manager already
> had exactly the same producer token.
> For registering, all unregistering can succeed:
> 
> 1) the first unregistering will do the real job that unregister the token
> 2) the following unregistering will do nothing by iterating the
> producer token list without finding a match one
> 
> Maybe you can show me the userspace behaviour (ioctls) when you see this?
> 
Sure, what would you need? qemu traces?

Thanks,
Dragos

> Thanks
> 
>>
>>>> 4) The original vq tries to unregister it's producer which it has
>>>>    already unlinked in step 1. irq_bypass_unregister_producer() will go
>>>>    ahead and unlink the producer once again. That happens because:
>>>>       a) The producer has a token.
>>>>       b) An element with that token is found. But that element comes
>>>>          from step 3.
>>>>
>>>> I see 3 ways to fix this:
>>>> 1) Fix the vhost-vdpa part. What this patch does. vfio has a different
>>>>    workflow.
>>>> 2) Set the token to NULL directly in irq_bypass_unregister_producer()
>>>>    after unlinking the producer. But that makes the API asymmetrical.
>>>> 3) Make irq_bypass_unregister_producer() also compare the pointer
>>>>    elements not just the tokens and do the unlink only on match.
>>>>
>>>> Any thoughts?
>>>>
>>>> Oops: general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] SMP
>>>> CPU: 8 PID: 5190 Comm: qemu-system-x86 Not tainted 6.10.0-rc7+ #6
>>>> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
>>>> RIP: 0010:irq_bypass_unregister_producer+0xa5/0xd0
>>>> RSP: 0018:ffffc900034d7e50 EFLAGS: 00010246
>>>> RAX: dead000000000122 RBX: ffff888353d12718 RCX: ffff88810336a000
>>>> RDX: dead000000000100 RSI: ffffffff829243a0 RDI: 0000000000000000
>>>> RBP: ffff888353c42000 R08: ffff888104882738 R09: ffff88810336a000
>>>> R10: ffff888448ab2050 R11: 0000000000000000 R12: ffff888353d126a0
>>>> R13: 0000000000000004 R14: 0000000000000055 R15: 0000000000000004
>>>> FS:  00007f9df9403c80(0000) GS:ffff88852cc00000(0000) knlGS:0000000000000000
>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> CR2: 0000562dffc6b568 CR3: 000000012efbb006 CR4: 0000000000772ef0
>>>> PKRU: 55555554
>>>> Call Trace:
>>>>  <TASK>
>>>>  ? die_addr+0x36/0x90
>>>>  ? exc_general_protection+0x1a8/0x390
>>>>  ? asm_exc_general_protection+0x26/0x30
>>>>  ? irq_bypass_unregister_producer+0xa5/0xd0
>>>>  vhost_vdpa_setup_vq_irq+0x5a/0xc0 [vhost_vdpa]
>>>>  vhost_vdpa_unlocked_ioctl+0xdcd/0xe00 [vhost_vdpa]
>>>>  ? vhost_vdpa_config_cb+0x30/0x30 [vhost_vdpa]
>>>>  __x64_sys_ioctl+0x90/0xc0
>>>>  do_syscall_64+0x4f/0x110
>>>>  entry_SYSCALL_64_after_hwframe+0x4b/0x53
>>>> RIP: 0033:0x7f9df930774f
>>>> RSP: 002b:00007ffc55013080 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>>>> RAX: ffffffffffffffda RBX: 0000562dfe134d20 RCX: 00007f9df930774f
>>>> RDX: 00007ffc55013200 RSI: 000000004008af21 RDI: 0000000000000011
>>>> RBP: 00007ffc55013200 R08: 0000000000000002 R09: 0000000000000000
>>>> R10: 0000000000000000 R11: 0000000000000246 R12: 0000562dfe134360
>>>> R13: 0000562dfe134d20 R14: 0000000000000000 R15: 00007f9df801e190
>>>>
>>>> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
>>>> ---
>>>>  drivers/vhost/vdpa.c | 1 +
>>>>  1 file changed, 1 insertion(+)
>>>>
>>>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>>>> index 478cd46a49ed..d4a7a3918d86 100644
>>>> --- a/drivers/vhost/vdpa.c
>>>> +++ b/drivers/vhost/vdpa.c
>>>> @@ -226,6 +226,7 @@ static void vhost_vdpa_unsetup_vq_irq(struct vhost_vdpa *v, u16 qid)
>>>>         struct vhost_virtqueue *vq = &v->vqs[qid];
>>>>
>>>>         irq_bypass_unregister_producer(&vq->call_ctx.producer);
>>>> +       vq->call_ctx.producer.token = NULL;
>>>>  }
>>>>
>>>>  static int _compat_vdpa_reset(struct vhost_vdpa *v)
>>>> --
>>>> 2.45.2
>>>>
>>>
>> Thanks
>>
>

Jason Wang Aug. 6, 2024, 2:57 a.m. UTC | #5

On Mon, Aug 5, 2024 at 11:59 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
>
> On 05.08.24 05:17, Jason Wang wrote:
> > On Fri, Aug 2, 2024 at 2:51 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
> >>
> >> On Fri, 2024-08-02 at 11:29 +0800, Jason Wang wrote:
> >>> On Thu, Aug 1, 2024 at 11:38 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
> >>>>
> >>>> The following workflow triggers the crash referenced below:
> >>>>
> >>>> 1) vhost_vdpa_unsetup_vq_irq() unregisters the irq bypass producer
> >>>>    but the producer->token is still valid.
> >>>> 2) vq context gets released and reassigned to another vq.
> >>>
> >>> Just to make sure I understand here, which structure is referred to as
> >>> "vq context" here? I guess it's not call_ctx as it is a part of the vq
> >>> itself.
> >>>
> >>>> 3) That other vq registers it's producer with the same vq context
> >>>>    pointer as token in vhost_vdpa_setup_vq_irq().
> >>>
> >>> Or did you mean when a single eventfd is shared among different vqs?
> >>>
> >> Yes, that's what I mean: vq->call_ctx.ctx which is a eventfd_ctx.
> >>
> >> But I don't think it's shared in this case, only that the old eventfd_ctx value
> >> is lingering in producer->token. And this old eventfd_ctx is assigned now to
> >> another vq.
> >
> > Just to make sure I understand the issue. The eventfd_ctx should be
> > still valid until a new VHOST_SET_VRING_CALL().
> >
> I think it's not about the validity of the eventfd_ctx. More about
> the lingering ctx value of the producer after vhost_vdpa_unsetup_vq_irq().

Probably, but

> That value is the eventfd ctx, but it could be anything else really...

I mean we hold a refcnt of the eventfd so it should be valid until the
next set_vring_call() or vhost_dev_cleanup().

But I do spot some possible issue:

1) We swap and assign new ctx in vhost_vring_ioctl():

                swap(ctx, vq->call_ctx.ctx);

2) and old ctx will be put there as well:

                if (!IS_ERR_OR_NULL(ctx))
                        eventfd_ctx_put(ctx);

3) but in vdpa, we try to unregister the producer with the new token:

static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, unsigned int cmd,
                           void __user *argp)
{
...
        r = vhost_vring_ioctl(&v->vdev, cmd, argp);
...
        switch (cmd) {
...
        case VHOST_SET_VRING_CALL:
                if (vq->call_ctx.ctx) {
                        cb.callback = vhost_vdpa_virtqueue_cb;
                        cb.private = vq;
                        cb.trigger = vq->call_ctx.ctx;
                } else {
                        cb.callback = NULL;
                        cb.private = NULL;
                        cb.trigger = NULL;
                }
                ops->set_vq_cb(vdpa, idx, &cb);
                vhost_vdpa_setup_vq_irq(v, idx);

in vhost_vdpa_setup_vq_irq() we had:

        irq_bypass_unregister_producer(&vq->call_ctx.producer);

here the producer->token still points to the old one...

Is this what you have seen?

>
>
> > I may miss something but the only way to assign exactly the same
> > eventfd_ctx value to another vq is where the guest tries to share the
> > MSI-X vector among virtqueues, then qemu will use a single eventfd as
> > the callback for multiple virtqueues. If this is true:
> >
> I don't think this is the case. I see the issue happening when running qemu vdpa
> live migration tests on the same host. From a vdpa device it's basically a device
> starting on a VM over and over.
>
> > For bypass registering, only the first registering can succeed as the
> > following registering will fail because the irq bypass manager already
> > had exactly the same producer token.
> > For registering, all unregistering can succeed:
> >
> > 1) the first unregistering will do the real job that unregister the token
> > 2) the following unregistering will do nothing by iterating the
> > producer token list without finding a match one
> >
> > Maybe you can show me the userspace behaviour (ioctls) when you see this?
> >
> Sure, what would you need? qemu traces?

Yes, that would be helpful.

Thanks

>
> Thanks,
> Dragos
>
> > Thanks
> >
> >>
> >>>> 4) The original vq tries to unregister it's producer which it has
> >>>>    already unlinked in step 1. irq_bypass_unregister_producer() will go
> >>>>    ahead and unlink the producer once again. That happens because:
> >>>>       a) The producer has a token.
> >>>>       b) An element with that token is found. But that element comes
> >>>>          from step 3.
> >>>>
> >>>> I see 3 ways to fix this:
> >>>> 1) Fix the vhost-vdpa part. What this patch does. vfio has a different
> >>>>    workflow.
> >>>> 2) Set the token to NULL directly in irq_bypass_unregister_producer()
> >>>>    after unlinking the producer. But that makes the API asymmetrical.
> >>>> 3) Make irq_bypass_unregister_producer() also compare the pointer
> >>>>    elements not just the tokens and do the unlink only on match.
> >>>>
> >>>> Any thoughts?
> >>>>
> >>>> Oops: general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] SMP
> >>>> CPU: 8 PID: 5190 Comm: qemu-system-x86 Not tainted 6.10.0-rc7+ #6
> >>>> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> >>>> RIP: 0010:irq_bypass_unregister_producer+0xa5/0xd0
> >>>> RSP: 0018:ffffc900034d7e50 EFLAGS: 00010246
> >>>> RAX: dead000000000122 RBX: ffff888353d12718 RCX: ffff88810336a000
> >>>> RDX: dead000000000100 RSI: ffffffff829243a0 RDI: 0000000000000000
> >>>> RBP: ffff888353c42000 R08: ffff888104882738 R09: ffff88810336a000
> >>>> R10: ffff888448ab2050 R11: 0000000000000000 R12: ffff888353d126a0
> >>>> R13: 0000000000000004 R14: 0000000000000055 R15: 0000000000000004
> >>>> FS:  00007f9df9403c80(0000) GS:ffff88852cc00000(0000) knlGS:0000000000000000
> >>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>> CR2: 0000562dffc6b568 CR3: 000000012efbb006 CR4: 0000000000772ef0
> >>>> PKRU: 55555554
> >>>> Call Trace:
> >>>>  <TASK>
> >>>>  ? die_addr+0x36/0x90
> >>>>  ? exc_general_protection+0x1a8/0x390
> >>>>  ? asm_exc_general_protection+0x26/0x30
> >>>>  ? irq_bypass_unregister_producer+0xa5/0xd0
> >>>>  vhost_vdpa_setup_vq_irq+0x5a/0xc0 [vhost_vdpa]
> >>>>  vhost_vdpa_unlocked_ioctl+0xdcd/0xe00 [vhost_vdpa]
> >>>>  ? vhost_vdpa_config_cb+0x30/0x30 [vhost_vdpa]
> >>>>  __x64_sys_ioctl+0x90/0xc0
> >>>>  do_syscall_64+0x4f/0x110
> >>>>  entry_SYSCALL_64_after_hwframe+0x4b/0x53
> >>>> RIP: 0033:0x7f9df930774f
> >>>> RSP: 002b:00007ffc55013080 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> >>>> RAX: ffffffffffffffda RBX: 0000562dfe134d20 RCX: 00007f9df930774f
> >>>> RDX: 00007ffc55013200 RSI: 000000004008af21 RDI: 0000000000000011
> >>>> RBP: 00007ffc55013200 R08: 0000000000000002 R09: 0000000000000000
> >>>> R10: 0000000000000000 R11: 0000000000000246 R12: 0000562dfe134360
> >>>> R13: 0000562dfe134d20 R14: 0000000000000000 R15: 00007f9df801e190
> >>>>
> >>>> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
> >>>> ---
> >>>>  drivers/vhost/vdpa.c | 1 +
> >>>>  1 file changed, 1 insertion(+)
> >>>>
> >>>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> >>>> index 478cd46a49ed..d4a7a3918d86 100644
> >>>> --- a/drivers/vhost/vdpa.c
> >>>> +++ b/drivers/vhost/vdpa.c
> >>>> @@ -226,6 +226,7 @@ static void vhost_vdpa_unsetup_vq_irq(struct vhost_vdpa *v, u16 qid)
> >>>>         struct vhost_virtqueue *vq = &v->vqs[qid];
> >>>>
> >>>>         irq_bypass_unregister_producer(&vq->call_ctx.producer);
> >>>> +       vq->call_ctx.producer.token = NULL;
> >>>>  }
> >>>>
> >>>>  static int _compat_vdpa_reset(struct vhost_vdpa *v)
> >>>> --
> >>>> 2.45.2
> >>>>
> >>>
> >> Thanks
> >>
> >
>

Dragos Tatulea Aug. 6, 2024, 8:18 a.m. UTC | #6

(Re-sending. I messed up the previous message, sorry about that.)

On 06.08.24 04:57, Jason Wang wrote:
> On Mon, Aug 5, 2024 at 11:59 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
>>
>> On 05.08.24 05:17, Jason Wang wrote:
>>> On Fri, Aug 2, 2024 at 2:51 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
>>>>
>>>> On Fri, 2024-08-02 at 11:29 +0800, Jason Wang wrote:
>>>>> On Thu, Aug 1, 2024 at 11:38 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
>>>>>>
>>>>>> The following workflow triggers the crash referenced below:
>>>>>>
>>>>>> 1) vhost_vdpa_unsetup_vq_irq() unregisters the irq bypass producer
>>>>>>    but the producer->token is still valid.
>>>>>> 2) vq context gets released and reassigned to another vq.
>>>>>
>>>>> Just to make sure I understand here, which structure is referred to as
>>>>> "vq context" here? I guess it's not call_ctx as it is a part of the vq
>>>>> itself.
>>>>>
>>>>>> 3) That other vq registers it's producer with the same vq context
>>>>>>    pointer as token in vhost_vdpa_setup_vq_irq().
>>>>>
>>>>> Or did you mean when a single eventfd is shared among different vqs?
>>>>>
>>>> Yes, that's what I mean: vq->call_ctx.ctx which is a eventfd_ctx.
>>>>
>>>> But I don't think it's shared in this case, only that the old eventfd_ctx value
>>>> is lingering in producer->token. And this old eventfd_ctx is assigned now to
>>>> another vq.
>>>
>>> Just to make sure I understand the issue. The eventfd_ctx should be
>>> still valid until a new VHOST_SET_VRING_CALL().
>>>
>> I think it's not about the validity of the eventfd_ctx. More about
>> the lingering ctx value of the producer after vhost_vdpa_unsetup_vq_irq().
> 
> Probably, but
> 
>> That value is the eventfd ctx, but it could be anything else really...
> 
> I mean we hold a refcnt of the eventfd so it should be valid until the
> next set_vring_call() or vhost_dev_cleanup().
> 
> But I do spot some possible issue:
> 
> 1) We swap and assign new ctx in vhost_vring_ioctl():
> 
>                 swap(ctx, vq->call_ctx.ctx);
> 
> 2) and old ctx will be put there as well:
> 
>                 if (!IS_ERR_OR_NULL(ctx))
>                         eventfd_ctx_put(ctx);
> 
> 3) but in vdpa, we try to unregister the producer with the new token:
> 
> static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, unsigned int cmd,
>                            void __user *argp)
> {
> ...
>         r = vhost_vring_ioctl(&v->vdev, cmd, argp);
> ...
>         switch (cmd) {
> ...
>         case VHOST_SET_VRING_CALL:
>                 if (vq->call_ctx.ctx) {
>                         cb.callback = vhost_vdpa_virtqueue_cb;
>                         cb.private = vq;
>                         cb.trigger = vq->call_ctx.ctx;
>                 } else {
>                         cb.callback = NULL;
>                         cb.private = NULL;
>                         cb.trigger = NULL;
>                 }
>                 ops->set_vq_cb(vdpa, idx, &cb);
>                 vhost_vdpa_setup_vq_irq(v, idx);
> 
> in vhost_vdpa_setup_vq_irq() we had:
> 
>         irq_bypass_unregister_producer(&vq->call_ctx.producer);
> 
> here the producer->token still points to the old one...
> 
> Is this what you have seen?
Yup. That is the issue. The unregister already happened at
vhost_vdpa_unsetup_vq_irq(). So this second unregister will
work on an already unregistered element due to the token still
being set.

> 
>>
>>
>>> I may miss something but the only way to assign exactly the same
>>> eventfd_ctx value to another vq is where the guest tries to share the
>>> MSI-X vector among virtqueues, then qemu will use a single eventfd as
>>> the callback for multiple virtqueues. If this is true:
>>>
>> I don't think this is the case. I see the issue happening when running qemu vdpa
>> live migration tests on the same host. From a vdpa device it's basically a device
>> starting on a VM over and over.
>>
>>> For bypass registering, only the first registering can succeed as the
>>> following registering will fail because the irq bypass manager already
>>> had exactly the same producer token.
>>> For registering, all unregistering can succeed:
>>>
>>> 1) the first unregistering will do the real job that unregister the token
>>> 2) the following unregistering will do nothing by iterating the
>>> producer token list without finding a match one
>>>
>>> Maybe you can show me the userspace behaviour (ioctls) when you see this?
>>>
>> Sure, what would you need? qemu traces?
> 
> Yes, that would be helpful.
> 
Will try to get them.

Thanks,
Dragos
> Thanks
> 
>>
>> Thanks,
>> Dragos
>>
>>> Thanks
>>>
>>>>
>>>>>> 4) The original vq tries to unregister it's producer which it has
>>>>>>    already unlinked in step 1. irq_bypass_unregister_producer() will go
>>>>>>    ahead and unlink the producer once again. That happens because:
>>>>>>       a) The producer has a token.
>>>>>>       b) An element with that token is found. But that element comes
>>>>>>          from step 3.
>>>>>>
>>>>>> I see 3 ways to fix this:
>>>>>> 1) Fix the vhost-vdpa part. What this patch does. vfio has a different
>>>>>>    workflow.
>>>>>> 2) Set the token to NULL directly in irq_bypass_unregister_producer()
>>>>>>    after unlinking the producer. But that makes the API asymmetrical.
>>>>>> 3) Make irq_bypass_unregister_producer() also compare the pointer
>>>>>>    elements not just the tokens and do the unlink only on match.
>>>>>>
>>>>>> Any thoughts?
>>>>>>
>>>>>> Oops: general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] SMP
>>>>>> CPU: 8 PID: 5190 Comm: qemu-system-x86 Not tainted 6.10.0-rc7+ #6
>>>>>> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
>>>>>> RIP: 0010:irq_bypass_unregister_producer+0xa5/0xd0
>>>>>> RSP: 0018:ffffc900034d7e50 EFLAGS: 00010246
>>>>>> RAX: dead000000000122 RBX: ffff888353d12718 RCX: ffff88810336a000
>>>>>> RDX: dead000000000100 RSI: ffffffff829243a0 RDI: 0000000000000000
>>>>>> RBP: ffff888353c42000 R08: ffff888104882738 R09: ffff88810336a000
>>>>>> R10: ffff888448ab2050 R11: 0000000000000000 R12: ffff888353d126a0
>>>>>> R13: 0000000000000004 R14: 0000000000000055 R15: 0000000000000004
>>>>>> FS:  00007f9df9403c80(0000) GS:ffff88852cc00000(0000) knlGS:0000000000000000
>>>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>> CR2: 0000562dffc6b568 CR3: 000000012efbb006 CR4: 0000000000772ef0
>>>>>> PKRU: 55555554
>>>>>> Call Trace:
>>>>>>  <TASK>
>>>>>>  ? die_addr+0x36/0x90
>>>>>>  ? exc_general_protection+0x1a8/0x390
>>>>>>  ? asm_exc_general_protection+0x26/0x30
>>>>>>  ? irq_bypass_unregister_producer+0xa5/0xd0
>>>>>>  vhost_vdpa_setup_vq_irq+0x5a/0xc0 [vhost_vdpa]
>>>>>>  vhost_vdpa_unlocked_ioctl+0xdcd/0xe00 [vhost_vdpa]
>>>>>>  ? vhost_vdpa_config_cb+0x30/0x30 [vhost_vdpa]
>>>>>>  __x64_sys_ioctl+0x90/0xc0
>>>>>>  do_syscall_64+0x4f/0x110
>>>>>>  entry_SYSCALL_64_after_hwframe+0x4b/0x53
>>>>>> RIP: 0033:0x7f9df930774f
>>>>>> RSP: 002b:00007ffc55013080 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
>>>>>> RAX: ffffffffffffffda RBX: 0000562dfe134d20 RCX: 00007f9df930774f
>>>>>> RDX: 00007ffc55013200 RSI: 000000004008af21 RDI: 0000000000000011
>>>>>> RBP: 00007ffc55013200 R08: 0000000000000002 R09: 0000000000000000
>>>>>> R10: 0000000000000000 R11: 0000000000000246 R12: 0000562dfe134360
>>>>>> R13: 0000562dfe134d20 R14: 0000000000000000 R15: 00007f9df801e190
>>>>>>
>>>>>> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com>
>>>>>> ---
>>>>>>  drivers/vhost/vdpa.c | 1 +
>>>>>>  1 file changed, 1 insertion(+)
>>>>>>
>>>>>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>>>>>> index 478cd46a49ed..d4a7a3918d86 100644
>>>>>> --- a/drivers/vhost/vdpa.c
>>>>>> +++ b/drivers/vhost/vdpa.c
>>>>>> @@ -226,6 +226,7 @@ static void vhost_vdpa_unsetup_vq_irq(struct vhost_vdpa *v, u16 qid)
>>>>>>         struct vhost_virtqueue *vq = &v->vqs[qid];
>>>>>>
>>>>>>         irq_bypass_unregister_producer(&vq->call_ctx.producer);
>>>>>> +       vq->call_ctx.producer.token = NULL;
>>>>>>  }
>>>>>>
>>>>>>  static int _compat_vdpa_reset(struct vhost_vdpa *v)
>>>>>> --
>>>>>> 2.45.2
>>>>>>
>>>>>
>>>> Thanks
>>>>
>>>
>>
>

Dragos Tatulea Aug. 6, 2024, 2:45 p.m. UTC | #7

On 06.08.24 10:18, Dragos Tatulea wrote:
> (Re-sending. I messed up the previous message, sorry about that.)
> 
> On 06.08.24 04:57, Jason Wang wrote:
>> On Mon, Aug 5, 2024 at 11:59 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
>>>
>>> On 05.08.24 05:17, Jason Wang wrote:
>>>> On Fri, Aug 2, 2024 at 2:51 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
>>>>>
>>>>> On Fri, 2024-08-02 at 11:29 +0800, Jason Wang wrote:
>>>>>> On Thu, Aug 1, 2024 at 11:38 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
>>>>>>>
>>>>>>> The following workflow triggers the crash referenced below:
>>>>>>>
>>>>>>> 1) vhost_vdpa_unsetup_vq_irq() unregisters the irq bypass producer
>>>>>>>    but the producer->token is still valid.
>>>>>>> 2) vq context gets released and reassigned to another vq.
>>>>>>
>>>>>> Just to make sure I understand here, which structure is referred to as
>>>>>> "vq context" here? I guess it's not call_ctx as it is a part of the vq
>>>>>> itself.
>>>>>>
>>>>>>> 3) That other vq registers it's producer with the same vq context
>>>>>>>    pointer as token in vhost_vdpa_setup_vq_irq().
>>>>>>
>>>>>> Or did you mean when a single eventfd is shared among different vqs?
>>>>>>
>>>>> Yes, that's what I mean: vq->call_ctx.ctx which is a eventfd_ctx.
>>>>>
>>>>> But I don't think it's shared in this case, only that the old eventfd_ctx value
>>>>> is lingering in producer->token. And this old eventfd_ctx is assigned now to
>>>>> another vq.
>>>>
>>>> Just to make sure I understand the issue. The eventfd_ctx should be
>>>> still valid until a new VHOST_SET_VRING_CALL().
>>>>
>>> I think it's not about the validity of the eventfd_ctx. More about
>>> the lingering ctx value of the producer after vhost_vdpa_unsetup_vq_irq().
>>
>> Probably, but
>>
>>> That value is the eventfd ctx, but it could be anything else really...
>>
>> I mean we hold a refcnt of the eventfd so it should be valid until the
>> next set_vring_call() or vhost_dev_cleanup().
>>
>> But I do spot some possible issue:
>>
>> 1) We swap and assign new ctx in vhost_vring_ioctl():
>>
>>                 swap(ctx, vq->call_ctx.ctx);
>>
>> 2) and old ctx will be put there as well:
>>
>>                 if (!IS_ERR_OR_NULL(ctx))
>>                         eventfd_ctx_put(ctx);
>>
>> 3) but in vdpa, we try to unregister the producer with the new token:
>>
>> static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, unsigned int cmd,
>>                            void __user *argp)
>> {
>> ...
>>         r = vhost_vring_ioctl(&v->vdev, cmd, argp);
>> ...
>>         switch (cmd) {
>> ...
>>         case VHOST_SET_VRING_CALL:
>>                 if (vq->call_ctx.ctx) {
>>                         cb.callback = vhost_vdpa_virtqueue_cb;
>>                         cb.private = vq;
>>                         cb.trigger = vq->call_ctx.ctx;
>>                 } else {
>>                         cb.callback = NULL;
>>                         cb.private = NULL;
>>                         cb.trigger = NULL;
>>                 }
>>                 ops->set_vq_cb(vdpa, idx, &cb);
>>                 vhost_vdpa_setup_vq_irq(v, idx);
>>
>> in vhost_vdpa_setup_vq_irq() we had:
>>
>>         irq_bypass_unregister_producer(&vq->call_ctx.producer);
>>
>> here the producer->token still points to the old one...
>>
>> Is this what you have seen?
> Yup. That is the issue. The unregister already happened at
> vhost_vdpa_unsetup_vq_irq(). So this second unregister will
> work on an already unregistered element due to the token still
> being set.
> 
>>
>>>
>>>
>>>> I may miss something but the only way to assign exactly the same
>>>> eventfd_ctx value to another vq is where the guest tries to share the
>>>> MSI-X vector among virtqueues, then qemu will use a single eventfd as
>>>> the callback for multiple virtqueues. If this is true:
>>>>
>>> I don't think this is the case. I see the issue happening when running qemu vdpa
>>> live migration tests on the same host. From a vdpa device it's basically a device
>>> starting on a VM over and over.
>>>
>>>> For bypass registering, only the first registering can succeed as the
>>>> following registering will fail because the irq bypass manager already
>>>> had exactly the same producer token.
>>>> For registering, all unregistering can succeed:
>>>>
>>>> 1) the first unregistering will do the real job that unregister the token
>>>> 2) the following unregistering will do nothing by iterating the
>>>> producer token list without finding a match one
>>>>
>>>> Maybe you can show me the userspace behaviour (ioctls) when you see this?
>>>>
>>> Sure, what would you need? qemu traces?
>>
>> Yes, that would be helpful.
>>
> Will try to get them.
As the traces are quite large (~5MB), I uploaded them in this location [0].
I used the following qemu traces:
--trace vhost_vdpa* --trace virtio_net_handle*

[0] https://drive.google.com/file/d/1XyXYyockJ_O7zMgI7vot6AxYjze9Ljju/view?usp=sharing

Thanks,
Dragos

Jason Wang Aug. 8, 2024, 2:56 a.m. UTC | #8

On Tue, Aug 6, 2024 at 10:45 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
>
>
>
> On 06.08.24 10:18, Dragos Tatulea wrote:
> > (Re-sending. I messed up the previous message, sorry about that.)
> >
> > On 06.08.24 04:57, Jason Wang wrote:
> >> On Mon, Aug 5, 2024 at 11:59 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
> >>>
> >>> On 05.08.24 05:17, Jason Wang wrote:
> >>>> On Fri, Aug 2, 2024 at 2:51 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
> >>>>>
> >>>>> On Fri, 2024-08-02 at 11:29 +0800, Jason Wang wrote:
> >>>>>> On Thu, Aug 1, 2024 at 11:38 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
> >>>>>>>
> >>>>>>> The following workflow triggers the crash referenced below:
> >>>>>>>
> >>>>>>> 1) vhost_vdpa_unsetup_vq_irq() unregisters the irq bypass producer
> >>>>>>>    but the producer->token is still valid.
> >>>>>>> 2) vq context gets released and reassigned to another vq.
> >>>>>>
> >>>>>> Just to make sure I understand here, which structure is referred to as
> >>>>>> "vq context" here? I guess it's not call_ctx as it is a part of the vq
> >>>>>> itself.
> >>>>>>
> >>>>>>> 3) That other vq registers it's producer with the same vq context
> >>>>>>>    pointer as token in vhost_vdpa_setup_vq_irq().
> >>>>>>
> >>>>>> Or did you mean when a single eventfd is shared among different vqs?
> >>>>>>
> >>>>> Yes, that's what I mean: vq->call_ctx.ctx which is a eventfd_ctx.
> >>>>>
> >>>>> But I don't think it's shared in this case, only that the old eventfd_ctx value
> >>>>> is lingering in producer->token. And this old eventfd_ctx is assigned now to
> >>>>> another vq.
> >>>>
> >>>> Just to make sure I understand the issue. The eventfd_ctx should be
> >>>> still valid until a new VHOST_SET_VRING_CALL().
> >>>>
> >>> I think it's not about the validity of the eventfd_ctx. More about
> >>> the lingering ctx value of the producer after vhost_vdpa_unsetup_vq_irq().
> >>
> >> Probably, but
> >>
> >>> That value is the eventfd ctx, but it could be anything else really...
> >>
> >> I mean we hold a refcnt of the eventfd so it should be valid until the
> >> next set_vring_call() or vhost_dev_cleanup().
> >>
> >> But I do spot some possible issue:
> >>
> >> 1) We swap and assign new ctx in vhost_vring_ioctl():
> >>
> >>                 swap(ctx, vq->call_ctx.ctx);
> >>
> >> 2) and old ctx will be put there as well:
> >>
> >>                 if (!IS_ERR_OR_NULL(ctx))
> >>                         eventfd_ctx_put(ctx);
> >>
> >> 3) but in vdpa, we try to unregister the producer with the new token:
> >>
> >> static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, unsigned int cmd,
> >>                            void __user *argp)
> >> {
> >> ...
> >>         r = vhost_vring_ioctl(&v->vdev, cmd, argp);
> >> ...
> >>         switch (cmd) {
> >> ...
> >>         case VHOST_SET_VRING_CALL:
> >>                 if (vq->call_ctx.ctx) {
> >>                         cb.callback = vhost_vdpa_virtqueue_cb;
> >>                         cb.private = vq;
> >>                         cb.trigger = vq->call_ctx.ctx;
> >>                 } else {
> >>                         cb.callback = NULL;
> >>                         cb.private = NULL;
> >>                         cb.trigger = NULL;
> >>                 }
> >>                 ops->set_vq_cb(vdpa, idx, &cb);
> >>                 vhost_vdpa_setup_vq_irq(v, idx);
> >>
> >> in vhost_vdpa_setup_vq_irq() we had:
> >>
> >>         irq_bypass_unregister_producer(&vq->call_ctx.producer);
> >>
> >> here the producer->token still points to the old one...
> >>
> >> Is this what you have seen?
> > Yup. That is the issue. The unregister already happened at
> > vhost_vdpa_unsetup_vq_irq(). So this second unregister will
> > work on an already unregistered element due to the token still
> > being set.
> >
> >>
> >>>
> >>>
> >>>> I may miss something but the only way to assign exactly the same
> >>>> eventfd_ctx value to another vq is where the guest tries to share the
> >>>> MSI-X vector among virtqueues, then qemu will use a single eventfd as
> >>>> the callback for multiple virtqueues. If this is true:
> >>>>
> >>> I don't think this is the case. I see the issue happening when running qemu vdpa
> >>> live migration tests on the same host. From a vdpa device it's basically a device
> >>> starting on a VM over and over.
> >>>
> >>>> For bypass registering, only the first registering can succeed as the
> >>>> following registering will fail because the irq bypass manager already
> >>>> had exactly the same producer token.
> >>>> For registering, all unregistering can succeed:
> >>>>
> >>>> 1) the first unregistering will do the real job that unregister the token
> >>>> 2) the following unregistering will do nothing by iterating the
> >>>> producer token list without finding a match one
> >>>>
> >>>> Maybe you can show me the userspace behaviour (ioctls) when you see this?
> >>>>
> >>> Sure, what would you need? qemu traces?
> >>
> >> Yes, that would be helpful.
> >>
> > Will try to get them.
> As the traces are quite large (~5MB), I uploaded them in this location [0].
> I used the following qemu traces:
> --trace vhost_vdpa* --trace virtio_net_handle*
>
> [0] https://drive.google.com/file/d/1XyXYyockJ_O7zMgI7vot6AxYjze9Ljju/view?usp=sharing

Thanks for doing this.

So it looks not like a case of eventfd sharing:

"""
153@1722953531.918958:vhost_vdpa_iotlb_begin_batch vdpa:0x7f6f9cfb5190
fd: 17 msg_type: 2 type: 5
153@1722953531.918959:vhost_vdpa_set_vring_base dev: 0x55573cc9ca70
index: 6 num: 0 svq 1
153@1722953531.918961:vhost_vdpa_set_vring_kick dev: 0x55573cc9ca70
index: 6 fd: 237
153@1722953531.918964:vhost_vdpa_set_vring_call dev: 0x55573cc9ca70
index: 6 fd: 238
153@1722953531.918978:vhost_vdpa_dma_map vdpa:0x7f6f9cfb5190 fd: 17
msg_type: 2 asid: 1 iova: 0x13000 size: 0x2000 uaddr: 0x7f6f9da1a000
perm: 0x1 type: 2
153@1722953531.918984:vhost_vdpa_dma_map vdpa:0x7f6f9cfb5190 fd: 17
msg_type: 2 asid: 1 iova: 0x15000 size: 0x1000 uaddr: 0x7f6f9da19000
perm: 0x3 type: 2
153@1722953531.918987:vhost_vdpa_set_vring_addr dev: 0x55573cc9ca70
index: 6 flags: 0x0 desc_user_addr: 0x13000 used_user_addr: 0x15000
avail_user_addr: 0x14000 log_guest_\
addr: 0x0
153@1722953531.918989:vhost_vdpa_set_vring_base dev: 0x55573cc9ca70
index: 7 num: 0 svq 1
153@1722953531.918991:vhost_vdpa_set_vring_kick dev: 0x55573cc9ca70
index: 7 fd: 239
153@1722953531.918993:vhost_vdpa_set_vring_call dev: 0x55573cc9ca70
index: 7 fd: 240
"""

I think a more proper way is to unregister and clean the token before
calling vhost_vring_ioctl() in the case of SET_VRING_KICK. Let me try
to draft a patch and see.

Thanks

>
> Thanks,
> Dragos
>

Jason Wang Aug. 8, 2024, 8:23 a.m. UTC | #9

On Thu, Aug 8, 2024 at 10:56 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Tue, Aug 6, 2024 at 10:45 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
> >
> >
> >
> > On 06.08.24 10:18, Dragos Tatulea wrote:
> > > (Re-sending. I messed up the previous message, sorry about that.)
> > >
> > > On 06.08.24 04:57, Jason Wang wrote:
> > >> On Mon, Aug 5, 2024 at 11:59 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
> > >>>
> > >>> On 05.08.24 05:17, Jason Wang wrote:
> > >>>> On Fri, Aug 2, 2024 at 2:51 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
> > >>>>>
> > >>>>> On Fri, 2024-08-02 at 11:29 +0800, Jason Wang wrote:
> > >>>>>> On Thu, Aug 1, 2024 at 11:38 PM Dragos Tatulea <dtatulea@nvidia.com> wrote:
> > >>>>>>>
> > >>>>>>> The following workflow triggers the crash referenced below:
> > >>>>>>>
> > >>>>>>> 1) vhost_vdpa_unsetup_vq_irq() unregisters the irq bypass producer
> > >>>>>>>    but the producer->token is still valid.
> > >>>>>>> 2) vq context gets released and reassigned to another vq.
> > >>>>>>
> > >>>>>> Just to make sure I understand here, which structure is referred to as
> > >>>>>> "vq context" here? I guess it's not call_ctx as it is a part of the vq
> > >>>>>> itself.
> > >>>>>>
> > >>>>>>> 3) That other vq registers it's producer with the same vq context
> > >>>>>>>    pointer as token in vhost_vdpa_setup_vq_irq().
> > >>>>>>
> > >>>>>> Or did you mean when a single eventfd is shared among different vqs?
> > >>>>>>
> > >>>>> Yes, that's what I mean: vq->call_ctx.ctx which is a eventfd_ctx.
> > >>>>>
> > >>>>> But I don't think it's shared in this case, only that the old eventfd_ctx value
> > >>>>> is lingering in producer->token. And this old eventfd_ctx is assigned now to
> > >>>>> another vq.
> > >>>>
> > >>>> Just to make sure I understand the issue. The eventfd_ctx should be
> > >>>> still valid until a new VHOST_SET_VRING_CALL().
> > >>>>
> > >>> I think it's not about the validity of the eventfd_ctx. More about
> > >>> the lingering ctx value of the producer after vhost_vdpa_unsetup_vq_irq().
> > >>
> > >> Probably, but
> > >>
> > >>> That value is the eventfd ctx, but it could be anything else really...
> > >>
> > >> I mean we hold a refcnt of the eventfd so it should be valid until the
> > >> next set_vring_call() or vhost_dev_cleanup().
> > >>
> > >> But I do spot some possible issue:
> > >>
> > >> 1) We swap and assign new ctx in vhost_vring_ioctl():
> > >>
> > >>                 swap(ctx, vq->call_ctx.ctx);
> > >>
> > >> 2) and old ctx will be put there as well:
> > >>
> > >>                 if (!IS_ERR_OR_NULL(ctx))
> > >>                         eventfd_ctx_put(ctx);
> > >>
> > >> 3) but in vdpa, we try to unregister the producer with the new token:
> > >>
> > >> static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, unsigned int cmd,
> > >>                            void __user *argp)
> > >> {
> > >> ...
> > >>         r = vhost_vring_ioctl(&v->vdev, cmd, argp);
> > >> ...
> > >>         switch (cmd) {
> > >> ...
> > >>         case VHOST_SET_VRING_CALL:
> > >>                 if (vq->call_ctx.ctx) {
> > >>                         cb.callback = vhost_vdpa_virtqueue_cb;
> > >>                         cb.private = vq;
> > >>                         cb.trigger = vq->call_ctx.ctx;
> > >>                 } else {
> > >>                         cb.callback = NULL;
> > >>                         cb.private = NULL;
> > >>                         cb.trigger = NULL;
> > >>                 }
> > >>                 ops->set_vq_cb(vdpa, idx, &cb);
> > >>                 vhost_vdpa_setup_vq_irq(v, idx);
> > >>
> > >> in vhost_vdpa_setup_vq_irq() we had:
> > >>
> > >>         irq_bypass_unregister_producer(&vq->call_ctx.producer);
> > >>
> > >> here the producer->token still points to the old one...
> > >>
> > >> Is this what you have seen?
> > > Yup. That is the issue. The unregister already happened at
> > > vhost_vdpa_unsetup_vq_irq(). So this second unregister will
> > > work on an already unregistered element due to the token still
> > > being set.
> > >
> > >>
> > >>>
> > >>>
> > >>>> I may miss something but the only way to assign exactly the same
> > >>>> eventfd_ctx value to another vq is where the guest tries to share the
> > >>>> MSI-X vector among virtqueues, then qemu will use a single eventfd as
> > >>>> the callback for multiple virtqueues. If this is true:
> > >>>>
> > >>> I don't think this is the case. I see the issue happening when running qemu vdpa
> > >>> live migration tests on the same host. From a vdpa device it's basically a device
> > >>> starting on a VM over and over.
> > >>>
> > >>>> For bypass registering, only the first registering can succeed as the
> > >>>> following registering will fail because the irq bypass manager already
> > >>>> had exactly the same producer token.
> > >>>> For registering, all unregistering can succeed:
> > >>>>
> > >>>> 1) the first unregistering will do the real job that unregister the token
> > >>>> 2) the following unregistering will do nothing by iterating the
> > >>>> producer token list without finding a match one
> > >>>>
> > >>>> Maybe you can show me the userspace behaviour (ioctls) when you see this?
> > >>>>
> > >>> Sure, what would you need? qemu traces?
> > >>
> > >> Yes, that would be helpful.
> > >>
> > > Will try to get them.
> > As the traces are quite large (~5MB), I uploaded them in this location [0].
> > I used the following qemu traces:
> > --trace vhost_vdpa* --trace virtio_net_handle*
> >
> > [0] https://drive.google.com/file/d/1XyXYyockJ_O7zMgI7vot6AxYjze9Ljju/view?usp=sharing
>
> Thanks for doing this.
>
> So it looks not like a case of eventfd sharing:
>
> """
> 153@1722953531.918958:vhost_vdpa_iotlb_begin_batch vdpa:0x7f6f9cfb5190
> fd: 17 msg_type: 2 type: 5
> 153@1722953531.918959:vhost_vdpa_set_vring_base dev: 0x55573cc9ca70
> index: 6 num: 0 svq 1
> 153@1722953531.918961:vhost_vdpa_set_vring_kick dev: 0x55573cc9ca70
> index: 6 fd: 237
> 153@1722953531.918964:vhost_vdpa_set_vring_call dev: 0x55573cc9ca70
> index: 6 fd: 238
> 153@1722953531.918978:vhost_vdpa_dma_map vdpa:0x7f6f9cfb5190 fd: 17
> msg_type: 2 asid: 1 iova: 0x13000 size: 0x2000 uaddr: 0x7f6f9da1a000
> perm: 0x1 type: 2
> 153@1722953531.918984:vhost_vdpa_dma_map vdpa:0x7f6f9cfb5190 fd: 17
> msg_type: 2 asid: 1 iova: 0x15000 size: 0x1000 uaddr: 0x7f6f9da19000
> perm: 0x3 type: 2
> 153@1722953531.918987:vhost_vdpa_set_vring_addr dev: 0x55573cc9ca70
> index: 6 flags: 0x0 desc_user_addr: 0x13000 used_user_addr: 0x15000
> avail_user_addr: 0x14000 log_guest_\
> addr: 0x0
> 153@1722953531.918989:vhost_vdpa_set_vring_base dev: 0x55573cc9ca70
> index: 7 num: 0 svq 1
> 153@1722953531.918991:vhost_vdpa_set_vring_kick dev: 0x55573cc9ca70
> index: 7 fd: 239
> 153@1722953531.918993:vhost_vdpa_set_vring_call dev: 0x55573cc9ca70
> index: 7 fd: 240
> """
>
> I think a more proper way is to unregister and clean the token before
> calling vhost_vring_ioctl() in the case of SET_VRING_KICK. Let me try
> to draft a patch and see.

I've posted a RFC patch, please try to see if it works.

Thanks

>
> Thanks
>
> >
> > Thanks,
> > Dragos
> >

[RFC,vhost] vhost-vdpa: Fix invalid irq bypass unregister

Commit Message

Comments

Patch