mbox series

[v1,0/2] Add ablility of VFIO driver to ignore reset when device don't need it

Message ID 20211014095748.84604-1-yaozhenguo1@gmail.com (mailing list archive)
Headers show
Series Add ablility of VFIO driver to ignore reset when device don't need it | expand

Message

yaozhenguo Oct. 14, 2021, 9:57 a.m. UTC
In some scenarios, vfio device can't do any reset in initialization
process. For example: Nvswitch and GPU A100 working in Shared NVSwitch
Virtualization Model. In such mode, there are two type VMs: service
VM and Guest VM. The GPU devices are initialized in the following steps:

1. Service VM boot up. GPUs and Nvswitchs are passthrough to service VM.
Nvidia driver and manager software will do some settings in service VM.

2. The selected GPUs are unpluged from service VM.

3. Guest VM boots up with the selected GPUs passthrough.

The selected GPUs can't do any reset in step3, or they will be initialized
failed in Guest VM.

This patchset add a PCI sysfs interface:ignore_reset which drivers can
use it to control whether to do PCI reset or not. For example: In Shared
NVSwitch Virtualization Model. Hypervisor can disable PCI reset by setting
ignore_reset to 1 before Gust VM booting up.

Zhenguo Yao (2):
  PCI: Add ignore_reset sysfs interface to control whether do device
    reset in PCI drivers
  vfio-pci: Don't do device reset when ignore_reset is setting

 drivers/pci/pci-sysfs.c          | 25 +++++++++++++++++
 drivers/vfio/pci/vfio_pci_core.c | 48 ++++++++++++++++++++------------
 include/linux/pci.h              |  1 +
 3 files changed, 56 insertions(+), 18 deletions(-)

Comments

Alex Williamson Oct. 14, 2021, 12:48 p.m. UTC | #1
On Thu, 14 Oct 2021 17:57:46 +0800
Zhenguo Yao <yaozhenguo1@gmail.com> wrote:

> In some scenarios, vfio device can't do any reset in initialization
> process. For example: Nvswitch and GPU A100 working in Shared NVSwitch
> Virtualization Model. In such mode, there are two type VMs: service
> VM and Guest VM. The GPU devices are initialized in the following steps:
> 
> 1. Service VM boot up. GPUs and Nvswitchs are passthrough to service VM.
> Nvidia driver and manager software will do some settings in service VM.
> 
> 2. The selected GPUs are unpluged from service VM.
> 
> 3. Guest VM boots up with the selected GPUs passthrough.
> 
> The selected GPUs can't do any reset in step3, or they will be initialized
> failed in Guest VM.
> 
> This patchset add a PCI sysfs interface:ignore_reset which drivers can
> use it to control whether to do PCI reset or not. For example: In Shared
> NVSwitch Virtualization Model. Hypervisor can disable PCI reset by setting
> ignore_reset to 1 before Gust VM booting up.
> 
> Zhenguo Yao (2):
>   PCI: Add ignore_reset sysfs interface to control whether do device
>     reset in PCI drivers
>   vfio-pci: Don't do device reset when ignore_reset is setting
> 
>  drivers/pci/pci-sysfs.c          | 25 +++++++++++++++++
>  drivers/vfio/pci/vfio_pci_core.c | 48 ++++++++++++++++++++------------
>  include/linux/pci.h              |  1 +
>  3 files changed, 56 insertions(+), 18 deletions(-)
> 

This all seems like code to mask that these NVSwitch configurations are
probably insecure because we can't factor and manage NVSwitch isolation
into IOMMU grouping.  I'm guessing this "service VM" pokes proprietary
registers to manage that isolation and perhaps later resetting devices
negates that programming.  A more proper solution is probably to do our
best to guess the span of an NVSwitch configuration and make the IOMMU
group include all the devices, until NVIDIA provides proper code for
the kernel to understand this interconnect and how it affects DMA
isolation.  Nak on disabling resets for the purpose of preventing a
user from undoing proprietary device programming.  Thanks,

Alex
yaozhenguo Oct. 14, 2021, 1:37 p.m. UTC | #2
OK.  Thank you.  Let's waitting for NVIDIA's solution.

Alex Williamson <alex.williamson@redhat.com> 于2021年10月14日周四 下午8:48写道:
>
> On Thu, 14 Oct 2021 17:57:46 +0800
> Zhenguo Yao <yaozhenguo1@gmail.com> wrote:
>
> > In some scenarios, vfio device can't do any reset in initialization
> > process. For example: Nvswitch and GPU A100 working in Shared NVSwitch
> > Virtualization Model. In such mode, there are two type VMs: service
> > VM and Guest VM. The GPU devices are initialized in the following steps:
> >
> > 1. Service VM boot up. GPUs and Nvswitchs are passthrough to service VM.
> > Nvidia driver and manager software will do some settings in service VM.
> >
> > 2. The selected GPUs are unpluged from service VM.
> >
> > 3. Guest VM boots up with the selected GPUs passthrough.
> >
> > The selected GPUs can't do any reset in step3, or they will be initialized
> > failed in Guest VM.
> >
> > This patchset add a PCI sysfs interface:ignore_reset which drivers can
> > use it to control whether to do PCI reset or not. For example: In Shared
> > NVSwitch Virtualization Model. Hypervisor can disable PCI reset by setting
> > ignore_reset to 1 before Gust VM booting up.
> >
> > Zhenguo Yao (2):
> >   PCI: Add ignore_reset sysfs interface to control whether do device
> >     reset in PCI drivers
> >   vfio-pci: Don't do device reset when ignore_reset is setting
> >
> >  drivers/pci/pci-sysfs.c          | 25 +++++++++++++++++
> >  drivers/vfio/pci/vfio_pci_core.c | 48 ++++++++++++++++++++------------
> >  include/linux/pci.h              |  1 +
> >  3 files changed, 56 insertions(+), 18 deletions(-)
> >
>
> This all seems like code to mask that these NVSwitch configurations are
> probably insecure because we can't factor and manage NVSwitch isolation
> into IOMMU grouping.  I'm guessing this "service VM" pokes proprietary
> registers to manage that isolation and perhaps later resetting devices
> negates that programming.  A more proper solution is probably to do our
> best to guess the span of an NVSwitch configuration and make the IOMMU
> group include all the devices, until NVIDIA provides proper code for
> the kernel to understand this interconnect and how it affects DMA
> isolation.  Nak on disabling resets for the purpose of preventing a
> user from undoing proprietary device programming.  Thanks,
>
> Alex
>