[8/9] vfio/pci: use x86 naming instead of igd

Message ID	20210201162828.5938-9-mgurtovoy@nvidia.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> TLS: TLSv1.2, AES256-SHA) id <B60182c5a0001>; Mon, 01 Feb 2021 08:29:14 -0800 From: Max Gurtovoy <mgurtovoy@nvidia.com> To: <jgg@nvidia.com>, <cohuck@redhat.com>, <kvm@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <alex.williamson@redhat.com> CC: <liranl@nvidia.com>, <oren@nvidia.com>, <tzahio@nvidia.com>, <leonro@nvidia.com>, <yarong@nvidia.com>, <aviadye@nvidia.com>, <shahafs@nvidia.com>, <artemp@nvidia.com>, <kwankhede@nvidia.com>, <ACurrid@nvidia.com>, <gmataev@nvidia.com>, <cjia@nvidia.com>, <mjrosato@linux.ibm.com>, <yishaih@nvidia.com>, <aik@ozlabs.ru>, Max Gurtovoy <mgurtovoy@nvidia.com> Subject: [PATCH 8/9] vfio/pci: use x86 naming instead of igd Date: Mon, 1 Feb 2021 16:28:27 +0000 Message-ID: <20210201162828.5938-9-mgurtovoy@nvidia.com> In-Reply-To: <20210201162828.5938-1-mgurtovoy@nvidia.com> References: <20210201162828.5938-1-mgurtovoy@nvidia.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain Precedence: bulk
Series	Introduce vfio-pci-core subsystem \| expand [v2,0/9] Introduce vfio-pci-core subsystem [1/9] vfio-pci: rename vfio_pci.c to vfio_pci_core.c [2/9] vfio-pci: introduce vfio_pci_core subsystem driver [3/9] vfio-pci-core: export vfio_pci_register_dev_region function [4/9] mlx5-vfio-pci: add new vfio_pci driver for mlx5 devices [5/9] vfio-pci/zdev: remove unused vdev argument [6/9] vfio-pci/zdev: fix possible segmentation fault issue [7/9] vfio/pci: use s390 naming instead of zdev [8/9] vfio/pci: use x86 naming instead of igd [9/9] vfio/pci: use powernv naming instead of nvlink2

Max Gurtovoy Feb. 1, 2021, 4:28 p.m. UTC

This patch doesn't change any logic but only align to the concept of
vfio_pci_core extensions. Extensions that are related to a platform
and not to a specific vendor of PCI devices should be part of the core
driver. Extensions that are specific for PCI device vendor should go
to a dedicated vendor vfio-pci driver.

For now, x86 extensions will include only igd.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
---
 drivers/vfio/pci/Kconfig                            | 13 ++++++-------
 drivers/vfio/pci/Makefile                           |  2 +-
 drivers/vfio/pci/vfio_pci_core.c                    |  2 +-
 drivers/vfio/pci/vfio_pci_private.h                 |  2 +-
 drivers/vfio/pci/{vfio_pci_igd.c => vfio_pci_x86.c} |  0
 5 files changed, 9 insertions(+), 10 deletions(-)
 rename drivers/vfio/pci/{vfio_pci_igd.c => vfio_pci_x86.c} (100%)

Cornelia Huck Feb. 1, 2021, 5:14 p.m. UTC | #1

On Mon, 1 Feb 2021 16:28:27 +0000
Max Gurtovoy <mgurtovoy@nvidia.com> wrote:

> This patch doesn't change any logic but only align to the concept of
> vfio_pci_core extensions. Extensions that are related to a platform
> and not to a specific vendor of PCI devices should be part of the core
> driver. Extensions that are specific for PCI device vendor should go
> to a dedicated vendor vfio-pci driver.

My understanding is that igd means support for Intel graphics, i.e. a
strict subset of x86. If there are other future extensions that e.g.
only make sense for some devices found only on AMD systems, I don't
think they should all be included under the same x86 umbrella.

Similar reasoning for nvlink, that only seems to cover support for some
GPUs under Power, and is not a general platform-specific extension IIUC.

We can arguably do the zdev -> s390 rename (as zpci appears only on
s390, and all PCI devices will be zpci on that platform), although I'm
not sure about the benefit.

> 
> For now, x86 extensions will include only igd.
> 
> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
> ---
>  drivers/vfio/pci/Kconfig                            | 13 ++++++-------
>  drivers/vfio/pci/Makefile                           |  2 +-
>  drivers/vfio/pci/vfio_pci_core.c                    |  2 +-
>  drivers/vfio/pci/vfio_pci_private.h                 |  2 +-
>  drivers/vfio/pci/{vfio_pci_igd.c => vfio_pci_x86.c} |  0
>  5 files changed, 9 insertions(+), 10 deletions(-)
>  rename drivers/vfio/pci/{vfio_pci_igd.c => vfio_pci_x86.c} (100%)

(...)

> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index c559027def2d..e0e258c37fb5 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -328,7 +328,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>  
>  	if (vfio_pci_is_vga(pdev) &&
>  	    pdev->vendor == PCI_VENDOR_ID_INTEL &&
> -	    IS_ENABLED(CONFIG_VFIO_PCI_IGD)) {
> +	    IS_ENABLED(CONFIG_VFIO_PCI_X86)) {
>  		ret = vfio_pci_igd_init(vdev);

This one explicitly checks for Intel devices, so I'm not sure why you
want to generalize this to x86?

>  		if (ret && ret != -ENODEV) {
>  			pci_warn(pdev, "Failed to setup Intel IGD regions\n");

Matthew Rosato Feb. 1, 2021, 5:49 p.m. UTC | #2

On 2/1/21 12:14 PM, Cornelia Huck wrote:
> On Mon, 1 Feb 2021 16:28:27 +0000
> Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
> 
>> This patch doesn't change any logic but only align to the concept of
>> vfio_pci_core extensions. Extensions that are related to a platform
>> and not to a specific vendor of PCI devices should be part of the core
>> driver. Extensions that are specific for PCI device vendor should go
>> to a dedicated vendor vfio-pci driver.
> 
> My understanding is that igd means support for Intel graphics, i.e. a
> strict subset of x86. If there are other future extensions that e.g.
> only make sense for some devices found only on AMD systems, I don't
> think they should all be included under the same x86 umbrella.
> 
> Similar reasoning for nvlink, that only seems to cover support for some
> GPUs under Power, and is not a general platform-specific extension IIUC.
> 
> We can arguably do the zdev -> s390 rename (as zpci appears only on
> s390, and all PCI devices will be zpci on that platform), although I'm
> not sure about the benefit.

As far as I can tell, there isn't any benefit for s390 it's just 
"re-branding" to match the platform name rather than the zdev moniker, 
which admittedly perhaps makes it more clear to someone outside of s390 
that any PCI device on s390 is a zdev/zpci type, and thus will use this 
extension to vfio_pci(_core).  This would still be true even if we added 
something later that builds atop it (e.g. a platform-specific device 
like ism-vfio-pci).  Or for that matter, mlx5 via vfio-pci on s390x uses 
these zdev extensions today and would need to continue using them in a 
world where mlx5-vfio-pci.ko exists.

I guess all that to say: if such a rename matches the 'grand scheme' of 
this design where we treat arch-level extensions to vfio_pci(_core) as 
"vfio_pci_(arch)" then I'm not particularly opposed to the rename.  But 
by itself it's not very exciting :)

> 
>>
>> For now, x86 extensions will include only igd.
>>
>> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
>> ---
>>   drivers/vfio/pci/Kconfig                            | 13 ++++++-------
>>   drivers/vfio/pci/Makefile                           |  2 +-
>>   drivers/vfio/pci/vfio_pci_core.c                    |  2 +-
>>   drivers/vfio/pci/vfio_pci_private.h                 |  2 +-
>>   drivers/vfio/pci/{vfio_pci_igd.c => vfio_pci_x86.c} |  0
>>   5 files changed, 9 insertions(+), 10 deletions(-)
>>   rename drivers/vfio/pci/{vfio_pci_igd.c => vfio_pci_x86.c} (100%)
> 
> (...)
> 
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> index c559027def2d..e0e258c37fb5 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -328,7 +328,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
>>   
>>   	if (vfio_pci_is_vga(pdev) &&
>>   	    pdev->vendor == PCI_VENDOR_ID_INTEL &&
>> -	    IS_ENABLED(CONFIG_VFIO_PCI_IGD)) {
>> +	    IS_ENABLED(CONFIG_VFIO_PCI_X86)) {
>>   		ret = vfio_pci_igd_init(vdev);
> 
> This one explicitly checks for Intel devices, so I'm not sure why you
> want to generalize this to x86?
> 
>>   		if (ret && ret != -ENODEV) {
>>   			pci_warn(pdev, "Failed to setup Intel IGD regions\n");
>

Alex Williamson Feb. 1, 2021, 6:42 p.m. UTC | #3

On Mon, 1 Feb 2021 12:49:12 -0500
Matthew Rosato <mjrosato@linux.ibm.com> wrote:

> On 2/1/21 12:14 PM, Cornelia Huck wrote:
> > On Mon, 1 Feb 2021 16:28:27 +0000
> > Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
> >   
> >> This patch doesn't change any logic but only align to the concept of
> >> vfio_pci_core extensions. Extensions that are related to a platform
> >> and not to a specific vendor of PCI devices should be part of the core
> >> driver. Extensions that are specific for PCI device vendor should go
> >> to a dedicated vendor vfio-pci driver.  
> > 
> > My understanding is that igd means support for Intel graphics, i.e. a
> > strict subset of x86. If there are other future extensions that e.g.
> > only make sense for some devices found only on AMD systems, I don't
> > think they should all be included under the same x86 umbrella.
> > 
> > Similar reasoning for nvlink, that only seems to cover support for some
> > GPUs under Power, and is not a general platform-specific extension IIUC.
> > 
> > We can arguably do the zdev -> s390 rename (as zpci appears only on
> > s390, and all PCI devices will be zpci on that platform), although I'm
> > not sure about the benefit.  
> 
> As far as I can tell, there isn't any benefit for s390 it's just 
> "re-branding" to match the platform name rather than the zdev moniker, 
> which admittedly perhaps makes it more clear to someone outside of s390 
> that any PCI device on s390 is a zdev/zpci type, and thus will use this 
> extension to vfio_pci(_core).  This would still be true even if we added 
> something later that builds atop it (e.g. a platform-specific device 
> like ism-vfio-pci).  Or for that matter, mlx5 via vfio-pci on s390x uses 
> these zdev extensions today and would need to continue using them in a 
> world where mlx5-vfio-pci.ko exists.
> 
> I guess all that to say: if such a rename matches the 'grand scheme' of 
> this design where we treat arch-level extensions to vfio_pci(_core) as 
> "vfio_pci_(arch)" then I'm not particularly opposed to the rename.  But 
> by itself it's not very exciting :)

This all seems like the wrong direction to me.  The goal here is to
modularize vfio-pci into a core library and derived vendor modules that
make use of that core library.  If existing device specific extensions
within vfio-pci cannot be turned into vendor modules through this
support and are instead redefined as platform specific features of the
new core library, that feels like we're already admitting failure of
this core library to support known devices, let alone future devices.

IGD is a specific set of devices.  They happen to rely on some platform
specific support, whose availability should be determined via the
vendor module probe callback.  Packing that support into an "x86"
component as part of the core feels not only short sighted, but also
avoids addressing the issues around how userspace determines an optimal
module to use for a device.  Thanks,

Alex

> >> For now, x86 extensions will include only igd.
> >>
> >> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
> >> ---
> >>   drivers/vfio/pci/Kconfig                            | 13 ++++++-------
> >>   drivers/vfio/pci/Makefile                           |  2 +-
> >>   drivers/vfio/pci/vfio_pci_core.c                    |  2 +-
> >>   drivers/vfio/pci/vfio_pci_private.h                 |  2 +-
> >>   drivers/vfio/pci/{vfio_pci_igd.c => vfio_pci_x86.c} |  0
> >>   5 files changed, 9 insertions(+), 10 deletions(-)
> >>   rename drivers/vfio/pci/{vfio_pci_igd.c => vfio_pci_x86.c} (100%)  
> > 
> > (...)
> >   
> >> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> >> index c559027def2d..e0e258c37fb5 100644
> >> --- a/drivers/vfio/pci/vfio_pci_core.c
> >> +++ b/drivers/vfio/pci/vfio_pci_core.c
> >> @@ -328,7 +328,7 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev)
> >>   
> >>   	if (vfio_pci_is_vga(pdev) &&
> >>   	    pdev->vendor == PCI_VENDOR_ID_INTEL &&
> >> -	    IS_ENABLED(CONFIG_VFIO_PCI_IGD)) {
> >> +	    IS_ENABLED(CONFIG_VFIO_PCI_X86)) {
> >>   		ret = vfio_pci_igd_init(vdev);  
> > 
> > This one explicitly checks for Intel devices, so I'm not sure why you
> > want to generalize this to x86?
> >   
> >>   		if (ret && ret != -ENODEV) {
> >>   			pci_warn(pdev, "Failed to setup Intel IGD regions\n");  
> >   
>

Cornelia Huck Feb. 2, 2021, 4:06 p.m. UTC | #4

On Mon, 1 Feb 2021 11:42:30 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Mon, 1 Feb 2021 12:49:12 -0500
> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
> 
> > On 2/1/21 12:14 PM, Cornelia Huck wrote:  
> > > On Mon, 1 Feb 2021 16:28:27 +0000
> > > Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
> > >     
> > >> This patch doesn't change any logic but only align to the concept of
> > >> vfio_pci_core extensions. Extensions that are related to a platform
> > >> and not to a specific vendor of PCI devices should be part of the core
> > >> driver. Extensions that are specific for PCI device vendor should go
> > >> to a dedicated vendor vfio-pci driver.    
> > > 
> > > My understanding is that igd means support for Intel graphics, i.e. a
> > > strict subset of x86. If there are other future extensions that e.g.
> > > only make sense for some devices found only on AMD systems, I don't
> > > think they should all be included under the same x86 umbrella.
> > > 
> > > Similar reasoning for nvlink, that only seems to cover support for some
> > > GPUs under Power, and is not a general platform-specific extension IIUC.
> > > 
> > > We can arguably do the zdev -> s390 rename (as zpci appears only on
> > > s390, and all PCI devices will be zpci on that platform), although I'm
> > > not sure about the benefit.    
> > 
> > As far as I can tell, there isn't any benefit for s390 it's just 
> > "re-branding" to match the platform name rather than the zdev moniker, 
> > which admittedly perhaps makes it more clear to someone outside of s390 
> > that any PCI device on s390 is a zdev/zpci type, and thus will use this 
> > extension to vfio_pci(_core).  This would still be true even if we added 
> > something later that builds atop it (e.g. a platform-specific device 
> > like ism-vfio-pci).  Or for that matter, mlx5 via vfio-pci on s390x uses 
> > these zdev extensions today and would need to continue using them in a 
> > world where mlx5-vfio-pci.ko exists.
> > 
> > I guess all that to say: if such a rename matches the 'grand scheme' of 
> > this design where we treat arch-level extensions to vfio_pci(_core) as 
> > "vfio_pci_(arch)" then I'm not particularly opposed to the rename.  But 
> > by itself it's not very exciting :)  
> 
> This all seems like the wrong direction to me.  The goal here is to
> modularize vfio-pci into a core library and derived vendor modules that
> make use of that core library.  If existing device specific extensions
> within vfio-pci cannot be turned into vendor modules through this
> support and are instead redefined as platform specific features of the
> new core library, that feels like we're already admitting failure of
> this core library to support known devices, let alone future devices.
> 
> IGD is a specific set of devices.  They happen to rely on some platform
> specific support, whose availability should be determined via the
> vendor module probe callback.  Packing that support into an "x86"
> component as part of the core feels not only short sighted, but also
> avoids addressing the issues around how userspace determines an optimal
> module to use for a device.

Hm, it seems that not all current extensions to the vfio-pci code are
created equal.

IIUC, we have igd and nvlink, which are sets of devices that only show
up on x86 or ppc, respectively, and may rely on some special features
of those architectures/platforms. The important point is that you have
a device identifier that you can match a driver against.

On the other side, we have the zdev support, which both requires s390
and applies to any pci device on s390. If we added special handling for
ISM on s390, ISM would be in a category similar to igd/nvlink.

Now, if somebody plugs a mlx5 device into an s390, we would want both
the zdev support and the specialized mlx5 driver. Maybe zdev should be
some kind of library that can be linked into normal vfio-pci, into
vfio-pci-mlx5, and a hypothetical vfio-pci-ism? You always want zdev on
s390 (if configured into the kernel).

Jason Gunthorpe Feb. 2, 2021, 5:10 p.m. UTC | #5

On Tue, Feb 02, 2021 at 05:06:59PM +0100, Cornelia Huck wrote:

> On the other side, we have the zdev support, which both requires s390
> and applies to any pci device on s390.

Is there a reason why CONFIG_VFIO_PCI_ZDEV exists? Why not just always
return the s390 specific data in VFIO_DEVICE_GET_INFO if running on
s390?

It would be like returning data from ACPI on other platforms.

It really seems like part of vfio-pci-core

Jason

Max Gurtovoy Feb. 2, 2021, 5:41 p.m. UTC | #6

On 2/2/2021 6:06 PM, Cornelia Huck wrote:
> On Mon, 1 Feb 2021 11:42:30 -0700
> Alex Williamson <alex.williamson@redhat.com> wrote:
>
>> On Mon, 1 Feb 2021 12:49:12 -0500
>> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
>>
>>> On 2/1/21 12:14 PM, Cornelia Huck wrote:
>>>> On Mon, 1 Feb 2021 16:28:27 +0000
>>>> Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
>>>>      
>>>>> This patch doesn't change any logic but only align to the concept of
>>>>> vfio_pci_core extensions. Extensions that are related to a platform
>>>>> and not to a specific vendor of PCI devices should be part of the core
>>>>> driver. Extensions that are specific for PCI device vendor should go
>>>>> to a dedicated vendor vfio-pci driver.
>>>> My understanding is that igd means support for Intel graphics, i.e. a
>>>> strict subset of x86. If there are other future extensions that e.g.
>>>> only make sense for some devices found only on AMD systems, I don't
>>>> think they should all be included under the same x86 umbrella.
>>>>
>>>> Similar reasoning for nvlink, that only seems to cover support for some
>>>> GPUs under Power, and is not a general platform-specific extension IIUC.
>>>>
>>>> We can arguably do the zdev -> s390 rename (as zpci appears only on
>>>> s390, and all PCI devices will be zpci on that platform), although I'm
>>>> not sure about the benefit.
>>> As far as I can tell, there isn't any benefit for s390 it's just
>>> "re-branding" to match the platform name rather than the zdev moniker,
>>> which admittedly perhaps makes it more clear to someone outside of s390
>>> that any PCI device on s390 is a zdev/zpci type, and thus will use this
>>> extension to vfio_pci(_core).  This would still be true even if we added
>>> something later that builds atop it (e.g. a platform-specific device
>>> like ism-vfio-pci).  Or for that matter, mlx5 via vfio-pci on s390x uses
>>> these zdev extensions today and would need to continue using them in a
>>> world where mlx5-vfio-pci.ko exists.
>>>
>>> I guess all that to say: if such a rename matches the 'grand scheme' of
>>> this design where we treat arch-level extensions to vfio_pci(_core) as
>>> "vfio_pci_(arch)" then I'm not particularly opposed to the rename.  But
>>> by itself it's not very exciting :)
>> This all seems like the wrong direction to me.  The goal here is to
>> modularize vfio-pci into a core library and derived vendor modules that
>> make use of that core library.  If existing device specific extensions
>> within vfio-pci cannot be turned into vendor modules through this
>> support and are instead redefined as platform specific features of the
>> new core library, that feels like we're already admitting failure of
>> this core library to support known devices, let alone future devices.
>>
>> IGD is a specific set of devices.  They happen to rely on some platform
>> specific support, whose availability should be determined via the
>> vendor module probe callback.  Packing that support into an "x86"
>> component as part of the core feels not only short sighted, but also
>> avoids addressing the issues around how userspace determines an optimal
>> module to use for a device.
> Hm, it seems that not all current extensions to the vfio-pci code are
> created equal.
>
> IIUC, we have igd and nvlink, which are sets of devices that only show
> up on x86 or ppc, respectively, and may rely on some special features
> of those architectures/platforms. The important point is that you have
> a device identifier that you can match a driver against.

maybe you can supply the ids ?

Alexey K, I saw you've been working on the NVLINK2 for P9. can you 
supply the exact ids for that should be bounded to this driver ?

I'll add it to V3.

>
> On the other side, we have the zdev support, which both requires s390
> and applies to any pci device on s390. If we added special handling for
> ISM on s390, ISM would be in a category similar to igd/nvlink.
>
> Now, if somebody plugs a mlx5 device into an s390, we would want both
> the zdev support and the specialized mlx5 driver. Maybe zdev should be
> some kind of library that can be linked into normal vfio-pci, into
> vfio-pci-mlx5, and a hypothetical vfio-pci-ism? You always want zdev on
> s390 (if configured into the kernel).
>

Alex Williamson Feb. 2, 2021, 5:54 p.m. UTC | #7

On Tue, 2 Feb 2021 19:41:16 +0200
Max Gurtovoy <mgurtovoy@nvidia.com> wrote:

> On 2/2/2021 6:06 PM, Cornelia Huck wrote:
> > On Mon, 1 Feb 2021 11:42:30 -0700
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> >  
> >> On Mon, 1 Feb 2021 12:49:12 -0500
> >> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
> >>  
> >>> On 2/1/21 12:14 PM, Cornelia Huck wrote:  
> >>>> On Mon, 1 Feb 2021 16:28:27 +0000
> >>>> Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
> >>>>        
> >>>>> This patch doesn't change any logic but only align to the concept of
> >>>>> vfio_pci_core extensions. Extensions that are related to a platform
> >>>>> and not to a specific vendor of PCI devices should be part of the core
> >>>>> driver. Extensions that are specific for PCI device vendor should go
> >>>>> to a dedicated vendor vfio-pci driver.  
> >>>> My understanding is that igd means support for Intel graphics, i.e. a
> >>>> strict subset of x86. If there are other future extensions that e.g.
> >>>> only make sense for some devices found only on AMD systems, I don't
> >>>> think they should all be included under the same x86 umbrella.
> >>>>
> >>>> Similar reasoning for nvlink, that only seems to cover support for some
> >>>> GPUs under Power, and is not a general platform-specific extension IIUC.
> >>>>
> >>>> We can arguably do the zdev -> s390 rename (as zpci appears only on
> >>>> s390, and all PCI devices will be zpci on that platform), although I'm
> >>>> not sure about the benefit.  
> >>> As far as I can tell, there isn't any benefit for s390 it's just
> >>> "re-branding" to match the platform name rather than the zdev moniker,
> >>> which admittedly perhaps makes it more clear to someone outside of s390
> >>> that any PCI device on s390 is a zdev/zpci type, and thus will use this
> >>> extension to vfio_pci(_core).  This would still be true even if we added
> >>> something later that builds atop it (e.g. a platform-specific device
> >>> like ism-vfio-pci).  Or for that matter, mlx5 via vfio-pci on s390x uses
> >>> these zdev extensions today and would need to continue using them in a
> >>> world where mlx5-vfio-pci.ko exists.
> >>>
> >>> I guess all that to say: if such a rename matches the 'grand scheme' of
> >>> this design where we treat arch-level extensions to vfio_pci(_core) as
> >>> "vfio_pci_(arch)" then I'm not particularly opposed to the rename.  But
> >>> by itself it's not very exciting :)  
> >> This all seems like the wrong direction to me.  The goal here is to
> >> modularize vfio-pci into a core library and derived vendor modules that
> >> make use of that core library.  If existing device specific extensions
> >> within vfio-pci cannot be turned into vendor modules through this
> >> support and are instead redefined as platform specific features of the
> >> new core library, that feels like we're already admitting failure of
> >> this core library to support known devices, let alone future devices.
> >>
> >> IGD is a specific set of devices.  They happen to rely on some platform
> >> specific support, whose availability should be determined via the
> >> vendor module probe callback.  Packing that support into an "x86"
> >> component as part of the core feels not only short sighted, but also
> >> avoids addressing the issues around how userspace determines an optimal
> >> module to use for a device.  
> > Hm, it seems that not all current extensions to the vfio-pci code are
> > created equal.
> >
> > IIUC, we have igd and nvlink, which are sets of devices that only show
> > up on x86 or ppc, respectively, and may rely on some special features
> > of those architectures/platforms. The important point is that you have
> > a device identifier that you can match a driver against.  
> 
> maybe you can supply the ids ?
> 
> Alexey K, I saw you've been working on the NVLINK2 for P9. can you 
> supply the exact ids for that should be bounded to this driver ?
> 
> I'll add it to V3.

As noted previously, if we start adding ids for vfio drivers then we
create conflicts with the native host driver.  We cannot register a
vfio PCI driver that automatically claims devices.  At best, this
NVLink driver and an IGD driver could reject devices that they don't
support, ie. NVIDIA GPUs where there's not the correct platform
provided support or Intel GPUs without an OpRegion.  Thanks,

Alex

Jason Gunthorpe Feb. 2, 2021, 6:50 p.m. UTC | #8

On Tue, Feb 02, 2021 at 10:54:55AM -0700, Alex Williamson wrote:

> As noted previously, if we start adding ids for vfio drivers then we
> create conflicts with the native host driver.  We cannot register a
> vfio PCI driver that automatically claims devices.

We can't do that in vfio_pci.ko, but a nvlink_vfio_pci.ko can, just
like the RFC showed with the mlx5 example. The key thing is the module
is not autoloadable and there is no modules.alias data for the PCI
IDs.

The admin must explicitly load the module, just like the admin must
explicitly do "cat > new_id". "modprobe nvlink_vfio_pci" replaces
"newid", and preloading the correct IDs into the module's driver makes
the entire admin experience much more natural and safe.

This could be improved with some simple work in the driver core:

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 2f32f38a11ed0b..dc3b088ad44d69 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -828,6 +828,9 @@ static int __device_attach_driver(struct device_driver *drv, void *_data)
 	bool async_allowed;
 	int ret;

+	if (drv->flags & DRIVER_EXPLICIT_BIND_ONLY)
+		continue;
+
 	ret = driver_match_device(drv, dev);
 	if (ret == 0) {
 		/* no match */

Thus the match table could be properly specified, but only explicit
manual bind would attach the driver. This would cleanly resolve the
duplicate ID problem, and we could even set a wildcard PCI match table
for vfio_pci and eliminate the new_id part of the sequence.

However, I'd prefer to split any driver core work from VFIO parts - so
I'd propose starting by splitting to vfio_pci_core.ko, vfio_pci.ko,
nvlink_vfio_pci.ko, and igd_vfio_pci.ko working as above.

For uAPI compatability vfio_pci.ko would need some
request_module()/symbol_get() trick to pass control over to the device
specific module.

Jason

Christoph Hellwig Feb. 2, 2021, 6:55 p.m. UTC | #9

On Tue, Feb 02, 2021 at 02:50:17PM -0400, Jason Gunthorpe wrote:
> For uAPI compatability vfio_pci.ko would need some
> request_module()/symbol_get() trick to pass control over to the device
> specific module.

Err, don't go there.

Please do the DRIVER_EXPLICIT_BIND_ONLY thing first, which avoids the
need for such hacks.

Jason Gunthorpe Feb. 2, 2021, 7:05 p.m. UTC | #10

On Tue, Feb 02, 2021 at 06:55:18PM +0000, Christoph Hellwig wrote:
> On Tue, Feb 02, 2021 at 02:50:17PM -0400, Jason Gunthorpe wrote:
> > For uAPI compatability vfio_pci.ko would need some
> > request_module()/symbol_get() trick to pass control over to the device
> > specific module.
> 
> Err, don't go there.
> 
> Please do the DRIVER_EXPLICIT_BIND_ONLY thing first, which avoids the
> need for such hacks.

Ah, right good point, we don't need to avoid loading the nvlink.ko if
loading it is harmless.

Jason

Alex Williamson Feb. 2, 2021, 7:37 p.m. UTC | #11

On Tue, 2 Feb 2021 14:50:17 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Feb 02, 2021 at 10:54:55AM -0700, Alex Williamson wrote:
> 
> > As noted previously, if we start adding ids for vfio drivers then we
> > create conflicts with the native host driver.  We cannot register a
> > vfio PCI driver that automatically claims devices.  
> 
> We can't do that in vfio_pci.ko, but a nvlink_vfio_pci.ko can, just
> like the RFC showed with the mlx5 example. The key thing is the module
> is not autoloadable and there is no modules.alias data for the PCI
> IDs.
> 
> The admin must explicitly load the module, just like the admin must
> explicitly do "cat > new_id". "modprobe nvlink_vfio_pci" replaces
> "newid", and preloading the correct IDs into the module's driver makes
> the entire admin experience much more natural and safe.
> 
> This could be improved with some simple work in the driver core:
> 
> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> index 2f32f38a11ed0b..dc3b088ad44d69 100644
> --- a/drivers/base/dd.c
> +++ b/drivers/base/dd.c
> @@ -828,6 +828,9 @@ static int __device_attach_driver(struct device_driver *drv, void *_data)
>  	bool async_allowed;
>  	int ret;
>  
> +	if (drv->flags & DRIVER_EXPLICIT_BIND_ONLY)
> +		continue;
> +
>  	ret = driver_match_device(drv, dev);
>  	if (ret == 0) {
>  		/* no match */
> 
> Thus the match table could be properly specified, but only explicit
> manual bind would attach the driver. This would cleanly resolve the
> duplicate ID problem, and we could even set a wildcard PCI match table
> for vfio_pci and eliminate the new_id part of the sequence.
> 
> However, I'd prefer to split any driver core work from VFIO parts - so
> I'd propose starting by splitting to vfio_pci_core.ko, vfio_pci.ko,
> nvlink_vfio_pci.ko, and igd_vfio_pci.ko working as above.

For the most part, this explicit bind interface is redundant to
driver_override, which already avoids the duplicate ID issue.  A user
specifies a driver to use for a given device, which automatically makes
the driver match accept the device and there are no conflicts with
native drivers.  The problem is still how the user knows to choose
vfio-pci-igd for a device versus vfio-pci-nvlink, other vendor drivers,
or vfio-pci.

A driver id table doesn't really help for binding the device,
ultimately even if a device is in the id table it might fail to probe
due to the missing platform support that each of these igd and nvlink
drivers expose, at which point the user would need to pick a next best
options.  Are you somehow proposing the driver id table for the user to
understand possible drivers, even if that doesn't prioritize them?  I
don't see that there's anything new here otherwise.  Thanks,

Alex

Jason Gunthorpe Feb. 2, 2021, 8:44 p.m. UTC | #12

On Tue, Feb 02, 2021 at 12:37:23PM -0700, Alex Williamson wrote:

> For the most part, this explicit bind interface is redundant to
> driver_override, which already avoids the duplicate ID issue.  

No, the point here is to have the ID tables in the PCI drivers because
they fundamentally only work with their supported IDs. The normal
driver core ID tables are a replacement for all the hardwired if's in
vfio_pci.

driver_override completely disables all the ID checking, it seems only
useful for vfio_pci which works with everything. It should not be used
with something like nvlink_vfio_pci.ko that needs ID checking.

Yes, this DRIVER_EXPLICIT_BIND_ONLY idea somewhat replaces
driver_override because we could set the PCI any match on vfio_pci and
manage the driver binding explicitly instead.

> A driver id table doesn't really help for binding the device,
> ultimately even if a device is in the id table it might fail to
> probe due to the missing platform support that each of these igd and
> nvlink drivers expose,

What happens depends on what makes sense for the driver, some missing
optional support could continue without it, or it could fail.

IGD and nvlink can trivially go onwards and work if they don't find
the platform support.

Or they might want to fail, I think the mlx5 and probably nvlink
drivers should fail as they are intended to be coupled with userspace
that expects to use their extended features.

In those cases failing is a feature because it prevents the whole
system from going into an unexpected state.

Jason

Max Gurtovoy Feb. 2, 2021, 8:59 p.m. UTC | #13

On 2/2/2021 10:44 PM, Jason Gunthorpe wrote:
> On Tue, Feb 02, 2021 at 12:37:23PM -0700, Alex Williamson wrote:
>
>> For the most part, this explicit bind interface is redundant to
>> driver_override, which already avoids the duplicate ID issue.
> No, the point here is to have the ID tables in the PCI drivers because
> they fundamentally only work with their supported IDs. The normal
> driver core ID tables are a replacement for all the hardwired if's in
> vfio_pci.
>
> driver_override completely disables all the ID checking, it seems only
> useful for vfio_pci which works with everything. It should not be used
> with something like nvlink_vfio_pci.ko that needs ID checking.

This mechanism of driver_override seems weird to me. In case of hotplug 
and both capable drivers (native device driver and vfio-pci) are loaded, 
both will compete on the device.

I think the proposed flags is very powerful and it does fix the original 
concern Alex had ("if we start adding ids for vfio drivers then we 
create conflicts with the native host driver") and it's very deterministic.

In this way we'll bind explicitly to a driver.

And the way we'll choose a vfio-pci driver is by device_id + vendor_id + 
subsystem_device + subsystem_vendor.

There shouldn't be 2 vfio-pci drivers that support a device with same 
above 4 ids.

if you don't find a suitable vendor-vfio-pci.ko, you'll try binding 
vfio-pci.ko.

Each driver will publish its supported ids in sysfs to help the user to 
decide.

>
> Yes, this DRIVER_EXPLICIT_BIND_ONLY idea somewhat replaces
> driver_override because we could set the PCI any match on vfio_pci and
> manage the driver binding explicitly instead.
>
>> A driver id table doesn't really help for binding the device,
>> ultimately even if a device is in the id table it might fail to
>> probe due to the missing platform support that each of these igd and
>> nvlink drivers expose,
> What happens depends on what makes sense for the driver, some missing
> optional support could continue without it, or it could fail.
>
> IGD and nvlink can trivially go onwards and work if they don't find
> the platform support.
>
> Or they might want to fail, I think the mlx5 and probably nvlink
> drivers should fail as they are intended to be coupled with userspace
> that expects to use their extended features.
>
> In those cases failing is a feature because it prevents the whole
> system from going into an unexpected state.
>
> Jason

Alex Williamson Feb. 2, 2021, 9:30 p.m. UTC | #14

On Tue, 2 Feb 2021 22:59:27 +0200
Max Gurtovoy <mgurtovoy@nvidia.com> wrote:

> On 2/2/2021 10:44 PM, Jason Gunthorpe wrote:
> > On Tue, Feb 02, 2021 at 12:37:23PM -0700, Alex Williamson wrote:
> >  
> >> For the most part, this explicit bind interface is redundant to
> >> driver_override, which already avoids the duplicate ID issue.  
> > No, the point here is to have the ID tables in the PCI drivers because
> > they fundamentally only work with their supported IDs. The normal
> > driver core ID tables are a replacement for all the hardwired if's in
> > vfio_pci.
> >
> > driver_override completely disables all the ID checking, it seems only
> > useful for vfio_pci which works with everything. It should not be used
> > with something like nvlink_vfio_pci.ko that needs ID checking.  
> 
> This mechanism of driver_override seems weird to me. In case of hotplug 
> and both capable drivers (native device driver and vfio-pci) are loaded, 
> both will compete on the device.

How would the hot-added device have driver_override set?  There's no
competition, the native device driver would claim the device and the
user could set driver_override, unbind and re-probe to get their
specified driver.  Once a driver_override is set, there cannot be any
competition, driver_override is used for match exclusively if set.

> I think the proposed flags is very powerful and it does fix the original 
> concern Alex had ("if we start adding ids for vfio drivers then we 
> create conflicts with the native host driver") and it's very deterministic.
> 
> In this way we'll bind explicitly to a driver.
> 
> And the way we'll choose a vfio-pci driver is by device_id + vendor_id + 
> subsystem_device + subsystem_vendor.
> 
> There shouldn't be 2 vfio-pci drivers that support a device with same 
> above 4 ids.

It's entirely possible there could be, but without neural implant
devices to interpret the user's intentions, I think we'll have to
accept there could be non-determinism here.

The first set of users already fail this specification though, we can't
base it strictly on device and vendor IDs, we need wildcards, class
codes, revision IDs, etc., just like any other PCI drvier.  We're not
going to maintain a set of specific device IDs for the IGD extension,
nor I suspect the NVLINK support as that would require a kernel update
every time a new GPU is released that makes use of the same interface.

As I understand Jason's reply, these vendor drivers would have an ids
table and a user could look at modalias for the device to compare to
the driver supported aliases for a match.  Does kmod already have this
as a utility outside of modprobe?

> if you don't find a suitable vendor-vfio-pci.ko, you'll try binding 
> vfio-pci.ko.
> 
> Each driver will publish its supported ids in sysfs to help the user to 
> decide.

Seems like it would be embedded in the aliases for the module, with
this explicit binding flag being the significant difference that
prevents auto loading the device.  We still have one of the races that
driver_override resolves though, the proposed explicit bind flag is on
the driver not the device, so a native host driver being loaded due to
a hotplug operation or independent actions of different admins could
usurp the device between unbind of old driver and bind to new driver.

> > Yes, this DRIVER_EXPLICIT_BIND_ONLY idea somewhat replaces
> > driver_override because we could set the PCI any match on vfio_pci and
> > manage the driver binding explicitly instead.
> >  
> >> A driver id table doesn't really help for binding the device,
> >> ultimately even if a device is in the id table it might fail to
> >> probe due to the missing platform support that each of these igd and
> >> nvlink drivers expose,  
> > What happens depends on what makes sense for the driver, some missing
> > optional support could continue without it, or it could fail.
> >
> > IGD and nvlink can trivially go onwards and work if they don't find
> > the platform support.

This seems unpredictable from a user perspective.  In either the igd or
nvlink cases, if the platform features aren't available, the feature
set of the device is reduced.  That's not apparent until the user tries
to start interacting with the device if the device specific driver
doesn't fail the probe.  Userspace policy would need to decide if a
fallback driver is acceptable or the vendor specific driver failure is
fatal. Thanks,

Alex

Jason Gunthorpe Feb. 2, 2021, 11:06 p.m. UTC | #15

On Tue, Feb 02, 2021 at 02:30:13PM -0700, Alex Williamson wrote:

> The first set of users already fail this specification though, we can't
> base it strictly on device and vendor IDs, we need wildcards, class
> codes, revision IDs, etc., just like any other PCI drvier.  We're not
> going to maintain a set of specific device IDs for the IGD
> extension,

The Intel GPU driver already has a include/drm/i915_pciids.h that
organizes all the PCI match table entries, no reason why VFIO IGD
couldn't include that too and use the same match table as the real GPU
driver. Same HW right?

Also how sure are you that this loose detection is going to work with
future Intel discrete GPUs that likely won't need vfio_igd?

> nor I suspect the NVLINK support as that would require a kernel update
> every time a new GPU is released that makes use of the same interface.

The nvlink device that required this special vfio code was a one
off. Current devices do not use it. Not having an exact PCI ID match
in this case is a bug.

> As I understand Jason's reply, these vendor drivers would have an ids
> table and a user could look at modalias for the device to compare to
> the driver supported aliases for a match.  Does kmod already have this
> as a utility outside of modprobe?

I think this is worth exploring.

One idea that fits nicely with the existing infrastructure is to add
to driver core a 'device mode' string. It would be "default" or "vfio"

devices in vfio mode only match vfio mode device_drivers.

devices in vfio mode generate a unique modalias string that includes
some additional 'mode=vfio' identifier

drivers that run in vfio mode generate a module table string that
includes the same mode=vfio

The driver core can trigger driver auto loading soley based on the
mode string, happens naturally.

All the existing udev, depmod/etc tooling will transparently work.

Like driver_override, but doesn't bypass all the ID and module loading
parts of the driver core.

(But lets not get too far down this path until we can agree that
embracing the driver core like the RFC contemplates is the agreed
direction)

> Seems like it would be embedded in the aliases for the module, with
> this explicit binding flag being the significant difference that
> prevents auto loading the device.  We still have one of the races that
> driver_override resolves though, the proposed explicit bind flag is on
> the driver not the device, so a native host driver being loaded due to
> a hotplug operation or independent actions of different admins could
> usurp the device between unbind of old driver and bind to new driver.

This is because the sysfs doesn't have an atomic way to bind and
rebind a device, teaching 'bind' to how to do that would also solve
this problem.

> This seems unpredictable from a user perspective.  In either the igd or
> nvlink cases, if the platform features aren't available, the feature
> set of the device is reduced.  That's not apparent until the user tries
> to start interacting with the device if the device specific driver
> doesn't fail the probe.  Userspace policy would need to decide if a
> fallback driver is acceptable or the vendor specific driver failure is
> fatal. Thanks,

It matches today's behavior, if it is a good idea to preserve it, then
it can be so without much effort.

I do prefer the explicitness because I think most use cases have a
user that requires the special driver to run. Explicitly binding a
the required driver seems preferable.

Certainly nvlink and mlx5 should fail probe and not fall back to plain
vfio-pci. If user wants plain vfio-pci user should ask explicitly. At
least for the mlx5 cases this is a completely reasonable thing to
do. I like that we can support this choice.

I'm not so clear on IGD, especially how it would interact with future
descrete cards that probably don't need it. IMHO, it would be fine if
it was different for some good reason.

Jason

Alex Williamson Feb. 2, 2021, 11:59 p.m. UTC | #16

On Tue, 2 Feb 2021 19:06:04 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Feb 02, 2021 at 02:30:13PM -0700, Alex Williamson wrote:
> 
> > The first set of users already fail this specification though, we can't
> > base it strictly on device and vendor IDs, we need wildcards, class
> > codes, revision IDs, etc., just like any other PCI drvier.  We're not
> > going to maintain a set of specific device IDs for the IGD
> > extension,  
> 
> The Intel GPU driver already has a include/drm/i915_pciids.h that
> organizes all the PCI match table entries, no reason why VFIO IGD
> couldn't include that too and use the same match table as the real GPU
> driver. Same HW right?

vfio-pci-igd support knows very little about the device, we're
effectively just exposing a firmware table and some of the host bridge
config space (read-only).  So the idea that the host kernel needs to
have updated i915 support in order to expose the device to userspace
with these extra regions is a bit silly.

> Also how sure are you that this loose detection is going to work with
> future Intel discrete GPUs that likely won't need vfio_igd?

Not at all, which is one more reason we don't want to rely on i915's
device table, which would likely support those as well.  We might only
want to bind to GPUs on the root complex, or at address 0000:00:02.0.
Our "what to reject" algorithm might need to evolve as those arrive,
but I don't think that means we need to explicitly list every device ID
either.

> > nor I suspect the NVLINK support as that would require a kernel update
> > every time a new GPU is released that makes use of the same interface.  
> 
> The nvlink device that required this special vfio code was a one
> off. Current devices do not use it. Not having an exact PCI ID match
> in this case is a bug.

AIUI, the quirk is only activated when there's a firmware table to
support it.  No firmware table, no driver bind, no need to use
explicit IDs.  Vendor and class code should be enough.

> > As I understand Jason's reply, these vendor drivers would have an ids
> > table and a user could look at modalias for the device to compare to
> > the driver supported aliases for a match.  Does kmod already have this
> > as a utility outside of modprobe?  
> 
> I think this is worth exploring.
> 
> One idea that fits nicely with the existing infrastructure is to add
> to driver core a 'device mode' string. It would be "default" or "vfio"
> 
> devices in vfio mode only match vfio mode device_drivers.
> 
> devices in vfio mode generate a unique modalias string that includes
> some additional 'mode=vfio' identifier
> 
> drivers that run in vfio mode generate a module table string that
> includes the same mode=vfio
> 
> The driver core can trigger driver auto loading soley based on the
> mode string, happens naturally.
> 
> All the existing udev, depmod/etc tooling will transparently work.
> 
> Like driver_override, but doesn't bypass all the ID and module loading
> parts of the driver core.
> 
> (But lets not get too far down this path until we can agree that
> embracing the driver core like the RFC contemplates is the agreed
> direction)

I'm not sure I fully follow the mechanics of this.  I'm interpreting
this as something like a sub-class of drivers where for example
vfio-pci class drivers would have a vfio-pci: alias prefix rather than
pci:.  There might be some sysfs attribute for the device that would
allow the user to write an alias prefix and would that trigger the
(ex.) pci-core to send remove uevents for the pci: modalias device and
add uevents for the vfio-pci: modalias device?  Some ordering rules
would then allow vendor/device modules to precede vfio-pci, which would
have only a wildcard id table?

I need to churn on that for a while, but if driver-core folks are
interested, maybe it could be a good idea...  Thanks,

Alex

Jason Gunthorpe Feb. 3, 2021, 1:54 p.m. UTC | #17

On Tue, Feb 02, 2021 at 04:59:23PM -0700, Alex Williamson wrote:
> On Tue, 2 Feb 2021 19:06:04 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Feb 02, 2021 at 02:30:13PM -0700, Alex Williamson wrote:
> > 
> > > The first set of users already fail this specification though, we can't
> > > base it strictly on device and vendor IDs, we need wildcards, class
> > > codes, revision IDs, etc., just like any other PCI drvier.  We're not
> > > going to maintain a set of specific device IDs for the IGD
> > > extension,  
> > 
> > The Intel GPU driver already has a include/drm/i915_pciids.h that
> > organizes all the PCI match table entries, no reason why VFIO IGD
> > couldn't include that too and use the same match table as the real GPU
> > driver. Same HW right?
> 
> vfio-pci-igd support knows very little about the device, we're
> effectively just exposing a firmware table and some of the host bridge
> config space (read-only).  So the idea that the host kernel needs to
> have updated i915 support in order to expose the device to userspace
> with these extra regions is a bit silly.

It is our standard driver process in Linux, we can use it here in vfio

My point is we don't have to preserve the currently loose matching
semantics and we can have explicit ID lists for these drivers. It is
not a blocker to this direction.

We can argue if it is better/worse, but the rest of the kernel works
just fine on an 'update the ID match table when the HW is vetted'
principle. It is not an important enough reason to reject this
direction.

> I'm not sure I fully follow the mechanics of this.  I'm interpreting
> this as something like a sub-class of drivers where for example
> vfio-pci class drivers would have a vfio-pci: alias prefix rather than
> pci:.  There might be some sysfs attribute for the device that would
> allow the user to write an alias prefix and would that trigger the
> (ex.) pci-core to send remove uevents for the pci: modalias device and
> add uevents for the vfio-pci: modalias device?  Some ordering rules
> would then allow vendor/device modules to precede vfio-pci, which would
> have only a wildcard id table?

Yes, I think you have the general idea

> I need to churn on that for a while, but if driver-core folks are
> interested, maybe it could be a good idea...  Thanks,

My intention is to show there are options here, we are not limited to
just what we have today.

If people are accepting that these device-specific drivers are
required then we need to come to a community consensus to decide what
direction to pursue:

* Do we embrace the driver core and use it to load VFIO modules like a
  normal subsytem (this RFC)

OR 

* Do we make a driver-core like thing inside the VFIO bus drivers and
  have them run their own special driver matching, binding, and loading
  scheme. (May RFC)

Haven't heard a 3rd option yet..

Jason

Max Gurtovoy Feb. 3, 2021, 4:07 p.m. UTC | #18

On 2/3/2021 5:24 AM, Alexey Kardashevskiy wrote:
>
>
> On 03/02/2021 04:41, Max Gurtovoy wrote:
>>
>> On 2/2/2021 6:06 PM, Cornelia Huck wrote:
>>> On Mon, 1 Feb 2021 11:42:30 -0700
>>> Alex Williamson <alex.williamson@redhat.com> wrote:
>>>
>>>> On Mon, 1 Feb 2021 12:49:12 -0500
>>>> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
>>>>
>>>>> On 2/1/21 12:14 PM, Cornelia Huck wrote:
>>>>>> On Mon, 1 Feb 2021 16:28:27 +0000
>>>>>> Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
>>>>>>> This patch doesn't change any logic but only align to the 
>>>>>>> concept of
>>>>>>> vfio_pci_core extensions. Extensions that are related to a platform
>>>>>>> and not to a specific vendor of PCI devices should be part of 
>>>>>>> the core
>>>>>>> driver. Extensions that are specific for PCI device vendor 
>>>>>>> should go
>>>>>>> to a dedicated vendor vfio-pci driver.
>>>>>> My understanding is that igd means support for Intel graphics, 
>>>>>> i.e. a
>>>>>> strict subset of x86. If there are other future extensions that e.g.
>>>>>> only make sense for some devices found only on AMD systems, I don't
>>>>>> think they should all be included under the same x86 umbrella.
>>>>>>
>>>>>> Similar reasoning for nvlink, that only seems to cover support 
>>>>>> for some
>>>>>> GPUs under Power, and is not a general platform-specific 
>>>>>> extension IIUC.
>>>>>>
>>>>>> We can arguably do the zdev -> s390 rename (as zpci appears only on
>>>>>> s390, and all PCI devices will be zpci on that platform), 
>>>>>> although I'm
>>>>>> not sure about the benefit.
>>>>> As far as I can tell, there isn't any benefit for s390 it's just
>>>>> "re-branding" to match the platform name rather than the zdev 
>>>>> moniker,
>>>>> which admittedly perhaps makes it more clear to someone outside of 
>>>>> s390
>>>>> that any PCI device on s390 is a zdev/zpci type, and thus will use 
>>>>> this
>>>>> extension to vfio_pci(_core).  This would still be true even if we 
>>>>> added
>>>>> something later that builds atop it (e.g. a platform-specific device
>>>>> like ism-vfio-pci).  Or for that matter, mlx5 via vfio-pci on 
>>>>> s390x uses
>>>>> these zdev extensions today and would need to continue using them 
>>>>> in a
>>>>> world where mlx5-vfio-pci.ko exists.
>>>>>
>>>>> I guess all that to say: if such a rename matches the 'grand 
>>>>> scheme' of
>>>>> this design where we treat arch-level extensions to 
>>>>> vfio_pci(_core) as
>>>>> "vfio_pci_(arch)" then I'm not particularly opposed to the 
>>>>> rename.  But
>>>>> by itself it's not very exciting :)
>>>> This all seems like the wrong direction to me.  The goal here is to
>>>> modularize vfio-pci into a core library and derived vendor modules 
>>>> that
>>>> make use of that core library.  If existing device specific extensions
>>>> within vfio-pci cannot be turned into vendor modules through this
>>>> support and are instead redefined as platform specific features of the
>>>> new core library, that feels like we're already admitting failure of
>>>> this core library to support known devices, let alone future devices.
>>>>
>>>> IGD is a specific set of devices.  They happen to rely on some 
>>>> platform
>>>> specific support, whose availability should be determined via the
>>>> vendor module probe callback.  Packing that support into an "x86"
>>>> component as part of the core feels not only short sighted, but also
>>>> avoids addressing the issues around how userspace determines an 
>>>> optimal
>>>> module to use for a device.
>>> Hm, it seems that not all current extensions to the vfio-pci code are
>>> created equal.
>>>
>>> IIUC, we have igd and nvlink, which are sets of devices that only show
>>> up on x86 or ppc, respectively, and may rely on some special features
>>> of those architectures/platforms. The important point is that you have
>>> a device identifier that you can match a driver against.
>>
>> maybe you can supply the ids ?
>>
>> Alexey K, I saw you've been working on the NVLINK2 for P9. can you 
>> supply the exact ids for that should be bounded to this driver ?
>>
>> I'll add it to V3.
>
> Sorry no, I never really had the exact ids, they keep popping up after 
> years.
>
> The nvlink2 support can only work if the platform supports it as there 
> is nothing to do to the GPUs themselves, it is about setting up those 
> nvlink2 links. If the platform advertises certain features in the 
> device tree - then any GPU in the SXM2 form factor (not sure about the 
> exact abbrev, not an usual PCIe device) should just work.
>
> If the nvlink2 "subdriver" fails to find nvlinks (the platform does 
> not advertise them or some parameters are new to this subdriver, such 
> as link-speed), we still fall back to generic vfio-pci and try passing 
> through this GPU as a plain PCI device with no extra links. 
> Semi-optimal but if the user is mining cryptocoins, then highspeed 
> links are not really that important :) And the nvidia driver is 
> capable of running such GPUs without nvlinks. Thanks,

 From the above I can conclude that nvlink2_vfio_pci will need to find 
nvlinks during the probe and fail in case it doesn't.

which platform driver is responsible to advertise it ? can we use 
aux_device to represent these nvlink/nvlinks ?

In case of a failure, the user can fallback to vfio-pci.ko and run 
without the nvlinks as you said.

This is deterministic behavior and the user will know exactly what he's 
getting from vfio subsystem.

>
>
>>
>>>
>>> On the other side, we have the zdev support, which both requires s390
>>> and applies to any pci device on s390. If we added special handling for
>>> ISM on s390, ISM would be in a category similar to igd/nvlink.
>>>
>>> Now, if somebody plugs a mlx5 device into an s390, we would want both
>>> the zdev support and the specialized mlx5 driver. Maybe zdev should be
>>> some kind of library that can be linked into normal vfio-pci, into
>>> vfio-pci-mlx5, and a hypothetical vfio-pci-ism? You always want zdev on
>>> s390 (if configured into the kernel).
>>>
>

Max Gurtovoy Feb. 4, 2021, 9:12 a.m. UTC | #19

On 2/3/2021 5:24 AM, Alexey Kardashevskiy wrote:
>
>
> On 03/02/2021 04:41, Max Gurtovoy wrote:
>>
>> On 2/2/2021 6:06 PM, Cornelia Huck wrote:
>>> On Mon, 1 Feb 2021 11:42:30 -0700
>>> Alex Williamson <alex.williamson@redhat.com> wrote:
>>>
>>>> On Mon, 1 Feb 2021 12:49:12 -0500
>>>> Matthew Rosato <mjrosato@linux.ibm.com> wrote:
>>>>
>>>>> On 2/1/21 12:14 PM, Cornelia Huck wrote:
>>>>>> On Mon, 1 Feb 2021 16:28:27 +0000
>>>>>> Max Gurtovoy <mgurtovoy@nvidia.com> wrote:
>>>>>>> This patch doesn't change any logic but only align to the 
>>>>>>> concept of
>>>>>>> vfio_pci_core extensions. Extensions that are related to a platform
>>>>>>> and not to a specific vendor of PCI devices should be part of 
>>>>>>> the core
>>>>>>> driver. Extensions that are specific for PCI device vendor 
>>>>>>> should go
>>>>>>> to a dedicated vendor vfio-pci driver.
>>>>>> My understanding is that igd means support for Intel graphics, 
>>>>>> i.e. a
>>>>>> strict subset of x86. If there are other future extensions that e.g.
>>>>>> only make sense for some devices found only on AMD systems, I don't
>>>>>> think they should all be included under the same x86 umbrella.
>>>>>>
>>>>>> Similar reasoning for nvlink, that only seems to cover support 
>>>>>> for some
>>>>>> GPUs under Power, and is not a general platform-specific 
>>>>>> extension IIUC.
>>>>>>
>>>>>> We can arguably do the zdev -> s390 rename (as zpci appears only on
>>>>>> s390, and all PCI devices will be zpci on that platform), 
>>>>>> although I'm
>>>>>> not sure about the benefit.
>>>>> As far as I can tell, there isn't any benefit for s390 it's just
>>>>> "re-branding" to match the platform name rather than the zdev 
>>>>> moniker,
>>>>> which admittedly perhaps makes it more clear to someone outside of 
>>>>> s390
>>>>> that any PCI device on s390 is a zdev/zpci type, and thus will use 
>>>>> this
>>>>> extension to vfio_pci(_core).  This would still be true even if we 
>>>>> added
>>>>> something later that builds atop it (e.g. a platform-specific device
>>>>> like ism-vfio-pci).  Or for that matter, mlx5 via vfio-pci on 
>>>>> s390x uses
>>>>> these zdev extensions today and would need to continue using them 
>>>>> in a
>>>>> world where mlx5-vfio-pci.ko exists.
>>>>>
>>>>> I guess all that to say: if such a rename matches the 'grand 
>>>>> scheme' of
>>>>> this design where we treat arch-level extensions to 
>>>>> vfio_pci(_core) as
>>>>> "vfio_pci_(arch)" then I'm not particularly opposed to the 
>>>>> rename.  But
>>>>> by itself it's not very exciting :)
>>>> This all seems like the wrong direction to me.  The goal here is to
>>>> modularize vfio-pci into a core library and derived vendor modules 
>>>> that
>>>> make use of that core library.  If existing device specific extensions
>>>> within vfio-pci cannot be turned into vendor modules through this
>>>> support and are instead redefined as platform specific features of the
>>>> new core library, that feels like we're already admitting failure of
>>>> this core library to support known devices, let alone future devices.
>>>>
>>>> IGD is a specific set of devices.  They happen to rely on some 
>>>> platform
>>>> specific support, whose availability should be determined via the
>>>> vendor module probe callback.  Packing that support into an "x86"
>>>> component as part of the core feels not only short sighted, but also
>>>> avoids addressing the issues around how userspace determines an 
>>>> optimal
>>>> module to use for a device.
>>> Hm, it seems that not all current extensions to the vfio-pci code are
>>> created equal.
>>>
>>> IIUC, we have igd and nvlink, which are sets of devices that only show
>>> up on x86 or ppc, respectively, and may rely on some special features
>>> of those architectures/platforms. The important point is that you have
>>> a device identifier that you can match a driver against.
>>
>> maybe you can supply the ids ?
>>
>> Alexey K, I saw you've been working on the NVLINK2 for P9. can you 
>> supply the exact ids for that should be bounded to this driver ?
>>
>> I'll add it to V3.
>
> Sorry no, I never really had the exact ids, they keep popping up after 
> years.
>
> The nvlink2 support can only work if the platform supports it as there 
> is nothing to do to the GPUs themselves, it is about setting up those 
> nvlink2 links. If the platform advertises certain features in the 
> device tree - then any GPU in the SXM2 form factor (not sure about the 
> exact abbrev, not an usual PCIe device) should just work.
>
> If the nvlink2 "subdriver" fails to find nvlinks (the platform does 
> not advertise them or some parameters are new to this subdriver, such 
> as link-speed), we still fall back to generic vfio-pci and try passing 
> through this GPU as a plain PCI device with no extra links. 
> Semi-optimal but if the user is mining cryptocoins, then highspeed 
> links are not really that important :) And the nvidia driver is 
> capable of running such GPUs without nvlinks. Thanks,
>
I see.

But the PCI function (the bounded BDF) is GPU function or NVLINK function ?

If it's NVLINK function then we should fail probing in the host vfio-pci 
driver.

if its a GPU function so it shouldn't been called nvlink2 vfio-pci 
driver. Its just an extension in the GPU vfio-pci driver.

>
>>
>>>
>>> On the other side, we have the zdev support, which both requires s390
>>> and applies to any pci device on s390. If we added special handling for
>>> ISM on s390, ISM would be in a category similar to igd/nvlink.
>>>
>>> Now, if somebody plugs a mlx5 device into an s390, we would want both
>>> the zdev support and the specialized mlx5 driver. Maybe zdev should be
>>> some kind of library that can be linked into normal vfio-pci, into
>>> vfio-pci-mlx5, and a hypothetical vfio-pci-ism? You always want zdev on
>>> s390 (if configured into the kernel).
>>>
>

Jason Gunthorpe Feb. 4, 2021, 12:51 p.m. UTC | #20

On Thu, Feb 04, 2021 at 12:05:22PM +1100, Alexey Kardashevskiy wrote:

> It is system firmware (==bios) which puts stuff in the device tree. The
> stuff is:
> 1. emulated pci devices (custom pci bridges), one per nvlink, emulated by
> the firmware, the driver is "ibmnpu" and it is a part on the nvidia driver;
> these are basically config space proxies to the cpu's side of nvlink.
> 2. interconnect information - which of 6 gpus nvlinks connected to which
> nvlink on the cpu side, and memory ranges.

So what is this vfio_nvlink driver supposed to be bound to? 

The "emulated pci devices"?

A real GPU function?

A real nvswitch function?

Something else?

Jason

Alexey Kardashevskiy Feb. 5, 2021, 12:42 a.m. UTC | #21

On 04/02/2021 23:51, Jason Gunthorpe wrote:
> On Thu, Feb 04, 2021 at 12:05:22PM +1100, Alexey Kardashevskiy wrote:
> 
>> It is system firmware (==bios) which puts stuff in the device tree. The
>> stuff is:
>> 1. emulated pci devices (custom pci bridges), one per nvlink, emulated by
>> the firmware, the driver is "ibmnpu" and it is a part on the nvidia driver;
>> these are basically config space proxies to the cpu's side of nvlink.
>> 2. interconnect information - which of 6 gpus nvlinks connected to which
>> nvlink on the cpu side, and memory ranges.
> 
> So what is this vfio_nvlink driver supposed to be bound to?
> 
> The "emulated pci devices"?

Yes.

> A real GPU function?

Yes.

> A real nvswitch function?

What do you mean by this exactly? The cpu side of nvlink is "emulated 
pci devices", the gpu side is not in pci space at all, the nvidia driver 
manages it via the gpu's mmio or/and cfg space.

> Something else?

Nope :)
In this new scheme which you are proposing it should be 2 drivers, I guess.

> 
> Jason
>

Max Gurtovoy Feb. 8, 2021, 12:44 p.m. UTC | #22

On 2/5/2021 2:42 AM, Alexey Kardashevskiy wrote:
>
>
> On 04/02/2021 23:51, Jason Gunthorpe wrote:
>> On Thu, Feb 04, 2021 at 12:05:22PM +1100, Alexey Kardashevskiy wrote:
>>
>>> It is system firmware (==bios) which puts stuff in the device tree. The
>>> stuff is:
>>> 1. emulated pci devices (custom pci bridges), one per nvlink, 
>>> emulated by
>>> the firmware, the driver is "ibmnpu" and it is a part on the nvidia 
>>> driver;
>>> these are basically config space proxies to the cpu's side of nvlink.
>>> 2. interconnect information - which of 6 gpus nvlinks connected to 
>>> which
>>> nvlink on the cpu side, and memory ranges.
>>
>> So what is this vfio_nvlink driver supposed to be bound to?
>>
>> The "emulated pci devices"?
>
> Yes.
>
>> A real GPU function?
>
> Yes.
>
>> A real nvswitch function?
>
> What do you mean by this exactly? The cpu side of nvlink is "emulated 
> pci devices", the gpu side is not in pci space at all, the nvidia 
> driver manages it via the gpu's mmio or/and cfg space.
>
>> Something else?
>
> Nope :)
> In this new scheme which you are proposing it should be 2 drivers, I 
> guess.

I see.

So should it be nvidia_vfio_pci.ko ? and it will do the NVLINK stuff in 
case the class code matches and otherwise just work as simple vfio_pci GPU ?

What about the second driver ? should it be called ibmnpu_vfio_pci.ko ?

>
>>
>> Jason
>>
>

Jason Gunthorpe Feb. 8, 2021, 6:13 p.m. UTC | #23

On Fri, Feb 05, 2021 at 11:42:11AM +1100, Alexey Kardashevskiy wrote:
> > A real nvswitch function?
> 
> What do you mean by this exactly? The cpu side of nvlink is "emulated pci
> devices", the gpu side is not in pci space at all, the nvidia driver manages
> it via the gpu's mmio or/and cfg space.

Some versions of the nvswitch chip have a PCI-E link too, that is what
I though this was all about when I first saw it.

So, it is really a special set of functions for NVIDIA GPU device
assignment only applicable to P9 systems, much like IGD is for Intel
on x86.

Jason

Alexey Kardashevskiy Feb. 9, 2021, 1:51 a.m. UTC | #24

On 09/02/2021 05:13, Jason Gunthorpe wrote:
> On Fri, Feb 05, 2021 at 11:42:11AM +1100, Alexey Kardashevskiy wrote:
>>> A real nvswitch function?
>>
>> What do you mean by this exactly? The cpu side of nvlink is "emulated pci
>> devices", the gpu side is not in pci space at all, the nvidia driver manages
>> it via the gpu's mmio or/and cfg space.
> 
> Some versions of the nvswitch chip have a PCI-E link too, that is what
> I though this was all about when I first saw it.

> So, it is really a special set of functions for NVIDIA GPU device
> assignment only applicable to P9 systems, much like IGD is for Intel
> on x86.

These GPUs are not P9 specific and they all have both PCIe and NVLink2 
links. The special part is that some nvlinks are between P9 and GPU and 
the rest are between GPUs, everywhere else (x86, may be ARM) the nvlinks 
are used between GPUs but even there I do not think the nvlink logic is 
presented to the host in the PCI space.

Alexey Kardashevskiy Feb. 9, 2021, 1:55 a.m. UTC | #25

On 08/02/2021 23:44, Max Gurtovoy wrote:
> 
> On 2/5/2021 2:42 AM, Alexey Kardashevskiy wrote:
>>
>>
>> On 04/02/2021 23:51, Jason Gunthorpe wrote:
>>> On Thu, Feb 04, 2021 at 12:05:22PM +1100, Alexey Kardashevskiy wrote:
>>>
>>>> It is system firmware (==bios) which puts stuff in the device tree. The
>>>> stuff is:
>>>> 1. emulated pci devices (custom pci bridges), one per nvlink, 
>>>> emulated by
>>>> the firmware, the driver is "ibmnpu" and it is a part on the nvidia 
>>>> driver;
>>>> these are basically config space proxies to the cpu's side of nvlink.
>>>> 2. interconnect information - which of 6 gpus nvlinks connected to 
>>>> which
>>>> nvlink on the cpu side, and memory ranges.
>>>
>>> So what is this vfio_nvlink driver supposed to be bound to?
>>>
>>> The "emulated pci devices"?
>>
>> Yes.
>>
>>> A real GPU function?
>>
>> Yes.
>>
>>> A real nvswitch function?
>>
>> What do you mean by this exactly? The cpu side of nvlink is "emulated 
>> pci devices", the gpu side is not in pci space at all, the nvidia 
>> driver manages it via the gpu's mmio or/and cfg space.
>>
>>> Something else?
>>
>> Nope :)
>> In this new scheme which you are proposing it should be 2 drivers, I 
>> guess.
> 
> I see.
> 
> So should it be nvidia_vfio_pci.ko ? and it will do the NVLINK stuff in 
> case the class code matches and otherwise just work as simple vfio_pci 
> GPU ?

"nvidia_vfio_pci" would be too generic, sounds like it is for every 
nvidia on every platform. powernv_nvidia_vfio_pci.ko may be.

> What about the second driver ? should it be called ibmnpu_vfio_pci.ko ?

This will do.


> 
>>
>>>
>>> Jason
>>>
>>

Christoph Hellwig Feb. 11, 2021, 8:44 a.m. UTC | #26

On Tue, Feb 02, 2021 at 04:59:23PM -0700, Alex Williamson wrote:
> vfio-pci-igd support knows very little about the device, we're
> effectively just exposing a firmware table and some of the host bridge
> config space (read-only).  So the idea that the host kernel needs to
> have updated i915 support in order to expose the device to userspace
> with these extra regions is a bit silly.

On the other hand assuming the IGD scheme works for every device
with an Intel Vendor ID and a VGA classcode that hangs off an Intel
host bridge seems highly dangerous.  Is this actually going to work
for the new discreete Intel graphics?  For the old i740?  And if not
what is the failure scenario?

Christoph Hellwig Feb. 11, 2021, 8:47 a.m. UTC | #27

On Wed, Feb 03, 2021 at 09:54:48AM -0400, Jason Gunthorpe wrote:
> If people are accepting that these device-specific drivers are
> required then we need to come to a community consensus to decide what
> direction to pursue:
> 
> * Do we embrace the driver core and use it to load VFIO modules like a
>   normal subsytem (this RFC)
> 
> OR 
> 
> * Do we make a driver-core like thing inside the VFIO bus drivers and
>   have them run their own special driver matching, binding, and loading
>   scheme. (May RFC)
> 
> Haven't heard a 3rd option yet..

The third option would be to use the driver core to bind the VFIO
submodules.  Define a new bus for it, which also uses the normal PCI IDs
for binding, and walk through those VFIO specific drivers when vfio_pci
is bound to a device.  That would provide a pretty clean abstraction
and could even keep the existing behavior of say bind to all VGA devices
with an Intel vendor ID (even if I think that is a bad idea).

Christoph Hellwig Feb. 11, 2021, 8:50 a.m. UTC | #28

On Thu, Feb 04, 2021 at 11:12:49AM +0200, Max Gurtovoy wrote:
> But the PCI function (the bounded BDF) is GPU function or NVLINK function ?
> 
> If it's NVLINK function then we should fail probing in the host vfio-pci
> driver.
> 
> if its a GPU function so it shouldn't been called nvlink2 vfio-pci driver.
> Its just an extension in the GPU vfio-pci driver.

I suspect the trivial and correct answer is that we should just drop the
driver entirely.  It is for obsolete hardware that never had upstream
support for even using it in the guest.  It also is the reason for
keeping cruft in the always built-in powernv platform code alive that
is otherwise dead wood.

Jason Gunthorpe Feb. 11, 2021, 2:30 p.m. UTC | #29

On Thu, Feb 11, 2021 at 08:47:03AM +0000, Christoph Hellwig wrote:
> On Wed, Feb 03, 2021 at 09:54:48AM -0400, Jason Gunthorpe wrote:
> > If people are accepting that these device-specific drivers are
> > required then we need to come to a community consensus to decide what
> > direction to pursue:
> > 
> > * Do we embrace the driver core and use it to load VFIO modules like a
> >   normal subsytem (this RFC)
> > 
> > OR 
> > 
> > * Do we make a driver-core like thing inside the VFIO bus drivers and
> >   have them run their own special driver matching, binding, and loading
> >   scheme. (May RFC)
> > 
> > Haven't heard a 3rd option yet..
> 
> The third option would be to use the driver core to bind the VFIO
> submodules.  Define a new bus for it, which also uses the normal PCI IDs
> for binding, and walk through those VFIO specific drivers when vfio_pci
> is bound to a device.  That would provide a pretty clean abstraction
> and could even keep the existing behavior of say bind to all VGA devices
> with an Intel vendor ID (even if I think that is a bad idea).

I think of this as some variant to the second option above.

Maximally using the driver core to make subdrivers still means the
VFIO side needs to reimplement all the existing matcher logic for PCI
(and it means we don't generalize, so future CXL/etc, will need more
VFIO special code)  It has to put this stuff on a new special bus and
somehow make names and match tables for it.

It also means we'd have to somehow fix vfio-pci to allow hot
plug/unplug of the subdriver. The driver core doesn't really run
synchronously for binding, so late binding would have to be
accommodated somehow. It feels like adding a lot of complexity for
very little gain to me.

Personally I dislike the subdriver direction of the May RFC quite a
bit, it has a lot of unfixable negatives for the admin side. The first
option does present some challenges for userspace but I belive we can
work through them.

Jason

Jason Gunthorpe Feb. 11, 2021, 2:49 p.m. UTC | #30

On Thu, Feb 11, 2021 at 08:50:21AM +0000, Christoph Hellwig wrote:
> On Thu, Feb 04, 2021 at 11:12:49AM +0200, Max Gurtovoy wrote:
> > But the PCI function (the bounded BDF) is GPU function or NVLINK function ?
> > 
> > If it's NVLINK function then we should fail probing in the host vfio-pci
> > driver.
> > 
> > if its a GPU function so it shouldn't been called nvlink2 vfio-pci driver.
> > Its just an extension in the GPU vfio-pci driver.
> 
> I suspect the trivial and correct answer is that we should just drop
> the driver entirely.  It is for obsolete hardware that never had
> upstream

The HW is still in active deployment and use. The big system, Summit,
only went to production sometime in 2019, so it is barely started on
its lifecycle. Something around a 5-10 year operational lifetime would
be pretty typical in this area.

> support for even using it in the guest.  It also is the reason for
> keeping cruft in the always built-in powernv platform code alive that
> is otherwise dead wood.

Or stated another way, once vfio-pci supports loadable extensions the
non-upstream hardware could provide the extension it needs out of
tree.

Jason

Max Gurtovoy Feb. 11, 2021, 3:47 p.m. UTC | #31

On 2/2/2021 7:10 PM, Jason Gunthorpe wrote:
> On Tue, Feb 02, 2021 at 05:06:59PM +0100, Cornelia Huck wrote:
>
>> On the other side, we have the zdev support, which both requires s390
>> and applies to any pci device on s390.
> Is there a reason why CONFIG_VFIO_PCI_ZDEV exists? Why not just always
> return the s390 specific data in VFIO_DEVICE_GET_INFO if running on
> s390?
>
> It would be like returning data from ACPI on other platforms.

Agree.

all agree that I remove it ?

we already have a check in the code:

if (ret && ret != -ENODEV) {
                                 pci_warn(vdev->vpdev.pdev, "Failed to 
setup zPCI info capabilities\n");
                                 return ret;
}

so in case its not zdev we should get -ENODEV and continue in the good flow.

>
> It really seems like part of vfio-pci-core
>
> Jason

Matthew Rosato Feb. 11, 2021, 4:29 p.m. UTC | #32

On 2/11/21 10:47 AM, Max Gurtovoy wrote:
> 
> On 2/2/2021 7:10 PM, Jason Gunthorpe wrote:
>> On Tue, Feb 02, 2021 at 05:06:59PM +0100, Cornelia Huck wrote:
>>
>>> On the other side, we have the zdev support, which both requires s390
>>> and applies to any pci device on s390.
>> Is there a reason why CONFIG_VFIO_PCI_ZDEV exists? Why not just always
>> return the s390 specific data in VFIO_DEVICE_GET_INFO if running on
>> s390?
>>
>> It would be like returning data from ACPI on other platforms.
> 
> Agree.
> 
> all agree that I remove it ?

I did some archives digging on the discussions around 
CONFIG_VFIO_PCI_ZDEV and whether we should/should not have a Kconfig 
switch around this; it was something that was carried over various 
attempts to get the zdev support upstream, but I can't really find (or 
think of) a compelling reason that a Kconfig switch must be kept for it. 
  The bottom line is if you're on s390, you really want zdev support.

So: I don't have an objection so long as the net result is that 
vfio_pci_zdev.o is always built in to vfio-pci(-core) for s390.

> 
> we already have a check in the code:
> 
> if (ret && ret != -ENODEV) {
>                                  pci_warn(vdev->vpdev.pdev, "Failed to 
> setup zPCI info capabilities\n");
>                                  return ret;
> }
> 
> so in case its not zdev we should get -ENODEV and continue in the good 
> flow.
> 
>>
>> It really seems like part of vfio-pci-core
>>
>> Jason

Cornelia Huck Feb. 11, 2021, 5:39 p.m. UTC | #33

On Thu, 11 Feb 2021 11:29:37 -0500
Matthew Rosato <mjrosato@linux.ibm.com> wrote:

> On 2/11/21 10:47 AM, Max Gurtovoy wrote:
> > 
> > On 2/2/2021 7:10 PM, Jason Gunthorpe wrote:  
> >> On Tue, Feb 02, 2021 at 05:06:59PM +0100, Cornelia Huck wrote:
> >>  
> >>> On the other side, we have the zdev support, which both requires s390
> >>> and applies to any pci device on s390.  
> >> Is there a reason why CONFIG_VFIO_PCI_ZDEV exists? Why not just always
> >> return the s390 specific data in VFIO_DEVICE_GET_INFO if running on
> >> s390?
> >>
> >> It would be like returning data from ACPI on other platforms.  
> > 
> > Agree.
> > 
> > all agree that I remove it ?  
> 
> I did some archives digging on the discussions around 
> CONFIG_VFIO_PCI_ZDEV and whether we should/should not have a Kconfig 
> switch around this; it was something that was carried over various 
> attempts to get the zdev support upstream, but I can't really find (or 
> think of) a compelling reason that a Kconfig switch must be kept for it. 
>   The bottom line is if you're on s390, you really want zdev support.
> 
> So: I don't have an objection so long as the net result is that 
> vfio_pci_zdev.o is always built in to vfio-pci(-core) for s390.

Yes, I also don't expect presence of the zdev stuff to confuse any
older userspace.

So, let's just drop CONFIG_VFIO_PCI_ZDEV and use CONFIG_S390 in lieu of
it (not changing the file name).

Alex Williamson Feb. 11, 2021, 7:43 p.m. UTC | #34

On Thu, 11 Feb 2021 08:44:26 +0000
Christoph Hellwig <hch@infradead.org> wrote:

> On Tue, Feb 02, 2021 at 04:59:23PM -0700, Alex Williamson wrote:
> > vfio-pci-igd support knows very little about the device, we're
> > effectively just exposing a firmware table and some of the host bridge
> > config space (read-only).  So the idea that the host kernel needs to
> > have updated i915 support in order to expose the device to userspace
> > with these extra regions is a bit silly.  
> 
> On the other hand assuming the IGD scheme works for every device
> with an Intel Vendor ID and a VGA classcode that hangs off an Intel
> host bridge seems highly dangerous.  Is this actually going to work
> for the new discreete Intel graphics?  For the old i740?  And if not
> what is the failure scenario?

The failure scenario is that we expose read-only copies of the OpRegion
firmware table and host and lpc bridge config space to userspace.  Not
exactly dangerous.  For discrete graphics we'd simply fail the device
probe if the target device isn't on the root bus.  This would cover the
old i740 as well, assuming you're seriously concerned about someone
plugging in a predominantly AGP graphics card from 20+ years ago into a
modern system and trying to assign it to a guest.  Thanks,

Alex

[8/9] vfio/pci: use x86 naming instead of igd

Commit Message

Comments

Patch