[RFC,v2] vPCI: account for hidden devices

Message ID	7294a70c-0089-e375-bb5a-bf9544d4f251@suse.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <xen-devel-bounces@lists.xenproject.org> Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org> Message-ID: <7294a70c-0089-e375-bb5a-bf9544d4f251@suse.com> Date: Wed, 24 May 2023 15:45:58 +0200 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.11.0 From: Jan Beulich <jbeulich@suse.com> Subject: [PATCH RFC v2] vPCI: account for hidden devices To: "xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org> Cc: =?utf-8?q?Roger_Pau_Monn=C3=A9?= <roger.pau@citrix.com>, Stefano Stabellini <sstabellini@kernel.org> Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit MIME-Version: 1.0
Series	[RFC,v2] vPCI: account for hidden devices \| expand [RFC,v2] vPCI: account for hidden devices

Jan Beulich May 24, 2023, 1:45 p.m. UTC

Hidden devices (e.g. an add-in PCI serial card used for Xen's serial
console) are associated with DomXEN, not Dom0. This means that while
looking for overlapping BARs such devices cannot be found on Dom0's list
of devices; DomXEN's list also needs to be scanned.

Suppress vPCI init altogether for r/o devices (which constitute a subset
of hidden ones).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
RFC: The modify_bars() change is intentionally mis-formatted, as the
     necessary re-indentation would make the diff difficult to read. At
     this point I'd merely like to gather input towards possible better
     approaches to solve the issue (not the least because quite possibly
     there are further places needing changing).

RFC: Whether to actually suppress vPCI init is up for debate; adding the
     extra logic is following Roger's suggestion (I'm not convinced it is
     useful to have). If we want to keep that, a 2nd question would be
     whether to keep it in vpci_add_handlers(): Both of its callers (can)
     have a struct pci_seg readily available, so checking ->ro_map at the
     call sites would be easier.

RFC: _setup_hwdom_pci_devices()' loop may want splitting: For
     modify_bars() to consistently respect BARs of hidden devices while
     setting up "normal" ones (i.e. to avoid as much as possible the
     "continue" path introduced here), setting up of the former may want
     doing first.

RFC: vpci_write()'s check of the r/o map may want moving out of mainline
     code, into the case dealing with !pdev->vpci.
---
v2: Extend existing comment. Relax assertion. Don't initialize vPCI for
    r/o devices.

Roger Pau Monné May 24, 2023, 2:22 p.m. UTC | #1

On Wed, May 24, 2023 at 03:45:58PM +0200, Jan Beulich wrote:
> Hidden devices (e.g. an add-in PCI serial card used for Xen's serial
> console) are associated with DomXEN, not Dom0. This means that while
> looking for overlapping BARs such devices cannot be found on Dom0's list
> of devices; DomXEN's list also needs to be scanned.
> 
> Suppress vPCI init altogether for r/o devices (which constitute a subset
> of hidden ones).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> RFC: The modify_bars() change is intentionally mis-formatted, as the
>      necessary re-indentation would make the diff difficult to read. At
>      this point I'd merely like to gather input towards possible better
>      approaches to solve the issue (not the least because quite possibly
>      there are further places needing changing).

I think we should also handle the case of !pdev->vpci for the hardware
doamin, as it's allowed for the vpci_add_handlers() call in
setup_one_hwdom_device() to fail and the device wold still be assigned
to the hardware domain.

I can submit that as a separate bugfix, as it's already an issue
without taking r/o or hidden devices into account.

> 
> RFC: Whether to actually suppress vPCI init is up for debate; adding the
>      extra logic is following Roger's suggestion (I'm not convinced it is
>      useful to have). If we want to keep that, a 2nd question would be
>      whether to keep it in vpci_add_handlers(): Both of its callers (can)
>      have a struct pci_seg readily available, so checking ->ro_map at the
>      call sites would be easier.

But that would duplicate the logic into the callers, which doesn't
seem very nice to me, and makes it easier to make mistakes if further
callers are added and r/o is not checked there.

> RFC: _setup_hwdom_pci_devices()' loop may want splitting: For
>      modify_bars() to consistently respect BARs of hidden devices while
>      setting up "normal" ones (i.e. to avoid as much as possible the
>      "continue" path introduced here), setting up of the former may want
>      doing first.

But BARs of hidden devices should be mapped into dom0 physmap?  And
hence doing those before or after normal devices will lead to the same
result.  The loop in modify_bars() is there to avoid attempting to map
the same range twice, or to unmap a range while there are devices
still using it, but the unmap is never done during initial device
setup.

> 
> RFC: vpci_write()'s check of the r/o map may want moving out of mainline
>      code, into the case dealing with !pdev->vpci.

Indeed.

> ---
> v2: Extend existing comment. Relax assertion. Don't initialize vPCI for
>     r/o devices.
> 
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -218,6 +218,7 @@ static int modify_bars(const struct pci_
>      struct vpci_header *header = &pdev->vpci->header;
>      struct rangeset *mem = rangeset_new(NULL, NULL, 0);
>      struct pci_dev *tmp, *dev = NULL;
> +    const struct domain *d;
>      const struct vpci_msix *msix = pdev->vpci->msix;
>      unsigned int i;
>      int rc;
> @@ -285,9 +286,11 @@ static int modify_bars(const struct pci_
>  
>      /*
>       * Check for overlaps with other BARs. Note that only BARs that are
> -     * currently mapped (enabled) are checked for overlaps.
> +     * currently mapped (enabled) are checked for overlaps. Note also that
> +     * for Dom0 we also need to include hidden, i.e. DomXEN's, devices.
>       */
> -    for_each_pdev ( pdev->domain, tmp )
> +for ( d = pdev->domain; ; d = dom_xen ) {//todo
> +    for_each_pdev ( d, tmp )
>      {
>          if ( tmp == pdev )
>          {
> @@ -304,6 +307,7 @@ static int modify_bars(const struct pci_
>                   */
>                  continue;
>          }
> +if ( !tmp->vpci ) { ASSERT(d == dom_xen); continue; }//todo
>  
>          for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
>          {
> @@ -330,6 +334,7 @@ static int modify_bars(const struct pci_
>              }
>          }
>      }
> +if ( !is_hardware_domain(d) ) break; }//todo

AFAICT in order to support hidden devices there's one bit missing in
vpci_{write,read}(): the call to pci_get_pdev() should be also done
against dom_xen when handling accesses originated from the hardware
domain.

One further question is whether we want to map BARs of r/o devices
into the hardware domain physmap.  Not sure that's very helpful, as
dom0 won't be allowed to modify any of the config space bits of those
devices, so even attempts to size the BARs will fail.  I wonder what
kind of issues this can cause to dom0 OSes.

Thanks, Roger.

Jan Beulich May 24, 2023, 2:44 p.m. UTC | #2

On 24.05.2023 16:22, Roger Pau Monné wrote:
> On Wed, May 24, 2023 at 03:45:58PM +0200, Jan Beulich wrote:
>> Hidden devices (e.g. an add-in PCI serial card used for Xen's serial
>> console) are associated with DomXEN, not Dom0. This means that while
>> looking for overlapping BARs such devices cannot be found on Dom0's list
>> of devices; DomXEN's list also needs to be scanned.
>>
>> Suppress vPCI init altogether for r/o devices (which constitute a subset
>> of hidden ones).
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> RFC: The modify_bars() change is intentionally mis-formatted, as the
>>      necessary re-indentation would make the diff difficult to read. At
>>      this point I'd merely like to gather input towards possible better
>>      approaches to solve the issue (not the least because quite possibly
>>      there are further places needing changing).
> 
> I think we should also handle the case of !pdev->vpci for the hardware
> doamin, as it's allowed for the vpci_add_handlers() call in
> setup_one_hwdom_device() to fail and the device wold still be assigned
> to the hardware domain.
> 
> I can submit that as a separate bugfix, as it's already an issue
> without taking r/o or hidden devices into account.

Yeah, I think that wants dealing with separately. I'm not actually sure
though that "is allowed to fail" is proper behavior ...

But anyway - I take this as you agreeing to go that route, which is the
prereq for me to actually make a well-formed patch. Please shout soon
if that's a misunderstanding of mine.

>> RFC: Whether to actually suppress vPCI init is up for debate; adding the
>>      extra logic is following Roger's suggestion (I'm not convinced it is
>>      useful to have). If we want to keep that, a 2nd question would be
>>      whether to keep it in vpci_add_handlers(): Both of its callers (can)
>>      have a struct pci_seg readily available, so checking ->ro_map at the
>>      call sites would be easier.
> 
> But that would duplicate the logic into the callers, which doesn't
> seem very nice to me, and makes it easier to make mistakes if further
> callers are added and r/o is not checked there.

Right, hence why I didn't do it the alternative way from the beginning.
Both approaches have a pro and a con.

But prior to answering the 2nd question, what about the 1st one? Is it
really worth to have the extra logic?

>> RFC: _setup_hwdom_pci_devices()' loop may want splitting: For
>>      modify_bars() to consistently respect BARs of hidden devices while
>>      setting up "normal" ones (i.e. to avoid as much as possible the
>>      "continue" path introduced here), setting up of the former may want
>>      doing first.
> 
> But BARs of hidden devices should be mapped into dom0 physmap?

Yes.

>  And
> hence doing those before or after normal devices will lead to the same
> result.  The loop in modify_bars() is there to avoid attempting to map
> the same range twice, or to unmap a range while there are devices
> still using it, but the unmap is never done during initial device
> setup.

Okay, so maybe indeed there's no effect on the final result. Yet still
the anomaly bothered me when seeing it in the logs (it actually mislead
me initially in my conclusions as to what was actually going on), when
I added a printk() to that new "continue" path. We would skip hidden
devices up until them getting initialized themselves. There would be
less skipping if all (there aren't going to be many) DomXEN devices
were initialized first.

>> RFC: vpci_write()'s check of the r/o map may want moving out of mainline
>>      code, into the case dealing with !pdev->vpci.
> 
> Indeed.

Will extend the patch accordingly then

>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -218,6 +218,7 @@ static int modify_bars(const struct pci_
>>      struct vpci_header *header = &pdev->vpci->header;
>>      struct rangeset *mem = rangeset_new(NULL, NULL, 0);
>>      struct pci_dev *tmp, *dev = NULL;
>> +    const struct domain *d;
>>      const struct vpci_msix *msix = pdev->vpci->msix;
>>      unsigned int i;
>>      int rc;
>> @@ -285,9 +286,11 @@ static int modify_bars(const struct pci_
>>  
>>      /*
>>       * Check for overlaps with other BARs. Note that only BARs that are
>> -     * currently mapped (enabled) are checked for overlaps.
>> +     * currently mapped (enabled) are checked for overlaps. Note also that
>> +     * for Dom0 we also need to include hidden, i.e. DomXEN's, devices.
>>       */
>> -    for_each_pdev ( pdev->domain, tmp )
>> +for ( d = pdev->domain; ; d = dom_xen ) {//todo
>> +    for_each_pdev ( d, tmp )
>>      {
>>          if ( tmp == pdev )
>>          {
>> @@ -304,6 +307,7 @@ static int modify_bars(const struct pci_
>>                   */
>>                  continue;
>>          }
>> +if ( !tmp->vpci ) { ASSERT(d == dom_xen); continue; }//todo
>>  
>>          for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
>>          {
>> @@ -330,6 +334,7 @@ static int modify_bars(const struct pci_
>>              }
>>          }
>>      }
>> +if ( !is_hardware_domain(d) ) break; }//todo
> 
> AFAICT in order to support hidden devices there's one bit missing in
> vpci_{write,read}(): the call to pci_get_pdev() should be also done
> against dom_xen when handling accesses originated from the hardware
> domain.

Hmm, right. Without that we'd always take the vpci_{read,write}_hw()
path.

> One further question is whether we want to map BARs of r/o devices
> into the hardware domain physmap.  Not sure that's very helpful, as
> dom0 won't be allowed to modify any of the config space bits of those
> devices, so even attempts to size the BARs will fail.  I wonder what
> kind of issues this can cause to dom0 OSes.

This is what Linux (6.3) says:

pci 0000:02:00.1: [Firmware Bug]: reg 0x10: invalid BAR (can't size)
pci 0000:02:00.1: [Firmware Bug]: reg 0x14: invalid BAR (can't size)
pci 0000:02:00.1: [Firmware Bug]: reg 0x24: invalid BAR (can't size)

Jan

Roger Pau Monné May 24, 2023, 3:33 p.m. UTC | #3

On Wed, May 24, 2023 at 04:44:49PM +0200, Jan Beulich wrote:
> On 24.05.2023 16:22, Roger Pau Monné wrote:
> > On Wed, May 24, 2023 at 03:45:58PM +0200, Jan Beulich wrote:
> >> Hidden devices (e.g. an add-in PCI serial card used for Xen's serial
> >> console) are associated with DomXEN, not Dom0. This means that while
> >> looking for overlapping BARs such devices cannot be found on Dom0's list
> >> of devices; DomXEN's list also needs to be scanned.
> >>
> >> Suppress vPCI init altogether for r/o devices (which constitute a subset
> >> of hidden ones).
> >>
> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >> ---
> >> RFC: The modify_bars() change is intentionally mis-formatted, as the
> >>      necessary re-indentation would make the diff difficult to read. At
> >>      this point I'd merely like to gather input towards possible better
> >>      approaches to solve the issue (not the least because quite possibly
> >>      there are further places needing changing).
> > 
> > I think we should also handle the case of !pdev->vpci for the hardware
> > doamin, as it's allowed for the vpci_add_handlers() call in
> > setup_one_hwdom_device() to fail and the device wold still be assigned
> > to the hardware domain.
> > 
> > I can submit that as a separate bugfix, as it's already an issue
> > without taking r/o or hidden devices into account.
> 
> Yeah, I think that wants dealing with separately. I'm not actually sure
> though that "is allowed to fail" is proper behavior ...

One better option would be to mark the device r/o if the
vpci_add_handlers() call fails in setup_one_hwdom_device(), as that
would prevent dom0 from accessing native MSI(-X) capabilities.

> But anyway - I take this as you agreeing to go that route, which is the
> prereq for me to actually make a well-formed patch. Please shout soon
> if that's a misunderstanding of mine.

Sure, will send the fix later today or tomorrow so that you can
rebase.

> >> RFC: Whether to actually suppress vPCI init is up for debate; adding the
> >>      extra logic is following Roger's suggestion (I'm not convinced it is
> >>      useful to have). If we want to keep that, a 2nd question would be
> >>      whether to keep it in vpci_add_handlers(): Both of its callers (can)
> >>      have a struct pci_seg readily available, so checking ->ro_map at the
> >>      call sites would be easier.
> > 
> > But that would duplicate the logic into the callers, which doesn't
> > seem very nice to me, and makes it easier to make mistakes if further
> > callers are added and r/o is not checked there.
> 
> Right, hence why I didn't do it the alternative way from the beginning.
> Both approaches have a pro and a con.
> 
> But prior to answering the 2nd question, what about the 1st one? Is it
> really worth to have the extra logic?

Why would we want to do all the vPCI initialization for r/o devices?
None of the handlers setup will get called, so I see it the other way
around: not short-circuiting vpci_add_handlers() for r/o devices is a
waste of time and resources because none of the setup state would be
used anyway.

> >  And
> > hence doing those before or after normal devices will lead to the same
> > result.  The loop in modify_bars() is there to avoid attempting to map
> > the same range twice, or to unmap a range while there are devices
> > still using it, but the unmap is never done during initial device
> > setup.
> 
> Okay, so maybe indeed there's no effect on the final result. Yet still
> the anomaly bothered me when seeing it in the logs (it actually mislead
> me initially in my conclusions as to what was actually going on), when
> I added a printk() to that new "continue" path. We would skip hidden
> devices up until them getting initialized themselves. There would be
> less skipping if all (there aren't going to be many) DomXEN devices
> were initialized first.

I think that just makes the logic more complicated for no reason, the
only reason you don't see this with devices assigned to dom0 is
because device addition is interleaved with calls to
vpci_add_handlers().  However it would also be valid to add all
devices to dom0 and then call vpci_add_handlers() for each one of them.

> > One further question is whether we want to map BARs of r/o devices
> > into the hardware domain physmap.  Not sure that's very helpful, as
> > dom0 won't be allowed to modify any of the config space bits of those
> > devices, so even attempts to size the BARs will fail.  I wonder what
> > kind of issues this can cause to dom0 OSes.
> 
> This is what Linux (6.3) says:
> 
> pci 0000:02:00.1: [Firmware Bug]: reg 0x10: invalid BAR (can't size)
> pci 0000:02:00.1: [Firmware Bug]: reg 0x14: invalid BAR (can't size)
> pci 0000:02:00.1: [Firmware Bug]: reg 0x24: invalid BAR (can't size)

OK, seems fine then.  There's no point in mapping the BARs of r/o
devices to the dom0 physmap, as the domain is unable to size them in
the first place.

Thanks, Roger.

Roger Pau Monné May 24, 2023, 3:56 p.m. UTC | #4

On Wed, May 24, 2023 at 03:45:58PM +0200, Jan Beulich wrote:
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -218,6 +218,7 @@ static int modify_bars(const struct pci_
>      struct vpci_header *header = &pdev->vpci->header;
>      struct rangeset *mem = rangeset_new(NULL, NULL, 0);
>      struct pci_dev *tmp, *dev = NULL;
> +    const struct domain *d;
>      const struct vpci_msix *msix = pdev->vpci->msix;
>      unsigned int i;
>      int rc;
> @@ -285,9 +286,11 @@ static int modify_bars(const struct pci_
>  
>      /*
>       * Check for overlaps with other BARs. Note that only BARs that are
> -     * currently mapped (enabled) are checked for overlaps.
> +     * currently mapped (enabled) are checked for overlaps. Note also that
> +     * for Dom0 we also need to include hidden, i.e. DomXEN's, devices.
>       */
> -    for_each_pdev ( pdev->domain, tmp )
> +for ( d = pdev->domain; ; d = dom_xen ) {//todo

Looking at this again, I think this is slightly more complex, as during
runtime dom0 will get here with pdev->domain == hardware_domain OR
dom_xen, and hence you also need to account that devices that have
pdev->domain == dom_xen need to iterate over devices that belong to
the hardware_domain, ie:

for ( d = pdev->domain; ;
      d = (pdev->domain == dom_xen) ? hardware_domain : dom_xen )

And we likely want to limit this to devices that belong to the
hardware_domain or to dom_xen (in preparation for vPCI being used for
domUs).

Thanks, Roger.

Stefano Stabellini May 24, 2023, 11:37 p.m. UTC | #5

On Wed, 24 May 2023, Jan Beulich wrote:
> >> RFC: _setup_hwdom_pci_devices()' loop may want splitting: For
> >>      modify_bars() to consistently respect BARs of hidden devices while
> >>      setting up "normal" ones (i.e. to avoid as much as possible the
> >>      "continue" path introduced here), setting up of the former may want
> >>      doing first.
> > 
> > But BARs of hidden devices should be mapped into dom0 physmap?
> 
> Yes.

The BARs would be mapped read-only (not read-write), right? Otherwise we
let dom0 access devices that belong to Xen, which doesn't seem like a
good idea.

But even if we map the BARs read-only, what is the benefit of mapping
them to Dom0? If Dom0 loads a driver for it and the driver wants to
initialize the device, the driver will crash because the MMIO region is
read-only instead of read-write, right?

How does this device hiding work for dom0? How does dom0 know not to
access a device that is present on the PCI bus but is used by Xen?

The reason why I was suggesting to hide the device completely in the
past was that I assumed we had to hide the device (make it "disappear"
on the PCI bus) otherwise Dom0 would access it. Is there another way to
mark a PCI devices as "inaccessible" or "disabled"?

Stefano Stabellini May 24, 2023, 11:40 p.m. UTC | #6

On Wed, 24 May 2023, Jan Beulich wrote:
> Hidden devices (e.g. an add-in PCI serial card used for Xen's serial
> console) are associated with DomXEN, not Dom0. This means that while
> looking for overlapping BARs such devices cannot be found on Dom0's list
> of devices; DomXEN's list also needs to be scanned.
> 
> Suppress vPCI init altogether for r/o devices (which constitute a subset
> of hidden ones).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

This works! Ship it! :-)
https://gitlab.com/xen-project/people/sstabellini/xen/-/jobs/4346896950

I understand this is an RFC and there are still open questions, but
thank you for addressing the issue quickly.



> ---
> RFC: The modify_bars() change is intentionally mis-formatted, as the
>      necessary re-indentation would make the diff difficult to read. At
>      this point I'd merely like to gather input towards possible better
>      approaches to solve the issue (not the least because quite possibly
>      there are further places needing changing).
> 
> RFC: Whether to actually suppress vPCI init is up for debate; adding the
>      extra logic is following Roger's suggestion (I'm not convinced it is
>      useful to have). If we want to keep that, a 2nd question would be
>      whether to keep it in vpci_add_handlers(): Both of its callers (can)
>      have a struct pci_seg readily available, so checking ->ro_map at the
>      call sites would be easier.
> 
> RFC: _setup_hwdom_pci_devices()' loop may want splitting: For
>      modify_bars() to consistently respect BARs of hidden devices while
>      setting up "normal" ones (i.e. to avoid as much as possible the
>      "continue" path introduced here), setting up of the former may want
>      doing first.
> 
> RFC: vpci_write()'s check of the r/o map may want moving out of mainline
>      code, into the case dealing with !pdev->vpci.
> ---
> v2: Extend existing comment. Relax assertion. Don't initialize vPCI for
>     r/o devices.
> 
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -218,6 +218,7 @@ static int modify_bars(const struct pci_
>      struct vpci_header *header = &pdev->vpci->header;
>      struct rangeset *mem = rangeset_new(NULL, NULL, 0);
>      struct pci_dev *tmp, *dev = NULL;
> +    const struct domain *d;
>      const struct vpci_msix *msix = pdev->vpci->msix;
>      unsigned int i;
>      int rc;
> @@ -285,9 +286,11 @@ static int modify_bars(const struct pci_
>  
>      /*
>       * Check for overlaps with other BARs. Note that only BARs that are
> -     * currently mapped (enabled) are checked for overlaps.
> +     * currently mapped (enabled) are checked for overlaps. Note also that
> +     * for Dom0 we also need to include hidden, i.e. DomXEN's, devices.
>       */
> -    for_each_pdev ( pdev->domain, tmp )
> +for ( d = pdev->domain; ; d = dom_xen ) {//todo
> +    for_each_pdev ( d, tmp )
>      {
>          if ( tmp == pdev )
>          {
> @@ -304,6 +307,7 @@ static int modify_bars(const struct pci_
>                   */
>                  continue;
>          }
> +if ( !tmp->vpci ) { ASSERT(d == dom_xen); continue; }//todo
>  
>          for ( i = 0; i < ARRAY_SIZE(tmp->vpci->header.bars); i++ )
>          {
> @@ -330,6 +334,7 @@ static int modify_bars(const struct pci_
>              }
>          }
>      }
> +if ( !is_hardware_domain(d) ) break; }//todo
>  
>      ASSERT(dev);
>  
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -70,6 +70,7 @@ void vpci_remove_device(struct pci_dev *
>  int vpci_add_handlers(struct pci_dev *pdev)
>  {
>      unsigned int i;
> +    const unsigned long *ro_map;
>      int rc = 0;
>  
>      if ( !has_vpci(pdev->domain) )
> @@ -78,6 +79,11 @@ int vpci_add_handlers(struct pci_dev *pd
>      /* We should not get here twice for the same device. */
>      ASSERT(!pdev->vpci);
>  
> +    /* No vPCI for r/o devices. */
> +    ro_map = pci_get_ro_map(pdev->sbdf.seg);
> +    if ( ro_map && test_bit(pdev->sbdf.bdf, ro_map) )
> +        return 0;
> +
>      pdev->vpci = xzalloc(struct vpci);
>      if ( !pdev->vpci )
>          return -ENOMEM;
>

Jan Beulich May 25, 2023, 7:23 a.m. UTC | #7

On 25.05.2023 01:37, Stefano Stabellini wrote:
> On Wed, 24 May 2023, Jan Beulich wrote:
>>>> RFC: _setup_hwdom_pci_devices()' loop may want splitting: For
>>>>      modify_bars() to consistently respect BARs of hidden devices while
>>>>      setting up "normal" ones (i.e. to avoid as much as possible the
>>>>      "continue" path introduced here), setting up of the former may want
>>>>      doing first.
>>>
>>> But BARs of hidden devices should be mapped into dom0 physmap?
>>
>> Yes.
> 
> The BARs would be mapped read-only (not read-write), right? Otherwise we
> let dom0 access devices that belong to Xen, which doesn't seem like a
> good idea.
> 
> But even if we map the BARs read-only, what is the benefit of mapping
> them to Dom0? If Dom0 loads a driver for it and the driver wants to
> initialize the device, the driver will crash because the MMIO region is
> read-only instead of read-write, right?
> 
> How does this device hiding work for dom0? How does dom0 know not to
> access a device that is present on the PCI bus but is used by Xen?

None of these are new questions - this has all been this way for PV Dom0,
and so far we've limped along quite okay. That's not to say that we
shouldn't improve things if we can, but that first requires ideas as to
how.

> The reason why I was suggesting to hide the device completely in the
> past was that I assumed we had to hide the device (make it "disappear"
> on the PCI bus) otherwise Dom0 would access it. Is there another way to
> mark a PCI devices as "inaccessible" or "disabled"?

I'm no aware of any.

Jan

Roger Pau Monné May 25, 2023, 7:35 a.m. UTC | #8

On Wed, May 24, 2023 at 04:37:42PM -0700, Stefano Stabellini wrote:
> On Wed, 24 May 2023, Jan Beulich wrote:
> > >> RFC: _setup_hwdom_pci_devices()' loop may want splitting: For
> > >>      modify_bars() to consistently respect BARs of hidden devices while
> > >>      setting up "normal" ones (i.e. to avoid as much as possible the
> > >>      "continue" path introduced here), setting up of the former may want
> > >>      doing first.
> > > 
> > > But BARs of hidden devices should be mapped into dom0 physmap?
> > 
> > Yes.
> 
> The BARs would be mapped read-only (not read-write), right? Otherwise we
> let dom0 access devices that belong to Xen, which doesn't seem like a
> good idea.

It's my understanding that Xen will mark any regions of the BARs
in-use by the hypervisor as read-only, but the rest will be writable.

> But even if we map the BARs read-only, what is the benefit of mapping
> them to Dom0? If Dom0 loads a driver for it and the driver wants to
> initialize the device, the driver will crash because the MMIO region is
> read-only instead of read-write, right?

I think USB is a good example, Xen uses the debug functionality of
EHCI/XHCI, but by marking the device as hidden it allows dom0 to also
make use of any remaining USB ports.

For r/o devices I don't see much point in mapping the BARs to dom0, as
dom0 is not allowed to size them in the first place, so will be able
to detect the BAR position, but not the BAR size.  I guess this could
cause issues in the future if dom0 starts repositioning BARs, but
let's leave that aside.

> How does this device hiding work for dom0? How does dom0 know not to
> access a device that is present on the PCI bus but is used by Xen?

I think the objective for hidden is to allow dom0 to use the device,
but Xen should protect any MMIO region it doesn't want dom0 to
modify.

> The reason why I was suggesting to hide the device completely in the
> past was that I assumed we had to hide the device (make it "disappear"
> on the PCI bus) otherwise Dom0 would access it. Is there another way to
> mark a PCI devices as "inaccessible" or "disabled"?

I'm not aware of anything else, short of using stuff like the ACPI
STAO and reporting disabled devices there in addition of marking their
config space as r/o.

Thanks, Roger.

Jan Beulich May 25, 2023, 1:24 p.m. UTC | #9

On 24.05.2023 17:33, Roger Pau Monné wrote:
> On Wed, May 24, 2023 at 04:44:49PM +0200, Jan Beulich wrote:
>> On 24.05.2023 16:22, Roger Pau Monné wrote:
>>> On Wed, May 24, 2023 at 03:45:58PM +0200, Jan Beulich wrote:
>>>> Hidden devices (e.g. an add-in PCI serial card used for Xen's serial
>>>> console) are associated with DomXEN, not Dom0. This means that while
>>>> looking for overlapping BARs such devices cannot be found on Dom0's list
>>>> of devices; DomXEN's list also needs to be scanned.
>>>>
>>>> Suppress vPCI init altogether for r/o devices (which constitute a subset
>>>> of hidden ones).
>>>>
>>>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>>> ---
>>>> RFC: The modify_bars() change is intentionally mis-formatted, as the
>>>>      necessary re-indentation would make the diff difficult to read. At
>>>>      this point I'd merely like to gather input towards possible better
>>>>      approaches to solve the issue (not the least because quite possibly
>>>>      there are further places needing changing).
>>>
>>> I think we should also handle the case of !pdev->vpci for the hardware
>>> doamin, as it's allowed for the vpci_add_handlers() call in
>>> setup_one_hwdom_device() to fail and the device wold still be assigned
>>> to the hardware domain.
>>>
>>> I can submit that as a separate bugfix, as it's already an issue
>>> without taking r/o or hidden devices into account.
>>
>> Yeah, I think that wants dealing with separately. I'm not actually sure
>> though that "is allowed to fail" is proper behavior ...
> 
> One better option would be to mark the device r/o if the
> vpci_add_handlers() call fails in setup_one_hwdom_device(), as that
> would prevent dom0 from accessing native MSI(-X) capabilities.

Perhaps, but again in a separate patch.

>>>> RFC: Whether to actually suppress vPCI init is up for debate; adding the
>>>>      extra logic is following Roger's suggestion (I'm not convinced it is
>>>>      useful to have). If we want to keep that, a 2nd question would be
>>>>      whether to keep it in vpci_add_handlers(): Both of its callers (can)
>>>>      have a struct pci_seg readily available, so checking ->ro_map at the
>>>>      call sites would be easier.
>>>
>>> But that would duplicate the logic into the callers, which doesn't
>>> seem very nice to me, and makes it easier to make mistakes if further
>>> callers are added and r/o is not checked there.
>>
>> Right, hence why I didn't do it the alternative way from the beginning.
>> Both approaches have a pro and a con.
>>
>> But prior to answering the 2nd question, what about the 1st one? Is it
>> really worth to have the extra logic?
> 
> Why would we want to do all the vPCI initialization for r/o devices?
> None of the handlers setup will get called, so I see it the other way
> around: not short-circuiting vpci_add_handlers() for r/o devices is a
> waste of time and resources because none of the setup state would be
> used anyway.

Hmm, yes, that's a valid way of looking at (and justifying) it.

>>>  And
>>> hence doing those before or after normal devices will lead to the same
>>> result.  The loop in modify_bars() is there to avoid attempting to map
>>> the same range twice, or to unmap a range while there are devices
>>> still using it, but the unmap is never done during initial device
>>> setup.
>>
>> Okay, so maybe indeed there's no effect on the final result. Yet still
>> the anomaly bothered me when seeing it in the logs (it actually mislead
>> me initially in my conclusions as to what was actually going on), when
>> I added a printk() to that new "continue" path. We would skip hidden
>> devices up until them getting initialized themselves. There would be
>> less skipping if all (there aren't going to be many) DomXEN devices
>> were initialized first.
> 
> I think that just makes the logic more complicated for no reason, the
> only reason you don't see this with devices assigned to dom0 is
> because device addition is interleaved with calls to
> vpci_add_handlers().  However it would also be valid to add all
> devices to dom0 and then call vpci_add_handlers() for each one of them.

Okay, I'll leave that alone than, at least as long as we don't see any
reason because of actual behavioral differences.

Jan

Jan Beulich May 25, 2023, 2:39 p.m. UTC | #10

On 24.05.2023 17:56, Roger Pau Monné wrote:
> On Wed, May 24, 2023 at 03:45:58PM +0200, Jan Beulich wrote:
>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -218,6 +218,7 @@ static int modify_bars(const struct pci_
>>      struct vpci_header *header = &pdev->vpci->header;
>>      struct rangeset *mem = rangeset_new(NULL, NULL, 0);
>>      struct pci_dev *tmp, *dev = NULL;
>> +    const struct domain *d;
>>      const struct vpci_msix *msix = pdev->vpci->msix;
>>      unsigned int i;
>>      int rc;
>> @@ -285,9 +286,11 @@ static int modify_bars(const struct pci_
>>  
>>      /*
>>       * Check for overlaps with other BARs. Note that only BARs that are
>> -     * currently mapped (enabled) are checked for overlaps.
>> +     * currently mapped (enabled) are checked for overlaps. Note also that
>> +     * for Dom0 we also need to include hidden, i.e. DomXEN's, devices.
>>       */
>> -    for_each_pdev ( pdev->domain, tmp )
>> +for ( d = pdev->domain; ; d = dom_xen ) {//todo
> 
> Looking at this again, I think this is slightly more complex, as during
> runtime dom0 will get here with pdev->domain == hardware_domain OR
> dom_xen, and hence you also need to account that devices that have
> pdev->domain == dom_xen need to iterate over devices that belong to
> the hardware_domain, ie:
> 
> for ( d = pdev->domain; ;
>       d = (pdev->domain == dom_xen) ? hardware_domain : dom_xen )

Right, something along these lines. To keep loop continuation expression
and exit condition simple, I'll probably prefer

for ( d = pdev->domain != dom_xen ? pdev->domain : hardware_domain;
      ; d = dom_xen )

> And we likely want to limit this to devices that belong to the
> hardware_domain or to dom_xen (in preparation for vPCI being used for
> domUs).

I'm afraid I don't understand this remark, though.

Jan

Roger Pau Monné May 25, 2023, 3:02 p.m. UTC | #11

On Thu, May 25, 2023 at 04:39:51PM +0200, Jan Beulich wrote:
> On 24.05.2023 17:56, Roger Pau Monné wrote:
> > On Wed, May 24, 2023 at 03:45:58PM +0200, Jan Beulich wrote:
> >> --- a/xen/drivers/vpci/header.c
> >> +++ b/xen/drivers/vpci/header.c
> >> @@ -218,6 +218,7 @@ static int modify_bars(const struct pci_
> >>      struct vpci_header *header = &pdev->vpci->header;
> >>      struct rangeset *mem = rangeset_new(NULL, NULL, 0);
> >>      struct pci_dev *tmp, *dev = NULL;
> >> +    const struct domain *d;
> >>      const struct vpci_msix *msix = pdev->vpci->msix;
> >>      unsigned int i;
> >>      int rc;
> >> @@ -285,9 +286,11 @@ static int modify_bars(const struct pci_
> >>  
> >>      /*
> >>       * Check for overlaps with other BARs. Note that only BARs that are
> >> -     * currently mapped (enabled) are checked for overlaps.
> >> +     * currently mapped (enabled) are checked for overlaps. Note also that
> >> +     * for Dom0 we also need to include hidden, i.e. DomXEN's, devices.
> >>       */
> >> -    for_each_pdev ( pdev->domain, tmp )
> >> +for ( d = pdev->domain; ; d = dom_xen ) {//todo
> > 
> > Looking at this again, I think this is slightly more complex, as during
> > runtime dom0 will get here with pdev->domain == hardware_domain OR
> > dom_xen, and hence you also need to account that devices that have
> > pdev->domain == dom_xen need to iterate over devices that belong to
> > the hardware_domain, ie:
> > 
> > for ( d = pdev->domain; ;
> >       d = (pdev->domain == dom_xen) ? hardware_domain : dom_xen )
> 
> Right, something along these lines. To keep loop continuation expression
> and exit condition simple, I'll probably prefer
> 
> for ( d = pdev->domain != dom_xen ? pdev->domain : hardware_domain;
>       ; d = dom_xen )

LGTM.  I would add parentheses around the pdev->domain != dom_xen
condition, but that's just my personal taste.

We might want to add an

ASSERT(pdev->domain == hardware_domain || pdev->domain == dom_xen);

here, just to remind that this chunk must be revisited when adding
domU support (but you can also argue we haven't done this elsewhere),
I just feel here it's not so obvious we don't want do to this for
domUs.

> > And we likely want to limit this to devices that belong to the
> > hardware_domain or to dom_xen (in preparation for vPCI being used for
> > domUs).
> 
> I'm afraid I don't understand this remark, though.

This was looking forward to domU support, so that you already cater
for pdev->domain not being hardware_domain or dom_xen, but we might
want to leave that for later, when domU support is actually
introduced.

Thanks, Roger.

Jan Beulich May 25, 2023, 3:30 p.m. UTC | #12

On 25.05.2023 17:02, Roger Pau Monné wrote:
> On Thu, May 25, 2023 at 04:39:51PM +0200, Jan Beulich wrote:
>> On 24.05.2023 17:56, Roger Pau Monné wrote:
>>> On Wed, May 24, 2023 at 03:45:58PM +0200, Jan Beulich wrote:
>>>> --- a/xen/drivers/vpci/header.c
>>>> +++ b/xen/drivers/vpci/header.c
>>>> @@ -218,6 +218,7 @@ static int modify_bars(const struct pci_
>>>>      struct vpci_header *header = &pdev->vpci->header;
>>>>      struct rangeset *mem = rangeset_new(NULL, NULL, 0);
>>>>      struct pci_dev *tmp, *dev = NULL;
>>>> +    const struct domain *d;
>>>>      const struct vpci_msix *msix = pdev->vpci->msix;
>>>>      unsigned int i;
>>>>      int rc;
>>>> @@ -285,9 +286,11 @@ static int modify_bars(const struct pci_
>>>>  
>>>>      /*
>>>>       * Check for overlaps with other BARs. Note that only BARs that are
>>>> -     * currently mapped (enabled) are checked for overlaps.
>>>> +     * currently mapped (enabled) are checked for overlaps. Note also that
>>>> +     * for Dom0 we also need to include hidden, i.e. DomXEN's, devices.
>>>>       */
>>>> -    for_each_pdev ( pdev->domain, tmp )
>>>> +for ( d = pdev->domain; ; d = dom_xen ) {//todo
>>>
>>> Looking at this again, I think this is slightly more complex, as during
>>> runtime dom0 will get here with pdev->domain == hardware_domain OR
>>> dom_xen, and hence you also need to account that devices that have
>>> pdev->domain == dom_xen need to iterate over devices that belong to
>>> the hardware_domain, ie:
>>>
>>> for ( d = pdev->domain; ;
>>>       d = (pdev->domain == dom_xen) ? hardware_domain : dom_xen )
>>
>> Right, something along these lines. To keep loop continuation expression
>> and exit condition simple, I'll probably prefer
>>
>> for ( d = pdev->domain != dom_xen ? pdev->domain : hardware_domain;
>>       ; d = dom_xen )
> 
> LGTM.  I would add parentheses around the pdev->domain != dom_xen
> condition, but that's just my personal taste.
> 
> We might want to add an
> 
> ASSERT(pdev->domain == hardware_domain || pdev->domain == dom_xen);
> 
> here, just to remind that this chunk must be revisited when adding
> domU support (but you can also argue we haven't done this elsewhere),
> I just feel here it's not so obvious we don't want do to this for
> domUs.

I could add such an assertion, if only ...

>>> And we likely want to limit this to devices that belong to the
>>> hardware_domain or to dom_xen (in preparation for vPCI being used for
>>> domUs).
>>
>> I'm afraid I don't understand this remark, though.
> 
> This was looking forward to domU support, so that you already cater
> for pdev->domain not being hardware_domain or dom_xen, but we might
> want to leave that for later, when domU support is actually
> introduced.

... I understood why this checking doesn't apply to DomU-s as well,
in your opinion.

Jan

Jan Beulich May 25, 2023, 3:49 p.m. UTC | #13

On 25.05.2023 17:30, Jan Beulich wrote:
> On 25.05.2023 17:02, Roger Pau Monné wrote:
>> On Thu, May 25, 2023 at 04:39:51PM +0200, Jan Beulich wrote:
>>> On 24.05.2023 17:56, Roger Pau Monné wrote:
>>>> On Wed, May 24, 2023 at 03:45:58PM +0200, Jan Beulich wrote:
>>>>> --- a/xen/drivers/vpci/header.c
>>>>> +++ b/xen/drivers/vpci/header.c
>>>>> @@ -218,6 +218,7 @@ static int modify_bars(const struct pci_
>>>>>      struct vpci_header *header = &pdev->vpci->header;
>>>>>      struct rangeset *mem = rangeset_new(NULL, NULL, 0);
>>>>>      struct pci_dev *tmp, *dev = NULL;
>>>>> +    const struct domain *d;
>>>>>      const struct vpci_msix *msix = pdev->vpci->msix;
>>>>>      unsigned int i;
>>>>>      int rc;
>>>>> @@ -285,9 +286,11 @@ static int modify_bars(const struct pci_
>>>>>  
>>>>>      /*
>>>>>       * Check for overlaps with other BARs. Note that only BARs that are
>>>>> -     * currently mapped (enabled) are checked for overlaps.
>>>>> +     * currently mapped (enabled) are checked for overlaps. Note also that
>>>>> +     * for Dom0 we also need to include hidden, i.e. DomXEN's, devices.
>>>>>       */
>>>>> -    for_each_pdev ( pdev->domain, tmp )
>>>>> +for ( d = pdev->domain; ; d = dom_xen ) {//todo
>>>>
>>>> Looking at this again, I think this is slightly more complex, as during
>>>> runtime dom0 will get here with pdev->domain == hardware_domain OR
>>>> dom_xen, and hence you also need to account that devices that have
>>>> pdev->domain == dom_xen need to iterate over devices that belong to
>>>> the hardware_domain, ie:
>>>>
>>>> for ( d = pdev->domain; ;
>>>>       d = (pdev->domain == dom_xen) ? hardware_domain : dom_xen )
>>>
>>> Right, something along these lines. To keep loop continuation expression
>>> and exit condition simple, I'll probably prefer
>>>
>>> for ( d = pdev->domain != dom_xen ? pdev->domain : hardware_domain;
>>>       ; d = dom_xen )
>>
>> LGTM.  I would add parentheses around the pdev->domain != dom_xen
>> condition, but that's just my personal taste.
>>
>> We might want to add an
>>
>> ASSERT(pdev->domain == hardware_domain || pdev->domain == dom_xen);
>>
>> here, just to remind that this chunk must be revisited when adding
>> domU support (but you can also argue we haven't done this elsewhere),
>> I just feel here it's not so obvious we don't want do to this for
>> domUs.
> 
> I could add such an assertion, if only ...
> 
>>>> And we likely want to limit this to devices that belong to the
>>>> hardware_domain or to dom_xen (in preparation for vPCI being used for
>>>> domUs).
>>>
>>> I'm afraid I don't understand this remark, though.
>>
>> This was looking forward to domU support, so that you already cater
>> for pdev->domain not being hardware_domain or dom_xen, but we might
>> want to leave that for later, when domU support is actually
>> introduced.
> 
> ... I understood why this checking doesn't apply to DomU-s as well,
> in your opinion.

Or did you mean that to go inside the if() your patch adds (and hence
my patch won't need to add anymore)? I didn't think you did, because
then it would rather be

ASSERT(d == hardware_domain || d == dom_xen)

imo.

Jan

Stefano Stabellini May 25, 2023, 7:24 p.m. UTC | #14

On Thu, 25 May 2023, Jan Beulich wrote:
> On 25.05.2023 01:37, Stefano Stabellini wrote:
> > On Wed, 24 May 2023, Jan Beulich wrote:
> >>>> RFC: _setup_hwdom_pci_devices()' loop may want splitting: For
> >>>>      modify_bars() to consistently respect BARs of hidden devices while
> >>>>      setting up "normal" ones (i.e. to avoid as much as possible the
> >>>>      "continue" path introduced here), setting up of the former may want
> >>>>      doing first.
> >>>
> >>> But BARs of hidden devices should be mapped into dom0 physmap?
> >>
> >> Yes.
> > 
> > The BARs would be mapped read-only (not read-write), right? Otherwise we
> > let dom0 access devices that belong to Xen, which doesn't seem like a
> > good idea.
> > 
> > But even if we map the BARs read-only, what is the benefit of mapping
> > them to Dom0? If Dom0 loads a driver for it and the driver wants to
> > initialize the device, the driver will crash because the MMIO region is
> > read-only instead of read-write, right?
> > 
> > How does this device hiding work for dom0? How does dom0 know not to
> > access a device that is present on the PCI bus but is used by Xen?
> 
> None of these are new questions - this has all been this way for PV Dom0,
> and so far we've limped along quite okay. That's not to say that we
> shouldn't improve things if we can, but that first requires ideas as to
> how.

For PV, that was OK because PV requires extensive guest modifications
anyway. We only run Linux and few BSDs as Dom0. So, making the interface
cleaner and reducing guest changes is nice-to-have but not critical.

For PVH, this is different. One of the top reasons for AMD to work on
PVH is to enable arbitrary non-Linux OSes as Dom0 (when paired with
dom0less/hyperlaunch). It could be anything from Zephyr to a
proprietary RTOS like VxWorks. Minimal guest changes for advanced
features (e.g. Dom0 S3) might be OK but in general I think we should aim
at (almost) zero guest changes. On ARM, it is already the case (with some
non-upstream patches for dom0less PCI.)

For this specific patch, which is necessary to enable PVH on AMD x86 in
gitlab-ci, we can do anything we want to make it move faster. But
medium/long term I think we should try to make non-Xen-aware PVH Dom0
possible.

Cheers,

Stefano

Stefano Stabellini May 25, 2023, 7:32 p.m. UTC | #15

On Thu, 25 May 2023, Stefano Stabellini wrote:
> On Thu, 25 May 2023, Jan Beulich wrote:
> > On 25.05.2023 01:37, Stefano Stabellini wrote:
> > > On Wed, 24 May 2023, Jan Beulich wrote:
> > >>>> RFC: _setup_hwdom_pci_devices()' loop may want splitting: For
> > >>>>      modify_bars() to consistently respect BARs of hidden devices while
> > >>>>      setting up "normal" ones (i.e. to avoid as much as possible the
> > >>>>      "continue" path introduced here), setting up of the former may want
> > >>>>      doing first.
> > >>>
> > >>> But BARs of hidden devices should be mapped into dom0 physmap?
> > >>
> > >> Yes.
> > > 
> > > The BARs would be mapped read-only (not read-write), right? Otherwise we
> > > let dom0 access devices that belong to Xen, which doesn't seem like a
> > > good idea.
> > > 
> > > But even if we map the BARs read-only, what is the benefit of mapping
> > > them to Dom0? If Dom0 loads a driver for it and the driver wants to
> > > initialize the device, the driver will crash because the MMIO region is
> > > read-only instead of read-write, right?
> > > 
> > > How does this device hiding work for dom0? How does dom0 know not to
> > > access a device that is present on the PCI bus but is used by Xen?
> > 
> > None of these are new questions - this has all been this way for PV Dom0,
> > and so far we've limped along quite okay. That's not to say that we
> > shouldn't improve things if we can, but that first requires ideas as to
> > how.
> 
> For PV, that was OK because PV requires extensive guest modifications
> anyway. We only run Linux and few BSDs as Dom0. So, making the interface
> cleaner and reducing guest changes is nice-to-have but not critical.
> 
> For PVH, this is different. One of the top reasons for AMD to work on
> PVH is to enable arbitrary non-Linux OSes as Dom0 (when paired with
> dom0less/hyperlaunch). It could be anything from Zephyr to a
> proprietary RTOS like VxWorks. Minimal guest changes for advanced
> features (e.g. Dom0 S3) might be OK but in general I think we should aim
> at (almost) zero guest changes. On ARM, it is already the case (with some
> non-upstream patches for dom0less PCI.)
> 
> For this specific patch, which is necessary to enable PVH on AMD x86 in
> gitlab-ci, we can do anything we want to make it move faster. But
> medium/long term I think we should try to make non-Xen-aware PVH Dom0
> possible.

Like I wrote, personally I am happy with whatever gets us to have the PVH
test in gitlab-ci faster.

However, on the specific problem of PCI devices used by Xen and how to
deal with them for Dom0 PVH, I think they should be completely hidden.
Hidden in the sense that they don't appear on the Dom0 PCI bus. If the
hidden device is a function of a multi-function PCI device, then the
entire multi-function PCI device should be hidden.

I don't think this case is very important because devices used by Xen
are timers, IOMMUs, UARTs, all devices that typically are not
multi-function, so it is OK to be extra careful and remove the entire
device from Dom0 in the odd case that the device is both multi-function
and only partially used by Xen. This is what I would do for Xen on ARM
too.

Jan Beulich May 26, 2023, 6:39 a.m. UTC | #16

On 25.05.2023 21:24, Stefano Stabellini wrote:
> On Thu, 25 May 2023, Jan Beulich wrote:
>> On 25.05.2023 01:37, Stefano Stabellini wrote:
>>> On Wed, 24 May 2023, Jan Beulich wrote:
>>>>>> RFC: _setup_hwdom_pci_devices()' loop may want splitting: For
>>>>>>      modify_bars() to consistently respect BARs of hidden devices while
>>>>>>      setting up "normal" ones (i.e. to avoid as much as possible the
>>>>>>      "continue" path introduced here), setting up of the former may want
>>>>>>      doing first.
>>>>>
>>>>> But BARs of hidden devices should be mapped into dom0 physmap?
>>>>
>>>> Yes.
>>>
>>> The BARs would be mapped read-only (not read-write), right? Otherwise we
>>> let dom0 access devices that belong to Xen, which doesn't seem like a
>>> good idea.
>>>
>>> But even if we map the BARs read-only, what is the benefit of mapping
>>> them to Dom0? If Dom0 loads a driver for it and the driver wants to
>>> initialize the device, the driver will crash because the MMIO region is
>>> read-only instead of read-write, right?
>>>
>>> How does this device hiding work for dom0? How does dom0 know not to
>>> access a device that is present on the PCI bus but is used by Xen?
>>
>> None of these are new questions - this has all been this way for PV Dom0,
>> and so far we've limped along quite okay. That's not to say that we
>> shouldn't improve things if we can, but that first requires ideas as to
>> how.
> 
> For PV, that was OK because PV requires extensive guest modifications
> anyway. We only run Linux and few BSDs as Dom0. So, making the interface
> cleaner and reducing guest changes is nice-to-have but not critical.
> 
> For PVH, this is different. One of the top reasons for AMD to work on
> PVH is to enable arbitrary non-Linux OSes as Dom0 (when paired with
> dom0less/hyperlaunch). It could be anything from Zephyr to a
> proprietary RTOS like VxWorks. Minimal guest changes for advanced
> features (e.g. Dom0 S3) might be OK but in general I think we should aim
> at (almost) zero guest changes. On ARM, it is already the case (with some
> non-upstream patches for dom0less PCI.)
> 
> For this specific patch, which is necessary to enable PVH on AMD x86 in
> gitlab-ci, we can do anything we want to make it move faster. But
> medium/long term I think we should try to make non-Xen-aware PVH Dom0
> possible.

I don't think Linux could boot as PVH Dom0 without any awareness. Hence
I guess it's not easy to see how other OSes might. What you're after
looks rather like a HVM Dom0 to me, with it being unclear where the
external emulator then would run (in a stubdom maybe, which might be
possible to arrange for via the dom0less way of creating boot time
DomU-s) and how it would get any necessary xenstore based information.

Jan

Jan Beulich May 26, 2023, 6:45 a.m. UTC | #17

On 25.05.2023 21:32, Stefano Stabellini wrote:
> Like I wrote, personally I am happy with whatever gets us to have the PVH
> test in gitlab-ci faster.
> 
> However, on the specific problem of PCI devices used by Xen and how to
> deal with them for Dom0 PVH, I think they should be completely hidden.
> Hidden in the sense that they don't appear on the Dom0 PCI bus. If the
> hidden device is a function of a multi-function PCI device, then the
> entire multi-function PCI device should be hidden.
> 
> I don't think this case is very important because devices used by Xen
> are timers, IOMMUs, UARTs,

... USB debug ports (EHCI, XHCI), ...

> all devices that typically are not multi-function,

except for the ones added. Furthermore see video_endboot() for a case
of also hiding the VGA device, which isn't unlikely to have secondary
functions (sound controllers are not uncommon). Hence ...

> so it is OK to be extra careful and remove the entire
> device from Dom0 in the odd case that the device is both multi-function
> and only partially used by Xen. This is what I would do for Xen on ARM
> too.

... at best I would see this as a non-default mode of operation. Of
course we could also play more funny games with vPCI, like surfacing
a "stub" device in place of a hidden one, so the other functions can
still be found.

Jan

Roger Pau Monné May 29, 2023, 8:08 a.m. UTC | #18

On Thu, May 25, 2023 at 05:30:54PM +0200, Jan Beulich wrote:
> On 25.05.2023 17:02, Roger Pau Monné wrote:
> > On Thu, May 25, 2023 at 04:39:51PM +0200, Jan Beulich wrote:
> >> On 24.05.2023 17:56, Roger Pau Monné wrote:
> >>> On Wed, May 24, 2023 at 03:45:58PM +0200, Jan Beulich wrote:
> >>>> --- a/xen/drivers/vpci/header.c
> >>>> +++ b/xen/drivers/vpci/header.c
> >>>> @@ -218,6 +218,7 @@ static int modify_bars(const struct pci_
> >>>>      struct vpci_header *header = &pdev->vpci->header;
> >>>>      struct rangeset *mem = rangeset_new(NULL, NULL, 0);
> >>>>      struct pci_dev *tmp, *dev = NULL;
> >>>> +    const struct domain *d;
> >>>>      const struct vpci_msix *msix = pdev->vpci->msix;
> >>>>      unsigned int i;
> >>>>      int rc;
> >>>> @@ -285,9 +286,11 @@ static int modify_bars(const struct pci_
> >>>>  
> >>>>      /*
> >>>>       * Check for overlaps with other BARs. Note that only BARs that are
> >>>> -     * currently mapped (enabled) are checked for overlaps.
> >>>> +     * currently mapped (enabled) are checked for overlaps. Note also that
> >>>> +     * for Dom0 we also need to include hidden, i.e. DomXEN's, devices.
> >>>>       */
> >>>> -    for_each_pdev ( pdev->domain, tmp )
> >>>> +for ( d = pdev->domain; ; d = dom_xen ) {//todo
> >>>
> >>> Looking at this again, I think this is slightly more complex, as during
> >>> runtime dom0 will get here with pdev->domain == hardware_domain OR
> >>> dom_xen, and hence you also need to account that devices that have
> >>> pdev->domain == dom_xen need to iterate over devices that belong to
> >>> the hardware_domain, ie:
> >>>
> >>> for ( d = pdev->domain; ;
> >>>       d = (pdev->domain == dom_xen) ? hardware_domain : dom_xen )
> >>
> >> Right, something along these lines. To keep loop continuation expression
> >> and exit condition simple, I'll probably prefer
> >>
> >> for ( d = pdev->domain != dom_xen ? pdev->domain : hardware_domain;
> >>       ; d = dom_xen )
> > 
> > LGTM.  I would add parentheses around the pdev->domain != dom_xen
> > condition, but that's just my personal taste.
> > 
> > We might want to add an
> > 
> > ASSERT(pdev->domain == hardware_domain || pdev->domain == dom_xen);
> > 
> > here, just to remind that this chunk must be revisited when adding
> > domU support (but you can also argue we haven't done this elsewhere),
> > I just feel here it's not so obvious we don't want do to this for
> > domUs.
> 
> I could add such an assertion, if only ...
> 
> >>> And we likely want to limit this to devices that belong to the
> >>> hardware_domain or to dom_xen (in preparation for vPCI being used for
> >>> domUs).
> >>
> >> I'm afraid I don't understand this remark, though.
> > 
> > This was looking forward to domU support, so that you already cater
> > for pdev->domain not being hardware_domain or dom_xen, but we might
> > want to leave that for later, when domU support is actually
> > introduced.
> 
> ... I understood why this checking doesn't apply to DomU-s as well,
> in your opinion.

It's my understanding that domUs can never get hidden or read-only
devices assigned, and hence there no need to check for overlap with
devices assigned to dom_xen, as those cannot have any BARs mapped in
a domU physmap.

So for domUs the overlap check only needs to be performed against
devices assigned to pdev->domain.

Thanks, Roger.

Jan Beulich May 30, 2023, 8:45 a.m. UTC | #19

On 29.05.2023 10:08, Roger Pau Monné wrote:
> On Thu, May 25, 2023 at 05:30:54PM +0200, Jan Beulich wrote:
>> On 25.05.2023 17:02, Roger Pau Monné wrote:
>>> On Thu, May 25, 2023 at 04:39:51PM +0200, Jan Beulich wrote:
>>>> On 24.05.2023 17:56, Roger Pau Monné wrote:
>>>>> On Wed, May 24, 2023 at 03:45:58PM +0200, Jan Beulich wrote:
>>>>>> --- a/xen/drivers/vpci/header.c
>>>>>> +++ b/xen/drivers/vpci/header.c
>>>>>> @@ -218,6 +218,7 @@ static int modify_bars(const struct pci_
>>>>>>      struct vpci_header *header = &pdev->vpci->header;
>>>>>>      struct rangeset *mem = rangeset_new(NULL, NULL, 0);
>>>>>>      struct pci_dev *tmp, *dev = NULL;
>>>>>> +    const struct domain *d;
>>>>>>      const struct vpci_msix *msix = pdev->vpci->msix;
>>>>>>      unsigned int i;
>>>>>>      int rc;
>>>>>> @@ -285,9 +286,11 @@ static int modify_bars(const struct pci_
>>>>>>  
>>>>>>      /*
>>>>>>       * Check for overlaps with other BARs. Note that only BARs that are
>>>>>> -     * currently mapped (enabled) are checked for overlaps.
>>>>>> +     * currently mapped (enabled) are checked for overlaps. Note also that
>>>>>> +     * for Dom0 we also need to include hidden, i.e. DomXEN's, devices.
>>>>>>       */
>>>>>> -    for_each_pdev ( pdev->domain, tmp )
>>>>>> +for ( d = pdev->domain; ; d = dom_xen ) {//todo
>>>>>
>>>>> Looking at this again, I think this is slightly more complex, as during
>>>>> runtime dom0 will get here with pdev->domain == hardware_domain OR
>>>>> dom_xen, and hence you also need to account that devices that have
>>>>> pdev->domain == dom_xen need to iterate over devices that belong to
>>>>> the hardware_domain, ie:
>>>>>
>>>>> for ( d = pdev->domain; ;
>>>>>       d = (pdev->domain == dom_xen) ? hardware_domain : dom_xen )
>>>>
>>>> Right, something along these lines. To keep loop continuation expression
>>>> and exit condition simple, I'll probably prefer
>>>>
>>>> for ( d = pdev->domain != dom_xen ? pdev->domain : hardware_domain;
>>>>       ; d = dom_xen )
>>>
>>> LGTM.  I would add parentheses around the pdev->domain != dom_xen
>>> condition, but that's just my personal taste.
>>>
>>> We might want to add an
>>>
>>> ASSERT(pdev->domain == hardware_domain || pdev->domain == dom_xen);
>>>
>>> here, just to remind that this chunk must be revisited when adding
>>> domU support (but you can also argue we haven't done this elsewhere),
>>> I just feel here it's not so obvious we don't want do to this for
>>> domUs.
>>
>> I could add such an assertion, if only ...
>>
>>>>> And we likely want to limit this to devices that belong to the
>>>>> hardware_domain or to dom_xen (in preparation for vPCI being used for
>>>>> domUs).
>>>>
>>>> I'm afraid I don't understand this remark, though.
>>>
>>> This was looking forward to domU support, so that you already cater
>>> for pdev->domain not being hardware_domain or dom_xen, but we might
>>> want to leave that for later, when domU support is actually
>>> introduced.
>>
>> ... I understood why this checking doesn't apply to DomU-s as well,
>> in your opinion.
> 
> It's my understanding that domUs can never get hidden or read-only
> devices assigned, and hence there no need to check for overlap with
> devices assigned to dom_xen, as those cannot have any BARs mapped in
> a domU physmap.
> 
> So for domUs the overlap check only needs to be performed against
> devices assigned to pdev->domain.

I fully agree, but the assertion you suggested doesn't express that. Or
maybe I'm misunderstanding what you did suggest, and there was an
implication of some further if() around it.

Jan

Roger Pau Monné May 30, 2023, 9:12 a.m. UTC | #20

On Tue, May 30, 2023 at 10:45:09AM +0200, Jan Beulich wrote:
> On 29.05.2023 10:08, Roger Pau Monné wrote:
> > On Thu, May 25, 2023 at 05:30:54PM +0200, Jan Beulich wrote:
> >> On 25.05.2023 17:02, Roger Pau Monné wrote:
> >>> On Thu, May 25, 2023 at 04:39:51PM +0200, Jan Beulich wrote:
> >>>> On 24.05.2023 17:56, Roger Pau Monné wrote:
> >>>>> On Wed, May 24, 2023 at 03:45:58PM +0200, Jan Beulich wrote:
> >>>>>> --- a/xen/drivers/vpci/header.c
> >>>>>> +++ b/xen/drivers/vpci/header.c
> >>>>>> @@ -218,6 +218,7 @@ static int modify_bars(const struct pci_
> >>>>>>      struct vpci_header *header = &pdev->vpci->header;
> >>>>>>      struct rangeset *mem = rangeset_new(NULL, NULL, 0);
> >>>>>>      struct pci_dev *tmp, *dev = NULL;
> >>>>>> +    const struct domain *d;
> >>>>>>      const struct vpci_msix *msix = pdev->vpci->msix;
> >>>>>>      unsigned int i;
> >>>>>>      int rc;
> >>>>>> @@ -285,9 +286,11 @@ static int modify_bars(const struct pci_
> >>>>>>  
> >>>>>>      /*
> >>>>>>       * Check for overlaps with other BARs. Note that only BARs that are
> >>>>>> -     * currently mapped (enabled) are checked for overlaps.
> >>>>>> +     * currently mapped (enabled) are checked for overlaps. Note also that
> >>>>>> +     * for Dom0 we also need to include hidden, i.e. DomXEN's, devices.
> >>>>>>       */
> >>>>>> -    for_each_pdev ( pdev->domain, tmp )
> >>>>>> +for ( d = pdev->domain; ; d = dom_xen ) {//todo
> >>>>>
> >>>>> Looking at this again, I think this is slightly more complex, as during
> >>>>> runtime dom0 will get here with pdev->domain == hardware_domain OR
> >>>>> dom_xen, and hence you also need to account that devices that have
> >>>>> pdev->domain == dom_xen need to iterate over devices that belong to
> >>>>> the hardware_domain, ie:
> >>>>>
> >>>>> for ( d = pdev->domain; ;
> >>>>>       d = (pdev->domain == dom_xen) ? hardware_domain : dom_xen )
> >>>>
> >>>> Right, something along these lines. To keep loop continuation expression
> >>>> and exit condition simple, I'll probably prefer
> >>>>
> >>>> for ( d = pdev->domain != dom_xen ? pdev->domain : hardware_domain;
> >>>>       ; d = dom_xen )
> >>>
> >>> LGTM.  I would add parentheses around the pdev->domain != dom_xen
> >>> condition, but that's just my personal taste.
> >>>
> >>> We might want to add an
> >>>
> >>> ASSERT(pdev->domain == hardware_domain || pdev->domain == dom_xen);
> >>>
> >>> here, just to remind that this chunk must be revisited when adding
> >>> domU support (but you can also argue we haven't done this elsewhere),
> >>> I just feel here it's not so obvious we don't want do to this for
> >>> domUs.
> >>
> >> I could add such an assertion, if only ...
> >>
> >>>>> And we likely want to limit this to devices that belong to the
> >>>>> hardware_domain or to dom_xen (in preparation for vPCI being used for
> >>>>> domUs).
> >>>>
> >>>> I'm afraid I don't understand this remark, though.
> >>>
> >>> This was looking forward to domU support, so that you already cater
> >>> for pdev->domain not being hardware_domain or dom_xen, but we might
> >>> want to leave that for later, when domU support is actually
> >>> introduced.
> >>
> >> ... I understood why this checking doesn't apply to DomU-s as well,
> >> in your opinion.
> > 
> > It's my understanding that domUs can never get hidden or read-only
> > devices assigned, and hence there no need to check for overlap with
> > devices assigned to dom_xen, as those cannot have any BARs mapped in
> > a domU physmap.
> > 
> > So for domUs the overlap check only needs to be performed against
> > devices assigned to pdev->domain.
> 
> I fully agree, but the assertion you suggested doesn't express that. Or
> maybe I'm misunderstanding what you did suggest, and there was an
> implication of some further if() around it.

Maybe I'm getting myself confused, but if you add something like:

for ( d = pdev->domain != dom_xen ? pdev->domain : hardware_domain;
      ; d = dom_xen )

Such loop would need to be avoided for domUs, so my suggestion was to
add the assert in order to remind us that the loop would need
adjusting if we ever add domU support.  But maybe you had already
plans to restrict the loop to dom0 only.

Thanks, Roger.

Jan Beulich May 30, 2023, 9:44 a.m. UTC | #21

On 30.05.2023 11:12, Roger Pau Monné wrote:
> On Tue, May 30, 2023 at 10:45:09AM +0200, Jan Beulich wrote:
>> On 29.05.2023 10:08, Roger Pau Monné wrote:
>>> On Thu, May 25, 2023 at 05:30:54PM +0200, Jan Beulich wrote:
>>>> On 25.05.2023 17:02, Roger Pau Monné wrote:
>>>>> On Thu, May 25, 2023 at 04:39:51PM +0200, Jan Beulich wrote:
>>>>>> On 24.05.2023 17:56, Roger Pau Monné wrote:
>>>>>>> On Wed, May 24, 2023 at 03:45:58PM +0200, Jan Beulich wrote:
>>>>>>>> --- a/xen/drivers/vpci/header.c
>>>>>>>> +++ b/xen/drivers/vpci/header.c
>>>>>>>> @@ -218,6 +218,7 @@ static int modify_bars(const struct pci_
>>>>>>>>      struct vpci_header *header = &pdev->vpci->header;
>>>>>>>>      struct rangeset *mem = rangeset_new(NULL, NULL, 0);
>>>>>>>>      struct pci_dev *tmp, *dev = NULL;
>>>>>>>> +    const struct domain *d;
>>>>>>>>      const struct vpci_msix *msix = pdev->vpci->msix;
>>>>>>>>      unsigned int i;
>>>>>>>>      int rc;
>>>>>>>> @@ -285,9 +286,11 @@ static int modify_bars(const struct pci_
>>>>>>>>  
>>>>>>>>      /*
>>>>>>>>       * Check for overlaps with other BARs. Note that only BARs that are
>>>>>>>> -     * currently mapped (enabled) are checked for overlaps.
>>>>>>>> +     * currently mapped (enabled) are checked for overlaps. Note also that
>>>>>>>> +     * for Dom0 we also need to include hidden, i.e. DomXEN's, devices.
>>>>>>>>       */
>>>>>>>> -    for_each_pdev ( pdev->domain, tmp )
>>>>>>>> +for ( d = pdev->domain; ; d = dom_xen ) {//todo
>>>>>>>
>>>>>>> Looking at this again, I think this is slightly more complex, as during
>>>>>>> runtime dom0 will get here with pdev->domain == hardware_domain OR
>>>>>>> dom_xen, and hence you also need to account that devices that have
>>>>>>> pdev->domain == dom_xen need to iterate over devices that belong to
>>>>>>> the hardware_domain, ie:
>>>>>>>
>>>>>>> for ( d = pdev->domain; ;
>>>>>>>       d = (pdev->domain == dom_xen) ? hardware_domain : dom_xen )
>>>>>>
>>>>>> Right, something along these lines. To keep loop continuation expression
>>>>>> and exit condition simple, I'll probably prefer
>>>>>>
>>>>>> for ( d = pdev->domain != dom_xen ? pdev->domain : hardware_domain;
>>>>>>       ; d = dom_xen )
>>>>>
>>>>> LGTM.  I would add parentheses around the pdev->domain != dom_xen
>>>>> condition, but that's just my personal taste.
>>>>>
>>>>> We might want to add an
>>>>>
>>>>> ASSERT(pdev->domain == hardware_domain || pdev->domain == dom_xen);
>>>>>
>>>>> here, just to remind that this chunk must be revisited when adding
>>>>> domU support (but you can also argue we haven't done this elsewhere),
>>>>> I just feel here it's not so obvious we don't want do to this for
>>>>> domUs.
>>>>
>>>> I could add such an assertion, if only ...
>>>>
>>>>>>> And we likely want to limit this to devices that belong to the
>>>>>>> hardware_domain or to dom_xen (in preparation for vPCI being used for
>>>>>>> domUs).
>>>>>>
>>>>>> I'm afraid I don't understand this remark, though.
>>>>>
>>>>> This was looking forward to domU support, so that you already cater
>>>>> for pdev->domain not being hardware_domain or dom_xen, but we might
>>>>> want to leave that for later, when domU support is actually
>>>>> introduced.
>>>>
>>>> ... I understood why this checking doesn't apply to DomU-s as well,
>>>> in your opinion.
>>>
>>> It's my understanding that domUs can never get hidden or read-only
>>> devices assigned, and hence there no need to check for overlap with
>>> devices assigned to dom_xen, as those cannot have any BARs mapped in
>>> a domU physmap.
>>>
>>> So for domUs the overlap check only needs to be performed against
>>> devices assigned to pdev->domain.
>>
>> I fully agree, but the assertion you suggested doesn't express that. Or
>> maybe I'm misunderstanding what you did suggest, and there was an
>> implication of some further if() around it.
> 
> Maybe I'm getting myself confused, but if you add something like:
> 
> for ( d = pdev->domain != dom_xen ? pdev->domain : hardware_domain;
>       ; d = dom_xen )
> 
> Such loop would need to be avoided for domUs, so my suggestion was to
> add the assert in order to remind us that the loop would need
> adjusting if we ever add domU support.  But maybe you had already
> plans to restrict the loop to dom0 only.

Not really, no, but at the bottom of the loop I also have

        if ( !is_hardware_domain(d) )
            break;
    }

(still mis-formatted in the v2 patch). I.e. restricting to Dom0 goes
only as far as the 2nd loop iteration.

Jan

Roger Pau Monné May 30, 2023, 11:04 a.m. UTC | #22

On Tue, May 30, 2023 at 11:44:52AM +0200, Jan Beulich wrote:
> On 30.05.2023 11:12, Roger Pau Monné wrote:
> > On Tue, May 30, 2023 at 10:45:09AM +0200, Jan Beulich wrote:
> >> On 29.05.2023 10:08, Roger Pau Monné wrote:
> >>> On Thu, May 25, 2023 at 05:30:54PM +0200, Jan Beulich wrote:
> >>>> On 25.05.2023 17:02, Roger Pau Monné wrote:
> >>>>> On Thu, May 25, 2023 at 04:39:51PM +0200, Jan Beulich wrote:
> >>>>>> On 24.05.2023 17:56, Roger Pau Monné wrote:
> >>>>>>> On Wed, May 24, 2023 at 03:45:58PM +0200, Jan Beulich wrote:
> >>>>>>>> --- a/xen/drivers/vpci/header.c
> >>>>>>>> +++ b/xen/drivers/vpci/header.c
> >>>>>>>> @@ -218,6 +218,7 @@ static int modify_bars(const struct pci_
> >>>>>>>>      struct vpci_header *header = &pdev->vpci->header;
> >>>>>>>>      struct rangeset *mem = rangeset_new(NULL, NULL, 0);
> >>>>>>>>      struct pci_dev *tmp, *dev = NULL;
> >>>>>>>> +    const struct domain *d;
> >>>>>>>>      const struct vpci_msix *msix = pdev->vpci->msix;
> >>>>>>>>      unsigned int i;
> >>>>>>>>      int rc;
> >>>>>>>> @@ -285,9 +286,11 @@ static int modify_bars(const struct pci_
> >>>>>>>>  
> >>>>>>>>      /*
> >>>>>>>>       * Check for overlaps with other BARs. Note that only BARs that are
> >>>>>>>> -     * currently mapped (enabled) are checked for overlaps.
> >>>>>>>> +     * currently mapped (enabled) are checked for overlaps. Note also that
> >>>>>>>> +     * for Dom0 we also need to include hidden, i.e. DomXEN's, devices.
> >>>>>>>>       */
> >>>>>>>> -    for_each_pdev ( pdev->domain, tmp )
> >>>>>>>> +for ( d = pdev->domain; ; d = dom_xen ) {//todo
> >>>>>>>
> >>>>>>> Looking at this again, I think this is slightly more complex, as during
> >>>>>>> runtime dom0 will get here with pdev->domain == hardware_domain OR
> >>>>>>> dom_xen, and hence you also need to account that devices that have
> >>>>>>> pdev->domain == dom_xen need to iterate over devices that belong to
> >>>>>>> the hardware_domain, ie:
> >>>>>>>
> >>>>>>> for ( d = pdev->domain; ;
> >>>>>>>       d = (pdev->domain == dom_xen) ? hardware_domain : dom_xen )
> >>>>>>
> >>>>>> Right, something along these lines. To keep loop continuation expression
> >>>>>> and exit condition simple, I'll probably prefer
> >>>>>>
> >>>>>> for ( d = pdev->domain != dom_xen ? pdev->domain : hardware_domain;
> >>>>>>       ; d = dom_xen )
> >>>>>
> >>>>> LGTM.  I would add parentheses around the pdev->domain != dom_xen
> >>>>> condition, but that's just my personal taste.
> >>>>>
> >>>>> We might want to add an
> >>>>>
> >>>>> ASSERT(pdev->domain == hardware_domain || pdev->domain == dom_xen);
> >>>>>
> >>>>> here, just to remind that this chunk must be revisited when adding
> >>>>> domU support (but you can also argue we haven't done this elsewhere),
> >>>>> I just feel here it's not so obvious we don't want do to this for
> >>>>> domUs.
> >>>>
> >>>> I could add such an assertion, if only ...
> >>>>
> >>>>>>> And we likely want to limit this to devices that belong to the
> >>>>>>> hardware_domain or to dom_xen (in preparation for vPCI being used for
> >>>>>>> domUs).
> >>>>>>
> >>>>>> I'm afraid I don't understand this remark, though.
> >>>>>
> >>>>> This was looking forward to domU support, so that you already cater
> >>>>> for pdev->domain not being hardware_domain or dom_xen, but we might
> >>>>> want to leave that for later, when domU support is actually
> >>>>> introduced.
> >>>>
> >>>> ... I understood why this checking doesn't apply to DomU-s as well,
> >>>> in your opinion.
> >>>
> >>> It's my understanding that domUs can never get hidden or read-only
> >>> devices assigned, and hence there no need to check for overlap with
> >>> devices assigned to dom_xen, as those cannot have any BARs mapped in
> >>> a domU physmap.
> >>>
> >>> So for domUs the overlap check only needs to be performed against
> >>> devices assigned to pdev->domain.
> >>
> >> I fully agree, but the assertion you suggested doesn't express that. Or
> >> maybe I'm misunderstanding what you did suggest, and there was an
> >> implication of some further if() around it.
> > 
> > Maybe I'm getting myself confused, but if you add something like:
> > 
> > for ( d = pdev->domain != dom_xen ? pdev->domain : hardware_domain;
> >       ; d = dom_xen )
> > 
> > Such loop would need to be avoided for domUs, so my suggestion was to
> > add the assert in order to remind us that the loop would need
> > adjusting if we ever add domU support.  But maybe you had already
> > plans to restrict the loop to dom0 only.
> 
> Not really, no, but at the bottom of the loop I also have
> 
>         if ( !is_hardware_domain(d) )
>             break;
>     }
> 
> (still mis-formatted in the v2 patch). I.e. restricting to Dom0 goes
> only as far as the 2nd loop iteration.

Oh, right, and that would also exit the loop on the first iteration if
the device is assigned to a domU, so it's all fine.  Sorry for the
noise then.

Thanks, Roger.

Stefano Stabellini May 30, 2023, 10:38 p.m. UTC | #23

On Fri, 26 May 2023, Jan Beulich wrote:
> On 25.05.2023 21:24, Stefano Stabellini wrote:
> > On Thu, 25 May 2023, Jan Beulich wrote:
> >> On 25.05.2023 01:37, Stefano Stabellini wrote:
> >>> On Wed, 24 May 2023, Jan Beulich wrote:
> >>>>>> RFC: _setup_hwdom_pci_devices()' loop may want splitting: For
> >>>>>>      modify_bars() to consistently respect BARs of hidden devices while
> >>>>>>      setting up "normal" ones (i.e. to avoid as much as possible the
> >>>>>>      "continue" path introduced here), setting up of the former may want
> >>>>>>      doing first.
> >>>>>
> >>>>> But BARs of hidden devices should be mapped into dom0 physmap?
> >>>>
> >>>> Yes.
> >>>
> >>> The BARs would be mapped read-only (not read-write), right? Otherwise we
> >>> let dom0 access devices that belong to Xen, which doesn't seem like a
> >>> good idea.
> >>>
> >>> But even if we map the BARs read-only, what is the benefit of mapping
> >>> them to Dom0? If Dom0 loads a driver for it and the driver wants to
> >>> initialize the device, the driver will crash because the MMIO region is
> >>> read-only instead of read-write, right?
> >>>
> >>> How does this device hiding work for dom0? How does dom0 know not to
> >>> access a device that is present on the PCI bus but is used by Xen?
> >>
> >> None of these are new questions - this has all been this way for PV Dom0,
> >> and so far we've limped along quite okay. That's not to say that we
> >> shouldn't improve things if we can, but that first requires ideas as to
> >> how.
> > 
> > For PV, that was OK because PV requires extensive guest modifications
> > anyway. We only run Linux and few BSDs as Dom0. So, making the interface
> > cleaner and reducing guest changes is nice-to-have but not critical.
> > 
> > For PVH, this is different. One of the top reasons for AMD to work on
> > PVH is to enable arbitrary non-Linux OSes as Dom0 (when paired with
> > dom0less/hyperlaunch). It could be anything from Zephyr to a
> > proprietary RTOS like VxWorks. Minimal guest changes for advanced
> > features (e.g. Dom0 S3) might be OK but in general I think we should aim
> > at (almost) zero guest changes. On ARM, it is already the case (with some
> > non-upstream patches for dom0less PCI.)
> > 
> > For this specific patch, which is necessary to enable PVH on AMD x86 in
> > gitlab-ci, we can do anything we want to make it move faster. But
> > medium/long term I think we should try to make non-Xen-aware PVH Dom0
> > possible.
> 
> I don't think Linux could boot as PVH Dom0 without any awareness. Hence
> I guess it's not easy to see how other OSes might. What you're after
> looks rather like a HVM Dom0 to me, with it being unclear where the
> external emulator then would run (in a stubdom maybe, which might be
> possible to arrange for via the dom0less way of creating boot time
> DomU-s) and how it would get any necessary xenstore based information.

I know that Linux has lots of Xen awareness scattered everywhere so it
is difficult to tell what's what. Leaving the PVH entry point aside for
this discussion, what else is really needed for a Linux without
CONFIG_XEN to boot as PVH Dom0?

Same question from a different angle: let's say that we boot Zephyr or
another RTOS as HVM Dom0, what is really required for the emulator to
emulate? I am hoping that the answer is "nothing" except for maybe a
UART.

It comes down to how much legacy stuff the guest OS expects to find.
Legacy stuff that would normally be emulated by QEMU. I am counting on
the fact that a modern OS doesn't expect any of the legacy stuff (e.g.
PIIX3/Q35/E1000) if it is not advertised in the firmware tables. If
there is no need for QEMU, I don't know if I would call it PVH or HVM
but either way we are good. 

Same for xenstore: there should be no need for xenstore without
CONFIG_XEN.

Stefano Stabellini May 30, 2023, 10:42 p.m. UTC | #24

On Fri, 26 May 2023, Jan Beulich wrote:
> On 25.05.2023 21:32, Stefano Stabellini wrote:
> > Like I wrote, personally I am happy with whatever gets us to have the PVH
> > test in gitlab-ci faster.
> > 
> > However, on the specific problem of PCI devices used by Xen and how to
> > deal with them for Dom0 PVH, I think they should be completely hidden.
> > Hidden in the sense that they don't appear on the Dom0 PCI bus. If the
> > hidden device is a function of a multi-function PCI device, then the
> > entire multi-function PCI device should be hidden.
> > 
> > I don't think this case is very important because devices used by Xen
> > are timers, IOMMUs, UARTs,
> 
> ... USB debug ports (EHCI, XHCI), ...
> 
> > all devices that typically are not multi-function,
> 
> except for the ones added. Furthermore see video_endboot() for a case
> of also hiding the VGA device, which isn't unlikely to have secondary
> functions (sound controllers are not uncommon). Hence ...
> 
> > so it is OK to be extra careful and remove the entire
> > device from Dom0 in the odd case that the device is both multi-function
> > and only partially used by Xen. This is what I would do for Xen on ARM
> > too.
> 
> ... at best I would see this as a non-default mode of operation. Of
> course we could also play more funny games with vPCI, like surfacing
> a "stub" device in place of a hidden one, so the other functions can
> still be found.

From our use-case point of view, non-default is OK. The PCI stub idea is
a cool trick that might work but hopefully we won't need it at least
initially. But it is something to consider if it turns out there is an
important multi-function device with one of the devices hidden.

Jan Beulich May 31, 2023, 9:23 a.m. UTC | #25

On 31.05.2023 00:38, Stefano Stabellini wrote:
> On Fri, 26 May 2023, Jan Beulich wrote:
>> On 25.05.2023 21:24, Stefano Stabellini wrote:
>>> On Thu, 25 May 2023, Jan Beulich wrote:
>>>> On 25.05.2023 01:37, Stefano Stabellini wrote:
>>>>> On Wed, 24 May 2023, Jan Beulich wrote:
>>>>>>>> RFC: _setup_hwdom_pci_devices()' loop may want splitting: For
>>>>>>>>      modify_bars() to consistently respect BARs of hidden devices while
>>>>>>>>      setting up "normal" ones (i.e. to avoid as much as possible the
>>>>>>>>      "continue" path introduced here), setting up of the former may want
>>>>>>>>      doing first.
>>>>>>>
>>>>>>> But BARs of hidden devices should be mapped into dom0 physmap?
>>>>>>
>>>>>> Yes.
>>>>>
>>>>> The BARs would be mapped read-only (not read-write), right? Otherwise we
>>>>> let dom0 access devices that belong to Xen, which doesn't seem like a
>>>>> good idea.
>>>>>
>>>>> But even if we map the BARs read-only, what is the benefit of mapping
>>>>> them to Dom0? If Dom0 loads a driver for it and the driver wants to
>>>>> initialize the device, the driver will crash because the MMIO region is
>>>>> read-only instead of read-write, right?
>>>>>
>>>>> How does this device hiding work for dom0? How does dom0 know not to
>>>>> access a device that is present on the PCI bus but is used by Xen?
>>>>
>>>> None of these are new questions - this has all been this way for PV Dom0,
>>>> and so far we've limped along quite okay. That's not to say that we
>>>> shouldn't improve things if we can, but that first requires ideas as to
>>>> how.
>>>
>>> For PV, that was OK because PV requires extensive guest modifications
>>> anyway. We only run Linux and few BSDs as Dom0. So, making the interface
>>> cleaner and reducing guest changes is nice-to-have but not critical.
>>>
>>> For PVH, this is different. One of the top reasons for AMD to work on
>>> PVH is to enable arbitrary non-Linux OSes as Dom0 (when paired with
>>> dom0less/hyperlaunch). It could be anything from Zephyr to a
>>> proprietary RTOS like VxWorks. Minimal guest changes for advanced
>>> features (e.g. Dom0 S3) might be OK but in general I think we should aim
>>> at (almost) zero guest changes. On ARM, it is already the case (with some
>>> non-upstream patches for dom0less PCI.)
>>>
>>> For this specific patch, which is necessary to enable PVH on AMD x86 in
>>> gitlab-ci, we can do anything we want to make it move faster. But
>>> medium/long term I think we should try to make non-Xen-aware PVH Dom0
>>> possible.
>>
>> I don't think Linux could boot as PVH Dom0 without any awareness. Hence
>> I guess it's not easy to see how other OSes might. What you're after
>> looks rather like a HVM Dom0 to me, with it being unclear where the
>> external emulator then would run (in a stubdom maybe, which might be
>> possible to arrange for via the dom0less way of creating boot time
>> DomU-s) and how it would get any necessary xenstore based information.
> 
> I know that Linux has lots of Xen awareness scattered everywhere so it
> is difficult to tell what's what. Leaving the PVH entry point aside for
> this discussion, what else is really needed for a Linux without
> CONFIG_XEN to boot as PVH Dom0?
> 
> Same question from a different angle: let's say that we boot Zephyr or
> another RTOS as HVM Dom0, what is really required for the emulator to
> emulate? I am hoping that the answer is "nothing" except for maybe a
> UART.
> 
> It comes down to how much legacy stuff the guest OS expects to find.
> Legacy stuff that would normally be emulated by QEMU. I am counting on
> the fact that a modern OS doesn't expect any of the legacy stuff (e.g.
> PIIX3/Q35/E1000) if it is not advertised in the firmware tables.

And that's where I expect the problems start: We don't really alter
things like the DSDT and SSDTs, and we also don't parse them. So we
won't know what firmware describes there. Hence we have to expect that
any legacy device might be present in the underlying platform, and
hence would also need offering either by passing through or by
emulation. Yet we can't sensibly emulate everything in Xen itself.

> If
> there is no need for QEMU, I don't know if I would call it PVH or HVM
> but either way we are good. 
> 
> Same for xenstore: there should be no need for xenstore without
> CONFIG_XEN.

Right, it may be possible to get away without xenstore.

Jan

[RFC,v2] vPCI: account for hidden devices

Commit Message

Comments

Patch