diff mbox series

[v5,06/14] vpci/header: implement guest BAR register handlers

Message ID 20211125110251.2877218-7-andr2000@gmail.com (mailing list archive)
State New, archived
Headers show
Series PCI devices passthrough on Arm, part 3 | expand

Commit Message

Oleksandr Andrushchenko Nov. 25, 2021, 11:02 a.m. UTC
From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>

Add relevant vpci register handlers when assigning PCI device to a domain
and remove those when de-assigning. This allows having different
handlers for different domains, e.g. hwdom and other guests.

Emulate guest BAR register values: this allows creating a guest view
of the registers and emulates size and properties probe as it is done
during PCI device enumeration by the guest.

ROM BAR is only handled for the hardware domain and for guest domains
there is a stub: at the moment PCI expansion ROM handling is supported
for x86 only and it might not be used by other architectures without
emulating x86. Other use-cases may include using that expansion ROM before
Xen boots, hence no emulation is needed in Xen itself. Or when a guest
wants to use the ROM code which seems to be rare.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
---
Since v4:
- updated commit message
- s/guest_addr/guest_reg
Since v3:
- squashed two patches: dynamic add/remove handlers and guest BAR
  handler implementation
- fix guest BAR read of the high part of a 64bit BAR (Roger)
- add error handling to vpci_assign_device
- s/dom%pd/%pd
- blank line before return
Since v2:
- remove unneeded ifdefs for CONFIG_HAS_VPCI_GUEST_SUPPORT as more code
  has been eliminated from being built on x86
Since v1:
 - constify struct pci_dev where possible
 - do not open code is_system_domain()
 - simplify some code3. simplify
 - use gdprintk + error code instead of gprintk
 - gate vpci_bar_{add|remove}_handlers with CONFIG_HAS_VPCI_GUEST_SUPPORT,
   so these do not get compiled for x86
 - removed unneeded is_system_domain check
 - re-work guest read/write to be much simpler and do more work on write
   than read which is expected to be called more frequently
 - removed one too obvious comment
---
 xen/drivers/vpci/header.c | 72 +++++++++++++++++++++++++++++++++++----
 xen/include/xen/vpci.h    |  3 ++
 2 files changed, 69 insertions(+), 6 deletions(-)

Comments

Bertrand Marquis Nov. 25, 2021, 4:28 p.m. UTC | #1
Hi Oleksandr,

> On 25 Nov 2021, at 11:02, Oleksandr Andrushchenko <andr2000@gmail.com> wrote:
> 
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Add relevant vpci register handlers when assigning PCI device to a domain
> and remove those when de-assigning. This allows having different
> handlers for different domains, e.g. hwdom and other guests.
> 
> Emulate guest BAR register values: this allows creating a guest view
> of the registers and emulates size and properties probe as it is done
> during PCI device enumeration by the guest.
> 
> ROM BAR is only handled for the hardware domain and for guest domains
> there is a stub: at the moment PCI expansion ROM handling is supported
> for x86 only and it might not be used by other architectures without
> emulating x86. Other use-cases may include using that expansion ROM before
> Xen boots, hence no emulation is needed in Xen itself. Or when a guest
> wants to use the ROM code which seems to be rare.

In the generic code, bars for ioports are actually skipped (check code before
in header.c, in case of ioports there is a continue) and no handler is registered for them.
The consequence will be that a guest will access hardware when reading those BARs.

I think we should instead make sure that we intercept all accesses to BARs and return
something empty for IOPORTS BARs.

Regards
Bertrand

> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> ---
> Since v4:
> - updated commit message
> - s/guest_addr/guest_reg
> Since v3:
> - squashed two patches: dynamic add/remove handlers and guest BAR
>  handler implementation
> - fix guest BAR read of the high part of a 64bit BAR (Roger)
> - add error handling to vpci_assign_device
> - s/dom%pd/%pd
> - blank line before return
> Since v2:
> - remove unneeded ifdefs for CONFIG_HAS_VPCI_GUEST_SUPPORT as more code
>  has been eliminated from being built on x86
> Since v1:
> - constify struct pci_dev where possible
> - do not open code is_system_domain()
> - simplify some code3. simplify
> - use gdprintk + error code instead of gprintk
> - gate vpci_bar_{add|remove}_handlers with CONFIG_HAS_VPCI_GUEST_SUPPORT,
>   so these do not get compiled for x86
> - removed unneeded is_system_domain check
> - re-work guest read/write to be much simpler and do more work on write
>   than read which is expected to be called more frequently
> - removed one too obvious comment
> ---
> xen/drivers/vpci/header.c | 72 +++++++++++++++++++++++++++++++++++----
> xen/include/xen/vpci.h    |  3 ++
> 2 files changed, 69 insertions(+), 6 deletions(-)
> 
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index ba333fb2f9b0..8880d34ebf8e 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -433,6 +433,48 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>     pci_conf_write32(pdev->sbdf, reg, val);
> }
> 
> +static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
> +                            uint32_t val, void *data)
> +{
> +    struct vpci_bar *bar = data;
> +    bool hi = false;
> +
> +    if ( bar->type == VPCI_BAR_MEM64_HI )
> +    {
> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> +        bar--;
> +        hi = true;
> +    }
> +    else
> +    {
> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
> +        val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
> +                                           : PCI_BASE_ADDRESS_MEM_TYPE_64;
> +        val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
> +    }
> +
> +    bar->guest_reg &= ~(0xffffffffull << (hi ? 32 : 0));
> +    bar->guest_reg |= (uint64_t)val << (hi ? 32 : 0);
> +
> +    bar->guest_reg &= ~(bar->size - 1) | ~PCI_BASE_ADDRESS_MEM_MASK;
> +}
> +
> +static uint32_t guest_bar_read(const struct pci_dev *pdev, unsigned int reg,
> +                               void *data)
> +{
> +    const struct vpci_bar *bar = data;
> +    bool hi = false;
> +
> +    if ( bar->type == VPCI_BAR_MEM64_HI )
> +    {
> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> +        bar--;
> +        hi = true;
> +    }
> +
> +    return bar->guest_reg >> (hi ? 32 : 0);
> +}
> +
> static void rom_write(const struct pci_dev *pdev, unsigned int reg,
>                       uint32_t val, void *data)
> {
> @@ -481,6 +523,17 @@ static void rom_write(const struct pci_dev *pdev, unsigned int reg,
>         rom->addr = val & PCI_ROM_ADDRESS_MASK;
> }
> 
> +static void guest_rom_write(const struct pci_dev *pdev, unsigned int reg,
> +                            uint32_t val, void *data)
> +{
> +}
> +
> +static uint32_t guest_rom_read(const struct pci_dev *pdev, unsigned int reg,
> +                               void *data)
> +{
> +    return 0xffffffff;
> +}
> +
> static int init_bars(struct pci_dev *pdev)
> {
>     uint16_t cmd;
> @@ -489,6 +542,7 @@ static int init_bars(struct pci_dev *pdev)
>     struct vpci_header *header = &pdev->vpci->header;
>     struct vpci_bar *bars = header->bars;
>     int rc;
> +    bool is_hwdom = is_hardware_domain(pdev->domain);
> 
>     switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
>     {
> @@ -528,8 +582,10 @@ static int init_bars(struct pci_dev *pdev)
>         if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
>         {
>             bars[i].type = VPCI_BAR_MEM64_HI;
> -            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
> -                                   4, &bars[i]);
> +            rc = vpci_add_register(pdev->vpci,
> +                                   is_hwdom ? vpci_hw_read32 : guest_bar_read,
> +                                   is_hwdom ? bar_write : guest_bar_write,
> +                                   reg, 4, &bars[i]);
>             if ( rc )
>             {
>                 pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
> @@ -569,8 +625,10 @@ static int init_bars(struct pci_dev *pdev)
>         bars[i].size = size;
>         bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
> 
> -        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg, 4,
> -                               &bars[i]);
> +        rc = vpci_add_register(pdev->vpci,
> +                               is_hwdom ? vpci_hw_read32 : guest_bar_read,
> +                               is_hwdom ? bar_write : guest_bar_write,
> +                               reg, 4, &bars[i]);
>         if ( rc )
>         {
>             pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
> @@ -590,8 +648,10 @@ static int init_bars(struct pci_dev *pdev)
>         header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
>                               PCI_ROM_ADDRESS_ENABLE;
> 
> -        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write, rom_reg,
> -                               4, rom);
> +        rc = vpci_add_register(pdev->vpci,
> +                               is_hwdom ? vpci_hw_read32 : guest_rom_read,
> +                               is_hwdom ? rom_write : guest_rom_write,
> +                               rom_reg, 4, rom);
>         if ( rc )
>             rom->type = VPCI_BAR_EMPTY;
>     }
> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
> index ed127a08a953..0a73b14a92dc 100644
> --- a/xen/include/xen/vpci.h
> +++ b/xen/include/xen/vpci.h
> @@ -68,7 +68,10 @@ struct vpci {
>     struct vpci_header {
>         /* Information about the PCI BARs of this device. */
>         struct vpci_bar {
> +            /* Physical view of the BAR. */
>             uint64_t addr;
> +            /* Guest view of the BAR: address and lower bits. */
> +            uint64_t guest_reg;
>             uint64_t size;
>             enum {
>                 VPCI_BAR_EMPTY,
> -- 
> 2.25.1
>
Oleksandr Andrushchenko Nov. 26, 2021, 12:19 p.m. UTC | #2
Hi, Bertrand!

On 25.11.21 18:28, Bertrand Marquis wrote:
> Hi Oleksandr,
>
>> On 25 Nov 2021, at 11:02, Oleksandr Andrushchenko <andr2000@gmail.com> wrote:
>>
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>
>> Add relevant vpci register handlers when assigning PCI device to a domain
>> and remove those when de-assigning. This allows having different
>> handlers for different domains, e.g. hwdom and other guests.
>>
>> Emulate guest BAR register values: this allows creating a guest view
>> of the registers and emulates size and properties probe as it is done
>> during PCI device enumeration by the guest.
>>
>> ROM BAR is only handled for the hardware domain and for guest domains
>> there is a stub: at the moment PCI expansion ROM handling is supported
>> for x86 only and it might not be used by other architectures without
>> emulating x86. Other use-cases may include using that expansion ROM before
>> Xen boots, hence no emulation is needed in Xen itself. Or when a guest
>> wants to use the ROM code which seems to be rare.
> In the generic code, bars for ioports are actually skipped (check code before
> in header.c, in case of ioports there is a continue) and no handler is registered for them.
> The consequence will be that a guest will access hardware when reading those BARs.
Yes, this seems to be a valid point
>
> I think we should instead make sure that we intercept all accesses to BARs and return
> something empty for IOPORTS BARs.
I would like to hear from Roger on what was the initial plan for that, so
we are aligned between the different architectures, Arm and x86 here for now

Thank you,
Oleksandr
Roger Pau Monne Jan. 12, 2022, 12:35 p.m. UTC | #3
On Thu, Nov 25, 2021 at 01:02:43PM +0200, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Add relevant vpci register handlers when assigning PCI device to a domain
> and remove those when de-assigning. This allows having different
> handlers for different domains, e.g. hwdom and other guests.
> 
> Emulate guest BAR register values: this allows creating a guest view
> of the registers and emulates size and properties probe as it is done
> during PCI device enumeration by the guest.
> 
> ROM BAR is only handled for the hardware domain and for guest domains
> there is a stub: at the moment PCI expansion ROM handling is supported
> for x86 only and it might not be used by other architectures without
> emulating x86. Other use-cases may include using that expansion ROM before
> Xen boots, hence no emulation is needed in Xen itself. Or when a guest
> wants to use the ROM code which seems to be rare.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> ---
> Since v4:
> - updated commit message
> - s/guest_addr/guest_reg
> Since v3:
> - squashed two patches: dynamic add/remove handlers and guest BAR
>   handler implementation
> - fix guest BAR read of the high part of a 64bit BAR (Roger)
> - add error handling to vpci_assign_device
> - s/dom%pd/%pd
> - blank line before return
> Since v2:
> - remove unneeded ifdefs for CONFIG_HAS_VPCI_GUEST_SUPPORT as more code
>   has been eliminated from being built on x86
> Since v1:
>  - constify struct pci_dev where possible
>  - do not open code is_system_domain()
>  - simplify some code3. simplify
>  - use gdprintk + error code instead of gprintk
>  - gate vpci_bar_{add|remove}_handlers with CONFIG_HAS_VPCI_GUEST_SUPPORT,
>    so these do not get compiled for x86
>  - removed unneeded is_system_domain check
>  - re-work guest read/write to be much simpler and do more work on write
>    than read which is expected to be called more frequently
>  - removed one too obvious comment
> ---
>  xen/drivers/vpci/header.c | 72 +++++++++++++++++++++++++++++++++++----
>  xen/include/xen/vpci.h    |  3 ++
>  2 files changed, 69 insertions(+), 6 deletions(-)
> 
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index ba333fb2f9b0..8880d34ebf8e 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -433,6 +433,48 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>      pci_conf_write32(pdev->sbdf, reg, val);
>  }
>  
> +static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
> +                            uint32_t val, void *data)
> +{
> +    struct vpci_bar *bar = data;
> +    bool hi = false;
> +
> +    if ( bar->type == VPCI_BAR_MEM64_HI )
> +    {
> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> +        bar--;
> +        hi = true;
> +    }
> +    else
> +    {
> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
> +        val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
> +                                           : PCI_BASE_ADDRESS_MEM_TYPE_64;
> +        val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
> +    }
> +
> +    bar->guest_reg &= ~(0xffffffffull << (hi ? 32 : 0));
> +    bar->guest_reg |= (uint64_t)val << (hi ? 32 : 0);
> +
> +    bar->guest_reg &= ~(bar->size - 1) | ~PCI_BASE_ADDRESS_MEM_MASK;
> +}
> +
> +static uint32_t guest_bar_read(const struct pci_dev *pdev, unsigned int reg,
> +                               void *data)
> +{
> +    const struct vpci_bar *bar = data;
> +    bool hi = false;
> +
> +    if ( bar->type == VPCI_BAR_MEM64_HI )
> +    {
> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> +        bar--;
> +        hi = true;
> +    }
> +
> +    return bar->guest_reg >> (hi ? 32 : 0);
> +}
> +
>  static void rom_write(const struct pci_dev *pdev, unsigned int reg,
>                        uint32_t val, void *data)
>  {
> @@ -481,6 +523,17 @@ static void rom_write(const struct pci_dev *pdev, unsigned int reg,
>          rom->addr = val & PCI_ROM_ADDRESS_MASK;
>  }
>  
> +static void guest_rom_write(const struct pci_dev *pdev, unsigned int reg,
> +                            uint32_t val, void *data)
> +{
> +}
> +
> +static uint32_t guest_rom_read(const struct pci_dev *pdev, unsigned int reg,
> +                               void *data)
> +{
> +    return 0xffffffff;
> +}

There should be no need for those handlers. As said elsewhere: for
guests registers not explicitly handled should return ~0 for reads and
drop writes, which is what you are proposing here.

> +
>  static int init_bars(struct pci_dev *pdev)
>  {
>      uint16_t cmd;
> @@ -489,6 +542,7 @@ static int init_bars(struct pci_dev *pdev)
>      struct vpci_header *header = &pdev->vpci->header;
>      struct vpci_bar *bars = header->bars;
>      int rc;
> +    bool is_hwdom = is_hardware_domain(pdev->domain);
>  
>      switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
>      {
> @@ -528,8 +582,10 @@ static int init_bars(struct pci_dev *pdev)
>          if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
>          {
>              bars[i].type = VPCI_BAR_MEM64_HI;
> -            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
> -                                   4, &bars[i]);
> +            rc = vpci_add_register(pdev->vpci,
> +                                   is_hwdom ? vpci_hw_read32 : guest_bar_read,
> +                                   is_hwdom ? bar_write : guest_bar_write,
> +                                   reg, 4, &bars[i]);
>              if ( rc )
>              {
>                  pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
> @@ -569,8 +625,10 @@ static int init_bars(struct pci_dev *pdev)
>          bars[i].size = size;
>          bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
>  
> -        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg, 4,
> -                               &bars[i]);
> +        rc = vpci_add_register(pdev->vpci,
> +                               is_hwdom ? vpci_hw_read32 : guest_bar_read,
> +                               is_hwdom ? bar_write : guest_bar_write,
> +                               reg, 4, &bars[i]);
>          if ( rc )
>          {
>              pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
> @@ -590,8 +648,10 @@ static int init_bars(struct pci_dev *pdev)
>          header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
>                                PCI_ROM_ADDRESS_ENABLE;
>  
> -        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write, rom_reg,
> -                               4, rom);
> +        rc = vpci_add_register(pdev->vpci,
> +                               is_hwdom ? vpci_hw_read32 : guest_rom_read,
> +                               is_hwdom ? rom_write : guest_rom_write,
> +                               rom_reg, 4, rom);

This whole call should be made conditional to is_hwdom, as said above
there's no need for the guest_rom handlers.

Likewise I assume you expect IO BARs to simply return ~0 and drop
writes, as there's no explicit handler added for those?

>          if ( rc )
>              rom->type = VPCI_BAR_EMPTY;
>      }
> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
> index ed127a08a953..0a73b14a92dc 100644
> --- a/xen/include/xen/vpci.h
> +++ b/xen/include/xen/vpci.h
> @@ -68,7 +68,10 @@ struct vpci {
>      struct vpci_header {
>          /* Information about the PCI BARs of this device. */
>          struct vpci_bar {
> +            /* Physical view of the BAR. */

No, that's not the physical view, it's the physical (host) address.

>              uint64_t addr;
> +            /* Guest view of the BAR: address and lower bits. */
> +            uint64_t guest_reg;

I continue to think it would be clearer if you store the guest address
here (gaddr, without the low bits) and add those in guest_bar_read
based on bar->{type,prefetchable}. Then it would be equivalent to the
existing 'addr' field.

I wonder whether we need to protect the added code with
CONFIG_HAS_VPCI_GUEST_SUPPORT, this would effectively be dead code
otherwise. Long term I don't think we wish to differentiate between
dom0 and domU vPCI support at build time, so I'm unsure whether it's
helpful to pollute the code with CONFIG_HAS_VPCI_GUEST_SUPPORT when
the plan is to remove those long term.

Thanks, Roger.
Roger Pau Monne Jan. 12, 2022, 5:34 p.m. UTC | #4
A couple more comments I realized while walking the dog.

On Thu, Nov 25, 2021 at 01:02:43PM +0200, Oleksandr Andrushchenko wrote:
> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> 
> Add relevant vpci register handlers when assigning PCI device to a domain
> and remove those when de-assigning. This allows having different
> handlers for different domains, e.g. hwdom and other guests.
> 
> Emulate guest BAR register values: this allows creating a guest view
> of the registers and emulates size and properties probe as it is done
> during PCI device enumeration by the guest.
> 
> ROM BAR is only handled for the hardware domain and for guest domains
> there is a stub: at the moment PCI expansion ROM handling is supported
> for x86 only and it might not be used by other architectures without
> emulating x86. Other use-cases may include using that expansion ROM before
> Xen boots, hence no emulation is needed in Xen itself. Or when a guest
> wants to use the ROM code which seems to be rare.
> 
> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> ---
> Since v4:
> - updated commit message
> - s/guest_addr/guest_reg
> Since v3:
> - squashed two patches: dynamic add/remove handlers and guest BAR
>   handler implementation
> - fix guest BAR read of the high part of a 64bit BAR (Roger)
> - add error handling to vpci_assign_device
> - s/dom%pd/%pd
> - blank line before return
> Since v2:
> - remove unneeded ifdefs for CONFIG_HAS_VPCI_GUEST_SUPPORT as more code
>   has been eliminated from being built on x86
> Since v1:
>  - constify struct pci_dev where possible
>  - do not open code is_system_domain()
>  - simplify some code3. simplify
>  - use gdprintk + error code instead of gprintk
>  - gate vpci_bar_{add|remove}_handlers with CONFIG_HAS_VPCI_GUEST_SUPPORT,
>    so these do not get compiled for x86
>  - removed unneeded is_system_domain check
>  - re-work guest read/write to be much simpler and do more work on write
>    than read which is expected to be called more frequently
>  - removed one too obvious comment
> ---
>  xen/drivers/vpci/header.c | 72 +++++++++++++++++++++++++++++++++++----
>  xen/include/xen/vpci.h    |  3 ++
>  2 files changed, 69 insertions(+), 6 deletions(-)
> 
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index ba333fb2f9b0..8880d34ebf8e 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -433,6 +433,48 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>      pci_conf_write32(pdev->sbdf, reg, val);
>  }
>  
> +static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
> +                            uint32_t val, void *data)
> +{
> +    struct vpci_bar *bar = data;
> +    bool hi = false;
> +
> +    if ( bar->type == VPCI_BAR_MEM64_HI )
> +    {
> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> +        bar--;
> +        hi = true;
> +    }
> +    else
> +    {
> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
> +        val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
> +                                           : PCI_BASE_ADDRESS_MEM_TYPE_64;
> +        val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
> +    }
> +
> +    bar->guest_reg &= ~(0xffffffffull << (hi ? 32 : 0));
> +    bar->guest_reg |= (uint64_t)val << (hi ? 32 : 0);
> +
> +    bar->guest_reg &= ~(bar->size - 1) | ~PCI_BASE_ADDRESS_MEM_MASK;

You need to assert that the guest set address has the same page offset
as the physical address on the host, or otherwise things won't work as
expected. Ie: guest_addr & ~PAGE_MASK == addr & ~PAGE_MASK.

> +}
> +
> +static uint32_t guest_bar_read(const struct pci_dev *pdev, unsigned int reg,
> +                               void *data)
> +{
> +    const struct vpci_bar *bar = data;
> +    bool hi = false;
> +
> +    if ( bar->type == VPCI_BAR_MEM64_HI )
> +    {
> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
> +        bar--;
> +        hi = true;
> +    }
> +
> +    return bar->guest_reg >> (hi ? 32 : 0);
> +}
> +
>  static void rom_write(const struct pci_dev *pdev, unsigned int reg,
>                        uint32_t val, void *data)
>  {
> @@ -481,6 +523,17 @@ static void rom_write(const struct pci_dev *pdev, unsigned int reg,
>          rom->addr = val & PCI_ROM_ADDRESS_MASK;
>  }
>  
> +static void guest_rom_write(const struct pci_dev *pdev, unsigned int reg,
> +                            uint32_t val, void *data)
> +{
> +}
> +
> +static uint32_t guest_rom_read(const struct pci_dev *pdev, unsigned int reg,
> +                               void *data)
> +{
> +    return 0xffffffff;
> +}
> +
>  static int init_bars(struct pci_dev *pdev)
>  {
>      uint16_t cmd;
> @@ -489,6 +542,7 @@ static int init_bars(struct pci_dev *pdev)
>      struct vpci_header *header = &pdev->vpci->header;
>      struct vpci_bar *bars = header->bars;
>      int rc;
> +    bool is_hwdom = is_hardware_domain(pdev->domain);
>  
>      switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
>      {
> @@ -528,8 +582,10 @@ static int init_bars(struct pci_dev *pdev)
>          if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
>          {
>              bars[i].type = VPCI_BAR_MEM64_HI;
> -            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
> -                                   4, &bars[i]);
> +            rc = vpci_add_register(pdev->vpci,
> +                                   is_hwdom ? vpci_hw_read32 : guest_bar_read,
> +                                   is_hwdom ? bar_write : guest_bar_write,
> +                                   reg, 4, &bars[i]);
>              if ( rc )
>              {
>                  pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
> @@ -569,8 +625,10 @@ static int init_bars(struct pci_dev *pdev)
>          bars[i].size = size;
>          bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
>  
> -        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg, 4,
> -                               &bars[i]);
> +        rc = vpci_add_register(pdev->vpci,
> +                               is_hwdom ? vpci_hw_read32 : guest_bar_read,
> +                               is_hwdom ? bar_write : guest_bar_write,
> +                               reg, 4, &bars[i]);

You need to initialize guest_reg to the physical host value also.

>          if ( rc )
>          {
>              pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
> @@ -590,8 +648,10 @@ static int init_bars(struct pci_dev *pdev)
>          header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
>                                PCI_ROM_ADDRESS_ENABLE;
>  
> -        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write, rom_reg,
> -                               4, rom);
> +        rc = vpci_add_register(pdev->vpci,
> +                               is_hwdom ? vpci_hw_read32 : guest_rom_read,
> +                               is_hwdom ? rom_write : guest_rom_write,
> +                               rom_reg, 4, rom);
>          if ( rc )
>              rom->type = VPCI_BAR_EMPTY;

Also memory decoding needs to be initially disabled when used by
guests, in order to prevent the BAR being placed on top of a RAM
region. The guest physmap will be different from the host one, so it's
possible for BARs to end up placed on top of RAM regions initially
until the firmware or OS places them at a suitable address.

Thanks, Roger.
Oleksandr Andrushchenko Jan. 31, 2022, 9:47 a.m. UTC | #5
Hi, Roger!

On 12.01.22 14:35, Roger Pau Monné wrote:
> On Thu, Nov 25, 2021 at 01:02:43PM +0200, Oleksandr Andrushchenko wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>
>> Add relevant vpci register handlers when assigning PCI device to a domain
>> and remove those when de-assigning. This allows having different
>> handlers for different domains, e.g. hwdom and other guests.
>>
>> Emulate guest BAR register values: this allows creating a guest view
>> of the registers and emulates size and properties probe as it is done
>> during PCI device enumeration by the guest.
>>
>> ROM BAR is only handled for the hardware domain and for guest domains
>> there is a stub: at the moment PCI expansion ROM handling is supported
>> for x86 only and it might not be used by other architectures without
>> emulating x86. Other use-cases may include using that expansion ROM before
>> Xen boots, hence no emulation is needed in Xen itself. Or when a guest
>> wants to use the ROM code which seems to be rare.
>>
>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> ---
>> Since v4:
>> - updated commit message
>> - s/guest_addr/guest_reg
>> Since v3:
>> - squashed two patches: dynamic add/remove handlers and guest BAR
>>    handler implementation
>> - fix guest BAR read of the high part of a 64bit BAR (Roger)
>> - add error handling to vpci_assign_device
>> - s/dom%pd/%pd
>> - blank line before return
>> Since v2:
>> - remove unneeded ifdefs for CONFIG_HAS_VPCI_GUEST_SUPPORT as more code
>>    has been eliminated from being built on x86
>> Since v1:
>>   - constify struct pci_dev where possible
>>   - do not open code is_system_domain()
>>   - simplify some code3. simplify
>>   - use gdprintk + error code instead of gprintk
>>   - gate vpci_bar_{add|remove}_handlers with CONFIG_HAS_VPCI_GUEST_SUPPORT,
>>     so these do not get compiled for x86
>>   - removed unneeded is_system_domain check
>>   - re-work guest read/write to be much simpler and do more work on write
>>     than read which is expected to be called more frequently
>>   - removed one too obvious comment
>> ---
>>   xen/drivers/vpci/header.c | 72 +++++++++++++++++++++++++++++++++++----
>>   xen/include/xen/vpci.h    |  3 ++
>>   2 files changed, 69 insertions(+), 6 deletions(-)
>>
>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
>> index ba333fb2f9b0..8880d34ebf8e 100644
>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -433,6 +433,48 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>>       pci_conf_write32(pdev->sbdf, reg, val);
>>   }
>>   
>> +static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
>> +                            uint32_t val, void *data)
>> +{
>> +    struct vpci_bar *bar = data;
>> +    bool hi = false;
>> +
>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
>> +    {
>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>> +        bar--;
>> +        hi = true;
>> +    }
>> +    else
>> +    {
>> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
>> +        val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
>> +                                           : PCI_BASE_ADDRESS_MEM_TYPE_64;
>> +        val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
>> +    }
>> +
>> +    bar->guest_reg &= ~(0xffffffffull << (hi ? 32 : 0));
>> +    bar->guest_reg |= (uint64_t)val << (hi ? 32 : 0);
>> +
>> +    bar->guest_reg &= ~(bar->size - 1) | ~PCI_BASE_ADDRESS_MEM_MASK;
>> +}
>> +
>> +static uint32_t guest_bar_read(const struct pci_dev *pdev, unsigned int reg,
>> +                               void *data)
>> +{
>> +    const struct vpci_bar *bar = data;
>> +    bool hi = false;
>> +
>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
>> +    {
>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>> +        bar--;
>> +        hi = true;
>> +    }
>> +
>> +    return bar->guest_reg >> (hi ? 32 : 0);
>> +}
>> +
>>   static void rom_write(const struct pci_dev *pdev, unsigned int reg,
>>                         uint32_t val, void *data)
>>   {
>> @@ -481,6 +523,17 @@ static void rom_write(const struct pci_dev *pdev, unsigned int reg,
>>           rom->addr = val & PCI_ROM_ADDRESS_MASK;
>>   }
>>   
>> +static void guest_rom_write(const struct pci_dev *pdev, unsigned int reg,
>> +                            uint32_t val, void *data)
>> +{
>> +}
>> +
>> +static uint32_t guest_rom_read(const struct pci_dev *pdev, unsigned int reg,
>> +                               void *data)
>> +{
>> +    return 0xffffffff;
>> +}
> There should be no need for those handlers. As said elsewhere: for
> guests registers not explicitly handled should return ~0 for reads and
> drop writes, which is what you are proposing here.
Yes, you are right: I can see in vpci_read that we end up reading ~0 if no
handler exists (which is what I do here with guest_rom_read). But I am not that
sure about the dropped writes:

void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
                 uint32_t data)
{
     unsigned int data_offset = 0;

[snip]

     if ( data_offset < size )
         /* Tailing gap, write the remaining. */
         vpci_write_hw(sbdf, reg + data_offset, size - data_offset,
                       data >> (data_offset * 8));

so it looks like for the un-handled writes we still reach the HW register.
Could you please tell if the code above needs improvement (like checking
if the write was handled) or I still need to provide a write handler, e.g.
guest_rom_write here?
>> +
>>   static int init_bars(struct pci_dev *pdev)
>>   {
>>       uint16_t cmd;
>> @@ -489,6 +542,7 @@ static int init_bars(struct pci_dev *pdev)
>>       struct vpci_header *header = &pdev->vpci->header;
>>       struct vpci_bar *bars = header->bars;
>>       int rc;
>> +    bool is_hwdom = is_hardware_domain(pdev->domain);
>>   
>>       switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
>>       {
>> @@ -528,8 +582,10 @@ static int init_bars(struct pci_dev *pdev)
>>           if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
>>           {
>>               bars[i].type = VPCI_BAR_MEM64_HI;
>> -            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
>> -                                   4, &bars[i]);
>> +            rc = vpci_add_register(pdev->vpci,
>> +                                   is_hwdom ? vpci_hw_read32 : guest_bar_read,
>> +                                   is_hwdom ? bar_write : guest_bar_write,
>> +                                   reg, 4, &bars[i]);
>>               if ( rc )
>>               {
>>                   pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
>> @@ -569,8 +625,10 @@ static int init_bars(struct pci_dev *pdev)
>>           bars[i].size = size;
>>           bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
>>   
>> -        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg, 4,
>> -                               &bars[i]);
>> +        rc = vpci_add_register(pdev->vpci,
>> +                               is_hwdom ? vpci_hw_read32 : guest_bar_read,
>> +                               is_hwdom ? bar_write : guest_bar_write,
>> +                               reg, 4, &bars[i]);
>>           if ( rc )
>>           {
>>               pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
>> @@ -590,8 +648,10 @@ static int init_bars(struct pci_dev *pdev)
>>           header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
>>                                 PCI_ROM_ADDRESS_ENABLE;
>>   
>> -        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write, rom_reg,
>> -                               4, rom);
>> +        rc = vpci_add_register(pdev->vpci,
>> +                               is_hwdom ? vpci_hw_read32 : guest_rom_read,
>> +                               is_hwdom ? rom_write : guest_rom_write,
>> +                               rom_reg, 4, rom);
> This whole call should be made conditional to is_hwdom, as said above
> there's no need for the guest_rom handlers.
Yes, if writes are indeed dropped, please see question above
>
> Likewise I assume you expect IO BARs to simply return ~0 and drop
> writes, as there's no explicit handler added for those?
Yes, but that was not my intention: I simply didn't handle IO BARs
and now we do need that handling: either with the default behavior
for the unhandled read/write (drop writes, read ~0) or by introducing
the handlers. I hope we can rely on the "unhandled read/write" and
get what we want
>
>>           if ( rc )
>>               rom->type = VPCI_BAR_EMPTY;
>>       }
>> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
>> index ed127a08a953..0a73b14a92dc 100644
>> --- a/xen/include/xen/vpci.h
>> +++ b/xen/include/xen/vpci.h
>> @@ -68,7 +68,10 @@ struct vpci {
>>       struct vpci_header {
>>           /* Information about the PCI BARs of this device. */
>>           struct vpci_bar {
>> +            /* Physical view of the BAR. */
> No, that's not the physical view, it's the physical (host) address.
Ok
>
>>               uint64_t addr;
>> +            /* Guest view of the BAR: address and lower bits. */
>> +            uint64_t guest_reg;
> I continue to think it would be clearer if you store the guest address
> here (gaddr, without the low bits) and add those in guest_bar_read
> based on bar->{type,prefetchable}. Then it would be equivalent to the
> existing 'addr' field.
Ok, I'll re-work the code with this approach in mind: s/guest_reg/gaddr +
required code to handle that
>
> I wonder whether we need to protect the added code with
> CONFIG_HAS_VPCI_GUEST_SUPPORT, this would effectively be dead code
> otherwise. Long term I don't think we wish to differentiate between
> dom0 and domU vPCI support at build time, so I'm unsure whether it's
> helpful to pollute the code with CONFIG_HAS_VPCI_GUEST_SUPPORT when
> the plan is to remove those long term.
I would have it without CONFIG_HAS_VPCI_GUEST_SUPPORT if you
don't mind
>
> Thanks, Roger.
Thank you,
Oleksandr
Oleksandr Andrushchenko Jan. 31, 2022, 9:53 a.m. UTC | #6
Hi, Roger!

On 12.01.22 19:34, Roger Pau Monné wrote:
> A couple more comments I realized while walking the dog.
>
> On Thu, Nov 25, 2021 at 01:02:43PM +0200, Oleksandr Andrushchenko wrote:
>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>
>> Add relevant vpci register handlers when assigning PCI device to a domain
>> and remove those when de-assigning. This allows having different
>> handlers for different domains, e.g. hwdom and other guests.
>>
>> Emulate guest BAR register values: this allows creating a guest view
>> of the registers and emulates size and properties probe as it is done
>> during PCI device enumeration by the guest.
>>
>> ROM BAR is only handled for the hardware domain and for guest domains
>> there is a stub: at the moment PCI expansion ROM handling is supported
>> for x86 only and it might not be used by other architectures without
>> emulating x86. Other use-cases may include using that expansion ROM before
>> Xen boots, hence no emulation is needed in Xen itself. Or when a guest
>> wants to use the ROM code which seems to be rare.
>>
>> Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> ---
>> Since v4:
>> - updated commit message
>> - s/guest_addr/guest_reg
>> Since v3:
>> - squashed two patches: dynamic add/remove handlers and guest BAR
>>    handler implementation
>> - fix guest BAR read of the high part of a 64bit BAR (Roger)
>> - add error handling to vpci_assign_device
>> - s/dom%pd/%pd
>> - blank line before return
>> Since v2:
>> - remove unneeded ifdefs for CONFIG_HAS_VPCI_GUEST_SUPPORT as more code
>>    has been eliminated from being built on x86
>> Since v1:
>>   - constify struct pci_dev where possible
>>   - do not open code is_system_domain()
>>   - simplify some code3. simplify
>>   - use gdprintk + error code instead of gprintk
>>   - gate vpci_bar_{add|remove}_handlers with CONFIG_HAS_VPCI_GUEST_SUPPORT,
>>     so these do not get compiled for x86
>>   - removed unneeded is_system_domain check
>>   - re-work guest read/write to be much simpler and do more work on write
>>     than read which is expected to be called more frequently
>>   - removed one too obvious comment
>> ---
>>   xen/drivers/vpci/header.c | 72 +++++++++++++++++++++++++++++++++++----
>>   xen/include/xen/vpci.h    |  3 ++
>>   2 files changed, 69 insertions(+), 6 deletions(-)
>>
>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
>> index ba333fb2f9b0..8880d34ebf8e 100644
>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -433,6 +433,48 @@ static void bar_write(const struct pci_dev *pdev, unsigned int reg,
>>       pci_conf_write32(pdev->sbdf, reg, val);
>>   }
>>   
>> +static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
>> +                            uint32_t val, void *data)
>> +{
>> +    struct vpci_bar *bar = data;
>> +    bool hi = false;
>> +
>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
>> +    {
>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>> +        bar--;
>> +        hi = true;
>> +    }
>> +    else
>> +    {
>> +        val &= PCI_BASE_ADDRESS_MEM_MASK;
>> +        val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
>> +                                           : PCI_BASE_ADDRESS_MEM_TYPE_64;
>> +        val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
>> +    }
>> +
>> +    bar->guest_reg &= ~(0xffffffffull << (hi ? 32 : 0));
>> +    bar->guest_reg |= (uint64_t)val << (hi ? 32 : 0);
>> +
>> +    bar->guest_reg &= ~(bar->size - 1) | ~PCI_BASE_ADDRESS_MEM_MASK;
> You need to assert that the guest set address has the same page offset
> as the physical address on the host, or otherwise things won't work as
> expected. Ie: guest_addr & ~PAGE_MASK == addr & ~PAGE_MASK.
Good catch, thank you
>
>> +}
>> +
>> +static uint32_t guest_bar_read(const struct pci_dev *pdev, unsigned int reg,
>> +                               void *data)
>> +{
>> +    const struct vpci_bar *bar = data;
>> +    bool hi = false;
>> +
>> +    if ( bar->type == VPCI_BAR_MEM64_HI )
>> +    {
>> +        ASSERT(reg > PCI_BASE_ADDRESS_0);
>> +        bar--;
>> +        hi = true;
>> +    }
>> +
>> +    return bar->guest_reg >> (hi ? 32 : 0);
>> +}
>> +
>>   static void rom_write(const struct pci_dev *pdev, unsigned int reg,
>>                         uint32_t val, void *data)
>>   {
>> @@ -481,6 +523,17 @@ static void rom_write(const struct pci_dev *pdev, unsigned int reg,
>>           rom->addr = val & PCI_ROM_ADDRESS_MASK;
>>   }
>>   
>> +static void guest_rom_write(const struct pci_dev *pdev, unsigned int reg,
>> +                            uint32_t val, void *data)
>> +{
>> +}
>> +
>> +static uint32_t guest_rom_read(const struct pci_dev *pdev, unsigned int reg,
>> +                               void *data)
>> +{
>> +    return 0xffffffff;
>> +}
>> +
>>   static int init_bars(struct pci_dev *pdev)
>>   {
>>       uint16_t cmd;
>> @@ -489,6 +542,7 @@ static int init_bars(struct pci_dev *pdev)
>>       struct vpci_header *header = &pdev->vpci->header;
>>       struct vpci_bar *bars = header->bars;
>>       int rc;
>> +    bool is_hwdom = is_hardware_domain(pdev->domain);
>>   
>>       switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
>>       {
>> @@ -528,8 +582,10 @@ static int init_bars(struct pci_dev *pdev)
>>           if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
>>           {
>>               bars[i].type = VPCI_BAR_MEM64_HI;
>> -            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
>> -                                   4, &bars[i]);
>> +            rc = vpci_add_register(pdev->vpci,
>> +                                   is_hwdom ? vpci_hw_read32 : guest_bar_read,
>> +                                   is_hwdom ? bar_write : guest_bar_write,
>> +                                   reg, 4, &bars[i]);
>>               if ( rc )
>>               {
>>                   pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
>> @@ -569,8 +625,10 @@ static int init_bars(struct pci_dev *pdev)
>>           bars[i].size = size;
>>           bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
>>   
>> -        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg, 4,
>> -                               &bars[i]);
>> +        rc = vpci_add_register(pdev->vpci,
>> +                               is_hwdom ? vpci_hw_read32 : guest_bar_read,
>> +                               is_hwdom ? bar_write : guest_bar_write,
>> +                               reg, 4, &bars[i]);
> You need to initialize guest_reg to the physical host value also.
But why? There was a concern that exposing host's value to a guest
may be a security issue. And wouldn't it be possible for a guest to decide
that the firmware has setup the BAR and it doesn't need to care of it and
hence use a wrong value instead of setting it up by itself? I had an issue
with that if I'm not mistaken that guest's Linux didn't set the BAR properly
and used what was programmed
>
>>           if ( rc )
>>           {
>>               pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
>> @@ -590,8 +648,10 @@ static int init_bars(struct pci_dev *pdev)
>>           header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
>>                                 PCI_ROM_ADDRESS_ENABLE;
>>   
>> -        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write, rom_reg,
>> -                               4, rom);
>> +        rc = vpci_add_register(pdev->vpci,
>> +                               is_hwdom ? vpci_hw_read32 : guest_rom_read,
>> +                               is_hwdom ? rom_write : guest_rom_write,
>> +                               rom_reg, 4, rom);
>>           if ( rc )
>>               rom->type = VPCI_BAR_EMPTY;
> Also memory decoding needs to be initially disabled when used by
> guests, in order to prevent the BAR being placed on top of a RAM
> region. The guest physmap will be different from the host one, so it's
> possible for BARs to end up placed on top of RAM regions initially
> until the firmware or OS places them at a suitable address.
Agree, memory decoding must be disabled
>
> Thanks, Roger.
Thank you,
Oleksandr
Oleksandr Andrushchenko Jan. 31, 2022, 10:40 a.m. UTC | #7
On 31.01.22 11:47, Oleksandr Andrushchenko wrote:
> Hi, Roger!
>
> On 12.01.22 14:35, Roger Pau Monné wrote:
>>
>>> +static void guest_rom_write(const struct pci_dev *pdev, unsigned int reg,
>>> +                            uint32_t val, void *data)
>>> +{
>>> +}
>>> +
>>> +static uint32_t guest_rom_read(const struct pci_dev *pdev, unsigned int reg,
>>> +                               void *data)
>>> +{
>>> +    return 0xffffffff;
>>> +}
>> There should be no need for those handlers. As said elsewhere: for
>> guests registers not explicitly handled should return ~0 for reads and
>> drop writes, which is what you are proposing here.
> Yes, you are right: I can see in vpci_read that we end up reading ~0 if no
> handler exists (which is what I do here with guest_rom_read). But I am not that
> sure about the dropped writes:
>
> void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>                   uint32_t data)
> {
>       unsigned int data_offset = 0;
>
> [snip]
>
>       if ( data_offset < size )
>           /* Tailing gap, write the remaining. */
>           vpci_write_hw(sbdf, reg + data_offset, size - data_offset,
>                         data >> (data_offset * 8));
>
> so it looks like for the un-handled writes we still reach the HW register.
> Could you please tell if the code above needs improvement (like checking
> if the write was handled) or I still need to provide a write handler, e.g.
> guest_rom_write here?
Hm, but the same applies to the reads as well... And this is no surprise,
as for the guests I can see that it accesses all the configuration space
registers that I don't handle. Without that I would have guests unable
to properly setup a PCI device being passed through... And this is why
I have a big TODO in this series describing unhandled registers.
So, it seems that I do need to provide those handlers which I need to
drop writes and return ~0 on reads.
Jan Beulich Jan. 31, 2022, 10:54 a.m. UTC | #8
On 31.01.2022 11:40, Oleksandr Andrushchenko wrote:
> On 31.01.22 11:47, Oleksandr Andrushchenko wrote:
>> Hi, Roger!
>>
>> On 12.01.22 14:35, Roger Pau Monné wrote:
>>>
>>>> +static void guest_rom_write(const struct pci_dev *pdev, unsigned int reg,
>>>> +                            uint32_t val, void *data)
>>>> +{
>>>> +}
>>>> +
>>>> +static uint32_t guest_rom_read(const struct pci_dev *pdev, unsigned int reg,
>>>> +                               void *data)
>>>> +{
>>>> +    return 0xffffffff;
>>>> +}
>>> There should be no need for those handlers. As said elsewhere: for
>>> guests registers not explicitly handled should return ~0 for reads and
>>> drop writes, which is what you are proposing here.
>> Yes, you are right: I can see in vpci_read that we end up reading ~0 if no
>> handler exists (which is what I do here with guest_rom_read). But I am not that
>> sure about the dropped writes:
>>
>> void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>                   uint32_t data)
>> {
>>       unsigned int data_offset = 0;
>>
>> [snip]
>>
>>       if ( data_offset < size )
>>           /* Tailing gap, write the remaining. */
>>           vpci_write_hw(sbdf, reg + data_offset, size - data_offset,
>>                         data >> (data_offset * 8));
>>
>> so it looks like for the un-handled writes we still reach the HW register.
>> Could you please tell if the code above needs improvement (like checking
>> if the write was handled) or I still need to provide a write handler, e.g.
>> guest_rom_write here?
> Hm, but the same applies to the reads as well... And this is no surprise,
> as for the guests I can see that it accesses all the configuration space
> registers that I don't handle. Without that I would have guests unable
> to properly setup a PCI device being passed through... And this is why
> I have a big TODO in this series describing unhandled registers.
> So, it seems that I do need to provide those handlers which I need to
> drop writes and return ~0 on reads.

It feels like we had been there before: For your initial purposes it may
be fine to do as you suggest, but any such patches should carry RFC tags
or alike to indicate they're not considered ready. Once you're aiming
for things to go in, I think there's no good way around white-listing
what guests may access. You may know that we've been bitten by starting
out with black-listing in the past, first and foremost with x86'es MSRs.

Jan
Roger Pau Monne Jan. 31, 2022, 10:56 a.m. UTC | #9
On Mon, Jan 31, 2022 at 09:53:29AM +0000, Oleksandr Andrushchenko wrote:
> Hi, Roger!
> 
> On 12.01.22 19:34, Roger Pau Monné wrote:
> > A couple more comments I realized while walking the dog.
> >
> > On Thu, Nov 25, 2021 at 01:02:43PM +0200, Oleksandr Andrushchenko wrote:
> >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> >> @@ -569,8 +625,10 @@ static int init_bars(struct pci_dev *pdev)
> >>           bars[i].size = size;
> >>           bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
> >>   
> >> -        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg, 4,
> >> -                               &bars[i]);
> >> +        rc = vpci_add_register(pdev->vpci,
> >> +                               is_hwdom ? vpci_hw_read32 : guest_bar_read,
> >> +                               is_hwdom ? bar_write : guest_bar_write,
> >> +                               reg, 4, &bars[i]);
> > You need to initialize guest_reg to the physical host value also.
> But why? There was a concern that exposing host's value to a guest
> may be a security issue. And wouldn't it be possible for a guest to decide
> that the firmware has setup the BAR and it doesn't need to care of it and
> hence use a wrong value instead of setting it up by itself? I had an issue
> with that if I'm not mistaken that guest's Linux didn't set the BAR properly
> and used what was programmed

I think I've made that comment before realizing that all BARs must
start with memory decoding disabled for guests, so that the guest
firmware can position them. Using the host value as a starting point
doesn't make sense because there's no relation between the guest and
the host memory maps. You can drop this comment.

Thanks, Roger.
Oleksandr Andrushchenko Jan. 31, 2022, 11:04 a.m. UTC | #10
Hi, Jan!

On 31.01.22 12:54, Jan Beulich wrote:
> On 31.01.2022 11:40, Oleksandr Andrushchenko wrote:
>> On 31.01.22 11:47, Oleksandr Andrushchenko wrote:
>>> Hi, Roger!
>>>
>>> On 12.01.22 14:35, Roger Pau Monné wrote:
>>>>> +static void guest_rom_write(const struct pci_dev *pdev, unsigned int reg,
>>>>> +                            uint32_t val, void *data)
>>>>> +{
>>>>> +}
>>>>> +
>>>>> +static uint32_t guest_rom_read(const struct pci_dev *pdev, unsigned int reg,
>>>>> +                               void *data)
>>>>> +{
>>>>> +    return 0xffffffff;
>>>>> +}
>>>> There should be no need for those handlers. As said elsewhere: for
>>>> guests registers not explicitly handled should return ~0 for reads and
>>>> drop writes, which is what you are proposing here.
>>> Yes, you are right: I can see in vpci_read that we end up reading ~0 if no
>>> handler exists (which is what I do here with guest_rom_read). But I am not that
>>> sure about the dropped writes:
>>>
>>> void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>>                    uint32_t data)
>>> {
>>>        unsigned int data_offset = 0;
>>>
>>> [snip]
>>>
>>>        if ( data_offset < size )
>>>            /* Tailing gap, write the remaining. */
>>>            vpci_write_hw(sbdf, reg + data_offset, size - data_offset,
>>>                          data >> (data_offset * 8));
>>>
>>> so it looks like for the un-handled writes we still reach the HW register.
>>> Could you please tell if the code above needs improvement (like checking
>>> if the write was handled) or I still need to provide a write handler, e.g.
>>> guest_rom_write here?
>> Hm, but the same applies to the reads as well... And this is no surprise,
>> as for the guests I can see that it accesses all the configuration space
>> registers that I don't handle. Without that I would have guests unable
>> to properly setup a PCI device being passed through... And this is why
>> I have a big TODO in this series describing unhandled registers.
>> So, it seems that I do need to provide those handlers which I need to
>> drop writes and return ~0 on reads.
Replying to myself: it is still possible to have vpci_ignored_{read|write}
to handle defaults if, when vpci_add_register is called, the handler
provided is NULL
> It feels like we had been there before: For your initial purposes it may
> be fine to do as you suggest, but any such patches should carry RFC tags
> or alike to indicate they're not considered ready. Once you're aiming
> for things to go in, I think there's no good way around white-listing
> what guests may access. You may know that we've been bitten by starting
> out with black-listing in the past, first and foremost with x86'es MSRs.
I already have a big TODO patch describing the issue. Do you want
it to have a list of handlers that we support as of now? What sort of
while/black list would you expect?
I do understand that we do need proper handling for all the PCI registers
and capabilities long term, but this can't be done at the moment when
we have nothing working at all. Requesting proper handling now will
turn this series into a huge amount of code and undefined time frame.
>
> Jan
>
Thank you,
Oleksandr
Roger Pau Monne Jan. 31, 2022, 11:04 a.m. UTC | #11
On Mon, Jan 31, 2022 at 09:47:07AM +0000, Oleksandr Andrushchenko wrote:
> Hi, Roger!
> 
> On 12.01.22 14:35, Roger Pau Monné wrote:
> > On Thu, Nov 25, 2021 at 01:02:43PM +0200, Oleksandr Andrushchenko wrote:
> >> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> >> +static void guest_rom_write(const struct pci_dev *pdev, unsigned int reg,
> >> +                            uint32_t val, void *data)
> >> +{
> >> +}
> >> +
> >> +static uint32_t guest_rom_read(const struct pci_dev *pdev, unsigned int reg,
> >> +                               void *data)
> >> +{
> >> +    return 0xffffffff;
> >> +}
> > There should be no need for those handlers. As said elsewhere: for
> > guests registers not explicitly handled should return ~0 for reads and
> > drop writes, which is what you are proposing here.
> Yes, you are right: I can see in vpci_read that we end up reading ~0 if no
> handler exists (which is what I do here with guest_rom_read). But I am not that
> sure about the dropped writes:
> 
> void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>                  uint32_t data)
> {
>      unsigned int data_offset = 0;
> 
> [snip]
> 
>      if ( data_offset < size )
>          /* Tailing gap, write the remaining. */
>          vpci_write_hw(sbdf, reg + data_offset, size - data_offset,
>                        data >> (data_offset * 8));
> 
> so it looks like for the un-handled writes we still reach the HW register.
> Could you please tell if the code above needs improvement (like checking
> if the write was handled) or I still need to provide a write handler, e.g.
> guest_rom_write here?

Right now (given the current code) unhandled reads and writes will all
end up being forwarded to the hardware. This is intended for dom0, but
this is not how it's going to work for domUs, where accesses will be
discarded based on an accept list. IOW the handlers that you are
adding here should be the default behavior for registers not
explicitly handled in the domU case, and shouldn't require explicit
handling.

> >> +
> >>   static int init_bars(struct pci_dev *pdev)
> >>   {
> >>       uint16_t cmd;
> >> @@ -489,6 +542,7 @@ static int init_bars(struct pci_dev *pdev)
> >>       struct vpci_header *header = &pdev->vpci->header;
> >>       struct vpci_bar *bars = header->bars;
> >>       int rc;
> >> +    bool is_hwdom = is_hardware_domain(pdev->domain);
> >>   
> >>       switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
> >>       {
> >> @@ -528,8 +582,10 @@ static int init_bars(struct pci_dev *pdev)
> >>           if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
> >>           {
> >>               bars[i].type = VPCI_BAR_MEM64_HI;
> >> -            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
> >> -                                   4, &bars[i]);
> >> +            rc = vpci_add_register(pdev->vpci,
> >> +                                   is_hwdom ? vpci_hw_read32 : guest_bar_read,
> >> +                                   is_hwdom ? bar_write : guest_bar_write,
> >> +                                   reg, 4, &bars[i]);
> >>               if ( rc )
> >>               {
> >>                   pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
> >> @@ -569,8 +625,10 @@ static int init_bars(struct pci_dev *pdev)
> >>           bars[i].size = size;
> >>           bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
> >>   
> >> -        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg, 4,
> >> -                               &bars[i]);
> >> +        rc = vpci_add_register(pdev->vpci,
> >> +                               is_hwdom ? vpci_hw_read32 : guest_bar_read,
> >> +                               is_hwdom ? bar_write : guest_bar_write,
> >> +                               reg, 4, &bars[i]);
> >>           if ( rc )
> >>           {
> >>               pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
> >> @@ -590,8 +648,10 @@ static int init_bars(struct pci_dev *pdev)
> >>           header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
> >>                                 PCI_ROM_ADDRESS_ENABLE;
> >>   
> >> -        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write, rom_reg,
> >> -                               4, rom);
> >> +        rc = vpci_add_register(pdev->vpci,
> >> +                               is_hwdom ? vpci_hw_read32 : guest_rom_read,
> >> +                               is_hwdom ? rom_write : guest_rom_write,
> >> +                               rom_reg, 4, rom);
> > This whole call should be made conditional to is_hwdom, as said above
> > there's no need for the guest_rom handlers.
> Yes, if writes are indeed dropped, please see question above
> >
> > Likewise I assume you expect IO BARs to simply return ~0 and drop
> > writes, as there's no explicit handler added for those?
> Yes, but that was not my intention: I simply didn't handle IO BARs
> and now we do need that handling: either with the default behavior
> for the unhandled read/write (drop writes, read ~0) or by introducing
> the handlers. I hope we can rely on the "unhandled read/write" and
> get what we want

Indeed, the default behavior should be changed for domUs to drop
writes, return ~0 for unhandled reads, then you won't need to add
dummy handlers for the registers you don't want to expose.

> >
> >>           if ( rc )
> >>               rom->type = VPCI_BAR_EMPTY;
> >>       }
> >> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
> >> index ed127a08a953..0a73b14a92dc 100644
> >> --- a/xen/include/xen/vpci.h
> >> +++ b/xen/include/xen/vpci.h
> >> @@ -68,7 +68,10 @@ struct vpci {
> >>       struct vpci_header {
> >>           /* Information about the PCI BARs of this device. */
> >>           struct vpci_bar {
> >> +            /* Physical view of the BAR. */
> > No, that's not the physical view, it's the physical (host) address.
> Ok
> >
> >>               uint64_t addr;
> >> +            /* Guest view of the BAR: address and lower bits. */
> >> +            uint64_t guest_reg;
> > I continue to think it would be clearer if you store the guest address
> > here (gaddr, without the low bits) and add those in guest_bar_read
> > based on bar->{type,prefetchable}. Then it would be equivalent to the
> > existing 'addr' field.
> Ok, I'll re-work the code with this approach in mind: s/guest_reg/gaddr +
> required code to handle that
> >
> > I wonder whether we need to protect the added code with
> > CONFIG_HAS_VPCI_GUEST_SUPPORT, this would effectively be dead code
> > otherwise. Long term I don't think we wish to differentiate between
> > dom0 and domU vPCI support at build time, so I'm unsure whether it's
> > helpful to pollute the code with CONFIG_HAS_VPCI_GUEST_SUPPORT when
> > the plan is to remove those long term.
> I would have it without CONFIG_HAS_VPCI_GUEST_SUPPORT if you
> don't mind

Well, I guess if it's not too intrusive it's fine to add the defines,
removing them afterwards should be easy.

Thanks, Roger.
Roger Pau Monne Jan. 31, 2022, 11:10 a.m. UTC | #12
On Mon, Jan 31, 2022 at 10:40:47AM +0000, Oleksandr Andrushchenko wrote:
> On 31.01.22 11:47, Oleksandr Andrushchenko wrote:
> > Hi, Roger!
> >
> > On 12.01.22 14:35, Roger Pau Monné wrote:
> >>
> >>> +static void guest_rom_write(const struct pci_dev *pdev, unsigned int reg,
> >>> +                            uint32_t val, void *data)
> >>> +{
> >>> +}
> >>> +
> >>> +static uint32_t guest_rom_read(const struct pci_dev *pdev, unsigned int reg,
> >>> +                               void *data)
> >>> +{
> >>> +    return 0xffffffff;
> >>> +}
> >> There should be no need for those handlers. As said elsewhere: for
> >> guests registers not explicitly handled should return ~0 for reads and
> >> drop writes, which is what you are proposing here.
> > Yes, you are right: I can see in vpci_read that we end up reading ~0 if no
> > handler exists (which is what I do here with guest_rom_read). But I am not that
> > sure about the dropped writes:
> >
> > void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
> >                   uint32_t data)
> > {
> >       unsigned int data_offset = 0;
> >
> > [snip]
> >
> >       if ( data_offset < size )
> >           /* Tailing gap, write the remaining. */
> >           vpci_write_hw(sbdf, reg + data_offset, size - data_offset,
> >                         data >> (data_offset * 8));
> >
> > so it looks like for the un-handled writes we still reach the HW register.
> > Could you please tell if the code above needs improvement (like checking
> > if the write was handled) or I still need to provide a write handler, e.g.
> > guest_rom_write here?
> Hm, but the same applies to the reads as well... And this is no surprise,
> as for the guests I can see that it accesses all the configuration space
> registers that I don't handle. Without that I would have guests unable
> to properly setup a PCI device being passed through... And this is why
> I have a big TODO in this series describing unhandled registers.
> So, it seems that I do need to provide those handlers which I need to
> drop writes and return ~0 on reads.

Right (see my previous reply to this comment). I think it would be
easier (and cleaner) if you switched the default behavior regarding
unhandled register access for domUs at the start of the series (drop
writes, reads returns ~0), and then you won't need to add all those
dummy handler to drop writes and return ~0 for reads.

It's going to be more work initially as you would need to support
passthrough of more registers, but it's the right approach that we
need implementation wise.

Thanks, Roger.
Oleksandr Andrushchenko Jan. 31, 2022, 11:23 a.m. UTC | #13
On 31.01.22 13:10, Roger Pau Monné wrote:
> On Mon, Jan 31, 2022 at 10:40:47AM +0000, Oleksandr Andrushchenko wrote:
>> On 31.01.22 11:47, Oleksandr Andrushchenko wrote:
>>> Hi, Roger!
>>>
>>> On 12.01.22 14:35, Roger Pau Monné wrote:
>>>>> +static void guest_rom_write(const struct pci_dev *pdev, unsigned int reg,
>>>>> +                            uint32_t val, void *data)
>>>>> +{
>>>>> +}
>>>>> +
>>>>> +static uint32_t guest_rom_read(const struct pci_dev *pdev, unsigned int reg,
>>>>> +                               void *data)
>>>>> +{
>>>>> +    return 0xffffffff;
>>>>> +}
>>>> There should be no need for those handlers. As said elsewhere: for
>>>> guests registers not explicitly handled should return ~0 for reads and
>>>> drop writes, which is what you are proposing here.
>>> Yes, you are right: I can see in vpci_read that we end up reading ~0 if no
>>> handler exists (which is what I do here with guest_rom_read). But I am not that
>>> sure about the dropped writes:
>>>
>>> void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>>                    uint32_t data)
>>> {
>>>        unsigned int data_offset = 0;
>>>
>>> [snip]
>>>
>>>        if ( data_offset < size )
>>>            /* Tailing gap, write the remaining. */
>>>            vpci_write_hw(sbdf, reg + data_offset, size - data_offset,
>>>                          data >> (data_offset * 8));
>>>
>>> so it looks like for the un-handled writes we still reach the HW register.
>>> Could you please tell if the code above needs improvement (like checking
>>> if the write was handled) or I still need to provide a write handler, e.g.
>>> guest_rom_write here?
>> Hm, but the same applies to the reads as well... And this is no surprise,
>> as for the guests I can see that it accesses all the configuration space
>> registers that I don't handle. Without that I would have guests unable
>> to properly setup a PCI device being passed through... And this is why
>> I have a big TODO in this series describing unhandled registers.
>> So, it seems that I do need to provide those handlers which I need to
>> drop writes and return ~0 on reads.
> Right (see my previous reply to this comment). I think it would be
> easier (and cleaner) if you switched the default behavior regarding
> unhandled register access for domUs at the start of the series (drop
> writes, reads returns ~0), and then you won't need to add all those
> dummy handler to drop writes and return ~0 for reads.
>
> It's going to be more work initially as you would need to support
> passthrough of more registers, but it's the right approach that we
> need implementation wise.
While I agree in general, this effectively means that I'll need to provide
handling for all PCIe registers and capabilities from the very start.
Otherwise no guest be able to properly initialize a PCI device without that.
Of course, we may want starting from stubs instead of proper emulation,
which will direct the access to real HW and later on we add proper emulation.
But, again, this is going to be a rather big piece of code where we need
to explicitly handle every possible capability.

At the moment we are not going to claim that vPCI provides all means to
pass through a PCI device safely with this respect and this is why the feature
itself won't even be a tech preview yet. For that reason I think we can still
have implemented only crucial set of handlers and still allow the rest to
be read/write directly without emulation.

Another question is what needs to be done for vendor specific capabilities?
How these are going to be emulated?
>
> Thanks, Roger.
Thank you,
Oleksandr
Roger Pau Monne Jan. 31, 2022, 11:27 a.m. UTC | #14
On Mon, Jan 31, 2022 at 11:04:29AM +0000, Oleksandr Andrushchenko wrote:
> Hi, Jan!
> 
> On 31.01.22 12:54, Jan Beulich wrote:
> > On 31.01.2022 11:40, Oleksandr Andrushchenko wrote:
> >> On 31.01.22 11:47, Oleksandr Andrushchenko wrote:
> >>> Hi, Roger!
> >>>
> >>> On 12.01.22 14:35, Roger Pau Monné wrote:
> >>>>> +static void guest_rom_write(const struct pci_dev *pdev, unsigned int reg,
> >>>>> +                            uint32_t val, void *data)
> >>>>> +{
> >>>>> +}
> >>>>> +
> >>>>> +static uint32_t guest_rom_read(const struct pci_dev *pdev, unsigned int reg,
> >>>>> +                               void *data)
> >>>>> +{
> >>>>> +    return 0xffffffff;
> >>>>> +}
> >>>> There should be no need for those handlers. As said elsewhere: for
> >>>> guests registers not explicitly handled should return ~0 for reads and
> >>>> drop writes, which is what you are proposing here.
> >>> Yes, you are right: I can see in vpci_read that we end up reading ~0 if no
> >>> handler exists (which is what I do here with guest_rom_read). But I am not that
> >>> sure about the dropped writes:
> >>>
> >>> void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
> >>>                    uint32_t data)
> >>> {
> >>>        unsigned int data_offset = 0;
> >>>
> >>> [snip]
> >>>
> >>>        if ( data_offset < size )
> >>>            /* Tailing gap, write the remaining. */
> >>>            vpci_write_hw(sbdf, reg + data_offset, size - data_offset,
> >>>                          data >> (data_offset * 8));
> >>>
> >>> so it looks like for the un-handled writes we still reach the HW register.
> >>> Could you please tell if the code above needs improvement (like checking
> >>> if the write was handled) or I still need to provide a write handler, e.g.
> >>> guest_rom_write here?
> >> Hm, but the same applies to the reads as well... And this is no surprise,
> >> as for the guests I can see that it accesses all the configuration space
> >> registers that I don't handle. Without that I would have guests unable
> >> to properly setup a PCI device being passed through... And this is why
> >> I have a big TODO in this series describing unhandled registers.
> >> So, it seems that I do need to provide those handlers which I need to
> >> drop writes and return ~0 on reads.
> Replying to myself: it is still possible to have vpci_ignored_{read|write}
> to handle defaults if, when vpci_add_register is called, the handler
> provided is NULL
> > It feels like we had been there before: For your initial purposes it may
> > be fine to do as you suggest, but any such patches should carry RFC tags
> > or alike to indicate they're not considered ready. Once you're aiming
> > for things to go in, I think there's no good way around white-listing
> > what guests may access. You may know that we've been bitten by starting
> > out with black-listing in the past, first and foremost with x86'es MSRs.
> I already have a big TODO patch describing the issue. Do you want
> it to have a list of handlers that we support as of now? What sort of
> while/black list would you expect?
> I do understand that we do need proper handling for all the PCI registers
> and capabilities long term, but this can't be done at the moment when
> we have nothing working at all. Requesting proper handling now will
> turn this series into a huge amount of code and undefined time frame.

We should at least make sure the code added now doesn't need to be
changed in the future when the default is switched. If you don't
want to switch the default handling for domUs to ignore writes and
return ~0 from reads to unhandled registers right now you should keep
the patches that add the ignore handlers to the end of the series and
mark them as 'HACK' or some such in order to notice they are just
used for testing purposes.

Thanks, Roger.
Oleksandr Andrushchenko Jan. 31, 2022, 11:30 a.m. UTC | #15
On 31.01.22 13:27, Roger Pau Monné wrote:
> On Mon, Jan 31, 2022 at 11:04:29AM +0000, Oleksandr Andrushchenko wrote:
>> Hi, Jan!
>>
>> On 31.01.22 12:54, Jan Beulich wrote:
>>> On 31.01.2022 11:40, Oleksandr Andrushchenko wrote:
>>>> On 31.01.22 11:47, Oleksandr Andrushchenko wrote:
>>>>> Hi, Roger!
>>>>>
>>>>> On 12.01.22 14:35, Roger Pau Monné wrote:
>>>>>>> +static void guest_rom_write(const struct pci_dev *pdev, unsigned int reg,
>>>>>>> +                            uint32_t val, void *data)
>>>>>>> +{
>>>>>>> +}
>>>>>>> +
>>>>>>> +static uint32_t guest_rom_read(const struct pci_dev *pdev, unsigned int reg,
>>>>>>> +                               void *data)
>>>>>>> +{
>>>>>>> +    return 0xffffffff;
>>>>>>> +}
>>>>>> There should be no need for those handlers. As said elsewhere: for
>>>>>> guests registers not explicitly handled should return ~0 for reads and
>>>>>> drop writes, which is what you are proposing here.
>>>>> Yes, you are right: I can see in vpci_read that we end up reading ~0 if no
>>>>> handler exists (which is what I do here with guest_rom_read). But I am not that
>>>>> sure about the dropped writes:
>>>>>
>>>>> void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
>>>>>                     uint32_t data)
>>>>> {
>>>>>         unsigned int data_offset = 0;
>>>>>
>>>>> [snip]
>>>>>
>>>>>         if ( data_offset < size )
>>>>>             /* Tailing gap, write the remaining. */
>>>>>             vpci_write_hw(sbdf, reg + data_offset, size - data_offset,
>>>>>                           data >> (data_offset * 8));
>>>>>
>>>>> so it looks like for the un-handled writes we still reach the HW register.
>>>>> Could you please tell if the code above needs improvement (like checking
>>>>> if the write was handled) or I still need to provide a write handler, e.g.
>>>>> guest_rom_write here?
>>>> Hm, but the same applies to the reads as well... And this is no surprise,
>>>> as for the guests I can see that it accesses all the configuration space
>>>> registers that I don't handle. Without that I would have guests unable
>>>> to properly setup a PCI device being passed through... And this is why
>>>> I have a big TODO in this series describing unhandled registers.
>>>> So, it seems that I do need to provide those handlers which I need to
>>>> drop writes and return ~0 on reads.
>> Replying to myself: it is still possible to have vpci_ignored_{read|write}
>> to handle defaults if, when vpci_add_register is called, the handler
>> provided is NULL
>>> It feels like we had been there before: For your initial purposes it may
>>> be fine to do as you suggest, but any such patches should carry RFC tags
>>> or alike to indicate they're not considered ready. Once you're aiming
>>> for things to go in, I think there's no good way around white-listing
>>> what guests may access. You may know that we've been bitten by starting
>>> out with black-listing in the past, first and foremost with x86'es MSRs.
>> I already have a big TODO patch describing the issue. Do you want
>> it to have a list of handlers that we support as of now? What sort of
>> while/black list would you expect?
>> I do understand that we do need proper handling for all the PCI registers
>> and capabilities long term, but this can't be done at the moment when
>> we have nothing working at all. Requesting proper handling now will
>> turn this series into a huge amount of code and undefined time frame.
> We should at least make sure the code added now doesn't need to be
> changed in the future when the default is switched. If you don't
> want to switch the default handling for domUs to ignore writes and
> return ~0 from reads to unhandled registers right now you should keep
> the patches that add the ignore handlers to the end of the series and
> mark them as 'HACK' or some such in order to notice they are just
> used for testing purposes.
Or for all the registers that I do want the writes to be rejected and
reads return ~0 I can pass NULL while calling vpci_add_register,
so the following works:

int vpci_add_register(struct vpci *vpci, vpci_read_t *read_handler,
                       vpci_write_t *write_handler, unsigned int offset,
                       unsigned int size, void *data)
{
[snip]
     r->read = read_handler ?: vpci_ignored_read;
     r->write = write_handler ?: vpci_ignored_write;
which does what we want.
> Thanks, Roger.
Thank you,
Oleksandr
Roger Pau Monne Jan. 31, 2022, 11:31 a.m. UTC | #16
On Mon, Jan 31, 2022 at 11:23:48AM +0000, Oleksandr Andrushchenko wrote:
> 
> 
> On 31.01.22 13:10, Roger Pau Monné wrote:
> > On Mon, Jan 31, 2022 at 10:40:47AM +0000, Oleksandr Andrushchenko wrote:
> >> On 31.01.22 11:47, Oleksandr Andrushchenko wrote:
> >>> Hi, Roger!
> >>>
> >>> On 12.01.22 14:35, Roger Pau Monné wrote:
> >>>>> +static void guest_rom_write(const struct pci_dev *pdev, unsigned int reg,
> >>>>> +                            uint32_t val, void *data)
> >>>>> +{
> >>>>> +}
> >>>>> +
> >>>>> +static uint32_t guest_rom_read(const struct pci_dev *pdev, unsigned int reg,
> >>>>> +                               void *data)
> >>>>> +{
> >>>>> +    return 0xffffffff;
> >>>>> +}
> >>>> There should be no need for those handlers. As said elsewhere: for
> >>>> guests registers not explicitly handled should return ~0 for reads and
> >>>> drop writes, which is what you are proposing here.
> >>> Yes, you are right: I can see in vpci_read that we end up reading ~0 if no
> >>> handler exists (which is what I do here with guest_rom_read). But I am not that
> >>> sure about the dropped writes:
> >>>
> >>> void vpci_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int size,
> >>>                    uint32_t data)
> >>> {
> >>>        unsigned int data_offset = 0;
> >>>
> >>> [snip]
> >>>
> >>>        if ( data_offset < size )
> >>>            /* Tailing gap, write the remaining. */
> >>>            vpci_write_hw(sbdf, reg + data_offset, size - data_offset,
> >>>                          data >> (data_offset * 8));
> >>>
> >>> so it looks like for the un-handled writes we still reach the HW register.
> >>> Could you please tell if the code above needs improvement (like checking
> >>> if the write was handled) or I still need to provide a write handler, e.g.
> >>> guest_rom_write here?
> >> Hm, but the same applies to the reads as well... And this is no surprise,
> >> as for the guests I can see that it accesses all the configuration space
> >> registers that I don't handle. Without that I would have guests unable
> >> to properly setup a PCI device being passed through... And this is why
> >> I have a big TODO in this series describing unhandled registers.
> >> So, it seems that I do need to provide those handlers which I need to
> >> drop writes and return ~0 on reads.
> > Right (see my previous reply to this comment). I think it would be
> > easier (and cleaner) if you switched the default behavior regarding
> > unhandled register access for domUs at the start of the series (drop
> > writes, reads returns ~0), and then you won't need to add all those
> > dummy handler to drop writes and return ~0 for reads.
> >
> > It's going to be more work initially as you would need to support
> > passthrough of more registers, but it's the right approach that we
> > need implementation wise.
> While I agree in general, this effectively means that I'll need to provide
> handling for all PCIe registers and capabilities from the very start.

Well, we can only offer handling of the header and the MSI and MSI-X
capabilities right now, because that's all vPCI currently knows about.

> Otherwise no guest be able to properly initialize a PCI device without that.
> Of course, we may want starting from stubs instead of proper emulation,
> which will direct the access to real HW and later on we add proper emulation.
> But, again, this is going to be a rather big piece of code where we need
> to explicitly handle every possible capability.
> 
> At the moment we are not going to claim that vPCI provides all means to
> pass through a PCI device safely with this respect and this is why the feature
> itself won't even be a tech preview yet. For that reason I think we can still
> have implemented only crucial set of handlers and still allow the rest to
> be read/write directly without emulation.

See my other reply, you can probably move the special handlers into a
separate patch at the end of the series in order to test the
functionality without adding code that will need to be removed when
the defaults for domUs are changed.

> Another question is what needs to be done for vendor specific capabilities?
> How these are going to be emulated?

I think you will need some kind of permissive mode in order to allow a
guest to access those, as they shouldn't be exposed by default.

Thanks, Roger.
Jan Beulich Jan. 31, 2022, 11:39 a.m. UTC | #17
On 31.01.2022 12:23, Oleksandr Andrushchenko wrote:
> On 31.01.22 13:10, Roger Pau Monné wrote:
>> Right (see my previous reply to this comment). I think it would be
>> easier (and cleaner) if you switched the default behavior regarding
>> unhandled register access for domUs at the start of the series (drop
>> writes, reads returns ~0), and then you won't need to add all those
>> dummy handler to drop writes and return ~0 for reads.
>>
>> It's going to be more work initially as you would need to support
>> passthrough of more registers, but it's the right approach that we
>> need implementation wise.
> While I agree in general, this effectively means that I'll need to provide
> handling for all PCIe registers and capabilities from the very start.
> Otherwise no guest be able to properly initialize a PCI device without that.
> Of course, we may want starting from stubs instead of proper emulation,
> which will direct the access to real HW and later on we add proper emulation.
> But, again, this is going to be a rather big piece of code where we need
> to explicitly handle every possible capability.

Since the two sub-threads are now about exactly the same topic, I'm
answering here instead of there.

No, you are not going to need to emulate all possible capabilities.
We (or really qemu) don't do this on x86 either. Certain capabilities
may be a must, but not everything. There are also device specific
registers not covered by any capability structures - what to do with
those is even more of a question.

Furthermore for some of the fields justification why access to the
raw hardware value is fine is going to be easy: r/o fields like
vendor and device ID, for example. But every bit you allow direct
access to needs to come with justification.

> At the moment we are not going to claim that vPCI provides all means to
> pass through a PCI device safely with this respect and this is why the feature
> itself won't even be a tech preview yet. For that reason I think we can still
> have implemented only crucial set of handlers and still allow the rest to
> be read/write directly without emulation.

I think you need to separate what you need for development from what
goes upstream: For dev purposes you can very well invert the policy
from white- to black-listing. But if we accepted the latter into the
main tree, the risk would be there that something gets missed at the
time where the permission model gets changed around.

You could even have a non-default mode operating the way you want it
(along the lines of pciback's permissive mode), allowing you to get
away without needing to carry private patches. Things may also
initially only work in that mode. But the default should be a mode
which is secure (and which perhaps initially offers only very limited
functionality).

> Another question is what needs to be done for vendor specific capabilities?
> How these are going to be emulated?

By vendor specific code, I'm afraid. Assuming these capabilities
really need exposing in the first place.

Jan
Oleksandr Andrushchenko Jan. 31, 2022, 1:30 p.m. UTC | #18
On 31.01.22 13:39, Jan Beulich wrote:
> On 31.01.2022 12:23, Oleksandr Andrushchenko wrote:
>> On 31.01.22 13:10, Roger Pau Monné wrote:
>>> Right (see my previous reply to this comment). I think it would be
>>> easier (and cleaner) if you switched the default behavior regarding
>>> unhandled register access for domUs at the start of the series (drop
>>> writes, reads returns ~0), and then you won't need to add all those
>>> dummy handler to drop writes and return ~0 for reads.
>>>
>>> It's going to be more work initially as you would need to support
>>> passthrough of more registers, but it's the right approach that we
>>> need implementation wise.
>> While I agree in general, this effectively means that I'll need to provide
>> handling for all PCIe registers and capabilities from the very start.
>> Otherwise no guest be able to properly initialize a PCI device without that.
>> Of course, we may want starting from stubs instead of proper emulation,
>> which will direct the access to real HW and later on we add proper emulation.
>> But, again, this is going to be a rather big piece of code where we need
>> to explicitly handle every possible capability.
> Since the two sub-threads are now about exactly the same topic, I'm
> answering here instead of there.
>
> No, you are not going to need to emulate all possible capabilities.
> We (or really qemu) don't do this on x86 either. Certain capabilities
> may be a must, but not everything. There are also device specific
> registers not covered by any capability structures - what to do with
> those is even more of a question.
>
> Furthermore for some of the fields justification why access to the
> raw hardware value is fine is going to be easy: r/o fields like
> vendor and device ID, for example. But every bit you allow direct
> access to needs to come with justification.
>
>> At the moment we are not going to claim that vPCI provides all means to
>> pass through a PCI device safely with this respect and this is why the feature
>> itself won't even be a tech preview yet. For that reason I think we can still
>> have implemented only crucial set of handlers and still allow the rest to
>> be read/write directly without emulation.
> I think you need to separate what you need for development from what
> goes upstream: For dev purposes you can very well invert the policy
> from white- to black-listing. But if we accepted the latter into the
> main tree, the risk would be there that something gets missed at the
> time where the permission model gets changed around.
>
> You could even have a non-default mode operating the way you want it
> (along the lines of pciback's permissive mode), allowing you to get
> away without needing to carry private patches. Things may also
> initially only work in that mode. But the default should be a mode
> which is secure (and which perhaps initially offers only very limited
> functionality).
Ok, so to make it clear:
1. We do not allow unhandled access for guests: for that I will create a
dedicated patch which will implement such restrictions. Something like
the below (for both vPCI read and write):

diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
index c5e67491c24f..9ef2a1b5af58 100644
--- a/xen/drivers/vpci/vpci.c
+++ b/xen/drivers/vpci/vpci.c
@@ -347,6 +347,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
      const struct vpci_register *r;
      unsigned int data_offset = 0;
      uint32_t data = ~(uint32_t)0;
+    bool handled = false;

      if ( !size )
      {
@@ -405,6 +406,8 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
          if ( cmp > 0 )
              continue;

+        handled = true; /* Found the handler for this access. */
+
          if ( emu.offset < r->offset )
          {
              /* Heading gap, read partial content from hardware. */
@@ -432,6 +435,10 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
      }
      spin_unlock(&pdev->vpci_lock);

+    /* All unhandled guest requests return all 1's. */
+    if ( !is_hardware_domain(d) && !handled )
+        return ~(uint32_t)0;
+
      if ( data_offset < size )
      {
          /* Tailing gap, read the remaining. */

@Roger: does the above work for you?

2. For the time being, while only a reduced set of registers is emulated,
everyone who wants to test vPCI for guests should either create their own
patch with those handlers implemented or overcome that in any other
suitable way, e.g. with a hack that removes HW register access protection
or by any other means.

>
>> Another question is what needs to be done for vendor specific capabilities?
>> How these are going to be emulated?
> By vendor specific code, I'm afraid. Assuming these capabilities
> really need exposing in the first place.
I have a feeling this is not going to happen...
>
> Jan
>

Thank you,
Oleksandr
Jan Beulich Jan. 31, 2022, 1:36 p.m. UTC | #19
On 31.01.2022 14:30, Oleksandr Andrushchenko wrote:
> 
> 
> On 31.01.22 13:39, Jan Beulich wrote:
>> On 31.01.2022 12:23, Oleksandr Andrushchenko wrote:
>>> On 31.01.22 13:10, Roger Pau Monné wrote:
>>>> Right (see my previous reply to this comment). I think it would be
>>>> easier (and cleaner) if you switched the default behavior regarding
>>>> unhandled register access for domUs at the start of the series (drop
>>>> writes, reads returns ~0), and then you won't need to add all those
>>>> dummy handler to drop writes and return ~0 for reads.
>>>>
>>>> It's going to be more work initially as you would need to support
>>>> passthrough of more registers, but it's the right approach that we
>>>> need implementation wise.
>>> While I agree in general, this effectively means that I'll need to provide
>>> handling for all PCIe registers and capabilities from the very start.
>>> Otherwise no guest be able to properly initialize a PCI device without that.
>>> Of course, we may want starting from stubs instead of proper emulation,
>>> which will direct the access to real HW and later on we add proper emulation.
>>> But, again, this is going to be a rather big piece of code where we need
>>> to explicitly handle every possible capability.
>> Since the two sub-threads are now about exactly the same topic, I'm
>> answering here instead of there.
>>
>> No, you are not going to need to emulate all possible capabilities.
>> We (or really qemu) don't do this on x86 either. Certain capabilities
>> may be a must, but not everything. There are also device specific
>> registers not covered by any capability structures - what to do with
>> those is even more of a question.
>>
>> Furthermore for some of the fields justification why access to the
>> raw hardware value is fine is going to be easy: r/o fields like
>> vendor and device ID, for example. But every bit you allow direct
>> access to needs to come with justification.
>>
>>> At the moment we are not going to claim that vPCI provides all means to
>>> pass through a PCI device safely with this respect and this is why the feature
>>> itself won't even be a tech preview yet. For that reason I think we can still
>>> have implemented only crucial set of handlers and still allow the rest to
>>> be read/write directly without emulation.
>> I think you need to separate what you need for development from what
>> goes upstream: For dev purposes you can very well invert the policy
>> from white- to black-listing. But if we accepted the latter into the
>> main tree, the risk would be there that something gets missed at the
>> time where the permission model gets changed around.
>>
>> You could even have a non-default mode operating the way you want it
>> (along the lines of pciback's permissive mode), allowing you to get
>> away without needing to carry private patches. Things may also
>> initially only work in that mode. But the default should be a mode
>> which is secure (and which perhaps initially offers only very limited
>> functionality).
> Ok, so to make it clear:
> 1. We do not allow unhandled access for guests: for that I will create a
> dedicated patch which will implement such restrictions. Something like
> the below (for both vPCI read and write):
> 
> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
> index c5e67491c24f..9ef2a1b5af58 100644
> --- a/xen/drivers/vpci/vpci.c
> +++ b/xen/drivers/vpci/vpci.c
> @@ -347,6 +347,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>       const struct vpci_register *r;
>       unsigned int data_offset = 0;
>       uint32_t data = ~(uint32_t)0;
> +    bool handled = false;
> 
>       if ( !size )
>       {
> @@ -405,6 +406,8 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>           if ( cmp > 0 )
>               continue;
> 
> +        handled = true; /* Found the handler for this access. */
> +
>           if ( emu.offset < r->offset )
>           {
>               /* Heading gap, read partial content from hardware. */
> @@ -432,6 +435,10 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>       }
>       spin_unlock(&pdev->vpci_lock);
> 
> +    /* All unhandled guest requests return all 1's. */
> +    if ( !is_hardware_domain(d) && !handled )
> +        return ~(uint32_t)0;
> +
>       if ( data_offset < size )
>       {
>           /* Tailing gap, read the remaining. */

Except that like for the "tailing gap" you also need to avoid the
"heading gap" ending up in a read of the underlying hardware
register. Effectively you want to deal properly with all
vpci_read_hw() invocations (including the one when no pdev was
found, which for a DomU may simply mean domain_crash()).

Jan
Oleksandr Andrushchenko Jan. 31, 2022, 1:41 p.m. UTC | #20
On 31.01.22 15:36, Jan Beulich wrote:
> On 31.01.2022 14:30, Oleksandr Andrushchenko wrote:
>>
>> On 31.01.22 13:39, Jan Beulich wrote:
>>> On 31.01.2022 12:23, Oleksandr Andrushchenko wrote:
>>>> On 31.01.22 13:10, Roger Pau Monné wrote:
>>>>> Right (see my previous reply to this comment). I think it would be
>>>>> easier (and cleaner) if you switched the default behavior regarding
>>>>> unhandled register access for domUs at the start of the series (drop
>>>>> writes, reads returns ~0), and then you won't need to add all those
>>>>> dummy handler to drop writes and return ~0 for reads.
>>>>>
>>>>> It's going to be more work initially as you would need to support
>>>>> passthrough of more registers, but it's the right approach that we
>>>>> need implementation wise.
>>>> While I agree in general, this effectively means that I'll need to provide
>>>> handling for all PCIe registers and capabilities from the very start.
>>>> Otherwise no guest be able to properly initialize a PCI device without that.
>>>> Of course, we may want starting from stubs instead of proper emulation,
>>>> which will direct the access to real HW and later on we add proper emulation.
>>>> But, again, this is going to be a rather big piece of code where we need
>>>> to explicitly handle every possible capability.
>>> Since the two sub-threads are now about exactly the same topic, I'm
>>> answering here instead of there.
>>>
>>> No, you are not going to need to emulate all possible capabilities.
>>> We (or really qemu) don't do this on x86 either. Certain capabilities
>>> may be a must, but not everything. There are also device specific
>>> registers not covered by any capability structures - what to do with
>>> those is even more of a question.
>>>
>>> Furthermore for some of the fields justification why access to the
>>> raw hardware value is fine is going to be easy: r/o fields like
>>> vendor and device ID, for example. But every bit you allow direct
>>> access to needs to come with justification.
>>>
>>>> At the moment we are not going to claim that vPCI provides all means to
>>>> pass through a PCI device safely with this respect and this is why the feature
>>>> itself won't even be a tech preview yet. For that reason I think we can still
>>>> have implemented only crucial set of handlers and still allow the rest to
>>>> be read/write directly without emulation.
>>> I think you need to separate what you need for development from what
>>> goes upstream: For dev purposes you can very well invert the policy
>>> from white- to black-listing. But if we accepted the latter into the
>>> main tree, the risk would be there that something gets missed at the
>>> time where the permission model gets changed around.
>>>
>>> You could even have a non-default mode operating the way you want it
>>> (along the lines of pciback's permissive mode), allowing you to get
>>> away without needing to carry private patches. Things may also
>>> initially only work in that mode. But the default should be a mode
>>> which is secure (and which perhaps initially offers only very limited
>>> functionality).
>> Ok, so to make it clear:
>> 1. We do not allow unhandled access for guests: for that I will create a
>> dedicated patch which will implement such restrictions. Something like
>> the below (for both vPCI read and write):
>>
>> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
>> index c5e67491c24f..9ef2a1b5af58 100644
>> --- a/xen/drivers/vpci/vpci.c
>> +++ b/xen/drivers/vpci/vpci.c
>> @@ -347,6 +347,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>>        const struct vpci_register *r;
>>        unsigned int data_offset = 0;
>>        uint32_t data = ~(uint32_t)0;
>> +    bool handled = false;
>>
>>        if ( !size )
>>        {
>> @@ -405,6 +406,8 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>>            if ( cmp > 0 )
>>                continue;
>>
>> +        handled = true; /* Found the handler for this access. */
>> +
>>            if ( emu.offset < r->offset )
>>            {
>>                /* Heading gap, read partial content from hardware. */
>> @@ -432,6 +435,10 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>>        }
>>        spin_unlock(&pdev->vpci_lock);
>>
>> +    /* All unhandled guest requests return all 1's. */
>> +    if ( !is_hardware_domain(d) && !handled )
>> +        return ~(uint32_t)0;
>> +
>>        if ( data_offset < size )
>>        {
>>            /* Tailing gap, read the remaining. */
> Except that like for the "tailing gap" you also need to avoid the
> "heading gap" ending up in a read of the underlying hardware
> register. Effectively you want to deal properly with all
> vpci_read_hw() invocations (including the one when no pdev was
> found, which for a DomU may simply mean domain_crash()).
Yes. And with the above patch I can now remove the "TODO patch" then?
Because it is saying that we allow access to the registers, but it is not safe.
And now, if we disable that access, then TODO should be about the need to
implement emulation for all the registers which are not yet handled which is
obvious.
>
> Jan
>
Thank you,
Oleksandr
Jan Beulich Jan. 31, 2022, 1:51 p.m. UTC | #21
On 31.01.2022 14:41, Oleksandr Andrushchenko wrote:
> On 31.01.22 15:36, Jan Beulich wrote:
>> On 31.01.2022 14:30, Oleksandr Andrushchenko wrote:
>>> On 31.01.22 13:39, Jan Beulich wrote:
>>>> On 31.01.2022 12:23, Oleksandr Andrushchenko wrote:
>>>>> On 31.01.22 13:10, Roger Pau Monné wrote:
>>>>>> Right (see my previous reply to this comment). I think it would be
>>>>>> easier (and cleaner) if you switched the default behavior regarding
>>>>>> unhandled register access for domUs at the start of the series (drop
>>>>>> writes, reads returns ~0), and then you won't need to add all those
>>>>>> dummy handler to drop writes and return ~0 for reads.
>>>>>>
>>>>>> It's going to be more work initially as you would need to support
>>>>>> passthrough of more registers, but it's the right approach that we
>>>>>> need implementation wise.
>>>>> While I agree in general, this effectively means that I'll need to provide
>>>>> handling for all PCIe registers and capabilities from the very start.
>>>>> Otherwise no guest be able to properly initialize a PCI device without that.
>>>>> Of course, we may want starting from stubs instead of proper emulation,
>>>>> which will direct the access to real HW and later on we add proper emulation.
>>>>> But, again, this is going to be a rather big piece of code where we need
>>>>> to explicitly handle every possible capability.
>>>> Since the two sub-threads are now about exactly the same topic, I'm
>>>> answering here instead of there.
>>>>
>>>> No, you are not going to need to emulate all possible capabilities.
>>>> We (or really qemu) don't do this on x86 either. Certain capabilities
>>>> may be a must, but not everything. There are also device specific
>>>> registers not covered by any capability structures - what to do with
>>>> those is even more of a question.
>>>>
>>>> Furthermore for some of the fields justification why access to the
>>>> raw hardware value is fine is going to be easy: r/o fields like
>>>> vendor and device ID, for example. But every bit you allow direct
>>>> access to needs to come with justification.
>>>>
>>>>> At the moment we are not going to claim that vPCI provides all means to
>>>>> pass through a PCI device safely with this respect and this is why the feature
>>>>> itself won't even be a tech preview yet. For that reason I think we can still
>>>>> have implemented only crucial set of handlers and still allow the rest to
>>>>> be read/write directly without emulation.
>>>> I think you need to separate what you need for development from what
>>>> goes upstream: For dev purposes you can very well invert the policy
>>>> from white- to black-listing. But if we accepted the latter into the
>>>> main tree, the risk would be there that something gets missed at the
>>>> time where the permission model gets changed around.
>>>>
>>>> You could even have a non-default mode operating the way you want it
>>>> (along the lines of pciback's permissive mode), allowing you to get
>>>> away without needing to carry private patches. Things may also
>>>> initially only work in that mode. But the default should be a mode
>>>> which is secure (and which perhaps initially offers only very limited
>>>> functionality).
>>> Ok, so to make it clear:
>>> 1. We do not allow unhandled access for guests: for that I will create a
>>> dedicated patch which will implement such restrictions. Something like
>>> the below (for both vPCI read and write):
>>>
>>> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
>>> index c5e67491c24f..9ef2a1b5af58 100644
>>> --- a/xen/drivers/vpci/vpci.c
>>> +++ b/xen/drivers/vpci/vpci.c
>>> @@ -347,6 +347,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>>>        const struct vpci_register *r;
>>>        unsigned int data_offset = 0;
>>>        uint32_t data = ~(uint32_t)0;
>>> +    bool handled = false;
>>>
>>>        if ( !size )
>>>        {
>>> @@ -405,6 +406,8 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>>>            if ( cmp > 0 )
>>>                continue;
>>>
>>> +        handled = true; /* Found the handler for this access. */
>>> +
>>>            if ( emu.offset < r->offset )
>>>            {
>>>                /* Heading gap, read partial content from hardware. */
>>> @@ -432,6 +435,10 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>>>        }
>>>        spin_unlock(&pdev->vpci_lock);
>>>
>>> +    /* All unhandled guest requests return all 1's. */
>>> +    if ( !is_hardware_domain(d) && !handled )
>>> +        return ~(uint32_t)0;
>>> +
>>>        if ( data_offset < size )
>>>        {
>>>            /* Tailing gap, read the remaining. */
>> Except that like for the "tailing gap" you also need to avoid the
>> "heading gap" ending up in a read of the underlying hardware
>> register. Effectively you want to deal properly with all
>> vpci_read_hw() invocations (including the one when no pdev was
>> found, which for a DomU may simply mean domain_crash()).
> Yes. And with the above patch I can now remove the "TODO patch" then?
> Because it is saying that we allow access to the registers, but it is not safe.
> And now, if we disable that access, then TODO should be about the need to
> implement emulation for all the registers which are not yet handled which is
> obvious.

Yes, I think that other patch then should have no use anymore. (To be
honest I don't recall such a patch anyway.)

Jan
Oleksandr Andrushchenko Jan. 31, 2022, 1:58 p.m. UTC | #22
On 31.01.22 15:51, Jan Beulich wrote:
> On 31.01.2022 14:41, Oleksandr Andrushchenko wrote:
>> On 31.01.22 15:36, Jan Beulich wrote:
>>> On 31.01.2022 14:30, Oleksandr Andrushchenko wrote:
>>>> On 31.01.22 13:39, Jan Beulich wrote:
>>>>> On 31.01.2022 12:23, Oleksandr Andrushchenko wrote:
>>>>>> On 31.01.22 13:10, Roger Pau Monné wrote:
>>>>>>> Right (see my previous reply to this comment). I think it would be
>>>>>>> easier (and cleaner) if you switched the default behavior regarding
>>>>>>> unhandled register access for domUs at the start of the series (drop
>>>>>>> writes, reads returns ~0), and then you won't need to add all those
>>>>>>> dummy handler to drop writes and return ~0 for reads.
>>>>>>>
>>>>>>> It's going to be more work initially as you would need to support
>>>>>>> passthrough of more registers, but it's the right approach that we
>>>>>>> need implementation wise.
>>>>>> While I agree in general, this effectively means that I'll need to provide
>>>>>> handling for all PCIe registers and capabilities from the very start.
>>>>>> Otherwise no guest be able to properly initialize a PCI device without that.
>>>>>> Of course, we may want starting from stubs instead of proper emulation,
>>>>>> which will direct the access to real HW and later on we add proper emulation.
>>>>>> But, again, this is going to be a rather big piece of code where we need
>>>>>> to explicitly handle every possible capability.
>>>>> Since the two sub-threads are now about exactly the same topic, I'm
>>>>> answering here instead of there.
>>>>>
>>>>> No, you are not going to need to emulate all possible capabilities.
>>>>> We (or really qemu) don't do this on x86 either. Certain capabilities
>>>>> may be a must, but not everything. There are also device specific
>>>>> registers not covered by any capability structures - what to do with
>>>>> those is even more of a question.
>>>>>
>>>>> Furthermore for some of the fields justification why access to the
>>>>> raw hardware value is fine is going to be easy: r/o fields like
>>>>> vendor and device ID, for example. But every bit you allow direct
>>>>> access to needs to come with justification.
>>>>>
>>>>>> At the moment we are not going to claim that vPCI provides all means to
>>>>>> pass through a PCI device safely with this respect and this is why the feature
>>>>>> itself won't even be a tech preview yet. For that reason I think we can still
>>>>>> have implemented only crucial set of handlers and still allow the rest to
>>>>>> be read/write directly without emulation.
>>>>> I think you need to separate what you need for development from what
>>>>> goes upstream: For dev purposes you can very well invert the policy
>>>>> from white- to black-listing. But if we accepted the latter into the
>>>>> main tree, the risk would be there that something gets missed at the
>>>>> time where the permission model gets changed around.
>>>>>
>>>>> You could even have a non-default mode operating the way you want it
>>>>> (along the lines of pciback's permissive mode), allowing you to get
>>>>> away without needing to carry private patches. Things may also
>>>>> initially only work in that mode. But the default should be a mode
>>>>> which is secure (and which perhaps initially offers only very limited
>>>>> functionality).
>>>> Ok, so to make it clear:
>>>> 1. We do not allow unhandled access for guests: for that I will create a
>>>> dedicated patch which will implement such restrictions. Something like
>>>> the below (for both vPCI read and write):
>>>>
>>>> diff --git a/xen/drivers/vpci/vpci.c b/xen/drivers/vpci/vpci.c
>>>> index c5e67491c24f..9ef2a1b5af58 100644
>>>> --- a/xen/drivers/vpci/vpci.c
>>>> +++ b/xen/drivers/vpci/vpci.c
>>>> @@ -347,6 +347,7 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>>>>         const struct vpci_register *r;
>>>>         unsigned int data_offset = 0;
>>>>         uint32_t data = ~(uint32_t)0;
>>>> +    bool handled = false;
>>>>
>>>>         if ( !size )
>>>>         {
>>>> @@ -405,6 +406,8 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>>>>             if ( cmp > 0 )
>>>>                 continue;
>>>>
>>>> +        handled = true; /* Found the handler for this access. */
>>>> +
>>>>             if ( emu.offset < r->offset )
>>>>             {
>>>>                 /* Heading gap, read partial content from hardware. */
>>>> @@ -432,6 +435,10 @@ uint32_t vpci_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int size)
>>>>         }
>>>>         spin_unlock(&pdev->vpci_lock);
>>>>
>>>> +    /* All unhandled guest requests return all 1's. */
>>>> +    if ( !is_hardware_domain(d) && !handled )
>>>> +        return ~(uint32_t)0;
>>>> +
>>>>         if ( data_offset < size )
>>>>         {
>>>>             /* Tailing gap, read the remaining. */
>>> Except that like for the "tailing gap" you also need to avoid the
>>> "heading gap" ending up in a read of the underlying hardware
>>> register. Effectively you want to deal properly with all
>>> vpci_read_hw() invocations (including the one when no pdev was
>>> found, which for a DomU may simply mean domain_crash()).
>> Yes. And with the above patch I can now remove the "TODO patch" then?
>> Because it is saying that we allow access to the registers, but it is not safe.
>> And now, if we disable that access, then TODO should be about the need to
>> implement emulation for all the registers which are not yet handled which is
>> obvious.
> Yes, I think that other patch then should have no use anymore. (To be
> honest I don't recall such a patch anyway.)
This is "[PATCH v5 14/14] vpci: add TODO for the registers not explicitly handled"
in this series
>
> Jan
>
Thank you,
Oleksandr
Oleksandr Andrushchenko Jan. 31, 2022, 2:51 p.m. UTC | #23
>>> I wonder whether we need to protect the added code with
>>> CONFIG_HAS_VPCI_GUEST_SUPPORT, this would effectively be dead code
>>> otherwise. Long term I don't think we wish to differentiate between
>>> dom0 and domU vPCI support at build time, so I'm unsure whether it's
>>> helpful to pollute the code with CONFIG_HAS_VPCI_GUEST_SUPPORT when
>>> the plan is to remove those long term.
>> I would have it without CONFIG_HAS_VPCI_GUEST_SUPPORT if you
>> don't mind
> Well, I guess if it's not too intrusive it's fine to add the defines,
> removing them afterwards should be easy.
It is intrusive: it is easy to add such a define in struct vpci, but then you need
ifdefery in xen/drivers/vpci/header.c to sort out the case when it is defined or
not. I can still do that if you insist
>
> Thanks, Roger.
Thank you,
Oleksandr
Oleksandr Andrushchenko Jan. 31, 2022, 3:06 p.m. UTC | #24
Hi, Roger!
>>               rom->type = VPCI_BAR_EMPTY;
>>       }
>> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
>> index ed127a08a953..0a73b14a92dc 100644
>> --- a/xen/include/xen/vpci.h
>> +++ b/xen/include/xen/vpci.h
>> @@ -68,7 +68,10 @@ struct vpci {
>>       struct vpci_header {
>>           /* Information about the PCI BARs of this device. */
>>           struct vpci_bar {
>> +            /* Physical view of the BAR. */
> No, that's not the physical view, it's the physical (host) address.
>
>>               uint64_t addr;
>> +            /* Guest view of the BAR: address and lower bits. */
>> +            uint64_t guest_reg;
> I continue to think it would be clearer if you store the guest address
> here (gaddr, without the low bits) and add those in guest_bar_read
> based on bar->{type,prefetchable}. Then it would be equivalent to the
> existing 'addr' field.
>
I agreed first to do such a change, but then recalled our discussion with Jan [1].
And then we decided that in order for it to be efficient it is better if we setup all the
things during the write phase (rare), rather then during the write phase (more often).
If you still see it clearer I can re-work the code

Thank you,
Oleksandr

[1] https://www.mail-archive.com/xen-devel@lists.xenproject.org/msg103431.html
Jan Beulich Jan. 31, 2022, 3:50 p.m. UTC | #25
On 31.01.2022 16:06, Oleksandr Andrushchenko wrote:
> Hi, Roger!
>>>               rom->type = VPCI_BAR_EMPTY;
>>>       }
>>> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
>>> index ed127a08a953..0a73b14a92dc 100644
>>> --- a/xen/include/xen/vpci.h
>>> +++ b/xen/include/xen/vpci.h
>>> @@ -68,7 +68,10 @@ struct vpci {
>>>       struct vpci_header {
>>>           /* Information about the PCI BARs of this device. */
>>>           struct vpci_bar {
>>> +            /* Physical view of the BAR. */
>> No, that's not the physical view, it's the physical (host) address.
>>
>>>               uint64_t addr;
>>> +            /* Guest view of the BAR: address and lower bits. */
>>> +            uint64_t guest_reg;
>> I continue to think it would be clearer if you store the guest address
>> here (gaddr, without the low bits) and add those in guest_bar_read
>> based on bar->{type,prefetchable}. Then it would be equivalent to the
>> existing 'addr' field.
>>
> I agreed first to do such a change, but then recalled our discussion with Jan [1].
> And then we decided that in order for it to be efficient it is better if we setup all the
> things during the write phase (rare), rather then during the write phase (more often).

Small correction: The 2nd "write" was likely meant to be "read". But
please recall that Roger is the maintainer of the code, so he gets
the final say.

Jan
Oleksandr Andrushchenko Feb. 1, 2022, 7:31 a.m. UTC | #26
On 31.01.22 17:50, Jan Beulich wrote:
> On 31.01.2022 16:06, Oleksandr Andrushchenko wrote:
>> Hi, Roger!
>>>>                rom->type = VPCI_BAR_EMPTY;
>>>>        }
>>>> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
>>>> index ed127a08a953..0a73b14a92dc 100644
>>>> --- a/xen/include/xen/vpci.h
>>>> +++ b/xen/include/xen/vpci.h
>>>> @@ -68,7 +68,10 @@ struct vpci {
>>>>        struct vpci_header {
>>>>            /* Information about the PCI BARs of this device. */
>>>>            struct vpci_bar {
>>>> +            /* Physical view of the BAR. */
>>> No, that's not the physical view, it's the physical (host) address.
>>>
>>>>                uint64_t addr;
>>>> +            /* Guest view of the BAR: address and lower bits. */
>>>> +            uint64_t guest_reg;
>>> I continue to think it would be clearer if you store the guest address
>>> here (gaddr, without the low bits) and add those in guest_bar_read
>>> based on bar->{type,prefetchable}. Then it would be equivalent to the
>>> existing 'addr' field.
>>>
>> I agreed first to do such a change, but then recalled our discussion with Jan [1].
>> And then we decided that in order for it to be efficient it is better if we setup all the
>> things during the write phase (rare), rather then during the write phase (more often).
> Small correction: The 2nd "write" was likely meant to be "read".
Yes, this is correct.
>   But
> please recall that Roger is the maintainer of the code, so he gets
> the final say.
Agree, but would vote for the current approach as it still saves some
CPU cycles making the read operation really tiny
>
> Jan
>
Thank you,
Oleksandr
Roger Pau Monne Feb. 1, 2022, 10:10 a.m. UTC | #27
On Tue, Feb 01, 2022 at 07:31:31AM +0000, Oleksandr Andrushchenko wrote:
> 
> 
> On 31.01.22 17:50, Jan Beulich wrote:
> > On 31.01.2022 16:06, Oleksandr Andrushchenko wrote:
> >> Hi, Roger!
> >>>>                rom->type = VPCI_BAR_EMPTY;
> >>>>        }
> >>>> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
> >>>> index ed127a08a953..0a73b14a92dc 100644
> >>>> --- a/xen/include/xen/vpci.h
> >>>> +++ b/xen/include/xen/vpci.h
> >>>> @@ -68,7 +68,10 @@ struct vpci {
> >>>>        struct vpci_header {
> >>>>            /* Information about the PCI BARs of this device. */
> >>>>            struct vpci_bar {
> >>>> +            /* Physical view of the BAR. */
> >>> No, that's not the physical view, it's the physical (host) address.
> >>>
> >>>>                uint64_t addr;
> >>>> +            /* Guest view of the BAR: address and lower bits. */
> >>>> +            uint64_t guest_reg;
> >>> I continue to think it would be clearer if you store the guest address
> >>> here (gaddr, without the low bits) and add those in guest_bar_read
> >>> based on bar->{type,prefetchable}. Then it would be equivalent to the
> >>> existing 'addr' field.
> >>>
> >> I agreed first to do such a change, but then recalled our discussion with Jan [1].
> >> And then we decided that in order for it to be efficient it is better if we setup all the
> >> things during the write phase (rare), rather then during the write phase (more often).
> > Small correction: The 2nd "write" was likely meant to be "read".
> Yes, this is correct.
> >   But
> > please recall that Roger is the maintainer of the code, so he gets
> > the final say.
> Agree, but would vote for the current approach as it still saves some
> CPU cycles making the read operation really tiny

I think you need to build the mapping rangeset(s) based on guest
addresses, not host ones, so it's likely going to be easier if you
store the address here in order to use it when building the rangeset.

Overall the cost of the vmexit will shadow the cost of doing a couple
of ORs here in order to return the guest view of the BAR.

If you think storing the guest view of the BAR register will make the
code easier to understand, then please go ahead. Otherwise I would
recommend to store the address like we do for the host position of the
BAR (ie: addr field).

Thanks, Roger.
Oleksandr Andrushchenko Feb. 1, 2022, 10:41 a.m. UTC | #28
Hi, Roger!

On 01.02.22 12:10, Roger Pau Monné wrote:
> On Tue, Feb 01, 2022 at 07:31:31AM +0000, Oleksandr Andrushchenko wrote:
>>
>> On 31.01.22 17:50, Jan Beulich wrote:
>>> On 31.01.2022 16:06, Oleksandr Andrushchenko wrote:
>>>> Hi, Roger!
>>>>>>                 rom->type = VPCI_BAR_EMPTY;
>>>>>>         }
>>>>>> diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
>>>>>> index ed127a08a953..0a73b14a92dc 100644
>>>>>> --- a/xen/include/xen/vpci.h
>>>>>> +++ b/xen/include/xen/vpci.h
>>>>>> @@ -68,7 +68,10 @@ struct vpci {
>>>>>>         struct vpci_header {
>>>>>>             /* Information about the PCI BARs of this device. */
>>>>>>             struct vpci_bar {
>>>>>> +            /* Physical view of the BAR. */
>>>>> No, that's not the physical view, it's the physical (host) address.
>>>>>
>>>>>>                 uint64_t addr;
>>>>>> +            /* Guest view of the BAR: address and lower bits. */
>>>>>> +            uint64_t guest_reg;
>>>>> I continue to think it would be clearer if you store the guest address
>>>>> here (gaddr, without the low bits) and add those in guest_bar_read
>>>>> based on bar->{type,prefetchable}. Then it would be equivalent to the
>>>>> existing 'addr' field.
>>>>>
>>>> I agreed first to do such a change, but then recalled our discussion with Jan [1].
>>>> And then we decided that in order for it to be efficient it is better if we setup all the
>>>> things during the write phase (rare), rather then during the write phase (more often).
>>> Small correction: The 2nd "write" was likely meant to be "read".
>> Yes, this is correct.
>>>    But
>>> please recall that Roger is the maintainer of the code, so he gets
>>> the final say.
>> Agree, but would vote for the current approach as it still saves some
>> CPU cycles making the read operation really tiny
> I think you need to build the mapping rangeset(s) based on guest
> addresses, not host ones, so it's likely going to be easier if you
> store the address here in order to use it when building the rangeset.
>
> Overall the cost of the vmexit will shadow the cost of doing a couple
> of ORs here in order to return the guest view of the BAR.
>
> If you think storing the guest view of the BAR register will make the
> code easier to understand, then please go ahead. Otherwise I would
> recommend to store the address like we do for the host position of the
> BAR (ie: addr field).
I still think it is easier to understand: if you take a look at what we do
for BAR write for both host and guest you'll see that we do almost the
same operations, but in host case we end up writing bar->addr + low
bits to the HW register and in case of a guest we store the complete
thing into bar->guest_reg. Read operation doesn't require any processing
for host, so it is equivalent to direct hw read and in case of a guest it
is as simple as possible and implements the equivalent by returning
part of bar->guest_reg (hi or lo).  So, from this POV it is IMO easier to
understand the logic.
That being said, I do agree that the contents of the bar->addr is not
equivalent to bar->guest_reg, but we have already taken care of it
by naming the guest's one with guest_reg, not guest_addr.

I will keep the code as is then.
>
> Thanks, Roger.
Thank you,
Oleksandr
Oleksandr Andrushchenko Feb. 3, 2022, 12:36 p.m. UTC | #29
Hi, Bertrand!

On 26.11.21 14:19, Oleksandr Andrushchenko wrote:
> Hi, Bertrand!
>
> On 25.11.21 18:28, Bertrand Marquis wrote:
>> Hi Oleksandr,
>>
>>> On 25 Nov 2021, at 11:02, Oleksandr Andrushchenko <andr2000@gmail.com> wrote:
>>>
>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>
>>> Add relevant vpci register handlers when assigning PCI device to a domain
>>> and remove those when de-assigning. This allows having different
>>> handlers for different domains, e.g. hwdom and other guests.
>>>
>>> Emulate guest BAR register values: this allows creating a guest view
>>> of the registers and emulates size and properties probe as it is done
>>> during PCI device enumeration by the guest.
>>>
>>> ROM BAR is only handled for the hardware domain and for guest domains
>>> there is a stub: at the moment PCI expansion ROM handling is supported
>>> for x86 only and it might not be used by other architectures without
>>> emulating x86. Other use-cases may include using that expansion ROM before
>>> Xen boots, hence no emulation is needed in Xen itself. Or when a guest
>>> wants to use the ROM code which seems to be rare.
>> In the generic code, bars for ioports are actually skipped (check code before
>> in header.c, in case of ioports there is a continue) and no handler is registered for them.
>> The consequence will be that a guest will access hardware when reading those BARs.
> Yes, this seems to be a valid point
So, with the approach we have developed these days we will ignore all writes
and return ~0 for reads for all unhandled ops, e.g. those which do not have explicit
register handlers employed. Thus, this case will fall into unhandled clause.

Thank you,
Oleksandr
Jan Beulich Feb. 3, 2022, 12:44 p.m. UTC | #30
On 03.02.2022 13:36, Oleksandr Andrushchenko wrote:
> Hi, Bertrand!
> 
> On 26.11.21 14:19, Oleksandr Andrushchenko wrote:
>> Hi, Bertrand!
>>
>> On 25.11.21 18:28, Bertrand Marquis wrote:
>>> Hi Oleksandr,
>>>
>>>> On 25 Nov 2021, at 11:02, Oleksandr Andrushchenko <andr2000@gmail.com> wrote:
>>>>
>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>
>>>> Add relevant vpci register handlers when assigning PCI device to a domain
>>>> and remove those when de-assigning. This allows having different
>>>> handlers for different domains, e.g. hwdom and other guests.
>>>>
>>>> Emulate guest BAR register values: this allows creating a guest view
>>>> of the registers and emulates size and properties probe as it is done
>>>> during PCI device enumeration by the guest.
>>>>
>>>> ROM BAR is only handled for the hardware domain and for guest domains
>>>> there is a stub: at the moment PCI expansion ROM handling is supported
>>>> for x86 only and it might not be used by other architectures without
>>>> emulating x86. Other use-cases may include using that expansion ROM before
>>>> Xen boots, hence no emulation is needed in Xen itself. Or when a guest
>>>> wants to use the ROM code which seems to be rare.
>>> In the generic code, bars for ioports are actually skipped (check code before
>>> in header.c, in case of ioports there is a continue) and no handler is registered for them.
>>> The consequence will be that a guest will access hardware when reading those BARs.
>> Yes, this seems to be a valid point
> So, with the approach we have developed these days we will ignore all writes
> and return ~0 for reads for all unhandled ops, e.g. those which do not have explicit
> register handlers employed. Thus, this case will fall into unhandled clause.

Except that I guess BARs are special in that reads may not return ~0,
or else the low bits carry a meaning we don't want to convey. Unused
BARs need to be hard-wired to 0, I think.

Jan
Oleksandr Andrushchenko Feb. 3, 2022, 12:45 p.m. UTC | #31
Hi, Roger!
>> Also memory decoding needs to be initially disabled when used by
>> guests, in order to prevent the BAR being placed on top of a RAM
>> region. The guest physmap will be different from the host one, so it's
>> possible for BARs to end up placed on top of RAM regions initially
>> until the firmware or OS places them at a suitable address.
> Agree, memory decoding must be disabled
Isn't it already achieved by the toolstack resetting the PCI device
while assigning  it to a guest?

Thank you,
Oleksandr
Oleksandr Andrushchenko Feb. 3, 2022, 12:48 p.m. UTC | #32
Hi, Jan!

On 03.02.22 14:44, Jan Beulich wrote:
> On 03.02.2022 13:36, Oleksandr Andrushchenko wrote:
>> Hi, Bertrand!
>>
>> On 26.11.21 14:19, Oleksandr Andrushchenko wrote:
>>> Hi, Bertrand!
>>>
>>> On 25.11.21 18:28, Bertrand Marquis wrote:
>>>> Hi Oleksandr,
>>>>
>>>>> On 25 Nov 2021, at 11:02, Oleksandr Andrushchenko <andr2000@gmail.com> wrote:
>>>>>
>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>
>>>>> Add relevant vpci register handlers when assigning PCI device to a domain
>>>>> and remove those when de-assigning. This allows having different
>>>>> handlers for different domains, e.g. hwdom and other guests.
>>>>>
>>>>> Emulate guest BAR register values: this allows creating a guest view
>>>>> of the registers and emulates size and properties probe as it is done
>>>>> during PCI device enumeration by the guest.
>>>>>
>>>>> ROM BAR is only handled for the hardware domain and for guest domains
>>>>> there is a stub: at the moment PCI expansion ROM handling is supported
>>>>> for x86 only and it might not be used by other architectures without
>>>>> emulating x86. Other use-cases may include using that expansion ROM before
>>>>> Xen boots, hence no emulation is needed in Xen itself. Or when a guest
>>>>> wants to use the ROM code which seems to be rare.
>>>> In the generic code, bars for ioports are actually skipped (check code before
>>>> in header.c, in case of ioports there is a continue) and no handler is registered for them.
>>>> The consequence will be that a guest will access hardware when reading those BARs.
>>> Yes, this seems to be a valid point
>> So, with the approach we have developed these days we will ignore all writes
>> and return ~0 for reads for all unhandled ops, e.g. those which do not have explicit
>> register handlers employed. Thus, this case will fall into unhandled clause.
> Except that I guess BARs are special in that reads may not return ~0,
> or else the low bits carry a meaning we don't want to convey. Unused
> BARs need to be hard-wired to 0, I think.
So, you mean we should have 2 sets of BAR handlers for guests:
1. normal emulation (these are implemented in this patch)
2. all other BARs: read 0/ignore write for all other BARs, including ROM, IO etc.

Is this what you mean?
> Jan
>
Thank you,
Oleksandr
Jan Beulich Feb. 3, 2022, 12:50 p.m. UTC | #33
On 03.02.2022 13:48, Oleksandr Andrushchenko wrote:
> Hi, Jan!
> 
> On 03.02.22 14:44, Jan Beulich wrote:
>> On 03.02.2022 13:36, Oleksandr Andrushchenko wrote:
>>> Hi, Bertrand!
>>>
>>> On 26.11.21 14:19, Oleksandr Andrushchenko wrote:
>>>> Hi, Bertrand!
>>>>
>>>> On 25.11.21 18:28, Bertrand Marquis wrote:
>>>>> Hi Oleksandr,
>>>>>
>>>>>> On 25 Nov 2021, at 11:02, Oleksandr Andrushchenko <andr2000@gmail.com> wrote:
>>>>>>
>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>
>>>>>> Add relevant vpci register handlers when assigning PCI device to a domain
>>>>>> and remove those when de-assigning. This allows having different
>>>>>> handlers for different domains, e.g. hwdom and other guests.
>>>>>>
>>>>>> Emulate guest BAR register values: this allows creating a guest view
>>>>>> of the registers and emulates size and properties probe as it is done
>>>>>> during PCI device enumeration by the guest.
>>>>>>
>>>>>> ROM BAR is only handled for the hardware domain and for guest domains
>>>>>> there is a stub: at the moment PCI expansion ROM handling is supported
>>>>>> for x86 only and it might not be used by other architectures without
>>>>>> emulating x86. Other use-cases may include using that expansion ROM before
>>>>>> Xen boots, hence no emulation is needed in Xen itself. Or when a guest
>>>>>> wants to use the ROM code which seems to be rare.
>>>>> In the generic code, bars for ioports are actually skipped (check code before
>>>>> in header.c, in case of ioports there is a continue) and no handler is registered for them.
>>>>> The consequence will be that a guest will access hardware when reading those BARs.
>>>> Yes, this seems to be a valid point
>>> So, with the approach we have developed these days we will ignore all writes
>>> and return ~0 for reads for all unhandled ops, e.g. those which do not have explicit
>>> register handlers employed. Thus, this case will fall into unhandled clause.
>> Except that I guess BARs are special in that reads may not return ~0,
>> or else the low bits carry a meaning we don't want to convey. Unused
>> BARs need to be hard-wired to 0, I think.
> So, you mean we should have 2 sets of BAR handlers for guests:
> 1. normal emulation (these are implemented in this patch)
> 2. all other BARs: read 0/ignore write for all other BARs, including ROM, IO etc.
> 
> Is this what you mean?

I think that's what we're going to need, yes.

Jan
Oleksandr Andrushchenko Feb. 3, 2022, 12:53 p.m. UTC | #34
On 03.02.22 14:50, Jan Beulich wrote:
> On 03.02.2022 13:48, Oleksandr Andrushchenko wrote:
>> Hi, Jan!
>>
>> On 03.02.22 14:44, Jan Beulich wrote:
>>> On 03.02.2022 13:36, Oleksandr Andrushchenko wrote:
>>>> Hi, Bertrand!
>>>>
>>>> On 26.11.21 14:19, Oleksandr Andrushchenko wrote:
>>>>> Hi, Bertrand!
>>>>>
>>>>> On 25.11.21 18:28, Bertrand Marquis wrote:
>>>>>> Hi Oleksandr,
>>>>>>
>>>>>>> On 25 Nov 2021, at 11:02, Oleksandr Andrushchenko <andr2000@gmail.com> wrote:
>>>>>>>
>>>>>>> From: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>>>>>>>
>>>>>>> Add relevant vpci register handlers when assigning PCI device to a domain
>>>>>>> and remove those when de-assigning. This allows having different
>>>>>>> handlers for different domains, e.g. hwdom and other guests.
>>>>>>>
>>>>>>> Emulate guest BAR register values: this allows creating a guest view
>>>>>>> of the registers and emulates size and properties probe as it is done
>>>>>>> during PCI device enumeration by the guest.
>>>>>>>
>>>>>>> ROM BAR is only handled for the hardware domain and for guest domains
>>>>>>> there is a stub: at the moment PCI expansion ROM handling is supported
>>>>>>> for x86 only and it might not be used by other architectures without
>>>>>>> emulating x86. Other use-cases may include using that expansion ROM before
>>>>>>> Xen boots, hence no emulation is needed in Xen itself. Or when a guest
>>>>>>> wants to use the ROM code which seems to be rare.
>>>>>> In the generic code, bars for ioports are actually skipped (check code before
>>>>>> in header.c, in case of ioports there is a continue) and no handler is registered for them.
>>>>>> The consequence will be that a guest will access hardware when reading those BARs.
>>>>> Yes, this seems to be a valid point
>>>> So, with the approach we have developed these days we will ignore all writes
>>>> and return ~0 for reads for all unhandled ops, e.g. those which do not have explicit
>>>> register handlers employed. Thus, this case will fall into unhandled clause.
>>> Except that I guess BARs are special in that reads may not return ~0,
>>> or else the low bits carry a meaning we don't want to convey. Unused
>>> BARs need to be hard-wired to 0, I think.
>> So, you mean we should have 2 sets of BAR handlers for guests:
>> 1. normal emulation (these are implemented in this patch)
>> 2. all other BARs: read 0/ignore write for all other BARs, including ROM, IO etc.
>>
>> Is this what you mean?
> I think that's what we're going to need, yes.
Ok, then I'll stuff that into this patch v6
> Jan
>
Thank you,
Oleksandr
Jan Beulich Feb. 3, 2022, 12:54 p.m. UTC | #35
On 03.02.2022 13:45, Oleksandr Andrushchenko wrote:
>>> Also memory decoding needs to be initially disabled when used by
>>> guests, in order to prevent the BAR being placed on top of a RAM
>>> region. The guest physmap will be different from the host one, so it's
>>> possible for BARs to end up placed on top of RAM regions initially
>>> until the firmware or OS places them at a suitable address.
>> Agree, memory decoding must be disabled
> Isn't it already achieved by the toolstack resetting the PCI device
> while assigning  it to a guest?

Iirc the tool stack would reset a device only after getting it back from
a DomU. When coming straight from Dom0 or DomIO, no reset would be
performed. Furthermore, (again iirc) there are cases where there's no
known (standard) way to reset a device. Assigning such to a guest when
it previously was owned by another one is risky (and hence needs an
admin knowing what they're doing), but may be acceptable in particular
when e.g. simply rebooting a guest.

IOW - I don't think you can rely on the bit being in a particular state.

Jan
Oleksandr Andrushchenko Feb. 3, 2022, 1:30 p.m. UTC | #36
On 03.02.22 14:54, Jan Beulich wrote:
> On 03.02.2022 13:45, Oleksandr Andrushchenko wrote:
>>>> Also memory decoding needs to be initially disabled when used by
>>>> guests, in order to prevent the BAR being placed on top of a RAM
>>>> region. The guest physmap will be different from the host one, so it's
>>>> possible for BARs to end up placed on top of RAM regions initially
>>>> until the firmware or OS places them at a suitable address.
>>> Agree, memory decoding must be disabled
>> Isn't it already achieved by the toolstack resetting the PCI device
>> while assigning  it to a guest?
> Iirc the tool stack would reset a device only after getting it back from
> a DomU. When coming straight from Dom0 or DomIO, no reset would be
> performed. Furthermore, (again iirc) there are cases where there's no
> known (standard) way to reset a device. Assigning such to a guest when
> it previously was owned by another one is risky (and hence needs an
> admin knowing what they're doing), but may be acceptable in particular
> when e.g. simply rebooting a guest.
>
> IOW - I don't think you can rely on the bit being in a particular state.
So, you mean something like:

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index 7695158e6445..9ebd57472da8 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -808,6 +808,14 @@ static int init_bars(struct pci_dev *pdev)
              return rc;
      }

+    /*
+     * Memory decoding needs to be initially disabled when used by
+     * guests, in order to prevent the BAR being placed on top of a RAM
+     * region.
+     */
+    if ( !is_hwdom )
+        pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd & ~PCI_COMMAND_MEMORY);
+
      return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0;
  }
  REGISTER_VPCI_INIT(init_bars, VPCI_PRIORITY_MIDDLE);

> Jan
>
Thank you,
Oleksandr
Jan Beulich Feb. 3, 2022, 2:04 p.m. UTC | #37
On 03.02.2022 14:30, Oleksandr Andrushchenko wrote:
> 
> 
> On 03.02.22 14:54, Jan Beulich wrote:
>> On 03.02.2022 13:45, Oleksandr Andrushchenko wrote:
>>>>> Also memory decoding needs to be initially disabled when used by
>>>>> guests, in order to prevent the BAR being placed on top of a RAM
>>>>> region. The guest physmap will be different from the host one, so it's
>>>>> possible for BARs to end up placed on top of RAM regions initially
>>>>> until the firmware or OS places them at a suitable address.
>>>> Agree, memory decoding must be disabled
>>> Isn't it already achieved by the toolstack resetting the PCI device
>>> while assigning  it to a guest?
>> Iirc the tool stack would reset a device only after getting it back from
>> a DomU. When coming straight from Dom0 or DomIO, no reset would be
>> performed. Furthermore, (again iirc) there are cases where there's no
>> known (standard) way to reset a device. Assigning such to a guest when
>> it previously was owned by another one is risky (and hence needs an
>> admin knowing what they're doing), but may be acceptable in particular
>> when e.g. simply rebooting a guest.
>>
>> IOW - I don't think you can rely on the bit being in a particular state.
> So, you mean something like:

Perhaps, but then I think ...

> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -808,6 +808,14 @@ static int init_bars(struct pci_dev *pdev)
>               return rc;
>       }
> 
> +    /*
> +     * Memory decoding needs to be initially disabled when used by
> +     * guests, in order to prevent the BAR being placed on top of a RAM
> +     * region.
> +     */
> +    if ( !is_hwdom )
> +        pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd & ~PCI_COMMAND_MEMORY);
> +
>       return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0;

... you also want to update cmd, thus avoiding the call to modify_bars().

And btw, from an abstract pov the same is true for I/O decoding: I
realize that you mean to leave I/O port BARs aside for the moment, but I
think the command register handling could very well take care of both.

Which quickly gets us to the bus master enable bit: I think that one
should initially be off too. Making me wonder: Doesn't the PCI spec
define what the reset state of this register is? If so, that's what I
think we want to put in place for DomU-s.

Jan
Roger Pau Monne Feb. 3, 2022, 2:05 p.m. UTC | #38
On Thu, Feb 03, 2022 at 01:30:26PM +0000, Oleksandr Andrushchenko wrote:
> 
> 
> On 03.02.22 14:54, Jan Beulich wrote:
> > On 03.02.2022 13:45, Oleksandr Andrushchenko wrote:
> >>>> Also memory decoding needs to be initially disabled when used by
> >>>> guests, in order to prevent the BAR being placed on top of a RAM
> >>>> region. The guest physmap will be different from the host one, so it's
> >>>> possible for BARs to end up placed on top of RAM regions initially
> >>>> until the firmware or OS places them at a suitable address.
> >>> Agree, memory decoding must be disabled
> >> Isn't it already achieved by the toolstack resetting the PCI device
> >> while assigning  it to a guest?
> > Iirc the tool stack would reset a device only after getting it back from
> > a DomU. When coming straight from Dom0 or DomIO, no reset would be
> > performed. Furthermore, (again iirc) there are cases where there's no
> > known (standard) way to reset a device. Assigning such to a guest when
> > it previously was owned by another one is risky (and hence needs an
> > admin knowing what they're doing), but may be acceptable in particular
> > when e.g. simply rebooting a guest.
> >
> > IOW - I don't think you can rely on the bit being in a particular state.
> So, you mean something like:
> 
> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
> index 7695158e6445..9ebd57472da8 100644
> --- a/xen/drivers/vpci/header.c
> +++ b/xen/drivers/vpci/header.c
> @@ -808,6 +808,14 @@ static int init_bars(struct pci_dev *pdev)
>               return rc;
>       }
> 
> +    /*
> +     * Memory decoding needs to be initially disabled when used by
> +     * guests, in order to prevent the BAR being placed on top of a RAM
> +     * region.
> +     */
> +    if ( !is_hwdom )
> +        pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd & ~PCI_COMMAND_MEMORY);

Memory decoding is already disabled here, so you just need to avoid
enabling it, for example:

    /*
     * Memory decoding needs to be initially disabled when used by
     * guests, in order to prevent the BARs being mapped at gfn 0 by
     * default.
     */
    if ( !is_hwdom )
        cmd &= ~PCI_COMMAND_MEMORY;

>       return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0;

This is important here because guest_reg won't be set (ie: will be set
to 0) so if for some reason memory decoding was enabled you would end
up with BARs mappings overlapping at gfn 0.

Thanks, Roger.
Oleksandr Andrushchenko Feb. 3, 2022, 2:19 p.m. UTC | #39
On 03.02.22 16:04, Jan Beulich wrote:
> On 03.02.2022 14:30, Oleksandr Andrushchenko wrote:
>>
>> On 03.02.22 14:54, Jan Beulich wrote:
>>> On 03.02.2022 13:45, Oleksandr Andrushchenko wrote:
>>>>>> Also memory decoding needs to be initially disabled when used by
>>>>>> guests, in order to prevent the BAR being placed on top of a RAM
>>>>>> region. The guest physmap will be different from the host one, so it's
>>>>>> possible for BARs to end up placed on top of RAM regions initially
>>>>>> until the firmware or OS places them at a suitable address.
>>>>> Agree, memory decoding must be disabled
>>>> Isn't it already achieved by the toolstack resetting the PCI device
>>>> while assigning  it to a guest?
>>> Iirc the tool stack would reset a device only after getting it back from
>>> a DomU. When coming straight from Dom0 or DomIO, no reset would be
>>> performed. Furthermore, (again iirc) there are cases where there's no
>>> known (standard) way to reset a device. Assigning such to a guest when
>>> it previously was owned by another one is risky (and hence needs an
>>> admin knowing what they're doing), but may be acceptable in particular
>>> when e.g. simply rebooting a guest.
>>>
>>> IOW - I don't think you can rely on the bit being in a particular state.
>> So, you mean something like:
> Perhaps, but then I think ...
>
>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -808,6 +808,14 @@ static int init_bars(struct pci_dev *pdev)
>>                return rc;
>>        }
>>
>> +    /*
>> +     * Memory decoding needs to be initially disabled when used by
>> +     * guests, in order to prevent the BAR being placed on top of a RAM
>> +     * region.
>> +     */
>> +    if ( !is_hwdom )
>> +        pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd & ~PCI_COMMAND_MEMORY);
>> +
>>        return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0;
> ... you also want to update cmd, thus avoiding the call to modify_bars().
>
> And btw, from an abstract pov the same is true for I/O decoding: I
> realize that you mean to leave I/O port BARs aside for the moment, but I
> think the command register handling could very well take care of both.
>
> Which quickly gets us to the bus master enable bit: I think that one
> should initially be off too. Making me wonder: Doesn't the PCI spec
> define what the reset state of this register is? If so, that's what I
> think we want to put in place for DomU-s.
The spec I have says that all bits are typically 0 after reset.
So, it seems to be reasonable to just write 0 to PCI_COMMAND
> Jan
>
Thank you,
Oleksandr
Oleksandr Andrushchenko Feb. 3, 2022, 2:26 p.m. UTC | #40
On 03.02.22 16:05, Roger Pau Monné wrote:
> On Thu, Feb 03, 2022 at 01:30:26PM +0000, Oleksandr Andrushchenko wrote:
>>
>> On 03.02.22 14:54, Jan Beulich wrote:
>>> On 03.02.2022 13:45, Oleksandr Andrushchenko wrote:
>>>>>> Also memory decoding needs to be initially disabled when used by
>>>>>> guests, in order to prevent the BAR being placed on top of a RAM
>>>>>> region. The guest physmap will be different from the host one, so it's
>>>>>> possible for BARs to end up placed on top of RAM regions initially
>>>>>> until the firmware or OS places them at a suitable address.
>>>>> Agree, memory decoding must be disabled
>>>> Isn't it already achieved by the toolstack resetting the PCI device
>>>> while assigning  it to a guest?
>>> Iirc the tool stack would reset a device only after getting it back from
>>> a DomU. When coming straight from Dom0 or DomIO, no reset would be
>>> performed. Furthermore, (again iirc) there are cases where there's no
>>> known (standard) way to reset a device. Assigning such to a guest when
>>> it previously was owned by another one is risky (and hence needs an
>>> admin knowing what they're doing), but may be acceptable in particular
>>> when e.g. simply rebooting a guest.
>>>
>>> IOW - I don't think you can rely on the bit being in a particular state.
>> So, you mean something like:
>>
>> diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
>> index 7695158e6445..9ebd57472da8 100644
>> --- a/xen/drivers/vpci/header.c
>> +++ b/xen/drivers/vpci/header.c
>> @@ -808,6 +808,14 @@ static int init_bars(struct pci_dev *pdev)
>>                return rc;
>>        }
>>
>> +    /*
>> +     * Memory decoding needs to be initially disabled when used by
>> +     * guests, in order to prevent the BAR being placed on top of a RAM
>> +     * region.
>> +     */
>> +    if ( !is_hwdom )
>> +        pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd & ~PCI_COMMAND_MEMORY);
> Memory decoding is already disabled here, so you just need to avoid
> enabling it, for example:
>
>      /*
>       * Memory decoding needs to be initially disabled when used by
>       * guests, in order to prevent the BARs being mapped at gfn 0 by
>       * default.
>       */
>      if ( !is_hwdom )
>          cmd &= ~PCI_COMMAND_MEMORY;
>
>>        return (cmd & PCI_COMMAND_MEMORY) ? modify_bars(pdev, cmd, false) : 0;
> This is important here because guest_reg won't be set (ie: will be set
> to 0) so if for some reason memory decoding was enabled you would end
> up with BARs mappings overlapping at gfn 0.
Then the patch [1] will do what we need to be done for the guest I guess
I am thinking to still have it in the series which will solve exactly the problem
we are trying to solve
>
> Thanks, Roger.
[1] https://patchwork.kernel.org/project/xen-devel/patch/20211125110251.2877218-11-andr2000@gmail.com/
diff mbox series

Patch

diff --git a/xen/drivers/vpci/header.c b/xen/drivers/vpci/header.c
index ba333fb2f9b0..8880d34ebf8e 100644
--- a/xen/drivers/vpci/header.c
+++ b/xen/drivers/vpci/header.c
@@ -433,6 +433,48 @@  static void bar_write(const struct pci_dev *pdev, unsigned int reg,
     pci_conf_write32(pdev->sbdf, reg, val);
 }
 
+static void guest_bar_write(const struct pci_dev *pdev, unsigned int reg,
+                            uint32_t val, void *data)
+{
+    struct vpci_bar *bar = data;
+    bool hi = false;
+
+    if ( bar->type == VPCI_BAR_MEM64_HI )
+    {
+        ASSERT(reg > PCI_BASE_ADDRESS_0);
+        bar--;
+        hi = true;
+    }
+    else
+    {
+        val &= PCI_BASE_ADDRESS_MEM_MASK;
+        val |= bar->type == VPCI_BAR_MEM32 ? PCI_BASE_ADDRESS_MEM_TYPE_32
+                                           : PCI_BASE_ADDRESS_MEM_TYPE_64;
+        val |= bar->prefetchable ? PCI_BASE_ADDRESS_MEM_PREFETCH : 0;
+    }
+
+    bar->guest_reg &= ~(0xffffffffull << (hi ? 32 : 0));
+    bar->guest_reg |= (uint64_t)val << (hi ? 32 : 0);
+
+    bar->guest_reg &= ~(bar->size - 1) | ~PCI_BASE_ADDRESS_MEM_MASK;
+}
+
+static uint32_t guest_bar_read(const struct pci_dev *pdev, unsigned int reg,
+                               void *data)
+{
+    const struct vpci_bar *bar = data;
+    bool hi = false;
+
+    if ( bar->type == VPCI_BAR_MEM64_HI )
+    {
+        ASSERT(reg > PCI_BASE_ADDRESS_0);
+        bar--;
+        hi = true;
+    }
+
+    return bar->guest_reg >> (hi ? 32 : 0);
+}
+
 static void rom_write(const struct pci_dev *pdev, unsigned int reg,
                       uint32_t val, void *data)
 {
@@ -481,6 +523,17 @@  static void rom_write(const struct pci_dev *pdev, unsigned int reg,
         rom->addr = val & PCI_ROM_ADDRESS_MASK;
 }
 
+static void guest_rom_write(const struct pci_dev *pdev, unsigned int reg,
+                            uint32_t val, void *data)
+{
+}
+
+static uint32_t guest_rom_read(const struct pci_dev *pdev, unsigned int reg,
+                               void *data)
+{
+    return 0xffffffff;
+}
+
 static int init_bars(struct pci_dev *pdev)
 {
     uint16_t cmd;
@@ -489,6 +542,7 @@  static int init_bars(struct pci_dev *pdev)
     struct vpci_header *header = &pdev->vpci->header;
     struct vpci_bar *bars = header->bars;
     int rc;
+    bool is_hwdom = is_hardware_domain(pdev->domain);
 
     switch ( pci_conf_read8(pdev->sbdf, PCI_HEADER_TYPE) & 0x7f )
     {
@@ -528,8 +582,10 @@  static int init_bars(struct pci_dev *pdev)
         if ( i && bars[i - 1].type == VPCI_BAR_MEM64_LO )
         {
             bars[i].type = VPCI_BAR_MEM64_HI;
-            rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg,
-                                   4, &bars[i]);
+            rc = vpci_add_register(pdev->vpci,
+                                   is_hwdom ? vpci_hw_read32 : guest_bar_read,
+                                   is_hwdom ? bar_write : guest_bar_write,
+                                   reg, 4, &bars[i]);
             if ( rc )
             {
                 pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
@@ -569,8 +625,10 @@  static int init_bars(struct pci_dev *pdev)
         bars[i].size = size;
         bars[i].prefetchable = val & PCI_BASE_ADDRESS_MEM_PREFETCH;
 
-        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, bar_write, reg, 4,
-                               &bars[i]);
+        rc = vpci_add_register(pdev->vpci,
+                               is_hwdom ? vpci_hw_read32 : guest_bar_read,
+                               is_hwdom ? bar_write : guest_bar_write,
+                               reg, 4, &bars[i]);
         if ( rc )
         {
             pci_conf_write16(pdev->sbdf, PCI_COMMAND, cmd);
@@ -590,8 +648,10 @@  static int init_bars(struct pci_dev *pdev)
         header->rom_enabled = pci_conf_read32(pdev->sbdf, rom_reg) &
                               PCI_ROM_ADDRESS_ENABLE;
 
-        rc = vpci_add_register(pdev->vpci, vpci_hw_read32, rom_write, rom_reg,
-                               4, rom);
+        rc = vpci_add_register(pdev->vpci,
+                               is_hwdom ? vpci_hw_read32 : guest_rom_read,
+                               is_hwdom ? rom_write : guest_rom_write,
+                               rom_reg, 4, rom);
         if ( rc )
             rom->type = VPCI_BAR_EMPTY;
     }
diff --git a/xen/include/xen/vpci.h b/xen/include/xen/vpci.h
index ed127a08a953..0a73b14a92dc 100644
--- a/xen/include/xen/vpci.h
+++ b/xen/include/xen/vpci.h
@@ -68,7 +68,10 @@  struct vpci {
     struct vpci_header {
         /* Information about the PCI BARs of this device. */
         struct vpci_bar {
+            /* Physical view of the BAR. */
             uint64_t addr;
+            /* Guest view of the BAR: address and lower bits. */
+            uint64_t guest_reg;
             uint64_t size;
             enum {
                 VPCI_BAR_EMPTY,