[kernel,RFC,0/4] powerpc/powenv/ioda: Allow huge DMA window at 4GB

Message ID	20191202015953.127902-1-aik@ozlabs.ru (mailing list archive)
Headers	show Return-Path: <SRS0=t74V=ZY=vger.kernel.org=kvm-owner@kernel.org> From: Alexey Kardashevskiy <aik@ozlabs.ru> To: linuxppc-dev@lists.ozlabs.org Cc: David Gibson <david@gibson.dropbear.id.au>, kvm@vger.kernel.org, Alistair Popple <alistair@popple.id.au>, Alex Williamson <alex.williamson@redhat.com>, Oliver O'Halloran <oohall@gmail.com>, Alexey Kardashevskiy <aik@ozlabs.ru> Subject: [PATCH kernel RFC 0/4] powerpc/powenv/ioda: Allow huge DMA window at 4GB Date: Mon, 2 Dec 2019 12:59:49 +1100 Message-Id: <20191202015953.127902-1-aik@ozlabs.ru> Sender: kvm-owner@vger.kernel.org Precedence: bulk
Series	powerpc/powenv/ioda: Allow huge DMA window at 4GB \| expand [kernel,RFC,0/4] powerpc/powenv/ioda: Allow huge DMA window at 4GB [kernel,RFC,1/4] powerpc/powernv/ioda: Rework for huge DMA window at 4GB [kernel,RFC,2/4] powerpc/powernv/ioda: Allow smaller TCE table levels [kernel,RFC,3/4] powerpc/powernv/phb4: Add 4GB IOMMU bypass mode [kernel,RFC,4/4] vfio/spapr_tce: Advertise and allow a huge DMA windows at 4GB

Alexey Kardashevskiy Dec. 2, 2019, 1:59 a.m. UTC

Here is an attempt to support bigger DMA space for devices
supporting DMA masks less than 59 bits (GPUs come into mind
first). POWER9 PHBs have an option to map 2 windows at 0
and select a windows based on DMA address being below or above
4GB.

This adds the "iommu=iommu_bypass" kernel parameter and
supports VFIO+pseries machine - current this requires telling
upstream+unmodified QEMU about this via
-global spapr-pci-host-bridge.dma64_win_addr=0x100000000
or per-phb property. 4/4 advertises the new option but
there is no automation around it in QEMU (should it be?).

For now it is either 1<<59 or 4GB mode; dynamic switching is
not supported (could be via sysfs).

This is based on sha1
a6ed68d6468b Linus Torvalds "Merge tag 'drm-next-2019-11-27' of git://anongit.freedesktop.org/drm/drm".

Please comment. Thanks.



Alexey Kardashevskiy (4):
  powerpc/powernv/ioda: Rework for huge DMA window at 4GB
  powerpc/powernv/ioda: Allow smaller TCE table levels
  powerpc/powernv/phb4: Add 4GB IOMMU bypass mode
  vfio/spapr_tce: Advertise and allow a huge DMA windows at 4GB

 arch/powerpc/include/asm/iommu.h              |   1 +
 arch/powerpc/include/asm/opal-api.h           |  11 +-
 arch/powerpc/include/asm/opal.h               |   2 +
 arch/powerpc/platforms/powernv/pci.h          |   1 +
 include/uapi/linux/vfio.h                     |   2 +
 arch/powerpc/platforms/powernv/opal-call.c    |   2 +
 arch/powerpc/platforms/powernv/pci-ioda-tce.c |   4 +-
 arch/powerpc/platforms/powernv/pci-ioda.c     | 219 ++++++++++++++----
 drivers/vfio/vfio_iommu_spapr_tce.c           |  10 +-
 9 files changed, 202 insertions(+), 50 deletions(-)

Alistair Popple Dec. 2, 2019, 5:36 a.m. UTC | #1

On Monday, 2 December 2019 12:59:49 PM AEDT Alexey Kardashevskiy wrote:
> Here is an attempt to support bigger DMA space for devices
> supporting DMA masks less than 59 bits (GPUs come into mind
> first). POWER9 PHBs have an option to map 2 windows at 0
> and select a windows based on DMA address being below or above
> 4GB.
> 
> This adds the "iommu=iommu_bypass" kernel parameter and

Would it be possible to just enable this by default if the platform supports 
it? Are there any downsides? Adding it as an option seems like it would make 
things harder to support and reduces the amount of testing/use it would get.

> supports VFIO+pseries machine - current this requires telling
> upstream+unmodified QEMU about this via
> -global spapr-pci-host-bridge.dma64_win_addr=0x100000000
> or per-phb property. 4/4 advertises the new option but
> there is no automation around it in QEMU (should it be?).
> 
> For now it is either 1<<59 or 4GB mode; dynamic switching is
> not supported (could be via sysfs).
> 
> This is based on sha1
> a6ed68d6468b Linus Torvalds "Merge tag 'drm-next-2019-11-27' of git://
anongit.freedesktop.org/drm/drm".

Are you sure? I am getting the following rejected hunk trying to apply the 
first patch in the series:

--- arch/powerpc/platforms/powernv/pci-ioda.c
+++ arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2349,15 +2349,10 @@ static void pnv_pci_ioda2_set_bypass(struct 
pnv_ioda_pe *pe, bool enable)
                pe->tce_bypass_enabled = enable;
 }
 
-static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
-               int num, __u32 page_shift, __u64 window_size, __u32 levels,
+static long pnv_pci_ioda2_create_table(int nid, int num, __u64 bus_offset,
+               __u32 page_shift, __u64 window_size, __u32 levels,
                bool alloc_userspace_copy, struct iommu_table **ptbl)
 {
-       struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
-                       table_group);
-       int nid = pe->phb->hose->node;
-       __u64 bus_offset = num ?
-               pe->table_group.tce64_start : table_group->tce32_start;
        long ret;
        struct iommu_table *tbl;

- Alistair
 
> Please comment. Thanks.
> 
> 
> 
> Alexey Kardashevskiy (4):
>   powerpc/powernv/ioda: Rework for huge DMA window at 4GB
>   powerpc/powernv/ioda: Allow smaller TCE table levels
>   powerpc/powernv/phb4: Add 4GB IOMMU bypass mode
>   vfio/spapr_tce: Advertise and allow a huge DMA windows at 4GB
> 
>  arch/powerpc/include/asm/iommu.h              |   1 +
>  arch/powerpc/include/asm/opal-api.h           |  11 +-
>  arch/powerpc/include/asm/opal.h               |   2 +
>  arch/powerpc/platforms/powernv/pci.h          |   1 +
>  include/uapi/linux/vfio.h                     |   2 +
>  arch/powerpc/platforms/powernv/opal-call.c    |   2 +
>  arch/powerpc/platforms/powernv/pci-ioda-tce.c |   4 +-
>  arch/powerpc/platforms/powernv/pci-ioda.c     | 219 ++++++++++++++----
>  drivers/vfio/vfio_iommu_spapr_tce.c           |  10 +-
>  9 files changed, 202 insertions(+), 50 deletions(-)
> 
>

Alexey Kardashevskiy Dec. 2, 2019, 5:58 a.m. UTC | #2

On 02/12/2019 16:36, Alistair Popple wrote:
> On Monday, 2 December 2019 12:59:49 PM AEDT Alexey Kardashevskiy wrote:
>> Here is an attempt to support bigger DMA space for devices
>> supporting DMA masks less than 59 bits (GPUs come into mind
>> first). POWER9 PHBs have an option to map 2 windows at 0
>> and select a windows based on DMA address being below or above
>> 4GB.
>>
>> This adds the "iommu=iommu_bypass" kernel parameter and
> 
> Would it be possible to just enable this by default if the platform supports 
> it? Are there any downsides?

It changes the second DMA window location which is now assumed by QEMU
to be at 0x800.0000.0000.0000 and I do not see an easy way to work
around this.

For example, we start QEMU without VFIO but with emulated XHCI which
will ask for DDW, we (QEMU) have to pick a window location but then we
have to stick to it and if a user later hotplugs an VFIO-PCI, that
physical IOMMU has to support the previously selected DMA window
address; otherwise hotplug is going to fail.

The question is how to tell QEMU about this new offset and what we do
about migration from P8 (which let's say did have a VFIO device which we
unplug before the migration) to P9 with a prospect of hotplugging an
VFIO device but this time with this GTE4GB bit set.


> Adding it as an option seems like it would make 
> things harder to support and reduces the amount of testing/use it would get.

Yeah, this why this is an RFC...


>> supports VFIO+pseries machine - current this requires telling
>> upstream+unmodified QEMU about this via
>> -global spapr-pci-host-bridge.dma64_win_addr=0x100000000
>> or per-phb property. 4/4 advertises the new option but
>> there is no automation around it in QEMU (should it be?).
>>
>> For now it is either 1<<59 or 4GB mode; dynamic switching is
>> not supported (could be via sysfs).
>>
>> This is based on sha1
>> a6ed68d6468b Linus Torvalds "Merge tag 'drm-next-2019-11-27' of git://
> anongit.freedesktop.org/drm/drm".
> 
> Are you sure?

Almost. It should have been HEAD^^^^^..HEAD instead of HEAD^^^^..HEAD :)

I've posted 00/4 to the thread now, sorry about that. Thanks,


> I am getting the following rejected hunk trying to apply the 
> first patch in the series:
> 
> --- arch/powerpc/platforms/powernv/pci-ioda.c
> +++ arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -2349,15 +2349,10 @@ static void pnv_pci_ioda2_set_bypass(struct 
> pnv_ioda_pe *pe, bool enable)
>                 pe->tce_bypass_enabled = enable;
>  }
>  
> -static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group,
> -               int num, __u32 page_shift, __u64 window_size, __u32 levels,
> +static long pnv_pci_ioda2_create_table(int nid, int num, __u64 bus_offset,
> +               __u32 page_shift, __u64 window_size, __u32 levels,
>                 bool alloc_userspace_copy, struct iommu_table **ptbl)
>  {
> -       struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
> -                       table_group);
> -       int nid = pe->phb->hose->node;
> -       __u64 bus_offset = num ?
> -               pe->table_group.tce64_start : table_group->tce32_start;
>         long ret;
>         struct iommu_table *tbl;
> 
> - Alistair
>  
>> Please comment. Thanks.
>>
>>
>>
>> Alexey Kardashevskiy (4):
>>   powerpc/powernv/ioda: Rework for huge DMA window at 4GB
>>   powerpc/powernv/ioda: Allow smaller TCE table levels
>>   powerpc/powernv/phb4: Add 4GB IOMMU bypass mode
>>   vfio/spapr_tce: Advertise and allow a huge DMA windows at 4GB
>>
>>  arch/powerpc/include/asm/iommu.h              |   1 +
>>  arch/powerpc/include/asm/opal-api.h           |  11 +-
>>  arch/powerpc/include/asm/opal.h               |   2 +
>>  arch/powerpc/platforms/powernv/pci.h          |   1 +
>>  include/uapi/linux/vfio.h                     |   2 +
>>  arch/powerpc/platforms/powernv/opal-call.c    |   2 +
>>  arch/powerpc/platforms/powernv/pci-ioda-tce.c |   4 +-
>>  arch/powerpc/platforms/powernv/pci-ioda.c     | 219 ++++++++++++++----
>>  drivers/vfio/vfio_iommu_spapr_tce.c           |  10 +-
>>  9 files changed, 202 insertions(+), 50 deletions(-)
>>
>>
> 
> 
> 
>

Alexey Kardashevskiy Jan. 10, 2020, 4:18 a.m. UTC | #3

On 02/12/2019 12:59, Alexey Kardashevskiy wrote:
> Here is an attempt to support bigger DMA space for devices
> supporting DMA masks less than 59 bits (GPUs come into mind
> first). POWER9 PHBs have an option to map 2 windows at 0
> and select a windows based on DMA address being below or above
> 4GB.
> 
> This adds the "iommu=iommu_bypass" kernel parameter and
> supports VFIO+pseries machine - current this requires telling
> upstream+unmodified QEMU about this via
> -global spapr-pci-host-bridge.dma64_win_addr=0x100000000
> or per-phb property. 4/4 advertises the new option but
> there is no automation around it in QEMU (should it be?).
> 
> For now it is either 1<<59 or 4GB mode; dynamic switching is
> not supported (could be via sysfs).
> 
> This is based on sha1
> a6ed68d6468b Linus Torvalds "Merge tag 'drm-next-2019-11-27' of git://anongit.freedesktop.org/drm/drm".
> 
> Please comment. Thanks.


David, Alistair, ping? Thanks,


> 
> 
> 
> Alexey Kardashevskiy (4):
>   powerpc/powernv/ioda: Rework for huge DMA window at 4GB
>   powerpc/powernv/ioda: Allow smaller TCE table levels
>   powerpc/powernv/phb4: Add 4GB IOMMU bypass mode
>   vfio/spapr_tce: Advertise and allow a huge DMA windows at 4GB
> 
>  arch/powerpc/include/asm/iommu.h              |   1 +
>  arch/powerpc/include/asm/opal-api.h           |  11 +-
>  arch/powerpc/include/asm/opal.h               |   2 +
>  arch/powerpc/platforms/powernv/pci.h          |   1 +
>  include/uapi/linux/vfio.h                     |   2 +
>  arch/powerpc/platforms/powernv/opal-call.c    |   2 +
>  arch/powerpc/platforms/powernv/pci-ioda-tce.c |   4 +-
>  arch/powerpc/platforms/powernv/pci-ioda.c     | 219 ++++++++++++++----
>  drivers/vfio/vfio_iommu_spapr_tce.c           |  10 +-
>  9 files changed, 202 insertions(+), 50 deletions(-)
>

Alexey Kardashevskiy Jan. 23, 2020, 12:53 a.m. UTC | #4

Anyone, ping?


On 10/01/2020 15:18, Alexey Kardashevskiy wrote:
> 
> 
> On 02/12/2019 12:59, Alexey Kardashevskiy wrote:
>> Here is an attempt to support bigger DMA space for devices
>> supporting DMA masks less than 59 bits (GPUs come into mind
>> first). POWER9 PHBs have an option to map 2 windows at 0
>> and select a windows based on DMA address being below or above
>> 4GB.
>>
>> This adds the "iommu=iommu_bypass" kernel parameter and
>> supports VFIO+pseries machine - current this requires telling
>> upstream+unmodified QEMU about this via
>> -global spapr-pci-host-bridge.dma64_win_addr=0x100000000
>> or per-phb property. 4/4 advertises the new option but
>> there is no automation around it in QEMU (should it be?).
>>
>> For now it is either 1<<59 or 4GB mode; dynamic switching is
>> not supported (could be via sysfs).
>>
>> This is based on sha1
>> a6ed68d6468b Linus Torvalds "Merge tag 'drm-next-2019-11-27' of git://anongit.freedesktop.org/drm/drm".
>>
>> Please comment. Thanks.
> 
> 
> David, Alistair, ping? Thanks,


> 
> 
>>
>>
>>
>> Alexey Kardashevskiy (4):
>>   powerpc/powernv/ioda: Rework for huge DMA window at 4GB
>>   powerpc/powernv/ioda: Allow smaller TCE table levels
>>   powerpc/powernv/phb4: Add 4GB IOMMU bypass mode
>>   vfio/spapr_tce: Advertise and allow a huge DMA windows at 4GB
>>
>>  arch/powerpc/include/asm/iommu.h              |   1 +
>>  arch/powerpc/include/asm/opal-api.h           |  11 +-
>>  arch/powerpc/include/asm/opal.h               |   2 +
>>  arch/powerpc/platforms/powernv/pci.h          |   1 +
>>  include/uapi/linux/vfio.h                     |   2 +
>>  arch/powerpc/platforms/powernv/opal-call.c    |   2 +
>>  arch/powerpc/platforms/powernv/pci-ioda-tce.c |   4 +-
>>  arch/powerpc/platforms/powernv/pci-ioda.c     | 219 ++++++++++++++----
>>  drivers/vfio/vfio_iommu_spapr_tce.c           |  10 +-
>>  9 files changed, 202 insertions(+), 50 deletions(-)
>>
>

David Gibson Jan. 23, 2020, 1:17 a.m. UTC | #5

On Thu, Jan 23, 2020 at 11:53:32AM +1100, Alexey Kardashevskiy wrote:
> Anyone, ping?

Sorry, I've totally lost track of this one.  I think you'll need to
repost.


> 
> 
> On 10/01/2020 15:18, Alexey Kardashevskiy wrote:
> > 
> > 
> > On 02/12/2019 12:59, Alexey Kardashevskiy wrote:
> >> Here is an attempt to support bigger DMA space for devices
> >> supporting DMA masks less than 59 bits (GPUs come into mind
> >> first). POWER9 PHBs have an option to map 2 windows at 0
> >> and select a windows based on DMA address being below or above
> >> 4GB.
> >>
> >> This adds the "iommu=iommu_bypass" kernel parameter and
> >> supports VFIO+pseries machine - current this requires telling
> >> upstream+unmodified QEMU about this via
> >> -global spapr-pci-host-bridge.dma64_win_addr=0x100000000
> >> or per-phb property. 4/4 advertises the new option but
> >> there is no automation around it in QEMU (should it be?).
> >>
> >> For now it is either 1<<59 or 4GB mode; dynamic switching is
> >> not supported (could be via sysfs).
> >>
> >> This is based on sha1
> >> a6ed68d6468b Linus Torvalds "Merge tag 'drm-next-2019-11-27' of git://anongit.freedesktop.org/drm/drm".
> >>
> >> Please comment. Thanks.
> > 
> > 
> > David, Alistair, ping? Thanks,
> 
> 
> > 
> > 
> >>
> >>
> >>
> >> Alexey Kardashevskiy (4):
> >>   powerpc/powernv/ioda: Rework for huge DMA window at 4GB
> >>   powerpc/powernv/ioda: Allow smaller TCE table levels
> >>   powerpc/powernv/phb4: Add 4GB IOMMU bypass mode
> >>   vfio/spapr_tce: Advertise and allow a huge DMA windows at 4GB
> >>
> >>  arch/powerpc/include/asm/iommu.h              |   1 +
> >>  arch/powerpc/include/asm/opal-api.h           |  11 +-
> >>  arch/powerpc/include/asm/opal.h               |   2 +
> >>  arch/powerpc/platforms/powernv/pci.h          |   1 +
> >>  include/uapi/linux/vfio.h                     |   2 +
> >>  arch/powerpc/platforms/powernv/opal-call.c    |   2 +
> >>  arch/powerpc/platforms/powernv/pci-ioda-tce.c |   4 +-
> >>  arch/powerpc/platforms/powernv/pci-ioda.c     | 219 ++++++++++++++----
> >>  drivers/vfio/vfio_iommu_spapr_tce.c           |  10 +-
> >>  9 files changed, 202 insertions(+), 50 deletions(-)
> >>
> > 
>

Alexey Kardashevskiy Jan. 23, 2020, 8:42 a.m. UTC | #6

On 23/01/2020 12:17, David Gibson wrote:
> On Thu, Jan 23, 2020 at 11:53:32AM +1100, Alexey Kardashevskiy wrote:
>> Anyone, ping?
> 
> Sorry, I've totally lost track of this one.  I think you'll need to
> repost.



It has not changed and still applies, and the question is more about how
we proceed with this feature that the patches themselves. Or it is just
not in your mailbox anymore so you cannot reply? :)



> 
> 
>>
>>
>> On 10/01/2020 15:18, Alexey Kardashevskiy wrote:
>>>
>>>
>>> On 02/12/2019 12:59, Alexey Kardashevskiy wrote:
>>>> Here is an attempt to support bigger DMA space for devices
>>>> supporting DMA masks less than 59 bits (GPUs come into mind
>>>> first). POWER9 PHBs have an option to map 2 windows at 0
>>>> and select a windows based on DMA address being below or above
>>>> 4GB.
>>>>
>>>> This adds the "iommu=iommu_bypass" kernel parameter and
>>>> supports VFIO+pseries machine - current this requires telling
>>>> upstream+unmodified QEMU about this via
>>>> -global spapr-pci-host-bridge.dma64_win_addr=0x100000000
>>>> or per-phb property. 4/4 advertises the new option but
>>>> there is no automation around it in QEMU (should it be?).
>>>>
>>>> For now it is either 1<<59 or 4GB mode; dynamic switching is
>>>> not supported (could be via sysfs).
>>>>
>>>> This is based on sha1
>>>> a6ed68d6468b Linus Torvalds "Merge tag 'drm-next-2019-11-27' of git://anongit.freedesktop.org/drm/drm".
>>>>
>>>> Please comment. Thanks.
>>>
>>>
>>> David, Alistair, ping? Thanks,
>>
>>
>>>
>>>
>>>>
>>>>
>>>>
>>>> Alexey Kardashevskiy (4):
>>>>   powerpc/powernv/ioda: Rework for huge DMA window at 4GB
>>>>   powerpc/powernv/ioda: Allow smaller TCE table levels
>>>>   powerpc/powernv/phb4: Add 4GB IOMMU bypass mode
>>>>   vfio/spapr_tce: Advertise and allow a huge DMA windows at 4GB
>>>>
>>>>  arch/powerpc/include/asm/iommu.h              |   1 +
>>>>  arch/powerpc/include/asm/opal-api.h           |  11 +-
>>>>  arch/powerpc/include/asm/opal.h               |   2 +
>>>>  arch/powerpc/platforms/powernv/pci.h          |   1 +
>>>>  include/uapi/linux/vfio.h                     |   2 +
>>>>  arch/powerpc/platforms/powernv/opal-call.c    |   2 +
>>>>  arch/powerpc/platforms/powernv/pci-ioda-tce.c |   4 +-
>>>>  arch/powerpc/platforms/powernv/pci-ioda.c     | 219 ++++++++++++++----
>>>>  drivers/vfio/vfio_iommu_spapr_tce.c           |  10 +-
>>>>  9 files changed, 202 insertions(+), 50 deletions(-)
>>>>
>>>
>>
>

[kernel,RFC,0/4] powerpc/powenv/ioda: Allow huge DMA window at 4GB

Message

Comments