mbox series

[00/11] of: Fix DMA configuration for non-DT masters

Message ID 20190924181244.7159-1-nsaenzjulienne@suse.de (mailing list archive)
Headers show
Series of: Fix DMA configuration for non-DT masters | expand

Message

Nicolas Saenz Julienne Sept. 24, 2019, 6:12 p.m. UTC
Hi All,
this series tries to address one of the issues blocking us from
upstreaming Broadcom's STB PCIe controller[1]. Namely, the fact that
devices not represented in DT which sit behind a PCI bus fail to get the
bus' DMA addressing constraints.

This is due to the fact that of_dma_configure() assumes it's receiving a
DT node representing the device being configured, as opposed to the PCIe
bridge node we currently pass. This causes the code to directly jump
into PCI's parent node when checking for 'dma-ranges' and misses
whatever was set there.

To address this I create a new API in OF - inspired from Robin Murphys
original proposal[2] - which accepts a bus DT node as it's input in
order to configure a device's DMA constraints. The changes go deep into
of/address.c's implementation, as a device being having a DT node
assumption was pretty strong.

On top of this work, I also cleaned up of_dma_configure() removing its
redundant arguments and creating an alternative function for the special cases
not applicable to either the above case or the default usage.

IMO the resulting functions are more explicit. They will probably
surface some hacky usages that can be properly fixed as I show with the
DT fixes on the Layerscape platform.

This was also tested on a Raspberry Pi 4 with a custom PCIe driver and
on a Seattle AMD board.

Regards,
Nicolas

[1] https://patchwork.kernel.org/patch/9650345/#20294961
[2] https://patchwork.kernel.org/patch/9650345/

---

Nicolas Saenz Julienne (11):
  of: address: clean-up unused variable in of_dma_get_range()
  of: base: introduce __of_n_*_cells_parent()
  of: address: use parent DT node in bus->count_cells()
  of: address: introduce of_translate_dma_address_parent()
  of: expose __of_get_dma_parent() to OF subsystem
  of: address: use parent OF node in of_dma_get_range()
  dts: arm64: layerscape: add dma-ranges property to qoric-mc node
  dts: arm64: layerscape: add dma-ranges property to pcie nodes
  of: device: remove comment in of_dma_configure()
  of: device: introduce of_dma_configure_parent()
  of: simplify of_dma_config()'s arguments

 .../arm64/boot/dts/freescale/fsl-ls1088a.dtsi |   1 +
 .../arm64/boot/dts/freescale/fsl-ls208xa.dtsi |   5 +
 .../arm64/boot/dts/freescale/fsl-lx2160a.dtsi |   1 +
 drivers/base/platform.c                       |   2 +-
 drivers/bcma/main.c                           |   2 +-
 drivers/bus/fsl-mc/fsl-mc-bus.c               |   2 +-
 drivers/dma/qcom/hidma_mgmt.c                 |   2 +-
 drivers/gpu/drm/etnaviv/etnaviv_drv.c         |   2 +-
 drivers/gpu/drm/msm/adreno/a6xx_gmu.c         |   2 +-
 drivers/gpu/drm/sun4i/sun4i_backend.c         |   2 +-
 drivers/gpu/drm/xen/xen_drm_front.c           |   2 +-
 drivers/gpu/host1x/bus.c                      |   4 +-
 drivers/media/platform/qcom/venus/firmware.c  |   2 +-
 drivers/media/platform/s5p-mfc/s5p_mfc.c      |   2 +-
 drivers/of/address.c                          | 136 +++++++++---------
 drivers/of/base.c                             |  69 +++++++--
 drivers/of/device.c                           |  59 +++++++-
 drivers/of/of_private.h                       |   5 +
 drivers/pci/pci-driver.c                      |   3 +-
 drivers/xen/gntdev.c                          |   2 +-
 include/linux/of_address.h                    |   8 +-
 include/linux/of_device.h                     |  23 ++-
 22 files changed, 223 insertions(+), 113 deletions(-)

Comments

Nicolas Saenz Julienne Sept. 25, 2019, 2:52 p.m. UTC | #1
On Tue, 2019-09-24 at 16:59 -0500, Rob Herring wrote:
> On Tue, Sep 24, 2019 at 1:12 PM Nicolas Saenz Julienne
> <nsaenzjulienne@suse.de> wrote:
> > Hi All,
> > this series tries to address one of the issues blocking us from
> > upstreaming Broadcom's STB PCIe controller[1]. Namely, the fact that
> > devices not represented in DT which sit behind a PCI bus fail to get the
> > bus' DMA addressing constraints.
> > 
> > This is due to the fact that of_dma_configure() assumes it's receiving a
> > DT node representing the device being configured, as opposed to the PCIe
> > bridge node we currently pass. This causes the code to directly jump
> > into PCI's parent node when checking for 'dma-ranges' and misses
> > whatever was set there.
> > 
> > To address this I create a new API in OF - inspired from Robin Murphys
> > original proposal[2] - which accepts a bus DT node as it's input in
> > order to configure a device's DMA constraints. The changes go deep into
> > of/address.c's implementation, as a device being having a DT node
> > assumption was pretty strong.
> > 
> > On top of this work, I also cleaned up of_dma_configure() removing its
> > redundant arguments and creating an alternative function for the special
> > cases
> > not applicable to either the above case or the default usage.
> > 
> > IMO the resulting functions are more explicit. They will probably
> > surface some hacky usages that can be properly fixed as I show with the
> > DT fixes on the Layerscape platform.
> > 
> > This was also tested on a Raspberry Pi 4 with a custom PCIe driver and
> > on a Seattle AMD board.
> 
> Humm, I've been working on this issue too. Looks similar though yours
> has a lot more churn and there's some other bugs I've found.

That's good news, and yes now that I see it, some stuff on my series is overly
complicated. Specially around of_translate_*().

On top of that, you removed in of_dma_get_range():

-	/*
-	 * At least empty ranges has to be defined for parent node if
-	 * DMA is supported
-	 */
-	if (!ranges)
-		break;

Which I assumed was bound to the standard and makes things easier.

> Can you test out this branch[1]. I don't have any h/w needing this,
> but wrote a unittest and tested with modified QEMU.

I reviewed everything, I did find a minor issue, see the patch attached.

Also I tested your branch both on an RPi4, with a PCI device that depends on
these changes and by comparing the OF debugs logs on a Layerscape board which
uses dma-ranges, dma-coherent and IOMMU. All works as expected.

Will you send this series for v5.5? Please keep me in the loop, I'll review and
test the final version.

Regards,
Nicolas
Robin Murphy Sept. 25, 2019, 3:09 p.m. UTC | #2
On 25/09/2019 15:52, Nicolas Saenz Julienne wrote:
> On Tue, 2019-09-24 at 16:59 -0500, Rob Herring wrote:
>> On Tue, Sep 24, 2019 at 1:12 PM Nicolas Saenz Julienne
>> <nsaenzjulienne@suse.de> wrote:
>>> Hi All,
>>> this series tries to address one of the issues blocking us from
>>> upstreaming Broadcom's STB PCIe controller[1]. Namely, the fact that
>>> devices not represented in DT which sit behind a PCI bus fail to get the
>>> bus' DMA addressing constraints.
>>>
>>> This is due to the fact that of_dma_configure() assumes it's receiving a
>>> DT node representing the device being configured, as opposed to the PCIe
>>> bridge node we currently pass. This causes the code to directly jump
>>> into PCI's parent node when checking for 'dma-ranges' and misses
>>> whatever was set there.
>>>
>>> To address this I create a new API in OF - inspired from Robin Murphys
>>> original proposal[2] - which accepts a bus DT node as it's input in
>>> order to configure a device's DMA constraints. The changes go deep into
>>> of/address.c's implementation, as a device being having a DT node
>>> assumption was pretty strong.
>>>
>>> On top of this work, I also cleaned up of_dma_configure() removing its
>>> redundant arguments and creating an alternative function for the special
>>> cases
>>> not applicable to either the above case or the default usage.
>>>
>>> IMO the resulting functions are more explicit. They will probably
>>> surface some hacky usages that can be properly fixed as I show with the
>>> DT fixes on the Layerscape platform.
>>>
>>> This was also tested on a Raspberry Pi 4 with a custom PCIe driver and
>>> on a Seattle AMD board.
>>
>> Humm, I've been working on this issue too. Looks similar though yours
>> has a lot more churn and there's some other bugs I've found.
> 
> That's good news, and yes now that I see it, some stuff on my series is overly
> complicated. Specially around of_translate_*().
> 
> On top of that, you removed in of_dma_get_range():
> 
> -	/*
> -	 * At least empty ranges has to be defined for parent node if
> -	 * DMA is supported
> -	 */
> -	if (!ranges)
> -		break;
> 
> Which I assumed was bound to the standard and makes things easier.
> 
>> Can you test out this branch[1]. I don't have any h/w needing this,
>> but wrote a unittest and tested with modified QEMU.
> 
> I reviewed everything, I did find a minor issue, see the patch attached.

WRT that patch, the original intent of "force_dma" was purely to 
consider a device DMA-capable regardless of the presence of 
"dma-ranges". Expecting of_dma_configure() to do anything for a non-OF 
device has always been bogus - magic paravirt devices which appear out 
of nowhere and expect to be treated as genuine DMA masters are a 
separate problem that we haven't really approached yet.

Robin.
Nicolas Saenz Julienne Sept. 25, 2019, 3:30 p.m. UTC | #3
On Wed, 2019-09-25 at 16:09 +0100, Robin Murphy wrote:
> On 25/09/2019 15:52, Nicolas Saenz Julienne wrote:
> > On Tue, 2019-09-24 at 16:59 -0500, Rob Herring wrote:
> > > On Tue, Sep 24, 2019 at 1:12 PM Nicolas Saenz Julienne
> > > <nsaenzjulienne@suse.de> wrote:
> > > > Hi All,
> > > > this series tries to address one of the issues blocking us from
> > > > upstreaming Broadcom's STB PCIe controller[1]. Namely, the fact that
> > > > devices not represented in DT which sit behind a PCI bus fail to get the
> > > > bus' DMA addressing constraints.
> > > > 
> > > > This is due to the fact that of_dma_configure() assumes it's receiving a
> > > > DT node representing the device being configured, as opposed to the PCIe
> > > > bridge node we currently pass. This causes the code to directly jump
> > > > into PCI's parent node when checking for 'dma-ranges' and misses
> > > > whatever was set there.
> > > > 
> > > > To address this I create a new API in OF - inspired from Robin Murphys
> > > > original proposal[2] - which accepts a bus DT node as it's input in
> > > > order to configure a device's DMA constraints. The changes go deep into
> > > > of/address.c's implementation, as a device being having a DT node
> > > > assumption was pretty strong.
> > > > 
> > > > On top of this work, I also cleaned up of_dma_configure() removing its
> > > > redundant arguments and creating an alternative function for the special
> > > > cases
> > > > not applicable to either the above case or the default usage.
> > > > 
> > > > IMO the resulting functions are more explicit. They will probably
> > > > surface some hacky usages that can be properly fixed as I show with the
> > > > DT fixes on the Layerscape platform.
> > > > 
> > > > This was also tested on a Raspberry Pi 4 with a custom PCIe driver and
> > > > on a Seattle AMD board.
> > > 
> > > Humm, I've been working on this issue too. Looks similar though yours
> > > has a lot more churn and there's some other bugs I've found.
> > 
> > That's good news, and yes now that I see it, some stuff on my series is
> > overly
> > complicated. Specially around of_translate_*().
> > 
> > On top of that, you removed in of_dma_get_range():
> > 
> > -	/*
> > -	 * At least empty ranges has to be defined for parent node if
> > -	 * DMA is supported
> > -	 */
> > -	if (!ranges)
> > -		break;
> > 
> > Which I assumed was bound to the standard and makes things easier.
> > 
> > > Can you test out this branch[1]. I don't have any h/w needing this,
> > > but wrote a unittest and tested with modified QEMU.
> > 
> > I reviewed everything, I did find a minor issue, see the patch attached.
> 
> WRT that patch, the original intent of "force_dma" was purely to 
> consider a device DMA-capable regardless of the presence of 
> "dma-ranges". Expecting of_dma_configure() to do anything for a non-OF 
> device has always been bogus - magic paravirt devices which appear out 
> of nowhere and expect to be treated as genuine DMA masters are a 
> separate problem that we haven't really approached yet.

I agree it's clearly abusing the function. I have no problem with the behaviour
change if it's OK with you.

Robin, have you looked into supporting multiple dma-ranges? It's the next thing
we need for BCM STB's PCIe. I'll have a go at it myself if nothing is in the
works already.

Regards,
Nicolas
Rob Herring Sept. 25, 2019, 4:07 p.m. UTC | #4
On Wed, Sep 25, 2019 at 9:53 AM Nicolas Saenz Julienne
<nsaenzjulienne@suse.de> wrote:
>
> On Tue, 2019-09-24 at 16:59 -0500, Rob Herring wrote:
> > On Tue, Sep 24, 2019 at 1:12 PM Nicolas Saenz Julienne
> > <nsaenzjulienne@suse.de> wrote:
> > > Hi All,
> > > this series tries to address one of the issues blocking us from
> > > upstreaming Broadcom's STB PCIe controller[1]. Namely, the fact that
> > > devices not represented in DT which sit behind a PCI bus fail to get the
> > > bus' DMA addressing constraints.
> > >
> > > This is due to the fact that of_dma_configure() assumes it's receiving a
> > > DT node representing the device being configured, as opposed to the PCIe
> > > bridge node we currently pass. This causes the code to directly jump
> > > into PCI's parent node when checking for 'dma-ranges' and misses
> > > whatever was set there.
> > >
> > > To address this I create a new API in OF - inspired from Robin Murphys
> > > original proposal[2] - which accepts a bus DT node as it's input in
> > > order to configure a device's DMA constraints. The changes go deep into
> > > of/address.c's implementation, as a device being having a DT node
> > > assumption was pretty strong.
> > >
> > > On top of this work, I also cleaned up of_dma_configure() removing its
> > > redundant arguments and creating an alternative function for the special
> > > cases
> > > not applicable to either the above case or the default usage.
> > >
> > > IMO the resulting functions are more explicit. They will probably
> > > surface some hacky usages that can be properly fixed as I show with the
> > > DT fixes on the Layerscape platform.
> > >
> > > This was also tested on a Raspberry Pi 4 with a custom PCIe driver and
> > > on a Seattle AMD board.
> >
> > Humm, I've been working on this issue too. Looks similar though yours
> > has a lot more churn and there's some other bugs I've found.
>
> That's good news, and yes now that I see it, some stuff on my series is overly
> complicated. Specially around of_translate_*().
>
> On top of that, you removed in of_dma_get_range():
>
> -       /*
> -        * At least empty ranges has to be defined for parent node if
> -        * DMA is supported
> -        */
> -       if (!ranges)
> -               break;
>
> Which I assumed was bound to the standard and makes things easier.

The standard is whatever we say it is and what exists in the wild...

Probably better for me to get the series posted for context, but the
above is removed because we could be passing in the bus device/child
node and checking for 'dma-ranges' rather than only the bus node.
While this does mean 'dma-ranges' could be in a child node which is
wrong, it simplifies the only caller of_dma_configure(). And really,
there's no way to detect that error. Someone could call
of_dma_configure(NULL, child, ...). Perhaps we could assert that
'ranges' is present whenever 'dma-ranges' is.

Back to the standard, I think it can be summarized as a device's
immediate parent (a bus node) must contain 'dma-ranges'. All the
parent nodes of the bus node should also have 'dma-ranges', but
missing is treated as empty (1:1 translation). 'dma-ranges' missing in
all the parent nodes is also treated as 1:1 translation and no
addressing restrictions.

> > Can you test out this branch[1]. I don't have any h/w needing this,
> > but wrote a unittest and tested with modified QEMU.
>
> I reviewed everything, I did find a minor issue, see the patch attached.
>
> Also I tested your branch both on an RPi4, with a PCI device that depends on
> these changes and by comparing the OF debugs logs on a Layerscape board which
> uses dma-ranges, dma-coherent and IOMMU. All works as expected.
>
> Will you send this series for v5.5? Please keep me in the loop, I'll review and
> test the final version.

Yes, sending it out soon.

Rob
Rob Herring Sept. 25, 2019, 4:16 p.m. UTC | #5
On Wed, Sep 25, 2019 at 10:30 AM Nicolas Saenz Julienne
<nsaenzjulienne@suse.de> wrote:
>
> On Wed, 2019-09-25 at 16:09 +0100, Robin Murphy wrote:
> > On 25/09/2019 15:52, Nicolas Saenz Julienne wrote:
> > > On Tue, 2019-09-24 at 16:59 -0500, Rob Herring wrote:
> > > > On Tue, Sep 24, 2019 at 1:12 PM Nicolas Saenz Julienne
> > > > <nsaenzjulienne@suse.de> wrote:
> > > > > Hi All,
> > > > > this series tries to address one of the issues blocking us from
> > > > > upstreaming Broadcom's STB PCIe controller[1]. Namely, the fact that
> > > > > devices not represented in DT which sit behind a PCI bus fail to get the
> > > > > bus' DMA addressing constraints.
> > > > >
> > > > > This is due to the fact that of_dma_configure() assumes it's receiving a
> > > > > DT node representing the device being configured, as opposed to the PCIe
> > > > > bridge node we currently pass. This causes the code to directly jump
> > > > > into PCI's parent node when checking for 'dma-ranges' and misses
> > > > > whatever was set there.
> > > > >
> > > > > To address this I create a new API in OF - inspired from Robin Murphys
> > > > > original proposal[2] - which accepts a bus DT node as it's input in
> > > > > order to configure a device's DMA constraints. The changes go deep into
> > > > > of/address.c's implementation, as a device being having a DT node
> > > > > assumption was pretty strong.
> > > > >
> > > > > On top of this work, I also cleaned up of_dma_configure() removing its
> > > > > redundant arguments and creating an alternative function for the special
> > > > > cases
> > > > > not applicable to either the above case or the default usage.
> > > > >
> > > > > IMO the resulting functions are more explicit. They will probably
> > > > > surface some hacky usages that can be properly fixed as I show with the
> > > > > DT fixes on the Layerscape platform.
> > > > >
> > > > > This was also tested on a Raspberry Pi 4 with a custom PCIe driver and
> > > > > on a Seattle AMD board.
> > > >
> > > > Humm, I've been working on this issue too. Looks similar though yours
> > > > has a lot more churn and there's some other bugs I've found.
> > >
> > > That's good news, and yes now that I see it, some stuff on my series is
> > > overly
> > > complicated. Specially around of_translate_*().
> > >
> > > On top of that, you removed in of_dma_get_range():
> > >
> > > -   /*
> > > -    * At least empty ranges has to be defined for parent node if
> > > -    * DMA is supported
> > > -    */
> > > -   if (!ranges)
> > > -           break;
> > >
> > > Which I assumed was bound to the standard and makes things easier.
> > >
> > > > Can you test out this branch[1]. I don't have any h/w needing this,
> > > > but wrote a unittest and tested with modified QEMU.
> > >
> > > I reviewed everything, I did find a minor issue, see the patch attached.
> >
> > WRT that patch, the original intent of "force_dma" was purely to
> > consider a device DMA-capable regardless of the presence of
> > "dma-ranges". Expecting of_dma_configure() to do anything for a non-OF
> > device has always been bogus - magic paravirt devices which appear out
> > of nowhere and expect to be treated as genuine DMA masters are a
> > separate problem that we haven't really approached yet.
>
> I agree it's clearly abusing the function. I have no problem with the behaviour
> change if it's OK with you.
>
> Robin, have you looked into supporting multiple dma-ranges? It's the next thing
> we need for BCM STB's PCIe. I'll have a go at it myself if nothing is in the
> works already.

Multiple dma-ranges as far as configuring inbound windows should work
already other than the bug when there's any parent translation. But if
you mean supporting multiple DMA offsets and masks per device in the
DMA API, there's nothing in the works yet.

Rob
Robin Murphy Sept. 25, 2019, 4:52 p.m. UTC | #6
On 25/09/2019 17:16, Rob Herring wrote:
> On Wed, Sep 25, 2019 at 10:30 AM Nicolas Saenz Julienne
> <nsaenzjulienne@suse.de> wrote:
>>
>> On Wed, 2019-09-25 at 16:09 +0100, Robin Murphy wrote:
>>> On 25/09/2019 15:52, Nicolas Saenz Julienne wrote:
>>>> On Tue, 2019-09-24 at 16:59 -0500, Rob Herring wrote:
>>>>> On Tue, Sep 24, 2019 at 1:12 PM Nicolas Saenz Julienne
>>>>> <nsaenzjulienne@suse.de> wrote:
>>>>>> Hi All,
>>>>>> this series tries to address one of the issues blocking us from
>>>>>> upstreaming Broadcom's STB PCIe controller[1]. Namely, the fact that
>>>>>> devices not represented in DT which sit behind a PCI bus fail to get the
>>>>>> bus' DMA addressing constraints.
>>>>>>
>>>>>> This is due to the fact that of_dma_configure() assumes it's receiving a
>>>>>> DT node representing the device being configured, as opposed to the PCIe
>>>>>> bridge node we currently pass. This causes the code to directly jump
>>>>>> into PCI's parent node when checking for 'dma-ranges' and misses
>>>>>> whatever was set there.
>>>>>>
>>>>>> To address this I create a new API in OF - inspired from Robin Murphys
>>>>>> original proposal[2] - which accepts a bus DT node as it's input in
>>>>>> order to configure a device's DMA constraints. The changes go deep into
>>>>>> of/address.c's implementation, as a device being having a DT node
>>>>>> assumption was pretty strong.
>>>>>>
>>>>>> On top of this work, I also cleaned up of_dma_configure() removing its
>>>>>> redundant arguments and creating an alternative function for the special
>>>>>> cases
>>>>>> not applicable to either the above case or the default usage.
>>>>>>
>>>>>> IMO the resulting functions are more explicit. They will probably
>>>>>> surface some hacky usages that can be properly fixed as I show with the
>>>>>> DT fixes on the Layerscape platform.
>>>>>>
>>>>>> This was also tested on a Raspberry Pi 4 with a custom PCIe driver and
>>>>>> on a Seattle AMD board.
>>>>>
>>>>> Humm, I've been working on this issue too. Looks similar though yours
>>>>> has a lot more churn and there's some other bugs I've found.
>>>>
>>>> That's good news, and yes now that I see it, some stuff on my series is
>>>> overly
>>>> complicated. Specially around of_translate_*().
>>>>
>>>> On top of that, you removed in of_dma_get_range():
>>>>
>>>> -   /*
>>>> -    * At least empty ranges has to be defined for parent node if
>>>> -    * DMA is supported
>>>> -    */
>>>> -   if (!ranges)
>>>> -           break;
>>>>
>>>> Which I assumed was bound to the standard and makes things easier.
>>>>
>>>>> Can you test out this branch[1]. I don't have any h/w needing this,
>>>>> but wrote a unittest and tested with modified QEMU.
>>>>
>>>> I reviewed everything, I did find a minor issue, see the patch attached.
>>>
>>> WRT that patch, the original intent of "force_dma" was purely to
>>> consider a device DMA-capable regardless of the presence of
>>> "dma-ranges". Expecting of_dma_configure() to do anything for a non-OF
>>> device has always been bogus - magic paravirt devices which appear out
>>> of nowhere and expect to be treated as genuine DMA masters are a
>>> separate problem that we haven't really approached yet.
>>
>> I agree it's clearly abusing the function. I have no problem with the behaviour
>> change if it's OK with you.

Thinking about it, you could probably just remove that call from the Xen 
DRM driver now anyway - since the dma-direct rework, we lost the ability 
to set dma_dummy_ops by default, and NULL ops now represent what it 
(presumably) wants.

>> Robin, have you looked into supporting multiple dma-ranges? It's the next thing
>> we need for BCM STB's PCIe. I'll have a go at it myself if nothing is in the
>> works already.
> 
> Multiple dma-ranges as far as configuring inbound windows should work
> already other than the bug when there's any parent translation. But if
> you mean supporting multiple DMA offsets and masks per device in the
> DMA API, there's nothing in the works yet.

There's also the in-between step of making of_dma_get_range() return a 
size based on all the dma-ranges entries rather than only the first one 
- otherwise, something like [1] can lead to pretty unworkable default 
masks. We implemented that when doing acpi_dma_get_range(), it's just 
that the OF counterpart never caught up.

Robin.

[1] 
http://linux-arm.org/git?p=linux-rm.git;a=commitdiff;h=a2814af56b3486c2985a95540a88d8f9fa3a699f
Rob Herring Sept. 25, 2019, 9:33 p.m. UTC | #7
On Wed, Sep 25, 2019 at 11:52 AM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 25/09/2019 17:16, Rob Herring wrote:
> > On Wed, Sep 25, 2019 at 10:30 AM Nicolas Saenz Julienne
> > <nsaenzjulienne@suse.de> wrote:
> >>
> >> On Wed, 2019-09-25 at 16:09 +0100, Robin Murphy wrote:
> >>> On 25/09/2019 15:52, Nicolas Saenz Julienne wrote:
> >>>> On Tue, 2019-09-24 at 16:59 -0500, Rob Herring wrote:
> >>>>> On Tue, Sep 24, 2019 at 1:12 PM Nicolas Saenz Julienne
> >>>>> <nsaenzjulienne@suse.de> wrote:
> >>>>>> Hi All,
> >>>>>> this series tries to address one of the issues blocking us from
> >>>>>> upstreaming Broadcom's STB PCIe controller[1]. Namely, the fact that
> >>>>>> devices not represented in DT which sit behind a PCI bus fail to get the
> >>>>>> bus' DMA addressing constraints.
> >>>>>>
> >>>>>> This is due to the fact that of_dma_configure() assumes it's receiving a
> >>>>>> DT node representing the device being configured, as opposed to the PCIe
> >>>>>> bridge node we currently pass. This causes the code to directly jump
> >>>>>> into PCI's parent node when checking for 'dma-ranges' and misses
> >>>>>> whatever was set there.
> >>>>>>
> >>>>>> To address this I create a new API in OF - inspired from Robin Murphys
> >>>>>> original proposal[2] - which accepts a bus DT node as it's input in
> >>>>>> order to configure a device's DMA constraints. The changes go deep into
> >>>>>> of/address.c's implementation, as a device being having a DT node
> >>>>>> assumption was pretty strong.
> >>>>>>
> >>>>>> On top of this work, I also cleaned up of_dma_configure() removing its
> >>>>>> redundant arguments and creating an alternative function for the special
> >>>>>> cases
> >>>>>> not applicable to either the above case or the default usage.
> >>>>>>
> >>>>>> IMO the resulting functions are more explicit. They will probably
> >>>>>> surface some hacky usages that can be properly fixed as I show with the
> >>>>>> DT fixes on the Layerscape platform.
> >>>>>>
> >>>>>> This was also tested on a Raspberry Pi 4 with a custom PCIe driver and
> >>>>>> on a Seattle AMD board.
> >>>>>
> >>>>> Humm, I've been working on this issue too. Looks similar though yours
> >>>>> has a lot more churn and there's some other bugs I've found.
> >>>>
> >>>> That's good news, and yes now that I see it, some stuff on my series is
> >>>> overly
> >>>> complicated. Specially around of_translate_*().
> >>>>
> >>>> On top of that, you removed in of_dma_get_range():
> >>>>
> >>>> -   /*
> >>>> -    * At least empty ranges has to be defined for parent node if
> >>>> -    * DMA is supported
> >>>> -    */
> >>>> -   if (!ranges)
> >>>> -           break;
> >>>>
> >>>> Which I assumed was bound to the standard and makes things easier.
> >>>>
> >>>>> Can you test out this branch[1]. I don't have any h/w needing this,
> >>>>> but wrote a unittest and tested with modified QEMU.
> >>>>
> >>>> I reviewed everything, I did find a minor issue, see the patch attached.
> >>>
> >>> WRT that patch, the original intent of "force_dma" was purely to
> >>> consider a device DMA-capable regardless of the presence of
> >>> "dma-ranges". Expecting of_dma_configure() to do anything for a non-OF
> >>> device has always been bogus - magic paravirt devices which appear out
> >>> of nowhere and expect to be treated as genuine DMA masters are a
> >>> separate problem that we haven't really approached yet.
> >>
> >> I agree it's clearly abusing the function. I have no problem with the behaviour
> >> change if it's OK with you.
>
> Thinking about it, you could probably just remove that call from the Xen
> DRM driver now anyway - since the dma-direct rework, we lost the ability
> to set dma_dummy_ops by default, and NULL ops now represent what it
> (presumably) wants.

Not xen_dma_ops? In any case, I'll send out a patch for the the Xen
folks to comment on.

> >> Robin, have you looked into supporting multiple dma-ranges? It's the next thing
> >> we need for BCM STB's PCIe. I'll have a go at it myself if nothing is in the
> >> works already.
> >
> > Multiple dma-ranges as far as configuring inbound windows should work
> > already other than the bug when there's any parent translation. But if
> > you mean supporting multiple DMA offsets and masks per device in the
> > DMA API, there's nothing in the works yet.
>
> There's also the in-between step of making of_dma_get_range() return a
> size based on all the dma-ranges entries rather than only the first one
> - otherwise, something like [1] can lead to pretty unworkable default
> masks. We implemented that when doing acpi_dma_get_range(), it's just
> that the OF counterpart never caught up.

Right. I suppose we assume any holes in the ranges are addressable by
the device but won't get used for other reasons (such as no memory
there). However, to be correct, the range of the dma offset plus mask
would need to be within the min start and max end addresses. IOW,
while we need to round up (0xa_8000_0000 - 0x2c1c_0000) to the next
power of 2, the 'correct' thing to do is round down.

Rob

> [1]
> http://linux-arm.org/git?p=linux-rm.git;a=commitdiff;h=a2814af56b3486c2985a95540a88d8f9fa3a699f
Nicolas Saenz Julienne Sept. 26, 2019, 10:44 a.m. UTC | #8
> > > > Robin, have you looked into supporting multiple dma-ranges? It's the
> > > > next thing
> > > > we need for BCM STB's PCIe. I'll have a go at it myself if nothing is in
> > > > the
> > > > works already.
> > > 
> > > Multiple dma-ranges as far as configuring inbound windows should work
> > > already other than the bug when there's any parent translation. But if
> > > you mean supporting multiple DMA offsets and masks per device in the
> > > DMA API, there's nothing in the works yet.

Sorry, I meant supporting multiple DMA offsets[1]. I think I could still make
it with a single DMA mask though.

> > There's also the in-between step of making of_dma_get_range() return a
> > size based on all the dma-ranges entries rather than only the first one
> > - otherwise, something like [1] can lead to pretty unworkable default
> > masks. We implemented that when doing acpi_dma_get_range(), it's just
> > that the OF counterpart never caught up.
> 
> Right. I suppose we assume any holes in the ranges are addressable by
> the device but won't get used for other reasons (such as no memory
> there). However, to be correct, the range of the dma offset plus mask
> would need to be within the min start and max end addresses. IOW,
> while we need to round up (0xa_8000_0000 - 0x2c1c_0000) to the next
> power of 2, the 'correct' thing to do is round down.

IIUC I also have this issue on my list. The RPi4 PCIe block has an integration
bug that only allows DMA to the lower 3GB. With dma-ranges of size 0xc000_0000
you get a 32bit DMA mask wich is not what you need. So far I faked it in the
device-tree but I guess it be better to add an extra check in
of_dma_configure(), decrease the mask and print some kind of warning stating
that DMA addressing is suboptimal.

Regards,
Nicolas

[1] https://lkml.org/lkml/2018/9/19/641
Robin Murphy Sept. 26, 2019, 11:20 a.m. UTC | #9
On 2019-09-26 11:44 am, Nicolas Saenz Julienne wrote:
>>>>> Robin, have you looked into supporting multiple dma-ranges? It's the
>>>>> next thing
>>>>> we need for BCM STB's PCIe. I'll have a go at it myself if nothing is in
>>>>> the
>>>>> works already.
>>>>
>>>> Multiple dma-ranges as far as configuring inbound windows should work
>>>> already other than the bug when there's any parent translation. But if
>>>> you mean supporting multiple DMA offsets and masks per device in the
>>>> DMA API, there's nothing in the works yet.
> 
> Sorry, I meant supporting multiple DMA offsets[1]. I think I could still make
> it with a single DMA mask though.

The main problem for supporting that case in general is the disgusting 
carving up of the physical memory map you may have to do to guarantee 
that a single buffer allocation cannot ever span two windows with 
different offsets. I don't think we ever reached a conclusion on whether 
that was even achievable in practice.

>>> There's also the in-between step of making of_dma_get_range() return a
>>> size based on all the dma-ranges entries rather than only the first one
>>> - otherwise, something like [1] can lead to pretty unworkable default
>>> masks. We implemented that when doing acpi_dma_get_range(), it's just
>>> that the OF counterpart never caught up.
>>
>> Right. I suppose we assume any holes in the ranges are addressable by
>> the device but won't get used for other reasons (such as no memory
>> there). However, to be correct, the range of the dma offset plus mask
>> would need to be within the min start and max end addresses. IOW,
>> while we need to round up (0xa_8000_0000 - 0x2c1c_0000) to the next
>> power of 2, the 'correct' thing to do is round down.
> 
> IIUC I also have this issue on my list. The RPi4 PCIe block has an integration
> bug that only allows DMA to the lower 3GB. With dma-ranges of size 0xc000_0000
> you get a 32bit DMA mask wich is not what you need. So far I faked it in the
> device-tree but I guess it be better to add an extra check in
> of_dma_configure(), decrease the mask and print some kind of warning stating
> that DMA addressing is suboptimal.

Yeah, there's just no way for masks to describe that the device can 
drive all the individual bits, just not in certain combinations :(

The plan I have sketched out there is to merge dma_pfn_offset and 
bus_dma_mask into a "DMA range" descriptor, so we can then hang one or 
more of those off a device to properly cope with all these weird 
interconnects. Conceptually it feels pretty straightforward; I think 
most of the challenge is in implementing it efficiently. Plus there's 
the question of whether it could also subsume the dma_mask as well.

Robin.
Florian Fainelli Oct. 2, 2019, 6:28 p.m. UTC | #10
On 9/26/2019 4:20 AM, Robin Murphy wrote:
> On 2019-09-26 11:44 am, Nicolas Saenz Julienne wrote:
>>>>>> Robin, have you looked into supporting multiple dma-ranges? It's the
>>>>>> next thing
>>>>>> we need for BCM STB's PCIe. I'll have a go at it myself if nothing
>>>>>> is in
>>>>>> the
>>>>>> works already.
>>>>>
>>>>> Multiple dma-ranges as far as configuring inbound windows should work
>>>>> already other than the bug when there's any parent translation. But if
>>>>> you mean supporting multiple DMA offsets and masks per device in the
>>>>> DMA API, there's nothing in the works yet.
>>
>> Sorry, I meant supporting multiple DMA offsets[1]. I think I could
>> still make
>> it with a single DMA mask though.
> 
> The main problem for supporting that case in general is the disgusting
> carving up of the physical memory map you may have to do to guarantee
> that a single buffer allocation cannot ever span two windows with
> different offsets. I don't think we ever reached a conclusion on whether
> that was even achievable in practice.

It is with the Broadcom STB SoCs which have between 1 and 3 memory
controllers depending on the SoC, and multiple dma-ranges cells for PCIe
as a consequence.

Each memory controller has a different physical address aperture in the
CPU's physical address map (e.g.: MEMC0 is 0x0 - 0x3fff_ffff, MEMC1
0x4000_0000 - 0x7ffff_ffff and MEMC2 0x8000_0000 - 0xbfff_ffff, not
counting the extension regions above 4GB), and while the CPU is
scheduled and arbitrated the same way across all memory controllers
(thus making it virtually UMA, almost) having a buffer span two memory
controllers would be problematic because the memory controllers do not
know how to guarantee the transaction ordering and buffer data
consistency in both DRAM itself and for other memory controller clients,
like PCIe.

We historically had to reserve the last 4KB of each memory controller to
avoid problematic controllers like EHCI to prefetch beyond the end of a
memory controller's populated memory and that also incidentally takes
care of never having a buffer cross a controller boundary. Either you
can allocate the entire buffer on a given memory controller, or you
cannot allocate memory at all on that zone/region and another one must
be found (or there is no more memory and there is a genuine OOM).

The way we reserve memory right now is based on the first patch
submitted by Jim:

https://lore.kernel.org/patchwork/patch/988469/

whereby we read the memory node's "reg" property and we map the physical
addresses to the memory controller configuration read from the specific
registers in the CPU's Bus Interface Unit (where the memory controller
apertures are architecturally defined) and then we use that to call
memblock_reserve() (not part of that patch, it should be though).