[RFC,v3,0/6] Restricted DMA

Message ID	20210106034124.30560-1-tientzu@chromium.org (mailing list archive)
Headers	show Return-Path: <SRS0=pLaW=GJ=lists.xenproject.org=xen-devel-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org EBA2922CE3 Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org> From: Claire Chang <tientzu@chromium.org> To: robh+dt@kernel.org, mpe@ellerman.id.au, benh@kernel.crashing.org, paulus@samba.org, joro@8bytes.org, will@kernel.org, frowand.list@gmail.com, konrad.wilk@oracle.com, boris.ostrovsky@oracle.com, jgross@suse.com, sstabellini@kernel.org, hch@lst.de, m.szyprowski@samsung.com, robin.murphy@arm.com Cc: grant.likely@arm.com, xypron.glpk@gmx.de, treding@nvidia.com, mingo@kernel.org, bauerman@linux.ibm.com, peterz@infradead.org, gregkh@linuxfoundation.org, saravanak@google.com, rafael.j.wysocki@intel.com, heikki.krogerus@linux.intel.com, andriy.shevchenko@linux.intel.com, rdunlap@infradead.org, dan.j.williams@intel.com, bgolaszewski@baylibre.com, devicetree@vger.kernel.org, linux-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, iommu@lists.linux-foundation.org, xen-devel@lists.xenproject.org, tfiga@chromium.org, drinkcat@chromium.org, Claire Chang <tientzu@chromium.org> Subject: [RFC PATCH v3 0/6] Restricted DMA Date: Wed, 6 Jan 2021 11:41:18 +0800 Message-Id: <20210106034124.30560-1-tientzu@chromium.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Restricted DMA \| expand [RFC,v3,0/6] Restricted DMA [RFC,v3,1/6] swiotlb: Add io_tlb_mem struct [RFC,v3,2/6] swiotlb: Add restricted DMA pool [RFC,v3,3/6] swiotlb: Use restricted DMA pool if available [RFC,v3,4/6] swiotlb: Add restricted DMA alloc/free support. [RFC,v3,5/6] dt-bindings: of: Add restricted DMA pool [RFC,v3,6/6] of: Add plumbing for restricted DMA pool

Claire Chang Jan. 6, 2021, 3:41 a.m. UTC

This series implements mitigations for lack of DMA access control on
systems without an IOMMU, which could result in the DMA accessing the
system memory at unexpected times and/or unexpected addresses, possibly
leading to data leakage or corruption.

For example, we plan to use the PCI-e bus for Wi-Fi and that PCI-e bus is
not behind an IOMMU. As PCI-e, by design, gives the device full access to
system memory, a vulnerability in the Wi-Fi firmware could easily escalate
to a full system exploit (remote wifi exploits: [1a], [1b] that shows a
full chain of exploits; [2], [3]).

To mitigate the security concerns, we introduce restricted DMA. Restricted
DMA utilizes the existing swiotlb to bounce streaming DMA in and out of a
specially allocated region and does memory allocation from the same region.
The feature on its own provides a basic level of protection against the DMA
overwriting buffer contents at unexpected times. However, to protect
against general data leakage and system memory corruption, the system needs
to provide a way to restrict the DMA to a predefined memory region (this is
usually done at firmware level, e.g. in ATF on some ARM platforms).

[1a] https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_4.html
[1b] https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_11.html
[2] https://blade.tencent.com/en/advisories/qualpwn/
[3] https://www.bleepingcomputer.com/news/security/vulnerabilities-found-in-highly-popular-firmware-for-wifi-chips/

Claire Chang (6):
  swiotlb: Add io_tlb_mem struct
  swiotlb: Add restricted DMA pool
  swiotlb: Use restricted DMA pool if available
  swiotlb: Add restricted DMA alloc/free support.
  dt-bindings: of: Add restricted DMA pool
  of: Add plumbing for restricted DMA pool

 .../reserved-memory/reserved-memory.txt       |  24 +
 arch/powerpc/platforms/pseries/svm.c          |   4 +-
 drivers/iommu/dma-iommu.c                     |  12 +-
 drivers/of/address.c                          |  21 +
 drivers/of/device.c                           |   4 +
 drivers/of/of_private.h                       |   5 +
 drivers/xen/swiotlb-xen.c                     |   4 +-
 include/linux/device.h                        |   4 +
 include/linux/swiotlb.h                       |  61 +-
 kernel/dma/Kconfig                            |   1 +
 kernel/dma/direct.c                           |  20 +-
 kernel/dma/direct.h                           |  10 +-
 kernel/dma/swiotlb.c                          | 576 +++++++++++-------
 13 files changed, 514 insertions(+), 232 deletions(-)

Florian Fainelli Jan. 6, 2021, 6:48 p.m. UTC | #1

Hi,

First of all let me say that I am glad that someone is working on a
upstream solution for this issue, would appreciate if you could CC and
Jim Quinlan on subsequent submissions.

On 1/5/21 7:41 PM, Claire Chang wrote:
> This series implements mitigations for lack of DMA access control on
> systems without an IOMMU, which could result in the DMA accessing the
> system memory at unexpected times and/or unexpected addresses, possibly
> leading to data leakage or corruption.
> 
> For example, we plan to use the PCI-e bus for Wi-Fi and that PCI-e bus is
> not behind an IOMMU. As PCI-e, by design, gives the device full access to
> system memory, a vulnerability in the Wi-Fi firmware could easily escalate
> to a full system exploit (remote wifi exploits: [1a], [1b] that shows a
> full chain of exploits; [2], [3]).
> 
> To mitigate the security concerns, we introduce restricted DMA. Restricted
> DMA utilizes the existing swiotlb to bounce streaming DMA in and out of a
> specially allocated region and does memory allocation from the same region.
> The feature on its own provides a basic level of protection against the DMA
> overwriting buffer contents at unexpected times. However, to protect
> against general data leakage and system memory corruption, the system needs
> to provide a way to restrict the DMA to a predefined memory region (this is
> usually done at firmware level, e.g. in ATF on some ARM platforms).

Can you explain how ATF gets involved and to what extent it does help,
besides enforcing a secure region from the ARM CPU's perpsective? Does
the PCIe root complex not have an IOMMU but can somehow be denied access
to a region that is marked NS=0 in the ARM CPU's MMU? If so, that is
still some sort of basic protection that the HW enforces, right?

On Broadcom STB SoCs we have had something similar for a while however
and while we don't have an IOMMU for the PCIe bridge, we do have a a
basic protection mechanism whereby we can configure a region in DRAM to
be PCIe read/write and CPU read/write which then gets used as the PCIe
inbound region for the PCIe EP. By default the PCIe bridge is not
allowed access to DRAM so we must call into a security agent to allow
the PCIe bridge to access the designated DRAM region.

We have done this using a private CMA area region assigned via Device
Tree, assigned with a and requiring the PCIe EP driver to use
dma_alloc_from_contiguous() in order to allocate from this device
private CMA area. The only drawback with that approach is that it
requires knowing how much memory you need up front for buffers and DMA
descriptors that the PCIe EP will need to process. The problem is that
it requires driver modifications and that does not scale over the number
of PCIe EP drivers, some we absolutely do not control, but there is no
need to bounce buffer. Your approach scales better across PCIe EP
drivers however it does require bounce buffering which could be a
performance hit.

Thanks!

Claire Chang Jan. 7, 2021, 5:38 p.m. UTC | #2

On Thu, Jan 7, 2021 at 2:48 AM Florian Fainelli <f.fainelli@gmail.com>
wrote:
>
> Hi,
>
> First of all let me say that I am glad that someone is working on a
> upstream solution for this issue, would appreciate if you could CC and
> Jim Quinlan on subsequent submissions.

Sure!

>
>
> On 1/5/21 7:41 PM, Claire Chang wrote:
> > This series implements mitigations for lack of DMA access control on
> > systems without an IOMMU, which could result in the DMA accessing the
> > system memory at unexpected times and/or unexpected addresses, possibly
> > leading to data leakage or corruption.
> >
> > For example, we plan to use the PCI-e bus for Wi-Fi and that PCI-e bus
is
> > not behind an IOMMU. As PCI-e, by design, gives the device full access
to
> > system memory, a vulnerability in the Wi-Fi firmware could easily
escalate
> > to a full system exploit (remote wifi exploits: [1a], [1b] that shows a
> > full chain of exploits; [2], [3]).
> >
> > To mitigate the security concerns, we introduce restricted DMA.
Restricted
> > DMA utilizes the existing swiotlb to bounce streaming DMA in and out of
a
> > specially allocated region and does memory allocation from the same
region.
> > The feature on its own provides a basic level of protection against the
DMA
> > overwriting buffer contents at unexpected times. However, to protect
> > against general data leakage and system memory corruption, the system
needs
> > to provide a way to restrict the DMA to a predefined memory region
(this is
> > usually done at firmware level, e.g. in ATF on some ARM platforms).
>
> Can you explain how ATF gets involved and to what extent it does help,
> besides enforcing a secure region from the ARM CPU's perpsective? Does
> the PCIe root complex not have an IOMMU but can somehow be denied access
> to a region that is marked NS=0 in the ARM CPU's MMU? If so, that is
> still some sort of basic protection that the HW enforces, right?

We need the ATF support for memory MPU (memory protection unit).
Restricted DMA (with reserved-memory in dts) makes sure the predefined
memory
region is for PCIe DMA only, but we still need MPU to locks down PCIe
access to
that specific regions.

>
> On Broadcom STB SoCs we have had something similar for a while however
> and while we don't have an IOMMU for the PCIe bridge, we do have a a
> basic protection mechanism whereby we can configure a region in DRAM to
> be PCIe read/write and CPU read/write which then gets used as the PCIe
> inbound region for the PCIe EP. By default the PCIe bridge is not
> allowed access to DRAM so we must call into a security agent to allow
> the PCIe bridge to access the designated DRAM region.
>
> We have done this using a private CMA area region assigned via Device
> Tree, assigned with a and requiring the PCIe EP driver to use
> dma_alloc_from_contiguous() in order to allocate from this device
> private CMA area. The only drawback with that approach is that it
> requires knowing how much memory you need up front for buffers and DMA
> descriptors that the PCIe EP will need to process. The problem is that
> it requires driver modifications and that does not scale over the number
> of PCIe EP drivers, some we absolutely do not control, but there is no
> need to bounce buffer. Your approach scales better across PCIe EP
> drivers however it does require bounce buffering which could be a
> performance hit.

Only the streaming DMA (map/unmap) needs bounce buffering.
I also added alloc/free support in this series
(https://lore.kernel.org/patchwork/patch/1360995/), so dma_direct_alloc()
will
try to allocate memory from the predefined memory region.

As for the performance hit, it should be similar to the default swiotlb.
Here are my experiment results. Both SoCs lack IOMMU for PCIe.

PCIe wifi vht80 throughput -

  MTK SoC                  tcp_tx     tcp_rx    udp_tx   udp_rx
  w/o Restricted DMA  244.1     134.66   312.56   350.79
  w/ Restricted DMA    246.95   136.59   363.21   351.99

  Rockchip SoC           tcp_tx     tcp_rx    udp_tx   udp_rx
  w/o Restricted DMA  237.87   133.86   288.28   361.88
  w/ Restricted DMA    256.01   130.95   292.28   353.19

The CPU usage doesn't increase too much either.
Although I didn't measure the CPU usage very precisely, it's ~3% with a
single
big core (Cortex-A72) and ~5% with a single small core (Cortex-A53).

Thanks!
>
> Thanks!


On Thu, Jan 7, 2021 at 2:48 AM Florian Fainelli <f.fainelli@gmail.com>
wrote:

> Hi,
>
> First of all let me say that I am glad that someone is working on a
> upstream solution for this issue, would appreciate if you could CC and
> Jim Quinlan on subsequent submissions.
>
> On 1/5/21 7:41 PM, Claire Chang wrote:
> > This series implements mitigations for lack of DMA access control on
> > systems without an IOMMU, which could result in the DMA accessing the
> > system memory at unexpected times and/or unexpected addresses, possibly
> > leading to data leakage or corruption.
> >
> > For example, we plan to use the PCI-e bus for Wi-Fi and that PCI-e bus is
> > not behind an IOMMU. As PCI-e, by design, gives the device full access to
> > system memory, a vulnerability in the Wi-Fi firmware could easily
> escalate
> > to a full system exploit (remote wifi exploits: [1a], [1b] that shows a
> > full chain of exploits; [2], [3]).
> >
> > To mitigate the security concerns, we introduce restricted DMA.
> Restricted
> > DMA utilizes the existing swiotlb to bounce streaming DMA in and out of a
> > specially allocated region and does memory allocation from the same
> region.
> > The feature on its own provides a basic level of protection against the
> DMA
> > overwriting buffer contents at unexpected times. However, to protect
> > against general data leakage and system memory corruption, the system
> needs
> > to provide a way to restrict the DMA to a predefined memory region (this
> is
> > usually done at firmware level, e.g. in ATF on some ARM platforms).
>
> Can you explain how ATF gets involved and to what extent it does help,
> besides enforcing a secure region from the ARM CPU's perpsective? Does
> the PCIe root complex not have an IOMMU but can somehow be denied access
> to a region that is marked NS=0 in the ARM CPU's MMU? If so, that is
> still some sort of basic protection that the HW enforces, right?
>
> On Broadcom STB SoCs we have had something similar for a while however
> and while we don't have an IOMMU for the PCIe bridge, we do have a a
> basic protection mechanism whereby we can configure a region in DRAM to
> be PCIe read/write and CPU read/write which then gets used as the PCIe
> inbound region for the PCIe EP. By default the PCIe bridge is not
> allowed access to DRAM so we must call into a security agent to allow
> the PCIe bridge to access the designated DRAM region.
>
> We have done this using a private CMA area region assigned via Device
> Tree, assigned with a and requiring the PCIe EP driver to use
> dma_alloc_from_contiguous() in order to allocate from this device
> private CMA area. The only drawback with that approach is that it
> requires knowing how much memory you need up front for buffers and DMA
> descriptors that the PCIe EP will need to process. The problem is that
> it requires driver modifications and that does not scale over the number
> of PCIe EP drivers, some we absolutely do not control, but there is no
> need to bounce buffer. Your approach scales better across PCIe EP
> drivers however it does require bounce buffering which could be a
> performance hit.
>
> Thanks!
> --
> Florian
>

Claire Chang Jan. 7, 2021, 5:42 p.m. UTC | #3

On Thu, Jan 7, 2021 at 2:48 AM Florian Fainelli <f.fainelli@gmail.com> wrote:
>
> Hi,
>
> First of all let me say that I am glad that someone is working on a
> upstream solution for this issue, would appreciate if you could CC and
> Jim Quinlan on subsequent submissions.

Sure!

>
> On 1/5/21 7:41 PM, Claire Chang wrote:
> > This series implements mitigations for lack of DMA access control on
> > systems without an IOMMU, which could result in the DMA accessing the
> > system memory at unexpected times and/or unexpected addresses, possibly
> > leading to data leakage or corruption.
> >
> > For example, we plan to use the PCI-e bus for Wi-Fi and that PCI-e bus is
> > not behind an IOMMU. As PCI-e, by design, gives the device full access to
> > system memory, a vulnerability in the Wi-Fi firmware could easily escalate
> > to a full system exploit (remote wifi exploits: [1a], [1b] that shows a
> > full chain of exploits; [2], [3]).
> >
> > To mitigate the security concerns, we introduce restricted DMA. Restricted
> > DMA utilizes the existing swiotlb to bounce streaming DMA in and out of a
> > specially allocated region and does memory allocation from the same region.
> > The feature on its own provides a basic level of protection against the DMA
> > overwriting buffer contents at unexpected times. However, to protect
> > against general data leakage and system memory corruption, the system needs
> > to provide a way to restrict the DMA to a predefined memory region (this is
> > usually done at firmware level, e.g. in ATF on some ARM platforms).
>
> Can you explain how ATF gets involved and to what extent it does help,
> besides enforcing a secure region from the ARM CPU's perpsective? Does
> the PCIe root complex not have an IOMMU but can somehow be denied access
> to a region that is marked NS=0 in the ARM CPU's MMU? If so, that is
> still some sort of basic protection that the HW enforces, right?

We need the ATF support for memory MPU (memory protection unit).
Restricted DMA (with reserved-memory in dts) makes sure the predefined memory
region is for PCIe DMA only, but we still need MPU to locks down PCIe access to
that specific regions.

>
> On Broadcom STB SoCs we have had something similar for a while however
> and while we don't have an IOMMU for the PCIe bridge, we do have a a
> basic protection mechanism whereby we can configure a region in DRAM to
> be PCIe read/write and CPU read/write which then gets used as the PCIe
> inbound region for the PCIe EP. By default the PCIe bridge is not
> allowed access to DRAM so we must call into a security agent to allow
> the PCIe bridge to access the designated DRAM region.
>
> We have done this using a private CMA area region assigned via Device
> Tree, assigned with a and requiring the PCIe EP driver to use
> dma_alloc_from_contiguous() in order to allocate from this device
> private CMA area. The only drawback with that approach is that it
> requires knowing how much memory you need up front for buffers and DMA
> descriptors that the PCIe EP will need to process. The problem is that
> it requires driver modifications and that does not scale over the number
> of PCIe EP drivers, some we absolutely do not control, but there is no
> need to bounce buffer. Your approach scales better across PCIe EP
> drivers however it does require bounce buffering which could be a
> performance hit.

Only the streaming DMA (map/unmap) needs bounce buffering.
I also added alloc/free support in this series
(https://lore.kernel.org/patchwork/patch/1360995/), so dma_direct_alloc() will
try to allocate memory from the predefined memory region.

As for the performance hit, it should be similar to the default swiotlb.
Here are my experiment results. Both SoCs lack IOMMU for PCIe.

PCIe wifi vht80 throughput -

  MTK SoC                  tcp_tx     tcp_rx    udp_tx   udp_rx
  w/o Restricted DMA  244.1     134.66   312.56   350.79
  w/ Restricted DMA    246.95   136.59   363.21   351.99

  Rockchip SoC           tcp_tx     tcp_rx    udp_tx   udp_rx
  w/o Restricted DMA  237.87   133.86   288.28   361.88
  w/ Restricted DMA    256.01   130.95   292.28   353.19

The CPU usage doesn't increase too much either.
Although I didn't measure the CPU usage very precisely, it's ~3% with a single
big core (Cortex-A72) and ~5% with a single small core (Cortex-A53).

Thanks!

>
> Thanks!
> --
> Florian

Florian Fainelli Jan. 7, 2021, 5:59 p.m. UTC | #4

On 1/7/21 9:42 AM, Claire Chang wrote:

>> Can you explain how ATF gets involved and to what extent it does help,
>> besides enforcing a secure region from the ARM CPU's perpsective? Does
>> the PCIe root complex not have an IOMMU but can somehow be denied access
>> to a region that is marked NS=0 in the ARM CPU's MMU? If so, that is
>> still some sort of basic protection that the HW enforces, right?
> 
> We need the ATF support for memory MPU (memory protection unit).
> Restricted DMA (with reserved-memory in dts) makes sure the predefined memory
> region is for PCIe DMA only, but we still need MPU to locks down PCIe access to
> that specific regions.

OK so you do have a protection unit of some sort to enforce which region
in DRAM the PCIE bridge is allowed to access, that makes sense,
otherwise the restricted DMA region would only be a hint but nothing you
can really enforce. This is almost entirely analogous to our systems then.

There may be some value in standardizing on an ARM SMCCC call then since
you already support two different SoC vendors.

> 
>>
>> On Broadcom STB SoCs we have had something similar for a while however
>> and while we don't have an IOMMU for the PCIe bridge, we do have a a
>> basic protection mechanism whereby we can configure a region in DRAM to
>> be PCIe read/write and CPU read/write which then gets used as the PCIe
>> inbound region for the PCIe EP. By default the PCIe bridge is not
>> allowed access to DRAM so we must call into a security agent to allow
>> the PCIe bridge to access the designated DRAM region.
>>
>> We have done this using a private CMA area region assigned via Device
>> Tree, assigned with a and requiring the PCIe EP driver to use
>> dma_alloc_from_contiguous() in order to allocate from this device
>> private CMA area. The only drawback with that approach is that it
>> requires knowing how much memory you need up front for buffers and DMA
>> descriptors that the PCIe EP will need to process. The problem is that
>> it requires driver modifications and that does not scale over the number
>> of PCIe EP drivers, some we absolutely do not control, but there is no
>> need to bounce buffer. Your approach scales better across PCIe EP
>> drivers however it does require bounce buffering which could be a
>> performance hit.
> 
> Only the streaming DMA (map/unmap) needs bounce buffering.

True, and typically only on transmit since you don't really control
where the sk_buff are allocated from, right? On RX since you need to
hand buffer addresses to the WLAN chip prior to DMA, you can allocate
them from a pool that already falls within the restricted DMA region, right?

> I also added alloc/free support in this series
> (https://lore.kernel.org/patchwork/patch/1360995/), so dma_direct_alloc() will
> try to allocate memory from the predefined memory region.
> 
> As for the performance hit, it should be similar to the default swiotlb.
> Here are my experiment results. Both SoCs lack IOMMU for PCIe.
> 
> PCIe wifi vht80 throughput -
> 
>   MTK SoC                  tcp_tx     tcp_rx    udp_tx   udp_rx
>   w/o Restricted DMA  244.1     134.66   312.56   350.79
>   w/ Restricted DMA    246.95   136.59   363.21   351.99
> 
>   Rockchip SoC           tcp_tx     tcp_rx    udp_tx   udp_rx
>   w/o Restricted DMA  237.87   133.86   288.28   361.88
>   w/ Restricted DMA    256.01   130.95   292.28   353.19

How come you get better throughput with restricted DMA? Is it because
doing DMA to/from a contiguous region allows for better grouping of
transactions from the DRAM controller's perspective somehow?

> 
> The CPU usage doesn't increase too much either.
> Although I didn't measure the CPU usage very precisely, it's ~3% with a single
> big core (Cortex-A72) and ~5% with a single small core (Cortex-A53).
> 
> Thanks!
> 
>>
>> Thanks!
>> --
>> Florian

Claire Chang Jan. 12, 2021, 7:48 a.m. UTC | #5

On Fri, Jan 8, 2021 at 1:59 AM Florian Fainelli <f.fainelli@gmail.com> wrote:
>
> On 1/7/21 9:42 AM, Claire Chang wrote:
>
> >> Can you explain how ATF gets involved and to what extent it does help,
> >> besides enforcing a secure region from the ARM CPU's perpsective? Does
> >> the PCIe root complex not have an IOMMU but can somehow be denied access
> >> to a region that is marked NS=0 in the ARM CPU's MMU? If so, that is
> >> still some sort of basic protection that the HW enforces, right?
> >
> > We need the ATF support for memory MPU (memory protection unit).
> > Restricted DMA (with reserved-memory in dts) makes sure the predefined memory
> > region is for PCIe DMA only, but we still need MPU to locks down PCIe access to
> > that specific regions.
>
> OK so you do have a protection unit of some sort to enforce which region
> in DRAM the PCIE bridge is allowed to access, that makes sense,
> otherwise the restricted DMA region would only be a hint but nothing you
> can really enforce. This is almost entirely analogous to our systems then.

Here is the example of setting the MPU:
https://github.com/ARM-software/arm-trusted-firmware/blob/master/plat/mediatek/mt8183/drivers/emi_mpu/emi_mpu.c#L132

>
> There may be some value in standardizing on an ARM SMCCC call then since
> you already support two different SoC vendors.
>
> >
> >>
> >> On Broadcom STB SoCs we have had something similar for a while however
> >> and while we don't have an IOMMU for the PCIe bridge, we do have a a
> >> basic protection mechanism whereby we can configure a region in DRAM to
> >> be PCIe read/write and CPU read/write which then gets used as the PCIe
> >> inbound region for the PCIe EP. By default the PCIe bridge is not
> >> allowed access to DRAM so we must call into a security agent to allow
> >> the PCIe bridge to access the designated DRAM region.
> >>
> >> We have done this using a private CMA area region assigned via Device
> >> Tree, assigned with a and requiring the PCIe EP driver to use
> >> dma_alloc_from_contiguous() in order to allocate from this device
> >> private CMA area. The only drawback with that approach is that it
> >> requires knowing how much memory you need up front for buffers and DMA
> >> descriptors that the PCIe EP will need to process. The problem is that
> >> it requires driver modifications and that does not scale over the number
> >> of PCIe EP drivers, some we absolutely do not control, but there is no
> >> need to bounce buffer. Your approach scales better across PCIe EP
> >> drivers however it does require bounce buffering which could be a
> >> performance hit.
> >
> > Only the streaming DMA (map/unmap) needs bounce buffering.
>
> True, and typically only on transmit since you don't really control
> where the sk_buff are allocated from, right? On RX since you need to
> hand buffer addresses to the WLAN chip prior to DMA, you can allocate
> them from a pool that already falls within the restricted DMA region, right?
>

Right, but applying bounce buffering to RX will make it more secure.
The device won't be able to modify the content after unmap. Just like what
iommu_unmap does.

> > I also added alloc/free support in this series
> > (https://lore.kernel.org/patchwork/patch/1360995/), so dma_direct_alloc() will
> > try to allocate memory from the predefined memory region.
> >
> > As for the performance hit, it should be similar to the default swiotlb.
> > Here are my experiment results. Both SoCs lack IOMMU for PCIe.
> >
> > PCIe wifi vht80 throughput -
> >
> >   MTK SoC                  tcp_tx     tcp_rx    udp_tx   udp_rx
> >   w/o Restricted DMA  244.1     134.66   312.56   350.79
> >   w/ Restricted DMA    246.95   136.59   363.21   351.99
> >
> >   Rockchip SoC           tcp_tx     tcp_rx    udp_tx   udp_rx
> >   w/o Restricted DMA  237.87   133.86   288.28   361.88
> >   w/ Restricted DMA    256.01   130.95   292.28   353.19
>
> How come you get better throughput with restricted DMA? Is it because
> doing DMA to/from a contiguous region allows for better grouping of
> transactions from the DRAM controller's perspective somehow?

I'm not sure, but actually, enabling the default swiotlb for wifi also helps the
throughput a little bit for me.

>
> >
> > The CPU usage doesn't increase too much either.
> > Although I didn't measure the CPU usage very precisely, it's ~3% with a single
> > big core (Cortex-A72) and ~5% with a single small core (Cortex-A53).
> >
> > Thanks!
> >
> >>
> >> Thanks!
> >> --
> >> Florian
>
>
> --
> Florian

Florian Fainelli Jan. 12, 2021, 6:01 p.m. UTC | #6

On 1/11/21 11:48 PM, Claire Chang wrote:
> On Fri, Jan 8, 2021 at 1:59 AM Florian Fainelli <f.fainelli@gmail.com> wrote:
>>
>> On 1/7/21 9:42 AM, Claire Chang wrote:
>>
>>>> Can you explain how ATF gets involved and to what extent it does help,
>>>> besides enforcing a secure region from the ARM CPU's perpsective? Does
>>>> the PCIe root complex not have an IOMMU but can somehow be denied access
>>>> to a region that is marked NS=0 in the ARM CPU's MMU? If so, that is
>>>> still some sort of basic protection that the HW enforces, right?
>>>
>>> We need the ATF support for memory MPU (memory protection unit).
>>> Restricted DMA (with reserved-memory in dts) makes sure the predefined memory
>>> region is for PCIe DMA only, but we still need MPU to locks down PCIe access to
>>> that specific regions.
>>
>> OK so you do have a protection unit of some sort to enforce which region
>> in DRAM the PCIE bridge is allowed to access, that makes sense,
>> otherwise the restricted DMA region would only be a hint but nothing you
>> can really enforce. This is almost entirely analogous to our systems then.
> 
> Here is the example of setting the MPU:
> https://github.com/ARM-software/arm-trusted-firmware/blob/master/plat/mediatek/mt8183/drivers/emi_mpu/emi_mpu.c#L132
> 
>>
>> There may be some value in standardizing on an ARM SMCCC call then since
>> you already support two different SoC vendors.
>>
>>>
>>>>
>>>> On Broadcom STB SoCs we have had something similar for a while however
>>>> and while we don't have an IOMMU for the PCIe bridge, we do have a a
>>>> basic protection mechanism whereby we can configure a region in DRAM to
>>>> be PCIe read/write and CPU read/write which then gets used as the PCIe
>>>> inbound region for the PCIe EP. By default the PCIe bridge is not
>>>> allowed access to DRAM so we must call into a security agent to allow
>>>> the PCIe bridge to access the designated DRAM region.
>>>>
>>>> We have done this using a private CMA area region assigned via Device
>>>> Tree, assigned with a and requiring the PCIe EP driver to use
>>>> dma_alloc_from_contiguous() in order to allocate from this device
>>>> private CMA area. The only drawback with that approach is that it
>>>> requires knowing how much memory you need up front for buffers and DMA
>>>> descriptors that the PCIe EP will need to process. The problem is that
>>>> it requires driver modifications and that does not scale over the number
>>>> of PCIe EP drivers, some we absolutely do not control, but there is no
>>>> need to bounce buffer. Your approach scales better across PCIe EP
>>>> drivers however it does require bounce buffering which could be a
>>>> performance hit.
>>>
>>> Only the streaming DMA (map/unmap) needs bounce buffering.
>>
>> True, and typically only on transmit since you don't really control
>> where the sk_buff are allocated from, right? On RX since you need to
>> hand buffer addresses to the WLAN chip prior to DMA, you can allocate
>> them from a pool that already falls within the restricted DMA region, right?
>>
> 
> Right, but applying bounce buffering to RX will make it more secure.
> The device won't be able to modify the content after unmap. Just like what
> iommu_unmap does.

Sure, however the goals of using bounce buffering equally applies to RX
and TX in that this is the only layer sitting between a stack (block,
networking, USB, etc.) and the underlying device driver that scales well
in order to massage a dma_addr_t to be within a particular physical range.

There is however room for improvement if the drivers are willing to
change their buffer allocation strategy. When you receive Wi-Fi frames
you need to allocate buffers for the Wi-Fi device to DMA into, and that
happens ahead of the DMA transfers by the Wi-Fi device. At buffer
allocation time you could very well allocate these frames from the
restricted DMA region without having to bounce buffer them since the
host CPU is in control over where and when to DMA into.

The issue is that each network driver may implement its own buffer
allocation strategy, some may simply call netdev_alloc_skb() which gives
zero control over where the buffer comes from unless you play tricks
with NUMA node allocations and somehow declare that your restricted DMA
region is a different NUMA node. If the driver allocates pages and then
attaches a SKB to that page using build_skb(), then you have much more
control over where that page comes from, and this is where using a
device private CMA are helps, because you can just do
dma_alloc_from_contiguous() and that will ensure that the pages are
coming from your specific CMA area.

Few questions on the implementation:

- is there any warning or error being printed if the restricted DMA
region is outside of a device's DMA addressable range?

- are there are any helpful statistics that could be shown to indicate
that the restricted DMA region was sized too small, e.g.: that
allocation of a DMA buffer failed because we ran out of space in the
swiotlb pool?

> 
>>> I also added alloc/free support in this series
>>> (https://lore.kernel.org/patchwork/patch/1360995/), so dma_direct_alloc() will
>>> try to allocate memory from the predefined memory region.
>>>
>>> As for the performance hit, it should be similar to the default swiotlb.
>>> Here are my experiment results. Both SoCs lack IOMMU for PCIe.
>>>
>>> PCIe wifi vht80 throughput -
>>>
>>>   MTK SoC                  tcp_tx     tcp_rx    udp_tx   udp_rx
>>>   w/o Restricted DMA  244.1     134.66   312.56   350.79
>>>   w/ Restricted DMA    246.95   136.59   363.21   351.99
>>>
>>>   Rockchip SoC           tcp_tx     tcp_rx    udp_tx   udp_rx
>>>   w/o Restricted DMA  237.87   133.86   288.28   361.88
>>>   w/ Restricted DMA    256.01   130.95   292.28   353.19
>>
>> How come you get better throughput with restricted DMA? Is it because
>> doing DMA to/from a contiguous region allows for better grouping of
>> transactions from the DRAM controller's perspective somehow?
> 
> I'm not sure, but actually, enabling the default swiotlb for wifi also helps the
> throughput a little bit for me.

OK, it would be interesting if you could get to the bottom of why
performance does increase with swiotlb.

Tomasz Figa Jan. 13, 2021, 2:29 a.m. UTC | #7

Hi Florian,

On Wed, Jan 13, 2021 at 3:01 AM Florian Fainelli <f.fainelli@gmail.com> wrote:
>
> On 1/11/21 11:48 PM, Claire Chang wrote:
> > On Fri, Jan 8, 2021 at 1:59 AM Florian Fainelli <f.fainelli@gmail.com> wrote:
> >>
> >> On 1/7/21 9:42 AM, Claire Chang wrote:
> >>
> >>>> Can you explain how ATF gets involved and to what extent it does help,
> >>>> besides enforcing a secure region from the ARM CPU's perpsective? Does
> >>>> the PCIe root complex not have an IOMMU but can somehow be denied access
> >>>> to a region that is marked NS=0 in the ARM CPU's MMU? If so, that is
> >>>> still some sort of basic protection that the HW enforces, right?
> >>>
> >>> We need the ATF support for memory MPU (memory protection unit).
> >>> Restricted DMA (with reserved-memory in dts) makes sure the predefined memory
> >>> region is for PCIe DMA only, but we still need MPU to locks down PCIe access to
> >>> that specific regions.
> >>
> >> OK so you do have a protection unit of some sort to enforce which region
> >> in DRAM the PCIE bridge is allowed to access, that makes sense,
> >> otherwise the restricted DMA region would only be a hint but nothing you
> >> can really enforce. This is almost entirely analogous to our systems then.
> >
> > Here is the example of setting the MPU:
> > https://github.com/ARM-software/arm-trusted-firmware/blob/master/plat/mediatek/mt8183/drivers/emi_mpu/emi_mpu.c#L132
> >
> >>
> >> There may be some value in standardizing on an ARM SMCCC call then since
> >> you already support two different SoC vendors.
> >>
> >>>
> >>>>
> >>>> On Broadcom STB SoCs we have had something similar for a while however
> >>>> and while we don't have an IOMMU for the PCIe bridge, we do have a a
> >>>> basic protection mechanism whereby we can configure a region in DRAM to
> >>>> be PCIe read/write and CPU read/write which then gets used as the PCIe
> >>>> inbound region for the PCIe EP. By default the PCIe bridge is not
> >>>> allowed access to DRAM so we must call into a security agent to allow
> >>>> the PCIe bridge to access the designated DRAM region.
> >>>>
> >>>> We have done this using a private CMA area region assigned via Device
> >>>> Tree, assigned with a and requiring the PCIe EP driver to use
> >>>> dma_alloc_from_contiguous() in order to allocate from this device
> >>>> private CMA area. The only drawback with that approach is that it
> >>>> requires knowing how much memory you need up front for buffers and DMA
> >>>> descriptors that the PCIe EP will need to process. The problem is that
> >>>> it requires driver modifications and that does not scale over the number
> >>>> of PCIe EP drivers, some we absolutely do not control, but there is no
> >>>> need to bounce buffer. Your approach scales better across PCIe EP
> >>>> drivers however it does require bounce buffering which could be a
> >>>> performance hit.
> >>>
> >>> Only the streaming DMA (map/unmap) needs bounce buffering.
> >>
> >> True, and typically only on transmit since you don't really control
> >> where the sk_buff are allocated from, right? On RX since you need to
> >> hand buffer addresses to the WLAN chip prior to DMA, you can allocate
> >> them from a pool that already falls within the restricted DMA region, right?
> >>
> >
> > Right, but applying bounce buffering to RX will make it more secure.
> > The device won't be able to modify the content after unmap. Just like what
> > iommu_unmap does.
>
> Sure, however the goals of using bounce buffering equally applies to RX
> and TX in that this is the only layer sitting between a stack (block,
> networking, USB, etc.) and the underlying device driver that scales well
> in order to massage a dma_addr_t to be within a particular physical range.
>
> There is however room for improvement if the drivers are willing to
> change their buffer allocation strategy. When you receive Wi-Fi frames
> you need to allocate buffers for the Wi-Fi device to DMA into, and that
> happens ahead of the DMA transfers by the Wi-Fi device. At buffer
> allocation time you could very well allocate these frames from the
> restricted DMA region without having to bounce buffer them since the
> host CPU is in control over where and when to DMA into.
>

That is, however, still a trade-off between saving that one copy and
protection from the DMA tampering with the packet contents when the
kernel is reading them. Notice how the copy effectively makes a
snapshot of the contents, guaranteeing that the kernel has a
consistent view of the packet, which is not true if the DMA could
modify the buffer contents in the middle of CPU accesses.

Best regards,
Tomasz

> The issue is that each network driver may implement its own buffer
> allocation strategy, some may simply call netdev_alloc_skb() which gives
> zero control over where the buffer comes from unless you play tricks
> with NUMA node allocations and somehow declare that your restricted DMA
> region is a different NUMA node. If the driver allocates pages and then
> attaches a SKB to that page using build_skb(), then you have much more
> control over where that page comes from, and this is where using a
> device private CMA are helps, because you can just do
> dma_alloc_from_contiguous() and that will ensure that the pages are
> coming from your specific CMA area.
>
> Few questions on the implementation:
>
> - is there any warning or error being printed if the restricted DMA
> region is outside of a device's DMA addressable range?
>
> - are there are any helpful statistics that could be shown to indicate
> that the restricted DMA region was sized too small, e.g.: that
> allocation of a DMA buffer failed because we ran out of space in the
> swiotlb pool?
>
> >
> >>> I also added alloc/free support in this series
> >>> (https://lore.kernel.org/patchwork/patch/1360995/), so dma_direct_alloc() will
> >>> try to allocate memory from the predefined memory region.
> >>>
> >>> As for the performance hit, it should be similar to the default swiotlb.
> >>> Here are my experiment results. Both SoCs lack IOMMU for PCIe.
> >>>
> >>> PCIe wifi vht80 throughput -
> >>>
> >>>   MTK SoC                  tcp_tx     tcp_rx    udp_tx   udp_rx
> >>>   w/o Restricted DMA  244.1     134.66   312.56   350.79
> >>>   w/ Restricted DMA    246.95   136.59   363.21   351.99
> >>>
> >>>   Rockchip SoC           tcp_tx     tcp_rx    udp_tx   udp_rx
> >>>   w/o Restricted DMA  237.87   133.86   288.28   361.88
> >>>   w/ Restricted DMA    256.01   130.95   292.28   353.19
> >>
> >> How come you get better throughput with restricted DMA? Is it because
> >> doing DMA to/from a contiguous region allows for better grouping of
> >> transactions from the DRAM controller's perspective somehow?
> >
> > I'm not sure, but actually, enabling the default swiotlb for wifi also helps the
> > throughput a little bit for me.
>
> OK, it would be interesting if you could get to the bottom of why
> performance does increase with swiotlb.
> --
> Florian

Florian Fainelli Jan. 13, 2021, 3:56 a.m. UTC | #8

On 1/12/2021 6:29 PM, Tomasz Figa wrote:
> Hi Florian,
> 
> On Wed, Jan 13, 2021 at 3:01 AM Florian Fainelli <f.fainelli@gmail.com> wrote:
>>
>> On 1/11/21 11:48 PM, Claire Chang wrote:
>>> On Fri, Jan 8, 2021 at 1:59 AM Florian Fainelli <f.fainelli@gmail.com> wrote:
>>>>
>>>> On 1/7/21 9:42 AM, Claire Chang wrote:
>>>>
>>>>>> Can you explain how ATF gets involved and to what extent it does help,
>>>>>> besides enforcing a secure region from the ARM CPU's perpsective? Does
>>>>>> the PCIe root complex not have an IOMMU but can somehow be denied access
>>>>>> to a region that is marked NS=0 in the ARM CPU's MMU? If so, that is
>>>>>> still some sort of basic protection that the HW enforces, right?
>>>>>
>>>>> We need the ATF support for memory MPU (memory protection unit).
>>>>> Restricted DMA (with reserved-memory in dts) makes sure the predefined memory
>>>>> region is for PCIe DMA only, but we still need MPU to locks down PCIe access to
>>>>> that specific regions.
>>>>
>>>> OK so you do have a protection unit of some sort to enforce which region
>>>> in DRAM the PCIE bridge is allowed to access, that makes sense,
>>>> otherwise the restricted DMA region would only be a hint but nothing you
>>>> can really enforce. This is almost entirely analogous to our systems then.
>>>
>>> Here is the example of setting the MPU:
>>> https://github.com/ARM-software/arm-trusted-firmware/blob/master/plat/mediatek/mt8183/drivers/emi_mpu/emi_mpu.c#L132
>>>
>>>>
>>>> There may be some value in standardizing on an ARM SMCCC call then since
>>>> you already support two different SoC vendors.
>>>>
>>>>>
>>>>>>
>>>>>> On Broadcom STB SoCs we have had something similar for a while however
>>>>>> and while we don't have an IOMMU for the PCIe bridge, we do have a a
>>>>>> basic protection mechanism whereby we can configure a region in DRAM to
>>>>>> be PCIe read/write and CPU read/write which then gets used as the PCIe
>>>>>> inbound region for the PCIe EP. By default the PCIe bridge is not
>>>>>> allowed access to DRAM so we must call into a security agent to allow
>>>>>> the PCIe bridge to access the designated DRAM region.
>>>>>>
>>>>>> We have done this using a private CMA area region assigned via Device
>>>>>> Tree, assigned with a and requiring the PCIe EP driver to use
>>>>>> dma_alloc_from_contiguous() in order to allocate from this device
>>>>>> private CMA area. The only drawback with that approach is that it
>>>>>> requires knowing how much memory you need up front for buffers and DMA
>>>>>> descriptors that the PCIe EP will need to process. The problem is that
>>>>>> it requires driver modifications and that does not scale over the number
>>>>>> of PCIe EP drivers, some we absolutely do not control, but there is no
>>>>>> need to bounce buffer. Your approach scales better across PCIe EP
>>>>>> drivers however it does require bounce buffering which could be a
>>>>>> performance hit.
>>>>>
>>>>> Only the streaming DMA (map/unmap) needs bounce buffering.
>>>>
>>>> True, and typically only on transmit since you don't really control
>>>> where the sk_buff are allocated from, right? On RX since you need to
>>>> hand buffer addresses to the WLAN chip prior to DMA, you can allocate
>>>> them from a pool that already falls within the restricted DMA region, right?
>>>>
>>>
>>> Right, but applying bounce buffering to RX will make it more secure.
>>> The device won't be able to modify the content after unmap. Just like what
>>> iommu_unmap does.
>>
>> Sure, however the goals of using bounce buffering equally applies to RX
>> and TX in that this is the only layer sitting between a stack (block,
>> networking, USB, etc.) and the underlying device driver that scales well
>> in order to massage a dma_addr_t to be within a particular physical range.
>>
>> There is however room for improvement if the drivers are willing to
>> change their buffer allocation strategy. When you receive Wi-Fi frames
>> you need to allocate buffers for the Wi-Fi device to DMA into, and that
>> happens ahead of the DMA transfers by the Wi-Fi device. At buffer
>> allocation time you could very well allocate these frames from the
>> restricted DMA region without having to bounce buffer them since the
>> host CPU is in control over where and when to DMA into.
>>
> 
> That is, however, still a trade-off between saving that one copy and
> protection from the DMA tampering with the packet contents when the
> kernel is reading them. Notice how the copy effectively makes a
> snapshot of the contents, guaranteeing that the kernel has a
> consistent view of the packet, which is not true if the DMA could
> modify the buffer contents in the middle of CPU accesses.

I would say that the window just became so much narrower for the PCIe
end-point to overwrite contents with the copy because it would have to
happen within the dma_unmap_{page,single} time and before the copy is
finished to the bounce buffer.

Tomasz Figa Jan. 13, 2021, 4:25 a.m. UTC | #9

On Wed, Jan 13, 2021 at 12:56 PM Florian Fainelli <f.fainelli@gmail.com> wrote:
>
>
>
> On 1/12/2021 6:29 PM, Tomasz Figa wrote:
> > Hi Florian,
> >
> > On Wed, Jan 13, 2021 at 3:01 AM Florian Fainelli <f.fainelli@gmail.com> wrote:
> >>
> >> On 1/11/21 11:48 PM, Claire Chang wrote:
> >>> On Fri, Jan 8, 2021 at 1:59 AM Florian Fainelli <f.fainelli@gmail.com> wrote:
> >>>>
> >>>> On 1/7/21 9:42 AM, Claire Chang wrote:
> >>>>
> >>>>>> Can you explain how ATF gets involved and to what extent it does help,
> >>>>>> besides enforcing a secure region from the ARM CPU's perpsective? Does
> >>>>>> the PCIe root complex not have an IOMMU but can somehow be denied access
> >>>>>> to a region that is marked NS=0 in the ARM CPU's MMU? If so, that is
> >>>>>> still some sort of basic protection that the HW enforces, right?
> >>>>>
> >>>>> We need the ATF support for memory MPU (memory protection unit).
> >>>>> Restricted DMA (with reserved-memory in dts) makes sure the predefined memory
> >>>>> region is for PCIe DMA only, but we still need MPU to locks down PCIe access to
> >>>>> that specific regions.
> >>>>
> >>>> OK so you do have a protection unit of some sort to enforce which region
> >>>> in DRAM the PCIE bridge is allowed to access, that makes sense,
> >>>> otherwise the restricted DMA region would only be a hint but nothing you
> >>>> can really enforce. This is almost entirely analogous to our systems then.
> >>>
> >>> Here is the example of setting the MPU:
> >>> https://github.com/ARM-software/arm-trusted-firmware/blob/master/plat/mediatek/mt8183/drivers/emi_mpu/emi_mpu.c#L132
> >>>
> >>>>
> >>>> There may be some value in standardizing on an ARM SMCCC call then since
> >>>> you already support two different SoC vendors.
> >>>>
> >>>>>
> >>>>>>
> >>>>>> On Broadcom STB SoCs we have had something similar for a while however
> >>>>>> and while we don't have an IOMMU for the PCIe bridge, we do have a a
> >>>>>> basic protection mechanism whereby we can configure a region in DRAM to
> >>>>>> be PCIe read/write and CPU read/write which then gets used as the PCIe
> >>>>>> inbound region for the PCIe EP. By default the PCIe bridge is not
> >>>>>> allowed access to DRAM so we must call into a security agent to allow
> >>>>>> the PCIe bridge to access the designated DRAM region.
> >>>>>>
> >>>>>> We have done this using a private CMA area region assigned via Device
> >>>>>> Tree, assigned with a and requiring the PCIe EP driver to use
> >>>>>> dma_alloc_from_contiguous() in order to allocate from this device
> >>>>>> private CMA area. The only drawback with that approach is that it
> >>>>>> requires knowing how much memory you need up front for buffers and DMA
> >>>>>> descriptors that the PCIe EP will need to process. The problem is that
> >>>>>> it requires driver modifications and that does not scale over the number
> >>>>>> of PCIe EP drivers, some we absolutely do not control, but there is no
> >>>>>> need to bounce buffer. Your approach scales better across PCIe EP
> >>>>>> drivers however it does require bounce buffering which could be a
> >>>>>> performance hit.
> >>>>>
> >>>>> Only the streaming DMA (map/unmap) needs bounce buffering.
> >>>>
> >>>> True, and typically only on transmit since you don't really control
> >>>> where the sk_buff are allocated from, right? On RX since you need to
> >>>> hand buffer addresses to the WLAN chip prior to DMA, you can allocate
> >>>> them from a pool that already falls within the restricted DMA region, right?
> >>>>
> >>>
> >>> Right, but applying bounce buffering to RX will make it more secure.
> >>> The device won't be able to modify the content after unmap. Just like what
> >>> iommu_unmap does.
> >>
> >> Sure, however the goals of using bounce buffering equally applies to RX
> >> and TX in that this is the only layer sitting between a stack (block,
> >> networking, USB, etc.) and the underlying device driver that scales well
> >> in order to massage a dma_addr_t to be within a particular physical range.
> >>
> >> There is however room for improvement if the drivers are willing to
> >> change their buffer allocation strategy. When you receive Wi-Fi frames
> >> you need to allocate buffers for the Wi-Fi device to DMA into, and that
> >> happens ahead of the DMA transfers by the Wi-Fi device. At buffer
> >> allocation time you could very well allocate these frames from the
> >> restricted DMA region without having to bounce buffer them since the
> >> host CPU is in control over where and when to DMA into.
> >>
> >
> > That is, however, still a trade-off between saving that one copy and
> > protection from the DMA tampering with the packet contents when the
> > kernel is reading them. Notice how the copy effectively makes a
> > snapshot of the contents, guaranteeing that the kernel has a
> > consistent view of the packet, which is not true if the DMA could
> > modify the buffer contents in the middle of CPU accesses.
>
> I would say that the window just became so much narrower for the PCIe
> end-point to overwrite contents with the copy because it would have to
> happen within the dma_unmap_{page,single} time and before the copy is
> finished to the bounce buffer.

Not only. Imagine this:

a) Without bouncing:

- RX interrupt
- Pass the packet to the network stack
- Network stack validates the packet
- DMA overwrites the packet
- Network stack goes boom, because the packet changed after validation

b) With bouncing:

- RX interrupt
- Copy the packet to a DMA-inaccessible buffer
- Network stack validates the packet
- Network stack is happy, because the packet is guaranteed to stay the
same after validation

Best regards,
Tomasz

Florian Fainelli Jan. 13, 2021, 4:41 a.m. UTC | #10

On 1/12/2021 8:25 PM, Tomasz Figa wrote:
> On Wed, Jan 13, 2021 at 12:56 PM Florian Fainelli <f.fainelli@gmail.com> wrote:
>>
>>
>>
>> On 1/12/2021 6:29 PM, Tomasz Figa wrote:
>>> Hi Florian,
>>>
>>> On Wed, Jan 13, 2021 at 3:01 AM Florian Fainelli <f.fainelli@gmail.com> wrote:
>>>>
>>>> On 1/11/21 11:48 PM, Claire Chang wrote:
>>>>> On Fri, Jan 8, 2021 at 1:59 AM Florian Fainelli <f.fainelli@gmail.com> wrote:
>>>>>>
>>>>>> On 1/7/21 9:42 AM, Claire Chang wrote:
>>>>>>
>>>>>>>> Can you explain how ATF gets involved and to what extent it does help,
>>>>>>>> besides enforcing a secure region from the ARM CPU's perpsective? Does
>>>>>>>> the PCIe root complex not have an IOMMU but can somehow be denied access
>>>>>>>> to a region that is marked NS=0 in the ARM CPU's MMU? If so, that is
>>>>>>>> still some sort of basic protection that the HW enforces, right?
>>>>>>>
>>>>>>> We need the ATF support for memory MPU (memory protection unit).
>>>>>>> Restricted DMA (with reserved-memory in dts) makes sure the predefined memory
>>>>>>> region is for PCIe DMA only, but we still need MPU to locks down PCIe access to
>>>>>>> that specific regions.
>>>>>>
>>>>>> OK so you do have a protection unit of some sort to enforce which region
>>>>>> in DRAM the PCIE bridge is allowed to access, that makes sense,
>>>>>> otherwise the restricted DMA region would only be a hint but nothing you
>>>>>> can really enforce. This is almost entirely analogous to our systems then.
>>>>>
>>>>> Here is the example of setting the MPU:
>>>>> https://github.com/ARM-software/arm-trusted-firmware/blob/master/plat/mediatek/mt8183/drivers/emi_mpu/emi_mpu.c#L132
>>>>>
>>>>>>
>>>>>> There may be some value in standardizing on an ARM SMCCC call then since
>>>>>> you already support two different SoC vendors.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> On Broadcom STB SoCs we have had something similar for a while however
>>>>>>>> and while we don't have an IOMMU for the PCIe bridge, we do have a a
>>>>>>>> basic protection mechanism whereby we can configure a region in DRAM to
>>>>>>>> be PCIe read/write and CPU read/write which then gets used as the PCIe
>>>>>>>> inbound region for the PCIe EP. By default the PCIe bridge is not
>>>>>>>> allowed access to DRAM so we must call into a security agent to allow
>>>>>>>> the PCIe bridge to access the designated DRAM region.
>>>>>>>>
>>>>>>>> We have done this using a private CMA area region assigned via Device
>>>>>>>> Tree, assigned with a and requiring the PCIe EP driver to use
>>>>>>>> dma_alloc_from_contiguous() in order to allocate from this device
>>>>>>>> private CMA area. The only drawback with that approach is that it
>>>>>>>> requires knowing how much memory you need up front for buffers and DMA
>>>>>>>> descriptors that the PCIe EP will need to process. The problem is that
>>>>>>>> it requires driver modifications and that does not scale over the number
>>>>>>>> of PCIe EP drivers, some we absolutely do not control, but there is no
>>>>>>>> need to bounce buffer. Your approach scales better across PCIe EP
>>>>>>>> drivers however it does require bounce buffering which could be a
>>>>>>>> performance hit.
>>>>>>>
>>>>>>> Only the streaming DMA (map/unmap) needs bounce buffering.
>>>>>>
>>>>>> True, and typically only on transmit since you don't really control
>>>>>> where the sk_buff are allocated from, right? On RX since you need to
>>>>>> hand buffer addresses to the WLAN chip prior to DMA, you can allocate
>>>>>> them from a pool that already falls within the restricted DMA region, right?
>>>>>>
>>>>>
>>>>> Right, but applying bounce buffering to RX will make it more secure.
>>>>> The device won't be able to modify the content after unmap. Just like what
>>>>> iommu_unmap does.
>>>>
>>>> Sure, however the goals of using bounce buffering equally applies to RX
>>>> and TX in that this is the only layer sitting between a stack (block,
>>>> networking, USB, etc.) and the underlying device driver that scales well
>>>> in order to massage a dma_addr_t to be within a particular physical range.
>>>>
>>>> There is however room for improvement if the drivers are willing to
>>>> change their buffer allocation strategy. When you receive Wi-Fi frames
>>>> you need to allocate buffers for the Wi-Fi device to DMA into, and that
>>>> happens ahead of the DMA transfers by the Wi-Fi device. At buffer
>>>> allocation time you could very well allocate these frames from the
>>>> restricted DMA region without having to bounce buffer them since the
>>>> host CPU is in control over where and when to DMA into.
>>>>
>>>
>>> That is, however, still a trade-off between saving that one copy and
>>> protection from the DMA tampering with the packet contents when the
>>> kernel is reading them. Notice how the copy effectively makes a
>>> snapshot of the contents, guaranteeing that the kernel has a
>>> consistent view of the packet, which is not true if the DMA could
>>> modify the buffer contents in the middle of CPU accesses.
>>
>> I would say that the window just became so much narrower for the PCIe
>> end-point to overwrite contents with the copy because it would have to
>> happen within the dma_unmap_{page,single} time and before the copy is
>> finished to the bounce buffer.
> 
> Not only. Imagine this:
> 
> a) Without bouncing:
> 
> - RX interrupt
> - Pass the packet to the network stack
> - Network stack validates the packet
> - DMA overwrites the packet
> - Network stack goes boom, because the packet changed after validation
> 
> b) With bouncing:
> 
> - RX interrupt
> - Copy the packet to a DMA-inaccessible buffer
> - Network stack validates the packet
> - Network stack is happy, because the packet is guaranteed to stay the
> same after validation

Yes that's a much safer set of operations, thanks for walking through a
practical example.

Claire Chang Feb. 9, 2021, 6:27 a.m. UTC | #11

v4 here: https://lore.kernel.org/patchwork/cover/1378113/

[RFC,v3,0/6] Restricted DMA

Message

Comments