[v2,5/8] iommu/io-pgtable-arm: Support lockless operation

Message ID	af61929e3984a31588cab08ed84a237602b5263e.1498145008.git.robin.murphy@arm.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> From: Robin Murphy <robin.murphy@arm.com> To: will.deacon@arm.com, joro@8bytes.org Subject: [PATCH v2 5/8] iommu/io-pgtable-arm: Support lockless operation Date: Thu, 22 Jun 2017 16:53:54 +0100 Message-Id: <af61929e3984a31588cab08ed84a237602b5263e.1498145008.git.robin.murphy@arm.com> In-Reply-To: <cover.1498145008.git.robin.murphy@arm.com> References: <cover.1498145008.git.robin.murphy@arm.com> Precedence: list Cc: salil.mehta@huawei.com, sunil.goutham@cavium.com, thunder.leizhen@huawei.com, john.garry@huawei.com, wangzhou1@hisilicon.com, iommu@lists.linux-foundation.org, ray.jui@broadcom.com, linux-arm-kernel@lists.infradead.org, linu.cherian@cavium.com, nwatters@codeaurora.org MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Robin Murphy June 22, 2017, 3:53 p.m. UTC

For parallel I/O with multiple concurrent threads servicing the same
device (or devices, if several share a domain), serialising page table
updates becomes a massive bottleneck. On reflection, though, we don't
strictly need to do that - for valid IOMMU API usage, there are in fact
only two races that we need to guard against: multiple map requests for
different blocks within the same region, when the intermediate-level
table for that region does not yet exist; and multiple unmaps of
different parts of the same block entry. Both of those are fairly easily
solved by using a cmpxchg to install the new table, such that if we then
find that someone else's table got there first, we can simply free ours
and continue.

Make the requisite changes such that we can withstand being called
without the caller maintaining a lock. In theory, this opens up a few
corners in which wildly misbehaving callers making nonsensical
overlapping requests might lead to crashes instead of just unpredictable
results, but correct code really does not deserve to pay a significant
performance cost for the sake of masking bugs in theoretical broken code.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
---

v2:
 - Fix barriers in install_table
 - Make a few more PTE accesses with {READ,WRITE}_ONCE() just in case
 - Minor cosmetics

 drivers/iommu/io-pgtable-arm.c | 72 +++++++++++++++++++++++++++++++++---------
 1 file changed, 57 insertions(+), 15 deletions(-)

Linu Cherian June 23, 2017, 5:53 a.m. UTC | #1

Robin,
Was trying to understand the new changes. Had few questions on 
arm_lpae_install_table. 

On Thu Jun 22, 2017 at 04:53:54PM +0100, Robin Murphy wrote:
> For parallel I/O with multiple concurrent threads servicing the same
> device (or devices, if several share a domain), serialising page table
> updates becomes a massive bottleneck. On reflection, though, we don't
> strictly need to do that - for valid IOMMU API usage, there are in fact
> only two races that we need to guard against: multiple map requests for
> different blocks within the same region, when the intermediate-level
> table for that region does not yet exist; and multiple unmaps of
> different parts of the same block entry. Both of those are fairly easily
> solved by using a cmpxchg to install the new table, such that if we then
> find that someone else's table got there first, we can simply free ours
> and continue.
> 
> Make the requisite changes such that we can withstand being called
> without the caller maintaining a lock. In theory, this opens up a few
> corners in which wildly misbehaving callers making nonsensical
> overlapping requests might lead to crashes instead of just unpredictable
> results, but correct code really does not deserve to pay a significant
> performance cost for the sake of masking bugs in theoretical broken code.
> 
> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> ---
> 
> v2:
>  - Fix barriers in install_table
>  - Make a few more PTE accesses with {READ,WRITE}_ONCE() just in case
>  - Minor cosmetics
> 
>  drivers/iommu/io-pgtable-arm.c | 72 +++++++++++++++++++++++++++++++++---------
>  1 file changed, 57 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> index 6334f51912ea..52700fa958c2 100644
> --- a/drivers/iommu/io-pgtable-arm.c
> +++ b/drivers/iommu/io-pgtable-arm.c
> @@ -20,6 +20,7 @@
>  
>  #define pr_fmt(fmt)	"arm-lpae io-pgtable: " fmt
>  
> +#include <linux/atomic.h>
>  #include <linux/iommu.h>
>  #include <linux/kernel.h>
>  #include <linux/sizes.h>
> @@ -99,6 +100,8 @@
>  #define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)6) << 52)
>  #define ARM_LPAE_PTE_ATTR_MASK		(ARM_LPAE_PTE_ATTR_LO_MASK |	\
>  					 ARM_LPAE_PTE_ATTR_HI_MASK)
> +/* Software bit for solving coherency races */
> +#define ARM_LPAE_PTE_SW_SYNC		(((arm_lpae_iopte)1) << 55)
>  
>  /* Stage-1 PTE */
>  #define ARM_LPAE_PTE_AP_UNPRIV		(((arm_lpae_iopte)1) << 6)
> @@ -249,15 +252,20 @@ static void __arm_lpae_free_pages(void *pages, size_t size,
>  	free_pages_exact(pages, size);
>  }
>  
> +static void __arm_lpae_sync_pte(arm_lpae_iopte *ptep,
> +				struct io_pgtable_cfg *cfg)
> +{
> +	dma_sync_single_for_device(cfg->iommu_dev, __arm_lpae_dma_addr(ptep),
> +				   sizeof(*ptep), DMA_TO_DEVICE);
> +}
> +
>  static void __arm_lpae_set_pte(arm_lpae_iopte *ptep, arm_lpae_iopte pte,
>  			       struct io_pgtable_cfg *cfg)
>  {
>  	*ptep = pte;
>  
>  	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
> -		dma_sync_single_for_device(cfg->iommu_dev,
> -					   __arm_lpae_dma_addr(ptep),
> -					   sizeof(pte), DMA_TO_DEVICE);
> +		__arm_lpae_sync_pte(ptep, cfg);
>  }
>  
>  static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
> @@ -314,16 +322,30 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
>  
>  static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
>  					     arm_lpae_iopte *ptep,
> +					     arm_lpae_iopte curr,
>  					     struct io_pgtable_cfg *cfg)
>  {
> -	arm_lpae_iopte new;
> +	arm_lpae_iopte old, new;
>  
>  	new = __pa(table) | ARM_LPAE_PTE_TYPE_TABLE;
>  	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
>  		new |= ARM_LPAE_PTE_NSTABLE;
>  
> -	__arm_lpae_set_pte(ptep, new, cfg);
> -	return new;
> +	/* Ensure the table itself is visible before its PTE can be */
> +	wmb();

Could you please give more hints on why this is required.


> +
> +	old = cmpxchg64_relaxed(ptep, curr, new);
> +
> +	if ((cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA) ||
> +	    (old & ARM_LPAE_PTE_SW_SYNC))
> +		return old;
> +
> +	/* Even if it's not ours, there's no point waiting; just kick it */
> +	__arm_lpae_sync_pte(ptep, cfg);
> +	if (old == curr)
> +		WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC);

How is it ensured that the cache operations are completed before we flag them with
	ARM_LPAE_PTE_SW_SYNC. The previous version had a wmb() after the sync operation.


> +
> +	return old;
>  }
>


Thanks.

Linu Cherian June 23, 2017, 8:56 a.m. UTC | #2

On Fri Jun 23, 2017 at 11:23:26AM +0530, Linu Cherian wrote:
> 
> Robin,
> Was trying to understand the new changes. Had few questions on 
> arm_lpae_install_table. 
> 
> On Thu Jun 22, 2017 at 04:53:54PM +0100, Robin Murphy wrote:
> > For parallel I/O with multiple concurrent threads servicing the same
> > device (or devices, if several share a domain), serialising page table
> > updates becomes a massive bottleneck. On reflection, though, we don't
> > strictly need to do that - for valid IOMMU API usage, there are in fact
> > only two races that we need to guard against: multiple map requests for
> > different blocks within the same region, when the intermediate-level
> > table for that region does not yet exist; and multiple unmaps of
> > different parts of the same block entry. Both of those are fairly easily
> > solved by using a cmpxchg to install the new table, such that if we then
> > find that someone else's table got there first, we can simply free ours
> > and continue.
> > 
> > Make the requisite changes such that we can withstand being called
> > without the caller maintaining a lock. In theory, this opens up a few
> > corners in which wildly misbehaving callers making nonsensical
> > overlapping requests might lead to crashes instead of just unpredictable
> > results, but correct code really does not deserve to pay a significant
> > performance cost for the sake of masking bugs in theoretical broken code.
> > 
> > Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> > ---
> > 
> > v2:
> >  - Fix barriers in install_table
> >  - Make a few more PTE accesses with {READ,WRITE}_ONCE() just in case
> >  - Minor cosmetics
> > 
> >  drivers/iommu/io-pgtable-arm.c | 72 +++++++++++++++++++++++++++++++++---------
> >  1 file changed, 57 insertions(+), 15 deletions(-)
> > 
> > diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> > index 6334f51912ea..52700fa958c2 100644
> > --- a/drivers/iommu/io-pgtable-arm.c
> > +++ b/drivers/iommu/io-pgtable-arm.c
> > @@ -20,6 +20,7 @@
> >  
> >  #define pr_fmt(fmt)	"arm-lpae io-pgtable: " fmt
> >  
> > +#include <linux/atomic.h>
> >  #include <linux/iommu.h>
> >  #include <linux/kernel.h>
> >  #include <linux/sizes.h>
> > @@ -99,6 +100,8 @@
> >  #define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)6) << 52)
> >  #define ARM_LPAE_PTE_ATTR_MASK		(ARM_LPAE_PTE_ATTR_LO_MASK |	\
> >  					 ARM_LPAE_PTE_ATTR_HI_MASK)
> > +/* Software bit for solving coherency races */
> > +#define ARM_LPAE_PTE_SW_SYNC		(((arm_lpae_iopte)1) << 55)
> >  
> >  /* Stage-1 PTE */
> >  #define ARM_LPAE_PTE_AP_UNPRIV		(((arm_lpae_iopte)1) << 6)
> > @@ -249,15 +252,20 @@ static void __arm_lpae_free_pages(void *pages, size_t size,
> >  	free_pages_exact(pages, size);
> >  }
> >  
> > +static void __arm_lpae_sync_pte(arm_lpae_iopte *ptep,
> > +				struct io_pgtable_cfg *cfg)
> > +{
> > +	dma_sync_single_for_device(cfg->iommu_dev, __arm_lpae_dma_addr(ptep),
> > +				   sizeof(*ptep), DMA_TO_DEVICE);
> > +}
> > +
> >  static void __arm_lpae_set_pte(arm_lpae_iopte *ptep, arm_lpae_iopte pte,
> >  			       struct io_pgtable_cfg *cfg)
> >  {
> >  	*ptep = pte;
> >  
> >  	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
> > -		dma_sync_single_for_device(cfg->iommu_dev,
> > -					   __arm_lpae_dma_addr(ptep),
> > -					   sizeof(pte), DMA_TO_DEVICE);
> > +		__arm_lpae_sync_pte(ptep, cfg);
> >  }
> >  
> >  static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
> > @@ -314,16 +322,30 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
> >  
> >  static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
> >  					     arm_lpae_iopte *ptep,
> > +					     arm_lpae_iopte curr,
> >  					     struct io_pgtable_cfg *cfg)
> >  {
> > -	arm_lpae_iopte new;
> > +	arm_lpae_iopte old, new;
> >  
> >  	new = __pa(table) | ARM_LPAE_PTE_TYPE_TABLE;
> >  	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
> >  		new |= ARM_LPAE_PTE_NSTABLE;
> >  
> > -	__arm_lpae_set_pte(ptep, new, cfg);
> > -	return new;
> > +	/* Ensure the table itself is visible before its PTE can be */
> > +	wmb();
> 
> Could you please give more hints on why this is required.
> 
> 
> > +
> > +	old = cmpxchg64_relaxed(ptep, curr, new);
> > +
> > +	if ((cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA) ||
> > +	    (old & ARM_LPAE_PTE_SW_SYNC))
> > +		return old;
> > +
> > +	/* Even if it's not ours, there's no point waiting; just kick it */
> > +	__arm_lpae_sync_pte(ptep, cfg);
> > +	if (old == curr)
> > +		WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC);
> 
> How is it ensured that the cache operations are completed before we flag them with
> 	ARM_LPAE_PTE_SW_SYNC. The previous version had a wmb() after the sync operation.
> 
> 
> > +
> > +	return old;
> >  }
> >
> 


Observed a performance drop of close to 1G, 
while testing on the 10G network interface with this patch series compared to v1.
Moving the wmb() as in v1 restores it back.

@@ -332,7 +332,6 @@ static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
                new |= ARM_LPAE_PTE_NSTABLE;
 
        /* Ensure the table itself is visible before its PTE can be */
-       wmb();
 
        old = cmpxchg64_relaxed(ptep, curr, new);
 
@@ -342,6 +341,7 @@ static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
 
        /* Even if it's not ours, there's no point waiting; just kick it */
        __arm_lpae_sync_pte(ptep, cfg);
+       wmb();
        if (old == curr)
                WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC);

Robin Murphy June 23, 2017, 10:35 a.m. UTC | #3

On 23/06/17 09:56, Linu Cherian wrote:
> On Fri Jun 23, 2017 at 11:23:26AM +0530, Linu Cherian wrote:
>>
>> Robin,
>> Was trying to understand the new changes. Had few questions on 
>> arm_lpae_install_table. 
>>
>> On Thu Jun 22, 2017 at 04:53:54PM +0100, Robin Murphy wrote:
>>> For parallel I/O with multiple concurrent threads servicing the same
>>> device (or devices, if several share a domain), serialising page table
>>> updates becomes a massive bottleneck. On reflection, though, we don't
>>> strictly need to do that - for valid IOMMU API usage, there are in fact
>>> only two races that we need to guard against: multiple map requests for
>>> different blocks within the same region, when the intermediate-level
>>> table for that region does not yet exist; and multiple unmaps of
>>> different parts of the same block entry. Both of those are fairly easily
>>> solved by using a cmpxchg to install the new table, such that if we then
>>> find that someone else's table got there first, we can simply free ours
>>> and continue.
>>>
>>> Make the requisite changes such that we can withstand being called
>>> without the caller maintaining a lock. In theory, this opens up a few
>>> corners in which wildly misbehaving callers making nonsensical
>>> overlapping requests might lead to crashes instead of just unpredictable
>>> results, but correct code really does not deserve to pay a significant
>>> performance cost for the sake of masking bugs in theoretical broken code.
>>>
>>> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
>>> ---
>>>
>>> v2:
>>>  - Fix barriers in install_table
>>>  - Make a few more PTE accesses with {READ,WRITE}_ONCE() just in case
>>>  - Minor cosmetics
>>>
>>>  drivers/iommu/io-pgtable-arm.c | 72 +++++++++++++++++++++++++++++++++---------
>>>  1 file changed, 57 insertions(+), 15 deletions(-)
>>>
>>> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
>>> index 6334f51912ea..52700fa958c2 100644
>>> --- a/drivers/iommu/io-pgtable-arm.c
>>> +++ b/drivers/iommu/io-pgtable-arm.c
>>> @@ -20,6 +20,7 @@
>>>  
>>>  #define pr_fmt(fmt)	"arm-lpae io-pgtable: " fmt
>>>  
>>> +#include <linux/atomic.h>
>>>  #include <linux/iommu.h>
>>>  #include <linux/kernel.h>
>>>  #include <linux/sizes.h>
>>> @@ -99,6 +100,8 @@
>>>  #define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)6) << 52)
>>>  #define ARM_LPAE_PTE_ATTR_MASK		(ARM_LPAE_PTE_ATTR_LO_MASK |	\
>>>  					 ARM_LPAE_PTE_ATTR_HI_MASK)
>>> +/* Software bit for solving coherency races */
>>> +#define ARM_LPAE_PTE_SW_SYNC		(((arm_lpae_iopte)1) << 55)
>>>  
>>>  /* Stage-1 PTE */
>>>  #define ARM_LPAE_PTE_AP_UNPRIV		(((arm_lpae_iopte)1) << 6)
>>> @@ -249,15 +252,20 @@ static void __arm_lpae_free_pages(void *pages, size_t size,
>>>  	free_pages_exact(pages, size);
>>>  }
>>>  
>>> +static void __arm_lpae_sync_pte(arm_lpae_iopte *ptep,
>>> +				struct io_pgtable_cfg *cfg)
>>> +{
>>> +	dma_sync_single_for_device(cfg->iommu_dev, __arm_lpae_dma_addr(ptep),
>>> +				   sizeof(*ptep), DMA_TO_DEVICE);
>>> +}
>>> +
>>>  static void __arm_lpae_set_pte(arm_lpae_iopte *ptep, arm_lpae_iopte pte,
>>>  			       struct io_pgtable_cfg *cfg)
>>>  {
>>>  	*ptep = pte;
>>>  
>>>  	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
>>> -		dma_sync_single_for_device(cfg->iommu_dev,
>>> -					   __arm_lpae_dma_addr(ptep),
>>> -					   sizeof(pte), DMA_TO_DEVICE);
>>> +		__arm_lpae_sync_pte(ptep, cfg);
>>>  }
>>>  
>>>  static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
>>> @@ -314,16 +322,30 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
>>>  
>>>  static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
>>>  					     arm_lpae_iopte *ptep,
>>> +					     arm_lpae_iopte curr,
>>>  					     struct io_pgtable_cfg *cfg)
>>>  {
>>> -	arm_lpae_iopte new;
>>> +	arm_lpae_iopte old, new;
>>>  
>>>  	new = __pa(table) | ARM_LPAE_PTE_TYPE_TABLE;
>>>  	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
>>>  		new |= ARM_LPAE_PTE_NSTABLE;
>>>  
>>> -	__arm_lpae_set_pte(ptep, new, cfg);
>>> -	return new;
>>> +	/* Ensure the table itself is visible before its PTE can be */
>>> +	wmb();
>>
>> Could you please give more hints on why this is required.

In theory it's possible for the write to ptep to become visible before
the previous writes filling out the PTEs in table - if the IOMMU
happened to speculatively walk ptep while parts of table were still sat
in a write buffer somewhere, it could see the old contents of that page
and potentially allocate bogus TLB entries if the stale data happened to
look like valid PTEs. In the non-coherent case the DMA cache maintenance
is sufficient, but otherwise we need a barrier to order writing the
next-level PTEs strictly before writing the table PTE pointing to them,
such that the IOMMU cannot at any point see a partially-initialised table.

>>> +
>>> +	old = cmpxchg64_relaxed(ptep, curr, new);
>>> +
>>> +	if ((cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA) ||
>>> +	    (old & ARM_LPAE_PTE_SW_SYNC))
>>> +		return old;
>>> +
>>> +	/* Even if it's not ours, there's no point waiting; just kick it */
>>> +	__arm_lpae_sync_pte(ptep, cfg);
>>> +	if (old == curr)
>>> +		WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC);
>>
>> How is it ensured that the cache operations are completed before we flag them with
>> 	ARM_LPAE_PTE_SW_SYNC. The previous version had a wmb() after the sync operation.

The wmb() here was a silly oversight on my part - as Will reminded me,
dma_sync_*() already has its own barrier to ensure completion, which is
pretty obvious in retrospect because the entire streaming DMA would be
totally broken otherwise.

>>> +
>>> +	return old;
>>>  }
>>>
>>
> 
> 
> Observed a performance drop of close to 1G, 
> while testing on the 10G network interface with this patch series compared to v1.

Is that consistent over multiple runs? I wouldn't expect many workloads
to be thrashing table and hugepage mappings in the same IOVA region, so
after a point we should tend to reach a fairly steady state where we're
only changing leaf PTEs.

> Moving the wmb() as in v1 restores it back.

Note that on a coherent platform like ThunderX that's as good as just
deleting it, because you'll never execute the case below. However, on
reflection I think it can at least safely be downgraded to dma_wmb()
(i.e. DMB) rather than a full DSB - would you be able to test what
difference that makes?

Robin.

> @@ -332,7 +332,6 @@ static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
>                 new |= ARM_LPAE_PTE_NSTABLE;
>  
>         /* Ensure the table itself is visible before its PTE can be */
> -       wmb()>
>         old = cmpxchg64_relaxed(ptep, curr, new);
>  
> @@ -342,6 +341,7 @@ static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
>  
>         /* Even if it's not ours, there's no point waiting; just kick it */
>         __arm_lpae_sync_pte(ptep, cfg);
> +       wmb();
>         if (old == curr)
>                 WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC);
>  
> 
> 
>

Linu Cherian June 23, 2017, 11:34 a.m. UTC | #4

On Fri Jun 23, 2017 at 11:35:25AM +0100, Robin Murphy wrote:
> On 23/06/17 09:56, Linu Cherian wrote:
> > On Fri Jun 23, 2017 at 11:23:26AM +0530, Linu Cherian wrote:
> >>
> >> Robin,
> >> Was trying to understand the new changes. Had few questions on 
> >> arm_lpae_install_table. 
> >>
> >> On Thu Jun 22, 2017 at 04:53:54PM +0100, Robin Murphy wrote:
> >>> For parallel I/O with multiple concurrent threads servicing the same
> >>> device (or devices, if several share a domain), serialising page table
> >>> updates becomes a massive bottleneck. On reflection, though, we don't
> >>> strictly need to do that - for valid IOMMU API usage, there are in fact
> >>> only two races that we need to guard against: multiple map requests for
> >>> different blocks within the same region, when the intermediate-level
> >>> table for that region does not yet exist; and multiple unmaps of
> >>> different parts of the same block entry. Both of those are fairly easily
> >>> solved by using a cmpxchg to install the new table, such that if we then
> >>> find that someone else's table got there first, we can simply free ours
> >>> and continue.
> >>>
> >>> Make the requisite changes such that we can withstand being called
> >>> without the caller maintaining a lock. In theory, this opens up a few
> >>> corners in which wildly misbehaving callers making nonsensical
> >>> overlapping requests might lead to crashes instead of just unpredictable
> >>> results, but correct code really does not deserve to pay a significant
> >>> performance cost for the sake of masking bugs in theoretical broken code.
> >>>
> >>> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> >>> ---
> >>>
> >>> v2:
> >>>  - Fix barriers in install_table
> >>>  - Make a few more PTE accesses with {READ,WRITE}_ONCE() just in case
> >>>  - Minor cosmetics
> >>>
> >>>  drivers/iommu/io-pgtable-arm.c | 72 +++++++++++++++++++++++++++++++++---------
> >>>  1 file changed, 57 insertions(+), 15 deletions(-)
> >>>
> >>> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> >>> index 6334f51912ea..52700fa958c2 100644
> >>> --- a/drivers/iommu/io-pgtable-arm.c
> >>> +++ b/drivers/iommu/io-pgtable-arm.c
> >>> @@ -20,6 +20,7 @@
> >>>  
> >>>  #define pr_fmt(fmt)	"arm-lpae io-pgtable: " fmt
> >>>  
> >>> +#include <linux/atomic.h>
> >>>  #include <linux/iommu.h>
> >>>  #include <linux/kernel.h>
> >>>  #include <linux/sizes.h>
> >>> @@ -99,6 +100,8 @@
> >>>  #define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)6) << 52)
> >>>  #define ARM_LPAE_PTE_ATTR_MASK		(ARM_LPAE_PTE_ATTR_LO_MASK |	\
> >>>  					 ARM_LPAE_PTE_ATTR_HI_MASK)
> >>> +/* Software bit for solving coherency races */
> >>> +#define ARM_LPAE_PTE_SW_SYNC		(((arm_lpae_iopte)1) << 55)
> >>>  
> >>>  /* Stage-1 PTE */
> >>>  #define ARM_LPAE_PTE_AP_UNPRIV		(((arm_lpae_iopte)1) << 6)
> >>> @@ -249,15 +252,20 @@ static void __arm_lpae_free_pages(void *pages, size_t size,
> >>>  	free_pages_exact(pages, size);
> >>>  }
> >>>  
> >>> +static void __arm_lpae_sync_pte(arm_lpae_iopte *ptep,
> >>> +				struct io_pgtable_cfg *cfg)
> >>> +{
> >>> +	dma_sync_single_for_device(cfg->iommu_dev, __arm_lpae_dma_addr(ptep),
> >>> +				   sizeof(*ptep), DMA_TO_DEVICE);
> >>> +}
> >>> +
> >>>  static void __arm_lpae_set_pte(arm_lpae_iopte *ptep, arm_lpae_iopte pte,
> >>>  			       struct io_pgtable_cfg *cfg)
> >>>  {
> >>>  	*ptep = pte;
> >>>  
> >>>  	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
> >>> -		dma_sync_single_for_device(cfg->iommu_dev,
> >>> -					   __arm_lpae_dma_addr(ptep),
> >>> -					   sizeof(pte), DMA_TO_DEVICE);
> >>> +		__arm_lpae_sync_pte(ptep, cfg);
> >>>  }
> >>>  
> >>>  static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
> >>> @@ -314,16 +322,30 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
> >>>  
> >>>  static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
> >>>  					     arm_lpae_iopte *ptep,
> >>> +					     arm_lpae_iopte curr,
> >>>  					     struct io_pgtable_cfg *cfg)
> >>>  {
> >>> -	arm_lpae_iopte new;
> >>> +	arm_lpae_iopte old, new;
> >>>  
> >>>  	new = __pa(table) | ARM_LPAE_PTE_TYPE_TABLE;
> >>>  	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
> >>>  		new |= ARM_LPAE_PTE_NSTABLE;
> >>>  
> >>> -	__arm_lpae_set_pte(ptep, new, cfg);
> >>> -	return new;
> >>> +	/* Ensure the table itself is visible before its PTE can be */
> >>> +	wmb();
> >>
> >> Could you please give more hints on why this is required.
> 
> In theory it's possible for the write to ptep to become visible before
> the previous writes filling out the PTEs in table - if the IOMMU
> happened to speculatively walk ptep while parts of table were still sat
> in a write buffer somewhere, it could see the old contents of that page
> and potentially allocate bogus TLB entries if the stale data happened to
> look like valid PTEs. In the non-coherent case the DMA cache maintenance
> is sufficient, but otherwise we need a barrier to order writing the
> next-level PTEs strictly before writing the table PTE pointing to them,
> such that the IOMMU cannot at any point see a partially-initialised table.

Got it. Thanks a lot for explainig that.

> 
> >>> +
> >>> +	old = cmpxchg64_relaxed(ptep, curr, new);
> >>> +
> >>> +	if ((cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA) ||
> >>> +	    (old & ARM_LPAE_PTE_SW_SYNC))
> >>> +		return old;
> >>> +
> >>> +	/* Even if it's not ours, there's no point waiting; just kick it */
> >>> +	__arm_lpae_sync_pte(ptep, cfg);
> >>> +	if (old == curr)
> >>> +		WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC);
> >>
> >> How is it ensured that the cache operations are completed before we flag them with
> >> 	ARM_LPAE_PTE_SW_SYNC. The previous version had a wmb() after the sync operation.
> 
> The wmb() here was a silly oversight on my part - as Will reminded me,
> dma_sync_*() already has its own barrier to ensure completion, which is
> pretty obvious in retrospect because the entire streaming DMA would be
> totally broken otherwise.
> 
> >>> +
> >>> +	return old;
> >>>  }
> >>>
> >>
> > 
> > 
> > Observed a performance drop of close to 1G, 
> > while testing on the 10G network interface with this patch series compared to v1.
> 
> Is that consistent over multiple runs? I wouldn't expect many workloads
> to be thrashing table and hugepage mappings in the same IOVA region, so
> after a point we should tend to reach a fairly steady state where we're
> only changing leaf PTEs.
> 
> > Moving the wmb() as in v1 restores it back.
> 
> Note that on a coherent platform like ThunderX that's as good as just
> deleting it, because you'll never execute the case below. However, on
> reflection I think it can at least safely be downgraded to dma_wmb()
> (i.e. DMB) rather than a full DSB - would you be able to test what
> difference that makes?

The testing was done on Thunderx 1, which has a non coherent page table walk.
Yes, downgrading to dma_wmb() helps. With this change, performance is back to v1.



Thanks.

Linu Cherian June 27, 2017, 5:11 a.m. UTC | #5

On Fri Jun 23, 2017 at 05:04:05PM +0530, Linu Cherian wrote:
> On Fri Jun 23, 2017 at 11:35:25AM +0100, Robin Murphy wrote:
> > On 23/06/17 09:56, Linu Cherian wrote:
> > > On Fri Jun 23, 2017 at 11:23:26AM +0530, Linu Cherian wrote:
> > >>
> > >> Robin,
> > >> Was trying to understand the new changes. Had few questions on 
> > >> arm_lpae_install_table. 
> > >>
> > >> On Thu Jun 22, 2017 at 04:53:54PM +0100, Robin Murphy wrote:
> > >>> For parallel I/O with multiple concurrent threads servicing the same
> > >>> device (or devices, if several share a domain), serialising page table
> > >>> updates becomes a massive bottleneck. On reflection, though, we don't
> > >>> strictly need to do that - for valid IOMMU API usage, there are in fact
> > >>> only two races that we need to guard against: multiple map requests for
> > >>> different blocks within the same region, when the intermediate-level
> > >>> table for that region does not yet exist; and multiple unmaps of
> > >>> different parts of the same block entry. Both of those are fairly easily
> > >>> solved by using a cmpxchg to install the new table, such that if we then
> > >>> find that someone else's table got there first, we can simply free ours
> > >>> and continue.
> > >>>
> > >>> Make the requisite changes such that we can withstand being called
> > >>> without the caller maintaining a lock. In theory, this opens up a few
> > >>> corners in which wildly misbehaving callers making nonsensical
> > >>> overlapping requests might lead to crashes instead of just unpredictable
> > >>> results, but correct code really does not deserve to pay a significant
> > >>> performance cost for the sake of masking bugs in theoretical broken code.
> > >>>
> > >>> Signed-off-by: Robin Murphy <robin.murphy@arm.com>
> > >>> ---
> > >>>
> > >>> v2:
> > >>>  - Fix barriers in install_table
> > >>>  - Make a few more PTE accesses with {READ,WRITE}_ONCE() just in case
> > >>>  - Minor cosmetics
> > >>>
> > >>>  drivers/iommu/io-pgtable-arm.c | 72 +++++++++++++++++++++++++++++++++---------
> > >>>  1 file changed, 57 insertions(+), 15 deletions(-)
> > >>>
> > >>> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> > >>> index 6334f51912ea..52700fa958c2 100644
> > >>> --- a/drivers/iommu/io-pgtable-arm.c
> > >>> +++ b/drivers/iommu/io-pgtable-arm.c
> > >>> @@ -20,6 +20,7 @@
> > >>>  
> > >>>  #define pr_fmt(fmt)	"arm-lpae io-pgtable: " fmt
> > >>>  
> > >>> +#include <linux/atomic.h>
> > >>>  #include <linux/iommu.h>
> > >>>  #include <linux/kernel.h>
> > >>>  #include <linux/sizes.h>
> > >>> @@ -99,6 +100,8 @@
> > >>>  #define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)6) << 52)
> > >>>  #define ARM_LPAE_PTE_ATTR_MASK		(ARM_LPAE_PTE_ATTR_LO_MASK |	\
> > >>>  					 ARM_LPAE_PTE_ATTR_HI_MASK)
> > >>> +/* Software bit for solving coherency races */
> > >>> +#define ARM_LPAE_PTE_SW_SYNC		(((arm_lpae_iopte)1) << 55)
> > >>>  
> > >>>  /* Stage-1 PTE */
> > >>>  #define ARM_LPAE_PTE_AP_UNPRIV		(((arm_lpae_iopte)1) << 6)
> > >>> @@ -249,15 +252,20 @@ static void __arm_lpae_free_pages(void *pages, size_t size,
> > >>>  	free_pages_exact(pages, size);
> > >>>  }
> > >>>  
> > >>> +static void __arm_lpae_sync_pte(arm_lpae_iopte *ptep,
> > >>> +				struct io_pgtable_cfg *cfg)
> > >>> +{
> > >>> +	dma_sync_single_for_device(cfg->iommu_dev, __arm_lpae_dma_addr(ptep),
> > >>> +				   sizeof(*ptep), DMA_TO_DEVICE);
> > >>> +}
> > >>> +
> > >>>  static void __arm_lpae_set_pte(arm_lpae_iopte *ptep, arm_lpae_iopte pte,
> > >>>  			       struct io_pgtable_cfg *cfg)
> > >>>  {
> > >>>  	*ptep = pte;
> > >>>  
> > >>>  	if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA))
> > >>> -		dma_sync_single_for_device(cfg->iommu_dev,
> > >>> -					   __arm_lpae_dma_addr(ptep),
> > >>> -					   sizeof(pte), DMA_TO_DEVICE);
> > >>> +		__arm_lpae_sync_pte(ptep, cfg);
> > >>>  }
> > >>>  
> > >>>  static int __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
> > >>> @@ -314,16 +322,30 @@ static int arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
> > >>>  
> > >>>  static arm_lpae_iopte arm_lpae_install_table(arm_lpae_iopte *table,
> > >>>  					     arm_lpae_iopte *ptep,
> > >>> +					     arm_lpae_iopte curr,
> > >>>  					     struct io_pgtable_cfg *cfg)
> > >>>  {
> > >>> -	arm_lpae_iopte new;
> > >>> +	arm_lpae_iopte old, new;
> > >>>  
> > >>>  	new = __pa(table) | ARM_LPAE_PTE_TYPE_TABLE;
> > >>>  	if (cfg->quirks & IO_PGTABLE_QUIRK_ARM_NS)
> > >>>  		new |= ARM_LPAE_PTE_NSTABLE;
> > >>>  
> > >>> -	__arm_lpae_set_pte(ptep, new, cfg);
> > >>> -	return new;
> > >>> +	/* Ensure the table itself is visible before its PTE can be */
> > >>> +	wmb();
> > >>
> > >> Could you please give more hints on why this is required.
> > 
> > In theory it's possible for the write to ptep to become visible before
> > the previous writes filling out the PTEs in table - if the IOMMU
> > happened to speculatively walk ptep while parts of table were still sat
> > in a write buffer somewhere, it could see the old contents of that page
> > and potentially allocate bogus TLB entries if the stale data happened to
> > look like valid PTEs. In the non-coherent case the DMA cache maintenance
> > is sufficient, but otherwise we need a barrier to order writing the
> > next-level PTEs strictly before writing the table PTE pointing to them,
> > such that the IOMMU cannot at any point see a partially-initialised table.
> 
> Got it. Thanks a lot for explainig that.
> 
> > 
> > >>> +
> > >>> +	old = cmpxchg64_relaxed(ptep, curr, new);
> > >>> +
> > >>> +	if ((cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA) ||
> > >>> +	    (old & ARM_LPAE_PTE_SW_SYNC))
> > >>> +		return old;
> > >>> +
> > >>> +	/* Even if it's not ours, there's no point waiting; just kick it */
> > >>> +	__arm_lpae_sync_pte(ptep, cfg);
> > >>> +	if (old == curr)
> > >>> +		WRITE_ONCE(*ptep, new | ARM_LPAE_PTE_SW_SYNC);
> > >>
> > >> How is it ensured that the cache operations are completed before we flag them with
> > >> 	ARM_LPAE_PTE_SW_SYNC. The previous version had a wmb() after the sync operation.
> > 
> > The wmb() here was a silly oversight on my part - as Will reminded me,
> > dma_sync_*() already has its own barrier to ensure completion, which is
> > pretty obvious in retrospect because the entire streaming DMA would be
> > totally broken otherwise.
> > 
> > >>> +
> > >>> +	return old;
> > >>>  }
> > >>>
> > >>
> > > 
> > > 
> > > Observed a performance drop of close to 1G, 
> > > while testing on the 10G network interface with this patch series compared to v1.
> > 
> > Is that consistent over multiple runs? I wouldn't expect many workloads
> > to be thrashing table and hugepage mappings in the same IOVA region, so
> > after a point we should tend to reach a fairly steady state where we're
> > only changing leaf PTEs.
> > 
> > > Moving the wmb() as in v1 restores it back.
> > 
> > Note that on a coherent platform like ThunderX that's as good as just
> > deleting it, because you'll never execute the case below. However, on
> > reflection I think it can at least safely be downgraded to dma_wmb()
> > (i.e. DMB) rather than a full DSB - would you be able to test what
> > difference that makes?
> 
> The testing was done on Thunderx 1, which has a non coherent page table walk.
> Yes, downgrading to dma_wmb() helps. With this change, performance is back to v1.
> 
> 
> 

Should i send a patch for this ?

Thanks.

Will Deacon June 27, 2017, 8:39 a.m. UTC | #6

On Tue, Jun 27, 2017 at 10:41:55AM +0530, Linu Cherian wrote:
> On Fri Jun 23, 2017 at 05:04:05PM +0530, Linu Cherian wrote:
> > On Fri Jun 23, 2017 at 11:35:25AM +0100, Robin Murphy wrote:
> > > Note that on a coherent platform like ThunderX that's as good as just
> > > deleting it, because you'll never execute the case below. However, on
> > > reflection I think it can at least safely be downgraded to dma_wmb()
> > > (i.e. DMB) rather than a full DSB - would you be able to test what
> > > difference that makes?
> > 
> > The testing was done on Thunderx 1, which has a non coherent page table walk.
> > Yes, downgrading to dma_wmb() helps. With this change, performance is back to v1.
> > 
> 
> Should i send a patch for this ?

I already did it:

https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/commit/?h=for-joerg/arm-smmu/updates&id=77f3445866c39d8866b31d8d9fa47c7c20938e05

Will

Linu Cherian June 27, 2017, 9:08 a.m. UTC | #7

On Tue Jun 27, 2017 at 09:39:13AM +0100, Will Deacon wrote:
> On Tue, Jun 27, 2017 at 10:41:55AM +0530, Linu Cherian wrote:
> > On Fri Jun 23, 2017 at 05:04:05PM +0530, Linu Cherian wrote:
> > > On Fri Jun 23, 2017 at 11:35:25AM +0100, Robin Murphy wrote:
> > > > Note that on a coherent platform like ThunderX that's as good as just
> > > > deleting it, because you'll never execute the case below. However, on
> > > > reflection I think it can at least safely be downgraded to dma_wmb()
> > > > (i.e. DMB) rather than a full DSB - would you be able to test what
> > > > difference that makes?
> > > 
> > > The testing was done on Thunderx 1, which has a non coherent page table walk.
> > > Yes, downgrading to dma_wmb() helps. With this change, performance is back to v1.
> > > 
> > 
> > Should i send a patch for this ?
> 
> I already did it:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/commit/?h=for-joerg/arm-smmu/updates&id=77f3445866c39d8866b31d8d9fa47c7c20938e05
> 
> Will
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Thanks Will.

[v2,5/8] iommu/io-pgtable-arm: Support lockless operation

Commit Message

Comments

Patch