diff mbox series

[2/3] iommu/io-pgtable-arm: Add IOMMU_LLC page protection flag

Message ID 3f589e7de3f9fa93e84c83420c5270c546a0c368.1610372717.git.saiprakash.ranjan@codeaurora.org (mailing list archive)
State New, archived
Headers show
Series iommu/drm/msm: Allow non-coherent masters to use system cache | expand

Commit Message

Sai Prakash Ranjan Jan. 11, 2021, 2:15 p.m. UTC
Add a new page protection flag IOMMU_LLC which can be used
by non-coherent masters to set cacheable memory attributes
for an outer level of cache called as last-level cache or
system cache. Initial user of this page protection flag is
the adreno gpu and then can later be used by other clients
such as video where this can be used for per-buffer based
mapping.

Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
---
 drivers/iommu/io-pgtable-arm.c | 3 +++
 include/linux/iommu.h          | 6 ++++++
 2 files changed, 9 insertions(+)

Comments

Will Deacon Jan. 29, 2021, 9:05 a.m. UTC | #1
On Mon, Jan 11, 2021 at 07:45:04PM +0530, Sai Prakash Ranjan wrote:
> Add a new page protection flag IOMMU_LLC which can be used
> by non-coherent masters to set cacheable memory attributes
> for an outer level of cache called as last-level cache or
> system cache. Initial user of this page protection flag is
> the adreno gpu and then can later be used by other clients
> such as video where this can be used for per-buffer based
> mapping.
> 
> Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
> ---
>  drivers/iommu/io-pgtable-arm.c | 3 +++
>  include/linux/iommu.h          | 6 ++++++
>  2 files changed, 9 insertions(+)
> 
> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> index 7439ee7fdcdb..ebe653ef601b 100644
> --- a/drivers/iommu/io-pgtable-arm.c
> +++ b/drivers/iommu/io-pgtable-arm.c
> @@ -415,6 +415,9 @@ static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data,
>  		else if (prot & IOMMU_CACHE)
>  			pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
>  				<< ARM_LPAE_PTE_ATTRINDX_SHIFT);
> +		else if (prot & IOMMU_LLC)
> +			pte |= (ARM_LPAE_MAIR_ATTR_IDX_INC_OCACHE
> +				<< ARM_LPAE_PTE_ATTRINDX_SHIFT);
>  	}
>  
>  	if (prot & IOMMU_CACHE)
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index ffaa389ea128..1f82057df531 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -31,6 +31,12 @@
>   * if the IOMMU page table format is equivalent.
>   */
>  #define IOMMU_PRIV	(1 << 5)
> +/*
> + * Non-coherent masters can use this page protection flag to set cacheable
> + * memory attributes for only a transparent outer level of cache, also known as
> + * the last-level or system cache.
> + */
> +#define IOMMU_LLC	(1 << 6)

On reflection, I'm a bit worried about exposing this because I think it will
introduce a mismatched virtual alias with the CPU (we don't even have a MAIR
set up for this memory type). Now, we also have that issue for the PTW, but
since we always use cache maintenance (i.e. the streaming API) for
publishing the page-tables to a non-coheren walker, it works out. However,
if somebody expects IOMMU_LLC to be coherent with a DMA API coherent
allocation, then they're potentially in for a nasty surprise due to the
mismatched outer-cacheability attributes.

So I can take patch (1) as a trivial rename, but unfortunately I think this
needs more thought before exposing it beyond the PTW.

Will
Sai Prakash Ranjan Jan. 29, 2021, 9:42 a.m. UTC | #2
On 2021-01-29 14:35, Will Deacon wrote:
> On Mon, Jan 11, 2021 at 07:45:04PM +0530, Sai Prakash Ranjan wrote:
>> Add a new page protection flag IOMMU_LLC which can be used
>> by non-coherent masters to set cacheable memory attributes
>> for an outer level of cache called as last-level cache or
>> system cache. Initial user of this page protection flag is
>> the adreno gpu and then can later be used by other clients
>> such as video where this can be used for per-buffer based
>> mapping.
>> 
>> Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
>> ---
>>  drivers/iommu/io-pgtable-arm.c | 3 +++
>>  include/linux/iommu.h          | 6 ++++++
>>  2 files changed, 9 insertions(+)
>> 
>> diff --git a/drivers/iommu/io-pgtable-arm.c 
>> b/drivers/iommu/io-pgtable-arm.c
>> index 7439ee7fdcdb..ebe653ef601b 100644
>> --- a/drivers/iommu/io-pgtable-arm.c
>> +++ b/drivers/iommu/io-pgtable-arm.c
>> @@ -415,6 +415,9 @@ static arm_lpae_iopte arm_lpae_prot_to_pte(struct 
>> arm_lpae_io_pgtable *data,
>>  		else if (prot & IOMMU_CACHE)
>>  			pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
>>  				<< ARM_LPAE_PTE_ATTRINDX_SHIFT);
>> +		else if (prot & IOMMU_LLC)
>> +			pte |= (ARM_LPAE_MAIR_ATTR_IDX_INC_OCACHE
>> +				<< ARM_LPAE_PTE_ATTRINDX_SHIFT);
>>  	}
>> 
>>  	if (prot & IOMMU_CACHE)
>> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> index ffaa389ea128..1f82057df531 100644
>> --- a/include/linux/iommu.h
>> +++ b/include/linux/iommu.h
>> @@ -31,6 +31,12 @@
>>   * if the IOMMU page table format is equivalent.
>>   */
>>  #define IOMMU_PRIV	(1 << 5)
>> +/*
>> + * Non-coherent masters can use this page protection flag to set 
>> cacheable
>> + * memory attributes for only a transparent outer level of cache, 
>> also known as
>> + * the last-level or system cache.
>> + */
>> +#define IOMMU_LLC	(1 << 6)
> 
> On reflection, I'm a bit worried about exposing this because I think it 
> will
> introduce a mismatched virtual alias with the CPU (we don't even have a 
> MAIR
> set up for this memory type). Now, we also have that issue for the PTW, 
> but
> since we always use cache maintenance (i.e. the streaming API) for
> publishing the page-tables to a non-coheren walker, it works out. 
> However,
> if somebody expects IOMMU_LLC to be coherent with a DMA API coherent
> allocation, then they're potentially in for a nasty surprise due to the
> mismatched outer-cacheability attributes.
> 

Can't we add the syscached memory type similar to what is done on 
android?

> So I can take patch (1) as a trivial rename, but unfortunately I think 
> this
> needs more thought before exposing it beyond the PTW.
> 

That wouldn't be of much use, would it :) , we would be losing on
perf gain for GPU usecases without the rest of the patches.

Thanks,
Sai
Will Deacon Feb. 1, 2021, 11:15 a.m. UTC | #3
On Fri, Jan 29, 2021 at 03:12:59PM +0530, Sai Prakash Ranjan wrote:
> On 2021-01-29 14:35, Will Deacon wrote:
> > On Mon, Jan 11, 2021 at 07:45:04PM +0530, Sai Prakash Ranjan wrote:
> > > Add a new page protection flag IOMMU_LLC which can be used
> > > by non-coherent masters to set cacheable memory attributes
> > > for an outer level of cache called as last-level cache or
> > > system cache. Initial user of this page protection flag is
> > > the adreno gpu and then can later be used by other clients
> > > such as video where this can be used for per-buffer based
> > > mapping.
> > > 
> > > Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
> > > ---
> > >  drivers/iommu/io-pgtable-arm.c | 3 +++
> > >  include/linux/iommu.h          | 6 ++++++
> > >  2 files changed, 9 insertions(+)
> > > 
> > > diff --git a/drivers/iommu/io-pgtable-arm.c
> > > b/drivers/iommu/io-pgtable-arm.c
> > > index 7439ee7fdcdb..ebe653ef601b 100644
> > > --- a/drivers/iommu/io-pgtable-arm.c
> > > +++ b/drivers/iommu/io-pgtable-arm.c
> > > @@ -415,6 +415,9 @@ static arm_lpae_iopte
> > > arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data,
> > >  		else if (prot & IOMMU_CACHE)
> > >  			pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
> > >  				<< ARM_LPAE_PTE_ATTRINDX_SHIFT);
> > > +		else if (prot & IOMMU_LLC)
> > > +			pte |= (ARM_LPAE_MAIR_ATTR_IDX_INC_OCACHE
> > > +				<< ARM_LPAE_PTE_ATTRINDX_SHIFT);
> > >  	}
> > > 
> > >  	if (prot & IOMMU_CACHE)
> > > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > > index ffaa389ea128..1f82057df531 100644
> > > --- a/include/linux/iommu.h
> > > +++ b/include/linux/iommu.h
> > > @@ -31,6 +31,12 @@
> > >   * if the IOMMU page table format is equivalent.
> > >   */
> > >  #define IOMMU_PRIV	(1 << 5)
> > > +/*
> > > + * Non-coherent masters can use this page protection flag to set
> > > cacheable
> > > + * memory attributes for only a transparent outer level of cache,
> > > also known as
> > > + * the last-level or system cache.
> > > + */
> > > +#define IOMMU_LLC	(1 << 6)
> > 
> > On reflection, I'm a bit worried about exposing this because I think it
> > will
> > introduce a mismatched virtual alias with the CPU (we don't even have a
> > MAIR
> > set up for this memory type). Now, we also have that issue for the PTW,
> > but
> > since we always use cache maintenance (i.e. the streaming API) for
> > publishing the page-tables to a non-coheren walker, it works out.
> > However,
> > if somebody expects IOMMU_LLC to be coherent with a DMA API coherent
> > allocation, then they're potentially in for a nasty surprise due to the
> > mismatched outer-cacheability attributes.
> > 
> 
> Can't we add the syscached memory type similar to what is done on android?

Maybe. How does the GPU driver map these things on the CPU side?

Will
Rob Clark Feb. 1, 2021, 4:20 p.m. UTC | #4
On Mon, Feb 1, 2021 at 3:16 AM Will Deacon <will@kernel.org> wrote:
>
> On Fri, Jan 29, 2021 at 03:12:59PM +0530, Sai Prakash Ranjan wrote:
> > On 2021-01-29 14:35, Will Deacon wrote:
> > > On Mon, Jan 11, 2021 at 07:45:04PM +0530, Sai Prakash Ranjan wrote:
> > > > Add a new page protection flag IOMMU_LLC which can be used
> > > > by non-coherent masters to set cacheable memory attributes
> > > > for an outer level of cache called as last-level cache or
> > > > system cache. Initial user of this page protection flag is
> > > > the adreno gpu and then can later be used by other clients
> > > > such as video where this can be used for per-buffer based
> > > > mapping.
> > > >
> > > > Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
> > > > ---
> > > >  drivers/iommu/io-pgtable-arm.c | 3 +++
> > > >  include/linux/iommu.h          | 6 ++++++
> > > >  2 files changed, 9 insertions(+)
> > > >
> > > > diff --git a/drivers/iommu/io-pgtable-arm.c
> > > > b/drivers/iommu/io-pgtable-arm.c
> > > > index 7439ee7fdcdb..ebe653ef601b 100644
> > > > --- a/drivers/iommu/io-pgtable-arm.c
> > > > +++ b/drivers/iommu/io-pgtable-arm.c
> > > > @@ -415,6 +415,9 @@ static arm_lpae_iopte
> > > > arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data,
> > > >           else if (prot & IOMMU_CACHE)
> > > >                   pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
> > > >                           << ARM_LPAE_PTE_ATTRINDX_SHIFT);
> > > > +         else if (prot & IOMMU_LLC)
> > > > +                 pte |= (ARM_LPAE_MAIR_ATTR_IDX_INC_OCACHE
> > > > +                         << ARM_LPAE_PTE_ATTRINDX_SHIFT);
> > > >   }
> > > >
> > > >   if (prot & IOMMU_CACHE)
> > > > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > > > index ffaa389ea128..1f82057df531 100644
> > > > --- a/include/linux/iommu.h
> > > > +++ b/include/linux/iommu.h
> > > > @@ -31,6 +31,12 @@
> > > >   * if the IOMMU page table format is equivalent.
> > > >   */
> > > >  #define IOMMU_PRIV       (1 << 5)
> > > > +/*
> > > > + * Non-coherent masters can use this page protection flag to set
> > > > cacheable
> > > > + * memory attributes for only a transparent outer level of cache,
> > > > also known as
> > > > + * the last-level or system cache.
> > > > + */
> > > > +#define IOMMU_LLC        (1 << 6)
> > >
> > > On reflection, I'm a bit worried about exposing this because I think it
> > > will
> > > introduce a mismatched virtual alias with the CPU (we don't even have a
> > > MAIR
> > > set up for this memory type). Now, we also have that issue for the PTW,
> > > but
> > > since we always use cache maintenance (i.e. the streaming API) for
> > > publishing the page-tables to a non-coheren walker, it works out.
> > > However,
> > > if somebody expects IOMMU_LLC to be coherent with a DMA API coherent
> > > allocation, then they're potentially in for a nasty surprise due to the
> > > mismatched outer-cacheability attributes.
> > >
> >
> > Can't we add the syscached memory type similar to what is done on android?
>
> Maybe. How does the GPU driver map these things on the CPU side?

Currently we use writecombine mappings for everything, although there
are some cases that we'd like to use cached (but have not merged
patches that would give userspace a way to flush/invalidate)

BR,
-R
Jordan Crouse Feb. 1, 2021, 6:20 p.m. UTC | #5
On Mon, Feb 01, 2021 at 08:20:44AM -0800, Rob Clark wrote:
> On Mon, Feb 1, 2021 at 3:16 AM Will Deacon <will@kernel.org> wrote:
> >
> > On Fri, Jan 29, 2021 at 03:12:59PM +0530, Sai Prakash Ranjan wrote:
> > > On 2021-01-29 14:35, Will Deacon wrote:
> > > > On Mon, Jan 11, 2021 at 07:45:04PM +0530, Sai Prakash Ranjan wrote:
> > > > > Add a new page protection flag IOMMU_LLC which can be used
> > > > > by non-coherent masters to set cacheable memory attributes
> > > > > for an outer level of cache called as last-level cache or
> > > > > system cache. Initial user of this page protection flag is
> > > > > the adreno gpu and then can later be used by other clients
> > > > > such as video where this can be used for per-buffer based
> > > > > mapping.
> > > > >
> > > > > Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
> > > > > ---
> > > > >  drivers/iommu/io-pgtable-arm.c | 3 +++
> > > > >  include/linux/iommu.h          | 6 ++++++
> > > > >  2 files changed, 9 insertions(+)
> > > > >
> > > > > diff --git a/drivers/iommu/io-pgtable-arm.c
> > > > > b/drivers/iommu/io-pgtable-arm.c
> > > > > index 7439ee7fdcdb..ebe653ef601b 100644
> > > > > --- a/drivers/iommu/io-pgtable-arm.c
> > > > > +++ b/drivers/iommu/io-pgtable-arm.c
> > > > > @@ -415,6 +415,9 @@ static arm_lpae_iopte
> > > > > arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data,
> > > > >           else if (prot & IOMMU_CACHE)
> > > > >                   pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
> > > > >                           << ARM_LPAE_PTE_ATTRINDX_SHIFT);
> > > > > +         else if (prot & IOMMU_LLC)
> > > > > +                 pte |= (ARM_LPAE_MAIR_ATTR_IDX_INC_OCACHE
> > > > > +                         << ARM_LPAE_PTE_ATTRINDX_SHIFT);
> > > > >   }
> > > > >
> > > > >   if (prot & IOMMU_CACHE)
> > > > > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > > > > index ffaa389ea128..1f82057df531 100644
> > > > > --- a/include/linux/iommu.h
> > > > > +++ b/include/linux/iommu.h
> > > > > @@ -31,6 +31,12 @@
> > > > >   * if the IOMMU page table format is equivalent.
> > > > >   */
> > > > >  #define IOMMU_PRIV       (1 << 5)
> > > > > +/*
> > > > > + * Non-coherent masters can use this page protection flag to set
> > > > > cacheable
> > > > > + * memory attributes for only a transparent outer level of cache,
> > > > > also known as
> > > > > + * the last-level or system cache.
> > > > > + */
> > > > > +#define IOMMU_LLC        (1 << 6)
> > > >
> > > > On reflection, I'm a bit worried about exposing this because I think it
> > > > will
> > > > introduce a mismatched virtual alias with the CPU (we don't even have a
> > > > MAIR
> > > > set up for this memory type). Now, we also have that issue for the PTW,
> > > > but
> > > > since we always use cache maintenance (i.e. the streaming API) for
> > > > publishing the page-tables to a non-coheren walker, it works out.
> > > > However,
> > > > if somebody expects IOMMU_LLC to be coherent with a DMA API coherent
> > > > allocation, then they're potentially in for a nasty surprise due to the
> > > > mismatched outer-cacheability attributes.
> > > >
> > >
> > > Can't we add the syscached memory type similar to what is done on android?
> >
> > Maybe. How does the GPU driver map these things on the CPU side?
> 
> Currently we use writecombine mappings for everything, although there
> are some cases that we'd like to use cached (but have not merged
> patches that would give userspace a way to flush/invalidate)
> 
> BR,
> -R

LLC/system cache doesn't have a relationship with the CPU cache.  Its just a
little accelerator that sits on the connection from the GPU to DDR and caches
accesses. The hint that Sai is suggesting is used to mark the buffers as
'no-write-allocate' to prevent GPU write operations from being cached in the LLC
which a) isn't interesting and b) takes up cache space for read operations.

Its easiest to think of the LLC as a bonus accelerator that has no cost for
us to use outside of the unfortunate per buffer hint.

We do have to worry about the CPU cache w.r.t I/O coherency (which is a
different hint) and in that case we have all of concerns that Will identified.

Jordan
Sai Prakash Ranjan Feb. 2, 2021, 6:26 a.m. UTC | #6
On 2021-02-01 23:50, Jordan Crouse wrote:
> On Mon, Feb 01, 2021 at 08:20:44AM -0800, Rob Clark wrote:
>> On Mon, Feb 1, 2021 at 3:16 AM Will Deacon <will@kernel.org> wrote:
>> >
>> > On Fri, Jan 29, 2021 at 03:12:59PM +0530, Sai Prakash Ranjan wrote:
>> > > On 2021-01-29 14:35, Will Deacon wrote:
>> > > > On Mon, Jan 11, 2021 at 07:45:04PM +0530, Sai Prakash Ranjan wrote:
>> > > > > Add a new page protection flag IOMMU_LLC which can be used
>> > > > > by non-coherent masters to set cacheable memory attributes
>> > > > > for an outer level of cache called as last-level cache or
>> > > > > system cache. Initial user of this page protection flag is
>> > > > > the adreno gpu and then can later be used by other clients
>> > > > > such as video where this can be used for per-buffer based
>> > > > > mapping.
>> > > > >
>> > > > > Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
>> > > > > ---
>> > > > >  drivers/iommu/io-pgtable-arm.c | 3 +++
>> > > > >  include/linux/iommu.h          | 6 ++++++
>> > > > >  2 files changed, 9 insertions(+)
>> > > > >
>> > > > > diff --git a/drivers/iommu/io-pgtable-arm.c
>> > > > > b/drivers/iommu/io-pgtable-arm.c
>> > > > > index 7439ee7fdcdb..ebe653ef601b 100644
>> > > > > --- a/drivers/iommu/io-pgtable-arm.c
>> > > > > +++ b/drivers/iommu/io-pgtable-arm.c
>> > > > > @@ -415,6 +415,9 @@ static arm_lpae_iopte
>> > > > > arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data,
>> > > > >           else if (prot & IOMMU_CACHE)
>> > > > >                   pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
>> > > > >                           << ARM_LPAE_PTE_ATTRINDX_SHIFT);
>> > > > > +         else if (prot & IOMMU_LLC)
>> > > > > +                 pte |= (ARM_LPAE_MAIR_ATTR_IDX_INC_OCACHE
>> > > > > +                         << ARM_LPAE_PTE_ATTRINDX_SHIFT);
>> > > > >   }
>> > > > >
>> > > > >   if (prot & IOMMU_CACHE)
>> > > > > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> > > > > index ffaa389ea128..1f82057df531 100644
>> > > > > --- a/include/linux/iommu.h
>> > > > > +++ b/include/linux/iommu.h
>> > > > > @@ -31,6 +31,12 @@
>> > > > >   * if the IOMMU page table format is equivalent.
>> > > > >   */
>> > > > >  #define IOMMU_PRIV       (1 << 5)
>> > > > > +/*
>> > > > > + * Non-coherent masters can use this page protection flag to set
>> > > > > cacheable
>> > > > > + * memory attributes for only a transparent outer level of cache,
>> > > > > also known as
>> > > > > + * the last-level or system cache.
>> > > > > + */
>> > > > > +#define IOMMU_LLC        (1 << 6)
>> > > >
>> > > > On reflection, I'm a bit worried about exposing this because I think it
>> > > > will
>> > > > introduce a mismatched virtual alias with the CPU (we don't even have a
>> > > > MAIR
>> > > > set up for this memory type). Now, we also have that issue for the PTW,
>> > > > but
>> > > > since we always use cache maintenance (i.e. the streaming API) for
>> > > > publishing the page-tables to a non-coheren walker, it works out.
>> > > > However,
>> > > > if somebody expects IOMMU_LLC to be coherent with a DMA API coherent
>> > > > allocation, then they're potentially in for a nasty surprise due to the
>> > > > mismatched outer-cacheability attributes.
>> > > >
>> > >
>> > > Can't we add the syscached memory type similar to what is done on android?
>> >
>> > Maybe. How does the GPU driver map these things on the CPU side?
>> 
>> Currently we use writecombine mappings for everything, although there
>> are some cases that we'd like to use cached (but have not merged
>> patches that would give userspace a way to flush/invalidate)
>> 
>> BR,
>> -R
> 
> LLC/system cache doesn't have a relationship with the CPU cache.  Its 
> just a
> little accelerator that sits on the connection from the GPU to DDR and 
> caches
> accesses. The hint that Sai is suggesting is used to mark the buffers 
> as
> 'no-write-allocate' to prevent GPU write operations from being cached 
> in the LLC
> which a) isn't interesting and b) takes up cache space for read 
> operations.
> 
> Its easiest to think of the LLC as a bonus accelerator that has no cost 
> for
> us to use outside of the unfortunate per buffer hint.
> 
> We do have to worry about the CPU cache w.r.t I/O coherency (which is a
> different hint) and in that case we have all of concerns that Will 
> identified.
> 

For mismatched outer cacheability attributes which Will mentioned, I was
referring to [1] in android kernel.

[1] https://android-review.googlesource.com/c/kernel/common/+/1549097/3

Thanks,
Sai
Sai Prakash Ranjan Feb. 2, 2021, 6:28 a.m. UTC | #7
On 2021-02-01 23:50, Jordan Crouse wrote:
> On Mon, Feb 01, 2021 at 08:20:44AM -0800, Rob Clark wrote:
>> On Mon, Feb 1, 2021 at 3:16 AM Will Deacon <will@kernel.org> wrote:
>> >
>> > On Fri, Jan 29, 2021 at 03:12:59PM +0530, Sai Prakash Ranjan wrote:
>> > > On 2021-01-29 14:35, Will Deacon wrote:
>> > > > On Mon, Jan 11, 2021 at 07:45:04PM +0530, Sai Prakash Ranjan wrote:
>> > > > > Add a new page protection flag IOMMU_LLC which can be used
>> > > > > by non-coherent masters to set cacheable memory attributes
>> > > > > for an outer level of cache called as last-level cache or
>> > > > > system cache. Initial user of this page protection flag is
>> > > > > the adreno gpu and then can later be used by other clients
>> > > > > such as video where this can be used for per-buffer based
>> > > > > mapping.
>> > > > >
>> > > > > Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
>> > > > > ---
>> > > > >  drivers/iommu/io-pgtable-arm.c | 3 +++
>> > > > >  include/linux/iommu.h          | 6 ++++++
>> > > > >  2 files changed, 9 insertions(+)
>> > > > >
>> > > > > diff --git a/drivers/iommu/io-pgtable-arm.c
>> > > > > b/drivers/iommu/io-pgtable-arm.c
>> > > > > index 7439ee7fdcdb..ebe653ef601b 100644
>> > > > > --- a/drivers/iommu/io-pgtable-arm.c
>> > > > > +++ b/drivers/iommu/io-pgtable-arm.c
>> > > > > @@ -415,6 +415,9 @@ static arm_lpae_iopte
>> > > > > arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data,
>> > > > >           else if (prot & IOMMU_CACHE)
>> > > > >                   pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
>> > > > >                           << ARM_LPAE_PTE_ATTRINDX_SHIFT);
>> > > > > +         else if (prot & IOMMU_LLC)
>> > > > > +                 pte |= (ARM_LPAE_MAIR_ATTR_IDX_INC_OCACHE
>> > > > > +                         << ARM_LPAE_PTE_ATTRINDX_SHIFT);
>> > > > >   }
>> > > > >
>> > > > >   if (prot & IOMMU_CACHE)
>> > > > > diff --git a/include/linux/iommu.h b/include/linux/iommu.h
>> > > > > index ffaa389ea128..1f82057df531 100644
>> > > > > --- a/include/linux/iommu.h
>> > > > > +++ b/include/linux/iommu.h
>> > > > > @@ -31,6 +31,12 @@
>> > > > >   * if the IOMMU page table format is equivalent.
>> > > > >   */
>> > > > >  #define IOMMU_PRIV       (1 << 5)
>> > > > > +/*
>> > > > > + * Non-coherent masters can use this page protection flag to set
>> > > > > cacheable
>> > > > > + * memory attributes for only a transparent outer level of cache,
>> > > > > also known as
>> > > > > + * the last-level or system cache.
>> > > > > + */
>> > > > > +#define IOMMU_LLC        (1 << 6)
>> > > >
>> > > > On reflection, I'm a bit worried about exposing this because I think it
>> > > > will
>> > > > introduce a mismatched virtual alias with the CPU (we don't even have a
>> > > > MAIR
>> > > > set up for this memory type). Now, we also have that issue for the PTW,
>> > > > but
>> > > > since we always use cache maintenance (i.e. the streaming API) for
>> > > > publishing the page-tables to a non-coheren walker, it works out.
>> > > > However,
>> > > > if somebody expects IOMMU_LLC to be coherent with a DMA API coherent
>> > > > allocation, then they're potentially in for a nasty surprise due to the
>> > > > mismatched outer-cacheability attributes.
>> > > >
>> > >
>> > > Can't we add the syscached memory type similar to what is done on android?
>> >
>> > Maybe. How does the GPU driver map these things on the CPU side?
>> 
>> Currently we use writecombine mappings for everything, although there
>> are some cases that we'd like to use cached (but have not merged
>> patches that would give userspace a way to flush/invalidate)
>> 
>> BR,
>> -R
> 
> LLC/system cache doesn't have a relationship with the CPU cache.  Its 
> just a
> little accelerator that sits on the connection from the GPU to DDR and 
> caches
> accesses. The hint that Sai is suggesting is used to mark the buffers 
> as
> 'no-write-allocate' to prevent GPU write operations from being cached 
> in the LLC
> which a) isn't interesting and b) takes up cache space for read 
> operations.
> 
> Its easiest to think of the LLC as a bonus accelerator that has no cost 
> for
> us to use outside of the unfortunate per buffer hint.
> 
> We do have to worry about the CPU cache w.r.t I/O coherency (which is a
> different hint) and in that case we have all of concerns that Will 
> identified.
> 

For mismatched outer cacheability attributes which Will mentioned, I was
referring to [1] in android kernel.

[1] https://android-review.googlesource.com/c/kernel/common/+/1549097/3

Thanks,
Sai
Will Deacon Feb. 3, 2021, 9:46 p.m. UTC | #8
On Tue, Feb 02, 2021 at 11:56:27AM +0530, Sai Prakash Ranjan wrote:
> On 2021-02-01 23:50, Jordan Crouse wrote:
> > On Mon, Feb 01, 2021 at 08:20:44AM -0800, Rob Clark wrote:
> > > On Mon, Feb 1, 2021 at 3:16 AM Will Deacon <will@kernel.org> wrote:
> > > > On Fri, Jan 29, 2021 at 03:12:59PM +0530, Sai Prakash Ranjan wrote:
> > > > > On 2021-01-29 14:35, Will Deacon wrote:
> > > > > > On Mon, Jan 11, 2021 at 07:45:04PM +0530, Sai Prakash Ranjan wrote:
> > > > > > > +#define IOMMU_LLC        (1 << 6)
> > > > > >
> > > > > > On reflection, I'm a bit worried about exposing this because I think it
> > > > > > will
> > > > > > introduce a mismatched virtual alias with the CPU (we don't even have a
> > > > > > MAIR
> > > > > > set up for this memory type). Now, we also have that issue for the PTW,
> > > > > > but
> > > > > > since we always use cache maintenance (i.e. the streaming API) for
> > > > > > publishing the page-tables to a non-coheren walker, it works out.
> > > > > > However,
> > > > > > if somebody expects IOMMU_LLC to be coherent with a DMA API coherent
> > > > > > allocation, then they're potentially in for a nasty surprise due to the
> > > > > > mismatched outer-cacheability attributes.
> > > > > >
> > > > >
> > > > > Can't we add the syscached memory type similar to what is done on android?
> > > >
> > > > Maybe. How does the GPU driver map these things on the CPU side?
> > > 
> > > Currently we use writecombine mappings for everything, although there
> > > are some cases that we'd like to use cached (but have not merged
> > > patches that would give userspace a way to flush/invalidate)
> > > 
> > 
> > LLC/system cache doesn't have a relationship with the CPU cache.  Its
> > just a
> > little accelerator that sits on the connection from the GPU to DDR and
> > caches
> > accesses. The hint that Sai is suggesting is used to mark the buffers as
> > 'no-write-allocate' to prevent GPU write operations from being cached in
> > the LLC
> > which a) isn't interesting and b) takes up cache space for read
> > operations.
> > 
> > Its easiest to think of the LLC as a bonus accelerator that has no cost
> > for
> > us to use outside of the unfortunate per buffer hint.
> > 
> > We do have to worry about the CPU cache w.r.t I/O coherency (which is a
> > different hint) and in that case we have all of concerns that Will
> > identified.
> > 
> 
> For mismatched outer cacheability attributes which Will mentioned, I was
> referring to [1] in android kernel.

I've lost track of the conversation here :/

When the GPU has a buffer mapped with IOMMU_LLC, is the buffer also mapped
into the CPU and with what attributes? Rob said "writecombine for
everything" -- does that mean ioremap_wc() / MEMREMAP_WC?

Finally, we need to be careful when we use the word "hint" as "allocation
hint" has a specific meaning in the architecture, and if we only mismatch on
those then we're actually ok. But I think IOMMU_LLC is more than just a
hint, since it actually drives eviction policy (i.e. it enables writeback).

Sorry for the pedantry, but I just want to make sure we're all talking
about the same things!

Cheers,

Will
Rob Clark Feb. 3, 2021, 10:14 p.m. UTC | #9
On Wed, Feb 3, 2021 at 1:46 PM Will Deacon <will@kernel.org> wrote:
>
> On Tue, Feb 02, 2021 at 11:56:27AM +0530, Sai Prakash Ranjan wrote:
> > On 2021-02-01 23:50, Jordan Crouse wrote:
> > > On Mon, Feb 01, 2021 at 08:20:44AM -0800, Rob Clark wrote:
> > > > On Mon, Feb 1, 2021 at 3:16 AM Will Deacon <will@kernel.org> wrote:
> > > > > On Fri, Jan 29, 2021 at 03:12:59PM +0530, Sai Prakash Ranjan wrote:
> > > > > > On 2021-01-29 14:35, Will Deacon wrote:
> > > > > > > On Mon, Jan 11, 2021 at 07:45:04PM +0530, Sai Prakash Ranjan wrote:
> > > > > > > > +#define IOMMU_LLC        (1 << 6)
> > > > > > >
> > > > > > > On reflection, I'm a bit worried about exposing this because I think it
> > > > > > > will
> > > > > > > introduce a mismatched virtual alias with the CPU (we don't even have a
> > > > > > > MAIR
> > > > > > > set up for this memory type). Now, we also have that issue for the PTW,
> > > > > > > but
> > > > > > > since we always use cache maintenance (i.e. the streaming API) for
> > > > > > > publishing the page-tables to a non-coheren walker, it works out.
> > > > > > > However,
> > > > > > > if somebody expects IOMMU_LLC to be coherent with a DMA API coherent
> > > > > > > allocation, then they're potentially in for a nasty surprise due to the
> > > > > > > mismatched outer-cacheability attributes.
> > > > > > >
> > > > > >
> > > > > > Can't we add the syscached memory type similar to what is done on android?
> > > > >
> > > > > Maybe. How does the GPU driver map these things on the CPU side?
> > > >
> > > > Currently we use writecombine mappings for everything, although there
> > > > are some cases that we'd like to use cached (but have not merged
> > > > patches that would give userspace a way to flush/invalidate)
> > > >
> > >
> > > LLC/system cache doesn't have a relationship with the CPU cache.  Its
> > > just a
> > > little accelerator that sits on the connection from the GPU to DDR and
> > > caches
> > > accesses. The hint that Sai is suggesting is used to mark the buffers as
> > > 'no-write-allocate' to prevent GPU write operations from being cached in
> > > the LLC
> > > which a) isn't interesting and b) takes up cache space for read
> > > operations.
> > >
> > > Its easiest to think of the LLC as a bonus accelerator that has no cost
> > > for
> > > us to use outside of the unfortunate per buffer hint.
> > >
> > > We do have to worry about the CPU cache w.r.t I/O coherency (which is a
> > > different hint) and in that case we have all of concerns that Will
> > > identified.
> > >
> >
> > For mismatched outer cacheability attributes which Will mentioned, I was
> > referring to [1] in android kernel.
>
> I've lost track of the conversation here :/
>
> When the GPU has a buffer mapped with IOMMU_LLC, is the buffer also mapped
> into the CPU and with what attributes? Rob said "writecombine for
> everything" -- does that mean ioremap_wc() / MEMREMAP_WC?

Currently userspace asks for everything WC, so pgprot_writecombine()

The kernel doesn't enforce this, but so far provides no UAPI to do
anything useful with non-coherent cached mappings (although there is
interest to support this)

BR,
-R

> Finally, we need to be careful when we use the word "hint" as "allocation
> hint" has a specific meaning in the architecture, and if we only mismatch on
> those then we're actually ok. But I think IOMMU_LLC is more than just a
> hint, since it actually drives eviction policy (i.e. it enables writeback).
>
> Sorry for the pedantry, but I just want to make sure we're all talking
> about the same things!
>
> Cheers,
>
> Will
Sai Prakash Ranjan Feb. 5, 2021, 12:08 p.m. UTC | #10
On 2021-02-04 03:16, Will Deacon wrote:
> On Tue, Feb 02, 2021 at 11:56:27AM +0530, Sai Prakash Ranjan wrote:
>> On 2021-02-01 23:50, Jordan Crouse wrote:
>> > On Mon, Feb 01, 2021 at 08:20:44AM -0800, Rob Clark wrote:
>> > > On Mon, Feb 1, 2021 at 3:16 AM Will Deacon <will@kernel.org> wrote:
>> > > > On Fri, Jan 29, 2021 at 03:12:59PM +0530, Sai Prakash Ranjan wrote:
>> > > > > On 2021-01-29 14:35, Will Deacon wrote:
>> > > > > > On Mon, Jan 11, 2021 at 07:45:04PM +0530, Sai Prakash Ranjan wrote:
>> > > > > > > +#define IOMMU_LLC        (1 << 6)
>> > > > > >
>> > > > > > On reflection, I'm a bit worried about exposing this because I think it
>> > > > > > will
>> > > > > > introduce a mismatched virtual alias with the CPU (we don't even have a
>> > > > > > MAIR
>> > > > > > set up for this memory type). Now, we also have that issue for the PTW,
>> > > > > > but
>> > > > > > since we always use cache maintenance (i.e. the streaming API) for
>> > > > > > publishing the page-tables to a non-coheren walker, it works out.
>> > > > > > However,
>> > > > > > if somebody expects IOMMU_LLC to be coherent with a DMA API coherent
>> > > > > > allocation, then they're potentially in for a nasty surprise due to the
>> > > > > > mismatched outer-cacheability attributes.
>> > > > > >
>> > > > >
>> > > > > Can't we add the syscached memory type similar to what is done on android?
>> > > >
>> > > > Maybe. How does the GPU driver map these things on the CPU side?
>> > >
>> > > Currently we use writecombine mappings for everything, although there
>> > > are some cases that we'd like to use cached (but have not merged
>> > > patches that would give userspace a way to flush/invalidate)
>> > >
>> >
>> > LLC/system cache doesn't have a relationship with the CPU cache.  Its
>> > just a
>> > little accelerator that sits on the connection from the GPU to DDR and
>> > caches
>> > accesses. The hint that Sai is suggesting is used to mark the buffers as
>> > 'no-write-allocate' to prevent GPU write operations from being cached in
>> > the LLC
>> > which a) isn't interesting and b) takes up cache space for read
>> > operations.
>> >
>> > Its easiest to think of the LLC as a bonus accelerator that has no cost
>> > for
>> > us to use outside of the unfortunate per buffer hint.
>> >
>> > We do have to worry about the CPU cache w.r.t I/O coherency (which is a
>> > different hint) and in that case we have all of concerns that Will
>> > identified.
>> >
>> 
>> For mismatched outer cacheability attributes which Will mentioned, I 
>> was
>> referring to [1] in android kernel.
> 
> I've lost track of the conversation here :/
> 
> When the GPU has a buffer mapped with IOMMU_LLC, is the buffer also 
> mapped
> into the CPU and with what attributes? Rob said "writecombine for
> everything" -- does that mean ioremap_wc() / MEMREMAP_WC?
> 

Rob answered this.

> Finally, we need to be careful when we use the word "hint" as 
> "allocation
> hint" has a specific meaning in the architecture, and if we only 
> mismatch on
> those then we're actually ok. But I think IOMMU_LLC is more than just a
> hint, since it actually drives eviction policy (i.e. it enables 
> writeback).
> 
> Sorry for the pedantry, but I just want to make sure we're all talking
> about the same things!
> 

Sorry for the confusion which probably was caused by my mentioning of
android, NWA(no write allocate) is an allocation hint which we can 
ignore
for now as it is not introduced yet in upstream.

Thanks,
Sai
Sai Prakash Ranjan March 9, 2021, 6:40 a.m. UTC | #11
Hi,

On 2021-02-05 17:38, Sai Prakash Ranjan wrote:
> On 2021-02-04 03:16, Will Deacon wrote:
>> On Tue, Feb 02, 2021 at 11:56:27AM +0530, Sai Prakash Ranjan wrote:
>>> On 2021-02-01 23:50, Jordan Crouse wrote:
>>> > On Mon, Feb 01, 2021 at 08:20:44AM -0800, Rob Clark wrote:
>>> > > On Mon, Feb 1, 2021 at 3:16 AM Will Deacon <will@kernel.org> wrote:
>>> > > > On Fri, Jan 29, 2021 at 03:12:59PM +0530, Sai Prakash Ranjan wrote:
>>> > > > > On 2021-01-29 14:35, Will Deacon wrote:
>>> > > > > > On Mon, Jan 11, 2021 at 07:45:04PM +0530, Sai Prakash Ranjan wrote:
>>> > > > > > > +#define IOMMU_LLC        (1 << 6)
>>> > > > > >
>>> > > > > > On reflection, I'm a bit worried about exposing this because I think it
>>> > > > > > will
>>> > > > > > introduce a mismatched virtual alias with the CPU (we don't even have a
>>> > > > > > MAIR
>>> > > > > > set up for this memory type). Now, we also have that issue for the PTW,
>>> > > > > > but
>>> > > > > > since we always use cache maintenance (i.e. the streaming API) for
>>> > > > > > publishing the page-tables to a non-coheren walker, it works out.
>>> > > > > > However,
>>> > > > > > if somebody expects IOMMU_LLC to be coherent with a DMA API coherent
>>> > > > > > allocation, then they're potentially in for a nasty surprise due to the
>>> > > > > > mismatched outer-cacheability attributes.
>>> > > > > >
>>> > > > >
>>> > > > > Can't we add the syscached memory type similar to what is done on android?
>>> > > >
>>> > > > Maybe. How does the GPU driver map these things on the CPU side?
>>> > >
>>> > > Currently we use writecombine mappings for everything, although there
>>> > > are some cases that we'd like to use cached (but have not merged
>>> > > patches that would give userspace a way to flush/invalidate)
>>> > >
>>> >
>>> > LLC/system cache doesn't have a relationship with the CPU cache.  Its
>>> > just a
>>> > little accelerator that sits on the connection from the GPU to DDR and
>>> > caches
>>> > accesses. The hint that Sai is suggesting is used to mark the buffers as
>>> > 'no-write-allocate' to prevent GPU write operations from being cached in
>>> > the LLC
>>> > which a) isn't interesting and b) takes up cache space for read
>>> > operations.
>>> >
>>> > Its easiest to think of the LLC as a bonus accelerator that has no cost
>>> > for
>>> > us to use outside of the unfortunate per buffer hint.
>>> >
>>> > We do have to worry about the CPU cache w.r.t I/O coherency (which is a
>>> > different hint) and in that case we have all of concerns that Will
>>> > identified.
>>> >
>>> 
>>> For mismatched outer cacheability attributes which Will mentioned, I 
>>> was
>>> referring to [1] in android kernel.
>> 
>> I've lost track of the conversation here :/
>> 
>> When the GPU has a buffer mapped with IOMMU_LLC, is the buffer also 
>> mapped
>> into the CPU and with what attributes? Rob said "writecombine for
>> everything" -- does that mean ioremap_wc() / MEMREMAP_WC?
>> 
> 
> Rob answered this.
> 
>> Finally, we need to be careful when we use the word "hint" as 
>> "allocation
>> hint" has a specific meaning in the architecture, and if we only 
>> mismatch on
>> those then we're actually ok. But I think IOMMU_LLC is more than just 
>> a
>> hint, since it actually drives eviction policy (i.e. it enables 
>> writeback).
>> 
>> Sorry for the pedantry, but I just want to make sure we're all talking
>> about the same things!
>> 
> 
> Sorry for the confusion which probably was caused by my mentioning of
> android, NWA(no write allocate) is an allocation hint which we can 
> ignore
> for now as it is not introduced yet in upstream.
> 

Any chance of taking this forward? We do not want to miss out on small 
fps
gain when the product gets released.

Thanks,
Sai
Rob Clark March 16, 2021, 5:04 p.m. UTC | #12
On Wed, Feb 3, 2021 at 2:14 PM Rob Clark <robdclark@gmail.com> wrote:
>
> On Wed, Feb 3, 2021 at 1:46 PM Will Deacon <will@kernel.org> wrote:
> >
> > On Tue, Feb 02, 2021 at 11:56:27AM +0530, Sai Prakash Ranjan wrote:
> > > On 2021-02-01 23:50, Jordan Crouse wrote:
> > > > On Mon, Feb 01, 2021 at 08:20:44AM -0800, Rob Clark wrote:
> > > > > On Mon, Feb 1, 2021 at 3:16 AM Will Deacon <will@kernel.org> wrote:
> > > > > > On Fri, Jan 29, 2021 at 03:12:59PM +0530, Sai Prakash Ranjan wrote:
> > > > > > > On 2021-01-29 14:35, Will Deacon wrote:
> > > > > > > > On Mon, Jan 11, 2021 at 07:45:04PM +0530, Sai Prakash Ranjan wrote:
> > > > > > > > > +#define IOMMU_LLC        (1 << 6)
> > > > > > > >
> > > > > > > > On reflection, I'm a bit worried about exposing this because I think it
> > > > > > > > will
> > > > > > > > introduce a mismatched virtual alias with the CPU (we don't even have a
> > > > > > > > MAIR
> > > > > > > > set up for this memory type). Now, we also have that issue for the PTW,
> > > > > > > > but
> > > > > > > > since we always use cache maintenance (i.e. the streaming API) for
> > > > > > > > publishing the page-tables to a non-coheren walker, it works out.
> > > > > > > > However,
> > > > > > > > if somebody expects IOMMU_LLC to be coherent with a DMA API coherent
> > > > > > > > allocation, then they're potentially in for a nasty surprise due to the
> > > > > > > > mismatched outer-cacheability attributes.
> > > > > > > >
> > > > > > >
> > > > > > > Can't we add the syscached memory type similar to what is done on android?
> > > > > >
> > > > > > Maybe. How does the GPU driver map these things on the CPU side?
> > > > >
> > > > > Currently we use writecombine mappings for everything, although there
> > > > > are some cases that we'd like to use cached (but have not merged
> > > > > patches that would give userspace a way to flush/invalidate)
> > > > >
> > > >
> > > > LLC/system cache doesn't have a relationship with the CPU cache.  Its
> > > > just a
> > > > little accelerator that sits on the connection from the GPU to DDR and
> > > > caches
> > > > accesses. The hint that Sai is suggesting is used to mark the buffers as
> > > > 'no-write-allocate' to prevent GPU write operations from being cached in
> > > > the LLC
> > > > which a) isn't interesting and b) takes up cache space for read
> > > > operations.
> > > >
> > > > Its easiest to think of the LLC as a bonus accelerator that has no cost
> > > > for
> > > > us to use outside of the unfortunate per buffer hint.
> > > >
> > > > We do have to worry about the CPU cache w.r.t I/O coherency (which is a
> > > > different hint) and in that case we have all of concerns that Will
> > > > identified.
> > > >
> > >
> > > For mismatched outer cacheability attributes which Will mentioned, I was
> > > referring to [1] in android kernel.
> >
> > I've lost track of the conversation here :/
> >
> > When the GPU has a buffer mapped with IOMMU_LLC, is the buffer also mapped
> > into the CPU and with what attributes? Rob said "writecombine for
> > everything" -- does that mean ioremap_wc() / MEMREMAP_WC?
>
> Currently userspace asks for everything WC, so pgprot_writecombine()
>
> The kernel doesn't enforce this, but so far provides no UAPI to do
> anything useful with non-coherent cached mappings (although there is
> interest to support this)
>

btw, I'm looking at a benchmark (gl_driver2_off) where (after some
other in-flight optimizations land) we end up bottlenecked on writing
to WC cmdstream buffers.  I assume in the current state, WC goes all
the way to main memory rather than just to system cache?

BR,
-R

> BR,
> -R
>
> > Finally, we need to be careful when we use the word "hint" as "allocation
> > hint" has a specific meaning in the architecture, and if we only mismatch on
> > those then we're actually ok. But I think IOMMU_LLC is more than just a
> > hint, since it actually drives eviction policy (i.e. it enables writeback).
> >
> > Sorry for the pedantry, but I just want to make sure we're all talking
> > about the same things!
> >
> > Cheers,
> >
> > Will
Rob Clark March 16, 2021, 5:16 p.m. UTC | #13
On Tue, Mar 16, 2021 at 10:04 AM Rob Clark <robdclark@gmail.com> wrote:
>
> On Wed, Feb 3, 2021 at 2:14 PM Rob Clark <robdclark@gmail.com> wrote:
> >
> > On Wed, Feb 3, 2021 at 1:46 PM Will Deacon <will@kernel.org> wrote:
> > >
> > > On Tue, Feb 02, 2021 at 11:56:27AM +0530, Sai Prakash Ranjan wrote:
> > > > On 2021-02-01 23:50, Jordan Crouse wrote:
> > > > > On Mon, Feb 01, 2021 at 08:20:44AM -0800, Rob Clark wrote:
> > > > > > On Mon, Feb 1, 2021 at 3:16 AM Will Deacon <will@kernel.org> wrote:
> > > > > > > On Fri, Jan 29, 2021 at 03:12:59PM +0530, Sai Prakash Ranjan wrote:
> > > > > > > > On 2021-01-29 14:35, Will Deacon wrote:
> > > > > > > > > On Mon, Jan 11, 2021 at 07:45:04PM +0530, Sai Prakash Ranjan wrote:
> > > > > > > > > > +#define IOMMU_LLC        (1 << 6)
> > > > > > > > >
> > > > > > > > > On reflection, I'm a bit worried about exposing this because I think it
> > > > > > > > > will
> > > > > > > > > introduce a mismatched virtual alias with the CPU (we don't even have a
> > > > > > > > > MAIR
> > > > > > > > > set up for this memory type). Now, we also have that issue for the PTW,
> > > > > > > > > but
> > > > > > > > > since we always use cache maintenance (i.e. the streaming API) for
> > > > > > > > > publishing the page-tables to a non-coheren walker, it works out.
> > > > > > > > > However,
> > > > > > > > > if somebody expects IOMMU_LLC to be coherent with a DMA API coherent
> > > > > > > > > allocation, then they're potentially in for a nasty surprise due to the
> > > > > > > > > mismatched outer-cacheability attributes.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Can't we add the syscached memory type similar to what is done on android?
> > > > > > >
> > > > > > > Maybe. How does the GPU driver map these things on the CPU side?
> > > > > >
> > > > > > Currently we use writecombine mappings for everything, although there
> > > > > > are some cases that we'd like to use cached (but have not merged
> > > > > > patches that would give userspace a way to flush/invalidate)
> > > > > >
> > > > >
> > > > > LLC/system cache doesn't have a relationship with the CPU cache.  Its
> > > > > just a
> > > > > little accelerator that sits on the connection from the GPU to DDR and
> > > > > caches
> > > > > accesses. The hint that Sai is suggesting is used to mark the buffers as
> > > > > 'no-write-allocate' to prevent GPU write operations from being cached in
> > > > > the LLC
> > > > > which a) isn't interesting and b) takes up cache space for read
> > > > > operations.
> > > > >
> > > > > Its easiest to think of the LLC as a bonus accelerator that has no cost
> > > > > for
> > > > > us to use outside of the unfortunate per buffer hint.
> > > > >
> > > > > We do have to worry about the CPU cache w.r.t I/O coherency (which is a
> > > > > different hint) and in that case we have all of concerns that Will
> > > > > identified.
> > > > >
> > > >
> > > > For mismatched outer cacheability attributes which Will mentioned, I was
> > > > referring to [1] in android kernel.
> > >
> > > I've lost track of the conversation here :/
> > >
> > > When the GPU has a buffer mapped with IOMMU_LLC, is the buffer also mapped
> > > into the CPU and with what attributes? Rob said "writecombine for
> > > everything" -- does that mean ioremap_wc() / MEMREMAP_WC?
> >
> > Currently userspace asks for everything WC, so pgprot_writecombine()
> >
> > The kernel doesn't enforce this, but so far provides no UAPI to do
> > anything useful with non-coherent cached mappings (although there is
> > interest to support this)
> >
>
> btw, I'm looking at a benchmark (gl_driver2_off) where (after some
> other in-flight optimizations land) we end up bottlenecked on writing
> to WC cmdstream buffers.  I assume in the current state, WC goes all
> the way to main memory rather than just to system cache?
>

oh, I guess this (mentioned earlier in thread) is what I really want
for this benchmark:

https://android-review.googlesource.com/c/kernel/common/+/1549097/3

> BR,
> -R
>
> > BR,
> > -R
> >
> > > Finally, we need to be careful when we use the word "hint" as "allocation
> > > hint" has a specific meaning in the architecture, and if we only mismatch on
> > > those then we're actually ok. But I think IOMMU_LLC is more than just a
> > > hint, since it actually drives eviction policy (i.e. it enables writeback).
> > >
> > > Sorry for the pedantry, but I just want to make sure we're all talking
> > > about the same things!
> > >
> > > Cheers,
> > >
> > > Will
Sai Prakash Ranjan March 17, 2021, 9:33 a.m. UTC | #14
Hi Rob,

On 2021-03-16 22:46, Rob Clark wrote:

<snip>...

>> > >
>> > > When the GPU has a buffer mapped with IOMMU_LLC, is the buffer also mapped
>> > > into the CPU and with what attributes? Rob said "writecombine for
>> > > everything" -- does that mean ioremap_wc() / MEMREMAP_WC?
>> >
>> > Currently userspace asks for everything WC, so pgprot_writecombine()
>> >
>> > The kernel doesn't enforce this, but so far provides no UAPI to do
>> > anything useful with non-coherent cached mappings (although there is
>> > interest to support this)
>> >
>> 
>> btw, I'm looking at a benchmark (gl_driver2_off) where (after some
>> other in-flight optimizations land) we end up bottlenecked on writing
>> to WC cmdstream buffers.  I assume in the current state, WC goes all
>> the way to main memory rather than just to system cache?
>> 
> 
> oh, I guess this (mentioned earlier in thread) is what I really want
> for this benchmark:
> 
> https://android-review.googlesource.com/c/kernel/common/+/1549097/3
> 

You can also check if the system cache lines are allocated for GPU
or not with patch in https://crrev.com/c/2766723

With the above patch applied,
cat /sys/kernel/debug/llcc_stats/llcc_scid_status

The SCIDs for GPU are listed in include/linux/soc/qcom/llcc-qcom.h

Thanks,
Sai
Will Deacon March 25, 2021, 5:33 p.m. UTC | #15
On Tue, Mar 09, 2021 at 12:10:44PM +0530, Sai Prakash Ranjan wrote:
> On 2021-02-05 17:38, Sai Prakash Ranjan wrote:
> > On 2021-02-04 03:16, Will Deacon wrote:
> > > On Tue, Feb 02, 2021 at 11:56:27AM +0530, Sai Prakash Ranjan wrote:
> > > > On 2021-02-01 23:50, Jordan Crouse wrote:
> > > > > On Mon, Feb 01, 2021 at 08:20:44AM -0800, Rob Clark wrote:
> > > > > > On Mon, Feb 1, 2021 at 3:16 AM Will Deacon <will@kernel.org> wrote:
> > > > > > > On Fri, Jan 29, 2021 at 03:12:59PM +0530, Sai Prakash Ranjan wrote:
> > > > > > > > On 2021-01-29 14:35, Will Deacon wrote:
> > > > > > > > > On Mon, Jan 11, 2021 at 07:45:04PM +0530, Sai Prakash Ranjan wrote:
> > > > > > > > > > +#define IOMMU_LLC        (1 << 6)
> > > > > > > > >
> > > > > > > > > On reflection, I'm a bit worried about exposing this because I think it
> > > > > > > > > will
> > > > > > > > > introduce a mismatched virtual alias with the CPU (we don't even have a
> > > > > > > > > MAIR
> > > > > > > > > set up for this memory type). Now, we also have that issue for the PTW,
> > > > > > > > > but
> > > > > > > > > since we always use cache maintenance (i.e. the streaming API) for
> > > > > > > > > publishing the page-tables to a non-coheren walker, it works out.
> > > > > > > > > However,
> > > > > > > > > if somebody expects IOMMU_LLC to be coherent with a DMA API coherent
> > > > > > > > > allocation, then they're potentially in for a nasty surprise due to the
> > > > > > > > > mismatched outer-cacheability attributes.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Can't we add the syscached memory type similar to what is done on android?
> > > > > > >
> > > > > > > Maybe. How does the GPU driver map these things on the CPU side?
> > > > > >
> > > > > > Currently we use writecombine mappings for everything, although there
> > > > > > are some cases that we'd like to use cached (but have not merged
> > > > > > patches that would give userspace a way to flush/invalidate)
> > > > > >
> > > > >
> > > > > LLC/system cache doesn't have a relationship with the CPU cache.  Its
> > > > > just a
> > > > > little accelerator that sits on the connection from the GPU to DDR and
> > > > > caches
> > > > > accesses. The hint that Sai is suggesting is used to mark the buffers as
> > > > > 'no-write-allocate' to prevent GPU write operations from being cached in
> > > > > the LLC
> > > > > which a) isn't interesting and b) takes up cache space for read
> > > > > operations.
> > > > >
> > > > > Its easiest to think of the LLC as a bonus accelerator that has no cost
> > > > > for
> > > > > us to use outside of the unfortunate per buffer hint.
> > > > >
> > > > > We do have to worry about the CPU cache w.r.t I/O coherency (which is a
> > > > > different hint) and in that case we have all of concerns that Will
> > > > > identified.
> > > > >
> > > > 
> > > > For mismatched outer cacheability attributes which Will
> > > > mentioned, I was
> > > > referring to [1] in android kernel.
> > > 
> > > I've lost track of the conversation here :/
> > > 
> > > When the GPU has a buffer mapped with IOMMU_LLC, is the buffer also
> > > mapped
> > > into the CPU and with what attributes? Rob said "writecombine for
> > > everything" -- does that mean ioremap_wc() / MEMREMAP_WC?
> > > 
> > 
> > Rob answered this.
> > 
> > > Finally, we need to be careful when we use the word "hint" as
> > > "allocation
> > > hint" has a specific meaning in the architecture, and if we only
> > > mismatch on
> > > those then we're actually ok. But I think IOMMU_LLC is more than
> > > just a
> > > hint, since it actually drives eviction policy (i.e. it enables
> > > writeback).
> > > 
> > > Sorry for the pedantry, but I just want to make sure we're all talking
> > > about the same things!
> > > 
> > 
> > Sorry for the confusion which probably was caused by my mentioning of
> > android, NWA(no write allocate) is an allocation hint which we can
> > ignore
> > for now as it is not introduced yet in upstream.
> > 
> 
> Any chance of taking this forward? We do not want to miss out on small fps
> gain when the product gets released.

Do we have a solution to the mismatched virtual alias?

Will
Rob Clark March 25, 2021, 6:36 p.m. UTC | #16
On Wed, Mar 17, 2021 at 2:33 AM Sai Prakash Ranjan
<saiprakash.ranjan@codeaurora.org> wrote:
>
> Hi Rob,
>
> On 2021-03-16 22:46, Rob Clark wrote:
>
> <snip>...
>
> >> > >
> >> > > When the GPU has a buffer mapped with IOMMU_LLC, is the buffer also mapped
> >> > > into the CPU and with what attributes? Rob said "writecombine for
> >> > > everything" -- does that mean ioremap_wc() / MEMREMAP_WC?
> >> >
> >> > Currently userspace asks for everything WC, so pgprot_writecombine()
> >> >
> >> > The kernel doesn't enforce this, but so far provides no UAPI to do
> >> > anything useful with non-coherent cached mappings (although there is
> >> > interest to support this)
> >> >
> >>
> >> btw, I'm looking at a benchmark (gl_driver2_off) where (after some
> >> other in-flight optimizations land) we end up bottlenecked on writing
> >> to WC cmdstream buffers.  I assume in the current state, WC goes all
> >> the way to main memory rather than just to system cache?
> >>
> >
> > oh, I guess this (mentioned earlier in thread) is what I really want
> > for this benchmark:
> >
> > https://android-review.googlesource.com/c/kernel/common/+/1549097/3
> >
>
> You can also check if the system cache lines are allocated for GPU
> or not with patch in https://crrev.com/c/2766723
>
> With the above patch applied,
> cat /sys/kernel/debug/llcc_stats/llcc_scid_status
>
> The SCIDs for GPU are listed in include/linux/soc/qcom/llcc-qcom.h
>

Actually for the benchmark I was referring to, it is the *CPU*
bottlenecked on writes to writecombine mappings.. so I think what I
want is for CPU mappings to be able to use systemcache..

BR,
-R
Sai Prakash Ranjan June 30, 2021, 10:07 a.m. UTC | #17
Hi Will,

On 2021-03-25 23:03, Will Deacon wrote:
> On Tue, Mar 09, 2021 at 12:10:44PM +0530, Sai Prakash Ranjan wrote:
>> On 2021-02-05 17:38, Sai Prakash Ranjan wrote:
>> > On 2021-02-04 03:16, Will Deacon wrote:
>> > > On Tue, Feb 02, 2021 at 11:56:27AM +0530, Sai Prakash Ranjan wrote:
>> > > > On 2021-02-01 23:50, Jordan Crouse wrote:
>> > > > > On Mon, Feb 01, 2021 at 08:20:44AM -0800, Rob Clark wrote:
>> > > > > > On Mon, Feb 1, 2021 at 3:16 AM Will Deacon <will@kernel.org> wrote:
>> > > > > > > On Fri, Jan 29, 2021 at 03:12:59PM +0530, Sai Prakash Ranjan wrote:
>> > > > > > > > On 2021-01-29 14:35, Will Deacon wrote:
>> > > > > > > > > On Mon, Jan 11, 2021 at 07:45:04PM +0530, Sai Prakash Ranjan wrote:
>> > > > > > > > > > +#define IOMMU_LLC        (1 << 6)
>> > > > > > > > >
>> > > > > > > > > On reflection, I'm a bit worried about exposing this because I think it
>> > > > > > > > > will
>> > > > > > > > > introduce a mismatched virtual alias with the CPU (we don't even have a
>> > > > > > > > > MAIR
>> > > > > > > > > set up for this memory type). Now, we also have that issue for the PTW,
>> > > > > > > > > but
>> > > > > > > > > since we always use cache maintenance (i.e. the streaming API) for
>> > > > > > > > > publishing the page-tables to a non-coheren walker, it works out.
>> > > > > > > > > However,
>> > > > > > > > > if somebody expects IOMMU_LLC to be coherent with a DMA API coherent
>> > > > > > > > > allocation, then they're potentially in for a nasty surprise due to the
>> > > > > > > > > mismatched outer-cacheability attributes.
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > > Can't we add the syscached memory type similar to what is done on android?
>> > > > > > >
>> > > > > > > Maybe. How does the GPU driver map these things on the CPU side?
>> > > > > >
>> > > > > > Currently we use writecombine mappings for everything, although there
>> > > > > > are some cases that we'd like to use cached (but have not merged
>> > > > > > patches that would give userspace a way to flush/invalidate)
>> > > > > >
>> > > > >
>> > > > > LLC/system cache doesn't have a relationship with the CPU cache.  Its
>> > > > > just a
>> > > > > little accelerator that sits on the connection from the GPU to DDR and
>> > > > > caches
>> > > > > accesses. The hint that Sai is suggesting is used to mark the buffers as
>> > > > > 'no-write-allocate' to prevent GPU write operations from being cached in
>> > > > > the LLC
>> > > > > which a) isn't interesting and b) takes up cache space for read
>> > > > > operations.
>> > > > >
>> > > > > Its easiest to think of the LLC as a bonus accelerator that has no cost
>> > > > > for
>> > > > > us to use outside of the unfortunate per buffer hint.
>> > > > >
>> > > > > We do have to worry about the CPU cache w.r.t I/O coherency (which is a
>> > > > > different hint) and in that case we have all of concerns that Will
>> > > > > identified.
>> > > > >
>> > > >
>> > > > For mismatched outer cacheability attributes which Will
>> > > > mentioned, I was
>> > > > referring to [1] in android kernel.
>> > >
>> > > I've lost track of the conversation here :/
>> > >
>> > > When the GPU has a buffer mapped with IOMMU_LLC, is the buffer also
>> > > mapped
>> > > into the CPU and with what attributes? Rob said "writecombine for
>> > > everything" -- does that mean ioremap_wc() / MEMREMAP_WC?
>> > >
>> >
>> > Rob answered this.
>> >
>> > > Finally, we need to be careful when we use the word "hint" as
>> > > "allocation
>> > > hint" has a specific meaning in the architecture, and if we only
>> > > mismatch on
>> > > those then we're actually ok. But I think IOMMU_LLC is more than
>> > > just a
>> > > hint, since it actually drives eviction policy (i.e. it enables
>> > > writeback).
>> > >
>> > > Sorry for the pedantry, but I just want to make sure we're all talking
>> > > about the same things!
>> > >
>> >
>> > Sorry for the confusion which probably was caused by my mentioning of
>> > android, NWA(no write allocate) is an allocation hint which we can
>> > ignore
>> > for now as it is not introduced yet in upstream.
>> >
>> 
>> Any chance of taking this forward? We do not want to miss out on small 
>> fps
>> gain when the product gets released.
> 
> Do we have a solution to the mismatched virtual alias?
> 

Sorry for the long delay on this thread.

For mismatched virtual alias question, wasn't this already discussed in 
stretch
when initial support for system cache [1] (which was reverted by you) 
was added?

Excerpt from there,

"As seen in downstream kernels there are few non-coherent devices which
would not want to allocate in system cache, and therefore would want
Inner/Outer non-cached memory. So, we may want to either override the
attributes per-device, or as you suggested we may want to introduce
another memory type 'sys-cached' that can be added with its separate
infra."

As for DMA API usage, we do not have any upstream users (video will be
one if they decide to upstream that).

[1] 
https://patchwork.kernel.org/project/linux-arm-msm/patch/20180615105329.26800-1-vivek.gautam@codeaurora.org/

Thanks,
Sai
diff mbox series

Patch

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 7439ee7fdcdb..ebe653ef601b 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -415,6 +415,9 @@  static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data,
 		else if (prot & IOMMU_CACHE)
 			pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
 				<< ARM_LPAE_PTE_ATTRINDX_SHIFT);
+		else if (prot & IOMMU_LLC)
+			pte |= (ARM_LPAE_MAIR_ATTR_IDX_INC_OCACHE
+				<< ARM_LPAE_PTE_ATTRINDX_SHIFT);
 	}
 
 	if (prot & IOMMU_CACHE)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index ffaa389ea128..1f82057df531 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -31,6 +31,12 @@ 
  * if the IOMMU page table format is equivalent.
  */
 #define IOMMU_PRIV	(1 << 5)
+/*
+ * Non-coherent masters can use this page protection flag to set cacheable
+ * memory attributes for only a transparent outer level of cache, also known as
+ * the last-level or system cache.
+ */
+#define IOMMU_LLC	(1 << 6)
 
 struct iommu_ops;
 struct iommu_group;