[1/2] drm: add cache support for arm64

Message ID	20190805211451.20176-1-robdclark@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Rob Clark <robdclark@gmail.com> To: dri-devel@lists.freedesktop.org Subject: [PATCH 1/2] drm: add cache support for arm64 Date: Mon, 5 Aug 2019 14:14:33 -0700 Message-Id: <20190805211451.20176-1-robdclark@gmail.com> MIME-Version: 1.0 X-Mailman-Original-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=btriHWYQu73nHxyRj0uULDI0WFupXKYE8nFpqRhUi5E=; b=a62gzBHN8zDS50OBhnDV1V4x7nXoRywHSroTQoVByDczAGXV3BipxjPTq01SrkbTub agEwfMmn44cmWjaxwPjM6J6JZpjx8WmyJxEXRnDXjNclOKeWLXUkdJVEX/MQCLvJBqtC K1ZxpmOSHoJ9dVfT1c362g+35YgZNJ24X+8fEyjxPJRcu8l+xY8PVNESw9BGEXx1nU94 wqiQun9OUEuGzEM7cth+AU5ej7W/2Q3sElHfUdFCQU068818rs5agvFXmicnwPaMJR6t 5IAlapgFz9ED0//3XPA1KmOrrKMINrpJTfgOu4sGbvQe4C1K91C5/vchqTWaSGlR7ZtT i/gw== Precedence: list Cc: Rob Clark <robdclark@chromium.org>, Maxime Ripard <maxime.ripard@bootlin.com>, Catalin Marinas <catalin.marinas@arm.com>, David Airlie <airlied@linux.ie>, linux-kernel@vger.kernel.org, Sean Paul <sean@poorly.run>, linux-arm-kernel@lists.infradead.org, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Thomas Gleixner <tglx@linutronix.de>, Will Deacon <will@kernel.org>, Christoph Hellwig <hch@lst.de>, Allison Randal <allison@lohutok.net> Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	[1/2] drm: add cache support for arm64 \| expand [1/2] drm: add cache support for arm64 [2/2] drm/msm: use drm_cache when available

Rob Clark Aug. 5, 2019, 9:14 p.m. UTC

From: Rob Clark <robdclark@chromium.org>

This will let drm/msm stop abusing DMA API for cache ops, at least for
arm64.

Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 arch/arm64/mm/flush.c       |  2 ++
 drivers/gpu/drm/drm_cache.c | 20 +++++++++++++++++---
 include/drm/drm_cache.h     |  4 ++++
 3 files changed, 23 insertions(+), 3 deletions(-)

Daniel Vetter Aug. 6, 2019, 9:38 a.m. UTC | #1

On Tue, Aug 06, 2019 at 10:48:21AM +0200, Christoph Hellwig wrote:
> This goes in the wrong direction.  drm_cflush_* are a bad API we need to
> get rid of, not add use of it.  The reason for that is two-fold:
> 
>  a) it doesn't address how cache maintaince actually works in most
>     platforms.  When talking about a cache we three fundamental operations:
> 
> 	1) write back - this writes the content of the cache back to the
> 	   backing memory
> 	2) invalidate - this remove the content of the cache
> 	3) write back + invalidate - do both of the above
> 
>  b) which of the above operation you use when depends on a couple of
>     factors of what you want to do with the range you do the cache
>     maintainance operations
> 
> Take a look at the comment in arch/arc/mm/dma.c around line 30 that
> explains how this applies to buffer ownership management.  Note that
> "for device" applies to "for userspace" in the same way, just that
> userspace then also needs to follow this protocol.  So the whole idea
> that random driver code calls random low-level cache maintainance
> operations (and use the non-specific term flush to make it all more
> confusing) is a bad idea.  Fortunately enough we have really good
> arch helpers for all non-coherent architectures (this excludes the
> magic i915 won't be covered by that, but that is a separate issue
> to be addressed later, and the fact that while arm32 did grew them
> very recently and doesn't expose them for all configs, which is easily
> fixable if needed) with arch_sync_dma_for_device and
> arch_sync_dma_for_cpu.  So what we need is to figure out where we
> have valid cases for buffer ownership transfer outside the DMA
> API, and build proper wrappers around the above function for that.
> My guess is it should probably be build to go with the iommu API
> as that is the only other way to map memory for DMA access, but
> if you have a better idea I'd be open to discussion.

I just read through all the arch_sync_dma_for_device/cpu functions and
none seem to use the struct *dev argument. Iirc you've said that's on the
way out?

That dev parameter is another holdup for the places where we do not yet
know what the new device will be (e.g. generic dma-buf exporters like
vgem). And sprinkling a fake dev or passing NULL is a bit silly.

Add a HAVE_ARCH_SYNC_DMA and the above refactor (assuming it's ok to roll
out everywhere) and we should indeed be able to use this. We still need to
have all the others for x86 and all that. Plus I guess we should roll out
the split into invalidate and flush.

The other bit is phys vs. virt addr confusion, but looks like standard if
phys_addr and you kmap underneath (except from drm_clflush_virt_range,
only used by i915).
-Daniel

Rob Clark Aug. 6, 2019, 2:11 p.m. UTC | #2

On Tue, Aug 6, 2019 at 1:48 AM Christoph Hellwig <hch@lst.de> wrote:
>
> This goes in the wrong direction.  drm_cflush_* are a bad API we need to
> get rid of, not add use of it.  The reason for that is two-fold:
>
>  a) it doesn't address how cache maintaince actually works in most
>     platforms.  When talking about a cache we three fundamental operations:
>
>         1) write back - this writes the content of the cache back to the
>            backing memory
>         2) invalidate - this remove the content of the cache
>         3) write back + invalidate - do both of the above

Agreed that drm_cflush_* isn't a great API.  In this particular case
(IIUC), I need wb+inv so that there aren't dirty cache lines that drop
out to memory later, and so that I don't get a cache hit on
uncached/wc mmap'ing.

>  b) which of the above operation you use when depends on a couple of
>     factors of what you want to do with the range you do the cache
>     maintainance operations
>
> Take a look at the comment in arch/arc/mm/dma.c around line 30 that
> explains how this applies to buffer ownership management.  Note that
> "for device" applies to "for userspace" in the same way, just that
> userspace then also needs to follow this protocol.  So the whole idea
> that random driver code calls random low-level cache maintainance
> operations (and use the non-specific term flush to make it all more
> confusing) is a bad idea.  Fortunately enough we have really good
> arch helpers for all non-coherent architectures (this excludes the
> magic i915 won't be covered by that, but that is a separate issue
> to be addressed later, and the fact that while arm32 did grew them
> very recently and doesn't expose them for all configs, which is easily
> fixable if needed) with arch_sync_dma_for_device and
> arch_sync_dma_for_cpu.  So what we need is to figure out where we
> have valid cases for buffer ownership transfer outside the DMA
> API, and build proper wrappers around the above function for that.
> My guess is it should probably be build to go with the iommu API
> as that is the only other way to map memory for DMA access, but
> if you have a better idea I'd be open to discussion.

Tying it in w/ iommu seems a bit weird to me.. but maybe that is just
me, I'm certainly willing to consider proposals or to try things and
see how they work out.

Exposing the arch_sync_* API and using that directly (bypassing
drm_cflush_*) actually seems pretty reasonable and pragmatic.  I did
have one doubt, as phys_to_virt() is only valid for kernel direct
mapped memory (AFAIU), what happens for pages that are not in kernel
linear map?  Maybe it is ok to ignore those pages, since they won't
have an aliased mapping?

BR,
-R

Mark Rutland Aug. 6, 2019, 2:34 p.m. UTC | #3

On Tue, Aug 06, 2019 at 07:11:41AM -0700, Rob Clark wrote:
> On Tue, Aug 6, 2019 at 1:48 AM Christoph Hellwig <hch@lst.de> wrote:
> >
> > This goes in the wrong direction.  drm_cflush_* are a bad API we need to
> > get rid of, not add use of it.  The reason for that is two-fold:
> >
> >  a) it doesn't address how cache maintaince actually works in most
> >     platforms.  When talking about a cache we three fundamental operations:
> >
> >         1) write back - this writes the content of the cache back to the
> >            backing memory
> >         2) invalidate - this remove the content of the cache
> >         3) write back + invalidate - do both of the above
> 
> Agreed that drm_cflush_* isn't a great API.  In this particular case
> (IIUC), I need wb+inv so that there aren't dirty cache lines that drop
> out to memory later, and so that I don't get a cache hit on
> uncached/wc mmap'ing.

Is there a cacheable alias lying around (e.g. the linear map), or are
these addresses only mapped uncached/wc?

If there's a cacheable alias, performing an invalidate isn't sufficient,
since a CPU can allocate a new (clean) entry at any point in time (e.g.
as a result of prefetching or arbitrary speculation).

Thanks,
Mark.

Rob Clark Aug. 6, 2019, 4:23 p.m. UTC | #4

On Tue, Aug 6, 2019 at 8:50 AM Christoph Hellwig <hch@lst.de> wrote:
>
> On Tue, Aug 06, 2019 at 07:11:41AM -0700, Rob Clark wrote:
> > Agreed that drm_cflush_* isn't a great API.  In this particular case
> > (IIUC), I need wb+inv so that there aren't dirty cache lines that drop
> > out to memory later, and so that I don't get a cache hit on
> > uncached/wc mmap'ing.
>
> So what is the use case here?  Allocate pages using the page allocator
> (or CMA for that matter), and then mmaping them to userspace and never
> touching them again from the kernel?

Currently, it is pages coming from tmpfs.  Ideally we want pages that
are swappable when unpinned.

CPU mappings are *mostly* just mapping to userspace.  There are a few
exceptions that are vmap'd (fbcon, and ringbuffer).

(Eventually I'd like to support pages passed in from userspace.. but
that is down the road.)

> > Tying it in w/ iommu seems a bit weird to me.. but maybe that is just
> > me, I'm certainly willing to consider proposals or to try things and
> > see how they work out.
>
> This was just my through as the fit seems easy.  But maybe you'll
> need to explain your use case(s) a bit more so that we can figure out
> what a good high level API is.

Tying it to iommu_map/unmap would be awkward, as we could need to
setup cpu mmap before it ends up mapped to iommu.  And the plan to
support per-process pagetables involved creating an iommu_domain per
userspace gl context.. some buffers would end up mapped into multiple
contexts/iommu_domains.

If the cache operation was detached from iommu_map/unmap, then it
would seem weird to be part of the iommu API.

I guess I'm not entirely sure what you had in mind, but this is why
iommu seemed to me like a bad fit.

> > Exposing the arch_sync_* API and using that directly (bypassing
> > drm_cflush_*) actually seems pretty reasonable and pragmatic.  I did
> > have one doubt, as phys_to_virt() is only valid for kernel direct
> > mapped memory (AFAIU), what happens for pages that are not in kernel
> > linear map?  Maybe it is ok to ignore those pages, since they won't
> > have an aliased mapping?
>
> They could have an aliased mapping in vmalloc/vmap space for example,
> if you created one.  We have the flush_kernel_vmap_range /
> invalidate_kernel_vmap_range APIs for those, that are implement
> on architectures with VIVT caches.

If I only have to worry about aliased mappings that I create myself,
that doesn't seem too bad..  I could track that myself easily enough.

BR,
-R

Rob Clark Aug. 6, 2019, 4:26 p.m. UTC | #5

On Tue, Aug 6, 2019 at 9:23 AM Rob Clark <robdclark@chromium.org> wrote:
>
> On Tue, Aug 6, 2019 at 8:50 AM Christoph Hellwig <hch@lst.de> wrote:
> >
> > On Tue, Aug 06, 2019 at 07:11:41AM -0700, Rob Clark wrote:
> > > Agreed that drm_cflush_* isn't a great API.  In this particular case
> > > (IIUC), I need wb+inv so that there aren't dirty cache lines that drop
> > > out to memory later, and so that I don't get a cache hit on
> > > uncached/wc mmap'ing.
> >
> > So what is the use case here?  Allocate pages using the page allocator
> > (or CMA for that matter), and then mmaping them to userspace and never
> > touching them again from the kernel?
>
> Currently, it is pages coming from tmpfs.  Ideally we want pages that
> are swappable when unpinned.

to be more specific, pages come from
shmem_file_setup()/shmem_read_mapping_page()

BR,
-R

Rob Clark Aug. 6, 2019, 4:31 p.m. UTC | #6

On Tue, Aug 6, 2019 at 7:35 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Tue, Aug 06, 2019 at 07:11:41AM -0700, Rob Clark wrote:
> > On Tue, Aug 6, 2019 at 1:48 AM Christoph Hellwig <hch@lst.de> wrote:
> > >
> > > This goes in the wrong direction.  drm_cflush_* are a bad API we need to
> > > get rid of, not add use of it.  The reason for that is two-fold:
> > >
> > >  a) it doesn't address how cache maintaince actually works in most
> > >     platforms.  When talking about a cache we three fundamental operations:
> > >
> > >         1) write back - this writes the content of the cache back to the
> > >            backing memory
> > >         2) invalidate - this remove the content of the cache
> > >         3) write back + invalidate - do both of the above
> >
> > Agreed that drm_cflush_* isn't a great API.  In this particular case
> > (IIUC), I need wb+inv so that there aren't dirty cache lines that drop
> > out to memory later, and so that I don't get a cache hit on
> > uncached/wc mmap'ing.
>
> Is there a cacheable alias lying around (e.g. the linear map), or are
> these addresses only mapped uncached/wc?
>
> If there's a cacheable alias, performing an invalidate isn't sufficient,
> since a CPU can allocate a new (clean) entry at any point in time (e.g.
> as a result of prefetching or arbitrary speculation).

I *believe* that there are not alias mappings (that I don't control
myself) for pages coming from
shmem_file_setup()/shmem_read_mapping_page()..  digging around at what
dma_sync_sg_* does under the hood, it looks like it is just
arch_sync_dma_for_cpu/device(), so I guess that should be sufficient
for what I need.

There are a few buffers that I vmap for use on kernel side, but like
the userspace mmap's, the vmaps are also writecombine.

BR,
-R

Daniel Vetter Aug. 7, 2019, 8:48 a.m. UTC | #7

On Wed, Aug 7, 2019 at 8:25 AM Christoph Hellwig <hch@lst.de> wrote:
> On Tue, Aug 06, 2019 at 09:23:51AM -0700, Rob Clark wrote:
> > On Tue, Aug 6, 2019 at 8:50 AM Christoph Hellwig <hch@lst.de> wrote:
> > >
> > > On Tue, Aug 06, 2019 at 07:11:41AM -0700, Rob Clark wrote:
> > > > Agreed that drm_cflush_* isn't a great API.  In this particular case
> > > > (IIUC), I need wb+inv so that there aren't dirty cache lines that drop
> > > > out to memory later, and so that I don't get a cache hit on
> > > > uncached/wc mmap'ing.
> > >
> > > So what is the use case here?  Allocate pages using the page allocator
> > > (or CMA for that matter), and then mmaping them to userspace and never
> > > touching them again from the kernel?
> >
> > Currently, it is pages coming from tmpfs.  Ideally we want pages that
> > are swappable when unpinned.
>
> tmpfs is basically a (complicated) frontend for alloc pages as far
> as page allocation is concerned.
>
> > CPU mappings are *mostly* just mapping to userspace.  There are a few
> > exceptions that are vmap'd (fbcon, and ringbuffer).
>
> And those use the same backend?
>
> > (Eventually I'd like to support pages passed in from userspace.. but
> > that is down the road.)
>
> Eww.  Please talk to the iommu list before starting on that.
>
> > > > Tying it in w/ iommu seems a bit weird to me.. but maybe that is just
> > > > me, I'm certainly willing to consider proposals or to try things and
> > > > see how they work out.
> > >
> > > This was just my through as the fit seems easy.  But maybe you'll
> > > need to explain your use case(s) a bit more so that we can figure out
> > > what a good high level API is.
> >
> > Tying it to iommu_map/unmap would be awkward, as we could need to
> > setup cpu mmap before it ends up mapped to iommu.  And the plan to
> > support per-process pagetables involved creating an iommu_domain per
> > userspace gl context.. some buffers would end up mapped into multiple
> > contexts/iommu_domains.
> >
> > If the cache operation was detached from iommu_map/unmap, then it
> > would seem weird to be part of the iommu API.
> >
> > I guess I'm not entirely sure what you had in mind, but this is why
> > iommu seemed to me like a bad fit.
>
> So back to the question, I'd like to understand your use case (and
> maybe hear from the other drm folks if that is common):

Filling in a bunch more of the use-cases we have in drm. Don't need to
solve them all right away ofc, but whatever direction we're aiming for
should keep these in mind I think.

>  - you allocate pages from shmem (why shmem, btw?  if this is done by
>    other drm drivers how do they guarantee addressability without an
>    iommu?)

We use shmem to get at swappable pages. We generally just assume that
the gpu can get at those pages, but things fall apart in fun ways:
- some setups somehow inject bounce buffers. Some drivers just give
up, others try to allocate a pool of pages with dma_alloc_coherent.
- some devices are misdesigned and can't access as much as the cpu. We
allocate using GFP_DMA32 to fix that.

Also modern gpu apis pretty much assume you can malloc() and then use
that directly with the gpu.

>  - then the memory is either mapped to userspace or vmapped (or even
>    both, althrough the lack of aliasing you mentioned would speak
>    against it) as writecombine (aka arm v6+ normal uncached).  Does
>    the mapping live on until the memory is freed?

Generally we cache mappings forever. Some exceptions for 32bit
userspace excluded, where people expect to be able to use more than
4GB of textures somehow, so we have to recycle old mappings. Obviously
applies less to gpus on socs.

Also, at least i915 also has writeback userspace mmaps, and userspace
doing the cflushing. Also not sure I ever mentioned this, but at least
i915 userspace controls the coherency mode of the device dma directly,
on a per-op granularity. For buffers shared with other processes it
defers to the gpu pagetables, which the kernel controls.

>  - as you mention swapping - how do you guarantee there are no
>    aliases in the kernel direct mapping after the page has been swapped
>    in?

No idea personally, since we can ignore this on x86. I think atm
there's not a huge overlap of gpu drivers doing swapping and running
on something like arm where incompatible aliased mappings are bad.

>  - then the memory is potentially mapped to the iommu.  Is it using
>    a long-living mapping, or does get unmapped/remapped repeatedly?

Again, generally cached for as long as possible, until we run out of
space/memory somewhere.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

Mark Rutland Aug. 7, 2019, 12:38 p.m. UTC | #8

On Tue, Aug 06, 2019 at 09:31:55AM -0700, Rob Clark wrote:
> On Tue, Aug 6, 2019 at 7:35 AM Mark Rutland <mark.rutland@arm.com> wrote:
> >
> > On Tue, Aug 06, 2019 at 07:11:41AM -0700, Rob Clark wrote:
> > > On Tue, Aug 6, 2019 at 1:48 AM Christoph Hellwig <hch@lst.de> wrote:
> > > >
> > > > This goes in the wrong direction.  drm_cflush_* are a bad API we need to
> > > > get rid of, not add use of it.  The reason for that is two-fold:
> > > >
> > > >  a) it doesn't address how cache maintaince actually works in most
> > > >     platforms.  When talking about a cache we three fundamental operations:
> > > >
> > > >         1) write back - this writes the content of the cache back to the
> > > >            backing memory
> > > >         2) invalidate - this remove the content of the cache
> > > >         3) write back + invalidate - do both of the above
> > >
> > > Agreed that drm_cflush_* isn't a great API.  In this particular case
> > > (IIUC), I need wb+inv so that there aren't dirty cache lines that drop
> > > out to memory later, and so that I don't get a cache hit on
> > > uncached/wc mmap'ing.
> >
> > Is there a cacheable alias lying around (e.g. the linear map), or are
> > these addresses only mapped uncached/wc?
> >
> > If there's a cacheable alias, performing an invalidate isn't sufficient,
> > since a CPU can allocate a new (clean) entry at any point in time (e.g.
> > as a result of prefetching or arbitrary speculation).
> 
> I *believe* that there are not alias mappings (that I don't control
> myself) for pages coming from
> shmem_file_setup()/shmem_read_mapping_page()..  

AFAICT, that's regular anonymous memory, so there will be a cacheable
alias in the linear/direct map.

> digging around at what dma_sync_sg_* does under the hood, it looks
> like it is just arch_sync_dma_for_cpu/device(), so I guess that should
> be sufficient for what I need.

I don't think that's the case, per the example I gave above.

Thanks,
Mark.

Rob Clark Aug. 7, 2019, 4:09 p.m. UTC | #9

On Tue, Aug 6, 2019 at 11:25 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Tue, Aug 06, 2019 at 09:23:51AM -0700, Rob Clark wrote:
> > On Tue, Aug 6, 2019 at 8:50 AM Christoph Hellwig <hch@lst.de> wrote:
> > >
> > > On Tue, Aug 06, 2019 at 07:11:41AM -0700, Rob Clark wrote:
> > > > Agreed that drm_cflush_* isn't a great API.  In this particular case
> > > > (IIUC), I need wb+inv so that there aren't dirty cache lines that drop
> > > > out to memory later, and so that I don't get a cache hit on
> > > > uncached/wc mmap'ing.
> > >
> > > So what is the use case here?  Allocate pages using the page allocator
> > > (or CMA for that matter), and then mmaping them to userspace and never
> > > touching them again from the kernel?
> >
> > Currently, it is pages coming from tmpfs.  Ideally we want pages that
> > are swappable when unpinned.
>
> tmpfs is basically a (complicated) frontend for alloc pages as far
> as page allocation is concerned.
>
> > CPU mappings are *mostly* just mapping to userspace.  There are a few
> > exceptions that are vmap'd (fbcon, and ringbuffer).
>
> And those use the same backend?

yes

> > (Eventually I'd like to support pages passed in from userspace.. but
> > that is down the road.)
>
> Eww.  Please talk to the iommu list before starting on that.

This is more of a long term goal, we can't do it until we have
per-context/process pagetables, ofc.

Getting a bit off topic, but I'm curious about what problems you are
concerned about.  Userspace can shoot it's own foot, but if it is not
sharing GPU pagetables with other processes, it can't shoot other's
feet.  (I'm guessing you are concerned about non-page-aligned
mappings?)

> > > > Tying it in w/ iommu seems a bit weird to me.. but maybe that is just
> > > > me, I'm certainly willing to consider proposals or to try things and
> > > > see how they work out.
> > >
> > > This was just my through as the fit seems easy.  But maybe you'll
> > > need to explain your use case(s) a bit more so that we can figure out
> > > what a good high level API is.
> >
> > Tying it to iommu_map/unmap would be awkward, as we could need to
> > setup cpu mmap before it ends up mapped to iommu.  And the plan to
> > support per-process pagetables involved creating an iommu_domain per
> > userspace gl context.. some buffers would end up mapped into multiple
> > contexts/iommu_domains.
> >
> > If the cache operation was detached from iommu_map/unmap, then it
> > would seem weird to be part of the iommu API.
> >
> > I guess I'm not entirely sure what you had in mind, but this is why
> > iommu seemed to me like a bad fit.
>
> So back to the question, I'd like to understand your use case (and
> maybe hear from the other drm folks if that is common):
>
>  - you allocate pages from shmem (why shmem, btw?  if this is done by
>    other drm drivers how do they guarantee addressability without an
>    iommu?)

shmem for swappable pages.  I don't unpin and let things get swapped
out yet, but I'm told it starts to become important when you have 50
browser tabs open ;-)

There are some display-only drm drivers with no IOMMU, which use CMA
rather than shmem.  Basically every real GPU has some form of MMU or
IOMMU for memory protection.  (The couple exceptions do expensive
kernel side cmdstream validation.)

>  - then the memory is either mapped to userspace or vmapped (or even
>    both, althrough the lack of aliasing you mentioned would speak
>    against it) as writecombine (aka arm v6+ normal uncached).  Does
>    the mapping live on until the memory is freed?

(side note, *most* of the drm/msm supported devices are armv8, the
exceptions are 8060 and 8064 which are armv7.. I don't think drm/msm
will ever have to deal w/ armv6)

Userspace keeps the userspace mmap around opportunistically (once it
is mmap'd, not all buffers will be accessed from CPU).  In fact there
is a userspace buffer cache, where we try to re-use buffers that are
already allocated and possibly mmap'd, since allocation and setting up
mmap is expensive.

(There is an MADVISE ioctl so userspace can tell kernel about buffers
in the cache, which are available to be purged by shrinker..  if a
buffer is purged, it's userspace mmap is torn down... along with it's
IOMMU map, ofc)

>  - as you mention swapping - how do you guarantee there are no
>    aliases in the kernel direct mapping after the page has been swapped
>    in?

Before unpinning and allowing pages to be swapped out, CPU and IOMMU
maps would be torn down.  (I don't think we could unpin buffers w/ a
vmap, but those are just a drop in the bucket.)

Currently, the kernel always knows buffers associated w/ a submit
(job) queued to the GPU, so it could bring pages back in and re-store
iommu map.. the fault handler can be used to bring things back in for
CPU access.  (For more free-form HMM style access to userspace memory,
we'd need to be able to sleep in IOMMU fault handler before the IOMMU
resumes translation.)

As far as cache is concerned, it would be basically the same as newly
allocated pages, ie. need to wb+inv the new pages.

>  - then the memory is potentially mapped to the iommu.  Is it using
>    a long-living mapping, or does get unmapped/remapped repeatedly?

Similar to CPU maps, we keep the IOMMU map around as long as possible.

(Side note, I was thinking batching unmaps could be useful to reduce
IOMMU TLB inv overhead.. usually when we are freeing a buffer, we are
freeing multiple, so we could unmap them all, then TLB inv, then free
pages.)

BR,
-R

Rob Clark Aug. 7, 2019, 4:15 p.m. UTC | #10

On Wed, Aug 7, 2019 at 5:38 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Tue, Aug 06, 2019 at 09:31:55AM -0700, Rob Clark wrote:
> > On Tue, Aug 6, 2019 at 7:35 AM Mark Rutland <mark.rutland@arm.com> wrote:
> > >
> > > On Tue, Aug 06, 2019 at 07:11:41AM -0700, Rob Clark wrote:
> > > > On Tue, Aug 6, 2019 at 1:48 AM Christoph Hellwig <hch@lst.de> wrote:
> > > > >
> > > > > This goes in the wrong direction.  drm_cflush_* are a bad API we need to
> > > > > get rid of, not add use of it.  The reason for that is two-fold:
> > > > >
> > > > >  a) it doesn't address how cache maintaince actually works in most
> > > > >     platforms.  When talking about a cache we three fundamental operations:
> > > > >
> > > > >         1) write back - this writes the content of the cache back to the
> > > > >            backing memory
> > > > >         2) invalidate - this remove the content of the cache
> > > > >         3) write back + invalidate - do both of the above
> > > >
> > > > Agreed that drm_cflush_* isn't a great API.  In this particular case
> > > > (IIUC), I need wb+inv so that there aren't dirty cache lines that drop
> > > > out to memory later, and so that I don't get a cache hit on
> > > > uncached/wc mmap'ing.
> > >
> > > Is there a cacheable alias lying around (e.g. the linear map), or are
> > > these addresses only mapped uncached/wc?
> > >
> > > If there's a cacheable alias, performing an invalidate isn't sufficient,
> > > since a CPU can allocate a new (clean) entry at any point in time (e.g.
> > > as a result of prefetching or arbitrary speculation).
> >
> > I *believe* that there are not alias mappings (that I don't control
> > myself) for pages coming from
> > shmem_file_setup()/shmem_read_mapping_page()..
>
> AFAICT, that's regular anonymous memory, so there will be a cacheable
> alias in the linear/direct map.

tbh, I'm not 100% sure whether there is a cacheable alias, or whether
any potential linear map is torn down.  My understanding is that a
cacheable alias is "ok", with some caveats.. ie. that the cacheable
alias is not accessed.  I'm not entirely sure about pre-fetch from
access to adjacent pages.  We have been using shmem as a source for
pages since the beginning, and I haven't seen it cause any problems in
the last 6 years.  (This is limited to armv7 and armv8, I'm not really
sure what would happen on armv6, but that is a combo I don't have to
care about.)

BR,
-R

> > digging around at what dma_sync_sg_* does under the hood, it looks
> > like it is just arch_sync_dma_for_cpu/device(), so I guess that should
> > be sufficient for what I need.
>
> I don't think that's the case, per the example I gave above.
>
> Thanks,
> Mark.

Mark Rutland Aug. 7, 2019, 4:49 p.m. UTC | #11

On Wed, Aug 07, 2019 at 09:15:54AM -0700, Rob Clark wrote:
> On Wed, Aug 7, 2019 at 5:38 AM Mark Rutland <mark.rutland@arm.com> wrote:
> >
> > On Tue, Aug 06, 2019 at 09:31:55AM -0700, Rob Clark wrote:
> > > On Tue, Aug 6, 2019 at 7:35 AM Mark Rutland <mark.rutland@arm.com> wrote:
> > > >
> > > > On Tue, Aug 06, 2019 at 07:11:41AM -0700, Rob Clark wrote:
> > > > > On Tue, Aug 6, 2019 at 1:48 AM Christoph Hellwig <hch@lst.de> wrote:
> > > > > >
> > > > > > This goes in the wrong direction.  drm_cflush_* are a bad API we need to
> > > > > > get rid of, not add use of it.  The reason for that is two-fold:
> > > > > >
> > > > > >  a) it doesn't address how cache maintaince actually works in most
> > > > > >     platforms.  When talking about a cache we three fundamental operations:
> > > > > >
> > > > > >         1) write back - this writes the content of the cache back to the
> > > > > >            backing memory
> > > > > >         2) invalidate - this remove the content of the cache
> > > > > >         3) write back + invalidate - do both of the above
> > > > >
> > > > > Agreed that drm_cflush_* isn't a great API.  In this particular case
> > > > > (IIUC), I need wb+inv so that there aren't dirty cache lines that drop
> > > > > out to memory later, and so that I don't get a cache hit on
> > > > > uncached/wc mmap'ing.
> > > >
> > > > Is there a cacheable alias lying around (e.g. the linear map), or are
> > > > these addresses only mapped uncached/wc?
> > > >
> > > > If there's a cacheable alias, performing an invalidate isn't sufficient,
> > > > since a CPU can allocate a new (clean) entry at any point in time (e.g.
> > > > as a result of prefetching or arbitrary speculation).
> > >
> > > I *believe* that there are not alias mappings (that I don't control
> > > myself) for pages coming from
> > > shmem_file_setup()/shmem_read_mapping_page()..
> >
> > AFAICT, that's regular anonymous memory, so there will be a cacheable
> > alias in the linear/direct map.
> 
> tbh, I'm not 100% sure whether there is a cacheable alias, or whether
> any potential linear map is torn down.

I'm fairly confident that the linear/direct map cacheable alias is not
torn down when pages are allocated. The gneeric page allocation code
doesn't do so, and I see nothing the shmem code to do so.

For arm64, we can tear down portions of the linear map, but that has to
be done explicitly, and this is only possible when using rodata_full. If
not using rodata_full, it is not possible to dynamically tear down the
cacheable alias.

> My understanding is that a cacheable alias is "ok", with some
> caveats.. ie. that the cacheable alias is not accessed.  

Unfortunately, that is not true. You'll often get away with it in
practice, but that's a matter of probability rather than a guarantee.

You  cannot prevent a CPU from accessing a VA arbitrarily (e.g. as the
result of wild speculation). The ARM ARM (ARM DDI 0487E.a) points this
out explicitly:

| Entries for addresses that are Normal Cacheable can be allocated to
| the cache at any time, and so the cache invalidate instruction cannot
| ensure that the address is not present in a cache.

... along with:

| Caches introduce a number of potential problems, mainly because:
|
| * Memory accesses can occur at times other than when the programmer
|   would expect them.

;)

> I'm not entirely sure about pre-fetch from access to adjacent pages.
> We have been using shmem as a source for pages since the beginning,
> and I haven't seen it cause any problems in the last 6 years.  (This
> is limited to armv7 and armv8, I'm not really sure what would happen
> on armv6, but that is a combo I don't have to care about.)

Over time, CPUs get more aggressive w.r.t. prefetching and speculation,
so having not seen such issues in the past does not imply we won't in
future.

Anecdotally, we had an issue a few years ago that we were only able to
reproduce under heavy stress testing. A CPU would make speculative
instruction fetches from a read-destructive MMIO register, despite the
PC never going near the corresponding VA, and despite that code having
(apparently) caused no issue for years before that.

See commit:

  b6ccb9803e90c16b ("ARM: 7954/1: mm: remove remaining domain support from ARMv6")

... which unfortunately lacks the full war story.

Thanks,
Mark.

Rob Clark Aug. 7, 2019, 5:30 p.m. UTC | #12

On Wed, Aug 7, 2019 at 9:50 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Wed, Aug 07, 2019 at 09:15:54AM -0700, Rob Clark wrote:
> > On Wed, Aug 7, 2019 at 5:38 AM Mark Rutland <mark.rutland@arm.com> wrote:
> > >
> > > On Tue, Aug 06, 2019 at 09:31:55AM -0700, Rob Clark wrote:
> > > > On Tue, Aug 6, 2019 at 7:35 AM Mark Rutland <mark.rutland@arm.com> wrote:
> > > > >
> > > > > On Tue, Aug 06, 2019 at 07:11:41AM -0700, Rob Clark wrote:
> > > > > > On Tue, Aug 6, 2019 at 1:48 AM Christoph Hellwig <hch@lst.de> wrote:
> > > > > > >
> > > > > > > This goes in the wrong direction.  drm_cflush_* are a bad API we need to
> > > > > > > get rid of, not add use of it.  The reason for that is two-fold:
> > > > > > >
> > > > > > >  a) it doesn't address how cache maintaince actually works in most
> > > > > > >     platforms.  When talking about a cache we three fundamental operations:
> > > > > > >
> > > > > > >         1) write back - this writes the content of the cache back to the
> > > > > > >            backing memory
> > > > > > >         2) invalidate - this remove the content of the cache
> > > > > > >         3) write back + invalidate - do both of the above
> > > > > >
> > > > > > Agreed that drm_cflush_* isn't a great API.  In this particular case
> > > > > > (IIUC), I need wb+inv so that there aren't dirty cache lines that drop
> > > > > > out to memory later, and so that I don't get a cache hit on
> > > > > > uncached/wc mmap'ing.
> > > > >
> > > > > Is there a cacheable alias lying around (e.g. the linear map), or are
> > > > > these addresses only mapped uncached/wc?
> > > > >
> > > > > If there's a cacheable alias, performing an invalidate isn't sufficient,
> > > > > since a CPU can allocate a new (clean) entry at any point in time (e.g.
> > > > > as a result of prefetching or arbitrary speculation).
> > > >
> > > > I *believe* that there are not alias mappings (that I don't control
> > > > myself) for pages coming from
> > > > shmem_file_setup()/shmem_read_mapping_page()..
> > >
> > > AFAICT, that's regular anonymous memory, so there will be a cacheable
> > > alias in the linear/direct map.
> >
> > tbh, I'm not 100% sure whether there is a cacheable alias, or whether
> > any potential linear map is torn down.
>
> I'm fairly confident that the linear/direct map cacheable alias is not
> torn down when pages are allocated. The gneeric page allocation code
> doesn't do so, and I see nothing the shmem code to do so.
>
> For arm64, we can tear down portions of the linear map, but that has to
> be done explicitly, and this is only possible when using rodata_full. If
> not using rodata_full, it is not possible to dynamically tear down the
> cacheable alias.

So, we do end up using GFP_HIGHUSER, which appears to get passed thru
when shmem gets to the point of actually allocating pages.. not sure
if that just ends up being a hint, or if it guarantees that we don't
get something in the linear map.

(Bear with me while I "page" this all back in.. last time I dug thru
the shmem code was probably pre-armv8, or at least before I had any
armv8 hw)

BR,
-R

Mark Rutland Aug. 8, 2019, 10:20 a.m. UTC | #13

On Thu, Aug 08, 2019 at 09:58:27AM +0200, Christoph Hellwig wrote:
> On Wed, Aug 07, 2019 at 05:49:59PM +0100, Mark Rutland wrote:
> > For arm64, we can tear down portions of the linear map, but that has to
> > be done explicitly, and this is only possible when using rodata_full. If
> > not using rodata_full, it is not possible to dynamically tear down the
> > cacheable alias.
> 
> Interesting.  For this or next merge window I plan to add support to the
> generic DMA code to remap pages as uncachable in place based on the
> openrisc code.  Aѕ far as I can tell the requirement for that is
> basically just that the kernel direct mapping doesn't use PMD or bigger
> mapping so that it supports changing protection bits on a per-PTE basis.
> Is that the case with arm64 + rodata_full?

Yes, with the added case that on arm64 we can also have contiguous
entries at the PTE level, which we also have to disable.

Our kernel page table creation code does that for rodata_full or
DEBUG_PAGEALLOC. See arch/arm64/mmu.c, in map_mem(), where we pass
NO_{BLOCK,CONT}_MAPPINGS down to our pagetable creation code.

Thanks,
Mark.

Mark Rutland Aug. 8, 2019, 10:24 a.m. UTC | #14

On Thu, Aug 08, 2019 at 11:20:53AM +0100, Mark Rutland wrote:
> On Thu, Aug 08, 2019 at 09:58:27AM +0200, Christoph Hellwig wrote:
> > On Wed, Aug 07, 2019 at 05:49:59PM +0100, Mark Rutland wrote:
> > > For arm64, we can tear down portions of the linear map, but that has to
> > > be done explicitly, and this is only possible when using rodata_full. If
> > > not using rodata_full, it is not possible to dynamically tear down the
> > > cacheable alias.
> > 
> > Interesting.  For this or next merge window I plan to add support to the
> > generic DMA code to remap pages as uncachable in place based on the
> > openrisc code.  Aѕ far as I can tell the requirement for that is
> > basically just that the kernel direct mapping doesn't use PMD or bigger
> > mapping so that it supports changing protection bits on a per-PTE basis.
> > Is that the case with arm64 + rodata_full?
> 
> Yes, with the added case that on arm64 we can also have contiguous
> entries at the PTE level, which we also have to disable.
> 
> Our kernel page table creation code does that for rodata_full or
> DEBUG_PAGEALLOC. See arch/arm64/mmu.c, in map_mem(), where we pass
> NO_{BLOCK,CONT}_MAPPINGS down to our pagetable creation code.

Whoops, that should be: arch/arm64/mm/mmu.c.

Mark.

Will Deacon Aug. 8, 2019, 10:32 a.m. UTC | #15

On Thu, Aug 08, 2019 at 11:20:53AM +0100, Mark Rutland wrote:
> On Thu, Aug 08, 2019 at 09:58:27AM +0200, Christoph Hellwig wrote:
> > On Wed, Aug 07, 2019 at 05:49:59PM +0100, Mark Rutland wrote:
> > > For arm64, we can tear down portions of the linear map, but that has to
> > > be done explicitly, and this is only possible when using rodata_full. If
> > > not using rodata_full, it is not possible to dynamically tear down the
> > > cacheable alias.
> > 
> > Interesting.  For this or next merge window I plan to add support to the
> > generic DMA code to remap pages as uncachable in place based on the
> > openrisc code.  Aѕ far as I can tell the requirement for that is
> > basically just that the kernel direct mapping doesn't use PMD or bigger
> > mapping so that it supports changing protection bits on a per-PTE basis.
> > Is that the case with arm64 + rodata_full?
> 
> Yes, with the added case that on arm64 we can also have contiguous
> entries at the PTE level, which we also have to disable.
> 
> Our kernel page table creation code does that for rodata_full or
> DEBUG_PAGEALLOC. See arch/arm64/mmu.c, in map_mem(), where we pass
> NO_{BLOCK,CONT}_MAPPINGS down to our pagetable creation code.

FWIW, we made rodata_full the default a couple of releases ago, so if
solving the cacheable alias for non-cacheable DMA buffers requires this
to be present, then we could probably just refuse to probe non-coherent
DMA-capable devices on systems where rodata_full has been disabled.

Will

Daniel Vetter Aug. 8, 2019, 11:58 a.m. UTC | #16

On Thu, Aug 08, 2019 at 11:55:06AM +0200, Christoph Hellwig wrote:
> On Wed, Aug 07, 2019 at 10:48:56AM +0200, Daniel Vetter wrote:
> > >    other drm drivers how do they guarantee addressability without an
> > >    iommu?)
> > 
> > We use shmem to get at swappable pages. We generally just assume that
> > the gpu can get at those pages, but things fall apart in fun ways:
> > - some setups somehow inject bounce buffers. Some drivers just give
> > up, others try to allocate a pool of pages with dma_alloc_coherent.
> > - some devices are misdesigned and can't access as much as the cpu. We
> > allocate using GFP_DMA32 to fix that.
> 
> Well, for shmem you can't really call allocators directly, right?

We can pass gfp flags to shmem_read_mapping_page_gfp, which is just about
enough for the 2 cases on intel platforms where the gpu can only access
4G, but the cpu has way more.

> One thing I have in my pipeline is a dma_alloc_pages API that allocates
> pages that are guaranteed to be addressably by the device or otherwise
> fail.  But that doesn't really help with the shmem fs.

Yeah, the other drivers where the shmem gfp trick doesn't work copy
back&forth between the dma-able pages and the shmem swappable pages as
needed in their shrinker/allocation code. I guess ideal would be if we
could fuse the custom allocator somehow directly into shmem.

Otoh once you start thrashing beyond system memory for gfx workloads it's
pretty hopeless anyway, and speed doesn't really matter anymore.

> > Also modern gpu apis pretty much assume you can malloc() and then use
> > that directly with the gpu.
> 
> Which is fine as long as the GPU itself supports full 64-bit addressing
> (or always sits behind an iommu), and the platform doesn't impose
> addressing limit, which unfortunately some that are shipped right now
> still do :(

Yes, the userspace api people in khronos are occasionally a bit optimistic
:-)

> But userspace malloc really means dma_map_* anyway, so not really
> relevant for memory allocations.

It does tie in, since we'll want a dma_map which fails if a direct mapping
isn't possible. It also helps the driver code a lot if we could use the
same low-level flushing functions between our own memory (whatever that
is) and anon pages from malloc. And in all the cases if it's not possible,
we want a failure, not elaborate attempts at hiding the differences
between all possible architectures out there.
-Daniel

Rob Clark Aug. 8, 2019, 4:32 p.m. UTC | #17

On Thu, Aug 8, 2019 at 3:00 AM Christoph Hellwig <hch@lst.de> wrote:
>
> On Wed, Aug 07, 2019 at 09:09:53AM -0700, Rob Clark wrote:
> > > > (Eventually I'd like to support pages passed in from userspace.. but
> > > > that is down the road.)
> > >
> > > Eww.  Please talk to the iommu list before starting on that.
> >
> > This is more of a long term goal, we can't do it until we have
> > per-context/process pagetables, ofc.
> >
> > Getting a bit off topic, but I'm curious about what problems you are
> > concerned about.  Userspace can shoot it's own foot, but if it is not
> > sharing GPU pagetables with other processes, it can't shoot other's
> > feet.  (I'm guessing you are concerned about non-page-aligned
> > mappings?)
>
> Maybe I misunderstood what you mean above, I though you mean messing
> with page cachability attributes for userspace pages.  If what you are
> looking into is just "standard" SVM I only hope that our APIs for that
> which currently are a mess are in shape by then, as all users currently
> have their own crufty and at least slightly buggy versions of that.  But
> at least it is an issue that is being worked on.

ahh, ok.. and no, we wouldn't be remapping 'malloc' memory as
writecombine.  We'd have to wire up better support for cached buffers.

Currently we use WC for basically all GEM buffers allocated from
kernel, since that is a good choice for most GPU workloads.. ie. CPU
isn't reading back from GPU buffers in most cases.  I'm told cached
buffers are useful for compute workloads where there is more back and
forth between GPU and CPU, but we haven't really crossed that bridge
yet.  Compute workloads is also were the SVM interest is.

> > > So back to the question, I'd like to understand your use case (and
> > > maybe hear from the other drm folks if that is common):
> > >
> > >  - you allocate pages from shmem (why shmem, btw?  if this is done by
> > >    other drm drivers how do they guarantee addressability without an
> > >    iommu?)
> >
> > shmem for swappable pages.  I don't unpin and let things get swapped
> > out yet, but I'm told it starts to become important when you have 50
> > browser tabs open ;-)
>
> Yes,  but at that point the swapping can use the kernel linear mapping
> and we are going into aliasing problems that can disturb the cache.  So
> as-is this is going to problematic without new hooks into shmemfs.
>

My expectation is that we'd treat the pages as cached when handing
them back to shmemfs, and wb+inv when we take them back again from
shmemfs and re-pin.  I think this works out to be basically the same
scenario as having to wb+inv when we first get the pages from shmemfs.

> > >  - then the memory is either mapped to userspace or vmapped (or even
> > >    both, althrough the lack of aliasing you mentioned would speak
> > >    against it) as writecombine (aka arm v6+ normal uncached).  Does
> > >    the mapping live on until the memory is freed?
> >
> > (side note, *most* of the drm/msm supported devices are armv8, the
> > exceptions are 8060 and 8064 which are armv7.. I don't think drm/msm
> > will ever have to deal w/ armv6)
>
> Well, the point was that starting from v6 the kernels dma uncached
> really is write combine.  So that applied to v7 and v8 as well.

ahh, gotcha

BR,
-R

Rob Clark Aug. 8, 2019, 4:44 p.m. UTC | #18

On Thu, Aug 8, 2019 at 12:59 AM Christoph Hellwig <hch@lst.de> wrote:
>
> On Wed, Aug 07, 2019 at 10:30:04AM -0700, Rob Clark wrote:
> > So, we do end up using GFP_HIGHUSER, which appears to get passed thru
> > when shmem gets to the point of actually allocating pages.. not sure
> > if that just ends up being a hint, or if it guarantees that we don't
> > get something in the linear map.
> >
> > (Bear with me while I "page" this all back in.. last time I dug thru
> > the shmem code was probably pre-armv8, or at least before I had any
> > armv8 hw)
>
> GFP_HIGHUSER basically just means that this is an allocation that could
> dip into highmem, in which case it would not have a kernel mapping.
> This can happen on arm + LPAE, but not on arm64.

Just a dumb question, but why is *all* memory in the linear map on
arm64?  It would seem useful to have a source of pages that is not in
the linear map.
I guess it is mapped as huge pages (or something larger than 4k pages)?

Any recommended reading to understand how/why the kernel address space
is setup the way it is (so I can ask fewer dumb questions)?

BR,
-R

[1/2] drm: add cache support for arm64

Commit Message

Comments

Patch