[v3] drm/fourcc: document modifier uniqueness requirements

Message ID	WOsdNGp0dhyp8Modsrt7DYpd0fVk7Yk264FISQ1Yls30bhlSXbxzgKTpmOCJ9H2WV1XHyUjCXu7nwBOWQ6n1NCbIcVl1-1IZ4rMMGN1dN-U=@emersion.fr (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=c9vV=7K=lists.freedesktop.org=dri-devel-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 280AD20814 Date: Thu, 28 May 2020 14:38:36 +0000 To: dri-devel@lists.freedesktop.org From: Simon Ser <contact@emersion.fr> Subject: [PATCH v3] drm/fourcc: document modifier uniqueness requirements Message-ID: <WOsdNGp0dhyp8Modsrt7DYpd0fVk7Yk264FISQ1Yls30bhlSXbxzgKTpmOCJ9H2WV1XHyUjCXu7nwBOWQ6n1NCbIcVl1-1IZ4rMMGN1dN-U=@emersion.fr> MIME-Version: 1.0 Precedence: list Reply-To: Simon Ser <contact@emersion.fr> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>, =?utf-8?b?TWFyZWsgT2zFocOhaw==?= <maraeo@gmail.com> Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	[v3] drm/fourcc: document modifier uniqueness requirements \| expand [v3] drm/fourcc: document modifier uniqueness requirements

Simon Ser May 28, 2020, 2:38 p.m. UTC

There have suggestions to bake pitch alignment, address alignement,
contiguous memory or other placement (hidden VRAM, GTT/BAR, etc)
constraints into modifiers. Last time this was brought up it seemed
like the consensus was to not allow this. Document this in drm_fourcc.h.

There are several reasons for this.

- Encoding all of these constraints in the modifiers would explode the
  search space pretty quickly (we only have 64 bits to work with).
- Modifiers need to be unambiguous: a buffer can only have a single
  modifier.
- Modifier users aren't expected to parse modifiers.

v2: add paragraph about aliases (Daniel)

v3: fix unrelated changes sent with the patch

Signed-off-by: Simon Ser <contact@emersion.fr>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Daniel Stone <daniel@fooishbar.org>
Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Marek Olšák <maraeo@gmail.com>
---
 include/uapi/drm/drm_fourcc.h | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

Marek Olšák May 28, 2020, 3:49 p.m. UTC | #1

On most hardware, there is a minimum pitch alignment for linear and any
greater multiple of the alignment is fine.

On Navi, the pitch in bytes for linear must be align(width * bpp / 8, 256).
That's because the hw computes the pitch from the width and doesn't allow
setting a custom pitch. For that reason, multi-GPU sharing might not be
possible if the other GPU doesn't align the pitch in exactly the same way.

Marek

On Thu, May 28, 2020 at 10:38 AM Simon Ser <contact@emersion.fr> wrote:

> There have suggestions to bake pitch alignment, address alignement,
> contiguous memory or other placement (hidden VRAM, GTT/BAR, etc)
> constraints into modifiers. Last time this was brought up it seemed
> like the consensus was to not allow this. Document this in drm_fourcc.h.
>
> There are several reasons for this.
>
> - Encoding all of these constraints in the modifiers would explode the
>   search space pretty quickly (we only have 64 bits to work with).
> - Modifiers need to be unambiguous: a buffer can only have a single
>   modifier.
> - Modifier users aren't expected to parse modifiers.
>
> v2: add paragraph about aliases (Daniel)
>
> v3: fix unrelated changes sent with the patch
>
> Signed-off-by: Simon Ser <contact@emersion.fr>
> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: Daniel Stone <daniel@fooishbar.org>
> Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
> Cc: Dave Airlie <airlied@gmail.com>
> Cc: Marek Olšák <maraeo@gmail.com>
> ---
>  include/uapi/drm/drm_fourcc.h | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
>
> diff --git a/include/uapi/drm/drm_fourcc.h b/include/uapi/drm/drm_fourcc.h
> index 490143500a50..f41fcb1ed63d 100644
> --- a/include/uapi/drm/drm_fourcc.h
> +++ b/include/uapi/drm/drm_fourcc.h
> @@ -58,6 +58,21 @@ extern "C" {
>   * may preserve meaning - such as number of planes - from the fourcc code,
>   * whereas others may not.
>   *
> + * Modifiers must uniquely encode buffer layout. In other words, a buffer
> must
> + * match only a single modifier. A modifier must not be a subset of
> layouts of
> + * another modifier. For instance, it's incorrect to encode pitch
> alignment in
> + * a modifier: a buffer may match a 64-pixel aligned modifier and a
> 32-pixel
> + * aligned modifier. That said, modifiers can have implicit minimal
> + * requirements.
> + *
> + * For modifiers where the combination of fourcc code and modifier can
> alias,
> + * a canonical pair needs to be defined and used by all drivers. An
> example
> + * is AFBC, where both ARGB and ABGR have the exact same compressed
> layout.
> + *
> + * Users see modifiers as opaque tokens they can check for equality and
> + * intersect. Users musn't need to know to reason about the modifier value
> + * (i.e. users are not expected to extract information out of the
> modifier).
> + *
>   * Vendors should document their modifier usage in as much detail as
>   * possible, to ensure maximum compatibility across devices, drivers and
>   * applications.
> --
> 2.26.2
>
>
>

Simon Ser May 29, 2020, 8:59 a.m. UTC | #2

On Thursday, May 28, 2020 5:49 PM, Marek Olšák <maraeo@gmail.com> wrote:

> On most hardware, there is a minimum pitch alignment for linear and
> any greater multiple of the alignment is fine.
>
> On Navi, the pitch in bytes for linear must be
> align(width * bpp / 8, 256). That's because the hw computes the pitch
> from the width and doesn't allow setting a custom pitch. For that
> reason, multi-GPU sharing might not be possible if the other GPU
> doesn't align the pitch in exactly the same way.

OK. In this case I think it's fine to make the DMA-BUF import fail, as
we've suggested on IRC. The more-or-less planned fix for these buffer
sharing issues is to revive the buffer constraints proposal from the
allocator project. It's a lot of work though.

Alex Deucher May 29, 2020, 1:28 p.m. UTC | #3

On Fri, May 29, 2020 at 4:59 AM Simon Ser <contact@emersion.fr> wrote:
>
> On Thursday, May 28, 2020 5:49 PM, Marek Olšák <maraeo@gmail.com> wrote:
>
> > On most hardware, there is a minimum pitch alignment for linear and
> > any greater multiple of the alignment is fine.
> >
> > On Navi, the pitch in bytes for linear must be
> > align(width * bpp / 8, 256). That's because the hw computes the pitch
> > from the width and doesn't allow setting a custom pitch. For that
> > reason, multi-GPU sharing might not be possible if the other GPU
> > doesn't align the pitch in exactly the same way.
>
> OK. In this case I think it's fine to make the DMA-BUF import fail, as
> we've suggested on IRC. The more-or-less planned fix for these buffer
> sharing issues is to revive the buffer constraints proposal from the
> allocator project. It's a lot of work though.

I get that, but why explicitly limit modifiers then?  Shouldn't we try
and do the best we can with what we have now?  If not the situation is
not much better than what we have now.  Why go through the effort or
adding modifer support in the first place if they are mostly useless?
I don't quite get what we are trying to do with them.  What does this
mean "Modifiers must uniquely encode buffer layout"?  We have a number
of buffer layouts that are the same from a functional standpoint, but
they have different alignment requirements depending on the chip and
the number of memory channels, etc.  Would those be considered the
same modifer?  If not, then we are sort of implicitly encoding
alignment requirements into the modifier.

Alex

Daniel Vetter May 29, 2020, 1:43 p.m. UTC | #4

On Fri, May 29, 2020 at 3:29 PM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Fri, May 29, 2020 at 4:59 AM Simon Ser <contact@emersion.fr> wrote:
> >
> > On Thursday, May 28, 2020 5:49 PM, Marek Olšák <maraeo@gmail.com> wrote:
> >
> > > On most hardware, there is a minimum pitch alignment for linear and
> > > any greater multiple of the alignment is fine.
> > >
> > > On Navi, the pitch in bytes for linear must be
> > > align(width * bpp / 8, 256). That's because the hw computes the pitch
> > > from the width and doesn't allow setting a custom pitch. For that
> > > reason, multi-GPU sharing might not be possible if the other GPU
> > > doesn't align the pitch in exactly the same way.
> >
> > OK. In this case I think it's fine to make the DMA-BUF import fail, as
> > we've suggested on IRC. The more-or-less planned fix for these buffer
> > sharing issues is to revive the buffer constraints proposal from the
> > allocator project. It's a lot of work though.
>
> I get that, but why explicitly limit modifiers then?  Shouldn't we try
> and do the best we can with what we have now?  If not the situation is
> not much better than what we have now.  Why go through the effort or
> adding modifer support in the first place if they are mostly useless?
> I don't quite get what we are trying to do with them.  What does this
> mean "Modifiers must uniquely encode buffer layout"?  We have a number
> of buffer layouts that are the same from a functional standpoint, but
> they have different alignment requirements depending on the chip and
> the number of memory channels, etc.  Would those be considered the
> same modifer?  If not, then we are sort of implicitly encoding
> alignment requirements into the modifier.

The risk is essentially that if you have these, and they still match,
then either you need to make sure every claims support for the full
set of modifiers, and it gets complicated. Or occasionally sharing
doesn't work when it should.

I'd say the more specific the format (extreme case is compression
format used by one vendor only), the more you can just bake in as
implicit assumptions and deal with the fallout. The stuff Simon patch
adds to the docs is also just general rules, there's always some
exceptions. Pretty much for every more complex set of modifiers we had
lengthy discussions to figure out what they should look like, and as
you can see from scrolling through drm_fourcc.h, there's a bunch of
fairly different approaches in there. One extreme is hw where you
enumerate 1-2 modifiers and done, not a single further constraint
beyond "must be aligned naturally to tiles", others are a _lot_ more
complicated. If the memory channels and stuff meaningfully impacts how
you can share buffers (and if that meaningfully changes across
different gpus, we don't want to encode everything, but only the stuff
that's actually in used and needed for sharing), then that's a good
reason to add them. But if that then results in no sharing possible
per modifiers, while the hw could do it, then that's not great either
(e.g. if you have integrated + discrete gpu in an optimus laptop, but
memory channels mismatch except for the buffer size you care about it
all lines up).
-Daniel

Daniel Stone May 29, 2020, 1:56 p.m. UTC | #5

Hi Alex,

On Fri, 29 May 2020 at 14:29, Alex Deucher <alexdeucher@gmail.com> wrote:
> On Fri, May 29, 2020 at 4:59 AM Simon Ser <contact@emersion.fr> wrote:
> > OK. In this case I think it's fine to make the DMA-BUF import fail, as
> > we've suggested on IRC. The more-or-less planned fix for these buffer
> > sharing issues is to revive the buffer constraints proposal from the
> > allocator project. It's a lot of work though.
>
> I get that, but why explicitly limit modifiers then?  Shouldn't we try
> and do the best we can with what we have now?  If not the situation is
> not much better than what we have now.  Why go through the effort or
> adding modifer support in the first place if they are mostly useless?

Well sure, we could add pitch alignment in there. And height
alignment. And starting byte offset. And acceptable byte distance
between planes. And physical contiguity / number of backing pages. And
placement (system vs. GTT vs. local), which also implies adding a
device-unique identifier whilst we're at it. And acceptable
width/height bounds. All of those are perfectly valid constraints
which could cause imports to fail, and not even an exhaustive list.

How does Navi ensure that every single linear dmabuf source it can
ever receive is aligned to 256 bytes today? How does adding support
for modifiers - something which does solve other problems, like 'every
three months I wearily review a patch forcing all winsys buffers to be
allocated for scanout usage for a new random reason, regressing
performance for a lot of other vendors' - make Navi's situation worse?

> I don't quite get what we are trying to do with them.  What does this
> mean "Modifiers must uniquely encode buffer layout"?  We have a number
> of buffer layouts that are the same from a functional standpoint, but
> they have different alignment requirements depending on the chip and
> the number of memory channels, etc.  Would those be considered the
> same modifer?  If not, then we are sort of implicitly encoding
> alignment requirements into the modifier.

Yes, of course some requirements are implicit. Given that tiles are
indivisible, it makes no sense to have a 64x64 tiled buffer with a
32-pixel stride. But that isn't the same thing as encoding an
arbitrary constraint, it's just a requirement of the encoding.

The reason why modifiers have been successful and adopted by every
other vendor apart from IMG, is exactly because they _don't_ attempt
to boil the ocean, but are the most practical realisation of improving
performance within a complex ecosystem. The allocator is the complete
and exhaustive solution to all our problems, but it's not exactly
going to be done tomorrow.

Playing a single video today could easily involve a V4L2 codec source
into a V4L2 postprocessor into Chromium's Wayland host compositor
through Chromium itself into the host Wayland compositor and finally
into EGL and/or Vulkan and/or KMS. If you want to figure out what the
V4L2/DRM/KMS, GStreamer/VA-API/Kodi, EGL/Vulkan, and Wayland/X11 APIs
look for negotiating a totally optimal layout across at least three
different hardware classes from at least three different vendors, then
I'm all for it. I'll be cheering you on from the sidelines and doing
what I can to help. But the only reason we've got to this point today
is because Allwinner, AmLogic, Arm, Broadcom, Intel, NVIDIA, NXP,
Qualcomm, Rockchip, Samsung, and VeriSilicon, have all spent the last
few years trying to avoid perfection being the enemy of good. (And
those are just the hardware vendors - obviously Collabora and others
like us have also put not-inconsiderable effort into getting at least
this far.)

Cheers,
Daniel

Alex Deucher May 29, 2020, 2:29 p.m. UTC | #6

On Fri, May 29, 2020 at 9:58 AM Daniel Stone <daniel@fooishbar.org> wrote:
>
> Hi Alex,
>
> On Fri, 29 May 2020 at 14:29, Alex Deucher <alexdeucher@gmail.com> wrote:
> > On Fri, May 29, 2020 at 4:59 AM Simon Ser <contact@emersion.fr> wrote:
> > > OK. In this case I think it's fine to make the DMA-BUF import fail, as
> > > we've suggested on IRC. The more-or-less planned fix for these buffer
> > > sharing issues is to revive the buffer constraints proposal from the
> > > allocator project. It's a lot of work though.
> >
> > I get that, but why explicitly limit modifiers then?  Shouldn't we try
> > and do the best we can with what we have now?  If not the situation is
> > not much better than what we have now.  Why go through the effort or
> > adding modifer support in the first place if they are mostly useless?
>
> Well sure, we could add pitch alignment in there. And height
> alignment. And starting byte offset. And acceptable byte distance
> between planes. And physical contiguity / number of backing pages. And
> placement (system vs. GTT vs. local), which also implies adding a
> device-unique identifier whilst we're at it. And acceptable
> width/height bounds. All of those are perfectly valid constraints
> which could cause imports to fail, and not even an exhaustive list.
>
> How does Navi ensure that every single linear dmabuf source it can
> ever receive is aligned to 256 bytes today? How does adding support
> for modifiers - something which does solve other problems, like 'every
> three months I wearily review a patch forcing all winsys buffers to be
> allocated for scanout usage for a new random reason, regressing
> performance for a lot of other vendors' - make Navi's situation worse?
>
> > I don't quite get what we are trying to do with them.  What does this
> > mean "Modifiers must uniquely encode buffer layout"?  We have a number
> > of buffer layouts that are the same from a functional standpoint, but
> > they have different alignment requirements depending on the chip and
> > the number of memory channels, etc.  Would those be considered the
> > same modifer?  If not, then we are sort of implicitly encoding
> > alignment requirements into the modifier.
>
> Yes, of course some requirements are implicit. Given that tiles are
> indivisible, it makes no sense to have a 64x64 tiled buffer with a
> 32-pixel stride. But that isn't the same thing as encoding an
> arbitrary constraint, it's just a requirement of the encoding.
>
> The reason why modifiers have been successful and adopted by every
> other vendor apart from IMG, is exactly because they _don't_ attempt
> to boil the ocean, but are the most practical realisation of improving
> performance within a complex ecosystem. The allocator is the complete
> and exhaustive solution to all our problems, but it's not exactly
> going to be done tomorrow.

Maybe I'm over thinking this.  I just don't want to get into a
situation where we go through a lot of effort to add modifier support
and then performance ends up being worse than it is today in a lot of
cases.

Alex

>
> Playing a single video today could easily involve a V4L2 codec source
> into a V4L2 postprocessor into Chromium's Wayland host compositor
> through Chromium itself into the host Wayland compositor and finally
> into EGL and/or Vulkan and/or KMS. If you want to figure out what the
> V4L2/DRM/KMS, GStreamer/VA-API/Kodi, EGL/Vulkan, and Wayland/X11 APIs
> look for negotiating a totally optimal layout across at least three
> different hardware classes from at least three different vendors, then
> I'm all for it. I'll be cheering you on from the sidelines and doing
> what I can to help. But the only reason we've got to this point today
> is because Allwinner, AmLogic, Arm, Broadcom, Intel, NVIDIA, NXP,
> Qualcomm, Rockchip, Samsung, and VeriSilicon, have all spent the last
> few years trying to avoid perfection being the enemy of good. (And
> those are just the hardware vendors - obviously Collabora and others
> like us have also put not-inconsiderable effort into getting at least
> this far.)
>
> Cheers,
> Daniel

Daniel Stone May 29, 2020, 2:30 p.m. UTC | #7

On Fri, 29 May 2020 at 15:29, Alex Deucher <alexdeucher@gmail.com> wrote:
> Maybe I'm over thinking this.  I just don't want to get into a
> situation where we go through a lot of effort to add modifier support
> and then performance ends up being worse than it is today in a lot of
> cases.

I'm genuinely curious: what do you imagine could cause a worse result?

Cheers,
Daniel

Alex Deucher May 29, 2020, 2:36 p.m. UTC | #8

On Fri, May 29, 2020 at 10:32 AM Daniel Stone <daniel@fooishbar.org> wrote:
>
> On Fri, 29 May 2020 at 15:29, Alex Deucher <alexdeucher@gmail.com> wrote:
> > Maybe I'm over thinking this.  I just don't want to get into a
> > situation where we go through a lot of effort to add modifier support
> > and then performance ends up being worse than it is today in a lot of
> > cases.
>
> I'm genuinely curious: what do you imagine could cause a worse result?

As an example, in some cases, it's actually better to use linear for
system memory because it better aligns with pcie access patterns than
some tiling formats (which are better aligned for the memory
controller topology on the dGPU).  That said, I haven't been in the
loop as much with the tiling formats on newer GPUs, so that may not be
as much of an issue anymore.

Alex

Daniel Stone May 29, 2020, 3:01 p.m. UTC | #9

On Fri, 29 May 2020 at 15:36, Alex Deucher <alexdeucher@gmail.com> wrote:
> On Fri, May 29, 2020 at 10:32 AM Daniel Stone <daniel@fooishbar.org> wrote:
> > On Fri, 29 May 2020 at 15:29, Alex Deucher <alexdeucher@gmail.com> wrote:
> > > Maybe I'm over thinking this.  I just don't want to get into a
> > > situation where we go through a lot of effort to add modifier support
> > > and then performance ends up being worse than it is today in a lot of
> > > cases.
> >
> > I'm genuinely curious: what do you imagine could cause a worse result?
>
> As an example, in some cases, it's actually better to use linear for
> system memory because it better aligns with pcie access patterns than
> some tiling formats (which are better aligned for the memory
> controller topology on the dGPU).  That said, I haven't been in the
> loop as much with the tiling formats on newer GPUs, so that may not be
> as much of an issue anymore.

Yeah, that makes a lot of sense. On the other hand, placement isn't
explicitly encoded for either modifiers or non-modifiers, so I'm not
sure how it would really regress.

In case it was missed somewhere, there is no generic code doing
modifier selection for modifier optimality anywhere. The flow is:
  - every producer/consumer advertises a list of modifier + format
pairs, declaring what they _can_ support
  - for every use where a buffer needs to be allocated, the generic
code intersects these lists of modifiers to determine the set of
modifiers mutually acceptable to all consumers
  - the buffer allocator is always handed a _list_ of modifiers, and
makes its own decision based on ??

For a concrete end-to-end example:
  - KMS declares which modifiers are supported for scanout
  - EGL declares which modifiers are supported for EGLImage import
  - Weston determines that one of its clients could be directly
scanned out rather than composited
  - Weston intersects the KMS + EGL set of modifiers to come up with
the optimal modifier set (i.e. bypassing composition)
  - Weston sends this intersected list to the client via the Wayland
protocol (mentioned in previous MR)
  - the client is using EGL, so Mesa receives this list of modifiers,
and passes this on to amdgpu
  - amdgpu uses magic inscrutable heuristics to determine the most
optimal modifier to use, and allocates a buffer based on that

Weston (or GNOME Shell, or Chromium, or whatever) will never be in a
position as a generic client to know that on Raven2 it should use a
particular supertiled layout with no DCC if width > 2048. So we
designed the entire framework to explicitly avoid generic code trying
to reason about the performance properties of specific modifiers.

What Weston _does_ know, however, is that display controller can work
with modifier set A, and the GPU can work with modifier set B, and if
the client can pick something from modifier set A, then there is a
much greater probability that Weston can leave the GPU alone so it can
be entirely used by the client. It also knows that if the surface
can't be directly scanned out for whatever reason, then there's no
point in the client optimising for direct scanout, and it can tell the
client to select based on optimality purely for the GPU.

So that's the thinking behind the interface: that the driver still has
exactly as much control and ability to use magic heuristics as it
always has, but that system components can supplement the driver's
heuristics with their own knowledge, to increase the chance that the
driver's heuristics arrive at a configuration that a) will definitely
work, and b) have a much greater chance of working optimally.

Does that help at all?

Cheers,
Daniel

Alex Deucher May 29, 2020, 3:31 p.m. UTC | #10

On Fri, May 29, 2020 at 11:03 AM Daniel Stone <daniel@fooishbar.org> wrote:
>
> On Fri, 29 May 2020 at 15:36, Alex Deucher <alexdeucher@gmail.com> wrote:
> > On Fri, May 29, 2020 at 10:32 AM Daniel Stone <daniel@fooishbar.org> wrote:
> > > On Fri, 29 May 2020 at 15:29, Alex Deucher <alexdeucher@gmail.com> wrote:
> > > > Maybe I'm over thinking this.  I just don't want to get into a
> > > > situation where we go through a lot of effort to add modifier support
> > > > and then performance ends up being worse than it is today in a lot of
> > > > cases.
> > >
> > > I'm genuinely curious: what do you imagine could cause a worse result?
> >
> > As an example, in some cases, it's actually better to use linear for
> > system memory because it better aligns with pcie access patterns than
> > some tiling formats (which are better aligned for the memory
> > controller topology on the dGPU).  That said, I haven't been in the
> > loop as much with the tiling formats on newer GPUs, so that may not be
> > as much of an issue anymore.
>
> Yeah, that makes a lot of sense. On the other hand, placement isn't
> explicitly encoded for either modifiers or non-modifiers, so I'm not
> sure how it would really regress.
>
> In case it was missed somewhere, there is no generic code doing
> modifier selection for modifier optimality anywhere. The flow is:
>   - every producer/consumer advertises a list of modifier + format
> pairs, declaring what they _can_ support
>   - for every use where a buffer needs to be allocated, the generic
> code intersects these lists of modifiers to determine the set of
> modifiers mutually acceptable to all consumers
>   - the buffer allocator is always handed a _list_ of modifiers, and
> makes its own decision based on ??
>
> For a concrete end-to-end example:
>   - KMS declares which modifiers are supported for scanout
>   - EGL declares which modifiers are supported for EGLImage import
>   - Weston determines that one of its clients could be directly
> scanned out rather than composited
>   - Weston intersects the KMS + EGL set of modifiers to come up with
> the optimal modifier set (i.e. bypassing composition)
>   - Weston sends this intersected list to the client via the Wayland
> protocol (mentioned in previous MR)
>   - the client is using EGL, so Mesa receives this list of modifiers,
> and passes this on to amdgpu
>   - amdgpu uses magic inscrutable heuristics to determine the most
> optimal modifier to use, and allocates a buffer based on that
>
> Weston (or GNOME Shell, or Chromium, or whatever) will never be in a
> position as a generic client to know that on Raven2 it should use a
> particular supertiled layout with no DCC if width > 2048. So we
> designed the entire framework to explicitly avoid generic code trying
> to reason about the performance properties of specific modifiers.
>
> What Weston _does_ know, however, is that display controller can work
> with modifier set A, and the GPU can work with modifier set B, and if
> the client can pick something from modifier set A, then there is a
> much greater probability that Weston can leave the GPU alone so it can
> be entirely used by the client. It also knows that if the surface
> can't be directly scanned out for whatever reason, then there's no
> point in the client optimising for direct scanout, and it can tell the
> client to select based on optimality purely for the GPU.
>
> So that's the thinking behind the interface: that the driver still has
> exactly as much control and ability to use magic heuristics as it
> always has, but that system components can supplement the driver's
> heuristics with their own knowledge, to increase the chance that the
> driver's heuristics arrive at a configuration that a) will definitely
> work, and b) have a much greater chance of working optimally.
>
> Does that help at all?

Yes, that helps a lot.  I has some misconceptions about the higher
parts of the stack.

Thanks!

Alex

Michel Dänzer May 30, 2020, 1:08 p.m. UTC | #11

On 2020-05-29 5:01 p.m., Daniel Stone wrote:
> On Fri, 29 May 2020 at 15:36, Alex Deucher <alexdeucher@gmail.com> wrote:
>> On Fri, May 29, 2020 at 10:32 AM Daniel Stone <daniel@fooishbar.org> wrote:
>>> On Fri, 29 May 2020 at 15:29, Alex Deucher <alexdeucher@gmail.com> wrote:
>>>> Maybe I'm over thinking this.  I just don't want to get into a
>>>> situation where we go through a lot of effort to add modifier support
>>>> and then performance ends up being worse than it is today in a lot of
>>>> cases.
>>>
>>> I'm genuinely curious: what do you imagine could cause a worse result?
>>
>> As an example, in some cases, it's actually better to use linear for
>> system memory because it better aligns with pcie access patterns than
>> some tiling formats (which are better aligned for the memory
>> controller topology on the dGPU).  That said, I haven't been in the
>> loop as much with the tiling formats on newer GPUs, so that may not be
>> as much of an issue anymore.
> 
> Yeah, that makes a lot of sense. On the other hand, placement isn't
> explicitly encoded for either modifiers or non-modifiers, so I'm not
> sure how it would really regress.

Without modifiers, only linear buffers could be shared between devices
with amdgpu. With modifiers, such shared buffers might end up tiled but
located in system memory.


Anyway, as you explained well, the benefits of modifiers aren't limited
to inter-device buffer sharing.

Brian Starkey June 1, 2020, 11:01 a.m. UTC | #12

Sorry for commenting on the obsolete v1 - that'll teach me for reading
my backlog chronologically.

On Thu, May 28, 2020 at 02:38:36PM +0000, Simon Ser wrote:
> There have suggestions to bake pitch alignment, address alignement,
> contiguous memory or other placement (hidden VRAM, GTT/BAR, etc)
> constraints into modifiers. Last time this was brought up it seemed
> like the consensus was to not allow this. Document this in drm_fourcc.h.
> 
> There are several reasons for this.
> 
> - Encoding all of these constraints in the modifiers would explode the
>   search space pretty quickly (we only have 64 bits to work with).
> - Modifiers need to be unambiguous: a buffer can only have a single
>   modifier.
> - Modifier users aren't expected to parse modifiers.
> 
> v2: add paragraph about aliases (Daniel)
> 
> v3: fix unrelated changes sent with the patch
> 
> Signed-off-by: Simon Ser <contact@emersion.fr>
> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: Daniel Stone <daniel@fooishbar.org>
> Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
> Cc: Dave Airlie <airlied@gmail.com>
> Cc: Marek Olšák <maraeo@gmail.com>
> ---
>  include/uapi/drm/drm_fourcc.h | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/include/uapi/drm/drm_fourcc.h b/include/uapi/drm/drm_fourcc.h
> index 490143500a50..f41fcb1ed63d 100644
> --- a/include/uapi/drm/drm_fourcc.h
> +++ b/include/uapi/drm/drm_fourcc.h
> @@ -58,6 +58,21 @@ extern "C" {
>   * may preserve meaning - such as number of planes - from the fourcc code,
>   * whereas others may not.
>   *
> + * Modifiers must uniquely encode buffer layout. In other words, a buffer must
> + * match only a single modifier. A modifier must not be a subset of layouts of
> + * another modifier. For instance, it's incorrect to encode pitch alignment in
> + * a modifier: a buffer may match a 64-pixel aligned modifier and a 32-pixel
> + * aligned modifier. That said, modifiers can have implicit minimal
> + * requirements.
> + *
> + * For modifiers where the combination of fourcc code and modifier can alias,
> + * a canonical pair needs to be defined and used by all drivers. An example
> + * is AFBC, where both ARGB and ABGR have the exact same compressed layout.

As I mentioned, this isn't true for AFBC (at least, it shouldn't be),
so I don't think we can use that as an example.

We define a "canonical" ordering in AFBC for exactly this reason - the
ordering can differ and not all decoders support all orderings, so we
need to pick one for best compatibility.

Thanks,
-Brian

> + *
> + * Users see modifiers as opaque tokens they can check for equality and
> + * intersect. Users musn't need to know to reason about the modifier value
> + * (i.e. users are not expected to extract information out of the modifier).
> + *
>   * Vendors should document their modifier usage in as much detail as
>   * possible, to ensure maximum compatibility across devices, drivers and
>   * applications.
> -- 
> 2.26.2
> 
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel

Neil Armstrong June 1, 2020, 1:43 p.m. UTC | #13

On 29/05/2020 15:56, Daniel Stone wrote:
> Hi Alex,
> 
> On Fri, 29 May 2020 at 14:29, Alex Deucher <alexdeucher@gmail.com> wrote:
>> On Fri, May 29, 2020 at 4:59 AM Simon Ser <contact@emersion.fr> wrote:
>>> OK. In this case I think it's fine to make the DMA-BUF import fail, as
>>> we've suggested on IRC. The more-or-less planned fix for these buffer
>>> sharing issues is to revive the buffer constraints proposal from the
>>> allocator project. It's a lot of work though.
>>
>> I get that, but why explicitly limit modifiers then?  Shouldn't we try
>> and do the best we can with what we have now?  If not the situation is
>> not much better than what we have now.  Why go through the effort or
>> adding modifer support in the first place if they are mostly useless?
> 
> Well sure, we could add pitch alignment in there. And height
> alignment. And starting byte offset. And acceptable byte distance
> between planes. And physical contiguity / number of backing pages. And
> placement (system vs. GTT vs. local), which also implies adding a
> device-unique identifier whilst we're at it. And acceptable
> width/height bounds. All of those are perfectly valid constraints
> which could cause imports to fail, and not even an exhaustive list.
> 
> How does Navi ensure that every single linear dmabuf source it can
> ever receive is aligned to 256 bytes today? How does adding support
> for modifiers - something which does solve other problems, like 'every
> three months I wearily review a patch forcing all winsys buffers to be
> allocated for scanout usage for a new random reason, regressing
> performance for a lot of other vendors' - make Navi's situation worse?
> 
>> I don't quite get what we are trying to do with them.  What does this
>> mean "Modifiers must uniquely encode buffer layout"?  We have a number
>> of buffer layouts that are the same from a functional standpoint, but
>> they have different alignment requirements depending on the chip and
>> the number of memory channels, etc.  Would those be considered the
>> same modifer?  If not, then we are sort of implicitly encoding
>> alignment requirements into the modifier.
> 
> Yes, of course some requirements are implicit. Given that tiles are
> indivisible, it makes no sense to have a 64x64 tiled buffer with a
> 32-pixel stride. But that isn't the same thing as encoding an
> arbitrary constraint, it's just a requirement of the encoding.
> 
> The reason why modifiers have been successful and adopted by every
> other vendor apart from IMG, is exactly because they _don't_ attempt
> to boil the ocean, but are the most practical realisation of improving
> performance within a complex ecosystem. The allocator is the complete
> and exhaustive solution to all our problems, but it's not exactly
> going to be done tomorrow.
> 
> Playing a single video today could easily involve a V4L2 codec source
> into a V4L2 postprocessor into Chromium's Wayland host compositor
> through Chromium itself into the host Wayland compositor and finally
> into EGL and/or Vulkan and/or KMS. If you want to figure out what the
> V4L2/DRM/KMS, GStreamer/VA-API/Kodi, EGL/Vulkan, and Wayland/X11 APIs
> look for negotiating a totally optimal layout across at least three
> different hardware classes from at least three different vendors, then
> I'm all for it. I'll be cheering you on from the sidelines and doing
> what I can to help. But the only reason we've got to this point today
> is because Allwinner, AmLogic, Arm, Broadcom, Intel, NVIDIA, NXP,
> Qualcomm, Rockchip, Samsung, and VeriSilicon, have all spent the last
> few years trying to avoid perfection being the enemy of good. (And
> those are just the hardware vendors - obviously Collabora and others
> like us have also put not-inconsiderable effort into getting at least
> this far.)

I have an hard time understanding how this affects the Amlogic Framebuffer
Compression patchset I sent at [1].

[1] http://lore.kernel.org/r/20200529151935.13418-2-narmstrong@baylibre.com

Neil

> 
> Cheers,
> Daniel
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>

Alex Deucher June 1, 2020, 2:25 p.m. UTC | #14

On Fri, May 29, 2020 at 11:03 AM Daniel Stone <daniel@fooishbar.org> wrote:
>
> On Fri, 29 May 2020 at 15:36, Alex Deucher <alexdeucher@gmail.com> wrote:
> > On Fri, May 29, 2020 at 10:32 AM Daniel Stone <daniel@fooishbar.org> wrote:
> > > On Fri, 29 May 2020 at 15:29, Alex Deucher <alexdeucher@gmail.com> wrote:
> > > > Maybe I'm over thinking this.  I just don't want to get into a
> > > > situation where we go through a lot of effort to add modifier support
> > > > and then performance ends up being worse than it is today in a lot of
> > > > cases.
> > >
> > > I'm genuinely curious: what do you imagine could cause a worse result?
> >
> > As an example, in some cases, it's actually better to use linear for
> > system memory because it better aligns with pcie access patterns than
> > some tiling formats (which are better aligned for the memory
> > controller topology on the dGPU).  That said, I haven't been in the
> > loop as much with the tiling formats on newer GPUs, so that may not be
> > as much of an issue anymore.
>
> Yeah, that makes a lot of sense. On the other hand, placement isn't
> explicitly encoded for either modifiers or non-modifiers, so I'm not
> sure how it would really regress.
>
> In case it was missed somewhere, there is no generic code doing
> modifier selection for modifier optimality anywhere. The flow is:
>   - every producer/consumer advertises a list of modifier + format
> pairs, declaring what they _can_ support
>   - for every use where a buffer needs to be allocated, the generic
> code intersects these lists of modifiers to determine the set of
> modifiers mutually acceptable to all consumers
>   - the buffer allocator is always handed a _list_ of modifiers, and
> makes its own decision based on ??
>
> For a concrete end-to-end example:
>   - KMS declares which modifiers are supported for scanout
>   - EGL declares which modifiers are supported for EGLImage import
>   - Weston determines that one of its clients could be directly
> scanned out rather than composited
>   - Weston intersects the KMS + EGL set of modifiers to come up with
> the optimal modifier set (i.e. bypassing composition)
>   - Weston sends this intersected list to the client via the Wayland
> protocol (mentioned in previous MR)
>   - the client is using EGL, so Mesa receives this list of modifiers,
> and passes this on to amdgpu
>   - amdgpu uses magic inscrutable heuristics to determine the most
> optimal modifier to use, and allocates a buffer based on that
>
> Weston (or GNOME Shell, or Chromium, or whatever) will never be in a
> position as a generic client to know that on Raven2 it should use a
> particular supertiled layout with no DCC if width > 2048. So we
> designed the entire framework to explicitly avoid generic code trying
> to reason about the performance properties of specific modifiers.
>
> What Weston _does_ know, however, is that display controller can work
> with modifier set A, and the GPU can work with modifier set B, and if
> the client can pick something from modifier set A, then there is a
> much greater probability that Weston can leave the GPU alone so it can
> be entirely used by the client. It also knows that if the surface
> can't be directly scanned out for whatever reason, then there's no
> point in the client optimising for direct scanout, and it can tell the
> client to select based on optimality purely for the GPU.

Just so I understand this correctly, the main reason for this is to
deal with display hardware and render hardware from different vendors
which may or may not support any common formats other than linear.  It
provides a way to tunnel device capabilities between the different
drivers.  In the case of a device with display and rendering on the
same device or multiple devices from the same vendor, it not really
that useful.  It doesn't seem to provide much over the current EGL
hints (SCANOUT, SECURE, etc.).  I still don't understand how it solves
the DCC problem though.  Compression and encryption seem kind like
meta modifiers.  There is an under laying high level layout, linear,
tiled, etc. but it could also be compressed and/or encrypted.  Is the
idea that those are separate modifiers?  E.g.,
0: linear
1: linear | encrypted
2. linear | compressed
3: linear | encrypted | compressed
4: tiled1
5: tiled1 | encrypted
6: tiled1 | compressed
7: tiled1 | encrypted | compressed
etc.
Or that the modifiers only expose the high level layout, and it's then
up the the driver(s) to enable compression, etc. if both sides have a
compatible layout?

Thanks,

Alex

>
> So that's the thinking behind the interface: that the driver still has
> exactly as much control and ability to use magic heuristics as it
> always has, but that system components can supplement the driver's
> heuristics with their own knowledge, to increase the chance that the
> driver's heuristics arrive at a configuration that a) will definitely
> work, and b) have a much greater chance of working optimally.
>
> Does that help at all?
>
> Cheers,
> Daniel

Daniel Stone June 3, 2020, 9:48 a.m. UTC | #15

Hi Alex,

On Mon, 1 Jun 2020 at 15:25, Alex Deucher <alexdeucher@gmail.com> wrote:
> On Fri, May 29, 2020 at 11:03 AM Daniel Stone <daniel@fooishbar.org> wrote:
> > What Weston _does_ know, however, is that display controller can work
> > with modifier set A, and the GPU can work with modifier set B, and if
> > the client can pick something from modifier set A, then there is a
> > much greater probability that Weston can leave the GPU alone so it can
> > be entirely used by the client. It also knows that if the surface
> > can't be directly scanned out for whatever reason, then there's no
> > point in the client optimising for direct scanout, and it can tell the
> > client to select based on optimality purely for the GPU.
>
> Just so I understand this correctly, the main reason for this is to
> deal with display hardware and render hardware from different vendors
> which may or may not support any common formats other than linear.

It handles pretty much everything other than a single-context,
single-GPU, single-device, tunnel.

When sharing between subsystems and device categories, it lets us talk
about capabilities in a more global way. For example, GBM lets you
talk about 'scanout' and 'texture' and 'render', but what about media
codecs? We could add the concept of decode/encode to something like
GBM, and all the protocols like Wayland/X11 as well, then hope it
actually works, but ...

When sharing between heterogeneous vendors, it lets us talk about
capabilities in a neutral way. For example, if you look at most modern
Arm SoCs, your GPU, display controller, and media codec, will very
likely all be from three totally different vendors. A GPU like
Mali-T8xx can be shipped in tens of different vendor SoCs in several
different revisions each. Just saying 'scanout' is totally meaningless
for the Panfrost driver. Putting awareness for every different KMS
platform and every different codec down into the Mesa driver is a
synchronisation nightmare, and all those drivers would also need
specific awareness about the Mesa driver. So modifiers allow us to
explicitly describe that we want a particular revision of Arm
Framebuffer Compression, and all the components can understand that
without having to be specifically aware of 15 different KMS drivers.
But even if you have the same vendor ...

When sharing between multiple devices of the same class from the same
vendor, it lets us surface and transit that information in a generic
way, without AMD having to figure out ways to tunnel back-channel
information between different instances of drivers potentially
targeting different revisions. The alternatives seem to be deeply
pessimal hacks, and we think we can do better. And when we get
pessimal ...

In every case, modifiers are about surfacing and sharing information.
One of the reasons Collabora have been putting so much time and energy
into this work is exactly _because_ solving those problems on a
case-by-case basis was a pretty lucrative source of revenue for us.
Debugging these kinds of issues before has usually involved specific
driver knowledge, hacking into the driver to insert your own tracing,
etc.

If you (as someone who's trying to use a device optimally) are
fortunate enough that you can get the attention of a vendor and have
them solve the problem for you, then that's lucky for everyone apart
from the AMD engineers who have to go solve it. If you're not, and you
can't figure it out yourself, then you have to go pay a consultancy.
On the face of it, that's good for us, except that we don't want to be
doing that kind of repetitive boring work. But it's bad for the
ecosystem that this knowledge is hidden away and that you have to pay
specialists to extract it. So we're really keen to surface as much
mechanism and information as possible, to give people the tools to
either solve their own problems or at least make well-informed
reports, burn down a toxic source of revenue, waste less engineering
time extracting hidden information, and empower users as much as
possible.

> It
> provides a way to tunnel device capabilities between the different
> drivers.  In the case of a device with display and rendering on the
> same device or multiple devices from the same vendor, it not really
> that useful.

Oh no, it's still super useful. There are a ton of corner cases where
'if you're on same same-vendor same-gen same-silicon hardware' falls
apart - in addition to the world just not being very much
same-vendor/same-gen/same-silicon anymore. For some concrete examples:

On NVIDIA Tegra hardware, planes within the display controller have
heterogeneous capability. Some can decompress and detile, others
can't.

On Rockchip hardware, AFBC (DCC equivalent) is available for scanout
on any plane, and can be produced by the GPU. Great! Except that 'any
plane' isn't 'every plane' - there's a global decompression unit.

On Intel hardware, they appear to have forked the media codec IP,
shipping two different versions of the codec, one as 'low-power' and
one as 'normal', obviously with varying capability.

Even handwaving those away as vendor errors - that performance on
those gens will always be pessimal and they should do better next time
- I don't think same-vendor/same-gen/same-silicon is a good design
point anymore. Between heterogeneous cut-and-paste SoCs, multi-GPU and
eGPU usecases, virtualisation and tunneling, etc, the usecases are
starting to demand that we do better. Vulkan's memory-allocation
design also really pushes against the model that memory allocations
themselves are blessed with side-channel descriptor tags.

'Those aren't my usecases and we've made Vulkan work so we don't need
it' is an entirely reasonable position, but then you're just
exchanging the problem of describing your tiling & compression layouts
in a 56-bit enum to make modifiers work, for the problem of
maintaining a surprisingly wide chunk of the display stack. For all
the reasons above, over the past few years, the entire rest of the
ecosystem has settled on using modifiers to describe and negotiate
buffer exchange across context/process/protocol/subsystem/device
boundaries. All the effort of making this work in KMS, GBM, EGL,
Vulkan, Wayland, X11, V4L2, VA-API, GStreamer, etc, is going there.

Realistically, the non-modifier path is probably going to bitrot, and
people are certainly resistant to putting more smarts into it, because
it just adds complexity to a now-single-vendor path - even NVIDIA are
pushing this forward, and their display path is much more of an
encapsulated magic tunnel than AMD's. In that sense, it's pretty much
accumulating technical debt; the longer you avoid dealing with the
display stack by implementing modifiers, the more work you have to put
into maintaining the display stack by fixing the non-modifier path.

> It doesn't seem to provide much over the current EGL
> hints (SCANOUT, SECURE, etc.).

Well yeah, if those single bits of information are enough to perfectly
encapsulate everything you need to know, then sure. But it hasn't been
for others, which is why we've all migrated away from them.

> I still don't understand how it solves
> the DCC problem though.  Compression and encryption seem kind like
> meta modifiers.  There is an under laying high level layout, linear,
> tiled, etc. but it could also be compressed and/or encrypted.  Is the
> idea that those are separate modifiers?  E.g.,
> 0: linear
> 1: linear | encrypted
> 2. linear | compressed
> 3: linear | encrypted | compressed
> 4: tiled1
> 5: tiled1 | encrypted
> 6: tiled1 | compressed
> 7: tiled1 | encrypted | compressed
> etc.
> Or that the modifiers only expose the high level layout, and it's then
> up the the driver(s) to enable compression, etc. if both sides have a
> compatible layout?

Do you remember the old wfb from xserver? Think of modifiers as pretty
much that. The format (e.g. A8R8G8B8) describes what you will read
when you load a particular pixel/texel, and what will get stored when
you write. The modifier describes how to get there: that includes both
tiling (since you need to know the particular tiling layout in order
to know the byte location to access), and compression (since you need
to know the particular compression mechanism in order to access the
pixel, e.g. for RLE-type compression that you need to access the first
pixel of the tile if the 'all pixels are the identical' bit is set).

The idea is that these tokens fully describe the mechanisms in use,
without the drivers needing to do magic heuristics. For instance, if
your modifier is just 'tiled', then that's not a full description. A
full description would tell you about supertiling structures, tile
sizes and ordering, etc. The definitions already in
include/uapi/drm/drm_fourcc.h are a bit of a mixed bag - we've
definitely learnt more as we've gone on - but the NVIDIA definitions
are  pretty exemplary for something deeply parameterised along a lot
of variable axes.

Basically, if you have to have sets of heuristics which you keep in
sync in order to translate from modifier -> hardware layout params,
then your modifiers aren't expressive enough. From a very quick look
at DC, that would be your tile-split, tile-mode, array-mode, and
swizzle-mode parameters, plus whatever from dc_tiling_mode isn't
completely static and deterministic. 'DCCRate' always appears to be
hardcoded to 1 (and 'DCCRateChroma' never set), but that might be one
to parameterise as well.

With that expression, you don't have to determine the tiling layout
from dimensions/usage/etc, because the modifier _is_ the tiling
layout, ditto compression.

Encryption I'm minded to consider as something different. Modifiers
don't cover buffer placement at all. That includes whether or not the
memory is physically contiguous, whether it's in
hidden-VRAM/BAR/sysmem, which device it lives on, etc. As far as I can
tell from TMZ, encryption is essentially a side effect of placement?
The memory is encrypted, the encryption is an immutable property of
the allocation, and if the device is configured to access encrypted
memory (by being 'secure'), then the encryption is transparent, no?

That being said, there is a reasonable argument to consume a single
bit in modifiers for TMZ on/off (assuming TMZ is not parameterised),
which would make its availability and use much more transparent.

Cheers,
Daniel

Marek Olšák June 3, 2020, 6:53 p.m. UTC | #16

TMZ is more complicated. If there is a TMZ buffer used by a command buffer,
then all other used buffers must also be TMZ or read only. If no TMZ
buffers are used by a command buffer, then TMZ is disabled. If a context is
not secure, TMZ is also disabled. A context can switch between secure and
non-secure based on the buffers being used.

So mixing secure and non-secure memory writes in one command buffer won't
work. This is not fixable in the driver - apps must be aware of this.

Marek

On Wed, Jun 3, 2020 at 5:50 AM Daniel Stone <daniel@fooishbar.org> wrote:

> Hi Alex,
>
> On Mon, 1 Jun 2020 at 15:25, Alex Deucher <alexdeucher@gmail.com> wrote:
> > On Fri, May 29, 2020 at 11:03 AM Daniel Stone <daniel@fooishbar.org>
> wrote:
> > > What Weston _does_ know, however, is that display controller can work
> > > with modifier set A, and the GPU can work with modifier set B, and if
> > > the client can pick something from modifier set A, then there is a
> > > much greater probability that Weston can leave the GPU alone so it can
> > > be entirely used by the client. It also knows that if the surface
> > > can't be directly scanned out for whatever reason, then there's no
> > > point in the client optimising for direct scanout, and it can tell the
> > > client to select based on optimality purely for the GPU.
> >
> > Just so I understand this correctly, the main reason for this is to
> > deal with display hardware and render hardware from different vendors
> > which may or may not support any common formats other than linear.
>
> It handles pretty much everything other than a single-context,
> single-GPU, single-device, tunnel.
>
> When sharing between subsystems and device categories, it lets us talk
> about capabilities in a more global way. For example, GBM lets you
> talk about 'scanout' and 'texture' and 'render', but what about media
> codecs? We could add the concept of decode/encode to something like
> GBM, and all the protocols like Wayland/X11 as well, then hope it
> actually works, but ...
>
> When sharing between heterogeneous vendors, it lets us talk about
> capabilities in a neutral way. For example, if you look at most modern
> Arm SoCs, your GPU, display controller, and media codec, will very
> likely all be from three totally different vendors. A GPU like
> Mali-T8xx can be shipped in tens of different vendor SoCs in several
> different revisions each. Just saying 'scanout' is totally meaningless
> for the Panfrost driver. Putting awareness for every different KMS
> platform and every different codec down into the Mesa driver is a
> synchronisation nightmare, and all those drivers would also need
> specific awareness about the Mesa driver. So modifiers allow us to
> explicitly describe that we want a particular revision of Arm
> Framebuffer Compression, and all the components can understand that
> without having to be specifically aware of 15 different KMS drivers.
> But even if you have the same vendor ...
>
> When sharing between multiple devices of the same class from the same
> vendor, it lets us surface and transit that information in a generic
> way, without AMD having to figure out ways to tunnel back-channel
> information between different instances of drivers potentially
> targeting different revisions. The alternatives seem to be deeply
> pessimal hacks, and we think we can do better. And when we get
> pessimal ...
>
> In every case, modifiers are about surfacing and sharing information.
> One of the reasons Collabora have been putting so much time and energy
> into this work is exactly _because_ solving those problems on a
> case-by-case basis was a pretty lucrative source of revenue for us.
> Debugging these kinds of issues before has usually involved specific
> driver knowledge, hacking into the driver to insert your own tracing,
> etc.
>
> If you (as someone who's trying to use a device optimally) are
> fortunate enough that you can get the attention of a vendor and have
> them solve the problem for you, then that's lucky for everyone apart
> from the AMD engineers who have to go solve it. If you're not, and you
> can't figure it out yourself, then you have to go pay a consultancy.
> On the face of it, that's good for us, except that we don't want to be
> doing that kind of repetitive boring work. But it's bad for the
> ecosystem that this knowledge is hidden away and that you have to pay
> specialists to extract it. So we're really keen to surface as much
> mechanism and information as possible, to give people the tools to
> either solve their own problems or at least make well-informed
> reports, burn down a toxic source of revenue, waste less engineering
> time extracting hidden information, and empower users as much as
> possible.
>
> > It
> > provides a way to tunnel device capabilities between the different
> > drivers.  In the case of a device with display and rendering on the
> > same device or multiple devices from the same vendor, it not really
> > that useful.
>
> Oh no, it's still super useful. There are a ton of corner cases where
> 'if you're on same same-vendor same-gen same-silicon hardware' falls
> apart - in addition to the world just not being very much
> same-vendor/same-gen/same-silicon anymore. For some concrete examples:
>
> On NVIDIA Tegra hardware, planes within the display controller have
> heterogeneous capability. Some can decompress and detile, others
> can't.
>
> On Rockchip hardware, AFBC (DCC equivalent) is available for scanout
> on any plane, and can be produced by the GPU. Great! Except that 'any
> plane' isn't 'every plane' - there's a global decompression unit.
>
> On Intel hardware, they appear to have forked the media codec IP,
> shipping two different versions of the codec, one as 'low-power' and
> one as 'normal', obviously with varying capability.
>
> Even handwaving those away as vendor errors - that performance on
> those gens will always be pessimal and they should do better next time
> - I don't think same-vendor/same-gen/same-silicon is a good design
> point anymore. Between heterogeneous cut-and-paste SoCs, multi-GPU and
> eGPU usecases, virtualisation and tunneling, etc, the usecases are
> starting to demand that we do better. Vulkan's memory-allocation
> design also really pushes against the model that memory allocations
> themselves are blessed with side-channel descriptor tags.
>
> 'Those aren't my usecases and we've made Vulkan work so we don't need
> it' is an entirely reasonable position, but then you're just
> exchanging the problem of describing your tiling & compression layouts
> in a 56-bit enum to make modifiers work, for the problem of
> maintaining a surprisingly wide chunk of the display stack. For all
> the reasons above, over the past few years, the entire rest of the
> ecosystem has settled on using modifiers to describe and negotiate
> buffer exchange across context/process/protocol/subsystem/device
> boundaries. All the effort of making this work in KMS, GBM, EGL,
> Vulkan, Wayland, X11, V4L2, VA-API, GStreamer, etc, is going there.
>
> Realistically, the non-modifier path is probably going to bitrot, and
> people are certainly resistant to putting more smarts into it, because
> it just adds complexity to a now-single-vendor path - even NVIDIA are
> pushing this forward, and their display path is much more of an
> encapsulated magic tunnel than AMD's. In that sense, it's pretty much
> accumulating technical debt; the longer you avoid dealing with the
> display stack by implementing modifiers, the more work you have to put
> into maintaining the display stack by fixing the non-modifier path.
>
> > It doesn't seem to provide much over the current EGL
> > hints (SCANOUT, SECURE, etc.).
>
> Well yeah, if those single bits of information are enough to perfectly
> encapsulate everything you need to know, then sure. But it hasn't been
> for others, which is why we've all migrated away from them.
>
> > I still don't understand how it solves
> > the DCC problem though.  Compression and encryption seem kind like
> > meta modifiers.  There is an under laying high level layout, linear,
> > tiled, etc. but it could also be compressed and/or encrypted.  Is the
> > idea that those are separate modifiers?  E.g.,
> > 0: linear
> > 1: linear | encrypted
> > 2. linear | compressed
> > 3: linear | encrypted | compressed
> > 4: tiled1
> > 5: tiled1 | encrypted
> > 6: tiled1 | compressed
> > 7: tiled1 | encrypted | compressed
> > etc.
> > Or that the modifiers only expose the high level layout, and it's then
> > up the the driver(s) to enable compression, etc. if both sides have a
> > compatible layout?
>
> Do you remember the old wfb from xserver? Think of modifiers as pretty
> much that. The format (e.g. A8R8G8B8) describes what you will read
> when you load a particular pixel/texel, and what will get stored when
> you write. The modifier describes how to get there: that includes both
> tiling (since you need to know the particular tiling layout in order
> to know the byte location to access), and compression (since you need
> to know the particular compression mechanism in order to access the
> pixel, e.g. for RLE-type compression that you need to access the first
> pixel of the tile if the 'all pixels are the identical' bit is set).
>
> The idea is that these tokens fully describe the mechanisms in use,
> without the drivers needing to do magic heuristics. For instance, if
> your modifier is just 'tiled', then that's not a full description. A
> full description would tell you about supertiling structures, tile
> sizes and ordering, etc. The definitions already in
> include/uapi/drm/drm_fourcc.h are a bit of a mixed bag - we've
> definitely learnt more as we've gone on - but the NVIDIA definitions
> are  pretty exemplary for something deeply parameterised along a lot
> of variable axes.
>
> Basically, if you have to have sets of heuristics which you keep in
> sync in order to translate from modifier -> hardware layout params,
> then your modifiers aren't expressive enough. From a very quick look
> at DC, that would be your tile-split, tile-mode, array-mode, and
> swizzle-mode parameters, plus whatever from dc_tiling_mode isn't
> completely static and deterministic. 'DCCRate' always appears to be
> hardcoded to 1 (and 'DCCRateChroma' never set), but that might be one
> to parameterise as well.
>
> With that expression, you don't have to determine the tiling layout
> from dimensions/usage/etc, because the modifier _is_ the tiling
> layout, ditto compression.
>
> Encryption I'm minded to consider as something different. Modifiers
> don't cover buffer placement at all. That includes whether or not the
> memory is physically contiguous, whether it's in
> hidden-VRAM/BAR/sysmem, which device it lives on, etc. As far as I can
> tell from TMZ, encryption is essentially a side effect of placement?
> The memory is encrypted, the encryption is an immutable property of
> the allocation, and if the device is configured to access encrypted
> memory (by being 'secure'), then the encryption is transparent, no?
>
> That being said, there is a reasonable argument to consume a single
> bit in modifiers for TMZ on/off (assuming TMZ is not parameterised),
> which would make its availability and use much more transparent.
>
> Cheers,
> Daniel
>

Daniel Stone June 4, 2020, 9:14 a.m. UTC | #17

On Wed, 3 Jun 2020 at 19:53, Marek Olšák <maraeo@gmail.com> wrote:
> TMZ is more complicated. If there is a TMZ buffer used by a command buffer, then all other used buffers must also be TMZ or read only. If no TMZ buffers are used by a command buffer, then TMZ is disabled. If a context is not secure, TMZ is also disabled. A context can switch between secure and non-secure based on the buffers being used.
>
> So mixing secure and non-secure memory writes in one command buffer won't work. This is not fixable in the driver - apps must be aware of this.

Sure, that makes sense. It probably points to TMZ being its own
special thing, independent of modifiers, since it touches so much
global state, and doesn't mesh cleanly any of the models we have for
advertising and negotiating buffer allocation and import.

Cheers,
Daniel

[v3] drm/fourcc: document modifier uniqueness requirements

Commit Message

Comments

Patch