diff mbox series

[v3] drm/plane: Add documentation about software color conversion.

Message ID 20230825140434.182664-1-jfalempe@redhat.com (mailing list archive)
State New, archived
Headers show
Series [v3] drm/plane: Add documentation about software color conversion. | expand

Commit Message

Jocelyn Falempe Aug. 25, 2023, 2:04 p.m. UTC
After discussions on IRC, the consensus is that the DRM drivers should
avoid software color conversion, and only advertise the formats supported
by hardware.
Update the doc accordingly so that the rule and exceptions are clear for
everyone.

Acked-by: Simon Ser <contact@emersion.fr>
Signed-off-by: Jocelyn Falempe <jfalempe@redhat.com>
---
 drivers/gpu/drm/drm_plane.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)


base-commit: 82d750e9d2f5d0594c8f7057ce59127e701af781

Comments

Maxime Ripard Aug. 25, 2023, 2:14 p.m. UTC | #1
On Fri, 25 Aug 2023 16:04:18 +0200, Jocelyn Falempe wrote:
> After discussions on IRC, the consensus is that the DRM drivers should
> avoid software color conversion, and only advertise the formats supported
> by hardware.
> Update the doc accordingly so that the rule and exceptions are clear for
> everyone.
> 
> [ ... ]

Acked-by: Maxime Ripard <mripard@kernel.org>

Thanks!
Maxime
Javier Martinez Canillas Aug. 25, 2023, 2:25 p.m. UTC | #2
Jocelyn Falempe <jfalempe@redhat.com> writes:

Hello Jocelyn,

> After discussions on IRC, the consensus is that the DRM drivers should
> avoid software color conversion, and only advertise the formats supported
> by hardware.
> Update the doc accordingly so that the rule and exceptions are clear for
> everyone.
>
> Acked-by: Simon Ser <contact@emersion.fr>
> Signed-off-by: Jocelyn Falempe <jfalempe@redhat.com>
> ---

Thanks a lot for writing this!

Acked-by: Javier Martinez Canillas <javierm@redhat.com>
Pekka Paalanen Aug. 28, 2023, 7:35 a.m. UTC | #3
On Fri, 25 Aug 2023 16:04:18 +0200
Jocelyn Falempe <jfalempe@redhat.com> wrote:

> After discussions on IRC, the consensus is that the DRM drivers should
> avoid software color conversion, and only advertise the formats supported
> by hardware.
> Update the doc accordingly so that the rule and exceptions are clear for
> everyone.
> 
> Acked-by: Simon Ser <contact@emersion.fr>
> Signed-off-by: Jocelyn Falempe <jfalempe@redhat.com>
> ---
>  drivers/gpu/drm/drm_plane.c | 24 ++++++++++++++++++++++++
>  1 file changed, 24 insertions(+)
> 
> diff --git a/drivers/gpu/drm/drm_plane.c b/drivers/gpu/drm/drm_plane.c
> index 24e7998d1731..d05642033202 100644
> --- a/drivers/gpu/drm/drm_plane.c
> +++ b/drivers/gpu/drm/drm_plane.c
> @@ -140,6 +140,30 @@
>   *     DRM_FORMAT_MOD_LINEAR. Before linux kernel release v5.1 there have been
>   *     various bugs in this area with inconsistencies between the capability
>   *     flag and per-plane properties.
> + *
> + *     All drivers should support XRGB8888, even if the hardware cannot support
> + *     it. This has become the de-facto standard and a lot of user-space assume
> + *     it will be present. If XRGB8888 is not natively supported, then it
> + *     shouldn't be the default for preferred depth or fbdev emulation.
> + *
> + *     DRM drivers should not do software color conversion, and
> + *     only advertise the formats they support in hardware. This is for
> + *     performance reason, and to avoid multiple conversions in userspace and
> + *     kernel space. KMS page flips are generally expected to be very cheap
> + *     operations.
> + *
> + *     But there are two exceptions only for dumb buffers:
> + *     * To support XRGB8888 if it's not supported by the hardware.
> + *     * Any driver is free to modify its internal representation of the format,
> + *       as long as it doesn't alter the visible content in any way, and doesn't
> + *       modify the user-provided buffer. An example would be to drop the
> + *       padding component from a format to save some memory bandwidth.
> + *     On most hardware, VRAM read access are slow, so when doing the software
> + *     conversion, the dumb buffer should be allocated in system RAM in order to
> + *     have decent performance.
> + *     Extra care should be taken when doing software conversion with
> + *     DRM_CAP_DUMB_PREFER_SHADOW, there are more detailed explanations here:
> + *     https://lore.kernel.org/dri-devel/20230818162415.2185f8e3@eldfell/
>   */

Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com>


Thanks,
pq
Thomas Zimmermann Sept. 8, 2023, 9:21 a.m. UTC | #4
Hi

Am 25.08.23 um 16:04 schrieb Jocelyn Falempe:
[...]
> + *
> + *     But there are two exceptions only for dumb buffers:
> + *     * To support XRGB8888 if it's not supported by the hardware.


> + *     * Any driver is free to modify its internal representation of the format,
> + *       as long as it doesn't alter the visible content in any way, and doesn't
> + *       modify the user-provided buffer. An example would be to drop the
> + *       padding component from a format to save some memory bandwidth.

I have strong objections to this point, _especially_ as you're 
apparently trying to sneak this in after our discussion. NAK on this 
part from my side.

If you want userspace to be able to use a certain format, then export 
the corresponding 4cc code. Then let userspace decide what to do about 
it. If userspace pick a certain format, go with it.

Hence, no implicit conversion from XRGB888 to RGB888, just because it's 
possible.

Best regards
Thomas


> + *     On most hardware, VRAM read access are slow, so when doing the software
> + *     conversion, the dumb buffer should be allocated in system RAM in order to
> + *     have decent performance.
> + *     Extra care should be taken when doing software conversion with
> + *     DRM_CAP_DUMB_PREFER_SHADOW, there are more detailed explanations here:
> + *     https://lore.kernel.org/dri-devel/20230818162415.2185f8e3@eldfell/
>    */
>   
>   static unsigned int drm_num_planes(struct drm_device *dev)
> 
> base-commit: 82d750e9d2f5d0594c8f7057ce59127e701af781
Maxime Ripard Sept. 8, 2023, 10:58 a.m. UTC | #5
Hi,

On Fri, Sep 08, 2023 at 11:21:51AM +0200, Thomas Zimmermann wrote:
> Am 25.08.23 um 16:04 schrieb Jocelyn Falempe:
> [...]
> > + *
> > + *     But there are two exceptions only for dumb buffers:
> > + *     * To support XRGB8888 if it's not supported by the hardware.
> 
> 
> > + *     * Any driver is free to modify its internal representation of the format,
> > + *       as long as it doesn't alter the visible content in any way, and doesn't
> > + *       modify the user-provided buffer. An example would be to drop the
> > + *       padding component from a format to save some memory bandwidth.
> 
> I have strong objections to this point, _especially_ as you're apparently
> trying to sneak this in after our discussion.

I think it's an unfair characterization. This was discussed on
#dri-devel, and went through several rounds over the mailing lists, with
you in Cc for each. How is that sneaking something in?

> NAK on this part from my side.
> 
> If you want userspace to be able to use a certain format, then export the
> corresponding 4cc code. Then let userspace decide what to do about it. If
> userspace pick a certain format, go with it.
> 
> Hence, no implicit conversion from XRGB888 to RGB888, just because it's
> possible.

I'm not sure what's your argument is, really. Last time we discussed
this, you were saying that you were enforcing that rule because that was
the outcome of that (painful) discussion with Pekka and Javier. It turns
out that both Pekka and Javier are ok with the patch, so it's not clear
to me what your objections are at this point.

Are you really arguing that we shouldn't allow a driver to consume less
resources while not affecting any other component in the system in any
way?

Maxime
Pekka Paalanen Sept. 8, 2023, 11:16 a.m. UTC | #6
On Fri, 8 Sep 2023 11:21:51 +0200
Thomas Zimmermann <tzimmermann@suse.de> wrote:

> Hi
> 
> Am 25.08.23 um 16:04 schrieb Jocelyn Falempe:
> [...]
> > + *
> > + *     But there are two exceptions only for dumb buffers:
> > + *     * To support XRGB8888 if it's not supported by the hardware.  
> 
> 
> > + *     * Any driver is free to modify its internal representation of the format,
> > + *       as long as it doesn't alter the visible content in any way, and doesn't
> > + *       modify the user-provided buffer. An example would be to drop the
> > + *       padding component from a format to save some memory bandwidth.  
> 
> I have strong objections to this point, _especially_ as you're 
> apparently trying to sneak this in after our discussion. NAK on this 
> part from my side.
> 
> If you want userspace to be able to use a certain format, then export 
> the corresponding 4cc code. Then let userspace decide what to do about 
> it. If userspace pick a certain format, go with it.

What is the reason for your objection, exactly?

> Hence, no implicit conversion from XRGB888 to RGB888, just because it's 
> possible.

For the particular driver in question though, the conversion allows
using a display resolution that is otherwise not possible. I also hear
it improves performance since 25% less data needs to travel across a
slow bus. There is also so little VRAM, than all dumb buffers need to
be allocated from sysram instead anyway, so a copy is always necessary.

Since XRGB8888 is the one format that is recommended to be supported by
all drivers, I don't see a problem here. Did you test on your
incredibly slow g200 test rig if the conversion ends up hurting instead
of helping performance there?

If it hurts, then I see that you have a good reason to NAK this.

It's hard to imagine how it would hurt, since you always need a copy
from sysram dumb buffers to VRAM - or do you?


Thanks,
pq

> > + *     On most hardware, VRAM read access are slow, so when doing the software
> > + *     conversion, the dumb buffer should be allocated in system RAM in order to
> > + *     have decent performance.
> > + *     Extra care should be taken when doing software conversion with
> > + *     DRM_CAP_DUMB_PREFER_SHADOW, there are more detailed explanations here:
> > + *     https://lore.kernel.org/dri-devel/20230818162415.2185f8e3@eldfell/
> >    */
> >   
> >   static unsigned int drm_num_planes(struct drm_device *dev)
> > 
> > base-commit: 82d750e9d2f5d0594c8f7057ce59127e701af781  
>
Thomas Zimmermann Sept. 8, 2023, 1:22 p.m. UTC | #7
Hi Maxime

Am 08.09.23 um 12:58 schrieb Maxime Ripard:
> Hi,
> 
> On Fri, Sep 08, 2023 at 11:21:51AM +0200, Thomas Zimmermann wrote:
>> Am 25.08.23 um 16:04 schrieb Jocelyn Falempe:
>> [...]
>>> + *
>>> + *     But there are two exceptions only for dumb buffers:
>>> + *     * To support XRGB8888 if it's not supported by the hardware.
>>
>>
>>> + *     * Any driver is free to modify its internal representation of the format,
>>> + *       as long as it doesn't alter the visible content in any way, and doesn't
>>> + *       modify the user-provided buffer. An example would be to drop the
>>> + *       padding component from a format to save some memory bandwidth.
>>
>> I have strong objections to this point, _especially_ as you're apparently
>> trying to sneak this in after our discussion.
> 
> I think it's an unfair characterization. This was discussed on
> #dri-devel, and went through several rounds over the mailing lists, with
> you in Cc for each. How is that sneaking something in?

A few months ago, we had a flamewar'ish IRC discussion on format 
conversion within the kernel. The general sentiment was that the kernel 
drivers should use what ever is provided by userspace without further 
processing. The short argument was 'userspace knows better'. The only 
exception is for supporting XRGB8888 on hardware that would otherwise 
not support it. After some consideration, I agree with all that. (Back 
then I didn't.)

A few weeks ago I received a patch to do an implicit conversion from 
XRGB8888 to RGB888 within mgag200. [1] I don't have a link to the 
discussion, but I NAK'ed that patch pretty hard on IRC by following that 
other discussion.

And know I find that this patch (even in its v1) contains language that 
retroactively legitimizes the mgag200 patch. I wrote 'apparently' I my 
reply, as I assume that there's more to it, but how does it not look 
like an attempt to sneak in something that is known to be controversial?

It might have been better to discuss the question separately on the 
dri-devel ML. Maybe we can do this here.

Best regards
Thomas

[1] https://patchwork.freedesktop.org/patch/531879/?series=116381&rev=1

> 
>> NAK on this part from my side.
>>
>> If you want userspace to be able to use a certain format, then export the
>> corresponding 4cc code. Then let userspace decide what to do about it. If
>> userspace pick a certain format, go with it.
>>
>> Hence, no implicit conversion from XRGB888 to RGB888, just because it's
>> possible.
> 
> I'm not sure what's your argument is, really. Last time we discussed
> this, you were saying that you were enforcing that rule because that was
> the outcome of that (painful) discussion with Pekka and Javier. It turns
> out that both Pekka and Javier are ok with the patch, so it's not clear
> to me what your objections are at this point.
> 
> Are you really arguing that we shouldn't allow a driver to consume less
> resources while not affecting any other component in the system in any
> way?
> 
> Maxime
Simon Ser Sept. 8, 2023, 1:27 p.m. UTC | #8
On Friday, September 8th, 2023 at 22:22, Thomas Zimmermann <tzimmermann@suse.de> wrote:

> Am 08.09.23 um 12:58 schrieb Maxime Ripard:
> 
> > Hi,
> > 
> > On Fri, Sep 08, 2023 at 11:21:51AM +0200, Thomas Zimmermann wrote:
> > 
> > > Am 25.08.23 um 16:04 schrieb Jocelyn Falempe:
> > > [...]
> > > 
> > > > + *
> > > > + * But there are two exceptions only for dumb buffers:
> > > > + * * To support XRGB8888 if it's not supported by the hardware.
> > > 
> > > > + * * Any driver is free to modify its internal representation of the format,
> > > > + * as long as it doesn't alter the visible content in any way, and doesn't
> > > > + * modify the user-provided buffer. An example would be to drop the
> > > > + * padding component from a format to save some memory bandwidth.
> > > 
> > > I have strong objections to this point, especially as you're apparently
> > > trying to sneak this in after our discussion.
> > 
> > I think it's an unfair characterization. This was discussed on
> > #dri-devel, and went through several rounds over the mailing lists, with
> > you in Cc for each. How is that sneaking something in?
> 
> 
> A few months ago, we had a flamewar'ish IRC discussion on format
> conversion within the kernel. The general sentiment was that the kernel
> drivers should use what ever is provided by userspace without further
> processing. The short argument was 'userspace knows better'. The only
> exception is for supporting XRGB8888 on hardware that would otherwise
> not support it. After some consideration, I agree with all that. (Back
> then I didn't.)

(FWIW, as a userspace dev, the above makes perfect sense to me.)
Javier Martinez Canillas Sept. 8, 2023, 1:46 p.m. UTC | #9
Thomas Zimmermann <tzimmermann@suse.de> writes:

Hello Thomas,

> Hi Maxime
>
> Am 08.09.23 um 12:58 schrieb Maxime Ripard:
>> Hi,
>> 
>> On Fri, Sep 08, 2023 at 11:21:51AM +0200, Thomas Zimmermann wrote:
>>> Am 25.08.23 um 16:04 schrieb Jocelyn Falempe:
>>> [...]
>>>> + *
>>>> + *     But there are two exceptions only for dumb buffers:
>>>> + *     * To support XRGB8888 if it's not supported by the hardware.
>>>
>>>
>>>> + *     * Any driver is free to modify its internal representation of the format,
>>>> + *       as long as it doesn't alter the visible content in any way, and doesn't
>>>> + *       modify the user-provided buffer. An example would be to drop the
>>>> + *       padding component from a format to save some memory bandwidth.
>>>
>>> I have strong objections to this point, _especially_ as you're apparently
>>> trying to sneak this in after our discussion.
>> 
>> I think it's an unfair characterization. This was discussed on
>> #dri-devel, and went through several rounds over the mailing lists, with
>> you in Cc for each. How is that sneaking something in?
>
> A few months ago, we had a flamewar'ish IRC discussion on format 
> conversion within the kernel. The general sentiment was that the kernel 
> drivers should use what ever is provided by userspace without further 
> processing. The short argument was 'userspace knows better'. The only 
> exception is for supporting XRGB8888 on hardware that would otherwise 
> not support it. After some consideration, I agree with all that. (Back 
> then I didn't.)
>
> A few weeks ago I received a patch to do an implicit conversion from 
> XRGB8888 to RGB888 within mgag200. [1] I don't have a link to the 
> discussion, but I NAK'ed that patch pretty hard on IRC by following that 
> other discussion.
>
> And know I find that this patch (even in its v1) contains language that 
> retroactively legitimizes the mgag200 patch. I wrote 'apparently' I my 
> reply, as I assume that there's more to it, but how does it not look 
> like an attempt to sneak in something that is known to be controversial?
>

While is true that the motivation for Jocelyn's patch was to make explicit
what are the rules with regard to drivers emulating formats (other than
"we had a flamewar on IRC a while back" which is quite ambiguous), it was
not attempt to sneak something that is known to be controversial.

In fact, it is an attempt to dispel the controversy and document what is
acceptable and what is not for a driver.

> It might have been better to discuss the question separately on the 
> dri-devel ML. Maybe we can do this here.
>

This was discussed in the #dri-devel IRC channel, I believe you were on
PTO at the time and probably that's why you missed. I found the logs here:

https://people.freedesktop.org/~cbrill/dri-log/?channel=dri-devel&date=2023-08-04

As you can see there, most people agreed that what Jocelyn wrote in his
doc patch is the most pragmatic compromise.
Thomas Zimmermann Sept. 8, 2023, 1:56 p.m. UTC | #10
Hi

Am 08.09.23 um 13:16 schrieb Pekka Paalanen:
> On Fri, 8 Sep 2023 11:21:51 +0200
> Thomas Zimmermann <tzimmermann@suse.de> wrote:
> 
>> Hi
>>
>> Am 25.08.23 um 16:04 schrieb Jocelyn Falempe:
>> [...]
>>> + *
>>> + *     But there are two exceptions only for dumb buffers:
>>> + *     * To support XRGB8888 if it's not supported by the hardware.
>>
>>
>>> + *     * Any driver is free to modify its internal representation of the format,
>>> + *       as long as it doesn't alter the visible content in any way, and doesn't
>>> + *       modify the user-provided buffer. An example would be to drop the
>>> + *       padding component from a format to save some memory bandwidth.
>>
>> I have strong objections to this point, _especially_ as you're
>> apparently trying to sneak this in after our discussion. NAK on this
>> part from my side.
>>
>> If you want userspace to be able to use a certain format, then export
>> the corresponding 4cc code. Then let userspace decide what to do about
>> it. If userspace pick a certain format, go with it.
> 
> What is the reason for your objection, exactly?
> 
>> Hence, no implicit conversion from XRGB888 to RGB888, just because it's
>> possible.
> 
> For the particular driver in question though, the conversion allows
> using a display resolution that is otherwise not possible. I also hear
> it improves performance since 25% less data needs to travel across a
> slow bus. There is also so little VRAM, than all dumb buffers need to
> be allocated from sysram instead anyway, so a copy is always necessary.
> 
> Since XRGB8888 is the one format that is recommended to be supported by
> all drivers, I don't see a problem here. Did you test on your
> incredibly slow g200 test rig if the conversion ends up hurting instead
> of helping performance there?
> 
> If it hurts, then I see that you have a good reason to NAK this.
> 
> It's hard to imagine how it would hurt, since you always need a copy
> from sysram dumb buffers to VRAM - or do you?

I have a number of concerns. My point it not that we shouldn't optimize. 
I just don't want it in the kernel. Mgag200 can export DRM_FORMAT_RGB888 
for userspace to use.

AFAICT the main argument against userspace is that Mesa doesn't like 
3-byte pixels. But I don't see how this conversion cannot be a 
post-processing step within Mesa: do the rendering in RGB32 and then 
convert to a framebuffer in RGB24. Userspace can do that more 
efficiently than the kernel. This has all of the upsides of reduced 
bandwidth, but none of the downsides of kernel code. Applications and/or 
Mesa would be in control of the buffer format and apply the optimization 
where it makes sense. And it would be available for all drivers that are 
similar to mgag200.

My main point is simplicity of the driver: I prefer the driver to be 
simple without unnecessary indirection or overhead. Optimizations like 
these my or may not work on a given system with a certain workload. I'd 
better leave this heuristic to userspace.

Another point of concern is CPU consumption: Slow I/O buses may stall 
the display thread, but the CPU could do something else in the meantime. 
Doing format conversion on the CPU prevents that, hence affecting other 
parts of the system negatively. Of course, that's more of a gut feeling 
than hard data.

Please note that the kernel's conversion code uses memory allocation of 
intermediate buffers. We even recently had a discussion about allocation 
overhead during display updates. Userspace can surely do a better job at 
keeping such buffers around.

And finally a note the hardware itself: on low-end hardware like those 
Matrox chips, just switch to RGB16. That will be pretty and fast enough 
for these chips' server systems. Anyone who cares about fast and 
beautiful should buy a real graphics card.

Best regards
Thomas

> 
> 
> Thanks,
> pq
> 
>>> + *     On most hardware, VRAM read access are slow, so when doing the software
>>> + *     conversion, the dumb buffer should be allocated in system RAM in order to
>>> + *     have decent performance.
>>> + *     Extra care should be taken when doing software conversion with
>>> + *     DRM_CAP_DUMB_PREFER_SHADOW, there are more detailed explanations here:
>>> + *     https://lore.kernel.org/dri-devel/20230818162415.2185f8e3@eldfell/
>>>     */
>>>    
>>>    static unsigned int drm_num_planes(struct drm_device *dev)
>>>
>>> base-commit: 82d750e9d2f5d0594c8f7057ce59127e701af781
>>
>
Jocelyn Falempe Sept. 8, 2023, 2:06 p.m. UTC | #11
On 08/09/2023 15:46, Javier Martinez Canillas wrote:
> Thomas Zimmermann <tzimmermann@suse.de> writes:
> 
> Hello Thomas,
> 
>> Hi Maxime
>>
>> Am 08.09.23 um 12:58 schrieb Maxime Ripard:
>>> Hi,
>>>
>>> On Fri, Sep 08, 2023 at 11:21:51AM +0200, Thomas Zimmermann wrote:
>>>> Am 25.08.23 um 16:04 schrieb Jocelyn Falempe:
>>>> [...]
>>>>> + *
>>>>> + *     But there are two exceptions only for dumb buffers:
>>>>> + *     * To support XRGB8888 if it's not supported by the hardware.
>>>>
>>>>
>>>>> + *     * Any driver is free to modify its internal representation of the format,
>>>>> + *       as long as it doesn't alter the visible content in any way, and doesn't
>>>>> + *       modify the user-provided buffer. An example would be to drop the
>>>>> + *       padding component from a format to save some memory bandwidth.
>>>>
>>>> I have strong objections to this point, _especially_ as you're apparently
>>>> trying to sneak this in after our discussion.
>>>
>>> I think it's an unfair characterization. This was discussed on
>>> #dri-devel, and went through several rounds over the mailing lists, with
>>> you in Cc for each. How is that sneaking something in?
>>
>> A few months ago, we had a flamewar'ish IRC discussion on format
>> conversion within the kernel. The general sentiment was that the kernel
>> drivers should use what ever is provided by userspace without further
>> processing. The short argument was 'userspace knows better'. The only
>> exception is for supporting XRGB8888 on hardware that would otherwise
>> not support it. After some consideration, I agree with all that. (Back
>> then I didn't.)

I wasn't part of this "flamewar", and though my patch was a bit 
unrelated to this. That's why I started this work to document clearly 
what is acceptable in the kernel or not. I discuss it on IRC, and then 
proposed the patch on dri-devel to find a compromise, and see if my case 
can be acceptable or not.

>>
>> A few weeks ago I received a patch to do an implicit conversion from
>> XRGB8888 to RGB888 within mgag200. [1] I don't have a link to the
>> discussion, but I NAK'ed that patch pretty hard on IRC by following that
>> other discussion.
>>
>> And know I find that this patch (even in its v1) contains language that
>> retroactively legitimizes the mgag200 patch. I wrote 'apparently' I my
>> reply, as I assume that there's more to it, but how does it not look
>> like an attempt to sneak in something that is known to be controversial?
>>

That was not my intention, and I apologize if you feel it this way. My 
goal was just to clarify if this optimization is acceptable for other 
kernel developers, since I though you were willing to accept it, but 
some other developers from the "flamewar" were against.

> 
> While is true that the motivation for Jocelyn's patch was to make explicit
> what are the rules with regard to drivers emulating formats (other than
> "we had a flamewar on IRC a while back" which is quite ambiguous), it was
> not attempt to sneak something that is known to be controversial.
> 
> In fact, it is an attempt to dispel the controversy and document what is
> acceptable and what is not for a driver.
> 
>> It might have been better to discuss the question separately on the
>> dri-devel ML. Maybe we can do this here.
>>
> 
> This was discussed in the #dri-devel IRC channel, I believe you were on
> PTO at the time and probably that's why you missed. I found the logs here:
> 
> https://people.freedesktop.org/~cbrill/dri-log/?channel=dri-devel&date=2023-08-04
> 
> As you can see there, most people agreed that what Jocelyn wrote in his
> doc patch is the most pragmatic compromise.
> 

Best regards,
Thomas Zimmermann Sept. 8, 2023, 2:09 p.m. UTC | #12
Hi Javier

Am 08.09.23 um 15:46 schrieb Javier Martinez Canillas:
> Thomas Zimmermann <tzimmermann@suse.de> writes:
> 
> Hello Thomas,
> 
>> Hi Maxime
>>
>> Am 08.09.23 um 12:58 schrieb Maxime Ripard:
>>> Hi,
>>>
>>> On Fri, Sep 08, 2023 at 11:21:51AM +0200, Thomas Zimmermann wrote:
>>>> Am 25.08.23 um 16:04 schrieb Jocelyn Falempe:
>>>> [...]
>>>>> + *
>>>>> + *     But there are two exceptions only for dumb buffers:
>>>>> + *     * To support XRGB8888 if it's not supported by the hardware.
>>>>
>>>>
>>>>> + *     * Any driver is free to modify its internal representation of the format,
>>>>> + *       as long as it doesn't alter the visible content in any way, and doesn't
>>>>> + *       modify the user-provided buffer. An example would be to drop the
>>>>> + *       padding component from a format to save some memory bandwidth.
>>>>
>>>> I have strong objections to this point, _especially_ as you're apparently
>>>> trying to sneak this in after our discussion.
>>>
>>> I think it's an unfair characterization. This was discussed on
>>> #dri-devel, and went through several rounds over the mailing lists, with
>>> you in Cc for each. How is that sneaking something in?
>>
>> A few months ago, we had a flamewar'ish IRC discussion on format
>> conversion within the kernel. The general sentiment was that the kernel
>> drivers should use what ever is provided by userspace without further
>> processing. The short argument was 'userspace knows better'. The only
>> exception is for supporting XRGB8888 on hardware that would otherwise
>> not support it. After some consideration, I agree with all that. (Back
>> then I didn't.)
>>
>> A few weeks ago I received a patch to do an implicit conversion from
>> XRGB8888 to RGB888 within mgag200. [1] I don't have a link to the
>> discussion, but I NAK'ed that patch pretty hard on IRC by following that
>> other discussion.
>>
>> And know I find that this patch (even in its v1) contains language that
>> retroactively legitimizes the mgag200 patch. I wrote 'apparently' I my
>> reply, as I assume that there's more to it, but how does it not look
>> like an attempt to sneak in something that is known to be controversial?
>>
> 
> While is true that the motivation for Jocelyn's patch was to make explicit
> what are the rules with regard to drivers emulating formats (other than
> "we had a flamewar on IRC a while back" which is quite ambiguous), it was
> not attempt to sneak something that is known to be controversial.
> 
> In fact, it is an attempt to dispel the controversy and document what is
> acceptable and what is not for a driver.
> 
>> It might have been better to discuss the question separately on the
>> dri-devel ML. Maybe we can do this here.
>>
> 
> This was discussed in the #dri-devel IRC channel, I believe you were on
> PTO at the time and probably that's why you missed. I found the logs here:
> 
> https://people.freedesktop.org/~cbrill/dri-log/?channel=dri-devel&date=2023-08-04
> 
> As you can see there, most people agreed that what Jocelyn wrote in his
> doc patch is the most pragmatic compromise.

Thanks for the pointer to the URL. Quoting Daniel (sima) from that 
discussion.

"imo just document that for hw/drivers where XRGB8888 is not support or 
has a very bad cost in terms of upload/scanout bw it's ok to emulate it 
in the kernel driver, but _only_ for that format "

This seems the overall sentiment. I disagree with the "has very bad 
cost" part, though. The cost/benefit ratio should be determined by 
userspace IMHO. Please see my reply to Pekka for some details.

I don't have a pointer to that other IRC discussion, but IIRC there 
where quite a lot more people involved, including from userspace. Many 
of those are absent here.

Best regards
Thomas

>
Thomas Zimmermann Sept. 8, 2023, 2:13 p.m. UTC | #13
Hi Jocelyn

Am 08.09.23 um 16:06 schrieb Jocelyn Falempe:
[...]
>>> And know I find that this patch (even in its v1) contains language that
>>> retroactively legitimizes the mgag200 patch. I wrote 'apparently' I my
>>> reply, as I assume that there's more to it, but how does it not look
>>> like an attempt to sneak in something that is known to be controversial?
>>>
> 
> That was not my intention, and I apologize if you feel it this way. My 
> goal was just to clarify if this optimization is acceptable for other 

Sorry for throwing accusations around. You certainly act in good faith.

Best regards
Thomas
Pekka Paalanen Sept. 8, 2023, 2:41 p.m. UTC | #14
On Fri, 8 Sep 2023 15:56:51 +0200
Thomas Zimmermann <tzimmermann@suse.de> wrote:

> Hi
> 
> Am 08.09.23 um 13:16 schrieb Pekka Paalanen:
> > On Fri, 8 Sep 2023 11:21:51 +0200
> > Thomas Zimmermann <tzimmermann@suse.de> wrote:
> >   
> >> Hi
> >>
> >> Am 25.08.23 um 16:04 schrieb Jocelyn Falempe:
> >> [...]  
> >>> + *
> >>> + *     But there are two exceptions only for dumb buffers:
> >>> + *     * To support XRGB8888 if it's not supported by the hardware.  
> >>
> >>  
> >>> + *     * Any driver is free to modify its internal representation of the format,
> >>> + *       as long as it doesn't alter the visible content in any way, and doesn't
> >>> + *       modify the user-provided buffer. An example would be to drop the
> >>> + *       padding component from a format to save some memory bandwidth.  
> >>
> >> I have strong objections to this point, _especially_ as you're
> >> apparently trying to sneak this in after our discussion. NAK on this
> >> part from my side.
> >>
> >> If you want userspace to be able to use a certain format, then export
> >> the corresponding 4cc code. Then let userspace decide what to do about
> >> it. If userspace pick a certain format, go with it.  
> > 
> > What is the reason for your objection, exactly?
> >   
> >> Hence, no implicit conversion from XRGB888 to RGB888, just because it's
> >> possible.  
> > 
> > For the particular driver in question though, the conversion allows
> > using a display resolution that is otherwise not possible. I also hear
> > it improves performance since 25% less data needs to travel across a
> > slow bus. There is also so little VRAM, than all dumb buffers need to
> > be allocated from sysram instead anyway, so a copy is always necessary.
> > 
> > Since XRGB8888 is the one format that is recommended to be supported by
> > all drivers, I don't see a problem here. Did you test on your
> > incredibly slow g200 test rig if the conversion ends up hurting instead
> > of helping performance there?
> > 
> > If it hurts, then I see that you have a good reason to NAK this.
> > 
> > It's hard to imagine how it would hurt, since you always need a copy
> > from sysram dumb buffers to VRAM - or do you?  
> 
> I have a number of concerns. My point it not that we shouldn't optimize. 
> I just don't want it in the kernel. Mgag200 can export DRM_FORMAT_RGB888 
> for userspace to use.
> 
> AFAICT the main argument against userspace is that Mesa doesn't like 
> 3-byte pixels. But I don't see how this conversion cannot be a 
> post-processing step within Mesa: do the rendering in RGB32 and then 
> convert to a framebuffer in RGB24. Userspace can do that more 
> efficiently than the kernel. This has all of the upsides of reduced 
> bandwidth, but none of the downsides of kernel code. Applications and/or 
> Mesa would be in control of the buffer format and apply the optimization 
> where it makes sense. And it would be available for all drivers that are 
> similar to mgag200.
> 
> My main point is simplicity of the driver: I prefer the driver to be 
> simple without unnecessary indirection or overhead. Optimizations like 
> these my or may not work on a given system with a certain workload. I'd 
> better leave this heuristic to userspace.
> 
> Another point of concern is CPU consumption: Slow I/O buses may stall 
> the display thread, but the CPU could do something else in the meantime. 
> Doing format conversion on the CPU prevents that, hence affecting other 
> parts of the system negatively. Of course, that's more of a gut feeling 
> than hard data.
> 
> Please note that the kernel's conversion code uses memory allocation of 
> intermediate buffers. We even recently had a discussion about allocation 
> overhead during display updates. Userspace can surely do a better job at 
> keeping such buffers around.
> 
> And finally a note the hardware itself: on low-end hardware like those 
> Matrox chips, just switch to RGB16. That will be pretty and fast enough 
> for these chips' server systems. Anyone who cares about fast and 
> beautiful should buy a real graphics card.

Fair enough.

Did you consider that every frame will be copied twice: once in
userspace to get RGB888, then again by the driver into VRAM, since dumb
buffers reside in sysram?

RGB565 would probably go the same route I guess.

I suspect the intermediate buffers (dumb buffers in this case) need to
be full frame size rather than a scanline, too. I'm not sure why the
driver would need any scratch buffers beyond a couple dozen bytes while
doing a software conversion, just enough to have the lowest common
multiple of 3 bytes and optimal bus write width.


Thanks,
pq

> 
> Best regards
> Thomas
> 
> > 
> > 
> > Thanks,
> > pq
> >   
> >>> + *     On most hardware, VRAM read access are slow, so when doing the software
> >>> + *     conversion, the dumb buffer should be allocated in system RAM in order to
> >>> + *     have decent performance.
> >>> + *     Extra care should be taken when doing software conversion with
> >>> + *     DRM_CAP_DUMB_PREFER_SHADOW, there are more detailed explanations here:
> >>> + *     https://lore.kernel.org/dri-devel/20230818162415.2185f8e3@eldfell/
> >>>     */
> >>>    
> >>>    static unsigned int drm_num_planes(struct drm_device *dev)
> >>>
> >>> base-commit: 82d750e9d2f5d0594c8f7057ce59127e701af781  
> >>  
> >   
>
Jocelyn Falempe Sept. 8, 2023, 2:48 p.m. UTC | #15
On 08/09/2023 15:56, Thomas Zimmermann wrote:
> Hi
> 
> Am 08.09.23 um 13:16 schrieb Pekka Paalanen:
>> On Fri, 8 Sep 2023 11:21:51 +0200
>> Thomas Zimmermann <tzimmermann@suse.de> wrote:
>>
>>> Hi
>>>
>>> Am 25.08.23 um 16:04 schrieb Jocelyn Falempe:
>>> [...]
>>>> + *
>>>> + *     But there are two exceptions only for dumb buffers:
>>>> + *     * To support XRGB8888 if it's not supported by the hardware.
>>>
>>>
>>>> + *     * Any driver is free to modify its internal representation 
>>>> of the format,
>>>> + *       as long as it doesn't alter the visible content in any 
>>>> way, and doesn't
>>>> + *       modify the user-provided buffer. An example would be to 
>>>> drop the
>>>> + *       padding component from a format to save some memory 
>>>> bandwidth.
>>>
>>> I have strong objections to this point, _especially_ as you're
>>> apparently trying to sneak this in after our discussion. NAK on this
>>> part from my side.
>>>
>>> If you want userspace to be able to use a certain format, then export
>>> the corresponding 4cc code. Then let userspace decide what to do about
>>> it. If userspace pick a certain format, go with it.
>>
>> What is the reason for your objection, exactly?
>>
>>> Hence, no implicit conversion from XRGB888 to RGB888, just because it's
>>> possible.
>>
>> For the particular driver in question though, the conversion allows
>> using a display resolution that is otherwise not possible. I also hear
>> it improves performance since 25% less data needs to travel across a
>> slow bus. There is also so little VRAM, than all dumb buffers need to
>> be allocated from sysram instead anyway, so a copy is always necessary.
>>
>> Since XRGB8888 is the one format that is recommended to be supported by
>> all drivers, I don't see a problem here. Did you test on your
>> incredibly slow g200 test rig if the conversion ends up hurting instead
>> of helping performance there?
>>
>> If it hurts, then I see that you have a good reason to NAK this.
>>
>> It's hard to imagine how it would hurt, since you always need a copy
>> from sysram dumb buffers to VRAM - or do you?
> 
> I have a number of concerns. My point it not that we shouldn't optimize. 
> I just don't want it in the kernel. Mgag200 can export DRM_FORMAT_RGB888 
> for userspace to use.

It already does, it's just userspace doesn't want to support it.
> 
> AFAICT the main argument against userspace is that Mesa doesn't like 
> 3-byte pixels. But I don't see how this conversion cannot be a 
> post-processing step within Mesa: do the rendering in RGB32 and then 
> convert to a framebuffer in RGB24. Userspace can do that more 
> efficiently than the kernel. This has all of the upsides of reduced 
> bandwidth, but none of the downsides of kernel code. Applications and/or 
> Mesa would be in control of the buffer format and apply the optimization 
> where it makes sense. And it would be available for all drivers that are 
> similar to mgag200.

For this particular case, user-space would use more memory and CPU, 
because the copy to VRAM is done on kernel side, and it is where the 
conversion must be done for maximum performances. So there is no way for 
userspace to be as efficient as the kernel in this use-case.

For the conversion, the kernel allocate only 1 line, and convert/copy 
one line at a time. And because the CPU is out-of-order, it can start 
converting the next line using CPU registers while waiting for the bus.

Userspace would need to allocate a whole framebuffer, and can't use the 
"wasted" CPU cycle to do the conversion.

> 
> My main point is simplicity of the driver: I prefer the driver to be 
> simple without unnecessary indirection or overhead. Optimizations like 
> these my or may not work on a given system with a certain workload. I'd 
> better leave this heuristic to userspace.

I agree that the driver is simpler without optimizing thing. But I think 
it's the role of the driver to get the maximum from the hardware, even 
if it's old and broken like g200. And userspace don't want to optimize 
for such hardware.

> 
> Another point of concern is CPU consumption: Slow I/O buses may stall 
> the display thread, but the CPU could do something else in the meantime. 
> Doing format conversion on the CPU prevents that, hence affecting other 
> parts of the system negatively. Of course, that's more of a gut feeling 
> than hard data.

I think it's the reverse. Without dropping the X data, the CPU is 
actually stalling much longer sending useless bytes to the mgag200's 
VRAM. And the CPU can't do anything else while doing memcpy_toio().
Using DMA is the only way to free the CPU during the copy, but it 
appears to be either broken or significantly slower on most mgag200 
hardware.

I actually have made the work to support DMA, and I'm a bit sad it 
didn't work on all g200, so this optimization is really a last resort, 
on a really broken hardware.

> 
> Please note that the kernel's conversion code uses memory allocation of 
> intermediate buffers. We even recently had a discussion about allocation 
> overhead during display updates. Userspace can surely do a better job at 
> keeping such buffers around.
> 
> And finally a note the hardware itself: on low-end hardware like those 
> Matrox chips, just switch to RGB16. That will be pretty and fast enough 
> for these chips' server systems. Anyone who cares about fast and 
> beautiful should buy a real graphics card.

There are still sysadmin users that occasionally have to browse the web 
to find answer, or read their mail in a GUI client. It turns out that 
rgb16 is pretty ugly for today standard, and buying an external card 
would be a bit too much, and won't work for remote control.


Best regards,
Thomas Zimmermann Sept. 8, 2023, 3:10 p.m. UTC | #16
Hi

Am 08.09.23 um 16:41 schrieb Pekka Paalanen:
> On Fri, 8 Sep 2023 15:56:51 +0200
> Thomas Zimmermann <tzimmermann@suse.de> wrote:
> 
>> Hi
>>
>> Am 08.09.23 um 13:16 schrieb Pekka Paalanen:
>>> On Fri, 8 Sep 2023 11:21:51 +0200
>>> Thomas Zimmermann <tzimmermann@suse.de> wrote:
>>>    
>>>> Hi
>>>>
>>>> Am 25.08.23 um 16:04 schrieb Jocelyn Falempe:
>>>> [...]
>>>>> + *
>>>>> + *     But there are two exceptions only for dumb buffers:
>>>>> + *     * To support XRGB8888 if it's not supported by the hardware.
>>>>
>>>>   
>>>>> + *     * Any driver is free to modify its internal representation of the format,
>>>>> + *       as long as it doesn't alter the visible content in any way, and doesn't
>>>>> + *       modify the user-provided buffer. An example would be to drop the
>>>>> + *       padding component from a format to save some memory bandwidth.
>>>>
>>>> I have strong objections to this point, _especially_ as you're
>>>> apparently trying to sneak this in after our discussion. NAK on this
>>>> part from my side.
>>>>
>>>> If you want userspace to be able to use a certain format, then export
>>>> the corresponding 4cc code. Then let userspace decide what to do about
>>>> it. If userspace pick a certain format, go with it.
>>>
>>> What is the reason for your objection, exactly?
>>>    
>>>> Hence, no implicit conversion from XRGB888 to RGB888, just because it's
>>>> possible.
>>>
>>> For the particular driver in question though, the conversion allows
>>> using a display resolution that is otherwise not possible. I also hear
>>> it improves performance since 25% less data needs to travel across a
>>> slow bus. There is also so little VRAM, than all dumb buffers need to
>>> be allocated from sysram instead anyway, so a copy is always necessary.
>>>
>>> Since XRGB8888 is the one format that is recommended to be supported by
>>> all drivers, I don't see a problem here. Did you test on your
>>> incredibly slow g200 test rig if the conversion ends up hurting instead
>>> of helping performance there?
>>>
>>> If it hurts, then I see that you have a good reason to NAK this.
>>>
>>> It's hard to imagine how it would hurt, since you always need a copy
>>> from sysram dumb buffers to VRAM - or do you?
>>
>> I have a number of concerns. My point it not that we shouldn't optimize.
>> I just don't want it in the kernel. Mgag200 can export DRM_FORMAT_RGB888
>> for userspace to use.
>>
>> AFAICT the main argument against userspace is that Mesa doesn't like
>> 3-byte pixels. But I don't see how this conversion cannot be a
>> post-processing step within Mesa: do the rendering in RGB32 and then
>> convert to a framebuffer in RGB24. Userspace can do that more
>> efficiently than the kernel. This has all of the upsides of reduced
>> bandwidth, but none of the downsides of kernel code. Applications and/or
>> Mesa would be in control of the buffer format and apply the optimization
>> where it makes sense. And it would be available for all drivers that are
>> similar to mgag200.
>>
>> My main point is simplicity of the driver: I prefer the driver to be
>> simple without unnecessary indirection or overhead. Optimizations like
>> these my or may not work on a given system with a certain workload. I'd
>> better leave this heuristic to userspace.
>>
>> Another point of concern is CPU consumption: Slow I/O buses may stall
>> the display thread, but the CPU could do something else in the meantime.
>> Doing format conversion on the CPU prevents that, hence affecting other
>> parts of the system negatively. Of course, that's more of a gut feeling
>> than hard data.
>>
>> Please note that the kernel's conversion code uses memory allocation of
>> intermediate buffers. We even recently had a discussion about allocation
>> overhead during display updates. Userspace can surely do a better job at
>> keeping such buffers around.
>>
>> And finally a note the hardware itself: on low-end hardware like those
>> Matrox chips, just switch to RGB16. That will be pretty and fast enough
>> for these chips' server systems. Anyone who cares about fast and
>> beautiful should buy a real graphics card.
> 
> Fair enough.
> 
> Did you consider that every frame will be copied twice: once in
> userspace to get RGB888, then again by the driver into VRAM, since dumb
> buffers reside in sysram?

In the kernel, we reduce the copying to the changed parts, if we have 
damage information from userspace. IDK Mesa's software renderer, but it 
could certainly apply a similar optimization.

> 
> RGB565 would probably go the same route I guess.

 From what I know userspace still supports RGB565 rendering directly. 
Last time I tested, it worked for me.

> 
> I suspect the intermediate buffers (dumb buffers in this case) need to
> be full frame size rather than a scanline, too. I'm not sure why the
> driver would need any scratch buffers beyond a couple dozen bytes while
> doing a software conversion, just enough to have the lowest common
> multiple of 3 bytes and optimal bus write width.

For the conversion in the kernel we allocate enough temporary memory to 
hold the part of each scanline that changed. Then we go over dirty 
scanlines, convert each into the output format and store it in that 
temporary buffer. Then copy the temporary buffer to VRAM. The buffer can 
be reused for the scanlines of a single conversion. But the nature of 
the framebuffer (buffers are possibly shared with concurrent access from 
multiple planes) makes it hard to keep that temporary memory around. 
Hence it's freed after each conversion. The code is at [1].

I assume that a userspace software renderer could do a much better job 
at keeping the temporary memory allocated.

Best regards
Thomas

[1] 
https://elixir.bootlin.com/linux/v6.5/source/drivers/gpu/drm/drm_format_helper.c#L88

> 
> 
> Thanks,
> pq
> 
>>
>> Best regards
>> Thomas
>>
>>>
>>>
>>> Thanks,
>>> pq
>>>    
>>>>> + *     On most hardware, VRAM read access are slow, so when doing the software
>>>>> + *     conversion, the dumb buffer should be allocated in system RAM in order to
>>>>> + *     have decent performance.
>>>>> + *     Extra care should be taken when doing software conversion with
>>>>> + *     DRM_CAP_DUMB_PREFER_SHADOW, there are more detailed explanations here:
>>>>> + *     https://lore.kernel.org/dri-devel/20230818162415.2185f8e3@eldfell/
>>>>>      */
>>>>>     
>>>>>     static unsigned int drm_num_planes(struct drm_device *dev)
>>>>>
>>>>> base-commit: 82d750e9d2f5d0594c8f7057ce59127e701af781
>>>>   
>>>    
>>
>
Thomas Zimmermann Sept. 8, 2023, 3:37 p.m. UTC | #17
Hi

Am 08.09.23 um 16:48 schrieb Jocelyn Falempe:
> On 08/09/2023 15:56, Thomas Zimmermann wrote:
>> Hi
>>
>> Am 08.09.23 um 13:16 schrieb Pekka Paalanen:
>>> On Fri, 8 Sep 2023 11:21:51 +0200
>>> Thomas Zimmermann <tzimmermann@suse.de> wrote:
>>>
>>>> Hi
>>>>
>>>> Am 25.08.23 um 16:04 schrieb Jocelyn Falempe:
>>>> [...]
>>>>> + *
>>>>> + *     But there are two exceptions only for dumb buffers:
>>>>> + *     * To support XRGB8888 if it's not supported by the hardware.
>>>>
>>>>
>>>>> + *     * Any driver is free to modify its internal representation 
>>>>> of the format,
>>>>> + *       as long as it doesn't alter the visible content in any 
>>>>> way, and doesn't
>>>>> + *       modify the user-provided buffer. An example would be to 
>>>>> drop the
>>>>> + *       padding component from a format to save some memory 
>>>>> bandwidth.
>>>>
>>>> I have strong objections to this point, _especially_ as you're
>>>> apparently trying to sneak this in after our discussion. NAK on this
>>>> part from my side.
>>>>
>>>> If you want userspace to be able to use a certain format, then export
>>>> the corresponding 4cc code. Then let userspace decide what to do about
>>>> it. If userspace pick a certain format, go with it.
>>>
>>> What is the reason for your objection, exactly?
>>>
>>>> Hence, no implicit conversion from XRGB888 to RGB888, just because it's
>>>> possible.
>>>
>>> For the particular driver in question though, the conversion allows
>>> using a display resolution that is otherwise not possible. I also hear
>>> it improves performance since 25% less data needs to travel across a
>>> slow bus. There is also so little VRAM, than all dumb buffers need to
>>> be allocated from sysram instead anyway, so a copy is always necessary.
>>>
>>> Since XRGB8888 is the one format that is recommended to be supported by
>>> all drivers, I don't see a problem here. Did you test on your
>>> incredibly slow g200 test rig if the conversion ends up hurting instead
>>> of helping performance there?
>>>
>>> If it hurts, then I see that you have a good reason to NAK this.
>>>
>>> It's hard to imagine how it would hurt, since you always need a copy
>>> from sysram dumb buffers to VRAM - or do you?
>>
>> I have a number of concerns. My point it not that we shouldn't 
>> optimize. I just don't want it in the kernel. Mgag200 can export 
>> DRM_FORMAT_RGB888 for userspace to use.
> 
> It already does, it's just userspace doesn't want to support it.
>>
>> AFAICT the main argument against userspace is that Mesa doesn't like 
>> 3-byte pixels. But I don't see how this conversion cannot be a 
>> post-processing step within Mesa: do the rendering in RGB32 and then 
>> convert to a framebuffer in RGB24. Userspace can do that more 
>> efficiently than the kernel. This has all of the upsides of reduced 
>> bandwidth, but none of the downsides of kernel code. Applications 
>> and/or Mesa would be in control of the buffer format and apply the 
>> optimization where it makes sense. And it would be available for all 
>> drivers that are similar to mgag200.
> 
> For this particular case, user-space would use more memory and CPU, 
> because the copy to VRAM is done on kernel side, and it is where the 
> conversion must be done for maximum performances. So there is no way for 
> userspace to be as efficient as the kernel in this use-case.
> 
> For the conversion, the kernel allocate only 1 line, and convert/copy 
> one line at a time. And because the CPU is out-of-order, it can start 
> converting the next line using CPU registers while waiting for the bus.

Access is writecombined, so you cannot throw large amounts of data 
towards the bus and do something else meanwhile.

> 
> Userspace would need to allocate a whole framebuffer, and can't use the 
> "wasted" CPU cycle to do the conversion.

Yes, userspace would probably need a full extra framebuffer to store the 
RGB32 data. But just as in the kernel, userspace can do format 
conversion of only the damaged areas. No extra CPU overhead here.

> 
>>
>> My main point is simplicity of the driver: I prefer the driver to be 
>> simple without unnecessary indirection or overhead. Optimizations like 
>> these my or may not work on a given system with a certain workload. 
>> I'd better leave this heuristic to userspace.
> 
> I agree that the driver is simpler without optimizing thing. But I think 
> it's the role of the driver to get the maximum from the hardware, even 
> if it's old and broken like g200. And userspace don't want to optimize 
> for such hardware.

Optimization always depends on the workload; something that the driver 
doesn't know. For example, as we mostly move the mouse cursor around the 
screen, the damages areas are usually small. Optimizing this might be 
pointless in any case.

> 
>>
>> Another point of concern is CPU consumption: Slow I/O buses may stall 
>> the display thread, but the CPU could do something else in the 
>> meantime. Doing format conversion on the CPU prevents that, hence 
>> affecting other parts of the system negatively. Of course, that's more 
>> of a gut feeling than hard data.
> 
> I think it's the reverse. Without dropping the X data, the CPU is 
> actually stalling much longer sending useless bytes to the mgag200's 
> VRAM. And the CPU can't do anything else while doing memcpy_toio().

Hyperthreading.

You are also arguing against XRGB in general, which is a different topic.

> Using DMA is the only way to free the CPU during the copy, but it 
> appears to be either broken or significantly slower on most mgag200 
> hardware.
> 
> I actually have made the work to support DMA, and I'm a bit sad it 
> didn't work on all g200, so this optimization is really a last resort, 
> on a really broken hardware.
> 
>>
>> Please note that the kernel's conversion code uses memory allocation 
>> of intermediate buffers. We even recently had a discussion about 
>> allocation overhead during display updates. Userspace can surely do a 
>> better job at keeping such buffers around.
>>
>> And finally a note the hardware itself: on low-end hardware like those 
>> Matrox chips, just switch to RGB16. That will be pretty and fast 
>> enough for these chips' server systems. Anyone who cares about fast 
>> and beautiful should buy a real graphics card.
> 
> There are still sysadmin users that occasionally have to browse the web 
> to find answer, or read their mail in a GUI client. It turns out that 
> rgb16 is pretty ugly for today standard, and buying an external card 
> would be a bit too much, and won't work for remote control.

I'm sure sysadmins have a computer for work with a decent GPU and don't 
need to browse the web on their server systems.

Best regards
Thomas

> 
> 
> Best regards,
>
Pekka Paalanen Sept. 11, 2023, 8:38 a.m. UTC | #18
On Fri, 8 Sep 2023 17:10:46 +0200
Thomas Zimmermann <tzimmermann@suse.de> wrote:

> Hi
> 
> Am 08.09.23 um 16:41 schrieb Pekka Paalanen:
> > On Fri, 8 Sep 2023 15:56:51 +0200
> > Thomas Zimmermann <tzimmermann@suse.de> wrote:
> >   
> >> Hi
> >>
> >> Am 08.09.23 um 13:16 schrieb Pekka Paalanen:  
> >>> On Fri, 8 Sep 2023 11:21:51 +0200
> >>> Thomas Zimmermann <tzimmermann@suse.de> wrote:
> >>>      
> >>>> Hi
> >>>>
> >>>> Am 25.08.23 um 16:04 schrieb Jocelyn Falempe:
> >>>> [...]  
> >>>>> + *
> >>>>> + *     But there are two exceptions only for dumb buffers:
> >>>>> + *     * To support XRGB8888 if it's not supported by the hardware.  
> >>>>
> >>>>     
> >>>>> + *     * Any driver is free to modify its internal representation of the format,
> >>>>> + *       as long as it doesn't alter the visible content in any way, and doesn't
> >>>>> + *       modify the user-provided buffer. An example would be to drop the
> >>>>> + *       padding component from a format to save some memory bandwidth.  
> >>>>
> >>>> I have strong objections to this point, _especially_ as you're
> >>>> apparently trying to sneak this in after our discussion. NAK on this
> >>>> part from my side.
> >>>>
> >>>> If you want userspace to be able to use a certain format, then export
> >>>> the corresponding 4cc code. Then let userspace decide what to do about
> >>>> it. If userspace pick a certain format, go with it.  
> >>>
> >>> What is the reason for your objection, exactly?
> >>>      
> >>>> Hence, no implicit conversion from XRGB888 to RGB888, just because it's
> >>>> possible.  
> >>>
> >>> For the particular driver in question though, the conversion allows
> >>> using a display resolution that is otherwise not possible. I also hear
> >>> it improves performance since 25% less data needs to travel across a
> >>> slow bus. There is also so little VRAM, than all dumb buffers need to
> >>> be allocated from sysram instead anyway, so a copy is always necessary.
> >>>
> >>> Since XRGB8888 is the one format that is recommended to be supported by
> >>> all drivers, I don't see a problem here. Did you test on your
> >>> incredibly slow g200 test rig if the conversion ends up hurting instead
> >>> of helping performance there?
> >>>
> >>> If it hurts, then I see that you have a good reason to NAK this.
> >>>
> >>> It's hard to imagine how it would hurt, since you always need a copy
> >>> from sysram dumb buffers to VRAM - or do you?  
> >>
> >> I have a number of concerns. My point it not that we shouldn't optimize.
> >> I just don't want it in the kernel. Mgag200 can export DRM_FORMAT_RGB888
> >> for userspace to use.
> >>
> >> AFAICT the main argument against userspace is that Mesa doesn't like
> >> 3-byte pixels. But I don't see how this conversion cannot be a
> >> post-processing step within Mesa: do the rendering in RGB32 and then
> >> convert to a framebuffer in RGB24. Userspace can do that more
> >> efficiently than the kernel. This has all of the upsides of reduced
> >> bandwidth, but none of the downsides of kernel code. Applications and/or
> >> Mesa would be in control of the buffer format and apply the optimization
> >> where it makes sense. And it would be available for all drivers that are
> >> similar to mgag200.
> >>
> >> My main point is simplicity of the driver: I prefer the driver to be
> >> simple without unnecessary indirection or overhead. Optimizations like
> >> these my or may not work on a given system with a certain workload. I'd
> >> better leave this heuristic to userspace.
> >>
> >> Another point of concern is CPU consumption: Slow I/O buses may stall
> >> the display thread, but the CPU could do something else in the meantime.
> >> Doing format conversion on the CPU prevents that, hence affecting other
> >> parts of the system negatively. Of course, that's more of a gut feeling
> >> than hard data.
> >>
> >> Please note that the kernel's conversion code uses memory allocation of
> >> intermediate buffers. We even recently had a discussion about allocation
> >> overhead during display updates. Userspace can surely do a better job at
> >> keeping such buffers around.
> >>
> >> And finally a note the hardware itself: on low-end hardware like those
> >> Matrox chips, just switch to RGB16. That will be pretty and fast enough
> >> for these chips' server systems. Anyone who cares about fast and
> >> beautiful should buy a real graphics card.  
> > 
> > Fair enough.
> > 
> > Did you consider that every frame will be copied twice: once in
> > userspace to get RGB888, then again by the driver into VRAM, since dumb
> > buffers reside in sysram?  
> 
> In the kernel, we reduce the copying to the changed parts, if we have 
> damage information from userspace. IDK Mesa's software renderer, but it 
> could certainly apply a similar optimization.

I have already assumed that everything uses damage information to
optimize the regions to copy. It's still two copies instead of one.
Actually, it's slightly more than two copies.

Due to damage tracking and the driver needing to flip between two VRAM
buffers, it is always copying current + previous damage, not only
current damage.

> > I suspect the intermediate buffers (dumb buffers in this case) need to
> > be full frame size rather than a scanline, too. I'm not sure why the
> > driver would need any scratch buffers beyond a couple dozen bytes while
> > doing a software conversion, just enough to have the lowest common
> > multiple of 3 bytes and optimal bus write width.  
> 
> For the conversion in the kernel we allocate enough temporary memory to 
> hold the part of each scanline that changed. Then we go over dirty 
> scanlines, convert each into the output format and store it in that 
> temporary buffer. Then copy the temporary buffer to VRAM. The buffer can 
> be reused for the scanlines of a single conversion. But the nature of 
> the framebuffer (buffers are possibly shared with concurrent access from 
> multiple planes) makes it hard to keep that temporary memory around. 
> Hence it's freed after each conversion. The code is at [1].

Yes, that's the conversion in the kernel. However, before the kernel
can run that conversion, userspace must have already allocated full
sized dumb buffers to convert its full sized internal framebuffer into.
Since KMS is modernly used with double-buffering, there must always be
two dumb buffers, which means updating a dumb buffer needs to copy not
just current damage but also previous damage. Userspace has no way to
know that single-buffering would be equally good in this case for this
particular driver.

If userspace gave its internal framebuffer to the kernel without doing
the conversion to RGB888, then that internal buffer would be the dumb
buffer, and there would be no need to allocate a second full size
buffer (third, because double-buffering towards KMS).

The driver allocating even full scanlines would be a lot less memory
touched if userspace didn't convert to RGB888, and if the driver used a
tailored conversion function (it literally needs to handle only a single
combination of input and output conditions), it wouldn't need even that
nor to allocate and free on every conversion. I understand you do not
want to pay the cost having such special-case code, and that's ok. It
would just be even further optimization after getting rid of a static
full sized buffer allocation.


> I assume that a userspace software renderer could do a much better job 
> at keeping the temporary memory allocated.

I'm not sure why the kernel can't keep the temporary scanline buffer
allocated with the CRTC. It only needs to be reallocated if it's too
small. Sure, people probably don't want to spend time on such code.

All the above discussion is based on the assumption that dumb buffers
are allocated in sysram.

It's fine to say you don't want to optimize, but I want to be clear on
the exact trade-off.

The trade-offs vary greatly depending on each particular use case,
which is why I wouldn't make a hard rule of no in-kernel conversions,
just a rule of thumb since it's *usually* not useful or is actively
hurting. Here we are talking about XRGB8888 which is already exempt
from the rule of thumb, with two more special conditions: dumb buffers
in sysram, and the performance traits of RGB888 on the specific
hardware.


Thanks,
pq

> 
> Best regards
> Thomas
> 
> [1] 
> https://elixir.bootlin.com/linux/v6.5/source/drivers/gpu/drm/drm_format_helper.c#L88
> 
> > 
> > 
> > Thanks,
> > pq
> >   
> >>
> >> Best regards
> >> Thomas
> >>  
> >>>
> >>>
> >>> Thanks,
> >>> pq
> >>>      
> >>>>> + *     On most hardware, VRAM read access are slow, so when doing the software
> >>>>> + *     conversion, the dumb buffer should be allocated in system RAM in order to
> >>>>> + *     have decent performance.
> >>>>> + *     Extra care should be taken when doing software conversion with
> >>>>> + *     DRM_CAP_DUMB_PREFER_SHADOW, there are more detailed explanations here:
> >>>>> + *     https://lore.kernel.org/dri-devel/20230818162415.2185f8e3@eldfell/
> >>>>>      */
> >>>>>     
> >>>>>     static unsigned int drm_num_planes(struct drm_device *dev)
> >>>>>
> >>>>> base-commit: 82d750e9d2f5d0594c8f7057ce59127e701af781  
> >>>>     
> >>>      
> >>  
> >   
>
Jocelyn Falempe Sept. 11, 2023, 10:05 a.m. UTC | #19
On 08/09/2023 17:37, Thomas Zimmermann wrote:
> Hi
> 
> Am 08.09.23 um 16:48 schrieb Jocelyn Falempe:
>> On 08/09/2023 15:56, Thomas Zimmermann wrote:
>>> Hi
>>>
>>> Am 08.09.23 um 13:16 schrieb Pekka Paalanen:
>>>> On Fri, 8 Sep 2023 11:21:51 +0200
>>>> Thomas Zimmermann <tzimmermann@suse.de> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> Am 25.08.23 um 16:04 schrieb Jocelyn Falempe:
>>>>> [...]
>>>>>> + *
>>>>>> + *     But there are two exceptions only for dumb buffers:
>>>>>> + *     * To support XRGB8888 if it's not supported by the hardware.
>>>>>
>>>>>
>>>>>> + *     * Any driver is free to modify its internal representation 
>>>>>> of the format,
>>>>>> + *       as long as it doesn't alter the visible content in any 
>>>>>> way, and doesn't
>>>>>> + *       modify the user-provided buffer. An example would be to 
>>>>>> drop the
>>>>>> + *       padding component from a format to save some memory 
>>>>>> bandwidth.
>>>>>
>>>>> I have strong objections to this point, _especially_ as you're
>>>>> apparently trying to sneak this in after our discussion. NAK on this
>>>>> part from my side.
>>>>>
>>>>> If you want userspace to be able to use a certain format, then export
>>>>> the corresponding 4cc code. Then let userspace decide what to do about
>>>>> it. If userspace pick a certain format, go with it.
>>>>
>>>> What is the reason for your objection, exactly?
>>>>
>>>>> Hence, no implicit conversion from XRGB888 to RGB888, just because 
>>>>> it's
>>>>> possible.
>>>>
>>>> For the particular driver in question though, the conversion allows
>>>> using a display resolution that is otherwise not possible. I also hear
>>>> it improves performance since 25% less data needs to travel across a
>>>> slow bus. There is also so little VRAM, than all dumb buffers need to
>>>> be allocated from sysram instead anyway, so a copy is always necessary.
>>>>
>>>> Since XRGB8888 is the one format that is recommended to be supported by
>>>> all drivers, I don't see a problem here. Did you test on your
>>>> incredibly slow g200 test rig if the conversion ends up hurting instead
>>>> of helping performance there?
>>>>
>>>> If it hurts, then I see that you have a good reason to NAK this.
>>>>
>>>> It's hard to imagine how it would hurt, since you always need a copy
>>>> from sysram dumb buffers to VRAM - or do you?
>>>
>>> I have a number of concerns. My point it not that we shouldn't 
>>> optimize. I just don't want it in the kernel. Mgag200 can export 
>>> DRM_FORMAT_RGB888 for userspace to use.
>>
>> It already does, it's just userspace doesn't want to support it.
>>>
>>> AFAICT the main argument against userspace is that Mesa doesn't like 
>>> 3-byte pixels. But I don't see how this conversion cannot be a 
>>> post-processing step within Mesa: do the rendering in RGB32 and then 
>>> convert to a framebuffer in RGB24. Userspace can do that more 
>>> efficiently than the kernel. This has all of the upsides of reduced 
>>> bandwidth, but none of the downsides of kernel code. Applications 
>>> and/or Mesa would be in control of the buffer format and apply the 
>>> optimization where it makes sense. And it would be available for all 
>>> drivers that are similar to mgag200.
>>
>> For this particular case, user-space would use more memory and CPU, 
>> because the copy to VRAM is done on kernel side, and it is where the 
>> conversion must be done for maximum performances. So there is no way 
>> for userspace to be as efficient as the kernel in this use-case.
>>
>> For the conversion, the kernel allocate only 1 line, and convert/copy 
>> one line at a time. And because the CPU is out-of-order, it can start 
>> converting the next line using CPU registers while waiting for the bus.
> 
> Access is writecombined, so you cannot throw large amounts of data 
> towards the bus and do something else meanwhile.
> 
>>
>> Userspace would need to allocate a whole framebuffer, and can't use 
>> the "wasted" CPU cycle to do the conversion.
> 
> Yes, userspace would probably need a full extra framebuffer to store the 
> RGB32 data. But just as in the kernel, userspace can do format 
> conversion of only the damaged areas. No extra CPU overhead here.
> 
>>
>>>
>>> My main point is simplicity of the driver: I prefer the driver to be 
>>> simple without unnecessary indirection or overhead. Optimizations 
>>> like these my or may not work on a given system with a certain 
>>> workload. I'd better leave this heuristic to userspace.
>>
>> I agree that the driver is simpler without optimizing thing. But I 
>> think it's the role of the driver to get the maximum from the 
>> hardware, even if it's old and broken like g200. And userspace don't 
>> want to optimize for such hardware.
> 
> Optimization always depends on the workload; something that the driver 
> doesn't know. For example, as we mostly move the mouse cursor around the 
> screen, the damages areas are usually small. Optimizing this might be 
> pointless in any case.

So your point is we should not optimize because sometime it might not be 
necessary ? And even for cursor update, the conversion is still 25% faster.

> 
>>
>>>
>>> Another point of concern is CPU consumption: Slow I/O buses may stall 
>>> the display thread, but the CPU could do something else in the 
>>> meantime. Doing format conversion on the CPU prevents that, hence 
>>> affecting other parts of the system negatively. Of course, that's 
>>> more of a gut feeling than hard data.
>>
>> I think it's the reverse. Without dropping the X data, the CPU is 
>> actually stalling much longer sending useless bytes to the mgag200's 
>> VRAM. And the CPU can't do anything else while doing memcpy_toio().
> 
> Hyperthreading.

I still doubt a user-space conversion would do a better job than the kernel.
> 
> You are also arguing against XRGB in general, which is a different topic.

yes, the issue is human eyes only sees 3 colors, and it's not a power of 
two. So compromise have been made, and that Matrox card, is from the era 
of the transition from 16bits to 32bits, and works significantly better 
in 24bits. And it's probably the only remaining GPU with this problem.

> 
>> Using DMA is the only way to free the CPU during the copy, but it 
>> appears to be either broken or significantly slower on most mgag200 
>> hardware.
>>
>> I actually have made the work to support DMA, and I'm a bit sad it 
>> didn't work on all g200, so this optimization is really a last resort, 
>> on a really broken hardware.
>>
>>>
>>> Please note that the kernel's conversion code uses memory allocation 
>>> of intermediate buffers. We even recently had a discussion about 
>>> allocation overhead during display updates. Userspace can surely do a 
>>> better job at keeping such buffers around.
>>>
>>> And finally a note the hardware itself: on low-end hardware like 
>>> those Matrox chips, just switch to RGB16. That will be pretty and 
>>> fast enough for these chips' server systems. Anyone who cares about 
>>> fast and beautiful should buy a real graphics card.
>>
>> There are still sysadmin users that occasionally have to browse the 
>> web to find answer, or read their mail in a GUI client. It turns out 
>> that rgb16 is pretty ugly for today standard, and buying an external 
>> card would be a bit too much, and won't work for remote control.
> 
> I'm sure sysadmins have a computer for work with a decent GPU and don't 
> need to browse the web on their server systems.

The GUI applications also include graphical installer, that obviously 
you can't run on other system.
I do have bug reports, and I already fixed a few regressions in the 
mgag200 driver from this reports.
But if you think they shouldn't use this GPU, then why maintaining a 
driver in the first place ? Simpledrm is enough if you don't use graphics.

> 
> Best regards
> Thomas
> 
>>
>>
>> Best regards,
>>
> 

Best regards,
Thomas Zimmermann Sept. 11, 2023, 10:18 a.m. UTC | #20
Hi

Am 11.09.23 um 10:38 schrieb Pekka Paalanen:
[...]
>> In the kernel, we reduce the copying to the changed parts, if we have
>> damage information from userspace. IDK Mesa's software renderer, but it
>> could certainly apply a similar optimization.
> 
> I have already assumed that everything uses damage information to
> optimize the regions to copy. It's still two copies instead of one.
> Actually, it's slightly more than two copies.
> 
> Due to damage tracking and the driver needing to flip between two VRAM
> buffers, it is always copying current + previous damage, not only
> current damage.

All dumb buffers are in system memory. From there, we always copy into 
the same region of VRAM. So from the kernel's perspective, it's really 
just the latest damage.

> 
>>> I suspect the intermediate buffers (dumb buffers in this case) need to
>>> be full frame size rather than a scanline, too. I'm not sure why the
>>> driver would need any scratch buffers beyond a couple dozen bytes while
>>> doing a software conversion, just enough to have the lowest common
>>> multiple of 3 bytes and optimal bus write width.
>>
>> For the conversion in the kernel we allocate enough temporary memory to
>> hold the part of each scanline that changed. Then we go over dirty
>> scanlines, convert each into the output format and store it in that
>> temporary buffer. Then copy the temporary buffer to VRAM. The buffer can
>> be reused for the scanlines of a single conversion. But the nature of
>> the framebuffer (buffers are possibly shared with concurrent access from
>> multiple planes) makes it hard to keep that temporary memory around.
>> Hence it's freed after each conversion. The code is at [1].
> 
> Yes, that's the conversion in the kernel. However, before the kernel
> can run that conversion, userspace must have already allocated full
> sized dumb buffers to convert its full sized internal framebuffer into.
> Since KMS is modernly used with double-buffering, there must always be
> two dumb buffers, which means updating a dumb buffer needs to copy not
> just current damage but also previous damage. Userspace has no way to
> know that single-buffering would be equally good in this case for this
> particular driver.

I think that this would be a good optimization. If we let userspace know 
that a certain GEM BO is a shadow buffer, the compositor could avoid 
that second dumb buffer entirely. We have plenty of DRM drivers that 
would benefit from this (grep the DRM code for 'shadow_plane').

It would also allow a format conversion in userspace with little memory 
overhead compared to the two dumb buffers of the current code.

> 
> If userspace gave its internal framebuffer to the kernel without doing
> the conversion to RGB888, then that internal buffer would be the dumb
> buffer, and there would be no need to allocate a second full size
> buffer (third, because double-buffering towards KMS).
> 
> The driver allocating even full scanlines would be a lot less memory
> touched if userspace didn't convert to RGB888, and if the driver used a
> tailored conversion function (it literally needs to handle only a single
> combination of input and output conditions), it wouldn't need even that
> nor to allocate and free on every conversion. I understand you do not
> want to pay the cost having such special-case code, and that's ok. It
> would just be even further optimization after getting rid of a static
> full sized buffer allocation.
> 
> 
>> I assume that a userspace software renderer could do a much better job
>> at keeping the temporary memory allocated.
> 
> I'm not sure why the kernel can't keep the temporary scanline buffer
> allocated with the CRTC. It only needs to be reallocated if it's too
> small. Sure, people probably don't want to spend time on such code.

I wanted to make that happen at some point. But it's a bit of a hassle. 
Either we pass a temporary buffer to the kernel's format-conversion code 
or we store it in one of the involved data structures, preferable 
drm_framebuffer.  The former solution complicates the caller and is 
prone to memory leaks; the latter is complicated as drm_framebuffer 
could be used concurrently.

> 
> All the above discussion is based on the assumption that dumb buffers
> are allocated in sysram.
> 
> It's fine to say you don't want to optimize, but I want to be clear on
> the exact trade-off.
> 
> The trade-offs vary greatly depending on each particular use case,
> which is why I wouldn't make a hard rule of no in-kernel conversions,
> just a rule of thumb since it's *usually* not useful or is actively
> hurting. Here we are talking about XRGB8888 which is already exempt
> from the rule of thumb, with two more special conditions: dumb buffers
> in sysram, and the performance traits of RGB888 on the specific
> hardware.

I don't like this optimization from an architectural POV. But I also 
haven't seen that it makes a difference in practice. As I said in 
another email, the current XRGB-based code is not too slow to be 
unusable in the case of a full-screen update. And as we're mostly moving 
mouse cursors around the screen, screen updates are small most of the time.

Best regards
Thomas

> 
> 
> Thanks,
> pq
> 
>>
>> Best regards
>> Thomas
>>
>> [1]
>> https://elixir.bootlin.com/linux/v6.5/source/drivers/gpu/drm/drm_format_helper.c#L88
>>
>>>
>>>
>>> Thanks,
>>> pq
>>>    
>>>>
>>>> Best regards
>>>> Thomas
>>>>   
>>>>>
>>>>>
>>>>> Thanks,
>>>>> pq
>>>>>       
>>>>>>> + *     On most hardware, VRAM read access are slow, so when doing the software
>>>>>>> + *     conversion, the dumb buffer should be allocated in system RAM in order to
>>>>>>> + *     have decent performance.
>>>>>>> + *     Extra care should be taken when doing software conversion with
>>>>>>> + *     DRM_CAP_DUMB_PREFER_SHADOW, there are more detailed explanations here:
>>>>>>> + *     https://lore.kernel.org/dri-devel/20230818162415.2185f8e3@eldfell/
>>>>>>>       */
>>>>>>>      
>>>>>>>      static unsigned int drm_num_planes(struct drm_device *dev)
>>>>>>>
>>>>>>> base-commit: 82d750e9d2f5d0594c8f7057ce59127e701af781
>>>>>>      
>>>>>       
>>>>   
>>>    
>>
>
Thomas Zimmermann Sept. 11, 2023, 10:44 a.m. UTC | #21
Hi

Am 11.09.23 um 12:05 schrieb Jocelyn Falempe:
[...]
>> Optimization always depends on the workload; something that the driver 
>> doesn't know. For example, as we mostly move the mouse cursor around 
>> the screen, the damages areas are usually small. Optimizing this might 
>> be pointless in any case.
> 
> So your point is we should not optimize because sometime it might not be 
> necessary ? And even for cursor update, the conversion is still 25% faster.

Yes, kind of. Those 25% are meaningless in the big picture. And the 
current code is fast enough for what these chips do (server consoles). 
So there's no pointing micro-optimizing anything here. See my reply to 
Pekka for an optimization that I'd find much more useful.

If I have to choose between fast-enough-but-simple and 
faster-but-complicated, the former is much preferable. Anyone who wants 
fast and beautiful graphics has a better graphics card anyway.

> 
>>
>>>
>>>>
>>>> Another point of concern is CPU consumption: Slow I/O buses may 
>>>> stall the display thread, but the CPU could do something else in the 
>>>> meantime. Doing format conversion on the CPU prevents that, hence 
>>>> affecting other parts of the system negatively. Of course, that's 
>>>> more of a gut feeling than hard data.
>>>
>>> I think it's the reverse. Without dropping the X data, the CPU is 
>>> actually stalling much longer sending useless bytes to the mgag200's 
>>> VRAM. And the CPU can't do anything else while doing memcpy_toio().
>>
>> Hyperthreading.
> 
> I still doubt a user-space conversion would do a better job than the 
> kernel.
>>
>> You are also arguing against XRGB in general, which is a different topic.
> 
> yes, the issue is human eyes only sees 3 colors, and it's not a power of 
> two. So compromise have been made, and that Matrox card, is from the era 
> of the transition from 16bits to 32bits, and works significantly better 
> in 24bits. And it's probably the only remaining GPU with this problem.

No. There's at least udl hardware, which does only support RGB16 and 
RGB24. And for simpledrm and ofdrm, we have to take whatever the system 
provided us with. That could be RGB24.

> 
>>
>>> Using DMA is the only way to free the CPU during the copy, but it 
>>> appears to be either broken or significantly slower on most mgag200 
>>> hardware.
>>>
>>> I actually have made the work to support DMA, and I'm a bit sad it 
>>> didn't work on all g200, so this optimization is really a last 
>>> resort, on a really broken hardware.
>>>
>>>>
>>>> Please note that the kernel's conversion code uses memory allocation 
>>>> of intermediate buffers. We even recently had a discussion about 
>>>> allocation overhead during display updates. Userspace can surely do 
>>>> a better job at keeping such buffers around.
>>>>
>>>> And finally a note the hardware itself: on low-end hardware like 
>>>> those Matrox chips, just switch to RGB16. That will be pretty and 
>>>> fast enough for these chips' server systems. Anyone who cares about 
>>>> fast and beautiful should buy a real graphics card.
>>>
>>> There are still sysadmin users that occasionally have to browse the 
>>> web to find answer, or read their mail in a GUI client. It turns out 
>>> that rgb16 is pretty ugly for today standard, and buying an external 
>>> card would be a bit too much, and won't work for remote control.
>>
>> I'm sure sysadmins have a computer for work with a decent GPU and 
>> don't need to browse the web on their server systems.
> 
> The GUI applications also include graphical installer, that obviously 
> you can't run on other system.

> I do have bug reports, and I already fixed a few regressions in the 
> mgag200 driver from this reports.

This isn't a driver bug.

> But if you think they shouldn't use this GPU, then why maintaining a 
> driver in the first place ? Simpledrm is enough if you don't use graphics.

I've done my share of graphic installations on these chips. Neither 
low-color modes nor performance has ever been a problem. And if, it is 
still better to spend the time on the renderer, so that all drivers can 
benefit.

Best regards
Thomas

> 
>>
>> Best regards
>> Thomas
>>
>>>
>>>
>>> Best regards,
>>>
>>
> 
> Best regards,
>
Maxime Ripard Sept. 11, 2023, 11:14 a.m. UTC | #22
On Fri, Sep 08, 2023 at 05:37:27PM +0200, Thomas Zimmermann wrote:
> > > Please note that the kernel's conversion code uses memory allocation
> > > of intermediate buffers. We even recently had a discussion about
> > > allocation overhead during display updates. Userspace can surely do
> > > a better job at keeping such buffers around.
> > > 
> > > And finally a note the hardware itself: on low-end hardware like
> > > those Matrox chips, just switch to RGB16. That will be pretty and
> > > fast enough for these chips' server systems. Anyone who cares about
> > > fast and beautiful should buy a real graphics card.
> > 
> > There are still sysadmin users that occasionally have to browse the web
> > to find answer, or read their mail in a GUI client. It turns out that
> > rgb16 is pretty ugly for today standard, and buying an external card
> > would be a bit too much, and won't work for remote control.
> 
> I'm sure sysadmins have a computer for work with a decent GPU and don't need
> to browse the web on their server systems.

If your expectation is that there's no capable display stack running on
those systems, what user-space component would you expect to optimize
the format/transfers like you previously hinted at exactly?

Maxime
Michel Dänzer Sept. 12, 2023, 3:57 p.m. UTC | #23
On 9/11/23 10:38, Pekka Paalanen wrote:
> On Fri, 8 Sep 2023 17:10:46 +0200
> Thomas Zimmermann <tzimmermann@suse.de> wrote:
>> Am 08.09.23 um 16:41 schrieb Pekka Paalanen:
>>> On Fri, 8 Sep 2023 15:56:51 +0200
>>> Thomas Zimmermann <tzimmermann@suse.de> wrote:
>>>> 
>>>> I have a number of concerns. My point it not that we shouldn't optimize.
>>>> I just don't want it in the kernel. Mgag200 can export DRM_FORMAT_RGB888
>>>> for userspace to use.
>>>>
>>>> AFAICT the main argument against userspace is that Mesa doesn't like
>>>> 3-byte pixels. But I don't see how this conversion cannot be a
>>>> post-processing step within Mesa: do the rendering in RGB32 and then
>>>> convert to a framebuffer in RGB24.

Even assuming the conversion could be handled transparently in Mesa, it would still require the KMS client to pick RGB888 instead of XRGB8888. Most (all?) KMS clients support XRGB8888, many of them will realistically never support RGB888. (Or even if they did, they might prefer XRGB8888 anyway, if RGB888 requires a final conversion step)


>>>> Another point of concern is CPU consumption: Slow I/O buses may stall
>>>> the display thread, but the CPU could do something else in the meantime.
>>>> Doing format conversion on the CPU prevents that, hence affecting other
>>>> parts of the system negatively. Of course, that's more of a gut feeling
>>>> than hard data.

Jocelyn, have you measured if the XRGB8888 -> RGB888 conversion copy takes longer than a straight RGB888 -> RGB888 copy in the kernel?
Jocelyn Falempe Sept. 13, 2023, 8:14 a.m. UTC | #24
On 12/09/2023 17:57, Michel Dänzer wrote:
> On 9/11/23 10:38, Pekka Paalanen wrote:
>> On Fri, 8 Sep 2023 17:10:46 +0200
>> Thomas Zimmermann <tzimmermann@suse.de> wrote:
>>> Am 08.09.23 um 16:41 schrieb Pekka Paalanen:
>>>> On Fri, 8 Sep 2023 15:56:51 +0200
>>>> Thomas Zimmermann <tzimmermann@suse.de> wrote:
>>>>>
>>>>> I have a number of concerns. My point it not that we shouldn't optimize.
>>>>> I just don't want it in the kernel. Mgag200 can export DRM_FORMAT_RGB888
>>>>> for userspace to use.
>>>>>
>>>>> AFAICT the main argument against userspace is that Mesa doesn't like
>>>>> 3-byte pixels. But I don't see how this conversion cannot be a
>>>>> post-processing step within Mesa: do the rendering in RGB32 and then
>>>>> convert to a framebuffer in RGB24.
> 
> Even assuming the conversion could be handled transparently in Mesa, it would still require the KMS client to pick RGB888 instead of XRGB8888. Most (all?) KMS clients support XRGB8888, many of them will realistically never support RGB888. (Or even if they did, they might prefer XRGB8888 anyway, if RGB888 requires a final conversion step)
> 
> 
>>>>> Another point of concern is CPU consumption: Slow I/O buses may stall
>>>>> the display thread, but the CPU could do something else in the meantime.
>>>>> Doing format conversion on the CPU prevents that, hence affecting other
>>>>> parts of the system negatively. Of course, that's more of a gut feeling
>>>>> than hard data.
> 
> Jocelyn, have you measured if the XRGB8888 -> RGB888 conversion copy takes longer than a straight RGB888 -> RGB888 copy in the kernel?
> 
> 

At least on my Matrox system, the conversion is really negligible 
compared to the copy, due to slow memory bandwidth. I wasn't able to see 
a difference, using ktime_get_ns().

The CPU is an old Intel(R) Core(TM) i3 CPU 540  @ 3.07GHz
in 1280x1024, the RGB888 copy takes 95ms.
and the XRGB8888->RGB888 takes also 95ms.
(and the full XRGB8888 copy takes 125ms)

Also we say "conversion", but when talking about XRGB8888->RGB888, there 
is no math involved, only dropping one bytes every 4.

But really performance is not the main concern here as it is obvious 
that it's much faster in RGB888. I tried to summarize the other 
arguments below, which I still find not convincing:
  * The driver is already fast enough, if you want faster, buy another GPU.
  * This adds too much complexity in the driver (about ~20 lines of code)
  * It breaks an unwritten rule of not changing the color format inside 
the kernel (which I tried to write with this patch, while adding an 
exception for legitimate use cases, like this one).

But let's admit that this discussion is going nowhere, and that I failed 
to reach a consensus, so I'm now focusing on other things.

Best regards,
Michel Dänzer Sept. 13, 2023, 5:02 p.m. UTC | #25
On 9/13/23 10:14, Jocelyn Falempe wrote:
> On 12/09/2023 17:57, Michel Dänzer wrote:
>> On 9/11/23 10:38, Pekka Paalanen wrote:
>>> On Fri, 8 Sep 2023 17:10:46 +0200
>>> Thomas Zimmermann <tzimmermann@suse.de> wrote:
>>>> Am 08.09.23 um 16:41 schrieb Pekka Paalanen:
>>>>> On Fri, 8 Sep 2023 15:56:51 +0200
>>>>> Thomas Zimmermann <tzimmermann@suse.de> wrote:
>>>>>>
>>>>>> Another point of concern is CPU consumption: Slow I/O buses may stall
>>>>>> the display thread, but the CPU could do something else in the meantime.
>>>>>> Doing format conversion on the CPU prevents that, hence affecting other
>>>>>> parts of the system negatively. Of course, that's more of a gut feeling
>>>>>> than hard data.
>>
>> Jocelyn, have you measured if the XRGB8888 -> RGB888 conversion copy takes longer than a straight RGB888 -> RGB888 copy in the kernel?
> 
> At least on my Matrox system, the conversion is really negligible compared to the copy, due to slow memory bandwidth. I wasn't able to see a difference, using ktime_get_ns().
> 
> The CPU is an old Intel(R) Core(TM) i3 CPU 540  @ 3.07GHz
> in 1280x1024, the RGB888 copy takes 95ms.
> and the XRGB8888->RGB888 takes also 95ms.
> (and the full XRGB8888 copy takes 125ms)

Then any XRGB8888 → RGB888 conversion in user space has to result in worse performance.


> But let's admit that this discussion is going nowhere, and that I failed to reach a consensus, so I'm now focusing on other things. 

I see consensus, with one person disagreeing.
diff mbox series

Patch

diff --git a/drivers/gpu/drm/drm_plane.c b/drivers/gpu/drm/drm_plane.c
index 24e7998d1731..d05642033202 100644
--- a/drivers/gpu/drm/drm_plane.c
+++ b/drivers/gpu/drm/drm_plane.c
@@ -140,6 +140,30 @@ 
  *     DRM_FORMAT_MOD_LINEAR. Before linux kernel release v5.1 there have been
  *     various bugs in this area with inconsistencies between the capability
  *     flag and per-plane properties.
+ *
+ *     All drivers should support XRGB8888, even if the hardware cannot support
+ *     it. This has become the de-facto standard and a lot of user-space assume
+ *     it will be present. If XRGB8888 is not natively supported, then it
+ *     shouldn't be the default for preferred depth or fbdev emulation.
+ *
+ *     DRM drivers should not do software color conversion, and
+ *     only advertise the formats they support in hardware. This is for
+ *     performance reason, and to avoid multiple conversions in userspace and
+ *     kernel space. KMS page flips are generally expected to be very cheap
+ *     operations.
+ *
+ *     But there are two exceptions only for dumb buffers:
+ *     * To support XRGB8888 if it's not supported by the hardware.
+ *     * Any driver is free to modify its internal representation of the format,
+ *       as long as it doesn't alter the visible content in any way, and doesn't
+ *       modify the user-provided buffer. An example would be to drop the
+ *       padding component from a format to save some memory bandwidth.
+ *     On most hardware, VRAM read access are slow, so when doing the software
+ *     conversion, the dumb buffer should be allocated in system RAM in order to
+ *     have decent performance.
+ *     Extra care should be taken when doing software conversion with
+ *     DRM_CAP_DUMB_PREFER_SHADOW, there are more detailed explanations here:
+ *     https://lore.kernel.org/dri-devel/20230818162415.2185f8e3@eldfell/
  */
 
 static unsigned int drm_num_planes(struct drm_device *dev)