diff mbox series

drm/atomic: Perform blocking commits on workqueue

Message ID	20230926170549.2589045-1-halfline@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Ray Strode <halfline@gmail.com> To: dri-devel@lists.freedesktop.org Subject: [PATCH] drm/atomic: Perform blocking commits on workqueue Date: Tue, 26 Sep 2023 13:05:49 -0400 Message-ID: <20230926170549.2589045-1-halfline@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Cc: Ray Strode <rstrode@redhat.com>, daniel.vetter@ffwll.ch, Xinhui.Pan@amd.com, mdaenzer@redhat.com, alexander.deucher@amd.com, airlied@redhat.com, christian.koenig@amd.com Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	drm/atomic: Perform blocking commits on workqueue \| expand drm/atomic: Perform blocking commits on workqueue

Commit Message

Ray Strode Sept. 26, 2023, 5:05 p.m. UTC

From: Ray Strode <rstrode@redhat.com>

A drm atomic commit can be quite slow on some hardware. It can lead
to a lengthy queue of commands that need to get processed and waited
on before control can go back to user space.

If user space is a real-time thread, that delay can have severe
consequences, leading to the process getting killed for exceeding
rlimits.

This commit addresses the problem by always running the slow part of
a commit on a workqueue, separated from the task initiating the
commit.

This change makes the nonblocking and blocking paths work in the same way,
and as a result allows the task to sleep and not use up its
RLIMIT_RTTIME allocation.

Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2861
Signed-off-by: Ray Strode <rstrode@redhat.com>
---
 drivers/gpu/drm/drm_atomic_helper.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

Comments

Christian König Sept. 27, 2023, 8:04 a.m. UTC | #1

Am 26.09.23 um 19:05 schrieb Ray Strode:
> From: Ray Strode <rstrode@redhat.com>
>
> A drm atomic commit can be quite slow on some hardware. It can lead
> to a lengthy queue of commands that need to get processed and waited
> on before control can go back to user space.
>
> If user space is a real-time thread, that delay can have severe
> consequences, leading to the process getting killed for exceeding
> rlimits.
>
> This commit addresses the problem by always running the slow part of
> a commit on a workqueue, separated from the task initiating the
> commit.
>
> This change makes the nonblocking and blocking paths work in the same way,
> and as a result allows the task to sleep and not use up its
> RLIMIT_RTTIME allocation.

Well this patch made me laugh :)

I'm not an expert for that stuff, but as far as I know the whole purpose 
of the blocking functionality is to make sure that the CPU overhead 
caused by the commit is accounted to the right process.

So what you are suggesting here is to actually break that functionality 
and that doesn't seem to make sense.

When it's really not desirable to account the CPU overhead to the 
process initiating it then you probably rather want to use an non 
blocking commit plus a dma_fence to wait for the work to end from userspace.

Regards,
Christian.

>
> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2861
> Signed-off-by: Ray Strode <rstrode@redhat.com>
> ---
>   drivers/gpu/drm/drm_atomic_helper.c | 7 +++----
>   1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c
> index 292e38eb6218..1a1e68d98d38 100644
> --- a/drivers/gpu/drm/drm_atomic_helper.c
> +++ b/drivers/gpu/drm/drm_atomic_helper.c
> @@ -2028,64 +2028,63 @@ int drm_atomic_helper_commit(struct drm_device *dev,
>   	 * This is the point of no return - everything below never fails except
>   	 * when the hw goes bonghits. Which means we can commit the new state on
>   	 * the software side now.
>   	 */
>   
>   	ret = drm_atomic_helper_swap_state(state, true);
>   	if (ret)
>   		goto err;
>   
>   	/*
>   	 * Everything below can be run asynchronously without the need to grab
>   	 * any modeset locks at all under one condition: It must be guaranteed
>   	 * that the asynchronous work has either been cancelled (if the driver
>   	 * supports it, which at least requires that the framebuffers get
>   	 * cleaned up with drm_atomic_helper_cleanup_planes()) or completed
>   	 * before the new state gets committed on the software side with
>   	 * drm_atomic_helper_swap_state().
>   	 *
>   	 * This scheme allows new atomic state updates to be prepared and
>   	 * checked in parallel to the asynchronous completion of the previous
>   	 * update. Which is important since compositors need to figure out the
>   	 * composition of the next frame right after having submitted the
>   	 * current layout.
>   	 *
>   	 * NOTE: Commit work has multiple phases, first hardware commit, then
>   	 * cleanup. We want them to overlap, hence need system_unbound_wq to
>   	 * make sure work items don't artificially stall on each another.
>   	 */
>   
>   	drm_atomic_state_get(state);
> -	if (nonblock)
> -		queue_work(system_unbound_wq, &state->commit_work);
> -	else
> -		commit_tail(state);
> +	queue_work(system_unbound_wq, &state->commit_work);
> +	if (!nonblock)
> +		flush_work(&state->commit_work);
>   
>   	return 0;
>   
>   err:
>   	drm_atomic_helper_cleanup_planes(dev, state);
>   	return ret;
>   }
>   EXPORT_SYMBOL(drm_atomic_helper_commit);
>   
>   /**
>    * DOC: implementing nonblocking commit
>    *
>    * Nonblocking atomic commits should use struct &drm_crtc_commit to sequence
>    * different operations against each another. Locks, especially struct
>    * &drm_modeset_lock, should not be held in worker threads or any other
>    * asynchronous context used to commit the hardware state.
>    *
>    * drm_atomic_helper_commit() implements the recommended sequence for
>    * nonblocking commits, using drm_atomic_helper_setup_commit() internally:
>    *
>    * 1. Run drm_atomic_helper_prepare_planes(). Since this can fail and we
>    * need to propagate out of memory/VRAM errors to userspace, it must be called
>    * synchronously.
>    *
>    * 2. Synchronize with any outstanding nonblocking commit worker threads which
>    * might be affected by the new state update. This is handled by
>    * drm_atomic_helper_setup_commit().
>    *
>    * Asynchronous workers need to have sufficient parallelism to be able to run
>    * different atomic commits on different CRTCs in parallel. The simplest way to

Ray Strode Sept. 27, 2023, 8:25 p.m. UTC | #2

Hi,

On Wed, Sep 27, 2023 at 4:05 AM Christian König
<christian.koenig@amd.com> wrote:
> I'm not an expert for that stuff, but as far as I know the whole purpose
> of the blocking functionality is to make sure that the CPU overhead
> caused by the commit is accounted to the right process.
I'm not an expert either, but that's clearly wrong.

The point of blocking functionality in drmAtomicCommit is for the
ioctl to block until the operation is completed, so userspace knows
that they can commit again after it returns without getting EBUSY, and
they can be sure, e.g., the mode is set or the displays are disabled
or whatever.

To say the "whole point" is about CPU overhead accounting sounds
rather absurd to me. Is that really what you meant?

You could try to make the argument that one of the points of the
blocking call is about CPU accounting (instead of the whole point),
maybe that's what you meant? That seems likely wrong to me too. I mean
just check the docs:

       RLIMIT_RTTIME (since Linux 2.6.25)
              This is a limit (in microseconds) on the amount of CPU
time that a process scheduled under a real‐time scheduling policy may
consume without making a blocking system call.  For  the  purpose  of
              this limit, each time a process makes a blocking system
call, the count of its consumed CPU time is reset to zero.

drmAtomicCommit() is making a blocking system call so the limit should
be reset, in my opinion.

Furthermore, a lot happens under the covers as part of drmAtomicCommit.

Are you telling me that in all the drivers handling drmAtomicCommit,
none of the work from those drivers gets deferred outside of process
context if DRM_MODE_ATOMIC_NONBLOCK isn't set? I find that very hard
to believe. But it would have to be true, if one of the main points of
the blocking call is about CPU accounting, right? You can't say the
point is about CPU accounting if some of the work is handled in a
helper thread or work queue or whatever.

╎❯ git grep -E 'INIT_WORK|DECLARE_TASKLET|open_softirq|timer_setup|kthread_run'
| wc -l
182

There seems to be a lot of non-process context execution going on in the tree...

> So what you are suggesting here is to actually break that functionality
> and that doesn't seem to make sense.

What's the use case here that could break? Walk me through a
real-world, sensible example where a program depends on
drmAtomicCommit() blocking and continuing to eat into process cpu time
while it blocks. I just want to see where you are coming from. Maybe
I'm being unimaginative but I just don't see it. I do see actual
main-stream display servers breaking today because the current
behavior defies expectations.

> When it's really not desirable to account the CPU overhead to the
> process initiating it then you probably rather want to use an non
> blocking commit plus a dma_fence to wait for the work to end from userspace.

Well, first I don't think that's very convenient. You're talking about
a per-plane property, so there would need to be a separate file
descriptor allocated for every plane, right? and user-space would have
to block on all of them before proceeding? Also, are you sure that
works in all cases? The problematic case we're facing right now is
when all planes and all crtcs are getting disabled. I'm just skimming
over the code here, but I see this:

→       for_each_new_crtc_in_state(state, crtc, crtc_state, i) {•
...
→       →       if (arg->flags & DRM_MODE_PAGE_FLIP_EVENT || fence_ptr) {•
...
→       →       →       e = create_vblank_event(crtc, arg->user_data);•
...
→       →       →       crtc_state->event = e;•
→       →       }•
...
→       →       if (fence_ptr) {•
...
→       →       →       fence = drm_crtc_create_fence(crtc);•
...
→       →       →       ret = setup_out_fence(&f[(*num_fences)++], fence);•
...
→       →       →       crtc_state->event->base.fence = fence;•
→       →       }•

Is the code really going to allocate a vblank_event when all the
crtc's are disabled ? I guess it's possible, but that's
counterintuitive to me. I really don't know. You're confident a set of
fences will actually work?

Regardless, this seems like a roundabout way to address a problem that
we could just ... fix.

--Ray

Christian König Sept. 28, 2023, 6:56 a.m. UTC | #3

Hi Ray,

Am 27.09.23 um 22:25 schrieb Ray Strode:
> Hi,
>
> On Wed, Sep 27, 2023 at 4:05 AM Christian König
> <christian.koenig@amd.com> wrote:
>> I'm not an expert for that stuff, but as far as I know the whole purpose
>> of the blocking functionality is to make sure that the CPU overhead
>> caused by the commit is accounted to the right process.
> I'm not an expert either, but that's clearly wrong.
>
> The point of blocking functionality in drmAtomicCommit is for the
> ioctl to block until the operation is completed, so userspace knows
> that they can commit again after it returns without getting EBUSY, and
> they can be sure, e.g., the mode is set or the displays are disabled
> or whatever.
>
> To say the "whole point" is about CPU overhead accounting sounds
> rather absurd to me. Is that really what you meant?

Yes, absolutely. See the functionality you try to implement already exists.

After making a non blocking commit userspace can still wait on the 
commit to finish by looking at the out fence.

This just happens asynchronously so you can use (for example) 
select()/poll() etc on multiple events.

> You could try to make the argument that one of the points of the
> blocking call is about CPU accounting (instead of the whole point),
> maybe that's what you meant? That seems likely wrong to me too. I mean
> just check the docs:
>
>         RLIMIT_RTTIME (since Linux 2.6.25)
>                This is a limit (in microseconds) on the amount of CPU
> time that a process scheduled under a real‐time scheduling policy may
> consume without making a blocking system call.  For  the  purpose  of
>                this limit, each time a process makes a blocking system
> call, the count of its consumed CPU time is reset to zero.
> drmAtomicCommit() is making a blocking system call so the limit should
> be reset, in my opinion.

Exactly that's the point why I'm rejecting this. As far as I can see 
this opinion does not match what RLIMIT_RTTIME is intended to do.

A blocking system call in the sense of RLIMIT_RTTIME means something 
which results in the process listening for external events, e.g. calling 
select(), epoll() or read() on (for example) a network socket etc...

As far as I can see drmAtomicCommit() is *not* meant with that what 
similar to for example yield() also doesn't reset the RLIMIT_RTTIME counter.

> Furthermore, a lot happens under the covers as part of drmAtomicCommit.
>
> Are you telling me that in all the drivers handling drmAtomicCommit,
> none of the work from those drivers gets deferred outside of process
> context if DRM_MODE_ATOMIC_NONBLOCK isn't set? I find that very hard
> to believe. But it would have to be true, if one of the main points of
> the blocking call is about CPU accounting, right? You can't say the
> point is about CPU accounting if some of the work is handled in a
> helper thread or work queue or whatever.
>
> ╎❯ git grep -E 'INIT_WORK|DECLARE_TASKLET|open_softirq|timer_setup|kthread_run'
> | wc -l
> 182
>
> There seems to be a lot of non-process context execution going on in the tree...
>
>> So what you are suggesting here is to actually break that functionality
>> and that doesn't seem to make sense.
> What's the use case here that could break?

Well you are breaking the RLIMIT_RTTIME functionality.

The purpose of that functionality is to allow debugging and monitoring 
applications to make sure that they keep alive and can react to external 
signals.

 From the RLIMIT_RTTIME documentation: "The intended use of this limit 
is to stop a runaway real-time process from locking up the system."

And when drmAtomicCommit() is triggering this then we either have a 
problem with the application doing something it is not supposed to do 
(like blocking for vblank while it should listen to network traffic) or 
the driver is somehow buggy.

>   Walk me through a
> real-world, sensible example where a program depends on
> drmAtomicCommit() blocking and continuing to eat into process cpu time
> while it blocks. I just want to see where you are coming from. Maybe
> I'm being unimaginative but I just don't see it. I do see actual
> main-stream display servers breaking today because the current
> behavior defies expectations.
>
>> When it's really not desirable to account the CPU overhead to the
>> process initiating it then you probably rather want to use an non
>> blocking commit plus a dma_fence to wait for the work to end from userspace.
> Well, first I don't think that's very convenient. You're talking about
> a per-plane property, so there would need to be a separate file
> descriptor allocated for every plane, right? and user-space would have
> to block on all of them before proceeding? Also, are you sure that
> works in all cases? The problematic case we're facing right now is
> when all planes and all crtcs are getting disabled. I'm just skimming
> over the code here, but I see this:
>
> →       for_each_new_crtc_in_state(state, crtc, crtc_state, i) {•
> ...
> →       →       if (arg->flags & DRM_MODE_PAGE_FLIP_EVENT || fence_ptr) {•
> ...
> →       →       →       e = create_vblank_event(crtc, arg->user_data);•
> ...
> →       →       →       crtc_state->event = e;•
> →       →       }•
> ...
> →       →       if (fence_ptr) {•
> ...
> →       →       →       fence = drm_crtc_create_fence(crtc);•
> ...
> →       →       →       ret = setup_out_fence(&f[(*num_fences)++], fence);•
> ...
> →       →       →       crtc_state->event->base.fence = fence;•
> →       →       }•
>
> Is the code really going to allocate a vblank_event when all the
> crtc's are disabled ? I guess it's possible, but that's
> counterintuitive to me. I really don't know. You're confident a set of
> fences will actually work?

No when you disable everything there is of course no fence allocated.

But then you also should never see anything waiting for to long to 
actually be able to trigger the RLIMIT_RTTIME.

> Regardless, this seems like a roundabout way to address a problem that
> we could just ... fix.

Well to make it clear: This is not a problem but intended behavior!

 From what I know RLIMIT_RTTIME usually works in units of multiple 
seconds. Milliseconds are possible as well, but when an application sets 
a low millisecond timeout and then blocks for a vblank which can easily 
take 30ms to complete that means you have a bug inside your application.

With this commit you are completely papering over that.

Regards,
Christian.

>
> --Ray

Michel Dänzer Sept. 28, 2023, 9:43 a.m. UTC | #4

On 9/28/23 08:56, Christian König wrote:
> Am 27.09.23 um 22:25 schrieb Ray Strode:
>> On Wed, Sep 27, 2023 at 4:05 AM Christian König
>> <christian.koenig@amd.com> wrote:
> 
>>> When it's really not desirable to account the CPU overhead to the
>>> process initiating it then you probably rather want to use an non
>>> blocking commit plus a dma_fence to wait for the work to end from userspace.
>> Well, first I don't think that's very convenient. You're talking about
>> a per-plane property, so there would need to be a separate file
>> descriptor allocated for every plane, right? and user-space would have
>> to block on all of them before proceeding?

OUT_FENCE_PTR is a per-CRTC property, not per-plane. Also, at least in this particular case, a single sync file (not dma_fence) for any CRTC might suffice.


>> Also, are you sure that works in all cases? The problematic case we're facing right >> now is when all planes and all crtcs are getting disabled. I'm just skimming
>> over the code here, but I see this:
>>
>> →       for_each_new_crtc_in_state(state, crtc, crtc_state, i) {•
>> ...
>> →       →       if (arg->flags & DRM_MODE_PAGE_FLIP_EVENT || fence_ptr) {•
>> ...
>> →       →       →       e = create_vblank_event(crtc, arg->user_data);•
>> ...
>> →       →       →       crtc_state->event = e;•
>> →       →       }•
>> ...
>> →       →       if (fence_ptr) {•
>> ...
>> →       →       →       fence = drm_crtc_create_fence(crtc);•
>> ...
>> →       →       →       ret = setup_out_fence(&f[(*num_fences)++], fence);•
>> ...
>> →       →       →       crtc_state->event->base.fence = fence;•
>> →       →       }•
>>
>> Is the code really going to allocate a vblank_event when all the
>> crtc's are disabled ? I guess it's possible, but that's
>> counterintuitive to me. I really don't know. You're confident a set of
>> fences will actually work?
> 
> No when you disable everything there is of course no fence allocated.

I don't see why not. "new_crtc_in_state" in this code means the atomic commit contains some state for the CRTC (such as setting the OUT_FENCE_PTR property), it could disable or leave it disabled though. I can't see any other code which would prevent this from working with a disabled CRTC, I could be missing something though.


> But then you also should never see anything waiting for to long to actually be able to trigger the RLIMIT_RTTIME.

amdgpu DC exceeds 200ms on some setups, see the linked issue.


>> Regardless, this seems like a roundabout way to address a problem that
>> we could just ... fix.

Handling modesets asynchronously does seem desirable in mutter to me. E.g. it would allow doing modesets in parallel with modesets or other atomic commits on other GPUs.


> From what I know RLIMIT_RTTIME usually works in units of multiple seconds. 

RealtimeKit seems to allow 200ms maximum.

Ray Strode Sept. 28, 2023, 12:46 p.m. UTC | #5

Hi,

On Thu, Sep 28, 2023 at 2:56 AM Christian König
<christian.koenig@amd.com> wrote:
> > To say the "whole point" is about CPU overhead accounting sounds
> > rather absurd to me. Is that really what you meant?
>
> Yes, absolutely. See the functionality you try to implement already exists.

You say lower in this same message that you don't believe the
functionality actually works for the dpms off case I mentioned.

> After making a non blocking commit userspace can still wait on the
> commit to finish by looking at the out fence.

fences, not fence, fences. drmModeAtomicCommit works on multiple objects
at the same time. To follow the spirit of such an api, we would need a
separate fd allocated for each crtc and would have to wait on all of
them.  Surely you can see how that is much less straightforward than
using a blocking api. But mutter already uses non-blocking apis for the
lion's share of cases. It doesn't need fences for those cases, though,
because it can just use page flip events.

The main reason it uses blocking apis are for modesets and when doing
dpms off. The latter case you said you don't think can use fences, and
it certainly can't use page flip events.

So if you're right that fences can't be used for the dpms off case, it's
not workable answer. If you're wrong, and fences can be used for the
dpms off case, then it's a messy answer.

> A blocking system call in the sense of RLIMIT_RTTIME means something
> which results in the process listening for external events, e.g. calling
> select(), epoll() or read() on (for example) a network socket etc...
>
> As far as I can see drmAtomicCommit() is *not* meant with that what
> similar to for example yield() also doesn't reset the RLIMIT_RTTIME counter.
No no no, drmModeAtomicCommit() is not like sched_yield(). That's a
really strange thing to say (you do mean sched_yield() right?).
sched_yield() is an oddball because it's specifically for giving other
threads a turn if they need it without causing the current thread to
sleep if they don't.  It's a niche api that's meant for high performance
use cases. It's a way to reduce scheduling latency and increase running
time predictability.  drmModeAtomicCommit() using up rt time, busy
looping while waiting on the hardware to respond, eating into userspace
RLIMIT_RTTIME is nothing like that.

I'm getting the idea that you think there is some big bucket of kernel
syscalls that block for a large fraction of a second by design and are
not meant to reset RLIMIT_RTTIME. I could be wrong, but I don't think
that's true. Off the top of my head, the only ones I can think of that
might reasonably do that are futex() (which obviously can't sleep),
sched_yield() (who's whole point is to not sleep), and maybe a
some obscure ioctls (some probably legitimately, some probably
illegitimately and noone has noticed.). I'm willing to be proven wrong
here, and I might be, but right now from thinking about it, my guess is
the above list is pretty close to complete.

> Well you are breaking the RLIMIT_RTTIME functionality.
>
> The purpose of that functionality is to allow debugging and monitoring
> applications to make sure that they keep alive and can react to external
> signals.

I don't think you really thought through what you're saying here. It
just flatly doesn't apply for drmModeAtomicCommit. What is an
application supposed to do? It can't block the SIGKILL that's coming.
Respond to the preceding SIGXCPUs? What response could the application
possibly make? I'm guessing drmModeAtomicCommit isn't going to EINTR
because it's busy looping waiting on hardware in the process context.
And the kernel doesn't even guarantee SIGXCPU is going to go to the
thread with the stuck syscall, so even if there was a legitimate
response (or even "pthread_cancel" or some wreckless nonsense like
that), getting the notification to the right part of the program is an
exercise in gymnastics.

> From the RLIMIT_RTTIME documentation: "The intended use of this limit
> is to stop a runaway real-time process from locking up the system."
>
> And when drmAtomicCommit() is triggering this then we either have a
> problem with the application doing something it is not supposed to do
> (like blocking for vblank while it should listen to network traffic) or
> the driver is somehow buggy.

drmModeAtomicCommit() is used by display servers. If drmModeAtomicCommit
runs away in e.g. a set of 100ms busy loops responding to a confused or
slow responding GPU, the system will seemingly lock up to the user. That
is an intractable problem that we can not get away from. It doesn't
matter if the kernel is stuck in process context or on a workqueue. And,
regardless, it's not reasonable to expect userspace to craft elaborate
workarounds for driver bugs. We just have to fix the bugs.

> No when you disable everything there is of course no fence allocated.
Okay, so it's not actually an answer

> But then you also should never see anything waiting for to long to
> actually be able to trigger the RLIMIT_RTTIME.

But we do. That's the problem. That's like the whole problem. The amdgpu
driver thinks it's okay to do something like:

for_each_command_in_queue(&queue, command) {
        execute_command(command);
        while (1) {
                wait_for_response();

                if (delay++ > 100000)
                        break;
                udelay(1);
        }
}

or something like that. all while keeping the process in the RUNNABLE
state. It just seems wrong to me. At least let the userspace process
get scheduled out.

> > Regardless, this seems like a roundabout way to address a problem that
> > we could just ... fix.
>
> Well to make it clear: This is not a problem but intended behavior!

I'm going to be frank, I don't think this was intended behavior. We can
wait for sima to get off PTO and chime in, to know one way or the other
(or maybe airlied can chime in with his take?), but I doubt he was
thinking about realtime scheduling minutiae when he put together the
drmModeAtomicCommit implementation.

There's no practical reason for doing things the way they're being done
as far as I can tell.

--Ray

Ray Strode Sept. 28, 2023, 12:59 p.m. UTC | #6

hi,

On Thu, Sep 28, 2023 at 5:43 AM Michel Dänzer
<michel.daenzer@mailbox.org> wrote:
> >>> When it's really not desirable to account the CPU overhead to the
> >>> process initiating it then you probably rather want to use an non
> >>> blocking commit plus a dma_fence to wait for the work to end from userspace.
> >> Well, first I don't think that's very convenient. You're talking about
> >> a per-plane property, so there would need to be a separate file
> >> descriptor allocated for every plane, right? and user-space would have
> >> to block on all of them before proceeding?
>
> OUT_FENCE_PTR is a per-CRTC property, not per-plane.

Okay, sure.

> Also, at least in this particular case, a single sync file (not dma_fence) for any CRTC might suffice.

I don't see how we could rely on that given the provided api and
multitude of drivers. It might
work and then break randomly.

> >> Is the code really going to allocate a vblank_event when all the
> >> crtc's are disabled ? I guess it's possible, but that's
> >> counterintuitive to me. I really don't know. You're confident a set of
> >> fences will actually work?
> >
> > No when you disable everything there is of course no fence allocated.
>
> I don't see why not. "new_crtc_in_state" in this code means the atomic
> commit contains some state for the CRTC (such as setting the
> OUT_FENCE_PTR property), it could disable or leave it disabled though.
> I can't see any other code which would prevent this from working with a
> disabled CRTC, I could be missing something though.

So I'll concede it's possible it works, and the fact it's using a vblank
type event is just technical debt (though Christian says he doesn't think it
works, so I think it's possible it doesn't work as well.)

Having said that, this whole argument of "the dysfunctional blocking api
is actually functioning as designed, but don't use it, since it doesn't work,
use the non-blocking api instead, because maybe it might work or it might not,
not sure" is pretty non-convincing to me.

> Handling modesets asynchronously does seem desirable in mutter to me.

Sure, fine. That doesn't mean we shouldn't fix the blocking call to behave like
almost all other blocking system calls.

--Ray

Christian König Sept. 28, 2023, 1:23 p.m. UTC | #7

Hi,

Am 28.09.23 um 14:46 schrieb Ray Strode:
> Hi,
>
> On Thu, Sep 28, 2023 at 2:56 AM Christian König
> <christian.koenig@amd.com>  wrote:
>>> To say the "whole point" is about CPU overhead accounting sounds
>>> rather absurd to me. Is that really what you meant?
>> Yes, absolutely. See the functionality you try to implement already exists.
> You say lower in this same message that you don't believe the
> functionality actually works for the dpms off case I mentioned.

well in the dpms off case there shouldn't be any blocking/as far as I 
understand it.

If you see a large delay in the dpms off case then we probably have a 
driver bug somewhere.
/
> [SNIP]
>> A blocking system call in the sense of RLIMIT_RTTIME means something
>> which results in the process listening for external events, e.g. calling
>> select(), epoll() or read() on (for example) a network socket etc...
>>
>> As far as I can see drmAtomicCommit() is *not* meant with that what
>> similar to for example yield() also doesn't reset the RLIMIT_RTTIME counter.
> No no no, drmModeAtomicCommit() is not like sched_yield(). That's a
> really strange thing to say (you do mean sched_yield() right?).
> sched_yield() is an oddball because it's specifically for giving other
> threads a turn if they need it without causing the current thread to
> sleep if they don't.  It's a niche api that's meant for high performance
> use cases. It's a way to reduce scheduling latency and increase running
> time predictability.  drmModeAtomicCommit() using up rt time, busy
> looping while waiting on the hardware to respond, eating into userspace
> RLIMIT_RTTIME is nothing like that.
>
> I'm getting the idea that you think there is some big bucket of kernel
> syscalls that block for a large fraction of a second by design and are
> not meant to reset RLIMIT_RTTIME.

Yes, exactly that's the case as far as I know.

The purpose of RLIMIT_RTTIME is to have a watchdog if an application is 
still responsive. Only things which make the application look for 
outside events are considered a reset for this whatdog.

So for example calling select() and poll() will reset the watchdog, but 
not calling any DRM modeset functions or an power management IOCTL for 
example.

There are only two possibilities I can see how a DRM modeset is 
triggering this:

1. We create enough CPU overhead to actually exceed the limit. In this 
case triggering it is certainly intentional.

2. We busy wait for the hardware to reach a certain state. If that is 
true then this is a bug in the driver.

>   I could be wrong, but I don't think
> that's true. Off the top of my head, the only ones I can think of that
> might reasonably do that are futex() (which obviously can't sleep),
> sched_yield() (who's whole point is to not sleep), and maybe a
> some obscure ioctls (some probably legitimately, some probably
> illegitimately and noone has noticed.). I'm willing to be proven wrong
> here, and I might be, but right now from thinking about it, my guess is
> the above list is pretty close to complete.
>
>> Well you are breaking the RLIMIT_RTTIME functionality.
>>
>> The purpose of that functionality is to allow debugging and monitoring
>> applications to make sure that they keep alive and can react to external
>> signals.
> I don't think you really thought through what you're saying here. It
> just flatly doesn't apply for drmModeAtomicCommit. What is an
> application supposed to do?

Get terminated and restart. That's what the whole functionality is good for.

If you don't desire that than don't use the RLIMIT_RTTIME functionality.

>   It can't block the SIGKILL that's coming.
> Respond to the preceding SIGXCPUs? What response could the application
> possibly make? I'm guessing drmModeAtomicCommit isn't going to EINTR
> because it's busy looping waiting on hardware in the process context.
> And the kernel doesn't even guarantee SIGXCPU is going to go to the
> thread with the stuck syscall, so even if there was a legitimate
> response (or even "pthread_cancel" or some wreckless nonsense like
> that), getting the notification to the right part of the program is an
> exercise in gymnastics.
>
>>  From the RLIMIT_RTTIME documentation: "The intended use of this limit
>> is to stop a runaway real-time process from locking up the system."
>>
>> And when drmAtomicCommit() is triggering this then we either have a
>> problem with the application doing something it is not supposed to do
>> (like blocking for vblank while it should listen to network traffic) or
>> the driver is somehow buggy.
> drmModeAtomicCommit() is used by display servers. If drmModeAtomicCommit
> runs away in e.g. a set of 100ms busy loops responding to a confused or
> slow responding GPU, the system will seemingly lock up to the user. That
> is an intractable problem that we can not get away from. It doesn't
> matter if the kernel is stuck in process context or on a workqueue. And,
> regardless, it's not reasonable to expect userspace to craft elaborate
> workarounds for driver bugs. We just have to fix the bugs.

Exactly that's the reason why I'm pointing out that this isn't a good idea.

>
>> No when you disable everything there is of course no fence allocated.
> Okay, so it's not actually an answer
>
>> But then you also should never see anything waiting for to long to
>> actually be able to trigger the RLIMIT_RTTIME.
> But we do. That's the problem. That's like the whole problem. The amdgpu
> driver thinks it's okay to do something like:
>
> for_each_command_in_queue(&queue, command) {
>          execute_command(command);
>          while (1) {
>                  wait_for_response();
>
>                  if (delay++ > 100000)
>                          break;
>                  udelay(1);
>          }
> }
>
> or something like that. all while keeping the process in the RUNNABLE
> state. It just seems wrong to me. At least let the userspace process
> get scheduled out.

That just papers over the problem. The process doesn't become more 
responsive by hiding the problem.

What you need to do here is to report those problems to the driver teams 
and not try to hide them this way.

Regards,
Christian.

>
>>> Regardless, this seems like a roundabout way to address a problem that
>>> we could just ... fix.
>> Well to make it clear: This is not a problem but intended behavior!
> I'm going to be frank, I don't think this was intended behavior. We can
> wait for sima to get off PTO and chime in, to know one way or the other
> (or maybe airlied can chime in with his take?), but I doubt he was
> thinking about realtime scheduling minutiae when he put together the
> drmModeAtomicCommit implementation.
>
> There's no practical reason for doing things the way they're being done
> as far as I can tell.
>
> --Ray

Michel Dänzer Sept. 28, 2023, 1:37 p.m. UTC | #8

On 9/28/23 14:59, Ray Strode wrote:
> On Thu, Sep 28, 2023 at 5:43 AM Michel Dänzer
> <michel.daenzer@mailbox.org> wrote:
>>>>> When it's really not desirable to account the CPU overhead to the
>>>>> process initiating it then you probably rather want to use an non
>>>>> blocking commit plus a dma_fence to wait for the work to end from userspace.
>>>> Well, first I don't think that's very convenient. You're talking about
>>>> a per-plane property, so there would need to be a separate file
>>>> descriptor allocated for every plane, right? and user-space would have
>>>> to block on all of them before proceeding?
>>
>> OUT_FENCE_PTR is a per-CRTC property, not per-plane.
> 
> Okay, sure.
> 
>> Also, at least in this particular case, a single sync file (not dma_fence) for any CRTC might suffice.
> 
> I don't see how we could rely on that given the provided api and
> multitude of drivers. It might work and then break randomly.

If it's supposed to work from the KMS API PoV, any bugs to the contrary should be fixed.

I'm not really seeing the big difference between using a single fence or multiple, anyway.


I do wonder if there might be a time window where the out fences have signalled, but the atomic commit ioctl will still fail with EBUSY. If there is though, I'd expect it to affect the flip completion events as well.

Michel Dänzer Sept. 28, 2023, 1:58 p.m. UTC | #9

On 9/28/23 15:23, Christian König wrote:
> 
> What you need to do here is to report those problems to the driver teams and not try to hide them this way.

See the linked issue: https://gitlab.freedesktop.org/drm/amd/-/issues/2861 (BTW, the original reporter of that issue isn't hitting it with DPMS off but just starting a game, so there seem to be at least two separate causes of exceeding 200ms for an atomic commit in DC)

It's not like GitLab issues get much if any attention by DC developers though... So Ray tried to come up with some kind of solution.

Ray Strode Sept. 28, 2023, 2:20 p.m. UTC | #10

Hi,

On Thu, Sep 28, 2023 at 9:24 AM Christian König
<christian.koenig@amd.com> wrote:
> If you see a large delay in the dpms off case then we probably have a driver bug somewhere.

This is something we both agree on, I think.

>> I'm getting the idea that you think there is some big bucket of kernel
>> syscalls that block for a large fraction of a second by design and are
>> not meant to reset RLIMIT_RTTIME.
>
> Yes, exactly that's the case as far as I know.

Okay, well i'm willing to be proven wrong. My previously stated list
is: futex, sched_yield, and maybe a few obscure ioctls. If you have a
big bucket of syscalls in your mind, i'm all ears. I think I'm
probably right and the lions share of all syscalls that can block for
a long time don't leave the process RUNNABLE, though.

> The purpose of RLIMIT_RTTIME is to have a watchdog if an application is still responsive. Only things which make the application look for outside events are considered a reset for this whatdog.
>
> So for example calling select() and poll() will reset the watchdog, but not calling any DRM modeset functions or an power management IOCTL for example.

what power management ioctls are you talking about?

>
> There are only two possibilities I can see how a DRM modeset is triggering this:
>
> 1. We create enough CPU overhead to actually exceed the limit. In this case triggering it is certainly intentional.

As I said before I suspect that not all modeset work for all drivers
is happening in the process context already anyway. So if this is
intentional, I suspect it's not actually working as designed (but I
haven't confirmed that, aside from my `git grep -E
'INIT_WORK|DECLARE_TASKLET|open_softirq|timer_setup|kthread_run' | wc
-l` litmus test) . I can't come up with a good reason it would be
designed like that, however, and you haven't provided one, so I
suspect it's not actually intentional.

>> I don't think you really thought through what you're saying here. It
>> just flatly doesn't apply for drmModeAtomicCommit. What is an
>> application supposed to do?
>
> Get terminated and restart. That's what the whole functionality is good for.
>
> If you don't desire that than don't use the RLIMIT_RTTIME functionality.

No, getting terminated and restarting the session because
drmModeAtomicCommit is taking too long is not really acceptable
behavior for a display server. In general, crashing the session is
maybe better than locking up the entire system hard, so RLIMIT_RTTIME
is still a good idea, but in either case we're in bug land here, each
such bug needs to be fixed. In this case, the bug is
drmModeAtomicCommit isn't sleeping like nearly every other syscall
does.

> From the RLIMIT_RTTIME documentation: "The intended use of this limit
> is to stop a runaway real-time process from locking up the system."
>
> And when drmAtomicCommit() is triggering this then we either have a
> problem with the application doing something it is not supposed to do
> (like blocking for vblank while it should listen to network traffic) or
> the driver is somehow buggy.
Yes! This is about bugs.

> drmModeAtomicCommit() is used by display servers. If drmModeAtomicCommit
> runs away in e.g. a set of 100ms busy loops responding to a confused or
> slow responding GPU, the system will seemingly lock up to the user. That
> is an intractable problem that we can not get away from. It doesn't
> matter if the kernel is stuck in process context or on a workqueue. And,
> regardless, it's not reasonable to expect userspace to craft elaborate
> workarounds for driver bugs. We just have to fix the bugs.
>
> Exactly that's the reason why I'm pointing out that this isn't a good idea.

Not following your logic here.

> That just papers over the problem. The process doesn't become more responsive by hiding the problem.

It's not about the process being responsive. In meat-space, the
sub-second delay is imperceptible because the screen is turned off.

It's about not killing an innocent display server because of a driver bug.

> What you need to do here is to report those problems to the driver teams and not try to hide them this way.

There is already a bug, it's mentioned in the merge request, the
reason I cc'd you (vs say just Dave and Daniel) is because the driver
bug is in amdgpu code.

--Ray

Christian König Sept. 28, 2023, 2:51 p.m. UTC | #11

Am 28.09.23 um 15:37 schrieb Michel Dänzer:
> On 9/28/23 14:59, Ray Strode wrote:
>> On Thu, Sep 28, 2023 at 5:43 AM Michel Dänzer
>> <michel.daenzer@mailbox.org> wrote:
>>>>>> When it's really not desirable to account the CPU overhead to the
>>>>>> process initiating it then you probably rather want to use an non
>>>>>> blocking commit plus a dma_fence to wait for the work to end from userspace.
>>>>> Well, first I don't think that's very convenient. You're talking about
>>>>> a per-plane property, so there would need to be a separate file
>>>>> descriptor allocated for every plane, right? and user-space would have
>>>>> to block on all of them before proceeding?
>>> OUT_FENCE_PTR is a per-CRTC property, not per-plane.
>> Okay, sure.
>>
>>> Also, at least in this particular case, a single sync file (not dma_fence) for any CRTC might suffice.
>> I don't see how we could rely on that given the provided api and
>> multitude of drivers. It might work and then break randomly.
> If it's supposed to work from the KMS API PoV, any bugs to the contrary should be fixed.
>
> I'm not really seeing the big difference between using a single fence or multiple, anyway.

The big difference is that a standard modeset can take some time, e.g. 
setting up power levels, waiting for PLLs to settle, waiting for a 
vblank etc..

That this happens async in the background so that the frontend 
application can still respond to other signals seems reasonable.

But in the case of turning thing off, what should we wait for? I think 
we still support the out fence, but it doesn't really make sense to use.

> I do wonder if there might be a time window where the out fences have signalled, but the atomic commit ioctl will still fail with EBUSY. If there is though, I'd expect it to affect the flip completion events as well.
>

I'm not deep enough into that code to confirm that. Daniel or other 
display folks need to help here.

Regards,
Christian.

Christian König Sept. 28, 2023, 3:02 p.m. UTC | #12

Am 28.09.23 um 15:58 schrieb Michel Dänzer:
> On 9/28/23 15:23, Christian König wrote:
>> What you need to do here is to report those problems to the driver teams and not try to hide them this way.
> See the linked issue: https://gitlab.freedesktop.org/drm/amd/-/issues/2861 (BTW, the original reporter of that issue isn't hitting it with DPMS off but just starting a game, so there seem to be at least two separate causes of exceeding 200ms for an atomic commit in DC)
>
> It's not like GitLab issues get much if any attention by DC developers though... So Ray tried to come up with some kind of solution.

I haven't seen this bug either for the simple reason that it wasn't 
assigned to anybody. Going to grab it and forward it to the DC guys.

That makes the reasoning understandable, but we should still absolutely 
not paper over problems like suggested with this patch.

Thanks,
Christian.

Ville Syrjälä Sept. 28, 2023, 3:03 p.m. UTC | #13

On Tue, Sep 26, 2023 at 01:05:49PM -0400, Ray Strode wrote:
> From: Ray Strode <rstrode@redhat.com>
> 
> A drm atomic commit can be quite slow on some hardware. It can lead
> to a lengthy queue of commands that need to get processed and waited
> on before control can go back to user space.
> 
> If user space is a real-time thread, that delay can have severe
> consequences, leading to the process getting killed for exceeding
> rlimits.
> 
> This commit addresses the problem by always running the slow part of
> a commit on a workqueue, separated from the task initiating the
> commit.
> 
> This change makes the nonblocking and blocking paths work in the same way,
> and as a result allows the task to sleep and not use up its
> RLIMIT_RTTIME allocation.
> 
> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2861
> Signed-off-by: Ray Strode <rstrode@redhat.com>
> ---
>  drivers/gpu/drm/drm_atomic_helper.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c
> index 292e38eb6218..1a1e68d98d38 100644
> --- a/drivers/gpu/drm/drm_atomic_helper.c
> +++ b/drivers/gpu/drm/drm_atomic_helper.c
> @@ -2028,64 +2028,63 @@ int drm_atomic_helper_commit(struct drm_device *dev,
>  	 * This is the point of no return - everything below never fails except
>  	 * when the hw goes bonghits. Which means we can commit the new state on
>  	 * the software side now.
>  	 */
>  
>  	ret = drm_atomic_helper_swap_state(state, true);
>  	if (ret)
>  		goto err;
>  
>  	/*
>  	 * Everything below can be run asynchronously without the need to grab
>  	 * any modeset locks at all under one condition: It must be guaranteed
>  	 * that the asynchronous work has either been cancelled (if the driver
>  	 * supports it, which at least requires that the framebuffers get
>  	 * cleaned up with drm_atomic_helper_cleanup_planes()) or completed
>  	 * before the new state gets committed on the software side with
>  	 * drm_atomic_helper_swap_state().
>  	 *
>  	 * This scheme allows new atomic state updates to be prepared and
>  	 * checked in parallel to the asynchronous completion of the previous
>  	 * update. Which is important since compositors need to figure out the
>  	 * composition of the next frame right after having submitted the
>  	 * current layout.
>  	 *
>  	 * NOTE: Commit work has multiple phases, first hardware commit, then
>  	 * cleanup. We want them to overlap, hence need system_unbound_wq to
>  	 * make sure work items don't artificially stall on each another.
>  	 */
>  
>  	drm_atomic_state_get(state);
> -	if (nonblock)
> -		queue_work(system_unbound_wq, &state->commit_work);
> -	else
> -		commit_tail(state);
> +	queue_work(system_unbound_wq, &state->commit_work);
> +	if (!nonblock)
> +		flush_work(&state->commit_work);

Here's my earlier take on this: https://patchwork.freedesktop.org/series/108668/
execpt I went further and moved the flush past the unlock in the end.

Michel Dänzer Sept. 28, 2023, 3:19 p.m. UTC | #14

On 9/28/23 16:51, Christian König wrote:
> Am 28.09.23 um 15:37 schrieb Michel Dänzer:
>> On 9/28/23 14:59, Ray Strode wrote:
>>> On Thu, Sep 28, 2023 at 5:43 AM Michel Dänzer
>>> <michel.daenzer@mailbox.org> wrote:
>>>>>>> When it's really not desirable to account the CPU overhead to the
>>>>>>> process initiating it then you probably rather want to use an non
>>>>>>> blocking commit plus a dma_fence to wait for the work to end from userspace.
>>>>>> Well, first I don't think that's very convenient. You're talking about
>>>>>> a per-plane property, so there would need to be a separate file
>>>>>> descriptor allocated for every plane, right? and user-space would have
>>>>>> to block on all of them before proceeding?
>>>> OUT_FENCE_PTR is a per-CRTC property, not per-plane.
>>> Okay, sure.
>>>
>>>> Also, at least in this particular case, a single sync file (not dma_fence) for any CRTC might suffice.
>>> I don't see how we could rely on that given the provided api and
>>> multitude of drivers. It might work and then break randomly.
>> If it's supposed to work from the KMS API PoV, any bugs to the contrary should be fixed.
>>
>> I'm not really seeing the big difference between using a single fence or multiple, anyway.
> 
> The big difference is that a standard modeset can take some time, e.g. setting up power levels, waiting for PLLs to settle, waiting for a vblank etc..

Right (starting a game shouldn't involve such a modeset though in a Wayland session).

What I mean is there's no significant difference for user space between using a single out fence or multiple. Just slightly different bookkeeping.


> But in the case of turning thing off, what should we wait for?

The atomic commit to finish, so it's safe to submit another one without hitting EBUSY. (What we use the completion events for with page flips)

Ray Strode Sept. 28, 2023, 7:33 p.m. UTC | #15

hI,

On Thu, Sep 28, 2023 at 11:05 AM Ville Syrjälä
<ville.syrjala@linux.intel.com> wrote:
> Here's my earlier take on this: https://patchwork.freedesktop.org/series/108668/

Nice. Was there push back? Why didn't it go in?

> execpt I went further and moved the flush past the unlock in the end.

Is that necessary? I was wondering about that but there's this comment
in the code:

*  ... Locks, especially struct
 * &drm_modeset_lock, should not be held in worker threads or any other
 * asynchronous context used to commit the hardware state.

So it made me think there would be no locks held, and then I threw a
scratch build
at someone who reported it didn't deadlock and solved their issue.

--Ray

Ville Syrjälä Oct. 4, 2023, 5:27 p.m. UTC | #16

On Thu, Sep 28, 2023 at 03:33:46PM -0400, Ray Strode wrote:
> hI,
> 
> On Thu, Sep 28, 2023 at 11:05 AM Ville Syrjälä
> <ville.syrjala@linux.intel.com> wrote:
> > Here's my earlier take on this: https://patchwork.freedesktop.org/series/108668/
> 
> Nice. Was there push back? Why didn't it go in?

No one really seemed all that interested in it. I'd still like to get
it in, if for no other reason than to make things operate more uniformly.
Though there are lots of legacy codepaths left that still hold the locks
over the whole commit, but those could be fixed up as a followup.

> 
> > execpt I went further and moved the flush past the unlock in the end.
> 
> Is that necessary? I was wondering about that but there's this comment
> in the code:

I'm not really sure there is any point in doing this otherwise. 
It would just change which thread executes the code but everything
else would still get blocked on the locks the same as before.

> 
> *  ... Locks, especially struct
>  * &drm_modeset_lock, should not be held in worker threads or any other
>  * asynchronous context used to commit the hardware state.
> 
> So it made me think there would be no locks held, and then I threw a
> scratch build
> at someone who reported it didn't deadlock and solved their issue.
> 
> --Ray

Ray Strode Oct. 4, 2023, 7:38 p.m. UTC | #17

Hi,

On Wed, Oct 4, 2023 at 1:28 PM Ville Syrjälä
<ville.syrjala@linux.intel.com> wrote:
> No one really seemed all that interested in it. I'd still like to get
> it in, if for no other reason than to make things operate more uniformly.
> Though there are lots of legacy codepaths left that still hold the locks
> over the whole commit, but those could be fixed up as a followup.

Ah I see what you're saying.

> > > execpt I went further and moved the flush past the unlock in the end.
> >
> > Is that necessary? I was wondering about that but there's this comment
> > in the code:
>
> I'm not really sure there is any point in doing this otherwise.
> It would just change which thread executes the code but everything
> else would still get blocked on the locks the same as before.

Well in my case, I just want driver work to not charge the process cpu time.

But I get what you're saying, Checking if new configurations are valid
shouldnt block on existing configurations getting applied.

Doing it outside the locks isn't necessary to prevents deadlocks, it's
necessary because you're trying to avoid contention when there doesn't
need to be contention.

Your patchset sort of has a different goal, but it solves the issue I
care about at the same time.

I'd be happy if it went in... Daniel, Dave, what are your takes on this?

--Ray

Daniel Vetter Oct. 5, 2023, 9:57 a.m. UTC | #18

On Tue, Sep 26, 2023 at 01:05:49PM -0400, Ray Strode wrote:
> From: Ray Strode <rstrode@redhat.com>
> 
> A drm atomic commit can be quite slow on some hardware. It can lead
> to a lengthy queue of commands that need to get processed and waited
> on before control can go back to user space.
> 
> If user space is a real-time thread, that delay can have severe
> consequences, leading to the process getting killed for exceeding
> rlimits.
> 
> This commit addresses the problem by always running the slow part of
> a commit on a workqueue, separated from the task initiating the
> commit.
> 
> This change makes the nonblocking and blocking paths work in the same way,
> and as a result allows the task to sleep and not use up its
> RLIMIT_RTTIME allocation.
> 
> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2861
> Signed-off-by: Ray Strode <rstrode@redhat.com>

So imo the trouble with this is that we suddenly start to make
realtime/cpu usage guarantees in the atomic ioctl. That's a _huge_ uapi
change, because even limited to the case of !ALLOW_MODESET we do best
effort guarantees at best. And some drivers (again amd's dc) spend a ton
of cpu time recomputing state even for pure plane changes without any crtc
changes like dpms on/off (at least I remember some bug reports about
that). And that state recomputation has to happen synchronously, because
it always influences the ioctl errno return value.

My take is that you're papering over a performance problem here of the
"the driver is too slow/wastes too much cpu time". We should fix the
driver, if that's possible.

Another option would be if userspace drops realtime priorities for these
known-slow operations. And right now _all_ kms operations are potentially
cpu and real-time wasters, the entire uapi is best effort.

We can also try to change the atomic uapi to give some hard real-time
guarantees so that running compositors as SCHED_RT is possible, but that
- means a very serious stream of bugs to fix all over
- therefore needs some very wide buy-in from drivers that they're willing
  to make this guarantee
- probably needs some really carefully carved out limitations, because
  there's imo flat-out no way we'll make all atomic ioctl hard time limit
  bound

Also, as König has pointed out, you can roll this duct-tape out in
userspace by making the commit non-blocking and immediately waiting for
the fences.

One thing I didn't see mention is that there's a very subtle uapi
difference between non-blocking and blocking:
- non-blocking is not allowed to get ahead of the previous commit, and
  will return EBUSY in that case. See the comment in
  drm_atomic_helper_commit()
- blocking otoh will just block until any previous pending commit has
  finished

Not taking that into account in your patch here breaks uapi because
userspace will suddenly get EBUSY when they don't expect that.

Cheers, Sima


> ---
>  drivers/gpu/drm/drm_atomic_helper.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c
> index 292e38eb6218..1a1e68d98d38 100644
> --- a/drivers/gpu/drm/drm_atomic_helper.c
> +++ b/drivers/gpu/drm/drm_atomic_helper.c
> @@ -2028,64 +2028,63 @@ int drm_atomic_helper_commit(struct drm_device *dev,
>  	 * This is the point of no return - everything below never fails except
>  	 * when the hw goes bonghits. Which means we can commit the new state on
>  	 * the software side now.
>  	 */
>  
>  	ret = drm_atomic_helper_swap_state(state, true);
>  	if (ret)
>  		goto err;
>  
>  	/*
>  	 * Everything below can be run asynchronously without the need to grab
>  	 * any modeset locks at all under one condition: It must be guaranteed
>  	 * that the asynchronous work has either been cancelled (if the driver
>  	 * supports it, which at least requires that the framebuffers get
>  	 * cleaned up with drm_atomic_helper_cleanup_planes()) or completed
>  	 * before the new state gets committed on the software side with
>  	 * drm_atomic_helper_swap_state().
>  	 *
>  	 * This scheme allows new atomic state updates to be prepared and
>  	 * checked in parallel to the asynchronous completion of the previous
>  	 * update. Which is important since compositors need to figure out the
>  	 * composition of the next frame right after having submitted the
>  	 * current layout.
>  	 *
>  	 * NOTE: Commit work has multiple phases, first hardware commit, then
>  	 * cleanup. We want them to overlap, hence need system_unbound_wq to
>  	 * make sure work items don't artificially stall on each another.
>  	 */
>  
>  	drm_atomic_state_get(state);
> -	if (nonblock)
> -		queue_work(system_unbound_wq, &state->commit_work);
> -	else
> -		commit_tail(state);
> +	queue_work(system_unbound_wq, &state->commit_work);
> +	if (!nonblock)
> +		flush_work(&state->commit_work);
>  
>  	return 0;
>  
>  err:
>  	drm_atomic_helper_cleanup_planes(dev, state);
>  	return ret;
>  }
>  EXPORT_SYMBOL(drm_atomic_helper_commit);
>  
>  /**
>   * DOC: implementing nonblocking commit
>   *
>   * Nonblocking atomic commits should use struct &drm_crtc_commit to sequence
>   * different operations against each another. Locks, especially struct
>   * &drm_modeset_lock, should not be held in worker threads or any other
>   * asynchronous context used to commit the hardware state.
>   *
>   * drm_atomic_helper_commit() implements the recommended sequence for
>   * nonblocking commits, using drm_atomic_helper_setup_commit() internally:
>   *
>   * 1. Run drm_atomic_helper_prepare_planes(). Since this can fail and we
>   * need to propagate out of memory/VRAM errors to userspace, it must be called
>   * synchronously.
>   *
>   * 2. Synchronize with any outstanding nonblocking commit worker threads which
>   * might be affected by the new state update. This is handled by
>   * drm_atomic_helper_setup_commit().
>   *
>   * Asynchronous workers need to have sufficient parallelism to be able to run
>   * different atomic commits on different CRTCs in parallel. The simplest way to
> -- 
> 2.41.0
>

Ville Syrjälä Oct. 5, 2023, 10:16 a.m. UTC | #19

On Thu, Oct 05, 2023 at 11:57:41AM +0200, Daniel Vetter wrote:
> On Tue, Sep 26, 2023 at 01:05:49PM -0400, Ray Strode wrote:
> > From: Ray Strode <rstrode@redhat.com>
> > 
> > A drm atomic commit can be quite slow on some hardware. It can lead
> > to a lengthy queue of commands that need to get processed and waited
> > on before control can go back to user space.
> > 
> > If user space is a real-time thread, that delay can have severe
> > consequences, leading to the process getting killed for exceeding
> > rlimits.
> > 
> > This commit addresses the problem by always running the slow part of
> > a commit on a workqueue, separated from the task initiating the
> > commit.
> > 
> > This change makes the nonblocking and blocking paths work in the same way,
> > and as a result allows the task to sleep and not use up its
> > RLIMIT_RTTIME allocation.
> > 
> > Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2861
> > Signed-off-by: Ray Strode <rstrode@redhat.com>
> 
> So imo the trouble with this is that we suddenly start to make
> realtime/cpu usage guarantees in the atomic ioctl. That's a _huge_ uapi
> change, because even limited to the case of !ALLOW_MODESET we do best
> effort guarantees at best. And some drivers (again amd's dc) spend a ton
> of cpu time recomputing state even for pure plane changes without any crtc
> changes like dpms on/off (at least I remember some bug reports about
> that). And that state recomputation has to happen synchronously, because
> it always influences the ioctl errno return value.
> 
> My take is that you're papering over a performance problem here of the
> "the driver is too slow/wastes too much cpu time". We should fix the
> driver, if that's possible.
> 
> Another option would be if userspace drops realtime priorities for these
> known-slow operations. And right now _all_ kms operations are potentially
> cpu and real-time wasters, the entire uapi is best effort.
> 
> We can also try to change the atomic uapi to give some hard real-time
> guarantees so that running compositors as SCHED_RT is possible, but that
> - means a very serious stream of bugs to fix all over
> - therefore needs some very wide buy-in from drivers that they're willing
>   to make this guarantee
> - probably needs some really carefully carved out limitations, because
>   there's imo flat-out no way we'll make all atomic ioctl hard time limit
>   bound
> 
> Also, as König has pointed out, you can roll this duct-tape out in
> userspace by making the commit non-blocking and immediately waiting for
> the fences.
> 
> One thing I didn't see mention is that there's a very subtle uapi
> difference between non-blocking and blocking:
> - non-blocking is not allowed to get ahead of the previous commit, and
>   will return EBUSY in that case. See the comment in
>   drm_atomic_helper_commit()
> - blocking otoh will just block until any previous pending commit has
>   finished
> 
> Not taking that into account in your patch here breaks uapi because
> userspace will suddenly get EBUSY when they don't expect that.

The -EBUSY logic already checks whether the current commit is
non-blocking vs. blocking commit, so I don't see how there would
be any change in behaviour from simply stuffing the commit_tail
onto a workqueue, especially as the locks will be still held across
the flush.

In my earlier series [1] where I move the flush to happen after dropping
the locks there is a far more subtle issue because currently even
non-blocking commits can actually block due to the mutex. Changing
that might break something, so I preserved that behaviour explicitly.
Full explanation in the first patch there.

[1] https://patchwork.freedesktop.org/series/108668/

Christian König Oct. 5, 2023, 11:51 a.m. UTC | #20

Am 05.10.23 um 11:57 schrieb Daniel Vetter:
> On Tue, Sep 26, 2023 at 01:05:49PM -0400, Ray Strode wrote:
>> From: Ray Strode <rstrode@redhat.com>
>>
>> A drm atomic commit can be quite slow on some hardware. It can lead
>> to a lengthy queue of commands that need to get processed and waited
>> on before control can go back to user space.
>>
>> If user space is a real-time thread, that delay can have severe
>> consequences, leading to the process getting killed for exceeding
>> rlimits.
>>
>> This commit addresses the problem by always running the slow part of
>> a commit on a workqueue, separated from the task initiating the
>> commit.
>>
>> This change makes the nonblocking and blocking paths work in the same way,
>> and as a result allows the task to sleep and not use up its
>> RLIMIT_RTTIME allocation.
>>
>> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2861
>> Signed-off-by: Ray Strode <rstrode@redhat.com>
> So imo the trouble with this is that we suddenly start to make
> realtime/cpu usage guarantees in the atomic ioctl. That's a _huge_ uapi
> change, because even limited to the case of !ALLOW_MODESET we do best
> effort guarantees at best. And some drivers (again amd's dc) spend a ton
> of cpu time recomputing state even for pure plane changes without any crtc
> changes like dpms on/off (at least I remember some bug reports about
> that). And that state recomputation has to happen synchronously, because
> it always influences the ioctl errno return value.
>
> My take is that you're papering over a performance problem here of the
> "the driver is too slow/wastes too much cpu time". We should fix the
> driver, if that's possible.
>
> Another option would be if userspace drops realtime priorities for these
> known-slow operations. And right now _all_ kms operations are potentially
> cpu and real-time wasters, the entire uapi is best effort.
>
> We can also try to change the atomic uapi to give some hard real-time
> guarantees so that running compositors as SCHED_RT is possible, but that
> - means a very serious stream of bugs to fix all over
> - therefore needs some very wide buy-in from drivers that they're willing
>    to make this guarantee
> - probably needs some really carefully carved out limitations, because
>    there's imo flat-out no way we'll make all atomic ioctl hard time limit
>    bound

Well, we actually have a pending request to support some real time use 
cases with the amdgpu driver.

And as I already said to Alex internally this is not a pile, but a 
mountain of work even when we exclude DC.

Fixing DC to not busy wait for events which raise an interrupt is still 
something we should absolutely do anyway, but that is an ongoing process.

> Also, as König has pointed out, you can roll this duct-tape out in
> userspace by making the commit non-blocking and immediately waiting for
> the fences.

Oh, please don't use my last name like this. It reminds me of my 
military time :)

Regards,
Christian.

>
> One thing I didn't see mention is that there's a very subtle uapi
> difference between non-blocking and blocking:
> - non-blocking is not allowed to get ahead of the previous commit, and
>    will return EBUSY in that case. See the comment in
>    drm_atomic_helper_commit()
> - blocking otoh will just block until any previous pending commit has
>    finished
>
> Not taking that into account in your patch here breaks uapi because
> userspace will suddenly get EBUSY when they don't expect that.
>
> Cheers, Sima
>
>
>> ---
>>   drivers/gpu/drm/drm_atomic_helper.c | 7 +++----
>>   1 file changed, 3 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c
>> index 292e38eb6218..1a1e68d98d38 100644
>> --- a/drivers/gpu/drm/drm_atomic_helper.c
>> +++ b/drivers/gpu/drm/drm_atomic_helper.c
>> @@ -2028,64 +2028,63 @@ int drm_atomic_helper_commit(struct drm_device *dev,
>>   	 * This is the point of no return - everything below never fails except
>>   	 * when the hw goes bonghits. Which means we can commit the new state on
>>   	 * the software side now.
>>   	 */
>>   
>>   	ret = drm_atomic_helper_swap_state(state, true);
>>   	if (ret)
>>   		goto err;
>>   
>>   	/*
>>   	 * Everything below can be run asynchronously without the need to grab
>>   	 * any modeset locks at all under one condition: It must be guaranteed
>>   	 * that the asynchronous work has either been cancelled (if the driver
>>   	 * supports it, which at least requires that the framebuffers get
>>   	 * cleaned up with drm_atomic_helper_cleanup_planes()) or completed
>>   	 * before the new state gets committed on the software side with
>>   	 * drm_atomic_helper_swap_state().
>>   	 *
>>   	 * This scheme allows new atomic state updates to be prepared and
>>   	 * checked in parallel to the asynchronous completion of the previous
>>   	 * update. Which is important since compositors need to figure out the
>>   	 * composition of the next frame right after having submitted the
>>   	 * current layout.
>>   	 *
>>   	 * NOTE: Commit work has multiple phases, first hardware commit, then
>>   	 * cleanup. We want them to overlap, hence need system_unbound_wq to
>>   	 * make sure work items don't artificially stall on each another.
>>   	 */
>>   
>>   	drm_atomic_state_get(state);
>> -	if (nonblock)
>> -		queue_work(system_unbound_wq, &state->commit_work);
>> -	else
>> -		commit_tail(state);
>> +	queue_work(system_unbound_wq, &state->commit_work);
>> +	if (!nonblock)
>> +		flush_work(&state->commit_work);
>>   
>>   	return 0;
>>   
>>   err:
>>   	drm_atomic_helper_cleanup_planes(dev, state);
>>   	return ret;
>>   }
>>   EXPORT_SYMBOL(drm_atomic_helper_commit);
>>   
>>   /**
>>    * DOC: implementing nonblocking commit
>>    *
>>    * Nonblocking atomic commits should use struct &drm_crtc_commit to sequence
>>    * different operations against each another. Locks, especially struct
>>    * &drm_modeset_lock, should not be held in worker threads or any other
>>    * asynchronous context used to commit the hardware state.
>>    *
>>    * drm_atomic_helper_commit() implements the recommended sequence for
>>    * nonblocking commits, using drm_atomic_helper_setup_commit() internally:
>>    *
>>    * 1. Run drm_atomic_helper_prepare_planes(). Since this can fail and we
>>    * need to propagate out of memory/VRAM errors to userspace, it must be called
>>    * synchronously.
>>    *
>>    * 2. Synchronize with any outstanding nonblocking commit worker threads which
>>    * might be affected by the new state update. This is handled by
>>    * drm_atomic_helper_setup_commit().
>>    *
>>    * Asynchronous workers need to have sufficient parallelism to be able to run
>>    * different atomic commits on different CRTCs in parallel. The simplest way to
>> -- 
>> 2.41.0
>>

Ray Strode Oct. 5, 2023, 9:04 p.m. UTC | #21

Hi,

On Thu, Oct 5, 2023 at 5:57 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> So imo the trouble with this is that we suddenly start to make
> realtime/cpu usage guarantees in the atomic ioctl. That's a _huge_ uapi
> change, because even limited to the case of !ALLOW_MODESET we do best
> effort guarantees at best.
So you're saying there's a best effort to not use up process CPU time,
but Christian was trying to argue nearly the opposite, that any needed
CPU time for the operation should get accounted toward the process.

What you're saying makes more sense to me and it tracks with what I'm
getting skimming over the code. I see it potentially sleeps during
blocking drmModeAtomicCommit() calls in several places:

- copy_from_user when copying properties
- fence and vblank event allocation
- waiting on modeset locks
- And even plane changes:

for_each_new_plane_in_state(state, plane, new_plane_state, i) {•
...
→       /*•
→        * If waiting for fences pre-swap (ie: nonblock), userspace can•
→        * still interrupt the operation. Instead of blocking until the•
→        * timer expires, make the wait interruptible.•
→        */•
→       ret = dma_fence_wait(new_plane_state->fence, pre_swap);•
...
}•

(Ignore the typo where it says "nonblock" instead of "!nonblock",
the point is while waiting on plane state changes it sleeps.)

So, in general, the ioctl sleeps when it needs to. It's not trying
to be some non-premptible RT task with a dedicated cpu budget that it
strictly adheres to. That makes sense to me.

> And that state recomputation has to happen synchronously, because
> it always influences the ioctl errno return value.

Not sure I'm following. The task doesn't have to stay in RUNNING the
entire time the ioctl is getting carried out. It can get preempted,
and rescheduled later, all before returning from the ioctl and
delivering the errno back to userspace. What am I missing?

The problem isn't that the ioctl blocks, the problem is that when it
blocks it shouldn't be leaving the task state in RUNNING.

> My take is that you're papering over a performance problem here of the
> "the driver is too slow/wastes too much cpu time". We should fix the
> driver, if that's possible.

I think everyone agrees the amdgpu DC code doing several 100ms busy
loops in a row with no break in between is buggy behavior, and getting
that fixed is important.

Also, there's no dispute that the impetus for my proposed change was
that misbehavior in the driver code.

Neither of those facts mean we shouldn't improve
drm_atomic_helper_commit too. Making the change is a good idea. It was
even independently proposed a year ago, for reasons unrelated to the
current situation. It makes the nonblock and the
!nonblock code paths behave closer to the same. it makes the userspace
experience better in the face of driver bugs. You said best effort
earlier, this change helps provide a best effort.

Realistically, if it was done this way from the start, no one would be
batting an eye, right? It just improves things all around. I think
maybe you're worried about a slippery slope?

> We can also try to change the atomic uapi to give some hard real-time
> guarantees so that running compositors as SCHED_RT is possible, but that
> - means a very serious stream of bugs to fix all over
> - therefore needs some very wide buy-in from drivers that they're willing
>   to make this guarantee
> - probably needs some really carefully carved out limitations, because
>   there's imo flat-out no way we'll make all atomic ioctl hard time limit
>   bound

We don't need a guarantee that reprogramming ast BMC framebuffer
offsets or displayport link training (or whatever) takes less than
200ms.  If a driver has to sit and wait 32ms for vblank before
twiddling things in hardware to keep crud from showing up on screen or
something, that's fine.  But in none of these cases should the
userspace process be kept in RUNNING while it does it!

I take your point that there are a lot of drivers that may be doing
slow things, but surely they should let the process nap while they do
their work?

> Also, as König has pointed out, you can roll this duct-tape out in
> userspace by making the commit non-blocking and immediately waiting for
> the fences.

Sure, userspace can do that (even, it turns out, in the all crtc
disabled case which was an open question earlier in the thread).

That's not really an argument against fixing the !nonblock case
though.

> One thing I didn't see mention is that there's a very subtle uapi
> difference between non-blocking and blocking:
> - non-blocking is not allowed to get ahead of the previous commit, and
>   will return EBUSY in that case. See the comment in
>   drm_atomic_helper_commit()
> - blocking otoh will just block until any previous pending commit has
>   finished
>
> Not taking that into account in your patch here breaks uapi because
> userspace will suddenly get EBUSY when they don't expect that.

I don't think that's right, drm_atomic_helper_setup_commit has several
chunks of code like this:

→       →       if (nonblock && old_conn_state->commit &&•
→       →
!try_wait_for_completion(&old_conn_state->commit->flip_done)) {•
...
→       →       →       return -EBUSY;•
→       →       }•

So it only returns EBUSY for DRM_MODE_ATOMIC_NONBLOCK cases.

There's also this code earlier in the process:

→       list_for_each_entry(commit, &crtc->commit_list, commit_entry) {•
...
→       →       →       completed =
try_wait_for_completion(&commit->flip_done);•
...
→       →       →       if (!completed && nonblock) {•
...
→       →       →       →       return -EBUSY;•
→       →       →       }•
...
→       }•
...
→       ret = wait_for_completion_interruptible_timeout(&stall_commit->cleanup_done,
→       →       →       →       →       →       →       10*HZ);•
...

So it looks like it sleeps (not leaving the system in RUNNING state!) if
there's an outstanding commit.

--Ray

Christian König Oct. 6, 2023, 7:12 a.m. UTC | #22

Am 05.10.23 um 23:04 schrieb Ray Strode:
> Hi,
>
> On Thu, Oct 5, 2023 at 5:57 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>> So imo the trouble with this is that we suddenly start to make
>> realtime/cpu usage guarantees in the atomic ioctl. That's a _huge_ uapi
>> change, because even limited to the case of !ALLOW_MODESET we do best
>> effort guarantees at best.
> So you're saying there's a best effort to not use up process CPU time,
> but Christian was trying to argue nearly the opposite, that any needed
> CPU time for the operation should get accounted toward the process.

Well as far as I can see what Daniel and I say here perfectly matches.

When the operation busy waits then that *should* get accounted to the 
CPU time of the current process. When the operation sleeps and waits for 
some interrupt for example it should not get accounted.

What you suggest is to put the parts of the operation which busy wait 
into a background task and then sleep for that task to complete. This is 
not a solution to the problem, but just hides it.

Stuff like that is not a valid justification for the change. Ville 
changes on the other hand tried to prevent lock contention which is a 
valid goal here.

Regards,
Christian.

>
> What you're saying makes more sense to me and it tracks with what I'm
> getting skimming over the code. I see it potentially sleeps during
> blocking drmModeAtomicCommit() calls in several places:
>
> - copy_from_user when copying properties
> - fence and vblank event allocation
> - waiting on modeset locks
> - And even plane changes:
>
> for_each_new_plane_in_state(state, plane, new_plane_state, i) {•
> ...
> →       /*•
> →        * If waiting for fences pre-swap (ie: nonblock), userspace can•
> →        * still interrupt the operation. Instead of blocking until the•
> →        * timer expires, make the wait interruptible.•
> →        */•
> →       ret = dma_fence_wait(new_plane_state->fence, pre_swap);•
> ...
> }•
>
> (Ignore the typo where it says "nonblock" instead of "!nonblock",
> the point is while waiting on plane state changes it sleeps.)
>
> So, in general, the ioctl sleeps when it needs to. It's not trying
> to be some non-premptible RT task with a dedicated cpu budget that it
> strictly adheres to. That makes sense to me.
>
>> And that state recomputation has to happen synchronously, because
>> it always influences the ioctl errno return value.
> Not sure I'm following. The task doesn't have to stay in RUNNING the
> entire time the ioctl is getting carried out. It can get preempted,
> and rescheduled later, all before returning from the ioctl and
> delivering the errno back to userspace. What am I missing?
>
> The problem isn't that the ioctl blocks, the problem is that when it
> blocks it shouldn't be leaving the task state in RUNNING.
>
>> My take is that you're papering over a performance problem here of the
>> "the driver is too slow/wastes too much cpu time". We should fix the
>> driver, if that's possible.
> I think everyone agrees the amdgpu DC code doing several 100ms busy
> loops in a row with no break in between is buggy behavior, and getting
> that fixed is important.
>
> Also, there's no dispute that the impetus for my proposed change was
> that misbehavior in the driver code.
>
> Neither of those facts mean we shouldn't improve
> drm_atomic_helper_commit too. Making the change is a good idea. It was
> even independently proposed a year ago, for reasons unrelated to the
> current situation. It makes the nonblock and the
> !nonblock code paths behave closer to the same. it makes the userspace
> experience better in the face of driver bugs. You said best effort
> earlier, this change helps provide a best effort.
>
> Realistically, if it was done this way from the start, no one would be
> batting an eye, right? It just improves things all around. I think
> maybe you're worried about a slippery slope?
>
>> We can also try to change the atomic uapi to give some hard real-time
>> guarantees so that running compositors as SCHED_RT is possible, but that
>> - means a very serious stream of bugs to fix all over
>> - therefore needs some very wide buy-in from drivers that they're willing
>>    to make this guarantee
>> - probably needs some really carefully carved out limitations, because
>>    there's imo flat-out no way we'll make all atomic ioctl hard time limit
>>    bound
> We don't need a guarantee that reprogramming ast BMC framebuffer
> offsets or displayport link training (or whatever) takes less than
> 200ms.  If a driver has to sit and wait 32ms for vblank before
> twiddling things in hardware to keep crud from showing up on screen or
> something, that's fine.  But in none of these cases should the
> userspace process be kept in RUNNING while it does it!
>
> I take your point that there are a lot of drivers that may be doing
> slow things, but surely they should let the process nap while they do
> their work?
>
>> Also, as König has pointed out, you can roll this duct-tape out in
>> userspace by making the commit non-blocking and immediately waiting for
>> the fences.
> Sure, userspace can do that (even, it turns out, in the all crtc
> disabled case which was an open question earlier in the thread).
>
> That's not really an argument against fixing the !nonblock case
> though.
>
>> One thing I didn't see mention is that there's a very subtle uapi
>> difference between non-blocking and blocking:
>> - non-blocking is not allowed to get ahead of the previous commit, and
>>    will return EBUSY in that case. See the comment in
>>    drm_atomic_helper_commit()
>> - blocking otoh will just block until any previous pending commit has
>>    finished
>>
>> Not taking that into account in your patch here breaks uapi because
>> userspace will suddenly get EBUSY when they don't expect that.
> I don't think that's right, drm_atomic_helper_setup_commit has several
> chunks of code like this:
>
> →       →       if (nonblock && old_conn_state->commit &&•
> →       →
> !try_wait_for_completion(&old_conn_state->commit->flip_done)) {•
> ...
> →       →       →       return -EBUSY;•
> →       →       }•
>
> So it only returns EBUSY for DRM_MODE_ATOMIC_NONBLOCK cases.
>
> There's also this code earlier in the process:
>
> →       list_for_each_entry(commit, &crtc->commit_list, commit_entry) {•
> ...
> →       →       →       completed =
> try_wait_for_completion(&commit->flip_done);•
> ...
> →       →       →       if (!completed && nonblock) {•
> ...
> →       →       →       →       return -EBUSY;•
> →       →       →       }•
> ...
> →       }•
> ...
> →       ret = wait_for_completion_interruptible_timeout(&stall_commit->cleanup_done,
> →       →       →       →       →       →       →       10*HZ);•
> ...
>
> So it looks like it sleeps (not leaving the system in RUNNING state!) if
> there's an outstanding commit.
>
> --Ray

Ray Strode Oct. 6, 2023, 6:48 p.m. UTC | #23

Hi,

On Fri, Oct 6, 2023 at 3:12 AM Christian König <christian.koenig@amd.com> wrote:
> When the operation busy waits then that *should* get accounted to the
> CPU time of the current process. When the operation sleeps and waits for
> some interrupt for example it should not get accounted.
> What you suggest is to put the parts of the operation which busy wait
> into a background task and then sleep for that task to complete. This is
> not a solution to the problem, but just hides it.

Actually, I think we both probably agree that there shouldn't be long
busy waits in the context of the current process. After all, we both
agree what the AMD DC driver code is doing is wrong.

To be clear, my take is, if driver code is running in process context
and needs to wait for periods of time on the order of or in excess of
a typical process time slice it should be sleeping during the waiting.
If the operation is at a point where it can be cancelled without side
effects, the sleeping should be INTERRUPTIBLE. If it's past the point
of no return, sleeping should be UNINTERRUPTIBLE. At no point, in my
opinion, should kernel code busy block a typical process for dozens of
milliseconds while keeping the process RUNNING. I don't think this is
a controversial take.

Actually, I think (maybe?) you might even agree with that, but you're
also saying: user space processes aren't special here. While it's not
okay to busy block them, it's also not okay to busy block on the
system unbound workqueue either. If that's your sentiment, I don't
disagree with it.

So I think we both agree the busy waiting is a problem, but maybe we
disagree on the best place for the problem to manifest when it
happens.

One thought re the DC code is regardless of where the code is running,
the scheduler is going to forcefully preempt it at some point right?
Any writereg/udelay(1)/readreg loop is going to get disrupted by a
much bigger than 1us delay by the kernel if the loop goes on long
enough. I'm not wrong about that? if that's true, the code might as
well switch out the udelay(1) for a usleep(1) and call it a day (well
modulo the fact I think it can be called from an interrupt handler; at
least "git grep" says there's a link_set_dpms_off in
link_dp_irq_handler.c)

> Stuff like that is not a valid justification for the change. Ville
> changes on the other hand tried to prevent lock contention which is a
> valid goal here.

Okay so let's land his patchset! (assuming it's ready to go in).
Ville, is that something you'd want to resend for review?

Note, a point that I don't think has been brought up yet, too, is the
system unbound workqueue doesn't run with real time priority. Given
the lion's share of mutter's drmModeAtomicCommit calls are nonblock,
and so are using the system unbound workqueue for handling the
commits, even without this patch, that somewhat diminishes the utility
of using a realtime thread anyway. I believe the original point of the
realtime thread was to make sure mouse cursor motion updates are as
responsive as possible, because any latency there is something the
user really feels. Maybe there needs to be a different mechanism in
place to make sure user perceived interactivity is given priority.

--Ray

Michel Dänzer Oct. 8, 2023, 10:37 a.m. UTC | #24

On 10/6/23 20:48, Ray Strode wrote:
> 
> Note, a point that I don't think has been brought up yet, too, is
> the system unbound workqueue doesn't run with real time priority.
> Given the lion's share of mutter's drmModeAtomicCommit calls are
> nonblock, and so are using the system unbound workqueue for handling
> the commits, even without this patch, that somewhat diminishes the
> utility of using a realtime thread anyway. I believe the original
> point of the realtime thread was to make sure mouse cursor motion
> updates are as responsive as possible, because any latency there is
> something the user really feels.

Mutter's KMS thread needs to be real-time so that it can reliably schedule its work building up to calling the atomic commit ioctl for minimal input → output latency. That some of the ioctl's own work doesn't run at the same priority doesn't necessarily matter for this, as long as it can hit the next vertical blank period.

BTW, I understand kwin has or is planning to get a real-time KMS thread as well, for the same reasons.

> Maybe there needs to be a different mechanism in place to make sure
> user perceived interactivity is given priority.

The only alternative I'm aware of having been discussed so far is allowing atomic commits to be amended. I don't think that would be a great solution for this issue though, as it would result in Wayland compositors wasting CPU cycles (in other words, energy) for constant amendments of atomic commits, in the hope that one of them results in good latency.

Christian König Oct. 9, 2023, 6:42 a.m. UTC | #25

Am 06.10.23 um 20:48 schrieb Ray Strode:
> Hi,
>
> On Fri, Oct 6, 2023 at 3:12 AM Christian König <christian.koenig@amd.com> wrote:
>> When the operation busy waits then that *should* get accounted to the
>> CPU time of the current process. When the operation sleeps and waits for
>> some interrupt for example it should not get accounted.
>> What you suggest is to put the parts of the operation which busy wait
>> into a background task and then sleep for that task to complete. This is
>> not a solution to the problem, but just hides it.
> Actually, I think we both probably agree that there shouldn't be long
> busy waits in the context of the current process. After all, we both
> agree what the AMD DC driver code is doing is wrong.
>
> To be clear, my take is, if driver code is running in process context
> and needs to wait for periods of time on the order of or in excess of
> a typical process time slice it should be sleeping during the waiting.
> If the operation is at a point where it can be cancelled without side
> effects, the sleeping should be INTERRUPTIBLE. If it's past the point
> of no return, sleeping should be UNINTERRUPTIBLE. At no point, in my
> opinion, should kernel code busy block a typical process for dozens of
> milliseconds while keeping the process RUNNING. I don't think this is
> a controversial take.

Exactly that's what I completely disagree on.

When the driver is burning CPU cycles on behalves of a process then 
those CPU cycles should be accounted to the process causing this.

That the driver should probably improve it's behavior is a different issue.

> Actually, I think (maybe?) you might even agree with that, but you're
> also saying: user space processes aren't special here. While it's not
> okay to busy block them, it's also not okay to busy block on the
> system unbound workqueue either. If that's your sentiment, I don't
> disagree with it.

No, it's absolutely ok to busy block them it's just not nice to do so.

As Daniel pointed out this behavior is not incorrect at all. The DRM 
subsystem doesn't make any guarantee that drmModeAtomicCommit() will not 
burn CPU cycles.

>
> So I think we both agree the busy waiting is a problem, but maybe we
> disagree on the best place for the problem to manifest when it
> happens.
>
> One thought re the DC code is regardless of where the code is running,
> the scheduler is going to forcefully preempt it at some point right?
> Any writereg/udelay(1)/readreg loop is going to get disrupted by a
> much bigger than 1us delay by the kernel if the loop goes on long
> enough. I'm not wrong about that? if that's true, the code might as
> well switch out the udelay(1) for a usleep(1) and call it a day (well
> modulo the fact I think it can be called from an interrupt handler; at
> least "git grep" says there's a link_set_dpms_off in
> link_dp_irq_handler.c)
>
>> Stuff like that is not a valid justification for the change. Ville
>> changes on the other hand tried to prevent lock contention which is a
>> valid goal here.
> Okay so let's land his patchset! (assuming it's ready to go in).
> Ville, is that something you'd want to resend for review?

Well, while Ville patch has at least some justification I would still 
strongly object to move the work into a background thread to prevent 
userspace from being accounted for the work it causes.

Regards,
Christian.

>
> Note, a point that I don't think has been brought up yet, too, is the
> system unbound workqueue doesn't run with real time priority. Given
> the lion's share of mutter's drmModeAtomicCommit calls are nonblock,
> and so are using the system unbound workqueue for handling the
> commits, even without this patch, that somewhat diminishes the utility
> of using a realtime thread anyway. I believe the original point of the
> realtime thread was to make sure mouse cursor motion updates are as
> responsive as possible, because any latency there is something the
> user really feels. Maybe there needs to be a different mechanism in
> place to make sure user perceived interactivity is given priority.
>
> --Ray

Ville Syrjälä Oct. 9, 2023, 12:19 p.m. UTC | #26

On Mon, Oct 09, 2023 at 08:42:24AM +0200, Christian König wrote:
> Am 06.10.23 um 20:48 schrieb Ray Strode:
> > Hi,
> >
> > On Fri, Oct 6, 2023 at 3:12 AM Christian König <christian.koenig@amd.com> wrote:
> >> When the operation busy waits then that *should* get accounted to the
> >> CPU time of the current process. When the operation sleeps and waits for
> >> some interrupt for example it should not get accounted.
> >> What you suggest is to put the parts of the operation which busy wait
> >> into a background task and then sleep for that task to complete. This is
> >> not a solution to the problem, but just hides it.
> > Actually, I think we both probably agree that there shouldn't be long
> > busy waits in the context of the current process. After all, we both
> > agree what the AMD DC driver code is doing is wrong.
> >
> > To be clear, my take is, if driver code is running in process context
> > and needs to wait for periods of time on the order of or in excess of
> > a typical process time slice it should be sleeping during the waiting.
> > If the operation is at a point where it can be cancelled without side
> > effects, the sleeping should be INTERRUPTIBLE. If it's past the point
> > of no return, sleeping should be UNINTERRUPTIBLE. At no point, in my
> > opinion, should kernel code busy block a typical process for dozens of
> > milliseconds while keeping the process RUNNING. I don't think this is
> > a controversial take.
> 
> Exactly that's what I completely disagree on.
> 
> When the driver is burning CPU cycles on behalves of a process then 
> those CPU cycles should be accounted to the process causing this.
> 
> That the driver should probably improve it's behavior is a different issue.
> 
> > Actually, I think (maybe?) you might even agree with that, but you're
> > also saying: user space processes aren't special here. While it's not
> > okay to busy block them, it's also not okay to busy block on the
> > system unbound workqueue either. If that's your sentiment, I don't
> > disagree with it.
> 
> No, it's absolutely ok to busy block them it's just not nice to do so.
> 
> As Daniel pointed out this behavior is not incorrect at all. The DRM 
> subsystem doesn't make any guarantee that drmModeAtomicCommit() will not 
> burn CPU cycles.
> 
> >
> > So I think we both agree the busy waiting is a problem, but maybe we
> > disagree on the best place for the problem to manifest when it
> > happens.
> >
> > One thought re the DC code is regardless of where the code is running,
> > the scheduler is going to forcefully preempt it at some point right?
> > Any writereg/udelay(1)/readreg loop is going to get disrupted by a
> > much bigger than 1us delay by the kernel if the loop goes on long
> > enough. I'm not wrong about that? if that's true, the code might as
> > well switch out the udelay(1) for a usleep(1) and call it a day (well
> > modulo the fact I think it can be called from an interrupt handler; at
> > least "git grep" says there's a link_set_dpms_off in
> > link_dp_irq_handler.c)
> >
> >> Stuff like that is not a valid justification for the change. Ville
> >> changes on the other hand tried to prevent lock contention which is a
> >> valid goal here.
> > Okay so let's land his patchset! (assuming it's ready to go in).
> > Ville, is that something you'd want to resend for review?
> 
> Well, while Ville patch has at least some justification I would still 
> strongly object to move the work into a background thread to prevent 
> userspace from being accounted for the work it causes.

Aren't most wayland compositors using nonblocking commits anyway?
If so they would already be bypassing proper CPU time accounting.
Not saying we shouldn't try to fix that, but just pointing out that
it already is an issue with nonblocking commits.

Christian König Oct. 9, 2023, 12:36 p.m. UTC | #27

Am 09.10.23 um 14:19 schrieb Ville Syrjälä:
> On Mon, Oct 09, 2023 at 08:42:24AM +0200, Christian König wrote:
>> Am 06.10.23 um 20:48 schrieb Ray Strode:
>>> Hi,
>>>
>>> On Fri, Oct 6, 2023 at 3:12 AM Christian König <christian.koenig@amd.com> wrote:
>>>> When the operation busy waits then that *should* get accounted to the
>>>> CPU time of the current process. When the operation sleeps and waits for
>>>> some interrupt for example it should not get accounted.
>>>> What you suggest is to put the parts of the operation which busy wait
>>>> into a background task and then sleep for that task to complete. This is
>>>> not a solution to the problem, but just hides it.
>>> Actually, I think we both probably agree that there shouldn't be long
>>> busy waits in the context of the current process. After all, we both
>>> agree what the AMD DC driver code is doing is wrong.
>>>
>>> To be clear, my take is, if driver code is running in process context
>>> and needs to wait for periods of time on the order of or in excess of
>>> a typical process time slice it should be sleeping during the waiting.
>>> If the operation is at a point where it can be cancelled without side
>>> effects, the sleeping should be INTERRUPTIBLE. If it's past the point
>>> of no return, sleeping should be UNINTERRUPTIBLE. At no point, in my
>>> opinion, should kernel code busy block a typical process for dozens of
>>> milliseconds while keeping the process RUNNING. I don't think this is
>>> a controversial take.
>> Exactly that's what I completely disagree on.
>>
>> When the driver is burning CPU cycles on behalves of a process then
>> those CPU cycles should be accounted to the process causing this.
>>
>> That the driver should probably improve it's behavior is a different issue.
>>
>>> Actually, I think (maybe?) you might even agree with that, but you're
>>> also saying: user space processes aren't special here. While it's not
>>> okay to busy block them, it's also not okay to busy block on the
>>> system unbound workqueue either. If that's your sentiment, I don't
>>> disagree with it.
>> No, it's absolutely ok to busy block them it's just not nice to do so.
>>
>> As Daniel pointed out this behavior is not incorrect at all. The DRM
>> subsystem doesn't make any guarantee that drmModeAtomicCommit() will not
>> burn CPU cycles.
>>
>>> So I think we both agree the busy waiting is a problem, but maybe we
>>> disagree on the best place for the problem to manifest when it
>>> happens.
>>>
>>> One thought re the DC code is regardless of where the code is running,
>>> the scheduler is going to forcefully preempt it at some point right?
>>> Any writereg/udelay(1)/readreg loop is going to get disrupted by a
>>> much bigger than 1us delay by the kernel if the loop goes on long
>>> enough. I'm not wrong about that? if that's true, the code might as
>>> well switch out the udelay(1) for a usleep(1) and call it a day (well
>>> modulo the fact I think it can be called from an interrupt handler; at
>>> least "git grep" says there's a link_set_dpms_off in
>>> link_dp_irq_handler.c)
>>>
>>>> Stuff like that is not a valid justification for the change. Ville
>>>> changes on the other hand tried to prevent lock contention which is a
>>>> valid goal here.
>>> Okay so let's land his patchset! (assuming it's ready to go in).
>>> Ville, is that something you'd want to resend for review?
>> Well, while Ville patch has at least some justification I would still
>> strongly object to move the work into a background thread to prevent
>> userspace from being accounted for the work it causes.
> Aren't most wayland compositors using nonblocking commits anyway?
> If so they would already be bypassing proper CPU time accounting.
> Not saying we shouldn't try to fix that, but just pointing out that
> it already is an issue with nonblocking commits.

That's a rather good argument, but for async operations background work 
is simply a necessity because you otherwise can't implement them.

The key point here is that the patch puts the work into the background 
just to avoid that it is accounted to the thread issuing it, and that in 
turn is not valid as far as I can see.

Regards,
Christian.

>

Daniel Vetter Oct. 10, 2023, 11:36 a.m. UTC | #28

On Thu, Oct 05, 2023 at 01:16:27PM +0300, Ville Syrjälä wrote:
> On Thu, Oct 05, 2023 at 11:57:41AM +0200, Daniel Vetter wrote:
> > On Tue, Sep 26, 2023 at 01:05:49PM -0400, Ray Strode wrote:
> > > From: Ray Strode <rstrode@redhat.com>
> > > 
> > > A drm atomic commit can be quite slow on some hardware. It can lead
> > > to a lengthy queue of commands that need to get processed and waited
> > > on before control can go back to user space.
> > > 
> > > If user space is a real-time thread, that delay can have severe
> > > consequences, leading to the process getting killed for exceeding
> > > rlimits.
> > > 
> > > This commit addresses the problem by always running the slow part of
> > > a commit on a workqueue, separated from the task initiating the
> > > commit.
> > > 
> > > This change makes the nonblocking and blocking paths work in the same way,
> > > and as a result allows the task to sleep and not use up its
> > > RLIMIT_RTTIME allocation.
> > > 
> > > Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2861
> > > Signed-off-by: Ray Strode <rstrode@redhat.com>
> > 
> > So imo the trouble with this is that we suddenly start to make
> > realtime/cpu usage guarantees in the atomic ioctl. That's a _huge_ uapi
> > change, because even limited to the case of !ALLOW_MODESET we do best
> > effort guarantees at best. And some drivers (again amd's dc) spend a ton
> > of cpu time recomputing state even for pure plane changes without any crtc
> > changes like dpms on/off (at least I remember some bug reports about
> > that). And that state recomputation has to happen synchronously, because
> > it always influences the ioctl errno return value.
> > 
> > My take is that you're papering over a performance problem here of the
> > "the driver is too slow/wastes too much cpu time". We should fix the
> > driver, if that's possible.
> > 
> > Another option would be if userspace drops realtime priorities for these
> > known-slow operations. And right now _all_ kms operations are potentially
> > cpu and real-time wasters, the entire uapi is best effort.
> > 
> > We can also try to change the atomic uapi to give some hard real-time
> > guarantees so that running compositors as SCHED_RT is possible, but that
> > - means a very serious stream of bugs to fix all over
> > - therefore needs some very wide buy-in from drivers that they're willing
> >   to make this guarantee
> > - probably needs some really carefully carved out limitations, because
> >   there's imo flat-out no way we'll make all atomic ioctl hard time limit
> >   bound
> > 
> > Also, as König has pointed out, you can roll this duct-tape out in
> > userspace by making the commit non-blocking and immediately waiting for
> > the fences.
> > 
> > One thing I didn't see mention is that there's a very subtle uapi
> > difference between non-blocking and blocking:
> > - non-blocking is not allowed to get ahead of the previous commit, and
> >   will return EBUSY in that case. See the comment in
> >   drm_atomic_helper_commit()
> > - blocking otoh will just block until any previous pending commit has
> >   finished
> > 
> > Not taking that into account in your patch here breaks uapi because
> > userspace will suddenly get EBUSY when they don't expect that.
> 
> The -EBUSY logic already checks whether the current commit is
> non-blocking vs. blocking commit, so I don't see how there would
> be any change in behaviour from simply stuffing the commit_tail
> onto a workqueue, especially as the locks will be still held across
> the flush.

Hm right, I forgot the patch context when I was chasing the EBUSY logic, I
thought it just pushed a nonblocking commit in somehow.

> In my earlier series [1] where I move the flush to happen after dropping
> the locks there is a far more subtle issue because currently even
> non-blocking commits can actually block due to the mutex. Changing
> that might break something, so I preserved that behaviour explicitly.
> Full explanation in the first patch there.
> 
> [1] https://patchwork.freedesktop.org/series/108668/

Yeah there's a can of tricky details here for sure ...
-Sima

Daniel Vetter Oct. 10, 2023, 11:41 a.m. UTC | #29

On Mon, Oct 09, 2023 at 02:36:17PM +0200, Christian König wrote:
> Am 09.10.23 um 14:19 schrieb Ville Syrjälä:
> > On Mon, Oct 09, 2023 at 08:42:24AM +0200, Christian König wrote:
> > > Am 06.10.23 um 20:48 schrieb Ray Strode:
> > > > Hi,
> > > > 
> > > > On Fri, Oct 6, 2023 at 3:12 AM Christian König <christian.koenig@amd.com> wrote:
> > > > > When the operation busy waits then that *should* get accounted to the
> > > > > CPU time of the current process. When the operation sleeps and waits for
> > > > > some interrupt for example it should not get accounted.
> > > > > What you suggest is to put the parts of the operation which busy wait
> > > > > into a background task and then sleep for that task to complete. This is
> > > > > not a solution to the problem, but just hides it.
> > > > Actually, I think we both probably agree that there shouldn't be long
> > > > busy waits in the context of the current process. After all, we both
> > > > agree what the AMD DC driver code is doing is wrong.
> > > > 
> > > > To be clear, my take is, if driver code is running in process context
> > > > and needs to wait for periods of time on the order of or in excess of
> > > > a typical process time slice it should be sleeping during the waiting.
> > > > If the operation is at a point where it can be cancelled without side
> > > > effects, the sleeping should be INTERRUPTIBLE. If it's past the point
> > > > of no return, sleeping should be UNINTERRUPTIBLE. At no point, in my
> > > > opinion, should kernel code busy block a typical process for dozens of
> > > > milliseconds while keeping the process RUNNING. I don't think this is
> > > > a controversial take.
> > > Exactly that's what I completely disagree on.
> > > 
> > > When the driver is burning CPU cycles on behalves of a process then
> > > those CPU cycles should be accounted to the process causing this.
> > > 
> > > That the driver should probably improve it's behavior is a different issue.
> > > 
> > > > Actually, I think (maybe?) you might even agree with that, but you're
> > > > also saying: user space processes aren't special here. While it's not
> > > > okay to busy block them, it's also not okay to busy block on the
> > > > system unbound workqueue either. If that's your sentiment, I don't
> > > > disagree with it.
> > > No, it's absolutely ok to busy block them it's just not nice to do so.
> > > 
> > > As Daniel pointed out this behavior is not incorrect at all. The DRM
> > > subsystem doesn't make any guarantee that drmModeAtomicCommit() will not
> > > burn CPU cycles.
> > > 
> > > > So I think we both agree the busy waiting is a problem, but maybe we
> > > > disagree on the best place for the problem to manifest when it
> > > > happens.
> > > > 
> > > > One thought re the DC code is regardless of where the code is running,
> > > > the scheduler is going to forcefully preempt it at some point right?
> > > > Any writereg/udelay(1)/readreg loop is going to get disrupted by a
> > > > much bigger than 1us delay by the kernel if the loop goes on long
> > > > enough. I'm not wrong about that? if that's true, the code might as
> > > > well switch out the udelay(1) for a usleep(1) and call it a day (well
> > > > modulo the fact I think it can be called from an interrupt handler; at
> > > > least "git grep" says there's a link_set_dpms_off in
> > > > link_dp_irq_handler.c)
> > > > 
> > > > > Stuff like that is not a valid justification for the change. Ville
> > > > > changes on the other hand tried to prevent lock contention which is a
> > > > > valid goal here.
> > > > Okay so let's land his patchset! (assuming it's ready to go in).
> > > > Ville, is that something you'd want to resend for review?
> > > Well, while Ville patch has at least some justification I would still
> > > strongly object to move the work into a background thread to prevent
> > > userspace from being accounted for the work it causes.
> > Aren't most wayland compositors using nonblocking commits anyway?
> > If so they would already be bypassing proper CPU time accounting.
> > Not saying we shouldn't try to fix that, but just pointing out that
> > it already is an issue with nonblocking commits.
> 
> That's a rather good argument, but for async operations background work is
> simply a necessity because you otherwise can't implement them.

Yeah I don't think we can use "we need to properly account" stuff to
reject this, because we don't. Also we already do fail with inheriting the
priority properly, so another nail in the "kms is best effort".

> The key point here is that the patch puts the work into the background just
> to avoid that it is accounted to the thread issuing it, and that in turn is
> not valid as far as I can see.

Yeah it's that aspect I'm really worried about, because we essentially
start to support some gurantees that a) most drivers can't uphold without
a huge amount of work, some of the DC state recomputations are _really_
expensive b) without actually making the semantics clear, it's just
duct-tape.

Yes compositors want to run kms in real-time, and yes that results in fun
if you try to strictly account for cpu time spent. Especially if your
policy is to just nuke the real time thread instead of demoting it to
SCHED_NORMAL for a time.

I think if we want more than hacks here we need to answer two questions:
- which parts of the kms api are real time
- what exactly do we guarantee with that

And that answer probably needs to include things like the real time thread
workers for cursor and other nonblocking commits.
-Sima

Ray Strode Oct. 12, 2023, 6:19 p.m. UTC | #30

Hi,

On Mon, Oct 09, 2023 at 02:36:17PM +0200, Christian König wrote:
> > > > To be clear, my take is, if driver code is running in process context
> > > > and needs to wait for periods of time on the order of or in excess of
> > > > a typical process time slice it should be sleeping during the waiting.
> > > > If the operation is at a point where it can be cancelled without side
> > > > effects, the sleeping should be INTERRUPTIBLE. If it's past the point
> > > > of no return, sleeping should be UNINTERRUPTIBLE. At no point, in my
> > > > opinion, should kernel code busy block a typical process for dozens of
> > > > milliseconds while keeping the process RUNNING. I don't think this is
> > > > a controversial take.
> > > Exactly that's what I completely disagree on.

Okay if we can't agree that it's not okay for user space (or the
kernel running in the context of user space) to busy loop a cpu core
at 100% utilization throughout and beyond the process's entire
scheduled time slice then we really are at an impasse. I gotta say i'm
astonished that this seemingly indefensible behavior is somehow a
point of contention, but I'm not going to keep arguing about it beyond
this email.

I mean we're not talking about scientific computing, or code
compilation, or seti@home. We're talking about nearly the equivalent
of `while (1) __asm__ ("nop");`

> > The key point here is that the patch puts the work into the background just
> > to avoid that it is accounted to the thread issuing it, and that in turn is
> > not valid as far as I can see.
>
> Yeah it's that aspect I'm really worried about, because we essentially
> start to support some gurantees that a) most drivers can't uphold without
> a huge amount of work, some of the DC state recomputations are _really_
> expensive b) without actually making the semantics clear, it's just
> duct-tape.

If DC plane state computation (or whatever) is really taking 50ms or
200ms, then it probably should be done in chunks so it doesn't get
preempted at an inopportune point? Look, this is not my wheelhouse,
this is your wheelhouse, and I don't want to keep debating forever. It
seems there is a discrepancy between our understandings of implied
acceptable behavior.

> Yes compositors want to run kms in real-time, and yes that results in fun
> if you try to strictly account for cpu time spent. Especially if your
> policy is to just nuke the real time thread instead of demoting it to
> SCHED_NORMAL for a time.

So I ended up going with this suggestion for blocking modesets:

https://gitlab.gnome.org/GNOME/mutter/-/commit/5d3e31a49968fc0da04e98c0f9d624ea5095c9e0

But *this* feels like duct tape: You've already said there's no
guarantee the problem won't also happen during preliminary computation
during non-blocking commits or via other drm entry points. So it
really does seem like a fix that won't age well. I won't be surprised
if in ~3 years (or whatever) in some RHEL release there's a customer
bug leading to the real-time thread getting blocklisted for obscure
server display hardware because it's causing the session to tank on a
production machine.

> I think if we want more than hacks here we need to answer two questions:
> - which parts of the kms api are real time
> - what exactly do we guarantee with that

imo, this isn't just about real-time versus non-real-time. It's no
more acceptable for non-real-time mutter to be using 100% CPU doing
busywaits than it is for real-time mutter to be using 100% cpu doing
busywaits.

Also, both you and Christian have suggested using the non-blocking
modeset api with a fence fd to poll on is equivalent to the blocking
api flushing the commit_tail work before returning from the ioctl, but
that's not actually true. I think we all now agree the EBUSY problem
you mentioned as an issue with my proposed patch wasn't actually a
problem for blocking commits, but that very same issue is a problem
with the non-blocking commits that then block on a fence fd, right? I
guess we'd need to block on a fence fd from the prior non-blocking
commit first before starting the blocking commit (or something)

--Ray

Daniel Vetter Oct. 13, 2023, 9:41 a.m. UTC | #31

On Thu, Oct 12, 2023 at 02:19:41PM -0400, Ray Strode wrote:
> Hi,
> 
> On Mon, Oct 09, 2023 at 02:36:17PM +0200, Christian König wrote:
> > > > > To be clear, my take is, if driver code is running in process context
> > > > > and needs to wait for periods of time on the order of or in excess of
> > > > > a typical process time slice it should be sleeping during the waiting.
> > > > > If the operation is at a point where it can be cancelled without side
> > > > > effects, the sleeping should be INTERRUPTIBLE. If it's past the point
> > > > > of no return, sleeping should be UNINTERRUPTIBLE. At no point, in my
> > > > > opinion, should kernel code busy block a typical process for dozens of
> > > > > milliseconds while keeping the process RUNNING. I don't think this is
> > > > > a controversial take.
> > > > Exactly that's what I completely disagree on.
> 
> Okay if we can't agree that it's not okay for user space (or the
> kernel running in the context of user space) to busy loop a cpu core
> at 100% utilization throughout and beyond the process's entire
> scheduled time slice then we really are at an impasse. I gotta say i'm
> astonished that this seemingly indefensible behavior is somehow a
> point of contention, but I'm not going to keep arguing about it beyond
> this email.
> 
> I mean we're not talking about scientific computing, or code
> compilation, or seti@home. We're talking about nearly the equivalent
> of `while (1) __asm__ ("nop");`

I don't think anyone said this shouldn't be fixed or improved.

What I'm saying is that the atomic ioctl is not going to make guarantees
that it will not take up to much cpu time (for some extremely vague value
of "too much") to the point that userspace can configure it's compositor
in a way that it _will_ get killed if we _ever_ violate this rule.

We should of course try to do as good as job as possible, but that's not
what you're asking for. You're asking for a hard real-time guarantee with
the implication if we ever break it, it's a regression, and the kernel has
to bend over backwards with tricks like in your patch to make it work.

It's that hard real time guarantee of "the kernel will _never_ violate
this cpu time bound" that people are objecting against. Fixing the worst
offenders I don't think will see any pushback at all.

> > > The key point here is that the patch puts the work into the background just
> > > to avoid that it is accounted to the thread issuing it, and that in turn is
> > > not valid as far as I can see.
> >
> > Yeah it's that aspect I'm really worried about, because we essentially
> > start to support some gurantees that a) most drivers can't uphold without
> > a huge amount of work, some of the DC state recomputations are _really_
> > expensive b) without actually making the semantics clear, it's just
> > duct-tape.
> 
> If DC plane state computation (or whatever) is really taking 50ms or
> 200ms, then it probably should be done in chunks so it doesn't get
> preempted at an inopportune point? Look, this is not my wheelhouse,
> this is your wheelhouse, and I don't want to keep debating forever. It
> seems there is a discrepancy between our understandings of implied
> acceptable behavior.
> 
> > Yes compositors want to run kms in real-time, and yes that results in fun
> > if you try to strictly account for cpu time spent. Especially if your
> > policy is to just nuke the real time thread instead of demoting it to
> > SCHED_NORMAL for a time.
> 
> So I ended up going with this suggestion for blocking modesets:
> 
> https://gitlab.gnome.org/GNOME/mutter/-/commit/5d3e31a49968fc0da04e98c0f9d624ea5095c9e0
> 
> But *this* feels like duct tape: You've already said there's no
> guarantee the problem won't also happen during preliminary computation
> during non-blocking commits or via other drm entry points. So it
> really does seem like a fix that won't age well. I won't be surprised
> if in ~3 years (or whatever) in some RHEL release there's a customer
> bug leading to the real-time thread getting blocklisted for obscure
> server display hardware because it's causing the session to tank on a
> production machine.

Yes the atomic ioctl makes no guarantees across drivers and hw platforms
of guaranteed "we will never violate, you can kill your compositor if you
do" cpu bound limits. We'll try to not suck too badly, and we'll try to
focus on fixing the suck where it upsets the people the most.

But there is fundamentally no hard real time guarantee in atomic. At least
not with the current uapi.

> > I think if we want more than hacks here we need to answer two questions:
> > - which parts of the kms api are real time
> > - what exactly do we guarantee with that
> 
> imo, this isn't just about real-time versus non-real-time. It's no
> more acceptable for non-real-time mutter to be using 100% CPU doing
> busywaits than it is for real-time mutter to be using 100% cpu doing
> busywaits.

The problem isn't about wasting cpu time. It's about you having set up the
system in a way so that mutter gets killed if we ever waste too much cpu
time, ever. Which moves this from a soft real time "we'll try really hard
to not suck" to a hard real time "there will be obvious functional
regression reports if we even violate this once" guarantee.

The former is absolutely fine, and within the practical limit of writing
kms drivers is hard and and takes a lot of time, we'll do the best we can.

The latter is flat out not a guarantee the kernel currently makes, and I
see no practical feasible way to make that happen. And pretending we do
make this guarantee will just result in frustrated users filling bugs that
they desktop session died and developers closing them as "too hard to
fix".

> Also, both you and Christian have suggested using the non-blocking
> modeset api with a fence fd to poll on is equivalent to the blocking
> api flushing the commit_tail work before returning from the ioctl, but
> that's not actually true. I think we all now agree the EBUSY problem
> you mentioned as an issue with my proposed patch wasn't actually a
> problem for blocking commits, but that very same issue is a problem
> with the non-blocking commits that then block on a fence fd, right? I
> guess we'd need to block on a fence fd from the prior non-blocking
> commit first before starting the blocking commit (or something)

Yeah there's a bunch of issues around the current nonblocking modeset uapi
semantics that make it not so useful. There was a long irc discussion last
few days about all the various aspects, it's quite tricky.

Maybe as a bit more useful outcome of this entire discussion: Could you
sign up to write a documentation patch for the atomic ioctl that adds a
paragraph stating extremely clearly that this ioctl does not make hard
real time guarantees, but only best effort soft realtime, and even then
priority of the effort is focused entirely on the !ALLOW_MODESET case,
because that is the one users care about the most.

Cheers, Sima

Michel Dänzer Oct. 13, 2023, 10:22 a.m. UTC | #32

On 10/13/23 11:41, Daniel Vetter wrote:
> On Thu, Oct 12, 2023 at 02:19:41PM -0400, Ray Strode wrote:
>> On Mon, Oct 09, 2023 at 02:36:17PM +0200, Christian König wrote:
>>>>>> To be clear, my take is, if driver code is running in process context
>>>>>> and needs to wait for periods of time on the order of or in excess of
>>>>>> a typical process time slice it should be sleeping during the waiting.
>>>>>> If the operation is at a point where it can be cancelled without side
>>>>>> effects, the sleeping should be INTERRUPTIBLE. If it's past the point
>>>>>> of no return, sleeping should be UNINTERRUPTIBLE. At no point, in my
>>>>>> opinion, should kernel code busy block a typical process for dozens of
>>>>>> milliseconds while keeping the process RUNNING. I don't think this is
>>>>>> a controversial take.
>>>>> Exactly that's what I completely disagree on.
>>
>> Okay if we can't agree that it's not okay for user space (or the
>> kernel running in the context of user space) to busy loop a cpu core
>> at 100% utilization throughout and beyond the process's entire
>> scheduled time slice then we really are at an impasse. I gotta say i'm
>> astonished that this seemingly indefensible behavior is somehow a
>> point of contention, but I'm not going to keep arguing about it beyond
>> this email.
>>
>> I mean we're not talking about scientific computing, or code
>> compilation, or seti@home. We're talking about nearly the equivalent
>> of `while (1) __asm__ ("nop");`
> 
> I don't think anyone said this shouldn't be fixed or improved.
> 
> What I'm saying is that the atomic ioctl is not going to make guarantees
> that it will not take up to much cpu time (for some extremely vague value
> of "too much") to the point that userspace can configure it's compositor
> in a way that it _will_ get killed if we _ever_ violate this rule.
> 
> We should of course try to do as good as job as possible, but that's not
> what you're asking for. You're asking for a hard real-time guarantee with
> the implication if we ever break it, it's a regression, and the kernel has
> to bend over backwards with tricks like in your patch to make it work.

I don't think mutter really needs or wants such a hard real-time guarantee. What it needs is a fighting chance to react before the kernel kills its process.

The intended mechanism for this is SIGXCPU, but that can't work if the kernel is stuck in a busy-loop. Ray's patch seems like one way to avoid that.

That said, as long as SIGXCPU can work as intended with the non-blocking commits mutter uses for everything except modesets, mutter's workaround of dropping RT priority for the blocking commits seems acceptable for the time being.

Ray Strode Oct. 13, 2023, 2:04 p.m. UTC | #33

Hi

On Fri, Oct 13, 2023 at 5:41 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > I mean we're not talking about scientific computing, or code
> > compilation, or seti@home. We're talking about nearly the equivalent
> > of `while (1) __asm__ ("nop");`
>
> I don't think anyone said this shouldn't be fixed or improved.

Yea but we apparently disagree that it would be an improvement to stop
discrediting user space for driver problems.
You see, to me, there are two problems 1) The behavior itself is not
nice and should be fixed 2) The blame/accounting/attribution for a
driver problem is getting directed to user space. I think both issues
should be fixed. One brought the other issue to light, but both
problems, in my mind, are legitimate and should be addressed. You
think fixing the second problem is some tactic to paper over the first
problem. Christian thinks the second problem isn't a problem at all
but the correct design.  So none of us are completely aligned and I
don't see anyone changing their mind anytime soon.

> What I'm saying is that the atomic ioctl is not going to make guarantees
> that it will not take up to much cpu time (for some extremely vague value
> of "too much") to the point that userspace can configure it's compositor
> in a way that it _will_ get killed if we _ever_ violate this rule.
yea I find that strange, all kernel tasks have a certain implied
baseline responsibility to be good citizens to the system.
And how user space is configured seems nearly immaterial.

But we're talking in circles.

> Fixing the worst offenders I don't think will see any pushback at all.
Yea we all agree fixing this one busy loop is a problem but we don't
agree on where the cpu time blame should go.

> > But *this* feels like duct tape: You've already said there's no
> > guarantee the problem won't also happen during preliminary computation
> > during non-blocking commits or via other drm entry points. So it
> > really does seem like a fix that won't age well. I won't be surprised
> > if in ~3 years (or whatever) in some RHEL release there's a customer
> > bug leading to the real-time thread getting blocklisted for obscure
> > server display hardware because it's causing the session to tank on a
> > production machine.
>
> Yes the atomic ioctl makes no guarantees across drivers and hw platforms
> of guaranteed "we will never violate, you can kill your compositor if you
> do" cpu bound limits. We'll try to not suck too badly, and we'll try to
> focus on fixing the suck where it upsets the people the most.
>
> But there is fundamentally no hard real time guarantee in atomic. At least
> not with the current uapi.

So in your mind mutter should get out of the real-time thread business entirely?

> The problem isn't about wasting cpu time. It's about you having set up the
> system in a way so that mutter gets killed if we ever waste too much cpu
> time, ever.

mutter is not set up in a way to kill itself if the driver wastes too
much cpu time,
ever. mutter is set up in a way to kill itself if it, itself, wastes
too much cpu time, ever.
The fact that the kernel is killing it when a kernel driver is wasting
cpu time is the
bug that led to the patch in the first place!

> The latter is flat out not a guarantee the kernel currently makes, and I
> see no practical feasible way to make that happen. And pretending we do
> make this guarantee will just result in frustrated users filling bugs that
> they desktop session died and developers closing them as "too hard to
> fix".

I think all three of us agree busy loops are not nice (though maybe we
aren't completely aligned on severity). And I'll certainly concede
that fixing all the busy loops can be hard. Some of the cpu-bound code
paths may even be doing legitimate computation.  I still assert that
if the uapi calls into driver code that might potentially be doing
something slow where it can't sleep, it should be doing it on a
workqueue or thread. That seems orthogonal to fixing all the places
where the drivers aren't acting nicely.

> Maybe as a bit more useful outcome of this entire discussion: Could you
> sign up to write a documentation patch for the atomic ioctl that adds a
> paragraph stating extremely clearly that this ioctl does not make hard
> real time guarantees, but only best effort soft realtime, and even then
> priority of the effort is focused entirely on the !ALLOW_MODESET case,
> because that is the one users care about the most.

I don't think I'm the best positioned to write such documentation. I
still think what the kernel is doing is wrong here and I don't even
fully grok what you mean by best effort.

--Ray

Daniel Vetter Oct. 17, 2023, 7:32 a.m. UTC | #34

On Fri, Oct 13, 2023 at 12:22:52PM +0200, Michel Dänzer wrote:
> On 10/13/23 11:41, Daniel Vetter wrote:
> > On Thu, Oct 12, 2023 at 02:19:41PM -0400, Ray Strode wrote:
> >> On Mon, Oct 09, 2023 at 02:36:17PM +0200, Christian König wrote:
> >>>>>> To be clear, my take is, if driver code is running in process context
> >>>>>> and needs to wait for periods of time on the order of or in excess of
> >>>>>> a typical process time slice it should be sleeping during the waiting.
> >>>>>> If the operation is at a point where it can be cancelled without side
> >>>>>> effects, the sleeping should be INTERRUPTIBLE. If it's past the point
> >>>>>> of no return, sleeping should be UNINTERRUPTIBLE. At no point, in my
> >>>>>> opinion, should kernel code busy block a typical process for dozens of
> >>>>>> milliseconds while keeping the process RUNNING. I don't think this is
> >>>>>> a controversial take.
> >>>>> Exactly that's what I completely disagree on.
> >>
> >> Okay if we can't agree that it's not okay for user space (or the
> >> kernel running in the context of user space) to busy loop a cpu core
> >> at 100% utilization throughout and beyond the process's entire
> >> scheduled time slice then we really are at an impasse. I gotta say i'm
> >> astonished that this seemingly indefensible behavior is somehow a
> >> point of contention, but I'm not going to keep arguing about it beyond
> >> this email.
> >>
> >> I mean we're not talking about scientific computing, or code
> >> compilation, or seti@home. We're talking about nearly the equivalent
> >> of `while (1) __asm__ ("nop");`
> > 
> > I don't think anyone said this shouldn't be fixed or improved.
> > 
> > What I'm saying is that the atomic ioctl is not going to make guarantees
> > that it will not take up to much cpu time (for some extremely vague value
> > of "too much") to the point that userspace can configure it's compositor
> > in a way that it _will_ get killed if we _ever_ violate this rule.
> > 
> > We should of course try to do as good as job as possible, but that's not
> > what you're asking for. You're asking for a hard real-time guarantee with
> > the implication if we ever break it, it's a regression, and the kernel has
> > to bend over backwards with tricks like in your patch to make it work.
> 
> I don't think mutter really needs or wants such a hard real-time
> guarantee. What it needs is a fighting chance to react before the kernel
> kills its process.
> 
> The intended mechanism for this is SIGXCPU, but that can't work if the
> kernel is stuck in a busy-loop. Ray's patch seems like one way to avoid
> that.

I don't think signals will get us out of this either, or at least still
has risk. We are trying to make everything interruptible and bail out
asap, but those checks are when we're blocking, not when the cpu is busy.

So this also wont guarantee that you expire your timeslice when a driver
is doing a silly expensive atomic_check computation. Much less likely, but
definitely not a zero chance.

> That said, as long as SIGXCPU can work as intended with the non-blocking
> commits mutter uses for everything except modesets, mutter's workaround
> of dropping RT priority for the blocking commits seems acceptable for
> the time being.

Is there no rt setup where the kernel just auto-demotes when you've used
up your slice? That's the only setup I see that guarantees you're not
getting killed here.

I think dropping rt priority around full modesets is still good since
modests really shouldn't ever run as rt, that makes no sense to me.
-Sima

Daniel Vetter Oct. 17, 2023, 7:37 a.m. UTC | #35

On Fri, Oct 13, 2023 at 10:04:02AM -0400, Ray Strode wrote:
> Hi
> 
> On Fri, Oct 13, 2023 at 5:41 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > > I mean we're not talking about scientific computing, or code
> > > compilation, or seti@home. We're talking about nearly the equivalent
> > > of `while (1) __asm__ ("nop");`
> >
> > I don't think anyone said this shouldn't be fixed or improved.
> 
> Yea but we apparently disagree that it would be an improvement to stop
> discrediting user space for driver problems.
> You see, to me, there are two problems 1) The behavior itself is not
> nice and should be fixed 2) The blame/accounting/attribution for a
> driver problem is getting directed to user space. I think both issues
> should be fixed. One brought the other issue to light, but both
> problems, in my mind, are legitimate and should be addressed. You
> think fixing the second problem is some tactic to paper over the first
> problem. Christian thinks the second problem isn't a problem at all
> but the correct design.  So none of us are completely aligned and I
> don't see anyone changing their mind anytime soon.
> 
> > What I'm saying is that the atomic ioctl is not going to make guarantees
> > that it will not take up to much cpu time (for some extremely vague value
> > of "too much") to the point that userspace can configure it's compositor
> > in a way that it _will_ get killed if we _ever_ violate this rule.
> yea I find that strange, all kernel tasks have a certain implied
> baseline responsibility to be good citizens to the system.
> And how user space is configured seems nearly immaterial.
> 
> But we're talking in circles.
> 
> > Fixing the worst offenders I don't think will see any pushback at all.
> Yea we all agree fixing this one busy loop is a problem but we don't
> agree on where the cpu time blame should go.
> 
> > > But *this* feels like duct tape: You've already said there's no
> > > guarantee the problem won't also happen during preliminary computation
> > > during non-blocking commits or via other drm entry points. So it
> > > really does seem like a fix that won't age well. I won't be surprised
> > > if in ~3 years (or whatever) in some RHEL release there's a customer
> > > bug leading to the real-time thread getting blocklisted for obscure
> > > server display hardware because it's causing the session to tank on a
> > > production machine.
> >
> > Yes the atomic ioctl makes no guarantees across drivers and hw platforms
> > of guaranteed "we will never violate, you can kill your compositor if you
> > do" cpu bound limits. We'll try to not suck too badly, and we'll try to
> > focus on fixing the suck where it upsets the people the most.
> >
> > But there is fundamentally no hard real time guarantee in atomic. At least
> > not with the current uapi.
> 
> So in your mind mutter should get out of the real-time thread business entirely?

Yes. At least the hard-real time "the kernel kills your process if you
fail" business. Because desktop drawing just isn't hard real-time, nothing
disastrous happens if we miss a deadline.

Of course special setups where everything is very controlled might be
different.

> > The problem isn't about wasting cpu time. It's about you having set up the
> > system in a way so that mutter gets killed if we ever waste too much cpu
> > time, ever.
> 
> mutter is not set up in a way to kill itself if the driver wastes too
> much cpu time,
> ever. mutter is set up in a way to kill itself if it, itself, wastes
> too much cpu time, ever.
> The fact that the kernel is killing it when a kernel driver is wasting
> cpu time is the
> bug that led to the patch in the first place!
> 
> > The latter is flat out not a guarantee the kernel currently makes, and I
> > see no practical feasible way to make that happen. And pretending we do
> > make this guarantee will just result in frustrated users filling bugs that
> > they desktop session died and developers closing them as "too hard to
> > fix".
> 
> I think all three of us agree busy loops are not nice (though maybe we
> aren't completely aligned on severity). And I'll certainly concede
> that fixing all the busy loops can be hard. Some of the cpu-bound code
> paths may even be doing legitimate computation.  I still assert that
> if the uapi calls into driver code that might potentially be doing
> something slow where it can't sleep, it should be doing it on a
> workqueue or thread. That seems orthogonal to fixing all the places
> where the drivers aren't acting nicely.

Again no, because underlying this your requirement boils down to hard real
time.

And we just cannot guarantee that with atomic kms. Best effort,
absolutely. Guaranteed to never fail, no way.

> > Maybe as a bit more useful outcome of this entire discussion: Could you
> > sign up to write a documentation patch for the atomic ioctl that adds a
> > paragraph stating extremely clearly that this ioctl does not make hard
> > real time guarantees, but only best effort soft realtime, and even then
> > priority of the effort is focused entirely on the !ALLOW_MODESET case,
> > because that is the one users care about the most.
> 
> I don't think I'm the best positioned to write such documentation. I
> still think what the kernel is doing is wrong here and I don't even
> fully grok what you mean by best effort.

It's the difference between hard real time and soft real time. atomic kms
is a soft real time system, not hard real time.

You've set up mutter in a hard real time way, and that just doesn't work.

I guess I can type up some patch when I'm back from XDC, but if the
difference between soft and hard real time isn't clear, then yeah you
don't understand the point I've been trying to make in this thread.

Cheers, Sima

Christian König Oct. 17, 2023, 7:55 a.m. UTC | #36

Am 17.10.23 um 09:32 schrieb Daniel Vetter:
> On Fri, Oct 13, 2023 at 12:22:52PM +0200, Michel Dänzer wrote:
>> On 10/13/23 11:41, Daniel Vetter wrote:
>>> On Thu, Oct 12, 2023 at 02:19:41PM -0400, Ray Strode wrote:
>>>> On Mon, Oct 09, 2023 at 02:36:17PM +0200, Christian König wrote:
>>>>>>>> To be clear, my take is, if driver code is running in process context
>>>>>>>> and needs to wait for periods of time on the order of or in excess of
>>>>>>>> a typical process time slice it should be sleeping during the waiting.
>>>>>>>> If the operation is at a point where it can be cancelled without side
>>>>>>>> effects, the sleeping should be INTERRUPTIBLE. If it's past the point
>>>>>>>> of no return, sleeping should be UNINTERRUPTIBLE. At no point, in my
>>>>>>>> opinion, should kernel code busy block a typical process for dozens of
>>>>>>>> milliseconds while keeping the process RUNNING. I don't think this is
>>>>>>>> a controversial take.
>>>>>>> Exactly that's what I completely disagree on.
>>>> Okay if we can't agree that it's not okay for user space (or the
>>>> kernel running in the context of user space) to busy loop a cpu core
>>>> at 100% utilization throughout and beyond the process's entire
>>>> scheduled time slice then we really are at an impasse. I gotta say i'm
>>>> astonished that this seemingly indefensible behavior is somehow a
>>>> point of contention, but I'm not going to keep arguing about it beyond
>>>> this email.
>>>>
>>>> I mean we're not talking about scientific computing, or code
>>>> compilation, or seti@home. We're talking about nearly the equivalent
>>>> of `while (1) __asm__ ("nop");`
>>> I don't think anyone said this shouldn't be fixed or improved.
>>>
>>> What I'm saying is that the atomic ioctl is not going to make guarantees
>>> that it will not take up to much cpu time (for some extremely vague value
>>> of "too much") to the point that userspace can configure it's compositor
>>> in a way that it _will_ get killed if we _ever_ violate this rule.
>>>
>>> We should of course try to do as good as job as possible, but that's not
>>> what you're asking for. You're asking for a hard real-time guarantee with
>>> the implication if we ever break it, it's a regression, and the kernel has
>>> to bend over backwards with tricks like in your patch to make it work.
>> I don't think mutter really needs or wants such a hard real-time
>> guarantee. What it needs is a fighting chance to react before the kernel
>> kills its process.
>>
>> The intended mechanism for this is SIGXCPU, but that can't work if the
>> kernel is stuck in a busy-loop. Ray's patch seems like one way to avoid
>> that.
> I don't think signals will get us out of this either, or at least still
> has risk. We are trying to make everything interruptible and bail out
> asap, but those checks are when we're blocking, not when the cpu is busy.
>
> So this also wont guarantee that you expire your timeslice when a driver
> is doing a silly expensive atomic_check computation. Much less likely, but
> definitely not a zero chance.
>
>> That said, as long as SIGXCPU can work as intended with the non-blocking
>> commits mutter uses for everything except modesets, mutter's workaround
>> of dropping RT priority for the blocking commits seems acceptable for
>> the time being.
> Is there no rt setup where the kernel just auto-demotes when you've used
> up your slice? That's the only setup I see that guarantees you're not
> getting killed here.
>
> I think dropping rt priority around full modesets is still good since
> modests really shouldn't ever run as rt, that makes no sense to me.

Completely agree.

One more data point not mentioned before: While amdgpu could be improved 
we do have devices which (for example) have to do I2C by bit banging 
because they lack the necessary functionality in the hardware.

And IIRC transmitting the 256 bytes EDID takes something like ~5ms (fast 
mode) or ~20ms (standard mode) in which the CPU usually just busy loops 
most of the time. Not saying that we should do a full EDID transmit with 
every modeset, but just to give an example of what might be necessary here.

Christian.

> -Sima

diff mbox series

Patch

diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c
index 292e38eb6218..1a1e68d98d38 100644
--- a/drivers/gpu/drm/drm_atomic_helper.c
+++ b/drivers/gpu/drm/drm_atomic_helper.c
@@ -2028,64 +2028,63 @@  int drm_atomic_helper_commit(struct drm_device *dev,
 	 * This is the point of no return - everything below never fails except
 	 * when the hw goes bonghits. Which means we can commit the new state on
 	 * the software side now.
 	 */
 
 	ret = drm_atomic_helper_swap_state(state, true);
 	if (ret)
 		goto err;
 
 	/*
 	 * Everything below can be run asynchronously without the need to grab
 	 * any modeset locks at all under one condition: It must be guaranteed
 	 * that the asynchronous work has either been cancelled (if the driver
 	 * supports it, which at least requires that the framebuffers get
 	 * cleaned up with drm_atomic_helper_cleanup_planes()) or completed
 	 * before the new state gets committed on the software side with
 	 * drm_atomic_helper_swap_state().
 	 *
 	 * This scheme allows new atomic state updates to be prepared and
 	 * checked in parallel to the asynchronous completion of the previous
 	 * update. Which is important since compositors need to figure out the
 	 * composition of the next frame right after having submitted the
 	 * current layout.
 	 *
 	 * NOTE: Commit work has multiple phases, first hardware commit, then
 	 * cleanup. We want them to overlap, hence need system_unbound_wq to
 	 * make sure work items don't artificially stall on each another.
 	 */
 
 	drm_atomic_state_get(state);
-	if (nonblock)
-		queue_work(system_unbound_wq, &state->commit_work);
-	else
-		commit_tail(state);
+	queue_work(system_unbound_wq, &state->commit_work);
+	if (!nonblock)
+		flush_work(&state->commit_work);
 
 	return 0;
 
 err:
 	drm_atomic_helper_cleanup_planes(dev, state);
 	return ret;
 }
 EXPORT_SYMBOL(drm_atomic_helper_commit);
 
 /**
  * DOC: implementing nonblocking commit
  *
  * Nonblocking atomic commits should use struct &drm_crtc_commit to sequence
  * different operations against each another. Locks, especially struct
  * &drm_modeset_lock, should not be held in worker threads or any other
  * asynchronous context used to commit the hardware state.
  *
  * drm_atomic_helper_commit() implements the recommended sequence for
  * nonblocking commits, using drm_atomic_helper_setup_commit() internally:
  *
  * 1. Run drm_atomic_helper_prepare_planes(). Since this can fail and we
  * need to propagate out of memory/VRAM errors to userspace, it must be called
  * synchronously.
  *
  * 2. Synchronize with any outstanding nonblocking commit worker threads which
  * might be affected by the new state update. This is handled by
  * drm_atomic_helper_setup_commit().
  *
  * Asynchronous workers need to have sufficient parallelism to be able to run
  * different atomic commits on different CRTCs in parallel. The simplest way to