Message ID | 20230926170549.2589045-1-halfline@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/atomic: Perform blocking commits on workqueue | expand |
Am 26.09.23 um 19:05 schrieb Ray Strode: > From: Ray Strode <rstrode@redhat.com> > > A drm atomic commit can be quite slow on some hardware. It can lead > to a lengthy queue of commands that need to get processed and waited > on before control can go back to user space. > > If user space is a real-time thread, that delay can have severe > consequences, leading to the process getting killed for exceeding > rlimits. > > This commit addresses the problem by always running the slow part of > a commit on a workqueue, separated from the task initiating the > commit. > > This change makes the nonblocking and blocking paths work in the same way, > and as a result allows the task to sleep and not use up its > RLIMIT_RTTIME allocation. Well this patch made me laugh :) I'm not an expert for that stuff, but as far as I know the whole purpose of the blocking functionality is to make sure that the CPU overhead caused by the commit is accounted to the right process. So what you are suggesting here is to actually break that functionality and that doesn't seem to make sense. When it's really not desirable to account the CPU overhead to the process initiating it then you probably rather want to use an non blocking commit plus a dma_fence to wait for the work to end from userspace. Regards, Christian. > > Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2861 > Signed-off-by: Ray Strode <rstrode@redhat.com> > --- > drivers/gpu/drm/drm_atomic_helper.c | 7 +++---- > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c > index 292e38eb6218..1a1e68d98d38 100644 > --- a/drivers/gpu/drm/drm_atomic_helper.c > +++ b/drivers/gpu/drm/drm_atomic_helper.c > @@ -2028,64 +2028,63 @@ int drm_atomic_helper_commit(struct drm_device *dev, > * This is the point of no return - everything below never fails except > * when the hw goes bonghits. Which means we can commit the new state on > * the software side now. > */ > > ret = drm_atomic_helper_swap_state(state, true); > if (ret) > goto err; > > /* > * Everything below can be run asynchronously without the need to grab > * any modeset locks at all under one condition: It must be guaranteed > * that the asynchronous work has either been cancelled (if the driver > * supports it, which at least requires that the framebuffers get > * cleaned up with drm_atomic_helper_cleanup_planes()) or completed > * before the new state gets committed on the software side with > * drm_atomic_helper_swap_state(). > * > * This scheme allows new atomic state updates to be prepared and > * checked in parallel to the asynchronous completion of the previous > * update. Which is important since compositors need to figure out the > * composition of the next frame right after having submitted the > * current layout. > * > * NOTE: Commit work has multiple phases, first hardware commit, then > * cleanup. We want them to overlap, hence need system_unbound_wq to > * make sure work items don't artificially stall on each another. > */ > > drm_atomic_state_get(state); > - if (nonblock) > - queue_work(system_unbound_wq, &state->commit_work); > - else > - commit_tail(state); > + queue_work(system_unbound_wq, &state->commit_work); > + if (!nonblock) > + flush_work(&state->commit_work); > > return 0; > > err: > drm_atomic_helper_cleanup_planes(dev, state); > return ret; > } > EXPORT_SYMBOL(drm_atomic_helper_commit); > > /** > * DOC: implementing nonblocking commit > * > * Nonblocking atomic commits should use struct &drm_crtc_commit to sequence > * different operations against each another. Locks, especially struct > * &drm_modeset_lock, should not be held in worker threads or any other > * asynchronous context used to commit the hardware state. > * > * drm_atomic_helper_commit() implements the recommended sequence for > * nonblocking commits, using drm_atomic_helper_setup_commit() internally: > * > * 1. Run drm_atomic_helper_prepare_planes(). Since this can fail and we > * need to propagate out of memory/VRAM errors to userspace, it must be called > * synchronously. > * > * 2. Synchronize with any outstanding nonblocking commit worker threads which > * might be affected by the new state update. This is handled by > * drm_atomic_helper_setup_commit(). > * > * Asynchronous workers need to have sufficient parallelism to be able to run > * different atomic commits on different CRTCs in parallel. The simplest way to
Hi, On Wed, Sep 27, 2023 at 4:05 AM Christian König <christian.koenig@amd.com> wrote: > I'm not an expert for that stuff, but as far as I know the whole purpose > of the blocking functionality is to make sure that the CPU overhead > caused by the commit is accounted to the right process. I'm not an expert either, but that's clearly wrong. The point of blocking functionality in drmAtomicCommit is for the ioctl to block until the operation is completed, so userspace knows that they can commit again after it returns without getting EBUSY, and they can be sure, e.g., the mode is set or the displays are disabled or whatever. To say the "whole point" is about CPU overhead accounting sounds rather absurd to me. Is that really what you meant? You could try to make the argument that one of the points of the blocking call is about CPU accounting (instead of the whole point), maybe that's what you meant? That seems likely wrong to me too. I mean just check the docs: RLIMIT_RTTIME (since Linux 2.6.25) This is a limit (in microseconds) on the amount of CPU time that a process scheduled under a real‐time scheduling policy may consume without making a blocking system call. For the purpose of this limit, each time a process makes a blocking system call, the count of its consumed CPU time is reset to zero. drmAtomicCommit() is making a blocking system call so the limit should be reset, in my opinion. Furthermore, a lot happens under the covers as part of drmAtomicCommit. Are you telling me that in all the drivers handling drmAtomicCommit, none of the work from those drivers gets deferred outside of process context if DRM_MODE_ATOMIC_NONBLOCK isn't set? I find that very hard to believe. But it would have to be true, if one of the main points of the blocking call is about CPU accounting, right? You can't say the point is about CPU accounting if some of the work is handled in a helper thread or work queue or whatever. ╎❯ git grep -E 'INIT_WORK|DECLARE_TASKLET|open_softirq|timer_setup|kthread_run' | wc -l 182 There seems to be a lot of non-process context execution going on in the tree... > So what you are suggesting here is to actually break that functionality > and that doesn't seem to make sense. What's the use case here that could break? Walk me through a real-world, sensible example where a program depends on drmAtomicCommit() blocking and continuing to eat into process cpu time while it blocks. I just want to see where you are coming from. Maybe I'm being unimaginative but I just don't see it. I do see actual main-stream display servers breaking today because the current behavior defies expectations. > When it's really not desirable to account the CPU overhead to the > process initiating it then you probably rather want to use an non > blocking commit plus a dma_fence to wait for the work to end from userspace. Well, first I don't think that's very convenient. You're talking about a per-plane property, so there would need to be a separate file descriptor allocated for every plane, right? and user-space would have to block on all of them before proceeding? Also, are you sure that works in all cases? The problematic case we're facing right now is when all planes and all crtcs are getting disabled. I'm just skimming over the code here, but I see this: → for_each_new_crtc_in_state(state, crtc, crtc_state, i) {• ... → → if (arg->flags & DRM_MODE_PAGE_FLIP_EVENT || fence_ptr) {• ... → → → e = create_vblank_event(crtc, arg->user_data);• ... → → → crtc_state->event = e;• → → }• ... → → if (fence_ptr) {• ... → → → fence = drm_crtc_create_fence(crtc);• ... → → → ret = setup_out_fence(&f[(*num_fences)++], fence);• ... → → → crtc_state->event->base.fence = fence;• → → }• Is the code really going to allocate a vblank_event when all the crtc's are disabled ? I guess it's possible, but that's counterintuitive to me. I really don't know. You're confident a set of fences will actually work? Regardless, this seems like a roundabout way to address a problem that we could just ... fix. --Ray
Hi Ray, Am 27.09.23 um 22:25 schrieb Ray Strode: > Hi, > > On Wed, Sep 27, 2023 at 4:05 AM Christian König > <christian.koenig@amd.com> wrote: >> I'm not an expert for that stuff, but as far as I know the whole purpose >> of the blocking functionality is to make sure that the CPU overhead >> caused by the commit is accounted to the right process. > I'm not an expert either, but that's clearly wrong. > > The point of blocking functionality in drmAtomicCommit is for the > ioctl to block until the operation is completed, so userspace knows > that they can commit again after it returns without getting EBUSY, and > they can be sure, e.g., the mode is set or the displays are disabled > or whatever. > > To say the "whole point" is about CPU overhead accounting sounds > rather absurd to me. Is that really what you meant? Yes, absolutely. See the functionality you try to implement already exists. After making a non blocking commit userspace can still wait on the commit to finish by looking at the out fence. This just happens asynchronously so you can use (for example) select()/poll() etc on multiple events. > You could try to make the argument that one of the points of the > blocking call is about CPU accounting (instead of the whole point), > maybe that's what you meant? That seems likely wrong to me too. I mean > just check the docs: > > RLIMIT_RTTIME (since Linux 2.6.25) > This is a limit (in microseconds) on the amount of CPU > time that a process scheduled under a real‐time scheduling policy may > consume without making a blocking system call. For the purpose of > this limit, each time a process makes a blocking system > call, the count of its consumed CPU time is reset to zero. > drmAtomicCommit() is making a blocking system call so the limit should > be reset, in my opinion. Exactly that's the point why I'm rejecting this. As far as I can see this opinion does not match what RLIMIT_RTTIME is intended to do. A blocking system call in the sense of RLIMIT_RTTIME means something which results in the process listening for external events, e.g. calling select(), epoll() or read() on (for example) a network socket etc... As far as I can see drmAtomicCommit() is *not* meant with that what similar to for example yield() also doesn't reset the RLIMIT_RTTIME counter. > Furthermore, a lot happens under the covers as part of drmAtomicCommit. > > Are you telling me that in all the drivers handling drmAtomicCommit, > none of the work from those drivers gets deferred outside of process > context if DRM_MODE_ATOMIC_NONBLOCK isn't set? I find that very hard > to believe. But it would have to be true, if one of the main points of > the blocking call is about CPU accounting, right? You can't say the > point is about CPU accounting if some of the work is handled in a > helper thread or work queue or whatever. > > ╎❯ git grep -E 'INIT_WORK|DECLARE_TASKLET|open_softirq|timer_setup|kthread_run' > | wc -l > 182 > > There seems to be a lot of non-process context execution going on in the tree... > >> So what you are suggesting here is to actually break that functionality >> and that doesn't seem to make sense. > What's the use case here that could break? Well you are breaking the RLIMIT_RTTIME functionality. The purpose of that functionality is to allow debugging and monitoring applications to make sure that they keep alive and can react to external signals. From the RLIMIT_RTTIME documentation: "The intended use of this limit is to stop a runaway real-time process from locking up the system." And when drmAtomicCommit() is triggering this then we either have a problem with the application doing something it is not supposed to do (like blocking for vblank while it should listen to network traffic) or the driver is somehow buggy. > Walk me through a > real-world, sensible example where a program depends on > drmAtomicCommit() blocking and continuing to eat into process cpu time > while it blocks. I just want to see where you are coming from. Maybe > I'm being unimaginative but I just don't see it. I do see actual > main-stream display servers breaking today because the current > behavior defies expectations. > >> When it's really not desirable to account the CPU overhead to the >> process initiating it then you probably rather want to use an non >> blocking commit plus a dma_fence to wait for the work to end from userspace. > Well, first I don't think that's very convenient. You're talking about > a per-plane property, so there would need to be a separate file > descriptor allocated for every plane, right? and user-space would have > to block on all of them before proceeding? Also, are you sure that > works in all cases? The problematic case we're facing right now is > when all planes and all crtcs are getting disabled. I'm just skimming > over the code here, but I see this: > > → for_each_new_crtc_in_state(state, crtc, crtc_state, i) {• > ... > → → if (arg->flags & DRM_MODE_PAGE_FLIP_EVENT || fence_ptr) {• > ... > → → → e = create_vblank_event(crtc, arg->user_data);• > ... > → → → crtc_state->event = e;• > → → }• > ... > → → if (fence_ptr) {• > ... > → → → fence = drm_crtc_create_fence(crtc);• > ... > → → → ret = setup_out_fence(&f[(*num_fences)++], fence);• > ... > → → → crtc_state->event->base.fence = fence;• > → → }• > > Is the code really going to allocate a vblank_event when all the > crtc's are disabled ? I guess it's possible, but that's > counterintuitive to me. I really don't know. You're confident a set of > fences will actually work? No when you disable everything there is of course no fence allocated. But then you also should never see anything waiting for to long to actually be able to trigger the RLIMIT_RTTIME. > Regardless, this seems like a roundabout way to address a problem that > we could just ... fix. Well to make it clear: This is not a problem but intended behavior! From what I know RLIMIT_RTTIME usually works in units of multiple seconds. Milliseconds are possible as well, but when an application sets a low millisecond timeout and then blocks for a vblank which can easily take 30ms to complete that means you have a bug inside your application. With this commit you are completely papering over that. Regards, Christian. > > --Ray
On 9/28/23 08:56, Christian König wrote: > Am 27.09.23 um 22:25 schrieb Ray Strode: >> On Wed, Sep 27, 2023 at 4:05 AM Christian König >> <christian.koenig@amd.com> wrote: > >>> When it's really not desirable to account the CPU overhead to the >>> process initiating it then you probably rather want to use an non >>> blocking commit plus a dma_fence to wait for the work to end from userspace. >> Well, first I don't think that's very convenient. You're talking about >> a per-plane property, so there would need to be a separate file >> descriptor allocated for every plane, right? and user-space would have >> to block on all of them before proceeding? OUT_FENCE_PTR is a per-CRTC property, not per-plane. Also, at least in this particular case, a single sync file (not dma_fence) for any CRTC might suffice. >> Also, are you sure that works in all cases? The problematic case we're facing right >> now is when all planes and all crtcs are getting disabled. I'm just skimming >> over the code here, but I see this: >> >> → for_each_new_crtc_in_state(state, crtc, crtc_state, i) {• >> ... >> → → if (arg->flags & DRM_MODE_PAGE_FLIP_EVENT || fence_ptr) {• >> ... >> → → → e = create_vblank_event(crtc, arg->user_data);• >> ... >> → → → crtc_state->event = e;• >> → → }• >> ... >> → → if (fence_ptr) {• >> ... >> → → → fence = drm_crtc_create_fence(crtc);• >> ... >> → → → ret = setup_out_fence(&f[(*num_fences)++], fence);• >> ... >> → → → crtc_state->event->base.fence = fence;• >> → → }• >> >> Is the code really going to allocate a vblank_event when all the >> crtc's are disabled ? I guess it's possible, but that's >> counterintuitive to me. I really don't know. You're confident a set of >> fences will actually work? > > No when you disable everything there is of course no fence allocated. I don't see why not. "new_crtc_in_state" in this code means the atomic commit contains some state for the CRTC (such as setting the OUT_FENCE_PTR property), it could disable or leave it disabled though. I can't see any other code which would prevent this from working with a disabled CRTC, I could be missing something though. > But then you also should never see anything waiting for to long to actually be able to trigger the RLIMIT_RTTIME. amdgpu DC exceeds 200ms on some setups, see the linked issue. >> Regardless, this seems like a roundabout way to address a problem that >> we could just ... fix. Handling modesets asynchronously does seem desirable in mutter to me. E.g. it would allow doing modesets in parallel with modesets or other atomic commits on other GPUs. > From what I know RLIMIT_RTTIME usually works in units of multiple seconds. RealtimeKit seems to allow 200ms maximum.
Hi, On Thu, Sep 28, 2023 at 2:56 AM Christian König <christian.koenig@amd.com> wrote: > > To say the "whole point" is about CPU overhead accounting sounds > > rather absurd to me. Is that really what you meant? > > Yes, absolutely. See the functionality you try to implement already exists. You say lower in this same message that you don't believe the functionality actually works for the dpms off case I mentioned. > After making a non blocking commit userspace can still wait on the > commit to finish by looking at the out fence. fences, not fence, fences. drmModeAtomicCommit works on multiple objects at the same time. To follow the spirit of such an api, we would need a separate fd allocated for each crtc and would have to wait on all of them. Surely you can see how that is much less straightforward than using a blocking api. But mutter already uses non-blocking apis for the lion's share of cases. It doesn't need fences for those cases, though, because it can just use page flip events. The main reason it uses blocking apis are for modesets and when doing dpms off. The latter case you said you don't think can use fences, and it certainly can't use page flip events. So if you're right that fences can't be used for the dpms off case, it's not workable answer. If you're wrong, and fences can be used for the dpms off case, then it's a messy answer. > A blocking system call in the sense of RLIMIT_RTTIME means something > which results in the process listening for external events, e.g. calling > select(), epoll() or read() on (for example) a network socket etc... > > As far as I can see drmAtomicCommit() is *not* meant with that what > similar to for example yield() also doesn't reset the RLIMIT_RTTIME counter. No no no, drmModeAtomicCommit() is not like sched_yield(). That's a really strange thing to say (you do mean sched_yield() right?). sched_yield() is an oddball because it's specifically for giving other threads a turn if they need it without causing the current thread to sleep if they don't. It's a niche api that's meant for high performance use cases. It's a way to reduce scheduling latency and increase running time predictability. drmModeAtomicCommit() using up rt time, busy looping while waiting on the hardware to respond, eating into userspace RLIMIT_RTTIME is nothing like that. I'm getting the idea that you think there is some big bucket of kernel syscalls that block for a large fraction of a second by design and are not meant to reset RLIMIT_RTTIME. I could be wrong, but I don't think that's true. Off the top of my head, the only ones I can think of that might reasonably do that are futex() (which obviously can't sleep), sched_yield() (who's whole point is to not sleep), and maybe a some obscure ioctls (some probably legitimately, some probably illegitimately and noone has noticed.). I'm willing to be proven wrong here, and I might be, but right now from thinking about it, my guess is the above list is pretty close to complete. > Well you are breaking the RLIMIT_RTTIME functionality. > > The purpose of that functionality is to allow debugging and monitoring > applications to make sure that they keep alive and can react to external > signals. I don't think you really thought through what you're saying here. It just flatly doesn't apply for drmModeAtomicCommit. What is an application supposed to do? It can't block the SIGKILL that's coming. Respond to the preceding SIGXCPUs? What response could the application possibly make? I'm guessing drmModeAtomicCommit isn't going to EINTR because it's busy looping waiting on hardware in the process context. And the kernel doesn't even guarantee SIGXCPU is going to go to the thread with the stuck syscall, so even if there was a legitimate response (or even "pthread_cancel" or some wreckless nonsense like that), getting the notification to the right part of the program is an exercise in gymnastics. > From the RLIMIT_RTTIME documentation: "The intended use of this limit > is to stop a runaway real-time process from locking up the system." > > And when drmAtomicCommit() is triggering this then we either have a > problem with the application doing something it is not supposed to do > (like blocking for vblank while it should listen to network traffic) or > the driver is somehow buggy. drmModeAtomicCommit() is used by display servers. If drmModeAtomicCommit runs away in e.g. a set of 100ms busy loops responding to a confused or slow responding GPU, the system will seemingly lock up to the user. That is an intractable problem that we can not get away from. It doesn't matter if the kernel is stuck in process context or on a workqueue. And, regardless, it's not reasonable to expect userspace to craft elaborate workarounds for driver bugs. We just have to fix the bugs. > No when you disable everything there is of course no fence allocated. Okay, so it's not actually an answer > But then you also should never see anything waiting for to long to > actually be able to trigger the RLIMIT_RTTIME. But we do. That's the problem. That's like the whole problem. The amdgpu driver thinks it's okay to do something like: for_each_command_in_queue(&queue, command) { execute_command(command); while (1) { wait_for_response(); if (delay++ > 100000) break; udelay(1); } } or something like that. all while keeping the process in the RUNNABLE state. It just seems wrong to me. At least let the userspace process get scheduled out. > > Regardless, this seems like a roundabout way to address a problem that > > we could just ... fix. > > Well to make it clear: This is not a problem but intended behavior! I'm going to be frank, I don't think this was intended behavior. We can wait for sima to get off PTO and chime in, to know one way or the other (or maybe airlied can chime in with his take?), but I doubt he was thinking about realtime scheduling minutiae when he put together the drmModeAtomicCommit implementation. There's no practical reason for doing things the way they're being done as far as I can tell. --Ray
hi, On Thu, Sep 28, 2023 at 5:43 AM Michel Dänzer <michel.daenzer@mailbox.org> wrote: > >>> When it's really not desirable to account the CPU overhead to the > >>> process initiating it then you probably rather want to use an non > >>> blocking commit plus a dma_fence to wait for the work to end from userspace. > >> Well, first I don't think that's very convenient. You're talking about > >> a per-plane property, so there would need to be a separate file > >> descriptor allocated for every plane, right? and user-space would have > >> to block on all of them before proceeding? > > OUT_FENCE_PTR is a per-CRTC property, not per-plane. Okay, sure. > Also, at least in this particular case, a single sync file (not dma_fence) for any CRTC might suffice. I don't see how we could rely on that given the provided api and multitude of drivers. It might work and then break randomly. > >> Is the code really going to allocate a vblank_event when all the > >> crtc's are disabled ? I guess it's possible, but that's > >> counterintuitive to me. I really don't know. You're confident a set of > >> fences will actually work? > > > > No when you disable everything there is of course no fence allocated. > > I don't see why not. "new_crtc_in_state" in this code means the atomic > commit contains some state for the CRTC (such as setting the > OUT_FENCE_PTR property), it could disable or leave it disabled though. > I can't see any other code which would prevent this from working with a > disabled CRTC, I could be missing something though. So I'll concede it's possible it works, and the fact it's using a vblank type event is just technical debt (though Christian says he doesn't think it works, so I think it's possible it doesn't work as well.) Having said that, this whole argument of "the dysfunctional blocking api is actually functioning as designed, but don't use it, since it doesn't work, use the non-blocking api instead, because maybe it might work or it might not, not sure" is pretty non-convincing to me. > Handling modesets asynchronously does seem desirable in mutter to me. Sure, fine. That doesn't mean we shouldn't fix the blocking call to behave like almost all other blocking system calls. --Ray
Hi, Am 28.09.23 um 14:46 schrieb Ray Strode: > Hi, > > On Thu, Sep 28, 2023 at 2:56 AM Christian König > <christian.koenig@amd.com> wrote: >>> To say the "whole point" is about CPU overhead accounting sounds >>> rather absurd to me. Is that really what you meant? >> Yes, absolutely. See the functionality you try to implement already exists. > You say lower in this same message that you don't believe the > functionality actually works for the dpms off case I mentioned. well in the dpms off case there shouldn't be any blocking/as far as I understand it. If you see a large delay in the dpms off case then we probably have a driver bug somewhere. / > [SNIP] >> A blocking system call in the sense of RLIMIT_RTTIME means something >> which results in the process listening for external events, e.g. calling >> select(), epoll() or read() on (for example) a network socket etc... >> >> As far as I can see drmAtomicCommit() is *not* meant with that what >> similar to for example yield() also doesn't reset the RLIMIT_RTTIME counter. > No no no, drmModeAtomicCommit() is not like sched_yield(). That's a > really strange thing to say (you do mean sched_yield() right?). > sched_yield() is an oddball because it's specifically for giving other > threads a turn if they need it without causing the current thread to > sleep if they don't. It's a niche api that's meant for high performance > use cases. It's a way to reduce scheduling latency and increase running > time predictability. drmModeAtomicCommit() using up rt time, busy > looping while waiting on the hardware to respond, eating into userspace > RLIMIT_RTTIME is nothing like that. > > I'm getting the idea that you think there is some big bucket of kernel > syscalls that block for a large fraction of a second by design and are > not meant to reset RLIMIT_RTTIME. Yes, exactly that's the case as far as I know. The purpose of RLIMIT_RTTIME is to have a watchdog if an application is still responsive. Only things which make the application look for outside events are considered a reset for this whatdog. So for example calling select() and poll() will reset the watchdog, but not calling any DRM modeset functions or an power management IOCTL for example. There are only two possibilities I can see how a DRM modeset is triggering this: 1. We create enough CPU overhead to actually exceed the limit. In this case triggering it is certainly intentional. 2. We busy wait for the hardware to reach a certain state. If that is true then this is a bug in the driver. > I could be wrong, but I don't think > that's true. Off the top of my head, the only ones I can think of that > might reasonably do that are futex() (which obviously can't sleep), > sched_yield() (who's whole point is to not sleep), and maybe a > some obscure ioctls (some probably legitimately, some probably > illegitimately and noone has noticed.). I'm willing to be proven wrong > here, and I might be, but right now from thinking about it, my guess is > the above list is pretty close to complete. > >> Well you are breaking the RLIMIT_RTTIME functionality. >> >> The purpose of that functionality is to allow debugging and monitoring >> applications to make sure that they keep alive and can react to external >> signals. > I don't think you really thought through what you're saying here. It > just flatly doesn't apply for drmModeAtomicCommit. What is an > application supposed to do? Get terminated and restart. That's what the whole functionality is good for. If you don't desire that than don't use the RLIMIT_RTTIME functionality. > It can't block the SIGKILL that's coming. > Respond to the preceding SIGXCPUs? What response could the application > possibly make? I'm guessing drmModeAtomicCommit isn't going to EINTR > because it's busy looping waiting on hardware in the process context. > And the kernel doesn't even guarantee SIGXCPU is going to go to the > thread with the stuck syscall, so even if there was a legitimate > response (or even "pthread_cancel" or some wreckless nonsense like > that), getting the notification to the right part of the program is an > exercise in gymnastics. > >> From the RLIMIT_RTTIME documentation: "The intended use of this limit >> is to stop a runaway real-time process from locking up the system." >> >> And when drmAtomicCommit() is triggering this then we either have a >> problem with the application doing something it is not supposed to do >> (like blocking for vblank while it should listen to network traffic) or >> the driver is somehow buggy. > drmModeAtomicCommit() is used by display servers. If drmModeAtomicCommit > runs away in e.g. a set of 100ms busy loops responding to a confused or > slow responding GPU, the system will seemingly lock up to the user. That > is an intractable problem that we can not get away from. It doesn't > matter if the kernel is stuck in process context or on a workqueue. And, > regardless, it's not reasonable to expect userspace to craft elaborate > workarounds for driver bugs. We just have to fix the bugs. Exactly that's the reason why I'm pointing out that this isn't a good idea. > >> No when you disable everything there is of course no fence allocated. > Okay, so it's not actually an answer > >> But then you also should never see anything waiting for to long to >> actually be able to trigger the RLIMIT_RTTIME. > But we do. That's the problem. That's like the whole problem. The amdgpu > driver thinks it's okay to do something like: > > for_each_command_in_queue(&queue, command) { > execute_command(command); > while (1) { > wait_for_response(); > > if (delay++ > 100000) > break; > udelay(1); > } > } > > or something like that. all while keeping the process in the RUNNABLE > state. It just seems wrong to me. At least let the userspace process > get scheduled out. That just papers over the problem. The process doesn't become more responsive by hiding the problem. What you need to do here is to report those problems to the driver teams and not try to hide them this way. Regards, Christian. > >>> Regardless, this seems like a roundabout way to address a problem that >>> we could just ... fix. >> Well to make it clear: This is not a problem but intended behavior! > I'm going to be frank, I don't think this was intended behavior. We can > wait for sima to get off PTO and chime in, to know one way or the other > (or maybe airlied can chime in with his take?), but I doubt he was > thinking about realtime scheduling minutiae when he put together the > drmModeAtomicCommit implementation. > > There's no practical reason for doing things the way they're being done > as far as I can tell. > > --Ray
On 9/28/23 14:59, Ray Strode wrote: > On Thu, Sep 28, 2023 at 5:43 AM Michel Dänzer > <michel.daenzer@mailbox.org> wrote: >>>>> When it's really not desirable to account the CPU overhead to the >>>>> process initiating it then you probably rather want to use an non >>>>> blocking commit plus a dma_fence to wait for the work to end from userspace. >>>> Well, first I don't think that's very convenient. You're talking about >>>> a per-plane property, so there would need to be a separate file >>>> descriptor allocated for every plane, right? and user-space would have >>>> to block on all of them before proceeding? >> >> OUT_FENCE_PTR is a per-CRTC property, not per-plane. > > Okay, sure. > >> Also, at least in this particular case, a single sync file (not dma_fence) for any CRTC might suffice. > > I don't see how we could rely on that given the provided api and > multitude of drivers. It might work and then break randomly. If it's supposed to work from the KMS API PoV, any bugs to the contrary should be fixed. I'm not really seeing the big difference between using a single fence or multiple, anyway. I do wonder if there might be a time window where the out fences have signalled, but the atomic commit ioctl will still fail with EBUSY. If there is though, I'd expect it to affect the flip completion events as well.
On 9/28/23 15:23, Christian König wrote: > > What you need to do here is to report those problems to the driver teams and not try to hide them this way. See the linked issue: https://gitlab.freedesktop.org/drm/amd/-/issues/2861 (BTW, the original reporter of that issue isn't hitting it with DPMS off but just starting a game, so there seem to be at least two separate causes of exceeding 200ms for an atomic commit in DC) It's not like GitLab issues get much if any attention by DC developers though... So Ray tried to come up with some kind of solution.
Hi, On Thu, Sep 28, 2023 at 9:24 AM Christian König <christian.koenig@amd.com> wrote: > If you see a large delay in the dpms off case then we probably have a driver bug somewhere. This is something we both agree on, I think. >> I'm getting the idea that you think there is some big bucket of kernel >> syscalls that block for a large fraction of a second by design and are >> not meant to reset RLIMIT_RTTIME. > > Yes, exactly that's the case as far as I know. Okay, well i'm willing to be proven wrong. My previously stated list is: futex, sched_yield, and maybe a few obscure ioctls. If you have a big bucket of syscalls in your mind, i'm all ears. I think I'm probably right and the lions share of all syscalls that can block for a long time don't leave the process RUNNABLE, though. > The purpose of RLIMIT_RTTIME is to have a watchdog if an application is still responsive. Only things which make the application look for outside events are considered a reset for this whatdog. > > So for example calling select() and poll() will reset the watchdog, but not calling any DRM modeset functions or an power management IOCTL for example. what power management ioctls are you talking about? > > There are only two possibilities I can see how a DRM modeset is triggering this: > > 1. We create enough CPU overhead to actually exceed the limit. In this case triggering it is certainly intentional. As I said before I suspect that not all modeset work for all drivers is happening in the process context already anyway. So if this is intentional, I suspect it's not actually working as designed (but I haven't confirmed that, aside from my `git grep -E 'INIT_WORK|DECLARE_TASKLET|open_softirq|timer_setup|kthread_run' | wc -l` litmus test) . I can't come up with a good reason it would be designed like that, however, and you haven't provided one, so I suspect it's not actually intentional. >> I don't think you really thought through what you're saying here. It >> just flatly doesn't apply for drmModeAtomicCommit. What is an >> application supposed to do? > > Get terminated and restart. That's what the whole functionality is good for. > > If you don't desire that than don't use the RLIMIT_RTTIME functionality. No, getting terminated and restarting the session because drmModeAtomicCommit is taking too long is not really acceptable behavior for a display server. In general, crashing the session is maybe better than locking up the entire system hard, so RLIMIT_RTTIME is still a good idea, but in either case we're in bug land here, each such bug needs to be fixed. In this case, the bug is drmModeAtomicCommit isn't sleeping like nearly every other syscall does. > From the RLIMIT_RTTIME documentation: "The intended use of this limit > is to stop a runaway real-time process from locking up the system." > > And when drmAtomicCommit() is triggering this then we either have a > problem with the application doing something it is not supposed to do > (like blocking for vblank while it should listen to network traffic) or > the driver is somehow buggy. Yes! This is about bugs. > drmModeAtomicCommit() is used by display servers. If drmModeAtomicCommit > runs away in e.g. a set of 100ms busy loops responding to a confused or > slow responding GPU, the system will seemingly lock up to the user. That > is an intractable problem that we can not get away from. It doesn't > matter if the kernel is stuck in process context or on a workqueue. And, > regardless, it's not reasonable to expect userspace to craft elaborate > workarounds for driver bugs. We just have to fix the bugs. > > Exactly that's the reason why I'm pointing out that this isn't a good idea. Not following your logic here. > That just papers over the problem. The process doesn't become more responsive by hiding the problem. It's not about the process being responsive. In meat-space, the sub-second delay is imperceptible because the screen is turned off. It's about not killing an innocent display server because of a driver bug. > What you need to do here is to report those problems to the driver teams and not try to hide them this way. There is already a bug, it's mentioned in the merge request, the reason I cc'd you (vs say just Dave and Daniel) is because the driver bug is in amdgpu code. --Ray
Am 28.09.23 um 15:37 schrieb Michel Dänzer: > On 9/28/23 14:59, Ray Strode wrote: >> On Thu, Sep 28, 2023 at 5:43 AM Michel Dänzer >> <michel.daenzer@mailbox.org> wrote: >>>>>> When it's really not desirable to account the CPU overhead to the >>>>>> process initiating it then you probably rather want to use an non >>>>>> blocking commit plus a dma_fence to wait for the work to end from userspace. >>>>> Well, first I don't think that's very convenient. You're talking about >>>>> a per-plane property, so there would need to be a separate file >>>>> descriptor allocated for every plane, right? and user-space would have >>>>> to block on all of them before proceeding? >>> OUT_FENCE_PTR is a per-CRTC property, not per-plane. >> Okay, sure. >> >>> Also, at least in this particular case, a single sync file (not dma_fence) for any CRTC might suffice. >> I don't see how we could rely on that given the provided api and >> multitude of drivers. It might work and then break randomly. > If it's supposed to work from the KMS API PoV, any bugs to the contrary should be fixed. > > I'm not really seeing the big difference between using a single fence or multiple, anyway. The big difference is that a standard modeset can take some time, e.g. setting up power levels, waiting for PLLs to settle, waiting for a vblank etc.. That this happens async in the background so that the frontend application can still respond to other signals seems reasonable. But in the case of turning thing off, what should we wait for? I think we still support the out fence, but it doesn't really make sense to use. > I do wonder if there might be a time window where the out fences have signalled, but the atomic commit ioctl will still fail with EBUSY. If there is though, I'd expect it to affect the flip completion events as well. > I'm not deep enough into that code to confirm that. Daniel or other display folks need to help here. Regards, Christian.
Am 28.09.23 um 15:58 schrieb Michel Dänzer: > On 9/28/23 15:23, Christian König wrote: >> What you need to do here is to report those problems to the driver teams and not try to hide them this way. > See the linked issue: https://gitlab.freedesktop.org/drm/amd/-/issues/2861 (BTW, the original reporter of that issue isn't hitting it with DPMS off but just starting a game, so there seem to be at least two separate causes of exceeding 200ms for an atomic commit in DC) > > It's not like GitLab issues get much if any attention by DC developers though... So Ray tried to come up with some kind of solution. I haven't seen this bug either for the simple reason that it wasn't assigned to anybody. Going to grab it and forward it to the DC guys. That makes the reasoning understandable, but we should still absolutely not paper over problems like suggested with this patch. Thanks, Christian.
On Tue, Sep 26, 2023 at 01:05:49PM -0400, Ray Strode wrote: > From: Ray Strode <rstrode@redhat.com> > > A drm atomic commit can be quite slow on some hardware. It can lead > to a lengthy queue of commands that need to get processed and waited > on before control can go back to user space. > > If user space is a real-time thread, that delay can have severe > consequences, leading to the process getting killed for exceeding > rlimits. > > This commit addresses the problem by always running the slow part of > a commit on a workqueue, separated from the task initiating the > commit. > > This change makes the nonblocking and blocking paths work in the same way, > and as a result allows the task to sleep and not use up its > RLIMIT_RTTIME allocation. > > Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2861 > Signed-off-by: Ray Strode <rstrode@redhat.com> > --- > drivers/gpu/drm/drm_atomic_helper.c | 7 +++---- > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c > index 292e38eb6218..1a1e68d98d38 100644 > --- a/drivers/gpu/drm/drm_atomic_helper.c > +++ b/drivers/gpu/drm/drm_atomic_helper.c > @@ -2028,64 +2028,63 @@ int drm_atomic_helper_commit(struct drm_device *dev, > * This is the point of no return - everything below never fails except > * when the hw goes bonghits. Which means we can commit the new state on > * the software side now. > */ > > ret = drm_atomic_helper_swap_state(state, true); > if (ret) > goto err; > > /* > * Everything below can be run asynchronously without the need to grab > * any modeset locks at all under one condition: It must be guaranteed > * that the asynchronous work has either been cancelled (if the driver > * supports it, which at least requires that the framebuffers get > * cleaned up with drm_atomic_helper_cleanup_planes()) or completed > * before the new state gets committed on the software side with > * drm_atomic_helper_swap_state(). > * > * This scheme allows new atomic state updates to be prepared and > * checked in parallel to the asynchronous completion of the previous > * update. Which is important since compositors need to figure out the > * composition of the next frame right after having submitted the > * current layout. > * > * NOTE: Commit work has multiple phases, first hardware commit, then > * cleanup. We want them to overlap, hence need system_unbound_wq to > * make sure work items don't artificially stall on each another. > */ > > drm_atomic_state_get(state); > - if (nonblock) > - queue_work(system_unbound_wq, &state->commit_work); > - else > - commit_tail(state); > + queue_work(system_unbound_wq, &state->commit_work); > + if (!nonblock) > + flush_work(&state->commit_work); Here's my earlier take on this: https://patchwork.freedesktop.org/series/108668/ execpt I went further and moved the flush past the unlock in the end.
On 9/28/23 16:51, Christian König wrote: > Am 28.09.23 um 15:37 schrieb Michel Dänzer: >> On 9/28/23 14:59, Ray Strode wrote: >>> On Thu, Sep 28, 2023 at 5:43 AM Michel Dänzer >>> <michel.daenzer@mailbox.org> wrote: >>>>>>> When it's really not desirable to account the CPU overhead to the >>>>>>> process initiating it then you probably rather want to use an non >>>>>>> blocking commit plus a dma_fence to wait for the work to end from userspace. >>>>>> Well, first I don't think that's very convenient. You're talking about >>>>>> a per-plane property, so there would need to be a separate file >>>>>> descriptor allocated for every plane, right? and user-space would have >>>>>> to block on all of them before proceeding? >>>> OUT_FENCE_PTR is a per-CRTC property, not per-plane. >>> Okay, sure. >>> >>>> Also, at least in this particular case, a single sync file (not dma_fence) for any CRTC might suffice. >>> I don't see how we could rely on that given the provided api and >>> multitude of drivers. It might work and then break randomly. >> If it's supposed to work from the KMS API PoV, any bugs to the contrary should be fixed. >> >> I'm not really seeing the big difference between using a single fence or multiple, anyway. > > The big difference is that a standard modeset can take some time, e.g. setting up power levels, waiting for PLLs to settle, waiting for a vblank etc.. Right (starting a game shouldn't involve such a modeset though in a Wayland session). What I mean is there's no significant difference for user space between using a single out fence or multiple. Just slightly different bookkeeping. > But in the case of turning thing off, what should we wait for? The atomic commit to finish, so it's safe to submit another one without hitting EBUSY. (What we use the completion events for with page flips)
hI, On Thu, Sep 28, 2023 at 11:05 AM Ville Syrjälä <ville.syrjala@linux.intel.com> wrote: > Here's my earlier take on this: https://patchwork.freedesktop.org/series/108668/ Nice. Was there push back? Why didn't it go in? > execpt I went further and moved the flush past the unlock in the end. Is that necessary? I was wondering about that but there's this comment in the code: * ... Locks, especially struct * &drm_modeset_lock, should not be held in worker threads or any other * asynchronous context used to commit the hardware state. So it made me think there would be no locks held, and then I threw a scratch build at someone who reported it didn't deadlock and solved their issue. --Ray
On Thu, Sep 28, 2023 at 03:33:46PM -0400, Ray Strode wrote: > hI, > > On Thu, Sep 28, 2023 at 11:05 AM Ville Syrjälä > <ville.syrjala@linux.intel.com> wrote: > > Here's my earlier take on this: https://patchwork.freedesktop.org/series/108668/ > > Nice. Was there push back? Why didn't it go in? No one really seemed all that interested in it. I'd still like to get it in, if for no other reason than to make things operate more uniformly. Though there are lots of legacy codepaths left that still hold the locks over the whole commit, but those could be fixed up as a followup. > > > execpt I went further and moved the flush past the unlock in the end. > > Is that necessary? I was wondering about that but there's this comment > in the code: I'm not really sure there is any point in doing this otherwise. It would just change which thread executes the code but everything else would still get blocked on the locks the same as before. > > * ... Locks, especially struct > * &drm_modeset_lock, should not be held in worker threads or any other > * asynchronous context used to commit the hardware state. > > So it made me think there would be no locks held, and then I threw a > scratch build > at someone who reported it didn't deadlock and solved their issue. > > --Ray
Hi, On Wed, Oct 4, 2023 at 1:28 PM Ville Syrjälä <ville.syrjala@linux.intel.com> wrote: > No one really seemed all that interested in it. I'd still like to get > it in, if for no other reason than to make things operate more uniformly. > Though there are lots of legacy codepaths left that still hold the locks > over the whole commit, but those could be fixed up as a followup. Ah I see what you're saying. > > > execpt I went further and moved the flush past the unlock in the end. > > > > Is that necessary? I was wondering about that but there's this comment > > in the code: > > I'm not really sure there is any point in doing this otherwise. > It would just change which thread executes the code but everything > else would still get blocked on the locks the same as before. Well in my case, I just want driver work to not charge the process cpu time. But I get what you're saying, Checking if new configurations are valid shouldnt block on existing configurations getting applied. Doing it outside the locks isn't necessary to prevents deadlocks, it's necessary because you're trying to avoid contention when there doesn't need to be contention. Your patchset sort of has a different goal, but it solves the issue I care about at the same time. I'd be happy if it went in... Daniel, Dave, what are your takes on this? --Ray
On Tue, Sep 26, 2023 at 01:05:49PM -0400, Ray Strode wrote: > From: Ray Strode <rstrode@redhat.com> > > A drm atomic commit can be quite slow on some hardware. It can lead > to a lengthy queue of commands that need to get processed and waited > on before control can go back to user space. > > If user space is a real-time thread, that delay can have severe > consequences, leading to the process getting killed for exceeding > rlimits. > > This commit addresses the problem by always running the slow part of > a commit on a workqueue, separated from the task initiating the > commit. > > This change makes the nonblocking and blocking paths work in the same way, > and as a result allows the task to sleep and not use up its > RLIMIT_RTTIME allocation. > > Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2861 > Signed-off-by: Ray Strode <rstrode@redhat.com> So imo the trouble with this is that we suddenly start to make realtime/cpu usage guarantees in the atomic ioctl. That's a _huge_ uapi change, because even limited to the case of !ALLOW_MODESET we do best effort guarantees at best. And some drivers (again amd's dc) spend a ton of cpu time recomputing state even for pure plane changes without any crtc changes like dpms on/off (at least I remember some bug reports about that). And that state recomputation has to happen synchronously, because it always influences the ioctl errno return value. My take is that you're papering over a performance problem here of the "the driver is too slow/wastes too much cpu time". We should fix the driver, if that's possible. Another option would be if userspace drops realtime priorities for these known-slow operations. And right now _all_ kms operations are potentially cpu and real-time wasters, the entire uapi is best effort. We can also try to change the atomic uapi to give some hard real-time guarantees so that running compositors as SCHED_RT is possible, but that - means a very serious stream of bugs to fix all over - therefore needs some very wide buy-in from drivers that they're willing to make this guarantee - probably needs some really carefully carved out limitations, because there's imo flat-out no way we'll make all atomic ioctl hard time limit bound Also, as König has pointed out, you can roll this duct-tape out in userspace by making the commit non-blocking and immediately waiting for the fences. One thing I didn't see mention is that there's a very subtle uapi difference between non-blocking and blocking: - non-blocking is not allowed to get ahead of the previous commit, and will return EBUSY in that case. See the comment in drm_atomic_helper_commit() - blocking otoh will just block until any previous pending commit has finished Not taking that into account in your patch here breaks uapi because userspace will suddenly get EBUSY when they don't expect that. Cheers, Sima > --- > drivers/gpu/drm/drm_atomic_helper.c | 7 +++---- > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c > index 292e38eb6218..1a1e68d98d38 100644 > --- a/drivers/gpu/drm/drm_atomic_helper.c > +++ b/drivers/gpu/drm/drm_atomic_helper.c > @@ -2028,64 +2028,63 @@ int drm_atomic_helper_commit(struct drm_device *dev, > * This is the point of no return - everything below never fails except > * when the hw goes bonghits. Which means we can commit the new state on > * the software side now. > */ > > ret = drm_atomic_helper_swap_state(state, true); > if (ret) > goto err; > > /* > * Everything below can be run asynchronously without the need to grab > * any modeset locks at all under one condition: It must be guaranteed > * that the asynchronous work has either been cancelled (if the driver > * supports it, which at least requires that the framebuffers get > * cleaned up with drm_atomic_helper_cleanup_planes()) or completed > * before the new state gets committed on the software side with > * drm_atomic_helper_swap_state(). > * > * This scheme allows new atomic state updates to be prepared and > * checked in parallel to the asynchronous completion of the previous > * update. Which is important since compositors need to figure out the > * composition of the next frame right after having submitted the > * current layout. > * > * NOTE: Commit work has multiple phases, first hardware commit, then > * cleanup. We want them to overlap, hence need system_unbound_wq to > * make sure work items don't artificially stall on each another. > */ > > drm_atomic_state_get(state); > - if (nonblock) > - queue_work(system_unbound_wq, &state->commit_work); > - else > - commit_tail(state); > + queue_work(system_unbound_wq, &state->commit_work); > + if (!nonblock) > + flush_work(&state->commit_work); > > return 0; > > err: > drm_atomic_helper_cleanup_planes(dev, state); > return ret; > } > EXPORT_SYMBOL(drm_atomic_helper_commit); > > /** > * DOC: implementing nonblocking commit > * > * Nonblocking atomic commits should use struct &drm_crtc_commit to sequence > * different operations against each another. Locks, especially struct > * &drm_modeset_lock, should not be held in worker threads or any other > * asynchronous context used to commit the hardware state. > * > * drm_atomic_helper_commit() implements the recommended sequence for > * nonblocking commits, using drm_atomic_helper_setup_commit() internally: > * > * 1. Run drm_atomic_helper_prepare_planes(). Since this can fail and we > * need to propagate out of memory/VRAM errors to userspace, it must be called > * synchronously. > * > * 2. Synchronize with any outstanding nonblocking commit worker threads which > * might be affected by the new state update. This is handled by > * drm_atomic_helper_setup_commit(). > * > * Asynchronous workers need to have sufficient parallelism to be able to run > * different atomic commits on different CRTCs in parallel. The simplest way to > -- > 2.41.0 >
On Thu, Oct 05, 2023 at 11:57:41AM +0200, Daniel Vetter wrote: > On Tue, Sep 26, 2023 at 01:05:49PM -0400, Ray Strode wrote: > > From: Ray Strode <rstrode@redhat.com> > > > > A drm atomic commit can be quite slow on some hardware. It can lead > > to a lengthy queue of commands that need to get processed and waited > > on before control can go back to user space. > > > > If user space is a real-time thread, that delay can have severe > > consequences, leading to the process getting killed for exceeding > > rlimits. > > > > This commit addresses the problem by always running the slow part of > > a commit on a workqueue, separated from the task initiating the > > commit. > > > > This change makes the nonblocking and blocking paths work in the same way, > > and as a result allows the task to sleep and not use up its > > RLIMIT_RTTIME allocation. > > > > Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2861 > > Signed-off-by: Ray Strode <rstrode@redhat.com> > > So imo the trouble with this is that we suddenly start to make > realtime/cpu usage guarantees in the atomic ioctl. That's a _huge_ uapi > change, because even limited to the case of !ALLOW_MODESET we do best > effort guarantees at best. And some drivers (again amd's dc) spend a ton > of cpu time recomputing state even for pure plane changes without any crtc > changes like dpms on/off (at least I remember some bug reports about > that). And that state recomputation has to happen synchronously, because > it always influences the ioctl errno return value. > > My take is that you're papering over a performance problem here of the > "the driver is too slow/wastes too much cpu time". We should fix the > driver, if that's possible. > > Another option would be if userspace drops realtime priorities for these > known-slow operations. And right now _all_ kms operations are potentially > cpu and real-time wasters, the entire uapi is best effort. > > We can also try to change the atomic uapi to give some hard real-time > guarantees so that running compositors as SCHED_RT is possible, but that > - means a very serious stream of bugs to fix all over > - therefore needs some very wide buy-in from drivers that they're willing > to make this guarantee > - probably needs some really carefully carved out limitations, because > there's imo flat-out no way we'll make all atomic ioctl hard time limit > bound > > Also, as König has pointed out, you can roll this duct-tape out in > userspace by making the commit non-blocking and immediately waiting for > the fences. > > One thing I didn't see mention is that there's a very subtle uapi > difference between non-blocking and blocking: > - non-blocking is not allowed to get ahead of the previous commit, and > will return EBUSY in that case. See the comment in > drm_atomic_helper_commit() > - blocking otoh will just block until any previous pending commit has > finished > > Not taking that into account in your patch here breaks uapi because > userspace will suddenly get EBUSY when they don't expect that. The -EBUSY logic already checks whether the current commit is non-blocking vs. blocking commit, so I don't see how there would be any change in behaviour from simply stuffing the commit_tail onto a workqueue, especially as the locks will be still held across the flush. In my earlier series [1] where I move the flush to happen after dropping the locks there is a far more subtle issue because currently even non-blocking commits can actually block due to the mutex. Changing that might break something, so I preserved that behaviour explicitly. Full explanation in the first patch there. [1] https://patchwork.freedesktop.org/series/108668/
Am 05.10.23 um 11:57 schrieb Daniel Vetter: > On Tue, Sep 26, 2023 at 01:05:49PM -0400, Ray Strode wrote: >> From: Ray Strode <rstrode@redhat.com> >> >> A drm atomic commit can be quite slow on some hardware. It can lead >> to a lengthy queue of commands that need to get processed and waited >> on before control can go back to user space. >> >> If user space is a real-time thread, that delay can have severe >> consequences, leading to the process getting killed for exceeding >> rlimits. >> >> This commit addresses the problem by always running the slow part of >> a commit on a workqueue, separated from the task initiating the >> commit. >> >> This change makes the nonblocking and blocking paths work in the same way, >> and as a result allows the task to sleep and not use up its >> RLIMIT_RTTIME allocation. >> >> Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2861 >> Signed-off-by: Ray Strode <rstrode@redhat.com> > So imo the trouble with this is that we suddenly start to make > realtime/cpu usage guarantees in the atomic ioctl. That's a _huge_ uapi > change, because even limited to the case of !ALLOW_MODESET we do best > effort guarantees at best. And some drivers (again amd's dc) spend a ton > of cpu time recomputing state even for pure plane changes without any crtc > changes like dpms on/off (at least I remember some bug reports about > that). And that state recomputation has to happen synchronously, because > it always influences the ioctl errno return value. > > My take is that you're papering over a performance problem here of the > "the driver is too slow/wastes too much cpu time". We should fix the > driver, if that's possible. > > Another option would be if userspace drops realtime priorities for these > known-slow operations. And right now _all_ kms operations are potentially > cpu and real-time wasters, the entire uapi is best effort. > > We can also try to change the atomic uapi to give some hard real-time > guarantees so that running compositors as SCHED_RT is possible, but that > - means a very serious stream of bugs to fix all over > - therefore needs some very wide buy-in from drivers that they're willing > to make this guarantee > - probably needs some really carefully carved out limitations, because > there's imo flat-out no way we'll make all atomic ioctl hard time limit > bound Well, we actually have a pending request to support some real time use cases with the amdgpu driver. And as I already said to Alex internally this is not a pile, but a mountain of work even when we exclude DC. Fixing DC to not busy wait for events which raise an interrupt is still something we should absolutely do anyway, but that is an ongoing process. > Also, as König has pointed out, you can roll this duct-tape out in > userspace by making the commit non-blocking and immediately waiting for > the fences. Oh, please don't use my last name like this. It reminds me of my military time :) Regards, Christian. > > One thing I didn't see mention is that there's a very subtle uapi > difference between non-blocking and blocking: > - non-blocking is not allowed to get ahead of the previous commit, and > will return EBUSY in that case. See the comment in > drm_atomic_helper_commit() > - blocking otoh will just block until any previous pending commit has > finished > > Not taking that into account in your patch here breaks uapi because > userspace will suddenly get EBUSY when they don't expect that. > > Cheers, Sima > > >> --- >> drivers/gpu/drm/drm_atomic_helper.c | 7 +++---- >> 1 file changed, 3 insertions(+), 4 deletions(-) >> >> diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c >> index 292e38eb6218..1a1e68d98d38 100644 >> --- a/drivers/gpu/drm/drm_atomic_helper.c >> +++ b/drivers/gpu/drm/drm_atomic_helper.c >> @@ -2028,64 +2028,63 @@ int drm_atomic_helper_commit(struct drm_device *dev, >> * This is the point of no return - everything below never fails except >> * when the hw goes bonghits. Which means we can commit the new state on >> * the software side now. >> */ >> >> ret = drm_atomic_helper_swap_state(state, true); >> if (ret) >> goto err; >> >> /* >> * Everything below can be run asynchronously without the need to grab >> * any modeset locks at all under one condition: It must be guaranteed >> * that the asynchronous work has either been cancelled (if the driver >> * supports it, which at least requires that the framebuffers get >> * cleaned up with drm_atomic_helper_cleanup_planes()) or completed >> * before the new state gets committed on the software side with >> * drm_atomic_helper_swap_state(). >> * >> * This scheme allows new atomic state updates to be prepared and >> * checked in parallel to the asynchronous completion of the previous >> * update. Which is important since compositors need to figure out the >> * composition of the next frame right after having submitted the >> * current layout. >> * >> * NOTE: Commit work has multiple phases, first hardware commit, then >> * cleanup. We want them to overlap, hence need system_unbound_wq to >> * make sure work items don't artificially stall on each another. >> */ >> >> drm_atomic_state_get(state); >> - if (nonblock) >> - queue_work(system_unbound_wq, &state->commit_work); >> - else >> - commit_tail(state); >> + queue_work(system_unbound_wq, &state->commit_work); >> + if (!nonblock) >> + flush_work(&state->commit_work); >> >> return 0; >> >> err: >> drm_atomic_helper_cleanup_planes(dev, state); >> return ret; >> } >> EXPORT_SYMBOL(drm_atomic_helper_commit); >> >> /** >> * DOC: implementing nonblocking commit >> * >> * Nonblocking atomic commits should use struct &drm_crtc_commit to sequence >> * different operations against each another. Locks, especially struct >> * &drm_modeset_lock, should not be held in worker threads or any other >> * asynchronous context used to commit the hardware state. >> * >> * drm_atomic_helper_commit() implements the recommended sequence for >> * nonblocking commits, using drm_atomic_helper_setup_commit() internally: >> * >> * 1. Run drm_atomic_helper_prepare_planes(). Since this can fail and we >> * need to propagate out of memory/VRAM errors to userspace, it must be called >> * synchronously. >> * >> * 2. Synchronize with any outstanding nonblocking commit worker threads which >> * might be affected by the new state update. This is handled by >> * drm_atomic_helper_setup_commit(). >> * >> * Asynchronous workers need to have sufficient parallelism to be able to run >> * different atomic commits on different CRTCs in parallel. The simplest way to >> -- >> 2.41.0 >>
Hi, On Thu, Oct 5, 2023 at 5:57 AM Daniel Vetter <daniel@ffwll.ch> wrote: > So imo the trouble with this is that we suddenly start to make > realtime/cpu usage guarantees in the atomic ioctl. That's a _huge_ uapi > change, because even limited to the case of !ALLOW_MODESET we do best > effort guarantees at best. So you're saying there's a best effort to not use up process CPU time, but Christian was trying to argue nearly the opposite, that any needed CPU time for the operation should get accounted toward the process. What you're saying makes more sense to me and it tracks with what I'm getting skimming over the code. I see it potentially sleeps during blocking drmModeAtomicCommit() calls in several places: - copy_from_user when copying properties - fence and vblank event allocation - waiting on modeset locks - And even plane changes: for_each_new_plane_in_state(state, plane, new_plane_state, i) {• ... → /*• → * If waiting for fences pre-swap (ie: nonblock), userspace can• → * still interrupt the operation. Instead of blocking until the• → * timer expires, make the wait interruptible.• → */• → ret = dma_fence_wait(new_plane_state->fence, pre_swap);• ... }• (Ignore the typo where it says "nonblock" instead of "!nonblock", the point is while waiting on plane state changes it sleeps.) So, in general, the ioctl sleeps when it needs to. It's not trying to be some non-premptible RT task with a dedicated cpu budget that it strictly adheres to. That makes sense to me. > And that state recomputation has to happen synchronously, because > it always influences the ioctl errno return value. Not sure I'm following. The task doesn't have to stay in RUNNING the entire time the ioctl is getting carried out. It can get preempted, and rescheduled later, all before returning from the ioctl and delivering the errno back to userspace. What am I missing? The problem isn't that the ioctl blocks, the problem is that when it blocks it shouldn't be leaving the task state in RUNNING. > My take is that you're papering over a performance problem here of the > "the driver is too slow/wastes too much cpu time". We should fix the > driver, if that's possible. I think everyone agrees the amdgpu DC code doing several 100ms busy loops in a row with no break in between is buggy behavior, and getting that fixed is important. Also, there's no dispute that the impetus for my proposed change was that misbehavior in the driver code. Neither of those facts mean we shouldn't improve drm_atomic_helper_commit too. Making the change is a good idea. It was even independently proposed a year ago, for reasons unrelated to the current situation. It makes the nonblock and the !nonblock code paths behave closer to the same. it makes the userspace experience better in the face of driver bugs. You said best effort earlier, this change helps provide a best effort. Realistically, if it was done this way from the start, no one would be batting an eye, right? It just improves things all around. I think maybe you're worried about a slippery slope? > We can also try to change the atomic uapi to give some hard real-time > guarantees so that running compositors as SCHED_RT is possible, but that > - means a very serious stream of bugs to fix all over > - therefore needs some very wide buy-in from drivers that they're willing > to make this guarantee > - probably needs some really carefully carved out limitations, because > there's imo flat-out no way we'll make all atomic ioctl hard time limit > bound We don't need a guarantee that reprogramming ast BMC framebuffer offsets or displayport link training (or whatever) takes less than 200ms. If a driver has to sit and wait 32ms for vblank before twiddling things in hardware to keep crud from showing up on screen or something, that's fine. But in none of these cases should the userspace process be kept in RUNNING while it does it! I take your point that there are a lot of drivers that may be doing slow things, but surely they should let the process nap while they do their work? > Also, as König has pointed out, you can roll this duct-tape out in > userspace by making the commit non-blocking and immediately waiting for > the fences. Sure, userspace can do that (even, it turns out, in the all crtc disabled case which was an open question earlier in the thread). That's not really an argument against fixing the !nonblock case though. > One thing I didn't see mention is that there's a very subtle uapi > difference between non-blocking and blocking: > - non-blocking is not allowed to get ahead of the previous commit, and > will return EBUSY in that case. See the comment in > drm_atomic_helper_commit() > - blocking otoh will just block until any previous pending commit has > finished > > Not taking that into account in your patch here breaks uapi because > userspace will suddenly get EBUSY when they don't expect that. I don't think that's right, drm_atomic_helper_setup_commit has several chunks of code like this: → → if (nonblock && old_conn_state->commit &&• → → !try_wait_for_completion(&old_conn_state->commit->flip_done)) {• ... → → → return -EBUSY;• → → }• So it only returns EBUSY for DRM_MODE_ATOMIC_NONBLOCK cases. There's also this code earlier in the process: → list_for_each_entry(commit, &crtc->commit_list, commit_entry) {• ... → → → completed = try_wait_for_completion(&commit->flip_done);• ... → → → if (!completed && nonblock) {• ... → → → → return -EBUSY;• → → → }• ... → }• ... → ret = wait_for_completion_interruptible_timeout(&stall_commit->cleanup_done, → → → → → → → 10*HZ);• ... So it looks like it sleeps (not leaving the system in RUNNING state!) if there's an outstanding commit. --Ray
Am 05.10.23 um 23:04 schrieb Ray Strode: > Hi, > > On Thu, Oct 5, 2023 at 5:57 AM Daniel Vetter <daniel@ffwll.ch> wrote: >> So imo the trouble with this is that we suddenly start to make >> realtime/cpu usage guarantees in the atomic ioctl. That's a _huge_ uapi >> change, because even limited to the case of !ALLOW_MODESET we do best >> effort guarantees at best. > So you're saying there's a best effort to not use up process CPU time, > but Christian was trying to argue nearly the opposite, that any needed > CPU time for the operation should get accounted toward the process. Well as far as I can see what Daniel and I say here perfectly matches. When the operation busy waits then that *should* get accounted to the CPU time of the current process. When the operation sleeps and waits for some interrupt for example it should not get accounted. What you suggest is to put the parts of the operation which busy wait into a background task and then sleep for that task to complete. This is not a solution to the problem, but just hides it. Stuff like that is not a valid justification for the change. Ville changes on the other hand tried to prevent lock contention which is a valid goal here. Regards, Christian. > > What you're saying makes more sense to me and it tracks with what I'm > getting skimming over the code. I see it potentially sleeps during > blocking drmModeAtomicCommit() calls in several places: > > - copy_from_user when copying properties > - fence and vblank event allocation > - waiting on modeset locks > - And even plane changes: > > for_each_new_plane_in_state(state, plane, new_plane_state, i) {• > ... > → /*• > → * If waiting for fences pre-swap (ie: nonblock), userspace can• > → * still interrupt the operation. Instead of blocking until the• > → * timer expires, make the wait interruptible.• > → */• > → ret = dma_fence_wait(new_plane_state->fence, pre_swap);• > ... > }• > > (Ignore the typo where it says "nonblock" instead of "!nonblock", > the point is while waiting on plane state changes it sleeps.) > > So, in general, the ioctl sleeps when it needs to. It's not trying > to be some non-premptible RT task with a dedicated cpu budget that it > strictly adheres to. That makes sense to me. > >> And that state recomputation has to happen synchronously, because >> it always influences the ioctl errno return value. > Not sure I'm following. The task doesn't have to stay in RUNNING the > entire time the ioctl is getting carried out. It can get preempted, > and rescheduled later, all before returning from the ioctl and > delivering the errno back to userspace. What am I missing? > > The problem isn't that the ioctl blocks, the problem is that when it > blocks it shouldn't be leaving the task state in RUNNING. > >> My take is that you're papering over a performance problem here of the >> "the driver is too slow/wastes too much cpu time". We should fix the >> driver, if that's possible. > I think everyone agrees the amdgpu DC code doing several 100ms busy > loops in a row with no break in between is buggy behavior, and getting > that fixed is important. > > Also, there's no dispute that the impetus for my proposed change was > that misbehavior in the driver code. > > Neither of those facts mean we shouldn't improve > drm_atomic_helper_commit too. Making the change is a good idea. It was > even independently proposed a year ago, for reasons unrelated to the > current situation. It makes the nonblock and the > !nonblock code paths behave closer to the same. it makes the userspace > experience better in the face of driver bugs. You said best effort > earlier, this change helps provide a best effort. > > Realistically, if it was done this way from the start, no one would be > batting an eye, right? It just improves things all around. I think > maybe you're worried about a slippery slope? > >> We can also try to change the atomic uapi to give some hard real-time >> guarantees so that running compositors as SCHED_RT is possible, but that >> - means a very serious stream of bugs to fix all over >> - therefore needs some very wide buy-in from drivers that they're willing >> to make this guarantee >> - probably needs some really carefully carved out limitations, because >> there's imo flat-out no way we'll make all atomic ioctl hard time limit >> bound > We don't need a guarantee that reprogramming ast BMC framebuffer > offsets or displayport link training (or whatever) takes less than > 200ms. If a driver has to sit and wait 32ms for vblank before > twiddling things in hardware to keep crud from showing up on screen or > something, that's fine. But in none of these cases should the > userspace process be kept in RUNNING while it does it! > > I take your point that there are a lot of drivers that may be doing > slow things, but surely they should let the process nap while they do > their work? > >> Also, as König has pointed out, you can roll this duct-tape out in >> userspace by making the commit non-blocking and immediately waiting for >> the fences. > Sure, userspace can do that (even, it turns out, in the all crtc > disabled case which was an open question earlier in the thread). > > That's not really an argument against fixing the !nonblock case > though. > >> One thing I didn't see mention is that there's a very subtle uapi >> difference between non-blocking and blocking: >> - non-blocking is not allowed to get ahead of the previous commit, and >> will return EBUSY in that case. See the comment in >> drm_atomic_helper_commit() >> - blocking otoh will just block until any previous pending commit has >> finished >> >> Not taking that into account in your patch here breaks uapi because >> userspace will suddenly get EBUSY when they don't expect that. > I don't think that's right, drm_atomic_helper_setup_commit has several > chunks of code like this: > > → → if (nonblock && old_conn_state->commit &&• > → → > !try_wait_for_completion(&old_conn_state->commit->flip_done)) {• > ... > → → → return -EBUSY;• > → → }• > > So it only returns EBUSY for DRM_MODE_ATOMIC_NONBLOCK cases. > > There's also this code earlier in the process: > > → list_for_each_entry(commit, &crtc->commit_list, commit_entry) {• > ... > → → → completed = > try_wait_for_completion(&commit->flip_done);• > ... > → → → if (!completed && nonblock) {• > ... > → → → → return -EBUSY;• > → → → }• > ... > → }• > ... > → ret = wait_for_completion_interruptible_timeout(&stall_commit->cleanup_done, > → → → → → → → 10*HZ);• > ... > > So it looks like it sleeps (not leaving the system in RUNNING state!) if > there's an outstanding commit. > > --Ray
Hi, On Fri, Oct 6, 2023 at 3:12 AM Christian König <christian.koenig@amd.com> wrote: > When the operation busy waits then that *should* get accounted to the > CPU time of the current process. When the operation sleeps and waits for > some interrupt for example it should not get accounted. > What you suggest is to put the parts of the operation which busy wait > into a background task and then sleep for that task to complete. This is > not a solution to the problem, but just hides it. Actually, I think we both probably agree that there shouldn't be long busy waits in the context of the current process. After all, we both agree what the AMD DC driver code is doing is wrong. To be clear, my take is, if driver code is running in process context and needs to wait for periods of time on the order of or in excess of a typical process time slice it should be sleeping during the waiting. If the operation is at a point where it can be cancelled without side effects, the sleeping should be INTERRUPTIBLE. If it's past the point of no return, sleeping should be UNINTERRUPTIBLE. At no point, in my opinion, should kernel code busy block a typical process for dozens of milliseconds while keeping the process RUNNING. I don't think this is a controversial take. Actually, I think (maybe?) you might even agree with that, but you're also saying: user space processes aren't special here. While it's not okay to busy block them, it's also not okay to busy block on the system unbound workqueue either. If that's your sentiment, I don't disagree with it. So I think we both agree the busy waiting is a problem, but maybe we disagree on the best place for the problem to manifest when it happens. One thought re the DC code is regardless of where the code is running, the scheduler is going to forcefully preempt it at some point right? Any writereg/udelay(1)/readreg loop is going to get disrupted by a much bigger than 1us delay by the kernel if the loop goes on long enough. I'm not wrong about that? if that's true, the code might as well switch out the udelay(1) for a usleep(1) and call it a day (well modulo the fact I think it can be called from an interrupt handler; at least "git grep" says there's a link_set_dpms_off in link_dp_irq_handler.c) > Stuff like that is not a valid justification for the change. Ville > changes on the other hand tried to prevent lock contention which is a > valid goal here. Okay so let's land his patchset! (assuming it's ready to go in). Ville, is that something you'd want to resend for review? Note, a point that I don't think has been brought up yet, too, is the system unbound workqueue doesn't run with real time priority. Given the lion's share of mutter's drmModeAtomicCommit calls are nonblock, and so are using the system unbound workqueue for handling the commits, even without this patch, that somewhat diminishes the utility of using a realtime thread anyway. I believe the original point of the realtime thread was to make sure mouse cursor motion updates are as responsive as possible, because any latency there is something the user really feels. Maybe there needs to be a different mechanism in place to make sure user perceived interactivity is given priority. --Ray
On 10/6/23 20:48, Ray Strode wrote: > > Note, a point that I don't think has been brought up yet, too, is > the system unbound workqueue doesn't run with real time priority. > Given the lion's share of mutter's drmModeAtomicCommit calls are > nonblock, and so are using the system unbound workqueue for handling > the commits, even without this patch, that somewhat diminishes the > utility of using a realtime thread anyway. I believe the original > point of the realtime thread was to make sure mouse cursor motion > updates are as responsive as possible, because any latency there is > something the user really feels. Mutter's KMS thread needs to be real-time so that it can reliably schedule its work building up to calling the atomic commit ioctl for minimal input → output latency. That some of the ioctl's own work doesn't run at the same priority doesn't necessarily matter for this, as long as it can hit the next vertical blank period. BTW, I understand kwin has or is planning to get a real-time KMS thread as well, for the same reasons. > Maybe there needs to be a different mechanism in place to make sure > user perceived interactivity is given priority. The only alternative I'm aware of having been discussed so far is allowing atomic commits to be amended. I don't think that would be a great solution for this issue though, as it would result in Wayland compositors wasting CPU cycles (in other words, energy) for constant amendments of atomic commits, in the hope that one of them results in good latency.
Am 06.10.23 um 20:48 schrieb Ray Strode: > Hi, > > On Fri, Oct 6, 2023 at 3:12 AM Christian König <christian.koenig@amd.com> wrote: >> When the operation busy waits then that *should* get accounted to the >> CPU time of the current process. When the operation sleeps and waits for >> some interrupt for example it should not get accounted. >> What you suggest is to put the parts of the operation which busy wait >> into a background task and then sleep for that task to complete. This is >> not a solution to the problem, but just hides it. > Actually, I think we both probably agree that there shouldn't be long > busy waits in the context of the current process. After all, we both > agree what the AMD DC driver code is doing is wrong. > > To be clear, my take is, if driver code is running in process context > and needs to wait for periods of time on the order of or in excess of > a typical process time slice it should be sleeping during the waiting. > If the operation is at a point where it can be cancelled without side > effects, the sleeping should be INTERRUPTIBLE. If it's past the point > of no return, sleeping should be UNINTERRUPTIBLE. At no point, in my > opinion, should kernel code busy block a typical process for dozens of > milliseconds while keeping the process RUNNING. I don't think this is > a controversial take. Exactly that's what I completely disagree on. When the driver is burning CPU cycles on behalves of a process then those CPU cycles should be accounted to the process causing this. That the driver should probably improve it's behavior is a different issue. > Actually, I think (maybe?) you might even agree with that, but you're > also saying: user space processes aren't special here. While it's not > okay to busy block them, it's also not okay to busy block on the > system unbound workqueue either. If that's your sentiment, I don't > disagree with it. No, it's absolutely ok to busy block them it's just not nice to do so. As Daniel pointed out this behavior is not incorrect at all. The DRM subsystem doesn't make any guarantee that drmModeAtomicCommit() will not burn CPU cycles. > > So I think we both agree the busy waiting is a problem, but maybe we > disagree on the best place for the problem to manifest when it > happens. > > One thought re the DC code is regardless of where the code is running, > the scheduler is going to forcefully preempt it at some point right? > Any writereg/udelay(1)/readreg loop is going to get disrupted by a > much bigger than 1us delay by the kernel if the loop goes on long > enough. I'm not wrong about that? if that's true, the code might as > well switch out the udelay(1) for a usleep(1) and call it a day (well > modulo the fact I think it can be called from an interrupt handler; at > least "git grep" says there's a link_set_dpms_off in > link_dp_irq_handler.c) > >> Stuff like that is not a valid justification for the change. Ville >> changes on the other hand tried to prevent lock contention which is a >> valid goal here. > Okay so let's land his patchset! (assuming it's ready to go in). > Ville, is that something you'd want to resend for review? Well, while Ville patch has at least some justification I would still strongly object to move the work into a background thread to prevent userspace from being accounted for the work it causes. Regards, Christian. > > Note, a point that I don't think has been brought up yet, too, is the > system unbound workqueue doesn't run with real time priority. Given > the lion's share of mutter's drmModeAtomicCommit calls are nonblock, > and so are using the system unbound workqueue for handling the > commits, even without this patch, that somewhat diminishes the utility > of using a realtime thread anyway. I believe the original point of the > realtime thread was to make sure mouse cursor motion updates are as > responsive as possible, because any latency there is something the > user really feels. Maybe there needs to be a different mechanism in > place to make sure user perceived interactivity is given priority. > > --Ray
On Mon, Oct 09, 2023 at 08:42:24AM +0200, Christian König wrote: > Am 06.10.23 um 20:48 schrieb Ray Strode: > > Hi, > > > > On Fri, Oct 6, 2023 at 3:12 AM Christian König <christian.koenig@amd.com> wrote: > >> When the operation busy waits then that *should* get accounted to the > >> CPU time of the current process. When the operation sleeps and waits for > >> some interrupt for example it should not get accounted. > >> What you suggest is to put the parts of the operation which busy wait > >> into a background task and then sleep for that task to complete. This is > >> not a solution to the problem, but just hides it. > > Actually, I think we both probably agree that there shouldn't be long > > busy waits in the context of the current process. After all, we both > > agree what the AMD DC driver code is doing is wrong. > > > > To be clear, my take is, if driver code is running in process context > > and needs to wait for periods of time on the order of or in excess of > > a typical process time slice it should be sleeping during the waiting. > > If the operation is at a point where it can be cancelled without side > > effects, the sleeping should be INTERRUPTIBLE. If it's past the point > > of no return, sleeping should be UNINTERRUPTIBLE. At no point, in my > > opinion, should kernel code busy block a typical process for dozens of > > milliseconds while keeping the process RUNNING. I don't think this is > > a controversial take. > > Exactly that's what I completely disagree on. > > When the driver is burning CPU cycles on behalves of a process then > those CPU cycles should be accounted to the process causing this. > > That the driver should probably improve it's behavior is a different issue. > > > Actually, I think (maybe?) you might even agree with that, but you're > > also saying: user space processes aren't special here. While it's not > > okay to busy block them, it's also not okay to busy block on the > > system unbound workqueue either. If that's your sentiment, I don't > > disagree with it. > > No, it's absolutely ok to busy block them it's just not nice to do so. > > As Daniel pointed out this behavior is not incorrect at all. The DRM > subsystem doesn't make any guarantee that drmModeAtomicCommit() will not > burn CPU cycles. > > > > > So I think we both agree the busy waiting is a problem, but maybe we > > disagree on the best place for the problem to manifest when it > > happens. > > > > One thought re the DC code is regardless of where the code is running, > > the scheduler is going to forcefully preempt it at some point right? > > Any writereg/udelay(1)/readreg loop is going to get disrupted by a > > much bigger than 1us delay by the kernel if the loop goes on long > > enough. I'm not wrong about that? if that's true, the code might as > > well switch out the udelay(1) for a usleep(1) and call it a day (well > > modulo the fact I think it can be called from an interrupt handler; at > > least "git grep" says there's a link_set_dpms_off in > > link_dp_irq_handler.c) > > > >> Stuff like that is not a valid justification for the change. Ville > >> changes on the other hand tried to prevent lock contention which is a > >> valid goal here. > > Okay so let's land his patchset! (assuming it's ready to go in). > > Ville, is that something you'd want to resend for review? > > Well, while Ville patch has at least some justification I would still > strongly object to move the work into a background thread to prevent > userspace from being accounted for the work it causes. Aren't most wayland compositors using nonblocking commits anyway? If so they would already be bypassing proper CPU time accounting. Not saying we shouldn't try to fix that, but just pointing out that it already is an issue with nonblocking commits.
Am 09.10.23 um 14:19 schrieb Ville Syrjälä: > On Mon, Oct 09, 2023 at 08:42:24AM +0200, Christian König wrote: >> Am 06.10.23 um 20:48 schrieb Ray Strode: >>> Hi, >>> >>> On Fri, Oct 6, 2023 at 3:12 AM Christian König <christian.koenig@amd.com> wrote: >>>> When the operation busy waits then that *should* get accounted to the >>>> CPU time of the current process. When the operation sleeps and waits for >>>> some interrupt for example it should not get accounted. >>>> What you suggest is to put the parts of the operation which busy wait >>>> into a background task and then sleep for that task to complete. This is >>>> not a solution to the problem, but just hides it. >>> Actually, I think we both probably agree that there shouldn't be long >>> busy waits in the context of the current process. After all, we both >>> agree what the AMD DC driver code is doing is wrong. >>> >>> To be clear, my take is, if driver code is running in process context >>> and needs to wait for periods of time on the order of or in excess of >>> a typical process time slice it should be sleeping during the waiting. >>> If the operation is at a point where it can be cancelled without side >>> effects, the sleeping should be INTERRUPTIBLE. If it's past the point >>> of no return, sleeping should be UNINTERRUPTIBLE. At no point, in my >>> opinion, should kernel code busy block a typical process for dozens of >>> milliseconds while keeping the process RUNNING. I don't think this is >>> a controversial take. >> Exactly that's what I completely disagree on. >> >> When the driver is burning CPU cycles on behalves of a process then >> those CPU cycles should be accounted to the process causing this. >> >> That the driver should probably improve it's behavior is a different issue. >> >>> Actually, I think (maybe?) you might even agree with that, but you're >>> also saying: user space processes aren't special here. While it's not >>> okay to busy block them, it's also not okay to busy block on the >>> system unbound workqueue either. If that's your sentiment, I don't >>> disagree with it. >> No, it's absolutely ok to busy block them it's just not nice to do so. >> >> As Daniel pointed out this behavior is not incorrect at all. The DRM >> subsystem doesn't make any guarantee that drmModeAtomicCommit() will not >> burn CPU cycles. >> >>> So I think we both agree the busy waiting is a problem, but maybe we >>> disagree on the best place for the problem to manifest when it >>> happens. >>> >>> One thought re the DC code is regardless of where the code is running, >>> the scheduler is going to forcefully preempt it at some point right? >>> Any writereg/udelay(1)/readreg loop is going to get disrupted by a >>> much bigger than 1us delay by the kernel if the loop goes on long >>> enough. I'm not wrong about that? if that's true, the code might as >>> well switch out the udelay(1) for a usleep(1) and call it a day (well >>> modulo the fact I think it can be called from an interrupt handler; at >>> least "git grep" says there's a link_set_dpms_off in >>> link_dp_irq_handler.c) >>> >>>> Stuff like that is not a valid justification for the change. Ville >>>> changes on the other hand tried to prevent lock contention which is a >>>> valid goal here. >>> Okay so let's land his patchset! (assuming it's ready to go in). >>> Ville, is that something you'd want to resend for review? >> Well, while Ville patch has at least some justification I would still >> strongly object to move the work into a background thread to prevent >> userspace from being accounted for the work it causes. > Aren't most wayland compositors using nonblocking commits anyway? > If so they would already be bypassing proper CPU time accounting. > Not saying we shouldn't try to fix that, but just pointing out that > it already is an issue with nonblocking commits. That's a rather good argument, but for async operations background work is simply a necessity because you otherwise can't implement them. The key point here is that the patch puts the work into the background just to avoid that it is accounted to the thread issuing it, and that in turn is not valid as far as I can see. Regards, Christian. >
On Thu, Oct 05, 2023 at 01:16:27PM +0300, Ville Syrjälä wrote: > On Thu, Oct 05, 2023 at 11:57:41AM +0200, Daniel Vetter wrote: > > On Tue, Sep 26, 2023 at 01:05:49PM -0400, Ray Strode wrote: > > > From: Ray Strode <rstrode@redhat.com> > > > > > > A drm atomic commit can be quite slow on some hardware. It can lead > > > to a lengthy queue of commands that need to get processed and waited > > > on before control can go back to user space. > > > > > > If user space is a real-time thread, that delay can have severe > > > consequences, leading to the process getting killed for exceeding > > > rlimits. > > > > > > This commit addresses the problem by always running the slow part of > > > a commit on a workqueue, separated from the task initiating the > > > commit. > > > > > > This change makes the nonblocking and blocking paths work in the same way, > > > and as a result allows the task to sleep and not use up its > > > RLIMIT_RTTIME allocation. > > > > > > Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2861 > > > Signed-off-by: Ray Strode <rstrode@redhat.com> > > > > So imo the trouble with this is that we suddenly start to make > > realtime/cpu usage guarantees in the atomic ioctl. That's a _huge_ uapi > > change, because even limited to the case of !ALLOW_MODESET we do best > > effort guarantees at best. And some drivers (again amd's dc) spend a ton > > of cpu time recomputing state even for pure plane changes without any crtc > > changes like dpms on/off (at least I remember some bug reports about > > that). And that state recomputation has to happen synchronously, because > > it always influences the ioctl errno return value. > > > > My take is that you're papering over a performance problem here of the > > "the driver is too slow/wastes too much cpu time". We should fix the > > driver, if that's possible. > > > > Another option would be if userspace drops realtime priorities for these > > known-slow operations. And right now _all_ kms operations are potentially > > cpu and real-time wasters, the entire uapi is best effort. > > > > We can also try to change the atomic uapi to give some hard real-time > > guarantees so that running compositors as SCHED_RT is possible, but that > > - means a very serious stream of bugs to fix all over > > - therefore needs some very wide buy-in from drivers that they're willing > > to make this guarantee > > - probably needs some really carefully carved out limitations, because > > there's imo flat-out no way we'll make all atomic ioctl hard time limit > > bound > > > > Also, as König has pointed out, you can roll this duct-tape out in > > userspace by making the commit non-blocking and immediately waiting for > > the fences. > > > > One thing I didn't see mention is that there's a very subtle uapi > > difference between non-blocking and blocking: > > - non-blocking is not allowed to get ahead of the previous commit, and > > will return EBUSY in that case. See the comment in > > drm_atomic_helper_commit() > > - blocking otoh will just block until any previous pending commit has > > finished > > > > Not taking that into account in your patch here breaks uapi because > > userspace will suddenly get EBUSY when they don't expect that. > > The -EBUSY logic already checks whether the current commit is > non-blocking vs. blocking commit, so I don't see how there would > be any change in behaviour from simply stuffing the commit_tail > onto a workqueue, especially as the locks will be still held across > the flush. Hm right, I forgot the patch context when I was chasing the EBUSY logic, I thought it just pushed a nonblocking commit in somehow. > In my earlier series [1] where I move the flush to happen after dropping > the locks there is a far more subtle issue because currently even > non-blocking commits can actually block due to the mutex. Changing > that might break something, so I preserved that behaviour explicitly. > Full explanation in the first patch there. > > [1] https://patchwork.freedesktop.org/series/108668/ Yeah there's a can of tricky details here for sure ... -Sima
On Mon, Oct 09, 2023 at 02:36:17PM +0200, Christian König wrote: > Am 09.10.23 um 14:19 schrieb Ville Syrjälä: > > On Mon, Oct 09, 2023 at 08:42:24AM +0200, Christian König wrote: > > > Am 06.10.23 um 20:48 schrieb Ray Strode: > > > > Hi, > > > > > > > > On Fri, Oct 6, 2023 at 3:12 AM Christian König <christian.koenig@amd.com> wrote: > > > > > When the operation busy waits then that *should* get accounted to the > > > > > CPU time of the current process. When the operation sleeps and waits for > > > > > some interrupt for example it should not get accounted. > > > > > What you suggest is to put the parts of the operation which busy wait > > > > > into a background task and then sleep for that task to complete. This is > > > > > not a solution to the problem, but just hides it. > > > > Actually, I think we both probably agree that there shouldn't be long > > > > busy waits in the context of the current process. After all, we both > > > > agree what the AMD DC driver code is doing is wrong. > > > > > > > > To be clear, my take is, if driver code is running in process context > > > > and needs to wait for periods of time on the order of or in excess of > > > > a typical process time slice it should be sleeping during the waiting. > > > > If the operation is at a point where it can be cancelled without side > > > > effects, the sleeping should be INTERRUPTIBLE. If it's past the point > > > > of no return, sleeping should be UNINTERRUPTIBLE. At no point, in my > > > > opinion, should kernel code busy block a typical process for dozens of > > > > milliseconds while keeping the process RUNNING. I don't think this is > > > > a controversial take. > > > Exactly that's what I completely disagree on. > > > > > > When the driver is burning CPU cycles on behalves of a process then > > > those CPU cycles should be accounted to the process causing this. > > > > > > That the driver should probably improve it's behavior is a different issue. > > > > > > > Actually, I think (maybe?) you might even agree with that, but you're > > > > also saying: user space processes aren't special here. While it's not > > > > okay to busy block them, it's also not okay to busy block on the > > > > system unbound workqueue either. If that's your sentiment, I don't > > > > disagree with it. > > > No, it's absolutely ok to busy block them it's just not nice to do so. > > > > > > As Daniel pointed out this behavior is not incorrect at all. The DRM > > > subsystem doesn't make any guarantee that drmModeAtomicCommit() will not > > > burn CPU cycles. > > > > > > > So I think we both agree the busy waiting is a problem, but maybe we > > > > disagree on the best place for the problem to manifest when it > > > > happens. > > > > > > > > One thought re the DC code is regardless of where the code is running, > > > > the scheduler is going to forcefully preempt it at some point right? > > > > Any writereg/udelay(1)/readreg loop is going to get disrupted by a > > > > much bigger than 1us delay by the kernel if the loop goes on long > > > > enough. I'm not wrong about that? if that's true, the code might as > > > > well switch out the udelay(1) for a usleep(1) and call it a day (well > > > > modulo the fact I think it can be called from an interrupt handler; at > > > > least "git grep" says there's a link_set_dpms_off in > > > > link_dp_irq_handler.c) > > > > > > > > > Stuff like that is not a valid justification for the change. Ville > > > > > changes on the other hand tried to prevent lock contention which is a > > > > > valid goal here. > > > > Okay so let's land his patchset! (assuming it's ready to go in). > > > > Ville, is that something you'd want to resend for review? > > > Well, while Ville patch has at least some justification I would still > > > strongly object to move the work into a background thread to prevent > > > userspace from being accounted for the work it causes. > > Aren't most wayland compositors using nonblocking commits anyway? > > If so they would already be bypassing proper CPU time accounting. > > Not saying we shouldn't try to fix that, but just pointing out that > > it already is an issue with nonblocking commits. > > That's a rather good argument, but for async operations background work is > simply a necessity because you otherwise can't implement them. Yeah I don't think we can use "we need to properly account" stuff to reject this, because we don't. Also we already do fail with inheriting the priority properly, so another nail in the "kms is best effort". > The key point here is that the patch puts the work into the background just > to avoid that it is accounted to the thread issuing it, and that in turn is > not valid as far as I can see. Yeah it's that aspect I'm really worried about, because we essentially start to support some gurantees that a) most drivers can't uphold without a huge amount of work, some of the DC state recomputations are _really_ expensive b) without actually making the semantics clear, it's just duct-tape. Yes compositors want to run kms in real-time, and yes that results in fun if you try to strictly account for cpu time spent. Especially if your policy is to just nuke the real time thread instead of demoting it to SCHED_NORMAL for a time. I think if we want more than hacks here we need to answer two questions: - which parts of the kms api are real time - what exactly do we guarantee with that And that answer probably needs to include things like the real time thread workers for cursor and other nonblocking commits. -Sima
Hi, On Mon, Oct 09, 2023 at 02:36:17PM +0200, Christian König wrote: > > > > To be clear, my take is, if driver code is running in process context > > > > and needs to wait for periods of time on the order of or in excess of > > > > a typical process time slice it should be sleeping during the waiting. > > > > If the operation is at a point where it can be cancelled without side > > > > effects, the sleeping should be INTERRUPTIBLE. If it's past the point > > > > of no return, sleeping should be UNINTERRUPTIBLE. At no point, in my > > > > opinion, should kernel code busy block a typical process for dozens of > > > > milliseconds while keeping the process RUNNING. I don't think this is > > > > a controversial take. > > > Exactly that's what I completely disagree on. Okay if we can't agree that it's not okay for user space (or the kernel running in the context of user space) to busy loop a cpu core at 100% utilization throughout and beyond the process's entire scheduled time slice then we really are at an impasse. I gotta say i'm astonished that this seemingly indefensible behavior is somehow a point of contention, but I'm not going to keep arguing about it beyond this email. I mean we're not talking about scientific computing, or code compilation, or seti@home. We're talking about nearly the equivalent of `while (1) __asm__ ("nop");` > > The key point here is that the patch puts the work into the background just > > to avoid that it is accounted to the thread issuing it, and that in turn is > > not valid as far as I can see. > > Yeah it's that aspect I'm really worried about, because we essentially > start to support some gurantees that a) most drivers can't uphold without > a huge amount of work, some of the DC state recomputations are _really_ > expensive b) without actually making the semantics clear, it's just > duct-tape. If DC plane state computation (or whatever) is really taking 50ms or 200ms, then it probably should be done in chunks so it doesn't get preempted at an inopportune point? Look, this is not my wheelhouse, this is your wheelhouse, and I don't want to keep debating forever. It seems there is a discrepancy between our understandings of implied acceptable behavior. > Yes compositors want to run kms in real-time, and yes that results in fun > if you try to strictly account for cpu time spent. Especially if your > policy is to just nuke the real time thread instead of demoting it to > SCHED_NORMAL for a time. So I ended up going with this suggestion for blocking modesets: https://gitlab.gnome.org/GNOME/mutter/-/commit/5d3e31a49968fc0da04e98c0f9d624ea5095c9e0 But *this* feels like duct tape: You've already said there's no guarantee the problem won't also happen during preliminary computation during non-blocking commits or via other drm entry points. So it really does seem like a fix that won't age well. I won't be surprised if in ~3 years (or whatever) in some RHEL release there's a customer bug leading to the real-time thread getting blocklisted for obscure server display hardware because it's causing the session to tank on a production machine. > I think if we want more than hacks here we need to answer two questions: > - which parts of the kms api are real time > - what exactly do we guarantee with that imo, this isn't just about real-time versus non-real-time. It's no more acceptable for non-real-time mutter to be using 100% CPU doing busywaits than it is for real-time mutter to be using 100% cpu doing busywaits. Also, both you and Christian have suggested using the non-blocking modeset api with a fence fd to poll on is equivalent to the blocking api flushing the commit_tail work before returning from the ioctl, but that's not actually true. I think we all now agree the EBUSY problem you mentioned as an issue with my proposed patch wasn't actually a problem for blocking commits, but that very same issue is a problem with the non-blocking commits that then block on a fence fd, right? I guess we'd need to block on a fence fd from the prior non-blocking commit first before starting the blocking commit (or something) --Ray
On Thu, Oct 12, 2023 at 02:19:41PM -0400, Ray Strode wrote: > Hi, > > On Mon, Oct 09, 2023 at 02:36:17PM +0200, Christian König wrote: > > > > > To be clear, my take is, if driver code is running in process context > > > > > and needs to wait for periods of time on the order of or in excess of > > > > > a typical process time slice it should be sleeping during the waiting. > > > > > If the operation is at a point where it can be cancelled without side > > > > > effects, the sleeping should be INTERRUPTIBLE. If it's past the point > > > > > of no return, sleeping should be UNINTERRUPTIBLE. At no point, in my > > > > > opinion, should kernel code busy block a typical process for dozens of > > > > > milliseconds while keeping the process RUNNING. I don't think this is > > > > > a controversial take. > > > > Exactly that's what I completely disagree on. > > Okay if we can't agree that it's not okay for user space (or the > kernel running in the context of user space) to busy loop a cpu core > at 100% utilization throughout and beyond the process's entire > scheduled time slice then we really are at an impasse. I gotta say i'm > astonished that this seemingly indefensible behavior is somehow a > point of contention, but I'm not going to keep arguing about it beyond > this email. > > I mean we're not talking about scientific computing, or code > compilation, or seti@home. We're talking about nearly the equivalent > of `while (1) __asm__ ("nop");` I don't think anyone said this shouldn't be fixed or improved. What I'm saying is that the atomic ioctl is not going to make guarantees that it will not take up to much cpu time (for some extremely vague value of "too much") to the point that userspace can configure it's compositor in a way that it _will_ get killed if we _ever_ violate this rule. We should of course try to do as good as job as possible, but that's not what you're asking for. You're asking for a hard real-time guarantee with the implication if we ever break it, it's a regression, and the kernel has to bend over backwards with tricks like in your patch to make it work. It's that hard real time guarantee of "the kernel will _never_ violate this cpu time bound" that people are objecting against. Fixing the worst offenders I don't think will see any pushback at all. > > > The key point here is that the patch puts the work into the background just > > > to avoid that it is accounted to the thread issuing it, and that in turn is > > > not valid as far as I can see. > > > > Yeah it's that aspect I'm really worried about, because we essentially > > start to support some gurantees that a) most drivers can't uphold without > > a huge amount of work, some of the DC state recomputations are _really_ > > expensive b) without actually making the semantics clear, it's just > > duct-tape. > > If DC plane state computation (or whatever) is really taking 50ms or > 200ms, then it probably should be done in chunks so it doesn't get > preempted at an inopportune point? Look, this is not my wheelhouse, > this is your wheelhouse, and I don't want to keep debating forever. It > seems there is a discrepancy between our understandings of implied > acceptable behavior. > > > Yes compositors want to run kms in real-time, and yes that results in fun > > if you try to strictly account for cpu time spent. Especially if your > > policy is to just nuke the real time thread instead of demoting it to > > SCHED_NORMAL for a time. > > So I ended up going with this suggestion for blocking modesets: > > https://gitlab.gnome.org/GNOME/mutter/-/commit/5d3e31a49968fc0da04e98c0f9d624ea5095c9e0 > > But *this* feels like duct tape: You've already said there's no > guarantee the problem won't also happen during preliminary computation > during non-blocking commits or via other drm entry points. So it > really does seem like a fix that won't age well. I won't be surprised > if in ~3 years (or whatever) in some RHEL release there's a customer > bug leading to the real-time thread getting blocklisted for obscure > server display hardware because it's causing the session to tank on a > production machine. Yes the atomic ioctl makes no guarantees across drivers and hw platforms of guaranteed "we will never violate, you can kill your compositor if you do" cpu bound limits. We'll try to not suck too badly, and we'll try to focus on fixing the suck where it upsets the people the most. But there is fundamentally no hard real time guarantee in atomic. At least not with the current uapi. > > I think if we want more than hacks here we need to answer two questions: > > - which parts of the kms api are real time > > - what exactly do we guarantee with that > > imo, this isn't just about real-time versus non-real-time. It's no > more acceptable for non-real-time mutter to be using 100% CPU doing > busywaits than it is for real-time mutter to be using 100% cpu doing > busywaits. The problem isn't about wasting cpu time. It's about you having set up the system in a way so that mutter gets killed if we ever waste too much cpu time, ever. Which moves this from a soft real time "we'll try really hard to not suck" to a hard real time "there will be obvious functional regression reports if we even violate this once" guarantee. The former is absolutely fine, and within the practical limit of writing kms drivers is hard and and takes a lot of time, we'll do the best we can. The latter is flat out not a guarantee the kernel currently makes, and I see no practical feasible way to make that happen. And pretending we do make this guarantee will just result in frustrated users filling bugs that they desktop session died and developers closing them as "too hard to fix". > Also, both you and Christian have suggested using the non-blocking > modeset api with a fence fd to poll on is equivalent to the blocking > api flushing the commit_tail work before returning from the ioctl, but > that's not actually true. I think we all now agree the EBUSY problem > you mentioned as an issue with my proposed patch wasn't actually a > problem for blocking commits, but that very same issue is a problem > with the non-blocking commits that then block on a fence fd, right? I > guess we'd need to block on a fence fd from the prior non-blocking > commit first before starting the blocking commit (or something) Yeah there's a bunch of issues around the current nonblocking modeset uapi semantics that make it not so useful. There was a long irc discussion last few days about all the various aspects, it's quite tricky. Maybe as a bit more useful outcome of this entire discussion: Could you sign up to write a documentation patch for the atomic ioctl that adds a paragraph stating extremely clearly that this ioctl does not make hard real time guarantees, but only best effort soft realtime, and even then priority of the effort is focused entirely on the !ALLOW_MODESET case, because that is the one users care about the most. Cheers, Sima
On 10/13/23 11:41, Daniel Vetter wrote: > On Thu, Oct 12, 2023 at 02:19:41PM -0400, Ray Strode wrote: >> On Mon, Oct 09, 2023 at 02:36:17PM +0200, Christian König wrote: >>>>>> To be clear, my take is, if driver code is running in process context >>>>>> and needs to wait for periods of time on the order of or in excess of >>>>>> a typical process time slice it should be sleeping during the waiting. >>>>>> If the operation is at a point where it can be cancelled without side >>>>>> effects, the sleeping should be INTERRUPTIBLE. If it's past the point >>>>>> of no return, sleeping should be UNINTERRUPTIBLE. At no point, in my >>>>>> opinion, should kernel code busy block a typical process for dozens of >>>>>> milliseconds while keeping the process RUNNING. I don't think this is >>>>>> a controversial take. >>>>> Exactly that's what I completely disagree on. >> >> Okay if we can't agree that it's not okay for user space (or the >> kernel running in the context of user space) to busy loop a cpu core >> at 100% utilization throughout and beyond the process's entire >> scheduled time slice then we really are at an impasse. I gotta say i'm >> astonished that this seemingly indefensible behavior is somehow a >> point of contention, but I'm not going to keep arguing about it beyond >> this email. >> >> I mean we're not talking about scientific computing, or code >> compilation, or seti@home. We're talking about nearly the equivalent >> of `while (1) __asm__ ("nop");` > > I don't think anyone said this shouldn't be fixed or improved. > > What I'm saying is that the atomic ioctl is not going to make guarantees > that it will not take up to much cpu time (for some extremely vague value > of "too much") to the point that userspace can configure it's compositor > in a way that it _will_ get killed if we _ever_ violate this rule. > > We should of course try to do as good as job as possible, but that's not > what you're asking for. You're asking for a hard real-time guarantee with > the implication if we ever break it, it's a regression, and the kernel has > to bend over backwards with tricks like in your patch to make it work. I don't think mutter really needs or wants such a hard real-time guarantee. What it needs is a fighting chance to react before the kernel kills its process. The intended mechanism for this is SIGXCPU, but that can't work if the kernel is stuck in a busy-loop. Ray's patch seems like one way to avoid that. That said, as long as SIGXCPU can work as intended with the non-blocking commits mutter uses for everything except modesets, mutter's workaround of dropping RT priority for the blocking commits seems acceptable for the time being.
Hi On Fri, Oct 13, 2023 at 5:41 AM Daniel Vetter <daniel@ffwll.ch> wrote: > > I mean we're not talking about scientific computing, or code > > compilation, or seti@home. We're talking about nearly the equivalent > > of `while (1) __asm__ ("nop");` > > I don't think anyone said this shouldn't be fixed or improved. Yea but we apparently disagree that it would be an improvement to stop discrediting user space for driver problems. You see, to me, there are two problems 1) The behavior itself is not nice and should be fixed 2) The blame/accounting/attribution for a driver problem is getting directed to user space. I think both issues should be fixed. One brought the other issue to light, but both problems, in my mind, are legitimate and should be addressed. You think fixing the second problem is some tactic to paper over the first problem. Christian thinks the second problem isn't a problem at all but the correct design. So none of us are completely aligned and I don't see anyone changing their mind anytime soon. > What I'm saying is that the atomic ioctl is not going to make guarantees > that it will not take up to much cpu time (for some extremely vague value > of "too much") to the point that userspace can configure it's compositor > in a way that it _will_ get killed if we _ever_ violate this rule. yea I find that strange, all kernel tasks have a certain implied baseline responsibility to be good citizens to the system. And how user space is configured seems nearly immaterial. But we're talking in circles. > Fixing the worst offenders I don't think will see any pushback at all. Yea we all agree fixing this one busy loop is a problem but we don't agree on where the cpu time blame should go. > > But *this* feels like duct tape: You've already said there's no > > guarantee the problem won't also happen during preliminary computation > > during non-blocking commits or via other drm entry points. So it > > really does seem like a fix that won't age well. I won't be surprised > > if in ~3 years (or whatever) in some RHEL release there's a customer > > bug leading to the real-time thread getting blocklisted for obscure > > server display hardware because it's causing the session to tank on a > > production machine. > > Yes the atomic ioctl makes no guarantees across drivers and hw platforms > of guaranteed "we will never violate, you can kill your compositor if you > do" cpu bound limits. We'll try to not suck too badly, and we'll try to > focus on fixing the suck where it upsets the people the most. > > But there is fundamentally no hard real time guarantee in atomic. At least > not with the current uapi. So in your mind mutter should get out of the real-time thread business entirely? > The problem isn't about wasting cpu time. It's about you having set up the > system in a way so that mutter gets killed if we ever waste too much cpu > time, ever. mutter is not set up in a way to kill itself if the driver wastes too much cpu time, ever. mutter is set up in a way to kill itself if it, itself, wastes too much cpu time, ever. The fact that the kernel is killing it when a kernel driver is wasting cpu time is the bug that led to the patch in the first place! > The latter is flat out not a guarantee the kernel currently makes, and I > see no practical feasible way to make that happen. And pretending we do > make this guarantee will just result in frustrated users filling bugs that > they desktop session died and developers closing them as "too hard to > fix". I think all three of us agree busy loops are not nice (though maybe we aren't completely aligned on severity). And I'll certainly concede that fixing all the busy loops can be hard. Some of the cpu-bound code paths may even be doing legitimate computation. I still assert that if the uapi calls into driver code that might potentially be doing something slow where it can't sleep, it should be doing it on a workqueue or thread. That seems orthogonal to fixing all the places where the drivers aren't acting nicely. > Maybe as a bit more useful outcome of this entire discussion: Could you > sign up to write a documentation patch for the atomic ioctl that adds a > paragraph stating extremely clearly that this ioctl does not make hard > real time guarantees, but only best effort soft realtime, and even then > priority of the effort is focused entirely on the !ALLOW_MODESET case, > because that is the one users care about the most. I don't think I'm the best positioned to write such documentation. I still think what the kernel is doing is wrong here and I don't even fully grok what you mean by best effort. --Ray
On Fri, Oct 13, 2023 at 12:22:52PM +0200, Michel Dänzer wrote: > On 10/13/23 11:41, Daniel Vetter wrote: > > On Thu, Oct 12, 2023 at 02:19:41PM -0400, Ray Strode wrote: > >> On Mon, Oct 09, 2023 at 02:36:17PM +0200, Christian König wrote: > >>>>>> To be clear, my take is, if driver code is running in process context > >>>>>> and needs to wait for periods of time on the order of or in excess of > >>>>>> a typical process time slice it should be sleeping during the waiting. > >>>>>> If the operation is at a point where it can be cancelled without side > >>>>>> effects, the sleeping should be INTERRUPTIBLE. If it's past the point > >>>>>> of no return, sleeping should be UNINTERRUPTIBLE. At no point, in my > >>>>>> opinion, should kernel code busy block a typical process for dozens of > >>>>>> milliseconds while keeping the process RUNNING. I don't think this is > >>>>>> a controversial take. > >>>>> Exactly that's what I completely disagree on. > >> > >> Okay if we can't agree that it's not okay for user space (or the > >> kernel running in the context of user space) to busy loop a cpu core > >> at 100% utilization throughout and beyond the process's entire > >> scheduled time slice then we really are at an impasse. I gotta say i'm > >> astonished that this seemingly indefensible behavior is somehow a > >> point of contention, but I'm not going to keep arguing about it beyond > >> this email. > >> > >> I mean we're not talking about scientific computing, or code > >> compilation, or seti@home. We're talking about nearly the equivalent > >> of `while (1) __asm__ ("nop");` > > > > I don't think anyone said this shouldn't be fixed or improved. > > > > What I'm saying is that the atomic ioctl is not going to make guarantees > > that it will not take up to much cpu time (for some extremely vague value > > of "too much") to the point that userspace can configure it's compositor > > in a way that it _will_ get killed if we _ever_ violate this rule. > > > > We should of course try to do as good as job as possible, but that's not > > what you're asking for. You're asking for a hard real-time guarantee with > > the implication if we ever break it, it's a regression, and the kernel has > > to bend over backwards with tricks like in your patch to make it work. > > I don't think mutter really needs or wants such a hard real-time > guarantee. What it needs is a fighting chance to react before the kernel > kills its process. > > The intended mechanism for this is SIGXCPU, but that can't work if the > kernel is stuck in a busy-loop. Ray's patch seems like one way to avoid > that. I don't think signals will get us out of this either, or at least still has risk. We are trying to make everything interruptible and bail out asap, but those checks are when we're blocking, not when the cpu is busy. So this also wont guarantee that you expire your timeslice when a driver is doing a silly expensive atomic_check computation. Much less likely, but definitely not a zero chance. > That said, as long as SIGXCPU can work as intended with the non-blocking > commits mutter uses for everything except modesets, mutter's workaround > of dropping RT priority for the blocking commits seems acceptable for > the time being. Is there no rt setup where the kernel just auto-demotes when you've used up your slice? That's the only setup I see that guarantees you're not getting killed here. I think dropping rt priority around full modesets is still good since modests really shouldn't ever run as rt, that makes no sense to me. -Sima
On Fri, Oct 13, 2023 at 10:04:02AM -0400, Ray Strode wrote: > Hi > > On Fri, Oct 13, 2023 at 5:41 AM Daniel Vetter <daniel@ffwll.ch> wrote: > > > I mean we're not talking about scientific computing, or code > > > compilation, or seti@home. We're talking about nearly the equivalent > > > of `while (1) __asm__ ("nop");` > > > > I don't think anyone said this shouldn't be fixed or improved. > > Yea but we apparently disagree that it would be an improvement to stop > discrediting user space for driver problems. > You see, to me, there are two problems 1) The behavior itself is not > nice and should be fixed 2) The blame/accounting/attribution for a > driver problem is getting directed to user space. I think both issues > should be fixed. One brought the other issue to light, but both > problems, in my mind, are legitimate and should be addressed. You > think fixing the second problem is some tactic to paper over the first > problem. Christian thinks the second problem isn't a problem at all > but the correct design. So none of us are completely aligned and I > don't see anyone changing their mind anytime soon. > > > What I'm saying is that the atomic ioctl is not going to make guarantees > > that it will not take up to much cpu time (for some extremely vague value > > of "too much") to the point that userspace can configure it's compositor > > in a way that it _will_ get killed if we _ever_ violate this rule. > yea I find that strange, all kernel tasks have a certain implied > baseline responsibility to be good citizens to the system. > And how user space is configured seems nearly immaterial. > > But we're talking in circles. > > > Fixing the worst offenders I don't think will see any pushback at all. > Yea we all agree fixing this one busy loop is a problem but we don't > agree on where the cpu time blame should go. > > > > But *this* feels like duct tape: You've already said there's no > > > guarantee the problem won't also happen during preliminary computation > > > during non-blocking commits or via other drm entry points. So it > > > really does seem like a fix that won't age well. I won't be surprised > > > if in ~3 years (or whatever) in some RHEL release there's a customer > > > bug leading to the real-time thread getting blocklisted for obscure > > > server display hardware because it's causing the session to tank on a > > > production machine. > > > > Yes the atomic ioctl makes no guarantees across drivers and hw platforms > > of guaranteed "we will never violate, you can kill your compositor if you > > do" cpu bound limits. We'll try to not suck too badly, and we'll try to > > focus on fixing the suck where it upsets the people the most. > > > > But there is fundamentally no hard real time guarantee in atomic. At least > > not with the current uapi. > > So in your mind mutter should get out of the real-time thread business entirely? Yes. At least the hard-real time "the kernel kills your process if you fail" business. Because desktop drawing just isn't hard real-time, nothing disastrous happens if we miss a deadline. Of course special setups where everything is very controlled might be different. > > The problem isn't about wasting cpu time. It's about you having set up the > > system in a way so that mutter gets killed if we ever waste too much cpu > > time, ever. > > mutter is not set up in a way to kill itself if the driver wastes too > much cpu time, > ever. mutter is set up in a way to kill itself if it, itself, wastes > too much cpu time, ever. > The fact that the kernel is killing it when a kernel driver is wasting > cpu time is the > bug that led to the patch in the first place! > > > The latter is flat out not a guarantee the kernel currently makes, and I > > see no practical feasible way to make that happen. And pretending we do > > make this guarantee will just result in frustrated users filling bugs that > > they desktop session died and developers closing them as "too hard to > > fix". > > I think all three of us agree busy loops are not nice (though maybe we > aren't completely aligned on severity). And I'll certainly concede > that fixing all the busy loops can be hard. Some of the cpu-bound code > paths may even be doing legitimate computation. I still assert that > if the uapi calls into driver code that might potentially be doing > something slow where it can't sleep, it should be doing it on a > workqueue or thread. That seems orthogonal to fixing all the places > where the drivers aren't acting nicely. Again no, because underlying this your requirement boils down to hard real time. And we just cannot guarantee that with atomic kms. Best effort, absolutely. Guaranteed to never fail, no way. > > Maybe as a bit more useful outcome of this entire discussion: Could you > > sign up to write a documentation patch for the atomic ioctl that adds a > > paragraph stating extremely clearly that this ioctl does not make hard > > real time guarantees, but only best effort soft realtime, and even then > > priority of the effort is focused entirely on the !ALLOW_MODESET case, > > because that is the one users care about the most. > > I don't think I'm the best positioned to write such documentation. I > still think what the kernel is doing is wrong here and I don't even > fully grok what you mean by best effort. It's the difference between hard real time and soft real time. atomic kms is a soft real time system, not hard real time. You've set up mutter in a hard real time way, and that just doesn't work. I guess I can type up some patch when I'm back from XDC, but if the difference between soft and hard real time isn't clear, then yeah you don't understand the point I've been trying to make in this thread. Cheers, Sima
Am 17.10.23 um 09:32 schrieb Daniel Vetter: > On Fri, Oct 13, 2023 at 12:22:52PM +0200, Michel Dänzer wrote: >> On 10/13/23 11:41, Daniel Vetter wrote: >>> On Thu, Oct 12, 2023 at 02:19:41PM -0400, Ray Strode wrote: >>>> On Mon, Oct 09, 2023 at 02:36:17PM +0200, Christian König wrote: >>>>>>>> To be clear, my take is, if driver code is running in process context >>>>>>>> and needs to wait for periods of time on the order of or in excess of >>>>>>>> a typical process time slice it should be sleeping during the waiting. >>>>>>>> If the operation is at a point where it can be cancelled without side >>>>>>>> effects, the sleeping should be INTERRUPTIBLE. If it's past the point >>>>>>>> of no return, sleeping should be UNINTERRUPTIBLE. At no point, in my >>>>>>>> opinion, should kernel code busy block a typical process for dozens of >>>>>>>> milliseconds while keeping the process RUNNING. I don't think this is >>>>>>>> a controversial take. >>>>>>> Exactly that's what I completely disagree on. >>>> Okay if we can't agree that it's not okay for user space (or the >>>> kernel running in the context of user space) to busy loop a cpu core >>>> at 100% utilization throughout and beyond the process's entire >>>> scheduled time slice then we really are at an impasse. I gotta say i'm >>>> astonished that this seemingly indefensible behavior is somehow a >>>> point of contention, but I'm not going to keep arguing about it beyond >>>> this email. >>>> >>>> I mean we're not talking about scientific computing, or code >>>> compilation, or seti@home. We're talking about nearly the equivalent >>>> of `while (1) __asm__ ("nop");` >>> I don't think anyone said this shouldn't be fixed or improved. >>> >>> What I'm saying is that the atomic ioctl is not going to make guarantees >>> that it will not take up to much cpu time (for some extremely vague value >>> of "too much") to the point that userspace can configure it's compositor >>> in a way that it _will_ get killed if we _ever_ violate this rule. >>> >>> We should of course try to do as good as job as possible, but that's not >>> what you're asking for. You're asking for a hard real-time guarantee with >>> the implication if we ever break it, it's a regression, and the kernel has >>> to bend over backwards with tricks like in your patch to make it work. >> I don't think mutter really needs or wants such a hard real-time >> guarantee. What it needs is a fighting chance to react before the kernel >> kills its process. >> >> The intended mechanism for this is SIGXCPU, but that can't work if the >> kernel is stuck in a busy-loop. Ray's patch seems like one way to avoid >> that. > I don't think signals will get us out of this either, or at least still > has risk. We are trying to make everything interruptible and bail out > asap, but those checks are when we're blocking, not when the cpu is busy. > > So this also wont guarantee that you expire your timeslice when a driver > is doing a silly expensive atomic_check computation. Much less likely, but > definitely not a zero chance. > >> That said, as long as SIGXCPU can work as intended with the non-blocking >> commits mutter uses for everything except modesets, mutter's workaround >> of dropping RT priority for the blocking commits seems acceptable for >> the time being. > Is there no rt setup where the kernel just auto-demotes when you've used > up your slice? That's the only setup I see that guarantees you're not > getting killed here. > > I think dropping rt priority around full modesets is still good since > modests really shouldn't ever run as rt, that makes no sense to me. Completely agree. One more data point not mentioned before: While amdgpu could be improved we do have devices which (for example) have to do I2C by bit banging because they lack the necessary functionality in the hardware. And IIRC transmitting the 256 bytes EDID takes something like ~5ms (fast mode) or ~20ms (standard mode) in which the CPU usually just busy loops most of the time. Not saying that we should do a full EDID transmit with every modeset, but just to give an example of what might be necessary here. Christian. > -Sima
diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c index 292e38eb6218..1a1e68d98d38 100644 --- a/drivers/gpu/drm/drm_atomic_helper.c +++ b/drivers/gpu/drm/drm_atomic_helper.c @@ -2028,64 +2028,63 @@ int drm_atomic_helper_commit(struct drm_device *dev, * This is the point of no return - everything below never fails except * when the hw goes bonghits. Which means we can commit the new state on * the software side now. */ ret = drm_atomic_helper_swap_state(state, true); if (ret) goto err; /* * Everything below can be run asynchronously without the need to grab * any modeset locks at all under one condition: It must be guaranteed * that the asynchronous work has either been cancelled (if the driver * supports it, which at least requires that the framebuffers get * cleaned up with drm_atomic_helper_cleanup_planes()) or completed * before the new state gets committed on the software side with * drm_atomic_helper_swap_state(). * * This scheme allows new atomic state updates to be prepared and * checked in parallel to the asynchronous completion of the previous * update. Which is important since compositors need to figure out the * composition of the next frame right after having submitted the * current layout. * * NOTE: Commit work has multiple phases, first hardware commit, then * cleanup. We want them to overlap, hence need system_unbound_wq to * make sure work items don't artificially stall on each another. */ drm_atomic_state_get(state); - if (nonblock) - queue_work(system_unbound_wq, &state->commit_work); - else - commit_tail(state); + queue_work(system_unbound_wq, &state->commit_work); + if (!nonblock) + flush_work(&state->commit_work); return 0; err: drm_atomic_helper_cleanup_planes(dev, state); return ret; } EXPORT_SYMBOL(drm_atomic_helper_commit); /** * DOC: implementing nonblocking commit * * Nonblocking atomic commits should use struct &drm_crtc_commit to sequence * different operations against each another. Locks, especially struct * &drm_modeset_lock, should not be held in worker threads or any other * asynchronous context used to commit the hardware state. * * drm_atomic_helper_commit() implements the recommended sequence for * nonblocking commits, using drm_atomic_helper_setup_commit() internally: * * 1. Run drm_atomic_helper_prepare_planes(). Since this can fail and we * need to propagate out of memory/VRAM errors to userspace, it must be called * synchronously. * * 2. Synchronize with any outstanding nonblocking commit worker threads which * might be affected by the new state update. This is handled by * drm_atomic_helper_setup_commit(). * * Asynchronous workers need to have sufficient parallelism to be able to run * different atomic commits on different CRTCs in parallel. The simplest way to