Message ID | 20250131225942.365475324@goodmis.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | sched: Extended Scheduler Time Slice revisited | expand |
On Fri, Jan 31, 2025 at 05:58:38PM -0500, Steven Rostedt wrote: > From: "Steven Rostedt (Google)" <rostedt@goodmis.org> > > This is to improve user space implemented spin locks or any critical > section. It may also be extended for VMs and their guest spin locks as > well, but that will come later. > > This adds a new field in the struct rseq called cr_counter. This is a 32 bit > field where bit zero is a flag reserved for the kernel, and the other 31 > bits can be used as a counter (although the kernel doesn't care how they > are used, as any bit set means the same). > > This works in tandem with PREEMPT_LAZY, where a task can tell the kernel > via the rseq structure that it is in a critical section (like holding a > spin lock) that it will be leaving very shortly, and to ask the kernel to > not preempt it at the moment. I still have full hate for this approach.
On February 1, 2025 6:59:06 AM EST, Peter Zijlstra <peterz@infradead.org> wrote:
>I still have full hate for this approach.
So what approach would you prefer?
-- Steve
On 2025-01-31 23:58, Steven Rostedt wrote: [...] > @@ -148,6 +160,18 @@ struct rseq { > */ > __u32 mm_cid; > > + /* > + * The cr_counter is a way for user space to inform the kernel that > + * it is in a critical section. If bits 1-31 are set, then the > + * kernel may grant the thread a bit more time (but there is no > + * guarantee of how much time or if it is granted at all). If the > + * kernel does grant the thread extra time, it will set bit 0 to > + * inform user space that it has granted the thread more time and that > + * user space should call yield() as soon as it leaves its critical Does it specifically need to be yield(), or can it be just "entering the kernel" through any system call or trap ? [...] > diff --git a/kernel/rseq.c b/kernel/rseq.c > index 9de6e35fe679..b792e36a3550 100644 > --- a/kernel/rseq.c > +++ b/kernel/rseq.c > @@ -339,6 +339,36 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) > force_sigsegv(sig); > } > > +bool rseq_delay_resched(void) > +{ > + struct task_struct *t = current; > + u32 flags; > + > + if (!t->rseq) > + return false; > + > + /* Make sure the cr_counter exists */ > + if (current->rseq_len <= offsetof(struct rseq, cr_counter)) > + return false; It would be clearer/more precise with < offsetofend(struct rseq, cr_counter) Thanks, Mathieu
On Sat, Feb 01, 2025 at 07:47:32AM -0500, Steven Rostedt wrote: > > > On February 1, 2025 6:59:06 AM EST, Peter Zijlstra <peterz@infradead.org> wrote: > > >I still have full hate for this approach. > > So what approach would you prefer? The one that does not rely on the preemption method -- I think I posted something along those line, and someone else recently reposted something bsaed on it. Tying things to the preemption method is absurdly bad design -- and I've told you that before.
On Sat, 1 Feb 2025 19:11:29 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > On Sat, Feb 01, 2025 at 07:47:32AM -0500, Steven Rostedt wrote: > > > > > > On February 1, 2025 6:59:06 AM EST, Peter Zijlstra <peterz@infradead.org> wrote: > > > > >I still have full hate for this approach. > > > > So what approach would you prefer? > > The one that does not rely on the preemption method -- I think I posted > something along those line, and someone else recently reposted something > bsaed on it. > > Tying things to the preemption method is absurdly bad design -- and I've > told you that before. How exactly is it "bad design"? Changing the preemption method itself changes the way applications schedule and can be very noticeable to the applications themselves. No preempt, applications will have high latency every time any application does a system call. Preempt voluntary is a little more reactive, but more randomly done. The preempt lazy kconfig has: This option provides a scheduler driven preemption model that is fundamentally similar to full preemption, but is less eager to preempt SCHED_NORMAL tasks in an attempt to reduce lock holder preemption and recover some of the performance gains seen from using Voluntary preemption. This could be a config option called PREEMPT_USER_LAZY that extends the "reduce lock holder preemption of user space spin locks". But if your issue is with relying on the preemption method, does that mean you prefer to have this feature for any preemption method? That may require still using the LAZY flag that can cause a schedule in the kernel but not in user space? Note, my group is actually more interested in implementing this for VMs. But that requires another level of redirection of the pointers. That is, qemu could create a device that shares memory between the guest kernel and the qemu VCPU thread. The guest kernel could update the counter in this shared memory before grabbing a raw_spin_lock which act like this patch set does. The difference would be that the counter would need to live in a memory page that only has this information in it and not the rseq structure itself. Mathieu was concerned about leaks and corruption in the rseq structure by a malicious guest. Thus, the counter would have to be a clean memory page that is shared between the guest and the qemu thread. The rseq would then have a pointer to this memory, and the host kernel would then have to traverse that pointer to the location of the counter. In other words, my real goal is to have this working for guests on their raw_spin_locks. We first tried to do this in KVM directly, but the KVM maintainers said this is more a generic scheduling issue and doesn't belong in KVM. I agreed with them. -- Steve
On Sat, 1 Feb 2025 15:35:13 +0100 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > On 2025-01-31 23:58, Steven Rostedt wrote: > > [...] > > > @@ -148,6 +160,18 @@ struct rseq { > > */ > > __u32 mm_cid; > > > > + /* > > + * The cr_counter is a way for user space to inform the kernel that > > + * it is in a critical section. If bits 1-31 are set, then the > > + * kernel may grant the thread a bit more time (but there is no > > + * guarantee of how much time or if it is granted at all). If the > > + * kernel does grant the thread extra time, it will set bit 0 to > > + * inform user space that it has granted the thread more time and that > > + * user space should call yield() as soon as it leaves its critical > > Does it specifically need to be yield(), or can it be just "entering > the kernel" through any system call or trap ? No it doesn't need to be yield. That just seemed like the obvious system call to use. But any system call would force a schedule. We could just optimize yield() though. > > [...] > > > diff --git a/kernel/rseq.c b/kernel/rseq.c > > index 9de6e35fe679..b792e36a3550 100644 > > --- a/kernel/rseq.c > > +++ b/kernel/rseq.c > > @@ -339,6 +339,36 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) > > force_sigsegv(sig); > > } > > > > +bool rseq_delay_resched(void) > > +{ > > + struct task_struct *t = current; > > + u32 flags; > > + > > + if (!t->rseq) > > + return false; > > + > > + /* Make sure the cr_counter exists */ > > + if (current->rseq_len <= offsetof(struct rseq, cr_counter)) > > + return false; > > It would be clearer/more precise with < offsetofend(struct rseq, cr_counter) Ah yeah, thanks! -- Steve
On Sat, 1 Feb 2025 at 15:08, Steven Rostedt <rostedt@goodmis.org> wrote: > > No it doesn't need to be yield. That just seemed like the obvious > system call to use. But any system call would force a schedule. We > could just optimize yield() though. No, "optimizing" yield is not on the table. Why? Because it has *semantics* that people depend on because it has historical meaning. Things like "move the process to the end of the scheduling queue of that priority". That may sound like a good thing, but it's ABSOLUTELY NOT what you should actually do unless you know *exactly* what the system behavior is. For example, maybe the reason the kernel set NEED_RESCHED_LAZY is that a higher-priority process is ready to run. You haven't used up your time slice yet, but something more important needs the CPU. If you call "sched_yield()", sure, you'll run that higher priority thing. So far so good. But you *also* are literally telling the scheduler to put you at the back of the queue, despite the fact that maybe you are still in line to be run for *your* priority level. So now your performance will absolutely suck, because you just told the scheduler that you are not important, and other processes in your priority level should get priority. So no. "yield()" does not mean "just reschedule". It rally means "yield my position in the scheduling queue". You are literally better off using absolutely *ANY* other system call. The fact that you are confused about this kind of very basic issue does imply that this patch should absolutely be handled by somebody who knows the scheduler better. Linus
On Sat, 1 Feb 2025 at 15:18, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > But you *also* are literally telling the scheduler to put you at the > back of the queue, despite the fact that maybe you are still in line > to be run for *your* priority level. So now your performance will > absolutely suck, because you just told the scheduler that you are not > important, and other processes in your priority level should get > priority. Side note: we've messed with this before, exactly because people have been confused about this before. I have this dim memory of us having at one point decided to actively ignore that "put me last" part of sched_yield because we had huge performance problems with people misusing "sched_yield()". I didn't actually check what current active schedulers do. I would not be in the least surprised if different schedulers end up having very different behavior (particularly any RT scheduler). But the moral of the story ends up being the same: don't use yield() unless you are on some embedded platform and know exactly what the scheduling pattern is - or if you really want to say "I don't want to run now, do *anything* else, my performance doesn't matter". A traditional (reasonable) situation might be "I started another task or thread, I need for it to run first and initialized things, I'm polling for that to be done but I don't want to busy-loop if there is real work to be done". Where it really is a complete hack: "my performance doesn't matter because it's a one-time startup thing, and I couldn't be arsed to have real locking". In fact, the whole "I couldn't be arsed" is basically the tag-line for "yield()". Maybe we should rename the system call. Linus
On Sat, 1 Feb 2025 15:18:15 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sat, 1 Feb 2025 at 15:08, Steven Rostedt <rostedt@goodmis.org> wrote: > > > > No it doesn't need to be yield. That just seemed like the obvious > > system call to use. But any system call would force a schedule. We > > could just optimize yield() though. > > No, "optimizing" yield is not on the table. Note, I only mentioned this because Peter had it in his patch he created when I first brought this up. https://lore.kernel.org/all/20231030132949.GA38123@noisy.programming.kicks-ass.net/ SYSCALL_DEFINE0(sched_yield) { + if (current->rseq_sched_delay) { + trace_printk("yield -- made it\n"); + schedule(); + return 0; + } + do_sched_yield(); return 0; } > > Why? Because it has *semantics* that people depend on because it has > historical meaning. Things like "move the process to the end of the > scheduling queue of that priority". Yes, I know the historical meaning. It's also a system call I've been telling people to avoid for the last 20 years. I haven't seen or heard about any application that actually depends on that behavior today. > > That may sound like a good thing, but it's ABSOLUTELY NOT what you > should actually do unless you know *exactly* what the system behavior > is. > > For example, maybe the reason the kernel set NEED_RESCHED_LAZY is that > a higher-priority process is ready to run. You haven't used up your > time slice yet, but something more important needs the CPU. If it's RT, it would set NEED_RESCHED and not LAZY. This is only for SCHED_OTHER. Sure, it could be a task with a negative nice value. > > So no. "yield()" does not mean "just reschedule". It rally means > "yield my position in the scheduling queue". The optimization would only treat sched_yield differently if the task had asked for an extended time slice, and it was granted. Like in Peter's patch, if that was the case, it would just call schedule and return. This would not affect yield() for any other task not implementing the rseq extend counter. > > You are literally better off using absolutely *ANY* other system call. > > The fact that you are confused about this kind of very basic issue > does imply that this patch should absolutely be handled by somebody > who knows the scheduler better. > I'm not confused. And before seeing Peter's use of yield(), I was reluctant to use it for the very same reasons you mentioned above. In my test programs, I was simply using getuid(), as that was one of the quickest syscalls. -- Steve
On Sat, 1 Feb 2025 15:35:53 -0800 Linus Torvalds <torvalds@linux-foundation.org> wrote: > I didn't actually check what current active schedulers do. I would > not be in the least surprised if different schedulers end up having > very different behavior (particularly any RT scheduler). The only real use of sched yield() I have ever seen in practice was in a RT application where all tasks were given the same RT FIFO priority, and the application would use the yield() system call to trigger the next task. That's also the only use case that really does have a strict defined behavior for yield(). In SCHED_FIFO, as the name suggests, tasks run in a first-in first-out order. yield() will put the task at the end of that queue. You can use this method to implement a strict application defined scheduler. -- Steve.
On Sat, Feb 01, 2025 at 10:22:08PM -0500, Steven Rostedt wrote: > And before seeing Peter's use of yield(), I was reluctant to use it for > the very same reasons you mentioned above. In my test programs, I was > simply using getuid(), as that was one of the quickest syscalls. Is getuid() guaranteed to issue a syscall? It feels like the kind of information that a tricksy libc could cache. Traditionally, I think we've used getppid() as the canonical "very cheap syscall" as no layer can cache that information.
On Sun, 2 Feb 2025 07:22:08 +0000 Matthew Wilcox <willy@infradead.org> wrote: > On Sat, Feb 01, 2025 at 10:22:08PM -0500, Steven Rostedt wrote: > > And before seeing Peter's use of yield(), I was reluctant to use it for > > the very same reasons you mentioned above. In my test programs, I was > > simply using getuid(), as that was one of the quickest syscalls. > > Is getuid() guaranteed to issue a syscall? It feels like the kind of > information that a tricksy libc could cache. Traditionally, I think > we've used getppid() as the canonical "very cheap syscall" as no layer > can cache that information. Maybe that was what I used. Can't remember. And I think I even open coded it using syscall() to not even rely on glibc. -- Steve
On Sat, Feb 01, 2025 at 06:06:17PM -0500, Steven Rostedt wrote: > On Sat, 1 Feb 2025 19:11:29 +0100 > Peter Zijlstra <peterz@infradead.org> wrote: > > > On Sat, Feb 01, 2025 at 07:47:32AM -0500, Steven Rostedt wrote: > > > > > > > > > On February 1, 2025 6:59:06 AM EST, Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > >I still have full hate for this approach. > > > > > > So what approach would you prefer? > > > > The one that does not rely on the preemption method -- I think I posted > > something along those line, and someone else recently reposted something > > bsaed on it. > > > > Tying things to the preemption method is absurdly bad design -- and I've > > told you that before. > > How exactly is it "bad design"? Changing the preemption method itself > changes the way applications schedule and can be very noticeable to the > applications themselves. Lazy is not the default, nor even the recommended preemption method at this time. Lazy will not ever be the only preemption method, full isn't going anywhere. Lazy only applies to fair (and whatever bpf things end up using resched_curr_lazy()). Lazy works on tick granularity, which is variable per the HZ config, and way too long for any of this nonsense. So by tying this to lazy, you get something that doesn't actually work most of the time, and when it works, it has variable and bad behaviour. So yeah, crap. This really isn't difficult to understand, and I've told you this before.
On Mon, Feb 03, 2025 at 09:43:06AM +0100, Peter Zijlstra wrote: > Lazy is not the default, nor even the recommended preemption method at > this time. Just to clarify this for people reading along; lazy is there for people to identify performance 'issues' vs voluntary such that those can be addressed. I'm not at all sure who is spending time on it, but it will probably take a while. Once it has been determined lazy is good enough -- as indicated by the distros having picked it as a default, we can look at deprecating and removing voluntary (and eventually none). Also, RT likes it :-)
On Mon, 3 Feb 2025 09:43:06 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > Lazy is not the default, nor even the recommended preemption method at > this time. That's OK. If it is considered to be the default in the future, this can wait. > > Lazy will not ever be the only preemption method, full isn't going > anywhere. That's fine too, as full preemption has the same issue of preempting kernel mutexes. Full preemption is for something that likely doesn't want this feature anyway. > > Lazy only applies to fair (and whatever bpf things end up using > resched_curr_lazy()). Is that a problem? User spin locks for RT tasks are very dangerous. If an RT task preempts the owner that is of lower priority, it can cause a deadlock (if the two tasks are pinned to the same CPU). Which BTW, Sebastion mentioned in the Stable RT meeting that glibc supplies a pthread_spin_lock() and doesn't have in the man page anything about this possible scenario. > > Lazy works on tick granularity, which is variable per the HZ config, and > way too long for any of this nonsense. Patch 2 changes that to do what you wrote the last time. It has a max wait time of 50us. > > So by tying this to lazy, you get something that doesn't actually work > most of the time, and when it works, it has variable and bad behaviour. Um no. If we wait for lazy to become the default behavior, it will work most of the time. And when it does work, it has strict behavior of 50us. > > So yeah, crap. As your rationale was not correct, I will disagree with this being crap. > > This really isn't difficult to understand, and I've told you this > before. And I listened to what you told me before. Patch 2 implements the 50us max that you suggested. I separated it out because it made the code simpler to understand and debug. The change log even mentioned: For the moment, it lets it run for one more tick (which will be changed later). That "changed later" is the second patch in this series. With the "this can wait until lazy is default", is because we have an "upstream first" policy. As long as there is some buy-in to the changes, we can go ahead and implement it on our devices. We do not have to wait for it to be accepted. But if there's a strong NAK to the idea, it is much harder to get it implemented internally. I would also implement a way for user space to know if it is supported or not. Perhaps have the cr_counter of the rseq initialized to some value that tells user space this is supported in the current configuration of the kernel? This would make there be "no surprises". Our current use case is actually for VMs. Which requires a slightly different method. Instead of having the cr_counter that is used for telling the kernel the task is in a critical section, the rseq would contain a pointer to some user space memory that has that counter. The reason is that this memory would need to be mapped between the VM guest kernel and the VM VCPU emulation thread. Mathieu did not want to allow exposure of the VM VCPU thread's rseq structure to the VM guest kernel. Having a separate memory map for that is more secure. Then the raw spin locks of the guest VM kernel could be implemented using this method as well. We do find performance issues when a VCPU of a guest kernel is preempted while holding spin locks. We can focus more on this VM use case, and we could then give better benchmarks. But again, his depends on whether or not you intend on NAKing this approach altogether. -- Steve
On Tue, Feb 4, 2025 at 1:45 AM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Mon, 3 Feb 2025 09:43:06 +0100 > Peter Zijlstra <peterz@infradead.org> wrote: > > > Lazy is not the default, nor even the recommended preemption method at > > this time. > > That's OK. If it is considered to be the default in the future, this can > wait. > > > > > Lazy will not ever be the only preemption method, full isn't going > > anywhere. > > That's fine too, as full preemption has the same issue of preempting > kernel mutexes. Full preemption is for something that likely doesn't want > this feature anyway. > > > > > Lazy only applies to fair (and whatever bpf things end up using > > resched_curr_lazy()). > > Is that a problem? User spin locks for RT tasks are very dangerous. If an > RT task preempts the owner that is of lower priority, it can cause a > deadlock (if the two tasks are pinned to the same CPU). Which BTW, > Sebastion mentioned in the Stable RT meeting that glibc supplies a > pthread_spin_lock() and doesn't have in the man page anything about this > possible scenario. > > > > > Lazy works on tick granularity, which is variable per the HZ config, and > > way too long for any of this nonsense. > > Patch 2 changes that to do what you wrote the last time. It has a max wait > time of 50us. > > > > > So by tying this to lazy, you get something that doesn't actually work > > most of the time, and when it works, it has variable and bad behaviour. > > Um no. If we wait for lazy to become the default behavior, it will work > most of the time. And when it does work, it has strict behavior of 50us. > > > > > So yeah, crap. > > As your rationale was not correct, I will disagree with this being crap. > > > > > > This really isn't difficult to understand, and I've told you this > > before. > > And I listened to what you told me before. Patch 2 implements the 50us max > that you suggested. I separated it out because it made the code simpler to > understand and debug. The change log even mentioned: > > For the moment, it lets it run for one more tick (which will be > changed later). > > That "changed later" is the second patch in this series. > > With the "this can wait until lazy is default", is because we have an > "upstream first" policy. As long as there is some buy-in to the changes, we > can go ahead and implement it on our devices. We do not have to wait for it > to be accepted. But if there's a strong NAK to the idea, it is much harder > to get it implemented internally. Can you explain why this approach requires PREEMPT_LAZY? Could exit_to_user_mode_loop() be changed to something like the following (with maybe some provision to only do it once)? if ((ti_work & _TIF_NEED_RESCHED) && !rseq_delay_resched()) schedule(); I suppose there would also need to be some additional changes to make sure full preemption also doesn't preempt, maybe in preempt_schedule*(). -- Suleiman
On Tue, 4 Feb 2025 12:28:41 +0900 Suleiman Souhlal <suleiman@google.com> wrote: > Can you explain why this approach requires PREEMPT_LAZY? > > Could exit_to_user_mode_loop() be changed to something like the > following (with maybe some provision to only do it once)? > > if ((ti_work & _TIF_NEED_RESCHED) && !rseq_delay_resched()) > schedule(); The main reason is that we need to differentiate a preemption based on a SCHED_OTHER scheduling tick, and an RT task waking up. We should not delay any RT tasks ever. If PREEMPT_LAZY becomes default, IIUC then even the old "server" version will have RT tasks preempt tasks within the kernel without waiting for another tick. Currently, the only way to differentiate between a SCHED_OTHER scheduler tick preemption and an RT task waking up is with the NEED_RESCHED_LAZY vs NEED_RESCHED. Now, if we wanted to (and I'm not sure we do), we could add another way to differentiate the two and still allow this to work. > > I suppose there would also need to be some additional changes to make > sure full preemption also doesn't preempt, maybe in > preempt_schedule*(). Which may be quite difficult as the cr_counter is in user space and can only be read from a user space faultable context. -- Steve
On Mon, Feb 03, 2025 at 11:45:37AM -0500, Steven Rostedt wrote: > > Lazy only applies to fair (and whatever bpf things end up using > > resched_curr_lazy()). > > Is that a problem? User spin locks for RT tasks are very dangerous. If an > RT task preempts the owner that is of lower priority, it can cause a > deadlock (if the two tasks are pinned to the same CPU). Which BTW, > Sebastion mentioned in the Stable RT meeting that glibc supplies a > pthread_spin_lock() and doesn't have in the man page anything about this > possible scenario. Yeah, we've known that for at least a decade if not longer. That's not new. Traditionally glibc people haven't been very RT minded -- the whole condvar thing comes to mind as well. And yes, you can still use the whole 'delay preemption' hint for RT tasks just fine. Spinlocks isn't the only thing. It can be used to make any RSEQ section more likely to succeed. > Patch 2 changes that to do what you wrote the last time. It has a max wait > time of 50us. I'm so confused, WTF do you then need the lazy crap? You're making things needlessly complicated again.
On Tue, 4 Feb 2025 10:16:13 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > And yes, you can still use the whole 'delay preemption' hint for RT > tasks just fine. Spinlocks isn't the only thing. It can be used to make > any RSEQ section more likely to succeed. > > > > Patch 2 changes that to do what you wrote the last time. It has a max wait > > time of 50us. > > I'm so confused, WTF do you then need the lazy crap? > > You're making things needlessly complicated again. Do we really want to delay an RT task by 50us? That will cause a lot more regressions. This is a performance feature not a latency one. RT tasks are about limiting latency and will sacrifice performance to do so. PREEMPT_RT has great minimal latency at the expense of performance. We try to improve performance but never at the sacrifice of latency. RT wakeups are also more predictable. That is, they happen when an event comes in or a timer expires and when an RT task wakes up it is to run ASAP to handle some event. This is about SCHED_OTHER tasks where the scheduler decides when a task gets to run and will preempt it when its quota is over. The task has no idea when that will happen. This is about giving the kernel a hint that it's a bad time to interrupt the task and if it can just wait another 50us or less, then it would be fine to schedule. SCHED_OTHER tasks never have latency requirements less than a millisecond. And SCHED_OTHER tasks are effected by other SCHED_OTHER tasks, even those that are lower priority. My patches here were based on where the NEED_RESCHED_LAZY came from, and that was from the PREEMPT_RT patch. The problem was that the kernel would allow preemption almost everywhere. This was great for RT tasks, but non RT tasks suffered performance issues. That's because of the timer tick going off while a SCHED_OTHER task was holding a spin_lock converted into a mutex and it would be scheduled out while holding that sleeping spin_lock. This increased the amount of contention on that spin_lock, and that affected performance. The fix was to introduce NEED_RESCHED_LAZY which would not have the SCHED_OTHER task preempt while holding a sleeping spin_lock. This put back the performance close to non PREEMPT_RT. This work I'm doing is based on that. It doesn't make sense to delay RT tasks for a performance improvement. -- Steve
On Tue, 4 Feb 2025 07:51:00 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:
> > I'm so confused, WTF do you then need the lazy crap?
IOW, the "lazy crap" was created to solve this very issue. The holding of
sleeping spin locks interrupted by a scheduler tick. I'm just giving user
space the same feature that we gave the kernel in PREEMPT_RT.
-- Steve
n Tue, 4 Feb 2025 08:16:53 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > On Tue, 4 Feb 2025 07:51:00 -0500 > Steven Rostedt <rostedt@goodmis.org> wrote: > > > > I'm so confused, WTF do you then need the lazy crap? > > IOW, the "lazy crap" was created to solve this very issue. The holding of > sleeping spin locks interrupted by a scheduler tick. I'm just giving user > space the same feature that we gave the kernel in PREEMPT_RT. > Also, I believe it is best to follow the current preemption method and that's what the NEED_RESCHED_LAZY gives us. Let's say you have a low priority program, maybe even malicious, that goes into a loop of calling a system call that can run for almost a millisecond without sleeping. In PREEMPT_NONE, this low priority program can cause RT tasks a latency of a millisecond because if an RT task wakes up as the program just enters the system call, it will have to wait for it to exit that system call for it to run, which might be close to that millisecond. For PREEMPT_VOLUNTARY, it will only preempt tasks until it hits a might_sleep(), or cond_resched() (but so would PREEMPT_NONE on the cond_resched(), but we want to get rid of those). For PREEMPT_FULL, the program shouldn't affect any other task because its system call will simply be preempted. Now let's look at this new feature. It allows a task to ask for some extended time to get out of a critical section if possible. If we decide in the future that we remove PREEMPT_NONE and PREEMPT_VOLUNTARY with a dynamic type like: TYPE | Sched Tick | RT Wakeup | Enter user space | ===========+================+=============+====================+ None | Set LAZY | Set LAZY | schedule | -----------+----------------+-------------+--------------------+ Voluntary? | Set LAZY | schedule | schedule | -----------+----------------+-------------+--------------------+ Full | schedule | schedule | schedule | -----------+----------------+-------------+--------------------+ (The "Enter user space" is when a NEED_RESCHED is set) Where in NONE, the LAZY flag is set for both the sched tick and the RT wakeup and it doesn't schedule until it hits user space. In "Voluntary", the LAZY flag is set only for sched tick on SCHED_OTHER tasks, but RT tasks will get to be scheduled immediately (depending on preempt_disable of course). With "Full" it will schedule whenever it can. With that task that calls that long system call, which method type above is in place determines the latency of other tasks. Now, if we add this feature, I want it to behave the same as a long system call. Where it would only extend the time if a long system call would extend the time, as that means it wouldn't modify the typical behavior of the system for other tasks, but it would help in the performance for the task that is requesting this feature. With this feature: TYPE | Sched Tick | RT Wakeup | Enter user space | ===========+================+=============+=======================+ None | Set LAZY | Set LAZY | schedule if !LAZY | -----------+----------------+-------------+-----------------------+ Voluntary? | Set LAZY | schedule | schedule if !LAZY | -----------+----------------+-------------+-----------------------+ Full | schedule | schedule | schedule | -----------+----------------+-------------+-----------------------+ Thus, in NONE, it would likely get to extend its time just like if it called a long system call. This can include even making RT tasks wait a little longer, just like they would wait on a system call. In "Voluntary", it would only get its timeslice extended if it was another SCHED_OTHER task that is to be scheduled. But if an RT task would wake up, it would schedule immediately regardless if a extended time slice was requested or not. In "Full", it probably makes sense to simply disable this feature (the program would see that it is disabled when it registers the rseq), as it would never get its time slice extended, as a system call would be preempted immediately if it was interrupted. So back to your question about why I'm tying this to the "lazy crap", is because I want the behavior of other tasks to not change due to one task asking for an extended time slice. -- Steve
On Tue, Feb 04, 2025 at 07:51:00AM -0500, Steven Rostedt wrote: > On Tue, 4 Feb 2025 10:16:13 +0100 > Peter Zijlstra <peterz@infradead.org> wrote: > > > And yes, you can still use the whole 'delay preemption' hint for RT > > tasks just fine. Spinlocks isn't the only thing. It can be used to make > > any RSEQ section more likely to succeed. > > > > > > > Patch 2 changes that to do what you wrote the last time. It has a max wait > > > time of 50us. > > > > I'm so confused, WTF do you then need the lazy crap? > > > > You're making things needlessly complicated again. > > Do we really want to delay an RT task by 50us? If you go back and reread that initial thread, you'll find the 50us is below the scheduling latency that random test box already had. I'm sure more modern systems will have a lower number, and slower systems will have a larger number, but we got to pick a number :/ I'm fine with making it 20us. Or whatever. Its just a stupid number. But yes. If we're going to be doing this, there is absolutely no reason not to allow DEADLINE/FIFO threads the same. Misbehaving FIFO is already a problem, and we can make DL-CBS enforcement punch through it if we have to. And less retries on the RSEQ for FIFO can equally improve performance. There is no difference between a 'malicious/broken' userspace consuming the entire window in userspace (50us, 20us whatever it will be) and doing a system call which we know will cause similar delays because it does in-kernel locking.
On Tue, 4 Feb 2025 16:30:53 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > If you go back and reread that initial thread, you'll find the 50us is > below the scheduling latency that random test box already had. > > I'm sure more modern systems will have a lower number, and slower > systems will have a larger number, but we got to pick a number :/ > > I'm fine with making it 20us. Or whatever. Its just a stupid number. > > But yes. If we're going to be doing this, there is absolutely no reason > not to allow DEADLINE/FIFO threads the same. Misbehaving FIFO is already > a problem, and we can make DL-CBS enforcement punch through it if we > have to. > > And less retries on the RSEQ for FIFO can equally improve performance. > > There is no difference between a 'malicious/broken' userspace consuming > the entire window in userspace (50us, 20us whatever it will be) and > doing a system call which we know will cause similar delays because it > does in-kernel locking. This is where we will disagree for the reasons I explained in my second email. This feature affects other tasks. And no, making it 20us doesn't make it better. Because from what I get from you, if we implement this, it will be available for all preemption methods (including PREEMPT_RT), where we do have less than 50us latency, and and even a 20us will break those applications. This was supposed to be only a hint to the kernel, not a complete feature that is hard coded and will override how other tasks behave. As system calls themselves can make how things are scheduled depending on the preemption method, I didn't want to add something that will change how things are scheduled that ignores the preemption method that was chosen. -- Steve
On Feb 1, 2025, at 10:11 AM, Peter Zijlstra <peterz@infradead.org> wrote: On Sat, Feb 01, 2025 at 07:47:32AM -0500, Steven Rostedt wrote: On February 1, 2025 6:59:06 AM EST, Peter Zijlstra <peterz@infradead.org> wrote: I still have full hate for this approach. So what approach would you prefer? The one that does not rely on the preemption method -- I think I posted something along those line, and someone else recently reposted something bsaed on it. Here is the RFC I had sent that Peter is referring to. <https://lore.kernel.org/all/20241113000126.967713-1-prakash.sangappa@oracle.com/> lore.kernel.org<https://lore.kernel.org/all/20241113000126.967713-1-prakash.sangappa@oracle.com/> [X]<https://lore.kernel.org/all/20241113000126.967713-1-prakash.sangappa@oracle.com/> -Prakash. Tying things to the preemption method is absurdly bad design -- and I've told you that before.
> On Feb 4, 2025, at 5:44 PM, Prakash Sangappa <prakash.sangappa@oracle.com> wrote: > > > >> On Feb 1, 2025, at 10:11 AM, Peter Zijlstra <peterz@infradead.org> wrote: >> >>> On Sat, Feb 01, 2025 at 07:47:32AM -0500, Steven Rostedt wrote: >>> >>> >>> On February 1, 2025 6:59:06 AM EST, Peter Zijlstra <peterz@infradead.org> wrote: >>> >>>> I still have full hate for this approach. >>> >>> So what approach would you prefer? >> >> The one that does not rely on the preemption method -- I think I posted >> something along those line, and someone else recently reposted something >> bsaed on it. > > Here is the RFC I had sent that Peter is referring FWIW, I second the idea of a new syscall for this than (ab)using rseq and also independence from preemption method. I agree that something generic is better than relying on preemption method. thanks, - Joel > Tying things to the preemption method is absurdly bad design -- and I've >> told you that before. >> >
On Tue, 4 Feb 2025 19:56:09 -0500 Joel Fernandes <joel@joelfernandes.org> wrote: > > Here is the RFC I had sent that Peter is referring > > FWIW, I second the idea of a new syscall for this than (ab)using rseq > and also independence from preemption method. I agree that something > generic is better than relying on preemption method. So you are for adding another user/kernel memory mapped section? And you are also OK with allowing any task to make an RT task wait longer? Putting my RT hat back on, I would definitely disable that on any system that requires RT. But for now, I'm looking at implementing this for VMs only. -- Steve
On Tue, Feb 4, 2025 at 10:03 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > On Tue, 4 Feb 2025 19:56:09 -0500 > Joel Fernandes <joel@joelfernandes.org> wrote: > > > > Here is the RFC I had sent that Peter is referring > > > > FWIW, I second the idea of a new syscall for this than (ab)using rseq > > and also independence from preemption method. I agree that something > > generic is better than relying on preemption method. > > So you are for adding another user/kernel memory mapped section? I don't personally mind that. > And you are also OK with allowing any task to make an RT task wait longer? > > Putting my RT hat back on, I would definitely disable that on any system > that requires RT. Just so I understand, you are basically saying that you want this feature only for FAIR tasks, and allowing RT tasks to extend time slice might actually hurt the latency of (other) RT tasks on the system right? This assumes PREEMPT_RT because the latency is 50us right? But in a poorly designed system, if you have RT tasks at higher priority that preempt things lower in RT, that would already cause latency anyway. Similarly, I would also consider any PREEMPT_RT system that (mis)uses this API in an RT task as also a poorly designed system. I think PREEMPT_RT systems generally require careful design anyway. So the fact that a system is poorly designed and thus causes latency is not the kernel's problem IMO. In any case, if you want this to only work on FAIR tasks and not RT tasks, why is that only possible to do with rseq() + LAZY preemption and not Prakash's new API + all preemption modes? Also you can just ignore RT tasks (not that I'm saying that's a good idea but..) in taskshrd_delay_resched() in that patch if you ever wanted to do that. I just feel the RT latency thing is a non-issue AFAICS. thanks, - Joel
On Tue, Feb 04, 2025 at 11:11:19AM -0500, Steven Rostedt wrote: > On Tue, 4 Feb 2025 16:30:53 +0100 > Peter Zijlstra <peterz@infradead.org> wrote: > > > If you go back and reread that initial thread, you'll find the 50us is > > below the scheduling latency that random test box already had. > > > > I'm sure more modern systems will have a lower number, and slower > > systems will have a larger number, but we got to pick a number :/ > > > > I'm fine with making it 20us. Or whatever. Its just a stupid number. > > > > But yes. If we're going to be doing this, there is absolutely no reason > > not to allow DEADLINE/FIFO threads the same. Misbehaving FIFO is already > > a problem, and we can make DL-CBS enforcement punch through it if we > > have to. > > > > And less retries on the RSEQ for FIFO can equally improve performance. > > > > There is no difference between a 'malicious/broken' userspace consuming > > the entire window in userspace (50us, 20us whatever it will be) and > > doing a system call which we know will cause similar delays because it > > does in-kernel locking. > > This is where we will disagree for the reasons I explained in my second > email. This feature affects other tasks. And no, making it 20us doesn't > make it better. Because from what I get from you, if we implement this, it > will be available for all preemption methods (including PREEMPT_RT), where > we do have less than 50us latency, and and even a 20us will break those > applications. Then pick another number; RT too has a max scheduling latency number (on some random hardware). If you stay below that, all is fine. > This was supposed to be only a hint to the kernel, not a complete feature That's a contradiction in terms -- even a hint is a feature. > that is hard coded and will override how other tasks behave. Everything has some effect. My point is that if you limit this effect to be less than what it can already effect, you're not making things worse. > As system > calls themselves can make how things are scheduled depending on the > preemption method, What? > I didn't want to add something that will change how > things are scheduled that ignores the preemption method that was chosen. Userspace is totally oblivious to the preemption method chosen, and it damn well should be.
On Wed, 5 Feb 2025 10:07:36 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > > This is where we will disagree for the reasons I explained in my second > > email. This feature affects other tasks. And no, making it 20us doesn't > > make it better. Because from what I get from you, if we implement this, it > > will be available for all preemption methods (including PREEMPT_RT), where > > we do have less than 50us latency, and and even a 20us will break those > > applications. > > Then pick another number; RT too has a max scheduling latency number (on > some random hardware). If you stay below that, all is fine. So we set it to 1us? Or does this have to calculate what that latency number is for each random hardware? > > > This was supposed to be only a hint to the kernel, not a complete feature > > That's a contradiction in terms -- even a hint is a feature. Yes, a hint is a feature, I meant "complete feature" meaning it being not just a hint, but guaranteed to do something. > > > that is hard coded and will override how other tasks behave. > > Everything has some effect. My point is that if you limit this effect to > be less than what it can already effect, you're not making things worse. > > > As system > > calls themselves can make how things are scheduled depending on the > > preemption method, > > What? Read my last email: https://lore.kernel.org/all/20250204100555.1a641b9b@gandalf.local.home/ I went into this in detail. > > > I didn't want to add something that will change how > > things are scheduled that ignores the preemption method that was chosen. > > Userspace is totally oblivious to the preemption method chosen, and it > damn well should be. Agreed, and user space doesn't have to know what preemption method was chosen for this. Where does this say that it needs to know? All this does is to give the kernel a hint that it is in a critical section and the kernel decides to grant some more time or not. The preemption method will influence that decision, but user space doesn't need to be know. -- Steve
On Wed, 5 Feb 2025 00:09:51 -0500 Joel Fernandes <joel@joelfernandes.org> wrote: > On Tue, Feb 4, 2025 at 10:03 PM Steven Rostedt <rostedt@goodmis.org> wrote: > > > > On Tue, 4 Feb 2025 19:56:09 -0500 > > Joel Fernandes <joel@joelfernandes.org> wrote: > > > > > > Here is the RFC I had sent that Peter is referring > > > > > > FWIW, I second the idea of a new syscall for this than (ab)using rseq > > > and also independence from preemption method. I agree that something > > > generic is better than relying on preemption method. > > > > So you are for adding another user/kernel memory mapped section? > > I don't personally mind that. I'm glad you don't personally mind it. Are you going to help maintain another memory mapped section? > > > And you are also OK with allowing any task to make an RT task wait longer? > > > > Putting my RT hat back on, I would definitely disable that on any system > > that requires RT. > > Just so I understand, you are basically saying that you want this > feature only for FAIR tasks, and allowing RT tasks to extend time > slice might actually hurt the latency of (other) RT tasks on the > system right? This assumes PREEMPT_RT because the latency is 50us > right? RT tasks don't have a time slice. They are affected by events. An external interrupt coming in, or a timer going off that states something is happening. Perhaps we could use this for SCHED_RR or maybe even SCHED_DEADLINE, as those do have time slices. But if it does get used, it should only be used when the task being scheduled is the same SCHED_RR priority, or if SCHED_DEADLINE will not fail its guarantees. > > But in a poorly designed system, if you have RT tasks at higher > priority that preempt things lower in RT, that would already cause > latency anyway. Similarly, I would also consider any PREEMPT_RT system And that would be a poorly designed system, and not the problem of the kernel. > that (mis)uses this API in an RT task as also a poorly designed > system. I think PREEMPT_RT systems generally require careful design > anyway. So the fact that a system is poorly designed and thus causes > latency is not the kernel's problem IMO. Correct. And why I don't think this should be used for RT. It's SCHED_OTHER that doesn't have any control of the sched tick, where this hint can help. > > In any case, if you want this to only work on FAIR tasks and not RT > tasks, why is that only possible to do with rseq() + LAZY preemption > and not Prakash's new API + all preemption modes? > > Also you can just ignore RT tasks (not that I'm saying that's a good > idea but..) in taskshrd_delay_resched() in that patch if you ever > wanted to do that. > > I just feel the RT latency thing is a non-issue AFAICS. Have you worked on any RT projects before? -- Steve
On Wed, 5 Feb 2025 08:16:35 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > Correct. And why I don't think this should be used for RT. It's SCHED_OTHER > that doesn't have any control of the sched tick, where this hint can help. Honestly, I don't care that much if it is used on all preemption models, but I do care if it affects RT tasks. The LAZY flag just happens to let us know if the next schedule is mandatory or not. NEED_RESCHED_LAZY means, this schedule is not that important, where as the NEED_RESCHED means it is. That's why I picked that to decide if the task can get extended or not. Has nothing to do with the preemption method. The preemption method currently sets that flag. Now we could just change when NEED_RESCHED_LAZY is set and this could work with all preemption methods. To explain this better. Currently we have: TYPE | Sched Tick | RT Wakeup | ===========+=======================+======================+ None | NEED_RESCHED_LAZY | NEED_RESCHED_LAZY | -----------+-----------------------+----------------------+ Voluntary | NEED_RESCHED_LAZY | NEED_RESCHED | -----------+-----------------------+----------------------+ Full | NEED_RESCHED | NEED_RESCHED | -----------+-----------------------+----------------------+ Perhaps if we do: TYPE | Sched Tick | RT Wakeup | ===========+===================================+======================+ None | NEED_RESCHED_LAZY | NEED_RESCHED_LAZY | -----------+-----------------------------------+----------------------+ Voluntary | NEED_RESCHED_LAZY | NEED_RESCHED | -----------+-----------------------------------+----------------------+ Full | NEED_RESCHED or NEED_RESCHED_LAZY | NEED_RESCHED | -----------+-----------------------------------+----------------------+ Where on the scheduler tick, for PREEMPT_FULL (and even PREEMPT_RT), we set NEED_RESCHED if the task is in the kernel and NEED_RESCHED_LAZY if it is in user space then this patch will work in all preemption methods. As the LAZY bit will decide if the task gets extended or not. That is, any SCHED_OTHER task that is being preempted due to its scheduler tick can be granted 50us more, regardless of the preemption method. Now on PREEMPT_NONE, it may even get to preempt a RT task a bit more, but RT tasks have more to worry about if they are running on a PREEMPT_NONE system anyway! -- Steve
On Wed, 5 Feb 2025 08:10:19 -0500 Steven Rostedt <rostedt@goodmis.org> wrote: > The preemption method will influence that decision, but user space doesn't > need to be know. If your issue is that this depends on the preemption method, I have a slight change to make it work for all preemption methods. I just replied to Joel about this: https://lore.kernel.org/all/20250205083857.3cc06aa7@gandalf.local.home/ I'll repeat some of it here. Currently we have: TYPE | Sched Tick | RT Wakeup | ===========+=======================+======================+ None | NEED_RESCHED_LAZY | NEED_RESCHED_LAZY | -----------+-----------------------+----------------------+ Voluntary | NEED_RESCHED_LAZY | NEED_RESCHED | -----------+-----------------------+----------------------+ Full | NEED_RESCHED | NEED_RESCHED | -----------+-----------------------+----------------------+ Now, if we set NEED_RESCHED_LAZY in PREEMPT_FULL (and PREEMPT_RT) on a scheduler tick if it interrupted user space (not the kernel) then we have this: TYPE | Sched Tick | RT Wakeup | ===========+===================================+======================+ None | NEED_RESCHED_LAZY | NEED_RESCHED_LAZY | -----------+-----------------------------------+----------------------+ Voluntary | NEED_RESCHED_LAZY | NEED_RESCHED | -----------+-----------------------------------+----------------------+ Full | NEED_RESCHED or NEED_RESCHED_LAZY | NEED_RESCHED | -----------+-----------------------------------+----------------------+ Then going back to user space from the interrupt, we can use rseq and the LAZY bit to know if we should extend the tick or not! Even without rseq, this would behave the same as NEED_RESCHED_LAZY will schedule just like NEED_RESCHED when going back to user space. This will allow this schedule tick extension to work in all the preemption methods. Would this be something you are more OK with? I can go and test this out. -- Steve
diff --git a/include/linux/sched.h b/include/linux/sched.h index 64934e0830af..8e983d8cf72d 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2206,6 +2206,16 @@ static inline bool owner_on_cpu(struct task_struct *owner) unsigned long sched_cpu_util(int cpu); #endif /* CONFIG_SMP */ +#ifdef CONFIG_RSEQ + +extern bool rseq_delay_resched(void); + +#else + +static inline bool rseq_delay_resched(void) { return false; } + +#endif + #ifdef CONFIG_SCHED_CORE extern void sched_core_free(struct task_struct *tsk); extern void sched_core_fork(struct task_struct *p); diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h index c233aae5eac9..185fe9826ff9 100644 --- a/include/uapi/linux/rseq.h +++ b/include/uapi/linux/rseq.h @@ -37,6 +37,18 @@ enum rseq_cs_flags { (1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT), }; +enum rseq_cr_flags_bit { + RSEQ_CR_FLAG_KERNEL_REQUEST_SCHED_BIT = 0, +}; + +enum rseq_cr_flags { + RSEQ_CR_FLAG_KERNEL_REQUEST_SCHED = + (1U << RSEQ_CR_FLAG_KERNEL_REQUEST_SCHED_BIT), +}; + +#define RSEQ_CR_FLAG_IN_CRITICAL_SECTION_MASK \ + (~RSEQ_CR_FLAG_KERNEL_REQUEST_SCHED) + /* * struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always * contained within a single cache-line. It is usually declared as @@ -148,6 +160,18 @@ struct rseq { */ __u32 mm_cid; + /* + * The cr_counter is a way for user space to inform the kernel that + * it is in a critical section. If bits 1-31 are set, then the + * kernel may grant the thread a bit more time (but there is no + * guarantee of how much time or if it is granted at all). If the + * kernel does grant the thread extra time, it will set bit 0 to + * inform user space that it has granted the thread more time and that + * user space should call yield() as soon as it leaves its critical + * section. + */ + __u32 cr_counter; + /* * Flexible array member at end of structure, after last feature field. */ diff --git a/kernel/entry/common.c b/kernel/entry/common.c index e33691d5adf7..50e35f153bf8 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -90,6 +90,8 @@ void __weak arch_do_signal_or_restart(struct pt_regs *regs) { } __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work) { + unsigned long ignore_mask = 0; + /* * Before returning to user space ensure that all pending work * items have been completed. @@ -98,9 +100,18 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs, local_irq_enable_exit_to_user(ti_work); - if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY)) + if (ti_work & _TIF_NEED_RESCHED) { schedule(); + } else if (ti_work & _TIF_NEED_RESCHED_LAZY) { + /* Allow to leave with NEED_RESCHED_LAZY still set */ + if (rseq_delay_resched()) { + trace_printk("Avoid scheduling\n"); + ignore_mask |= _TIF_NEED_RESCHED_LAZY; + } else + schedule(); + } + if (ti_work & _TIF_UPROBE) uprobe_notify_resume(regs); @@ -127,6 +138,7 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs, tick_nohz_user_enter_prepare(); ti_work = read_thread_flags(); + ti_work &= ~ignore_mask; } /* Return the latest work state for arch_exit_to_user_mode() */ diff --git a/kernel/rseq.c b/kernel/rseq.c index 9de6e35fe679..b792e36a3550 100644 --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -339,6 +339,36 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) force_sigsegv(sig); } +bool rseq_delay_resched(void) +{ + struct task_struct *t = current; + u32 flags; + + if (!t->rseq) + return false; + + /* Make sure the cr_counter exists */ + if (current->rseq_len <= offsetof(struct rseq, cr_counter)) + return false; + + /* If this were to fault, it would likely cause a schedule anyway */ + if (copy_from_user_nofault(&flags, &t->rseq->cr_counter, sizeof(flags))) + return false; + + if (!(flags & RSEQ_CR_FLAG_IN_CRITICAL_SECTION_MASK)) + return false; + + trace_printk("Extend time slice\n"); + flags |= RSEQ_CR_FLAG_KERNEL_REQUEST_SCHED; + + if (copy_to_user_nofault(&t->rseq->cr_counter, &flags, sizeof(flags))) { + trace_printk("Faulted writing rseq\n"); + return false; + } + + return true; +} + #ifdef CONFIG_DEBUG_RSEQ /*