Message ID | 20250124163741.101568-1-pbonzini@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [GIT,PULL] KVM changes for Linux 6.14 | expand |
On Fri, 24 Jan 2025 16:37:41 +0000, Paolo Bonzini <pbonzini@redhat.com> wrote: > > Linus, > > The following changes since commit 5bc55a333a2f7316b58edc7573e8e893f7acb532: > > Linux 6.13-rc7 (2025-01-12 14:37:56 -0800) > > are available in the Git repository at: > > https://git.kernel.org/pub/scm/virt/kvm/kvm.git tags/for-linus > > for you to fetch changes up to 931656b9e2ff7029aee0b36e17780621948a6ac1: > > kvm: defer huge page recovery vhost task to later (2025-01-24 10:53:56 -0500) Sorry to ask the obvious, but why didn't you include the KVM/arm64 updates which you said you had pulled[1]? Was there anything wrong with it? If so, I would very much like to know. M. [1] https://lore.kernel.org/r/CABgObfYckN2J_Q3d--ZfAP=QbtGWp-teOpXGPfubU=BN4DSrgw@mail.gmail.com
Let's bring some thread setup people in on this.. The kvm people obviously already solved their particular issue, but I get the feeling that the kvm solution is kind of a hack that works around a user space oddity. For newly added people: see commit 931656b9e2ff ("kvm: defer huge page recovery vhost task to later") and the explanation below, and this thread on the mailing lists: https://lore.kernel.org/all/Z2RYyagu3phDFIac@kbusch-mbp.dhcp.thefacebook.com/ Arguably the user space oddity is just strange and Paolo even calls it a bug, but at the same time, I do think user space can and should reasonably expect that it only has children that it created explicitly, and the automatic reclamation thread most definitely is a bit too implicit. On Fri, 24 Jan 2025 at 08:38, Paolo Bonzini <pbonzini@redhat.com> wrote: > > * The recently introduced conversion of the NX-page reclamation kthread to > vhost_task moved the task under the main process. The task is created as > soon as KVM_CREATE_VM was invoked and this, of course, broke userspace that > didn't expect to see any child task of the VM process until it started > creating its own userspace threads. In particular crosvm refuses to fork() > if procfs shows any child task, so unbreak it by creating the task lazily. > This is arguably a userspace bug, as there can be other kinds of legitimate > worker tasks and they wouldn't impede fork(); but it's not like userspace > has a way to distinguish kernel worker tasks right now. Should they show > as "Kthread: 1" in proc/.../status? So first off, let me just say that I still absolutely think that the current "vhost workers are children of the starter" is the right model, even if it has caused some issues because of various legacy expectations. But in this case I do wonder if we should hide the implicit kernel threads from user space somehow. Keith pinpointed the user space logic to fork_remap(): https://github.com/google/minijail/blob/main/rust/minijail/src/lib.rs#L987 and honestly, I do think it makes sense for user space to ask "am I single-threaded" (which is presumably the thing that breaks), and the code for that is pretty simple: fn is_single_threaded() -> io::Result<bool> { match count_dir_entries("/proc/self/task") { Ok(1) => Ok(true), Ok(_) => Ok(false), Err(e) => Err(e), } } and I really don't think user space is "wrong". So the fact that a kernel helper thread that runs async in the background and does random background infrastructure things that do not really affect user space should probably simply not break this kind of simple (and admittedly simplistic) user space logic. Should we just add some flag to say "don't show this thread in this context"? We obviously still want to see it for management purposes, so it's not like the thing should be entirely invisible, but maybe Christian / Eric / Oleg have some opinions on how to do this cleanly in "copy_process()" or similar? Linus
On Fri, 24 Jan 2025 at 08:38, Paolo Bonzini <pbonzini@redhat.com> wrote: > > but you can throw away the <<<< ... ==== part completely, and apply the > same change on top of the new implementation: > > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c > index edef30359c19..9f9a29be3beb 100644 > --- a/arch/x86/kvm/cpuid.c > +++ b/arch/x86/kvm/cpuid.c > @@ -1177,6 +1177,7 @@ void kvm_set_cpu_caps(void) > EMULATED_F(NO_SMM_CTL_MSR), > /* PrefetchCtlMsr */ > F(WRMSR_XX_BASE_NS), > + F(SRSO_USER_KERNEL_NO), > SYNTHESIZED_F(SBPB), > SYNTHESIZED_F(IBPB_BRTYPE), > SYNTHESIZED_F(SRSO_NO), Ehh. My resolution ended up being different. I did this instead: F(WRMSR_XX_BASE_NS), SYNTHESIZED_F(SBPB), SYNTHESIZED_F(IBPB_BRTYPE), SYNTHESIZED_F(SRSO_NO), + SYNTHESIZED_F(SRSO_USER_KERNEL_NO), which (apart from the line ordering) differs from your suggestion in F() vs SYNTHESIZED_F(). That really seemed to be the RightThing(tm) to do from the context of the two conflicting commits, but maybe there was some reason that I didn't catch that you kept it as a plain "F()". So please take a look, and if I screwed up send me a fix (with a scathing explanation for why I'm maternally related to some less-than-gifted rodentia with syphilis). Linus
The pull request you sent on Fri, 24 Jan 2025 11:37:41 -0500:
> https://git.kernel.org/pub/scm/virt/kvm/kvm.git tags/for-linus
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/0f8e26b38d7ac72b3ad764944a25dd5808f37a6e
Thank you!
On Sat, 25 Jan 2025 at 10:12, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Arguably the user space oddity is just strange and Paolo even calls it > a bug, but at the same time, I do think user space can and should > reasonably expect that it only has children that it created > explicitly [..] Note that I think that doing things like "io_uring" and getting IO helper threads that way would very much count as "explicit children", so I don't argue that all kernel helper threads would fall under this category. And I suspect that the normal vhost workers fall under that same kind of "it's like io_uring". If you use VHOST_NEW_WORKER to create a worker thread, then that's a pretty explicit "I have a child process". So it's really just that hugepage recovery thread that seems to be a bit "too" much of an implicit kernel helper thread that user space kind of gets accidentally and implicitly just because of a kernel implementation detail. I'm sure the kvm hack to just start it later (at KVM_RUN time?) is sufficient in practice, but it still feels conceptually iffy to me. Linus
On 01/25, Linus Torvalds wrote: > > Keith pinpointed the user space logic to fork_remap(): > > https://github.com/google/minijail/blob/main/rust/minijail/src/lib.rs#L987 > > and honestly, I do think it makes sense for user space to ask "am I > single-threaded" (which is presumably the thing that breaks), and the > code for that is pretty simple: > > fn is_single_threaded() -> io::Result<bool> { > match count_dir_entries("/proc/self/task") { > Ok(1) => Ok(true), > Ok(_) => Ok(false), > Err(e) => Err(e), > } > } > > and I really don't think user space is "wrong". > > So the fact that a kernel helper thread that runs async in the > background and does random background infrastructure things that do > not really affect user space should probably simply not break this > kind of simple (and admittedly simplistic) user space logic. > > Should we just add some flag to say "don't show this thread in this > context"? Not sure I understand... Looking at is_single_threaded() above I guess something like below should work (incomplete, in particular we need to chang first_tid() as well). But a PF_HIDDEN sub-thread will still be visible via /proc/$pid_of_PF_HIDDEN > We obviously still want to see it for management purposes, > so it's not like the thing should be entirely invisible, Can you explain? Oleg. --- x/include/linux/sched.h +++ x/include/linux/sched.h @@ -1685,7 +1685,7 @@ extern struct pid *cad_pid; #define PF_USED_MATH 0x00002000 /* If unset the fpu must be initialized before use */ #define PF_USER_WORKER 0x00004000 /* Kernel thread cloned from userspace thread */ #define PF_NOFREEZE 0x00008000 /* This thread should not be frozen */ -#define PF__HOLE__00010000 0x00010000 +#define PF_HIDDEN 0x00010000 #define PF_KSWAPD 0x00020000 /* I am kswapd */ #define PF_MEMALLOC_NOFS 0x00040000 /* All allocations inherit GFP_NOFS. See memalloc_nfs_save() */ #define PF_MEMALLOC_NOIO 0x00080000 /* All allocations inherit GFP_NOIO. See memalloc_noio_save() */ --- x/include/linux/sched/task.h +++ x/include/linux/sched/task.h @@ -31,6 +31,7 @@ struct kernel_clone_args { u32 io_thread:1; u32 user_worker:1; u32 no_files:1; + u32 hidden:1; unsigned long stack; unsigned long stack_size; unsigned long tls; --- x/kernel/fork.c +++ x/kernel/fork.c @@ -2237,6 +2237,8 @@ __latent_entropy struct task_struct *cop } if (args->io_thread) p->flags |= PF_IO_WORKER; + if (args->hidden) + p->flags |= PF_HIDDEN; if (args->name) strscpy_pad(p->comm, args->name, sizeof(p->comm)); --- x/kernel/vhost_task.c +++ x/kernel/vhost_task.c @@ -117,7 +117,7 @@ EXPORT_SYMBOL_GPL(vhost_task_stop); */ struct vhost_task *vhost_task_create(bool (*fn)(void *), void (*handle_sigkill)(void *), void *arg, - const char *name) + bool hidden, const char *name) { struct kernel_clone_args args = { .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM | @@ -125,6 +125,7 @@ struct vhost_task *vhost_task_create(boo .exit_signal = 0, .fn = vhost_task_fn, .name = name, + .hidden = hidden, .user_worker = 1, .no_files = 1, }; --- x/fs/proc/base.c +++ x/fs/proc/base.c @@ -3906,9 +3906,12 @@ static struct task_struct *next_tid(stru struct task_struct *pos = NULL; rcu_read_lock(); if (pid_alive(start)) { - pos = __next_thread(start); - if (pos) - get_task_struct(pos); + for (pos = start; (pos = __next_thread(pos)); ) { + if (!(pos->flags & PF_HIDDEN)) { + get_task_struct(pos); + break; + } + } } rcu_read_unlock(); put_task_struct(start);
On Sun, 26 Jan 2025 at 06:21, Oleg Nesterov <oleg@redhat.com> wrote: > > > Should we just add some flag to say "don't show this thread in this > > context"? > > Not sure I understand... Looking at is_single_threaded() above I guess > something like below should work (incomplete, in particular we need to > chang first_tid() as well). So yes, I was thinking something similar, but: > But a PF_HIDDEN sub-thread will still be visible via /proc/$pid_of_PF_HIDDEN > > > We obviously still want to see it for management purposes, > > so it's not like the thing should be entirely invisible, > > Can you explain? I was literally thinking that instead of a "hidden" flag, it would be a "self-hidden" flag. So if somebody _else_ (notably the sysadmin) does "ps" they see the kernel thread as a subthread. But when you look at your own /proc/self/task/ listing, you only see your own explicit threads. So that "is_singlethreaded()" logic works. Maybe that's just too ugly for words, and the kvm workaround is better. Linus
On 01/26, Linus Torvalds wrote: > > On Sun, 26 Jan 2025 at 06:21, Oleg Nesterov <oleg@redhat.com> wrote: > > > > > Should we just add some flag to say "don't show this thread in this > > > context"? > > > > Not sure I understand... Looking at is_single_threaded() above I guess > > something like below should work (incomplete, in particular we need to > > chang first_tid() as well). > > So yes, I was thinking something similar, but: > > > But a PF_HIDDEN sub-thread will still be visible via /proc/$pid_of_PF_HIDDEN > > > > > We obviously still want to see it for management purposes, > > > so it's not like the thing should be entirely invisible, > > > > Can you explain? > > I was literally thinking that instead of a "hidden" flag, it would be > a "self-hidden" flag. > > So if somebody _else_ (notably the sysadmin) does "ps" they see the > kernel thread as a subthread. > > But when you look at your own /proc/self/task/ listing, you only see > your own explicit threads. So that "is_singlethreaded()" logic works. Got it... I don't think we even need to detect the /proc/self/ or /proc/self-thread/ case, next_tid() can just check same_thread_group, - if (!(pos->flags & PF_HIDDEN)) { + if (!(pos->flags & PF_HIDDEN) || !same_thread_group(current, pos))) { right ? Oleg.
On 01/26, Oleg Nesterov wrote: > > On 01/26, Linus Torvalds wrote: > > > > I was literally thinking that instead of a "hidden" flag, it would be > > a "self-hidden" flag. > > > > So if somebody _else_ (notably the sysadmin) does "ps" they see the > > kernel thread as a subthread. > > > > But when you look at your own /proc/self/task/ listing, you only see > > your own explicit threads. So that "is_singlethreaded()" logic works. > > Got it... > > I don't think we even need to detect the /proc/self/ or /proc/self-thread/ > case, next_tid() can just check same_thread_group, > > - if (!(pos->flags & PF_HIDDEN)) { > + if (!(pos->flags & PF_HIDDEN) || !same_thread_group(current, pos))) { > > right ? Or we can exclude them from /proc/whatever/task/ listing unconditionally, and change next_tgid() to report them as if there are not sub-threads, iow "ps ax" will show all the PF_HIDDEN tasks... I dunno. Oleg.
On Sun, 26 Jan 2025 at 10:54, Oleg Nesterov <oleg@redhat.com> wrote: > > I don't think we even need to detect the /proc/self/ or /proc/self-thread/ > case, next_tid() can just check same_thread_group, That was my thinking yes. If we exclude them from /proc/*/task entirely, I'd worry that it would hide it from some management tool and be used for nefarious purposes (even if they then show up elsewhere that the tool wouldn't look at). But as mentioned, maybe this is all more of a hack than what kvm now does. Linus
Linus Torvalds <torvalds@linux-foundation.org> writes: > On Sat, 25 Jan 2025 at 10:12, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> Arguably the user space oddity is just strange and Paolo even calls it >> a bug, but at the same time, I do think user space can and should >> reasonably expect that it only has children that it created >> explicitly [..] > > Note that I think that doing things like "io_uring" and getting IO > helper threads that way would very much count as "explicit children", > so I don't argue that all kernel helper threads would fall under this > category. > > And I suspect that the normal vhost workers fall under that same kind > of "it's like io_uring". If you use VHOST_NEW_WORKER to create a > worker thread, then that's a pretty explicit "I have a child process". > > So it's really just that hugepage recovery thread that seems to be a > bit "too" much of an implicit kernel helper thread that user space > kind of gets accidentally and implicitly just because of a kernel > implementation detail. > > I'm sure the kvm hack to just start it later (at KVM_RUN time?) is > sufficient in practice, but it still feels conceptually iffy to me. I don't think implicit vs explicit is right question. Rather we should be asking can userspace care? If I read the context from the commit correctly what userspace is asking is: Am I single threaded so that I know nothing funny will happen in the forked process. The most common funny I am aware of for forked multi-threaded processes is that if they fork with another thread holding a lock the forked process might hang forever on the lock because the lock will never be released. The most interesting part of the hugepage reaper appears to be kvm_mmu_commit_zap_page, where a page is freed after being flushed from the tlb. I would argue that if kvm_mmu_commit_zap_page and friends change the page tables in a way that userspace can see after a fork, and in turn could affect how the forked process will execute userspace is doing something sensible in testing for it. On the flip side if this isn't something userspace can observe in it's own process I would argue that the proper solution is to user a regular kthread. In summary the conceptually clean approach is to only have threads that when running can effect the process they are a part of in a userspace visible way. Assuming the hugepage reaper can effect the process it is a part of, the only problem I see is the hugepage reaper existing when it had nothing it could possibly do. I don't think hiding threads is a useful solution because the threads will effect they process they are a part of. If the threads aren't effecting the process they are a part of we have other solutions besides threads. Eric
On 01/26, Linus Torvalds wrote: > > On Sun, 26 Jan 2025 at 10:54, Oleg Nesterov <oleg@redhat.com> wrote: > > > > I don't think we even need to detect the /proc/self/ or /proc/self-thread/ > > case, next_tid() can just check same_thread_group, > > That was my thinking yes. > > If we exclude them from /proc/*/task entirely, I'd worry that it would > hide it from some management tool and be used for nefarious purposes Agreed, > (even if they then show up elsewhere that the tool wouldn't look at). Even if we move them from /proc/*/task to /proc ? Perhaps, I honestly do not know what will/can confuse userspace more. > But as mentioned, maybe this is all more of a hack than what kvm now does. I don't know. But I will be happy to make a patch if we have a consensus. Oleg.
On Mon, Jan 27, 2025 at 3:10 PM Oleg Nesterov <oleg@redhat.com> wrote: > On 01/26, Linus Torvalds wrote: > > On Sun, 26 Jan 2025 at 10:54, Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > I don't think we even need to detect the /proc/self/ or /proc/self-thread/ > > > case, next_tid() can just check same_thread_group, > > > > That was my thinking yes. > > > > If we exclude them from /proc/*/task entirely, I'd worry that it would > > hide it from some management tool and be used for nefarious purposes > > Agreed, > > > (even if they then show up elsewhere that the tool wouldn't look at). > > Even if we move them from /proc/*/task to /proc ? Indeed---as long as they show up somewhere, it's not worse than it used to be. The reason why I'd prefer them to stay in /proc/*/task is that moving them away at least partly negates the benefits of the "workers are children of the starter" model. For example it complicates measuring their cost within the process that runs the VM. Maybe it's more of a romantic thing than a real practical issue, because in the real world resource accounting for VMs is done via cgroups. But unlike the lazy creation in KVM, which is overall pretty self-contained, I am afraid the ugliness in procfs would be much worse compared to the benefit, if there's a benefit at all. > Perhaps, I honestly do not know what will/can confuse userspace more. At the very least, marking workers as "Kthread: 1" makes sense and should not cause too much confusion. I wouldn't go beyond that unless we get more reports of similar issues, and I'm not even sure how common it is for userspace libraries to check for single-threadedness. Paolo > > But as mentioned, maybe this is all more of a hack than what kvm now does. > > I don't know. But I will be happy to make a patch if we have a consensus. > > Oleg. >
On Sat, Jan 25, 2025, Linus Torvalds wrote: > On Fri, 24 Jan 2025 at 08:38, Paolo Bonzini <pbonzini@redhat.com> wrote: > > > > but you can throw away the <<<< ... ==== part completely, and apply the > > same change on top of the new implementation: > > > > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c > > index edef30359c19..9f9a29be3beb 100644 > > --- a/arch/x86/kvm/cpuid.c > > +++ b/arch/x86/kvm/cpuid.c > > @@ -1177,6 +1177,7 @@ void kvm_set_cpu_caps(void) > > EMULATED_F(NO_SMM_CTL_MSR), > > /* PrefetchCtlMsr */ > > F(WRMSR_XX_BASE_NS), > > + F(SRSO_USER_KERNEL_NO), > > SYNTHESIZED_F(SBPB), > > SYNTHESIZED_F(IBPB_BRTYPE), > > SYNTHESIZED_F(SRSO_NO), > > Ehh. My resolution ended up being different. > > I did this instead: > > F(WRMSR_XX_BASE_NS), > SYNTHESIZED_F(SBPB), > SYNTHESIZED_F(IBPB_BRTYPE), > SYNTHESIZED_F(SRSO_NO), > + SYNTHESIZED_F(SRSO_USER_KERNEL_NO), > > which (apart from the line ordering) differs from your suggestion in > F() vs SYNTHESIZED_F(). > > That really seemed to be the RightThing(tm) to do from the context of > the two conflicting commits, but maybe there was some reason that I > didn't catch that you kept it as a plain "F()". Heh, I waffled on whether SRSO_USER_KERNEL_NO should be F() or SYNTHESIZED_F() when the initial commit went in. I would prefer to keep it F(), though it doesn't matter terribly at the moment. The "synthesized" features are for cases where the kernel stuffs X86_FEATURE_xxx via set_cpu_cap() even when the feature isn't present in CPUID, and it's correct for KVM to relay the synthesized feature to the guest. E.g. SRSO_NO is synthesized into cpu_caps for Zen1/2, and in that case the absense of the SRSO flaw extends to the guest as well. if (boot_cpu_data.x86 < 0x19 && !cpu_smt_possible()) { setup_force_cpu_cap(X86_FEATURE_SRSO_NO); return; } For SRSO_USER_KERNEL_NO, it's currently not force set, i.e. it's a pure reflection of hardware capabilities. Treating it as synthesized is effectively a nop with the current code, but that would change if the kernel were to force set the flag. If a future commit force set SRSO_USER_KERNEL_NO because of a ucode update that didn't also modify CPUID behavior, then treating the flag as synthesized would be desirabled, e.g. so that the guest could also avoid the overhead of mitigating SRSO. But if a future commit set the flag for some other reason, e.g. if the kernel somehow isn't vulnerable even when running on buggy hardware, then enumerating SRSO_USER_KERNEL_NO to the guest could cause the guest kernel to incorrectly skips its mitigation. My vote is to err on the side of caution and go with F().
On Sat, Jan 25, 2025 at 7:16 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > Ehh. My resolution ended up being different. > > I did this instead: > > F(WRMSR_XX_BASE_NS), > SYNTHESIZED_F(SBPB), > SYNTHESIZED_F(IBPB_BRTYPE), > SYNTHESIZED_F(SRSO_NO), > + SYNTHESIZED_F(SRSO_USER_KERNEL_NO), > > which (apart from the line ordering) differs from your suggestion in > F() vs SYNTHESIZED_F(). > > That really seemed to be the RightThing(tm) to do from the context of > the two conflicting commits, but maybe there was some reason that I > didn't catch that you kept it as a plain "F()". SYNTHESIZED_F() generally is used together with setup_force_cpu_cap(), i.e. when it makes sense to present the feature even if cpuid does not have it *and* the VM is not able to see the difference. You use it if when mitigations on the host automatically protect the guest as well. For example, F vs. SYNTHESIZED_F() makes a difference for X86_FEATURE_SRSO_NO because F() hides the feature from the guests and SYNTHESIZED_F() lets them use it. It doesn't hurt at all in this case, or make a difference for that matter, because there's no setup_force_cpu_cap(X86_FEATURE_SRSO_USER_KERNEL_NO). But here using SYNTHESIZED_F it's just a little less self-documenting and a little less future proof, nothing that a quick follow-up PR can't fix, and also I managed to pull the KVM/ARM changes from the wrong machine so I have to send a second KVM pull request anyway. > So please take a look, and if I screwed up send me a fix (with a > scathing explanation for why I'm maternally related to some > less-than-gifted rodentia with syphilis). I think I don't want to know if it's a Finnish metaphor, or you came up with it all on your own... Paolo
On Mon, Jan 27, 2025 at 04:15:01PM +0100, Paolo Bonzini wrote: > On Mon, Jan 27, 2025 at 3:10 PM Oleg Nesterov <oleg@redhat.com> wrote: > > On 01/26, Linus Torvalds wrote: > > > On Sun, 26 Jan 2025 at 10:54, Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > > > I don't think we even need to detect the /proc/self/ or /proc/self-thread/ > > > > case, next_tid() can just check same_thread_group, > > > > > > That was my thinking yes. > > > > > > If we exclude them from /proc/*/task entirely, I'd worry that it would > > > hide it from some management tool and be used for nefarious purposes > > > > Agreed, > > > > > (even if they then show up elsewhere that the tool wouldn't look at). > > > > Even if we move them from /proc/*/task to /proc ? > > Indeed---as long as they show up somewhere, it's not worse than it > used to be. The reason why I'd prefer them to stay in /proc/*/task is > that moving them away at least partly negates the benefits of the > "workers are children of the starter" model. For example it > complicates measuring their cost within the process that runs the VM. > Maybe it's more of a romantic thing than a real practical issue, > because in the real world resource accounting for VMs is done via > cgroups. But unlike the lazy creation in KVM, which is overall pretty > self-contained, I am afraid the ugliness in procfs would be much worse > compared to the benefit, if there's a benefit at all. > > > Perhaps, I honestly do not know what will/can confuse userspace more. > > At the very least, marking workers as "Kthread: 1" makes sense and > should not cause too much confusion. I wouldn't go beyond that unless > we get more reports of similar issues, and I'm not even sure how > common it is for userspace libraries to check for single-threadedness. Sorry, just saw this thread now. What if we did what Linus suggests and hide (odd) user workers from /proc/<pid>/task/* but also added /proc/<pid>/workers/*. The latter would only list the workers that got spawned by the kernel for that particular task? This would acknowledge their somewhat special status and allow userspace to still detect them as "Hey, I didn't actually spawn those, they got shoved into my workload by the kernel for me.". (Another (ugly) alternative would be to abuse prctl() and have workloads opt-in to hiding user workers from /proc/<pid>/task/.) > > Paolo > > > > But as mentioned, maybe this is all more of a hack than what kvm now does. > > > > I don't know. But I will be happy to make a patch if we have a consensus. > > > > Oleg. > > >
On Tue, Feb 4, 2025 at 3:19 PM Christian Brauner <brauner@kernel.org> wrote: > > On Mon, Jan 27, 2025 at 04:15:01PM +0100, Paolo Bonzini wrote: > > On Mon, Jan 27, 2025 at 3:10 PM Oleg Nesterov <oleg@redhat.com> wrote: > > > On 01/26, Linus Torvalds wrote: > > > > On Sun, 26 Jan 2025 at 10:54, Oleg Nesterov <oleg@redhat.com> wrote: > > > > > > > > > > I don't think we even need to detect the /proc/self/ or /proc/self-thread/ > > > > > case, next_tid() can just check same_thread_group, > > > > > > > > That was my thinking yes. > > > > > > > > If we exclude them from /proc/*/task entirely, I'd worry that it would > > > > hide it from some management tool and be used for nefarious purposes > > > > > > Agreed, > > > > > > > (even if they then show up elsewhere that the tool wouldn't look at). > > > > > > Even if we move them from /proc/*/task to /proc ? > > > > Indeed---as long as they show up somewhere, it's not worse than it > > used to be. The reason why I'd prefer them to stay in /proc/*/task is > > that moving them away at least partly negates the benefits of the > > "workers are children of the starter" model. For example it > > complicates measuring their cost within the process that runs the VM. > > Maybe it's more of a romantic thing than a real practical issue, > > because in the real world resource accounting for VMs is done via > > cgroups. But unlike the lazy creation in KVM, which is overall pretty > > self-contained, I am afraid the ugliness in procfs would be much worse > > compared to the benefit, if there's a benefit at all. > > > > > Perhaps, I honestly do not know what will/can confuse userspace more. > > > > At the very least, marking workers as "Kthread: 1" makes sense and > > should not cause too much confusion. I wouldn't go beyond that unless > > we get more reports of similar issues, and I'm not even sure how > > common it is for userspace libraries to check for single-threadedness. > > Sorry, just saw this thread now. > > What if we did what Linus suggests and hide (odd) user workers from > /proc/<pid>/task/* but also added /proc/<pid>/workers/*. The latter > would only list the workers that got spawned by the kernel for that > particular task? This would acknowledge their somewhat special status > and allow userspace to still detect them as "Hey, I didn't actually > spawn those, they got shoved into my workload by the kernel for me.". Wouldn't the workers then disappear completely from ps, top or other tools that look at /proc/$PID/task? That seems a bit too underhanded towards userspace... Paolo > (Another (ugly) alternative would be to abuse prctl() and have workloads > opt-in to hiding user workers from /proc/<pid>/task/.)
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index ae0b438a2c99..f7e222953cab 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -821,7 +821,7 @@ void kvm_set_cpu_caps(void) kvm_cpu_cap_mask(CPUID_8000_0021_EAX, F(NO_NESTED_DATA_BP) | F(LFENCE_RDTSC) | 0 /* SmmPgCfgLock */ | F(NULL_SEL_CLR_BASE) | F(AUTOIBRS) | 0 /* PrefetchCtlMsr */ | - F(WRMSR_XX_BASE_NS) + F(WRMSR_XX_BASE_NS) | F(SRSO_USER_KERNEL_NO) ); kvm_cpu_cap_check_and_set(X86_FEATURE_SBPB); but you can throw away the <<<< ... ==== part completely, and apply the same change on top of the new implementation: diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index edef30359c19..9f9a29be3beb 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -1177,6 +1177,7 @@ void kvm_set_cpu_caps(void) EMULATED_F(NO_SMM_CTL_MSR), /* PrefetchCtlMsr */ F(WRMSR_XX_BASE_NS), + F(SRSO_USER_KERNEL_NO), SYNTHESIZED_F(SBPB), SYNTHESIZED_F(IBPB_BRTYPE), SYNTHESIZED_F(SRSO_NO),