Message ID | 20240809074725.320801-1-npiggin@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] accel/tcg: clear all TBs from a page when it is written to | expand |
(Widening Cc list) On 9/8/24 09:47, Nicholas Piggin wrote: > This is not a clean patch, but does fix a problem I hit with TB > invalidation due to the target software writing to memory with TBs. > > Lockup messages are triggering in Linux due to page clearing taking a > long time when a code page has been freed, because it takes a lot of > notdirty notifiers, which massively slows things down. Linux might > possibly have a bug here too because it seems to hang indefinitely in > some cases, but even if it didn't, the latency of clearing these pages > is very high. > > This showed when running KVM on the emulated machine, starting and > stopping guests. That causes lots of instruction pages to be freed. > Usually if you're just running Linux, executable pages remain in > pagecache so you get fewer of these bombs in the kernel memory > allocator. But page reclaim, JITs, deleting executable files, etc., > could trigger it too. > > Invalidating all TBs from the page on any hit seems to avoid the problem > and generally speeds things up. > > How important is the precise invalidation? These days I assume the > tricky kind of SMC that frequently writes code close to where it's > executing is pretty rare and might not be something we really care about > for performance. Could we remove sub-page TB invalidation entirely? > > Thanks, > Nick > --- > accel/tcg/tb-maint.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/accel/tcg/tb-maint.c b/accel/tcg/tb-maint.c > index cc0f5afd47..d9a76b1665 100644 > --- a/accel/tcg/tb-maint.c > +++ b/accel/tcg/tb-maint.c > @@ -1107,6 +1107,9 @@ tb_invalidate_phys_page_range__locked(struct page_collection *pages, > TranslationBlock *current_tb = retaddr ? tcg_tb_lookup(retaddr) : NULL; > #endif /* TARGET_HAS_PRECISE_SMC */ > > + start &= TARGET_PAGE_MASK; > + last |= ~TARGET_PAGE_MASK; > + > /* Range may not cross a page. */ > tcg_debug_assert(((start ^ last) & TARGET_PAGE_MASK) == 0); >
On 8/9/24 17:47, Nicholas Piggin wrote: > This is not a clean patch, but does fix a problem I hit with TB > invalidation due to the target software writing to memory with TBs. > > Lockup messages are triggering in Linux due to page clearing taking a > long time when a code page has been freed, because it takes a lot of > notdirty notifiers, which massively slows things down. Linux might > possibly have a bug here too because it seems to hang indefinitely in > some cases, but even if it didn't, the latency of clearing these pages > is very high. > > This showed when running KVM on the emulated machine, starting and > stopping guests. That causes lots of instruction pages to be freed. > Usually if you're just running Linux, executable pages remain in > pagecache so you get fewer of these bombs in the kernel memory > allocator. But page reclaim, JITs, deleting executable files, etc., > could trigger it too. > > Invalidating all TBs from the page on any hit seems to avoid the problem > and generally speeds things up. > > How important is the precise invalidation? These days I assume the > tricky kind of SMC that frequently writes code close to where it's > executing is pretty rare and might not be something we really care about > for performance. Could we remove sub-page TB invalidation entirely? Happens on x86 and s390 regularly enough, so we can't remove it. > @@ -1107,6 +1107,9 @@ tb_invalidate_phys_page_range__locked(struct page_collection *pages, > TranslationBlock *current_tb = retaddr ? tcg_tb_lookup(retaddr) : NULL; > #endif /* TARGET_HAS_PRECISE_SMC */ > > + start &= TARGET_PAGE_MASK; > + last |= ~TARGET_PAGE_MASK; > + > /* Range may not cross a page. */ > tcg_debug_assert(((start ^ last) & TARGET_PAGE_MASK) == 0); This would definitely break SMC. However, there's a better solution. We're already iterating over all of the TBs on the current page only. Move *everything* except the tb_phys_invalidate__locked call into the SMC ifdef, and unconditionally invalidate every TB selected in the loop. We experimented with something like this for aarch64, which used to spend a lot of the kernel startup time invalidating code pages from the (somewhat bloated) EDK2 bios. But it turned out the bigger problem was address space randomization, and with CF_PCREL the problem appeared to go away. I don't think we've done any kvm-under-tcg performance testing, but lockup messages would certainly be something to look for... r~
On Mon Aug 12, 2024 at 11:25 AM AEST, Richard Henderson wrote: > On 8/9/24 17:47, Nicholas Piggin wrote: > > This is not a clean patch, but does fix a problem I hit with TB > > invalidation due to the target software writing to memory with TBs. > > > > Lockup messages are triggering in Linux due to page clearing taking a > > long time when a code page has been freed, because it takes a lot of > > notdirty notifiers, which massively slows things down. Linux might > > possibly have a bug here too because it seems to hang indefinitely in > > some cases, but even if it didn't, the latency of clearing these pages > > is very high. > > > > This showed when running KVM on the emulated machine, starting and > > stopping guests. That causes lots of instruction pages to be freed. > > Usually if you're just running Linux, executable pages remain in > > pagecache so you get fewer of these bombs in the kernel memory > > allocator. But page reclaim, JITs, deleting executable files, etc., > > could trigger it too. > > > > Invalidating all TBs from the page on any hit seems to avoid the problem > > and generally speeds things up. > > > > How important is the precise invalidation? These days I assume the > > tricky kind of SMC that frequently writes code close to where it's > > executing is pretty rare and might not be something we really care about > > for performance. Could we remove sub-page TB invalidation entirely? > > Happens on x86 and s390 regularly enough, so we can't remove it. > > > @@ -1107,6 +1107,9 @@ tb_invalidate_phys_page_range__locked(struct page_collection *pages, > > TranslationBlock *current_tb = retaddr ? tcg_tb_lookup(retaddr) : NULL; > > #endif /* TARGET_HAS_PRECISE_SMC */ > > > > + start &= TARGET_PAGE_MASK; > > + last |= ~TARGET_PAGE_MASK; > > + > > /* Range may not cross a page. */ > > tcg_debug_assert(((start ^ last) & TARGET_PAGE_MASK) == 0); > > This would definitely break SMC. They can't invalidate the instruction currently being executed? I'll experiment a bit more. > However, there's a better solution. We're already iterating over all of the TBs on the > current page only. Move *everything* except the tb_phys_invalidate__locked call into the > SMC ifdef, and unconditionally invalidate every TB selected in the loop. Okay. I suspect *most* of the time even the strict SMC archs would not be writing to the same page they're executing either. But I can start with the !SMC. > We experimented with something like this for aarch64, which used to spend a lot of the > kernel startup time invalidating code pages from the (somewhat bloated) EDK2 bios. But it > turned out the bigger problem was address space randomization, and with CF_PCREL the > problem appeared to go away. Interesting. > I don't think we've done any kvm-under-tcg performance testing, but lockup messages would > certainly be something to look for... Yeah, actually Linux is throwing the messages a bit more recently at least on distros that enable page clearing at alloc for security, because that clearing is a big chunk that can happen in critical sections. Thanks for the suggestion, I'll give it a try. Thanks, Nick
On 8/14/24 16:09, Nicholas Piggin wrote: >>> @@ -1107,6 +1107,9 @@ tb_invalidate_phys_page_range__locked(struct page_collection *pages, >>> TranslationBlock *current_tb = retaddr ? tcg_tb_lookup(retaddr) : NULL; >>> #endif /* TARGET_HAS_PRECISE_SMC */ >>> >>> + start &= TARGET_PAGE_MASK; >>> + last |= ~TARGET_PAGE_MASK; >>> + >>> /* Range may not cross a page. */ >>> tcg_debug_assert(((start ^ last) & TARGET_PAGE_MASK) == 0); >> >> This would definitely break SMC. > > They can't invalidate the instruction currently being executed? They can. But that's where adjusting start/last would break things, because we've lost track of exactly what's being modified, which changes the result of "did we modify the current instruction". r~
diff --git a/accel/tcg/tb-maint.c b/accel/tcg/tb-maint.c index cc0f5afd47..d9a76b1665 100644 --- a/accel/tcg/tb-maint.c +++ b/accel/tcg/tb-maint.c @@ -1107,6 +1107,9 @@ tb_invalidate_phys_page_range__locked(struct page_collection *pages, TranslationBlock *current_tb = retaddr ? tcg_tb_lookup(retaddr) : NULL; #endif /* TARGET_HAS_PRECISE_SMC */ + start &= TARGET_PAGE_MASK; + last |= ~TARGET_PAGE_MASK; + /* Range may not cross a page. */ tcg_debug_assert(((start ^ last) & TARGET_PAGE_MASK) == 0);