diff mbox

[0/5] KVM: Put struct kvm_vcpu on a diet

Message ID 20230213163351.30704-1-minipli@grsecurity.net (mailing list archive)
State New, archived
Headers show

Commit Message

Mathias Krause Feb. 13, 2023, 4:33 p.m. UTC
Relayout members of struct kvm_vcpu and embedded structs to reduce its
memory footprint. Not that it makes sense from a memory usage point of
view (given how few of such objects get allocated), but this series
achieves to make it consume two cachelines less, which should provide a
micro-architectural net win. However, I wasn't able to see a noticeable
difference running benchmarks within a guest VM -- the VMEXIT costs are
likely still high enough to mask any gains. Pointers to other benchmark
tools / tests are highly appreciated!

I made sure that the mutex and spinlocks still stick to their own
disjoint cachelines, so this series shouldn't cause any regression in
that regard.

Below is the high level pahole(1) diff. Most significant is the overall
size change from 6688 to 6560 bytes, i.e. -128 bytes.



There are still holes left, i.e. still room for squeezing out a few more
bytes. But I didn't want to destroy the grouping of members nor sprinkle
random structs with __packed just for that. That said, we might still
want to do it to save yet another cacheline (only 32 more bytes to go).

Code wise it seems to save a few instructions as well, likely because of
patch 2, simplifying some of the struct kvm_queued_exception users:

$ size kvm-6.2-rc5{,+patch}/arch/x86/kvm/kvm.ko
   text    data     bss     dec     hex filename
 806277  129095    1168  936540   e4a5c kvm-6.2-rc5/arch/x86/kvm/kvm.ko
 806026  129063    1168  936257   e4941 kvm-6.2-rc5+patch/arch/x86/kvm/kvm.ko

$ bloat-o-meter kvm-6.2-rc5{,+patch}/arch/x86/kvm/kvm.ko
add/remove: 0/2 grow/shrink: 7/11 up/down: 102/-340 (-238)
Function                                     old     new   delta
kvm_vcpu_ioctl_x86_set_vcpu_events           664     707     +43
kvm_multiple_exception                       581     613     +32
kvm_inject_exception                         153     161      +8
kvm_vcpu_ioctl_x86_get_vcpu_events           510     517      +7
__UNIQUE_ID_vermagic166                       67      73      +6
kvm_inject_page_fault                        108     113      +5
kvm_mmu_x86_module_init                      174     175      +1
kvm_vcpu_ready_for_interrupt_injection       134     131      -3
kvm_vcpu_ioctl                              1717    1714      -3
kvm_vcpu_compat_ioctl                        274     271      -3
kvm_sigset_activate                           53      50      -3
kvm_can_do_async_pf                          150     147      -3
kvm_mtrr_get_msr                             290     286      -4
kvm_deliver_exception_payload                139     135      -4
kvm_arch_vcpu_ioctl_set_guest_debug          662     657      -5
x86_emulate_instruction                     1952    1945      -7
kvm_vcpu_reset                              1382    1375      -7
__pfx_kvm_deliver_exception_payload.part      16       -     -16
kvm_deliver_exception_payload.part           122       -    -122
kvm_arch_vcpu_ioctl_run                     7782    7622    -160
Total: Before=635427, After=635189, chg -0.04%


Please apply!

Thanks,

Mathias Krause (5):
  KVM: x86: Shrink struct kvm_pmu
  KVM: x86: Shrink struct kvm_queued_exception
  KVM: Shrink struct kvm_mmu_memory_cache
  KVM: x86: Shrink struct kvm_vcpu_arch
  KVM: Shrink struct kvm_vcpu

 arch/x86/include/asm/kvm_host.h | 30 +++++++++++++++---------------
 include/linux/kvm_host.h        |  6 +++---
 include/linux/kvm_types.h       |  2 +-
 3 files changed, 19 insertions(+), 19 deletions(-)

Comments

Sean Christopherson Feb. 13, 2023, 5:05 p.m. UTC | #1
On Mon, Feb 13, 2023, Mathias Krause wrote:
> Relayout members of struct kvm_vcpu and embedded structs to reduce its
> memory footprint. Not that it makes sense from a memory usage point of
> view (given how few of such objects get allocated), but this series
> achieves to make it consume two cachelines less, which should provide a
> micro-architectural net win. However, I wasn't able to see a noticeable
> difference running benchmarks within a guest VM -- the VMEXIT costs are
> likely still high enough to mask any gains.

...

> Below is the high level pahole(1) diff. Most significant is the overall
> size change from 6688 to 6560 bytes, i.e. -128 bytes.

While part of me wishes KVM were more careful about struct layouts, IMO fiddling
with per vCPU or per VM structures isn't worth the ongoing maintenance cost.

Unless the size of the vCPU allocation (vcpu_vmx or vcpu_svm in x86 land) crosses
a meaningful boundary, e.g. drops the size from an order-3 to order-2 allocation,
the memory savings are negligible in the grand scheme.  Assuming the kernel is
even capable of perfectly packing vCPU allocations, saving even a few hundred bytes
per vCPU is uninteresting unless the vCPU count gets reaaally high, and at that
point the host likely has hundreds of GiB of memory, i.e. saving a few KiB is again
uninteresting.

And as you observed, imperfect struct layouts are highly unlikely to have a
measurable impact on performance.  The types of operations that are involved in
a world switch are just too costly for the layout to matter much.  I do like to
shave cycles in the VM-Enter/VM-Exit paths, but only when a change is inarguably
more performant, doesn't require ongoing mainteance, and/or also improves the code
quality.

I am in favor in cleaning up kvm_mmu_memory_cache as there's no reason to carry
a sub-optimal layouy and the change is arguably warranted even without the change
in size.  Ditto for kvm_pmu, logically I think it makes sense to have the version
at the very top.

But I dislike using bitfields instead of bools in kvm_queued_exception, and shuffling
fields in kvm_vcpu, kvm_vcpu_arch, vcpu_vmx, vcpu_svm, etc. unless there's a truly
egregious field(s) just isn't worth the cost in the long term.
Mathias Krause Feb. 14, 2023, 12:19 p.m. UTC | #2
On 13.02.23 18:05, Sean Christopherson wrote:
> On Mon, Feb 13, 2023, Mathias Krause wrote:
>> Relayout members of struct kvm_vcpu and embedded structs to reduce its
>> memory footprint. Not that it makes sense from a memory usage point of
>> view (given how few of such objects get allocated), but this series
>> achieves to make it consume two cachelines less, which should provide a
>> micro-architectural net win. However, I wasn't able to see a noticeable
>> difference running benchmarks within a guest VM -- the VMEXIT costs are
>> likely still high enough to mask any gains.
> 
> ...
> 
>> Below is the high level pahole(1) diff. Most significant is the overall
>> size change from 6688 to 6560 bytes, i.e. -128 bytes.
> 
> While part of me wishes KVM were more careful about struct layouts, IMO fiddling
> with per vCPU or per VM structures isn't worth the ongoing maintenance cost.
> 
> Unless the size of the vCPU allocation (vcpu_vmx or vcpu_svm in x86 land) crosses
> a meaningful boundary, e.g. drops the size from an order-3 to order-2 allocation,
> the memory savings are negligible in the grand scheme.  Assuming the kernel is
> even capable of perfectly packing vCPU allocations, saving even a few hundred bytes
> per vCPU is uninteresting unless the vCPU count gets reaaally high, and at that
> point the host likely has hundreds of GiB of memory, i.e. saving a few KiB is again
> uninteresting.

Fully agree! That's why I said, this change makes no sense from a memory
usage point of view. The overall memory savings are not visible at all,
recognizing that the slab allocator isn't able to put more vCPU objects
in a given slab page. However, I still remain confident that this makes
sense from a uarch point of view. Touching less cache lines should be a
win -- even if I'm unable to measure it. By preserving more cachelines
during a VMEXIT, guests should be able to resume their work faster
(assuming they still need these cachelines).

> And as you observed, imperfect struct layouts are highly unlikely to have a
> measurable impact on performance.  The types of operations that are involved in
> a world switch are just too costly for the layout to matter much.  I do like to
> shave cycles in the VM-Enter/VM-Exit paths, but only when a change is inarguably
> more performant, doesn't require ongoing mainteance, and/or also improves the code
> quality.

Any pointers to measure the "more performant" aspect? I tried to make
use of the vmx_vmcs_shadow_test in kvm-unit-tests, as it's already
counting cycles, but the numbers are too unstable, even if I pin the
test to a given CPU, disable turbo mode, SMT, use the performance cpu
governor, etc.

> I am in favor in cleaning up kvm_mmu_memory_cache as there's no reason to carry
> a sub-optimal layouy and the change is arguably warranted even without the change
> in size.  Ditto for kvm_pmu, logically I think it makes sense to have the version
> at the very top.

Yeah, was exactly thinking the same when modifying kvm_pmu.

> But I dislike using bitfields instead of bools in kvm_queued_exception, and shuffling
> fields in kvm_vcpu, kvm_vcpu_arch, vcpu_vmx, vcpu_svm, etc. unless there's a truly
> egregious field(s) just isn't worth the cost in the long term.

Heh, just found this gem in vcpu_vmx:

struct vcpu_vmx {
  [...]
  union vmx_exit_reason      exit_reason;

  /* XXX 44 bytes hole, try to pack */

  /* --- cacheline 123 boundary (7872 bytes) --- */
  struct pi_desc             pi_desc __attribute__((__aligned__(64)));
  [...]

So there are, in fact, some bigger holes left.

Would be nice if pahole had a --density flag that would output some
ASCII art, visualizing which bytes of a struct are allocated by real
members and which ones are pure padding.
Sean Christopherson Feb. 16, 2023, 5:32 p.m. UTC | #3
On Tue, Feb 14, 2023, Mathias Krause wrote:
> On 13.02.23 18:05, Sean Christopherson wrote:
> However, I still remain confident that this makes sense from a uarch point of
> view. Touching less cache lines should be a win -- even if I'm unable to
> measure it. By preserving more cachelines during a VMEXIT, guests should be
> able to resume their work faster (assuming they still need these cachelines).

Yes, but I'm skeptical that compacting struct kvm_vcpu will actually result in
fewer cache lines being touched, at least not in a persistent, maintainable way.
While every VM-Exit accesses a lot of state, it's most definitely still a small
percentage of kvm_vcpu.  And for the VM-Exits that really benefit from optimized
handling, a.k.a. the fastpath exits, that percentage is even lower.

On x86, kvm_vcpu is actually comprised of three "major" structs: kvm_vcpu,
kvm_vcpu_arch, and vcpu_{vmx,svm}.  The majority of fields accessed on every VM-Exit
are in the vendor part, vcpu_{vmx,svm}, but there are still high-touch fields in
the common structures, e.g. registers in kvm_vcpu_arch and things like requests
in kvm_vcpu.

Naively compating any of the three (well, four) structures might result in fewer
cache lines being touched, but mostly by coincidence.  E.g. because compacting
part of kvm_vcpu_arch happened to better align vcpu_vmx, not because of the
compaction itself.

If we really wanted to optimize cache line usage, our best bet would be to identify
the fields that are accessed in the fastpath run loop and then pack those fields
into as few cache lines as possible.  And to do that in a maintainable way, we'd
probably need something like ASI[*] to allow us to detect changes that result in
the fastpath handling accessing fields that aren't in the cache-optimized region.

I'm not necessarily opposed to such aggressive optimization, but the ROI is likely
very, very low.  For optimized workloads, there simply aren't very many VM-Exits,
e.g. the majority of exits on a modern CPU are due to timer ticks.  And even those
will hopefully be eliminiated in the not-too-distant future, e.g. by having hardware
virtualize the TSC deadline timer, and by moving to a vCPU scheduling scheme that
allows for a tickless host.
 
https://lore.kernel.org/all/20220223052223.1202152-1-junaids@google.com

> > And as you observed, imperfect struct layouts are highly unlikely to have a
> > measurable impact on performance.  The types of operations that are involved in
> > a world switch are just too costly for the layout to matter much.  I do like to
> > shave cycles in the VM-Enter/VM-Exit paths, but only when a change is inarguably
> > more performant, doesn't require ongoing mainteance, and/or also improves the code
> > quality.
> 
> Any pointers to measure the "more performant" aspect?

TL;DR: not really.

What I've done in the past is run a very tight loop in the guest, and then measure
latency from the host by hacking KVM.  Measuring from the guest works, e.g. we have
a variety of selftests that do exactly that, but when trying to shave cycles in
the VM-Exit path, it doesn't take many host IRQs arriving at the "wrong" time to
skew the measurement.  My quick-and-dirty solution has always been to just hack
KVM to measure latency with IRQs disabled, but a more maintainable approach would
be to add smarts somewhere to sanitize the results, e.g. to throw away outliers
where the guest likely got interrupted.

I believe we've talked about adding a selftest to measure fastpath latency, e.g.
by writing MSR_IA32_TSC_DEADLINE in a tight loop.

However, that's not going to be useful in this case since you are interested in
measuring the impact of reducing the host's L1D footprint.  If the guest isn't
cache-constrainted, reducing the host's cache footprint isn't going to impact
performance since there's no contention.

Running a micro benchmark in the guest that aims to measure cache performance might
work, but presumably those are all userspace tests, i.e. you'd end up measuring
the impact of the guest kernel too.  And they wouldn't consistently trigger VM-Exits,
so it would be difficult to prove the validity of the results.

I suppose you could write a dedicated selftest or port a micro benchmark to run
as a selftest (or KVM-unit-test)?

>  I tried to make use of the vmx_vmcs_shadow_test in kvm-unit-tests, as it's
>  already counting cycles, but the numbers are too unstable, even if I pin the
>  test to a given CPU, disable turbo mode, SMT, use the performance cpu
>  governor, etc.

Heh, you might have picked quite possibly the worst way to measure VM-Exit
performance :-)

The guest code in that test that's measuring latency runs at L2.  For VMREADs
and VMWRITEs that are "passed-through" all the way to L2, no VM-Exit will occur
(the access will be handled purely in ucode).  And for accesses that do cause a
VM-Exit, I'm pretty sure they all result in a nested VM-Exit, which is a _very_
heavy path (~10k cycles).  Even if the exit is handled by KVM (in L0), it's still
a relatively slow, heavy path.

> > I am in favor in cleaning up kvm_mmu_memory_cache as there's no reason to carry
> > a sub-optimal layouy and the change is arguably warranted even without the change
> > in size.  Ditto for kvm_pmu, logically I think it makes sense to have the version
> > at the very top.
> 
> Yeah, was exactly thinking the same when modifying kvm_pmu.
> 
> > But I dislike using bitfields instead of bools in kvm_queued_exception, and shuffling
> > fields in kvm_vcpu, kvm_vcpu_arch, vcpu_vmx, vcpu_svm, etc. unless there's a truly
> > egregious field(s) just isn't worth the cost in the long term.
> 
> Heh, just found this gem in vcpu_vmx:
> 
> struct vcpu_vmx {
>   [...]
>   union vmx_exit_reason      exit_reason;
> 
>   /* XXX 44 bytes hole, try to pack */
> 
>   /* --- cacheline 123 boundary (7872 bytes) --- */
>   struct pi_desc             pi_desc __attribute__((__aligned__(64)));
>   [...]
> 
> So there are, in fact, some bigger holes left.

Ya.  Again, I'm definitely ok cleaning up the truly heinous warts and/or doing
a targeted, deliberate refactor of structures.  What I don't want to do is
shuffle fields around purely to save a few bytes here and there.
Mathias Krause Feb. 17, 2023, 4:07 p.m. UTC | #4
On 16.02.23 18:32, Sean Christopherson wrote:
> On Tue, Feb 14, 2023, Mathias Krause wrote:
>> On 13.02.23 18:05, Sean Christopherson wrote:
>> However, I still remain confident that this makes sense from a uarch point of
>> view. Touching less cache lines should be a win -- even if I'm unable to
>> measure it. By preserving more cachelines during a VMEXIT, guests should be
>> able to resume their work faster (assuming they still need these cachelines).
> 
> Yes, but I'm skeptical that compacting struct kvm_vcpu will actually result in
> fewer cache lines being touched, at least not in a persistent, maintainable way.
> While every VM-Exit accesses a lot of state, it's most definitely still a small
> percentage of kvm_vcpu.  And for the VM-Exits that really benefit from optimized
> handling, a.k.a. the fastpath exits, that percentage is even lower.

Yeah, that's probably true.

> On x86, kvm_vcpu is actually comprised of three "major" structs: kvm_vcpu,
> kvm_vcpu_arch, and vcpu_{vmx,svm}.  The majority of fields accessed on every VM-Exit
> are in the vendor part, vcpu_{vmx,svm}, but there are still high-touch fields in
> the common structures, e.g. registers in kvm_vcpu_arch and things like requests
> in kvm_vcpu.
> 
> Naively compating any of the three (well, four) structures might result in fewer
> cache lines being touched, but mostly by coincidence.  E.g. because compacting
> part of kvm_vcpu_arch happened to better align vcpu_vmx, not because of the
> compaction itself.

Fortunately, kvm_vcpu is embedded as first member in vcpu_{vmx,svm}, so
all three share a common "header." Optimizations done for kvm_vcpu will
therefore benefit the vendor specific structures too. However, you're
right that this will implicitly change the layout for the remainder of
vcpu_{vmx,svm} and might even have a negative impact regarding cacheline
usage. But, as my changes chop off exactly 128 bytes from kvm_vcpu,
that's not the case here. But I can see that this is "coincidence" and
fragile in the long run.

> If we really wanted to optimize cache line usage, our best bet would be to identify
> the fields that are accessed in the fastpath run loop and then pack those fields
> into as few cache lines as possible.  And to do that in a maintainable way, we'd
> probably need something like ASI[*] to allow us to detect changes that result in
> the fastpath handling accessing fields that aren't in the cache-optimized region.
> 

> I'm not necessarily opposed to such aggressive optimization, but the ROI is likely
> very, very low.  For optimized workloads, there simply aren't very many VM-Exits,
> e.g. the majority of exits on a modern CPU are due to timer ticks.  And even those
> will hopefully be eliminiated in the not-too-distant future, e.g. by having hardware
> virtualize the TSC deadline timer, and by moving to a vCPU scheduling scheme that
> allows for a tickless host.

Well, for guests running grsecurity kernels, there's also the CR0.WP
toggling triggering VMEXITs, which happens a lot! -- at least until
something along the lines of [1] gets merged *hint ;)*

[1]
https://lore.kernel.org/all/20230201194604.11135-1-minipli@grsecurity.net/

>  
> https://lore.kernel.org/all/20220223052223.1202152-1-junaids@google.com

Heh, that RFC is from February last year and it looks like it stalled at
that point. But I guess you only meant patch 44 anyway, that splits up
kvm_vcpu_arch:
https://lore.kernel.org/all/20220223052223.1202152-45-junaids@google.com/.
It does that for other purposes, though, which might conflict with the
performance aspect I'm mostly after here. Anyways, I got your point. If
we care about cacheline footprint, we should do a more radical change
and group hot members together instead of simply shrinking the structs
involved.

>>> And as you observed, imperfect struct layouts are highly unlikely to have a
>>> measurable impact on performance.  The types of operations that are involved in
>>> a world switch are just too costly for the layout to matter much.  I do like to
>>> shave cycles in the VM-Enter/VM-Exit paths, but only when a change is inarguably
>>> more performant, doesn't require ongoing mainteance, and/or also improves the code
>>> quality.
>>
>> Any pointers to measure the "more performant" aspect?
> 
> TL;DR: not really.
> 
> What I've done in the past is run a very tight loop in the guest, and then measure
> latency from the host by hacking KVM.  Measuring from the guest works, e.g. we have
> a variety of selftests that do exactly that, but when trying to shave cycles in
> the VM-Exit path, it doesn't take many host IRQs arriving at the "wrong" time to
> skew the measurement.  My quick-and-dirty solution has always been to just hack
> KVM to measure latency with IRQs disabled, but a more maintainable approach would
> be to add smarts somewhere to sanitize the results, e.g. to throw away outliers
> where the guest likely got interrupted.
> 
> I believe we've talked about adding a selftest to measure fastpath latency, e.g.
> by writing MSR_IA32_TSC_DEADLINE in a tight loop.
> 
> However, that's not going to be useful in this case since you are interested in
> measuring the impact of reducing the host's L1D footprint.  If the guest isn't
> cache-constrainted, reducing the host's cache footprint isn't going to impact
> performance since there's no contention.

Yeah, it's hard to find a test case measuring the gains. I looked into
running Linux userland workloads initially, but saw no real impact, as
the sdtdev was already too high. But, as you pointed out, a
micro-benchmark is of no use either, so it's all hand-waving only. :D

> Running a micro benchmark in the guest that aims to measure cache performance might
> work, but presumably those are all userspace tests, i.e. you'd end up measuring
> the impact of the guest kernel too.  And they wouldn't consistently trigger VM-Exits,
> so it would be difficult to prove the validity of the results.

Jepp. It's all just gut feeling, unfortunately.

> I suppose you could write a dedicated selftest or port a micro benchmark to run
> as a selftest (or KVM-unit-test)?
> 
>>  I tried to make use of the vmx_vmcs_shadow_test in kvm-unit-tests, as it's
>>  already counting cycles, but the numbers are too unstable, even if I pin the
>>  test to a given CPU, disable turbo mode, SMT, use the performance cpu
>>  governor, etc.
> 
> Heh, you might have picked quite possibly the worst way to measure VM-Exit
> performance :-)
> 
> The guest code in that test that's measuring latency runs at L2.  For VMREADs
> and VMWRITEs that are "passed-through" all the way to L2, no VM-Exit will occur
> (the access will be handled purely in ucode).  And for accesses that do cause a
> VM-Exit, I'm pretty sure they all result in a nested VM-Exit, which is a _very_
> heavy path (~10k cycles).  Even if the exit is handled by KVM (in L0), it's still
> a relatively slow, heavy path.

I see. I'll have a look at the selftests and see if I can repurpose one
of them. But, as you noted, a microbenchmark might not be what I'm
after. It's more about identifying the usage patterns for hot VMEXIT
paths and optimize these.

>>> I am in favor in cleaning up kvm_mmu_memory_cache as there's no reason to carry
>>> a sub-optimal layouy and the change is arguably warranted even without the change
>>> in size.  Ditto for kvm_pmu, logically I think it makes sense to have the version
>>> at the very top.
>>
>> Yeah, was exactly thinking the same when modifying kvm_pmu.
>>
>>> But I dislike using bitfields instead of bools in kvm_queued_exception, and shuffling
>>> fields in kvm_vcpu, kvm_vcpu_arch, vcpu_vmx, vcpu_svm, etc. unless there's a truly
>>> egregious field(s) just isn't worth the cost in the long term.
>>
>> Heh, just found this gem in vcpu_vmx:
>>
>> struct vcpu_vmx {
>>   [...]
>>   union vmx_exit_reason      exit_reason;
>>
>>   /* XXX 44 bytes hole, try to pack */
>>
>>   /* --- cacheline 123 boundary (7872 bytes) --- */
>>   struct pi_desc             pi_desc __attribute__((__aligned__(64)));
>>   [...]
>>
>> So there are, in fact, some bigger holes left.
> 
> Ya.  Again, I'm definitely ok cleaning up the truly heinous warts and/or doing
> a targeted, deliberate refactor of structures.  What I don't want to do is
> shuffle fields around purely to save a few bytes here and there.

Got it. I'll back out the reshuffling ones and only keep the ones for
kvm_pmu and kvm_mmu_memory_cache, as these are more like straight cleanups.

Thanks,
Mathias
Sean Christopherson Feb. 17, 2023, 4:35 p.m. UTC | #5
On Fri, Feb 17, 2023, Mathias Krause wrote:
> On 16.02.23 18:32, Sean Christopherson wrote:
> > I'm not necessarily opposed to such aggressive optimization, but the ROI is likely
> > very, very low.  For optimized workloads, there simply aren't very many VM-Exits,
> > e.g. the majority of exits on a modern CPU are due to timer ticks.  And even those
> > will hopefully be eliminiated in the not-too-distant future, e.g. by having hardware
> > virtualize the TSC deadline timer, and by moving to a vCPU scheduling scheme that
> > allows for a tickless host.
> 
> Well, for guests running grsecurity kernels, there's also the CR0.WP
> toggling triggering VMEXITs, which happens a lot! -- at least until
> something along the lines of [1] gets merged *hint ;)*

Ha!  It's high on my todo list for 6.4, catching up on other stuff at the moment.

That series is also _exactly_ why the ROI for aggressive cache line optimization
is low.  The better long term answer is almost always to avoid the VM-Exit in the
first place, or failing that, to handle the exit in a fastpath.  Sometimes it takes
a few years, e.g. to get necessary hardware support, but x86 virtualization is fast
approaching the point where anything remotely performance critical is handled entirely
within the guest.
diff mbox

Patch

--- kvm_vcpu.before	2023-02-13 14:13:49.919952154 +0100
+++ kvm_vcpu.after	2023-02-13 14:13:53.559952140 +0100
@@ -6,78 +6,60 @@ 
 	int                        vcpu_idx;             /*    40     4 */
 	int                        ____srcu_idx;         /*    44     4 */
 	int                        mode;                 /*    48     4 */
-
-	/* XXX 4 bytes hole, try to pack */
-
+	unsigned int               guest_debug;          /*    52     4 */
 	u64                        requests;             /*    56     8 */
 	/* --- cacheline 1 boundary (64 bytes) --- */
-	long unsigned int          guest_debug;          /*    64     8 */
-	struct mutex               mutex;                /*    72    32 */
-	struct kvm_run *           run;                  /*   104     8 */
-	struct rcuwait             wait;                 /*   112     8 */
-	struct pid *               pid;                  /*   120     8 */
+	struct mutex               mutex;                /*    64    32 */
+	struct kvm_run *           run;                  /*    96     8 */
+	struct rcuwait             wait;                 /*   104     8 */
+	struct pid *               pid;                  /*   112     8 */
+	sigset_t                   sigset;               /*   120     8 */
 	/* --- cacheline 2 boundary (128 bytes) --- */
 	int                        sigset_active;        /*   128     4 */
-
-	/* XXX 4 bytes hole, try to pack */
-
-	sigset_t                   sigset;               /*   136     8 */
-	unsigned int               halt_poll_ns;         /*   144     4 */
-	bool                       valid_wakeup;         /*   148     1 */
+	unsigned int               halt_poll_ns;         /*   132     4 */
+	bool                       valid_wakeup;         /*   136     1 */
 
 	/* XXX 3 bytes hole, try to pack */
 
-	int                        mmio_needed;          /*   152     4 */
-	int                        mmio_read_completed;  /*   156     4 */
-	int                        mmio_is_write;        /*   160     4 */
-	int                        mmio_cur_fragment;    /*   164     4 */
-	int                        mmio_nr_fragments;    /*   168     4 */
-
-	/* XXX 4 bytes hole, try to pack */
-
-	struct kvm_mmio_fragment   mmio_fragments[2];    /*   176    48 */
-	/* --- cacheline 3 boundary (192 bytes) was 32 bytes ago --- */
+	int                        mmio_needed;          /*   140     4 */
+	int                        mmio_read_completed;  /*   144     4 */
+	int                        mmio_is_write;        /*   148     4 */
+	int                        mmio_cur_fragment;    /*   152     4 */
+	int                        mmio_nr_fragments;    /*   156     4 */
+	struct kvm_mmio_fragment   mmio_fragments[2];    /*   160    48 */
+	/* --- cacheline 3 boundary (192 bytes) was 16 bytes ago --- */
 	struct {
-		u32                queued;               /*   224     4 */
-
-		/* XXX 4 bytes hole, try to pack */
-
-		struct list_head   queue;                /*   232    16 */
-		struct list_head   done;                 /*   248    16 */
-		/* --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- */
-		spinlock_t         lock;                 /*   264     4 */
-	} async_pf;                                      /*   224    48 */
-
-	/* XXX last struct has 4 bytes of padding */
-
+		struct list_head   queue;                /*   208    16 */
+		struct list_head   done;                 /*   224    16 */
+		spinlock_t         lock;                 /*   240     4 */
+		u32                queued;               /*   244     4 */
+	} async_pf;                                      /*   208    40 */
 	struct {
-		bool               in_spin_loop;         /*   272     1 */
-		bool               dy_eligible;          /*   273     1 */
-	} spin_loop;                                     /*   272     2 */
-	bool                       preempted;            /*   274     1 */
-	bool                       ready;                /*   275     1 */
+		bool               in_spin_loop;         /*   248     1 */
+		bool               dy_eligible;          /*   249     1 */
+	} spin_loop;                                     /*   248     2 */
+	bool                       preempted;            /*   250     1 */
+	bool                       ready;                /*   251     1 */
 
 	/* XXX 4 bytes hole, try to pack */
 
-	struct kvm_vcpu_arch       arch __attribute__((__aligned__(8))); /*   280  5208 */
-
-	/* XXX last struct has 6 bytes of padding */
-
-	/* --- cacheline 85 boundary (5440 bytes) was 48 bytes ago --- */
-	struct kvm_vcpu_stat       stat;                 /*  5488  1104 */
-	/* --- cacheline 103 boundary (6592 bytes) --- */
-	char                       stats_id[48];         /*  6592    48 */
-	struct kvm_dirty_ring      dirty_ring;           /*  6640    32 */
+	/* --- cacheline 4 boundary (256 bytes) --- */
+	struct kvm_vcpu_arch       arch __attribute__((__aligned__(8))); /*   256  5104 */
+	/* --- cacheline 83 boundary (5312 bytes) was 48 bytes ago --- */
+	struct kvm_vcpu_stat       stat;                 /*  5360  1104 */
+	/* --- cacheline 101 boundary (6464 bytes) --- */
+	char                       stats_id[48];         /*  6464    48 */
+	struct kvm_dirty_ring      dirty_ring;           /*  6512    32 */
 
 	/* XXX last struct has 4 bytes of padding */
 
-	/* --- cacheline 104 boundary (6656 bytes) was 16 bytes ago --- */
-	struct kvm_memory_slot *   last_used_slot;       /*  6672     8 */
-	u64                        last_used_slot_gen;   /*  6680     8 */
-
-	/* size: 6688, cachelines: 105, members: 33 */
-	/* sum members: 6669, holes: 5, sum holes: 19 */
-	/* paddings: 3, sum paddings: 14 */
+	/* --- cacheline 102 boundary (6528 bytes) was 16 bytes ago --- */
+	struct kvm_memory_slot *   last_used_slot;       /*  6544     8 */
+	u64                        last_used_slot_gen;   /*  6552     8 */
+
+	/* size: 6560, cachelines: 103, members: 33 */
+	/* sum members: 6553, holes: 2, sum holes: 7 */
+	/* paddings: 1, sum paddings: 4 */
 	/* forced alignments: 1, forced holes: 1, sum forced holes: 4 */
 	/* last cacheline: 32 bytes */
 } __attribute__((__aligned__(8)));