diff mbox

KVM: x86: Reduce retpoline performance impact in slot_handle_level_range()

Message ID CA+55aFwqCc1UOkdBjEgiuL9fMkRj7RPrGFYQ376YhrqTUpLwrg@mail.gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Linus Torvalds Feb. 2, 2018, 7:10 p.m. UTC
On Fri, Feb 2, 2018 at 10:50 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Will it make for bigger code? Yes. But probably not really all *that*
> much bigger, because of how it also will allow the compiler to
> simplify some things.

Actually, testing this with my fairly minimal config, it actually
makes for *smaller* code to inline those things.

That may be a quirk of my configuration, or maybe I screwed something
else up, but:

  [torvalds@i7 linux]$ size ~/mmu.o arch/x86/kvm/mmu.o
     text    data     bss     dec     hex filename
    85587    9310     120   95017   17329 /home/torvalds/mmu.o
    85531    9310     120   94961   172f1 arch/x86/kvm/mmu.o

so the attached patch actually shrank things down by about 50 bytes
because of the code simplification.

Of course, I have been known to screw up retpoline testing in the
past, so my numbers are suspect ;). Somebody should double-check me.

                    Linus
arch/x86/kvm/mmu.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

Comments

David Woodhouse Feb. 2, 2018, 7:17 p.m. UTC | #1
On Fri, 2018-02-02 at 11:10 -0800, Linus Torvalds wrote:
> On Fri, Feb 2, 2018 at 10:50 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > 
> > 
> > Will it make for bigger code? Yes. But probably not really all *that*
> > much bigger, because of how it also will allow the compiler to
> > simplify some things.
>
> Actually, testing this with my fairly minimal config, it actually
> makes for *smaller* code to inline those things.
> 
> That may be a quirk of my configuration, or maybe I screwed something
> else up, but:
> 
>   [torvalds@i7 linux]$ size ~/mmu.o arch/x86/kvm/mmu.o
>      text    data     bss     dec     hex filename
>     85587    9310     120   95017   17329 /home/torvalds/mmu.o
>     85531    9310     120   94961   172f1 arch/x86/kvm/mmu.o
> 
> so the attached patch actually shrank things down by about 50 bytes
> because of the code simplification.
> 
> Of course, I have been known to screw up retpoline testing in the
> past, so my numbers are suspect ;). Somebody should double-check me.

I got this:

   text	   data	    bss	    dec	    hex	
filename
  87167	   9310	    120	  96597	  17955	
arch/x86/kvm/mmu.o
  88299	   9310	    120	  97729	  17dc1	
arch/x86/kvm/mmu-inline.o

But then, I'd also done kvm_handle_hva() and kvm_handle_hva_range().

Either way, that does look like a reasonable answer. I had looked at
the various one-line wrappers around slot_handle_level_range() and
thought "hm, those should be inline", but I hadn't made the next step
and pondered putting the whole thing inline. We'll give it a spin and
work out where the next performance bottleneck is. Thanks.
Alan Cox Feb. 2, 2018, 9:23 p.m. UTC | #2
> Either way, that does look like a reasonable answer. I had looked at
> the various one-line wrappers around slot_handle_level_range() and
> thought "hm, those should be inline", but I hadn't made the next step
> and pondered putting the whole thing inline. We'll give it a spin and
> work out where the next performance bottleneck is. Thanks.

In addition the problem with switch() is that gcc might decide in some
cases that the best way to implement your switch is an indirect call
from a lookup table.

For the simple case how about wrapping the if into

                call_likely(foo->bar, usualfunction, args)

as a companion to 

                 foo->bar(args)

that can resolve to nothing special on architectures that don't need it,
an if/else case on platforms with spectre, and potentially clever
stuff on any platform where you can beat the compiler by knowing
probabilities it can't infer ?

Alan
David Woodhouse Feb. 3, 2018, 2:46 p.m. UTC | #3
On Fri, 2018-02-02 at 21:23 +0000, Alan Cox wrote:
> In addition the problem with switch() is that gcc might decide in some
> cases that the best way to implement your switch is an indirect call
> from a lookup table.

That's also true of the

  if (ptr == usualfunction)
     usualfunction();
  else
     *ptr();

construct. Especially if GCC doesn't take into account the increased
cost of indirect branches with retpoline.

> For the simple case how about wrapping the if into
> 
>                 call_likely(foo->bar, usualfunction, args)
> 
> as a companion to 
> 
>                  foo->bar(args)
> 
> that can resolve to nothing special on architectures that don't need it,
> an if/else case on platforms with spectre, and potentially clever
> stuff on any platform where you can beat the compiler by knowing
> probabilities it can't infer ?

Yeah. I'm keen on being able to use something like alternatives to
*change* 'usualfunction' at runtime though. I suspect it'll be a win
for stuff like dma_ops.

But I'm also keen to actually base such things on real data, not just
go randomly "optimising" stuff just because we can. Let's try to make
sure we fix up the real bottlenecks, and not just go crazy.
Peter Zijlstra Feb. 5, 2018, 8:51 a.m. UTC | #4
On Sat, Feb 03, 2018 at 02:46:47PM +0000, David Woodhouse wrote:
> > For the simple case how about wrapping the if into
> > 
> >                 call_likely(foo->bar, usualfunction, args)
> > 
> > as a companion to 
> > 
> >                  foo->bar(args)
> > 
> > that can resolve to nothing special on architectures that don't need it,
> > an if/else case on platforms with spectre, and potentially clever
> > stuff on any platform where you can beat the compiler by knowing
> > probabilities it can't infer ?
> 
> Yeah. I'm keen on being able to use something like alternatives to
> *change* 'usualfunction' at runtime though. I suspect it'll be a win
> for stuff like dma_ops.
> 
> But I'm also keen to actually base such things on real data, not just
> go randomly "optimising" stuff just because we can. Let's try to make
> sure we fix up the real bottlenecks, and not just go crazy.

Google has a fairly long history of using feedback driven optimization
compiles for the kernel. They were also the ones that developed perf
autofdo tooling IIRC.

https://gcc.gnu.org/wiki/AutoFDO/Tutorial

One of the things pjt promised was a series of patches doing the
proposed optimization for the scheduler code based on their results.
Peter Zijlstra Feb. 5, 2018, 8:55 a.m. UTC | #5
On Sat, Feb 03, 2018 at 02:46:47PM +0000, David Woodhouse wrote:
> Yeah. I'm keen on being able to use something like alternatives to
> *change* 'usualfunction' at runtime though. I suspect it'll be a win
> for stuff like dma_ops.

That shouldn't be too hard to implement.
diff mbox

Patch

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 2b8eb4da4d08..b9f0de6e309b 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -5058,7 +5058,7 @@  void kvm_mmu_uninit_vm(struct kvm *kvm)
 typedef bool (*slot_level_handler) (struct kvm *kvm, struct kvm_rmap_head *rmap_head);
 
 /* The caller should hold mmu-lock before calling this function. */
-static bool
+static bool __always_inline
 slot_handle_level_range(struct kvm *kvm, struct kvm_memory_slot *memslot,
 			slot_level_handler fn, int start_level, int end_level,
 			gfn_t start_gfn, gfn_t end_gfn, bool lock_flush_tlb)
@@ -5088,7 +5088,7 @@  slot_handle_level_range(struct kvm *kvm, struct kvm_memory_slot *memslot,
 	return flush;
 }
 
-static bool
+static bool __always_inline
 slot_handle_level(struct kvm *kvm, struct kvm_memory_slot *memslot,
 		  slot_level_handler fn, int start_level, int end_level,
 		  bool lock_flush_tlb)
@@ -5099,7 +5099,7 @@  slot_handle_level(struct kvm *kvm, struct kvm_memory_slot *memslot,
 			lock_flush_tlb);
 }
 
-static bool
+static bool __always_inline
 slot_handle_all_level(struct kvm *kvm, struct kvm_memory_slot *memslot,
 		      slot_level_handler fn, bool lock_flush_tlb)
 {
@@ -5107,7 +5107,7 @@  slot_handle_all_level(struct kvm *kvm, struct kvm_memory_slot *memslot,
 				 PT_MAX_HUGEPAGE_LEVEL, lock_flush_tlb);
 }
 
-static bool
+static bool __always_inline
 slot_handle_large_level(struct kvm *kvm, struct kvm_memory_slot *memslot,
 			slot_level_handler fn, bool lock_flush_tlb)
 {
@@ -5115,7 +5115,7 @@  slot_handle_large_level(struct kvm *kvm, struct kvm_memory_slot *memslot,
 				 PT_MAX_HUGEPAGE_LEVEL, lock_flush_tlb);
 }
 
-static bool
+static bool __always_inline
 slot_handle_leaf(struct kvm *kvm, struct kvm_memory_slot *memslot,
 		 slot_level_handler fn, bool lock_flush_tlb)
 {