Performance regression using KVM/ARM

Message ID	20160422101546.GB30824@cbox (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org> Date: Fri, 22 Apr 2016 12:15:46 +0200 From: Christoffer Dall <christoffer.dall@linaro.org> To: Alexander Graf <agraf@suse.de> Message-ID: <20160422101546.GB30824@cbox> References: <20160421162348.GA24178@cbox> <57192EED.2040501@suse.de> <20160422100118.GD25288@cbox> <5719F7BC.4090703@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5719F7BC.4090703@suse.de> User-Agent: Mutt/1.5.21 (2010-09-15) Subject: Re: [Qemu-devel] Performance regression using KVM/ARM Precedence: list Cc: Marc Zyngier <marc.zyngier@arm.com>, Peter Maydell <peter.maydell@linaro.org>, Paolo Bonzini <pbonzini@redhat.com>, qemu-devel@nongnu.org, "Michael S. Tsirkin" <mst@redhat.com> Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>

Christoffer Dall April 22, 2016, 10:15 a.m. UTC

On Fri, Apr 22, 2016 at 12:06:52PM +0200, Alexander Graf wrote:
> On 04/22/2016 12:01 PM, Christoffer Dall wrote:
> >On Thu, Apr 21, 2016 at 09:50:05PM +0200, Alexander Graf wrote:
> >>
> >>On 21.04.16 18:23, Christoffer Dall wrote:
> >>>Hi,
> >>>
> >>>Commit 9fac18f (oslib: allocate PROT_NONE pages on top of RAM,
> >>>2015-09-10) had the unfortunate side effect that memory slots registered
> >>>with KVM no longer contain a userspace address that is aligned to a 2M
> >>>boundary, causing the use of THP to fail in the kernel.
> >>>
> >>>I fail to see where in the QEMU code we should be asking for a 2M
> >>>alignment of our memory region.  Can someone help pointing me to the
> >>>right place to fix this or suggest a patch?
> >>>
> >>>This causes a performance regssion of hackbench on KVM/ARM of about 62%
> >>>compared to the workload running with THP.
> >>>
> >>>We have verified that this is indeed the cause of the failure by adding
> >>>various prints to QEMU and the kernel, but unfortunatley my QEMU
> >>>knowledge is not sufficient for me to fix it myself.
> >>>
> >>>Any help would be much appreciated!
> >>The code changed quite heavily since I last looked at it, but could you
> >>please try whether the (untested) patch below makes a difference?
> >>
> >>
> >Unfortunately this doesn't make any difference.  It feels to me like
> >we're missing specifying a 2M alignemnt in QEMU somewhere, but I can't
> >properly understand the links between the actual allocation, registering
> >mem slots with the KVM part of QEMU, and actually setting up KVM user
> >memory regions.
> >
> >What has to happen is that the resulting struct
> >kvm_userspace_memory_region() has the same alignment offset from 2M (the
> >huge page size) of the ->guest_phys_addr and ->userspace-addr fields.
> 
> Well, I would expect that the guest address space is always very big
> aligned - and definitely at least 2MB - so we're safe there.
> 
> That means we only need to align the qemu virtual address. There
> used to be a memalign() call for that, but it got replaced with
> direct mmap() and then a lot of code changed on top. Looking at the
> logs, I'm sure Paolo knows the answer though :)
> 
Peter just pointed me to a change I remember doing for ARM, so perhaps
this fix is the right one?




Thanks,
-Christoffer

Peter Maydell April 22, 2016, 10:17 a.m. UTC | #1

On 22 April 2016 at 11:15, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> Peter just pointed me to a change I remember doing for ARM, so perhaps
> this fix is the right one?
>
>
> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> index d25f671..a36e734 100644
> --- a/util/oslib-posix.c
> +++ b/util/oslib-posix.c
> @@ -35,7 +35,7 @@
>  extern int daemon(int, int);
>  #endif
>
> -#if defined(__linux__) && (defined(__x86_64__) || defined(__arm__))
> +#if defined(__linux__) && (defined(__x86_64__) || defined(__arm__)) || defined(__aarch64__)
>     /* Use 2 MiB alignment so transparent hugepages can be used by KVM.
>        Valgrind does not support alignments larger than 1 MiB,
>        therefore we need special code which handles running on Valgrind. */

I hadn't realised AArch64 didn't define __arm__. Your extra clause
wants to be inside the parens for the ||, not outside.

So was the problem just that we weren't passing 2MB as the align
parameter to qemu_ram_mmap(), and if we do pass 2MB then it does the
right thing ?

thanks
-- PMM

Christoffer Dall April 22, 2016, 10:26 a.m. UTC | #2

On Fri, Apr 22, 2016 at 11:17:47AM +0100, Peter Maydell wrote:
> On 22 April 2016 at 11:15, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> > Peter just pointed me to a change I remember doing for ARM, so perhaps
> > this fix is the right one?
> >
> >
> > diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> > index d25f671..a36e734 100644
> > --- a/util/oslib-posix.c
> > +++ b/util/oslib-posix.c
> > @@ -35,7 +35,7 @@
> >  extern int daemon(int, int);
> >  #endif
> >
> > -#if defined(__linux__) && (defined(__x86_64__) || defined(__arm__))
> > +#if defined(__linux__) && (defined(__x86_64__) || defined(__arm__)) || defined(__aarch64__)
> >     /* Use 2 MiB alignment so transparent hugepages can be used by KVM.
> >        Valgrind does not support alignments larger than 1 MiB,
> >        therefore we need special code which handles running on Valgrind. */
> 
> I hadn't realised AArch64 didn't define __arm__. Your extra clause
> wants to be inside the parens for the ||, not outside.
> 
> So was the problem just that we weren't passing 2MB as the align
> parameter to qemu_ram_mmap(), and if we do pass 2MB then it does the
> right thing ?
> 
Yes, that was essentially the problem.  I'll send a proper patch.

However, Marc pointed out, that we should probably think about improving
on this in the future, because if you're running on a 64K page system
and want huge pages, you have to align at a higher boundary, but I'm on
the other hand not sure such a boundary is always practical on a 64K
system.  Which would suggest that we either need to:

1) Probe the kernel for the page size and always align to something that
allows huge pages, or

2) Let the user specify an option that says it wants to be able to use
THP and only then align to the huge page boundary.

Not sure...

-Christoffer

Andrew Jones April 22, 2016, 11:16 a.m. UTC | #3

On Fri, Apr 22, 2016 at 12:26:52PM +0200, Christoffer Dall wrote:
> On Fri, Apr 22, 2016 at 11:17:47AM +0100, Peter Maydell wrote:
> > On 22 April 2016 at 11:15, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> > > Peter just pointed me to a change I remember doing for ARM, so perhaps
> > > this fix is the right one?
> > >
> > >
> > > diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> > > index d25f671..a36e734 100644
> > > --- a/util/oslib-posix.c
> > > +++ b/util/oslib-posix.c
> > > @@ -35,7 +35,7 @@
> > >  extern int daemon(int, int);
> > >  #endif
> > >
> > > -#if defined(__linux__) && (defined(__x86_64__) || defined(__arm__))
> > > +#if defined(__linux__) && (defined(__x86_64__) || defined(__arm__)) || defined(__aarch64__)
> > >     /* Use 2 MiB alignment so transparent hugepages can be used by KVM.
> > >        Valgrind does not support alignments larger than 1 MiB,
> > >        therefore we need special code which handles running on Valgrind. */
> > 
> > I hadn't realised AArch64 didn't define __arm__. Your extra clause
> > wants to be inside the parens for the ||, not outside.
> > 
> > So was the problem just that we weren't passing 2MB as the align
> > parameter to qemu_ram_mmap(), and if we do pass 2MB then it does the
> > right thing ?
> > 
> Yes, that was essentially the problem.  I'll send a proper patch.
> 
> However, Marc pointed out, that we should probably think about improving
> on this in the future, because if you're running on a 64K page system
> and want huge pages, you have to align at a higher boundary, but I'm on
> the other hand not sure such a boundary is always practical on a 64K
> system.  Which would suggest that we either need to:
> 
> 1) Probe the kernel for the page size and always align to something that
> allows huge pages, or
> 
> 2) Let the user specify an option that says it wants to be able to use
> THP and only then align to the huge page boundary.
> 
> Not sure...

Probably (1), otherwise the T in THP becomes t, i.e. the use of THP
gets slightly less transparent. Though I agree that using THP on 64k
page systems isn't always a good choice, I think, in those cases,
that users would likely just disable THP for the whole system.

drew

Alexander Graf April 22, 2016, 11:24 a.m. UTC | #4

On 04/22/2016 01:16 PM, Andrew Jones wrote:
> On Fri, Apr 22, 2016 at 12:26:52PM +0200, Christoffer Dall wrote:
>> On Fri, Apr 22, 2016 at 11:17:47AM +0100, Peter Maydell wrote:
>>> On 22 April 2016 at 11:15, Christoffer Dall <christoffer.dall@linaro.org> wrote:
>>>> Peter just pointed me to a change I remember doing for ARM, so perhaps
>>>> this fix is the right one?
>>>>
>>>>
>>>> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
>>>> index d25f671..a36e734 100644
>>>> --- a/util/oslib-posix.c
>>>> +++ b/util/oslib-posix.c
>>>> @@ -35,7 +35,7 @@
>>>>   extern int daemon(int, int);
>>>>   #endif
>>>>
>>>> -#if defined(__linux__) && (defined(__x86_64__) || defined(__arm__))
>>>> +#if defined(__linux__) && (defined(__x86_64__) || defined(__arm__)) || defined(__aarch64__)
>>>>      /* Use 2 MiB alignment so transparent hugepages can be used by KVM.
>>>>         Valgrind does not support alignments larger than 1 MiB,
>>>>         therefore we need special code which handles running on Valgrind. */
>>> I hadn't realised AArch64 didn't define __arm__. Your extra clause
>>> wants to be inside the parens for the ||, not outside.
>>>
>>> So was the problem just that we weren't passing 2MB as the align
>>> parameter to qemu_ram_mmap(), and if we do pass 2MB then it does the
>>> right thing ?
>>>
>> Yes, that was essentially the problem.  I'll send a proper patch.
>>
>> However, Marc pointed out, that we should probably think about improving
>> on this in the future, because if you're running on a 64K page system
>> and want huge pages, you have to align at a higher boundary, but I'm on
>> the other hand not sure such a boundary is always practical on a 64K
>> system.  Which would suggest that we either need to:
>>
>> 1) Probe the kernel for the page size and always align to something that
>> allows huge pages, or
>>
>> 2) Let the user specify an option that says it wants to be able to use
>> THP and only then align to the huge page boundary.
>>
>> Not sure...
> Probably (1), otherwise the T in THP becomes t, i.e. the use of THP
> gets slightly less transparent. Though I agree that using THP on 64k
> page systems isn't always a good choice, I think, in those cases,
> that users would likely just disable THP for the whole system.

I'd say that THP on 64k systems really only makes sense with support for 
combined pages that allows you to use 2MB huge pages on a 64k PAGE_SIZE 
system. So in that case aligning to 2MB is fine again.


Alex

Eric Auger June 13, 2016, 2:53 p.m. UTC | #5

Hi,

Le 22/04/2016 à 12:15, Christoffer Dall a écrit :
> On Fri, Apr 22, 2016 at 12:06:52PM +0200, Alexander Graf wrote:
>> On 04/22/2016 12:01 PM, Christoffer Dall wrote:
>>> On Thu, Apr 21, 2016 at 09:50:05PM +0200, Alexander Graf wrote:
>>>>
>>>> On 21.04.16 18:23, Christoffer Dall wrote:
>>>>> Hi,
>>>>>
>>>>> Commit 9fac18f (oslib: allocate PROT_NONE pages on top of RAM,
>>>>> 2015-09-10) had the unfortunate side effect that memory slots registered
>>>>> with KVM no longer contain a userspace address that is aligned to a 2M
>>>>> boundary, causing the use of THP to fail in the kernel.
>>>>>
>>>>> I fail to see where in the QEMU code we should be asking for a 2M
>>>>> alignment of our memory region.  Can someone help pointing me to the
>>>>> right place to fix this or suggest a patch?
>>>>>
>>>>> This causes a performance regssion of hackbench on KVM/ARM of about 62%
>>>>> compared to the workload running with THP.
>>>>>
>>>>> We have verified that this is indeed the cause of the failure by adding
>>>>> various prints to QEMU and the kernel, but unfortunatley my QEMU
>>>>> knowledge is not sufficient for me to fix it myself.
>>>>>
>>>>> Any help would be much appreciated!
>>>> The code changed quite heavily since I last looked at it, but could you
>>>> please try whether the (untested) patch below makes a difference?
>>>>
>>>>
>>> Unfortunately this doesn't make any difference.  It feels to me like
>>> we're missing specifying a 2M alignemnt in QEMU somewhere, but I can't
>>> properly understand the links between the actual allocation, registering
>>> mem slots with the KVM part of QEMU, and actually setting up KVM user
>>> memory regions.
>>>
>>> What has to happen is that the resulting struct
>>> kvm_userspace_memory_region() has the same alignment offset from 2M (the
>>> huge page size) of the ->guest_phys_addr and ->userspace-addr fields.
>>
>> Well, I would expect that the guest address space is always very big
>> aligned - and definitely at least 2MB - so we're safe there.
>>
>> That means we only need to align the qemu virtual address. There
>> used to be a memalign() call for that, but it got replaced with
>> direct mmap() and then a lot of code changed on top. Looking at the
>> logs, I'm sure Paolo knows the answer though :)
>>
> Peter just pointed me to a change I remember doing for ARM, so perhaps
> this fix is the right one?
As shared with Christoffer, the patch below also alters multiple vfio
platform device assignment (outcome of a bisect). I guess this relates
to the GPA allocation on the platform bus but I need to further
investigate. I plan to work on a QEMU fix this week.

Thanks

Eric
> 
> 
> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> index d25f671..a36e734 100644
> --- a/util/oslib-posix.c
> +++ b/util/oslib-posix.c
> @@ -35,7 +35,7 @@
>  extern int daemon(int, int);
>  #endif
>  
> -#if defined(__linux__) && (defined(__x86_64__) || defined(__arm__))
> +#if defined(__linux__) && (defined(__x86_64__) || defined(__arm__)) || defined(__aarch64__)
>     /* Use 2 MiB alignment so transparent hugepages can be used by KVM.
>        Valgrind does not support alignments larger than 1 MiB,
>        therefore we need special code which handles running on Valgrind. */
> 
> 
> Thanks,
> -Christoffer
>

Performance regression using KVM/ARM

Commit Message

Comments

Patch