mbox series

[v4,0/4] overcommit: introduce mem-lock-onfault

Message ID 20250123131944.391886-1-d-tatianin@yandex-team.ru (mailing list archive)
Headers show
Series overcommit: introduce mem-lock-onfault | expand

Message

Daniil Tatianin Jan. 23, 2025, 1:19 p.m. UTC
Currently, passing mem-lock=on to QEMU causes memory usage to grow by
huge amounts:

no memlock:
    $ ./qemu-system-x86_64 -overcommit mem-lock=off
    $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
    45652

    $ ./qemu-system-x86_64 -overcommit mem-lock=off -enable-kvm
    $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
    39756

memlock:
    $ ./qemu-system-x86_64 -overcommit mem-lock=on
    $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
    1309876

    $ ./qemu-system-x86_64 -overcommit mem-lock=on -enable-kvm
    $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
    259956

This is caused by the fact that mlockall(2) automatically
write-faults every existing and future anonymous mappings in the
process right away.

One of the reasons to enable mem-lock is to protect a QEMU process'
pages from being compacted and migrated by kcompactd (which does so
by messing with a live process page tables causing thousands of TLB
flush IPIs per second) basically stealing all guest time while it's
active.

mem-lock=on helps against this (given compact_unevictable_allowed is 0),
but the memory overhead it introduces is an undesirable side effect,
which we can completely avoid by passing MCL_ONFAULT to mlockall, which
is what this series allows to do with a new option for mem-lock called
on-fault.

memlock-onfault:
    $ ./qemu-system-x86_64 -overcommit mem-lock=on-fault
    $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
    54004

    $ ./qemu-system-x86_64 -overcommit mem-lock=on-fault -enable-kvm
    $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
    47772

You may notice the memory usage is still slightly higher, in this case
by a few megabytes over the mem-lock=off case. I was able to trace this
down to a bug in the linux kernel with MCL_ONFAULT not being honored for
the early process heap (with brk(2) etc.) so it is still write-faulted in
this case, but it's still way less than it was with just the mem-lock=on.

Changes since v1:
    - Don't make a separate mem-lock-onfault, add an on-fault option to mem-lock instead

Changes since v2:
    - Move overcommit option parsing out of line
    - Make enable_mlock an enum instead

Changes since v3:
    - Rebase to latest master due to the recent sysemu -> system renames

Daniil Tatianin (4):
  os: add an ability to lock memory on_fault
  system/vl: extract overcommit option parsing into a helper
  system: introduce a new MlockState enum
  overcommit: introduce mem-lock=on-fault

 hw/virtio/virtio-mem.c    |  2 +-
 include/system/os-posix.h |  2 +-
 include/system/os-win32.h |  3 ++-
 include/system/system.h   | 12 ++++++++-
 migration/postcopy-ram.c  |  4 +--
 os-posix.c                | 10 ++++++--
 qemu-options.hx           | 14 +++++++----
 system/globals.c          | 12 ++++++++-
 system/vl.c               | 52 +++++++++++++++++++++++++++++++--------
 9 files changed, 87 insertions(+), 24 deletions(-)

Comments

Peter Xu Jan. 23, 2025, 4:31 p.m. UTC | #1
On Thu, Jan 23, 2025 at 04:19:40PM +0300, Daniil Tatianin wrote:
> Currently, passing mem-lock=on to QEMU causes memory usage to grow by
> huge amounts:
> 
> no memlock:
>     $ ./qemu-system-x86_64 -overcommit mem-lock=off
>     $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>     45652
> 
>     $ ./qemu-system-x86_64 -overcommit mem-lock=off -enable-kvm
>     $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>     39756
> 
> memlock:
>     $ ./qemu-system-x86_64 -overcommit mem-lock=on
>     $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>     1309876
> 
>     $ ./qemu-system-x86_64 -overcommit mem-lock=on -enable-kvm
>     $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>     259956
> 
> This is caused by the fact that mlockall(2) automatically
> write-faults every existing and future anonymous mappings in the
> process right away.
> 
> One of the reasons to enable mem-lock is to protect a QEMU process'
> pages from being compacted and migrated by kcompactd (which does so
> by messing with a live process page tables causing thousands of TLB
> flush IPIs per second) basically stealing all guest time while it's
> active.
> 
> mem-lock=on helps against this (given compact_unevictable_allowed is 0),
> but the memory overhead it introduces is an undesirable side effect,
> which we can completely avoid by passing MCL_ONFAULT to mlockall, which
> is what this series allows to do with a new option for mem-lock called
> on-fault.
> 
> memlock-onfault:
>     $ ./qemu-system-x86_64 -overcommit mem-lock=on-fault
>     $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>     54004
> 
>     $ ./qemu-system-x86_64 -overcommit mem-lock=on-fault -enable-kvm
>     $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>     47772
> 
> You may notice the memory usage is still slightly higher, in this case
> by a few megabytes over the mem-lock=off case. I was able to trace this
> down to a bug in the linux kernel with MCL_ONFAULT not being honored for
> the early process heap (with brk(2) etc.) so it is still write-faulted in
> this case, but it's still way less than it was with just the mem-lock=on.
> 
> Changes since v1:
>     - Don't make a separate mem-lock-onfault, add an on-fault option to mem-lock instead
> 
> Changes since v2:
>     - Move overcommit option parsing out of line
>     - Make enable_mlock an enum instead
> 
> Changes since v3:
>     - Rebase to latest master due to the recent sysemu -> system renames
> 
> Daniil Tatianin (4):
>   os: add an ability to lock memory on_fault
>   system/vl: extract overcommit option parsing into a helper
>   system: introduce a new MlockState enum
>   overcommit: introduce mem-lock=on-fault
> 
>  hw/virtio/virtio-mem.c    |  2 +-
>  include/system/os-posix.h |  2 +-
>  include/system/os-win32.h |  3 ++-
>  include/system/system.h   | 12 ++++++++-
>  migration/postcopy-ram.c  |  4 +--
>  os-posix.c                | 10 ++++++--
>  qemu-options.hx           | 14 +++++++----
>  system/globals.c          | 12 ++++++++-
>  system/vl.c               | 52 +++++++++++++++++++++++++++++++--------
>  9 files changed, 87 insertions(+), 24 deletions(-)

Considering it's very mem relevant change and looks pretty benign.. I can
pick this if nobody disagrees (or beats me to it, which I'd appreciate).

I'll also provide at least one week for people to stop me.

Thanks,
Daniil Tatianin Feb. 4, 2025, 8:23 a.m. UTC | #2
On 1/23/25 7:31 PM, Peter Xu wrote:
> On Thu, Jan 23, 2025 at 04:19:40PM +0300, Daniil Tatianin wrote:
>> Currently, passing mem-lock=on to QEMU causes memory usage to grow by
>> huge amounts:
>>
>> no memlock:
>>      $ ./qemu-system-x86_64 -overcommit mem-lock=off
>>      $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>>      45652
>>
>>      $ ./qemu-system-x86_64 -overcommit mem-lock=off -enable-kvm
>>      $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>>      39756
>>
>> memlock:
>>      $ ./qemu-system-x86_64 -overcommit mem-lock=on
>>      $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>>      1309876
>>
>>      $ ./qemu-system-x86_64 -overcommit mem-lock=on -enable-kvm
>>      $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>>      259956
>>
>> This is caused by the fact that mlockall(2) automatically
>> write-faults every existing and future anonymous mappings in the
>> process right away.
>>
>> One of the reasons to enable mem-lock is to protect a QEMU process'
>> pages from being compacted and migrated by kcompactd (which does so
>> by messing with a live process page tables causing thousands of TLB
>> flush IPIs per second) basically stealing all guest time while it's
>> active.
>>
>> mem-lock=on helps against this (given compact_unevictable_allowed is 0),
>> but the memory overhead it introduces is an undesirable side effect,
>> which we can completely avoid by passing MCL_ONFAULT to mlockall, which
>> is what this series allows to do with a new option for mem-lock called
>> on-fault.
>>
>> memlock-onfault:
>>      $ ./qemu-system-x86_64 -overcommit mem-lock=on-fault
>>      $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>>      54004
>>
>>      $ ./qemu-system-x86_64 -overcommit mem-lock=on-fault -enable-kvm
>>      $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>>      47772
>>
>> You may notice the memory usage is still slightly higher, in this case
>> by a few megabytes over the mem-lock=off case. I was able to trace this
>> down to a bug in the linux kernel with MCL_ONFAULT not being honored for
>> the early process heap (with brk(2) etc.) so it is still write-faulted in
>> this case, but it's still way less than it was with just the mem-lock=on.
>>
>> Changes since v1:
>>      - Don't make a separate mem-lock-onfault, add an on-fault option to mem-lock instead
>>
>> Changes since v2:
>>      - Move overcommit option parsing out of line
>>      - Make enable_mlock an enum instead
>>
>> Changes since v3:
>>      - Rebase to latest master due to the recent sysemu -> system renames
>>
>> Daniil Tatianin (4):
>>    os: add an ability to lock memory on_fault
>>    system/vl: extract overcommit option parsing into a helper
>>    system: introduce a new MlockState enum
>>    overcommit: introduce mem-lock=on-fault
>>
>>   hw/virtio/virtio-mem.c    |  2 +-
>>   include/system/os-posix.h |  2 +-
>>   include/system/os-win32.h |  3 ++-
>>   include/system/system.h   | 12 ++++++++-
>>   migration/postcopy-ram.c  |  4 +--
>>   os-posix.c                | 10 ++++++--
>>   qemu-options.hx           | 14 +++++++----
>>   system/globals.c          | 12 ++++++++-
>>   system/vl.c               | 52 +++++++++++++++++++++++++++++++--------
>>   9 files changed, 87 insertions(+), 24 deletions(-)
> Considering it's very mem relevant change and looks pretty benign.. I can
> pick this if nobody disagrees (or beats me to it, which I'd appreciate).
>
> I'll also provide at least one week for people to stop me.

I think it's been almost two weeks, so should be good now :)

Thanks!

> Thanks,
>
Peter Xu Feb. 4, 2025, 2:47 p.m. UTC | #3
On Tue, Feb 04, 2025 at 11:23:41AM +0300, Daniil Tatianin wrote:
> 
> On 1/23/25 7:31 PM, Peter Xu wrote:
> > On Thu, Jan 23, 2025 at 04:19:40PM +0300, Daniil Tatianin wrote:
> > > Currently, passing mem-lock=on to QEMU causes memory usage to grow by
> > > huge amounts:
> > > 
> > > no memlock:
> > >      $ ./qemu-system-x86_64 -overcommit mem-lock=off
> > >      $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
> > >      45652
> > > 
> > >      $ ./qemu-system-x86_64 -overcommit mem-lock=off -enable-kvm
> > >      $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
> > >      39756
> > > 
> > > memlock:
> > >      $ ./qemu-system-x86_64 -overcommit mem-lock=on
> > >      $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
> > >      1309876
> > > 
> > >      $ ./qemu-system-x86_64 -overcommit mem-lock=on -enable-kvm
> > >      $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
> > >      259956
> > > 
> > > This is caused by the fact that mlockall(2) automatically
> > > write-faults every existing and future anonymous mappings in the
> > > process right away.
> > > 
> > > One of the reasons to enable mem-lock is to protect a QEMU process'
> > > pages from being compacted and migrated by kcompactd (which does so
> > > by messing with a live process page tables causing thousands of TLB
> > > flush IPIs per second) basically stealing all guest time while it's
> > > active.
> > > 
> > > mem-lock=on helps against this (given compact_unevictable_allowed is 0),
> > > but the memory overhead it introduces is an undesirable side effect,
> > > which we can completely avoid by passing MCL_ONFAULT to mlockall, which
> > > is what this series allows to do with a new option for mem-lock called
> > > on-fault.
> > > 
> > > memlock-onfault:
> > >      $ ./qemu-system-x86_64 -overcommit mem-lock=on-fault
> > >      $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
> > >      54004
> > > 
> > >      $ ./qemu-system-x86_64 -overcommit mem-lock=on-fault -enable-kvm
> > >      $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
> > >      47772
> > > 
> > > You may notice the memory usage is still slightly higher, in this case
> > > by a few megabytes over the mem-lock=off case. I was able to trace this
> > > down to a bug in the linux kernel with MCL_ONFAULT not being honored for
> > > the early process heap (with brk(2) etc.) so it is still write-faulted in
> > > this case, but it's still way less than it was with just the mem-lock=on.
> > > 
> > > Changes since v1:
> > >      - Don't make a separate mem-lock-onfault, add an on-fault option to mem-lock instead
> > > 
> > > Changes since v2:
> > >      - Move overcommit option parsing out of line
> > >      - Make enable_mlock an enum instead
> > > 
> > > Changes since v3:
> > >      - Rebase to latest master due to the recent sysemu -> system renames
> > > 
> > > Daniil Tatianin (4):
> > >    os: add an ability to lock memory on_fault
> > >    system/vl: extract overcommit option parsing into a helper
> > >    system: introduce a new MlockState enum
> > >    overcommit: introduce mem-lock=on-fault
> > > 
> > >   hw/virtio/virtio-mem.c    |  2 +-
> > >   include/system/os-posix.h |  2 +-
> > >   include/system/os-win32.h |  3 ++-
> > >   include/system/system.h   | 12 ++++++++-
> > >   migration/postcopy-ram.c  |  4 +--
> > >   os-posix.c                | 10 ++++++--
> > >   qemu-options.hx           | 14 +++++++----
> > >   system/globals.c          | 12 ++++++++-
> > >   system/vl.c               | 52 +++++++++++++++++++++++++++++++--------
> > >   9 files changed, 87 insertions(+), 24 deletions(-)
> > Considering it's very mem relevant change and looks pretty benign.. I can
> > pick this if nobody disagrees (or beats me to it, which I'd appreciate).
> > 
> > I'll also provide at least one week for people to stop me.
> 
> I think it's been almost two weeks, so should be good now :)

Don't worry, this is in track.  I'll send it maybe in a few days.

Thanks,
Daniil Tatianin Feb. 4, 2025, 6:21 p.m. UTC | #4
On 2/4/25 5:47 PM, Peter Xu wrote:

> On Tue, Feb 04, 2025 at 11:23:41AM +0300, Daniil Tatianin wrote:
>> On 1/23/25 7:31 PM, Peter Xu wrote:
>>> On Thu, Jan 23, 2025 at 04:19:40PM +0300, Daniil Tatianin wrote:
>>>> Currently, passing mem-lock=on to QEMU causes memory usage to grow by
>>>> huge amounts:
>>>>
>>>> no memlock:
>>>>       $ ./qemu-system-x86_64 -overcommit mem-lock=off
>>>>       $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>>>>       45652
>>>>
>>>>       $ ./qemu-system-x86_64 -overcommit mem-lock=off -enable-kvm
>>>>       $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>>>>       39756
>>>>
>>>> memlock:
>>>>       $ ./qemu-system-x86_64 -overcommit mem-lock=on
>>>>       $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>>>>       1309876
>>>>
>>>>       $ ./qemu-system-x86_64 -overcommit mem-lock=on -enable-kvm
>>>>       $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>>>>       259956
>>>>
>>>> This is caused by the fact that mlockall(2) automatically
>>>> write-faults every existing and future anonymous mappings in the
>>>> process right away.
>>>>
>>>> One of the reasons to enable mem-lock is to protect a QEMU process'
>>>> pages from being compacted and migrated by kcompactd (which does so
>>>> by messing with a live process page tables causing thousands of TLB
>>>> flush IPIs per second) basically stealing all guest time while it's
>>>> active.
>>>>
>>>> mem-lock=on helps against this (given compact_unevictable_allowed is 0),
>>>> but the memory overhead it introduces is an undesirable side effect,
>>>> which we can completely avoid by passing MCL_ONFAULT to mlockall, which
>>>> is what this series allows to do with a new option for mem-lock called
>>>> on-fault.
>>>>
>>>> memlock-onfault:
>>>>       $ ./qemu-system-x86_64 -overcommit mem-lock=on-fault
>>>>       $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>>>>       54004
>>>>
>>>>       $ ./qemu-system-x86_64 -overcommit mem-lock=on-fault -enable-kvm
>>>>       $ ps -p $(pidof ./qemu-system-x86_64) -o rss=
>>>>       47772
>>>>
>>>> You may notice the memory usage is still slightly higher, in this case
>>>> by a few megabytes over the mem-lock=off case. I was able to trace this
>>>> down to a bug in the linux kernel with MCL_ONFAULT not being honored for
>>>> the early process heap (with brk(2) etc.) so it is still write-faulted in
>>>> this case, but it's still way less than it was with just the mem-lock=on.
>>>>
>>>> Changes since v1:
>>>>       - Don't make a separate mem-lock-onfault, add an on-fault option to mem-lock instead
>>>>
>>>> Changes since v2:
>>>>       - Move overcommit option parsing out of line
>>>>       - Make enable_mlock an enum instead
>>>>
>>>> Changes since v3:
>>>>       - Rebase to latest master due to the recent sysemu -> system renames
>>>>
>>>> Daniil Tatianin (4):
>>>>     os: add an ability to lock memory on_fault
>>>>     system/vl: extract overcommit option parsing into a helper
>>>>     system: introduce a new MlockState enum
>>>>     overcommit: introduce mem-lock=on-fault
>>>>
>>>>    hw/virtio/virtio-mem.c    |  2 +-
>>>>    include/system/os-posix.h |  2 +-
>>>>    include/system/os-win32.h |  3 ++-
>>>>    include/system/system.h   | 12 ++++++++-
>>>>    migration/postcopy-ram.c  |  4 +--
>>>>    os-posix.c                | 10 ++++++--
>>>>    qemu-options.hx           | 14 +++++++----
>>>>    system/globals.c          | 12 ++++++++-
>>>>    system/vl.c               | 52 +++++++++++++++++++++++++++++++--------
>>>>    9 files changed, 87 insertions(+), 24 deletions(-)
>>> Considering it's very mem relevant change and looks pretty benign.. I can
>>> pick this if nobody disagrees (or beats me to it, which I'd appreciate).
>>>
>>> I'll also provide at least one week for people to stop me.
>> I think it's been almost two weeks, so should be good now :)
> Don't worry, this is in track.  I'll send it maybe in a few days.
>
> Thanks,

Amazing, thank you!