[RFC,v10,0/2] mm: Support for page hinting

Message ID	20190603170306.49099-1-nitesh@redhat.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of nitesh@redhat.com designates 209.132.183.28 as permitted sender) client-ip=209.132.183.28; From: Nitesh Narayan Lal <nitesh@redhat.com> To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, pbonzini@redhat.com, lcapitulino@redhat.com, pagupta@redhat.com, wei.w.wang@intel.com, yang.zhang.wz@gmail.com, riel@surriel.com, david@redhat.com, mst@redhat.com, dodgen@google.com, konrad.wilk@oracle.com, dhildenb@redhat.com, aarcange@redhat.com, alexander.duyck@gmail.com Subject: [RFC][Patch v10 0/2] mm: Support for page hinting Date: Mon, 3 Jun 2019 13:03:04 -0400 Message-Id: <20190603170306.49099-1-nitesh@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: Support for page hinting \| expand [RFC,v10,0/2] mm: Support for page hinting [RFC,v10,1/2] mm: page_hinting: core infrastructure [RFC,v10,2/2] virtio-balloon: page_hinting: reporting to the host

Nitesh Narayan Lal June 3, 2019, 5:03 p.m. UTC

This patch series proposes an efficient mechanism for communicating free memory
from a guest to its hypervisor. It especially enables guests with no page cache
(e.g., nvdimm, virtio-pmem) or with small page caches (e.g., ram > disk) to
rapidly hand back free memory to the hypervisor.
This approach has a minimal impact on the existing core-mm infrastructure.

Measurement results (measurement details appended to this email):
* With active page hinting, 3 more guests could be launched each of 5 GB(total 
5 vs. 2) on a 15GB (single NUMA) system without swapping.
* With active page hinting, on a system with 15 GB of (single NUMA) memory and
4GB of swap, the runtime of "memhog 6G" in 3 guests (run sequentially) resulted
in the last invocation to only need 37s compared to 3m35s without page hinting.

This approach tracks all freed pages of the order MAX_ORDER - 2 in bitmaps.
A new hook after buddy merging is used to set the bits in the bitmap.
Currently, the bits are only cleared when pages are hinted, not when pages are
re-allocated.

Bitmaps are stored on a per-zone basis and are protected by the zone lock. A
workqueue asynchronously processes the bitmaps as soon as a pre-defined memory
threshold is met, trying to isolate and report pages that are still free.

The isolated pages are reported via virtio-balloon, which is responsible for
sending batched pages to the host synchronously. Once the hypervisor processed
the hinting request, the isolated pages are returned back to the buddy.

The key changes made in this series compared to v9[1] are:
* Pages only in the chunks of "MAX_ORDER - 2" are reported to the hypervisor to
not break up the THP.
* At a time only a set of 16 pages can be isolated and reported to the host to
avoids any false OOMs.
* page_hinting.c is moved under mm/ from virt/kvm/ as the feature is dependent
on virtio and not on KVM itself. This would enable any other hypervisor to use
this feature by implementing virtio devices.
* The sysctl variable is replaced with a virtio-balloon parameter to
enable/disable page-hinting.

Pending items:
* Test device assigned guests to ensure that hinting doesn't break it.
* Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device side support.
* Compare reporting free pages via vring with vhost.
* Decide between MADV_DONTNEED and MADV_FREE.
* Look into memory hotplug, more efficient locking, possible races when
disabling.
* Come up with proper/traceable error-message/logs.
* Minor reworks and simplifications (e.g., virtio protocol).

Benefit analysis:
1. Use-case - Number of guests that can be launched without swap usage
NUMA Nodes = 1 with 15 GB memory
Guest Memory = 5 GB
Number of cores in guest = 1
Workload = test allocation program allocates 4GB memory, touches it via memset
and exits.
Procedure =
The first guest is launched and once its console is up, the test allocation
program is executed with 4 GB memory request (Due to this the guest occupies
almost 4-5 GB of memory in the host in a system without page hinting). Once
this program exits at that time another guest is launched in the host and the
same process is followed. It is continued until the swap is not used.

Results:
Without hinting = 3, swap usage at the end 1.1GB.
With hinting = 5, swap usage at the end 0.

2. Use-case - memhog execution time
Guest Memory = 6GB
Number of cores = 4
NUMA Nodes = 1 with 15 GB memory
Process: 3 Guests are launched and the ‘memhog 6G’ execution time is monitored
one after the other in each of them.
Without Hinting - Guest1:47s, Guest2:53s, Guest3:3m35s, End swap usage: 3.5G
With Hinting - Guest1:40s, Guest2:44s, Guest3:37s, End swap usage: 0

Performance analysis:
1. will-it-scale's page_faul1:
Guest Memory = 6GB
Number of cores = 24

Without Hinting:
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
1,315890,95.82,317633,95.83,317633
2,570810,91.67,531147,91.94,635266
3,826491,87.54,713545,88.53,952899
4,1087434,83.40,901215,85.30,1270532
5,1277137,79.26,916442,83.74,1588165
6,1503611,75.12,1113832,79.89,1905798
7,1683750,70.99,1140629,78.33,2223431
8,1893105,66.85,1157028,77.40,2541064
9,2046516,62.50,1179445,76.48,2858697
10,2291171,58.57,1209247,74.99,3176330
11,2486198,54.47,1217265,75.13,3493963
12,2656533,50.36,1193392,74.42,3811596
13,2747951,46.21,1185540,73.45,4129229
14,2965757,42.09,1161862,72.20,4446862
15,3049128,37.97,1185923,72.12,4764495
16,3150692,33.83,1163789,70.70,5082128
17,3206023,29.70,1174217,70.11,5399761
18,3211380,25.62,1179660,69.40,5717394
19,3202031,21.44,1181259,67.28,6035027
20,3218245,17.35,1196367,66.75,6352660
21,3228576,13.26,1129561,66.74,6670293
22,3207452,9.15,1166517,66.47,6987926
23,3153800,5.09,1172877,61.57,7305559
24,3184542,0.99,1186244,58.36,7623192

With Hinting:
0,0,100,0,100,0
1,306737,95.82,305130,95.78,306737
2,573207,91.68,530453,91.92,613474
3,810319,87.53,695281,88.58,920211
4,1074116,83.40,880602,85.48,1226948
5,1308283,79.26,1109257,81.23,1533685
6,1501987,75.12,1093661,80.19,1840422
7,1695300,70.99,1104207,79.03,2147159
8,1901523,66.85,1193613,76.90,2453896
9,2051288,62.73,1200913,76.22,2760633
10,2275771,58.60,1192992,75.66,3067370
11,2435016,54.48,1191472,74.66,3374107
12,2623114,50.35,1196911,74.02,3680844
13,2766071,46.22,1178589,73.02,3987581
14,2932163,42.10,1166414,72.96,4294318
15,3000853,37.96,1177177,72.62,4601055
16,3113738,33.85,1165444,70.54,4907792
17,3132135,29.77,1165055,68.51,5214529
18,3175121,25.69,1166969,69.27,5521266
19,3205490,21.61,1159310,65.65,5828003
20,3220855,17.52,1171827,62.04,6134740
21,3182568,13.48,1138918,65.05,6441477
22,3130543,9.30,1128185,60.60,6748214
23,3087426,5.15,1127912,55.36,7054951
24,3099457,1.04,1176100,54.96,7361688

[1] https://lkml.org/lkml/2019/3/6/413

Michael S. Tsirkin June 3, 2019, 6:04 p.m. UTC | #1

On Mon, Jun 03, 2019 at 01:03:04PM -0400, Nitesh Narayan Lal wrote:
> This patch series proposes an efficient mechanism for communicating free memory
> from a guest to its hypervisor. It especially enables guests with no page cache
> (e.g., nvdimm, virtio-pmem) or with small page caches (e.g., ram > disk) to
> rapidly hand back free memory to the hypervisor.
> This approach has a minimal impact on the existing core-mm infrastructure.

Could you help us compare with Alex's series?
What are the main differences?

> Measurement results (measurement details appended to this email):
> * With active page hinting, 3 more guests could be launched each of 5 GB(total 
> 5 vs. 2) on a 15GB (single NUMA) system without swapping.
> * With active page hinting, on a system with 15 GB of (single NUMA) memory and
> 4GB of swap, the runtime of "memhog 6G" in 3 guests (run sequentially) resulted
> in the last invocation to only need 37s compared to 3m35s without page hinting.
> 
> This approach tracks all freed pages of the order MAX_ORDER - 2 in bitmaps.
> A new hook after buddy merging is used to set the bits in the bitmap.
> Currently, the bits are only cleared when pages are hinted, not when pages are
> re-allocated.
> 
> Bitmaps are stored on a per-zone basis and are protected by the zone lock. A
> workqueue asynchronously processes the bitmaps as soon as a pre-defined memory
> threshold is met, trying to isolate and report pages that are still free.
> 
> The isolated pages are reported via virtio-balloon, which is responsible for
> sending batched pages to the host synchronously. Once the hypervisor processed
> the hinting request, the isolated pages are returned back to the buddy.
> 
> The key changes made in this series compared to v9[1] are:
> * Pages only in the chunks of "MAX_ORDER - 2" are reported to the hypervisor to
> not break up the THP.
> * At a time only a set of 16 pages can be isolated and reported to the host to
> avoids any false OOMs.
> * page_hinting.c is moved under mm/ from virt/kvm/ as the feature is dependent
> on virtio and not on KVM itself. This would enable any other hypervisor to use
> this feature by implementing virtio devices.
> * The sysctl variable is replaced with a virtio-balloon parameter to
> enable/disable page-hinting.
> 
> Pending items:
> * Test device assigned guests to ensure that hinting doesn't break it.
> * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device side support.
> * Compare reporting free pages via vring with vhost.
> * Decide between MADV_DONTNEED and MADV_FREE.
> * Look into memory hotplug, more efficient locking, possible races when
> disabling.
> * Come up with proper/traceable error-message/logs.
> * Minor reworks and simplifications (e.g., virtio protocol).
> 
> Benefit analysis:
> 1. Use-case - Number of guests that can be launched without swap usage
> NUMA Nodes = 1 with 15 GB memory
> Guest Memory = 5 GB
> Number of cores in guest = 1
> Workload = test allocation program allocates 4GB memory, touches it via memset
> and exits.
> Procedure =
> The first guest is launched and once its console is up, the test allocation
> program is executed with 4 GB memory request (Due to this the guest occupies
> almost 4-5 GB of memory in the host in a system without page hinting). Once
> this program exits at that time another guest is launched in the host and the
> same process is followed. It is continued until the swap is not used.
> 
> Results:
> Without hinting = 3, swap usage at the end 1.1GB.
> With hinting = 5, swap usage at the end 0.
> 
> 2. Use-case - memhog execution time
> Guest Memory = 6GB
> Number of cores = 4
> NUMA Nodes = 1 with 15 GB memory
> Process: 3 Guests are launched and the ‘memhog 6G’ execution time is monitored
> one after the other in each of them.
> Without Hinting - Guest1:47s, Guest2:53s, Guest3:3m35s, End swap usage: 3.5G
> With Hinting - Guest1:40s, Guest2:44s, Guest3:37s, End swap usage: 0
> 
> Performance analysis:
> 1. will-it-scale's page_faul1:
> Guest Memory = 6GB
> Number of cores = 24
> 
> Without Hinting:
> tasks,processes,processes_idle,threads,threads_idle,linear
> 0,0,100,0,100,0
> 1,315890,95.82,317633,95.83,317633
> 2,570810,91.67,531147,91.94,635266
> 3,826491,87.54,713545,88.53,952899
> 4,1087434,83.40,901215,85.30,1270532
> 5,1277137,79.26,916442,83.74,1588165
> 6,1503611,75.12,1113832,79.89,1905798
> 7,1683750,70.99,1140629,78.33,2223431
> 8,1893105,66.85,1157028,77.40,2541064
> 9,2046516,62.50,1179445,76.48,2858697
> 10,2291171,58.57,1209247,74.99,3176330
> 11,2486198,54.47,1217265,75.13,3493963
> 12,2656533,50.36,1193392,74.42,3811596
> 13,2747951,46.21,1185540,73.45,4129229
> 14,2965757,42.09,1161862,72.20,4446862
> 15,3049128,37.97,1185923,72.12,4764495
> 16,3150692,33.83,1163789,70.70,5082128
> 17,3206023,29.70,1174217,70.11,5399761
> 18,3211380,25.62,1179660,69.40,5717394
> 19,3202031,21.44,1181259,67.28,6035027
> 20,3218245,17.35,1196367,66.75,6352660
> 21,3228576,13.26,1129561,66.74,6670293
> 22,3207452,9.15,1166517,66.47,6987926
> 23,3153800,5.09,1172877,61.57,7305559
> 24,3184542,0.99,1186244,58.36,7623192
> 
> With Hinting:
> 0,0,100,0,100,0
> 1,306737,95.82,305130,95.78,306737
> 2,573207,91.68,530453,91.92,613474
> 3,810319,87.53,695281,88.58,920211
> 4,1074116,83.40,880602,85.48,1226948
> 5,1308283,79.26,1109257,81.23,1533685
> 6,1501987,75.12,1093661,80.19,1840422
> 7,1695300,70.99,1104207,79.03,2147159
> 8,1901523,66.85,1193613,76.90,2453896
> 9,2051288,62.73,1200913,76.22,2760633
> 10,2275771,58.60,1192992,75.66,3067370
> 11,2435016,54.48,1191472,74.66,3374107
> 12,2623114,50.35,1196911,74.02,3680844
> 13,2766071,46.22,1178589,73.02,3987581
> 14,2932163,42.10,1166414,72.96,4294318
> 15,3000853,37.96,1177177,72.62,4601055
> 16,3113738,33.85,1165444,70.54,4907792
> 17,3132135,29.77,1165055,68.51,5214529
> 18,3175121,25.69,1166969,69.27,5521266
> 19,3205490,21.61,1159310,65.65,5828003
> 20,3220855,17.52,1171827,62.04,6134740
> 21,3182568,13.48,1138918,65.05,6441477
> 22,3130543,9.30,1128185,60.60,6748214
> 23,3087426,5.15,1127912,55.36,7054951
> 24,3099457,1.04,1176100,54.96,7361688
> 
> [1] https://lkml.org/lkml/2019/3/6/413
>

Nitesh Narayan Lal June 3, 2019, 6:38 p.m. UTC | #2

On 6/3/19 2:04 PM, Michael S. Tsirkin wrote:
> On Mon, Jun 03, 2019 at 01:03:04PM -0400, Nitesh Narayan Lal wrote:
>> This patch series proposes an efficient mechanism for communicating free memory
>> from a guest to its hypervisor. It especially enables guests with no page cache
>> (e.g., nvdimm, virtio-pmem) or with small page caches (e.g., ram > disk) to
>> rapidly hand back free memory to the hypervisor.
>> This approach has a minimal impact on the existing core-mm infrastructure.
> Could you help us compare with Alex's series?
> What are the main differences?
I have just started reviewing Alex's series. Once I am done with it, I can.
>> Measurement results (measurement details appended to this email):
>> * With active page hinting, 3 more guests could be launched each of 5 GB(total 
>> 5 vs. 2) on a 15GB (single NUMA) system without swapping.
>> * With active page hinting, on a system with 15 GB of (single NUMA) memory and
>> 4GB of swap, the runtime of "memhog 6G" in 3 guests (run sequentially) resulted
>> in the last invocation to only need 37s compared to 3m35s without page hinting.
>>
>> This approach tracks all freed pages of the order MAX_ORDER - 2 in bitmaps.
>> A new hook after buddy merging is used to set the bits in the bitmap.
>> Currently, the bits are only cleared when pages are hinted, not when pages are
>> re-allocated.
>>
>> Bitmaps are stored on a per-zone basis and are protected by the zone lock. A
>> workqueue asynchronously processes the bitmaps as soon as a pre-defined memory
>> threshold is met, trying to isolate and report pages that are still free.
>>
>> The isolated pages are reported via virtio-balloon, which is responsible for
>> sending batched pages to the host synchronously. Once the hypervisor processed
>> the hinting request, the isolated pages are returned back to the buddy.
>>
>> The key changes made in this series compared to v9[1] are:
>> * Pages only in the chunks of "MAX_ORDER - 2" are reported to the hypervisor to
>> not break up the THP.
>> * At a time only a set of 16 pages can be isolated and reported to the host to
>> avoids any false OOMs.
>> * page_hinting.c is moved under mm/ from virt/kvm/ as the feature is dependent
>> on virtio and not on KVM itself. This would enable any other hypervisor to use
>> this feature by implementing virtio devices.
>> * The sysctl variable is replaced with a virtio-balloon parameter to
>> enable/disable page-hinting.
>>
>> Pending items:
>> * Test device assigned guests to ensure that hinting doesn't break it.
>> * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device side support.
>> * Compare reporting free pages via vring with vhost.
>> * Decide between MADV_DONTNEED and MADV_FREE.
>> * Look into memory hotplug, more efficient locking, possible races when
>> disabling.
>> * Come up with proper/traceable error-message/logs.
>> * Minor reworks and simplifications (e.g., virtio protocol).
>>
>> Benefit analysis:
>> 1. Use-case - Number of guests that can be launched without swap usage
>> NUMA Nodes = 1 with 15 GB memory
>> Guest Memory = 5 GB
>> Number of cores in guest = 1
>> Workload = test allocation program allocates 4GB memory, touches it via memset
>> and exits.
>> Procedure =
>> The first guest is launched and once its console is up, the test allocation
>> program is executed with 4 GB memory request (Due to this the guest occupies
>> almost 4-5 GB of memory in the host in a system without page hinting). Once
>> this program exits at that time another guest is launched in the host and the
>> same process is followed. It is continued until the swap is not used.
>>
>> Results:
>> Without hinting = 3, swap usage at the end 1.1GB.
>> With hinting = 5, swap usage at the end 0.
>>
>> 2. Use-case - memhog execution time
>> Guest Memory = 6GB
>> Number of cores = 4
>> NUMA Nodes = 1 with 15 GB memory
>> Process: 3 Guests are launched and the ‘memhog 6G’ execution time is monitored
>> one after the other in each of them.
>> Without Hinting - Guest1:47s, Guest2:53s, Guest3:3m35s, End swap usage: 3.5G
>> With Hinting - Guest1:40s, Guest2:44s, Guest3:37s, End swap usage: 0
>>
>> Performance analysis:
>> 1. will-it-scale's page_faul1:
>> Guest Memory = 6GB
>> Number of cores = 24
>>
>> Without Hinting:
>> tasks,processes,processes_idle,threads,threads_idle,linear
>> 0,0,100,0,100,0
>> 1,315890,95.82,317633,95.83,317633
>> 2,570810,91.67,531147,91.94,635266
>> 3,826491,87.54,713545,88.53,952899
>> 4,1087434,83.40,901215,85.30,1270532
>> 5,1277137,79.26,916442,83.74,1588165
>> 6,1503611,75.12,1113832,79.89,1905798
>> 7,1683750,70.99,1140629,78.33,2223431
>> 8,1893105,66.85,1157028,77.40,2541064
>> 9,2046516,62.50,1179445,76.48,2858697
>> 10,2291171,58.57,1209247,74.99,3176330
>> 11,2486198,54.47,1217265,75.13,3493963
>> 12,2656533,50.36,1193392,74.42,3811596
>> 13,2747951,46.21,1185540,73.45,4129229
>> 14,2965757,42.09,1161862,72.20,4446862
>> 15,3049128,37.97,1185923,72.12,4764495
>> 16,3150692,33.83,1163789,70.70,5082128
>> 17,3206023,29.70,1174217,70.11,5399761
>> 18,3211380,25.62,1179660,69.40,5717394
>> 19,3202031,21.44,1181259,67.28,6035027
>> 20,3218245,17.35,1196367,66.75,6352660
>> 21,3228576,13.26,1129561,66.74,6670293
>> 22,3207452,9.15,1166517,66.47,6987926
>> 23,3153800,5.09,1172877,61.57,7305559
>> 24,3184542,0.99,1186244,58.36,7623192
>>
>> With Hinting:
>> 0,0,100,0,100,0
>> 1,306737,95.82,305130,95.78,306737
>> 2,573207,91.68,530453,91.92,613474
>> 3,810319,87.53,695281,88.58,920211
>> 4,1074116,83.40,880602,85.48,1226948
>> 5,1308283,79.26,1109257,81.23,1533685
>> 6,1501987,75.12,1093661,80.19,1840422
>> 7,1695300,70.99,1104207,79.03,2147159
>> 8,1901523,66.85,1193613,76.90,2453896
>> 9,2051288,62.73,1200913,76.22,2760633
>> 10,2275771,58.60,1192992,75.66,3067370
>> 11,2435016,54.48,1191472,74.66,3374107
>> 12,2623114,50.35,1196911,74.02,3680844
>> 13,2766071,46.22,1178589,73.02,3987581
>> 14,2932163,42.10,1166414,72.96,4294318
>> 15,3000853,37.96,1177177,72.62,4601055
>> 16,3113738,33.85,1165444,70.54,4907792
>> 17,3132135,29.77,1165055,68.51,5214529
>> 18,3175121,25.69,1166969,69.27,5521266
>> 19,3205490,21.61,1159310,65.65,5828003
>> 20,3220855,17.52,1171827,62.04,6134740
>> 21,3182568,13.48,1138918,65.05,6441477
>> 22,3130543,9.30,1128185,60.60,6748214
>> 23,3087426,5.15,1127912,55.36,7054951
>> 24,3099457,1.04,1176100,54.96,7361688
>>
>> [1] https://lkml.org/lkml/2019/3/6/413
>>

Nitesh Narayan Lal June 11, 2019, 12:19 p.m. UTC | #3

On 6/3/19 2:04 PM, Michael S. Tsirkin wrote:
> On Mon, Jun 03, 2019 at 01:03:04PM -0400, Nitesh Narayan Lal wrote:
>> This patch series proposes an efficient mechanism for communicating free memory
>> from a guest to its hypervisor. It especially enables guests with no page cache
>> (e.g., nvdimm, virtio-pmem) or with small page caches (e.g., ram > disk) to
>> rapidly hand back free memory to the hypervisor.
>> This approach has a minimal impact on the existing core-mm infrastructure.
> Could you help us compare with Alex's series?
> What are the main differences?
Sorry for the late reply, but I haven't been feeling too well during the
last week.

The main differences are that this series uses a bitmap to track pages
that should be hinted to the hypervisor, while Alexander's series tracks
it directly in core-mm. Also in order to prevent duplicate hints
Alexander's series uses a newly defined page flag whereas I have added
another argument to __free_one_page.
For these reasons, Alexander's series is relatively more core-mm
invasive, while this series is lightweight (e.g., LOC). We'll have to
see if there are real performance differences.

I'm planning on doing some further investigations/review/testing/...
once I'm back on track.
>
>> Measurement results (measurement details appended to this email):
>> * With active page hinting, 3 more guests could be launched each of 5 GB(total 
>> 5 vs. 2) on a 15GB (single NUMA) system without swapping.
>> * With active page hinting, on a system with 15 GB of (single NUMA) memory and
>> 4GB of swap, the runtime of "memhog 6G" in 3 guests (run sequentially) resulted
>> in the last invocation to only need 37s compared to 3m35s without page hinting.
>>
>> This approach tracks all freed pages of the order MAX_ORDER - 2 in bitmaps.
>> A new hook after buddy merging is used to set the bits in the bitmap.
>> Currently, the bits are only cleared when pages are hinted, not when pages are
>> re-allocated.
>>
>> Bitmaps are stored on a per-zone basis and are protected by the zone lock. A
>> workqueue asynchronously processes the bitmaps as soon as a pre-defined memory
>> threshold is met, trying to isolate and report pages that are still free.
>>
>> The isolated pages are reported via virtio-balloon, which is responsible for
>> sending batched pages to the host synchronously. Once the hypervisor processed
>> the hinting request, the isolated pages are returned back to the buddy.
>>
>> The key changes made in this series compared to v9[1] are:
>> * Pages only in the chunks of "MAX_ORDER - 2" are reported to the hypervisor to
>> not break up the THP.
>> * At a time only a set of 16 pages can be isolated and reported to the host to
>> avoids any false OOMs.
>> * page_hinting.c is moved under mm/ from virt/kvm/ as the feature is dependent
>> on virtio and not on KVM itself. This would enable any other hypervisor to use
>> this feature by implementing virtio devices.
>> * The sysctl variable is replaced with a virtio-balloon parameter to
>> enable/disable page-hinting.
>>
>> Pending items:
>> * Test device assigned guests to ensure that hinting doesn't break it.
>> * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device side support.
>> * Compare reporting free pages via vring with vhost.
>> * Decide between MADV_DONTNEED and MADV_FREE.
>> * Look into memory hotplug, more efficient locking, possible races when
>> disabling.
>> * Come up with proper/traceable error-message/logs.
>> * Minor reworks and simplifications (e.g., virtio protocol).
>>
>> Benefit analysis:
>> 1. Use-case - Number of guests that can be launched without swap usage
>> NUMA Nodes = 1 with 15 GB memory
>> Guest Memory = 5 GB
>> Number of cores in guest = 1
>> Workload = test allocation program allocates 4GB memory, touches it via memset
>> and exits.
>> Procedure =
>> The first guest is launched and once its console is up, the test allocation
>> program is executed with 4 GB memory request (Due to this the guest occupies
>> almost 4-5 GB of memory in the host in a system without page hinting). Once
>> this program exits at that time another guest is launched in the host and the
>> same process is followed. It is continued until the swap is not used.
>>
>> Results:
>> Without hinting = 3, swap usage at the end 1.1GB.
>> With hinting = 5, swap usage at the end 0.
>>
>> 2. Use-case - memhog execution time
>> Guest Memory = 6GB
>> Number of cores = 4
>> NUMA Nodes = 1 with 15 GB memory
>> Process: 3 Guests are launched and the ‘memhog 6G’ execution time is monitored
>> one after the other in each of them.
>> Without Hinting - Guest1:47s, Guest2:53s, Guest3:3m35s, End swap usage: 3.5G
>> With Hinting - Guest1:40s, Guest2:44s, Guest3:37s, End swap usage: 0
>>
>> Performance analysis:
>> 1. will-it-scale's page_faul1:
>> Guest Memory = 6GB
>> Number of cores = 24
>>
>> Without Hinting:
>> tasks,processes,processes_idle,threads,threads_idle,linear
>> 0,0,100,0,100,0
>> 1,315890,95.82,317633,95.83,317633
>> 2,570810,91.67,531147,91.94,635266
>> 3,826491,87.54,713545,88.53,952899
>> 4,1087434,83.40,901215,85.30,1270532
>> 5,1277137,79.26,916442,83.74,1588165
>> 6,1503611,75.12,1113832,79.89,1905798
>> 7,1683750,70.99,1140629,78.33,2223431
>> 8,1893105,66.85,1157028,77.40,2541064
>> 9,2046516,62.50,1179445,76.48,2858697
>> 10,2291171,58.57,1209247,74.99,3176330
>> 11,2486198,54.47,1217265,75.13,3493963
>> 12,2656533,50.36,1193392,74.42,3811596
>> 13,2747951,46.21,1185540,73.45,4129229
>> 14,2965757,42.09,1161862,72.20,4446862
>> 15,3049128,37.97,1185923,72.12,4764495
>> 16,3150692,33.83,1163789,70.70,5082128
>> 17,3206023,29.70,1174217,70.11,5399761
>> 18,3211380,25.62,1179660,69.40,5717394
>> 19,3202031,21.44,1181259,67.28,6035027
>> 20,3218245,17.35,1196367,66.75,6352660
>> 21,3228576,13.26,1129561,66.74,6670293
>> 22,3207452,9.15,1166517,66.47,6987926
>> 23,3153800,5.09,1172877,61.57,7305559
>> 24,3184542,0.99,1186244,58.36,7623192
>>
>> With Hinting:
>> 0,0,100,0,100,0
>> 1,306737,95.82,305130,95.78,306737
>> 2,573207,91.68,530453,91.92,613474
>> 3,810319,87.53,695281,88.58,920211
>> 4,1074116,83.40,880602,85.48,1226948
>> 5,1308283,79.26,1109257,81.23,1533685
>> 6,1501987,75.12,1093661,80.19,1840422
>> 7,1695300,70.99,1104207,79.03,2147159
>> 8,1901523,66.85,1193613,76.90,2453896
>> 9,2051288,62.73,1200913,76.22,2760633
>> 10,2275771,58.60,1192992,75.66,3067370
>> 11,2435016,54.48,1191472,74.66,3374107
>> 12,2623114,50.35,1196911,74.02,3680844
>> 13,2766071,46.22,1178589,73.02,3987581
>> 14,2932163,42.10,1166414,72.96,4294318
>> 15,3000853,37.96,1177177,72.62,4601055
>> 16,3113738,33.85,1165444,70.54,4907792
>> 17,3132135,29.77,1165055,68.51,5214529
>> 18,3175121,25.69,1166969,69.27,5521266
>> 19,3205490,21.61,1159310,65.65,5828003
>> 20,3220855,17.52,1171827,62.04,6134740
>> 21,3182568,13.48,1138918,65.05,6441477
>> 22,3130543,9.30,1128185,60.60,6748214
>> 23,3087426,5.15,1127912,55.36,7054951
>> 24,3099457,1.04,1176100,54.96,7361688
>>
>> [1] https://lkml.org/lkml/2019/3/6/413
>>

Alexander Duyck June 11, 2019, 3 p.m. UTC | #4

On Tue, Jun 11, 2019 at 5:19 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 6/3/19 2:04 PM, Michael S. Tsirkin wrote:
> > On Mon, Jun 03, 2019 at 01:03:04PM -0400, Nitesh Narayan Lal wrote:
> >> This patch series proposes an efficient mechanism for communicating free memory
> >> from a guest to its hypervisor. It especially enables guests with no page cache
> >> (e.g., nvdimm, virtio-pmem) or with small page caches (e.g., ram > disk) to
> >> rapidly hand back free memory to the hypervisor.
> >> This approach has a minimal impact on the existing core-mm infrastructure.
> > Could you help us compare with Alex's series?
> > What are the main differences?
> Sorry for the late reply, but I haven't been feeling too well during the
> last week.
>
> The main differences are that this series uses a bitmap to track pages
> that should be hinted to the hypervisor, while Alexander's series tracks
> it directly in core-mm. Also in order to prevent duplicate hints
> Alexander's series uses a newly defined page flag whereas I have added
> another argument to __free_one_page.
> For these reasons, Alexander's series is relatively more core-mm
> invasive, while this series is lightweight (e.g., LOC). We'll have to
> see if there are real performance differences.
>
> I'm planning on doing some further investigations/review/testing/...
> once I'm back on track.

BTW one thing I found is that I will likely need to add a new
parameter like you did to __free_one_page as I need to defer setting
the flag until after all of the merges have happened. Otherwise set
the flag on a given page, and then after the merge that page may not
be the one we ultimately add to the free list.

I'll try to have an update with all of my changes ready before the end
of this week.

Thanks.

- Alex

Nitesh Narayan Lal June 25, 2019, 2:48 p.m. UTC | #5

On 6/3/19 2:04 PM, Michael S. Tsirkin wrote:
> On Mon, Jun 03, 2019 at 01:03:04PM -0400, Nitesh Narayan Lal wrote:
>> This patch series proposes an efficient mechanism for communicating free memory
>> from a guest to its hypervisor. It especially enables guests with no page cache
>> (e.g., nvdimm, virtio-pmem) or with small page caches (e.g., ram > disk) to
>> rapidly hand back free memory to the hypervisor.
>> This approach has a minimal impact on the existing core-mm infrastructure.
> Could you help us compare with Alex's series?
> What are the main differences?
Results on comparing the benefits/performance of Alexander's v1
(bubble-hinting)[1], Page-Hinting (includes some of the upstream
suggested changes on v10) over an unmodified Kernel.

Test1 - Number of guests that can be launched without swap usage.
Guest size: 5GB
Cores: 4
Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
Process: Guest is launched sequentially after running an allocation
program with 4GB request.

Results:
unmodified kernel: 2 guests without swap usage and 3rd guest with a swap
usage of 2.3GB.
bubble-hinting v1: 4 guests without swap usage and 5th guest with a swap
usage of 1MB.
Page-hinting: 5 guests without swap usage and 6th guest with a swap
usage of 8MB.


Test2 - Memhog execution time
Guest size: 6GB
Cores: 4
Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
Process: 3 guests are launched and "time memhog 6G" is launched in each
of them sequentially.

Results:
unmodified kernel: Guest1-40s, Guest2-1m5s, Guest3-6m38s (swap usage at
the end-3.6G)
bubble-hinting v1: Guest1-32s, Guest2-58s, Guest3-35s (swap usage at the
end-0)
Page-hinting: Guest1-42s, Guest2-47s, Guest3-32s (swap usage at the end-0)


Test3 - Will-it-scale's page_fault1
Guest size: 6GB
Cores: 24
Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)

unmodified kernel:
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
1,459168,95.83,459315,95.83,459315
2,956272,91.68,884643,91.72,918630
3,1407811,87.53,1267948,87.69,1377945
4,1755744,83.39,1562471,83.73,1837260
5,2056741,79.24,1812309,80.00,2296575
6,2393759,75.09,2025719,77.02,2755890
7,2754403,70.95,2238180,73.72,3215205
8,2947493,66.81,2369686,70.37,3674520
9,3063579,62.68,2321148,68.84,4133835
10,3229023,58.54,2377596,65.84,4593150
11,3337665,54.40,2429818,64.01,5052465
12,3255140,50.28,2395070,61.63,5511780
13,3260721,46.11,2402644,59.77,5971095
14,3210590,42.02,2390806,57.46,6430410
15,3164811,37.88,2265352,51.39,6889725
16,3144764,33.77,2335028,54.07,7349040
17,3128839,29.63,2328662,49.52,7808355
18,3133344,25.50,2301181,48.01,8267670
19,3135979,21.38,2343003,43.66,8726985
20,3136448,17.27,2306109,40.81,9186300
21,3130324,13.16,2403688,35.84,9645615
22,3109883,9.04,2290808,36.24,10104930
23,3136805,4.94,2263818,35.43,10564245
24,3118949,0.78,2252891,31.03,11023560

bubble-hinting v1:
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
1,292183,95.83,292428,95.83,292428
2,540606,91.67,501887,91.91,584856
3,821748,87.53,735244,88.31,877284
4,1033782,83.38,839925,85.59,1169712
5,1261352,79.25,896464,83.86,1462140
6,1459544,75.12,1050094,80.93,1754568
7,1686537,70.97,1112202,79.23,2046996
8,1866892,66.83,1083571,78.48,2339424
9,2056887,62.72,1101660,77.94,2631852
10,2252955,58.57,1097439,77.36,2924280
11,2413907,54.40,1088583,76.72,3216708
12,2596504,50.35,1117474,76.01,3509136
13,2715338,46.21,1087666,75.32,3801564
14,2861697,42.08,1084692,74.35,4093992
15,2964620,38.02,1087910,73.40,4386420
16,3065575,33.84,1099406,71.07,4678848
17,3107674,29.76,1056948,71.36,4971276
18,3144963,25.71,1094883,70.14,5263704
19,3173468,21.61,1073049,66.21,5556132
20,3173233,17.55,1072417,67.16,5848560
21,3209710,13.37,1079147,65.64,6140988
22,3182958,9.37,1085872,65.95,6433416
23,3200747,5.23,1076414,59.40,6725844
24,3181699,1.04,1051233,65.62,7018272

Page-hinting:
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
1,467693,95.83,467970,95.83,467970
2,967860,91.68,895883,91.70,935940
3,1408191,87.53,1279602,87.68,1403910
4,1766250,83.39,1557224,83.93,1871880
5,2124689,79.24,1834625,80.35,2339850
6,2413514,75.10,1989557,77.00,2807820
7,2644648,70.95,2158055,73.73,3275790
8,2896483,66.81,2305785,70.85,3743760
9,3157796,62.67,2304083,69.49,4211730
10,3251633,58.53,2379589,66.43,4679700
11,3313704,54.41,2349310,64.76,5147670
12,3285612,50.30,2362013,62.63,5615640
13,3207275,46.17,2377760,59.94,6083610
14,3221727,42.02,2416278,56.70,6551580
15,3194781,37.91,2334552,54.96,7019550
16,3211818,33.78,2399077,52.75,7487520
17,3172664,29.65,2337660,50.27,7955490
18,3177152,25.49,2349721,47.02,8423460
19,3149924,21.36,2319286,40.16,8891430
20,3166910,17.30,2279719,43.23,9359400
21,3159464,13.19,2342849,34.84,9827370
22,3167091,9.06,2285156,37.97,10295340
23,3174137,4.96,2365448,33.74,10763310
24,3161629,0.86,2253813,32.38,11231280


Test4: Netperf
Guest size: 5GB
Cores: 4
Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
Netserver: Running on core 0
Netperf: Running on core 1
Recv Socket Size bytes: 131072
Send Socket Size bytes:16384
Send Message Size bytes:1000000000
Time: 900s
Process: netperf is run 3 times sequentially in the same guest with the
same inputs mentioned above and throughput (10^6bits/sec) is observed.
unmodified kernel: 1st Run-14769.60, 2nd Run-14849.18, 3rd Run-14842.02
bubble-hinting v1: 1st Run-13441.77, 2nd Run-13487.81, 3rd Run-13503.87
Page-hinting: 1st Run-14308.20, 2nd Run-14344.36, 3rd Run-14450.07

Drawback with bubble-hinting:
More invasive.

Drawback with page-hinting:
Additional bitmap required, including growing/shrinking the bitmap on
memory hotplug.


[1] https://lkml.org/lkml/2019/6/19/926
>> Measurement results (measurement details appended to this email):
>> * With active page hinting, 3 more guests could be launched each of 5 GB(total 
>> 5 vs. 2) on a 15GB (single NUMA) system without swapping.
>> * With active page hinting, on a system with 15 GB of (single NUMA) memory and
>> 4GB of swap, the runtime of "memhog 6G" in 3 guests (run sequentially) resulted
>> in the last invocation to only need 37s compared to 3m35s without page hinting.
>>
>> This approach tracks all freed pages of the order MAX_ORDER - 2 in bitmaps.
>> A new hook after buddy merging is used to set the bits in the bitmap.
>> Currently, the bits are only cleared when pages are hinted, not when pages are
>> re-allocated.
>>
>> Bitmaps are stored on a per-zone basis and are protected by the zone lock. A
>> workqueue asynchronously processes the bitmaps as soon as a pre-defined memory
>> threshold is met, trying to isolate and report pages that are still free.
>>
>> The isolated pages are reported via virtio-balloon, which is responsible for
>> sending batched pages to the host synchronously. Once the hypervisor processed
>> the hinting request, the isolated pages are returned back to the buddy.
>>
>> The key changes made in this series compared to v9[1] are:
>> * Pages only in the chunks of "MAX_ORDER - 2" are reported to the hypervisor to
>> not break up the THP.
>> * At a time only a set of 16 pages can be isolated and reported to the host to
>> avoids any false OOMs.
>> * page_hinting.c is moved under mm/ from virt/kvm/ as the feature is dependent
>> on virtio and not on KVM itself. This would enable any other hypervisor to use
>> this feature by implementing virtio devices.
>> * The sysctl variable is replaced with a virtio-balloon parameter to
>> enable/disable page-hinting.
>>
>> Pending items:
>> * Test device assigned guests to ensure that hinting doesn't break it.
>> * Follow up on VIRTIO_BALLOON_F_PAGE_POISON's device side support.
>> * Compare reporting free pages via vring with vhost.
>> * Decide between MADV_DONTNEED and MADV_FREE.
>> * Look into memory hotplug, more efficient locking, possible races when
>> disabling.
>> * Come up with proper/traceable error-message/logs.
>> * Minor reworks and simplifications (e.g., virtio protocol).
>>
>> Benefit analysis:
>> 1. Use-case - Number of guests that can be launched without swap usage
>> NUMA Nodes = 1 with 15 GB memory
>> Guest Memory = 5 GB
>> Number of cores in guest = 1
>> Workload = test allocation program allocates 4GB memory, touches it via memset
>> and exits.
>> Procedure =
>> The first guest is launched and once its console is up, the test allocation
>> program is executed with 4 GB memory request (Due to this the guest occupies
>> almost 4-5 GB of memory in the host in a system without page hinting). Once
>> this program exits at that time another guest is launched in the host and the
>> same process is followed. It is continued until the swap is not used.
>>
>> Results:
>> Without hinting = 3, swap usage at the end 1.1GB.
>> With hinting = 5, swap usage at the end 0.
>>
>> 2. Use-case - memhog execution time
>> Guest Memory = 6GB
>> Number of cores = 4
>> NUMA Nodes = 1 with 15 GB memory
>> Process: 3 Guests are launched and the ‘memhog 6G’ execution time is monitored
>> one after the other in each of them.
>> Without Hinting - Guest1:47s, Guest2:53s, Guest3:3m35s, End swap usage: 3.5G
>> With Hinting - Guest1:40s, Guest2:44s, Guest3:37s, End swap usage: 0
>>
>> Performance analysis:
>> 1. will-it-scale's page_faul1:
>> Guest Memory = 6GB
>> Number of cores = 24
>>
>> Without Hinting:
>> tasks,processes,processes_idle,threads,threads_idle,linear
>> 0,0,100,0,100,0
>> 1,315890,95.82,317633,95.83,317633
>> 2,570810,91.67,531147,91.94,635266
>> 3,826491,87.54,713545,88.53,952899
>> 4,1087434,83.40,901215,85.30,1270532
>> 5,1277137,79.26,916442,83.74,1588165
>> 6,1503611,75.12,1113832,79.89,1905798
>> 7,1683750,70.99,1140629,78.33,2223431
>> 8,1893105,66.85,1157028,77.40,2541064
>> 9,2046516,62.50,1179445,76.48,2858697
>> 10,2291171,58.57,1209247,74.99,3176330
>> 11,2486198,54.47,1217265,75.13,3493963
>> 12,2656533,50.36,1193392,74.42,3811596
>> 13,2747951,46.21,1185540,73.45,4129229
>> 14,2965757,42.09,1161862,72.20,4446862
>> 15,3049128,37.97,1185923,72.12,4764495
>> 16,3150692,33.83,1163789,70.70,5082128
>> 17,3206023,29.70,1174217,70.11,5399761
>> 18,3211380,25.62,1179660,69.40,5717394
>> 19,3202031,21.44,1181259,67.28,6035027
>> 20,3218245,17.35,1196367,66.75,6352660
>> 21,3228576,13.26,1129561,66.74,6670293
>> 22,3207452,9.15,1166517,66.47,6987926
>> 23,3153800,5.09,1172877,61.57,7305559
>> 24,3184542,0.99,1186244,58.36,7623192
>>
>> With Hinting:
>> 0,0,100,0,100,0
>> 1,306737,95.82,305130,95.78,306737
>> 2,573207,91.68,530453,91.92,613474
>> 3,810319,87.53,695281,88.58,920211
>> 4,1074116,83.40,880602,85.48,1226948
>> 5,1308283,79.26,1109257,81.23,1533685
>> 6,1501987,75.12,1093661,80.19,1840422
>> 7,1695300,70.99,1104207,79.03,2147159
>> 8,1901523,66.85,1193613,76.90,2453896
>> 9,2051288,62.73,1200913,76.22,2760633
>> 10,2275771,58.60,1192992,75.66,3067370
>> 11,2435016,54.48,1191472,74.66,3374107
>> 12,2623114,50.35,1196911,74.02,3680844
>> 13,2766071,46.22,1178589,73.02,3987581
>> 14,2932163,42.10,1166414,72.96,4294318
>> 15,3000853,37.96,1177177,72.62,4601055
>> 16,3113738,33.85,1165444,70.54,4907792
>> 17,3132135,29.77,1165055,68.51,5214529
>> 18,3175121,25.69,1166969,69.27,5521266
>> 19,3205490,21.61,1159310,65.65,5828003
>> 20,3220855,17.52,1171827,62.04,6134740
>> 21,3182568,13.48,1138918,65.05,6441477
>> 22,3130543,9.30,1128185,60.60,6748214
>> 23,3087426,5.15,1127912,55.36,7054951
>> 24,3099457,1.04,1176100,54.96,7361688
>>
>> [1] https://lkml.org/lkml/2019/3/6/413
>>

Alexander Duyck June 25, 2019, 5:10 p.m. UTC | #6

On Tue, Jun 25, 2019 at 7:49 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
>
> On 6/3/19 2:04 PM, Michael S. Tsirkin wrote:
> > On Mon, Jun 03, 2019 at 01:03:04PM -0400, Nitesh Narayan Lal wrote:
> >> This patch series proposes an efficient mechanism for communicating free memory
> >> from a guest to its hypervisor. It especially enables guests with no page cache
> >> (e.g., nvdimm, virtio-pmem) or with small page caches (e.g., ram > disk) to
> >> rapidly hand back free memory to the hypervisor.
> >> This approach has a minimal impact on the existing core-mm infrastructure.
> > Could you help us compare with Alex's series?
> > What are the main differences?
> Results on comparing the benefits/performance of Alexander's v1
> (bubble-hinting)[1], Page-Hinting (includes some of the upstream
> suggested changes on v10) over an unmodified Kernel.
>
> Test1 - Number of guests that can be launched without swap usage.
> Guest size: 5GB
> Cores: 4
> Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
> Process: Guest is launched sequentially after running an allocation
> program with 4GB request.
>
> Results:
> unmodified kernel: 2 guests without swap usage and 3rd guest with a swap
> usage of 2.3GB.
> bubble-hinting v1: 4 guests without swap usage and 5th guest with a swap
> usage of 1MB.
> Page-hinting: 5 guests without swap usage and 6th guest with a swap
> usage of 8MB.
>
>
> Test2 - Memhog execution time
> Guest size: 6GB
> Cores: 4
> Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
> Process: 3 guests are launched and "time memhog 6G" is launched in each
> of them sequentially.
>
> Results:
> unmodified kernel: Guest1-40s, Guest2-1m5s, Guest3-6m38s (swap usage at
> the end-3.6G)
> bubble-hinting v1: Guest1-32s, Guest2-58s, Guest3-35s (swap usage at the
> end-0)
> Page-hinting: Guest1-42s, Guest2-47s, Guest3-32s (swap usage at the end-0)
>
>
> Test3 - Will-it-scale's page_fault1
> Guest size: 6GB
> Cores: 24
> Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
>
> unmodified kernel:
> tasks,processes,processes_idle,threads,threads_idle,linear
> 0,0,100,0,100,0
> 1,459168,95.83,459315,95.83,459315
> 2,956272,91.68,884643,91.72,918630
> 3,1407811,87.53,1267948,87.69,1377945
> 4,1755744,83.39,1562471,83.73,1837260
> 5,2056741,79.24,1812309,80.00,2296575
> 6,2393759,75.09,2025719,77.02,2755890
> 7,2754403,70.95,2238180,73.72,3215205
> 8,2947493,66.81,2369686,70.37,3674520
> 9,3063579,62.68,2321148,68.84,4133835
> 10,3229023,58.54,2377596,65.84,4593150
> 11,3337665,54.40,2429818,64.01,5052465
> 12,3255140,50.28,2395070,61.63,5511780
> 13,3260721,46.11,2402644,59.77,5971095
> 14,3210590,42.02,2390806,57.46,6430410
> 15,3164811,37.88,2265352,51.39,6889725
> 16,3144764,33.77,2335028,54.07,7349040
> 17,3128839,29.63,2328662,49.52,7808355
> 18,3133344,25.50,2301181,48.01,8267670
> 19,3135979,21.38,2343003,43.66,8726985
> 20,3136448,17.27,2306109,40.81,9186300
> 21,3130324,13.16,2403688,35.84,9645615
> 22,3109883,9.04,2290808,36.24,10104930
> 23,3136805,4.94,2263818,35.43,10564245
> 24,3118949,0.78,2252891,31.03,11023560
>
> bubble-hinting v1:
> tasks,processes,processes_idle,threads,threads_idle,linear
> 0,0,100,0,100,0
> 1,292183,95.83,292428,95.83,292428
> 2,540606,91.67,501887,91.91,584856
> 3,821748,87.53,735244,88.31,877284
> 4,1033782,83.38,839925,85.59,1169712
> 5,1261352,79.25,896464,83.86,1462140
> 6,1459544,75.12,1050094,80.93,1754568
> 7,1686537,70.97,1112202,79.23,2046996
> 8,1866892,66.83,1083571,78.48,2339424
> 9,2056887,62.72,1101660,77.94,2631852
> 10,2252955,58.57,1097439,77.36,2924280
> 11,2413907,54.40,1088583,76.72,3216708
> 12,2596504,50.35,1117474,76.01,3509136
> 13,2715338,46.21,1087666,75.32,3801564
> 14,2861697,42.08,1084692,74.35,4093992
> 15,2964620,38.02,1087910,73.40,4386420
> 16,3065575,33.84,1099406,71.07,4678848
> 17,3107674,29.76,1056948,71.36,4971276
> 18,3144963,25.71,1094883,70.14,5263704
> 19,3173468,21.61,1073049,66.21,5556132
> 20,3173233,17.55,1072417,67.16,5848560
> 21,3209710,13.37,1079147,65.64,6140988
> 22,3182958,9.37,1085872,65.95,6433416
> 23,3200747,5.23,1076414,59.40,6725844
> 24,3181699,1.04,1051233,65.62,7018272
>
> Page-hinting:
> tasks,processes,processes_idle,threads,threads_idle,linear
> 0,0,100,0,100,0
> 1,467693,95.83,467970,95.83,467970
> 2,967860,91.68,895883,91.70,935940
> 3,1408191,87.53,1279602,87.68,1403910
> 4,1766250,83.39,1557224,83.93,1871880
> 5,2124689,79.24,1834625,80.35,2339850
> 6,2413514,75.10,1989557,77.00,2807820
> 7,2644648,70.95,2158055,73.73,3275790
> 8,2896483,66.81,2305785,70.85,3743760
> 9,3157796,62.67,2304083,69.49,4211730
> 10,3251633,58.53,2379589,66.43,4679700
> 11,3313704,54.41,2349310,64.76,5147670
> 12,3285612,50.30,2362013,62.63,5615640
> 13,3207275,46.17,2377760,59.94,6083610
> 14,3221727,42.02,2416278,56.70,6551580
> 15,3194781,37.91,2334552,54.96,7019550
> 16,3211818,33.78,2399077,52.75,7487520
> 17,3172664,29.65,2337660,50.27,7955490
> 18,3177152,25.49,2349721,47.02,8423460
> 19,3149924,21.36,2319286,40.16,8891430
> 20,3166910,17.30,2279719,43.23,9359400
> 21,3159464,13.19,2342849,34.84,9827370
> 22,3167091,9.06,2285156,37.97,10295340
> 23,3174137,4.96,2365448,33.74,10763310
> 24,3161629,0.86,2253813,32.38,11231280
>
>
> Test4: Netperf
> Guest size: 5GB
> Cores: 4
> Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
> Netserver: Running on core 0
> Netperf: Running on core 1
> Recv Socket Size bytes: 131072
> Send Socket Size bytes:16384
> Send Message Size bytes:1000000000
> Time: 900s
> Process: netperf is run 3 times sequentially in the same guest with the
> same inputs mentioned above and throughput (10^6bits/sec) is observed.
> unmodified kernel: 1st Run-14769.60, 2nd Run-14849.18, 3rd Run-14842.02
> bubble-hinting v1: 1st Run-13441.77, 2nd Run-13487.81, 3rd Run-13503.87
> Page-hinting: 1st Run-14308.20, 2nd Run-14344.36, 3rd Run-14450.07
>
> Drawback with bubble-hinting:
> More invasive.
>
> Drawback with page-hinting:
> Additional bitmap required, including growing/shrinking the bitmap on
> memory hotplug.
>
>
> [1] https://lkml.org/lkml/2019/6/19/926

Any chance you could provide a .config for your kernel? I'm wondering
what is different between the two as it seems like you are showing a
significant regression in terms of performance for the bubble
hinting/aeration approach versus a stock kernel without the patches
and that doesn't match up with what I have been seeing.

Also, any ETA for when we can look at the patches for the approach you have?

Thanks.

- Alex

Nitesh Narayan Lal June 25, 2019, 5:31 p.m. UTC | #7

On 6/25/19 1:10 PM, Alexander Duyck wrote:
> On Tue, Jun 25, 2019 at 7:49 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>
>> On 6/3/19 2:04 PM, Michael S. Tsirkin wrote:
>>> On Mon, Jun 03, 2019 at 01:03:04PM -0400, Nitesh Narayan Lal wrote:
>>>> This patch series proposes an efficient mechanism for communicating free memory
>>>> from a guest to its hypervisor. It especially enables guests with no page cache
>>>> (e.g., nvdimm, virtio-pmem) or with small page caches (e.g., ram > disk) to
>>>> rapidly hand back free memory to the hypervisor.
>>>> This approach has a minimal impact on the existing core-mm infrastructure.
>>> Could you help us compare with Alex's series?
>>> What are the main differences?
>> Results on comparing the benefits/performance of Alexander's v1
>> (bubble-hinting)[1], Page-Hinting (includes some of the upstream
>> suggested changes on v10) over an unmodified Kernel.
>>
>> Test1 - Number of guests that can be launched without swap usage.
>> Guest size: 5GB
>> Cores: 4
>> Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
>> Process: Guest is launched sequentially after running an allocation
>> program with 4GB request.
>>
>> Results:
>> unmodified kernel: 2 guests without swap usage and 3rd guest with a swap
>> usage of 2.3GB.
>> bubble-hinting v1: 4 guests without swap usage and 5th guest with a swap
>> usage of 1MB.
>> Page-hinting: 5 guests without swap usage and 6th guest with a swap
>> usage of 8MB.
>>
>>
>> Test2 - Memhog execution time
>> Guest size: 6GB
>> Cores: 4
>> Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
>> Process: 3 guests are launched and "time memhog 6G" is launched in each
>> of them sequentially.
>>
>> Results:
>> unmodified kernel: Guest1-40s, Guest2-1m5s, Guest3-6m38s (swap usage at
>> the end-3.6G)
>> bubble-hinting v1: Guest1-32s, Guest2-58s, Guest3-35s (swap usage at the
>> end-0)
>> Page-hinting: Guest1-42s, Guest2-47s, Guest3-32s (swap usage at the end-0)
>>
>>
>> Test3 - Will-it-scale's page_fault1
>> Guest size: 6GB
>> Cores: 24
>> Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
>>
>> unmodified kernel:
>> tasks,processes,processes_idle,threads,threads_idle,linear
>> 0,0,100,0,100,0
>> 1,459168,95.83,459315,95.83,459315
>> 2,956272,91.68,884643,91.72,918630
>> 3,1407811,87.53,1267948,87.69,1377945
>> 4,1755744,83.39,1562471,83.73,1837260
>> 5,2056741,79.24,1812309,80.00,2296575
>> 6,2393759,75.09,2025719,77.02,2755890
>> 7,2754403,70.95,2238180,73.72,3215205
>> 8,2947493,66.81,2369686,70.37,3674520
>> 9,3063579,62.68,2321148,68.84,4133835
>> 10,3229023,58.54,2377596,65.84,4593150
>> 11,3337665,54.40,2429818,64.01,5052465
>> 12,3255140,50.28,2395070,61.63,5511780
>> 13,3260721,46.11,2402644,59.77,5971095
>> 14,3210590,42.02,2390806,57.46,6430410
>> 15,3164811,37.88,2265352,51.39,6889725
>> 16,3144764,33.77,2335028,54.07,7349040
>> 17,3128839,29.63,2328662,49.52,7808355
>> 18,3133344,25.50,2301181,48.01,8267670
>> 19,3135979,21.38,2343003,43.66,8726985
>> 20,3136448,17.27,2306109,40.81,9186300
>> 21,3130324,13.16,2403688,35.84,9645615
>> 22,3109883,9.04,2290808,36.24,10104930
>> 23,3136805,4.94,2263818,35.43,10564245
>> 24,3118949,0.78,2252891,31.03,11023560
>>
>> bubble-hinting v1:
>> tasks,processes,processes_idle,threads,threads_idle,linear
>> 0,0,100,0,100,0
>> 1,292183,95.83,292428,95.83,292428
>> 2,540606,91.67,501887,91.91,584856
>> 3,821748,87.53,735244,88.31,877284
>> 4,1033782,83.38,839925,85.59,1169712
>> 5,1261352,79.25,896464,83.86,1462140
>> 6,1459544,75.12,1050094,80.93,1754568
>> 7,1686537,70.97,1112202,79.23,2046996
>> 8,1866892,66.83,1083571,78.48,2339424
>> 9,2056887,62.72,1101660,77.94,2631852
>> 10,2252955,58.57,1097439,77.36,2924280
>> 11,2413907,54.40,1088583,76.72,3216708
>> 12,2596504,50.35,1117474,76.01,3509136
>> 13,2715338,46.21,1087666,75.32,3801564
>> 14,2861697,42.08,1084692,74.35,4093992
>> 15,2964620,38.02,1087910,73.40,4386420
>> 16,3065575,33.84,1099406,71.07,4678848
>> 17,3107674,29.76,1056948,71.36,4971276
>> 18,3144963,25.71,1094883,70.14,5263704
>> 19,3173468,21.61,1073049,66.21,5556132
>> 20,3173233,17.55,1072417,67.16,5848560
>> 21,3209710,13.37,1079147,65.64,6140988
>> 22,3182958,9.37,1085872,65.95,6433416
>> 23,3200747,5.23,1076414,59.40,6725844
>> 24,3181699,1.04,1051233,65.62,7018272
>>
>> Page-hinting:
>> tasks,processes,processes_idle,threads,threads_idle,linear
>> 0,0,100,0,100,0
>> 1,467693,95.83,467970,95.83,467970
>> 2,967860,91.68,895883,91.70,935940
>> 3,1408191,87.53,1279602,87.68,1403910
>> 4,1766250,83.39,1557224,83.93,1871880
>> 5,2124689,79.24,1834625,80.35,2339850
>> 6,2413514,75.10,1989557,77.00,2807820
>> 7,2644648,70.95,2158055,73.73,3275790
>> 8,2896483,66.81,2305785,70.85,3743760
>> 9,3157796,62.67,2304083,69.49,4211730
>> 10,3251633,58.53,2379589,66.43,4679700
>> 11,3313704,54.41,2349310,64.76,5147670
>> 12,3285612,50.30,2362013,62.63,5615640
>> 13,3207275,46.17,2377760,59.94,6083610
>> 14,3221727,42.02,2416278,56.70,6551580
>> 15,3194781,37.91,2334552,54.96,7019550
>> 16,3211818,33.78,2399077,52.75,7487520
>> 17,3172664,29.65,2337660,50.27,7955490
>> 18,3177152,25.49,2349721,47.02,8423460
>> 19,3149924,21.36,2319286,40.16,8891430
>> 20,3166910,17.30,2279719,43.23,9359400
>> 21,3159464,13.19,2342849,34.84,9827370
>> 22,3167091,9.06,2285156,37.97,10295340
>> 23,3174137,4.96,2365448,33.74,10763310
>> 24,3161629,0.86,2253813,32.38,11231280
>>
>>
>> Test4: Netperf
>> Guest size: 5GB
>> Cores: 4
>> Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
>> Netserver: Running on core 0
>> Netperf: Running on core 1
>> Recv Socket Size bytes: 131072
>> Send Socket Size bytes:16384
>> Send Message Size bytes:1000000000
>> Time: 900s
>> Process: netperf is run 3 times sequentially in the same guest with the
>> same inputs mentioned above and throughput (10^6bits/sec) is observed.
>> unmodified kernel: 1st Run-14769.60, 2nd Run-14849.18, 3rd Run-14842.02
>> bubble-hinting v1: 1st Run-13441.77, 2nd Run-13487.81, 3rd Run-13503.87
>> Page-hinting: 1st Run-14308.20, 2nd Run-14344.36, 3rd Run-14450.07
>>
>> Drawback with bubble-hinting:
>> More invasive.
>>
>> Drawback with page-hinting:
>> Additional bitmap required, including growing/shrinking the bitmap on
>> memory hotplug.
>>
>>
>> [1] https://lkml.org/lkml/2019/6/19/926
> Any chance you could provide a .config for your kernel? I'm wondering
> what is different between the two as it seems like you are showing a
> significant regression in terms of performance for the bubble
> hinting/aeration approach versus a stock kernel without the patches
> and that doesn't match up with what I have been seeing.
I have attached the config which I was using.
>
> Also, any ETA for when we can look at the patches for the approach you have?
I am hoping to get more comments about the overall approach before
posting my next series.
If I don't get them I will probably post my series with the changes made
so far.
(As of now all of the changes are around the suggestions made by you and
David).
>
> Thanks.
>
> - Alex

Alexander Duyck June 28, 2019, 6:25 p.m. UTC | #8

On Tue, Jun 25, 2019 at 10:32 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>
> On 6/25/19 1:10 PM, Alexander Duyck wrote:
> > On Tue, Jun 25, 2019 at 7:49 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
> >>
> >> On 6/3/19 2:04 PM, Michael S. Tsirkin wrote:
> >>> On Mon, Jun 03, 2019 at 01:03:04PM -0400, Nitesh Narayan Lal wrote:
> >>>> This patch series proposes an efficient mechanism for communicating free memory
> >>>> from a guest to its hypervisor. It especially enables guests with no page cache
> >>>> (e.g., nvdimm, virtio-pmem) or with small page caches (e.g., ram > disk) to
> >>>> rapidly hand back free memory to the hypervisor.
> >>>> This approach has a minimal impact on the existing core-mm infrastructure.
> >>> Could you help us compare with Alex's series?
> >>> What are the main differences?
> >> Results on comparing the benefits/performance of Alexander's v1
> >> (bubble-hinting)[1], Page-Hinting (includes some of the upstream
> >> suggested changes on v10) over an unmodified Kernel.
> >>
> >> Test1 - Number of guests that can be launched without swap usage.
> >> Guest size: 5GB
> >> Cores: 4
> >> Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
> >> Process: Guest is launched sequentially after running an allocation
> >> program with 4GB request.
> >>
> >> Results:
> >> unmodified kernel: 2 guests without swap usage and 3rd guest with a swap
> >> usage of 2.3GB.
> >> bubble-hinting v1: 4 guests without swap usage and 5th guest with a swap
> >> usage of 1MB.
> >> Page-hinting: 5 guests without swap usage and 6th guest with a swap
> >> usage of 8MB.
> >>
> >>
> >> Test2 - Memhog execution time
> >> Guest size: 6GB
> >> Cores: 4
> >> Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
> >> Process: 3 guests are launched and "time memhog 6G" is launched in each
> >> of them sequentially.
> >>
> >> Results:
> >> unmodified kernel: Guest1-40s, Guest2-1m5s, Guest3-6m38s (swap usage at
> >> the end-3.6G)
> >> bubble-hinting v1: Guest1-32s, Guest2-58s, Guest3-35s (swap usage at the
> >> end-0)
> >> Page-hinting: Guest1-42s, Guest2-47s, Guest3-32s (swap usage at the end-0)
> >>
> >>
> >> Test3 - Will-it-scale's page_fault1
> >> Guest size: 6GB
> >> Cores: 24
> >> Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
> >>
> >> unmodified kernel:
> >> tasks,processes,processes_idle,threads,threads_idle,linear
> >> 0,0,100,0,100,0
> >> 1,459168,95.83,459315,95.83,459315
> >> 2,956272,91.68,884643,91.72,918630
> >> 3,1407811,87.53,1267948,87.69,1377945
> >> 4,1755744,83.39,1562471,83.73,1837260
> >> 5,2056741,79.24,1812309,80.00,2296575
> >> 6,2393759,75.09,2025719,77.02,2755890
> >> 7,2754403,70.95,2238180,73.72,3215205
> >> 8,2947493,66.81,2369686,70.37,3674520
> >> 9,3063579,62.68,2321148,68.84,4133835
> >> 10,3229023,58.54,2377596,65.84,4593150
> >> 11,3337665,54.40,2429818,64.01,5052465
> >> 12,3255140,50.28,2395070,61.63,5511780
> >> 13,3260721,46.11,2402644,59.77,5971095
> >> 14,3210590,42.02,2390806,57.46,6430410
> >> 15,3164811,37.88,2265352,51.39,6889725
> >> 16,3144764,33.77,2335028,54.07,7349040
> >> 17,3128839,29.63,2328662,49.52,7808355
> >> 18,3133344,25.50,2301181,48.01,8267670
> >> 19,3135979,21.38,2343003,43.66,8726985
> >> 20,3136448,17.27,2306109,40.81,9186300
> >> 21,3130324,13.16,2403688,35.84,9645615
> >> 22,3109883,9.04,2290808,36.24,10104930
> >> 23,3136805,4.94,2263818,35.43,10564245
> >> 24,3118949,0.78,2252891,31.03,11023560
> >>
> >> bubble-hinting v1:
> >> tasks,processes,processes_idle,threads,threads_idle,linear
> >> 0,0,100,0,100,0
> >> 1,292183,95.83,292428,95.83,292428
> >> 2,540606,91.67,501887,91.91,584856
> >> 3,821748,87.53,735244,88.31,877284
> >> 4,1033782,83.38,839925,85.59,1169712
> >> 5,1261352,79.25,896464,83.86,1462140
> >> 6,1459544,75.12,1050094,80.93,1754568
> >> 7,1686537,70.97,1112202,79.23,2046996
> >> 8,1866892,66.83,1083571,78.48,2339424
> >> 9,2056887,62.72,1101660,77.94,2631852
> >> 10,2252955,58.57,1097439,77.36,2924280
> >> 11,2413907,54.40,1088583,76.72,3216708
> >> 12,2596504,50.35,1117474,76.01,3509136
> >> 13,2715338,46.21,1087666,75.32,3801564
> >> 14,2861697,42.08,1084692,74.35,4093992
> >> 15,2964620,38.02,1087910,73.40,4386420
> >> 16,3065575,33.84,1099406,71.07,4678848
> >> 17,3107674,29.76,1056948,71.36,4971276
> >> 18,3144963,25.71,1094883,70.14,5263704
> >> 19,3173468,21.61,1073049,66.21,5556132
> >> 20,3173233,17.55,1072417,67.16,5848560
> >> 21,3209710,13.37,1079147,65.64,6140988
> >> 22,3182958,9.37,1085872,65.95,6433416
> >> 23,3200747,5.23,1076414,59.40,6725844
> >> 24,3181699,1.04,1051233,65.62,7018272
> >>
> >> Page-hinting:
> >> tasks,processes,processes_idle,threads,threads_idle,linear
> >> 0,0,100,0,100,0
> >> 1,467693,95.83,467970,95.83,467970
> >> 2,967860,91.68,895883,91.70,935940
> >> 3,1408191,87.53,1279602,87.68,1403910
> >> 4,1766250,83.39,1557224,83.93,1871880
> >> 5,2124689,79.24,1834625,80.35,2339850
> >> 6,2413514,75.10,1989557,77.00,2807820
> >> 7,2644648,70.95,2158055,73.73,3275790
> >> 8,2896483,66.81,2305785,70.85,3743760
> >> 9,3157796,62.67,2304083,69.49,4211730
> >> 10,3251633,58.53,2379589,66.43,4679700
> >> 11,3313704,54.41,2349310,64.76,5147670
> >> 12,3285612,50.30,2362013,62.63,5615640
> >> 13,3207275,46.17,2377760,59.94,6083610
> >> 14,3221727,42.02,2416278,56.70,6551580
> >> 15,3194781,37.91,2334552,54.96,7019550
> >> 16,3211818,33.78,2399077,52.75,7487520
> >> 17,3172664,29.65,2337660,50.27,7955490
> >> 18,3177152,25.49,2349721,47.02,8423460
> >> 19,3149924,21.36,2319286,40.16,8891430
> >> 20,3166910,17.30,2279719,43.23,9359400
> >> 21,3159464,13.19,2342849,34.84,9827370
> >> 22,3167091,9.06,2285156,37.97,10295340
> >> 23,3174137,4.96,2365448,33.74,10763310
> >> 24,3161629,0.86,2253813,32.38,11231280
> >>
> >>
> >> Test4: Netperf
> >> Guest size: 5GB
> >> Cores: 4
> >> Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
> >> Netserver: Running on core 0
> >> Netperf: Running on core 1
> >> Recv Socket Size bytes: 131072
> >> Send Socket Size bytes:16384
> >> Send Message Size bytes:1000000000
> >> Time: 900s
> >> Process: netperf is run 3 times sequentially in the same guest with the
> >> same inputs mentioned above and throughput (10^6bits/sec) is observed.
> >> unmodified kernel: 1st Run-14769.60, 2nd Run-14849.18, 3rd Run-14842.02
> >> bubble-hinting v1: 1st Run-13441.77, 2nd Run-13487.81, 3rd Run-13503.87
> >> Page-hinting: 1st Run-14308.20, 2nd Run-14344.36, 3rd Run-14450.07
> >>
> >> Drawback with bubble-hinting:
> >> More invasive.
> >>
> >> Drawback with page-hinting:
> >> Additional bitmap required, including growing/shrinking the bitmap on
> >> memory hotplug.
> >>
> >>
> >> [1] https://lkml.org/lkml/2019/6/19/926
> > Any chance you could provide a .config for your kernel? I'm wondering
> > what is different between the two as it seems like you are showing a
> > significant regression in terms of performance for the bubble
> > hinting/aeration approach versus a stock kernel without the patches
> > and that doesn't match up with what I have been seeing.
> I have attached the config which I was using.

Were all of these runs with the same config? I ask because I noticed
the config you provided had a number of quite expensive memory debug
options enabled:

#
# Memory Debugging
#
CONFIG_PAGE_EXTENSION=y
CONFIG_DEBUG_PAGEALLOC=y
CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT=y
CONFIG_PAGE_OWNER=y
# CONFIG_PAGE_POISONING is not set
CONFIG_DEBUG_PAGE_REF=y
# CONFIG_DEBUG_RODATA_TEST is not set
CONFIG_DEBUG_OBJECTS=y
# CONFIG_DEBUG_OBJECTS_SELFTEST is not set
# CONFIG_DEBUG_OBJECTS_FREE is not set
# CONFIG_DEBUG_OBJECTS_TIMERS is not set
# CONFIG_DEBUG_OBJECTS_WORK is not set
# CONFIG_DEBUG_OBJECTS_RCU_HEAD is not set
# CONFIG_DEBUG_OBJECTS_PERCPU_COUNTER is not set
CONFIG_DEBUG_OBJECTS_ENABLE_DEFAULT=1
CONFIG_SLUB_DEBUG_ON=y
# CONFIG_SLUB_STATS is not set
CONFIG_HAVE_DEBUG_KMEMLEAK=y
CONFIG_DEBUG_KMEMLEAK=y
CONFIG_DEBUG_KMEMLEAK_EARLY_LOG_SIZE=400
# CONFIG_DEBUG_KMEMLEAK_TEST is not set
# CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF is not set
CONFIG_DEBUG_KMEMLEAK_AUTO_SCAN=y
CONFIG_DEBUG_STACK_USAGE=y
CONFIG_DEBUG_VM=y
# CONFIG_DEBUG_VM_VMACACHE is not set
# CONFIG_DEBUG_VM_RB is not set
# CONFIG_DEBUG_VM_PGFLAGS is not set
CONFIG_ARCH_HAS_DEBUG_VIRTUAL=y
CONFIG_DEBUG_VIRTUAL=y
CONFIG_DEBUG_MEMORY_INIT=y
CONFIG_DEBUG_PER_CPU_MAPS=y
CONFIG_HAVE_ARCH_KASAN=y
CONFIG_CC_HAS_KASAN_GENERIC=y
# CONFIG_KASAN is not set
CONFIG_KASAN_STACK=1
# end of Memory Debugging

When I went through and enabled these then my results for the bubble
hinting matched pretty closely to what you reported. However, when I
compiled without the patches and this config enabled the results were
still about what was reported with the bubble hinting but were maybe
5% improved. I'm just wondering if you were doing some additional
debugging and left those options enabled for the bubble hinting test
run.

Nitesh Narayan Lal June 28, 2019, 7:13 p.m. UTC | #9

On 6/28/19 2:25 PM, Alexander Duyck wrote:
> On Tue, Jun 25, 2019 at 10:32 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>> On 6/25/19 1:10 PM, Alexander Duyck wrote:
>>> On Tue, Jun 25, 2019 at 7:49 AM Nitesh Narayan Lal <nitesh@redhat.com> wrote:
>>>> On 6/3/19 2:04 PM, Michael S. Tsirkin wrote:
>>>>> On Mon, Jun 03, 2019 at 01:03:04PM -0400, Nitesh Narayan Lal wrote:
>>>>>> This patch series proposes an efficient mechanism for communicating free memory
>>>>>> from a guest to its hypervisor. It especially enables guests with no page cache
>>>>>> (e.g., nvdimm, virtio-pmem) or with small page caches (e.g., ram > disk) to
>>>>>> rapidly hand back free memory to the hypervisor.
>>>>>> This approach has a minimal impact on the existing core-mm infrastructure.
>>>>> Could you help us compare with Alex's series?
>>>>> What are the main differences?
>>>> Results on comparing the benefits/performance of Alexander's v1
>>>> (bubble-hinting)[1], Page-Hinting (includes some of the upstream
>>>> suggested changes on v10) over an unmodified Kernel.
>>>>
>>>> Test1 - Number of guests that can be launched without swap usage.
>>>> Guest size: 5GB
>>>> Cores: 4
>>>> Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
>>>> Process: Guest is launched sequentially after running an allocation
>>>> program with 4GB request.
>>>>
>>>> Results:
>>>> unmodified kernel: 2 guests without swap usage and 3rd guest with a swap
>>>> usage of 2.3GB.
>>>> bubble-hinting v1: 4 guests without swap usage and 5th guest with a swap
>>>> usage of 1MB.
>>>> Page-hinting: 5 guests without swap usage and 6th guest with a swap
>>>> usage of 8MB.
>>>>
>>>>
>>>> Test2 - Memhog execution time
>>>> Guest size: 6GB
>>>> Cores: 4
>>>> Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
>>>> Process: 3 guests are launched and "time memhog 6G" is launched in each
>>>> of them sequentially.
>>>>
>>>> Results:
>>>> unmodified kernel: Guest1-40s, Guest2-1m5s, Guest3-6m38s (swap usage at
>>>> the end-3.6G)
>>>> bubble-hinting v1: Guest1-32s, Guest2-58s, Guest3-35s (swap usage at the
>>>> end-0)
>>>> Page-hinting: Guest1-42s, Guest2-47s, Guest3-32s (swap usage at the end-0)
>>>>
>>>>
>>>> Test3 - Will-it-scale's page_fault1
>>>> Guest size: 6GB
>>>> Cores: 24
>>>> Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
>>>>
>>>> unmodified kernel:
>>>> tasks,processes,processes_idle,threads,threads_idle,linear
>>>> 0,0,100,0,100,0
>>>> 1,459168,95.83,459315,95.83,459315
>>>> 2,956272,91.68,884643,91.72,918630
>>>> 3,1407811,87.53,1267948,87.69,1377945
>>>> 4,1755744,83.39,1562471,83.73,1837260
>>>> 5,2056741,79.24,1812309,80.00,2296575
>>>> 6,2393759,75.09,2025719,77.02,2755890
>>>> 7,2754403,70.95,2238180,73.72,3215205
>>>> 8,2947493,66.81,2369686,70.37,3674520
>>>> 9,3063579,62.68,2321148,68.84,4133835
>>>> 10,3229023,58.54,2377596,65.84,4593150
>>>> 11,3337665,54.40,2429818,64.01,5052465
>>>> 12,3255140,50.28,2395070,61.63,5511780
>>>> 13,3260721,46.11,2402644,59.77,5971095
>>>> 14,3210590,42.02,2390806,57.46,6430410
>>>> 15,3164811,37.88,2265352,51.39,6889725
>>>> 16,3144764,33.77,2335028,54.07,7349040
>>>> 17,3128839,29.63,2328662,49.52,7808355
>>>> 18,3133344,25.50,2301181,48.01,8267670
>>>> 19,3135979,21.38,2343003,43.66,8726985
>>>> 20,3136448,17.27,2306109,40.81,9186300
>>>> 21,3130324,13.16,2403688,35.84,9645615
>>>> 22,3109883,9.04,2290808,36.24,10104930
>>>> 23,3136805,4.94,2263818,35.43,10564245
>>>> 24,3118949,0.78,2252891,31.03,11023560
>>>>
>>>> bubble-hinting v1:
>>>> tasks,processes,processes_idle,threads,threads_idle,linear
>>>> 0,0,100,0,100,0
>>>> 1,292183,95.83,292428,95.83,292428
>>>> 2,540606,91.67,501887,91.91,584856
>>>> 3,821748,87.53,735244,88.31,877284
>>>> 4,1033782,83.38,839925,85.59,1169712
>>>> 5,1261352,79.25,896464,83.86,1462140
>>>> 6,1459544,75.12,1050094,80.93,1754568
>>>> 7,1686537,70.97,1112202,79.23,2046996
>>>> 8,1866892,66.83,1083571,78.48,2339424
>>>> 9,2056887,62.72,1101660,77.94,2631852
>>>> 10,2252955,58.57,1097439,77.36,2924280
>>>> 11,2413907,54.40,1088583,76.72,3216708
>>>> 12,2596504,50.35,1117474,76.01,3509136
>>>> 13,2715338,46.21,1087666,75.32,3801564
>>>> 14,2861697,42.08,1084692,74.35,4093992
>>>> 15,2964620,38.02,1087910,73.40,4386420
>>>> 16,3065575,33.84,1099406,71.07,4678848
>>>> 17,3107674,29.76,1056948,71.36,4971276
>>>> 18,3144963,25.71,1094883,70.14,5263704
>>>> 19,3173468,21.61,1073049,66.21,5556132
>>>> 20,3173233,17.55,1072417,67.16,5848560
>>>> 21,3209710,13.37,1079147,65.64,6140988
>>>> 22,3182958,9.37,1085872,65.95,6433416
>>>> 23,3200747,5.23,1076414,59.40,6725844
>>>> 24,3181699,1.04,1051233,65.62,7018272
>>>>
>>>> Page-hinting:
>>>> tasks,processes,processes_idle,threads,threads_idle,linear
>>>> 0,0,100,0,100,0
>>>> 1,467693,95.83,467970,95.83,467970
>>>> 2,967860,91.68,895883,91.70,935940
>>>> 3,1408191,87.53,1279602,87.68,1403910
>>>> 4,1766250,83.39,1557224,83.93,1871880
>>>> 5,2124689,79.24,1834625,80.35,2339850
>>>> 6,2413514,75.10,1989557,77.00,2807820
>>>> 7,2644648,70.95,2158055,73.73,3275790
>>>> 8,2896483,66.81,2305785,70.85,3743760
>>>> 9,3157796,62.67,2304083,69.49,4211730
>>>> 10,3251633,58.53,2379589,66.43,4679700
>>>> 11,3313704,54.41,2349310,64.76,5147670
>>>> 12,3285612,50.30,2362013,62.63,5615640
>>>> 13,3207275,46.17,2377760,59.94,6083610
>>>> 14,3221727,42.02,2416278,56.70,6551580
>>>> 15,3194781,37.91,2334552,54.96,7019550
>>>> 16,3211818,33.78,2399077,52.75,7487520
>>>> 17,3172664,29.65,2337660,50.27,7955490
>>>> 18,3177152,25.49,2349721,47.02,8423460
>>>> 19,3149924,21.36,2319286,40.16,8891430
>>>> 20,3166910,17.30,2279719,43.23,9359400
>>>> 21,3159464,13.19,2342849,34.84,9827370
>>>> 22,3167091,9.06,2285156,37.97,10295340
>>>> 23,3174137,4.96,2365448,33.74,10763310
>>>> 24,3161629,0.86,2253813,32.38,11231280
>>>>
>>>>
>>>> Test4: Netperf
>>>> Guest size: 5GB
>>>> Cores: 4
>>>> Total NUMA Node Memory ~ 15 GB (All guests are running on a single node)
>>>> Netserver: Running on core 0
>>>> Netperf: Running on core 1
>>>> Recv Socket Size bytes: 131072
>>>> Send Socket Size bytes:16384
>>>> Send Message Size bytes:1000000000
>>>> Time: 900s
>>>> Process: netperf is run 3 times sequentially in the same guest with the
>>>> same inputs mentioned above and throughput (10^6bits/sec) is observed.
>>>> unmodified kernel: 1st Run-14769.60, 2nd Run-14849.18, 3rd Run-14842.02
>>>> bubble-hinting v1: 1st Run-13441.77, 2nd Run-13487.81, 3rd Run-13503.87
>>>> Page-hinting: 1st Run-14308.20, 2nd Run-14344.36, 3rd Run-14450.07
>>>>
>>>> Drawback with bubble-hinting:
>>>> More invasive.
>>>>
>>>> Drawback with page-hinting:
>>>> Additional bitmap required, including growing/shrinking the bitmap on
>>>> memory hotplug.
>>>>
>>>>
>>>> [1] https://lkml.org/lkml/2019/6/19/926
>>> Any chance you could provide a .config for your kernel? I'm wondering
>>> what is different between the two as it seems like you are showing a
>>> significant regression in terms of performance for the bubble
>>> hinting/aeration approach versus a stock kernel without the patches
>>> and that doesn't match up with what I have been seeing.
>> I have attached the config which I was using.
> Were all of these runs with the same config? I ask because I noticed
> the config you provided had a number of quite expensive memory debug
> options enabled:
Yes, memory debugging configs were enabled for all the cases.
>
> #
> # Memory Debugging
> #
> CONFIG_PAGE_EXTENSION=y
> CONFIG_DEBUG_PAGEALLOC=y
> CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT=y
> CONFIG_PAGE_OWNER=y
> # CONFIG_PAGE_POISONING is not set
> CONFIG_DEBUG_PAGE_REF=y
> # CONFIG_DEBUG_RODATA_TEST is not set
> CONFIG_DEBUG_OBJECTS=y
> # CONFIG_DEBUG_OBJECTS_SELFTEST is not set
> # CONFIG_DEBUG_OBJECTS_FREE is not set
> # CONFIG_DEBUG_OBJECTS_TIMERS is not set
> # CONFIG_DEBUG_OBJECTS_WORK is not set
> # CONFIG_DEBUG_OBJECTS_RCU_HEAD is not set
> # CONFIG_DEBUG_OBJECTS_PERCPU_COUNTER is not set
> CONFIG_DEBUG_OBJECTS_ENABLE_DEFAULT=1
> CONFIG_SLUB_DEBUG_ON=y
> # CONFIG_SLUB_STATS is not set
> CONFIG_HAVE_DEBUG_KMEMLEAK=y
> CONFIG_DEBUG_KMEMLEAK=y
> CONFIG_DEBUG_KMEMLEAK_EARLY_LOG_SIZE=400
> # CONFIG_DEBUG_KMEMLEAK_TEST is not set
> # CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF is not set
> CONFIG_DEBUG_KMEMLEAK_AUTO_SCAN=y
> CONFIG_DEBUG_STACK_USAGE=y
> CONFIG_DEBUG_VM=y
> # CONFIG_DEBUG_VM_VMACACHE is not set
> # CONFIG_DEBUG_VM_RB is not set
> # CONFIG_DEBUG_VM_PGFLAGS is not set
> CONFIG_ARCH_HAS_DEBUG_VIRTUAL=y
> CONFIG_DEBUG_VIRTUAL=y
> CONFIG_DEBUG_MEMORY_INIT=y
> CONFIG_DEBUG_PER_CPU_MAPS=y
> CONFIG_HAVE_ARCH_KASAN=y
> CONFIG_CC_HAS_KASAN_GENERIC=y
> # CONFIG_KASAN is not set
> CONFIG_KASAN_STACK=1
> # end of Memory Debugging
>
> When I went through and enabled these then my results for the bubble
> hinting matched pretty closely to what you reported. However, when I
> compiled without the patches and this config enabled the results were
> still about what was reported with the bubble hinting but were maybe
> 5% improved. I'm just wondering if you were doing some additional
> debugging and left those options enabled for the bubble hinting test
> run.
I have the same set of debugging options enabled for all three cases
reported.

[RFC,v10,0/2] mm: Support for page hinting

Message

Comments