Message ID | 20201221162519.GA22504@open-light-1.localdomain (mailing list archive) |
---|---|
Headers | show |
Series | speed up page allocation for __GFP_ZERO | expand |
On 21.12.20 17:25, Liang Li wrote: > The first version can be found at: https://lkml.org/lkml/2020/4/12/42 > > Zero out the page content usually happens when allocating pages with > the flag of __GFP_ZERO, this is a time consuming operation, it makes > the population of a large vma area very slowly. This patch introduce > a new feature for zero out pages before page allocation, it can help > to speed up page allocation with __GFP_ZERO. > > My original intention for adding this feature is to shorten VM > creation time when SR-IOV devicde is attached, it works good and the > VM creation time is reduced by about 90%. > > Creating a VM [64G RAM, 32 CPUs] with GPU passthrough > ===================================================== > QEMU use 4K pages, THP is off > round1 round2 round3 > w/o this patch: 23.5s 24.7s 24.6s > w/ this patch: 10.2s 10.3s 11.2s > > QEMU use 4K pages, THP is on > round1 round2 round3 > w/o this patch: 17.9s 14.8s 14.9s > w/ this patch: 1.9s 1.8s 1.9s > ===================================================== > I am still not convinces that we want/need this for this (main) use case. Why can't we use huge pages for such use cases (that really care about VM creation time) and rather deal with pre-zeroing of huge pages instead? If possible, I'd like to avoid GFP_ZERO (for reasons already discussed). > Obviously, it can do more than this. We can benefit from this feature > in the flowing case: > > Interactive sence > ================= > Shorten application lunch time on desktop or mobile phone, it can help > to improve the user experience. Test shows on a > server [Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz], zero out 1GB RAM by > the kernel will take about 200ms, while some mainly used application > like Firefox browser, Office will consume 100 ~ 300 MB RAM just after > launch, by pre zero out free pages, it means the application launch > time will be reduced about 20~60ms (can be visual sensed?). May be > we can make use of this feature to speed up the launch of Andorid APP > (I didn't do any test for Android). I am not really sure if you can actually visually sense a difference in your examples. Startup time of an application is not just memory allocation (page zeroing) time. It would be interesting of much of a difference this actually makes in practice. (e.g., firefox startup time etc.) > > Virtulization > ============= > Speed up VM creation and shorten guest boot time, especially for PCI > SR-IOV device passthrough scenario. Compared with some of the para > vitalization solutions, it is easy to deploy because it’s transparent > to guest and can handle DMA properly in BIOS stage, while the para > virtualization solution can’t handle it well. What is the "para virtualization" approach you are talking about? > > Improve guest performance when use VIRTIO_BALLOON_F_REPORTING for memory > overcommit. The VIRTIO_BALLOON_F_REPORTING feature will report guest page > to the VMM, VMM will unmap the corresponding host page for reclaim, > when guest allocate a page just reclaimed, host will allocate a new page > and zero it out for guest, in this case pre zero out free page will help > to speed up the proccess of fault in and reduce the performance impaction. Such faults in the VMM are no different to other faults, when first accessing a page to be populated. Again, I wonder how much of a difference it actually makes. > > Speed up kernel routine > ======================= > This can’t be guaranteed because we don’t pre zero out all the free pages, > but is true for most case. It can help to speed up some important system > call just like fork, which will allocate zero pages for building page > table. And speed up the process of page fault, especially for huge page > fault. The POC of Hugetlb free page pre zero out has been done. Would be interesting to have an actual example with some numbers. > > Security > ======== > This is a weak version of "introduce init_on_alloc=1 and init_on_free=1 > boot options", which zero out page in a asynchronous way. For users can't > tolerate the impaction of 'init_on_alloc=1' or 'init_on_free=1' brings, > this feauture provide another choice. "we don’t pre zero out all the free pages" so this is of little actual use.
On Tue, Dec 22, 2020 at 4:47 PM David Hildenbrand <david@redhat.com> wrote: > > On 21.12.20 17:25, Liang Li wrote: > > The first version can be found at: https://lkml.org/lkml/2020/4/12/42 > > > > Zero out the page content usually happens when allocating pages with > > the flag of __GFP_ZERO, this is a time consuming operation, it makes > > the population of a large vma area very slowly. This patch introduce > > a new feature for zero out pages before page allocation, it can help > > to speed up page allocation with __GFP_ZERO. > > > > My original intention for adding this feature is to shorten VM > > creation time when SR-IOV devicde is attached, it works good and the > > VM creation time is reduced by about 90%. > > > > Creating a VM [64G RAM, 32 CPUs] with GPU passthrough > > ===================================================== > > QEMU use 4K pages, THP is off > > round1 round2 round3 > > w/o this patch: 23.5s 24.7s 24.6s > > w/ this patch: 10.2s 10.3s 11.2s > > > > QEMU use 4K pages, THP is on > > round1 round2 round3 > > w/o this patch: 17.9s 14.8s 14.9s > > w/ this patch: 1.9s 1.8s 1.9s > > ===================================================== > > > > I am still not convinces that we want/need this for this (main) use > case. Why can't we use huge pages for such use cases (that really care > about VM creation time) and rather deal with pre-zeroing of huge pages > instead? > > If possible, I'd like to avoid GFP_ZERO (for reasons already discussed). > Yes, for VM creation, we can simply use hugetlb for that, just like what I have done in the other series 'mm: support free hugepage pre zero out' I send the v2 because I think VM creation is just one example we can benefit from. > > Obviously, it can do more than this. We can benefit from this feature > > in the flowing case: > > > > Interactive sence > > ================= > > Shorten application lunch time on desktop or mobile phone, it can help > > to improve the user experience. Test shows on a > > server [Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz], zero out 1GB RAM by > > the kernel will take about 200ms, while some mainly used application > > like Firefox browser, Office will consume 100 ~ 300 MB RAM just after > > launch, by pre zero out free pages, it means the application launch > > time will be reduced about 20~60ms (can be visual sensed?). May be > > we can make use of this feature to speed up the launch of Andorid APP > > (I didn't do any test for Android). > > I am not really sure if you can actually visually sense a difference in > your examples. Startup time of an application is not just memory > allocation (page zeroing) time. It would be interesting of much of a > difference this actually makes in practice. (e.g., firefox startup time > etc.) Yes, using Firefox and Office as an example seems not convincing, maybe a large Game APP which consumes several GB of RAM is better. > > > > Virtulization > > ============= > > Speed up VM creation and shorten guest boot time, especially for PCI > > SR-IOV device passthrough scenario. Compared with some of the para > > vitalization solutions, it is easy to deploy because it’s transparent > > to guest and can handle DMA properly in BIOS stage, while the para > > virtualization solution can’t handle it well. > > What is the "para virtualization" approach you are talking about? I refer two topic in the KVM forum 2020, the doc can give more details : https://static.sched.com/hosted_files/kvmforum2020/48/coIOMMU.pdf https://static.sched.com/hosted_files/kvmforum2020/51/The%20Practice%20Method%20to%20Speed%20Up%2010x%20Boot-up%20Time%20for%20Guest%20in%20Alibaba%20Cloud.pdf and the flowing link is mine: https://static.sched.com/hosted_files/kvmforum2020/90/Speed%20Up%20Creation%20of%20a%20VM%20With%20Passthrough%20GPU.pdf > > > > > Improve guest performance when use VIRTIO_BALLOON_F_REPORTING for memory > > overcommit. The VIRTIO_BALLOON_F_REPORTING feature will report guest page > > to the VMM, VMM will unmap the corresponding host page for reclaim, > > when guest allocate a page just reclaimed, host will allocate a new page > > and zero it out for guest, in this case pre zero out free page will help > > to speed up the proccess of fault in and reduce the performance impaction. > > Such faults in the VMM are no different to other faults, when first > accessing a page to be populated. Again, I wonder how much of a > difference it actually makes. > I am not just referring to faults in the VMM, I mean the whole process that handles guest page faults. without VIRTIO_BALLOON_F_REPORTING, pages used by guests will be zero out only once by host. With VIRTIO_BALLOON_F_REPORTING, free pages are reclaimed by the host and may return to the host buddy free list. When the pages are given back to the guest, the host kernel needs to zero out it again. It means with VIRTIO_BALLOON_F_REPORTING, guest memory performance will be degraded for frequently zero out operation on host side. The performance degradation will be obvious for huge page case. Free page pre zero out can help to make guest memory performance almost the same as without VIRTIO_BALLOON_F_REPORTING. > > > > Speed up kernel routine > > ======================= > > This can’t be guaranteed because we don’t pre zero out all the free pages, > > but is true for most case. It can help to speed up some important system > > call just like fork, which will allocate zero pages for building page > > table. And speed up the process of page fault, especially for huge page > > fault. The POC of Hugetlb free page pre zero out has been done. > > Would be interesting to have an actual example with some numbers. I will try to do some tests to get some numbers. > > > > Security > > ======== > > This is a weak version of "introduce init_on_alloc=1 and init_on_free=1 > > boot options", which zero out page in a asynchronous way. For users can't > > tolerate the impaction of 'init_on_alloc=1' or 'init_on_free=1' brings, > > this feauture provide another choice. > "we don’t pre zero out all the free pages" so this is of little actual use. OK. It seems none of the reasons listed above is strong enough for this feature, above all of them, which one is likely to become the most strong one? From the implementation, you will find it is configurable, users don't want to use it can turn it off. This is not an option? Thanks for your comments, David. Liang
> >>> >>> Virtulization >>> ============= >>> Speed up VM creation and shorten guest boot time, especially for PCI >>> SR-IOV device passthrough scenario. Compared with some of the para >>> vitalization solutions, it is easy to deploy because it’s transparent >>> to guest and can handle DMA properly in BIOS stage, while the para >>> virtualization solution can’t handle it well. >> >> What is the "para virtualization" approach you are talking about? > > I refer two topic in the KVM forum 2020, the doc can give more details : > https://static.sched.com/hosted_files/kvmforum2020/48/coIOMMU.pdf > https://static.sched.com/hosted_files/kvmforum2020/51/The%20Practice%20Method%20to%20Speed%20Up%2010x%20Boot-up%20Time%20for%20Guest%20in%20Alibaba%20Cloud.pdf > > and the flowing link is mine: > https://static.sched.com/hosted_files/kvmforum2020/90/Speed%20Up%20Creation%20of%20a%20VM%20With%20Passthrough%20GPU.pdf Thanks for the pointers! I actually did watch your presentation. >> >>> >>> Improve guest performance when use VIRTIO_BALLOON_F_REPORTING for memory >>> overcommit. The VIRTIO_BALLOON_F_REPORTING feature will report guest page >>> to the VMM, VMM will unmap the corresponding host page for reclaim, >>> when guest allocate a page just reclaimed, host will allocate a new page >>> and zero it out for guest, in this case pre zero out free page will help >>> to speed up the proccess of fault in and reduce the performance impaction. >> >> Such faults in the VMM are no different to other faults, when first >> accessing a page to be populated. Again, I wonder how much of a >> difference it actually makes. >> > > I am not just referring to faults in the VMM, I mean the whole process > that handles guest page faults. > without VIRTIO_BALLOON_F_REPORTING, pages used by guests will be zero > out only once by host. With VIRTIO_BALLOON_F_REPORTING, free pages are > reclaimed by the host and may return to the host buddy > free list. When the pages are given back to the guest, the host kernel > needs to zero out it again. It means > with VIRTIO_BALLOON_F_REPORTING, guest memory performance will be > degraded for frequently > zero out operation on host side. The performance degradation will be > obvious for huge page case. Free > page pre zero out can help to make guest memory performance almost the > same as without > VIRTIO_BALLOON_F_REPORTING. Yes, what I am saying is that this fault handling is no different to ordinary faults when accessing a virtual memory location the first time and populating a page. The only difference is that it happens continuously, not only the first time we touch a page. And we might be able to improve handling in the hypervisor in the future. We have been discussing using MADV_FREE instead of MADV_DONTNEED in QEMU for handling free page reporting. Then, guest reported pages will only get reclaimed by the hypervisor when there is actual memory pressure in the hypervisor (e.g., when about to swap). And zeroing a page is an obvious improvement over going to swap. The price for zeroing pages has to be paid at one point. Also note that we've been discussing cache-related things already. If you zero out before giving the page to the guest, the page will already be in the cache - where the guest directly wants to access it. [...] >>> >>> Security >>> ======== >>> This is a weak version of "introduce init_on_alloc=1 and init_on_free=1 >>> boot options", which zero out page in a asynchronous way. For users can't >>> tolerate the impaction of 'init_on_alloc=1' or 'init_on_free=1' brings, >>> this feauture provide another choice. >> "we don’t pre zero out all the free pages" so this is of little actual use. > > OK. It seems none of the reasons listed above is strong enough for I was rather saying that for security it's of little use IMHO. Application/VM start up time might be improved by using huge pages (and pre-zeroing these). Free page reporting might be improved by using MADV_FREE instead of MADV_DONTNEED in the hypervisor. > this feature, above all of them, which one is likely to become the > most strong one? From the implementation, you will find it is > configurable, users don't want to use it can turn it off. This is not > an option? Well, we have to maintain the feature and sacrifice a page flag. For example, do we expect someone explicitly enabling the feature just to speed up startup time of an app that consumes a lot of memory? I highly doubt it. I'd love to hear opinions of other people. (a lot of people are offline until beginning of January, including, well, actually me :) )
On Mon, Dec 21, 2020 at 11:25:22AM -0500, Liang Li wrote: > Creating a VM [64G RAM, 32 CPUs] with GPU passthrough > ===================================================== > QEMU use 4K pages, THP is off > round1 round2 round3 > w/o this patch: 23.5s 24.7s 24.6s > w/ this patch: 10.2s 10.3s 11.2s > > QEMU use 4K pages, THP is on > round1 round2 round3 > w/o this patch: 17.9s 14.8s 14.9s > w/ this patch: 1.9s 1.8s 1.9s > ===================================================== The cost of zeroing pages has to be paid somewhere. You've successfully moved it out of this path that you can measure. So now you've put it somewhere that you're not measuring. Why is this a win? > Speed up kernel routine > ======================= > This can’t be guaranteed because we don’t pre zero out all the free pages, > but is true for most case. It can help to speed up some important system > call just like fork, which will allocate zero pages for building page > table. And speed up the process of page fault, especially for huge page > fault. The POC of Hugetlb free page pre zero out has been done. Try kernbench with and without your patch.
https://static.sched.com/hosted_files/kvmforum2020/51/The%20Practice%20Method%20to%20Speed%20Up%2010x%20Boot-up%20Time%20for%20Guest%20in%20Alibaba%20Cloud.pdf > > > > and the flowing link is mine: > > https://static.sched.com/hosted_files/kvmforum2020/90/Speed%20Up%20Creation%20of%20a%20VM%20With%20Passthrough%20GPU.pdf > > Thanks for the pointers! I actually did watch your presentation. You're welcome! And thanks for your time! :) > >> > >>> > >>> Improve guest performance when use VIRTIO_BALLOON_F_REPORTING for memory > >>> overcommit. The VIRTIO_BALLOON_F_REPORTING feature will report guest page > >>> to the VMM, VMM will unmap the corresponding host page for reclaim, > >>> when guest allocate a page just reclaimed, host will allocate a new page > >>> and zero it out for guest, in this case pre zero out free page will help > >>> to speed up the proccess of fault in and reduce the performance impaction. > >> > >> Such faults in the VMM are no different to other faults, when first > >> accessing a page to be populated. Again, I wonder how much of a > >> difference it actually makes. > >> > > > > I am not just referring to faults in the VMM, I mean the whole process > > that handles guest page faults. > > without VIRTIO_BALLOON_F_REPORTING, pages used by guests will be zero > > out only once by host. With VIRTIO_BALLOON_F_REPORTING, free pages are > > reclaimed by the host and may return to the host buddy > > free list. When the pages are given back to the guest, the host kernel > > needs to zero out it again. It means > > with VIRTIO_BALLOON_F_REPORTING, guest memory performance will be > > degraded for frequently > > zero out operation on host side. The performance degradation will be > > obvious for huge page case. Free > > page pre zero out can help to make guest memory performance almost the > > same as without > > VIRTIO_BALLOON_F_REPORTING. > > Yes, what I am saying is that this fault handling is no different to > ordinary faults when accessing a virtual memory location the first time > and populating a page. The only difference is that it happens > continuously, not only the first time we touch a page. > > And we might be able to improve handling in the hypervisor in the > future. We have been discussing using MADV_FREE instead of MADV_DONTNEED > in QEMU for handling free page reporting. Then, guest reported pages > will only get reclaimed by the hypervisor when there is actual memory > pressure in the hypervisor (e.g., when about to swap). And zeroing a > page is an obvious improvement over going to swap. The price for zeroing > pages has to be paid at one point. > > Also note that we've been discussing cache-related things already. If > you zero out before giving the page to the guest, the page will already > be in the cache - where the guest directly wants to access it. > OK, that's very reasonable and much better. Looking forward for your work. > >>> > >>> Security > >>> ======== > >>> This is a weak version of "introduce init_on_alloc=1 and init_on_free=1 > >>> boot options", which zero out page in a asynchronous way. For users can't > >>> tolerate the impaction of 'init_on_alloc=1' or 'init_on_free=1' brings, > >>> this feauture provide another choice. > >> "we don’t pre zero out all the free pages" so this is of little actual use. > > > > OK. It seems none of the reasons listed above is strong enough for > > I was rather saying that for security it's of little use IMHO. > Application/VM start up time might be improved by using huge pages (and > pre-zeroing these). Free page reporting might be improved by using > MADV_FREE instead of MADV_DONTNEED in the hypervisor. > > > this feature, above all of them, which one is likely to become the > > most strong one? From the implementation, you will find it is > > configurable, users don't want to use it can turn it off. This is not > > an option? > > Well, we have to maintain the feature and sacrifice a page flag. For > example, do we expect someone explicitly enabling the feature just to > speed up startup time of an app that consumes a lot of memory? I highly > doubt it. In our production environment, there are three main applications have such requirement, one is QEMU [creating a VM with SR-IOV passthrough device], anther other two are DPDK related applications, DPDK OVS and SPDK vhost, for best performance, they populate memory when starting up. For SPDK vhost, we make use of the VHOST_USER_GET/SET_INFLIGHT_FD feature for vhost 'live' upgrade, which is done by killing the old process and starting a new one with the new binary. In this case, we want the new process started as quick as possible to shorten the service downtime. We really enable this feature to speed up startup time for them :) > I'd love to hear opinions of other people. (a lot of people are offline > until beginning of January, including, well, actually me :) ) OK. I will wait some time for others' feedback. Happy holidays! thanks! Liang
> > ===================================================== > > QEMU use 4K pages, THP is off > > round1 round2 round3 > > w/o this patch: 23.5s 24.7s 24.6s > > w/ this patch: 10.2s 10.3s 11.2s > > > > QEMU use 4K pages, THP is on > > round1 round2 round3 > > w/o this patch: 17.9s 14.8s 14.9s > > w/ this patch: 1.9s 1.8s 1.9s > > ===================================================== > > The cost of zeroing pages has to be paid somewhere. You've successfully > moved it out of this path that you can measure. So now you've put it > somewhere that you're not measuring. Why is this a win? Win or not depends on its effect. For our case, it solves the issue that we faced, so it can be thought as a win for us. If others don't have the issue we faced, the result will be different, maybe they will be affected by the side effect of this feature. I think this is your concern behind the question. right? I will try to do more tests and provide more benchmark performance data. > > Speed up kernel routine > > ======================= > > This can’t be guaranteed because we don’t pre zero out all the free pages, > > but is true for most case. It can help to speed up some important system > > call just like fork, which will allocate zero pages for building page > > table. And speed up the process of page fault, especially for huge page > > fault. The POC of Hugetlb free page pre zero out has been done. > > Try kernbench with and without your patch. OK. Thanks for your suggestion! Liang
Liang Li <liliang.opensource@gmail.com> writes: > The first version can be found at: https://lkml.org/lkml/2020/4/12/42 > > Zero out the page content usually happens when allocating pages with > the flag of __GFP_ZERO, this is a time consuming operation, it makes > the population of a large vma area very slowly. This patch introduce > a new feature for zero out pages before page allocation, it can help > to speed up page allocation with __GFP_ZERO. kzeropaged appears to escape some of the kernel's resource controls, at least if I'm understanding this right. The heavy part of a page fault is moved out of the faulting task's context so the CPU controller can't throttle it. A task that uses these pages can benefit from clearing done by CPUs that it's not allowed to run on. How can it handle these cases?
On Mon, Dec 21, 2020 at 8:25 AM Liang Li <liliang.opensource@gmail.com> wrote: > > The first version can be found at: https://lkml.org/lkml/2020/4/12/42 > > Zero out the page content usually happens when allocating pages with > the flag of __GFP_ZERO, this is a time consuming operation, it makes > the population of a large vma area very slowly. This patch introduce > a new feature for zero out pages before page allocation, it can help > to speed up page allocation with __GFP_ZERO. > > My original intention for adding this feature is to shorten VM > creation time when SR-IOV devicde is attached, it works good and the > VM creation time is reduced by about 90%. > > Creating a VM [64G RAM, 32 CPUs] with GPU passthrough > ===================================================== > QEMU use 4K pages, THP is off > round1 round2 round3 > w/o this patch: 23.5s 24.7s 24.6s > w/ this patch: 10.2s 10.3s 11.2s > > QEMU use 4K pages, THP is on > round1 round2 round3 > w/o this patch: 17.9s 14.8s 14.9s > w/ this patch: 1.9s 1.8s 1.9s > ===================================================== > > Obviously, it can do more than this. We can benefit from this feature > in the flowing case: So I am not sure page reporting is the best thing to base this page zeroing setup on. The idea with page reporting is to essentially act as a leaky bucket and allow the guest to drop memory it isn't using slowly so if it needs to reinflate it won't clash with the applications that need memory. What you are doing here seems far more aggressive in that you are going down to low order pages and sleeping instead of rescheduling for the next time interval. Also I am not sure your SR-IOV creation time test is a good justification for this extra overhead. With your patches applied all you are doing is making use of the free time before the test to do the page zeroing instead of doing it during your test. As such your CPU overhead prior to running the test would be higher and you haven't captured that information. One thing I would be interested in seeing is what is the load this is adding when you are running simple memory allocation/free type tests on the system. For example it might be useful to see what the will-it-scale page_fault1 tests look like with this patch applied versus not applied. I suspect it would be adding some amount of overhead as you have to spend a ton of time scanning all the pages and that will be considerable overhead.
[...] >> I was rather saying that for security it's of little use IMHO. >> Application/VM start up time might be improved by using huge pages (and >> pre-zeroing these). Free page reporting might be improved by using >> MADV_FREE instead of MADV_DONTNEED in the hypervisor. >> >>> this feature, above all of them, which one is likely to become the >>> most strong one? From the implementation, you will find it is >>> configurable, users don't want to use it can turn it off. This is not >>> an option? >> >> Well, we have to maintain the feature and sacrifice a page flag. For >> example, do we expect someone explicitly enabling the feature just to >> speed up startup time of an app that consumes a lot of memory? I highly >> doubt it. > > In our production environment, there are three main applications have such > requirement, one is QEMU [creating a VM with SR-IOV passthrough device], > anther other two are DPDK related applications, DPDK OVS and SPDK vhost, > for best performance, they populate memory when starting up. For SPDK vhost, > we make use of the VHOST_USER_GET/SET_INFLIGHT_FD feature for > vhost 'live' upgrade, which is done by killing the old process and > starting a new > one with the new binary. In this case, we want the new process started as quick > as possible to shorten the service downtime. We really enable this feature > to speed up startup time for them :) Thanks for info on the use case! All of these use cases either already use, or could use, huge pages IMHO. It's not your ordinary proprietary gaming app :) This is where pre-zeroing of huge pages could already help. Just wondering, wouldn't it be possible to use tmpfs/hugetlbfs ... creating a file and pre-zeroing it from another process, or am I missing something important? At least for QEMU this should work AFAIK, where you can just pass the file to be use using memory-backend-file. > >> I'd love to hear opinions of other people. (a lot of people are offline >> until beginning of January, including, well, actually me :) ) > > OK. I will wait some time for others' feedback. Happy holidays! To you too, cheers!
On Wed, Dec 23, 2020 at 4:41 PM David Hildenbrand <david@redhat.com> wrote: > > [...] > > >> I was rather saying that for security it's of little use IMHO. > >> Application/VM start up time might be improved by using huge pages (and > >> pre-zeroing these). Free page reporting might be improved by using > >> MADV_FREE instead of MADV_DONTNEED in the hypervisor. > >> > >>> this feature, above all of them, which one is likely to become the > >>> most strong one? From the implementation, you will find it is > >>> configurable, users don't want to use it can turn it off. This is not > >>> an option? > >> > >> Well, we have to maintain the feature and sacrifice a page flag. For > >> example, do we expect someone explicitly enabling the feature just to > >> speed up startup time of an app that consumes a lot of memory? I highly > >> doubt it. > > > > In our production environment, there are three main applications have such > > requirement, one is QEMU [creating a VM with SR-IOV passthrough device], > > anther other two are DPDK related applications, DPDK OVS and SPDK vhost, > > for best performance, they populate memory when starting up. For SPDK vhost, > > we make use of the VHOST_USER_GET/SET_INFLIGHT_FD feature for > > vhost 'live' upgrade, which is done by killing the old process and > > starting a new > > one with the new binary. In this case, we want the new process started as quick > > as possible to shorten the service downtime. We really enable this feature > > to speed up startup time for them :) > > Thanks for info on the use case! > > All of these use cases either already use, or could use, huge pages > IMHO. It's not your ordinary proprietary gaming app :) This is where > pre-zeroing of huge pages could already help. You are welcome. For some historical reason, some of our services are not using hugetlbfs, that is why I didn't start with hugetlbfs. > Just wondering, wouldn't it be possible to use tmpfs/hugetlbfs ... > creating a file and pre-zeroing it from another process, or am I missing > something important? At least for QEMU this should work AFAIK, where you > can just pass the file to be use using memory-backend-file. > If using another process to create a file, we can offload the overhead to another process, and there is no need to pre-zeroing it's content, just populating the memory is enough. If we do it that way, then how to determine the size of the file? it depends on the RAM size of the VM the customer buys. Maybe we can create a file large enough in advance and truncate it to the right size just before the VM is created. Then, how many large files should be created on a host? You will find there are a lot of things that have to be handled properly. I think it's possible to make it work well, but we will transfer the management complexity to up layer components. It's a bad practice to let upper layer components process such low level details which should be handled in the OS layer. > > > >> I'd love to hear opinions of other people. (a lot of people are offline > >> until beginning of January, including, well, actually me :) ) > > > > OK. I will wait some time for others' feedback. Happy holidays! > > To you too, cheers! > I have to work at least two months before the vacation. :( Liang
On Tue 22-12-20 22:42:13, Liang Li wrote: > > > ===================================================== > > > QEMU use 4K pages, THP is off > > > round1 round2 round3 > > > w/o this patch: 23.5s 24.7s 24.6s > > > w/ this patch: 10.2s 10.3s 11.2s > > > > > > QEMU use 4K pages, THP is on > > > round1 round2 round3 > > > w/o this patch: 17.9s 14.8s 14.9s > > > w/ this patch: 1.9s 1.8s 1.9s > > > ===================================================== > > > > The cost of zeroing pages has to be paid somewhere. You've successfully > > moved it out of this path that you can measure. So now you've put it > > somewhere that you're not measuring. Why is this a win? > > Win or not depends on its effect. For our case, it solves the issue > that we faced, so it can be thought as a win for us. If others don't > have the issue we faced, the result will be different, maybe they will > be affected by the side effect of this feature. I think this is your > concern behind the question. right? I will try to do more tests and > provide more benchmark performance data. Yes, zeroying memory does have a noticeable overhead but we cannot simply allow tasks to spil over this overhead to all other users by default. So if anything this would need to be an opt-in feature configurable by administrator.
On Mon 21-12-20 11:25:22, Liang Li wrote: [...] > Security > ======== > This is a weak version of "introduce init_on_alloc=1 and init_on_free=1 > boot options", which zero out page in a asynchronous way. For users can't > tolerate the impaction of 'init_on_alloc=1' or 'init_on_free=1' brings, > this feauture provide another choice. Most of the usecases are about the start up time imporvemtns IIUC. Have you tried to use init_on_free or this would be prohibitive for your workloads?
> > Win or not depends on its effect. For our case, it solves the issue > > that we faced, so it can be thought as a win for us. If others don't > > have the issue we faced, the result will be different, maybe they will > > be affected by the side effect of this feature. I think this is your > > concern behind the question. right? I will try to do more tests and > > provide more benchmark performance data. > > Yes, zeroying memory does have a noticeable overhead but we cannot > simply allow tasks to spil over this overhead to all other users by > default. So if anything this would need to be an opt-in feature > configurable by administrator. > -- > Michal Hocko > SUSE Labs I know the overhead, so I add a switch in /sys/ to enable or disable it dynamically. Thanks Liang
On Mon, Jan 4, 2021 at 8:56 PM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 21-12-20 11:25:22, Liang Li wrote: > [...] > > Security > > ======== > > This is a weak version of "introduce init_on_alloc=1 and init_on_free=1 > > boot options", which zero out page in a asynchronous way. For users can't > > tolerate the impaction of 'init_on_alloc=1' or 'init_on_free=1' brings, > > this feauture provide another choice. > > Most of the usecases are about the start up time imporvemtns IIUC. Have > you tried to use init_on_free or this would be prohibitive for your > workloads? > I have not tried yet. 'init_on_free' may help to shorten the start up time. In our use case, we care about both the VM creation time and the VM reboot time[terminate QEMU process first and launch a new one], 'init_on_free' will slow down the termination process and is not helpful for VM reboot. Our aim is to speed up 'VM start up' and not slow down 'VM shut down'. Thanks Liang
> Am 23.12.2020 um 13:12 schrieb Liang Li <liliang324@gmail.com>: > > On Wed, Dec 23, 2020 at 4:41 PM David Hildenbrand <david@redhat.com> wrote: >> >> [...] >> >>>> I was rather saying that for security it's of little use IMHO. >>>> Application/VM start up time might be improved by using huge pages (and >>>> pre-zeroing these). Free page reporting might be improved by using >>>> MADV_FREE instead of MADV_DONTNEED in the hypervisor. >>>> >>>>> this feature, above all of them, which one is likely to become the >>>>> most strong one? From the implementation, you will find it is >>>>> configurable, users don't want to use it can turn it off. This is not >>>>> an option? >>>> >>>> Well, we have to maintain the feature and sacrifice a page flag. For >>>> example, do we expect someone explicitly enabling the feature just to >>>> speed up startup time of an app that consumes a lot of memory? I highly >>>> doubt it. >>> >>> In our production environment, there are three main applications have such >>> requirement, one is QEMU [creating a VM with SR-IOV passthrough device], >>> anther other two are DPDK related applications, DPDK OVS and SPDK vhost, >>> for best performance, they populate memory when starting up. For SPDK vhost, >>> we make use of the VHOST_USER_GET/SET_INFLIGHT_FD feature for >>> vhost 'live' upgrade, which is done by killing the old process and >>> starting a new >>> one with the new binary. In this case, we want the new process started as quick >>> as possible to shorten the service downtime. We really enable this feature >>> to speed up startup time for them :) Am I wrong or does using hugeltbfs/tmpfs ... i.e., a file not-deleted between shutting down the old instances and firing up the new instance just solve this issue? >> >> Thanks for info on the use case! >> >> All of these use cases either already use, or could use, huge pages >> IMHO. It's not your ordinary proprietary gaming app :) This is where >> pre-zeroing of huge pages could already help. > > You are welcome. For some historical reason, some of our services are > not using hugetlbfs, that is why I didn't start with hugetlbfs. > >> Just wondering, wouldn't it be possible to use tmpfs/hugetlbfs ... >> creating a file and pre-zeroing it from another process, or am I missing >> something important? At least for QEMU this should work AFAIK, where you >> can just pass the file to be use using memory-backend-file. >> > If using another process to create a file, we can offload the overhead to > another process, and there is no need to pre-zeroing it's content, just > populating the memory is enough. Right, if non-zero memory can be tolerated (e.g., for vms usually has to). > If we do it that way, then how to determine the size of the file? it depends > on the RAM size of the VM the customer buys. > Maybe we can create a file > large enough in advance and truncate it to the right size just before the > VM is created. Then, how many large files should be created on a host? That‘s mostly already existing scheduling logic, no? (How many vms can I put onto a specific machine eventually) > You will find there are a lot of things that have to be handled properly. > I think it's possible to make it work well, but we will transfer the > management complexity to up layer components. It's a bad practice to let > upper layer components process such low level details which should be > handled in the OS layer. It‘s bad practice to squeeze things into the kernel that can just be handled on upper layers ;)
> >>> In our production environment, there are three main applications have such > >>> requirement, one is QEMU [creating a VM with SR-IOV passthrough device], > >>> anther other two are DPDK related applications, DPDK OVS and SPDK vhost, > >>> for best performance, they populate memory when starting up. For SPDK vhost, > >>> we make use of the VHOST_USER_GET/SET_INFLIGHT_FD feature for > >>> vhost 'live' upgrade, which is done by killing the old process and > >>> starting a new > >>> one with the new binary. In this case, we want the new process started as quick > >>> as possible to shorten the service downtime. We really enable this feature > >>> to speed up startup time for them :) > > Am I wrong or does using hugeltbfs/tmpfs ... i.e., a file not-deleted between shutting down the old instances and firing up the new instance just solve this issue? You are right, it works for the SPDK vhost upgrade case. > > >> > >> Thanks for info on the use case! > >> > >> All of these use cases either already use, or could use, huge pages > >> IMHO. It's not your ordinary proprietary gaming app :) This is where > >> pre-zeroing of huge pages could already help. > > > > You are welcome. For some historical reason, some of our services are > > not using hugetlbfs, that is why I didn't start with hugetlbfs. > > > >> Just wondering, wouldn't it be possible to use tmpfs/hugetlbfs ... > >> creating a file and pre-zeroing it from another process, or am I missing > >> something important? At least for QEMU this should work AFAIK, where you > >> can just pass the file to be use using memory-backend-file. > >> > > If using another process to create a file, we can offload the overhead to > > another process, and there is no need to pre-zeroing it's content, just > > populating the memory is enough. > > Right, if non-zero memory can be tolerated (e.g., for vms usually has to). I mean there is no need to pre-zeroing the file content obviously in user space, the kernel will do it when populating the memory. > > If we do it that way, then how to determine the size of the file? it depends > > on the RAM size of the VM the customer buys. > > Maybe we can create a file > > large enough in advance and truncate it to the right size just before the > > VM is created. Then, how many large files should be created on a host? > > That‘s mostly already existing scheduling logic, no? (How many vms can I put onto a specific machine eventually) It depends on how the scheduling component is designed. Yes, you can put 10 VMs with 4C8G(4CPU, 8G RAM) on a host and 20 VMs with 2C4G on another one. But if one type of them, e.g. 4C8G are sold out, customers can't by more 4C8G VM while there are some free 2C4G VMs, the resource reserved for them can be provided as 4C8G VMs > > You will find there are a lot of things that have to be handled properly. > > I think it's possible to make it work well, but we will transfer the > > management complexity to up layer components. It's a bad practice to let > > upper layer components process such low level details which should be > > handled in the OS layer. > > It‘s bad practice to squeeze things into the kernel that can just be handled on upper layers ;) > You must know there are a lot of functions in the kernel which can be done in userspace. e.g. Some of the device emulations like APIC, vhost-net backend which has userspace implementation. :) Bad or not depends on the benefits the solution brings. From the viewpoint of a user space application, the kernel should provide high performance memory management service. That's why I think it should be done in the kernel. Thanks Liang
On 05.01.21 03:14, Liang Li wrote: >>>>> In our production environment, there are three main applications have such >>>>> requirement, one is QEMU [creating a VM with SR-IOV passthrough device], >>>>> anther other two are DPDK related applications, DPDK OVS and SPDK vhost, >>>>> for best performance, they populate memory when starting up. For SPDK vhost, >>>>> we make use of the VHOST_USER_GET/SET_INFLIGHT_FD feature for >>>>> vhost 'live' upgrade, which is done by killing the old process and >>>>> starting a new >>>>> one with the new binary. In this case, we want the new process started as quick >>>>> as possible to shorten the service downtime. We really enable this feature >>>>> to speed up startup time for them :) >> >> Am I wrong or does using hugeltbfs/tmpfs ... i.e., a file not-deleted between shutting down the old instances and firing up the new instance just solve this issue? > > You are right, it works for the SPDK vhost upgrade case. > >> >>>> >>>> Thanks for info on the use case! >>>> >>>> All of these use cases either already use, or could use, huge pages >>>> IMHO. It's not your ordinary proprietary gaming app :) This is where >>>> pre-zeroing of huge pages could already help. >>> >>> You are welcome. For some historical reason, some of our services are >>> not using hugetlbfs, that is why I didn't start with hugetlbfs. >>> >>>> Just wondering, wouldn't it be possible to use tmpfs/hugetlbfs ... >>>> creating a file and pre-zeroing it from another process, or am I missing >>>> something important? At least for QEMU this should work AFAIK, where you >>>> can just pass the file to be use using memory-backend-file. >>>> >>> If using another process to create a file, we can offload the overhead to >>> another process, and there is no need to pre-zeroing it's content, just >>> populating the memory is enough. >> >> Right, if non-zero memory can be tolerated (e.g., for vms usually has to). > > I mean there is no need to pre-zeroing the file content obviously in user space, > the kernel will do it when populating the memory. > >>> If we do it that way, then how to determine the size of the file? it depends >>> on the RAM size of the VM the customer buys. >>> Maybe we can create a file >>> large enough in advance and truncate it to the right size just before the >>> VM is created. Then, how many large files should be created on a host? >> >> That‘s mostly already existing scheduling logic, no? (How many vms can I put onto a specific machine eventually) > > It depends on how the scheduling component is designed. Yes, you can put > 10 VMs with 4C8G(4CPU, 8G RAM) on a host and 20 VMs with 2C4G on > another one. But if one type of them, e.g. 4C8G are sold out, customers > can't by more 4C8G VM while there are some free 2C4G VMs, the resource > reserved for them can be provided as 4C8G VMs > 1. You can, just the startup time will be a little slower? E.g., grow pre-allocated 4G file to 8G. 2. Or let's be creative: teach QEMU to construct a single RAMBlock/MemoryRegion out of multiple tmpfs files. Works as long as you don't go crazy on different VM sizes / size differences. 3. In your example above, you can dynamically rebalance as VMs are getting sold, to make sure you always have "big ones" lying around you can shrink on demand. > > You must know there are a lot of functions in the kernel which can > be done in userspace. e.g. Some of the device emulations like APIC, > vhost-net backend which has userspace implementation. :) > Bad or not depends on the benefits the solution brings. > From the viewpoint of a user space application, the kernel should > provide high performance memory management service. That's why > I think it should be done in the kernel. As I expressed a couple of times already, I don't see why using hugetlbfs and implementing some sort of pre-zeroing there isn't sufficient. We really don't *want* complicated things deep down in the mm core if there are reasonable alternatives.
> >> That‘s mostly already existing scheduling logic, no? (How many vms can I put onto a specific machine eventually) > > > > It depends on how the scheduling component is designed. Yes, you can put > > 10 VMs with 4C8G(4CPU, 8G RAM) on a host and 20 VMs with 2C4G on > > another one. But if one type of them, e.g. 4C8G are sold out, customers > > can't by more 4C8G VM while there are some free 2C4G VMs, the resource > > reserved for them can be provided as 4C8G VMs > > > > 1. You can, just the startup time will be a little slower? E.g., grow > pre-allocated 4G file to 8G. > > 2. Or let's be creative: teach QEMU to construct a single > RAMBlock/MemoryRegion out of multiple tmpfs files. Works as long as you > don't go crazy on different VM sizes / size differences. > > 3. In your example above, you can dynamically rebalance as VMs are > getting sold, to make sure you always have "big ones" lying around you > can shrink on demand. > Yes, we can always come up with some ways to make things work. it will make the developer of the upper layer component crazy :) > > > > You must know there are a lot of functions in the kernel which can > > be done in userspace. e.g. Some of the device emulations like APIC, > > vhost-net backend which has userspace implementation. :) > > Bad or not depends on the benefits the solution brings. > > From the viewpoint of a user space application, the kernel should > > provide high performance memory management service. That's why > > I think it should be done in the kernel. > > As I expressed a couple of times already, I don't see why using > hugetlbfs and implementing some sort of pre-zeroing there isn't sufficient. Did I miss something before? I thought you doubt the need for hugetlbfs free page pre zero out. Hugetlbfs is a good choice and is sufficient. > We really don't *want* complicated things deep down in the mm core if > there are reasonable alternatives. > I understand your concern, we should have sufficient reason to add a new feature to the kernel. And for this one, it's most value is to make the application's life is easier. And implementing it in hugetlbfs can avoid adding more complexity to core MM. I will send out a new revision and drop the part for 'buddy free pages pre zero out'. Thanks for your suggestion! Liang
On 05.01.21 11:22, Liang Li wrote: >>>> That‘s mostly already existing scheduling logic, no? (How many vms can I put onto a specific machine eventually) >>> >>> It depends on how the scheduling component is designed. Yes, you can put >>> 10 VMs with 4C8G(4CPU, 8G RAM) on a host and 20 VMs with 2C4G on >>> another one. But if one type of them, e.g. 4C8G are sold out, customers >>> can't by more 4C8G VM while there are some free 2C4G VMs, the resource >>> reserved for them can be provided as 4C8G VMs >>> >> >> 1. You can, just the startup time will be a little slower? E.g., grow >> pre-allocated 4G file to 8G. >> >> 2. Or let's be creative: teach QEMU to construct a single >> RAMBlock/MemoryRegion out of multiple tmpfs files. Works as long as you >> don't go crazy on different VM sizes / size differences. >> >> 3. In your example above, you can dynamically rebalance as VMs are >> getting sold, to make sure you always have "big ones" lying around you >> can shrink on demand. >> > Yes, we can always come up with some ways to make things work. > it will make the developer of the upper layer component crazy :) I'd say that's life in upper layers to optimize special (!) use cases. :) >>> >>> You must know there are a lot of functions in the kernel which can >>> be done in userspace. e.g. Some of the device emulations like APIC, >>> vhost-net backend which has userspace implementation. :) >>> Bad or not depends on the benefits the solution brings. >>> From the viewpoint of a user space application, the kernel should >>> provide high performance memory management service. That's why >>> I think it should be done in the kernel. >> >> As I expressed a couple of times already, I don't see why using >> hugetlbfs and implementing some sort of pre-zeroing there isn't sufficient. > > Did I miss something before? I thought you doubt the need for > hugetlbfs free page pre zero out. Hugetlbfs is a good choice and is > sufficient. I remember even suggesting to focus on hugetlbfs during your KVM talk when chatting. Maybe I was not clear before. > >> We really don't *want* complicated things deep down in the mm core if >> there are reasonable alternatives. >> > I understand your concern, we should have sufficient reason to add a new > feature to the kernel. And for this one, it's most value is to make the > application's life is easier. And implementing it in hugetlbfs can avoid > adding more complexity to core MM. Exactly, that's my point. Some people might still disagree with the hugetlbfs approach, but there it's easier to add tunables without affecting the overall system.
The first version can be found at: https://lkml.org/lkml/2020/4/12/42 Zero out the page content usually happens when allocating pages with the flag of __GFP_ZERO, this is a time consuming operation, it makes the population of a large vma area very slowly. This patch introduce a new feature for zero out pages before page allocation, it can help to speed up page allocation with __GFP_ZERO. My original intention for adding this feature is to shorten VM creation time when SR-IOV devicde is attached, it works good and the VM creation time is reduced by about 90%. Creating a VM [64G RAM, 32 CPUs] with GPU passthrough ===================================================== QEMU use 4K pages, THP is off round1 round2 round3 w/o this patch: 23.5s 24.7s 24.6s w/ this patch: 10.2s 10.3s 11.2s QEMU use 4K pages, THP is on round1 round2 round3 w/o this patch: 17.9s 14.8s 14.9s w/ this patch: 1.9s 1.8s 1.9s ===================================================== Obviously, it can do more than this. We can benefit from this feature in the flowing case: Interactive sence ================= Shorten application lunch time on desktop or mobile phone, it can help to improve the user experience. Test shows on a server [Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz], zero out 1GB RAM by the kernel will take about 200ms, while some mainly used application like Firefox browser, Office will consume 100 ~ 300 MB RAM just after launch, by pre zero out free pages, it means the application launch time will be reduced about 20~60ms (can be visual sensed?). May be we can make use of this feature to speed up the launch of Andorid APP (I didn't do any test for Android). Virtulization ============= Speed up VM creation and shorten guest boot time, especially for PCI SR-IOV device passthrough scenario. Compared with some of the para vitalization solutions, it is easy to deploy because it’s transparent to guest and can handle DMA properly in BIOS stage, while the para virtualization solution can’t handle it well. Improve guest performance when use VIRTIO_BALLOON_F_REPORTING for memory overcommit. The VIRTIO_BALLOON_F_REPORTING feature will report guest page to the VMM, VMM will unmap the corresponding host page for reclaim, when guest allocate a page just reclaimed, host will allocate a new page and zero it out for guest, in this case pre zero out free page will help to speed up the proccess of fault in and reduce the performance impaction. Speed up kernel routine ======================= This can’t be guaranteed because we don’t pre zero out all the free pages, but is true for most case. It can help to speed up some important system call just like fork, which will allocate zero pages for building page table. And speed up the process of page fault, especially for huge page fault. The POC of Hugetlb free page pre zero out has been done. Security ======== This is a weak version of "introduce init_on_alloc=1 and init_on_free=1 boot options", which zero out page in a asynchronous way. For users can't tolerate the impaction of 'init_on_alloc=1' or 'init_on_free=1' brings, this feauture provide another choice. For the feedback of the first version, cache pollution is the main concern of the mm guys, On the other hand, this feature is really helpful for some use case. May be we should let the user decide wether to use it. So a switch is added in the /sys files, users who don’t like it can turn off the switch, or by configuring a large batch size to reduce cache pollution. To make the whole function works, support of pre zero out free huge pages should be added for hugetlbfs, I will send another patch for it. Liang Li (4): mm: let user decide page reporting option mm: pre zero out free pages to speed up page allocation for __GFP_ZERO mm: make page reporing worker works better for low order page mm: Add batch size for free page reporting drivers/virtio/virtio_balloon.c | 3 + include/linux/highmem.h | 31 +++- include/linux/page-flags.h | 16 +- include/linux/page_reporting.h | 3 + include/trace/events/mmflags.h | 7 + mm/Kconfig | 10 ++ mm/Makefile | 1 + mm/huge_memory.c | 3 +- mm/page_alloc.c | 4 + mm/page_prezero.c | 266 ++++++++++++++++++++++++++++++++ mm/page_prezero.h | 13 ++ mm/page_reporting.c | 49 +++++- mm/page_reporting.h | 16 +- 13 files changed, 405 insertions(+), 17 deletions(-) create mode 100644 mm/page_prezero.c create mode 100644 mm/page_prezero.h Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Liang Li <liliang324@gmail.com>