Message ID | 20191007075548.12456-1-mhocko@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm, hugetlb: allow hugepage allocations to excessively reclaim | expand |
On 10/7/19 12:55 AM, Michal Hocko wrote: > From: David Rientjes <rientjes@google.com> > > b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction > may not succeed") has chnaged the allocator to bail out from the > allocator early to prevent from a potentially excessive memory > reclaim. __GFP_RETRY_MAYFAIL is designed to retry the allocation, > reclaim and compaction loop as long as there is a reasonable chance to > make a forward progress. Neither COMPACT_SKIPPED nor COMPACT_DEFERRED > at the INIT_COMPACT_PRIORITY compaction attempt gives this feedback. > > The most obvious affected subsystem is hugetlbfs which allocates huge > pages based on an admin request (or via admin configured overcommit). > I have done a simple test which tries to allocate half of the memory > for hugetlb pages while the memory is full of a clean page cache. This > is not an unusual situation because we try to cache as much of the > memory as possible and sysctl/sysfs interface to allocate huge pages is > there for flexibility to allocate hugetlb pages at any time. > > System has 1GB of RAM and we are requesting 515MB worth of hugetlb pages > after the memory is prefilled by a clean page cache: > root@test1:~# cat hugetlb_test.sh > > set -x > echo 0 > /proc/sys/vm/nr_hugepages > echo 3 > /proc/sys/vm/drop_caches > echo 1 > /proc/sys/vm/compact_memory > dd if=/mnt/data/file-1G of=/dev/null bs=$((4<<10)) > TS=$(date +%s) > echo 256 > /proc/sys/vm/nr_hugepages > cat /proc/sys/vm/nr_hugepages > > The results for 2 consecutive runs on clean 5.3 > root@test1:~# sh hugetlb_test.sh > + echo 0 > + echo 3 > + echo 1 > + dd if=/mnt/data/file-1G of=/dev/null bs=4096 > 262144+0 records in > 262144+0 records out > 1073741824 bytes (1.1 GB) copied, 21.0694 s, 51.0 MB/s > + date +%s > + TS=1569905284 > + echo 256 > + cat /proc/sys/vm/nr_hugepages > 256 > root@test1:~# sh hugetlb_test.sh > + echo 0 > + echo 3 > + echo 1 > + dd if=/mnt/data/file-1G of=/dev/null bs=4096 > 262144+0 records in > 262144+0 records out > 1073741824 bytes (1.1 GB) copied, 21.7548 s, 49.4 MB/s > + date +%s > + TS=1569905311 > + echo 256 > + cat /proc/sys/vm/nr_hugepages > 256 > > Now with b39d0ee2632d applied > root@test1:~# sh hugetlb_test.sh > + echo 0 > + echo 3 > + echo 1 > + dd if=/mnt/data/file-1G of=/dev/null bs=4096 > 262144+0 records in > 262144+0 records out > 1073741824 bytes (1.1 GB) copied, 20.1815 s, 53.2 MB/s > + date +%s > + TS=1569905516 > + echo 256 > + cat /proc/sys/vm/nr_hugepages > 11 > root@test1:~# sh hugetlb_test.sh > + echo 0 > + echo 3 > + echo 1 > + dd if=/mnt/data/file-1G of=/dev/null bs=4096 > 262144+0 records in > 262144+0 records out > 1073741824 bytes (1.1 GB) copied, 21.9485 s, 48.9 MB/s > + date +%s > + TS=1569905541 > + echo 256 > + cat /proc/sys/vm/nr_hugepages > 12 > > The success rate went down by factor of 20! > > Although hugetlb allocation requests might fail and it is reasonable to > expect them to under extremely fragmented memory or when the memory is > under a heavy pressure but the above situation is not that case. > > Fix the regression by reverting back to the previous behavior for > __GFP_RETRY_MAYFAIL requests and disable the beail out heuristic for > those requests. Thank you Michal for doing this. hugetlbfs allocations are commonly done via sysctl/sysfs shortly after boot where this may not be as much of an issue. However, I am aware of at least three use cases where allocations are made after the system has been up and running for quite some time: - DB reconfiguration. If sysctl/sysfs fails to get required number of huge pages, system is rebooted to perform allocation after boot. - VM provisioning. If unable get required number of huge pages, fall back to base pages. - An application that does not preallocate pool, but rather allocates pages at fault time for optimal NUMA locality. In all cases, I would expect b39d0ee2632d to cause regressions and noticable behavior changes. My quick/limited testing in [1] was insufficient. It was also mentioned that if something like b39d0ee2632d went forward, I would like exemptions for __GFP_RETRY_MAYFAIL requests as in this patch. > > [mhocko@suse.com: reworded changelog] > Fixes: b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed") > Cc: Mike Kravetz <mike.kravetz@oracle.com> > Signed-off-by: David Rientjes <rientjes@google.com> > Signed-off-by: Michal Hocko <mhocko@suse.com> FWIW, Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> [1] https://lkml.kernel.org/r/3468b605-a3a9-6978-9699-57c52a90bd7e@oracle.com
On Mon 07-10-19 12:03:30, Mike Kravetz wrote: > On 10/7/19 12:55 AM, Michal Hocko wrote: > > From: David Rientjes <rientjes@google.com> > > > > b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction > > may not succeed") has chnaged the allocator to bail out from the > > allocator early to prevent from a potentially excessive memory > > reclaim. __GFP_RETRY_MAYFAIL is designed to retry the allocation, > > reclaim and compaction loop as long as there is a reasonable chance to > > make a forward progress. Neither COMPACT_SKIPPED nor COMPACT_DEFERRED > > at the INIT_COMPACT_PRIORITY compaction attempt gives this feedback. > > > > The most obvious affected subsystem is hugetlbfs which allocates huge > > pages based on an admin request (or via admin configured overcommit). > > I have done a simple test which tries to allocate half of the memory > > for hugetlb pages while the memory is full of a clean page cache. This > > is not an unusual situation because we try to cache as much of the > > memory as possible and sysctl/sysfs interface to allocate huge pages is > > there for flexibility to allocate hugetlb pages at any time. > > > > System has 1GB of RAM and we are requesting 515MB worth of hugetlb pages > > after the memory is prefilled by a clean page cache: > > root@test1:~# cat hugetlb_test.sh > > > > set -x > > echo 0 > /proc/sys/vm/nr_hugepages > > echo 3 > /proc/sys/vm/drop_caches > > echo 1 > /proc/sys/vm/compact_memory > > dd if=/mnt/data/file-1G of=/dev/null bs=$((4<<10)) > > TS=$(date +%s) > > echo 256 > /proc/sys/vm/nr_hugepages > > cat /proc/sys/vm/nr_hugepages > > > > The results for 2 consecutive runs on clean 5.3 > > root@test1:~# sh hugetlb_test.sh > > + echo 0 > > + echo 3 > > + echo 1 > > + dd if=/mnt/data/file-1G of=/dev/null bs=4096 > > 262144+0 records in > > 262144+0 records out > > 1073741824 bytes (1.1 GB) copied, 21.0694 s, 51.0 MB/s > > + date +%s > > + TS=1569905284 > > + echo 256 > > + cat /proc/sys/vm/nr_hugepages > > 256 > > root@test1:~# sh hugetlb_test.sh > > + echo 0 > > + echo 3 > > + echo 1 > > + dd if=/mnt/data/file-1G of=/dev/null bs=4096 > > 262144+0 records in > > 262144+0 records out > > 1073741824 bytes (1.1 GB) copied, 21.7548 s, 49.4 MB/s > > + date +%s > > + TS=1569905311 > > + echo 256 > > + cat /proc/sys/vm/nr_hugepages > > 256 > > > > Now with b39d0ee2632d applied > > root@test1:~# sh hugetlb_test.sh > > + echo 0 > > + echo 3 > > + echo 1 > > + dd if=/mnt/data/file-1G of=/dev/null bs=4096 > > 262144+0 records in > > 262144+0 records out > > 1073741824 bytes (1.1 GB) copied, 20.1815 s, 53.2 MB/s > > + date +%s > > + TS=1569905516 > > + echo 256 > > + cat /proc/sys/vm/nr_hugepages > > 11 > > root@test1:~# sh hugetlb_test.sh > > + echo 0 > > + echo 3 > > + echo 1 > > + dd if=/mnt/data/file-1G of=/dev/null bs=4096 > > 262144+0 records in > > 262144+0 records out > > 1073741824 bytes (1.1 GB) copied, 21.9485 s, 48.9 MB/s > > + date +%s > > + TS=1569905541 > > + echo 256 > > + cat /proc/sys/vm/nr_hugepages > > 12 > > > > The success rate went down by factor of 20! > > > > Although hugetlb allocation requests might fail and it is reasonable to > > expect them to under extremely fragmented memory or when the memory is > > under a heavy pressure but the above situation is not that case. > > > > Fix the regression by reverting back to the previous behavior for > > __GFP_RETRY_MAYFAIL requests and disable the beail out heuristic for > > those requests. > > Thank you Michal for doing this. > > hugetlbfs allocations are commonly done via sysctl/sysfs shortly after boot > where this may not be as much of an issue. However, I am aware of at least > three use cases where allocations are made after the system has been up and > running for quite some time: > - DB reconfiguration. If sysctl/sysfs fails to get required number of huge > pages, system is rebooted to perform allocation after boot. > - VM provisioning. If unable get required number of huge pages, fall back > to base pages. > - An application that does not preallocate pool, but rather allocates pages > at fault time for optimal NUMA locality. > In all cases, I would expect b39d0ee2632d to cause regressions and noticable > behavior changes. Thanks a lot Mike. This is a very useful addition and I can see Andrew has already added it to the changelog (thx). The usecase I keep hearing most from the field is the first and the second one. > My quick/limited testing in [1] was insufficient. It was also mentioned that > if something like b39d0ee2632d went forward, I would like exemptions for > __GFP_RETRY_MAYFAIL requests as in this patch. > > > > > [mhocko@suse.com: reworded changelog] > > Fixes: b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed") > > Cc: Mike Kravetz <mike.kravetz@oracle.com> > > Signed-off-by: David Rientjes <rientjes@google.com> > > Signed-off-by: Michal Hocko <mhocko@suse.com> > > FWIW, > Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Thanks! > [1] https://lkml.kernel.org/r/3468b605-a3a9-6978-9699-57c52a90bd7e@oracle.com > -- > Mike Kravetz
On 10/7/19 9:55 AM, Michal Hocko wrote: > From: David Rientjes <rientjes@google.com> Nit: the subject is still somewhat misleading IMHO, especially in light of Mike's responses. I would say "reclaim as needed" instead of "excessively reclaim". The excessive reclaim behavior in hugetlb nr_pages setting was a bug that was addressed by a different series. ... > [mhocko@suse.com: reworded changelog] > Fixes: b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed") > Cc: Mike Kravetz <mike.kravetz@oracle.com> > Signed-off-by: David Rientjes <rientjes@google.com> > Signed-off-by: Michal Hocko <mhocko@suse.com> I still believe that using __GFP_NORETRY as needed is a cleaner solution than a check for pageblock order and __GFP_IO, but that can be always changed later. This patch does fix the hugetlbfs regression, so Acked-by: Vlastimil Babka <vbabka@suse.cz>
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 15c2050c629b..01aa46acee76 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4467,12 +4467,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, if (page) goto got_pg; - if (order >= pageblock_order && (gfp_mask & __GFP_IO)) { + if (order >= pageblock_order && (gfp_mask & __GFP_IO) && + !(gfp_mask & __GFP_RETRY_MAYFAIL)) { /* * If allocating entire pageblock(s) and compaction * failed because all zones are below low watermarks * or is prohibited because it recently failed at this - * order, fail immediately. + * order, fail immediately unless the allocator has + * requested compaction and reclaim retry. * * Reclaim is * - potentially very expensive because zones are far