Message ID | 20220711075225.15687-1-mlombard@redhat.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | mm: prevent page_frag_alloc() from corrupting the memory | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch |
Tested with this kernel module: http://bsdbackstore.eu/misc/oomk/ It requires 2 parameters: the first one is the amount of memory you want to allocate with page_frag_alloc(), the second one is the size of the fragment I tested it on a machine with ~7Gb of free memory. Without the patch: ------------------------------------------------- 3Gb of memory will be used with frag size = 1024 byte. No issue: #insmod oomk.ko memory_size_gb=3 fragsize=1024 [ 177.875107] Test begins, memory size = 3 fragsize = 1024 [ 177.974538] Test completed! 10 Gb of memory, 1024 byte frag. page allocation failure but the kernel handles it and doesn't crash: #insmod oomk.ko memory_size_gb=10 fragsize=1024 [ 215.104801] Test begins, memory size = 10 fragsize = 1024 [ 215.227854] insmod: page allocation failure: order:0, mode:0xa20(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0 [ 215.230231] CPU: 1 PID: 1738 Comm: insmod Kdump: loaded Tainted: G OE --------- --- 5.14.0-124.kpq0.el9.x86_64 #1 [ 215.232344] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 215.233523] Call Trace: [ 215.234001] dump_stack_lvl+0x34/0x44 [ 215.234894] warn_alloc+0x134/0x160 [ 215.235592] __alloc_pages_slowpath.constprop.0+0x809/0x840 [ 215.236687] ? get_page_from_freelist+0xc6/0x500 [ 215.237569] __alloc_pages+0x1fa/0x230 [ 215.238381] page_frag_alloc_align+0x16c/0x1a0 [...] [ 215.315722] allocation number 7379888 failed! [ 215.426227] Test completed! 10Gb, 4097 byte frag. Kernel crashes: #insmod oomk.ko memory_size_gb=10 fragsize=4097 [ 623.461505] BUG: Bad page state in process insmod pfn:10a80c [ 623.462634] page:000000000654dc14 refcount:0 mapcount:0 mapping:000000007a56d6cd index:0x0 pfn:0x10a80c [ 623.464401] memcg:ffff900343a5b501 [ 623.465058] aops:0xffff9003409e5d38 with invalid host inode 00003524480055f0 [ 623.466394] flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff) [ 623.467632] raw: 0017ffffc0000000 dead000000000100 dead000000000122 ffff900346cf2900 [ 623.469069] raw: 0000000000000000 0000000000100010 00000000ffffffff ffff900343a5b501 [ 623.470521] page dumped because: page still charged to cgroup [...] [ 626.632838] general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] PREEMPT SMP PTI [ 626.633913] ------------[ cut here ]------------ [ 626.639981] CPU: 0 PID: 722 Comm: agetty Kdump: loaded Tainted: G B OE --------- --- 5.14.0-124.kpq0.el9.x86_64 #1 [ 626.640923] WARNING: CPU: 1 PID: 22 at mm/slub.c:4566 __ksize+0xc4/0xe0 [ 626.645018] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 626.645021] RIP: 0010:___slab_alloc+0x1b7/0x5c0 ------------------------------------------ With the patch the kernel doesn't crash: #insmod oomk.ko memory_size_gb=10 fragsize=4097 [ 4859.358496] Test begins, memory size = 10 fragsize = 4097 [ 4859.459674] allocation number 607754 failed! [ 4859.495489] Test completed! #insmod oomk.ko memory_size_gb=10 fragsize=40000 [ 8428.021491] Test begins, memory size = 10 fragsize = 40000 [ 8428.024308] allocation number 0 failed! [ 8428.025709] Test completed! Maurizio
On Mon, Jul 11, 2022 at 12:52 AM Maurizio Lombardi <mlombard@redhat.com> wrote: > > A number of drivers call page_frag_alloc() with a > fragment's size > PAGE_SIZE. > In low memory conditions, __page_frag_cache_refill() may fail the order 3 > cache allocation and fall back to order 0; > If this happens, the cache will be smaller than the fragment, causing > memory corruptions. > > Prevent this from happening by checking if the newly allocated cache > is large enough for the fragment; if not, the allocation will fail > and page_frag_alloc() will return NULL. > > Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> > --- > mm/page_alloc.c | 11 ++++++++--- > 1 file changed, 8 insertions(+), 3 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index e008a3df0485..7fb000d7e90c 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -5611,12 +5611,17 @@ void *page_frag_alloc_align(struct page_frag_cache *nc, > /* if size can vary use size else just use PAGE_SIZE */ > size = nc->size; > #endif > - /* OK, page count is 0, we can safely set it */ > - set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1); > - > /* reset page count bias and offset to start of new frag */ > nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1; > offset = size - fragsz; > + if (unlikely(offset < 0)) { > + free_the_page(page, compound_order(page)); > + nc->va = NULL; > + return NULL; > + } > + > + /* OK, page count is 0, we can safely set it */ > + set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1); > } > > nc->pagecnt_bias--; Rather than forcing us to free the page it might be better to move the lines getting the size and computing the offset to the top of the "if (unlikely(offset < 0)) {" block. Then instead of freeing the page we could just return NULL and don't have to change the value of any fields in the page_frag_cache. That way a driver performing bad requests can't force us to start allocating and freeing pages like mad by repeatedly flushing the cache.
po 11. 7. 2022 v 17:34 odesílatel Alexander Duyck <alexander.duyck@gmail.com> napsal: > > Rather than forcing us to free the page it might be better to move the > lines getting the size and computing the offset to the top of the "if > (unlikely(offset < 0)) {" block. Then instead of freeing the page we > could just return NULL and don't have to change the value of any > fields in the page_frag_cache. > > That way a driver performing bad requests can't force us to start > allocating and freeing pages like mad by repeatedly flushing the > cache. > I understand. On the other hand, if we free the cache page then the next time __page_frag_cache_refill() runs it may be successful at allocating the order=3 cache, the normal page_frag_alloc() behaviour will therefore be restored. Maurizio
On Mon, Jul 11, 2022 at 9:17 AM Maurizio Lombardi <mlombard@redhat.com> wrote: > > po 11. 7. 2022 v 17:34 odesílatel Alexander Duyck > <alexander.duyck@gmail.com> napsal: > > > > Rather than forcing us to free the page it might be better to move the > > lines getting the size and computing the offset to the top of the "if > > (unlikely(offset < 0)) {" block. Then instead of freeing the page we > > could just return NULL and don't have to change the value of any > > fields in the page_frag_cache. > > > > That way a driver performing bad requests can't force us to start > > allocating and freeing pages like mad by repeatedly flushing the > > cache. > > > > I understand. On the other hand, if we free the cache page then the > next time __page_frag_cache_refill() runs it may be successful > at allocating the order=3 cache, the normal page_frag_alloc() behaviour will > therefore be restored. That is a big "maybe". My concern is that it will actually make memory pressure worse by forcing us to reduce the number of uses for a lower order page. One bad actor will have us flushing memory like mad so a guy expecting a small fragment may end up allocating 32K pages because someone else is trying to allocate them. I recommend we do not optimize for a case which this code was not designed for. Try to optimize for the standard case that most of the drivers are using. These drivers that are allocating higher order pages worth of memory should really be using alloc_pages. Using this to allocate pages over 4K in size is just a waste since they are not likely to see page reuse which is what this code expects to see.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e008a3df0485..7fb000d7e90c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5611,12 +5611,17 @@ void *page_frag_alloc_align(struct page_frag_cache *nc, /* if size can vary use size else just use PAGE_SIZE */ size = nc->size; #endif - /* OK, page count is 0, we can safely set it */ - set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1); - /* reset page count bias and offset to start of new frag */ nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1; offset = size - fragsz; + if (unlikely(offset < 0)) { + free_the_page(page, compound_order(page)); + nc->va = NULL; + return NULL; + } + + /* OK, page count is 0, we can safely set it */ + set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1); } nc->pagecnt_bias--;
A number of drivers call page_frag_alloc() with a fragment's size > PAGE_SIZE. In low memory conditions, __page_frag_cache_refill() may fail the order 3 cache allocation and fall back to order 0; If this happens, the cache will be smaller than the fragment, causing memory corruptions. Prevent this from happening by checking if the newly allocated cache is large enough for the fragment; if not, the allocation will fail and page_frag_alloc() will return NULL. Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> --- mm/page_alloc.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-)