Message ID | 20201103193239.1807-1-dongli.zhang@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/1] mm: avoid re-using pfmemalloc page in page_frag_alloc() | expand |
On Tue, Nov 03, 2020 at 11:32:39AM -0800, Dongli Zhang wrote: > The ethernet driver may allocates skb (and skb->data) via napi_alloc_skb(). > This ends up to page_frag_alloc() to allocate skb->data from > page_frag_cache->va. > > During the memory pressure, page_frag_cache->va may be allocated as > pfmemalloc page. As a result, the skb->pfmemalloc is always true as > skb->data is from page_frag_cache->va. The skb will be dropped if the > sock (receiver) does not have SOCK_MEMALLOC. This is expected behaviour > under memory pressure. > > However, once kernel is not under memory pressure any longer (suppose large > amount of memory pages are just reclaimed), the page_frag_alloc() may still > re-use the prior pfmemalloc page_frag_cache->va to allocate skb->data. As a > result, the skb->pfmemalloc is always true unless page_frag_cache->va is > re-allocated, even the kernel is not under memory pressure any longer. > > Here is how kernel runs into issue. > > 1. The kernel is under memory pressure and allocation of > PAGE_FRAG_CACHE_MAX_ORDER in __page_frag_cache_refill() will fail. Instead, > the pfmemalloc page is allocated for page_frag_cache->va. > > 2: All skb->data from page_frag_cache->va (pfmemalloc) will have > skb->pfmemalloc=true. The skb will always be dropped by sock without > SOCK_MEMALLOC. This is an expected behaviour. > > 3. Suppose a large amount of pages are reclaimed and kernel is not under > memory pressure any longer. We expect skb->pfmemalloc drop will not happen. > > 4. Unfortunately, page_frag_alloc() does not proactively re-allocate > page_frag_alloc->va and will always re-use the prior pfmemalloc page. The > skb->pfmemalloc is always true even kernel is not under memory pressure any > longer. > > Therefore, this patch always checks and tries to avoid re-using the > pfmemalloc page for page_frag_alloc->va. > > Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com> > Cc: Bert Barbe <bert.barbe@oracle.com> > Cc: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> > Cc: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> > Cc: Manjunath Patil <manjunath.b.patil@oracle.com> > Cc: Joe Jin <joe.jin@oracle.com> > Cc: SRINIVAS <srinivas.eeda@oracle.com> > Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> > --- > mm/page_alloc.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 23f5066bd4a5..291df2f9f8f3 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -5075,6 +5075,16 @@ void *page_frag_alloc(struct page_frag_cache *nc, > struct page *page; > int offset; > > + /* > + * Try to avoid re-using pfmemalloc page because kernel may already > + * run out of the memory pressure situation at any time. > + */ > + if (unlikely(nc->va && nc->pfmemalloc)) { > + page = virt_to_page(nc->va); > + __page_frag_cache_drain(page, nc->pagecnt_bias); > + nc->va = NULL; > + } I think this is the wrong way to solve this problem. Instead, we should use up this page, but refuse to recycle it. How about something like this (not even compile tested): +++ b/mm/page_alloc.c @@ -5139,6 +5139,10 @@ void *page_frag_alloc(struct page_frag_cache *nc, if (!page_ref_sub_and_test(page, nc->pagecnt_bias)) goto refill; + if (nc->pfmemalloc) { + free_the_page(page); + goto refill; + } #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) /* if size can vary use size else just use PAGE_SIZE */
Hi Matthew, On 11/3/20 12:35 PM, Matthew Wilcox wrote: > On Tue, Nov 03, 2020 at 11:32:39AM -0800, Dongli Zhang wrote: >> The ethernet driver may allocates skb (and skb->data) via napi_alloc_skb(). >> This ends up to page_frag_alloc() to allocate skb->data from >> page_frag_cache->va. >> >> During the memory pressure, page_frag_cache->va may be allocated as >> pfmemalloc page. As a result, the skb->pfmemalloc is always true as >> skb->data is from page_frag_cache->va. The skb will be dropped if the >> sock (receiver) does not have SOCK_MEMALLOC. This is expected behaviour >> under memory pressure. >> >> However, once kernel is not under memory pressure any longer (suppose large >> amount of memory pages are just reclaimed), the page_frag_alloc() may still >> re-use the prior pfmemalloc page_frag_cache->va to allocate skb->data. As a >> result, the skb->pfmemalloc is always true unless page_frag_cache->va is >> re-allocated, even the kernel is not under memory pressure any longer. >> >> Here is how kernel runs into issue. >> >> 1. The kernel is under memory pressure and allocation of >> PAGE_FRAG_CACHE_MAX_ORDER in __page_frag_cache_refill() will fail. Instead, >> the pfmemalloc page is allocated for page_frag_cache->va. >> >> 2: All skb->data from page_frag_cache->va (pfmemalloc) will have >> skb->pfmemalloc=true. The skb will always be dropped by sock without >> SOCK_MEMALLOC. This is an expected behaviour. >> >> 3. Suppose a large amount of pages are reclaimed and kernel is not under >> memory pressure any longer. We expect skb->pfmemalloc drop will not happen. >> >> 4. Unfortunately, page_frag_alloc() does not proactively re-allocate >> page_frag_alloc->va and will always re-use the prior pfmemalloc page. The >> skb->pfmemalloc is always true even kernel is not under memory pressure any >> longer. >> >> Therefore, this patch always checks and tries to avoid re-using the >> pfmemalloc page for page_frag_alloc->va. >> >> Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com> >> Cc: Bert Barbe <bert.barbe@oracle.com> >> Cc: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> >> Cc: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> >> Cc: Manjunath Patil <manjunath.b.patil@oracle.com> >> Cc: Joe Jin <joe.jin@oracle.com> >> Cc: SRINIVAS <srinivas.eeda@oracle.com> >> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> >> --- >> mm/page_alloc.c | 10 ++++++++++ >> 1 file changed, 10 insertions(+) >> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index 23f5066bd4a5..291df2f9f8f3 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -5075,6 +5075,16 @@ void *page_frag_alloc(struct page_frag_cache *nc, >> struct page *page; >> int offset; >> >> + /* >> + * Try to avoid re-using pfmemalloc page because kernel may already >> + * run out of the memory pressure situation at any time. >> + */ >> + if (unlikely(nc->va && nc->pfmemalloc)) { >> + page = virt_to_page(nc->va); >> + __page_frag_cache_drain(page, nc->pagecnt_bias); >> + nc->va = NULL; >> + } > > I think this is the wrong way to solve this problem. Instead, we should > use up this page, but refuse to recycle it. How about something like this (not even compile tested): Thank you very much for the feedback. Yes, the option is to use the same page until it is used up (offset < 0). Instead of recycling it, the kernel free it and allocate new one. This depends on whether we will tolerate the packet drop until this page is used up. For virtio-net, the payload (skb->data) is of size 128-byte. The padding and alignment will finally make it as 512-byte. Therefore, for virtio-net, we will have at most 4096/512-1=7 packets dropped before the page is used up. Dongli Zhang > > +++ b/mm/page_alloc.c > @@ -5139,6 +5139,10 @@ void *page_frag_alloc(struct page_frag_cache *nc, > > if (!page_ref_sub_and_test(page, nc->pagecnt_bias)) > goto refill; > + if (nc->pfmemalloc) { > + free_the_page(page); > + goto refill; > + } > > #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) > /* if size can vary use size else just use PAGE_SIZE */ >
On Tue, Nov 03, 2020 at 12:57:33PM -0800, Dongli Zhang wrote: > On 11/3/20 12:35 PM, Matthew Wilcox wrote: > > On Tue, Nov 03, 2020 at 11:32:39AM -0800, Dongli Zhang wrote: > >> However, once kernel is not under memory pressure any longer (suppose large > >> amount of memory pages are just reclaimed), the page_frag_alloc() may still > >> re-use the prior pfmemalloc page_frag_cache->va to allocate skb->data. As a > >> result, the skb->pfmemalloc is always true unless page_frag_cache->va is > >> re-allocated, even the kernel is not under memory pressure any longer. > >> + /* > >> + * Try to avoid re-using pfmemalloc page because kernel may already > >> + * run out of the memory pressure situation at any time. > >> + */ > >> + if (unlikely(nc->va && nc->pfmemalloc)) { > >> + page = virt_to_page(nc->va); > >> + __page_frag_cache_drain(page, nc->pagecnt_bias); > >> + nc->va = NULL; > >> + } > > > > I think this is the wrong way to solve this problem. Instead, we should > > use up this page, but refuse to recycle it. How about something like this (not even compile tested): > > Thank you very much for the feedback. Yes, the option is to use the same page > until it is used up (offset < 0). Instead of recycling it, the kernel free it > and allocate new one. > > This depends on whether we will tolerate the packet drop until this page is used up. > > For virtio-net, the payload (skb->data) is of size 128-byte. The padding and > alignment will finally make it as 512-byte. > > Therefore, for virtio-net, we will have at most 4096/512-1=7 packets dropped > before the page is used up. My thinking is that if the kernel is under memory pressure then freeing the page and allocating a new one is likely to put even more strain on the memory allocator, so we want to do this "soon", rather than at each allocation. Thanks for providing the numbers. Do you think that dropping (up to) 7 packets is acceptable? We could also do something like ... if (unlikely(nc->pfmemalloc)) { page = alloc_page(GFP_NOWAIT | __GFP_NOWARN); if (page) nc->pfmemalloc = 0; put_page(page); } to test if the memory allocator has free pages at the moment. Not sure whether that's a good idea or not -- hopefully you have a test environment set up where you can reproduce this condition on demand and determine which of these three approaches is best!
On 11/3/20 1:15 PM, Matthew Wilcox wrote: > On Tue, Nov 03, 2020 at 12:57:33PM -0800, Dongli Zhang wrote: >> On 11/3/20 12:35 PM, Matthew Wilcox wrote: >>> On Tue, Nov 03, 2020 at 11:32:39AM -0800, Dongli Zhang wrote: >>>> However, once kernel is not under memory pressure any longer (suppose large >>>> amount of memory pages are just reclaimed), the page_frag_alloc() may still >>>> re-use the prior pfmemalloc page_frag_cache->va to allocate skb->data. As a >>>> result, the skb->pfmemalloc is always true unless page_frag_cache->va is >>>> re-allocated, even the kernel is not under memory pressure any longer. >>>> + /* >>>> + * Try to avoid re-using pfmemalloc page because kernel may already >>>> + * run out of the memory pressure situation at any time. >>>> + */ >>>> + if (unlikely(nc->va && nc->pfmemalloc)) { >>>> + page = virt_to_page(nc->va); >>>> + __page_frag_cache_drain(page, nc->pagecnt_bias); >>>> + nc->va = NULL; >>>> + } >>> >>> I think this is the wrong way to solve this problem. Instead, we should >>> use up this page, but refuse to recycle it. How about something like this (not even compile tested): >> >> Thank you very much for the feedback. Yes, the option is to use the same page >> until it is used up (offset < 0). Instead of recycling it, the kernel free it >> and allocate new one. >> >> This depends on whether we will tolerate the packet drop until this page is used up. >> >> For virtio-net, the payload (skb->data) is of size 128-byte. The padding and >> alignment will finally make it as 512-byte. >> >> Therefore, for virtio-net, we will have at most 4096/512-1=7 packets dropped >> before the page is used up. > > My thinking is that if the kernel is under memory pressure then freeing > the page and allocating a new one is likely to put even more strain > on the memory allocator, so we want to do this "soon", rather than at > each allocation. > > Thanks for providing the numbers. Do you think that dropping (up to) > 7 packets is acceptable?> > We could also do something like ... > > if (unlikely(nc->pfmemalloc)) { > page = alloc_page(GFP_NOWAIT | __GFP_NOWARN); > if (page) > nc->pfmemalloc = 0; > put_page(page); > } > > to test if the memory allocator has free pages at the moment. Not sure > whether that's a good idea or not -- hopefully you have a test environment > set up where you can reproduce this condition on demand and determine > which of these three approaches is best! > From mm's perspective, we expect to reduce the number of page allocation (especially under memory pressure). From networking's perspective, we expect to reduce the number of skb drop. That's why I CCed netdev folks (including David and Jakub), although the patch is for mm/page_alloc.c. The page_frag_alloc() is primarily used by networking and nvme-tcp. Unfortunately, so far I do not have the env to reproduce. I reproduced with a patch to fail page allocation and set nc->pfmemalloc on purpose. From mm's perspective, I think to use up the page is a good option. Indeed tt is system administrator's duty to avoid memory pressure, in order to avoid the extra packet drops. Dongli Zhang
>Thanks for providing the numbers. Do you think that dropping (up to) >7 packets is acceptable? net.ipv4.tcp_syn_retries = 6 tcp clients wouldn't even get that far leading to connect establish issues. -rama On Tue, Nov 03, 2020 at 09:15:41PM +0000, Matthew Wilcox wrote: >On Tue, Nov 03, 2020 at 12:57:33PM -0800, Dongli Zhang wrote: >> On 11/3/20 12:35 PM, Matthew Wilcox wrote: >> > On Tue, Nov 03, 2020 at 11:32:39AM -0800, Dongli Zhang wrote: >> >> However, once kernel is not under memory pressure any longer (suppose large >> >> amount of memory pages are just reclaimed), the page_frag_alloc() may still >> >> re-use the prior pfmemalloc page_frag_cache->va to allocate skb->data. As a >> >> result, the skb->pfmemalloc is always true unless page_frag_cache->va is >> >> re-allocated, even the kernel is not under memory pressure any longer. >> >> + /* >> >> + * Try to avoid re-using pfmemalloc page because kernel may already >> >> + * run out of the memory pressure situation at any time. >> >> + */ >> >> + if (unlikely(nc->va && nc->pfmemalloc)) { >> >> + page = virt_to_page(nc->va); >> >> + __page_frag_cache_drain(page, nc->pagecnt_bias); >> >> + nc->va = NULL; >> >> + } >> > >> > I think this is the wrong way to solve this problem. Instead, we should >> > use up this page, but refuse to recycle it. How about something like this (not even compile tested): >> >> Thank you very much for the feedback. Yes, the option is to use the same page >> until it is used up (offset < 0). Instead of recycling it, the kernel free it >> and allocate new one. >> >> This depends on whether we will tolerate the packet drop until this page is used up. >> >> For virtio-net, the payload (skb->data) is of size 128-byte. The padding and >> alignment will finally make it as 512-byte. >> >> Therefore, for virtio-net, we will have at most 4096/512-1=7 packets dropped >> before the page is used up. > >My thinking is that if the kernel is under memory pressure then freeing >the page and allocating a new one is likely to put even more strain >on the memory allocator, so we want to do this "soon", rather than at >each allocation. > >Thanks for providing the numbers. Do you think that dropping (up to) >7 packets is acceptable? > >We could also do something like ... > > if (unlikely(nc->pfmemalloc)) { > page = alloc_page(GFP_NOWAIT | __GFP_NOWARN); > if (page) > nc->pfmemalloc = 0; > put_page(page); > } > >to test if the memory allocator has free pages at the moment. Not sure >whether that's a good idea or not -- hopefully you have a test environment >set up where you can reproduce this condition on demand and determine >which of these three approaches is best!
On 11/4/20 2:16 AM, Rama Nichanamatlu wrote: >> Thanks for providing the numbers. Do you think that dropping (up to) >> 7 packets is acceptable? > > net.ipv4.tcp_syn_retries = 6 > > tcp clients wouldn't even get that far leading to connect establish issues. This does not really matter. If host was under memory pressure, dropping a few packets is really not an issue. Please do not add expensive checks in fast path, just to "not drop a packet" even if the world is collapsing. Also consider that NIC typically have thousands of pre-allocated page/frags for their RX ring buffers, they might all have pfmemalloc set, so we are speaking of thousands of packet drops before the RX-ring can be refilled with normal (non pfmemalloc) page/frags. If we want to solve this issue more generically, we would have to try to copy data into a non pfmemalloc frag instead of dropping skb that had frags allocated minutes ago under memory pressure. This copy could happen in core networking stack, but this seems adding more pressure to mm layer under pressure.
On Wed, Nov 04, 2020 at 09:50:30AM +0100, Eric Dumazet wrote: > On 11/4/20 2:16 AM, Rama Nichanamatlu wrote: > >> Thanks for providing the numbers. Do you think that dropping (up to) > >> 7 packets is acceptable? > > > > net.ipv4.tcp_syn_retries = 6 > > > > tcp clients wouldn't even get that far leading to connect establish issues. > > This does not really matter. If host was under memory pressure, > dropping a few packets is really not an issue. > > Please do not add expensive checks in fast path, just to "not drop a packet" > even if the world is collapsing. Right, that was my first patch -- to only recheck if we're about to reuse the page. Do you think that's acceptable, or is that still too close to the fast path? > Also consider that NIC typically have thousands of pre-allocated page/frags > for their RX ring buffers, they might all have pfmemalloc set, so we are speaking > of thousands of packet drops before the RX-ring can be refilled with normal (non pfmemalloc) page/frags. > > If we want to solve this issue more generically, we would have to try > to copy data into a non pfmemalloc frag instead of dropping skb that > had frags allocated minutes ago under memory pressure. I don't think we need to copy anything. We need to figure out if the system is still under memory pressure, and if not, we can clear the pfmemalloc bit on the frag, as in my second patch. The 'least change' way of doing that is to try to allocate a page, but the VM could export a symbol that says "we're not under memory pressure any more". Did you want to move checking that into the networking layer, or do you want to keep it in the pagefrag allocator?
On 11/4/20 1:36 PM, Matthew Wilcox wrote: > On Wed, Nov 04, 2020 at 09:50:30AM +0100, Eric Dumazet wrote: >> On 11/4/20 2:16 AM, Rama Nichanamatlu wrote: >>>> Thanks for providing the numbers. Do you think that dropping (up to) >>>> 7 packets is acceptable? >>> >>> net.ipv4.tcp_syn_retries = 6 >>> >>> tcp clients wouldn't even get that far leading to connect establish issues. >> >> This does not really matter. If host was under memory pressure, >> dropping a few packets is really not an issue. >> >> Please do not add expensive checks in fast path, just to "not drop a packet" >> even if the world is collapsing. > > Right, that was my first patch -- to only recheck if we're about to > reuse the page. Do you think that's acceptable, or is that still too > close to the fast path? I think it is totally acceptable. The same strategy is used in NIC drivers, before recycling a page. If page_is_pfmemalloc() returns true, they simply release the 'problematic'page and attempt a new allocation. ( git grep -n page_is_pfmemalloc -- drivers/net/ethernet/ ) > >> Also consider that NIC typically have thousands of pre-allocated page/frags >> for their RX ring buffers, they might all have pfmemalloc set, so we are speaking >> of thousands of packet drops before the RX-ring can be refilled with normal (non pfmemalloc) page/frags. >> >> If we want to solve this issue more generically, we would have to try >> to copy data into a non pfmemalloc frag instead of dropping skb that >> had frags allocated minutes ago under memory pressure. > > I don't think we need to copy anything. We need to figure out if the > system is still under memory pressure, and if not, we can clear the > pfmemalloc bit on the frag, as in my second patch. The 'least change' > way of doing that is to try to allocate a page, but the VM could export > a symbol that says "we're not under memory pressure any more". > > Did you want to move checking that into the networking layer, or do you > want to keep it in the pagefrag allocator? I think your proposal is fine, thanks !
On 11/4/20 4:51 AM, Eric Dumazet wrote: > > > On 11/4/20 1:36 PM, Matthew Wilcox wrote: >> On Wed, Nov 04, 2020 at 09:50:30AM +0100, Eric Dumazet wrote: >>> On 11/4/20 2:16 AM, Rama Nichanamatlu wrote: >>>>> Thanks for providing the numbers. Do you think that dropping (up to) >>>>> 7 packets is acceptable? >>>> >>>> net.ipv4.tcp_syn_retries = 6 >>>> >>>> tcp clients wouldn't even get that far leading to connect establish issues. >>> >>> This does not really matter. If host was under memory pressure, >>> dropping a few packets is really not an issue. >>> >>> Please do not add expensive checks in fast path, just to "not drop a packet" >>> even if the world is collapsing. >> >> Right, that was my first patch -- to only recheck if we're about to >> reuse the page. Do you think that's acceptable, or is that still too >> close to the fast path? > > I think it is totally acceptable. > > The same strategy is used in NIC drivers, before recycling a page. > > If page_is_pfmemalloc() returns true, they simply release the 'problematic'page > and attempt a new allocation. > > ( git grep -n page_is_pfmemalloc -- drivers/net/ethernet/ ) While the drivers may implement their own page_frag_cache to manage skb->frags ... ... the skb->data is usually allocated via __netdev_alloc_skb() or napi_alloc_skb(), which end up to the global this_cpu_ptr(&netdev_alloc_cache) or this_cpu_ptr(&napi_alloc_cache.page). > > >> >>> Also consider that NIC typically have thousands of pre-allocated page/frags >>> for their RX ring buffers, they might all have pfmemalloc set, so we are speaking >>> of thousands of packet drops before the RX-ring can be refilled with normal (non pfmemalloc) page/frags. >>> >>> If we want to solve this issue more generically, we would have to try >>> to copy data into a non pfmemalloc frag instead of dropping skb that >>> had frags allocated minutes ago under memory pressure. >> >> I don't think we need to copy anything. We need to figure out if the >> system is still under memory pressure, and if not, we can clear the >> pfmemalloc bit on the frag, as in my second patch. The 'least change' >> way of doing that is to try to allocate a page, but the VM could export >> a symbol that says "we're not under memory pressure any more". >> >> Did you want to move checking that into the networking layer, or do you >> want to keep it in the pagefrag allocator? > > I think your proposal is fine, thanks ! Hi Matthew, are you going to send out the patch to avoid pfmemalloc recycle? Thank you very much! Dongli Zhang
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 23f5066bd4a5..291df2f9f8f3 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5075,6 +5075,16 @@ void *page_frag_alloc(struct page_frag_cache *nc, struct page *page; int offset; + /* + * Try to avoid re-using pfmemalloc page because kernel may already + * run out of the memory pressure situation at any time. + */ + if (unlikely(nc->va && nc->pfmemalloc)) { + page = virt_to_page(nc->va); + __page_frag_cache_drain(page, nc->pagecnt_bias); + nc->va = NULL; + } + if (unlikely(!nc->va)) { refill: page = __page_frag_cache_refill(nc, gfp_mask);
The ethernet driver may allocates skb (and skb->data) via napi_alloc_skb(). This ends up to page_frag_alloc() to allocate skb->data from page_frag_cache->va. During the memory pressure, page_frag_cache->va may be allocated as pfmemalloc page. As a result, the skb->pfmemalloc is always true as skb->data is from page_frag_cache->va. The skb will be dropped if the sock (receiver) does not have SOCK_MEMALLOC. This is expected behaviour under memory pressure. However, once kernel is not under memory pressure any longer (suppose large amount of memory pages are just reclaimed), the page_frag_alloc() may still re-use the prior pfmemalloc page_frag_cache->va to allocate skb->data. As a result, the skb->pfmemalloc is always true unless page_frag_cache->va is re-allocated, even the kernel is not under memory pressure any longer. Here is how kernel runs into issue. 1. The kernel is under memory pressure and allocation of PAGE_FRAG_CACHE_MAX_ORDER in __page_frag_cache_refill() will fail. Instead, the pfmemalloc page is allocated for page_frag_cache->va. 2: All skb->data from page_frag_cache->va (pfmemalloc) will have skb->pfmemalloc=true. The skb will always be dropped by sock without SOCK_MEMALLOC. This is an expected behaviour. 3. Suppose a large amount of pages are reclaimed and kernel is not under memory pressure any longer. We expect skb->pfmemalloc drop will not happen. 4. Unfortunately, page_frag_alloc() does not proactively re-allocate page_frag_alloc->va and will always re-use the prior pfmemalloc page. The skb->pfmemalloc is always true even kernel is not under memory pressure any longer. Therefore, this patch always checks and tries to avoid re-using the pfmemalloc page for page_frag_alloc->va. Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com> Cc: Bert Barbe <bert.barbe@oracle.com> Cc: Rama Nichanamatlu <rama.nichanamatlu@oracle.com> Cc: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> Cc: Manjunath Patil <manjunath.b.patil@oracle.com> Cc: Joe Jin <joe.jin@oracle.com> Cc: SRINIVAS <srinivas.eeda@oracle.com> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> --- mm/page_alloc.c | 10 ++++++++++ 1 file changed, 10 insertions(+)