diff mbox

[RFC] mm, memcg: use consistent gfp flags during readahead

Message ID 1465301556-26431-1-git-send-email-mhocko@kernel.org (mailing list archive)
State New, archived
Headers show

Commit Message

Michal Hocko June 7, 2016, 12:12 p.m. UTC
From: Michal Hocko <mhocko@suse.com>

Vladimir has noticed that we might declare memcg oom even during
readahead because read_pages only uses GFP_KERNEL (with mapping_gfp
restriction) while __do_page_cache_readahead uses
page_cache_alloc_readahead which adds __GFP_NORETRY to prevent from
OOMs. This gfp mask discrepancy is really unfortunate and easily
fixable. Drop page_cache_alloc_readahead() which only has one user
and outsource the gfp_mask logic into readahead_gfp_mask and propagate
this mask from __do_page_cache_readahead down to read_pages.

This alone would have only very limited impact as most filesystems
are implementing ->readpages and the common implementation
mpage_readpages does GFP_KERNEL (with mapping_gfp restriction) again.
We can tell it to use readahead_gfp_mask instead as this function is
called only during readahead as well. The same applies to
read_cache_pages.

ext4 has its own ext4_mpage_readpages but the path which has pages !=
NULL can use the same gfp mask.
Btrfs, cifs, f2fs and orangefs are doing a very similar pattern to
mpage_readpages so the same can be applied to them as well.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---

Hi,
an alternative solution for ->readpages part would be add the gfp mask
as a new argument. This would be a larger change and I am not even sure
it would be so much better. An explicit usage of the readahead gfp mask
sounds like easier to track. If there is a general agreement this is a
proper way to go I can rework the patch to do so, of course.

Does this make sense?

 fs/btrfs/extent_io.c    |  3 ++-
 fs/cifs/file.c          |  2 +-
 fs/ext4/readpage.c      |  2 +-
 fs/f2fs/data.c          |  3 ++-
 fs/mpage.c              |  2 +-
 fs/orangefs/inode.c     |  2 +-
 include/linux/pagemap.h |  6 +++---
 mm/readahead.c          | 12 ++++++------
 8 files changed, 17 insertions(+), 15 deletions(-)

Comments

Vladimir Davydov June 8, 2016, 1:59 p.m. UTC | #1
On Tue, Jun 07, 2016 at 02:12:36PM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Vladimir has noticed that we might declare memcg oom even during
> readahead because read_pages only uses GFP_KERNEL (with mapping_gfp
> restriction) while __do_page_cache_readahead uses
> page_cache_alloc_readahead which adds __GFP_NORETRY to prevent from
> OOMs. This gfp mask discrepancy is really unfortunate and easily
> fixable. Drop page_cache_alloc_readahead() which only has one user
> and outsource the gfp_mask logic into readahead_gfp_mask and propagate
> this mask from __do_page_cache_readahead down to read_pages.
> 
> This alone would have only very limited impact as most filesystems
> are implementing ->readpages and the common implementation
> mpage_readpages does GFP_KERNEL (with mapping_gfp restriction) again.
> We can tell it to use readahead_gfp_mask instead as this function is
> called only during readahead as well. The same applies to
> read_cache_pages.
> 
> ext4 has its own ext4_mpage_readpages but the path which has pages !=
> NULL can use the same gfp mask.
> Btrfs, cifs, f2fs and orangefs are doing a very similar pattern to
> mpage_readpages so the same can be applied to them as well.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
> 
> Hi,
> an alternative solution for ->readpages part would be add the gfp mask
> as a new argument. This would be a larger change and I am not even sure
> it would be so much better. An explicit usage of the readahead gfp mask
> sounds like easier to track. If there is a general agreement this is a
> proper way to go I can rework the patch to do so, of course.
> 
> Does this make sense?
...
> diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c
> index dc54a4b60eba..c75b66a64982 100644
> --- a/fs/ext4/readpage.c
> +++ b/fs/ext4/readpage.c
> @@ -166,7 +166,7 @@ int ext4_mpage_readpages(struct address_space *mapping,
>  			page = list_entry(pages->prev, struct page, lru);
>  			list_del(&page->lru);
>  			if (add_to_page_cache_lru(page, mapping, page->index,
> -				  mapping_gfp_constraint(mapping, GFP_KERNEL)))
> +				  readahead_gfp_mask(mapping)))
>  				goto next_page;
>  		}
>  

ext4 (at least) might issue other allocations in ->readpages, e.g.
bio_alloc with GFP_KERNEL.

I wonder if it would be better to set GFP_NOFS context on task_struct in
read_pages() and handle it in alloc_pages. You've been planning doing
something like this anyway, haven't you?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Hocko June 8, 2016, 2:10 p.m. UTC | #2
On Wed 08-06-16 16:59:00, Vladimir Davydov wrote:
> On Tue, Jun 07, 2016 at 02:12:36PM +0200, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Vladimir has noticed that we might declare memcg oom even during
> > readahead because read_pages only uses GFP_KERNEL (with mapping_gfp
> > restriction) while __do_page_cache_readahead uses
> > page_cache_alloc_readahead which adds __GFP_NORETRY to prevent from
> > OOMs. This gfp mask discrepancy is really unfortunate and easily
> > fixable. Drop page_cache_alloc_readahead() which only has one user
> > and outsource the gfp_mask logic into readahead_gfp_mask and propagate
> > this mask from __do_page_cache_readahead down to read_pages.
> > 
> > This alone would have only very limited impact as most filesystems
> > are implementing ->readpages and the common implementation
> > mpage_readpages does GFP_KERNEL (with mapping_gfp restriction) again.
> > We can tell it to use readahead_gfp_mask instead as this function is
> > called only during readahead as well. The same applies to
> > read_cache_pages.
> > 
> > ext4 has its own ext4_mpage_readpages but the path which has pages !=
> > NULL can use the same gfp mask.
> > Btrfs, cifs, f2fs and orangefs are doing a very similar pattern to
> > mpage_readpages so the same can be applied to them as well.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> > 
> > Hi,
> > an alternative solution for ->readpages part would be add the gfp mask
> > as a new argument. This would be a larger change and I am not even sure
> > it would be so much better. An explicit usage of the readahead gfp mask
> > sounds like easier to track. If there is a general agreement this is a
> > proper way to go I can rework the patch to do so, of course.
> > 
> > Does this make sense?
> ...
> > diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c
> > index dc54a4b60eba..c75b66a64982 100644
> > --- a/fs/ext4/readpage.c
> > +++ b/fs/ext4/readpage.c
> > @@ -166,7 +166,7 @@ int ext4_mpage_readpages(struct address_space *mapping,
> >  			page = list_entry(pages->prev, struct page, lru);
> >  			list_del(&page->lru);
> >  			if (add_to_page_cache_lru(page, mapping, page->index,
> > -				  mapping_gfp_constraint(mapping, GFP_KERNEL)))
> > +				  readahead_gfp_mask(mapping)))
> >  				goto next_page;
> >  		}
> >  
> 
> ext4 (at least) might issue other allocations in ->readpages, e.g.
> bio_alloc with GFP_KERNEL.

That is true but the primary point of this patch is to put page cache
and memcg charge into sync. bio_alloc shouldn't get to the memcg code
path AFAICS. That being said it is not nice that ext4 can go OOM during
readahead but fixing that would be out of scope of this patch IMHO.

> I wonder if it would be better to set GFP_NOFS context on task_struct in
> read_pages() and handle it in alloc_pages. You've been planning doing
> something like this anyway, haven't you?

I have no idea whether ext4 can safely do GFP_FS allocation from this
request. Ext4 doesn't clear GFP_FS in the mapping gfp mask so it does
GFP_FS charge by default.

You are right that I would love to see scope GFP_NOFS contexts used by
FS. I plan to repost my patches to add the API hopefully soon but there
are other things to settle down before I can do that.
diff mbox

Patch

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d247fc0eea19..883031d9f35b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4160,7 +4160,8 @@  int extent_readpages(struct extent_io_tree *tree,
 		prefetchw(&page->flags);
 		list_del(&page->lru);
 		if (add_to_page_cache_lru(page, mapping,
-					page->index, GFP_NOFS)) {
+					page->index,
+					readahead_gfp_mask(mapping))) {
 			put_page(page);
 			continue;
 		}
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index c03d0744648b..60e0167b0c1d 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3380,7 +3380,7 @@  readpages_get_pages(struct address_space *mapping, struct list_head *page_list,
 	struct page *page, *tpage;
 	unsigned int expected_index;
 	int rc;
-	gfp_t gfp = mapping_gfp_constraint(mapping, GFP_KERNEL);
+	gfp_t gfp = readahead_gfp_mask(mapping);
 
 	INIT_LIST_HEAD(tmplist);
 
diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c
index dc54a4b60eba..c75b66a64982 100644
--- a/fs/ext4/readpage.c
+++ b/fs/ext4/readpage.c
@@ -166,7 +166,7 @@  int ext4_mpage_readpages(struct address_space *mapping,
 			page = list_entry(pages->prev, struct page, lru);
 			list_del(&page->lru);
 			if (add_to_page_cache_lru(page, mapping, page->index,
-				  mapping_gfp_constraint(mapping, GFP_KERNEL)))
+				  readahead_gfp_mask(mapping)))
 				goto next_page;
 		}
 
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 5dafb9cef12e..3f914339e4f4 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -928,7 +928,8 @@  static int f2fs_mpage_readpages(struct address_space *mapping,
 			page = list_entry(pages->prev, struct page, lru);
 			list_del(&page->lru);
 			if (add_to_page_cache_lru(page, mapping,
-						  page->index, GFP_KERNEL))
+						  page->index,
+						  readahead_gfp_mask(mapping)))
 				goto next_page;
 		}
 
diff --git a/fs/mpage.c b/fs/mpage.c
index eedc644b78d7..9c11255b0797 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -362,7 +362,7 @@  mpage_readpages(struct address_space *mapping, struct list_head *pages,
 	sector_t last_block_in_bio = 0;
 	struct buffer_head map_bh;
 	unsigned long first_logical_block = 0;
-	gfp_t gfp = mapping_gfp_constraint(mapping, GFP_KERNEL);
+	gfp_t gfp = readahead_gfp_mask(mapping);
 
 	map_bh.b_state = 0;
 	map_bh.b_size = 0;
diff --git a/fs/orangefs/inode.c b/fs/orangefs/inode.c
index 85640e955cde..06a8da75651d 100644
--- a/fs/orangefs/inode.c
+++ b/fs/orangefs/inode.c
@@ -80,7 +80,7 @@  static int orangefs_readpages(struct file *file,
 		if (!add_to_page_cache(page,
 				       mapping,
 				       page->index,
-				       GFP_KERNEL)) {
+				       readahead_gfp_mask(mapping))) {
 			ret = read_one_page(page);
 			gossip_debug(GOSSIP_INODE_DEBUG,
 				"failure adding page to cache, read_one_page returned: %d\n",
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 97354102794d..81363b834900 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -209,10 +209,10 @@  static inline struct page *page_cache_alloc_cold(struct address_space *x)
 	return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
 }
 
-static inline struct page *page_cache_alloc_readahead(struct address_space *x)
+static inline gfp_t readahead_gfp_mask(struct address_space *x)
 {
-	return __page_cache_alloc(mapping_gfp_mask(x) |
-				  __GFP_COLD | __GFP_NORETRY | __GFP_NOWARN);
+	return mapping_gfp_mask(x) |
+				  __GFP_COLD | __GFP_NORETRY | __GFP_NOWARN;
 }
 
 typedef int filler_t(void *, struct page *);
diff --git a/mm/readahead.c b/mm/readahead.c
index 40be3ae0afe3..7cd032ef17bf 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -89,7 +89,7 @@  int read_cache_pages(struct address_space *mapping, struct list_head *pages,
 		page = lru_to_page(pages);
 		list_del(&page->lru);
 		if (add_to_page_cache_lru(page, mapping, page->index,
-				mapping_gfp_constraint(mapping, GFP_KERNEL))) {
+				readahead_gfp_mask(mapping))) {
 			read_cache_pages_invalidate_page(mapping, page);
 			continue;
 		}
@@ -108,7 +108,7 @@  int read_cache_pages(struct address_space *mapping, struct list_head *pages,
 EXPORT_SYMBOL(read_cache_pages);
 
 static int read_pages(struct address_space *mapping, struct file *filp,
-		struct list_head *pages, unsigned nr_pages)
+		struct list_head *pages, unsigned nr_pages, gfp_t gfp_mask)
 {
 	struct blk_plug plug;
 	unsigned page_idx;
@@ -126,8 +126,7 @@  static int read_pages(struct address_space *mapping, struct file *filp,
 	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
 		struct page *page = lru_to_page(pages);
 		list_del(&page->lru);
-		if (!add_to_page_cache_lru(page, mapping, page->index,
-				mapping_gfp_constraint(mapping, GFP_KERNEL))) {
+		if (!add_to_page_cache_lru(page, mapping, page->index, gfp_mask)) {
 			mapping->a_ops->readpage(filp, page);
 		}
 		put_page(page);
@@ -159,6 +158,7 @@  int __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 	int page_idx;
 	int ret = 0;
 	loff_t isize = i_size_read(inode);
+	gfp_t gfp_mask = readahead_gfp_mask(mapping);
 
 	if (isize == 0)
 		goto out;
@@ -180,7 +180,7 @@  int __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 		if (page && !radix_tree_exceptional_entry(page))
 			continue;
 
-		page = page_cache_alloc_readahead(mapping);
+		page = __page_cache_alloc(gfp_mask);
 		if (!page)
 			break;
 		page->index = page_offset;
@@ -196,7 +196,7 @@  int __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
 	 * will then handle the error.
 	 */
 	if (ret)
-		read_pages(mapping, filp, &page_pool, ret);
+		read_pages(mapping, filp, &page_pool, ret, gfp_mask);
 	BUG_ON(!list_empty(&page_pool));
 out:
 	return ret;