diff mbox series

[4/4] memcg, inode: protect page cache from freeing inode

Message ID 1576582159-5198-5-git-send-email-laoar.shao@gmail.com (mailing list archive)
State New, archived
Headers show
Series memcg, inode: protect page cache from freeing inode | expand

Commit Message

Yafang Shao Dec. 17, 2019, 11:29 a.m. UTC
On my server there're some running MEMCGs protected by memory.{min, low},
but I found the usage of these MEMCGs abruptly became very small, which
were far less than the protect limit. It confused me and finally I
found that was because of inode stealing.
Once an inode is freed, all its belonging page caches will be dropped as
well, no matter how may page caches it has. So if we intend to protect the
page caches in a memcg, we must protect their host (the inode) first.
Otherwise the memcg protection can be easily bypassed with freeing inode,
especially if there're big files in this memcg.
The inherent mismatch between memcg and inode is a trouble. One inode can
be shared by different MEMCGs, but it is a very rare case. If an inode is
shared, its belonging page caches may be charged to different MEMCGs.
Currently there's no perfect solution to fix this kind of issue, but the
inode majority-writer ownership switching can help it more or less.

Cc: Roman Gushchin <guro@fb.com>
Cc: Chris Down <chris@chrisdown.name>
Cc: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 fs/inode.c                 |  9 +++++++++
 include/linux/memcontrol.h | 15 +++++++++++++++
 mm/memcontrol.c            | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                |  4 ++++
 4 files changed, 74 insertions(+)

Comments

Dave Chinner Dec. 18, 2019, 2:21 a.m. UTC | #1
On Tue, Dec 17, 2019 at 06:29:19AM -0500, Yafang Shao wrote:
> On my server there're some running MEMCGs protected by memory.{min, low},
> but I found the usage of these MEMCGs abruptly became very small, which
> were far less than the protect limit. It confused me and finally I
> found that was because of inode stealing.
> Once an inode is freed, all its belonging page caches will be dropped as
> well, no matter how may page caches it has. So if we intend to protect the
> page caches in a memcg, we must protect their host (the inode) first.
> Otherwise the memcg protection can be easily bypassed with freeing inode,
> especially if there're big files in this memcg.
> The inherent mismatch between memcg and inode is a trouble. One inode can
> be shared by different MEMCGs, but it is a very rare case. If an inode is
> shared, its belonging page caches may be charged to different MEMCGs.
> Currently there's no perfect solution to fix this kind of issue, but the
> inode majority-writer ownership switching can help it more or less.
> 
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Chris Down <chris@chrisdown.name>
> Cc: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  fs/inode.c                 |  9 +++++++++
>  include/linux/memcontrol.h | 15 +++++++++++++++
>  mm/memcontrol.c            | 46 ++++++++++++++++++++++++++++++++++++++++++++++
>  mm/vmscan.c                |  4 ++++
>  4 files changed, 74 insertions(+)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index fef457a..b022447 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -734,6 +734,15 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
>  	if (!spin_trylock(&inode->i_lock))
>  		return LRU_SKIP;
>  
> +
> +	/* Page protection only works in reclaimer */
> +	if (inode->i_data.nrpages && current->reclaim_state) {
> +		if (mem_cgroup_inode_protected(inode)) {
> +			spin_unlock(&inode->i_lock);
> +			return LRU_ROTATE;

Urk, so after having plumbed the memcg all the way down to the
list_lru walk code so that we only walk inodes in that memcg, we now
have to do a lookup from the inode back to the owner memcg to
determine if we should reclaim it? IOWs, I think the layering here
is all wrong - if memcg info is needed in the shrinker, it should
come from the shrink_control->memcg pointer, not be looked up from
the object being isolated...

i.e. this code should read something like this:

	if (memcg && inode->i_data.nrpages &&
	    (!memcg_can_reclaim_inode(memcg, inode)) {
		spin_unlock(&inode->i_lock);
		return LRU_ROTATE;
	}

This code does not need comments because it is obvious what it does,
and it provides a generic hook into inode reclaim for the memcg code
to decide whether the shrinker should reclaim the inode or not.

This is how the memcg code should interact with other shrinkers, too
(e.g. the dentry cache isolation function), so you need to look at
how to make the memcg visible to the lru walker isolation
functions....

Cheers,

Dave.
Yafang Shao Dec. 18, 2019, 2:33 a.m. UTC | #2
On Wed, Dec 18, 2019 at 10:21 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Dec 17, 2019 at 06:29:19AM -0500, Yafang Shao wrote:
> > On my server there're some running MEMCGs protected by memory.{min, low},
> > but I found the usage of these MEMCGs abruptly became very small, which
> > were far less than the protect limit. It confused me and finally I
> > found that was because of inode stealing.
> > Once an inode is freed, all its belonging page caches will be dropped as
> > well, no matter how may page caches it has. So if we intend to protect the
> > page caches in a memcg, we must protect their host (the inode) first.
> > Otherwise the memcg protection can be easily bypassed with freeing inode,
> > especially if there're big files in this memcg.
> > The inherent mismatch between memcg and inode is a trouble. One inode can
> > be shared by different MEMCGs, but it is a very rare case. If an inode is
> > shared, its belonging page caches may be charged to different MEMCGs.
> > Currently there's no perfect solution to fix this kind of issue, but the
> > inode majority-writer ownership switching can help it more or less.
> >
> > Cc: Roman Gushchin <guro@fb.com>
> > Cc: Chris Down <chris@chrisdown.name>
> > Cc: Dave Chinner <dchinner@redhat.com>
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  fs/inode.c                 |  9 +++++++++
> >  include/linux/memcontrol.h | 15 +++++++++++++++
> >  mm/memcontrol.c            | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> >  mm/vmscan.c                |  4 ++++
> >  4 files changed, 74 insertions(+)
> >
> > diff --git a/fs/inode.c b/fs/inode.c
> > index fef457a..b022447 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -734,6 +734,15 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
> >       if (!spin_trylock(&inode->i_lock))
> >               return LRU_SKIP;
> >
> > +
> > +     /* Page protection only works in reclaimer */
> > +     if (inode->i_data.nrpages && current->reclaim_state) {
> > +             if (mem_cgroup_inode_protected(inode)) {
> > +                     spin_unlock(&inode->i_lock);
> > +                     return LRU_ROTATE;
>
> Urk, so after having plumbed the memcg all the way down to the
> list_lru walk code so that we only walk inodes in that memcg, we now
> have to do a lookup from the inode back to the owner memcg to
> determine if we should reclaim it? IOWs, I think the layering here
> is all wrong - if memcg info is needed in the shrinker, it should
> come from the shrink_control->memcg pointer, not be looked up from
> the object being isolated...
>

Agree with you that the layering here is not good.
I had tried to use shrink_control->memcg pointer as an argument or
something else,  but I found that will change lots of code.
I don't want to change too much code, so I implement it this way,
although it looks a litte strange.

> i.e. this code should read something like this:
>
>         if (memcg && inode->i_data.nrpages &&
>             (!memcg_can_reclaim_inode(memcg, inode)) {
>                 spin_unlock(&inode->i_lock);
>                 return LRU_ROTATE;
>         }
>
> This code does not need comments because it is obvious what it does,
> and it provides a generic hook into inode reclaim for the memcg code
> to decide whether the shrinker should reclaim the inode or not.
>
> This is how the memcg code should interact with other shrinkers, too
> (e.g. the dentry cache isolation function), so you need to look at
> how to make the memcg visible to the lru walker isolation
> functions....
>

Thanks for your suggestion.
I will rethink it torwards this way.

Thanks
Yafang
Roman Gushchin Dec. 18, 2019, 5:53 p.m. UTC | #3
On Tue, Dec 17, 2019 at 06:29:19AM -0500, Yafang Shao wrote:
> On my server there're some running MEMCGs protected by memory.{min, low},
> but I found the usage of these MEMCGs abruptly became very small, which
> were far less than the protect limit. It confused me and finally I
> found that was because of inode stealing.
> Once an inode is freed, all its belonging page caches will be dropped as
> well, no matter how may page caches it has. So if we intend to protect the
> page caches in a memcg, we must protect their host (the inode) first.
> Otherwise the memcg protection can be easily bypassed with freeing inode,
> especially if there're big files in this memcg.
> The inherent mismatch between memcg and inode is a trouble. One inode can
> be shared by different MEMCGs, but it is a very rare case. If an inode is
> shared, its belonging page caches may be charged to different MEMCGs.
> Currently there's no perfect solution to fix this kind of issue, but the
> inode majority-writer ownership switching can help it more or less.
> 
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Chris Down <chris@chrisdown.name>
> Cc: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  fs/inode.c                 |  9 +++++++++
>  include/linux/memcontrol.h | 15 +++++++++++++++
>  mm/memcontrol.c            | 46 ++++++++++++++++++++++++++++++++++++++++++++++
>  mm/vmscan.c                |  4 ++++
>  4 files changed, 74 insertions(+)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index fef457a..b022447 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -734,6 +734,15 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
>  	if (!spin_trylock(&inode->i_lock))
>  		return LRU_SKIP;
>  
> +
> +	/* Page protection only works in reclaimer */
> +	if (inode->i_data.nrpages && current->reclaim_state) {
> +		if (mem_cgroup_inode_protected(inode)) {
> +			spin_unlock(&inode->i_lock);
> +			return LRU_ROTATE;
> +		}
> +	}

Not directly related to this approach, but I wonder, if we should scale down
the size of shrinker lists depending on the memory protection (like we do with
LRU lists)? It won't fix the problem with huge inodes being reclaimed at once
without a need, but will help scale the memory pressure for protected cgroups.

Thanks!


> +
>  	/*
>  	 * Referenced or dirty inodes are still in use. Give them another pass
>  	 * through the LRU as we canot reclaim them now.
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 1a315c7..21338f0 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -247,6 +247,9 @@ struct mem_cgroup {
>  	unsigned int tcpmem_active : 1;
>  	unsigned int tcpmem_pressure : 1;
>  
> +	/* Soft protection will be ignored if it's true */
> +	unsigned int in_low_reclaim : 1;
> +
>  	int under_oom;
>  
>  	int	swappiness;
> @@ -363,6 +366,7 @@ static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg,
>  
>  enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
>  						struct mem_cgroup *memcg);
> +unsigned long mem_cgroup_inode_protected(struct inode *inode);
>  
>  int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
>  			  gfp_t gfp_mask, struct mem_cgroup **memcgp,
> @@ -850,6 +854,11 @@ static inline enum mem_cgroup_protection mem_cgroup_protected(
>  	return MEMCG_PROT_NONE;
>  }
>  
> +static inline unsigned long mem_cgroup_inode_protected(struct inode *inode)
> +{
> +	return 0;
> +}
> +
>  static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
>  					gfp_t gfp_mask,
>  					struct mem_cgroup **memcgp,
> @@ -926,6 +935,12 @@ static inline struct mem_cgroup *get_mem_cgroup_from_page(struct page *page)
>  	return NULL;
>  }
>  
> +static inline struct mem_cgroup *
> +mem_cgroup_from_css(struct cgroup_subsys_state *css)
> +{
> +	return NULL;
> +}
> +
>  static inline void mem_cgroup_put(struct mem_cgroup *memcg)
>  {
>  }
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 234370c..efb53f3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6355,6 +6355,52 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
>  }
>  
>  /**
> + * Once an inode is freed, all its belonging page caches will be dropped as
> + * well, even if there're lots of page caches. So if we intend to protect
> + * page caches in a memcg, we must protect their host first. Otherwise the
> + * memory usage can be dropped abruptly if there're big files in this
> + * memcg. IOW the memcy protection can be easily bypassed with freeing
> + * inode. We should prevent it.
> + * The inherent mismatch between memcg and inode is a trouble. One inode
> + * can be shared by different MEMCGs, but it is a very rare case. If
> + * an inode is shared, its belonging page caches may be charged to
> + * different MEMCGs. Currently there's no perfect solution to fix this
> + * kind of issue, but the inode majority-writer ownership switching can
> + * help it more or less.
> + */
> +unsigned long mem_cgroup_inode_protected(struct inode *inode)
> +{
> +	unsigned long cgroup_size;
> +	unsigned long protect = 0;
> +	struct bdi_writeback *wb;
> +	struct mem_cgroup *memcg;
> +
> +	wb = inode_to_wb(inode);
> +	if (!wb)
> +		goto out;
> +
> +	memcg = mem_cgroup_from_css(wb->memcg_css);
> +	if (!memcg || memcg == root_mem_cgroup)
> +		goto out;
> +
> +	protect = mem_cgroup_protection(memcg, memcg->in_low_reclaim);
> +	if (!protect)
> +		goto out;
> +
> +	cgroup_size = mem_cgroup_size(memcg);
> +	/*
> +	 * Don't need to protect this inode, if the usage is still above
> +	 * the limit after reclaiming this inode and its belonging page
> +	 * caches.
> +	 */
> +	if (inode->i_data.nrpages + protect < cgroup_size)
> +		protect = 0;
> +
> +out:
> +	return protect;
> +}
> +
> +/**
>   * mem_cgroup_try_charge - try charging a page
>   * @page: page to charge
>   * @mm: mm context of the victim
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 3c4c2da..1cc7fc2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2666,6 +2666,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>  				sc->memcg_low_skipped = 1;
>  				continue;
>  			}
> +			memcg->in_low_reclaim = 1;
>  			memcg_memory_event(memcg, MEMCG_LOW);
>  			break;
>  		case MEMCG_PROT_NONE:
> @@ -2693,6 +2694,9 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
>  		shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
>  			    sc->priority);
>  
> +		if (memcg->in_low_reclaim)
> +			memcg->in_low_reclaim = 0;
> +
>  		/* Record the group's reclaim efficiency */
>  		vmpressure(sc->gfp_mask, memcg, false,
>  			   sc->nr_scanned - scanned,
> -- 
> 1.8.3.1
>
Yafang Shao Dec. 19, 2019, 1:45 a.m. UTC | #4
On Thu, Dec 19, 2019 at 1:53 AM Roman Gushchin <guro@fb.com> wrote:
>
> On Tue, Dec 17, 2019 at 06:29:19AM -0500, Yafang Shao wrote:
> > On my server there're some running MEMCGs protected by memory.{min, low},
> > but I found the usage of these MEMCGs abruptly became very small, which
> > were far less than the protect limit. It confused me and finally I
> > found that was because of inode stealing.
> > Once an inode is freed, all its belonging page caches will be dropped as
> > well, no matter how may page caches it has. So if we intend to protect the
> > page caches in a memcg, we must protect their host (the inode) first.
> > Otherwise the memcg protection can be easily bypassed with freeing inode,
> > especially if there're big files in this memcg.
> > The inherent mismatch between memcg and inode is a trouble. One inode can
> > be shared by different MEMCGs, but it is a very rare case. If an inode is
> > shared, its belonging page caches may be charged to different MEMCGs.
> > Currently there's no perfect solution to fix this kind of issue, but the
> > inode majority-writer ownership switching can help it more or less.
> >
> > Cc: Roman Gushchin <guro@fb.com>
> > Cc: Chris Down <chris@chrisdown.name>
> > Cc: Dave Chinner <dchinner@redhat.com>
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  fs/inode.c                 |  9 +++++++++
> >  include/linux/memcontrol.h | 15 +++++++++++++++
> >  mm/memcontrol.c            | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> >  mm/vmscan.c                |  4 ++++
> >  4 files changed, 74 insertions(+)
> >
> > diff --git a/fs/inode.c b/fs/inode.c
> > index fef457a..b022447 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -734,6 +734,15 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
> >       if (!spin_trylock(&inode->i_lock))
> >               return LRU_SKIP;
> >
> > +
> > +     /* Page protection only works in reclaimer */
> > +     if (inode->i_data.nrpages && current->reclaim_state) {
> > +             if (mem_cgroup_inode_protected(inode)) {
> > +                     spin_unlock(&inode->i_lock);
> > +                     return LRU_ROTATE;
> > +             }
> > +     }
>
> Not directly related to this approach, but I wonder, if we should scale down
> the size of shrinker lists depending on the memory protection (like we do with
> LRU lists)? It won't fix the problem with huge inodes being reclaimed at once
> without a need, but will help scale the memory pressure for protected cgroups.
>

Same with what we are doing in get_scan_count() to calculate how many
pages we should scan ?
I guess we should.

>
>
> > +
> >       /*
> >        * Referenced or dirty inodes are still in use. Give them another pass
> >        * through the LRU as we canot reclaim them now.
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 1a315c7..21338f0 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -247,6 +247,9 @@ struct mem_cgroup {
> >       unsigned int tcpmem_active : 1;
> >       unsigned int tcpmem_pressure : 1;
> >
> > +     /* Soft protection will be ignored if it's true */
> > +     unsigned int in_low_reclaim : 1;
> > +
> >       int under_oom;
> >
> >       int     swappiness;
> > @@ -363,6 +366,7 @@ static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg,
> >
> >  enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
> >                                               struct mem_cgroup *memcg);
> > +unsigned long mem_cgroup_inode_protected(struct inode *inode);
> >
> >  int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
> >                         gfp_t gfp_mask, struct mem_cgroup **memcgp,
> > @@ -850,6 +854,11 @@ static inline enum mem_cgroup_protection mem_cgroup_protected(
> >       return MEMCG_PROT_NONE;
> >  }
> >
> > +static inline unsigned long mem_cgroup_inode_protected(struct inode *inode)
> > +{
> > +     return 0;
> > +}
> > +
> >  static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
> >                                       gfp_t gfp_mask,
> >                                       struct mem_cgroup **memcgp,
> > @@ -926,6 +935,12 @@ static inline struct mem_cgroup *get_mem_cgroup_from_page(struct page *page)
> >       return NULL;
> >  }
> >
> > +static inline struct mem_cgroup *
> > +mem_cgroup_from_css(struct cgroup_subsys_state *css)
> > +{
> > +     return NULL;
> > +}
> > +
> >  static inline void mem_cgroup_put(struct mem_cgroup *memcg)
> >  {
> >  }
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 234370c..efb53f3 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -6355,6 +6355,52 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
> >  }
> >
> >  /**
> > + * Once an inode is freed, all its belonging page caches will be dropped as
> > + * well, even if there're lots of page caches. So if we intend to protect
> > + * page caches in a memcg, we must protect their host first. Otherwise the
> > + * memory usage can be dropped abruptly if there're big files in this
> > + * memcg. IOW the memcy protection can be easily bypassed with freeing
> > + * inode. We should prevent it.
> > + * The inherent mismatch between memcg and inode is a trouble. One inode
> > + * can be shared by different MEMCGs, but it is a very rare case. If
> > + * an inode is shared, its belonging page caches may be charged to
> > + * different MEMCGs. Currently there's no perfect solution to fix this
> > + * kind of issue, but the inode majority-writer ownership switching can
> > + * help it more or less.
> > + */
> > +unsigned long mem_cgroup_inode_protected(struct inode *inode)
> > +{
> > +     unsigned long cgroup_size;
> > +     unsigned long protect = 0;
> > +     struct bdi_writeback *wb;
> > +     struct mem_cgroup *memcg;
> > +
> > +     wb = inode_to_wb(inode);
> > +     if (!wb)
> > +             goto out;
> > +
> > +     memcg = mem_cgroup_from_css(wb->memcg_css);
> > +     if (!memcg || memcg == root_mem_cgroup)
> > +             goto out;
> > +
> > +     protect = mem_cgroup_protection(memcg, memcg->in_low_reclaim);
> > +     if (!protect)
> > +             goto out;
> > +
> > +     cgroup_size = mem_cgroup_size(memcg);
> > +     /*
> > +      * Don't need to protect this inode, if the usage is still above
> > +      * the limit after reclaiming this inode and its belonging page
> > +      * caches.
> > +      */
> > +     if (inode->i_data.nrpages + protect < cgroup_size)
> > +             protect = 0;
> > +
> > +out:
> > +     return protect;
> > +}
> > +
> > +/**
> >   * mem_cgroup_try_charge - try charging a page
> >   * @page: page to charge
> >   * @mm: mm context of the victim
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 3c4c2da..1cc7fc2 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2666,6 +2666,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> >                               sc->memcg_low_skipped = 1;
> >                               continue;
> >                       }
> > +                     memcg->in_low_reclaim = 1;
> >                       memcg_memory_event(memcg, MEMCG_LOW);
> >                       break;
> >               case MEMCG_PROT_NONE:
> > @@ -2693,6 +2694,9 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
> >               shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
> >                           sc->priority);
> >
> > +             if (memcg->in_low_reclaim)
> > +                     memcg->in_low_reclaim = 0;
> > +
> >               /* Record the group's reclaim efficiency */
> >               vmpressure(sc->gfp_mask, memcg, false,
> >                          sc->nr_scanned - scanned,
> > --
> > 1.8.3.1
> >
diff mbox series

Patch

diff --git a/fs/inode.c b/fs/inode.c
index fef457a..b022447 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -734,6 +734,15 @@  static enum lru_status inode_lru_isolate(struct list_head *item,
 	if (!spin_trylock(&inode->i_lock))
 		return LRU_SKIP;
 
+
+	/* Page protection only works in reclaimer */
+	if (inode->i_data.nrpages && current->reclaim_state) {
+		if (mem_cgroup_inode_protected(inode)) {
+			spin_unlock(&inode->i_lock);
+			return LRU_ROTATE;
+		}
+	}
+
 	/*
 	 * Referenced or dirty inodes are still in use. Give them another pass
 	 * through the LRU as we canot reclaim them now.
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1a315c7..21338f0 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -247,6 +247,9 @@  struct mem_cgroup {
 	unsigned int tcpmem_active : 1;
 	unsigned int tcpmem_pressure : 1;
 
+	/* Soft protection will be ignored if it's true */
+	unsigned int in_low_reclaim : 1;
+
 	int under_oom;
 
 	int	swappiness;
@@ -363,6 +366,7 @@  static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg,
 
 enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
 						struct mem_cgroup *memcg);
+unsigned long mem_cgroup_inode_protected(struct inode *inode);
 
 int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 			  gfp_t gfp_mask, struct mem_cgroup **memcgp,
@@ -850,6 +854,11 @@  static inline enum mem_cgroup_protection mem_cgroup_protected(
 	return MEMCG_PROT_NONE;
 }
 
+static inline unsigned long mem_cgroup_inode_protected(struct inode *inode)
+{
+	return 0;
+}
+
 static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask,
 					struct mem_cgroup **memcgp,
@@ -926,6 +935,12 @@  static inline struct mem_cgroup *get_mem_cgroup_from_page(struct page *page)
 	return NULL;
 }
 
+static inline struct mem_cgroup *
+mem_cgroup_from_css(struct cgroup_subsys_state *css)
+{
+	return NULL;
+}
+
 static inline void mem_cgroup_put(struct mem_cgroup *memcg)
 {
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 234370c..efb53f3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6355,6 +6355,52 @@  enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
 }
 
 /**
+ * Once an inode is freed, all its belonging page caches will be dropped as
+ * well, even if there're lots of page caches. So if we intend to protect
+ * page caches in a memcg, we must protect their host first. Otherwise the
+ * memory usage can be dropped abruptly if there're big files in this
+ * memcg. IOW the memcy protection can be easily bypassed with freeing
+ * inode. We should prevent it.
+ * The inherent mismatch between memcg and inode is a trouble. One inode
+ * can be shared by different MEMCGs, but it is a very rare case. If
+ * an inode is shared, its belonging page caches may be charged to
+ * different MEMCGs. Currently there's no perfect solution to fix this
+ * kind of issue, but the inode majority-writer ownership switching can
+ * help it more or less.
+ */
+unsigned long mem_cgroup_inode_protected(struct inode *inode)
+{
+	unsigned long cgroup_size;
+	unsigned long protect = 0;
+	struct bdi_writeback *wb;
+	struct mem_cgroup *memcg;
+
+	wb = inode_to_wb(inode);
+	if (!wb)
+		goto out;
+
+	memcg = mem_cgroup_from_css(wb->memcg_css);
+	if (!memcg || memcg == root_mem_cgroup)
+		goto out;
+
+	protect = mem_cgroup_protection(memcg, memcg->in_low_reclaim);
+	if (!protect)
+		goto out;
+
+	cgroup_size = mem_cgroup_size(memcg);
+	/*
+	 * Don't need to protect this inode, if the usage is still above
+	 * the limit after reclaiming this inode and its belonging page
+	 * caches.
+	 */
+	if (inode->i_data.nrpages + protect < cgroup_size)
+		protect = 0;
+
+out:
+	return protect;
+}
+
+/**
  * mem_cgroup_try_charge - try charging a page
  * @page: page to charge
  * @mm: mm context of the victim
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3c4c2da..1cc7fc2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2666,6 +2666,7 @@  static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 				sc->memcg_low_skipped = 1;
 				continue;
 			}
+			memcg->in_low_reclaim = 1;
 			memcg_memory_event(memcg, MEMCG_LOW);
 			break;
 		case MEMCG_PROT_NONE:
@@ -2693,6 +2694,9 @@  static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
 		shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
 			    sc->priority);
 
+		if (memcg->in_low_reclaim)
+			memcg->in_low_reclaim = 0;
+
 		/* Record the group's reclaim efficiency */
 		vmpressure(sc->gfp_mask, memcg, false,
 			   sc->nr_scanned - scanned,