diff mbox series

[02/18] mm: memcontrol: fix theoretical race in charge moving

Message ID 20200420221126.341272-3-hannes@cmpxchg.org (mailing list archive)
State New, archived
Headers show
Series mm: memcontrol: charge swapin pages on instantiation | expand

Commit Message

Johannes Weiner April 20, 2020, 10:11 p.m. UTC
The move_lock is a per-memcg lock, but the VM accounting code that
needs to acquire it comes from the page and follows page->mem_cgroup
under RCU protection. That means that the page becomes unlocked not
when we drop the move_lock, but when we update page->mem_cgroup. And
that assignment doesn't imply any memory ordering. If that pointer
write gets reordered against the reads of the page state -
page_mapped, PageDirty etc. the state may change while we rely on it
being stable and we can end up corrupting the counters.

Place an SMP memory barrier to make sure we're done with all page
state by the time the new page->mem_cgroup becomes visible.

Also replace the open-coded move_lock with a lock_page_memcg() to make
it more obvious what we're serializing against.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

Comments

Joonsoo Kim April 22, 2020, 6:36 a.m. UTC | #1
On Mon, Apr 20, 2020 at 06:11:10PM -0400, Johannes Weiner wrote:
> The move_lock is a per-memcg lock, but the VM accounting code that
> needs to acquire it comes from the page and follows page->mem_cgroup
> under RCU protection. That means that the page becomes unlocked not
> when we drop the move_lock, but when we update page->mem_cgroup. And
> that assignment doesn't imply any memory ordering. If that pointer
> write gets reordered against the reads of the page state -
> page_mapped, PageDirty etc. the state may change while we rely on it
> being stable and we can end up corrupting the counters.
> 
> Place an SMP memory barrier to make sure we're done with all page
> state by the time the new page->mem_cgroup becomes visible.
> 
> Also replace the open-coded move_lock with a lock_page_memcg() to make
> it more obvious what we're serializing against.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Thanks.
Shakeel Butt April 22, 2020, 4:51 p.m. UTC | #2
On Mon, Apr 20, 2020 at 3:11 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> The move_lock is a per-memcg lock, but the VM accounting code that
> needs to acquire it comes from the page and follows page->mem_cgroup
> under RCU protection. That means that the page becomes unlocked not
> when we drop the move_lock, but when we update page->mem_cgroup. And
> that assignment doesn't imply any memory ordering. If that pointer
> write gets reordered against the reads of the page state -
> page_mapped, PageDirty etc. the state may change while we rely on it
> being stable and we can end up corrupting the counters.
>
> Place an SMP memory barrier to make sure we're done with all page
> state by the time the new page->mem_cgroup becomes visible.
>
> Also replace the open-coded move_lock with a lock_page_memcg() to make
> it more obvious what we're serializing against.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/memcontrol.c | 26 ++++++++++++++------------
>  1 file changed, 14 insertions(+), 12 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 5beea03dd58a..41f5ed79272e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5372,7 +5372,6 @@ static int mem_cgroup_move_account(struct page *page,
>  {
>         struct lruvec *from_vec, *to_vec;
>         struct pglist_data *pgdat;
> -       unsigned long flags;
>         unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
>         int ret;
>         bool anon;
> @@ -5399,18 +5398,13 @@ static int mem_cgroup_move_account(struct page *page,
>         from_vec = mem_cgroup_lruvec(from, pgdat);
>         to_vec = mem_cgroup_lruvec(to, pgdat);
>
> -       spin_lock_irqsave(&from->move_lock, flags);
> +       lock_page_memcg(page);
>
>         if (!anon && page_mapped(page)) {
>                 __mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages);
>                 __mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages);
>         }
>
> -       /*
> -        * move_lock grabbed above and caller set from->moving_account, so
> -        * mod_memcg_page_state will serialize updates to PageDirty.
> -        * So mapping should be stable for dirty pages.
> -        */
>         if (!anon && PageDirty(page)) {
>                 struct address_space *mapping = page_mapping(page);
>
> @@ -5426,15 +5420,23 @@ static int mem_cgroup_move_account(struct page *page,
>         }
>
>         /*
> +        * All state has been migrated, let's switch to the new memcg.
> +        *
>          * It is safe to change page->mem_cgroup here because the page
> -        * is referenced, charged, and isolated - we can't race with
> -        * uncharging, charging, migration, or LRU putback.
> +        * is referenced, charged, isolated, and locked: we can't race
> +        * with (un)charging, migration, LRU putback, or anything else
> +        * that would rely on a stable page->mem_cgroup.
> +        *
> +        * Note that lock_page_memcg is a memcg lock, not a page lock,
> +        * to save space. As soon as we switch page->mem_cgroup to a
> +        * new memcg that isn't locked, the above state can change
> +        * concurrently again. Make sure we're truly done with it.
>          */
> +       smp_mb();

You said theoretical race in the subject but the above comment
convinced me that smp_mb() is required. So, why is the race still
theoretical?

>
> -       /* caller should have done css_get */
> -       page->mem_cgroup = to;
> +       page->mem_cgroup = to;  /* caller should have done css_get */
>
> -       spin_unlock_irqrestore(&from->move_lock, flags);
> +       __unlock_page_memcg(from);
>
>         ret = 0;
>
> --
> 2.26.0
>
Johannes Weiner April 22, 2020, 5:42 p.m. UTC | #3
On Wed, Apr 22, 2020 at 09:51:20AM -0700, Shakeel Butt wrote:
> On Mon, Apr 20, 2020 at 3:11 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > @@ -5426,15 +5420,23 @@ static int mem_cgroup_move_account(struct page *page,
> >         }
> >
> >         /*
> > +        * All state has been migrated, let's switch to the new memcg.
> > +        *
> >          * It is safe to change page->mem_cgroup here because the page
> > -        * is referenced, charged, and isolated - we can't race with
> > -        * uncharging, charging, migration, or LRU putback.
> > +        * is referenced, charged, isolated, and locked: we can't race
> > +        * with (un)charging, migration, LRU putback, or anything else
> > +        * that would rely on a stable page->mem_cgroup.
> > +        *
> > +        * Note that lock_page_memcg is a memcg lock, not a page lock,
> > +        * to save space. As soon as we switch page->mem_cgroup to a
> > +        * new memcg that isn't locked, the above state can change
> > +        * concurrently again. Make sure we're truly done with it.
> >          */
> > +       smp_mb();
> 
> You said theoretical race in the subject but the above comment
> convinced me that smp_mb() is required. So, why is the race still
> theoretical?

Sorry about the confusion.

I said theoretical because I spotted it while thinking about the
code. I'm not aware of any real users that suffered the consequences
of this race condition. But they could exist in theory :-)

I think it's a real bug that needs fixing.
Shakeel Butt April 22, 2020, 6:01 p.m. UTC | #4
On Wed, Apr 22, 2020 at 10:42 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Apr 22, 2020 at 09:51:20AM -0700, Shakeel Butt wrote:
> > On Mon, Apr 20, 2020 at 3:11 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > @@ -5426,15 +5420,23 @@ static int mem_cgroup_move_account(struct page *page,
> > >         }
> > >
> > >         /*
> > > +        * All state has been migrated, let's switch to the new memcg.
> > > +        *
> > >          * It is safe to change page->mem_cgroup here because the page
> > > -        * is referenced, charged, and isolated - we can't race with
> > > -        * uncharging, charging, migration, or LRU putback.
> > > +        * is referenced, charged, isolated, and locked: we can't race
> > > +        * with (un)charging, migration, LRU putback, or anything else
> > > +        * that would rely on a stable page->mem_cgroup.
> > > +        *
> > > +        * Note that lock_page_memcg is a memcg lock, not a page lock,
> > > +        * to save space. As soon as we switch page->mem_cgroup to a
> > > +        * new memcg that isn't locked, the above state can change
> > > +        * concurrently again. Make sure we're truly done with it.
> > >          */
> > > +       smp_mb();
> >
> > You said theoretical race in the subject but the above comment
> > convinced me that smp_mb() is required. So, why is the race still
> > theoretical?
>
> Sorry about the confusion.
>
> I said theoretical because I spotted it while thinking about the
> code. I'm not aware of any real users that suffered the consequences
> of this race condition. But they could exist in theory :-)
>
> I think it's a real bug that needs fixing.

Thanks for the clarification. I would suggest removing "theoretical"
from the subject as it undermines that a real bug is fixed by the
patch.
Shakeel Butt April 22, 2020, 6:02 p.m. UTC | #5
On Mon, Apr 20, 2020 at 3:11 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> The move_lock is a per-memcg lock, but the VM accounting code that
> needs to acquire it comes from the page and follows page->mem_cgroup
> under RCU protection. That means that the page becomes unlocked not
> when we drop the move_lock, but when we update page->mem_cgroup. And
> that assignment doesn't imply any memory ordering. If that pointer
> write gets reordered against the reads of the page state -
> page_mapped, PageDirty etc. the state may change while we rely on it
> being stable and we can end up corrupting the counters.
>
> Place an SMP memory barrier to make sure we're done with all page
> state by the time the new page->mem_cgroup becomes visible.
>
> Also replace the open-coded move_lock with a lock_page_memcg() to make
> it more obvious what we're serializing against.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Shakeel Butt <shakeelb@google.com>
diff mbox series

Patch

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5beea03dd58a..41f5ed79272e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5372,7 +5372,6 @@  static int mem_cgroup_move_account(struct page *page,
 {
 	struct lruvec *from_vec, *to_vec;
 	struct pglist_data *pgdat;
-	unsigned long flags;
 	unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
 	int ret;
 	bool anon;
@@ -5399,18 +5398,13 @@  static int mem_cgroup_move_account(struct page *page,
 	from_vec = mem_cgroup_lruvec(from, pgdat);
 	to_vec = mem_cgroup_lruvec(to, pgdat);
 
-	spin_lock_irqsave(&from->move_lock, flags);
+	lock_page_memcg(page);
 
 	if (!anon && page_mapped(page)) {
 		__mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages);
 		__mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages);
 	}
 
-	/*
-	 * move_lock grabbed above and caller set from->moving_account, so
-	 * mod_memcg_page_state will serialize updates to PageDirty.
-	 * So mapping should be stable for dirty pages.
-	 */
 	if (!anon && PageDirty(page)) {
 		struct address_space *mapping = page_mapping(page);
 
@@ -5426,15 +5420,23 @@  static int mem_cgroup_move_account(struct page *page,
 	}
 
 	/*
+	 * All state has been migrated, let's switch to the new memcg.
+	 *
 	 * It is safe to change page->mem_cgroup here because the page
-	 * is referenced, charged, and isolated - we can't race with
-	 * uncharging, charging, migration, or LRU putback.
+	 * is referenced, charged, isolated, and locked: we can't race
+	 * with (un)charging, migration, LRU putback, or anything else
+	 * that would rely on a stable page->mem_cgroup.
+	 *
+	 * Note that lock_page_memcg is a memcg lock, not a page lock,
+	 * to save space. As soon as we switch page->mem_cgroup to a
+	 * new memcg that isn't locked, the above state can change
+	 * concurrently again. Make sure we're truly done with it.
 	 */
+	smp_mb();
 
-	/* caller should have done css_get */
-	page->mem_cgroup = to;
+	page->mem_cgroup = to; 	/* caller should have done css_get */
 
-	spin_unlock_irqrestore(&from->move_lock, flags);
+	__unlock_page_memcg(from);
 
 	ret = 0;