diff mbox series

[PATCH-mm,v3] mm/list_lru: Optimize memcg_reparent_list_lru_node()

Message ID 20220309144000.1470138-1-longman@redhat.com (mailing list archive)
State New
Headers show
Series [PATCH-mm,v3] mm/list_lru: Optimize memcg_reparent_list_lru_node() | expand

Commit Message

Waiman Long March 9, 2022, 2:40 p.m. UTC
Since commit 2c80cd57c743 ("mm/list_lru.c: fix list_lru_count_node()
to be race free"), we are tracking the total number of lru
entries in a list_lru_node in its nr_items field.  In the case of
memcg_reparent_list_lru_node(), there is nothing to be done if nr_items
is 0.  We don't even need to take the nlru->lock as no new lru entry
could be added by a racing list_lru_add() to the draining src_idx memcg
at this point.

On systems that serve a lot of containers, it is possible that there can
be thousands of list_lru's present due to the fact that each container
may mount its own container specific filesystems. As a typical container
uses only a few cpus, it is likely that only the list_lru_node that
contains those cpus will be utilized while the rests may be empty. In
other words, there can be a lot of list_lru_node with 0 nr_items. By
skipping a lock/unlock operation and loading a cacheline from memcg_lrus,
a sizeable number of cpu cycles can be saved. That can be substantial
if we are talking about thousands of list_lru_node's with 0 nr_items.

Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 mm/list_lru.c | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Muchun Song March 23, 2022, 1:06 a.m. UTC | #1
On Wed, Mar 9, 2022 at 10:40 PM Waiman Long <longman@redhat.com> wrote:
>
> Since commit 2c80cd57c743 ("mm/list_lru.c: fix list_lru_count_node()
> to be race free"), we are tracking the total number of lru
> entries in a list_lru_node in its nr_items field.  In the case of
> memcg_reparent_list_lru_node(), there is nothing to be done if nr_items
> is 0.  We don't even need to take the nlru->lock as no new lru entry
> could be added by a racing list_lru_add() to the draining src_idx memcg
> at this point.

Hi Waiman,

Sorry for the late reply.  Quick question: what if there is an inflight
list_lru_add()?  How about the following race?

CPU0:                               CPU1:
list_lru_add()
    spin_lock(&nlru->lock)
    l = list_lru_from_kmem(memcg)
                                    memcg_reparent_objcgs(memcg)
                                    memcg_reparent_list_lrus(memcg)
                                        memcg_reparent_list_lru()
                                            memcg_reparent_list_lru_node()
                                                if (!READ_ONCE(nlru->nr_items))
                                                    // Miss reparenting
                                                    return
    // Assume 0->1
    l->nr_items++
    // Assume 0->1
    nlru->nr_items++

IIUC, we use nlru->lock to serialise this scenario.

Thanks.
Waiman Long March 23, 2022, 1:55 a.m. UTC | #2
On 3/22/22 21:06, Muchun Song wrote:
> On Wed, Mar 9, 2022 at 10:40 PM Waiman Long <longman@redhat.com> wrote:
>> Since commit 2c80cd57c743 ("mm/list_lru.c: fix list_lru_count_node()
>> to be race free"), we are tracking the total number of lru
>> entries in a list_lru_node in its nr_items field.  In the case of
>> memcg_reparent_list_lru_node(), there is nothing to be done if nr_items
>> is 0.  We don't even need to take the nlru->lock as no new lru entry
>> could be added by a racing list_lru_add() to the draining src_idx memcg
>> at this point.
> Hi Waiman,
>
> Sorry for the late reply.  Quick question: what if there is an inflight
> list_lru_add()?  How about the following race?
>
> CPU0:                               CPU1:
> list_lru_add()
>      spin_lock(&nlru->lock)
>      l = list_lru_from_kmem(memcg)
>                                      memcg_reparent_objcgs(memcg)
>                                      memcg_reparent_list_lrus(memcg)
>                                          memcg_reparent_list_lru()
>                                              memcg_reparent_list_lru_node()
>                                                  if (!READ_ONCE(nlru->nr_items))
>                                                      // Miss reparenting
>                                                      return
>      // Assume 0->1
>      l->nr_items++
>      // Assume 0->1
>      nlru->nr_items++
>
> IIUC, we use nlru->lock to serialise this scenario.

I guess this race is theoretically possible but very unlikely since it 
means a very long pause between list_lru_from_kmem() and the increment 
of nr_items.

How about the following changes to make sure that this race can't happen?

diff --git a/mm/list_lru.c b/mm/list_lru.c
index c669d87001a6..c31a0a8ad4e7 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -395,9 +395,10 @@ static void memcg_reparent_list_lru_node(struct 
list_lru *lru, int nid,
         struct list_lru_one *src, *dst;

         /*
-        * If there is no lru entry in this nlru, we can skip it 
immediately.
+        * If there is no lru entry in this nlru and the nlru->lock is free,
+        * we can skip it immediately.
          */
-       if (!READ_ONCE(nlru->nr_items))
+       if (!READ_ONCE(nlru->nr_items) && !spin_is_locked(&nlru->lock))
                 return;

Cheers,
Longman
Muchun Song March 23, 2022, 2:12 a.m. UTC | #3
On Wed, Mar 23, 2022 at 9:55 AM Waiman Long <longman@redhat.com> wrote:
>
> On 3/22/22 21:06, Muchun Song wrote:
> > On Wed, Mar 9, 2022 at 10:40 PM Waiman Long <longman@redhat.com> wrote:
> >> Since commit 2c80cd57c743 ("mm/list_lru.c: fix list_lru_count_node()
> >> to be race free"), we are tracking the total number of lru
> >> entries in a list_lru_node in its nr_items field.  In the case of
> >> memcg_reparent_list_lru_node(), there is nothing to be done if nr_items
> >> is 0.  We don't even need to take the nlru->lock as no new lru entry
> >> could be added by a racing list_lru_add() to the draining src_idx memcg
> >> at this point.
> > Hi Waiman,
> >
> > Sorry for the late reply.  Quick question: what if there is an inflight
> > list_lru_add()?  How about the following race?
> >
> > CPU0:                               CPU1:
> > list_lru_add()
> >      spin_lock(&nlru->lock)
> >      l = list_lru_from_kmem(memcg)
> >                                      memcg_reparent_objcgs(memcg)
> >                                      memcg_reparent_list_lrus(memcg)
> >                                          memcg_reparent_list_lru()
> >                                              memcg_reparent_list_lru_node()
> >                                                  if (!READ_ONCE(nlru->nr_items))
> >                                                      // Miss reparenting
> >                                                      return
> >      // Assume 0->1
> >      l->nr_items++
> >      // Assume 0->1
> >      nlru->nr_items++
> >
> > IIUC, we use nlru->lock to serialise this scenario.
>
> I guess this race is theoretically possible but very unlikely since it
> means a very long pause between list_lru_from_kmem() and the increment
> of nr_items.

It is more possible in a VM.

>
> How about the following changes to make sure that this race can't happen?
>
> diff --git a/mm/list_lru.c b/mm/list_lru.c
> index c669d87001a6..c31a0a8ad4e7 100644
> --- a/mm/list_lru.c
> +++ b/mm/list_lru.c
> @@ -395,9 +395,10 @@ static void memcg_reparent_list_lru_node(struct
> list_lru *lru, int nid,
>          struct list_lru_one *src, *dst;
>
>          /*
> -        * If there is no lru entry in this nlru, we can skip it
> immediately.
> +        * If there is no lru entry in this nlru and the nlru->lock is free,
> +        * we can skip it immediately.
>           */
> -       if (!READ_ONCE(nlru->nr_items))
> +       if (!READ_ONCE(nlru->nr_items) && !spin_is_locked(&nlru->lock))

I think we also should insert a smp_rmb() between those two loads.

Thanks.
Waiman Long March 28, 2022, 12:57 a.m. UTC | #4
On 3/22/22 22:12, Muchun Song wrote:
> On Wed, Mar 23, 2022 at 9:55 AM Waiman Long <longman@redhat.com> wrote:
>> On 3/22/22 21:06, Muchun Song wrote:
>>> On Wed, Mar 9, 2022 at 10:40 PM Waiman Long <longman@redhat.com> wrote:
>>>> Since commit 2c80cd57c743 ("mm/list_lru.c: fix list_lru_count_node()
>>>> to be race free"), we are tracking the total number of lru
>>>> entries in a list_lru_node in its nr_items field.  In the case of
>>>> memcg_reparent_list_lru_node(), there is nothing to be done if nr_items
>>>> is 0.  We don't even need to take the nlru->lock as no new lru entry
>>>> could be added by a racing list_lru_add() to the draining src_idx memcg
>>>> at this point.
>>> Hi Waiman,
>>>
>>> Sorry for the late reply.  Quick question: what if there is an inflight
>>> list_lru_add()?  How about the following race?
>>>
>>> CPU0:                               CPU1:
>>> list_lru_add()
>>>       spin_lock(&nlru->lock)
>>>       l = list_lru_from_kmem(memcg)
>>>                                       memcg_reparent_objcgs(memcg)
>>>                                       memcg_reparent_list_lrus(memcg)
>>>                                           memcg_reparent_list_lru()
>>>                                               memcg_reparent_list_lru_node()
>>>                                                   if (!READ_ONCE(nlru->nr_items))
>>>                                                       // Miss reparenting
>>>                                                       return
>>>       // Assume 0->1
>>>       l->nr_items++
>>>       // Assume 0->1
>>>       nlru->nr_items++
>>>
>>> IIUC, we use nlru->lock to serialise this scenario.
>> I guess this race is theoretically possible but very unlikely since it
>> means a very long pause between list_lru_from_kmem() and the increment
>> of nr_items.
> It is more possible in a VM.
>
>> How about the following changes to make sure that this race can't happen?
>>
>> diff --git a/mm/list_lru.c b/mm/list_lru.c
>> index c669d87001a6..c31a0a8ad4e7 100644
>> --- a/mm/list_lru.c
>> +++ b/mm/list_lru.c
>> @@ -395,9 +395,10 @@ static void memcg_reparent_list_lru_node(struct
>> list_lru *lru, int nid,
>>           struct list_lru_one *src, *dst;
>>
>>           /*
>> -        * If there is no lru entry in this nlru, we can skip it
>> immediately.
>> +        * If there is no lru entry in this nlru and the nlru->lock is free,
>> +        * we can skip it immediately.
>>            */
>> -       if (!READ_ONCE(nlru->nr_items))
>> +       if (!READ_ONCE(nlru->nr_items) && !spin_is_locked(&nlru->lock))
> I think we also should insert a smp_rmb() between those two loads.

Thinking about this some more, I believe that adding spin_is_locked() 
check will be enough for x86. However, that will likely not be enough 
for arches with a more relaxed memory semantics. So the safest way to 
avoid this possible race is to move the check to within the lock 
critical section, though that comes with a slightly higher overhead for 
the 0 nr_items case. I will send out a patch to correct that. Thanks for 
bring this possible race to my attention.

Cheers,
Longman
Roman Gushchin March 28, 2022, 7:12 p.m. UTC | #5
On Sun, Mar 27, 2022 at 08:57:15PM -0400, Waiman Long wrote:
> On 3/22/22 22:12, Muchun Song wrote:
> > On Wed, Mar 23, 2022 at 9:55 AM Waiman Long <longman@redhat.com> wrote:
> > > On 3/22/22 21:06, Muchun Song wrote:
> > > > On Wed, Mar 9, 2022 at 10:40 PM Waiman Long <longman@redhat.com> wrote:
> > > > > Since commit 2c80cd57c743 ("mm/list_lru.c: fix list_lru_count_node()
> > > > > to be race free"), we are tracking the total number of lru
> > > > > entries in a list_lru_node in its nr_items field.  In the case of
> > > > > memcg_reparent_list_lru_node(), there is nothing to be done if nr_items
> > > > > is 0.  We don't even need to take the nlru->lock as no new lru entry
> > > > > could be added by a racing list_lru_add() to the draining src_idx memcg
> > > > > at this point.
> > > > Hi Waiman,
> > > > 
> > > > Sorry for the late reply.  Quick question: what if there is an inflight
> > > > list_lru_add()?  How about the following race?
> > > > 
> > > > CPU0:                               CPU1:
> > > > list_lru_add()
> > > >       spin_lock(&nlru->lock)
> > > >       l = list_lru_from_kmem(memcg)
> > > >                                       memcg_reparent_objcgs(memcg)
> > > >                                       memcg_reparent_list_lrus(memcg)
> > > >                                           memcg_reparent_list_lru()
> > > >                                               memcg_reparent_list_lru_node()
> > > >                                                   if (!READ_ONCE(nlru->nr_items))
> > > >                                                       // Miss reparenting
> > > >                                                       return
> > > >       // Assume 0->1
> > > >       l->nr_items++
> > > >       // Assume 0->1
> > > >       nlru->nr_items++
> > > > 
> > > > IIUC, we use nlru->lock to serialise this scenario.
> > > I guess this race is theoretically possible but very unlikely since it
> > > means a very long pause between list_lru_from_kmem() and the increment
> > > of nr_items.
> > It is more possible in a VM.
> > 
> > > How about the following changes to make sure that this race can't happen?
> > > 
> > > diff --git a/mm/list_lru.c b/mm/list_lru.c
> > > index c669d87001a6..c31a0a8ad4e7 100644
> > > --- a/mm/list_lru.c
> > > +++ b/mm/list_lru.c
> > > @@ -395,9 +395,10 @@ static void memcg_reparent_list_lru_node(struct
> > > list_lru *lru, int nid,
> > >           struct list_lru_one *src, *dst;
> > > 
> > >           /*
> > > -        * If there is no lru entry in this nlru, we can skip it
> > > immediately.
> > > +        * If there is no lru entry in this nlru and the nlru->lock is free,
> > > +        * we can skip it immediately.
> > >            */
> > > -       if (!READ_ONCE(nlru->nr_items))
> > > +       if (!READ_ONCE(nlru->nr_items) && !spin_is_locked(&nlru->lock))
> > I think we also should insert a smp_rmb() between those two loads.
> 
> Thinking about this some more, I believe that adding spin_is_locked() check
> will be enough for x86. However, that will likely not be enough for arches
> with a more relaxed memory semantics. So the safest way to avoid this
> possible race is to move the check to within the lock critical section,
> though that comes with a slightly higher overhead for the 0 nr_items case. I
> will send out a patch to correct that. Thanks for bring this possible race
> to my attention.

Yes, I think it's not enough:
CPU0                                       CPU1
READ_ONCE(&nlru->nr_items) -> 0
                                           spin_lock(&nlru->lock);
                                           nlru->nr_items++;
                                           spin_unlock(&nlru->lock);
&& !spin_is_locked(&nlru->lock) -> 0


Getting back to the original patch, I wonder if instead we can batch reparenting
of lrus so we don't have to grab and release nlru->lock for each reparenting lru.


Thanks!
Roman Gushchin March 28, 2022, 7:12 p.m. UTC | #6
On Wed, Mar 23, 2022 at 09:06:03AM +0800, Muchun Song wrote:
> On Wed, Mar 9, 2022 at 10:40 PM Waiman Long <longman@redhat.com> wrote:
> >
> > Since commit 2c80cd57c743 ("mm/list_lru.c: fix list_lru_count_node()
> > to be race free"), we are tracking the total number of lru
> > entries in a list_lru_node in its nr_items field.  In the case of
> > memcg_reparent_list_lru_node(), there is nothing to be done if nr_items
> > is 0.  We don't even need to take the nlru->lock as no new lru entry
> > could be added by a racing list_lru_add() to the draining src_idx memcg
> > at this point.
> 
> Hi Waiman,
> 
> Sorry for the late reply.  Quick question: what if there is an inflight
> list_lru_add()?  How about the following race?
> 
> CPU0:                               CPU1:
> list_lru_add()
>     spin_lock(&nlru->lock)
>     l = list_lru_from_kmem(memcg)
>                                     memcg_reparent_objcgs(memcg)
>                                     memcg_reparent_list_lrus(memcg)
>                                         memcg_reparent_list_lru()
>                                             memcg_reparent_list_lru_node()
>                                                 if (!READ_ONCE(nlru->nr_items))
>                                                     // Miss reparenting
>                                                     return
>     // Assume 0->1
>     l->nr_items++
>     // Assume 0->1
>     nlru->nr_items++
> 
> IIUC, we use nlru->lock to serialise this scenario.


Thank you for bringing this up, really cool race!
Waiman Long March 28, 2022, 8:46 p.m. UTC | #7
On 3/28/22 15:12, Roman Gushchin wrote:
> On Sun, Mar 27, 2022 at 08:57:15PM -0400, Waiman Long wrote:
>> On 3/22/22 22:12, Muchun Song wrote:
>>> On Wed, Mar 23, 2022 at 9:55 AM Waiman Long <longman@redhat.com> wrote:
>>>> On 3/22/22 21:06, Muchun Song wrote:
>>>>> On Wed, Mar 9, 2022 at 10:40 PM Waiman Long <longman@redhat.com> wrote:
>>>>>> Since commit 2c80cd57c743 ("mm/list_lru.c: fix list_lru_count_node()
>>>>>> to be race free"), we are tracking the total number of lru
>>>>>> entries in a list_lru_node in its nr_items field.  In the case of
>>>>>> memcg_reparent_list_lru_node(), there is nothing to be done if nr_items
>>>>>> is 0.  We don't even need to take the nlru->lock as no new lru entry
>>>>>> could be added by a racing list_lru_add() to the draining src_idx memcg
>>>>>> at this point.
>>>>> Hi Waiman,
>>>>>
>>>>> Sorry for the late reply.  Quick question: what if there is an inflight
>>>>> list_lru_add()?  How about the following race?
>>>>>
>>>>> CPU0:                               CPU1:
>>>>> list_lru_add()
>>>>>        spin_lock(&nlru->lock)
>>>>>        l = list_lru_from_kmem(memcg)
>>>>>                                        memcg_reparent_objcgs(memcg)
>>>>>                                        memcg_reparent_list_lrus(memcg)
>>>>>                                            memcg_reparent_list_lru()
>>>>>                                                memcg_reparent_list_lru_node()
>>>>>                                                    if (!READ_ONCE(nlru->nr_items))
>>>>>                                                        // Miss reparenting
>>>>>                                                        return
>>>>>        // Assume 0->1
>>>>>        l->nr_items++
>>>>>        // Assume 0->1
>>>>>        nlru->nr_items++
>>>>>
>>>>> IIUC, we use nlru->lock to serialise this scenario.
>>>> I guess this race is theoretically possible but very unlikely since it
>>>> means a very long pause between list_lru_from_kmem() and the increment
>>>> of nr_items.
>>> It is more possible in a VM.
>>>
>>>> How about the following changes to make sure that this race can't happen?
>>>>
>>>> diff --git a/mm/list_lru.c b/mm/list_lru.c
>>>> index c669d87001a6..c31a0a8ad4e7 100644
>>>> --- a/mm/list_lru.c
>>>> +++ b/mm/list_lru.c
>>>> @@ -395,9 +395,10 @@ static void memcg_reparent_list_lru_node(struct
>>>> list_lru *lru, int nid,
>>>>            struct list_lru_one *src, *dst;
>>>>
>>>>            /*
>>>> -        * If there is no lru entry in this nlru, we can skip it
>>>> immediately.
>>>> +        * If there is no lru entry in this nlru and the nlru->lock is free,
>>>> +        * we can skip it immediately.
>>>>             */
>>>> -       if (!READ_ONCE(nlru->nr_items))
>>>> +       if (!READ_ONCE(nlru->nr_items) && !spin_is_locked(&nlru->lock))
>>> I think we also should insert a smp_rmb() between those two loads.
>> Thinking about this some more, I believe that adding spin_is_locked() check
>> will be enough for x86. However, that will likely not be enough for arches
>> with a more relaxed memory semantics. So the safest way to avoid this
>> possible race is to move the check to within the lock critical section,
>> though that comes with a slightly higher overhead for the 0 nr_items case. I
>> will send out a patch to correct that. Thanks for bring this possible race
>> to my attention.
> Yes, I think it's not enough:
> CPU0                                       CPU1
> READ_ONCE(&nlru->nr_items) -> 0
>                                             spin_lock(&nlru->lock);
>                                             nlru->nr_items++;
>                                             spin_unlock(&nlru->lock);
> && !spin_is_locked(&nlru->lock) -> 0
I have actually thought of that. I am even thinking about reading 
nr_items again after spin_is_locked(). Still for arches with relaxed 
memory semantics, when will a memory write by one cpu be propagated to 
another cpu can be highly variable. It is very hard to prove that it is 
completely safe.

x86 has a more strict memory semantics and it is the only architecture 
that I have enough confidence that doing the check without taking a lock 
can be safe. Perhaps we could use this optimization just for x86 and do 
it inside locks for the rests.

> Getting back to the original patch, I wonder if instead we can batch reparenting
> of lrus so we don't have to grab and release nlru->lock for each reparenting lru.

nlru is actually a sub-structure within a lru. So if there are m lrus 
and n nodes, we will have m*n nlrus. I don't believe there is anymore 
batching that can be done.

Cheers,
Longman
Roman Gushchin March 28, 2022, 9:12 p.m. UTC | #8
On Mon, Mar 28, 2022 at 04:46:39PM -0400, Waiman Long wrote:
> On 3/28/22 15:12, Roman Gushchin wrote:
> > On Sun, Mar 27, 2022 at 08:57:15PM -0400, Waiman Long wrote:
> > > On 3/22/22 22:12, Muchun Song wrote:
> > > > On Wed, Mar 23, 2022 at 9:55 AM Waiman Long <longman@redhat.com> wrote:
> > > > > On 3/22/22 21:06, Muchun Song wrote:
> > > > > > On Wed, Mar 9, 2022 at 10:40 PM Waiman Long <longman@redhat.com> wrote:
> > > > > > > Since commit 2c80cd57c743 ("mm/list_lru.c: fix list_lru_count_node()
> > > > > > > to be race free"), we are tracking the total number of lru
> > > > > > > entries in a list_lru_node in its nr_items field.  In the case of
> > > > > > > memcg_reparent_list_lru_node(), there is nothing to be done if nr_items
> > > > > > > is 0.  We don't even need to take the nlru->lock as no new lru entry
> > > > > > > could be added by a racing list_lru_add() to the draining src_idx memcg
> > > > > > > at this point.
> > > > > > Hi Waiman,
> > > > > > 
> > > > > > Sorry for the late reply.  Quick question: what if there is an inflight
> > > > > > list_lru_add()?  How about the following race?
> > > > > > 
> > > > > > CPU0:                               CPU1:
> > > > > > list_lru_add()
> > > > > >        spin_lock(&nlru->lock)
> > > > > >        l = list_lru_from_kmem(memcg)
> > > > > >                                        memcg_reparent_objcgs(memcg)
> > > > > >                                        memcg_reparent_list_lrus(memcg)
> > > > > >                                            memcg_reparent_list_lru()
> > > > > >                                                memcg_reparent_list_lru_node()
> > > > > >                                                    if (!READ_ONCE(nlru->nr_items))
> > > > > >                                                        // Miss reparenting
> > > > > >                                                        return
> > > > > >        // Assume 0->1
> > > > > >        l->nr_items++
> > > > > >        // Assume 0->1
> > > > > >        nlru->nr_items++
> > > > > > 
> > > > > > IIUC, we use nlru->lock to serialise this scenario.
> > > > > I guess this race is theoretically possible but very unlikely since it
> > > > > means a very long pause between list_lru_from_kmem() and the increment
> > > > > of nr_items.
> > > > It is more possible in a VM.
> > > > 
> > > > > How about the following changes to make sure that this race can't happen?
> > > > > 
> > > > > diff --git a/mm/list_lru.c b/mm/list_lru.c
> > > > > index c669d87001a6..c31a0a8ad4e7 100644
> > > > > --- a/mm/list_lru.c
> > > > > +++ b/mm/list_lru.c
> > > > > @@ -395,9 +395,10 @@ static void memcg_reparent_list_lru_node(struct
> > > > > list_lru *lru, int nid,
> > > > >            struct list_lru_one *src, *dst;
> > > > > 
> > > > >            /*
> > > > > -        * If there is no lru entry in this nlru, we can skip it
> > > > > immediately.
> > > > > +        * If there is no lru entry in this nlru and the nlru->lock is free,
> > > > > +        * we can skip it immediately.
> > > > >             */
> > > > > -       if (!READ_ONCE(nlru->nr_items))
> > > > > +       if (!READ_ONCE(nlru->nr_items) && !spin_is_locked(&nlru->lock))
> > > > I think we also should insert a smp_rmb() between those two loads.
> > > Thinking about this some more, I believe that adding spin_is_locked() check
> > > will be enough for x86. However, that will likely not be enough for arches
> > > with a more relaxed memory semantics. So the safest way to avoid this
> > > possible race is to move the check to within the lock critical section,
> > > though that comes with a slightly higher overhead for the 0 nr_items case. I
> > > will send out a patch to correct that. Thanks for bring this possible race
> > > to my attention.
> > Yes, I think it's not enough:
> > CPU0                                       CPU1
> > READ_ONCE(&nlru->nr_items) -> 0
> >                                             spin_lock(&nlru->lock);
> >                                             nlru->nr_items++;
> >                                             spin_unlock(&nlru->lock);
> > && !spin_is_locked(&nlru->lock) -> 0
> I have actually thought of that. I am even thinking about reading nr_items
> again after spin_is_locked(). Still for arches with relaxed memory
> semantics, when will a memory write by one cpu be propagated to another cpu
> can be highly variable. It is very hard to prove that it is completely safe.
> 
> x86 has a more strict memory semantics and it is the only architecture that
> I have enough confidence that doing the check without taking a lock can be
> safe. Perhaps we could use this optimization just for x86 and do it inside
> locks for the rests.

Hm, is this such a big problem in the real life? Can you describe the setup?
I'm somewhat resistant to an idea of having arch-specific optimizations here
without a HUGE reason.

> 
> > Getting back to the original patch, I wonder if instead we can batch reparenting
> > of lrus so we don't have to grab and release nlru->lock for each reparenting lru.
> 
> nlru is actually a sub-structure within a lru. So if there are m lrus and n
> nodes, we will have m*n nlrus. I don't believe there is anymore batching
> that can be done.

I need to think, I'll share any ideas/patches if there will be any.

Thanks!
Waiman Long March 28, 2022, 9:20 p.m. UTC | #9
On 3/28/22 17:12, Roman Gushchin wrote:
> On Mon, Mar 28, 2022 at 04:46:39PM -0400, Waiman Long wrote:
>> On 3/28/22 15:12, Roman Gushchin wrote:
>>> On Sun, Mar 27, 2022 at 08:57:15PM -0400, Waiman Long wrote:
>>>> On 3/22/22 22:12, Muchun Song wrote:
>>>>> On Wed, Mar 23, 2022 at 9:55 AM Waiman Long <longman@redhat.com> wrote:
>>>>>> On 3/22/22 21:06, Muchun Song wrote:
>>>>>>> On Wed, Mar 9, 2022 at 10:40 PM Waiman Long <longman@redhat.com> wrote:
>>>>>>>> Since commit 2c80cd57c743 ("mm/list_lru.c: fix list_lru_count_node()
>>>>>>>> to be race free"), we are tracking the total number of lru
>>>>>>>> entries in a list_lru_node in its nr_items field.  In the case of
>>>>>>>> memcg_reparent_list_lru_node(), there is nothing to be done if nr_items
>>>>>>>> is 0.  We don't even need to take the nlru->lock as no new lru entry
>>>>>>>> could be added by a racing list_lru_add() to the draining src_idx memcg
>>>>>>>> at this point.
>>>>>>> Hi Waiman,
>>>>>>>
>>>>>>> Sorry for the late reply.  Quick question: what if there is an inflight
>>>>>>> list_lru_add()?  How about the following race?
>>>>>>>
>>>>>>> CPU0:                               CPU1:
>>>>>>> list_lru_add()
>>>>>>>         spin_lock(&nlru->lock)
>>>>>>>         l = list_lru_from_kmem(memcg)
>>>>>>>                                         memcg_reparent_objcgs(memcg)
>>>>>>>                                         memcg_reparent_list_lrus(memcg)
>>>>>>>                                             memcg_reparent_list_lru()
>>>>>>>                                                 memcg_reparent_list_lru_node()
>>>>>>>                                                     if (!READ_ONCE(nlru->nr_items))
>>>>>>>                                                         // Miss reparenting
>>>>>>>                                                         return
>>>>>>>         // Assume 0->1
>>>>>>>         l->nr_items++
>>>>>>>         // Assume 0->1
>>>>>>>         nlru->nr_items++
>>>>>>>
>>>>>>> IIUC, we use nlru->lock to serialise this scenario.
>>>>>> I guess this race is theoretically possible but very unlikely since it
>>>>>> means a very long pause between list_lru_from_kmem() and the increment
>>>>>> of nr_items.
>>>>> It is more possible in a VM.
>>>>>
>>>>>> How about the following changes to make sure that this race can't happen?
>>>>>>
>>>>>> diff --git a/mm/list_lru.c b/mm/list_lru.c
>>>>>> index c669d87001a6..c31a0a8ad4e7 100644
>>>>>> --- a/mm/list_lru.c
>>>>>> +++ b/mm/list_lru.c
>>>>>> @@ -395,9 +395,10 @@ static void memcg_reparent_list_lru_node(struct
>>>>>> list_lru *lru, int nid,
>>>>>>             struct list_lru_one *src, *dst;
>>>>>>
>>>>>>             /*
>>>>>> -        * If there is no lru entry in this nlru, we can skip it
>>>>>> immediately.
>>>>>> +        * If there is no lru entry in this nlru and the nlru->lock is free,
>>>>>> +        * we can skip it immediately.
>>>>>>              */
>>>>>> -       if (!READ_ONCE(nlru->nr_items))
>>>>>> +       if (!READ_ONCE(nlru->nr_items) && !spin_is_locked(&nlru->lock))
>>>>> I think we also should insert a smp_rmb() between those two loads.
>>>> Thinking about this some more, I believe that adding spin_is_locked() check
>>>> will be enough for x86. However, that will likely not be enough for arches
>>>> with a more relaxed memory semantics. So the safest way to avoid this
>>>> possible race is to move the check to within the lock critical section,
>>>> though that comes with a slightly higher overhead for the 0 nr_items case. I
>>>> will send out a patch to correct that. Thanks for bring this possible race
>>>> to my attention.
>>> Yes, I think it's not enough:
>>> CPU0                                       CPU1
>>> READ_ONCE(&nlru->nr_items) -> 0
>>>                                              spin_lock(&nlru->lock);
>>>                                              nlru->nr_items++;
>>>                                              spin_unlock(&nlru->lock);
>>> && !spin_is_locked(&nlru->lock) -> 0
>> I have actually thought of that. I am even thinking about reading nr_items
>> again after spin_is_locked(). Still for arches with relaxed memory
>> semantics, when will a memory write by one cpu be propagated to another cpu
>> can be highly variable. It is very hard to prove that it is completely safe.
>>
>> x86 has a more strict memory semantics and it is the only architecture that
>> I have enough confidence that doing the check without taking a lock can be
>> safe. Perhaps we could use this optimization just for x86 and do it inside
>> locks for the rests.
> Hm, is this such a big problem in the real life? Can you describe the setup?
> I'm somewhat resistant to an idea of having arch-specific optimizations here
> without a HUGE reason.

I am just throwing this idea out for discussion. It does not mean that I 
want to do an arch specific patch unless there is performance data to 
indicate a substantial gain in performance in some use cases.

Cheers,
Longman
Roman Gushchin March 28, 2022, 11:44 p.m. UTC | #10
On Mon, Mar 28, 2022 at 05:20:16PM -0400, Waiman Long wrote:
> On 3/28/22 17:12, Roman Gushchin wrote:
> > On Mon, Mar 28, 2022 at 04:46:39PM -0400, Waiman Long wrote:
> > > On 3/28/22 15:12, Roman Gushchin wrote:
> > > > On Sun, Mar 27, 2022 at 08:57:15PM -0400, Waiman Long wrote:
> > > > > On 3/22/22 22:12, Muchun Song wrote:
> > > > > > On Wed, Mar 23, 2022 at 9:55 AM Waiman Long <longman@redhat.com> wrote:
> > > > > > > On 3/22/22 21:06, Muchun Song wrote:
> > > > > > > > On Wed, Mar 9, 2022 at 10:40 PM Waiman Long <longman@redhat.com> wrote:
> > > > > > > > > Since commit 2c80cd57c743 ("mm/list_lru.c: fix list_lru_count_node()
> > > > > > > > > to be race free"), we are tracking the total number of lru
> > > > > > > > > entries in a list_lru_node in its nr_items field.  In the case of
> > > > > > > > > memcg_reparent_list_lru_node(), there is nothing to be done if nr_items
> > > > > > > > > is 0.  We don't even need to take the nlru->lock as no new lru entry
> > > > > > > > > could be added by a racing list_lru_add() to the draining src_idx memcg
> > > > > > > > > at this point.
> > > > > > > > Hi Waiman,
> > > > > > > > 
> > > > > > > > Sorry for the late reply.  Quick question: what if there is an inflight
> > > > > > > > list_lru_add()?  How about the following race?
> > > > > > > > 
> > > > > > > > CPU0:                               CPU1:
> > > > > > > > list_lru_add()
> > > > > > > >         spin_lock(&nlru->lock)
> > > > > > > >         l = list_lru_from_kmem(memcg)
> > > > > > > >                                         memcg_reparent_objcgs(memcg)
> > > > > > > >                                         memcg_reparent_list_lrus(memcg)
> > > > > > > >                                             memcg_reparent_list_lru()
> > > > > > > >                                                 memcg_reparent_list_lru_node()
> > > > > > > >                                                     if (!READ_ONCE(nlru->nr_items))
> > > > > > > >                                                         // Miss reparenting
> > > > > > > >                                                         return
> > > > > > > >         // Assume 0->1
> > > > > > > >         l->nr_items++
> > > > > > > >         // Assume 0->1
> > > > > > > >         nlru->nr_items++
> > > > > > > > 
> > > > > > > > IIUC, we use nlru->lock to serialise this scenario.
> > > > > > > I guess this race is theoretically possible but very unlikely since it
> > > > > > > means a very long pause between list_lru_from_kmem() and the increment
> > > > > > > of nr_items.
> > > > > > It is more possible in a VM.
> > > > > > 
> > > > > > > How about the following changes to make sure that this race can't happen?
> > > > > > > 
> > > > > > > diff --git a/mm/list_lru.c b/mm/list_lru.c
> > > > > > > index c669d87001a6..c31a0a8ad4e7 100644
> > > > > > > --- a/mm/list_lru.c
> > > > > > > +++ b/mm/list_lru.c
> > > > > > > @@ -395,9 +395,10 @@ static void memcg_reparent_list_lru_node(struct
> > > > > > > list_lru *lru, int nid,
> > > > > > >             struct list_lru_one *src, *dst;
> > > > > > > 
> > > > > > >             /*
> > > > > > > -        * If there is no lru entry in this nlru, we can skip it
> > > > > > > immediately.
> > > > > > > +        * If there is no lru entry in this nlru and the nlru->lock is free,
> > > > > > > +        * we can skip it immediately.
> > > > > > >              */
> > > > > > > -       if (!READ_ONCE(nlru->nr_items))
> > > > > > > +       if (!READ_ONCE(nlru->nr_items) && !spin_is_locked(&nlru->lock))
> > > > > > I think we also should insert a smp_rmb() between those two loads.
> > > > > Thinking about this some more, I believe that adding spin_is_locked() check
> > > > > will be enough for x86. However, that will likely not be enough for arches
> > > > > with a more relaxed memory semantics. So the safest way to avoid this
> > > > > possible race is to move the check to within the lock critical section,
> > > > > though that comes with a slightly higher overhead for the 0 nr_items case. I
> > > > > will send out a patch to correct that. Thanks for bring this possible race
> > > > > to my attention.
> > > > Yes, I think it's not enough:
> > > > CPU0                                       CPU1
> > > > READ_ONCE(&nlru->nr_items) -> 0
> > > >                                              spin_lock(&nlru->lock);
> > > >                                              nlru->nr_items++;
> > > >                                              spin_unlock(&nlru->lock);
> > > > && !spin_is_locked(&nlru->lock) -> 0
> > > I have actually thought of that. I am even thinking about reading nr_items
> > > again after spin_is_locked(). Still for arches with relaxed memory
> > > semantics, when will a memory write by one cpu be propagated to another cpu
> > > can be highly variable. It is very hard to prove that it is completely safe.
> > > 
> > > x86 has a more strict memory semantics and it is the only architecture that
> > > I have enough confidence that doing the check without taking a lock can be
> > > safe. Perhaps we could use this optimization just for x86 and do it inside
> > > locks for the rests.
> > Hm, is this such a big problem in the real life? Can you describe the setup?
> > I'm somewhat resistant to an idea of having arch-specific optimizations here
> > without a HUGE reason.
> 
> I am just throwing this idea out for discussion. It does not mean that I
> want to do an arch specific patch unless there is performance data to
> indicate a substantial gain in performance in some use cases.

Got it!
I mean it's not obvious from the original commit message if it's just a nice
optimization or there was a real life problem. In the first case the best thing
is probably to leave everything as it is now, in the latter we need to do
something. I'm trying to understand which case it is.

Thanks!
Muchun Song March 29, 2022, 1:15 a.m. UTC | #11
On Tue, Mar 29, 2022 at 3:12 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Sun, Mar 27, 2022 at 08:57:15PM -0400, Waiman Long wrote:
> > On 3/22/22 22:12, Muchun Song wrote:
> > > On Wed, Mar 23, 2022 at 9:55 AM Waiman Long <longman@redhat.com> wrote:
> > > > On 3/22/22 21:06, Muchun Song wrote:
> > > > > On Wed, Mar 9, 2022 at 10:40 PM Waiman Long <longman@redhat.com> wrote:
> > > > > > Since commit 2c80cd57c743 ("mm/list_lru.c: fix list_lru_count_node()
> > > > > > to be race free"), we are tracking the total number of lru
> > > > > > entries in a list_lru_node in its nr_items field.  In the case of
> > > > > > memcg_reparent_list_lru_node(), there is nothing to be done if nr_items
> > > > > > is 0.  We don't even need to take the nlru->lock as no new lru entry
> > > > > > could be added by a racing list_lru_add() to the draining src_idx memcg
> > > > > > at this point.
> > > > > Hi Waiman,
> > > > >
> > > > > Sorry for the late reply.  Quick question: what if there is an inflight
> > > > > list_lru_add()?  How about the following race?
> > > > >
> > > > > CPU0:                               CPU1:
> > > > > list_lru_add()
> > > > >       spin_lock(&nlru->lock)
> > > > >       l = list_lru_from_kmem(memcg)
> > > > >                                       memcg_reparent_objcgs(memcg)
> > > > >                                       memcg_reparent_list_lrus(memcg)
> > > > >                                           memcg_reparent_list_lru()
> > > > >                                               memcg_reparent_list_lru_node()
> > > > >                                                   if (!READ_ONCE(nlru->nr_items))
> > > > >                                                       // Miss reparenting
> > > > >                                                       return
> > > > >       // Assume 0->1
> > > > >       l->nr_items++
> > > > >       // Assume 0->1
> > > > >       nlru->nr_items++
> > > > >
> > > > > IIUC, we use nlru->lock to serialise this scenario.
> > > > I guess this race is theoretically possible but very unlikely since it
> > > > means a very long pause between list_lru_from_kmem() and the increment
> > > > of nr_items.
> > > It is more possible in a VM.
> > >
> > > > How about the following changes to make sure that this race can't happen?
> > > >
> > > > diff --git a/mm/list_lru.c b/mm/list_lru.c
> > > > index c669d87001a6..c31a0a8ad4e7 100644
> > > > --- a/mm/list_lru.c
> > > > +++ b/mm/list_lru.c
> > > > @@ -395,9 +395,10 @@ static void memcg_reparent_list_lru_node(struct
> > > > list_lru *lru, int nid,
> > > >           struct list_lru_one *src, *dst;
> > > >
> > > >           /*
> > > > -        * If there is no lru entry in this nlru, we can skip it
> > > > immediately.
> > > > +        * If there is no lru entry in this nlru and the nlru->lock is free,
> > > > +        * we can skip it immediately.
> > > >            */
> > > > -       if (!READ_ONCE(nlru->nr_items))
> > > > +       if (!READ_ONCE(nlru->nr_items) && !spin_is_locked(&nlru->lock))
> > > I think we also should insert a smp_rmb() between those two loads.
> >
> > Thinking about this some more, I believe that adding spin_is_locked() check
> > will be enough for x86. However, that will likely not be enough for arches
> > with a more relaxed memory semantics. So the safest way to avoid this
> > possible race is to move the check to within the lock critical section,
> > though that comes with a slightly higher overhead for the 0 nr_items case. I
> > will send out a patch to correct that. Thanks for bring this possible race
> > to my attention.
>
> Yes, I think it's not enough:

I think it may be enough if we insert a smp_rmb() between those two loads.

> CPU0                                       CPU1
> READ_ONCE(&nlru->nr_items) -> 0
>                                            spin_lock(&nlru->lock);
>                                            nlru->nr_items++;
                                             ^^^
                                             |||
                                             The nlr here is not the
same as the one in CPU0,
                                             since CPU0 have done the
memcg reparting. Then
                                             CPU0 will not miss nlru
reparting.  If I am wrong, please
                                             correct me.  Thanks.
>                                            spin_unlock(&nlru->lock);
> && !spin_is_locked(&nlru->lock) -> 0
Roman Gushchin March 29, 2022, 2:30 a.m. UTC | #12
On Tue, Mar 29, 2022 at 09:15:46AM +0800, Muchun Song wrote:
> On Tue, Mar 29, 2022 at 3:12 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > On Sun, Mar 27, 2022 at 08:57:15PM -0400, Waiman Long wrote:
> > > On 3/22/22 22:12, Muchun Song wrote:
> > > > On Wed, Mar 23, 2022 at 9:55 AM Waiman Long <longman@redhat.com> wrote:
> > > > > On 3/22/22 21:06, Muchun Song wrote:
> > > > > > On Wed, Mar 9, 2022 at 10:40 PM Waiman Long <longman@redhat.com> wrote:
> > > > > > > Since commit 2c80cd57c743 ("mm/list_lru.c: fix list_lru_count_node()
> > > > > > > to be race free"), we are tracking the total number of lru
> > > > > > > entries in a list_lru_node in its nr_items field.  In the case of
> > > > > > > memcg_reparent_list_lru_node(), there is nothing to be done if nr_items
> > > > > > > is 0.  We don't even need to take the nlru->lock as no new lru entry
> > > > > > > could be added by a racing list_lru_add() to the draining src_idx memcg
> > > > > > > at this point.
> > > > > > Hi Waiman,
> > > > > >
> > > > > > Sorry for the late reply.  Quick question: what if there is an inflight
> > > > > > list_lru_add()?  How about the following race?
> > > > > >
> > > > > > CPU0:                               CPU1:
> > > > > > list_lru_add()
> > > > > >       spin_lock(&nlru->lock)
> > > > > >       l = list_lru_from_kmem(memcg)
> > > > > >                                       memcg_reparent_objcgs(memcg)
> > > > > >                                       memcg_reparent_list_lrus(memcg)
> > > > > >                                           memcg_reparent_list_lru()
> > > > > >                                               memcg_reparent_list_lru_node()
> > > > > >                                                   if (!READ_ONCE(nlru->nr_items))
> > > > > >                                                       // Miss reparenting
> > > > > >                                                       return
> > > > > >       // Assume 0->1
> > > > > >       l->nr_items++
> > > > > >       // Assume 0->1
> > > > > >       nlru->nr_items++
> > > > > >
> > > > > > IIUC, we use nlru->lock to serialise this scenario.
> > > > > I guess this race is theoretically possible but very unlikely since it
> > > > > means a very long pause between list_lru_from_kmem() and the increment
> > > > > of nr_items.
> > > > It is more possible in a VM.
> > > >
> > > > > How about the following changes to make sure that this race can't happen?
> > > > >
> > > > > diff --git a/mm/list_lru.c b/mm/list_lru.c
> > > > > index c669d87001a6..c31a0a8ad4e7 100644
> > > > > --- a/mm/list_lru.c
> > > > > +++ b/mm/list_lru.c
> > > > > @@ -395,9 +395,10 @@ static void memcg_reparent_list_lru_node(struct
> > > > > list_lru *lru, int nid,
> > > > >           struct list_lru_one *src, *dst;
> > > > >
> > > > >           /*
> > > > > -        * If there is no lru entry in this nlru, we can skip it
> > > > > immediately.
> > > > > +        * If there is no lru entry in this nlru and the nlru->lock is free,
> > > > > +        * we can skip it immediately.
> > > > >            */
> > > > > -       if (!READ_ONCE(nlru->nr_items))
> > > > > +       if (!READ_ONCE(nlru->nr_items) && !spin_is_locked(&nlru->lock))
> > > > I think we also should insert a smp_rmb() between those two loads.
> > >
> > > Thinking about this some more, I believe that adding spin_is_locked() check
> > > will be enough for x86. However, that will likely not be enough for arches
> > > with a more relaxed memory semantics. So the safest way to avoid this
> > > possible race is to move the check to within the lock critical section,
> > > though that comes with a slightly higher overhead for the 0 nr_items case. I
> > > will send out a patch to correct that. Thanks for bring this possible race
> > > to my attention.
> >
> > Yes, I think it's not enough:
> 
> I think it may be enough if we insert a smp_rmb() between those two loads.
> 
> > CPU0                                       CPU1
> > READ_ONCE(&nlru->nr_items) -> 0
> >                                            spin_lock(&nlru->lock);
> >                                            nlru->nr_items++;
>                                              ^^^
>                                              |||
>                                              The nlr here is not the
> same as the one in CPU0,
>                                              since CPU0 have done the
> memcg reparting. Then
>                                              CPU0 will not miss nlru
> reparting.  If I am wrong, please
>                                              correct me.  Thanks.
> >                                            spin_unlock(&nlru->lock);
> > && !spin_is_locked(&nlru->lock) -> 0

Indeed, you're right.

Thanks!
Waiman Long March 29, 2022, 9:53 p.m. UTC | #13
On 3/28/22 21:15, Muchun Song wrote:
> On Tue, Mar 29, 2022 at 3:12 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>> On Sun, Mar 27, 2022 at 08:57:15PM -0400, Waiman Long wrote:
>>> On 3/22/22 22:12, Muchun Song wrote:
>>>> On Wed, Mar 23, 2022 at 9:55 AM Waiman Long <longman@redhat.com> wrote:
>>>>> On 3/22/22 21:06, Muchun Song wrote:
>>>>>> On Wed, Mar 9, 2022 at 10:40 PM Waiman Long <longman@redhat.com> wrote:
>>>>>>> Since commit 2c80cd57c743 ("mm/list_lru.c: fix list_lru_count_node()
>>>>>>> to be race free"), we are tracking the total number of lru
>>>>>>> entries in a list_lru_node in its nr_items field.  In the case of
>>>>>>> memcg_reparent_list_lru_node(), there is nothing to be done if nr_items
>>>>>>> is 0.  We don't even need to take the nlru->lock as no new lru entry
>>>>>>> could be added by a racing list_lru_add() to the draining src_idx memcg
>>>>>>> at this point.
>>>>>> Hi Waiman,
>>>>>>
>>>>>> Sorry for the late reply.  Quick question: what if there is an inflight
>>>>>> list_lru_add()?  How about the following race?
>>>>>>
>>>>>> CPU0:                               CPU1:
>>>>>> list_lru_add()
>>>>>>        spin_lock(&nlru->lock)
>>>>>>        l = list_lru_from_kmem(memcg)
>>>>>>                                        memcg_reparent_objcgs(memcg)
>>>>>>                                        memcg_reparent_list_lrus(memcg)
>>>>>>                                            memcg_reparent_list_lru()
>>>>>>                                                memcg_reparent_list_lru_node()
>>>>>>                                                    if (!READ_ONCE(nlru->nr_items))
>>>>>>                                                        // Miss reparenting
>>>>>>                                                        return
>>>>>>        // Assume 0->1
>>>>>>        l->nr_items++
>>>>>>        // Assume 0->1
>>>>>>        nlru->nr_items++
>>>>>>
>>>>>> IIUC, we use nlru->lock to serialise this scenario.
>>>>> I guess this race is theoretically possible but very unlikely since it
>>>>> means a very long pause between list_lru_from_kmem() and the increment
>>>>> of nr_items.
>>>> It is more possible in a VM.
>>>>
>>>>> How about the following changes to make sure that this race can't happen?
>>>>>
>>>>> diff --git a/mm/list_lru.c b/mm/list_lru.c
>>>>> index c669d87001a6..c31a0a8ad4e7 100644
>>>>> --- a/mm/list_lru.c
>>>>> +++ b/mm/list_lru.c
>>>>> @@ -395,9 +395,10 @@ static void memcg_reparent_list_lru_node(struct
>>>>> list_lru *lru, int nid,
>>>>>            struct list_lru_one *src, *dst;
>>>>>
>>>>>            /*
>>>>> -        * If there is no lru entry in this nlru, we can skip it
>>>>> immediately.
>>>>> +        * If there is no lru entry in this nlru and the nlru->lock is free,
>>>>> +        * we can skip it immediately.
>>>>>             */
>>>>> -       if (!READ_ONCE(nlru->nr_items))
>>>>> +       if (!READ_ONCE(nlru->nr_items) && !spin_is_locked(&nlru->lock))
>>>> I think we also should insert a smp_rmb() between those two loads.
>>> Thinking about this some more, I believe that adding spin_is_locked() check
>>> will be enough for x86. However, that will likely not be enough for arches
>>> with a more relaxed memory semantics. So the safest way to avoid this
>>> possible race is to move the check to within the lock critical section,
>>> though that comes with a slightly higher overhead for the 0 nr_items case. I
>>> will send out a patch to correct that. Thanks for bring this possible race
>>> to my attention.
>> Yes, I think it's not enough:
> I think it may be enough if we insert a smp_rmb() between those two loads.
>
>> CPU0                                       CPU1
>> READ_ONCE(&nlru->nr_items) -> 0
>>                                             spin_lock(&nlru->lock);
>>                                             nlru->nr_items++;
>                                               ^^^
>                                               |||
>                                               The nlr here is not the
> same as the one in CPU0,
>                                               since CPU0 have done the
> memcg reparting. Then
>                                               CPU0 will not miss nlru
> reparting.  If I am wrong, please
>                                               correct me.  Thanks.
>>                                             spin_unlock(&nlru->lock);
>> && !spin_is_locked(&nlru->lock) -> 0

I just realize that there is another lock/unlock pair in 
memcg_reparent_objcgs():

memcg_reparent_objcgs()
     spin_lock_irq()
     memcg reparenting
     spin_unlock_irq()
     percpu_ref_kill()
         spin_lock_irqsave()
         ...
         spin_unlock_irqrestore()

This lock/unlock pair in percpu_ref_kill() will stop the reordering of 
read/write before the memcg reparenting. Now I think just adding a 
spin_is_locked() check with smp_rmb() should be enough. However, I would 
like to change the ordering like that:

if (!spin_is_locked(&nlru->lock)) {
         smp_rmb();
         if (!READ_ONCE(nlru->nr_items))
                 return;
}

Otherwise, we will have the problem

list_lru_add()
       spin_lock(&nlru->lock)
       l = list_lru_from_kmem(memcg)
READ_ONCE(nlru->nr_items);
       // Assume 0->1
       l->nr_items++
       // Assume 0->1
       nlru->nr_items++
       spin_unlock(&nlru->lock)
                                       spin_is_locked()

If spin_is_locked() is before spin_lock() in list_lru_add(), 
list_lru_from_kmem() is guarantee to get the reparented memcg and so 
won't added to the reparented lru.

Thanks,
Longman
Muchun Song March 30, 2022, 6:38 a.m. UTC | #14
On Wed, Mar 30, 2022 at 5:53 AM Waiman Long <longman@redhat.com> wrote:
>
> On 3/28/22 21:15, Muchun Song wrote:
> > On Tue, Mar 29, 2022 at 3:12 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >> On Sun, Mar 27, 2022 at 08:57:15PM -0400, Waiman Long wrote:
> >>> On 3/22/22 22:12, Muchun Song wrote:
> >>>> On Wed, Mar 23, 2022 at 9:55 AM Waiman Long <longman@redhat.com> wrote:
> >>>>> On 3/22/22 21:06, Muchun Song wrote:
> >>>>>> On Wed, Mar 9, 2022 at 10:40 PM Waiman Long <longman@redhat.com> wrote:
> >>>>>>> Since commit 2c80cd57c743 ("mm/list_lru.c: fix list_lru_count_node()
> >>>>>>> to be race free"), we are tracking the total number of lru
> >>>>>>> entries in a list_lru_node in its nr_items field.  In the case of
> >>>>>>> memcg_reparent_list_lru_node(), there is nothing to be done if nr_items
> >>>>>>> is 0.  We don't even need to take the nlru->lock as no new lru entry
> >>>>>>> could be added by a racing list_lru_add() to the draining src_idx memcg
> >>>>>>> at this point.
> >>>>>> Hi Waiman,
> >>>>>>
> >>>>>> Sorry for the late reply.  Quick question: what if there is an inflight
> >>>>>> list_lru_add()?  How about the following race?
> >>>>>>
> >>>>>> CPU0:                               CPU1:
> >>>>>> list_lru_add()
> >>>>>>        spin_lock(&nlru->lock)
> >>>>>>        l = list_lru_from_kmem(memcg)
> >>>>>>                                        memcg_reparent_objcgs(memcg)
> >>>>>>                                        memcg_reparent_list_lrus(memcg)
> >>>>>>                                            memcg_reparent_list_lru()
> >>>>>>                                                memcg_reparent_list_lru_node()
> >>>>>>                                                    if (!READ_ONCE(nlru->nr_items))
> >>>>>>                                                        // Miss reparenting
> >>>>>>                                                        return
> >>>>>>        // Assume 0->1
> >>>>>>        l->nr_items++
> >>>>>>        // Assume 0->1
> >>>>>>        nlru->nr_items++
> >>>>>>
> >>>>>> IIUC, we use nlru->lock to serialise this scenario.
> >>>>> I guess this race is theoretically possible but very unlikely since it
> >>>>> means a very long pause between list_lru_from_kmem() and the increment
> >>>>> of nr_items.
> >>>> It is more possible in a VM.
> >>>>
> >>>>> How about the following changes to make sure that this race can't happen?
> >>>>>
> >>>>> diff --git a/mm/list_lru.c b/mm/list_lru.c
> >>>>> index c669d87001a6..c31a0a8ad4e7 100644
> >>>>> --- a/mm/list_lru.c
> >>>>> +++ b/mm/list_lru.c
> >>>>> @@ -395,9 +395,10 @@ static void memcg_reparent_list_lru_node(struct
> >>>>> list_lru *lru, int nid,
> >>>>>            struct list_lru_one *src, *dst;
> >>>>>
> >>>>>            /*
> >>>>> -        * If there is no lru entry in this nlru, we can skip it
> >>>>> immediately.
> >>>>> +        * If there is no lru entry in this nlru and the nlru->lock is free,
> >>>>> +        * we can skip it immediately.
> >>>>>             */
> >>>>> -       if (!READ_ONCE(nlru->nr_items))
> >>>>> +       if (!READ_ONCE(nlru->nr_items) && !spin_is_locked(&nlru->lock))
> >>>> I think we also should insert a smp_rmb() between those two loads.
> >>> Thinking about this some more, I believe that adding spin_is_locked() check
> >>> will be enough for x86. However, that will likely not be enough for arches
> >>> with a more relaxed memory semantics. So the safest way to avoid this
> >>> possible race is to move the check to within the lock critical section,
> >>> though that comes with a slightly higher overhead for the 0 nr_items case. I
> >>> will send out a patch to correct that. Thanks for bring this possible race
> >>> to my attention.
> >> Yes, I think it's not enough:
> > I think it may be enough if we insert a smp_rmb() between those two loads.
> >
> >> CPU0                                       CPU1
> >> READ_ONCE(&nlru->nr_items) -> 0
> >>                                             spin_lock(&nlru->lock);
> >>                                             nlru->nr_items++;
> >                                               ^^^
> >                                               |||
> >                                               The nlr here is not the
> > same as the one in CPU0,
> >                                               since CPU0 have done the
> > memcg reparting. Then
> >                                               CPU0 will not miss nlru
> > reparting.  If I am wrong, please
> >                                               correct me.  Thanks.
> >>                                             spin_unlock(&nlru->lock);
> >> && !spin_is_locked(&nlru->lock) -> 0
>
> I just realize that there is another lock/unlock pair in
> memcg_reparent_objcgs():
>
> memcg_reparent_objcgs()
>      spin_lock_irq()
>      memcg reparenting
>      spin_unlock_irq()
>      percpu_ref_kill()
>          spin_lock_irqsave()
>          ...
>          spin_unlock_irqrestore()
>
> This lock/unlock pair in percpu_ref_kill() will stop the reordering of
> read/write before the memcg reparenting. Now I think just adding a
> spin_is_locked() check with smp_rmb() should be enough. However, I would
> like to change the ordering like that:
>
> if (!spin_is_locked(&nlru->lock)) {
>          smp_rmb();
>          if (!READ_ONCE(nlru->nr_items))
>                  return;
> }

Does the following race still exist?

 CPU0:                               CPU1:
                                        spin_is_locked(&nlru->lock)
 list_lru_add()
        spin_lock(&nlru->lock)
        l = list_lru_from_kmem(memcg)
                                        memcg_reparent_objcgs(memcg)
                                        memcg_reparent_list_lrus(memcg)
                                            memcg_reparent_list_lru()
                                                memcg_reparent_list_lru_node()
                                                    if
(!READ_ONCE(nlru->nr_items))
                                                        // Miss reparenting
                                                        return
        // Assume 0->1
        l->nr_items++
        // Assume 0->1
        nlru->nr_items++

>
> Otherwise, we will have the problem
>
> list_lru_add()
>        spin_lock(&nlru->lock)
>        l = list_lru_from_kmem(memcg)
> READ_ONCE(nlru->nr_items);
>        // Assume 0->1
>        l->nr_items++
>        // Assume 0->1
>        nlru->nr_items++
>        spin_unlock(&nlru->lock)
>                                        spin_is_locked()

You are right.

>
> If spin_is_locked() is before spin_lock() in list_lru_add(),
> list_lru_from_kmem() is guarantee to get the reparented memcg and so
> won't added to the reparented lru.
>
Muchun Song March 30, 2022, 7:20 a.m. UTC | #15
On Wed, Mar 30, 2022 at 2:38 PM Muchun Song <songmuchun@bytedance.com> wrote:
>
> On Wed, Mar 30, 2022 at 5:53 AM Waiman Long <longman@redhat.com> wrote:
> >
> > On 3/28/22 21:15, Muchun Song wrote:
> > > On Tue, Mar 29, 2022 at 3:12 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > >> On Sun, Mar 27, 2022 at 08:57:15PM -0400, Waiman Long wrote:
> > >>> On 3/22/22 22:12, Muchun Song wrote:
> > >>>> On Wed, Mar 23, 2022 at 9:55 AM Waiman Long <longman@redhat.com> wrote:
> > >>>>> On 3/22/22 21:06, Muchun Song wrote:
> > >>>>>> On Wed, Mar 9, 2022 at 10:40 PM Waiman Long <longman@redhat.com> wrote:
> > >>>>>>> Since commit 2c80cd57c743 ("mm/list_lru.c: fix list_lru_count_node()
> > >>>>>>> to be race free"), we are tracking the total number of lru
> > >>>>>>> entries in a list_lru_node in its nr_items field.  In the case of
> > >>>>>>> memcg_reparent_list_lru_node(), there is nothing to be done if nr_items
> > >>>>>>> is 0.  We don't even need to take the nlru->lock as no new lru entry
> > >>>>>>> could be added by a racing list_lru_add() to the draining src_idx memcg
> > >>>>>>> at this point.
> > >>>>>> Hi Waiman,
> > >>>>>>
> > >>>>>> Sorry for the late reply.  Quick question: what if there is an inflight
> > >>>>>> list_lru_add()?  How about the following race?
> > >>>>>>
> > >>>>>> CPU0:                               CPU1:
> > >>>>>> list_lru_add()
> > >>>>>>        spin_lock(&nlru->lock)
> > >>>>>>        l = list_lru_from_kmem(memcg)
> > >>>>>>                                        memcg_reparent_objcgs(memcg)
> > >>>>>>                                        memcg_reparent_list_lrus(memcg)
> > >>>>>>                                            memcg_reparent_list_lru()
> > >>>>>>                                                memcg_reparent_list_lru_node()
> > >>>>>>                                                    if (!READ_ONCE(nlru->nr_items))
> > >>>>>>                                                        // Miss reparenting
> > >>>>>>                                                        return
> > >>>>>>        // Assume 0->1
> > >>>>>>        l->nr_items++
> > >>>>>>        // Assume 0->1
> > >>>>>>        nlru->nr_items++
> > >>>>>>
> > >>>>>> IIUC, we use nlru->lock to serialise this scenario.
> > >>>>> I guess this race is theoretically possible but very unlikely since it
> > >>>>> means a very long pause between list_lru_from_kmem() and the increment
> > >>>>> of nr_items.
> > >>>> It is more possible in a VM.
> > >>>>
> > >>>>> How about the following changes to make sure that this race can't happen?
> > >>>>>
> > >>>>> diff --git a/mm/list_lru.c b/mm/list_lru.c
> > >>>>> index c669d87001a6..c31a0a8ad4e7 100644
> > >>>>> --- a/mm/list_lru.c
> > >>>>> +++ b/mm/list_lru.c
> > >>>>> @@ -395,9 +395,10 @@ static void memcg_reparent_list_lru_node(struct
> > >>>>> list_lru *lru, int nid,
> > >>>>>            struct list_lru_one *src, *dst;
> > >>>>>
> > >>>>>            /*
> > >>>>> -        * If there is no lru entry in this nlru, we can skip it
> > >>>>> immediately.
> > >>>>> +        * If there is no lru entry in this nlru and the nlru->lock is free,
> > >>>>> +        * we can skip it immediately.
> > >>>>>             */
> > >>>>> -       if (!READ_ONCE(nlru->nr_items))
> > >>>>> +       if (!READ_ONCE(nlru->nr_items) && !spin_is_locked(&nlru->lock))
> > >>>> I think we also should insert a smp_rmb() between those two loads.
> > >>> Thinking about this some more, I believe that adding spin_is_locked() check
> > >>> will be enough for x86. However, that will likely not be enough for arches
> > >>> with a more relaxed memory semantics. So the safest way to avoid this
> > >>> possible race is to move the check to within the lock critical section,
> > >>> though that comes with a slightly higher overhead for the 0 nr_items case. I
> > >>> will send out a patch to correct that. Thanks for bring this possible race
> > >>> to my attention.
> > >> Yes, I think it's not enough:
> > > I think it may be enough if we insert a smp_rmb() between those two loads.
> > >
> > >> CPU0                                       CPU1
> > >> READ_ONCE(&nlru->nr_items) -> 0
> > >>                                             spin_lock(&nlru->lock);
> > >>                                             nlru->nr_items++;
> > >                                               ^^^
> > >                                               |||
> > >                                               The nlr here is not the
> > > same as the one in CPU0,
> > >                                               since CPU0 have done the
> > > memcg reparting. Then
> > >                                               CPU0 will not miss nlru
> > > reparting.  If I am wrong, please
> > >                                               correct me.  Thanks.
> > >>                                             spin_unlock(&nlru->lock);
> > >> && !spin_is_locked(&nlru->lock) -> 0
> >
> > I just realize that there is another lock/unlock pair in
> > memcg_reparent_objcgs():
> >
> > memcg_reparent_objcgs()
> >      spin_lock_irq()
> >      memcg reparenting
> >      spin_unlock_irq()
> >      percpu_ref_kill()
> >          spin_lock_irqsave()
> >          ...
> >          spin_unlock_irqrestore()
> >
> > This lock/unlock pair in percpu_ref_kill() will stop the reordering of
> > read/write before the memcg reparenting. Now I think just adding a
> > spin_is_locked() check with smp_rmb() should be enough. However, I would
> > like to change the ordering like that:
> >
> > if (!spin_is_locked(&nlru->lock)) {
> >          smp_rmb();
> >          if (!READ_ONCE(nlru->nr_items))
> >                  return;
> > }
>
> Does the following race still exist?

Ignore this. My bad. I think your approach could work.

>
>  CPU0:                               CPU1:
>                                         spin_is_locked(&nlru->lock)
>  list_lru_add()
>         spin_lock(&nlru->lock)
>         l = list_lru_from_kmem(memcg)
>                                         memcg_reparent_objcgs(memcg)
>                                         memcg_reparent_list_lrus(memcg)
>                                             memcg_reparent_list_lru()
>                                                 memcg_reparent_list_lru_node()
>                                                     if
> (!READ_ONCE(nlru->nr_items))
>                                                         // Miss reparenting
>                                                         return
>         // Assume 0->1
>         l->nr_items++
>         // Assume 0->1
>         nlru->nr_items++
>
> >
> > Otherwise, we will have the problem
> >
> > list_lru_add()
> >        spin_lock(&nlru->lock)
> >        l = list_lru_from_kmem(memcg)
> > READ_ONCE(nlru->nr_items);
> >        // Assume 0->1
> >        l->nr_items++
> >        // Assume 0->1
> >        nlru->nr_items++
> >        spin_unlock(&nlru->lock)
> >                                        spin_is_locked()
>
> You are right.
>
> >
> > If spin_is_locked() is before spin_lock() in list_lru_add(),
> > list_lru_from_kmem() is guarantee to get the reparented memcg and so
> > won't added to the reparented lru.
> >
diff mbox series

Patch

diff --git a/mm/list_lru.c b/mm/list_lru.c
index ba76428ceece..c669d87001a6 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -394,6 +394,12 @@  static void memcg_reparent_list_lru_node(struct list_lru *lru, int nid,
 	int dst_idx = dst_memcg->kmemcg_id;
 	struct list_lru_one *src, *dst;
 
+	/*
+	 * If there is no lru entry in this nlru, we can skip it immediately.
+	 */
+	if (!READ_ONCE(nlru->nr_items))
+		return;
+
 	/*
 	 * Since list_lru_{add,del} may be called under an IRQ-safe lock,
 	 * we have to use IRQ-safe primitives here to avoid deadlock.