diff mbox series

[PATCHv2,1/1] mm: fix unproperly folio_put by changing API in read_pages

Message ID 20240401081734.1433755-1-zhaoyang.huang@unisoc.com (mailing list archive)
State New
Headers show
Series [PATCHv2,1/1] mm: fix unproperly folio_put by changing API in read_pages | expand

Commit Message

zhaoyang.huang April 1, 2024, 8:17 a.m. UTC
From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>

An VM_BUG_ON in step 9 of [1] could happen as the refcnt is dropped
unproperly during the procedure of read_pages()->readahead_folio->folio_put.
This is introduced by commit 9fd472af84ab ("mm: improve cleanup when
->readpages doesn't process all pages")'.

key steps of[1] in brief:
2'. Thread_truncate get folio to its local fbatch by find_get_entry in step 2
7'. Last refcnt remained which is not as expect as from alloc_pages
    but from thread_truncate's local fbatch in step 7
8'. Thread_reclaim succeed to isolate the folio by the wrong refcnt(not
    the value but meaning) in step 8
9'. Thread_truncate hit the VM_BUG_ON in step 9

[1]
Thread_readahead:
0. folio = filemap_alloc_folio(gfp_mask, 0);
       (refcount 1: alloc_pages)
1. ret = filemap_add_folio(mapping, folio, index + i, gfp_mask);
       (refcount 2: alloc_pages, page_cache)

Thread_truncate:
2. folio = find_get_entries(&fbatch_truncate);
       (refcount 3: alloc_pages, page_cache, fbatch_truncate))

Thread_readahead:
3. Then we call read_pages()
       First we call ->readahead() which for some reason stops early.
4. Then we call readahead_folio() which calls folio_put()
       (refcount 2: page_cache, fbatch_truncate)
5. Then we call folio_get()
       (refcount 3: page_cache, fbatch_truncate, read_pages_temp)
6. Then we call filemap_remove_folio()
       (refcount 2: fbatch_truncate, read_pages_temp)
7. Then we call folio_unlock() and folio_put()
       (refcount 1: fbatch_truncate)

Thread_reclaim:
8. collect the page from LRU and call shrink_inactive_list->isolate_lru_folios
        shrink_inactive_list
        {
            isolate_lru_folios
            {
               if (!folio_test_lru(folio)) //false
                   bail out;
               if (!folio_try_get(folio)) //false
                   bail out;
            }
         }
         (refcount 2: fbatch_truncate, reclaim_isolate)

9. call shrink_folio_list->__remove_mapping
        shrink_folio_list()
        {
            folio_try_lock(folio);
            __remove_mapping()
            {
                if (!folio_ref_freeze(2)) //false
                    bail out;
            }
            folio_unlock(folio)
            list_add(folio, free_folios);
        }
        (folio has refcount 0)

Thread_truncate:
10. Thread_truncate will hit the refcnt VM_BUG_ON(refcnt == 0) in
release_pages->folio_put_testzero
        truncate_inode_pages_range
        {
            folio = find_get_entries(&fbatch_truncate);
            truncate_inode_pages(&fbatch_truncate);
            folio_fbatch_release(&fbatch_truncate);
            {
                folio_put_testzero(folio); //VM_BUG_ON here
            }
        }

fix: commit 9fd472af84ab ("mm: improve cleanup when ->readpages doesn't process all pages")'

Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
---
patch v2: update commit message
---
---
 mm/readahead.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Matthew Wilcox (Oracle) April 2, 2024, 12:34 a.m. UTC | #1
On Mon, Apr 01, 2024 at 04:17:34PM +0800, zhaoyang.huang wrote:
> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> 
> An VM_BUG_ON in step 9 of [1] could happen as the refcnt is dropped
> unproperly during the procedure of read_pages()->readahead_folio->folio_put.
> This is introduced by commit 9fd472af84ab ("mm: improve cleanup when
> ->readpages doesn't process all pages")'.

This patch is no less bullshit than the last time you posted it.
I explained why here:
https://lore.kernel.org/linux-mm/ZgQRtQ60mrvOUKXo@casper.infradead.org/
Zhaoyang Huang April 2, 2024, 6:33 a.m. UTC | #2
On Tue, Apr 2, 2024 at 8:34 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Apr 01, 2024 at 04:17:34PM +0800, zhaoyang.huang wrote:
> > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> >
> > An VM_BUG_ON in step 9 of [1] could happen as the refcnt is dropped
> > unproperly during the procedure of read_pages()->readahead_folio->folio_put.
> > This is introduced by commit 9fd472af84ab ("mm: improve cleanup when
> > ->readpages doesn't process all pages")'.
>
> This patch is no less bullshit than the last time you posted it.
> I explained why here:
> https://lore.kernel.org/linux-mm/ZgQRtQ60mrvOUKXo@casper.infradead.org/
Yes. I get your point in your previous feedback. Could you please
check the timing sequence in v2's commit message, where the folio_lock
failed to prevent the race from happening. I also would like to insist
that the refcnt got from alloc_pages just paired with the one which is
checked in __remove_mapping but any other folio_put. The sequence of
read_pages->readahead_folio->folio_put happens to be right when no
race and then frees the folio immediatly.
David Hildenbrand April 2, 2024, 12:58 p.m. UTC | #3
On 01.04.24 10:17, zhaoyang.huang wrote:
> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> 
> An VM_BUG_ON in step 9 of [1] could happen as the refcnt is dropped
> unproperly during the procedure of read_pages()->readahead_folio->folio_put.
> This is introduced by commit 9fd472af84ab ("mm: improve cleanup when
> ->readpages doesn't process all pages")'.
> 
> key steps of[1] in brief:
> 2'. Thread_truncate get folio to its local fbatch by find_get_entry in step 2
> 7'. Last refcnt remained which is not as expect as from alloc_pages
>      but from thread_truncate's local fbatch in step 7
> 8'. Thread_reclaim succeed to isolate the folio by the wrong refcnt(not
>      the value but meaning) in step 8
> 9'. Thread_truncate hit the VM_BUG_ON in step 9
> 
> [1]
> Thread_readahead:
> 0. folio = filemap_alloc_folio(gfp_mask, 0);
>         (refcount 1: alloc_pages)
> 1. ret = filemap_add_folio(mapping, folio, index + i, gfp_mask);
>         (refcount 2: alloc_pages, page_cache)
> 
> Thread_truncate:
> 2. folio = find_get_entries(&fbatch_truncate);
>         (refcount 3: alloc_pages, page_cache, fbatch_truncate))
> 
> Thread_readahead:
> 3. Then we call read_pages()
>         First we call ->readahead() which for some reason stops early.
> 4. Then we call readahead_folio() which calls folio_put()
>         (refcount 2: page_cache, fbatch_truncate)
> 5. Then we call folio_get()
>         (refcount 3: page_cache, fbatch_truncate, read_pages_temp)
> 6. Then we call filemap_remove_folio()
>         (refcount 2: fbatch_truncate, read_pages_temp)
> 7. Then we call folio_unlock() and folio_put()
>         (refcount 1: fbatch_truncate)
> 
> Thread_reclaim:
> 8. collect the page from LRU and call shrink_inactive_list->isolate_lru_folios
>          shrink_inactive_list
>          {
>              isolate_lru_folios
>              {
>                 if (!folio_test_lru(folio)) //false
>                     bail out;
>                 if (!folio_try_get(folio)) //false
>                     bail out;
>              }
>           }
>           (refcount 2: fbatch_truncate, reclaim_isolate)
> 
> 9. call shrink_folio_list->__remove_mapping
>          shrink_folio_list()
>          {
>              folio_try_lock(folio);
>              __remove_mapping()
>              {
>                  if (!folio_ref_freeze(2)) //false
>                      bail out;
>              }
>              folio_unlock(folio)
>              list_add(folio, free_folios);
>          }
>          (folio has refcount 0)
> 
> Thread_truncate:
> 10. Thread_truncate will hit the refcnt VM_BUG_ON(refcnt == 0) in
> release_pages->folio_put_testzero
>          truncate_inode_pages_range
>          {
>              folio = find_get_entries(&fbatch_truncate);
>              truncate_inode_pages(&fbatch_truncate);
>              folio_fbatch_release(&fbatch_truncate);
>              {
>                  folio_put_testzero(folio); //VM_BUG_ON here
>              }
>          }
> 
> fix: commit 9fd472af84ab ("mm: improve cleanup when ->readpages doesn't process all pages")'

Something that would help here is an actual reproducer that triggersthis 
issue.

To me, it's unclear at this point if we are talking about an actual 
issue or a theoretical issue?
Zhaoyang Huang April 3, 2024, 5:50 a.m. UTC | #4
On Tue, Apr 2, 2024 at 8:58 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 01.04.24 10:17, zhaoyang.huang wrote:
> > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> >
> > An VM_BUG_ON in step 9 of [1] could happen as the refcnt is dropped
> > unproperly during the procedure of read_pages()->readahead_folio->folio_put.
> > This is introduced by commit 9fd472af84ab ("mm: improve cleanup when
> > ->readpages doesn't process all pages")'.
> >
> > key steps of[1] in brief:
> > 2'. Thread_truncate get folio to its local fbatch by find_get_entry in step 2
> > 7'. Last refcnt remained which is not as expect as from alloc_pages
> >      but from thread_truncate's local fbatch in step 7
> > 8'. Thread_reclaim succeed to isolate the folio by the wrong refcnt(not
> >      the value but meaning) in step 8
> > 9'. Thread_truncate hit the VM_BUG_ON in step 9
> >
> > [1]
> > Thread_readahead:
> > 0. folio = filemap_alloc_folio(gfp_mask, 0);
> >         (refcount 1: alloc_pages)
> > 1. ret = filemap_add_folio(mapping, folio, index + i, gfp_mask);
> >         (refcount 2: alloc_pages, page_cache)
> >
> > Thread_truncate:
> > 2. folio = find_get_entries(&fbatch_truncate);
> >         (refcount 3: alloc_pages, page_cache, fbatch_truncate))
> >
> > Thread_readahead:
> > 3. Then we call read_pages()
> >         First we call ->readahead() which for some reason stops early.
> > 4. Then we call readahead_folio() which calls folio_put()
> >         (refcount 2: page_cache, fbatch_truncate)
> > 5. Then we call folio_get()
> >         (refcount 3: page_cache, fbatch_truncate, read_pages_temp)
> > 6. Then we call filemap_remove_folio()
> >         (refcount 2: fbatch_truncate, read_pages_temp)
> > 7. Then we call folio_unlock() and folio_put()
> >         (refcount 1: fbatch_truncate)
> >
> > Thread_reclaim:
> > 8. collect the page from LRU and call shrink_inactive_list->isolate_lru_folios
> >          shrink_inactive_list
> >          {
> >              isolate_lru_folios
> >              {
> >                 if (!folio_test_lru(folio)) //false
> >                     bail out;
> >                 if (!folio_try_get(folio)) //false
> >                     bail out;
> >              }
> >           }
> >           (refcount 2: fbatch_truncate, reclaim_isolate)
> >
> > 9. call shrink_folio_list->__remove_mapping
> >          shrink_folio_list()
> >          {
> >              folio_try_lock(folio);
> >              __remove_mapping()
> >              {
> >                  if (!folio_ref_freeze(2)) //false
> >                      bail out;
> >              }
> >              folio_unlock(folio)
> >              list_add(folio, free_folios);
> >          }
> >          (folio has refcount 0)
> >
> > Thread_truncate:
> > 10. Thread_truncate will hit the refcnt VM_BUG_ON(refcnt == 0) in
> > release_pages->folio_put_testzero
> >          truncate_inode_pages_range
> >          {
> >              folio = find_get_entries(&fbatch_truncate);
> >              truncate_inode_pages(&fbatch_truncate);
> >              folio_fbatch_release(&fbatch_truncate);
> >              {
> >                  folio_put_testzero(folio); //VM_BUG_ON here
> >              }
> >          }
> >
> > fix: commit 9fd472af84ab ("mm: improve cleanup when ->readpages doesn't process all pages")'
>
> Something that would help here is an actual reproducer that triggersthis
> issue.
>
> To me, it's unclear at this point if we are talking about an actual
> issue or a theoretical issue?
Thanks for feedback. Above callstack is a theoretical issue so far
which is arised from an ongoing analysis of a practical livelock issue
generated by folio_try_get_rcu which is related to abnormal folio
refcnt state. So do you think this callstack makes sense?

>
> --
> Cheers,
>
> David / dhildenb
>
David Hildenbrand April 3, 2024, 8:01 a.m. UTC | #5
On 03.04.24 07:50, Zhaoyang Huang wrote:
> On Tue, Apr 2, 2024 at 8:58 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 01.04.24 10:17, zhaoyang.huang wrote:
>>> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
>>>
>>> An VM_BUG_ON in step 9 of [1] could happen as the refcnt is dropped
>>> unproperly during the procedure of read_pages()->readahead_folio->folio_put.
>>> This is introduced by commit 9fd472af84ab ("mm: improve cleanup when
>>> ->readpages doesn't process all pages")'.
>>>
>>> key steps of[1] in brief:
>>> 2'. Thread_truncate get folio to its local fbatch by find_get_entry in step 2
>>> 7'. Last refcnt remained which is not as expect as from alloc_pages
>>>       but from thread_truncate's local fbatch in step 7
>>> 8'. Thread_reclaim succeed to isolate the folio by the wrong refcnt(not
>>>       the value but meaning) in step 8
>>> 9'. Thread_truncate hit the VM_BUG_ON in step 9
>>>
>>> [1]
>>> Thread_readahead:
>>> 0. folio = filemap_alloc_folio(gfp_mask, 0);
>>>          (refcount 1: alloc_pages)
>>> 1. ret = filemap_add_folio(mapping, folio, index + i, gfp_mask);
>>>          (refcount 2: alloc_pages, page_cache)

[not going into all details, just a high-level remark]

page_cache_ra_unbounded() does a filemap_invalidate_lock_shared(), which 
is a down_read_trylock(&mapping->invalidate_lock).

That is, all read_pages() calls in mm/readahead.c happen under 
mapping->invalidate_lock in read mode.

... and ...

>>>
>>> Thread_truncate:
>>> 2. folio = find_get_entries(&fbatch_truncate);
>>>          (refcount 3: alloc_pages, page_cache, fbatch_truncate))

truncation, such as truncate_inode_pages() must be called under 
mapping->invalidate_lock held in write mode. So naive me would have 
thought that readahead and truncate cannot race in that way.

[...]


>>
>> Something that would help here is an actual reproducer that triggersthis
>> issue.
>>
>> To me, it's unclear at this point if we are talking about an actual
>> issue or a theoretical issue?
> Thanks for feedback. Above callstack is a theoretical issue so far
> which is arised from an ongoing analysis of a practical livelock issue
> generated by folio_try_get_rcu which is related to abnormal folio
> refcnt state. So do you think this callstack makes sense?

I'm not an expert on that code, and only spent 5 min looking into the 
code. So my reasoning about invalidate_lock above might be completely wrong.

It would be a very rare race that was not reported so far in practice. 
And it certainly wouldn't be the easiest one to explain, because the 
call chain above is a bit elaborate and does not explain which locks are 
involved and how they fail to protect us from any such race.

For this case in particular, I think we really need a real reproducer to 
convince people that the actual issue does exist and the fix actually 
resolves the issue.
Zhaoyang Huang April 3, 2024, 11:08 a.m. UTC | #6
On Wed, Apr 3, 2024 at 4:01 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 03.04.24 07:50, Zhaoyang Huang wrote:
> > On Tue, Apr 2, 2024 at 8:58 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 01.04.24 10:17, zhaoyang.huang wrote:
> >>> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> >>>
> >>> An VM_BUG_ON in step 9 of [1] could happen as the refcnt is dropped
> >>> unproperly during the procedure of read_pages()->readahead_folio->folio_put.
> >>> This is introduced by commit 9fd472af84ab ("mm: improve cleanup when
> >>> ->readpages doesn't process all pages")'.
> >>>
> >>> key steps of[1] in brief:
> >>> 2'. Thread_truncate get folio to its local fbatch by find_get_entry in step 2
> >>> 7'. Last refcnt remained which is not as expect as from alloc_pages
> >>>       but from thread_truncate's local fbatch in step 7
> >>> 8'. Thread_reclaim succeed to isolate the folio by the wrong refcnt(not
> >>>       the value but meaning) in step 8
> >>> 9'. Thread_truncate hit the VM_BUG_ON in step 9
> >>>
> >>> [1]
> >>> Thread_readahead:
> >>> 0. folio = filemap_alloc_folio(gfp_mask, 0);
> >>>          (refcount 1: alloc_pages)
> >>> 1. ret = filemap_add_folio(mapping, folio, index + i, gfp_mask);
> >>>          (refcount 2: alloc_pages, page_cache)
>
> [not going into all details, just a high-level remark]
>
> page_cache_ra_unbounded() does a filemap_invalidate_lock_shared(), which
> is a down_read_trylock(&mapping->invalidate_lock).
>
> That is, all read_pages() calls in mm/readahead.c happen under
> mapping->invalidate_lock in read mode.
>
> ... and ...
>
> >>>
> >>> Thread_truncate:
> >>> 2. folio = find_get_entries(&fbatch_truncate);
> >>>          (refcount 3: alloc_pages, page_cache, fbatch_truncate))
>
> truncation, such as truncate_inode_pages() must be called under
> mapping->invalidate_lock held in write mode. So naive me would have
> thought that readahead and truncate cannot race in that way.
>
> [...]
>
Thanks for the reminder. But I don't find the spot where holding
"mapping->invalidate_lock" by check the callstack of
'kill_bdev()->truncate_inode_pages()->truncate_inode_pages_range()',
or the lock is holded beyond this?
>
> >>
> >> Something that would help here is an actual reproducer that triggersthis
> >> issue.
> >>
> >> To me, it's unclear at this point if we are talking about an actual
> >> issue or a theoretical issue?
> > Thanks for feedback. Above callstack is a theoretical issue so far
> > which is arised from an ongoing analysis of a practical livelock issue
> > generated by folio_try_get_rcu which is related to abnormal folio
> > refcnt state. So do you think this callstack makes sense?
>
> I'm not an expert on that code, and only spent 5 min looking into the
> code. So my reasoning about invalidate_lock above might be completely wrong.
>
> It would be a very rare race that was not reported so far in practice.
> And it certainly wouldn't be the easiest one to explain, because the
> call chain above is a bit elaborate and does not explain which locks are
> involved and how they fail to protect us from any such race.
>
> For this case in particular, I think we really need a real reproducer to
> convince people that the actual issue does exist and the fix actually
> resolves the issue.
Sorry, it is theoretically yet according to my understanding.
>
> --
> Cheers,
>
> David / dhildenb
>
David Hildenbrand April 3, 2024, 11:47 a.m. UTC | #7
On 03.04.24 13:08, Zhaoyang Huang wrote:
> On Wed, Apr 3, 2024 at 4:01 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 03.04.24 07:50, Zhaoyang Huang wrote:
>>> On Tue, Apr 2, 2024 at 8:58 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 01.04.24 10:17, zhaoyang.huang wrote:
>>>>> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
>>>>>
>>>>> An VM_BUG_ON in step 9 of [1] could happen as the refcnt is dropped
>>>>> unproperly during the procedure of read_pages()->readahead_folio->folio_put.
>>>>> This is introduced by commit 9fd472af84ab ("mm: improve cleanup when
>>>>> ->readpages doesn't process all pages")'.
>>>>>
>>>>> key steps of[1] in brief:
>>>>> 2'. Thread_truncate get folio to its local fbatch by find_get_entry in step 2
>>>>> 7'. Last refcnt remained which is not as expect as from alloc_pages
>>>>>        but from thread_truncate's local fbatch in step 7
>>>>> 8'. Thread_reclaim succeed to isolate the folio by the wrong refcnt(not
>>>>>        the value but meaning) in step 8
>>>>> 9'. Thread_truncate hit the VM_BUG_ON in step 9
>>>>>
>>>>> [1]
>>>>> Thread_readahead:
>>>>> 0. folio = filemap_alloc_folio(gfp_mask, 0);
>>>>>           (refcount 1: alloc_pages)
>>>>> 1. ret = filemap_add_folio(mapping, folio, index + i, gfp_mask);
>>>>>           (refcount 2: alloc_pages, page_cache)
>>
>> [not going into all details, just a high-level remark]
>>
>> page_cache_ra_unbounded() does a filemap_invalidate_lock_shared(), which
>> is a down_read_trylock(&mapping->invalidate_lock).
>>
>> That is, all read_pages() calls in mm/readahead.c happen under
>> mapping->invalidate_lock in read mode.
>>
>> ... and ...
>>
>>>>>
>>>>> Thread_truncate:
>>>>> 2. folio = find_get_entries(&fbatch_truncate);
>>>>>           (refcount 3: alloc_pages, page_cache, fbatch_truncate))
>>
>> truncation, such as truncate_inode_pages() must be called under
>> mapping->invalidate_lock held in write mode. So naive me would have
>> thought that readahead and truncate cannot race in that way.
>>
>> [...]
>>
> Thanks for the reminder. But I don't find the spot where holding
> "mapping->invalidate_lock" by check the callstack of
> 'kill_bdev()->truncate_inode_pages()->truncate_inode_pages_range()',
> or the lock is holded beyond this?

Well, truncate_inode_pages() documents:

"Called under (and serialised by) inode->i_rwsem and 
mapping->invalidate_lock."

If that's not the case than that's either (a) a BUG or (b) an 
undocumented exception to the rule, whereby other mechanisms are in 
place to prevent any further pagecache magic.

I mean, kill_bdev() documents " Kill _all_ buffers and pagecache , dirty 
or not..", so *something* has to be in place to guarantee that there 
cannot be something concurrently filling the pagecache again, otherwise 
kill_bdev() could not possibly do something reasonable.

For example, blkdev_flush_mapping() is called when bd_openers goes to 0, 
and my best guess is that nobody should be able to make use of that 
device at that point.

Similarly, changing the blocksize sounds like something that wouldn't be 
done at arbitrary points in time ...

So kill_bdev() already has a "special" smell to it and I suspect (b) 
applies, where concurrent pagecache action is not really any concern.

But I'm not an expert and I looked at most of that code right now for 
the first time ...

>>
>>>>
>>>> Something that would help here is an actual reproducer that triggersthis
>>>> issue.
>>>>
>>>> To me, it's unclear at this point if we are talking about an actual
>>>> issue or a theoretical issue?
>>> Thanks for feedback. Above callstack is a theoretical issue so far
>>> which is arised from an ongoing analysis of a practical livelock issue
>>> generated by folio_try_get_rcu which is related to abnormal folio
>>> refcnt state. So do you think this callstack makes sense?
>>
>> I'm not an expert on that code, and only spent 5 min looking into the
>> code. So my reasoning about invalidate_lock above might be completely wrong.
>>
>> It would be a very rare race that was not reported so far in practice.
>> And it certainly wouldn't be the easiest one to explain, because the
>> call chain above is a bit elaborate and does not explain which locks are
>> involved and how they fail to protect us from any such race.
>>
>> For this case in particular, I think we really need a real reproducer to
>> convince people that the actual issue does exist and the fix actually
>> resolves the issue.
> Sorry, it is theoretically yet according to my understanding.

Okay, if you find a reproducer, please share it and we can investigate 
if it's a locking problem or something else. As of now, I'm not 
convinced that there is an actual issue that needs fixing.
Zhaoyang Huang April 4, 2024, 9:15 a.m. UTC | #8
On Wed, Apr 3, 2024 at 7:47 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 03.04.24 13:08, Zhaoyang Huang wrote:
> > On Wed, Apr 3, 2024 at 4:01 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 03.04.24 07:50, Zhaoyang Huang wrote:
> >>> On Tue, Apr 2, 2024 at 8:58 PM David Hildenbrand <david@redhat.com> wrote:
> >>>>
> >>>> On 01.04.24 10:17, zhaoyang.huang wrote:
> >>>>> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> >>>>>
> >>>>> An VM_BUG_ON in step 9 of [1] could happen as the refcnt is dropped
> >>>>> unproperly during the procedure of read_pages()->readahead_folio->folio_put.
> >>>>> This is introduced by commit 9fd472af84ab ("mm: improve cleanup when
> >>>>> ->readpages doesn't process all pages")'.
> >>>>>
> >>>>> key steps of[1] in brief:
> >>>>> 2'. Thread_truncate get folio to its local fbatch by find_get_entry in step 2
> >>>>> 7'. Last refcnt remained which is not as expect as from alloc_pages
> >>>>>        but from thread_truncate's local fbatch in step 7
> >>>>> 8'. Thread_reclaim succeed to isolate the folio by the wrong refcnt(not
> >>>>>        the value but meaning) in step 8
> >>>>> 9'. Thread_truncate hit the VM_BUG_ON in step 9
> >>>>>
> >>>>> [1]
> >>>>> Thread_readahead:
> >>>>> 0. folio = filemap_alloc_folio(gfp_mask, 0);
> >>>>>           (refcount 1: alloc_pages)
> >>>>> 1. ret = filemap_add_folio(mapping, folio, index + i, gfp_mask);
> >>>>>           (refcount 2: alloc_pages, page_cache)
> >>
> >> [not going into all details, just a high-level remark]
> >>
> >> page_cache_ra_unbounded() does a filemap_invalidate_lock_shared(), which
> >> is a down_read_trylock(&mapping->invalidate_lock).
> >>
> >> That is, all read_pages() calls in mm/readahead.c happen under
> >> mapping->invalidate_lock in read mode.
> >>
> >> ... and ...
> >>
> >>>>>
> >>>>> Thread_truncate:
> >>>>> 2. folio = find_get_entries(&fbatch_truncate);
> >>>>>           (refcount 3: alloc_pages, page_cache, fbatch_truncate))
> >>
> >> truncation, such as truncate_inode_pages() must be called under
> >> mapping->invalidate_lock held in write mode. So naive me would have
> >> thought that readahead and truncate cannot race in that way.
> >>
> >> [...]
> >>
> > Thanks for the reminder. But I don't find the spot where holding
> > "mapping->invalidate_lock" by check the callstack of
> > 'kill_bdev()->truncate_inode_pages()->truncate_inode_pages_range()',
> > or the lock is holded beyond this?
>
> Well, truncate_inode_pages() documents:
>
> "Called under (and serialised by) inode->i_rwsem and
> mapping->invalidate_lock."
>
> If that's not the case than that's either (a) a BUG or (b) an
> undocumented exception to the rule, whereby other mechanisms are in
> place to prevent any further pagecache magic.
>
> I mean, kill_bdev() documents " Kill _all_ buffers and pagecache , dirty
> or not..", so *something* has to be in place to guarantee that there
> cannot be something concurrently filling the pagecache again, otherwise
> kill_bdev() could not possibly do something reasonable.
>
> For example, blkdev_flush_mapping() is called when bd_openers goes to 0,
> and my best guess is that nobody should be able to make use of that
> device at that point.
>
> Similarly, changing the blocksize sounds like something that wouldn't be
> done at arbitrary points in time ...
>
> So kill_bdev() already has a "special" smell to it and I suspect (b)
> applies, where concurrent pagecache action is not really any concern.
>
> But I'm not an expert and I looked at most of that code right now for
> the first time ...
Thanks for your help. I don't know if it is related to an out of date
documentation issue. Regardless of truncation path, there could be an
isolation path for entering this race where the refcnt of local folio
batch replaced the one of alloc_pages and then makes the page out of
the protection by folio_lock and could race between reclaiming and
following actions of isolation. It could be any kind of phenomenon
other than VM_BUG_ON.
>
> >>
> >>>>
> >>>> Something that would help here is an actual reproducer that triggersthis
> >>>> issue.
> >>>>
> >>>> To me, it's unclear at this point if we are talking about an actual
> >>>> issue or a theoretical issue?
> >>> Thanks for feedback. Above callstack is a theoretical issue so far
> >>> which is arised from an ongoing analysis of a practical livelock issue
> >>> generated by folio_try_get_rcu which is related to abnormal folio
> >>> refcnt state. So do you think this callstack makes sense?
> >>
> >> I'm not an expert on that code, and only spent 5 min looking into the
> >> code. So my reasoning about invalidate_lock above might be completely wrong.
> >>
> >> It would be a very rare race that was not reported so far in practice.
> >> And it certainly wouldn't be the easiest one to explain, because the
> >> call chain above is a bit elaborate and does not explain which locks are
> >> involved and how they fail to protect us from any such race.
> >>
> >> For this case in particular, I think we really need a real reproducer to
> >> convince people that the actual issue does exist and the fix actually
> >> resolves the issue.
> > Sorry, it is theoretically yet according to my understanding.
>
> Okay, if you find a reproducer, please share it and we can investigate
> if it's a locking problem or something else. As of now, I'm not
> convinced that there is an actual issue that needs fixing.
I am hoping anybody could help to confirm if this is indeed a BUG
according to the procedures in the commit message.
>
> --
> Cheers,
>
> David / dhildenb
>
diff mbox series

Patch

diff --git a/mm/readahead.c b/mm/readahead.c
index 130c0e7df99f..657736200a92 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -163,7 +163,7 @@  static void read_pages(struct readahead_control *rac)
 		 * may be used to size the next readahead, so make sure
 		 * they accurately reflect what happened.
 		 */
-		while ((folio = readahead_folio(rac)) != NULL) {
+		while ((folio = __readahead_folio(rac)) != NULL) {
 			unsigned long nr = folio_nr_pages(folio);
 
 			folio_get(folio);