diff mbox series

vfs: fix race between evice_inodes() and find_inode()&iput()

Message ID 20240823130730.658881-1-sunjunchao2870@gmail.com (mailing list archive)
State New, archived
Headers show
Series vfs: fix race between evice_inodes() and find_inode()&iput() | expand

Commit Message

Julian Sun Aug. 23, 2024, 1:07 p.m. UTC
Hi, all

Recently I noticed a bug[1] in btrfs, after digged it into
and I believe it'a race in vfs.

Let's assume there's a inode (ie ino 261) with i_count 1 is
called by iput(), and there's a concurrent thread calling
generic_shutdown_super().

cpu0:                              cpu1:
iput() // i_count is 1
  ->spin_lock(inode)
  ->dec i_count to 0
  ->iput_final()                    generic_shutdown_super()
    ->__inode_add_lru()               ->evict_inodes()
      // cause some reason[2]           ->if (atomic_read(inode->i_count)) continue;
      // return before                  // inode 261 passed the above check
      // list_lru_add_obj()             // and then schedule out
   ->spin_unlock()
// note here: the inode 261
// was still at sb list and hash list,
// and I_FREEING|I_WILL_FREE was not been set

btrfs_iget()
  // after some function calls
  ->find_inode()
    // found the above inode 261
    ->spin_lock(inode)
   // check I_FREEING|I_WILL_FREE
   // and passed
      ->__iget()
    ->spin_unlock(inode)                // schedule back
                                        ->spin_lock(inode)
                                        // check (I_NEW|I_FREEING|I_WILL_FREE) flags,
                                        // passed and set I_FREEING
iput()                                  ->spin_unlock(inode)
  ->spin_lock(inode)			  ->evict()
  // dec i_count to 0
  ->iput_final()
    ->spin_unlock()
    ->evict()

Now, we have two threads simultaneously evicting
the same inode, which may trigger the BUG(inode->i_state & I_CLEAR)
statement both within clear_inode() and iput().

To fix the bug, recheck the inode->i_count after holding i_lock.
Because in the most scenarios, the first check is valid, and
the overhead of spin_lock() can be reduced.

If there is any misunderstanding, please let me know, thanks.

[1]: https://lore.kernel.org/linux-btrfs/000000000000eabe1d0619c48986@google.com/
[2]: The reason might be 1. SB_ACTIVE was removed or 2. mapping_shrinkable()
return false when I reproduced the bug.

Reported-by: syzbot+67ba3c42bcbb4665d3ad@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=67ba3c42bcbb4665d3ad
CC: stable@vger.kernel.org
Fixes: 63997e98a3be ("split invalidate_inodes()")
Signed-off-by: Julian Sun <sunjunchao2870@gmail.com>
---
 fs/inode.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Mateusz Guzik Aug. 24, 2024, 4:54 a.m. UTC | #1
On Fri, Aug 23, 2024 at 09:07:30PM +0800, Julian Sun wrote:
> Hi, all
> 
> Recently I noticed a bug[1] in btrfs, after digged it into
> and I believe it'a race in vfs.
> 
> Let's assume there's a inode (ie ino 261) with i_count 1 is
> called by iput(), and there's a concurrent thread calling
> generic_shutdown_super().
> 
> cpu0:                              cpu1:
> iput() // i_count is 1
>   ->spin_lock(inode)
>   ->dec i_count to 0
>   ->iput_final()                    generic_shutdown_super()
>     ->__inode_add_lru()               ->evict_inodes()
>       // cause some reason[2]           ->if (atomic_read(inode->i_count)) continue;
>       // return before                  // inode 261 passed the above check
>       // list_lru_add_obj()             // and then schedule out
>    ->spin_unlock()
> // note here: the inode 261
> // was still at sb list and hash list,
> // and I_FREEING|I_WILL_FREE was not been set
> 
> btrfs_iget()
>   // after some function calls
>   ->find_inode()
>     // found the above inode 261
>     ->spin_lock(inode)
>    // check I_FREEING|I_WILL_FREE
>    // and passed
>       ->__iget()
>     ->spin_unlock(inode)                // schedule back
>                                         ->spin_lock(inode)
>                                         // check (I_NEW|I_FREEING|I_WILL_FREE) flags,
>                                         // passed and set I_FREEING
> iput()                                  ->spin_unlock(inode)
>   ->spin_lock(inode)			  ->evict()
>   // dec i_count to 0
>   ->iput_final()
>     ->spin_unlock()
>     ->evict()
> 
> Now, we have two threads simultaneously evicting
> the same inode, which may trigger the BUG(inode->i_state & I_CLEAR)
> statement both within clear_inode() and iput().
> 
> To fix the bug, recheck the inode->i_count after holding i_lock.
> Because in the most scenarios, the first check is valid, and
> the overhead of spin_lock() can be reduced.
> 
> If there is any misunderstanding, please let me know, thanks.
> 
> [1]: https://lore.kernel.org/linux-btrfs/000000000000eabe1d0619c48986@google.com/
> [2]: The reason might be 1. SB_ACTIVE was removed or 2. mapping_shrinkable()
> return false when I reproduced the bug.
> 
> Reported-by: syzbot+67ba3c42bcbb4665d3ad@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=67ba3c42bcbb4665d3ad
> CC: stable@vger.kernel.org
> Fixes: 63997e98a3be ("split invalidate_inodes()")
> Signed-off-by: Julian Sun <sunjunchao2870@gmail.com>
> ---
>  fs/inode.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 3a41f83a4ba5..011f630777d0 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -723,6 +723,10 @@ void evict_inodes(struct super_block *sb)
>  			continue;
>  
>  		spin_lock(&inode->i_lock);
> +		if (atomic_read(&inode->i_count)) {
> +			spin_unlock(&inode->i_lock);
> +			continue;
> +		}
>  		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
>  			spin_unlock(&inode->i_lock);
>  			continue;

This looks correct to me, albeit I would argue the commit message is
overly verbose making it harder to understand the gist of the problem:
evict_inodes() fails to re-check i_count after acquiring the spin lock,
while the flags blocking 0->1 i_count transisions are not set yet,
making it possible to race against such transition.

The real remark I have here is that evict_inodes(), modulo the bug, is
identical to invalidate_inodes(). Perhaps a separate patch (*not* for
stable) to whack it would be prudent?
Julian Sun Aug. 26, 2024, 4:11 a.m. UTC | #2
On Sat, 2024-08-24 at 06:54 +0200, Mateusz Guzik wrote:
> On Fri, Aug 23, 2024 at 09:07:30PM +0800, Julian Sun wrote:
> > Hi, all
> > 
> > Recently I noticed a bug[1] in btrfs, after digged it into
> > and I believe it'a race in vfs.
> > 
> > Let's assume there's a inode (ie ino 261) with i_count 1 is
> > called by iput(), and there's a concurrent thread calling
> > generic_shutdown_super().
> > 
> > cpu0:                              cpu1:
> > iput() // i_count is 1
> >   ->spin_lock(inode)
> >   ->dec i_count to 0
> >   ->iput_final()                    generic_shutdown_super()
> >     ->__inode_add_lru()               ->evict_inodes()
> >       // cause some reason[2]           ->if (atomic_read(inode-
> > >i_count)) continue;
> >       // return before                  // inode 261 passed the
> > above check
> >       // list_lru_add_obj()             // and then schedule out
> >    ->spin_unlock()
> > // note here: the inode 261
> > // was still at sb list and hash list,
> > // and I_FREEING|I_WILL_FREE was not been set
> > 
> > btrfs_iget()
> >   // after some function calls
> >   ->find_inode()
> >     // found the above inode 261
> >     ->spin_lock(inode)
> >    // check I_FREEING|I_WILL_FREE
> >    // and passed
> >       ->__iget()
> >     ->spin_unlock(inode)                // schedule back
> >                                         ->spin_lock(inode)
> >                                         // check
> > (I_NEW|I_FREEING|I_WILL_FREE) flags,
> >                                         // passed and set I_FREEING
> > iput()                                  ->spin_unlock(inode)
> >   ->spin_lock(inode)                      ->evict()
> >   // dec i_count to 0
> >   ->iput_final()
> >     ->spin_unlock()
> >     ->evict()
> > 
> > Now, we have two threads simultaneously evicting
> > the same inode, which may trigger the BUG(inode->i_state & I_CLEAR)
> > statement both within clear_inode() and iput().
> > 
> > To fix the bug, recheck the inode->i_count after holding i_lock.
> > Because in the most scenarios, the first check is valid, and
> > the overhead of spin_lock() can be reduced.
> > 
> > If there is any misunderstanding, please let me know, thanks.
> > 
> > [1]:
> > https://lore.kernel.org/linux-btrfs/000000000000eabe1d0619c48986@google.com/
> > [2]: The reason might be 1. SB_ACTIVE was removed or 2.
> > mapping_shrinkable()
> > return false when I reproduced the bug.
> > 
> > Reported-by: syzbot+67ba3c42bcbb4665d3ad@syzkaller.appspotmail.com
> > Closes:
> > https://syzkaller.appspot.com/bug?extid=67ba3c42bcbb4665d3ad
> > CC: stable@vger.kernel.org
> > Fixes: 63997e98a3be ("split invalidate_inodes()")
> > Signed-off-by: Julian Sun <sunjunchao2870@gmail.com>
> > ---
> >  fs/inode.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 3a41f83a4ba5..011f630777d0 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -723,6 +723,10 @@ void evict_inodes(struct super_block *sb)
> >                         continue;
> >  
> >                 spin_lock(&inode->i_lock);
> > +               if (atomic_read(&inode->i_count)) {
> > +                       spin_unlock(&inode->i_lock);
> > +                       continue;
> > +               }
> >                 if (inode->i_state & (I_NEW | I_FREEING |
> > I_WILL_FREE)) {
> >                         spin_unlock(&inode->i_lock);
> >                         continue;
> 
> This looks correct to me, albeit I would argue the commit message is
> overly verbose making it harder to understand the gist of the
> problem:
> evict_inodes() fails to re-check i_count after acquiring the spin
> lock,
> while the flags blocking 0->1 i_count transisions are not set yet,
> making it possible to race against such transition.
Alright, I think the issue is clearly explained through the above
commit message. If you insist, I can send a patch v2 to reorder the
commit message.
> 
> The real remark I have here is that evict_inodes(), modulo the bug,
> is
> identical to invalidate_inodes(). Perhaps a separate patch (*not* for
> stable) to whack it would be prudent?
Agreed. We can replace invalidate_inodes() with evict_inodes() after
this patch.

Thanks,
Mateusz Guzik Aug. 26, 2024, 7:05 a.m. UTC | #3
On Mon, Aug 26, 2024 at 6:11 AM Julian Sun <sunjunchao2870@gmail.com> wrote:
>
> On Sat, 2024-08-24 at 06:54 +0200, Mateusz Guzik wrote:
> > evict_inodes() fails to re-check i_count after acquiring the spin
> > lock,
> > while the flags blocking 0->1 i_count transisions are not set yet,
> > making it possible to race against such transition.
> Alright, I think the issue is clearly explained through the above
> commit message. If you insist, I can send a patch v2 to reorder the
> commit message.

I'm in no position to insist, merely noting. :)
Christian Brauner Aug. 26, 2024, 10:50 a.m. UTC | #4
On Fri, 23 Aug 2024 21:07:30 +0800, Julian Sun wrote:
> Recently I noticed a bug[1] in btrfs, after digged it into
> and I believe it'a race in vfs.
> 
> Let's assume there's a inode (ie ino 261) with i_count 1 is
> called by iput(), and there's a concurrent thread calling
> generic_shutdown_super().
> 
> [...]

Applied to the vfs.misc branch of the vfs/vfs.git tree.
Patches in the vfs.misc branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.misc

[1/1] vfs: fix race between evice_inodes() and find_inode()&iput()
      https://git.kernel.org/vfs/vfs/c/f37af83281e6
Jan Kara Aug. 28, 2024, 4:36 p.m. UTC | #5
On Fri 23-08-24 21:07:30, Julian Sun wrote:
> Hi, all
> 
> Recently I noticed a bug[1] in btrfs, after digged it into
> and I believe it'a race in vfs.
> 
> Let's assume there's a inode (ie ino 261) with i_count 1 is
> called by iput(), and there's a concurrent thread calling
> generic_shutdown_super().
> 
> cpu0:                              cpu1:
> iput() // i_count is 1
>   ->spin_lock(inode)
>   ->dec i_count to 0
>   ->iput_final()                    generic_shutdown_super()
>     ->__inode_add_lru()               ->evict_inodes()
>       // cause some reason[2]           ->if (atomic_read(inode->i_count)) continue;
>       // return before                  // inode 261 passed the above check
>       // list_lru_add_obj()             // and then schedule out
>    ->spin_unlock()
> // note here: the inode 261
> // was still at sb list and hash list,
> // and I_FREEING|I_WILL_FREE was not been set
> 
> btrfs_iget()
>   // after some function calls
>   ->find_inode()
>     // found the above inode 261
>     ->spin_lock(inode)
>    // check I_FREEING|I_WILL_FREE
>    // and passed
>       ->__iget()
>     ->spin_unlock(inode)                // schedule back
>                                         ->spin_lock(inode)
>                                         // check (I_NEW|I_FREEING|I_WILL_FREE) flags,
>                                         // passed and set I_FREEING
> iput()                                  ->spin_unlock(inode)
>   ->spin_lock(inode)			  ->evict()
>   // dec i_count to 0
>   ->iput_final()
>     ->spin_unlock()
>     ->evict()
> 
> Now, we have two threads simultaneously evicting
> the same inode, which may trigger the BUG(inode->i_state & I_CLEAR)
> statement both within clear_inode() and iput().
> 
> To fix the bug, recheck the inode->i_count after holding i_lock.
> Because in the most scenarios, the first check is valid, and
> the overhead of spin_lock() can be reduced.
> 
> If there is any misunderstanding, please let me know, thanks.
> 
> [1]: https://lore.kernel.org/linux-btrfs/000000000000eabe1d0619c48986@google.com/
> [2]: The reason might be 1. SB_ACTIVE was removed or 2. mapping_shrinkable()
> return false when I reproduced the bug.
> 
> Reported-by: syzbot+67ba3c42bcbb4665d3ad@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=67ba3c42bcbb4665d3ad
> CC: stable@vger.kernel.org
> Fixes: 63997e98a3be ("split invalidate_inodes()")
> Signed-off-by: Julian Sun <sunjunchao2870@gmail.com>

Thanks for the fix. It looks good to me and I'm curious how come we didn't
hit this earlier ;). Anyway, feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

I'd just note that the Fixes tag isn't correct AFAICT. At the time of
commit 63997e98a3be we had a global inode_lock and the
atomic_read(&inode->i_count) check was done under it. I'd say it was

55fa6091d831 ("fs: move i_sb_list out from under inode_lock")

or possibly

250df6ed274d ("fs: protect inode->i_state with inode->i_lock")

(depending on how you look at it) that introduced this problem. Not that it
would matter that much these days :).

								Honza

> ---
>  fs/inode.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 3a41f83a4ba5..011f630777d0 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -723,6 +723,10 @@ void evict_inodes(struct super_block *sb)
>  			continue;
>  
>  		spin_lock(&inode->i_lock);
> +		if (atomic_read(&inode->i_count)) {
> +			spin_unlock(&inode->i_lock);
> +			continue;
> +		}
>  		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
>  			spin_unlock(&inode->i_lock);
>  			continue;
> -- 
> 2.39.2
>
diff mbox series

Patch

diff --git a/fs/inode.c b/fs/inode.c
index 3a41f83a4ba5..011f630777d0 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -723,6 +723,10 @@  void evict_inodes(struct super_block *sb)
 			continue;
 
 		spin_lock(&inode->i_lock);
+		if (atomic_read(&inode->i_count)) {
+			spin_unlock(&inode->i_lock);
+			continue;
+		}
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
 			spin_unlock(&inode->i_lock);
 			continue;