Message ID | 20180326043503.17828-1-david@fromorbit.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Mar 26, 2018 at 03:35:03PM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > We recently had an oops reported on a 4.14 kernel in > xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage > and so the m_perag_tree lookup walked into lala land. > > We found a mount in a failed state, blocked on teh shrinker rwsem > here: > > mount_bdev() > deactivate_locked_super() > unregister_shrinker() > > Essentially, the machine was under memory pressure when the mount > was being run, xfs_fs_fill_super() failed after allocating the > xfs_mount and attaching it to sb->s_fs_info. It then cleaned up and > freed the xfs_mount, but the sb->s_fs_info field still pointed to > the freed memory. Hence when the superblock shrinker then ran > it fell off the bad pointer. > > This is reproduced by using the mount_delay sysfs control as added > in teh previous patch. It produces an oops down this path during the > stalled mount: > The problem is that the superblock shrinker is running before the > filesystem structures it depends on have been fully set up. i.e. > the shrinker is registered in sget(), before ->fill_super() has been > called, and the shrinker can call into the filesystem before > fill_super() does it's setup work. Wait a sec... How the hell does it get through trylock_super() before ->s_root is set and ->s_umount is unlocked? -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Mar 26, 2018 at 06:31:51AM +0100, Al Viro wrote: > On Mon, Mar 26, 2018 at 03:35:03PM +1100, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > We recently had an oops reported on a 4.14 kernel in > > xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage > > and so the m_perag_tree lookup walked into lala land. > > > > We found a mount in a failed state, blocked on teh shrinker rwsem > > here: > > > > mount_bdev() > > deactivate_locked_super() > > unregister_shrinker() > > > > Essentially, the machine was under memory pressure when the mount > > was being run, xfs_fs_fill_super() failed after allocating the > > xfs_mount and attaching it to sb->s_fs_info. It then cleaned up and > > freed the xfs_mount, but the sb->s_fs_info field still pointed to > > the freed memory. Hence when the superblock shrinker then ran > > it fell off the bad pointer. > > > > This is reproduced by using the mount_delay sysfs control as added > > in teh previous patch. It produces an oops down this path during the > > stalled mount: > > > The problem is that the superblock shrinker is running before the > > filesystem structures it depends on have been fully set up. i.e. > > the shrinker is registered in sget(), before ->fill_super() has been > > called, and the shrinker can call into the filesystem before > > fill_super() does it's setup work. > > Wait a sec... How the hell does it get through trylock_super() before > ->s_root is set and ->s_umount is unlocked? I see... So basically the story is * super_cache_count() lacks trylock_super(), making it possible that it'll be called too early on half-set superblock. * it can't be called too late (during fs shutdown), since the shrinker is unregistered before the call of ->kill_sb() * making sure it won't get called too early can be done by checking SB_ACTIVE. It's potentially racy, though - don't we need a barrier between setting the things up and setting SB_ACTIVE? And that, BTW, means that we want SB_BORN instead of SB_ACTIVE - unlike the latter, the former is set only in one place. So I'd suggest switching to checking that, with a barrier pair added (in mount_fs() before setting the sucker, another in super_cache_count() (before doing the scan). -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Mar 26, 2018 at 06:51:37AM +0100, Al Viro wrote: > On Mon, Mar 26, 2018 at 06:31:51AM +0100, Al Viro wrote: > > On Mon, Mar 26, 2018 at 03:35:03PM +1100, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > We recently had an oops reported on a 4.14 kernel in > > > xfs_reclaim_inodes_count() where sb->s_fs_info pointed to garbage > > > and so the m_perag_tree lookup walked into lala land. > > > > > > We found a mount in a failed state, blocked on teh shrinker rwsem > > > here: > > > > > > mount_bdev() > > > deactivate_locked_super() > > > unregister_shrinker() > > > > > > Essentially, the machine was under memory pressure when the mount > > > was being run, xfs_fs_fill_super() failed after allocating the > > > xfs_mount and attaching it to sb->s_fs_info. It then cleaned up and > > > freed the xfs_mount, but the sb->s_fs_info field still pointed to > > > the freed memory. Hence when the superblock shrinker then ran > > > it fell off the bad pointer. > > > > > > This is reproduced by using the mount_delay sysfs control as added > > > in teh previous patch. It produces an oops down this path during the > > > stalled mount: > > > > > The problem is that the superblock shrinker is running before the > > > filesystem structures it depends on have been fully set up. i.e. > > > the shrinker is registered in sget(), before ->fill_super() has been > > > called, and the shrinker can call into the filesystem before > > > fill_super() does it's setup work. > > > > Wait a sec... How the hell does it get through trylock_super() before > > ->s_root is set and ->s_umount is unlocked? > > I see... So basically the story is > > * super_cache_count() lacks trylock_super(), making it possible that it'll > be called too early on half-set superblock. > * it can't be called too late (during fs shutdown), since the shrinker is > unregistered before the call of ->kill_sb() > * making sure it won't get called too early can be done by checking SB_ACTIVE. Yeah, it's the counting that is the issue, not the actual inode scanning. > It's potentially racy, though - don't we need a barrier between setting the > things up and setting SB_ACTIVE? Well, we start with it clear, so it won't be a problem if the shrinker races with it being set. I think it's more a problem when we clear it, but I'm not sure how much of a problem that is because the filesystem structures are still all set up whenever it gets cleared. It said, it's no trouble to add a smp_wmb/smp_rmb barriers where necessary... > And that, BTW, means that we want SB_BORN instead of SB_ACTIVE - unlike the > latter, the former is set only in one place. Not sure that's the case - lots of filesystems set SB_ACTIVE in their mount process to enable iput_final() to cache inodes. That's why I chose SB_ACTIVE - it matches when the filesystem starts making use of the inode cache and giving the shrinker real work to do.... <shrug> not fussed - let me know if you still prefer SB_BORN and I'll switch it. Cheers, Dave.
On Mon, Mar 26, 2018 at 05:33:32PM +1100, Dave Chinner wrote: > > It's potentially racy, though - don't we need a barrier between setting the > > things up and setting SB_ACTIVE? > > Well, we start with it clear, so it won't be a problem if the > shrinker races with it being set. I think it's more a problem when > we clear it, but I'm not sure how much of a problem that is because > the filesystem structures are still all set up whenever it gets > cleared. ... except that stores might be reordered, with ->s_flags one observed before some of the stores that went before it. > It said, it's no trouble to add a smp_wmb/smp_rmb barriers where > necessary... > > > And that, BTW, means that we want SB_BORN instead of SB_ACTIVE - unlike the > > latter, the former is set only in one place. > > Not sure that's the case - lots of filesystems set SB_ACTIVE in > their mount process to enable iput_final() to cache inodes. That's > why I chose SB_ACTIVE - it matches when the filesystem starts making > use of the inode cache and giving the shrinker real work to do.... > > <shrug> not fussed - let me know if you still prefer SB_BORN and > I'll switch it. I do. Let it match the places like trylock_super() et.al. -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Mar 26, 2018 at 07:55:47AM +0100, Al Viro wrote: > On Mon, Mar 26, 2018 at 05:33:32PM +1100, Dave Chinner wrote: > > > It's potentially racy, though - don't we need a barrier between setting the > > > things up and setting SB_ACTIVE? > > > > Well, we start with it clear, so it won't be a problem if the > > shrinker races with it being set. I think it's more a problem when > > we clear it, but I'm not sure how much of a problem that is because > > the filesystem structures are still all set up whenever it gets > > cleared. > > ... except that stores might be reordered, with ->s_flags one observed before > some of the stores that went before it. > > > It said, it's no trouble to add a smp_wmb/smp_rmb barriers where > > necessary... > > > > > And that, BTW, means that we want SB_BORN instead of SB_ACTIVE - unlike the > > > latter, the former is set only in one place. > > > > Not sure that's the case - lots of filesystems set SB_ACTIVE in > > their mount process to enable iput_final() to cache inodes. That's > > why I chose SB_ACTIVE - it matches when the filesystem starts making > > use of the inode cache and giving the shrinker real work to do.... > > > > <shrug> not fussed - let me know if you still prefer SB_BORN and > > I'll switch it. > > I do. Let it match the places like trylock_super() et.al. No worries, will switch. -Dave.
diff --git a/fs/super.c b/fs/super.c index 672538ca9831..2e398e429f55 100644 --- a/fs/super.c +++ b/fs/super.c @@ -120,13 +120,26 @@ static unsigned long super_cache_count(struct shrinker *shrink, sb = container_of(shrink, struct super_block, s_shrink); /* - * Don't call trylock_super as it is a potential - * scalability bottleneck. The counts could get updated - * between super_cache_count and super_cache_scan anyway. - * Call to super_cache_count with shrinker_rwsem held - * ensures the safety of call to list_lru_shrink_count() and - * s_op->nr_cached_objects(). + * If we are currently mounting or unmounting the superblock, the + * underlying filesystem might be in a state of partial construction or + * deconstructions, and hence it is dangerous to access it. + * + * We don't call trylock_super() as it has proven to be a scalability + * bottleneck, and the shrinker rwsem does not protect filesystem + * operations backing list_lru_shrink_count() or + * s_op->nr_cached_objects() during filesystem setup or teardown. The + * counts could get updated between super_cache_count and + * super_cache_scan anyway, so we really don't need locks here. + * + * Hence we need some other mechanism to avoid filesysetms during setup + * and teardown. We can use the SB_ACTIVE flag for that - it is only set + * after ->fill_super has set up the filesystem fully, and it is cleared + * before ->put_super starts tearing it down. + * */ + if (!(sb->s_flags & SB_ACTIVE)) + return 0; + if (sb->s_op && sb->s_op->nr_cached_objects) total_objects = sb->s_op->nr_cached_objects(sb, sc); diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index ee26437dc567..41ce7236f6ea 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1755,6 +1755,8 @@ xfs_fs_fill_super( out_close_devices: xfs_close_devices(mp); out_free_fsname: + sb->s_fs_info = NULL; + sb->s_op = NULL; xfs_free_fsname(mp); kfree(mp); out: @@ -1781,6 +1783,9 @@ xfs_fs_put_super( xfs_destroy_percpu_counters(mp); xfs_destroy_mount_workqueues(mp); xfs_close_devices(mp); + + sb->s_fs_info = NULL; + sb->s_op = NULL; xfs_free_fsname(mp); kfree(mp); } @@ -1800,6 +1805,12 @@ xfs_fs_nr_cached_objects( struct super_block *sb, struct shrink_control *sc) { + /* + * Don't do anything until the filesystem is fully set up, or in the + * process of being torn down due to a mount failure. + */ + if (!sb->s_fs_info) + return 0; return xfs_reclaim_inodes_count(XFS_M(sb)); }