diff mbox series

[v2] xfs: clear PF_MEMALLOC before exiting xfsaild thread

Message ID 20200309043430.143206-1-ebiggers@kernel.org (mailing list archive)
State Superseded
Headers show
Series [v2] xfs: clear PF_MEMALLOC before exiting xfsaild thread | expand

Commit Message

Eric Biggers March 9, 2020, 4:34 a.m. UTC
From: Eric Biggers <ebiggers@google.com>

Leaving PF_MEMALLOC set when exiting a kthread causes it to remain set
during do_exit().  That can confuse things.  For example, if BSD process
accounting is enabled and the accounting file has FS_SYNC_FL set and is
located on an ext4 filesystem without a journal, then do_exit() ends up
calling ext4_write_inode().  That triggers the
WARN_ON_ONCE(current->flags & PF_MEMALLOC) there, as it assumes
(appropriately) that inodes aren't written when allocating memory.

Fix this in xfsaild() by using the helper functions to save and restore
PF_MEMALLOC.

This can be reproduced as follows in the kvm-xfstests test appliance
modified to add the 'acct' Debian package, and with kvm-xfstests's
recommended kconfig modified to add CONFIG_BSD_PROCESS_ACCT=y:

	mkfs.ext2 -F /dev/vdb
	mount /vdb -t ext4
	touch /vdb/file
	chattr +S /vdb/file
	accton /vdb/file
	mkfs.xfs -f /dev/vdc
	mount /vdc
	umount /vdc

It causes:
	WARNING: CPU: 0 PID: 332 at fs/ext4/inode.c:5097 ext4_write_inode+0x140/0x1a0
	CPU: 0 PID: 332 Comm: xfsaild/vdc Not tainted 5.6.0-rc5 #5
	[...]
	RIP: 0010:ext4_write_inode+0x140/0x1a0 fs/ext4/inode.c:5097
	[...]
	Call Trace:
	 write_inode fs/fs-writeback.c:1312 [inline]
	 __writeback_single_inode+0x465/0x5f0 fs/fs-writeback.c:1511
	 writeback_single_inode+0xad/0x120 fs/fs-writeback.c:1565
	 sync_inode fs/fs-writeback.c:2602 [inline]
	 sync_inode_metadata+0x3d/0x57 fs/fs-writeback.c:2622
	 ext4_fsync_nojournal fs/ext4/fsync.c:94 [inline]
	 ext4_sync_file+0x243/0x4b0 fs/ext4/fsync.c:172
	 generic_write_sync include/linux/fs.h:2867 [inline]
	 ext4_buffered_write_iter+0xe1/0x130 fs/ext4/file.c:277
	 call_write_iter include/linux/fs.h:1901 [inline]
	 new_sync_write+0x130/0x1d0 fs/read_write.c:483
	 __kernel_write+0x54/0xe0 fs/read_write.c:515
	 do_acct_process+0x122/0x170 kernel/acct.c:522
	 slow_acct_process kernel/acct.c:581 [inline]
	 acct_process+0x1d4/0x27c kernel/acct.c:607
	 do_exit+0x83d/0xbc0 kernel/exit.c:791
	 kthread+0xf1/0x140 kernel/kthread.c:257
	 ret_from_fork+0x27/0x50 arch/x86/entry/entry_64.S:352

This case was originally reported by syzbot at
https://lore.kernel.org/r/0000000000000e7156059f751d7b@google.com.

Reported-by: syzbot+1f9dc49e8de2582d90c2@syzkaller.appspotmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
---

v2: include more details in the commit message.

 fs/xfs/xfs_trans_ail.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Brian Foster March 9, 2020, 10:57 a.m. UTC | #1
On Sun, Mar 08, 2020 at 09:34:30PM -0700, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@google.com>
> 
> Leaving PF_MEMALLOC set when exiting a kthread causes it to remain set
> during do_exit().  That can confuse things.  For example, if BSD process
> accounting is enabled and the accounting file has FS_SYNC_FL set and is
> located on an ext4 filesystem without a journal, then do_exit() ends up
> calling ext4_write_inode().  That triggers the
> WARN_ON_ONCE(current->flags & PF_MEMALLOC) there, as it assumes
> (appropriately) that inodes aren't written when allocating memory.
> 
> Fix this in xfsaild() by using the helper functions to save and restore
> PF_MEMALLOC.
> 
> This can be reproduced as follows in the kvm-xfstests test appliance
> modified to add the 'acct' Debian package, and with kvm-xfstests's
> recommended kconfig modified to add CONFIG_BSD_PROCESS_ACCT=y:
> 
> 	mkfs.ext2 -F /dev/vdb
> 	mount /vdb -t ext4
> 	touch /vdb/file
> 	chattr +S /vdb/file
> 	accton /vdb/file
> 	mkfs.xfs -f /dev/vdc
> 	mount /vdc
> 	umount /vdc
> 
> It causes:
> 	WARNING: CPU: 0 PID: 332 at fs/ext4/inode.c:5097 ext4_write_inode+0x140/0x1a0
> 	CPU: 0 PID: 332 Comm: xfsaild/vdc Not tainted 5.6.0-rc5 #5
> 	[...]
> 	RIP: 0010:ext4_write_inode+0x140/0x1a0 fs/ext4/inode.c:5097
> 	[...]
> 	Call Trace:
> 	 write_inode fs/fs-writeback.c:1312 [inline]
> 	 __writeback_single_inode+0x465/0x5f0 fs/fs-writeback.c:1511
> 	 writeback_single_inode+0xad/0x120 fs/fs-writeback.c:1565
> 	 sync_inode fs/fs-writeback.c:2602 [inline]
> 	 sync_inode_metadata+0x3d/0x57 fs/fs-writeback.c:2622
> 	 ext4_fsync_nojournal fs/ext4/fsync.c:94 [inline]
> 	 ext4_sync_file+0x243/0x4b0 fs/ext4/fsync.c:172
> 	 generic_write_sync include/linux/fs.h:2867 [inline]
> 	 ext4_buffered_write_iter+0xe1/0x130 fs/ext4/file.c:277
> 	 call_write_iter include/linux/fs.h:1901 [inline]
> 	 new_sync_write+0x130/0x1d0 fs/read_write.c:483
> 	 __kernel_write+0x54/0xe0 fs/read_write.c:515
> 	 do_acct_process+0x122/0x170 kernel/acct.c:522
> 	 slow_acct_process kernel/acct.c:581 [inline]
> 	 acct_process+0x1d4/0x27c kernel/acct.c:607
> 	 do_exit+0x83d/0xbc0 kernel/exit.c:791
> 	 kthread+0xf1/0x140 kernel/kthread.c:257
> 	 ret_from_fork+0x27/0x50 arch/x86/entry/entry_64.S:352
> 
> This case was originally reported by syzbot at
> https://lore.kernel.org/r/0000000000000e7156059f751d7b@google.com.
> 
> Reported-by: syzbot+1f9dc49e8de2582d90c2@syzkaller.appspotmail.com
> Signed-off-by: Eric Biggers <ebiggers@google.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

> 
> v2: include more details in the commit message.
> 
>  fs/xfs/xfs_trans_ail.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 00cc5b8734be8..3bc570c90ad97 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -529,8 +529,9 @@ xfsaild(
>  {
>  	struct xfs_ail	*ailp = data;
>  	long		tout = 0;	/* milliseconds */
> +	unsigned int	noreclaim_flag;
>  
> -	current->flags |= PF_MEMALLOC;
> +	noreclaim_flag = memalloc_noreclaim_save();
>  	set_freezable();
>  
>  	while (1) {
> @@ -601,6 +602,7 @@ xfsaild(
>  		tout = xfsaild_push(ailp);
>  	}
>  
> +	memalloc_noreclaim_restore(noreclaim_flag);
>  	return 0;
>  }
>  
> -- 
> 2.25.1
>
Darrick J. Wong March 9, 2020, 4:24 p.m. UTC | #2
On Sun, Mar 08, 2020 at 09:34:30PM -0700, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@google.com>
> 
> Leaving PF_MEMALLOC set when exiting a kthread causes it to remain set
> during do_exit().  That can confuse things.  For example, if BSD process
> accounting is enabled and the accounting file has FS_SYNC_FL set and is
> located on an ext4 filesystem without a journal, then do_exit() ends up
> calling ext4_write_inode().  That triggers the
> WARN_ON_ONCE(current->flags & PF_MEMALLOC) there, as it assumes
> (appropriately) that inodes aren't written when allocating memory.
> 
> Fix this in xfsaild() by using the helper functions to save and restore
> PF_MEMALLOC.
> 
> This can be reproduced as follows in the kvm-xfstests test appliance
> modified to add the 'acct' Debian package, and with kvm-xfstests's
> recommended kconfig modified to add CONFIG_BSD_PROCESS_ACCT=y:
> 
> 	mkfs.ext2 -F /dev/vdb
> 	mount /vdb -t ext4
> 	touch /vdb/file
> 	chattr +S /vdb/file

Does this trip if the process accounting file is also on an xfs
filesystem?

> 	accton /vdb/file
> 	mkfs.xfs -f /dev/vdc
> 	mount /vdc
> 	umount /vdc

...and if so, can this be turned into an fstests case, please?


> 
> It causes:
> 	WARNING: CPU: 0 PID: 332 at fs/ext4/inode.c:5097 ext4_write_inode+0x140/0x1a0
> 	CPU: 0 PID: 332 Comm: xfsaild/vdc Not tainted 5.6.0-rc5 #5
> 	[...]
> 	RIP: 0010:ext4_write_inode+0x140/0x1a0 fs/ext4/inode.c:5097
> 	[...]
> 	Call Trace:
> 	 write_inode fs/fs-writeback.c:1312 [inline]
> 	 __writeback_single_inode+0x465/0x5f0 fs/fs-writeback.c:1511
> 	 writeback_single_inode+0xad/0x120 fs/fs-writeback.c:1565
> 	 sync_inode fs/fs-writeback.c:2602 [inline]
> 	 sync_inode_metadata+0x3d/0x57 fs/fs-writeback.c:2622
> 	 ext4_fsync_nojournal fs/ext4/fsync.c:94 [inline]
> 	 ext4_sync_file+0x243/0x4b0 fs/ext4/fsync.c:172
> 	 generic_write_sync include/linux/fs.h:2867 [inline]
> 	 ext4_buffered_write_iter+0xe1/0x130 fs/ext4/file.c:277
> 	 call_write_iter include/linux/fs.h:1901 [inline]
> 	 new_sync_write+0x130/0x1d0 fs/read_write.c:483
> 	 __kernel_write+0x54/0xe0 fs/read_write.c:515
> 	 do_acct_process+0x122/0x170 kernel/acct.c:522
> 	 slow_acct_process kernel/acct.c:581 [inline]
> 	 acct_process+0x1d4/0x27c kernel/acct.c:607
> 	 do_exit+0x83d/0xbc0 kernel/exit.c:791
> 	 kthread+0xf1/0x140 kernel/kthread.c:257
> 	 ret_from_fork+0x27/0x50 arch/x86/entry/entry_64.S:352
> 
> This case was originally reported by syzbot at
> https://lore.kernel.org/r/0000000000000e7156059f751d7b@google.com.
> 
> Reported-by: syzbot+1f9dc49e8de2582d90c2@syzkaller.appspotmail.com
> Signed-off-by: Eric Biggers <ebiggers@google.com>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
> 
> v2: include more details in the commit message.
> 
>  fs/xfs/xfs_trans_ail.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 00cc5b8734be8..3bc570c90ad97 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -529,8 +529,9 @@ xfsaild(
>  {
>  	struct xfs_ail	*ailp = data;
>  	long		tout = 0;	/* milliseconds */
> +	unsigned int	noreclaim_flag;
>  
> -	current->flags |= PF_MEMALLOC;
> +	noreclaim_flag = memalloc_noreclaim_save();
>  	set_freezable();
>  
>  	while (1) {
> @@ -601,6 +602,7 @@ xfsaild(
>  		tout = xfsaild_push(ailp);
>  	}
>  
> +	memalloc_noreclaim_restore(noreclaim_flag);
>  	return 0;
>  }
>  
> -- 
> 2.25.1
>
Eric Biggers March 9, 2020, 6:04 p.m. UTC | #3
On Mon, Mar 09, 2020 at 09:24:39AM -0700, Darrick J. Wong wrote:
> On Sun, Mar 08, 2020 at 09:34:30PM -0700, Eric Biggers wrote:
> > From: Eric Biggers <ebiggers@google.com>
> > 
> > Leaving PF_MEMALLOC set when exiting a kthread causes it to remain set
> > during do_exit().  That can confuse things.  For example, if BSD process
> > accounting is enabled and the accounting file has FS_SYNC_FL set and is
> > located on an ext4 filesystem without a journal, then do_exit() ends up
> > calling ext4_write_inode().  That triggers the
> > WARN_ON_ONCE(current->flags & PF_MEMALLOC) there, as it assumes
> > (appropriately) that inodes aren't written when allocating memory.
> > 
> > Fix this in xfsaild() by using the helper functions to save and restore
> > PF_MEMALLOC.
> > 
> > This can be reproduced as follows in the kvm-xfstests test appliance
> > modified to add the 'acct' Debian package, and with kvm-xfstests's
> > recommended kconfig modified to add CONFIG_BSD_PROCESS_ACCT=y:
> > 
> > 	mkfs.ext2 -F /dev/vdb
> > 	mount /vdb -t ext4
> > 	touch /vdb/file
> > 	chattr +S /vdb/file
> 
> Does this trip if the process accounting file is also on an xfs
> filesystem?
> 
> > 	accton /vdb/file
> > 	mkfs.xfs -f /dev/vdc
> > 	mount /vdc
> > 	umount /vdc
> 
> ...and if so, can this be turned into an fstests case, please?

I wasn't expecting it, but it turns out it does actually trip a similar warning
in iomap_do_writepage():

        mkfs.xfs -f /dev/vdb
        mount /vdb
        touch /vdb/file
        chattr +S /vdb/file
        accton /vdb/file
        mkfs.xfs -f /dev/vdc
        mount /vdc
        umount /vdc

causes...

	WARNING: CPU: 1 PID: 336 at fs/iomap/buffered-io.c:1534
	CPU: 1 PID: 336 Comm: xfsaild/vdc Not tainted 5.6.0-rc5 #3
	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20191223_100556-anatol 04/01/2014
	RIP: 0010:iomap_do_writepage+0x16b/0x1f0 fs/iomap/buffered-io.c:1534
	[...]
	Call Trace:
	 write_cache_pages+0x189/0x4d0 mm/page-writeback.c:2238
	 iomap_writepages+0x1c/0x33 fs/iomap/buffered-io.c:1642
	 xfs_vm_writepages+0x65/0x90 fs/xfs/xfs_aops.c:578
	 do_writepages+0x41/0xe0 mm/page-writeback.c:2344
	 __filemap_fdatawrite_range+0xd2/0x120 mm/filemap.c:421
	 file_write_and_wait_range+0x71/0xc0 mm/filemap.c:760
	 xfs_file_fsync+0x7a/0x2b0 fs/xfs/xfs_file.c:114
	 generic_write_sync include/linux/fs.h:2867 [inline]
	 xfs_file_buffered_aio_write+0x379/0x3b0 fs/xfs/xfs_file.c:691
	 call_write_iter include/linux/fs.h:1901 [inline]
	 new_sync_write+0x130/0x1d0 fs/read_write.c:483
	 __kernel_write+0x54/0xe0 fs/read_write.c:515
	 do_acct_process+0x122/0x170 kernel/acct.c:522
	 slow_acct_process kernel/acct.c:581 [inline]
	 acct_process+0x1d4/0x27c kernel/acct.c:607
	 do_exit+0x83d/0xbc0 kernel/exit.c:791
	 kthread+0xf1/0x140 kernel/kthread.c:257
	 ret_from_fork+0x27/0x50 arch/x86/entry/entry_64.S:352

So sure, since it's not necessarily a multi-filesystem thing, I can try to turn
it into an xfstest.  There's currently no way to enable BSD process accounting
in xfstests though, so we'll either need to make the test depend on the 'acct'
program or add a helper test program.

Also, do you want me to update the commit message again, to mention the above
case?

- Eric
Darrick J. Wong March 9, 2020, 6:13 p.m. UTC | #4
On Mon, Mar 09, 2020 at 11:04:52AM -0700, Eric Biggers wrote:
> On Mon, Mar 09, 2020 at 09:24:39AM -0700, Darrick J. Wong wrote:
> > On Sun, Mar 08, 2020 at 09:34:30PM -0700, Eric Biggers wrote:
> > > From: Eric Biggers <ebiggers@google.com>
> > > 
> > > Leaving PF_MEMALLOC set when exiting a kthread causes it to remain set
> > > during do_exit().  That can confuse things.  For example, if BSD process
> > > accounting is enabled and the accounting file has FS_SYNC_FL set and is
> > > located on an ext4 filesystem without a journal, then do_exit() ends up
> > > calling ext4_write_inode().  That triggers the
> > > WARN_ON_ONCE(current->flags & PF_MEMALLOC) there, as it assumes
> > > (appropriately) that inodes aren't written when allocating memory.
> > > 
> > > Fix this in xfsaild() by using the helper functions to save and restore
> > > PF_MEMALLOC.
> > > 
> > > This can be reproduced as follows in the kvm-xfstests test appliance
> > > modified to add the 'acct' Debian package, and with kvm-xfstests's
> > > recommended kconfig modified to add CONFIG_BSD_PROCESS_ACCT=y:
> > > 
> > > 	mkfs.ext2 -F /dev/vdb
> > > 	mount /vdb -t ext4
> > > 	touch /vdb/file
> > > 	chattr +S /vdb/file
> > 
> > Does this trip if the process accounting file is also on an xfs
> > filesystem?
> > 
> > > 	accton /vdb/file
> > > 	mkfs.xfs -f /dev/vdc
> > > 	mount /vdc
> > > 	umount /vdc
> > 
> > ...and if so, can this be turned into an fstests case, please?
> 
> I wasn't expecting it, but it turns out it does actually trip a similar warning
> in iomap_do_writepage():
> 
>         mkfs.xfs -f /dev/vdb
>         mount /vdb
>         touch /vdb/file
>         chattr +S /vdb/file
>         accton /vdb/file
>         mkfs.xfs -f /dev/vdc
>         mount /vdc
>         umount /vdc
> 
> causes...
> 
> 	WARNING: CPU: 1 PID: 336 at fs/iomap/buffered-io.c:1534
> 	CPU: 1 PID: 336 Comm: xfsaild/vdc Not tainted 5.6.0-rc5 #3
> 	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20191223_100556-anatol 04/01/2014
> 	RIP: 0010:iomap_do_writepage+0x16b/0x1f0 fs/iomap/buffered-io.c:1534
> 	[...]
> 	Call Trace:
> 	 write_cache_pages+0x189/0x4d0 mm/page-writeback.c:2238
> 	 iomap_writepages+0x1c/0x33 fs/iomap/buffered-io.c:1642
> 	 xfs_vm_writepages+0x65/0x90 fs/xfs/xfs_aops.c:578
> 	 do_writepages+0x41/0xe0 mm/page-writeback.c:2344
> 	 __filemap_fdatawrite_range+0xd2/0x120 mm/filemap.c:421
> 	 file_write_and_wait_range+0x71/0xc0 mm/filemap.c:760
> 	 xfs_file_fsync+0x7a/0x2b0 fs/xfs/xfs_file.c:114
> 	 generic_write_sync include/linux/fs.h:2867 [inline]
> 	 xfs_file_buffered_aio_write+0x379/0x3b0 fs/xfs/xfs_file.c:691
> 	 call_write_iter include/linux/fs.h:1901 [inline]
> 	 new_sync_write+0x130/0x1d0 fs/read_write.c:483
> 	 __kernel_write+0x54/0xe0 fs/read_write.c:515
> 	 do_acct_process+0x122/0x170 kernel/acct.c:522
> 	 slow_acct_process kernel/acct.c:581 [inline]
> 	 acct_process+0x1d4/0x27c kernel/acct.c:607
> 	 do_exit+0x83d/0xbc0 kernel/exit.c:791
> 	 kthread+0xf1/0x140 kernel/kthread.c:257
> 	 ret_from_fork+0x27/0x50 arch/x86/entry/entry_64.S:352
> 
> So sure, since it's not necessarily a multi-filesystem thing, I can try to turn
> it into an xfstest.  There's currently no way to enable BSD process accounting
> in xfstests though, so we'll either need to make the test depend on the 'acct'
> program or add a helper test program.
> 
> Also, do you want me to update the commit message again, to mention the above
> case?

I think it's worth mentioning that this is a general problem that
applies any time the process accounting file has that sync flag set,
since this problem isn't specific to ext4 + xfs.

(and now I wonder how many other places in the kernel suffer from these
kinds of file write surprises...)

--D

> - Eric
Eric Biggers March 12, 2020, 10:20 p.m. UTC | #5
On Mon, Mar 09, 2020 at 09:24:39AM -0700, Darrick J. Wong wrote:
> > 
> > 	mkfs.ext2 -F /dev/vdb
> > 	mount /vdb -t ext4
> > 	touch /vdb/file
> > 	chattr +S /vdb/file
> 
> Does this trip if the process accounting file is also on an xfs
> filesystem?
> 
> > 	accton /vdb/file
> > 	mkfs.xfs -f /dev/vdc
> > 	mount /vdc
> > 	umount /vdc
> 
> ...and if so, can this be turned into an fstests case, please?
> 

Test sent out at
https://lkml.kernel.org/fstests/20200312221437.141484-1-ebiggers@kernel.org/

- Eric
diff mbox series

Patch

diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 00cc5b8734be8..3bc570c90ad97 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -529,8 +529,9 @@  xfsaild(
 {
 	struct xfs_ail	*ailp = data;
 	long		tout = 0;	/* milliseconds */
+	unsigned int	noreclaim_flag;
 
-	current->flags |= PF_MEMALLOC;
+	noreclaim_flag = memalloc_noreclaim_save();
 	set_freezable();
 
 	while (1) {
@@ -601,6 +602,7 @@  xfsaild(
 		tout = xfsaild_push(ailp);
 	}
 
+	memalloc_noreclaim_restore(noreclaim_flag);
 	return 0;
 }