kernel BUG at fs/xfs/xfs_aops.c:853! in kernel 4.13 rc6

Message ID	20170904123039.GA5664@quack2.suse.cz (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-xfs-owner@kernel.org> Date: Mon, 4 Sep 2017 14:30:39 +0200 From: Jan Kara <jack@suse.cz> To: =?utf-8?B?0JzQuNGF0LDQuNC7INCT0LDQstGA0LjQu9C+0LI=?= <mikhail.v.gavrilov@gmail.com> Cc: Christoph Hellwig <hch@infradead.org>, linux-xfs@vger.kernel.org, linux-mm@kvack.org Subject: Re: kernel BUG at fs/xfs/xfs_aops.c:853! in kernel 4.13 rc6 Message-ID: <20170904123039.GA5664@quack2.suse.cz> References: <CABXGCsOL+_OgC0dpO1+Zeg=iu7ryZRZT4S7k-io8EGB0ZRgZGw@mail.gmail.com> <20170903074306.GA8351@infradead.org> <CABXGCsMmEvEh__R2L47jqVnxv9XDaT_KP67jzsUeDLhF2OuOyA@mail.gmail.com> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="vkogqOf2sHV7VnPd" Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <CABXGCsMmEvEh__R2L47jqVnxv9XDaT_KP67jzsUeDLhF2OuOyA@mail.gmail.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk

Jan Kara Sept. 4, 2017, 12:30 p.m. UTC

On Sun 03-09-17 19:08:54, Михаил Гаврилов wrote:
> On 3 September 2017 at 12:43, Christoph Hellwig <hch@infradead.org> wrote:
> >
> > This is:
> >
> >         bh = head = page_buffers(page);
> >
> > Which looks odd and like some sort of VM/writeback change might
> > have triggered that we get a page without buffers, despite always
> > creating buffers in iomap_begin/end and page_mkwrite.
> >
> > Ccing linux-mm if anything odd happen in that area recently.
> >
> > Can you tell anything about the workload you are running?
> >
> 
> On XFS partition stored launched KVM VM images, + home partition with
> Google Chrome profiles.
> Seems the bug triggering by high memory consumption and using swap
> which two times larger than system memory.
> I saw that it happens when swap has reached size of system memory.

Can you reproduce this? I've seen one occurence of this on our distro
4.4-based kernel but we were never able to reproduce and find the culprit.
If you can reproduce, could you run with the attached debug patch to see
whether the WARN_ON triggers? Because my suspicion is that there is some
subtle race in page table teardown vs writeback vs page reclaim which can
result in page being dirtied without filesystem being notified about it (I
have seen very similar oops for ext4 as well which leads me to suspicion
this is a generic issue). Thanks!

								Honza

Mikhail Gavrilov Oct. 7, 2017, 8:10 a.m. UTC | #1

> Can you reproduce this? I've seen one occurence of this on our distro
> 4.4-based kernel but we were never able to reproduce and find the culprit.
> If you can reproduce, could you run with the attached debug patch to see
> whether the WARN_ON triggers? Because my suspicion is that there is some
> subtle race in page table teardown vs writeback vs page reclaim which can
> result in page being dirtied without filesystem being notified about it (I
> have seen very similar oops for ext4 as well which leads me to suspicion
> this is a generic issue). Thanks!

I trying reproduce issue with with your patch.
But seems now got another issue:

[ 1966.953781] INFO: task tracker-store:8578 blocked for more than 120 seconds.
[ 1966.953797]       Not tainted 4.13.4-301.fc27.x86_64+debug #1
[ 1966.953800] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 1966.953804] tracker-store   D12840  8578   1655 0x00000000
[ 1966.953811] Call Trace:
[ 1966.953823]  __schedule+0x2dc/0xbb0
[ 1966.953830]  ? wait_on_page_bit_common+0xfb/0x1a0
[ 1966.953838]  schedule+0x3d/0x90
[ 1966.953843]  io_schedule+0x16/0x40
[ 1966.953847]  wait_on_page_bit_common+0x10a/0x1a0
[ 1966.953857]  ? page_cache_tree_insert+0x170/0x170
[ 1966.953865]  __filemap_fdatawait_range+0x101/0x1a0
[ 1966.953883]  file_write_and_wait_range+0x63/0xc0
[ 1966.953928]  xfs_file_fsync+0x7c/0x2b0 [xfs]
[ 1966.953938]  vfs_fsync_range+0x4b/0xb0
[ 1966.953945]  do_fsync+0x3d/0x70
[ 1966.953950]  SyS_fsync+0x10/0x20
[ 1966.953954]  entry_SYSCALL_64_fastpath+0x1f/0xbe
[ 1966.953957] RIP: 0033:0x7f9364393d5c
[ 1966.953959] RSP: 002b:00007ffe130b7d50 EFLAGS: 00000293 ORIG_RAX:
000000000000004a
[ 1966.953964] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f9364393d5c
[ 1966.953966] RDX: 0000000000000000 RSI: 00007ffe130b7d80 RDI: 000000000000000f
[ 1966.953968] RBP: 0000000000058a8e R08: 0000000000000000 R09: 000000000000002c
[ 1966.953970] R10: 0000000000058a8e R11: 0000000000000293 R12: 00007f9368ef57f0
[ 1966.953972] R13: 00005585d1ab47b0 R14: 00005585d19701d0 R15: 0000000000058a8e
[ 1966.954131]
               Showing all locks held in the system:
[ 1966.954141] 1 lock held by khungtaskd/65:
[ 1966.954147]  #0:  (tasklist_lock){.+.+..}, at: [<ffffffff9a114c6d>]
debug_show_all_locks+0x3d/0x1a0
[ 1966.954163] 3 locks held by kworker/u16:4/145:
[ 1966.954165]  #0:  ("writeback"){.+.+.+}, at: [<ffffffff9a0d2ac0>]
process_one_work+0x1d0/0x6a0
[ 1966.954176]  #1:  ((&(&wb->dwork)->work)){+.+.+.}, at:
[<ffffffff9a0d2ac0>] process_one_work+0x1d0/0x6a0
[ 1966.954187]  #2:  (&type->s_umount_key#63){++++.+}, at:
[<ffffffff9a2d3dbb>] trylock_super+0x1b/0x50
[ 1966.954289] 3 locks held by Cache2 I/O/2602:
[ 1966.954291]  #0:  (sb_writers#17){.+.+.+}, at: [<ffffffff9a2ccaef>]
do_sys_ftruncate.constprop.17+0xdf/0x110
[ 1966.954305]  #1:  (&inode->i_rwsem){++++++}, at:
[<ffffffff9a2cc795>] do_truncate+0x65/0xc0
[ 1966.954317]  #2:  (&(&ip->i_mmaplock)->mr_lock){+++++.}, at:
[<ffffffffc0a9aef9>] xfs_ilock+0x159/0x220 [xfs]
[ 1966.954501] 2 locks held by kworker/0:0/6788:
[ 1966.954503]  #0:  ("xfs-cil/%s"mp->m_fsname){++++..}, at:
[<ffffffff9a0d2ac0>] process_one_work+0x1d0/0x6a0
[ 1966.954513]  #1:  ((&cil->xc_push_work)){+.+...}, at:
[<ffffffff9a0d2ac0>] process_one_work+0x1d0/0x6a0
[ 1966.954530] 3 locks held by TaskSchedulerFo/8616:
[ 1966.954531]  #0:  (sb_writers#17){.+.+.+}, at: [<ffffffff9a2ccaef>]
do_sys_ftruncate.constprop.17+0xdf/0x110
[ 1966.954543]  #1:  (&sb->s_type->i_mutex_key#19){++++++}, at:
[<ffffffff9a2cc795>] do_truncate+0x65/0xc0
[ 1966.954556]  #2:  (&(&ip->i_mmaplock)->mr_lock){+++++.}, at:
[<ffffffffc0a9aef9>] xfs_ilock+0x159/0x220 [xfs]
[ 1966.954592] 3 locks held by TaskSchedulerFo/8686:
[ 1966.954594]  #0:  (sb_writers#17){.+.+.+}, at: [<ffffffff9a2fa7d4>]
mnt_want_write+0x24/0x50
[ 1966.954608]  #1:  (sb_internal#2){.+.+.+}, at: [<ffffffffc0aad7ac>]
xfs_trans_alloc+0xec/0x130 [xfs]
[ 1966.954646]  #2:  (&xfs_nondir_ilock_class){++++..}, at:
[<ffffffffc0a9af14>] xfs_ilock+0x174/0x220 [xfs]
[ 1966.954679] 5 locks held by TaskSchedulerFo/8687:
[ 1966.954681]  #0:  (sb_writers#17){.+.+.+}, at: [<ffffffff9a2ccaef>]
do_sys_ftruncate.constprop.17+0xdf/0x110
[ 1966.954693]  #1:  (&sb->s_type->i_mutex_key#19){++++++}, at:
[<ffffffff9a2cc795>] do_truncate+0x65/0xc0
[ 1966.954705]  #2:  (&(&ip->i_mmaplock)->mr_lock){+++++.}, at:
[<ffffffffc0a9aef9>] xfs_ilock+0x159/0x220 [xfs]
[ 1966.954740]  #3:  (sb_internal#2){.+.+.+}, at: [<ffffffffc0aad7ac>]
xfs_trans_alloc+0xec/0x130 [xfs]
[ 1966.954777]  #4:  (&xfs_nondir_ilock_class){++++..}, at:
[<ffffffffc0a9af14>] xfs_ilock+0x174/0x220 [xfs]
[ 1966.954811] 5 locks held by TaskSchedulerFo/8689:
[ 1966.954813]  #0:  (sb_writers#17){.+.+.+}, at: [<ffffffff9a2ccaef>]
do_sys_ftruncate.constprop.17+0xdf/0x110
[ 1966.954825]  #1:  (&sb->s_type->i_mutex_key#19){++++++}, at:
[<ffffffff9a2cc795>] do_truncate+0x65/0xc0
[ 1966.954837]  #2:  (&(&ip->i_mmaplock)->mr_lock){+++++.}, at:
[<ffffffffc0a9aef9>] xfs_ilock+0x159/0x220 [xfs]
[ 1966.954871]  #3:  (sb_internal#2){.+.+.+}, at: [<ffffffffc0aad7ac>]
xfs_trans_alloc+0xec/0x130 [xfs]
[ 1966.954906]  #4:  (&xfs_nondir_ilock_class){++++..}, at:
[<ffffffffc0a9af14>] xfs_ilock+0x174/0x220 [xfs]
[ 1966.954941] 3 locks held by TaskSchedulerFo/8690:
[ 1966.954943]  #0:  (sb_writers#17){.+.+.+}, at: [<ffffffff9a2ccaef>]
do_sys_ftruncate.constprop.17+0xdf/0x110
[ 1966.954955]  #1:  (&sb->s_type->i_mutex_key#19){++++++}, at:
[<ffffffff9a2cc795>] do_truncate+0x65/0xc0
[ 1966.954967]  #2:  (&(&ip->i_mmaplock)->mr_lock){+++++.}, at:
[<ffffffffc0a9aef9>] xfs_ilock+0x159/0x220 [xfs]
[ 1966.955001] 2 locks held by TaskSchedulerFo/9512:
[ 1966.955003]  #0:  (&f->f_pos_lock){+.+.+.}, at:
[<ffffffff9a2f71ac>] __fdget_pos+0x4c/0x60
[ 1966.955013]  #1:  (&sb->s_type->i_mutex_key#19){++++++}, at:
[<ffffffffc0a9af4c>] xfs_ilock+0x1ac/0x220 [xfs]
[ 1966.955048] 1 lock held by TaskSchedulerFo/9513:
[ 1966.955049]  #0:  (&xfs_nondir_ilock_class){++++..}, at:
[<ffffffffc0a9ae89>] xfs_ilock+0xe9/0x220 [xfs]

[ 1966.955214] =============================================
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mikhail Gavrilov Oct. 7, 2017, 9:22 a.m. UTC | #2

And yet another

[41288.797026] INFO: task tracker-store:4535 blocked for more than 120 seconds.
[41288.797034]       Not tainted 4.13.4-301.fc27.x86_64+debug #1
[41288.797037] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[41288.797041] tracker-store   D10616  4535   1655 0x00000000
[41288.797049] Call Trace:
[41288.797061]  __schedule+0x2dc/0xbb0
[41288.797072]  ? bit_wait+0x60/0x60
[41288.797076]  schedule+0x3d/0x90
[41288.797082]  io_schedule+0x16/0x40
[41288.797086]  bit_wait_io+0x11/0x60
[41288.797091]  __wait_on_bit+0x31/0x90
[41288.797099]  out_of_line_wait_on_bit+0x94/0xb0
[41288.797106]  ? bit_waitqueue+0x40/0x40
[41288.797113]  __block_write_begin_int+0x265/0x550
[41288.797132]  iomap_write_begin.constprop.14+0x7d/0x130
[41288.797140]  iomap_write_actor+0x92/0x180
[41288.797152]  ? iomap_write_begin.constprop.14+0x130/0x130
[41288.797155]  iomap_apply+0x9f/0x110
[41288.797165]  ? iomap_write_begin.constprop.14+0x130/0x130
[41288.797169]  iomap_file_buffered_write+0x6e/0xa0
[41288.797171]  ? iomap_write_begin.constprop.14+0x130/0x130
[41288.797212]  xfs_file_buffered_aio_write+0xdd/0x380 [xfs]
[41288.797250]  xfs_file_write_iter+0x9e/0x140 [xfs]
[41288.797258]  __vfs_write+0xf8/0x170
[41288.797270]  vfs_write+0xc6/0x1c0
[41288.797275]  SyS_pwrite64+0x98/0xc0
[41288.797284]  entry_SYSCALL_64_fastpath+0x1f/0xbe
[41288.797287] RIP: 0033:0x7f357aa30163
[41288.797289] RSP: 002b:00007ffe6bef2070 EFLAGS: 00000293 ORIG_RAX:
0000000000000012
[41288.797292] RAX: ffffffffffffffda RBX: 00000000000027f7 RCX: 00007f357aa30163
[41288.797294] RDX: 0000000000001000 RSI: 0000559141745d18 RDI: 0000000000000009
[41288.797296] RBP: 0000000000010351 R08: 0000559140da78a8 R09: 000055914100c548
[41288.797298] R10: 0000000000743d90 R11: 0000000000000293 R12: 0000000000002c5e
[41288.797299] R13: 000000000000162f R14: 0000559140da77f8 R15: 0000000000000001
[41288.797329]
               Showing all locks held in the system:
[41288.797338] 1 lock held by khungtaskd/65:
[41288.797341]  #0:  (tasklist_lock){.+.+..}, at: [<ffffffff9a114c6d>]
debug_show_all_locks+0x3d/0x1a0
[41288.797358] 1 lock held by kworker/0:1H/377:
[41288.797359]  #0:  (&rq->lock){-.-.-.}, at: [<ffffffff9a9a2af1>]
__schedule+0xe1/0xbb0
[41288.797452] 5 locks held by TaskSchedulerFo/6821:
[41288.797454]  #0:  (sb_writers#17){.+.+.+}, at: [<ffffffff9a2ccaef>]
do_sys_ftruncate.constprop.17+0xdf/0x110
[41288.797467]  #1:  (&sb->s_type->i_mutex_key#19){++++++}, at:
[<ffffffff9a2cc795>] do_truncate+0x65/0xc0
[41288.797480]  #2:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at:
[<ffffffffc0a9aef9>] xfs_ilock+0x159/0x220 [xfs]
[41288.797518]  #3:  (sb_internal#2){.+.+.?}, at: [<ffffffffc0aad7ac>]
xfs_trans_alloc+0xec/0x130 [xfs]
[41288.797548]  #4:  (&xfs_nondir_ilock_class){++++--}, at:
[<ffffffffc0a9af14>] xfs_ilock+0x174/0x220 [xfs]
[41288.797610] 2 locks held by TaskSchedulerBa/7167:
[41288.797611]  #0:  (sb_internal#2){.+.+.?}, at: [<ffffffffc0aad7ac>]
xfs_trans_alloc+0xec/0x130 [xfs]
[41288.797646]  #1:  (&xfs_nondir_ilock_class){++++--}, at:
[<ffffffffc0a9af14>] xfs_ilock+0x174/0x220 [xfs]
[41288.797677] 2 locks held by TaskSchedulerFo/7174:
[41288.797678]  #0:  (sb_internal#2){.+.+.?}, at: [<ffffffffc0aad7ac>]
xfs_trans_alloc+0xec/0x130 [xfs]
[41288.797708]  #1:  (&xfs_nondir_ilock_class){++++--}, at:
[<ffffffffc0a9af14>] xfs_ilock+0x174/0x220 [xfs]
[41288.797899] 2 locks held by TaskSchedulerFo/5547:
[41288.797901]  #0:  (sb_writers#17){.+.+.+}, at: [<ffffffff9a2fa7d4>]
mnt_want_write+0x24/0x50
[41288.797913]  #1:  (&type->i_mutex_dir_key#7){++++++}, at:
[<ffffffff9a2e190a>] path_openat+0x30a/0xc80
[41288.797922] 1 lock held by TaskSchedulerFo/5605:
[41288.797923]  #0:  (&type->i_mutex_dir_key#7){++++++}, at:
[<ffffffff9a2dcff5>] lookup_slow+0xe5/0x220
[41288.797931] 1 lock held by TaskSchedulerFo/6638:
[41288.797932]  #0:  (&type->i_mutex_dir_key#7){++++++}, at:
[<ffffffff9a2dcff5>] lookup_slow+0xe5/0x220
[41288.797940] 1 lock held by TaskSchedulerFo/6830:
[41288.797941]  #0:  (&type->i_mutex_dir_key#7){++++++}, at:
[<ffffffff9a2dcff5>] lookup_slow+0xe5/0x220
[41288.797948] 4 locks held by TaskSchedulerFo/6832:
[41288.797949]  #0:  (sb_writers#17){.+.+.+}, at: [<ffffffff9a2fa7d4>]
mnt_want_write+0x24/0x50
[41288.797957]  #1:  (&type->i_mutex_dir_key#7){++++++}, at:
[<ffffffff9a2e190a>] path_openat+0x30a/0xc80
[41288.797964]  #2:  (sb_internal#2){.+.+.?}, at: [<ffffffffc0aad7ac>]
xfs_trans_alloc+0xec/0x130 [xfs]
[41288.797998]  #3:  (&xfs_dir_ilock_class/5){+.+...}, at:
[<ffffffffc0a9af14>] xfs_ilock+0x174/0x220 [xfs]
[41288.798027] 5 locks held by TaskSchedulerFo/6959:
[41288.798028]  #0:  (sb_writers#17){.+.+.+}, at: [<ffffffff9a2ccaef>]
do_sys_ftruncate.constprop.17+0xdf/0x110
[41288.798035]  #1:  (&sb->s_type->i_mutex_key#19){++++++}, at:
[<ffffffff9a2cc795>] do_truncate+0x65/0xc0
[41288.798042]  #2:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at:
[<ffffffffc0a9aef9>] xfs_ilock+0x159/0x220 [xfs]
[41288.798064]  #3:  (sb_internal#2){.+.+.?}, at: [<ffffffffc0aad7ac>]
xfs_trans_alloc+0xec/0x130 [xfs]
[41288.798087]  #4:  (&xfs_nondir_ilock_class){++++--}, at:
[<ffffffffc0a9af14>] xfs_ilock+0x174/0x220 [xfs]
[41288.798110] 8 locks held by TaskSchedulerBa/7193:
[41288.798111]  #0:  (sb_writers#17){.+.+.+}, at: [<ffffffff9a2fa7d4>]
mnt_want_write+0x24/0x50
[41288.798118]  #1:  (&type->i_mutex_dir_key#7/1){+.+.+.}, at:
[<ffffffff9a2dc68a>] lock_rename+0xda/0x100
[41288.798132]  #2:  (&sb->s_type->i_mutex_key#19){++++++}, at:
[<ffffffff9a2f293d>] lock_two_nondirectories+0x6d/0x80
[41288.798145]  #3:  (&sb->s_type->i_mutex_key#19/4){+.+.+.}, at:
[<ffffffff9a2f2926>] lock_two_nondirectories+0x56/0x80
[41288.798168]  #4:  (sb_internal#2){.+.+.?}, at: [<ffffffffc0aad7ac>]
xfs_trans_alloc+0xec/0x130 [xfs]
[41288.798203]  #5:  (&xfs_dir_ilock_class){++++-.}, at:
[<ffffffffc0a9af14>] xfs_ilock+0x174/0x220 [xfs]
[41288.798230]  #6:  (&xfs_nondir_ilock_class){++++--}, at:
[<ffffffffc0a9b1d7>] xfs_ilock_nowait+0x197/0x270 [xfs]
[41288.798262]  #7:  (&xfs_nondir_ilock_class){++++--}, at:
[<ffffffffc0a9b1d7>] xfs_ilock_nowait+0x197/0x270 [xfs]
[41288.798298] 2 locks held by TaskSchedulerFo/7212:
[41288.798300]  #0:  (sb_internal#2){.+.+.?}, at: [<ffffffffc0aad7ac>]
xfs_trans_alloc+0xec/0x130 [xfs]
[41288.798332]  #1:  (&xfs_nondir_ilock_class){++++--}, at:
[<ffffffffc0a9af14>] xfs_ilock+0x174/0x220 [xfs]
[41288.798360] 1 lock held by TaskSchedulerFo/7297:
[41288.798362]  #0:  (&type->i_mutex_dir_key#7){++++++}, at:
[<ffffffff9a2dcff5>] lookup_slow+0xe5/0x220
[41288.798375] 1 lock held by chrome/8645:
[41288.798376]  #0:  (&dev->struct_mutex){+.+.+.}, at:
[<ffffffffc029b8a1>] i915_gem_close_object+0x31/0x130 [i915]
[41288.798619] 3 locks held by kworker/2:3/540:
[41288.798621]  #0:  ("events"){.+.+.+}, at: [<ffffffff9a0d2ac0>]
process_one_work+0x1d0/0x6a0
[41288.798632]  #1:  ((&dev_priv->mm.free_work)){+.+...}, at:
[<ffffffff9a0d2ac0>] process_one_work+0x1d0/0x6a0
[41288.798642]  #2:  (&dev->struct_mutex){+.+.+.}, at:
[<ffffffffc0298515>] __i915_gem_free_objects+0x35/0x420 [i915]
[41288.798680] 2 locks held by tracker-store/4535:
[41288.798682]  #0:  (sb_writers#17){.+.+.+}, at: [<ffffffff9a2cf953>]
vfs_write+0x193/0x1c0
[41288.798695]  #1:  (&sb->s_type->i_mutex_key#19){++++++}, at:
[<ffffffffc0a9aedb>] xfs_ilock+0x13b/0x220 [xfs]
[41288.798740] 1 lock held by gitkraken/5915:
[41288.798742]  #0:  (&type->i_mutex_dir_key#7){++++++}, at:
[<ffffffff9a2dcff5>] lookup_slow+0xe5/0x220
[41288.798755] 1 lock held by gitkraken/6017:
[41288.798756]  #0:  (&type->i_mutex_dir_key#7){++++++}, at:
[<ffffffff9a2dcff5>] lookup_slow+0xe5/0x220
[41288.798768] 1 lock held by gitkraken/6125:
[41288.798770]  #0:  (&type->i_mutex_dir_key#7){++++++}, at:
[<ffffffff9a2dcff5>] lookup_slow+0xe5/0x220

[41288.798788] =============================================
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Luis Chamberlain Oct. 9, 2017, 6:31 p.m. UTC | #3

On Mon, Oct 09, 2017 at 11:05:29AM +1100, Dave Chinner wrote:
> On Sat, Oct 07, 2017 at 01:10:58PM +0500, Михаил Гаврилов wrote:
> > But seems now got another issue:
> > 
> > [ 1966.953781] INFO: task tracker-store:8578 blocked for more than 120 seconds.
> > [ 1966.953797]       Not tainted 4.13.4-301.fc27.x86_64+debug #1
> > [ 1966.953800] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [ 1966.953804] tracker-store   D12840  8578   1655 0x00000000
> > [ 1966.953811] Call Trace:
> > [ 1966.953823]  __schedule+0x2dc/0xbb0
> > [ 1966.953830]  ? wait_on_page_bit_common+0xfb/0x1a0
> > [ 1966.953838]  schedule+0x3d/0x90
> > [ 1966.953843]  io_schedule+0x16/0x40
> > [ 1966.953847]  wait_on_page_bit_common+0x10a/0x1a0
> > [ 1966.953857]  ? page_cache_tree_insert+0x170/0x170
> > [ 1966.953865]  __filemap_fdatawait_range+0x101/0x1a0
> > [ 1966.953883]  file_write_and_wait_range+0x63/0xc0
> 
> Ok, that's in wait_on_page_writeback(page)
> ......
> 
> > And yet another
> > 
> > [41288.797026] INFO: task tracker-store:4535 blocked for more than 120 seconds.
> > [41288.797034]       Not tainted 4.13.4-301.fc27.x86_64+debug #1
> > [41288.797037] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [41288.797041] tracker-store   D10616  4535   1655 0x00000000
> > [41288.797049] Call Trace:
> > [41288.797061]  __schedule+0x2dc/0xbb0
> > [41288.797072]  ? bit_wait+0x60/0x60
> > [41288.797076]  schedule+0x3d/0x90
> > [41288.797082]  io_schedule+0x16/0x40
> > [41288.797086]  bit_wait_io+0x11/0x60
> > [41288.797091]  __wait_on_bit+0x31/0x90
> > [41288.797099]  out_of_line_wait_on_bit+0x94/0xb0
> > [41288.797106]  ? bit_waitqueue+0x40/0x40
> > [41288.797113]  __block_write_begin_int+0x265/0x550
> > [41288.797132]  iomap_write_begin.constprop.14+0x7d/0x130
> 
> And that's in wait_on_buffer().
> 
> In both cases we are waiting on a bit lock for IO completion. In the
> first case it is on page, the second it's on sub-page read IO
> completion during a write.
> 
> Triggeringa hung task timeouts like this doesn't usually indicate a
> filesystem problem.

<-- snip -->

> None of these things usually filesystem problems, and the trainsmash
> of blocked tasks on filesystem locks is typical for these types of
> "blocked indefinitely with locks held" type of situations. It does
> tend to indicate taht there is quite a bit of load on the
> filesystem, though...

As Jan Kara noted we've seen this also on customers SLE12-SP2 kernel (4.4
based). Although we also were never able to root cause, since that bug
is now closed on our end I figured it would be worth mentioning two
theories we discussed, one more recent than the other.

One theory came from observation of logs and these logs hinting at an
older issue our Docker team had been looking into for a while, that of
libdm mounts being leaked into another container's namespace. Apparently
the clue to when this happens is when something as follows is seen on
the docker logs:

2017-06-22T18:59:47.925917+08:00 host-docker-01 dockerd[1957]: time="2017-06-22T18:59:47.925857351+08:00" level=info msg="Container
6b8f678a27d61939f358614c673224675d64f1527bb2046943e2d493f095c865 failed to exit within 30 seconds of signal 15 - using the force"

This is a SIGTERM. When one sees the above message it is a good hint that the
container actually did not finish at all, but instead was forced to be
terminated via 'docker kill' or 'docker rm -f'. To be clear docker was not able
to use SIGTERM so then resorts to SIGKILL.

After this is when we get the next clueful message which interests us to try
to figure out a way to reproduce the originally reported issue:

2017-06-22T18:59:48.071374+08:00 host-docker-01 dockerd[1957]: time="2017-06-22T18:59:48.071314219+08:00" level=error msg="devmapper: Error
unmounting device

Aleksa indicates he's been studying this code for a while, and although he
has fixes for this on Docker it also means he does understands what is
going on here. The above error message indicates Docker in turn *failed* to
still kill the damn container with SIGKILL. This, he indicates, is due to
libdm mounts leaking from one container namespace to another container's
namespace. The error is not docker being unable to umount, its actually
that docker cannot remove the backing device.

Its after this when the bug triggers. It doesn't explain *why* we hit this
bug on XFS, but it should be a clue. Aleksa is however is absolutely sure
that the bugs found internally *are* caused by mount leakage somehow, and
all we know is that XFS oops happened after this.

He suggests this could in theory be reproduced by doing the following
while you have a Docker daemon running:

  o unshare -m
  o mount --make-rprivate /

Effectively forcing a mount leak. We had someone trying to reproduce this
internally somehow but I don't think they were in the end able to.

If this docker issue leads to a kernel issue the severity might be a bit more
serious, it indicates an unprivileged user namespaces can effectively perform a
DoS against the host if it's trying to operate on devicemapper mounts.

Then *another* theory came recently from Jan Blunck during the ALPSS, he noted
he traced a similar issue back to a systemd misuse, ie, systemd setting the
root namespace to 'shared' by default. Originally the root namespace is set to
'private', but systemd decided it'd be a good idea to set it to 'shared'. Which
would quite easily explain the mount namespace leaking.

Aleksa however contends that the mountspace leaking happens because of Docker's
architecture (all of the mounts are created in one mount namespace, and all of
the runc invocations happen in that same mount namespace). Even if you
explicitly set different sharing options (which runc already supports), the
problem still exists.

But if *not* using docker, it gives us an idea of how you perhaps you could
run into a similar issue.

Regardless Aleksa points out there's still a more fundamental issue of "I can
leak a mountpoint as an unprivileged user, no matter what the propagation is":

  % unshare -rm
  % mount --make-rprivate /
  % # I now have the mount alive.

  Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric W. Biederman Oct. 9, 2017, 7:02 p.m. UTC | #4

"Luis R. Rodriguez" <mcgrof@kernel.org> writes:

> On Mon, Oct 09, 2017 at 11:05:29AM +1100, Dave Chinner wrote:
>> On Sat, Oct 07, 2017 at 01:10:58PM +0500, Михаил Гаврилов wrote:
>> > But seems now got another issue:
>> > 
>> > [ 1966.953781] INFO: task tracker-store:8578 blocked for more than 120 seconds.
>> > [ 1966.953797]       Not tainted 4.13.4-301.fc27.x86_64+debug #1
>> > [ 1966.953800] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> > disables this message.
>> > [ 1966.953804] tracker-store   D12840  8578   1655 0x00000000
>> > [ 1966.953811] Call Trace:
>> > [ 1966.953823]  __schedule+0x2dc/0xbb0
>> > [ 1966.953830]  ? wait_on_page_bit_common+0xfb/0x1a0
>> > [ 1966.953838]  schedule+0x3d/0x90
>> > [ 1966.953843]  io_schedule+0x16/0x40
>> > [ 1966.953847]  wait_on_page_bit_common+0x10a/0x1a0
>> > [ 1966.953857]  ? page_cache_tree_insert+0x170/0x170
>> > [ 1966.953865]  __filemap_fdatawait_range+0x101/0x1a0
>> > [ 1966.953883]  file_write_and_wait_range+0x63/0xc0
>> 
>> Ok, that's in wait_on_page_writeback(page)
>> ......
>> 
>> > And yet another
>> > 
>> > [41288.797026] INFO: task tracker-store:4535 blocked for more than 120 seconds.
>> > [41288.797034]       Not tainted 4.13.4-301.fc27.x86_64+debug #1
>> > [41288.797037] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> > disables this message.
>> > [41288.797041] tracker-store   D10616  4535   1655 0x00000000
>> > [41288.797049] Call Trace:
>> > [41288.797061]  __schedule+0x2dc/0xbb0
>> > [41288.797072]  ? bit_wait+0x60/0x60
>> > [41288.797076]  schedule+0x3d/0x90
>> > [41288.797082]  io_schedule+0x16/0x40
>> > [41288.797086]  bit_wait_io+0x11/0x60
>> > [41288.797091]  __wait_on_bit+0x31/0x90
>> > [41288.797099]  out_of_line_wait_on_bit+0x94/0xb0
>> > [41288.797106]  ? bit_waitqueue+0x40/0x40
>> > [41288.797113]  __block_write_begin_int+0x265/0x550
>> > [41288.797132]  iomap_write_begin.constprop.14+0x7d/0x130
>> 
>> And that's in wait_on_buffer().
>> 
>> In both cases we are waiting on a bit lock for IO completion. In the
>> first case it is on page, the second it's on sub-page read IO
>> completion during a write.
>> 
>> Triggeringa hung task timeouts like this doesn't usually indicate a
>> filesystem problem.
>
> <-- snip -->
>
>> None of these things usually filesystem problems, and the trainsmash
>> of blocked tasks on filesystem locks is typical for these types of
>> "blocked indefinitely with locks held" type of situations. It does
>> tend to indicate taht there is quite a bit of load on the
>> filesystem, though...
>
> As Jan Kara noted we've seen this also on customers SLE12-SP2 kernel (4.4
> based). Although we also were never able to root cause, since that bug
> is now closed on our end I figured it would be worth mentioning two
> theories we discussed, one more recent than the other.
>
> One theory came from observation of logs and these logs hinting at an
> older issue our Docker team had been looking into for a while, that of
> libdm mounts being leaked into another container's namespace. Apparently
> the clue to when this happens is when something as follows is seen on
> the docker logs:
>
> 2017-06-22T18:59:47.925917+08:00 host-docker-01 dockerd[1957]: time="2017-06-22T18:59:47.925857351+08:00" level=info msg="Container
> 6b8f678a27d61939f358614c673224675d64f1527bb2046943e2d493f095c865 failed to exit within 30 seconds of signal 15 - using the force"
>
> This is a SIGTERM. When one sees the above message it is a good hint that the
> container actually did not finish at all, but instead was forced to be
> terminated via 'docker kill' or 'docker rm -f'. To be clear docker was not able
> to use SIGTERM so then resorts to SIGKILL.
>
> After this is when we get the next clueful message which interests us to try
> to figure out a way to reproduce the originally reported issue:
>
> 2017-06-22T18:59:48.071374+08:00 host-docker-01 dockerd[1957]: time="2017-06-22T18:59:48.071314219+08:00" level=error msg="devmapper: Error
> unmounting device
>
> Aleksa indicates he's been studying this code for a while, and although he
> has fixes for this on Docker it also means he does understands what is
> going on here. The above error message indicates Docker in turn *failed* to
> still kill the damn container with SIGKILL. This, he indicates, is due to
> libdm mounts leaking from one container namespace to another container's
> namespace. The error is not docker being unable to umount, its actually
> that docker cannot remove the backing device.
>
> Its after this when the bug triggers. It doesn't explain *why* we hit this
> bug on XFS, but it should be a clue. Aleksa is however is absolutely sure
> that the bugs found internally *are* caused by mount leakage somehow, and
> all we know is that XFS oops happened after this.
>
> He suggests this could in theory be reproduced by doing the following
> while you have a Docker daemon running:
>
>   o unshare -m
>   o mount --make-rprivate /
>
> Effectively forcing a mount leak. We had someone trying to reproduce this
> internally somehow but I don't think they were in the end able to.
>
> If this docker issue leads to a kernel issue the severity might be a bit more
> serious, it indicates an unprivileged user namespaces can effectively perform a
> DoS against the host if it's trying to operate on devicemapper mounts.
>
> Then *another* theory came recently from Jan Blunck during the ALPSS, he noted
> he traced a similar issue back to a systemd misuse, ie, systemd setting the
> root namespace to 'shared' by default. Originally the root namespace is set to
> 'private', but systemd decided it'd be a good idea to set it to 'shared'. Which
> would quite easily explain the mount namespace leaking.
>
> Aleksa however contends that the mountspace leaking happens because of Docker's
> architecture (all of the mounts are created in one mount namespace, and all of
> the runc invocations happen in that same mount namespace). Even if you
> explicitly set different sharing options (which runc already supports), the
> problem still exists.
>
> But if *not* using docker, it gives us an idea of how you perhaps you could
> run into a similar issue.
>
> Regardless Aleksa points out there's still a more fundamental issue of "I can
> leak a mountpoint as an unprivileged user, no matter what the propagation is":
>
>   % unshare -rm
>   % mount --make-rprivate /
>   % # I now have the mount alive.
>
>   Luis

*Scratches my head*

That most definitely is not a leak.  It is a creation of a duplicate set
of mounts.

It is true that unmount in the parent will not remove those duplicate
mounts in the other mount namespaces.  Unlink of the mountpoint will
ensure that nothing is mounted on a file or directory in any mount
namespace.

Given that they are duplicate of the mount they will not make any umount
fail in the original namespace.

The superblock and the block device will remain busy.  But that is not
new as any open file or directory can already do that.

Further mount namespaces reference counts are held open by processes
directly or file descriptors or other mount namespaces which are held
open by processes.  So killing the appropriate set of processes should
make the mount namespace go away.

I have heard tell of cases where mounts make it into mount namespaces
where no one was expecting them to exist, and no one wanting to kill the
processes using those mount namespaces.  But that is mostly sloppiness.
That is not a big bad leak that can not be prevented, it is an "oops did
I put that there?" kind of situation.

If there is something special about device mapper that needs to be take
into account I would love to hear about it.

The only scenario I can currently imagine that goes from umount failing
to excessive I/O happening on the system, is if userspace decides not
to send a SIGKILL to some set of processes that then do a bunch of I/O.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Chinner Oct. 9, 2017, 10:28 p.m. UTC | #5

On Mon, Oct 09, 2017 at 08:31:29PM +0200, Luis R. Rodriguez wrote:
> On Mon, Oct 09, 2017 at 11:05:29AM +1100, Dave Chinner wrote:
> > On Sat, Oct 07, 2017 at 01:10:58PM +0500, Михаил Гаврилов wrote:
> > > But seems now got another issue:
> > > 
> > > [ 1966.953781] INFO: task tracker-store:8578 blocked for more than 120 seconds.
> > > [ 1966.953797]       Not tainted 4.13.4-301.fc27.x86_64+debug #1
> > > [ 1966.953800] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > disables this message.
> > > [ 1966.953804] tracker-store   D12840  8578   1655 0x00000000
> > > [ 1966.953811] Call Trace:
> > > [ 1966.953823]  __schedule+0x2dc/0xbb0
> > > [ 1966.953830]  ? wait_on_page_bit_common+0xfb/0x1a0
> > > [ 1966.953838]  schedule+0x3d/0x90
> > > [ 1966.953843]  io_schedule+0x16/0x40
> > > [ 1966.953847]  wait_on_page_bit_common+0x10a/0x1a0
> > > [ 1966.953857]  ? page_cache_tree_insert+0x170/0x170
> > > [ 1966.953865]  __filemap_fdatawait_range+0x101/0x1a0
> > > [ 1966.953883]  file_write_and_wait_range+0x63/0xc0
> > 
> > Ok, that's in wait_on_page_writeback(page)
> > ......
> > 
> > > And yet another
> > > 
> > > [41288.797026] INFO: task tracker-store:4535 blocked for more than 120 seconds.
> > > [41288.797034]       Not tainted 4.13.4-301.fc27.x86_64+debug #1
> > > [41288.797037] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > disables this message.
> > > [41288.797041] tracker-store   D10616  4535   1655 0x00000000
> > > [41288.797049] Call Trace:
> > > [41288.797061]  __schedule+0x2dc/0xbb0
> > > [41288.797072]  ? bit_wait+0x60/0x60
> > > [41288.797076]  schedule+0x3d/0x90
> > > [41288.797082]  io_schedule+0x16/0x40
> > > [41288.797086]  bit_wait_io+0x11/0x60
> > > [41288.797091]  __wait_on_bit+0x31/0x90
> > > [41288.797099]  out_of_line_wait_on_bit+0x94/0xb0
> > > [41288.797106]  ? bit_waitqueue+0x40/0x40
> > > [41288.797113]  __block_write_begin_int+0x265/0x550
> > > [41288.797132]  iomap_write_begin.constprop.14+0x7d/0x130
> > 
> > And that's in wait_on_buffer().
> > 
> > In both cases we are waiting on a bit lock for IO completion. In the
> > first case it is on page, the second it's on sub-page read IO
> > completion during a write.
> > 
> > Triggeringa hung task timeouts like this doesn't usually indicate a
> > filesystem problem.
> 
> <-- snip -->
> 
> > None of these things usually filesystem problems, and the trainsmash
> > of blocked tasks on filesystem locks is typical for these types of
> > "blocked indefinitely with locks held" type of situations. It does
> > tend to indicate taht there is quite a bit of load on the
> > filesystem, though...
> 
> As Jan Kara noted we've seen this also on customers SLE12-SP2 kernel (4.4
> based). Although we also were never able to root cause, since that bug
> is now closed on our end I figured it would be worth mentioning two
> theories we discussed, one more recent than the other.

Sure, but stuff going on with docker mounts and namespaces has
nothing to do with IO path locking and completions. There's
something in the filesystem or the storage stack below going
wrong here, not above it in the vfsmount layer....

Cheers,

Dave.

Jan Kara Oct. 10, 2017, 7:57 a.m. UTC | #6

On Tue 10-10-17 09:28:51, Dave Chinner wrote:
> On Mon, Oct 09, 2017 at 08:31:29PM +0200, Luis R. Rodriguez wrote:
> > On Mon, Oct 09, 2017 at 11:05:29AM +1100, Dave Chinner wrote:
> > > On Sat, Oct 07, 2017 at 01:10:58PM +0500, Михаил Гаврилов wrote:
> > > > But seems now got another issue:
> > > > 
> > > > [ 1966.953781] INFO: task tracker-store:8578 blocked for more than 120 seconds.
> > > > [ 1966.953797]       Not tainted 4.13.4-301.fc27.x86_64+debug #1
> > > > [ 1966.953800] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > > disables this message.
> > > > [ 1966.953804] tracker-store   D12840  8578   1655 0x00000000
> > > > [ 1966.953811] Call Trace:
> > > > [ 1966.953823]  __schedule+0x2dc/0xbb0
> > > > [ 1966.953830]  ? wait_on_page_bit_common+0xfb/0x1a0
> > > > [ 1966.953838]  schedule+0x3d/0x90
> > > > [ 1966.953843]  io_schedule+0x16/0x40
> > > > [ 1966.953847]  wait_on_page_bit_common+0x10a/0x1a0
> > > > [ 1966.953857]  ? page_cache_tree_insert+0x170/0x170
> > > > [ 1966.953865]  __filemap_fdatawait_range+0x101/0x1a0
> > > > [ 1966.953883]  file_write_and_wait_range+0x63/0xc0
> > > 
> > > Ok, that's in wait_on_page_writeback(page)
> > > ......
> > > 
> > > > And yet another
> > > > 
> > > > [41288.797026] INFO: task tracker-store:4535 blocked for more than 120 seconds.
> > > > [41288.797034]       Not tainted 4.13.4-301.fc27.x86_64+debug #1
> > > > [41288.797037] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > > disables this message.
> > > > [41288.797041] tracker-store   D10616  4535   1655 0x00000000
> > > > [41288.797049] Call Trace:
> > > > [41288.797061]  __schedule+0x2dc/0xbb0
> > > > [41288.797072]  ? bit_wait+0x60/0x60
> > > > [41288.797076]  schedule+0x3d/0x90
> > > > [41288.797082]  io_schedule+0x16/0x40
> > > > [41288.797086]  bit_wait_io+0x11/0x60
> > > > [41288.797091]  __wait_on_bit+0x31/0x90
> > > > [41288.797099]  out_of_line_wait_on_bit+0x94/0xb0
> > > > [41288.797106]  ? bit_waitqueue+0x40/0x40
> > > > [41288.797113]  __block_write_begin_int+0x265/0x550
> > > > [41288.797132]  iomap_write_begin.constprop.14+0x7d/0x130
> > > 
> > > And that's in wait_on_buffer().
> > > 
> > > In both cases we are waiting on a bit lock for IO completion. In the
> > > first case it is on page, the second it's on sub-page read IO
> > > completion during a write.
> > > 
> > > Triggeringa hung task timeouts like this doesn't usually indicate a
> > > filesystem problem.
> > 
> > <-- snip -->
> > 
> > > None of these things usually filesystem problems, and the trainsmash
> > > of blocked tasks on filesystem locks is typical for these types of
> > > "blocked indefinitely with locks held" type of situations. It does
> > > tend to indicate taht there is quite a bit of load on the
> > > filesystem, though...
> > 
> > As Jan Kara noted we've seen this also on customers SLE12-SP2 kernel (4.4
> > based). Although we also were never able to root cause, since that bug
> > is now closed on our end I figured it would be worth mentioning two
> > theories we discussed, one more recent than the other.
> 
> Sure, but stuff going on with docker mounts and namespaces has
> nothing to do with IO path locking and completions. There's
> something in the filesystem or the storage stack below going
> wrong here, not above it in the vfsmount layer....

Agreed. If anything, I suspect there's somewhere some race when we tear
down process' mappings when handling SIGKILL. But I was never able to
reproduce nor find any problem in the code...

								Honza

Aleksa Sarai Oct. 15, 2017, 8:53 a.m. UTC | #7

[Dammit, I thought I sent this earlier -- it looks like Thunderbird 
swallowed my original draft for this email.]

Hi Eric,

This is the bug that I talked to you about at LPC, related to 
devicemapper and it not being possible to issue DELETE and REMOVE 
operations on a devicemapper device that is still mounted in 
$some_namespace. [Before we go on, deferred removal and deletion can 
help here, but the deferral will never kick in until the reference goes 
away. On SUSE systems, deferred removal doesn't appear to work at all, 
but that's an issue for us to solve.]

> *Scratches my head*
> 
> That most definitely is not a leak.  It is a creation of a duplicate set
> of mounts.

Sorry for my sloppy wording, "leak" in this context means (for me at 
least) that the mount has been duplicated into a mount namespace due to 
sloppiness or malice by userspace.

> It is true that unmount in the parent will not remove those duplicate
> mounts in the other mount namespaces.  Unlink of the mountpoint will
> ensure that nothing is mounted on a file or directory in any mount
> namespace.
> 
> Given that they are duplicate of the mount they will not make any umount
> fail in the original namespace.

The error is a bit confusing. If you read the code, it turns out that 
"unmount failed" doesn't actually mean that umount(2) returned EBUSY. It 
means that the umount(2) *succeeded* and then the subsequent cleanup by 
Docker (where it tries to remove and delete unused devicemapper devices) 
fails. This is caused by the mount still existing in a container's mount 
namespace, due either a race condition (that I may have patch to fix[1]) 
or due to the container inadvertently bind-mounting the rootfs mounts 
from the host (I've seen people do ridiculous things like a recursive 
bind-mount of '/' into a container, for example).

> Further mount namespaces reference counts are held open by processes
> directly or file descriptors or other mount namespaces which are held
> open by processes.  So killing the appropriate set of processes should
> make the mount namespace go away.
> 
> I have heard tell of cases where mounts make it into mount namespaces
> where no one was expecting them to exist, and no one wanting to kill the
> processes using those mount namespaces.  But that is mostly sloppiness.
> That is not a big bad leak that can not be prevented, it is an "oops did
> I put that there?" kind of situation.
> 
> If there is something special about device mapper that needs to be take
> into account I would love to hear about it.

You're right that most of this issue (in the case of Docker) is just 
sloppiness, though the architecture makes it quite hard to resolve this 
issue at the moment. But the problem in Docker /can/ be solved at least 
partially by just removing the mountpoint. In fact I have a patch to do 
that already[2].

However, there is a more fundamental issue here, one which is quite 
concerning. It basically boils down to the fact that any unprivileged 
user can create a reference to a devicemapper mount on the host in such 
a way that the host won't know about it. A toy example is the following:

   % unshare -rm
   # mount --make-rprivate
   # mount -t tmpfs tmpfs /tmp && mkdir /tmp/saved_mount
   # mount --rbind /some/devicemapper/mount /tmp/saved_mount

At this point, even if the host does an `rm /some/devicemapper/mount`, 
the devicemapper reference will stick around in the "container". This 
isn't an issue for most cases, but with devicemapper, the host might 
want to do more management operations than just ` rm 
/some/devicemapper/mount`. They probably (like Docker does) want to 
remove the device and/or delete the device. In the above situation, they 
would be blocked from doing so. As I mention above, deferred deletion 
and removal can help here (the operation succeeds on non-SUSE systems) 
but there is still an issue that the space will not be reclaimed because 
the deferred operation will never kick off (because there is still a 
reference lying around).

As you've said, 8ed936b5671b ("vfs: Lazily remove mounts on unlinked 
files and directories.") added the ability to reduce possible DoS 
attacks against rmdir(2) by forcing unmounts in all mount namespaces 
that would block the removal from working. I'm wondering whether there 
would be interest in doing something similar for devicemapper's DELETE 
and REMOVE operations? From my perspective, it is quite hard for 
userspace to be able to resolve this issue. You mentioned that

 > killing the appropriate set of processes should
 > make the mount namespace go away.

But I'm not sure I understand how userspace can tell what "the 
appropriate set of processes" is. Not to mention that the appropriate 
set of processes could be an "innocent" process inside a container that 
accidentally inherited the mount, and then a malicious process 
duplicated the mount further -- making it hard to remove by the host.

[1]: https://github.com/opencontainers/runc/pull/1500
[2]: https://github.com/docker/docker/pull/34573

Theodore Ts'o Oct. 15, 2017, 1:06 p.m. UTC | #8

On Sun, Oct 15, 2017 at 07:53:11PM +1100, Aleksa Sarai wrote:
> This is the bug that I talked to you about at LPC, related to devicemapper
> and it not being possible to issue DELETE and REMOVE operations on a
> devicemapper device that is still mounted in $some_namespace. [Before we go
> on, deferred removal and deletion can help here, but the deferral will never
> kick in until the reference goes away. On SUSE systems, deferred removal
> doesn't appear to work at all, but that's an issue for us to solve.]

Yeah, it's not really a bug, as much as (IMHO) a profound design
misfeature in the way mount namespaces work.  And the fundamental
problem is that there are two distinct things that you might want to
do:

    * In a container, "unmount" a file system is it no longer shows up in
      your mount namespace.

    * As a system administrator, make a mounted file system **go** **away**
      because the device is about to disapear, either because:
         * The iscsi/nbd device is gone or about to disappear (maybe the server is
	   about to shut down)
	 * The user wants to yank out the USB thumb drive
	 * The system is shutting down and for whatever reason the shutdown
	   sequence wants to remove the block device driver or otherwise shutdown
	   the UFS device on the Android system.

The last three examples are all real world examples where people have
complained to me about, and there have been some hacky things we've
done as a result.

We sort of have a hacky solution which works today, at least for f2fs,
ext4 and xfs file systems, which is you can use
{EXT4,F2FS}_FS_IOC_SHUTDOWN / XFS_IOC_GOINGDOWN ioctls to forcibly
shutdown the file system, and then recurse over all of /proc looking
for /proc/*/mounts files and seeing if the file system is mounted in
that namespace, and then either killing the pid or somehow entering
the namespace of that process and unmounting the file system.  But
it's ugly and messy, and it's not at all intuitive, and so people keep
tripping against the problem.

What we really need is some kind of "global shutdown and unmount"
system call which safely and cleanly does this.  Instead of relying on
the file system's brute force shutdown sequence, it would be better if
we implemented some kind of VFS-level revoke(2) functionality (ala the
*BSD revoke system call, which they've had for ages) on all file
descriptors opened on the file systems, and if any processes have a
CWD in the file system, it is replaced by an anonymous
invalid/"deleted" directory, followed by a syncfs() on the file
system, followed by a recursive search and removal of the file system
from all mount namespaces.

Obviously, this could only be used by someone with root privileges on
the "root" container.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric W. Biederman Oct. 15, 2017, 10:14 p.m. UTC | #9

Theodore Ts'o <tytso@mit.edu> writes:

> On Sun, Oct 15, 2017 at 07:53:11PM +1100, Aleksa Sarai wrote:
>> This is the bug that I talked to you about at LPC, related to devicemapper
>> and it not being possible to issue DELETE and REMOVE operations on a
>> devicemapper device that is still mounted in $some_namespace. [Before we go
>> on, deferred removal and deletion can help here, but the deferral will never
>> kick in until the reference goes away. On SUSE systems, deferred removal
>> doesn't appear to work at all, but that's an issue for us to solve.]
>
> Yeah, it's not really a bug, as much as (IMHO) a profound design
> misfeature in the way mount namespaces work.  And the fundamental
> problem is that there are two distinct things that you might want to
> do:
>
>     * In a container, "unmount" a file system is it no longer shows up in
>       your mount namespace.
>
>     * As a system administrator, make a mounted file system **go** **away**
>       because the device is about to disapear, either because:
>          * The iscsi/nbd device is gone or about to disappear (maybe the server is
> 	   about to shut down)
> 	 * The user wants to yank out the USB thumb drive
> 	 * The system is shutting down and for whatever reason the shutdown
> 	   sequence wants to remove the block device driver or otherwise shutdown
> 	   the UFS device on the Android system.
>
> The last three examples are all real world examples where people have
> complained to me about, and there have been some hacky things we've
> done as a result.
>
> We sort of have a hacky solution which works today, at least for f2fs,
> ext4 and xfs file systems, which is you can use
> {EXT4,F2FS}_FS_IOC_SHUTDOWN / XFS_IOC_GOINGDOWN ioctls to forcibly
> shutdown the file system, and then recurse over all of /proc looking
> for /proc/*/mounts files and seeing if the file system is mounted in
> that namespace, and then either killing the pid or somehow entering
> the namespace of that process and unmounting the file system.  But
> it's ugly and messy, and it's not at all intuitive, and so people keep
> tripping against the problem.
>
> What we really need is some kind of "global shutdown and unmount"
> system call which safely and cleanly does this.  Instead of relying on
> the file system's brute force shutdown sequence, it would be better if
> we implemented some kind of VFS-level revoke(2) functionality (ala the
> *BSD revoke system call, which they've had for ages) on all file
> descriptors opened on the file systems, and if any processes have a
> CWD in the file system, it is replaced by an anonymous
> invalid/"deleted" directory, followed by a syncfs() on the file
> system, followed by a recursive search and removal of the file system
> from all mount namespaces.
>
> Obviously, this could only be used by someone with root privileges on
> the "root" container.

There are two practical cases here.

1) What to do if someone is actively using the filesystem?
   AKA Is using the file system as a working directory or has a file on
   that filesystem open possibly mmaped.

2) What to do if the filesystem is simply mounted in some set of mount
   namespaces.

For the second case we should be able to solve it easily with a
variation on the internal detach_mounts call that we use when the mount
point goes away.  All of the infrastructure exists.

Looking at the code it appears ext4, f2fs, and xfs shutdown path
implements revoking a bdev from a filesystem.  Further if the ext4
implementation is anything to go by it looks like something we could
generalize into the vfs.

Hmm.  Unless I am completely mistake we already have a super block
operation that pretty much does this in umount_begin (the guts behind
mount -f).  Which makes the set of filesystems that can support this
kind of operation to: 9p, ceph, cifs, fuse, nfs, ext4, f2fs, and xfs.

Hmm...

If the filesystem can cut all communication with the backing store as
ext4, f2fs, and xfs can I don't know that we need to remove the
filesystem from the mount tree.  The superblock will just be useless not
gone.

Which means we will just have a few extra resources consumed by vfsmount
structs and a superblock but nothing important going on.  (Especially if
we can move that operation up to the vfs from the individual
filesystems).

Just letting the vfsmount structures sit nicely avoids the problem of
what happens if the revoked filesystem is mounted on top of something
sensitive in the mount tree, that should not be revealed.

So my suggestions for this case are two fold.

- Tweak Docker and friends to not be sloppy and hold onto extra
  resources that they don't need.  That is just bad form.

- Generalize what ext4, f2fs, xfs and possibly the network filesystems
  with umount_begin are doing into a general disconnect this filesystem
  from it's backing store operation.

  That operation should be enough to drop the reference to the backing
  device so that device mapper doesn't care.

A nice to have would be remove mounts from the mount trees, that are not
problematic to remove.

I won't call this a security regression with user namespaces, or a mount
namespace specific issue as it appears any process that can open a file
has always been able to trigger this.  That said it does appear this has
gone from a theoretical issue to an actuall problem so it is most
definitely time to address and fix this.

Further I agree Dave Chinners comment about the hung task timeout.  That
appears to be a separate issue.  As no number of idle references to a
super block should trigger that.

Ted, Aleksa would either of you be interested in generalizing what ext4,
f2fs, and xfs does now and working to put a good interface on it?  I can
help especially with review but for the short term I am rather booked.

Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Chinner Oct. 15, 2017, 11:22 p.m. UTC | #10

On Sun, Oct 15, 2017 at 05:14:41PM -0500, Eric W. Biederman wrote:
> Theodore Ts'o <tytso@mit.edu> writes:
> 
> > On Sun, Oct 15, 2017 at 07:53:11PM +1100, Aleksa Sarai wrote:
> >> This is the bug that I talked to you about at LPC, related to devicemapper
> >> and it not being possible to issue DELETE and REMOVE operations on a
> >> devicemapper device that is still mounted in $some_namespace. [Before we go
> >> on, deferred removal and deletion can help here, but the deferral will never
> >> kick in until the reference goes away. On SUSE systems, deferred removal
> >> doesn't appear to work at all, but that's an issue for us to solve.]
> >
> > Yeah, it's not really a bug, as much as (IMHO) a profound design
> > misfeature in the way mount namespaces work.  And the fundamental
> > problem is that there are two distinct things that you might want to
> > do:
> >
> >     * In a container, "unmount" a file system is it no longer shows up in
> >       your mount namespace.
> >
> >     * As a system administrator, make a mounted file system **go** **away**
> >       because the device is about to disapear, either because:
> >          * The iscsi/nbd device is gone or about to disappear (maybe the server is
> > 	   about to shut down)
> > 	 * The user wants to yank out the USB thumb drive
> > 	 * The system is shutting down and for whatever reason the shutdown
> > 	   sequence wants to remove the block device driver or otherwise shutdown
> > 	   the UFS device on the Android system.
> >
> > The last three examples are all real world examples where people have
> > complained to me about, and there have been some hacky things we've
> > done as a result.
> >
> > We sort of have a hacky solution which works today, at least for f2fs,
> > ext4 and xfs file systems, which is you can use
> > {EXT4,F2FS}_FS_IOC_SHUTDOWN / XFS_IOC_GOINGDOWN ioctls to forcibly
> > shutdown the file system, and then recurse over all of /proc looking
> > for /proc/*/mounts files and seeing if the file system is mounted in
> > that namespace, and then either killing the pid or somehow entering
> > the namespace of that process and unmounting the file system.  But
> > it's ugly and messy, and it's not at all intuitive, and so people keep
> > tripping against the problem.
> >
> > What we really need is some kind of "global shutdown and unmount"
> > system call which safely and cleanly does this.  Instead of relying on
> > the file system's brute force shutdown sequence, it would be better if
> > we implemented some kind of VFS-level revoke(2) functionality (ala the
> > *BSD revoke system call, which they've had for ages) on all file
> > descriptors opened on the file systems, and if any processes have a
> > CWD in the file system, it is replaced by an anonymous
> > invalid/"deleted" directory, followed by a syncfs() on the file
> > system, followed by a recursive search and removal of the file system
> > from all mount namespaces.
> >
> > Obviously, this could only be used by someone with root privileges on
> > the "root" container.
> 
> There are two practical cases here.
> 
> 1) What to do if someone is actively using the filesystem?
>    AKA Is using the file system as a working directory or has a file on
>    that filesystem open possibly mmaped.
> 
> 2) What to do if the filesystem is simply mounted in some set of mount
>    namespaces.
> 
> For the second case we should be able to solve it easily with a
> variation on the internal detach_mounts call that we use when the mount
> point goes away.  All of the infrastructure exists.
> 
> 
> Looking at the code it appears ext4, f2fs, and xfs shutdown path
> implements revoking a bdev from a filesystem. 

Deja vu.

I raised the concept of a ->shutdown() superblock op more than
ten years ago to deal with the problem of a block device being
yanked out from underneath a filesystem without the fs being aware
of it. I wanted it called from the block device invalidation code
that is run when a device is yanked out so the filesystem has
immediate notification that the bdev is gone an is never coming
back....

Unfortunately, like so many of the things we wanted the VFS to
support to integrate XFS properly into Linux 10-15 years ago (e.g.
preallocation, unwritten extents, fiemap, freeze/thaw, shutdown,
etc) it was considered "that a silly XFS problem" and so doesn't get
any further until years later someone else has suddenly realises
"you know, if we had ....".

That's when we've finally been able to either just lifted the XFS
ioctl interface to the VFS or implemented a new API that end up
being a thin wrapper around existing XFS functionality.... :P

> Further if the ext4 implementation is anything to go by it looks
> like something we could generalize into the vfs.

In case you hadn't guessed by now, shutdown didn't originate in ext4.

The ext4 shutdown code (EXT4_IOC_SHUTDOWN) and the f2fs code
(F2FS_IOC_SHUTDOWN) are all copies of the XFS shutdown ioctl
interface (that's the 'X' in the ioctl definition). They got added
so various XFS specific filesystem and data integrity tests in
fstests that relied on the shutdown interface could be run on f2fs,
and then more recently ext4.

Historically stuff like this only gets pulled up to the VFS until
someone on the XFS side says "stop copy-n-pasting ioctl definitions
and define it generically!".  Hence if we pull it up to the VFS, it
needs to takes it's cues from the XFS semantics, not the ext4
reimplementation. They should be the same, but copy-n-paste has a
habit of changing subtle stuff....

> Hmm.  Unless I am completely mistake we already have a super block
> operation that pretty much does this in umount_begin (the guts behind
> mount -f).  Which makes the set of filesystems that can support this
> kind of operation to: 9p, ceph, cifs, fuse, nfs, ext4, f2fs, and xfs.

I'm not sure that a shutdown operation is correct here.  A shutdown
operation on a filesystem like XFS chops off all IO in flight, stops
all modifications in flight, triggers EIO on all current and future
IO and refuses to allow anyone to read anything from the filesystem.
Shutdown is a nasty, hard stop for a filesystem, not a "prepare to
unmount" operation.

Indeed, shutdowns can occur during unmount (e.g. corruption occurs
due to metadata writeback errors when flushing the journal) which
leave us with a tricky situation if unmount is supposed to use
shutdowns to force unmounts...

It must also be said here that a shutdown of an active, valid
filesystem *will* cause unrecoverable data loss. And if it's your
root filesystem, shutdown will be the last thing the system is able
to do correctly until it is forcibly rebooted by hand (because the
reboot command won't be able to be loaded from the fs)....

> If the filesystem can cut all communication with the backing store as
> ext4, f2fs, and xfs can I don't know that we need to remove the
> filesystem from the mount tree.  The superblock will just be useless not
> gone.

So the problem that often occurs with this is that the sb can still
hold a reference to the block device and so the device cannot be
remounted (will give block device EBUSY errors) even though the user
cannot find a reference to an active mount or user of the block
device.

> So my suggestions for this case are two fold.
> 
> - Tweak Docker and friends to not be sloppy and hold onto extra
>   resources that they don't need.  That is just bad form.
> 
> - Generalize what ext4, f2fs, xfs and possibly the network filesystems
>   with umount_begin are doing into a general disconnect this filesystem
>   from it's backing store operation.
> 
>   That operation should be enough to drop the reference to the backing
>   device so that device mapper doesn't care.

Define the semantics of a forced filesystem unmount are supposed to
be first, then decide whether an existing shutdown operation can be
used. It may be we just need a new flag to the existing API to
implement slightly different semantics (e.g to silence unnecessary
warnings), but at minimum I think we've need the ->unmount_begin op
name to change to indicate it's function, not document the calling
context...

Cheers,

Dave.

Theodore Ts'o Oct. 16, 2017, 1:13 a.m. UTC | #11

On Sun, Oct 15, 2017 at 05:14:41PM -0500, Eric W. Biederman wrote:
> 
> Looking at the code it appears ext4, f2fs, and xfs shutdown path
> implements revoking a bdev from a filesystem.  Further if the ext4
> implementation is anything to go by it looks like something we could
> generalize into the vfs.

There are two things which the current file system shutdown paths do.
The first is that they prevent the file system from attempting to
write to the bdev.  That's all very file system specific, and can't be
generalized into the VFS.

The second thing they do is they cause system calls which might modify
the file system to return an error.  Currently operations that might
result in _reads_ are not shutdown, so it's not a true revoke(2)
functionality ala *BSD.  I assume that's what you are talking about
generalizing into the VFS.  Personally, I would prefer to see us
generalize something like vhangup() but which works on a file
descriptor, not just a TTY.  That it is, it disconnects the file
descriptor entirely from the hardware / file system so in the case of
the tty, it can be used by other login session, and in the case of the
file descriptor belonging to a file system, it stops the file system
from being unmounted.

The reason why I want to see something which actually does disconnect
the file descriptor and replaces the CWD for processes is because the
shutdown path is inherently a fairly violent procedure.  It forcibly
disconnects the file system from the block device, which means we
can't do a clean unmount.  If the block device has disappeared, we
don't have a choice, and that's the only thing we can do.

But if the USB thumb drive hasn't been yanked out yet, it would be
nice to simply disconnect everything at the file descriptor level,
then do a normal umount in all of the namespaces.  This will allow the
file system to be cleanly unmounted, which means it will work for
those file systems which don't have journals.  (The ext4/btrfs/xfs
shutdown paths rely on the file system's journalling capability
because it causes the block device to look like the system had been
powered off, and so you have to replay the journal when file system is
remounted.  That's fine, but it's not going to work on file systems
like VFAT, or worse, on NTFS, where we don't correctly support the
native file system's logging capability, so power-fail or a forced
shutdown may end up corrupting the file system enough that you have
boot into Windows to run CHKDSK.EXE in order clean it up.)

> I won't call this a security regression with user namespaces, or a mount
> namespace specific issue as it appears any process that can open a file
> has always been able to trigger this.  That said it does appear this has
> gone from a theoretical issue to an actuall problem so it is most
> definitely time to address and fix this.

The reason why this has started become a really painful issue is that
mount namespaces are extremely painful to debug.  You can't iterate
over all of the possible namespaces, without having to iterate over
every process and every thread's /proc/*/mounts.  So if there are 10
mount namespaces, but 10,000 threads, that's 10,000 /proc/*/mounts
file that you have to examine individually, just to *find* that needle
in the haystack which is keeping the mount pinned.

And there is at least one system daemon where when it starts, it opens
a new mount namespace.  Which means if you happen to have a USB
thumbdrive mounted when you restart that system daemon, it is not at
all obvious what you need to do to unmount the USB thumb drive.  And
yes, any process can open a file, but people at least knew how to find
it by using lsof.  So perhaps this could be solved partially by better
tooling, but I would argue that having userspace do a brute force
search through all of /proc/*/fd/* actually wasn't particularly clean,
either.  We've lived with it, true.  But perhaps it's time to do
something better?

And in the case where the system daemon only accidentally captured the
USB thumb drive by forking a mount namespace, it would be *definitely*
nice if we could simply make the mount disappear, instead of what we
have to today, which is kill the system daemon, unmount the USB thumb
drive, and then restart the system daemon.  (Once, that is, you have
figured what is going on first.  Often, you end up reporting an ext4
bug to the ext4 maintainer first, because you think it's an ext4 bug
that ext4 is refusing to unmount the file system.  After all, you are
an experienced system administrator, so you *know* that *nothing*
could *possibly* be keeping the file system busy.  lsof said so.  :-)

> Ted, Aleksa would either of you be interested in generalizing what ext4,
> f2fs, and xfs does now and working to put a good interface on it?  I can
> help especially with review but for the short term I am rather booked.

Unfortunately, I have way too much travel coming up in the short term,
so I probably won't have to take on a new project until at least
mid-to-late-November at the earliest.  Aleska, do you have time?  I
can consult on a design, but I have zero coding time for the next
couple of weeks.

	     	    	       	       	  - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric W. Biederman Oct. 16, 2017, 5:44 p.m. UTC | #12

Dave Chinner <david@fromorbit.com> writes:

> On Sun, Oct 15, 2017 at 05:14:41PM -0500, Eric W. Biederman wrote:
>> Theodore Ts'o <tytso@mit.edu> writes:
>> 
>> > On Sun, Oct 15, 2017 at 07:53:11PM +1100, Aleksa Sarai wrote:
>> >> This is the bug that I talked to you about at LPC, related to devicemapper
>> >> and it not being possible to issue DELETE and REMOVE operations on a
>> >> devicemapper device that is still mounted in $some_namespace. [Before we go
>> >> on, deferred removal and deletion can help here, but the deferral will never
>> >> kick in until the reference goes away. On SUSE systems, deferred removal
>> >> doesn't appear to work at all, but that's an issue for us to solve.]
>> >
>> > Yeah, it's not really a bug, as much as (IMHO) a profound design
>> > misfeature in the way mount namespaces work.  And the fundamental
>> > problem is that there are two distinct things that you might want to
>> > do:
>> >
>> >     * In a container, "unmount" a file system is it no longer shows up in
>> >       your mount namespace.
>> >
>> >     * As a system administrator, make a mounted file system **go** **away**
>> >       because the device is about to disapear, either because:
>> >          * The iscsi/nbd device is gone or about to disappear (maybe the server is
>> > 	   about to shut down)
>> > 	 * The user wants to yank out the USB thumb drive
>> > 	 * The system is shutting down and for whatever reason the shutdown
>> > 	   sequence wants to remove the block device driver or otherwise shutdown
>> > 	   the UFS device on the Android system.
>> >
>> > The last three examples are all real world examples where people have
>> > complained to me about, and there have been some hacky things we've
>> > done as a result.
>> >
>> > We sort of have a hacky solution which works today, at least for f2fs,
>> > ext4 and xfs file systems, which is you can use
>> > {EXT4,F2FS}_FS_IOC_SHUTDOWN / XFS_IOC_GOINGDOWN ioctls to forcibly
>> > shutdown the file system, and then recurse over all of /proc looking
>> > for /proc/*/mounts files and seeing if the file system is mounted in
>> > that namespace, and then either killing the pid or somehow entering
>> > the namespace of that process and unmounting the file system.  But
>> > it's ugly and messy, and it's not at all intuitive, and so people keep
>> > tripping against the problem.
>> >
>> > What we really need is some kind of "global shutdown and unmount"
>> > system call which safely and cleanly does this.  Instead of relying on
>> > the file system's brute force shutdown sequence, it would be better if
>> > we implemented some kind of VFS-level revoke(2) functionality (ala the
>> > *BSD revoke system call, which they've had for ages) on all file
>> > descriptors opened on the file systems, and if any processes have a
>> > CWD in the file system, it is replaced by an anonymous
>> > invalid/"deleted" directory, followed by a syncfs() on the file
>> > system, followed by a recursive search and removal of the file system
>> > from all mount namespaces.
>> >
>> > Obviously, this could only be used by someone with root privileges on
>> > the "root" container.
>> 
>> There are two practical cases here.
>> 
>> 1) What to do if someone is actively using the filesystem?
>>    AKA Is using the file system as a working directory or has a file on
>>    that filesystem open possibly mmaped.
>> 
>> 2) What to do if the filesystem is simply mounted in some set of mount
>>    namespaces.
>> 
>> For the second case we should be able to solve it easily with a
>> variation on the internal detach_mounts call that we use when the mount
>> point goes away.  All of the infrastructure exists.
>> 
>> 
>> Looking at the code it appears ext4, f2fs, and xfs shutdown path
>> implements revoking a bdev from a filesystem. 
>
> Deja vu.
>
> I raised the concept of a ->shutdown() superblock op more than
> ten years ago to deal with the problem of a block device being
> yanked out from underneath a filesystem without the fs being aware
> of it. I wanted it called from the block device invalidation code
> that is run when a device is yanked out so the filesystem has
> immediate notification that the bdev is gone an is never coming
> back....
>
> Unfortunately, like so many of the things we wanted the VFS to
> support to integrate XFS properly into Linux 10-15 years ago (e.g.
> preallocation, unwritten extents, fiemap, freeze/thaw, shutdown,
> etc) it was considered "that a silly XFS problem" and so doesn't get
> any further until years later someone else has suddenly realises
> "you know, if we had ....".
>
> That's when we've finally been able to either just lifted the XFS
> ioctl interface to the VFS or implemented a new API that end up
> being a thin wrapper around existing XFS functionality.... :P
>
>> Further if the ext4 implementation is anything to go by it looks
>> like something we could generalize into the vfs.
>
> In case you hadn't guessed by now, shutdown didn't originate in ext4.
>
> The ext4 shutdown code (EXT4_IOC_SHUTDOWN) and the f2fs code
> (F2FS_IOC_SHUTDOWN) are all copies of the XFS shutdown ioctl
> interface (that's the 'X' in the ioctl definition). They got added
> so various XFS specific filesystem and data integrity tests in
> fstests that relied on the shutdown interface could be run on f2fs,
> and then more recently ext4.
>
> Historically stuff like this only gets pulled up to the VFS until
> someone on the XFS side says "stop copy-n-pasting ioctl definitions
> and define it generically!".  Hence if we pull it up to the VFS, it
> needs to takes it's cues from the XFS semantics, not the ext4
> reimplementation. They should be the same, but copy-n-paste has a
> habit of changing subtle stuff....
>
>> Hmm.  Unless I am completely mistake we already have a super block
>> operation that pretty much does this in umount_begin (the guts behind
>> mount -f).  Which makes the set of filesystems that can support this
>> kind of operation to: 9p, ceph, cifs, fuse, nfs, ext4, f2fs, and xfs.
>
> I'm not sure that a shutdown operation is correct here.  A shutdown
> operation on a filesystem like XFS chops off all IO in flight, stops
> all modifications in flight, triggers EIO on all current and future
> IO and refuses to allow anyone to read anything from the filesystem.
> Shutdown is a nasty, hard stop for a filesystem, not a "prepare to
> unmount" operation.
>
> Indeed, shutdowns can occur during unmount (e.g. corruption occurs
> due to metadata writeback errors when flushing the journal) which
> leave us with a tricky situation if unmount is supposed to use
> shutdowns to force unmounts...
>
> It must also be said here that a shutdown of an active, valid
> filesystem *will* cause unrecoverable data loss. And if it's your
> root filesystem, shutdown will be the last thing the system is able
> to do correctly until it is forcibly rebooted by hand (because the
> reboot command won't be able to be loaded from the fs)....
>
>> If the filesystem can cut all communication with the backing store as
>> ext4, f2fs, and xfs can I don't know that we need to remove the
>> filesystem from the mount tree.  The superblock will just be useless not
>> gone.
>
> So the problem that often occurs with this is that the sb can still
> hold a reference to the block device and so the device cannot be
> remounted (will give block device EBUSY errors) even though the user
> cannot find a reference to an active mount or user of the block
> device.
>
>> So my suggestions for this case are two fold.
>> 
>> - Tweak Docker and friends to not be sloppy and hold onto extra
>>   resources that they don't need.  That is just bad form.
>> 
>> - Generalize what ext4, f2fs, xfs and possibly the network filesystems
>>   with umount_begin are doing into a general disconnect this filesystem
>>   from it's backing store operation.
>> 
>>   That operation should be enough to drop the reference to the backing
>>   device so that device mapper doesn't care.
>
> Define the semantics of a forced filesystem unmount are supposed to
> be first, then decide whether an existing shutdown operation can be
> used. It may be we just need a new flag to the existing API to
> implement slightly different semantics (e.g to silence unnecessary
> warnings), but at minimum I think we've need the ->unmount_begin op
> name to change to indicate it's function, not document the calling
> context...

Definite the expected semantics of a forced filesystem umount is the
same process as defining the semantics of a forced filesytem shutdown.
Look at the code and see what it does and document it.

That said, a quick skim through the umount_begin methods on network
filesystems (the only filesystems that implement umount_begin) it
appears they drop the network connection.  Which is pretty much the same
as dropping the connection to the bdev.  So I think there is good reason
to believe these two cases can be unified.  They may not be exactly the
same but they are close enough that the should be able to share common
infrastructure.

Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric W. Biederman Oct. 16, 2017, 5:53 p.m. UTC | #13

Theodore Ts'o <tytso@mit.edu> writes:

> On Sun, Oct 15, 2017 at 05:14:41PM -0500, Eric W. Biederman wrote:
>> 
>> Looking at the code it appears ext4, f2fs, and xfs shutdown path
>> implements revoking a bdev from a filesystem.  Further if the ext4
>> implementation is anything to go by it looks like something we could
>> generalize into the vfs.
>
> There are two things which the current file system shutdown paths do.
> The first is that they prevent the file system from attempting to
> write to the bdev.  That's all very file system specific, and can't be
> generalized into the VFS.
>
> The second thing they do is they cause system calls which might modify
> the file system to return an error.  Currently operations that might
> result in _reads_ are not shutdown, so it's not a true revoke(2)
> functionality ala *BSD.  I assume that's what you are talking about
> generalizing into the VFS.  Personally, I would prefer to see us
> generalize something like vhangup() but which works on a file
> descriptor, not just a TTY.  That it is, it disconnects the file
> descriptor entirely from the hardware / file system so in the case of
> the tty, it can be used by other login session, and in the case of the
> file descriptor belonging to a file system, it stops the file system
> from being unmounted.
>
> The reason why I want to see something which actually does disconnect
> the file descriptor and replaces the CWD for processes is because the
> shutdown path is inherently a fairly violent procedure.  It forcibly
> disconnects the file system from the block device, which means we
> can't do a clean unmount.  If the block device has disappeared, we
> don't have a choice, and that's the only thing we can do.
>
> But if the USB thumb drive hasn't been yanked out yet, it would be
> nice to simply disconnect everything at the file descriptor level,
> then do a normal umount in all of the namespaces.  This will allow the
> file system to be cleanly unmounted, which means it will work for
> those file systems which don't have journals.  (The ext4/btrfs/xfs
> shutdown paths rely on the file system's journalling capability
> because it causes the block device to look like the system had been
> powered off, and so you have to replay the journal when file system is
> remounted.  That's fine, but it's not going to work on file systems
> like VFAT, or worse, on NTFS, where we don't correctly support the
> native file system's logging capability, so power-fail or a forced
> shutdown may end up corrupting the file system enough that you have
> boot into Windows to run CHKDSK.EXE in order clean it up.)
>
>> I won't call this a security regression with user namespaces, or a mount
>> namespace specific issue as it appears any process that can open a file
>> has always been able to trigger this.  That said it does appear this has
>> gone from a theoretical issue to an actuall problem so it is most
>> definitely time to address and fix this.
>
> The reason why this has started become a really painful issue is that
> mount namespaces are extremely painful to debug.  You can't iterate
> over all of the possible namespaces, without having to iterate over
> every process and every thread's /proc/*/mounts.  So if there are 10
> mount namespaces, but 10,000 threads, that's 10,000 /proc/*/mounts
> file that you have to examine individually, just to *find* that needle
> in the haystack which is keeping the mount pinned.
>
> And there is at least one system daemon where when it starts, it opens
> a new mount namespace.  Which means if you happen to have a USB
> thumbdrive mounted when you restart that system daemon, it is not at
> all obvious what you need to do to unmount the USB thumb drive.  And
> yes, any process can open a file, but people at least knew how to find
> it by using lsof.  So perhaps this could be solved partially by better
> tooling, but I would argue that having userspace do a brute force
> search through all of /proc/*/fd/* actually wasn't particularly clean,
> either.  We've lived with it, true.  But perhaps it's time to do
> something better?
>
> And in the case where the system daemon only accidentally captured the
> USB thumb drive by forking a mount namespace, it would be *definitely*
> nice if we could simply make the mount disappear, instead of what we
> have to today, which is kill the system daemon, unmount the USB thumb
> drive, and then restart the system daemon.  (Once, that is, you have
> figured what is going on first.  Often, you end up reporting an ext4
> bug to the ext4 maintainer first, because you think it's an ext4 bug
> that ext4 is refusing to unmount the file system.  After all, you are
> an experienced system administrator, so you *know* that *nothing*
> could *possibly* be keeping the file system busy.  lsof said so.  :-)

I would prefer that we start with what we can do easily.  There is a
danger in working on revoke like actions that a high cost will be paid
to get nice semantics for a rare case.

We can easily set a flag on the superblock and disconnect from the
backing store.  A generic shutdown derived from the existiong ioctls.

We can easily before that request that the filesystem be remounted
read-only.  We may not succeed (as someone may have something open for
write) but that code path exists and it is easy to use.

Tracking down all instances of struct file and all instances of struct
path that reference a filesystem is expensive today, and expensive to
add a list to do.  So I don't know that we want to do that.


So the best option that I see right now is remount the superblock
read-only, followed by disconnect the superblock from the backing store.
With a filesystem sync in there before we disconnect the backing store
to ensure all of the changes if possible are flushed to disk.

>> Ted, Aleksa would either of you be interested in generalizing what ext4,
>> f2fs, and xfs does now and working to put a good interface on it?  I can
>> help especially with review but for the short term I am rather booked.
>
> Unfortunately, I have way too much travel coming up in the short term,
> so I probably won't have to take on a new project until at least
> mid-to-late-November at the earliest.  Aleska, do you have time?  I
> can consult on a design, but I have zero coding time for the next
> couple of weeks.

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Theodore Ts'o Oct. 16, 2017, 6:50 p.m. UTC | #14

On Mon, Oct 16, 2017 at 12:53:53PM -0500, Eric W. Biederman wrote:
> I would prefer that we start with what we can do easily.  There is a
> danger in working on revoke like actions that a high cost will be paid
> to get nice semantics for a rare case.

Users wanting to remove a USB thumb drives do *not* seem like a rare
case to me....  and disconnecting from the backing store without
corrupting the file system is not necessarily trivial.

> We can easily before that request that the filesystem be remounted
> read-only.  We may not succeed (as someone may have something open for
> write) but that code path exists and it is easy to use.
> 
> Tracking down all instances of struct file and all instances of struct
> path that reference a filesystem is expensive today, and expensive to
> add a list to do.  So I don't know that we want to do that.

It's not that expensive.  After all, system administsrators are
running lsof all the time to work around this sort of problem.  And
doing this kind forced unmount is not a high-performance critical
path.  And just like what a human system administrator will be doing,
if there are no struct files holding the file system open, we won't
need to scan all of the struct files.  If there is something holding
the file system busy, either the kernel is going to have to scan all
of struct files, or we are going to be forcing the system
adminsitrator to paw through all of /proc/*/fd/* and /proc/*/mounts
looking for the needle in the haystack.

I claim it's ***always*** going to be faster to do it in the kernel
than forcing the system administrator to suck the information through
the thin straw which is /proc/*/mounts and /proc/*/fd/*.

    	       	     		       	   - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Chinner Oct. 16, 2017, 9:38 p.m. UTC | #15

On Mon, Oct 16, 2017 at 12:44:33PM -0500, Eric W. Biederman wrote:
> Dave Chinner <david@fromorbit.com> writes:
> > On Sun, Oct 15, 2017 at 05:14:41PM -0500, Eric W. Biederman wrote:
> >> So my suggestions for this case are two fold.
> >> 
> >> - Tweak Docker and friends to not be sloppy and hold onto extra
> >>   resources that they don't need.  That is just bad form.
> >> 
> >> - Generalize what ext4, f2fs, xfs and possibly the network filesystems
> >>   with umount_begin are doing into a general disconnect this filesystem
> >>   from it's backing store operation.
> >> 
> >>   That operation should be enough to drop the reference to the backing
> >>   device so that device mapper doesn't care.
> >
> > Define the semantics of a forced filesystem unmount are supposed to
> > be first, then decide whether an existing shutdown operation can be
> > used. It may be we just need a new flag to the existing API to
> > implement slightly different semantics (e.g to silence unnecessary
> > warnings), but at minimum I think we've need the ->unmount_begin op
> > name to change to indicate it's function, not document the calling
> > context...
> 
> Definite the expected semantics of a forced filesystem umount is the
> same process as defining the semantics of a forced filesytem shutdown.
> Look at the code and see what it does and document it.
> 
> That said, a quick skim through the umount_begin methods on network
> filesystems (the only filesystems that implement umount_begin) it
> appears they drop the network connection.  Which is pretty much the same
> as dropping the connection to the bdev.

Sure. But what do they do to incoming attempts to read from the
filesystem, or modify/write it in some way? That behaviour really
isn't defined in any way and what we need to do that so *all
filesystems behave the same way*.

As I've already pointed out, ext4 implements different shutdown
semantics to XFS even though it copied the user API to trigger
shutdowns. That's not an acceptible situation for a generic shutdown
operation....

> So I think there is good reason
> to believe these two cases can be unified.  They may not be exactly the
> same but they are close enough that the should be able to share common
> infrastructure.

You're making it sound so much simpler than it really is...

Cheers,

Dave.

Dave Chinner Oct. 16, 2017, 10 p.m. UTC | #16

On Sun, Oct 15, 2017 at 09:13:01PM -0400, Theodore Ts'o wrote:
> On Sun, Oct 15, 2017 at 05:14:41PM -0500, Eric W. Biederman wrote:
> > 
> > Looking at the code it appears ext4, f2fs, and xfs shutdown path
> > implements revoking a bdev from a filesystem.  Further if the ext4
> > implementation is anything to go by it looks like something we could
> > generalize into the vfs.
> 
> There are two things which the current file system shutdown paths do.
> The first is that they prevent the file system from attempting to
> write to the bdev.  That's all very file system specific, and can't be
> generalized into the VFS.
> 
> The second thing they do is they cause system calls which might modify
> the file system to return an error.  Currently operations that might
> result in _reads_ are not shutdown, so it's not a true revoke(2)

Which is different to XFS - XFS shuts down all access to the bdev,
read or write, and returns EIO to the callers.(*) IOWs, a UAPI has
been copy-and-pasted, but the behaviour and semantics have not been
copied/duplicated. The requirement was "close enough for fstests to
use it" not "must behave the exact same way on all filesystems".

From that perspective, I agree with Ted:  we need an interface
that provides userspace with sane, consistent semantics and allows
filesystems to be disconnected from userspace references so can be
unmounted. What the filesystem then does to disconnect itself from
the block device and allow unmount to complete (i.e. the shutdown
implementation) is irrelevant to the VFS and users....

Cheers,

Dave.

(*) Shutdown in XFS is primarily intended to prevent propagating
damage in the filesystem when corruption is first detected or an
unrecoverable error occurs. Historically speaking, it is always
initiated by the XFS filesystem itself, so whatever we do to provide
unmount sanity isn't going to replace internal filesystem shutdown
functionality...

Aleksa Sarai Oct. 17, 2017, 12:59 a.m. UTC | #17

>> Looking at the code it appears ext4, f2fs, and xfs shutdown path
>> implements revoking a bdev from a filesystem.  Further if the ext4
>> implementation is anything to go by it looks like something we could
>> generalize into the vfs.
> 
> There are two things which the current file system shutdown paths do.
> The first is that they prevent the file system from attempting to
> write to the bdev.  That's all very file system specific, and can't be
> generalized into the VFS.
> 
> The second thing they do is they cause system calls which might modify
> the file system to return an error.  Currently operations that might
> result in _reads_ are not shutdown, so it's not a true revoke(2)
> functionality ala *BSD.  I assume that's what you are talking about
> generalizing into the VFS.  Personally, I would prefer to see us
> generalize something like vhangup() but which works on a file
> descriptor, not just a TTY.  That it is, it disconnects the file
> descriptor entirely from the hardware / file system so in the case of
> the tty, it can be used by other login session, and in the case of the
> file descriptor belonging to a file system, it stops the file system
> from being unmounted
Presumably the fd would just be used to specify the backing store? I was 
imagining doing it through an additional umount(2) flag but I guess that 
having an fd open is probably better form.

I'm a little confused about whether this actually will solve the 
original problem though, because it still requires the iteration over 
/proc/**/mounts in order for userspace to finish the unmounts. I feel 
like this is trying to generalise the idea behind luksSuspend -- am I 
misunderstanding how this would solve the original issue? Is it the case 
that if we "disconnect" at the file descriptor level, then the bdev is 
no longer considered "used" and it can be operated on safely?

>> Ted, Aleksa would either of you be interested in generalizing what ext4,
>> f2fs, and xfs does now and working to put a good interface on it?  I can
>> help especially with review but for the short term I am rather booked.
> 
> Unfortunately, I have way too much travel coming up in the short term,
> so I probably won't have to take on a new project until at least
> mid-to-late-November at the earliest.  Aleska, do you have time?  I
> can consult on a design, but I have zero coding time for the next
> couple of weeks.

I can give it a shot, but a quick disclaimer that I'm not very familiar 
with the VFS codebase so the review cycle will probably take a while. Oh 
well, it's a good opportunity for me to learn more about it. :D

Theodore Ts'o Oct. 17, 2017, 1:34 a.m. UTC | #18

On Tue, Oct 17, 2017 at 09:00:11AM +1100, Dave Chinner wrote:
> > The second thing they do is they cause system calls which might modify
> > the file system to return an error.  Currently operations that might
> > result in _reads_ are not shutdown, so it's not a true revoke(2)
> 
> Which is different to XFS - XFS shuts down all access to the bdev,
> read or write, and returns EIO to the callers.(*) IOWs, a UAPI has
> been copy-and-pasted, but the behaviour and semantics have not been
> copied/duplicated. The requirement was "close enough for fstests to
> use it" not "must behave the exact same way on all filesystems".

Funny story, we was actually trying to copy XFS's semantics.
Originally I had a shutdown test in ext4's readpage() and readpages()
--- but during the code review, someone pointed out that xfs didn't
have those tests in xfs_vm_readpage() and xfs_rm_readpages().  Since
it didn't really matter for my intended use case, I ended up removing
the checks from ext4's readpage() and readpages() functions.

What we didn't notice the that the shutdown test was in
xfs_get_blocks().

> From that perspective, I agree with Ted:  we need an interface
> that provides userspace with sane, consistent semantics and allows
> filesystems to be disconnected from userspace references so can be
> unmounted. What the filesystem then does to disconnect itself from
> the block device and allow unmount to complete (i.e. the shutdown
> implementation) is irrelevant to the VFS and users....

I agree that we should formally define what the semantics should be.
I also believe it should work even if the file system doesn't support
journals, and the file system should be left in a consistent state if
possible, since there are three different, distinct use cases:

* The file system is damaged, so we want to avoid making any changes
  to the file system to minimize further damage.
* The block device has already disappeared, and we are trying to minimize the
  block I/O devices.  We would also prefer that attempts to write dirty pages
  in the writeback pages not cause a kernel panic.
* The user wants to remove the USB thumb drive, but hasn't removed it yet, and the
  system would like to cleanly unmount the file system and leave the thumb drive
  in a consistent state --- without having to force userspace to brute force
  search through /proc and terminate processes just to allow the device to be
  unmounted.

					- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jan Kara Oct. 17, 2017, 9:20 a.m. UTC | #19

On Tue 17-10-17 11:59:50, Aleksa Sarai wrote:
> >>Looking at the code it appears ext4, f2fs, and xfs shutdown path
> >>implements revoking a bdev from a filesystem.  Further if the ext4
> >>implementation is anything to go by it looks like something we could
> >>generalize into the vfs.
> >
> >There are two things which the current file system shutdown paths do.
> >The first is that they prevent the file system from attempting to
> >write to the bdev.  That's all very file system specific, and can't be
> >generalized into the VFS.
> >
> >The second thing they do is they cause system calls which might modify
> >the file system to return an error.  Currently operations that might
> >result in _reads_ are not shutdown, so it's not a true revoke(2)
> >functionality ala *BSD.  I assume that's what you are talking about
> >generalizing into the VFS.  Personally, I would prefer to see us
> >generalize something like vhangup() but which works on a file
> >descriptor, not just a TTY.  That it is, it disconnects the file
> >descriptor entirely from the hardware / file system so in the case of
> >the tty, it can be used by other login session, and in the case of the
> >file descriptor belonging to a file system, it stops the file system
> >from being unmounted
> Presumably the fd would just be used to specify the backing store? I was
> imagining doing it through an additional umount(2) flag but I guess that
> having an fd open is probably better form.
> 
> I'm a little confused about whether this actually will solve the original
> problem though, because it still requires the iteration over /proc/**/mounts
> in order for userspace to finish the unmounts. I feel like this is trying to
> generalise the idea behind luksSuspend -- am I misunderstanding how this
> would solve the original issue? Is it the case that if we "disconnect" at
> the file descriptor level, then the bdev is no longer considered "used" and
> it can be operated on safely?

So umount(2) is essentially a directory tree operation - detach filesystem
mounted on 'dir' from the directory tree. If this was the last point where
the superblock was mounted, we also cleanup the superblock and release the
underlying device.

The operation we are speaking about here is different. It is more along the
lines of "release this device".  And in the current world of containers,
mount namespaces, etc. it is not trivial for userspace to implement this
using umount(2) as Ted points out. I believe we could do that by walking
through all mount points of a superblock and unmounting them (and I don't
want to get into a discussion how to efficiently implement that now but in
principle the kernel has all the necessary information).

And then there's another dimension to this problem (and I believe it is
good to explicitely distinguish this) - what to do if someone is actually
using some of the mountpoints either by having CWD there, having file open
there, or having something else mounted underneath. umount(2) returns EBUSY
in these cases which is impractical for some use cases. And I believe the
proposal here is to "invalidate" open file descriptors through revoke, then
put the superblock into quiescent state and make filesystem stop accessing
the device (and probably release the device reference so that the device is
really free).

I think we could implement the "put the superblock into quiescent state
and make filesystem stop accessing the device" by a mechanism similar to
filesystem freezing. That already implements putting filesystem into
quiescent state while still in use. We would have to modify
sb_start_write() etc. calls to be able to return errors in case write
access is revoked and never going back (instead of blocking forever) but
that should be doable and in my opinion that is easier than trying to tweak
fs shutdown to result in a consistent filesystem. The read part could be
handled as well by putting checks in strategic place.

What I'm a bit concerned about is the "release device reference" part - for
a block device to stop looking busy we have to do that however then the
block device can go away and the filesystem isn't prepared to that - we
reference sb->s_bdev in lots of places, we have buffer heads which are part
of bdev page cache, and probably other indirect assumptions I forgot about
now. One solution to this is to not just stop accessing the device but
truly cleanup the filesystem up to a point where it is practically
unmounted. I like this solution more but we have to be careful to block
any access attemps high enough in VFS ideally before ever entering fs code.

Another option would be to do something similar to what we do when the
device just gets unplugged under our hands - we detach bdev from gendisk,
leave it dangling and invisible. But we would still somehow have to
convince DM that the bdev practically went away by calling
disk->fops->release() and it all just seems fragile to me. But I wanted to
mention this option in case the above solution proves to be too difficult.

								Honza

Theodore Ts'o Oct. 17, 2017, 2:12 p.m. UTC | #20

On Tue, Oct 17, 2017 at 11:20:17AM +0200, Jan Kara wrote:
> The operation we are speaking about here is different. It is more along the
> lines of "release this device".  And in the current world of containers,
> mount namespaces, etc. it is not trivial for userspace to implement this
> using umount(2) as Ted points out. I believe we could do that by walking
> through all mount points of a superblock and unmounting them (and I don't
> want to get into a discussion how to efficiently implement that now but in
> principle the kernel has all the necessary information).

Yes, this is what I want.  And regardless of how efficiently or not
the kernel can implement such an operatoin, by definition it will be
more efficient than if we ahve to do it in userspace.  (And I don't
think it has to be super-efficient, since this is not a hot-path.  So
for the record, I wouldn't want to add any extra linked list
references, etc.)

> What I'm a bit concerned about is the "release device reference" part - for
> a block device to stop looking busy we have to do that however then the
> block device can go away and the filesystem isn't prepared to that - we
> reference sb->s_bdev in lots of places, we have buffer heads which are part
> of bdev page cache, and probably other indirect assumptions I forgot about
> now. One solution to this is to not just stop accessing the device but
> truly cleanup the filesystem up to a point where it is practically
> unmounted. I like this solution more but we have to be careful to block
> any access attemps high enough in VFS ideally before ever entering fs code.

Right, so first step would be to block access attempts high up in the
VFS.  The second would be to point any file descriptors at a revoked
NULL struct file, also redirect any task struct's CWD so it is as if
the directory had gotten rmdir'ed, and also munmap any mapped regions.
At that point, all of the file descriptors will be closed.  The third
step would be to do a syncfs(), which will force out any dirty pages.
And then finally, to call umount() in all of the namespaces, which
will naturally take care of any buffer or page cache references once
the ref count of the struct super goes to zero.

This all doesn't have to be a single system call.  Perhaps it would
make sense for first and second step to be one system call --- call it
revokefs(2), perhaps.  And then the last step could be another system
call --- maybe umountall(2).

> Another option would be to do something similar to what we do when the
> device just gets unplugged under our hands - we detach bdev from gendisk,
> leave it dangling and invisible. But we would still somehow have to
> convince DM that the bdev practically went away by calling
> disk->fops->release() and it all just seems fragile to me. But I wanted to
> mention this option in case the above solution proves to be too difficult.

Yeah, that's similarly as fragile as using the ext4/xfs/f2fs
shutdown/goingdown ioctl.  In order to do this right I really think we
need to get the VFS involved, so it can be a real, clean unmount, as
opposed to something where we just rip the file system away from the
bdev.

						- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Luis Chamberlain Nov. 6, 2017, 7:25 p.m. UTC | #21

On Tue, Oct 17, 2017 at 7:12 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Tue, Oct 17, 2017 at 11:20:17AM +0200, Jan Kara wrote:
>> The operation we are speaking about here is different. It is more along the
>> lines of "release this device".  And in the current world of containers,
>> mount namespaces, etc. it is not trivial for userspace to implement this
>> using umount(2) as Ted points out. I believe we could do that by walking
>> through all mount points of a superblock and unmounting them (and I don't
>> want to get into a discussion how to efficiently implement that now but in
>> principle the kernel has all the necessary information).
>
> Yes, this is what I want.  And regardless of how efficiently or not
> the kernel can implement such an operatoin, by definition it will be
> more efficient than if we have to do it in userspace.

It seems most folks agree we could all benefit from this, to help
userspace with a sane implementation.

>> What I'm a bit concerned about is the "release device reference" part - for
>> a block device to stop looking busy we have to do that however then the
>> block device can go away and the filesystem isn't prepared to that - we
>> reference sb->s_bdev in lots of places, we have buffer heads which are part
>> of bdev page cache, and probably other indirect assumptions I forgot about
>> now.

Is this new operation really the only place where such type of work
could be useful for, or are there existing uses cases this sort of
functionality could also be used for?

For instance I don't think we do something similar to revokefs(2) (as
described below) when a devices has been removed from a system, you
seem to suggest we remove the dev from gendisk leaving it dangling and
invisible. But other than this, it would seem its up to the filesystem
to get anything else implemented correctly?

> This all doesn't have to be a single system call.  Perhaps it would
> make sense for first and second step to be one system call --- call it
> revokefs(2), perhaps.  And then the last step could be another system
> call --- maybe umountall(2).

Wouldn't *some* part of this also help *enhance* filesystem suspend /
thaw be used on system suspend / resume as well?

If I may, if we split these up, into two, say revokefs(2) and
umountall(2), how about:

a) revokefs(2): ensures all file descriptors for the fs are closed
   - blocks access attempts high up in VFS
   - point any file descriptor to a revoked null struct file
   - redirect any task struct CWD's so as if the directory had rmmdir'd
   - munmap any mapped regions

Of these only the first one seems useful for fs suspend?

b) umountall(2): properly unmounts filesystem from all namespaces
   - May need to verify if revokefs(2) was called, if so, now that all
file descriptors should
     be closed, do syncfs() to force out any dirty pages
   - unmount() in all namespaces, this takes care of any buffer or page
     cache reference once the ref count of the struct super block goes to
     to zero

  Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jan Kara Nov. 7, 2017, 3:26 p.m. UTC | #22

On Mon 06-11-17 11:25:34, Luis R. Rodriguez wrote:
> On Tue, Oct 17, 2017 at 7:12 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> > On Tue, Oct 17, 2017 at 11:20:17AM +0200, Jan Kara wrote:
> >> The operation we are speaking about here is different. It is more along the
> >> lines of "release this device".  And in the current world of containers,
> >> mount namespaces, etc. it is not trivial for userspace to implement this
> >> using umount(2) as Ted points out. I believe we could do that by walking
> >> through all mount points of a superblock and unmounting them (and I don't
> >> want to get into a discussion how to efficiently implement that now but in
> >> principle the kernel has all the necessary information).
> >
> > Yes, this is what I want.  And regardless of how efficiently or not
> > the kernel can implement such an operatoin, by definition it will be
> > more efficient than if we have to do it in userspace.
> 
> It seems most folks agree we could all benefit from this, to help
> userspace with a sane implementation.
> 
> >> What I'm a bit concerned about is the "release device reference" part - for
> >> a block device to stop looking busy we have to do that however then the
> >> block device can go away and the filesystem isn't prepared to that - we
> >> reference sb->s_bdev in lots of places, we have buffer heads which are part
> >> of bdev page cache, and probably other indirect assumptions I forgot about
> >> now.
> 
> Is this new operation really the only place where such type of work
> could be useful for, or are there existing uses cases this sort of
> functionality could also be used for?

The functionality of being able to "invalidate" open file descriptor so
that it no longer points to the object it used to is useful also for other
cases I guess...

> For instance I don't think we do something similar to revokefs(2) (as
> described below) when a devices has been removed from a system, you
> seem to suggest we remove the dev from gendisk leaving it dangling and
> invisible. But other than this, it would seem its up to the filesystem
> to get anything else implemented correctly?

Yes, that's the current situation. When the device is yanked from under a
filesystem the current implementation makes it relatively straightforward
from fs POV - for all fs cares about the underlying device still exists. It
just returns errors for any IO done to it. It is upto fs implementation to
deal with it and be able to shutdown itself correctly in such case.

> > This all doesn't have to be a single system call.  Perhaps it would
> > make sense for first and second step to be one system call --- call it
> > revokefs(2), perhaps.  And then the last step could be another system
> > call --- maybe umountall(2).
> 
> Wouldn't *some* part of this also help *enhance* filesystem suspend /
> thaw be used on system suspend / resume as well?
>
> If I may, if we split these up, into two, say revokefs(2) and
> umountall(2), how about:
> 
> a) revokefs(2): ensures all file descriptors for the fs are closed
>    - blocks access attempts high up in VFS
>    - point any file descriptor to a revoked null struct file
>    - redirect any task struct CWD's so as if the directory had rmmdir'd
>    - munmap any mapped regions
> 
> Of these only the first one seems useful for fs suspend?

If you reference "blocks access attempts high up in VFS" that already
happens for writes when you freeze the filesystem. Also suspend is
different in that userspace is already frozen when you get to freezing
filesystems so you care only about in-kernel users and there you do not
have standard set of entry points anyway... So I don't see much crossection
with system suspend here.

> 
> b) umountall(2): properly unmounts filesystem from all namespaces
>    - May need to verify if revokefs(2) was called, if so, now that all
> file descriptors should
>      be closed, do syncfs() to force out any dirty pages

IMHO it doesn't need to verify this. The unmount will just fail if someone
is still using some fs.

>    - unmount() in all namespaces, this takes care of any buffer or page
>      cache reference once the ref count of the struct super block goes to
>      to zero

								Honza

kernel BUG at fs/xfs/xfs_aops.c:853! in kernel 4.13 rc6

Commit Message

Comments

Patch