Message ID | 20240823110439.1585041-6-leo.lilong@huawei.com (mailing list archive) |
---|---|
State | Deferred, archived |
Headers | show |
Series | xfs: fix and cleanups for log item push | expand |
On Fri, Aug 23, 2024 at 07:04:39PM +0800, Long Li wrote: > KASAN reported a UAF bug while fault injection test: > > ================================================================== > BUG: KASAN: use-after-free in xfs_inode_item_push+0x2db/0x2f0 > Read of size 8 at addr ffff888022f74788 by task xfsaild/sda/479 > > CPU: 0 PID: 479 Comm: xfsaild/sda Not tainted 6.2.0-rc7-00003-ga8a43e2eb5f6 #89 > Call Trace: > <TASK> > dump_stack_lvl+0x51/0x6a > print_report+0x171/0x4a6 > kasan_report+0xb7/0x130 > xfs_inode_item_push+0x2db/0x2f0 > xfsaild+0x729/0x1f70 > kthread+0x290/0x340 > ret_from_fork+0x1f/0x30 > </TASK> > > Allocated by task 494: > kasan_save_stack+0x22/0x40 > kasan_set_track+0x25/0x30 > __kasan_slab_alloc+0x58/0x70 > kmem_cache_alloc+0x197/0x5d0 > xfs_inode_item_init+0x62/0x170 > xfs_trans_ijoin+0x15e/0x240 > xfs_init_new_inode+0x573/0x1820 > xfs_create+0x6a1/0x1020 > xfs_generic_create+0x544/0x5d0 > vfs_mkdir+0x5d0/0x980 > do_mkdirat+0x14e/0x220 > __x64_sys_mkdir+0x6a/0x80 > do_syscall_64+0x39/0x80 > entry_SYSCALL_64_after_hwframe+0x63/0xcd > > Freed by task 14: > kasan_save_stack+0x22/0x40 > kasan_set_track+0x25/0x30 > kasan_save_free_info+0x2e/0x40 > __kasan_slab_free+0x114/0x1b0 > kmem_cache_free+0xee/0x4e0 > xfs_inode_free_callback+0x187/0x2a0 > rcu_do_batch+0x317/0xce0 > rcu_core+0x686/0xa90 > __do_softirq+0x1b6/0x626 > > The buggy address belongs to the object at ffff888022f74758 > which belongs to the cache xfs_ili of size 200 > The buggy address is located 48 bytes inside of > 200-byte region [ffff888022f74758, ffff888022f74820) > > The buggy address belongs to the physical page: > page:ffffea00008bdd00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x22f74 > head:ffffea00008bdd00 order:1 compound_mapcount:0 subpages_mapcount:0 compound_pincount:0 > flags: 0x1fffff80010200(slab|head|node=0|zone=1|lastcpupid=0x1fffff) > raw: 001fffff80010200 ffff888010ed4040 ffffea00008b2510 ffffea00008bde10 > raw: 0000000000000000 00000000001a001a 00000001ffffffff 0000000000000000 > page dumped because: kasan: bad access detected > > Memory state around the buggy address: > ffff888022f74680: 00 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc > ffff888022f74700: fc fc fc fc fc fc fc fc fc fc fc fa fb fb fb fb > >ffff888022f74780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > ^ > ffff888022f74800: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc > ffff888022f74880: fc fc 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > ================================================================== > > When push inode item in xfsaild, it will race with reclaim inodes task. > Consider the following call graph, both tasks deal with the same inode. > During flushing the cluster, it will enter xfs_iflush_abort() in shutdown > conditions, inode's XFS_IFLUSHING flag will be cleared and lip->li_buf set > to null. Concurrently, inode will be reclaimed in shutdown conditions, > there is no need to wait xfs buf lock because of lip->li_buf is null at > this time, inode will be freed via rcu callback if xfsaild task schedule > out during flushing the cluster. so, it is unsafe to reference lip after > flushing the cluster in xfs_inode_item_push(). > > <log item is in AIL> > <filesystem shutdown> > spin_lock(&ailp->ail_lock) > xfs_inode_item_push(lip) > xfs_buf_trylock(bp) > spin_unlock(&lip->li_ailp->ail_lock) > xfs_iflush_cluster(bp) > if (xfs_is_shutdown()) > xfs_iflush_abort(ip) > xfs_trans_ail_delete(ip) > spin_lock(&ailp->ail_lock) > spin_unlock(&ailp->ail_lock) > xfs_iflush_abort_clean(ip) > error = -EIO > <log item removed from AIL> > <log item li_buf set to null> > if (error) > xfs_force_shutdown() > xlog_shutdown_wait(mp->m_log) > might_sleep() > xfs_reclaim_inode(ip) > if (shutdown) > xfs_iflush_shutdown_abort(ip) > if (!bp) > xfs_iflush_abort(ip) > return > __xfs_inode_free(ip) > call_rcu(ip, xfs_inode_free_callback) > ...... > <rcu grace period expires> > <rcu free callbacks run somewhere> > xfs_inode_free_callback(ip) > kmem_cache_free(ip->i_itemp) > ...... > <starts running again> > xfs_buf_ioend_fail(bp); > xfs_buf_ioend(bp) > xfs_buf_relse(bp); > return error > spin_lock(&lip->li_ailp->ail_lock) > <UAF on log item> > > Additionally, after xfsaild_push_item(), the tracepoints can still access > the log item, potentially causing a UAF. I've previously submitted two > versions [1][2] attempting to solve this issue, but the solutions had > flaws. > > Fix it by returning XFS_ITEM_UNSAFE in xfs_inode_item_push() when the log > item might be freed, ensuring xfsaild does not access the log item after > it is pushed. > > [1] https://patchwork.kernel.org/project/xfs/patch/20230211022941.GA1515023@ceph-admin/ > [2] https://patchwork.kernel.org/project/xfs/patch/20230722025721.312909-1-leo.lilong@huawei.com/ > Fixes: 90c60e164012 ("xfs: xfs_iflush() is no longer necessary") > Signed-off-by: Long Li <leo.lilong@huawei.com> > --- > fs/xfs/xfs_inode_item.c | 21 ++++++++++++++------- > 1 file changed, 14 insertions(+), 7 deletions(-) > > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c > index b509cbd191f4..c855cd2c81a5 100644 > --- a/fs/xfs/xfs_inode_item.c > +++ b/fs/xfs/xfs_inode_item.c > @@ -720,10 +720,11 @@ STATIC uint > xfs_inode_item_push( > struct xfs_log_item *lip, > struct list_head *buffer_list) > - __releases(&lip->li_ailp->ail_lock) > - __acquires(&lip->li_ailp->ail_lock) > + __releases(&ailp->ail_lock) > + __acquires(&ailp->ail_lock) I wonder, is smatch or whatever actually uses these annotations smart enough to read through the local variable declarations below? > { > struct xfs_inode_log_item *iip = INODE_ITEM(lip); > + struct xfs_ail *ailp = lip->li_ailp; > struct xfs_inode *ip = iip->ili_inode; > struct xfs_buf *bp = lip->li_buf; > uint rval = XFS_ITEM_SUCCESS; > @@ -748,7 +749,7 @@ xfs_inode_item_push( > if (!xfs_buf_trylock(bp)) > return XFS_ITEM_LOCKED; > > - spin_unlock(&lip->li_ailp->ail_lock); > + spin_unlock(&ailp->ail_lock); > > /* > * We need to hold a reference for flushing the cluster buffer as it may > @@ -762,17 +763,23 @@ xfs_inode_item_push( > if (!xfs_buf_delwri_queue(bp, buffer_list)) > rval = XFS_ITEM_FLUSHING; > xfs_buf_relse(bp); > - } else { > + } else if (error == -EAGAIN) { > /* > * Release the buffer if we were unable to flush anything. On > * any other error, the buffer has already been released. > */ > - if (error == -EAGAIN) > - xfs_buf_relse(bp); > + xfs_buf_relse(bp); > rval = XFS_ITEM_LOCKED; > + } else { > + /* > + * The filesystem has already been shut down. If there's a race > + * between inode flush and inode reclaim, the inode might be > + * freed. Accessing the item after this point would be unsafe. > + */ > + rval = XFS_ITEM_UNSAFE; I wonder if it's time to convert this to a switch statement but the fix looks correct so Reviewed-by: Darrick J. Wong <djwong@kernel.org> --D > } > > - spin_lock(&lip->li_ailp->ail_lock); > + spin_lock(&ailp->ail_lock); > return rval; > } > > -- > 2.39.2 > >
On Fri, Aug 23, 2024 at 10:22:42AM -0700, Darrick J. Wong wrote: > On Fri, Aug 23, 2024 at 07:04:39PM +0800, Long Li wrote: > > KASAN reported a UAF bug while fault injection test: > > > > ================================================================== > > BUG: KASAN: use-after-free in xfs_inode_item_push+0x2db/0x2f0 > > Read of size 8 at addr ffff888022f74788 by task xfsaild/sda/479 > > > > CPU: 0 PID: 479 Comm: xfsaild/sda Not tainted 6.2.0-rc7-00003-ga8a43e2eb5f6 #89 > > Call Trace: > > <TASK> > > dump_stack_lvl+0x51/0x6a > > print_report+0x171/0x4a6 > > kasan_report+0xb7/0x130 > > xfs_inode_item_push+0x2db/0x2f0 > > xfsaild+0x729/0x1f70 > > kthread+0x290/0x340 > > ret_from_fork+0x1f/0x30 > > </TASK> > > > > Allocated by task 494: > > kasan_save_stack+0x22/0x40 > > kasan_set_track+0x25/0x30 > > __kasan_slab_alloc+0x58/0x70 > > kmem_cache_alloc+0x197/0x5d0 > > xfs_inode_item_init+0x62/0x170 > > xfs_trans_ijoin+0x15e/0x240 > > xfs_init_new_inode+0x573/0x1820 > > xfs_create+0x6a1/0x1020 > > xfs_generic_create+0x544/0x5d0 > > vfs_mkdir+0x5d0/0x980 > > do_mkdirat+0x14e/0x220 > > __x64_sys_mkdir+0x6a/0x80 > > do_syscall_64+0x39/0x80 > > entry_SYSCALL_64_after_hwframe+0x63/0xcd > > > > Freed by task 14: > > kasan_save_stack+0x22/0x40 > > kasan_set_track+0x25/0x30 > > kasan_save_free_info+0x2e/0x40 > > __kasan_slab_free+0x114/0x1b0 > > kmem_cache_free+0xee/0x4e0 > > xfs_inode_free_callback+0x187/0x2a0 > > rcu_do_batch+0x317/0xce0 > > rcu_core+0x686/0xa90 > > __do_softirq+0x1b6/0x626 > > > > The buggy address belongs to the object at ffff888022f74758 > > which belongs to the cache xfs_ili of size 200 > > The buggy address is located 48 bytes inside of > > 200-byte region [ffff888022f74758, ffff888022f74820) > > > > The buggy address belongs to the physical page: > > page:ffffea00008bdd00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x22f74 > > head:ffffea00008bdd00 order:1 compound_mapcount:0 subpages_mapcount:0 compound_pincount:0 > > flags: 0x1fffff80010200(slab|head|node=0|zone=1|lastcpupid=0x1fffff) > > raw: 001fffff80010200 ffff888010ed4040 ffffea00008b2510 ffffea00008bde10 > > raw: 0000000000000000 00000000001a001a 00000001ffffffff 0000000000000000 > > page dumped because: kasan: bad access detected > > > > Memory state around the buggy address: > > ffff888022f74680: 00 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc > > ffff888022f74700: fc fc fc fc fc fc fc fc fc fc fc fa fb fb fb fb > > >ffff888022f74780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > ^ > > ffff888022f74800: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc > > ffff888022f74880: fc fc 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > ================================================================== > > > > When push inode item in xfsaild, it will race with reclaim inodes task. > > Consider the following call graph, both tasks deal with the same inode. > > During flushing the cluster, it will enter xfs_iflush_abort() in shutdown > > conditions, inode's XFS_IFLUSHING flag will be cleared and lip->li_buf set > > to null. Concurrently, inode will be reclaimed in shutdown conditions, > > there is no need to wait xfs buf lock because of lip->li_buf is null at > > this time, inode will be freed via rcu callback if xfsaild task schedule > > out during flushing the cluster. so, it is unsafe to reference lip after > > flushing the cluster in xfs_inode_item_push(). > > > > <log item is in AIL> > > <filesystem shutdown> > > spin_lock(&ailp->ail_lock) > > xfs_inode_item_push(lip) > > xfs_buf_trylock(bp) > > spin_unlock(&lip->li_ailp->ail_lock) > > xfs_iflush_cluster(bp) > > if (xfs_is_shutdown()) > > xfs_iflush_abort(ip) > > xfs_trans_ail_delete(ip) > > spin_lock(&ailp->ail_lock) > > spin_unlock(&ailp->ail_lock) > > xfs_iflush_abort_clean(ip) > > error = -EIO > > <log item removed from AIL> > > <log item li_buf set to null> > > if (error) > > xfs_force_shutdown() > > xlog_shutdown_wait(mp->m_log) > > might_sleep() > > xfs_reclaim_inode(ip) > > if (shutdown) > > xfs_iflush_shutdown_abort(ip) > > if (!bp) > > xfs_iflush_abort(ip) > > return > > __xfs_inode_free(ip) > > call_rcu(ip, xfs_inode_free_callback) > > ...... > > <rcu grace period expires> > > <rcu free callbacks run somewhere> > > xfs_inode_free_callback(ip) > > kmem_cache_free(ip->i_itemp) > > ...... > > <starts running again> > > xfs_buf_ioend_fail(bp); > > xfs_buf_ioend(bp) > > xfs_buf_relse(bp); > > return error > > spin_lock(&lip->li_ailp->ail_lock) > > <UAF on log item> > > > > Additionally, after xfsaild_push_item(), the tracepoints can still access > > the log item, potentially causing a UAF. I've previously submitted two > > versions [1][2] attempting to solve this issue, but the solutions had > > flaws. > > > > Fix it by returning XFS_ITEM_UNSAFE in xfs_inode_item_push() when the log > > item might be freed, ensuring xfsaild does not access the log item after > > it is pushed. > > > > [1] https://patchwork.kernel.org/project/xfs/patch/20230211022941.GA1515023@ceph-admin/ > > [2] https://patchwork.kernel.org/project/xfs/patch/20230722025721.312909-1-leo.lilong@huawei.com/ > > Fixes: 90c60e164012 ("xfs: xfs_iflush() is no longer necessary") > > Signed-off-by: Long Li <leo.lilong@huawei.com> > > --- > > fs/xfs/xfs_inode_item.c | 21 ++++++++++++++------- > > 1 file changed, 14 insertions(+), 7 deletions(-) > > > > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c > > index b509cbd191f4..c855cd2c81a5 100644 > > --- a/fs/xfs/xfs_inode_item.c > > +++ b/fs/xfs/xfs_inode_item.c > > @@ -720,10 +720,11 @@ STATIC uint > > xfs_inode_item_push( > > struct xfs_log_item *lip, > > struct list_head *buffer_list) > > - __releases(&lip->li_ailp->ail_lock) > > - __acquires(&lip->li_ailp->ail_lock) > > + __releases(&ailp->ail_lock) > > + __acquires(&ailp->ail_lock) > > I wonder, is smatch or whatever actually uses these annotations smart > enough to read through the local variable declarations below? Static analysis tools, including smatch, sparse, and others, have limitations in their ability to fully understand complex pointer relationships and aliasing. This limitation can lead to false positives or missed issues when the lock reference in the annotation differs from its usage in the function body. Therefore, it's usually best practice to maintain consistency between annotations and actual code usage. So I think it's just to keep the declaration and the function body consistent. > > > { > > struct xfs_inode_log_item *iip = INODE_ITEM(lip); > > + struct xfs_ail *ailp = lip->li_ailp; > > struct xfs_inode *ip = iip->ili_inode; > > struct xfs_buf *bp = lip->li_buf; > > uint rval = XFS_ITEM_SUCCESS; > > @@ -748,7 +749,7 @@ xfs_inode_item_push( > > if (!xfs_buf_trylock(bp)) > > return XFS_ITEM_LOCKED; > > > > - spin_unlock(&lip->li_ailp->ail_lock); > > + spin_unlock(&ailp->ail_lock); > > > > /* > > * We need to hold a reference for flushing the cluster buffer as it may > > @@ -762,17 +763,23 @@ xfs_inode_item_push( > > if (!xfs_buf_delwri_queue(bp, buffer_list)) > > rval = XFS_ITEM_FLUSHING; > > xfs_buf_relse(bp); > > - } else { > > + } else if (error == -EAGAIN) { > > /* > > * Release the buffer if we were unable to flush anything. On > > * any other error, the buffer has already been released. > > */ > > - if (error == -EAGAIN) > > - xfs_buf_relse(bp); > > + xfs_buf_relse(bp); > > rval = XFS_ITEM_LOCKED; > > + } else { > > + /* > > + * The filesystem has already been shut down. If there's a race > > + * between inode flush and inode reclaim, the inode might be > > + * freed. Accessing the item after this point would be unsafe. > > + */ > > + rval = XFS_ITEM_UNSAFE; > > I wonder if it's time to convert this to a switch statement but the fix > looks correct so > > Reviewed-by: Darrick J. Wong <djwong@kernel.org> > > --D > Using a switch statement would be more concise. If we're still using this approach in the next version, I can refactor it to use a switch statement. Thanks for your review, Long Li > > > } > > > > - spin_lock(&lip->li_ailp->ail_lock); > > + spin_lock(&ailp->ail_lock); > > return rval; > > } > > > > -- > > 2.39.2 > > > >
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c index b509cbd191f4..c855cd2c81a5 100644 --- a/fs/xfs/xfs_inode_item.c +++ b/fs/xfs/xfs_inode_item.c @@ -720,10 +720,11 @@ STATIC uint xfs_inode_item_push( struct xfs_log_item *lip, struct list_head *buffer_list) - __releases(&lip->li_ailp->ail_lock) - __acquires(&lip->li_ailp->ail_lock) + __releases(&ailp->ail_lock) + __acquires(&ailp->ail_lock) { struct xfs_inode_log_item *iip = INODE_ITEM(lip); + struct xfs_ail *ailp = lip->li_ailp; struct xfs_inode *ip = iip->ili_inode; struct xfs_buf *bp = lip->li_buf; uint rval = XFS_ITEM_SUCCESS; @@ -748,7 +749,7 @@ xfs_inode_item_push( if (!xfs_buf_trylock(bp)) return XFS_ITEM_LOCKED; - spin_unlock(&lip->li_ailp->ail_lock); + spin_unlock(&ailp->ail_lock); /* * We need to hold a reference for flushing the cluster buffer as it may @@ -762,17 +763,23 @@ xfs_inode_item_push( if (!xfs_buf_delwri_queue(bp, buffer_list)) rval = XFS_ITEM_FLUSHING; xfs_buf_relse(bp); - } else { + } else if (error == -EAGAIN) { /* * Release the buffer if we were unable to flush anything. On * any other error, the buffer has already been released. */ - if (error == -EAGAIN) - xfs_buf_relse(bp); + xfs_buf_relse(bp); rval = XFS_ITEM_LOCKED; + } else { + /* + * The filesystem has already been shut down. If there's a race + * between inode flush and inode reclaim, the inode might be + * freed. Accessing the item after this point would be unsafe. + */ + rval = XFS_ITEM_UNSAFE; } - spin_lock(&lip->li_ailp->ail_lock); + spin_lock(&ailp->ail_lock); return rval; }