diff mbox series

xfs: fix race condition in inodegc list and cpumask handling

Message ID 20241125015258.2652325-1-leo.lilong@huawei.com (mailing list archive)
State Not Applicable, archived
Headers show
Series xfs: fix race condition in inodegc list and cpumask handling | expand

Commit Message

Long Li Nov. 25, 2024, 1:52 a.m. UTC
There is a race condition between inodegc queue and inodegc worker where
the cpumask bit may not be set when concurrent operations occur.

Current problematic sequence:

  CPU0                             CPU1
  --------------------             ---------------------
  xfs_inodegc_queue()              xfs_inodegc_worker()
                                     llist_del_all(&gc->list)
    llist_add(&ip->i_gclist, &gc->list)
    cpumask_test_and_set_cpu()
                                     cpumask_clear_cpu()
                  < cpumask not set >

Fix this by moving llist_del_all() after cpumask_clear_cpu() to ensure
proper ordering. This change ensures that when the worker thread clears
the cpumask, any concurrent queue operations will either properly set
the cpumask bit or have already emptied the list.

Also remove unnecessary smp_mb__{before/after}_atomic() barriers since
the llist_* operations already provide required ordering semantics. it
make the code cleaner.

Fixes: 62334fab4762 ("xfs: use per-mount cpumask to track nonempty percpu inodegc lists")
Reported-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Long Li <leo.lilong@huawei.com>
---
 fs/xfs/xfs_icache.c | 25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

Comments

Dave Chinner Nov. 25, 2024, 9:03 p.m. UTC | #1
On Mon, Nov 25, 2024 at 09:52:58AM +0800, Long Li wrote:
> There is a race condition between inodegc queue and inodegc worker where
> the cpumask bit may not be set when concurrent operations occur.

What problems does this cause? i.e. how do we identify systems
hitting this issue?

> 
> Current problematic sequence:
> 
>   CPU0                             CPU1
>   --------------------             ---------------------
>   xfs_inodegc_queue()              xfs_inodegc_worker()
>                                      llist_del_all(&gc->list)
>     llist_add(&ip->i_gclist, &gc->list)
>     cpumask_test_and_set_cpu()
>                                      cpumask_clear_cpu()
>                   < cpumask not set >
> 
> Fix this by moving llist_del_all() after cpumask_clear_cpu() to ensure
> proper ordering. This change ensures that when the worker thread clears
> the cpumask, any concurrent queue operations will either properly set
> the cpumask bit or have already emptied the list.
> 
> Also remove unnecessary smp_mb__{before/after}_atomic() barriers since
> the llist_* operations already provide required ordering semantics. it
> make the code cleaner.

IIRC, the barriers were for ordering the cpumask bitmap ops against
llist operations. There are calls elsewhere to for_each_cpu() that
then use llist_empty() checks (e.g xfs_inodegc_queue_all/wait_all),
so on relaxed architectures (like alpha) I think we have to ensure
the bitmask ops carried full ordering against the independent llist
ops themselves. i.e. llist_empty() just uses READ_ONCE, so it only
orders against other llist ops and won't guarantee any specific
ordering against against cpumask modifications.

I could be remembering incorrectly, but I think that was the
original reason for the barriers. Can you please confirm that the
cpumask iteration/llist_empty checks do not need these bitmask
barriers anymore? If that's ok, then the change looks fine.

-Dave.
Long Li Dec. 5, 2024, 6:59 a.m. UTC | #2
On Tue, Nov 26, 2024 at 08:03:43AM +1100, Dave Chinner wrote:

Sorry for reply so late, because I want to make the problem as clear
as possible, but there are still some doubts.

> On Mon, Nov 25, 2024 at 09:52:58AM +0800, Long Li wrote:
> > There is a race condition between inodegc queue and inodegc worker where
> > the cpumask bit may not be set when concurrent operations occur.
> 
> What problems does this cause? i.e. how do we identify systems
> hitting this issue?
> 

I haven't encountered any actual issues, but while reviewing 62334fab4762
("xfs: use per-mount cpumask to track nonempty percpu inodegc lists"), I
noticed there is a potential problem.

When the gc worker runs on a CPU other than the specified one due to
loadbalancing, it could race with xfs_inodegc_queue() processing the
same struct xfs_inodegc. If xfs_inodegc_queue() adds the last inode
to the gc list during this race, that inode might never be processed
and reclaimed due to cpumask not set. This maybe lead to memory leaks
after filesystem unmount, I'm unsure if there are other more serious
implications.

> > 
> > Current problematic sequence:
> > 
> >   CPU0                             CPU1
> >   --------------------             ---------------------
> >   xfs_inodegc_queue()              xfs_inodegc_worker()
> >                                      llist_del_all(&gc->list)
> >     llist_add(&ip->i_gclist, &gc->list)
> >     cpumask_test_and_set_cpu()
> >                                      cpumask_clear_cpu()
> >                   < cpumask not set >
> > 
> > Fix this by moving llist_del_all() after cpumask_clear_cpu() to ensure
> > proper ordering. This change ensures that when the worker thread clears
> > the cpumask, any concurrent queue operations will either properly set
> > the cpumask bit or have already emptied the list.
> > 
> > Also remove unnecessary smp_mb__{before/after}_atomic() barriers since
> > the llist_* operations already provide required ordering semantics. it
> > make the code cleaner.
> 
> IIRC, the barriers were for ordering the cpumask bitmap ops against
> llist operations. There are calls elsewhere to for_each_cpu() that
> then use llist_empty() checks (e.g xfs_inodegc_queue_all/wait_all),
> so on relaxed architectures (like alpha) I think we have to ensure
> the bitmask ops carried full ordering against the independent llist
> ops themselves. i.e. llist_empty() just uses READ_ONCE, so it only
> orders against other llist ops and won't guarantee any specific
> ordering against against cpumask modifications.
> 
> I could be remembering incorrectly, but I think that was the
> original reason for the barriers. Can you please confirm that the
> cpumask iteration/llist_empty checks do not need these bitmask
> barriers anymore? If that's ok, then the change looks fine.
> 

Even on architectures with relaxed memory ordering (like alpha), I noticed
that llist_add() already has full barrier semantics, so I think the 
smp_mb__before_atomic barrier in xfs_inodegc_queue() can be removed.

  llist_add()
    try_cmpxchg
      raw_try_cmpxchg
        arch_cmpxchg
  
  arch_cmpxchg of alpha in file arch/alpha/include/asm/cmpxchg.h
  
  #define arch_cmpxchg(ptr, o, n)                                         \
  ({                                                                      \
          __typeof__(*(ptr)) __ret;                                       \
          __typeof__(*(ptr)) _o_ = (o);                                   \
          __typeof__(*(ptr)) _n_ = (n);                                   \
          smp_mb();                                                       \
          __ret = (__typeof__(*(ptr))) ____cmpxchg((ptr),                 \
                  (unsigned long)_o_, (unsigned long)_n_, sizeof(*(ptr)));\
          smp_mb();                                                       \
	  ^^^^^^^
          __ret;                                                          \
  })

I'm wondering if we really need to "Ensure the list add is always seen
by xfs_inodegc_queue_all() who finds the cpumask bit set". It seems
harmless even if we don't see the list add completion - it can be
processed in the next round. xfs_inodegc_queue_all() doesn't guarantee
processing all inodes that are being or will be added to the gc llist
anyway.

If we do need that guarantee, should we add a barrier between reading
m_inodegc_cpumask and gc->list in xfs_inodegc_queue_all() to prevent 
load-load reordering?

Maybe I'm misunderstanding something.

Thanks,
Long Li
Dave Chinner Dec. 10, 2024, 6:20 a.m. UTC | #3
On Thu, Dec 05, 2024 at 02:59:16PM +0800, Long Li wrote:
> On Tue, Nov 26, 2024 at 08:03:43AM +1100, Dave Chinner wrote:
> 
> Sorry for reply so late, because I want to make the problem as clear
> as possible, but there are still some doubts.
> 
> > On Mon, Nov 25, 2024 at 09:52:58AM +0800, Long Li wrote:
> > > There is a race condition between inodegc queue and inodegc worker where
> > > the cpumask bit may not be set when concurrent operations occur.
> > 
> > What problems does this cause? i.e. how do we identify systems
> > hitting this issue?
> > 
> 
> I haven't encountered any actual issues, but while reviewing 62334fab4762
> ("xfs: use per-mount cpumask to track nonempty percpu inodegc lists"), I
> noticed there is a potential problem.
> 
> When the gc worker runs on a CPU other than the specified one due to
> loadbalancing,

How? inodegc is using get_cpu() to pin the task to the cpu while it
processes the queue and then queues the work to be run on that CPU.
The per-PCU inodegc queue is then processed using a single CPU
affine worker thread. The whole point of this setup is that
scheduler load balancing, etc, cannot disturb the cpu affinity of
the queues and the worker threads that service them.

How does load balancing break explicit CPU affine kernel task
scheduling?

> it could race with xfs_inodegc_queue() processing the
> same struct xfs_inodegc. If xfs_inodegc_queue() adds the last inode
> to the gc list during this race, that inode might never be processed
> and reclaimed due to cpumask not set. This maybe lead to memory leaks
> after filesystem unmount, I'm unsure if there are other more serious
> implications.

xfs_inodegc_stop() should handle this all just fine. It removes
the enabled flag, then moves into a loop that should catch list adds
that were in progress when the enabled flag was cleared.

> > > Current problematic sequence:
> > > 
> > >   CPU0                             CPU1
> > >   --------------------             ---------------------
> > >   xfs_inodegc_queue()              xfs_inodegc_worker()
> > >                                      llist_del_all(&gc->list)
> > >     llist_add(&ip->i_gclist, &gc->list)
> > >     cpumask_test_and_set_cpu()
> > >                                      cpumask_clear_cpu()
> > >                   < cpumask not set >
> > > 
> > > Fix this by moving llist_del_all() after cpumask_clear_cpu() to ensure
> > > proper ordering. This change ensures that when the worker thread clears
> > > the cpumask, any concurrent queue operations will either properly set
> > > the cpumask bit or have already emptied the list.
> > > 
> > > Also remove unnecessary smp_mb__{before/after}_atomic() barriers since
> > > the llist_* operations already provide required ordering semantics. it
> > > make the code cleaner.
> > 
> > IIRC, the barriers were for ordering the cpumask bitmap ops against
> > llist operations. There are calls elsewhere to for_each_cpu() that
> > then use llist_empty() checks (e.g xfs_inodegc_queue_all/wait_all),
> > so on relaxed architectures (like alpha) I think we have to ensure
> > the bitmask ops carried full ordering against the independent llist
> > ops themselves. i.e. llist_empty() just uses READ_ONCE, so it only
> > orders against other llist ops and won't guarantee any specific
> > ordering against against cpumask modifications.
> > 
> > I could be remembering incorrectly, but I think that was the
> > original reason for the barriers. Can you please confirm that the
> > cpumask iteration/llist_empty checks do not need these bitmask
> > barriers anymore? If that's ok, then the change looks fine.
> > 
> 
> Even on architectures with relaxed memory ordering (like alpha), I noticed
> that llist_add() already has full barrier semantics, so I think the 
> smp_mb__before_atomic barrier in xfs_inodegc_queue() can be removed.

Ok. Seems reasonable to remove it if everything uses full memory
barriers for the llist_add() operation.

-Dave.
Long Li Dec. 25, 2024, 12:41 p.m. UTC | #4
On Tue, Dec 10, 2024 at 05:20:51PM +1100, Dave Chinner wrote:
> On Thu, Dec 05, 2024 at 02:59:16PM +0800, Long Li wrote:
> > On Tue, Nov 26, 2024 at 08:03:43AM +1100, Dave Chinner wrote:
> > 
> > Sorry for reply so late, because I want to make the problem as clear
> > as possible, but there are still some doubts.
> > 
> > > On Mon, Nov 25, 2024 at 09:52:58AM +0800, Long Li wrote:
> > > > There is a race condition between inodegc queue and inodegc worker where
> > > > the cpumask bit may not be set when concurrent operations occur.
> > > 
> > > What problems does this cause? i.e. how do we identify systems
> > > hitting this issue?
> > > 
> > 
> > I haven't encountered any actual issues, but while reviewing 62334fab4762
> > ("xfs: use per-mount cpumask to track nonempty percpu inodegc lists"), I
> > noticed there is a potential problem.
> > 
> > When the gc worker runs on a CPU other than the specified one due to
> > loadbalancing,
> 
> How? inodegc is using get_cpu() to pin the task to the cpu while it
> processes the queue and then queues the work to be run on that CPU.
> The per-PCU inodegc queue is then processed using a single CPU
> affine worker thread. The whole point of this setup is that
> scheduler load balancing, etc, cannot disturb the cpu affinity of
> the queues and the worker threads that service them.
> 
> How does load balancing break explicit CPU affine kernel task
> scheduling?

Sorry, I misunderstood earlier. The load balancing mechanisms cannot
interfere with the CPU affinity of the queues. The inodegc workqueue
is not of WQ_UNBOUND type, so work items typically execute on their
designated CPUs.

However, in CPU hotplug scenarios, the CPU offline process will unbind
workers, causing work items to execute on other CPUs. If we queue an
inode on a CPU that's about to go offline after workqueue_offline_cpu()
but before the CPU is actually marked offline, the following concurrent
sequence might occur (though I haven't been able to reproduce this scenario
in practice).

From another perspective, with kernel preemption enabled, I think it's
possible for xfs_inodegc_queue() and xfs_inodegc_worker() to race on the
same CPU and process the same gc, potentially leading to the sequence
mentioned below, since kernel preemption is enabled in the xfs_inodegc_worker
context.

> 
> > it could race with xfs_inodegc_queue() processing the
> > same struct xfs_inodegc. If xfs_inodegc_queue() adds the last inode
> > to the gc list during this race, that inode might never be processed
> > and reclaimed due to cpumask not set. This maybe lead to memory leaks
> > after filesystem unmount, I'm unsure if there are other more serious
> > implications.
> 
> xfs_inodegc_stop() should handle this all just fine. It removes
> the enabled flag, then moves into a loop that should catch list adds
> that were in progress when the enabled flag was cleared.
> 

I don't think xfs_inodegc_stop() adequately handles this scenario. 
xfs_inodegc_queue_all() only processes the CPUs that are set in
mp->m_inodegc_cpumask. If a CPU's corresponding bit is not set in
the cpumask, then any inodes waiting for gc on that CPU won't have
a chance to be processed.

Thanks,
Long Li

> > > > Current problematic sequence:
> > > > 
> > > >   CPU0                             CPU1
> > > >   --------------------             ---------------------
> > > >   xfs_inodegc_queue()              xfs_inodegc_worker()
> > > >                                      llist_del_all(&gc->list)
> > > >     llist_add(&ip->i_gclist, &gc->list)
> > > >     cpumask_test_and_set_cpu()
> > > >                                      cpumask_clear_cpu()
> > > >                   < cpumask not set >
> > > > 
> > > > Fix this by moving llist_del_all() after cpumask_clear_cpu() to ensure
> > > > proper ordering. This change ensures that when the worker thread clears
> > > > the cpumask, any concurrent queue operations will either properly set
> > > > the cpumask bit or have already emptied the list.
> > > > 
> > > > Also remove unnecessary smp_mb__{before/after}_atomic() barriers since
> > > > the llist_* operations already provide required ordering semantics. it
> > > > make the code cleaner.
> > > 
> > > IIRC, the barriers were for ordering the cpumask bitmap ops against
> > > llist operations. There are calls elsewhere to for_each_cpu() that
> > > then use llist_empty() checks (e.g xfs_inodegc_queue_all/wait_all),
> > > so on relaxed architectures (like alpha) I think we have to ensure
> > > the bitmask ops carried full ordering against the independent llist
> > > ops themselves. i.e. llist_empty() just uses READ_ONCE, so it only
> > > orders against other llist ops and won't guarantee any specific
> > > ordering against against cpumask modifications.
> > > 
> > > I could be remembering incorrectly, but I think that was the
> > > original reason for the barriers. Can you please confirm that the
> > > cpumask iteration/llist_empty checks do not need these bitmask
> > > barriers anymore? If that's ok, then the change looks fine.
> > > 
> > 
> > Even on architectures with relaxed memory ordering (like alpha), I noticed
> > that llist_add() already has full barrier semantics, so I think the 
> > smp_mb__before_atomic barrier in xfs_inodegc_queue() can be removed.
> 
> Ok. Seems reasonable to remove it if everything uses full memory
> barriers for the llist_add() operation.
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
diff mbox series

Patch

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 7b6c026d01a1..fba784d7a146 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1948,19 +1948,20 @@  xfs_inodegc_worker(
 {
 	struct xfs_inodegc	*gc = container_of(to_delayed_work(work),
 						struct xfs_inodegc, work);
-	struct llist_node	*node = llist_del_all(&gc->list);
+	struct llist_node	*node;
 	struct xfs_inode	*ip, *n;
 	struct xfs_mount	*mp = gc->mp;
 	unsigned int		nofs_flag;
 
+	cpumask_clear_cpu(gc->cpu, &mp->m_inodegc_cpumask);
+
 	/*
-	 * Clear the cpu mask bit and ensure that we have seen the latest
-	 * update of the gc structure associated with this CPU. This matches
-	 * with the release semantics used when setting the cpumask bit in
-	 * xfs_inodegc_queue.
+	 * llist_del_all provides ordering semantics that ensure the CPU mask
+	 * clearing is always visible before removing all entries from the gc
+	 * list. This prevents list being added while the CPU mask bit is
+	 * unset in xfs_inodegc_queue()
 	 */
-	cpumask_clear_cpu(gc->cpu, &mp->m_inodegc_cpumask);
-	smp_mb__after_atomic();
+	node = llist_del_all(&gc->list);
 
 	WRITE_ONCE(gc->items, 0);
 
@@ -2188,13 +2189,11 @@  xfs_inodegc_queue(
 	shrinker_hits = READ_ONCE(gc->shrinker_hits);
 
 	/*
-	 * Ensure the list add is always seen by anyone who finds the cpumask
-	 * bit set. This effectively gives the cpumask bit set operation
-	 * release ordering semantics.
+	 * llist_add provides necessary ordering semantics to ensure list
+	 * additions are visible when cpumask bit is found set, so no
+	 * additional memory barrier is needed.
 	 */
-	smp_mb__before_atomic();
-	if (!cpumask_test_cpu(cpu_nr, &mp->m_inodegc_cpumask))
-		cpumask_test_and_set_cpu(cpu_nr, &mp->m_inodegc_cpumask);
+	cpumask_test_and_set_cpu(cpu_nr, &mp->m_inodegc_cpumask);
 
 	/*
 	 * We queue the work while holding the current CPU so that the work