diff mbox series

[8/8] shmem,percpu_counter: add _limited_add(fbc, limit, amount)

Message ID bb817848-2d19-bcc8-39ca-ea179af0f0b4@google.com (mailing list archive)
State New
Headers show
Series shmem,tmpfs: general maintenance | expand

Commit Message

Hugh Dickins Sept. 30, 2023, 3:42 a.m. UTC
Percpu counter's compare and add are separate functions: without locking
around them (which would defeat their purpose), it has been possible to
overflow the intended limit.  Imagine all the other CPUs fallocating
tmpfs huge pages to the limit, in between this CPU's compare and its add.

I have not seen reports of that happening; but tmpfs's recent addition
of dquot_alloc_block_nodirty() in between the compare and the add makes
it even more likely, and I'd be uncomfortable to leave it unfixed.

Introduce percpu_counter_limited_add(fbc, limit, amount) to prevent it.

I believe this implementation is correct, and slightly more efficient
than the combination of compare and add (taking the lock once rather
than twice when nearing full - the last 128MiB of a tmpfs volume on a
machine with 128 CPUs and 4KiB pages); but it does beg for a better
design - when nearing full, there is no new batching, but the costly
percpu counter sum across CPUs still has to be done, while locked.

Follow __percpu_counter_sum()'s example, including cpu_dying_mask as
well as cpu_online_mask: but shouldn't __percpu_counter_compare() and
__percpu_counter_limited_add() then be adding a num_dying_cpus() to
num_online_cpus(), when they calculate the maximum which could be held
across CPUs?  But the times when it matters would be vanishingly rare.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Darrick J. Wong <djwong@kernel.org>
---
Tim, Dave, Darrick: I didn't want to waste your time on patches 1-7,
which are just internal to shmem, and do not affect this patch (which
applies to v6.6-rc and linux-next as is): but want to run this by you.

 include/linux/percpu_counter.h | 23 +++++++++++++++
 lib/percpu_counter.c           | 53 ++++++++++++++++++++++++++++++++++
 mm/shmem.c                     | 10 +++----
 3 files changed, 81 insertions(+), 5 deletions(-)

Comments

Jan Kara Oct. 4, 2023, 3:32 p.m. UTC | #1
On Fri 29-09-23 20:42:45, Hugh Dickins wrote:
> Percpu counter's compare and add are separate functions: without locking
> around them (which would defeat their purpose), it has been possible to
> overflow the intended limit.  Imagine all the other CPUs fallocating
> tmpfs huge pages to the limit, in between this CPU's compare and its add.
> 
> I have not seen reports of that happening; but tmpfs's recent addition
> of dquot_alloc_block_nodirty() in between the compare and the add makes
> it even more likely, and I'd be uncomfortable to leave it unfixed.
> 
> Introduce percpu_counter_limited_add(fbc, limit, amount) to prevent it.
> 
> I believe this implementation is correct, and slightly more efficient
> than the combination of compare and add (taking the lock once rather
> than twice when nearing full - the last 128MiB of a tmpfs volume on a
> machine with 128 CPUs and 4KiB pages); but it does beg for a better
> design - when nearing full, there is no new batching, but the costly
> percpu counter sum across CPUs still has to be done, while locked.
> 
> Follow __percpu_counter_sum()'s example, including cpu_dying_mask as
> well as cpu_online_mask: but shouldn't __percpu_counter_compare() and
> __percpu_counter_limited_add() then be adding a num_dying_cpus() to
> num_online_cpus(), when they calculate the maximum which could be held
> across CPUs?  But the times when it matters would be vanishingly rare.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: Tim Chen <tim.c.chen@intel.com>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: Darrick J. Wong <djwong@kernel.org>

Looks good to me. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
Dave Chinner Oct. 4, 2023, 11:10 p.m. UTC | #2
On Fri, Sep 29, 2023 at 08:42:45PM -0700, Hugh Dickins wrote:
> Percpu counter's compare and add are separate functions: without locking
> around them (which would defeat their purpose), it has been possible to
> overflow the intended limit.  Imagine all the other CPUs fallocating
> tmpfs huge pages to the limit, in between this CPU's compare and its add.
> 
> I have not seen reports of that happening; but tmpfs's recent addition
> of dquot_alloc_block_nodirty() in between the compare and the add makes
> it even more likely, and I'd be uncomfortable to leave it unfixed.
> 
> Introduce percpu_counter_limited_add(fbc, limit, amount) to prevent it.
> 
> I believe this implementation is correct, and slightly more efficient
> than the combination of compare and add (taking the lock once rather
> than twice when nearing full - the last 128MiB of a tmpfs volume on a
> machine with 128 CPUs and 4KiB pages); but it does beg for a better
> design - when nearing full, there is no new batching, but the costly
> percpu counter sum across CPUs still has to be done, while locked.
> 
> Follow __percpu_counter_sum()'s example, including cpu_dying_mask as
> well as cpu_online_mask: but shouldn't __percpu_counter_compare() and
> __percpu_counter_limited_add() then be adding a num_dying_cpus() to
> num_online_cpus(), when they calculate the maximum which could be held
> across CPUs?  But the times when it matters would be vanishingly rare.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: Tim Chen <tim.c.chen@intel.com>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: Darrick J. Wong <djwong@kernel.org>
> ---
> Tim, Dave, Darrick: I didn't want to waste your time on patches 1-7,
> which are just internal to shmem, and do not affect this patch (which
> applies to v6.6-rc and linux-next as is): but want to run this by you.

Hmmmm. IIUC, this only works for addition that approaches the limit
from below?

So if we are approaching the limit from above (i.e. add of a
negative amount, limit is zero) then this code doesn't work the same
as the open-coded compare+add operation would?

Hence I think this looks like a "add if result is less than"
operation, which is distinct from then "add if result is greater
than" operation that we use this same pattern for in XFS and ext4.
Perhaps a better name is in order?

I'm also not a great fan of having two
similar-but-not-quite-the-same implementations for the two
comparisons, but unless we decide to convert the XFs slow path to
this it doesn't matter that much at the moment....

Implementation seems OK at a quick glance, though.

Cheers,

Dave.
Chen, Tim C Oct. 5, 2023, 4:50 p.m. UTC | #3
>Signed-off-by: Hugh Dickins <hughd@google.com>
>Cc: Tim Chen <tim.c.chen@intel.com>
>Cc: Dave Chinner <dchinner@redhat.com>
>Cc: Darrick J. Wong <djwong@kernel.org>
>---
>Tim, Dave, Darrick: I didn't want to waste your time on patches 1-7, which are
>just internal to shmem, and do not affect this patch (which applies to v6.6-rc
>and linux-next as is): but want to run this by you.
>
> include/linux/percpu_counter.h | 23 +++++++++++++++
> lib/percpu_counter.c           | 53 ++++++++++++++++++++++++++++++++++
> mm/shmem.c                     | 10 +++----
> 3 files changed, 81 insertions(+), 5 deletions(-)
>
>diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
>index d01351b1526f..8cb7c071bd5c 100644
>--- a/include/linux/percpu_counter.h
>+++ b/include/linux/percpu_counter.h
>@@ -57,6 +57,8 @@ void percpu_counter_add_batch(struct percpu_counter
>*fbc, s64 amount,
> 			      s32 batch);
> s64 __percpu_counter_sum(struct percpu_counter *fbc);  int
>__percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch);
>+bool __percpu_counter_limited_add(struct percpu_counter *fbc, s64 limit,
>+				  s64 amount, s32 batch);
> void percpu_counter_sync(struct percpu_counter *fbc);
>
> static inline int percpu_counter_compare(struct percpu_counter *fbc, s64 rhs)
>@@ -69,6 +71,13 @@ static inline void percpu_counter_add(struct
>percpu_counter *fbc, s64 amount)
> 	percpu_counter_add_batch(fbc, amount, percpu_counter_batch);  }
>
>+static inline bool
>+percpu_counter_limited_add(struct percpu_counter *fbc, s64 limit, s64
>+amount) {
>+	return __percpu_counter_limited_add(fbc, limit, amount,
>+					    percpu_counter_batch);
>+}
>+
> /*
>  * With percpu_counter_add_local() and percpu_counter_sub_local(), counts
>  * are accumulated in local per cpu counter and not in fbc->count until @@ -
>185,6 +194,20 @@ percpu_counter_add(struct percpu_counter *fbc, s64
>amount)
> 	local_irq_restore(flags);
> }
>
>+static inline bool
>+percpu_counter_limited_add(struct percpu_counter *fbc, s64 limit, s64
>+amount) {
>+	unsigned long flags;
>+	s64 count;
>+
>+	local_irq_save(flags);
>+	count = fbc->count + amount;
>+	if (count <= limit)
>+		fbc->count = count;
>+	local_irq_restore(flags);
>+	return count <= limit;
>+}
>+
> /* non-SMP percpu_counter_add_local is the same with percpu_counter_add
>*/  static inline void  percpu_counter_add_local(struct percpu_counter *fbc,
>s64 amount) diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c index
>9073430dc865..58a3392f471b 100644
>--- a/lib/percpu_counter.c
>+++ b/lib/percpu_counter.c
>@@ -278,6 +278,59 @@ int __percpu_counter_compare(struct
>percpu_counter *fbc, s64 rhs, s32 batch)  }
>EXPORT_SYMBOL(__percpu_counter_compare);
>
>+/*
>+ * Compare counter, and add amount if the total is within limit.
>+ * Return true if amount was added, false if it would exceed limit.
>+ */
>+bool __percpu_counter_limited_add(struct percpu_counter *fbc,
>+				  s64 limit, s64 amount, s32 batch) {
>+	s64 count;
>+	s64 unknown;
>+	unsigned long flags;
>+	bool good;
>+
>+	if (amount > limit)
>+		return false;
>+
>+	local_irq_save(flags);
>+	unknown = batch * num_online_cpus();
>+	count = __this_cpu_read(*fbc->counters);
>+
>+	/* Skip taking the lock when safe */
>+	if (abs(count + amount) <= batch &&
>+	    fbc->count + unknown <= limit) {
>+		this_cpu_add(*fbc->counters, amount);
>+		local_irq_restore(flags);
>+		return true;
>+	}
>+
>+	raw_spin_lock(&fbc->lock);
>+	count = fbc->count + amount;
>+

Perhaps we can fast path the case where for sure
we will exceed limit? 

if (fbc->count + amount - unknown > limit)
	return false;
 
Tim

>+	/* Skip percpu_counter_sum() when safe */
>+	if (count + unknown > limit) {
>+		s32 *pcount;
>+		int cpu;
>+
>+		for_each_cpu_or(cpu, cpu_online_mask, cpu_dying_mask) {
>+			pcount = per_cpu_ptr(fbc->counters, cpu);
>+			count += *pcount;
>+		}
>+	}
>+
>+	good = count <= limit;
>+	if (good) {
>+		count = __this_cpu_read(*fbc->counters);
>+		fbc->count += count + amount;
>+		__this_cpu_sub(*fbc->counters, count);
>+	}
>+
>+	raw_spin_unlock(&fbc->lock);
>+	local_irq_restore(flags);
>+	return good;
>+}
>+
> static int __init percpu_counter_startup(void)  {
> 	int ret;
>diff --git a/mm/shmem.c b/mm/shmem.c
>index 4f4ab26bc58a..7cb72c747954 100644
>--- a/mm/shmem.c
>+++ b/mm/shmem.c
>@@ -217,15 +217,15 @@ static int shmem_inode_acct_blocks(struct inode
>*inode, long pages)
>
> 	might_sleep();	/* when quotas */
> 	if (sbinfo->max_blocks) {
>-		if (percpu_counter_compare(&sbinfo->used_blocks,
>-					   sbinfo->max_blocks - pages) > 0)
>+		if (!percpu_counter_limited_add(&sbinfo->used_blocks,
>+						sbinfo->max_blocks, pages))
> 			goto unacct;
>
> 		err = dquot_alloc_block_nodirty(inode, pages);
>-		if (err)
>+		if (err) {
>+			percpu_counter_sub(&sbinfo->used_blocks, pages);
> 			goto unacct;
>-
>-		percpu_counter_add(&sbinfo->used_blocks, pages);
>+		}
> 	} else {
> 		err = dquot_alloc_block_nodirty(inode, pages);
> 		if (err)
>--
>2.35.3
Hugh Dickins Oct. 6, 2023, 5:35 a.m. UTC | #4
On Thu, 5 Oct 2023, Dave Chinner wrote:
> On Fri, Sep 29, 2023 at 08:42:45PM -0700, Hugh Dickins wrote:
> > Percpu counter's compare and add are separate functions: without locking
> > around them (which would defeat their purpose), it has been possible to
> > overflow the intended limit.  Imagine all the other CPUs fallocating
> > tmpfs huge pages to the limit, in between this CPU's compare and its add.
> > 
> > I have not seen reports of that happening; but tmpfs's recent addition
> > of dquot_alloc_block_nodirty() in between the compare and the add makes
> > it even more likely, and I'd be uncomfortable to leave it unfixed.
> > 
> > Introduce percpu_counter_limited_add(fbc, limit, amount) to prevent it.
> > 
> > I believe this implementation is correct, and slightly more efficient
> > than the combination of compare and add (taking the lock once rather
> > than twice when nearing full - the last 128MiB of a tmpfs volume on a
> > machine with 128 CPUs and 4KiB pages); but it does beg for a better
> > design - when nearing full, there is no new batching, but the costly
> > percpu counter sum across CPUs still has to be done, while locked.
> > 
> > Follow __percpu_counter_sum()'s example, including cpu_dying_mask as
> > well as cpu_online_mask: but shouldn't __percpu_counter_compare() and
> > __percpu_counter_limited_add() then be adding a num_dying_cpus() to
> > num_online_cpus(), when they calculate the maximum which could be held
> > across CPUs?  But the times when it matters would be vanishingly rare.
> > 
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > Cc: Tim Chen <tim.c.chen@intel.com>
> > Cc: Dave Chinner <dchinner@redhat.com>
> > Cc: Darrick J. Wong <djwong@kernel.org>
> > ---
> > Tim, Dave, Darrick: I didn't want to waste your time on patches 1-7,
> > which are just internal to shmem, and do not affect this patch (which
> > applies to v6.6-rc and linux-next as is): but want to run this by you.
> 
> Hmmmm. IIUC, this only works for addition that approaches the limit
> from below?

That's certainly how I was thinking about it, and what I need for tmpfs.
Precisely what its limitations (haha) are, I'll have to take care to
spell out.

(IIRC - it's a while since I wrote it - it can be used for subtraction,
but goes the very slow way when it could go the fast way - uncompared
percpu_counter_sub() much better for that.  You might be proposing that
a tweak could adjust it to going the fast way when coming down from the
"limit", but going the slow way as it approaches 0 - that would be neat,
but I've not yet looked into whether it's feasily done.)

> 
> So if we are approaching the limit from above (i.e. add of a
> negative amount, limit is zero) then this code doesn't work the same
> as the open-coded compare+add operation would?

To it and to me, a limit of 0 means nothing positive can be added
(and it immediately returns false for that case); and adding anything
negative would be an error since the positive would not have been allowed.

Would a negative limit have any use?

It's definitely not allowing all the possibilities that you could arrange
with a separate compare and add; whether it's ruling out some useful
possibilities to which it can easily be generalized, I'm not sure.

Well worth a look - but it'll be easier for me to break it than get
it right, so I might just stick to adding some comments.

I might find that actually I prefer your way round: getting slower
as approaching 0, without any need for specifying a limit??  That the
tmpfs case pushed it in this direction, when it's better reversed?  Or
that might be an embarrassing delusion which I'll regret having mentioned.

> 
> Hence I think this looks like a "add if result is less than"
> operation, which is distinct from then "add if result is greater
> than" operation that we use this same pattern for in XFS and ext4.
> Perhaps a better name is in order?

The name still seems good to me, but a comment above it on its
assumptions/limitations well worth adding.

I didn't find a percpu_counter_compare() in ext4, and haven't got
far yet with understanding the XFS ones: tomorrow...

> 
> I'm also not a great fan of having two
> similar-but-not-quite-the-same implementations for the two
> comparisons, but unless we decide to convert the XFs slow path to
> this it doesn't matter that much at the moment....
> 
> Implementation seems OK at a quick glance, though.

Thanks!

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Hugh Dickins Oct. 6, 2023, 5:42 a.m. UTC | #5
On Thu, 5 Oct 2023, Chen, Tim C wrote:

> >--- a/lib/percpu_counter.c
> >+++ b/lib/percpu_counter.c
> >@@ -278,6 +278,59 @@ int __percpu_counter_compare(struct
> >percpu_counter *fbc, s64 rhs, s32 batch)  }
> >EXPORT_SYMBOL(__percpu_counter_compare);
> >
> >+/*
> >+ * Compare counter, and add amount if the total is within limit.
> >+ * Return true if amount was added, false if it would exceed limit.
> >+ */
> >+bool __percpu_counter_limited_add(struct percpu_counter *fbc,
> >+				  s64 limit, s64 amount, s32 batch) {
> >+	s64 count;
> >+	s64 unknown;
> >+	unsigned long flags;
> >+	bool good;
> >+
> >+	if (amount > limit)
> >+		return false;
> >+
> >+	local_irq_save(flags);
> >+	unknown = batch * num_online_cpus();
> >+	count = __this_cpu_read(*fbc->counters);
> >+
> >+	/* Skip taking the lock when safe */
> >+	if (abs(count + amount) <= batch &&
> >+	    fbc->count + unknown <= limit) {
> >+		this_cpu_add(*fbc->counters, amount);
> >+		local_irq_restore(flags);
> >+		return true;
> >+	}
> >+
> >+	raw_spin_lock(&fbc->lock);
> >+	count = fbc->count + amount;
> >+
> 
> Perhaps we can fast path the case where for sure
> we will exceed limit? 
> 
> if (fbc->count + amount - unknown > limit)
> 	return false;

Thanks, that sounds reasonable: I'll try to add something like that -
but haven't thought about it carefully enough yet (too easy for me
to overlook some negative case which messes everything up).

Hugh
Dennis Zhou Oct. 6, 2023, 10:59 p.m. UTC | #6
Hello,

On Thu, Oct 05, 2023 at 10:42:17PM -0700, Hugh Dickins wrote:
> On Thu, 5 Oct 2023, Chen, Tim C wrote:
> 
> > >--- a/lib/percpu_counter.c
> > >+++ b/lib/percpu_counter.c
> > >@@ -278,6 +278,59 @@ int __percpu_counter_compare(struct
> > >percpu_counter *fbc, s64 rhs, s32 batch)  }
> > >EXPORT_SYMBOL(__percpu_counter_compare);
> > >
> > >+/*
> > >+ * Compare counter, and add amount if the total is within limit.
> > >+ * Return true if amount was added, false if it would exceed limit.
> > >+ */
> > >+bool __percpu_counter_limited_add(struct percpu_counter *fbc,
> > >+				  s64 limit, s64 amount, s32 batch) {
> > >+	s64 count;
> > >+	s64 unknown;
> > >+	unsigned long flags;
> > >+	bool good;
> > >+
> > >+	if (amount > limit)
> > >+		return false;
> > >+
> > >+	local_irq_save(flags);
> > >+	unknown = batch * num_online_cpus();
> > >+	count = __this_cpu_read(*fbc->counters);
> > >+
> > >+	/* Skip taking the lock when safe */
> > >+	if (abs(count + amount) <= batch &&
> > >+	    fbc->count + unknown <= limit) {
> > >+		this_cpu_add(*fbc->counters, amount);
> > >+		local_irq_restore(flags);
> > >+		return true;
> > >+	}
> > >+
> > >+	raw_spin_lock(&fbc->lock);
> > >+	count = fbc->count + amount;
> > >+
> > 
> > Perhaps we can fast path the case where for sure
> > we will exceed limit? 
> > 
> > if (fbc->count + amount - unknown > limit)
> > 	return false;
> 
> Thanks, that sounds reasonable: I'll try to add something like that -
> but haven't thought about it carefully enough yet (too easy for me
> to overlook some negative case which messes everything up).
> 
> Hugh
>

Sorry for the late chime in. I'm traveling right now.

I haven't been super happy lately with percpu_counter as it has had a
few corner cases such as the cpu_dying_mask fiasco which I thought we
fixed with a series from tglx [1]. If not I can resurrect it and pull
it.

I feel like percpu_counter is deviating from its original intended
usecase which, from my perspective, was a thin wrapper around a percpu
variable. At this point we seem to be bolting onto percpu_counter
instead of giving it a clear focus for what it's supposed to do well.
I think I understand the use case, and ultimately it's kind of the
duality where I think it was xfs is using percpu_counters where it must
be > 0 for the value to make sense and there was a race condition with
cpu dying [2].

At this point, I think it's probably better to wholy think about the
lower bound and upper bound problem of percpu_counter wrt the # of
online cpus.

Thanks,
Dennis

[1] https://lore.kernel.org/lkml/20230414162755.281993820@linutronix.de/
[2] https://lore.kernel.org/lkml/20230406015629.1804722-1-yebin@huaweicloud.com/
Dave Chinner Oct. 9, 2023, 12:15 a.m. UTC | #7
On Thu, Oct 05, 2023 at 10:35:33PM -0700, Hugh Dickins wrote:
> On Thu, 5 Oct 2023, Dave Chinner wrote:
> > On Fri, Sep 29, 2023 at 08:42:45PM -0700, Hugh Dickins wrote:
> > > Percpu counter's compare and add are separate functions: without locking
> > > around them (which would defeat their purpose), it has been possible to
> > > overflow the intended limit.  Imagine all the other CPUs fallocating
> > > tmpfs huge pages to the limit, in between this CPU's compare and its add.
> > > 
> > > I have not seen reports of that happening; but tmpfs's recent addition
> > > of dquot_alloc_block_nodirty() in between the compare and the add makes
> > > it even more likely, and I'd be uncomfortable to leave it unfixed.
> > > 
> > > Introduce percpu_counter_limited_add(fbc, limit, amount) to prevent it.
> > > 
> > > I believe this implementation is correct, and slightly more efficient
> > > than the combination of compare and add (taking the lock once rather
> > > than twice when nearing full - the last 128MiB of a tmpfs volume on a
> > > machine with 128 CPUs and 4KiB pages); but it does beg for a better
> > > design - when nearing full, there is no new batching, but the costly
> > > percpu counter sum across CPUs still has to be done, while locked.
> > > 
> > > Follow __percpu_counter_sum()'s example, including cpu_dying_mask as
> > > well as cpu_online_mask: but shouldn't __percpu_counter_compare() and
> > > __percpu_counter_limited_add() then be adding a num_dying_cpus() to
> > > num_online_cpus(), when they calculate the maximum which could be held
> > > across CPUs?  But the times when it matters would be vanishingly rare.
> > > 
> > > Signed-off-by: Hugh Dickins <hughd@google.com>
> > > Cc: Tim Chen <tim.c.chen@intel.com>
> > > Cc: Dave Chinner <dchinner@redhat.com>
> > > Cc: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > > Tim, Dave, Darrick: I didn't want to waste your time on patches 1-7,
> > > which are just internal to shmem, and do not affect this patch (which
> > > applies to v6.6-rc and linux-next as is): but want to run this by you.
> > 
> > Hmmmm. IIUC, this only works for addition that approaches the limit
> > from below?
> 
> That's certainly how I was thinking about it, and what I need for tmpfs.
> Precisely what its limitations (haha) are, I'll have to take care to
> spell out.
> 
> (IIRC - it's a while since I wrote it - it can be used for subtraction,
> but goes the very slow way when it could go the fast way - uncompared
> percpu_counter_sub() much better for that.  You might be proposing that
> a tweak could adjust it to going the fast way when coming down from the
> "limit", but going the slow way as it approaches 0 - that would be neat,
> but I've not yet looked into whether it's feasily done.)
> 
> > 
> > So if we are approaching the limit from above (i.e. add of a
> > negative amount, limit is zero) then this code doesn't work the same
> > as the open-coded compare+add operation would?
> 
> To it and to me, a limit of 0 means nothing positive can be added
> (and it immediately returns false for that case); and adding anything
> negative would be an error since the positive would not have been allowed.
> 
> Would a negative limit have any use?

I don't have any use for it, but the XFS case is decrementing free
space to determine if ENOSPC has been hit. It's the opposite
implemention to shmem, which increments used space to determine if
ENOSPC is hit.

> It's definitely not allowing all the possibilities that you could arrange
> with a separate compare and add; whether it's ruling out some useful
> possibilities to which it can easily be generalized, I'm not sure.
> 
> Well worth a look - but it'll be easier for me to break it than get
> it right, so I might just stick to adding some comments.
> 
> I might find that actually I prefer your way round: getting slower
> as approaching 0, without any need for specifying a limit??  That the
> tmpfs case pushed it in this direction, when it's better reversed?  Or
> that might be an embarrassing delusion which I'll regret having mentioned.

I think there's cases for both approaching and upper limit from
before and a lower limit from above. Both are the same "compare and
add" algorithm, just with minor logic differences...

> > Hence I think this looks like a "add if result is less than"
> > operation, which is distinct from then "add if result is greater
> > than" operation that we use this same pattern for in XFS and ext4.
> > Perhaps a better name is in order?
> 
> The name still seems good to me, but a comment above it on its
> assumptions/limitations well worth adding.
> 
> I didn't find a percpu_counter_compare() in ext4, and haven't got

Go search for EXT4_FREECLUSTERS_WATERMARK....

> far yet with understanding the XFS ones: tomorrow...

XFS detects being near ENOSPC to change the batch update size so
taht when near ENOSPC the percpu counter always aggregates to the
global sum on every modification. i.e. it becomes more accurate (but
slower) near the ENOSPC threshold. Then if the result of the
subtraction ends up being less than zero, it takes a lock (i.e. goes
even slower!), undoes the subtraction that took it below zero, and
determines if it can dip into the reserve pool or ENOSPC should be
reported.

Some of that could be optimised, but we need that external "lock and
undo" mechanism to manage the reserve pool space atomically at
ENOSPC...

Cheers,

Dave.
Hugh Dickins Oct. 12, 2023, 4:09 a.m. UTC | #8
On Fri, 6 Oct 2023, Dennis Zhou wrote:
> 
> Sorry for the late chime in. I'm traveling right now.

No problem at all, thanks for looking.

> 
> I haven't been super happy lately with percpu_counter as it has had a
> few corner cases such as the cpu_dying_mask fiasco which I thought we
> fixed with a series from tglx [1]. If not I can resurrect it and pull
> it.
> 
> I feel like percpu_counter is deviating from its original intended
> usecase which, from my perspective, was a thin wrapper around a percpu
> variable. At this point we seem to be bolting onto percpu_counter
> instead of giving it a clear focus for what it's supposed to do well.
> I think I understand the use case, and ultimately it's kind of the
> duality where I think it was xfs is using percpu_counters where it must
> be > 0 for the value to make sense and there was a race condition with
> cpu dying [2].
> 
> At this point, I think it's probably better to wholy think about the
> lower bound and upper bound problem of percpu_counter wrt the # of
> online cpus.
> 
> Thanks,
> Dennis
> 
> [1] https://lore.kernel.org/lkml/20230414162755.281993820@linutronix.de/
> [2] https://lore.kernel.org/lkml/20230406015629.1804722-1-yebin@huaweicloud.com/

Thanks for the links.  I can see that the current cpu_dying situation
is not ideal, but don't see any need to get any deeper into that for
percpu_counter_limited_add(): I did consider an update to remove its
use of cpu_dying_mask, but that just seemed wrong - it should do the
same as is currently done in __percpu_counter_sum(), then be updated
along with that when cpu_dying is sorted, by tglx's series or otherwise.

I don't think I agree with you about percpu_counter deviating from its
original intended usecase; but I haven't studied the history to see what
that initial usecase was.  Whatever, we've had percpu_counter_add() and
percpu_counter_compare() for many years, and percpu_counter_limited_add()
is just an atomic combination of those: I don't see it as deviating at all.

Hugh
Hugh Dickins Oct. 12, 2023, 4:36 a.m. UTC | #9
On Mon, 9 Oct 2023, Dave Chinner wrote:
> On Thu, Oct 05, 2023 at 10:35:33PM -0700, Hugh Dickins wrote:
> > On Thu, 5 Oct 2023, Dave Chinner wrote:
> > > 
> > > Hmmmm. IIUC, this only works for addition that approaches the limit
> > > from below?
> > 
> > That's certainly how I was thinking about it, and what I need for tmpfs.
> > Precisely what its limitations (haha) are, I'll have to take care to
> > spell out.
> > 
> > (IIRC - it's a while since I wrote it - it can be used for subtraction,
> > but goes the very slow way when it could go the fast way - uncompared
> > percpu_counter_sub() much better for that.  You might be proposing that
> > a tweak could adjust it to going the fast way when coming down from the
> > "limit", but going the slow way as it approaches 0 - that would be neat,
> > but I've not yet looked into whether it's feasily done.)

Easily done once I'd looked at it from the right angle.

> > 
> > > 
> > > So if we are approaching the limit from above (i.e. add of a
> > > negative amount, limit is zero) then this code doesn't work the same
> > > as the open-coded compare+add operation would?
> > 
> > To it and to me, a limit of 0 means nothing positive can be added
> > (and it immediately returns false for that case); and adding anything
> > negative would be an error since the positive would not have been allowed.
> > 
> > Would a negative limit have any use?

There was no reason to exclude it, once I was thinking clearly
about the comparisons.

> 
> I don't have any use for it, but the XFS case is decrementing free
> space to determine if ENOSPC has been hit. It's the opposite
> implemention to shmem, which increments used space to determine if
> ENOSPC is hit.

Right.

> 
> > It's definitely not allowing all the possibilities that you could arrange
> > with a separate compare and add; whether it's ruling out some useful
> > possibilities to which it can easily be generalized, I'm not sure.
> > 
> > Well worth a look - but it'll be easier for me to break it than get
> > it right, so I might just stick to adding some comments.
> > 
> > I might find that actually I prefer your way round: getting slower
> > as approaching 0, without any need for specifying a limit??  That the
> > tmpfs case pushed it in this direction, when it's better reversed?  Or
> > that might be an embarrassing delusion which I'll regret having mentioned.
> 
> I think there's cases for both approaching and upper limit from
> before and a lower limit from above. Both are the same "compare and
> add" algorithm, just with minor logic differences...

Good, thanks, you've saved me: I was getting a bit fundamentalist there,
thinking to offer one simplest primitive from which anything could be
built.  But when it came down to it, I had no enthusiam for rewriting
tmpfs's used_blocks as free_blocks, just to avoid that limit argument.

> 
> > > Hence I think this looks like a "add if result is less than"
> > > operation, which is distinct from then "add if result is greater
> > > than" operation that we use this same pattern for in XFS and ext4.
> > > Perhaps a better name is in order?
> > 
> > The name still seems good to me, but a comment above it on its
> > assumptions/limitations well worth adding.
> > 
> > I didn't find a percpu_counter_compare() in ext4, and haven't got
> 
> Go search for EXT4_FREECLUSTERS_WATERMARK....

Ah, not a percpu_counter_compare() user, but doing its own thing.

> 
> > far yet with understanding the XFS ones: tomorrow...
> 
> XFS detects being near ENOSPC to change the batch update size so
> taht when near ENOSPC the percpu counter always aggregates to the
> global sum on every modification. i.e. it becomes more accurate (but
> slower) near the ENOSPC threshold. Then if the result of the
> subtraction ends up being less than zero, it takes a lock (i.e. goes
> even slower!), undoes the subtraction that took it below zero, and
> determines if it can dip into the reserve pool or ENOSPC should be
> reported.
> 
> Some of that could be optimised, but we need that external "lock and
> undo" mechanism to manage the reserve pool space atomically at
> ENOSPC...

Thanks for going above and beyond with the description; but I'll be
honest and admit that I only looked quickly, and did not reach any
conclusion as to whether such usage could or should be converted
to percpu_counter_limited_add() - which would never take any XFS
locks, of course, so might just end up doubling the slow work.

But absolutely I agree with you, and thank you for pointing out,
how stupidly useless percpu_counter_limited_add() was for decrementing -
it was nothing more than a slow way of doing percpu_counter_sub().

I'm about to send in a 9/8, extending it to be more useful: thanks.

Hugh
diff mbox series

Patch

diff --git a/include/linux/percpu_counter.h b/include/linux/percpu_counter.h
index d01351b1526f..8cb7c071bd5c 100644
--- a/include/linux/percpu_counter.h
+++ b/include/linux/percpu_counter.h
@@ -57,6 +57,8 @@  void percpu_counter_add_batch(struct percpu_counter *fbc, s64 amount,
 			      s32 batch);
 s64 __percpu_counter_sum(struct percpu_counter *fbc);
 int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch);
+bool __percpu_counter_limited_add(struct percpu_counter *fbc, s64 limit,
+				  s64 amount, s32 batch);
 void percpu_counter_sync(struct percpu_counter *fbc);
 
 static inline int percpu_counter_compare(struct percpu_counter *fbc, s64 rhs)
@@ -69,6 +71,13 @@  static inline void percpu_counter_add(struct percpu_counter *fbc, s64 amount)
 	percpu_counter_add_batch(fbc, amount, percpu_counter_batch);
 }
 
+static inline bool
+percpu_counter_limited_add(struct percpu_counter *fbc, s64 limit, s64 amount)
+{
+	return __percpu_counter_limited_add(fbc, limit, amount,
+					    percpu_counter_batch);
+}
+
 /*
  * With percpu_counter_add_local() and percpu_counter_sub_local(), counts
  * are accumulated in local per cpu counter and not in fbc->count until
@@ -185,6 +194,20 @@  percpu_counter_add(struct percpu_counter *fbc, s64 amount)
 	local_irq_restore(flags);
 }
 
+static inline bool
+percpu_counter_limited_add(struct percpu_counter *fbc, s64 limit, s64 amount)
+{
+	unsigned long flags;
+	s64 count;
+
+	local_irq_save(flags);
+	count = fbc->count + amount;
+	if (count <= limit)
+		fbc->count = count;
+	local_irq_restore(flags);
+	return count <= limit;
+}
+
 /* non-SMP percpu_counter_add_local is the same with percpu_counter_add */
 static inline void
 percpu_counter_add_local(struct percpu_counter *fbc, s64 amount)
diff --git a/lib/percpu_counter.c b/lib/percpu_counter.c
index 9073430dc865..58a3392f471b 100644
--- a/lib/percpu_counter.c
+++ b/lib/percpu_counter.c
@@ -278,6 +278,59 @@  int __percpu_counter_compare(struct percpu_counter *fbc, s64 rhs, s32 batch)
 }
 EXPORT_SYMBOL(__percpu_counter_compare);
 
+/*
+ * Compare counter, and add amount if the total is within limit.
+ * Return true if amount was added, false if it would exceed limit.
+ */
+bool __percpu_counter_limited_add(struct percpu_counter *fbc,
+				  s64 limit, s64 amount, s32 batch)
+{
+	s64 count;
+	s64 unknown;
+	unsigned long flags;
+	bool good;
+
+	if (amount > limit)
+		return false;
+
+	local_irq_save(flags);
+	unknown = batch * num_online_cpus();
+	count = __this_cpu_read(*fbc->counters);
+
+	/* Skip taking the lock when safe */
+	if (abs(count + amount) <= batch &&
+	    fbc->count + unknown <= limit) {
+		this_cpu_add(*fbc->counters, amount);
+		local_irq_restore(flags);
+		return true;
+	}
+
+	raw_spin_lock(&fbc->lock);
+	count = fbc->count + amount;
+
+	/* Skip percpu_counter_sum() when safe */
+	if (count + unknown > limit) {
+		s32 *pcount;
+		int cpu;
+
+		for_each_cpu_or(cpu, cpu_online_mask, cpu_dying_mask) {
+			pcount = per_cpu_ptr(fbc->counters, cpu);
+			count += *pcount;
+		}
+	}
+
+	good = count <= limit;
+	if (good) {
+		count = __this_cpu_read(*fbc->counters);
+		fbc->count += count + amount;
+		__this_cpu_sub(*fbc->counters, count);
+	}
+
+	raw_spin_unlock(&fbc->lock);
+	local_irq_restore(flags);
+	return good;
+}
+
 static int __init percpu_counter_startup(void)
 {
 	int ret;
diff --git a/mm/shmem.c b/mm/shmem.c
index 4f4ab26bc58a..7cb72c747954 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -217,15 +217,15 @@  static int shmem_inode_acct_blocks(struct inode *inode, long pages)
 
 	might_sleep();	/* when quotas */
 	if (sbinfo->max_blocks) {
-		if (percpu_counter_compare(&sbinfo->used_blocks,
-					   sbinfo->max_blocks - pages) > 0)
+		if (!percpu_counter_limited_add(&sbinfo->used_blocks,
+						sbinfo->max_blocks, pages))
 			goto unacct;
 
 		err = dquot_alloc_block_nodirty(inode, pages);
-		if (err)
+		if (err) {
+			percpu_counter_sub(&sbinfo->used_blocks, pages);
 			goto unacct;
-
-		percpu_counter_add(&sbinfo->used_blocks, pages);
+		}
 	} else {
 		err = dquot_alloc_block_nodirty(inode, pages);
 		if (err)