[v4,13/27] lib: add errseq_t type and infrastructure for handling it
diff mbox

Message ID 20170509154930.29524-14-jlayton@redhat.com
State New
Headers show

Commit Message

Jeff Layton May 9, 2017, 3:49 p.m. UTC
An errseq_t is a way of recording errors in one place, and allowing any
number of "subscribers" to tell whether an error has been set again
since a previous time.

It's implemented as an unsigned 32-bit value that is managed with atomic
operations. The low order bits are designated to hold an error code
(max size of MAX_ERRNO). The upper bits are used as a counter.

The API works with consumers sampling an errseq_t value at a particular
point in time. Later, that value can be used to tell whether new errors
have been set since that time.

Note that there is a 1 in 512k risk of collisions here if new errors
are being recorded frequently, since we have so few bits to use as a
counter. To mitigate this, one bit is used as a flag to tell whether the
value has been sampled since a new value was recorded. That allows
us to avoid bumping the counter if no one has sampled it since it
was last bumped.

Later patches will build on this infrastructure to change how writeback
errors are tracked in the kernel.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 include/linux/errseq.h |  19 +++++
 lib/Makefile           |   2 +-
 lib/errseq.c           | 199 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 219 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/errseq.h
 create mode 100644 lib/errseq.c

Comments

NeilBrown May 9, 2017, 10:03 p.m. UTC | #1
On Tue, May 09 2017, Jeff Layton wrote:

> An errseq_t is a way of recording errors in one place, and allowing any
> number of "subscribers" to tell whether an error has been set again
> since a previous time.
>
> It's implemented as an unsigned 32-bit value that is managed with atomic
> operations. The low order bits are designated to hold an error code
> (max size of MAX_ERRNO). The upper bits are used as a counter.
>
> The API works with consumers sampling an errseq_t value at a particular
> point in time. Later, that value can be used to tell whether new errors
> have been set since that time.
>
> Note that there is a 1 in 512k risk of collisions here if new errors
> are being recorded frequently, since we have so few bits to use as a
> counter. To mitigate this, one bit is used as a flag to tell whether the
> value has been sampled since a new value was recorded. That allows
> us to avoid bumping the counter if no one has sampled it since it
> was last bumped.
>
> Later patches will build on this infrastructure to change how writeback
> errors are tracked in the kernel.
>
> Signed-off-by: Jeff Layton <jlayton@redhat.com>

I like that this is a separate lib/*.c - nicely structured too.

Reviewed-by: NeilBrown <neilb@suse.com>

Thanks,
NeilBrown


> ---
>  include/linux/errseq.h |  19 +++++
>  lib/Makefile           |   2 +-
>  lib/errseq.c           | 199 +++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 219 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/errseq.h
>  create mode 100644 lib/errseq.c
>
> diff --git a/include/linux/errseq.h b/include/linux/errseq.h
> new file mode 100644
> index 000000000000..0d2555f310cd
> --- /dev/null
> +++ b/include/linux/errseq.h
> @@ -0,0 +1,19 @@
> +#ifndef _LINUX_ERRSEQ_H
> +#define _LINUX_ERRSEQ_H
> +
> +/* See lib/errseq.c for more info */
> +
> +typedef u32	errseq_t;
> +
> +void __errseq_set(errseq_t *eseq, int err);
> +static inline void errseq_set(errseq_t *eseq, int err)
> +{
> +	/* Optimize for the common case of no error */
> +	if (unlikely(err))
> +		__errseq_set(eseq, err);
> +}
> +
> +errseq_t errseq_sample(errseq_t *eseq);
> +int errseq_check(errseq_t *eseq, errseq_t since);
> +int errseq_check_and_advance(errseq_t *eseq, errseq_t *since);
> +#endif
> diff --git a/lib/Makefile b/lib/Makefile
> index 320ac46a8725..2423afef40f7 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -41,7 +41,7 @@ obj-y += bcd.o div64.o sort.o parser.o debug_locks.o random32.o \
>  	 gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
>  	 bsearch.o find_bit.o llist.o memweight.o kfifo.o \
>  	 percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
> -	 once.o refcount.o
> +	 once.o refcount.o errseq.o
>  obj-y += string_helpers.o
>  obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
>  obj-y += hexdump.o
> diff --git a/lib/errseq.c b/lib/errseq.c
> new file mode 100644
> index 000000000000..0f8b4ed460f0
> --- /dev/null
> +++ b/lib/errseq.c
> @@ -0,0 +1,199 @@
> +#include <linux/err.h>
> +#include <linux/bug.h>
> +#include <linux/atomic.h>
> +#include <linux/errseq.h>
> +
> +/*
> + * An errseq_t is a way of recording errors in one place, and allowing any
> + * number of "subscribers" to tell whether it has changed since an arbitrary
> + * time of their choosing.
> + *
> + * It's implemented as an unsigned 32-bit value. The low order bits are
> + * designated to hold an error code (between 0 and -MAX_ERRNO). The upper bits
> + * are used as a counter. This is done with atomics instead of locking so that
> + * these functions can be called from any context.
> + *
> + * The general idea is for consumers to sample an errseq_t value at a
> + * particular point in time. Later, that value can be used to tell whether any
> + * new errors have occurred since that time.
> + *
> + * Note that there is a risk of collisions, if new errors are being recorded
> + * frequently, since we have so few bits to use as a counter.
> + *
> + * To mitigate this, one bit is used as a flag to tell whether the value has
> + * been sampled since a new value was recorded. That allows us to avoid bumping
> + * the counter if no one has sampled it since the last time an error was
> + * recorded.
> + *
> + * A new errseq_t should always be zeroed out.  A errseq_t value of all zeroes
> + * is the special (but common) case where there has never been an error. An all
> + * zero value thus serves as the "epoch" if one wishes to know whether there
> + * has ever been an error set since it was first initialized.
> + */
> +
> +/* The low bits are designated for error code (max of MAX_ERRNO) */
> +#define ERRSEQ_SHIFT		ilog2(MAX_ERRNO + 1)
> +
> +/* This bit is used as a flag to indicate whether the value has been seen */
> +#define ERRSEQ_SEEN		(1 << ERRSEQ_SHIFT)
> +
> +/* The "ones" bit for the counter */
> +#define ERRSEQ_CTR_INC		(1 << (ERRSEQ_SHIFT + 1))
> +
> +/**
> + * __errseq_set - set a errseq_t for later reporting
> + * @eseq: errseq_t field that should be set
> + * @err: error to set
> + *
> + * This function sets the error in *eseq, and increments the sequence counter
> + * if the last sequence was sampled at some point in the past.
> + *
> + * Any error set will always overwrite an existing error.
> + *
> + * Most callers will want to use the errseq_set inline wrapper to efficiently
> + * handle the common case where err is 0.
> + */
> +void __errseq_set(errseq_t *eseq, int err)
> +{
> +	errseq_t old;
> +
> +	/* MAX_ERRNO must be able to serve as a mask */
> +	BUILD_BUG_ON_NOT_POWER_OF_2(MAX_ERRNO + 1);
> +
> +	/*
> +	 * Ensure the error code actually fits where we want it to go. If it
> +	 * doesn't then just throw a warning and don't record anything. We
> +	 * also don't accept zero here as that would effectively clear a
> +	 * previous error.
> +	 */
> +	if (WARN(unlikely(err == 0 || (unsigned int)-err > MAX_ERRNO),
> +				"err = %d\n", err))
> +		return;
> +
> +	old = READ_ONCE(*eseq);
> +	for (;;) {
> +		errseq_t new, cur;
> +
> +		/* Clear out error bits and set new error */
> +		new = (old & ~(MAX_ERRNO|ERRSEQ_SEEN)) | -err;
> +
> +		/* Only increment if someone has looked at it */
> +		if (old & ERRSEQ_SEEN)
> +			new += ERRSEQ_CTR_INC;
> +
> +		/* If there would be no change, then call it done */
> +		if (new == old)
> +			break;
> +
> +		/* Try to swap the new value into place */
> +		cur = cmpxchg(eseq, old, new);
> +
> +		/*
> +		 * Call it success if we did the swap or someone else beat us
> +		 * to it for the same value.
> +		 */
> +		if (likely(cur == old || cur == new))
> +			break;
> +
> +		/* Raced with an update, try again */
> +		old = cur;
> +	}
> +}
> +EXPORT_SYMBOL(__errseq_set);
> +
> +/**
> + * errseq_sample - grab current errseq_t value
> + * @eseq: pointer to errseq_t to be sampled
> + *
> + * This function allows callers to sample an errseq_t value, marking it as
> + * "seen" if required.
> + */
> +errseq_t errseq_sample(errseq_t *eseq)
> +{
> +	errseq_t old = READ_ONCE(*eseq);
> +	errseq_t new = old;
> +
> +	/*
> +	 * For the common case of no errors ever having been set, we can skip
> +	 * marking the SEEN bit. Once an error has been set, the value will
> +	 * never go back to zero.
> +	 */
> +	if (old != 0) {
> +		new |= ERRSEQ_SEEN;
> +		if (old != new)
> +			cmpxchg(eseq, old, new);
> +	}
> +	return new;
> +}
> +EXPORT_SYMBOL(errseq_sample);
> +
> +/**
> + * errseq_check - has an error occurred since a particular point in time?
> + * @eseq: pointer to errseq_t value to be checked
> + * @since: previously-sampled errseq_t from which to check
> + *
> + * Grab the value that eseq points to, and see if it has changed "since"
> + * the given value was sampled. The "since" value is not advanced, so there
> + * is no need to mark the value as seen.
> + *
> + * Returns the latest error set in the errseq_t or 0 if it hasn't changed.
> + */
> +int errseq_check(errseq_t *eseq, errseq_t since)
> +{
> +	errseq_t cur = READ_ONCE(*eseq);
> +
> +	if (likely(cur == since))
> +		return 0;
> +	return -(cur & MAX_ERRNO);
> +}
> +EXPORT_SYMBOL(errseq_check);
> +
> +/**
> + * errseq_check_and_advance - check an errseq_t and advance it to the current value
> + * @eseq: pointer to value being checked reported
> + * @since: pointer to previously-sampled errseq_t to check against and advance
> + *
> + * Grab the eseq value, and see whether it matches the value that "since"
> + * points to. If it does, then just return 0.
> + *
> + * If it doesn't, then the value has changed. Set the "seen" flag, and try to
> + * swap it into place as the new eseq value. Then, set that value as the new
> + * "since" value, and return whatever the error portion is set to.
> + *
> + * Note that no locking is provided here for concurrent updates to the "since"
> + * value. The caller must provide that if necessary. Because of this, callers
> + * may want to do a lockless errseq_check before taking the lock and calling
> + * this.
> + */
> +int errseq_check_and_advance(errseq_t *eseq, errseq_t *since)
> +{
> +	int err = 0;
> +	errseq_t old, new;
> +
> +	/*
> +	 * Most callers will want to use the inline wrapper to check this,
> +	 * so that the common case of no error is handled without needing
> +	 * to lock.
> +	 */
> +	old = READ_ONCE(*eseq);
> +	if (old != *since) {
> +		/*
> +		 * Set the flag and try to swap it into place if it has
> +		 * changed.
> +		 *
> +		 * We don't care about the outcome of the swap here. If the
> +		 * swap doesn't occur, then it has either been updated by a
> +		 * writer who is bumping the seq count anyway, or another
> +		 * reader who is just setting the "seen" flag. Either outcome
> +		 * is OK, and we can advance "since" and return an error based
> +		 * on what we have.
> +		 */
> +		new = old | ERRSEQ_SEEN;
> +		if (new != old)
> +			cmpxchg(eseq, old, new);
> +		*since = new;
> +		err = -(new & MAX_ERRNO);
> +	}
> +	return err;
> +}
> +EXPORT_SYMBOL(errseq_check_and_advance);
> -- 
> 2.9.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton May 10, 2017, 11:29 a.m. UTC | #2
On Wed, 2017-05-10 at 08:03 +1000, NeilBrown wrote:
> On Tue, May 09 2017, Jeff Layton wrote:
> 
> > An errseq_t is a way of recording errors in one place, and allowing any
> > number of "subscribers" to tell whether an error has been set again
> > since a previous time.
> > 
> > It's implemented as an unsigned 32-bit value that is managed with atomic
> > operations. The low order bits are designated to hold an error code
> > (max size of MAX_ERRNO). The upper bits are used as a counter.
> > 
> > The API works with consumers sampling an errseq_t value at a particular
> > point in time. Later, that value can be used to tell whether new errors
> > have been set since that time.
> > 
> > Note that there is a 1 in 512k risk of collisions here if new errors
> > are being recorded frequently, since we have so few bits to use as a
> > counter. To mitigate this, one bit is used as a flag to tell whether the
> > value has been sampled since a new value was recorded. That allows
> > us to avoid bumping the counter if no one has sampled it since it
> > was last bumped.
> > 
> > Later patches will build on this infrastructure to change how writeback
> > errors are tracked in the kernel.
> > 
> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> 
> I like that this is a separate lib/*.c - nicely structured too.
> 
> Reviewed-by: NeilBrown <neilb@suse.com>
> 
> 

Thanks, yeah...it occurred to me that this scheme is not really specific
to writeback errors. While I can't think of another use-case for
errseq_t's right offhand, I think this makes for cleaner layering and
should make it easy to use in other ways should they arise.
Jan Kara May 10, 2017, 11:34 a.m. UTC | #3
On Tue 09-05-17 11:49:16, Jeff Layton wrote:
> An errseq_t is a way of recording errors in one place, and allowing any
> number of "subscribers" to tell whether an error has been set again
> since a previous time.
> 
> It's implemented as an unsigned 32-bit value that is managed with atomic
> operations. The low order bits are designated to hold an error code
> (max size of MAX_ERRNO). The upper bits are used as a counter.
> 
> The API works with consumers sampling an errseq_t value at a particular
> point in time. Later, that value can be used to tell whether new errors
> have been set since that time.
> 
> Note that there is a 1 in 512k risk of collisions here if new errors
> are being recorded frequently, since we have so few bits to use as a
> counter. To mitigate this, one bit is used as a flag to tell whether the
> value has been sampled since a new value was recorded. That allows
> us to avoid bumping the counter if no one has sampled it since it
> was last bumped.
> 
> Later patches will build on this infrastructure to change how writeback
> errors are tracked in the kernel.
> 
> Signed-off-by: Jeff Layton <jlayton@redhat.com>

The patch looks good to me. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

Just two nits below:
...
> +int errseq_check_and_advance(errseq_t *eseq, errseq_t *since)
> +{
> +	int err = 0;
> +	errseq_t old, new;
> +
> +	/*
> +	 * Most callers will want to use the inline wrapper to check this,
> +	 * so that the common case of no error is handled without needing
> +	 * to lock.
> +	 */

I'm not sure which locking you are speaking about here. Is the comment
stale?

> +	old = READ_ONCE(*eseq);
> +	if (old != *since) {
> +		/*
> +		 * Set the flag and try to swap it into place if it has
> +		 * changed.
> +		 *
> +		 * We don't care about the outcome of the swap here. If the
> +		 * swap doesn't occur, then it has either been updated by a
> +		 * writer who is bumping the seq count anyway, or another

"bumping the seq count anyway" part is not quite true. Writer may see
ERRSEQ_SEEN not set and so just update the error code and leave seq count
as is. But since you compare full errseq_t for equality, this works out as
designed...

> +		 * reader who is just setting the "seen" flag. Either outcome
> +		 * is OK, and we can advance "since" and return an error based
> +		 * on what we have.
> +		 */
> +		new = old | ERRSEQ_SEEN;
> +		if (new != old)
> +			cmpxchg(eseq, old, new);
> +		*since = new;
> +		err = -(new & MAX_ERRNO);
> +	}
> +	return err;
> +}
> +EXPORT_SYMBOL(errseq_check_and_advance);

								Honza
Jeff Layton May 10, 2017, 11:58 a.m. UTC | #4
On Wed, 2017-05-10 at 13:34 +0200, Jan Kara wrote:
> On Tue 09-05-17 11:49:16, Jeff Layton wrote:
> > An errseq_t is a way of recording errors in one place, and allowing any
> > number of "subscribers" to tell whether an error has been set again
> > since a previous time.
> > 
> > It's implemented as an unsigned 32-bit value that is managed with atomic
> > operations. The low order bits are designated to hold an error code
> > (max size of MAX_ERRNO). The upper bits are used as a counter.
> > 
> > The API works with consumers sampling an errseq_t value at a particular
> > point in time. Later, that value can be used to tell whether new errors
> > have been set since that time.
> > 
> > Note that there is a 1 in 512k risk of collisions here if new errors
> > are being recorded frequently, since we have so few bits to use as a
> > counter. To mitigate this, one bit is used as a flag to tell whether the
> > value has been sampled since a new value was recorded. That allows
> > us to avoid bumping the counter if no one has sampled it since it
> > was last bumped.
> > 
> > Later patches will build on this infrastructure to change how writeback
> > errors are tracked in the kernel.
> > 
> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> 
> The patch looks good to me. Feel free to add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> 
> Just two nits below:
> ...
> > +int errseq_check_and_advance(errseq_t *eseq, errseq_t *since)
> > +{
> > +	int err = 0;
> > +	errseq_t old, new;
> > +
> > +	/*
> > +	 * Most callers will want to use the inline wrapper to check this,
> > +	 * so that the common case of no error is handled without needing
> > +	 * to lock.
> > +	 */
> 
> I'm not sure which locking you are speaking about here. Is the comment
> stale?
> 

No it's not stale, just unclear. In the kerneldoc comment, I have this
blurb:

 * Note that no locking is provided here for concurrent updates to the "since"
 * value. The caller must provide that if necessary. Because of this, callers
 * may want to do a lockless errseq_check before taking the lock and calling
 * this.

...so the lock here refers to any locking that the caller needs to do to
serialize changes to "since" in this function. For the usual case of
file->f_wb_err, we use f_lock to protect it so by doing an errseq_check
beforehand we can avoid taking the f_lock and calling this function if
the value hasn't changed.

I'll see if I can rephrase that to be more clear.

> > +	old = READ_ONCE(*eseq);
> > +	if (old != *since) {
> > +		/*
> > +		 * Set the flag and try to swap it into place if it has
> > +		 * changed.
> > +		 *
> > +		 * We don't care about the outcome of the swap here. If the
> > +		 * swap doesn't occur, then it has either been updated by a
> > +		 * writer who is bumping the seq count anyway, or another
> 
> "bumping the seq count anyway" part is not quite true. Writer may see
> ERRSEQ_SEEN not set and so just update the error code and leave seq count
> as is. But since you compare full errseq_t for equality, this works out as
> designed...
> 

Yes, I think the code is correct, but the comment is not clear. Maybe
"bumping the seq count or resetting the error code, or..."

In any case, I'll clean up the comments for the next respin here.

> > +		 * reader who is just setting the "seen" flag. Either outcome
> > +		 * is OK, and we can advance "since" and return an error based
> > +		 * on what we have.
> > +		 */
> > +		new = old | ERRSEQ_SEEN;
> > +		if (new != old)
> > +			cmpxchg(eseq, old, new);
> > +		*since = new;
> > +		err = -(new & MAX_ERRNO);
> > +	}
> > +	return err;
> > +}
> > +EXPORT_SYMBOL(errseq_check_and_advance);
> 
> 								Honza

Thanks!
Matthew Wilcox May 10, 2017, 2:18 p.m. UTC | #5
On Tue, May 09, 2017 at 11:49:16AM -0400, Jeff Layton wrote:
> +++ b/lib/errseq.c
> @@ -0,0 +1,199 @@
> +#include <linux/err.h>
> +#include <linux/bug.h>
> +#include <linux/atomic.h>
> +#include <linux/errseq.h>
> +
> +/*
> + * An errseq_t is a way of recording errors in one place, and allowing any
> + * number of "subscribers" to tell whether it has changed since an arbitrary
> + * time of their choosing.

You use the word "time" in several places in the documentation, but I think
it's clearer to say "sampling point" or "sample", since you're not using jiffies
or nanoseconds.  For example, I'd phrase this paragraph this way:

 * An errseq_t is a way of recording errors in one place, and allowing any
 * number of "subscribers" to tell whether it has changed since they last
 * sampled it.

> + * The general idea is for consumers to sample an errseq_t value at a
> + * particular point in time. Later, that value can be used to tell whether any
> + * new errors have occurred since that time.

 * The general idea is for consumers to sample an errseq_t value.  That
 * value can be used to tell whether any new errors have occurred since
 * the last time it was sampled.

> +/* The "ones" bit for the counter */

Maybe "The lowest bit of the counter"?

> +/**
> + * errseq_check - has an error occurred since a particular point in time?

"has an error occurred since the last time it was sampled"

> +/**
> + * errseq_check_and_advance - check an errseq_t and advance it to the current value
> + * @eseq: pointer to value being checked reported

"value being checked reported"?

> +int errseq_check_and_advance(errseq_t *eseq, errseq_t *since)
> +{
> +	int err = 0;
> +	errseq_t old, new;
> +
> +	/*
> +	 * Most callers will want to use the inline wrapper to check this,
> +	 * so that the common case of no error is handled without needing
> +	 * to lock.
> +	 */
> +	old = READ_ONCE(*eseq);
> +	if (old != *since) {
> +		/*
> +		 * Set the flag and try to swap it into place if it has
> +		 * changed.
> +		 *
> +		 * We don't care about the outcome of the swap here. If the
> +		 * swap doesn't occur, then it has either been updated by a
> +		 * writer who is bumping the seq count anyway, or another
> +		 * reader who is just setting the "seen" flag. Either outcome
> +		 * is OK, and we can advance "since" and return an error based
> +		 * on what we have.
> +		 */
> +		new = old | ERRSEQ_SEEN;
> +		if (new != old)
> +			cmpxchg(eseq, old, new);
> +		*since = new;
> +		err = -(new & MAX_ERRNO);
> +	}

I probably need to read through the patchset some more to understand this.
Naively, surely "since" should be updated to the current value of 'eseq'
if we failed the cmpxchg()?
Jeff Layton May 10, 2017, 2:56 p.m. UTC | #6
On Wed, 2017-05-10 at 07:18 -0700, Matthew Wilcox wrote:
> On Tue, May 09, 2017 at 11:49:16AM -0400, Jeff Layton wrote:
> > +++ b/lib/errseq.c
> > @@ -0,0 +1,199 @@
> > +#include <linux/err.h>
> > +#include <linux/bug.h>
> > +#include <linux/atomic.h>
> > +#include <linux/errseq.h>
> > +
> > +/*
> > + * An errseq_t is a way of recording errors in one place, and allowing any
> > + * number of "subscribers" to tell whether it has changed since an arbitrary
> > + * time of their choosing.
> 
> You use the word "time" in several places in the documentation, but I think
> it's clearer to say "sampling point" or "sample", since you're not using jiffies
> or nanoseconds.  For example, I'd phrase this paragraph this way:
> 
>  * An errseq_t is a way of recording errors in one place, and allowing any
>  * number of "subscribers" to tell whether it has changed since they last
>  * sampled it.
> 
> > + * The general idea is for consumers to sample an errseq_t value at a
> > + * particular point in time. Later, that value can be used to tell whether any
> > + * new errors have occurred since that time.
> 
>  * The general idea is for consumers to sample an errseq_t value.  That
>  * value can be used to tell whether any new errors have occurred since
>  * the last time it was sampled.
> 
> > +/* The "ones" bit for the counter */
> 
> Maybe "The lowest bit of the counter"?
> 
> > +/**
> > + * errseq_check - has an error occurred since a particular point in time?
> 
> "has an error occurred since the last time it was sampled"
> 
> > +/**
> > + * errseq_check_and_advance - check an errseq_t and advance it to the current value
> > + * @eseq: pointer to value being checked reported
> 
> "value being checked reported"?
> 

Thanks. I'm cleaning up the comments like you suggest.

> > +int errseq_check_and_advance(errseq_t *eseq, errseq_t *since)
> > +{
> > +	int err = 0;
> > +	errseq_t old, new;
> > +
> > +	/*
> > +	 * Most callers will want to use the inline wrapper to check this,
> > +	 * so that the common case of no error is handled without needing
> > +	 * to lock.
> > +	 */
> > +	old = READ_ONCE(*eseq);
> > +	if (old != *since) {
> > +		/*
> > +		 * Set the flag and try to swap it into place if it has
> > +		 * changed.
> > +		 *
> > +		 * We don't care about the outcome of the swap here. If the
> > +		 * swap doesn't occur, then it has either been updated by a
> > +		 * writer who is bumping the seq count anyway, or another
> > +		 * reader who is just setting the "seen" flag. Either outcome
> > +		 * is OK, and we can advance "since" and return an error based
> > +		 * on what we have.
> > +		 */
> > +		new = old | ERRSEQ_SEEN;
> > +		if (new != old)
> > +			cmpxchg(eseq, old, new);
> > +		*since = new;
> > +		err = -(new & MAX_ERRNO);
> > +	}
> 
> I probably need to read through the patchset some more to understand this.
> Naively, surely "since" should be updated to the current value of 'eseq'
> if we failed the cmpxchg()?

I don't think so. If we want to do that, then we'll need to redrive the
cmpxchg to set the SEEN flag if it's now clear. Storing the value in
"since" is effectively sampling it, so you do have to mark it seen.

The good news is that I think that "new" is just as valid a value to
store here as *eseq would be. It ends up representing an errseq_t value
that never actually got stored in eseq, but that's OK with the way this
works.

Patch
diff mbox

diff --git a/include/linux/errseq.h b/include/linux/errseq.h
new file mode 100644
index 000000000000..0d2555f310cd
--- /dev/null
+++ b/include/linux/errseq.h
@@ -0,0 +1,19 @@ 
+#ifndef _LINUX_ERRSEQ_H
+#define _LINUX_ERRSEQ_H
+
+/* See lib/errseq.c for more info */
+
+typedef u32	errseq_t;
+
+void __errseq_set(errseq_t *eseq, int err);
+static inline void errseq_set(errseq_t *eseq, int err)
+{
+	/* Optimize for the common case of no error */
+	if (unlikely(err))
+		__errseq_set(eseq, err);
+}
+
+errseq_t errseq_sample(errseq_t *eseq);
+int errseq_check(errseq_t *eseq, errseq_t since);
+int errseq_check_and_advance(errseq_t *eseq, errseq_t *since);
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 320ac46a8725..2423afef40f7 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -41,7 +41,7 @@  obj-y += bcd.o div64.o sort.o parser.o debug_locks.o random32.o \
 	 gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
 	 bsearch.o find_bit.o llist.o memweight.o kfifo.o \
 	 percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
-	 once.o refcount.o
+	 once.o refcount.o errseq.o
 obj-y += string_helpers.o
 obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
 obj-y += hexdump.o
diff --git a/lib/errseq.c b/lib/errseq.c
new file mode 100644
index 000000000000..0f8b4ed460f0
--- /dev/null
+++ b/lib/errseq.c
@@ -0,0 +1,199 @@ 
+#include <linux/err.h>
+#include <linux/bug.h>
+#include <linux/atomic.h>
+#include <linux/errseq.h>
+
+/*
+ * An errseq_t is a way of recording errors in one place, and allowing any
+ * number of "subscribers" to tell whether it has changed since an arbitrary
+ * time of their choosing.
+ *
+ * It's implemented as an unsigned 32-bit value. The low order bits are
+ * designated to hold an error code (between 0 and -MAX_ERRNO). The upper bits
+ * are used as a counter. This is done with atomics instead of locking so that
+ * these functions can be called from any context.
+ *
+ * The general idea is for consumers to sample an errseq_t value at a
+ * particular point in time. Later, that value can be used to tell whether any
+ * new errors have occurred since that time.
+ *
+ * Note that there is a risk of collisions, if new errors are being recorded
+ * frequently, since we have so few bits to use as a counter.
+ *
+ * To mitigate this, one bit is used as a flag to tell whether the value has
+ * been sampled since a new value was recorded. That allows us to avoid bumping
+ * the counter if no one has sampled it since the last time an error was
+ * recorded.
+ *
+ * A new errseq_t should always be zeroed out.  A errseq_t value of all zeroes
+ * is the special (but common) case where there has never been an error. An all
+ * zero value thus serves as the "epoch" if one wishes to know whether there
+ * has ever been an error set since it was first initialized.
+ */
+
+/* The low bits are designated for error code (max of MAX_ERRNO) */
+#define ERRSEQ_SHIFT		ilog2(MAX_ERRNO + 1)
+
+/* This bit is used as a flag to indicate whether the value has been seen */
+#define ERRSEQ_SEEN		(1 << ERRSEQ_SHIFT)
+
+/* The "ones" bit for the counter */
+#define ERRSEQ_CTR_INC		(1 << (ERRSEQ_SHIFT + 1))
+
+/**
+ * __errseq_set - set a errseq_t for later reporting
+ * @eseq: errseq_t field that should be set
+ * @err: error to set
+ *
+ * This function sets the error in *eseq, and increments the sequence counter
+ * if the last sequence was sampled at some point in the past.
+ *
+ * Any error set will always overwrite an existing error.
+ *
+ * Most callers will want to use the errseq_set inline wrapper to efficiently
+ * handle the common case where err is 0.
+ */
+void __errseq_set(errseq_t *eseq, int err)
+{
+	errseq_t old;
+
+	/* MAX_ERRNO must be able to serve as a mask */
+	BUILD_BUG_ON_NOT_POWER_OF_2(MAX_ERRNO + 1);
+
+	/*
+	 * Ensure the error code actually fits where we want it to go. If it
+	 * doesn't then just throw a warning and don't record anything. We
+	 * also don't accept zero here as that would effectively clear a
+	 * previous error.
+	 */
+	if (WARN(unlikely(err == 0 || (unsigned int)-err > MAX_ERRNO),
+				"err = %d\n", err))
+		return;
+
+	old = READ_ONCE(*eseq);
+	for (;;) {
+		errseq_t new, cur;
+
+		/* Clear out error bits and set new error */
+		new = (old & ~(MAX_ERRNO|ERRSEQ_SEEN)) | -err;
+
+		/* Only increment if someone has looked at it */
+		if (old & ERRSEQ_SEEN)
+			new += ERRSEQ_CTR_INC;
+
+		/* If there would be no change, then call it done */
+		if (new == old)
+			break;
+
+		/* Try to swap the new value into place */
+		cur = cmpxchg(eseq, old, new);
+
+		/*
+		 * Call it success if we did the swap or someone else beat us
+		 * to it for the same value.
+		 */
+		if (likely(cur == old || cur == new))
+			break;
+
+		/* Raced with an update, try again */
+		old = cur;
+	}
+}
+EXPORT_SYMBOL(__errseq_set);
+
+/**
+ * errseq_sample - grab current errseq_t value
+ * @eseq: pointer to errseq_t to be sampled
+ *
+ * This function allows callers to sample an errseq_t value, marking it as
+ * "seen" if required.
+ */
+errseq_t errseq_sample(errseq_t *eseq)
+{
+	errseq_t old = READ_ONCE(*eseq);
+	errseq_t new = old;
+
+	/*
+	 * For the common case of no errors ever having been set, we can skip
+	 * marking the SEEN bit. Once an error has been set, the value will
+	 * never go back to zero.
+	 */
+	if (old != 0) {
+		new |= ERRSEQ_SEEN;
+		if (old != new)
+			cmpxchg(eseq, old, new);
+	}
+	return new;
+}
+EXPORT_SYMBOL(errseq_sample);
+
+/**
+ * errseq_check - has an error occurred since a particular point in time?
+ * @eseq: pointer to errseq_t value to be checked
+ * @since: previously-sampled errseq_t from which to check
+ *
+ * Grab the value that eseq points to, and see if it has changed "since"
+ * the given value was sampled. The "since" value is not advanced, so there
+ * is no need to mark the value as seen.
+ *
+ * Returns the latest error set in the errseq_t or 0 if it hasn't changed.
+ */
+int errseq_check(errseq_t *eseq, errseq_t since)
+{
+	errseq_t cur = READ_ONCE(*eseq);
+
+	if (likely(cur == since))
+		return 0;
+	return -(cur & MAX_ERRNO);
+}
+EXPORT_SYMBOL(errseq_check);
+
+/**
+ * errseq_check_and_advance - check an errseq_t and advance it to the current value
+ * @eseq: pointer to value being checked reported
+ * @since: pointer to previously-sampled errseq_t to check against and advance
+ *
+ * Grab the eseq value, and see whether it matches the value that "since"
+ * points to. If it does, then just return 0.
+ *
+ * If it doesn't, then the value has changed. Set the "seen" flag, and try to
+ * swap it into place as the new eseq value. Then, set that value as the new
+ * "since" value, and return whatever the error portion is set to.
+ *
+ * Note that no locking is provided here for concurrent updates to the "since"
+ * value. The caller must provide that if necessary. Because of this, callers
+ * may want to do a lockless errseq_check before taking the lock and calling
+ * this.
+ */
+int errseq_check_and_advance(errseq_t *eseq, errseq_t *since)
+{
+	int err = 0;
+	errseq_t old, new;
+
+	/*
+	 * Most callers will want to use the inline wrapper to check this,
+	 * so that the common case of no error is handled without needing
+	 * to lock.
+	 */
+	old = READ_ONCE(*eseq);
+	if (old != *since) {
+		/*
+		 * Set the flag and try to swap it into place if it has
+		 * changed.
+		 *
+		 * We don't care about the outcome of the swap here. If the
+		 * swap doesn't occur, then it has either been updated by a
+		 * writer who is bumping the seq count anyway, or another
+		 * reader who is just setting the "seen" flag. Either outcome
+		 * is OK, and we can advance "since" and return an error based
+		 * on what we have.
+		 */
+		new = old | ERRSEQ_SEEN;
+		if (new != old)
+			cmpxchg(eseq, old, new);
+		*since = new;
+		err = -(new & MAX_ERRNO);
+	}
+	return err;
+}
+EXPORT_SYMBOL(errseq_check_and_advance);