mbox series

[v3,0/3] vfs: have syncfs() return error when there are writeback errors

Message ID 20200207170423.377931-1-jlayton@kernel.org (mailing list archive)
Headers show
Series vfs: have syncfs() return error when there are writeback errors | expand

Message

Jeffrey Layton Feb. 7, 2020, 5:04 p.m. UTC
You're probably wondering -- Where are v1 and v2 sets?

I did the first couple of versions of this set back in 2018, and then
got dragged off to work on other things. I'd like to resurrect this set
though, as I think it's valuable overall, and I have need of it for some
other work I'm doing.

Currently, syncfs does not return errors when one of the inodes fails to
be written back. It will return errors based on the legacy AS_EIO and
AS_ENOSPC flags when syncing out the block device fails, but that's not
particularly helpful for filesystems that aren't backed by a blockdev.
It's also possible for a stray sync to lose those errors.

The basic idea is to track writeback errors at the superblock level,
so that we can quickly and easily check whether something bad happened
without having to fsync each file individually. syncfs is then changed
to reliably report writeback errors, and a new ioctl is added to allow
userland to get at the current errseq_t value w/o having to sync out
anything.

I do have a xfstest for this. I do not yet have manpage patches, but
I'm happy to roll some once there is consensus on the interface.

Caveats:

- Having different behavior for an O_PATH descriptor in syncfs is
  a bit odd, but it means that we don't have to grow struct file. Is
  that acceptable from an API standpoint?

- This adds a new generic fs ioctl to allow userland to scrape the
  current superblock's errseq_t value. It may be best to present this
  to userland via fsinfo() instead (once that's merged). I'm fine with
  dropping the last patch for now and reworking it for fsinfo if so.

Jeff Layton (3):
  vfs: track per-sb writeback errors and report them to syncfs
  buffer: record blockdev write errors in super_block that it backs
  vfs: add a new ioctl for fetching the superblock's errseq_t

 fs/buffer.c             |  2 ++
 fs/ioctl.c              |  4 ++++
 fs/open.c               |  6 +++---
 fs/sync.c               |  9 ++++++++-
 include/linux/errseq.h  |  1 +
 include/linux/fs.h      |  3 +++
 include/linux/pagemap.h |  5 ++++-
 include/uapi/linux/fs.h |  1 +
 lib/errseq.c            | 33 +++++++++++++++++++++++++++++++--
 9 files changed, 57 insertions(+), 7 deletions(-)

Comments

Dave Chinner Feb. 7, 2020, 8:52 p.m. UTC | #1
On Fri, Feb 07, 2020 at 12:04:20PM -0500, Jeff Layton wrote:
> You're probably wondering -- Where are v1 and v2 sets?
> 
> I did the first couple of versions of this set back in 2018, and then
> got dragged off to work on other things. I'd like to resurrect this set
> though, as I think it's valuable overall, and I have need of it for some
> other work I'm doing.
> 
> Currently, syncfs does not return errors when one of the inodes fails to
> be written back. It will return errors based on the legacy AS_EIO and
> AS_ENOSPC flags when syncing out the block device fails, but that's not
> particularly helpful for filesystems that aren't backed by a blockdev.
> It's also possible for a stray sync to lose those errors.
> 
> The basic idea is to track writeback errors at the superblock level,
> so that we can quickly and easily check whether something bad happened
> without having to fsync each file individually. syncfs is then changed
> to reliably report writeback errors, and a new ioctl is added to allow
> userland to get at the current errseq_t value w/o having to sync out
> anything.

So what, exactly, can userspace do with this error? It has no idea
at all what file the writeback failure occurred on or even
what files syncfs() even acted on so there's no obvious error
recovery that it could perform on reception of such an error.

> I do have a xfstest for this. I do not yet have manpage patches, but
> I'm happy to roll some once there is consensus on the interface.
> 
> Caveats:
> 
> - Having different behavior for an O_PATH descriptor in syncfs is
>   a bit odd, but it means that we don't have to grow struct file. Is
>   that acceptable from an API standpoint?

It's an ugly wart, IMO. But because we suck at APIs, I'm betting
that we'll decide this is OK or do something even worse...

> - This adds a new generic fs ioctl to allow userland to scrape the
>   current superblock's errseq_t value. It may be best to present this
>   to userland via fsinfo() instead (once that's merged). I'm fine with
>   dropping the last patch for now and reworking it for fsinfo if so.

What, exactly, is this useful for? Why would we consider exposing
an internal implementation detail to userspace like this?

Cheers,

Dave.
Andres Freund Feb. 7, 2020, 9:20 p.m. UTC | #2
Hi,

On 2020-02-08 07:52:43 +1100, Dave Chinner wrote:
> On Fri, Feb 07, 2020 at 12:04:20PM -0500, Jeff Layton wrote:
> > You're probably wondering -- Where are v1 and v2 sets?

> > The basic idea is to track writeback errors at the superblock level,
> > so that we can quickly and easily check whether something bad happened
> > without having to fsync each file individually. syncfs is then changed
> > to reliably report writeback errors, and a new ioctl is added to allow
> > userland to get at the current errseq_t value w/o having to sync out
> > anything.
> 
> So what, exactly, can userspace do with this error? It has no idea
> at all what file the writeback failure occurred on or even
> what files syncfs() even acted on so there's no obvious error
> recovery that it could perform on reception of such an error.

Depends on the application.  For e.g. postgres it'd to be to reset
in-memory contents and perform WAL replay from the last checkpoint. Due
to various reasons* it's very hard for us (without major performance
and/or reliability impact) to fully guarantee that by the time we fsync
specific files we do so on an old enough fd to guarantee that we'd see
the an error triggered by background writeback.  But keeping track of
all potential filesystems data resides on (with one fd open permanently
for each) and then syncfs()ing them at checkpoint time is quite doable.

*I can go into details, but it's probably not interesting enough


> > - This adds a new generic fs ioctl to allow userland to scrape the
> >   current superblock's errseq_t value. It may be best to present this
> >   to userland via fsinfo() instead (once that's merged). I'm fine with
> >   dropping the last patch for now and reworking it for fsinfo if so.
> 
> What, exactly, is this useful for? Why would we consider exposing
> an internal implementation detail to userspace like this?

There is, as far as I can tell, so far no way but scraping the kernel
log to figure out if there have been data loss errors on a
machine/fs. Even besides app specific reactions like outlined above,
just generally being able to alert whenever there error count increases
seems extremely useful.  I'm not sure it makes sense to expose the
errseq_t bits straight though - seems like it'd enshrine them in
userspace ABI too much?

Greetings,

Andres Freund
Jeffrey Layton Feb. 7, 2020, 10:05 p.m. UTC | #3
On Fri, 2020-02-07 at 13:20 -0800, Andres Freund wrote:
> Hi,
> 
> On 2020-02-08 07:52:43 +1100, Dave Chinner wrote:
> > On Fri, Feb 07, 2020 at 12:04:20PM -0500, Jeff Layton wrote:
> > > You're probably wondering -- Where are v1 and v2 sets?
> > > The basic idea is to track writeback errors at the superblock level,
> > > so that we can quickly and easily check whether something bad happened
> > > without having to fsync each file individually. syncfs is then changed
> > > to reliably report writeback errors, and a new ioctl is added to allow
> > > userland to get at the current errseq_t value w/o having to sync out
> > > anything.
> > 
> > So what, exactly, can userspace do with this error? It has no idea
> > at all what file the writeback failure occurred on or even
> > what files syncfs() even acted on so there's no obvious error
> > recovery that it could perform on reception of such an error.
> 
> Depends on the application.  For e.g. postgres it'd to be to reset
> in-memory contents and perform WAL replay from the last checkpoint. Due
> to various reasons* it's very hard for us (without major performance
> and/or reliability impact) to fully guarantee that by the time we fsync
> specific files we do so on an old enough fd to guarantee that we'd see
> the an error triggered by background writeback.  But keeping track of
> all potential filesystems data resides on (with one fd open permanently
> for each) and then syncfs()ing them at checkpoint time is quite doable.
> 
> *I can go into details, but it's probably not interesting enough
> 

Do applications (specifically postgresql) need the ability to check
whether there have been writeback errors on a filesystem w/o blocking on
a syncfs() call?  I thought that you had mentioned a specific usecase
for that, but if you're actually ok with syncfs() then we can drop that
part altogether.

> 
> > > - This adds a new generic fs ioctl to allow userland to scrape the
> > >   current superblock's errseq_t value. It may be best to present this
> > >   to userland via fsinfo() instead (once that's merged). I'm fine with
> > >   dropping the last patch for now and reworking it for fsinfo if so.
> > 
> > What, exactly, is this useful for? Why would we consider exposing
> > an internal implementation detail to userspace like this?
> 
> There is, as far as I can tell, so far no way but scraping the kernel
> log to figure out if there have been data loss errors on a
> machine/fs. Even besides app specific reactions like outlined above,
> just generally being able to alert whenever there error count increases
> seems extremely useful.  I'm not sure it makes sense to expose the
> errseq_t bits straight though - seems like it'd enshrine them in
> userspace ABI too much?
> 

Yeah, if we do end up keeping it, I'm leaning toward making this
fetchable via fsinfo() (once that's merged). If we do that, then we'll
split this into a struct with two fields -- the most recent errno and an
opaque token that you can keep to tell whether new errors have been
recorded since.

I think that should be a little cleaner from an API standpoint. Probably
we can just drop the ioctl, under the assumption that fsinfo() will be
available in 5.7.

Cheers,
Andres Freund Feb. 7, 2020, 10:21 p.m. UTC | #4
Hi,

On 2020-02-07 17:05:28 -0500, Jeff Layton wrote:
> On Fri, 2020-02-07 at 13:20 -0800, Andres Freund wrote:
> > On 2020-02-08 07:52:43 +1100, Dave Chinner wrote:
> > > On Fri, Feb 07, 2020 at 12:04:20PM -0500, Jeff Layton wrote:
> > > > You're probably wondering -- Where are v1 and v2 sets?
> > > > The basic idea is to track writeback errors at the superblock level,
> > > > so that we can quickly and easily check whether something bad happened
> > > > without having to fsync each file individually. syncfs is then changed
> > > > to reliably report writeback errors, and a new ioctl is added to allow
> > > > userland to get at the current errseq_t value w/o having to sync out
> > > > anything.
> > > 
> > > So what, exactly, can userspace do with this error? It has no idea
> > > at all what file the writeback failure occurred on or even
> > > what files syncfs() even acted on so there's no obvious error
> > > recovery that it could perform on reception of such an error.
> > 
> > Depends on the application.  For e.g. postgres it'd to be to reset
> > in-memory contents and perform WAL replay from the last checkpoint. Due
> > to various reasons* it's very hard for us (without major performance
> > and/or reliability impact) to fully guarantee that by the time we fsync
> > specific files we do so on an old enough fd to guarantee that we'd see
> > the an error triggered by background writeback.  But keeping track of
> > all potential filesystems data resides on (with one fd open permanently
> > for each) and then syncfs()ing them at checkpoint time is quite doable.
> > 
> > *I can go into details, but it's probably not interesting enough
> > 
> 
> Do applications (specifically postgresql) need the ability to check
> whether there have been writeback errors on a filesystem w/o blocking on
> a syncfs() call?  I thought that you had mentioned a specific usecase
> for that, but if you're actually ok with syncfs() then we can drop that
> part altogether.

It'd be considerably better if we could check for errors without a
blocking syncfs(). A syncfs will trigger much more dirty pages to be
written back than what we need for durability. Our checkpoint writes are
throttled to reduce the impact on current IO, we try to ensure there's
not much outstanding IO before calling fsync() on FDs, etc - all to
avoid stalls. Especially as on plenty installations there's also
temporary files, e.g. for bigger-than-memory sorts, on the same FS.  So
if we had to syncfs() to reliability detect errros it'd cause some pain
- but would still be an improvement.

But with a nonblocking check we could compare the error count from the
last checkpoint with the current count before finalizing the checkpoint
- without causing unnecessary writeback.

Currently, if we crash (any unclean shutdown, be it a PG bug, OS dying,
kill -9), we'll iterate over all files afterwards to make sure they're
fsynced, before starting to perform WAL replay. That can take quite a
while on some systems - it'd be much nicer if we could just syncfs() the
involved filesystems (which we can detect more quickly than iterating
over the whole directory tree, there's only a few places where we
support separate mounts), and still get errors.


> Yeah, if we do end up keeping it, I'm leaning toward making this
> fetchable via fsinfo() (once that's merged). If we do that, then we'll
> split this into a struct with two fields -- the most recent errno and an
> opaque token that you can keep to tell whether new errors have been
> recorded since.
> 
> I think that should be a little cleaner from an API standpoint. Probably
> we can just drop the ioctl, under the assumption that fsinfo() will be
> available in 5.7.

Sounds like a plan.

I guess an alternative could be to expose the error count in /sys, but
that looks like it'd be a bigger project, as there doesn't seem to be a
good pre-existing hierarchy to hook into.

Greetings,

Andres Freund
Dave Chinner Feb. 10, 2020, 9:46 p.m. UTC | #5
On Fri, Feb 07, 2020 at 01:20:12PM -0800, Andres Freund wrote:
> Hi,
> 
> On 2020-02-08 07:52:43 +1100, Dave Chinner wrote:
> > On Fri, Feb 07, 2020 at 12:04:20PM -0500, Jeff Layton wrote:
> > > You're probably wondering -- Where are v1 and v2 sets?
> 
> > > The basic idea is to track writeback errors at the superblock level,
> > > so that we can quickly and easily check whether something bad happened
> > > without having to fsync each file individually. syncfs is then changed
> > > to reliably report writeback errors, and a new ioctl is added to allow
> > > userland to get at the current errseq_t value w/o having to sync out
> > > anything.
> > 
> > So what, exactly, can userspace do with this error? It has no idea
> > at all what file the writeback failure occurred on or even
> > what files syncfs() even acted on so there's no obvious error
> > recovery that it could perform on reception of such an error.
> 
> Depends on the application.  For e.g. postgres it'd to be to reset
> in-memory contents and perform WAL replay from the last checkpoint.

What happens if a user runs 'sync -f /path/to/postgres/data' instead
of postgres? All the writeback errors are consumed at that point by
reporting them to the process that ran syncfs()...

> Due
> to various reasons* it's very hard for us (without major performance
> and/or reliability impact) to fully guarantee that by the time we fsync
> specific files we do so on an old enough fd to guarantee that we'd see
> the an error triggered by background writeback.  But keeping track of
> all potential filesystems data resides on (with one fd open permanently
> for each) and then syncfs()ing them at checkpoint time is quite doable.

Oh, you have to keep an fd permanently open to every superblock that
application holds data on so that errors detected by other users of
that filesystem are also reported to the application?

This seems like a fairly important requirement for applications to
ensure this error reporting is "reliable" and that certainly wasn't
apparent from the patches or their description.  i.e. the API has an
explicit userspace application behaviour requirement for reliable
functioning, and that was not documented.  "we suck at APIs" and all
that..

It also seems to me as useful only to applications that have a
"rollback and replay" error recovery mechanism. If the application
doesn't have the ability to go back in time to before the
"unfindable" writeback error occurred, then this error is largely
useless to those applications because they can't do anything with
it, and so....

> > > - This adds a new generic fs ioctl to allow userland to scrape
> > > the current superblock's errseq_t value. It may be best to
> > > present this to userland via fsinfo() instead (once that's
> > > merged). I'm fine with dropping the last patch for now and
> > > reworking it for fsinfo if so.
> > 
> > What, exactly, is this useful for? Why would we consider
> > exposing an internal implementation detail to userspace like
> > this?
> 
> There is, as far as I can tell, so far no way but scraping the
> kernel log to figure out if there have been data loss errors on a
> machine/fs.

.... most applications will still require users to scrape their
logs to find out what error actually occurred. IOWs, we haven't
really changed the status quo with this new mechanism.

FWIW, explicit userspace error notifications for data loss events is
one of the features that David Howell's generic filesystem
notification mechanism is intended to provide.  Hence I'm not sure
that there's a huge amount of value in providing a partial solution
that only certain applications can use when there's a fully generic
mechanism for error notification just around the corner.

> Even besides app specific reactions like outlined above,
> just generally being able to alert whenever there error count
> increases seems extremely useful.

Yup, a generic filesystem notification mechanism is perfect for
that, plus it can provide more explicit details of where the error
actually occurred rather than jsut a handwavy "some error occurred
some where" report....

> I'm not sure it makes sense to
> expose the errseq_t bits straight though - seems like it'd
> enshrine them in userspace ABI too much?

Even a little is way too much. Userspace ABI needs to be completely
independent of the kernel internal structures and implementation.
This is basic "we suck at APIs 101" stuff...

Cheers,

Dave.
Andres Freund Feb. 10, 2020, 11:59 p.m. UTC | #6
Hi,

David added you, because there's discussion about your notify work
below.

On 2020-02-11 08:46:57 +1100, Dave Chinner wrote:
> On Fri, Feb 07, 2020 at 01:20:12PM -0800, Andres Freund wrote:
> > Hi,
> > 
> > On 2020-02-08 07:52:43 +1100, Dave Chinner wrote:
> > > On Fri, Feb 07, 2020 at 12:04:20PM -0500, Jeff Layton wrote:
> > > > You're probably wondering -- Where are v1 and v2 sets?
> > 
> > > > The basic idea is to track writeback errors at the superblock level,
> > > > so that we can quickly and easily check whether something bad happened
> > > > without having to fsync each file individually. syncfs is then changed
> > > > to reliably report writeback errors, and a new ioctl is added to allow
> > > > userland to get at the current errseq_t value w/o having to sync out
> > > > anything.
> > > 
> > > So what, exactly, can userspace do with this error? It has no idea
> > > at all what file the writeback failure occurred on or even
> > > what files syncfs() even acted on so there's no obvious error
> > > recovery that it could perform on reception of such an error.
> > 
> > Depends on the application.  For e.g. postgres it'd to be to reset
> > in-memory contents and perform WAL replay from the last checkpoint.
> 
> What happens if a user runs 'sync -f /path/to/postgres/data' instead
> of postgres? All the writeback errors are consumed at that point by
> reporting them to the process that ran syncfs()...

We'd have to keep an fd open from *before* we start durable operations,
which has a sampled errseq_t from before we rely on seeing errors.


> > Due to various reasons* it's very hard for us (without major performance
> > and/or reliability impact) to fully guarantee that by the time we fsync
> > specific files we do so on an old enough fd to guarantee that we'd see
> > the an error triggered by background writeback.  But keeping track of
> > all potential filesystems data resides on (with one fd open permanently
> > for each) and then syncfs()ing them at checkpoint time is quite doable.
> 
> Oh, you have to keep an fd permanently open to every superblock that
> application holds data on so that errors detected by other users of
> that filesystem are also reported to the application?

Right

Currently it's much worse (you probably now?):

Without error reporting capabilities in syncfs or such you have to keep
an fd open to *every* single inode you want to reliably get errors
for. Fds have an errseq_t to keep track of which errors have been seen
by that fd, so if you have one open from *before* an error is triggered,
you can be sure to detect that. But if the fd is not guaranteed to be
old enough you can hit two cases:

1) Some other application actually sees an error, address_space->wb_err
   is marked ERRSEQ_SEEN. Any new fd will not see a report the problem
   anymore.
2) The inode with the error gets evicted (memory pressure on a database
   server isn't rare) while there is no fd open. Nobody might see the
   error.

If there were a reliable (i.e. it may not wrap around or such) error
counter available *somewhere*, we could just keep track of that, instead
of actually needing an "old" open fd in the right process.


> This seems like a fairly important requirement for applications to
> ensure this error reporting is "reliable" and that certainly wasn't
> apparent from the patches or their description.  i.e. the API has an
> explicit userspace application behaviour requirement for reliable
> functioning, and that was not documented.  "we suck at APIs" and all
> that..

Yup.


> It also seems to me as useful only to applications that have a
> "rollback and replay" error recovery mechanism. If the application
> doesn't have the ability to go back in time to before the
> "unfindable" writeback error occurred, then this error is largely
> useless to those applications because they can't do anything with
> it, and so....

That's a pretty common thing these days for applications that actually
care about data to some degree. Either they immediately f[data]sync
after writing, or they use some form of storage that has journalling
capabilities. Which e.g. sqlite provides for the myriad of cases that
don't want a separate server.

And even if they can't recover from the error, there's a huge difference
between not noticing that shit has hit the fan and happily continuing to
accept further data , and telling the user that something has gone wrong
with data integrity without details.


> .... most applications will still require users to scrape their
> logs to find out what error actually occurred. IOWs, we haven't
> really changed the status quo with this new mechanism.

I think there's a huge practical difference between having to do so in
case there was an actual error (on the relevant FSs), and having to do
so continually.


> FWIW, explicit userspace error notifications for data loss events is
> one of the features that David Howell's generic filesystem
> notification mechanism is intended to provide.  Hence I'm not sure
> that there's a huge amount of value in providing a partial solution
> that only certain applications can use when there's a fully generic
> mechanism for error notification just around the corner.

Interesting. I largely missed that work, unfortunately. It's hard to
keep up with all kernel things, while also maintaining / developing an
RDBMS :/

I assume the last state that includes the superblock layer stuff is at
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications
whereas there's a newer
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications-core
not including that. There do seem to be some significant changes between
the two.

As far as I can tell the superblock based stuff does *not* actually
report any errors yet (contrast to READONLY, EDQUOT). Is the plan here
to include writeback errors as well? Or just filesystem metadata/journal
IO?

I don't think that block layer notifications would be sufficient for an
individual userspace application's data integrity purposes? For one,
it'd need to map devices to relevant filesystems afaictl. And there's
also errors above the block layer.


Based on skimming the commits in those two trees, I'm not quite sure I
understand what the permission model will be for accessing the
notifications will be? Seems anyone, even within a container or
something, will see blockdev errors from everywhere?  The earlier
superblock support (I'm not sure I like that name btw, hard to
understand for us userspace folks), seems to have required exec
permission, but nothing else.

For it to be usable for integrity purposes delivery has to be reliable
and there needs to be a clear ordering, in the sense that reading
notification needs to return all errors that occured timewise before the
last pending notification has been read. Looks like that's largely the
case, although I'm a bit scared after seeing:

+void __post_watch_notification(struct watch_list *wlist,
+			       struct watch_notification *n,
+			       const struct cred *cred,
+			       u64 id)
+
+		if (security_post_notification(watch->cred, cred, n) < 0)
+			continue;
+

if an LSM module just decides to hide notifications that relied upon for
integrity, we'll be in trouble.


> > I'm not sure it makes sense to
> > expose the errseq_t bits straight though - seems like it'd
> > enshrine them in userspace ABI too much?
> 
> Even a little is way too much. Userspace ABI needs to be completely
> independent of the kernel internal structures and implementation.
> This is basic "we suck at APIs 101" stuff...

Well, if it were just a counter of errors that gets stuck at some well
defined max, it seems like it'd be ok. But I agree that it'd not be the
best possible interface, and that the notify API seems like it could
turn into that.

Greetings,

Andres Freund
Andres Freund Feb. 11, 2020, 12:04 a.m. UTC | #7
Hi,

(sorry if somebody got this twice)

David added you, because there's discussion about your notify work
below.

On 2020-02-11 08:46:57 +1100, Dave Chinner wrote:
> On Fri, Feb 07, 2020 at 01:20:12PM -0800, Andres Freund wrote:
> > Hi,
> > 
> > On 2020-02-08 07:52:43 +1100, Dave Chinner wrote:
> > > On Fri, Feb 07, 2020 at 12:04:20PM -0500, Jeff Layton wrote:
> > > > You're probably wondering -- Where are v1 and v2 sets?
> > 
> > > > The basic idea is to track writeback errors at the superblock level,
> > > > so that we can quickly and easily check whether something bad happened
> > > > without having to fsync each file individually. syncfs is then changed
> > > > to reliably report writeback errors, and a new ioctl is added to allow
> > > > userland to get at the current errseq_t value w/o having to sync out
> > > > anything.
> > > 
> > > So what, exactly, can userspace do with this error? It has no idea
> > > at all what file the writeback failure occurred on or even
> > > what files syncfs() even acted on so there's no obvious error
> > > recovery that it could perform on reception of such an error.
> > 
> > Depends on the application.  For e.g. postgres it'd to be to reset
> > in-memory contents and perform WAL replay from the last checkpoint.
> 
> What happens if a user runs 'sync -f /path/to/postgres/data' instead
> of postgres? All the writeback errors are consumed at that point by
> reporting them to the process that ran syncfs()...

We'd have to keep an fd open from *before* we start durable operations,
which has a sampled errseq_t from before we rely on seeing errors.


> > Due to various reasons* it's very hard for us (without major performance
> > and/or reliability impact) to fully guarantee that by the time we fsync
> > specific files we do so on an old enough fd to guarantee that we'd see
> > the an error triggered by background writeback.  But keeping track of
> > all potential filesystems data resides on (with one fd open permanently
> > for each) and then syncfs()ing them at checkpoint time is quite doable.
> 
> Oh, you have to keep an fd permanently open to every superblock that
> application holds data on so that errors detected by other users of
> that filesystem are also reported to the application?

Right

Currently it's much worse (you probably now?):

Without error reporting capabilities in syncfs or such you have to keep
an fd open to *every* single inode you want to reliably get errors
for. Fds have an errseq_t to keep track of which errors have been seen
by that fd, so if you have one open from *before* an error is triggered,
you can be sure to detect that. But if the fd is not guaranteed to be
old enough you can hit two cases:

1) Some other application actually sees an error, address_space->wb_err
   is marked ERRSEQ_SEEN. Any new fd will not see a report the problem
   anymore.
2) The inode with the error gets evicted (memory pressure on a database
   server isn't rare) while there is no fd open. Nobody might see the
   error.

If there were a reliable (i.e. it may not wrap around or such) error
counter available *somewhere*, we could just keep track of that, instead
of actually needing an "old" open fd in the right process.


> This seems like a fairly important requirement for applications to
> ensure this error reporting is "reliable" and that certainly wasn't
> apparent from the patches or their description.  i.e. the API has an
> explicit userspace application behaviour requirement for reliable
> functioning, and that was not documented.  "we suck at APIs" and all
> that..

Yup.


> It also seems to me as useful only to applications that have a
> "rollback and replay" error recovery mechanism. If the application
> doesn't have the ability to go back in time to before the
> "unfindable" writeback error occurred, then this error is largely
> useless to those applications because they can't do anything with
> it, and so....

That's a pretty common thing these days for applications that actually
care about data to some degree. Either they immediately f[data]sync
after writing, or they use some form of storage that has journalling
capabilities. Which e.g. sqlite provides for the myriad of cases that
don't want a separate server.

And even if they can't recover from the error, there's a huge difference
between not noticing that shit has hit the fan and happily continuing to
accept further data , and telling the user that something has gone wrong
with data integrity without details.


> .... most applications will still require users to scrape their
> logs to find out what error actually occurred. IOWs, we haven't
> really changed the status quo with this new mechanism.

I think there's a huge practical difference between having to do so in
case there was an actual error (on the relevant FSs), and having to do
so continually.


> FWIW, explicit userspace error notifications for data loss events is
> one of the features that David Howell's generic filesystem
> notification mechanism is intended to provide.  Hence I'm not sure
> that there's a huge amount of value in providing a partial solution
> that only certain applications can use when there's a fully generic
> mechanism for error notification just around the corner.

Interesting. I largely missed that work, unfortunately. It's hard to
keep up with all kernel things, while also maintaining / developing an
RDBMS :/

I assume the last state that includes the superblock layer stuff is at
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications
whereas there's a newer
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications-core
not including that. There do seem to be some significant changes between
the two.

As far as I can tell the superblock based stuff does *not* actually
report any errors yet (contrast to READONLY, EDQUOT). Is the plan here
to include writeback errors as well? Or just filesystem metadata/journal
IO?

I don't think that block layer notifications would be sufficient for an
individual userspace application's data integrity purposes? For one,
it'd need to map devices to relevant filesystems afaictl. And there's
also errors above the block layer.


Based on skimming the commits in those two trees, I'm not quite sure I
understand what the permission model will be for accessing the
notifications will be? Seems anyone, even within a container or
something, will see blockdev errors from everywhere?  The earlier
superblock support (I'm not sure I like that name btw, hard to
understand for us userspace folks), seems to have required exec
permission, but nothing else.

For it to be usable for integrity purposes delivery has to be reliable
and there needs to be a clear ordering, in the sense that reading
notification needs to return all errors that occured timewise before the
last pending notification has been read. Looks like that's largely the
case, although I'm a bit scared after seeing:

+void __post_watch_notification(struct watch_list *wlist,
+			       struct watch_notification *n,
+			       const struct cred *cred,
+			       u64 id)
+
+		if (security_post_notification(watch->cred, cred, n) < 0)
+			continue;
+

if an LSM module just decides to hide notifications that relied upon for
integrity, we'll be in trouble.


> > I'm not sure it makes sense to
> > expose the errseq_t bits straight though - seems like it'd
> > enshrine them in userspace ABI too much?
> 
> Even a little is way too much. Userspace ABI needs to be completely
> independent of the kernel internal structures and implementation.
> This is basic "we suck at APIs 101" stuff...

Well, if it were just a counter of errors that gets stuck at some well
defined max, it seems like it'd be ok. But I agree that it'd not be the
best possible interface, and that the notify API seems like it could
turn into that.

Greetings,

Andres Freund
Dave Chinner Feb. 11, 2020, 12:48 a.m. UTC | #8
On Mon, Feb 10, 2020 at 04:04:05PM -0800, Andres Freund wrote:
> Hi,
> 
> (sorry if somebody got this twice)
> 
> David added you, because there's discussion about your notify work
> below.
> 
> On 2020-02-11 08:46:57 +1100, Dave Chinner wrote:
> > On Fri, Feb 07, 2020 at 01:20:12PM -0800, Andres Freund wrote:
> > > Hi,
> > > 
> > > On 2020-02-08 07:52:43 +1100, Dave Chinner wrote:
> > > > On Fri, Feb 07, 2020 at 12:04:20PM -0500, Jeff Layton wrote:
> > > > > You're probably wondering -- Where are v1 and v2 sets?
> > > 
> > > > > The basic idea is to track writeback errors at the superblock level,
> > > > > so that we can quickly and easily check whether something bad happened
> > > > > without having to fsync each file individually. syncfs is then changed
> > > > > to reliably report writeback errors, and a new ioctl is added to allow
> > > > > userland to get at the current errseq_t value w/o having to sync out
> > > > > anything.
> > > > 
> > > > So what, exactly, can userspace do with this error? It has no idea
> > > > at all what file the writeback failure occurred on or even
> > > > what files syncfs() even acted on so there's no obvious error
> > > > recovery that it could perform on reception of such an error.
> > > 
> > > Depends on the application.  For e.g. postgres it'd to be to reset
> > > in-memory contents and perform WAL replay from the last checkpoint.
> > 
> > What happens if a user runs 'sync -f /path/to/postgres/data' instead
> > of postgres? All the writeback errors are consumed at that point by
> > reporting them to the process that ran syncfs()...
> 
> We'd have to keep an fd open from *before* we start durable operations,
> which has a sampled errseq_t from before we rely on seeing errors.
> 
> 
> > > Due to various reasons* it's very hard for us (without major performance
> > > and/or reliability impact) to fully guarantee that by the time we fsync
> > > specific files we do so on an old enough fd to guarantee that we'd see
> > > the an error triggered by background writeback.  But keeping track of
> > > all potential filesystems data resides on (with one fd open permanently
> > > for each) and then syncfs()ing them at checkpoint time is quite doable.
> > 
> > Oh, you have to keep an fd permanently open to every superblock that
> > application holds data on so that errors detected by other users of
> > that filesystem are also reported to the application?
> 
> Right
> 
> Currently it's much worse (you probably now?):

*nod*

> > FWIW, explicit userspace error notifications for data loss events is
> > one of the features that David Howell's generic filesystem
> > notification mechanism is intended to provide.  Hence I'm not sure
> > that there's a huge amount of value in providing a partial solution
> > that only certain applications can use when there's a fully generic
> > mechanism for error notification just around the corner.
> 
> Interesting. I largely missed that work, unfortunately. It's hard to
> keep up with all kernel things, while also maintaining / developing an
> RDBMS :/

It's hard enough trying to keep up with everything as a filesystem
developer... :/

> I assume the last state that includes the superblock layer stuff is at
> https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications
> whereas there's a newer
> https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications-core
> not including that. There do seem to be some significant changes between
> the two.

ANd they are out of date, anyway, because they are still based on
a mmap'd ring buffer rather than the pipe infrastructure. See this
branch for the latest:

https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications-pipe-core

It doesn't have all the block/superblock/usb notification
infrastructure in it, just the keyring implementation. No point
reimplementing everything while the core notification mechanism is
still changing...

> 
> As far as I can tell the superblock based stuff does *not* actually
> report any errors yet (contrast to READONLY, EDQUOT). Is the plan here
> to include writeback errors as well? Or just filesystem metadata/journal
> IO?

Right, that part hasn't been implemented yet, though it's repeatedly
mentioned as intended to be supported functionality. It will depend
on the filesystem to what it is going to report, but I would expect
that it will initially be focussed on reporting user data errors
(e.g. writeback errors, block device gone bad data loss reports,
etc). It may not be possible to do anything sane with
metadata/journal IO errors as they typically cause the filesystem to
shutdown.

Of course, a filesystem shutdown is likely to result in a thundering
herd of userspace IO error notifications (think hundreds of GB of
dirty page cache getting EIO errors). Hence individual filesystems
will have to put some thought into how critical filesystem error
notifications are handled.

That said, we likely want userspace notification of metadata IO
errors for our own purposes. e.g. so we can trigger the online
filesystem repair code to start trying to fix whatever went wrong. I
doubt there's much userspace can do with things like "bad freespace
btree block" notifications, whilst the filesystem's online repair
tool can trigger a free space scan and rebuild/repair it without
userspace applications even being aware that we just detected and
corrected a critical metadata corruption....

> I don't think that block layer notifications would be sufficient for an
> individual userspace application's data integrity purposes? For one,
> it'd need to map devices to relevant filesystems afaictl. And there's
> also errors above the block layer.

Block device errors separate notifications to the superblock
notifications. If you want the notification of raw block device
errors, then that's what you listen for. If you want the filesystem
to actually tell you what file and offset that EIO was generated
for, then you'd get that through the superblock notifier, not the
block device notifier...

> Based on skimming the commits in those two trees, I'm not quite sure I
> understand what the permission model will be for accessing the
> notifications will be? Seems anyone, even within a container or
> something, will see blockdev errors from everywhere?  The earlier
> superblock support (I'm not sure I like that name btw, hard to
> understand for us userspace folks), seems to have required exec
> permission, but nothing else.

I'm not really familiar with those details - I've only got all the
"how it fits into the big picture" stuff in my head. Little
implementation details like that :) aren't that important to me -
all I need to know is how the infrastructure interacts with the
kernel filesystem code and whether it provides the functionality we
need to report filesystem errors directly to userspace...

Cheers,

Dave.
Andres Freund Feb. 11, 2020, 1:31 a.m. UTC | #9
Hi,

I shortly after this found a thread where Linus was explicitly asking
for potential userspace users of the feature, so I also responded there:
https://lore.kernel.org/linux-fsdevel/20200211005626.7yqjf5rbs3vbwagd@alap3.anarazel.de/

On 2020-02-11 11:48:30 +1100, Dave Chinner wrote:
> On Mon, Feb 10, 2020 at 04:04:05PM -0800, Andres Freund wrote:
> > On 2020-02-11 08:46:57 +1100, Dave Chinner wrote:
> > As far as I can tell the superblock based stuff does *not* actually
> > report any errors yet (contrast to READONLY, EDQUOT). Is the plan here
> > to include writeback errors as well? Or just filesystem metadata/journal
> > IO?
> 
> Right, that part hasn't been implemented yet, though it's repeatedly
> mentioned as intended to be supported functionality. It will depend
> on the filesystem to what it is going to report

There really ought to be some clear guidelines what is expected to be
reported though. Otherwise we'll just end up with a hodgepodge of
different semantics, which'd be, ummm, not good.


> but I would expect that it will initially be focussed on reporting
> user data errors (e.g. writeback errors, block device gone bad data
> loss reports, etc). It may not be possible to do anything sane with
> metadata/journal IO errors as they typically cause the filesystem to
> shutdown.

I was mostly referencing the metadata/journal errors because it's what a
number of filesystems seem to treat as errors (cf errors=remount-ro
etc), and I just wanted to be sure that more than just those get
reported up...

I think the patch already had support for getting a separate type of
notification for SBs remounted ro, shouldn't be too hard to change that
so it'd report error shutdowns / remount-ro as a different
category. Without


> Of course, a filesystem shutdown is likely to result in a thundering
> herd of userspace IO error notifications (think hundreds of GB of
> dirty page cache getting EIO errors). Hence individual filesystems
> will have to put some thought into how critical filesystem error
> notifications are handled.

Probably would make sense to stop reporting them individually once the
whole FS is shutdown/remounted due to errors, and a notification about
that fact has been sent.


> That said, we likely want userspace notification of metadata IO
> errors for our own purposes. e.g. so we can trigger the online
> filesystem repair code to start trying to fix whatever went wrong. I
> doubt there's much userspace can do with things like "bad freespace
> btree block" notifications, whilst the filesystem's online repair
> tool can trigger a free space scan and rebuild/repair it without
> userspace applications even being aware that we just detected and
> corrected a critical metadata corruption....

Neat.


> > I don't think that block layer notifications would be sufficient for an
> > individual userspace application's data integrity purposes? For one,
> > it'd need to map devices to relevant filesystems afaictl. And there's
> > also errors above the block layer.
> 
> Block device errors separate notifications to the superblock
> notifications. If you want the notification of raw block device
> errors, then that's what you listen for. If you want the filesystem
> to actually tell you what file and offset that EIO was generated
> for, then you'd get that through the superblock notifier, not the
> block device notifier...

Not something we urgently need, but it might come in handy at a later
point.

Thanks,

Andres
Jeffrey Layton Feb. 11, 2020, 12:57 p.m. UTC | #10
On Tue, 2020-02-11 at 08:46 +1100, Dave Chinner wrote:
> On Fri, Feb 07, 2020 at 01:20:12PM -0800, Andres Freund wrote:
> > Hi,
> > 
> > On 2020-02-08 07:52:43 +1100, Dave Chinner wrote:
> > > On Fri, Feb 07, 2020 at 12:04:20PM -0500, Jeff Layton wrote:
> > > > You're probably wondering -- Where are v1 and v2 sets?
> > > > The basic idea is to track writeback errors at the superblock level,
> > > > so that we can quickly and easily check whether something bad happened
> > > > without having to fsync each file individually. syncfs is then changed
> > > > to reliably report writeback errors, and a new ioctl is added to allow
> > > > userland to get at the current errseq_t value w/o having to sync out
> > > > anything.
> > > 
> > > So what, exactly, can userspace do with this error? It has no idea
> > > at all what file the writeback failure occurred on or even
> > > what files syncfs() even acted on so there's no obvious error
> > > recovery that it could perform on reception of such an error.
> > 
> > Depends on the application.  For e.g. postgres it'd to be to reset
> > in-memory contents and perform WAL replay from the last checkpoint.
> 
> What happens if a user runs 'sync -f /path/to/postgres/data' instead
> of postgres? All the writeback errors are consumed at that point by
> reporting them to the process that ran syncfs()...
> 

Well, no. If you keep a fd open, then you can be sure that you'll see
any errors that occurred since that open at syncfs time, regardless of
who else is issuing syncfs calls(). That's basically how errseq_t works.

I guess I figured that part was obvious and didn't point it out here --
mea culpa.

> > Due
> > to various reasons* it's very hard for us (without major performance
> > and/or reliability impact) to fully guarantee that by the time we fsync
> > specific files we do so on an old enough fd to guarantee that we'd see
> > the an error triggered by background writeback.  But keeping track of
> > all potential filesystems data resides on (with one fd open permanently
> > for each) and then syncfs()ing them at checkpoint time is quite doable.
> 
> Oh, you have to keep an fd permanently open to every superblock that
> application holds data on so that errors detected by other users of
> that filesystem are also reported to the application?
> 
> This seems like a fairly important requirement for applications to
> ensure this error reporting is "reliable" and that certainly wasn't
> apparent from the patches or their description.  i.e. the API has an
> explicit userspace application behaviour requirement for reliable
> functioning, and that was not documented.  "we suck at APIs" and all
> that..
> 
> It also seems to me as useful only to applications that have a
> "rollback and replay" error recovery mechanism. If the application
> doesn't have the ability to go back in time to before the
> "unfindable" writeback error occurred, then this error is largely
> useless to those applications because they can't do anything with
> it, and so....
> 

Just knowing that an error occurred is still a better situation than
letting the application obliviously chug along.

In a sense I see the above argument as circular. Our error reporting
mechanisms have historically sucked, and applications have been written
accordingly. Unless we improve how errors are reported then applications
will never improve.

It is true that we can't reasonably report which inodes failed writeback
with this interface, but syncfs only returns int. That's really the best
we can do with it.

> > > > - This adds a new generic fs ioctl to allow userland to scrape
> > > > the current superblock's errseq_t value. It may be best to
> > > > present this to userland via fsinfo() instead (once that's
> > > > merged). I'm fine with dropping the last patch for now and
> > > > reworking it for fsinfo if so.
> > > 
> > > What, exactly, is this useful for? Why would we consider
> > > exposing an internal implementation detail to userspace like
> > > this?
> > 
> > There is, as far as I can tell, so far no way but scraping the
> > kernel log to figure out if there have been data loss errors on a
> > machine/fs.
> 
> .... most applications will still require users to scrape their
> logs to find out what error actually occurred. IOWs, we haven't
> really changed the status quo with this new mechanism.
> 
> FWIW, explicit userspace error notifications for data loss events is
> one of the features that David Howell's generic filesystem
> notification mechanism is intended to provide.  Hence I'm not sure
> that there's a huge amount of value in providing a partial solution
> that only certain applications can use when there's a fully generic
> mechanism for error notification just around the corner.
> 

David's notification work is great, but it's quite a bit more
heavyweight than just allowing syncfs to return better errors. The
application would need to register to be notified and watch a pipe. Not
all application are going to want or need to do that.

> > Even besides app specific reactions like outlined above,
> > just generally being able to alert whenever there error count
> > increases seems extremely useful.
> 
> Yup, a generic filesystem notification mechanism is perfect for
> that, plus it can provide more explicit details of where the error
> actually occurred rather than jsut a handwavy "some error occurred
> some where" report....
> 
> > I'm not sure it makes sense to
> > expose the errseq_t bits straight though - seems like it'd
> > enshrine them in userspace ABI too much?
> 
> Even a little is way too much. Userspace ABI needs to be completely
> independent of the kernel internal structures and implementation.
> This is basic "we suck at APIs 101" stuff...
> 

Yeah, I've already self-NAK'ed that patch.

If we are going to expose that info, we'll probably do it via fsinfo(),
and put it in a struct with two fields: last reported error code, and an
opaque token that you can use to see whether new errors have been
recorded since you last checked. I think that should be enough to ensure
that we don't tie this too closely to the internal kernel mplementation.
Jeffrey Layton Feb. 12, 2020, 12:21 p.m. UTC | #11
On Fri, 2020-02-07 at 12:04 -0500, Jeff Layton wrote:
> You're probably wondering -- Where are v1 and v2 sets?
> 
> I did the first couple of versions of this set back in 2018, and then
> got dragged off to work on other things. I'd like to resurrect this set
> though, as I think it's valuable overall, and I have need of it for some
> other work I'm doing.
> 
> Currently, syncfs does not return errors when one of the inodes fails to
> be written back. It will return errors based on the legacy AS_EIO and
> AS_ENOSPC flags when syncing out the block device fails, but that's not
> particularly helpful for filesystems that aren't backed by a blockdev.
> It's also possible for a stray sync to lose those errors.
> 
> The basic idea is to track writeback errors at the superblock level,
> so that we can quickly and easily check whether something bad happened
> without having to fsync each file individually. syncfs is then changed
> to reliably report writeback errors, and a new ioctl is added to allow
> userland to get at the current errseq_t value w/o having to sync out
> anything.
> 
> I do have a xfstest for this. I do not yet have manpage patches, but
> I'm happy to roll some once there is consensus on the interface.
> 
> Caveats:
> 
> - Having different behavior for an O_PATH descriptor in syncfs is
>   a bit odd, but it means that we don't have to grow struct file. Is
>   that acceptable from an API standpoint?
> 

There are a couple of other options besides requiring an O_PATH fd here:

1) we could just add a new errseq_t field to struct file for this. On my
machine (x86_64) there is 4 bytes of padding at the end of struct file.
An errseq_t would slot in there without changing the slab object size.
YMMV on other arches of course.

2) we could add a new fcntl command value (F_SYNCFS or something?), that
would flip the fd to being suitable for syncfs. If you tried to use the
fd to do a fsync at that point, we could return an error.

Anyone else have other thoughts on how best to do this?

> - This adds a new generic fs ioctl to allow userland to scrape the
>   current superblock's errseq_t value. It may be best to present this
>   to userland via fsinfo() instead (once that's merged). I'm fine with
>   dropping the last patch for now and reworking it for fsinfo if so.
> 

To be clear, as I stated in earlier replies, I think we can drop the
ioctl. If we did want something like this, I think we'd want to expose
it via fsinfo() instead, and that could be done after the syncfs changes
went in.

> Jeff Layton (3):
>   vfs: track per-sb writeback errors and report them to syncfs
>   buffer: record blockdev write errors in super_block that it backs
>   vfs: add a new ioctl for fetching the superblock's errseq_t
> 
>  fs/buffer.c             |  2 ++
>  fs/ioctl.c              |  4 ++++
>  fs/open.c               |  6 +++---
>  fs/sync.c               |  9 ++++++++-
>  include/linux/errseq.h  |  1 +
>  include/linux/fs.h      |  3 +++
>  include/linux/pagemap.h |  5 ++++-
>  include/uapi/linux/fs.h |  1 +
>  lib/errseq.c            | 33 +++++++++++++++++++++++++++++++--
>  9 files changed, 57 insertions(+), 7 deletions(-)
>