Message ID | cover.1695350405.git.wqu@suse.com (mailing list archive) |
---|---|
Headers | show |
Series | btrfs: introduce "abort=" groups for more strict error handling | expand |
On Fri, Sep 22, 2023 at 12:25:18PM +0930, Qu Wenruo wrote: > During a very interesting (and weird) debugging session, it turns out > that btrfs will ignore a lot of write errors until we hit some critical > location, then btrfs started reacting, normally by aborting the > transaction. > > This can be problematic for two types of people: > > - Developers > Sometimes we want to catch the earlies sign, continuing without any > obvious errors (other than kernel error messages) can make debugging > much harder. > > - Sysadmins who wants to catch problems early > Dmesg is not really something users would check frequently, and even > they check it, it may already be too late. > Meanwhile if the fs flips read-only early it's not only noisy but also > saves the context much better (more relevant dmesgs etc). For sysadmins I think that the preferred way is to get events (like via the uevents interface) that can be monitored and then reacted to by some tools. > On the other hand, I totally understand if just a single sector failed > to be write and we mark the whole fs read-only, it can be super > frustrating for regular end users, thus we can not make it the default > behavior. I can't imagine a realistic scenario where a user would like this behaviour, one EIO takes down whole filesystem could make sense only for some testing environments. > So here we introduce a mount option group "abort=", and make the > following errors more noisy and abort early if specified by the user. Default andswer for a new mount option is 'no', here we also have one that is probably doing the same, 'fatal_errors', so if you really want to do that by a mount option then please use this one. > - Any super block write back failure > Currently we're very loose on the super block writeback failure. > The failure has to meet both conditions below: > * The primary super block writeback failed Doesn't this fail with flip to read-only? > * Total failed devices go beyond tolerance > The tolerance is super high, num_devices - 1. To me this is > too high, but I don't have a better idea yet. Does this depend on the profile constraints? > This new "rescue=super" may be more frequently used considering how > loose our existing tolerance is. > > - Any data writeback failure > This is only for the data writeback at btrfs bio layer. > This means, if a data sector is going to be written to a RAID1 chunk, > and one mirror failed, we still consider the writeback succeeded. > > There would be another one for btrfs bio layer, but I have found > something weird in the code, thus it would only be introduced after I > solved the problem there, meanwhile we can discuss on the usefulness of > this patchset. We can possibly enhance the error checking with additional knobs and checkpoints that will have to survive or detect specific testing, but as mount options it's not very flexible. We can possibly do it via sysfs or BPF but this may not be the proper interface anyway.
On 2023/9/23 00:25, David Sterba wrote: > On Fri, Sep 22, 2023 at 12:25:18PM +0930, Qu Wenruo wrote: >> During a very interesting (and weird) debugging session, it turns out >> that btrfs will ignore a lot of write errors until we hit some critical >> location, then btrfs started reacting, normally by aborting the >> transaction. >> >> This can be problematic for two types of people: >> >> - Developers >> Sometimes we want to catch the earlies sign, continuing without any >> obvious errors (other than kernel error messages) can make debugging >> much harder. >> >> - Sysadmins who wants to catch problems early >> Dmesg is not really something users would check frequently, and even >> they check it, it may already be too late. >> Meanwhile if the fs flips read-only early it's not only noisy but also >> saves the context much better (more relevant dmesgs etc). > > For sysadmins I think that the preferred way is to get events (like via > the uevents interface) that can be monitored and then reacted to by some > tools. I think this is a future development idea, to have a generic way to do error reporting. > >> On the other hand, I totally understand if just a single sector failed >> to be write and we mark the whole fs read-only, it can be super >> frustrating for regular end users, thus we can not make it the default >> behavior. > > I can't imagine a realistic scenario where a user would like this > behaviour, one EIO takes down whole filesystem could make sense only for > some testing environments. I doubt, for some environment with expensive hardware, one may not even expect any -EIO for valid operations. If that happens, it may mean bad firmware or bad hardware, neither is a good thing especially if they paid extra money for the fancy hardware or the support fee. > >> So here we introduce a mount option group "abort=", and make the >> following errors more noisy and abort early if specified by the user. > > Default andswer for a new mount option is 'no', here we also have one > that is probably doing the same, 'fatal_errors', so if you really want > to do that by a mount option then please use this one. Or I can go sysfs interface and put it under some debug directory. > >> - Any super block write back failure >> Currently we're very loose on the super block writeback failure. >> The failure has to meet both conditions below: >> * The primary super block writeback failed > > Doesn't this fail with flip to read-only? Nope, just by itself is not enough to go read-only, as long as we have other devices. If the primary super block writeback failed, it would only make write_dev_supers() to return error. But inside write_all_supers(), error from write_dev_super() would only increase @total_errors, not directly erroring out. > >> * Total failed devices go beyond tolerance >> The tolerance is super high, num_devices - 1. To me this is >> too high, but I don't have a better idea yet. > > Does this depend on the profile constraints? Nope, it's profile independent. The @max_errors inside write_all_supers() is purely determined by btrfs_super_num_devices() - 1. > >> This new "rescue=super" may be more frequently used considering how >> loose our existing tolerance is. >> >> - Any data writeback failure >> This is only for the data writeback at btrfs bio layer. >> This means, if a data sector is going to be written to a RAID1 chunk, >> and one mirror failed, we still consider the writeback succeeded. >> >> There would be another one for btrfs bio layer, but I have found >> something weird in the code, thus it would only be introduced after I >> solved the problem there, meanwhile we can discuss on the usefulness of >> this patchset. > > We can possibly enhance the error checking with additional knobs and > checkpoints that will have to survive or detect specific testing, but as > mount options it's not very flexible. We can possibly do it via sysfs or > BPF but this may not be the proper interface anyway. I think sysfs would be better, but not familiar enough with BPF to determine if it's any better or worse. Thanks, Qu
On Sat, Sep 23, 2023 at 06:46:26AM +0930, Qu Wenruo wrote: > On 2023/9/23 00:25, David Sterba wrote: > > On Fri, Sep 22, 2023 at 12:25:18PM +0930, Qu Wenruo wrote: > >> On the other hand, I totally understand if just a single sector failed > >> to be write and we mark the whole fs read-only, it can be super > >> frustrating for regular end users, thus we can not make it the default > >> behavior. > > > > I can't imagine a realistic scenario where a user would like this > > behaviour, one EIO takes down whole filesystem could make sense only for > > some testing environments. > > I doubt, for some environment with expensive hardware, one may not even > expect any -EIO for valid operations. > If that happens, it may mean bad firmware or bad hardware, neither is a > good thing especially if they paid extra money for the fancy hardware or > the support fee. So the semantics we'd need is like "fail on first error of <type>" where we can define a set of errors, like EIO, superblock write erorr or something related to devices. > >> So here we introduce a mount option group "abort=", and make the > >> following errors more noisy and abort early if specified by the user. > > > > Default andswer for a new mount option is 'no', here we also have one > > that is probably doing the same, 'fatal_errors', so if you really want > > to do that by a mount option then please use this one. > > Or I can go sysfs interface and put it under some debug directory. For a prototype it's much more convenient until we understand what's the actual usecase. > > > >> This new "rescue=super" may be more frequently used considering how > >> loose our existing tolerance is. > >> > >> - Any data writeback failure > >> This is only for the data writeback at btrfs bio layer. > >> This means, if a data sector is going to be written to a RAID1 chunk, > >> and one mirror failed, we still consider the writeback succeeded. > >> > >> There would be another one for btrfs bio layer, but I have found > >> something weird in the code, thus it would only be introduced after I > >> solved the problem there, meanwhile we can discuss on the usefulness of > >> this patchset. > > > > We can possibly enhance the error checking with additional knobs and > > checkpoints that will have to survive or detect specific testing, but as > > mount options it's not very flexible. We can possibly do it via sysfs or > > BPF but this may not be the proper interface anyway. > > I think sysfs would be better, but not familiar enough with BPF to > determine if it's any better or worse. BPF is probably a bad idea, I mentioned only as a potential way, it's another extensible interface. What you suggest looks like the forced shutdown of filesystem. This can be done internally or externally (ioctl). I haven't looked at the interface to what extent it's configurable, but let's say there's a bitmask set by admin and the filesystem checks that in case a given type of error happens. Then forced shutown would be like a transaction abort, followed by read-only. We have decent support for that but with the shutdown some kind of audit would have to happen anyway, namely for the EIO type of errors. Specific context like super block write error would be relatively easy.
On 2023/9/26 02:07, David Sterba wrote: > On Sat, Sep 23, 2023 at 06:46:26AM +0930, Qu Wenruo wrote: >> On 2023/9/23 00:25, David Sterba wrote: >>> On Fri, Sep 22, 2023 at 12:25:18PM +0930, Qu Wenruo wrote: >>>> On the other hand, I totally understand if just a single sector failed >>>> to be write and we mark the whole fs read-only, it can be super >>>> frustrating for regular end users, thus we can not make it the default >>>> behavior. >>> >>> I can't imagine a realistic scenario where a user would like this >>> behaviour, one EIO takes down whole filesystem could make sense only for >>> some testing environments. >> >> I doubt, for some environment with expensive hardware, one may not even >> expect any -EIO for valid operations. >> If that happens, it may mean bad firmware or bad hardware, neither is a >> good thing especially if they paid extra money for the fancy hardware or >> the support fee. > > So the semantics we'd need is like "fail on first error of <type>" where > we can define a set of errors, like EIO, superblock write erorr or > something related to devices. The "set of errors" would be very specific, thus -EIO is not a good idea AFAIK. > >>>> So here we introduce a mount option group "abort=", and make the >>>> following errors more noisy and abort early if specified by the user. >>> >>> Default andswer for a new mount option is 'no', here we also have one >>> that is probably doing the same, 'fatal_errors', so if you really want >>> to do that by a mount option then please use this one. >> >> Or I can go sysfs interface and put it under some debug directory. > > For a prototype it's much more convenient until we understand what's the > actual usecase. Well, not that convenient, as we need to expand the mount option bits to U64, or on 32bit systems we're going to cause problems due to the fact that we're go beyond 32 mount options. (the first patch) > >>> >>>> This new "rescue=super" may be more frequently used considering how >>>> loose our existing tolerance is. >>>> >>>> - Any data writeback failure >>>> This is only for the data writeback at btrfs bio layer. >>>> This means, if a data sector is going to be written to a RAID1 chunk, >>>> and one mirror failed, we still consider the writeback succeeded. >>>> >>>> There would be another one for btrfs bio layer, but I have found >>>> something weird in the code, thus it would only be introduced after I >>>> solved the problem there, meanwhile we can discuss on the usefulness of >>>> this patchset. >>> >>> We can possibly enhance the error checking with additional knobs and >>> checkpoints that will have to survive or detect specific testing, but as >>> mount options it's not very flexible. We can possibly do it via sysfs or >>> BPF but this may not be the proper interface anyway. >> >> I think sysfs would be better, but not familiar enough with BPF to >> determine if it's any better or worse. > > BPF is probably a bad idea, I mentioned only as a potential way, it's > another extensible interface. > > What you suggest looks like the forced shutdown of filesystem. This can > be done internally or externally (ioctl). I haven't looked at the > interface to what extent it's configurable, but let's say there's a > bitmask set by admin and the filesystem checks that in case a given type > of error happens. Then forced shutown would be like a transaction abort, > followed by read-only. We have decent support for that but with the > shutdown some kind of audit would have to happen anyway, namely for the > EIO type of errors. Specific context like super block write error would > be relatively easy. I'm not sure if I understand the bitmask thing correctly. The idea of the series is to catch certain types of error which are by default ignored/handled gracefully. The bitmask looks a little too generic, what we want is to catch specific error at certain call sites (thus I can understand why you mention ebpf). Thus we can not just simple use a bitmask to handle all the generic errors, but add checks into the route we're interested in. Thanks, Qu