Message ID | d9445c47de98ad9142a0e40598f7c8152ed833d1.1459913103.git.jcody@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
[ Adding some CCs ] Am 06.04.2016 um 05:29 hat Jeff Cody geschrieben: > Upon receiving an I/O error after an fsync, by default gluster will > dump its cache. However, QEMU will retry the fsync, which is especially > useful when encountering errors such as ENOSPC when using the werror=stop > option. When using caching with gluster, however, the last written data > will be lost upon encountering ENOSPC. Using the cache xlator option of > 'resync-failed-syncs-after-fsync' should cause gluster to retain the > cached data after a failed fsync, so that ENOSPC and other transient > errors are recoverable. > > Signed-off-by: Jeff Cody <jcody@redhat.com> > --- > block/gluster.c | 27 +++++++++++++++++++++++++++ > configure | 8 ++++++++ > 2 files changed, 35 insertions(+) > > diff --git a/block/gluster.c b/block/gluster.c > index 30a827e..b1cf71b 100644 > --- a/block/gluster.c > +++ b/block/gluster.c > @@ -330,6 +330,23 @@ static int qemu_gluster_open(BlockDriverState *bs, QDict *options, > goto out; > } > > +#ifdef CONFIG_GLUSTERFS_XLATOR_OPT > + /* Without this, if fsync fails for a recoverable reason (for instance, > + * ENOSPC), gluster will dump its cache, preventing retries. This means > + * almost certain data loss. Not all gluster versions support the > + * 'resync-failed-syncs-after-fsync' key value, but there is no way to > + * discover during runtime if it is supported (this api returns success for > + * unknown key/value pairs) */ Honestly, this sucks. There is apparently no way to operate gluster so we can safely recover after a failed fsync. "We hope everything is fine, but depending on your gluster version, we may now corrupt your image" isn't very good. We need to consider very carefully if this is good enough to go on after an error. I'm currently leaning towards "no". That is, we should only enable this after Gluster provides us a way to make sure that the option is really set. > + ret = glfs_set_xlator_option (s->glfs, "*-write-behind", > + "resync-failed-syncs-after-fsync", > + "on"); > + if (ret < 0) { > + error_setg_errno(errp, errno, "Unable to set xlator key/value pair"); > + ret = -errno; > + goto out; > + } > +#endif We also need to consider the case without CONFIG_GLUSTERFS_XLATOR_OPT. In this case (as well as theoretically in the case that the option didn't take effect - if only we could know about it), a failed glfs_fsync_async() is fatal and we need to stop operating on the image, i.e. set bs->drv = NULL like when we detect corruption in qcow2 images. The guest will see a broken disk that fails all I/O requests, but that's better than corrupting data. Kevin
We had a thread discussing this not on the upstream list. My summary of the thread is that I don't understand why gluster should drop cached data after a failed fsync() for any open file. For closed files, I think it might still happen but this is the same as any file system (and unlikely to be the case for qemu?). I will note that Linux in general had (still has I think?) the behavior that once the process closes a file (or exits), we lose context to return an error to. From that point on, any failed IO from the page cache to the target disk will be dropped from cache. To hold things in the cache would lead it to fill with old data that is not really recoverable and we have no good way to know that the situation is repairable and how long that might take. Upstream kernel people have debated this, the behavior might be tweaked for certain types of errors. Regards, Ric On 04/06/2016 07:02 AM, Kevin Wolf wrote: > [ Adding some CCs ] > > Am 06.04.2016 um 05:29 hat Jeff Cody geschrieben: >> Upon receiving an I/O error after an fsync, by default gluster will >> dump its cache. However, QEMU will retry the fsync, which is especially >> useful when encountering errors such as ENOSPC when using the werror=stop >> option. When using caching with gluster, however, the last written data >> will be lost upon encountering ENOSPC. Using the cache xlator option of >> 'resync-failed-syncs-after-fsync' should cause gluster to retain the >> cached data after a failed fsync, so that ENOSPC and other transient >> errors are recoverable. >> >> Signed-off-by: Jeff Cody <jcody@redhat.com> >> --- >> block/gluster.c | 27 +++++++++++++++++++++++++++ >> configure | 8 ++++++++ >> 2 files changed, 35 insertions(+) >> >> diff --git a/block/gluster.c b/block/gluster.c >> index 30a827e..b1cf71b 100644 >> --- a/block/gluster.c >> +++ b/block/gluster.c >> @@ -330,6 +330,23 @@ static int qemu_gluster_open(BlockDriverState *bs, QDict *options, >> goto out; >> } >> >> +#ifdef CONFIG_GLUSTERFS_XLATOR_OPT >> + /* Without this, if fsync fails for a recoverable reason (for instance, >> + * ENOSPC), gluster will dump its cache, preventing retries. This means >> + * almost certain data loss. Not all gluster versions support the >> + * 'resync-failed-syncs-after-fsync' key value, but there is no way to >> + * discover during runtime if it is supported (this api returns success for >> + * unknown key/value pairs) */ > Honestly, this sucks. There is apparently no way to operate gluster so > we can safely recover after a failed fsync. "We hope everything is fine, > but depending on your gluster version, we may now corrupt your image" > isn't very good. > > We need to consider very carefully if this is good enough to go on after > an error. I'm currently leaning towards "no". That is, we should only > enable this after Gluster provides us a way to make sure that the option > is really set. > >> + ret = glfs_set_xlator_option (s->glfs, "*-write-behind", >> + "resync-failed-syncs-after-fsync", >> + "on"); >> + if (ret < 0) { >> + error_setg_errno(errp, errno, "Unable to set xlator key/value pair"); >> + ret = -errno; >> + goto out; >> + } >> +#endif > We also need to consider the case without CONFIG_GLUSTERFS_XLATOR_OPT. > In this case (as well as theoretically in the case that the option > didn't take effect - if only we could know about it), a failed > glfs_fsync_async() is fatal and we need to stop operating on the image, > i.e. set bs->drv = NULL like when we detect corruption in qcow2 images. > The guest will see a broken disk that fails all I/O requests, but that's > better than corrupting data. > > Kevin
Am 06.04.2016 um 13:19 hat Ric Wheeler geschrieben: > > We had a thread discussing this not on the upstream list. > > My summary of the thread is that I don't understand why gluster > should drop cached data after a failed fsync() for any open file. It certainly shouldn't, but it does by default. :-) Have a look at commit 3fcead2d in glusterfs.git, which at least introduces an option to get usable behaviour: { .key = {"resync-failed-syncs-after-fsync"}, .type = GF_OPTION_TYPE_BOOL, .default_value = "off", .description = "If sync of \"cached-writes issued before fsync\" " "(to backend) fails, this option configures whether " "to retry syncing them after fsync or forget them. " "If set to on, cached-writes are retried " "till a \"flush\" fop (or a successful sync) on sync " "failures. " "fsync itself is failed irrespective of the value of " "this option. ", }, As you can see, the default is still to drop cached data, and this is with the file still opened. qemu needs to make sure that this option is set, and if Jeff's comment in the code below is right, there is no way currently to make sure that the option isn't silently ignored. Can we get some function that sets an option and fails if the option is unknown? Or one that queries the state after setting an option, so we can check whether we succeeded in switching to the mode we need? > For closed files, I think it might still happen but this is the same > as any file system (and unlikely to be the case for qemu?). Our problem is only with open images. Dropping caches for files that qemu doesn't use any more is fine as far as I'm concerned. Note that our usage can involve cases where we reopen a file with different flags, i.e. first open a second file descriptor, then close the first one. The image was never completely closed here and we would still want the cache to preserve our data in such cases. > I will note that Linux in general had (still has I think?) the > behavior that once the process closes a file (or exits), we lose > context to return an error to. From that point on, any failed IO > from the page cache to the target disk will be dropped from cache. > To hold things in the cache would lead it to fill with old data that > is not really recoverable and we have no good way to know that the > situation is repairable and how long that might take. Upstream > kernel people have debated this, the behavior might be tweaked for > certain types of errors. That's fine, we just don't want the next fsync() to signal success when in reality the cache has thrown away our data. As soon as we close the image, there is no next fsync(), so you can do whatever you like. Kevin > On 04/06/2016 07:02 AM, Kevin Wolf wrote: > >[ Adding some CCs ] > > > >Am 06.04.2016 um 05:29 hat Jeff Cody geschrieben: > >>Upon receiving an I/O error after an fsync, by default gluster will > >>dump its cache. However, QEMU will retry the fsync, which is especially > >>useful when encountering errors such as ENOSPC when using the werror=stop > >>option. When using caching with gluster, however, the last written data > >>will be lost upon encountering ENOSPC. Using the cache xlator option of > >>'resync-failed-syncs-after-fsync' should cause gluster to retain the > >>cached data after a failed fsync, so that ENOSPC and other transient > >>errors are recoverable. > >> > >>Signed-off-by: Jeff Cody <jcody@redhat.com> > >>--- > >> block/gluster.c | 27 +++++++++++++++++++++++++++ > >> configure | 8 ++++++++ > >> 2 files changed, 35 insertions(+) > >> > >>diff --git a/block/gluster.c b/block/gluster.c > >>index 30a827e..b1cf71b 100644 > >>--- a/block/gluster.c > >>+++ b/block/gluster.c > >>@@ -330,6 +330,23 @@ static int qemu_gluster_open(BlockDriverState *bs, QDict *options, > >> goto out; > >> } > >>+#ifdef CONFIG_GLUSTERFS_XLATOR_OPT > >>+ /* Without this, if fsync fails for a recoverable reason (for instance, > >>+ * ENOSPC), gluster will dump its cache, preventing retries. This means > >>+ * almost certain data loss. Not all gluster versions support the > >>+ * 'resync-failed-syncs-after-fsync' key value, but there is no way to > >>+ * discover during runtime if it is supported (this api returns success for > >>+ * unknown key/value pairs) */ > >Honestly, this sucks. There is apparently no way to operate gluster so > >we can safely recover after a failed fsync. "We hope everything is fine, > >but depending on your gluster version, we may now corrupt your image" > >isn't very good. > > > >We need to consider very carefully if this is good enough to go on after > >an error. I'm currently leaning towards "no". That is, we should only > >enable this after Gluster provides us a way to make sure that the option > >is really set. > > > >>+ ret = glfs_set_xlator_option (s->glfs, "*-write-behind", > >>+ "resync-failed-syncs-after-fsync", > >>+ "on"); > >>+ if (ret < 0) { > >>+ error_setg_errno(errp, errno, "Unable to set xlator key/value pair"); > >>+ ret = -errno; > >>+ goto out; > >>+ } > >>+#endif > >We also need to consider the case without CONFIG_GLUSTERFS_XLATOR_OPT. > >In this case (as well as theoretically in the case that the option > >didn't take effect - if only we could know about it), a failed > >glfs_fsync_async() is fatal and we need to stop operating on the image, > >i.e. set bs->drv = NULL like when we detect corruption in qcow2 images. > >The guest will see a broken disk that fails all I/O requests, but that's > >better than corrupting data. > > > >Kevin >
Am 06.04.2016 um 13:41 hat Kevin Wolf geschrieben: > Am 06.04.2016 um 13:19 hat Ric Wheeler geschrieben: > > > > We had a thread discussing this not on the upstream list. > > > > My summary of the thread is that I don't understand why gluster > > should drop cached data after a failed fsync() for any open file. > > It certainly shouldn't, but it does by default. :-) > > Have a look at commit 3fcead2d in glusterfs.git, which at least > introduces an option to get usable behaviour: > > { .key = {"resync-failed-syncs-after-fsync"}, > .type = GF_OPTION_TYPE_BOOL, > .default_value = "off", > .description = "If sync of \"cached-writes issued before fsync\" " > "(to backend) fails, this option configures whether " > "to retry syncing them after fsync or forget them. " > "If set to on, cached-writes are retried " > "till a \"flush\" fop (or a successful sync) on sync " > "failures. " > "fsync itself is failed irrespective of the value of " > "this option. ", > }, > > As you can see, the default is still to drop cached data, and this is > with the file still opened. qemu needs to make sure that this option is > set, and if Jeff's comment in the code below is right, there is no way > currently to make sure that the option isn't silently ignored. > > Can we get some function that sets an option and fails if the option is > unknown? Or one that queries the state after setting an option, so we > can check whether we succeeded in switching to the mode we need? > > > For closed files, I think it might still happen but this is the same > > as any file system (and unlikely to be the case for qemu?). > > Our problem is only with open images. Dropping caches for files that > qemu doesn't use any more is fine as far as I'm concerned. > > Note that our usage can involve cases where we reopen a file with > different flags, i.e. first open a second file descriptor, then close > the first one. The image was never completely closed here and we would > still want the cache to preserve our data in such cases. Hm, actually, maybe we should just call bdrv_flush() before reopening an image, and if an error is returned, we abort the reopen. It's far from being a hot path, so the overhead of a flush shouldn't matter, and it seems we're taking an unnecessary risk without doing this. Kevin > > I will note that Linux in general had (still has I think?) the > > behavior that once the process closes a file (or exits), we lose > > context to return an error to. From that point on, any failed IO > > from the page cache to the target disk will be dropped from cache. > > To hold things in the cache would lead it to fill with old data that > > is not really recoverable and we have no good way to know that the > > situation is repairable and how long that might take. Upstream > > kernel people have debated this, the behavior might be tweaked for > > certain types of errors. > > That's fine, we just don't want the next fsync() to signal success when > in reality the cache has thrown away our data. As soon as we close the > image, there is no next fsync(), so you can do whatever you like. > > Kevin > > > On 04/06/2016 07:02 AM, Kevin Wolf wrote: > > >[ Adding some CCs ] > > > > > >Am 06.04.2016 um 05:29 hat Jeff Cody geschrieben: > > >>Upon receiving an I/O error after an fsync, by default gluster will > > >>dump its cache. However, QEMU will retry the fsync, which is especially > > >>useful when encountering errors such as ENOSPC when using the werror=stop > > >>option. When using caching with gluster, however, the last written data > > >>will be lost upon encountering ENOSPC. Using the cache xlator option of > > >>'resync-failed-syncs-after-fsync' should cause gluster to retain the > > >>cached data after a failed fsync, so that ENOSPC and other transient > > >>errors are recoverable. > > >> > > >>Signed-off-by: Jeff Cody <jcody@redhat.com> > > >>--- > > >> block/gluster.c | 27 +++++++++++++++++++++++++++ > > >> configure | 8 ++++++++ > > >> 2 files changed, 35 insertions(+) > > >> > > >>diff --git a/block/gluster.c b/block/gluster.c > > >>index 30a827e..b1cf71b 100644 > > >>--- a/block/gluster.c > > >>+++ b/block/gluster.c > > >>@@ -330,6 +330,23 @@ static int qemu_gluster_open(BlockDriverState *bs, QDict *options, > > >> goto out; > > >> } > > >>+#ifdef CONFIG_GLUSTERFS_XLATOR_OPT > > >>+ /* Without this, if fsync fails for a recoverable reason (for instance, > > >>+ * ENOSPC), gluster will dump its cache, preventing retries. This means > > >>+ * almost certain data loss. Not all gluster versions support the > > >>+ * 'resync-failed-syncs-after-fsync' key value, but there is no way to > > >>+ * discover during runtime if it is supported (this api returns success for > > >>+ * unknown key/value pairs) */ > > >Honestly, this sucks. There is apparently no way to operate gluster so > > >we can safely recover after a failed fsync. "We hope everything is fine, > > >but depending on your gluster version, we may now corrupt your image" > > >isn't very good. > > > > > >We need to consider very carefully if this is good enough to go on after > > >an error. I'm currently leaning towards "no". That is, we should only > > >enable this after Gluster provides us a way to make sure that the option > > >is really set. > > > > > >>+ ret = glfs_set_xlator_option (s->glfs, "*-write-behind", > > >>+ "resync-failed-syncs-after-fsync", > > >>+ "on"); > > >>+ if (ret < 0) { > > >>+ error_setg_errno(errp, errno, "Unable to set xlator key/value pair"); > > >>+ ret = -errno; > > >>+ goto out; > > >>+ } > > >>+#endif > > >We also need to consider the case without CONFIG_GLUSTERFS_XLATOR_OPT. > > >In this case (as well as theoretically in the case that the option > > >didn't take effect - if only we could know about it), a failed > > >glfs_fsync_async() is fatal and we need to stop operating on the image, > > >i.e. set bs->drv = NULL like when we detect corruption in qcow2 images. > > >The guest will see a broken disk that fails all I/O requests, but that's > > >better than corrupting data. > > > > > >Kevin > > >
On Wed, Apr 06, 2016 at 01:02:16PM +0200, Kevin Wolf wrote: > [ Adding some CCs ] > > Am 06.04.2016 um 05:29 hat Jeff Cody geschrieben: > > Upon receiving an I/O error after an fsync, by default gluster will > > dump its cache. However, QEMU will retry the fsync, which is especially > > useful when encountering errors such as ENOSPC when using the werror=stop > > option. When using caching with gluster, however, the last written data > > will be lost upon encountering ENOSPC. Using the cache xlator option of > > 'resync-failed-syncs-after-fsync' should cause gluster to retain the > > cached data after a failed fsync, so that ENOSPC and other transient > > errors are recoverable. > > > > Signed-off-by: Jeff Cody <jcody@redhat.com> > > --- > > block/gluster.c | 27 +++++++++++++++++++++++++++ > > configure | 8 ++++++++ > > 2 files changed, 35 insertions(+) > > > > diff --git a/block/gluster.c b/block/gluster.c > > index 30a827e..b1cf71b 100644 > > --- a/block/gluster.c > > +++ b/block/gluster.c > > @@ -330,6 +330,23 @@ static int qemu_gluster_open(BlockDriverState *bs, QDict *options, > > goto out; > > } > > > > +#ifdef CONFIG_GLUSTERFS_XLATOR_OPT > > + /* Without this, if fsync fails for a recoverable reason (for instance, > > + * ENOSPC), gluster will dump its cache, preventing retries. This means > > + * almost certain data loss. Not all gluster versions support the > > + * 'resync-failed-syncs-after-fsync' key value, but there is no way to > > + * discover during runtime if it is supported (this api returns success for > > + * unknown key/value pairs) */ > > Honestly, this sucks. There is apparently no way to operate gluster so > we can safely recover after a failed fsync. "We hope everything is fine, > but depending on your gluster version, we may now corrupt your image" > isn't very good. > > We need to consider very carefully if this is good enough to go on after > an error. I'm currently leaning towards "no". That is, we should only > enable this after Gluster provides us a way to make sure that the option > is really set. > > > + ret = glfs_set_xlator_option (s->glfs, "*-write-behind", > > + "resync-failed-syncs-after-fsync", > > + "on"); > > + if (ret < 0) { > > + error_setg_errno(errp, errno, "Unable to set xlator key/value pair"); > > + ret = -errno; > > + goto out; > > + } > > +#endif > > We also need to consider the case without CONFIG_GLUSTERFS_XLATOR_OPT. > In this case (as well as theoretically in the case that the option > didn't take effect - if only we could know about it), a failed > glfs_fsync_async() is fatal and we need to stop operating on the image, > i.e. set bs->drv = NULL like when we detect corruption in qcow2 images. > The guest will see a broken disk that fails all I/O requests, but that's > better than corrupting data. > Gluster versions that don't support CONFIG_GLUSTERFS_XLATOR_OPT will also not have the gluster patch that removes the file descriptor invalidation upon error (unless that was a relatively new bug/feature). But if that is the case, every operation on the file descriptor in those versions will return error. But it is also rather old versions that don't support glfs_set_xlator_option() I believe.
On Wed, Apr 06, 2016 at 01:51:59PM +0200, Kevin Wolf wrote: > Am 06.04.2016 um 13:41 hat Kevin Wolf geschrieben: > > Am 06.04.2016 um 13:19 hat Ric Wheeler geschrieben: > > > > > > We had a thread discussing this not on the upstream list. > > > > > > My summary of the thread is that I don't understand why gluster > > > should drop cached data after a failed fsync() for any open file. > > > > It certainly shouldn't, but it does by default. :-) > > > > Have a look at commit 3fcead2d in glusterfs.git, which at least > > introduces an option to get usable behaviour: > > > > { .key = {"resync-failed-syncs-after-fsync"}, > > .type = GF_OPTION_TYPE_BOOL, > > .default_value = "off", > > .description = "If sync of \"cached-writes issued before fsync\" " > > "(to backend) fails, this option configures whether " > > "to retry syncing them after fsync or forget them. " > > "If set to on, cached-writes are retried " > > "till a \"flush\" fop (or a successful sync) on sync " > > "failures. " > > "fsync itself is failed irrespective of the value of " > > "this option. ", > > }, > > > > As you can see, the default is still to drop cached data, and this is > > with the file still opened. qemu needs to make sure that this option is > > set, and if Jeff's comment in the code below is right, there is no way > > currently to make sure that the option isn't silently ignored. > > > > Can we get some function that sets an option and fails if the option is > > unknown? Or one that queries the state after setting an option, so we > > can check whether we succeeded in switching to the mode we need? > > > > > For closed files, I think it might still happen but this is the same > > > as any file system (and unlikely to be the case for qemu?). > > > > Our problem is only with open images. Dropping caches for files that > > qemu doesn't use any more is fine as far as I'm concerned. > > > > Note that our usage can involve cases where we reopen a file with > > different flags, i.e. first open a second file descriptor, then close > > the first one. The image was never completely closed here and we would > > still want the cache to preserve our data in such cases. > > Hm, actually, maybe we should just call bdrv_flush() before reopening an > image, and if an error is returned, we abort the reopen. It's far from > being a hot path, so the overhead of a flush shouldn't matter, and it > seems we're taking an unnecessary risk without doing this. > [I seemed to have been dropped from the cc] Are you talking about doing a bdrv_flush() on the new descriptor (i.e. reop_s->glfs)? Because otherwise, we already do this in bdrv_reopen_prepare() on the original fd. It happens right before the call to drv->bdrv_reopen_prepare(): 2020 ret = bdrv_flush(reopen_state->bs); 2021 if (ret) { 2022 error_setg_errno(errp, -ret, "Error flushing drive"); 2023 goto error; 2024 } 2025 2026 if (drv->bdrv_reopen_prepare) { 2027 ret = drv->bdrv_reopen_prepare(reopen_state, queue, &local_err); > > > > I will note that Linux in general had (still has I think?) the > > > behavior that once the process closes a file (or exits), we lose > > > context to return an error to. From that point on, any failed IO > > > from the page cache to the target disk will be dropped from cache. > > > To hold things in the cache would lead it to fill with old data that > > > is not really recoverable and we have no good way to know that the > > > situation is repairable and how long that might take. Upstream > > > kernel people have debated this, the behavior might be tweaked for > > > certain types of errors. > > > > That's fine, we just don't want the next fsync() to signal success when > > in reality the cache has thrown away our data. As soon as we close the > > image, there is no next fsync(), so you can do whatever you like. > > > > Kevin > > > > > On 04/06/2016 07:02 AM, Kevin Wolf wrote: > > > >[ Adding some CCs ] > > > > > > > >Am 06.04.2016 um 05:29 hat Jeff Cody geschrieben: > > > >>Upon receiving an I/O error after an fsync, by default gluster will > > > >>dump its cache. However, QEMU will retry the fsync, which is especially > > > >>useful when encountering errors such as ENOSPC when using the werror=stop > > > >>option. When using caching with gluster, however, the last written data > > > >>will be lost upon encountering ENOSPC. Using the cache xlator option of > > > >>'resync-failed-syncs-after-fsync' should cause gluster to retain the > > > >>cached data after a failed fsync, so that ENOSPC and other transient > > > >>errors are recoverable. > > > >> > > > >>Signed-off-by: Jeff Cody <jcody@redhat.com> > > > >>--- > > > >> block/gluster.c | 27 +++++++++++++++++++++++++++ > > > >> configure | 8 ++++++++ > > > >> 2 files changed, 35 insertions(+) > > > >> > > > >>diff --git a/block/gluster.c b/block/gluster.c > > > >>index 30a827e..b1cf71b 100644 > > > >>--- a/block/gluster.c > > > >>+++ b/block/gluster.c > > > >>@@ -330,6 +330,23 @@ static int qemu_gluster_open(BlockDriverState *bs, QDict *options, > > > >> goto out; > > > >> } > > > >>+#ifdef CONFIG_GLUSTERFS_XLATOR_OPT > > > >>+ /* Without this, if fsync fails for a recoverable reason (for instance, > > > >>+ * ENOSPC), gluster will dump its cache, preventing retries. This means > > > >>+ * almost certain data loss. Not all gluster versions support the > > > >>+ * 'resync-failed-syncs-after-fsync' key value, but there is no way to > > > >>+ * discover during runtime if it is supported (this api returns success for > > > >>+ * unknown key/value pairs) */ > > > >Honestly, this sucks. There is apparently no way to operate gluster so > > > >we can safely recover after a failed fsync. "We hope everything is fine, > > > >but depending on your gluster version, we may now corrupt your image" > > > >isn't very good. > > > > > > > >We need to consider very carefully if this is good enough to go on after > > > >an error. I'm currently leaning towards "no". That is, we should only > > > >enable this after Gluster provides us a way to make sure that the option > > > >is really set. > > > > > > > >>+ ret = glfs_set_xlator_option (s->glfs, "*-write-behind", > > > >>+ "resync-failed-syncs-after-fsync", > > > >>+ "on"); > > > >>+ if (ret < 0) { > > > >>+ error_setg_errno(errp, errno, "Unable to set xlator key/value pair"); > > > >>+ ret = -errno; > > > >>+ goto out; > > > >>+ } > > > >>+#endif > > > >We also need to consider the case without CONFIG_GLUSTERFS_XLATOR_OPT. > > > >In this case (as well as theoretically in the case that the option > > > >didn't take effect - if only we could know about it), a failed > > > >glfs_fsync_async() is fatal and we need to stop operating on the image, > > > >i.e. set bs->drv = NULL like when we detect corruption in qcow2 images. > > > >The guest will see a broken disk that fails all I/O requests, but that's > > > >better than corrupting data. > > > > > > > >Kevin > > > > > >
Am 06.04.2016 um 15:10 hat Jeff Cody geschrieben: > On Wed, Apr 06, 2016 at 01:51:59PM +0200, Kevin Wolf wrote: > > Am 06.04.2016 um 13:41 hat Kevin Wolf geschrieben: > > > Am 06.04.2016 um 13:19 hat Ric Wheeler geschrieben: > > > > > > > > We had a thread discussing this not on the upstream list. > > > > > > > > My summary of the thread is that I don't understand why gluster > > > > should drop cached data after a failed fsync() for any open file. > > > > > > It certainly shouldn't, but it does by default. :-) > > > > > > Have a look at commit 3fcead2d in glusterfs.git, which at least > > > introduces an option to get usable behaviour: > > > > > > { .key = {"resync-failed-syncs-after-fsync"}, > > > .type = GF_OPTION_TYPE_BOOL, > > > .default_value = "off", > > > .description = "If sync of \"cached-writes issued before fsync\" " > > > "(to backend) fails, this option configures whether " > > > "to retry syncing them after fsync or forget them. " > > > "If set to on, cached-writes are retried " > > > "till a \"flush\" fop (or a successful sync) on sync " > > > "failures. " > > > "fsync itself is failed irrespective of the value of " > > > "this option. ", > > > }, > > > > > > As you can see, the default is still to drop cached data, and this is > > > with the file still opened. qemu needs to make sure that this option is > > > set, and if Jeff's comment in the code below is right, there is no way > > > currently to make sure that the option isn't silently ignored. > > > > > > Can we get some function that sets an option and fails if the option is > > > unknown? Or one that queries the state after setting an option, so we > > > can check whether we succeeded in switching to the mode we need? > > > > > > > For closed files, I think it might still happen but this is the same > > > > as any file system (and unlikely to be the case for qemu?). > > > > > > Our problem is only with open images. Dropping caches for files that > > > qemu doesn't use any more is fine as far as I'm concerned. > > > > > > Note that our usage can involve cases where we reopen a file with > > > different flags, i.e. first open a second file descriptor, then close > > > the first one. The image was never completely closed here and we would > > > still want the cache to preserve our data in such cases. > > > > Hm, actually, maybe we should just call bdrv_flush() before reopening an > > image, and if an error is returned, we abort the reopen. It's far from > > being a hot path, so the overhead of a flush shouldn't matter, and it > > seems we're taking an unnecessary risk without doing this. > > > > [I seemed to have been dropped from the cc] > > Are you talking about doing a bdrv_flush() on the new descriptor (i.e. > reop_s->glfs)? Because otherwise, we already do this in > bdrv_reopen_prepare() on the original fd. It happens right before the call > to drv->bdrv_reopen_prepare(): > > > 2020 ret = bdrv_flush(reopen_state->bs); > 2021 if (ret) { > 2022 error_setg_errno(errp, -ret, "Error flushing drive"); > 2023 goto error; > 2024 } > 2025 > 2026 if (drv->bdrv_reopen_prepare) { > 2027 ret = drv->bdrv_reopen_prepare(reopen_state, queue, &local_err); Ah, thanks. Yes, this is what I meant. I expected it somewhere close to the bdrv_drain_all() call, so I missed the call you quoted. So that's good news, at least this part of the problem doesn't exist then. :-) Kevin > > > > > > I will note that Linux in general had (still has I think?) the > > > > behavior that once the process closes a file (or exits), we lose > > > > context to return an error to. From that point on, any failed IO > > > > from the page cache to the target disk will be dropped from cache. > > > > To hold things in the cache would lead it to fill with old data that > > > > is not really recoverable and we have no good way to know that the > > > > situation is repairable and how long that might take. Upstream > > > > kernel people have debated this, the behavior might be tweaked for > > > > certain types of errors. > > > > > > That's fine, we just don't want the next fsync() to signal success when > > > in reality the cache has thrown away our data. As soon as we close the > > > image, there is no next fsync(), so you can do whatever you like. > > > > > > Kevin > > > > > > > On 04/06/2016 07:02 AM, Kevin Wolf wrote: > > > > >[ Adding some CCs ] > > > > > > > > > >Am 06.04.2016 um 05:29 hat Jeff Cody geschrieben: > > > > >>Upon receiving an I/O error after an fsync, by default gluster will > > > > >>dump its cache. However, QEMU will retry the fsync, which is especially > > > > >>useful when encountering errors such as ENOSPC when using the werror=stop > > > > >>option. When using caching with gluster, however, the last written data > > > > >>will be lost upon encountering ENOSPC. Using the cache xlator option of > > > > >>'resync-failed-syncs-after-fsync' should cause gluster to retain the > > > > >>cached data after a failed fsync, so that ENOSPC and other transient > > > > >>errors are recoverable. > > > > >> > > > > >>Signed-off-by: Jeff Cody <jcody@redhat.com> > > > > >>--- > > > > >> block/gluster.c | 27 +++++++++++++++++++++++++++ > > > > >> configure | 8 ++++++++ > > > > >> 2 files changed, 35 insertions(+) > > > > >> > > > > >>diff --git a/block/gluster.c b/block/gluster.c > > > > >>index 30a827e..b1cf71b 100644 > > > > >>--- a/block/gluster.c > > > > >>+++ b/block/gluster.c > > > > >>@@ -330,6 +330,23 @@ static int qemu_gluster_open(BlockDriverState *bs, QDict *options, > > > > >> goto out; > > > > >> } > > > > >>+#ifdef CONFIG_GLUSTERFS_XLATOR_OPT > > > > >>+ /* Without this, if fsync fails for a recoverable reason (for instance, > > > > >>+ * ENOSPC), gluster will dump its cache, preventing retries. This means > > > > >>+ * almost certain data loss. Not all gluster versions support the > > > > >>+ * 'resync-failed-syncs-after-fsync' key value, but there is no way to > > > > >>+ * discover during runtime if it is supported (this api returns success for > > > > >>+ * unknown key/value pairs) */ > > > > >Honestly, this sucks. There is apparently no way to operate gluster so > > > > >we can safely recover after a failed fsync. "We hope everything is fine, > > > > >but depending on your gluster version, we may now corrupt your image" > > > > >isn't very good. > > > > > > > > > >We need to consider very carefully if this is good enough to go on after > > > > >an error. I'm currently leaning towards "no". That is, we should only > > > > >enable this after Gluster provides us a way to make sure that the option > > > > >is really set. > > > > > > > > > >>+ ret = glfs_set_xlator_option (s->glfs, "*-write-behind", > > > > >>+ "resync-failed-syncs-after-fsync", > > > > >>+ "on"); > > > > >>+ if (ret < 0) { > > > > >>+ error_setg_errno(errp, errno, "Unable to set xlator key/value pair"); > > > > >>+ ret = -errno; > > > > >>+ goto out; > > > > >>+ } > > > > >>+#endif > > > > >We also need to consider the case without CONFIG_GLUSTERFS_XLATOR_OPT. > > > > >In this case (as well as theoretically in the case that the option > > > > >didn't take effect - if only we could know about it), a failed > > > > >glfs_fsync_async() is fatal and we need to stop operating on the image, > > > > >i.e. set bs->drv = NULL like when we detect corruption in qcow2 images. > > > > >The guest will see a broken disk that fails all I/O requests, but that's > > > > >better than corrupting data. > > > > > > > > > >Kevin > > > > > > > > >
On Wed, Apr 06, 2016 at 08:44:18AM -0400, Jeff Cody wrote: > On Wed, Apr 06, 2016 at 01:02:16PM +0200, Kevin Wolf wrote: > > [ Adding some CCs ] > > > > Am 06.04.2016 um 05:29 hat Jeff Cody geschrieben: > > > Upon receiving an I/O error after an fsync, by default gluster will > > > dump its cache. However, QEMU will retry the fsync, which is especially > > > useful when encountering errors such as ENOSPC when using the werror=stop > > > option. When using caching with gluster, however, the last written data > > > will be lost upon encountering ENOSPC. Using the cache xlator option of Use "write-behind xlator" instead of "cache xlator". There are different caches in Gluster. > > > 'resync-failed-syncs-after-fsync' should cause gluster to retain the > > > cached data after a failed fsync, so that ENOSPC and other transient > > > errors are recoverable. > > > > > > Signed-off-by: Jeff Cody <jcody@redhat.com> > > > --- > > > block/gluster.c | 27 +++++++++++++++++++++++++++ > > > configure | 8 ++++++++ > > > 2 files changed, 35 insertions(+) > > > > > > diff --git a/block/gluster.c b/block/gluster.c > > > index 30a827e..b1cf71b 100644 > > > --- a/block/gluster.c > > > +++ b/block/gluster.c > > > @@ -330,6 +330,23 @@ static int qemu_gluster_open(BlockDriverState *bs, QDict *options, > > > goto out; > > > } > > > > > > +#ifdef CONFIG_GLUSTERFS_XLATOR_OPT > > > + /* Without this, if fsync fails for a recoverable reason (for instance, > > > + * ENOSPC), gluster will dump its cache, preventing retries. This means > > > + * almost certain data loss. Not all gluster versions support the > > > + * 'resync-failed-syncs-after-fsync' key value, but there is no way to > > > + * discover during runtime if it is supported (this api returns success for > > > + * unknown key/value pairs) */ > > > > Honestly, this sucks. There is apparently no way to operate gluster so > > we can safely recover after a failed fsync. "We hope everything is fine, > > but depending on your gluster version, we may now corrupt your image" > > isn't very good. > > > > We need to consider very carefully if this is good enough to go on after > > an error. I'm currently leaning towards "no". That is, we should only > > enable this after Gluster provides us a way to make sure that the option > > is really set. Unfortunately it is also possible to disable the write-behind xlator as well. This would cause setting the option to fail too :-/ At the moment there is no real useful way to detect if write-behind is disabled (it is enabled by default). > > > + ret = glfs_set_xlator_option (s->glfs, "*-write-behind", > > > + "resync-failed-syncs-after-fsync", > > > + "on"); > > > + if (ret < 0) { > > > + error_setg_errno(errp, errno, "Unable to set xlator key/value pair"); > > > + ret = -errno; > > > + goto out; > > > + } > > > +#endif > > > > We also need to consider the case without CONFIG_GLUSTERFS_XLATOR_OPT. > > In this case (as well as theoretically in the case that the option > > didn't take effect - if only we could know about it), a failed > > glfs_fsync_async() is fatal and we need to stop operating on the image, > > i.e. set bs->drv = NULL like when we detect corruption in qcow2 images. > > The guest will see a broken disk that fails all I/O requests, but that's > > better than corrupting data. > > > > Gluster versions that don't support CONFIG_GLUSTERFS_XLATOR_OPT will > also not have the gluster patch that removes the file descriptor > invalidation upon error (unless that was a relatively new > bug/feature). But if that is the case, every operation on the file > descriptor in those versions will return error. But it is also rather > old versions that don't support glfs_set_xlator_option() I believe. Indeed, glfs_set_xlator_option() was introduced with glusterfs-3.4.0. We are currently on glusterfs-3.7, with the oldest supported version of 3.5. In ~2 months we hopefully have a 3.8 release and that will cause the end-of-life of 3.5. 3.4 has been EOL for about a year now, hopefully all our users have upgraded, but we know that some users will stay on unsupported versions for a long time... However, the "resync-failed-syncs-after-fsync" option was only introduced recently, with glusterfs-3.7.9. You could detect this with pkg-config glusterfs-api >= 7.3.7.9 if need to be. More details about the problem the option addresses are in the commit message on http://review.gluster.org/13057 . HTH, Niels
On Wed, Apr 06, 2016 at 04:47:32PM +0200, Niels de Vos wrote: > On Wed, Apr 06, 2016 at 08:44:18AM -0400, Jeff Cody wrote: > > On Wed, Apr 06, 2016 at 01:02:16PM +0200, Kevin Wolf wrote: > > > [ Adding some CCs ] > > > > > > Am 06.04.2016 um 05:29 hat Jeff Cody geschrieben: > > > > Upon receiving an I/O error after an fsync, by default gluster will > > > > dump its cache. However, QEMU will retry the fsync, which is especially > > > > useful when encountering errors such as ENOSPC when using the werror=stop > > > > option. When using caching with gluster, however, the last written data > > > > will be lost upon encountering ENOSPC. Using the cache xlator option of > > Use "write-behind xlator" instead of "cache xlator". There are different > caches in Gluster. > Thanks, I'll update the commit message. > > > > 'resync-failed-syncs-after-fsync' should cause gluster to retain the > > > > cached data after a failed fsync, so that ENOSPC and other transient > > > > errors are recoverable. > > > > > > > > Signed-off-by: Jeff Cody <jcody@redhat.com> > > > > --- > > > > block/gluster.c | 27 +++++++++++++++++++++++++++ > > > > configure | 8 ++++++++ > > > > 2 files changed, 35 insertions(+) > > > > > > > > diff --git a/block/gluster.c b/block/gluster.c > > > > index 30a827e..b1cf71b 100644 > > > > --- a/block/gluster.c > > > > +++ b/block/gluster.c > > > > @@ -330,6 +330,23 @@ static int qemu_gluster_open(BlockDriverState *bs, QDict *options, > > > > goto out; > > > > } > > > > > > > > +#ifdef CONFIG_GLUSTERFS_XLATOR_OPT > > > > + /* Without this, if fsync fails for a recoverable reason (for instance, > > > > + * ENOSPC), gluster will dump its cache, preventing retries. This means > > > > + * almost certain data loss. Not all gluster versions support the > > > > + * 'resync-failed-syncs-after-fsync' key value, but there is no way to > > > > + * discover during runtime if it is supported (this api returns success for > > > > + * unknown key/value pairs) */ > > > > > > Honestly, this sucks. There is apparently no way to operate gluster so > > > we can safely recover after a failed fsync. "We hope everything is fine, > > > but depending on your gluster version, we may now corrupt your image" > > > isn't very good. > > > > > > We need to consider very carefully if this is good enough to go on after > > > an error. I'm currently leaning towards "no". That is, we should only > > > enable this after Gluster provides us a way to make sure that the option > > > is really set. > > Unfortunately it is also possible to disable the write-behind xlator as > well. This would cause setting the option to fail too :-/ At the moment > there is no real useful way to detect if write-behind is disabled (it is > enabled by default). > > > > > + ret = glfs_set_xlator_option (s->glfs, "*-write-behind", > > > > + "resync-failed-syncs-after-fsync", > > > > + "on"); > > > > + if (ret < 0) { > > > > + error_setg_errno(errp, errno, "Unable to set xlator key/value pair"); > > > > + ret = -errno; > > > > + goto out; > > > > + } > > > > +#endif > > > > > > We also need to consider the case without CONFIG_GLUSTERFS_XLATOR_OPT. > > > In this case (as well as theoretically in the case that the option > > > didn't take effect - if only we could know about it), a failed > > > glfs_fsync_async() is fatal and we need to stop operating on the image, > > > i.e. set bs->drv = NULL like when we detect corruption in qcow2 images. > > > The guest will see a broken disk that fails all I/O requests, but that's > > > better than corrupting data. > > > > > > > Gluster versions that don't support CONFIG_GLUSTERFS_XLATOR_OPT will > > also not have the gluster patch that removes the file descriptor > > invalidation upon error (unless that was a relatively new > > bug/feature). But if that is the case, every operation on the file > > descriptor in those versions will return error. But it is also rather > > old versions that don't support glfs_set_xlator_option() I believe. > > Indeed, glfs_set_xlator_option() was introduced with glusterfs-3.4.0. We > are currently on glusterfs-3.7, with the oldest supported version of > 3.5. In ~2 months we hopefully have a 3.8 release and that will cause Is it possible for 3.8 to be able to validate key/value pairs in glfs_set_xlator_option()? Or is it something that by design is decoupled enough that it isn't feasible? This is the second instance I know of that a lack of error information from the gluster api has made it difficult to determine if a feature is supported (the other instance being SEEK_DATA/SEEK_HOLE). It'd be nice if error returns were more consistent in 3.8 in the API, is possible. > the end-of-life of 3.5. 3.4 has been EOL for about a year now, hopefully > all our users have upgraded, but we know that some users will stay on > unsupported versions for a long time... > > However, the "resync-failed-syncs-after-fsync" option was only > introduced recently, with glusterfs-3.7.9. You could detect this with > pkg-config glusterfs-api >= 7.3.7.9 if need to be. > Unfortunately, compile-time detection of the specific key option isn't a great option, since we are using gluster as a shared library. Backports for various distributions may or may not have the "resync-failed-syncs-after-fsync" key in it regardless of versioning, and it isn't a new symbol that is detectable by the linker or at runtime. Additionally, QEMU on a system with gluster version 3.7.8, for instance, would have no way of knowing that "resync-failed-syncs-after-fsync" does nothing. The reason for the compile-time check for glfs_set_xlator_option() is because it is a symbol that will be looked up when QEMU is linked (and executed). In contrast, if the user downgrades gluster on their system to a version that doesn't have glfs_set_xlator_option(), QEMU won't run, so the breakage will be visible. > More details about the problem the option addresses are in the commit > message on http://review.gluster.org/13057 . > Thanks, that is useful! Jeff
+Raghavendra G who implemented this option in write-behind, to this upstream patch review discussion Pranith On 04/06/2016 06:50 PM, Kevin Wolf wrote: > Am 06.04.2016 um 15:10 hat Jeff Cody geschrieben: >> On Wed, Apr 06, 2016 at 01:51:59PM +0200, Kevin Wolf wrote: >>> Am 06.04.2016 um 13:41 hat Kevin Wolf geschrieben: >>>> Am 06.04.2016 um 13:19 hat Ric Wheeler geschrieben: >>>>> We had a thread discussing this not on the upstream list. >>>>> >>>>> My summary of the thread is that I don't understand why gluster >>>>> should drop cached data after a failed fsync() for any open file. >>>> It certainly shouldn't, but it does by default. :-) >>>> >>>> Have a look at commit 3fcead2d in glusterfs.git, which at least >>>> introduces an option to get usable behaviour: >>>> >>>> { .key = {"resync-failed-syncs-after-fsync"}, >>>> .type = GF_OPTION_TYPE_BOOL, >>>> .default_value = "off", >>>> .description = "If sync of \"cached-writes issued before fsync\" " >>>> "(to backend) fails, this option configures whether " >>>> "to retry syncing them after fsync or forget them. " >>>> "If set to on, cached-writes are retried " >>>> "till a \"flush\" fop (or a successful sync) on sync " >>>> "failures. " >>>> "fsync itself is failed irrespective of the value of " >>>> "this option. ", >>>> }, >>>> >>>> As you can see, the default is still to drop cached data, and this is >>>> with the file still opened. qemu needs to make sure that this option is >>>> set, and if Jeff's comment in the code below is right, there is no way >>>> currently to make sure that the option isn't silently ignored. >>>> >>>> Can we get some function that sets an option and fails if the option is >>>> unknown? Or one that queries the state after setting an option, so we >>>> can check whether we succeeded in switching to the mode we need? >>>> >>>>> For closed files, I think it might still happen but this is the same >>>>> as any file system (and unlikely to be the case for qemu?). >>>> Our problem is only with open images. Dropping caches for files that >>>> qemu doesn't use any more is fine as far as I'm concerned. >>>> >>>> Note that our usage can involve cases where we reopen a file with >>>> different flags, i.e. first open a second file descriptor, then close >>>> the first one. The image was never completely closed here and we would >>>> still want the cache to preserve our data in such cases. >>> Hm, actually, maybe we should just call bdrv_flush() before reopening an >>> image, and if an error is returned, we abort the reopen. It's far from >>> being a hot path, so the overhead of a flush shouldn't matter, and it >>> seems we're taking an unnecessary risk without doing this. >>> >> [I seemed to have been dropped from the cc] >> >> Are you talking about doing a bdrv_flush() on the new descriptor (i.e. >> reop_s->glfs)? Because otherwise, we already do this in >> bdrv_reopen_prepare() on the original fd. It happens right before the call >> to drv->bdrv_reopen_prepare(): >> >> >> 2020 ret = bdrv_flush(reopen_state->bs); >> 2021 if (ret) { >> 2022 error_setg_errno(errp, -ret, "Error flushing drive"); >> 2023 goto error; >> 2024 } >> 2025 >> 2026 if (drv->bdrv_reopen_prepare) { >> 2027 ret = drv->bdrv_reopen_prepare(reopen_state, queue, &local_err); > Ah, thanks. Yes, this is what I meant. I expected it somewhere close to > the bdrv_drain_all() call, so I missed the call you quoted. So that's > good news, at least this part of the problem doesn't exist then. :-) > > Kevin > >>>>> I will note that Linux in general had (still has I think?) the >>>>> behavior that once the process closes a file (or exits), we lose >>>>> context to return an error to. From that point on, any failed IO >>>>> from the page cache to the target disk will be dropped from cache. >>>>> To hold things in the cache would lead it to fill with old data that >>>>> is not really recoverable and we have no good way to know that the >>>>> situation is repairable and how long that might take. Upstream >>>>> kernel people have debated this, the behavior might be tweaked for >>>>> certain types of errors. >>>> That's fine, we just don't want the next fsync() to signal success when >>>> in reality the cache has thrown away our data. As soon as we close the >>>> image, there is no next fsync(), so you can do whatever you like. >>>> >>>> Kevin >>>> >>>>> On 04/06/2016 07:02 AM, Kevin Wolf wrote: >>>>>> [ Adding some CCs ] >>>>>> >>>>>> Am 06.04.2016 um 05:29 hat Jeff Cody geschrieben: >>>>>>> Upon receiving an I/O error after an fsync, by default gluster will >>>>>>> dump its cache. However, QEMU will retry the fsync, which is especially >>>>>>> useful when encountering errors such as ENOSPC when using the werror=stop >>>>>>> option. When using caching with gluster, however, the last written data >>>>>>> will be lost upon encountering ENOSPC. Using the cache xlator option of >>>>>>> 'resync-failed-syncs-after-fsync' should cause gluster to retain the >>>>>>> cached data after a failed fsync, so that ENOSPC and other transient >>>>>>> errors are recoverable. >>>>>>> >>>>>>> Signed-off-by: Jeff Cody <jcody@redhat.com> >>>>>>> --- >>>>>>> block/gluster.c | 27 +++++++++++++++++++++++++++ >>>>>>> configure | 8 ++++++++ >>>>>>> 2 files changed, 35 insertions(+) >>>>>>> >>>>>>> diff --git a/block/gluster.c b/block/gluster.c >>>>>>> index 30a827e..b1cf71b 100644 >>>>>>> --- a/block/gluster.c >>>>>>> +++ b/block/gluster.c >>>>>>> @@ -330,6 +330,23 @@ static int qemu_gluster_open(BlockDriverState *bs, QDict *options, >>>>>>> goto out; >>>>>>> } >>>>>>> +#ifdef CONFIG_GLUSTERFS_XLATOR_OPT >>>>>>> + /* Without this, if fsync fails for a recoverable reason (for instance, >>>>>>> + * ENOSPC), gluster will dump its cache, preventing retries. This means >>>>>>> + * almost certain data loss. Not all gluster versions support the >>>>>>> + * 'resync-failed-syncs-after-fsync' key value, but there is no way to >>>>>>> + * discover during runtime if it is supported (this api returns success for >>>>>>> + * unknown key/value pairs) */ >>>>>> Honestly, this sucks. There is apparently no way to operate gluster so >>>>>> we can safely recover after a failed fsync. "We hope everything is fine, >>>>>> but depending on your gluster version, we may now corrupt your image" >>>>>> isn't very good. >>>>>> >>>>>> We need to consider very carefully if this is good enough to go on after >>>>>> an error. I'm currently leaning towards "no". That is, we should only >>>>>> enable this after Gluster provides us a way to make sure that the option >>>>>> is really set. >>>>>> >>>>>>> + ret = glfs_set_xlator_option (s->glfs, "*-write-behind", >>>>>>> + "resync-failed-syncs-after-fsync", >>>>>>> + "on"); >>>>>>> + if (ret < 0) { >>>>>>> + error_setg_errno(errp, errno, "Unable to set xlator key/value pair"); >>>>>>> + ret = -errno; >>>>>>> + goto out; >>>>>>> + } >>>>>>> +#endif >>>>>> We also need to consider the case without CONFIG_GLUSTERFS_XLATOR_OPT. >>>>>> In this case (as well as theoretically in the case that the option >>>>>> didn't take effect - if only we could know about it), a failed >>>>>> glfs_fsync_async() is fatal and we need to stop operating on the image, >>>>>> i.e. set bs->drv = NULL like when we detect corruption in qcow2 images. >>>>>> The guest will see a broken disk that fails all I/O requests, but that's >>>>>> better than corrupting data. >>>>>> >>>>>> Kevin
On Wed, Apr 06, 2016 at 01:02:16PM +0200, Kevin Wolf wrote: > [ Adding some CCs ] > > Am 06.04.2016 um 05:29 hat Jeff Cody geschrieben: > > Upon receiving an I/O error after an fsync, by default gluster will > > dump its cache. However, QEMU will retry the fsync, which is especially > > useful when encountering errors such as ENOSPC when using the werror=stop > > option. When using caching with gluster, however, the last written data > > will be lost upon encountering ENOSPC. Using the cache xlator option of > > 'resync-failed-syncs-after-fsync' should cause gluster to retain the > > cached data after a failed fsync, so that ENOSPC and other transient > > errors are recoverable. > > > > Signed-off-by: Jeff Cody <jcody@redhat.com> > > --- > > block/gluster.c | 27 +++++++++++++++++++++++++++ > > configure | 8 ++++++++ > > 2 files changed, 35 insertions(+) > > > > diff --git a/block/gluster.c b/block/gluster.c > > index 30a827e..b1cf71b 100644 > > --- a/block/gluster.c > > +++ b/block/gluster.c > > @@ -330,6 +330,23 @@ static int qemu_gluster_open(BlockDriverState *bs, QDict *options, > > goto out; > > } > > > > +#ifdef CONFIG_GLUSTERFS_XLATOR_OPT > > + /* Without this, if fsync fails for a recoverable reason (for instance, > > + * ENOSPC), gluster will dump its cache, preventing retries. This means > > + * almost certain data loss. Not all gluster versions support the > > + * 'resync-failed-syncs-after-fsync' key value, but there is no way to > > + * discover during runtime if it is supported (this api returns success for > > + * unknown key/value pairs) */ > > Honestly, this sucks. There is apparently no way to operate gluster so > we can safely recover after a failed fsync. "We hope everything is fine, > but depending on your gluster version, we may now corrupt your image" > isn't very good. > > We need to consider very carefully if this is good enough to go on after > an error. I'm currently leaning towards "no". That is, we should only > enable this after Gluster provides us a way to make sure that the option > is really set. > Thinking from the viewpoint of what is achievable for 2.6: * This patch is an improvement over the current state. It will at least provide protection on more recent versions of Gluster. The question is, are there then Gluster versions that will actually result in inconsistent data if this resync fsync option is not supported? There was another change in Gluster around v3.7.5 (can someone from Gluster confirm?) that removed Gluster's file descriptor invalidation upon any write I/O error. If that version is correct, that means Gluster versions 3.7.5 - 3.7.8 are problematic on fsync failure with write-behind caching. Given that, here are some alternatives that are feasible for 2.6: A) We could decide that the lack of discoverability of this option means we treat all Gluster versions as not having the option. On a fsync failure, we always switch over to read-only. There will be data loss (and perhaps on some Gluster version where it could have otherwise been avoided), but we will have consistency. B) Another option--sub-optimal from a performance perspective--is to just always open Gluster in O_DIRECT with the 'strict-o-direct' performance key value set to true. Beyond just 2.6: There is a larger question for QEMU, per Ric's point on other filesystem behavior on fsync failure (different thread), and given that the POSIX spec leaves cache validity after fsync i/o failure as not guaranteed [1]. Is QEMU overall currently doing the correct thing on a bdrv_flush() failure? That is, how certain are we that the other underlying protocol drivers (including raw-posix) can safely retry a fsync? Given the wording of the POSIX spec, should we move all images to read-only upon a bdrv_flush() failure (unless we know explicitly otherwise by the protocol driver)? And, if _POSIX_SYNCHRONIZED_IO is not set, are we even safe simply setting an image to r/o? [1] From another thread, courtesy of Eric Blake: http://pubs.opengroup.org/onlinepubs/9699919799/functions/fsync.html Jeff > > + ret = glfs_set_xlator_option (s->glfs, "*-write-behind", > > + "resync-failed-syncs-after-fsync", > > + "on"); > > + if (ret < 0) { > > + error_setg_errno(errp, errno, "Unable to set xlator key/value pair"); > > + ret = -errno; > > + goto out; > > + } > > +#endif > > We also need to consider the case without CONFIG_GLUSTERFS_XLATOR_OPT. > In this case (as well as theoretically in the case that the option > didn't take effect - if only we could know about it), a failed > glfs_fsync_async() is fatal and we need to stop operating on the image, > i.e. set bs->drv = NULL like when we detect corruption in qcow2 images. > The guest will see a broken disk that fails all I/O requests, but that's > better than corrupting data. > > Kevin
> +Raghavendra G who implemented this option in write-behind, to this > upstream patch review discussion Thanks Pranith. Kritika did inform us about the discussion. We were working on ways to find out solutions to the problems raised (and it was a long festive weekend in Bangalore). Sorry for top-posting. I am trying to address two issues raised in this mail thread: 1. No ways to identify whether setting of option succeeded in gfapi. 2. Why retaining cache after fsync failure is not default behavior? 1. No ways to identify whether setting of option succeeded in gfapi: I've added Poornima and Raghavendra Talur who work on gfapi to assist on this. 2. Why retaining cache after fsync failure is not default behavior? It was mostly to not to break two synchronized applications, which rely on fsync failures to retry. Details of discussion can be found below. The other reason was we could not figure out what POSIX's take on the state of earlier write after fsync failure (Other filesystems xfs, ext4 had non-uniform behavior). The question more specifically was "is it correct for a write issued before a failed fsync to succeed on the backend storage (persisting happened after fsync failure)?". I've also added Vijay Bellur who was involved in the earlier discussion to cc list. Following is the discussion we had earlier with Kevin, Jeff Cody and others on the same topic. I am quoting it verbatim below. <quote> > > > > > I would actually argue that this setting would be right for everyone, > > > > > not just qemu. Can you think of a case where keeping the data cached > > > > > after a failed fsync would actively hurt any application? I can only > > > > > think of cases where data is unnecessarily lost if data is dropped. > > > > > > > > > > > > > I worry about use cases with concurrent writers to the same file and > > > > how different applications would handle fsync() failures with our new > > > > behavior. > > > > > > Any specific scenario you're worried about? > > > > > > > Keeping the known old behavior as the default will ensure that we do > > > > not break anything once this is out. qemu/virt users with gluster are > > > > accustomed to setting the virt group and hence no additional knobs > > > > would need to be altered by them. > > > > > > Not changing anything is a safe way to avoid regressions. But it's also > > > a safe way to leave bugs unfixed. I would say that the current behavíour > > > is at least borderline buggy and very hard for applications to handle > > > correctly. > > > > I tend to agree with Kevin on this. If we've an error handling strategy > > that > > is posix-complaint, I don't think there is need to add one more option. > > Most > > of the times options tend to be used in default values, which is equivalent > > to not providing an option at all. However, before doing that its better we > > think through whether it can affect any existing deployments adversely > > (even > > when they are not posix-complaint). > > > > One pattern that I can think of - > > Applications that operate on the same file from different clients through > some locking or other co-ordination would have this pattern: > > lock (file), write (file), fsync (file), unlock (file); > > Now if the first fsync() fails, based on application logic the offset used > for the failed write + fsync could be re-utilized by a co-ordinating > application on another node to write out legitimate data. When control > returns back to the application that received a failure, the subsequent > write + fsync can cause data to be overwritten at the old offset along with > new data being written at the new offset. Yeah. I agree. Co-ordinated applications on different mounts will have issues, if they are working on the assumption that after fsync no older writes will hit the backend. Given that there seems to be a fair bit of confusion on status of retry of older writes after an fsync failure, we can expect some regressions. So, to summarize, 1. Keep the behaviour in patchset 11 of [1] as default. Both fsync and flush act as barriers and will make sure either a. older writes are synced to backend b. old writes are failed and never retried. after a failed fsync/flush. 2. Modify the behaviour of patchset 11 of [1] to keep failed syncs after a failed fsync and retry them till a flush. After a flush, no retries of failed syncs will be attempted. This behaviour will be introduced as an option. 3. Transient and non-transient errors will be treated similarly and failed syncs will be retried alike. Does everyone agree on the above points? If yes, I'll modify [1] accordingly. [1] http://review.gluster.org/#/c/12594/ </quote>
On Mon, Apr 11, 2016 at 9:56 AM, Raghavendra Gowdappa <rgowdapp@redhat.com> wrote: > > > +Raghavendra G who implemented this option in write-behind, to this > > upstream patch review discussion > > Thanks Pranith. Kritika did inform us about the discussion. We were > working on ways to find out solutions to the problems raised (and it was a > long festive weekend in Bangalore). > > Sorry for top-posting. I am trying to address two issues raised in this > mail thread: > 1. No ways to identify whether setting of option succeeded in gfapi. > 2. Why retaining cache after fsync failure is not default behavior? > > 1. No ways to identify whether setting of option succeeded in gfapi: > > I've added Poornima and Raghavendra Talur who work on gfapi to assist on > this. > There is currently no such feature in gfapi. We could think of two possible solutions: a. Have the qemu-glusterfs driver require a version of glusterfs-api.rpm which surely has this write behind option. In that case, if you use glfs_set_xlator_option before glfs_init with right key and value, there is no way the volume set gets rejected. Note that this is a installation time dependency, we don't have libgfapi library versioning. This solution is good enough, if the logic in qemu is ret = glfs_set_xlator_option; if (ret) { exit because we don't have required environment. } proceed with work; b. Second solution is to implement a case in write_behind getxattr FOP to handle such request and reply back with the current setting. Look at dht_get_real_filename for similar feature. You will have to implement logic similar to something like below ret = glfs_getxattr ( fs, "/", glusterfs.write-behind- resync-failed-syncs-after-fsync-status, &value, size); if ( (ret = -1 && errno == ENODATA) || value == 0 ) { // write behind in the stack does not understand this option // OR // we have write-behind with required option set to off <work with the assumptions that there are not retries> } else { // We do have the required option } One negative aspect of both the solutions above is the loosely coupled nature of logic. If the behavior ever changes in lower layer(which is gfapi or write-behind in this case) the upper layer(qemu) will have to a. have a dependency of the sort which defines both the minimum version and maximum version of package required b. interpret the return values with more if-else conditions to maintain backward compatibility. We are thinking of other ways too, but given above are current solutions that come to mind. Thanks, Raghavendra Talur > 2. Why retaining cache after fsync failure is not default behavior? > > It was mostly to not to break two synchronized applications, which rely on > fsync failures to retry. Details of discussion can be found below. The > other reason was we could not figure out what POSIX's take on the state of > earlier write after fsync failure (Other filesystems xfs, ext4 had > non-uniform behavior). The question more specifically was "is it correct > for a write issued before a failed fsync to succeed on the backend storage > (persisting happened after fsync failure)?". I've also added Vijay Bellur > who was involved in the earlier discussion to cc list. > > Following is the discussion we had earlier with Kevin, Jeff Cody and > others on the same topic. I am quoting it verbatim below. > > <quote> > > > > > > > I would actually argue that this setting would be right for > everyone, > > > > > > not just qemu. Can you think of a case where keeping the data > cached > > > > > > after a failed fsync would actively hurt any application? I can > only > > > > > > think of cases where data is unnecessarily lost if data is > dropped. > > > > > > > > > > > > > > > > I worry about use cases with concurrent writers to the same file > and > > > > > how different applications would handle fsync() failures with our > new > > > > > behavior. > > > > > > > > Any specific scenario you're worried about? > > > > > > > > > Keeping the known old behavior as the default will ensure that we > do > > > > > not break anything once this is out. qemu/virt users with gluster > are > > > > > accustomed to setting the virt group and hence no additional knobs > > > > > would need to be altered by them. > > > > > > > > Not changing anything is a safe way to avoid regressions. But it's > also > > > > a safe way to leave bugs unfixed. I would say that the current > behavíour > > > > is at least borderline buggy and very hard for applications to handle > > > > correctly. > > > > > > I tend to agree with Kevin on this. If we've an error handling strategy > > > that > > > is posix-complaint, I don't think there is need to add one more option. > > > Most > > > of the times options tend to be used in default values, which is > equivalent > > > to not providing an option at all. However, before doing that its > better we > > > think through whether it can affect any existing deployments adversely > > > (even > > > when they are not posix-complaint). > > > > > > > One pattern that I can think of - > > > > Applications that operate on the same file from different clients through > > some locking or other co-ordination would have this pattern: > > > > lock (file), write (file), fsync (file), unlock (file); > > > > Now if the first fsync() fails, based on application logic the offset > used > > for the failed write + fsync could be re-utilized by a co-ordinating > > application on another node to write out legitimate data. When control > > returns back to the application that received a failure, the subsequent > > write + fsync can cause data to be overwritten at the old offset along > with > > new data being written at the new offset. > > Yeah. I agree. Co-ordinated applications on different mounts will have > issues, if they are working on the assumption that after fsync no older > writes will hit the backend. Given that there seems to be a fair bit of > confusion on status of retry of older writes after an fsync failure, we can > expect some regressions. So, to summarize, > > 1. Keep the behaviour in patchset 11 of [1] as default. Both fsync and > flush act as barriers and will make sure either > a. older writes are synced to backend > b. old writes are failed and never retried. > > after a failed fsync/flush. > > 2. Modify the behaviour of patchset 11 of [1] to keep failed syncs after a > failed fsync and retry them till a flush. After a flush, no retries of > failed syncs will be attempted. This behaviour will be introduced as an > option. > > 3. Transient and non-transient errors will be treated similarly and failed > syncs will be retried alike. > > Does everyone agree on the above points? If yes, I'll modify [1] > accordingly. > > [1] http://review.gluster.org/#/c/12594/ > > </quote> > >
Am 09.05.2016 um 08:38 hat Raghavendra Talur geschrieben: > > > On Mon, Apr 11, 2016 at 9:56 AM, Raghavendra Gowdappa <rgowdapp@redhat.com> > wrote: > > > > +Raghavendra G who implemented this option in write-behind, to this > > upstream patch review discussion > > Thanks Pranith. Kritika did inform us about the discussion. We were working > on ways to find out solutions to the problems raised (and it was a long > festive weekend in Bangalore). > > Sorry for top-posting. I am trying to address two issues raised in this > mail thread: > 1. No ways to identify whether setting of option succeeded in gfapi. > 2. Why retaining cache after fsync failure is not default behavior? > > 1. No ways to identify whether setting of option succeeded in gfapi: > > I've added Poornima and Raghavendra Talur who work on gfapi to assist on > this. > > > There is currently no such feature in gfapi. We could think of two possible > solutions: > > a. Have the qemu-glusterfs driver require a version of glusterfs-api.rpm which > surely has this write behind option. In that case, if you use > glfs_set_xlator_option before glfs_init with right key and value, there is no > way the volume set gets rejected. Note that this is a installation time > dependency, we don't have libgfapi library versioning. This solution is good > enough, if the logic in qemu is > > ret = glfs_set_xlator_option; > if (ret) { > exit because we don't have required environment. > } > proceed with work; This is not good enough. Basically the requirement is: Given a qemu binary that is put on a machine with a random gluster version installed, this qemu binary never corrupts a gluster image. Either it errors out or it works correctly. Just assuming that the right gluster version is installed is much too weak, and even version number checks aren't very good when you consider things like backports. > b. Second solution is to implement a case in write_behind getxattr FOP to > handle such request and reply back with the current setting. Look at > dht_get_real_filename for similar feature. You will have to implement logic > similar to something like below > > ret = glfs_getxattr ( fs, "/", glusterfs.write-behind- > resync-failed-syncs-after-fsync-status, &value, size); > > if ( (ret = -1 && errno == ENODATA) || value == 0 ) { > // write behind in the stack does not understand this option > // OR > // we have write-behind with required option set to off > <work with the assumptions that there are not retries> > } else { > // We do have the required option > } Yes, please this one. Or a new glfs_setxattr_fail() that returns an error if the option didn't take effect. But I guess a glfs_getxattr() function makes sense anyway. In qemu, we try to allow querying everything that you can set. > One negative aspect of both the solutions above is the loosely coupled nature > of logic. If the behavior ever changes in lower layer(which is gfapi or > write-behind in this case) the upper layer(qemu) will have to > a. have a dependency of the sort which defines both the minimum version and > maximum version of package required > b. interpret the return values with more if-else conditions to maintain > backward compatibility. It's gluster's job to keep the compatibility with existing users of the APIs. Just like qemu has to make sure that old QMP clients keep working when we make changes or extensions to the protocol. In other words: If the behaviour of a lower layer changes while the upper layer only uses old APIs, that's clearly a bug in the lower layer. Kevin
diff --git a/block/gluster.c b/block/gluster.c index 30a827e..b1cf71b 100644 --- a/block/gluster.c +++ b/block/gluster.c @@ -330,6 +330,23 @@ static int qemu_gluster_open(BlockDriverState *bs, QDict *options, goto out; } +#ifdef CONFIG_GLUSTERFS_XLATOR_OPT + /* Without this, if fsync fails for a recoverable reason (for instance, + * ENOSPC), gluster will dump its cache, preventing retries. This means + * almost certain data loss. Not all gluster versions support the + * 'resync-failed-syncs-after-fsync' key value, but there is no way to + * discover during runtime if it is supported (this api returns success for + * unknown key/value pairs) */ + ret = glfs_set_xlator_option (s->glfs, "*-write-behind", + "resync-failed-syncs-after-fsync", + "on"); + if (ret < 0) { + error_setg_errno(errp, errno, "Unable to set xlator key/value pair"); + ret = -errno; + goto out; + } +#endif + qemu_gluster_parse_flags(bdrv_flags, &open_flags); s->fd = glfs_open(s->glfs, gconf->image, open_flags); @@ -386,6 +403,16 @@ static int qemu_gluster_reopen_prepare(BDRVReopenState *state, goto exit; } +#ifdef CONFIG_GLUSTERFS_XLATOR_OPT + ret = glfs_set_xlator_option(reop_s->glfs, "*-write-behind", + "resync-failed-syncs-after-fsync", "on"); + if (ret < 0) { + error_setg_errno(errp, errno, "Unable to set xlator key/value pair"); + ret = -errno; + goto exit; + } +#endif + reop_s->fd = glfs_open(reop_s->glfs, gconf->image, open_flags); if (reop_s->fd == NULL) { /* reops->glfs will be cleaned up in _abort */ diff --git a/configure b/configure index 5db29f0..3e921fe 100755 --- a/configure +++ b/configure @@ -298,6 +298,7 @@ coroutine="" coroutine_pool="" seccomp="" glusterfs="" +glusterfs_xlator_opt="no" glusterfs_discard="no" glusterfs_zerofill="no" archipelago="no" @@ -3397,6 +3398,9 @@ if test "$glusterfs" != "no" ; then glusterfs="yes" glusterfs_cflags=`$pkg_config --cflags glusterfs-api` glusterfs_libs=`$pkg_config --libs glusterfs-api` + if $pkg_config --atleast-version=4 glusterfs-api; then + glusterfs_xlator_opt="yes" + fi if $pkg_config --atleast-version=5 glusterfs-api; then glusterfs_discard="yes" fi @@ -5339,6 +5343,10 @@ if test "$glusterfs" = "yes" ; then echo "GLUSTERFS_LIBS=$glusterfs_libs" >> $config_host_mak fi +if test "$glusterfs_xlator_opt" = "yes" ; then + echo "CONFIG_GLUSTERFS_XLATOR_OPT=y" >> $config_host_mak +fi + if test "$glusterfs_discard" = "yes" ; then echo "CONFIG_GLUSTERFS_DISCARD=y" >> $config_host_mak fi
Upon receiving an I/O error after an fsync, by default gluster will dump its cache. However, QEMU will retry the fsync, which is especially useful when encountering errors such as ENOSPC when using the werror=stop option. When using caching with gluster, however, the last written data will be lost upon encountering ENOSPC. Using the cache xlator option of 'resync-failed-syncs-after-fsync' should cause gluster to retain the cached data after a failed fsync, so that ENOSPC and other transient errors are recoverable. Signed-off-by: Jeff Cody <jcody@redhat.com> --- block/gluster.c | 27 +++++++++++++++++++++++++++ configure | 8 ++++++++ 2 files changed, 35 insertions(+)