Message ID | 20170425085831.14999-1-anand.jain@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi David, This patch adds comments explaining the original design, and also adds a cleanup so that a volume level error-handles for the flush failures can be implemented. I wonder if you have any comments.? (Also just a note, a bio prealloc (for flush) will make below ENOMEM to go away, however the prealloc itself has to optimized based on the barrier option. So in the long run I don't think below ENOMEM would remain.) Thanks, Anand On 04/25/2017 04:58 PM, Anand Jain wrote: > This adds comments to the flush error handling part of > the code, and hopes to maintain the same logic with a > framework which can be used to handle the errors at the > volume level. > > Signed-off-by: Anand Jain <anand.jain@oracle.com> > --- > fs/btrfs/disk-io.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++---- > fs/btrfs/volumes.h | 1 + > 2 files changed, 55 insertions(+), 4 deletions(-) > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index eb1ee7b6f532..dafcb6bb2d5d 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -3527,6 +3527,10 @@ static int write_dev_flush(struct btrfs_device *device, int wait) > if (wait) { > bio = device->flush_bio; > if (!bio) > + /* > + * This means the alloc has failed with ENOMEM, however > + * here we return 0, as its not a device error. > + */ > return 0; > > wait_for_completion(&device->flush_wait); > @@ -3566,6 +3570,33 @@ static int write_dev_flush(struct btrfs_device *device, int wait) > return 0; > } > > +static int check_barrier_error(struct btrfs_fs_devices *fsdevs) > +{ > + int submit_flush_error = 0; > + int dev_flush_error = 0; > + struct btrfs_device *dev; > + > + list_for_each_entry_rcu(dev, &fsdevs->devices, dev_list) { > + if (!dev->bdev) { > + submit_flush_error++; > + dev_flush_error++; > + continue; > + } > + if (dev->last_flush_error == ENOMEM) > + submit_flush_error++; > + if (dev->last_flush_error && dev->last_flush_error != ENOMEM) > + dev_flush_error++; > + } > + > + if (submit_flush_error > > + fsdevs->fs_info->num_tolerated_disk_barrier_failures || > + dev_flush_error > > + fsdevs->fs_info->num_tolerated_disk_barrier_failures) > + return -EIO; > + > + return 0; > +} > + > /* > * send an empty flush down to each device in parallel, > * then wait for them > @@ -3593,6 +3624,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info) > ret = write_dev_flush(dev, 0); > if (ret) > errors_send++; > + dev->last_flush_error = ret; > } > > /* wait for all the barriers */ > @@ -3607,12 +3639,30 @@ static int barrier_all_devices(struct btrfs_fs_info *info) > continue; > > ret = write_dev_flush(dev, 1); > - if (ret) > + if (ret) { > + dev->last_flush_error = ret; > errors_wait++; > + } > + } > + > + /* > + * Try hard in case of flush. Lets say, in RAID1 we have > + * the following situation > + * dev1: EIO dev2: ENOMEM > + * this is not a fatal error as we hope to recover from > + * ENOMEM in the next attempt to flush. > + * But the following is considered as fatal > + * dev1: ENOMEM dev2: ENOMEM > + * dev1: bdev == NULL dev2: ENOMEM > + */ > + if (errors_send || errors_wait) { > + /* > + * At some point we need the status of all disks > + * to arrive at the volume status. So error checking > + * is being pushed to a separate loop. > + */ > + return check_barrier_error(info->fs_devices); > } > - if (errors_send > info->num_tolerated_disk_barrier_failures || > - errors_wait > info->num_tolerated_disk_barrier_failures) > - return -EIO; > return 0; > } > > diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h > index 59be81206dd7..9c09dcd96e5d 100644 > --- a/fs/btrfs/volumes.h > +++ b/fs/btrfs/volumes.h > @@ -74,6 +74,7 @@ struct btrfs_device { > int missing; > int can_discard; > int is_tgtdev_for_dev_replace; > + int last_flush_error; > > #ifdef __BTRFS_NEED_DEVICE_DATA_ORDERED > seqcount_t data_seqcount; > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Apr 25, 2017 at 04:58:31PM +0800, Anand Jain wrote: > This adds comments to the flush error handling part of > the code, and hopes to maintain the same logic with a > framework which can be used to handle the errors at the > volume level. > > Signed-off-by: Anand Jain <anand.jain@oracle.com> > --- > fs/btrfs/disk-io.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++---- > fs/btrfs/volumes.h | 1 + > 2 files changed, 55 insertions(+), 4 deletions(-) > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index eb1ee7b6f532..dafcb6bb2d5d 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -3527,6 +3527,10 @@ static int write_dev_flush(struct btrfs_device *device, int wait) > if (wait) { > bio = device->flush_bio; > if (!bio) > + /* > + * This means the alloc has failed with ENOMEM, however > + * here we return 0, as its not a device error. > + */ > return 0; > > wait_for_completion(&device->flush_wait); > @@ -3566,6 +3570,33 @@ static int write_dev_flush(struct btrfs_device *device, int wait) > return 0; > } > > +static int check_barrier_error(struct btrfs_fs_devices *fsdevs) > +{ > + int submit_flush_error = 0; > + int dev_flush_error = 0; > + struct btrfs_device *dev; > + > + list_for_each_entry_rcu(dev, &fsdevs->devices, dev_list) { > + if (!dev->bdev) { > + submit_flush_error++; > + dev_flush_error++; > + continue; > + } > + if (dev->last_flush_error == ENOMEM) That's -ENOMEM > + submit_flush_error++; > + if (dev->last_flush_error && dev->last_flush_error != ENOMEM) also here. > + dev_flush_error++; > + } > + > + if (submit_flush_error > > + fsdevs->fs_info->num_tolerated_disk_barrier_failures || > + dev_flush_error > > + fsdevs->fs_info->num_tolerated_disk_barrier_failures) > + return -EIO; Can you please reformat this so it's clear what's the condition and what's the statement? > + > + return 0; > +} > + > /* > * send an empty flush down to each device in parallel, > * then wait for them > @@ -3593,6 +3624,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info) > ret = write_dev_flush(dev, 0); > if (ret) > errors_send++; > + dev->last_flush_error = ret; Here the error is set unconditionally, so it always tracks the return code, not only the error ... > } > > /* wait for all the barriers */ > @@ -3607,12 +3639,30 @@ static int barrier_all_devices(struct btrfs_fs_info *info) > continue; > > ret = write_dev_flush(dev, 1); > - if (ret) > + if (ret) { > + dev->last_flush_error = ret; ... while this tracks only the errors. Unless I'm missing something, both should be set in a consistent way. > errors_wait++; > + } > + } > + > + /* > + * Try hard in case of flush. Lets say, in RAID1 we have > + * the following situation > + * dev1: EIO dev2: ENOMEM > + * this is not a fatal error as we hope to recover from > + * ENOMEM in the next attempt to flush. This could still be problematic under some very rare conditions, but I don't deem it important at the moment as the memory allocation will be gone. Then the comment reflects the current state, which is fine. > + * But the following is considered as fatal > + * dev1: ENOMEM dev2: ENOMEM > + * dev1: bdev == NULL dev2: ENOMEM > + */ > + if (errors_send || errors_wait) { > + /* > + * At some point we need the status of all disks > + * to arrive at the volume status. So error checking > + * is being pushed to a separate loop. > + */ > + return check_barrier_error(info->fs_devices); > } > - if (errors_send > info->num_tolerated_disk_barrier_failures || > - errors_wait > info->num_tolerated_disk_barrier_failures) > - return -EIO; > return 0; > } > > diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h > index 59be81206dd7..9c09dcd96e5d 100644 > --- a/fs/btrfs/volumes.h > +++ b/fs/btrfs/volumes.h > @@ -74,6 +74,7 @@ struct btrfs_device { > int missing; > int can_discard; > int is_tgtdev_for_dev_replace; > + int last_flush_error; > > #ifdef __BTRFS_NEED_DEVICE_DATA_ORDERED > seqcount_t data_seqcount; > -- > 2.10.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>> @@ -3566,6 +3570,33 @@ static int write_dev_flush(struct btrfs_device *device, int wait) >> return 0; >> } >> >> +static int check_barrier_error(struct btrfs_fs_devices *fsdevs) >> +{ >> + int submit_flush_error = 0; >> + int dev_flush_error = 0; >> + struct btrfs_device *dev; >> + >> + list_for_each_entry_rcu(dev, &fsdevs->devices, dev_list) { >> + if (!dev->bdev) { >> + submit_flush_error++; >> + dev_flush_error++; >> + continue; >> + } >> + if (dev->last_flush_error == ENOMEM) > > That's -ENOMEM My bad, will fix it. >> + submit_flush_error++; >> + if (dev->last_flush_error && dev->last_flush_error != ENOMEM) > > also here. yes. >> + dev_flush_error++; >> + } >> + >> + if (submit_flush_error > >> + fsdevs->fs_info->num_tolerated_disk_barrier_failures || >> + dev_flush_error > >> + fsdevs->fs_info->num_tolerated_disk_barrier_failures) >> + return -EIO; > > Can you please reformat this so it's clear what's the condition and > what's the statement? included in v2. >> + >> + return 0; >> +} >> + >> /* >> * send an empty flush down to each device in parallel, >> * then wait for them >> @@ -3593,6 +3624,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info) >> ret = write_dev_flush(dev, 0); >> if (ret) >> errors_send++; >> + dev->last_flush_error = ret; > > Here the error is set unconditionally, so it always tracks the return > code, not only the error ... Right. So it captures send part of the error, that is when we call write_dev_flush(... 0); which returns only -ENOMEM or 0; >> } >> >> /* wait for all the barriers */ >> @@ -3607,12 +3639,30 @@ static int barrier_all_devices(struct btrfs_fs_info *info) >> continue; >> >> ret = write_dev_flush(dev, 1); >> - if (ret) >> + if (ret) { >> + dev->last_flush_error = ret; > > ... while this tracks only the errors. Unless I'm missing something, > both should be set in a consistent way. Actually this is correct, that way I preserve the -ENOMEM; which occurred in the send part of the code. And during wait.. we always return 0; for -ENOMEM that occurred during send. ------------------------------ @@ -3625,6 +3625,10 @@ static int write_dev_flush(struct btrfs_device *device, int wait) if (wait) { bio = device->flush_bio; if (!bio) + /* + * This means the alloc has failed with ENOMEM, however + * here we return 0, as its not a device error. + */ return 0; ------------------------------ >> errors_wait++; >> + } >> + } >> + >> + /* >> + * Try hard in case of flush. Lets say, in RAID1 we have >> + * the following situation >> + * dev1: EIO dev2: ENOMEM >> + * this is not a fatal error as we hope to recover from >> + * ENOMEM in the next attempt to flush. > > This could still be problematic under some very rare conditions, but I > don't deem it important at the moment as the memory allocation will be > gone. Then the comment reflects the current state, which is fine. Thanks for mentioning that; though I suspected that way as well; I had to put extra effort not to change original logic unless there is an identified issue with it. So I instead, I just documented the logic in the comments. I am sending V2 with the above changes. Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index eb1ee7b6f532..dafcb6bb2d5d 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3527,6 +3527,10 @@ static int write_dev_flush(struct btrfs_device *device, int wait) if (wait) { bio = device->flush_bio; if (!bio) + /* + * This means the alloc has failed with ENOMEM, however + * here we return 0, as its not a device error. + */ return 0; wait_for_completion(&device->flush_wait); @@ -3566,6 +3570,33 @@ static int write_dev_flush(struct btrfs_device *device, int wait) return 0; } +static int check_barrier_error(struct btrfs_fs_devices *fsdevs) +{ + int submit_flush_error = 0; + int dev_flush_error = 0; + struct btrfs_device *dev; + + list_for_each_entry_rcu(dev, &fsdevs->devices, dev_list) { + if (!dev->bdev) { + submit_flush_error++; + dev_flush_error++; + continue; + } + if (dev->last_flush_error == ENOMEM) + submit_flush_error++; + if (dev->last_flush_error && dev->last_flush_error != ENOMEM) + dev_flush_error++; + } + + if (submit_flush_error > + fsdevs->fs_info->num_tolerated_disk_barrier_failures || + dev_flush_error > + fsdevs->fs_info->num_tolerated_disk_barrier_failures) + return -EIO; + + return 0; +} + /* * send an empty flush down to each device in parallel, * then wait for them @@ -3593,6 +3624,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info) ret = write_dev_flush(dev, 0); if (ret) errors_send++; + dev->last_flush_error = ret; } /* wait for all the barriers */ @@ -3607,12 +3639,30 @@ static int barrier_all_devices(struct btrfs_fs_info *info) continue; ret = write_dev_flush(dev, 1); - if (ret) + if (ret) { + dev->last_flush_error = ret; errors_wait++; + } + } + + /* + * Try hard in case of flush. Lets say, in RAID1 we have + * the following situation + * dev1: EIO dev2: ENOMEM + * this is not a fatal error as we hope to recover from + * ENOMEM in the next attempt to flush. + * But the following is considered as fatal + * dev1: ENOMEM dev2: ENOMEM + * dev1: bdev == NULL dev2: ENOMEM + */ + if (errors_send || errors_wait) { + /* + * At some point we need the status of all disks + * to arrive at the volume status. So error checking + * is being pushed to a separate loop. + */ + return check_barrier_error(info->fs_devices); } - if (errors_send > info->num_tolerated_disk_barrier_failures || - errors_wait > info->num_tolerated_disk_barrier_failures) - return -EIO; return 0; } diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 59be81206dd7..9c09dcd96e5d 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -74,6 +74,7 @@ struct btrfs_device { int missing; int can_discard; int is_tgtdev_for_dev_replace; + int last_flush_error; #ifdef __BTRFS_NEED_DEVICE_DATA_ORDERED seqcount_t data_seqcount;
This adds comments to the flush error handling part of the code, and hopes to maintain the same logic with a framework which can be used to handle the errors at the volume level. Signed-off-by: Anand Jain <anand.jain@oracle.com> --- fs/btrfs/disk-io.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++---- fs/btrfs/volumes.h | 1 + 2 files changed, 55 insertions(+), 4 deletions(-)