diff mbox

bcache: recover data from backing device when read request hit clean

Message ID CAPGwLLNwG+kLT9_QaCL5PBLS0a6K-x-iog+ON2pArUQWr1rWTg@mail.gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Rui Hua Nov. 17, 2017, 3:51 a.m. UTC
When we send a read request and hit the clean data in cache device, there
is a situation called cache read race in bcache(see the commit in the tail
of cache_look_up(), the following explaination just copy from there):
The bucket we're reading from might be reused while our bio is in flight,
and we could then end up reading the wrong data. We guard against this
by checking (in bch_cache_read_endio()) if the pointer is stale again;
if so, we treat it as an error (s->iop.error = -EINTR) and reread from
the backing device (but we don't pass that error up anywhere)

It should be noted that cache read race happened under normal
circumstances, not the circumstance when SSD failed, it was counted
and shown in  /sys/fs/bcache/XXX/internal/cache_read_races.

Without this patch, when we use writeback mode, we will never reread from
the backing device when cache read race happend, until the whole cache
device is clean, because the condition
(s->recoverable && (dc && !atomic_read(&dc->has_dirty))) is false in
cached_dev_read_error(). In this situation, the s->iop.error(= -EINTR)
will be passed up, at last, user will receive -EINTR when it's bio end,
this is not suitable, and wield to up-application.

In this patch, we use s->read_dirty_data to judge whether the read
request hit dirty data in cache device, it is safe to reread data from
the backing device when the read request hit clean data. This can not
only handle cache read race, but also recover data when failed read
request from cache device.

Signed-off-by: Hua Rui <huarui.dev@gmail.com>
---
 drivers/md/bcache/request.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

Comments

Michael Lyle Nov. 17, 2017, 4:02 a.m. UTC | #1
Hi, Rui Hua--

On 11/16/2017 07:51 PM, Rui Hua wrote:
> In this patch, we use s->read_dirty_data to judge whether the read
> request hit dirty data in cache device, it is safe to reread data from
> the backing device when the read request hit clean data. This can not
> only handle cache read race, but also recover data when failed read
> request from cache device.

I haven't read over all of this yet, to understand the read race.  But
this change can't be safe-- read_dirty_data only is set to true if the
metadata on the cache device is correctly read.  If that fails, the flag
may not be set and we could read stale data on the backing device.

Mike
Rui Hua Nov. 17, 2017, 4:25 a.m. UTC | #2
Hi, Michael--

Thanks for the quickly reply.

If the metadata on the cache device was NOT correctly read,
cached_dev_read_error() will not be called, so the recovery check will
not be executed. so this patch is safe.
s->iop.error was not set when faild to read metadata on the cache
device, so the  cached_dev_bio_complete() will be called instead of
cached_dev_read_error()

2017-11-17 12:02 GMT+08:00 Michael Lyle <mlyle@lyle.org>:
> Hi, Rui Hua--
>
> On 11/16/2017 07:51 PM, Rui Hua wrote:
>> In this patch, we use s->read_dirty_data to judge whether the read
>> request hit dirty data in cache device, it is safe to reread data from
>> the backing device when the read request hit clean data. This can not
>> only handle cache read race, but also recover data when failed read
>> request from cache device.
>
> I haven't read over all of this yet, to understand the read race.  But
> this change can't be safe-- read_dirty_data only is set to true if the
> metadata on the cache device is correctly read.  If that fails, the flag
> may not be set and we could read stale data on the backing device.
>
> Mike
Coly Li Nov. 17, 2017, 6:01 a.m. UTC | #3
On 17/11/2017 11:51 AM, Rui Hua wrote:
> When we send a read request and hit the clean data in cache device, there
> is a situation called cache read race in bcache(see the commit in the tail
> of cache_look_up(), the following explaination just copy from there):
> The bucket we're reading from might be reused while our bio is in flight,
> and we could then end up reading the wrong data. We guard against this
> by checking (in bch_cache_read_endio()) if the pointer is stale again;
> if so, we treat it as an error (s->iop.error = -EINTR) and reread from
> the backing device (but we don't pass that error up anywhere)
> 

Hi Rui Hua,

I see the point, and agree with you this should be fixed.
Indeed return -EINTR to upper layer makes sense. A read request may
cover several keys, and when a key encounters read race, bcache may
return result of previous keys before read race encountered, and tells
upper layer code this is not full completed by returnning -EINTR. This
is a defined behavior by POSIX.

The reason why I still agree to fix it is, most of upper layer code just
check if bio->bi_status is 0 or not, they don't check whether
bio->bi_status is -EAGAIN or -EINTR.

> It should be noted that cache read race happened under normal
> circumstances, not the circumstance when SSD failed, it was counted
> and shown in  /sys/fs/bcache/XXX/internal/cache_read_races.
> 

Yes, this is not for real I/O error on cache device, the I/O error code
is set by bcache code. I didn't notice this case, and my test didn't
cover such situation. Thank for point out this !

> Without this patch, when we use writeback mode, we will never reread from
> the backing device when cache read race happend, until the whole cache
> device is clean, because the condition

I assume this is a race condition, that means after the race window, the
KEY associated to the reused bucket will be invalided too. So a second
read will trigger a cache miss, and re-read the data back into cache
device. So it won't be "never". The problem is, if upper layer code
treats it as a failed I/O, and it happens to be metadata of upper layer
code, then some negative result may happens. For example file system
turns into read-only mode.

> (s->recoverable && (dc && !atomic_read(&dc->has_dirty))) is false in
> cached_dev_read_error(). In this situation, the s->iop.error(= -EINTR)
> will be passed up, at last, user will receive -EINTR when it's bio end,
> this is not suitable, and wield to up-application.

As I said, most of upper layer code only checks whether bio->bi_status
is 0, otherwise that's error. We need to give such error a second try.

> 
> In this patch, we use s->read_dirty_data to judge whether the read
> request hit dirty data in cache device, it is safe to reread data from
> the backing device when the read request hit clean data. This can not
> only handle cache read race, but also recover data when failed read
> request from cache device.

In your another reply to Michael, you mentioned if bcache internal node
I/O failed, cached_dev_read_error() won't be called. I check the code
with your hint, it seems make sense. Also I am not sure whether the
internal node I/O failure is handled properly or not, I feel for leaf
node, your modification works for both writeback mode, and a
writethrough mode changed from a writeback mode with dirty data.

It seems my previous patch neglects the internal btree node I/O failure,
because I mistakenly thought both internal and leaf node I/O failure
will go into cached_dev_read_error(). Thanks for correct me and provide
a comprehensive fix.

A positive side effect is, Eric, Kent and Michael they all want to
permit data recovery from backing device is the failed data is clean on
cache device. Your patch permits more read request to be served, and
fixs a read race at same time, cool :-)

For users want aggressive I/O error notifying of cache device (SSD), I
think it can be solved some other error accounting methods. We don't
discuss in this thread.

> 
> Signed-off-by: Hua Rui <huarui.dev@gmail.com>

Reviewed-by: Coly Li <colyli@suse.de>

P.S could you please also take a look on btree internal node I/O failure
? Thanks in advance.

Coly Li

> ---
>  drivers/md/bcache/request.c | 13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
> index 3a7aed7..f02fefb 100644
> --- a/drivers/md/bcache/request.c
> +++ b/drivers/md/bcache/request.c
> @@ -708,16 +708,15 @@ static void cached_dev_read_error(struct closure *cl)
>  {
>         struct search *s = container_of(cl, struct search, cl);
>         struct bio *bio = &s->bio.bio;
> -       struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);
> 
>         /*
> -        * If cache device is dirty (dc->has_dirty is non-zero), then
> -        * recovery a failed read request from cached device may get a
> -        * stale data back. So read failure recovery is only permitted
> -        * when cache device is clean.
> +        * If read request hit dirty data (s->read_dirty_data is true),
> +        * then recovery a failed read request from cached device may
> +        * get a stale data back. So read failure recovery is only
> +        * permitted when read request hit clean data in cache device,
> +        * or when cache read race happend.
>          */
> -       if (s->recoverable &&
> -           (dc && !atomic_read(&dc->has_dirty))) {
> +       if (s->recoverable && !s->read_dirty_data) {
>                 /* Retry from the backing device: */
>                 trace_bcache_read_retry(s->orig_bio);
>
Rui Hua Nov. 17, 2017, 7:26 a.m. UTC | #4
Hi, Coly--
Thanks for your detailed review notes.

2017-11-17 14:01 GMT+08:00 Coly Li <colyli@suse.de>:

>> Without this patch, when we use writeback mode, we will never reread from
>> the backing device when cache read race happend, until the whole cache
>> device is clean, because the condition
>
> I assume this is a race condition, that means after the race window, the
> KEY associated to the reused bucket will be invalided too. So a second
> read will trigger a cache miss, and re-read the data back into cache
> device. So it won't be "never". The problem is, if upper layer code
> treats it as a failed I/O, and it happens to be metadata of upper layer
> code, then some negative result may happens. For example file system
> turns into read-only mode.

Yes, your example about metadata of upper layer code provides a good reference ,
Indeed, I met the xfs metadata error a few days ago when I use xfs on bcache,
under heavy load, I got the following xfs error:
localhost kernel: XFS (bcache0): metadata I/O error: block 0x3e3e2af0
("xfs_trans_read_buf_map") error 4 numblks 16
And I can confirm it was caused by the read race.

The word "never" I used in above is not suitable:-/, thanks for your point,
 what I want to say is we will not recover clean data UNTIL the whole cache
device become clean, or in writethrough mode.

>
> P.S could you please also take a look on btree internal node I/O failure
> ? Thanks in advance.
>
At least, for now, I find that when cache_lookup() traverse btree,
if bch_btree_node_get() return ERR_PTR(-EIO), it will not pass to upper layer,
cached_dev_bio_complete() will be called, just like there is no any IO error,
I think this is another bug. I'll deep into code and find more detail
when I have time.

Thanks,
Rui Hua Nov. 17, 2017, 10:20 a.m. UTC | #5
Hi, Stefan

2017-11-17 16:28 GMT+08:00 Stefan Priebe - Profihost AG <s.priebe@profihost.ag>:
> I‘m getting the same xfs error message under high load. Does this patch fix
> it?
>
Did you applied the patch "bcache: only permit to recovery read error
when cache device is clean" ?
If you did, maybe this patch can fix it. And you'd better check
/sys/fs/bcache/XXX/internal/cache_read_races in your environment,
meanwhile, it should not be zero when you get that err message.
Stefan Priebe - Profihost AG Nov. 17, 2017, 12:38 p.m. UTC | #6
Am 17.11.2017 um 11:20 schrieb Rui Hua:
> Hi, Stefan
> 
> 2017-11-17 16:28 GMT+08:00 Stefan Priebe - Profihost AG <s.priebe@profihost.ag>:
>> I‘m getting the same xfs error message under high load. Does this patch fix
>> it?
>>
> Did you applied the patch "bcache: only permit to recovery read error
> when cache device is clean" ?
> If you did, maybe this patch can fix it. And you'd better check
> /sys/fs/bcache/XXX/internal/cache_read_races in your environment,
> meanwhile, it should not be zero when you get that err message.
> 


Yes this patch is applied.

Stefan
Eddie Chapman Nov. 17, 2017, 12:57 p.m. UTC | #7
On 17/11/17 10:20, Rui Hua wrote:
> Hi, Stefan
> 
> 2017-11-17 16:28 GMT+08:00 Stefan Priebe - Profihost AG <s.priebe@profihost.ag>:
>> I‘m getting the same xfs error message under high load. Does this patch fix
>> it?
>>
> Did you applied the patch "bcache: only permit to recovery read error
> when cache device is clean" ?
> If you did, maybe this patch can fix it. And you'd better check
> /sys/fs/bcache/XXX/internal/cache_read_races in your environment,
> meanwhile, it should not be zero when you get that err message.

Hi all,

I have 3 servers running a very recent 4.9 stable release, with several 
recent bcache patches cherry picked, including V4 of "bcache: only 
permit to recovery read error when cache device is clean".

In the 3 weeks since using these cherry picks I've experienced a very 
small number of isolated read errors in the layer above bcache, on all 3 
servers.

On one of the servers, 2 out of the 6 bcache resources have a value of 1 
in /sys/fs/bcache/XXX/internal/cache_read_races, and it is on these same 
2 bcache resources where one read error has occurred on the upper layer. 
The other 4 bcache resources have 0 in cache_read_races and I haven't 
had any read errors on the layers above them.

On another server, I have 1 bcache resource out of 10 with a value of 5 
in /sys/fs/bcache/XXX/internal/cache_read_races, and it is on that 
bcache resource where a read error occurred on one occasion. The other 9 
bcache resources have 0 in cache_read_races, and no read errors have 
occurred on the layers above any of them.

On the 3rd server where some read errors occurred, I cannot verify if 
there were positive values in cache_read_races as I moved the data from 
there onto other storage, and shut down the bcache resources where the 
errors occurred.

If I can provide any other info which might help with this issue, please 
let me know.

regards,
Eddie
Coly Li Nov. 17, 2017, 1:22 p.m. UTC | #8
On 17/11/2017 8:57 PM, Eddie Chapman wrote:
> On 17/11/17 10:20, Rui Hua wrote:
>> Hi, Stefan
>>
>> 2017-11-17 16:28 GMT+08:00 Stefan Priebe - Profihost AG
>> <s.priebe@profihost.ag>:
>>> I‘m getting the same xfs error message under high load. Does this
>>> patch fix
>>> it?
>>>
>> Did you applied the patch "bcache: only permit to recovery read error
>> when cache device is clean" ?
>> If you did, maybe this patch can fix it. And you'd better check
>> /sys/fs/bcache/XXX/internal/cache_read_races in your environment,
>> meanwhile, it should not be zero when you get that err message.
> 
> Hi all,
> 
> I have 3 servers running a very recent 4.9 stable release, with several
> recent bcache patches cherry picked, including V4 of "bcache: only
> permit to recovery read error when cache device is clean".
> 
> In the 3 weeks since using these cherry picks I've experienced a very
> small number of isolated read errors in the layer above bcache, on all 3
> servers.
> 
> On one of the servers, 2 out of the 6 bcache resources have a value of 1
> in /sys/fs/bcache/XXX/internal/cache_read_races, and it is on these same
> 2 bcache resources where one read error has occurred on the upper layer.
> The other 4 bcache resources have 0 in cache_read_races and I haven't
> had any read errors on the layers above them.
> 
> On another server, I have 1 bcache resource out of 10 with a value of 5
> in /sys/fs/bcache/XXX/internal/cache_read_races, and it is on that
> bcache resource where a read error occurred on one occasion. The other 9
> bcache resources have 0 in cache_read_races, and no read errors have
> occurred on the layers above any of them.
> 
> On the 3rd server where some read errors occurred, I cannot verify if
> there were positive values in cache_read_races as I moved the data from
> there onto other storage, and shut down the bcache resources where the
> errors occurred.
> 
> If I can provide any other info which might help with this issue, please
> let me know.

Hi Eddie,

This is very informative, thank you so much :-)

Coly Li
Eddie Chapman Nov. 17, 2017, 1:43 p.m. UTC | #9
On 17/11/17 13:22, Coly Li wrote:
> On 17/11/2017 8:57 PM, Eddie Chapman wrote:
>> On 17/11/17 10:20, Rui Hua wrote:
>>> Hi, Stefan
>>>
>>> 2017-11-17 16:28 GMT+08:00 Stefan Priebe - Profihost AG
>>> <s.priebe@profihost.ag>:
>>>> I‘m getting the same xfs error message under high load. Does this
>>>> patch fix
>>>> it?
>>>>
>>> Did you applied the patch "bcache: only permit to recovery read error
>>> when cache device is clean" ?
>>> If you did, maybe this patch can fix it. And you'd better check
>>> /sys/fs/bcache/XXX/internal/cache_read_races in your environment,
>>> meanwhile, it should not be zero when you get that err message.
>>
>> Hi all,
>>
>> I have 3 servers running a very recent 4.9 stable release, with several
>> recent bcache patches cherry picked, including V4 of "bcache: only
>> permit to recovery read error when cache device is clean".
>>
>> In the 3 weeks since using these cherry picks I've experienced a very
>> small number of isolated read errors in the layer above bcache, on all 3
>> servers.
>>
>> On one of the servers, 2 out of the 6 bcache resources have a value of 1
>> in /sys/fs/bcache/XXX/internal/cache_read_races, and it is on these same
>> 2 bcache resources where one read error has occurred on the upper layer.
>> The other 4 bcache resources have 0 in cache_read_races and I haven't
>> had any read errors on the layers above them.
>>
>> On another server, I have 1 bcache resource out of 10 with a value of 5
>> in /sys/fs/bcache/XXX/internal/cache_read_races, and it is on that
>> bcache resource where a read error occurred on one occasion. The other 9
>> bcache resources have 0 in cache_read_races, and no read errors have
>> occurred on the layers above any of them.
>>
>> On the 3rd server where some read errors occurred, I cannot verify if
>> there were positive values in cache_read_races as I moved the data from
>> there onto other storage, and shut down the bcache resources where the
>> errors occurred.
>>
>> If I can provide any other info which might help with this issue, please
>> let me know.
> 
> Hi Eddie,
> 
> This is very informative, thank you so much :-)
> 
> Coly Li

Hi Coly,

You are most welcome. Another interesting info, but maybe it is 
unrelated/coincidence: the bcache resources where the errors occurred, 
the underlying backing device was a raid adapter that is quite a lot 
slower than the (different) underlying physical storage on the other 
bcache resources that do not have read races. Up to now I had suspected 
a driver issue with this raid adapter as causing the read errors, so I 
started the process of gradually retiring the adapter on these servers 
in the last 3 weeks. Anyway, in light of this issue coming up here I'm 
wondering if this is significant in suggesting possibly that the read 
races are more likely to occur if the backing storage is quite slow. Or 
maybe not.

Eddie
Michael Lyle Nov. 18, 2017, 12:57 a.m. UTC | #10
Rui Hua--

Thanks for the clarification.  You're correct, it doesn't get called,
and it goes to read_complete.  However, read_error should probably get
called.  How would you suggest handling this-- should we initially set
read_dirty_data true and clear it if it is all clean?  (and change the
condition to properly go into the error path).  As it is, the patch
makes things no worse but the error path is very unsafe.

Mike

On Thu, Nov 16, 2017 at 8:25 PM, Rui Hua <huarui.dev@gmail.com> wrote:
> Hi, Michael--
>
> Thanks for the quickly reply.
>
> If the metadata on the cache device was NOT correctly read,
> cached_dev_read_error() will not be called, so the recovery check will
> not be executed. so this patch is safe.
> s->iop.error was not set when faild to read metadata on the cache
> device, so the  cached_dev_bio_complete() will be called instead of
> cached_dev_read_error()
>
> 2017-11-17 12:02 GMT+08:00 Michael Lyle <mlyle@lyle.org>:
>> Hi, Rui Hua--
>>
>> On 11/16/2017 07:51 PM, Rui Hua wrote:
>>> In this patch, we use s->read_dirty_data to judge whether the read
>>> request hit dirty data in cache device, it is safe to reread data from
>>> the backing device when the read request hit clean data. This can not
>>> only handle cache read race, but also recover data when failed read
>>> request from cache device.
>>
>> I haven't read over all of this yet, to understand the read race.  But
>> this change can't be safe-- read_dirty_data only is set to true if the
>> metadata on the cache device is correctly read.  If that fails, the flag
>> may not be set and we could read stale data on the backing device.
>>
>> Mike
Rui Hua Nov. 18, 2017, 5 p.m. UTC | #11
Hi, Michael

> Thanks for the clarification.  You're correct, it doesn't get called,
> and it goes to read_complete.  However, read_error should probably get
> called.  How would you suggest handling this-- should we initially set
> read_dirty_data true and clear it if it is all clean?  (and change the
> condition to properly go into the error path).  As it is, the patch
> makes things no worse but the error path is very unsafe.

Yes, init read_dirty_data to true is an option, but I think it is not
so easy to judge whether it is all clean, because a read request may
cover several keys(just as Coly mentioned in previous reply), probably
some of them are dirty, others are clean, and cache_lookup_fn() was
called on each of these keys when read request search them on btree,
which means we should count and mark them in some place?

Maybe we can set s->iop.status when metadata I/O error, and clear
s->recoverable at the same time (add the code in cache_lookup(), after
bch_btree_map_keys() returns), then it can goto the read_error path,
and will not goto the recovery code.  This option is just a idea,
certainly, we should consider other exception cases.

Thanks,
Michael Lyle Nov. 22, 2017, 5:13 a.m. UTC | #12
Reviewed-by: Michael Lyle <mlyle@lyle.org>

Please note though, that your patches have all tabs expanded to spaces,
which makes them difficult to apply.  I have fixed things these times
but please try to submit them in correct form in the future.

Thanks,

Mike

On 11/17/2017 04:57 PM, Michael Lyle wrote:
> Rui Hua--
> 
> Thanks for the clarification.  You're correct, it doesn't get called,
> and it goes to read_complete.  However, read_error should probably get
> called.  How would you suggest handling this-- should we initially set
> read_dirty_data true and clear it if it is all clean?  (and change the
> condition to properly go into the error path).  As it is, the patch
> makes things no worse but the error path is very unsafe.
> 
> Mike
> 
> On Thu, Nov 16, 2017 at 8:25 PM, Rui Hua <huarui.dev@gmail.com> wrote:
>> Hi, Michael--
>>
>> Thanks for the quickly reply.
>>
>> If the metadata on the cache device was NOT correctly read,
>> cached_dev_read_error() will not be called, so the recovery check will
>> not be executed. so this patch is safe.
>> s->iop.error was not set when faild to read metadata on the cache
>> device, so the  cached_dev_bio_complete() will be called instead of
>> cached_dev_read_error()
>>
>> 2017-11-17 12:02 GMT+08:00 Michael Lyle <mlyle@lyle.org>:
>>> Hi, Rui Hua--
>>>
>>> On 11/16/2017 07:51 PM, Rui Hua wrote:
>>>> In this patch, we use s->read_dirty_data to judge whether the read
>>>> request hit dirty data in cache device, it is safe to reread data from
>>>> the backing device when the read request hit clean data. This can not
>>>> only handle cache read race, but also recover data when failed read
>>>> request from cache device.
>>>
>>> I haven't read over all of this yet, to understand the read race.  But
>>> this change can't be safe-- read_dirty_data only is set to true if the
>>> metadata on the cache device is correctly read.  If that fails, the flag
>>> may not be set and we could read stale data on the backing device.
>>>
>>> Mike
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Rui Hua Nov. 22, 2017, 6:04 a.m. UTC | #13
Hi, Michael

Thanks for your reminder. I'll checkpatch carefully next time.

Thanks

2017-11-22 13:13 GMT+08:00 Michael Lyle <mlyle@lyle.org>:
> Reviewed-by: Michael Lyle <mlyle@lyle.org>
>
> Please note though, that your patches have all tabs expanded to spaces,
> which makes them difficult to apply.  I have fixed things these times
> but please try to submit them in correct form in the future.
>
> Thanks,
>
> Mike
>
> On 11/17/2017 04:57 PM, Michael Lyle wrote:
>> Rui Hua--
>>
>> Thanks for the clarification.  You're correct, it doesn't get called,
>> and it goes to read_complete.  However, read_error should probably get
>> called.  How would you suggest handling this-- should we initially set
>> read_dirty_data true and clear it if it is all clean?  (and change the
>> condition to properly go into the error path).  As it is, the patch
>> makes things no worse but the error path is very unsafe.
>>
>> Mike
>>
>> On Thu, Nov 16, 2017 at 8:25 PM, Rui Hua <huarui.dev@gmail.com> wrote:
>>> Hi, Michael--
>>>
>>> Thanks for the quickly reply.
>>>
>>> If the metadata on the cache device was NOT correctly read,
>>> cached_dev_read_error() will not be called, so the recovery check will
>>> not be executed. so this patch is safe.
>>> s->iop.error was not set when faild to read metadata on the cache
>>> device, so the  cached_dev_bio_complete() will be called instead of
>>> cached_dev_read_error()
>>>
>>> 2017-11-17 12:02 GMT+08:00 Michael Lyle <mlyle@lyle.org>:
>>>> Hi, Rui Hua--
>>>>
>>>> On 11/16/2017 07:51 PM, Rui Hua wrote:
>>>>> In this patch, we use s->read_dirty_data to judge whether the read
>>>>> request hit dirty data in cache device, it is safe to reread data from
>>>>> the backing device when the read request hit clean data. This can not
>>>>> only handle cache read race, but also recover data when failed read
>>>>> request from cache device.
>>>>
>>>> I haven't read over all of this yet, to understand the read race.  But
>>>> this change can't be safe-- read_dirty_data only is set to true if the
>>>> metadata on the cache device is correctly read.  If that fails, the flag
>>>> may not be set and we could read stale data on the backing device.
>>>>
>>>> Mike
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
diff mbox

Patch

diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 3a7aed7..f02fefb 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -708,16 +708,15 @@  static void cached_dev_read_error(struct closure *cl)
 {
        struct search *s = container_of(cl, struct search, cl);
        struct bio *bio = &s->bio.bio;
-       struct cached_dev *dc = container_of(s->d, struct cached_dev, disk);

        /*
-        * If cache device is dirty (dc->has_dirty is non-zero), then
-        * recovery a failed read request from cached device may get a
-        * stale data back. So read failure recovery is only permitted
-        * when cache device is clean.
+        * If read request hit dirty data (s->read_dirty_data is true),
+        * then recovery a failed read request from cached device may
+        * get a stale data back. So read failure recovery is only
+        * permitted when read request hit clean data in cache device,
+        * or when cache read race happend.
         */
-       if (s->recoverable &&
-           (dc && !atomic_read(&dc->has_dirty))) {
+       if (s->recoverable && !s->read_dirty_data) {
                /* Retry from the backing device: */
                trace_bcache_read_retry(s->orig_bio);