diff mbox series

[question] solution for raid10 configuration concurrent with io

Message ID cb390b39-1b1e-04e1-55ad-2ff8afc47e1b@huawei.com (mailing list archive)
State New, archived
Delegated to: Song Liu
Headers show
Series [question] solution for raid10 configuration concurrent with io | expand

Commit Message

Yu Kuai April 14, 2023, 2:28 a.m. UTC
Hello,

rai10 support that add or remove conf->mirrors[].rdev/replacement can
concurrent with io, and this is handled by something like following
code:

raid10_write_request
  if (rrdev)
   // decide to write replacement
   r10_bio->devs[i].repl_bio = bio

  raid10_write_one_disk
         if (replacement) {
                 rdev = conf->mirrors[devnum].replacement;
                 if (rdev == NULL) {
                         /* Replacement just got moved to main 'rdev' */
                         smp_mb();
                         rdev = conf->mirrors[devnum].rdev;
                 }
         } else
                 rdev = conf->mirrors[devnum].rdev;

And this scheme is not complete and can cause kernel panic or data loss
in multiple places.

for example, null-ptr-dereference:

t1:				t2:
raid10_write_request:

  // read rdev
  rdev = conf->mirros[].rdev;
				raid10_remove_disk
				 p = conf->mirros + number;
				 rdevp = &p->rdev;
				 // reset rdev
				 *rdevp = NULL
  raid10_write_one_disk
   // reread rdev got NULL
   rdev = conf->mirrors[devnum].rdev
     // null-ptr-dereference
    mbio = bio_alloc_clone(rdev->bdev...)
				 synchronize_rcu()

for example, data loss:

t1:
// assum that rdev is NULL, and replacement is not NULL
raid10_write_request
  // read replacement
  rrdev = conf->mirros[].replacement

				t2: replacement moved to rdev
				raid10_remove_disk
				 p->rdev = p->replacement
				 p->replacement = NULL

				t3: add a new replacement
				raid10_add_disk
				 p->replacement = rdev
  raid10_write_one_disk
   // read again, and got wrong replacement
   // should write to rdev in this case
   rrdev = conf->mirros[].replacement

In order to fix these problems, I come up with two different solutions:

1) based on current solution, make sure rdev is only accessd once while
handling io, this can be done by changing r10bio:

And only read conf once, then record rdev/replacement. This requires to
change all the context to issue io and complete io.

2) add a new synchronization that configuration can't concurrent with
any io, however, I think this can work but I'm not 100% percent sure.
Code changes should be less, and this will allow some cleanups and
simplify logic.

Any suggestions?

Thanks,
Kuai

Comments

Xiao Ni April 27, 2023, 6:53 a.m. UTC | #1
On Fri, Apr 14, 2023 at 10:30 AM Yu Kuai <yukuai3@huawei.com> wrote:
>
> Hello,
>
> rai10 support that add or remove conf->mirrors[].rdev/replacement can
> concurrent with io, and this is handled by something like following
> code:
>
> raid10_write_request
>   if (rrdev)
>    // decide to write replacement
>    r10_bio->devs[i].repl_bio = bio
>
>   raid10_write_one_disk
>          if (replacement) {
>                  rdev = conf->mirrors[devnum].replacement;
>                  if (rdev == NULL) {
>                          /* Replacement just got moved to main 'rdev' */
>                          smp_mb();
>                          rdev = conf->mirrors[devnum].rdev;
>                  }
>          } else
>                  rdev = conf->mirrors[devnum].rdev;
>
> And this scheme is not complete and can cause kernel panic or data loss
> in multiple places.
>
> for example, null-ptr-dereference:
>
> t1:                             t2:
> raid10_write_request:
>
>   // read rdev
>   rdev = conf->mirros[].rdev;
>                                 raid10_remove_disk
>                                  p = conf->mirros + number;
>                                  rdevp = &p->rdev;
>                                  // reset rdev
>                                  *rdevp = NULL
>   raid10_write_one_disk
>    // reread rdev got NULL
>    rdev = conf->mirrors[devnum].rdev
>      // null-ptr-dereference
>     mbio = bio_alloc_clone(rdev->bdev...)
>                                  synchronize_rcu()

Hi Yu kuai

raid10_write_request adds the rdev->nr_pending with rcu lock
protection. Can this case happen? After adding ->nr_pending, the rdev
can't be removed.

>
> for example, data loss:
>
> t1:
> // assum that rdev is NULL, and replacement is not NULL

How can trigger this? Could you give the detailed commands?

Best Regards
Xiao Ni

> raid10_write_request
>   // read replacement
>   rrdev = conf->mirros[].replacement
>
>                                 t2: replacement moved to rdev
>                                 raid10_remove_disk
>                                  p->rdev = p->replacement
>                                  p->replacement = NULL
>
>                                 t3: add a new replacement
>                                 raid10_add_disk
>                                  p->replacement = rdev
>   raid10_write_one_disk
>    // read again, and got wrong replacement
>    // should write to rdev in this case
>    rrdev = conf->mirros[].replacement
>
> In order to fix these problems, I come up with two different solutions:
>
> 1) based on current solution, make sure rdev is only accessd once while
> handling io, this can be done by changing r10bio:
> diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h
> index 63e48b11b552..c132cba1953c 100644
> --- a/drivers/md/raid10.h
> +++ b/drivers/md/raid10.h
> @@ -145,12 +145,12 @@ struct r10bio {
>           */
>          struct r10dev {
>                  struct bio      *bio;
> -               union {
> -                       struct bio      *repl_bio; /* used for resync and
> -                                                   * writes */
> -                       struct md_rdev  *rdev;     /* used for reads
> -                                                   * (read_slot >= 0) */
> -               };
> +               struct md_rdev  *rdev;
> +
> +               /* used for resync and writes */
> +               struct bio      *repl_bio;
> +               struct md_rdev  *replacement;
> +
>                  sector_t        addr;
>                  int             devnum;
>          } devs[];
>
> And only read conf once, then record rdev/replacement. This requires to
> change all the context to issue io and complete io.
>
> 2) add a new synchronization that configuration can't concurrent with
> any io, however, I think this can work but I'm not 100% percent sure.
> Code changes should be less, and this will allow some cleanups and
> simplify logic.
>
> Any suggestions?
>
> Thanks,
> Kuai
>
Yu Kuai April 27, 2023, 7:13 a.m. UTC | #2
Hi,

在 2023/04/27 14:53, Xiao Ni 写道:
>> for example, null-ptr-dereference:
>>
>> t1:                             t2:
>> raid10_write_request:
>>
>>    // read rdev
>>    rdev = conf->mirros[].rdev;
>>                                  raid10_remove_disk
>>                                   p = conf->mirros + number;
>>                                   rdevp = &p->rdev;
>>                                   // reset rdev
>>                                   *rdevp = NULL
>>    raid10_write_one_disk
>>     // reread rdev got NULL
>>     rdev = conf->mirrors[devnum].rdev
>>       // null-ptr-dereference
>>      mbio = bio_alloc_clone(rdev->bdev...)
>>                                   synchronize_rcu()
> 
> Hi Yu kuai
> 
> raid10_write_request adds the rdev->nr_pending with rcu lock
> protection. Can this case happen? After adding ->nr_pending, the rdev
> can't be removed.

The current rcu protection really is a mess, many places access rdev
after rcu_read_unlock()...

For the above case, noted that raid10_remove_disk is called before
nr_pending is increased, and raid10_write_one_disk() is called after
rcu_read_unlock().

t1:				t2:

raid10_write_request
  rcu_read_lock
  rdev = conf->mirros[].rdev
				raid10_remove_disk
				 ......
				 // nr_pending is 0, remove disk
  // read inside rcu
  rcu_read_unlock

  raid10_write_one_disk
  // trigger null-ptr-dereference
				synchronize_rcu()

Thanks,
Kuai
> 
>>
>> for example, data loss:
>>
>> t1:
>> // assum that rdev is NULL, and replacement is not NULL
> 
> How can trigger this? Could you give the detailed commands?
> 
> Best Regards
> Xiao Ni
Yu Kuai April 27, 2023, 7:22 a.m. UTC | #3
Hi,

Sorry that synchronize_rcu() is misplaced, it should be right after
rcu_read_unlock():

在 2023/04/27 15:13, Yu Kuai 写道:
> t1:                t2:
> 
> raid10_write_request
>   rcu_read_lock
>   rdev = conf->mirros[].rdev
>                  raid10_remove_disk
>                   ......
>                   // nr_pending is 0, remove disk
>   // read inside rcu
>   rcu_read_unlock
					//set rdev NULL
					synchronize_rcu
> 
>   raid10_write_one_disk
>   // trigger null-ptr-dereference
> 

Thanks,
Kuai
Yu Kuai April 27, 2023, 8:09 a.m. UTC | #4
Hi,

在 2023/04/27 14:53, Xiao Ni 写道:

>> // assum that rdev is NULL, and replacement is not NULL
> 
> How can trigger this? Could you give the detailed commands?
> 

This is just because raid10_remove_disk() set rdev to NULL first, and
then replace rdev with replacement. This is temporary but io can
concurrent with this temporary status.

raid10_remove_disk

  p = conf->mirrors + number;
  rdevp = &p->rdev;
  *rdevp = NULL;
  // here!
  p->rdev = p->replacement;
  p->replacement = NULL;

Thanks,
Kuai
Xiao Ni April 28, 2023, 9:11 a.m. UTC | #5
On Thu, Apr 27, 2023 at 3:13 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2023/04/27 14:53, Xiao Ni 写道:
> >> for example, null-ptr-dereference:
> >>
> >> t1:                             t2:
> >> raid10_write_request:
> >>
> >>    // read rdev
> >>    rdev = conf->mirros[].rdev;
> >>                                  raid10_remove_disk
> >>                                   p = conf->mirros + number;
> >>                                   rdevp = &p->rdev;
> >>                                   // reset rdev
> >>                                   *rdevp = NULL
> >>    raid10_write_one_disk
> >>     // reread rdev got NULL
> >>     rdev = conf->mirrors[devnum].rdev
> >>       // null-ptr-dereference
> >>      mbio = bio_alloc_clone(rdev->bdev...)
> >>                                   synchronize_rcu()
> >
> > Hi Yu kuai
> >
> > raid10_write_request adds the rdev->nr_pending with rcu lock
> > protection. Can this case happen? After adding ->nr_pending, the rdev
> > can't be removed.
>
> The current rcu protection really is a mess, many places access rdev
> after rcu_read_unlock()...

Hi Yu Kuai

It looks like a problem that is it safe to access rdev after adding
rdev->nr_pending and rcu_read_unlock. Because besides raid10, other
personalities still do in the same way. After reading the related
codes, I have the same question.

>
> For the above case, noted that raid10_remove_disk is called before
> nr_pending is increased, and raid10_write_one_disk() is called after
> rcu_read_unlock().
>
> t1:                             t2:
>
> raid10_write_request
>   rcu_read_lock
>   rdev = conf->mirros[].rdev
>                                 raid10_remove_disk
>                                  ......
>                                  // nr_pending is 0, remove disk
>   // read inside rcu
>   rcu_read_unlock
>
>   raid10_write_one_disk
>   // trigger null-ptr-dereference
>                                 synchronize_rcu()

Function remove_and_add_spares sets RemoveSynchronized, calls
synchronize_rcu and calls raid10_remove_disk. So the right position of
synchronize_rcu in your case should be before raid10_remove_disk?

But it still can't resolve the problem. raid10_remove_disk can set
rdev to NULL between rcu_dereference and adding ->nr_pending

raid10_write_request                     remove_and_add_spares

                                                       set RemoveSynchronized
                                                       synchronize_rcu
rcu_read_lock
rdev = rcu_dereference
                                                     <-------     *rdevp = NULL
atomic_inc rdev->nr_pending
rcu_read_unlock

raid10_write_one_disk
// trigger null-ptr-dereference

Can we check RemoveSynchronized in pers->make_request? It can't submit
bio to this rdev after synchronize_rcu. For the
pers->hot_remove_disk(raid10_remove_disk) side, as you said in the
second solution, maybe it needs a new method to know all
pers->make_request(raid10_make_request) exit.


Best Regards
Xiao Ni

>
> Thanks,
> Kuai
> >
> >>
> >> for example, data loss:
> >>
> >> t1:
> >> // assum that rdev is NULL, and replacement is not NULL
> >
> > How can trigger this? Could you give the detailed commands?
> >
> > Best Regards
> > Xiao Ni
>
Yu Kuai April 28, 2023, 9:31 a.m. UTC | #6
Hi,

在 2023/04/28 17:11, Xiao Ni 写道:
> On Thu, Apr 27, 2023 at 3:13 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> Hi,
>>
>> 在 2023/04/27 14:53, Xiao Ni 写道:
>>>> for example, null-ptr-dereference:
>>>>
>>>> t1:                             t2:
>>>> raid10_write_request:
>>>>
>>>>     // read rdev
>>>>     rdev = conf->mirros[].rdev;
>>>>                                   raid10_remove_disk
>>>>                                    p = conf->mirros + number;
>>>>                                    rdevp = &p->rdev;
>>>>                                    // reset rdev
>>>>                                    *rdevp = NULL
>>>>     raid10_write_one_disk
>>>>      // reread rdev got NULL
>>>>      rdev = conf->mirrors[devnum].rdev
>>>>        // null-ptr-dereference
>>>>       mbio = bio_alloc_clone(rdev->bdev...)
>>>>                                    synchronize_rcu()
>>>
>>> Hi Yu kuai
>>>
>>> raid10_write_request adds the rdev->nr_pending with rcu lock
>>> protection. Can this case happen? After adding ->nr_pending, the rdev
>>> can't be removed.
>>
>> The current rcu protection really is a mess, many places access rdev
>> after rcu_read_unlock()...
> 
> Hi Yu Kuai
> 
> It looks like a problem that is it safe to access rdev after adding
> rdev->nr_pending and rcu_read_unlock. Because besides raid10, other
> personalities still do in the same way. After reading the related
> codes, I have the same question.

I think it's not safe, that's basically what the first solution try to
avoid, but it's just in raid10, and I'm not quite familiar with all the
personality but I think they can be fixed likewise. I'll keep looking
into the details. 
diff mbox series

Patch

diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h
index 63e48b11b552..c132cba1953c 100644
--- a/drivers/md/raid10.h
+++ b/drivers/md/raid10.h
@@ -145,12 +145,12 @@  struct r10bio {
          */
         struct r10dev {
                 struct bio      *bio;
-               union {
-                       struct bio      *repl_bio; /* used for resync and
-                                                   * writes */
-                       struct md_rdev  *rdev;     /* used for reads
-                                                   * (read_slot >= 0) */
-               };
+               struct md_rdev  *rdev;
+
+               /* used for resync and writes */
+               struct bio      *repl_bio;
+               struct md_rdev  *replacement;
+
                 sector_t        addr;
                 int             devnum;
         } devs[];