Message ID | cb390b39-1b1e-04e1-55ad-2ff8afc47e1b@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Song Liu |
Headers | show |
Series | [question] solution for raid10 configuration concurrent with io | expand |
On Fri, Apr 14, 2023 at 10:30 AM Yu Kuai <yukuai3@huawei.com> wrote: > > Hello, > > rai10 support that add or remove conf->mirrors[].rdev/replacement can > concurrent with io, and this is handled by something like following > code: > > raid10_write_request > if (rrdev) > // decide to write replacement > r10_bio->devs[i].repl_bio = bio > > raid10_write_one_disk > if (replacement) { > rdev = conf->mirrors[devnum].replacement; > if (rdev == NULL) { > /* Replacement just got moved to main 'rdev' */ > smp_mb(); > rdev = conf->mirrors[devnum].rdev; > } > } else > rdev = conf->mirrors[devnum].rdev; > > And this scheme is not complete and can cause kernel panic or data loss > in multiple places. > > for example, null-ptr-dereference: > > t1: t2: > raid10_write_request: > > // read rdev > rdev = conf->mirros[].rdev; > raid10_remove_disk > p = conf->mirros + number; > rdevp = &p->rdev; > // reset rdev > *rdevp = NULL > raid10_write_one_disk > // reread rdev got NULL > rdev = conf->mirrors[devnum].rdev > // null-ptr-dereference > mbio = bio_alloc_clone(rdev->bdev...) > synchronize_rcu() Hi Yu kuai raid10_write_request adds the rdev->nr_pending with rcu lock protection. Can this case happen? After adding ->nr_pending, the rdev can't be removed. > > for example, data loss: > > t1: > // assum that rdev is NULL, and replacement is not NULL How can trigger this? Could you give the detailed commands? Best Regards Xiao Ni > raid10_write_request > // read replacement > rrdev = conf->mirros[].replacement > > t2: replacement moved to rdev > raid10_remove_disk > p->rdev = p->replacement > p->replacement = NULL > > t3: add a new replacement > raid10_add_disk > p->replacement = rdev > raid10_write_one_disk > // read again, and got wrong replacement > // should write to rdev in this case > rrdev = conf->mirros[].replacement > > In order to fix these problems, I come up with two different solutions: > > 1) based on current solution, make sure rdev is only accessd once while > handling io, this can be done by changing r10bio: > diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h > index 63e48b11b552..c132cba1953c 100644 > --- a/drivers/md/raid10.h > +++ b/drivers/md/raid10.h > @@ -145,12 +145,12 @@ struct r10bio { > */ > struct r10dev { > struct bio *bio; > - union { > - struct bio *repl_bio; /* used for resync and > - * writes */ > - struct md_rdev *rdev; /* used for reads > - * (read_slot >= 0) */ > - }; > + struct md_rdev *rdev; > + > + /* used for resync and writes */ > + struct bio *repl_bio; > + struct md_rdev *replacement; > + > sector_t addr; > int devnum; > } devs[]; > > And only read conf once, then record rdev/replacement. This requires to > change all the context to issue io and complete io. > > 2) add a new synchronization that configuration can't concurrent with > any io, however, I think this can work but I'm not 100% percent sure. > Code changes should be less, and this will allow some cleanups and > simplify logic. > > Any suggestions? > > Thanks, > Kuai >
Hi, 在 2023/04/27 14:53, Xiao Ni 写道: >> for example, null-ptr-dereference: >> >> t1: t2: >> raid10_write_request: >> >> // read rdev >> rdev = conf->mirros[].rdev; >> raid10_remove_disk >> p = conf->mirros + number; >> rdevp = &p->rdev; >> // reset rdev >> *rdevp = NULL >> raid10_write_one_disk >> // reread rdev got NULL >> rdev = conf->mirrors[devnum].rdev >> // null-ptr-dereference >> mbio = bio_alloc_clone(rdev->bdev...) >> synchronize_rcu() > > Hi Yu kuai > > raid10_write_request adds the rdev->nr_pending with rcu lock > protection. Can this case happen? After adding ->nr_pending, the rdev > can't be removed. The current rcu protection really is a mess, many places access rdev after rcu_read_unlock()... For the above case, noted that raid10_remove_disk is called before nr_pending is increased, and raid10_write_one_disk() is called after rcu_read_unlock(). t1: t2: raid10_write_request rcu_read_lock rdev = conf->mirros[].rdev raid10_remove_disk ...... // nr_pending is 0, remove disk // read inside rcu rcu_read_unlock raid10_write_one_disk // trigger null-ptr-dereference synchronize_rcu() Thanks, Kuai > >> >> for example, data loss: >> >> t1: >> // assum that rdev is NULL, and replacement is not NULL > > How can trigger this? Could you give the detailed commands? > > Best Regards > Xiao Ni
Hi, Sorry that synchronize_rcu() is misplaced, it should be right after rcu_read_unlock(): 在 2023/04/27 15:13, Yu Kuai 写道: > t1: t2: > > raid10_write_request > rcu_read_lock > rdev = conf->mirros[].rdev > raid10_remove_disk > ...... > // nr_pending is 0, remove disk > // read inside rcu > rcu_read_unlock //set rdev NULL synchronize_rcu > > raid10_write_one_disk > // trigger null-ptr-dereference > Thanks, Kuai
Hi, 在 2023/04/27 14:53, Xiao Ni 写道: >> // assum that rdev is NULL, and replacement is not NULL > > How can trigger this? Could you give the detailed commands? > This is just because raid10_remove_disk() set rdev to NULL first, and then replace rdev with replacement. This is temporary but io can concurrent with this temporary status. raid10_remove_disk p = conf->mirrors + number; rdevp = &p->rdev; *rdevp = NULL; // here! p->rdev = p->replacement; p->replacement = NULL; Thanks, Kuai
On Thu, Apr 27, 2023 at 3:13 PM Yu Kuai <yukuai1@huaweicloud.com> wrote: > > Hi, > > 在 2023/04/27 14:53, Xiao Ni 写道: > >> for example, null-ptr-dereference: > >> > >> t1: t2: > >> raid10_write_request: > >> > >> // read rdev > >> rdev = conf->mirros[].rdev; > >> raid10_remove_disk > >> p = conf->mirros + number; > >> rdevp = &p->rdev; > >> // reset rdev > >> *rdevp = NULL > >> raid10_write_one_disk > >> // reread rdev got NULL > >> rdev = conf->mirrors[devnum].rdev > >> // null-ptr-dereference > >> mbio = bio_alloc_clone(rdev->bdev...) > >> synchronize_rcu() > > > > Hi Yu kuai > > > > raid10_write_request adds the rdev->nr_pending with rcu lock > > protection. Can this case happen? After adding ->nr_pending, the rdev > > can't be removed. > > The current rcu protection really is a mess, many places access rdev > after rcu_read_unlock()... Hi Yu Kuai It looks like a problem that is it safe to access rdev after adding rdev->nr_pending and rcu_read_unlock. Because besides raid10, other personalities still do in the same way. After reading the related codes, I have the same question. > > For the above case, noted that raid10_remove_disk is called before > nr_pending is increased, and raid10_write_one_disk() is called after > rcu_read_unlock(). > > t1: t2: > > raid10_write_request > rcu_read_lock > rdev = conf->mirros[].rdev > raid10_remove_disk > ...... > // nr_pending is 0, remove disk > // read inside rcu > rcu_read_unlock > > raid10_write_one_disk > // trigger null-ptr-dereference > synchronize_rcu() Function remove_and_add_spares sets RemoveSynchronized, calls synchronize_rcu and calls raid10_remove_disk. So the right position of synchronize_rcu in your case should be before raid10_remove_disk? But it still can't resolve the problem. raid10_remove_disk can set rdev to NULL between rcu_dereference and adding ->nr_pending raid10_write_request remove_and_add_spares set RemoveSynchronized synchronize_rcu rcu_read_lock rdev = rcu_dereference <------- *rdevp = NULL atomic_inc rdev->nr_pending rcu_read_unlock raid10_write_one_disk // trigger null-ptr-dereference Can we check RemoveSynchronized in pers->make_request? It can't submit bio to this rdev after synchronize_rcu. For the pers->hot_remove_disk(raid10_remove_disk) side, as you said in the second solution, maybe it needs a new method to know all pers->make_request(raid10_make_request) exit. Best Regards Xiao Ni > > Thanks, > Kuai > > > >> > >> for example, data loss: > >> > >> t1: > >> // assum that rdev is NULL, and replacement is not NULL > > > > How can trigger this? Could you give the detailed commands? > > > > Best Regards > > Xiao Ni >
Hi, 在 2023/04/28 17:11, Xiao Ni 写道: > On Thu, Apr 27, 2023 at 3:13 PM Yu Kuai <yukuai1@huaweicloud.com> wrote: >> >> Hi, >> >> 在 2023/04/27 14:53, Xiao Ni 写道: >>>> for example, null-ptr-dereference: >>>> >>>> t1: t2: >>>> raid10_write_request: >>>> >>>> // read rdev >>>> rdev = conf->mirros[].rdev; >>>> raid10_remove_disk >>>> p = conf->mirros + number; >>>> rdevp = &p->rdev; >>>> // reset rdev >>>> *rdevp = NULL >>>> raid10_write_one_disk >>>> // reread rdev got NULL >>>> rdev = conf->mirrors[devnum].rdev >>>> // null-ptr-dereference >>>> mbio = bio_alloc_clone(rdev->bdev...) >>>> synchronize_rcu() >>> >>> Hi Yu kuai >>> >>> raid10_write_request adds the rdev->nr_pending with rcu lock >>> protection. Can this case happen? After adding ->nr_pending, the rdev >>> can't be removed. >> >> The current rcu protection really is a mess, many places access rdev >> after rcu_read_unlock()... > > Hi Yu Kuai > > It looks like a problem that is it safe to access rdev after adding > rdev->nr_pending and rcu_read_unlock. Because besides raid10, other > personalities still do in the same way. After reading the related > codes, I have the same question. I think it's not safe, that's basically what the first solution try to avoid, but it's just in raid10, and I'm not quite familiar with all the personality but I think they can be fixed likewise. I'll keep looking into the details.
diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h index 63e48b11b552..c132cba1953c 100644 --- a/drivers/md/raid10.h +++ b/drivers/md/raid10.h @@ -145,12 +145,12 @@ struct r10bio { */ struct r10dev { struct bio *bio; - union { - struct bio *repl_bio; /* used for resync and - * writes */ - struct md_rdev *rdev; /* used for reads - * (read_slot >= 0) */ - }; + struct md_rdev *rdev; + + /* used for resync and writes */ + struct bio *repl_bio; + struct md_rdev *replacement; + sector_t addr; int devnum; } devs[];