Message ID | 20190625144111.11270-1-idryomov@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | rbd: support for object-map and fast-diff | expand |
Hi Ilya, Nice work. Some comments/questions: 1) Generally having a state machine makes things easier to track as the code is less dispersed than before. 2) The running_list is used to keep track of inflight requests in case of exclusive lock to support rbd_quiesce_lock() when releasing the lock. It would be great to generalize this list to keep track of all inflight requests even in the case when lock is not required, it could be used to support a generic flush (similar to librbd work queue flush/drain). Having a generic inflight requests list will also make it easier to support other functions such as task aborts, request timeouts, flushing on pre-snapshots. 3) For the acquiring_list, my concern is that while the lock is pending to be acquired, the requests are being accepted without a limit. In case there is a delay acquiring the lock, for example if the primary of the object header is down (which could block for ~ 25 sec) or worse if the pool is inactive, the count could well exceed the max queue depth + for write requests this can consume a lot of memory. 4) In rbd_img_exclusive_lock() at end, we queue an acquire lock task for every request. I understand this is a single threaded queue and if lock is acquired then all acquire tasks are cancelled, however i feel the queue could fill a lot. Any chance we can only schedule 1 acquire task ? 5) The state RBD_IMG_EXCLUSIVE_LOCK is used for both cases when image does not require exclusive lock + when lock is acquired. Cosmetically it may be better to separate them. 6) Probably not an issue, but we are now creating a large number of mutexes, at least 2 for every request. Maybe we need to test in high iops/queue depths that there is no overhead for this. 7) Is there any consideration to split the rbd module to multiple files ? Looking at how big librbd, fitting this in a single kernel file is challenging at best. /Maged On 25/06/2019 16:40, Ilya Dryomov wrote: > Hello, > > This series adds support for object-map and fast-diff image features. > Patches 1 - 11 prepare object and image request state machines; patches > 12 - 14 fix most of the shortcomings in our exclusive lock code, making > it suitable for guarding the object map; patches 15 - 18 take care of > the prerequisites and finally patches 19 - 20 implement object-map and > fast-diff. > > Thanks, > > Ilya > > > Ilya Dryomov (20): > rbd: get rid of obj_req->xferred, obj_req->result and img_req->xferred > rbd: replace obj_req->tried_parent with obj_req->read_state > rbd: get rid of RBD_OBJ_WRITE_{FLAT,GUARD} > rbd: move OSD request submission into object request state machines > rbd: introduce image request state machine > rbd: introduce obj_req->osd_reqs list > rbd: factor out rbd_osd_setup_copyup() > rbd: factor out __rbd_osd_setup_discard_ops() > rbd: move OSD request allocation into object request state machines > rbd: rename rbd_obj_setup_*() to rbd_obj_init_*() > rbd: introduce copyup state machine > rbd: lock should be quiesced on reacquire > rbd: quiescing lock should wait for image requests > rbd: new exclusive lock wait/wake code > libceph: bump CEPH_MSG_MAX_DATA_LEN (again) > libceph: change ceph_osdc_call() to take page vector for response > libceph: export osd_req_op_data() macro > rbd: call rbd_dev_mapping_set() from rbd_dev_image_probe() > rbd: support for object-map and fast-diff > rbd: setallochint only if object doesn't exist > > drivers/block/rbd.c | 2433 ++++++++++++++++++-------- > drivers/block/rbd_types.h | 10 + > include/linux/ceph/cls_lock_client.h | 3 + > include/linux/ceph/libceph.h | 6 +- > include/linux/ceph/osd_client.h | 10 +- > include/linux/ceph/striper.h | 2 + > net/ceph/cls_lock_client.c | 47 +- > net/ceph/osd_client.c | 18 +- > net/ceph/striper.c | 17 + > 9 files changed, 1817 insertions(+), 729 deletions(-) >
re-reading 3), i realize the queue depth will not be exceeded in the request queue. On 30/06/2019 15:55, Maged Mokhtar wrote: > > > Hi Ilya, > > Nice work. Some comments/questions: > > 1) Generally having a state machine makes things easier to track as the > code is less dispersed than before. > > 2) The running_list is used to keep track of inflight requests in case > of exclusive lock to support rbd_quiesce_lock() when releasing the lock. > It would be great to generalize this list to keep track of all inflight > requests even in the case when lock is not required, it could be used to > support a generic flush (similar to librbd work queue flush/drain). > Having a generic inflight requests list will also make it easier to > support other functions such as task aborts, request timeouts, flushing > on pre-snapshots. > > 3) For the acquiring_list, my concern is that while the lock is pending > to be acquired, the requests are being accepted without a limit. In case > there is a delay acquiring the lock, for example if the primary of the > object header is down (which could block for ~ 25 sec) or worse if the > pool is inactive, the count could well exceed the max queue depth + for > write requests this can consume a lot of memory. > > 4) In rbd_img_exclusive_lock() at end, we queue an acquire lock task for > every request. I understand this is a single threaded queue and if lock > is acquired then all acquire tasks are cancelled, however i feel the > queue could fill a lot. Any chance we can only schedule 1 acquire task ? > > 5) The state RBD_IMG_EXCLUSIVE_LOCK is used for both cases when image > does not require exclusive lock + when lock is acquired. Cosmetically it > may be better to separate them. > > 6) Probably not an issue, but we are now creating a large number of > mutexes, at least 2 for every request. Maybe we need to test in high > iops/queue depths that there is no overhead for this. > > 7) Is there any consideration to split the rbd module to multiple files > ? Looking at how big librbd, fitting this in a single kernel file is > challenging at best. > > /Maged > > On 25/06/2019 16:40, Ilya Dryomov wrote: >> Hello, >> >> This series adds support for object-map and fast-diff image features. >> Patches 1 - 11 prepare object and image request state machines; patches >> 12 - 14 fix most of the shortcomings in our exclusive lock code, making >> it suitable for guarding the object map; patches 15 - 18 take care of >> the prerequisites and finally patches 19 - 20 implement object-map and >> fast-diff. >> >> Thanks, >> >> Ilya >> >> >> Ilya Dryomov (20): >> rbd: get rid of obj_req->xferred, obj_req->result and img_req->xferred >> rbd: replace obj_req->tried_parent with obj_req->read_state >> rbd: get rid of RBD_OBJ_WRITE_{FLAT,GUARD} >> rbd: move OSD request submission into object request state machines >> rbd: introduce image request state machine >> rbd: introduce obj_req->osd_reqs list >> rbd: factor out rbd_osd_setup_copyup() >> rbd: factor out __rbd_osd_setup_discard_ops() >> rbd: move OSD request allocation into object request state machines >> rbd: rename rbd_obj_setup_*() to rbd_obj_init_*() >> rbd: introduce copyup state machine >> rbd: lock should be quiesced on reacquire >> rbd: quiescing lock should wait for image requests >> rbd: new exclusive lock wait/wake code >> libceph: bump CEPH_MSG_MAX_DATA_LEN (again) >> libceph: change ceph_osdc_call() to take page vector for response >> libceph: export osd_req_op_data() macro >> rbd: call rbd_dev_mapping_set() from rbd_dev_image_probe() >> rbd: support for object-map and fast-diff >> rbd: setallochint only if object doesn't exist >> >> drivers/block/rbd.c | 2433 ++++++++++++++++++-------- >> drivers/block/rbd_types.h | 10 + >> include/linux/ceph/cls_lock_client.h | 3 + >> include/linux/ceph/libceph.h | 6 +- >> include/linux/ceph/osd_client.h | 10 +- >> include/linux/ceph/striper.h | 2 + >> net/ceph/cls_lock_client.c | 47 +- >> net/ceph/osd_client.c | 18 +- >> net/ceph/striper.c | 17 + >> 9 files changed, 1817 insertions(+), 729 deletions(-) >> >
Hi Ilya, On 06/25/2019 10:40 PM, Ilya Dryomov wrote: > Hello, > > This series adds support for object-map and fast-diff image features. > Patches 1 - 11 prepare object and image request state machines; patches > 12 - 14 fix most of the shortcomings in our exclusive lock code, making > it suitable for guarding the object map; patches 15 - 18 take care of > the prerequisites and finally patches 19 - 20 implement object-map and > fast-diff. Nice patchset. I did review and a testing for this patchset. TEST: with object-map enabled, I found a case failed: tasks/rbd_kernel.yaml. It failed to rollback test_img while test_img is mapped. Thanx > > Thanks, > > Ilya > > > Ilya Dryomov (20): > rbd: get rid of obj_req->xferred, obj_req->result and img_req->xferred > rbd: replace obj_req->tried_parent with obj_req->read_state > rbd: get rid of RBD_OBJ_WRITE_{FLAT,GUARD} > rbd: move OSD request submission into object request state machines > rbd: introduce image request state machine > rbd: introduce obj_req->osd_reqs list > rbd: factor out rbd_osd_setup_copyup() > rbd: factor out __rbd_osd_setup_discard_ops() > rbd: move OSD request allocation into object request state machines > rbd: rename rbd_obj_setup_*() to rbd_obj_init_*() > rbd: introduce copyup state machine > rbd: lock should be quiesced on reacquire > rbd: quiescing lock should wait for image requests > rbd: new exclusive lock wait/wake code > libceph: bump CEPH_MSG_MAX_DATA_LEN (again) > libceph: change ceph_osdc_call() to take page vector for response > libceph: export osd_req_op_data() macro > rbd: call rbd_dev_mapping_set() from rbd_dev_image_probe() > rbd: support for object-map and fast-diff > rbd: setallochint only if object doesn't exist > > drivers/block/rbd.c | 2433 ++++++++++++++++++-------- > drivers/block/rbd_types.h | 10 + > include/linux/ceph/cls_lock_client.h | 3 + > include/linux/ceph/libceph.h | 6 +- > include/linux/ceph/osd_client.h | 10 +- > include/linux/ceph/striper.h | 2 + > net/ceph/cls_lock_client.c | 47 +- > net/ceph/osd_client.c | 18 +- > net/ceph/striper.c | 17 + > 9 files changed, 1817 insertions(+), 729 deletions(-) >
Hi Maged, On 06/30/2019 09:55 PM, Maged Mokhtar wrote: > > > Hi Ilya, > > Nice work. Some comments/questions: > > 1) Generally having a state machine makes things easier to track as > the code is less dispersed than before. > > 2) The running_list is used to keep track of inflight requests in case > of exclusive lock to support rbd_quiesce_lock() when releasing the > lock. It would be great to generalize this list to keep track of all > inflight requests even in the case when lock is not required, it could > be used to support a generic flush (similar to librbd work queue > flush/drain). Having a generic inflight requests list will also make > it easier to support other functions such as task aborts, request > timeouts, flushing on pre-snapshots. > > 3) For the acquiring_list, my concern is that while the lock is > pending to be acquired, the requests are being accepted without a > limit. In case there is a delay acquiring the lock, for example if the > primary of the object header is down (which could block for ~ 25 sec) > or worse if the pool is inactive, the count could well exceed the max > queue depth + for write requests this can consume a lot of memory. > > 4) In rbd_img_exclusive_lock() at end, we queue an acquire lock task > for every request. I understand this is a single threaded queue and if > lock is acquired then all acquire tasks are cancelled, however i feel > the queue could fill a lot. Any chance we can only schedule 1 acquire > task ? > > 5) The state RBD_IMG_EXCLUSIVE_LOCK is used for both cases when image > does not require exclusive lock + when lock is acquired. Cosmetically > it may be better to separate them. > > 6) Probably not an issue, but we are now creating a large number of > mutexes, at least 2 for every request. Maybe we need to test in high > iops/queue depths that there is no overhead for this. > > 7) Is there any consideration to split the rbd module to multiple > files ? Looking at how big librbd, fitting this in a single kernel > file is challenging at best. Ilya could provide more information, but I can share something about this. Ilya and I had a discussion about splitting rbd.c as it is really big enough. But we want something WIP (this patchset) could be merged firstly. A rough idea is pulling exclusive-lock and object-map out from rbd.c. Thanx > > /Maged > > On 25/06/2019 16:40, Ilya Dryomov wrote: >> Hello, >> >> This series adds support for object-map and fast-diff image features. >> Patches 1 - 11 prepare object and image request state machines; patches >> 12 - 14 fix most of the shortcomings in our exclusive lock code, making >> it suitable for guarding the object map; patches 15 - 18 take care of >> the prerequisites and finally patches 19 - 20 implement object-map and >> fast-diff. >> >> Thanks, >> >> Ilya >> >> >> Ilya Dryomov (20): >> rbd: get rid of obj_req->xferred, obj_req->result and >> img_req->xferred >> rbd: replace obj_req->tried_parent with obj_req->read_state >> rbd: get rid of RBD_OBJ_WRITE_{FLAT,GUARD} >> rbd: move OSD request submission into object request state machines >> rbd: introduce image request state machine >> rbd: introduce obj_req->osd_reqs list >> rbd: factor out rbd_osd_setup_copyup() >> rbd: factor out __rbd_osd_setup_discard_ops() >> rbd: move OSD request allocation into object request state machines >> rbd: rename rbd_obj_setup_*() to rbd_obj_init_*() >> rbd: introduce copyup state machine >> rbd: lock should be quiesced on reacquire >> rbd: quiescing lock should wait for image requests >> rbd: new exclusive lock wait/wake code >> libceph: bump CEPH_MSG_MAX_DATA_LEN (again) >> libceph: change ceph_osdc_call() to take page vector for response >> libceph: export osd_req_op_data() macro >> rbd: call rbd_dev_mapping_set() from rbd_dev_image_probe() >> rbd: support for object-map and fast-diff >> rbd: setallochint only if object doesn't exist >> >> drivers/block/rbd.c | 2433 ++++++++++++++++++-------- >> drivers/block/rbd_types.h | 10 + >> include/linux/ceph/cls_lock_client.h | 3 + >> include/linux/ceph/libceph.h | 6 +- >> include/linux/ceph/osd_client.h | 10 +- >> include/linux/ceph/striper.h | 2 + >> net/ceph/cls_lock_client.c | 47 +- >> net/ceph/osd_client.c | 18 +- >> net/ceph/striper.c | 17 + >> 9 files changed, 1817 insertions(+), 729 deletions(-) >> > >
On Sun, Jun 30, 2019 at 3:55 PM Maged Mokhtar <mmokhtar@petasan.org> wrote: > > > > Hi Ilya, > > Nice work. Some comments/questions: > > 1) Generally having a state machine makes things easier to track as the > code is less dispersed than before. > > 2) The running_list is used to keep track of inflight requests in case > of exclusive lock to support rbd_quiesce_lock() when releasing the lock. > It would be great to generalize this list to keep track of all inflight > requests even in the case when lock is not required, it could be used to > support a generic flush (similar to librbd work queue flush/drain). > Having a generic inflight requests list will also make it easier to > support other functions such as task aborts, request timeouts, flushing > on pre-snapshots. Hi Maged, We could extend it in the future, but right now it is there just for exclusive lock. Likely, we would need a different list for tracking all in-flight requests because if we start adding random requests (e.g. those that don't require exclusive lock), deadlocks will occur. > > 3) For the acquiring_list, my concern is that while the lock is pending > to be acquired, the requests are being accepted without a limit. In case > there is a delay acquiring the lock, for example if the primary of the > object header is down (which could block for ~ 25 sec) or worse if the > pool is inactive, the count could well exceed the max queue depth + for > write requests this can consume a lot of memory. As you noted in a different email, it is limited by queue_depth. No different from regular I/O taking too long. > > 4) In rbd_img_exclusive_lock() at end, we queue an acquire lock task for > every request. I understand this is a single threaded queue and if lock > is acquired then all acquire tasks are cancelled, however i feel the > queue could fill a lot. Any chance we can only schedule 1 acquire task ? This is not new, it's done in existing code too. queue_delayed_work() bails if the work is already queued, so ->lock_dwork is never queued more than once. > > 5) The state RBD_IMG_EXCLUSIVE_LOCK is used for both cases when image > does not require exclusive lock + when lock is acquired. Cosmetically it > may be better to separate them. Not all requests require exclusive lock even if exclusive-lock feature is enabled (e.g. reads when object-map feature is disabled). We switch from RBD_IMG_EXCLUSIVE_LOCK early (i.e. no waiting) in those cases and the case of exclusive-lock feature being disabled just falls into that basket. > > 6) Probably not an issue, but we are now creating a large number of > mutexes, at least 2 for every request. Maybe we need to test in high > iops/queue depths that there is no overhead for this. Yes, that would be useful. If you do that, please share your results! We can get that down to one by reusing image requests with the help of blk-mq. We would still need to allocate them for accessing the parent image, but for regular image requests it should be pretty easy to do. Reusing object requests is going to be harder, but we could probably preallocate one per image request, as that is by far the most common case. > > 7) Is there any consideration to split the rbd module to multiple files > ? Looking at how big librbd, fitting this in a single kernel file is > challenging at best. Dongsheng expressed this concern is well. Personally, I don't mind a large file as long as it is nicely structured. Adding object-map and fast-diff added just two forward declarations, one of which could be avoided. We are never going to fit the entire librbd into the kernel, no matter how many files ;) Breaking up into multiple files complicates blaming, so I would rather keep it as is at this point. Thanks, Ilya
On Mon, Jul 1, 2019 at 7:34 AM Dongsheng Yang <dongsheng.yang@easystack.cn> wrote: > > Hi Ilya, > > On 06/25/2019 10:40 PM, Ilya Dryomov wrote: > > Hello, > > > > This series adds support for object-map and fast-diff image features. > > Patches 1 - 11 prepare object and image request state machines; patches > > 12 - 14 fix most of the shortcomings in our exclusive lock code, making > > it suitable for guarding the object map; patches 15 - 18 take care of > > the prerequisites and finally patches 19 - 20 implement object-map and > > fast-diff. > > Nice patchset. I did review and a testing for this patchset. Hi Dongsheng, Thank you! > > TEST: > with object-map enabled, I found a case failed: > tasks/rbd_kernel.yaml. > It failed to rollback test_img while test_img is mapped. This is because with object map reads grab the lock: 67 rbd resize testimg1 --size=40 --allow-shrink 68 cat /sys/bus/rbd/devices/$DEV_ID1/size | grep 41943040 69 cat /sys/bus/rbd/devices/$DEV_ID2/size | grep 76800000 70 71 sudo dd if=/dev/rbd/rbd/testimg1 of=/tmp/img1.small 72 cp /tmp/img1 /tmp/img1.trunc 73 truncate -s 41943040 /tmp/img1.trunc 74 cmp /tmp/img1.trunc /tmp/img1.small 75 76 # rollback and check data again 77 rbd snap rollback --snap=snap1 testimg1 Without object map the lock is freed by resize on line 67 and rollback executes on an unlocked image. With object map we re-grab the lock for dd on line 71 and rollback executes on a locked image. rollback isn't supposed to work on a locked image, so it's just an accident that this test continued to work after exclusive lock was introduced in 4.9. Another test that will need to be fixed is krbd_exclusive_option.sh. This one fails both with and without object-map feature because we no longer block if the peer refuses to release the lock. I'm on vacation and didn't get a chance to update the test suite yet. Ilya