mbox series

[00/20] rbd: support for object-map and fast-diff

Message ID 20190625144111.11270-1-idryomov@gmail.com (mailing list archive)
Headers show
Series rbd: support for object-map and fast-diff | expand

Message

Ilya Dryomov June 25, 2019, 2:40 p.m. UTC
Hello,

This series adds support for object-map and fast-diff image features.
Patches 1 - 11 prepare object and image request state machines; patches
12 - 14 fix most of the shortcomings in our exclusive lock code, making
it suitable for guarding the object map; patches 15 - 18 take care of
the prerequisites and finally patches 19 - 20 implement object-map and
fast-diff.

Thanks,

                Ilya


Ilya Dryomov (20):
  rbd: get rid of obj_req->xferred, obj_req->result and img_req->xferred
  rbd: replace obj_req->tried_parent with obj_req->read_state
  rbd: get rid of RBD_OBJ_WRITE_{FLAT,GUARD}
  rbd: move OSD request submission into object request state machines
  rbd: introduce image request state machine
  rbd: introduce obj_req->osd_reqs list
  rbd: factor out rbd_osd_setup_copyup()
  rbd: factor out __rbd_osd_setup_discard_ops()
  rbd: move OSD request allocation into object request state machines
  rbd: rename rbd_obj_setup_*() to rbd_obj_init_*()
  rbd: introduce copyup state machine
  rbd: lock should be quiesced on reacquire
  rbd: quiescing lock should wait for image requests
  rbd: new exclusive lock wait/wake code
  libceph: bump CEPH_MSG_MAX_DATA_LEN (again)
  libceph: change ceph_osdc_call() to take page vector for response
  libceph: export osd_req_op_data() macro
  rbd: call rbd_dev_mapping_set() from rbd_dev_image_probe()
  rbd: support for object-map and fast-diff
  rbd: setallochint only if object doesn't exist

 drivers/block/rbd.c                  | 2433 ++++++++++++++++++--------
 drivers/block/rbd_types.h            |   10 +
 include/linux/ceph/cls_lock_client.h |    3 +
 include/linux/ceph/libceph.h         |    6 +-
 include/linux/ceph/osd_client.h      |   10 +-
 include/linux/ceph/striper.h         |    2 +
 net/ceph/cls_lock_client.c           |   47 +-
 net/ceph/osd_client.c                |   18 +-
 net/ceph/striper.c                   |   17 +
 9 files changed, 1817 insertions(+), 729 deletions(-)

Comments

Maged Mokhtar June 30, 2019, 1:55 p.m. UTC | #1
Hi Ilya,

Nice work. Some comments/questions:

1) Generally having a state machine makes things easier to track as the 
code is less dispersed than before.

2) The running_list is used to keep track of inflight requests in case 
of exclusive lock to support rbd_quiesce_lock() when releasing the lock. 
It would be great to generalize this list to keep track of all inflight 
requests even in the case when lock is not required, it could be used to 
support a generic flush (similar to librbd work queue flush/drain). 
Having a generic inflight requests list will also make it easier to 
support other functions such as task aborts, request timeouts, flushing 
on pre-snapshots.

3) For the acquiring_list, my concern is that while the lock is pending 
to be acquired, the requests are being accepted without a limit. In case 
there is a delay acquiring the lock, for example if the primary of the 
object header is down (which could block for ~ 25 sec) or worse if the 
pool is inactive, the count could well exceed the max queue depth + for 
write requests this can consume a lot of memory.

4) In rbd_img_exclusive_lock() at end, we queue an acquire lock task for 
every request. I understand this is a single threaded queue and if lock 
is acquired then all acquire tasks are cancelled, however i feel the 
queue could fill a lot. Any chance we can only schedule 1 acquire task ?

5) The state RBD_IMG_EXCLUSIVE_LOCK is used for both cases when image 
does not require exclusive lock + when lock is acquired. Cosmetically it 
may be better to separate them.

6) Probably not an issue, but we are now creating a large number of 
mutexes, at least 2 for every request. Maybe we need to test in high 
iops/queue depths that there is no overhead for this.

7) Is there any consideration to split the rbd module to multiple files 
? Looking at how big librbd, fitting this in a single kernel file is 
challenging at best.

/Maged

On 25/06/2019 16:40, Ilya Dryomov wrote:
> Hello,
> 
> This series adds support for object-map and fast-diff image features.
> Patches 1 - 11 prepare object and image request state machines; patches
> 12 - 14 fix most of the shortcomings in our exclusive lock code, making
> it suitable for guarding the object map; patches 15 - 18 take care of
> the prerequisites and finally patches 19 - 20 implement object-map and
> fast-diff.
> 
> Thanks,
> 
>                  Ilya
> 
> 
> Ilya Dryomov (20):
>    rbd: get rid of obj_req->xferred, obj_req->result and img_req->xferred
>    rbd: replace obj_req->tried_parent with obj_req->read_state
>    rbd: get rid of RBD_OBJ_WRITE_{FLAT,GUARD}
>    rbd: move OSD request submission into object request state machines
>    rbd: introduce image request state machine
>    rbd: introduce obj_req->osd_reqs list
>    rbd: factor out rbd_osd_setup_copyup()
>    rbd: factor out __rbd_osd_setup_discard_ops()
>    rbd: move OSD request allocation into object request state machines
>    rbd: rename rbd_obj_setup_*() to rbd_obj_init_*()
>    rbd: introduce copyup state machine
>    rbd: lock should be quiesced on reacquire
>    rbd: quiescing lock should wait for image requests
>    rbd: new exclusive lock wait/wake code
>    libceph: bump CEPH_MSG_MAX_DATA_LEN (again)
>    libceph: change ceph_osdc_call() to take page vector for response
>    libceph: export osd_req_op_data() macro
>    rbd: call rbd_dev_mapping_set() from rbd_dev_image_probe()
>    rbd: support for object-map and fast-diff
>    rbd: setallochint only if object doesn't exist
> 
>   drivers/block/rbd.c                  | 2433 ++++++++++++++++++--------
>   drivers/block/rbd_types.h            |   10 +
>   include/linux/ceph/cls_lock_client.h |    3 +
>   include/linux/ceph/libceph.h         |    6 +-
>   include/linux/ceph/osd_client.h      |   10 +-
>   include/linux/ceph/striper.h         |    2 +
>   net/ceph/cls_lock_client.c           |   47 +-
>   net/ceph/osd_client.c                |   18 +-
>   net/ceph/striper.c                   |   17 +
>   9 files changed, 1817 insertions(+), 729 deletions(-)
>
Maged Mokhtar June 30, 2019, 2:20 p.m. UTC | #2
re-reading 3), i realize the queue depth will not be exceeded in the 
request queue.


On 30/06/2019 15:55, Maged Mokhtar wrote:
> 
> 
> Hi Ilya,
> 
> Nice work. Some comments/questions:
> 
> 1) Generally having a state machine makes things easier to track as the 
> code is less dispersed than before.
> 
> 2) The running_list is used to keep track of inflight requests in case 
> of exclusive lock to support rbd_quiesce_lock() when releasing the lock. 
> It would be great to generalize this list to keep track of all inflight 
> requests even in the case when lock is not required, it could be used to 
> support a generic flush (similar to librbd work queue flush/drain). 
> Having a generic inflight requests list will also make it easier to 
> support other functions such as task aborts, request timeouts, flushing 
> on pre-snapshots.
> 
> 3) For the acquiring_list, my concern is that while the lock is pending 
> to be acquired, the requests are being accepted without a limit. In case 
> there is a delay acquiring the lock, for example if the primary of the 
> object header is down (which could block for ~ 25 sec) or worse if the 
> pool is inactive, the count could well exceed the max queue depth + for 
> write requests this can consume a lot of memory.
> 
> 4) In rbd_img_exclusive_lock() at end, we queue an acquire lock task for 
> every request. I understand this is a single threaded queue and if lock 
> is acquired then all acquire tasks are cancelled, however i feel the 
> queue could fill a lot. Any chance we can only schedule 1 acquire task ?
> 
> 5) The state RBD_IMG_EXCLUSIVE_LOCK is used for both cases when image 
> does not require exclusive lock + when lock is acquired. Cosmetically it 
> may be better to separate them.
> 
> 6) Probably not an issue, but we are now creating a large number of 
> mutexes, at least 2 for every request. Maybe we need to test in high 
> iops/queue depths that there is no overhead for this.
> 
> 7) Is there any consideration to split the rbd module to multiple files 
> ? Looking at how big librbd, fitting this in a single kernel file is 
> challenging at best.
> 
> /Maged
> 
> On 25/06/2019 16:40, Ilya Dryomov wrote:
>> Hello,
>>
>> This series adds support for object-map and fast-diff image features.
>> Patches 1 - 11 prepare object and image request state machines; patches
>> 12 - 14 fix most of the shortcomings in our exclusive lock code, making
>> it suitable for guarding the object map; patches 15 - 18 take care of
>> the prerequisites and finally patches 19 - 20 implement object-map and
>> fast-diff.
>>
>> Thanks,
>>
>>                  Ilya
>>
>>
>> Ilya Dryomov (20):
>>    rbd: get rid of obj_req->xferred, obj_req->result and img_req->xferred
>>    rbd: replace obj_req->tried_parent with obj_req->read_state
>>    rbd: get rid of RBD_OBJ_WRITE_{FLAT,GUARD}
>>    rbd: move OSD request submission into object request state machines
>>    rbd: introduce image request state machine
>>    rbd: introduce obj_req->osd_reqs list
>>    rbd: factor out rbd_osd_setup_copyup()
>>    rbd: factor out __rbd_osd_setup_discard_ops()
>>    rbd: move OSD request allocation into object request state machines
>>    rbd: rename rbd_obj_setup_*() to rbd_obj_init_*()
>>    rbd: introduce copyup state machine
>>    rbd: lock should be quiesced on reacquire
>>    rbd: quiescing lock should wait for image requests
>>    rbd: new exclusive lock wait/wake code
>>    libceph: bump CEPH_MSG_MAX_DATA_LEN (again)
>>    libceph: change ceph_osdc_call() to take page vector for response
>>    libceph: export osd_req_op_data() macro
>>    rbd: call rbd_dev_mapping_set() from rbd_dev_image_probe()
>>    rbd: support for object-map and fast-diff
>>    rbd: setallochint only if object doesn't exist
>>
>>   drivers/block/rbd.c                  | 2433 ++++++++++++++++++--------
>>   drivers/block/rbd_types.h            |   10 +
>>   include/linux/ceph/cls_lock_client.h |    3 +
>>   include/linux/ceph/libceph.h         |    6 +-
>>   include/linux/ceph/osd_client.h      |   10 +-
>>   include/linux/ceph/striper.h         |    2 +
>>   net/ceph/cls_lock_client.c           |   47 +-
>>   net/ceph/osd_client.c                |   18 +-
>>   net/ceph/striper.c                   |   17 +
>>   9 files changed, 1817 insertions(+), 729 deletions(-)
>>
>
Dongsheng Yang July 1, 2019, 5:34 a.m. UTC | #3
Hi Ilya,

On 06/25/2019 10:40 PM, Ilya Dryomov wrote:
> Hello,
>
> This series adds support for object-map and fast-diff image features.
> Patches 1 - 11 prepare object and image request state machines; patches
> 12 - 14 fix most of the shortcomings in our exclusive lock code, making
> it suitable for guarding the object map; patches 15 - 18 take care of
> the prerequisites and finally patches 19 - 20 implement object-map and
> fast-diff.

Nice patchset. I did review and a testing for this patchset.

TEST:
       with object-map enabled, I found a case failed: 
tasks/rbd_kernel.yaml.
It failed to rollback test_img while test_img is mapped.

Thanx
>
> Thanks,
>
>                  Ilya
>
>
> Ilya Dryomov (20):
>    rbd: get rid of obj_req->xferred, obj_req->result and img_req->xferred
>    rbd: replace obj_req->tried_parent with obj_req->read_state
>    rbd: get rid of RBD_OBJ_WRITE_{FLAT,GUARD}
>    rbd: move OSD request submission into object request state machines
>    rbd: introduce image request state machine
>    rbd: introduce obj_req->osd_reqs list
>    rbd: factor out rbd_osd_setup_copyup()
>    rbd: factor out __rbd_osd_setup_discard_ops()
>    rbd: move OSD request allocation into object request state machines
>    rbd: rename rbd_obj_setup_*() to rbd_obj_init_*()
>    rbd: introduce copyup state machine
>    rbd: lock should be quiesced on reacquire
>    rbd: quiescing lock should wait for image requests
>    rbd: new exclusive lock wait/wake code
>    libceph: bump CEPH_MSG_MAX_DATA_LEN (again)
>    libceph: change ceph_osdc_call() to take page vector for response
>    libceph: export osd_req_op_data() macro
>    rbd: call rbd_dev_mapping_set() from rbd_dev_image_probe()
>    rbd: support for object-map and fast-diff
>    rbd: setallochint only if object doesn't exist
>
>   drivers/block/rbd.c                  | 2433 ++++++++++++++++++--------
>   drivers/block/rbd_types.h            |   10 +
>   include/linux/ceph/cls_lock_client.h |    3 +
>   include/linux/ceph/libceph.h         |    6 +-
>   include/linux/ceph/osd_client.h      |   10 +-
>   include/linux/ceph/striper.h         |    2 +
>   net/ceph/cls_lock_client.c           |   47 +-
>   net/ceph/osd_client.c                |   18 +-
>   net/ceph/striper.c                   |   17 +
>   9 files changed, 1817 insertions(+), 729 deletions(-)
>
Dongsheng Yang July 1, 2019, 5:45 a.m. UTC | #4
Hi Maged,

On 06/30/2019 09:55 PM, Maged Mokhtar wrote:
>
>
> Hi Ilya,
>
> Nice work. Some comments/questions:
>
> 1) Generally having a state machine makes things easier to track as 
> the code is less dispersed than before.
>
> 2) The running_list is used to keep track of inflight requests in case 
> of exclusive lock to support rbd_quiesce_lock() when releasing the 
> lock. It would be great to generalize this list to keep track of all 
> inflight requests even in the case when lock is not required, it could 
> be used to support a generic flush (similar to librbd work queue 
> flush/drain). Having a generic inflight requests list will also make 
> it easier to support other functions such as task aborts, request 
> timeouts, flushing on pre-snapshots.
>
> 3) For the acquiring_list, my concern is that while the lock is 
> pending to be acquired, the requests are being accepted without a 
> limit. In case there is a delay acquiring the lock, for example if the 
> primary of the object header is down (which could block for ~ 25 sec) 
> or worse if the pool is inactive, the count could well exceed the max 
> queue depth + for write requests this can consume a lot of memory.
>
> 4) In rbd_img_exclusive_lock() at end, we queue an acquire lock task 
> for every request. I understand this is a single threaded queue and if 
> lock is acquired then all acquire tasks are cancelled, however i feel 
> the queue could fill a lot. Any chance we can only schedule 1 acquire 
> task ?
>
> 5) The state RBD_IMG_EXCLUSIVE_LOCK is used for both cases when image 
> does not require exclusive lock + when lock is acquired. Cosmetically 
> it may be better to separate them.
>
> 6) Probably not an issue, but we are now creating a large number of 
> mutexes, at least 2 for every request. Maybe we need to test in high 
> iops/queue depths that there is no overhead for this.
>
> 7) Is there any consideration to split the rbd module to multiple 
> files ? Looking at how big librbd, fitting this in a single kernel 
> file is challenging at best.

Ilya could provide more information, but I can share something about 
this. Ilya and I had a discussion about splitting rbd.c as it is really 
big enough. But we want something WIP (this patchset)
could be merged firstly. A rough idea is pulling exclusive-lock and 
object-map out from rbd.c.

Thanx
>
> /Maged
>
> On 25/06/2019 16:40, Ilya Dryomov wrote:
>> Hello,
>>
>> This series adds support for object-map and fast-diff image features.
>> Patches 1 - 11 prepare object and image request state machines; patches
>> 12 - 14 fix most of the shortcomings in our exclusive lock code, making
>> it suitable for guarding the object map; patches 15 - 18 take care of
>> the prerequisites and finally patches 19 - 20 implement object-map and
>> fast-diff.
>>
>> Thanks,
>>
>>                  Ilya
>>
>>
>> Ilya Dryomov (20):
>>    rbd: get rid of obj_req->xferred, obj_req->result and 
>> img_req->xferred
>>    rbd: replace obj_req->tried_parent with obj_req->read_state
>>    rbd: get rid of RBD_OBJ_WRITE_{FLAT,GUARD}
>>    rbd: move OSD request submission into object request state machines
>>    rbd: introduce image request state machine
>>    rbd: introduce obj_req->osd_reqs list
>>    rbd: factor out rbd_osd_setup_copyup()
>>    rbd: factor out __rbd_osd_setup_discard_ops()
>>    rbd: move OSD request allocation into object request state machines
>>    rbd: rename rbd_obj_setup_*() to rbd_obj_init_*()
>>    rbd: introduce copyup state machine
>>    rbd: lock should be quiesced on reacquire
>>    rbd: quiescing lock should wait for image requests
>>    rbd: new exclusive lock wait/wake code
>>    libceph: bump CEPH_MSG_MAX_DATA_LEN (again)
>>    libceph: change ceph_osdc_call() to take page vector for response
>>    libceph: export osd_req_op_data() macro
>>    rbd: call rbd_dev_mapping_set() from rbd_dev_image_probe()
>>    rbd: support for object-map and fast-diff
>>    rbd: setallochint only if object doesn't exist
>>
>>   drivers/block/rbd.c                  | 2433 ++++++++++++++++++--------
>>   drivers/block/rbd_types.h            |   10 +
>>   include/linux/ceph/cls_lock_client.h |    3 +
>>   include/linux/ceph/libceph.h         |    6 +-
>>   include/linux/ceph/osd_client.h      |   10 +-
>>   include/linux/ceph/striper.h         |    2 +
>>   net/ceph/cls_lock_client.c           |   47 +-
>>   net/ceph/osd_client.c                |   18 +-
>>   net/ceph/striper.c                   |   17 +
>>   9 files changed, 1817 insertions(+), 729 deletions(-)
>>
>
>
Ilya Dryomov July 3, 2019, 8:12 a.m. UTC | #5
On Sun, Jun 30, 2019 at 3:55 PM Maged Mokhtar <mmokhtar@petasan.org> wrote:
>
>
>
> Hi Ilya,
>
> Nice work. Some comments/questions:
>
> 1) Generally having a state machine makes things easier to track as the
> code is less dispersed than before.
>
> 2) The running_list is used to keep track of inflight requests in case
> of exclusive lock to support rbd_quiesce_lock() when releasing the lock.
> It would be great to generalize this list to keep track of all inflight
> requests even in the case when lock is not required, it could be used to
> support a generic flush (similar to librbd work queue flush/drain).
> Having a generic inflight requests list will also make it easier to
> support other functions such as task aborts, request timeouts, flushing
> on pre-snapshots.

Hi Maged,

We could extend it in the future, but right now it is there just for
exclusive lock.  Likely, we would need a different list for tracking
all in-flight requests because if we start adding random requests
(e.g. those that don't require exclusive lock), deadlocks will occur.

>
> 3) For the acquiring_list, my concern is that while the lock is pending
> to be acquired, the requests are being accepted without a limit. In case
> there is a delay acquiring the lock, for example if the primary of the
> object header is down (which could block for ~ 25 sec) or worse if the
> pool is inactive, the count could well exceed the max queue depth + for
> write requests this can consume a lot of memory.

As you noted in a different email, it is limited by queue_depth.  No
different from regular I/O taking too long.

>
> 4) In rbd_img_exclusive_lock() at end, we queue an acquire lock task for
> every request. I understand this is a single threaded queue and if lock
> is acquired then all acquire tasks are cancelled, however i feel the
> queue could fill a lot. Any chance we can only schedule 1 acquire task ?

This is not new, it's done in existing code too.  queue_delayed_work()
bails if the work is already queued, so ->lock_dwork is never queued
more than once.

>
> 5) The state RBD_IMG_EXCLUSIVE_LOCK is used for both cases when image
> does not require exclusive lock + when lock is acquired. Cosmetically it
> may be better to separate them.

Not all requests require exclusive lock even if exclusive-lock feature
is enabled (e.g. reads when object-map feature is disabled).  We switch
from RBD_IMG_EXCLUSIVE_LOCK early (i.e. no waiting) in those cases and
the case of exclusive-lock feature being disabled just falls into that
basket.

>
> 6) Probably not an issue, but we are now creating a large number of
> mutexes, at least 2 for every request. Maybe we need to test in high
> iops/queue depths that there is no overhead for this.

Yes, that would be useful.  If you do that, please share your results!

We can get that down to one by reusing image requests with the help of
blk-mq.  We would still need to allocate them for accessing the parent
image, but for regular image requests it should be pretty easy to do.

Reusing object requests is going to be harder, but we could probably
preallocate one per image request, as that is by far the most common
case.

>
> 7) Is there any consideration to split the rbd module to multiple files
> ? Looking at how big librbd, fitting this in a single kernel file is
> challenging at best.

Dongsheng expressed this concern is well.  Personally, I don't mind
a large file as long as it is nicely structured.  Adding object-map and
fast-diff added just two forward declarations, one of which could be
avoided.

We are never going to fit the entire librbd into the kernel, no matter
how many files ;)  Breaking up into multiple files complicates blaming,
so I would rather keep it as is at this point.

Thanks,

                Ilya
Ilya Dryomov July 3, 2019, 11:48 a.m. UTC | #6
On Mon, Jul 1, 2019 at 7:34 AM Dongsheng Yang
<dongsheng.yang@easystack.cn> wrote:
>
> Hi Ilya,
>
> On 06/25/2019 10:40 PM, Ilya Dryomov wrote:
> > Hello,
> >
> > This series adds support for object-map and fast-diff image features.
> > Patches 1 - 11 prepare object and image request state machines; patches
> > 12 - 14 fix most of the shortcomings in our exclusive lock code, making
> > it suitable for guarding the object map; patches 15 - 18 take care of
> > the prerequisites and finally patches 19 - 20 implement object-map and
> > fast-diff.
>
> Nice patchset. I did review and a testing for this patchset.

Hi Dongsheng,

Thank you!

>
> TEST:
>        with object-map enabled, I found a case failed:
> tasks/rbd_kernel.yaml.
> It failed to rollback test_img while test_img is mapped.

This is because with object map reads grab the lock:

  67 rbd resize testimg1 --size=40 --allow-shrink
  68 cat /sys/bus/rbd/devices/$DEV_ID1/size | grep 41943040
  69 cat /sys/bus/rbd/devices/$DEV_ID2/size | grep 76800000
  70
  71 sudo dd if=/dev/rbd/rbd/testimg1 of=/tmp/img1.small
  72 cp /tmp/img1 /tmp/img1.trunc
  73 truncate -s 41943040 /tmp/img1.trunc
  74 cmp /tmp/img1.trunc /tmp/img1.small
  75
  76 # rollback and check data again
  77 rbd snap rollback --snap=snap1 testimg1

Without object map the lock is freed by resize on line 67 and rollback
executes on an unlocked image.  With object map we re-grab the lock for
dd on line 71 and rollback executes on a locked image.

rollback isn't supposed to work on a locked image, so it's just an
accident that this test continued to work after exclusive lock was
introduced in 4.9.

Another test that will need to be fixed is krbd_exclusive_option.sh.
This one fails both with and without object-map feature because we no
longer block if the peer refuses to release the lock.

I'm on vacation and didn't get a chance to update the test suite yet.

                Ilya