diff mbox

rbd: implement REQ_OP_WRITE_ZEROES

Message ID 1495552107-1237-1-git-send-email-idryomov@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ilya Dryomov May 23, 2017, 3:08 p.m. UTC
Commit 93c1defedcae ("rbd: remove the discard_zeroes_data flag")
explicitly didn't implement REQ_OP_WRITE_ZEROES for rbd, while the
following commit 48920ff2a5a9 ("block: remove the discard_zeroes_data
flag") dropped ->discard_zeroes_data in favor of REQ_OP_WRITE_ZEROES.

rbd does support efficient zeroing via CEPH_OSD_OP_ZERO opcode and will
release either some or all blocks depending on whether the zeroing
request is rbd_obj_bytes() aligned.  This is how we currently implement
discards, so REQ_OP_WRITE_ZEROES can be identical to REQ_OP_DISCARD for
now.  Caveats:

- REQ_NOUNMAP is ignored, but AFAICT that's true of at least two other
  current implementations - nvme and loop

- there is no ->write_zeroes_alignment and blk_bio_write_zeroes_split()
  is hence less helpful than blk_bio_discard_split(), but this can (and
  should) be fixed on the rbd side

In the future we will split these into two code paths to respect
REQ_NOUNMAP on zeroout and save on zeroing blocks that couldn't be
released on discard.

Fixes: 93c1defedcae ("rbd: remove the discard_zeroes_data flag")
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
---
 drivers/block/rbd.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Jason Dillaman May 23, 2017, 6:28 p.m. UTC | #1
lgtm

Reviewed-by: Jason Dillaman <dillaman@redhat.com>

On Tue, May 23, 2017 at 11:08 AM, Ilya Dryomov <idryomov@gmail.com> wrote:
> Commit 93c1defedcae ("rbd: remove the discard_zeroes_data flag")
> explicitly didn't implement REQ_OP_WRITE_ZEROES for rbd, while the
> following commit 48920ff2a5a9 ("block: remove the discard_zeroes_data
> flag") dropped ->discard_zeroes_data in favor of REQ_OP_WRITE_ZEROES.
>
> rbd does support efficient zeroing via CEPH_OSD_OP_ZERO opcode and will
> release either some or all blocks depending on whether the zeroing
> request is rbd_obj_bytes() aligned.  This is how we currently implement
> discards, so REQ_OP_WRITE_ZEROES can be identical to REQ_OP_DISCARD for
> now.  Caveats:
>
> - REQ_NOUNMAP is ignored, but AFAICT that's true of at least two other
>   current implementations - nvme and loop
>
> - there is no ->write_zeroes_alignment and blk_bio_write_zeroes_split()
>   is hence less helpful than blk_bio_discard_split(), but this can (and
>   should) be fixed on the rbd side
>
> In the future we will split these into two code paths to respect
> REQ_NOUNMAP on zeroout and save on zeroing blocks that couldn't be
> released on discard.
>
> Fixes: 93c1defedcae ("rbd: remove the discard_zeroes_data flag")
> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
> ---
>  drivers/block/rbd.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index 454bf9c34882..c16f74547804 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -4023,6 +4023,7 @@ static void rbd_queue_workfn(struct work_struct *work)
>
>         switch (req_op(rq)) {
>         case REQ_OP_DISCARD:
> +       case REQ_OP_WRITE_ZEROES:
>                 op_type = OBJ_OP_DISCARD;
>                 break;
>         case REQ_OP_WRITE:
> @@ -4420,6 +4421,7 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
>         q->limits.discard_granularity = segment_size;
>         q->limits.discard_alignment = segment_size;
>         blk_queue_max_discard_sectors(q, segment_size / SECTOR_SIZE);
> +       blk_queue_max_write_zeroes_sectors(q, segment_size / SECTOR_SIZE);
>
>         if (!ceph_test_opt(rbd_dev->rbd_client->client, NOCRC))
>                 q->backing_dev_info->capabilities |= BDI_CAP_STABLE_WRITES;
> --
> 2.4.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexandre DERUMIER May 24, 2017, 11:38 a.m. UTC | #2
Hi,

is it planned to implement write zeroes in qemu rbd block driver soon ?
(bdrv_co_write_zeroes)

It's really missing currently, as qemu drive-mirror need it to have sparse images on copy.

Ref from my discussion with Paolo from redhat in 2014 about this:
  https://lists.gnu.org/archive/html/qemu-devel/2014-10/msg01274.html


REgards,

Alexandre

----- Mail original -----
De: "Jason Dillaman" <jdillama@redhat.com>
À: "Ilya Dryomov" <idryomov@gmail.com>
Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "Christoph Hellwig" <hch@lst.de>, "Hannes Reinecke" <hare@suse.com>
Envoyé: Mardi 23 Mai 2017 20:28:00
Objet: Re: [PATCH] rbd: implement REQ_OP_WRITE_ZEROES

lgtm 

Reviewed-by: Jason Dillaman <dillaman@redhat.com> 

On Tue, May 23, 2017 at 11:08 AM, Ilya Dryomov <idryomov@gmail.com> wrote: 
> Commit 93c1defedcae ("rbd: remove the discard_zeroes_data flag") 
> explicitly didn't implement REQ_OP_WRITE_ZEROES for rbd, while the 
> following commit 48920ff2a5a9 ("block: remove the discard_zeroes_data 
> flag") dropped ->discard_zeroes_data in favor of REQ_OP_WRITE_ZEROES. 
> 
> rbd does support efficient zeroing via CEPH_OSD_OP_ZERO opcode and will 
> release either some or all blocks depending on whether the zeroing 
> request is rbd_obj_bytes() aligned. This is how we currently implement 
> discards, so REQ_OP_WRITE_ZEROES can be identical to REQ_OP_DISCARD for 
> now. Caveats: 
> 
> - REQ_NOUNMAP is ignored, but AFAICT that's true of at least two other 
> current implementations - nvme and loop 
> 
> - there is no ->write_zeroes_alignment and blk_bio_write_zeroes_split() 
> is hence less helpful than blk_bio_discard_split(), but this can (and 
> should) be fixed on the rbd side 
> 
> In the future we will split these into two code paths to respect 
> REQ_NOUNMAP on zeroout and save on zeroing blocks that couldn't be 
> released on discard. 
> 
> Fixes: 93c1defedcae ("rbd: remove the discard_zeroes_data flag") 
> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> 
> --- 
> drivers/block/rbd.c | 2 ++ 
> 1 file changed, 2 insertions(+) 
> 
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c 
> index 454bf9c34882..c16f74547804 100644 
> --- a/drivers/block/rbd.c 
> +++ b/drivers/block/rbd.c 
> @@ -4023,6 +4023,7 @@ static void rbd_queue_workfn(struct work_struct *work) 
> 
> switch (req_op(rq)) { 
> case REQ_OP_DISCARD: 
> + case REQ_OP_WRITE_ZEROES: 
> op_type = OBJ_OP_DISCARD; 
> break; 
> case REQ_OP_WRITE: 
> @@ -4420,6 +4421,7 @@ static int rbd_init_disk(struct rbd_device *rbd_dev) 
> q->limits.discard_granularity = segment_size; 
> q->limits.discard_alignment = segment_size; 
> blk_queue_max_discard_sectors(q, segment_size / SECTOR_SIZE); 
> + blk_queue_max_write_zeroes_sectors(q, segment_size / SECTOR_SIZE); 
> 
> if (!ceph_test_opt(rbd_dev->rbd_client->client, NOCRC)) 
> q->backing_dev_info->capabilities |= BDI_CAP_STABLE_WRITES; 
> -- 
> 2.4.3 
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> the body of a message to majordomo@vger.kernel.org 
> More majordomo info at http://vger.kernel.org/majordomo-info.html
Jason Dillaman May 24, 2017, 11:53 a.m. UTC | #3
I just opened a tracker ticket for this [1] -- let me know if you have
any other QEMU improvement ideas.

[1] http://tracker.ceph.com/issues/20070

On Wed, May 24, 2017 at 7:38 AM, Alexandre DERUMIER <aderumier@odiso.com> wrote:
> Hi,
>
> is it planned to implement write zeroes in qemu rbd block driver soon ?
> (bdrv_co_write_zeroes)
>
> It's really missing currently, as qemu drive-mirror need it to have sparse images on copy.
>
> Ref from my discussion with Paolo from redhat in 2014 about this:
>   https://lists.gnu.org/archive/html/qemu-devel/2014-10/msg01274.html
>
>
> REgards,
>
> Alexandre
>
> ----- Mail original -----
> De: "Jason Dillaman" <jdillama@redhat.com>
> À: "Ilya Dryomov" <idryomov@gmail.com>
> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "Christoph Hellwig" <hch@lst.de>, "Hannes Reinecke" <hare@suse.com>
> Envoyé: Mardi 23 Mai 2017 20:28:00
> Objet: Re: [PATCH] rbd: implement REQ_OP_WRITE_ZEROES
>
> lgtm
>
> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
>
> On Tue, May 23, 2017 at 11:08 AM, Ilya Dryomov <idryomov@gmail.com> wrote:
>> Commit 93c1defedcae ("rbd: remove the discard_zeroes_data flag")
>> explicitly didn't implement REQ_OP_WRITE_ZEROES for rbd, while the
>> following commit 48920ff2a5a9 ("block: remove the discard_zeroes_data
>> flag") dropped ->discard_zeroes_data in favor of REQ_OP_WRITE_ZEROES.
>>
>> rbd does support efficient zeroing via CEPH_OSD_OP_ZERO opcode and will
>> release either some or all blocks depending on whether the zeroing
>> request is rbd_obj_bytes() aligned. This is how we currently implement
>> discards, so REQ_OP_WRITE_ZEROES can be identical to REQ_OP_DISCARD for
>> now. Caveats:
>>
>> - REQ_NOUNMAP is ignored, but AFAICT that's true of at least two other
>> current implementations - nvme and loop
>>
>> - there is no ->write_zeroes_alignment and blk_bio_write_zeroes_split()
>> is hence less helpful than blk_bio_discard_split(), but this can (and
>> should) be fixed on the rbd side
>>
>> In the future we will split these into two code paths to respect
>> REQ_NOUNMAP on zeroout and save on zeroing blocks that couldn't be
>> released on discard.
>>
>> Fixes: 93c1defedcae ("rbd: remove the discard_zeroes_data flag")
>> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
>> ---
>> drivers/block/rbd.c | 2 ++
>> 1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
>> index 454bf9c34882..c16f74547804 100644
>> --- a/drivers/block/rbd.c
>> +++ b/drivers/block/rbd.c
>> @@ -4023,6 +4023,7 @@ static void rbd_queue_workfn(struct work_struct *work)
>>
>> switch (req_op(rq)) {
>> case REQ_OP_DISCARD:
>> + case REQ_OP_WRITE_ZEROES:
>> op_type = OBJ_OP_DISCARD;
>> break;
>> case REQ_OP_WRITE:
>> @@ -4420,6 +4421,7 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
>> q->limits.discard_granularity = segment_size;
>> q->limits.discard_alignment = segment_size;
>> blk_queue_max_discard_sectors(q, segment_size / SECTOR_SIZE);
>> + blk_queue_max_write_zeroes_sectors(q, segment_size / SECTOR_SIZE);
>>
>> if (!ceph_test_opt(rbd_dev->rbd_client->client, NOCRC))
>> q->backing_dev_info->capabilities |= BDI_CAP_STABLE_WRITES;
>> --
>> 2.4.3
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexandre DERUMIER May 24, 2017, 5:38 p.m. UTC | #4
>>I just opened a tracker ticket for this [1] 
>>
>>[1] http://tracker.ceph.com/issues/20070

Thanks Jason!

>>-- let me know if you have
>>any other QEMU improvement ideas.

For the moment, the bigger limitation is cpu usage of librbd, 
as qemu can only 1 thread, I can't reach more than around 70000 iops by disk.
(3,1ghz cpu, disabling debug, rbd_cache, using jemalloc).

So any improvement to reduce cpu usage could be great :)


Also, in the future, I think qemu will support multiple iothreads by disk, 
I don't known if librbd is already ready for this ?




----- Mail original -----
De: "Jason Dillaman" <jdillama@redhat.com>
À: "aderumier" <aderumier@odiso.com>
Cc: "Ilya Dryomov" <idryomov@gmail.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "Christoph Hellwig" <hch@lst.de>, "Hannes Reinecke" <hare@suse.com>
Envoyé: Mercredi 24 Mai 2017 13:53:40
Objet: Re: [PATCH] rbd: implement REQ_OP_WRITE_ZEROES

I just opened a tracker ticket for this [1] -- let me know if you have 
any other QEMU improvement ideas. 

[1] http://tracker.ceph.com/issues/20070 

On Wed, May 24, 2017 at 7:38 AM, Alexandre DERUMIER <aderumier@odiso.com> wrote: 
> Hi, 
> 
> is it planned to implement write zeroes in qemu rbd block driver soon ? 
> (bdrv_co_write_zeroes) 
> 
> It's really missing currently, as qemu drive-mirror need it to have sparse images on copy. 
> 
> Ref from my discussion with Paolo from redhat in 2014 about this: 
> https://lists.gnu.org/archive/html/qemu-devel/2014-10/msg01274.html 
> 
> 
> REgards, 
> 
> Alexandre 
> 
> ----- Mail original ----- 
> De: "Jason Dillaman" <jdillama@redhat.com> 
> À: "Ilya Dryomov" <idryomov@gmail.com> 
> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "Christoph Hellwig" <hch@lst.de>, "Hannes Reinecke" <hare@suse.com> 
> Envoyé: Mardi 23 Mai 2017 20:28:00 
> Objet: Re: [PATCH] rbd: implement REQ_OP_WRITE_ZEROES 
> 
> lgtm 
> 
> Reviewed-by: Jason Dillaman <dillaman@redhat.com> 
> 
> On Tue, May 23, 2017 at 11:08 AM, Ilya Dryomov <idryomov@gmail.com> wrote: 
>> Commit 93c1defedcae ("rbd: remove the discard_zeroes_data flag") 
>> explicitly didn't implement REQ_OP_WRITE_ZEROES for rbd, while the 
>> following commit 48920ff2a5a9 ("block: remove the discard_zeroes_data 
>> flag") dropped ->discard_zeroes_data in favor of REQ_OP_WRITE_ZEROES. 
>> 
>> rbd does support efficient zeroing via CEPH_OSD_OP_ZERO opcode and will 
>> release either some or all blocks depending on whether the zeroing 
>> request is rbd_obj_bytes() aligned. This is how we currently implement 
>> discards, so REQ_OP_WRITE_ZEROES can be identical to REQ_OP_DISCARD for 
>> now. Caveats: 
>> 
>> - REQ_NOUNMAP is ignored, but AFAICT that's true of at least two other 
>> current implementations - nvme and loop 
>> 
>> - there is no ->write_zeroes_alignment and blk_bio_write_zeroes_split() 
>> is hence less helpful than blk_bio_discard_split(), but this can (and 
>> should) be fixed on the rbd side 
>> 
>> In the future we will split these into two code paths to respect 
>> REQ_NOUNMAP on zeroout and save on zeroing blocks that couldn't be 
>> released on discard. 
>> 
>> Fixes: 93c1defedcae ("rbd: remove the discard_zeroes_data flag") 
>> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> 
>> --- 
>> drivers/block/rbd.c | 2 ++ 
>> 1 file changed, 2 insertions(+) 
>> 
>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c 
>> index 454bf9c34882..c16f74547804 100644 
>> --- a/drivers/block/rbd.c 
>> +++ b/drivers/block/rbd.c 
>> @@ -4023,6 +4023,7 @@ static void rbd_queue_workfn(struct work_struct *work) 
>> 
>> switch (req_op(rq)) { 
>> case REQ_OP_DISCARD: 
>> + case REQ_OP_WRITE_ZEROES: 
>> op_type = OBJ_OP_DISCARD; 
>> break; 
>> case REQ_OP_WRITE: 
>> @@ -4420,6 +4421,7 @@ static int rbd_init_disk(struct rbd_device *rbd_dev) 
>> q->limits.discard_granularity = segment_size; 
>> q->limits.discard_alignment = segment_size; 
>> blk_queue_max_discard_sectors(q, segment_size / SECTOR_SIZE); 
>> + blk_queue_max_write_zeroes_sectors(q, segment_size / SECTOR_SIZE); 
>> 
>> if (!ceph_test_opt(rbd_dev->rbd_client->client, NOCRC)) 
>> q->backing_dev_info->capabilities |= BDI_CAP_STABLE_WRITES; 
>> -- 
>> 2.4.3 
>> 
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
>> the body of a message to majordomo@vger.kernel.org 
>> More majordomo info at http://vger.kernel.org/majordomo-info.html 
> 
> 
> 
> -- 
> Jason 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> the body of a message to majordomo@vger.kernel.org 
> More majordomo info at http://vger.kernel.org/majordomo-info.html 
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> the body of a message to majordomo@vger.kernel.org 
> More majordomo info at http://vger.kernel.org/majordomo-info.html
Jason Dillaman May 24, 2017, 7:34 p.m. UTC | #5
On Wed, May 24, 2017 at 1:38 PM, Alexandre DERUMIER <aderumier@odiso.com> wrote:
>>>I just opened a tracker ticket for this [1]
>>>
>>>[1] http://tracker.ceph.com/issues/20070
>
> Thanks Jason!
>
>>>-- let me know if you have
>>>any other QEMU improvement ideas.
>
> For the moment, the bigger limitation is cpu usage of librbd,
> as qemu can only 1 thread, I can't reach more than around 70000 iops by disk.
> (3,1ghz cpu, disabling debug, rbd_cache, using jemalloc).
>
> So any improvement to reduce cpu usage could be great :)

We've had some interest expressed in volunteering time and code
contributions to reduce the CPU usage of librbd / librados. I am
hoping some of these improvements will be included in the Mimic
release.

> Also, in the future, I think qemu will support multiple iothreads by disk,
> I don't known if librbd is already ready for this ?

This is hopefully another improvement we can make in the Mimic release
timeframe. librbd already has an architecture to support multiple
internal IO threads, but we have it disabled for now due to lack of
thorough testing -- and for the fact that librados needs to utilize
multiple IO threads as well.

>
>
> ----- Mail original -----
> De: "Jason Dillaman" <jdillama@redhat.com>
> À: "aderumier" <aderumier@odiso.com>
> Cc: "Ilya Dryomov" <idryomov@gmail.com>, "ceph-devel" <ceph-devel@vger.kernel.org>, "Christoph Hellwig" <hch@lst.de>, "Hannes Reinecke" <hare@suse.com>
> Envoyé: Mercredi 24 Mai 2017 13:53:40
> Objet: Re: [PATCH] rbd: implement REQ_OP_WRITE_ZEROES
>
> I just opened a tracker ticket for this [1] -- let me know if you have
> any other QEMU improvement ideas.
>
> [1] http://tracker.ceph.com/issues/20070
>
> On Wed, May 24, 2017 at 7:38 AM, Alexandre DERUMIER <aderumier@odiso.com> wrote:
>> Hi,
>>
>> is it planned to implement write zeroes in qemu rbd block driver soon ?
>> (bdrv_co_write_zeroes)
>>
>> It's really missing currently, as qemu drive-mirror need it to have sparse images on copy.
>>
>> Ref from my discussion with Paolo from redhat in 2014 about this:
>> https://lists.gnu.org/archive/html/qemu-devel/2014-10/msg01274.html
>>
>>
>> REgards,
>>
>> Alexandre
>>
>> ----- Mail original -----
>> De: "Jason Dillaman" <jdillama@redhat.com>
>> À: "Ilya Dryomov" <idryomov@gmail.com>
>> Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "Christoph Hellwig" <hch@lst.de>, "Hannes Reinecke" <hare@suse.com>
>> Envoyé: Mardi 23 Mai 2017 20:28:00
>> Objet: Re: [PATCH] rbd: implement REQ_OP_WRITE_ZEROES
>>
>> lgtm
>>
>> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
>>
>> On Tue, May 23, 2017 at 11:08 AM, Ilya Dryomov <idryomov@gmail.com> wrote:
>>> Commit 93c1defedcae ("rbd: remove the discard_zeroes_data flag")
>>> explicitly didn't implement REQ_OP_WRITE_ZEROES for rbd, while the
>>> following commit 48920ff2a5a9 ("block: remove the discard_zeroes_data
>>> flag") dropped ->discard_zeroes_data in favor of REQ_OP_WRITE_ZEROES.
>>>
>>> rbd does support efficient zeroing via CEPH_OSD_OP_ZERO opcode and will
>>> release either some or all blocks depending on whether the zeroing
>>> request is rbd_obj_bytes() aligned. This is how we currently implement
>>> discards, so REQ_OP_WRITE_ZEROES can be identical to REQ_OP_DISCARD for
>>> now. Caveats:
>>>
>>> - REQ_NOUNMAP is ignored, but AFAICT that's true of at least two other
>>> current implementations - nvme and loop
>>>
>>> - there is no ->write_zeroes_alignment and blk_bio_write_zeroes_split()
>>> is hence less helpful than blk_bio_discard_split(), but this can (and
>>> should) be fixed on the rbd side
>>>
>>> In the future we will split these into two code paths to respect
>>> REQ_NOUNMAP on zeroout and save on zeroing blocks that couldn't be
>>> released on discard.
>>>
>>> Fixes: 93c1defedcae ("rbd: remove the discard_zeroes_data flag")
>>> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
>>> ---
>>> drivers/block/rbd.c | 2 ++
>>> 1 file changed, 2 insertions(+)
>>>
>>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
>>> index 454bf9c34882..c16f74547804 100644
>>> --- a/drivers/block/rbd.c
>>> +++ b/drivers/block/rbd.c
>>> @@ -4023,6 +4023,7 @@ static void rbd_queue_workfn(struct work_struct *work)
>>>
>>> switch (req_op(rq)) {
>>> case REQ_OP_DISCARD:
>>> + case REQ_OP_WRITE_ZEROES:
>>> op_type = OBJ_OP_DISCARD;
>>> break;
>>> case REQ_OP_WRITE:
>>> @@ -4420,6 +4421,7 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
>>> q->limits.discard_granularity = segment_size;
>>> q->limits.discard_alignment = segment_size;
>>> blk_queue_max_discard_sectors(q, segment_size / SECTOR_SIZE);
>>> + blk_queue_max_write_zeroes_sectors(q, segment_size / SECTOR_SIZE);
>>>
>>> if (!ceph_test_opt(rbd_dev->rbd_client->client, NOCRC))
>>> q->backing_dev_info->capabilities |= BDI_CAP_STABLE_WRITES;
>>> --
>>> 2.4.3
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Jason
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Jason
>
Ilya Dryomov May 29, 2017, 8:39 a.m. UTC | #6
On Tue, May 23, 2017 at 5:08 PM, Ilya Dryomov <idryomov@gmail.com> wrote:
> Commit 93c1defedcae ("rbd: remove the discard_zeroes_data flag")
> explicitly didn't implement REQ_OP_WRITE_ZEROES for rbd, while the
> following commit 48920ff2a5a9 ("block: remove the discard_zeroes_data
> flag") dropped ->discard_zeroes_data in favor of REQ_OP_WRITE_ZEROES.
>
> rbd does support efficient zeroing via CEPH_OSD_OP_ZERO opcode and will
> release either some or all blocks depending on whether the zeroing
> request is rbd_obj_bytes() aligned.  This is how we currently implement
> discards, so REQ_OP_WRITE_ZEROES can be identical to REQ_OP_DISCARD for
> now.  Caveats:
>
> - REQ_NOUNMAP is ignored, but AFAICT that's true of at least two other
>   current implementations - nvme and loop
>
> - there is no ->write_zeroes_alignment and blk_bio_write_zeroes_split()
>   is hence less helpful than blk_bio_discard_split(), but this can (and
>   should) be fixed on the rbd side
>
> In the future we will split these into two code paths to respect
> REQ_NOUNMAP on zeroout and save on zeroing blocks that couldn't be
> released on discard.
>
> Fixes: 93c1defedcae ("rbd: remove the discard_zeroes_data flag")
> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
> ---
>  drivers/block/rbd.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
> index 454bf9c34882..c16f74547804 100644
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -4023,6 +4023,7 @@ static void rbd_queue_workfn(struct work_struct *work)
>
>         switch (req_op(rq)) {
>         case REQ_OP_DISCARD:
> +       case REQ_OP_WRITE_ZEROES:
>                 op_type = OBJ_OP_DISCARD;
>                 break;
>         case REQ_OP_WRITE:
> @@ -4420,6 +4421,7 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
>         q->limits.discard_granularity = segment_size;
>         q->limits.discard_alignment = segment_size;
>         blk_queue_max_discard_sectors(q, segment_size / SECTOR_SIZE);
> +       blk_queue_max_write_zeroes_sectors(q, segment_size / SECTOR_SIZE);
>
>         if (!ceph_test_opt(rbd_dev->rbd_client->client, NOCRC))
>                 q->backing_dev_info->capabilities |= BDI_CAP_STABLE_WRITES;

Hi Christoph,

I'm planning to merge this into 4.12-rc because it fixes a 93c1defedcae
regression, but I'm not quite sure what you meant by "rbd only supports
discarding on large alignments, so the zeroing code would always fall
back to explicit writings of zeroes".  Care to take a look and ack?

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig May 29, 2017, 8:43 a.m. UTC | #7
On Tue, May 23, 2017 at 05:08:27PM +0200, Ilya Dryomov wrote:
> Commit 93c1defedcae ("rbd: remove the discard_zeroes_data flag")
> explicitly didn't implement REQ_OP_WRITE_ZEROES for rbd, while the
> following commit 48920ff2a5a9 ("block: remove the discard_zeroes_data
> flag") dropped ->discard_zeroes_data in favor of REQ_OP_WRITE_ZEROES.
> 
> rbd does support efficient zeroing via CEPH_OSD_OP_ZERO opcode and will
> release either some or all blocks depending on whether the zeroing
> request is rbd_obj_bytes() aligned.  This is how we currently implement
> discards, so REQ_OP_WRITE_ZEROES can be identical to REQ_OP_DISCARD for
> now.  Caveats:
> 
> - REQ_NOUNMAP is ignored, but AFAICT that's true of at least two other
>   current implementations - nvme and loop

That's no problem, it's just a hint in case the device differenciates
the cases.

> - there is no ->write_zeroes_alignment and blk_bio_write_zeroes_split()
>   is hence less helpful than blk_bio_discard_split(), but this can (and
>   should) be fixed on the rbd side

Yes.

> In the future we will split these into two code paths to respect
> REQ_NOUNMAP on zeroout and save on zeroing blocks that couldn't be
> released on discard.
> 
> Fixes: 93c1defedcae ("rbd: remove the discard_zeroes_data flag")
> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 454bf9c34882..c16f74547804 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -4023,6 +4023,7 @@  static void rbd_queue_workfn(struct work_struct *work)
 
 	switch (req_op(rq)) {
 	case REQ_OP_DISCARD:
+	case REQ_OP_WRITE_ZEROES:
 		op_type = OBJ_OP_DISCARD;
 		break;
 	case REQ_OP_WRITE:
@@ -4420,6 +4421,7 @@  static int rbd_init_disk(struct rbd_device *rbd_dev)
 	q->limits.discard_granularity = segment_size;
 	q->limits.discard_alignment = segment_size;
 	blk_queue_max_discard_sectors(q, segment_size / SECTOR_SIZE);
+	blk_queue_max_write_zeroes_sectors(q, segment_size / SECTOR_SIZE);
 
 	if (!ceph_test_opt(rbd_dev->rbd_client->client, NOCRC))
 		q->backing_dev_info->capabilities |= BDI_CAP_STABLE_WRITES;