mbox series

[RFC,0/2] rbd: respect REQ_NOUNMAP by setting new nounmap flag for ZERO op

Message ID 20190118145607.30018-1-rpenyaev@suse.de (mailing list archive)
Headers show
Series rbd: respect REQ_NOUNMAP by setting new nounmap flag for ZERO op | expand

Message

Roman Penyaev Jan. 18, 2019, 2:56 p.m. UTC
Hi all,

This is an attempt to split DISCARD and WRITE_ZEROES paths on krbd side
when REQ_NOUNMAP flag is set for a block layer request.

Currently both REQ_OP_DISCARD and REQ_OP_WRITE_ZEROES block layer requests
fall down to CEPH_OSD_OP_ZERO request, which punches holes on osd side.

With a new CEPH_OSD_OP_FLAG_ZERO_NOUNMAP flag for CEPH_OSD_OP_ZERO request
osd can zero out blocks, instead of punching holes.

Possible handling of a new CEPH_OSD_OP_FLAG_ZERO_NOUNMAP on osd:

 diff --git a/src/osd/PrimaryLogPG.cc b/src/osd/PrimaryLogPG.cc
 index dff549535d39..d812d88d7bc0 100644
 --- a/src/osd/PrimaryLogPG.cc
 +++ b/src/osd/PrimaryLogPG.cc
 @@ -5788,6 +5788,7 @@ int PrimaryLogPG::do_osd_ops(OpContext *ctx, vector<OSDOp>& ops)
 
      // munge ZERO -> TRUNCATE?  (don't munge to DELETE or we risk hosing attributes)
      if (op.op == CEPH_OSD_OP_ZERO &&
 +       !(op.flags & CEPH_OSD_OP_FLAG_ZERO_NOUNMAP) &&
          obs.exists &&
          op.extent.offset < static_cast<Option::size_t>(osd->osd_max_object_size) &&
          op.extent.length >= 1 &&
 @@ -6583,3 +6584,28 @@ int PrimaryLogPG::do_osd_ops(OpContext *ctx, vector<OSDOp>& ops)
         result = -EOPNOTSUPP;
         break;
        }
 +      if (op.flags & CEPH_OSD_OP_FLAG_ZERO_NOUNMAP) {
 +        // ZERO -> WRITE_SAME
 +        // Why?  Internally storage backend punches holes on zeroing, but we
 +        // need zeroed blocks instead.
 +
 +        if (osd_op.indata.length()) {
 +          // Zero op with data? No way.
 +          result = -EINVAL;
 +          goto fail;
 +        }
 +
 +        // Extent and writesame layouts are almost similar, so reset union
 +        // members which are different
 +        op.extent.truncate_size = 0;
 +        op.extent.truncate_seq  = 0;
 +
 +        // Fill in zero data, will be duplicated inside do_writesame()
 +        const char buf[2] = {0};
 +        osd_op.indata.append(buf, sizeof(buf));
 +        op.writesame.data_length = sizeof(buf);
 +
 +        result = do_writesame(ctx, osd_op);
 +        break;
 +      }
 +

The other possible solution is to send CEPH_OSD_OP_WRITESAME directly from
krbd instead of CEPH_OSD_OP_ZERO + NOUNMAP flag, but IMO that has a
drawback: OP_WRITESAME was implemented on osd couple of years ago and seems
that is not very nice to break the compatibility if someone has updated
kernel, but still using old ceph cluster. Also ZERO + NOUNMAP has a cleaner
semantics.

These are just thoughts, nothing is tested, thus RFC.

Roman Penyaev (2):
  libceph, rbd: pass op flags to osd_req_op_extent_init()
  libceph, rbd: respect REQ_NOUNMAP by setting new nounmap flag for
    CEPH_OSD_OP_ZERO

 drivers/block/rbd.c             | 51 ++++++++++++++++++++++++---------
 include/linux/ceph/osd_client.h |  2 +-
 include/linux/ceph/rados.h      |  1 +
 net/ceph/osd_client.c           |  6 ++--
 4 files changed, 42 insertions(+), 18 deletions(-)

Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Sage Weil <sage@redhat.com>
Cc: Alex Elder <elder@kernel.org>
Cc: "Yan, Zheng" <zyan@redhat.com>
Cc: ceph-devel@vger.kernel.org

Comments

Ilya Dryomov Jan. 18, 2019, 4:29 p.m. UTC | #1
On Fri, Jan 18, 2019 at 3:56 PM Roman Penyaev <rpenyaev@suse.de> wrote:
>
> Hi all,
>
> This is an attempt to split DISCARD and WRITE_ZEROES paths on krbd side
> when REQ_NOUNMAP flag is set for a block layer request.

Hi Roman,

I'm working on splitting DISCARD and WRITE_ZEROES handling right now.
The idea is to punt on small and/or unaligned discard requests which
don't actually free up any space but translate into a RADOS zero op.
I'm not changing how WRITE_ZEROES is implemented though, so this is
orthogonal to your work -- just wanted to give a heads up.

>
> Currently both REQ_OP_DISCARD and REQ_OP_WRITE_ZEROES block layer requests
> fall down to CEPH_OSD_OP_ZERO request, which punches holes on osd side.
>
> With a new CEPH_OSD_OP_FLAG_ZERO_NOUNMAP flag for CEPH_OSD_OP_ZERO request
> osd can zero out blocks, instead of punching holes.

REQ_NOUNMAP is just a hint, the block device is free to ignore it.
IIRC the only way to control it from userspace is through fallocate(2):
FALLOC_FL_PUNCH_HOLE can unmap, while FALLOC_FL_ZERO_RANGE is supposed
to not unmap.  Given that fallocate(2) on block devices is fairly new,
I'm curious if you have an application that actually cares in mind?

Thanks,

                Ilya
Roman Penyaev Jan. 21, 2019, 10:23 a.m. UTC | #2
Hi Ilya,

On 2019-01-18 17:29, Ilya Dryomov wrote:
> On Fri, Jan 18, 2019 at 3:56 PM Roman Penyaev <rpenyaev@suse.de> wrote:
>> 
>> Hi all,
>> 
>> This is an attempt to split DISCARD and WRITE_ZEROES paths on krbd 
>> side
>> when REQ_NOUNMAP flag is set for a block layer request.
> 
> Hi Roman,
> 
> I'm working on splitting DISCARD and WRITE_ZEROES handling right now.
> The idea is to punt on small and/or unaligned discard requests which
> don't actually free up any space but translate into a RADOS zero op.
> I'm not changing how WRITE_ZEROES is implemented though, so this is
> orthogonal to your work -- just wanted to give a heads up.

Good to know, thanks for telling me.

>> Currently both REQ_OP_DISCARD and REQ_OP_WRITE_ZEROES block layer 
>> requests
>> fall down to CEPH_OSD_OP_ZERO request, which punches holes on osd 
>> side.
>> 
>> With a new CEPH_OSD_OP_FLAG_ZERO_NOUNMAP flag for CEPH_OSD_OP_ZERO 
>> request
>> osd can zero out blocks, instead of punching holes.
> 
> REQ_NOUNMAP is just a hint, the block device is free to ignore it.
> IIRC the only way to control it from userspace is through fallocate(2):
> FALLOC_FL_PUNCH_HOLE can unmap, while FALLOC_FL_ZERO_RANGE is supposed
> to not unmap.  Given that fallocate(2) on block devices is fairly new,
> I'm curious if you have an application that actually cares in mind?

No, no.  This is an attempt to follow block layer semantics, nothing 
more.
Indeed, the users of REQ_NONUMAP are ioctl() and fallocate(), so the 
only
practical value which comes to mind is performance (preallocate zeroed
blocks and format any fs, etc) and possible secure-erase.  After some
internal discussions about performance of writing zeroes (instead of
true DISCARD) this seems does not bring any value, at least on 
bluestore,
but secure wipe can make sense (for example using blkdiscard --zerouut).

--
Roman
Jason Dillaman Jan. 21, 2019, 1:58 p.m. UTC | #3
On Mon, Jan 21, 2019 at 5:23 AM Roman Penyaev <rpenyaev@suse.de> wrote:
>
> Hi Ilya,
>
> On 2019-01-18 17:29, Ilya Dryomov wrote:
> > On Fri, Jan 18, 2019 at 3:56 PM Roman Penyaev <rpenyaev@suse.de> wrote:
> >>
> >> Hi all,
> >>
> >> This is an attempt to split DISCARD and WRITE_ZEROES paths on krbd
> >> side
> >> when REQ_NOUNMAP flag is set for a block layer request.
> >
> > Hi Roman,
> >
> > I'm working on splitting DISCARD and WRITE_ZEROES handling right now.
> > The idea is to punt on small and/or unaligned discard requests which
> > don't actually free up any space but translate into a RADOS zero op.
> > I'm not changing how WRITE_ZEROES is implemented though, so this is
> > orthogonal to your work -- just wanted to give a heads up.
>
> Good to know, thanks for telling me.
>
> >> Currently both REQ_OP_DISCARD and REQ_OP_WRITE_ZEROES block layer
> >> requests
> >> fall down to CEPH_OSD_OP_ZERO request, which punches holes on osd
> >> side.
> >>
> >> With a new CEPH_OSD_OP_FLAG_ZERO_NOUNMAP flag for CEPH_OSD_OP_ZERO
> >> request
> >> osd can zero out blocks, instead of punching holes.
> >
> > REQ_NOUNMAP is just a hint, the block device is free to ignore it.
> > IIRC the only way to control it from userspace is through fallocate(2):
> > FALLOC_FL_PUNCH_HOLE can unmap, while FALLOC_FL_ZERO_RANGE is supposed
> > to not unmap.  Given that fallocate(2) on block devices is fairly new,
> > I'm curious if you have an application that actually cares in mind?
>
> No, no.  This is an attempt to follow block layer semantics, nothing
> more.
> Indeed, the users of REQ_NONUMAP are ioctl() and fallocate(), so the
> only
> practical value which comes to mind is performance (preallocate zeroed
> blocks and format any fs, etc) and possible secure-erase.  After some
> internal discussions about performance of writing zeroes (instead of
> true DISCARD) this seems does not bring any value, at least on
> bluestore,
> but secure wipe can make sense (for example using blkdiscard --zerouut).

The zeroed writes would need to be smaller than the bluestore min
alloc size for that to work. Otherwise, bluestore will just allocate a
new blob extent, write zeroes to it, and pivot the object metadata to
point to the new allocation.

> --
> Roman
>
Roman Penyaev Jan. 21, 2019, 2:36 p.m. UTC | #4
On 2019-01-21 14:58, Jason Dillaman wrote:
> On Mon, Jan 21, 2019 at 5:23 AM Roman Penyaev <rpenyaev@suse.de> wrote:
>> 
>> Hi Ilya,
>> 
>> On 2019-01-18 17:29, Ilya Dryomov wrote:
>> > On Fri, Jan 18, 2019 at 3:56 PM Roman Penyaev <rpenyaev@suse.de> wrote:
>> >>
>> >> Hi all,
>> >>
>> >> This is an attempt to split DISCARD and WRITE_ZEROES paths on krbd
>> >> side
>> >> when REQ_NOUNMAP flag is set for a block layer request.
>> >
>> > Hi Roman,
>> >
>> > I'm working on splitting DISCARD and WRITE_ZEROES handling right now.
>> > The idea is to punt on small and/or unaligned discard requests which
>> > don't actually free up any space but translate into a RADOS zero op.
>> > I'm not changing how WRITE_ZEROES is implemented though, so this is
>> > orthogonal to your work -- just wanted to give a heads up.
>> 
>> Good to know, thanks for telling me.
>> 
>> >> Currently both REQ_OP_DISCARD and REQ_OP_WRITE_ZEROES block layer
>> >> requests
>> >> fall down to CEPH_OSD_OP_ZERO request, which punches holes on osd
>> >> side.
>> >>
>> >> With a new CEPH_OSD_OP_FLAG_ZERO_NOUNMAP flag for CEPH_OSD_OP_ZERO
>> >> request
>> >> osd can zero out blocks, instead of punching holes.
>> >
>> > REQ_NOUNMAP is just a hint, the block device is free to ignore it.
>> > IIRC the only way to control it from userspace is through fallocate(2):
>> > FALLOC_FL_PUNCH_HOLE can unmap, while FALLOC_FL_ZERO_RANGE is supposed
>> > to not unmap.  Given that fallocate(2) on block devices is fairly new,
>> > I'm curious if you have an application that actually cares in mind?
>> 
>> No, no.  This is an attempt to follow block layer semantics, nothing
>> more.
>> Indeed, the users of REQ_NONUMAP are ioctl() and fallocate(), so the
>> only
>> practical value which comes to mind is performance (preallocate zeroed
>> blocks and format any fs, etc) and possible secure-erase.  After some
>> internal discussions about performance of writing zeroes (instead of
>> true DISCARD) this seems does not bring any value, at least on
>> bluestore,
>> but secure wipe can make sense (for example using blkdiscard 
>> --zerouut).
> 
> The zeroed writes would need to be smaller than the bluestore min
> alloc size for that to work. Otherwise, bluestore will just allocate a
> new blob extent, write zeroes to it, and pivot the object metadata to
> point to the new allocation.

Exactly, that what I've heard.  Thanks for clarifying.

--
Roman
David Disseldorp Jan. 24, 2019, 4:45 p.m. UTC | #5
On Mon, 21 Jan 2019 08:58:22 -0500, Jason Dillaman wrote:

> > Indeed, the users of REQ_NONUMAP are ioctl() and fallocate(), so the
> > only
> > practical value which comes to mind is performance (preallocate zeroed
> > blocks and format any fs, etc) and possible secure-erase.  After some
> > internal discussions about performance of writing zeroes (instead of
> > true DISCARD) this seems does not bring any value, at least on
> > bluestore,
> > but secure wipe can make sense (for example using blkdiscard --zerouut).  

Another possible use case could be space reservations for thin (over)
provisioned storage. E.g. I don't have anything to write now, but want
to make sure that the array won't reject writes to this region in
future.

> The zeroed writes would need to be smaller than the bluestore min
> alloc size for that to work. Otherwise, bluestore will just allocate a
> new blob extent, write zeroes to it, and pivot the object metadata to
> point to the new allocation.

I think we'll need some form of only-ack-when-the-previous-data-is-gone
guarantee from bluestore in future, at least if we want to work towards
supporting things like REQ_OP_SECURE_ERASE on the client side.

Cheers, David
David Disseldorp Jan. 24, 2019, 4:45 p.m. UTC | #6
On Fri, 18 Jan 2019 15:56:05 +0100, Roman Penyaev wrote:

> Hi all,
> 
> This is an attempt to split DISCARD and WRITE_ZEROES paths on krbd side
> when REQ_NOUNMAP flag is set for a block layer request.
> 
> Currently both REQ_OP_DISCARD and REQ_OP_WRITE_ZEROES block layer requests
> fall down to CEPH_OSD_OP_ZERO request, which punches holes on osd side.
> 
> With a new CEPH_OSD_OP_FLAG_ZERO_NOUNMAP flag for CEPH_OSD_OP_ZERO request
> osd can zero out blocks, instead of punching holes.

I think I'd prefer to have the WRITE_SAME conversion done on the client
side (handled alongside REQ_OP_WRITE_SAME), with EOPNOTSUPP propagated
up to callers when REQ_NOUNMAP is requested against an old cluster. If
that's considered too unkind to existing REQ_NOUNMAP consumers, then
this approach looks reasonable to me.

Cheers, David
Ilya Dryomov Jan. 24, 2019, 5:20 p.m. UTC | #7
On Thu, Jan 24, 2019 at 5:45 PM David Disseldorp <ddiss@suse.de> wrote:
>
> On Mon, 21 Jan 2019 08:58:22 -0500, Jason Dillaman wrote:
>
> > > Indeed, the users of REQ_NONUMAP are ioctl() and fallocate(), so the
> > > only
> > > practical value which comes to mind is performance (preallocate zeroed
> > > blocks and format any fs, etc) and possible secure-erase.  After some
> > > internal discussions about performance of writing zeroes (instead of
> > > true DISCARD) this seems does not bring any value, at least on
> > > bluestore,
> > > but secure wipe can make sense (for example using blkdiscard --zerouut).
>
> Another possible use case could be space reservations for thin (over)
> provisioned storage. E.g. I don't have anything to write now, but want
> to make sure that the array won't reject writes to this region in
> future.

I think you might be conflating REQ_NONUMAP with UNMAP flag to WRITE
SAME (and the way zeroout is implemented for SCSI).  As I understand it,
REQ_NOUNMAP means "try to avoid deallocating when zeroing", definitely
not "allocate if not yet allocated".

>
> > The zeroed writes would need to be smaller than the bluestore min
> > alloc size for that to work. Otherwise, bluestore will just allocate a
> > new blob extent, write zeroes to it, and pivot the object metadata to
> > point to the new allocation.
>
> I think we'll need some form of only-ack-when-the-previous-data-is-gone
> guarantee from bluestore in future, at least if we want to work towards
> supporting things like REQ_OP_SECURE_ERASE on the client side.

bluestore will also need to be taught to do BLKSECDISCARD instead of
regular BLKDISCARD, propagate it through the kv store, etc.  We will
also need to know whether secure erase is supported by all underlying
devices so that we can set QUEUE_FLAG_SECERASE at "rbd map" time and
then deal with the fact that it might change over time when OSDs get
replaced, etc.

zeroout and secure erase are two very different things...

Thanks,

                Ilya
David Disseldorp Jan. 24, 2019, 5:57 p.m. UTC | #8
On Thu, 24 Jan 2019 18:20:23 +0100, Ilya Dryomov wrote:

> On Thu, Jan 24, 2019 at 5:45 PM David Disseldorp <ddiss@suse.de> wrote:
> >
> > On Mon, 21 Jan 2019 08:58:22 -0500, Jason Dillaman wrote:
> >  
> > > > Indeed, the users of REQ_NONUMAP are ioctl() and fallocate(), so the
> > > > only
> > > > practical value which comes to mind is performance (preallocate zeroed
> > > > blocks and format any fs, etc) and possible secure-erase.  After some
> > > > internal discussions about performance of writing zeroes (instead of
> > > > true DISCARD) this seems does not bring any value, at least on
> > > > bluestore,
> > > > but secure wipe can make sense (for example using blkdiscard --zerouut).  
> >
> > Another possible use case could be space reservations for thin (over)
> > provisioned storage. E.g. I don't have anything to write now, but want
> > to make sure that the array won't reject writes to this region in
> > future.  
> 
> I think you might be conflating REQ_NONUMAP with UNMAP flag to WRITE
> SAME (and the way zeroout is implemented for SCSI).  As I understand it,
> REQ_NOUNMAP means "try to avoid deallocating when zeroing", definitely
> not "allocate if not yet allocated".

That doesn't quite match my interpretation of blkdiscard(8) "Zero-fill
rather than discard" and blk_types.h "write the zero filled sector many
times" + "do not free blocks when zeroing".

Either way, I'll work on adding blktests coverage for this so that
semantics can be agreed upon somewhere.

Cheers, David
Ilya Dryomov Jan. 24, 2019, 6:58 p.m. UTC | #9
On Thu, Jan 24, 2019 at 6:57 PM David Disseldorp <ddiss@suse.de> wrote:
>
> On Thu, 24 Jan 2019 18:20:23 +0100, Ilya Dryomov wrote:
>
> > On Thu, Jan 24, 2019 at 5:45 PM David Disseldorp <ddiss@suse.de> wrote:
> > >
> > > On Mon, 21 Jan 2019 08:58:22 -0500, Jason Dillaman wrote:
> > >
> > > > > Indeed, the users of REQ_NONUMAP are ioctl() and fallocate(), so the
> > > > > only
> > > > > practical value which comes to mind is performance (preallocate zeroed
> > > > > blocks and format any fs, etc) and possible secure-erase.  After some
> > > > > internal discussions about performance of writing zeroes (instead of
> > > > > true DISCARD) this seems does not bring any value, at least on
> > > > > bluestore,
> > > > > but secure wipe can make sense (for example using blkdiscard --zerouut).
> > >
> > > Another possible use case could be space reservations for thin (over)
> > > provisioned storage. E.g. I don't have anything to write now, but want
> > > to make sure that the array won't reject writes to this region in
> > > future.
> >
> > I think you might be conflating REQ_NONUMAP with UNMAP flag to WRITE
> > SAME (and the way zeroout is implemented for SCSI).  As I understand it,
> > REQ_NOUNMAP means "try to avoid deallocating when zeroing", definitely
> > not "allocate if not yet allocated".
>
> That doesn't quite match my interpretation of blkdiscard(8) "Zero-fill
> rather than discard" and blk_types.h "write the zero filled sector many
> times" + "do not free blocks when zeroing".

Discard isn't guaranteed to zero (or to deallocate for that matter).
The driver and the device are free to slice the discard request into
pieces, align or round it or even disregard it entirely.

Zeroout is guaranteed to zero, hence the distinction.  How that is
accomplished is up to the driver and the device.  It can write zeroes
explicitly, do an explicit write same, do a pre-allocating write same
(anchoring), deallocate, etc.

REQ_NOUNMAP is just a hint (that can be ignored) for when the device
can naturally do it either way.

Thanks,

                Ilya