diff mbox

[1/4] brd: handle misaligned discard

Message ID alpine.LRH.2.02.1610261625580.12287@file01.intranet.prod.int.rdu2.redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Mikulas Patocka Oct. 26, 2016, 8:26 p.m. UTC
The brd driver refuses misaligned discard requests with an error. However,
this is suboptimal, misaligned requests could be handled by discarding a
part of the request that is aligned on a page boundary. This patch changes
the code so that it handles misaligned requests.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/block/brd.c |   16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Bart Van Assche Oct. 26, 2016, 8:38 p.m. UTC | #1
On 10/26/2016 01:26 PM, Mikulas Patocka wrote:
> The brd driver refuses misaligned discard requests with an error. However,
> this is suboptimal, misaligned requests could be handled by discarding a
> part of the request that is aligned on a page boundary. This patch changes
> the code so that it handles misaligned requests.

Hello Mikulas,

We do not want this kind of discard request processing in every block 
driver. This is why I think that this kind of processing should be added 
to the block layer. See also "[PATCH v3 0/5] Make blkdev_issue_discard() 
submit aligned discard requests" 
(http://www.spinics.net/lists/linux-block/msg02360.html).

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mikulas Patocka Oct. 26, 2016, 9:46 p.m. UTC | #2
On Wed, 26 Oct 2016, Bart Van Assche wrote:

> On 10/26/2016 01:26 PM, Mikulas Patocka wrote:
> > The brd driver refuses misaligned discard requests with an error. However,
> > this is suboptimal, misaligned requests could be handled by discarding a
> > part of the request that is aligned on a page boundary. This patch changes
> > the code so that it handles misaligned requests.
> 
> Hello Mikulas,
> 
> We do not want this kind of discard request processing in every block driver.
> This is why I think that this kind of processing should be added to the block
> layer. See also "[PATCH v3 0/5] Make blkdev_issue_discard() submit aligned
> discard requests" (http://www.spinics.net/lists/linux-block/msg02360.html).
> 
> Bart.

I don't like the idea of complicating the code by turning discards into 
writes. You can just turn off the flag "discard_zeroes_data" and drop all 
the splitting code.

The flag "discard_zeroes_data" is actually misdesigned, because the 
storage stack can change dynamically while bios are in progress. You can 
send a discard bio to a device mapper device that has 
"discard_zeroes_data" - and while the bio is in progress, the device 
mapper stack can be reconfigured to redirect the bio to another device 
that doesn't have "discard_zeroes_data" - and the bio won't zero data and 
the caller that issued it has no way to find it out.

I think the proper thing would be to move "discard_zeroes_data" flag into 
the bio itself - there would be REQ_OP_DISCARD and REQ_OP_DISCARD_ZERO - 
and if the device doesn't support REQ_OP_DISCARD_ZERO, it rejects the bio 
and the caller is supposed to do zeroing manually.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Oct. 26, 2016, 9:50 p.m. UTC | #3
On Wed, Oct 26, 2016 at 05:46:11PM -0400, Mikulas Patocka wrote:
> I think the proper thing would be to move "discard_zeroes_data" flag into 
> the bio itself - there would be REQ_OP_DISCARD and REQ_OP_DISCARD_ZERO - 
> and if the device doesn't support REQ_OP_DISCARD_ZERO, it rejects the bio 
> and the caller is supposed to do zeroing manually.

Yes, Martin and I have come to a similar conclusion recently.  An
additional aspect is that NVMe has a Write Zeroes command which is more
limited than what REQ_OP_WRITE_SAME does.

So I think the right way is to add a REQ_OP_WRITE_ZEROES (or
REQ_OP_ZERO) and have modifies if it may discard or not.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Oct. 26, 2016, 9:57 p.m. UTC | #4
On 10/26/2016 02:46 PM, Mikulas Patocka wrote:
> I don't like the idea of complicating the code by turning discards into
> writes.

That's not what my patch series does. The only writes added by my patch 
series are those for the non-aligned head and tail of the range passed 
to blkdev_issue_zeroout().

> The flag "discard_zeroes_data" is actually misdesigned, because the
> storage stack can change dynamically while bios are in progress. You can
> send a discard bio to a device mapper device that has
> "discard_zeroes_data" - and while the bio is in progress, the device
> mapper stack can be reconfigured to redirect the bio to another device
> that doesn't have "discard_zeroes_data" - and the bio won't zero data and
> the caller that issued it has no way to find it out.
>
> I think the proper thing would be to move "discard_zeroes_data" flag into
> the bio itself - there would be REQ_OP_DISCARD and REQ_OP_DISCARD_ZERO -
> and if the device doesn't support REQ_OP_DISCARD_ZERO, it rejects the bio
> and the caller is supposed to do zeroing manually.

Sorry but I don't like this proposal. I think that a much better 
solution would be to pause I/O before making any changes to the 
discard_zeroes_data flag.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mikulas Patocka Oct. 28, 2016, 11:39 a.m. UTC | #5
On Wed, 26 Oct 2016, Bart Van Assche wrote:

> On 10/26/2016 02:46 PM, Mikulas Patocka wrote:
> > I don't like the idea of complicating the code by turning discards into
> > writes.
> 
> That's not what my patch series does. The only writes added by my patch series
> are those for the non-aligned head and tail of the range passed to
> blkdev_issue_zeroout().

The purpose of discard is to improve SSD performance and reduce wear.

Generating more write requests for discard does quite the opposite - it 
reduces performance (discard + two small writes is slower than just 
discard) and it also causes more wear.

> > The flag "discard_zeroes_data" is actually misdesigned, because the
> > storage stack can change dynamically while bios are in progress. You can
> > send a discard bio to a device mapper device that has
> > "discard_zeroes_data" - and while the bio is in progress, the device
> > mapper stack can be reconfigured to redirect the bio to another device
> > that doesn't have "discard_zeroes_data" - and the bio won't zero data and
> > the caller that issued it has no way to find it out.
> > 
> > I think the proper thing would be to move "discard_zeroes_data" flag into
> > the bio itself - there would be REQ_OP_DISCARD and REQ_OP_DISCARD_ZERO -
> > and if the device doesn't support REQ_OP_DISCARD_ZERO, it rejects the bio
> > and the caller is supposed to do zeroing manually.
> 
> Sorry but I don't like this proposal. I think that a much better solution
> would be to pause I/O before making any changes to the discard_zeroes_data
> flag.

The device mapper pauses all bios when it switches the table - but the bio 
was submitted with assumption that it goes to a device withe 
"discard_zeroes_data" set, then the bio is paused, device mapper table is 
switched, the bio is unpaused, and now it can go to a device without 
"discard_zeroes_data".

> Bart.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mikulas Patocka Oct. 28, 2016, 11:43 a.m. UTC | #6
On Wed, 26 Oct 2016, Christoph Hellwig wrote:

> On Wed, Oct 26, 2016 at 05:46:11PM -0400, Mikulas Patocka wrote:
> > I think the proper thing would be to move "discard_zeroes_data" flag into 
> > the bio itself - there would be REQ_OP_DISCARD and REQ_OP_DISCARD_ZERO - 
> > and if the device doesn't support REQ_OP_DISCARD_ZERO, it rejects the bio 
> > and the caller is supposed to do zeroing manually.
> 
> Yes, Martin and I have come to a similar conclusion recently.  An
> additional aspect is that NVMe has a Write Zeroes command which is more
> limited than what REQ_OP_WRITE_SAME does.
> 
> So I think the right way is to add a REQ_OP_WRITE_ZEROES (or
> REQ_OP_ZERO) and have modifies if it may discard or not.

We could detect if the REQ_OP_WRITE_SAME command contains all zeroes and 
if it does, turn it into "Write Zeroes" or TRIM command (if the device 
guarantees zeroing on trim). If it doesn't contain all zeroes and the 
device doesn't support non-zero WRITE SAME, then reject it.

Or maybe we could add a new command REQ_OP_WRITE_ZEROES - I'm not sure 
which of these two possibilities is better.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Oct. 28, 2016, 1:14 p.m. UTC | #7
[adding Chaitanya to Cc]

On Fri, Oct 28, 2016 at 07:43:41AM -0400, Mikulas Patocka wrote:
> We could detect if the REQ_OP_WRITE_SAME command contains all zeroes and 
> if it does, turn it into "Write Zeroes" or TRIM command (if the device 
> guarantees zeroing on trim). If it doesn't contain all zeroes and the 
> device doesn't support non-zero WRITE SAME, then reject it.

I don't like this because it's very inefficient - we have to allocate
a payload first and then compare the whole payload for very operation.

> Or maybe we could add a new command REQ_OP_WRITE_ZEROES - I'm not sure 
> which of these two possibilities is better.

Chaitanya actually did an initial prototype implementation of this for
NVMe that he shared with me.  I liked it a lot and I think he'll be
ready to post it in a few days.  Now that we have the REQ_OP* values
instead of mapping different command types to flags it's actually
surprisingly easy to add new block layer operations.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bart Van Assche Oct. 28, 2016, 3:55 p.m. UTC | #8
On 10/28/2016 04:39 AM, Mikulas Patocka wrote:
> On Wed, 26 Oct 2016, Bart Van Assche wrote:
>> On 10/26/2016 02:46 PM, Mikulas Patocka wrote:
>>> I think the proper thing would be to move "discard_zeroes_data" flag into
>>> the bio itself - there would be REQ_OP_DISCARD and REQ_OP_DISCARD_ZERO -
>>> and if the device doesn't support REQ_OP_DISCARD_ZERO, it rejects the bio
>>> and the caller is supposed to do zeroing manually.
>>
>> Sorry but I don't like this proposal. I think that a much better solution
>> would be to pause I/O before making any changes to the discard_zeroes_data
>> flag.
>
> The device mapper pauses all bios when it switches the table - but the bio
> was submitted with assumption that it goes to a device withe
> "discard_zeroes_data" set, then the bio is paused, device mapper table is
> switched, the bio is unpaused, and now it can go to a device without
> "discard_zeroes_data".

Hello Mikulas,

Sorry if I wasn't clear enough. What I meant is to wait until all 
outstanding requests have finished before modifying the 
discard_zeroes_data flag - the kind of operation that is performed by 
e.g. blk_mq_freeze_queue(). Modifying the discard_zeroes_data flag after 
a bio has been submitted and before it has completed could lead to 
several classes of subtle bugs. Functions like __blkdev_issue_discard() 
assume that the value of the discard_zeroes_data flag does not change 
after this function has been called and before the submitted requests 
completed.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mikulas Patocka Oct. 31, 2016, 4:31 p.m. UTC | #9
On Fri, 28 Oct 2016, Bart Van Assche wrote:

> On 10/28/2016 04:39 AM, Mikulas Patocka wrote:
> > On Wed, 26 Oct 2016, Bart Van Assche wrote:
> > > On 10/26/2016 02:46 PM, Mikulas Patocka wrote:
> > > > I think the proper thing would be to move "discard_zeroes_data" flag
> > > > into
> > > > the bio itself - there would be REQ_OP_DISCARD and REQ_OP_DISCARD_ZERO -
> > > > and if the device doesn't support REQ_OP_DISCARD_ZERO, it rejects the
> > > > bio
> > > > and the caller is supposed to do zeroing manually.
> > > 
> > > Sorry but I don't like this proposal. I think that a much better solution
> > > would be to pause I/O before making any changes to the discard_zeroes_data
> > > flag.
> > 
> > The device mapper pauses all bios when it switches the table - but the bio
> > was submitted with assumption that it goes to a device withe
> > "discard_zeroes_data" set, then the bio is paused, device mapper table is
> > switched, the bio is unpaused, and now it can go to a device without
> > "discard_zeroes_data".
> 
> Hello Mikulas,
> 
> Sorry if I wasn't clear enough. What I meant is to wait until all outstanding
> requests have finished

It is possible that the process sends never ending stream of bios (for 
example when reading linear data and using readahead), so waiting until 
there are no oustanding bios never finishes.

> before modifying the discard_zeroes_data flag - the
> kind of operation that is performed by e.g. blk_mq_freeze_queue().

blk_mq_freeze_queue() works on request-based drivers, device mapper works 
with bios, so that function has no effect on device mapper device. Anyway 
- blk_mq_freeze_queue() won't stop the process that issues the I/O 
requests - it will just hold the requests in the queue and not forward 
them to the device.

There is no way to stop the process that issues the bios. We can't stop 
the process that is looping in __blkdev_issue_discard, issuing discard 
requests. All that we can do is to hold the bios that the process issued.

Device mapper can freeze the filesystem with "freeze_bdev", but...
- some filesystems don't support freeze
- if the filesystem is not directly mounted on the frozen device, but 
there is a stack of intermediate block devices between the filesystem and 
the frozen device, then the filesystem will not be frozen
- the user can open the block device directly and he won't be affected by 
the freeze

> Modifying the discard_zeroes_data flag after a bio has been submitted 
> and before it has completed could lead to several classes of subtle 
> bugs. Functions like __blkdev_issue_discard() assume that the value of 
> the discard_zeroes_data flag does not change after this function has 
> been called and before the submitted requests completed.
>
> Bart.

I agree. That's the topic of the discussion - that the discard_zeroes_data 
flag is unreliable and the flag should be moved to the bio.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mikulas Patocka Oct. 31, 2016, 4:36 p.m. UTC | #10
On Fri, 28 Oct 2016, Christoph Hellwig wrote:

> [adding Chaitanya to Cc]
> 
> On Fri, Oct 28, 2016 at 07:43:41AM -0400, Mikulas Patocka wrote:
> > We could detect if the REQ_OP_WRITE_SAME command contains all zeroes and 
> > if it does, turn it into "Write Zeroes" or TRIM command (if the device 
> > guarantees zeroing on trim). If it doesn't contain all zeroes and the 
> > device doesn't support non-zero WRITE SAME, then reject it.
> 
> I don't like this because it's very inefficient - we have to allocate
> a payload first and then compare the whole payload for very operation.
> 
> > Or maybe we could add a new command REQ_OP_WRITE_ZEROES - I'm not sure 
> > which of these two possibilities is better.
> 
> Chaitanya actually did an initial prototype implementation of this for
> NVMe that he shared with me.  I liked it a lot and I think he'll be
> ready to post it in a few days.  Now that we have the REQ_OP* values
> instead of mapping different command types to flags it's actually
> surprisingly easy to add new block layer operations.

OK - when it is in the kernel, let me know, so that I can write device 
mapper support for that.

We should remove the flag "discard_zeroes_data" afterwards, because it is 
unreliable and impossible to support correctly in the device mapper.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: linux-2.6/drivers/block/brd.c
===================================================================
--- linux-2.6.orig/drivers/block/brd.c
+++ linux-2.6/drivers/block/brd.c
@@ -213,9 +213,14 @@  static int copy_to_brd_setup(struct brd_
 }
 
 static void discard_from_brd(struct brd_device *brd,
-			sector_t sector, size_t n)
+			sector_t sector, unsigned n_sectors)
 {
-	while (n >= PAGE_SIZE) {
+	unsigned boundary = -sector & ((PAGE_SIZE >> SECTOR_SHIFT) - 1);
+	if (unlikely(boundary >= n_sectors))
+		return;
+	sector += boundary;
+	n_sectors -= boundary;
+	while (n_sectors >= PAGE_SIZE >> SECTOR_SHIFT) {
 		/*
 		 * Don't want to actually discard pages here because
 		 * re-allocating the pages can result in writeback
@@ -226,7 +231,7 @@  static void discard_from_brd(struct brd_
 		else
 			brd_zero_page(brd, sector);
 		sector += PAGE_SIZE >> SECTOR_SHIFT;
-		n -= PAGE_SIZE;
+		n_sectors -= PAGE_SIZE >> SECTOR_SHIFT;
 	}
 }
 
@@ -339,10 +344,7 @@  static blk_qc_t brd_make_request(struct
 		goto io_error;
 
 	if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) {
-		if (sector & ((PAGE_SIZE >> SECTOR_SHIFT) - 1) ||
-		    bio->bi_iter.bi_size & ~PAGE_MASK)
-			goto io_error;
-		discard_from_brd(brd, sector, bio->bi_iter.bi_size);
+		discard_from_brd(brd, sector, bio->bi_iter.bi_size >> SECTOR_SHIFT);
 		goto out;
 	}