From patchwork Mon Jul 27 22:11:30 2015
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ming Lin <mlin@kernel.org>
X-Patchwork-Id: 6876931
X-Patchwork-Delegate: snitzer@redhat.com
Return-Path: <dm-devel-bounces@redhat.com>
X-Original-To: patchwork-dm-devel@patchwork.kernel.org
Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.136])
	by patchwork2.web.kernel.org (Postfix) with ESMTP id 168EDC05AC
	for <patchwork-dm-devel@patchwork.kernel.org>;
	Mon, 27 Jul 2015 22:16:27 +0000 (UTC)
Received: from mail.kernel.org (localhost [127.0.0.1])
	by mail.kernel.org (Postfix) with ESMTP id D0553206B6
	for <patchwork-dm-devel@patchwork.kernel.org>;
	Mon, 27 Jul 2015 22:16:25 +0000 (UTC)
Received: from mx4-phx2.redhat.com (mx4-phx2.redhat.com [209.132.183.25])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mail.kernel.org (Postfix) with ESMTPS id 5E70B206BE
	for <patchwork-dm-devel@patchwork.kernel.org>;
	Mon, 27 Jul 2015 22:16:24 +0000 (UTC)
Received: from lists01.pubmisc.prod.ext.phx2.redhat.com
	(lists01.pubmisc.prod.ext.phx2.redhat.com [10.5.19.33])
	by mx4-phx2.redhat.com (8.13.8/8.13.8) with ESMTP id t6RMBleq026704;
	Mon, 27 Jul 2015 18:11:47 -0400
Received: from int-mx10.intmail.prod.int.phx2.redhat.com
	(int-mx10.intmail.prod.int.phx2.redhat.com [10.5.11.23])
	by lists01.pubmisc.prod.ext.phx2.redhat.com (8.13.8/8.13.8) with
	ESMTP id t6RMBk4R002197 for <dm-devel@listman.util.phx.redhat.com>;
	Mon, 27 Jul 2015 18:11:46 -0400
Received: from mx1.redhat.com (ext-mx01.extmail.prod.ext.phx2.redhat.com
	[10.5.110.25])
	by int-mx10.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with
	ESMTP id t6RMBkmY023046
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256
	verify=NO); Mon, 27 Jul 2015 18:11:46 -0400
Received: from mail.kernel.org (mail.kernel.org [198.145.29.136])
	by mx1.redhat.com (Postfix) with ESMTP id 8E956ADADC;
	Mon, 27 Jul 2015 22:11:44 +0000 (UTC)
Received: from mail.kernel.org (localhost [127.0.0.1])
	by mail.kernel.org (Postfix) with ESMTP id 9A40E2071C;
	Mon, 27 Jul 2015 22:11:42 +0000 (UTC)
Received: from [192.168.88.6] (c-50-185-88-18.hsd1.ca.comcast.net
	[50.185.88.18])
	(using TLSv1.2 with cipher DHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by mail.kernel.org (Postfix) with ESMTPSA id C91F8206BE;
	Mon, 27 Jul 2015 22:11:40 +0000 (UTC)
Message-ID: <1438035090.28978.19.camel@ssi>
From: Ming Lin <mlin@kernel.org>
To: Mike Snitzer <snitzer@redhat.com>
Date: Mon, 27 Jul 2015 15:11:30 -0700
In-Reply-To: <20150727175048.GA18183@redhat.com>
References: <1436166674-31362-1-git-send-email-mlin@kernel.org>
	<1436764355.30675.10.camel@hasee> <20150713153537.GA30898@redhat.com>
	<1437675702.11359.25.camel@ssi> <20150727175048.GA18183@redhat.com>
Mime-Version: 1.0
X-Spam-Status: No, score=-8.2 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI,
	RP_MATCHES_RCVD,
	UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP
X-RedHat-Spam-Score: -2.788  (BAYES_50, RCVD_IN_DNSWL_MED, RP_MATCHES_RCVD,
	URIBL_BLOCKED) 198.145.29.136 mail.kernel.org
	198.145.29.136 mail.kernel.org <mlin@kernel.org>
X-Scanned-By: MIMEDefang 2.68 on 10.5.11.23
X-Scanned-By: MIMEDefang 2.75 on 10.5.110.25
X-loop: dm-devel@redhat.com
Cc: Jens Axboe <axboe@kernel.dk>, Jeff Moyer <jmoyer@redhat.com>,
	linux-kernel@vger.kernel.org, dm-devel@redhat.com,
	Dongsu Park <dpark@posteo.net>, Christoph Hellwig <hch@lst.de>,
	"Alasdair G. Kergon" <agk@redhat.com>,
	Kent Overstreet <kent.overstreet@gmail.com>
Subject: Re: [dm-devel] [PATCH v5 00/11] simplify block layer based on
	immutable biovecs
X-BeenThere: dm-devel@redhat.com
X-Mailman-Version: 2.1.12
Precedence: junk
Reply-To: device-mapper development <dm-devel@redhat.com>
List-Id: device-mapper development <dm-devel.redhat.com>
List-Unsubscribe: <https://www.redhat.com/mailman/options/dm-devel>,
	<mailto:dm-devel-request@redhat.com?subject=unsubscribe>
List-Archive: <https://www.redhat.com/archives/dm-devel>
List-Post: <mailto:dm-devel@redhat.com>
List-Help: <mailto:dm-devel-request@redhat.com?subject=help>
List-Subscribe: <https://www.redhat.com/mailman/listinfo/dm-devel>,
	<mailto:dm-devel-request@redhat.com?subject=subscribe>
Sender: dm-devel-bounces@redhat.com
Errors-To: dm-devel-bounces@redhat.com
X-Virus-Scanned: ClamAV using ClamSMTP

On Mon, 2015-07-27 at 13:50 -0400, Mike Snitzer wrote:
> On Thu, Jul 23 2015 at  2:21pm -0400,
> Ming Lin <mlin@kernel.org> wrote:
> 
> > On Mon, 2015-07-13 at 11:35 -0400, Mike Snitzer wrote:
> > > On Mon, Jul 13 2015 at  1:12am -0400,
> > > Ming Lin <mlin@kernel.org> wrote:
> > > 
> > > > On Mon, 2015-07-06 at 00:11 -0700, mlin@kernel.org wrote:
> > > > > Hi Mike,
> > > > > 
> > > > > On Wed, 2015-06-10 at 17:46 -0400, Mike Snitzer wrote:
> > > > > > I've been busy getting DM changes for the 4.2 merge window finalized.
> > > > > > As such I haven't connected with others on the team to discuss this
> > > > > > issue.
> > > > > > 
> > > > > > I'll see if we can make time in the next 2 days.  But I also have
> > > > > > RHEL-specific kernel deadlines I'm coming up against.
> > > > > > 
> > > > > > Seems late to be staging this extensive a change for 4.2... are you
> > > > > > pushing for this code to land in the 4.2 merge window?  Or do we have
> > > > > > time to work this further and target the 4.3 merge?
> > > > > > 
> > > > > 
> > > > > 4.2-rc1 was out.
> > > > > Would you have time to work together for 4.3 merge? 
> > > > 
> > > > Ping ...
> > > > 
> > > > What can I do to move forward?
> > > 
> > > You can show further testing.  Particularly that you've covered all the
> > > edge cases.
> > > 
> > > Until someone can produce some perf test results where they are actually
> > > properly controlling for the splitting, we have no useful information.
> > > 
> > > The primary concerns associated with this patchset are:
> > > 1) In the context of RAID, XFS's use of bio_add_page() used to build up
> > >    optimal IOs when the underlying block device provides striping info
> > >    via IO limits.  With this patchset how large will bios become in
> > >    practice _without_ bio_add_page() being bounded by the underlying IO
> > >    limits?
> > 
> > Totally new to XFS code.
> > Did you mean xfs_buf_ioapply_map() -> bio_add_page()?
> 
> Yes.  But there is also:
> xfs_vm_writepage -> xfs_submit_ioend -> xfs_bio_add_buffer -> bio_add_page
> 
> Basically in the old code XFS sized IO accordingly based on the
> bio_add_page feedback loop.
> 
> > The largest size could be BIO_MAX_PAGES pages, that is 256 pages(1M
> > bytes).
> 
> Independent of this late splitting work (but related): we really should
> look to fixup/extend BIO_MAX_PAGES to cover just barely "too large"
> configurations, e.g. 10+2 RAID6 with 128K chunk, so 1280K for a full
> stripe.  Ideally we'd be able to read/reite full stripes.
> 
> > > 2) The late splitting that occurs for the (presummably) large bios that
> > >    are sent down.. how does it cope/perform in the face of very
> > >    low/fragmented system memory?
> > 
> > I tested in qemu-kvm with 1G/1100M/1200M memory.
> > 10 HDDs were attached to qemu via virtio-blk.
> > Then created MD RAID6 array and mkfs.xfs on it.
> > 
> > I use bs=2M, so there will be a lot of bio splits.
> > 
> > [global]
> > ioengine=libaio
> > iodepth=64
> > direct=1
> > runtime=1200
> > time_based
> > group_reporting
> > numjobs=8
> > gtod_reduce=0
> > norandommap
> > 
> > [job1]
> > bs=2M
> > directory=/mnt
> > size=100M
> > rw=write
> > 
> > Here is the results:
> > 
> > memory		4.2-rc2		4.2-rc2-patched
> > ------		-------		---------------
> > 1G		OOM		OOM
> > 1100M		fail		OK
> > 1200M		OK		OK
> > 
> > "fail" means it hit a page allocation failure.
> > http://minggr.net/pub/block_patches_tests/dmesg.4.2.0-rc2
> > 
> > I tested 3 times for each kernel to confirm that with 1100M memory,
> > 4.2-rc2 always hit a page allocation failure and 4.2-rc2-patched is OK.
> > 
> > So the patched kernel performs better in this case.
> 
> Interesting.  Seems to prove Kent's broader point that he used mempools
> and handles allocations better than the old code did.
> 
> > > 3) More open-ended comment than question: Linux has evolved to perform
> > >    well on "enterprise" systems.  We generally don't fall off a cliff on 
> > >    performance like we used to.  The concern associated with this
> > >    patchset is that if it goes in without _real_ due-diligence on
> > >    "enterprise" scale systems and workloads it'll be too late once we
> > >    notice the problem(s).
> > > 
> > > So we really need answers to 1 and 2 above in order to feel better about
> > > the risks associated 3.
> > > 
> > > Alasdair's feedback to you on testing still applies (and hasn't been
> > > done AFAIK):
> > > https://www.redhat.com/archives/dm-devel/2015-May/msg00203.html
> > > 
> > > Particularly:
> > > "you might need to instrument the kernels to tell you the sizes of the
> > > bios being created and the amount of splitting actually happening."
> > 
> > I added a debug patch to record the amount of splitting actually
> > happened. https://goo.gl/Iiyg4Y
> > 
> > In the qemu 1200M memory test case,
> > 
> > $ cat /sys/block/md0/queue/split
> > discard split: 0, write same split: 0, segment split: 27400
> > 
> > > 
> > > and
> > > 
> > > "You may also want to test systems with a restricted amount of available
> > > memory to show how the splitting via worker thread performs.  (Again,
> > > instrument to prove the extent to which the new code is being exercised.)"
> > 
> > Does above test with qemu make sense?
> 
> The test is showing that systems with limited memory are performing
> better but, without looking at the patchset in detail, I'm not sure what
> your splitting accounting patch is showing.
> 
> Are you saying that:
> 1) the code only splits via worker threads
> 2) with 27400 splits in the 1200M case the splitting certainly isn't
>    making things any worse.

With this patchset, bio_add_page() always create as large as possible
bio(1M bytes max). The patch accounts how many times the bio was split
due to device limitation, for example, bio->bi_phys_segments >
queue_max_segments(q).

It's more interesting if we look at how many bios are allocated for each
application IO request.

e.g. 10+2 RAID6 with 128K chunk.

Assume we only consider device max_segments limitation.

# cat /sys/block/md0/queue/max_segments 
126

So blk_queue_split() will split the bio if its size > 126 pages(504K
bytes).

Let's do a 1280K request.

# dd if=/dev/zero of=/dev/md0 bs=1280k count=1 oflag=direct

With below debug patch,


For non-patched kernel, 10 bios were allocated.

[   11.921775] md_make_request: bio ffff8800469c5d00, offset 0, size 128K
[   11.945692] md_make_request: bio ffff8800471df700, offset 131072, size 128K
[   11.946596] md_make_request: bio ffff8800471df200, offset 262144, size 128K
[   11.947694] md_make_request: bio ffff8800471df300, offset 393216, size 128K
[   11.949421] md_make_request: bio ffff8800471df900, offset 524288, size 128K
[   11.956345] md_make_request: bio ffff8800471df000, offset 655360, size 128K
[   11.957586] md_make_request: bio ffff8800471dfb00, offset 786432, size 128K
[   11.959086] md_make_request: bio ffff8800471dfc00, offset 917504, size 128K
[   11.964221] md_make_request: bio ffff8800471df400, offset 1048576, size 128K
[   11.965117] md_make_request: bio ffff8800471df800, offset 1179648, size 128K

For patched kernel, only 2 bios were allocated at base case and 0 split.

[   20.034036] md_make_request: bio ffff880046a2ee00, offset 0, size 1024K
[   20.046104] md_make_request: bio ffff880046a2e500, offset 1048576, size 256K

4 bios allocated for worst case and 2 splits.
One of the worst case could be the memory is so segmented that 1M bio comprised
of 256 bi_phys_segments. So it needs 2 splits.

1280K = 1M + 256K

ffff880046a30900 and ffff880046a21500 are the original bios.
ffff880046a30200 and ffff880046a21e00 are the split bios.

[   13.049323] md_make_request: bio ffff880046a30200, offset 0, size 504K
[   13.080057] md_make_request: bio ffff880046a21e00, offset 516096, size 504K
[   13.082857] md_make_request: bio ffff880046a30900, offset 1032192, size 16K
[   13.084983] md_make_request: bio ffff880046a21500, offset 1048576, size 256K

# cat /sys/block/md0/queue/split 
discard split: 0, write same split: 0, segment split: 2

> 
> But for me the bigger take away is: the old merge_bvec code (no late
> splitting) is more prone to allocation failure then the new code.

Yes, as I showed above.

> 
> On that point alone I'm OK with this patchset going forward.
> 
> I'll reviewer the implementation details as they relate to DM now, but
> that is just a formality.  My hope is that I'll be abke to provide my
> Acked-by very soon.

Great! Thanks.

> 
> Mike
---
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

diff --git a/drivers/md/md.c b/drivers/md/md.c
index a4aa6e5..2fde2ce 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -259,6 +259,10 @@ static void md_make_request(struct request_queue *q, struct bio *bio)
 
 	blk_queue_split(q, &bio, q->bio_split);
 
+	if (!strcmp(current->comm, "dd") && bio_data_dir(bio) == WRITE)
+		printk("%s: bio %p, offset %lu, size %uK\n", __func__,
+			bio, bio->bi_iter.bi_sector<<9, bio->bi_iter.bi_size>>10);
+
 	if (mddev == NULL || mddev->pers == NULL
 	    || !mddev->ready) {
 		bio_io_error(bio);