From patchwork Mon Jul 27 22:11:30 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ming Lin X-Patchwork-Id: 6876931 X-Patchwork-Delegate: snitzer@redhat.com Return-Path: X-Original-To: patchwork-dm-devel@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork2.web.kernel.org (Postfix) with ESMTP id 168EDC05AC for ; Mon, 27 Jul 2015 22:16:27 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id D0553206B6 for ; Mon, 27 Jul 2015 22:16:25 +0000 (UTC) Received: from mx4-phx2.redhat.com (mx4-phx2.redhat.com [209.132.183.25]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 5E70B206BE for ; Mon, 27 Jul 2015 22:16:24 +0000 (UTC) Received: from lists01.pubmisc.prod.ext.phx2.redhat.com (lists01.pubmisc.prod.ext.phx2.redhat.com [10.5.19.33]) by mx4-phx2.redhat.com (8.13.8/8.13.8) with ESMTP id t6RMBleq026704; Mon, 27 Jul 2015 18:11:47 -0400 Received: from int-mx10.intmail.prod.int.phx2.redhat.com (int-mx10.intmail.prod.int.phx2.redhat.com [10.5.11.23]) by lists01.pubmisc.prod.ext.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id t6RMBk4R002197 for ; Mon, 27 Jul 2015 18:11:46 -0400 Received: from mx1.redhat.com (ext-mx01.extmail.prod.ext.phx2.redhat.com [10.5.110.25]) by int-mx10.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id t6RMBkmY023046 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Mon, 27 Jul 2015 18:11:46 -0400 Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by mx1.redhat.com (Postfix) with ESMTP id 8E956ADADC; Mon, 27 Jul 2015 22:11:44 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 9A40E2071C; Mon, 27 Jul 2015 22:11:42 +0000 (UTC) Received: from [192.168.88.6] (c-50-185-88-18.hsd1.ca.comcast.net [50.185.88.18]) (using TLSv1.2 with cipher DHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id C91F8206BE; Mon, 27 Jul 2015 22:11:40 +0000 (UTC) Message-ID: <1438035090.28978.19.camel@ssi> From: Ming Lin To: Mike Snitzer Date: Mon, 27 Jul 2015 15:11:30 -0700 In-Reply-To: <20150727175048.GA18183@redhat.com> References: <1436166674-31362-1-git-send-email-mlin@kernel.org> <1436764355.30675.10.camel@hasee> <20150713153537.GA30898@redhat.com> <1437675702.11359.25.camel@ssi> <20150727175048.GA18183@redhat.com> Mime-Version: 1.0 X-Spam-Status: No, score=-8.2 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP X-RedHat-Spam-Score: -2.788 (BAYES_50, RCVD_IN_DNSWL_MED, RP_MATCHES_RCVD, URIBL_BLOCKED) 198.145.29.136 mail.kernel.org 198.145.29.136 mail.kernel.org X-Scanned-By: MIMEDefang 2.68 on 10.5.11.23 X-Scanned-By: MIMEDefang 2.75 on 10.5.110.25 X-loop: dm-devel@redhat.com Cc: Jens Axboe , Jeff Moyer , linux-kernel@vger.kernel.org, dm-devel@redhat.com, Dongsu Park , Christoph Hellwig , "Alasdair G. Kergon" , Kent Overstreet Subject: Re: [dm-devel] [PATCH v5 00/11] simplify block layer based on immutable biovecs X-BeenThere: dm-devel@redhat.com X-Mailman-Version: 2.1.12 Precedence: junk Reply-To: device-mapper development List-Id: device-mapper development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com X-Virus-Scanned: ClamAV using ClamSMTP On Mon, 2015-07-27 at 13:50 -0400, Mike Snitzer wrote: > On Thu, Jul 23 2015 at 2:21pm -0400, > Ming Lin wrote: > > > On Mon, 2015-07-13 at 11:35 -0400, Mike Snitzer wrote: > > > On Mon, Jul 13 2015 at 1:12am -0400, > > > Ming Lin wrote: > > > > > > > On Mon, 2015-07-06 at 00:11 -0700, mlin@kernel.org wrote: > > > > > Hi Mike, > > > > > > > > > > On Wed, 2015-06-10 at 17:46 -0400, Mike Snitzer wrote: > > > > > > I've been busy getting DM changes for the 4.2 merge window finalized. > > > > > > As such I haven't connected with others on the team to discuss this > > > > > > issue. > > > > > > > > > > > > I'll see if we can make time in the next 2 days. But I also have > > > > > > RHEL-specific kernel deadlines I'm coming up against. > > > > > > > > > > > > Seems late to be staging this extensive a change for 4.2... are you > > > > > > pushing for this code to land in the 4.2 merge window? Or do we have > > > > > > time to work this further and target the 4.3 merge? > > > > > > > > > > > > > > > > 4.2-rc1 was out. > > > > > Would you have time to work together for 4.3 merge? > > > > > > > > Ping ... > > > > > > > > What can I do to move forward? > > > > > > You can show further testing. Particularly that you've covered all the > > > edge cases. > > > > > > Until someone can produce some perf test results where they are actually > > > properly controlling for the splitting, we have no useful information. > > > > > > The primary concerns associated with this patchset are: > > > 1) In the context of RAID, XFS's use of bio_add_page() used to build up > > > optimal IOs when the underlying block device provides striping info > > > via IO limits. With this patchset how large will bios become in > > > practice _without_ bio_add_page() being bounded by the underlying IO > > > limits? > > > > Totally new to XFS code. > > Did you mean xfs_buf_ioapply_map() -> bio_add_page()? > > Yes. But there is also: > xfs_vm_writepage -> xfs_submit_ioend -> xfs_bio_add_buffer -> bio_add_page > > Basically in the old code XFS sized IO accordingly based on the > bio_add_page feedback loop. > > > The largest size could be BIO_MAX_PAGES pages, that is 256 pages(1M > > bytes). > > Independent of this late splitting work (but related): we really should > look to fixup/extend BIO_MAX_PAGES to cover just barely "too large" > configurations, e.g. 10+2 RAID6 with 128K chunk, so 1280K for a full > stripe. Ideally we'd be able to read/reite full stripes. > > > > 2) The late splitting that occurs for the (presummably) large bios that > > > are sent down.. how does it cope/perform in the face of very > > > low/fragmented system memory? > > > > I tested in qemu-kvm with 1G/1100M/1200M memory. > > 10 HDDs were attached to qemu via virtio-blk. > > Then created MD RAID6 array and mkfs.xfs on it. > > > > I use bs=2M, so there will be a lot of bio splits. > > > > [global] > > ioengine=libaio > > iodepth=64 > > direct=1 > > runtime=1200 > > time_based > > group_reporting > > numjobs=8 > > gtod_reduce=0 > > norandommap > > > > [job1] > > bs=2M > > directory=/mnt > > size=100M > > rw=write > > > > Here is the results: > > > > memory 4.2-rc2 4.2-rc2-patched > > ------ ------- --------------- > > 1G OOM OOM > > 1100M fail OK > > 1200M OK OK > > > > "fail" means it hit a page allocation failure. > > http://minggr.net/pub/block_patches_tests/dmesg.4.2.0-rc2 > > > > I tested 3 times for each kernel to confirm that with 1100M memory, > > 4.2-rc2 always hit a page allocation failure and 4.2-rc2-patched is OK. > > > > So the patched kernel performs better in this case. > > Interesting. Seems to prove Kent's broader point that he used mempools > and handles allocations better than the old code did. > > > > 3) More open-ended comment than question: Linux has evolved to perform > > > well on "enterprise" systems. We generally don't fall off a cliff on > > > performance like we used to. The concern associated with this > > > patchset is that if it goes in without _real_ due-diligence on > > > "enterprise" scale systems and workloads it'll be too late once we > > > notice the problem(s). > > > > > > So we really need answers to 1 and 2 above in order to feel better about > > > the risks associated 3. > > > > > > Alasdair's feedback to you on testing still applies (and hasn't been > > > done AFAIK): > > > https://www.redhat.com/archives/dm-devel/2015-May/msg00203.html > > > > > > Particularly: > > > "you might need to instrument the kernels to tell you the sizes of the > > > bios being created and the amount of splitting actually happening." > > > > I added a debug patch to record the amount of splitting actually > > happened. https://goo.gl/Iiyg4Y > > > > In the qemu 1200M memory test case, > > > > $ cat /sys/block/md0/queue/split > > discard split: 0, write same split: 0, segment split: 27400 > > > > > > > > and > > > > > > "You may also want to test systems with a restricted amount of available > > > memory to show how the splitting via worker thread performs. (Again, > > > instrument to prove the extent to which the new code is being exercised.)" > > > > Does above test with qemu make sense? > > The test is showing that systems with limited memory are performing > better but, without looking at the patchset in detail, I'm not sure what > your splitting accounting patch is showing. > > Are you saying that: > 1) the code only splits via worker threads > 2) with 27400 splits in the 1200M case the splitting certainly isn't > making things any worse. With this patchset, bio_add_page() always create as large as possible bio(1M bytes max). The patch accounts how many times the bio was split due to device limitation, for example, bio->bi_phys_segments > queue_max_segments(q). It's more interesting if we look at how many bios are allocated for each application IO request. e.g. 10+2 RAID6 with 128K chunk. Assume we only consider device max_segments limitation. # cat /sys/block/md0/queue/max_segments 126 So blk_queue_split() will split the bio if its size > 126 pages(504K bytes). Let's do a 1280K request. # dd if=/dev/zero of=/dev/md0 bs=1280k count=1 oflag=direct With below debug patch, For non-patched kernel, 10 bios were allocated. [ 11.921775] md_make_request: bio ffff8800469c5d00, offset 0, size 128K [ 11.945692] md_make_request: bio ffff8800471df700, offset 131072, size 128K [ 11.946596] md_make_request: bio ffff8800471df200, offset 262144, size 128K [ 11.947694] md_make_request: bio ffff8800471df300, offset 393216, size 128K [ 11.949421] md_make_request: bio ffff8800471df900, offset 524288, size 128K [ 11.956345] md_make_request: bio ffff8800471df000, offset 655360, size 128K [ 11.957586] md_make_request: bio ffff8800471dfb00, offset 786432, size 128K [ 11.959086] md_make_request: bio ffff8800471dfc00, offset 917504, size 128K [ 11.964221] md_make_request: bio ffff8800471df400, offset 1048576, size 128K [ 11.965117] md_make_request: bio ffff8800471df800, offset 1179648, size 128K For patched kernel, only 2 bios were allocated at base case and 0 split. [ 20.034036] md_make_request: bio ffff880046a2ee00, offset 0, size 1024K [ 20.046104] md_make_request: bio ffff880046a2e500, offset 1048576, size 256K 4 bios allocated for worst case and 2 splits. One of the worst case could be the memory is so segmented that 1M bio comprised of 256 bi_phys_segments. So it needs 2 splits. 1280K = 1M + 256K ffff880046a30900 and ffff880046a21500 are the original bios. ffff880046a30200 and ffff880046a21e00 are the split bios. [ 13.049323] md_make_request: bio ffff880046a30200, offset 0, size 504K [ 13.080057] md_make_request: bio ffff880046a21e00, offset 516096, size 504K [ 13.082857] md_make_request: bio ffff880046a30900, offset 1032192, size 16K [ 13.084983] md_make_request: bio ffff880046a21500, offset 1048576, size 256K # cat /sys/block/md0/queue/split discard split: 0, write same split: 0, segment split: 2 > > But for me the bigger take away is: the old merge_bvec code (no late > splitting) is more prone to allocation failure then the new code. Yes, as I showed above. > > On that point alone I'm OK with this patchset going forward. > > I'll reviewer the implementation details as they relate to DM now, but > that is just a formality. My hope is that I'll be abke to provide my > Acked-by very soon. Great! Thanks. > > Mike --- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel diff --git a/drivers/md/md.c b/drivers/md/md.c index a4aa6e5..2fde2ce 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -259,6 +259,10 @@ static void md_make_request(struct request_queue *q, struct bio *bio) blk_queue_split(q, &bio, q->bio_split); + if (!strcmp(current->comm, "dd") && bio_data_dir(bio) == WRITE) + printk("%s: bio %p, offset %lu, size %uK\n", __func__, + bio, bio->bi_iter.bi_sector<<9, bio->bi_iter.bi_size>>10); + if (mddev == NULL || mddev->pers == NULL || !mddev->ready) { bio_io_error(bio);