From patchwork Thu Sep 17 09:14:22 2009 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Junichi Nomura X-Patchwork-Id: 48261 Received: from hormel.redhat.com (hormel1.redhat.com [209.132.177.33]) by demeter.kernel.org (8.14.2/8.14.2) with ESMTP id n8H9GoXo008419 for ; Thu, 17 Sep 2009 09:16:50 GMT Received: from listman.util.phx.redhat.com (listman.util.phx.redhat.com [10.8.4.110]) by hormel.redhat.com (Postfix) with ESMTP id 7040A61876F; Thu, 17 Sep 2009 05:16:49 -0400 (EDT) Received: from int-mx05.intmail.prod.int.phx2.redhat.com (nat-pool.util.phx.redhat.com [10.8.5.200]) by listman.util.phx.redhat.com (8.13.1/8.13.1) with ESMTP id n8H9GlPL023393 for ; Thu, 17 Sep 2009 05:16:47 -0400 Received: from mx1.redhat.com (ext-mx03.extmail.prod.ext.phx2.redhat.com [10.5.110.7]) by int-mx05.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id n8H9GkB6009885; Thu, 17 Sep 2009 05:16:46 -0400 Received: from tyo202.gate.nec.co.jp (TYO202.gate.nec.co.jp [202.32.8.206]) by mx1.redhat.com (8.13.8/8.13.8) with ESMTP id n8H9GaOt020022; Thu, 17 Sep 2009 05:16:37 -0400 Received: from mailgate3.nec.co.jp ([10.7.69.197]) by tyo202.gate.nec.co.jp (8.13.8/8.13.4) with ESMTP id n8H9GVrh011571; Thu, 17 Sep 2009 18:16:31 +0900 (JST) Received: (from root@localhost) by mailgate3.nec.co.jp (8.11.7/3.7W-MAILGATE-NEC) id n8H9GVW09090; Thu, 17 Sep 2009 18:16:31 +0900 (JST) Received: from mail02.kamome.nec.co.jp (mail02.kamome.nec.co.jp [10.25.43.5]) by mailsv.nec.co.jp (8.13.8/8.13.4) with ESMTP id n8H9GVC5021560; Thu, 17 Sep 2009 18:16:31 +0900 (JST) Received: from shoin.jp.nec.com ([10.26.220.3] [10.26.220.3]) by mail02.kamome.nec.co.jp with ESMTP id BT-MMP-2026369; Thu, 17 Sep 2009 18:14:12 +0900 Received: from ronnie.linux.bs1.fc.nec.co.jp ([10.34.125.103] [10.34.125.103]) by mail.jp.nec.com with ESMTP; Thu, 17 Sep 2009 18:14:11 +0900 Message-ID: <4AB1FDEE.5020500@ce.jp.nec.com> Date: Thu, 17 Sep 2009 18:14:22 +0900 From: "Jun'ichi Nomura" User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.1) Gecko/20090814 Fedora/3.0-2.6.b3.fc11 Thunderbird/3.0b3 MIME-Version: 1.0 To: device-mapper development , Mike Snitzer , Alasdair Kergon Subject: Re: [dm-devel] fragmented i/o with 2.6.31? References: <448b15030909160834j2b127c83jab163e1860fc9aa1@mail.gmail.com> <448b15030909160922o84c2d6gc8ead8226dd8777a@mail.gmail.com> <4AB1ED1F.1010203@ct.jp.nec.com> In-Reply-To: <4AB1ED1F.1010203@ct.jp.nec.com> X-RedHat-Spam-Score: 0 () X-Scanned-By: MIMEDefang 2.67 on 10.5.11.18 X-Scanned-By: MIMEDefang 2.67 on 10.5.110.7 X-loop: dm-devel@redhat.com Cc: Jens Axboe X-BeenThere: dm-devel@redhat.com X-Mailman-Version: 2.1.5 Precedence: junk Reply-To: device-mapper development List-Id: device-mapper development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com Hi Mike, Alasdair, Kiyoshi Ueda wrote: > On 09/17/2009 01:22 AM +0900, David Strand wrote: >> On Wed, Sep 16, 2009 at 8:34 AM, David Strand wrote: >>> I am issuing 512 Kbyte reads through the device mapper device node to >>> a fibre channel disk. With 2.6.30 one read command for the entire 512 >>> Kbyte length is placed on the wire. With 2.6.31 this is being broken >>> up into 5 smaller read commands placed on the wire, decreasing >>> performance. >>> >>> This is especially penalizing on some disks where we have prefetch >>> turned off via the scsi mode page. Is there any easy way (through >>> configuration or sysfs) to restore the single read per i/o behavior >>> that I used to get? >> >> I should note that I am using dm-mpath, and the i/o is fragmented on >> the wire when using the device mapper device node but it is not >> fragmented when using one of the regular /dev/sd* device nodes for >> that device. > > David, > Thank you for reporting this. > I found on my test machine that max_sectors is set to SAFE_MAX_SECTORS, > which limits the I/O size small. > The attached patch fixes it. I guess the patch (and increasing > read-ahead size in /sys/block/dm-/queue/read_ahead_kb) will solve > your fragmentation issue. Please try it. > > > Mike, Alasdair, > I found that max_sectors and max_hw_sectors of dm device are set > in smaller values than those of underlying devices. E.g: > # cat /sys/block/sdj/queue/max_sectors_kb > 512 > # cat /sys/block/sdj/queue/max_hw_sectors_kb > 32767 > # echo "0 10 linear /dev/sdj 0" | dmsetup create test > # cat /sys/block/dm-0/queue/max_sectors_kb > 127 > # cat /sys/block/dm-0/queue/max_hw_sectors_kb > 127 > This prevents the I/O size of struct request from becoming enough big > size, and causes undesired request fragmentation in request-based dm. > > This should be caused by the queue_limits stacking. > In dm_calculate_queue_limits(), the block-layer's small default size > is included in the merging process of target's queue_limits. > So underlying queue_limits is not propagated correctly. > > I think initializing default values of all max_* in '0' is an easy fix. > Do you think my patch is acceptable? > Any other idea to fix this problem? Well, sorry, we jumped the gun.. The patch should work fine for dm-multipath but setting '0' by default will cause problems on targets like 'zero' and 'error', which take no underlying device and use the default value. > blk_set_default_limits(limits); > + limits->max_sectors = 0; > + limits->max_hw_sectors = 0; So this should either set something very big (e.g. UINT_MAX) or set 0 by default but change to a certain safe value, if the end result of merging the limits is still 0. Attached is a revised patch with the latter approach. Please check this. If the approach is fine, I think we should bring this up to Jens whether to have these helpers in dm-table.c or move to block/blk-settings.c. Thanks, Index: linux-2.6.31/drivers/md/dm-table.c =================================================================== --- linux-2.6.31.orig/drivers/md/dm-table.c +++ linux-2.6.31/drivers/md/dm-table.c @@ -647,6 +647,28 @@ int dm_split_args(int *argc, char ***arg } /* + * blk_stack_limits() chooses min_not_zero max_sectors value of underlying + * devices. So set the default to 0. + * Otherwise, the default SAFE_MAX_SECTORS dominates even if all underlying + * devices have max_sectors values larger than that. + */ +static void _set_default_limits_for_stacking(struct queue_limits *limits) +{ + blk_set_default_limits(limits); + limits->max_sectors = 0; + limits->max_hw_sectors = 0; +} + +/* If there's no underlying device, use the default value in blockdev. */ +static void _adjust_limits_for_stacking(struct queue_limits *limits) +{ + if (limits->max_sectors == 0) + limits->max_sectors = SAFE_MAX_SECTORS; + if (limits->max_hw_sectors == 0) + limits->max_hw_sectors = SAFE_MAX_SECTORS; +} + +/* * Impose necessary and sufficient conditions on a devices's table such * that any incoming bio which respects its logical_block_size can be * processed successfully. If it falls across the boundary between @@ -684,7 +706,7 @@ static int validate_hardware_logical_blo while (i < dm_table_get_num_targets(table)) { ti = dm_table_get_target(table, i++); - blk_set_default_limits(&ti_limits); + _set_default_limits_for_stacking(&ti_limits); /* combine all target devices' limits */ if (ti->type->iterate_devices) @@ -707,6 +729,8 @@ static int validate_hardware_logical_blo device_logical_block_size_sects - next_target_start : 0; } + _adjust_limits_for_stacking(limits); + if (remaining) { DMWARN("%s: table line %u (start sect %llu len %llu) " "not aligned to h/w logical block size %u", @@ -991,10 +1015,10 @@ int dm_calculate_queue_limits(struct dm_ struct queue_limits ti_limits; unsigned i = 0; - blk_set_default_limits(limits); + _set_default_limits_for_stacking(limits); while (i < dm_table_get_num_targets(table)) { - blk_set_default_limits(&ti_limits); + _set_default_limits_for_stacking(&ti_limits); ti = dm_table_get_target(table, i++);