From patchwork Thu Sep 17 09:14:22 2009
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Junichi Nomura <j-nomura@ce.jp.nec.com>
X-Patchwork-Id: 48261
Received: from hormel.redhat.com (hormel1.redhat.com [209.132.177.33])
	by demeter.kernel.org (8.14.2/8.14.2) with ESMTP id n8H9GoXo008419
	for <patchwork-dm-devel@patchwork.kernel.org>;
	Thu, 17 Sep 2009 09:16:50 GMT
Received: from listman.util.phx.redhat.com (listman.util.phx.redhat.com
	[10.8.4.110])
	by hormel.redhat.com (Postfix) with ESMTP id 7040A61876F;
	Thu, 17 Sep 2009 05:16:49 -0400 (EDT)
Received: from int-mx05.intmail.prod.int.phx2.redhat.com
	(nat-pool.util.phx.redhat.com [10.8.5.200])
	by listman.util.phx.redhat.com (8.13.1/8.13.1) with ESMTP id
	n8H9GlPL023393 for <dm-devel@listman.util.phx.redhat.com>;
	Thu, 17 Sep 2009 05:16:47 -0400
Received: from mx1.redhat.com (ext-mx03.extmail.prod.ext.phx2.redhat.com
	[10.5.110.7])
	by int-mx05.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with
	ESMTP id n8H9GkB6009885; Thu, 17 Sep 2009 05:16:46 -0400
Received: from tyo202.gate.nec.co.jp (TYO202.gate.nec.co.jp [202.32.8.206])
	by mx1.redhat.com (8.13.8/8.13.8) with ESMTP id n8H9GaOt020022;
	Thu, 17 Sep 2009 05:16:37 -0400
Received: from mailgate3.nec.co.jp ([10.7.69.197])
	by tyo202.gate.nec.co.jp (8.13.8/8.13.4) with ESMTP id n8H9GVrh011571;
	Thu, 17 Sep 2009 18:16:31 +0900 (JST)
Received: (from root@localhost) by mailgate3.nec.co.jp
	(8.11.7/3.7W-MAILGATE-NEC)
	id n8H9GVW09090; Thu, 17 Sep 2009 18:16:31 +0900 (JST)
Received: from mail02.kamome.nec.co.jp (mail02.kamome.nec.co.jp [10.25.43.5])
	by mailsv.nec.co.jp (8.13.8/8.13.4) with ESMTP id n8H9GVC5021560;
	Thu, 17 Sep 2009 18:16:31 +0900 (JST)
Received: from shoin.jp.nec.com ([10.26.220.3] [10.26.220.3]) by
	mail02.kamome.nec.co.jp with ESMTP id BT-MMP-2026369;
	Thu, 17 Sep 2009 18:14:12 +0900
Received: from ronnie.linux.bs1.fc.nec.co.jp ([10.34.125.103]
	[10.34.125.103])
	by mail.jp.nec.com with ESMTP; Thu, 17 Sep 2009 18:14:11 +0900
Message-ID: <4AB1FDEE.5020500@ce.jp.nec.com>
Date: Thu, 17 Sep 2009 18:14:22 +0900
From: "Jun'ichi Nomura" <j-nomura@ce.jp.nec.com>
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US;
	rv:1.9.1.1) Gecko/20090814 Fedora/3.0-2.6.b3.fc11 Thunderbird/3.0b3
MIME-Version: 1.0
To: device-mapper development <dm-devel@redhat.com>,
	Mike Snitzer <snitzer@redhat.com>, Alasdair Kergon <agk@redhat.com>
Subject: Re: [dm-devel] fragmented i/o with 2.6.31?
References: <448b15030909160834j2b127c83jab163e1860fc9aa1@mail.gmail.com>
	<448b15030909160922o84c2d6gc8ead8226dd8777a@mail.gmail.com>
	<4AB1ED1F.1010203@ct.jp.nec.com>
In-Reply-To: <4AB1ED1F.1010203@ct.jp.nec.com>
X-RedHat-Spam-Score: 0  ()
X-Scanned-By: MIMEDefang 2.67 on 10.5.11.18
X-Scanned-By: MIMEDefang 2.67 on 10.5.110.7
X-loop: dm-devel@redhat.com
Cc: Jens Axboe <jens.axboe@oracle.com>
X-BeenThere: dm-devel@redhat.com
X-Mailman-Version: 2.1.5
Precedence: junk
Reply-To: device-mapper development <dm-devel@redhat.com>
List-Id: device-mapper development <dm-devel.redhat.com>
List-Unsubscribe: <https://www.redhat.com/mailman/listinfo/dm-devel>,
	<mailto:dm-devel-request@redhat.com?subject=unsubscribe>
List-Archive: <https://www.redhat.com/archives/dm-devel>
List-Post: <mailto:dm-devel@redhat.com>
List-Help: <mailto:dm-devel-request@redhat.com?subject=help>
List-Subscribe: <https://www.redhat.com/mailman/listinfo/dm-devel>,
	<mailto:dm-devel-request@redhat.com?subject=subscribe>
Sender: dm-devel-bounces@redhat.com
Errors-To: dm-devel-bounces@redhat.com

Hi Mike, Alasdair,

Kiyoshi Ueda wrote:
> On 09/17/2009 01:22 AM +0900, David Strand wrote:
>> On Wed, Sep 16, 2009 at 8:34 AM, David Strand <dpstrand@gmail.com> wrote:
>>> I am issuing 512 Kbyte reads through the device mapper device node to
>>> a fibre channel disk. With 2.6.30 one read command for the entire 512
>>> Kbyte length is placed on the wire. With 2.6.31 this is being broken
>>> up into 5 smaller read commands placed on the wire, decreasing
>>> performance.
>>>
>>> This is especially penalizing on some disks where we have prefetch
>>> turned off via the scsi mode page. Is there any easy way (through
>>> configuration or sysfs) to restore the single read per i/o behavior
>>> that I used to get?
>>
>> I should note that I am using dm-mpath, and the i/o is fragmented on
>> the wire when using the device mapper device node but it is not
>> fragmented when using one of the regular /dev/sd* device nodes for
>> that device.
> 
> David,
> Thank you for reporting this.
> I found on my test machine that max_sectors is set to SAFE_MAX_SECTORS,
> which limits the I/O size small.
> The attached patch fixes it.  I guess the patch (and increasing
> read-ahead size in /sys/block/dm-<n>/queue/read_ahead_kb) will solve
> your fragmentation issue.  Please try it.
> 
> 
> Mike, Alasdair,
> I found that max_sectors and max_hw_sectors of dm device are set
> in smaller values than those of underlying devices.  E.g:
>     # cat /sys/block/sdj/queue/max_sectors_kb
>     512
>     # cat /sys/block/sdj/queue/max_hw_sectors_kb
>     32767
>     # echo "0 10 linear /dev/sdj 0" | dmsetup create test
>     # cat /sys/block/dm-0/queue/max_sectors_kb
>     127
>     # cat /sys/block/dm-0/queue/max_hw_sectors_kb
>     127
> This prevents the I/O size of struct request from becoming enough big
> size, and causes undesired request fragmentation in request-based dm.
> 
> This should be caused by the queue_limits stacking.
> In dm_calculate_queue_limits(), the block-layer's small default size
> is included in the merging process of target's queue_limits.
> So underlying queue_limits is not propagated correctly.
> 
> I think initializing default values of all max_* in '0' is an easy fix.
> Do you think my patch is acceptable?
> Any other idea to fix this problem?

Well, sorry, we jumped the gun..
The patch should work fine for dm-multipath but
setting '0' by default will cause problems on targets like 'zero' and
'error', which take no underlying device and use the default value.

>  	blk_set_default_limits(limits);
> +	limits->max_sectors = 0;
> +	limits->max_hw_sectors = 0;

So this should either set something very big (e.g. UINT_MAX)
or set 0 by default but change to a certain safe value, if the end
result of merging the limits is still 0.

Attached is a revised patch with the latter approach.
Please check this.
If the approach is fine, I think we should bring this up to Jens
whether to have these helpers in dm-table.c or move to block/blk-settings.c.

Thanks,

Index: linux-2.6.31/drivers/md/dm-table.c
===================================================================
--- linux-2.6.31.orig/drivers/md/dm-table.c
+++ linux-2.6.31/drivers/md/dm-table.c
@@ -647,6 +647,28 @@ int dm_split_args(int *argc, char ***arg
 }
 
 /*
+ * blk_stack_limits() chooses min_not_zero max_sectors value of underlying
+ * devices. So set the default to 0.
+ * Otherwise, the default SAFE_MAX_SECTORS dominates even if all underlying
+ * devices have max_sectors values larger than that.
+ */
+static void _set_default_limits_for_stacking(struct queue_limits *limits)
+{
+	blk_set_default_limits(limits);
+	limits->max_sectors = 0;
+	limits->max_hw_sectors = 0;
+}
+
+/* If there's no underlying device, use the default value in blockdev. */
+static void _adjust_limits_for_stacking(struct queue_limits *limits)
+{
+	if (limits->max_sectors == 0)
+		limits->max_sectors = SAFE_MAX_SECTORS;
+	if (limits->max_hw_sectors == 0)
+		limits->max_hw_sectors = SAFE_MAX_SECTORS;
+}
+
+/*
  * Impose necessary and sufficient conditions on a devices's table such
  * that any incoming bio which respects its logical_block_size can be
  * processed successfully.  If it falls across the boundary between
@@ -684,7 +706,7 @@ static int validate_hardware_logical_blo
 	while (i < dm_table_get_num_targets(table)) {
 		ti = dm_table_get_target(table, i++);
 
-		blk_set_default_limits(&ti_limits);
+		_set_default_limits_for_stacking(&ti_limits);
 
 		/* combine all target devices' limits */
 		if (ti->type->iterate_devices)
@@ -707,6 +729,8 @@ static int validate_hardware_logical_blo
 		    device_logical_block_size_sects - next_target_start : 0;
 	}
 
+	_adjust_limits_for_stacking(limits);
+
 	if (remaining) {
 		DMWARN("%s: table line %u (start sect %llu len %llu) "
 		       "not aligned to h/w logical block size %u",
@@ -991,10 +1015,10 @@ int dm_calculate_queue_limits(struct dm_
 	struct queue_limits ti_limits;
 	unsigned i = 0;
 
-	blk_set_default_limits(limits);
+	_set_default_limits_for_stacking(limits);
 
 	while (i < dm_table_get_num_targets(table)) {
-		blk_set_default_limits(&ti_limits);
+		_set_default_limits_for_stacking(&ti_limits);
 
 		ti = dm_table_get_target(table, i++);