diff mbox

Re: [PATCH 3/3] dm-mpath: add service-time oriented dynamic load balancer

Message ID 4A0CDCE1.50409@ct.jp.nec.com (mailing list archive)
State Accepted, archived
Delegated to: Alasdair Kergon
Headers show

Commit Message

Kiyoshi Ueda May 15, 2009, 3:09 a.m. UTC
Hi Alasdair,

Thank you for the review and comments.
I'm totally agree with your comments.

I found that you have already updated the patch in your editing tree, thanks.
But I'm considering to update the patch to be simpler.
Please see below.

On 05/09/2009 09:49 AM +0900, Alasdair G Kergon wrote:
> On Fri, Apr 24, 2009 at 05:08:02PM +0900, Kiyoshi Ueda wrote:
>> +	struct selector *s = (struct selector *) ps->context;
>> +		     status_type_t type, char *result, unsigned int maxlen)
>> +	int sz = 0;
>> +			DMEMIT("%u %lu ", atomic_read(&pi->in_flight_size),
> 
> (similar comments)
 
OK, you have already fixed the pointer cast, the sz type and
the print format for in_flight_size.
I'll fix the type of maxlen.


>> +/*
>> + * Returns:
>> + * < 0 : pi1 is better
>> + * 0   : no difference between pi1 and pi2
>> + * > 0 : pi2 is better
>> + */
> 
> Please document the parameters and algorithm.
> Nothing explains what the performance value is for and examples of anticipated values.

OK, I'll add comments and documentation like the attached patch.
 

>> +			DMEMIT("%u %lu ", pi->repeat_count, pi->perf);
> 
> Need to handle both 32 and 64 bit archs: cast to llu.
>
>> +	if ((argc == 2) && (sscanf(argv[1], "%lu", &perf) != 1)) {
> 
> Please do sscanf for size_t the same way as dm-crypt does.
> (Or scan for lu, if the intent is to impose a size limit ahead of the later
> do_div() then cast explicitly.)
>
>> +	do_div(sz1, pi1->perf);
>> +	do_div(sz2, pi2->perf);
> 
> Or sector_div ?
>
> But what's the performance of those divisions like?
> Is there a better way of getting the same answer?
> Is there validation limiting the size of 'perf'?

I tried to use multiplication before, but it didn't work well,
because overflow happened when 'in_flight_size' and 'perf' had
pretty big values, and then I got wrong comparison results.
So I decided to use division.

However, if I limit the 'perf' value in smaller range (e.g. 0 to 100),
such overflow shouldn't happen.  So I should be able to use
multiplication.  Also, I don't need to use size_t for 'perf',
because 'unsinged' should be enough for such small values.

 
> (Also, did you check there's adequate locking & memory barriers around all the
> atomics?)

Yes, I did and I found no problem in both dm-queue-length and
dm-service-time.

Thanks,
Kiyoshi Ueda
Add documents/comments for dm-service-time.

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
---
 Documentation/device-mapper/dm-service-time.txt |   87 ++++++++++++++++++++++++
 drivers/md/dm-service-time.c                    |   26 +++++--
 2 files changed, 107 insertions(+), 6 deletions(-)
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Comments

Alasdair G Kergon May 18, 2009, 11:46 a.m. UTC | #1
On Fri, May 15, 2009 at 12:09:21PM +0900, Kiyoshi Ueda wrote:
> However, if I limit the 'perf' value in smaller range (e.g. 0 to 100),
> such overflow shouldn't happen.  So I should be able to use
> multiplication.  Also, I don't need to use size_t for 'perf',
> because 'unsinged' should be enough for such small values.
 
I think a small range would be adequate - look at the number sizes and
go 0-100 or 0-1000 as appropriate.

Thinking of naming, is it 'relative_throughput' ?

However, is it also going to be possible to extend this target to measure the
the actual throughput?  If we did that, would we use a special value to 
indicate it or add a new table parameter and keep the 'perf' ones as a
manual adjustment factor applied on top of the calculated one?

Alasdair

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Kiyoshi Ueda May 19, 2009, 2:59 a.m. UTC | #2
Hi Alasdair,

On 05/18/2009 08:46 PM +0900, Alasdair G Kergon wrote:
> On Fri, May 15, 2009 at 12:09:21PM +0900, Kiyoshi Ueda wrote:
>> However, if I limit the 'perf' value in smaller range (e.g. 0 to 100),
>> such overflow shouldn't happen.  So I should be able to use
>> multiplication.  Also, I don't need to use size_t for 'perf',
>> because 'unsinged' should be enough for such small values.
>  
> I think a small range would be adequate - look at the number sizes and
> go 0-100 or 0-1000 as appropriate.
> 
> Thinking of naming, is it 'relative_throughput' ?

Yes, 'relative_throughput' may be better naming.  I'll take it.


> However, is it also going to be possible to extend this target to measure the
> the actual throughput?  If we did that, would we use a special value to 
> indicate it or add a new table parameter and keep the 'perf' ones as a
> manual adjustment factor applied on top of the calculated one?

The older version of dm-service-time calculated the actual throughput
dynamically using part_stat:
    http://marc.info/?l=dm-devel&m=123321314426149&w=2

But the calculated values often had some noises and it didn't work well.
(e.g. Once the throughput value of a path was calculated so small,
      the path was never used, and the small throughput value was never
      updated, and the path was never used, and ...)

So I think doing such calculations in user-space and simplifying
the kernel-code are better.
For dynamically changing path's bandwidth, user-space daemon can
measure the throughput periodically and notify the dm device.
I've been thinking of adding a new dm message handler to update
'relative_thoughput' in future.
What do you think?

Thanks,
Kiyoshi Ueda

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox

Patch

Index: 2.6.30-rc5/Documentation/device-mapper/dm-service-time.txt
===================================================================
--- /dev/null
+++ 2.6.30-rc5/Documentation/device-mapper/dm-service-time.txt
@@ -0,0 +1,87 @@ 
+dm-service-time
+===============
+
+dm-service-time is a path selector module for device-mapper targets,
+which selects a path with the shortest estimated service time for
+the incoming I/O.
+
+The service time for each path is estimated by dividing the total size
+of in-flight I/Os on a path with the performance value of the path.
+The performance value is a relative throughput value among all paths
+in a path-group, and it can be specified as a table argument.
+
+The path selector name is 'service-time'.
+
+Table parameters for each path: [<repeat_count> [<performance>]]
+	<repeat_count>: The number of I/Os to dispatch using the selected
+			path before switching to the next path.
+			If not given, internal default is used.  To check
+			the default value, see the activated table.
+	<performance>: The relative throughput value of the path
+		       among all paths in the path-group.
+		       If not given, minimum value '1' is used.
+		       If '0' is given, the path isn't selected while
+		       other paths having a positive value are available.
+
+Status for each path: <status> <fail-count> <in-flight-size> <performance>
+	<status>: 'A' if the path is active, 'F' if the path is failed.
+	<fail-count>: The number of path failures.
+	<in-flight-size>: The size of in-flight I/Os on the path.
+	<performance>: The relative throughput value of the path
+		       among all paths in the path-group.
+
+
+Algorithm
+=========
+
+dm-service-time adds the I/O size to 'in-flight-size' when the I/O is
+dispatched and substracts when completed.
+Basically, dm-service-time selects a path having minimum service time
+which is calculated by:
+
+	('in-flight-size' + 'size-of-incoming-io') / 'performance'
+
+However, some optimizations below are used to reduce the calculation
+as much as possible.
+
+	1. If the paths have the same 'performance', skip the division
+	   and just compare the 'in-flight-size'.
+
+	2. If the paths have the same 'in-flight-size', skip the division
+	   and just compare the 'performance'.
+
+	3. If some paths have non-zero 'performance' and others have zero
+	   'performance', ignore those paths with zero 'performance'.
+
+If such optimizations can't be applied, calculate service time, and
+compare service time.
+If calculated service time is equal, the path having maximum 'performance'
+may be better.  So compare 'performance' then.
+
+
+Examples
+========
+In case that 2 paths (sda and sdb) are used with repeat_count == 128
+and sda has an average throughput 1GB/s and sdb has 4GB/s, 'performance'
+value may be '1' for sda and '4' for sdb.
+
+# echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4" \
+  dmsetup create test
+#
+# dmsetup table
+test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4
+#
+# dmsetup status
+test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 1 8:16 A 0 0 4
+
+
+Or '2' for sda and '8' for sdb would be also true.
+
+# echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8" \
+  dmsetup create test
+#
+# dmsetup table
+test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8
+#
+# dmsetup status
+test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 2 8:16 A 0 0 8
Index: 2.6.30-rc5/drivers/md/dm-service-time.c
===================================================================
--- 2.6.30-rc5.orig/drivers/md/dm-service-time.c
+++ 2.6.30-rc5/drivers/md/dm-service-time.c
@@ -105,22 +105,27 @@  static int st_add_path(struct path_selec
 	unsigned repeat_count = ST_MIN_IO;
 	unsigned long long tmpll = 1;
 
+	/*
+	 * Arguments: [<repeat_count> [<performance>]]
+	 * 	<repeat_count>: The number of I/Os before switching path.
+	 * 			If not given, default (ST_MIN_IO) is used.
+	 * 	<performance>: The relative throughput value of the path
+	 *		       among all paths in the path-group.
+	 *		       If not given, minimum value '1' is used.
+	 *		       If '0' is given, the path isn't selected while
+	 * 		       other paths having a positive value are
+	 * 		       available.
+	 */
 	if (argc > 2) {
 		*error = "service-time ps: incorrect number of arguments";
 		return -EINVAL;
 	}
 
-	/* First path argument is number of I/Os before switching path. */
 	if ((argc > 0) && (sscanf(argv[0], "%u", &repeat_count) != 1)) {
 		*error = "service-time ps: invalid repeat count";
 		return -EINVAL;
 	}
 
-	/*
-	 * Second path argument is a relative performance value.
-	 * If 0 is given, the path isn't used while other paths having
-	 * a positive value are available.
-	 */
 	if ((argc == 2) && (sscanf(argv[1], "%llu", &tmpll) != 1)) {
 		*error = "service-time ps: invalid performance value";
 		return -EINVAL;
@@ -164,10 +169,19 @@  static int st_reinstate_path(struct path
 }
 
 /*
+ * Compare the estimated service time of 2 paths, pi1 and pi2,
+ * for the incoming I/O.
+ *
  * Returns:
  * < 0 : pi1 is better
  * 0   : no difference between pi1 and pi2
  * > 0 : pi2 is better
+ *
+ * Description:
+ * Basically, the service time is estimated by:
+ *     ('pi->in-flight-size' + 'incoming') / 'pi->perf'
+ * To reduce the calculation, some optimizations are made.
+ * (See comments inline)
  */
 static int st_compare_load(struct path_info *pi1, struct path_info *pi2,
 			   size_t incoming)