From patchwork Fri May 15 03:09:21 2009 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kiyoshi Ueda X-Patchwork-Id: 23945 X-Patchwork-Delegate: agk@redhat.com Received: from hormel.redhat.com (hormel1.redhat.com [209.132.177.33]) by demeter.kernel.org (8.14.2/8.14.2) with ESMTP id n4F3A0LH029069 for ; Fri, 15 May 2009 03:10:01 GMT Received: from listman.util.phx.redhat.com (listman.util.phx.redhat.com [10.8.4.110]) by hormel.redhat.com (Postfix) with ESMTP id 7C050618312; Thu, 14 May 2009 23:09:38 -0400 (EDT) Received: from int-mx1.corp.redhat.com (int-mx1.corp.redhat.com [172.16.52.254]) by listman.util.phx.redhat.com (8.13.1/8.13.1) with ESMTP id n4F39atw018171 for ; Thu, 14 May 2009 23:09:36 -0400 Received: from mx3.redhat.com (mx3.redhat.com [172.16.48.32]) by int-mx1.corp.redhat.com (8.13.1/8.13.1) with ESMTP id n4F39acV019242; Thu, 14 May 2009 23:09:36 -0400 Received: from tyo202.gate.nec.co.jp (TYO202.gate.nec.co.jp [202.32.8.206]) by mx3.redhat.com (8.13.8/8.13.8) with ESMTP id n4F39MTO011153; Thu, 14 May 2009 23:09:22 -0400 Received: from mailgate3.nec.co.jp ([10.7.69.162]) by tyo202.gate.nec.co.jp (8.13.8/8.13.4) with ESMTP id n4F39Lob024578; Fri, 15 May 2009 12:09:21 +0900 (JST) Received: (from root@localhost) by mailgate3.nec.co.jp (8.11.7/3.7W-MAILGATE-NEC) id n4F39La15362; Fri, 15 May 2009 12:09:21 +0900 (JST) Received: from mailsv.linux.bs1.fc.nec.co.jp (mailsv.linux.bs1.fc.nec.co.jp [10.34.125.2]) by mailsv3.nec.co.jp (8.13.8/8.13.4) with ESMTP id n4F39L5m023582; Fri, 15 May 2009 12:09:21 +0900 (JST) Received: from elcondor.linux.bs1.fc.nec.co.jp (elcondor.linux.bs1.fc.nec.co.jp [10.34.125.195]) by mailsv.linux.bs1.fc.nec.co.jp (Postfix) with ESMTP id 22895E482C3; Fri, 15 May 2009 12:09:21 +0900 (JST) Message-ID: <4A0CDCE1.50409@ct.jp.nec.com> Date: Fri, 15 May 2009 12:09:21 +0900 From: Kiyoshi Ueda User-Agent: Thunderbird 2.0.0.21 (X11/20090320) MIME-Version: 1.0 To: Alasdair Kergon References: <49F1729A.1000906@ct.jp.nec.com> <49F17362.7080205@ct.jp.nec.com> <20090509004959.GJ1320@agk-dp.fab.redhat.com> In-Reply-To: <20090509004959.GJ1320@agk-dp.fab.redhat.com> X-RedHat-Spam-Score: -0.053 X-Scanned-By: MIMEDefang 2.58 on 172.16.52.254 X-Scanned-By: MIMEDefang 2.63 on 172.16.48.32 X-loop: dm-devel@redhat.com Cc: device-mapper development Subject: [dm-devel] Re: [PATCH 3/3] dm-mpath: add service-time oriented dynamic load balancer X-BeenThere: dm-devel@redhat.com X-Mailman-Version: 2.1.5 Precedence: junk Reply-To: device-mapper development List-Id: device-mapper development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com Hi Alasdair, Thank you for the review and comments. I'm totally agree with your comments. I found that you have already updated the patch in your editing tree, thanks. But I'm considering to update the patch to be simpler. Please see below. On 05/09/2009 09:49 AM +0900, Alasdair G Kergon wrote: > On Fri, Apr 24, 2009 at 05:08:02PM +0900, Kiyoshi Ueda wrote: >> + struct selector *s = (struct selector *) ps->context; >> + status_type_t type, char *result, unsigned int maxlen) >> + int sz = 0; >> + DMEMIT("%u %lu ", atomic_read(&pi->in_flight_size), > > (similar comments) OK, you have already fixed the pointer cast, the sz type and the print format for in_flight_size. I'll fix the type of maxlen. >> +/* >> + * Returns: >> + * < 0 : pi1 is better >> + * 0 : no difference between pi1 and pi2 >> + * > 0 : pi2 is better >> + */ > > Please document the parameters and algorithm. > Nothing explains what the performance value is for and examples of anticipated values. OK, I'll add comments and documentation like the attached patch. >> + DMEMIT("%u %lu ", pi->repeat_count, pi->perf); > > Need to handle both 32 and 64 bit archs: cast to llu. > >> + if ((argc == 2) && (sscanf(argv[1], "%lu", &perf) != 1)) { > > Please do sscanf for size_t the same way as dm-crypt does. > (Or scan for lu, if the intent is to impose a size limit ahead of the later > do_div() then cast explicitly.) > >> + do_div(sz1, pi1->perf); >> + do_div(sz2, pi2->perf); > > Or sector_div ? > > But what's the performance of those divisions like? > Is there a better way of getting the same answer? > Is there validation limiting the size of 'perf'? I tried to use multiplication before, but it didn't work well, because overflow happened when 'in_flight_size' and 'perf' had pretty big values, and then I got wrong comparison results. So I decided to use division. However, if I limit the 'perf' value in smaller range (e.g. 0 to 100), such overflow shouldn't happen. So I should be able to use multiplication. Also, I don't need to use size_t for 'perf', because 'unsinged' should be enough for such small values. > (Also, did you check there's adequate locking & memory barriers around all the > atomics?) Yes, I did and I found no problem in both dm-queue-length and dm-service-time. Thanks, Kiyoshi Ueda Add documents/comments for dm-service-time. Signed-off-by: Kiyoshi Ueda Signed-off-by: Jun'ichi Nomura --- Documentation/device-mapper/dm-service-time.txt | 87 ++++++++++++++++++++++++ drivers/md/dm-service-time.c | 26 +++++-- 2 files changed, 107 insertions(+), 6 deletions(-) -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel Index: 2.6.30-rc5/Documentation/device-mapper/dm-service-time.txt =================================================================== --- /dev/null +++ 2.6.30-rc5/Documentation/device-mapper/dm-service-time.txt @@ -0,0 +1,87 @@ +dm-service-time +=============== + +dm-service-time is a path selector module for device-mapper targets, +which selects a path with the shortest estimated service time for +the incoming I/O. + +The service time for each path is estimated by dividing the total size +of in-flight I/Os on a path with the performance value of the path. +The performance value is a relative throughput value among all paths +in a path-group, and it can be specified as a table argument. + +The path selector name is 'service-time'. + +Table parameters for each path: [ []] + : The number of I/Os to dispatch using the selected + path before switching to the next path. + If not given, internal default is used. To check + the default value, see the activated table. + : The relative throughput value of the path + among all paths in the path-group. + If not given, minimum value '1' is used. + If '0' is given, the path isn't selected while + other paths having a positive value are available. + +Status for each path: + : 'A' if the path is active, 'F' if the path is failed. + : The number of path failures. + : The size of in-flight I/Os on the path. + : The relative throughput value of the path + among all paths in the path-group. + + +Algorithm +========= + +dm-service-time adds the I/O size to 'in-flight-size' when the I/O is +dispatched and substracts when completed. +Basically, dm-service-time selects a path having minimum service time +which is calculated by: + + ('in-flight-size' + 'size-of-incoming-io') / 'performance' + +However, some optimizations below are used to reduce the calculation +as much as possible. + + 1. If the paths have the same 'performance', skip the division + and just compare the 'in-flight-size'. + + 2. If the paths have the same 'in-flight-size', skip the division + and just compare the 'performance'. + + 3. If some paths have non-zero 'performance' and others have zero + 'performance', ignore those paths with zero 'performance'. + +If such optimizations can't be applied, calculate service time, and +compare service time. +If calculated service time is equal, the path having maximum 'performance' +may be better. So compare 'performance' then. + + +Examples +======== +In case that 2 paths (sda and sdb) are used with repeat_count == 128 +and sda has an average throughput 1GB/s and sdb has 4GB/s, 'performance' +value may be '1' for sda and '4' for sdb. + +# echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4" \ + dmsetup create test +# +# dmsetup table +test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 1 8:16 128 4 +# +# dmsetup status +test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 1 8:16 A 0 0 4 + + +Or '2' for sda and '8' for sdb would be also true. + +# echo "0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8" \ + dmsetup create test +# +# dmsetup table +test: 0 10 multipath 0 0 1 1 service-time 0 2 2 8:0 128 2 8:16 128 8 +# +# dmsetup status +test: 0 10 multipath 2 0 0 0 1 1 E 0 2 2 8:0 A 0 0 2 8:16 A 0 0 8 Index: 2.6.30-rc5/drivers/md/dm-service-time.c =================================================================== --- 2.6.30-rc5.orig/drivers/md/dm-service-time.c +++ 2.6.30-rc5/drivers/md/dm-service-time.c @@ -105,22 +105,27 @@ static int st_add_path(struct path_selec unsigned repeat_count = ST_MIN_IO; unsigned long long tmpll = 1; + /* + * Arguments: [ []] + * : The number of I/Os before switching path. + * If not given, default (ST_MIN_IO) is used. + * : The relative throughput value of the path + * among all paths in the path-group. + * If not given, minimum value '1' is used. + * If '0' is given, the path isn't selected while + * other paths having a positive value are + * available. + */ if (argc > 2) { *error = "service-time ps: incorrect number of arguments"; return -EINVAL; } - /* First path argument is number of I/Os before switching path. */ if ((argc > 0) && (sscanf(argv[0], "%u", &repeat_count) != 1)) { *error = "service-time ps: invalid repeat count"; return -EINVAL; } - /* - * Second path argument is a relative performance value. - * If 0 is given, the path isn't used while other paths having - * a positive value are available. - */ if ((argc == 2) && (sscanf(argv[1], "%llu", &tmpll) != 1)) { *error = "service-time ps: invalid performance value"; return -EINVAL; @@ -164,10 +169,19 @@ static int st_reinstate_path(struct path } /* + * Compare the estimated service time of 2 paths, pi1 and pi2, + * for the incoming I/O. + * * Returns: * < 0 : pi1 is better * 0 : no difference between pi1 and pi2 * > 0 : pi2 is better + * + * Description: + * Basically, the service time is estimated by: + * ('pi->in-flight-size' + 'incoming') / 'pi->perf' + * To reduce the calculation, some optimizations are made. + * (See comments inline) */ static int st_compare_load(struct path_info *pi1, struct path_info *pi2, size_t incoming)