From patchwork Tue Jan 8 13:54:05 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Hannes Reinecke X-Patchwork-Id: 1946271 Return-Path: X-Original-To: patchwork-dm-devel@patchwork.kernel.org Delivered-To: patchwork-process-083081@patchwork2.kernel.org Received: from mx4-phx2.redhat.com (mx4-phx2.redhat.com [209.132.183.25]) by patchwork2.kernel.org (Postfix) with ESMTP id 1A5CEDF23A for ; Tue, 8 Jan 2013 13:59:44 +0000 (UTC) Received: from lists01.pubmisc.prod.ext.phx2.redhat.com (lists01.pubmisc.prod.ext.phx2.redhat.com [10.5.19.33]) by mx4-phx2.redhat.com (8.13.8/8.13.8) with ESMTP id r08DttL9019897; Tue, 8 Jan 2013 08:55:55 -0500 Received: from int-mx09.intmail.prod.int.phx2.redhat.com (int-mx09.intmail.prod.int.phx2.redhat.com [10.5.11.22]) by lists01.pubmisc.prod.ext.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id r08DsYns004689 for ; Tue, 8 Jan 2013 08:54:34 -0500 Received: from mx1.redhat.com (ext-mx16.extmail.prod.ext.phx2.redhat.com [10.5.110.21]) by int-mx09.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id r08DsY1M005099 for ; Tue, 8 Jan 2013 08:54:34 -0500 Received: from mx2.suse.de (cantor2.suse.de [195.135.220.15]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id r08DsW0O022827 for ; Tue, 8 Jan 2013 08:54:32 -0500 Received: from relay2.suse.de (unknown [195.135.220.254]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx2.suse.de (Postfix) with ESMTP id 52D32A520D; Tue, 8 Jan 2013 14:54:29 +0100 (CET) From: Hannes Reinecke To: Christophe Varoqui Date: Tue, 8 Jan 2013 14:54:05 +0100 Message-Id: <1357653259-62650-28-git-send-email-hare@suse.de> In-Reply-To: <1357653259-62650-1-git-send-email-hare@suse.de> References: <1357653259-62650-1-git-send-email-hare@suse.de> X-RedHat-Spam-Score: -7.299 (BAYES_00, DCC_REPUT_00_12, RCVD_IN_DNSWL_HI, RP_MATCHES_RCVD) X-Scanned-By: MIMEDefang 2.68 on 10.5.11.22 X-Scanned-By: MIMEDefang 2.68 on 10.5.110.21 X-loop: dm-devel@redhat.com Cc: Martin Wilck , dm-devel@redhat.com Subject: [dm-devel] [PATCH 27/42] Update 'no_path_retry' correctly for failed paths X-BeenThere: dm-devel@redhat.com X-Mailman-Version: 2.1.12 Precedence: junk Reply-To: device-mapper development List-Id: device-mapper development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com The bug is triggered if path failed event is received by multipathd after all paths have been already marked as failed. Surprisingly enough, it seems to happen quite often; colleague of mine who tested this hit this bug every time. Here is event sequence that explains this bug. I left some messages for clarity; full log is available on request. We have completed initialization and set feature queue_if_no_path for map CX_201 by virtue of using no_path_retry > 0. Aug 31 10:49:09 | CX_201: devmap event #18 Aug 31 10:49:09 | CX_201: discover Aug 31 10:49:09 | CX_201: rr_weight = 1 (internal default) Aug 31 10:49:09 | CX_201: pgfailback = -2 (controller setting) Aug 31 10:49:09 | CX_201: no_path_retry = 2 (controller setting) Aug 31 10:49:09 | pg_timeout = NONE (internal default) Aug 31 10:49:09 | 65:192: mark as failed Aug 31 10:49:09 | CX_201: remaining active paths: 3 Aug 31 10:49:09 | 8:192: mark as failed Aug 31 10:49:09 | CX_201: remaining active paths: 2 Aug 31 10:49:09 | CX_201: devmap event #19 Aug 31 10:49:09 | CX_201: discover Aug 31 10:49:09 | CX_201: rr_weight = 1 (internal default) Aug 31 10:49:09 | CX_201: pgfailback = -2 (controller setting) Aug 31 10:49:09 | CX_201: no_path_retry = 2 (controller setting) Aug 31 10:49:09 | pg_timeout = NONE (internal default) Two paths failed by driver, multipahd marked them as failed. Aug 31 10:49:09 | checker failed path 66:0 in map CX_201 Aug 31 10:49:09 | CX_201: remaining active paths: 1 Checker failed third path Aug 31 10:49:09 | checker failed path 8:96 in map CX_201 Aug 31 10:49:09 | CX_201: Entering recovery mode: max_retries=2 Aug 31 10:49:09 | CX_201: remaining active paths: 0 Checker failed last path; multipathd entered retry loop. Aug 31 10:49:10 | CX_201: devmap event #20 We got late event about failed path Aug 31 10:49:10 | CX_201: discover Start discovery. Call update_multipath -> setup_multipath -> update_multipath_strings -> update_multipath_tablle -> disassemble_map. Now disassemble_map tries to set no_path_retry value from kernel. This obviously is not going to work as kernel is able remembering only Boolean (queue/fail), while no_path_retry is arbitrary integer. So no_path_retry is set to NO_PATH_RETRY_QUEUE from kernel. Aug 31 10:49:10 | CX_201: rr_weight = 1 (internal default) Aug 31 10:49:10 | CX_201: pgfailback = -2 (controller setting) At this point we call set_no_path_retry: set_no_path_retry(struct multipath *mpp) { mpp->retry_tick = 0; mpp->nr_active = pathcount(mpp, PATH_UP) + pathcount(mpp, PATH_GHOST); if (mpp->nr_active > 0) select_no_path_retry(mpp); So 1) retry_tick is reset 2) nr_active = 0 (no active path) 3) we do not set no_path_retry from config file because nr_active == 0 => left with NO_PATH_RETRY_QUEUE. Aug 31 10:49:10 | pg_timeout = NONE (internal default) >From now on there is no state changes, so map is hung forever. Signed-off-by: Martin Wilck Signed-off-by: Hannes Reinecke --- libmultipath/structs_vec.c | 3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/libmultipath/structs_vec.c b/libmultipath/structs_vec.c index 384afb7..7073915 100644 --- a/libmultipath/structs_vec.c +++ b/libmultipath/structs_vec.c @@ -306,8 +306,7 @@ set_no_path_retry(struct multipath *mpp) { mpp->retry_tick = 0; mpp->nr_active = pathcount(mpp, PATH_UP) + pathcount(mpp, PATH_GHOST); - if (mpp->nr_active > 0) - select_no_path_retry(mpp); + select_no_path_retry(mpp); switch (mpp->no_path_retry) { case NO_PATH_RETRY_UNDEF: