diff mbox

[2/4] multipath-tools: add checker callout to repair path

Message ID 57ACE12F.20700@redhat.com (mailing list archive)
State Not Applicable, archived
Delegated to: christophe varoqui
Headers show

Commit Message

Mike Christie Aug. 11, 2016, 8:33 p.m. UTC
On 08/11/2016 10:50 AM, Bart Van Assche wrote:
> On 08/08/2016 05:01 AM, Mike Christie wrote:
>> This patch adds a callback which can be used to repair a path
>> if check() has determined it is in the PATH_DOWN state.
>>
>> The next patch that adds rbd checker support which will use this to
>> handle the case where a rbd device is blacklisted.
> 
> Hello Mike,
> 
> With this patch applied, with the TUR checker enabled in multipath.conf
> I see the following crash if I trigger SRP failover and failback:
> 
> ion-dev-ib-ini:~ # gdb ~bart/software/multipath-tools/multipathd/multipathd
> (gdb) handle SIGPIPE noprint nostop
> Signal        Stop      Print   Pass to program Description
> SIGPIPE       No        No      Yes             Broken pipe
> (gdb) run -d
> Aug 11 08:46:27 | sde: remove path (uevent)
> Aug 11 08:46:27 | mpathbe: adding map
> Aug 11 08:46:27 | 8:64: cannot find block device
> Aug 11 08:46:27 | Invalid device number 1
> Aug 11 08:46:27 | 1: cannot find block device
> Aug 11 08:46:27 | 8:96: cannot find block device
> Aug 11 08:46:27 | mpathbe: failed to setup multipath
> Aug 11 08:46:27 | dm-0: uev_add_map failed
> Aug 11 08:46:27 | uevent trigger error
> 
> Thread 4 "multipathd" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7ffff7f8b700 (LWP 8446)]
> 0x0000000000000000 in ?? ()
> (gdb) bt
> #0  0x0000000000000000 in ?? ()
> #1  0x00007ffff6c41905 in checker_repair (c=0x7fffdc001ef0) at checkers.c:225
> #2  0x000000000040a760 in repair_path (vecs=0x66d7e0, pp=0x7fffdc001a40)
>     at main.c:1733
> #3  0x000000000040ab27 in checkerloop (ap=0x66d7e0) at main.c:1807
> #4  0x00007ffff79bb474 in start_thread (arg=0x7ffff7f8b700)
>     at pthread_create.c:333
> #5  0x00007ffff63243ed in clone ()
>     at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
> (gdb) up
> #1  0x00007ffff6c41905 in checker_repair (c=0x7fffdc001ef0) at checkers.c:225
> 225             c->repair(c);
> (gdb) print *c
> $1 = {node = {next = 0x0, prev = 0x0}, handle = 0x0, refcount = 0, fd = 0, 
>   sync = 0, timeout = 0, disable = 0, name = '\000' <repeats 15 times>, 
>   message = '\000' <repeats 255 times>, context = 0x0, mpcontext = 0x0, 
>   check = 0x0, repair = 0x0, init = 0x0, free = 0x0}
> 

Sorry about the stupid bug.

Could you try the attached patch. I found two segfaults. If check_path
returns less than 0 then we free the path and so we cannot call repair
on it. If libcheck_init fails it memsets the checker, so we cannot call
repair on it too.

I moved the repair call to the specific paths that the path is down.
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Comments

Bart Van Assche Aug. 11, 2016, 9:41 p.m. UTC | #1
On 08/11/2016 01:33 PM, Mike Christie wrote:
> Could you try the attached patch. I found two segfaults. If check_path
> returns less than 0 then we free the path and so we cannot call repair
> on it. If libcheck_init fails it memsets the checker, so we cannot call
> repair on it too.
>
> I moved the repair call to the specific paths that the path is down.

Hello Mike,

Thanks for the patch. Unfortunately even with this patch applied I can 
still trigger a segfault sporadically:

# valgrind --read-var-info=yes multipathd -d
Aug 11 14:02:21 | mpathbf: load table [0 2097152 multipath 3 
queue_if_no_path pg_init_retries 50 0 2 1 queue-length 0 1 1 8:160 1000 
queue-length 0 1 1 8:64 1000]
Aug 11 14:02:21 | mpathbf: event checker started
Aug 11 14:02:21 | sdk [8:160]: path added to devmap mpathbf
Aug 11 14:02:21 | sdd: add path (uevent)
==2452== Thread 4:
==2452== Jump to the invalid address stated on the next line
==2452==    at 0x0: ???
==2452==    by 0x409BBE: repair_path (main.c:1451)
==2452==    by 0x40A905: check_path (main.c:1715)
==2452==    by 0x40AE72: checkerloop (main.c:1808)
==2452==    by 0x5047473: start_thread (pthread_create.c:333)
==2452==    by 0x671B3EC: clone (clone.S:109)
==2452==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==2452==
==2452==
==2452== Process terminating with default action of signal 11 (SIGSEGV)
==2452==  Bad permissions for mapped region at address 0x0
==2452==    at 0x0: ???
==2452==    by 0x409BBE: repair_path (main.c:1451)
==2452==    by 0x40A905: check_path (main.c:1715)
==2452==    by 0x40AE72: checkerloop (main.c:1808)
==2452==    by 0x5047473: start_thread (pthread_create.c:333)
==2452==    by 0x671B3EC: clone (clone.S:109)
==2452==

(gdb) list main.c:1451
1446    void repair_path(struct path * pp)
1447    {
1448            if (pp->state != PATH_DOWN)
1449                    return;
1450
1451            checker_repair(&pp->checker);
1452            if (strlen(checker_message(&pp->checker)))
1453                    LOG_MSG(1, checker_message(&pp->checker));
1454    }
1455

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Mike Christie Aug. 12, 2016, 4:54 p.m. UTC | #2
On 08/11/2016 04:41 PM, Bart Van Assche wrote:
> On 08/11/2016 01:33 PM, Mike Christie wrote:
>> Could you try the attached patch. I found two segfaults. If check_path
>> returns less than 0 then we free the path and so we cannot call repair
>> on it. If libcheck_init fails it memsets the checker, so we cannot call
>> repair on it too.
>>
>> I moved the repair call to the specific paths that the path is down.
> 
> Hello Mike,
> 
> Thanks for the patch. Unfortunately even with this patch applied I can
> still trigger a segfault sporadically:
> 

I can't seem to replicate the problem with my patch and I do not see
anything. Could you send me your multipath.conf/hwtable settings?

For the fo/fb test, dev_loss_tmo is firing causing paths to be
added/deleted right?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
Bart Van Assche Aug. 12, 2016, 5:10 p.m. UTC | #3
On 08/12/2016 09:54 AM, Mike Christie wrote:
> On 08/11/2016 04:41 PM, Bart Van Assche wrote:
>> On 08/11/2016 01:33 PM, Mike Christie wrote:
>>> Could you try the attached patch. I found two segfaults. If check_path
>>> returns less than 0 then we free the path and so we cannot call repair
>>> on it. If libcheck_init fails it memsets the checker, so we cannot call
>>> repair on it too.
>>>
>>> I moved the repair call to the specific paths that the path is down.
>>
>> Thanks for the patch. Unfortunately even with this patch applied I can
>> still trigger a segfault sporadically:
> 
> I can't seem to replicate the problem with my patch and I do not see
> anything. Could you send me your multipath.conf/hwtable settings?

Please find that file at the end of this e-mail.
 
> For the fo/fb test, dev_loss_tmo is firing causing paths to be
> added/deleted right?
 
The script that I'm using to simulate path loss writes into /sys/class
/srp_remote_ports/*/delete. That causes the ib_srp driver to call
scsi_remove_host(). That script is available at
https://github.com/bvanassche/srp-test. However, an InfiniBand HCA is
needed to run this script.

Bart.


/etc/multipath.conf:

defaults {
        user_friendly_names     yes
        queue_without_daemon    no
}

blacklist {
        device {
                vendor                  "ATA"
                product                 ".*"
        }
}

devices {
        device {
                vendor                  "SCST_BIO|LIO-ORG"
                product                 ".*"
                features                "3 queue_if_no_path pg_init_retries 50"
                path_grouping_policy    group_by_prio
                path_selector           "queue-length 0"
                path_checker            tur
        }
}

blacklist_exceptions {
        property        ".*"
}

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
diff mbox

Patch

diff --git a/multipathd/main.c b/multipathd/main.c
index f34500c..9f213cc 100644
--- a/multipathd/main.c
+++ b/multipathd/main.c
@@ -1442,6 +1442,16 @@  int update_path_groups(struct multipath *mpp, struct vectors *vecs, int refresh)
 	return 0;
 }
 
+void repair_path(struct path * pp)
+{
+	if (pp->state != PATH_DOWN)
+		return;
+
+	checker_repair(&pp->checker);
+	if (strlen(checker_message(&pp->checker)))
+		LOG_MSG(1, checker_message(&pp->checker));
+}
+
 /*
  * Returns '1' if the path has been checked, '-1' if it was blacklisted
  * and '0' otherwise
@@ -1606,6 +1616,7 @@  check_path (struct vectors * vecs, struct path * pp, int ticks)
 			pp->mpp->failback_tick = 0;
 
 			pp->mpp->stat_path_failures++;
+			repair_path(pp);
 			return 1;
 		}
 
@@ -1700,7 +1711,7 @@  check_path (struct vectors * vecs, struct path * pp, int ticks)
 	}
 
 	pp->state = newstate;
-
+	repair_path(pp);
 
 	if (pp->mpp->wait_for_udev)
 		return 1;
@@ -1725,14 +1736,6 @@  check_path (struct vectors * vecs, struct path * pp, int ticks)
 	return 1;
 }
 
-void repair_path(struct vectors * vecs, struct path * pp)
-{
-	if (pp->state != PATH_DOWN)
-		return;
-
-	checker_repair(&pp->checker);
-}
-
 static void *
 checkerloop (void *ap)
 {
@@ -1804,7 +1807,6 @@  checkerloop (void *ap)
 					i--;
 				} else
 					num_paths += rc;
-				repair_path(vecs, pp);
 			}
 			lock_cleanup_pop(vecs->lock);
 		}