Message ID | 57ACE12F.20700@redhat.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Delegated to: | christophe varoqui |
Headers | show |
On 08/11/2016 01:33 PM, Mike Christie wrote: > Could you try the attached patch. I found two segfaults. If check_path > returns less than 0 then we free the path and so we cannot call repair > on it. If libcheck_init fails it memsets the checker, so we cannot call > repair on it too. > > I moved the repair call to the specific paths that the path is down. Hello Mike, Thanks for the patch. Unfortunately even with this patch applied I can still trigger a segfault sporadically: # valgrind --read-var-info=yes multipathd -d Aug 11 14:02:21 | mpathbf: load table [0 2097152 multipath 3 queue_if_no_path pg_init_retries 50 0 2 1 queue-length 0 1 1 8:160 1000 queue-length 0 1 1 8:64 1000] Aug 11 14:02:21 | mpathbf: event checker started Aug 11 14:02:21 | sdk [8:160]: path added to devmap mpathbf Aug 11 14:02:21 | sdd: add path (uevent) ==2452== Thread 4: ==2452== Jump to the invalid address stated on the next line ==2452== at 0x0: ??? ==2452== by 0x409BBE: repair_path (main.c:1451) ==2452== by 0x40A905: check_path (main.c:1715) ==2452== by 0x40AE72: checkerloop (main.c:1808) ==2452== by 0x5047473: start_thread (pthread_create.c:333) ==2452== by 0x671B3EC: clone (clone.S:109) ==2452== Address 0x0 is not stack'd, malloc'd or (recently) free'd ==2452== ==2452== ==2452== Process terminating with default action of signal 11 (SIGSEGV) ==2452== Bad permissions for mapped region at address 0x0 ==2452== at 0x0: ??? ==2452== by 0x409BBE: repair_path (main.c:1451) ==2452== by 0x40A905: check_path (main.c:1715) ==2452== by 0x40AE72: checkerloop (main.c:1808) ==2452== by 0x5047473: start_thread (pthread_create.c:333) ==2452== by 0x671B3EC: clone (clone.S:109) ==2452== (gdb) list main.c:1451 1446 void repair_path(struct path * pp) 1447 { 1448 if (pp->state != PATH_DOWN) 1449 return; 1450 1451 checker_repair(&pp->checker); 1452 if (strlen(checker_message(&pp->checker))) 1453 LOG_MSG(1, checker_message(&pp->checker)); 1454 } 1455 -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On 08/11/2016 04:41 PM, Bart Van Assche wrote: > On 08/11/2016 01:33 PM, Mike Christie wrote: >> Could you try the attached patch. I found two segfaults. If check_path >> returns less than 0 then we free the path and so we cannot call repair >> on it. If libcheck_init fails it memsets the checker, so we cannot call >> repair on it too. >> >> I moved the repair call to the specific paths that the path is down. > > Hello Mike, > > Thanks for the patch. Unfortunately even with this patch applied I can > still trigger a segfault sporadically: > I can't seem to replicate the problem with my patch and I do not see anything. Could you send me your multipath.conf/hwtable settings? For the fo/fb test, dev_loss_tmo is firing causing paths to be added/deleted right? -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On 08/12/2016 09:54 AM, Mike Christie wrote: > On 08/11/2016 04:41 PM, Bart Van Assche wrote: >> On 08/11/2016 01:33 PM, Mike Christie wrote: >>> Could you try the attached patch. I found two segfaults. If check_path >>> returns less than 0 then we free the path and so we cannot call repair >>> on it. If libcheck_init fails it memsets the checker, so we cannot call >>> repair on it too. >>> >>> I moved the repair call to the specific paths that the path is down. >> >> Thanks for the patch. Unfortunately even with this patch applied I can >> still trigger a segfault sporadically: > > I can't seem to replicate the problem with my patch and I do not see > anything. Could you send me your multipath.conf/hwtable settings? Please find that file at the end of this e-mail. > For the fo/fb test, dev_loss_tmo is firing causing paths to be > added/deleted right? The script that I'm using to simulate path loss writes into /sys/class /srp_remote_ports/*/delete. That causes the ib_srp driver to call scsi_remove_host(). That script is available at https://github.com/bvanassche/srp-test. However, an InfiniBand HCA is needed to run this script. Bart. /etc/multipath.conf: defaults { user_friendly_names yes queue_without_daemon no } blacklist { device { vendor "ATA" product ".*" } } devices { device { vendor "SCST_BIO|LIO-ORG" product ".*" features "3 queue_if_no_path pg_init_retries 50" path_grouping_policy group_by_prio path_selector "queue-length 0" path_checker tur } } blacklist_exceptions { property ".*" } -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
diff --git a/multipathd/main.c b/multipathd/main.c index f34500c..9f213cc 100644 --- a/multipathd/main.c +++ b/multipathd/main.c @@ -1442,6 +1442,16 @@ int update_path_groups(struct multipath *mpp, struct vectors *vecs, int refresh) return 0; } +void repair_path(struct path * pp) +{ + if (pp->state != PATH_DOWN) + return; + + checker_repair(&pp->checker); + if (strlen(checker_message(&pp->checker))) + LOG_MSG(1, checker_message(&pp->checker)); +} + /* * Returns '1' if the path has been checked, '-1' if it was blacklisted * and '0' otherwise @@ -1606,6 +1616,7 @@ check_path (struct vectors * vecs, struct path * pp, int ticks) pp->mpp->failback_tick = 0; pp->mpp->stat_path_failures++; + repair_path(pp); return 1; } @@ -1700,7 +1711,7 @@ check_path (struct vectors * vecs, struct path * pp, int ticks) } pp->state = newstate; - + repair_path(pp); if (pp->mpp->wait_for_udev) return 1; @@ -1725,14 +1736,6 @@ check_path (struct vectors * vecs, struct path * pp, int ticks) return 1; } -void repair_path(struct vectors * vecs, struct path * pp) -{ - if (pp->state != PATH_DOWN) - return; - - checker_repair(&pp->checker); -} - static void * checkerloop (void *ap) { @@ -1804,7 +1807,6 @@ checkerloop (void *ap) i--; } else num_paths += rc; - repair_path(vecs, pp); } lock_cleanup_pop(vecs->lock); }