Message ID | CA+D=wkgGZ9A8Qa5C6q3cROrr+Gp=jsgowvcbOs-22UU=aVT7Wg@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Feb 15, 2016 at 05:32:54PM -0500, Martin Brandenburg wrote: > Something that used a slot, such as reader, would call > service_operation while holding a bufmap. Then the client-core would > crash, and the kernel would get run_down waiting on the slots to be > given up. But the slots are not given up until someone wakes all the > processes waiting in service_operation up, which happens after all the > slots are given up. Then client-core hangs until someone sends a > deadly signal to all the processes waiting in service_operation or > presumably the timeout expires. > > This splits finalize and run_down so that orangefs_devreq_release can > mark the slot map as killed, then purge waiting ops, then wait for all > the slots to be released. Meanwhile, processes which were waiting will > get into orangefs_bufmap_get which will see that the slot map is > shutting down and wait for the client-core to come back. D'oh. Yes, that was exactly the point of separating mark_dead and run_down - the latter should've been done after purging all requests. Fixes folded, branch force-pushed. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
This thing is invulnerable now! Nothing hangs when I kill the client-core, and the client-core always restarts. Sometimes, if you hit it right with a kill while dbench is running, a file create will fail. I've been trying to trace down why all day, in case there's something that can be done... Here's what I see: orangefs_create service_operation wait_for_matching_downcall purges op and returns -EAGAIN orangefs_clean_up_interrupted_operation if (EAGAIN) ... goto retry_servicing wait_for_matching_downcall returns 0 service_operation returns 0 orangefs_create has good return value from service_operation op->khandle: 00000000-0000-0000-0000-000000000000 op->fs_id: 0 subsequent getattr on bogus object fails orangefs_create on EINVAL. seems like the second time around, wait_for_matching_downcall must have seen op_state_serviced, but I don't see how yet... I pushed the new patches out to gitolite.kernel.org:pub/scm/linux/kernel/git/hubcap/linux for-next I made a couple of additional patches that make it easier to read the flow of gossip statements, and also removed a few lines of vestigial ASYNC code. -Mike On Mon, Feb 15, 2016 at 6:04 PM, Al Viro <viro@zeniv.linux.org.uk> wrote: > On Mon, Feb 15, 2016 at 05:32:54PM -0500, Martin Brandenburg wrote: > >> Something that used a slot, such as reader, would call >> service_operation while holding a bufmap. Then the client-core would >> crash, and the kernel would get run_down waiting on the slots to be >> given up. But the slots are not given up until someone wakes all the >> processes waiting in service_operation up, which happens after all the >> slots are given up. Then client-core hangs until someone sends a >> deadly signal to all the processes waiting in service_operation or >> presumably the timeout expires. >> >> This splits finalize and run_down so that orangefs_devreq_release can >> mark the slot map as killed, then purge waiting ops, then wait for all >> the slots to be released. Meanwhile, processes which were waiting will >> get into orangefs_bufmap_get which will see that the slot map is >> shutting down and wait for the client-core to come back. > > D'oh. Yes, that was exactly the point of separating mark_dead and run_down - > the latter should've been done after purging all requests. Fixes folded, > branch force-pushed. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/orangefs/devorangefs-req.c b/fs/orangefs/devorangefs-req.c index d96bcf10..b27ed1c 100644 --- a/fs/orangefs/devorangefs-req.c +++ b/fs/orangefs/devorangefs-req.c @@ -513,6 +513,9 @@ static int orangefs_devreq_release(struct inode *inode, struct file *file) * them as purged and wake them up */ purge_inprogress_ops(); + + orangefs_bufmap_run_down(); + gossip_debug(GOSSIP_DEV_DEBUG, "pvfs2-client-core: device close complete\n"); open_access_count = 0; diff --git a/fs/orangefs/orangefs-bufmap.c b/fs/orangefs/orangefs-bufmap.c index 3c6e07c..c544710 100644 --- a/fs/orangefs/orangefs-bufmap.c +++ b/fs/orangefs/orangefs-bufmap.c @@ -20,7 +20,7 @@ static struct slot_map rw_map = { }; static struct slot_map readdir_map = { .c = -1, - .q = __WAIT_QUEUE_HEAD_INITIALIZER(rw_map.q) + .q = __WAIT_QUEUE_HEAD_INITIALIZER(readdir_map.q) }; @@ -430,6 +430,15 @@ void orangefs_bufmap_finalize(void) gossip_debug(GOSSIP_BUFMAP_DEBUG, "orangefs_bufmap_finalize: called\n"); mark_killed(&rw_map); mark_killed(&readdir_map); + gossip_debug(GOSSIP_BUFMAP_DEBUG, + "orangefs_bufmap_finalize: exiting normally\n"); +} + +void orangefs_bufmap_run_down(void) +{ + struct orangefs_bufmap *bufmap = __orangefs_bufmap; + if (!bufmap) + return; run_down(&rw_map); run_down(&readdir_map); spin_lock(&orangefs_bufmap_lock); @@ -437,8 +446,6 @@ void orangefs_bufmap_finalize(void) spin_unlock(&orangefs_bufmap_lock); orangefs_bufmap_unmap(bufmap); orangefs_bufmap_free(bufmap); - gossip_debug(GOSSIP_BUFMAP_DEBUG, - "orangefs_bufmap_finalize: exiting normally\n"); } /* diff --git a/fs/orangefs/orangefs-bufmap.h b/fs/orangefs/orangefs-bufmap.h index ad8d82a..0be62be 100644 --- a/fs/orangefs/orangefs-bufmap.h +++ b/fs/orangefs/orangefs-bufmap.h @@ -17,6 +17,8 @@ int orangefs_bufmap_initialize(struct ORANGEFS_dev_map_desc *user_desc); void orangefs_bufmap_finalize(void); +void orangefs_bufmap_run_down(void); + int orangefs_bufmap_get(struct orangefs_bufmap **mapp, int *buffer_index);