diff mbox

Orangefs ABI documentation

Message ID 20160214234312.GX17997@ZenIV.linux.org.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Al Viro Feb. 14, 2016, 11:43 p.m. UTC
On Sun, Feb 14, 2016 at 05:31:10PM -0500, Mike Marshall wrote:
> I added the list_del...
> 
> Everything is very resilient, I killed
> the client-core over and over while dbench
> was running at the same time as  ls -R
> was running, and the client-core always
> restarted... until finally, it didn't. I guess
> related to the state of just what was going on
> at the time... Hit the WARN_ON in service_operation,
> and then oopsed on the orangefs_bufmap_put
> down at the end of wait_for_direct_io...

Bloody hell...  I think I see what's going on, and presumably the newer
slot allocator would fix that.  Look: closing control device (== daemon
death) checks if we have a bufmap installed and drops a reference to
it in that case.  The reason why it's conditional is that we might have
not gotten around to installing one (it's done via ioctl on control
device).  But ->release() does *NOT* wait for all references to go away!
In other words, it's possible to restart the daemon while the old bufmap
is still there.  Then have it killed after it has opened control devices
and before the old bufmap has run down.  For ->release() it looks like
we *have* gotten around to installing bufmap, and need the reference dropped.
In reality, the reference acquired when we were installing that one has
already been dropped, so we get double put.  With expected results...

If below ends up fixing the symptoms, analysis above has a good chance to
be correct.  This is no way to wait for rundown, of course - I'm not
suggesting it as the solution, just as a way to narrow down what's going
on.

Incidentally, could you fold the list_del() part into offending commit
(orangefs: delay freeing slot until cancel completes) and repush your
for-next?

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Mike Marshall Feb. 15, 2016, 5:46 p.m. UTC | #1
I pushed the list_del up to the kernel.org for-next branch...

And I've been running tests with the CRUDE bandaid... weird
results...

No oopses, no WARN_ONs... I was running dbench and ls -R
or find and kill-minus-nining different ones of them with no
perceived resulting problems, so I moved on to signalling
the client-core to abort... it restarted numerous times,
and then stuff wedged up differently than I've seen before.

Usually I kill the client-core and it comes back (gets restarted)
as seen by the different PID:

# ps -ef | grep pvfs
root      1292  1185  7 11:39 ?        00:00:01 pvfs2-client-core
--child -a 60000 -n 60000 --logtype file -L /var/log/client.log
# kill -6 1292
# ps -ef | grep pvfs
root      1299  1185  8 11:40 ?        00:00:00 pvfs2-client-core
--child -a 60000 -n 60000 --logtype file -L /var/log/client.log

Until once, it didn't die, and the gorked up unkillable left-over thing's
argv[0] (or wherever this string gets scraped from) was goofy:

# ps -ef | grep pvfs
root      1324  1185  1 11:41 ?        00:00:02 pvfs2-client-core
--child -a 60000 -n 60000 --logtype file -L /var/log/client.log
[root@be1 hubcap]# kill -6 1324
[root@be1 hubcap]# ps -ef | grep pvfs
root      1324  1185  2 11:41 ?        00:00:05 [pvfs2-client-co]

The virtual host was pretty wedged up after that, I couldn't look
at anything interesting, and got a bunch of terminal windows hung
trying:

# strace -f -p 1324
Process 1324 attached
^C

^C^C
                                     .
                     ls -R's output was flowing out here
/pvfsmnt/tdir/z_really_long_disgustingly_long_super_long_file_name52
/pvfsmnt/tdir/z_really_long_disgustingly_long_super_long_file_name53



^C^C^C


[root@logtruck hubcap]# ssh be1
root@be1's password:
Last login: Mon Feb 15 11:33:42 2016 from logtruck.clemson.edu
[root@be1 ~]# df


I still had one functioning window, and looked at dmesg from there,
nothing interesting there... a couple of expected tag WARNINGS while I was
killing finds and dbenches... ioctls that happened during the
successful restarts of the client-core...

[  809.520966] client-core: opening device
[  809.521031] pvfs2-client-core: open device complete (ret = 0)
[  809.521050] dispatch_ioctl_command: client debug mask has been been
received :0: :0:
[  809.521068] dispatch_ioctl_command: client debug array string has
been received.
[  809.521070] orangefs_prepare_debugfs_help_string: start
[  809.521071] orangefs_prepare_cdm_array: start
[  809.521104] orangefs_prepare_cdm_array: rc:50:
[  809.521106] orangefs_prepare_debugfs_help_string: cdm_element_count:50:
[  809.521239] debug_mask_to_string: start
[  809.521242] debug_mask_to_string: string:none:
[  809.521243] orangefs_client_debug_init: start
[  809.521249] orangefs_client_debug_init: rc:0:
[  809.566652] dispatch_ioctl_command: got ORANGEFS_DEV_REMOUNT_ALL
[  809.566667] dispatch_ioctl_command: priority remount in progress
[  809.566668] dispatch_ioctl_command: priority remount complete
[  812.454255] orangefs_debug_open: orangefs_debug_disabled: 0
[  812.454294] orangefs_debug_open: rc: 0
[  812.454320] orangefs_debug_write: kernel-debug
[  812.454323] debug_string_to_mask: start
[  896.410522] WARNING: No one's waiting for tag 15612
[ 1085.339948] WARNING: No one's waiting for tag 127943
[ 1146.820485] orangefs: please confirm that pvfs2-client daemon is running.
[ 1146.820488] fs/orangefs/dir.c line 264: orangefs_readdir:
orangefs_readdir_index_get() failure (-5)
[ 1146.866812] dispatch_ioctl_command: client debug mask has been been
received :0: :0:
[ 1146.866834] dispatch_ioctl_command: client debug array string has
been received.
[ 1175.906800] dispatch_ioctl_command: client debug mask has been been
received :0: :0:
[ 1175.906817] dispatch_ioctl_command: client debug array string has
been received.
[ 1223.915862] dispatch_ioctl_command: client debug mask has been been
received :0: :0:
[ 1223.915880] dispatch_ioctl_command: client debug array string has
been received.
[ 1274.458852] dispatch_ioctl_command: client debug mask has been been
received :0: :0:
[ 1274.458870] dispatch_ioctl_command: client debug array string has
been received.
[root@be1 hubcap]#


ps aux shows every process' state as S except for 1324 which is
racking up time:

[hubcap@be1 ~]$ ps aux | grep pvfs2-client
root      1324 92.4  0.0      0     0 ?        R    11:41  46:29
[pvfs2-client-co]
[hubcap@be1 ~]$ ps aux | grep pvfs2-client
root      1324 92.4  0.0      0     0 ?        R    11:41  46:30
[pvfs2-client-co]

I'll virsh destroy this thing now <g>...

-Mike



On Sun, Feb 14, 2016 at 6:43 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sun, Feb 14, 2016 at 05:31:10PM -0500, Mike Marshall wrote:
>> I added the list_del...
>>
>> Everything is very resilient, I killed
>> the client-core over and over while dbench
>> was running at the same time as  ls -R
>> was running, and the client-core always
>> restarted... until finally, it didn't. I guess
>> related to the state of just what was going on
>> at the time... Hit the WARN_ON in service_operation,
>> and then oopsed on the orangefs_bufmap_put
>> down at the end of wait_for_direct_io...
>
> Bloody hell...  I think I see what's going on, and presumably the newer
> slot allocator would fix that.  Look: closing control device (== daemon
> death) checks if we have a bufmap installed and drops a reference to
> it in that case.  The reason why it's conditional is that we might have
> not gotten around to installing one (it's done via ioctl on control
> device).  But ->release() does *NOT* wait for all references to go away!
> In other words, it's possible to restart the daemon while the old bufmap
> is still there.  Then have it killed after it has opened control devices
> and before the old bufmap has run down.  For ->release() it looks like
> we *have* gotten around to installing bufmap, and need the reference dropped.
> In reality, the reference acquired when we were installing that one has
> already been dropped, so we get double put.  With expected results...
>
> If below ends up fixing the symptoms, analysis above has a good chance to
> be correct.  This is no way to wait for rundown, of course - I'm not
> suggesting it as the solution, just as a way to narrow down what's going
> on.
>
> Incidentally, could you fold the list_del() part into offending commit
> (orangefs: delay freeing slot until cancel completes) and repush your
> for-next?
>
> diff --git a/fs/orangefs/devorangefs-req.c b/fs/orangefs/devorangefs-req.c
> index 6a7df12..630246d 100644
> --- a/fs/orangefs/devorangefs-req.c
> +++ b/fs/orangefs/devorangefs-req.c
> @@ -529,6 +529,9 @@ static int orangefs_devreq_release(struct inode *inode, struct file *file)
>         purge_inprogress_ops();
>         gossip_debug(GOSSIP_DEV_DEBUG,
>                      "pvfs2-client-core: device close complete\n");
> +       /* VERY CRUDE, NOT FOR MERGE */
> +       while (orangefs_get_bufmap_init())
> +               schedule_timeout(HZ);
>         open_access_count = 0;
>         mutex_unlock(&devreq_mutex);
>         return 0;
> diff --git a/fs/orangefs/orangefs-kernel.h b/fs/orangefs/orangefs-kernel.h
> index 41f8bb1f..1e28555 100644
> --- a/fs/orangefs/orangefs-kernel.h
> +++ b/fs/orangefs/orangefs-kernel.h
> @@ -261,6 +261,7 @@ static inline void set_op_state_purged(struct orangefs_kernel_op_s *op)
>  {
>         spin_lock(&op->lock);
>         if (unlikely(op_is_cancel(op))) {
> +               list_del(&op->list);
>                 spin_unlock(&op->lock);
>                 put_cancel(op);
>         } else {
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Feb. 15, 2016, 6:45 p.m. UTC | #2
On Mon, Feb 15, 2016 at 12:46:51PM -0500, Mike Marshall wrote:
> I pushed the list_del up to the kernel.org for-next branch...
> 
> And I've been running tests with the CRUDE bandaid... weird
> results...
> 
> No oopses, no WARN_ONs... I was running dbench and ls -R
> or find and kill-minus-nining different ones of them with no
> perceived resulting problems, so I moved on to signalling
> the client-core to abort... it restarted numerous times,
> and then stuff wedged up differently than I've seen before.

There are other problems with that thing (starting with the fact that
retrying readdir/wait_for_direct_io can try to grab a slot despite the
bufmap winding down).  OK, at that point I think we should try to see
if bufmap rewrite works - I've rebased on top of your branch and pushed
(head at 8c3bc9a).  Bufmap rewrite is really completely untested -
it's done pretty much blindly and I'd be surprised as hell if it has no
brainos at the first try.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mike Marshall Feb. 15, 2016, 10:47 p.m. UTC | #3
> Bufmap rewrite is really completely untested -
> it's done pretty much blindly and I'd be surprised as hell if it has no
> brainos at the first try.

You did pretty good, it takes me two tries to get hello world right...

Right off the bat, the kernel crashed, because:

static struct slot_map rw_map = {
        .c = -1,
        .q = __WAIT_QUEUE_HEAD_INITIALIZER(rw_map.q)
};
static struct slot_map readdir_map = {
        .c = -1,
        .q = __WAIT_QUEUE_HEAD_INITIALIZER(rw_map.q)
};                                          ^
                                            |
                                          D'OH!

But after that stuff almost worked...

It can still "sort of" wedge up.

We think that when dbench is running and the client-core is killed, you
can hit orangefs_bufmap_finalize -> mark_killed -> run_down/schedule().
while  those wait_for_completion_* schedules of extant ops in
wait_for_matching_downcall have also given up the processor...

Then... when you interrupt dbench, stuff starts flowing again...

I added a couple of gossip statements inside of mark_killed and
run_down...

Feb 15 16:40:15 be1 kernel: [  349.981597] orangefs_bufmap_finalize: called
Feb 15 16:40:15 be1 kernel: [  349.981600] mark_killed enter
Feb 15 16:40:15 be1 kernel: [  349.981602] mark_killed: leave
Feb 15 16:40:15 be1 kernel: [  349.981603] mark_killed enter
Feb 15 16:40:15 be1 kernel: [  349.981605] mark_killed: leave
Feb 15 16:40:15 be1 kernel: [  349.981606] run_down: enter:-1:
Feb 15 16:40:15 be1 kernel: [  349.981608] run_down: leave
Feb 15 16:40:15 be1 kernel: [  349.981609] run_down: enter:-2:
Feb 15 16:40:15 be1 kernel: [  349.981610] run_down: before schedule:-2:

            stuff just sits here while dbench is still running.
            Then Ctrl-C on dbench and off to the races again.

eb 15 16:42:28 be1 kernel: [  483.049927] ***
wait_for_matching_downcall: operation interrupted by a signal (tag
16523, op ffff880013418000)
Feb 15 16:42:28 be1 kernel: [  483.049930] Interrupted: Removed op
ffff880013418000 from htable_ops_in_progress
Feb 15 16:42:28 be1 kernel: [  483.049932] orangefs: service_operation
orangefs_inode_getattr returning: -4 for ffff880013418000.
Feb 15 16:42:28 be1 kernel: [  483.050116] ***
wait_for_matching_downcall: operation interrupted by a signal (tag
16518, op ffff8800001a8000)
Feb 15 16:42:28 be1 kernel: [  483.050118] Interrupted: Removed op
ffff8800001a8000 from htable_ops_in_progress
Feb 15 16:42:28 be1 kernel: [  483.050120] orangefs: service_operation
orangefs_inode_getattr returning: -4 for ffff8800001a8000.

Martin already has a patch... What do you think?
I'm headed home for supper...

-Mike

On Mon, Feb 15, 2016 at 1:45 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Mon, Feb 15, 2016 at 12:46:51PM -0500, Mike Marshall wrote:
>> I pushed the list_del up to the kernel.org for-next branch...
>>
>> And I've been running tests with the CRUDE bandaid... weird
>> results...
>>
>> No oopses, no WARN_ONs... I was running dbench and ls -R
>> or find and kill-minus-nining different ones of them with no
>> perceived resulting problems, so I moved on to signalling
>> the client-core to abort... it restarted numerous times,
>> and then stuff wedged up differently than I've seen before.
>
> There are other problems with that thing (starting with the fact that
> retrying readdir/wait_for_direct_io can try to grab a slot despite the
> bufmap winding down).  OK, at that point I think we should try to see
> if bufmap rewrite works - I've rebased on top of your branch and pushed
> (head at 8c3bc9a).  Bufmap rewrite is really completely untested -
> it's done pretty much blindly and I'd be surprised as hell if it has no
> brainos at the first try.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/orangefs/devorangefs-req.c b/fs/orangefs/devorangefs-req.c
index 6a7df12..630246d 100644
--- a/fs/orangefs/devorangefs-req.c
+++ b/fs/orangefs/devorangefs-req.c
@@ -529,6 +529,9 @@  static int orangefs_devreq_release(struct inode *inode, struct file *file)
 	purge_inprogress_ops();
 	gossip_debug(GOSSIP_DEV_DEBUG,
 		     "pvfs2-client-core: device close complete\n");
+	/* VERY CRUDE, NOT FOR MERGE */
+	while (orangefs_get_bufmap_init())
+		schedule_timeout(HZ);
 	open_access_count = 0;
 	mutex_unlock(&devreq_mutex);
 	return 0;
diff --git a/fs/orangefs/orangefs-kernel.h b/fs/orangefs/orangefs-kernel.h
index 41f8bb1f..1e28555 100644
--- a/fs/orangefs/orangefs-kernel.h
+++ b/fs/orangefs/orangefs-kernel.h
@@ -261,6 +261,7 @@  static inline void set_op_state_purged(struct orangefs_kernel_op_s *op)
 {
 	spin_lock(&op->lock);
 	if (unlikely(op_is_cancel(op))) {
+		list_del(&op->list);
 		spin_unlock(&op->lock);
 		put_cancel(op);
 	} else {