Orangefs ABI documentation

Message ID	20160214025615.GU17997@ZenIV.linux.org.uk (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> Date: Sun, 14 Feb 2016 02:56:15 +0000 From: Al Viro <viro@ZenIV.linux.org.uk> To: Mike Marshall <hubcap@omnibond.com> Cc: Martin Brandenburg <martin@omnibond.com>, Linus Torvalds <torvalds@linux-foundation.org>, linux-fsdevel <linux-fsdevel@vger.kernel.org>, Stephen Rothwell <sfr@canb.auug.org.au> Subject: Re: Orangefs ABI documentation Message-ID: <20160214025615.GU17997@ZenIV.linux.org.uk> References: <20160209221623.GI17997@ZenIV.linux.org.uk> <20160209224050.GJ17997@ZenIV.linux.org.uk> <20160209231328.GK17997@ZenIV.linux.org.uk> <20160211004432.GM17997@ZenIV.linux.org.uk> <CAOg9mSR+=Xj3rtf_DQQjxhWs=HQHYu-3Mjdb3LeOvaYEwLXvYA@mail.gmail.com> <20160212042757.GP17997@ZenIV.linux.org.uk> <CAOg9mSTPQLarX0kpBgfhnpPZL=0Y7netSLhSkhHfAEor_w2mEg@mail.gmail.com> <CA+D=wkhJRVsety9WX9KLGbRUEk1hwSf4Q0oP23c1TtQjr_Oguw@mail.gmail.com> <CAOg9mSQ=pF=Pg60VoSUeywybDATi1M3TTRmjGNEmvyLCh0AAoQ@mail.gmail.com> <20160213174738.GR17997@ZenIV.linux.org.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160213174738.GR17997@ZenIV.linux.org.uk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Message ID

20160214025615.GU17997@ZenIV.linux.org.uk (mailing list archive)

State

New, archived

Headers

Date: Sun, 14 Feb 2016 02:56:15 +0000
From: Al Viro <viro@ZenIV.linux.org.uk>
To: Mike Marshall <hubcap@omnibond.com>
Cc: Martin Brandenburg <martin@omnibond.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Stephen Rothwell <sfr@canb.auug.org.au>
Subject: Re: Orangefs ABI documentation
Message-ID: <20160214025615.GU17997@ZenIV.linux.org.uk>
References: <20160209221623.GI17997@ZenIV.linux.org.uk>
	<20160209224050.GJ17997@ZenIV.linux.org.uk>
	<20160209231328.GK17997@ZenIV.linux.org.uk>
	<20160211004432.GM17997@ZenIV.linux.org.uk>
	<CAOg9mSR+=Xj3rtf_DQQjxhWs=HQHYu-3Mjdb3LeOvaYEwLXvYA@mail.gmail.com>
	<20160212042757.GP17997@ZenIV.linux.org.uk>
	<CAOg9mSTPQLarX0kpBgfhnpPZL=0Y7netSLhSkhHfAEor_w2mEg@mail.gmail.com>
	<CA+D=wkhJRVsety9WX9KLGbRUEk1hwSf4Q0oP23c1TtQjr_Oguw@mail.gmail.com>
	<CAOg9mSQ=pF=Pg60VoSUeywybDATi1M3TTRmjGNEmvyLCh0AAoQ@mail.gmail.com>
	<20160213174738.GR17997@ZenIV.linux.org.uk>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20160213174738.GR17997@ZenIV.linux.org.uk>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk

Commit Message

Al Viro Feb. 14, 2016, 2:56 a.m. UTC

On Sat, Feb 13, 2016 at 05:47:38PM +0000, Al Viro wrote:
> On Sat, Feb 13, 2016 at 12:18:12PM -0500, Mike Marshall wrote:
> > I added the patches, and ran a bunch of tests.
> > 
> > Stuff works fine when left unbothered, and also
> > when wrenches are thrown into the works.
> > 
> > I had multiple userspace things going on at the
> > same time, dbench, ls -R, find... kill -9 or control-C on
> > any of them is handled well. When I killed both
> > the client-core and its restarter, the kernel
> > dealt with swarm of ops that had nowhere
> > to go... the WARN_ON in service_operation
> > was hit.
> > 
> > Feb 12 16:19:12 be1 kernel: [ 3658.167544] orangefs: please confirm
> > that pvfs2-client daemon is running.
> > Feb 12 16:19:12 be1 kernel: [ 3658.167547] fs/orangefs/dir.c line 264:
> > orangefs_readdir: orangefs_readdir_index_get() failure (-5)
> 
> I.e. bufmap is gone.
> 
> > Feb 12 16:19:12 be1 kernel: [ 3658.170741] ------------[ cut here ]------------
> > Feb 12 16:19:12 be1 kernel: [ 3658.170746] WARNING: CPU: 0 PID: 1667
> > at fs/orangefs/waitqueue.c:203 service_operation+0x4f6/0x7f0()
> 
> ... and we are in wait_for_direct_io(), holding an r/w slot and finding
> ourselves with bufmap already gone, despite not having freed that slot
> yet.  Bloody wonderful - we still have bufmap refcounting buggered somewhere.
> 
> Which tree had that been?  Could you push that tree (having checked that
> you don't have any uncommitted changes) in some branch?

OK, at the very least there's this; should be folded into "orangefs: delay
freeing slot until cancel completes"

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Al Viro Feb. 14, 2016, 3:46 a.m. UTC | #1

FWIW, I think I have a kinda-sorta solution for bufmap slot allocation/waiting;
if somebody has a better idea, I would love to drop the variant below.  And
I would certainly appreciate review - I hate messing with waitqueue primitives
and I know how easy it is to fuck those up ;-/

Below is a mockup of that thing:

/*  Three possible states: absent, installed and shutting down.
 *  install(map, count, bitmap) sets it up
 *  get(map)/put(map, slot) allocate and free resp.
 *  mark_dead(map) moves to shutdown state - no new allocations succeed until
 *  we reinstall it.
 *  run_down(map) waits for all allocations to be released; in the end, we
 *  are in the "absent" state again.
 *
 *  get() is not allowed to take longer than slot_timeout_secs seconds total;
 *  if the thing gets shut down and reinstalled during the wait, we are OK
 *  as long as reinstall comes within restart_timeout_secs.  For orangefs
 *  those default to 15 minutes and 30 seconds resp...
 */
struct slot_map {
	int c;			// absent -> -1
				// installed and full -> 0
				// installed with n slots free -> n
				// shutting down, with n slots in use -> -1-n
	wait_queue_head_t q;	// q.lock protects everything here.
	int count;
	unsigned long *map;
};

void install(struct slot_map *m, int count, unsigned long *map)
{
	spin_lock(&m->q.lock);
	m->c = m->count = count;
	m->map = map;
	wake_up_all_locked(&m->q);
	spin_unlock(&m->q.lock);
}

void mark_killed(struct slot_map *m)
{
	spin_lock(&m->q.lock);
	m->c -= m->count + 1;
	spin_unlock(&m->q.lock);
}

void run_down(struct slot_map *m)
{
	DEFINE_WAIT(wait);
	spin_lock(&m->q.lock);
#if 0
	// we don't have wait_event_locked(); might be worth adding.
	wait_event_locked(&m->q, m->c == -1);
#else
	// or we can open-code it
	if (m->c != -1) {
		for (;;) {
			if (likely(list_empty(&wait.task_list)))
				__add_wait_queue_tail(&m->q, &wait);
			set_current_state(TASK_UNINTERRUPTIBLE);

			if (m->c == -1)
				break;

			spin_unlock(&m->q.lock);
			schedule();
			spin_lock(&m->q.lock);
		}
		__remove_wait_queue(&m->q, &wait);
		__set_current_state(TASK_RUNNING);
	}
#endif
	m->map = NULL;
	spin_unlock(&m->q.lock);
}

void put(struct slot_map *m, int slot)
{
	int v;
	spin_lock(&m->q.lock);
	__clear_bit(slot, m->map);
	v = ++m->c;
	if (unlikely(v == 1))	/* no free slots -> one free slot */
		wake_up_locked(&m->q);
	else if (unlikely(v == -1))	/* finished dying */
		wake_up_all_locked(&m->q);
	spin_unlock(&m->q.lock);
}

static int wait_for_free(struct slot_map *m)
{
	long left = slot_timeout_secs * HZ;
	DEFINE_WAIT(wait);

#if 0
	// the trouble is, there's no wait_event_interruptible_timeout_locked()
	// might be worth adding...
	do {
		if (m->c > 0)
			break;
		if (m->c < 0) {
			/* we are waiting for map to be installed */
			/* it would better be there soon, or we go away */
			long n = left, t;
			if (n > restart_timeout_secs * HZ)
				n = restart_timeout_secs * HZ;
			t = wait_event_interruptible_timeout_locked(&m->q,
						m->c > 0, n);
			if (unlikely(t < 0) || (!t && m->c < 0))
				left = t;
			else
				left = t + (left - n);
		} else {
			/* just waiting for a slot to come free */
			left = wait_event_interruptible_timeout_locked(&m->q,
						m->c > 0, left);
		}
	} while (left > 0);
#else
	// or we can open-code it
	do {
		if (likely(list_empty(&wait.task_list)))
			__add_wait_queue_tail_exclusive(&m->q, &wait);
		set_current_state(TASK_INTERRUPTIBLE);

		if (m->c > 0)
			break;

		if (m->c < 0) {
			/* we are waiting for map to be installed */
			/* it would better be there soon, or we go away */
			long n = left, t;
			if (n > ORANGEFS_BUFMAP_WAIT_TIMEOUT_SECS * HZ)
				n = ORANGEFS_BUFMAP_WAIT_TIMEOUT_SECS * HZ;
			spin_unlock(&m->q.lock);
			t = schedule_timeout(n);
			spin_lock(&m->q.lock);
			if (unlikely(t < 0) || (!t && m->c < 0))
				left = t;
			else
				left = t + (left - n);
		} else {
			/* just waiting for a slot to come free */
			spin_unlock(&m->q.lock);
			left = schedule_timeout(left);
			spin_lock(&m->q.lock);
		}
	} while (left > 0);

	__remove_wait_queue(&m->q, &wait);
	__set_current_state(TASK_RUNNING);
#endif
	if (likely(left > 0))
		return 0;

	return left < 0 ? -EINTR : -ETIMEDOUT;
}

int get(struct slot_map *m)
{
	int res = 0;
	spin_lock(&m->q.lock);
	if (unlikely(m->c <= 0))
		res = wait_for_free(m);
	if (likely(!res)) {
		m->c--;
		res = find_first_zero_bit(m->map, m->count);
		__set_bit(res, m->map);
	}
	spin_unlock(&m->q.lock);
	return res;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro Feb. 14, 2016, 4:06 a.m. UTC | #2

On Sun, Feb 14, 2016 at 03:46:08AM +0000, Al Viro wrote:
> #if 0
> 	// the trouble is, there's no wait_event_interruptible_timeout_locked()

Sorry - it's wait_event_interruptible_timeout_locked_exclusive().  IOW, the
open-coded variant is doing the right thing, ifdefed-out one needs
s/wait_event_interruptible_timeout_locked/&_exclusive/...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mike Marshall Feb. 14, 2016, 10:31 p.m. UTC | #3

I added the list_del...

Everything is very resilient, I killed
the client-core over and over while dbench
was running at the same time as  ls -R
was running, and the client-core always
restarted... until finally, it didn't. I guess
related to the state of just what was going on
at the time... Hit the WARN_ON in service_operation,
and then oopsed on the orangefs_bufmap_put
down at the end of wait_for_direct_io...

http://myweb.clemson.edu/~hubcap/after.list_del

-Mike

On Sat, Feb 13, 2016 at 9:56 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sat, Feb 13, 2016 at 05:47:38PM +0000, Al Viro wrote:
>> On Sat, Feb 13, 2016 at 12:18:12PM -0500, Mike Marshall wrote:
>> > I added the patches, and ran a bunch of tests.
>> >
>> > Stuff works fine when left unbothered, and also
>> > when wrenches are thrown into the works.
>> >
>> > I had multiple userspace things going on at the
>> > same time, dbench, ls -R, find... kill -9 or control-C on
>> > any of them is handled well. When I killed both
>> > the client-core and its restarter, the kernel
>> > dealt with swarm of ops that had nowhere
>> > to go... the WARN_ON in service_operation
>> > was hit.
>> >
>> > Feb 12 16:19:12 be1 kernel: [ 3658.167544] orangefs: please confirm
>> > that pvfs2-client daemon is running.
>> > Feb 12 16:19:12 be1 kernel: [ 3658.167547] fs/orangefs/dir.c line 264:
>> > orangefs_readdir: orangefs_readdir_index_get() failure (-5)
>>
>> I.e. bufmap is gone.
>>
>> > Feb 12 16:19:12 be1 kernel: [ 3658.170741] ------------[ cut here ]------------
>> > Feb 12 16:19:12 be1 kernel: [ 3658.170746] WARNING: CPU: 0 PID: 1667
>> > at fs/orangefs/waitqueue.c:203 service_operation+0x4f6/0x7f0()
>>
>> ... and we are in wait_for_direct_io(), holding an r/w slot and finding
>> ourselves with bufmap already gone, despite not having freed that slot
>> yet.  Bloody wonderful - we still have bufmap refcounting buggered somewhere.
>>
>> Which tree had that been?  Could you push that tree (having checked that
>> you don't have any uncommitted changes) in some branch?
>
> OK, at the very least there's this; should be folded into "orangefs: delay
> freeing slot until cancel completes"
>
> diff --git a/fs/orangefs/orangefs-kernel.h b/fs/orangefs/orangefs-kernel.h
> index 41f8bb1f..1e28555 100644
> --- a/fs/orangefs/orangefs-kernel.h
> +++ b/fs/orangefs/orangefs-kernel.h
> @@ -261,6 +261,7 @@ static inline void set_op_state_purged(struct orangefs_kernel_op_s *op)
>  {
>         spin_lock(&op->lock);
>         if (unlikely(op_is_cancel(op))) {
> +               list_del(&op->list);
>                 spin_unlock(&op->lock);
>                 put_cancel(op);
>         } else {
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro Feb. 16, 2016, 2:12 a.m. UTC | #4

On Sun, Feb 14, 2016 at 03:46:08AM +0000, Al Viro wrote:
> FWIW, I think I have a kinda-sorta solution for bufmap slot allocation/waiting;
> if somebody has a better idea, I would love to drop the variant below.  And
> I would certainly appreciate review - I hate messing with waitqueue primitives
> and I know how easy it is to fuck those up ;-/

... and in the "easy to fuck up" department, this thing doesn't stop waiting
when it gets a signal *and* is vulnerable to an analogue of the problem
dealt with in commit 777c6c5f1f6e757ae49ecca2ed72d6b1f523c007
Author: Johannes Weiner <hannes@cmpxchg.org>
Date:   Wed Feb 4 15:12:14 2009 -0800

    wait: prevent exclusive waiter starvation

Suppose we are waiting for a slot when everything's full.  Somebody releases
theirs, and we get woken up, just as we are hit with SIGKILL or time out.
We remove ourselves from the waitqueue and bugger off.  Too bad - everybody
else is going to get stuck until they time out.  If we got the slot, we would've
done wakeup when releasing it.  Since we hadn't, no such wakeup happens...

I wonder if wait_event_interruptible_exclusive_locked{,_irq}() is
vulnerable to the same problem; plain one is used in fuse, irq - in
gadgetfs and the latter looks somewhat fishy in that respect...

orangefs one fixed, folded and pushed, but I hadn't really looked into
fuse and gadgetfs enough to tell if they have similar problems...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/fs/orangefs/orangefs-kernel.h b/fs/orangefs/orangefs-kernel.h
index 41f8bb1f..1e28555 100644
--- a/fs/orangefs/orangefs-kernel.h
+++ b/fs/orangefs/orangefs-kernel.h
@@ -261,6 +261,7 @@  static inline void set_op_state_purged(struct orangefs_kernel_op_s *op)
 {
 	spin_lock(&op->lock);
 	if (unlikely(op_is_cancel(op))) {
+		list_del(&op->list);
 		spin_unlock(&op->lock);
 		put_cancel(op);
 	} else {

Orangefs ABI documentation

Commit Message

Comments

Patch