Orangefs ABI documentation

Message ID	20160216233609.GE17997@ZenIV.linux.org.uk (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> Date: Tue, 16 Feb 2016 23:36:09 +0000 From: Al Viro <viro@ZenIV.linux.org.uk> To: Mike Marshall <hubcap@omnibond.com> Cc: Martin Brandenburg <martin@omnibond.com>, Linus Torvalds <torvalds@linux-foundation.org>, linux-fsdevel <linux-fsdevel@vger.kernel.org>, Stephen Rothwell <sfr@canb.auug.org.au> Subject: Re: Orangefs ABI documentation Message-ID: <20160216233609.GE17997@ZenIV.linux.org.uk> References: <CAOg9mSQ=pF=Pg60VoSUeywybDATi1M3TTRmjGNEmvyLCh0AAoQ@mail.gmail.com> <20160213174738.GR17997@ZenIV.linux.org.uk> <20160214025615.GU17997@ZenIV.linux.org.uk> <CAOg9mSRgGghk=RAmWZ_MLgp=3DkR+2iV5dU8ffaAFK8aYmM_xQ@mail.gmail.com> <20160214234312.GX17997@ZenIV.linux.org.uk> <CAOg9mSSsztHyRsQ=BAA9FoEMo7XEdgTwYHosqvh4ZjykDapj2A@mail.gmail.com> <20160215184554.GY17997@ZenIV.linux.org.uk> <CA+D=wkgGZ9A8Qa5C6q3cROrr+Gp=jsgowvcbOs-22UU=aVT7Wg@mail.gmail.com> <20160215230434.GZ17997@ZenIV.linux.org.uk> <CAOg9mSSdsVL8tRiAojg=UCNPQ6iPcthtdowH9kWyiWnXUvTEHg@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <CAOg9mSSdsVL8tRiAojg=UCNPQ6iPcthtdowH9kWyiWnXUvTEHg@mail.gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Al Viro Feb. 16, 2016, 11:36 p.m. UTC

On Tue, Feb 16, 2016 at 06:15:56PM -0500, Mike Marshall wrote:
> Here's what I see:
> 
>   orangefs_create
>     service_operation
>       wait_for_matching_downcall purges op and returns -EAGAIN
>       orangefs_clean_up_interrupted_operation
>       if (EAGAIN)
>         ...
>         goto retry_servicing
>       wait_for_matching_downcall returns 0
>     service_operation returns 0
>   orangefs_create has good return value from service_operation
> 
>    op->khandle: 00000000-0000-0000-0000-000000000000
>    op->fs_id: 0
> 
>    subsequent getattr on bogus object fails orangefs_create on EINVAL.
> 
>    seems like the second time around, wait_for_matching_downcall
>    must have seen op_state_serviced, but I don't see how yet...

I strongly suspect that this is what's missing.  Could you check if it helps?

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro Feb. 16, 2016, 11:54 p.m. UTC | #1

On Tue, Feb 16, 2016 at 11:36:09PM +0000, Al Viro wrote:

> I strongly suspect that this is what's missing.  Could you check if it helps?

BTW, I've pushed #orangefs-untested; the differences are
	* several fixes folded back into relevant commits (no point creating
bisect hazards and making the history harder to read)
	* this thing also folded (and in fact is what you get from diffing
that tree with your for-next)
	* commit message for bufmap rewrite supplied.

Could you (assuming it works, etc.) switch your branch to that?
git checkout for-next, git fetch from vfs.git, git diff FETCH_HEAD to
verify that the delta is as it should be, then git reset --hard FETCH_HEAD
and git push --force <whatever remote you are using> for-next would do it...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mike Marshall Feb. 17, 2016, 7:24 p.m. UTC | #2

It is still busted, I've been trying to find clues as to why...

Maybe this is relevant:

Alloced OP ffff880015698000 <- doomed op for orangefs_create MAILBOX2.CPT
service_operation: orangefs_create op ffff880015698000
ffff880015698000 got past is_daemon_in_service

... lots of stuff ...

w_f_m_d returned -11 for ffff880015698000 <- first op to get EAGAIN

first client core is NOT in service
second op to get EAGAIN
          ...
last client core is NOT in service

... lots of stuff ...

service_operation returns to orangef_create with handle 0 fsid 0 ret 0
for MAILBOX2.CPT

I'm guessing you want me to wait to do the switching of my branch
until we fix this (last?) thing, let me know...

-Mike


On Tue, Feb 16, 2016 at 6:54 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Feb 16, 2016 at 11:36:09PM +0000, Al Viro wrote:
>
>> I strongly suspect that this is what's missing.  Could you check if it helps?
>
> BTW, I've pushed #orangefs-untested; the differences are
>         * several fixes folded back into relevant commits (no point creating
> bisect hazards and making the history harder to read)
>         * this thing also folded (and in fact is what you get from diffing
> that tree with your for-next)
>         * commit message for bufmap rewrite supplied.
>
> Could you (assuming it works, etc.) switch your branch to that?
> git checkout for-next, git fetch from vfs.git, git diff FETCH_HEAD to
> verify that the delta is as it should be, then git reset --hard FETCH_HEAD
> and git push --force <whatever remote you are using> for-next would do it...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro Feb. 17, 2016, 8:11 p.m. UTC | #3

On Wed, Feb 17, 2016 at 02:24:34PM -0500, Mike Marshall wrote:
> It is still busted, I've been trying to find clues as to why...

With reinit_completion() added?

> Maybe this is relevant:
> 
> Alloced OP ffff880015698000 <- doomed op for orangefs_create MAILBOX2.CPT
> service_operation: orangefs_create op ffff880015698000
> ffff880015698000 got past is_daemon_in_service
> 
> ... lots of stuff ...
> 
> w_f_m_d returned -11 for ffff880015698000 <- first op to get EAGAIN
> 
> first client core is NOT in service
> second op to get EAGAIN
>           ...
> last client core is NOT in service
> 
> ... lots of stuff ...
> 
> service_operation returns to orangef_create with handle 0 fsid 0 ret 0
> for MAILBOX2.CPT
> 
> I'm guessing you want me to wait to do the switching of my branch
> until we fix this (last?) thing, let me know...

What I'd like to check is the value of op->waitq.done at retry_servicing.
If we get there with a non-zero value, we've a problem.  BTW, do you
hit any of gossip_err() in orangefs_clean_up_interrupted_operation()?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro Feb. 17, 2016, 9:17 p.m. UTC | #4

On Wed, Feb 17, 2016 at 08:11:20PM +0000, Al Viro wrote:
> With reinit_completion() added?
> 
> > Maybe this is relevant:
> > 
> > Alloced OP ffff880015698000 <- doomed op for orangefs_create MAILBOX2.CPT
> > service_operation: orangefs_create op ffff880015698000
> > ffff880015698000 got past is_daemon_in_service
> > 
> > ... lots of stuff ...
> > 
> > w_f_m_d returned -11 for ffff880015698000 <- first op to get EAGAIN
> > 
> > first client core is NOT in service
> > second op to get EAGAIN
> >           ...
> > last client core is NOT in service
> > 
> > ... lots of stuff ...
> > 
> > service_operation returns to orangef_create with handle 0 fsid 0 ret 0
> > for MAILBOX2.CPT
> > 
> > I'm guessing you want me to wait to do the switching of my branch
> > until we fix this (last?) thing, let me know...
> 
> What I'd like to check is the value of op->waitq.done at retry_servicing.
> If we get there with a non-zero value, we've a problem.  BTW, do you
> hit any of gossip_err() in orangefs_clean_up_interrupted_operation()?

Am I right assuming that you are seeing zero from service_operation()
for ORANGEFS_VFS_OP_CREATE with zeroed ->downcall.resp.create.refn?
AFAICS, that can only happen with wait_for_matching_downcall() returning
0, which mean op_state_serviced(op).  And prior to that we had
set_op_state_waiting(op), which means set_op_state_serviced(op) done between
those two...

So unless you have something really nasty going on (buggered op refcounting,
memory corruption, etc.), we had orangefs_devreq_write_iter() pick that
op and copy the entire op->downcall from userland.  With op->downcall.status
not set to -EFAULT, or we would've...  Oh, shit - it's fed through
orangefs_normalize_to_errno().  OTOH, it would've hit
gossip_err("orangefs: orangefs_normalize_to_errno: got error code which is not
from ORANGEFS.\n") and presumably you would've noticed that.  And it still
would've returned that -EFAULT intact, actually.  In any case, it's a bug -
we need
	ret = -(ORANGEFS_ERROR_BIT|9);
	goto Broken;
instead of those
	ret = -EFAULT;
	goto Broken;
and
	ret = -(ORANGEFS_ERROR_BIT|8);
	goto Broken;
instead of
	ret = -ENOMEM;
	goto Broken;
in orangefs_devreq_write_iter().

Anyway, it looks like we didn't go through Broken: in there, so copy_from_user()
must've been successful.

Hmm...  Could the daemon have really given that reply?  Relevant test would
be something like
	WARN_ON(op->upcall.type == ORANGEFS_VFS_OP_CREATE &&
	    !op->downcall.resp.create.refn.fsid);
right after
wakeup:
        /*
         * tell the vfs op waiting on a waitqueue
         * that this op is done
         */
        spin_lock(&op->lock);
        if (unlikely(op_state_given_up(op))) {
                spin_unlock(&op->lock);
                goto out;
        }
        set_op_state_serviced(op);

and plain WARN_ON(1) right after Broken:

If that op goes through either of those set_op_state_serviced() with that
zero refn.fsid, we'll see a WARN_ON triggered...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mike Marshall Feb. 17, 2016, 10:24 p.m. UTC | #5

> with reinit_completion() added?

In the right place, even...

> Am I right assuming that you are seeing zero from service_operation()
> for ORANGEFS_VFS_OP_CREATE with zeroed ->downcall.resp.create.refn?
> AFAICS, that can only happen with wait_for_matching_downcall() returning
> 0, which mean op_state_serviced(op).

Yes sir:

Feb 17 12:25:15 be1 kernel: [  857.901225] service_operation:
wait_for_matching_downcall returned 0 for ffff880015698000
Feb 17 12:25:15 be1 kernel: [  857.901228] orangefs: service_operation
orangefs_create returning: 0 for ffff880015698000.
Feb 17 12:25:15 be1 kernel: [  857.901230] orangefs_create:
MAILBOX2.CPT: handle:00000000-0000-0000-0000-000000000000: fsid:0:
new_op:ffff880015698000: ret:0:

I'll get those WARN_ONs in there and do some more
tests.

Martin thinks he might be on to a solution... he'll
probably send a response soon... I can hear him typing
at 500 characters a minute a few cubes away...

-Mike

On Wed, Feb 17, 2016 at 4:17 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Wed, Feb 17, 2016 at 08:11:20PM +0000, Al Viro wrote:
>> With reinit_completion() added?
>>
>> > Maybe this is relevant:
>> >
>> > Alloced OP ffff880015698000 <- doomed op for orangefs_create MAILBOX2.CPT
>> > service_operation: orangefs_create op ffff880015698000
>> > ffff880015698000 got past is_daemon_in_service
>> >
>> > ... lots of stuff ...
>> >
>> > w_f_m_d returned -11 for ffff880015698000 <- first op to get EAGAIN
>> >
>> > first client core is NOT in service
>> > second op to get EAGAIN
>> >           ...
>> > last client core is NOT in service
>> >
>> > ... lots of stuff ...
>> >
>> > service_operation returns to orangef_create with handle 0 fsid 0 ret 0
>> > for MAILBOX2.CPT
>> >
>> > I'm guessing you want me to wait to do the switching of my branch
>> > until we fix this (last?) thing, let me know...
>>
>> What I'd like to check is the value of op->waitq.done at retry_servicing.
>> If we get there with a non-zero value, we've a problem.  BTW, do you
>> hit any of gossip_err() in orangefs_clean_up_interrupted_operation()?
>
> Am I right assuming that you are seeing zero from service_operation()
> for ORANGEFS_VFS_OP_CREATE with zeroed ->downcall.resp.create.refn?
> AFAICS, that can only happen with wait_for_matching_downcall() returning
> 0, which mean op_state_serviced(op).  And prior to that we had
> set_op_state_waiting(op), which means set_op_state_serviced(op) done between
> those two...
>
> So unless you have something really nasty going on (buggered op refcounting,
> memory corruption, etc.), we had orangefs_devreq_write_iter() pick that
> op and copy the entire op->downcall from userland.  With op->downcall.status
> not set to -EFAULT, or we would've...  Oh, shit - it's fed through
> orangefs_normalize_to_errno().  OTOH, it would've hit
> gossip_err("orangefs: orangefs_normalize_to_errno: got error code which is not
> from ORANGEFS.\n") and presumably you would've noticed that.  And it still
> would've returned that -EFAULT intact, actually.  In any case, it's a bug -
> we need
>         ret = -(ORANGEFS_ERROR_BIT|9);
>         goto Broken;
> instead of those
>         ret = -EFAULT;
>         goto Broken;
> and
>         ret = -(ORANGEFS_ERROR_BIT|8);
>         goto Broken;
> instead of
>         ret = -ENOMEM;
>         goto Broken;
> in orangefs_devreq_write_iter().
>
> Anyway, it looks like we didn't go through Broken: in there, so copy_from_user()
> must've been successful.
>
> Hmm...  Could the daemon have really given that reply?  Relevant test would
> be something like
>         WARN_ON(op->upcall.type == ORANGEFS_VFS_OP_CREATE &&
>             !op->downcall.resp.create.refn.fsid);
> right after
> wakeup:
>         /*
>          * tell the vfs op waiting on a waitqueue
>          * that this op is done
>          */
>         spin_lock(&op->lock);
>         if (unlikely(op_state_given_up(op))) {
>                 spin_unlock(&op->lock);
>                 goto out;
>         }
>         set_op_state_serviced(op);
>
> and plain WARN_ON(1) right after Broken:
>
> If that op goes through either of those set_op_state_serviced() with that
> zero refn.fsid, we'll see a WARN_ON triggered...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Martin Brandenburg Feb. 17, 2016, 10:40 p.m. UTC | #6

On Wed, 17 Feb 2016, Mike Marshall wrote:

> It is still busted, I've been trying to find clues as to why...
> 
> Maybe this is relevant:
> 
> Alloced OP ffff880015698000 <- doomed op for orangefs_create MAILBOX2.CPT
> service_operation: orangefs_create op ffff880015698000
> ffff880015698000 got past is_daemon_in_service
> 
> ... lots of stuff ...
> 
> w_f_m_d returned -11 for ffff880015698000 <- first op to get EAGAIN
> 
> first client core is NOT in service
> second op to get EAGAIN
>           ...
> last client core is NOT in service
> 
> ... lots of stuff ...
> 
> service_operation returns to orangef_create with handle 0 fsid 0 ret 0
> for MAILBOX2.CPT
> 
> I'm guessing you want me to wait to do the switching of my branch
> until we fix this (last?) thing, let me know...
> 
> -Mike

I think I've identified something screwy.

Some process creates a file. Eventually we get into
wait_for_matching_downcall with no client-core. W_f_m_d
returns EAGAIN and op->lock is held. The op is still
waiting and in orangefs_request_list. Service_operation
calls orangefs_clean_up_interrupted_operation, which
attempts to remove the op from orangefs_request_list.

Meanwhile the client-core comes back and does a read.
W_f_m_d has returned EAGAIN, but the op is still in
orangefs_request_list, so it gets passed to the
client-core. Now the op is in service and in
htable_ops_in_progress.

But service_operation is about to retry it under the
impression that it was purged. So it puts the op back
in orangefs_request_list.

Then the client-core returns the op, so it is marked
serviced and returned to orangefs_inode_create.
Meanwhile something or other (great theory right? now
I'm less sure) happens with the second request (they
have the same tag) causing it to become corrupted.

I admit it starts to fall apart at the end, and I don't
have a clear theory on how this produces what we see.

In orangefs_clean_up_interrupted_operation

	if (op_state_waiting(op)) {
		/*
		 * upcall hasn't been read; remove op from upcall request
		 * list.
		 */
		spin_unlock(&op->lock);

		/* HERE */

		spin_lock(&orangefs_request_list_lock);
		list_del(&op->list);
		spin_unlock(&orangefs_request_list_lock);
		gossip_debug(GOSSIP_WAIT_DEBUG,
			     "Interrupted: Removed op %p from request_list\n",
			     op);
	} else if (op_state_in_progress(op)) {

and orangefs_devreq_read

restart:
	/* Get next op (if any) from top of list. */
	spin_lock(&orangefs_request_list_lock);
	list_for_each_entry_safe(op, temp, &orangefs_request_list, list) {
		__s32 fsid;
		/* This lock is held past the end of the loop when we break. */

		/* HERE */

		spin_lock(&op->lock);
		if (unlikely(op_state_purged(op))) {
			spin_unlock(&op->lock);
			continue;
		}

I think both processes can end up working on the same
op.

-- Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Orangefs ABI documentation

Commit Message

Comments

Patch