diff mbox

Orangefs ABI documentation

Message ID 20160123214006.GO17997@ZenIV.linux.org.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Al Viro Jan. 23, 2016, 9:40 p.m. UTC
On Sat, Jan 23, 2016 at 02:24:51PM -0500, Mike Marshall wrote:
> OK, I'll get them momentarily...
> 
> I merged your other patches, and there was a merge
> conflict I had to work around... you're working from
> an orangefs tree that lacks one commit I had made
> last week... my linux-next tree has all your patches
> through yesterday in it now...
> 
> I am setting up "the gnarly test" (at home from a VM,
> though) that should cause a bunch of cancellations,
> I want to see if I can get
> wait_for_cancellation_downcall to ever
> flow past that "if (signal_pending(current)) {"
> block... if it does, that demonstrate where
> the comments conflict with the code, right?

Yes...  BTW, speaking of that codepath - how can the second caller of
handle_io_error() ever get !op_state_serviced(new_op)?  That failure,
after all, had been in postcopy_buffers(), so the daemon is sitting
in its write_iter() waiting until we finish copying the data out of
bufmap; it's too late for sending cancel anyway, is it not?  IOW, would
the following do the right thing?  That would've left us with only
one caller of handle_io_error()...

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Mike Marshall Jan. 23, 2016, 10:36 p.m. UTC | #1
AV> IOW, would the following do the right thing?
AV> That would've left us with only one caller of
AV> handle_io_error()...

It works.  With your simplified code all
the needed things still happen: complete and
bufmap_put...

I've never had an error there unless I forgot
to turn on the client-core...

You must be looking for a way to get rid of
another macro <g>...

-Mike

On Sat, Jan 23, 2016 at 4:40 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sat, Jan 23, 2016 at 02:24:51PM -0500, Mike Marshall wrote:
>> OK, I'll get them momentarily...
>>
>> I merged your other patches, and there was a merge
>> conflict I had to work around... you're working from
>> an orangefs tree that lacks one commit I had made
>> last week... my linux-next tree has all your patches
>> through yesterday in it now...
>>
>> I am setting up "the gnarly test" (at home from a VM,
>> though) that should cause a bunch of cancellations,
>> I want to see if I can get
>> wait_for_cancellation_downcall to ever
>> flow past that "if (signal_pending(current)) {"
>> block... if it does, that demonstrate where
>> the comments conflict with the code, right?
>
> Yes...  BTW, speaking of that codepath - how can the second caller of
> handle_io_error() ever get !op_state_serviced(new_op)?  That failure,
> after all, had been in postcopy_buffers(), so the daemon is sitting
> in its write_iter() waiting until we finish copying the data out of
> bufmap; it's too late for sending cancel anyway, is it not?  IOW, would
> the following do the right thing?  That would've left us with only
> one caller of handle_io_error()...
>
> diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
> index c585063d..86ba1df 100644
> --- a/fs/orangefs/file.c
> +++ b/fs/orangefs/file.c
> @@ -245,15 +245,8 @@ populate_shared_memory:
>                                        buffer_index,
>                                        iter,
>                                        new_op->downcall.resp.io.amt_complete);
> -               if (ret < 0) {
> -                       /*
> -                        * put error codes in downcall so that handle_io_error()
> -                        * preserves it properly
> -                        */
> -                       new_op->downcall.status = ret;
> -                       handle_io_error();
> -                       goto out;
> -               }
> +               if (ret < 0)
> +                       goto done_copying;
>         }
>         gossip_debug(GOSSIP_FILE_DEBUG,
>             "%s(%pU): Amount written as returned by the sys-io call:%d\n",
> @@ -263,6 +256,7 @@ populate_shared_memory:
>
>         ret = new_op->downcall.resp.io.amt_complete;
>
> +done_copying:
>         /*
>          * tell the device file owner waiting on I/O that this read has
>          * completed and it can return now.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Jan. 23, 2016, 10:46 p.m. UTC | #2
On Sat, Jan 23, 2016 at 09:40:06PM +0000, Al Viro wrote:

> Yes...  BTW, speaking of that codepath - how can the second caller of
> handle_io_error() ever get !op_state_serviced(new_op)?  That failure,
> after all, had been in postcopy_buffers(), so the daemon is sitting
> in its write_iter() waiting until we finish copying the data out of
> bufmap; it's too late for sending cancel anyway, is it not?  IOW, would
> the following do the right thing?  That would've left us with only
> one caller of handle_io_error()...

FWIW, I'm not sure I like the correctness implications of the cancel
thing.  Look: we do large write(), it sends a couple of chunks successfully,
gets to submitting the third one, copies its data to bufmap, tells the
daemon to start writing, then gets a signal, sends cancel and buggers off.

What should we get?  -EINTR, despite having written some data?  That's
what the code does now, but I'm not sure it's what the userland expects.
Two chunks worth of data we'd written?  That's what one would expect
if the third one had hit an unmapped page, but in scenario with a signal
hitting us the daemon might very well have overwritten more of the file
by the time it had seen the cancel.

AFAICS, POSIX flat-out prohibits the current behaviour - what it says for
write(2) is
[EINTR]
    The write operation was terminated due to the receipt of a signal,
and no data was transferred.
^^^^^^^^^^^^^^^^^^^^^^^^^^^

but I'm not sure if "return a short write and to hell with having some
data beyond the returned amount actually written" would be better from
the userland POV.  It would be closer to what e.g. NFS is doing, though...

Linus?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Jan. 23, 2016, 11:35 p.m. UTC | #3
On Sat, Jan 23, 2016 at 2:46 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> What should we get?  -EINTR, despite having written some data?

No, that's not acceptable.

Either all or nothing (which is POSIX) or the NFS 'intr' mount
behavior (partial write return, -EINTR only when nothing was written
at all). And, like NFS, a mount option might be a good thing.

And of course, for the usual reasons, fatal signals are special in
that for them we generally say "screw posix, nobody sees the return
value anyway", but even there the filesystem might as well still
return the partial return value (just to not introduce yet another
special case).

In fact, I think that with our "fatal signals interrupt" behavior,
nobody should likely use the "intr" mount option on NFS. Even if the
semantics may be "better", there are likely simply just too many
programs that don't check the return value of "write()" at all, much
less handle partial writes correctly.

(And yes, our "screw posix" behavior wrt fatal signals is strictly
wrong even _despite_ the fact that nobody sees the return value -
other processes can still obviously see that the whole write wasn't
done. But blocking on a fatal signal is _so_ annoying that it's one of
those things where we just say "posix was wrong on this one, and if we
squint a bit we look _almost_ like we're compliant").

              Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Jan. 24, 2016, 12:16 a.m. UTC | #4
On Sat, Jan 23, 2016 at 05:36:41PM -0500, Mike Marshall wrote:
> AV> IOW, would the following do the right thing?
> AV> That would've left us with only one caller of
> AV> handle_io_error()...
> 
> It works.  With your simplified code all
> the needed things still happen: complete and
> bufmap_put...
> 
> I've never had an error there unless I forgot
> to turn on the client-core...
> 
> You must be looking for a way to get rid of
> another macro <g>...

That as well, but mostly I want to sort the situation with cancels out and
get a better grasp on when can that code be reached.  BTW, an error at that
spot is trivial to arrange - just pass read() a destination with munmapped
page in the middle and it'll trigger just fine.  IOW,
	p = mmap(NULL, 65536, PROT_READ|PROT_WRITE,
		 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
	munmap(p + 16384, 16384);
	read(fd, p, 65536);
with fd being a file on orangefs should step into that.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Jan. 24, 2016, 4:05 a.m. UTC | #5
On Sun, Jan 24, 2016 at 12:16:15AM +0000, Al Viro wrote:
> On Sat, Jan 23, 2016 at 05:36:41PM -0500, Mike Marshall wrote:
> > AV> IOW, would the following do the right thing?
> > AV> That would've left us with only one caller of
> > AV> handle_io_error()...
> > 
> > It works.  With your simplified code all
> > the needed things still happen: complete and
> > bufmap_put...
> > 
> > I've never had an error there unless I forgot
> > to turn on the client-core...
> > 
> > You must be looking for a way to get rid of
> > another macro <g>...
> 
> That as well, but mostly I want to sort the situation with cancels out and
> get a better grasp on when can that code be reached.  BTW, an error at that
> spot is trivial to arrange - just pass read() a destination with munmapped
> page in the middle and it'll trigger just fine.  IOW,
> 	p = mmap(NULL, 65536, PROT_READ|PROT_WRITE,
> 		 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> 	munmap(p + 16384, 16384);
> 	read(fd, p, 65536);
> with fd being a file on orangefs should step into that.

Hmm...  I've just realized that I don't understand why you are waiting in
the orangefs_devreq_write_iter() at all.  After all, you've reserved the
slot in wait_for_direct_io() and do not release it until right after you've
done complete(&op->done).  Why should server block while your read(2) copies
the data from bufmap to its real destination?  After all, any further
request hitting the same slot won't come until the slot is released anyway,
right?  What (and why) would the server be doing to that slot in the
meanwhile?  It's already copied whatever data it was going to copy as part of
that file_read...

Speaking of your slot allocator - surely it would be better to maintain the
count of unused slots there?  For example, replace that waitq with semaphore,
initialize it to number of slots, and have wait_for_a_slot() do
	if (down_interruptible(slargs->slots_sem)) {
		whine "interrupted"
		return -EINTR;
	}
	/* now we know that there's an empty slot for us */
	spin_lock(slargs->slots_lock);
	n = find_first_zero_bit(slargs->slots_bitmap, slargs->slot_count);
	set_bit(slargs->slots_bitmap, n);
	spin_unlock(slargs->slots_lock);
	// or a lockless variant thereof, for that matter - just need to be
	// carefull with barriers
	return n;
with put_back_slot() being
	spin_lock
	clear_bit
	spin_unlock
	up

Wait a minute...  What happens if you have a daemon disconnect while somebody
is holding a bufmap slot?  Each of those (as well as whoever's waiting for
a slot to come free) is holding a reference to bufmap.  devreq_mutex won't
do a damn thing, contrary to the comment in ->release() - it's _not_ held
across the duration of wait_for_direct_io().

We'll just decrement the refcount of bufmap, do nothing since it hasn't
reached zero, proceed to mark all ops as purged, wake each service_operation()
up and sod off.  Now, the holders of those slots will call
orangefs_get_bufmap_init(), get 1 (since we hadn't dropped the last reference
yet - can it *ever* see 0 there, actually?) and return -EAGAIN.  With
wait_for_direct_io() noticing that, freeing the slot and going into restart.
And if there was the only one, we are fine, but what if there were several?

Suppose there had been two read() going on at the time of disconnect.
The first one drops the slot.  Good, refcount of bufmap is down to 1 now.
And we go back to orangefs_bufmap_get().  Which calls orangefs_bufmap_ref().
Which sees that __orangefs_bufmap is still non-NULL.  And promptly regains
the reference and allocates the slot in that sucker.  Then it's back to
service_operations()...  Which will check is_daemon_in_service(), possibly
getting "yes, it is" (if the new one got already started).  Ouch.

The fun part in all of that is that new daemon won't be able to get new
bufmap in place until all users of the old one run down.  What a mess...

Ho-hum...  So we need to
	a) have ->release() wait for all slots to get freed
	b) have orangefs_bufmap_get() recognize that situation and
_not_ get new reference to bufmap that is already going down.
	c) move the "wait for new daemon to come and install a new bufmap"
to some point past the release of the slot - service_operation() is too
early for that to work.

Or am I missing something simple that makes the scenario above not go that
way?  Confused...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Martin Brandenburg Jan. 26, 2016, 7:52 p.m. UTC | #6
On 1/23/16, Al Viro <viro@zeniv.linux.org.uk> wrote:
> We'll just decrement the refcount of bufmap, do nothing since it hasn't
> reached zero, proceed to mark all ops as purged, wake each service_operation()
> up and sod off.  Now, the holders of those slots will call
> orangefs_get_bufmap_init(), get 1 (since we hadn't dropped the last reference
> yet - can it *ever* see 0 there, actually?) and return -EAGAIN.  With
> wait_for_direct_io() noticing that, freeing the slot and going into restart.
> And if there was the only one, we are fine, but what if there were several?

The answer here is yes. Otherwise a malicious client could not set up
the bufmap then crash the kernel by attempting to use it.

-- Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Jan. 30, 2016, 5:34 p.m. UTC | #7
On Tue, Jan 26, 2016 at 02:52:23PM -0500, Martin Brandenburg wrote:
> On 1/23/16, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > We'll just decrement the refcount of bufmap, do nothing since it hasn't
> > reached zero, proceed to mark all ops as purged, wake each service_operation()
> > up and sod off.  Now, the holders of those slots will call
> > orangefs_get_bufmap_init(), get 1 (since we hadn't dropped the last reference
> > yet - can it *ever* see 0 there, actually?) and return -EAGAIN.  With
> > wait_for_direct_io() noticing that, freeing the slot and going into restart.
> > And if there was the only one, we are fine, but what if there were several?
> 
> The answer here is yes. Otherwise a malicious client could not set up
> the bufmap then crash the kernel by attempting to use it.

Umm...  The question was "what happens if there was more than one slot
in use when we hit
        if (ret == -EAGAIN && op_state_purged(new_op)) {
                orangefs_bufmap_put(bufmap, buffer_index);
                gossip_debug(GOSSIP_FILE_DEBUG,
                             "%s:going to repopulate_shared_memory.\n",
                             __func__);
                goto populate_shared_memory;
        }
in wait_for_direct_io()?"  If the answer is "yes", I'd like to see more
detailed version, if possible...

Note that __orangefs_bufmap won't become NULL until all slots are freed,
so getting to that place with more than one slot in use will have us
go to populate_shared_memory, where we'll grab a new reference to the
same old bufmap and allocate a slot _there_...

And again, how could
                /* op uses shared memory */
                if (orangefs_get_bufmap_init() == 0) {
in service_operation() possibly be true, when we have
	* op->uses_shared_memory just checked to be set
	* all callers that set it (orangefs_readdir() and wait_for_direct_io()
having allocated a slot before calling service_operation() and not releasing
it until service_operation() returns
	* __orangefs_bufmap not becoming NULL until all slots are freed and
	* orangefs_get_bufmap_init() returning 1 unless __orangefs_bufmap is
NULL?

AFAICS, that code (waiting for daemon to be restarted) is provably never
executed.  What am I missing?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Jan. 30, 2016, 6:27 p.m. UTC | #8
On Sat, Jan 30, 2016 at 05:34:13PM +0000, Al Viro wrote:

> And again, how could
>                 /* op uses shared memory */
>                 if (orangefs_get_bufmap_init() == 0) {
> in service_operation() possibly be true, when we have
> 	* op->uses_shared_memory just checked to be set
> 	* all callers that set it (orangefs_readdir() and wait_for_direct_io()
> having allocated a slot before calling service_operation() and not releasing
> it until service_operation() returns
> 	* __orangefs_bufmap not becoming NULL until all slots are freed and
> 	* orangefs_get_bufmap_init() returning 1 unless __orangefs_bufmap is
> NULL?
> 
> AFAICS, that code (waiting for daemon to be restarted) is provably never
> executed.  What am I missing?

While we are at it, what happens to original code (without refcount changes,
etc. from my pile - the problem remains with those, but let's look at the
code at merge from v4.4) if something does read() and gets a signal
just as the daemon gets to
        n = copy_from_iter(&op->downcall, downcall_size, iter);
in ->write_iter(), reporting that is has finished that read?  We were in
wait_for_matching_downcall(), interruptibly sleeping.  We got woken up by
signal delivery.  Checked op->op_state; it's still not marked as serviced.
We check signal_pending() and find it true.  We hit
orangefs_clean_up_interrupted_operation(), which doesn't do anything to
op->op_state, and we return -EINTR to service_operation().  Which returns
without waiting for anything to wait_for_direct_io() and we go into
sending a cancel.  Now the daemon regains the timeslice.  Marks op as
serviced.  And proceeds to wait on op->io_completion_waitq.

Who's going to wake it up?  orangefs_cancel_op_in_progress() sure as hell
won't - it has no way to find op, nevermind doing wakeups.  The rest of
wait_for_direct_io() also doesn't wake the daemon up.  How is that supposed
to work?  Moreover, we issue a cancel; when is it supposed to be processed
and how do we tell if it's already been processed?

Who should be waiting for what in case of cancel being issued just as the
daemon gets around to reporting success of original operation?  On the
protocol level, that is.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mike Marshall Feb. 4, 2016, 11:30 p.m. UTC | #9
Hi Al...

Sorry I didn't get to your WARN_ONs right away, I've been trying to get
a better handle on the crashing-the-kernel-when-the-client-core-restarts
problem. I've gotten the kdump stuff working and my feeble crash-foo
has improved a little...

I did some testing with the out-of-tree version of the kernel module,
and it is actually able to recover when the client-core is stopped pretty
well. I looked closer at the client-core code and talked with some
of the others who've been around longer than me. The idea behind
the restart is rooted in "keeping things going" in production if
the client-core was to crash for some reason. On closer inspection
I see that the signal handling code in the client-core that simulates
a crash on SIGSEGV and SIGABRT is indeed written with sigaction
like the signal(2) man page says it should be.

As for the WARN_ONs, the waitqueue one is easy to hit when the
client-core stops and restarts, you can see here where precopy_buffers
started whining about the client-core, you can see that the client
core restarted when the debug mask got sent back over, and then
the WARN_ON in waitqueue gets hit:

[ 1239.198976] precopy_buffers: Failed to copy-in buffers. Please make
sure that  the pvfs2-client is running. -14
[ 1239.198979] precopy_buffers: Failed to copy-in buffers. Please make
sure that  the pvfs2-client is running. -14
[ 1239.198983] orangefs_file_write_iter: do_readv_writev failed, rc:-14:.
[ 1239.199175] precopy_buffers: Failed to copy-in buffers. Please make
sure that  the pvfs2-client is running. -14
[ 1239.199177] precopy_buffers: Failed to copy-in buffers. Please make
sure that  the pvfs2-client is running. -14
[ 1239.199180] orangefs_file_write_iter: do_readv_writev failed, rc:-14:.
[ 1239.199601] precopy_buffers: Failed to copy-in buffers. Please make
sure that  the pvfs2-client is running. -14
[ 1239.199602] precopy_buffers: Failed to copy-in buffers. Please make
sure that  the pvfs2-client is running. -14
[ 1239.199604] orangefs_file_write_iter: do_readv_writev failed, rc:-14:.
[ 1239.248239] dispatch_ioctl_command: client debug mask has been been
received  :0: :0:
[ 1239.248257] dispatch_ioctl_command: client debug array string has
been receiv ed.
[ 1239.307842] ------------[ cut here ]------------
[ 1239.307847] WARNING: CPU: 0 PID: 1347 at
fs/orangefs/waitqueue.c:208 service_ operation+0x59f/0x9b0()
[ 1239.307848] Modules linked in: bnep bluetooth ip6t_rpfilter rfkill
ip6t_REJEC T nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6
nf_conntrack_ipv4 nf_defrag_ip v4 xt_conntrack nf_conntrack
ebtable_nat ebtable_broute bridge stp llc ebtable_f ilter ebtables
ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip
6_tables iptable_mangle iptable_security iptable_raw ppdev parport_pc
virtio_bal loon virtio_console parport 8139too serio_raw pvpanic
i2c_piix4 uinput qxl drm_k ms_helper ttm drm 8139cp i2c_core
virtio_pci ata_generic virtio virtio_ring mii  pata_acpi
[ 1239.307870] CPU: 0 PID: 1347 Comm: dbench Not tainted
4.4.0-161988-g237f828-d irty #49
[ 1239.307871] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 1239.307872]  0000000000000000 0000000011eac412 ffff88003bf27cd0
ffffffff8139c 84d
[ 1239.307874]  ffff88003bf27d08 ffffffff8108e510 ffff880010968000
ffff88001096c 1d0
[ 1239.307876]  ffff88001096c188 00000000fffffff5 0000000000000000
ffff88003bf27 d18
[ 1239.307877] Call Trace:
[ 1239.307881]  [<ffffffff8139c84d>] dump_stack+0x19/0x1c
[ 1239.307884]  [<ffffffff8108e510>] warn_slowpath_common+0x80/0xc0
[ 1239.307886]  [<ffffffff8108e65a>] warn_slowpath_null+0x1a/0x20
[ 1239.307887]  [<ffffffff812fe73f>] service_operation+0x59f/0x9b0
[ 1239.307889]  [<ffffffff810c28b0>] ? prepare_to_wait_event+0x100/0x100
[ 1239.307891]  [<ffffffff810c28b0>] ? prepare_to_wait_event+0x100/0x100
[ 1239.307893]  [<ffffffff812fbd12>] orangefs_readdir+0x172/0xd50
[ 1239.307895]  [<ffffffff810c9a2d>] ? trace_hardirqs_on+0xd/0x10
[ 1239.307898]  [<ffffffff8120e4bf>] iterate_dir+0x9f/0x130
[ 1239.307899]  [<ffffffff8120e9d0>] SyS_getdents+0xa0/0x140
[ 1239.307900]  [<ffffffff8120e550>] ? iterate_dir+0x130/0x130
[ 1239.307903]  [<ffffffff8178302f>] entry_SYSCALL_64_fastpath+0x12/0x76
[ 1239.307904] ---[ end trace 66a9a15ad78b3dea ]---


On the next restart, the kernel crashed. The client-core is restarted
(restarting?) and orangefs_readdir is dealing with the wreckage and
trying to give up the slot it was using:

[ 1255.683226] dispatch_ioctl_command: client debug mask has been been
received  :0: :0:
[ 1255.683245] dispatch_ioctl_command: client debug array string has
been receiv ed.
[ 1255.711036] BUG: unable to handle kernel paging request at ffffffff810b0d68
[ 1255.711911] IP: [<ffffffff810cad14>] __lock_acquire+0x1a4/0x1e50
[ 1255.712605] PGD 1c13067 PUD 1c14063 PMD 10001e1
[ 1255.713147] Oops: 0003 [#1]
[ 1255.713522] Modules linked in: bnep bluetooth ip6t_rpfilter rfkill
ip6t_REJEC T nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6
nf_conntrack_ipv4 nf_defrag_ip v4 xt_conntrack nf_conntrack
ebtable_nat ebtable_broute bridge stp llc ebtable_f ilter ebtables
ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip
6_tables iptable_mangle iptable_security iptable_raw ppdev parport_pc
virtio_bal loon virtio_console parport 8139too serio_raw pvpanic
i2c_piix4 uinput qxl drm_k ms_helper ttm drm 8139cp i2c_core
virtio_pci ata_generic virtio virtio_ring mii  pata_acpi
[ 1255.720073] CPU: 0 PID: 1347 Comm: dbench Tainted: G        W
4.4.0-161 988-g237f828-dirty #49
[ 1255.721160] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 1255.721845] task: ffff880036894d00 ti: ffff88003bf24000 task.ti:
ffff88003bf2 4000
[ 1255.722674] RIP: 0010:[<ffffffff810cad14>]  [<ffffffff810cad14>]
__lock_acqui re+0x1a4/0x1e50
[ 1255.723669] RSP: 0018:ffff88003bf27c28  EFLAGS: 00010082
[ 1255.724289] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff810b0bd0
[ 1255.725143] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff880003067d58
[ 1255.726002] RBP: ffff88003bf27ce8 R08: 0000000000000001 R09: 0000000000000000
[ 1255.726852] R10: ffff880036894d00 R11: 0000000000000000 R12: ffff880003067d58
[ 1255.727731] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1255.728585] FS:  00007fbe5f7d7740(0000) GS:ffffffff81c28000(0000)
knlGS:00000 00000000000
[ 1255.729593] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1255.730288] CR2: ffffffff810b0d68 CR3: 00000000160da000 CR4: 00000000000006f0
[ 1255.731145] Stack:
[ 1255.731393]  0000000000000296 0000000011eac412 ffff880010968000
ffff88003bf27 c70
[ 1255.732414]  00000001000ecc8c ffff88003bf27d18 ffffffff81781b3a
ffff88003bf27 c90
[ 1255.733340]  0000000000000282 dead000000000200 0000000000000000
00000001000ec c8c
[ 1255.734267] Call Trace:
[ 1255.734567]  [<ffffffff81781b3a>] ? schedule_timeout+0x18a/0x2a0
[ 1255.735296]  [<ffffffff810e82b0>] ?
trace_event_raw_event_tick_stop+0x100/0x1 00
[ 1255.736176]  [<ffffffff810cca82>] lock_acquire+0xc2/0x150
[ 1255.736814]  [<ffffffff812fcbfb>] ? put_back_slot+0x1b/0x70
[ 1255.737467]  [<ffffffff817825b1>] _raw_spin_lock+0x31/0x40
[ 1255.738159]  [<ffffffff812fcbfb>] ? put_back_slot+0x1b/0x70
[ 1255.738838]  [<ffffffff812fcbfb>] put_back_slot+0x1b/0x70
[ 1255.739494]  [<ffffffff812fd36b>] orangefs_readdir_index_put+0x4b/0x70
[ 1255.740287]  [<ffffffff812fbccd>] orangefs_readdir+0x12d/0xd50
[ 1255.740998]  [<ffffffff810c9a2d>] ? trace_hardirqs_on+0xd/0x10
[ 1255.741688]  [<ffffffff8120e4bf>] iterate_dir+0x9f/0x130
[ 1255.742339]  [<ffffffff8120e9d0>] SyS_getdents+0xa0/0x140
[ 1255.743017]  [<ffffffff8120e550>] ? iterate_dir+0x130/0x130
[ 1255.743678]  [<ffffffff8178302f>] entry_SYSCALL_64_fastpath+0x12/0x76
[ 1255.744417] Code: c3 49 81 3c 24 c0 6b ef 81 b8 00 00 00 00 44 0f
44 c0 83 fb  01 0f 87 f0 fe ff ff 89 d8 49 8b 4c c4 08 48 85 c9 0f 84
e0 fe ff ff <ff> 81 98  01 00 00 44 8b 1d 97 42 ac 01 41 8b 9a 40 06
00 00 45
[ 1255.747699] RIP  [<ffffffff810cad14>] __lock_acquire+0x1a4/0x1e50
[ 1255.748450]  RSP <ffff88003bf27c28>
[ 1255.748898] CR2: ffffffff810b0d68
[ 1255.749293] ---[ end trace 66a9a15ad78b3deb ]---


There's a lot of timing that has to be gotten right across one of these
restarts, and all the changes we have made to the upstream kernel module
vs. the out-of-tree version have, I guess, gotten some of that timing out
of whack. I'm glad you're helping <g>...

One of the tests I made in the last few days was to basically give up on
every op that showed up in service_operation while is_daemon_in_service()
was failing. That kept the kernel happy, but was obviously more
brutal to the processes (dbench in my test case, people doing work in
real life) than managing to wait on as many of the ops as possible.

Anyhow... there's several more patches from both me and Martin in get-next
as of today. There's still a problem with his permissions patch, but I
guess it is good that we have that call-out now, and we'll get it working
properly. And still no patch to fix the as yet official follow_link
change - I hope I'm not giving Stephen Rothwell heartburn over that. He
said I could just merge in your fceef393a538 commit and then
change/get-rid-of our follow_link code and that would be OK...

-Mike

On Sat, Jan 30, 2016 at 1:27 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Sat, Jan 30, 2016 at 05:34:13PM +0000, Al Viro wrote:
>
>> And again, how could
>>                 /* op uses shared memory */
>>                 if (orangefs_get_bufmap_init() == 0) {
>> in service_operation() possibly be true, when we have
>>       * op->uses_shared_memory just checked to be set
>>       * all callers that set it (orangefs_readdir() and wait_for_direct_io()
>> having allocated a slot before calling service_operation() and not releasing
>> it until service_operation() returns
>>       * __orangefs_bufmap not becoming NULL until all slots are freed and
>>       * orangefs_get_bufmap_init() returning 1 unless __orangefs_bufmap is
>> NULL?
>>
>> AFAICS, that code (waiting for daemon to be restarted) is provably never
>> executed.  What am I missing?
>
> While we are at it, what happens to original code (without refcount changes,
> etc. from my pile - the problem remains with those, but let's look at the
> code at merge from v4.4) if something does read() and gets a signal
> just as the daemon gets to
>         n = copy_from_iter(&op->downcall, downcall_size, iter);
> in ->write_iter(), reporting that is has finished that read?  We were in
> wait_for_matching_downcall(), interruptibly sleeping.  We got woken up by
> signal delivery.  Checked op->op_state; it's still not marked as serviced.
> We check signal_pending() and find it true.  We hit
> orangefs_clean_up_interrupted_operation(), which doesn't do anything to
> op->op_state, and we return -EINTR to service_operation().  Which returns
> without waiting for anything to wait_for_direct_io() and we go into
> sending a cancel.  Now the daemon regains the timeslice.  Marks op as
> serviced.  And proceeds to wait on op->io_completion_waitq.
>
> Who's going to wake it up?  orangefs_cancel_op_in_progress() sure as hell
> won't - it has no way to find op, nevermind doing wakeups.  The rest of
> wait_for_direct_io() also doesn't wake the daemon up.  How is that supposed
> to work?  Moreover, we issue a cancel; when is it supposed to be processed
> and how do we tell if it's already been processed?
>
> Who should be waiting for what in case of cancel being issued just as the
> daemon gets around to reporting success of original operation?  On the
> protocol level, that is.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/orangefs/file.c b/fs/orangefs/file.c
index c585063d..86ba1df 100644
--- a/fs/orangefs/file.c
+++ b/fs/orangefs/file.c
@@ -245,15 +245,8 @@  populate_shared_memory:
 				       buffer_index,
 				       iter,
 				       new_op->downcall.resp.io.amt_complete);
-		if (ret < 0) {
-			/*
-			 * put error codes in downcall so that handle_io_error()
-			 * preserves it properly
-			 */
-			new_op->downcall.status = ret;
-			handle_io_error();
-			goto out;
-		}
+		if (ret < 0)
+			goto done_copying;
 	}
 	gossip_debug(GOSSIP_FILE_DEBUG,
 	    "%s(%pU): Amount written as returned by the sys-io call:%d\n",
@@ -263,6 +256,7 @@  populate_shared_memory:
 
 	ret = new_op->downcall.resp.io.amt_complete;
 
+done_copying:
 	/*
 	 * tell the device file owner waiting on I/O that this read has
 	 * completed and it can return now.