diff mbox

return layout on error, BUG/deadlock

Message ID CABpMAyL-B9Kqg8O4xxxtj0X8PF7fP4TM+NnLA_Ynhm-_3C6MUg@mail.gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Idan Kedar Aug. 9, 2012, 1:03 p.m. UTC
Hi,

As a result of some experiments, I wanted to see what happens when I
inject an error (hard coded) to the object layout driver. the patch is
at the bottom of this mail. the reason I did this is because when I
inject errors in my modified version of the object layout driver, I
get the same BUG Tigran reported about yesterday:
nfs4proc.c:6252 :   BUG_ON(!list_empty(&lo->plh_segs));

In my modified version (based on kernel 3.3), the bug seems to be that
pnfs_ld_write_done calls pnfs_return_layout in the error path, even if
there is in-flight I/O.
I wanted to see if this is a result of my modifications, so I injected
errors to the ORE (on kernel 3.5) and ran Connectathon's basic tests,
and got a deadlock AND that BUG:

[  112.659066] =================================
[  112.659066] [ INFO: inconsistent lock state ]
[  112.659066] 3.5.0-nfsobj+ #35 Not tainted
[  112.659066] ---------------------------------
[  112.659066] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
[  112.659066] kworker/0:2/456 [HC0[0]:SC0[0]:HE1:SE1] takes:
[  112.659066]  (&(&objlay->lock)->rlock){+.?...}, at:
[<ffffffffa02558fc>] objlayout_encode_layoutreturn+0x6c/0x420
[objlayoutdriver]
[  112.659066] {IN-SOFTIRQ-W} state was registered at:
[  112.659066]   [<ffffffff810b2bd8>] __lock_acquire+0x5e8/0x1bb0
[  112.659066]   [<ffffffff810b47c2>] lock_acquire+0x92/0x120
[  112.659066]   [<ffffffff81713e16>] _raw_spin_lock+0x36/0x70
[  112.659066]   [<ffffffffa025622d>]
objlayout_iodone.part.1+0x25/0x56 [objlayoutdriver]
[  112.659066]   [<ffffffffa025573b>] objlayout_write_done+0xcb/0xd0
[objlayoutdriver]
[  112.659066]   [<ffffffffa02541c3>] _write_done+0x43/0x60 [objlayoutdriver]
[  112.659066]   [<ffffffffa023a013>] _last_io+0x13/0x20 [libore]
[  112.659066]   [<ffffffffa023a868>] _done_io+0x28/0x30 [libore]
[  112.659066]   [<ffffffffa0229f94>] osd_request_async_done+0xd4/0xf0 [libosd]
[  112.659066]   [<ffffffff812a4a38>] blk_finish_request+0xa8/0x2c0
[  112.659066]   [<ffffffff812a4c9f>] blk_end_bidi_request+0x4f/0x80
[  112.659066]   [<ffffffff812a4d10>] blk_end_request+0x10/0x20
[  112.659066]   [<ffffffff8138c36f>] scsi_io_completion+0xaf/0x670
[  112.659066]   [<ffffffff81382861>] scsi_finish_command+0xd1/0x130
[  112.659066]   [<ffffffff8138c1bf>] scsi_softirq_done+0x13f/0x160
[  112.659066]   [<ffffffff812abb3c>] blk_done_softirq+0x8c/0xa0
[  112.659066]   [<ffffffff81068be8>] __do_softirq+0xc8/0x250
[  112.659066]   [<ffffffff8171696c>] call_softirq+0x1c/0x30
[  112.659066]   [<ffffffff81015785>] do_softirq+0xa5/0xe0
[  112.659066]   [<ffffffff8106893b>] local_bh_enable_ip+0xeb/0x100
[  112.659066]   [<ffffffff81714244>] _raw_spin_unlock_bh+0x34/0x40
[  112.659066]   [<ffffffff815c8abc>] release_sock+0x14c/0x190
[  112.659066]   [<ffffffff81613005>] tcp_sendpage+0xc5/0x6e0
[  112.659066]   [<ffffffff81638703>] inet_sendpage+0xd3/0x120
[  112.659066]   [<ffffffffa0221586>]
iscsi_sw_tcp_pdu_xmit+0x116/0x290 [iscsi_tcp]
[  112.659066]   [<ffffffff814ba930>] iscsi_tcp_task_xmit+0xb0/0x2d0
[  112.659066]   [<ffffffff814354ae>] iscsi_xmit_task+0x5e/0xb0
[  112.659066]   [<ffffffff81437457>] iscsi_xmitworker+0x1f7/0x330
[  112.659066]   [<ffffffff8107d8af>] process_one_work+0x19f/0x530
[  112.659066]   [<ffffffff8107f5d9>] worker_thread+0x159/0x340
[  112.659066]   [<ffffffff810847a7>] kthread+0xb7/0xc0
[  112.659066]   [<ffffffff81716874>] kernel_thread_helper+0x4/0x10
[  112.659066] irq event stamp: 56183
[  112.659066] hardirqs last  enabled at (56183): [<ffffffff810688e7>]
local_bh_enable_ip+0x97/0x100
[  112.659066] hardirqs last disabled at (56181): [<ffffffff81068894>]
local_bh_enable_ip+0x44/0x100
[  112.659066] softirqs last  enabled at (56182): [<ffffffffa00344d0>]
xprt_prepare_transmit+0x70/0x90 [sunrpc]
[  112.659066] softirqs last disabled at (56180): [<ffffffff81713ee8>]
_raw_spin_lock_bh+0x18/0x70
[  112.659066]
[  112.659066] other info that might help us debug this:
[  112.659066]  Possible unsafe locking scenario:
[  112.659066]
[  112.659066]        CPU0
[  112.659066]        ----
[  112.659066]   lock(&(&objlay->lock)->rlock);
[  112.659066]   <Interrupt>
[  112.659066]     lock(&(&objlay->lock)->rlock);
[  112.659066]
[  112.659066]  *** DEADLOCK ***
[  112.659066]
[  112.659066] 2 locks held by kworker/0:2/456:
[  112.659066]  #0:  (events){.+.+.+}, at: [<ffffffff8107d84c>]
process_one_work+0x13c/0x530
[  112.659066]  #1:  ((&wdata->task.u.tk_work)){+.+.+.}, at:
[<ffffffff8107d84c>] process_one_work+0x13c/0x530
[  112.659066]
[  112.659066] stack backtrace:
[  112.659066] Pid: 456, comm: kworker/0:2 Not tainted 3.5.0-nfsobj+ #35
[  112.659066] Call Trace:
[  112.659066]  [<ffffffff81703f08>] print_usage_bug+0x1f5/0x206
[  112.659066]  [<ffffffff8102242f>] ? save_stack_trace+0x2f/0x50
[  112.659066]  [<ffffffff810b2595>] mark_lock+0x295/0x2f0
[  112.659066]  [<ffffffff810b1af0>] ? check_usage_forwards+0x140/0x140
[  112.659066]  [<ffffffff810b2c3a>] __lock_acquire+0x64a/0x1bb0
[  112.659066]  [<ffffffff810aef9d>] ? trace_hardirqs_off+0xd/0x10
[  112.659066]  [<ffffffff8109852f>] ? local_clock+0x6f/0x80
[  112.659066]  [<ffffffff8101baf3>] ? native_sched_clock+0x13/0x80
[  112.659066]  [<ffffffff8101bb69>] ? sched_clock+0x9/0x10
[  112.659066]  [<ffffffff810982c5>] ? sched_clock_local+0x25/0x90
[  112.659066]  [<ffffffff81098458>] ? sched_clock_cpu+0xa8/0x110
[  112.659066]  [<ffffffffa02558fc>] ?
objlayout_encode_layoutreturn+0x6c/0x420 [objlayoutdriver]
[  112.659066]  [<ffffffff810b47c2>] lock_acquire+0x92/0x120
[  112.659066]  [<ffffffffa02558fc>] ?
objlayout_encode_layoutreturn+0x6c/0x420 [objlayoutdriver]
[  112.659066]  [<ffffffff8101bb69>] ? sched_clock+0x9/0x10
[  112.659066]  [<ffffffffa02551a0>] ?
__copy_nfsS_and_zero_terminate+0x90/0x90 [objlayoutdriver]
[  112.659066]  [<ffffffff81713e16>] _raw_spin_lock+0x36/0x70
[  112.659066]  [<ffffffffa02558fc>] ?
objlayout_encode_layoutreturn+0x6c/0x420 [objlayoutdriver]
[  112.659066]  [<ffffffff8101bb69>] ? sched_clock+0x9/0x10
[  112.659066]  [<ffffffffa02558fc>]
objlayout_encode_layoutreturn+0x6c/0x420 [objlayoutdriver]
[  112.659066]  [<ffffffff81098458>] ? sched_clock_cpu+0xa8/0x110
[  112.659066]  [<ffffffff810aef9d>] ? trace_hardirqs_off+0xd/0x10
[  112.659066]  [<ffffffff8109852f>] ? local_clock+0x6f/0x80
[  112.659066]  [<ffffffffa016541a>] ?
nfs4_xdr_enc_layoutreturn+0x11a/0x170 [nfs]
[  112.659066]  [<ffffffffa0045797>] ?
xdr_encode_opaque_fixed+0x47/0x80 [sunrpc]
[  112.659066]  [<ffffffffa02551a0>] ?
__copy_nfsS_and_zero_terminate+0x90/0x90 [objlayoutdriver]
[  112.659066]  [<ffffffffa0165449>] nfs4_xdr_enc_layoutreturn+0x149/0x170 [nfs]
[  112.659066]  [<ffffffff810688e7>] ? local_bh_enable_ip+0x97/0x100
[  112.659066]  [<ffffffffa00344d0>] ? xprt_prepare_transmit+0x70/0x90 [sunrpc]
[  112.659066]  [<ffffffffa0165300>] ? nfs4_xdr_enc_commit+0xe0/0xe0 [nfs]
[  112.659066]  [<ffffffffa003bc05>] rpcauth_wrap_req+0x65/0x70 [sunrpc]
[  112.659066]  [<ffffffffa0030dae>] call_transmit+0x18e/0x260 [sunrpc]
[  112.659066]  [<ffffffffa003a360>] __rpc_execute+0x70/0x280 [sunrpc]
[  112.659066]  [<ffffffffa0030c20>] ? call_connect+0x40/0x40 [sunrpc]
[  112.659066]  [<ffffffffa0030c20>] ? call_connect+0x40/0x40 [sunrpc]
[  112.659066]  [<ffffffff81084e9e>] ? wake_up_bit+0x2e/0x40
[  112.659066]  [<ffffffffa02551a0>] ?
__copy_nfsS_and_zero_terminate+0x90/0x90 [objlayoutdriver]
[  112.659066]  [<ffffffffa003ab4f>] rpc_execute+0x4f/0xb0 [sunrpc]
[  112.659066]  [<ffffffffa02551a0>] ?
__copy_nfsS_and_zero_terminate+0x90/0x90 [objlayoutdriver]
[  112.659066]  [<ffffffffa00327b5>] rpc_run_task+0x75/0x90 [sunrpc]
[  112.659066]  [<ffffffffa0161c96>] nfs4_proc_layoutreturn+0x86/0xb0 [nfs]
[  112.659066]  [<ffffffffa0174594>] _pnfs_return_layout+0x114/0x180 [nfs]
[  112.659066]  [<ffffffffa0174737>] pnfs_ld_write_done+0x67/0xe0 [nfs]
[  112.659066]  [<ffffffffa02551b5>] _rpc_write_complete+0x15/0x20
[objlayoutdriver]
[  112.659066]  [<ffffffff8107d8af>] process_one_work+0x19f/0x530
[  112.659066]  [<ffffffff8107d84c>] ? process_one_work+0x13c/0x530
[  112.659066]  [<ffffffff8107f5d9>] worker_thread+0x159/0x340
[  112.659066]  [<ffffffff8107f480>] ? manage_workers+0x230/0x230
[  112.659066]  [<ffffffff810847a7>] kthread+0xb7/0xc0
[  112.659066]  [<ffffffff810b5295>] ? trace_hardirqs_on_caller+0x105/0x190
[  112.659066]  [<ffffffff81716874>] kernel_thread_helper+0x4/0x10
[  112.659066]  [<ffffffff81714bb0>] ? retint_restore_args+0x13/0x13
[  112.659066]  [<ffffffff810846f0>] ? __init_kthread_worker+0x70/0x70
[  112.659066]  [<ffffffff81716870>] ? gs_change+0x13/0x13
[  112.787051] ------------[ cut here ]------------
[  112.787799] kernel BUG at fs/nfs/nfs4proc.c:6252!
[  112.787799] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[  112.787799] CPU 0
[  112.787799] Modules linked in:[  112.787799]
nfs_layout_nfsv41_files objlayoutdriver exofs libore osd libosd
iscsi_tcp netconsole nfs nfsd fscache lockd auth_rpcgss nfs_acl sunrpc
e1000 rtc_cmos serio_raw [last unloaded: scsi_wait_scan]

[  112.787799] Pid: 456, comm: kworker/0:2 Not tainted 3.5.0-nfsobj+
#35 innotek GmbH VirtualBox
[  112.787799] RIP: 0010:[<ffffffffa015a245>]  [<ffffffffa015a245>]
nfs4_layoutreturn_done+0xd5/0xe0 [nfs]
[  112.787799] RSP: 0018:ffff88003c7bdb50  EFLAGS: 00010206
[  112.787799] RAX: ffff880034480b90 RBX: ffff88003cbb5860 RCX: ffff88003add4ca8
[  112.787799] RDX: 0000000000000000 RSI: ffff8800354b1288 RDI: 0000000000000246
[  112.787799] RBP: ffff88003c7bdb70 R08: 0000000000000002 R09: 0000000000000000
[  112.787799] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88003a782240
[  112.787799] R13: ffff880034480b68 R14: ffff88003a787138 R15: ffffffffa02551a0
[  112.787799] FS:  0000000000000000(0000) GS:ffff88003e200000(0000)
knlGS:0000000000000000
[  112.787799] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  112.787799] CR2: 0000003f33c9b640 CR3: 0000000034070000 CR4: 00000000000006f0
[  112.787799] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  112.787799] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  112.787799] Process kworker/0:2 (pid: 456, threadinfo
ffff88003c7bc000, task ffff88003add4620)
[  112.787799] Stack:
[  112.787799]  ffff88003a787138 ffff88003a782240 ffff8800347c4858
ffff88003c7bdcf8
[  112.787799]  ffff88003c7bdb90 ffffffffa00391fc ffff88003a787138
ffff88003a782240
[  112.787799]  ffff88003c7bdc00 ffffffffa003a360 000000003c7bdbc0
ffff88003a7822b0
[  112.787799] Call Trace:
[  112.787799]  [<ffffffffa00391fc>] rpc_exit_task+0x2c/0x90 [sunrpc]
[  112.787799]  [<ffffffffa003a360>] __rpc_execute+0x70/0x280 [sunrpc]
[  112.787799]  [<ffffffffa00391d0>] ? rpc_async_release+0x20/0x20 [sunrpc]
[  112.787799]  [<ffffffffa00391d0>] ? rpc_async_release+0x20/0x20 [sunrpc]
[  112.787799]  [<ffffffff81084e9e>] ? wake_up_bit+0x2e/0x40
[  112.787799]  [<ffffffffa02551a0>] ?
__copy_nfsS_and_zero_terminate+0x90/0x90 [objlayoutdriver]
[  112.787799]  [<ffffffffa003ab4f>] rpc_execute+0x4f/0xb0 [sunrpc]
[  112.787799]  [<ffffffffa02551a0>] ?
__copy_nfsS_and_zero_terminate+0x90/0x90 [objlayoutdriver]
[  112.787799]  [<ffffffffa00327b5>] rpc_run_task+0x75/0x90 [sunrpc]
[  112.787799]  [<ffffffffa0161c96>] nfs4_proc_layoutreturn+0x86/0xb0 [nfs]
[  112.787799]  [<ffffffffa0174594>] _pnfs_return_layout+0x114/0x180 [nfs]
[  112.787799]  [<ffffffffa0174737>] pnfs_ld_write_done+0x67/0xe0 [nfs]
[  112.787799]  [<ffffffffa02551b5>] _rpc_write_complete+0x15/0x20
[objlayoutdriver]
[  112.787799]  [<ffffffff8107d8af>] process_one_work+0x19f/0x530
[  112.787799]  [<ffffffff8107d84c>] ? process_one_work+0x13c/0x530
[  112.787799]  [<ffffffff8107f5d9>] worker_thread+0x159/0x340
[  112.787799]  [<ffffffff8107f480>] ? manage_workers+0x230/0x230
[  112.787799]  [<ffffffff810847a7>] kthread+0xb7/0xc0
[  112.787799]  [<ffffffff810b5295>] ? trace_hardirqs_on_caller+0x105/0x190
[  112.787799]  [<ffffffff81716874>] kernel_thread_helper+0x4/0x10
[  112.787799]  [<ffffffff81714bb0>] ? retint_restore_args+0x13/0x13
[  112.787799]  [<ffffffff810846f0>] ? __init_kthread_worker+0x70/0x70
[  112.787799]  [<ffffffff81716870>] ? gs_change+0x13/0x13
[  112.787799] Code: c3 0f 1f 44 00 00 48 8d 73 64 ba 01 00 00 00 4c
89 ef e8 ff ab 01 00 eb c8 0f 1f 44 00 00 4c 89 e7 e8 e0 66 ed ff e9
5a ff ff ff <0f> 0b 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 53 48 83 ec
08 66
[  112.787799] RIP  [<ffffffffa015a245>] nfs4_layoutreturn_done+0xd5/0xe0 [nfs]
[  112.787799]  RSP <ffff88003c7bdb50>
[  112.872492] ---[ end trace 114822acc0612b2a ]---


My setup:

Data server is osd-osc emulator
MDS is nfsd-exofs on kernel 3.3 (Benny's kernel)

Is there any fix for this? Or is it not an issue and my injection is
just wrong and stupid?
I've seen some discussions on this in the mailing list, and I saw the
patch that removes the BUG, but I'm not sure if this is the right
thing to do for object layout.

---
 fs/exofs/ore.c |   11 ++++++++---
 1 files changed, 8 insertions(+), 3 deletions(-)

 			/* start read offset passed endof file */

Comments

Trond Myklebust Aug. 9, 2012, 2:05 p.m. UTC | #1
> -----Original Message-----
> From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs-
> owner@vger.kernel.org] On Behalf Of Idan Kedar
> Sent: Thursday, August 09, 2012 9:03 AM
> To: Boaz Harrosh; NFS list
> Cc: Benny Halevy
> Subject: return layout on error, BUG/deadlock
> 
> Hi,
> 
> As a result of some experiments, I wanted to see what happens when I
> inject an error (hard coded) to the object layout driver. the patch is at the
> bottom of this mail. the reason I did this is because when I inject errors in my
> modified version of the object layout driver, I get the same BUG Tigran
> reported about yesterday:
> nfs4proc.c:6252 :   BUG_ON(!list_empty(&lo->plh_segs));
> 
> In my modified version (based on kernel 3.3), the bug seems to be that
> pnfs_ld_write_done calls pnfs_return_layout in the error path, even if there
> is in-flight I/O.

That is not a bug. It is an intentional change in order to allow the MDS to fence off the outstanding writes (if it can do so) before we retransmit them as write-through-MDS. Otherwise, you risk races between the outstanding writes-to-DS and the new writes-through-MDS.

See the changelog in the patch that I sent to the list yesterday.
Idan Kedar Aug. 9, 2012, 3:49 p.m. UTC | #2
On Thu, Aug 9, 2012 at 5:05 PM, Myklebust, Trond
<Trond.Myklebust@netapp.com> wrote:
>> -----Original Message-----
>> From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs-
>> owner@vger.kernel.org] On Behalf Of Idan Kedar
>> Sent: Thursday, August 09, 2012 9:03 AM
>> To: Boaz Harrosh; NFS list
>> Cc: Benny Halevy
>> Subject: return layout on error, BUG/deadlock
>>
>> Hi,
>>
>> As a result of some experiments, I wanted to see what happens when I
>> inject an error (hard coded) to the object layout driver. the patch is at the
>> bottom of this mail. the reason I did this is because when I inject errors in my
>> modified version of the object layout driver, I get the same BUG Tigran
>> reported about yesterday:
>> nfs4proc.c:6252 :   BUG_ON(!list_empty(&lo->plh_segs));
>>
>> In my modified version (based on kernel 3.3), the bug seems to be that
>> pnfs_ld_write_done calls pnfs_return_layout in the error path, even if there
>> is in-flight I/O.
>
> That is not a bug. It is an intentional change in order to allow the MDS to fence off the outstanding writes (if it can do so) before we retransmit them as write-through-MDS. Otherwise, you risk races between the outstanding writes-to-DS and the new writes-through-MDS.

to what change are you referring?

>
> See the changelog in the patch that I sent to the list yesterday.
>

I saw that, and if I'm not mistaken these races apply to object layout
as well, and in any case they apply in my case. However, it is not
easy to mess around with LAYOUTRETURN in object layout, and there have
been several discussions on the issue. In one of these discussions
Benny clarified that the object layout client must wait for all
in-flight I/O to end.
So for file layout it probably makes sense, but object layout (and if
I understand correctly, block layout as well) something else needs to
be done. I thought about sync wait when returning the layout on error,
but according to Boaz it will cause deadlocks (Boaz - can you please
elaborate?).
And come to think of it, nfs4_proc_setattr also returns the layout
when there may be I/O in-flight (correct me if i'm wrong). So I guess
pnfs_return_layout should somehow solve this by itself by saying "if
this is fencing (a flag which will be set for file layout only), go
ahead, otherwise make the layout as 'needs to be returned' and when
the lseg lists gets empty return the layout".
Comments?

> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> Trond.Myklebust@netapp.com
> www.netapp.com
>
>
Trond Myklebust Aug. 9, 2012, 4:06 p.m. UTC | #3
On Thu, 2012-08-09 at 18:49 +0300, Idan Kedar wrote:
> On Thu, Aug 9, 2012 at 5:05 PM, Myklebust, Trond

> <Trond.Myklebust@netapp.com> wrote:

> >> -----Original Message-----

> >> From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs-

> >> owner@vger.kernel.org] On Behalf Of Idan Kedar

> >> Sent: Thursday, August 09, 2012 9:03 AM

> >> To: Boaz Harrosh; NFS list

> >> Cc: Benny Halevy

> >> Subject: return layout on error, BUG/deadlock

> >>

> >> Hi,

> >>

> >> As a result of some experiments, I wanted to see what happens when I

> >> inject an error (hard coded) to the object layout driver. the patch is at the

> >> bottom of this mail. the reason I did this is because when I inject errors in my

> >> modified version of the object layout driver, I get the same BUG Tigran

> >> reported about yesterday:

> >> nfs4proc.c:6252 :   BUG_ON(!list_empty(&lo->plh_segs));

> >>

> >> In my modified version (based on kernel 3.3), the bug seems to be that

> >> pnfs_ld_write_done calls pnfs_return_layout in the error path, even if there

> >> is in-flight I/O.

> >

> > That is not a bug. It is an intentional change in order to allow the MDS to fence off the outstanding writes (if it can do so) before we retransmit them as write-through-MDS. Otherwise, you risk races between the outstanding writes-to-DS and the new writes-through-MDS.

> 

> to what change are you referring?


As I stated in the changelog of the patch that I sent to the list
yesterday, the behaviour is due to commit 0a57cdac3f.

> >

> > See the changelog in the patch that I sent to the list yesterday.

> >

> 

> I saw that, and if I'm not mistaken these races apply to object layout

> as well, and in any case they apply in my case. However, it is not

> easy to mess around with LAYOUTRETURN in object layout, and there have

> been several discussions on the issue. In one of these discussions

> Benny clarified that the object layout client must wait for all

> in-flight I/O to end.


If the problem is that the DS is failing to respond, how does the client
know that the in-flight I/O has ended?

> So for file layout it probably makes sense, but object layout (and if

> I understand correctly, block layout as well) something else needs to

> be done. I thought about sync wait when returning the layout on error,

> but according to Boaz it will cause deadlocks (Boaz - can you please

> elaborate?).


The object layoutreturn has the ability to pass a timeout error value to
the MDS precisely in order to allow the latter to deal with this kind of
issue. See the description of struct pnfs_osd_ioerr4 in rfc5664.

The block layout is adding the same ability to layoutreturn in NFSv4.2
(see draft-ietf-nfsv4-minorversion2-13.txt) via the struct
layoutreturn_device_error4, so presumably they too have a plan for
dealing with this kind of issue.

> And come to think of it, nfs4_proc_setattr also returns the layout

> when there may be I/O in-flight (correct me if i'm wrong). So I guess

> pnfs_return_layout should somehow solve this by itself by saying "if

> this is fencing (a flag which will be set for file layout only), go

> ahead, otherwise make the layout as 'needs to be returned' and when

> the lseg lists gets empty return the layout".


The only layout type that sets the PNFS_LAYOUTRET_ON_SETATTR flag is
objects, so that question needs to be directed to Boaz.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com
Peng Tao Aug. 9, 2012, 4:34 p.m. UTC | #4
On Fri, Aug 10, 2012 at 12:06 AM, Myklebust, Trond
<Trond.Myklebust@netapp.com> wrote:
> On Thu, 2012-08-09 at 18:49 +0300, Idan Kedar wrote:
>> On Thu, Aug 9, 2012 at 5:05 PM, Myklebust, Trond
>> <Trond.Myklebust@netapp.com> wrote:
>> >> -----Original Message-----
>> >> From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs-
>> >> owner@vger.kernel.org] On Behalf Of Idan Kedar
>> >> Sent: Thursday, August 09, 2012 9:03 AM
>> >> To: Boaz Harrosh; NFS list
>> >> Cc: Benny Halevy
>> >> Subject: return layout on error, BUG/deadlock
>> >>
>> >> Hi,
>> >>
>> >> As a result of some experiments, I wanted to see what happens when I
>> >> inject an error (hard coded) to the object layout driver. the patch is at the
>> >> bottom of this mail. the reason I did this is because when I inject errors in my
>> >> modified version of the object layout driver, I get the same BUG Tigran
>> >> reported about yesterday:
>> >> nfs4proc.c:6252 :   BUG_ON(!list_empty(&lo->plh_segs));
>> >>
>> >> In my modified version (based on kernel 3.3), the bug seems to be that
>> >> pnfs_ld_write_done calls pnfs_return_layout in the error path, even if there
>> >> is in-flight I/O.
>> >
>> > That is not a bug. It is an intentional change in order to allow the MDS to fence off the outstanding writes (if it can do so) before we retransmit them as write-through-MDS. Otherwise, you risk races between the outstanding writes-to-DS and the new writes-through-MDS.
>>
>> to what change are you referring?
>
> As I stated in the changelog of the patch that I sent to the list
> yesterday, the behaviour is due to commit 0a57cdac3f.
>
>> >
>> > See the changelog in the patch that I sent to the list yesterday.
>> >
>>
>> I saw that, and if I'm not mistaken these races apply to object layout
>> as well, and in any case they apply in my case. However, it is not
>> easy to mess around with LAYOUTRETURN in object layout, and there have
>> been several discussions on the issue. In one of these discussions
>> Benny clarified that the object layout client must wait for all
>> in-flight I/O to end.
>
> If the problem is that the DS is failing to respond, how does the client
> know that the in-flight I/O has ended?
>
>> So for file layout it probably makes sense, but object layout (and if
>> I understand correctly, block layout as well) something else needs to
>> be done. I thought about sync wait when returning the layout on error,
>> but according to Boaz it will cause deadlocks (Boaz - can you please
>> elaborate?).
>
> The object layoutreturn has the ability to pass a timeout error value to
> the MDS precisely in order to allow the latter to deal with this kind of
> issue. See the description of struct pnfs_osd_ioerr4 in rfc5664.
>
> The block layout is adding the same ability to layoutreturn in NFSv4.2
> (see draft-ietf-nfsv4-minorversion2-13.txt) via the struct
> layoutreturn_device_error4, so presumably they too have a plan for
> dealing with this kind of issue.
It is one thing to tell MDS that there is DS access error by sending
layoutreturn, and it is another thing to return a layout even if there
is overlapping in-flight DS IO...

I certainly agree that client is entitled to return layout to inform
MDS about DS errors and also avoid possible cb_layoutrecall. But it is
just an optimization and should only be done when there is no
in-flight IO (at least for block layout) IMHO.

Thanks,
Tao
>
>> And come to think of it, nfs4_proc_setattr also returns the layout
>> when there may be I/O in-flight (correct me if i'm wrong). So I guess
>> pnfs_return_layout should somehow solve this by itself by saying "if
>> this is fencing (a flag which will be set for file layout only), go
>> ahead, otherwise make the layout as 'needs to be returned' and when
>> the lseg lists gets empty return the layout".
>
> The only layout type that sets the PNFS_LAYOUTRET_ON_SETATTR flag is
> objects, so that question needs to be directed to Boaz.
>
> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> Trond.Myklebust@netapp.com
> www.netapp.com
>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust Aug. 9, 2012, 4:37 p.m. UTC | #5
On Fri, 2012-08-10 at 00:34 +0800, Peng Tao wrote:
> On Fri, Aug 10, 2012 at 12:06 AM, Myklebust, Trond

> <Trond.Myklebust@netapp.com> wrote:

> > On Thu, 2012-08-09 at 18:49 +0300, Idan Kedar wrote:

> >> On Thu, Aug 9, 2012 at 5:05 PM, Myklebust, Trond

> >> <Trond.Myklebust@netapp.com> wrote:

> >> >> -----Original Message-----

> >> >> From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs-

> >> >> owner@vger.kernel.org] On Behalf Of Idan Kedar

> >> >> Sent: Thursday, August 09, 2012 9:03 AM

> >> >> To: Boaz Harrosh; NFS list

> >> >> Cc: Benny Halevy

> >> >> Subject: return layout on error, BUG/deadlock

> >> >>

> >> >> Hi,

> >> >>

> >> >> As a result of some experiments, I wanted to see what happens when I

> >> >> inject an error (hard coded) to the object layout driver. the patch is at the

> >> >> bottom of this mail. the reason I did this is because when I inject errors in my

> >> >> modified version of the object layout driver, I get the same BUG Tigran

> >> >> reported about yesterday:

> >> >> nfs4proc.c:6252 :   BUG_ON(!list_empty(&lo->plh_segs));

> >> >>

> >> >> In my modified version (based on kernel 3.3), the bug seems to be that

> >> >> pnfs_ld_write_done calls pnfs_return_layout in the error path, even if there

> >> >> is in-flight I/O.

> >> >

> >> > That is not a bug. It is an intentional change in order to allow the MDS to fence off the outstanding writes (if it can do so) before we retransmit them as write-through-MDS. Otherwise, you risk races between the outstanding writes-to-DS and the new writes-through-MDS.

> >>

> >> to what change are you referring?

> >

> > As I stated in the changelog of the patch that I sent to the list

> > yesterday, the behaviour is due to commit 0a57cdac3f.

> >

> >> >

> >> > See the changelog in the patch that I sent to the list yesterday.

> >> >

> >>

> >> I saw that, and if I'm not mistaken these races apply to object layout

> >> as well, and in any case they apply in my case. However, it is not

> >> easy to mess around with LAYOUTRETURN in object layout, and there have

> >> been several discussions on the issue. In one of these discussions

> >> Benny clarified that the object layout client must wait for all

> >> in-flight I/O to end.

> >

> > If the problem is that the DS is failing to respond, how does the client

> > know that the in-flight I/O has ended?

> >

> >> So for file layout it probably makes sense, but object layout (and if

> >> I understand correctly, block layout as well) something else needs to

> >> be done. I thought about sync wait when returning the layout on error,

> >> but according to Boaz it will cause deadlocks (Boaz - can you please

> >> elaborate?).

> >

> > The object layoutreturn has the ability to pass a timeout error value to

> > the MDS precisely in order to allow the latter to deal with this kind of

> > issue. See the description of struct pnfs_osd_ioerr4 in rfc5664.

> >

> > The block layout is adding the same ability to layoutreturn in NFSv4.2

> > (see draft-ietf-nfsv4-minorversion2-13.txt) via the struct

> > layoutreturn_device_error4, so presumably they too have a plan for

> > dealing with this kind of issue.

> It is one thing to tell MDS that there is DS access error by sending

> layoutreturn, and it is another thing to return a layout even if there

> is overlapping in-flight DS IO...

> 

> I certainly agree that client is entitled to return layout to inform

> MDS about DS errors and also avoid possible cb_layoutrecall. But it is

> just an optimization and should only be done when there is no

> in-flight IO (at least for block layout) IMHO.


HOW DO YOU GUARANTEE NO IN-FLIGHT IO?

Repeating the same mantra about 'no in-flight IO' that doesn't apply to
timeout situations isn't helpful.

A TIMEOUT means that you have NO IDEA if the data is still in flight or
not. That's when you need fencing, and the only thing that can supply
fencing in that situation is the MDS.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com
Peng Tao Aug. 9, 2012, 4:48 p.m. UTC | #6
On Fri, Aug 10, 2012 at 12:37 AM, Myklebust, Trond
<Trond.Myklebust@netapp.com> wrote:
> On Fri, 2012-08-10 at 00:34 +0800, Peng Tao wrote:
>> On Fri, Aug 10, 2012 at 12:06 AM, Myklebust, Trond
>> <Trond.Myklebust@netapp.com> wrote:
>> > On Thu, 2012-08-09 at 18:49 +0300, Idan Kedar wrote:
>> >> On Thu, Aug 9, 2012 at 5:05 PM, Myklebust, Trond
>> >> <Trond.Myklebust@netapp.com> wrote:
>> >> >> -----Original Message-----
>> >> >> From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs-
>> >> >> owner@vger.kernel.org] On Behalf Of Idan Kedar
>> >> >> Sent: Thursday, August 09, 2012 9:03 AM
>> >> >> To: Boaz Harrosh; NFS list
>> >> >> Cc: Benny Halevy
>> >> >> Subject: return layout on error, BUG/deadlock
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> As a result of some experiments, I wanted to see what happens when I
>> >> >> inject an error (hard coded) to the object layout driver. the patch is at the
>> >> >> bottom of this mail. the reason I did this is because when I inject errors in my
>> >> >> modified version of the object layout driver, I get the same BUG Tigran
>> >> >> reported about yesterday:
>> >> >> nfs4proc.c:6252 :   BUG_ON(!list_empty(&lo->plh_segs));
>> >> >>
>> >> >> In my modified version (based on kernel 3.3), the bug seems to be that
>> >> >> pnfs_ld_write_done calls pnfs_return_layout in the error path, even if there
>> >> >> is in-flight I/O.
>> >> >
>> >> > That is not a bug. It is an intentional change in order to allow the MDS to fence off the outstanding writes (if it can do so) before we retransmit them as write-through-MDS. Otherwise, you risk races between the outstanding writes-to-DS and the new writes-through-MDS.
>> >>
>> >> to what change are you referring?
>> >
>> > As I stated in the changelog of the patch that I sent to the list
>> > yesterday, the behaviour is due to commit 0a57cdac3f.
>> >
>> >> >
>> >> > See the changelog in the patch that I sent to the list yesterday.
>> >> >
>> >>
>> >> I saw that, and if I'm not mistaken these races apply to object layout
>> >> as well, and in any case they apply in my case. However, it is not
>> >> easy to mess around with LAYOUTRETURN in object layout, and there have
>> >> been several discussions on the issue. In one of these discussions
>> >> Benny clarified that the object layout client must wait for all
>> >> in-flight I/O to end.
>> >
>> > If the problem is that the DS is failing to respond, how does the client
>> > know that the in-flight I/O has ended?
>> >
>> >> So for file layout it probably makes sense, but object layout (and if
>> >> I understand correctly, block layout as well) something else needs to
>> >> be done. I thought about sync wait when returning the layout on error,
>> >> but according to Boaz it will cause deadlocks (Boaz - can you please
>> >> elaborate?).
>> >
>> > The object layoutreturn has the ability to pass a timeout error value to
>> > the MDS precisely in order to allow the latter to deal with this kind of
>> > issue. See the description of struct pnfs_osd_ioerr4 in rfc5664.
>> >
>> > The block layout is adding the same ability to layoutreturn in NFSv4.2
>> > (see draft-ietf-nfsv4-minorversion2-13.txt) via the struct
>> > layoutreturn_device_error4, so presumably they too have a plan for
>> > dealing with this kind of issue.
>> It is one thing to tell MDS that there is DS access error by sending
>> layoutreturn, and it is another thing to return a layout even if there
>> is overlapping in-flight DS IO...
>>
>> I certainly agree that client is entitled to return layout to inform
>> MDS about DS errors and also avoid possible cb_layoutrecall. But it is
>> just an optimization and should only be done when there is no
>> in-flight IO (at least for block layout) IMHO.
>
> HOW DO YOU GUARANTEE NO IN-FLIGHT IO?
>
I don't. That's why I don't return layout in pnfs_ld_write_done(). And
for layoutreturn upon cb_layoutreturn, block layout client needs to do
timed-lease IO fencing per rfc5663, but it is not implemented in Linux
client.

> Repeating the same mantra about 'no in-flight IO' that doesn't apply to
> timeout situations isn't helpful.
>
> A TIMEOUT means that you have NO IDEA if the data is still in flight or
> not. That's when you need fencing, and the only thing that can supply
> fencing in that situation is the MDS.

>
> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> Trond.Myklebust@netapp.com
> www.netapp.com
>
Idan Kedar Aug. 9, 2012, 4:54 p.m. UTC | #7
On Thu, Aug 9, 2012 at 7:06 PM, Myklebust, Trond
<Trond.Myklebust@netapp.com> wrote:
> On Thu, 2012-08-09 at 18:49 +0300, Idan Kedar wrote:
>> On Thu, Aug 9, 2012 at 5:05 PM, Myklebust, Trond
>> <Trond.Myklebust@netapp.com> wrote:
>> >> -----Original Message-----
>> >> From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs-
>> >> owner@vger.kernel.org] On Behalf Of Idan Kedar
>> >> Sent: Thursday, August 09, 2012 9:03 AM
>> >> To: Boaz Harrosh; NFS list
>> >> Cc: Benny Halevy
>> >> Subject: return layout on error, BUG/deadlock
>> >>
>> >> Hi,
>> >>
>> >> As a result of some experiments, I wanted to see what happens when I
>> >> inject an error (hard coded) to the object layout driver. the patch is at the
>> >> bottom of this mail. the reason I did this is because when I inject errors in my
>> >> modified version of the object layout driver, I get the same BUG Tigran
>> >> reported about yesterday:
>> >> nfs4proc.c:6252 :   BUG_ON(!list_empty(&lo->plh_segs));
>> >>
>> >> In my modified version (based on kernel 3.3), the bug seems to be that
>> >> pnfs_ld_write_done calls pnfs_return_layout in the error path, even if there
>> >> is in-flight I/O.
>> >
>> > That is not a bug. It is an intentional change in order to allow the MDS to fence off the outstanding writes (if it can do so) before we retransmit them as write-through-MDS. Otherwise, you risk races between the outstanding writes-to-DS and the new writes-through-MDS.
>>
>> to what change are you referring?
>
> As I stated in the changelog of the patch that I sent to the list
> yesterday, the behaviour is due to commit 0a57cdac3f.
>

I'm not sure I'm following. I was talking about the state of the code
v3.5, not to any specific patch/change. since the client yelled
"BUG!!!" and crashed, there is some bug involved, either because the
BUG() is correct or because the presence of the BUG() is the bug
itself.

>> >
>> > See the changelog in the patch that I sent to the list yesterday.
>> >
>>
>> I saw that, and if I'm not mistaken these races apply to object layout
>> as well, and in any case they apply in my case. However, it is not
>> easy to mess around with LAYOUTRETURN in object layout, and there have
>> been several discussions on the issue. In one of these discussions
>> Benny clarified that the object layout client must wait for all
>> in-flight I/O to end.
>
> If the problem is that the DS is failing to respond, how does the client
> know that the in-flight I/O has ended?
>

after the last rpc_call_done is called there is no I/O from the
client's perspective an the layout can be safely returned, after which
I/O can be done through the MDS.
Is there I/O pending from the DS perspective? maybe, but all you can
do is send I/O via MDS and hope it will not race. fencing doesn't
really save you, it just improves your odds where applicable, i.e.
file layout.

>> So for file layout it probably makes sense, but object layout (and if
>> I understand correctly, block layout as well) something else needs to
>> be done. I thought about sync wait when returning the layout on error,
>> but according to Boaz it will cause deadlocks (Boaz - can you please
>> elaborate?).
>
> The object layoutreturn has the ability to pass a timeout error value to
> the MDS precisely in order to allow the latter to deal with this kind of
> issue. See the description of struct pnfs_osd_ioerr4 in rfc5664.
>
> The block layout is adding the same ability to layoutreturn in NFSv4.2
> (see draft-ietf-nfsv4-minorversion2-13.txt) via the struct
> layoutreturn_device_error4, so presumably they too have a plan for
> dealing with this kind of issue.
>

I'll wait for Boaz (or someone else) to explain what exactly is "this
kind of issue". I still don't know why sync wait would deadlock.

>> And come to think of it, nfs4_proc_setattr also returns the layout
>> when there may be I/O in-flight (correct me if i'm wrong). So I guess
>> pnfs_return_layout should somehow solve this by itself by saying "if
>> this is fencing (a flag which will be set for file layout only), go
>> ahead, otherwise make the layout as 'needs to be returned' and when
>> the lseg lists gets empty return the layout".
>
> The only layout type that sets the PNFS_LAYOUTRET_ON_SETATTR flag is
> objects, so that question needs to be directed to Boaz.
>

Yes, this entire thread is mainly directed to Boaz... And IIRC he did
want to implement something of that sort, and this thread was
basically for asking him "did you?".

> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> Trond.Myklebust@netapp.com
> www.netapp.com
>
Trond Myklebust Aug. 9, 2012, 7:12 p.m. UTC | #8
T24gVGh1LCAyMDEyLTA4LTA5IGF0IDE5OjU0ICswMzAwLCBJZGFuIEtlZGFyIHdyb3RlOg0KPiBP
biBUaHUsIEF1ZyA5LCAyMDEyIGF0IDc6MDYgUE0sIE15a2xlYnVzdCwgVHJvbmQNCj4gPFRyb25k
Lk15a2xlYnVzdEBuZXRhcHAuY29tPiB3cm90ZToNCj4gPiBPbiBUaHUsIDIwMTItMDgtMDkgYXQg
MTg6NDkgKzAzMDAsIElkYW4gS2VkYXIgd3JvdGU6DQo+ID4+IE9uIFRodSwgQXVnIDksIDIwMTIg
YXQgNTowNSBQTSwgTXlrbGVidXN0LCBUcm9uZA0KPiA+PiA8VHJvbmQuTXlrbGVidXN0QG5ldGFw
cC5jb20+IHdyb3RlOg0KPiA+PiA+PiAtLS0tLU9yaWdpbmFsIE1lc3NhZ2UtLS0tLQ0KPiA+PiA+
PiBGcm9tOiBsaW51eC1uZnMtb3duZXJAdmdlci5rZXJuZWwub3JnIFttYWlsdG86bGludXgtbmZz
LQ0KPiA+PiA+PiBvd25lckB2Z2VyLmtlcm5lbC5vcmddIE9uIEJlaGFsZiBPZiBJZGFuIEtlZGFy
DQo+ID4+ID4+IFNlbnQ6IFRodXJzZGF5LCBBdWd1c3QgMDksIDIwMTIgOTowMyBBTQ0KPiA+PiA+
PiBUbzogQm9heiBIYXJyb3NoOyBORlMgbGlzdA0KPiA+PiA+PiBDYzogQmVubnkgSGFsZXZ5DQo+
ID4+ID4+IFN1YmplY3Q6IHJldHVybiBsYXlvdXQgb24gZXJyb3IsIEJVRy9kZWFkbG9jaw0KPiA+
PiA+Pg0KPiA+PiA+PiBIaSwNCj4gPj4gPj4NCj4gPj4gPj4gQXMgYSByZXN1bHQgb2Ygc29tZSBl
eHBlcmltZW50cywgSSB3YW50ZWQgdG8gc2VlIHdoYXQgaGFwcGVucyB3aGVuIEkNCj4gPj4gPj4g
aW5qZWN0IGFuIGVycm9yIChoYXJkIGNvZGVkKSB0byB0aGUgb2JqZWN0IGxheW91dCBkcml2ZXIu
IHRoZSBwYXRjaCBpcyBhdCB0aGUNCj4gPj4gPj4gYm90dG9tIG9mIHRoaXMgbWFpbC4gdGhlIHJl
YXNvbiBJIGRpZCB0aGlzIGlzIGJlY2F1c2Ugd2hlbiBJIGluamVjdCBlcnJvcnMgaW4gbXkNCj4g
Pj4gPj4gbW9kaWZpZWQgdmVyc2lvbiBvZiB0aGUgb2JqZWN0IGxheW91dCBkcml2ZXIsIEkgZ2V0
IHRoZSBzYW1lIEJVRyBUaWdyYW4NCj4gPj4gPj4gcmVwb3J0ZWQgYWJvdXQgeWVzdGVyZGF5Og0K
PiA+PiA+PiBuZnM0cHJvYy5jOjYyNTIgOiAgIEJVR19PTighbGlzdF9lbXB0eSgmbG8tPnBsaF9z
ZWdzKSk7DQo+ID4+ID4+DQo+ID4+ID4+IEluIG15IG1vZGlmaWVkIHZlcnNpb24gKGJhc2VkIG9u
IGtlcm5lbCAzLjMpLCB0aGUgYnVnIHNlZW1zIHRvIGJlIHRoYXQNCj4gPj4gPj4gcG5mc19sZF93
cml0ZV9kb25lIGNhbGxzIHBuZnNfcmV0dXJuX2xheW91dCBpbiB0aGUgZXJyb3IgcGF0aCwgZXZl
biBpZiB0aGVyZQ0KPiA+PiA+PiBpcyBpbi1mbGlnaHQgSS9PLg0KPiA+PiA+DQo+ID4+ID4gVGhh
dCBpcyBub3QgYSBidWcuIEl0IGlzIGFuIGludGVudGlvbmFsIGNoYW5nZSBpbiBvcmRlciB0byBh
bGxvdyB0aGUgTURTIHRvIGZlbmNlIG9mZiB0aGUgb3V0c3RhbmRpbmcgd3JpdGVzIChpZiBpdCBj
YW4gZG8gc28pIGJlZm9yZSB3ZSByZXRyYW5zbWl0IHRoZW0gYXMgd3JpdGUtdGhyb3VnaC1NRFMu
IE90aGVyd2lzZSwgeW91IHJpc2sgcmFjZXMgYmV0d2VlbiB0aGUgb3V0c3RhbmRpbmcgd3JpdGVz
LXRvLURTIGFuZCB0aGUgbmV3IHdyaXRlcy10aHJvdWdoLU1EUy4NCj4gPj4NCj4gPj4gdG8gd2hh
dCBjaGFuZ2UgYXJlIHlvdSByZWZlcnJpbmc/DQo+ID4NCj4gPiBBcyBJIHN0YXRlZCBpbiB0aGUg
Y2hhbmdlbG9nIG9mIHRoZSBwYXRjaCB0aGF0IEkgc2VudCB0byB0aGUgbGlzdA0KPiA+IHllc3Rl
cmRheSwgdGhlIGJlaGF2aW91ciBpcyBkdWUgdG8gY29tbWl0IDBhNTdjZGFjM2YuDQo+ID4NCj4g
DQo+IEknbSBub3Qgc3VyZSBJJ20gZm9sbG93aW5nLiBJIHdhcyB0YWxraW5nIGFib3V0IHRoZSBz
dGF0ZSBvZiB0aGUgY29kZQ0KPiB2My41LCBub3QgdG8gYW55IHNwZWNpZmljIHBhdGNoL2NoYW5n
ZS4gc2luY2UgdGhlIGNsaWVudCB5ZWxsZWQNCj4gIkJVRyEhISIgYW5kIGNyYXNoZWQsIHRoZXJl
IGlzIHNvbWUgYnVnIGludm9sdmVkLCBlaXRoZXIgYmVjYXVzZSB0aGUNCj4gQlVHKCkgaXMgY29y
cmVjdCBvciBiZWNhdXNlIHRoZSBwcmVzZW5jZSBvZiB0aGUgQlVHKCkgaXMgdGhlIGJ1Zw0KPiBp
dHNlbGYuDQo+IA0KPiA+PiA+DQo+ID4+ID4gU2VlIHRoZSBjaGFuZ2Vsb2cgaW4gdGhlIHBhdGNo
IHRoYXQgSSBzZW50IHRvIHRoZSBsaXN0IHllc3RlcmRheS4NCj4gPj4gPg0KPiA+Pg0KPiA+PiBJ
IHNhdyB0aGF0LCBhbmQgaWYgSSdtIG5vdCBtaXN0YWtlbiB0aGVzZSByYWNlcyBhcHBseSB0byBv
YmplY3QgbGF5b3V0DQo+ID4+IGFzIHdlbGwsIGFuZCBpbiBhbnkgY2FzZSB0aGV5IGFwcGx5IGlu
IG15IGNhc2UuIEhvd2V2ZXIsIGl0IGlzIG5vdA0KPiA+PiBlYXN5IHRvIG1lc3MgYXJvdW5kIHdp
dGggTEFZT1VUUkVUVVJOIGluIG9iamVjdCBsYXlvdXQsIGFuZCB0aGVyZSBoYXZlDQo+ID4+IGJl
ZW4gc2V2ZXJhbCBkaXNjdXNzaW9ucyBvbiB0aGUgaXNzdWUuIEluIG9uZSBvZiB0aGVzZSBkaXNj
dXNzaW9ucw0KPiA+PiBCZW5ueSBjbGFyaWZpZWQgdGhhdCB0aGUgb2JqZWN0IGxheW91dCBjbGll
bnQgbXVzdCB3YWl0IGZvciBhbGwNCj4gPj4gaW4tZmxpZ2h0IEkvTyB0byBlbmQuDQo+ID4NCj4g
PiBJZiB0aGUgcHJvYmxlbSBpcyB0aGF0IHRoZSBEUyBpcyBmYWlsaW5nIHRvIHJlc3BvbmQsIGhv
dyBkb2VzIHRoZSBjbGllbnQNCj4gPiBrbm93IHRoYXQgdGhlIGluLWZsaWdodCBJL08gaGFzIGVu
ZGVkPw0KPiA+DQo+IA0KPiBhZnRlciB0aGUgbGFzdCBycGNfY2FsbF9kb25lIGlzIGNhbGxlZCB0
aGVyZSBpcyBubyBJL08gZnJvbSB0aGUNCj4gY2xpZW50J3MgcGVyc3BlY3RpdmUgYW4gdGhlIGxh
eW91dCBjYW4gYmUgc2FmZWx5IHJldHVybmVkLCBhZnRlciB3aGljaA0KPiBJL08gY2FuIGJlIGRv
bmUgdGhyb3VnaCB0aGUgTURTLg0KDQpOby4gWW91IGFyZSBjb25mdXNlZC4NCg0KQmVpbmcgaW4g
cnBjX2NhbGxfZG9uZSBqdXN0IHRlbGxzIHlvdSB0aGF0IHRoZSBjbGllbnQgdGhpbmtzIGl0IGlz
IGRvbmUuDQpJdCB0ZWxscyB5b3Ugbm90aGluZyBhYm91dCB3aGF0IHRoZSBEUyBpcyBkb2luZy4g
SXQgbWF5IGp1c3Qgc2xvb29vb3dseQ0KYmUgcHJvY2Vzc2luZyB0aGUgV1JJVEUgY2FsbHMgZm9y
IGFsbCB5b3Uga25vdy4NCg0KVW5sZXNzIHlvdSBnZXQgYSByZXNwb25zZSBmcm9tIHRoZSBEUyBz
YXlpbmcgIkknbSBkb25lIiwgb3IgeW91IGhhdmUNCnNvbWUgb3RoZXIgbWVjaGFuaXNtIHRvIGd1
YXJhbnRlZSB0aGF0IHRoZSBEUyBpcyBkb25lIHdoZW4geW91IHRpbWUgb3V0LA0KdGhlbiB5b3Ug
aGF2ZSBubyBndWFyYW50ZWUgdGhhdCBpdCBpcyBzYWZlIHRvIHNlbmQgdG8gTURTLg0KDQo+IElz
IHRoZXJlIEkvTyBwZW5kaW5nIGZyb20gdGhlIERTIHBlcnNwZWN0aXZlPyBtYXliZSwgYnV0IGFs
bCB5b3UgY2FuDQo+IGRvIGlzIHNlbmQgSS9PIHZpYSBNRFMgYW5kIGhvcGUgaXQgd2lsbCBub3Qg
cmFjZS4gZmVuY2luZyBkb2Vzbid0DQo+IHJlYWxseSBzYXZlIHlvdSwgaXQganVzdCBpbXByb3Zl
cyB5b3VyIG9kZHMgd2hlcmUgYXBwbGljYWJsZSwgaS5lLg0KPiBmaWxlIGxheW91dC4NCg0KQWN0
dWFsbHksIHdoZW4gcmVhZGluZyB0aGUgaVNDU0kgc3BlYywgaXQgaW1wbGllcyB0aGF0IHRoZXJl
IGlzIGEga25vd24NCnNlc3Npb24gdGltZW91dCB0aGF0IGlzIG5lZ290aWF0ZWQgYmV0d2VlbiB0
aGUgaW5pdGlhdG9yIGFuZCB0YXJnZXQuDQpBZnRlciB0aGUgc2Vzc2lvbiBleHBpcmVzLCB5b3Ug
YXJlIGd1YXJhbnRlZWQgdGhhdCBubyBmdXJ0aGVyIGNvbW1hbmRzDQphcmUgcnVubmluZyBvbiB0
aGUgRFMuDQoNClRoZSBpRkNQIHByb3RvY29sIGF0dGFjaGVzIGEgdGltZW91dCB2YWx1ZSB0byBl
YWNoIGZyYW1lLCBhbmQgaXMNCnN1cHBvc2VkIHRvIGVuZm9yY2UgdGhhdCB0aW1lb3V0Lg0KDQpT
byBhdCBsZWFzdCB0aG9zZSBwcm90b2NvbHMgc2hvdWxkIGhhdmUgYSBkZXRlcm1pbmlzdGljIGJl
aGF2aW91ciB0aGF0DQpjYW4gYmUgdXNlZCBieSB0aGUgY2xpZW50Lg0KDQo+ID4+IFNvIGZvciBm
aWxlIGxheW91dCBpdCBwcm9iYWJseSBtYWtlcyBzZW5zZSwgYnV0IG9iamVjdCBsYXlvdXQgKGFu
ZCBpZg0KPiA+PiBJIHVuZGVyc3RhbmQgY29ycmVjdGx5LCBibG9jayBsYXlvdXQgYXMgd2VsbCkg
c29tZXRoaW5nIGVsc2UgbmVlZHMgdG8NCj4gPj4gYmUgZG9uZS4gSSB0aG91Z2h0IGFib3V0IHN5
bmMgd2FpdCB3aGVuIHJldHVybmluZyB0aGUgbGF5b3V0IG9uIGVycm9yLA0KPiA+PiBidXQgYWNj
b3JkaW5nIHRvIEJvYXogaXQgd2lsbCBjYXVzZSBkZWFkbG9ja3MgKEJvYXogLSBjYW4geW91IHBs
ZWFzZQ0KPiA+PiBlbGFib3JhdGU/KS4NCj4gPg0KPiA+IFRoZSBvYmplY3QgbGF5b3V0cmV0dXJu
IGhhcyB0aGUgYWJpbGl0eSB0byBwYXNzIGEgdGltZW91dCBlcnJvciB2YWx1ZSB0bw0KPiA+IHRo
ZSBNRFMgcHJlY2lzZWx5IGluIG9yZGVyIHRvIGFsbG93IHRoZSBsYXR0ZXIgdG8gZGVhbCB3aXRo
IHRoaXMga2luZCBvZg0KPiA+IGlzc3VlLiBTZWUgdGhlIGRlc2NyaXB0aW9uIG9mIHN0cnVjdCBw
bmZzX29zZF9pb2VycjQgaW4gcmZjNTY2NC4NCj4gPg0KPiA+IFRoZSBibG9jayBsYXlvdXQgaXMg
YWRkaW5nIHRoZSBzYW1lIGFiaWxpdHkgdG8gbGF5b3V0cmV0dXJuIGluIE5GU3Y0LjINCj4gPiAo
c2VlIGRyYWZ0LWlldGYtbmZzdjQtbWlub3J2ZXJzaW9uMi0xMy50eHQpIHZpYSB0aGUgc3RydWN0
DQo+ID4gbGF5b3V0cmV0dXJuX2RldmljZV9lcnJvcjQsIHNvIHByZXN1bWFibHkgdGhleSB0b28g
aGF2ZSBhIHBsYW4gZm9yDQo+ID4gZGVhbGluZyB3aXRoIHRoaXMga2luZCBvZiBpc3N1ZS4NCj4g
Pg0KPiANCj4gSSdsbCB3YWl0IGZvciBCb2F6IChvciBzb21lb25lIGVsc2UpIHRvIGV4cGxhaW4g
d2hhdCBleGFjdGx5IGlzICJ0aGlzDQo+IGtpbmQgb2YgaXNzdWUiLiBJIHN0aWxsIGRvbid0IGtu
b3cgd2h5IHN5bmMgd2FpdCB3b3VsZCBkZWFkbG9jay4NCj4gDQo+ID4+IEFuZCBjb21lIHRvIHRo
aW5rIG9mIGl0LCBuZnM0X3Byb2Nfc2V0YXR0ciBhbHNvIHJldHVybnMgdGhlIGxheW91dA0KPiA+
PiB3aGVuIHRoZXJlIG1heSBiZSBJL08gaW4tZmxpZ2h0IChjb3JyZWN0IG1lIGlmIGknbSB3cm9u
ZykuIFNvIEkgZ3Vlc3MNCj4gPj4gcG5mc19yZXR1cm5fbGF5b3V0IHNob3VsZCBzb21laG93IHNv
bHZlIHRoaXMgYnkgaXRzZWxmIGJ5IHNheWluZyAiaWYNCj4gPj4gdGhpcyBpcyBmZW5jaW5nIChh
IGZsYWcgd2hpY2ggd2lsbCBiZSBzZXQgZm9yIGZpbGUgbGF5b3V0IG9ubHkpLCBnbw0KPiA+PiBh
aGVhZCwgb3RoZXJ3aXNlIG1ha2UgdGhlIGxheW91dCBhcyAnbmVlZHMgdG8gYmUgcmV0dXJuZWQn
IGFuZCB3aGVuDQo+ID4+IHRoZSBsc2VnIGxpc3RzIGdldHMgZW1wdHkgcmV0dXJuIHRoZSBsYXlv
dXQiLg0KPiA+DQo+ID4gVGhlIG9ubHkgbGF5b3V0IHR5cGUgdGhhdCBzZXRzIHRoZSBQTkZTX0xB
WU9VVFJFVF9PTl9TRVRBVFRSIGZsYWcgaXMNCj4gPiBvYmplY3RzLCBzbyB0aGF0IHF1ZXN0aW9u
IG5lZWRzIHRvIGJlIGRpcmVjdGVkIHRvIEJvYXouDQo+ID4NCj4gDQo+IFllcywgdGhpcyBlbnRp
cmUgdGhyZWFkIGlzIG1haW5seSBkaXJlY3RlZCB0byBCb2F6Li4uIEFuZCBJSVJDIGhlIGRpZA0K
PiB3YW50IHRvIGltcGxlbWVudCBzb21ldGhpbmcgb2YgdGhhdCBzb3J0LCBhbmQgdGhpcyB0aHJl
YWQgd2FzDQo+IGJhc2ljYWxseSBmb3IgYXNraW5nIGhpbSAiZGlkIHlvdT8iLg0KPiANCj4gPiAt
LQ0KPiA+IFRyb25kIE15a2xlYnVzdA0KPiA+IExpbnV4IE5GUyBjbGllbnQgbWFpbnRhaW5lcg0K
PiA+DQo+ID4gTmV0QXBwDQo+ID4gVHJvbmQuTXlrbGVidXN0QG5ldGFwcC5jb20NCj4gPiB3d3cu
bmV0YXBwLmNvbQ0KPiA+DQo+IA0KPiANCj4gDQoNCi0tIA0KVHJvbmQgTXlrbGVidXN0DQpMaW51
eCBORlMgY2xpZW50IG1haW50YWluZXINCg0KTmV0QXBwDQpUcm9uZC5NeWtsZWJ1c3RAbmV0YXBw
LmNvbQ0Kd3d3Lm5ldGFwcC5jb20NCg0K
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Trond Myklebust Aug. 9, 2012, 7:31 p.m. UTC | #9
On Fri, 2012-08-10 at 00:48 +0800, Peng Tao wrote:
> On Fri, Aug 10, 2012 at 12:37 AM, Myklebust, Trond

> <Trond.Myklebust@netapp.com> wrote:

> > On Fri, 2012-08-10 at 00:34 +0800, Peng Tao wrote:

> >> On Fri, Aug 10, 2012 at 12:06 AM, Myklebust, Trond

> >> <Trond.Myklebust@netapp.com> wrote:

> >> > On Thu, 2012-08-09 at 18:49 +0300, Idan Kedar wrote:

> >> >> On Thu, Aug 9, 2012 at 5:05 PM, Myklebust, Trond

> >> >> <Trond.Myklebust@netapp.com> wrote:

> >> >> >> -----Original Message-----

> >> >> >> From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs-

> >> >> >> owner@vger.kernel.org] On Behalf Of Idan Kedar

> >> >> >> Sent: Thursday, August 09, 2012 9:03 AM

> >> >> >> To: Boaz Harrosh; NFS list

> >> >> >> Cc: Benny Halevy

> >> >> >> Subject: return layout on error, BUG/deadlock

> >> >> >>

> >> >> >> Hi,

> >> >> >>

> >> >> >> As a result of some experiments, I wanted to see what happens when I

> >> >> >> inject an error (hard coded) to the object layout driver. the patch is at the

> >> >> >> bottom of this mail. the reason I did this is because when I inject errors in my

> >> >> >> modified version of the object layout driver, I get the same BUG Tigran

> >> >> >> reported about yesterday:

> >> >> >> nfs4proc.c:6252 :   BUG_ON(!list_empty(&lo->plh_segs));

> >> >> >>

> >> >> >> In my modified version (based on kernel 3.3), the bug seems to be that

> >> >> >> pnfs_ld_write_done calls pnfs_return_layout in the error path, even if there

> >> >> >> is in-flight I/O.

> >> >> >

> >> >> > That is not a bug. It is an intentional change in order to allow the MDS to fence off the outstanding writes (if it can do so) before we retransmit them as write-through-MDS. Otherwise, you risk races between the outstanding writes-to-DS and the new writes-through-MDS.

> >> >>

> >> >> to what change are you referring?

> >> >

> >> > As I stated in the changelog of the patch that I sent to the list

> >> > yesterday, the behaviour is due to commit 0a57cdac3f.

> >> >

> >> >> >

> >> >> > See the changelog in the patch that I sent to the list yesterday.

> >> >> >

> >> >>

> >> >> I saw that, and if I'm not mistaken these races apply to object layout

> >> >> as well, and in any case they apply in my case. However, it is not

> >> >> easy to mess around with LAYOUTRETURN in object layout, and there have

> >> >> been several discussions on the issue. In one of these discussions

> >> >> Benny clarified that the object layout client must wait for all

> >> >> in-flight I/O to end.

> >> >

> >> > If the problem is that the DS is failing to respond, how does the client

> >> > know that the in-flight I/O has ended?

> >> >

> >> >> So for file layout it probably makes sense, but object layout (and if

> >> >> I understand correctly, block layout as well) something else needs to

> >> >> be done. I thought about sync wait when returning the layout on error,

> >> >> but according to Boaz it will cause deadlocks (Boaz - can you please

> >> >> elaborate?).

> >> >

> >> > The object layoutreturn has the ability to pass a timeout error value to

> >> > the MDS precisely in order to allow the latter to deal with this kind of

> >> > issue. See the description of struct pnfs_osd_ioerr4 in rfc5664.

> >> >

> >> > The block layout is adding the same ability to layoutreturn in NFSv4.2

> >> > (see draft-ietf-nfsv4-minorversion2-13.txt) via the struct

> >> > layoutreturn_device_error4, so presumably they too have a plan for

> >> > dealing with this kind of issue.

> >> It is one thing to tell MDS that there is DS access error by sending

> >> layoutreturn, and it is another thing to return a layout even if there

> >> is overlapping in-flight DS IO...

> >>

> >> I certainly agree that client is entitled to return layout to inform

> >> MDS about DS errors and also avoid possible cb_layoutrecall. But it is

> >> just an optimization and should only be done when there is no

> >> in-flight IO (at least for block layout) IMHO.

> >

> > HOW DO YOU GUARANTEE NO IN-FLIGHT IO?

> >

> I don't. That's why I don't return layout in pnfs_ld_write_done(). And

> for layoutreturn upon cb_layoutreturn, block layout client needs to do

> timed-lease IO fencing per rfc5663, but it is not implemented in Linux

> client.


The timed-lease IO fencing described in rfc5663 is about informing the
server about how long the client expects a command to succeed or fail.
It doesn't offer any advice for how the client is to deal with an
unresponsive DS.

What you need here is help from the underlying transport protocol. As I
said in the email to Idan, when researching iSCSI and iFCP, I found what
appears to be mechanisms for reliably timing out.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com
Idan Kedar Aug. 9, 2012, 11:04 p.m. UTC | #10
On Thu, Aug 9, 2012 at 10:12 PM, Myklebust, Trond
<Trond.Myklebust@netapp.com> wrote:
> On Thu, 2012-08-09 at 19:54 +0300, Idan Kedar wrote:
>> On Thu, Aug 9, 2012 at 7:06 PM, Myklebust, Trond
>> <Trond.Myklebust@netapp.com> wrote:
>> > On Thu, 2012-08-09 at 18:49 +0300, Idan Kedar wrote:
>> >> On Thu, Aug 9, 2012 at 5:05 PM, Myklebust, Trond
>> >> <Trond.Myklebust@netapp.com> wrote:
>> >> >> -----Original Message-----
>> >> >> From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs-
>> >> >> owner@vger.kernel.org] On Behalf Of Idan Kedar
>> >> >> Sent: Thursday, August 09, 2012 9:03 AM
>> >> >> To: Boaz Harrosh; NFS list
>> >> >> Cc: Benny Halevy
>> >> >> Subject: return layout on error, BUG/deadlock
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> As a result of some experiments, I wanted to see what happens when I
>> >> >> inject an error (hard coded) to the object layout driver. the patch is at the
>> >> >> bottom of this mail. the reason I did this is because when I inject errors in my
>> >> >> modified version of the object layout driver, I get the same BUG Tigran
>> >> >> reported about yesterday:
>> >> >> nfs4proc.c:6252 :   BUG_ON(!list_empty(&lo->plh_segs));
>> >> >>
>> >> >> In my modified version (based on kernel 3.3), the bug seems to be that
>> >> >> pnfs_ld_write_done calls pnfs_return_layout in the error path, even if there
>> >> >> is in-flight I/O.
>> >> >
>> >> > That is not a bug. It is an intentional change in order to allow the MDS to fence off the outstanding writes (if it can do so) before we retransmit them as write-through-MDS. Otherwise, you risk races between the outstanding writes-to-DS and the new writes-through-MDS.
>> >>
>> >> to what change are you referring?
>> >
>> > As I stated in the changelog of the patch that I sent to the list
>> > yesterday, the behaviour is due to commit 0a57cdac3f.
>> >
>>
>> I'm not sure I'm following. I was talking about the state of the code
>> v3.5, not to any specific patch/change. since the client yelled
>> "BUG!!!" and crashed, there is some bug involved, either because the
>> BUG() is correct or because the presence of the BUG() is the bug
>> itself.
>>
>> >> >
>> >> > See the changelog in the patch that I sent to the list yesterday.
>> >> >
>> >>
>> >> I saw that, and if I'm not mistaken these races apply to object layout
>> >> as well, and in any case they apply in my case. However, it is not
>> >> easy to mess around with LAYOUTRETURN in object layout, and there have
>> >> been several discussions on the issue. In one of these discussions
>> >> Benny clarified that the object layout client must wait for all
>> >> in-flight I/O to end.
>> >
>> > If the problem is that the DS is failing to respond, how does the client
>> > know that the in-flight I/O has ended?
>> >
>>
>> after the last rpc_call_done is called there is no I/O from the
>> client's perspective an the layout can be safely returned, after which
>> I/O can be done through the MDS.
>
> No. You are confused.
>
> Being in rpc_call_done just tells you that the client thinks it is done.
> It tells you nothing about what the DS is doing. It may just slooooowly
> be processing the WRITE calls for all you know.
>

That's what I meant by "from the client's perspective". At this point,
the client can say "I'm not *using* this layout anymore" because even
if some I/O on its way to the DS originates in the client, the client
can't do anything about it. The fate of this I/O is meaningless to the
client at that point. This is the best the client can do to help the
MDS preserve integrity - keep the layout as long as it has information
about its I/O.

> Unless you get a response from the DS saying "I'm done", or you have
> some other mechanism to guarantee that the DS is done when you time out,
> then you have no guarantee that it is safe to send to MDS.
>

Exactly. What if I don't have such a mechanism? My options are:
* Ask the MDS to somehow fence in-flight I/O and resend through the MDS.
* Return an error to the user.
What if the server does not support fencing? In that case we shouldn't
send through MDS while there is in-flight I/O. Do we want to assume
fencing is implemented? And if fencing isn't implemented and we return
an error to the user - how do we know when it is safe to perform any
I/O again?
And even if fencing is implemented, I don't think the client should
rely on it, because the whole idea beind pNFS is a smart client that
offloads what it can from the servers.

>> Is there I/O pending from the DS perspective? maybe, but all you can
>> do is send I/O via MDS and hope it will not race. fencing doesn't
>> really save you, it just improves your odds where applicable, i.e.
>> file layout.

One correction to what I said: Fencing does save you, if it's
implemented. LAYOUTRETURN doesn't save you (when fencing is not
implemented).

>
> Actually, when reading the iSCSI spec, it implies that there is a known
> session timeout that is negotiated between the initiator and target.
> After the session expires, you are guaranteed that no further commands
> are running on the DS.
>
> The iFCP protocol attaches a timeout value to each frame, and is
> supposed to enforce that timeout.
>
> So at least those protocols should have a deterministic behaviour that
> can be used by the client.
>
>> >> So for file layout it probably makes sense, but object layout (and if
>> >> I understand correctly, block layout as well) something else needs to
>> >> be done. I thought about sync wait when returning the layout on error,
>> >> but according to Boaz it will cause deadlocks (Boaz - can you please
>> >> elaborate?).
>> >
>> > The object layoutreturn has the ability to pass a timeout error value to
>> > the MDS precisely in order to allow the latter to deal with this kind of
>> > issue. See the description of struct pnfs_osd_ioerr4 in rfc5664.
>> >
>> > The block layout is adding the same ability to layoutreturn in NFSv4.2
>> > (see draft-ietf-nfsv4-minorversion2-13.txt) via the struct
>> > layoutreturn_device_error4, so presumably they too have a plan for
>> > dealing with this kind of issue.
>> >

Then on layoutreturn we could call a new layout driver method, end_io
or something returns only when there is no danger of race (either by
waiting for in-flight I/O to complete/timeout/error out or by fencing)
and only then return the layout. This works for me. I don't know if it
works for Boaz though.

>>
>> I'll wait for Boaz (or someone else) to explain what exactly is "this
>> kind of issue". I still don't know why sync wait would deadlock.
>>
>> >> And come to think of it, nfs4_proc_setattr also returns the layout
>> >> when there may be I/O in-flight (correct me if i'm wrong). So I guess
>> >> pnfs_return_layout should somehow solve this by itself by saying "if
>> >> this is fencing (a flag which will be set for file layout only), go
>> >> ahead, otherwise make the layout as 'needs to be returned' and when
>> >> the lseg lists gets empty return the layout".
>> >
>> > The only layout type that sets the PNFS_LAYOUTRET_ON_SETATTR flag is
>> > objects, so that question needs to be directed to Boaz.
>> >
>>
>> Yes, this entire thread is mainly directed to Boaz... And IIRC he did
>> want to implement something of that sort, and this thread was
>> basically for asking him "did you?".
>>
>> > --
>> > Trond Myklebust
>> > Linux NFS client maintainer
>> >
>> > NetApp
>> > Trond.Myklebust@netapp.com
>> > www.netapp.com
>> >
>>
>>
>>
>
> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> Trond.Myklebust@netapp.com
> www.netapp.com
>
Peng Tao Aug. 10, 2012, 2:46 a.m. UTC | #11
On Fri, Aug 10, 2012 at 3:31 AM, Myklebust, Trond
<Trond.Myklebust@netapp.com> wrote:
> On Fri, 2012-08-10 at 00:48 +0800, Peng Tao wrote:
>> On Fri, Aug 10, 2012 at 12:37 AM, Myklebust, Trond
>> <Trond.Myklebust@netapp.com> wrote:
>> > On Fri, 2012-08-10 at 00:34 +0800, Peng Tao wrote:
>> >> On Fri, Aug 10, 2012 at 12:06 AM, Myklebust, Trond
>> >> <Trond.Myklebust@netapp.com> wrote:
>> >> > On Thu, 2012-08-09 at 18:49 +0300, Idan Kedar wrote:
>> >> >> On Thu, Aug 9, 2012 at 5:05 PM, Myklebust, Trond
>> >> >> <Trond.Myklebust@netapp.com> wrote:
>> >> >> >> -----Original Message-----
>> >> >> >> From: linux-nfs-owner@vger.kernel.org [mailto:linux-nfs-
>> >> >> >> owner@vger.kernel.org] On Behalf Of Idan Kedar
>> >> >> >> Sent: Thursday, August 09, 2012 9:03 AM
>> >> >> >> To: Boaz Harrosh; NFS list
>> >> >> >> Cc: Benny Halevy
>> >> >> >> Subject: return layout on error, BUG/deadlock
>> >> >> >>
>> >> >> >> Hi,
>> >> >> >>
>> >> >> >> As a result of some experiments, I wanted to see what happens when I
>> >> >> >> inject an error (hard coded) to the object layout driver. the patch is at the
>> >> >> >> bottom of this mail. the reason I did this is because when I inject errors in my
>> >> >> >> modified version of the object layout driver, I get the same BUG Tigran
>> >> >> >> reported about yesterday:
>> >> >> >> nfs4proc.c:6252 :   BUG_ON(!list_empty(&lo->plh_segs));
>> >> >> >>
>> >> >> >> In my modified version (based on kernel 3.3), the bug seems to be that
>> >> >> >> pnfs_ld_write_done calls pnfs_return_layout in the error path, even if there
>> >> >> >> is in-flight I/O.
>> >> >> >
>> >> >> > That is not a bug. It is an intentional change in order to allow the MDS to fence off the outstanding writes (if it can do so) before we retransmit them as write-through-MDS. Otherwise, you risk races between the outstanding writes-to-DS and the new writes-through-MDS.
>> >> >>
>> >> >> to what change are you referring?
>> >> >
>> >> > As I stated in the changelog of the patch that I sent to the list
>> >> > yesterday, the behaviour is due to commit 0a57cdac3f.
>> >> >
>> >> >> >
>> >> >> > See the changelog in the patch that I sent to the list yesterday.
>> >> >> >
>> >> >>
>> >> >> I saw that, and if I'm not mistaken these races apply to object layout
>> >> >> as well, and in any case they apply in my case. However, it is not
>> >> >> easy to mess around with LAYOUTRETURN in object layout, and there have
>> >> >> been several discussions on the issue. In one of these discussions
>> >> >> Benny clarified that the object layout client must wait for all
>> >> >> in-flight I/O to end.
>> >> >
>> >> > If the problem is that the DS is failing to respond, how does the client
>> >> > know that the in-flight I/O has ended?
>> >> >
>> >> >> So for file layout it probably makes sense, but object layout (and if
>> >> >> I understand correctly, block layout as well) something else needs to
>> >> >> be done. I thought about sync wait when returning the layout on error,
>> >> >> but according to Boaz it will cause deadlocks (Boaz - can you please
>> >> >> elaborate?).
>> >> >
>> >> > The object layoutreturn has the ability to pass a timeout error value to
>> >> > the MDS precisely in order to allow the latter to deal with this kind of
>> >> > issue. See the description of struct pnfs_osd_ioerr4 in rfc5664.
>> >> >
>> >> > The block layout is adding the same ability to layoutreturn in NFSv4.2
>> >> > (see draft-ietf-nfsv4-minorversion2-13.txt) via the struct
>> >> > layoutreturn_device_error4, so presumably they too have a plan for
>> >> > dealing with this kind of issue.
>> >> It is one thing to tell MDS that there is DS access error by sending
>> >> layoutreturn, and it is another thing to return a layout even if there
>> >> is overlapping in-flight DS IO...
>> >>
>> >> I certainly agree that client is entitled to return layout to inform
>> >> MDS about DS errors and also avoid possible cb_layoutrecall. But it is
>> >> just an optimization and should only be done when there is no
>> >> in-flight IO (at least for block layout) IMHO.
>> >
>> > HOW DO YOU GUARANTEE NO IN-FLIGHT IO?
>> >
>> I don't. That's why I don't return layout in pnfs_ld_write_done(). And
>> for layoutreturn upon cb_layoutreturn, block layout client needs to do
>> timed-lease IO fencing per rfc5663, but it is not implemented in Linux
>> client.
>
> The timed-lease IO fencing described in rfc5663 is about informing the
> server about how long the client expects a command to succeed or fail.
> It doesn't offer any advice for how the client is to deal with an
> unresponsive DS.
>
> What you need here is help from the underlying transport protocol. As I
> said in the email to Idan, when researching iSCSI and iFCP, I found what
> appears to be mechanisms for reliably timing out.
Just checked and found that the layoutreturn-on-error behavior only
affects object and file layout. So block layout stays out and safe.
That's all I would ask for. Thanks for your explanation.

Best,
Tao
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/exofs/ore.c b/fs/exofs/ore.c
index 24a49d4..e9d7b45 100644
--- a/fs/exofs/ore.c
+++ b/fs/exofs/ore.c
@@ -426,9 +426,14 @@  int ore_check_io(struct ore_io_state *ios,
ore_on_dev_error on_dev_error)
 		if (unlikely(!or))
 			continue;

-		ret = osd_req_decode_sense(or, &osi);
-		if (likely(!ret))
-			continue;
+		if (jiffies % 6 == 0) {
+			osi.osd_err_pri = OSD_ERR_PRI_EIO;
+			ret = -EIO;
+		} else {
+			ret = osd_req_decode_sense(or, &osi);
+			if (likely(!ret))
+				continue;
+		}

 		if (OSD_ERR_PRI_CLEAR_PAGES == osi.osd_err_pri) {