ocfs2: Do not lock/unlock() inode DLM lock
diff mbox

Message ID 20160107050316.GA22340@desktop.lab.bej.apac.novell.com
State New
Headers show

Commit Message

Zhen Ren Jan. 7, 2016, 5:03 a.m. UTC
Hi,

On Tue, Dec 29, 2015 at 07:31:16PM -0700, He Gang wrote: 
> Hello Goldwyn,
> 
> When read path can't get the DLM lock immediately (NONBLOCK way), next get the lock with BLOCK way, this behavior will cost some time (several msecs).
> It looks make sense to delete that two line code. 
> But why there are two line code existed? I just worry about, if we delete two line code, when read path can't get the DLM lock with NONBLOCK way, read path will retry to get this DLM lock repeatedly, this will lead to cost too much CPU (Not waiting in sleep).
> I just worry about this possibility, Eric will test this case, and give a feedback.
Sorry for the late reply. After applying this patch, the performance is improved very much,
but "soft lockup" occurred several times(log message is placed at the end). The two lines:

       if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
           ocfs2_inode_unlock(inode, ex);

is similar with the lines of this commit:

commit c7e25e6e0b0486492c5faaf6312b37413642c48e
Author: Jan Kara <jack@suse.cz>
Date:   Thu Jun 23 22:51:47 2011 +0200

    ocfs2: Avoid livelock in ocfs2_readpage()
    
    When someone writes to an inode, readers accessing the same inode via
    ocfs2_readpage() just busyloop trying to get ip_alloc_sem because
    do_generic_file_read() looks up the page again and retries ->readpage()
    when previous attempt failed with AOP_TRUNCATED_PAGE. When there are enough
    readers, they can occupy all CPUs and in non-preempt kernel the system is
    deadlocked because writer holding ip_alloc_sem is never run to release the
    semaphore. Fix the problem by making reader block on ip_alloc_sem to break
    the busy loop.
    
    Signed-off-by: Jan Kara <jack@suse.cz>
    Signed-off-by: Joel Becker <jlbec@evilplan.org>
So, this patch removing the two lines of code may result in "busy wait" potentially that
reading thread repeatedly checks to see if cluster lock is available. Please correct me if
I'm wrong. 

This is the "soft lookup" log:
Dec 30 17:30:40 n2 kernel: [  248.084012] BUG: soft lockup - CPU#3 stuck for 23s! [iomaker:4061]
Dec 30 17:30:40 n2 kernel: [  248.084015] Modules linked in: ocfs2(FN) jbd2 ocfs2_nodemanager(F) quota_tree ocfs2_stack_user(FN) ocfs2_stackglue(FN) dlm(F) sctp libcrc32c configfs edd sg sd_mod crc_t10dif crc32c iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet mperf fuse loop dm_mod ipv6 ipv6_lib 8139too pcspkr acpiphp rtc_cmos pci_hotplug floppy 8139cp button i2c_piix4 mii virtio_balloon ext3 jbd mbcache ttm drm_kms_helper drm i2c_core sysimgblt sysfillrect syscopyarea uhci_hcd ehci_hcd processor thermal_sys hwmon usbcore usb_common intel_agp intel_gtt scsi_dh_alua scsi_dh_emc scsi_dh_hp_sw scsi_dh_rdac scsi_dh virtio_pci ata_generic ata_piix libata scsi_mod virtio_blk virtio virtio_ring
Dec 30 17:30:40 n2 kernel: [  248.084049] Supported: No, Unsupported modules are loaded
Dec 30 17:30:40 n2 kernel: [  248.084050] CPU 3 
Dec 30 17:30:40 n2 kernel: [  248.084051] Modules linked in: ocfs2(FN) jbd2 ocfs2_nodemanager(F) quota_tree ocfs2_stack_user(FN) ocfs2_stackglue(FN) dlm(F) sctp libcrc32c configfs edd sg sd_mod crc_t10dif crc32c iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet mperf fuse loop dm_mod ipv6 ipv6_lib 8139too pcspkr acpiphp rtc_cmos pci_hotplug floppy 8139cp button i2c_piix4 mii virtio_balloon ext3 jbd mbcache ttm drm_kms_helper drm i2c_core sysimgblt sysfillrect syscopyarea uhci_hcd ehci_hcd processor thermal_sys hwmon usbcore usb_common intel_agp intel_gtt scsi_dh_alua scsi_dh_emc scsi_dh_hp_sw scsi_dh_rdac scsi_dh virtio_pci ata_generic ata_piix libata scsi_mod virtio_blk virtio virtio_ring
Dec 30 17:30:40 n2 kernel: [  248.084074] Supported: No, Unsupported modules are loaded
Dec 30 17:30:40 n2 kernel: [  248.084075] 
Dec 30 17:30:40 n2 kernel: [  248.084077] Pid: 4061, comm: iomaker Tainted: GF          N  3.0.76-0.11-default #1 Bochs Bochs
Dec 30 17:30:40 n2 kernel: [  248.084080] RIP: 0010:[<ffffffff8145c7a9>]  [<ffffffff8145c7a9>] _raw_spin_lock+0x9/0x20
Dec 30 17:30:40 n2 kernel: [  248.084087] RSP: 0018:ffff8800db3fbb10  EFLAGS: 00000206
Dec 30 17:30:40 n2 kernel: [  248.084088] RAX: 0000000071967196 RBX: ffff880114c1c000 RCX: 0000000000000004
Dec 30 17:30:40 n2 kernel: [  248.084090] RDX: 0000000000007195 RSI: 0000000000000000 RDI: ffff880114c1c0d0
Dec 30 17:30:40 n2 kernel: [  248.084091] RBP: ffff880114c1c0d0 R08: 0000000000000000 R09: 0000000000000003
Dec 30 17:30:40 n2 kernel: [  248.084092] R10: 0000000000000100 R11: 0000000000014819 R12: ffffffff81464f6e
Dec 30 17:30:40 n2 kernel: [  248.084094] R13: ffff880037b34908 R14: ffff880037b342c0 R15: ffff880037b34908
Dec 30 17:30:40 n2 kernel: [  248.084096] FS:  00007fdbe0f7a720(0000) GS:ffff88011fd80000(0000) knlGS:0000000000000000
Dec 30 17:30:40 n2 kernel: [  248.084097] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Dec 30 17:30:40 n2 kernel: [  248.084099] CR2: 00007ff1df54b000 CR3: 00000000db2da000 CR4: 00000000000006e0
Dec 30 17:30:40 n2 kernel: [  248.084103] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec 30 17:30:40 n2 kernel: [  248.084106] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Dec 30 17:30:40 n2 kernel: [  248.084108] Process iomaker (pid: 4061, threadinfo ffff8800db3fa000, task ffff880037b342c0)
Dec 30 17:30:40 n2 kernel: [  248.084109] Stack:
Dec 30 17:30:40 n2 kernel: [  248.084112]  ffffffffa04ab9b5 0000000000000000 00000039c2f58103 0000000000000000
Dec 30 17:30:40 n2 kernel: [  248.084115]  0000000000000000 ffff880114c1c0d0 ffff880114c1c0d0 ffff880114c1c0d0
Dec 30 17:30:40 n2 kernel: [  248.084117]  0000000000000000 ffff880114c1c000 ffff8800dba0ec38 0000000000000004
Dec 30 17:30:40 n2 kernel: [  248.084120] Call Trace:
Dec 30 17:30:40 n2 kernel: [  248.084145]  [<ffffffffa04ab9b5>] ocfs2_wait_for_recovery+0x25/0xc0 [ocfs2]
Dec 30 17:30:40 n2 kernel: [  248.084182]  [<ffffffffa049aed8>] ocfs2_inode_lock_full_nested+0x298/0x510 [ocfs2]
Dec 30 17:30:40 n2 kernel: [  248.084204]  [<ffffffffa049b161>] ocfs2_inode_lock_with_page+0x11/0x40 [ocfs2]
Dec 30 17:30:40 n2 kernel: [  248.084225]  [<ffffffffa048170f>] ocfs2_readpage+0x8f/0x210 [ocfs2]
Dec 30 17:30:40 n2 kernel: [  248.084234]  [<ffffffff810f938e>] do_generic_file_read+0x13e/0x490
Dec 30 17:30:40 n2 kernel: [  248.084238]  [<ffffffff810f9d3c>] generic_file_aio_read+0xfc/0x260
Dec 30 17:30:40 n2 kernel: [  248.084250]  [<ffffffffa04a168b>] ocfs2_file_aio_read+0x14b/0x390 [ocfs2]
Dec 30 17:30:40 n2 kernel: [  248.084265]  [<ffffffff811584b8>] do_sync_read+0xc8/0x110
Dec 30 17:30:40 n2 kernel: [  248.084268]  [<ffffffff81158c67>] vfs_read+0xc7/0x130
Dec 30 17:30:40 n2 kernel: [  248.084271]  [<ffffffff81158dd3>] sys_read+0x53/0xa0
Dec 30 17:30:40 n2 kernel: [  248.084275]  [<ffffffff81464592>] system_call_fastpath+0x16/0x1b
Dec 30 17:30:40 n2 kernel: [  248.084280]  [<00007fdbe03532a0>] 0x7fdbe035329f
Dec 30 17:30:40 n2 kernel: [  248.084281] Code: 00 75 04 f0 0f b1 17 0f 94 c2 0f b6 c2 85 c0 ba 01 00 00 00 75 02 31 d2 89 d0 c3 0f 1f 80 00 00 00 00 b8 00 00 01 00 f0 0f c1 07 <0f> b7 d0 c1 e8 10 39 c2 74 07 f3 90 0f b7 17 eb f5 c3 0f 1f 44 
Dec 30 17:30:40 n2 kernel: [  248.084307] Call Trace:
Dec 30 17:30:40 n2 kernel: [  248.084319]  [<ffffffffa04ab9b5>] ocfs2_wait_for_recovery+0x25/0xc0 [ocfs2]
Dec 30 17:30:40 n2 kernel: [  248.084344]  [<ffffffffa049aed8>] ocfs2_inode_lock_full_nested+0x298/0x510 [ocfs2]
Dec 30 17:30:40 n2 kernel: [  248.084366]  [<ffffffffa049b161>] ocfs2_inode_lock_with_page+0x11/0x40 [ocfs2]
Dec 30 17:30:40 n2 kernel: [  248.084386]  [<ffffffffa048170f>] ocfs2_readpage+0x8f/0x210 [ocfs2]
Dec 30 17:30:40 n2 kernel: [  248.084394]  [<ffffffff810f938e>] do_generic_file_read+0x13e/0x490
Dec 30 17:30:40 n2 kernel: [  248.084397]  [<ffffffff810f9d3c>] generic_file_aio_read+0xfc/0x260
Dec 30 17:30:40 n2 kernel: [  248.084409]  [<ffffffffa04a168b>] ocfs2_file_aio_read+0x14b/0x390 [ocfs2]
Dec 30 17:30:40 n2 kernel: [  248.084423]  [<ffffffff811584b8>] do_sync_read+0xc8/0x110
Dec 30 17:30:40 n2 kernel: [  248.084426]  [<ffffffff81158c67>] vfs_read+0xc7/0x130
Dec 30 17:30:40 n2 kernel: [  248.084429]  [<ffffffff81158dd3>] sys_read+0x53/0xa0
Dec 30 17:30:40 n2 kernel: [  248.084431]  [<ffffffff81464592>] system_call_fastpath+0x16/0x1b
Dec 30 17:30:40 n2 kernel: [  248.084435]  [<00007fdbe03532a0>] 0x7fdbe035329f

Thanks,
Eric

> 
> Thanks
> Gang   
> 
> 
> >>> 
> > From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > 
> > DLM does not cache locks. So, blocking lock and unlock
> > will only make the performance worse where contention over
> > the locks is high.
> > 
> > Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
> > ---
> >  fs/ocfs2/dlmglue.c | 8 --------
> >  1 file changed, 8 deletions(-)
> > 
> > diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
> > index 20276e3..f92612e 100644
> > --- a/fs/ocfs2/dlmglue.c
> > +++ b/fs/ocfs2/dlmglue.c
> > @@ -2432,12 +2432,6 @@ bail:
> >   * done this we have to return AOP_TRUNCATED_PAGE so the aop method
> >   * that called us can bubble that back up into the VFS who will then
> >   * immediately retry the aop call.
> > - *
> > - * We do a blocking lock and immediate unlock before returning, though, so 
> > that
> > - * the lock has a great chance of being cached on this node by the time the 
> > VFS
> > - * calls back to retry the aop.    This has a potential to livelock as nodes
> > - * ping locks back and forth, but that's a risk we're willing to take to 
> > avoid
> > - * the lock inversion simply.
> >   */
> >  int ocfs2_inode_lock_with_page(struct inode *inode,
> >  			      struct buffer_head **ret_bh,
> > @@ -2449,8 +2443,6 @@ int ocfs2_inode_lock_with_page(struct inode *inode,
> >  	ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK);
> >  	if (ret == -EAGAIN) {
> >  		unlock_page(page);
> > -		if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
> > -			ocfs2_inode_unlock(inode, ex);
> >  		ret = AOP_TRUNCATED_PAGE;
> >  	}
> >  
> > -- 
> > 2.6.2
> 
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>

Comments

Goldwyn Rodrigues Jan. 8, 2016, 1:24 a.m. UTC | #1
Hi Eric,

On 01/07/2016 10:33 AM, Eric Ren wrote:
> Hi,
>
> On Tue, Dec 29, 2015 at 07:31:16PM -0700, He Gang wrote:
>> Hello Goldwyn,
>>
>> When read path can't get the DLM lock immediately (NONBLOCK way), next get the lock with BLOCK way, this behavior will cost some time (several msecs).
>> It looks make sense to delete that two line code.
>> But why there are two line code existed? I just worry about, if we delete two line code, when read path can't get the DLM lock with NONBLOCK way, read path will retry to get this DLM lock repeatedly, this will lead to cost too much CPU (Not waiting in sleep).
>> I just worry about this possibility, Eric will test this case, and give a feedback.
> Sorry for the late reply. After applying this patch, the performance is improved very much,
> but "soft lockup" occurred several times(log message is placed at the end). The two lines:

Did you perform this test on the upstream kernel? I performed mine on 
upstream and I did not receive any softlockups. AFAICS, you performed it 
on 3.0 based kernels (perhaps SLE11SP3) which does not have the patch 
you have mentioned.


>
>         if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
>             ocfs2_inode_unlock(inode, ex);
>
> is similar with the lines of this commit:
>
> commit c7e25e6e0b0486492c5faaf6312b37413642c48e
> Author: Jan Kara <jack@suse.cz>
> Date:   Thu Jun 23 22:51:47 2011 +0200
>
>      ocfs2: Avoid livelock in ocfs2_readpage()
>
>      When someone writes to an inode, readers accessing the same inode via
>      ocfs2_readpage() just busyloop trying to get ip_alloc_sem because
>      do_generic_file_read() looks up the page again and retries ->readpage()
>      when previous attempt failed with AOP_TRUNCATED_PAGE. When there are enough
>      readers, they can occupy all CPUs and in non-preempt kernel the system is
>      deadlocked because writer holding ip_alloc_sem is never run to release the
>      semaphore. Fix the problem by making reader block on ip_alloc_sem to break
>      the busy loop.
>
>      Signed-off-by: Jan Kara <jack@suse.cz>
>      Signed-off-by: Joel Becker <jlbec@evilplan.org>
> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
> index 4c1ec8f..ba3ca1e 100644
> --- a/fs/ocfs2/aops.c
> +++ b/fs/ocfs2/aops.c
> @@ -290,7 +290,15 @@ static int ocfs2_readpage(struct file *file, struct page *page)
>          }
>
>          if (down_read_trylock(&oi->ip_alloc_sem) == 0) {
> +               /*
> +                * Unlock the page and cycle ip_alloc_sem so that we don't
> +                * busyloop waiting for ip_alloc_sem to unlock
> +                */
>                  ret = AOP_TRUNCATED_PAGE;
> +               unlock_page(page);
> +               unlock = 0;
> +               down_read(&oi->ip_alloc_sem);
> +               up_read(&oi->ip_alloc_sem);
>                  goto out_inode_unlock;
>          }
>
> So, this patch removing the two lines of code may result in "busy wait" potentially that
> reading thread repeatedly checks to see if cluster lock is available. Please correct me if
> I'm wrong.

Correct. In older kernels, not upstream. Right?

>
> This is the "soft lookup" log:
> Dec 30 17:30:40 n2 kernel: [  248.084012] BUG: soft lockup - CPU#3 stuck for 23s! [iomaker:4061]
> Dec 30 17:30:40 n2 kernel: [  248.084015] Modules linked in: ocfs2(FN) jbd2 ocfs2_nodemanager(F) quota_tree ocfs2_stack_user(FN) ocfs2_stackglue(FN) dlm(F) sctp libcrc32c configfs edd sg sd_mod crc_t10dif crc32c iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet mperf fuse loop dm_mod ipv6 ipv6_lib 8139too pcspkr acpiphp rtc_cmos pci_hotplug floppy 8139cp button i2c_piix4 mii virtio_balloon ext3 jbd mbcache ttm drm_kms_helper drm i2c_core sysimgblt sysfillrect syscopyarea uhci_hcd ehci_hcd processor thermal_sys hwmon usbcore usb_common intel_agp intel_gtt scsi_dh_alua scsi_dh_emc scsi_dh_hp_sw scsi_dh_rdac scsi_dh virtio_pci ata_generic ata_piix libata scsi_mod virtio_blk virtio virtio_ring
> Dec 30 17:30:40 n2 kernel: [  248.084049] Supported: No, Unsupported modules are loaded
> Dec 30 17:30:40 n2 kernel: [  248.084050] CPU 3
> Dec 30 17:30:40 n2 kernel: [  248.084051] Modules linked in: ocfs2(FN) jbd2 ocfs2_nodemanager(F) quota_tree ocfs2_stack_user(FN) ocfs2_stackglue(FN) dlm(F) sctp libcrc32c configfs edd sg sd_mod crc_t10dif crc32c iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet mperf fuse loop dm_mod ipv6 ipv6_lib 8139too pcspkr acpiphp rtc_cmos pci_hotplug floppy 8139cp button i2c_piix4 mii virtio_balloon ext3 jbd mbcache ttm drm_kms_helper drm i2c_core sysimgblt sysfillrect syscopyarea uhci_hcd ehci_hcd processor thermal_sys hwmon usbcore usb_common intel_agp intel_gtt scsi_dh_alua scsi_dh_emc scsi_dh_hp_sw scsi_dh_rdac scsi_dh virtio_pci ata_generic ata_piix libata scsi_mod virtio_blk virtio virtio_ring
> Dec 30 17:30:40 n2 kernel: [  248.084074] Supported: No, Unsupported modules are loaded
> Dec 30 17:30:40 n2 kernel: [  248.084075]
> Dec 30 17:30:40 n2 kernel: [  248.084077] Pid: 4061, comm: iomaker Tainted: GF          N  3.0.76-0.11-default #1 Bochs Bochs
> Dec 30 17:30:40 n2 kernel: [  248.084080] RIP: 0010:[<ffffffff8145c7a9>]  [<ffffffff8145c7a9>] _raw_spin_lock+0x9/0x20
> Dec 30 17:30:40 n2 kernel: [  248.084087] RSP: 0018:ffff8800db3fbb10  EFLAGS: 00000206
> Dec 30 17:30:40 n2 kernel: [  248.084088] RAX: 0000000071967196 RBX: ffff880114c1c000 RCX: 0000000000000004
> Dec 30 17:30:40 n2 kernel: [  248.084090] RDX: 0000000000007195 RSI: 0000000000000000 RDI: ffff880114c1c0d0
> Dec 30 17:30:40 n2 kernel: [  248.084091] RBP: ffff880114c1c0d0 R08: 0000000000000000 R09: 0000000000000003
> Dec 30 17:30:40 n2 kernel: [  248.084092] R10: 0000000000000100 R11: 0000000000014819 R12: ffffffff81464f6e
> Dec 30 17:30:40 n2 kernel: [  248.084094] R13: ffff880037b34908 R14: ffff880037b342c0 R15: ffff880037b34908
> Dec 30 17:30:40 n2 kernel: [  248.084096] FS:  00007fdbe0f7a720(0000) GS:ffff88011fd80000(0000) knlGS:0000000000000000
> Dec 30 17:30:40 n2 kernel: [  248.084097] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> Dec 30 17:30:40 n2 kernel: [  248.084099] CR2: 00007ff1df54b000 CR3: 00000000db2da000 CR4: 00000000000006e0
> Dec 30 17:30:40 n2 kernel: [  248.084103] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Dec 30 17:30:40 n2 kernel: [  248.084106] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Dec 30 17:30:40 n2 kernel: [  248.084108] Process iomaker (pid: 4061, threadinfo ffff8800db3fa000, task ffff880037b342c0)
> Dec 30 17:30:40 n2 kernel: [  248.084109] Stack:
> Dec 30 17:30:40 n2 kernel: [  248.084112]  ffffffffa04ab9b5 0000000000000000 00000039c2f58103 0000000000000000
> Dec 30 17:30:40 n2 kernel: [  248.084115]  0000000000000000 ffff880114c1c0d0 ffff880114c1c0d0 ffff880114c1c0d0
> Dec 30 17:30:40 n2 kernel: [  248.084117]  0000000000000000 ffff880114c1c000 ffff8800dba0ec38 0000000000000004
> Dec 30 17:30:40 n2 kernel: [  248.084120] Call Trace:
> Dec 30 17:30:40 n2 kernel: [  248.084145]  [<ffffffffa04ab9b5>] ocfs2_wait_for_recovery+0x25/0xc0 [ocfs2]
> Dec 30 17:30:40 n2 kernel: [  248.084182]  [<ffffffffa049aed8>] ocfs2_inode_lock_full_nested+0x298/0x510 [ocfs2]
> Dec 30 17:30:40 n2 kernel: [  248.084204]  [<ffffffffa049b161>] ocfs2_inode_lock_with_page+0x11/0x40 [ocfs2]
> Dec 30 17:30:40 n2 kernel: [  248.084225]  [<ffffffffa048170f>] ocfs2_readpage+0x8f/0x210 [ocfs2]
> Dec 30 17:30:40 n2 kernel: [  248.084234]  [<ffffffff810f938e>] do_generic_file_read+0x13e/0x490
> Dec 30 17:30:40 n2 kernel: [  248.084238]  [<ffffffff810f9d3c>] generic_file_aio_read+0xfc/0x260
> Dec 30 17:30:40 n2 kernel: [  248.084250]  [<ffffffffa04a168b>] ocfs2_file_aio_read+0x14b/0x390 [ocfs2]
> Dec 30 17:30:40 n2 kernel: [  248.084265]  [<ffffffff811584b8>] do_sync_read+0xc8/0x110
> Dec 30 17:30:40 n2 kernel: [  248.084268]  [<ffffffff81158c67>] vfs_read+0xc7/0x130
> Dec 30 17:30:40 n2 kernel: [  248.084271]  [<ffffffff81158dd3>] sys_read+0x53/0xa0
> Dec 30 17:30:40 n2 kernel: [  248.084275]  [<ffffffff81464592>] system_call_fastpath+0x16/0x1b
> Dec 30 17:30:40 n2 kernel: [  248.084280]  [<00007fdbe03532a0>] 0x7fdbe035329f
> Dec 30 17:30:40 n2 kernel: [  248.084281] Code: 00 75 04 f0 0f b1 17 0f 94 c2 0f b6 c2 85 c0 ba 01 00 00 00 75 02 31 d2 89 d0 c3 0f 1f 80 00 00 00 00 b8 00 00 01 00 f0 0f c1 07 <0f> b7 d0 c1 e8 10 39 c2 74 07 f3 90 0f b7 17 eb f5 c3 0f 1f 44
> Dec 30 17:30:40 n2 kernel: [  248.084307] Call Trace:
> Dec 30 17:30:40 n2 kernel: [  248.084319]  [<ffffffffa04ab9b5>] ocfs2_wait_for_recovery+0x25/0xc0 [ocfs2]
> Dec 30 17:30:40 n2 kernel: [  248.084344]  [<ffffffffa049aed8>] ocfs2_inode_lock_full_nested+0x298/0x510 [ocfs2]
> Dec 30 17:30:40 n2 kernel: [  248.084366]  [<ffffffffa049b161>] ocfs2_inode_lock_with_page+0x11/0x40 [ocfs2]
> Dec 30 17:30:40 n2 kernel: [  248.084386]  [<ffffffffa048170f>] ocfs2_readpage+0x8f/0x210 [ocfs2]
> Dec 30 17:30:40 n2 kernel: [  248.084394]  [<ffffffff810f938e>] do_generic_file_read+0x13e/0x490
> Dec 30 17:30:40 n2 kernel: [  248.084397]  [<ffffffff810f9d3c>] generic_file_aio_read+0xfc/0x260
> Dec 30 17:30:40 n2 kernel: [  248.084409]  [<ffffffffa04a168b>] ocfs2_file_aio_read+0x14b/0x390 [ocfs2]
> Dec 30 17:30:40 n2 kernel: [  248.084423]  [<ffffffff811584b8>] do_sync_read+0xc8/0x110
> Dec 30 17:30:40 n2 kernel: [  248.084426]  [<ffffffff81158c67>] vfs_read+0xc7/0x130
> Dec 30 17:30:40 n2 kernel: [  248.084429]  [<ffffffff81158dd3>] sys_read+0x53/0xa0
> Dec 30 17:30:40 n2 kernel: [  248.084431]  [<ffffffff81464592>] system_call_fastpath+0x16/0x1b
> Dec 30 17:30:40 n2 kernel: [  248.084435]  [<00007fdbe03532a0>] 0x7fdbe035329f
>
> Thanks,
> Eric
>
>>
>> Thanks
>> Gang
>>
>>
>>>>>
>>> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>>
>>> DLM does not cache locks. So, blocking lock and unlock
>>> will only make the performance worse where contention over
>>> the locks is high.
>>>
>>> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>> ---
>>>   fs/ocfs2/dlmglue.c | 8 --------
>>>   1 file changed, 8 deletions(-)
>>>
>>> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
>>> index 20276e3..f92612e 100644
>>> --- a/fs/ocfs2/dlmglue.c
>>> +++ b/fs/ocfs2/dlmglue.c
>>> @@ -2432,12 +2432,6 @@ bail:
>>>    * done this we have to return AOP_TRUNCATED_PAGE so the aop method
>>>    * that called us can bubble that back up into the VFS who will then
>>>    * immediately retry the aop call.
>>> - *
>>> - * We do a blocking lock and immediate unlock before returning, though, so
>>> that
>>> - * the lock has a great chance of being cached on this node by the time the
>>> VFS
>>> - * calls back to retry the aop.    This has a potential to livelock as nodes
>>> - * ping locks back and forth, but that's a risk we're willing to take to
>>> avoid
>>> - * the lock inversion simply.
>>>    */
>>>   int ocfs2_inode_lock_with_page(struct inode *inode,
>>>   			      struct buffer_head **ret_bh,
>>> @@ -2449,8 +2443,6 @@ int ocfs2_inode_lock_with_page(struct inode *inode,
>>>   	ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK);
>>>   	if (ret == -EAGAIN) {
>>>   		unlock_page(page);
>>> -		if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
>>> -			ocfs2_inode_unlock(inode, ex);
>>>   		ret = AOP_TRUNCATED_PAGE;
>>>   	}
>>>
>>> --
>>> 2.6.2
>>
>>
>> _______________________________________________
>> Ocfs2-devel mailing list
>> Ocfs2-devel@oss.oracle.com
>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>>
Zhen Ren Jan. 8, 2016, 2:16 a.m. UTC | #2
Hi!

 >>>
> Hi Eric, 
>  
> On 01/07/2016 10:33 AM, Eric Ren wrote: 
> > Hi, 
> > 
> > On Tue, Dec 29, 2015 at 07:31:16PM -0700, He Gang wrote: 
> >> Hello Goldwyn, 
> >> 
> >> When read path can't get the DLM lock immediately (NONBLOCK way), next get  
> the lock with BLOCK way, this behavior will cost some time (several msecs). 
> >> It looks make sense to delete that two line code. 
> >> But why there are two line code existed? I just worry about, if we delete  
> two line code, when read path can't get the DLM lock with NONBLOCK way, read  
> path will retry to get this DLM lock repeatedly, this will lead to cost too  
> much CPU (Not waiting in sleep). 
> >> I just worry about this possibility, Eric will test this case, and give a  
> feedback. 
> > Sorry for the late reply. After applying this patch, the performance is  
> improved very much, 
> > but "soft lockup" occurred several times(log message is placed at the end).  
> The two lines: 
>  
> Did you perform this test on the upstream kernel? I performed mine on  
> upstream and I did not receive any softlockups. AFAICS, you performed it  
> on 3.0 based kernels (perhaps SLE11SP3) which does not have the patch  
> you have mentioned. 

I did not test on upstream kernel yet. But I did on sles12 sp1 which's kernel version is 3.12.49, so far
also have not seen "soft lookup" on it. I checked  SLE11SP3 kernel and it really have this patch - "ocfs2: 
Avoid livelock in ocfs2_readpage() ".

So, why newer kernel is without the issue of soft lookup is strange for me.

>  
>  
> > 
> >         if (ocfs2_inode_lock(inode, ret_bh, ex) == 0) 
> >             ocfs2_inode_unlock(inode, ex); 
> > 
> > is similar with the lines of this commit: 
> > 
> > commit c7e25e6e0b0486492c5faaf6312b37413642c48e 
> > Author: Jan Kara <jack@suse.cz> 
> > Date:   Thu Jun 23 22:51:47 2011 +0200 
> > 
> >      ocfs2: Avoid livelock in ocfs2_readpage() 
> > 
> >      When someone writes to an inode, readers accessing the same inode via 
> >      ocfs2_readpage() just busyloop trying to get ip_alloc_sem because 
> >      do_generic_file_read() looks up the page again and retries ->readpage() 
> >      when previous attempt failed with AOP_TRUNCATED_PAGE. When there are  
> enough 
> >      readers, they can occupy all CPUs and in non-preempt kernel the system  
> is 
> >      deadlocked because writer holding ip_alloc_sem is never run to release  
> the 
> >      semaphore. Fix the problem by making reader block on ip_alloc_sem to  
> break 
> >      the busy loop. 
> > 
> >      Signed-off-by: Jan Kara <jack@suse.cz> 
> >      Signed-off-by: Joel Becker <jlbec@evilplan.org> 
> > diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c 
> > index 4c1ec8f..ba3ca1e 100644 
> > --- a/fs/ocfs2/aops.c 
> > +++ b/fs/ocfs2/aops.c 
> > @@ -290,7 +290,15 @@ static int ocfs2_readpage(struct file *file, struct  
> page *page) 
> >          } 
> > 
> >          if (down_read_trylock(&oi->ip_alloc_sem) == 0) { 
> > +               /* 
> > +                * Unlock the page and cycle ip_alloc_sem so that we don't 
> > +                * busyloop waiting for ip_alloc_sem to unlock 
> > +                */ 
> >                  ret = AOP_TRUNCATED_PAGE; 
> > +               unlock_page(page); 
> > +               unlock = 0; 
> > +               down_read(&oi->ip_alloc_sem); 
> > +               up_read(&oi->ip_alloc_sem); 
> >                  goto out_inode_unlock; 
> >          } 
> > 
> > So, this patch removing the two lines of code may result in "busy wait"  
> potentially that 
> > reading thread repeatedly checks to see if cluster lock is available.  
> Please correct me if 
> > I'm wrong. 
>  
> Correct. In older kernels, not upstream. Right? 

Right, but SLE11SP1 kernel has already included the patch above.

Thanks,
Eric

>  
> > 
> > This is the "soft lookup" log: 
> > Dec 30 17:30:40 n2 kernel: [  248.084012] BUG: soft lockup - CPU#3 stuck for  
> 23s! [iomaker:4061] 
> > Dec 30 17:30:40 n2 kernel: [  248.084015] Modules linked in: ocfs2(FN) jbd2  
> ocfs2_nodemanager(F) quota_tree ocfs2_stack_user(FN) ocfs2_stackglue(FN)  
> dlm(F) sctp libcrc32c configfs edd sg sd_mod crc_t10dif crc32c iscsi_tcp  
> libiscsi_tcp libiscsi scsi_transport_iscsi af_packet mperf fuse loop dm_mod  
> ipv6 ipv6_lib 8139too pcspkr acpiphp rtc_cmos pci_hotplug floppy 8139cp  
> button i2c_piix4 mii virtio_balloon ext3 jbd mbcache ttm drm_kms_helper drm  
> i2c_core sysimgblt sysfillrect syscopyarea uhci_hcd ehci_hcd processor  
> thermal_sys hwmon usbcore usb_common intel_agp intel_gtt scsi_dh_alua  
> scsi_dh_emc scsi_dh_hp_sw scsi_dh_rdac scsi_dh virtio_pci ata_generic  
> ata_piix libata scsi_mod virtio_blk virtio virtio_ring 
> > Dec 30 17:30:40 n2 kernel: [  248.084049] Supported: No, Unsupported  
> modules are loaded 
> > Dec 30 17:30:40 n2 kernel: [  248.084050] CPU 3 
> > Dec 30 17:30:40 n2 kernel: [  248.084051] Modules linked in: ocfs2(FN) jbd2  
> ocfs2_nodemanager(F) quota_tree ocfs2_stack_user(FN) ocfs2_stackglue(FN)  
> dlm(F) sctp libcrc32c configfs edd sg sd_mod crc_t10dif crc32c iscsi_tcp  
> libiscsi_tcp libiscsi scsi_transport_iscsi af_packet mperf fuse loop dm_mod  
> ipv6 ipv6_lib 8139too pcspkr acpiphp rtc_cmos pci_hotplug floppy 8139cp  
> button i2c_piix4 mii virtio_balloon ext3 jbd mbcache ttm drm_kms_helper drm  
> i2c_core sysimgblt sysfillrect syscopyarea uhci_hcd ehci_hcd processor  
> thermal_sys hwmon usbcore usb_common intel_agp intel_gtt scsi_dh_alua  
> scsi_dh_emc scsi_dh_hp_sw scsi_dh_rdac scsi_dh virtio_pci ata_generic  
> ata_piix libata scsi_mod virtio_blk virtio virtio_ring 
> > Dec 30 17:30:40 n2 kernel: [  248.084074] Supported: No, Unsupported  
> modules are loaded 
> > Dec 30 17:30:40 n2 kernel: [  248.084075] 
> > Dec 30 17:30:40 n2 kernel: [  248.084077] Pid: 4061, comm: iomaker Tainted:  
> GF          N  3.0.76-0.11-default #1 Bochs Bochs 
> > Dec 30 17:30:40 n2 kernel: [  248.084080] RIP: 0010:[<ffffffff8145c7a9>]   
> [<ffffffff8145c7a9>] _raw_spin_lock+0x9/0x20 
> > Dec 30 17:30:40 n2 kernel: [  248.084087] RSP: 0018:ffff8800db3fbb10   
> EFLAGS: 00000206 
> > Dec 30 17:30:40 n2 kernel: [  248.084088] RAX: 0000000071967196 RBX:  
> ffff880114c1c000 RCX: 0000000000000004 
> > Dec 30 17:30:40 n2 kernel: [  248.084090] RDX: 0000000000007195 RSI:  
> 0000000000000000 RDI: ffff880114c1c0d0 
> > Dec 30 17:30:40 n2 kernel: [  248.084091] RBP: ffff880114c1c0d0 R08:  
> 0000000000000000 R09: 0000000000000003 
> > Dec 30 17:30:40 n2 kernel: [  248.084092] R10: 0000000000000100 R11:  
> 0000000000014819 R12: ffffffff81464f6e 
> > Dec 30 17:30:40 n2 kernel: [  248.084094] R13: ffff880037b34908 R14:  
> ffff880037b342c0 R15: ffff880037b34908 
> > Dec 30 17:30:40 n2 kernel: [  248.084096] FS:  00007fdbe0f7a720(0000)  
> GS:ffff88011fd80000(0000) knlGS:0000000000000000 
> > Dec 30 17:30:40 n2 kernel: [  248.084097] CS:  0010 DS: 0000 ES: 0000 CR0:  
> 000000008005003b 
> > Dec 30 17:30:40 n2 kernel: [  248.084099] CR2: 00007ff1df54b000 CR3:  
> 00000000db2da000 CR4: 00000000000006e0 
> > Dec 30 17:30:40 n2 kernel: [  248.084103] DR0: 0000000000000000 DR1:  
> 0000000000000000 DR2: 0000000000000000 
> > Dec 30 17:30:40 n2 kernel: [  248.084106] DR3: 0000000000000000 DR6:  
> 00000000ffff0ff0 DR7: 0000000000000400 
> > Dec 30 17:30:40 n2 kernel: [  248.084108] Process iomaker (pid: 4061,  
> threadinfo ffff8800db3fa000, task ffff880037b342c0) 
> > Dec 30 17:30:40 n2 kernel: [  248.084109] Stack: 
> > Dec 30 17:30:40 n2 kernel: [  248.084112]  ffffffffa04ab9b5  
> 0000000000000000 00000039c2f58103 0000000000000000 
> > Dec 30 17:30:40 n2 kernel: [  248.084115]  0000000000000000  
> ffff880114c1c0d0 ffff880114c1c0d0 ffff880114c1c0d0 
> > Dec 30 17:30:40 n2 kernel: [  248.084117]  0000000000000000  
> ffff880114c1c000 ffff8800dba0ec38 0000000000000004 
> > Dec 30 17:30:40 n2 kernel: [  248.084120] Call Trace: 
> > Dec 30 17:30:40 n2 kernel: [  248.084145]  [<ffffffffa04ab9b5>]  
> ocfs2_wait_for_recovery+0x25/0xc0 [ocfs2] 
> > Dec 30 17:30:40 n2 kernel: [  248.084182]  [<ffffffffa049aed8>]  
> ocfs2_inode_lock_full_nested+0x298/0x510 [ocfs2] 
> > Dec 30 17:30:40 n2 kernel: [  248.084204]  [<ffffffffa049b161>]  
> ocfs2_inode_lock_with_page+0x11/0x40 [ocfs2] 
> > Dec 30 17:30:40 n2 kernel: [  248.084225]  [<ffffffffa048170f>]  
> ocfs2_readpage+0x8f/0x210 [ocfs2] 
> > Dec 30 17:30:40 n2 kernel: [  248.084234]  [<ffffffff810f938e>]  
> do_generic_file_read+0x13e/0x490 
> > Dec 30 17:30:40 n2 kernel: [  248.084238]  [<ffffffff810f9d3c>]  
> generic_file_aio_read+0xfc/0x260 
> > Dec 30 17:30:40 n2 kernel: [  248.084250]  [<ffffffffa04a168b>]  
> ocfs2_file_aio_read+0x14b/0x390 [ocfs2] 
> > Dec 30 17:30:40 n2 kernel: [  248.084265]  [<ffffffff811584b8>]  
> do_sync_read+0xc8/0x110 
> > Dec 30 17:30:40 n2 kernel: [  248.084268]  [<ffffffff81158c67>]  
> vfs_read+0xc7/0x130 
> > Dec 30 17:30:40 n2 kernel: [  248.084271]  [<ffffffff81158dd3>]  
> sys_read+0x53/0xa0 
> > Dec 30 17:30:40 n2 kernel: [  248.084275]  [<ffffffff81464592>]  
> system_call_fastpath+0x16/0x1b 
> > Dec 30 17:30:40 n2 kernel: [  248.084280]  [<00007fdbe03532a0>]  
> 0x7fdbe035329f 
> > Dec 30 17:30:40 n2 kernel: [  248.084281] Code: 00 75 04 f0 0f b1 17 0f 94  
> c2 0f b6 c2 85 c0 ba 01 00 00 00 75 02 31 d2 89 d0 c3 0f 1f 80 00 00 00 00 b8  
> 00 00 01 00 f0 0f c1 07 <0f> b7 d0 c1 e8 10 39 c2 74 07 f3 90 0f b7 17 eb f5 c3  
> 0f 1f 44 
> > Dec 30 17:30:40 n2 kernel: [  248.084307] Call Trace: 
> > Dec 30 17:30:40 n2 kernel: [  248.084319]  [<ffffffffa04ab9b5>]  
> ocfs2_wait_for_recovery+0x25/0xc0 [ocfs2] 
> > Dec 30 17:30:40 n2 kernel: [  248.084344]  [<ffffffffa049aed8>]  
> ocfs2_inode_lock_full_nested+0x298/0x510 [ocfs2] 
> > Dec 30 17:30:40 n2 kernel: [  248.084366]  [<ffffffffa049b161>]  
> ocfs2_inode_lock_with_page+0x11/0x40 [ocfs2] 
> > Dec 30 17:30:40 n2 kernel: [  248.084386]  [<ffffffffa048170f>]  
> ocfs2_readpage+0x8f/0x210 [ocfs2] 
> > Dec 30 17:30:40 n2 kernel: [  248.084394]  [<ffffffff810f938e>]  
> do_generic_file_read+0x13e/0x490 
> > Dec 30 17:30:40 n2 kernel: [  248.084397]  [<ffffffff810f9d3c>]  
> generic_file_aio_read+0xfc/0x260 
> > Dec 30 17:30:40 n2 kernel: [  248.084409]  [<ffffffffa04a168b>]  
> ocfs2_file_aio_read+0x14b/0x390 [ocfs2] 
> > Dec 30 17:30:40 n2 kernel: [  248.084423]  [<ffffffff811584b8>]  
> do_sync_read+0xc8/0x110 
> > Dec 30 17:30:40 n2 kernel: [  248.084426]  [<ffffffff81158c67>]  
> vfs_read+0xc7/0x130 
> > Dec 30 17:30:40 n2 kernel: [  248.084429]  [<ffffffff81158dd3>]  
> sys_read+0x53/0xa0 
> > Dec 30 17:30:40 n2 kernel: [  248.084431]  [<ffffffff81464592>]  
> system_call_fastpath+0x16/0x1b 
> > Dec 30 17:30:40 n2 kernel: [  248.084435]  [<00007fdbe03532a0>]  
> 0x7fdbe035329f 
> > 
> > Thanks, 
> > Eric 
> > 
> >> 
> >> Thanks 
> >> Gang 
> >> 
> >> 
> >>>>> 
> >>> From: Goldwyn Rodrigues <rgoldwyn@suse.com> 
> >>> 
> >>> DLM does not cache locks. So, blocking lock and unlock 
> >>> will only make the performance worse where contention over 
> >>> the locks is high. 
> >>> 
> >>> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> 
> >>> --- 
> >>>   fs/ocfs2/dlmglue.c | 8 -------- 
> >>>   1 file changed, 8 deletions(-) 
> >>> 
> >>> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c 
> >>> index 20276e3..f92612e 100644 
> >>> --- a/fs/ocfs2/dlmglue.c 
> >>> +++ b/fs/ocfs2/dlmglue.c 
> >>> @@ -2432,12 +2432,6 @@ bail: 
> >>>    * done this we have to return AOP_TRUNCATED_PAGE so the aop method 
> >>>    * that called us can bubble that back up into the VFS who will then 
> >>>    * immediately retry the aop call. 
> >>> - * 
> >>> - * We do a blocking lock and immediate unlock before returning, though, so 
> >>> that 
> >>> - * the lock has a great chance of being cached on this node by the time the 
> >>> VFS 
> >>> - * calls back to retry the aop.    This has a potential to livelock as  
> nodes 
> >>> - * ping locks back and forth, but that's a risk we're willing to take to 
> >>> avoid 
> >>> - * the lock inversion simply. 
> >>>    */ 
> >>>   int ocfs2_inode_lock_with_page(struct inode *inode, 
> >>>   			      struct buffer_head **ret_bh, 
> >>> @@ -2449,8 +2443,6 @@ int ocfs2_inode_lock_with_page(struct inode *inode, 
> >>>   	ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK); 
> >>>   	if (ret == -EAGAIN) { 
> >>>   		unlock_page(page); 
> >>> -		if (ocfs2_inode_lock(inode, ret_bh, ex) == 0) 
> >>> -			ocfs2_inode_unlock(inode, ex); 
> >>>   		ret = AOP_TRUNCATED_PAGE; 
> >>>   	} 
> >>> 
> >>> -- 
> >>> 2.6.2 
> >> 
> >> 
> >> _______________________________________________ 
> >> Ocfs2-devel mailing list 
> >> Ocfs2-devel@oss.oracle.com 
> >> https://oss.oracle.com/mailman/listinfo/ocfs2-devel 
> >>

Patch
diff mbox

diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 4c1ec8f..ba3ca1e 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -290,7 +290,15 @@  static int ocfs2_readpage(struct file *file, struct page *page)
        }
 
        if (down_read_trylock(&oi->ip_alloc_sem) == 0) {
+               /*
+                * Unlock the page and cycle ip_alloc_sem so that we don't
+                * busyloop waiting for ip_alloc_sem to unlock
+                */
                ret = AOP_TRUNCATED_PAGE;
+               unlock_page(page);
+               unlock = 0;
+               down_read(&oi->ip_alloc_sem);
+               up_read(&oi->ip_alloc_sem);
                goto out_inode_unlock;
        }