From patchwork Fri Dec 9 09:30:46 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Zhen Ren X-Patchwork-Id: 9467715 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 67B7460586 for ; Fri, 9 Dec 2016 09:31:49 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 56B8D285DE for ; Fri, 9 Dec 2016 09:31:49 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 4BA4C28609; Fri, 9 Dec 2016 09:31:49 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.2 required=2.0 tests=BAYES_00, RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received: from aserp1040.oracle.com (aserp1040.oracle.com [141.146.126.69]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 18A30285DE for ; Fri, 9 Dec 2016 09:31:47 +0000 (UTC) Received: from aserv0021.oracle.com (aserv0021.oracle.com [141.146.126.233]) by aserp1040.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with ESMTP id uB99VYhd015156 (version=TLSv1 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Fri, 9 Dec 2016 09:31:35 GMT Received: from oss.oracle.com (oss-old-reserved.oracle.com [137.254.22.2]) by aserv0021.oracle.com (8.13.8/8.14.4) with ESMTP id uB99VYmC006545 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 9 Dec 2016 09:31:34 GMT Received: from localhost ([127.0.0.1] helo=lb-oss.oracle.com) by oss.oracle.com with esmtp (Exim 4.63) (envelope-from ) id 1cFHWg-0004JQ-LI; Fri, 09 Dec 2016 01:31:34 -0800 Received: from aserv0021.oracle.com ([141.146.126.233]) by oss.oracle.com with esmtp (Exim 4.63) (envelope-from ) id 1cFHWe-0004JF-Ts for ocfs2-devel@oss.oracle.com; Fri, 09 Dec 2016 01:31:32 -0800 Received: from aserp1020.oracle.com (aserp1020.oracle.com [141.146.126.67]) by aserv0021.oracle.com (8.13.8/8.14.4) with ESMTP id uB99VWGO006471 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Fri, 9 Dec 2016 09:31:32 GMT Received: from userp2030.oracle.com (userp2030.oracle.com [156.151.31.89]) by aserp1020.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with ESMTP id uB99VVnN025267 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO) for ; Fri, 9 Dec 2016 09:31:32 GMT Received: from pps.filterd (userp2030.oracle.com [127.0.0.1]) by userp2030.oracle.com (8.16.0.17/8.16.0.17) with SMTP id uB99MLr5005033 for ; Fri, 9 Dec 2016 09:31:31 GMT Authentication-Results: oracle.com; spf=pass smtp.mailfrom=zren@suse.com Received: from prv3-mh.provo.novell.com (victor.provo.novell.com [137.65.250.26]) by userp2030.oracle.com with ESMTP id 276mtseabd-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT) for ; Fri, 09 Dec 2016 09:31:31 +0000 Received: from laptop.apac.novell.com (prv-ext-foundry1int.gns.novell.com [137.65.251.240]) by prv3-mh.provo.novell.com with ESMTP (TLS encrypted); Fri, 09 Dec 2016 02:31:10 -0700 From: Eric Ren To: ocfs2-devel@oss.oracle.com Date: Fri, 9 Dec 2016 17:30:46 +0800 Message-Id: <1481275846-6604-1-git-send-email-zren@suse.com> X-Mailer: git-send-email 2.6.6 MIME-Version: 1.0 X-PDR: PASS X-ServerName: victor.provo.novell.com X-Proofpoint-SPF-Result: pass X-Proofpoint-SPF-Record: v=spf1 include:microfocus.com ~all X-Proofpoint-Virus-Version: vendor=nai engine=5800 definitions=8373 signatures=670771 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 suspectscore=1 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1609300000 definitions=main-1612090139 X-MIME-Autoconverted: from 8bit to quoted-printable by aserv0021.oracle.com id uB99VWGO006471 Cc: linux-kernel@vger.kernel.org Subject: [Ocfs2-devel] [PATCH] ocfs2: fix crash caused by stale lvb with fsdlm plugin X-BeenThere: ocfs2-devel@oss.oracle.com X-Mailman-Version: 2.1.9 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: ocfs2-devel-bounces@oss.oracle.com Errors-To: ocfs2-devel-bounces@oss.oracle.com X-Source-IP: aserv0021.oracle.com [141.146.126.233] X-Virus-Scanned: ClamAV using ClamSMTP The crash happens rather often when we reset some cluster nodes while nodes contend fiercely to do truncate and append. The crash backtrace is below: " [ 245.197849] dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover_grant 1 locks on 971 resources [ 245.197859] dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover 9 generation 5 done: 4 ms [ 245.198379] ocfs2: Begin replay journal (node 318952601, slot 2) on device (253,18) [ 247.272338] ocfs2: End replay journal (node 318952601, slot 2) on device (253,18) [ 247.547084] ocfs2: Beginning quota recovery on device (253,18) for slot 2 [ 247.683263] ocfs2: Finishing quota recovery on device (253,18) for slot 2 [ 247.833022] (truncate,30154,1):ocfs2_truncate_file:470 ERROR: bug expression: le64_to_cpu(fe->i_size) != i_size_read(inode) [ 247.833029] (truncate,30154,1):ocfs2_truncate_file:470 ERROR: Inode 290321, inode i_size = 732 != di i_size = 937, i_flags = 0x1 [ 247.833074] ------------[ cut here ]------------ [ 247.833077] kernel BUG at /usr/src/linux/fs/ocfs2/file.c:470! [ 247.833079] invalid opcode: 0000 [#1] SMP [ 247.833081] Modules linked in: ocfs2_stack_user(OEN) ocfs2(OEN) ocfs2_nodemanager ocfs2_stackglue(OEN) quota_tree dlm(OEN) configfs fuse sd_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet iscsi_ibft iscsi_boot_sysfs softdog xfs libcrc32c ppdev parport_pc pcspkr parport joydev virtio_balloon virtio_net i2c_piix4 acpi_cpufreq button processor ext4 crc16 jbd2 mbcache ata_generic cirrus virtio_blk ata_piix drm_kms_helper ahci syscopyarea libahci sysfillrect sysimgblt fb_sys_fops ttm floppy libata drm virtio_pci virtio_ring uhci_hcd virtio ehci_hcd usbcore serio_raw usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4 [ 247.833107] Supported: No, Unsupported modules are loaded [ 247.833110] CPU: 1 PID: 30154 Comm: truncate Tainted: G OE N 4.4.21-69-default #1 [ 247.833111] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20151112_172657-sheep25 04/01/2014 [ 247.833112] task: ffff88004ff6d240 ti: ffff880074e68000 task.ti: ffff880074e68000 [ 247.833113] RIP: 0010:[] [] ocfs2_truncate_file+0x640/0x6c0 [ocfs2] [ 247.833151] RSP: 0018:ffff880074e6bd50 EFLAGS: 00010282 [ 247.833152] RAX: 0000000000000074 RBX: 000000000000029e RCX: 0000000000000000 [ 247.833153] RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246 [ 247.833154] RBP: ffff880074e6bda8 R08: 000000003675dc7a R09: ffffffff82013414 [ 247.833155] R10: 0000000000034c50 R11: 0000000000000000 R12: ffff88003aab3448 [ 247.833156] R13: 00000000000002dc R14: 0000000000046e11 R15: 0000000000000020 [ 247.833157] FS: 00007f839f965700(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000 [ 247.833158] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 247.833159] CR2: 00007f839f97e000 CR3: 0000000036723000 CR4: 00000000000006e0 [ 247.833164] Stack: [ 247.833165] 00000000000003a9 0000000000000001 ffff880060554000 ffff88004fcaf000 [ 247.833167] ffff88003aa7b090 1000000000000000 ffff88003aab3448 ffff880074e6beb0 [ 247.833169] 0000000000000001 0000000000002068 0000000000000020 0000000000000000 [ 247.833171] Call Trace: [ 247.833208] [] ocfs2_setattr+0x698/0xa90 [ocfs2] [ 247.833225] [] notify_change+0x1ae/0x380 [ 247.833242] [] do_truncate+0x5e/0x90 [ 247.833246] [] do_sys_ftruncate.constprop.11+0x108/0x160 [ 247.833257] [] entry_SYSCALL_64_fastpath+0x12/0x6d [ 247.834724] DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x12/0x6d [ 247.834725] [ 247.834726] Leftover inexact backtrace: [ 247.834728] Code: 24 28 ba d6 01 00 00 48 c7 c6 30 43 62 a0 8b 41 2c 89 44 24 08 48 8b 41 20 48 c7 c1 78 a3 62 a0 48 89 04 24 31 c0 e8 a0 97 f9 ff <0f> 0b 3d 00 fe ff ff 0f 84 ab fd ff ff 83 f8 fc 0f 84 a2 fd ff [ 247.834748] RIP [] ocfs2_truncate_file+0x640/0x6c0 [ocfs2] [ 247.834774] RSP " It's because ocfs2_inode_lock() get us stale LVB in which the i_size is not equal to the disk i_size. We mistakenly trust the LVB because the underlaying fsdlm dlm_lock() doesn't set lkb_sbflags with DLM_SBF_VALNOTVALID properly for us. But, why? The current code tries to downconvert lock without DLM_LKF_VALBLK flag to tell o2cb don't update RSB's LVB if it's a PR->NULL conversion, even if the lock resource type needs LVB. This is not the right way for fsdlm. The fsdlm plugin behaves different on DLM_LKF_VALBLK, it depends on DLM_LKF_VALBLK to decide if we care about the LVB in the LKB. If DLM_LKF_VALBLK is not set, fsdlm will skip recovering RSB's LVB from this lkb and set the right DLM_SBF_VALNOTVALID appropriately when node failure happens. The following diagram briefly illustrates how this crash happens: RSB1 is inode metadata lock resource with LOCK_TYPE_USES_LVB; The 1st round: Node1 Node2 RSB1: PR RSB1(master): NULL->EX ocfs2_downconvert_lock(PR->NULL, set_lvb==0) ocfs2_dlm_lock(no DLM_LKF_VALBLK) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - dlm_lock(no DLM_LKF_VALBLK) convert_lock(overwrite lkb->lkb_exflags with no DLM_LKF_VALBLK) RSB1: NULL RSB1: EX reset Node2 dlm_recover_rsbs() recover_lvb() /* The LVB is not trustable if the node with EX fails and * no lock >= PR is left. We should set RSB_VALNOTVALID for RSB1. */ if(!(kb_exflags & DLM_LKF_VALBLK)) /* This means we miss the chance to return; * to invalid the LVB here. */ The 2nd round: Node 1 Node2 RSB1(become master from recovery) ocfs2_setattr() ocfs2_inode_lock(NULL->EX) /* dlm_lock() return the stale lvb without setting DLM_SBF_VALNOTVALID */ ocfs2_meta_lvb_is_trustable() return 1 /* so we don't refresh inode from disk */ ocfs2_truncate_file() mlog_bug_on_msg(disk isize != i_size_read(inode)) /* crash! */ The fix is quite straightforward. We keep to set DLM_LKF_VALBLK flag for dlm_lock() if the lock resource type needs LVB and the fsdlm plugin is uesed. Signed-off-by: Eric Ren Reviewed-by: Gang He Reviewed-by: Joseph Qi --- fs/ocfs2/dlmglue.c | 10 ++++++++++ fs/ocfs2/stackglue.c | 6 ++++++ fs/ocfs2/stackglue.h | 3 +++ 3 files changed, 19 insertions(+) diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c index 83d576f..77d1632 100644 --- a/fs/ocfs2/dlmglue.c +++ b/fs/ocfs2/dlmglue.c @@ -3303,6 +3303,16 @@ static int ocfs2_downconvert_lock(struct ocfs2_super *osb, mlog(ML_BASTS, "lockres %s, level %d => %d\n", lockres->l_name, lockres->l_level, new_level); + /* + * On DLM_LKF_VALBLK, fsdlm behaves differently with o2cb. It always + * expects DLM_LKF_VALBLK being set if the LKB has LVB, so that + * we can recover correctly from node failure. Otherwise, we may get + * invalid LVB in LKB, but without DLM_SBF_VALNOTVALID being set. + */ + if (!ocfs2_is_o2cb_active() && + lockres->l_ops->flags & LOCK_TYPE_USES_LVB) + lvb = 1; + if (lvb) dlm_flags |= DLM_LKF_VALBLK; diff --git a/fs/ocfs2/stackglue.c b/fs/ocfs2/stackglue.c index 52c07346b..8203590 100644 --- a/fs/ocfs2/stackglue.c +++ b/fs/ocfs2/stackglue.c @@ -48,6 +48,12 @@ static char ocfs2_hb_ctl_path[OCFS2_MAX_HB_CTL_PATH] = "/sbin/ocfs2_hb_ctl"; */ static struct ocfs2_stack_plugin *active_stack; +inline int ocfs2_is_o2cb_active(void) +{ + return !strcmp(active_stack->sp_name, OCFS2_STACK_PLUGIN_O2CB); +} +EXPORT_SYMBOL_GPL(ocfs2_is_o2cb_active); + static struct ocfs2_stack_plugin *ocfs2_stack_lookup(const char *name) { struct ocfs2_stack_plugin *p; diff --git a/fs/ocfs2/stackglue.h b/fs/ocfs2/stackglue.h index f2dce10..e3036e1 100644 --- a/fs/ocfs2/stackglue.h +++ b/fs/ocfs2/stackglue.h @@ -298,6 +298,9 @@ void ocfs2_stack_glue_set_max_proto_version(struct ocfs2_protocol_version *max_p int ocfs2_stack_glue_register(struct ocfs2_stack_plugin *plugin); void ocfs2_stack_glue_unregister(struct ocfs2_stack_plugin *plugin); +/* In ocfs2_downconvert_lock(), we need to know which stack we are using */ +int ocfs2_is_o2cb_active(void); + extern struct kset *ocfs2_kset; #endif /* STACKGLUE_H */