From patchwork Mon Jan 6 02:11:19 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Goldwyn Rodrigues X-Patchwork-Id: 3435271 Return-Path: X-Original-To: patchwork-ocfs2-devel@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork1.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.19.201]) by patchwork1.web.kernel.org (Postfix) with ESMTP id F217D9F163 for ; Mon, 6 Jan 2014 02:12:43 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id D379E20163 for ; Mon, 6 Jan 2014 02:12:42 +0000 (UTC) Received: from aserp1040.oracle.com (aserp1040.oracle.com [141.146.126.69]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 817E8200E8 for ; Mon, 6 Jan 2014 02:12:41 +0000 (UTC) Received: from acsinet21.oracle.com (acsinet21.oracle.com [141.146.126.237]) by aserp1040.oracle.com (Sentrion-MTA-4.3.1/Sentrion-MTA-4.3.1) with ESMTP id s062CUTc027273 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Mon, 6 Jan 2014 02:12:31 GMT Received: from oss.oracle.com (oss-external.oracle.com [137.254.96.51]) by acsinet21.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id s062C83w004005 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 6 Jan 2014 02:12:08 GMT Received: from localhost ([127.0.0.1] helo=oss.oracle.com) by oss.oracle.com with esmtp (Exim 4.63) (envelope-from ) id 1Vzzfk-0004wi-2O; Sun, 05 Jan 2014 18:12:08 -0800 Received: from acsinet21.oracle.com ([141.146.126.237]) by oss.oracle.com with esmtp (Exim 4.63) (envelope-from ) id 1VzzfB-0004vT-RV for ocfs2-devel@oss.oracle.com; Sun, 05 Jan 2014 18:11:33 -0800 Received: from aserp1030.oracle.com (aserp1030.oracle.com [141.146.126.68]) by acsinet21.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id s062BXAZ003348 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Mon, 6 Jan 2014 02:11:33 GMT Received: from mx2.suse.de (cantor2.suse.de [195.135.220.15]) by aserp1030.oracle.com (Sentrion-MTA-4.3.1/Sentrion-MTA-4.3.1) with ESMTP id s062BVWE003329 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=FAIL) for ; Mon, 6 Jan 2014 02:11:32 GMT Received: from relay2.suse.de (charybdis-ext.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id DEE77AC2F for ; Mon, 6 Jan 2014 02:11:30 +0000 (UTC) Date: Sun, 5 Jan 2014 20:11:19 -0600 From: Goldwyn Rodrigues To: ocfs2-devel@oss.oracle.com Message-ID: <20140106021115.GA5667@shrek.lan> MIME-Version: 1.0 Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) X-Flow-Control-Info: class=Pass-to-MM reputation=ipRisk-All ip=195.135.220.15 ct-class=T2 ct-vol1=0 ct-vol2=4 ct-vol3=3 ct-risk=58 ct-spam1=72 ct-spam2=38 ct-bulk=38 rcpts=1 size=5995 X-Sendmail-CM-Score: 0.00% X-Sendmail-CM-Analysis: v=2.1 cv=YtI2GeoX c=1 sm=1 tr=0 a=uEuDQZVrWKuLCe7byFjfVg==:117 a=uEuDQZVrWKuLCe7byFjfVg==:17 a=LcaDllckn3IA:10 a=mjS63MzGo1gA:10 a=7SFtLa4N3QQA:10 a=kj9zAlcOel0A:10 a=yPCof4ZbAAAA:8 a=pl8wQJlXjfMA:10 a=g1VHZubT4PI1btl947oA:9 a=CjuIK1q_8u gA:10 X-Sendmail-CT-Classification: not spam X-Sendmail-CT-RefID: str=0001.0A090206.52CA10D5.0028:SCFSTAT13898897, ss=1, re=0.000, recu=0.000, reip=0.000, cl=1, cld=1, fgs=0 Subject: [Ocfs2-devel] [RFC] Why unlink performance is low? X-BeenThere: ocfs2-devel@oss.oracle.com X-Mailman-Version: 2.1.9 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: ocfs2-devel-bounces@oss.oracle.com Errors-To: ocfs2-devel-bounces@oss.oracle.com X-Source-IP: acsinet21.oracle.com [141.146.126.237] X-Spam-Status: No, score=-4.3 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_MED, RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP After a delete, the system thread calls evict_inode which calls the following sequence: ocfs2_evict_inode() -> ocfs2_delete_inode() -> ocfs2_query_inode_wipe() -> ocfs2_try_open_lock() on d1, it fails with -EAGAIN. The open lock fails because on the remote node a PR->EX convert takes longer than a simple EX grant. This starts a checkpoint because OCFS2_INODE_DELETED flag is not set. Now, a checkpoint interferes with the journaling of the inodes deleted in the following unlinks. I had earlier concluded that this happens for directories only, however I was wrong. This happens for files as well. The patch attached is *not* correct. I am sending this to show that where the problem lies. I worked this on a "hypothetical" situation where the files created by other nodes are not open on any other node during the time of deletion. I agree that open lock should not block during inode eviction. The root problem is that open lock fails with -EAGAIN even if the file is not open on any other node of the cluster. The reason we get -EAGAIN is because the lock is on the remote end and the whole locking sequence does not complete with LKF_NOQUEUE set. Here are some numbers: Without patch (native) times: ------------------------------------------------- | # files | create #s | copy #s | remove #s | ------------------------------------------------- | 1 | 0:00.03 | 0:00.25 | 0:00.94 | | 2 | 0:00.12 | 0:00.20 | 0:01.12 | | 4 | 0:00.16 | 0:00.31 | 0:03.50 | | 8 | 0:00.11 | 0:00.38 | 0:08.15 | | 16 | 0:00.11 | 0:00.60 | 0:14.64 | | 32 | 0:00.15 | 0:00.89 | 0:28.04 | | 64 | 0:00.24 | 0:03.49 | 0:59.96 | | 128 | 0:00.42 | 0:08.73 | 1:52.14 | | 256 | 0:01.05 | 0:18.03 | 3:54.81 | | 1024 | 0:02.74 | 0:44.13 | 14:46.36 | With patch times: ------------------------------------------------- | # files | create #s | copy #s | remove #s | ------------------------------------------------- | 1 | 0:00.02 | 0:00.83 | 0:00.33 | | 2 | 0:00.04 | 0:00.18 | 0:00.27 | | 4 | 0:00.07 | 0:00.26 | 0:00.27 | | 8 | 0:00.08 | 0:00.29 | 0:00.44 | | 16 | 0:00.10 | 0:00.39 | 0:00.69 | | 32 | 0:00.14 | 0:00.60 | 0:01.26 | | 64 | 0:00.23 | 0:01.19 | 0:02.33 | | 128 | 0:00.51 | 0:02.15 | 0:04.60 | | 256 | 0:00.87 | 0:04.59 | 0:09.74 | | 1024 | 0:02.78 | 0:17.64 | 0:37.93 | The numbers show that the improvement is not just with unlinks but with other operations as well because the journal is no longer overworked. I am looking for suggestions where we can overcome this design issue to make sure that try open locks succeed if the file is not opened on any node. I think the semantics of DLM_LKF_NOQUEUE may be interpreted incorrectly, or we are probably not waiting for the lksb status to be updated, but I am not sure and some insight into this would be helpful. diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c index f2d48c8..eb3baac 100644 --- a/fs/ocfs2/dlmglue.c +++ b/fs/ocfs2/dlmglue.c @@ -1681,9 +1681,9 @@ void ocfs2_rw_unlock(struct inode *inode, int write) /* * ocfs2_open_lock always get PR mode lock. */ -int ocfs2_open_lock(struct inode *inode) +int ocfs2_open_lock(struct inode *inode, int ex) { - int status = 0; + int status = 0, level; struct ocfs2_lock_res *lockres; struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); @@ -1696,9 +1696,10 @@ int ocfs2_open_lock(struct inode *inode) goto out; lockres = &OCFS2_I(inode)->ip_open_lockres; + level = ex ? DLM_LOCK_EX : DLM_LOCK_PR; status = ocfs2_cluster_lock(OCFS2_SB(inode->i_sb), lockres, - DLM_LOCK_PR, 0, 0); + level, 0, 0); if (status < 0) mlog_errno(status); diff --git a/fs/ocfs2/dlmglue.h b/fs/ocfs2/dlmglue.h index 1d596d8..12766a1 100644 --- a/fs/ocfs2/dlmglue.h +++ b/fs/ocfs2/dlmglue.h @@ -110,7 +110,7 @@ int ocfs2_create_new_inode_locks(struct inode *inode); int ocfs2_drop_inode_locks(struct inode *inode); int ocfs2_rw_lock(struct inode *inode, int write); void ocfs2_rw_unlock(struct inode *inode, int write); -int ocfs2_open_lock(struct inode *inode); +int ocfs2_open_lock(struct inode *inode, int ex); int ocfs2_try_open_lock(struct inode *inode, int write); void ocfs2_open_unlock(struct inode *inode); int ocfs2_inode_lock_atime(struct inode *inode, diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c index f87f9bd..792dba7 100644 --- a/fs/ocfs2/inode.c +++ b/fs/ocfs2/inode.c @@ -454,7 +454,7 @@ static int ocfs2_read_locked_inode(struct inode *inode, 0, inode); if (can_lock) { - status = ocfs2_open_lock(inode); + status = ocfs2_open_lock(inode, 0); if (status) { make_bad_inode(inode); mlog_errno(status); @@ -922,7 +922,7 @@ static int ocfs2_query_inode_wipe(struct inode *inode, * Though we call this with the meta data lock held, the * trylock keeps us from ABBA deadlock. */ - status = ocfs2_try_open_lock(inode, 1); + status = ocfs2_open_lock(inode, 1); if (status == -EAGAIN) { status = 0; reason = 3; @@ -997,6 +997,7 @@ static void ocfs2_delete_inode(struct inode *inode) ocfs2_cleanup_delete_inode(inode, 0); goto bail_unblock; } + /* Lock down the inode. This gives us an up to date view of * it's metadata (for verification), and allows us to * serialize delete_inode on multiple nodes. diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c index be3f867..ac67f2d 100644 --- a/fs/ocfs2/namei.c +++ b/fs/ocfs2/namei.c @@ -2307,7 +2307,7 @@ int ocfs2_create_inode_in_orphan(struct inode *dir, } /* get open lock so that only nodes can't remove it from orphan dir. */ - status = ocfs2_open_lock(inode); + status = ocfs2_open_lock(inode, 0); if (status < 0) mlog_errno(status);