From patchwork Fri Dec 18 09:09:39 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Zhen Ren X-Patchwork-Id: 7881231 Return-Path: X-Original-To: patchwork-ocfs2-devel@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork2.web.kernel.org (Postfix) with ESMTP id 8136ABEEE5 for ; Fri, 18 Dec 2015 09:11:55 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 795C3202AE for ; Fri, 18 Dec 2015 09:11:54 +0000 (UTC) Received: from userp1040.oracle.com (userp1040.oracle.com [156.151.31.81]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 2FF1A204A9 for ; Fri, 18 Dec 2015 09:11:53 +0000 (UTC) Received: from userv0022.oracle.com (userv0022.oracle.com [156.151.31.74]) by userp1040.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with ESMTP id tBI9APZb010148 (version=TLSv1 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Fri, 18 Dec 2015 09:10:25 GMT Received: from oss.oracle.com (oss-old-reserved.oracle.com [137.254.22.2]) by userv0022.oracle.com (8.13.8/8.13.8) with ESMTP id tBI9AMKh024869 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 18 Dec 2015 09:10:22 GMT Received: from localhost ([127.0.0.1] helo=lb-oss.oracle.com) by oss.oracle.com with esmtp (Exim 4.63) (envelope-from ) id 1a9r3O-0008N4-5s; Fri, 18 Dec 2015 01:10:22 -0800 Received: from userv0021.oracle.com ([156.151.31.71]) by oss.oracle.com with esmtp (Exim 4.63) (envelope-from ) id 1a9r2w-0008GE-R2 for ocfs2-devel@oss.oracle.com; Fri, 18 Dec 2015 01:09:54 -0800 Received: from userp1030.oracle.com (userp1030.oracle.com [156.151.31.80]) by userv0021.oracle.com (8.13.8/8.13.8) with ESMTP id tBI99ssd014340 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Fri, 18 Dec 2015 09:09:54 GMT Received: from userp2040.oracle.com (userp2040.oracle.com [156.151.31.90]) by userp1030.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with ESMTP id tBI99rDp010860 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO) for ; Fri, 18 Dec 2015 09:09:54 GMT Received: from pps.filterd (userp2040.oracle.com [127.0.0.1]) by userp2040.oracle.com (8.15.0.59/8.15.0.59) with SMTP id tBI99CRc005879 for ; Fri, 18 Dec 2015 09:09:53 GMT Received: from smtp2.provo.novell.com (smtp2.provo.novell.com [137.65.250.81]) by userp2040.oracle.com with ESMTP id 1yubr1y3k7-1 (version=TLSv1/SSLv3 cipher=AES256-SHA bits=256 verify=NOT) for ; Fri, 18 Dec 2015 09:09:53 +0000 Received: from localhost (prv-ext-foundry1int.gns.novell.com [137.65.251.240]) by smtp2.provo.novell.com with ESMTP (NOT encrypted); Fri, 18 Dec 2015 02:09:42 -0700 Date: Fri, 18 Dec 2015 17:09:39 +0800 From: Eric Ren To: Gang He Message-ID: <20151218090939.GB10744@desktop.lab.bej.apac.novell.com> Mail-Followup-To: Gang He , ocfs2-devel@oss.oracle.com, mfasheh@suse.de, Goldwyn Rodrigues References: <56736AAA020000F9001A9EEA@relay2.provo.novell.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <56736AAA020000F9001A9EEA@relay2.provo.novell.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-Proofpoint-SPF-Result: permerror X-Proofpoint-SPF-Record: v=spf1 ip4:137.65.0.0/16 ip4:151.155.28.0/17 ip4:149.44.0.0/16 ip4:147.2.0.0/16 ip4:164.99.0.0/16 ip4:130.57.0.0/16 ip4:192.31.114.0/24 ip4:195.135.221.0/24 ip4:195.135.220.0/24 ip4:69.7.179.0/24 include:_spf2.suse.com -all X-ServerName: smtp2.provo.novell.com X-Proofpoint-Virus-Version: vendor=nai engine=5700 definitions=8018 signatures=670672 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1507310007 definitions=main-1512180169 Cc: mfasheh@suse.de, Goldwyn Rodrigues , ocfs2-devel@oss.oracle.com Subject: Re: [Ocfs2-devel] The root cause analysis about buffer read getting starvation X-BeenThere: ocfs2-devel@oss.oracle.com X-Mailman-Version: 2.1.9 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: ocfs2-devel-bounces@oss.oracle.com Errors-To: ocfs2-devel-bounces@oss.oracle.com X-Source-IP: userv0022.oracle.com [156.151.31.74] X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_MED, T_RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Hi all, On Thu, Dec 17, 2015 at 08:08:42AM -0700, He Gang wrote: > Hello Mark and all, > In the past days, I and Eric were looking at a customer issue, the customer is complaining that buffer reading sometimes lasts too much time ( 1 - 10 seconds) in case reading/writing the same file from different nodes concurrently, some day ago I sent a mail to the list for some discussions, you can read some details via the link https://oss.oracle.com/pipermail/ocfs2-devel/2015-December/011389.html. > But, this problem does not happen under SLES10 (sp1 - sp4), the customer upgraded his Linux OS to SLES11(sp3 or sp4), the problem happened, this is why the customer complains, he hope we can give a investigation, to see how to make OCFS2 buffer reading/writing behavior be consistent with SLES10. > According to our code reviewing and some testings, we found that the root cause to let buffer read get starvation. > The suspicious code in aops.c > 274 static int ocfs2_readpage(struct file *file, struct page *page) > 275 { > 276 struct inode *inode = page->mapping->host; > 277 struct ocfs2_inode_info *oi = OCFS2_I(inode); > 278 loff_t start = (loff_t)page->index << PAGE_CACHE_SHIFT; > 279 int ret, unlock = 1; > 280 long delta; > 281 struct timespec t_enter, t_mid1, t_mid2, t_exit; > 282 > 283 trace_ocfs2_readpage((unsigned long long)oi->ip_blkno, > 284 (page ? page->index : 0)); > 285 > 286 ret = ocfs2_inode_lock_with_page(inode, NULL, 0, page); <<= here, using nonblock way to get lock will bring many times retry, spend too much time > 287 if (ret != 0) { > 288 if (ret == AOP_TRUNCATED_PAGE) > 289 unlock = 0; > 290 mlog_errno(ret); > 291 goto out; > 292 } > 293 > 294 if (down_read_trylock(&oi->ip_alloc_sem) == 0) { <<= here, the same problem with above > 295 /* > 296 * Unlock the page and cycle ip_alloc_sem so that we don't > 297 * busyloop waiting for ip_alloc_sem to unlock > 298 */ > 299 ret = AOP_TRUNCATED_PAGE; > 300 unlock_page(page); > 301 unlock = 0; > 302 down_read(&oi->ip_alloc_sem); > 303 up_read(&oi->ip_alloc_sem); > 304 goto out_inode_unlock; > 305 } > > > As you can see, using nonblock way to get lock will bring many time retry, spend too much time. > We can't modify the code to using block way to get the lock, as this will bring a dead lock. > Actually, we did some testing when trying to use block way to get the lock here, the deadlock problems were encountered. > But, in SLES10 source code, there is not any using nonblock way to get lock in buffer reading/writing, this is why buffer reading/writing are very fair to get IO when reading/writing the same file from multiple nodes. SLES10 with kernel version about 2.6.16.x, used blocking way, i.e. down_read(), wich has the potential deaklock between page lock / ip_alloc_sem when one node get the cluster lock and does writing and reading on same file on it. This deadlock was fixed by this commit: --- commit e9dfc0b2bc42761410e8db6c252c6c5889e178b8 Author: Mark Fasheh Date: Mon May 14 11:38:51 2007 -0700 ocfs2: trylock in ocfs2_readpage() Similarly to the page lock / cluster lock inversion in ocfs2_readpage, we can deadlock on ip_alloc_sem. We can down_read_trylock() instead and just return AOP_TRUNCATED_PAGE if the operation fails. Signed-off-by: Mark Fasheh --- But somehow with this patch, performance in the scenario become very bad. I don't how this could happen? because the reading node just has only one thread reading the shared file, then down_read_trylock() should always get ip_alloc_sem successfully, right? if not, who else may race ip_alloc_sem? Thanks, Eric > Why the dead locks happen on SLES11? you can see the source code, there are some code change, especially inode alloc_sem lock. > On SLES11, to get inode alloc_sem lock is moved into ocfs2_readpage and ocfs2_write_begin, why we need to do that? this will let us to bring a dead lock factor, to avoid the dead locks, we will have to use nonblocking way to get some locks in ocfs2_readpage, the result will let buffer reading be unfair to get IO. and that, to avoid CPU busy loop, add some code to get the lock with block way in case can't get a lock in nonblock way, waste too much time also. > Finally, I want to discuss with your guys, how to fix this issue? could we move the inode alloc_sem lock back to ocfs2_file_aio_read/ocfs2_file_aio_write? > we can get the inode alloc_sem lock before calling into ocfs2_readpage/ocfs2_write_begin, just like SLES10. > Since I have not enough background behind these code changes, hope you can give some comments. > > Thanks a lot. > Gang > > > > _______________________________________________ > Ocfs2-devel mailing list > Ocfs2-devel@oss.oracle.com > https://oss.oracle.com/mailman/listinfo/ocfs2-devel > diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c index 8e7cafb..3030670 100644 --- a/fs/ocfs2/aops.c +++ b/fs/ocfs2/aops.c @@ -222,7 +222,10 @@ static int ocfs2_readpage(struct file *file, struct page *page) goto out; } - down_read(&OCFS2_I(inode)->ip_alloc_sem); + if (down_read_trylock(&OCFS2_I(inode)->ip_alloc_sem) == 0) { + ret = AOP_TRUNCATED_PAGE; + goto out_meta_unlock; + } /* * i_size might have just been updated as we grabed the meta lock. We @@ -258,6 +261,7 @@ static int ocfs2_readpage(struct file *file, struct page *page) ocfs2_data_unlock(inode, 0); out_alloc: up_read(&OCFS2_I(inode)->ip_alloc_sem); +out_meta_unlock: ocfs2_meta_unlock(inode, 0); out: if (unlock)