From patchwork Sun Sep 8 03:07:39 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Milosz Tanski X-Patchwork-Id: 2856831 Return-Path: X-Original-To: patchwork-ceph-devel@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.19.201]) by patchwork2.web.kernel.org (Postfix) with ESMTP id 7512FBF43F for ; Sun, 8 Sep 2013 03:07:48 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 6120A20319 for ; Sun, 8 Sep 2013 03:07:47 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 3999D2022D for ; Sun, 8 Sep 2013 03:07:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754053Ab3IHDHo (ORCPT ); Sat, 7 Sep 2013 23:07:44 -0400 Received: from mail-wg0-f53.google.com ([74.125.82.53]:62445 "EHLO mail-wg0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751770Ab3IHDHl (ORCPT ); Sat, 7 Sep 2013 23:07:41 -0400 Received: by mail-wg0-f53.google.com with SMTP id x12so1814681wgg.20 for ; Sat, 07 Sep 2013 20:07:39 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type; bh=UX2TyR+gBG7Mt5FwUAnjNss1de6w4+wIEANmxeTdfYk=; b=ZkWc/26efJXU7tNhuBwuhiU+KRDs+aUZ3DOtWlG2uJLGPo/nN3EhYaPDwtDjzrXU16 stxn5kt5mpIaQW45mtbV8YbVL02UELN/KAIrR7oVpGdl0AzJ6FMLZ3BbeJRRmZjV6cvZ /Oge8+O8YtKZ9DKPuNBAZ/NH4i0JnBFrzvPjPpHbr0V5Q1T/+Fo8IUytNkaw/PzfLqGg 0Y9i8FqoSV+mOVYE0LtNncSf2CNrDd9YEoDDAxPt7xdIKzTZXT2mkDXWAygiASiT/dc6 ZK2wX7/KtWE5/XSjcT8sSd0gZrtRXgodjGxGfSY8mxwezabC704lIKpeRkTUhG/Fik8g pntg== X-Gm-Message-State: ALoCoQnSv9ml03FFXFNCn909xT/xZLkPR64+3dAfB4SH9xnoRfxR5gYoxkGAi8CLYsscVv10Hb6m MIME-Version: 1.0 X-Received: by 10.180.76.48 with SMTP id h16mr3814694wiw.32.1378609659679; Sat, 07 Sep 2013 20:07:39 -0700 (PDT) Received: by 10.216.5.132 with HTTP; Sat, 7 Sep 2013 20:07:39 -0700 (PDT) In-Reply-To: References: <18764.1378483142@warthog.procyon.org.uk> Date: Sat, 7 Sep 2013 23:07:39 -0400 Message-ID: Subject: Re: [PATCH 0/8] ceph: fscache support & upstream changes From: Milosz Tanski To: David Howells Cc: sprabhu@redhat.com, ceph-devel , "Yan, Zheng" , Hongyi Jia , "linux-cachefs@redhat.com" , "linux-fsdevel@vger.kernel.org" , linux-kernel@vger.kernel.org, Sage Weil Sender: ceph-devel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: ceph-devel@vger.kernel.org X-Spam-Status: No, score=-9.3 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, RP_MATCHES_RCVD, T_FRT_BELOW2, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP David, I ran into another issue that caused one my machines to hang on a bunch of tasks and then hard lock. Here's the backtrace of the hang: INFO: task kworker/1:2:4214 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kworker/1:2 D ffff880443513fc0 0 4214 2 0x00000000 Workqueue: ceph-msgr con_work [libceph] ffff88042b093868 0000000000000246 ffff88042b8e5bc0 ffffffff81569fc6 ffff88042c51dbc0 ffff88042b093fd8 ffff88042b093fd8 ffff88042b093fd8 ffff88042c518000 ffff88042c51dbc0 ffff8804266b8d10 ffff8804439d7188 Call Trace: [] ? _raw_spin_unlock_irqrestore+0x16/0x20 [] ? fscache_wait_bit_interruptible+0x30/0x30 [fscache] [] schedule+0x29/0x70 [] fscache_wait_atomic_t+0xe/0x20 [fscache] [] out_of_line_wait_on_atomic_t+0x9f/0xe0 [] ? autoremove_wake_function+0x40/0x40 [] __fscache_relinquish_cookie+0x15c/0x310 [fscache] [] ceph_fscache_unregister_inode_cookie+0x3e/0x50 [ceph] [] ceph_destroy_inode+0x33/0x200 [ceph] [] ? __fsnotify_inode_delete+0xe/0x10 [] destroy_inode+0x3c/0x70 [] evict+0x119/0x1b0 [] iput+0x103/0x190 [] iterate_session_caps+0x7d/0x240 [ceph] [] ? remove_session_caps_cb+0x270/0x270 [ceph] [] dispatch+0x725/0x1b40 [ceph] [] ? kernel_recvmsg+0x46/0x60 [] ? ceph_tcp_recvmsg+0x48/0x60 [libceph] [] try_read+0xc1e/0x1e70 [libceph] [] con_work+0x105/0x1920 [libceph] [] ? xen_end_context_switch+0x1e/0x30 [] ? finish_task_switch+0x5a/0xc0 [] process_one_work+0x179/0x490 [] worker_thread+0x11b/0x370 [] ? manage_workers.isra.21+0x2e0/0x2e0 [] kthread+0xc0/0xd0 [] ? perf_trace_xen_mmu_set_pud+0xd0/0xd0 [] ? flush_kthread_worker+0xb0/0xb0 [] ret_from_fork+0x7c/0xb0 [] ? flush_kthread_worker+0xb0/0xb0 It looks like it's waiting for the the cookie's n_active to drop down to 0 ... but it isn't. After spending a bunch of hours reading the code, then having a some beers (it is Saturday night after all), then looking at code again... I think that the __fscache_check_consistency() function increments the n_active counter but never lowers it. I think the solution to this is the bellow diff but I'm not a 100% sure. Can you let me know if I'm on the right track... of it's beer googles? Thanks, - Milosz On Fri, Sep 6, 2013 at 4:03 PM, Sage Weil wrote: > On Fri, 6 Sep 2013, Milosz Tanski wrote: >> Sage, >> >> I've taken David's latest changes and per his request merged his >> 'fscache-fixes-for-ceph' tag then applied my changes on top of that. >> In addition to the pervious changes I also added a fix for the >> warnings the linux-next build bot found. >> >> I've given the results a quick test to make sure it builds, boots and >> runs okay. The code is located in my repository: >> >> https://adfin@bitbucket.org/adfin/linux-fs.git in the wip-fscache-v2 branch >> >> I hope that this is the final go for now and thanks for everyone's patience. > > Looks good; I'll send this to Linus along with the other ceph patches > shortly. > > Thanks, everyone! > sage > > >> >> - Milosz >> >> On Fri, Sep 6, 2013 at 11:59 AM, David Howells wrote: >> > Milosz Tanski wrote: >> > >> >> After running this for a day on some loaded machines I ran into what >> >> looks like an old issue with the new code. I remember you saw an issue >> >> that manifested it self in a similar way a while back. >> >> >> >> [13837253.462779] FS-Cache: Assertion failed >> >> [13837253.462782] 3 == 5 is false >> >> [13837253.462807] ------------[ cut here ]------------ >> >> [13837253.462811] kernel BUG at fs/fscache/operation.c:414! >> > >> > Bah. >> > >> > I forgot to call fscache_op_complete(). Patch updated and repushed. >> > >> > Btw, I've reordered the patches to put the CIFS patch last. Can you merge the >> > patches prior to the CIFS commit from my branch rather than cherry picking >> > them so that if they go via two different routes, GIT will handle the merge >> > correctly? I've stuck a tag on it (fscache-fixes-for-ceph) to make that >> > easier for you. >> > >> > I've also asked another RH engineer to try doing some basic testing on the >> > CIFS stuff - which may validate the fscache_readpages_cancel patch. >> > >> > David >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> --- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/fs/fscache/cookie.c b/fs/fscache/cookie.c index 318e843..b2a86e3 100644 --- a/fs/fscache/cookie.c +++ b/fs/fscache/cookie.c @@ -586,7 +586,8 @@ int __fscache_check_consistency(struct fscache_cookie *cookie) fscache_operation_init(op, NULL, NULL); op->flags = FSCACHE_OP_MYTHREAD | - (1 << FSCACHE_OP_WAITING); + (1 << FSCACHE_OP_WAITING) | + (1 << FSCACHE_OP_UNUSE_COOKIE); spin_lock(&cookie->lock);