From patchwork Thu Oct 18 20:23:11 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Josef Bacik X-Patchwork-Id: 10648083 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B36B0112B for ; Thu, 18 Oct 2018 20:23:31 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A070028D68 for ; Thu, 18 Oct 2018 20:23:31 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 922E028D8C; Thu, 18 Oct 2018 20:23:31 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0C61D28D68 for ; Thu, 18 Oct 2018 20:23:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 888DB6B0006; Thu, 18 Oct 2018 16:23:29 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 8363D6B0007; Thu, 18 Oct 2018 16:23:29 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6D88B6B0008; Thu, 18 Oct 2018 16:23:29 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-qk1-f198.google.com (mail-qk1-f198.google.com [209.85.222.198]) by kanga.kvack.org (Postfix) with ESMTP id 3FE526B0006 for ; Thu, 18 Oct 2018 16:23:29 -0400 (EDT) Received: by mail-qk1-f198.google.com with SMTP id x75-v6so32189411qka.18 for ; Thu, 18 Oct 2018 13:23:29 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:from:to:subject:date:message-id; bh=DWidnD6qko/sL4ed97MdEuO7mHJAyViTAX5kCV+6cyc=; b=TQALEOvjJUVUWoxYReiGneT8ouNHHzz7FsuwMmPjrOCXpMsFwNGm+YsWGrT4HnCfPi kzm/UDh95j715RmnQMqBZkhTk4A00JEYq2D6JlgCT2GSIJuwS43mtfTDuuN7zfiEvyDn 4VXmBsWvvje4GGyLXPolfADN4py+qo6plSELQ0/zETuFoZktq9hcslYaxNKeVCV/TLjj VxoJZrI+qO8EcrXGQIO+nWVGK+WJWUYplKrLKtG32xqk3MTQFAo5VzpBw7o+AYoKYk8Z iDJwy3HyEHf9P7xlUfWShMmwL/sFiLEKTsBVR3lD2ybC+kwUrNaY54635U0oJwZj2X24 ESgg== X-Gm-Message-State: ABuFfohv23IQWdf8DrkjJcDTHCAMMluC7R4PJMn2FKwQ19/p4GycDpf/ ZTu9pdbyPtuSPZrQ/u9cAmpSCDU+/HJSRfptrstthMrts4Gi7B676FZFGdFcOK/Ftq7eDoZcCoh LQwnMu9ef1BwQSrjqi6fNDchSND5V47YdNZLi3m4SpoPjRx2nTZI0VjVKYpBfcyvGEPg+3WTwsU R4jIat8GT1opOtxJGmSNHpXaa5+Fi8pUBkeof+ACdXNaTwXzMX8sQ9IiE6eC/fZsAiUZ0zsE1N/ mgYBjTW4cNk/ektdaxIsdt9SukXDubsc/K4bA/obOH0uzsKtBOkh1L05GmmwYk6F7OfDg3wDj6Z WxW6XAiwl5RKr+DJs3L87FQQbqFtmI8lqBQa9SSobs1xjUuvoGJEDtWwVh6sBiQIHKp07fgQyY6 z X-Received: by 2002:a0c:f84f:: with SMTP id g15mr25472593qvo.232.1539894208925; Thu, 18 Oct 2018 13:23:28 -0700 (PDT) X-Received: by 2002:a0c:f84f:: with SMTP id g15mr25472555qvo.232.1539894208232; Thu, 18 Oct 2018 13:23:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1539894208; cv=none; d=google.com; s=arc-20160816; b=hJSTr21DEgUbWqSaTyf0fRbU3Y3UjI7F5by3HmfWRV1fSRimGM5iHMAeIk1bgrBINm jmqrtrY7JUG0CzU6KAZnnIvl7lsUAL9f8zLAmZEDNQC1NdiLWG3VqaKtEFpi58ikdFm4 DZ1TY4wWd3I+6yacwfbqJcPSbE4niS66yK+blUWbDtGLwu9OwpFsrhiGPatcCM+4avYk AJp0X5zPud2Iz+hJ76vzE7ANBuZzgX/GBXLHk7Uxa4perlR7hJthxmT/xcvx/fqYpojw 2N9GfZzWPbufPDOYmTbqJVQgMSKoA2vP3GTzi3a4aoy9zTzuEIqsK3aqAl37AAlySFs5 8hKQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=message-id:date:subject:to:from:dkim-signature; bh=DWidnD6qko/sL4ed97MdEuO7mHJAyViTAX5kCV+6cyc=; b=K6SvYCFY9rUwOo2PEpW+M40R7jNbsfbn1hcKfOm0UlWMgTT8xkHG1Pi4e9kWtDyikf HBYf22uLETcFlABJw949yalO8r+ANy+k0l7Emqt7XmzAyDpjwumL1UTbJXVUFr1wQlf9 3VyxDevdCLtVT4PkND7G5aSiq3ImPWfnIAnqyRi9+oes6WchESQu/EDRwPZTMvysZnYM rKYSnJMx2s2cDkbFqwXZZpAfG16MRw4QjfVEZn+BZg+sc8JXZL8YZMku5AxVKDyXXiSe g3sYOVYRgIYVrmqj4Ww5Ku6twOOFI3qM87oIxwBP2Rt3LZ97gEdTRkUUQ2eAZz5Wtcim tAqg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@toxicpanda-com.20150623.gappssmtp.com header.s=20150623 header.b=AaEbPwvI; spf=neutral (google.com: 209.85.220.41 is neither permitted nor denied by best guess record for domain of josef@toxicpanda.com) smtp.mailfrom=josef@toxicpanda.com Received: from mail-sor-f41.google.com (mail-sor-f41.google.com. [209.85.220.41]) by mx.google.com with SMTPS id o23sor24253078qvc.7.2018.10.18.13.23.27 for (Google Transport Security); Thu, 18 Oct 2018 13:23:28 -0700 (PDT) Received-SPF: neutral (google.com: 209.85.220.41 is neither permitted nor denied by best guess record for domain of josef@toxicpanda.com) client-ip=209.85.220.41; Authentication-Results: mx.google.com; dkim=pass header.i=@toxicpanda-com.20150623.gappssmtp.com header.s=20150623 header.b=AaEbPwvI; spf=neutral (google.com: 209.85.220.41 is neither permitted nor denied by best guess record for domain of josef@toxicpanda.com) smtp.mailfrom=josef@toxicpanda.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=toxicpanda-com.20150623.gappssmtp.com; s=20150623; h=from:to:subject:date:message-id; bh=DWidnD6qko/sL4ed97MdEuO7mHJAyViTAX5kCV+6cyc=; b=AaEbPwvIQ2uGM6aHQ3euKHdNj9EWEgUN/aAWClr4BNjuZiV6sYn0RLHAKPODu9L+21 It7zKNOR8t4h8dy8WuSA8VLm8gdQgFnZZG8TCqZLHdJBcWGhh+HD7IJlaox4FetBBFL2 vH3mWAR5w+6QRzx3xgXDCUNFbz91FarZqclXCcOZYT+pPgNeuCR8osS0rJblLspEcig3 iSrMhaN7aNcVI6/blu1MxEbaNIUEYCiRt6EvB95nyXS9yHaxFALWO2EX8bx0EPgsFe/Q IZxe9iXQMiXWLbBsDkvwhbnnlEqaKPCsiVOENxGCJNMp0xwgu1Y36SrlsG1i+DbOnAi+ Z60Q== X-Google-Smtp-Source: ACcGV621niN4CxiCk1QbZve+IXGD5RNKB8F0Jf8D1W8k6JAqYonGTomE0k+mgIsNeKz01saE6vuAaQ== X-Received: by 2002:a0c:95e6:: with SMTP id t35mr31999009qvt.163.1539894207297; Thu, 18 Oct 2018 13:23:27 -0700 (PDT) Received: from localhost ([107.15.81.208]) by smtp.gmail.com with ESMTPSA id q24-v6sm14344124qtb.26.2018.10.18.13.23.25 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Thu, 18 Oct 2018 13:23:26 -0700 (PDT) From: Josef Bacik To: kernel-team@fb.com, hannes@cmpxchg.org, linux-kernel@vger.kernel.org, tj@kernel.org, david@fromorbit.com, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org, riel@fb.com, linux-mm@kvack.org Subject: [PATCH 0/7][V3] drop the mmap_sem when doing IO in the fault path Date: Thu, 18 Oct 2018 16:23:11 -0400 Message-Id: <20181018202318.9131-1-josef@toxicpanda.com> X-Mailer: git-send-email 2.14.3 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP Getting some production testing running on these patches shortly to verify they are ready for primetime, but in the meantime they've had a bunch of xfstests runs on xfs, btrfs, and ext4 using kvm-xfstests. v2->v3: - dropped the RFC, ready for a real review. - fixed a kbuild error for !MMU configs. - dropped the swapcache patches since Johannes is still working on those parts. v1->v2: - reworked so it only affects x86, since its the only arch I can build and test. - fixed the fact that do_page_mkwrite wasn't actually sending ALLOW_RETRY down to ->page_mkwrite. - fixed error handling in do_page_mkwrite/callers to explicitly catch VM_FAULT_RETRY. - fixed btrfs to set ->cached_page properly. This time I've verified that the ->page_mkwrite retry path is actually getting used (apparently I only verified the read side last time). xfstests is still running but it passed the couple of mmap tests I ran directly. Again this is an RFC, I'm still doing a bunch of testing, but I'd appreciate comments on the overall strategy. -- Original message -- Now that we have proper isolation in place with cgroups2 we have started going through and fixing the various priority inversions. Most are all gone now, but this one is sort of weird since it's not necessarily a priority inversion that happens within the kernel, but rather because of something userspace does. We have giant applications that we want to protect, and parts of these giant applications do things like watch the system state to determine how healthy the box is for load balancing and such. This involves running 'ps' or other such utilities. These utilities will often walk /proc//whatever, and these files can sometimes need to down_read(&task->mmap_sem). Not usually a big deal, but we noticed when we are stress testing that sometimes our protected application has latency spikes trying to get the mmap_sem for tasks that are in lower priority cgroups. This is because any down_write() on a semaphore essentially turns it into a mutex, so even if we currently have it held for reading, any new readers will not be allowed on to keep from starving the writer. This is fine, except a lower priority task could be stuck doing IO because it has been throttled to the point that its IO is taking much longer than normal. But because a higher priority group depends on this completing it is now stuck behind lower priority work. In order to avoid this particular priority inversion we want to use the existing retry mechanism to stop from holding the mmap_sem at all if we are going to do IO. This already exists in the read case sort of, but needed to be extended for more than just grabbing the page lock. With io.latency we throttle at submit_bio() time, so the readahead stuff can block and even page_cache_read can block, so all these paths need to have the mmap_sem dropped. The other big thing is ->page_mkwrite. btrfs is particularly shitty here because we have to reserve space for the dirty page, which can be a very expensive operation. We use the same retry method as the read path, and simply cache the page and verify the page is still setup properly the next pass through ->page_mkwrite(). I've tested these patches with xfstests and there are no regressions. Let me know what you think. Thanks, Josef