From patchwork Tue Dec 11 17:37:58 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Josef Bacik X-Patchwork-Id: 10724287 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0D6BF159A for ; Tue, 11 Dec 2018 17:38:10 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E78B32B59B for ; Tue, 11 Dec 2018 17:38:09 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id DAEA52B59F; Tue, 11 Dec 2018 17:38:09 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4174C2B59B for ; Tue, 11 Dec 2018 17:38:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 22DBA8E00BC; Tue, 11 Dec 2018 12:38:08 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 1DAA48E00B9; Tue, 11 Dec 2018 12:38:08 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0A30A8E00BC; Tue, 11 Dec 2018 12:38:08 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-yw1-f71.google.com (mail-yw1-f71.google.com [209.85.161.71]) by kanga.kvack.org (Postfix) with ESMTP id CC7F68E00B9 for ; Tue, 11 Dec 2018 12:38:07 -0500 (EST) Received: by mail-yw1-f71.google.com with SMTP id l69so9059980ywb.7 for ; Tue, 11 Dec 2018 09:38:07 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:from:to:subject:date:message-id; bh=TqxTYYL678habTdU2pMfSU1dK9Ap6pTHGuRGBvTv1I4=; b=oUvmugdNB6jIUfzvB8DqIgIqwQxB5ok42TNDf92okcVChSzn/4Vti9/RnLG5fgsuCj jTiJwCc2FGScU4OAf/deZFuUWyE9iv1iicGOM6X0tWEkp9w0/F4fjx9DW9iM7JKbhfcA chFZV2+3fBdJy5zi5HeqxwGFweU9ME8uu4yOr2p/B0SmbFC9psmTrC39OvNAcCyHZZIq Ea4o1O5EoR6028mWIXSwnxFAtGmX+lTSXTW9SWADJcTxNFPunaQI7tbNBkEQrzF3fzui vZBoRJ6R2jiRKiKIgg9ppKzb7brxxiLma/LafjU47vMIkhXv8gWekwNdwphHnJGqDbLA gv0A== X-Gm-Message-State: AA+aEWbgIsVB/UALRnCVy+diy6X+g6uz8RRcXWupDfGpYrqV+lyqFGM1 2vZzEC2/Uc+P+OXNJKdn+hoplEGJl0quVUh80y8bmFcxXQVSrm3exmzLplLsO0kHhd0MyA7Oc6q rNtc8KDwaM1toaz9LvAnMBanEhVIorOPihjRXZEDoOVLwPmNmQxH7oSHBZ9wbt04Pjh5qmQdsOY UlHEFyEHVBAVXqgn6NCDKecnOkxhObjobPocuVlIAcqnS2k8l4xGeYB99/IBF2hBXTvYSi4DlWg XzLq9KaTV7HAtZk5v2SWr2x5K6XREpQs8iKQXflBQ2f9zm6rI8sTICImyBjv8G07YmB4386WKkv BONlyN3S5zYAeS/booeLbDUclJjbJwJd18A+FdMs2sguTRjdy3VEooJhtDKwMMfKtmwSnCr8Ctr i X-Received: by 2002:a0d:f7c1:: with SMTP id h184mr17558551ywf.473.1544549887459; Tue, 11 Dec 2018 09:38:07 -0800 (PST) X-Received: by 2002:a0d:f7c1:: with SMTP id h184mr17558482ywf.473.1544549886493; Tue, 11 Dec 2018 09:38:06 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1544549886; cv=none; d=google.com; s=arc-20160816; b=PjhXdpkW4vlT2buuSBAJMfBwmpvOozKbr+jVmCD2rBHZErymelqM9rRYt7E8naJo/r 06jYRrj2xohIew3VvyNgq1tT6cAgoQxqrfIYdtBBfHyYp0oxl1cUKpsdANYDc54LiWiI PXm3PxLiyxvKFdDuEniDunXKgK7l8XqvBbYOs5Ch6Aw8zikl+NR4xQZlSOsKnSVF7+Bk 8d8CcHXKmdaoRaOeEO6XHYlxN40XHFZq/+ySn0Edlxnz5+HublCF1Ig8vm18EhI7ssOv ULrQSvNlyyGbpXFSJOwWe1bgOwWZxKf5gTuaJbqH/SsFJv2PpvoBtziVivw1SphcWSDY xscw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=message-id:date:subject:to:from:dkim-signature; bh=TqxTYYL678habTdU2pMfSU1dK9Ap6pTHGuRGBvTv1I4=; b=ouJTikOBXz0F7H8dp3W1w362R/MeIfewfRh4wEPNNt1jVJBIJWKdXeFeURdNSIyE4A 9Fb8CeGwARbYxwLwdvpEKK3wDErxsCGhURlApXd2PdB82A2M/SlFUx+6+XwvpJwcpkCS /4LVtjDTh4fWUHepoppGSyOp0SRjMn0hPk8qKUEeiIBMPsq2p6oFCEaGYdnEmVS1oT8F +h0XBK9JcFZMNIXkIz9jhYfMFloF1/a/Qxx+RzszfN7XwMeu39eLSNlTKTDb5vrEr2re QpoWQf/4tt2Yhj9kVuMOthWGXw03fYTSTTv6j63nSVRYR9E+7/zRxj4Mfyu1L1M58vTT WprA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@toxicpanda-com.20150623.gappssmtp.com header.s=20150623 header.b=vOKL819r; spf=neutral (google.com: 209.85.220.65 is neither permitted nor denied by best guess record for domain of josef@toxicpanda.com) smtp.mailfrom=josef@toxicpanda.com Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id e16sor2722609ywa.208.2018.12.11.09.38.06 for (Google Transport Security); Tue, 11 Dec 2018 09:38:06 -0800 (PST) Received-SPF: neutral (google.com: 209.85.220.65 is neither permitted nor denied by best guess record for domain of josef@toxicpanda.com) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@toxicpanda-com.20150623.gappssmtp.com header.s=20150623 header.b=vOKL819r; spf=neutral (google.com: 209.85.220.65 is neither permitted nor denied by best guess record for domain of josef@toxicpanda.com) smtp.mailfrom=josef@toxicpanda.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=toxicpanda-com.20150623.gappssmtp.com; s=20150623; h=from:to:subject:date:message-id; bh=TqxTYYL678habTdU2pMfSU1dK9Ap6pTHGuRGBvTv1I4=; b=vOKL819ri/0ZFDVzSa9awoHdgDpkol8NKXosA/YMYh7+LcCtNQihvtDe7BEj3is1yZ Kxor6NV7VcXhtKu/AgNNSbNc7p4I+5Z7OjnQSsTynBjY5idkKMgtyx3dnYDyPne+LuO7 IrVmfsZezyFy9iQIPFq6huRUR0ltDEEaxKmBswGIMmdpT5whFM1eSHDKwu9542lr4eOF 1myZf04jVmRfz5iROSVLd0NEO+PJzAfGn0/m68ImvZ/elZZTtQkvWlQEJmu4rEeSOBj8 UTdiKwMGm5hL/B5tpA2CJnhWW5+XrjRm3EKazFhE3O67YXRDEKwdjNTuGvCio7ErriMm Dq2w== X-Google-Smtp-Source: AFSGD/XomaM/Mo7pPyxnXqFG9XSwFtveYlG3CDEZ/MYoH51XeWiL7okZe7yzhc7olWc2SMNr/VdkSg== X-Received: by 2002:a81:ac56:: with SMTP id z22mr18220920ywj.40.1544549885863; Tue, 11 Dec 2018 09:38:05 -0800 (PST) Received: from localhost ([107.15.81.208]) by smtp.gmail.com with ESMTPSA id d138sm6079424ywb.44.2018.12.11.09.38.04 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Tue, 11 Dec 2018 09:38:04 -0800 (PST) From: Josef Bacik To: kernel-team@fb.com, hannes@cmpxchg.org, linux-kernel@vger.kernel.org, tj@kernel.org, david@fromorbit.com, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, riel@redhat.com, jack@suse.cz Subject: [PATCH 0/3][V5] drop the mmap_sem when doing IO in the fault path Date: Tue, 11 Dec 2018 12:37:58 -0500 Message-Id: <20181211173801.29535-1-josef@toxicpanda.com> X-Mailer: git-send-email 2.14.3 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP Here's the latest version, slimmed down a bit from my last submission with more details in the changelogs as requested. v4->v5: - dropped the cached_page infrastructure and the addition of the handle_mm_fault_cacheable helper as it had no discernable bearing on performance in my performance testing. - reworked the page lock dropping logic in order to be it's own helper, which comments describing how to use it. - added more details to the changelog for the fpin patch. - added a patch to cleanup the arguments for the readahead functions for mmap as per Jan's suggestion. v3->v4: - dropped the ->page_mkwrite portion of these patches, we don't actually see issues with mkwrite in production, and I kept running into corner cases where I missed something important. I want to wait on that part until I have a real reason to do the work so I can have a solid test in place. - completely reworked how we drop the mmap_sem in filemap_fault and cleaned it up a bit. Once I started actually testing this with our horrifying reproducer I saw a bunch of places where we still ended up doing IO under the mmap_sem because I had missed a few corner cases. Fixed this by reworking filemap_fault to only return RETRY once it has a completely uptodate page ready to be used. - lots more testing, including production testing. v2->v3: - dropped the RFC, ready for a real review. - fixed a kbuild error for !MMU configs. - dropped the swapcache patches since Johannes is still working on those parts. v1->v2: - reworked so it only affects x86, since its the only arch I can build and test. - fixed the fact that do_page_mkwrite wasn't actually sending ALLOW_RETRY down to ->page_mkwrite. - fixed error handling in do_page_mkwrite/callers to explicitly catch VM_FAULT_RETRY. - fixed btrfs to set ->cached_page properly. -- Original message -- Now that we have proper isolation in place with cgroups2 we have started going through and fixing the various priority inversions. Most are all gone now, but this one is sort of weird since it's not necessarily a priority inversion that happens within the kernel, but rather because of something userspace does. We have giant applications that we want to protect, and parts of these giant applications do things like watch the system state to determine how healthy the box is for load balancing and such. This involves running 'ps' or other such utilities. These utilities will often walk /proc//whatever, and these files can sometimes need to down_read(&task->mmap_sem). Not usually a big deal, but we noticed when we are stress testing that sometimes our protected application has latency spikes trying to get the mmap_sem for tasks that are in lower priority cgroups. This is because any down_write() on a semaphore essentially turns it into a mutex, so even if we currently have it held for reading, any new readers will not be allowed on to keep from starving the writer. This is fine, except a lower priority task could be stuck doing IO because it has been throttled to the point that its IO is taking much longer than normal. But because a higher priority group depends on this completing it is now stuck behind lower priority work. In order to avoid this particular priority inversion we want to use the existing retry mechanism to stop from holding the mmap_sem at all if we are going to do IO. This already exists in the read case sort of, but needed to be extended for more than just grabbing the page lock. With io.latency we throttle at submit_bio() time, so the readahead stuff can block and even page_cache_read can block, so all these paths need to have the mmap_sem dropped. The other big thing is ->page_mkwrite. btrfs is particularly shitty here because we have to reserve space for the dirty page, which can be a very expensive operation. We use the same retry method as the read path, and simply cache the page and verify the page is still setup properly the next pass through ->page_mkwrite(). I've tested these patches with xfstests and there are no regressions. Let me know what you think. Thanks, Josef Reviewed-by: Jan Kara