From patchwork Fri Nov 30 19:58:08 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Josef Bacik X-Patchwork-Id: 10707077 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 57ABE14D6 for ; Fri, 30 Nov 2018 19:58:18 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 48C5B304C9 for ; Fri, 30 Nov 2018 19:58:18 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 3C5B9304F5; Fri, 30 Nov 2018 19:58:18 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=unavailable version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C7F59304C9 for ; Fri, 30 Nov 2018 19:58:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BFCDA6B59EE; Fri, 30 Nov 2018 14:58:16 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id B83416B59EF; Fri, 30 Nov 2018 14:58:16 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A4B816B59F0; Fri, 30 Nov 2018 14:58:16 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-yb1-f197.google.com (mail-yb1-f197.google.com [209.85.219.197]) by kanga.kvack.org (Postfix) with ESMTP id 714EE6B59EE for ; Fri, 30 Nov 2018 14:58:16 -0500 (EST) Received: by mail-yb1-f197.google.com with SMTP id 7-v6so4166086ybi.19 for ; Fri, 30 Nov 2018 11:58:16 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:from:to:subject:date:message-id; bh=PSn0bxnnPw1G6B2a7cVzIfLJVlFidqA8+Pvk6v+1aiU=; b=dHd7ruRzkxbwNzLRyF4Kf4GBloIXYmfKCmejfndEbeOlM0QHLcCbXuwFtqfIKGeIa3 gm4JcqKNFy8sa+HEbSgKDIpx5O1XZJ3CJTuDqFex5VfZTpQ71TX0aq0AiLVAJ60lNcDS 899dhtjUahz9trXFdMcUf0wpnp9biZwiSyjCfmsbX5LYPmc8DBR5I30wmHmRgGTh687d 4miV1aQFAvMvH4D1qAV8qSro8caA1iD+vY3wbR/92gF7NGhUSROcZIaWrNZWprIScHOC SeWNYQEye2N8nHoyF7zp5TYgYaLaN+HIK4yKOzd8+Ni4NmEw4zPdvnB+iV/ORibOsrNW q9dQ== X-Gm-Message-State: AA+aEWZ767oMjziOLerUJc2COFZS/yyT48hhdm+y/cdxz65u5IYpcAft rH37I9L9D6xOvouihBZCgUNrtDsIK9oE0yhNW/xea4SbG4CvkBxYyOw3MQq3n2Kze4zzzS3q9nA D/mcIHD1E/whdzPxvtwOY7F3EZRSclkGeBz7fHTAlrRboWoSApGQYACzqqnx8LlXb7g3mgApcHD I7HJqEl7omdPbBKXOwQDOXukToUoVQ5tPVsxfFywnVeCU+LiyvumQn+OXGbZrpCQ2a/re51G/I8 fW56uPl4rCYm0Q202PkeTZSKduUb9SHgqfiUfXHil6om3PGycUE8FgsGeanY3cmP8/kLL4PObDI 7n4lzUb+dRcQ66ecXlyrNjv2EyP9+PnrZjPq4jTJsC7ML/aiTqAQWQGH0Z7X2QwxCF22oHdXloI q X-Received: by 2002:a25:3d83:: with SMTP id k125-v6mr6747658yba.260.1543607895979; Fri, 30 Nov 2018 11:58:15 -0800 (PST) X-Received: by 2002:a25:3d83:: with SMTP id k125-v6mr6747626yba.260.1543607895200; Fri, 30 Nov 2018 11:58:15 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1543607895; cv=none; d=google.com; s=arc-20160816; b=h6mt5FaR52MMAhmK68NiSNr1C2GfxZPVy6uATrdWjqjIdajgfJxj+uQ1IvGku1LVOJ T3YNrbvMnZdrFFvv1i5SnaYt2b0Q1nO7YD5t+8gNQ+C9IIIjVoiFXOW8JxuvOs3v7Nel GJdfUSq7Cy+hxFR/bUDsRXli+GEdJ0YuNWgGjI4Od2ds7khS1tj62vmRROawzNwXenfw whJCy8XoYSQJjc4Dn9WGUZhFFME7b+dOv0rlE9deaB5p8/u43ZOtBUldvUMoMvZY85DV OScJ1ica5/AANHBbWBdxiG/Dpj3yaLaJ8n/BVwfUPiM5ZU/XyjH3q+0qgfVOg2Jinqi+ Muhw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=message-id:date:subject:to:from:dkim-signature; bh=PSn0bxnnPw1G6B2a7cVzIfLJVlFidqA8+Pvk6v+1aiU=; b=q7gj/2G//rPj+1RtKJHuWN0Re85660ItbOMwveI2vDAXh87lBlxA3d4dozOGG85vwD Ow5So+shKY/YFdvYXlnfHnt8hFELCq1Bd7vP4jXapehS/RiLVsxkoL1/SoXYsgGuP8CP UlQ+UlBXVlXQ+TrAM6z4Oc54ogfIqGLQDuE2hiE0scsq2Xejn7qSKWH+DlRqpl4n7Z/6 GHiXJdWY8BlNJZUJ8hcHmuTSNjk+FwthDPA2VJOu5wd9IC0Munoai3/nFj1T1cm/Emzv JJxDYxDbNGQHLspT0acdLbG/c3Z8B21sk8/+rMvoGfY1bGOEf5Z7QHEiU5Jb5cM6DTUm ZizA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@toxicpanda-com.20150623.gappssmtp.com header.s=20150623 header.b=fEy2NaYe; spf=neutral (google.com: 209.85.220.65 is neither permitted nor denied by best guess record for domain of josef@toxicpanda.com) smtp.mailfrom=josef@toxicpanda.com Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id u11-v6sor2425102ybd.52.2018.11.30.11.58.14 for (Google Transport Security); Fri, 30 Nov 2018 11:58:15 -0800 (PST) Received-SPF: neutral (google.com: 209.85.220.65 is neither permitted nor denied by best guess record for domain of josef@toxicpanda.com) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@toxicpanda-com.20150623.gappssmtp.com header.s=20150623 header.b=fEy2NaYe; spf=neutral (google.com: 209.85.220.65 is neither permitted nor denied by best guess record for domain of josef@toxicpanda.com) smtp.mailfrom=josef@toxicpanda.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=toxicpanda-com.20150623.gappssmtp.com; s=20150623; h=from:to:subject:date:message-id; bh=PSn0bxnnPw1G6B2a7cVzIfLJVlFidqA8+Pvk6v+1aiU=; b=fEy2NaYeMxMp1Hq2S7PhoxSh/LDDhDKvP8zB3Ay4gdoNmcGr/6dbHB29QyFPZEzimV HuOnJ7LskXah/9WQTv8RA7PiIYIizvu90YJvz7X8ixder70S2hh4KUUAraP3ritYHo9s aotbttHU4BGLaaGz2Xu/XYjRQrd1keQKVclG4RRB6nScr2o0VO4K1lU7z4RHC5rnNqMk 15Gm72ymcOvWYyaWBYz9mUB38XCdujjWr5GU/2jYOUekBqpSNJZwqQXMAL00kIcFfH6j AYZXxugRkgD+SAgjxr6UEYP9R/HF2iK6/oYXY1TYyE4kj3rdB7iH2mdHHKz4U7gpiuFk XcHA== X-Google-Smtp-Source: AFSGD/VYUsl2xVUvkGxx4jdNiTF2u/FWco0LPBBw3S6Limd4QOomlBu0SvVfkGyqLl2fy0XYdgdLWA== X-Received: by 2002:a25:844e:: with SMTP id r14-v6mr6679151ybm.298.1543607894605; Fri, 30 Nov 2018 11:58:14 -0800 (PST) Received: from localhost ([107.15.81.208]) by smtp.gmail.com with ESMTPSA id z74sm3209138ywz.51.2018.11.30.11.58.13 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Fri, 30 Nov 2018 11:58:13 -0800 (PST) From: Josef Bacik To: kernel-team@fb.com, hannes@cmpxchg.org, linux-kernel@vger.kernel.org, tj@kernel.org, david@fromorbit.com, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, riel@redhat.com, jack@suse.cz Subject: [PATCH 0/4][V4] drop the mmap_sem when doing IO in the fault path Date: Fri, 30 Nov 2018 14:58:08 -0500 Message-Id: <20181130195812.19536-1-josef@toxicpanda.com> X-Mailer: git-send-email 2.14.3 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP v3->v4: - dropped the ->page_mkwrite portion of these patches, we don't actually see issues with mkwrite in production, and I kept running into corner cases where I missed something important. I want to wait on that part until I have a real reason to do the work so I can have a solid test in place. - completely reworked how we drop the mmap_sem in filemap_fault and cleaned it up a bit. Once I started actually testing this with our horrifying reproducer I saw a bunch of places where we still ended up doing IO under the mmap_sem because I had missed a few corner cases. Fixed this by reworking filemap_fault to only return RETRY once it has a completely uptodate page ready to be used. - lots more testing, including production testing. v2->v3: - dropped the RFC, ready for a real review. - fixed a kbuild error for !MMU configs. - dropped the swapcache patches since Johannes is still working on those parts. v1->v2: - reworked so it only affects x86, since its the only arch I can build and test. - fixed the fact that do_page_mkwrite wasn't actually sending ALLOW_RETRY down to ->page_mkwrite. - fixed error handling in do_page_mkwrite/callers to explicitly catch VM_FAULT_RETRY. - fixed btrfs to set ->cached_page properly. -- Original message -- Now that we have proper isolation in place with cgroups2 we have started going through and fixing the various priority inversions. Most are all gone now, but this one is sort of weird since it's not necessarily a priority inversion that happens within the kernel, but rather because of something userspace does. We have giant applications that we want to protect, and parts of these giant applications do things like watch the system state to determine how healthy the box is for load balancing and such. This involves running 'ps' or other such utilities. These utilities will often walk /proc//whatever, and these files can sometimes need to down_read(&task->mmap_sem). Not usually a big deal, but we noticed when we are stress testing that sometimes our protected application has latency spikes trying to get the mmap_sem for tasks that are in lower priority cgroups. This is because any down_write() on a semaphore essentially turns it into a mutex, so even if we currently have it held for reading, any new readers will not be allowed on to keep from starving the writer. This is fine, except a lower priority task could be stuck doing IO because it has been throttled to the point that its IO is taking much longer than normal. But because a higher priority group depends on this completing it is now stuck behind lower priority work. In order to avoid this particular priority inversion we want to use the existing retry mechanism to stop from holding the mmap_sem at all if we are going to do IO. This already exists in the read case sort of, but needed to be extended for more than just grabbing the page lock. With io.latency we throttle at submit_bio() time, so the readahead stuff can block and even page_cache_read can block, so all these paths need to have the mmap_sem dropped. The other big thing is ->page_mkwrite. btrfs is particularly shitty here because we have to reserve space for the dirty page, which can be a very expensive operation. We use the same retry method as the read path, and simply cache the page and verify the page is still setup properly the next pass through ->page_mkwrite(). I've tested these patches with xfstests and there are no regressions. Let me know what you think. Thanks, Josef