[RFC,2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks

From: Michal Hocko <mhocko@suse.com>

[CC Marcelo who might remember other details for the loads which made
 him to add this code - see the patch changelog for more context]

On Mon 25-07-16 10:32:47, Michal Hocko wrote:
> On Sat 23-07-16 10:12:24, NeilBrown wrote:
[...]
> > So I wonder what throttle_vm_writeout() really achieves these days.  Is
> > it just a bandaid that no-one is brave enough to remove?
> 
> Maybe yes. It is sitting there quietly and you do not know about it
> until it bites. Like in this particular case.

So I was playing with this today and tried to provoke throttle_vm_writeout
and couldn't hit that path with my pretty much default IO stack. I
probably need a more complex IO setup like dm-crypt or something that
basically have to double buffer every page in the writeout for some
time.

Anyway I believe that the throttle_vm_writeout is just a relict from the
past which just survived after many other changes in the reclaim path. I
fully realize my testing is quite poor and I would really appreciate if
Mikulas could try to retest with his more complex IO setups but let me
post a patch with the changelog so that we can at least reason about the
justification. In principle the reclaim path should have sufficient
throttling already and if that is not the case then we should
consolidate the remaining rather than have yet another one.

Thoughts?
---
>From 0d950d64e3c59061f7cca71fe5877d4e430499c9 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Mon, 25 Jul 2016 14:18:54 +0200
Subject: [PATCH] mm, vmscan: get rid of throttle_vm_writeout

throttle_vm_writeout has been introduced back in 2005 to fix OOMs caused
by excessive pageout activity during the reclaim. Too many pages could
be put under writeback therefore LRUs would be full of unreclaimable pages
until the IO completes and in turn the OOM killer could be invoked.

There have been some important changes introduced since then in the
reclaim path though. Writers are throttled by balance_dirty_pages
when initiating the buffered IO and later during the memory pressure,
the direct reclaim is throttled by wait_iff_congested if the node is
considered congested by dirty pages on LRUs and the underlying bdi
is congested by the queued IO. The kswapd is throttled as well if it
encounters pages marked for immediate reclaim or under writeback which
signals that that there are too many pages under writeback already.
Another important aspect is that we do not issue any IO from the direct
reclaim context anymore. In a heavy parallel load this could queue a lot
of IO which would be very scattered and thus unefficient which would
just make the problem worse.

This three mechanisms should throttle and keep the amount of IO in a
steady state even under heavy IO and memory pressure so yet another
throttling point doesn't really seem helpful. Quite contrary, Mikulas
Patocka has reported that swap backed by dm-crypt doesn't work properly
because the swapout IO cannot make sufficient progress as the writeout
path depends on dm_crypt worker which has to allocate memory to perform
the encryption. In order to guarantee a forward progress it relies
on the mempool allocator. mempool_alloc(), however, prefers to use
the underlying (usually page) allocator before it grabs objects from
the pool. Such an allocation can dive into the memory reclaim and
consequently to throttle_vm_writeout. If there are too many dirty or
pages under writeback it will get throttled even though it is in fact a
flusher to clear pending pages.

[  345.352536] kworker/u4:0    D ffff88003df7f438 10488     6      2	0x00000000
[  345.352536] Workqueue: kcryptd kcryptd_crypt [dm_crypt]
[  345.352536]  ffff88003df7f438 ffff88003e5d0380 ffff88003e5d0380 ffff88003e5d8e80
[  345.352536]  ffff88003dfb3240 ffff88003df73240 ffff88003df80000 ffff88003df7f470
[  345.352536]  ffff88003e5d0380 ffff88003e5d0380 ffff88003df7f828 ffff88003df7f450
[  345.352536] Call Trace:
[  345.352536]  [<ffffffff818d466c>] schedule+0x3c/0x90
[  345.352536]  [<ffffffff818d96a8>] schedule_timeout+0x1d8/0x360
[  345.352536]  [<ffffffff81135e40>] ? detach_if_pending+0x1c0/0x1c0
[  345.352536]  [<ffffffff811407c3>] ? ktime_get+0xb3/0x150
[  345.352536]  [<ffffffff811958cf>] ? __delayacct_blkio_start+0x1f/0x30
[  345.352536]  [<ffffffff818d39e4>] io_schedule_timeout+0xa4/0x110
[  345.352536]  [<ffffffff8121d886>] congestion_wait+0x86/0x1f0
[  345.352536]  [<ffffffff810fdf40>] ? prepare_to_wait_event+0xf0/0xf0
[  345.352536]  [<ffffffff812061d4>] throttle_vm_writeout+0x44/0xd0
[  345.352536]  [<ffffffff81211533>] shrink_zone_memcg+0x613/0x720
[  345.352536]  [<ffffffff81211720>] shrink_zone+0xe0/0x300
[  345.352536]  [<ffffffff81211aed>] do_try_to_free_pages+0x1ad/0x450
[  345.352536]  [<ffffffff81211e7f>] try_to_free_pages+0xef/0x300
[  345.352536]  [<ffffffff811fef19>] __alloc_pages_nodemask+0x879/0x1210
[  345.352536]  [<ffffffff810e8080>] ? sched_clock_cpu+0x90/0xc0
[  345.352536]  [<ffffffff8125a8d1>] alloc_pages_current+0xa1/0x1f0
[  345.352536]  [<ffffffff81265ef5>] ? new_slab+0x3f5/0x6a0
[  345.352536]  [<ffffffff81265dd7>] new_slab+0x2d7/0x6a0
[  345.352536]  [<ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
[  345.352536]  [<ffffffff812678cb>] ___slab_alloc+0x3fb/0x5c0
[  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
[  345.352536]  [<ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
[  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
[  345.352536]  [<ffffffff81267ae1>] __slab_alloc+0x51/0x90
[  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
[  345.352536]  [<ffffffff81267d9b>] kmem_cache_alloc+0x27b/0x310
[  345.352536]  [<ffffffff811f71bd>] mempool_alloc_slab+0x1d/0x30
[  345.352536]  [<ffffffff811f6f11>] mempool_alloc+0x91/0x230
[  345.352536]  [<ffffffff8141a02d>] bio_alloc_bioset+0xbd/0x260
[  345.352536]  [<ffffffffc02f1a54>] kcryptd_crypt+0x114/0x3b0 [dm_crypt]

Let's just drop throttle_vm_writeout altogether. It is not very much
helpful anymore.

I have tried to test a potential writeback IO runaway similar to the one
described in the original patch which has introduced that [1]. Small
virtual machine (512MB RAM, 4 CPUs, 2G of swap space and disk image on a
rather slow NFS in a sync mode on the host) with 8 parallel writers each
writing 1G worth of data. As soon as the pagecache fills up and the
direct reclaim hits then I start anon memory consumer in a loop
(allocating 300M and exiting after populating it) in the background
to make the memory pressure even stronger as well as to disrupt the
steady state for the IO. The direct reclaim is throttled because of the
congestion as well as kswapd hitting congestion_wait due to nr_immediate
but throttle_vm_writeout doesn't ever trigger the sleep throughout
the test. Dirty+writeback are close to nr_dirty_threshold with some
fluctuations caused by the anon consumer.

[1] https://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc1/2.6.9-rc1-mm3/broken-out/vm-pageout-throttling.patch
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/writeback.h |  1 -
 mm/page-writeback.c       | 30 ------------------------------
 mm/vmscan.c               |  2 --
 3 files changed, 33 deletions(-)

Message ID	20160725192344.GD2166@dhcp22.suse.cz (mailing list archive)
State	Not Applicable, archived
Delegated to:	Mike Snitzer
Headers	show Return-Path: <dm-devel-bounces@redhat.com> Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 064916077C for <patchwork-dm-devel@patchwork.kernel.org>; Mon, 25 Jul 2016 19:27:37 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E48FE24B5E for <patchwork-dm-devel@patchwork.kernel.org>; Mon, 25 Jul 2016 19:27:36 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id D551225819; Mon, 25 Jul 2016 19:27:36 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.2 required=2.0 tests=BAYES_00, RCVD_IN_DNSWL_MED autolearn=unavailable version=3.3.1 Received: from mx6-phx2.redhat.com (mx6-phx2.redhat.com [209.132.183.39]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 299E624B5E for <patchwork-dm-devel@patchwork.kernel.org>; Mon, 25 Jul 2016 19:27:35 +0000 (UTC) Received: from lists01.pubmisc.prod.ext.phx2.redhat.com (lists01.pubmisc.prod.ext.phx2.redhat.com [10.5.19.33]) by mx6-phx2.redhat.com (8.14.4/8.14.4) with ESMTP id u6PJNp95039572; Mon, 25 Jul 2016 15:23:52 -0400 Received: from int-mx11.intmail.prod.int.phx2.redhat.com (int-mx11.intmail.prod.int.phx2.redhat.com [10.5.11.24]) by lists01.pubmisc.prod.ext.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id u6PJNoZX012329 for <dm-devel@listman.util.phx.redhat.com>; Mon, 25 Jul 2016 15:23:50 -0400 Received: from mx1.redhat.com (ext-mx05.extmail.prod.ext.phx2.redhat.com [10.5.110.29]) by int-mx11.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id u6PJNonu028659 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Mon, 25 Jul 2016 15:23:50 -0400 Received: from mail-wm0-f67.google.com (mail-wm0-f67.google.com [74.125.82.67]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 1B6773DFCC; Mon, 25 Jul 2016 19:23:48 +0000 (UTC) Received: by mail-wm0-f67.google.com with SMTP id x83so18059958wma.3; Mon, 25 Jul 2016 12:23:47 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=FrY437UHob7NeCoUx3xTHFqcsAR4eUCkK9ZXK4aWcco=; b=DIvFjrPvfR84kYA1JKL3m//LIEkGsIzbS6owntnBSh8yGDpMl3xb5YTb55o9QMO06h vNBN7//9gZgD9YnVXY3DkwVCirCxCmxQ+7zGIs6devDraquX1NIzwDvF7yLJ2UngO+36 1eQudWNCZ9FuufJAB9Zv+p5cfOpxe6VGey2y8KiTpD04XWVWmVcnC7jfqCVvgXzw/0MU g49pFOezbCJiU51AJLOzU+sGgnnxCiY0IHM7+pjwpLe2d5AoVGY6niqHZDOvfUuGoZHt rGahhyAMw9KkkqN9ZiFWshaMfQ47n6jImrq301VX2HuWjFnxOLLjlTrJpN3FzapXolLm HcrA== X-Gm-Message-State: AEkoouvoUj+En/VgZ1pt3thEnTVyL0Jh5yaD94cB8WpHAOH0Bh4aBX65b2o4oJhI5dACMg== X-Received: by 10.194.136.196 with SMTP id qc4mr18760126wjb.136.1469474626674; Mon, 25 Jul 2016 12:23:46 -0700 (PDT) Received: from localhost (ip-86-49-65-8.net.upcbroadband.cz. [86.49.65.8]) by smtp.gmail.com with ESMTPSA id d80sm28894899wmd.14.2016.07.25.12.23.45 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 25 Jul 2016 12:23:46 -0700 (PDT) Date: Mon, 25 Jul 2016 21:23:45 +0200 From: Michal Hocko <mhocko@kernel.org> To: NeilBrown <neilb@suse.com> Message-ID: <20160725192344.GD2166@dhcp22.suse.cz> References: <1468831164-26621-1-git-send-email-mhocko@kernel.org> <1468831285-27242-1-git-send-email-mhocko@kernel.org> <1468831285-27242-2-git-send-email-mhocko@kernel.org> <87oa5q5abi.fsf@notabene.neil.brown.name> <20160722091558.GF794@dhcp22.suse.cz> <878twt5i1j.fsf@notabene.neil.brown.name> <20160725083247.GD9401@dhcp22.suse.cz> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20160725083247.GD9401@dhcp22.suse.cz> User-Agent: Mutt/1.6.0 (2016-04-01) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.29]); Mon, 25 Jul 2016 19:23:48 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.29]); Mon, 25 Jul 2016 19:23:48 +0000 (UTC) for IP:'74.125.82.67' DOMAIN:'mail-wm0-f67.google.com' HELO:'mail-wm0-f67.google.com' FROM:'mstsxfx@gmail.com' RCPT:'' X-RedHat-Spam-Score: -0.018 (BAYES_50, DCC_REPUT_13_19, FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_PASS) 74.125.82.67 mail-wm0-f67.google.com 74.125.82.67 mail-wm0-f67.google.com <mstsxfx@gmail.com> X-Scanned-By: MIMEDefang 2.68 on 10.5.11.24 X-Scanned-By: MIMEDefang 2.78 on 10.5.110.29 X-loop: dm-devel@redhat.com Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>, Marcelo Tosatti <mtosatti@redhat.com>, LKML <linux-kernel@vger.kernel.org>, linux-mm@kvack.org, dm-devel@redhat.com, Mikulas Patocka <mpatocka@redhat.com>, Mel Gorman <mgorman@suse.de>, David Rientjes <rientjes@google.com>, Ondrej Kozina <okozina@redhat.com>, Andrew Morton <akpm@linux-foundation.org> Subject: Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks X-BeenThere: dm-devel@redhat.com X-Mailman-Version: 2.1.12 Precedence: junk List-Id: device-mapper development <dm-devel.redhat.com> List-Unsubscribe: <https://www.redhat.com/mailman/options/dm-devel>, <mailto:dm-devel-request@redhat.com?subject=unsubscribe> List-Archive: <https://www.redhat.com/archives/dm-devel> List-Post: <mailto:dm-devel@redhat.com> List-Help: <mailto:dm-devel-request@redhat.com?subject=help> List-Subscribe: <https://www.redhat.com/mailman/listinfo/dm-devel>, <mailto:dm-devel-request@redhat.com?subject=subscribe> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com X-Virus-Scanned: ClamAV using ClamSMTP

[RFC,2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks

Commit Message

Comments

Patch