From patchwork Wed Jul 13 11:10:06 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Michal Hocko X-Patchwork-Id: 9227673 X-Patchwork-Delegate: snitzer@redhat.com Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 505456075D for ; Wed, 13 Jul 2016 13:58:53 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3F60C22230 for ; Wed, 13 Jul 2016 13:58:53 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 33F6226E64; Wed, 13 Jul 2016 13:58:53 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=2.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 Received: from mx4-phx2.redhat.com (mx4-phx2.redhat.com [209.132.183.25]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 5372522230 for ; Wed, 13 Jul 2016 13:58:52 +0000 (UTC) Received: from lists01.pubmisc.prod.ext.phx2.redhat.com (lists01.pubmisc.prod.ext.phx2.redhat.com [10.5.19.33]) by mx4-phx2.redhat.com (8.13.8/8.13.8) with ESMTP id u6DDp0c6031419; Wed, 13 Jul 2016 09:51:00 -0400 Received: from int-mx09.intmail.prod.int.phx2.redhat.com (int-mx09.intmail.prod.int.phx2.redhat.com [10.5.11.22]) by lists01.pubmisc.prod.ext.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id u6DDoPLs010009 for ; Wed, 13 Jul 2016 09:50:25 -0400 Received: from dhcp131-147.brq.redhat.com ([10.34.131.68]) by int-mx09.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id u6DDoNib026352 for ; Wed, 13 Jul 2016 09:50:24 -0400 Resent-From: Ondrej Kozina Resent-To: dm-devel@redhat.com Resent-Date: Wed, 13 Jul 2016 15:50:23 +0200 Resent-Message-ID: <7d735060-843b-1161-2760-fa3ddc191be3@redhat.com> Resent-User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.1.1 Received: from zmta05.collab.prod.int.phx2.redhat.com (LHLO zmta05.collab.prod.int.phx2.redhat.com) (10.5.81.12) by zmail17.collab.prod.int.phx2.redhat.com with LMTP; Wed, 13 Jul 2016 07:10:13 -0400 (EDT) Received: from int-mx09.intmail.prod.int.phx2.redhat.com (int-mx09.intmail.prod.int.phx2.redhat.com [10.5.11.22]) by zmta05.collab.prod.int.phx2.redhat.com (Postfix) with ESMTP id CE175F2BF6; Wed, 13 Jul 2016 07:10:13 -0400 (EDT) Received: from mx1.redhat.com (ext-mx03.extmail.prod.ext.phx2.redhat.com [10.5.110.27]) by int-mx09.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id u6DBADMv020071 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Wed, 13 Jul 2016 07:10:13 -0400 Received: from mail-wm0-f42.google.com (mail-wm0-f42.google.com [74.125.82.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 3F711B4C3B; Wed, 13 Jul 2016 11:10:11 +0000 (UTC) Received: by mail-wm0-f42.google.com with SMTP id f65so24011123wmi.0; Wed, 13 Jul 2016 04:10:11 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=Dv273rwtay6IsC4BS2uKXTcq46BvEqfQkc4jKaIfzK8=; b=h4Wz8sdvOWJXSCLJwgvwa4UHCrDjR32BIB3oblAwgzYanjIUkTwDk7aXBCFOOcNvdl Jl0ZzjrTTGRtJCqG/7K4QgVFnUY4DWFTR1pXGFmZj1znoiZ3Cdps7UHQdzcciXNYjCi+ EFGZ5UTz0B+R/Q7s7mJ0hiYZ7AhZdeGq7pb0qhMdfQantXiqEym4v0fnQqFeFlR8Bp5N Pb1aAJcEOK4hraJXF4wPYhpOQN3+1gPHhUooOOGAUZZfLQ0CWXF6Eaft9aSUcWpd+eOj hVvR4D0mnf1Zp5JHeJlTmnCs6ADjmklrh6CYR6pHbUAimfTWoTD9nF2K6SxE/9H7n6ws iDEg== X-Gm-Message-State: ALyK8tL0PQGV+oKmY0LZPpoHQF2v/juszvh7MDF2gWTuWTyNzhU3rPiU3V9VZ2N4D0+RLA== X-Received: by 10.28.138.18 with SMTP id m18mr27876369wmd.63.1468408208164; Wed, 13 Jul 2016 04:10:08 -0700 (PDT) Received: from localhost ([80.188.202.66]) by smtp.gmail.com with ESMTPSA id jf3sm338792wjb.26.2016.07.13.04.10.07 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 13 Jul 2016 04:10:07 -0700 (PDT) Date: Wed, 13 Jul 2016 13:10:06 +0200 From: Michal Hocko To: Mikulas Patocka Message-ID: <20160713111006.GF28723@dhcp22.suse.cz> References: <57837CEE.1010609@redhat.com> <9be09452-de7f-d8be-fd5d-4a80d1cd1ba3@redhat.com> <20160712064905.GA14586@dhcp22.suse.cz> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.6.0 (2016-04-01) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.27]); Wed, 13 Jul 2016 11:10:11 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.27]); Wed, 13 Jul 2016 11:10:11 +0000 (UTC) for IP:'74.125.82.42' DOMAIN:'mail-wm0-f42.google.com' HELO:'mail-wm0-f42.google.com' FROM:'mstsxfx@gmail.com' RCPT:'' X-RedHat-Spam-Score: -0.318 (BAYES_50, DCC_REPUT_00_12, FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS, RCVD_IN_DNSWL_LOW, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SPF_PASS) 74.125.82.42 mail-wm0-f42.google.com 74.125.82.42 mail-wm0-f42.google.com X-Scanned-By: MIMEDefang 2.68 on 10.5.11.22 X-Scanned-By: MIMEDefang 2.68 on 10.5.11.22 X-Scanned-By: MIMEDefang 2.76 on 10.5.110.27 X-loop: dm-devel@redhat.com Cc: Ondrej Kozina , Stanislav Kozina , Jerome Marchand , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [dm-devel] System freezes after OOM X-BeenThere: dm-devel@redhat.com X-Mailman-Version: 2.1.12 Precedence: junk List-Id: device-mapper development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com X-Virus-Scanned: ClamAV using ClamSMTP On Tue 12-07-16 19:44:11, Mikulas Patocka wrote: > The problem of swapping to dm-crypt is this. > > The free memory goes low, kswapd decides that some page should be swapped > out. However, when you swap to an ecrypted device, writeback of each page > requires another page to hold the encrypted data. dm-crypt uses mempools > for all its structures and pages, so that it can make forward progress > even if there is no memory free. However, the mempool code first allocates > from general memory allocator and resorts to the mempool only if the > memory is below limit. OK, thanks for the clarification. I guess the core part happens in crypt_alloc_buffer, right? > So every attempt to swap out some page allocates another page. > > As long as swapping is in progress, the free memory is below the limit > (because the swapping activity itself consumes any memory over the limit). > And that triggered the OOM killer prematurely. I am not sure I understand the last part. Are you saing that we trigger OOM because the initiated swapout will not be able to finish the IO thus release the page in time? The oom detection checks waits for an ongoing writeout if there is no reclaim progress and at least half of the reclaimable memory is either dirty or under writeback. Pages under swaout are marked as under writeback AFAIR. The writeout path (dm-crypt worker in this case) should be able to allocate a memory from the mempool, hand over to the crypt layer and finish the IO. Is it possible this might take a lot of time? > On Tue, 12 Jul 2016, Michal Hocko wrote: > > > On Mon 11-07-16 11:43:02, Mikulas Patocka wrote: > > [...] > > > The general problem is that the memory allocator does 16 retries to > > > allocate a page and then triggers the OOM killer (and it doesn't take into > > > account how much swap space is free or how many dirty pages were really > > > swapped out while it waited). > > > > Well, that is not how it works exactly. We retry as long as there is a > > reclaim progress (at least one page freed) back off only if the > > reclaimable memory can exceed watermks which is scaled down in 16 > > retries. The overal size of free swap is not really that important if we > > cannot swap out like here due to complete memory reserves depletion: > > https://okozina.fedorapeople.org/bugs/swap_on_dmcrypt/vmlog-1462458369-00000/sample-00011/dmesg: > > [ 90.491276] Node 0 DMA free:0kB min:60kB low:72kB high:84kB active_anon:4096kB inactive_anon:4636kB active_file:212kB inactive_file:280kB unevictable:488kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:488kB dirty:276kB writeback:4636kB mapped:476kB shmem:12kB slab_reclaimable:204kB slab_unreclaimable:4700kB kernel_stack:48kB pagetables:120kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:61132 all_unreclaimable? yes > > [ 90.491283] lowmem_reserve[]: 0 977 977 977 > > [ 90.491286] Node 0 DMA32 free:0kB min:3828kB low:4824kB high:5820kB active_anon:423820kB inactive_anon:424916kB active_file:17996kB inactive_file:21800kB unevictable:20724kB isolated(anon):384kB isolated(file):0kB present:1032184kB managed:1001260kB mlocked:20724kB dirty:25236kB writeback:49972kB mapped:23076kB shmem:1364kB slab_reclaimable:13796kB slab_unreclaimable:43008kB kernel_stack:2816kB pagetables:7320kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:5635400 all_unreclaimable? yes > > > > Look at the amount of free memory. It is completely depleted. So it > > smells like a process which has access to memory reserves has consumed > > all of it. I suspect a __GFP_MEMALLOC resp. PF_MEMALLOC from softirq > > context user which went off the leash. > > It is caused by the commit f9054c70d28bc214b2857cf8db8269f4f45a5e23. Prior > to this commit, mempool allocations set __GFP_NOMEMALLOC, so they never > exhausted reserved memory. With this commit, mempool allocations drop > __GFP_NOMEMALLOC, so they can dig deeper (if the process has PF_MEMALLOC, > they can bypass all limits). Hmm, but the patch allows access to the memory reserves only when the pool is empty. And even then the caller would have to request access to reserves explicitly either by __GFP_NOMEMALLOC or PF_MEMALLOC. That doesn't seem to be the case for the dm-crypt, though. Or do you suspect that some other mempool user might be doing so? > But swapping should proceed even if there is no memory free. There is a > comment "TODO: this could cause a theoretical memory reclaim deadlock in > the swap out path." in the function add_to_swap - but apart from that, > swap should proceed even with no available memory, as long as all the > drivers in the block layer use mempools. > > > > So, it could prematurely trigger OOM killer on any slow swapping device > > > (including dm-crypt). Michal Hocko reworked the OOM killer in the patch > > > 0a0337e0d1d134465778a16f5cbea95086e8e9e0, but it still has the flaw that > > > it triggers OOM if there is plenty of free swap space free. > > > > > > Michal, would you accept a change to the OOM killer, to prevent it from > > > triggerring when there is free swap space? > > > > No this doesn't sound like a proper solution. The current decision > > logic, as explained above relies on the feedback from the reclaim. A > > free swap space doesn't really mean we can make a forward progress. > > I'm interested - why would you need to trigger the OOM killer if there is > free swap space? Let me clarify. If there is a swapable memory then we shouldn't trigger the OOM killer normally of course. And that should be the case with the current implementation. We just rely on the swapout making some progress and back off only if that is not the case after several attempts with a throttling based on the writeback counters. Checking the available swap space doesn't guarantee a forward progress, though. If the swap out is stuck for some reason then it should be safer to trigger to OOM rather than wait or trash for ever (or an excessive amount of time). Now, I can see that the retry logic might need some tuning for complex setups like dm-crypt swap partitions because the progress might be much slower there. But I would like the understand what is the worst estimate for the swapout path with all the roadblocks on the way for this setup before we can think of a proper retry logic tuning. > The only possibility is that all the memory is filled with unswappable > kernel pages - but that condition could be detected if there is unusually > low number of anonymous and cache pages. Besides that - in what situation > is triggering the OOM killer with free swap desired? I hope the above has explained that. > The kernel 4.7-rc almost deadlocks in another way. The machine got stuck > and the following stacktrace was obtained when swapping to dm-crypt. > > We can see that dm-crypt does a mempool allocation. But the mempool > allocation somehow falls into throttle_vm_writeout. There, it waits for > 0.1 seconds. So, as a result, the dm-crypt worker thread ends up > processing requests at an unusually slow rate of 10 requests per second > and it results in the machine being stuck (it would proabably recover if > we waited for extreme amount of time). Hmm, that throttling is there since ever basically. I do not see what would have changed that recently, but I haven't looked too close to be honest. I agree that throttling a flusher (which this worker definitely is) doesn't look like a correct thing to do. We have PF_LESS_THROTTLE for this kind of things. So maybe the right thing to do is to use this flag for the dm_crypt worker: > > [ 345.352536] kworker/u4:0 D ffff88003df7f438 10488 6 2 0x00000000 > [ 345.352536] Workqueue: kcryptd kcryptd_crypt [dm_crypt] > [ 345.352536] ffff88003df7f438 ffff88003e5d0380 ffff88003e5d0380 ffff88003e5d8e80 > [ 345.352536] ffff88003dfb3240 ffff88003df73240 ffff88003df80000 ffff88003df7f470 > [ 345.352536] ffff88003e5d0380 ffff88003e5d0380 ffff88003df7f828 ffff88003df7f450 > [ 345.352536] Call Trace: > [ 345.352536] [] schedule+0x3c/0x90 > [ 345.352536] [] schedule_timeout+0x1d8/0x360 > [ 345.352536] [] ? detach_if_pending+0x1c0/0x1c0 > [ 345.352536] [] ? ktime_get+0xb3/0x150 > [ 345.352536] [] ? __delayacct_blkio_start+0x1f/0x30 > [ 345.352536] [] io_schedule_timeout+0xa4/0x110 > [ 345.352536] [] congestion_wait+0x86/0x1f0 > [ 345.352536] [] ? prepare_to_wait_event+0xf0/0xf0 > [ 345.352536] [] throttle_vm_writeout+0x44/0xd0 > [ 345.352536] [] shrink_zone_memcg+0x613/0x720 > [ 345.352536] [] shrink_zone+0xe0/0x300 > [ 345.352536] [] do_try_to_free_pages+0x1ad/0x450 > [ 345.352536] [] try_to_free_pages+0xef/0x300 > [ 345.352536] [] __alloc_pages_nodemask+0x879/0x1210 > [ 345.352536] [] ? sched_clock_cpu+0x90/0xc0 > [ 345.352536] [] alloc_pages_current+0xa1/0x1f0 > [ 345.352536] [] ? new_slab+0x3f5/0x6a0 > [ 345.352536] [] new_slab+0x2d7/0x6a0 > [ 345.352536] [] ? sched_clock_local+0x17/0x80 > [ 345.352536] [] ___slab_alloc+0x3fb/0x5c0 > [ 345.352536] [] ? mempool_alloc_slab+0x1d/0x30 > [ 345.352536] [] ? sched_clock_local+0x17/0x80 > [ 345.352536] [] ? mempool_alloc_slab+0x1d/0x30 > [ 345.352536] [] __slab_alloc+0x51/0x90 > [ 345.352536] [] ? mempool_alloc_slab+0x1d/0x30 > [ 345.352536] [] kmem_cache_alloc+0x27b/0x310 > [ 345.352536] [] mempool_alloc_slab+0x1d/0x30 > [ 345.352536] [] mempool_alloc+0x91/0x230 > [ 345.352536] [] bio_alloc_bioset+0xbd/0x260 > [ 345.352536] [] kcryptd_crypt+0x114/0x3b0 [dm_crypt] > [ 345.352536] [] process_one_work+0x242/0x700 > [ 345.352536] [] ? process_one_work+0x1ba/0x700 > [ 345.352536] [] worker_thread+0x4e/0x490 > [ 345.352536] [] ? process_one_work+0x700/0x700 > [ 345.352536] [] kthread+0x101/0x120 > [ 345.352536] [] ? trace_hardirqs_on_caller+0xf5/0x1b0 > [ 345.352536] [] ret_from_fork+0x1f/0x40 > [ 345.352536] [] ? kthread_create_on_node+0x250/0x250 diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c index 4f3cb3554944..0b806810efab 100644 --- a/drivers/md/dm-crypt.c +++ b/drivers/md/dm-crypt.c @@ -1392,11 +1392,14 @@ static void kcryptd_async_done(struct crypto_async_request *async_req, static void kcryptd_crypt(struct work_struct *work) { struct dm_crypt_io *io = container_of(work, struct dm_crypt_io, work); + unsigned int pflags = current->flags; + current->flags |= PF_LESS_THROTTLE; if (bio_data_dir(io->base_bio) == READ) kcryptd_crypt_read_convert(io); else kcryptd_crypt_write_convert(io); + tsk_restore_flags(current, pflags, PF_LESS_THROTTLE); } static void kcryptd_queue_crypt(struct dm_crypt_io *io)