Instability in current -git tree

Message ID	CAGM2reb2Zk6t=QJtJZPRGwovKKR9bdm+fzgmA_7CDVfDTjSgKA@mail.gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of pasha.tatashin@oracle.com designates 141.146.126.78 as permitted sender) client-ip=141.146.126.78; MIME-Version: 1.0 References: <CA+55aFyARQV302+mXNYznrOOjzW+yxbcv+=OkD43dG6G1ktoMQ@mail.gmail.com> <alpine.DEB.2.21.1807140031440.2644@nanos.tec.linutronix.de> <CA+55aFzBx1haeM2QSFvhaW2t_HVK78Y=bKvsiJmOZztwkZ-y7Q@mail.gmail.com> <CA+55aFzVGa57apuzDMBLgWQQRcm3BNBs1UEg-G_2o7YW1i=o2Q@mail.gmail.com> <CA+55aFy9NJZeqT7h_rAgbKUZLjzfxvDPwneFQracBjVhY53aQQ@mail.gmail.com> <20180713164804.fc2c27ccbac4c02ca2c8b984@linux-foundation.org> <CA+55aFxAZr8PHo-raTihr8TKK_D-fVL+k6_tw_UyDLychowFNw@mail.gmail.com> <20180713165812.ec391548ffeead96725d044c@linux-foundation.org> <9b93d48c-b997-01f7-2fd6-6e35301ef263@oracle.com> <CA+55aFxFw2-1BD2UBf_QJ2=faQES_8q==yUjwj4mGJ6Ub4uX7w@mail.gmail.com> <5edf2d71-f548-98f9-16dd-b7fed29f4869@oracle.com> <CA+55aFwPAwczHS3XKkEnjY02PaDf2mWrcqx_hket4Ce3nScsSg@mail.gmail.com> <CAGM2rebeo3UUo2bL6kXCMGhuM36wjF5CfvqGG_3rpCfBs5S2wA@mail.gmail.com> <CA+55aFxetyCqX2EzFBDdHtriwt6UDYcm0chHGQUdPX20qNHb4Q@mail.gmail.com> In-Reply-To: <CA+55aFxetyCqX2EzFBDdHtriwt6UDYcm0chHGQUdPX20qNHb4Q@mail.gmail.com> From: Pavel Tatashin <pasha.tatashin@oracle.com> Date: Sat, 14 Jul 2018 09:39:29 -0400 Message-ID: <CAGM2reb2Zk6t=QJtJZPRGwovKKR9bdm+fzgmA_7CDVfDTjSgKA@mail.gmail.com> Subject: Re: Instability in current -git tree To: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org>, tglx@linutronix.de, willy@infradead.org, mingo@redhat.com, axboe@kernel.dk, gregkh@linuxfoundation.org, davem@davemloft.net, viro@zeniv.linux.org.uk, Dave Airlie <airlied@gmail.com>, Tejun Heo <tj@kernel.org>, Theodore Tso <tytso@google.com>, snitzer@redhat.com, Linux Memory Management List <linux-mm@kvack.org>, neelx@redhat.com, mgorman@techsingularity.net Content-Type: multipart/mixed; boundary="000000000000edd2760570f5bbf9" Sender: owner-linux-mm@kvack.org Precedence: bulk

Pavel Tatashin July 14, 2018, 1:39 p.m. UTC

Hi Linus,

I attached a temporary fix, which I could not test, as I was unable to
reproduce the problem, but it should fix the issue.

Reverting "f7f99100d8d9 mm: stop zeroing memory during allocation in
vmemmap" would introduce a significant boot performance regression, as
we would zero the whole memmap twice during boot.

Later, I will introduce a more detailed fix that will get rid of
zero_resv_unavail() entirely, and instead will zero skipped struct
pages in memmap_init_zone(), where it should be done.

Thank you,
Pavel

On Fri, Jul 13, 2018 at 11:25 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, Jul 13, 2018 at 8:04 PM Pavel Tatashin
> <pasha.tatashin@oracle.com> wrote:
> >
> > > You can't just memset() the 'struct page' to zero after it's been set up.
> >
> > That should not be happening, unless there is a bug.
>
> Well, it does seem to happen. My memory stress-tester has been running
> for about half an hour now with the revert I posted - it used to
> trigger the problem in maybe ~5 minutes before.
>
> So I do think that revert fixes it for me. No guarantees, but since I
> figured out how to trigger it, it's been fairly reliable.
>
> > We want to zero those struct pages so we do not have uninitialized
> > data accessed by various parts of the code that rounds down large
> > pages and access the first page in section without verifying that the
> > page is valid. The example of this is described in commit that
> > introduced zero_resv_unavail()
>
> I'm attaching the relevant (?) parts of dmesg, which has the node
> ranges, maybe you can see what the problem with the code is.
>
> (NOTE! This dmesg is with that "mem=6G" command line option, which causes that
>
>   e820: remove [mem 0x180000000-0xfffffffffffffffe] usable
>
> line - that's just because it's my stress-test boot. It happens with
> or without it, but without the "mem=6G" it took days to trigger).
>
> I'm more than willing to test patches (either for added information or
> for testing fixes), although I think I'm getting off the computer for
> today.
>
>                 Linus

Linus Torvalds July 14, 2018, 5:11 p.m. UTC | #1

On Sat, Jul 14, 2018 at 6:40 AM Pavel Tatashin
<pasha.tatashin@oracle.com> wrote:
>
> I attached a temporary fix, which I could not test, as I was unable to
> reproduce the problem, but it should fix the issue.

Am building and will test. If this fixes it for me, I won't do the revert.

Thanks,

                Linus

Linus Torvalds July 14, 2018, 5:29 p.m. UTC | #2

On Sat, Jul 14, 2018 at 10:11 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Am building and will test. If this fixes it for me, I won't do the revert.

Looks good so far. It's past the 5-minute mark, at least. I'll leave
it running for a while, but at least preliminarily this looks like it
works.

I guess it should be marked for stable, because it appears that this
problem got back-ported to stable (I find that Laura reports it for
4.17.4, but not 4.17.3).

              Linus

Michal Hocko July 16, 2018, 12:06 p.m. UTC | #3

On Sat 14-07-18 09:39:29, Pavel Tatashin wrote:
[...]
> From 95259841ef79cc17c734a994affa3714479753e3 Mon Sep 17 00:00:00 2001
> From: Pavel Tatashin <pasha.tatashin@oracle.com>
> Date: Sat, 14 Jul 2018 09:15:07 -0400
> Subject: [PATCH] mm: zero unavailable pages before memmap init
> 
> We must zero struct pages for memory that is not backed by physical memory,
> or kernel does not have access to.
> 
> Recently, there was a change which zeroed all memmap for all holes in e820.
> Unfortunately, it introduced a bug that is discussed here:
> 
> https://www.spinics.net/lists/linux-mm/msg156764.html
> 
> Linus, also saw this bug on his machine, and confirmed that pulling
> commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into memblock.reserved")
> fixes the issue.
> 
> The problem is that we incorrectly zero some struct pages after they were
> setup.

I am sorry but I simply do not see it. zero_resv_unavail should be
touching only reserved memory ranges and those are not initialized
anywhere. So who has reused them and put them to normal available
memory to be initialized by free_area_init_node[s]?

The patch itself should be safe because reserved and available memory
ranges should be disjoint so the ordering shouldn't matter. The fact
that it matters is the crux thing to understand and document. So the
change looks good to me but I do not understand _why_ it makes any
difference. There must be somebody to put (memblock) reserved memory
available to the page allocator behind our backs.

Pavel Tatashin July 16, 2018, 12:09 p.m. UTC | #4

On 07/16/2018 08:06 AM, Michal Hocko wrote:
> On Sat 14-07-18 09:39:29, Pavel Tatashin wrote:
> [...]
>> From 95259841ef79cc17c734a994affa3714479753e3 Mon Sep 17 00:00:00 2001
>> From: Pavel Tatashin <pasha.tatashin@oracle.com>
>> Date: Sat, 14 Jul 2018 09:15:07 -0400
>> Subject: [PATCH] mm: zero unavailable pages before memmap init
>>
>> We must zero struct pages for memory that is not backed by physical memory,
>> or kernel does not have access to.
>>
>> Recently, there was a change which zeroed all memmap for all holes in e820.
>> Unfortunately, it introduced a bug that is discussed here:
>>
>> https://www.spinics.net/lists/linux-mm/msg156764.html
>>
>> Linus, also saw this bug on his machine, and confirmed that pulling
>> commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into memblock.reserved")
>> fixes the issue.
>>
>> The problem is that we incorrectly zero some struct pages after they were
>> setup.
> 
> I am sorry but I simply do not see it. zero_resv_unavail should be
> touching only reserved memory ranges and those are not initialized
> anywhere. So who has reused them and put them to normal available
> memory to be initialized by free_area_init_node[s]?
> 
> The patch itself should be safe because reserved and available memory
> ranges should be disjoint so the ordering shouldn't matter. The fact
> that it matters is the crux thing to understand and document. So the
> change looks good to me but I do not understand _why_ it makes any
> difference. There must be somebody to put (memblock) reserved memory
> available to the page allocator behind our backs.

Thats exactly right, and I am also not sure why this is happening, there must be some overlapping happening that just should not. I will study it later.

Now, I need to figure out what is happening with x86-32 failure, that is caused by my fix.

Pavel

Michal Hocko July 16, 2018, 12:29 p.m. UTC | #5

On Mon 16-07-18 08:09:19, Pavel Tatashin wrote:
> 
> 
> On 07/16/2018 08:06 AM, Michal Hocko wrote:
> > On Sat 14-07-18 09:39:29, Pavel Tatashin wrote:
> > [...]
> >> From 95259841ef79cc17c734a994affa3714479753e3 Mon Sep 17 00:00:00 2001
> >> From: Pavel Tatashin <pasha.tatashin@oracle.com>
> >> Date: Sat, 14 Jul 2018 09:15:07 -0400
> >> Subject: [PATCH] mm: zero unavailable pages before memmap init
> >>
> >> We must zero struct pages for memory that is not backed by physical memory,
> >> or kernel does not have access to.
> >>
> >> Recently, there was a change which zeroed all memmap for all holes in e820.
> >> Unfortunately, it introduced a bug that is discussed here:
> >>
> >> https://www.spinics.net/lists/linux-mm/msg156764.html
> >>
> >> Linus, also saw this bug on his machine, and confirmed that pulling
> >> commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into memblock.reserved")
> >> fixes the issue.
> >>
> >> The problem is that we incorrectly zero some struct pages after they were
> >> setup.
> > 
> > I am sorry but I simply do not see it. zero_resv_unavail should be
> > touching only reserved memory ranges and those are not initialized
> > anywhere. So who has reused them and put them to normal available
> > memory to be initialized by free_area_init_node[s]?
> > 
> > The patch itself should be safe because reserved and available memory
> > ranges should be disjoint so the ordering shouldn't matter. The fact
> > that it matters is the crux thing to understand and document. So the
> > change looks good to me but I do not understand _why_ it makes any
> > difference. There must be somebody to put (memblock) reserved memory
> > available to the page allocator behind our backs.
> 
> Thats exactly right, and I am also not sure why this is happening,
> there must be some overlapping happening that just should not. I will
> study it later.

Maybe a stupid question, but I do not see it from the code (this init
code is just to complex to keep it cached in head so I always have to
study the code again and again, sigh). So what exactly prevents
memmap_init_zone to stumble over reserved regions? We do play some ugly
games to find a first !reserved pfn in the node but I do not really see
anything in the init path to properly skip over reserved holes inside
the node.

Pavel Tatashin July 16, 2018, 1:26 p.m. UTC | #6

> Maybe a stupid question, but I do not see it from the code (this init
> code is just to complex to keep it cached in head so I always have to
> study the code again and again, sigh). So what exactly prevents
> memmap_init_zone to stumble over reserved regions? We do play some ugly
> games to find a first !reserved pfn in the node but I do not really see
> anything in the init path to properly skip over reserved holes inside
> the node.

Hi Michal,

This is not a stupid question. I figured out how this whole thing
became broken:  Revert "mm: page_alloc: skip over regions of invalid
pfns where possible" caused that.

Because, before that was reverted, memmap_init_zone() would use
memblock.memory to check that only pages that have physical backing
are initialized. But, now after that was reverted zer_resv_unavail()
scheme became totally broken.

The concept is quite easy: zero all the allocated memmap memory that
has not been initialized by memmap_init_zone(). So, I think I will
modify memmap_init_zone() to zero the skipped pfns that have memmap
backing. But, that requires more thinking.

Thank you,
Pavel

Oscar Salvador July 16, 2018, 1:39 p.m. UTC | #7

On Mon, Jul 16, 2018 at 02:29:18PM +0200, Michal Hocko wrote:
> On Mon 16-07-18 08:09:19, Pavel Tatashin wrote:
> > 
> > 
> > On 07/16/2018 08:06 AM, Michal Hocko wrote:
> > > On Sat 14-07-18 09:39:29, Pavel Tatashin wrote:
> > > [...]
> > >> From 95259841ef79cc17c734a994affa3714479753e3 Mon Sep 17 00:00:00 2001
> > >> From: Pavel Tatashin <pasha.tatashin@oracle.com>
> > >> Date: Sat, 14 Jul 2018 09:15:07 -0400
> > >> Subject: [PATCH] mm: zero unavailable pages before memmap init
> > >>
> > >> We must zero struct pages for memory that is not backed by physical memory,
> > >> or kernel does not have access to.
> > >>
> > >> Recently, there was a change which zeroed all memmap for all holes in e820.
> > >> Unfortunately, it introduced a bug that is discussed here:
> > >>
> > >> https://www.spinics.net/lists/linux-mm/msg156764.html
> > >>
> > >> Linus, also saw this bug on his machine, and confirmed that pulling
> > >> commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into memblock.reserved")
> > >> fixes the issue.
> > >>
> > >> The problem is that we incorrectly zero some struct pages after they were
> > >> setup.
> > > 
> > > I am sorry but I simply do not see it. zero_resv_unavail should be
> > > touching only reserved memory ranges and those are not initialized
> > > anywhere. So who has reused them and put them to normal available
> > > memory to be initialized by free_area_init_node[s]?
> > > 
> > > The patch itself should be safe because reserved and available memory
> > > ranges should be disjoint so the ordering shouldn't matter. The fact
> > > that it matters is the crux thing to understand and document. So the
> > > change looks good to me but I do not understand _why_ it makes any
> > > difference. There must be somebody to put (memblock) reserved memory
> > > available to the page allocator behind our backs.
> > 
> > Thats exactly right, and I am also not sure why this is happening,
> > there must be some overlapping happening that just should not. I will
> > study it later.
> 
> Maybe a stupid question, but I do not see it from the code (this init
> code is just to complex to keep it cached in head so I always have to
> study the code again and again, sigh). So what exactly prevents
> memmap_init_zone to stumble over reserved regions? We do play some ugly
> games to find a first !reserved pfn in the node but I do not really see
> anything in the init path to properly skip over reserved holes inside
> the node.

I think we are not really skiping reserved regions in memmap_init_zone().
memmap_init_zone() gets just called being size the subtract of zone_end_pfn - zone_start_pfn, and I don't see that we're checking if those pfn's fall in reserved regions.

To get a better insight, I just put a couple of printk's:

kernel: zero_resv_unavail: start-end: 0x9f000-0x100000
kernel: zero_resv_unavail: pfn: 0x9f
kernel: zero_resv_unavail: pfn: 0xa0
kernel: zero_resv_unavail: pfn: 0xa1
kernel: zero_resv_unavail: pfn: 0xa2
kernel: zero_resv_unavail: pfn: 0xa3
kernel: zero_resv_unavail: pfn: 0xa4
kernel: zero_resv_unavail: pfn: 0xa5
kernel: zero_resv_unavail: pfn: 0xa6
kernel: zero_resv_unavail: pfn: 0xa7
kernel: zero_resv_unavail: pfn: 0xa8
kernel: zero_resv_unavail: pfn: 0xa9
...
...
kernel: memmap_init_zone: pfn: 9f
kernel: memmap_init_zone: pfn: a0
kernel: memmap_init_zone: pfn: a1
kernel: memmap_init_zone: pfn: a2
kernel: memmap_init_zone: pfn: a3
kernel: memmap_init_zone: pfn: a4
kernel: memmap_init_zone: pfn: a5
kernel: memmap_init_zone: pfn: a6
kernel: memmap_init_zone: pfn: a7
kernel: memmap_init_zone: pfn: a8
kernel: memmap_init_zone: pfn: a9
kernel: memmap_init_zone: pfn: aa
kernel: memmap_init_zone: pfn: ab
kernel: memmap_init_zone: pfn: ac
kernel: memmap_init_zone: pfn: ad
kernel: memmap_init_zone: pfn: ae
kernel: memmap_init_zone: pfn: af
kernel: memmap_init_zone: pfn: b0
kernel: memmap_init_zone: pfn: b1
kernel: memmap_init_zone: pfn: b2

The printk from memmap_init_zone has already passed the checks about early_pfn_ etc.

So, reverting Pavel's fix would twist this, and we'd end up zeroing pages that are already set up in memmap_init_zone() 
(as we already had).

Michal Hocko July 16, 2018, 2:12 p.m. UTC | #8

On Mon 16-07-18 09:26:41, Pavel Tatashin wrote:
> > Maybe a stupid question, but I do not see it from the code (this init
> > code is just to complex to keep it cached in head so I always have to
> > study the code again and again, sigh). So what exactly prevents
> > memmap_init_zone to stumble over reserved regions? We do play some ugly
> > games to find a first !reserved pfn in the node but I do not really see
> > anything in the init path to properly skip over reserved holes inside
> > the node.
> 
> Hi Michal,
> 
> This is not a stupid question. I figured out how this whole thing
> became broken:  Revert "mm: page_alloc: skip over regions of invalid
> pfns where possible" caused that.
> 
> Because, before that was reverted, memmap_init_zone() would use
> memblock.memory to check that only pages that have physical backing
> are initialized. But, now after that was reverted zer_resv_unavail()
> scheme became totally broken.
>
> The concept is quite easy: zero all the allocated memmap memory that
> has not been initialized by memmap_init_zone(). So, I think I will
> modify memmap_init_zone() to zero the skipped pfns that have memmap
> backing. But, that requires more thinking.

I would just go with iterating over valid (unreserved) memory ranges in
memmap_init_zone.

Instability in current -git tree

Commit Message

Comments

Patch