Major KVM issues with kernel 4.5 on the host

Message ID	570EADD2.8030300@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> Subject: Re: Major KVM issues with kernel 4.5 on the host To: Marc Haber <mh+linux-kernel@zugschlus.de> References: <20160317165435.GB3022@torres.zugschlus.de> <20160317181128.GA30324@pd.tnic> <56EBD20A.1020608@redhat.com> <20160413183701.GC7600@torres.zugschlus.de> Cc: Borislav Petkov <bp@alien8.de>, linux-kernel@vger.kernel.org, kvm ML <kvm@vger.kernel.org> From: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <570EADD2.8030300@redhat.com> Date: Wed, 13 Apr 2016 22:36:34 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.7.1 MIME-Version: 1.0 In-Reply-To: <20160413183701.GC7600@torres.zugschlus.de> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: kvm-owner@vger.kernel.org Precedence: bulk

Paolo Bonzini April 13, 2016, 8:36 p.m. UTC

On 13/04/2016 20:37, Marc Haber wrote:
> On Fri, Mar 18, 2016 at 11:01:46AM +0100, Paolo Bonzini wrote:
>> On 17/03/2016 19:11, Borislav Petkov wrote:
>>> I'm going to try reproducing the issue on a less "important" machine
>>> so that bisecting is less painful, but maybe you guys have an idea
>>> what's going wrong here.
>>
>> No idea, sorry. :(  Bisecting would be great.
> 
> Working on that now.
> 
>>   I'll also try reproducing and bisecting next week, in the meanwhile
>>   just having the host dmesg would help a lot.
> 
> Attached. I hope the message will get through to the list.

Didn't help, but a fresh look at the list of 4.5 patches helped.
What the hell was I thinking, I missed write_rdtscp_aux who
obviously uses MSR_TSC_AUX.



Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marc Haber April 13, 2016, 8:52 p.m. UTC | #1

On Wed, Apr 13, 2016 at 10:36:34PM +0200, Paolo Bonzini wrote:
> Didn't help, but a fresh look at the list of 4.5 patches helped.
> What the hell was I thinking, I missed write_rdtscp_aux who
> obviously uses MSR_TSC_AUX.

So you want me to apply that to 4.5 od 4.5.1 and try that?

Greetings
Marc

Marc Haber April 13, 2016, 10:29 p.m. UTC | #2

On Wed, Apr 13, 2016 at 10:36:34PM +0200, Paolo Bonzini wrote:
> Didn't help, but a fresh look at the list of 4.5 patches helped.
> What the hell was I thinking, I missed write_rdtscp_aux who
> obviously uses MSR_TSC_AUX.

I applied this patch to 4.5, which didn't go cleanly, I had to do it
manually, and there is no change in behavior. Sometimes, the Vm just
crashes, but most times the filesystem is remounted ro.

[   84.658968] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #7669: comm aide: deleted inode referenced: 27903
[   84.664877] Aborting journal on device dm-0-8.
[   84.667992] EXT4-fs (dm-0): Remounting filesystem read-only
[   84.670972] EXT4-fs error (device dm-0): ext4_journal_check_start:56: Detected aborted journal
[   84.763331] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #7669: comm aide: deleted inode referenced: 27898
[   84.825412] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #7669: comm aide: deleted inode referenced: 27895
[   84.907959] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #7669: comm aide: deleted inode referenced: 27893
[   84.915187] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #7669: comm aide: deleted inode referenced: 27900
[   84.961062] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #7669: comm aide: deleted inode referenced: 27889
[   84.983700] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #7669: comm aide: deleted inode referenced: 27891
[   98.315538] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #23567: comm aide: deleted inode referenced: 27897
[   98.323606] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #23567: comm aide: deleted inode referenced: 27904
[   99.889927] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #4650: comm aide: deleted inode referenced: 27892
[   99.893823] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #4650: comm aide: deleted inode referenced: 27901
[   99.901140] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #4650: comm aide: deleted inode referenced: 27890
[   99.904898] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #4650: comm aide: deleted inode referenced: 27896
[   99.909758] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #4650: comm aide: deleted inode referenced: 27899
[   99.914394] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #4650: comm aide: deleted inode referenced: 27894
[  207.132045] serial8250: too much work for irq4
[  207.220043] serial8250: too much work for irq4
[  207.312028] serial8250: too much work for irq4


Greetings
Marc

Paolo Bonzini April 14, 2016, 1:16 a.m. UTC | #3

On 14/04/2016 00:29, Marc Haber wrote:
> On Wed, Apr 13, 2016 at 10:36:34PM +0200, Paolo Bonzini wrote:
>> Didn't help, but a fresh look at the list of 4.5 patches helped.
>> What the hell was I thinking, I missed write_rdtscp_aux who
>> obviously uses MSR_TSC_AUX.
> 
> I applied this patch to 4.5, which didn't go cleanly, I had to do it
> manually, and there is no change in behavior. Sometimes, the Vm just
> crashes, but most times the filesystem is remounted ro.

Ok, then I guess bisection is needed.  Please first try commit
45bdbcfdf241.  If it fails, then the bug come together with KVM's merge
window changes for 4.5-rc1.  Please apply the patch I sent here when
bisection is past 46896c73c1a4dde527c3a3cc43379deeb41985a1 (which means
that probably that should be the commit you try second; the bisection
then becomes much easier).

Thanks,

Paolo

> [   84.658968] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #7669: comm aide: deleted inode referenced: 27903
> [   84.664877] Aborting journal on device dm-0-8.
> [   84.667992] EXT4-fs (dm-0): Remounting filesystem read-only
> [   84.670972] EXT4-fs error (device dm-0): ext4_journal_check_start:56: Detected aborted journal
> [   84.763331] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #7669: comm aide: deleted inode referenced: 27898
> [   84.825412] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #7669: comm aide: deleted inode referenced: 27895
> [   84.907959] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #7669: comm aide: deleted inode referenced: 27893
> [   84.915187] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #7669: comm aide: deleted inode referenced: 27900
> [   84.961062] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #7669: comm aide: deleted inode referenced: 27889
> [   84.983700] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #7669: comm aide: deleted inode referenced: 27891
> [   98.315538] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #23567: comm aide: deleted inode referenced: 27897
> [   98.323606] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #23567: comm aide: deleted inode referenced: 27904
> [   99.889927] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #4650: comm aide: deleted inode referenced: 27892
> [   99.893823] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #4650: comm aide: deleted inode referenced: 27901
> [   99.901140] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #4650: comm aide: deleted inode referenced: 27890
> [   99.904898] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #4650: comm aide: deleted inode referenced: 27896
> [   99.909758] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #4650: comm aide: deleted inode referenced: 27899
> [   99.914394] EXT4-fs error (device dm-0): ext4_lookup:1602: inode #4650: comm aide: deleted inode referenced: 27894
> [  207.132045] serial8250: too much work for irq4
> [  207.220043] serial8250: too much work for irq4
> [  207.312028] serial8250: too much work for irq4
> 
> 
> Greetings
> Marc
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marc Haber April 14, 2016, 5:22 a.m. UTC | #4

On Thu, Apr 14, 2016 at 03:16:29AM +0200, Paolo Bonzini wrote:
> On 14/04/2016 00:29, Marc Haber wrote:
> > On Wed, Apr 13, 2016 at 10:36:34PM +0200, Paolo Bonzini wrote:
> >> Didn't help, but a fresh look at the list of 4.5 patches helped.
> >> What the hell was I thinking, I missed write_rdtscp_aux who
> >> obviously uses MSR_TSC_AUX.
> > 
> > I applied this patch to 4.5, which didn't go cleanly, I had to do it
> > manually, and there is no change in behavior. Sometimes, the Vm just
> > crashes, but most times the filesystem is remounted ro.
> 
> Ok, then I guess bisection is needed.  Please first try commit
> 45bdbcfdf241.  If it fails, then the bug come together with KVM's merge
> window changes for 4.5-rc1.  Please apply the patch I sent here when
> bisection is past 46896c73c1a4dde527c3a3cc43379deeb41985a1 (which means
> that probably that should be the commit you try second; the bisection
> then becomes much easier).

I have never bisected this deeply. Can you please give more advice,
with which two commits to start? And how do I find out whether I am
"past" a commit? I am als not a git expert, a few command lines would
be appreciated.

Things have not become any easier this night; 4.5-rc7 ran for more
than three hours before it failed :-(

Greetings
Marc

Marc Haber April 14, 2016, 6:07 a.m. UTC | #5

On Thu, Apr 14, 2016 at 03:16:29AM +0200, Paolo Bonzini wrote:
> Ok, then I guess bisection is needed.  Please first try commit
> 45bdbcfdf241.

That kernel labels itself as "4.4.0-rc5+", is that correct?

Greetings
Marc

Marc Haber April 14, 2016, 4:47 p.m. UTC | #6

On Thu, Apr 14, 2016 at 03:16:29AM +0200, Paolo Bonzini wrote:
> Ok, then I guess bisection is needed.  Please first try commit
> 45bdbcfdf241.

I did git checkout 45bdbcfdf241 and built the resulting kernel
4.4.0-rc5. This one has now been running for ten hours, which is
threefold the longest time that a faulty kernel has held before a VM
experienced corruption. So I guess, that one is fine.

Since 4.5.0-rc1 is bad, I guess I do:

git checkout 45bdbcfdf241
git bisect start
git bisect good
git bisect bad v4.5.0-rc1

right?

Greetings
Marc

Paolo Bonzini April 14, 2016, 5:30 p.m. UTC | #7

On 14/04/2016 18:47, Marc Haber wrote:
>> > Ok, then I guess bisection is needed.  Please first try commit
>> > 45bdbcfdf241.
> I did git checkout 45bdbcfdf241 and built the resulting kernel
> 4.4.0-rc5. This one has now been running for ten hours, which is
> threefold the longest time that a faulty kernel has held before a VM
> experienced corruption. So I guess, that one is fine.

Interesting, this means it's not a KVM bug.  You can ignore my patch
from yesterday (though we'll get it in anyway).

> Since 4.5.0-rc1 is bad, I guess I do:
> 
> git checkout 45bdbcfdf241
> git bisect start
> git bisect good
> git bisect bad v4.5.0-rc1

This is correct but you also want to do

git bisect good 4.4.0
git bisect good 4.4.0-rc5

so that bisection basically works through the commits in the merge window.

Thanks,

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Marc Haber April 14, 2016, 5:47 p.m. UTC | #8

On Thu, Apr 14, 2016 at 07:30:43PM +0200, Paolo Bonzini wrote:
> On 14/04/2016 18:47, Marc Haber wrote:
> >> > Ok, then I guess bisection is needed.  Please first try commit
> >> > 45bdbcfdf241.
> > I did git checkout 45bdbcfdf241 and built the resulting kernel
> > 4.4.0-rc5. This one has now been running for ten hours, which is
> > threefold the longest time that a faulty kernel has held before a VM
> > experienced corruption. So I guess, that one is fine.
> 
> Interesting, this means it's not a KVM bug.  You can ignore my patch
> from yesterday (though we'll get it in anyway).
> 
> > Since 4.5.0-rc1 is bad, I guess I do:
> > 
> > git checkout 45bdbcfdf241
> > git bisect start
> > git bisect good
> > git bisect bad v4.5.0-rc1
> 
> This is correct but you also want to do
> 
> git bisect good 4.4.0
> git bisect good 4.4.0-rc5
> 
> so that bisection basically works through the commits in the merge window.

So I start over from this:

[47/544]mh@fan:~/linux/debug/linux$ git checkout 45bdbcfdf241
HEAD is now at 45bdbcf... kvm: x86: Fix vmwrite to SECONDARY_VM_EXEC_CONTROL
[48/545]mh@fan:~/linux/debug/linux$ git bisect start
[49/546]mh@fan:~/linux/debug/linux$ git bisect good
[50/547]mh@fan:~/linux/debug/linux$ git bisect bad v4.5-rc1
Bisecting: 5761 revisions left to test after this (roughly 13 steps)
[cbd88cd4c07f9361914ab7fd7e21c9227986fe68] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
[51/548]mh@fan:~/linux/debug/linux$ git bisect good v4.4
Bisecting: 5468 revisions left to test after this (roughly 12 steps)
[f9a03ae123c92c1f45cd2ca88d0f6edd787be78c] Merge tag 'for-f2fs-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
[52/549]mh@fan:~/linux/debug/linux$ git bisect good v4.4-rc5
Bisecting: 5468 revisions left to test after this (roughly 12 steps)
[f9a03ae123c92c1f45cd2ca88d0f6edd787be78c] Merge tag 'for-f2fs-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
[53/550]mh@fan:~/linux/debug/linux$

This is going to take a few days as detecting a "bad" version may take
a few hours.

Greetings
Marc

Marc Haber April 21, 2016, 8:39 a.m. UTC | #9

On Thu, Apr 14, 2016 at 07:22:20AM +0200, Marc Haber wrote:
> On Thu, Apr 14, 2016 at 03:16:29AM +0200, Paolo Bonzini wrote:
> > On 14/04/2016 00:29, Marc Haber wrote:
> > > On Wed, Apr 13, 2016 at 10:36:34PM +0200, Paolo Bonzini wrote:
> > >> Didn't help, but a fresh look at the list of 4.5 patches helped.
> > >> What the hell was I thinking, I missed write_rdtscp_aux who
> > >> obviously uses MSR_TSC_AUX.
> > > 
> > > I applied this patch to 4.5, which didn't go cleanly, I had to do it
> > > manually, and there is no change in behavior. Sometimes, the Vm just
> > > crashes, but most times the filesystem is remounted ro.
> > 
> > Ok, then I guess bisection is needed.  Please first try commit
> > 45bdbcfdf241.  If it fails, then the bug come together with KVM's merge
> > window changes for 4.5-rc1.  Please apply the patch I sent here when
> > bisection is past 46896c73c1a4dde527c3a3cc43379deeb41985a1 (which means
> > that probably that should be the commit you try second; the bisection
> > then becomes much easier).
> 
> I have never bisected this deeply. Can you please give more advice,
> with which two commits to start? And how do I find out whether I am
> "past" a commit? I am als not a git expert, a few command lines would
> be appreciated.

I have tried bisecting, and finally bisect says that the bad commit is
0e749e54244eec87b2a3cd0a4314e60bc6781115 dax: increase granularity of dax_clear_blocks() operations

However, a kernel built after
$ git checkout 0e749e54244eec87b2a3cd0a4314e60bc6781115
seems to be fine, at least my VM is running for 15 hours now.

I guess I need to start over again with git bisect good
0e749e54244eec87b2a3cd0a4314e60bc6781115 and git bisect bad v4.5.

Currently, I cannot explain how this has happened, I must have flagged
an actually good kernel as bad from my understanding of git bisect.

Can you give advice how to continue here?

Greetings
Marc

Borislav Petkov April 21, 2016, 12:37 p.m. UTC | #10

On Thu, Apr 21, 2016 at 10:39:48AM +0200, Marc Haber wrote:
> Currently, I cannot explain how this has happened, I must have flagged
> an actually good kernel as bad from my understanding of git bisect.
> 
> Can you give advice how to continue here?

Yap, sounds like you marked a bisection step incorrectly, which lead
into the wrong direction. How reliable is your reproducer?

Also, do the bisection as Paolo suggested:

* try 45bdbcfdf241.

* then do

$ git bisect start v4.5-rc1 v4.4

which marks -rc1 as bad and 4.4 as good.

While you're doing that bisect, do what Paolo said by applying the diff
here

  https://lkml.kernel.org/r/570EADD2.8030300@redhat.com

when the bisection point you're at at each step contains

  46896c73c1a4 ("KVM: svm: add support for RDTSCP")

You should apply the above hunk by doing

$ patch -p1 --dry-run -i /tmp/hunk

If it applies fine, you then apply it

$ patch -p1 -i /tmp/hunk

All clear?

If not, do not hesitate to ask.

Thanks

Marc Haber April 21, 2016, 2:50 p.m. UTC | #11

On Thu, Apr 21, 2016 at 02:37:11PM +0200, Borislav Petkov wrote:
> On Thu, Apr 21, 2016 at 10:39:48AM +0200, Marc Haber wrote:
> > Currently, I cannot explain how this has happened, I must have flagged
> > an actually good kernel as bad from my understanding of git bisect.
> > 
> > Can you give advice how to continue here?
> 
> Yap, sounds like you marked a bisection step incorrectly, which lead
> into the wrong direction. How reliable is your reproducer?

Usually, the crash or filesystem corruption happens in the first 15 to
30 minutes. I have had one instance running three hours before
corrupting, I have therefore upped the run time to nine hours before
saying "this kernel is good".

What bothers me is that since I ended up with a "suspect" commit that
actually results in a "good" kernel (running for 22 hours now), I must
have said "bad" to an actually "good" kernel, which means that I had
an unrelated crash or corruption. Is that reasoning correct?

> Also, do the bisection as Paolo suggested:
> 
> * try 45bdbcfdf241.

That one qualified as "good" six days ago. I'll retry, maybe I just
didn't wait long enough.

"Trying" means make oldconfig, make deb-pkg in my case right? Does it
matter what I answer to the numerous config questions that keep coming
up during the oldconfig step?

> * then do
> 
> $ git bisect start v4.5-rc1 v4.4
> 
> which marks -rc1 as bad and 4.4 as good.

Would it help to explicitly mark
0e749e54244eec87b2a3cd0a4314e60bc6781115 as good so that the knowledge
gained during the last week is not completely lost?

> While you're doing that bisect, do what Paolo said by applying the diff
> here
> 
>   https://lkml.kernel.org/r/570EADD2.8030300@redhat.com
> 
> when the bisection point you're at at each step contains
> 
>   46896c73c1a4 ("KVM: svm: add support for RDTSCP")
> 
> You should apply the above hunk by doing
> 
> $ patch -p1 --dry-run -i /tmp/hunk
> 
> If it applies fine, you then apply it
> 
> $ patch -p1 -i /tmp/hunk
> 
> All clear?

So I need to git log | grep 46896c73c1a4 and apply the patch again
each time the commit is found?

Greetings
Marc

Borislav Petkov April 21, 2016, 4:51 p.m. UTC | #12

On Thu, Apr 21, 2016 at 04:50:05PM +0200, Marc Haber wrote:
> What bothers me is that since I ended up with a "suspect" commit that
> actually results in a "good" kernel (running for 22 hours now), I must
> have said "bad" to an actually "good" kernel, which means that I had
> an unrelated crash or corruption. Is that reasoning correct?

Hmm, did that "unrelated crash or corruption" have the same symptoms as
the original one?

> That one qualified as "good" six days ago. I'll retry, maybe I just
> didn't wait long enough.

So if the trigger time is varying so much, I'd try to double that to
make sure I'm fairly certain about each commit I'm testing.

Also, this is a single box we're talking about, right? And you're sure
it hasn't had any corruption issues so far?

I see you have amd64_edac loading, so it must have ECC DIMMs. Have you
had any reports in the past of ECC errors in dmesg? Or other MCEs,
lockups, etc? Can you grep your logs for stuff like "hardware error",
"mce", "edac" etc? Do a case-insensitive search.

> "Trying" means make oldconfig, make deb-pkg in my case right? Does it
> matter what I answer to the numerous config questions that keep coming
> up during the oldconfig step?

What I do is:

$ git bisect <good|bad>

to mark the current commit after having tested it. Then I do

$ yes "" | make oldconfig

to set the new config options. Then

$ make -j7
$ make modules_install install

and reboot into the new kernel. Kernel name will possibly change each
time so I write down on paper which kernel I'm testing. You can verify
when booting it by doing:

$ dmesg | head
[    0.000000] Linux version 4.6.0-rc2+ (boris@pd) (gcc version 5.3.1 20160101 (Debian 5.3.1-5) ) #1 SMP PREEMPT Wed Apr 6 20:22:51 CEST 2016
...

that date at the end of the line and number "#1" should be current.
Number is also in .version and gets issued when you finish building:

Kernel: arch/x86/boot/bzImage is ready  (#1)

> Would it help to explicitly mark
> 0e749e54244eec87b2a3cd0a4314e60bc6781115 as good so that the knowledge
> gained during the last week is not completely lost?

I'd do the whole thing again, just to be sure.

I know, bisection is very time-consuming :-\ And it is particularly
annoying if it is done on the box I'm normally using daily.

> So I need to git log | grep 46896c73c1a4 and apply the patch again
> each time the commit is found?

I think you can let git do that for ya:

$ git branch --contains 46896c73c1a4
* (HEAD detached at 46896c73c1a4)

that lists that the current checked out HEAD contains that commit. If you do

$ git checkout 46896c73c1a4~1

then that "(HEAD detached..." line is not in the list of branches
containing it.

HTH.

Marc Haber April 21, 2016, 8:04 p.m. UTC | #13

On Thu, Apr 21, 2016 at 06:51:06PM +0200, Borislav Petkov wrote:
> On Thu, Apr 21, 2016 at 04:50:05PM +0200, Marc Haber wrote:
> > What bothers me is that since I ended up with a "suspect" commit that
> > actually results in a "good" kernel (running for 22 hours now), I must
> > have said "bad" to an actually "good" kernel, which means that I had
> > an unrelated crash or corruption. Is that reasoning correct?
> 
> Hmm, did that "unrelated crash or corruption" have the same symptoms as
> the original one?

Yes, but there are two symptoms. The VM either suffers file system
issues (garbage read from files, or an aborted ext4 journal and
following ro remount) or it stops dead in its tracks.

> > That one qualified as "good" six days ago. I'll retry, maybe I just
> > didn't wait long enough.
> 
> So if the trigger time is varying so much, I'd try to double that to
> make sure I'm fairly certain about each commit I'm testing.

The longest trigger time I have seen was three hours, I tripled that
to nine hours, that probably was not enough.

> Also, this is a single box we're talking about, right? And you're sure
> it hasn't had any corruption issues so far?

It is a single box, and it runs perfectly with kernel 4.4.

> I see you have amd64_edac loading, so it must have ECC DIMMs. Have you
> had any reports in the past of ECC errors in dmesg? Or other MCEs,
> lockups, etc? Can you grep your logs for stuff like "hardware error",
> "mce", "edac" etc? Do a case-insensitive search.

The box reports about one correctable error per week, so I probably
have a faulty DIMM, but since the issue only surfaces in VMs while the
host system is in perfect working order...

And yes, I am pondering to simply replace the box with an Intel CPU.

I see "mce: CPU supports 6 MCE banks" once for each reboot, and about
30 "Machine check events logged" since January. How do I see which
events were logged?

> > "Trying" means make oldconfig, make deb-pkg in my case right? Does it
> > matter what I answer to the numerous config questions that keep coming
> > up during the oldconfig step?
> 
> What I do is:
> 
> $ git bisect <good|bad>
> 
> to mark the current commit after having tested it. Then I do
> 
> $ yes "" | make oldconfig
> 
> to set the new config options.

So you basically select the default for new options.

>  Then
> 
> $ make -j7
> $ make modules_install install
> 
> and reboot into the new kernel. Kernel name will possibly change each
> time so I write down on paper which kernel I'm testing.

I go the way of Debian packages since it is easier to handle the
crypto file systems when the machine is booting up.

And yes, I think about doing a test reinstall on unencrypted disk to
find out whether encryption plays a role, but I currently need the
machine to urgently to take it out of serice for half a month, and,
again, the host system is in perfect working order, it is just VMs
that barf.

>  You can verify when booting it by doing:
> 
> $ dmesg | head
> [    0.000000] Linux version 4.6.0-rc2+ (boris@pd) (gcc version 5.3.1 20160101 (Debian 5.3.1-5) ) #1 SMP PREEMPT Wed Apr 6 20:22:51 CEST 2016
> ...
> 
> that date at the end of the line and number "#1" should be current.

I check the date of the package I am installing and the date stamp of
the kernels being installed to /boot. I'm reasonably sure I have that
under control.

> > Would it help to explicitly mark
> > 0e749e54244eec87b2a3cd0a4314e60bc6781115 as good so that the knowledge
> > gained during the last week is not completely lost?
> 
> I'd do the whole thing again, just to be sure.
> 
> I know, bisection is very time-consuming :-\ And it is particularly
> annoying if it is done on the box I'm normally using daily.

... and if testing a "good" kernel means a day.

> > So I need to git log | grep 46896c73c1a4 and apply the patch again
> > each time the commit is found?
> 
> I think you can let git do that for ya:
> 
> $ git branch --contains 46896c73c1a4
> * (HEAD detached at 46896c73c1a4)
> 
> that lists that the current checked out HEAD contains that commit. If you do
> 
> $ git checkout 46896c73c1a4~1
> 
> then that "(HEAD detached..." line is not in the list of branches
> containing it.

And whenever 46896c73c1a4 is present, I need to apply Paolo's patch,
right?

Greetings
Marc

Borislav Petkov April 23, 2016, 4:04 p.m. UTC | #14

On Thu, Apr 21, 2016 at 10:04:33PM +0200, Marc Haber wrote:
> Yes, but there are two symptoms. The VM either suffers file system
> issues (garbage read from files, or an aborted ext4 journal and
> following ro remount) or it stops dead in its tracks.

Stops dead? What does that mean exactly? Box is wedged solid and it
doesn't react to any key presses?

Because if so, this could really be a DRAM going bad and a correctable
error turning into an uncorrectable. How old is the DRAM in that box?
Judging by your CPU, it should be a couple of years...

> The longest trigger time I have seen was three hours, I tripled that
> to nine hours, that probably was not enough.

So enlarge even more I guess.

> The box reports about one correctable error per week, so I probably
> have a faulty DIMM, but since the issue only surfaces in VMs while the
> host system is in perfect working order...

So it could be that correctable error turns into an uncorrectable one at
some point. But then you should be getting an exception...

> And yes, I am pondering to simply replace the box with an Intel CPU.

Your CPU is fine, from what I've seen so far.

> I see "mce: CPU supports 6 MCE banks" once for each reboot, and about
> 30 "Machine check events logged" since January. How do I see which
> events were logged?

Hmm, you have

[   18.149300] MCE: In-kernel MCE decoding enabled.

that's CONFIG_EDAC_DECODE_MCE, so you should have some "Hardware Error"
lines in dmesg, I'd guess, decoding the errors.

> So you basically select the default for new options.

Yap.

> I go the way of Debian packages since it is easier to handle the
> crypto file systems when the machine is booting up.

As long as you're testing the correct bisection kernels...

> And yes, I think about doing a test reinstall on unencrypted disk to
> find out whether encryption plays a role, but I currently need the
> machine to urgently to take it out of serice for half a month, and,
> again, the host system is in perfect working order, it is just VMs
> that barf.

Yeah, I can't reproduce it here and I have a very similar box to yours
which is otherwise idle, more or less.

Another fact which points to potentially DIMM going bad...

> I check the date of the package I am installing and the date stamp of
> the kernels being installed to /boot. I'm reasonably sure I have that
> under control.

Good.

> ... and if testing a "good" kernel means a day.

Yeah, it is annoying. In a perfect world, we all should have two
identical boxes so that we use one as a workstation and the second for
testing when the first one, the workstation barfs. I should bring that
up with my manager next time... :-)

> And whenever 46896c73c1a4 is present, I need to apply Paolo's patch,
> right?

Yap.

Thanks.

Marc Haber April 23, 2016, 6:43 p.m. UTC | #15

On Sat, Apr 23, 2016 at 06:04:29PM +0200, Borislav Petkov wrote:
> On Thu, Apr 21, 2016 at 10:04:33PM +0200, Marc Haber wrote:
> > Yes, but there are two symptoms. The VM either suffers file system
> > issues (garbage read from files, or an aborted ext4 journal and
> > following ro remount) or it stops dead in its tracks.
> 
> Stops dead? What does that mean exactly? Box is wedged solid and it
> doesn't react to any key presses?

No ping, no reaction on serial console, no reaction on virtual
console, no syslog entries.

> Because if so, this could really be a DRAM going bad and a correctable
> error turning into an uncorrectable. How old is the DRAM in that box?
> Judging by your CPU, it should be a couple of years...

Uncorrectable errors would still be identified by the ECC hardware,
and the box wouldn't be perfectly fine with an "old" kernel.

> > The box reports about one correctable error per week, so I probably
> > have a faulty DIMM, but since the issue only surfaces in VMs while the
> > host system is in perfect working order...
> 
> So it could be that correctable error turns into an uncorrectable one at
> some point. But then you should be getting an exception...

Yes, that would be in the logs.

> > And yes, I am pondering to simply replace the box with an Intel CPU.
> 
> Your CPU is fine, from what I've seen so far.

But we still postulate that the issue does only show on older AMD
CPUs. Otherwise, I wouldn't be the only one making this experience.

> > I go the way of Debian packages since it is easier to handle the
> > crypto file systems when the machine is booting up.
> 
> As long as you're testing the correct bisection kernels...

I am reasonably sure about that, yes.

> > And yes, I think about doing a test reinstall on unencrypted disk to
> > find out whether encryption plays a role, but I currently need the
> > machine to urgently to take it out of serice for half a month, and,
> > again, the host system is in perfect working order, it is just VMs
> > that barf.
> 
> Yeah, I can't reproduce it here and I have a very similar box to yours
> which is otherwise idle, more or less.
> 
> Another fact which points to potentially DIMM going bad...

Do you want me to memtest for 24 hours?

Greetings
Marc

Dr. David Alan Gilbert April 23, 2016, 6:52 p.m. UTC | #16

* Marc Haber (mh+linux-kernel@zugschlus.de) wrote:
> On Sat, Apr 23, 2016 at 06:04:29PM +0200, Borislav Petkov wrote:
> > On Thu, Apr 21, 2016 at 10:04:33PM +0200, Marc Haber wrote:
> > > Yes, but there are two symptoms. The VM either suffers file system
> > > issues (garbage read from files, or an aborted ext4 journal and
> > > following ro remount) or it stops dead in its tracks.
> > 
> > Stops dead? What does that mean exactly? Box is wedged solid and it
> > doesn't react to any key presses?
> 
> No ping, no reaction on serial console, no reaction on virtual
> console, no syslog entries.
> 
> > Because if so, this could really be a DRAM going bad and a correctable
> > error turning into an uncorrectable. How old is the DRAM in that box?
> > Judging by your CPU, it should be a couple of years...
> 
> Uncorrectable errors would still be identified by the ECC hardware,
> and the box wouldn't be perfectly fine with an "old" kernel.

Hmm, your problem does sound like bad hardware, but....
If you've got a nice reliable crash, can you try turning transparent huge pages
off on the host;
   echo never > /sys/kernel/mm/transparent_hugepage/enabled

Dave

> > > The box reports about one correctable error per week, so I probably
> > > have a faulty DIMM, but since the issue only surfaces in VMs while the
> > > host system is in perfect working order...
> > 
> > So it could be that correctable error turns into an uncorrectable one at
> > some point. But then you should be getting an exception...
> 
> Yes, that would be in the logs.
> 
> > > And yes, I am pondering to simply replace the box with an Intel CPU.
> > 
> > Your CPU is fine, from what I've seen so far.
> 
> But we still postulate that the issue does only show on older AMD
> CPUs. Otherwise, I wouldn't be the only one making this experience.
> 
> > > I go the way of Debian packages since it is easier to handle the
> > > crypto file systems when the machine is booting up.
> > 
> > As long as you're testing the correct bisection kernels...
> 
> I am reasonably sure about that, yes.
> 
> > > And yes, I think about doing a test reinstall on unencrypted disk to
> > > find out whether encryption plays a role, but I currently need the
> > > machine to urgently to take it out of serice for half a month, and,
> > > again, the host system is in perfect working order, it is just VMs
> > > that barf.
> > 
> > Yeah, I can't reproduce it here and I have a very similar box to yours
> > which is otherwise idle, more or less.
> > 
> > Another fact which points to potentially DIMM going bad...
> 
> Do you want me to memtest for 24 hours?
> 
> Greetings
> Marc
> 
> -- 
> -----------------------------------------------------------------------------
> Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
> Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
> Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421

Borislav Petkov April 23, 2016, 11:57 p.m. UTC | #17

On Sat, Apr 23, 2016 at 08:43:41PM +0200, Marc Haber wrote:
> Uncorrectable errors would still be identified by the ECC hardware,

Not if the hardware decides to syncflood so that we don't even get to
run the #MC handler...

> and the box wouldn't be perfectly fine with an "old" kernel.

Maybe the "old" kernel is not causing all the required ingredients to
come together for the uncorrectable error to happen. But yeah, I agree,
the fact that 4.4 is fine kinda doesn't fit with the uncorrectable error
theory.

> Yes, that would be in the logs.

Presumably. And see above.

> But we still postulate that the issue does only show on older AMD
> CPUs. Otherwise, I wouldn't be the only one making this experience.

It actually shows only on this one system. At least I'm not aware of any
other report of the same issue. My system with a F10h, rev E is just
fine.

> Do you want me to memtest for 24 hours?

Yeah, that memtest crap never triggers any ECCs. But if you're bored,
why not...

Marc Haber May 12, 2016, 8:20 p.m. UTC | #18

Hi David,

On Sat, Apr 23, 2016 at 07:52:46PM +0100, Dr. David Alan Gilbert wrote:
> Hmm, your problem does sound like bad hardware, but....
> If you've got a nice reliable crash, can you try turning transparent huge pages
> off on the host;
>    echo never > /sys/kernel/mm/transparent_hugepage/enabled

I must have missed this hint in the middle of the "your hardware is
bad" avalance that came over me.

I spent two weeks bisecting "good" kernels since during the repeated
reconfigurations, transparent huge pages got turned off in kernel
configuration. After running each kernel for 24 hours, I eventually
ended up with a working 4.5 kernel. The configuration diff was short,
showing transparent huge pages, and - finally - upon re-reading the
thread I found your hint.

I have now the result that 4.5, 4.5.1 and 4.5.4 corrupt KVM guest
memory reliably in the first hour of running under disk load, causing
the VM to either drop dead in the water, or to read randomness from
disk. Rebooting fixes the VM. This happens as soon as transparent huge
pages are turned on in the host.

Turning off transparent huge pages by echo never >
/sys/kernel/mm/transparent_hugepage/enabled fixes the issue even
without rebooting the host. Start up the VM again and it works just
fine.

Is this an issue in (a) transparent huge pages, (b) KVM or (c) qemu?
Where should this issue be forwarded? Or do we just accept it and turn
transparent huge pages off?

Greetings
Marc

Kirill A . Shutemov May 12, 2016, 8:24 p.m. UTC | #19

On Thu, May 12, 2016 at 10:20:09PM +0200, Marc Haber wrote:
> Hi David,
> 
> On Sat, Apr 23, 2016 at 07:52:46PM +0100, Dr. David Alan Gilbert wrote:
> > Hmm, your problem does sound like bad hardware, but....
> > If you've got a nice reliable crash, can you try turning transparent huge pages
> > off on the host;
> >    echo never > /sys/kernel/mm/transparent_hugepage/enabled
> 
> I must have missed this hint in the middle of the "your hardware is
> bad" avalance that came over me.
> 
> I spent two weeks bisecting "good" kernels since during the repeated
> reconfigurations, transparent huge pages got turned off in kernel
> configuration. After running each kernel for 24 hours, I eventually
> ended up with a working 4.5 kernel. The configuration diff was short,
> showing transparent huge pages, and - finally - upon re-reading the
> thread I found your hint.
> 
> I have now the result that 4.5, 4.5.1 and 4.5.4 corrupt KVM guest
> memory reliably in the first hour of running under disk load, causing
> the VM to either drop dead in the water, or to read randomness from
> disk. Rebooting fixes the VM. This happens as soon as transparent huge
> pages are turned on in the host.
> 
> Turning off transparent huge pages by echo never >
> /sys/kernel/mm/transparent_hugepage/enabled fixes the issue even
> without rebooting the host. Start up the VM again and it works just
> fine.
> 
> Is this an issue in (a) transparent huge pages, (b) KVM or (c) qemu?
> Where should this issue be forwarded? Or do we just accept it and turn
> transparent huge pages off?

Could you test this:

http://lkml.kernel.org/r/1463070742-18401-1-git-send-email-aarcange@redhat.com

?

Marc Haber May 12, 2016, 8:34 p.m. UTC | #20

On Thu, May 12, 2016 at 11:24:02PM +0300, Kirill A. Shutemov wrote:
> http://lkml.kernel.org/r/1463070742-18401-1-git-send-email-aarcange@redhat.com

Is this in v4.6-rc7?

If so, can I just test v4.6-rc7?

If not so, would it be a valid approach to first check plain v4.6-rc7
and then patched v4.6-rc7?

Greetings
Marc

Kirill A . Shutemov May 12, 2016, 8:42 p.m. UTC | #21

On Thu, May 12, 2016 at 10:34:57PM +0200, Marc Haber wrote:
> On Thu, May 12, 2016 at 11:24:02PM +0300, Kirill A. Shutemov wrote:
> > http://lkml.kernel.org/r/1463070742-18401-1-git-send-email-aarcange@redhat.com
> 
> Is this in v4.6-rc7?

No.

> If so, can I just test v4.6-rc7?
> 
> If not so, would it be a valid approach to first check plain v4.6-rc7
> and then patched v4.6-rc7?

It haven't hit -mm tree yet.

But I guess it should apply cleanly to v4.5. Or at least without major
conflicts.

Andrew, it looks like third bug report for the bug. It would be nice to
propogate it quicker, if possible.

Marc Haber May 13, 2016, 5:23 a.m. UTC | #22

On Thu, May 12, 2016 at 11:42:16PM +0300, Kirill A. Shutemov wrote:
> But I guess it should apply cleanly to v4.5. Or at least without major
> conflicts.

[11/511]mh@fan:~/linux/debug/linux$ curl 'http://marc.info/?l=linux-rdma&m=146307074800836&w=2' | patch -p1
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12529    0 12529    0     0   9844      0 --:--:--  0:00:01 --:--:--  9849
patching file include/linux/mm.h
Hunk #1 succeeded at 456 with fuzz 1 (offset -44 lines).
patching file include/linux/swap.h
Hunk #2 FAILED at 513.
1 out of 2 hunks FAILED -- saving rejects to file include/linux/swap.h.rej
patching file mm/huge_memory.c
Hunk #1 FAILED at 1298.
Hunk #2 FAILED at 2079.
Hunk #3 succeeded at 3340 (offset 117 lines).
2 out of 3 hunks FAILED -- saving rejects to file mm/huge_memory.c.rej
patching file mm/memory.c
Hunk #1 FAILED at 2373.
Hunk #2 succeeded at 2331 with fuzz 2 (offset -56 lines).
Hunk #3 FAILED at 2622.
2 out of 3 hunks FAILED -- saving rejects to file mm/memory.c.rej
patching file mm/swapfile.c
Hunk #1 FAILED at 922.
1 out of 1 hunk FAILED -- saving rejects to file mm/swapfile.c.rej
[12/512]mh@fan:~/linux/debug/linux$

It doesn't, and it doesn't apply to 4.6-rc3 as well:

[17/517]mh@fan:~/linux/debug/linux$ git checkout v4.6-rc3
Checking out files: 100% (9945/9945), done.
Previous HEAD position was b562e44... Linux 4.5
HEAD is now at bf16200... Linux 4.6-rc3
[18/518]mh@fan:~/linux/debug/linux$ curl 'http://marc.info/?l=linux-rdma&m=146307074800836&w=2' | patch -p1
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12529    0 12529    0     0   9692      0 --:--:--  0:00:01 --:--:--  9697
patching file include/linux/mm.h
patching file include/linux/swap.h
Hunk #2 FAILED at 513.
1 out of 2 hunks FAILED -- saving rejects to file include/linux/swap.h.rej
patching file mm/huge_memory.c
Hunk #1 FAILED at 1298.
Hunk #2 FAILED at 2079.
Hunk #3 succeeded at 3225 (offset 2 lines).
2 out of 3 hunks FAILED -- saving rejects to file mm/huge_memory.c.rej
patching file mm/memory.c
Hunk #1 FAILED at 2373.
Hunk #2 succeeded at 2354 (offset -33 lines).
Hunk #3 FAILED at 2622.
2 out of 3 hunks FAILED -- saving rejects to file mm/memory.c.rej
patching file mm/swapfile.c
Hunk #1 FAILED at 922.
1 out of 1 hunk FAILED -- saving rejects to file mm/swapfile.c.rej
[19/519]mh@fan:~/linux/debug/linux$

How do I apply this?

Greetings
Marc

Dr. David Alan Gilbert May 13, 2016, 8:35 a.m. UTC | #23

* Marc Haber (mh+linux-kernel@zugschlus.de) wrote:
> Hi David,
> 
> On Sat, Apr 23, 2016 at 07:52:46PM +0100, Dr. David Alan Gilbert wrote:
> > Hmm, your problem does sound like bad hardware, but....
> > If you've got a nice reliable crash, can you try turning transparent huge pages
> > off on the host;
> >    echo never > /sys/kernel/mm/transparent_hugepage/enabled
> 
> I must have missed this hint in the middle of the "your hardware is
> bad" avalance that came over me.
> 
> I spent two weeks bisecting "good" kernels since during the repeated
> reconfigurations, transparent huge pages got turned off in kernel
> configuration. After running each kernel for 24 hours, I eventually
> ended up with a working 4.5 kernel. The configuration diff was short,
> showing transparent huge pages, and - finally - upon re-reading the
> thread I found your hint.

OK, good.  When I sent that mail I'd hit a THP bug but in a
corner of migration and at the time we didn't know why and there was
no reason to think it would cause any other symptoms, but since it was
also between 4.4 and 4.5 it did seem worth mentioning as a long shot,
but it was no more than a long shot.

> I have now the result that 4.5, 4.5.1 and 4.5.4 corrupt KVM guest
> memory reliably in the first hour of running under disk load, causing
> the VM to either drop dead in the water, or to read randomness from
> disk. Rebooting fixes the VM. This happens as soon as transparent huge
> pages are turned on in the host.
> 
> Turning off transparent huge pages by echo never >
> /sys/kernel/mm/transparent_hugepage/enabled fixes the issue even
> without rebooting the host. Start up the VM again and it works just
> fine.
> 
> Is this an issue in (a) transparent huge pages, (b) KVM or (c) qemu?
> Where should this issue be forwarded? Or do we just accept it and turn
> transparent huge pages off?

Try Andrea's fix for (a).

Dave

> 
> Greetings
> Marc
> 
> -- 
> -----------------------------------------------------------------------------
> Marc Haber         | "I don't trust Computers. They | Mailadresse im Header
> Leimen, Germany    |  lose things."    Winona Ryder | Fon: *49 6224 1600402
> Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421

Marc Haber May 13, 2016, 2:03 p.m. UTC | #24

On Fri, May 13, 2016 at 09:35:45AM +0100, Dr. David Alan Gilbert wrote:
> also between 4.4 and 4.5 it did seem worth mentioning as a long shot,
> but it was no more than a long shot.

It was however helpful. I'd have bisected kernel configuration instead
of using the runtime control first, and seeing your long shot two
weeks earlier, it'd have saved myself those two weeks of tedious
bisecting.

> Try Andrea's fix for (a).

In the works.

Greetings
Marc

Major KVM issues with kernel 4.5 on the host

Commit Message

Comments

Patch