KVM: X86: Fix scan ioapic use-before-initialization

Message ID	1542702858-4318-1-git-send-email-wanpengli@tencent.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Wanpeng Li <kernellwp@gmail.com> To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: Paolo Bonzini <pbonzini@redhat.com>, =?utf-8?b?UmFkaW0gS3LEjW3DocWZ?= <rkrcmar@redhat.com>, Wei Wu <ww9210@gmail.com> Subject: [PATCH] KVM: X86: Fix scan ioapic use-before-initialization Date: Tue, 20 Nov 2018 16:34:18 +0800 Message-Id: <1542702858-4318-1-git-send-email-wanpengli@tencent.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: kvm-owner@vger.kernel.org Precedence: bulk
Series	KVM: X86: Fix scan ioapic use-before-initialization \| expand KVM: X86: Fix scan ioapic use-before-initialization

Wanpeng Li Nov. 20, 2018, 8:34 a.m. UTC

From: Wanpeng Li <wanpengli@tencent.com>

Reported by syzkaller:

 BUG: unable to handle kernel NULL pointer dereference at 00000000000001c8
 PGD 80000003ec4da067 P4D 80000003ec4da067 PUD 3f7bfa067 PMD 0 
 Oops: 0000 [#1] PREEMPT SMP PTI
 CPU: 7 PID: 5059 Comm: debug Tainted: G           OE     4.19.0-rc5 #16
 RIP: 0010:__lock_acquire+0x1a6/0x1990
 Call Trace:
  lock_acquire+0xdb/0x210
  _raw_spin_lock+0x38/0x70
  kvm_ioapic_scan_entry+0x3e/0x110 [kvm]
  vcpu_enter_guest+0x167e/0x1910 [kvm]
  kvm_arch_vcpu_ioctl_run+0x35c/0x610 [kvm]
  kvm_vcpu_ioctl+0x3e9/0x6d0 [kvm]
  do_vfs_ioctl+0xa5/0x690
  ksys_ioctl+0x6d/0x80
  __x64_sys_ioctl+0x1a/0x20
  do_syscall_64+0x83/0x6e0
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

The reason is that the testcase writes hyperv synic HV_X64_MSR_SINT6 msr 
and triggers scan ioapic logic to load synic vectors into EOI exit bitmap. 
However, irqchip is not initialized by this simple testcase, ioapic/apic 
objects should not be accessed.
This can be triggered by the following program:

    #define _GNU_SOURCE
    
    #include <endian.h>
    #include <stdint.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <sys/syscall.h>
    #include <sys/types.h>
    #include <unistd.h>
    
    uint64_t r[3] = {0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff};
    
    int main(void)
    {
    	syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
    	long res = 0;
    	memcpy((void*)0x20000040, "/dev/kvm", 9);
    	res = syscall(__NR_openat, 0xffffffffffffff9c, 0x20000040, 0, 0);
    	if (res != -1)
    		r[0] = res;
    	res = syscall(__NR_ioctl, r[0], 0xae01, 0);
    	if (res != -1)
    		r[1] = res;
    	res = syscall(__NR_ioctl, r[1], 0xae41, 0);
    	if (res != -1)
    		r[2] = res;
    	memcpy(
    			(void*)0x20000080,
    			"\x01\x00\x00\x00\x00\x5b\x61\xbb\x96\x00\x00\x40\x00\x00\x00\x00\x01\x00"
    			"\x08\x00\x00\x00\x00\x00\x0b\x77\xd1\x78\x4d\xd8\x3a\xed\xb1\x5c\x2e\x43"
    			"\xaa\x43\x39\xd6\xff\xf5\xf0\xa8\x98\xf2\x3e\x37\x29\x89\xde\x88\xc6\x33"
    			"\xfc\x2a\xdb\xb7\xe1\x4c\xac\x28\x61\x7b\x9c\xa9\xbc\x0d\xa0\x63\xfe\xfe"
    			"\xe8\x75\xde\xdd\x19\x38\xdc\x34\xf5\xec\x05\xfd\xeb\x5d\xed\x2e\xaf\x22"
    			"\xfa\xab\xb7\xe4\x42\x67\xd0\xaf\x06\x1c\x6a\x35\x67\x10\x55\xcb",
    			106);
    	syscall(__NR_ioctl, r[2], 0x4008ae89, 0x20000080);
    	syscall(__NR_ioctl, r[2], 0xae80, 0);
    	return 0;
    }

This patch fixes it by bailing out scan ioapic if ioapic is not initialized in 
kernel.

Reported-by: Wei Wu <ww9210@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Wei Wu <ww9210@gmail.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
---
 arch/x86/kvm/x86.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Paolo Bonzini Nov. 25, 2018, 5:31 p.m. UTC | #1

On 20/11/18 09:34, Wanpeng Li wrote:
> From: Wanpeng Li <wanpengli@tencent.com>
> 
> Reported by syzkaller:
> 
>  BUG: unable to handle kernel NULL pointer dereference at 00000000000001c8
>  PGD 80000003ec4da067 P4D 80000003ec4da067 PUD 3f7bfa067 PMD 0 
>  Oops: 0000 [#1] PREEMPT SMP PTI
>  CPU: 7 PID: 5059 Comm: debug Tainted: G           OE     4.19.0-rc5 #16
>  RIP: 0010:__lock_acquire+0x1a6/0x1990
>  Call Trace:
>   lock_acquire+0xdb/0x210
>   _raw_spin_lock+0x38/0x70
>   kvm_ioapic_scan_entry+0x3e/0x110 [kvm]
>   vcpu_enter_guest+0x167e/0x1910 [kvm]
>   kvm_arch_vcpu_ioctl_run+0x35c/0x610 [kvm]
>   kvm_vcpu_ioctl+0x3e9/0x6d0 [kvm]
>   do_vfs_ioctl+0xa5/0x690
>   ksys_ioctl+0x6d/0x80
>   __x64_sys_ioctl+0x1a/0x20
>   do_syscall_64+0x83/0x6e0
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> The reason is that the testcase writes hyperv synic HV_X64_MSR_SINT6 msr 
> and triggers scan ioapic logic to load synic vectors into EOI exit bitmap. 
> However, irqchip is not initialized by this simple testcase, ioapic/apic 
> objects should not be accessed.
> This can be triggered by the following program:
> 
>     #define _GNU_SOURCE
>     
>     #include <endian.h>
>     #include <stdint.h>
>     #include <stdio.h>
>     #include <stdlib.h>
>     #include <string.h>
>     #include <sys/syscall.h>
>     #include <sys/types.h>
>     #include <unistd.h>
>     
>     uint64_t r[3] = {0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff};
>     
>     int main(void)
>     {
>     	syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
>     	long res = 0;
>     	memcpy((void*)0x20000040, "/dev/kvm", 9);
>     	res = syscall(__NR_openat, 0xffffffffffffff9c, 0x20000040, 0, 0);
>     	if (res != -1)
>     		r[0] = res;
>     	res = syscall(__NR_ioctl, r[0], 0xae01, 0);
>     	if (res != -1)
>     		r[1] = res;
>     	res = syscall(__NR_ioctl, r[1], 0xae41, 0);
>     	if (res != -1)
>     		r[2] = res;
>     	memcpy(
>     			(void*)0x20000080,
>     			"\x01\x00\x00\x00\x00\x5b\x61\xbb\x96\x00\x00\x40\x00\x00\x00\x00\x01\x00"
>     			"\x08\x00\x00\x00\x00\x00\x0b\x77\xd1\x78\x4d\xd8\x3a\xed\xb1\x5c\x2e\x43"
>     			"\xaa\x43\x39\xd6\xff\xf5\xf0\xa8\x98\xf2\x3e\x37\x29\x89\xde\x88\xc6\x33"
>     			"\xfc\x2a\xdb\xb7\xe1\x4c\xac\x28\x61\x7b\x9c\xa9\xbc\x0d\xa0\x63\xfe\xfe"
>     			"\xe8\x75\xde\xdd\x19\x38\xdc\x34\xf5\xec\x05\xfd\xeb\x5d\xed\x2e\xaf\x22"
>     			"\xfa\xab\xb7\xe4\x42\x67\xd0\xaf\x06\x1c\x6a\x35\x67\x10\x55\xcb",
>     			106);
>     	syscall(__NR_ioctl, r[2], 0x4008ae89, 0x20000080);
>     	syscall(__NR_ioctl, r[2], 0xae80, 0);
>     	return 0;
>     }
> 
> This patch fixes it by bailing out scan ioapic if ioapic is not initialized in 
> kernel.
> 
> Reported-by: Wei Wu <ww9210@gmail.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Radim Krčmář <rkrcmar@redhat.com>
> Cc: Wei Wu <ww9210@gmail.com>
> Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
> ---
>  arch/x86/kvm/x86.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 66d66d7..14b2bc4 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7455,7 +7455,8 @@ static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu)
>  	else {
>  		if (vcpu->arch.apicv_active)
>  			kvm_x86_ops->sync_pir_to_irr(vcpu);
> -		kvm_ioapic_scan_entry(vcpu, vcpu->arch.ioapic_handled_vectors);
> +		if (ioapic_in_kernel(vcpu->kvm))
> +			kvm_ioapic_scan_entry(vcpu, vcpu->arch.ioapic_handled_vectors);
>  	}
>  
>  	if (is_guest_mode(vcpu))
> 

Queued, thanks.

Paolo

Dmitry Vyukov Dec. 27, 2018, 2:28 p.m. UTC | #2

On Sun, Nov 25, 2018 at 6:31 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 20/11/18 09:34, Wanpeng Li wrote:
> > From: Wanpeng Li <wanpengli@tencent.com>
> > ...
> > This patch fixes it by bailing out scan ioapic if ioapic is not initialized in
> > kernel.
> > Reported-by: Wei Wu <ww9210@gmail.com>

+Linus, Greg

I want to point out that this was reported more then 3 months ago by syzbot:
https://groups.google.com/forum/#!msg/syzkaller-bugs/cPT7tmaz-gQ/SzOyhM0YBAAJ
then the report was lost on kernel mailing lists and then re-reported
by somebody else:
https://www.spinics.net/lists/kvm/msg177705.html
and only then fixed.

Lots of kernel bug reports routinely get lost on mailing lists, which is bad.

Another bug was reported by syzbot in April:
https://groups.google.com/forum/#!msg/syzkaller-bugs/-9XIT9gwq7M/sqvBXSZWBgAJ
then get lost and then re-reported in November:
https://www.spinics.net/lists/kvm/msg177704.html
and only then fixed.

Not specific for KVM, another bug in kernel/trace reported by syzbot,
lost for months, then re-reported and fixed:
https://groups.google.com/forum/#!msg/syzkaller-bugs/o_-OeMyoTwg/Ugh432hlAgAJ
https://bugzilla.kernel.org/show_bug.cgi?id=200019

And, no, it's not that people ignore just syzbot reports. It's just
that syzbot reports can be tracked so it's easier to spot such cases,
for manually reported bugs nobody usually knows anything after few
weeks. Here is an example of bug report by a human, which was even
replied but then slipped from somebody's attention set for a moment
and then complete oblivion. Months later happened to be re-reported by
syzbot and then fixed:
https://groups.google.com/forum/#!msg/syzkaller-bugs/wFUedfOK2Rw/waUrQYOxAQAJ

Re-reported a year later bugs can cause security problems and large
amounts of work to backport the fix to thousands of downstream kernel
forks. Not re-reported bugs are even worse as they are just not fixed.

This Plumbers I was approached by Doug Ledford from Redhat, who said
literally that there was a bunch of syzbot reports in rdma subsystem
but since they were reported some time ago, now nobody knows
what/where are they. So while the bugs are still presumably there, now
they are completely unactionable and kernel development process is
incapable of dealing with this. While syzbot reports have some chances
of being recovered, this equally applies to human-reported bugs and
they can't be easily recovered.

This does not looks like how things should be for the most critical
and fundamental software project in the world. Lost bugs/patches
should not be a thing. There are known working solutions for this in
the form of tooling and procedures, namely bug tracking. Any bug
tracking systems allows to answer the main question: what are the
active bugs, sorted by priority, in subsystem X/assigned to me; and
lots of other useful questions.

And, yes, I know we have bugazilla. But it's not being used as a bug
tracking system as of now. And when used, sometimes cause more trouble
because nobody expects bugs to be there:
https://lwn.net/ml/linux-kernel/20181208115629.GA3288@kroah.com/

Linus Torvalds Dec. 27, 2018, 4:59 p.m. UTC | #3

On Thu, Dec 27, 2018 at 6:28 AM Dmitry Vyukov <dvyukov@google.com> wrote:
>
> Lots of kernel bug reports routinely get lost on mailing lists, which is bad.

Nobody reads the kernel mailing list directly - there's just too much traffic.

And honestly, even fewer people then read the syzbot reports, because
they are so illegible and inhuman. They're better than they used to
be, but they are still basically impossible to parse without a lot of
effort.

And no, syzbot didn't really report the bug with any specificity - it
wasn't clear *which* commit it was that caused it, so reading that
syzbot report, at no point was it then obvious that the original patch
had issues.

See the problem?

So the issue seems to be that syzbot is simply not useful enough. It's
output is too rough for people to take it seriously. You see how the
report by Wei Wu then got traction, because Wei took a syzbot report
and added some human background and distilled it down to not be
"here's a big dump of random information".

So I suspect syzbot should strive to make for a much stronger
signal-to-noise ratio. For example, if syzbot had actually bisected
the bug it reported, that would have been quite a strong signal.

Compare these two emails:

    https://lore.kernel.org/lkml/1542702858-4318-1-git-send-email-wanpengli@tencent.com/
    https://lore.kernel.org/lkml/0000000000001c7a5c0573607583@google.com/

and note the absolutely huge difference in actual *information* (as
opposed to raw data).

Any possibility that syzbot would actually do the bisection once it
finds a problem, and write a report based on the commit that caused
the problem rather than just a problem dump?

                 Linus

Dmitry Vyukov Dec. 28, 2018, 9:43 a.m. UTC | #4

On Thu, Dec 27, 2018 at 6:00 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, Dec 27, 2018 at 6:28 AM Dmitry Vyukov <dvyukov@google.com> wrote:
> >
> > Lots of kernel bug reports routinely get lost on mailing lists, which is bad.
>
> Nobody reads the kernel mailing list directly - there's just too much traffic.
>
> And honestly, even fewer people then read the syzbot reports, because
> they are so illegible and inhuman. They're better than they used to
> be, but they are still basically impossible to parse without a lot of
> effort.
>
> And no, syzbot didn't really report the bug with any specificity - it
> wasn't clear *which* commit it was that caused it, so reading that
> syzbot report, at no point was it then obvious that the original patch
> had issues.
>
> See the problem?
>
> So the issue seems to be that syzbot is simply not useful enough. It's
> output is too rough for people to take it seriously. You see how the
> report by Wei Wu then got traction, because Wei took a syzbot report
> and added some human background and distilled it down to not be
> "here's a big dump of random information".
>
> So I suspect syzbot should strive to make for a much stronger
> signal-to-noise ratio. For example, if syzbot had actually bisected
> the bug it reported, that would have been quite a strong signal.
>
> Compare these two emails:
>
>     https://lore.kernel.org/lkml/1542702858-4318-1-git-send-email-wanpengli@tencent.com/
>     https://lore.kernel.org/lkml/0000000000001c7a5c0573607583@google.com/
>
> and note the absolutely huge difference in actual *information* (as
> opposed to raw data).
>
> Any possibility that syzbot would actually do the bisection once it
> finds a problem, and write a report based on the commit that caused
> the problem rather than just a problem dump?

Hi Linus,

I agree there are things to improve in syzbot. Bisection is useful and
we will implement it. This is a popular user request, we keep track of
all them, so nothing is lost:
https://github.com/google/syzkaller/issues?q=label%3A"syzbot+user+request"

But let's not reduce the discussion to syzbot improvements and
distract it from the main point, which is:

> Nobody reads the kernel mailing list directly - there's just too much traffic.

As the result bug reports and patches got lots and this is bad and it
would be useful to stop it from happening and there are known ways for
this.

syzbot not doing bisection is not the root cause of this and most of
what you said does not have place.

1. I specifically added a case where it happens the other way around:
human report was ignored, then syzbot report fixed.

2. syzbot reports are not worse then average human reports, frequently better.
What linked to is not the human report, it's a reply from a developer
that includes a fix with explanation. If you look at the original
human reports, then can see that they miss kernel config, full console
output (sometimes there is some useful information before the crash):
https://www.spinics.net/lists/kvm/msg177705.html
https://www.spinics.net/lists/kvm/msg177704.html
syzbot reports are also better formatted, as one does not need to
parse custom prose to digest information:
https://groups.google.com/forum/#!msg/syzkaller/40Ts5kOqJlo/tEYv9j-3AQAJ

3. Bisection is useful, but not important in most cases.
First of all, both human reports did not contain bisection info. Which
clearly means that bisection is not the reason syzbot reports were not
acted on.
We see fix rate of 75% for reports without bisection. Lots of bugs
don't require even a reproducer (e.g. a wrong local if condition), fix
rate for such reports is 66% for an absolute number of hundreds. For
simple bugs nothing other then a crash message is required. For more
complex ones there is an infinite tail of custom information. E.g.
bisection may not help when a latent bug is unmasked, or when it's
bisection just to addition of WARN_ON. Say, for kvm bugs a critical
piece may be cpu stepping.

4. syzbot reports are useful and signal-to-noise ratio is high:
https://syzkaller.appspot.com/?fixed=upstream
You can also ask developers who fixed dozens of syzbot reports.

5. Developers who look at syzbot reports acknowledge that they are
lost because of the kernel development process.
This one that I linked:
https://groups.google.com/d/msg/syzkaller-bugs/o_-OeMyoTwg/UOZv1d2IAgAJ
Steven Rostedt says that it wasn't lost because it did not contain
bisection information, but because "Yeah, that time was quite busy for
me. I guess I failed to get time to look into it when it was first
reported [and then it was simply lost with no chances of recovering]".
Here the bug was acknowledged:
https://groups.google.com/d/msg/syzkaller/WA6MdAfCYS0/1rSe_qDeAgAJ
but then simply lost for half a year:
https://groups.google.com/forum/#!msg/syzkaller-bugs/wFUedfOK2Rw/waUrQYOxAQAJ

So while I see potential for syzbot improvements, I see the problem
that leads to lost reports/patches in the kernel development process.

Linus Torvalds Dec. 28, 2018, 9:08 p.m. UTC | #5

On Fri, Dec 28, 2018 at 1:43 AM Dmitry Vyukov <dvyukov@google.com> wrote:
>
> > Nobody reads the kernel mailing list directly - there's just too much traffic.
>
> As the result bug reports and patches got lots and this is bad and it
> would be useful to stop it from happening and there are known ways for
> this.

Well, let me be a  bit more specific: you will find that people read
the very _targeted_ mailing lists, because they not only tend to be
more specific to some particular interest, but also aren't the flood
of hundreds of emails a day.

And don't get me wrong: I'm not saying that lkml is useless. Not at
all. It's just that it's really more of an archival model than a
"people read it" - so you send your emails to a group of people, and
then you cc lkml so that when that group gets expanded people can be
pointed at the whole thread. Or, obviously, so that commit messages
etc can point to discussion.

But that does mean that any lkml cc shouldn't be expected to cause a
reaction in itself. It's about other things.

> syzbot not doing bisection is not the root cause of this

Root case? No. But if you do bisection, it means that you can now
target things much better. So then it's not lkml and "random
collection of maintainers", but a much more targeted group.

And that targeted group also ends up being a lot more receptive to it.

Again, look at the raw syzbot email and the email by Wanpeng Li. Yes,
the syzbot email did bring in a reasonable set of people just based on
the oops (I think it did "get_mainainter" on kvm_ioapic_scan_entry()).
But Wangpeng ended up sending it to the *particular* people who were
directly responsible.

> 2. syzbot reports are not worse then average human reports, frequently better.

No, they really aren't.

They are better in a *technical* sense, but they are also very much
obviously automated, which makes the target people take them much less
seriously.

When you see lots of syzbot emails, and there are lots of more or less
random recipients that may or may not be correct, what's the natural
reaction to that?

Look up "bystander effect".

> 3. Bisection is useful, but not important in most cases.

No.

Exactly because of the problem syzbot has. It's too scatter-shot.
People clearly ignore it, because people feel it's not _their_ issue.

The advantage of bisection is that it makes the problem much more
specific. Right now, you'll find that many developers ignore syzbot
simply because it's not worth their time to chase down whether it's
even their problem.

See what I'm saying?

It's the whole "data vs information" issue. Particularly when cc'ing
maintainers, who get hundreds of emails a day, you need to convince
them that this email is _relevant_.

                  Linus

Joey Pabalinas Dec. 28, 2018, 10:13 p.m. UTC | #6

On Fri, Dec 28, 2018 at 10:43:11AM +0100, 'Dmitry Vyukov' via syzkaller wrote:
> On Thu, Dec 27, 2018 at 6:00 PM Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > Nobody reads the kernel mailing list directly - there's just too much traffic.
> 
> As the result bug reports and patches got lots and this is bad and it
> would be useful to stop it from happening and there are known ways for
> this.

What are the "known ways"? The only effective way I can think of is to
setup personal email filters for specific topics, and while this is
useful and something I do myself, it requires a lot of up front work.

I don't think it's realistic to expect others to be doing this
instead of just subscribing to the topic lists.

Dmitry Vyukov Jan. 2, 2019, 1:43 p.m. UTC | #7

On Fri, Dec 28, 2018 at 11:13 PM Joey Pabalinas <joeypabalinas@gmail.com> wrote:
>
> On Fri, Dec 28, 2018 at 10:43:11AM +0100, 'Dmitry Vyukov' via syzkaller wrote:
> > On Thu, Dec 27, 2018 at 6:00 PM Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > > Nobody reads the kernel mailing list directly - there's just too much traffic.
> >
> > As the result bug reports and patches got lots and this is bad and it
> > would be useful to stop it from happening and there are known ways for
> > this.
>
> What are the "known ways"? The only effective way I can think of is to
> setup personal email filters for specific topics, and while this is
> useful and something I do myself, it requires a lot of up front work.
>
> I don't think it's realistic to expect others to be doing this
> instead of just subscribing to the topic lists.

Hi Joey,

I mean using a bug tracking system.

E.g. here are all open KASAN bugs regardless of when they were filed:
https://bugzilla.kernel.org/buglist.cgi?bug_status=__open__&component=Sanitizers&list_id=1010485&product=Memory%20Management

Here are all open syzkaller bugs:
https://github.com/google/syzkaller/issues

Dmitry Vyukov Jan. 2, 2019, 2:08 p.m. UTC | #8

On Fri, Dec 28, 2018 at 10:09 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, Dec 28, 2018 at 1:43 AM Dmitry Vyukov <dvyukov@google.com> wrote:
> >
> > > Nobody reads the kernel mailing list directly - there's just too much traffic.
> >
> > As the result bug reports and patches got lots and this is bad and it
> > would be useful to stop it from happening and there are known ways for
> > this.
>
> Well, let me be a  bit more specific: you will find that people read
> the very _targeted_ mailing lists, because they not only tend to be
> more specific to some particular interest, but also aren't the flood
> of hundreds of emails a day.
>
> And don't get me wrong: I'm not saying that lkml is useless. Not at
> all. It's just that it's really more of an archival model than a
> "people read it" - so you send your emails to a group of people, and
> then you cc lkml so that when that group gets expanded people can be
> pointed at the whole thread. Or, obviously, so that commit messages
> etc can point to discussion.
>
> But that does mean that any lkml cc shouldn't be expected to cause a
> reaction in itself. It's about other things.
>
> > syzbot not doing bisection is not the root cause of this
>
> Root case? No. But if you do bisection, it means that you can now
> target things much better. So then it's not lkml and "random
> collection of maintainers", but a much more targeted group.
>
> And that targeted group also ends up being a lot more receptive to it.
>
> Again, look at the raw syzbot email and the email by Wanpeng Li. Yes,
> the syzbot email did bring in a reasonable set of people just based on
> the oops (I think it did "get_mainainter" on kvm_ioapic_scan_entry()).
> But Wangpeng ended up sending it to the *particular* people who were
> directly responsible.
>
> > 2. syzbot reports are not worse then average human reports, frequently better.
>
> No, they really aren't.
>
> They are better in a *technical* sense, but they are also very much
> obviously automated, which makes the target people take them much less
> seriously.
>
> When you see lots of syzbot emails, and there are lots of more or less
> random recipients that may or may not be correct, what's the natural
> reaction to that?
>
> Look up "bystander effect".
>
> > 3. Bisection is useful, but not important in most cases.
>
> No.
>
> Exactly because of the problem syzbot has. It's too scatter-shot.
> People clearly ignore it, because people feel it's not _their_ issue.
>
> The advantage of bisection is that it makes the problem much more
> specific. Right now, you'll find that many developers ignore syzbot
> simply because it's not worth their time to chase down whether it's
> even their problem.
>
> See what I'm saying?
>
> It's the whole "data vs information" issue. Particularly when cc'ing
> maintainers, who get hundreds of emails a day, you need to convince
> them that this email is _relevant_.

I see what you are saying and I agree that bisection results will make
reports better in some cases. But I mean a more general problem.

Say you reported a bug, and it happened so that you missed that single
right person in CC because something, whatever, can happen, right?
With the current process it will be a coin flip if your report will be
routed to the right person or lost. And it's not that you personally
care a lot about this particular bug, it just happened that you
noticed it and wanted to be a good samaritan. So you will not keep
track of it on a post-note on your monitor and won't ping later. But
the bug can be bad and either cause security problems later, or reach
release and break things in the field and then require 1000x more work
to port the fix to all downstream forks.

Or, we heavily rely on end users for testing. End users are not kernel
developers and can't be generally expected to do pre-triage and proper
routing. Losing these valuable reports is bad because only small
fraction of users report anything to projects and this can also affect
user trust, if you see that your reports are not acted on, you don't
report next time.

Even if we take syzbot, it won't be able to bisect all the time for
multiple reasons:
 - some bugs don't have reproducers (but still very real and sometimes
manageable to fix)
 - kernel is build/boot broken sometimes for prolonged periods
 - some old bugs are bisected to introduction of the debugging tool
that detects the bug
 - some crashes can be too flaky for reliable bisection
 - some reproducers won't work on older kernels, yet the bug is there
 - ...
So it's will be nice to have bisection results when they are
available, but it does not feel like it should be the only guarantee
of a bug report not being lost.

Moreover, you can see in the examples I referenced above that they
were delivered to the right people, but then still lost because there
is nothing in the kernel development process that would prevent loses.

Moreover, replying on a small set of private emails generally creates
problems wrt bus-factor and vacations. It would be useful if anybody
could see what are the open bugs for rdma_cm subsystem at any point in
time.

Paolo Bonzini Jan. 7, 2019, 11:11 p.m. UTC | #9

On 27/12/18 17:59, Linus Torvalds wrote:
> So the issue seems to be that syzbot is simply not useful enough. It's
> output is too rough for people to take it seriously. You see how the
> report by Wei Wu then got traction, because Wei took a syzbot report
> and added some human background and distilled it down to not be
> "here's a big dump of random information".

We do take it seriously.  Usually the reports are relatively easy to
distill and fix, but when new random multi-threaded use-after-free
comes, doing the bisection in syzkaller might not work because they are
not deterministic in how much it takes to reproduce them.  So the only
way to process them is "look at when it started to happen and stare at
150 commits until you find the culprit", which is of course time
consuming even though the syzkaller script usually gives a clue of which
commit to look at.

I agree with Linus that the report is more or less useless except for
trivial bugs, but I'm not sure what can be done to improve it.  I do use
it for trivial bugs, and at the very least, having many different
reports obviously means "use-after-free" or "dangling pointer" or some
other kind of memory corruption.  I try to prioritize those, but theory
and practice are different.

Paolo

Dmitry Vyukov Jan. 9, 2019, 8:28 a.m. UTC | #10

On Wed, Jan 2, 2019 at 3:08 PM Dmitry Vyukov <dvyukov@google.com> wrote:
>
> On Fri, Dec 28, 2018 at 10:09 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Fri, Dec 28, 2018 at 1:43 AM Dmitry Vyukov <dvyukov@google.com> wrote:
> > >
> > > > Nobody reads the kernel mailing list directly - there's just too much traffic.
> > >
> > > As the result bug reports and patches got lots and this is bad and it
> > > would be useful to stop it from happening and there are known ways for
> > > this.
> >
> > Well, let me be a  bit more specific: you will find that people read
> > the very _targeted_ mailing lists, because they not only tend to be
> > more specific to some particular interest, but also aren't the flood
> > of hundreds of emails a day.
> >
> > And don't get me wrong: I'm not saying that lkml is useless. Not at
> > all. It's just that it's really more of an archival model than a
> > "people read it" - so you send your emails to a group of people, and
> > then you cc lkml so that when that group gets expanded people can be
> > pointed at the whole thread. Or, obviously, so that commit messages
> > etc can point to discussion.
> >
> > But that does mean that any lkml cc shouldn't be expected to cause a
> > reaction in itself. It's about other things.
> >
> > > syzbot not doing bisection is not the root cause of this
> >
> > Root case? No. But if you do bisection, it means that you can now
> > target things much better. So then it's not lkml and "random
> > collection of maintainers", but a much more targeted group.
> >
> > And that targeted group also ends up being a lot more receptive to it.
> >
> > Again, look at the raw syzbot email and the email by Wanpeng Li. Yes,
> > the syzbot email did bring in a reasonable set of people just based on
> > the oops (I think it did "get_mainainter" on kvm_ioapic_scan_entry()).
> > But Wangpeng ended up sending it to the *particular* people who were
> > directly responsible.
> >
> > > 2. syzbot reports are not worse then average human reports, frequently better.
> >
> > No, they really aren't.
> >
> > They are better in a *technical* sense, but they are also very much
> > obviously automated, which makes the target people take them much less
> > seriously.
> >
> > When you see lots of syzbot emails, and there are lots of more or less
> > random recipients that may or may not be correct, what's the natural
> > reaction to that?
> >
> > Look up "bystander effect".
> >
> > > 3. Bisection is useful, but not important in most cases.
> >
> > No.
> >
> > Exactly because of the problem syzbot has. It's too scatter-shot.
> > People clearly ignore it, because people feel it's not _their_ issue.
> >
> > The advantage of bisection is that it makes the problem much more
> > specific. Right now, you'll find that many developers ignore syzbot
> > simply because it's not worth their time to chase down whether it's
> > even their problem.
> >
> > See what I'm saying?
> >
> > It's the whole "data vs information" issue. Particularly when cc'ing
> > maintainers, who get hundreds of emails a day, you need to convince
> > them that this email is _relevant_.
>
> I see what you are saying and I agree that bisection results will make
> reports better in some cases. But I mean a more general problem.
>
> Say you reported a bug, and it happened so that you missed that single
> right person in CC because something, whatever, can happen, right?
> With the current process it will be a coin flip if your report will be
> routed to the right person or lost. And it's not that you personally
> care a lot about this particular bug, it just happened that you
> noticed it and wanted to be a good samaritan. So you will not keep
> track of it on a post-note on your monitor and won't ping later. But
> the bug can be bad and either cause security problems later, or reach
> release and break things in the field and then require 1000x more work
> to port the fix to all downstream forks.
>
> Or, we heavily rely on end users for testing. End users are not kernel
> developers and can't be generally expected to do pre-triage and proper
> routing. Losing these valuable reports is bad because only small
> fraction of users report anything to projects and this can also affect
> user trust, if you see that your reports are not acted on, you don't
> report next time.
>
> Even if we take syzbot, it won't be able to bisect all the time for
> multiple reasons:
>  - some bugs don't have reproducers (but still very real and sometimes
> manageable to fix)
>  - kernel is build/boot broken sometimes for prolonged periods
>  - some old bugs are bisected to introduction of the debugging tool
> that detects the bug
>  - some crashes can be too flaky for reliable bisection
>  - some reproducers won't work on older kernels, yet the bug is there
>  - ...
> So it's will be nice to have bisection results when they are
> available, but it does not feel like it should be the only guarantee
> of a bug report not being lost.
>
> Moreover, you can see in the examples I referenced above that they
> were delivered to the right people, but then still lost because there
> is nothing in the kernel development process that would prevent loses.
>
> Moreover, replying on a small set of private emails generally creates
> problems wrt bus-factor and vacations. It would be useful if anybody
> could see what are the open bugs for rdma_cm subsystem at any point in
> time.

This is quite indicative:

Serious issues affecting all filesystems:

Kernel quality control, or the lack thereof
https://lwn.net/Articles/774114/

Comment on ycombinator:
https://news.ycombinator.com/item?id=18844612

I've filed bugs for some of the mentioned copy_file_range() issues
more than two years ago:
- https://bugzilla.kernel.org/show_bug.cgi?id=135461
- https://bugzilla.kernel.org/show_bug.cgi?id=135451
No response...

KVM: X86: Fix scan ioapic use-before-initialization

Commit Message

Comments

Patch