diff mbox series

mm: fix a data race in put_page()

Message ID 1580995070-25139-1-git-send-email-cai@lca.pw (mailing list archive)
State New, archived
Headers show
Series mm: fix a data race in put_page() | expand

Commit Message

Qian Cai Feb. 6, 2020, 1:17 p.m. UTC
page->flags could be accessed concurrently as noticied by KCSAN,

 BUG: KCSAN: data-race in page_cpupid_xchg_last / put_page

 write (marked) to 0xfffffc0d48ec1a00 of 8 bytes by task 91442 on cpu 3:
  page_cpupid_xchg_last+0x51/0x80
  page_cpupid_xchg_last at mm/mmzone.c:109 (discriminator 11)
  wp_page_reuse+0x3e/0xc0
  wp_page_reuse at mm/memory.c:2453
  do_wp_page+0x472/0x7b0
  do_wp_page at mm/memory.c:2798
  __handle_mm_fault+0xcb0/0xd00
  handle_pte_fault at mm/memory.c:4049
  (inlined by) __handle_mm_fault at mm/memory.c:4163
  handle_mm_fault+0xfc/0x2f0
  handle_mm_fault at mm/memory.c:4200
  do_page_fault+0x263/0x6f9
  do_user_addr_fault at arch/x86/mm/fault.c:1465
  (inlined by) do_page_fault at arch/x86/mm/fault.c:1539
  page_fault+0x34/0x40

 read to 0xfffffc0d48ec1a00 of 8 bytes by task 94817 on cpu 69:
  put_page+0x15a/0x1f0
  page_zonenum at include/linux/mm.h:923
  (inlined by) is_zone_device_page at include/linux/mm.h:929
  (inlined by) page_is_devmap_managed at include/linux/mm.h:948
  (inlined by) put_page at include/linux/mm.h:1023
  wp_page_copy+0x571/0x930
  wp_page_copy at mm/memory.c:2615
  do_wp_page+0x107/0x7b0
  __handle_mm_fault+0xcb0/0xd00
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 Reported by Kernel Concurrency Sanitizer on:
 CPU: 69 PID: 94817 Comm: systemd-udevd Tainted: G        W  O L 5.5.0-next-20200204+ #6
 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

Both the read and write are done only with the non-exclusive mmap_sem
held. Since the read will check for specific bits (up to three bits for
now) in the flag, load tearing could in theory trigger a logic bug.

To fix it, it could introduce put_page_lockless() in those places but
that could be an overkill, and difficult to use. Thus, just add
READ_ONCE() for the read in page_zonenum() for now where it should not
affect the performance and correctness with a small trade-off that
compilers might generate less efficient optimization in some places.

Signed-off-by: Qian Cai <cai@lca.pw>
---
 include/linux/mm.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

David Hildenbrand Feb. 6, 2020, 1:33 p.m. UTC | #1
On 06.02.20 14:17, Qian Cai wrote:
> page->flags could be accessed concurrently as noticied by KCSAN,
> 
>  BUG: KCSAN: data-race in page_cpupid_xchg_last / put_page
> 
>  write (marked) to 0xfffffc0d48ec1a00 of 8 bytes by task 91442 on cpu 3:
>   page_cpupid_xchg_last+0x51/0x80
>   page_cpupid_xchg_last at mm/mmzone.c:109 (discriminator 11)
>   wp_page_reuse+0x3e/0xc0
>   wp_page_reuse at mm/memory.c:2453
>   do_wp_page+0x472/0x7b0
>   do_wp_page at mm/memory.c:2798
>   __handle_mm_fault+0xcb0/0xd00
>   handle_pte_fault at mm/memory.c:4049
>   (inlined by) __handle_mm_fault at mm/memory.c:4163
>   handle_mm_fault+0xfc/0x2f0
>   handle_mm_fault at mm/memory.c:4200
>   do_page_fault+0x263/0x6f9
>   do_user_addr_fault at arch/x86/mm/fault.c:1465
>   (inlined by) do_page_fault at arch/x86/mm/fault.c:1539
>   page_fault+0x34/0x40
> 
>  read to 0xfffffc0d48ec1a00 of 8 bytes by task 94817 on cpu 69:
>   put_page+0x15a/0x1f0
>   page_zonenum at include/linux/mm.h:923
>   (inlined by) is_zone_device_page at include/linux/mm.h:929
>   (inlined by) page_is_devmap_managed at include/linux/mm.h:948
>   (inlined by) put_page at include/linux/mm.h:1023
>   wp_page_copy+0x571/0x930
>   wp_page_copy at mm/memory.c:2615
>   do_wp_page+0x107/0x7b0
>   __handle_mm_fault+0xcb0/0xd00
>   handle_mm_fault+0xfc/0x2f0
>   do_page_fault+0x263/0x6f9
>   page_fault+0x34/0x40
> 
>  Reported by Kernel Concurrency Sanitizer on:
>  CPU: 69 PID: 94817 Comm: systemd-udevd Tainted: G        W  O L 5.5.0-next-20200204+ #6
>  Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
> 
> Both the read and write are done only with the non-exclusive mmap_sem
> held. Since the read will check for specific bits (up to three bits for
> now) in the flag, load tearing could in theory trigger a logic bug.
> 
> To fix it, it could introduce put_page_lockless() in those places but
> that could be an overkill, and difficult to use. Thus, just add
> READ_ONCE() for the read in page_zonenum() for now where it should not
> affect the performance and correctness with a small trade-off that
> compilers might generate less efficient optimization in some places.
> 
> Signed-off-by: Qian Cai <cai@lca.pw>
> ---
>  include/linux/mm.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 52269e56c514..f8529aa971c0 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -920,7 +920,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
>  
>  static inline enum zone_type page_zonenum(const struct page *page)
>  {
> -	return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
> +	return (READ_ONCE(page->flags) >> ZONES_PGSHIFT) & ZONES_MASK;

I can understand why other bits/flags might change, but not the zone
number? Nobody should be changing that without heavy locking (out of
memory hot(un)plug code). Or am I missing something? Can load tearing
actually produce an issue if these 3 bits will never change?
Qian Cai Feb. 6, 2020, 1:51 p.m. UTC | #2
On Thu, 2020-02-06 at 14:33 +0100, David Hildenbrand wrote:
> On 06.02.20 14:17, Qian Cai wrote:
> > page->flags could be accessed concurrently as noticied by KCSAN,
> > 
> >  BUG: KCSAN: data-race in page_cpupid_xchg_last / put_page
> > 
> >  write (marked) to 0xfffffc0d48ec1a00 of 8 bytes by task 91442 on cpu 3:
> >   page_cpupid_xchg_last+0x51/0x80
> >   page_cpupid_xchg_last at mm/mmzone.c:109 (discriminator 11)
> >   wp_page_reuse+0x3e/0xc0
> >   wp_page_reuse at mm/memory.c:2453
> >   do_wp_page+0x472/0x7b0
> >   do_wp_page at mm/memory.c:2798
> >   __handle_mm_fault+0xcb0/0xd00
> >   handle_pte_fault at mm/memory.c:4049
> >   (inlined by) __handle_mm_fault at mm/memory.c:4163
> >   handle_mm_fault+0xfc/0x2f0
> >   handle_mm_fault at mm/memory.c:4200
> >   do_page_fault+0x263/0x6f9
> >   do_user_addr_fault at arch/x86/mm/fault.c:1465
> >   (inlined by) do_page_fault at arch/x86/mm/fault.c:1539
> >   page_fault+0x34/0x40
> > 
> >  read to 0xfffffc0d48ec1a00 of 8 bytes by task 94817 on cpu 69:
> >   put_page+0x15a/0x1f0
> >   page_zonenum at include/linux/mm.h:923
> >   (inlined by) is_zone_device_page at include/linux/mm.h:929
> >   (inlined by) page_is_devmap_managed at include/linux/mm.h:948
> >   (inlined by) put_page at include/linux/mm.h:1023
> >   wp_page_copy+0x571/0x930
> >   wp_page_copy at mm/memory.c:2615
> >   do_wp_page+0x107/0x7b0
> >   __handle_mm_fault+0xcb0/0xd00
> >   handle_mm_fault+0xfc/0x2f0
> >   do_page_fault+0x263/0x6f9
> >   page_fault+0x34/0x40
> > 
> >  Reported by Kernel Concurrency Sanitizer on:
> >  CPU: 69 PID: 94817 Comm: systemd-udevd Tainted: G        W  O L 5.5.0-next-20200204+ #6
> >  Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
> > 
> > Both the read and write are done only with the non-exclusive mmap_sem
> > held. Since the read will check for specific bits (up to three bits for
> > now) in the flag, load tearing could in theory trigger a logic bug.
> > 
> > To fix it, it could introduce put_page_lockless() in those places but
> > that could be an overkill, and difficult to use. Thus, just add
> > READ_ONCE() for the read in page_zonenum() for now where it should not
> > affect the performance and correctness with a small trade-off that
> > compilers might generate less efficient optimization in some places.
> > 
> > Signed-off-by: Qian Cai <cai@lca.pw>
> > ---
> >  include/linux/mm.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 52269e56c514..f8529aa971c0 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -920,7 +920,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
> >  
> >  static inline enum zone_type page_zonenum(const struct page *page)
> >  {
> > -	return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
> > +	return (READ_ONCE(page->flags) >> ZONES_PGSHIFT) & ZONES_MASK;
> 
> I can understand why other bits/flags might change, but not the zone
> number? Nobody should be changing that without heavy locking (out of
> memory hot(un)plug code). Or am I missing something? Can load tearing
> actually produce an issue if these 3 bits will never change?
> 

I think you are right in this case that the writer page_cpupid_xchg_last() will
not change those 3 bits. I don't have an answer if other writers might race
here. Until someone could chime in probably from ZONE_DEVICE side to prove that
it otherwise, I'll just update the previous patch to mark it as a harmless data
race.
Jan Kara Feb. 6, 2020, 2:55 p.m. UTC | #3
On Thu 06-02-20 14:33:22, David Hildenbrand wrote:
> On 06.02.20 14:17, Qian Cai wrote:
> > page->flags could be accessed concurrently as noticied by KCSAN,
> > 
> >  BUG: KCSAN: data-race in page_cpupid_xchg_last / put_page
> > 
> >  write (marked) to 0xfffffc0d48ec1a00 of 8 bytes by task 91442 on cpu 3:
> >   page_cpupid_xchg_last+0x51/0x80
> >   page_cpupid_xchg_last at mm/mmzone.c:109 (discriminator 11)
> >   wp_page_reuse+0x3e/0xc0
> >   wp_page_reuse at mm/memory.c:2453
> >   do_wp_page+0x472/0x7b0
> >   do_wp_page at mm/memory.c:2798
> >   __handle_mm_fault+0xcb0/0xd00
> >   handle_pte_fault at mm/memory.c:4049
> >   (inlined by) __handle_mm_fault at mm/memory.c:4163
> >   handle_mm_fault+0xfc/0x2f0
> >   handle_mm_fault at mm/memory.c:4200
> >   do_page_fault+0x263/0x6f9
> >   do_user_addr_fault at arch/x86/mm/fault.c:1465
> >   (inlined by) do_page_fault at arch/x86/mm/fault.c:1539
> >   page_fault+0x34/0x40
> > 
> >  read to 0xfffffc0d48ec1a00 of 8 bytes by task 94817 on cpu 69:
> >   put_page+0x15a/0x1f0
> >   page_zonenum at include/linux/mm.h:923
> >   (inlined by) is_zone_device_page at include/linux/mm.h:929
> >   (inlined by) page_is_devmap_managed at include/linux/mm.h:948
> >   (inlined by) put_page at include/linux/mm.h:1023
> >   wp_page_copy+0x571/0x930
> >   wp_page_copy at mm/memory.c:2615
> >   do_wp_page+0x107/0x7b0
> >   __handle_mm_fault+0xcb0/0xd00
> >   handle_mm_fault+0xfc/0x2f0
> >   do_page_fault+0x263/0x6f9
> >   page_fault+0x34/0x40
> > 
> >  Reported by Kernel Concurrency Sanitizer on:
> >  CPU: 69 PID: 94817 Comm: systemd-udevd Tainted: G        W  O L 5.5.0-next-20200204+ #6
> >  Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
> > 
> > Both the read and write are done only with the non-exclusive mmap_sem
> > held. Since the read will check for specific bits (up to three bits for
> > now) in the flag, load tearing could in theory trigger a logic bug.
> > 
> > To fix it, it could introduce put_page_lockless() in those places but
> > that could be an overkill, and difficult to use. Thus, just add
> > READ_ONCE() for the read in page_zonenum() for now where it should not
> > affect the performance and correctness with a small trade-off that
> > compilers might generate less efficient optimization in some places.
> > 
> > Signed-off-by: Qian Cai <cai@lca.pw>
> > ---
> >  include/linux/mm.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 52269e56c514..f8529aa971c0 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -920,7 +920,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
> >  
> >  static inline enum zone_type page_zonenum(const struct page *page)
> >  {
> > -	return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
> > +	return (READ_ONCE(page->flags) >> ZONES_PGSHIFT) & ZONES_MASK;
> 
> I can understand why other bits/flags might change, but not the zone
> number? Nobody should be changing that without heavy locking (out of
> memory hot(un)plug code). Or am I missing something? Can load tearing
> actually produce an issue if these 3 bits will never change?

I don't think the problem is real. The question is how to make KCSAN happy
in a way that doesn't silence other possibly useful things it can find and
also which makes it most obvious to the reader what's going on... IMHO
using READ_ONCE() fulfills these targets nicely - it is free
performance-wise in this case, it silences the checker without impacting
other races on page->flags, its kind of obvious we don't want the load torn
in this case so it makes sense to the reader (although a comment may be
nice).

								Honza
David Hildenbrand Feb. 6, 2020, 2:59 p.m. UTC | #4
On 06.02.20 15:55, Jan Kara wrote:
> On Thu 06-02-20 14:33:22, David Hildenbrand wrote:
>> On 06.02.20 14:17, Qian Cai wrote:
>>> page->flags could be accessed concurrently as noticied by KCSAN,
>>>
>>>  BUG: KCSAN: data-race in page_cpupid_xchg_last / put_page
>>>
>>>  write (marked) to 0xfffffc0d48ec1a00 of 8 bytes by task 91442 on cpu 3:
>>>   page_cpupid_xchg_last+0x51/0x80
>>>   page_cpupid_xchg_last at mm/mmzone.c:109 (discriminator 11)
>>>   wp_page_reuse+0x3e/0xc0
>>>   wp_page_reuse at mm/memory.c:2453
>>>   do_wp_page+0x472/0x7b0
>>>   do_wp_page at mm/memory.c:2798
>>>   __handle_mm_fault+0xcb0/0xd00
>>>   handle_pte_fault at mm/memory.c:4049
>>>   (inlined by) __handle_mm_fault at mm/memory.c:4163
>>>   handle_mm_fault+0xfc/0x2f0
>>>   handle_mm_fault at mm/memory.c:4200
>>>   do_page_fault+0x263/0x6f9
>>>   do_user_addr_fault at arch/x86/mm/fault.c:1465
>>>   (inlined by) do_page_fault at arch/x86/mm/fault.c:1539
>>>   page_fault+0x34/0x40
>>>
>>>  read to 0xfffffc0d48ec1a00 of 8 bytes by task 94817 on cpu 69:
>>>   put_page+0x15a/0x1f0
>>>   page_zonenum at include/linux/mm.h:923
>>>   (inlined by) is_zone_device_page at include/linux/mm.h:929
>>>   (inlined by) page_is_devmap_managed at include/linux/mm.h:948
>>>   (inlined by) put_page at include/linux/mm.h:1023
>>>   wp_page_copy+0x571/0x930
>>>   wp_page_copy at mm/memory.c:2615
>>>   do_wp_page+0x107/0x7b0
>>>   __handle_mm_fault+0xcb0/0xd00
>>>   handle_mm_fault+0xfc/0x2f0
>>>   do_page_fault+0x263/0x6f9
>>>   page_fault+0x34/0x40
>>>
>>>  Reported by Kernel Concurrency Sanitizer on:
>>>  CPU: 69 PID: 94817 Comm: systemd-udevd Tainted: G        W  O L 5.5.0-next-20200204+ #6
>>>  Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
>>>
>>> Both the read and write are done only with the non-exclusive mmap_sem
>>> held. Since the read will check for specific bits (up to three bits for
>>> now) in the flag, load tearing could in theory trigger a logic bug.
>>>
>>> To fix it, it could introduce put_page_lockless() in those places but
>>> that could be an overkill, and difficult to use. Thus, just add
>>> READ_ONCE() for the read in page_zonenum() for now where it should not
>>> affect the performance and correctness with a small trade-off that
>>> compilers might generate less efficient optimization in some places.
>>>
>>> Signed-off-by: Qian Cai <cai@lca.pw>
>>> ---
>>>  include/linux/mm.h | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 52269e56c514..f8529aa971c0 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -920,7 +920,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
>>>  
>>>  static inline enum zone_type page_zonenum(const struct page *page)
>>>  {
>>> -	return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
>>> +	return (READ_ONCE(page->flags) >> ZONES_PGSHIFT) & ZONES_MASK;
>>
>> I can understand why other bits/flags might change, but not the zone
>> number? Nobody should be changing that without heavy locking (out of
>> memory hot(un)plug code). Or am I missing something? Can load tearing
>> actually produce an issue if these 3 bits will never change?
> 
> I don't think the problem is real. The question is how to make KCSAN happy
> in a way that doesn't silence other possibly useful things it can find and
> also which makes it most obvious to the reader what's going on... IMHO
> using READ_ONCE() fulfills these targets nicely - it is free
> performance-wise in this case, it silences the checker without impacting
> other races on page->flags, its kind of obvious we don't want the load torn
> in this case so it makes sense to the reader (although a comment may be
> nice).

Yes, I can understand that - but reading the other comments, I think I
am not the only one who doesn't like to see the code getting uglyfied
(sorry, but that's what it is) and harder to read just to make some code
checker happy.

Having that said, I don't really care. But I do think that expressing
"this is not possible but only to make $TOOL happy" in the commit
message is worth having. The patch description should be reworked.
Qian Cai Feb. 6, 2020, 3:23 p.m. UTC | #5
> On Feb 6, 2020, at 9:55 AM, Jan Kara <jack@suse.cz> wrote:
> 
> I don't think the problem is real. The question is how to make KCSAN happy
> in a way that doesn't silence other possibly useful things it can find and
> also which makes it most obvious to the reader what's going on... IMHO
> using READ_ONCE() fulfills these targets nicely - it is free
> performance-wise in this case, it silences the checker without impacting
> other races on page->flags, its kind of obvious we don't want the load torn
> in this case so it makes sense to the reader (although a comment may be
> nice).

Actually, use the data_race() macro there fulfilling the same purpose too, i.e, silence the splat here but still keep searching for other races.
John Hubbard Feb. 6, 2020, 11:34 p.m. UTC | #6
On 2/6/20 7:23 AM, Qian Cai wrote:
> 
> 
>> On Feb 6, 2020, at 9:55 AM, Jan Kara <jack@suse.cz> wrote:
>>
>> I don't think the problem is real. The question is how to make KCSAN happy
>> in a way that doesn't silence other possibly useful things it can find and
>> also which makes it most obvious to the reader what's going on... IMHO
>> using READ_ONCE() fulfills these targets nicely - it is free
>> performance-wise in this case, it silences the checker without impacting
>> other races on page->flags, its kind of obvious we don't want the load torn
>> in this case so it makes sense to the reader (although a comment may be
>> nice).
> 
> Actually, use the data_race() macro there fulfilling the same purpose too, i.e, silence the splat here but still keep searching for other races.
> 

Yes, but both READ_ONCE() and data_race() would be saying untrue things about this code,
and that somewhat offends my sense of perfection... :)

* READ_ONCE(): this field need not be restricted to being read only once, so the
  name is immediately wrong. We're using side effects of READ_ONCE().

* data_race(): there is no race on the N bits worth of page zone number data. There
  is only a perceived race, due to tools that look at word-level granularity.

I'd propose one or both of the following:

a) Hope that Marcus has an idea to enhance KCSAN so as to support this model of
   access, and/or

b) Add a new, better-named macro to indicate what's going on. Initial bikeshed-able
   candidates:

	READ_RO_BITS()
	READ_IMMUTABLE_BITS()
	...etc...

thanks,
John Hubbard Feb. 6, 2020, 11:36 p.m. UTC | #7
On 2/6/20 3:34 PM, John Hubbard wrote:
> On 2/6/20 7:23 AM, Qian Cai wrote:
>>
>>
>>> On Feb 6, 2020, at 9:55 AM, Jan Kara <jack@suse.cz> wrote:
>>>
>>> I don't think the problem is real. The question is how to make KCSAN happy
>>> in a way that doesn't silence other possibly useful things it can find and
>>> also which makes it most obvious to the reader what's going on... IMHO
>>> using READ_ONCE() fulfills these targets nicely - it is free
>>> performance-wise in this case, it silences the checker without impacting
>>> other races on page->flags, its kind of obvious we don't want the load torn
>>> in this case so it makes sense to the reader (although a comment may be
>>> nice).
>>
>> Actually, use the data_race() macro there fulfilling the same purpose too, i.e, silence the splat here but still keep searching for other races.
>>
> 
> Yes, but both READ_ONCE() and data_race() would be saying untrue things about this code,
> and that somewhat offends my sense of perfection... :)
> 
> * READ_ONCE(): this field need not be restricted to being read only once, so the
>   name is immediately wrong. We're using side effects of READ_ONCE().
> 
> * data_race(): there is no race on the N bits worth of page zone number data. There
>   is only a perceived race, due to tools that look at word-level granularity.
> 
> I'd propose one or both of the following:
> 
> a) Hope that Marcus has an idea to enhance KCSAN so as to support this model of
>    access, and/or
> 


arggh, make that "Marco Elver"! Really sorry for the typo on your name. 


thanks,
Qian Cai Feb. 6, 2020, 11:55 p.m. UTC | #8
> On Feb 6, 2020, at 6:34 PM, John Hubbard <jhubbard@nvidia.com> wrote:
> 
> * data_race(): there is no race on the N bits worth of page zone number data. There
>  is only a perceived race, due to tools that look at word-level granularity.

I’d like to think this from the compiler level. There is a data race from the compiler load and store point of view. It is just in the case it is harmless to affect the actual code logic.
Qian Cai Feb. 7, 2020, 12:18 a.m. UTC | #9
> On Feb 6, 2020, at 6:34 PM, John Hubbard <jhubbard@nvidia.com> wrote:
> 
> On 2/6/20 7:23 AM, Qian Cai wrote:
>> 
>> 
>>> On Feb 6, 2020, at 9:55 AM, Jan Kara <jack@suse.cz> wrote:
>>> 
>>> I don't think the problem is real. The question is how to make KCSAN happy
>>> in a way that doesn't silence other possibly useful things it can find and
>>> also which makes it most obvious to the reader what's going on... IMHO
>>> using READ_ONCE() fulfills these targets nicely - it is free
>>> performance-wise in this case, it silences the checker without impacting
>>> other races on page->flags, its kind of obvious we don't want the load torn
>>> in this case so it makes sense to the reader (although a comment may be
>>> nice).
>> 
>> Actually, use the data_race() macro there fulfilling the same purpose too, i.e, silence the splat here but still keep searching for other races.
>> 
> 
> Yes, but both READ_ONCE() and data_race() would be saying untrue things about this code,
> and that somewhat offends my sense of perfection... :)
> 
> * READ_ONCE(): this field need not be restricted to being read only once, so the
>  name is immediately wrong. We're using side effects of READ_ONCE().
> 
> * data_race(): there is no race on the N bits worth of page zone number data. There
>  is only a perceived race, due to tools that look at word-level granularity.
> 
> I'd propose one or both of the following:
> 
> a) Hope that Marcus has an idea to enhance KCSAN so as to support this model of
>   access, and/or

A similar thing was brought up before, i.e., anything compared to zero is immune to load-tearing
issues, but it is rather difficult to implement it in the compiler, so it was settled to use data_race(),

https://lore.kernel.org/lkml/CANpmjNN8J1oWtLPHTgCwbbtTuU_Js-8HD=cozW5cYkm8h-GTBg@mail.gmail.com/#r

> 
> b) Add a new, better-named macro to indicate what's going on. Initial bikeshed-able
>   candidates:
> 
> 	READ_RO_BITS()
> 	READ_IMMUTABLE_BITS()
> 	...etc...
> 

Actually, Linus might hate those kinds of complication rather than a simple data_race() macro,

https://lore.kernel.org/linux-fsdevel/CAHk-=wg5CkOEF8DTez1Qu0XTEFw_oHhxN98bDnFqbY7HL5AB2g@mail.gmail.com/
John Hubbard Feb. 7, 2020, 12:27 a.m. UTC | #10
On 2/6/20 4:18 PM, Qian Cai wrote:
>> On Feb 6, 2020, at 6:34 PM, John Hubbard <jhubbard@nvidia.com> wrote:
>> On 2/6/20 7:23 AM, Qian Cai wrote:
>>>> On Feb 6, 2020, at 9:55 AM, Jan Kara <jack@suse.cz> wrote:
>>>> I don't think the problem is real. The question is how to make KCSAN happy
>>>> in a way that doesn't silence other possibly useful things it can find and
>>>> also which makes it most obvious to the reader what's going on... IMHO
>>>> using READ_ONCE() fulfills these targets nicely - it is free
>>>> performance-wise in this case, it silences the checker without impacting
>>>> other races on page->flags, its kind of obvious we don't want the load torn
>>>> in this case so it makes sense to the reader (although a comment may be
>>>> nice).
>>>
>>> Actually, use the data_race() macro there fulfilling the same purpose too, i.e, silence the splat here but still keep searching for other races.
>>>
>>
>> Yes, but both READ_ONCE() and data_race() would be saying untrue things about this code,
>> and that somewhat offends my sense of perfection... :)
>>
>> * READ_ONCE(): this field need not be restricted to being read only once, so the
>>  name is immediately wrong. We're using side effects of READ_ONCE().
>>
>> * data_race(): there is no race on the N bits worth of page zone number data. There
>>  is only a perceived race, due to tools that look at word-level granularity.
>>
>> I'd propose one or both of the following:
>>
>> a) Hope that Marco (I've fixed the typo in his name. --jh) has an idea to enhance KCSAN so as to support this model of
>>   access, and/or
> 
> A similar thing was brought up before, i.e., anything compared to zero is immune to load-tearing
> issues, but it is rather difficult to implement it in the compiler, so it was settled to use data_race(),
> 
> https://lore.kernel.org/lkml/CANpmjNN8J1oWtLPHTgCwbbtTuU_Js-8HD=cozW5cYkm8h-GTBg@mail.gmail.com/#r
> 


Thanks for that link to the previous discussion, good context.


>>
>> b) Add a new, better-named macro to indicate what's going on. Initial bikeshed-able
>>   candidates:
>>
>> 	READ_RO_BITS()
>> 	READ_IMMUTABLE_BITS()
>> 	...etc...
>>
> 
> Actually, Linus might hate those kinds of complication rather than a simple data_race() macro,
> 
> https://lore.kernel.org/linux-fsdevel/CAHk-=wg5CkOEF8DTez1Qu0XTEFw_oHhxN98bDnFqbY7HL5AB2g@mail.gmail.com/
> 

Another good link. However, my macros above haven't been proposed yet, and I'm perfectly 
comfortable proposing something that Linus *might* (or might not!) hate. No point in 
guessing about it, IMHO.

If you want, I'll be happy to put on my flame suit and post a patchset proposing 
READ_IMMUTABLE_BITS() (or a better-named thing, if someone has another name idea).  :)



thanks,
Qian Cai Feb. 7, 2020, 12:55 a.m. UTC | #11
> On Feb 6, 2020, at 7:27 PM, John Hubbard <jhubbard@nvidia.com> wrote:
> 
> On 2/6/20 4:18 PM, Qian Cai wrote:
>>> On Feb 6, 2020, at 6:34 PM, John Hubbard <jhubbard@nvidia.com> wrote:
>>> On 2/6/20 7:23 AM, Qian Cai wrote:
>>>>> On Feb 6, 2020, at 9:55 AM, Jan Kara <jack@suse.cz> wrote:
>>>>> I don't think the problem is real. The question is how to make KCSAN happy
>>>>> in a way that doesn't silence other possibly useful things it can find and
>>>>> also which makes it most obvious to the reader what's going on... IMHO
>>>>> using READ_ONCE() fulfills these targets nicely - it is free
>>>>> performance-wise in this case, it silences the checker without impacting
>>>>> other races on page->flags, its kind of obvious we don't want the load torn
>>>>> in this case so it makes sense to the reader (although a comment may be
>>>>> nice).
>>>> 
>>>> Actually, use the data_race() macro there fulfilling the same purpose too, i.e, silence the splat here but still keep searching for other races.
>>>> 
>>> 
>>> Yes, but both READ_ONCE() and data_race() would be saying untrue things about this code,
>>> and that somewhat offends my sense of perfection... :)
>>> 
>>> * READ_ONCE(): this field need not be restricted to being read only once, so the
>>> name is immediately wrong. We're using side effects of READ_ONCE().
>>> 
>>> * data_race(): there is no race on the N bits worth of page zone number data. There
>>> is only a perceived race, due to tools that look at word-level granularity.
>>> 
>>> I'd propose one or both of the following:
>>> 
>>> a) Hope that Marco (I've fixed the typo in his name. --jh) has an idea to enhance KCSAN so as to support this model of
>>>  access, and/or
>> 
>> A similar thing was brought up before, i.e., anything compared to zero is immune to load-tearing
>> issues, but it is rather difficult to implement it in the compiler, so it was settled to use data_race(),
>> 
>> https://lore.kernel.org/lkml/CANpmjNN8J1oWtLPHTgCwbbtTuU_Js-8HD=cozW5cYkm8h-GTBg@mail.gmail.com/#r
>> 
> 
> 
> Thanks for that link to the previous discussion, good context.
> 
> 
>>> 
>>> b) Add a new, better-named macro to indicate what's going on. Initial bikeshed-able
>>>  candidates:
>>> 
>>> 	READ_RO_BITS()
>>> 	READ_IMMUTABLE_BITS()
>>> 	...etc...
>>> 
>> 
>> Actually, Linus might hate those kinds of complication rather than a simple data_race() macro,
>> 
>> https://lore.kernel.org/linux-fsdevel/CAHk-=wg5CkOEF8DTez1Qu0XTEFw_oHhxN98bDnFqbY7HL5AB2g@mail.gmail.com/
>> 
> 
> Another good link. However, my macros above haven't been proposed yet, and I'm perfectly 
> comfortable proposing something that Linus *might* (or might not!) hate. No point in 
> guessing about it, IMHO.
> 
> If you want, I'll be happy to put on my flame suit and post a patchset proposing 
> READ_IMMUTABLE_BITS() (or a better-named thing, if someone has another name idea).  :)
> 

BTW, the current comment said (note, it is called “access” which in this case it does read the whole word
rather than those 3 bits, even though it is only those bits are of interested for us),

/*
 * data_race(): macro to document that accesses in an expression may conflict with
 * other concurrent accesses resulting in data races, but the resulting
 * behaviour is deemed safe regardless.
 *
 * This macro *does not* affect normal code generation, but is a hint to tooling
 * that data races here should be ignored.
 */

Macro might have more to say.
Marco Elver Feb. 7, 2020, 1:17 p.m. UTC | #12
On Fri, 7 Feb 2020 at 01:55, Qian Cai <cai@lca.pw> wrote:
>
> > On Feb 6, 2020, at 7:27 PM, John Hubbard <jhubbard@nvidia.com> wrote:
> >
> > On 2/6/20 4:18 PM, Qian Cai wrote:
> >>> On Feb 6, 2020, at 6:34 PM, John Hubbard <jhubbard@nvidia.com> wrote:
> >>> On 2/6/20 7:23 AM, Qian Cai wrote:
> >>>>> On Feb 6, 2020, at 9:55 AM, Jan Kara <jack@suse.cz> wrote:
> >>>>> I don't think the problem is real. The question is how to make KCSAN happy
> >>>>> in a way that doesn't silence other possibly useful things it can find and
> >>>>> also which makes it most obvious to the reader what's going on... IMHO
> >>>>> using READ_ONCE() fulfills these targets nicely - it is free
> >>>>> performance-wise in this case, it silences the checker without impacting
> >>>>> other races on page->flags, its kind of obvious we don't want the load torn
> >>>>> in this case so it makes sense to the reader (although a comment may be
> >>>>> nice).
> >>>>
> >>>> Actually, use the data_race() macro there fulfilling the same purpose too, i.e, silence the splat here but still keep searching for other races.
> >>>>
> >>>
> >>> Yes, but both READ_ONCE() and data_race() would be saying untrue things about this code,
> >>> and that somewhat offends my sense of perfection... :)
> >>>
> >>> * READ_ONCE(): this field need not be restricted to being read only once, so the
> >>> name is immediately wrong. We're using side effects of READ_ONCE().
> >>>
> >>> * data_race(): there is no race on the N bits worth of page zone number data. There
> >>> is only a perceived race, due to tools that look at word-level granularity.
> >>>
> >>> I'd propose one or both of the following:
> >>>
> >>> a) Hope that Marco (I've fixed the typo in his name. --jh) has an idea to enhance KCSAN so as to support this model of
> >>>  access, and/or

From the other thread:

On Fri, 7 Feb 2020 at 00:18, John Hubbard <jhubbard@nvidia.com> wrote:
>
> Yes. I'm grasping at straws now, but...what about the idiom that page_zonenum()
> uses: a set of bits that are "always" (after a certain early point) read-only?
> What are your thoughts on that?

Without annotations it's hard to tell. The problem is that the
compiler can still emit a word-sized load, even if you're just
checking 1 bit. The instrumentation emitted for KCSAN only cares about
loads/stores, where access size is in number of bytes and not bits,
since that's what the compiler has to emit.  So, strictly speaking
these are data races: concurrent reads / writes where at least one
access is plain.

With the above caveat out of the way, we already have the following
defaults in KCSAN (after similar requests):
1. KCSAN ignores same-value stores, i.e. races with writes that appear
to write the same value do not result in data race reports.
2. KCSAN does not demand aligned writes (including things like 'var++'
if there are no concurrent writers) up to word size to be marked (with
READ_ONCE etc.), assuming there is no way for the compiler to screw
these up. [I still recommend writes to be marked though, if at all
possible, because I'm still not entirely convinced it's always safe!]

So, because of (2), KCSAN will not complain if you have something like
'flags |= SOME_FLAG' (where the read is marked). Because of (1), it'll
still complain about 'flags & SOME_FLAG' though, since the load is not
marked, and only sees this is a word-sized access (assuming flags is a
long) where a bit changed.

I don't think it's possible to easily convey to KCSAN which bits of an
access you only care about, so that we could extend (1).   Ideas?

> >> A similar thing was brought up before, i.e., anything compared to zero is immune to load-tearing
> >> issues, but it is rather difficult to implement it in the compiler, so it was settled to use data_race(),
> >>
> >> https://lore.kernel.org/lkml/CANpmjNN8J1oWtLPHTgCwbbtTuU_Js-8HD=cozW5cYkm8h-GTBg@mail.gmail.com/#r
> >>
> >
> > Thanks for that link to the previous discussion, good context.
> >
> >>>
> >>> b) Add a new, better-named macro to indicate what's going on. Initial bikeshed-able
> >>>  candidates:
> >>>
> >>>     READ_RO_BITS()
> >>>     READ_IMMUTABLE_BITS()
> >>>     ...etc...
> >>>

This could work, but 'READ_BITS()' is enough, if KCSAN's same-value
filter is default on anyway.  Although my preference is also to avoid
more macros if possible.

> >> Actually, Linus might hate those kinds of complication rather than a simple data_race() macro,
> >>
> >> https://lore.kernel.org/linux-fsdevel/CAHk-=wg5CkOEF8DTez1Qu0XTEFw_oHhxN98bDnFqbY7HL5AB2g@mail.gmail.com/
> >>
> >
> > Another good link. However, my macros above haven't been proposed yet, and I'm perfectly
> > comfortable proposing something that Linus *might* (or might not!) hate. No point in
> > guessing about it, IMHO.
> >
> > If you want, I'll be happy to put on my flame suit and post a patchset proposing
> > READ_IMMUTABLE_BITS() (or a better-named thing, if someone has another name idea).  :)
> >
>
> BTW, the current comment said (note, it is called “access” which in this case it does read the whole word
> rather than those 3 bits, even though it is only those bits are of interested for us),
>
> /*
>  * data_race(): macro to document that accesses in an expression may conflict with
>  * other concurrent accesses resulting in data races, but the resulting
>  * behaviour is deemed safe regardless.
>  *
>  * This macro *does not* affect normal code generation, but is a hint to tooling
>  * that data races here should be ignored.
>  */
>
> Macro might have more to say.

I agree that data_race() would be the most suitable here, since
technically it's still a data race. It just so happens that we end up
"throwing away" most of the bits of the access, and just care about a
few bits.

Thanks,
-- Marco
John Hubbard Feb. 9, 2020, 1:44 a.m. UTC | #13
On 2/7/20 5:17 AM, Marco Elver wrote:
...
>> Yes. I'm grasping at straws now, but...what about the idiom that page_zonenum()
>> uses: a set of bits that are "always" (after a certain early point) read-only?
>> What are your thoughts on that?
> 
> Without annotations it's hard to tell. The problem is that the
> compiler can still emit a word-sized load, even if you're just
> checking 1 bit. The instrumentation emitted for KCSAN only cares about
> loads/stores, where access size is in number of bytes and not bits,
> since that's what the compiler has to emit.  So, strictly speaking
> these are data races: concurrent reads / writes where at least one
> access is plain.
> 
> With the above caveat out of the way, we already have the following
> defaults in KCSAN (after similar requests):
> 1. KCSAN ignores same-value stores, i.e. races with writes that appear
> to write the same value do not result in data race reports.
> 2. KCSAN does not demand aligned writes (including things like 'var++'
> if there are no concurrent writers) up to word size to be marked (with
> READ_ONCE etc.), assuming there is no way for the compiler to screw
> these up. [I still recommend writes to be marked though, if at all
> possible, because I'm still not entirely convinced it's always safe!]
> 
> So, because of (2), KCSAN will not complain if you have something like
> 'flags |= SOME_FLAG' (where the read is marked). Because of (1), it'll
> still complain about 'flags & SOME_FLAG' though, since the load is not
> marked, and only sees this is a word-sized access (assuming flags is a
> long) where a bit changed.
> 
> I don't think it's possible to easily convey to KCSAN which bits of an
> access you only care about, so that we could extend (1).   Ideas?


I'm drawing a blank as far as easy ways forward, so I'm just going accept
the compiler word-level constraints as a "given". I was hoping maybe you
had some magic available, just checking. :)


> 
>>>> A similar thing was brought up before, i.e., anything compared to zero is immune to load-tearing
>>>> issues, but it is rather difficult to implement it in the compiler, so it was settled to use data_race(),
>>>>
>>>> https://lore.kernel.org/lkml/CANpmjNN8J1oWtLPHTgCwbbtTuU_Js-8HD=cozW5cYkm8h-GTBg@mail.gmail.com/#r
>>>>
>>>
>>> Thanks for that link to the previous discussion, good context.
>>>
>>>>>
>>>>> b) Add a new, better-named macro to indicate what's going on. Initial bikeshed-able
>>>>>  candidates:
>>>>>
>>>>>     READ_RO_BITS()
>>>>>     READ_IMMUTABLE_BITS()
>>>>>     ...etc...
>>>>>
> 
> This could work, but 'READ_BITS()' is enough, if KCSAN's same-value
> filter is default on anyway.  Although my preference is also to avoid
> more macros if possible.


So it looks like we're probably stuck with having to annotate the code. Given
that, there is a balance between how many macros, and how much commenting. For
example, if there is a single macro (data_race, for example), then we'll need to
add comments for the various cases, explaining which data_race situation is 
happening.

That's still true, but to a lesser extent if more macros are added. In this case,
I suspect that READ_BITS() makes the commenting easier and shorter. So I'd tentatively
lead towards adding it, but what do others on the list think?



thanks,
Qian Cai Feb. 9, 2020, 3:10 a.m. UTC | #14
> On Feb 8, 2020, at 8:44 PM, John Hubbard <jhubbard@nvidia.com> wrote:
> 
> So it looks like we're probably stuck with having to annotate the code. Given
> that, there is a balance between how many macros, and how much commenting. For
> example, if there is a single macro (data_race, for example), then we'll need to
> add comments for the various cases, explaining which data_race situation is 
> happening.

On the other hand, it is perfect fine of not commenting on each data_race() that most of times, people could run git blame to learn more details. Actually, no maintainers from various of subsystems asked for commenting so far.

> 
> That's still true, but to a lesser extent if more macros are added. In this case,
> I suspect that READ_BITS() makes the commenting easier and shorter. So I'd tentatively
> lead towards adding it, but what do others on the list think?

Even read bits could be dangerous from data races and confusing at best, so I am not really sure what the value of introducing this new macro. People who like to understand it correctly still need to read the commit logs.

This flags->zonenum is such a special case that I don’t really see it regularly for the last few weeks digging KCSAN reports, so even if it is worth adding READ_BITS(), there are more equally important macros need to be added together to be useful initially. For example, HARMLESS_COUNTERS(), READ_SINGLE_BIT(), READ_IMMUTATABLE_BITS() etc which Linus said exactly wanted to avoid.
John Hubbard Feb. 9, 2020, 7:12 a.m. UTC | #15
On 2/8/20 7:10 PM, Qian Cai wrote:
> 
> 
>> On Feb 8, 2020, at 8:44 PM, John Hubbard <jhubbard@nvidia.com> wrote:
>>
>> So it looks like we're probably stuck with having to annotate the code. Given
>> that, there is a balance between how many macros, and how much commenting. For
>> example, if there is a single macro (data_race, for example), then we'll need to
>> add comments for the various cases, explaining which data_race situation is
>> happening.
> 
> On the other hand, it is perfect fine of not commenting on each data_race() that most of times, people could run git blame to learn more details. Actually, no maintainers from various of subsystems asked for commenting so far.
> 

Well, maybe I'm looking at this wrong. I was thinking that one should attempt to
understand the code on the screen, and that's generally best--but here, maybe
"data_race" is just something that means "tool cruft", really. So mentally we
would move toward visually filtering out the data_race "key word".

I really don't like it but at least there is a significant benefit from the tool
that probably makes it worth the visual noise.

Blue sky thoughts for The Far Future: It would be nice if the tools got a lot
better--maybe in the direction of C language extensions, even if only used in
this project at first.

thanks,
Marco Elver Feb. 10, 2020, 7:48 a.m. UTC | #16
On Sun, 9 Feb 2020 at 08:15, John Hubbard <jhubbard@nvidia.com> wrote:
>
> On 2/8/20 7:10 PM, Qian Cai wrote:
> >
> >
> >> On Feb 8, 2020, at 8:44 PM, John Hubbard <jhubbard@nvidia.com> wrote:
> >>
> >> So it looks like we're probably stuck with having to annotate the code. Given
> >> that, there is a balance between how many macros, and how much commenting. For
> >> example, if there is a single macro (data_race, for example), then we'll need to
> >> add comments for the various cases, explaining which data_race situation is
> >> happening.
> >
> > On the other hand, it is perfect fine of not commenting on each data_race() that most of times, people could run git blame to learn more details. Actually, no maintainers from various of subsystems asked for commenting so far.
> >
>
> Well, maybe I'm looking at this wrong. I was thinking that one should attempt to
> understand the code on the screen, and that's generally best--but here, maybe
> "data_race" is just something that means "tool cruft", really. So mentally we
> would move toward visually filtering out the data_race "key word".

One thing to note is that 'data_race()' points out concurrency, and
that somebody has deemed that the code won't break even with data
races. Somebody trying to understand or modify the code should ensure
this will still be the case. So, 'data_race()' isn't just tool cruft.
It's documentation for something that really isn't obvious from the
code alone.

Whenever we see a READ_ONCE or other marked access it is obvious to
the reader that there are concurrent accesses happening.  I'd argue
that for intentional data races, we should convey similar information,
to avoid breaking the code (of course KCSAN would tell you, but only
after the change was done). Even moreso, since changes to code
involving 'data_race()' will need re-verification that the data races
are still safe.

> I really don't like it but at least there is a significant benefit from the tool
> that probably makes it worth the visual noise.
>
> Blue sky thoughts for The Far Future: It would be nice if the tools got a lot
> better--maybe in the direction of C language extensions, even if only used in
> this project at first.

Still thinking about this.  What we want to convey is that, while
there are races on the particular variable, nobody should be modifying
the bits here. Adding a READ_ONCE (or data_race()) would miss a
harmful race where somebody modifies these bits, so in principle I
agree. However, I think the tool can't automatically tell (even if we
had compiler extensions to give us the bits accessed) which bits we
care about, because we might have something like:

int foo_bar = READ_ONCE(flags) >> FOO_BAR_SHIFT;  // need the
READ_ONCE because of FOO bits
.. (foo_bar & FOO_MASK) ..  // FOO bits can be modified concurrently
.. (foo_bar & BAR_MASK) ..  // nobody should modify BAR bits
concurrently though !

What we want is to assert that nobody touches a particular set of
bits. KCSAN has recently gotten ASSERT_EXCLUSIVE_{WRITER,ACCESS}
macros which help assert properties of concurrent code, where bugs
won't manifest as data races. Along those lines, I can see the value
in doing an exclusivity check on a bitmask of a variable.

I don't know how much a READ_BITS macro could help, since it's
probably less ergonomic to have to say something like:
  READ_BITS(page->flags, ZONES_MASK << ZONES_PGSHIFT) >> ZONES_PGSHIFT.

Here is an alternative:

Let's say KCSAN gives you this:
   /* ... Assert that the bits set in mask are not written
concurrently; they may still be read concurrently.
     The access that immediately follows is assumed to access those
bits and safe w.r.t. data races.

     For example, this may be used when certain bits of @flags may
only be modified when holding the appropriate lock,
     but other bits may still be modified locklessly.
   ...
  */
   #define ASSERT_EXCLUSIVE_BITS(flags, mask)   ....

Then we can write page_zonenum as follows:

static inline enum zone_type page_zonenum(const struct page *page)
 {
+       ASSERT_EXCLUSIVE_BITS(page->flags, ZONES_MASK << ZONES_PGSHIFT);
        return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
 }

This will accomplish the following:
1. The current code is not touched, and we do not have to verify that
the change is correct without KCSAN.
2. We're not introducing a bunch of special macros to read bits in various ways.
3. KCSAN will assume that the access is safe, and no data race report
is generated.
4. If somebody modifies ZONES bits concurrently, KCSAN will tell you
about the race.
5. We're documenting the code.

Anything I missed?

Thanks,
-- Marco





> thanks,
> --
> John Hubbard
> NVIDIA
>
> >>
> >> That's still true, but to a lesser extent if more macros are added. In this case,
> >> I suspect that READ_BITS() makes the commenting easier and shorter. So I'd tentatively
> >> lead towards adding it, but what do others on the list think?
> >
> > Even read bits could be dangerous from data races and confusing at best, so I am not really sure what the value of introducing this new macro. People who like to understand it correctly still need to read the commit logs.
> >
> > This flags->zonenum is such a special case that I don’t really see it regularly for the last few weeks digging KCSAN reports, so even if it is worth adding READ_BITS(), there are more equally important macros need to be added together to be useful initially. For example, HARMLESS_COUNTERS(), READ_SINGLE_BIT(), READ_IMMUTATABLE_BITS() etc which Linus said exactly wanted to avoid.
> >
Qian Cai Feb. 10, 2020, 12:16 p.m. UTC | #17
> On Feb 10, 2020, at 2:48 AM, Marco Elver <elver@google.com> wrote:
> 
> Here is an alternative:
> 
> Let's say KCSAN gives you this:
>   /* ... Assert that the bits set in mask are not written
> concurrently; they may still be read concurrently.
>     The access that immediately follows is assumed to access those
> bits and safe w.r.t. data races.
> 
>     For example, this may be used when certain bits of @flags may
> only be modified when holding the appropriate lock,
>     but other bits may still be modified locklessly.
>   ...
>  */
>   #define ASSERT_EXCLUSIVE_BITS(flags, mask)   ....
> 
> Then we can write page_zonenum as follows:
> 
> static inline enum zone_type page_zonenum(const struct page *page)
> {
> +       ASSERT_EXCLUSIVE_BITS(page->flags, ZONES_MASK << ZONES_PGSHIFT);
>        return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
> }
> 
> This will accomplish the following:
> 1. The current code is not touched, and we do not have to verify that
> the change is correct without KCSAN.
> 2. We're not introducing a bunch of special macros to read bits in various ways.
> 3. KCSAN will assume that the access is safe, and no data race report
> is generated.
> 4. If somebody modifies ZONES bits concurrently, KCSAN will tell you
> about the race.
> 5. We're documenting the code.
> 
> Anything I missed?

I don’t know. Having to write the same line twice does not feel me any better than data_race() with commenting occasionally.
Marco Elver Feb. 10, 2020, 12:58 p.m. UTC | #18
On Mon, 10 Feb 2020 at 13:16, Qian Cai <cai@lca.pw> wrote:
>
>
>
> > On Feb 10, 2020, at 2:48 AM, Marco Elver <elver@google.com> wrote:
> >
> > Here is an alternative:
> >
> > Let's say KCSAN gives you this:
> >   /* ... Assert that the bits set in mask are not written
> > concurrently; they may still be read concurrently.
> >     The access that immediately follows is assumed to access those
> > bits and safe w.r.t. data races.
> >
> >     For example, this may be used when certain bits of @flags may
> > only be modified when holding the appropriate lock,
> >     but other bits may still be modified locklessly.
> >   ...
> >  */
> >   #define ASSERT_EXCLUSIVE_BITS(flags, mask)   ....
> >
> > Then we can write page_zonenum as follows:
> >
> > static inline enum zone_type page_zonenum(const struct page *page)
> > {
> > +       ASSERT_EXCLUSIVE_BITS(page->flags, ZONES_MASK << ZONES_PGSHIFT);
> >        return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
> > }
> >
> > This will accomplish the following:
> > 1. The current code is not touched, and we do not have to verify that
> > the change is correct without KCSAN.
> > 2. We're not introducing a bunch of special macros to read bits in various ways.
> > 3. KCSAN will assume that the access is safe, and no data race report
> > is generated.
> > 4. If somebody modifies ZONES bits concurrently, KCSAN will tell you
> > about the race.
> > 5. We're documenting the code.
> >
> > Anything I missed?
>
> I don’t know. Having to write the same line twice does not feel me any better than data_race() with commenting occasionally.

Point 4 above: While data_race() will ignore cause KCSAN to not report
the data race, now you might be missing a real bug: if somebody
concurrently modifies the bits accessed, you want to know about it!
Either way, it's up to you to add the ASSERT_EXCLUSIVE_BITS, but just
remember that if you decide to silence it with data_race(), you need
to be sure there are no concurrent writers to those bits.

There is no way to automatically infer all over the kernel which bits
we care about, and the most reliable is to be explicit about it. I
don't see a problem with it per se.

Thanks,
-- Marco
Qian Cai Feb. 10, 2020, 1:36 p.m. UTC | #19
On Mon, 2020-02-10 at 13:58 +0100, Marco Elver wrote:
> On Mon, 10 Feb 2020 at 13:16, Qian Cai <cai@lca.pw> wrote:
> > 
> > 
> > 
> > > On Feb 10, 2020, at 2:48 AM, Marco Elver <elver@google.com> wrote:
> > > 
> > > Here is an alternative:
> > > 
> > > Let's say KCSAN gives you this:
> > >   /* ... Assert that the bits set in mask are not written
> > > concurrently; they may still be read concurrently.
> > >     The access that immediately follows is assumed to access those
> > > bits and safe w.r.t. data races.
> > > 
> > >     For example, this may be used when certain bits of @flags may
> > > only be modified when holding the appropriate lock,
> > >     but other bits may still be modified locklessly.
> > >   ...
> > >  */
> > >   #define ASSERT_EXCLUSIVE_BITS(flags, mask)   ....
> > > 
> > > Then we can write page_zonenum as follows:
> > > 
> > > static inline enum zone_type page_zonenum(const struct page *page)
> > > {
> > > +       ASSERT_EXCLUSIVE_BITS(page->flags, ZONES_MASK << ZONES_PGSHIFT);
> > >        return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
> > > }
> > > 
> > > This will accomplish the following:
> > > 1. The current code is not touched, and we do not have to verify that
> > > the change is correct without KCSAN.
> > > 2. We're not introducing a bunch of special macros to read bits in various ways.
> > > 3. KCSAN will assume that the access is safe, and no data race report
> > > is generated.
> > > 4. If somebody modifies ZONES bits concurrently, KCSAN will tell you
> > > about the race.
> > > 5. We're documenting the code.
> > > 
> > > Anything I missed?
> > 
> > I don’t know. Having to write the same line twice does not feel me any better than data_race() with commenting occasionally.
> 
> Point 4 above: While data_race() will ignore cause KCSAN to not report
> the data race, now you might be missing a real bug: if somebody
> concurrently modifies the bits accessed, you want to know about it!
> Either way, it's up to you to add the ASSERT_EXCLUSIVE_BITS, but just
> remember that if you decide to silence it with data_race(), you need
> to be sure there are no concurrent writers to those bits.

Right, in this case, there is no concurrent writers to those bits, so I'll add a
comment should be sufficient. However, I'll keep ASSERT_EXCLUSIVE_BITS() in mind
for other places.

> 
> There is no way to automatically infer all over the kernel which bits
> we care about, and the most reliable is to be explicit about it. I
> don't see a problem with it per se.
> 
> Thanks,
> -- Marco
Marco Elver Feb. 10, 2020, 1:38 p.m. UTC | #20
On Mon, 10 Feb 2020 at 14:36, Qian Cai <cai@lca.pw> wrote:
>
> On Mon, 2020-02-10 at 13:58 +0100, Marco Elver wrote:
> > On Mon, 10 Feb 2020 at 13:16, Qian Cai <cai@lca.pw> wrote:
> > >
> > >
> > >
> > > > On Feb 10, 2020, at 2:48 AM, Marco Elver <elver@google.com> wrote:
> > > >
> > > > Here is an alternative:
> > > >
> > > > Let's say KCSAN gives you this:
> > > >   /* ... Assert that the bits set in mask are not written
> > > > concurrently; they may still be read concurrently.
> > > >     The access that immediately follows is assumed to access those
> > > > bits and safe w.r.t. data races.
> > > >
> > > >     For example, this may be used when certain bits of @flags may
> > > > only be modified when holding the appropriate lock,
> > > >     but other bits may still be modified locklessly.
> > > >   ...
> > > >  */
> > > >   #define ASSERT_EXCLUSIVE_BITS(flags, mask)   ....
> > > >
> > > > Then we can write page_zonenum as follows:
> > > >
> > > > static inline enum zone_type page_zonenum(const struct page *page)
> > > > {
> > > > +       ASSERT_EXCLUSIVE_BITS(page->flags, ZONES_MASK << ZONES_PGSHIFT);
> > > >        return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
> > > > }
> > > >
> > > > This will accomplish the following:
> > > > 1. The current code is not touched, and we do not have to verify that
> > > > the change is correct without KCSAN.
> > > > 2. We're not introducing a bunch of special macros to read bits in various ways.
> > > > 3. KCSAN will assume that the access is safe, and no data race report
> > > > is generated.
> > > > 4. If somebody modifies ZONES bits concurrently, KCSAN will tell you
> > > > about the race.
> > > > 5. We're documenting the code.
> > > >
> > > > Anything I missed?
> > >
> > > I don’t know. Having to write the same line twice does not feel me any better than data_race() with commenting occasionally.
> >
> > Point 4 above: While data_race() will ignore cause KCSAN to not report
> > the data race, now you might be missing a real bug: if somebody
> > concurrently modifies the bits accessed, you want to know about it!
> > Either way, it's up to you to add the ASSERT_EXCLUSIVE_BITS, but just
> > remember that if you decide to silence it with data_race(), you need
> > to be sure there are no concurrent writers to those bits.
>
> Right, in this case, there is no concurrent writers to those bits, so I'll add a
> comment should be sufficient. However, I'll keep ASSERT_EXCLUSIVE_BITS() in mind
> for other places.

Right now there are no concurrent writers to those bits. But somebody
might introduce a bug that will write them, even though they shouldn't
have. With ASSERT_EXCLUSIVE_BITS() you can catch that. Once I have the
patches for this out, I would consider adding it here for this reason.

> >
> > There is no way to automatically infer all over the kernel which bits
> > we care about, and the most reliable is to be explicit about it. I
> > don't see a problem with it per se.
> >
> > Thanks,
> > -- Marco
Qian Cai Feb. 10, 2020, 1:55 p.m. UTC | #21
On Mon, 2020-02-10 at 14:38 +0100, Marco Elver wrote:
> On Mon, 10 Feb 2020 at 14:36, Qian Cai <cai@lca.pw> wrote:
> > 
> > On Mon, 2020-02-10 at 13:58 +0100, Marco Elver wrote:
> > > On Mon, 10 Feb 2020 at 13:16, Qian Cai <cai@lca.pw> wrote:
> > > > 
> > > > 
> > > > 
> > > > > On Feb 10, 2020, at 2:48 AM, Marco Elver <elver@google.com> wrote:
> > > > > 
> > > > > Here is an alternative:
> > > > > 
> > > > > Let's say KCSAN gives you this:
> > > > >   /* ... Assert that the bits set in mask are not written
> > > > > concurrently; they may still be read concurrently.
> > > > >     The access that immediately follows is assumed to access those
> > > > > bits and safe w.r.t. data races.
> > > > > 
> > > > >     For example, this may be used when certain bits of @flags may
> > > > > only be modified when holding the appropriate lock,
> > > > >     but other bits may still be modified locklessly.
> > > > >   ...
> > > > >  */
> > > > >   #define ASSERT_EXCLUSIVE_BITS(flags, mask)   ....
> > > > > 
> > > > > Then we can write page_zonenum as follows:
> > > > > 
> > > > > static inline enum zone_type page_zonenum(const struct page *page)
> > > > > {
> > > > > +       ASSERT_EXCLUSIVE_BITS(page->flags, ZONES_MASK << ZONES_PGSHIFT);
> > > > >        return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
> > > > > }
> > > > > 
> > > > > This will accomplish the following:
> > > > > 1. The current code is not touched, and we do not have to verify that
> > > > > the change is correct without KCSAN.
> > > > > 2. We're not introducing a bunch of special macros to read bits in various ways.
> > > > > 3. KCSAN will assume that the access is safe, and no data race report
> > > > > is generated.
> > > > > 4. If somebody modifies ZONES bits concurrently, KCSAN will tell you
> > > > > about the race.
> > > > > 5. We're documenting the code.
> > > > > 
> > > > > Anything I missed?
> > > > 
> > > > I don’t know. Having to write the same line twice does not feel me any better than data_race() with commenting occasionally.
> > > 
> > > Point 4 above: While data_race() will ignore cause KCSAN to not report
> > > the data race, now you might be missing a real bug: if somebody
> > > concurrently modifies the bits accessed, you want to know about it!
> > > Either way, it's up to you to add the ASSERT_EXCLUSIVE_BITS, but just
> > > remember that if you decide to silence it with data_race(), you need
> > > to be sure there are no concurrent writers to those bits.
> > 
> > Right, in this case, there is no concurrent writers to those bits, so I'll add a
> > comment should be sufficient. However, I'll keep ASSERT_EXCLUSIVE_BITS() in mind
> > for other places.
> 
> Right now there are no concurrent writers to those bits. But somebody
> might introduce a bug that will write them, even though they shouldn't
> have. With ASSERT_EXCLUSIVE_BITS() you can catch that. Once I have the
> patches for this out, I would consider adding it here for this reason.

Surely, we could add many of those to catch theoretical issues. I can think of
more like ASSERT_HARMLESS_COUNTERS() because the worry about one day someone
might change the code to use counters from printing out information to making
important MM heuristic decisions. Then, we might end up with those too many
macros situation again. The list goes on, ASSERT_COMPARE_ZERO_NOLOOP(),
ASSERT_SINGLE_BIT() etc.

On the other hand, maybe to take a more pragmatic approach that if there are
strong evidences that developers could easily make mistakes in a certain place,
then we could add a new macro, so the next time Joe developer wants to a new
macro, he/she has to provide the same strong justifications?

> 
> > > 
> > > There is no way to automatically infer all over the kernel which bits
> > > we care about, and the most reliable is to be explicit about it. I
> > > don't see a problem with it per se.
> > > 
> > > Thanks,
> > > -- Marco
Marco Elver Feb. 10, 2020, 2:12 p.m. UTC | #22
On Mon, 10 Feb 2020 at 14:55, Qian Cai <cai@lca.pw> wrote:
>
> On Mon, 2020-02-10 at 14:38 +0100, Marco Elver wrote:
> > On Mon, 10 Feb 2020 at 14:36, Qian Cai <cai@lca.pw> wrote:
> > >
> > > On Mon, 2020-02-10 at 13:58 +0100, Marco Elver wrote:
> > > > On Mon, 10 Feb 2020 at 13:16, Qian Cai <cai@lca.pw> wrote:
> > > > >
> > > > >
> > > > >
> > > > > > On Feb 10, 2020, at 2:48 AM, Marco Elver <elver@google.com> wrote:
> > > > > >
> > > > > > Here is an alternative:
> > > > > >
> > > > > > Let's say KCSAN gives you this:
> > > > > >   /* ... Assert that the bits set in mask are not written
> > > > > > concurrently; they may still be read concurrently.
> > > > > >     The access that immediately follows is assumed to access those
> > > > > > bits and safe w.r.t. data races.
> > > > > >
> > > > > >     For example, this may be used when certain bits of @flags may
> > > > > > only be modified when holding the appropriate lock,
> > > > > >     but other bits may still be modified locklessly.
> > > > > >   ...
> > > > > >  */
> > > > > >   #define ASSERT_EXCLUSIVE_BITS(flags, mask)   ....
> > > > > >
> > > > > > Then we can write page_zonenum as follows:
> > > > > >
> > > > > > static inline enum zone_type page_zonenum(const struct page *page)
> > > > > > {
> > > > > > +       ASSERT_EXCLUSIVE_BITS(page->flags, ZONES_MASK << ZONES_PGSHIFT);
> > > > > >        return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
> > > > > > }
> > > > > >
> > > > > > This will accomplish the following:
> > > > > > 1. The current code is not touched, and we do not have to verify that
> > > > > > the change is correct without KCSAN.
> > > > > > 2. We're not introducing a bunch of special macros to read bits in various ways.
> > > > > > 3. KCSAN will assume that the access is safe, and no data race report
> > > > > > is generated.
> > > > > > 4. If somebody modifies ZONES bits concurrently, KCSAN will tell you
> > > > > > about the race.
> > > > > > 5. We're documenting the code.
> > > > > >
> > > > > > Anything I missed?
> > > > >
> > > > > I don’t know. Having to write the same line twice does not feel me any better than data_race() with commenting occasionally.
> > > >
> > > > Point 4 above: While data_race() will ignore cause KCSAN to not report
> > > > the data race, now you might be missing a real bug: if somebody
> > > > concurrently modifies the bits accessed, you want to know about it!
> > > > Either way, it's up to you to add the ASSERT_EXCLUSIVE_BITS, but just
> > > > remember that if you decide to silence it with data_race(), you need
> > > > to be sure there are no concurrent writers to those bits.
> > >
> > > Right, in this case, there is no concurrent writers to those bits, so I'll add a
> > > comment should be sufficient. However, I'll keep ASSERT_EXCLUSIVE_BITS() in mind
> > > for other places.
> >
> > Right now there are no concurrent writers to those bits. But somebody
> > might introduce a bug that will write them, even though they shouldn't
> > have. With ASSERT_EXCLUSIVE_BITS() you can catch that. Once I have the
> > patches for this out, I would consider adding it here for this reason.
>
> Surely, we could add many of those to catch theoretical issues. I can think of
> more like ASSERT_HARMLESS_COUNTERS() because the worry about one day someone
> might change the code to use counters from printing out information to making
> important MM heuristic decisions. Then, we might end up with those too many
> macros situation again. The list goes on, ASSERT_COMPARE_ZERO_NOLOOP(),
> ASSERT_SINGLE_BIT() etc.

I'm sorry, but the above don't assert any quantifiable properties in the code.

What we want is to be able to catch bugs that violate the *current*
properties of the code *today*. A very real property of the code
*today* is that nobody should modify zonenum without taking a lock. If
you mark the access here, there is no tool that can help you. I'm
trying to change that.

The fact that we have bits that can be modified locklessly and some
that can't is an inconvenience, but can be solved.

Makes sense?

Thanks,
-- Marco

> On the other hand, maybe to take a more pragmatic approach that if there are
> strong evidences that developers could easily make mistakes in a certain place,
> then we could add a new macro, so the next time Joe developer wants to a new
> macro, he/she has to provide the same strong justifications?
>
> >
> > > >
> > > > There is no way to automatically infer all over the kernel which bits
> > > > we care about, and the most reliable is to be explicit about it. I
> > > > don't see a problem with it per se.
> > > >
> > > > Thanks,
> > > > -- Marco
Qian Cai Feb. 10, 2020, 2:31 p.m. UTC | #23
On Mon, 2020-02-10 at 15:12 +0100, Marco Elver wrote:
> On Mon, 10 Feb 2020 at 14:55, Qian Cai <cai@lca.pw> wrote:
> > 
> > On Mon, 2020-02-10 at 14:38 +0100, Marco Elver wrote:
> > > On Mon, 10 Feb 2020 at 14:36, Qian Cai <cai@lca.pw> wrote:
> > > > 
> > > > On Mon, 2020-02-10 at 13:58 +0100, Marco Elver wrote:
> > > > > On Mon, 10 Feb 2020 at 13:16, Qian Cai <cai@lca.pw> wrote:
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > On Feb 10, 2020, at 2:48 AM, Marco Elver <elver@google.com> wrote:
> > > > > > > 
> > > > > > > Here is an alternative:
> > > > > > > 
> > > > > > > Let's say KCSAN gives you this:
> > > > > > >   /* ... Assert that the bits set in mask are not written
> > > > > > > concurrently; they may still be read concurrently.
> > > > > > >     The access that immediately follows is assumed to access those
> > > > > > > bits and safe w.r.t. data races.
> > > > > > > 
> > > > > > >     For example, this may be used when certain bits of @flags may
> > > > > > > only be modified when holding the appropriate lock,
> > > > > > >     but other bits may still be modified locklessly.
> > > > > > >   ...
> > > > > > >  */
> > > > > > >   #define ASSERT_EXCLUSIVE_BITS(flags, mask)   ....
> > > > > > > 
> > > > > > > Then we can write page_zonenum as follows:
> > > > > > > 
> > > > > > > static inline enum zone_type page_zonenum(const struct page *page)
> > > > > > > {
> > > > > > > +       ASSERT_EXCLUSIVE_BITS(page->flags, ZONES_MASK << ZONES_PGSHIFT);
> > > > > > >        return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
> > > > > > > }
> > > > > > > 
> > > > > > > This will accomplish the following:
> > > > > > > 1. The current code is not touched, and we do not have to verify that
> > > > > > > the change is correct without KCSAN.
> > > > > > > 2. We're not introducing a bunch of special macros to read bits in various ways.
> > > > > > > 3. KCSAN will assume that the access is safe, and no data race report
> > > > > > > is generated.
> > > > > > > 4. If somebody modifies ZONES bits concurrently, KCSAN will tell you
> > > > > > > about the race.
> > > > > > > 5. We're documenting the code.
> > > > > > > 
> > > > > > > Anything I missed?
> > > > > > 
> > > > > > I don’t know. Having to write the same line twice does not feel me any better than data_race() with commenting occasionally.
> > > > > 
> > > > > Point 4 above: While data_race() will ignore cause KCSAN to not report
> > > > > the data race, now you might be missing a real bug: if somebody
> > > > > concurrently modifies the bits accessed, you want to know about it!
> > > > > Either way, it's up to you to add the ASSERT_EXCLUSIVE_BITS, but just
> > > > > remember that if you decide to silence it with data_race(), you need
> > > > > to be sure there are no concurrent writers to those bits.
> > > > 
> > > > Right, in this case, there is no concurrent writers to those bits, so I'll add a
> > > > comment should be sufficient. However, I'll keep ASSERT_EXCLUSIVE_BITS() in mind
> > > > for other places.
> > > 
> > > Right now there are no concurrent writers to those bits. But somebody
> > > might introduce a bug that will write them, even though they shouldn't
> > > have. With ASSERT_EXCLUSIVE_BITS() you can catch that. Once I have the
> > > patches for this out, I would consider adding it here for this reason.
> > 
> > Surely, we could add many of those to catch theoretical issues. I can think of
> > more like ASSERT_HARMLESS_COUNTERS() because the worry about one day someone
> > might change the code to use counters from printing out information to making
> > important MM heuristic decisions. Then, we might end up with those too many
> > macros situation again. The list goes on, ASSERT_COMPARE_ZERO_NOLOOP(),
> > ASSERT_SINGLE_BIT() etc.
> 
> I'm sorry, but the above don't assert any quantifiable properties in the code.
> 
> What we want is to be able to catch bugs that violate the *current*
> properties of the code *today*. A very real property of the code
> *today* is that nobody should modify zonenum without taking a lock. If
> you mark the access here, there is no tool that can help you. I'm
> trying to change that.
> 
> The fact that we have bits that can be modified locklessly and some
> that can't is an inconvenience, but can be solved.
> 
> Makes sense?

OK, go ahead adding it if you really feel like. I'd hope this is not the
Pandora's box where people will eventually find more way to assert quantifiable
properties in the code only to address theoretical issues...


> 
> Thanks,
> -- Marco
> 
> > On the other hand, maybe to take a more pragmatic approach that if there are
> > strong evidences that developers could easily make mistakes in a certain place,
> > then we could add a new macro, so the next time Joe developer wants to a new
> > macro, he/she has to provide the same strong justifications?
> > 
> > > 
> > > > > 
> > > > > There is no way to automatically infer all over the kernel which bits
> > > > > we care about, and the most reliable is to be explicit about it. I
> > > > > don't see a problem with it per se.
> > > > > 
> > > > > Thanks,
> > > > > -- Marco
Qian Cai Feb. 10, 2020, 4:23 p.m. UTC | #24
On Mon, 2020-02-10 at 08:48 +0100, Marco Elver wrote:
> On Sun, 9 Feb 2020 at 08:15, John Hubbard <jhubbard@nvidia.com> wrote:
> > 
> > On 2/8/20 7:10 PM, Qian Cai wrote:
> > > 
> > > 
> > > > On Feb 8, 2020, at 8:44 PM, John Hubbard <jhubbard@nvidia.com> wrote:
> > > > 
> > > > So it looks like we're probably stuck with having to annotate the code. Given
> > > > that, there is a balance between how many macros, and how much commenting. For
> > > > example, if there is a single macro (data_race, for example), then we'll need to
> > > > add comments for the various cases, explaining which data_race situation is
> > > > happening.
> > > 
> > > On the other hand, it is perfect fine of not commenting on each data_race() that most of times, people could run git blame to learn more details. Actually, no maintainers from various of subsystems asked for commenting so far.
> > > 
> > 
> > Well, maybe I'm looking at this wrong. I was thinking that one should attempt to
> > understand the code on the screen, and that's generally best--but here, maybe
> > "data_race" is just something that means "tool cruft", really. So mentally we
> > would move toward visually filtering out the data_race "key word".
> 
> One thing to note is that 'data_race()' points out concurrency, and
> that somebody has deemed that the code won't break even with data
> races. Somebody trying to understand or modify the code should ensure
> this will still be the case. So, 'data_race()' isn't just tool cruft.
> It's documentation for something that really isn't obvious from the
> code alone.
> 
> Whenever we see a READ_ONCE or other marked access it is obvious to
> the reader that there are concurrent accesses happening.  I'd argue
> that for intentional data races, we should convey similar information,
> to avoid breaking the code (of course KCSAN would tell you, but only
> after the change was done). Even moreso, since changes to code
> involving 'data_race()' will need re-verification that the data races
> are still safe.
> 
> > I really don't like it but at least there is a significant benefit from the tool
> > that probably makes it worth the visual noise.
> > 
> > Blue sky thoughts for The Far Future: It would be nice if the tools got a lot
> > better--maybe in the direction of C language extensions, even if only used in
> > this project at first.
> 
> Still thinking about this.  What we want to convey is that, while
> there are races on the particular variable, nobody should be modifying
> the bits here. Adding a READ_ONCE (or data_race()) would miss a
> harmful race where somebody modifies these bits, so in principle I
> agree. However, I think the tool can't automatically tell (even if we
> had compiler extensions to give us the bits accessed) which bits we
> care about, because we might have something like:
> 
> int foo_bar = READ_ONCE(flags) >> FOO_BAR_SHIFT;  // need the
> READ_ONCE because of FOO bits
> .. (foo_bar & FOO_MASK) ..  // FOO bits can be modified concurrently
> .. (foo_bar & BAR_MASK) ..  // nobody should modify BAR bits
> concurrently though !
> 
> What we want is to assert that nobody touches a particular set of
> bits. KCSAN has recently gotten ASSERT_EXCLUSIVE_{WRITER,ACCESS}
> macros which help assert properties of concurrent code, where bugs
> won't manifest as data races. Along those lines, I can see the value
> in doing an exclusivity check on a bitmask of a variable.
> 
> I don't know how much a READ_BITS macro could help, since it's
> probably less ergonomic to have to say something like:
>   READ_BITS(page->flags, ZONES_MASK << ZONES_PGSHIFT) >> ZONES_PGSHIFT.
> 
> Here is an alternative:
> 
> Let's say KCSAN gives you this:
>    /* ... Assert that the bits set in mask are not written
> concurrently; they may still be read concurrently.
>      The access that immediately follows is assumed to access those
> bits and safe w.r.t. data races.
> 
>      For example, this may be used when certain bits of @flags may
> only be modified when holding the appropriate lock,
>      but other bits may still be modified locklessly.
>    ...
>   */
>    #define ASSERT_EXCLUSIVE_BITS(flags, mask)   ....
> 
> Then we can write page_zonenum as follows:
> 
> static inline enum zone_type page_zonenum(const struct page *page)
>  {
> +       ASSERT_EXCLUSIVE_BITS(page->flags, ZONES_MASK << ZONES_PGSHIFT);
>         return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
>  }

Actually, it seems still need to write if I understand correctly,

ASSERT_EXCLUSIVE_BITS(page->flags, ZONES_MASK << ZONES_PGSHIFT);
return data_race((page->flags >> ZONES_PGSHIFT) & ZONES_MASK);

On the other hand, if you really worry about this thing could go wrong, it might
be better of using READ_ONCE() at the first place where it will be more future-
proof with the trade-off it might generate less efficient code optimization?

Alternatively, is there a way to write this as this?

return ASSERT_EXCLUSIVE_BITS(page->flags, ZONES_MASK << ZONES_PGSHIFT);

Kind of ugly but it probably cleaner.

> 
> This will accomplish the following:
> 1. The current code is not touched, and we do not have to verify that
> the change is correct without KCSAN.
> 2. We're not introducing a bunch of special macros to read bits in various ways.
> 3. KCSAN will assume that the access is safe, and no data race report
> is generated.
> 4. If somebody modifies ZONES bits concurrently, KCSAN will tell you
> about the race.
> 5. We're documenting the code.
> 
> Anything I missed?
> 
> Thanks,
> -- Marco
> 
> 
> 
> 
> 
> > thanks,
> > --
> > John Hubbard
> > NVIDIA
> > 
> > > > 
> > > > That's still true, but to a lesser extent if more macros are added. In this case,
> > > > I suspect that READ_BITS() makes the commenting easier and shorter. So I'd tentatively
> > > > lead towards adding it, but what do others on the list think?
> > > 
> > > Even read bits could be dangerous from data races and confusing at best, so I am not really sure what the value of introducing this new macro. People who like to understand it correctly still need to read the commit logs.
> > > 
> > > This flags->zonenum is such a special case that I don’t really see it regularly for the last few weeks digging KCSAN reports, so even if it is worth adding READ_BITS(), there are more equally important macros need to be added together to be useful initially. For example, HARMLESS_COUNTERS(), READ_SINGLE_BIT(), READ_IMMUTATABLE_BITS() etc which Linus said exactly wanted to avoid.
> > >
Marco Elver Feb. 10, 2020, 4:33 p.m. UTC | #25
On Mon, 10 Feb 2020 at 17:23, Qian Cai <cai@lca.pw> wrote:
>
> On Mon, 2020-02-10 at 08:48 +0100, Marco Elver wrote:
> > On Sun, 9 Feb 2020 at 08:15, John Hubbard <jhubbard@nvidia.com> wrote:
> > >
> > > On 2/8/20 7:10 PM, Qian Cai wrote:
> > > >
> > > >
> > > > > On Feb 8, 2020, at 8:44 PM, John Hubbard <jhubbard@nvidia.com> wrote:
> > > > >
> > > > > So it looks like we're probably stuck with having to annotate the code. Given
> > > > > that, there is a balance between how many macros, and how much commenting. For
> > > > > example, if there is a single macro (data_race, for example), then we'll need to
> > > > > add comments for the various cases, explaining which data_race situation is
> > > > > happening.
> > > >
> > > > On the other hand, it is perfect fine of not commenting on each data_race() that most of times, people could run git blame to learn more details. Actually, no maintainers from various of subsystems asked for commenting so far.
> > > >
> > >
> > > Well, maybe I'm looking at this wrong. I was thinking that one should attempt to
> > > understand the code on the screen, and that's generally best--but here, maybe
> > > "data_race" is just something that means "tool cruft", really. So mentally we
> > > would move toward visually filtering out the data_race "key word".
> >
> > One thing to note is that 'data_race()' points out concurrency, and
> > that somebody has deemed that the code won't break even with data
> > races. Somebody trying to understand or modify the code should ensure
> > this will still be the case. So, 'data_race()' isn't just tool cruft.
> > It's documentation for something that really isn't obvious from the
> > code alone.
> >
> > Whenever we see a READ_ONCE or other marked access it is obvious to
> > the reader that there are concurrent accesses happening.  I'd argue
> > that for intentional data races, we should convey similar information,
> > to avoid breaking the code (of course KCSAN would tell you, but only
> > after the change was done). Even moreso, since changes to code
> > involving 'data_race()' will need re-verification that the data races
> > are still safe.
> >
> > > I really don't like it but at least there is a significant benefit from the tool
> > > that probably makes it worth the visual noise.
> > >
> > > Blue sky thoughts for The Far Future: It would be nice if the tools got a lot
> > > better--maybe in the direction of C language extensions, even if only used in
> > > this project at first.
> >
> > Still thinking about this.  What we want to convey is that, while
> > there are races on the particular variable, nobody should be modifying
> > the bits here. Adding a READ_ONCE (or data_race()) would miss a
> > harmful race where somebody modifies these bits, so in principle I
> > agree. However, I think the tool can't automatically tell (even if we
> > had compiler extensions to give us the bits accessed) which bits we
> > care about, because we might have something like:
> >
> > int foo_bar = READ_ONCE(flags) >> FOO_BAR_SHIFT;  // need the
> > READ_ONCE because of FOO bits
> > .. (foo_bar & FOO_MASK) ..  // FOO bits can be modified concurrently
> > .. (foo_bar & BAR_MASK) ..  // nobody should modify BAR bits
> > concurrently though !
> >
> > What we want is to assert that nobody touches a particular set of
> > bits. KCSAN has recently gotten ASSERT_EXCLUSIVE_{WRITER,ACCESS}
> > macros which help assert properties of concurrent code, where bugs
> > won't manifest as data races. Along those lines, I can see the value
> > in doing an exclusivity check on a bitmask of a variable.
> >
> > I don't know how much a READ_BITS macro could help, since it's
> > probably less ergonomic to have to say something like:
> >   READ_BITS(page->flags, ZONES_MASK << ZONES_PGSHIFT) >> ZONES_PGSHIFT.
> >
> > Here is an alternative:
> >
> > Let's say KCSAN gives you this:
> >    /* ... Assert that the bits set in mask are not written
> > concurrently; they may still be read concurrently.
> >      The access that immediately follows is assumed to access those
> > bits and safe w.r.t. data races.
> >
> >      For example, this may be used when certain bits of @flags may
> > only be modified when holding the appropriate lock,
> >      but other bits may still be modified locklessly.
> >    ...
> >   */
> >    #define ASSERT_EXCLUSIVE_BITS(flags, mask)   ....
> >
> > Then we can write page_zonenum as follows:
> >
> > static inline enum zone_type page_zonenum(const struct page *page)
> >  {
> > +       ASSERT_EXCLUSIVE_BITS(page->flags, ZONES_MASK << ZONES_PGSHIFT);
> >         return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
> >  }
>
> Actually, it seems still need to write if I understand correctly,
>
> ASSERT_EXCLUSIVE_BITS(page->flags, ZONES_MASK << ZONES_PGSHIFT);
> return data_race((page->flags >> ZONES_PGSHIFT) & ZONES_MASK);

No, I designed it so you won't need 'data_race()' if you don't want
to. I'll send the patches shortly.

> On the other hand, if you really worry about this thing could go wrong, it might
> be better of using READ_ONCE() at the first place where it will be more future-
> proof with the trade-off it might generate less efficient code optimization?

The READ_ONCE() I'd still advocate for, but KCSAN won't complain if
the pattern is as written above.

> Alternatively, is there a way to write this as this?
>
> return ASSERT_EXCLUSIVE_BITS(page->flags, ZONES_MASK << ZONES_PGSHIFT);

It's an ASSERT, without KCSAN it should do nothing, so this is wrong.
Also, this won't work because you're no longer returning the same
value. I thought about this for READ_BITS, but you'd need (I wrote
this earlier in the thread that it likely won't be suitable):

   READ_BITS(page->flags, ZONES_MASK << ZONES_PGSHIFT) >> ZONES_PGSHIFT

to get the equivalent result (notice this will result in a redundant
shift). Because we have all kinds of permutations and variants of how
to extract the same bits out of some flags, it's cleaner to have one
'ASSERT_EXCLUSIVE_BITS' and just give it the bits you care about.

Thanks,
-- Marco

> Kind of ugly but it probably cleaner.
>
> >
> > This will accomplish the following:
> > 1. The current code is not touched, and we do not have to verify that
> > the change is correct without KCSAN.
> > 2. We're not introducing a bunch of special macros to read bits in various ways.
> > 3. KCSAN will assume that the access is safe, and no data race report
> > is generated.
> > 4. If somebody modifies ZONES bits concurrently, KCSAN will tell you
> > about the race.
> > 5. We're documenting the code.
> >
> > Anything I missed?
> >
> > Thanks,
> > -- Marco
> >
> >
> >
> >
> >
> > > thanks,
> > > --
> > > John Hubbard
> > > NVIDIA
> > >
> > > > >
> > > > > That's still true, but to a lesser extent if more macros are added. In this case,
> > > > > I suspect that READ_BITS() makes the commenting easier and shorter. So I'd tentatively
> > > > > lead towards adding it, but what do others on the list think?
> > > >
> > > > Even read bits could be dangerous from data races and confusing at best, so I am not really sure what the value of introducing this new macro. People who like to understand it correctly still need to read the commit logs.
> > > >
> > > > This flags->zonenum is such a special case that I don’t really see it regularly for the last few weeks digging KCSAN reports, so even if it is worth adding READ_BITS(), there are more equally important macros need to be added together to be useful initially. For example, HARMLESS_COUNTERS(), READ_SINGLE_BIT(), READ_IMMUTATABLE_BITS() etc which Linus said exactly wanted to avoid.
> > > >
diff mbox series

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 52269e56c514..f8529aa971c0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -920,7 +920,7 @@  vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
 
 static inline enum zone_type page_zonenum(const struct page *page)
 {
-	return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
+	return (READ_ONCE(page->flags) >> ZONES_PGSHIFT) & ZONES_MASK;
 }
 
 #ifdef CONFIG_ZONE_DEVICE