[v2] mm/vmscan: fix data races at kswapd_classzone_idx
diff mbox series

Message ID 20200226035827.1285-1-cai@lca.pw
State New
Headers show
Series
  • [v2] mm/vmscan: fix data races at kswapd_classzone_idx
Related show

Commit Message

Qian Cai Feb. 26, 2020, 3:58 a.m. UTC
pgdat->kswapd_classzone_idx could be accessed concurrently in
wakeup_kswapd(). Plain writes and reads without any lock protection
result in data races. Fix them by adding a pair of READ|WRITE_ONCE() as
well as saving a branch (compilers might well optimize the original code
in an unintentional way anyway). While at it, also take care of
pgdat->kswapd_order and non-kswapd threads in allow_direct_reclaim().
The data races were reported by KCSAN,

 BUG: KCSAN: data-race in wakeup_kswapd / wakeup_kswapd

 write to 0xffff9f427ffff2dc of 4 bytes by task 7454 on cpu 13:
  wakeup_kswapd+0xf1/0x400
  wakeup_kswapd at mm/vmscan.c:3967
  wake_all_kswapds+0x59/0xc0
  wake_all_kswapds at mm/page_alloc.c:4241
  __alloc_pages_slowpath+0xdcc/0x1290
  __alloc_pages_slowpath at mm/page_alloc.c:4512
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x16e/0x6f0
  __handle_mm_fault+0xcd5/0xd40
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 1 lock held by mtest01/7454:
  #0: ffff9f425afe8808 (&mm->mmap_sem#2){++++}, at:
 do_page_fault+0x143/0x6f9
 do_user_addr_fault at arch/x86/mm/fault.c:1405
 (inlined by) do_page_fault at arch/x86/mm/fault.c:1539
 irq event stamp: 6944085
 count_memcg_event_mm+0x1a6/0x270
 count_memcg_event_mm+0x119/0x270
 __do_softirq+0x34c/0x57c
 irq_exit+0xa2/0xc0

 read to 0xffff9f427ffff2dc of 4 bytes by task 7472 on cpu 38:
  wakeup_kswapd+0xc8/0x400
  wake_all_kswapds+0x59/0xc0
  __alloc_pages_slowpath+0xdcc/0x1290
  __alloc_pages_nodemask+0x3bb/0x450
  alloc_pages_vma+0x8a/0x2c0
  do_anonymous_page+0x16e/0x6f0
  __handle_mm_fault+0xcd5/0xd40
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x6f9
  page_fault+0x34/0x40

 1 lock held by mtest01/7472:
  #0: ffff9f425a9ac148 (&mm->mmap_sem#2){++++}, at:
 do_page_fault+0x143/0x6f9
 irq event stamp: 6793561
 count_memcg_event_mm+0x1a6/0x270
 count_memcg_event_mm+0x119/0x270
 __do_softirq+0x34c/0x57c
 irq_exit+0xa2/0xc0

Signed-off-by: Qian Cai <cai@lca.pw>
---

v2: use a temp variable and take care of kswapd_order per Matthew.
     take care of allow_direct_reclaim() as well.

 mm/vmscan.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

Comments

Matthew Wilcox Feb. 26, 2020, 4:06 a.m. UTC | #1
On Tue, Feb 25, 2020 at 10:58:27PM -0500, Qian Cai wrote:
> pgdat->kswapd_classzone_idx could be accessed concurrently in
> wakeup_kswapd(). Plain writes and reads without any lock protection
> result in data races. Fix them by adding a pair of READ|WRITE_ONCE() as
> well as saving a branch (compilers might well optimize the original code
> in an unintentional way anyway). While at it, also take care of
> pgdat->kswapd_order and non-kswapd threads in allow_direct_reclaim().

I don't understand why the usages of kswapd_classzone_idx in kswapd() and
kswapd_try_to_sleep() don't need changing too?  kswapd_classzone_idx()
looks safe to me, but I'm prone to missing stupid things that compilers
are allowed to do.
Qian Cai Feb. 26, 2020, 11:50 a.m. UTC | #2
> On Feb 25, 2020, at 11:06 PM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> I don't understand why the usages of kswapd_classzone_idx in kswapd() and
> kswapd_try_to_sleep() don't need changing too?  kswapd_classzone_idx()
> looks safe to me, but I'm prone to missing stupid things that compilers
> are allowed to do.

I am not sure. Although it looks possible on paper, I am wondering why KCSAN did not trigger it yet which seems rather common. I did stress testing those areas with KCSAN for a few months now, but it might just be that I missed the report at the first place.

I’ll keep running some testing to confirm it, but until that happens or someone else could confirm it could happen, I’ll leave it out for this version. We can always submit an incremental patch later if necessary.
Qian Cai Feb. 26, 2020, 1:49 p.m. UTC | #3
On Tue, 2020-02-25 at 20:06 -0800, Matthew Wilcox wrote:
> On Tue, Feb 25, 2020 at 10:58:27PM -0500, Qian Cai wrote:
> > pgdat->kswapd_classzone_idx could be accessed concurrently in
> > wakeup_kswapd(). Plain writes and reads without any lock protection
> > result in data races. Fix them by adding a pair of READ|WRITE_ONCE() as
> > well as saving a branch (compilers might well optimize the original code
> > in an unintentional way anyway). While at it, also take care of
> > pgdat->kswapd_order and non-kswapd threads in allow_direct_reclaim().
> 
> I don't understand why the usages of kswapd_classzone_idx in kswapd() and
> kswapd_try_to_sleep() don't need changing too?  kswapd_classzone_idx()
> looks safe to me, but I'm prone to missing stupid things that compilers
> are allowed to do.

Right, I did capture the race this time. I'll post a v3.

[  924.803628][ T6299] BUG: KCSAN: data-race in kswapd / wakeup_kswapd 
[  924.809949][ T6299]  
[  924.812170][ T6299] write to 0xffff90973ffff2dc of 4 bytes by task 820 on cpu
6: 
[  924.819630][ T6299]  kswapd+0x27c/0x8d0 
[  924.823509][ T6299]  kthread+0x1e0/0x200 
[  924.827471][ T6299]  ret_from_fork+0x27/0x50 
[  924.831774][ T6299]  
[  924.833987][ T6299] read to 0xffff90973ffff2dc of 4 bytes by task 6299 on cpu
0: 
[  924.841442][ T6299]  wakeup_kswapd+0xf3/0x450 
[  924.845838][ T6299]  wake_all_kswapds+0x59/0xc0 
[  924.850409][ T6299]  __alloc_pages_slowpath+0xdcc/0x1290 
[  924.855769][ T6299]  __alloc_pages_nodemask+0x3bb/0x450 
[  924.861040][ T6299]  alloc_pages_vma+0x8a/0x2c0 
[  924.865612][ T6299]  do_anonymous_page+0x170/0x700 
[  924.870443][ T6299]  __handle_mm_fault+0xc9f/0xd00 
[  924.875276][ T6299]  handle_mm_fault+0xfc/0x2f0 
[  924.879849][ T6299]  do_page_fault+0x263/0x6f9 
[  924.884334][ T6299]  page_fault+0x34/0x40

Patch
diff mbox series

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 876370565455..e61cc71b8915 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3136,8 +3136,9 @@  static bool allow_direct_reclaim(pg_data_t *pgdat)
 
 	/* kswapd must be awake if processes are being throttled */
 	if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) {
-		pgdat->kswapd_classzone_idx = min(pgdat->kswapd_classzone_idx,
-						(enum zone_type)ZONE_NORMAL);
+		if (READ_ONCE(pgdat->kswapd_classzone_idx) > ZONE_NORMAL)
+			WRITE_ONCE(pgdat->kswapd_classzone_idx, ZONE_NORMAL);
+
 		wake_up_interruptible(&pgdat->kswapd_wait);
 	}
 
@@ -3953,20 +3954,23 @@  void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order,
 		   enum zone_type classzone_idx)
 {
 	pg_data_t *pgdat;
+	enum zone_type curr_idx;
 
 	if (!managed_zone(zone))
 		return;
 
 	if (!cpuset_zone_allowed(zone, gfp_flags))
 		return;
+
 	pgdat = zone->zone_pgdat;
+	curr_idx = READ_ONCE(pgdat->kswapd_classzone_idx);
+
+	if (curr_idx == MAX_NR_ZONES || curr_idx < classzone_idx)
+		WRITE_ONCE(pgdat->kswapd_classzone_idx, classzone_idx);
+
+	if (READ_ONCE(pgdat->kswapd_order) < order)
+		WRITE_ONCE(pgdat->kswapd_order, order);
 
-	if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES)
-		pgdat->kswapd_classzone_idx = classzone_idx;
-	else
-		pgdat->kswapd_classzone_idx = max(pgdat->kswapd_classzone_idx,
-						  classzone_idx);
-	pgdat->kswapd_order = max(pgdat->kswapd_order, order);
 	if (!waitqueue_active(&pgdat->kswapd_wait))
 		return;