mbox series

[v5,0/9] mm: swap: mTHP swap allocator base on swap cluster order

Message ID 20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org (mailing list archive)
Headers show
Series mm: swap: mTHP swap allocator base on swap cluster order | expand

Message

Chris Li July 31, 2024, 6:49 a.m. UTC
This is the short term solutions "swap cluster order" listed
in my "Swap Abstraction" discussion slice 8 in the recent
LSF/MM conference.

When commit 845982eb264bc "mm: swap: allow storage of all mTHP
orders" is introduced, it only allocates the mTHP swap entries
from the new empty cluster list.  It has a fragmentation issue
reported by Barry.

https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/

The reason is that all the empty clusters have been exhausted while
there are plenty of free swap entries in the cluster that are
not 100% free.

Remember the swap allocation order in the cluster.
Keep track of the per order non full cluster list for later allocation.

This series gives the swap SSD allocation a new separate code path
from the HDD allocation. The new allocator use cluster list only
and do not global scan swap_map[] without lock any more.

This streamline the swap allocation for SSD. The code matches the
execution flow much better.

User impact: For users that allocate and free mix order mTHP swapping,
It greatly improves the success rate of the mTHP swap allocation after the
initial phase.

It also performs faster when the swapfile is close to full, because the
allocator can get the non full cluster from a list rather than scanning
a lot of swap_map entries. 

With Barry's mthp test program V2:

Without:
$ ./thp_swap_allocator_test -a
Iteration 1: swpout inc: 32, swpout fallback inc: 192, Fallback percentage: 85.71%
Iteration 2: swpout inc: 0, swpout fallback inc: 231, Fallback percentage: 100.00%
Iteration 3: swpout inc: 0, swpout fallback inc: 227, Fallback percentage: 100.00%
...
Iteration 98: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
Iteration 99: swpout inc: 0, swpout fallback inc: 215, Fallback percentage: 100.00%
Iteration 100: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%

$ ./thp_swap_allocator_test -a -s
Iteration 1: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
Iteration 2: swpout inc: 0, swpout fallback inc: 218, Fallback percentage: 100.00%
Iteration 3: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
..
Iteration 98: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00%
Iteration 99: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00%
Iteration 100: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%

$ ./thp_swap_allocator_test -s
Iteration 1: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
Iteration 2: swpout inc: 0, swpout fallback inc: 218, Fallback percentage: 100.00%
Iteration 3: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
..
Iteration 98: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00%
Iteration 99: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00%
Iteration 100: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%

$ ./thp_swap_allocator_test
Iteration 1: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
Iteration 2: swpout inc: 0, swpout fallback inc: 218, Fallback percentage: 100.00%
Iteration 3: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
..
Iteration 98: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00%
Iteration 99: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00%
Iteration 100: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%

With: # with all 0.00% filter out
$ ./thp_swap_allocator_test -a | grep -v "0.00%"
$ # all result are 0.00%

$ ./thp_swap_allocator_test -a -s | grep -v "0.00%"
./thp_swap_allocator_test -a -s | grep -v "0.00%" 
Iteration 14: swpout inc: 223, swpout fallback inc: 3, Fallback percentage: 1.33%
Iteration 19: swpout inc: 219, swpout fallback inc: 7, Fallback percentage: 3.10%
Iteration 28: swpout inc: 225, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 29: swpout inc: 227, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 34: swpout inc: 220, swpout fallback inc: 8, Fallback percentage: 3.51%
Iteration 35: swpout inc: 222, swpout fallback inc: 11, Fallback percentage: 4.72%
Iteration 38: swpout inc: 217, swpout fallback inc: 4, Fallback percentage: 1.81%
Iteration 40: swpout inc: 222, swpout fallback inc: 6, Fallback percentage: 2.63%
Iteration 42: swpout inc: 221, swpout fallback inc: 2, Fallback percentage: 0.90%
Iteration 43: swpout inc: 215, swpout fallback inc: 7, Fallback percentage: 3.15%
Iteration 47: swpout inc: 226, swpout fallback inc: 2, Fallback percentage: 0.88%
Iteration 49: swpout inc: 217, swpout fallback inc: 1, Fallback percentage: 0.46%
Iteration 52: swpout inc: 221, swpout fallback inc: 8, Fallback percentage: 3.49%
Iteration 56: swpout inc: 224, swpout fallback inc: 4, Fallback percentage: 1.75%
Iteration 58: swpout inc: 214, swpout fallback inc: 5, Fallback percentage: 2.28%
Iteration 62: swpout inc: 220, swpout fallback inc: 3, Fallback percentage: 1.35%
Iteration 64: swpout inc: 224, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 67: swpout inc: 221, swpout fallback inc: 1, Fallback percentage: 0.45%
Iteration 75: swpout inc: 220, swpout fallback inc: 9, Fallback percentage: 3.93%
Iteration 82: swpout inc: 227, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 86: swpout inc: 211, swpout fallback inc: 12, Fallback percentage: 5.38%
Iteration 89: swpout inc: 226, swpout fallback inc: 2, Fallback percentage: 0.88%
Iteration 93: swpout inc: 220, swpout fallback inc: 1, Fallback percentage: 0.45%
Iteration 94: swpout inc: 224, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 96: swpout inc: 221, swpout fallback inc: 6, Fallback percentage: 2.64%
Iteration 98: swpout inc: 227, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 99: swpout inc: 227, swpout fallback inc: 3, Fallback percentage: 1.30%

$ ./thp_swap_allocator_test      
./thp_swap_allocator_test 
Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 2: swpout inc: 131, swpout fallback inc: 101, Fallback percentage: 43.53%
Iteration 3: swpout inc: 71, swpout fallback inc: 155, Fallback percentage: 68.58%
Iteration 4: swpout inc: 55, swpout fallback inc: 168, Fallback percentage: 75.34%
Iteration 5: swpout inc: 35, swpout fallback inc: 191, Fallback percentage: 84.51%
Iteration 6: swpout inc: 25, swpout fallback inc: 199, Fallback percentage: 88.84%
Iteration 7: swpout inc: 23, swpout fallback inc: 205, Fallback percentage: 89.91%
Iteration 8: swpout inc: 9, swpout fallback inc: 219, Fallback percentage: 96.05%
Iteration 9: swpout inc: 13, swpout fallback inc: 213, Fallback percentage: 94.25%
Iteration 10: swpout inc: 12, swpout fallback inc: 216, Fallback percentage: 94.74%
Iteration 11: swpout inc: 16, swpout fallback inc: 213, Fallback percentage: 93.01%
Iteration 12: swpout inc: 10, swpout fallback inc: 210, Fallback percentage: 95.45%
Iteration 13: swpout inc: 16, swpout fallback inc: 212, Fallback percentage: 92.98%
Iteration 14: swpout inc: 12, swpout fallback inc: 212, Fallback percentage: 94.64%
Iteration 15: swpout inc: 15, swpout fallback inc: 211, Fallback percentage: 93.36%
Iteration 16: swpout inc: 15, swpout fallback inc: 200, Fallback percentage: 93.02%
Iteration 17: swpout inc: 9, swpout fallback inc: 220, Fallback percentage: 96.07%

$ ./thp_swap_allocator_test -s
 ./thp_swap_allocator_test -s
Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 2: swpout inc: 97, swpout fallback inc: 135, Fallback percentage: 58.19%
Iteration 3: swpout inc: 42, swpout fallback inc: 192, Fallback percentage: 82.05%
Iteration 4: swpout inc: 19, swpout fallback inc: 214, Fallback percentage: 91.85%
Iteration 5: swpout inc: 12, swpout fallback inc: 213, Fallback percentage: 94.67%
Iteration 6: swpout inc: 11, swpout fallback inc: 217, Fallback percentage: 95.18%
Iteration 7: swpout inc: 9, swpout fallback inc: 214, Fallback percentage: 95.96%
Iteration 8: swpout inc: 8, swpout fallback inc: 213, Fallback percentage: 96.38%
Iteration 9: swpout inc: 2, swpout fallback inc: 223, Fallback percentage: 99.11%
Iteration 10: swpout inc: 2, swpout fallback inc: 228, Fallback percentage: 99.13%
Iteration 11: swpout inc: 4, swpout fallback inc: 214, Fallback percentage: 98.17%
Iteration 12: swpout inc: 5, swpout fallback inc: 226, Fallback percentage: 97.84%
Iteration 13: swpout inc: 3, swpout fallback inc: 212, Fallback percentage: 98.60%
Iteration 14: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
Iteration 15: swpout inc: 3, swpout fallback inc: 222, Fallback percentage: 98.67%
Iteration 16: swpout inc: 4, swpout fallback inc: 223, Fallback percentage: 98.24%

=========
Kernel compile under tmpfs with cgroup memory.max = 470M.
12 core 24 hyperthreading, 32 jobs. 10 Run each group

SSD swap 10 runs average, 20G swap partition:
With:
user    2929.064
system  1479.381 : 1376.89 1398.22 1444.64 1477.39 1479.04 1497.27
1504.47 1531.4 1532.92 1551.57
real    1441.324

Without:
user    2910.872
system  1482.732 : 1440.01 1451.4 1462.01 1467.47 1467.51 1469.3
1470.19 1496.32 1544.1 1559.01
real    1580.822

Two zram swap: zram0 3.0G zram1 20G.

The idea is forcing the zram0 almost full then overflow to zram1:

With:
user    4320.301
system  4272.403 : 4236.24 4262.81 4264.75 4269.13 4269.44 4273.06
4279.85 4285.98 4289.64 4293.13
real    431.759

Without
user    4301.393
system  4387.672 : 4374.47 4378.3 4380.95 4382.84 4383.06 4388.05
4389.76 4397.16 4398.23 4403.9
real    433.979

------ more test result from Kaiui ----------

Test with build linux kernel using a 4G ZRAM, 1G memory.max limit on top of shmem:

System info: 32 Core AMD Zen2, 64G total memory.

Test 3 times using only 4K pages:
=================================

With:
-----
1838.74user 2411.21system 2:37.86elapsed 2692%CPU (0avgtext+0avgdata 847060maxresident)k
1839.86user 2465.77system 2:39.35elapsed 2701%CPU (0avgtext+0avgdata 847060maxresident)k
1840.26user 2454.68system 2:39.43elapsed 2693%CPU (0avgtext+0avgdata 847060maxresident)k

Summary (~4.6% improment of system time):
User: 1839.62
System: 2443.89: 2465.77 2454.68 2411.21
Real: 158.88

Without:
--------
1837.99user 2575.95system 2:43.09elapsed 2706%CPU (0avgtext+0avgdata 846520maxresident)k
1838.32user 2555.15system 2:42.52elapsed 2709%CPU (0avgtext+0avgdata 846520maxresident)k
1843.02user 2561.55system 2:43.35elapsed 2702%CPU (0avgtext+0avgdata 846520maxresident)k

Summary:
User: 1839.78
System: 2564.22: 2575.95 2555.15 2561.55
Real: 162.99

Test 5 times using enabled all mTHP pages:
==========================================

With:
-----
1796.44user 2937.33system 2:59.09elapsed 2643%CPU (0avgtext+0avgdata 846936maxresident)k
1802.55user 3002.32system 2:54.68elapsed 2750%CPU (0avgtext+0avgdata 847072maxresident)k
1806.59user 2986.53system 2:55.17elapsed 2736%CPU (0avgtext+0avgdata 847092maxresident)k
1803.27user 2982.40system 2:54.49elapsed 2742%CPU (0avgtext+0avgdata 846796maxresident)k
1807.43user 3036.08system 2:56.06elapsed 2751%CPU (0avgtext+0avgdata 846488maxresident)k

Summary (~8.4% improvement of system time):
User: 1803.25
System: 2988.93: 2937.33 3002.32 2986.53 2982.40 3036.08
Real: 175.90

mTHP swapout status:
/sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout:347721
/sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout_fallback:3110
/sys/kernel/mm/transparent_hugepage/hugepages-512kB/stats/swpout:3365
/sys/kernel/mm/transparent_hugepage/hugepages-512kB/stats/swpout_fallback:8269
/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/stats/swpout:24
/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/stats/swpout_fallback:3341
/sys/kernel/mm/transparent_hugepage/hugepages-1024kB/stats/swpout:145
/sys/kernel/mm/transparent_hugepage/hugepages-1024kB/stats/swpout_fallback:5038
/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout:322737
/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout_fallback:36808
/sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout:380455
/sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout_fallback:1010
/sys/kernel/mm/transparent_hugepage/hugepages-256kB/stats/swpout:24973
/sys/kernel/mm/transparent_hugepage/hugepages-256kB/stats/swpout_fallback:13223
/sys/kernel/mm/transparent_hugepage/hugepages-128kB/stats/swpout:197348
/sys/kernel/mm/transparent_hugepage/hugepages-128kB/stats/swpout_fallback:80541

Without:
--------
1794.41user 3151.29system 3:05.97elapsed 2659%CPU (0avgtext+0avgdata 846704maxresident)k
1810.27user 3304.48system 3:05.38elapsed 2759%CPU (0avgtext+0avgdata 846636maxresident)k
1809.84user 3254.85system 3:03.83elapsed 2755%CPU (0avgtext+0avgdata 846952maxresident)k
1813.54user 3259.56system 3:04.28elapsed 2752%CPU (0avgtext+0avgdata 846848maxresident)k
1829.97user 3338.40system 3:07.32elapsed 2759%CPU (0avgtext+0avgdata 847024maxresident)k

Summary:
User: 1811.61
System: 3261.72 : 3151.29 3304.48 3254.85 3259.56 3338.40
Real: 185.356

mTHP swapout status:
hugepages-32kB/stats/swpout:35630
hugepages-32kB/stats/swpout_fallback:1809908
hugepages-512kB/stats/swpout:523
hugepages-512kB/stats/swpout_fallback:55235
hugepages-2048kB/stats/swpout:53
hugepages-2048kB/stats/swpout_fallback:17264
hugepages-1024kB/stats/swpout:85
hugepages-1024kB/stats/swpout_fallback:24979
hugepages-64kB/stats/swpout:30117
hugepages-64kB/stats/swpout_fallback:1825399
hugepages-16kB/stats/swpout:42775
hugepages-16kB/stats/swpout_fallback:1951123
hugepages-256kB/stats/swpout:2326
hugepages-256kB/stats/swpout_fallback:170165
hugepages-128kB/stats/swpout:17925
hugepages-128kB/stats/swpout_fallback:1309757

Reported-by: Barry Song <21cnbao@gmail.com>
Signed-off-by: Chris Li <chrisl@kernel.org>
---
Changes in v5:
- Suggestion and fix up from v4 discussion thread from Yinm and Ryan.
- Adding Kairui's swap cache reclaim patches on top of patch 3.
- Link to v4: https://lore.kernel.org/r/20240711-swap-allocator-v4-0-0295a4d4c7aa@kernel.org

Changes in v4:
- Remove a warning in patch 2.
- Allocating from the free cluster list before the nonfull list. Revert the v3 behavior.
- Add cluster_index and cluster_offset function.
- Patch 3 has a new allocating path for SSD.
- HDD swap allocation does not need to consider clusters any more.

Changes in v3:
- Using V1 as base.
- Rename "next" to "list" for the list field, suggested by Ying.
- Update comment for the locking rules for cluster fields and list,
  suggested by Ying.
- Allocate from the nonfull list before attempting free list, suggested
  by Kairui.
- Link to v2: https://lore.kernel.org/r/20240614-swap-allocator-v2-0-2a513b4a7f2f@kernel.org

Changes in v2:
- Abandoned.
- Link to v1: https://lore.kernel.org/r/20240524-swap-allocator-v1-0-47861b423b26@kernel.org

---
Chris Li (3):
      mm: swap: swap cluster switch to double link list
      mm: swap: mTHP allocate swap entries from nonfull list
      mm: swap: separate SSD allocation from scan_swap_map_slots()

Kairui Song (6):
      mm: swap: clean up initialization helper
      mm: swap: skip slot cache on freeing for mTHP
      mm: swap: allow cache reclaim to skip slot cache
      mm: swap: add a fragment cluster list
      mm: swap: relaim the cached parts that got scanned
      mm: swap: add a adaptive full cluster cache reclaim

 include/linux/swap.h |  34 ++-
 mm/swapfile.c        | 840 ++++++++++++++++++++++++++++++---------------------
 2 files changed, 514 insertions(+), 360 deletions(-)
---
base-commit: ff3a648ecb9409aff1448cf4f6aa41d78c69a3bc
change-id: 20240523-swap-allocator-1534c480ece4

Best regards,

Comments

David Hildenbrand Aug. 1, 2024, 9:14 a.m. UTC | #1
On 31.07.24 08:49, Chris Li wrote:
> This is the short term solutions "swap cluster order" listed
> in my "Swap Abstraction" discussion slice 8 in the recent
> LSF/MM conference.
> 

Running the cow.c selftest on mm/mm-unstable, I get:

# [RUN] Basic COW after fork() with mprotect() optimization ... with swapped-out, PTE-mapped THP (1024 kB)
[   51.865309] Oops: general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] PREEMPT SMP NOPTI
[   51.867738] CPU: 21 UID: 0 PID: 282 Comm: kworker/21:1 Not tainted 6.11.0-rc1+ #11
[   51.869566] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014
[   51.871298] Workqueue: events swap_discard_work
[   51.872211] RIP: 0010:__free_cluster+0x27/0x90
[   51.873101] Code: 90 90 90 0f 1f 44 00 00 8b 0d 8d 95 96 01 55 48 89 fd 53 48 89 f3 85 c9 75 3a 48 8b 43 50 48 8b 4b 48 48 8d 53 48 48 83 c5 60 <48> 89 41 08 48 89 08 48 8b 45 08 48 89 55 08 48 89 43 50 48 89 6b
[   51.876720] RSP: 0018:ffffa3dcc0aafdc8 EFLAGS: 00010286
[   51.877752] RAX: dead000000000122 RBX: ffff8e7ed9686e00 RCX: dead000000000100
[   51.879186] RDX: ffff8e7ed9686e48 RSI: ffff8e7ed9686e18 RDI: ffff8e7ec37831c0
[   51.880577] RBP: ffff8e7ec5d10860 R08: 0000000000000001 R09: 0000000000000028
[   51.881972] R10: 0000000000000200 R11: 00000000000004cb R12: ffff8e7ed9686e00
[   51.883393] R13: 0000000000028200 R14: 0000000000028000 R15: 0000000000000000
[   51.884827] FS:  0000000000000000(0000) GS:ffff8e822f480000(0000) knlGS:0000000000000000
[   51.886412] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   51.887532] CR2: 00007f37d7e17840 CR3: 0000000335a3a001 CR4: 0000000000770ef0
[   51.888931] PKRU: 55555554
[   51.889471] Call Trace:
[   51.889964]  <TASK>
[   51.890391]  ? __die_body.cold+0x19/0x27
[   51.891174]  ? die_addr+0x3c/0x60
[   51.891824]  ? exc_general_protection+0x14f/0x430
[   51.892754]  ? asm_exc_general_protection+0x26/0x30
[   51.893717]  ? __free_cluster+0x27/0x90
[   51.894483]  ? __free_cluster+0x7e/0x90
[   51.895245]  swap_do_scheduled_discard+0x142/0x1b0
[   51.896189]  swap_discard_work+0x26/0x30
[   51.896958]  process_one_work+0x211/0x5a0
[   51.897750]  ? srso_alias_return_thunk+0x5/0xfbef5
[   51.898693]  worker_thread+0x1c9/0x3c0
[   51.899438]  ? __pfx_worker_thread+0x10/0x10
[   51.900287]  kthread+0xe3/0x110
[   51.900913]  ? __pfx_kthread+0x10/0x10
[   51.901656]  ret_from_fork+0x34/0x50
[   51.902377]  ? __pfx_kthread+0x10/0x10
[   51.903114]  ret_from_fork_asm+0x1a/0x30
[   51.903896]  </TASK>


Maybe related to this series?
Kairui Song Aug. 1, 2024, 9:59 a.m. UTC | #2
On Thu, Aug 1, 2024 at 5:15 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 31.07.24 08:49, Chris Li wrote:
> > This is the short term solutions "swap cluster order" listed
> > in my "Swap Abstraction" discussion slice 8 in the recent
> > LSF/MM conference.
> >
>
> Running the cow.c selftest on mm/mm-unstable, I get:

Hi David, thanks very much for the test and report!

>
> # [RUN] Basic COW after fork() with mprotect() optimization ... with swapped-out, PTE-mapped THP (1024 kB)
> [   51.865309] Oops: general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] PREEMPT SMP NOPTI
> [   51.867738] CPU: 21 UID: 0 PID: 282 Comm: kworker/21:1 Not tainted 6.11.0-rc1+ #11
> [   51.869566] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014
> [   51.871298] Workqueue: events swap_discard_work
> [   51.872211] RIP: 0010:__free_cluster+0x27/0x90
> [   51.873101] Code: 90 90 90 0f 1f 44 00 00 8b 0d 8d 95 96 01 55 48 89 fd 53 48 89 f3 85 c9 75 3a 48 8b 43 50 48 8b 4b 48 48 8d 53 48 48 83 c5 60 <48> 89 41 08 48 89 08 48 8b 45 08 48 89 55 08 48 89 43 50 48 89 6b
> [   51.876720] RSP: 0018:ffffa3dcc0aafdc8 EFLAGS: 00010286
> [   51.877752] RAX: dead000000000122 RBX: ffff8e7ed9686e00 RCX: dead000000000100
> [   51.879186] RDX: ffff8e7ed9686e48 RSI: ffff8e7ed9686e18 RDI: ffff8e7ec37831c0
> [   51.880577] RBP: ffff8e7ec5d10860 R08: 0000000000000001 R09: 0000000000000028
> [   51.881972] R10: 0000000000000200 R11: 00000000000004cb R12: ffff8e7ed9686e00
> [   51.883393] R13: 0000000000028200 R14: 0000000000028000 R15: 0000000000000000
> [   51.884827] FS:  0000000000000000(0000) GS:ffff8e822f480000(0000) knlGS:0000000000000000
> [   51.886412] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   51.887532] CR2: 00007f37d7e17840 CR3: 0000000335a3a001 CR4: 0000000000770ef0
> [   51.888931] PKRU: 55555554
> [   51.889471] Call Trace:
> [   51.889964]  <TASK>
> [   51.890391]  ? __die_body.cold+0x19/0x27
> [   51.891174]  ? die_addr+0x3c/0x60
> [   51.891824]  ? exc_general_protection+0x14f/0x430
> [   51.892754]  ? asm_exc_general_protection+0x26/0x30
> [   51.893717]  ? __free_cluster+0x27/0x90
> [   51.894483]  ? __free_cluster+0x7e/0x90
> [   51.895245]  swap_do_scheduled_discard+0x142/0x1b0
> [   51.896189]  swap_discard_work+0x26/0x30
> [   51.896958]  process_one_work+0x211/0x5a0
> [   51.897750]  ? srso_alias_return_thunk+0x5/0xfbef5
> [   51.898693]  worker_thread+0x1c9/0x3c0
> [   51.899438]  ? __pfx_worker_thread+0x10/0x10
> [   51.900287]  kthread+0xe3/0x110
> [   51.900913]  ? __pfx_kthread+0x10/0x10
> [   51.901656]  ret_from_fork+0x34/0x50
> [   51.902377]  ? __pfx_kthread+0x10/0x10
> [   51.903114]  ret_from_fork_asm+0x1a/0x30
> [   51.903896]  </TASK>
>
>
> Maybe related to this series?

Right, I can reproduce your problem and I believe this patch can fix
it, see the attachment.

Hi Andrew, can you pick this patch too?
Kairui Song Aug. 1, 2024, 10:06 a.m. UTC | #3
On Thu, Aug 1, 2024 at 5:59 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Thu, Aug 1, 2024 at 5:15 PM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 31.07.24 08:49, Chris Li wrote:
> > > This is the short term solutions "swap cluster order" listed
> > > in my "Swap Abstraction" discussion slice 8 in the recent
> > > LSF/MM conference.
> > >
> >
> > Running the cow.c selftest on mm/mm-unstable, I get:
>
> Hi David, thanks very much for the test and report!
>
> >
> > # [RUN] Basic COW after fork() with mprotect() optimization ... with swapped-out, PTE-mapped THP (1024 kB)
> > [   51.865309] Oops: general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] PREEMPT SMP NOPTI
> > [   51.867738] CPU: 21 UID: 0 PID: 282 Comm: kworker/21:1 Not tainted 6.11.0-rc1+ #11
> > [   51.869566] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014
> > [   51.871298] Workqueue: events swap_discard_work
> > [   51.872211] RIP: 0010:__free_cluster+0x27/0x90
> > [   51.873101] Code: 90 90 90 0f 1f 44 00 00 8b 0d 8d 95 96 01 55 48 89 fd 53 48 89 f3 85 c9 75 3a 48 8b 43 50 48 8b 4b 48 48 8d 53 48 48 83 c5 60 <48> 89 41 08 48 89 08 48 8b 45 08 48 89 55 08 48 89 43 50 48 89 6b
> > [   51.876720] RSP: 0018:ffffa3dcc0aafdc8 EFLAGS: 00010286
> > [   51.877752] RAX: dead000000000122 RBX: ffff8e7ed9686e00 RCX: dead000000000100
> > [   51.879186] RDX: ffff8e7ed9686e48 RSI: ffff8e7ed9686e18 RDI: ffff8e7ec37831c0
> > [   51.880577] RBP: ffff8e7ec5d10860 R08: 0000000000000001 R09: 0000000000000028
> > [   51.881972] R10: 0000000000000200 R11: 00000000000004cb R12: ffff8e7ed9686e00
> > [   51.883393] R13: 0000000000028200 R14: 0000000000028000 R15: 0000000000000000
> > [   51.884827] FS:  0000000000000000(0000) GS:ffff8e822f480000(0000) knlGS:0000000000000000
> > [   51.886412] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   51.887532] CR2: 00007f37d7e17840 CR3: 0000000335a3a001 CR4: 0000000000770ef0
> > [   51.888931] PKRU: 55555554
> > [   51.889471] Call Trace:
> > [   51.889964]  <TASK>
> > [   51.890391]  ? __die_body.cold+0x19/0x27
> > [   51.891174]  ? die_addr+0x3c/0x60
> > [   51.891824]  ? exc_general_protection+0x14f/0x430
> > [   51.892754]  ? asm_exc_general_protection+0x26/0x30
> > [   51.893717]  ? __free_cluster+0x27/0x90
> > [   51.894483]  ? __free_cluster+0x7e/0x90
> > [   51.895245]  swap_do_scheduled_discard+0x142/0x1b0
> > [   51.896189]  swap_discard_work+0x26/0x30
> > [   51.896958]  process_one_work+0x211/0x5a0
> > [   51.897750]  ? srso_alias_return_thunk+0x5/0xfbef5
> > [   51.898693]  worker_thread+0x1c9/0x3c0
> > [   51.899438]  ? __pfx_worker_thread+0x10/0x10
> > [   51.900287]  kthread+0xe3/0x110
> > [   51.900913]  ? __pfx_kthread+0x10/0x10
> > [   51.901656]  ret_from_fork+0x34/0x50
> > [   51.902377]  ? __pfx_kthread+0x10/0x10
> > [   51.903114]  ret_from_fork_asm+0x1a/0x30
> > [   51.903896]  </TASK>
> >
> >
> > Maybe related to this series?
>
> Right, I can reproduce your problem and I believe this patch can fix
> it, see the attachment.
>
> Hi Andrew, can you pick this patch too?

This issue is caused by my PATCH 9/9 and this attachment patch can be
squashed into it, very sorry for the problem.
Chris Li Aug. 16, 2024, 7:36 a.m. UTC | #4
On Thu, Aug 8, 2024 at 1:40 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> [snip]
>
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -450,7 +450,10 @@ static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info
> >       lockdep_assert_held(&si->lock);
> >       lockdep_assert_held(&ci->lock);
> >
> > -     list_move_tail(&ci->list, &si->free_clusters);
> > +     if (ci->flags)
> > +             list_move_tail(&ci->list, &si->free_clusters);
> > +     else
> > +             list_add_tail(&ci->list, &si->free_clusters);
>
> If we use list_del_init() to delete the cluster, we can always use
> list_move_tail()?  If so, the logic can be simplified.

Thanks for the suggestion.

I feel that list_del_init() generates more instruction than necessary.
It is my bad that I leave the discard list without not a list flag bit
for it.

I do want to clean this up. While we are at it, because the cluster
can only belong to one list at a time. We can use a list indicator as
integer rather than bits mask. If we give the discard list the same
treatment, that should remove the special condition to add a cluster
to another list as well.

Chris

> >       ci->flags = CLUSTER_FLAG_FREE;
> >       ci->order = 0;
> >  }
> > @@ -474,7 +477,6 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
> >                               SWAPFILE_CLUSTER);
> >
> >               spin_lock(&si->lock);
> > -
> >               spin_lock(&ci->lock);
> >               __free_cluster(si, ci);
> >               memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> > @@ -666,7 +668,7 @@ static void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
> >               if (ci->flags & CLUSTER_FLAG_FRAG)
> >                       si->frag_cluster_nr[ci->order]--;
> >               list_move_tail(&ci->list, &si->full_clusters);
> > -             ci->flags = 0;
> > +             ci->flags = CLUSTER_FLAG_FULL;
> >       }
> >  }
>
> --
> Best Regards,
> Huang, Ying
Chris Li Aug. 16, 2024, 7:47 a.m. UTC | #5
On Thu, Aug 8, 2024 at 1:38 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Chris Li <chrisl@kernel.org> writes:
>
> > On Wed, Aug 7, 2024 at 12:59 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Hi, Chris,
> >>
> >> Chris Li <chrisl@kernel.org> writes:
> >>
> >> > This is the short term solutions "swap cluster order" listed
> >> > in my "Swap Abstraction" discussion slice 8 in the recent
> >> > LSF/MM conference.
> >> >
> >> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP
> >> > orders" is introduced, it only allocates the mTHP swap entries
> >> > from the new empty cluster list.  It has a fragmentation issue
> >> > reported by Barry.
> >> >
> >> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
> >> >
> >> > The reason is that all the empty clusters have been exhausted while
> >> > there are plenty of free swap entries in the cluster that are
> >> > not 100% free.
> >> >
> >> > Remember the swap allocation order in the cluster.
> >> > Keep track of the per order non full cluster list for later allocation.
> >> >
> >> > This series gives the swap SSD allocation a new separate code path
> >> > from the HDD allocation. The new allocator use cluster list only
> >> > and do not global scan swap_map[] without lock any more.
> >>
> >> This sounds good.  Can we use SSD allocation method for HDD too?
> >> We may not need a swap entry allocator optimized for HDD.
> >
> > Yes, that is the plan as well. That way we can completely get rid of
> > the old scan_swap_map_slots() code.
>
> Good!
>
> > However, considering the size of the series, let's focus on the
> > cluster allocation path first, get it tested and reviewed.
>
> OK.
>
> > For HDD optimization, mostly just the new block allocations portion
> > need some separate code path from the new cluster allocator to not do
> > the per cpu allocation.  Allocating from the non free list doesn't
> > need to change too
>
> I suggest not consider HDD optimization at all.  Just use SSD algorithm
> to simplify.

Adding a global next allocating CI rather than the per CPU next CI
pointer is pretty trivial as well. It is just a different way to fetch
the next cluster pointer.

>
> >>
> >> Hi, Hugh,
> >>
> >> What do you think about this?
> >>
> >> > This streamline the swap allocation for SSD. The code matches the
> >> > execution flow much better.
> >> >
> >> > User impact: For users that allocate and free mix order mTHP swapping,
> >> > It greatly improves the success rate of the mTHP swap allocation after the
> >> > initial phase.
> >> >
> >> > It also performs faster when the swapfile is close to full, because the
> >> > allocator can get the non full cluster from a list rather than scanning
> >> > a lot of swap_map entries.
> >>
> >> Do you have some test results to prove this?  Or which test below can
> >> prove this?
> >
> > The two zram tests are already proving this. The system time
> > improvement is about 2% on my low CPU count machine.
> > Kairui has a higher core count machine and the difference is higher
> > there. The theory is that higher CPU count has higher contentions.
>
> I will interpret this as the performance is better in theory.  But
> there's almost no measurable results so far.

I am trying to understand why don't see the performance improvement in
the zram setup in my cover letter as a measurable result?

>
> > The 2% system time number does not sound like much. But consider this
> > two factors:
> > 1) swap allocator only takes a small percentage of the overall workload.
> > 2) The new allocator does more work.
> > The old allocator has a time tick budget. It will abort and fail to
> > find an entry when it runs out of time budget, even though there are
> > still some free entries on the swapfile.
>
> What is the time tick budget you mentioned?

I was under the impression that the previous swap entry allocation
code will not scan 100% of the swapfile if there is only one entry
left.
Please let me know if my understanding is not correct.

        /* time to take a break? */
        if (unlikely(--latency_ration < 0)) {
                if (n_ret)
                        goto done;
                spin_unlock(&si->lock);
                cond_resched();
                spin_lock(&si->lock);
                latency_ration = LATENCY_LIMIT;
        }


>
> > The new allocator can get to the last few free swap entries if it is
> > available. If not then, the new swap allocator will work harder on
> > swap cache reclaim.
> >
> > From the swap cache reclaim aspect, it is very hard to optimize the
> > swap cache reclaim in the old allocation path because the scan
> > position is randomized.
> > The full list and frag list both design to help reduce the repeat
> > reclaim attempt of the swap cache.
>
> [snip]
>
> --
> Best Regards,
> Huang, Ying
Kairui Song Aug. 17, 2024, 5:47 p.m. UTC | #6
On Fri, Aug 16, 2024 at 3:37 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Thu, Aug 8, 2024 at 1:40 AM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > Kairui Song <ryncsn@gmail.com> writes:
> >
> > [snip]
> >
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -450,7 +450,10 @@ static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info
> > >       lockdep_assert_held(&si->lock);
> > >       lockdep_assert_held(&ci->lock);
> > >
> > > -     list_move_tail(&ci->list, &si->free_clusters);
> > > +     if (ci->flags)
> > > +             list_move_tail(&ci->list, &si->free_clusters);
> > > +     else
> > > +             list_add_tail(&ci->list, &si->free_clusters);
> >
> > If we use list_del_init() to delete the cluster, we can always use
> > list_move_tail()?  If so, the logic can be simplified.
>
> Thanks for the suggestion.

Hi All, thanks for the review and discussion.

> I feel that list_del_init() generates more instruction than necessary.
> It is my bad that I leave the discard list without not a list flag bit
> for it.

Right, list_del_init is a little bit more noisy than list_del indeed.

But considering after this patch, all non-discard clusters are always
a on list (free/nonfull/full) already, and a cluster will be dangling
only when being removed from the discard list (when doing discard
work, it need to unlock si->lock, so the cluster have to be hidden
from other racers).

I think it's good to use list_del_init when deleting from the discard
list in this patch, then list_move_tail can always be used when
changing the list of a cluster.
Discard should be a much less common operation so this should
be OK.
Kairui Song Aug. 18, 2024, 4:59 p.m. UTC | #7
On Fri, Aug 16, 2024 at 3:53 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Thu, Aug 8, 2024 at 1:38 AM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > Chris Li <chrisl@kernel.org> writes:
> >
> > > On Wed, Aug 7, 2024 at 12:59 AM Huang, Ying <ying.huang@intel.com> wrote:
> > >>
> > >> Hi, Chris,
> > >>
> > >> Chris Li <chrisl@kernel.org> writes:
> > >>
> > >> > This is the short term solutions "swap cluster order" listed
> > >> > in my "Swap Abstraction" discussion slice 8 in the recent
> > >> > LSF/MM conference.
> > >> >
> > >> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP
> > >> > orders" is introduced, it only allocates the mTHP swap entries
> > >> > from the new empty cluster list.  It has a fragmentation issue
> > >> > reported by Barry.
> > >> >
> > >> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
> > >> >
> > >> > The reason is that all the empty clusters have been exhausted while
> > >> > there are plenty of free swap entries in the cluster that are
> > >> > not 100% free.
> > >> >
> > >> > Remember the swap allocation order in the cluster.
> > >> > Keep track of the per order non full cluster list for later allocation.
> > >> >
> > >> > This series gives the swap SSD allocation a new separate code path
> > >> > from the HDD allocation. The new allocator use cluster list only
> > >> > and do not global scan swap_map[] without lock any more.
> > >>
> > >> This sounds good.  Can we use SSD allocation method for HDD too?
> > >> We may not need a swap entry allocator optimized for HDD.
> > >
> > > Yes, that is the plan as well. That way we can completely get rid of
> > > the old scan_swap_map_slots() code.
> >
> > Good!
> >
> > > However, considering the size of the series, let's focus on the
> > > cluster allocation path first, get it tested and reviewed.
> >
> > OK.
> >
> > > For HDD optimization, mostly just the new block allocations portion
> > > need some separate code path from the new cluster allocator to not do
> > > the per cpu allocation.  Allocating from the non free list doesn't
> > > need to change too
> >
> > I suggest not consider HDD optimization at all.  Just use SSD algorithm
> > to simplify.
>
> Adding a global next allocating CI rather than the per CPU next CI
> pointer is pretty trivial as well. It is just a different way to fetch
> the next cluster pointer.

Yes, if we enable the new cluster based allocator for HDD, we can
enable THP and mTHP for HDD too, and use a global cluster_next instead
of Per-CPU for it.
It's easy to do with minimal changes, and should actually boost
performance for HDD SWAP. Currently testing this locally.

> > >>
> > >> Hi, Hugh,
> > >>
> > >> What do you think about this?
> > >>
> > >> > This streamline the swap allocation for SSD. The code matches the
> > >> > execution flow much better.
> > >> >
> > >> > User impact: For users that allocate and free mix order mTHP swapping,
> > >> > It greatly improves the success rate of the mTHP swap allocation after the
> > >> > initial phase.
> > >> >
> > >> > It also performs faster when the swapfile is close to full, because the
> > >> > allocator can get the non full cluster from a list rather than scanning
> > >> > a lot of swap_map entries.
> > >>
> > >> Do you have some test results to prove this?  Or which test below can
> > >> prove this?
> > >
> > > The two zram tests are already proving this. The system time
> > > improvement is about 2% on my low CPU count machine.
> > > Kairui has a higher core count machine and the difference is higher
> > > there. The theory is that higher CPU count has higher contentions.
> >
> > I will interpret this as the performance is better in theory.  But
> > there's almost no measurable results so far.
>
> I am trying to understand why don't see the performance improvement in
> the zram setup in my cover letter as a measurable result?

Hi Ying, you can check the test with the 32 cores AMD machine in the
cover letter, as Chris pointed out the performance gain is higher as
core number grows. The performance gain is still not much (*yet, based
on this design thing can go much faster after HDD codes are
dropped which enables many other optimizations, this series
is mainly focusing on the fragmentation issue), but I think a
stable ~4 - 8% improvement with a build linux kernel test
could be considered measurable?
Huang, Ying Aug. 19, 2024, 8:27 a.m. UTC | #8
Kairui Song <ryncsn@gmail.com> writes:

> On Fri, Aug 16, 2024 at 3:53 PM Chris Li <chrisl@kernel.org> wrote:
>>
>> On Thu, Aug 8, 2024 at 1:38 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >
>> > Chris Li <chrisl@kernel.org> writes:
>> >
>> > > On Wed, Aug 7, 2024 at 12:59 AM Huang, Ying <ying.huang@intel.com> wrote:
>> > >>
>> > >> Hi, Chris,
>> > >>
>> > >> Chris Li <chrisl@kernel.org> writes:
>> > >>
>> > >> > This is the short term solutions "swap cluster order" listed
>> > >> > in my "Swap Abstraction" discussion slice 8 in the recent
>> > >> > LSF/MM conference.
>> > >> >
>> > >> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP
>> > >> > orders" is introduced, it only allocates the mTHP swap entries
>> > >> > from the new empty cluster list.  It has a fragmentation issue
>> > >> > reported by Barry.
>> > >> >
>> > >> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
>> > >> >
>> > >> > The reason is that all the empty clusters have been exhausted while
>> > >> > there are plenty of free swap entries in the cluster that are
>> > >> > not 100% free.
>> > >> >
>> > >> > Remember the swap allocation order in the cluster.
>> > >> > Keep track of the per order non full cluster list for later allocation.
>> > >> >
>> > >> > This series gives the swap SSD allocation a new separate code path
>> > >> > from the HDD allocation. The new allocator use cluster list only
>> > >> > and do not global scan swap_map[] without lock any more.
>> > >>
>> > >> This sounds good.  Can we use SSD allocation method for HDD too?
>> > >> We may not need a swap entry allocator optimized for HDD.
>> > >
>> > > Yes, that is the plan as well. That way we can completely get rid of
>> > > the old scan_swap_map_slots() code.
>> >
>> > Good!
>> >
>> > > However, considering the size of the series, let's focus on the
>> > > cluster allocation path first, get it tested and reviewed.
>> >
>> > OK.
>> >
>> > > For HDD optimization, mostly just the new block allocations portion
>> > > need some separate code path from the new cluster allocator to not do
>> > > the per cpu allocation.  Allocating from the non free list doesn't
>> > > need to change too
>> >
>> > I suggest not consider HDD optimization at all.  Just use SSD algorithm
>> > to simplify.
>>
>> Adding a global next allocating CI rather than the per CPU next CI
>> pointer is pretty trivial as well. It is just a different way to fetch
>> the next cluster pointer.
>
> Yes, if we enable the new cluster based allocator for HDD, we can
> enable THP and mTHP for HDD too, and use a global cluster_next instead
> of Per-CPU for it.
> It's easy to do with minimal changes, and should actually boost
> performance for HDD SWAP. Currently testing this locally.

I think that it's better to start with SSD algorithm.  Then, you can add
HDD specific optimization on top of it with supporting data.

BTW, I don't know why HDD shouldn't use per-CPU cluster.  Sequential
writing is more important for HDD.

>> > >>
>> > >> Hi, Hugh,
>> > >>
>> > >> What do you think about this?
>> > >>
>> > >> > This streamline the swap allocation for SSD. The code matches the
>> > >> > execution flow much better.
>> > >> >
>> > >> > User impact: For users that allocate and free mix order mTHP swapping,
>> > >> > It greatly improves the success rate of the mTHP swap allocation after the
>> > >> > initial phase.
>> > >> >
>> > >> > It also performs faster when the swapfile is close to full, because the
>> > >> > allocator can get the non full cluster from a list rather than scanning
>> > >> > a lot of swap_map entries.
>> > >>
>> > >> Do you have some test results to prove this?  Or which test below can
>> > >> prove this?
>> > >
>> > > The two zram tests are already proving this. The system time
>> > > improvement is about 2% on my low CPU count machine.
>> > > Kairui has a higher core count machine and the difference is higher
>> > > there. The theory is that higher CPU count has higher contentions.
>> >
>> > I will interpret this as the performance is better in theory.  But
>> > there's almost no measurable results so far.
>>
>> I am trying to understand why don't see the performance improvement in
>> the zram setup in my cover letter as a measurable result?
>
> Hi Ying, you can check the test with the 32 cores AMD machine in the
> cover letter, as Chris pointed out the performance gain is higher as
> core number grows. The performance gain is still not much (*yet, based
> on this design thing can go much faster after HDD codes are
> dropped which enables many other optimizations, this series
> is mainly focusing on the fragmentation issue), but I think a
> stable ~4 - 8% improvement with a build linux kernel test
> could be considered measurable?

Is this the test result for "when the swapfile is close to full"?

--
Best Regards,
Huang, Ying
Huang, Ying Aug. 19, 2024, 8:39 a.m. UTC | #9
Chris Li <chrisl@kernel.org> writes:

> On Thu, Aug 8, 2024 at 1:38 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Chris Li <chrisl@kernel.org> writes:
>>
>> > On Wed, Aug 7, 2024 at 12:59 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Hi, Chris,
>> >>
>> >> Chris Li <chrisl@kernel.org> writes:
>> >>
>> >> > This is the short term solutions "swap cluster order" listed
>> >> > in my "Swap Abstraction" discussion slice 8 in the recent
>> >> > LSF/MM conference.
>> >> >
>> >> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP
>> >> > orders" is introduced, it only allocates the mTHP swap entries
>> >> > from the new empty cluster list.  It has a fragmentation issue
>> >> > reported by Barry.
>> >> >
>> >> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
>> >> >
>> >> > The reason is that all the empty clusters have been exhausted while
>> >> > there are plenty of free swap entries in the cluster that are
>> >> > not 100% free.
>> >> >
>> >> > Remember the swap allocation order in the cluster.
>> >> > Keep track of the per order non full cluster list for later allocation.
>> >> >
>> >> > This series gives the swap SSD allocation a new separate code path
>> >> > from the HDD allocation. The new allocator use cluster list only
>> >> > and do not global scan swap_map[] without lock any more.
>> >>
>> >> This sounds good.  Can we use SSD allocation method for HDD too?
>> >> We may not need a swap entry allocator optimized for HDD.
>> >
>> > Yes, that is the plan as well. That way we can completely get rid of
>> > the old scan_swap_map_slots() code.
>>
>> Good!
>>
>> > However, considering the size of the series, let's focus on the
>> > cluster allocation path first, get it tested and reviewed.
>>
>> OK.
>>
>> > For HDD optimization, mostly just the new block allocations portion
>> > need some separate code path from the new cluster allocator to not do
>> > the per cpu allocation.  Allocating from the non free list doesn't
>> > need to change too
>>
>> I suggest not consider HDD optimization at all.  Just use SSD algorithm
>> to simplify.
>
> Adding a global next allocating CI rather than the per CPU next CI
> pointer is pretty trivial as well. It is just a different way to fetch
> the next cluster pointer.

For HDD optimization, I mean original no struct swap_cluster_info etc.

>>
>> >>
>> >> Hi, Hugh,
>> >>
>> >> What do you think about this?
>> >>
>> >> > This streamline the swap allocation for SSD. The code matches the
>> >> > execution flow much better.
>> >> >
>> >> > User impact: For users that allocate and free mix order mTHP swapping,
>> >> > It greatly improves the success rate of the mTHP swap allocation after the
>> >> > initial phase.
>> >> >
>> >> > It also performs faster when the swapfile is close to full, because the
>> >> > allocator can get the non full cluster from a list rather than scanning
>> >> > a lot of swap_map entries.
>> >>
>> >> Do you have some test results to prove this?  Or which test below can
>> >> prove this?
>> >
>> > The two zram tests are already proving this. The system time
>> > improvement is about 2% on my low CPU count machine.
>> > Kairui has a higher core count machine and the difference is higher
>> > there. The theory is that higher CPU count has higher contentions.
>>
>> I will interpret this as the performance is better in theory.  But
>> there's almost no measurable results so far.
>
> I am trying to understand why don't see the performance improvement in
> the zram setup in my cover letter as a measurable result?

IIUC, there's no benchmark score difference, just system time.  And the
number is low too.

For Kairui's test, does all performance improvement come from "swapfile
is close to full"?

>>
>> > The 2% system time number does not sound like much. But consider this
>> > two factors:
>> > 1) swap allocator only takes a small percentage of the overall workload.
>> > 2) The new allocator does more work.
>> > The old allocator has a time tick budget. It will abort and fail to
>> > find an entry when it runs out of time budget, even though there are
>> > still some free entries on the swapfile.
>>
>> What is the time tick budget you mentioned?
>
> I was under the impression that the previous swap entry allocation
> code will not scan 100% of the swapfile if there is only one entry
> left.
> Please let me know if my understanding is not correct.
>
>         /* time to take a break? */
>         if (unlikely(--latency_ration < 0)) {
>                 if (n_ret)
>                         goto done;
>                 spin_unlock(&si->lock);
>                 cond_resched();
>                 spin_lock(&si->lock);
>                 latency_ration = LATENCY_LIMIT;
>         }

IIUC, this is to reduce latency via cond_resched().  If n_ret != 0, we
have allocated some swap entries successfully, it's OK to return to
reduce allocation latency.

>
>>
>> > The new allocator can get to the last few free swap entries if it is
>> > available. If not then, the new swap allocator will work harder on
>> > swap cache reclaim.
>> >
>> > From the swap cache reclaim aspect, it is very hard to optimize the
>> > swap cache reclaim in the old allocation path because the scan
>> > position is randomized.
>> > The full list and frag list both design to help reduce the repeat
>> > reclaim attempt of the swap cache.
>>
>> [snip]
>>

--
Best Regards,
Huang, Ying
Kairui Song Aug. 19, 2024, 8:47 a.m. UTC | #10
On Mon, Aug 19, 2024 at 4:31 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > On Fri, Aug 16, 2024 at 3:53 PM Chris Li <chrisl@kernel.org> wrote:
> >>
> >> On Thu, Aug 8, 2024 at 1:38 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >
> >> > Chris Li <chrisl@kernel.org> writes:
> >> >
> >> > > On Wed, Aug 7, 2024 at 12:59 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> > >>
> >> > >> Hi, Chris,
> >> > >>
> >> > >> Chris Li <chrisl@kernel.org> writes:
> >> > >>
> >> > >> > This is the short term solutions "swap cluster order" listed
> >> > >> > in my "Swap Abstraction" discussion slice 8 in the recent
> >> > >> > LSF/MM conference.
> >> > >> >
> >> > >> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP
> >> > >> > orders" is introduced, it only allocates the mTHP swap entries
> >> > >> > from the new empty cluster list.  It has a fragmentation issue
> >> > >> > reported by Barry.
> >> > >> >
> >> > >> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
> >> > >> >
> >> > >> > The reason is that all the empty clusters have been exhausted while
> >> > >> > there are plenty of free swap entries in the cluster that are
> >> > >> > not 100% free.
> >> > >> >
> >> > >> > Remember the swap allocation order in the cluster.
> >> > >> > Keep track of the per order non full cluster list for later allocation.
> >> > >> >
> >> > >> > This series gives the swap SSD allocation a new separate code path
> >> > >> > from the HDD allocation. The new allocator use cluster list only
> >> > >> > and do not global scan swap_map[] without lock any more.
> >> > >>
> >> > >> This sounds good.  Can we use SSD allocation method for HDD too?
> >> > >> We may not need a swap entry allocator optimized for HDD.
> >> > >
> >> > > Yes, that is the plan as well. That way we can completely get rid of
> >> > > the old scan_swap_map_slots() code.
> >> >
> >> > Good!
> >> >
> >> > > However, considering the size of the series, let's focus on the
> >> > > cluster allocation path first, get it tested and reviewed.
> >> >
> >> > OK.
> >> >
> >> > > For HDD optimization, mostly just the new block allocations portion
> >> > > need some separate code path from the new cluster allocator to not do
> >> > > the per cpu allocation.  Allocating from the non free list doesn't
> >> > > need to change too
> >> >
> >> > I suggest not consider HDD optimization at all.  Just use SSD algorithm
> >> > to simplify.
> >>
> >> Adding a global next allocating CI rather than the per CPU next CI
> >> pointer is pretty trivial as well. It is just a different way to fetch
> >> the next cluster pointer.
> >
> > Yes, if we enable the new cluster based allocator for HDD, we can
> > enable THP and mTHP for HDD too, and use a global cluster_next instead
> > of Per-CPU for it.
> > It's easy to do with minimal changes, and should actually boost
> > performance for HDD SWAP. Currently testing this locally.
>
> I think that it's better to start with SSD algorithm.  Then, you can add
> HDD specific optimization on top of it with supporting data.

Yes, we are having the same idea.

>
> BTW, I don't know why HDD shouldn't use per-CPU cluster.  Sequential
> writing is more important for HDD.
> >> > >>
> >> > >> Hi, Hugh,
> >> > >>
> >> > >> What do you think about this?
> >> > >>
> >> > >> > This streamline the swap allocation for SSD. The code matches the
> >> > >> > execution flow much better.
> >> > >> >
> >> > >> > User impact: For users that allocate and free mix order mTHP swapping,
> >> > >> > It greatly improves the success rate of the mTHP swap allocation after the
> >> > >> > initial phase.
> >> > >> >
> >> > >> > It also performs faster when the swapfile is close to full, because the
> >> > >> > allocator can get the non full cluster from a list rather than scanning
> >> > >> > a lot of swap_map entries.
> >> > >>
> >> > >> Do you have some test results to prove this?  Or which test below can
> >> > >> prove this?
> >> > >
> >> > > The two zram tests are already proving this. The system time
> >> > > improvement is about 2% on my low CPU count machine.
> >> > > Kairui has a higher core count machine and the difference is higher
> >> > > there. The theory is that higher CPU count has higher contentions.
> >> >
> >> > I will interpret this as the performance is better in theory.  But
> >> > there's almost no measurable results so far.
> >>
> >> I am trying to understand why don't see the performance improvement in
> >> the zram setup in my cover letter as a measurable result?
> >
> > Hi Ying, you can check the test with the 32 cores AMD machine in the
> > cover letter, as Chris pointed out the performance gain is higher as
> > core number grows. The performance gain is still not much (*yet, based
> > on this design thing can go much faster after HDD codes are
> > dropped which enables many other optimizations, this series
> > is mainly focusing on the fragmentation issue), but I think a
> > stable ~4 - 8% improvement with a build linux kernel test
> > could be considered measurable?
>
> Is this the test result for "when the swapfile is close to full"?

Yes, it's about 60% to 90% full during the whole test process. If ZRAM
is completely full the workload will go OOM, but testing with madvice
showed no performance drop.
Chris Li Aug. 19, 2024, 9:27 p.m. UTC | #11
Hi Kairui,

On Mon, Aug 19, 2024 at 1:48 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Mon, Aug 19, 2024 at 4:31 PM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > Kairui Song <ryncsn@gmail.com> writes:
> >
> > > On Fri, Aug 16, 2024 at 3:53 PM Chris Li <chrisl@kernel.org> wrote:
> > >>
> > >> On Thu, Aug 8, 2024 at 1:38 AM Huang, Ying <ying.huang@intel.com> wrote:
> > >> >
> > >> > Chris Li <chrisl@kernel.org> writes:
> > >> >
> > >> > > On Wed, Aug 7, 2024 at 12:59 AM Huang, Ying <ying.huang@intel.com> wrote:
> > >> > >>
> > >> > >> Hi, Chris,
> > >> > >>
> > >> > >> Chris Li <chrisl@kernel.org> writes:
> > >> > >>
> > >> > >> > This is the short term solutions "swap cluster order" listed
> > >> > >> > in my "Swap Abstraction" discussion slice 8 in the recent
> > >> > >> > LSF/MM conference.
> > >> > >> >
> > >> > >> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP
> > >> > >> > orders" is introduced, it only allocates the mTHP swap entries
> > >> > >> > from the new empty cluster list.  It has a fragmentation issue
> > >> > >> > reported by Barry.
> > >> > >> >
> > >> > >> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
> > >> > >> >
> > >> > >> > The reason is that all the empty clusters have been exhausted while
> > >> > >> > there are plenty of free swap entries in the cluster that are
> > >> > >> > not 100% free.
> > >> > >> >
> > >> > >> > Remember the swap allocation order in the cluster.
> > >> > >> > Keep track of the per order non full cluster list for later allocation.
> > >> > >> >
> > >> > >> > This series gives the swap SSD allocation a new separate code path
> > >> > >> > from the HDD allocation. The new allocator use cluster list only
> > >> > >> > and do not global scan swap_map[] without lock any more.
> > >> > >>
> > >> > >> This sounds good.  Can we use SSD allocation method for HDD too?
> > >> > >> We may not need a swap entry allocator optimized for HDD.
> > >> > >
> > >> > > Yes, that is the plan as well. That way we can completely get rid of
> > >> > > the old scan_swap_map_slots() code.
> > >> >
> > >> > Good!
> > >> >
> > >> > > However, considering the size of the series, let's focus on the
> > >> > > cluster allocation path first, get it tested and reviewed.
> > >> >
> > >> > OK.
> > >> >
> > >> > > For HDD optimization, mostly just the new block allocations portion
> > >> > > need some separate code path from the new cluster allocator to not do
> > >> > > the per cpu allocation.  Allocating from the non free list doesn't
> > >> > > need to change too
> > >> >
> > >> > I suggest not consider HDD optimization at all.  Just use SSD algorithm
> > >> > to simplify.
> > >>
> > >> Adding a global next allocating CI rather than the per CPU next CI
> > >> pointer is pretty trivial as well. It is just a different way to fetch
> > >> the next cluster pointer.
> > >
> > > Yes, if we enable the new cluster based allocator for HDD, we can
> > > enable THP and mTHP for HDD too, and use a global cluster_next instead
> > > of Per-CPU for it.
> > > It's easy to do with minimal changes, and should actually boost
> > > performance for HDD SWAP. Currently testing this locally.
> >
> > I think that it's better to start with SSD algorithm.  Then, you can add
> > HDD specific optimization on top of it with supporting data.
>
> Yes, we are having the same idea.
>
> >
> > BTW, I don't know why HDD shouldn't use per-CPU cluster.  Sequential
> > writing is more important for HDD.
> > >> > >>
> > >> > >> Hi, Hugh,
> > >> > >>
> > >> > >> What do you think about this?
> > >> > >>
> > >> > >> > This streamline the swap allocation for SSD. The code matches the
> > >> > >> > execution flow much better.
> > >> > >> >
> > >> > >> > User impact: For users that allocate and free mix order mTHP swapping,
> > >> > >> > It greatly improves the success rate of the mTHP swap allocation after the
> > >> > >> > initial phase.
> > >> > >> >
> > >> > >> > It also performs faster when the swapfile is close to full, because the
> > >> > >> > allocator can get the non full cluster from a list rather than scanning
> > >> > >> > a lot of swap_map entries.
> > >> > >>
> > >> > >> Do you have some test results to prove this?  Or which test below can
> > >> > >> prove this?
> > >> > >
> > >> > > The two zram tests are already proving this. The system time
> > >> > > improvement is about 2% on my low CPU count machine.
> > >> > > Kairui has a higher core count machine and the difference is higher
> > >> > > there. The theory is that higher CPU count has higher contentions.
> > >> >
> > >> > I will interpret this as the performance is better in theory.  But
> > >> > there's almost no measurable results so far.
> > >>
> > >> I am trying to understand why don't see the performance improvement in
> > >> the zram setup in my cover letter as a measurable result?
> > >
> > > Hi Ying, you can check the test with the 32 cores AMD machine in the
> > > cover letter, as Chris pointed out the performance gain is higher as
> > > core number grows. The performance gain is still not much (*yet, based
> > > on this design thing can go much faster after HDD codes are
> > > dropped which enables many other optimizations, this series
> > > is mainly focusing on the fragmentation issue), but I think a
> > > stable ~4 - 8% improvement with a build linux kernel test
> > > could be considered measurable?
> >
> > Is this the test result for "when the swapfile is close to full"?
>
> Yes, it's about 60% to 90% full during the whole test process. If ZRAM
> is completely full the workload will go OOM, but testing with madvice

BTW, one trick to avoid ZRAM completely full causing OOM is to have
two zram devices and assign different priorities. Let the first zram
get 100% full then the swap overflow to the second ZRAM device, which
has more swap entries to avoid the OOM.

Chris

> showed no performance drop.
Andrew Morton Sept. 2, 2024, 1:20 a.m. UTC | #12
On Tue, 30 Jul 2024 23:49:12 -0700 Chris Li <chrisl@kernel.org> wrote:

> This is the short term solutions "swap cluster order" listed
> in my "Swap Abstraction" discussion slice 8 in the recent
> LSF/MM conference.

Well this patchset hasn't needed a -fix patch since Aug 5, which is
hopeful.

How do people feel about moving this series into mm-stable for the next
merge window?

A few notes I've made are

https://lkml.kernel.org/r/CACePvbX1EWQk03YcC47s7+vn40kEFb_3wp3D_GmJV-8Fn2j+=Q@mail.gmail.com
https://lkml.kernel.org/r/87bk23250r.fsf@yhuang6-desk2.ccr.corp.intel.com
https://lkml.kernel.org/r/CACePvbU9NUgfD-QC-hdYEmeMbLc5whOKeW4hb2ijcPonZ4i6mg@mail.gmail.com

Have these all been resolved?

Thanks.