Message ID | 165590626293.75778.9843437418112335153.stgit@manet.1015granger.net (mailing list archive) |
---|---|
Headers | show |
Series | Overhaul NFSD filecache | expand |
Hi, fstests generic/531 triggered a panic on kernel 5.19.0-rc3 with this patchset. [ 405.478056] BUG: kernel NULL pointer dereference, address: 0000000000000049 [ 405.608016] Call Trace: [ 405.608016] <TASK> [ 405.613020] nfs4_get_vfs_file+0x325/0x410 [nfsd] [ 405.618018] nfsd4_process_open2+0x4ba/0x16d0 [nfsd] [ 405.623016] ? inode_get_bytes+0x38/0x40 [ 405.623016] ? nfsd_permission+0x97/0xf0 [nfsd] [ 405.628022] ? fh_verify+0x1cc/0x6f0 [nfsd] [ 405.633025] nfsd4_open+0x640/0xb30 [nfsd] [ 405.638025] nfsd4_proc_compound+0x3bd/0x710 [nfsd] [ 405.643017] nfsd_dispatch+0x143/0x270 [nfsd] [ 405.648019] svc_process_common+0x3bf/0x5b0 [sunrpc] more detail in attachment file(531.dmesg) local.config of fstests: export NFS_MOUNT_OPTIONS="-o rw,relatime,vers=4.2,nconnect=8" changes of generic/531 max_allowable_files=$(( 1 * 1024 * 1024 / $nr_cpus / 2 )) Best Regards Wang Yugui (wangyugui@e16-tech.com) 2022/06/23 > This series overhauls the NFSD filecache, a cache of server-side > "struct file" objects recently used by NFS clients. The purposes of > this overhaul are an immediate improvement in cache scalability in > the number of open files, and preparation for further improvements. > > There are three categories of patches in this series: > > 1. Add observability of cache operation so we can see what we're > doing as changes are made to the code. > > 2. Improve the scalability of filecache garbage collection, > addressing several bugs along the way. > > 3. Improve the scalability of the filecache hash table by converting > it to use rhashtable. > > The series as it stands survives typical test workloads. Running > stress-tests like generic/531 is the next step. > > These patches are also available in the linux-nfs-bugzilla-386 > branch of > > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git > > --- > > Chuck Lever (30): > NFSD: Report filecache LRU size > NFSD: Report count of calls to nfsd_file_acquire() > NFSD: Report count of freed filecache items > NFSD: Report average age of filecache items > NFSD: Add nfsd_file_lru_dispose_list() helper > NFSD: Refactor nfsd_file_gc() > NFSD: Refactor nfsd_file_lru_scan() > NFSD: Report the number of items evicted by the LRU walk > NFSD: Record number of flush calls > NFSD: Report filecache item construction failures > NFSD: Zero counters when the filecache is re-initialized > NFSD: Hook up the filecache stat file > NFSD: WARN when freeing an item still linked via nf_lru > NFSD: Trace filecache LRU activity > NFSD: Leave open files out of the filecache LRU > NFSD: Fix the filecache LRU shrinker > NFSD: Never call nfsd_file_gc() in foreground paths > NFSD: No longer record nf_hashval in the trace log > NFSD: Remove lockdep assertion from unhash_and_release_locked() > NFSD: nfsd_file_unhash can compute hashval from nf->nf_inode > NFSD: Refactor __nfsd_file_close_inode() > NFSD: nfsd_file_hash_remove can compute hashval > NFSD: Remove nfsd_file::nf_hashval > NFSD: Remove stale comment from nfsd_file_acquire() > NFSD: Clean up "open file" case in nfsd_file_acquire() > NFSD: Document nfsd_file_cache_purge() API contract > NFSD: Replace the "init once" mechanism > NFSD: Set up an rhashtable for the filecache > NFSD: Convert the filecache to use rhashtable > NFSD: Clean up unusued code after rhashtable conversion > > > fs/nfsd/filecache.c | 677 +++++++++++++++++++++++++++----------------- > fs/nfsd/filecache.h | 6 +- > fs/nfsd/nfsctl.c | 10 + > fs/nfsd/trace.h | 117 ++++++-- > 4 files changed, 522 insertions(+), 288 deletions(-) > > -- > Chuck Lever
> On Jun 22, 2022, at 2:36 PM, Wang Yugui <wangyugui@e16-tech.com> wrote: > > Hi, > > fstests generic/531 triggered a panic on kernel 5.19.0-rc3 with this > patchset. As I mention in the cover letter, I haven't tried running generic/531 yet -- no claim at all that this is finished work and that #386 has been fixed at this point. I'm merely interested in comments on the general approach. > [ 405.478056] BUG: kernel NULL pointer dereference, address: 0000000000000049 The "RIP: " tells the location of the crash. Notice that the call trace here does not include that information. From your attachment: [ 405.518022] RIP: 0010:nfsd_do_file_acquire+0x4e1/0xb80 [nfsd] To match that to a line of source code: [cel@manet ~]$ cd src/linux/linux/ [cel@manet linux]$ scripts/faddr2line ../obj/manet/fs/nfsd/filecache.o nfsd_do_file_acquire+0x4e1 nfsd_do_file_acquire+0x4e1/0xfc0: rht_bucket_insert at /home/cel/src/linux/linux/include/linux/rhashtable.h:303 (inlined by) __rhashtable_insert_fast at /home/cel/src/linux/linux/include/linux/rhashtable.h:718 (inlined by) rhashtable_lookup_get_insert_key at /home/cel/src/linux/linux/include/linux/rhashtable.h:982 (inlined by) nfsd_file_insert at /home/cel/src/linux/linux/fs/nfsd/filecache.c:1031 (inlined by) nfsd_do_file_acquire at /home/cel/src/linux/linux/fs/nfsd/filecache.c:1089 [cel@manet linux]$ This is an example, I'm sure my compiled objects don't match yours. And, now that I've added observability, you should be able to do: # watch cat /proc/fs/nfsd/filecache to see how many items are in the hash and LRU list while the test is running. > [ 405.608016] Call Trace: > [ 405.608016] <TASK> > [ 405.613020] nfs4_get_vfs_file+0x325/0x410 [nfsd] > [ 405.618018] nfsd4_process_open2+0x4ba/0x16d0 [nfsd] > [ 405.623016] ? inode_get_bytes+0x38/0x40 > [ 405.623016] ? nfsd_permission+0x97/0xf0 [nfsd] > [ 405.628022] ? fh_verify+0x1cc/0x6f0 [nfsd] > [ 405.633025] nfsd4_open+0x640/0xb30 [nfsd] > [ 405.638025] nfsd4_proc_compound+0x3bd/0x710 [nfsd] > [ 405.643017] nfsd_dispatch+0x143/0x270 [nfsd] > [ 405.648019] svc_process_common+0x3bf/0x5b0 [sunrpc] > > more detail in attachment file(531.dmesg) > > local.config of fstests: > export NFS_MOUNT_OPTIONS="-o rw,relatime,vers=4.2,nconnect=8" > changes of generic/531 > max_allowable_files=$(( 1 * 1024 * 1024 / $nr_cpus / 2 )) Changed from: max_allowable_files=$(( $(cat /proc/sys/fs/file-max) / $nr_cpus / 2 )) For my own information, what's $nr_cpus in your test? Aside from the max_allowable_files setting, can you tell how the test determines when it should stop creating files? Is it looking for a particular error code from open(2), for instance? On my client: [cel@morisot generic]$ cat /proc/sys/fs/file-max 9223372036854775807 [cel@morisot generic]$ I wonder if it's realistic to expect an NFSv4 server to support that many open files. Is 9 quintillion files really something I'm going to have to engineer for, or is this just a crazy test? > Best Regards > Wang Yugui (wangyugui@e16-tech.com) > 2022/06/23 > >> This series overhauls the NFSD filecache, a cache of server-side >> "struct file" objects recently used by NFS clients. The purposes of >> this overhaul are an immediate improvement in cache scalability in >> the number of open files, and preparation for further improvements. >> >> There are three categories of patches in this series: >> >> 1. Add observability of cache operation so we can see what we're >> doing as changes are made to the code. >> >> 2. Improve the scalability of filecache garbage collection, >> addressing several bugs along the way. >> >> 3. Improve the scalability of the filecache hash table by converting >> it to use rhashtable. >> >> The series as it stands survives typical test workloads. Running >> stress-tests like generic/531 is the next step. >> >> These patches are also available in the linux-nfs-bugzilla-386 >> branch of >> >> https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git >> >> --- >> >> Chuck Lever (30): >> NFSD: Report filecache LRU size >> NFSD: Report count of calls to nfsd_file_acquire() >> NFSD: Report count of freed filecache items >> NFSD: Report average age of filecache items >> NFSD: Add nfsd_file_lru_dispose_list() helper >> NFSD: Refactor nfsd_file_gc() >> NFSD: Refactor nfsd_file_lru_scan() >> NFSD: Report the number of items evicted by the LRU walk >> NFSD: Record number of flush calls >> NFSD: Report filecache item construction failures >> NFSD: Zero counters when the filecache is re-initialized >> NFSD: Hook up the filecache stat file >> NFSD: WARN when freeing an item still linked via nf_lru >> NFSD: Trace filecache LRU activity >> NFSD: Leave open files out of the filecache LRU >> NFSD: Fix the filecache LRU shrinker >> NFSD: Never call nfsd_file_gc() in foreground paths >> NFSD: No longer record nf_hashval in the trace log >> NFSD: Remove lockdep assertion from unhash_and_release_locked() >> NFSD: nfsd_file_unhash can compute hashval from nf->nf_inode >> NFSD: Refactor __nfsd_file_close_inode() >> NFSD: nfsd_file_hash_remove can compute hashval >> NFSD: Remove nfsd_file::nf_hashval >> NFSD: Remove stale comment from nfsd_file_acquire() >> NFSD: Clean up "open file" case in nfsd_file_acquire() >> NFSD: Document nfsd_file_cache_purge() API contract >> NFSD: Replace the "init once" mechanism >> NFSD: Set up an rhashtable for the filecache >> NFSD: Convert the filecache to use rhashtable >> NFSD: Clean up unusued code after rhashtable conversion >> >> >> fs/nfsd/filecache.c | 677 +++++++++++++++++++++++++++----------------- >> fs/nfsd/filecache.h | 6 +- >> fs/nfsd/nfsctl.c | 10 + >> fs/nfsd/trace.h | 117 ++++++-- >> 4 files changed, 522 insertions(+), 288 deletions(-) >> >> -- >> Chuck Lever > > <531.dmesg> -- Chuck Lever
> On Jun 22, 2022, at 3:04 PM, Chuck Lever III <chuck.lever@oracle.com> wrote: > > > >> On Jun 22, 2022, at 2:36 PM, Wang Yugui <wangyugui@e16-tech.com> wrote: >> >> Hi, >> >> fstests generic/531 triggered a panic on kernel 5.19.0-rc3 with this >> patchset. > > As I mention in the cover letter, I haven't tried running generic/531 > yet -- no claim at all that this is finished work and that #386 has > been fixed at this point. I'm merely interested in comments on the > general approach. > > >> [ 405.478056] BUG: kernel NULL pointer dereference, address: 0000000000000049 > > The "RIP: " tells the location of the crash. Notice that the call > trace here does not include that information. From your attachment: > > [ 405.518022] RIP: 0010:nfsd_do_file_acquire+0x4e1/0xb80 [nfsd] > > To match that to a line of source code: > > [cel@manet ~]$ cd src/linux/linux/ > [cel@manet linux]$ scripts/faddr2line ../obj/manet/fs/nfsd/filecache.o nfsd_do_file_acquire+0x4e1 > nfsd_do_file_acquire+0x4e1/0xfc0: > rht_bucket_insert at /home/cel/src/linux/linux/include/linux/rhashtable.h:303 > (inlined by) __rhashtable_insert_fast at /home/cel/src/linux/linux/include/linux/rhashtable.h:718 > (inlined by) rhashtable_lookup_get_insert_key at /home/cel/src/linux/linux/include/linux/rhashtable.h:982 > (inlined by) nfsd_file_insert at /home/cel/src/linux/linux/fs/nfsd/filecache.c:1031 > (inlined by) nfsd_do_file_acquire at /home/cel/src/linux/linux/fs/nfsd/filecache.c:1089 > [cel@manet linux]$ > > This is an example, I'm sure my compiled objects don't match yours. > > And, now that I've added observability, you should be able to do: > > # watch cat /proc/fs/nfsd/filecache > > to see how many items are in the hash and LRU list while the test > is running. > > >> [ 405.608016] Call Trace: >> [ 405.608016] <TASK> >> [ 405.613020] nfs4_get_vfs_file+0x325/0x410 [nfsd] >> [ 405.618018] nfsd4_process_open2+0x4ba/0x16d0 [nfsd] >> [ 405.623016] ? inode_get_bytes+0x38/0x40 >> [ 405.623016] ? nfsd_permission+0x97/0xf0 [nfsd] >> [ 405.628022] ? fh_verify+0x1cc/0x6f0 [nfsd] >> [ 405.633025] nfsd4_open+0x640/0xb30 [nfsd] >> [ 405.638025] nfsd4_proc_compound+0x3bd/0x710 [nfsd] >> [ 405.643017] nfsd_dispatch+0x143/0x270 [nfsd] >> [ 405.648019] svc_process_common+0x3bf/0x5b0 [sunrpc] I was able to trigger something that looks very much like this crash. If you remove this line from fs/nfsd/filecache.c: .max_size = 131072, /* buckets */ things get a lot more stable for generic/531. I'm looking into the issue now. >> more detail in attachment file(531.dmesg) >> >> local.config of fstests: >> export NFS_MOUNT_OPTIONS="-o rw,relatime,vers=4.2,nconnect=8" >> changes of generic/531 >> max_allowable_files=$(( 1 * 1024 * 1024 / $nr_cpus / 2 )) > > Changed from: > > max_allowable_files=$(( $(cat /proc/sys/fs/file-max) / $nr_cpus / 2 )) > > For my own information, what's $nr_cpus in your test? > > Aside from the max_allowable_files setting, can you tell how the > test determines when it should stop creating files? Is it looking > for a particular error code from open(2), for instance? > > On my client: > > [cel@morisot generic]$ cat /proc/sys/fs/file-max > 9223372036854775807 > [cel@morisot generic]$ > > I wonder if it's realistic to expect an NFSv4 server to support > that many open files. Is 9 quintillion files really something > I'm going to have to engineer for, or is this just a crazy > test? > > >> Best Regards >> Wang Yugui (wangyugui@e16-tech.com) >> 2022/06/23 >> >>> This series overhauls the NFSD filecache, a cache of server-side >>> "struct file" objects recently used by NFS clients. The purposes of >>> this overhaul are an immediate improvement in cache scalability in >>> the number of open files, and preparation for further improvements. >>> >>> There are three categories of patches in this series: >>> >>> 1. Add observability of cache operation so we can see what we're >>> doing as changes are made to the code. >>> >>> 2. Improve the scalability of filecache garbage collection, >>> addressing several bugs along the way. >>> >>> 3. Improve the scalability of the filecache hash table by converting >>> it to use rhashtable. >>> >>> The series as it stands survives typical test workloads. Running >>> stress-tests like generic/531 is the next step. >>> >>> These patches are also available in the linux-nfs-bugzilla-386 >>> branch of >>> >>> https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git >>> >>> --- >>> >>> Chuck Lever (30): >>> NFSD: Report filecache LRU size >>> NFSD: Report count of calls to nfsd_file_acquire() >>> NFSD: Report count of freed filecache items >>> NFSD: Report average age of filecache items >>> NFSD: Add nfsd_file_lru_dispose_list() helper >>> NFSD: Refactor nfsd_file_gc() >>> NFSD: Refactor nfsd_file_lru_scan() >>> NFSD: Report the number of items evicted by the LRU walk >>> NFSD: Record number of flush calls >>> NFSD: Report filecache item construction failures >>> NFSD: Zero counters when the filecache is re-initialized >>> NFSD: Hook up the filecache stat file >>> NFSD: WARN when freeing an item still linked via nf_lru >>> NFSD: Trace filecache LRU activity >>> NFSD: Leave open files out of the filecache LRU >>> NFSD: Fix the filecache LRU shrinker >>> NFSD: Never call nfsd_file_gc() in foreground paths >>> NFSD: No longer record nf_hashval in the trace log >>> NFSD: Remove lockdep assertion from unhash_and_release_locked() >>> NFSD: nfsd_file_unhash can compute hashval from nf->nf_inode >>> NFSD: Refactor __nfsd_file_close_inode() >>> NFSD: nfsd_file_hash_remove can compute hashval >>> NFSD: Remove nfsd_file::nf_hashval >>> NFSD: Remove stale comment from nfsd_file_acquire() >>> NFSD: Clean up "open file" case in nfsd_file_acquire() >>> NFSD: Document nfsd_file_cache_purge() API contract >>> NFSD: Replace the "init once" mechanism >>> NFSD: Set up an rhashtable for the filecache >>> NFSD: Convert the filecache to use rhashtable >>> NFSD: Clean up unusued code after rhashtable conversion >>> >>> >>> fs/nfsd/filecache.c | 677 +++++++++++++++++++++++++++----------------- >>> fs/nfsd/filecache.h | 6 +- >>> fs/nfsd/nfsctl.c | 10 + >>> fs/nfsd/trace.h | 117 ++++++-- >>> 4 files changed, 522 insertions(+), 288 deletions(-) >>> >>> -- >>> Chuck Lever >> >> <531.dmesg> > > -- > Chuck Lever -- Chuck Lever
On Wed, Jun 22, 2022 at 07:04:39PM +0000, Chuck Lever III wrote: > > more detail in attachment file(531.dmesg) > > > > local.config of fstests: > > export NFS_MOUNT_OPTIONS="-o rw,relatime,vers=4.2,nconnect=8" > > changes of generic/531 > > max_allowable_files=$(( 1 * 1024 * 1024 / $nr_cpus / 2 )) > > Changed from: > > max_allowable_files=$(( $(cat /proc/sys/fs/file-max) / $nr_cpus / 2 )) > > For my own information, what's $nr_cpus in your test? > > Aside from the max_allowable_files setting, can you tell how the > test determines when it should stop creating files? Is it looking > for a particular error code from open(2), for instance? > > On my client: > > [cel@morisot generic]$ cat /proc/sys/fs/file-max > 9223372036854775807 > [cel@morisot generic]$ $ echo $((2**63 - 1)) 9223372036854775807 i.e. LLONG_MAX, or "no limit is set". > I wonder if it's realistic to expect an NFSv4 server to support > that many open files. Is 9 quintillion files really something > I'm going to have to engineer for, or is this just a crazy > test? The test does not use the value directly - it's a max value for clamping: max_files=$((50000 * LOAD_FACTOR)) max_allowable_files=$(( $(cat /proc/sys/fs/file-max) / $nr_cpus / 2 )) test $max_allowable_files -gt 0 && test $max_files -gt $max_allowable_files && \ max_files=$max_allowable_files ulimit -n $max_files i.e. the result should be max_files = max(50000, max_allowable_files) So the test should only be allowing 50,000 open unlinked files to be created before unmounting. Which means there's lots of silly renaming going on at the client and so the server is probably seeing 100,000 unique file handles across the test.... Cheers, Dave.
> On Jun 22, 2022, at 8:21 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Wed, Jun 22, 2022 at 07:04:39PM +0000, Chuck Lever III wrote: >>> more detail in attachment file(531.dmesg) >>> >>> local.config of fstests: >>> export NFS_MOUNT_OPTIONS="-o rw,relatime,vers=4.2,nconnect=8" >>> changes of generic/531 >>> max_allowable_files=$(( 1 * 1024 * 1024 / $nr_cpus / 2 )) >> >> Changed from: >> >> max_allowable_files=$(( $(cat /proc/sys/fs/file-max) / $nr_cpus / 2 )) >> >> For my own information, what's $nr_cpus in your test? >> >> Aside from the max_allowable_files setting, can you tell how the >> test determines when it should stop creating files? Is it looking >> for a particular error code from open(2), for instance? >> >> On my client: >> >> [cel@morisot generic]$ cat /proc/sys/fs/file-max >> 9223372036854775807 >> [cel@morisot generic]$ > > $ echo $((2**63 - 1)) > 9223372036854775807 > > i.e. LLONG_MAX, or "no limit is set". > >> I wonder if it's realistic to expect an NFSv4 server to support >> that many open files. Is 9 quintillion files really something >> I'm going to have to engineer for, or is this just a crazy >> test? > > The test does not use the value directly - it's a max value for > clamping: > > max_files=$((50000 * LOAD_FACTOR)) > max_allowable_files=$(( $(cat /proc/sys/fs/file-max) / $nr_cpus / 2 )) > test $max_allowable_files -gt 0 && test $max_files -gt $max_allowable_files && \ > max_files=$max_allowable_files > ulimit -n $max_files > > i.e. the result should be > > max_files = max(50000, max_allowable_files) > > So the test should only be allowing 50,000 open unlinked files to be > created before unmounting. Looking at my testing, it's ~50,000 per worker thread, and there are 2 workers per physical core on the client. But thankfully, this is much smaller than 9 quintillion. > Which means there's lots of silly > renaming going on at the client and so the server is probably seeing > 100,000 unique file handles across the test.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com -- Chuck Lever
Hi, > > On Jun 22, 2022, at 3:04 PM, Chuck Lever III <chuck.lever@oracle.com> wrote: > >> On Jun 22, 2022, at 2:36 PM, Wang Yugui <wangyugui@e16-tech.com> wrote: > >> > >> Hi, > >> > >> fstests generic/531 triggered a panic on kernel 5.19.0-rc3 with this > >> patchset. > > > > As I mention in the cover letter, I haven't tried running generic/531 > > yet -- no claim at all that this is finished work and that #386 has > > been fixed at this point. I'm merely interested in comments on the > > general approach. > > > > > >> [ 405.478056] BUG: kernel NULL pointer dereference, address: 0000000000000049 > > > > The "RIP: " tells the location of the crash. Notice that the call > > trace here does not include that information. From your attachment: > > > > [ 405.518022] RIP: 0010:nfsd_do_file_acquire+0x4e1/0xb80 [nfsd] > > > > To match that to a line of source code: > > > > [cel@manet ~]$ cd src/linux/linux/ > > [cel@manet linux]$ scripts/faddr2line ../obj/manet/fs/nfsd/filecache.o nfsd_do_file_acquire+0x4e1 > > nfsd_do_file_acquire+0x4e1/0xfc0: > > rht_bucket_insert at /home/cel/src/linux/linux/include/linux/rhashtable.h:303 > > (inlined by) __rhashtable_insert_fast at /home/cel/src/linux/linux/include/linux/rhashtable.h:718 > > (inlined by) rhashtable_lookup_get_insert_key at /home/cel/src/linux/linux/include/linux/rhashtable.h:982 > > (inlined by) nfsd_file_insert at /home/cel/src/linux/linux/fs/nfsd/filecache.c:1031 > > (inlined by) nfsd_do_file_acquire at /home/cel/src/linux/linux/fs/nfsd/filecache.c:1089 > > [cel@manet linux]$ > > > > This is an example, I'm sure my compiled objects don't match yours. > > > > And, now that I've added observability, you should be able to do: > > > > # watch cat /proc/fs/nfsd/filecache > > > > to see how many items are in the hash and LRU list while the test > > is running. > > > > > >> [ 405.608016] Call Trace: > >> [ 405.608016] <TASK> > >> [ 405.613020] nfs4_get_vfs_file+0x325/0x410 [nfsd] > >> [ 405.618018] nfsd4_process_open2+0x4ba/0x16d0 [nfsd] > >> [ 405.623016] ? inode_get_bytes+0x38/0x40 > >> [ 405.623016] ? nfsd_permission+0x97/0xf0 [nfsd] > >> [ 405.628022] ? fh_verify+0x1cc/0x6f0 [nfsd] > >> [ 405.633025] nfsd4_open+0x640/0xb30 [nfsd] > >> [ 405.638025] nfsd4_proc_compound+0x3bd/0x710 [nfsd] > >> [ 405.643017] nfsd_dispatch+0x143/0x270 [nfsd] > >> [ 405.648019] svc_process_common+0x3bf/0x5b0 [sunrpc] > > I was able to trigger something that looks very much like this crash. > If you remove this line from fs/nfsd/filecache.c: > > .max_size = 131072, /* buckets */ > > things get a lot more stable for generic/531. > > I'm looking into the issue now. Yes. When '.max_size = 131072' is removed, fstests generic/531 passed. Best Regards Wang Yugui (wangyugui@e16-tech.com) 2022/06/23
> On Jun 23, 2022, at 5:02 AM, Wang Yugui <wangyugui@e16-tech.com> wrote: > > Hi, > >>> On Jun 22, 2022, at 3:04 PM, Chuck Lever III <chuck.lever@oracle.com> wrote: >>>> On Jun 22, 2022, at 2:36 PM, Wang Yugui <wangyugui@e16-tech.com> wrote: >>>> >>>> Hi, >>>> >>>> fstests generic/531 triggered a panic on kernel 5.19.0-rc3 with this >>>> patchset. >>> >>> As I mention in the cover letter, I haven't tried running generic/531 >>> yet -- no claim at all that this is finished work and that #386 has >>> been fixed at this point. I'm merely interested in comments on the >>> general approach. >>> >>> >>>> [ 405.478056] BUG: kernel NULL pointer dereference, address: 0000000000000049 >>> >>> The "RIP: " tells the location of the crash. Notice that the call >>> trace here does not include that information. From your attachment: >>> >>> [ 405.518022] RIP: 0010:nfsd_do_file_acquire+0x4e1/0xb80 [nfsd] >>> >>> To match that to a line of source code: >>> >>> [cel@manet ~]$ cd src/linux/linux/ >>> [cel@manet linux]$ scripts/faddr2line ../obj/manet/fs/nfsd/filecache.o nfsd_do_file_acquire+0x4e1 >>> nfsd_do_file_acquire+0x4e1/0xfc0: >>> rht_bucket_insert at /home/cel/src/linux/linux/include/linux/rhashtable.h:303 >>> (inlined by) __rhashtable_insert_fast at /home/cel/src/linux/linux/include/linux/rhashtable.h:718 >>> (inlined by) rhashtable_lookup_get_insert_key at /home/cel/src/linux/linux/include/linux/rhashtable.h:982 >>> (inlined by) nfsd_file_insert at /home/cel/src/linux/linux/fs/nfsd/filecache.c:1031 >>> (inlined by) nfsd_do_file_acquire at /home/cel/src/linux/linux/fs/nfsd/filecache.c:1089 >>> [cel@manet linux]$ >>> >>> This is an example, I'm sure my compiled objects don't match yours. >>> >>> And, now that I've added observability, you should be able to do: >>> >>> # watch cat /proc/fs/nfsd/filecache >>> >>> to see how many items are in the hash and LRU list while the test >>> is running. >>> >>> >>>> [ 405.608016] Call Trace: >>>> [ 405.608016] <TASK> >>>> [ 405.613020] nfs4_get_vfs_file+0x325/0x410 [nfsd] >>>> [ 405.618018] nfsd4_process_open2+0x4ba/0x16d0 [nfsd] >>>> [ 405.623016] ? inode_get_bytes+0x38/0x40 >>>> [ 405.623016] ? nfsd_permission+0x97/0xf0 [nfsd] >>>> [ 405.628022] ? fh_verify+0x1cc/0x6f0 [nfsd] >>>> [ 405.633025] nfsd4_open+0x640/0xb30 [nfsd] >>>> [ 405.638025] nfsd4_proc_compound+0x3bd/0x710 [nfsd] >>>> [ 405.643017] nfsd_dispatch+0x143/0x270 [nfsd] >>>> [ 405.648019] svc_process_common+0x3bf/0x5b0 [sunrpc] >> >> I was able to trigger something that looks very much like this crash. >> If you remove this line from fs/nfsd/filecache.c: >> >> .max_size = 131072, /* buckets */ >> >> things get a lot more stable for generic/531. >> >> I'm looking into the issue now. > > Yes. When '.max_size = 131072' is removed, fstests generic/531 passed. Great! Are you comfortable with this general approach for bug #386? -- Chuck Lever
Hi, > > On Jun 23, 2022, at 5:02 AM, Wang Yugui <wangyugui@e16-tech.com> wrote: > > > > Hi, > > > >>> On Jun 22, 2022, at 3:04 PM, Chuck Lever III <chuck.lever@oracle.com> wrote: > >>>> On Jun 22, 2022, at 2:36 PM, Wang Yugui <wangyugui@e16-tech.com> wrote: > >>>> > >>>> Hi, > >>>> > >>>> fstests generic/531 triggered a panic on kernel 5.19.0-rc3 with this > >>>> patchset. > >>> > >>> As I mention in the cover letter, I haven't tried running generic/531 > >>> yet -- no claim at all that this is finished work and that #386 has > >>> been fixed at this point. I'm merely interested in comments on the > >>> general approach. > >>> > >>> > >>>> [ 405.478056] BUG: kernel NULL pointer dereference, address: 0000000000000049 > >>> > >>> The "RIP: " tells the location of the crash. Notice that the call > >>> trace here does not include that information. From your attachment: > >>> > >>> [ 405.518022] RIP: 0010:nfsd_do_file_acquire+0x4e1/0xb80 [nfsd] > >>> > >>> To match that to a line of source code: > >>> > >>> [cel@manet ~]$ cd src/linux/linux/ > >>> [cel@manet linux]$ scripts/faddr2line ../obj/manet/fs/nfsd/filecache.o nfsd_do_file_acquire+0x4e1 > >>> nfsd_do_file_acquire+0x4e1/0xfc0: > >>> rht_bucket_insert at /home/cel/src/linux/linux/include/linux/rhashtable.h:303 > >>> (inlined by) __rhashtable_insert_fast at /home/cel/src/linux/linux/include/linux/rhashtable.h:718 > >>> (inlined by) rhashtable_lookup_get_insert_key at /home/cel/src/linux/linux/include/linux/rhashtable.h:982 > >>> (inlined by) nfsd_file_insert at /home/cel/src/linux/linux/fs/nfsd/filecache.c:1031 > >>> (inlined by) nfsd_do_file_acquire at /home/cel/src/linux/linux/fs/nfsd/filecache.c:1089 > >>> [cel@manet linux]$ > >>> > >>> This is an example, I'm sure my compiled objects don't match yours. > >>> > >>> And, now that I've added observability, you should be able to do: > >>> > >>> # watch cat /proc/fs/nfsd/filecache > >>> > >>> to see how many items are in the hash and LRU list while the test > >>> is running. > >>> > >>> > >>>> [ 405.608016] Call Trace: > >>>> [ 405.608016] <TASK> > >>>> [ 405.613020] nfs4_get_vfs_file+0x325/0x410 [nfsd] > >>>> [ 405.618018] nfsd4_process_open2+0x4ba/0x16d0 [nfsd] > >>>> [ 405.623016] ? inode_get_bytes+0x38/0x40 > >>>> [ 405.623016] ? nfsd_permission+0x97/0xf0 [nfsd] > >>>> [ 405.628022] ? fh_verify+0x1cc/0x6f0 [nfsd] > >>>> [ 405.633025] nfsd4_open+0x640/0xb30 [nfsd] > >>>> [ 405.638025] nfsd4_proc_compound+0x3bd/0x710 [nfsd] > >>>> [ 405.643017] nfsd_dispatch+0x143/0x270 [nfsd] > >>>> [ 405.648019] svc_process_common+0x3bf/0x5b0 [sunrpc] > >> > >> I was able to trigger something that looks very much like this crash. > >> If you remove this line from fs/nfsd/filecache.c: > >> > >> .max_size = 131072, /* buckets */ > >> > >> things get a lot more stable for generic/531. > >> > >> I'm looking into the issue now. > > > > Yes. When '.max_size = 131072' is removed, fstests generic/531 passed. > > Great! Are you comfortable with this general approach for bug #386? It seems a good result for #386. fstests generic/531(file-max: 1M) performance result: base(5.19.0-rc3, 12 bits hash, serialized nfsd_file_gc): 222s this patchset(.min_size=4096): 59s so, a good improvement for #386. It seems a good(acceptable) result for #387 too. the period of 'text busy(exec directly from the back-end of nfs-server)' is about 4s. Best Regards Wang Yugui (wangyugui@e16-tech.com) 2022/06/24
On Wed, Jun 22, 2022 at 7:12 AM Chuck Lever <chuck.lever@oracle.com> wrote: > > This series overhauls the NFSD filecache, a cache of server-side > "struct file" objects recently used by NFS clients. The purposes of > this overhaul are an immediate improvement in cache scalability in > the number of open files, and preparation for further improvements. > > There are three categories of patches in this series: > > 1. Add observability of cache operation so we can see what we're > doing as changes are made to the code. > > 2. Improve the scalability of filecache garbage collection, > addressing several bugs along the way. > > 3. Improve the scalability of the filecache hash table by converting > it to use rhashtable. > > The series as it stands survives typical test workloads. Running > stress-tests like generic/531 is the next step. > > These patches are also available in the linux-nfs-bugzilla-386 > branch of > > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git > > --- > > Chuck Lever (30): > NFSD: Report filecache LRU size > NFSD: Report count of calls to nfsd_file_acquire() > NFSD: Report count of freed filecache items > NFSD: Report average age of filecache items > NFSD: Add nfsd_file_lru_dispose_list() helper > NFSD: Refactor nfsd_file_gc() > NFSD: Refactor nfsd_file_lru_scan() > NFSD: Report the number of items evicted by the LRU walk > NFSD: Record number of flush calls > NFSD: Report filecache item construction failures > NFSD: Zero counters when the filecache is re-initialized > NFSD: Hook up the filecache stat file > NFSD: WARN when freeing an item still linked via nf_lru > NFSD: Trace filecache LRU activity > NFSD: Leave open files out of the filecache LRU > NFSD: Fix the filecache LRU shrinker > NFSD: Never call nfsd_file_gc() in foreground paths > NFSD: No longer record nf_hashval in the trace log > NFSD: Remove lockdep assertion from unhash_and_release_locked() > NFSD: nfsd_file_unhash can compute hashval from nf->nf_inode > NFSD: Refactor __nfsd_file_close_inode() > NFSD: nfsd_file_hash_remove can compute hashval > NFSD: Remove nfsd_file::nf_hashval > NFSD: Remove stale comment from nfsd_file_acquire() > NFSD: Clean up "open file" case in nfsd_file_acquire() > NFSD: Document nfsd_file_cache_purge() API contract > NFSD: Replace the "init once" mechanism > NFSD: Set up an rhashtable for the filecache > NFSD: Convert the filecache to use rhashtable > NFSD: Clean up unusued code after rhashtable conversion > > > fs/nfsd/filecache.c | 677 +++++++++++++++++++++++++++----------------- > fs/nfsd/filecache.h | 6 +- > fs/nfsd/nfsctl.c | 10 + > fs/nfsd/trace.h | 117 ++++++-- > 4 files changed, 522 insertions(+), 288 deletions(-) > > -- > Chuck Lever > Yep, looks good so far, thanks for doing this. Somewhat similar to my (buggy) attempt at fixing it that I sent at the time (don't put open files on the LRU, and use rhashtable), but cleaner and, presumably, less buggy :) Can't test it right now, but it seems like Wang already confirmed that it works. - Frank
> On Jun 23, 2022, at 1:51 PM, Wang Yugui <wangyugui@e16-tech.com> wrote: > > Hi, > >>> On Jun 23, 2022, at 5:02 AM, Wang Yugui <wangyugui@e16-tech.com> wrote: >>> >>> Hi, >>> >>>>> On Jun 22, 2022, at 3:04 PM, Chuck Lever III <chuck.lever@oracle.com> wrote: >>>>>> On Jun 22, 2022, at 2:36 PM, Wang Yugui <wangyugui@e16-tech.com> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> fstests generic/531 triggered a panic on kernel 5.19.0-rc3 with this >>>>>> patchset. >>>>> >>>>> As I mention in the cover letter, I haven't tried running generic/531 >>>>> yet -- no claim at all that this is finished work and that #386 has >>>>> been fixed at this point. I'm merely interested in comments on the >>>>> general approach. >>>>> >>>>> >>>>>> [ 405.478056] BUG: kernel NULL pointer dereference, address: 0000000000000049 >>>>> >>>>> The "RIP: " tells the location of the crash. Notice that the call >>>>> trace here does not include that information. From your attachment: >>>>> >>>>> [ 405.518022] RIP: 0010:nfsd_do_file_acquire+0x4e1/0xb80 [nfsd] >>>>> >>>>> To match that to a line of source code: >>>>> >>>>> [cel@manet ~]$ cd src/linux/linux/ >>>>> [cel@manet linux]$ scripts/faddr2line ../obj/manet/fs/nfsd/filecache.o nfsd_do_file_acquire+0x4e1 >>>>> nfsd_do_file_acquire+0x4e1/0xfc0: >>>>> rht_bucket_insert at /home/cel/src/linux/linux/include/linux/rhashtable.h:303 >>>>> (inlined by) __rhashtable_insert_fast at /home/cel/src/linux/linux/include/linux/rhashtable.h:718 >>>>> (inlined by) rhashtable_lookup_get_insert_key at /home/cel/src/linux/linux/include/linux/rhashtable.h:982 >>>>> (inlined by) nfsd_file_insert at /home/cel/src/linux/linux/fs/nfsd/filecache.c:1031 >>>>> (inlined by) nfsd_do_file_acquire at /home/cel/src/linux/linux/fs/nfsd/filecache.c:1089 >>>>> [cel@manet linux]$ >>>>> >>>>> This is an example, I'm sure my compiled objects don't match yours. >>>>> >>>>> And, now that I've added observability, you should be able to do: >>>>> >>>>> # watch cat /proc/fs/nfsd/filecache >>>>> >>>>> to see how many items are in the hash and LRU list while the test >>>>> is running. >>>>> >>>>> >>>>>> [ 405.608016] Call Trace: >>>>>> [ 405.608016] <TASK> >>>>>> [ 405.613020] nfs4_get_vfs_file+0x325/0x410 [nfsd] >>>>>> [ 405.618018] nfsd4_process_open2+0x4ba/0x16d0 [nfsd] >>>>>> [ 405.623016] ? inode_get_bytes+0x38/0x40 >>>>>> [ 405.623016] ? nfsd_permission+0x97/0xf0 [nfsd] >>>>>> [ 405.628022] ? fh_verify+0x1cc/0x6f0 [nfsd] >>>>>> [ 405.633025] nfsd4_open+0x640/0xb30 [nfsd] >>>>>> [ 405.638025] nfsd4_proc_compound+0x3bd/0x710 [nfsd] >>>>>> [ 405.643017] nfsd_dispatch+0x143/0x270 [nfsd] >>>>>> [ 405.648019] svc_process_common+0x3bf/0x5b0 [sunrpc] >>>> >>>> I was able to trigger something that looks very much like this crash. >>>> If you remove this line from fs/nfsd/filecache.c: >>>> >>>> .max_size = 131072, /* buckets */ >>>> >>>> things get a lot more stable for generic/531. >>>> >>>> I'm looking into the issue now. >>> >>> Yes. When '.max_size = 131072' is removed, fstests generic/531 passed. >> >> Great! Are you comfortable with this general approach for bug #386? > > It seems a good result for #386. > > fstests generic/531(file-max: 1M) performance result: > base(5.19.0-rc3, 12 bits hash, serialized nfsd_file_gc): 222s > this patchset(.min_size=4096): 59s > so, a good improvement for #386. > > It seems a good(acceptable) result for #387 too. > the period of 'text busy(exec directly from the back-end of nfs-server)' > is about 4s. I was surprised to learn that NFSv4 doesn't close those files outright, that they remain in the filecache for a bit. I think we can do better there, but I haven't looked closely at that yet. I see that files are left open on the server by crashed NFSv4 clients too. That will result in the server becoming unusable after significant uptime, which I've seen on occasion (because I crash my clients a lot) but never looked into until now. -- Chuck Lever
> On Jun 23, 2022, at 4:27 PM, Frank van der Linden <fvdl@google.com> wrote: > > On Wed, Jun 22, 2022 at 7:12 AM Chuck Lever <chuck.lever@oracle.com> wrote: >> >> This series overhauls the NFSD filecache, a cache of server-side >> "struct file" objects recently used by NFS clients. The purposes of >> this overhaul are an immediate improvement in cache scalability in >> the number of open files, and preparation for further improvements. >> >> There are three categories of patches in this series: >> >> 1. Add observability of cache operation so we can see what we're >> doing as changes are made to the code. >> >> 2. Improve the scalability of filecache garbage collection, >> addressing several bugs along the way. >> >> 3. Improve the scalability of the filecache hash table by converting >> it to use rhashtable. >> >> The series as it stands survives typical test workloads. Running >> stress-tests like generic/531 is the next step. >> >> These patches are also available in the linux-nfs-bugzilla-386 >> branch of >> >> https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git >> >> --- >> >> Chuck Lever (30): >> NFSD: Report filecache LRU size >> NFSD: Report count of calls to nfsd_file_acquire() >> NFSD: Report count of freed filecache items >> NFSD: Report average age of filecache items >> NFSD: Add nfsd_file_lru_dispose_list() helper >> NFSD: Refactor nfsd_file_gc() >> NFSD: Refactor nfsd_file_lru_scan() >> NFSD: Report the number of items evicted by the LRU walk >> NFSD: Record number of flush calls >> NFSD: Report filecache item construction failures >> NFSD: Zero counters when the filecache is re-initialized >> NFSD: Hook up the filecache stat file >> NFSD: WARN when freeing an item still linked via nf_lru >> NFSD: Trace filecache LRU activity >> NFSD: Leave open files out of the filecache LRU >> NFSD: Fix the filecache LRU shrinker >> NFSD: Never call nfsd_file_gc() in foreground paths >> NFSD: No longer record nf_hashval in the trace log >> NFSD: Remove lockdep assertion from unhash_and_release_locked() >> NFSD: nfsd_file_unhash can compute hashval from nf->nf_inode >> NFSD: Refactor __nfsd_file_close_inode() >> NFSD: nfsd_file_hash_remove can compute hashval >> NFSD: Remove nfsd_file::nf_hashval >> NFSD: Remove stale comment from nfsd_file_acquire() >> NFSD: Clean up "open file" case in nfsd_file_acquire() >> NFSD: Document nfsd_file_cache_purge() API contract >> NFSD: Replace the "init once" mechanism >> NFSD: Set up an rhashtable for the filecache >> NFSD: Convert the filecache to use rhashtable >> NFSD: Clean up unusued code after rhashtable conversion >> >> >> fs/nfsd/filecache.c | 677 +++++++++++++++++++++++++++----------------- >> fs/nfsd/filecache.h | 6 +- >> fs/nfsd/nfsctl.c | 10 + >> fs/nfsd/trace.h | 117 ++++++-- >> 4 files changed, 522 insertions(+), 288 deletions(-) >> >> -- >> Chuck Lever >> > > Yep, looks good so far, thanks for doing this. Somewhat similar to my (buggy) > attempt at fixing it that I sent at the time (don't put open files on > the LRU, and > use rhashtable), but cleaner and, presumably, less buggy :) > > Can't test it right now, but it seems like Wang already confirmed that it works. Frank, thanks to you and Wang for reporting this issue, and sorry for taking so long to address it. -- Chuck Lever