Message ID | 20230414134908.103932-1-hare@suse.de (mailing list archive) |
---|---|
State | Mainlined, archived |
Headers | show |
Series | mm/filemap: allocate folios according to the blocksize | expand |
On Fri, Apr 14, 2023 at 03:49:08PM +0200, Hannes Reinecke wrote: > If the blocksize is larger than the pagesize allocate folios > with the correct order. I think you also need to get the one in page_cache_ra_unbounded() and page_cache_ra_order() also needs some love? It might help to put an assertion in __filemap_add_folio() to be sure we're not doing something unacceptable.
On Fri, Apr 14, 2023 at 03:49:08PM +0200, Hannes Reinecke wrote: > @@ -3607,14 +3611,16 @@ static struct folio *do_read_cache_folio(struct address_space *mapping, > pgoff_t index, filler_t filler, struct file *file, gfp_t gfp) > { > struct folio *folio; > - int err; > + int err, order = 0; > > + if (mapping->host->i_blkbits > PAGE_SHIFT) > + order = mapping->host->i_blkbits - PAGE_SHIFT; This pattern comes up a few times. What I did in a patch I wrote back in December 2020 and never submitted (!) was this: @@ -198,9 +198,15 @@ enum mapping_flags { AS_EXITING = 4, /* final truncate in progress */ /* writeback related tags are not used */ AS_NO_WRITEBACK_TAGS = 5, - AS_LARGE_FOLIO_SUPPORT = 6, + AS_FOLIO_ORDER_MIN = 8, + AS_FOLIO_ORDER_MAX = 13, + /* 8-17 are used for FOLIO_ORDER */ }; +#define AS_FOLIO_ORDER_MIN_MASK 0x00001f00 +#define AS_FOLIO_ORDER_MAX_MASK 0x0002e000 +#define AS_FOLIO_ORDER_MASK (AS_FOLIO_ORDER_MIN_MASK | AS_FOLIO_ORDER_MAX_MASK) ... +static inline unsigned mapping_min_folio_order(struct address_space *mapping) +{ + return (mapping->flags & AS_FOLIO_ORDER_MIN_MASK) >> AS_FOLIO_ORDER_MIN; +} (do we really need 5 bits for each, or can we get by with eg 3 or 4 bits?) Anyway, the point is that we could set this quite early in the creation of the mapping, and eliminate the conditional setting of order.
Hello, kernel test robot noticed "WARNING:at_include/linux/pagemap.h:#split_huge_page_to_list" on: commit: 3f1c33c25c31221a7a27d302ce6aac8e9b71edbb ("[PATCH] mm/filemap: allocate folios according to the blocksize") url: https://github.com/intel-lab-lkp/linux/commits/Hannes-Reinecke/mm-filemap-allocate-folios-according-to-the-blocksize/20230414-215648 base: v6.3-rc6 patch link: https://lore.kernel.org/all/20230414134908.103932-1-hare@suse.de/ patch subject: [PATCH] mm/filemap: allocate folios according to the blocksize in testcase: xfstests version: xfstests-x86_64-a7df89e-1_20230410 with following parameters: disk: 4HDD fs: ext4 fs2: smbv3 test: generic-group-28 compiler: gcc-11 test machine: 8 threads Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz (Skylake) with 16G memory (please refer to attached dmesg/kmsg for entire log/backtrace) If you fix the issue, kindly add following tag | Reported-by: kernel test robot <oliver.sang@intel.com> | Link: https://lore.kernel.org/oe-lkp/202304170945.28aa41a0-oliver.sang@intel.com [ 96.792002][ T3511] ------------[ cut here ]------------ [ 96.797305][ T3511] WARNING: CPU: 3 PID: 3511 at include/linux/pagemap.h:344 split_huge_page_to_list (include/linux/pagemap.h:344 mm/huge_memory.c:2767) [ 96.807438][ T3511] Modules linked in: loop cmac nls_utf8 cifs cifs_arc4 cifs_md4 dns_resolver dm_mod btrfs blake2b_generic xor intel_rapl_msr raid6_pq intel_rapl_common zstd_compress libcrc32c x86_pkg_temp_thermal intel_powerclamp coretemp sd_mod t10_pi kvm_intel crc64_rocksoft_generic i915 kvm crc64_rocksoft crc64 irqbypass crct10dif_pclmul sg crc32_pclmul ipmi_devintf crc32c_intel ipmi_msghandler drm_buddy ghash_clmulni_intel intel_gtt sha512_ssse3 drm_display_helper drm_kms_helper mei_wdt syscopyarea ahci rapl sysfillrect mei_me intel_cstate libahci sysimgblt wmi_bmof video intel_uncore serio_raw libata mei intel_pch_thermal ttm intel_pmc_core wmi acpi_pad tpm_infineon drm fuse ip_tables [ 96.868429][ T3511] CPU: 3 PID: 3511 Comm: xfs_io Not tainted 6.3.0-rc6-00001-g3f1c33c25c31 #1 [ 96.877009][ T3511] Hardware name: HP HP Z240 SFF Workstation/802E, BIOS N51 Ver. 01.63 10/05/2017 [ 96.885923][ T3511] RIP: 0010:split_huge_page_to_list (include/linux/pagemap.h:344 mm/huge_memory.c:2767) [ 96.891989][ T3511] Code: 89 f2 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 80 3c 02 00 0f 85 31 02 00 00 49 8b 84 24 98 00 00 00 a8 40 0f 85 f6 fe ff ff <0f> 0b e9 ef fe ff ff 81 e6 60 ec 0b 00 a9 00 00 04 00 44 0f 45 ee All code ======== 0: 89 f2 mov %esi,%edx 2: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax 9: fc ff df c: 48 c1 ea 03 shr $0x3,%rdx 10: 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1) 14: 0f 85 31 02 00 00 jne 0x24b 1a: 49 8b 84 24 98 00 00 mov 0x98(%r12),%rax 21: 00 22: a8 40 test $0x40,%al 24: 0f 85 f6 fe ff ff jne 0xffffffffffffff20 2a:* 0f 0b ud2 <-- trapping instruction 2c: e9 ef fe ff ff jmpq 0xffffffffffffff20 31: 81 e6 60 ec 0b 00 and $0xbec60,%esi 37: a9 00 00 04 00 test $0x40000,%eax 3c: 44 0f 45 ee cmovne %esi,%r13d Code starting with the faulting instruction =========================================== 0: 0f 0b ud2 2: e9 ef fe ff ff jmpq 0xfffffffffffffef6 7: 81 e6 60 ec 0b 00 and $0xbec60,%esi d: a9 00 00 04 00 test $0x40000,%eax 12: 44 0f 45 ee cmovne %esi,%r13d [ 96.911358][ T3511] RSP: 0018:ffffc9000420f9d8 EFLAGS: 00010046 [ 96.917253][ T3511] RAX: 0000000000000000 RBX: 1ffff92000841f43 RCX: ffffffff8192f43e [ 96.925040][ T3511] RDX: 1ffff11084ce792a RSI: 0000000000000008 RDI: ffff88842673c950 [ 96.932836][ T3511] RBP: ffffea0005796700 R08: 0000000000000000 R09: ffff88842673c957 [ 96.940625][ T3511] R10: ffffed1084ce792a R11: 0000000000000001 R12: ffff88842673c8b8 [ 96.948419][ T3511] R13: 0000000000000000 R14: ffff88842673c950 R15: ffffea0005796700 [ 96.956211][ T3511] FS: 00007f83d8bdc800(0000) GS:ffff8883cf580000(0000) knlGS:0000000000000000 [ 96.964956][ T3511] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 96.971364][ T3511] CR2: 00007f83d8bdaef8 CR3: 000000013f57e006 CR4: 00000000003706e0 [ 96.979162][ T3511] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 96.986958][ T3511] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 96.994758][ T3511] Call Trace: [ 96.997888][ T3511] <TASK> [ 97.000679][ T3511] ? can_split_folio (mm/huge_memory.c:2646) [ 97.005455][ T3511] ? __filemap_get_folio (arch/x86/include/asm/bitops.h:138 arch/x86/include/asm/bitops.h:144 include/asm-generic/bitops/instrumented-lock.h:58 include/linux/pagemap.h:915 include/linux/pagemap.h:951 mm/filemap.c:1936) [ 97.010578][ T3511] ? zero_user_segments (include/linux/highmem.h:282) [ 97.016651][ T3511] truncate_inode_partial_folio (mm/truncate.c:243) [ 97.022376][ T3511] truncate_inode_pages_range (mm/truncate.c:380) [ 97.027928][ T3511] ? folio_add_lru (arch/x86/include/asm/preempt.h:85 mm/swap.c:518) [ 97.032358][ T3511] ? truncate_inode_partial_folio (mm/truncate.c:332) [ 97.038261][ T3511] ? policy_node (include/linux/nodemask.h:266 mm/mempolicy.c:1869) [ 97.042604][ T3511] ? __cond_resched (kernel/sched/core.c:8489) [ 97.047115][ T3511] ? down_read (arch/x86/include/asm/atomic64_64.h:34 include/linux/atomic/atomic-long.h:41 include/linux/atomic/atomic-instrumented.h:1280 kernel/locking/rwsem.c:176 kernel/locking/rwsem.c:181 kernel/locking/rwsem.c:249 kernel/locking/rwsem.c:1249 kernel/locking/rwsem.c:1263 kernel/locking/rwsem.c:1522) [ 97.051372][ T3511] ? rwsem_down_read_slowpath (kernel/locking/rwsem.c:1518) [ 97.056924][ T3511] ? SMB2_set_eof (fs/cifs/smb2pdu.c:5197) cifs [ 97.062103][ T3511] ? up_read (arch/x86/include/asm/atomic64_64.h:160 include/linux/atomic/atomic-long.h:71 include/linux/atomic/atomic-instrumented.h:1318 kernel/locking/rwsem.c:1347 kernel/locking/rwsem.c:1616) [ 97.066015][ T3511] ? unmap_mapping_range (mm/memory.c:3541) [ 97.071051][ T3511] ? __do_fault (mm/memory.c:3541) [ 97.075387][ T3511] ? _raw_spin_lock (arch/x86/include/asm/atomic.h:202 include/linux/atomic/atomic-instrumented.h:543 include/asm-generic/qspinlock.h:111 include/linux/spinlock.h:186 include/linux/spinlock_api_smp.h:134 kernel/locking/spinlock.c:154) [ 97.079896][ T3511] ? __do_fault (mm/memory.c:3541) [ 97.084234][ T3511] ? zero_user_segments+0x19e/0x240 cifs [ 97.090958][ T3511] truncate_pagecache (mm/truncate.c:744) [ 97.095638][ T3511] smb3_simple_falloc+0xcbf/0x1840 cifs [ 97.101834][ T3511] ? smb3_fiemap (fs/cifs/smb2ops.c:3587) cifs [ 97.106919][ T3511] ? __do_sys_clone (kernel/fork.c:2812) [ 97.111433][ T3511] ? __do_sys_vfork (kernel/fork.c:2812) [ 97.115948][ T3511] vfs_fallocate (fs/open.c:324) [ 97.120381][ T3511] __x64_sys_fallocate (include/linux/file.h:44 fs/open.c:348 fs/open.c:355 fs/open.c:353 fs/open.c:353) [ 97.125238][ T3511] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) [ 97.129490][ T3511] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) [ 97.135206][ T3511] RIP: 0033:0x7f83d8f5d647 [ 97.139454][ T3511] Code: 89 7c 24 08 48 89 4c 24 18 e8 55 07 f9 ff 4c 8b 54 24 18 48 8b 54 24 10 41 89 c0 8b 74 24 0c 8b 7c 24 08 b8 1d 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 89 44 24 08 e8 85 07 f9 ff 8b 44 All code ======== 0: 89 7c 24 08 mov %edi,0x8(%rsp) 4: 48 89 4c 24 18 mov %rcx,0x18(%rsp) 9: e8 55 07 f9 ff callq 0xfffffffffff90763 e: 4c 8b 54 24 18 mov 0x18(%rsp),%r10 13: 48 8b 54 24 10 mov 0x10(%rsp),%rdx 18: 41 89 c0 mov %eax,%r8d 1b: 8b 74 24 0c mov 0xc(%rsp),%esi 1f: 8b 7c 24 08 mov 0x8(%rsp),%edi 23: b8 1d 01 00 00 mov $0x11d,%eax 28: 0f 05 syscall 2a:* 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax <-- trapping instruction 30: 77 31 ja 0x63 32: 44 89 c7 mov %r8d,%edi 35: 89 44 24 08 mov %eax,0x8(%rsp) 39: e8 85 07 f9 ff callq 0xfffffffffff907c3 3e: 8b .byte 0x8b 3f: 44 rex.R Code starting with the faulting instruction =========================================== 0: 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax 6: 77 31 ja 0x39 8: 44 89 c7 mov %r8d,%edi b: 89 44 24 08 mov %eax,0x8(%rsp) f: e8 85 07 f9 ff callq 0xfffffffffff90799 14: 8b .byte 0x8b 15: 44 rex.R [ 97.158825][ T3511] RSP: 002b:00007fffd0739670 EFLAGS: 00000293 ORIG_RAX: 000000000000011d [ 97.167058][ T3511] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f83d8f5d647 [ 97.174858][ T3511] RDX: 00000000000003ff RSI: 0000000000000000 RDI: 0000000000000004 [ 97.182644][ T3511] RBP: 000055cbb600cf50 R08: 0000000000000000 R09: 0000000000000000 [ 97.190444][ T3511] R10: 0000000000000002 R11: 0000000000000293 R12: 000055cbb4968340 [ 97.198236][ T3511] R13: 0000000000000000 R14: 000055cbb600cf30 R15: 000055cbb600cf50 [ 97.206035][ T3511] </TASK> [ 97.208906][ T3511] ---[ end trace 0000000000000000 ]--- [ 97.331260][ T292] generic/568 _check_dmesg: something found in dmesg (see /lkp/benchmarks/xfstests/results//generic/568.dmesg) [ 97.331271][ T292] To reproduce: git clone https://github.com/intel/lkp-tests.git cd lkp-tests sudo bin/lkp install job.yaml # job file is attached in this email bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run sudo bin/lkp run generated-yaml-file # if come across any failure that blocks the test, # please remove ~/.lkp and /lkp dir to run from a clean state.
On Mon, Apr 17, 2023 at 10:18:37AM +0800, kernel test robot wrote: > commit: 3f1c33c25c31221a7a27d302ce6aac8e9b71edbb ("[PATCH] mm/filemap: allocate folios according to the blocksize") > [ 96.797305][ T3511] WARNING: CPU: 3 PID: 3511 at include/linux/pagemap.h:344 split_huge_page_to_list (include/linux/pagemap.h:344 mm/huge_memory.c:2767) Oh, funny. That's: WARN_ON_ONCE(mapping_large_folio_support(mapping) == 0); so it's found a large folio in a mapping which doesn't claim to support large folios. This is a good warning; thank you, bot! > [ 96.994758][ T3511] Call Trace: > [ 96.997888][ T3511] <TASK> > [ 97.016651][ T3511] truncate_inode_partial_folio (mm/truncate.c:243) > [ 97.022376][ T3511] truncate_inode_pages_range (mm/truncate.c:380) > [ 97.090958][ T3511] truncate_pagecache (mm/truncate.c:744) > [ 97.095638][ T3511] smb3_simple_falloc+0xcbf/0x1840 cifs Ah, it's coming from smb/cifs. I bet it's set i_blkbits to something large. Looks like we need to do something like I suggested to set the minimum folio size in the mapping, rather than basing it on mapping->host->i_blkbits.
On Fri, Apr 14, 2023 at 03:49:08PM +0200, Hannes Reinecke wrote: > If the blocksize is larger than the pagesize allocate folios > with the correct order. And how is that supposed to actually happen?
On 4/17/23 06:36, Christoph Hellwig wrote: > On Fri, Apr 14, 2023 at 03:49:08PM +0200, Hannes Reinecke wrote: >> If the blocksize is larger than the pagesize allocate folios >> with the correct order. > > And how is that supposed to actually happen? By using my patcheset to brd and set the logical blocksize to eg 16k. Cheers, Hannes
On Mon, Apr 17, 2023 at 08:08:04AM +0200, Hannes Reinecke wrote: > On 4/17/23 06:36, Christoph Hellwig wrote: > > On Fri, Apr 14, 2023 at 03:49:08PM +0200, Hannes Reinecke wrote: > > > If the blocksize is larger than the pagesize allocate folios > > > with the correct order. > > > > And how is that supposed to actually happen? > > By using my patcheset to brd and set the logical blocksize to eg 16k. Then add it to your patchset instead of trying to add dead code to the kernel while also losing context.
On 4/17/23 08:27, Christoph Hellwig wrote: > On Mon, Apr 17, 2023 at 08:08:04AM +0200, Hannes Reinecke wrote: >> On 4/17/23 06:36, Christoph Hellwig wrote: >>> On Fri, Apr 14, 2023 at 03:49:08PM +0200, Hannes Reinecke wrote: >>>> If the blocksize is larger than the pagesize allocate folios >>>> with the correct order. >>> >>> And how is that supposed to actually happen? >> >> By using my patcheset to brd and set the logical blocksize to eg 16k. > > Then add it to your patchset instead of trying to add dead code to the > kernel while also losing context. Yeah, sorry. Was a bit overexcited here. Cheers, Hannes
To keep this thread alive and get some direction on the next steps, I made some changes with which I am able to do **buffered reads** with fio on brd with logical block size > 4k. Along with your patches (this patch and the brd patches), I added the following diff: diff --git a/fs/mpage.c b/fs/mpage.c index 242e213ee064..2e0c066d72d3 100644 --- a/fs/mpage.c +++ b/fs/mpage.c @@ -161,7 +161,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args) struct folio *folio = args->folio; struct inode *inode = folio->mapping->host; const unsigned blkbits = inode->i_blkbits; - const unsigned blocks_per_page = PAGE_SIZE >> blkbits; + const unsigned blocks_per_page = folio_size(folio) >> blkbits; const unsigned blocksize = 1 << blkbits; struct buffer_head *map_bh = &args->map_bh; sector_t block_in_file; diff --git a/mm/readahead.c b/mm/readahead.c index 47afbca1d122..2e42b5127f4c 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -210,7 +210,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, unsigned long index = readahead_index(ractl); gfp_t gfp_mask = readahead_gfp_mask(mapping); unsigned long i; - + int order = 0; /* * Partway through the readahead operation, we will have added * locked pages to the page cache, but will not yet have submitted @@ -223,6 +223,9 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, */ unsigned int nofs = memalloc_nofs_save(); + if (mapping->host->i_blkbits > PAGE_SHIFT) + order = mapping->host->i_blkbits - PAGE_SHIFT; + filemap_invalidate_lock_shared(mapping); /* * Preallocate as many pages as we will need. @@ -245,7 +248,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, continue; } - folio = filemap_alloc_folio(gfp_mask, 0); + folio = filemap_alloc_folio(gfp_mask, order); if (!folio) break; if (filemap_add_folio(mapping, folio, index + i, @@ -259,7 +262,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, if (i == nr_to_read - lookahead_size) folio_set_readahead(folio); ractl->_workingset |= folio_test_workingset(folio); - ractl->_nr_pages++; + ractl->_nr_pages += folio_nr_pages(folio); } And with that (drum roll): root@debian:~# cat /sys/block/ram0/queue/logical_block_size 8192 root@debian:~# fio -bs=8k -iodepth=8 -rw=read -ioengine=io_uring -size=200M -name=io_uring_1 -filename=/dev/ram0 io_uring_1: (g=0): rw=read, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=io_uring, iodepth=8 fio-3.33 Starting 1 process io_uring_1: (groupid=0, jobs=1): err= 0: pid=450: Thu Apr 20 11:34:10 2023 read: IOPS=94.8k, BW=741MiB/s (777MB/s)(40.0MiB/54msec) <snip> Run status group 0 (all jobs): READ: bw=741MiB/s (777MB/s), 741MiB/s-741MiB/s (777MB/s-777MB/s), io=40.0MiB (41.9MB), run=54-54msec Disk stats (read/write): ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% **Questions on the future work**: As willy pointed out, we have to do this `order = mapping->host->i_blkbits - PAGE_SHIFT` in many places. Should we pursue something that willy suggested: encapsulating order in the mapping->flags as a next step?[1] [1] https://lore.kernel.org/lkml/ZDty+PQfHkrGBojn@casper.infradead.org/
On 4/20/23 14:05, Pankaj Raghav wrote: > To keep this thread alive and get some direction on the next steps, I made some changes > with which I am able to do **buffered reads** with fio on brd with logical block size > 4k. > > Along with your patches (this patch and the brd patches), I added the following diff: > > diff --git a/fs/mpage.c b/fs/mpage.c > index 242e213ee064..2e0c066d72d3 100644 > --- a/fs/mpage.c > +++ b/fs/mpage.c > @@ -161,7 +161,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args) > struct folio *folio = args->folio; > struct inode *inode = folio->mapping->host; > const unsigned blkbits = inode->i_blkbits; > - const unsigned blocks_per_page = PAGE_SIZE >> blkbits; > + const unsigned blocks_per_page = folio_size(folio) >> blkbits; > const unsigned blocksize = 1 << blkbits; > struct buffer_head *map_bh = &args->map_bh; > sector_t block_in_file; > diff --git a/mm/readahead.c b/mm/readahead.c > index 47afbca1d122..2e42b5127f4c 100644 > --- a/mm/readahead.c > +++ b/mm/readahead.c > @@ -210,7 +210,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, > unsigned long index = readahead_index(ractl); > gfp_t gfp_mask = readahead_gfp_mask(mapping); > unsigned long i; > - > + int order = 0; > /* > * Partway through the readahead operation, we will have added > * locked pages to the page cache, but will not yet have submitted > @@ -223,6 +223,9 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, > */ > unsigned int nofs = memalloc_nofs_save(); > > + if (mapping->host->i_blkbits > PAGE_SHIFT) > + order = mapping->host->i_blkbits - PAGE_SHIFT; > + > filemap_invalidate_lock_shared(mapping); > /* > * Preallocate as many pages as we will need. > @@ -245,7 +248,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, > continue; > } > > - folio = filemap_alloc_folio(gfp_mask, 0); > + folio = filemap_alloc_folio(gfp_mask, order); > if (!folio) > break; > if (filemap_add_folio(mapping, folio, index + i, > @@ -259,7 +262,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, > if (i == nr_to_read - lookahead_size) > folio_set_readahead(folio); > ractl->_workingset |= folio_test_workingset(folio); > - ractl->_nr_pages++; > + ractl->_nr_pages += folio_nr_pages(folio); > } > > > And with that (drum roll): > > root@debian:~# cat /sys/block/ram0/queue/logical_block_size > 8192 > root@debian:~# fio -bs=8k -iodepth=8 -rw=read -ioengine=io_uring -size=200M -name=io_uring_1 > -filename=/dev/ram0 > io_uring_1: (g=0): rw=read, bs=(R) 8192B-8192B, (W) 8192B-8192B, (T) 8192B-8192B, ioengine=io_uring, > iodepth=8 > fio-3.33 > Starting 1 process > > io_uring_1: (groupid=0, jobs=1): err= 0: pid=450: Thu Apr 20 11:34:10 2023 > read: IOPS=94.8k, BW=741MiB/s (777MB/s)(40.0MiB/54msec) > > <snip> > > Run status group 0 (all jobs): > READ: bw=741MiB/s (777MB/s), 741MiB/s-741MiB/s (777MB/s-777MB/s), io=40.0MiB (41.9MB), run=54-54msec > > Disk stats (read/write): > ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% > > > **Questions on the future work**: > > As willy pointed out, we have to do this `order = mapping->host->i_blkbits - PAGE_SHIFT` in > many places. Should we pursue something that willy suggested: encapsulating order in the > mapping->flags as a next step?[1] > > > [1] https://lore.kernel.org/lkml/ZDty+PQfHkrGBojn@casper.infradead.org/ Well ... really, not sure. Yes, continue updating buffer_heads would be a logical thing as it could be done incrementally. But really, the end-goal should be to move away from buffer_heads for fs and mm usage. So I wonder if we shouldn't rather look in that direction.. Cheers, Hannes
On 2023-04-20 14:19, Hannes Reinecke wrote: >> >> **Questions on the future work**: >> >> As willy pointed out, we have to do this `order = mapping->host->i_blkbits - PAGE_SHIFT` in >> many places. Should we pursue something that willy suggested: encapsulating order in the >> mapping->flags as a next step?[1] >> >> >> [1] https://lore.kernel.org/lkml/ZDty+PQfHkrGBojn@casper.infradead.org/ > > Well ... really, not sure. > Yes, continue updating buffer_heads would be a logical thing as it could be done incrementally. > > But really, the end-goal should be to move away from buffer_heads for fs and mm usage. So I wonder > if we shouldn't rather look in that direction.. > Yeah, I understand that part. Hopefully, this will be discussed as a part of LSFMM. But the changes that are done in filemap and readahead needs to be done anyway irrespective of the underlying aops right? Or Am I missing something. > Cheers, > > Hannes >
On Thu, Apr 20, 2023 at 02:28:36PM +0200, Pankaj Raghav wrote: > On 2023-04-20 14:19, Hannes Reinecke wrote: > >> > >> **Questions on the future work**: > >> > >> As willy pointed out, we have to do this `order = mapping->host->i_blkbits - PAGE_SHIFT` in > >> many places. Should we pursue something that willy suggested: encapsulating order in the > >> mapping->flags as a next step?[1] > >> > >> > >> [1] https://lore.kernel.org/lkml/ZDty+PQfHkrGBojn@casper.infradead.org/ I wouldn't mind XFS gaining a means to control folio sizes, personally. At least that would make it easier to explore things like copy on write with large weird file allocation unit sizes (2MB, 28k, etc). > > > > Well ... really, not sure. > > Yes, continue updating buffer_heads would be a logical thing as it could be done incrementally. > > > > But really, the end-goal should be to move away from buffer_heads for fs and mm usage. So I wonder > > if we shouldn't rather look in that direction.. > > > Yeah, I understand that part. Hopefully, this will be discussed as a part of LSFMM. Agree. --D > > But the changes that are done in filemap and readahead needs to be done anyway irrespective of the > underlying aops right? Or Am I missing something. > > > Cheers, > > > > Hannes > >
On Thu, Apr 20, 2023 at 08:03:55AM -0700, Darrick J. Wong wrote: > On Thu, Apr 20, 2023 at 02:28:36PM +0200, Pankaj Raghav wrote: > > On 2023-04-20 14:19, Hannes Reinecke wrote: > > >> > > >> **Questions on the future work**: > > >> > > >> As willy pointed out, we have to do this `order = mapping->host->i_blkbits - PAGE_SHIFT` in > > >> many places. Should we pursue something that willy suggested: encapsulating order in the > > >> mapping->flags as a next step?[1] > > >> > > >> > > >> [1] https://lore.kernel.org/lkml/ZDty+PQfHkrGBojn@casper.infradead.org/ > > I wouldn't mind XFS gaining a means to control folio sizes, personally. > At least that would make it easier to explore things like copy on write > with large weird file allocation unit sizes (2MB, 28k, etc). [random historical musings follow] Ah, how things go full circle. This is how XFS originally worked on Irix - it had read and write IO sizes that it used to control the size of buffers allocated ifor caching data in the per-inode chunk cache. That stuff was in the original port to Linux, we even had a mount option to allow users to determine preferred IO sizes (biosize) but that was removed in 2019 ago because it hadn't done anything for more than a decade. That was done around the same time the mp->m_readio_blocks and mp->m_writeio_blocks variables that it originally influenced were removed. Extent allocation sizes are still influenced this way by the allocsize mount option and m_writeio_blocks was renamed m_allocsize_blocks to keep this functionality for extent allocation... Cheers, Dave.
diff --git a/mm/filemap.c b/mm/filemap.c index 05fd86752489..468f25714ced 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1927,7 +1927,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index, folio_wait_stable(folio); no_page: if (!folio && (fgp_flags & FGP_CREAT)) { - int err; + int err, order = 0; if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping)) gfp |= __GFP_WRITE; if (fgp_flags & FGP_NOFS) @@ -1937,7 +1937,9 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index, gfp |= GFP_NOWAIT | __GFP_NOWARN; } - folio = filemap_alloc_folio(gfp, 0); + if (mapping->host->i_blkbits > PAGE_SHIFT) + order = mapping->host->i_blkbits - PAGE_SHIFT; + folio = filemap_alloc_folio(gfp, order); if (!folio) return NULL; @@ -2492,9 +2494,11 @@ static int filemap_create_folio(struct file *file, struct folio_batch *fbatch) { struct folio *folio; - int error; + int error, order = 0; - folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0); + if (mapping->host->i_blkbits > PAGE_SHIFT) + order = mapping->host->i_blkbits - PAGE_SHIFT; + folio = filemap_alloc_folio(mapping_gfp_mask(mapping), order); if (!folio) return -ENOMEM; @@ -3607,14 +3611,16 @@ static struct folio *do_read_cache_folio(struct address_space *mapping, pgoff_t index, filler_t filler, struct file *file, gfp_t gfp) { struct folio *folio; - int err; + int err, order = 0; + if (mapping->host->i_blkbits > PAGE_SHIFT) + order = mapping->host->i_blkbits - PAGE_SHIFT; if (!filler) filler = mapping->a_ops->read_folio; repeat: folio = filemap_get_folio(mapping, index); if (!folio) { - folio = filemap_alloc_folio(gfp, 0); + folio = filemap_alloc_folio(gfp, order); if (!folio) return ERR_PTR(-ENOMEM); err = filemap_add_folio(mapping, folio, index, gfp);
If the blocksize is larger than the pagesize allocate folios with the correct order. Signed-off-by: Hannes Reinecke <hare@suse.de> --- mm/filemap.c | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-)