Message ID | cover.1708709155.git.john@groves.net (mailing list archive) |
---|---|
Headers | show |
Series | Introduce the famfs shared-memory file system | expand |
On Fri, Feb 23, 2024 at 11:41:44AM -0600, John Groves wrote: > This patch set introduces famfs[1] - a special-purpose fs-dax file system > for sharable disaggregated or fabric-attached memory (FAM). Famfs is not > CXL-specific in anyway way. > > * Famfs creates a simple access method for storing and sharing data in > sharable memory. The memory is exposed and accessed as memory-mappable > dax files. > * Famfs supports multiple hosts mounting the same file system from the > same memory (something existing fs-dax file systems don't do). > * A famfs file system can be created on either a /dev/pmem device in fs-dax > mode, or a /dev/dax device in devdax mode (the latter depending on > patches 2-6 of this series). > > The famfs kernel file system is part the famfs framework; additional > components in user space[2] handle metadata and direct the famfs kernel > module to instantiate files that map to specific memory. The famfs user > space has documentation and a reasonably thorough test suite. > > The famfs kernel module never accesses the shared memory directly (either > data or metadata). Because of this, shared memory managed by the famfs > framework does not create a RAS "blast radius" problem that should be able > to crash or de-stabilize the kernel. Poison or timeouts in famfs memory > can be expected to kill apps via SIGBUS and cause mounts to be disabled > due to memory failure notifications. > > Famfs does not attempt to solve concurrency or coherency problems for apps, > although it does solve these problems in regard to its own data structures. > Apps may encounter hard concurrency problems, but there are use cases that > are imminently useful and uncomplicated from a concurrency perspective: > serial sharing is one (only one host at a time has access), and read-only > concurrent sharing is another (all hosts can read-cache without worry). Can you do me a favor, curious if you can run a test like this: fio -name=ten-1g-per-thread --nrfiles=10 -bs=2M -ioengine=io_uring -direct=1 --group_reporting=1 --alloc-size=1048576 --filesize=1GiB --readwrite=write --fallocate=none --numjobs=$(nproc) --create_on_open=1 --directory=/mnt What do you get for throughput? The absolute large the system an capacity the better. Luis
On 24/02/23 04:07PM, Luis Chamberlain wrote: > On Fri, Feb 23, 2024 at 11:41:44AM -0600, John Groves wrote: > > This patch set introduces famfs[1] - a special-purpose fs-dax file system > > for sharable disaggregated or fabric-attached memory (FAM). Famfs is not > > CXL-specific in anyway way. > > > > * Famfs creates a simple access method for storing and sharing data in > > sharable memory. The memory is exposed and accessed as memory-mappable > > dax files. > > * Famfs supports multiple hosts mounting the same file system from the > > same memory (something existing fs-dax file systems don't do). > > * A famfs file system can be created on either a /dev/pmem device in fs-dax > > mode, or a /dev/dax device in devdax mode (the latter depending on > > patches 2-6 of this series). > > > > The famfs kernel file system is part the famfs framework; additional > > components in user space[2] handle metadata and direct the famfs kernel > > module to instantiate files that map to specific memory. The famfs user > > space has documentation and a reasonably thorough test suite. > > > > The famfs kernel module never accesses the shared memory directly (either > > data or metadata). Because of this, shared memory managed by the famfs > > framework does not create a RAS "blast radius" problem that should be able > > to crash or de-stabilize the kernel. Poison or timeouts in famfs memory > > can be expected to kill apps via SIGBUS and cause mounts to be disabled > > due to memory failure notifications. > > > > Famfs does not attempt to solve concurrency or coherency problems for apps, > > although it does solve these problems in regard to its own data structures. > > Apps may encounter hard concurrency problems, but there are use cases that > > are imminently useful and uncomplicated from a concurrency perspective: > > serial sharing is one (only one host at a time has access), and read-only > > concurrent sharing is another (all hosts can read-cache without worry). > > Can you do me a favor, curious if you can run a test like this: > > fio -name=ten-1g-per-thread --nrfiles=10 -bs=2M -ioengine=io_uring > -direct=1 > --group_reporting=1 --alloc-size=1048576 --filesize=1GiB > --readwrite=write --fallocate=none --numjobs=$(nproc) --create_on_open=1 > --directory=/mnt > > What do you get for throughput? > > The absolute large the system an capacity the better. > > Luis Luis, First, thanks for paying attention. I think I need to clarify a few things about famfs and then check how that modifies your ask; apologies if some are obvious. You should tell me whether this is still interesting given these clarifications and limitations, or if there is something else you'd like to see tested instead. But read on, I have run the closest tests I can. Famfs files just map to dax memory; they don't have a backing store. So the io_uring and direct=1 options don't work. The coolness is that the files & memory can be shared, and that apps can deal with files rather than having to learn new abstractions. Famfs files are never allocate-on-write, so (--fallocate=none is ok, but "actual" fallocate doesn't work - and --create_on_open desn't work). But it seems to be happy if I preallocate the files for the test. I don't currently have custody of a really beefy system (can get one, just need to plan ahead). My primary dev system is a 48 HT core E5-2690 v3 @ 2.60G (around 10 years old). I have a 128GB dax device that is backed by ddr4 via efi_fake_mem. So I can't do 48 x 10 x 1G, but I can do 48 x 10 x 256M. I ran this on ddr4-backed famfs, and xfs backed by a sata ssd. Probably not fair, but it's what I have on a Sunday evening. I can get access to a beefy system with real cxl memory, though don't assume 100% I can report performance on that - will check into that. But think about what you're looking for in light of the fact that famfs is just a shared-memory file system, so no O_DIRECT or io_uring. Basically just (hopefully efficient) vma fault handling and metadata distribution. ### Here is famfs. I had to drop the io_uring and script up alloc/creation of the files (sudo famfs creat -s 256M /mnt/famfs/foo) $ fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=100MiB --readwrite=write --fallocate=none --numjobs=48 --create_on_open=0 --directory=/mnt/famfs ten-256m-per-thread: (g=0): rw=write, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=psync, iodepth=1 ... fio-3.33 Starting 48 processes Jobs: 40 (f=400) ten-256m-per-thread: (groupid=0, jobs=48): err= 0: pid=201738: Mon Feb 26 06:48:21 2024 write: IOPS=15.2k, BW=29.6GiB/s (31.8GB/s)(44.7GiB/1511msec); 0 zone resets clat (usec): min=156, max=54645, avg=2077.40, stdev=1730.77 lat (usec): min=171, max=54686, avg=2404.87, stdev=2056.50 clat percentiles (usec): | 1.00th=[ 196], 5.00th=[ 243], 10.00th=[ 367], 20.00th=[ 644], | 30.00th=[ 857], 40.00th=[ 1352], 50.00th=[ 1876], 60.00th=[ 2442], | 70.00th=[ 2868], 80.00th=[ 3228], 90.00th=[ 3884], 95.00th=[ 4555], | 99.00th=[ 6390], 99.50th=[ 7439], 99.90th=[16450], 99.95th=[23987], | 99.99th=[46924] bw ( MiB/s): min=21544, max=28034, per=81.80%, avg=24789.35, stdev=130.16, samples=81 iops : min=10756, max=14000, avg=12378.00, stdev=65.06, samples=81 lat (usec) : 250=5.42%, 500=9.67%, 750=8.07%, 1000=11.77% lat (msec) : 2=16.87%, 4=39.59%, 10=8.37%, 20=0.17%, 50=0.07% lat (msec) : 100=0.01% cpu : usr=13.26%, sys=81.62%, ctx=2075, majf=0, minf=18159 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,22896,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=29.6GiB/s (31.8GB/s), 29.6GiB/s-29.6GiB/s (31.8GB/s-31.8GB/s), io=44.7GiB (48.0GB), run=1511-1511msec $ sudo famfs fsck -h /mnt/famfs Famfs Superblock: Filesystem UUID: 591f3f62-0a79-4543-9ab5-e02dc807c76c System UUID: 00000000-0000-0000-0000-0cc47aaaa734 sizeof superblock: 168 num_daxdevs: 1 primary: /dev/dax1.0 137438953472 Log stats: # of log entriesi in use: 480 of 25575 Log size in use: 157488 No allocation errors found Capacity: Device capacity: 128.00G Bitmap capacity: 127.99G Sum of file sizes: 120.00G Allocated space: 120.00G Free space: 7.99G Space amplification: 1.00 Percent used: 93.8% Famfs log: 480 of 25575 entries used 480 files 0 directories ### Here is the same fio command, plus --ioengine=io_uring and --direct=1. It's apples and oranges, since famfs is a memory interface and not a storage interface. This is run on an xfs file system on a SATA ssd. Note units are msec here, usec above. fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=none --numjobs=48 --create_on_open=0 --ioengine=io_uring --direct=1 --directory=/home/jmg/t1 ten-256m-per-thread: (g=0): rw=write, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=io_uring, iodepth=1 ... fio-3.33 Starting 48 processes ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) Jobs: 37 (f=370): [W(1),_(2),W(2),_(1),W(1),_(1),W(6),_(1),W(1),_(1),W(1),_(1),W(1),_(1),W(1),_(1),W(13),_(1),W(5),_(1),W(5)][72.1%][w=454MiB/s][w=227 IOPS][eta 01m:32sJobs: 37 (f=370): [W(1),_(2),W(2),_(1),W(1),_(1),W(6),_(1),W(1),_(1),W(1),_(1),W(1),_(1),W(1),_(1),W(13),_(1),W(5),_(1),W(5)][72.4%][w=456MiB/s][w=228 IOPS][eta 01m:31sJobs: 36 (f=360): [W(1),_(2),W(2),_(1),W(1),_(1),W(6),_(1),W(1),_(1),W(1),_(1),W(1),_(3),W(13),_(1),W(5),_(1),W(5)][72.9%][w=454MiB/s][w=227 IOPS][eta 01m:29s] Jobs: 33 (f=330): [_(3),W(2),_(1),W(1),_(1),W(1),_(1),W(4),_(1),W(1),_(1),W(1),_(1),W(1),_(3),W(13),_(1),W(5),_(1),W(2),_(1),W(2)][73.0%][w=458MiB/s][w=229 IOPS][eta 01Jobs: 30 (f=300): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(3),_(1),W(1),_(3),W(1),_(3),W(7),_(1),W(5),_(1),W(5),_(1),W(2),_(1),W(2)][73.6%][w=462MiB/s][w=231 IOPS][eta 01mJobs: 28 (f=280): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(3),_(5),W(1),_(3),W(7),_(1),W(5),_(1),W(5),_(1),W(2),_(2),W(1)][74.1%][w=456MiB/s][w=228 IOPS][eta 01m:25s] Jobs: 25 (f=250): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(1),_(1),W(1),_(5),W(1),_(3),W(2),_(1),W(4),_(1),W(5),_(1),W(5),_(2),W(1),_(2),W(1)][75.1%][w=458MiB/s][w=229 IOPJobs: 24 (f=240): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(1),_(1),W(1),_(5),W(1),_(3),W(2),_(1),W(3),_(2),W(5),_(1),W(5),_(2),W(1),_(2),W(1)][75.6%][w=456MiB/s][w=228 IOPJobs: 23 (f=230): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(1),_(1),W(1),_(5),E(1),_(3),W(2),_(1),W(3),_(2),W(5),_(1),W(5),_(2),W(1),_(2),W(1)][76.2%][w=452MiB/s][w=226 IOPJobs: 20 (f=200): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(1),_(11),W(2),_(1),W(3),_(2),W(5),_(1),W(3),_(1),W(1),_(2),W(1),_(3)][76.7%][w=448MiB/s][w=224 IOPS][eta 01m:15sJobs: 19 (f=190): [_(3),W(2),_(1),W(1),_(1),W(1),_(2),W(1),_(11),W(2),_(1),W(3),_(2),W(5),_(2),W(2),_(1),W(1),_(2),W(1),_(3)][77.5%][w=464MiB/s][w=232 IOPS][eta 01m:12sJobs: 18 (f=180): [_(3),W(2),_(3),W(1),_(2),W(1),_(11),W(2),_(1),W(3),_(2),W(5),_(2),W(2),_(1),W(1),_(2),W(1),_(3)][78.8%][w=478MiB/s][w=239 IOPS][eta 01m:07s] Jobs: 4 (f=40): [_(3),W(1),_(22),W(1),_(12),W(1),_(4),W(1),_(3)][92.4%][w=462MiB/s][w=231 IOPS][eta 00m:21s] ten-256m-per-thread: (groupid=0, jobs=48): err= 0: pid=210709: Mon Feb 26 07:20:51 2024 write: IOPS=228, BW=458MiB/s (480MB/s)(114GiB/255942msec); 0 zone resets slat (usec): min=39, max=776, avg=186.65, stdev=49.13 clat (msec): min=4, max=6718, avg=199.27, stdev=324.82 lat (msec): min=4, max=6718, avg=199.45, stdev=324.82 clat percentiles (msec): | 1.00th=[ 30], 5.00th=[ 47], 10.00th=[ 60], 20.00th=[ 69], | 30.00th=[ 78], 40.00th=[ 85], 50.00th=[ 95], 60.00th=[ 114], | 70.00th=[ 142], 80.00th=[ 194], 90.00th=[ 409], 95.00th=[ 810], | 99.00th=[ 1703], 99.50th=[ 2140], 99.90th=[ 3037], 99.95th=[ 3440], | 99.99th=[ 4665] bw ( KiB/s): min=195570, max=2422953, per=100.00%, avg=653513.53, stdev=8137.30, samples=17556 iops : min= 60, max= 1180, avg=314.22, stdev= 3.98, samples=17556 lat (msec) : 10=0.11%, 20=0.37%, 50=5.35%, 100=47.30%, 250=32.22% lat (msec) : 500=6.11%, 750=2.98%, 1000=1.98%, 2000=2.97%, >=2000=0.60% cpu : usr=0.10%, sys=0.01%, ctx=58709, majf=0, minf=669 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,58560,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=458MiB/s (480MB/s), 458MiB/s-458MiB/s (480MB/s-480MB/s), io=114GiB (123GB), run=255942-255942msec Disk stats (read/write): dm-2: ios=11/82263, merge=0/0, ticks=270/13403617, in_queue=13403887, util=97.10%, aggrios=11/152359, aggrmerge=0/5087, aggrticks=271/11493029, aggrin_queue=11494994, aggrutil=100.00% sdb: ios=11/152359, merge=0/5087, ticks=271/11493029, in_queue=11494994, util=100.00% ### Let me know what else you'd like to see tried. Regards, John
On Mon, Feb 26, 2024 at 07:27:18AM -0600, John Groves wrote: > Run status group 0 (all jobs): > WRITE: bw=29.6GiB/s (31.8GB/s), 29.6GiB/s-29.6GiB/s (31.8GB/s-31.8GB/s), io=44.7GiB (48.0GB), run=1511-1511msec > This is run on an xfs file system on a SATA ssd. To compare more closer apples to apples, wouldn't it make more sense to try this with XFS on pmem (with fio -direct=1)? Luis
On 24/02/26 07:53AM, Luis Chamberlain wrote: > On Mon, Feb 26, 2024 at 07:27:18AM -0600, John Groves wrote: > > Run status group 0 (all jobs): > > WRITE: bw=29.6GiB/s (31.8GB/s), 29.6GiB/s-29.6GiB/s (31.8GB/s-31.8GB/s), io=44.7GiB (48.0GB), run=1511-1511msec > > > This is run on an xfs file system on a SATA ssd. > > To compare more closer apples to apples, wouldn't it make more sense > to try this with XFS on pmem (with fio -direct=1)? > > Luis Makes sense. Here is the same command line I used with xfs before, but now it's on /dev/pmem0 (the same 128G, but converted from devdax to pmem because xfs requires that. fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=none --numjobs=48 --create_on_open=0 --ioengine=io_uring --direct=1 --directory=/mnt/xfs ten-256m-per-thread: (g=0): rw=write, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=io_uring, iodepth=1 ... fio-3.33 Starting 48 processes ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) Jobs: 36 (f=360): [W(3),_(1),W(3),_(1),W(1),_(1),W(6),_(1),W(1),_(1),W(1),_(1),W(7),_(1),W(3),_(1),W(2),_(2),W(4),_(1),W(5),_(1)][77.8%][w=15.1GiB/s][w=7750 IOPS][eta 00m:02s] ten-256m-per-thread: (groupid=0, jobs=48): err= 0: pid=8798: Mon Feb 26 15:10:30 2024 write: IOPS=7582, BW=14.8GiB/s (15.9GB/s)(114GiB/7723msec); 0 zone resets slat (usec): min=23, max=7352, avg=131.80, stdev=151.63 clat (usec): min=385, max=22638, avg=5789.74, stdev=3124.93 lat (usec): min=432, max=22724, avg=5921.54, stdev=3133.18 clat percentiles (usec): | 1.00th=[ 799], 5.00th=[ 1467], 10.00th=[ 2073], 20.00th=[ 3097], | 30.00th=[ 3949], 40.00th=[ 4752], 50.00th=[ 5473], 60.00th=[ 6194], | 70.00th=[ 7046], 80.00th=[ 8029], 90.00th=[ 9634], 95.00th=[11338], | 99.00th=[16319], 99.50th=[17957], 99.90th=[20055], 99.95th=[20579], | 99.99th=[21365] bw ( MiB/s): min=10852, max=26980, per=100.00%, avg=15940.43, stdev=88.61, samples=665 iops : min= 5419, max=13477, avg=7963.08, stdev=44.28, samples=665 lat (usec) : 500=0.15%, 750=0.47%, 1000=1.34% lat (msec) : 2=7.40%, 4=21.46%, 10=60.57%, 20=8.50%, 50=0.11% cpu : usr=2.33%, sys=0.32%, ctx=58806, majf=0, minf=36301 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,58560,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=14.8GiB/s (15.9GB/s), 14.8GiB/s-14.8GiB/s (15.9GB/s-15.9GB/s), io=114GiB (123GB), run=7723-7723msec Disk stats (read/write): pmem0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% I only have some educated guesses as to why famfs is faster. Since files are preallocated, they're always contiguous. And famfs is vastly simpler because it isn't aimed at general purpose uses cases (and indeed can't handle them). Regards, John
On Mon, Feb 26, 2024 at 1:16 PM John Groves <John@groves.net> wrote: > > On 24/02/26 07:53AM, Luis Chamberlain wrote: > > On Mon, Feb 26, 2024 at 07:27:18AM -0600, John Groves wrote: > > > Run status group 0 (all jobs): > > > WRITE: bw=29.6GiB/s (31.8GB/s), 29.6GiB/s-29.6GiB/s (31.8GB/s-31.8GB/s), io=44.7GiB (48.0GB), run=1511-1511msec > > > > > This is run on an xfs file system on a SATA ssd. > > > > To compare more closer apples to apples, wouldn't it make more sense > > to try this with XFS on pmem (with fio -direct=1)? > > > > Luis > > Makes sense. Here is the same command line I used with xfs before, but > now it's on /dev/pmem0 (the same 128G, but converted from devdax to pmem > because xfs requires that. > > fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=none --numjobs=48 --create_on_open=0 --ioengine=io_uring --direct=1 --directory=/mnt/xfs Could you try with mkfs.xfs -d agcount=1024 Luis
On 24/02/26 04:58PM, Luis Chamberlain wrote: > On Mon, Feb 26, 2024 at 1:16 PM John Groves <John@groves.net> wrote: > > > > On 24/02/26 07:53AM, Luis Chamberlain wrote: > > > On Mon, Feb 26, 2024 at 07:27:18AM -0600, John Groves wrote: > > > > Run status group 0 (all jobs): > > > > WRITE: bw=29.6GiB/s (31.8GB/s), 29.6GiB/s-29.6GiB/s (31.8GB/s-31.8GB/s), io=44.7GiB (48.0GB), run=1511-1511msec > > > > > > > This is run on an xfs file system on a SATA ssd. > > > > > > To compare more closer apples to apples, wouldn't it make more sense > > > to try this with XFS on pmem (with fio -direct=1)? > > > > > > Luis > > > > Makes sense. Here is the same command line I used with xfs before, but > > now it's on /dev/pmem0 (the same 128G, but converted from devdax to pmem > > because xfs requires that. > > > > fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=none --numjobs=48 --create_on_open=0 --ioengine=io_uring --direct=1 --directory=/mnt/xfs > > Could you try with mkfs.xfs -d agcount=1024 > > Luis $ luis/fio-xfsdax.sh + sudo mkfs.xfs -d agcount=1024 -m reflink=0 -f /dev/pmem0 meta-data=/dev/pmem0 isize=512 agcount=1024, agsize=32768 blks = sectsz=4096 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=0 bigtime=1 inobtcount=1 nrext64=0 data = bsize=4096 blocks=33554432, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=16384, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 + sudo mount -o dax /dev/pmem0 /mnt/xfs + sudo chown jmg:jmg /mnt/xfs + ls -al /mnt/xfs total 0 drwxr-xr-x 2 jmg jmg 6 Feb 26 19:56 . drwxr-xr-x. 4 root root 30 Feb 26 14:58 .. ++ nproc + fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=none --numjobs=48 --create_on_open=0 --ioengine=io_uring --direct=1 --directory=/mnt/xfs ten-256m-per-thread: (g=0): rw=write, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=io_uring, iodepth=1 ... fio-3.33 Starting 48 processes ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) ten-256m-per-thread: Laying out IO files (10 files / total 2441MiB) Jobs: 17 (f=170): [_(2),W(1),_(8),W(2),_(7),W(3),_(2),W(2),_(3),W(2),_(2),W(1),_(2),W(1),_(1),W(3),_(4),W(2)][Jobs: 1 (f=10): [_(47),W(1)][100.0%][w=8022MiB/s][w=4011 IOPS][eta 00m:00s] ten-256m-per-thread: (groupid=0, jobs=48): err= 0: pid=141563: Mon Feb 26 19:56:28 2024 write: IOPS=6578, BW=12.8GiB/s (13.8GB/s)(114GiB/8902msec); 0 zone resets slat (usec): min=18, max=60593, avg=1230.85, stdev=1799.97 clat (usec): min=2, max=98969, avg=5133.25, stdev=5141.07 lat (usec): min=294, max=99725, avg=6364.09, stdev=5440.30 clat percentiles (usec): | 1.00th=[ 11], 5.00th=[ 46], 10.00th=[ 217], 20.00th=[ 2376], | 30.00th=[ 2999], 40.00th=[ 3556], 50.00th=[ 3785], 60.00th=[ 3982], | 70.00th=[ 4228], 80.00th=[ 7504], 90.00th=[13173], 95.00th=[14091], | 99.00th=[21890], 99.50th=[27919], 99.90th=[45351], 99.95th=[57934], | 99.99th=[82314] bw ( MiB/s): min= 5085, max=27367, per=100.00%, avg=14361.95, stdev=165.61, samples=719 iops : min= 2516, max=13670, avg=7160.17, stdev=82.88, samples=719 lat (usec) : 4=0.05%, 10=0.72%, 20=2.23%, 50=2.48%, 100=3.02% lat (usec) : 250=1.54%, 500=2.37%, 750=1.34%, 1000=0.75% lat (msec) : 2=3.20%, 4=43.10%, 10=23.05%, 20=14.81%, 50=1.25% lat (msec) : 100=0.08% cpu : usr=10.18%, sys=0.79%, ctx=67227, majf=0, minf=38511 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,58560,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=12.8GiB/s (13.8GB/s), 12.8GiB/s-12.8GiB/s (13.8GB/s-13.8GB/s), io=114GiB (123GB), run=8902-8902msec Disk stats (read/write): pmem0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% I ran it several times with similar results. Regards, John
On Mon, Feb 26, 2024 at 08:05:58PM -0600, John Groves wrote: > On 24/02/26 04:58PM, Luis Chamberlain wrote: > > On Mon, Feb 26, 2024 at 1:16 PM John Groves <John@groves.net> wrote: > > > > > > On 24/02/26 07:53AM, Luis Chamberlain wrote: > > > > On Mon, Feb 26, 2024 at 07:27:18AM -0600, John Groves wrote: > > > > > Run status group 0 (all jobs): > > > > > WRITE: bw=29.6GiB/s (31.8GB/s), 29.6GiB/s-29.6GiB/s (31.8GB/s-31.8GB/s), io=44.7GiB (48.0GB), run=1511-1511msec > > > > > > > > > This is run on an xfs file system on a SATA ssd. > > > > > > > > To compare more closer apples to apples, wouldn't it make more sense > > > > to try this with XFS on pmem (with fio -direct=1)? > > > > > > > > Luis > > > > > > Makes sense. Here is the same command line I used with xfs before, but > > > now it's on /dev/pmem0 (the same 128G, but converted from devdax to pmem > > > because xfs requires that. > > > > > > fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=none --numjobs=48 --create_on_open=0 --ioengine=io_uring --direct=1 --directory=/mnt/xfs > > > > Could you try with mkfs.xfs -d agcount=1024 Won't change anything for the better, may make things worse. > bw ( MiB/s): min= 5085, max=27367, per=100.00%, avg=14361.95, stdev=165.61, samples=719 > iops : min= 2516, max=13670, avg=7160.17, stdev=82.88, samples=719 > lat (usec) : 4=0.05%, 10=0.72%, 20=2.23%, 50=2.48%, 100=3.02% > lat (usec) : 250=1.54%, 500=2.37%, 750=1.34%, 1000=0.75% > lat (msec) : 2=3.20%, 4=43.10%, 10=23.05%, 20=14.81%, 50=1.25% Most of the IO latencies are up round the 4-20ms marks. That seems kinda high for a 2MB IO. With a memcpy speed of 10GB/s, the 2MB should only take a couple of hundred microseconds. For Famfs, the latencies appear to be around 1-4ms. So where's all that extra time coming from? > lat (msec) : 100=0.08% > cpu : usr=10.18%, sys=0.79%, ctx=67227, majf=0, minf=38511 And why is system time reporting at almost zero instead of almost all the remaining cpu time (i.e. up at 80-90%)? Can you run call-graph kernel profiles for XFS and famfs whilst running this workload so we have some insight into what is behaving differently here? -Dave.
On Fri, Feb 23, 2024 at 7:42 PM John Groves <John@groves.net> wrote: > > This patch set introduces famfs[1] - a special-purpose fs-dax file system > for sharable disaggregated or fabric-attached memory (FAM). Famfs is not > CXL-specific in anyway way. > > * Famfs creates a simple access method for storing and sharing data in > sharable memory. The memory is exposed and accessed as memory-mappable > dax files. > * Famfs supports multiple hosts mounting the same file system from the > same memory (something existing fs-dax file systems don't do). > * A famfs file system can be created on either a /dev/pmem device in fs-dax > mode, or a /dev/dax device in devdax mode (the latter depending on > patches 2-6 of this series). > > The famfs kernel file system is part the famfs framework; additional > components in user space[2] handle metadata and direct the famfs kernel > module to instantiate files that map to specific memory. The famfs user > space has documentation and a reasonably thorough test suite. > So can we say that Famfs is Fuse specialized for DAX? I am asking because you seem to have asked it first: https://lore.kernel.org/linux-fsdevel/0100018b2439ebf3-a442db6f-f685-4bc4-b4b0-28dc333f6712-000000@email.amazonses.com/ I guess that you did not get your answers to your questions before or at LPC? I did not see your question back in October. Let me try to answer your questions and we can discuss later if a new dedicated kernel driver + userspace API is really needed, or if FUSE could be used as is extended for your needs. You wrote: "...My naive reading of the existence of some sort of fuse/dax support for virtiofs suggested that there might be a way of doing this - but I may be wrong about that." I'm not virtiofs expert, but I don't think that you are wrong about this. IIUC, virtiofsd could map arbitrary memory region to any fuse file mmaped by virtiofs client. So what are the gaps between virtiofs and famfs that justify a new filesystem driver and new userspace API? Thanks, Amir.
Hi Dave! On 24/02/29 01:15PM, Dave Chinner wrote: > On Mon, Feb 26, 2024 at 08:05:58PM -0600, John Groves wrote: > > On 24/02/26 04:58PM, Luis Chamberlain wrote: > > > On Mon, Feb 26, 2024 at 1:16 PM John Groves <John@groves.net> wrote: > > > > > > > > On 24/02/26 07:53AM, Luis Chamberlain wrote: > > > > > On Mon, Feb 26, 2024 at 07:27:18AM -0600, John Groves wrote: > > > > > > Run status group 0 (all jobs): > > > > > > WRITE: bw=29.6GiB/s (31.8GB/s), 29.6GiB/s-29.6GiB/s (31.8GB/s-31.8GB/s), io=44.7GiB (48.0GB), run=1511-1511msec > > > > > > > > > > > This is run on an xfs file system on a SATA ssd. > > > > > > > > > > To compare more closer apples to apples, wouldn't it make more sense > > > > > to try this with XFS on pmem (with fio -direct=1)? > > > > > > > > > > Luis > > > > > > > > Makes sense. Here is the same command line I used with xfs before, but > > > > now it's on /dev/pmem0 (the same 128G, but converted from devdax to pmem > > > > because xfs requires that. > > > > > > > > fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=none --numjobs=48 --create_on_open=0 --ioengine=io_uring --direct=1 --directory=/mnt/xfs > > > > > > Could you try with mkfs.xfs -d agcount=1024 > > Won't change anything for the better, may make things worse. I dropped that arg, though performance looked about the same either way. > > > bw ( MiB/s): min= 5085, max=27367, per=100.00%, avg=14361.95, stdev=165.61, samples=719 > > iops : min= 2516, max=13670, avg=7160.17, stdev=82.88, samples=719 > > lat (usec) : 4=0.05%, 10=0.72%, 20=2.23%, 50=2.48%, 100=3.02% > > lat (usec) : 250=1.54%, 500=2.37%, 750=1.34%, 1000=0.75% > > lat (msec) : 2=3.20%, 4=43.10%, 10=23.05%, 20=14.81%, 50=1.25% > > Most of the IO latencies are up round the 4-20ms marks. That seems > kinda high for a 2MB IO. With a memcpy speed of 10GB/s, the 2MB > should only take a couple of hundred microseconds. For Famfs, the > latencies appear to be around 1-4ms. > > So where's all that extra time coming from? Below, you will see two runs with performance and latency distribution about the same as famfs (the answer for that was --fallocate=native). > > > > lat (msec) : 100=0.08% > > cpu : usr=10.18%, sys=0.79%, ctx=67227, majf=0, minf=38511 > > And why is system time reporting at almost zero instead of almost > all the remaining cpu time (i.e. up at 80-90%)? Something weird is going on with the cpu reporting. Sometimes sys=~0, but other times it's about what you would expect. I suspect some sort of measurement error, like maybe the method doesn't work with my cpu model? (I'm grasping, but with a somewhat rational basis...) I pasted two xfs runs below. The first has the wonky cpu sys value, and the second looks about like what one would expect. > > Can you run call-graph kernel profiles for XFS and famfs whilst > running this workload so we have some insight into what is behaving > differently here? Can you point me to an example of how to do that? > > -Dave. > -- > Dave Chinner > david@fromorbit.com I'd been thinking about the ~2x gap for a few days, and the most obvious difference is famfs files must be preallocated (like fallocate, but works a bit differently since allocation happens in user space). I just checked one of the xfs files, and it had maybe 80 extents (whereas the famfs files always have 1 extent here). FWIW I ran xfs with and without io_uring, and there was no apparent difference (which makes sense to me because it's not block I/O). The prior ~2x gap still seems like a lot of overhead for extent list mapping to memory, but adding --fallocate=native to the xfs test brought it into line with famfs: + fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=native --numjobs=48 --create_on_open=0 --ioengine=io_uring --direct=1 --directory=/mnt/xfs ten-256m-per-thread: (g=0): rw=write, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=io_uring, iodepth=1 ... fio-3.33 Starting 48 processes Jobs: 38 (f=380): [W(5),_(1),W(12),_(1),W(3),_(1),W(2),_(1),W(2),_(1),W(1),_(1),W(1),_(1),W(6),_(1),W(6),_(2)][57.1%][w=28.0GiB/s][w=14.3k IOPS][eta 00m:03s] ten-256m-per-thread: (groupid=0, jobs=48): err= 0: pid=1452590: Thu Feb 29 07:46:06 2024 write: IOPS=15.3k, BW=29.8GiB/s (32.0GB/s)(114GiB/3838msec); 0 zone resets slat (usec): min=17, max=55364, avg=668.20, stdev=1120.41 clat (nsec): min=1368, max=99619k, avg=1982477.32, stdev=2198309.32 lat (usec): min=179, max=99813, avg=2650.68, stdev=2485.15 clat percentiles (usec): | 1.00th=[ 4], 5.00th=[ 14], 10.00th=[ 172], 20.00th=[ 420], | 30.00th=[ 644], 40.00th=[ 1057], 50.00th=[ 1582], 60.00th=[ 2008], | 70.00th=[ 2343], 80.00th=[ 3097], 90.00th=[ 4555], 95.00th=[ 5473], | 99.00th=[ 8717], 99.50th=[11863], 99.90th=[20055], 99.95th=[27657], | 99.99th=[49546] bw ( MiB/s): min=20095, max=59216, per=100.00%, avg=35985.47, stdev=318.61, samples=280 iops : min=10031, max=29587, avg=17970.76, stdev=159.29, samples=280 lat (usec) : 2=0.06%, 4=1.02%, 10=2.33%, 20=4.29%, 50=1.85% lat (usec) : 100=0.20%, 250=3.26%, 500=11.23%, 750=8.87%, 1000=5.82% lat (msec) : 2=20.95%, 4=26.74%, 10=12.60%, 20=0.66%, 50=0.09% lat (msec) : 100=0.01% cpu : usr=15.48%, sys=1.17%, ctx=62654, majf=0, minf=22801 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,58560,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=29.8GiB/s (32.0GB/s), 29.8GiB/s-29.8GiB/s (32.0GB/s-32.0GB/s), io=114GiB (123GB), run=3838-3838msec Disk stats (read/write): pmem0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% ## Here is a run where the cpu looks "normal" + fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=native --numjobs=48 --create_on_open=0 --direct=1 --directory=/mnt/xfs ten-256m-per-thread: (g=0): rw=write, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=psync, iodepth=1 ... fio-3.33 Starting 48 processes Jobs: 19 (f=190): [W(2),_(1),W(2),_(8),W(1),_(3),W(1),_(1),W(2),_(2),W(1),_(1),W(3),_(2),W(1),_(1),W(1),_(2),W(2),_(7),W(3),_(1)][55.6%][w=26.7GiB/s][w=13.6k IOPS][eta 00m:04s] ten-256m-per-thread: (groupid=0, jobs=48): err= 0: pid=1463615: Thu Feb 29 08:19:53 2024 write: IOPS=12.4k, BW=24.1GiB/s (25.9GB/s)(114GiB/4736msec); 0 zone resets clat (usec): min=138, max=117903, avg=2581.99, stdev=2704.61 lat (usec): min=152, max=120405, avg=3019.04, stdev=2964.47 clat percentiles (usec): | 1.00th=[ 161], 5.00th=[ 249], 10.00th=[ 627], 20.00th=[ 1270], | 30.00th=[ 1631], 40.00th=[ 1942], 50.00th=[ 2089], 60.00th=[ 2212], | 70.00th=[ 2343], 80.00th=[ 2704], 90.00th=[ 5866], 95.00th=[ 6849], | 99.00th=[12387], 99.50th=[14353], 99.90th=[26084], 99.95th=[38536], | 99.99th=[78119] bw ( MiB/s): min=21204, max=47040, per=100.00%, avg=29005.40, stdev=237.31, samples=329 iops : min=10577, max=23497, avg=14479.74, stdev=118.65, samples=329 lat (usec) : 250=5.04%, 500=4.03%, 750=2.37%, 1000=3.13% lat (msec) : 2=29.39%, 4=41.05%, 10=13.37%, 20=1.45%, 50=0.15% lat (msec) : 100=0.03%, 250=0.01% cpu : usr=14.43%, sys=78.18%, ctx=5272, majf=0, minf=15708 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,58560,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=24.1GiB/s (25.9GB/s), 24.1GiB/s-24.1GiB/s (25.9GB/s-25.9GB/s), io=114GiB (123GB), run=4736-4736msec Disk stats (read/write): pmem0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% Cheers, John
On 24/02/29 08:52AM, Amir Goldstein wrote: > On Fri, Feb 23, 2024 at 7:42 PM John Groves <John@groves.net> wrote: > > > > This patch set introduces famfs[1] - a special-purpose fs-dax file system > > for sharable disaggregated or fabric-attached memory (FAM). Famfs is not > > CXL-specific in anyway way. > > > > * Famfs creates a simple access method for storing and sharing data in > > sharable memory. The memory is exposed and accessed as memory-mappable > > dax files. > > * Famfs supports multiple hosts mounting the same file system from the > > same memory (something existing fs-dax file systems don't do). > > * A famfs file system can be created on either a /dev/pmem device in fs-dax > > mode, or a /dev/dax device in devdax mode (the latter depending on > > patches 2-6 of this series). > > > > The famfs kernel file system is part the famfs framework; additional > > components in user space[2] handle metadata and direct the famfs kernel > > module to instantiate files that map to specific memory. The famfs user > > space has documentation and a reasonably thorough test suite. > > > > So can we say that Famfs is Fuse specialized for DAX? > > I am asking because you seem to have asked it first: > https://lore.kernel.org/linux-fsdevel/0100018b2439ebf3-a442db6f-f685-4bc4-b4b0-28dc333f6712-000000@email.amazonses.com/ > I guess that you did not get your answers to your questions before or at LPC? Thanks for paying attention Amir. I think there is some validity to thinking of famfs as Fuse for DAX. Administration / metadata originating in user space is similar (but doing it this way also helps reduce RAS exposure to memory that might have a more complex connection path). One way it differs from fuse is that famfs is very much aimed at use cases that require performance. *Accessing* files must run at full memory speeds. > > I did not see your question back in October. > Let me try to answer your questions and we can discuss later if a new dedicated > kernel driver + userspace API is really needed, or if FUSE could be used as is > extended for your needs. > > You wrote: > "...My naive reading of the existence of some sort of fuse/dax support > for virtiofs > suggested that there might be a way of doing this - but I may be wrong > about that." > > I'm not virtiofs expert, but I don't think that you are wrong about this. > IIUC, virtiofsd could map arbitrary memory region to any fuse file mmaped > by virtiofs client. > > So what are the gaps between virtiofs and famfs that justify a new filesystem > driver and new userspace API? I have a lot of thoughts here, and an actual conversation might be good sooner rather than later. I hope to be at LSFMM to discuss this - if you agree, put in a vote for my topic ;). But if you want to talk sooner than that, I'm interested. I think one piece of evidence that this isn't possible with Fuse today is that I had to plumb the iomap interface for /dev/dax in this patch set. That is the way that fs-dax file systems communicate with the dax layer for fault resolution. If fuse/virtiofs handles dax somehow without the iomap interface, I suspect it's doing something somehow simpler, /and/ that might need to get reconciled with the fs-dax methodology. Or maybe I don't know what I'm talking about (in which case, please help :D). I think one thing that might make sense would be to bring up this functionality as a standalone file system, and then consider merging it into fuse when & if the time seems right. Famfs doesn't currently have any up-calls. User space plays the log and tells the kmod to instantiate files with extent lists to dax. Access happens with zero user space involvement. The important thing, the thing I'm currently paid for, is making it practical to use disaggregated shared memory - it's ultimately not important which mechanism is used to enable a filesystem access method for memory. But caching metadata in the kernel for efficient fault handling is the only way to get it to perform at "memory speeds" so that appears critical. One final observation: famfs has significantly more code in user space than in kernel space, and it's the user side that is likely to grow over time. That logic is at least theoretically independent of the kernel ABI. > > Thanks, > Amir. Thanks! John
On Thu, Feb 29, 2024 at 08:52:48AM -0600, John Groves wrote: > On 24/02/29 01:15PM, Dave Chinner wrote: > > On Mon, Feb 26, 2024 at 08:05:58PM -0600, John Groves wrote: > > > bw ( MiB/s): min= 5085, max=27367, per=100.00%, avg=14361.95, stdev=165.61, samples=719 > > > iops : min= 2516, max=13670, avg=7160.17, stdev=82.88, samples=719 > > > lat (usec) : 4=0.05%, 10=0.72%, 20=2.23%, 50=2.48%, 100=3.02% > > > lat (usec) : 250=1.54%, 500=2.37%, 750=1.34%, 1000=0.75% > > > lat (msec) : 2=3.20%, 4=43.10%, 10=23.05%, 20=14.81%, 50=1.25% > > > > Most of the IO latencies are up round the 4-20ms marks. That seems > > kinda high for a 2MB IO. With a memcpy speed of 10GB/s, the 2MB > > should only take a couple of hundred microseconds. For Famfs, the > > latencies appear to be around 1-4ms. > > > > So where's all that extra time coming from? > > Below, you will see two runs with performance and latency distribution > about the same as famfs (the answer for that was --fallocate=native). Ah, that is exactly what I suspected, and was wanting profiles because that will show up in them clearly. > > > lat (msec) : 100=0.08% > > > cpu : usr=10.18%, sys=0.79%, ctx=67227, majf=0, minf=38511 > > > > And why is system time reporting at almost zero instead of almost > > all the remaining cpu time (i.e. up at 80-90%)? > > Something weird is going on with the cpu reporting. Sometimes sys=~0, but other times > it's about what you would expect. I suspect some sort of measurement error, > like maybe the method doesn't work with my cpu model? (I'm grasping, but with > a somewhat rational basis...) > > I pasted two xfs runs below. The first has the wonky cpu sys value, and > the second looks about like what one would expect. > > > > > Can you run call-graph kernel profiles for XFS and famfs whilst > > running this workload so we have some insight into what is behaving > > differently here? > > Can you point me to an example of how to do that? perf record --call-graph ... pref report --call-graph ... > I'd been thinking about the ~2x gap for a few days, and the most obvious > difference is famfs files must be preallocated (like fallocate, but works > a bit differently since allocation happens in user space). I just checked > one of the xfs files, and it had maybe 80 extents (whereas the famfs > files always have 1 extent here). Which is about 4MB per extent. Extent size is not the problem for zero-seek-latency storage hardware, though. Essentially what you are seeing is interleaving extent allocation between all the files because they are located in the same directory. The locality algorithm is trying to place the data extents close to the owner inode, but the indoes are also all close together because they are located in the same AG as the parent directory inode. Allocation concurrency is created by placing new directories in different allocation groups, so we end up with workloads in different directories being largely isolated from each other. However, that means when you are trying to write to many files in the same directory at the same time, they are largely all competing for the same AG lock to do block allocation during IO submission. That creates interleaving of write() sized extents between different files. We use speculative preallocation for buffered IO to avoid this, and for direct IO the application needs to use extent size hints or preallocation to avoid this contention based interleaving. IOWs, by using fallocate() to preallocate all the space there will be no allocation during IO submission and so the serialisation that occurs due to competing allocations just goes away... > FWIW I ran xfs with and without io_uring, and there was no apparent > difference (which makes sense to me because it's not block I/O). > > The prior ~2x gap still seems like a lot of overhead for extent list > mapping to memory, but adding --fallocate=native to the xfs test brought > it into line with famfs: As I suspected. :) As for CPU usage accounting, the number of context switches says it all. "Bad": > cpu : usr=15.48%, sys=1.17%, ctx=62654, majf=0, minf=22801 "good": > cpu : usr=14.43%, sys=78.18%, ctx=5272, majf=0, minf=15708 I'd say that in the "bad" case most of the kernel work is being shuffled off to kernel threads to do the work and so it doesn't get accounted to the submission task. In comparison, in the "good" case the work is being done in the submission thread and hence there's a lot fewer context switches and the system time is correctly accounted to the submission task. Perhaps an io_uring task accounting problem? Cheers, Dave.