mbox series

[PATCHSET,v02,00/16] zuf: ZUFS Zero-copy User-mode FileSystem

Message ID 20190926020725.19601-1-boazh@netapp.com (mailing list archive)
Headers show
Series zuf: ZUFS Zero-copy User-mode FileSystem | expand

Message

Boaz Harrosh Sept. 26, 2019, 2:07 a.m. UTC
I would please like to submit the Kernel code part of the ZUFS file system,
for review. V02

[v02]
   The git of the changes over v01 can be found at: (Note based on v5.2)
	git https://github.com/NetApp/zufs-zuf upstream-5.2-v02-fixes

   The patches submitted are at:
	git https://github.com/NetApp/zufs-zuf upstream-v02

   list of changes since v01
   * Based on Linux v5.3. Previous was based on v5.2 and experienced
     build breakage with v5.3-rcX
   * Address *all* comments by the Intel Robot. The code should be
     completely free of any warnings.
   * Some bugs found since the time of the first submission.
   * More I(s) doted and T(s) crossed. Please see above
     upstream-5.2-v02-fixes for all the patches on top of v01 before
     they were re-squashed into this set.

[v01]
   find v01 submission here:
   https://lore.kernel.org/linux-fsdevel/20190812164806.15852-1-boazh@netapp.com/
   On github:
	git https://github.com/NetApp/zufs-zuf upstream-v01

---
ZUFS is a full implementation of a VFS filesystem. But mainly it is a very
new way to communicate with user-mode servers.
With performance and scalability never seen before. (<4us latency)
Why? the core communication with user-mode is completely lockless,
per-cpu locality, NUMA aware.

The Kernel code presented here can be found at:
	https://github.com/NetApp/zufs-zuf upstream

And the User-mode Server + example FSs here:
	https://github.com/NetApp/zufs-zus upstream

ZUFS - stands for Zero-copy User-mode FS
The Intention of this project is performance and low-latency.
* True zero copy end to end of both data and meta data.
* Very *low latency*, very high CPU locality, lock-less parallelism.
* Synchronous operations (for low latency)
* Numa awareness

Short description:
  ZUFS is a from scratch implementation of a filesystem-in-user-space, which
  tries to address the above goals. from the get go it is aimed for pmem
  based FSs. But supports any other type of FSs.
  The novelty of this project is that the interface is designed with a modern
  multi-core NUMA machine in mind down to the ABI.
  Also it utilizes the normal mount API of the Kernel.
  Multiple block devices are supported per superblock, Kernel owns those
  devices. FileSystem types are registered/exposed via the regular way

The Kernel is released as a pure GPLv2 License. The user-mode core is
BSD-3 so to be friendly with other OSs.

Current status: There are a couple of trivial open-source filesystem
implementations and a full blown proprietary implementation from Netapp.
 3 more ports to more serious open-source filesystems are on the way.
A usermode CEPH client, a ZFS implementation, and port of the infamous PMFS
to demonstrate the amazing pmem performance under zufs.
(Will be released as Open source when they are ready)

Together with the Kernel module submitted here the User-mode-Server and the
zusFSs User-mode plugins, pass Netapp QA including xfstests + internal QA tests.
And is released to costumers as Maxdata.
So it is very stable and performant

In the git repository above there is also a backport for rhel 7.6 7.7 and 8.0
Including rpm packages for Kernel and Server components.
(Also available evaluation licenses of Maxdata 1.5 for developers.
 Please contact Amit Golander <Amit.Golander@netapp.com> if you need one)

Performance:
A simple fio direct 4k random write test with incrementing number
of threads.

[fuse]
threads wr_iops	wr_bw	wr_lat
1	33606	134424	26.53226
2	57056	228224	30.38476
4	88667	354668	40.12783
7	116561	466245	53.98572
8	129134	516539	55.6134

[fuse-splice]
threads	wr_iops	wr_bw	wr_lat
1	39670	158682	21.8399
2	51100	204400	34.63294
4	75220	300882	47.42344
7	97706	390825	63.04435
8	98034	392137	73.24263

[xfs-dax]
threads	wr_iops	wr_bw		wr_lat   

[Maxdata-1.5-zufs]
threads	wr_iops	wr_bw		wr_lat
1	1041802 260,450		3.623
2	1983997 495,999		3.808
4	3829456 957,364		3.959
7	4501154 1,125,288	5.895330
8	4400698 1,100,174	6.922174

I have used an 8 way KVM-qemu with 2 NUMA nodes.
(on an Intel(R) Xeon(R) CPU E3-1230 v6 @ 3.50GHz)

Running fio with 4k random writes O_DIRECT | O_SYNC to a DRAM
simulated pmem. (memmap=! at grub)
Fuse-fs was a memcpy same 4k null-FS
fio was run with more and more threads (see threads column)
to test for scalability.

We see a bit of a slowdown when pushing to 8 threads. This is
mainly a scheduler and KVM issue. Big metal machines do better
(more flat scalability) but also degrade a bit on full load
I will try to post real metal scores later.

The in Kernel xfs-dax is slower than a zufs-pmem because:
1. It was not built specifically for pmem so there are latency
   issues (async operations) and extra copies in places.
2. In writes because of the Journal there are actually 3 IOPs
   for every write. Where with pmem other means can keep things
   crash-proof.
3. Because in random write + DAX each block is written twice
   It is first ZEROed then copied too.
4. But mainly because we use a single pmem on one of the NUMAs
   with zufs we put a pmem device on each NUMA node. And each core
   writes locally. So the memory bandwith is doubled. (Perhaps there
   is a way to use a dm configuration that makes this better but at
   the base xfs is not NUMA aware)
Is why I chose writes. With reads xfs-dax is much faster. In
zufs reads are actually 10% slower because in reads we do regular
memcpy-from-pmem which is exactly 10% slower than mov_nt operations

[Changes since last RFC submission]

Lots and lots of changes since then. More hardening stability
and more fixtures.

But mainly is the NEW-IO way.
The old way of IO where we mmap application-pages into the Server is
still there because there are modes where this is faster still.
For example direct IO from network type of FSs. We are all about choice.
(The zusFS is the one that decides which mode to use)
But the results above are with the NEW-IO way. The new way is -
we ask the Server what are the blocks to read/write (both pmem or bdev)
and the IO or pmem_memcpy is done in Kernel.
(We do not yet cache these results in Kernel but might in future
 ((when caching will actually make things faster currently xarray does
   not scale for us)))

[TODOs]
1. EZUFS_ASYNC is not submitted here. It is implemented but there are
   no current users so it was never fully tested (waiting for a user)
2. Support Page-cache. This one is very easy to do, but again no users
   yet
3. more stuff ....

Please help with *reviews*, comments, questions. We believe this is a very
important project that opens new ways for implementing Server-applications,
including but not restricted to FS Server applications.

Thank you
Boaz

----------------------------------------------------------------
Boaz Harrosh (16):
      fs: Add the ZUF filesystem to the build + License
      MAINTAINERS: Add the ZUFS maintainership
      zuf: Preliminary Documentation
      zuf: zuf-rootfs
      zuf: zuf-core The ZTs
      zuf: Multy Devices
      zuf: mounting
      zuf: Namei and directory operations
      zuf: readdir operation
      zuf: symlink
      zuf: Write/Read implementation
      zuf: mmap & sync
      zuf: More file operation
      zuf: ioctl implementation
      zuf: xattr && acl implementation
      zuf: Support for dynamic-debug of zusFSs

 Documentation/filesystems/zufs.txt |  386 +++++++++++++++++++++++++++++++
 MAINTAINERS                        |    6 +
 fs/Kconfig                         |    1 +
 fs/Makefile                        |    1 +
 fs/zuf/Kconfig                     |   24 ++
 fs/zuf/Makefile                    |   23 ++
 fs/zuf/_extern.h                   |  179 +++++++++++++++
 fs/zuf/_pr.h                       |   68 ++++++
 fs/zuf/acl.c                       |  270 ++++++++++++++++++++++
 fs/zuf/directory.c                 |  171 ++++++++++++++
 fs/zuf/file.c                      |  825 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/inode.c                     |  630 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/ioctl.c                     |  309 +++++++++++++++++++++++++
 fs/zuf/md.c                        |  742 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/md.h                        |  332 +++++++++++++++++++++++++++
 fs/zuf/md_def.h                    |  141 ++++++++++++
 fs/zuf/mmap.c                      |  300 ++++++++++++++++++++++++
 fs/zuf/module.c                    |   28 +++
 fs/zuf/namei.c                     |  435 +++++++++++++++++++++++++++++++++++
 fs/zuf/relay.h                     |  104 +++++++++
 fs/zuf/rw.c                        | 1051 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/super.c                     |  954 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/symlink.c                   |   74 ++++++
 fs/zuf/t1.c                        |  145 ++++++++++++
 fs/zuf/t2.c                        |  356 ++++++++++++++++++++++++++++
 fs/zuf/t2.h                        |   68 ++++++
 fs/zuf/xattr.c                     |  314 +++++++++++++++++++++++++
 fs/zuf/zuf-core.c                  | 1735 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/zuf/zuf-root.c                  |  519 +++++++++++++++++++++++++++++++++++++++++
 fs/zuf/zuf.h                       |  452 ++++++++++++++++++++++++++++++++++++
 fs/zuf/zus_api.h                   | 1075 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 31 files changed, 11718 insertions(+)
 create mode 100644 Documentation/filesystems/zufs.txt
 create mode 100644 fs/zuf/Kconfig
 create mode 100644 fs/zuf/Makefile
 create mode 100644 fs/zuf/_extern.h
 create mode 100644 fs/zuf/_pr.h
 create mode 100644 fs/zuf/acl.c
 create mode 100644 fs/zuf/directory.c
 create mode 100644 fs/zuf/file.c
 create mode 100644 fs/zuf/inode.c
 create mode 100644 fs/zuf/ioctl.c
 create mode 100644 fs/zuf/md.c
 create mode 100644 fs/zuf/md.h
 create mode 100644 fs/zuf/md_def.h
 create mode 100644 fs/zuf/mmap.c
 create mode 100644 fs/zuf/module.c
 create mode 100644 fs/zuf/namei.c
 create mode 100644 fs/zuf/relay.h
 create mode 100644 fs/zuf/rw.c
 create mode 100644 fs/zuf/super.c
 create mode 100644 fs/zuf/symlink.c
 create mode 100644 fs/zuf/t1.c
 create mode 100644 fs/zuf/t2.c
 create mode 100644 fs/zuf/t2.h
 create mode 100644 fs/zuf/xattr.c
 create mode 100644 fs/zuf/zuf-core.c
 create mode 100644 fs/zuf/zuf-root.c
 create mode 100644 fs/zuf/zuf.h
 create mode 100644 fs/zuf/zus_api.h

Comments

Miklos Szeredi Sept. 26, 2019, 7:11 a.m. UTC | #1
On Thu, Sep 26, 2019 at 4:08 AM Boaz Harrosh <boaz@plexistor.com> wrote:

> Performance:
> A simple fio direct 4k random write test with incrementing number
> of threads.
>
> [fuse]
> threads wr_iops wr_bw   wr_lat
> 1       33606   134424  26.53226
> 2       57056   228224  30.38476
> 4       88667   354668  40.12783
> 7       116561  466245  53.98572
> 8       129134  516539  55.6134
>
> [fuse-splice]
> threads wr_iops wr_bw   wr_lat
> 1       39670   158682  21.8399
> 2       51100   204400  34.63294
> 4       75220   300882  47.42344
> 7       97706   390825  63.04435
> 8       98034   392137  73.24263
>
> [xfs-dax]
> threads wr_iops wr_bw           wr_lat

Data missing.

> [Maxdata-1.5-zufs]
> threads wr_iops wr_bw           wr_lat
> 1       1041802 260,450         3.623
> 2       1983997 495,999         3.808
> 4       3829456 957,364         3.959
> 7       4501154 1,125,288       5.895330
> 8       4400698 1,100,174       6.922174

Just a heads up, that I have achieved similar results with a prototype
using the unmodified fuse protocol.  This prototype was built with
ideas taken from zufs (percpu/lockless, mmaped dev, single syscall per
op).  I found a big scheduler scalability bottleneck that is caused by
update of mm->cpu_bitmap at context switch.   This can be worked
around by using shared memory instead of shared page tables, which is
a bit of a pain, but it does prove the point.  Thought about fixing
the cpu_bitmap cacheline pingpong, but didn't really get anywhere.

Are you interested in comparing zufs with the scalable fuse prototype?
 If so, I'll push the code into a public repo with some instructions,

Thanks,
Miklos
Bernd Schubert Sept. 26, 2019, 9:41 a.m. UTC | #2
Hi Miklos,

> Just a heads up, that I have achieved similar results with a prototype
> using the unmodified fuse protocol.  This prototype was built with
> ideas taken from zufs (percpu/lockless, mmaped dev, single syscall per
> op).  I found a big scheduler scalability bottleneck that is caused by
> update of mm->cpu_bitmap at context switch.   This can be worked
> around by using shared memory instead of shared page tables, which is
> a bit of a pain, but it does prove the point.  Thought about fixing
> the cpu_bitmap cacheline pingpong, but didn't really get anywhere.
> 
> Are you interested in comparing zufs with the scalable fuse prototype?
>  If so, I'll push the code into a public repo with some instructions,

I would be happy to help here (review, lightly test and debug). I wanted
to give the ioctl threads method a try for some time already just never
came to it yet.


Thanks,
Bernd
Boaz Harrosh Sept. 26, 2019, 11:27 a.m. UTC | #3
On 26/09/2019 10:11, Miklos Szeredi wrote:
> On Thu, Sep 26, 2019 at 4:08 AM Boaz Harrosh <boaz@plexistor.com> wrote:
> 
<>
>> [xfs-dax]
>> threads wr_iops wr_bw           wr_lat
> 
> Data missing.
> 

Ooops sorry will send today

>> [Maxdata-1.5-zufs]
>> threads wr_iops wr_bw           wr_lat
>> 1       1041802 260,450         3.623
>> 2       1983997 495,999         3.808
>> 4       3829456 957,364         3.959
>> 7       4501154 1,125,288       5.895330
>> 8       4400698 1,100,174       6.922174
> 
> Just a heads up, that I have achieved similar results with a prototype
> using the unmodified fuse protocol.  This prototype was built with
> ideas taken from zufs (percpu/lockless, mmaped dev, single syscall per
> op).  I found a big scheduler scalability bottleneck that is caused by
> update of mm->cpu_bitmap at context switch.   This can be worked
> around by using shared memory instead of shared page tables, which is
> a bit of a pain, but it does prove the point.  Thought about fixing
> the cpu_bitmap cacheline pingpong, but didn't really get anywhere.
> 
> Are you interested in comparing zufs with the scalable fuse prototype?
>  If so, I'll push the code into a public repo with some instructions,
> 

Yes please do send it. I will give it a good run.
What fuseFS do you use in usermode?

> Thanks,
> Miklos
> 

Thank you Miklos for looking
Boaz
Boaz Harrosh Sept. 26, 2019, 11:41 a.m. UTC | #4
On 26/09/2019 05:40, Matt Benjamin wrote:
> per discussion 2 weeks ago--is there a git repo or something that I can clone?
> 
> Matt
> 

Please look in the cover letter there is a git tree address to clone
here:

[v02]
   The patches submitted are at:
	git https://github.com/NetApp/zufs-zuf upstream-v02

Also the same for zus Server in user-mode + infra:
	git https://github.com/NetApp/zufs-zus upstream


Please look in the 3rd patch:
	[PATCH 03/16] zuf: Preliminary Documentation

There are instructions what to clone how to compile and install
and how to use the scripts in do-zu to run a system.
I would love a good review for this documentation as well 
I'm sure its wrong and missing. I use it for so long I'm already
blind to it.

Please bug me day and night with any question

Thanks
Boaz
Bernd Schubert Sept. 26, 2019, 12:12 p.m. UTC | #5
>> Are you interested in comparing zufs with the scalable fuse prototype?
>>  If so, I'll push the code into a public repo with some instructions,
>>
> 
> Yes please do send it. I will give it a good run.
> What fuseFS do you use in usermode?

For the start passthrough should do, modified to skip all data. That is
what I am doing to measure fuse bandwidth. It also shouldn't be too
difficult to add an in-mem tree for dentries and inodes, to be able to
measure without tmpfs overhead.


Bernd
Boaz Harrosh Sept. 26, 2019, 12:24 p.m. UTC | #6
On 26/09/2019 15:12, Bernd Schubert wrote:
>>> Are you interested in comparing zufs with the scalable fuse prototype?
>>>  If so, I'll push the code into a public repo with some instructions,
>>>
>>
>> Yes please do send it. I will give it a good run.
>> What fuseFS do you use in usermode?
> 
> For the start passthrough should do, modified to skip all data. 

skip all data is not good for me. Because it hides away the page-faults
and the actual memory bandwith. But what I do is either memcpy
a single preallocated block to all blocks in the IO and/or set
in a defined pattern where each ulong in the file contains its
offset as data. This gives me true results.

> That is
> what I am doing to measure fuse bandwidth. It also shouldn't be too
> difficult to add an in-mem tree for dentries and inodes, to be able to
> measure without tmpfs overhead.
> 
Thanks that is very helpful I will use this
Boaz

> 
> Bernd
>
Boaz Harrosh Sept. 26, 2019, 12:48 p.m. UTC | #7
On 26/09/2019 10:11, Miklos Szeredi wrote:
> On Thu, Sep 26, 2019 at 4:08 AM Boaz Harrosh <boaz@plexistor.com> wrote:
> 
> Just a heads up, that I have achieved similar results with a prototype
> using the unmodified fuse protocol.  This prototype was built with
> ideas taken from zufs (percpu/lockless, mmaped dev, single syscall per
> op).

>  I found a big scheduler scalability bottleneck that is caused by
> update of mm->cpu_bitmap at context switch.   This can be worked
> around by using shared memory instead of shared page tables, which is
> a bit of a pain, but it does prove the point.  Thought about fixing
> the cpu_bitmap cacheline pingpong, but didn't really get anywhere.
> 

I'm not sure what is the scalability bottleneck you are seeing above.
With zufs I have a very good scalability, almost flat up to the
number of CPUs, and/or the limit of the memory bandwith if I'm accessing
pmem.

I do have a bad scalability bottleneck if I use mmap of pages caused
by the call to zap_vma_ptes. Which is why I invented the NIO way.
(Inspired by you)

Once you send me the git URL I will have a look in the code and see if
I can find any differences.

That said I do believe that a new Scheduler object that completely
bypasses the scheduler and just relinquishes its time slice to the
switched to thread, will cut off another 0.5u from the single thread
latency. (5th patch talks about that)

> Are you interested in comparing zufs with the scalable fuse prototype?
>  If so, I'll push the code into a public repo with some instructions,
> 
> Thanks,
> Miklos
> 

Miklos would you please have some bandwith to review my code? it would
make me very happy and calm. Your input is very valuable to me.

Thanks
Boaz
Miklos Szeredi Sept. 26, 2019, 1:45 p.m. UTC | #8
On Thu, Sep 26, 2019 at 2:24 PM Boaz Harrosh <openosd@gmail.com> wrote:
>
> On 26/09/2019 15:12, Bernd Schubert wrote:
> >>> Are you interested in comparing zufs with the scalable fuse prototype?
> >>>  If so, I'll push the code into a public repo with some instructions,
> >>>
> >>
> >> Yes please do send it. I will give it a good run.

  git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git#fuse2

Enable:

CONFIG_FUSE2_FS=y
CONFIG_SAMPLE_FUSE2=y

> >> What fuseFS do you use in usermode?

It's the example loopback filesystem supplied in the git tree above.
I haven't converted libfuse yet to use the new features, so for now
this is the only way to try it.

Usage:

    linux/samples/fuse2/loraw -2 -p -t ~/mnt/fuse/

    options:

     -d: debug
     -s: single threaded
     -b: FUSE_DEV_IOC_CLONE (v1)
     -p: use ioctl for device I/O (v2)
     -m: use "map read" transferring offset into file instead of actual data
     -1: use regular fuse
     -2: use experimental fuse2
     -t: use shared memory instead of threads

I tested with shmfs, and IIRC got about 4-8us latency, depending on
the hardware, type of operation, etc...

Let me know if something's not working properly (this is experimental code).

Thanks,
Miklos
Miklos Szeredi Sept. 26, 2019, 1:48 p.m. UTC | #9
On Thu, Sep 26, 2019 at 2:48 PM Boaz Harrosh <openosd@gmail.com> wrote:
>
> On 26/09/2019 10:11, Miklos Szeredi wrote:

> >  I found a big scheduler scalability bottleneck that is caused by
> > update of mm->cpu_bitmap at context switch.   This can be worked
> > around by using shared memory instead of shared page tables, which is
> > a bit of a pain, but it does prove the point.  Thought about fixing
> > the cpu_bitmap cacheline pingpong, but didn't really get anywhere.
> >
>
> I'm not sure what is the scalability bottleneck you are seeing above.
> With zufs I have a very good scalability, almost flat up to the
> number of CPUs, and/or the limit of the memory bandwith if I'm accessing
> pmem.

This was *really* noticable with NUMA and many cpus (>64).

> Miklos would you please have some bandwith to review my code? it would
> make me very happy and calm. Your input is very valuable to me.

Sure, will look at the patches.

Thanks,
Miklos