diff mbox series

[RFC,02/17] zuf: Preliminary Documentation

Message ID 20190219115136.29952-3-boaz@plexistor.com (mailing list archive)
State New, archived
Headers show
Series zuf: ZUFS Zero-copy User-mode FileSystem | expand

Commit Message

Boaz Harrosh Feb. 19, 2019, 11:51 a.m. UTC
From: Boaz Harrosh <boazh@netapp.com>

Adding Documentation/filesystems/zufs.txt

[v2]
  Incorporated Randy's few comments.

Randy Please give it an harder review?

CC: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 Documentation/filesystems/zufs.txt | 366 +++++++++++++++++++++++++++++
 1 file changed, 366 insertions(+)
 create mode 100644 Documentation/filesystems/zufs.txt

Comments

Miklos Szeredi Feb. 20, 2019, 8:27 a.m. UTC | #1
On Tue, Feb 19, 2019 at 12:51 PM Boaz harrosh <boaz@plexistor.com> wrote:
>

> +4. An FS operation like create or WRITE/READ and so on arrives from application
> +   via VFS. Eventually an Operation is dispatched to zus:
> +   ▪ A special per-operation descriptor is filled up with all parameters.
> +   ▪ A current CPU channel is grabbed. the operation descriptor is put on
> +     that channel (ZT). Including get_user_pages or Kernel-pages associated
> +     with this OPT.
> +   ▪ The ZT is awaken, app thread put to sleep.
> +   ▪ In ZT context pages are mapped to that ZT-vma. This is so we are sure
> +     the map is only on a single core. And no other core's TLB is affected.
> +     (This here is the all performance secret)

I still don't get it.  You say mapping the page to server address
space is the performance secret.  I say it's a security issue as well
as being perfectly useless, except for special cases.

So let's see.  There's the pmem case, which seems to be what this is
mostly about.    So we take a pmem filesystem and e.g. a read(2)
syscall.  There's no zero copy to talk about in that case since data
needs to be copied from pmem to application buffer.   If we want to
minimize memory copies, than it will be a single memory copy from pmem
to the app buffer.   Your implementation chooses to do this copy in
the userland server, via the application buffer mapped into its
address space.   But the memory copy could just as well be done by the
kernel, from one virtual address to another; the kernel just has to
juggle with physical page lookups, but does not have to establish a
new mapping, which makes it all the more secure and performant.

Sure, we've heard about arguments why the above doesn't work if the
data comes from e.g. a network where the network driver could be
writing data directly to application buffer.  But I've shown how that
could also be done with a new rsplice() interface, again without
having to insert the application page into the server address space.
And no, this doesn't have to involve syscall overhead, if that turns
out to be the limiting factor: mapped pipes might be an interesting
new concept as well.

The only way I see this implementation actually save a memory copy is
if a transformation is required (which again, already implies at least
a single memory copy) such as compression or encryption.   But even
that could be delegated to the kernel, since it does have plenty of
such transformations implemented, only an interface is required to
tell the kernel which one to do on the buffer.

Thanks,
Miklos
Boaz Harrosh Feb. 20, 2019, 2:24 p.m. UTC | #2
On 20/02/19 10:27, Miklos Szeredi wrote:
> On Tue, Feb 19, 2019 at 12:51 PM Boaz harrosh <boaz@plexistor.com> wrote:
>>
> 
>> +4. An FS operation like create or WRITE/READ and so on arrives from application
>> +   via VFS. Eventually an Operation is dispatched to zus:
>> +   ▪ A special per-operation descriptor is filled up with all parameters.
>> +   ▪ A current CPU channel is grabbed. the operation descriptor is put on
>> +     that channel (ZT). Including get_user_pages or Kernel-pages associated
>> +     with this OPT.
>> +   ▪ The ZT is awaken, app thread put to sleep.
>> +   ▪ In ZT context pages are mapped to that ZT-vma. This is so we are sure
>> +     the map is only on a single core. And no other core's TLB is affected.
>> +     (This here is the all performance secret)
> 
> I still don't get it.  You say mapping the page to server address
> space is the performance secret.  I say it's a security issue as well
> as being perfectly useless, except for special cases.
> 

Already working on this. Will submit new code in a month. There is already
an API of GET_BLOCK that returns one or more server blocks (vi dpp_t) to
Kernel. Today it is used in mmap, to directly map pmem to application space.
We will use this API to memcpy(nt) in kernel space.

The mapping facility will be optional for zusFS(s) but the preferred way
will be Kernel side copy.

This is because vm_zap_ptes() is way way to slow in current design
and I need things faster ...

<>
> Thanks,
> Miklos
> 

Thanks
Boaz
diff mbox series

Patch

diff --git a/Documentation/filesystems/zufs.txt b/Documentation/filesystems/zufs.txt
new file mode 100644
index 000000000000..01eb7a7e7257
--- /dev/null
+++ b/Documentation/filesystems/zufs.txt
@@ -0,0 +1,366 @@ 
+ZUFS - Zero-copy User-mode FileSystem
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Trees:
+	git clone https://github.com/NetApp/zufs-zuf -b zuf-upstream
+	git clone https://github.com/NetApp/zufs-zus -b zus-github
+
+patches, comments, questions, requests to:
+	boazh@netapp.com
+
+Introduction:
+~~~~~~~~~~~~~
+
+ZUFS - stands for Zero-copy User-mode FS
+▪ It is geared towards true zero copy end to end of both data and meta data.
+▪ It is geared towards very *low latency*, very high CPU locality, lock-less
+  parallelism.
+▪ Synchronous operations
+▪ Numa awareness
+
+ ZUFS is a, from scratch, implementation of a filesystem-in-user-space, which
+tries to address the above goals. It is aimed for pmem based FSs. But can easily
+support any other type of FSs that can utilize x10 latency and parallelism
+improvements.
+
+Glossary and names:
+~~~~~~~~~~~~~~~~~~~
+
+ZUF - Zero-copy User-mode Feeder
+  zuf.ko is the Kernel VFS component. Its job is to interface with the Kernel
+  VFS and dispatch commands to a User-mode application Server.
+  Uptodate code is found at:
+	git clone https://github.com/NetApp/zufs-zuf -b zuf-upstream
+
+ZUS - Zero-copy User-mode Server
+  zufs utilizes a User-mode server application. That takes care of the detailed
+  communication protocol and correctness with the Kernel.
+  In turn it utilizes many zusFS Filesystem plugins to implement the actual
+  on disc Filesystem.
+  Uptodate code is found at:
+	git clone https://github.com/NetApp/zufs-zus -b zus-github
+
+zusFS - FS plugins
+  These are .so loadable modules that implement one or more Filesystem-types
+  (-t xyz).
+  The zus server communicates with the plugin via a set of function vectors
+  for the different operations. And establishes communication via defined
+  structures.
+
+Filesystem-type:
+  At startup zus registers with the Kernel one or more Filesystem-type(s)
+  Associated with the type is a unique type-name (mount -t foofs) +
+  different info about the fs, like a magic number and so on.
+  One Server can support many FS-types, in turn each FS-type can mount
+  multiple super-blocks, each supporting multiple devices.
+
+Device-Table (DT) - A zufs FS can support multiple devices
+  ZUF in Kernel may receive, like any mount command a block-device or none.
+  For the former if the specified FS-types states so in a special field.
+  The mount will look for a Device table. A list of devices in a specific
+  order sitting at some offset on the block-device. The system will then
+  proceed to open and own all these devices and associate them to the mounting
+  super-block.
+  If FS-type specifies a -1 at DT_offset then there is no device table
+  and a DT of a single device is created. (If we have no devices, none
+  is specified than we operate without any block devices. (Mount options give
+  some indication of the storage information))
+  The device table has special consideration for pmem devices and will
+  present the all linear array of devices to zus, as one flat mmap space.
+  Alternatively all non-pmem devices are also provided an interface
+  with facility of data movement from pmem to slower devices.
+  A detailed NUMA info is exported to the Server for maximum utilization.
+
+pmem:
+  Multiple pmem devices are presented to the server as a single
+  linear file mmap. Something like /dev/dax. But it is strictly
+  available only to that specific super-block that owns it.
+
+dpp_t - Dual port pointer type
+  At some points in the protocol there are objects that return from zus
+  (The Server) to the Kernel via a dpp_t. This is a special kind of pointer
+  It is actually an offset 8 bytes aligned with the 3 low bits specifying
+  a pool code: [offset = dpp_t & ~0x7] [pool = dpp_t & 0x7]
+  pool == 0 means the offset is in pmem who's management is by zuf and
+  a full easy access is provided for zus.
+
+  pool != 0 Is a pre-established file (up to 6 such files per sb) where
+  the zus has an mmap on the file and the Kernel can access that data
+  via an offset into the file.
+  pool == 7 denotes an offset into the application buffers associated
+  with the current IO.
+  All dpp_t objects life time rules are strictly defined.
+  Mainly the primary use of dpp_t is the on-pmem inode structure. Both
+  zus and zuf can access and change this structure. On any modification
+  the zus is called so to be notified of any changes, persistence.
+  More such objects are: Symlinks, xattrs, mmap-data-blocks etc...
+
+Relay-wait-object:
+  communication between Kernel and server are done via zus-threads that
+  sleep in Kernel (inside an IOCTL) and wait for commands. Once received
+  the IOCTL returns operation id executed and the return info is returned via
+  a new IOCTL call, which then waits for the next operation.
+  To wake up the sleeping thread we use a Relay-wait-object. Currently
+  it is two waitqueue_head(s) back to back.
+  In future we should investigate the use of that special binder object
+  that releases its thread time slice to the other thread without going through
+  the scheduler.
+
+ZT-threads-array:
+  The novelty of the zufs is the ZT-threads system. One thread or more is
+  pre-created for each active core in the system.
+  ▪ The thread is AFFINITY set for that single core only.
+  ▪ Special communication file per ZT (O_TMPFILE + IOCTL_ZUFS_INIT)
+    At initialization the ZT thread communicates through a ZT_INIT ioctl
+    and registers as the handler of that core (Channel)
+  ▪ ZT-vma - Mmap 4M vma zero copy communication area per ZT
+    Pre allocated vma is created into which will be mapped the application
+    or Kernel buffers for the current operation.
+  ▪ IOCTL_ZU_WAIT_OPT – threads sleeps in Kernel waiting for an operation
+    via the IOCTL_ZU_WAIT_OPT call.
+
+  ▪ On an operation dispatch current CPU's ZT is selected, app pages mapped
+    into the ZT-vma. Server thread released with an operation to execute.
+  ▪ After execution, ZT returns to kernel (IOCTL_ZU_WAIT_OPT), app is released,
+    Server wait for new operation on that CPU.
+  ▪ Each ZT has a cyclic logic. Each call to IOCTL_ZU_WAIT_OPT from Server
+    returns the results of the previous operation, before going to sleep
+    waiting to receive a new operation.
+	zus			zuf-zt				application
+    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+     ---> IOCTL_ZU_WAIT_OPT    if (app-waiting)
+     |					wake-up-application	 -> return to app
+     |				FS-WAIT
+     |				|				<- POSIX call
+     |				V		<- fs-wake-up(dispatch)
+     |			<- return with new command
+     |--<- do_new_operation
+
+ZUS-mount-thread:
+  The system utilizes a single mount thread. (This thread is not affinity to any
+  core).
+  ▪ It will first Register all FS-types supported by this Server (By calling
+    all zusFS plugins to register their supported types). Once done
+  ▪ As above the thread sleeps in Kernel via the IOCTL_ZU_MOUNT call.
+  ▪ When the Kernel receives a mount request (vfs calles the fs_type->mount opt)
+    a mount is dispatched back to zus.
+  ▪ NOTE: That only on very first mount the above ZT-threads-array is created
+    the same array is then used for all super-blocks in the system
+  ▪ As part of the mount command in the context of this same mount-thread
+    a call to IOCTL_ZU_GRAB_PMEM will establish an interface to the pmem
+    Associated with this super_block
+  ▪ On return like above a new call to IOCTL_ZU_MOUNT will return info of the
+    mount before sleeping in kernel waiting for a new dispatch. All SB info
+    is provided to zuf, including the root inode info. Kernel then proceeds
+    to complete the mount call.
+  ▪ NOTE that since there is a single mount thread all FS-registration
+    super_block and pmem management are lockless.
+
+Philosophy of operations:
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+1. [zuf-root]
+
+On module load  (zuf.ko) A special pseudo FS is mounted on /sys/fs/zuf. This is
+called zuf-root.
+The zuf-root has no visible files. All communication is done via special-files.
+special-files are open(O_TMPFILE) and establish a special role via an
+IOCTL.
+All communications with the server are done via the zuf-root. Each root owns
+many FS-types and each FS-type owns many super-blocks of this type. All Sharing
+the same communication channels.
+Since all FS-type Servers live in the same zus application address space, at
+times. If the administrator wants to separate between different servers, he/she
+can mount a new zuf-root and point a new server instance on that new mount,
+registering other FS-types on that other instance. The all communication array
+will then be duplicated as well.
+(Otherwise pointing a new server instance on a busy root will return an error)
+
+2. [zus server start]
+  ▪ On load all configured zusFS plugins are loaded.
+  ▪ The Server starts by starting a single mount thread.
+  ▪ It than proceeds to register with Kernel all FS-types it will support.
+    (This is done on the single mount thread, so all FS-registration and
+     mount/umount operate in a single thread and therefor need not any locks)
+  ▪ Sleeping in the Kernel on a special-file of that zuf-root. waiting for a mount
+    command.
+
+3. [mount -t xyz]
+  [In Kernel]
+  ▪ If xyz was registered above as part of the Server startup. the regular
+    mount command will come to the zuf module with a zuf_mount() call. with
+    the xyz-FS-info. In turn this points to a zuf-root.
+  ▪ Code than proceed to load a device-table of devices as  specified above.
+    It then establishes an multi_devices object with a specific pmem_id.
+  ▪ It proceeds to call mount_bdev. Always with the same main-device
+    thous fully sporting automatic bind mounts. Even if different
+    devices are given to the mount command.
+  ▪ In zuf_fill_super it will then dispatch (awaken) the mount thread
+    specifying two parameters. One the FS-type to mount, and then
+    the pmem_id Associated with this super_block.
+
+  [In zus]
+  ▪ A zus_super_block_info is allocated.
+  ▪ zus calls PMEM_GRAB(pmem_id) to establish a direct mapping to its
+    pmem devices. On return we have full access to our PMEM
+
+  ▪ ZT-threads-array
+    If this is the first mount the all ZT-threads-array is created and
+    established. The mount thread will wait until all zt-threads finished
+    initialization and ready to rock.
+  ▪ Root-zus_inode is loaded and is returned to kernel
+  ▪ More info about the mount like block sizes and so on are returned to kernel.
+
+  [In Kernel]
+   The zuf_fill_super is finalized vectors established and we have a new
+   super_block ready for operations.
+
+4. An FS operation like create or WRITE/READ and so on arrives from application
+   via VFS. Eventually an Operation is dispatched to zus:
+   ▪ A special per-operation descriptor is filled up with all parameters.
+   ▪ A current CPU channel is grabbed. the operation descriptor is put on
+     that channel (ZT). Including get_user_pages or Kernel-pages associated
+     with this OPT.
+   ▪ The ZT is awaken, app thread put to sleep.
+   ▪ In ZT context pages are mapped to that ZT-vma. This is so we are sure
+     the map is only on a single core. And no other core's TLB is affected.
+     (This here is the all performance secret)
+   ▪ ZT thread is returned to user-space.
+   ▪ In ZT context the zus Server calls the appropriate zusFS->operation
+     vector. Output params filled.
+   ▪ zus calls again with an IOCTL_ZU_WAIT_OPT with the same descriptor
+     to return the requested info.
+   ▪ At Kernel (zuf) the app thread is awaken with the results, and the
+     ZT thread goes back to sleep waiting a new operation.
+
+   ZT rules:
+       A ZT thread must not return back to Kernel. One exception is locks
+   if needed it might sleep waiting for a lock. In which case we will see that
+   the same CPU channel is reentered via another application and/or thread.
+   But now that CPU channel is taken.  What we do is we utilize a few channels
+   (ZTs) per core and the threads may grab another channel. But this only
+   postpones the problem on a busy contended system, all such channels will be
+   consumed. If all channels are taken the application thread is put on a busy
+   scheduling wait until a channel can be grabbed.
+   Therefor Server must not sleep on a ZT. If it needs such a sleeping operation
+   it will return -EAGAIN to zuf. The app is kept sleeping the operation is put
+   on an asynchronous Q and the ZT freed for foreground operation. At some point
+   when the server completes the delayed operation it will complete notify
+   the Kernel with a special async cookie. And the app will be awakened.
+   (Here too we utilize pre allocated asyc channels and vmas. If all channels
+    are busy, application is kept sleeping waiting its free slot turn)
+
+4. On umount the operation is reversed and all resources are torn down.
+5. In case of an application or Server crash, all resources are Associated
+   with files, on file_release these resources are caught and freed.
+
+Objects and life-time
+~~~~~~~~~~~~~~~~~~~~~
+
+Each Kernel object type has an assosiated zus Server object type who's life
+time is governed by the life-time of the Kernel object. Therefor the Server's
+job is easy because it need not establish any object caches / hashes and so on.
+
+Inside zus all objects are allocated by the zusFS plugin. So in turn it can
+allocate a bigger space for its own private data and access it via the
+container_off() coding pattern. So when I say below a zus-object I mean both
+zus public part + zusFS private part of the same object.
+
+All operations return a User-mode pointer that are opaque to the the Kernel
+code, they are just a cookie which is returned back to zus, when needed.
+At times when we want the Kernel to have direct access to a zus object like
+zus_inode, along with the cookie we also return a dpp_t, with a defined
+structure.
+
+Kernel object 			| zus object 		| Kernel access (via dpp_t)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+zuf_fs_type
+	file_system_type	| zus_fs_info		| no
+
+zuf_sb_info
+	super_block		| zus_sb_info		| no
+
+zuf_inode_info			|			|
+	vfs_inode		| zus_inode_info	| no
+	zus_inode *		| 	zus_inode *	| yes
+	synlink *		|	char-array	| yes
+	xattr**			|	zus_xattr	| yes
+
+When a Kernel object's time is to die, a final call to zus is
+dispatched so the associated object can also be freed. Which means
+that on memory pressure when object caches are evicted also the zus
+memory resources are freed.
+
+
+How to use zufs:
+~~~~~~~~~~~~~~~~
+
+The most updated documentation of how to use the latest code bases
+is the script (set of scripts) at fs/do-zu/zudo on the zus git tree
+
+We the developers at Netapp use this script to mount and test our
+latest code. So any new Secret will be found in these scripts. Please
+read them as the ultimate source of how to operate things.
+
+TODO: We are looking for exports in system-d and udev to properly
+integrate these tools into a destro.
+
+We assume you cloned these git trees:
+[]$ mkdir zufs; cd zufs
+[]$ git clone https://github.com/NetApp/zufs-zuf -b zuf-upstream
+[]$ git clone https://github.com/NetApp/zufs-zuf -b zus-github
+
+This will create the following trees
+zufs/zus - Source code for Server
+zufs/zuf - Linux Kernel source tree to compile and install on your machine
+
+Also specifically:
+zufs/zus/fs/do-zu/zudo - script Documenting how to run things
+
+[]$ cd zuf
+
+First time
+[] ../zus/fs/do-zu/zudo
+this will create a file:
+	../zus/fs/do-zu/zu.conf
+
+Edit this file for your environment. Devices, mount-point and so on.
+On first run an example file will be created for you. Fill in the
+blanks. Most params can stay as is in most cases
+
+Now lest start running:
+
+[1]$ ../zus/fs/do-zu/zudo mkfs
+This will run the proper mkfs command selected at zu.conf file
+with the proper devices.
+
+[2]$ ../zus/fs/do-zu/zudo zuf-insmod
+This loads the zuf.ko module
+
+[3]$ ../zus/fs/do-zu/zudo zuf-root
+This mounts the zuf-root FS above on /sys/fs/zuf (automatically created above)
+
+[4]$ ../zus/fs/do-zu/zudo zus-up
+This runs the zus daemon in the background
+
+[5]$ ../zus/fs/do-zu/zudo mount
+This mount the mkfs FS above on the specified dir in zu.conf
+
+To run all the 5 commands above at once do:
+[]$ ../zus/fs/do-zu/zudo up
+
+To undo all the above in reverse order do:
+[]$ ../zus/fs/do-zu/zudo down
+
+And the most magic command is:
+[]$ ../zus/fs/do-zu/zudo again
+Will do a "down", then update-mods, then "up"
+(update-mods is a special script to copy the latest compiled binaries)
+
+Now you are ready for some:
+[]$ ../zus/fs/do-zu/zudo xfstest
+xfstests is assumed to be installed in the regular /opt/xfstests dir
+
+Again please see inside the scripts what each command does
+these scripts are the ultimate Documentation, do not believe
+anything I'm saying here. (Because it is outdated by now)