From patchwork Tue Feb 19 11:51:21 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Boaz Harrosh X-Patchwork-Id: 10819725 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D583C139A for ; Tue, 19 Feb 2019 11:51:50 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id BB9BF2BB54 for ; Tue, 19 Feb 2019 11:51:50 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id AE01A2B675; Tue, 19 Feb 2019 11:51:50 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,MAILING_LIST_MULTI,RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B67802B675 for ; Tue, 19 Feb 2019 11:51:48 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726633AbfBSLvs (ORCPT ); Tue, 19 Feb 2019 06:51:48 -0500 Received: from mail-wm1-f67.google.com ([209.85.128.67]:40775 "EHLO mail-wm1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725772AbfBSLvr (ORCPT ); Tue, 19 Feb 2019 06:51:47 -0500 Received: by mail-wm1-f67.google.com with SMTP id t15so2263626wmi.5 for ; Tue, 19 Feb 2019 03:51:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=plexistor-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=n7NM/8eUQsygU2igjXZGvb/jAOzD41ZICP5WHi+Ij0Q=; b=KTd5O0yQExvrpexQ9NBxV2P7w/CWgyQJYn2tIb4RoTmhoHEKX+ng6OMQq5CCibxAgj PBqfQM28cKDz38/FLR+a0QoArde7o77zyOe6GSmbfp0G1GLOwqO+XwiZXQaQ9/f9gii9 kuYGO8Zva29lRqLoLR8vNd4hxyPN+0eJgLY6jplc+LOY596lSAW2NjUycMF/76VVFW09 5KHdgcJl/qPdR3atpehvI5AX5CuVfrDWIVxdojd5wVXbKoy+6OLkrDLIcipGS+C8w3LA pQU5BiOYpEKSOf4kC82bvLpR4EgmRH+ceAOVJeRJDG2hCIxKwz506TTq+vYAVGtvS0UT mkGw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=n7NM/8eUQsygU2igjXZGvb/jAOzD41ZICP5WHi+Ij0Q=; b=D3aHM4vqO7T+e1qbedh4eid2Z5/AuNSKhxh3qhzJgfFHSTpfb0oB5CxWkDylEa6oVq 1QpzARfU8lCZYe8/ZjHIrvuaUZVFW8C0Kk/tJcU9FlZw6xraqR7W7117Bsw/m0zRdMJ5 AHe4NV4wmCbo2ljyVcNp+AZt4YZzXKM+NKPvQe3jLvpMkg6BWMR2fyIGalYzmmh5aj5f laEU1maHFnEfuPYXGmR9ug5Ube2ky2Yq0AcTSpTQp68lje7jcAnsVWYeTyYLVRlKXe2X HBbrgaIiXNPblFSf2EP/r916AF589/uUDfmVieCfmM/oA3YBPnppFp/fwO8pihybRhoL /Agw== X-Gm-Message-State: AHQUAuY+Uylh4EsFklyydnhqlIE25ohw5QVSwYIhPrh5Ng/YaRa7GyYs kPJsJ3y0boVqDsMYEgDvgbNXonA/8lw= X-Google-Smtp-Source: AHgI3IZ0GXCUMXxiiPiuFhR7Hhay+xE6ssGvZwCQnmnnNbHeovcvCba9mLfSh5DBPmzNglaKKz5FXw== X-Received: by 2002:a1c:c489:: with SMTP id u131mr2400895wmf.127.1550577104062; Tue, 19 Feb 2019 03:51:44 -0800 (PST) Received: from Bfire.plexistor.com ([207.232.55.62]) by smtp.gmail.com with ESMTPSA id t18sm3605830wmt.8.2019.02.19.03.51.42 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Tue, 19 Feb 2019 03:51:43 -0800 (PST) From: Boaz harrosh To: linux-fsdevel , Anna Schumaker , Al Viro Cc: Ric Wheeler , Miklos Szeredi , Steven Whitehouse , Jefff moyer , Amir Goldstein , Amit Golander , Sagi Manole Subject: [RFC PATCH 02/17] zuf: Preliminary Documentation Date: Tue, 19 Feb 2019 13:51:21 +0200 Message-Id: <20190219115136.29952-3-boaz@plexistor.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20190219115136.29952-1-boaz@plexistor.com> References: <20190219115136.29952-1-boaz@plexistor.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP From: Boaz Harrosh Adding Documentation/filesystems/zufs.txt [v2] Incorporated Randy's few comments. Randy Please give it an harder review? CC: Randy Dunlap Signed-off-by: Boaz Harrosh --- Documentation/filesystems/zufs.txt | 366 +++++++++++++++++++++++++++++ 1 file changed, 366 insertions(+) create mode 100644 Documentation/filesystems/zufs.txt diff --git a/Documentation/filesystems/zufs.txt b/Documentation/filesystems/zufs.txt new file mode 100644 index 000000000000..01eb7a7e7257 --- /dev/null +++ b/Documentation/filesystems/zufs.txt @@ -0,0 +1,366 @@ +ZUFS - Zero-copy User-mode FileSystem +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Trees: + git clone https://github.com/NetApp/zufs-zuf -b zuf-upstream + git clone https://github.com/NetApp/zufs-zus -b zus-github + +patches, comments, questions, requests to: + boazh@netapp.com + +Introduction: +~~~~~~~~~~~~~ + +ZUFS - stands for Zero-copy User-mode FS +▪ It is geared towards true zero copy end to end of both data and meta data. +▪ It is geared towards very *low latency*, very high CPU locality, lock-less + parallelism. +▪ Synchronous operations +▪ Numa awareness + + ZUFS is a, from scratch, implementation of a filesystem-in-user-space, which +tries to address the above goals. It is aimed for pmem based FSs. But can easily +support any other type of FSs that can utilize x10 latency and parallelism +improvements. + +Glossary and names: +~~~~~~~~~~~~~~~~~~~ + +ZUF - Zero-copy User-mode Feeder + zuf.ko is the Kernel VFS component. Its job is to interface with the Kernel + VFS and dispatch commands to a User-mode application Server. + Uptodate code is found at: + git clone https://github.com/NetApp/zufs-zuf -b zuf-upstream + +ZUS - Zero-copy User-mode Server + zufs utilizes a User-mode server application. That takes care of the detailed + communication protocol and correctness with the Kernel. + In turn it utilizes many zusFS Filesystem plugins to implement the actual + on disc Filesystem. + Uptodate code is found at: + git clone https://github.com/NetApp/zufs-zus -b zus-github + +zusFS - FS plugins + These are .so loadable modules that implement one or more Filesystem-types + (-t xyz). + The zus server communicates with the plugin via a set of function vectors + for the different operations. And establishes communication via defined + structures. + +Filesystem-type: + At startup zus registers with the Kernel one or more Filesystem-type(s) + Associated with the type is a unique type-name (mount -t foofs) + + different info about the fs, like a magic number and so on. + One Server can support many FS-types, in turn each FS-type can mount + multiple super-blocks, each supporting multiple devices. + +Device-Table (DT) - A zufs FS can support multiple devices + ZUF in Kernel may receive, like any mount command a block-device or none. + For the former if the specified FS-types states so in a special field. + The mount will look for a Device table. A list of devices in a specific + order sitting at some offset on the block-device. The system will then + proceed to open and own all these devices and associate them to the mounting + super-block. + If FS-type specifies a -1 at DT_offset then there is no device table + and a DT of a single device is created. (If we have no devices, none + is specified than we operate without any block devices. (Mount options give + some indication of the storage information)) + The device table has special consideration for pmem devices and will + present the all linear array of devices to zus, as one flat mmap space. + Alternatively all non-pmem devices are also provided an interface + with facility of data movement from pmem to slower devices. + A detailed NUMA info is exported to the Server for maximum utilization. + +pmem: + Multiple pmem devices are presented to the server as a single + linear file mmap. Something like /dev/dax. But it is strictly + available only to that specific super-block that owns it. + +dpp_t - Dual port pointer type + At some points in the protocol there are objects that return from zus + (The Server) to the Kernel via a dpp_t. This is a special kind of pointer + It is actually an offset 8 bytes aligned with the 3 low bits specifying + a pool code: [offset = dpp_t & ~0x7] [pool = dpp_t & 0x7] + pool == 0 means the offset is in pmem who's management is by zuf and + a full easy access is provided for zus. + + pool != 0 Is a pre-established file (up to 6 such files per sb) where + the zus has an mmap on the file and the Kernel can access that data + via an offset into the file. + pool == 7 denotes an offset into the application buffers associated + with the current IO. + All dpp_t objects life time rules are strictly defined. + Mainly the primary use of dpp_t is the on-pmem inode structure. Both + zus and zuf can access and change this structure. On any modification + the zus is called so to be notified of any changes, persistence. + More such objects are: Symlinks, xattrs, mmap-data-blocks etc... + +Relay-wait-object: + communication between Kernel and server are done via zus-threads that + sleep in Kernel (inside an IOCTL) and wait for commands. Once received + the IOCTL returns operation id executed and the return info is returned via + a new IOCTL call, which then waits for the next operation. + To wake up the sleeping thread we use a Relay-wait-object. Currently + it is two waitqueue_head(s) back to back. + In future we should investigate the use of that special binder object + that releases its thread time slice to the other thread without going through + the scheduler. + +ZT-threads-array: + The novelty of the zufs is the ZT-threads system. One thread or more is + pre-created for each active core in the system. + ▪ The thread is AFFINITY set for that single core only. + ▪ Special communication file per ZT (O_TMPFILE + IOCTL_ZUFS_INIT) + At initialization the ZT thread communicates through a ZT_INIT ioctl + and registers as the handler of that core (Channel) + ▪ ZT-vma - Mmap 4M vma zero copy communication area per ZT + Pre allocated vma is created into which will be mapped the application + or Kernel buffers for the current operation. + ▪ IOCTL_ZU_WAIT_OPT – threads sleeps in Kernel waiting for an operation + via the IOCTL_ZU_WAIT_OPT call. + + ▪ On an operation dispatch current CPU's ZT is selected, app pages mapped + into the ZT-vma. Server thread released with an operation to execute. + ▪ After execution, ZT returns to kernel (IOCTL_ZU_WAIT_OPT), app is released, + Server wait for new operation on that CPU. + ▪ Each ZT has a cyclic logic. Each call to IOCTL_ZU_WAIT_OPT from Server + returns the results of the previous operation, before going to sleep + waiting to receive a new operation. + zus zuf-zt application + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + ---> IOCTL_ZU_WAIT_OPT if (app-waiting) + | wake-up-application -> return to app + | FS-WAIT + | | <- POSIX call + | V <- fs-wake-up(dispatch) + | <- return with new command + |--<- do_new_operation + +ZUS-mount-thread: + The system utilizes a single mount thread. (This thread is not affinity to any + core). + ▪ It will first Register all FS-types supported by this Server (By calling + all zusFS plugins to register their supported types). Once done + ▪ As above the thread sleeps in Kernel via the IOCTL_ZU_MOUNT call. + ▪ When the Kernel receives a mount request (vfs calles the fs_type->mount opt) + a mount is dispatched back to zus. + ▪ NOTE: That only on very first mount the above ZT-threads-array is created + the same array is then used for all super-blocks in the system + ▪ As part of the mount command in the context of this same mount-thread + a call to IOCTL_ZU_GRAB_PMEM will establish an interface to the pmem + Associated with this super_block + ▪ On return like above a new call to IOCTL_ZU_MOUNT will return info of the + mount before sleeping in kernel waiting for a new dispatch. All SB info + is provided to zuf, including the root inode info. Kernel then proceeds + to complete the mount call. + ▪ NOTE that since there is a single mount thread all FS-registration + super_block and pmem management are lockless. + +Philosophy of operations: +~~~~~~~~~~~~~~~~~~~~~~~~~ + +1. [zuf-root] + +On module load (zuf.ko) A special pseudo FS is mounted on /sys/fs/zuf. This is +called zuf-root. +The zuf-root has no visible files. All communication is done via special-files. +special-files are open(O_TMPFILE) and establish a special role via an +IOCTL. +All communications with the server are done via the zuf-root. Each root owns +many FS-types and each FS-type owns many super-blocks of this type. All Sharing +the same communication channels. +Since all FS-type Servers live in the same zus application address space, at +times. If the administrator wants to separate between different servers, he/she +can mount a new zuf-root and point a new server instance on that new mount, +registering other FS-types on that other instance. The all communication array +will then be duplicated as well. +(Otherwise pointing a new server instance on a busy root will return an error) + +2. [zus server start] + ▪ On load all configured zusFS plugins are loaded. + ▪ The Server starts by starting a single mount thread. + ▪ It than proceeds to register with Kernel all FS-types it will support. + (This is done on the single mount thread, so all FS-registration and + mount/umount operate in a single thread and therefor need not any locks) + ▪ Sleeping in the Kernel on a special-file of that zuf-root. waiting for a mount + command. + +3. [mount -t xyz] + [In Kernel] + ▪ If xyz was registered above as part of the Server startup. the regular + mount command will come to the zuf module with a zuf_mount() call. with + the xyz-FS-info. In turn this points to a zuf-root. + ▪ Code than proceed to load a device-table of devices as specified above. + It then establishes an multi_devices object with a specific pmem_id. + ▪ It proceeds to call mount_bdev. Always with the same main-device + thous fully sporting automatic bind mounts. Even if different + devices are given to the mount command. + ▪ In zuf_fill_super it will then dispatch (awaken) the mount thread + specifying two parameters. One the FS-type to mount, and then + the pmem_id Associated with this super_block. + + [In zus] + ▪ A zus_super_block_info is allocated. + ▪ zus calls PMEM_GRAB(pmem_id) to establish a direct mapping to its + pmem devices. On return we have full access to our PMEM + + ▪ ZT-threads-array + If this is the first mount the all ZT-threads-array is created and + established. The mount thread will wait until all zt-threads finished + initialization and ready to rock. + ▪ Root-zus_inode is loaded and is returned to kernel + ▪ More info about the mount like block sizes and so on are returned to kernel. + + [In Kernel] + The zuf_fill_super is finalized vectors established and we have a new + super_block ready for operations. + +4. An FS operation like create or WRITE/READ and so on arrives from application + via VFS. Eventually an Operation is dispatched to zus: + ▪ A special per-operation descriptor is filled up with all parameters. + ▪ A current CPU channel is grabbed. the operation descriptor is put on + that channel (ZT). Including get_user_pages or Kernel-pages associated + with this OPT. + ▪ The ZT is awaken, app thread put to sleep. + ▪ In ZT context pages are mapped to that ZT-vma. This is so we are sure + the map is only on a single core. And no other core's TLB is affected. + (This here is the all performance secret) + ▪ ZT thread is returned to user-space. + ▪ In ZT context the zus Server calls the appropriate zusFS->operation + vector. Output params filled. + ▪ zus calls again with an IOCTL_ZU_WAIT_OPT with the same descriptor + to return the requested info. + ▪ At Kernel (zuf) the app thread is awaken with the results, and the + ZT thread goes back to sleep waiting a new operation. + + ZT rules: + A ZT thread must not return back to Kernel. One exception is locks + if needed it might sleep waiting for a lock. In which case we will see that + the same CPU channel is reentered via another application and/or thread. + But now that CPU channel is taken. What we do is we utilize a few channels + (ZTs) per core and the threads may grab another channel. But this only + postpones the problem on a busy contended system, all such channels will be + consumed. If all channels are taken the application thread is put on a busy + scheduling wait until a channel can be grabbed. + Therefor Server must not sleep on a ZT. If it needs such a sleeping operation + it will return -EAGAIN to zuf. The app is kept sleeping the operation is put + on an asynchronous Q and the ZT freed for foreground operation. At some point + when the server completes the delayed operation it will complete notify + the Kernel with a special async cookie. And the app will be awakened. + (Here too we utilize pre allocated asyc channels and vmas. If all channels + are busy, application is kept sleeping waiting its free slot turn) + +4. On umount the operation is reversed and all resources are torn down. +5. In case of an application or Server crash, all resources are Associated + with files, on file_release these resources are caught and freed. + +Objects and life-time +~~~~~~~~~~~~~~~~~~~~~ + +Each Kernel object type has an assosiated zus Server object type who's life +time is governed by the life-time of the Kernel object. Therefor the Server's +job is easy because it need not establish any object caches / hashes and so on. + +Inside zus all objects are allocated by the zusFS plugin. So in turn it can +allocate a bigger space for its own private data and access it via the +container_off() coding pattern. So when I say below a zus-object I mean both +zus public part + zusFS private part of the same object. + +All operations return a User-mode pointer that are opaque to the the Kernel +code, they are just a cookie which is returned back to zus, when needed. +At times when we want the Kernel to have direct access to a zus object like +zus_inode, along with the cookie we also return a dpp_t, with a defined +structure. + +Kernel object | zus object | Kernel access (via dpp_t) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +zuf_fs_type + file_system_type | zus_fs_info | no + +zuf_sb_info + super_block | zus_sb_info | no + +zuf_inode_info | | + vfs_inode | zus_inode_info | no + zus_inode * | zus_inode * | yes + synlink * | char-array | yes + xattr** | zus_xattr | yes + +When a Kernel object's time is to die, a final call to zus is +dispatched so the associated object can also be freed. Which means +that on memory pressure when object caches are evicted also the zus +memory resources are freed. + + +How to use zufs: +~~~~~~~~~~~~~~~~ + +The most updated documentation of how to use the latest code bases +is the script (set of scripts) at fs/do-zu/zudo on the zus git tree + +We the developers at Netapp use this script to mount and test our +latest code. So any new Secret will be found in these scripts. Please +read them as the ultimate source of how to operate things. + +TODO: We are looking for exports in system-d and udev to properly +integrate these tools into a destro. + +We assume you cloned these git trees: +[]$ mkdir zufs; cd zufs +[]$ git clone https://github.com/NetApp/zufs-zuf -b zuf-upstream +[]$ git clone https://github.com/NetApp/zufs-zuf -b zus-github + +This will create the following trees +zufs/zus - Source code for Server +zufs/zuf - Linux Kernel source tree to compile and install on your machine + +Also specifically: +zufs/zus/fs/do-zu/zudo - script Documenting how to run things + +[]$ cd zuf + +First time +[] ../zus/fs/do-zu/zudo +this will create a file: + ../zus/fs/do-zu/zu.conf + +Edit this file for your environment. Devices, mount-point and so on. +On first run an example file will be created for you. Fill in the +blanks. Most params can stay as is in most cases + +Now lest start running: + +[1]$ ../zus/fs/do-zu/zudo mkfs +This will run the proper mkfs command selected at zu.conf file +with the proper devices. + +[2]$ ../zus/fs/do-zu/zudo zuf-insmod +This loads the zuf.ko module + +[3]$ ../zus/fs/do-zu/zudo zuf-root +This mounts the zuf-root FS above on /sys/fs/zuf (automatically created above) + +[4]$ ../zus/fs/do-zu/zudo zus-up +This runs the zus daemon in the background + +[5]$ ../zus/fs/do-zu/zudo mount +This mount the mkfs FS above on the specified dir in zu.conf + +To run all the 5 commands above at once do: +[]$ ../zus/fs/do-zu/zudo up + +To undo all the above in reverse order do: +[]$ ../zus/fs/do-zu/zudo down + +And the most magic command is: +[]$ ../zus/fs/do-zu/zudo again +Will do a "down", then update-mods, then "up" +(update-mods is a special script to copy the latest compiled binaries) + +Now you are ready for some: +[]$ ../zus/fs/do-zu/zudo xfstest +xfstests is assumed to be installed in the regular /opt/xfstests dir + +Again please see inside the scripts what each command does +these scripts are the ultimate Documentation, do not believe +anything I'm saying here. (Because it is outdated by now)